Arxiv Papers of Today

生成时间: 2025-10-10 16:29:41 (UTC+8); Arxiv 发布时间: 2025-10-10 20:00 EDT (2025-10-11 08:00 UTC+8)

今天共有 61 篇相关文章

Keyword: reinforcement learning

ConCuR: Conciseness Makes State-of-the-Art Kernel Generation

ConCuR：简洁使最先进的内核生成

Authors: Lingcheng Kong, Jiateng Wei, Hanzhang Shen, Huan Wang
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2510.07356
Pdf link: https://arxiv.org/pdf/2510.07356
Abstract GPU kernel generation by LLMs has recently experienced rapid development, leveraging test-time scaling and reinforcement learning techniques. However, a key challenge for kernel generation is the scarcity of high-quality data, as most high-quality kernels are proprietary and not open-source. This challenge prevents us from leveraging supervised fine-tuning to align LLMs to the kernel generation task. To address this challenge, we develop a pipeline that generates and curates high-quality CUDA kernels with reasoning traces, motivated by a critical observation that concise yet informative reasoning traces result in robust generation of high-performance kernels. Using this pipeline, we construct our dataset ConCuR and introduce our model KernelCoder, which is the first model trained on a curated dataset consisting of PyTorch, reasoning, and CUDA kernel pairs, to our knowledge. In the KernelBench setup, our model achieves significant improvements over the existing top-performing model, QwQ-32B, and outperforms all open-source models fine-tuned for kernel generation, as well as frontier models such as DeepSeek-V3.1-Think and Claude-4-sonnet. Finally, we show that the average reasoning length can serve as a metric to assess the difficulty of kernel generation tasks. The observations, metrics, and our data collection and curation pipeline can help obtain better data in the kernel generation task in the future.
中文摘要 LLM 生成 GPU 内核最近经历了快速发展，利用测试时缩放和强化学习技术。然而，内核生成的一个关键挑战是高质量数据的稀缺性，因为大多数高质量内核都是专有的，而不是开源的。这一挑战使我们无法利用监督微调来使 LLM 与内核生成任务保持一致。为了应对这一挑战，我们开发了一个管道，可以生成和策划具有推理轨迹的高质量 CUDA 内核，其动机是关键观察，即简洁而信息丰富的推理轨迹可以稳健地生成高性能内核。使用这个管道，我们构建了我们的数据集 ConCuR，并介绍了我们的模型 KernelCoder，据我们所知，这是第一个在由 PyTorch、推理和 CUDA 内核对组成的精选数据集上训练的模型。在 KernelBench 设置中，我们的模型比现有的顶级模型 QwQ-32B 取得了显着改进，并且优于所有针对内核生成进行微调的开源模型，以及 DeepSeek-V3.1-Think 和 Claude-4-sonnet 等前沿模型。最后，我们表明平均推理长度可以作为评估内核生成任务难度的指标。观察结果、指标以及我们的数据收集和管理管道可以帮助在未来的内核生成任务中获得更好的数据。

L2M-AID: Autonomous Cyber-Physical Defense by Fusing Semantic Reasoning of Large Language Models with Multi-Agent Reinforcement Learning (Preprint)

L2M-AID：通过融合大型语言模型的语义推理与多智能体强化学习进行自主信息物理防御（预印本）

Authors: Tianxiang Xu, Zhichao Wen, Xinyu Zhao, Jun Wang, Yan Li, Chang Liu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.07363
Pdf link: https://arxiv.org/pdf/2510.07363
Abstract The increasing integration of Industrial IoT (IIoT) exposes critical cyber-physical systems to sophisticated, multi-stage attacks that elude traditional defenses lacking contextual awareness. This paper introduces L2M-AID, a novel framework for Autonomous Industrial Defense using LLM-empowered, Multi-agent reinforcement learning. L2M-AID orchestrates a team of collaborative agents, each driven by a Large Language Model (LLM), to achieve adaptive and resilient security. The core innovation lies in the deep fusion of two AI paradigms: we leverage an LLM as a semantic bridge to translate vast, unstructured telemetry into a rich, contextual state representation, enabling agents to reason about adversary intent rather than merely matching patterns. This semantically-aware state empowers a Multi-Agent Reinforcement Learning (MARL) algorithm, MAPPO, to learn complex cooperative strategies. The MARL reward function is uniquely engineered to balance security objectives (threat neutralization) with operational imperatives, explicitly penalizing actions that disrupt physical process stability. To validate our approach, we conduct extensive experiments on the benchmark SWaT dataset and a novel synthetic dataset generated based on the MITRE ATT&CK for ICS framework. Results demonstrate that L2M-AID significantly outperforms traditional IDS, deep learning anomaly detectors, and single-agent RL baselines across key metrics, achieving a 97.2% detection rate while reducing false positives by over 80% and improving response times by a factor of four. Crucially, it demonstrates superior performance in maintaining physical process stability, presenting a robust new paradigm for securing critical national infrastructure.
中文摘要 工业物联网（IIoT）的日益集成使关键的网络物理系统面临复杂的多阶段攻击，而这些攻击无法通过缺乏上下文感知的传统防御来逃避。本文介绍了L2M-AID，这是一种使用LLM赋能的多智能体强化学习的自主工业防御新框架。L2M-AID 编排了一个协作代理团队，每个代理都由大型语言模型（LLM）驱动，以实现自适应和弹性安全性。核心创新在于两种人工智能范式的深度融合：我们利用法学硕士作为语义桥梁，将庞大的非结构化遥测数据转化为丰富的上下文状态表示，使代理能够推理对手的意图，而不仅仅是匹配模式。这种语义感知状态使多智能体强化学习（MARL）算法 MAPPO 能够学习复杂的协作策略。MARL 奖励函数经过独特设计，可在安全目标（威胁中和）与作要求之间取得平衡，明确惩罚破坏物理流程稳定性的行为。为了验证我们的方法，我们对基准SWaT数据集和基于MITRE ATT&CK for ICS框架生成的新型合成数据集进行了广泛的实验。结果表明，L2M-AID 在关键指标上明显优于传统 IDS、深度学习异常检测器和单代理 RL 基线，实现了 97.2% 的检测率，同时将误报率降低了 80% 以上，并将响应时间缩短了四倍。至关重要的是，它在维护物理过程稳定性方面表现出卓越的性能，为保护关键国家基础设施提供了强大的新范式。

Parameter-Free Federated TD Learning with Markov Noise in Heterogeneous Environments

异构环境中的无参数联合TD学习，基于马尔可夫噪声

Authors: Ankur Naskar, Gugan Thoppe, Utsav Negi, Vijay Gupta
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.07436
Pdf link: https://arxiv.org/pdf/2510.07436
Abstract Federated learning (FL) can dramatically speed up reinforcement learning by distributing exploration and training across multiple agents. It can guarantee an optimal convergence rate that scales linearly in the number of agents, i.e., a rate of $\tilde{O}(1/(NT)),$ where $T$ is the iteration index and $N$ is the number of agents. However, when the training samples arise from a Markov chain, existing results on TD learning achieving this rate require the algorithm to depend on unknown problem parameters. We close this gap by proposing a two-timescale Federated Temporal Difference (FTD) learning with Polyak-Ruppert averaging. Our method provably attains the optimal $\tilde{O}(1/NT)$ rate in both average-reward and discounted settings--offering a parameter-free FTD approach for Markovian data. Although our results are novel even in the single-agent setting, they apply to the more realistic and challenging scenario of FL with heterogeneous environments.
中文摘要 联邦学习（FL）可以通过在多个代理之间分布探索和训练来显着加快强化学习速度。它可以保证最佳收敛率，该收敛率在代理数量上线性扩展，即 $\tilde{O}（1/（NT）），$ 的速率，其中 $T$ 是迭代指数，$N$ 是代理数量。然而，当训练样本来自马尔可夫链时，TD学习达到该速率的现有结果要求算法依赖于未知的问题参数。我们通过提出采用 Polyak-Ruppert 平均的双时间尺度联邦时间差（FTD）学习来缩小这一差距。我们的方法在平均奖励和折扣设置中都能证明达到最佳的$\tilde{O}（1/NT）$率，为马尔可夫数据提供了一种无参数的FTD方法。尽管我们的结果即使在单智能体环境中也是新颖的，但它们适用于具有异构环境的更现实和更具挑战性的FL场景。

Reasoning by Exploration: A Unified Approach to Retrieval and Generation over Graphs

探索推理：一种统一的图检索和生成方法

Authors: Haoyu Han, Kai Guo, Harry Shomer, Yu Wang, Yucheng Chu, Hang Li, Li Ma, Jiliang Tang
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2510.07484
Pdf link: https://arxiv.org/pdf/2510.07484
Abstract Reasoning over structured graphs remains a fundamental challenge for Large Language Models (LLMs), particularly when scaling to large graphs. Existing approaches typically follow the retrieval-augmented generation (RAG) paradigm: first retrieving subgraphs relevant to the query and then generating answers conditioned on the retrieved subgraphs. However, such two-phase pipelines often struggle to faithfully incorporate graph structure, since the generation process is ultimately constrained by the quality and completeness of the retrieved subgraph. Although many advanced retrievers have been proposed recently to mitigate this issue, they are usually tailored to the training graphs and generalize poorly to unseen graphs, which limits their practical applicability. In this work, we propose Reasoning by Exploration (RoE), a novel approach that unifies retrieval and generation by framing reasoning over graphs as a process of graph exploration. At each step, the LLM selects candidate nodes and edges to explore, gradually constructing reasoning paths and generating answers along the way. To enable effective exploration, RoE is trained in two stages: supervised fine-tuning (SFT) on gold reasoning paths, followed by reinforcement learning (RL) to enhance exploration effectiveness and generalization. Experiments on benchmark datasets demonstrate that RoE achieves substantial overall improvements over baselines, while also generalizing effectively to unseen graphs.
中文摘要 对结构化图进行推理仍然是大型语言模型（LLM）面临的一个基本挑战，尤其是在扩展到大型图时。现有方法通常遵循检索增强生成（RAG）范式：首先检索与查询相关的子图，然后生成以检索到的子图为条件的答案。然而，这种两阶段管道通常难以忠实地合并图结构，因为生成过程最终受到检索到的子图的质量和完整性的限制。尽管最近提出了许多高级检索器来缓解这个问题，但它们通常是针对训练图量身定制的，并且对看不见的图的推广效果很差，这限制了它们的实际适用性。在这项工作中，我们提出了探索推理（RoE），这是一种新颖的方法，通过将图上的推理作为图探索的过程来统一检索和生成。在每一步中，LLM 都会选择候选节点和边缘进行探索，逐步构建推理路径并在此过程中生成答案。为了实现有效的探索，RoE 分两个阶段进行训练：黄金推理路径上的监督微调（SFT），然后是强化学习（RL），以提高探索有效性和泛化性。基准数据集的实验表明，RoE 比基线实现了显着的整体改进，同时还有效地推广到看不见的图形。

Reinforcement Learning-based Task Offloading in the Internet of Wearable Things

可穿戴物联网中基于强化学习的任务卸载

Authors: Waleed Bin Qaim, Aleksandr Ometov, Claudia Campolo, Antonella Molinaro, Elena Simona Lohan, Jari Nurmi
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.07487
Pdf link: https://arxiv.org/pdf/2510.07487
Abstract Over the years, significant contributions have been made by the research and industrial sectors to improve wearable devices towards the Internet of Wearable Things (IoWT) paradigm. However, wearables are still facing several challenges. Many stem from the limited battery power and insufficient computation resources available on wearable devices. On the other hand, with the popularity of smart wearables, there is a consistent increase in the development of new computationally intensive and latency-critical applications. In such a context, task offloading allows wearables to leverage the resources available on nearby edge devices to enhance the overall user experience. This paper proposes a framework for Reinforcement Learning (RL)-based task offloading in the IoWT. We formulate the task offloading process considering the tradeoff between energy consumption and task accomplishment time. Moreover, we model the task offloading problem as a Markov Decision Process (MDP) and utilize the Q-learning technique to enable the wearable device to make optimal task offloading decisions without prior knowledge. We evaluate the performance of the proposed framework through extensive simulations for various applications and system configurations conducted in the ns-3 network simulator. We also show how varying the main system parameters of the Q-learning algorithm affects the overall performance in terms of average task accomplishment time, average energy consumption, and percentage of tasks offloaded.
中文摘要 多年来，研究和工业部门为改进可穿戴设备向可穿戴物联网（IoWT）范式做出了重大贡献。然而，可穿戴设备仍然面临一些挑战。许多源于可穿戴设备上可用的电池电量有限和计算资源不足。另一方面，随着智能可穿戴设备的普及，新的计算密集型和延迟关键型应用程序的开发持续增加。在这种情况下，任务卸载允许可穿戴设备利用附近边缘设备上的可用资源来增强整体用户体验。该文提出了一种基于强化学习（RL）的IoWT任务卸载框架。我们考虑了能耗和任务完成时间之间的权衡来制定任务卸载过程。此外，我们将任务卸载问题建模为马尔可夫决策过程（MDP），并利用Q学习技术使可穿戴设备能够在没有先验知识的情况下做出最佳的任务卸载决策。我们通过在 ns-3 网络模拟器中对各种应用和系统配置进行的广泛仿真来评估所提出的框架的性能。我们还展示了 Q 学习算法的主要系统参数的变化如何影响平均任务完成时间、平均能耗和卸载任务百分比方面的整体性能。

Expanding the Action Space of LLMs to Reason Beyond Language

将法学硕士的行动空间扩展到超越语言的推理

Authors: Zhongqi Yue, Weishi Wang, Yundaichuan Zhan, Juncheng Li, Daniel Dahlmeier, Fredrik D. Johansson
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.07581
Pdf link: https://arxiv.org/pdf/2510.07581
Abstract Large Language Models (LLMs) are powerful reasoners in natural language, but their actions are typically confined to outputting vocabulary tokens. As a result, interactions with external environments -- such as symbolic operators or simulators -- must be expressed through text in predefined formats, parsed, and routed to external interfaces. This overloads the model's language with both reasoning and control duties, and requires a hand-crafted parser, external to the LLM. To address this, we decouple environment interactions from language by internalizing them in an Expanded Action space (ExpA), beyond the vocabulary. The model starts reasoning in the default language environment, but may trigger routing actions and switch to an external environment at any time. From there, the model can only invoke environment-specific actions, receive feedback from the environment, and potentially route back to language as a result. To promote effective exploration of the expanded action space and new environments, we introduce ExpA Reinforcement Learning (EARL) with counterfactual policy optimization. On tasks requiring multi-turn interactions and contingent planning, EARL outperforms strong baselines with vocabulary-constrained actions. It performs robustly across calculator-based multi-task learning and, in the partially observed sorting problem, achieves perfect Sort-4 accuracy while self-discovering an efficient algorithm competitive with classical designs.
中文摘要 大型语言模型（LLM）是自然语言中强大的推理器，但它们的作通常仅限于输出词汇标记。因此，与外部环境（例如符号运算符或模拟器）的交互必须通过预定义格式的文本来表达、解析并路由到外部接口。这会使模型的语言因推理和控制职责而过载，并且需要在 LLM 外部手工制作的解析器。为了解决这个问题，我们通过将环境交互内化在词汇之外的扩展行动空间（ExpA）中，将它们与语言解耦。模型在默认语言环境中开始推理，但可能随时触发路由作并切换到外部环境。从那里，模型只能调用特定于环境的作，接收来自环境的反馈，并可能因此路由回语言。为了促进对扩展的行动空间和新环境的有效探索，我们引入了具有反事实策略优化的 ExpA 强化学习（EARL）。在需要多回合交互和应急计划的任务上，EARL 在词汇受限的动作方面优于强大的基线。它在基于计算器的多任务学习中表现稳健，并且在部分观察到的排序问题中，实现了完美的 Sort-4 精度，同时自我发现了与经典设计竞争的高效算法。

AgentAsk: Multi-Agent Systems Need to Ask

AgentAsk：多代理系统需要询问

Authors: Bohan Lin, Kuo Yang, Yingchuan Lai, Yudong Zhang, Chen Zhang, Guibin Zhang, Xinlei Yu, Miao Yu, Xu Wang, Yang Wang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.07593
Pdf link: https://arxiv.org/pdf/2510.07593
Abstract Multi-agent systems built on large language models (LLMs) promise enhanced problem-solving capabilities through collaborative division of labor. However, they frequently underperform single-agent baselines due to edge-level error cascades: minor inaccuracies at one message handoff propagate across the entire chain. We propose AgentAsk, a lightweight and plug-and-play clarification module that treats every inter-agent message as a potential failure point and inserts minimally necessary questions to arrest error propagation. AgentAsk follows a three-stage pipeline: (i) distilling edge-level judgments from curated failure traces into a compact policy, (ii) supervising the policy to determine when/what/whom/how to ask, and (iii) optimizing online with E-GRPO, a reinforcement learning objective that balances accuracy, latency, and cost. The module is architecture-agnostic and easy to integrate into existing orchestration. Across math, reasoning, and coding benchmarks, AgentAsk consistently improves accuracy and robustness over public multi-agent implementations while keeping overhead minimal, with latency and extra cost all less than 5%, approaching the performance of a strong evaluator. Beyond empirical improvements, we contribute a principled taxonomy of edge-level errors and a practical recipe for link-local intervention, offering a scalable pathway toward more reliable LLM-based multi-agent systems.
中文摘要 基于大型语言模型（LLM）构建的多智能体系统有望通过协作分工增强解决问题的能力。然而，由于边缘级错误级联，它们经常表现不佳于单代理基线：一条消息切换时的微小不准确会传播到整个链中。我们提出了 AgentAsk，这是一个轻量级且即插即用的澄清模块，它将每个代理间消息视为潜在的故障点，并插入最低限度的问题来阻止错误传播。AgentAsk 遵循一个三阶段的管道：（i）将边缘级判断从策划的故障跟踪中提炼成一个紧凑的策略，（ii）监督策略以确定何时/什么/谁/如何询问，以及（iii）使用 E-GRPO 进行在线优化，E-GRPO 是一个平衡准确性、延迟和成本的强化学习目标。该模块与体系结构无关，易于集成到现有业务流程中。在数学、推理和编码基准测试中，AgentAsk 与公共多代理实现相比，不断提高准确性和稳健性，同时将开销降至最低，延迟和额外成本均低于 5%，接近强大评估器的性能。除了经验改进之外，我们还贡献了边缘级错误的原则分类法和链接本地干预的实用方法，为更可靠的基于 LLM 的多智能体系统提供了一条可扩展的途径。

Value Flows

价值流

Authors: Perry Dong, Chongyi Zheng, Chelsea Finn, Dorsa Sadigh, Benjamin Eysenbach
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.07650
Pdf link: https://arxiv.org/pdf/2510.07650
Abstract While most reinforcement learning methods today flatten the distribution of future returns to a single scalar value, distributional RL methods exploit the return distribution to provide stronger learning signals and to enable applications in exploration and safe RL. While the predominant method for estimating the return distribution is by modeling it as a categorical distribution over discrete bins or estimating a finite number of quantiles, such approaches leave unanswered questions about the fine-grained structure of the return distribution and about how to distinguish states with high return uncertainty for decision-making. The key idea in this paper is to use modern, flexible flow-based models to estimate the full future return distributions and identify those states with high return variance. We do so by formulating a new flow-matching objective that generates probability density paths satisfying the distributional Bellman equation. Building upon the learned flow models, we estimate the return uncertainty of distinct states using a new flow derivative ODE. We additionally use this uncertainty information to prioritize learning a more accurate return estimation on certain transitions. We compare our method (Value Flows) with prior methods in the offline and online-to-online settings. Experiments on $37$ state-based and $25$ image-based benchmark tasks demonstrate that Value Flows achieves a $1.3\times$ improvement on average in success rates. Website: this https URL Code: this https URL
中文摘要 虽然当今大多数强化学习方法将未来回报的分布平坦化为单个标量值，但分布式 RL 方法利用回报分布来提供更强的学习信号，并支持在探索和安全 RL 中的应用。虽然估计回报分布的主要方法是将其建模为离散箱的分类分布或估计有限数量的分位数，但这些方法留下了关于回报分布的细粒度结构以及如何区分具有高回报不确定性的状态以进行决策的问题。本文的关键思想是使用现代的、灵活的基于流量的模型来估计完整的未来回报分布，并识别那些具有高回报方差的状态。为此，我们制定了一个新的流量匹配目标，该目标生成满足分布贝尔曼方程的概率密度路径。在学习到的流模型的基础上，我们使用新的流导数 ODE 估计不同状态的返回不确定性。我们还使用这些不确定性信息来优先学习对某些转换进行更准确的回报估计。我们将我们的方法（价值流）与离线和线上到线上设置中的先前方法进行了比较。对 37 美元基于状态和 25 美元基于图像的基准任务的实验表明，Value Flows 的成功率平均提高了 1.3 倍。网站：this https URL 代码：this https URL

LiveThinking: Enabling Real-Time Efficient Reasoning for AI-Powered Livestreaming via Reinforcement Learning

LiveThinking：通过强化学习实现人工智能直播的实时高效推理

Authors: Yuhan Sun, Zhiwei Huang, Wanqing Cui, Shaopan Xiong, Yazhi Guo, Meiguang Jin, Junfeng Ma
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.07685
Pdf link: https://arxiv.org/pdf/2510.07685
Abstract In AI-powered e-commerce livestreaming, digital avatars require real-time responses to drive engagement, a task for which high-latency Large Reasoning Models (LRMs) are ill-suited. We introduce LiveThinking, a practical two-stage optimization framework to bridge this gap. First, we address computational cost by distilling a 670B teacher LRM into a lightweight 30B Mixture-of-Experts (MoE) model (3B active) using Rejection Sampling Fine-Tuning (RFT). This reduces deployment overhead but preserves the teacher's verbose reasoning, causing latency. To solve this, our second stage employs reinforcement learning with Group Relative Policy Optimization (GRPO) to compress the model's reasoning path, guided by a multi-objective reward function balancing correctness, helpfulness, and brevity. LiveThinking achieves a 30-fold reduction in computational cost, enabling sub-second latency. In real-world application on Taobao Live, it improved response correctness by 3.3% and helpfulness by 21.8%. Tested by hundreds of thousands of viewers, our system led to a statistically significant increase in Gross Merchandise Volume (GMV), demonstrating its effectiveness in enhancing user experience and commercial performance in live, interactive settings.
中文摘要 在人工智能驱动的电子商务直播中，数字化身需要实时响应来推动参与度，而高延迟大型推理模型（LRM）不适合这项任务。我们介绍了 LiveThinking，这是一个实用的两阶段优化框架来弥合这一差距。首先，我们通过使用拒绝采样微调（RFT）将670B教师LRM提炼成轻量级30B混合专家（MoE）模型（3B主动）来解决计算成本问题。这减少了部署开销，但保留了教师的冗长推理，从而导致延迟。为了解决这个问题，我们的第二阶段采用强化学习和群体相对策略优化（GRPO）来压缩模型的推理路径，以平衡正确性、有用性和简洁性的多目标奖励函数为指导。LiveThinking 实现了 30 倍的计算成本降低，实现了亚秒级延迟。在淘宝直播的实际应用中，响应正确度提高了3.3%，帮助度提高了21.8%。经过数十万观众的测试，我们的系统导致商品总量（GMV）在统计上显着增加，证明了它在增强实时互动环境中的用户体验和商业绩效方面的有效性。

Control Synthesis of Cyber-Physical Systems for Real-Time Specifications through Causation-Guided Reinforcement Learning

通过因果引导强化学习实现实时规范的信息物理系统的控制综合

Authors: Xiaochen Tang, Zhenya Zhang, Miaomiao Zhang, Jie An
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.07715
Pdf link: https://arxiv.org/pdf/2510.07715
Abstract In real-time and safety-critical cyber-physical systems (CPSs), control synthesis must guarantee that generated policies meet stringent timing and correctness requirements under uncertain and dynamic conditions. Signal temporal logic (STL) has emerged as a powerful formalism of expressing real-time constraints, with its semantics enabling quantitative assessment of system behavior. Meanwhile, reinforcement learning (RL) has become an important method for solving control synthesis problems in unknown environments. Recent studies incorporate STL-based reward functions into RL to automatically synthesize control policies. However, the automatically inferred rewards obtained by these methods represent the global assessment of a whole or partial path but do not accumulate the rewards of local changes accurately, so the sparse global rewards may lead to non-convergence and unstable training performances. In this paper, we propose an online reward generation method guided by the online causation monitoring of STL. Our approach continuously monitors system behavior against an STL specification at each control step, computing the quantitative distance toward satisfaction or violation and thereby producing rewards that reflect instantaneous state dynamics. Additionally, we provide a smooth approximation of the causation semantics to overcome the discontinuity of the causation semantics and make it differentiable for using deep-RL methods. We have implemented a prototype tool and evaluated it in the Gym environment on a variety of continuously controlled benchmarks. Experimental results show that our proposed STL-guided RL method with online causation semantics outperforms existing relevant STL-guided RL methods, providing a more robust and efficient reward generation framework for deep-RL.
中文摘要 在实时和安全关键型网络物理系统（CPS）中，控制综合必须保证生成的策略在不确定和动态条件下满足严格的时序和正确性要求。信号时间逻辑（STL）已成为表达实时约束的强大形式主义，其语义能够对系统行为进行定量评估。同时，强化学习（RL）已成为解决未知环境下控制综合问题的重要方法。最近的研究将基于STL的奖励函数纳入RL中，以自动合成控制策略。然而，这些方法得到的自动推断奖励代表了对整个或部分路径的全局评估，但不能准确累积局部变化的奖励，因此稀疏的全局奖励可能导致不收敛和训练性能不稳定。本文提出了一种以STL在线因果关系监测为指导的在线奖励生成方法。我们的方法根据每个控制步骤的 STL 规范持续监控系统行为，计算与满足或违规的定量距离，从而产生反映瞬时状态动态的奖励。此外，我们还提供了因果关系语义的平滑近似，以克服因果关系语义的不连续性，并使其在使用深度 RL 方法时可微分。我们实施了一个原型工具，并在健身房环境中对其进行了各种持续控制的基准测试。实验结果表明，我们提出的具有在线因果语义的STL引导RL方法优于现有的相关STL引导RL方法，为深度RL提供了更稳健、更高效的奖励生成框架。

RePainter: Empowering E-commerce Object Removal via Spatial-matting Reinforcement Learning

RePainter：通过空间摳图强化学习增强电子商务对象移除

Authors: Zipeng Guo, Lichen Ma, Xiaolong Fu, Gaojing Zhou, Lan Yang, Yuchen Zhou, Linkai Liu, Yu He, Ximan Liu, Shiping Dong, Jingling Fu, Zhen Chen, Yu Shi, Junshi Huang, Jason Li, Chao Gou
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2510.07721
Pdf link: https://arxiv.org/pdf/2510.07721
Abstract In web data, product images are central to boosting user engagement and advertising efficacy on e-commerce platforms, yet the intrusive elements such as watermarks and promotional text remain major obstacles to delivering clear and appealing product visuals. Although diffusion-based inpainting methods have advanced, they still face challenges in commercial settings due to unreliable object removal and limited domain-specific adaptation. To tackle these challenges, we propose Repainter, a reinforcement learning framework that integrates spatial-matting trajectory refinement with Group Relative Policy Optimization (GRPO). Our approach modulates attention mechanisms to emphasize background context, generating higher-reward samples and reducing unwanted object insertion. We also introduce a composite reward mechanism that balances global, local, and semantic constraints, effectively reducing visual artifacts and reward hacking. Additionally, we contribute EcomPaint-100K, a high-quality, large-scale e-commerce inpainting dataset, and a standardized benchmark EcomPaint-Bench for fair evaluation. Extensive experiments demonstrate that Repainter significantly outperforms state-of-the-art methods, especially in challenging scenes with intricate compositions. We will release our code and weights upon acceptance.
中文摘要 在网络数据中，产品图片对于提高电子商务平台上的用户参与度和广告效果至关重要，但水印和促销文字等侵入性元素仍然是提供清晰且吸引人的产品视觉效果的主要障碍。尽管基于扩散的修复方法已经进步，但由于不可靠的物体去除和有限的特定领域适应，它们在商业环境中仍然面临挑战。为了应对这些挑战，我们提出了 Repainter，这是一个强化学习框架，它将空间映射轨迹细化与群体相对策略优化（GRPO）集成在一起。我们的方法调节注意力机制以强调背景上下文，生成更高奖励的样本并减少不需要的对象插入。我们还引入了一种复合奖励机制，平衡了全局、局部和语义约束，有效减少了视觉伪影和奖励黑客攻击。此外，我们还贡献了 EcomPaint-100K，一个高质量、大规模的电子商务修复数据集，以及一个标准化的基准 EcomPaint-Bench，用于公平评估。广泛的实验表明，Repainter 的性能明显优于最先进的方法，特别是在构图复杂的挑战性场景中。我们将在接受后发布我们的代码和权重。

DEAS: DEtached value learning with Action Sequence for Scalable Offline RL

DEAS：使用可扩展离线 RL 的动作序列进行分离式价值学习

Authors: Changyeon Kim, Haeone Lee, Younggyo Seo, Kimin Lee, Yuke Zhu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.07730
Pdf link: https://arxiv.org/pdf/2510.07730
Abstract Offline reinforcement learning (RL) presents an attractive paradigm for training intelligent agents without expensive online interactions. However, current approaches still struggle with complex, long-horizon sequential decision making. In this work, we introduce DEtached value learning with Action Sequence (DEAS), a simple yet effective offline RL framework that leverages action sequences for value learning. These temporally extended actions provide richer information than single-step actions and can be interpreted through the options framework via semi-Markov decision process Q-learning, enabling reduction of the effective planning horizon by considering longer sequences at once. However, directly adopting such sequences in actor-critic algorithms introduces excessive value overestimation, which we address through detached value learning that steers value estimates toward in-distribution actions that achieve high return in the offline dataset. We demonstrate that DEAS consistently outperforms baselines on complex, long-horizon tasks from OGBench and can be applied to enhance the performance of large-scale Vision-Language-Action models that predict action sequences, significantly boosting performance in both RoboCasa Kitchen simulation tasks and real-world manipulation tasks.
中文摘要 离线强化学习（RL）为训练智能代理提供了一种有吸引力的范式，无需昂贵的在线交互。然而，当前的方法仍然难以应对复杂的、长期的顺序决策。在这项工作中，我们介绍了带有动作序列的分离价值学习（DEAS），这是一个简单而有效的离线RL框架，它利用动作序列进行价值学习。这些时间扩展的动作提供了比单步动作更丰富的信息，并且可以通过半马尔可夫决策过程 Q 学习通过选项框架进行解释，从而通过一次考虑更长的序列来缩短有效规划期限。然而，在行为者-批评算法中直接采用此类序列会引入过度的价值高估，我们通过分离价值学习来解决这个问题，该学习将价值估计引导到在离线数据集中实现高回报的分布内行为。我们证明，DEAS 在复杂、长期任务上始终优于 OGBench 的基线，并且可用于增强预测动作序列的大规模视觉-语言-动作模型的性能，从而显着提高 RoboCasa Kitchen 模拟任务和真实世界作任务的性能。

ToolExpander: Extending the Frontiers of Tool-Using Reinforcement Learning to Weak LLMs

ToolExpander：将工具使用强化学习的前沿扩展到弱法学硕士

Authors: Fu Chen, Peng Wang, Xiyin Li, Wen Li, Shichi Lei, Dongdong Xiang
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.07737
Pdf link: https://arxiv.org/pdf/2510.07737
Abstract Training Large Language Models (LLMs) with Group Relative Policy Optimization (GRPO) encounters a significant challenge: models often fail to produce accurate responses, particularly in small-scale architectures. This limitation not only diminishes performance improvements and undermines the potential of GRPO but also frequently leads to mid-training collapse, adversely affecting stability and final efficacy. To address these issues, we propose ToolExpander, a novel framework that advances tool-oriented reinforcement learning for resource-constrained LLMs through two key innovations:(1) Dynamic Multi-Round Hard Sampling, which dynamically substitutes challenging samples(those without correct outputs over 10 rollouts) with high-quality few-shot demonstrations during training, coupled with an exponential learning rate decay strategy to mitigate oscillations;(2) Self-Exemplifying Thinking, an enhanced GRPO framework that eliminates KL divergence and incorporates adjusted clipping coefficients, encouraging models to autonomously generate and analyze few-shot examples via a minimal additional reward (0.01).Experimental results demonstrate that ToolExpander significantly enhances tool-using capabilities in LLMs, especially in weaker small-scale models, improving both training stability and overall performance.
中文摘要 使用组相对策略优化（GRPO）训练大型语言模型（LLM）遇到了一个重大挑战：模型通常无法产生准确的响应，尤其是在小规模架构中。这种限制不仅会削弱性能改进并破坏 GRPO 的潜力，而且经常导致训练中期崩溃，对稳定性和最终功效产生不利影响。为了解决这些问题，我们提出了 ToolExpander，这是一个新颖的框架，它通过两项关键创新推进了资源受限 LLM 的面向工具的强化学习：（1）动态多轮硬采样，它在训练期间用高质量的 few-shot 演示动态替换具有挑战性的样本（那些在 10 次推出中没有正确输出的样本），并结合指数学习率衰减策略来减轻振荡;（2） Self-Exeblifying Thinking，一个增强的 GRPO 框架，它消除了 KL 散度并结合了调整后的削波系数，鼓励模型通过最小的额外奖励（0.01）自主生成和分析少量样本示例。实验结果表明，ToolExpander显著增强了LLM中的工具使用能力，特别是在较弱的小规模模型中，提高了训练稳定性和整体性能。

OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment

OpenRubrics：用于奖励建模和 LLM 调整的可扩展综合评分标准生成

Authors: Tianci Liu, Ran Xu, Tony Yu, Ilgee Hong, Carl Yang, Tuo Zhao, Haoyu Wang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.07743
Pdf link: https://arxiv.org/pdf/2510.07743
Abstract Reward modeling lies at the core of reinforcement learning from human feedback (RLHF), yet most existing reward models rely on scalar or pairwise judgments that fail to capture the multifaceted nature of human preferences. Recent studies have explored rubrics-as-rewards (RaR) that uses structured natural language criteria that capture multiple dimensions of response quality. However, producing rubrics that are both reliable and scalable remains a key challenge. In this work, we introduce OpenRubrics, a diverse, large-scale collection of (prompt, rubric) pairs for training rubric-generation and rubric-based reward models. To elicit discriminative and comprehensive evaluation signals, we introduce Contrastive Rubric Generation (CRG), which derives both hard rules (explicit constraints) and principles (implicit qualities) by contrasting preferred and rejected responses. We further improve reliability by enforcing preference-label consistency via rejection sampling to remove noisy rubrics. Across multiple reward-modeling benchmarks, our rubric-based reward model, Rubric-RM, surpasses strong size-matched baselines by 6.8%. These gains transfer to policy models on instruction-following and biomedical benchmarks. Our results show that rubrics provide scalable alignment signals that narrow the gap between costly human evaluation and automated reward modeling, enabling a new principle-driven paradigm for LLM alignment.
中文摘要 奖励建模是人类反馈强化学习（RLHF）的核心，但大多数现有的奖励模型依赖于标量或成对判断，无法捕捉人类偏好的多方面性质。最近的研究探索了评分标准即奖励（RaR），它使用结构化自然语言标准来捕获响应质量的多个维度。然而，制作既可靠又可扩展的评分标准仍然是一个关键挑战。在这项工作中，我们介绍了 OpenRubrics，这是一个多样化的、大规模的（提示、评分标准）对集合，用于训练评分标准生成和基于评分标准的奖励模型。为了引出辨别性和综合性评估信号，我们引入了对比评分标准生成（CRG），它通过对比首选和拒绝的响应来推导出硬性规则（显性约束）和原则（隐性品质）。我们通过拒绝抽样强制执行偏好标签一致性，以消除嘈杂的评分标准，从而进一步提高可靠性。在多个奖励建模基准中，我们基于评分标准的奖励模型 Rubric-RM 比强尺寸匹配基线高出 6.8%。这些成果转移到指令遵循和生物医学基准的政策模型上。我们的结果表明，评分标准提供了可扩展的对齐信号，缩小了昂贵的人工评估和自动奖励建模之间的差距，从而为法学硕士对齐提供了一种新的原则驱动范式。

Human-in-the-Loop Optimization with Model-Informed Priors

基于模型的先验进行人机交互优化

Authors: Yi-Chi Liao, João Belo, Hee-Seung Moon, Jürgen Steimle, Anna Maria Feit
Subjects: Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2510.07754
Pdf link: https://arxiv.org/pdf/2510.07754
Abstract Human-in-the-loop optimization identifies optimal interface designs by iteratively observing user performance. However, it often requires numerous iterations due to the lack of prior information. While recent approaches have accelerated this process by leveraging previous optimization data, collecting user data remains costly and often impractical. We present a conceptual framework, Human-in-the-Loop Optimization with Model-Informed Priors (HOMI), which augments human-in-the-loop optimization with a training phase where the optimizer learns adaptation strategies from diverse, synthetic user data generated with predictive models before deployment. To realize HOMI, we introduce Neural Acquisition Function+ (NAF+), a Bayesian optimization method featuring a neural acquisition function trained with reinforcement learning. NAF+ learns optimization strategies from large-scale synthetic data, improving efficiency in real-time optimization with users. We evaluate HOMI and NAF+ with mid-air keyboard optimization, a representative VR input task. Our work presents a new approach for more efficient interface adaptation by bridging in situ and in silico optimization processes.
中文摘要 人机交互优化通过迭代观察用户性能来识别最佳界面设计。然而，由于缺乏先验信息，它通常需要多次迭代。虽然最近的方法通过利用以前的优化数据加速了这一过程，但收集用户数据仍然成本高昂，而且往往不切实际。我们提出了一个概念框架，即具有模型知情先验的人机交互优化（HOMI），它通过训练阶段增强了人机交互优化，在训练阶段，优化器在部署前从预测模型生成的各种合成用户数据中学习适应策略。为了实现HOMI，我们引入了神经采集函数+（NAF+），这是一种贝叶斯优化方法，具有通过强化学习训练的神经采集函数。NAF+从大规模合成数据中学习优化策略，与用户一起提高实时优化效率。我们通过空中键盘优化评估 HOMI 和 NAF+，这是一种具有代表性的 VR 输入任务。我们的工作提出了一种通过桥接原位和计算机优化过程来更有效地适应界面的新方法。

From Noisy to Native: LLM-driven Graph Restoration for Test-Time Graph Domain Adaptation

从嘈杂到原生：LLM 驱动的图恢复用于测试时间图域自适应

Authors: Xiangwei Lv, JinLuan Yang, Wang Lin, Jingyuan Chen, Beishui Liao
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.07762
Pdf link: https://arxiv.org/pdf/2510.07762
Abstract Graph domain adaptation (GDA) has achieved great attention due to its effectiveness in addressing the domain shift between train and test data. A significant bottleneck in existing graph domain adaptation methods is their reliance on source-domain data, which is often unavailable due to privacy or security concerns. This limitation has driven the development of Test-Time Graph Domain Adaptation (TT-GDA), which aims to transfer knowledge without accessing the source examples. Inspired by the generative power of large language models (LLMs), we introduce a novel framework that reframes TT-GDA as a generative graph restoration problem, "restoring the target graph to its pristine, source-domain-like state". There are two key challenges: (1) We need to construct a reasonable graph restoration process and design an effective encoding scheme that an LLM can understand, bridging the modality gap. (2) We need to devise a mechanism to ensure the restored graph acquires the intrinsic features of the source domain, even without access to the source data. To ensure the effectiveness of graph restoration, we propose GRAIL, that restores the target graph into a state that is well-aligned with the source domain. Specifically, we first compress the node representations into compact latent features and then use a graph diffusion process to model the graph restoration process. Then a quantization module encodes the restored features into discrete tokens. Building on this, an LLM is fine-tuned as a generative restorer to transform a "noisy" target graph into a "native" one. To further improve restoration quality, we introduce a reinforcement learning process guided by specialized alignment and confidence rewards. Extensive experiments demonstrate the effectiveness of our approach across various datasets.
中文摘要 图域自适应（GDA）因其在解决训练数据和测试数据之间的域偏移方面的有效性而受到了极大的关注。现有图域适应方法的一个重大瓶颈是它们对源域数据的依赖，而由于隐私或安全问题，这些数据通常不可用。这一限制推动了测试时间图域自适应（TT-GDA）的发展，其目的是在不访问源示例的情况下转移知识。受到大型语言模型（LLM）生成能力的启发，我们引入了一个新颖的框架，将 TT-GDA 重新定义为生成图恢复问题，“将目标图恢复到其原始的、类似源域的状态”。存在两个关键挑战：（1）我们需要构建一个合理的图恢复过程，并设计一个LLM可以理解的有效编码方案，弥合模态差距。（2）我们需要设计一种机制来确保恢复的图获得源域的内在特征，即使没有访问源数据。为了确保图恢复的有效性，我们提出了 GRAIL，它将目标图恢复到与源域很好地对齐的状态。具体来说，我们首先将节点表示压缩为紧凑的潜在特征，然后使用图扩散过程对图恢复过程进行建模。然后，量化模块将恢复的特征编码为离散标记。在此基础上，LLM 作为生成恢复器进行微调，将“嘈杂”的目标图转换为“原生”目标图。为了进一步提高修复质量，我们引入了以专业对齐和信心奖励为指导的强化学习过程。广泛的实验证明了我们的方法在各种数据集中的有效性。

Curing Miracle Steps in LLM Mathematical Reasoning with Rubric Rewards

使用评分标准奖励治愈 LLM 数学推理中的奇迹步骤

Authors: Youliang Yuan, Qiuyang Mang, Jingbang Chen, Hong Wan, Xiaoyuan Liu, Junjielong Xu, Jen-tse Huang, Wenxuan Wang, Wenxiang Jiao, Pinjia He
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.07774
Pdf link: https://arxiv.org/pdf/2510.07774
Abstract Large language models for mathematical reasoning are typically trained with outcome-based rewards, which credit only the final answer. In our experiments, we observe that this paradigm is highly susceptible to reward hacking, leading to a substantial overestimation of a model's reasoning ability. This is evidenced by a high incidence of false positives - solutions that reach the correct final answer through an unsound reasoning process. Through a systematic analysis with human verification, we establish a taxonomy of these failure modes, identifying patterns like Miracle Steps - abrupt jumps to a correct output without a valid preceding derivation. Probing experiments suggest a strong association between these Miracle Steps and memorization, where the model appears to recall the answer directly rather than deriving it. To mitigate this systemic issue, we introduce the Rubric Reward Model (RRM), a process-oriented reward function that evaluates the entire reasoning trajectory against problem-specific rubrics. The generative RRM provides fine-grained, calibrated rewards (0-1) that explicitly penalize logical flaws and encourage rigorous deduction. When integrated into a reinforcement learning pipeline, RRM-based training consistently outperforms outcome-only supervision across four math benchmarks. Notably, it boosts Verified Pass@1024 on AIME2024 from 26.7% to 62.6% and reduces the incidence of Miracle Steps by 71%. Our work demonstrates that rewarding the solution process is crucial for building models that are not only more accurate but also more reliable.
中文摘要 用于数学推理的大型语言模型通常使用基于结果的奖励进行训练，这些奖励只归功于最终答案。在我们的实验中，我们观察到这种范式极易受到奖励黑客攻击的影响，导致模型的推理能力被大幅高估。误报的高发生率证明了这一点--通过不健全的推理过程得出正确最终答案的解决方案。通过人工验证的系统分析，我们建立了这些故障模式的分类法，识别了诸如奇迹步骤之类的模式 - 突然跳转到正确的输出，而没有有效的先验推导。探测实验表明，这些奇迹步骤与记忆之间存在密切关联，其中模型似乎直接回忆起答案而不是推导出答案。为了缓解这一系统性问题，我们引入了评分标准奖励模型（RRM），这是一种面向过程的奖励函数，可根据特定于问题的评分标准评估整个推理轨迹。生成式 RRM 提供细粒度、校准的奖励（0-1），明确惩罚逻辑缺陷并鼓励严格的推论。当集成到强化学习管道中时，基于 RRM 的训练在四个数学基准中始终优于仅结果监督。值得注意的是，它将AIME2024验证Pass@1024从 26.7% 提高到 62.6%，并将 Miracle Steps 的发生率降低了 71%。我们的工作表明，奖励解决方案过程对于构建不仅更准确而且更可靠的模型至关重要。

GCPO: When Contrast Fails, Go Gold

GCPO：当对比失败时，选择黄金

Authors: Hao Wu, Wei Liu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.07790
Pdf link: https://arxiv.org/pdf/2510.07790
Abstract Reinforcement learning has been widely applied to enhance the reasoning capabilities of large language models. Extending the inference limits of smaller models has become a prominent research focus. However, algorithms such as Group Relative Policy Optimization (GRPO) suffer from a clear drawback: the upper bound of a model's rollout responses is entirely determined by the model itself, preventing the acquisition of knowledge from samples that are either all incorrect or all correct. In this paper, we introduce Group Contrastive Policy Optimization (GCPO), a method that incorporates external standard reference answers. When the model cannot solve a problem, the reference answer supplies the correct response, steering the model toward an unequivocally accurate update direction. This approach offers two main advantages: (1) it improves training efficiency by fully utilizing every sample; (2) it enables the model to emulate the problem solving strategy of the reference answer during training, thereby enhancing generalization in reasoning. GCPO achieves outstanding results across multiple benchmark datasets, yielding substantial improvements over the baseline model. Our code is available at: this https URL.
中文摘要 强化学习已被广泛应用于增强大语言模型的推理能力。扩展较小模型的推理极限已成为一个突出的研究重点。然而，群体相对策略优化（GRPO）等算法有一个明显的缺点：模型推出响应的上限完全由模型本身决定，从而阻止从全部不正确或全部正确的样本中获取知识。在本文中，我们介绍了组对比策略优化（GCPO），这是一种结合外部标准参考答案的方法。当模型无法解决问题时，参考答案会提供正确的答案，引导模型朝着明确准确的更新方向发展。这种方法有两个主要优点：（1）它通过充分利用每个样本来提高训练效率;（2）它使模型能够在训练过程中模拟参考答案的解决问题策略，从而增强推理的泛化性。GCPO 在多个基准数据集中取得了出色的结果，比基线模型有了显着改进。我们的代码可在以下网址获得：此 https URL。

Strategic Communication under Threat: Learning Information Trade-offs in Pursuit-Evasion Games

威胁下的战略沟通：在追捕-逃避游戏中学习信息权衡

Authors: Valerio La Gatta, Dolev Mutzari, Sarit Kraus, VS Subrahmanian
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.07813
Pdf link: https://arxiv.org/pdf/2510.07813
Abstract Adversarial environments require agents to navigate a key strategic trade-off: acquiring information enhances situational awareness, but may simultaneously expose them to threats. To investigate this tension, we formulate a PursuitEvasion-Exposure-Concealment Game (PEEC) in which a pursuer agent must decide when to communicate in order to obtain the evader's position. Each communication reveals the pursuer's location, increasing the risk of being targeted. Both agents learn their movement policies via reinforcement learning, while the pursuer additionally learns a communication policy that balances observability and risk. We propose SHADOW (Strategic-communication Hybrid Action Decision-making under partial Observation for Warfare), a multi-headed sequential reinforcement learning framework that integrates continuous navigation control, discrete communication actions, and opponent modeling for behavior prediction. Empirical evaluations show that SHADOW pursuers achieve higher success rates than six competitive baselines. Our ablation study confirms that temporal sequence modeling and opponent modeling are critical for effective decision-making. Finally, our sensitivity analysis reveals that the learned policies generalize well across varying communication risks and physical asymmetries between agents.
中文摘要 对抗性环境要求代理进行关键的战略权衡：获取信息可以增强态势感知，但同时可能会使他们面临威胁。为了研究这种紧张关系，我们制定了一个追捕-逃避-暴露-隐藏游戏（PEEC），其中追捕者必须决定何时进行通信以获得逃避者的位置。每次通信都会泄露追捕者的位置，增加成为目标的风险。两个代理都通过强化学习学习他们的移动策略，而追击者还学习平衡可观察性和风险的通信策略。我们提出了SHADOW（Strategic-communication Hybrid Action Decision-making under partial Observation for Warfare），这是一个多头顺序强化学习框架，集成了连续导航控制、离散通信动作和对手建模，用于行为预测。实证评估表明，影子追求者的成功率高于六个竞争基线。我们的消融研究证实，时间序列建模和对手建模对于有效决策至关重要。最后，我们的敏感性分析表明，学习到的策略在不同的通信风险和智能体之间的物理不对称性中具有很好的推广效果。

An LLM-Powered Cooperative Framework for Large-Scale Multi-Vehicle Navigation

LLM驱动的大规模多车导航协同框架

Authors: Yuping Zhou, Siqi Lai, Jindong Han, Hao Liu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.07825
Pdf link: https://arxiv.org/pdf/2510.07825
Abstract The rise of Internet of Vehicles (IoV) technologies is transforming traffic management from isolated control to a collective, multi-vehicle process. At the heart of this shift is multi-vehicle dynamic navigation, which requires simultaneously routing large fleets under evolving traffic conditions. Existing path search algorithms and reinforcement learning methods struggle to scale to city-wide networks, often failing to capture the nonlinear, stochastic, and coupled dynamics of urban traffic. To address these challenges, we propose CityNav, a hierarchical, LLM-powered framework for large-scale multi-vehicle navigation. CityNav integrates a global traffic allocation agent, which coordinates strategic traffic flow distribution across regions, with local navigation agents that generate locally adaptive routes aligned with global directives. To enable effective cooperation, we introduce a cooperative reasoning optimization mechanism, in which agents are jointly trained with a dual-reward structure: individual rewards promote per-vehicle efficiency, while shared rewards encourage network-wide coordination and congestion reduction. Extensive experiments on four real-world road networks of varying scales (up to 1.6 million roads and 430,000 intersections) and traffic datasets demonstrate that CityNav consistently outperforms nine classical path search and RL-based baselines in city-scale travel efficiency and congestion mitigation. Our results highlight the potential of LLMs to enable scalable, adaptive, and cooperative city-wide traffic navigation, providing a foundation for intelligent, large-scale vehicle routing in complex urban environments. Our project is available at this https URL.
中文摘要 车联网（IoV）技术的兴起正在将交通管理从孤立控制转变为集体多车辆流程。这种转变的核心是多车动态导航，这需要在不断变化的交通条件下同时对大型车队进行路由。现有的路径搜索算法和强化学习方法难以扩展到全市网络，往往无法捕捉城市交通的非线性、随机和耦合动态。为了应对这些挑战，我们提出了 CityNav，这是一个用于大规模多车导航的分层、LLM 驱动的框架。CityNav 集成了全球交通分配代理，该代理协调跨区域的战略交通流分配，本地导航代理生成符合全球指令的本地自适应路线。为了实现有效的合作，我们引入了一种合作推理优化机制，其中智能体以双重奖励结构进行联合训练：个人奖励促进每辆车的效率，而共享奖励鼓励全网协调和减少拥塞。对四个不同规模的真实世界道路网络（多达 160 万条道路和 430,000 个十字路口）和交通数据集的广泛实验表明，CityNav 在城市规模的出行效率和拥堵缓解方面始终优于九个经典路径搜索和基于 RL 的基线。我们的研究结果凸显了法学硕士在实现可扩展、自适应和协作的城市范围内交通导航方面的潜力，为复杂城市环境中的智能、大规模车辆路线提供了基础。我们的项目可在此 https URL 中找到。

Network Topology and Information Efficiency of Multi-Agent Systems: Study based on MARL

多智能体系统的网络拓扑与信息效率——基于MARL的研究

Authors: Xinren Zhang, Sixi Cheng, Zixin Zhong, Jiadong Yu
Subjects: Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2510.07888
Pdf link: https://arxiv.org/pdf/2510.07888
Abstract Multi-agent systems (MAS) solve complex problems through coordinated autonomous entities with individual decision-making capabilities. While Multi-Agent Reinforcement Learning (MARL) enables these agents to learn intelligent strategies, it faces challenges of non-stationarity and partial observability. Communications among agents offer a solution, but questions remain about its optimal structure and evaluation. This paper explores two underexamined aspects: communication topology and information efficiency. We demonstrate that directed and sequential topologies improve performance while reducing communication overhead across both homogeneous and heterogeneous tasks. Additionally, we introduce two metrics -- Information Entropy Efficiency Index (IEI) and Specialization Efficiency Index (SEI) -- to evaluate message compactness and role differentiation. Incorporating these metrics into training objectives improves success rates and convergence speed. Our findings highlight that designing adaptive communication topologies with information-efficient messaging is essential for effective coordination in complex MAS.
中文摘要 多智能体系统（MAS）通过具有个人决策能力的协调自主实体解决复杂问题。虽然多智能体强化学习（MARL）使这些智能体能够学习智能策略，但它面临着非平稳性和部分可观测性的挑战。代理之间的沟通提供了解决方案，但关于其最佳结构和评估的问题仍然存在。本文探讨了两个未被充分研究的方面：通信拓扑和信息效率。我们证明，定向和顺序拓扑可以提高性能，同时减少同构和异构任务之间的通信开销。此外，我们还引入了两个指标——信息熵效率指数（IEI）和专业化效率指数（SEI）——来评估消息的紧凑性和角色区分。将这些指标纳入培训目标可以提高成功率和收敛速度。我们的研究结果强调，设计具有信息高效消息传递的自适应通信拓扑对于复杂MAS中的有效协调至关重要。

MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding

MARC：内存增强的 RL 令牌压缩，用于高效理解视频

Authors: Peiran Wu, Zhuorui Yu, Yunze Liu, Chi-Hao Wu, Enmin Zhou, Junxiao Shen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2510.07915
Pdf link: https://arxiv.org/pdf/2510.07915
Abstract The rapid progress of large language models (LLMs) has laid the foundation for multimodal models. However, visual language models (VLMs) still face heavy computational costs when extended from images to videos due to high frame rates and long durations. Token compression is a promising solution, yet most existing training-free methods cause information loss and performance degradation. To overcome this, we propose \textbf{Memory-Augmented Reinforcement Learning-based Token Compression (MARC)}, which integrates structured retrieval and RL-based distillation. MARC adopts a \textit{retrieve-then-compress} strategy using a \textbf{Visual Memory Retriever (VMR)} to select key clips and a \textbf{Compression Group Relative Policy Optimization (C-GRPO)} framework to distil reasoning ability from a teacher to a student model. Experiments on six video benchmarks show that MARC achieves near-baseline accuracy using only one frame's tokens -- reducing visual tokens by \textbf{95\%}, GPU memory by \textbf{72\%}, and latency by \textbf{23.9\%}. This demonstrates its potential for efficient, real-time video understanding in resource-constrained settings such as video QA, surveillance, and autonomous driving.
中文摘要 大型语言模型（LLMs）的快速发展为多模态模型奠定了基础。然而，由于高帧速率和长持续时间，视觉语言模型（VLM）在从图像扩展到视频时仍然面临沉重的计算成本。令牌压缩是一种很有前途的解决方案，但大多数现有的免训练方法都会导致信息丢失和性能下降。为了克服这个问题，我们提出了 \textbf{基于记忆增强强化学习的标记压缩（MARC）}，它集成了结构化检索和基于 RL 的蒸馏。MARC 采用 \textit{检索然后压缩} 策略，使用 \textbf{Visual Memory Retriever （VMR）} 来选择关键剪辑，并使用 \textbf{压缩组相对策略优化（C-GRPO）} 框架将推理能力从教师提炼到学生模型。对六个视频基准测试的实验表明，MARC 仅使用一帧的标记即可实现接近基线的准确性——将视觉标记减少 \textbf{95\%}，将 GPU 内存减少 \textbf{72\%}，延迟减少 \textbf{23.9\%}。这证明了它在视频 QA、监控和自动驾驶等资源受限环境中高效、实时视频理解的潜力。

A$^2$Search: Ambiguity-Aware Question Answering with Reinforcement Learning

A$^2$搜索：使用强化学习进行歧义感知问答

Authors: Fengji Zhang, Xinyao Niu, Chengyang Ying, Guancheng Lin, Zhongkai Hao, Zhou Fan, Chengen Huang, Jacky Keung, Bei Chen, Junyang Lin
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.07958
Pdf link: https://arxiv.org/pdf/2510.07958
Abstract Recent advances in Large Language Models (LLMs) and Reinforcement Learning (RL) have led to strong performance in open-domain question answering (QA). However, existing models still struggle with questions that admit multiple valid answers. Standard QA benchmarks, which typically assume a single gold answer, overlook this reality and thus produce inappropriate training signals. Existing attempts to handle ambiguity often rely on costly manual annotation, which is difficult to scale to multi-hop datasets such as HotpotQA and MuSiQue. In this paper, we present A$^2$Search, an annotation-free, end-to-end training framework to recognize and handle ambiguity. At its core is an automated pipeline that detects ambiguous questions and gathers alternative answers via trajectory sampling and evidence verification. The model is then optimized with RL using a carefully designed $\mathrm{AnsF1}$ reward, which naturally accommodates multiple answers. Experiments on eight open-domain QA benchmarks demonstrate that A$^2$Search achieves new state-of-the-art performance. With only a single rollout, A$^2$Search-7B yields an average $\mathrm{AnsF1}@1$ score of $48.4\%$ across four multi-hop benchmarks, outperforming all strong baselines, including the substantially larger ReSearch-32B ($46.2\%$). Extensive analyses further show that A$^2$Search resolves ambiguity and generalizes across benchmarks, highlighting that embracing ambiguity is essential for building more reliable QA systems. Our code, data, and model weights can be found at this https URL
中文摘要 大型语言模型（LLM）和强化学习（RL）的最新进展导致了开放域问答（QA）的强劲表现。然而，现有模型仍然在为接受多个有效答案的问题而苦苦挣扎。标准 QA 基准通常假设只有一个黄金答案，但忽略了这一现实，从而产生了不适当的训练信号。现有的处理歧义的尝试通常依赖于昂贵的手动注释，这很难扩展到 HotpotQA 和 MuSiQue 等多跳数据集。在本文中，我们提出了 A$^2$Search，这是一个无注释的端到端训练框架，用于识别和处理歧义。其核心是一个自动化管道，它检测模棱两可的问题，并通过轨迹抽样和证据验证收集替代答案。然后使用精心设计的 $\mathrm{AnsF1}$ 奖励使用 RL 对模型进行优化，该奖励自然可以容纳多个答案。对八个开放领域 QA 基准测试的实验表明，A$^2$Search 实现了新的最先进的性能。只需一次部署，A$^2$Search-7B 在四个多跳基准测试中的平均 $\mathrm{AnsF1}@1$ 得分为 48.4\%$，优于所有强基线，包括更大的 ReSearch-32B （$46.2\%$）。广泛的分析进一步表明，A$^2$Search 解决了歧义并跨基准进行了泛化，强调拥抱歧义对于构建更可靠的 QA 系统至关重要。我们的代码、数据和模型权重可以在此 https URL 中找到

Climate Surrogates for Scalable Multi-Agent Reinforcement Learning: A Case Study with CICERO-SCM

可扩展多智能体强化学习的气候替代物：CICERO-SCM 案例研究

Authors: Oskar Bohn Lassen, Serio Angelo Maria Agriesti, Filipe Rodrigues, Francisco Camara Pereira
Subjects: Subjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2510.07971
Pdf link: https://arxiv.org/pdf/2510.07971
Abstract Climate policy studies require models that capture the combined effects of multiple greenhouse gases on global temperature, but these models are computationally expensive and difficult to embed in reinforcement learning. We present a multi-agent reinforcement learning (MARL) framework that integrates a high-fidelity, highly efficient climate surrogate directly in the environment loop, enabling regional agents to learn climate policies under multi-gas dynamics. As a proof of concept, we introduce a recurrent neural network architecture pretrained on ($20{,}000$) multi-gas emission pathways to surrogate the climate model CICERO-SCM. The surrogate model attains near-simulator accuracy with global-mean temperature RMSE $\approx 0.0004 \mathrm{K}$ and approximately $1000\times$ faster one-step inference. When substituted for the original simulator in a climate-policy MARL setting, it accelerates end-to-end training by $>!100\times$. We show that the surrogate and simulator converge to the same optimal policies and propose a methodology to assess this property in cases where using the simulator is intractable. Our work allows to bypass the core computational bottleneck without sacrificing policy fidelity, enabling large-scale multi-agent experiments across alternative climate-policy regimes with multi-gas dynamics and high-fidelity climate response.
中文摘要 气候政策研究需要能够捕捉多种温室气体对全球温度的综合影响的模型，但这些模型的计算成本高昂，并且难以嵌入强化学习中。我们提出了一个多智能体强化学习（MARL）框架，该框架将高保真、高效的气候替代物直接集成到环境循环中，使区域智能体能够学习多气体动力学下的气候政策。作为概念验证，我们引入了一种在（$20{，}000$）多气体排放路径上预训练的递归神经网络架构来替代气候模型CICERO-SCM。代理模型以全局平均温度 RMSE $\约 0.0004 \mathrm{K}$ 和大约 $1000\times$ 的一步推理速度达到接近模拟器的精度。当在气候政策 MARL 设置中替代原始模拟器时，它可以将端到端训练加速 $>\！100\times$。我们表明，代理和模拟器收敛到相同的最优策略，并提出了一种在使用模拟器难以处理的情况下评估该属性的方法。我们的工作允许在不牺牲政策保真度的情况下绕过核心计算瓶颈，从而能够在具有多气体动力学和高保真气候响应的替代气候政策制度中进行大规模多代理实验。

TaoSR-SHE: Stepwise Hybrid Examination Reinforcement Learning Framework for E-commerce Search Relevance

TaoSR-SHE：电子商务搜索相关性的逐步混合考试强化学习框架

Authors: Pengkun Jiao, Yiming Jin, Jianhui Yang, Chenhe Dong, Zerui Huang, Shaowei Yao, Xiaojiang Zhou, Dan Ou, Haihong Tang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.07972
Pdf link: https://arxiv.org/pdf/2510.07972
Abstract Query-product relevance analysis is a foundational technology in e-commerce search engines and has become increasingly important in AI-driven e-commerce. The recent emergence of large language models (LLMs), particularly their chain-of-thought (CoT) reasoning capabilities, offers promising opportunities for developing relevance systems that are both more interpretable and more robust. However, existing training paradigms have notable limitations: SFT and DPO suffer from poor generalization on long-tail queries and from a lack of fine-grained, stepwise supervision to enforce rule-aligned reasoning. In contrast, reinforcement learning with verification rewards (RLVR) suffers from sparse feedback, which provides insufficient signal to correct erroneous intermediate steps, thereby undermining logical consistency and limiting performance in complex inference scenarios. To address these challenges, we introduce the Stepwise Hybrid Examination Reinforcement Learning framework for Taobao Search Relevance (TaoSR-SHE). At its core is Stepwise Reward Policy Optimization (SRPO), a reinforcement learning algorithm that leverages step-level rewards generated by a hybrid of a high-quality generative stepwise reward model and a human-annotated offline verifier, prioritizing learning from critical correct and incorrect reasoning steps. TaoSR-SHE further incorporates two key techniques: diversified data filtering to encourage exploration across varied reasoning paths and mitigate policy entropy collapse, and multi-stage curriculum learning to foster progressive capability growth. Extensive experiments on real-world search benchmarks show that TaoSR-SHE improves both reasoning quality and relevance-prediction accuracy in large-scale e-commerce settings, outperforming SFT, DPO, GRPO, and other baselines, while also enhancing interpretability and robustness.
中文摘要 查询产品相关性分析是电子商务搜索引擎的一项基础技术，在人工智能驱动的电子商务中变得越来越重要。最近出现的大型语言模型（LLM），特别是其思维链（CoT）推理能力，为开发更具可解释性和更稳健的相关性系统提供了有希望的机会。然而，现有的训练范式有明显的局限性：SFT 和 DPO 在长尾查询上的泛化能力很差，并且缺乏细粒度的逐步监督来强制执行规则对齐的推理。相比之下，具有验证奖励的强化学习（RLVR）存在稀疏反馈的问题，它没有提供足够的信号来纠正错误的中间步骤，从而破坏逻辑一致性并限制了复杂推理场景中的性能。为了应对这些挑战，我们引入了用于淘宝搜索相关性的逐步混合考试强化学习框架（TaoSR-SHE）。其核心是逐步奖励策略优化（SRPO），这是一种强化学习算法，它利用由高质量生成逐步奖励模型和人工注释的离线验证器混合生成的逐步级奖励，优先学习关键的正确和错误推理步骤。TaoSR-SHE 进一步融入了两种关键技术：多样化的数据过滤以鼓励跨不同推理路径的探索并减轻政策熵崩溃，以及多阶段课程学习以促进渐进式能力增长。在真实世界搜索基准上的大量实验表明，TaoSR-SHE在大规模电子商务环境中提高了推理质量和相关性预测准确性，优于SFT、DPO、GRPO和其他基线，同时还增强了可解释性和鲁棒性。

TaoSR-AGRL: Adaptive Guided Reinforcement Learning Framework for E-commerce Search Relevance

TaoSR-AGRL：电子商务搜索相关性的自适应引导强化学习框架

Authors: Jianhui Yang, Yiming Jin, Pengkun Jiao, Chenhe Dong, Zerui Huang, Shaowei Yao, Xiaojiang Zhou, Dan Ou, Haihong Tang
Subjects: Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.08048
Pdf link: https://arxiv.org/pdf/2510.08048
Abstract Query-product relevance prediction is fundamental to e-commerce search and has become even more critical in the era of AI-powered shopping, where semantic understanding and complex reasoning directly shape the user experience and business conversion. Large Language Models (LLMs) enable generative, reasoning-based approaches, typically aligned via supervised fine-tuning (SFT) or preference optimization methods like Direct Preference Optimization (DPO). However, the increasing complexity of business rules and user queries exposes the inability of existing methods to endow models with robust reasoning capacity for long-tail and challenging cases. Efforts to address this via reinforcement learning strategies like Group Relative Policy Optimization (GRPO) often suffer from sparse terminal rewards, offering insufficient guidance for multi-step reasoning and slowing convergence. To address these challenges, we propose TaoSR-AGRL, an Adaptive Guided Reinforcement Learning framework for LLM-based relevance prediction in Taobao Search Relevance. TaoSR-AGRL introduces two key innovations: (1) Rule-aware Reward Shaping, which decomposes the final relevance judgment into dense, structured rewards aligned with domain-specific relevance criteria; and (2) Adaptive Guided Replay, which identifies low-accuracy rollouts during training and injects targeted ground-truth guidance to steer the policy away from stagnant, rule-violating reasoning patterns toward compliant trajectories. TaoSR-AGRL was evaluated on large-scale real-world datasets and through online side-by-side human evaluations on Taobao Search. It consistently outperforms DPO and standard GRPO baselines in offline experiments, improving relevance accuracy, rule adherence, and training stability. The model trained with TaoSR-AGRL has been successfully deployed in the main search scenario on Taobao, serving hundreds of millions of users.
中文摘要 查询产品相关性预测是电子商务搜索的基础，在人工智能驱动的购物时代变得更加重要，在人工智能驱动的购物时代，语义理解和复杂的推理直接影响用户体验和业务转化。大型语言模型（LLM）支持基于推理的生成方法，通常通过监督微调（SFT）或直接偏好优化（DPO）等偏好优化方法进行调整。然而，业务规则和用户查询的复杂性不断增加，暴露了现有方法无法为模型赋予长尾和具有挑战性案例的稳健推理能力。通过群体相对策略优化（GRPO）等强化学习策略解决这个问题的努力通常会受到终端奖励稀疏的影响，为多步骤推理提供不足的指导并减慢收敛速度。为了应对这些挑战，我们提出了 TaoSR-AGRL，这是一种自适应引导强化学习框架，用于淘宝搜索相关性中基于 LLM 的相关性预测。TaoSR-AGRL 引入了两项关键创新：（1）规则感知奖励塑造，将最终的相关性判断分解为符合特定领域相关性标准的密集、结构化的奖励;（2）自适应引导回放，在训练期间识别低准确度的推出，并注入有针对性的地面实况指导，以引导政策从停滞不前、违反规则的推理模式转向合规轨迹。TaoSR-AGRL在大规模真实世界数据集和淘宝搜索上的在线并排人体评估中进行了评估。在离线实验中，它始终优于 DPO 和标准 GRPO 基线，提高了相关性准确性、规则遵守性和训练稳定性。使用 TaoSR-AGRL 训练的模型已成功部署在淘宝主搜索场景中，服务了数亿用户。

A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models

流程奖励模型综述：从结果信号到大型语言模型的流程监督

Authors: Congming Zheng, Jiachen Zhu, Zhuoying Ou, Yuxiang Chen, Kangning Zhang, Rong Shan, Zeyu Zheng, Mengyue Yang, Jianghao Lin, Yong Yu, Weinan Zhang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.08049
Pdf link: https://arxiv.org/pdf/2510.08049
Abstract Although Large Language Models (LLMs) exhibit advanced reasoning ability, conventional alignment remains largely dominated by outcome reward models (ORMs) that judge only final answers. Process Reward Models(PRMs) address this gap by evaluating and guiding reasoning at the step or trajectory level. This survey provides a systematic overview of PRMs through the full loop: how to generate process data, build PRMs, and use PRMs for test-time scaling and reinforcement learning. We summarize applications across math, code, text, multimodal reasoning, robotics, and agents, and review emerging benchmarks. Our goal is to clarify design spaces, reveal open challenges, and guide future research toward fine-grained, robust reasoning alignment.
中文摘要 尽管大型语言模型（LLM）表现出先进的推理能力，但传统的对齐方式在很大程度上仍然由仅判断最终答案的结果奖励模型（ORM）主导。过程奖励模型（PRM）通过在阶梯或轨迹层面评估和指导推理来弥补这一差距。本调查通过整个循环系统地概述了 PRM：如何生成过程数据、构建 PRM 以及使用 PRM 进行测试时扩展和强化学习。我们总结了数学、代码、文本、多模态推理、机器人和代理的应用，并审查了新兴的基准。我们的目标是澄清设计空间，揭示开放的挑战，并指导未来的研究实现细粒度、稳健的推理一致性。

Real-Time Motion-Controllable Autoregressive Video Diffusion

实时运动可控自回归视频扩散

Authors: Kesen Zhao, Jiaxin Shi, Beier Zhu, Junbao Zhou, Xiaolong Shen, Yuan Zhou, Qianru Sun, Hanwang Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2510.08131
Pdf link: https://arxiv.org/pdf/2510.08131
Abstract Real-time motion-controllable video generation remains challenging due to the inherent latency of bidirectional diffusion models and the lack of effective autoregressive (AR) approaches. Existing AR video diffusion models are limited to simple control signals or text-to-video generation, and often suffer from quality degradation and motion artifacts in few-step generation. To address these challenges, we propose AR-Drag, the first RL-enhanced few-step AR video diffusion model for real-time image-to-video generation with diverse motion control. We first fine-tune a base I2V model to support basic motion control, then further improve it via reinforcement learning with a trajectory-based reward model. Our design preserves the Markov property through a Self-Rollout mechanism and accelerates training by selectively introducing stochasticity in denoising steps. Extensive experiments demonstrate that AR-Drag achieves high visual fidelity and precise motion alignment, significantly reducing latency compared with state-of-the-art motion-controllable VDMs, while using only 1.3B parameters. Additional visualizations can be found on our project page: this https URL.
中文摘要 由于双向扩散模型固有的延迟和缺乏有效的自回归（AR）方法，实时运动可控视频生成仍然具有挑战性。现有的AR视频扩散模型仅限于简单的控制信号或文本到视频生成，并且经常在少步生成中遭受质量下降和运动伪影的影响。为了应对这些挑战，我们提出了 AR-Drag，这是第一个 RL 增强的几步 AR 视频扩散模型，用于通过多样化的运动控制进行实时图像到视频生成。我们首先微调一个基本的 I2V 模型以支持基本的运动控制，然后通过基于轨迹的奖励模型的强化学习进一步改进它。我们的设计通过自推出机制保留了马尔可夫属性，并通过在去噪步骤中选择性地引入随机性来加速训练。大量实验表明，AR-Drag 实现了高视觉保真度和精确的运动对准，与最先进的运动可控 VDM 相比，显着降低了延迟，同时仅使用 1.3B 参数。可以在我们的项目页面上找到其他可视化：此 https URL。

ARM2: Adaptive Reasoning Model with Vision Understanding and Executable Code

ARM2：具有视觉理解和可执行代码的自适应推理模型

Authors: Jian Xie, Zhendong Chu, Aoxiao Zhong, Kai Zhang, Mingzhe Han, Xin Fang, Jialie Shen, Qingsong Wen
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.08163
Pdf link: https://arxiv.org/pdf/2510.08163
Abstract Large Reasoning Models (LRMs) often suffer from the ``over-thinking'' problem, generating unnecessarily long reasoning on simple tasks. Some strategies have been proposed to mitigate this issue, such as length penalties or routing mechanisms, but they are typically heuristic and task-specific, lacking a general framework for adaptive reasoning. In this paper, we present ARM2, a unified model that adaptively balances reasoning performance and efficiency across multiple formats through a reinforcement learning framework augmented with length-aware optimization. Beyond conventional natural language inference, ARM2 integrates vision understanding, extending its applicability to multimodal. Moreover, ARM2 integrates executable code into reasoning, enabling substantial reductions in token cost while preserving task performance compared to long CoT. Experiments demonstrate that ARM2 achieves performance on par with traditional reasoning models trained with GRPO, while reducing token usage by over 70% on average. We further conduct extensive analyses to validate the effectiveness of ARM2 and the soundness of its design.
中文摘要 大型推理模型（LRM）经常存在“过度思考”问题，在简单的任务上产生不必要的冗长推理。已经提出了一些策略来缓解这个问题，例如长度惩罚或路由机制，但它们通常是启发式的和特定于任务的，缺乏自适应推理的通用框架。在本文中，我们提出了 ARM2，这是一个统一模型，它通过增强长度感知优化的强化学习框架，自适应地平衡多种格式的推理性能和效率。除了传统的自然语言推理之外，ARM2 还集成了视觉理解，将其适用性扩展到多模态。此外，ARM2 将可执行代码集成到推理中，与长 CoT 相比，可以大幅降低令牌成本，同时保持任务性能。实验表明，ARM2 的性能与使用 GRPO 训练的传统推理模型相当，同时平均减少了 70% 以上的代币使用量。我们进一步进行了广泛的分析，以验证ARM2的有效性及其设计的合理性。

R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?

R-Horizon：你们的大型推理模型在广度和深度上到底能走多远？

Authors: Yi Lu, Jianing Wang, Linsen Guo, Wei He, Hongyin Tang, Tao Gui, Xuanjing Huang, Xuezhi Cao, Wei Wang, Xunliang Cai
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.08189
Pdf link: https://arxiv.org/pdf/2510.08189
Abstract Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1, DeepSeek-R1) have led to remarkable improvements through long Chain-of-Thought (CoT). However, existing benchmarks mainly focus on immediate, single-horizon tasks, failing to adequately evaluate models' ability to understand and respond to complex, long-horizon scenarios. To address this incomplete evaluation of Large Reasoning Models (LRMs), we propose R-HORIZON, a method designed to stimulate long-horizon reasoning behaviors in LRMs through query composition. Based on R-HORIZON, we construct a long-horizon reasoning benchmark, comprising complex multi-step reasoning tasks with interdependent problems that span long reasoning horizons. Through comprehensive evaluation of LRMs using the R-HORIZON benchmark, we find that even the most advanced LRMs suffer significant performance degradation. Our analysis reveals that LRMs exhibit limited effective reasoning length and struggle to allocate thinking budget across multiple problems appropriately. Recognizing these limitations, we use R-HORIZON to construct long-horizon reasoning data for reinforcement learning with verified rewards (RLVR). Compared to training with single-horizon data, RLVR with R-HORIZON not only substantially improves performance on the multi-horizon reasoning tasks, but also promotes accuracy on standard reasoning tasks, with an increase of 7.5 on AIME2024. These results position R-HORIZON as a scalable, controllable, and low-cost paradigm for enhancing and evaluating the long-horizon reasoning capabilities of LRMs.
中文摘要 推理模型（例如 OpenAI o1、DeepSeek-R1）测试时间缩放的最新趋势通过长思维链（CoT）带来了显着的改进。然而，现有的基准测试主要集中在即时的单视野任务上，未能充分评估模型理解和响应复杂、长期场景的能力。为了解决这种对大型推理模型（LRM）的不完整评估，我们提出了R-HORIZON，这是一种旨在通过查询组合来刺激LRM中长期推理行为的方法。基于 R-HORIZON，我们构建了一个长期推理基准，包括复杂的多步骤推理任务，这些任务具有跨越较长推理范围的相互依赖的问题。通过使用 R-HORIZON 基准对 LRM 进行综合评估，我们发现即使是最先进的 LRM 也会遭受显着的性能下降。我们的分析表明，LRM 表现出有限的有效推理长度，并且难以在多个问题中适当分配思维预算。认识到这些局限性，我们使用 R-HORIZON 构建用于具有验证奖励的强化学习（RLVR）的长视界推理数据。与单视界数据训练相比，R-HORIZON的RLVR不仅显著提高了多视界推理任务的性能，还提高了标准推理任务的准确率，AIME2024提高了7.5。这些结果使 R-HORIZON 成为一种可扩展、可控且低成本的范例，用于增强和评估 LRM 的长期推理能力。

Training-Free Group Relative Policy Optimization

免训练组相对策略优化

Authors: Yuzheng Cai, Siqi Cai, Yuchen Shi, Zihan Xu, Lichao Chen, Yulei Qin, Xiaoyu Tan, Gang Li, Zongyi Li, Haojia Lin, Yong Mao, Ke Li, Xing Sun
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.08191
Pdf link: https://arxiv.org/pdf/2510.08191
Abstract Recent advances in Large Language Model (LLM) agents have demonstrated their promising general capabilities. However, their performance in specialized real-world domains often degrades due to challenges in effectively integrating external tools and specific prompting strategies. While methods like agentic reinforcement learning have been proposed to address this, they typically rely on costly parameter updates, for example, through a process that uses Supervised Fine-Tuning (SFT) followed by a Reinforcement Learning (RL) phase with Group Relative Policy Optimization (GRPO) to alter the output distribution. However, we argue that LLMs can achieve a similar effect on the output distribution by learning experiential knowledge as a token prior, which is a far more lightweight approach that not only addresses practical data scarcity but also avoids the common issue of overfitting. To this end, we propose Training-Free Group Relative Policy Optimization (Training-Free GRPO), a cost-effective solution that enhances LLM agent performance without any parameter updates. Our method leverages the group relative semantic advantage instead of numerical ones within each group of rollouts, iteratively distilling high-quality experiential knowledge during multi-epoch learning on a minimal ground-truth data. Such knowledge serves as the learned token prior, which is seamlessly integrated during LLM API calls to guide model behavior. Experiments on mathematical reasoning and web searching tasks demonstrate that Training-Free GRPO, when applied to DeepSeek-V3.1-Terminus, significantly improves out-of-domain performance. With just a few dozen training samples, Training-Free GRPO outperforms fine-tuned small LLMs with marginal training data and cost.
中文摘要 大型语言模型（LLM）代理的最新进展已经证明了其有前途的通用能力。然而，由于有效集成外部工具和特定提示策略方面的挑战，它们在专业现实领域的性能往往会下降。虽然已经提出了代理强化学习等方法来解决这个问题，但它们通常依赖于昂贵的参数更新，例如，通过使用监督微调（SFT）的过程，然后是带有组相对策略优化（GRPO）的强化学习（RL）阶段来改变输出分布。然而，我们认为 LLM 可以通过学习经验知识作为代币先验来实现对输出分布的类似效果，这是一种更轻量级的方法，不仅解决了实际数据稀缺问题，还避免了过度拟合的常见问题。为此，我们提出了免训练组相对策略优化（Training-Free GRPO），这是一种经济高效的解决方案，无需任何参数更新即可增强LLM代理性能。我们的方法利用每组推出中的组相对语义优势而不是数字语义优势，在多纪元学习期间迭代提炼出最小的地面实况数据的高质量经验知识。这些知识作为学习到的令牌，在 LLM API 调用期间无缝集成以指导模型行为。数学推理和网络搜索任务的实验表明，当将无训练 GRPO 应用于 DeepSeek-V3.1-Terminus 时，可以显着提高域外性能。只需几十个训练样本，Training-Free GRPO 的性能就优于微调的小型 LLM，而训练数据和成本却很低。

Expressive Value Learning for Scalable Offline Reinforcement Learning

用于可扩展离线强化学习的表达价值学习

Authors: Nicolas Espinosa-Dice, Kiante Brantley, Wen Sun
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.08218
Pdf link: https://arxiv.org/pdf/2510.08218
Abstract Reinforcement learning (RL) is a powerful paradigm for learning to make sequences of decisions. However, RL has yet to be fully leveraged in robotics, principally due to its lack of scalability. Offline RL offers a promising avenue by training agents on large, diverse datasets, avoiding the costly real-world interactions of online RL. Scaling offline RL to increasingly complex datasets requires expressive generative models such as diffusion and flow matching. However, existing methods typically depend on either backpropagation through time (BPTT), which is computationally prohibitive, or policy distillation, which introduces compounding errors and limits scalability to larger base policies. In this paper, we consider the question of how to develop a scalable offline RL approach without relying on distillation or backpropagation through time. We introduce Expressive Value Learning for Offline Reinforcement Learning (EVOR): a scalable offline RL approach that integrates both expressive policies and expressive value functions. EVOR learns an optimal, regularized Q-function via flow matching during training. At inference-time, EVOR performs inference-time policy extraction via rejection sampling against the expressive value function, enabling efficient optimization, regularization, and compute-scalable search without retraining. Empirically, we show that EVOR outperforms baselines on a diverse set of offline RL tasks, demonstrating the benefit of integrating expressive value learning into offline RL.
中文摘要 强化学习（RL）是学习做出决策序列的强大范式。然而，RL 尚未在机器人技术中得到充分利用，这主要是由于其缺乏可扩展性。离线 RL 通过在大型、多样化的数据集上训练代理，避免了在线 RL 昂贵的现实世界交互，提供了一条有前途的途径。将离线 RL 扩展到日益复杂的数据集需要富有表现力的生成模型，例如扩散和流匹配。然而，现有方法通常依赖于随时间反向传播（BPTT），这在计算上是令人望而却步的，或者策略蒸馏，这会引入复合错误并限制对更大基本策略的可扩展性。在本文中，我们考虑了如何开发一种可扩展的离线RL方法，而不依赖蒸馏或随时间反向传播的问题。我们介绍了用于离线强化学习（EVOR）的表达价值学习：一种可扩展的离线RL方法，集成了表达策略和表达价值函数。EVOR 在训练期间通过流量匹配学习最佳的正则化 Q 函数。在推理时，EVOR 通过针对表达值函数的拒绝采样执行推理时策略提取，无需重新训练即可实现高效的优化、正则化和计算可扩展的搜索。根据经验，我们表明 EVOR 在一组不同的离线 RL 任务上优于基线，证明了将表达价值学习整合到离线 RL 中的好处。

Reinforcement Learning from Probabilistic Forecasts for Safe Decision-Making via Conditional Value-at-Risk Planning

通过条件风险价值规划从概率预测中强化学习以实现安全决策

Authors: Michal Koren, Or Peretz, Tai Dinh, Philip S. Yu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.08226
Pdf link: https://arxiv.org/pdf/2510.08226
Abstract Sequential decisions in volatile, high-stakes settings require more than maximizing expected return; they require principled uncertainty management. This paper presents the Uncertainty-Aware Markov Decision Process (UAMDP), a unified framework that couples Bayesian forecasting, posterior-sampling reinforcement learning, and planning under a conditional value-at-risk (CVaR) constraint. In a closed loop, the agent updates its beliefs over latent dynamics, samples plausible futures via Thompson sampling, and optimizes policies subject to preset risk tolerances. We establish regret bounds that converge to the Bayes-optimal benchmark under standard regularity conditions. We evaluate UAMDP in two domains-high-frequency equity trading and retail inventory control-both marked by structural uncertainty and economic volatility. Relative to strong deep learning baselines, UAMDP improves long-horizon forecasting accuracy (RMSE decreases by up to 25\% and sMAPE by 32\%), and these gains translate into economic performance: the trading Sharpe ratio rises from 1.54 to 1.74 while maximum drawdown is roughly halved. These results show that integrating calibrated probabilistic modeling, exploration aligned with posterior uncertainty, and risk-aware control yields a robust, generalizable approach to safer and more profitable sequential decision-making.
中文摘要 在动荡、高风险的环境中进行连续决策需要的不仅仅是最大化预期回报;它们需要有原则的不确定性管理。本文提出了不确定性感知马尔可夫决策过程（UAMDP），这是一个统一的框架，它结合了贝叶斯预测、后验抽样强化学习和条件风险价值（CVaR）约束下的规划。在闭环中，代理更新其对潜在动态的信念，通过汤普森采样对合理的期货进行采样，并根据预设的风险承受能力优化政策。我们建立了在标准正则条件下收敛到贝叶斯最优基准的后悔边界。我们在两个领域评估 UAMDP——高频股票交易和零售库存控制——这两个领域都以结构不确定性和经济波动为特征。相对于强大的深度学习基线，UAMDP 提高了长期预测准确性（RMSE 下降了 25\%，sMAPE 下降了 32\%），这些收益转化为经济表现：交易夏普比率从 1.54 上升到 1.74，而最大回撤大约减半。这些结果表明，整合校准概率建模、与后验不确定性一致的探索以及风险感知控制可以产生一种稳健的、可推广的方法，以实现更安全、更有利可图的顺序决策。

Enhancing Reasoning for Diffusion LLMs via Distribution Matching Policy Optimization

通过分布匹配策略优化增强扩散法学硕士的推理

Authors: Yuchen Zhu, Wei Guo, Jaemoo Choi, Petr Molodyk, Bo Yuan, Molei Tao, Yongxin Chen
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.08233
Pdf link: https://arxiv.org/pdf/2510.08233
Abstract Diffusion large language models (dLLMs) are promising alternatives to autoregressive large language models (AR-LLMs), as they potentially allow higher inference throughput. Reinforcement learning (RL) is a crucial component for dLLMs to achieve comparable performance with AR-LLMs on important tasks, such as reasoning. However, RL algorithms that are well-suited for dLLMs' unique characteristics have yet to be developed. This paper proposes Distribution Matching Policy Optimization (DMPO), a principled and theoretically grounded RL fine-tuning method specifically designed to enhance the reasoning capabilities of dLLMs by matching the dLLM policy distribution to the optimal, reward-tilted one through cross-entropy optimization. We identify a key challenge in the implementation with a small training batch size and propose several effective solutions through a novel weight baseline subtraction technique. DMPO exhibits superior performance on multiple reasoning benchmarks without supervised fine-tuning, with an accuracy improvement of up to $42.9\%$ over previously SOTA baselines and $55.8\%$ over the base model, underscoring the effectiveness of the distribution matching framework. Our code is available at this https URL.
中文摘要 扩散大型语言模型（dLLM）是自回归大型语言模型（AR-LLM）的有前途的替代方案，因为它们可能允许更高的推理吞吐量。强化学习（RL）是 dLLM 在推理等重要任务上实现与 AR-LLM 相当性能的关键组成部分。然而，非常适合 dLLM 独特特性的 RL 算法尚未开发出来。本文提出了分布匹配策略优化（DMPO），这是一种有原则性和理论依据的RL微调方法，专门设计用于通过交叉熵优化将dLLM策略分布与最优的奖励倾斜分布相匹配，从而增强dLLM的推理能力。我们确定了小训练批量实施中的关键挑战，并通过一种新颖的权重基线减法技术提出了几种有效的解决方案。DMPO 在多个推理基准上表现出卓越的性能，无需监督微调，比之前的 SOTA 基线准确性提高了 42.9 美元，比基本模型提高了 55.8 美元，凸显了分布匹配框架的有效性。我们的代码可在此 https URL 中找到。

The Alignment Waltz: Jointly Training Agents to Collaborate for Safety

对齐华尔兹：联合培训代理商合作确保安全

Authors: Jingyu Zhang, Haozhu Wang, Eric Michael Smith, Sid Wang, Amr Sharaf, Mahesh Pasupuleti, Benjamin Van Durme, Daniel Khashabi, Jason Weston, Hongyuan Zhan
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.08240
Pdf link: https://arxiv.org/pdf/2510.08240
Abstract Harnessing the power of LLMs requires a delicate dance between being helpful and harmless. This creates a fundamental tension between two competing challenges: vulnerability to adversarial attacks that elicit unsafe content, and a tendency for overrefusal on benign but sensitive prompts. Current approaches often navigate this dance with safeguard models that completely reject any content that contains unsafe portions. This approach cuts the music entirely-it may exacerbate overrefusals and fails to provide nuanced guidance for queries it refuses. To teach models a more coordinated choreography, we propose WaltzRL, a novel multi-agent reinforcement learning framework that formulates safety alignment as a collaborative, positive-sum game. WaltzRL jointly trains a conversation agent and a feedback agent, where the latter is incentivized to provide useful suggestions that improve the safety and helpfulness of the conversation agent's responses. At the core of WaltzRL is a Dynamic Improvement Reward (DIR) that evolves over time based on how well the conversation agent incorporates the feedback. At inference time, unsafe or overrefusing responses from the conversation agent are improved rather than discarded. The feedback agent is deployed together with the conversation agent and only engages adaptively when needed, preserving helpfulness and low latency on safe queries. Our experiments, conducted across five diverse datasets, demonstrate that WaltzRL significantly reduces both unsafe responses (e.g., from 39.0% to 4.6% on WildJailbreak) and overrefusals (from 45.3% to 9.9% on OR-Bench) compared to various baselines. By enabling the conversation and feedback agents to co-evolve and adaptively apply feedback, WaltzRL enhances LLM safety without degrading general capabilities, thereby advancing the Pareto front between helpfulness and harmlessness.
中文摘要 利用法学硕士的力量需要在有帮助和无害之间进行微妙的舞蹈。这在两个相互竞争的挑战之间造成了根本的紧张关系：容易受到引发不安全内容的对抗性攻击，以及对良性但敏感的提示进行过度拒绝的倾向。当前的方法通常使用保护模型来驾驭这种舞蹈，这些模型完全拒绝任何包含不安全部分的内容。这种方法完全削减了音乐——它可能会加剧过度拒绝，并且无法为它拒绝的查询提供细致入微的指导。为了教模型更协调的编排，我们提出了 WaltzRL，这是一种新颖的多智能体强化学习框架，它将安全对齐表述为协作的正和博弈。WaltzRL 联合训练对话代理和反馈代理，后者被激励提供有用的建议，以提高对话代理响应的安全性和有用性。WaltzRL 的核心是动态改进奖励（DIR），该奖励会根据对话代理整合反馈的程度随着时间的推移而演变。在推理时，来自对话代理的不安全或过度拒绝的响应会得到改进，而不是丢弃。反馈代理与对话代理一起部署，并且仅在需要时自适应参与，从而保持安全查询的实用性和低延迟。我们在五个不同的数据集中进行的实验表明，与各种基线相比，WaltzRL 显着减少了不安全反应（例如，在 WildJailbreak 上从 39.0% 到 4.6%）和过度拒绝（在 OR-Bench 上从 45.3% 到 9.9%）。通过使对话和反馈代理能够共同发展并自适应地应用反馈，WaltzRL 在不降低一般能力的情况下增强了 LLM 的安全性，从而推进了有用与无害之间的帕累托战线。

Opponent Shaping in LLM Agents

LLM 代理中的对手塑造

Authors: Marta Emili Garcia Segura, Stephen Hailes, Mirco Musolesi
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2510.08255
Pdf link: https://arxiv.org/pdf/2510.08255
Abstract Large Language Models (LLMs) are increasingly being deployed as autonomous agents in real-world environments. As these deployments scale, multi-agent interactions become inevitable, making it essential to understand strategic behavior in such systems. A central open question is whether LLM agents, like reinforcement learning agents, can shape the learning dynamics and influence the behavior of others through interaction alone. In this paper, we present the first investigation of opponent shaping (OS) with LLM-based agents. Existing OS algorithms cannot be directly applied to LLMs, as they require higher-order derivatives, face scalability constraints, or depend on architectural components that are absent in transformers. To address this gap, we introduce ShapeLLM, an adaptation of model-free OS methods tailored for transformer-based agents. Using ShapeLLM, we examine whether LLM agents can influence co-players' learning dynamics across diverse game-theoretic environments. We demonstrate that LLM agents can successfully guide opponents toward exploitable equilibria in competitive games (Iterated Prisoner's Dilemma, Matching Pennies, and Chicken) and promote coordination and improve collective welfare in cooperative games (Iterated Stag Hunt and a cooperative version of the Prisoner's Dilemma). Our findings show that LLM agents can both shape and be shaped through interaction, establishing opponent shaping as a key dimension of multi-agent LLM research.
中文摘要 大型语言模型（LLM）越来越多地被部署为现实环境中的自主代理。随着这些部署的扩展，多代理交互变得不可避免，因此了解此类系统中的战略行为至关重要。一个核心的悬而未决的问题是，LLM 代理是否可以像强化学习代理一样，仅通过交互来塑造学习动态并影响他人的行为。在本文中，我们首次研究了使用基于LLM的代理进行对手塑造（OS）的研究。现有的作系统算法不能直接应用于 LLM，因为它们需要高阶导数，面临可扩展性限制，或者依赖于 Transformer 中不存在的架构组件。为了解决这一差距，我们引入了 ShapeLLM，这是一种针对基于 Transformer 的代理量身定制的无模型作系统方法的改编。使用 ShapeLLM，我们研究了 LLM 代理是否可以影响不同博弈论环境中的共同玩家的学习动态。我们证明，LLM智能体可以成功地引导对手在竞争性游戏中走向可利用的均衡（迭代囚徒困境、匹配便士和鸡），并在合作游戏中促进协调并提高集体福利（迭代雄鹿狩猎和囚徒困境的合作版本）。我们的研究结果表明，LLM智能体既可以通过交互来塑造，也可以通过交互来塑造，从而将对手塑造确立为多智能体LLM研究的一个关键维度。

Mix- and MoE-DPO: A Variational Inference Approach to Direct Preference Optimization

混合和MoE-DPO：一种直接偏好优化的变分推理方法

Authors: Jason Bohne, Pawel Polak, David Rosenberg, Brian Bloniarz, Gary Kazantsev
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.08256
Pdf link: https://arxiv.org/pdf/2510.08256
Abstract Direct Preference Optimization (DPO) has recently emerged as a simple and effective alternative to reinforcement learning from human feedback (RLHF) for aligning large language models (LLMs) with user preferences. However, existing DPO formulations rely on a single monolithic model, which limits their expressivity in multi-task settings and their adaptability to heterogeneous or diverse preference distributions. In this work, we propose Mix- and MoE-DPO, a framework that extends DPO with both soft mixture models and mixture-of-experts (MoE) architectures, using a stochastic variational inference approach. Our method introduces a latent-variable model over expert assignments and optimizes a variational evidence lower bound (ELBO), enabling stable and efficient learning of specialized expert policies from preference data. Mix- and MoE-DPO provides three key advantages over standard DPO: (i) generalization via universal function approximation through mixtures; (ii) reward and policy specialization through expert components tailored to distinct preference modes; and (iii) contextual alignment through input-dependent soft gating that enables user-specific mixture policies. Our framework supports both shared base architectures with expert-specific policy heads and fully independent expert models, allowing flexible trade-offs between parameter efficiency and specialization. We validate our approach on a variety of model sizes and multi-preference datasets, demonstrating that Mix- and MoE-DPO offers a powerful and scalable method for preference-based LLM alignment.
中文摘要 直接偏好优化（DPO）最近出现，成为人类反馈强化学习（RLHF）的一种简单有效的替代方案，用于使大型语言模型（LLM）与用户偏好保持一致。然而，现有的 DPO 公式依赖于单一的单体模型，这限制了它们在多任务设置中的表达能力以及它们对异质或多样化偏好分布的适应性。在这项工作中，我们提出了混合和 MoE-DPO，这是一个使用随机变分推理方法通过软混合模型和专家混合（MoE）架构扩展 DPO 的框架。我们的方法引入了专家分配的潜在变量模型，并优化了变分证据下界（ELBO），从而能够从偏好数据中稳定有效地学习专业的专家政策。与标准 DPO 相比，混合和 MoE-DPO 具有三个关键优势：（i）通过混合的通用函数近似进行泛化;（ii）通过针对不同优惠模式量身定制的专家组成部分进行奖励和政策专业化;（iii）通过依赖于输入的软门控实现上下文对齐，从而实现特定于用户的混合策略。我们的框架支持具有特定于专家的策略头的共享基础架构和完全独立的专家模型，允许在参数效率和专业化之间进行灵活的权衡。我们在各种模型大小和多偏好数据集上验证了我们的方法，证明混合和 MoE-DPO 为基于偏好的 LLM 对齐提供了一种强大且可扩展的方法。

Evaluation of a Robust Control System in Real-World Cable-Driven Parallel Robots

现实世界电缆驱动并联机器人中鲁棒控制系统的评估

Authors: Damir Nurtdinov, Aliaksei Korshuk, Alexei Kornaev, Alexander Maloletov
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.08270
Pdf link: https://arxiv.org/pdf/2510.08270
Abstract This study evaluates the performance of classical and modern control methods for real-world Cable-Driven Parallel Robots (CDPRs), focusing on underconstrained systems with limited time discretization. A comparative analysis is conducted between classical PID controllers and modern reinforcement learning algorithms, including Deep Deterministic Policy Gradient (DDPG), Proximal Policy Optimization (PPO), and Trust Region Policy Optimization (TRPO). The results demonstrate that TRPO outperforms other methods, achieving the lowest root mean square (RMS) errors across various trajectories and exhibiting robustness to larger time intervals between control updates. TRPO's ability to balance exploration and exploitation enables stable control in noisy, real-world environments, reducing reliance on high-frequency sensor feedback and computational demands. These findings highlight TRPO's potential as a robust solution for complex robotic control tasks, with implications for dynamic environments and future applications in sensor fusion or hybrid control strategies.
中文摘要 本研究评估了现实世界电缆驱动并联机器人（CDPR）的经典和现代控制方法的性能，重点关注具有有限时间离散化的欠约束系统。对经典PID控制器与现代强化学习算法进行了比较分析，包括深度确定性策略梯度（DDPG）、近端策略优化（PPO）和信任区域策略优化（TRPO）。结果表明，TRPO优于其他方法，在各种轨迹上实现了最低的均方根（RMS）误差，并且对控制更新之间的较大时间间隔表现出鲁棒性。TRPO 平衡勘探和开发的能力可以在嘈杂的现实环境中实现稳定控制，减少对高频传感器反馈和计算需求的依赖。这些发现凸显了 TRPO 作为复杂机器人控制任务的强大解决方案的潜力，对动态环境和未来在传感器融合或混合控制策略中的应用具有影响。

Beyond Turn Limits: Training Deep Search Agents with Dynamic Context Window

超越回合限制：使用动态上下文窗口训练深度搜索代理

Authors: Qiaoyu Tang, Hao Xiang, Le Yu, Bowen Yu, Yaojie Lu, Xianpei Han, Le Sun, WenJuan Zhang, Pengbo Wang, Shixuan Liu, Zhenru Zhang, Jianhong Tu, Hongyu Lin, Junyang Lin
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.08276
Pdf link: https://arxiv.org/pdf/2510.08276
Abstract While recent advances in reasoning models have demonstrated cognitive behaviors through reinforcement learning, existing approaches struggle to invoke deep reasoning capabilities in multi-turn agents with long-horizon interactions. We propose DeepMiner, a novel framework that elicits such abilities by introducing high-difficulty training tasks and dynamic context window. DeepMiner presents a reverse construction method to generate complex but verifiable question-answer pairs from authentic web sources, which ensures the challenge and reliability of training data while injecting cognitive capabilities into multi-turn reasoning scenarios. We further design an elegant yet effective dynamic context management strategy for both training and inference, utilizing sliding window mechanisms while eliminating the dependency on external summarization models, thereby efficiently empowering the model to handle continuously expanding long-horizon contexts. Through reinforcement learning on Qwen3-32B, we develop DeepMiner-32B, which achieves substantial performance improvements across multiple search agent benchmarks. DeepMiner attains 33.5% accuracy on BrowseComp-en, surpassing the previous best open-source agent by almost 20 percentage points, and demonstrates consistent improvements on BrowseComp-zh, XBench-DeepSearch, and GAIA. Notably, our dynamic context management enables sustained interactions of nearly 100 turns within standard 32k context length, effectively addressing the context limitations that constrain existing multi-turn interaction systems.
中文摘要 虽然推理模型的最新进展已经通过强化学习证明了认知行为，但现有方法难以在具有长视野交互的多轮次代理中调用深度推理能力。我们提出了 DeepMiner，这是一个新颖的框架，它通过引入高难度的训练任务和动态上下文窗口来引发这种能力。DeepMiner 提出了一种反向构建方法，可以从真实的 Web 源中生成复杂但可验证的问答对，这确保了训练数据的挑战性和可靠性，同时为多轮推理场景注入了认知能力。我们进一步设计了一种优雅而有效的训练和推理动态上下文管理策略，利用滑动窗口机制，同时消除对外部摘要模型的依赖，从而有效地赋能模型处理不断扩展的长期上下文。通过对 Qwen3-32B 的强化学习，我们开发了 DeepMiner-32B，它在多个搜索代理基准测试中实现了显着的性能提升。DeepMiner 在 BrowseComp-en 上达到了 33.5% 的准确率，比之前最好的开源代理高出近 20 个百分点，并且在 BrowseComp-zh、XBench-DeepSearch 和 GAIA 上表现出持续的改进。值得注意的是，我们的动态上下文管理能够在标准的 32k 上下文长度内实现近 100 圈的持续交互，有效解决限制现有多轮交互系统的上下文限制。

Beyond Pass@k: Breadth-Depth Metrics for Reasoning Boundaries

超越Pass@k：推理边界的广度深度指标

Authors: Marius Dragoi, Ioana Pintilie, Florin Gogianu, Florin Brad
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.08325
Pdf link: https://arxiv.org/pdf/2510.08325
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm to improve Large Language Models on reasoning tasks such as coding, math or logic. To assess the reasoning boundary (the fraction of problems a model can solve) researchers often report Pass@k at large sampling budgets. Recent results reveal a crossover phenomenon: while RLVR models outperform the base model at small k values, the base model usually outperforms them when sampling a very large number of completions. This has been interpreted as evidence that base models have a larger reasoning boundary. We argue that on tasks with discrete answer spaces, such as math with numeric outputs, Pass@k at large k reflects the increasingly higher chance of success in the limit of the number of trials rather than genuine reasoning, and can therefore be misleading. We propose Cover@tau, which measures the fraction of problems that a model can solve for which at least a tau proportion of completions are correct. Unlike Pass@k, Cover@tau captures reasoning under an explicit reliability threshold: models that rely on random guessing degrade rapidly as tau increases. We evaluate several RLVR models using Cover@tau-based metrics and illustrate how the relative rankings of popular algorithms change compared to Pass@1, offering a different perspective on reasoning boundaries.
中文摘要 具有可验证奖励的强化学习（RLVR）已成为一种强大的范式，可以改进大型语言模型在编码、数学或逻辑等推理任务上。为了评估推理边界（模型可以解决的问题的比例），研究人员经常在大量抽样预算下报告Pass@k。最近的结果揭示了一种交叉现象：虽然 RLVR 模型在小 k 值下优于基础模型，但在对大量完成进行采样时，基础模型通常优于它们。这被解释为基础模型具有更大推理边界的证据。我们认为，在具有离散答案空间的任务上，例如具有数字输出的数学，大体上 k 的Pass@k反映了在试验次数限制内成功的机会越来越高，而不是真正的推理，因此可能具有误导性。我们提出了Cover@tau，它衡量模型可以解决的问题的比例，其中至少有一个 tau 比例的完成是正确的。与Pass@k不同，Cover@tau 在显式可靠性阈值下捕获推理：依赖随机猜测的模型随着 tau 的增加而迅速退化。我们使用基于Cover@tau的指标评估了几个 RLVR 模型，并说明了流行算法的相对排名与Pass@1相比如何变化，从而为推理边界提供了不同的视角。

DeepEN: Personalized Enteral Nutrition for Critically Ill Patients using Deep Reinforcement Learning

DeepEN：使用深度强化学习为危重患者提供个性化肠内营养

Authors: Daniel Jason Tan, Jiayang Chen, Dilruk Perera, Kay Choong See, Mengling Feng
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.08350
Pdf link: https://arxiv.org/pdf/2510.08350
Abstract We introduce DeepEN, a deep reinforcement learning (RL) framework for personalized enteral nutrition (EN) in critically ill patients. Trained offline on over 11,000 ICU patients from the MIMIC-IV database, DeepEN generates 4-hourly recommendations for caloric, protein, and fluid intake tailored to each patient's evolving physiology. The model integrates a curated, clinically informed state space with a custom reward function that balances short-term physiological and nutrition-related goals with long-term survival outcomes. Using a dueling double deep Q-network with conservative Q-learning regularization, DeepEN learns clinically realistic policies that align with high-value clinician actions while discouraging unsafe deviations. Across various qualitative and quantitative metrics, DeepEN outperforms clinician-derived and guideline-based policies, achieving a 3.7 $\pm$ 0.17 percentage-point reduction in estimated mortality (18.8% vs 22.5%) and improvements in key nutritional biomarkers. These findings highlight the potential of safe, data-driven personalization of EN therapy to improve outcomes beyond traditional guideline- or heuristic-based approaches.
中文摘要 我们介绍了 DeepEN，这是一种深度强化学习（RL）框架，用于危重患者的个性化肠内营养（EN）。DeepEN 对 MIMIC-IV 数据库中的 11,000 多名 ICU 患者进行了离线培训，根据每位患者不断变化的生理机能，生成每 4 小时的热量、蛋白质和液体摄入量建议。该模型将精心策划的、临床知情的状态空间与自定义奖励函数集成在一起，以平衡短期生理和营养相关目标与长期生存结果。使用具有保守 Q 学习正则化的双深度 Q 网络，DeepEN 学习与高价值临床医生行动相一致的临床现实策略，同时阻止不安全的偏差。在各种定性和定量指标中，DeepEN 优于临床医生衍生和基于指南的政策，估计死亡率降低了 3.7 $\pm$ 0.17 个百分点（18.8% 对 22.5%），并改善了关键营养生物标志物。这些发现凸显了安全、数据驱动的个性化 EN 治疗的潜力，可以超越传统的基于指南或启发式方法来改善结果。

QAgent: A modular Search Agent with Interactive Query Understanding

QAgent：具有交互式查询理解功能的模块化搜索代理

Authors: Yi Jiang, Lei Shen, Lujie Niu, Sendong Zhao, Wenbo Su, Bo Zheng
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.08383
Pdf link: https://arxiv.org/pdf/2510.08383
Abstract Large language models (LLMs) excel at natural language tasks but are limited by their static parametric knowledge, especially in knowledge-intensive task. Retrieval-augmented generation (RAG) mitigates this by integrating external information. However, (1) traditional RAG struggles with complex query understanding, and (2) even search agents trained with reinforcement learning (RL), despite their promise, still face generalization and deployment challenges. To address these limitations, we propose QAgent, a unified agentic RAG framework that employs a search agent for adaptive retrieval. This agent optimizes its understanding of the query through interactive reasoning and retrieval. To facilitate real-world application, we focus on modular search agent for query understanding that are plug-and-play in complex systems. Secifically, the agent follows a multi-step decision process trained with RL to maximize retrieval quality and support accurate downstream answers. We further analyze the strengths and weaknesses of end-to-end RL and propose a strategy that focuses on effective retrieval, thereby enhancing generalization in LLM applications. Experiments show QAgent excels at QA and serves as a plug-and-play module for real-world deployment.
中文摘要 大型语言模型（LLM）擅长自然语言任务，但受到其静态参数知识的限制，尤其是在知识密集型任务中。检索增强生成（RAG）通过整合外部信息来缓解这种情况。然而，（1）传统的RAG在复杂的查询理解方面遇到了困难，（2）即使是经过强化学习（RL）训练的搜索代理，尽管它们前景广阔，但仍面临泛化和部署的挑战。为了解决这些限制，我们提出了 QAgent，这是一个统一的代理 RAG 框架，它使用搜索代理进行自适应检索。该代理通过交互式推理和检索来优化其对查询的理解。为了促进实际应用，我们专注于模块化搜索代理，用于在复杂系统中即插即用的查询理解。当然，代理遵循经过 RL 训练的多步骤决策过程，以最大限度地提高检索质量并支持准确的下游答案。我们进一步分析了端到端 RL 的优缺点，并提出了一种专注于有效检索的策略，从而增强了 LLM 应用的泛化性。实验表明 QAgent 在 QA 方面表现出色，可作为实际部署的即插即用模块。

Reinforcing Diffusion Models by Direct Group Preference Optimization

通过直接群体偏好优化强化扩散模型

Authors: Yihong Luo, Tianyang Hu, Jing Tang
Subjects: Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2510.08425
Pdf link: https://arxiv.org/pdf/2510.08425
Abstract While reinforcement learning methods such as Group Relative Preference Optimization (GRPO) have significantly enhanced Large Language Models, adapting them to diffusion models remains challenging. In particular, GRPO demands a stochastic policy, yet the most cost-effective diffusion samplers are based on deterministic ODEs. Recent work addresses this issue by using inefficient SDE-based samplers to induce stochasticity, but this reliance on model-agnostic Gaussian noise leads to slow convergence. To resolve this conflict, we propose Direct Group Preference Optimization (DGPO), a new online RL algorithm that dispenses with the policy-gradient framework entirely. DGPO learns directly from group-level preferences, which utilize relative information of samples within groups. This design eliminates the need for inefficient stochastic policies, unlocking the use of efficient deterministic ODE samplers and faster training. Extensive results show that DGPO trains around 20 times faster than existing state-of-the-art methods and achieves superior performance on both in-domain and out-of-domain reward metrics. Code is available at this https URL.
中文摘要 虽然群体相对偏好优化（GRPO）等强化学习方法显着增强了大型语言模型，但使其适应扩散模型仍然具有挑战性。特别是，GRPO 需要随机策略，但最具成本效益的扩散采样器基于确定性 ODE。最近的工作通过使用低效的基于 SDE 的采样器来诱导随机性来解决这个问题，但这种对与模型无关的高斯噪声的依赖导致收敛缓慢。为了解决这种冲突，我们提出了直接组偏好优化（DGPO），这是一种新的在线RL算法，完全省去了策略梯度框架。DGPO 直接从组级偏好中学习，该偏好利用组内样本的相对信息。这种设计消除了对低效随机策略的需求，从而解锁了高效确定性 ODE 采样器的使用和更快的训练。大量结果表明，DGPO 的训练速度比现有最先进的方法快约 20 倍，并且在域内和域外奖励指标上都取得了卓越的性能。代码可在此 https URL 中找到。

ClauseLens: Clause-Grounded, CVaR-Constrained Reinforcement Learning for Trustworthy Reinsurance Pricing

ClauseLens：基于条款、CVaR 约束的强化学习，用于值得信赖的再保险定价

Authors: Stella C. Dong, James R. Finlay
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2510.08429
Pdf link: https://arxiv.org/pdf/2510.08429
Abstract Reinsurance treaty pricing must satisfy stringent regulatory standards, yet current quoting practices remain opaque and difficult to audit. We introduce ClauseLens, a clause-grounded reinforcement learning framework that produces transparent, regulation-compliant, and risk-aware treaty quotes. ClauseLens models the quoting task as a Risk-Aware Constrained Markov Decision Process (RA-CMDP). Statutory and policy clauses are retrieved from legal and underwriting corpora, embedded into the agent's observations, and used both to constrain feasible actions and to generate clause-grounded natural language justifications. Evaluated in a multi-agent treaty simulator calibrated to industry data, ClauseLens reduces solvency violations by 51%, improves tail-risk performance by 27.9% (CVaR_0.10), and achieves 88.2% accuracy in clause-grounded explanations with retrieval precision of 87.4% and recall of 91.1%. These findings demonstrate that embedding legal context into both decision and explanation pathways yields interpretable, auditable, and regulation-aligned quoting behavior consistent with Solvency II, NAIC RBC, and the EU AI Act.
中文摘要 再保险条约定价必须满足严格的监管标准，但当前的报价做法仍然不透明且难以审计。我们介绍了 ClauseLens，这是一个基于条款的强化学习框架，可以生成透明、符合法规且具有风险意识的条约报价。ClauseLens 将报价任务建模为风险感知约束马尔可夫决策过程（RA-CMDP）。从法律和承保语料库中检索法定和政策条款，嵌入到代理人的观察中，并用于限制可行的行动并生成基于条款的自然语言理由。在根据行业数据校准的多代理条约模拟器中进行评估，ClauseLens 将偿付能力违规行为减少了 51%，将尾部风险绩效提高了 27.9% （CVaR_0.10），并在基于条款的解释中实现了 88.2% 的准确率，检索精度为 87.4%，召回率为 91.1%。这些发现表明，将法律背景嵌入决策和解释途径中会产生可解释、可审计和符合监管的引用行为，符合偿付能力 II、NAIC RBC 和欧盟人工智能法案。

xRouter: Training Cost-Aware LLMs Orchestration System via Reinforcement Learning

xRouter：通过强化学习训练成本感知型法学硕士编排系统

Authors: Cheng Qian, Zuxin Liu, Shirley Kokane, Akshara Prabhakar, Jielin Qiu, Haolin Chen, Zhiwei Liu, Heng Ji, Weiran Yao, Shelby Heinecke, Silvio Savarese, Caiming Xiong, Huan Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.08439
Pdf link: https://arxiv.org/pdf/2510.08439
Abstract Modern LLM deployments confront a widening cost-performance spectrum: premium models deliver strong reasoning but are expensive, while lightweight models are economical yet brittle on complex tasks. Static escalation rules and keyword heuristics under-utilize this spectrum and fail to adapt across task types. We present xRouter, a tool-calling-based routing system in which a learned router can either answer directly or invoke one or more external models. The router is trained end-to-end with reinforcement learning using an explicit, cost-aware reward that encodes cost-performance trade-offs, eliminating the need for hand-engineered routing rules. Our implementation encompasses the full reinforcement learning framework, including reward and cost accounting, as well as the deployment and evaluation pipelines. Across diverse benchmarks, xRouter achieves strong cost-performance trade-offs (e.g., substantial cost reductions at comparable task completion rates), and provides empirical insights into what reliably helps learned routing and what does not, ranging from model trainability to the difficulty of eliciting sophisticated orchestration behaviors in small open models. We hope these findings and our open implementation will serve as a practical substrate for advancing learned, cost-aware LLM orchestration.
中文摘要 现代 LLM 部署面临着不断扩大的成本效益范围：高级模型提供强大的推理，但价格昂贵，而轻量级模型经济但对复杂任务来说很脆弱。静态升级规则和关键字启发式方法未充分利用这一范围，并且无法适应任务类型。我们介绍了 xRouter，这是一个基于工具调用的路由系统，其中学习的路由器可以直接应答或调用一个或多个外部模型。路由器通过强化学习进行端到端训练，使用明确的成本感知奖励对成本效益权衡进行编码，无需手工设计的路由规则。我们的实施包括完整的强化学习框架，包括奖励和成本核算，以及部署和评估管道。在不同的基准测试中，xRouter 实现了强大的成本效益权衡（例如，在可比的任务完成率下大幅降低成本），并提供了关于哪些可靠地有助于学习路由和哪些无效的经验见解，从模型可训练性到在小型开放模型中引出复杂编排行为的难度。我们希望这些发现和我们的开放实施能够成为推进学习的、具有成本意识的法学硕士编排的实用基础。

Gaze on the Prize: Shaping Visual Attention with Return-Guided Contrastive Learning

凝视奖品：通过回归引导的对比学习塑造视觉注意力

Authors: Andrew Lee, Ian Chuang, Dechen Gao, Kai Fukazawa, Iman Soltani
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.08442
Pdf link: https://arxiv.org/pdf/2510.08442
Abstract Visual Reinforcement Learning (RL) agents must learn to act based on high-dimensional image data where only a small fraction of the pixels is task-relevant. This forces agents to waste exploration and computational resources on irrelevant features, leading to sample-inefficient and unstable learning. To address this, inspired by human visual foveation, we introduce Gaze on the Prize. This framework augments visual RL with a learnable foveal attention mechanism (Gaze), guided by a self-supervised signal derived from the agent's experience pursuing higher returns (the Prize). Our key insight is that return differences reveal what matters most: If two similar representations produce different outcomes, their distinguishing features are likely task-relevant, and the gaze should focus on them accordingly. This is realized through return-guided contrastive learning that trains the attention to distinguish between the features relevant to success and failure. We group similar visual representations into positives and negatives based on their return differences and use the resulting labels to construct contrastive triplets. These triplets provide the training signal that teaches the attention mechanism to produce distinguishable representations for states associated with different outcomes. Our method achieves up to 2.4x improvement in sample efficiency and can solve tasks that the baseline fails to learn, demonstrated across a suite of manipulation tasks from the ManiSkill3 benchmark, all without modifying the underlying algorithm or hyperparameters.
中文摘要 视觉强化学习（RL）代理必须学会根据高维图像数据采取行动，其中只有一小部分像素与任务相关。这迫使智能体将探索和计算资源浪费在不相关的特征上，导致样本效率低下和学习不稳定。为了解决这个问题，受人类视觉注视点的启发，我们推出了 Gaze on the Prize。该框架通过可学习的中心凹注意力机制（凝视）增强视觉 RL，由来自智能体追求更高回报的经验（奖品）的自监督信号引导。我们的主要见解是，回报差异揭示了最重要的事情：如果两种相似的表征产生不同的结果，它们的区别特征可能与任务相关，并且目光应该相应地集中在它们身上。这是通过回报引导的对比学习来实现的，该学习训练注意力区分与成功和失败相关的特征。我们根据相似的视觉表示的返回差异分为正值和负值，并使用生成的标签来构建对比三元组。这些三元组提供训练信号，教导注意力机制为与不同结果相关的状态产生可区分的表示。我们的方法将样本效率提高了 2.4 倍，并且可以解决基线无法学习的任务，这在 ManiSkill3 基准测试的一系列作任务中得到了证明，所有这些都无需修改底层算法或超参数。

DexMan: Learning Bimanual Dexterous Manipulation from Human and Generated Videos

DexMan：从人类和生成的视频中学习双手灵巧作

Authors: Jhen Hsieh, Kuan-Hsun Tu, Kuo-Han Hung, Tsung-Wei Ke
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.08475
Pdf link: https://arxiv.org/pdf/2510.08475
Abstract We present DexMan, an automated framework that converts human visual demonstrations into bimanual dexterous manipulation skills for humanoid robots in simulation. Operating directly on third-person videos of humans manipulating rigid objects, DexMan eliminates the need for camera calibration, depth sensors, scanned 3D object assets, or ground-truth hand and object motion annotations. Unlike prior approaches that consider only simplified floating hands, it directly controls a humanoid robot and leverages novel contact-based rewards to improve policy learning from noisy hand-object poses estimated from in-the-wild videos. DexMan achieves state-of-the-art performance in object pose estimation on the TACO benchmark, with absolute gains of 0.08 and 0.12 in ADD-S and VSD. Meanwhile, its reinforcement learning policy surpasses previous methods by 19% in success rate on OakInk-v2. Furthermore, DexMan can generate skills from both real and synthetic videos, without the need for manual data collection and costly motion capture, and enabling the creation of large-scale, diverse datasets for training generalist dexterous manipulation.
中文摘要 我们展示了 DexMan，这是一个自动化框架，可将人类视觉演示转换为模拟中人形机器人的双手灵巧作技能。DexMan 直接对人类纵刚性物体的第三人称视频进行作，无需相机校准、深度传感器、扫描的 3D 对象资产或地面实况手部和物体运动注释。与之前仅考虑简化的浮动手的方法不同，它直接控制人形机器人，并利用基于接触的新颖奖励来改进从野外视频中估计的嘈杂手部物体姿势的策略学习。DexMan在TACO基准测试中实现了最先进的物体姿态估计性能，在ADD-S和VSD中分别获得了0.08和0.12的绝对增益。同时，其强化学习策略在 OakInk-v2 上的成功率比以前的方法高出 19%。此外，DexMan 可以从真实视频和合成视频中生成技能，无需手动数据收集和昂贵的动作捕捉，并能够创建大规模、多样化的数据集来训练通才的灵巧作。

Rethinking Provenance Completeness with a Learning-Based Linux Scheduler

使用基于学习的 Linux 调度器重新思考出处完整性

Authors: Jinsong Mao, Benjamin E. Ujcich, Shiqing Ma
Subjects: Subjects: Cryptography and Security (cs.CR); Operating Systems (cs.OS)
Arxiv link: https://arxiv.org/abs/2510.08479
Pdf link: https://arxiv.org/pdf/2510.08479
Abstract Provenance plays a critical role in maintaining traceability of a system's actions for root cause analysis of security threats and impacts. Provenance collection is often incorporated into the reference monitor of systems to ensure that an audit trail exists of all events, that events are completely captured, and that logging of such events cannot be bypassed. However, recent research has questioned whether existing state-of-the-art provenance collection systems fail to ensure the security guarantees of a true reference monitor due to the 'super producer threat' in which provenance generation can overload a system to force the system to drop security-relevant events and allow an attacker to hide their actions. One approach towards solving this threat is to enforce resource isolation, but that does not fully solve the problems resulting from hardware dependencies and performance limitations. In this paper, we show how an operating system's kernel scheduler can mitigate this threat, and we introduce Venus, a learned scheduler for Linux specifically designed for provenance. Unlike conventional schedulers that ignore provenance completeness requirements, Venus leverages reinforcement learning to learn provenance task behavior and to dynamically optimize resource allocation. We evaluate Venus's efficacy and show that Venus significantly improves both the completeness and efficiency of provenance collection systems compared to traditional scheduling, while maintaining reasonable overheads and even improving overall runtime in certain cases compared to the default Linux scheduler.
中文摘要 来源在维护系统作的可追溯性以分析安全威胁和影响的根本原因方面发挥着关键作用。来源收集通常被合并到系统的参考监视器中，以确保存在所有事件的审计跟踪，完全捕获事件，并且无法绕过此类事件的日志记录。然而，最近的研究质疑，现有的最先进的出处收集系统是否无法确保真正的参考监测器的安全保证，因为“超级生产者威胁”，出处生成可能会使系统过载，迫使系统放弃与安全相关的事件，并允许攻击者隐藏他们的行为。解决此威胁的一种方法是强制实施资源隔离，但这并不能完全解决硬件依赖关系和性能限制导致的问题。在本文中，我们展示了作系统的内核调度程序如何缓解这种威胁，并介绍了 Venus，这是一个专门为 Provenance 设计的 Linux 学习调度器。与忽略出处完整性要求的传统调度器不同，Venus 利用强化学习来学习出处任务行为并动态优化资源分配。我们评估了 Venus 的功效，并表明与传统调度相比，Venus 显着提高了出处收集系统的完整性和效率，同时与默认的 Linux 调度器相比，在某些情况下保持了合理的开销，甚至提高了整体运行时间。

Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools

Video-STAR：使用工具加强开放词汇动作识别

Authors: Zhenlong Yuan, Xiangyan Qu, Chengxuan Qian, Rui Chen, Jing Tang, Lei Sun, Xiangxiang Chu, Dapeng Zhang, Yiwei Wang, Yujun Cai, Shuo Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2510.08480
Pdf link: https://arxiv.org/pdf/2510.08480
Abstract Multimodal large language models (MLLMs) have demonstrated remarkable potential in bridging visual and textual reasoning, yet their reliance on text-centric priors often limits their ability to disentangle semantically similar actions in open-vocabulary scenarios. To address this, we propose Video-STAR, a framework that harmonizes contextual sub-motion decomposition with tool-augmented reinforcement learning for open-vocabulary action recognition (OVAR). Unlike prior methods that treat actions as monolithic entities, our approach innovatively decomposes actions into discriminative sub-motions for fine-grained matching while dynamically invoking domain-specific tools for cross-modal interleaving, thereby enabling category-specific reasoning capacity and reducing cross-modal hallucination. Moreover, by designing a hierarchical reward that balances tool-usage efficiency, sub-motion relevance, and structural coherence in reasoning, our method autonomously leverages external tools to prioritize sub-motion patterns without explicit supervision, transmitting from text-centric reasoning to visually grounded inference. Extensive evaluations on HMDB-51, UCF-101, SSv2, Kinetics-400, and Kinetics-600 datasets demonstrate our state-of-the-art performance, outperforming existing methods in distinguishing fine-grained actions and handling cross-modal hallucination, validating our excellent robustness and generalization.
中文摘要 多模态大型语言模型（MLLM）在连接视觉和文本推理方面表现出了巨大的潜力，但它们对以文本为中心的先验的依赖往往限制了它们在开放词汇场景中解开语义相似动作的能力。为了解决这个问题，我们提出了 Video-STAR，这是一个将上下文子运动分解与工具增强强化学习相协调的框架，用于开放词汇动作识别（OVAR）。与之前将动作视为单体实体的方法不同，我们的方法创新地将动作分解为判别子运动以进行细粒度匹配，同时动态调用特定领域的工具进行跨模态交错，从而实现特定类别的推理能力并减少跨模态幻觉。此外，通过设计平衡工具使用效率、子运动相关性和推理结构连贯性的分层奖励，我们的方法自主地利用外部工具来优先考虑子运动模式，而无需明确的监督，从以文本为中心的推理传递到视觉基础推理。对 HMDB-51、UCF-101、SSv2、Kinetics-400 和 Kinetics-600 数据集的广泛评估证明了我们最先进的性能，在区分细粒度动作和处理跨模态幻觉方面优于现有方法，验证了我们出色的鲁棒性和泛化性。

DYNAMIX: RL-based Adaptive Batch Size Optimization in Distributed Machine Learning Systems

DYNAMIX：分布式机器学习系统中基于 RL 的自适应批量大小优化

Authors: Yuanjun Dai, Keqiang He, An Wang
Subjects: Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2510.08522
Pdf link: https://arxiv.org/pdf/2510.08522
Abstract Existing batch size selection approaches in distributed machine learning rely on static allocation or simplistic heuristics that fail to adapt to heterogeneous, dynamic computing environments. We present DYNAMIX, a reinforcement learning framework that formulates batch size optimization as a sequential decision-making problem using Proximal Policy Optimization (PPO). Our approach employs a multi-dimensional state representation encompassing network-level metrics, system-level resource utilization, and training statistical efficiency indicators to enable informed decision-making across diverse computational resources. Our approach eliminates the need for explicit system modeling while integrating seamlessly with existing distributed training frameworks. Through evaluations across diverse workloads, hardware configurations, and network conditions, DYNAMIX achieves up to 6.3% improvement in the final model accuracy and 46% reduction in the total training time. Our scalability experiments demonstrate that DYNAMIX maintains the best performance as cluster size increases to 32 nodes, while policy transfer experiments show that learned policies generalize effectively across related model architectures.
中文摘要 分布式机器学习中现有的批量大小选择方法依赖于静态分配或简单的启发式方法，无法适应异构、动态计算环境。我们提出了 DYNAMIX，这是一个强化学习框架，它使用近端策略优化（PPO）将批量大小优化表述为顺序决策问题。我们的方法采用多维状态表示，包括网络级指标、系统级资源利用率和训练统计效率指标，以实现跨不同计算资源的明智决策。我们的方法消除了对显式系统建模的需求，同时与现有的分布式训练框架无缝集成。通过对不同工作负载、硬件配置和网络条件的评估，DYNAMIX 的最终模型精度提高了 6.3%，总训练时间减少了 46%。我们的可扩展性实验表明，当集群大小增加到 32 个节点时，DYNAMIX 保持最佳性能，而策略转移实验表明，学习到的策略在相关模型架构中有效地泛化。

Which Heads Matter for Reasoning? RL-Guided KV Cache Compression

哪些头脑对推理很重要？RL 引导的 KV 缓存压缩

Authors: Wenjie Du, Li Jiang, Keda Tao, Xue Liu, Huan Wang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.08525
Pdf link: https://arxiv.org/pdf/2510.08525
Abstract Reasoning large language models exhibit complex reasoning behaviors through the extended chain-of-thought generation, creating unprecedented Key-Value (KV) cache overhead during the decoding phase. Existing KV cache compression methods underperform on reasoning models: token-dropping methods break reasoning integrity by discarding critical information, while head-reallocating methods mistakenly compress reasoning-critical heads since they are designed for retrieval tasks, resulting in significant performance degradation as compression rates increase. We hypothesize that KV heads exhibit functional heterogeneity in reasoning models-some heads are critical for chain-of-thought consistency while others are compressible. To validate and exploit this insight, we propose RLKV, a novel reasoning-critical head identification framework, which uses reinforcement learning to directly optimize the relationship between each head's cache usage and reasoning quality. As RLKV produces rewards from actual generated samples during training, it naturally identifies heads relevant to reasoning behaviors. We then allocate full KV cache to these heads while applying compressed constant KV cache to others for efficient inference. Our experiments reveal that only a small fraction of attention heads is essential for reasoning, enabling our KV compression approach to outperform baseline methods while achieving 20-50% cache reduction with near lossless performance compared to uncompressed results.
中文摘要 推理大型语言模型通过扩展的思维链生成表现出复杂的推理行为，在解码阶段产生了前所未有的键值（KV）缓存开销。现有的 KV 缓存压缩方法在推理模型上表现不佳：标记删除方法通过丢弃关键信息来破坏推理完整性，而头重新分配方法错误地压缩了推理关键头，因为它们是为检索任务而设计的，导致随着压缩率的增加而显着降低性能。我们假设 KV 头在推理模型中表现出功能异质性——一些头对于思维链的一致性至关重要，而另一些则是可压缩的。为了验证和利用这一见解，我们提出了 RLKV，这是一种新型的推理关键头识别框架，它使用强化学习直接优化每个头的缓存使用与推理质量之间的关系。由于 RLKV 在训练过程中从实际生成的样本中产生奖励，因此它自然会识别与推理行为相关的头部。然后，我们将完整的 KV 缓存分配给这些头，同时将压缩的常量 KV 缓存应用于其他头部以实现高效推理。我们的实验表明，只有一小部分注意力头对于推理至关重要，这使得我们的 KV 压缩方法能够优于基线方法，同时实现 20-50% 的缓存减少，与未压缩的结果相比，性能近乎无损。

Convergence Theorems for Entropy-Regularized and Distributional Reinforcement Learning

熵正则化和分布强化学习的收敛定理

Authors: Yash Jhaveri, Harley Wiltzer, Patrick Shafto, Marc G. Bellemare, David Meger
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.08526
Pdf link: https://arxiv.org/pdf/2510.08526
Abstract In the pursuit of finding an optimal policy, reinforcement learning (RL) methods generally ignore the properties of learned policies apart from their expected return. Thus, even when successful, it is difficult to characterize which policies will be learned and what they will do. In this work, we present a theoretical framework for policy optimization that guarantees convergence to a particular optimal policy, via vanishing entropy regularization and a temperature decoupling gambit. Our approach realizes an interpretable, diversity-preserving optimal policy as the regularization temperature vanishes and ensures the convergence of policy derived objects--value functions and return distributions. In a particular instance of our method, for example, the realized policy samples all optimal actions uniformly. Leveraging our temperature decoupling gambit, we present an algorithm that estimates, to arbitrary accuracy, the return distribution associated to its interpretable, diversity-preserving optimal policy.
中文摘要 为了寻找最优策略，强化学习（RL）方法通常忽略了学习策略的属性以及其预期回报。因此，即使成功了，也很难确定哪些政策将被学习以及它们将做什么。在这项工作中，我们提出了一个策略优化的理论框架，该框架通过消失熵正则化和温度解耦策略来保证与特定最优策略的收敛。我们的方法在正则化温度消失时实现了可解释的、保留多样性的最优策略，并确保了策略派生对象（值函数和返回分布）的收敛。例如，在我们方法的特定实例中，实现的策略统一采样所有最优作。利用我们的温度解耦策略，我们提出了一种算法，该算法可以任意准确地估计与其可解释的、保持多样性的最优策略相关的回报分布。

CoMAS: Co-Evolving Multi-Agent Systems via Interaction Rewards

CoMAS：通过交互奖励共同发展多智能体系统

Authors: Xiangyuan Xue, Yifan Zhou, Guibin Zhang, Zaibin Zhang, Yijiang Li, Chen Zhang, Zhenfei Yin, Philip Torr, Wanli Ouyang, Lei Bai
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.08529
Pdf link: https://arxiv.org/pdf/2510.08529
Abstract Self-evolution is a central research topic in enabling large language model (LLM)-based agents to continually improve their capabilities after pretraining. Recent research has witnessed a transition from reinforcement learning (RL)-free to RL-based methods. Current RL-based methods either rely on dense external reward signals or extract intrinsic reward signals from LLMs themselves. However, these approaches diverge from the self-evolution mechanisms observed in human intelligence, where individuals learn and improve through mutual discussion and collaboration. In this work, we introduce Co-Evolving Multi-Agent Systems (CoMAS), a novel framework that enables agents to improve autonomously by learning from inter-agent interactions without external supervision. CoMAS generates intrinsic rewards from rich discussion dynamics, employs an LLM-as-a-judge mechanism to formulate these rewards, and optimizes each agent's policy through RL, thereby enabling decentralized and scalable co-evolution. Experimental results demonstrate that CoMAS consistently outperforms untrained agents and achieves state-of-the-art performance across most evaluation settings. Ablation studies confirm the necessity of interaction-based reward signals and reveal promising scalability as the number and diversity of agents increase. These findings establish CoMAS as a novel and effective paradigm for self-evolution in LLM-based agents.
中文摘要 自我进化是使基于大型语言模型（LLM）的智能体在预训练后不断提高其能力的核心研究课题。最近的研究见证了从无强化学习（RL）方法到基于 RL 的方法的转变。当前基于 RL 的方法要么依赖于密集的外部奖励信号，要么从 LLM 本身提取内在奖励信号。然而，这些方法与人类智能中观察到的自我进化机制不同，在人类智能中，个体通过相互讨论和协作来学习和改进。在这项工作中，我们介绍了共同进化的多智能体系统（CoMAS），这是一种新颖的框架，使智能体能够在没有外部监督的情况下通过从智能体间交互中学习来自主改进。CoMAS 从丰富的讨论动态中产生内在奖励，采用 LLM-as-a-judge 机制来制定这些奖励，并通过 RL 优化每个代理的策略，从而实现去中心化和可扩展的协同进化。实验结果表明，CoMAS 始终优于未经训练的代理，并在大多数评估设置中实现最先进的性能。消融研究证实了基于交互的奖励信号的必要性，并揭示了随着智能体数量和多样性的增加，可扩展性很有希望。这些发现将 CoMAS 确立为基于 LLM 的代理自我进化的新颖且有效的范式。

SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models

SpatialLadder：视觉语言模型中空间推理的渐进式训练

Authors: Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.08531
Pdf link: https://arxiv.org/pdf/2510.08531
Abstract Spatial reasoning remains a fundamental challenge for Vision-Language Models (VLMs), with current approaches struggling to achieve robust performance despite recent advances. We identify that this limitation stems from a critical gap: existing methods attempt to learn spatial reasoning directly without establishing the hierarchical foundations of perception and understanding. To address this challenge, we present a comprehensive methodology for building spatial intelligence progressively. We introduce SpatialLadder-26k, a multimodal dataset containing 26,610 samples spanning object localization, single image, multi-view, and video spatial reasoning tasks, constructed through a standardized pipeline that ensures systematic coverage across modalities. Building on this dataset, we design a three-stage progressive training framework that (1) establishes spatial perception through object localization, (2) develops spatial understanding through multi-dimensional spatial tasks, and (3) strengthens complex reasoning via reinforcement learning with verifiable rewards. This approach yields SpatialLadder, a 3B-parameter model that achieves state-of-the-art performance on spatial reasoning benchmarks, with 23.4% average improvement over the base model, surpassing GPT-4o by 20.8% and Gemini-2.0-Flash by 10.1%. Notably, SpatialLadder maintains strong generalization with 7.2% improvement on out-of-domain benchmarks, demonstrating that progressive training from perception to reasoning is essential for robust spatial intelligence.
中文摘要 空间推理仍然是视觉语言模型（VLM）面临的一个基本挑战，尽管最近取得了进展，但当前的方法仍难以实现稳健的性能。我们发现这种限制源于一个关键的差距：现有方法试图直接学习空间推理，而没有建立感知和理解的层次基础。为了应对这一挑战，我们提出了一种逐步构建空间智能的综合方法。我们介绍了 SpatialLadder-26k，这是一个多模态数据集，包含 26,610 个样本，涵盖对象定位、单图像、多视图和视频空间推理任务，通过标准化管道构建，确保跨模态的系统覆盖。基于该数据集，我们设计了一个三阶段渐进式训练框架，其中（1）通过对象定位建立空间感知，（2）通过多维空间任务发展空间理解，以及（3）通过强化学习加强复杂推理，并提供可验证的奖励。这种方法产生了 SpatialLadder，这是一个 3B 参数模型，在空间推理基准测试中实现了最先进的性能，比基础模型平均提高了 23.4%，比 GPT-4o 高出 20.8%，比 Gemini-2.0-Flash 高出 10.1%。值得注意的是，SpatialLadder 保持了很强的泛化性，在域外基准测试上提高了 7.2%，这表明从感知到推理的渐进式训练对于强大的空间智能至关重要。

On the optimization dynamics of RLVR: Gradient gap and step size thresholds

论RLVR的优化动力学：梯度间隙和步长阈值

Authors: Joe Suk, Yaqi Duan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Optimization and Control (math.OC); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2510.08539
Pdf link: https://arxiv.org/pdf/2510.08539
Abstract Reinforcement Learning with Verifiable Rewards (RLVR), which uses simple binary feedback to post-train large language models, has shown significant empirical success. However, a principled understanding of why it works has been lacking. This paper builds a theoretical foundation for RLVR by analyzing its training process at both the full-response (trajectory) and token levels. Central to our analysis is a quantity called the Gradient Gap, which formalizes the direction of improvement from low-reward to high-reward regions of the response space. We prove that convergence critically depends on aligning the update direction with this Gradient Gap. Moreover, we derive a sharp step-size threshold based on the magnitude of the Gradient Gap: below it, learning converges, whereas above it, performance collapses. Our theory further predicts how the critical step size must scale with response length and the success rate, thereby explaining why practical heuristics such as length normalization improve stability and showing that, with a fixed learning rate, the success rate can stagnate strictly below $100\%$. We validate these predictions through controlled bandit simulations and LLM experiments, including training Qwen2.5-7B with GRPO.
中文摘要 具有可验证奖励的强化学习（RLVR）使用简单的二进制反馈对大型语言模型进行后期训练，在实证上取得了显著的成功。然而，对它为什么有效缺乏原则性的理解。本文通过分析RLVR在全响应（轨迹）和token两个层面的训练过程，为RLVR奠定了理论基础。我们分析的核心是一个称为梯度间隙的量，它正式确定了响应空间从低奖励区域到高奖励区域的改进方向。我们证明，收敛在很大程度上取决于将更新方向与这个梯度间隙对齐。此外，我们根据梯度间隙的大小推导出一个尖锐的步长阈值：低于它，学习收敛，而高于它，性能崩溃。我们的理论进一步预测了临界步长必须如何随响应长度和成功率而缩放，从而解释了为什么长度归一化等实用启发式方法可以提高稳定性，并表明在固定学习率下，成功率可以严格停滞在 100\%$ 以下。我们通过受控强盗模拟和 LLM 实验验证了这些预测，包括使用 GRPO 训练 Qwen2.5-7B。

MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization

MM-HELIX：通过整体平台和自适应混合策略优化促进多模态长链反思推理

Authors: Xiangyu Zhao, Junming Lin, Tianhao Liang, Yifan Zhou, Wenhao Chai, Yuzhe Gu, Weiyun Wang, Kai Chen, Gen Luo, Wenwei Zhang, Junchi Yan, Hua Yang, Haodong Duan, Xue Yang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2510.08540
Pdf link: https://arxiv.org/pdf/2510.08540
Abstract While current Multimodal Large Language Models (MLLMs) have demonstrated proficiency in reasoning tasks such as mathematics and logic, their capacity for long-chain reflective reasoning, a prerequisite for solving complex real-world problems, remains largely underexplored. In this work, we first conduct an extensive empirical investigation to evaluate this capability. Leveraging a carefully designed data synthesis engine, we construct MM-HELIX, a multimodal benchmark consisting 1,260 samples of 42 challenging synthetic tasks that require iterative thinking and backtracking. Empirical results on this benchmark reveal that existing MLLMs exhibit significant performance deficits in long-chain reflective reasoning. To address this limitation, we generate post-training data and further explore learning paradigms for exploiting such data. We first develop the Step-Elicited Response Generation pipeline to create MM-HELIX-100K, a large-scale dataset of 100k high-quality, reflective reasoning traces for instruction-tuning stage. Given that standard Reinforcement Learning fails on complex tasks due to sparse reward signals and catastrophic forgetting after Supervised Fine-Tuning, we propose Adaptive Hybrid Policy Optimization (AHPO), a novel training strategy that dynamically unifies offline supervision and online optimization into a single stage. This strategy enables the model to learn from expert data when rewards are sparse and conduct independent exploration once proficient. When applied to the Qwen2.5-VL-7B baseline, our method achieves a +18.6\% accuracy improvement on MM-HELIX benchmark and demonstrates strong generalization with a +5.7\% average performance gain on general mathematic and logic tasks. Our work demonstrate that reflective reasoning in MLLMs can be effectively learned and generalized, paving the way for developing more capable MLLMs.
中文摘要 虽然当前的多模态大型语言模型（MLLM）已经证明了在数学和逻辑等推理任务方面的熟练程度，但它们的长链反思推理能力（解决复杂现实世界问题的先决条件）在很大程度上仍未得到充分探索。在这项工作中，我们首先进行了广泛的实证调查来评估这种能力。利用精心设计的数据合成引擎，我们构建了 MM-HELIX，这是一个多模态基准测试，由 1,260 个样本组成，涉及 42 个具有挑战性的合成任务，需要迭代思维和回溯。该基准的实证结果表明，现有的 MLLM 在长链反射推理方面表现出显着的性能缺陷。为了解决这一限制，我们生成训练后数据，并进一步探索利用此类数据的学习范式。我们首先开发了步进引发响应生成管道来创建 MM-HELIX-100K，这是一个包含 100k 个高质量、反射性推理轨迹的大规模数据集，用于指令调整阶段。鉴于标准强化学习在复杂任务上由于稀疏的奖励信号和监督微调后的灾难性遗忘而失败，我们提出了自适应混合策略优化（AHPO），这是一种将离线监督和在线优化动态统一为一个阶段的新颖训练策略。这种策略使模型能够在奖励稀疏时从专家数据中学习，并在熟练掌握后进行独立探索。当应用于Qwen2.5-VL-7B基线时，我们的方法在MM-HELIX基准测试上实现了+18.6%的准确率提高，并在一般数学和逻辑任务上表现出很强的泛化性，平均性能提升为+5.7%。我们的工作表明，MLLM 中的反思推理可以得到有效的学习和推广，为开发更强大的 MLLM 铺平道路。

Entropy Regularizing Activation: Boosting Continuous Control, Large Language Models, and Image Classification with Activation as Entropy Constraints

熵正则化激活：以激活作为熵约束促进连续控制、大型语言模型和图像分类

Authors: Zilin Kang, Chonghua Liao, Tingqiang Xu, Huazhe Xu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.08549
Pdf link: https://arxiv.org/pdf/2510.08549
Abstract We propose ERA, a new paradigm that constrains the sampling entropy above given thresholds by applying specially designed activations to the outputs of models. Our approach demonstrates broad effectiveness across different domains: 1) for large language models(LLMs), boosting the AIME 2025 score for Qwen2.5-Math-7B by 37.4%; 2) for continuous control reinforcement learning agents, improving performance by more than 30% over strong baselines such as SAC on the challenging HumanoidBench; 3) for image classification, enhancing ImageNet top-1 accuracy by 0.69% for ResNet-50. These gains are achieved with a computational overhead of less than 7%. Our work validates output activation as a powerful tool for entropy control, opening a new direction for designing simpler and more robust algorithms.
中文摘要 我们提出了 ERA，这是一种新范式，它通过对模型的输出应用专门设计的激活来将采样熵限制在给定阈值以上。我们的方法在不同领域展示了广泛的有效性：1）对于大型语言模型（LLM），将Qwen2.5-Math-7B的AIME 2025分数提高了37.4%;2）对于持续控制强化学习代理，在具有挑战性的HumanoidBench上，比强基线（例如SAC）的性能提高了30%以上;3）图像分类，ResNet-50的ImageNet top-1准确率提高了0.69%。这些收益是在不到 7% 的计算开销下实现的。我们的工作验证了输出激活作为熵控制的强大工具，为设计更简单、更稳健的算法开辟了新方向。

Improving Reasoning for Diffusion Language Models via Group Diffusion Policy Optimization

通过组扩散策略优化改进扩散语言模型的推理

Authors: Kevin Rojas, Jiahe Lin, Kashif Rasul, Anderson Schneider, Yuriy Nevmyvaka, Molei Tao, Wei Deng
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2510.08554
Pdf link: https://arxiv.org/pdf/2510.08554
Abstract Diffusion language models (DLMs) enable parallel, order-agnostic generation with iterative refinement, offering a flexible alternative to autoregressive large language models (LLMs). However, adapting reinforcement learning (RL) fine-tuning to DLMs remains an open challenge because of the intractable likelihood. Pioneering work such as diffu-GRPO estimated token-level likelihoods via one-step unmasking. While computationally efficient, this approach is severely biased. A more principled foundation lies in sequence-level likelihoods, where the evidence lower bound (ELBO) serves as a surrogate. Yet, despite this clean mathematical connection, ELBO-based methods have seen limited adoption due to the prohibitive cost of likelihood evaluation. In this work, we revisit ELBO estimation and disentangle its sources of variance. This decomposition motivates reducing variance through fast, deterministic integral approximations along a few pivotal dimensions. Building on this insight, we introduce \textbf{Group Diffusion Policy Optimization (GDPO)}, a new RL algorithm tailored for DLMs. GDPO leverages simple yet effective Semi-deterministic Monte Carlo schemes to mitigate the variance explosion of ELBO estimators under vanilla double Monte Carlo sampling, yielding a provably lower-variance estimator under tight evaluation budgets. Empirically, GDPO achieves consistent gains over pretrained checkpoints and outperforms diffu-GRPO, one of the state-of-the-art baselines, on the majority of math, reasoning, and coding benchmarks.
中文摘要 扩散语言模型（DLM）通过迭代细化实现并行、与顺序无关的生成，为自回归大型语言模型（LLM）提供了灵活的替代方案。然而，由于可能性难以解决，使强化学习（RL）微调适应 DLM 仍然是一个悬而未决的挑战。diffu-GRPO 等开创性工作通过一步揭露估计了代币级的可能性。虽然计算效率很高，但这种方法存在严重偏差。更有原则的基础在于序列水平似然，其中证据下界（ELBO）充当替代物。然而，尽管有这种干净的数学联系，但由于似然评估的成本高昂，基于ELBO的方法的采用有限。在这项工作中，我们重新审视了 ELBO 估计并理清了其方差来源。这种分解促使通过沿几个关键维度的快速、确定性积分近似来减少方差。基于这一见解，我们引入了 \textbf{Group Diffusion Policy Optimization （GDPO）}，这是一种为 DLM 量身定制的新 RL 算法。GDPO 利用简单而有效的半确定性蒙特卡洛方案来减轻普通双蒙特卡洛采样下 ELBO 估计器的方差爆炸，从而在紧张的评估预算下产生可证明的较低方差估计器。根据经验，GDPO 在大多数数学、推理和编码基准测试中比预训练检查点取得了一致的收益，并且优于最先进的基线之一 diffu-GRPO。

Agent Learning via Early Experience

通过早期经验学习代理

Authors: Kai Zhang, Xiangchao Chen, Bo Liu, Tianci Xue, Zeyi Liao, Zhihan Liu, Xiyao Wang, Yuting Ning, Zhaorun Chen, Xiaohan Fu, Jian Xie, Yuxuan Sun, Boyu Gou, Qi Qi, Zihang Meng, Jianwei Yang, Ning Zhang, Xian Li, Ashish Shah, Dat Huynh, Hengduo Li, Zi Yang, Sara Cao, Lawrence Jang, Shuyan Zhou, Jiacheng Zhu, Huan Sun, Jason Weston, Yu Su, Yifan Wu
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.08558
Pdf link: https://arxiv.org/pdf/2510.08558
Abstract A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly. This limitation stems from the nature of expert demonstrations: they capture only a narrow range of scenarios and expose the agent to limited environment diversity. We address this limitation with a middle-ground paradigm we call early experience: interaction data generated by the agent's own actions, where the resulting future states serve as supervision without reward signals. Within this paradigm we study two strategies of using such data: (1) Implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) Self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making. We evaluate across eight diverse environments and multiple model families. Our approaches consistently improve effectiveness and out-of-domain generalization, highlighting the value of early experience. Moreover, in environments with verifiable rewards, our results provide promising signals that early experience offers a strong foundation for subsequent reinforcement learning, positioning it as a practical bridge between imitation learning and fully experience-driven agents.
中文摘要 语言代理的长期目标是通过自己的经验学习和改进，最终在复杂的现实任务中超越人类。然而，在许多环境中，使用强化学习从经验数据中训练代理仍然很困难，这些环境要么缺乏可验证的奖励（例如，网站），要么需要低效的长期部署（例如，多轮工具的使用）。因此，当前大多数代理都依赖于对专家数据的监督微调，这在扩展上具有挑战性，而且泛化能力很差。这种限制源于专家演示的性质：它们只捕获了狭窄的场景范围，并使智能体暴露在有限的环境多样性中。我们通过一种我们称之为早期体验的中间地带范式来解决这一限制：由代理自身行为生成的交互数据，其中由此产生的未来状态充当监督，没有奖励信号。在这种范式中，我们研究了使用此类数据的两种策略：（1）隐式世界建模，它使用收集的状态将政策建立在环境动态的基础上;（2）自我反思，智能体从次优行为中学习，以改进推理和决策。我们评估了八个不同的环境和多个模型系列。我们的方法不断提高有效性和域外泛化，突出了早期经验的价值。此外，在具有可验证奖励的环境中，我们的结果提供了有希望的信号，表明早期经验为后续的强化学习提供了坚实的基础，将其定位为模仿学习和完全经验驱动的智能体之间的实用桥梁。

Keyword: diffusion policy

Improving Reasoning for Diffusion Language Models via Group Diffusion Policy Optimization

通过组扩散策略优化改进扩散语言模型的推理

Authors: Kevin Rojas, Jiahe Lin, Kashif Rasul, Anderson Schneider, Yuriy Nevmyvaka, Molei Tao, Wei Deng
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2510.08554
Pdf link: https://arxiv.org/pdf/2510.08554
Abstract Diffusion language models (DLMs) enable parallel, order-agnostic generation with iterative refinement, offering a flexible alternative to autoregressive large language models (LLMs). However, adapting reinforcement learning (RL) fine-tuning to DLMs remains an open challenge because of the intractable likelihood. Pioneering work such as diffu-GRPO estimated token-level likelihoods via one-step unmasking. While computationally efficient, this approach is severely biased. A more principled foundation lies in sequence-level likelihoods, where the evidence lower bound (ELBO) serves as a surrogate. Yet, despite this clean mathematical connection, ELBO-based methods have seen limited adoption due to the prohibitive cost of likelihood evaluation. In this work, we revisit ELBO estimation and disentangle its sources of variance. This decomposition motivates reducing variance through fast, deterministic integral approximations along a few pivotal dimensions. Building on this insight, we introduce \textbf{Group Diffusion Policy Optimization (GDPO)}, a new RL algorithm tailored for DLMs. GDPO leverages simple yet effective Semi-deterministic Monte Carlo schemes to mitigate the variance explosion of ELBO estimators under vanilla double Monte Carlo sampling, yielding a provably lower-variance estimator under tight evaluation budgets. Empirically, GDPO achieves consistent gains over pretrained checkpoints and outperforms diffu-GRPO, one of the state-of-the-art baselines, on the majority of math, reasoning, and coding benchmarks.
中文摘要 扩散语言模型（DLM）通过迭代细化实现并行、与顺序无关的生成，为自回归大型语言模型（LLM）提供了灵活的替代方案。然而，由于可能性难以解决，使强化学习（RL）微调适应 DLM 仍然是一个悬而未决的挑战。diffu-GRPO 等开创性工作通过一步揭露估计了代币级的可能性。虽然计算效率很高，但这种方法存在严重偏差。更有原则的基础在于序列水平似然，其中证据下界（ELBO）充当替代物。然而，尽管有这种干净的数学联系，但由于似然评估的成本高昂，基于ELBO的方法的采用有限。在这项工作中，我们重新审视了 ELBO 估计并理清了其方差来源。这种分解促使通过沿几个关键维度的快速、确定性积分近似来减少方差。基于这一见解，我们引入了 \textbf{Group Diffusion Policy Optimization （GDPO）}，这是一种为 DLM 量身定制的新 RL 算法。GDPO 利用简单而有效的半确定性蒙特卡洛方案来减轻普通双蒙特卡洛采样下 ELBO 估计器的方差爆炸，从而在紧张的评估预算下产生可证明的较低方差估计器。根据经验，GDPO 在大多数数学、推理和编码基准测试中比预训练检查点取得了一致的收益，并且优于最先进的基线之一 diffu-GRPO。

ResAD: Normalized Residual Trajectory Modeling for End-to-End Autonomous Driving

ResAD：端到端自动驾驶的归一化残差轨迹建模

Authors: Zhiyu Zheng, Shaoyu Chen, Haoran Yin, Xinbang Zhang, Jialv Zou, Xinggang Wang, Qian Zhang, Lefei Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.08562
Pdf link: https://arxiv.org/pdf/2510.08562
Abstract End-to-end autonomous driving (E2EAD) systems, which learn to predict future trajectories directly from sensor data, are fundamentally challenged by the inherent spatio-temporal imbalance of trajectory data. This imbalance creates a significant optimization burden, causing models to learn spurious correlations instead of causal inference, while also prioritizing uncertain, distant predictions, thereby compromising immediate safety. To address these issues, we propose ResAD, a novel Normalized Residual Trajectory Modeling framework. Instead of predicting the future trajectory directly, our approach reframes the learning task to predict the residual deviation from a deterministic inertial reference. The inertial reference serves as a counterfactual, forcing the model to move beyond simple pattern recognition and instead identify the underlying causal factors (e.g., traffic rules, obstacles) that necessitate deviations from a default, inertially-guided path. To deal with the optimization imbalance caused by uncertain, long-term horizons, ResAD further incorporates Point-wise Normalization of the predicted residual. It re-weights the optimization objective, preventing large-magnitude errors associated with distant, uncertain waypoints from dominating the learning signal. Extensive experiments validate the effectiveness of our framework. On the NAVSIM benchmark, ResAD achieves a state-of-the-art PDMS of 88.6 using a vanilla diffusion policy with only two denoising steps, demonstrating that our approach significantly simplifies the learning task and improves model performance. The code will be released to facilitate further research.
中文摘要 端到端自动驾驶（E2EAD）系统学习直接从传感器数据中预测未来轨迹，从根本上受到轨迹数据固有时空不平衡的挑战。这种不平衡造成了巨大的优化负担，导致模型学习虚假相关性而不是因果推理，同时还优先考虑不确定的、遥远的预测，从而危及即时安全。为了解决这些问题，我们提出了 ResAD，这是一种新颖的归一化残差轨迹建模框架。我们的方法不是直接预测未来的轨迹，而是重新构建学习任务，以预测与确定性惯性参考的残差偏差。惯性参考作为反事实，迫使模型超越简单的模式识别，而是识别需要偏离默认惯性引导路径的潜在因果因素（例如，交通规则、障碍物）。为了应对不确定的长期视野造成的优化不平衡，ResAD 进一步结合了预测残差的逐点归一化。它重新加权优化目标，防止与遥远、不确定的航路点相关的大幅度误差主导学习信号。广泛的实验验证了我们框架的有效性。在 NAVSIM 基准测试中，ResAD 使用普通扩散策略实现了 88.6 的最先进的 PDMS，只需两个去噪步骤，这表明我们的方法显着简化了学习任务并提高了模型性能。该代码将发布以促进进一步研究。