Arxiv Papers of Today

生成时间: 2025-10-13 16:31:14 (UTC+8); Arxiv 发布时间: 2025-10-13 20:00 EDT (2025-10-14 08:00 UTC+8)

今天共有 44 篇相关文章

Keyword: reinforcement learning

LadderSym: A Multimodal Interleaved Transformer for Music Practice Error Detection

LadderSym：用于音乐练习错误检测的多模态交错变压器

Authors: Benjamin Shiue-Hal Chou, Purvish Jajal, Nick John Eliopoulos, James C. Davis, George K. Thiruvathukal, Kristen Yeon-Ji Yun, Yung-Hsiang Lu
Subjects: Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2510.08580
Pdf link: https://arxiv.org/pdf/2510.08580
Abstract Music learners can greatly benefit from tools that accurately detect errors in their practice. Existing approaches typically compare audio recordings to music scores using heuristics or learnable models. This paper introduces \textit{LadderSym}, a novel Transformer-based method for music error detection. \textit{LadderSym} is guided by two key observations about the state-of-the-art approaches: (1) late fusion limits inter-stream alignment and cross-modality comparison capability; and (2) reliance on score audio introduces ambiguity in the frequency spectrum, degrading performance in music with concurrent notes. To address these limitations, \textit{LadderSym} introduces (1) a two-stream encoder with inter-stream alignment modules to improve audio comparison capabilities and error detection F1 scores, and (2) a multimodal strategy that leverages both audio and symbolic scores by incorporating symbolic representations as decoder prompts, reducing ambiguity and improving F1 scores. We evaluate our method on the \textit{MAESTRO-E} and \textit{CocoChorales-E} datasets by measuring the F1 score for each note category. Compared to the previous state of the art, \textit{LadderSym} more than doubles F1 for missed notes on \textit{MAESTRO-E} (26.8\% $\rightarrow$ 56.3\%) and improves extra note detection by 14.4 points (72.0\% $\rightarrow$ 86.4\%). Similar gains are observed on \textit{CocoChorales-E}. This work introduces general insights about comparison models that could inform sequence evaluation tasks for reinforcement Learning, human skill assessment, and model evaluation.
中文摘要 音乐学习者可以从准确检测练习错误的工具中受益匪浅。现有方法通常使用启发式或可学习模型将录音与乐谱进行比较。本文介绍了\textit{LadderSym}，这是一种基于Transformer的音乐错误检测方法。\textit{LadderSym} 以对最先进方法的两个关键观察为指导：（1）晚期融合限制了流间对齐和跨模态比较能力;（2）对乐谱音频的依赖会在频谱中引入歧义，从而降低并发音符音乐的性能。为了解决这些限制，\textit{LadderSym} 引入了（1）带有流间对齐模块的双流编码器，以提高音频比较能力和错误检测 F1 分数，以及（2）一种多模态策略，通过将符号表示合并为解码器提示来利用音频和符号分数，减少歧义并提高 F1 分数。我们通过测量每个音符类别的 F1 分数来评估我们在 \textit{MAESTRO-E} 和 \textit{CocoChorales-E} 数据集上的方法。与之前的技术水平相比，\textit{LadderSym} 将 \textit{MAESTRO-E} 上遗漏音符的 F1 提高了一倍多（26.8\% $\rightarrow$ 56.3\%），并将额外音符检测提高了 14.4 分（72.0\% $\rightarrow$ 86.4\%）。在 \textit{CocoChorales-E} 上观察到类似的增益。这项工作介绍了有关比较模型的一般见解，这些模型可以为强化学习、人类技能评估和模型评估的序列评估任务提供信息。

GRPO-GCC: Enhancing Cooperation in Spatial Public Goods Games via Group Relative Policy Optimization with Global Cooperation Constraint

GRPO-GCC：在全球合作约束下，通过群体相对政策优化，加强空间公共产品博弈合作

Authors: Zhaoqilin Yang, Chanchan Li, Tianqi Liu, Hongxin Zhao, Youliang Tian
Subjects: Subjects: Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT)
Arxiv link: https://arxiv.org/abs/2510.08607
Pdf link: https://arxiv.org/pdf/2510.08607
Abstract Inspired by the principle of self-regulating cooperation in collective institutions, we propose the Group Relative Policy Optimization with Global Cooperation Constraint (GRPO-GCC) framework. This work is the first to introduce GRPO into spatial public goods games, establishing a new deep reinforcement learning baseline for structured populations. GRPO-GCC integrates group relative policy optimization with a global cooperation constraint that strengthens incentives at intermediate cooperation levels while weakening them at extremes. This mechanism aligns local decision making with sustainable collective outcomes and prevents collapse into either universal defection or unconditional cooperation. The framework advances beyond existing approaches by combining group-normalized advantage estimation, a reference-anchored KL penalty, and a global incentive term that dynamically adjusts cooperative payoffs. As a result, it achieves accelerated cooperation onset, stabilized policy adaptation, and long-term sustainability. GRPO-GCC demonstrates how a simple yet global signal can reshape incentives toward resilient cooperation, and provides a new paradigm for multi-agent reinforcement learning in socio-technical systems.
中文摘要 受集体机构自我调节合作原则的启发，我们提出了具有全球合作约束的群体相对政策优化（GRPO-GCC）框架。这项工作首次将GRPO引入空间公共产品博弈，为结构化人群建立了新的深度强化学习基线。GRPO-GCC 将群体相对政策优化与全球合作约束相结合，在中间合作层面加强激励，同时在极端合作层面削弱激励。这种机制使地方决策与可持续的集体成果保持一致，并防止崩溃为普遍叛逃或无条件合作。该框架通过结合群体归一化优势估计、参考锚定的 KL 惩罚和动态调整合作收益的全球激励术语，超越了现有方法。从而实现了合作加速启动、政策调整稳定和长期可持续。GRPO-GCC展示了一个简单而全局的信号如何重塑对弹性合作的激励，并为社会技术系统中的多智能体强化学习提供了新的范式。

PARSE: LLM Driven Schema Optimization for Reliable Entity Extraction

PARSE：用于可靠实体提取的 LLM 驱动模式优化

Authors: Anubhav Shrimal, Aryan Jain, Soumyajit Chowdhury, Promod Yenigalla
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.08623
Pdf link: https://arxiv.org/pdf/2510.08623
Abstract Structured information extraction from unstructured text is critical for emerging Software 3.0 systems where LLM agents autonomously interact with APIs and tools. Recent approaches apply large language models directly to extraction tasks using existing JSON schemas, often with constraint decoding or reinforcement learning approaches to ensure syntactic validity, but treat JSON schemas as static contracts designed for human developers, leading to suboptimal extraction performance, frequent hallucinations, and unreliable agent behavior when schemas contain ambiguous or incomplete specifications. We recognize that JSON schemas themselves are a form of natural language understanding contract that encodes rules, relationships, and expectations about data structure contracts that LLMs should be able to both interpret and systematically improve. Consequently, we develop PARSE (Parameter Automated Refinement and Schema Extraction), a novel system with two synergistic components: ARCHITECT, which autonomously optimizes JSON schemas for LLM consumption while maintaining backward compatibility through RELAY (an integrated code generation system), and SCOPE, which implements reflection-based extraction with combined static and LLM-based guardrails. We evaluate PARSE qualitatively and quantitatively on three datasets including Schema-Guided Dialogue (SGD), Structured Web Data Extraction (SWDE), and internal retail conversation data, and find that it achieves up to 64.7% improvement in extraction accuracy on SWDE with combined framework improvements reaching 10% across models, while reducing extraction errors by 92% within the first retry and and maintaining practical latency.
中文摘要 从非结构化文本中提取结构化信息对于新兴的软件 3.0 系统至关重要，在该系统中，LLM 代理可以自主与 API 和工具交互。最近的方法将大型语言模型直接应用于使用现有 JSON 模式的提取任务，通常采用约束解码或强化学习方法来确保语法有效性，但将 JSON 模式视为专为人类开发人员设计的静态合约，导致提取性能不佳、频繁出现幻觉以及当模式包含模棱两可或不完整的规范时不可靠的代理行为。我们认识到 JSON 模式本身是一种自然语言理解契约的形式，它编码了有关数据结构契约的规则、关系和期望，LLM 应该能够解释和系统地改进这些规则、关系和期望。因此，我们开发了 PARSE（参数自动细化和模式提取），这是一个具有两个协同组件的新型系统：ARCHITECT，它自主优化 JSON 模式以供 LLM 使用，同时通过 RELAY（一种集成代码生成系统）保持向后兼容性，以及 SCOPE，它通过结合静态和基于 LLM 的护栏实现基于反射的提取。我们对 PARSE 在包括 Schema-Guided Dialogue （SGD）、Structured Web Data Extraction （SWDE）和内部零售对话数据在内的三个数据集上进行了定性和定量评估，发现它在 SWDE 上的提取准确率提高了 64.7%，组合框架改进跨模型达到 10%，同时在第一次重试中将提取错误减少了 92%，并保持了实际延迟。

Don't Waste Mistakes: Leveraging Negative RL-Groups via Confidence Reweighting

不要浪费错误：通过置信度重新加权利用负 RL 组

Authors: Yunzhen Feng, Parag Jain, Anthony Hartshorn, Yaqi Duan, Julia Kempe
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.08696
Pdf link: https://arxiv.org/pdf/2510.08696
Abstract Reinforcement learning with verifiable rewards (RLVR) has become a standard recipe for improving large language models (LLMs) on reasoning tasks, with Group Relative Policy Optimization (GRPO) widely used in practice. Yet GRPO wastes substantial compute on negative groups: groups in which no sampled response is correct yield zero advantage and thus no gradient. We ask whether negative groups can be leveraged without extra supervision. Starting from a maximum-likelihood (MLE) objective in reward modeling, we show that the MLE gradient is equivalent to a policy gradient for a modified value function. This value function adds a confidence-weighted penalty on incorrect responses, imposing larger penalties on more confident mistakes. We refer to this as \textbf{L}ikelihood \textbf{E}stimation with \textbf{N}egative \textbf{S}amples (\textbf{LENS}). LENS modifies GRPO to assign non-zero, confidence-dependent rewards to incorrect generations, making negative groups informative and converting previously wasted samples into useful gradient updates. On the MATH benchmark with Llama-3.1-8B and Qwen-2.5-3B, the proposed variant consistently outperforms GRPO baseline, with significant gains on harder items. These results demonstrate a principled and practical way to "rescue" negative groups, improving efficiency and performance in RLVR.
中文摘要 具有可验证奖励的强化学习（RLVR）已成为改进推理任务大型语言模型（LLM）的标准方法，群体相对策略优化（GRPO）在实践中得到了广泛应用。然而，GRPO 在负组上浪费了大量计算：没有正确的采样响应的组产生零优势，因此没有梯度。我们询问是否可以在没有额外监督的情况下利用负面群体。从奖励建模中的最大似然（MLE）目标开始，我们表明 MLE 梯度等同于修改后的价值函数的策略梯度。该价值函数对错误回答增加了置信度加权惩罚，对更自信的错误施加更大的惩罚。我们将其称为 \textbf{L}ikelihood \textbf{E}stimation with \textbf{N}egative \textbf{S}amples （\textbf{LENS}）。LENS 修改 GRPO，为不正确的生成分配非零的、依赖于置信度的奖励，使负组提供信息，并将以前浪费的样本转换为有用的梯度更新。在 Llama-3.1-8B 和 Qwen-2.5-3B 的 MATH 基准测试中，所提出的变体始终优于 GRPO 基线，在较难的项目上具有显着收益。这些结果展示了一种“拯救”负群体的原则性和实用方法，提高了 RLVR 的效率和性能。

SAFER-AiD: Saccade-Assisted Foveal-peripheral vision Enhanced Reconstruction for Adversarial Defense

SAFER-AiD：扫视辅助中心凹周边视觉增强重建，用于对抗性防御

Authors: Jiayang Liu, Daniel Tso, Yiming Bu, Qinru Qiu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.08761
Pdf link: https://arxiv.org/pdf/2510.08761
Abstract Adversarial attacks significantly challenge the safe deployment of deep learning models, particularly in real-world applications. Traditional defenses often rely on computationally intensive optimization (e.g., adversarial training or data augmentation) to improve robustness, whereas the human visual system achieves inherent robustness to adversarial perturbations through evolved biological mechanisms. We hypothesize that attention guided non-homogeneous sparse sampling and predictive coding plays a key role in this robustness. To test this hypothesis, we propose a novel defense framework incorporating three key biological mechanisms: foveal-peripheral processing, saccadic eye movements, and cortical filling-in. Our approach employs reinforcement learning-guided saccades to selectively capture multiple foveal-peripheral glimpses, which are integrated into a reconstructed image before classification. This biologically inspired preprocessing effectively mitigates adversarial noise, preserves semantic integrity, and notably requires no retraining or fine-tuning of downstream classifiers, enabling seamless integration with existing systems. Experiments on the ImageNet dataset demonstrate that our method improves system robustness across diverse classifiers and attack types, while significantly reducing training overhead compared to both biologically and non-biologically inspired defense techniques.
中文摘要 对抗性攻击极大地挑战了深度学习模型的安全部署，尤其是在实际应用中。传统的防御通常依靠计算密集型优化（例如，对抗性训练或数据增强）来提高鲁棒性，而人类视觉系统则通过进化的生物机制实现对对抗性扰动的固有鲁棒性。我们假设注意力引导的非同质稀疏采样和预测编码在这种稳健性中起着关键作用。为了检验这一假设，我们提出了一种新的防御框架，其中包含三种关键的生物学机制：中心凹-外周处理、扫视眼球运动和皮质填充。我们的方法采用强化学习引导的扫视来选择性地捕捉多个中心凹-外围一瞥，并在分类前将其整合到重建的图像中。这种受生物启发的预处理有效地减轻了对抗性噪声，保持了语义完整性，并且特别不需要对下游分类器进行重新训练或微调，从而能够与现有系统无缝集成。在 ImageNet 数据集上的实验表明，与生物和非生物启发的防御技术相比，我们的方法提高了不同分类器和攻击类型的系统鲁棒性，同时显着降低了训练开销。

Reinforcement Learning-Based Optimization of CT Acquisition and Reconstruction Parameters Through Virtual Imaging Trials

通过虚拟成像试验对CT采集和重建参数进行基于强化学习的优化

Authors: David Fenwick, Navid NaderiAlizadeh, Vahid Tarokh, Nicholas Felice, Darin Clark, Jayasai Rajagopal, Anuj Kapadia, Benjamin Wildman-Tobriner, Ehsan Samei, Ehsan Abadi
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.08763
Pdf link: https://arxiv.org/pdf/2510.08763
Abstract Protocol optimization is critical in Computed Tomography (CT) to achieve high diagnostic image quality while minimizing radiation dose. However, due to the complex interdependencies among CT acquisition and reconstruction parameters, traditional optimization methods rely on exhaustive testing of combinations of these parameters, which is often impractical. This study introduces a novel methodology that combines virtual imaging tools with reinforcement learning to optimize CT protocols more efficiently. Human models with liver lesions were imaged using a validated CT simulator and reconstructed with a novel CT reconstruction toolkit. The optimization parameter space included tube voltage, tube current, reconstruction kernel, slice thickness, and pixel size. The optimization process was performed using a Proximal Policy Optimization (PPO) agent, which was trained to maximize an image quality objective, specifically the detectability index (d') of liver lesions in the reconstructed images. Optimization performance was compared against an exhaustive search performed on a supercomputer. The proposed reinforcement learning approach achieved the global maximum d' across test cases while requiring 79.7% fewer steps than the exhaustive search, demonstrating both accuracy and computational efficiency. The proposed framework is flexible and can accommodate various image quality objectives. The findings highlight the potential of integrating virtual imaging tools with reinforcement learning for CT protocol management.
中文摘要 方案优化在计算机断层扫描（CT）中至关重要，以实现高诊断图像质量，同时最大限度地减少辐射剂量。然而，由于CT采集和重建参数之间复杂的相互依赖关系，传统的优化方法依赖于对这些参数组合的详尽测试，这通常是不切实际的。本研究引入了一种新方法，将虚拟成像工具与强化学习相结合，以更有效地优化 CT 方案。使用经过验证的 CT 模拟器对有肝脏病变的人体模型进行成像，并使用新型 CT 重建工具包进行重建。优化参数空间包括管电压、管电流、重构核、切片厚度和像素大小。优化过程是使用近端策略优化（PPO）代理进行的，该代理经过训练以最大限度地提高图像质量目标，特别是重建图像中肝脏病变的可检测性指数（d'）。将优化性能与在超级计算机上执行的详尽搜索进行了比较。所提出的强化学习方法在测试用例中实现了全局最大 d'，同时比穷举搜索需要的步骤少 79.7%，证明了准确性和计算效率。所提出的框架是灵活的，可以适应各种图像质量目标。研究结果强调了将虚拟成像工具与强化学习相结合以进行 CT 协议管理的潜力。

Zero-Shot Policy Transfer in Reinforcement Learning using Buckingham's Pi Theorem

使用白金汉圆周率定理的强化学习中的零样本策略转移

Authors: Francisco Pascoa, Ian Lalonde, Alexandre Girard
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.08768
Pdf link: https://arxiv.org/pdf/2510.08768
Abstract Reinforcement learning (RL) policies often fail to generalize to new robots, tasks, or environments with different physical parameters, a challenge that limits their real-world applicability. This paper presents a simple, zero-shot transfer method based on Buckingham's Pi Theorem to address this limitation. The method adapts a pre-trained policy to new system contexts by scaling its inputs (observations) and outputs (actions) through a dimensionless space, requiring no retraining. The approach is evaluated against a naive transfer baseline across three environments of increasing complexity: a simulated pendulum, a physical pendulum for sim-to-real validation, and the high-dimensional HalfCheetah. Results demonstrate that the scaled transfer exhibits no loss of performance on dynamically similar contexts. Furthermore, on non-similar contexts, the scaled policy consistently outperforms the naive transfer, significantly expanding the volume of contexts where the original policy remains effective. These findings demonstrate that dimensional analysis provides a powerful and practical tool to enhance the robustness and generalization of RL policies.
中文摘要 强化学习（RL）策略通常无法推广到具有不同物理参数的新机器人、任务或环境，这一挑战限制了它们在现实世界中的适用性。本文提出了一种基于白金汉圆周率定理的简单零样本转移方法来解决这一限制。该方法通过无量纲空间扩展其输入（观察）和输出（作），使预训练策略适应新的系统环境，无需重新训练。该方法根据三种复杂性不断增加的环境的朴素传输基线进行评估：模拟摆、用于模拟到真实验证的物理摆和高维 HalfCheetah。结果表明，缩放传输在动态相似的上下文中没有表现出性能损失。此外，在非相似上下文中，缩放策略始终优于朴素转移，显着扩大了原始策略仍然有效的上下文数量。这些发现表明，维度分析为增强RL策略的鲁棒性和泛化性提供了强大而实用的工具。

Prioritizing Latency with Profit: A DRL-Based Admission Control for 5G Network Slices

以利润优先考虑延迟：基于 DRL 的 5G 网络切片准入控制

Authors: Proggya Chakraborty, Aaquib Asrar, Jayasree Sengupta, Sipra Das Bit
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Performance (cs.PF)
Arxiv link: https://arxiv.org/abs/2510.08769
Pdf link: https://arxiv.org/pdf/2510.08769
Abstract 5G networks enable diverse services such as eMBB, URLLC, and mMTC through network slicing, necessitating intelligent admission control and resource allocation to meet stringent QoS requirements while maximizing Network Service Provider (NSP) profits. However, existing Deep Reinforcement Learning (DRL) frameworks focus primarily on profit optimization without explicitly accounting for service delay, potentially leading to QoS violations for latency-sensitive slices. Moreover, commonly used epsilon-greedy exploration of DRL often results in unstable convergence and suboptimal policy learning. To address these gaps, we propose DePSAC -- a Delay and Profit-aware Slice Admission Control scheme. Our DRL-based approach incorporates a delay-aware reward function, where penalties due to service delay incentivize the prioritization of latency-critical slices such as URLLC. Additionally, we employ Boltzmann exploration to achieve smoother and faster convergence. We implement and evaluate DePSAC on a simulated 5G core network substrate with realistic Network Slice Request (NSLR) arrival patterns. Experimental results demonstrate that our method outperforms the DSARA baseline in terms of overall profit, reduced URLLC slice delays, improved acceptance rates, and improved resource consumption. These findings validate the effectiveness of the proposed DePSAC in achieving better QoS-profit trade-offs for practical 5G network slicing scenarios.
中文摘要 5G网络通过网络切片实现eMBB、URLLC和mMTC等多种业务，需要智能的准入控制和资源分配，以满足严格的QoS要求，同时实现网络服务提供商（NSP）利润的最大化。然而，现有的深度强化学习（DRL）框架主要关注利润优化，而没有明确考虑服务延迟，这可能导致延迟敏感切片的 QoS 违规。此外，常用的 ε-贪婪探索 DRL 往往会导致收敛不稳定和策略学习次优。为了解决这些差距，我们提出了 DePSAC——一种延迟和利润感知切片准入控制方案。我们基于 DRL 的方法包含延迟感知奖励功能，其中由于服务延迟而产生的惩罚会激励延迟关键切片（例如 URLLC）的优先级。此外，我们采用玻尔兹曼探索来实现更平滑、更快的收敛。我们在具有真实网络切片请求（NSLR）到达模式的模拟 5G 核心网络基板上实施和评估 DePSAC。实验结果表明，我们的方法在整体利润、减少 URLLC 切片延迟、提高接受率和改善资源消耗方面优于 DSARA 基线。这些发现验证了所提出的DePSAC在实际5G网络切片场景中实现更好的QoS-利润权衡的有效性。

Guiding Exploration in Reinforcement Learning Through LLM-Augmented Observations

通过 LLM 增强观察指导强化学习的探索

Authors: Vaibhav Jain, Gerrit Grossmann
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.08779
Pdf link: https://arxiv.org/pdf/2510.08779
Abstract Reinforcement Learning (RL) agents often struggle in sparse-reward environments where traditional exploration strategies fail to discover effective action sequences. Large Language Models (LLMs) possess procedural knowledge and reasoning capabilities from text pretraining that could guide RL exploration, but existing approaches create rigid dependencies where RL policies must follow LLM suggestions or incorporate them directly into reward functions. We propose a framework that provides LLM-generated action recommendations through augmented observation spaces, allowing RL agents to learn when to follow or ignore this guidance. Our method leverages LLMs' world knowledge and reasoning abilities while maintaining flexibility through soft constraints. We evaluate our approach on three BabyAI environments of increasing complexity and show that the benefits of LLM guidance scale with task difficulty. In the most challenging environment, we achieve 71% relative improvement in final success rates over baseline. The approach provides substantial sample efficiency gains, with agents reaching performance thresholds up to 9 times faster, and requires no modifications to existing RL algorithms. Our results demonstrate an effective method for leveraging LLM planning capabilities to accelerate RL training in challenging environments.
中文摘要 强化学习（RL）代理经常在稀疏奖励环境中挣扎，传统的探索策略无法发现有效的动作序列。大型语言模型（LLM）拥有文本预训练的程序知识和推理能力，可以指导 RL 探索，但现有方法会产生严格的依赖关系，其中 RL 策略必须遵循 LLM 建议或将其直接合并到奖励函数中。我们提出了一个框架，通过增强的观察空间提供 LLM 生成的行动建议，允许 RL 代理了解何时遵循或忽略此指南。我们的方法利用法学硕士的世界知识和推理能力，同时通过软约束保持灵活性。我们在三个复杂性不断增加的 BabyAI 环境中评估了我们的方法，并表明 LLM 指导的好处随着任务难度的增加而增加。在最具挑战性的环境中，我们的最终成功率比基线相对提高了 71%。该方法可显着提高样本效率，代理达到性能阈值的速度提高多达 9 倍，并且无需修改现有的 RL 算法。我们的结果展示了一种有效的方法，可以利用 LLM 规划能力在具有挑战性的环境中加速 RL 训练。

Reinforcement Learning-Driven Edge Management for Reliable Multi-view 3D Reconstruction

强化学习驱动的边缘管理，实现可靠的多视图 3D 重建

Authors: Motahare Mounesan, Sourya Saha, Houchao Gan, Md. Nurul Absur, Saptarshi Debroy
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Graphics (cs.GR); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2510.08839
Pdf link: https://arxiv.org/pdf/2510.08839
Abstract Real-time multi-view 3D reconstruction is a mission-critical application for key edge-native use cases, such as fire rescue, where timely and accurate 3D scene modeling enables situational awareness and informed decision-making. However, the dynamic and unpredictable nature of edge resource availability introduces disruptions, such as degraded image quality, unstable network links, and fluctuating server loads, which challenge the reliability of the reconstruction pipeline. In this work, we present a reinforcement learning (RL)-based edge resource management framework for reliable 3D reconstruction to ensure high quality reconstruction within a reasonable amount of time, despite the system operating under a resource-constrained and disruption-prone environment. In particular, the framework adopts two cooperative Q-learning agents, one for camera selection and one for server selection, both of which operate entirely online, learning policies through interactions with the edge environment. To support learning under realistic constraints and evaluate system performance, we implement a distributed testbed comprising lab-hosted end devices and FABRIC infrastructure-hosted edge servers to emulate smart city edge infrastructure under realistic disruption scenarios. Results show that the proposed framework improves application reliability by effectively balancing end-to-end latency and reconstruction quality in dynamic environments.
中文摘要 实时多视图 3D 重建是关键边缘原生用例（例如消防救援）的关键任务应用程序，其中及时准确的 3D 场景建模可实现态势感知和明智决策。然而，边缘资源可用性的动态性和不可预测性带来了中断，例如图像质量下降、网络链路不稳定和服务器负载波动，这对重建管道的可靠性提出了挑战。在这项工作中，我们提出了一个基于强化学习（RL）的边缘资源管理框架，用于可靠的3D重建，以确保在合理的时间内进行高质量的重建，尽管系统在资源受限和容易中断的环境中运行。特别是，该框架采用了两个协作的 Q-learning 代理，一个用于相机选择，一个用于服务器选择，两者都完全在线运行，通过与边缘环境的交互来学习策略。为了支持在现实约束下学习并评估系统性能，我们实施了一个分布式测试平台，该测试平台由实验室托管的终端设备和 FABRIC 基础设施托管的边缘服务器组成，以模拟现实中断场景下的智慧城市边缘基础设施。结果表明，所提框架通过有效平衡动态环境中的端到端延迟和重建质量，提高了应用可靠性。

CDE: Concept-Driven Exploration for Reinforcement Learning

CDE：强化学习的概念驱动探索

Authors: Le Mao, Andrew H. Liu, Renos Zabounidis, Zachary Kingston, Joseph Campbell
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.08851
Pdf link: https://arxiv.org/pdf/2510.08851
Abstract Intelligent exploration remains a critical challenge in reinforcement learning (RL), especially in visual control tasks. Unlike low-dimensional state-based RL, visual RL must extract task-relevant structure from raw pixels, making exploration inefficient. We propose Concept-Driven Exploration (CDE), which leverages a pre-trained vision-language model (VLM) to generate object-centric visual concepts from textual task descriptions as weak, potentially noisy supervisory signals. Rather than directly conditioning on these noisy signals, CDE trains a policy to reconstruct the concepts via an auxiliary objective, using reconstruction accuracy as an intrinsic reward to guide exploration toward task-relevant objects. Because the policy internalizes these concepts, VLM queries are only needed during training, reducing dependence on external models during deployment. Across five challenging simulated visual manipulation tasks, CDE achieves efficient, targeted exploration and remains robust to noisy VLM predictions. Finally, we demonstrate real-world transfer by deploying CDE on a Franka Research 3 arm, attaining an 80\% success rate in a real-world manipulation task.
中文摘要 智能探索仍然是强化学习（RL）中的一个关键挑战，尤其是在视觉控制任务中。与基于低维状态的 RL 不同，视觉 RL 必须从原始像素中提取与任务相关的结构，这使得探索效率低下。我们提出了概念驱动探索（CDE），它利用预训练的视觉语言模型（VLM）从文本任务描述中生成以对象为中心的视觉概念，作为微弱的、潜在嘈杂的监督信号。CDE 不是直接根据这些嘈杂的信号进行调节，而是训练一种策略，通过辅助目标重建概念，使用重建精度作为内在奖励来指导对任务相关对象的探索。由于该策略将这些概念内部化，因此仅在训练期间需要 VLM 查询，从而减少了部署过程中对外部模型的依赖。在五项具有挑战性的模拟视觉作任务中，CDE 实现了高效、有针对性的探索，并且对嘈杂的 VLM 预测保持稳健。最后，我们通过在 Franka Research 3 臂上部署 CDE 来演示真实世界的转移，在现实世界的作任务中达到 80\% 的成功率。

Model-Based Lookahead Reinforcement Learning for in-hand manipulation

用于手动作的基于模型的前瞻强化学习

Authors: Alexandre Lopes, Catarina Barata, Plinio Moreno
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.08884
Pdf link: https://arxiv.org/pdf/2510.08884
Abstract In-Hand Manipulation, as many other dexterous tasks, remains a difficult challenge in robotics by combining complex dynamic systems with the capability to control and manoeuvre various objects using its actuators. This work presents the application of a previously developed hybrid Reinforcement Learning (RL) Framework to In-Hand Manipulation task, verifying that it is capable of improving the performance of the task. The model combines concepts of both Model-Free and Model-Based Reinforcement Learning, by guiding a trained policy with the help of a dynamic model and value-function through trajectory evaluation, as done in Model Predictive Control. This work evaluates the performance of the model by comparing it with the policy that will be guided. To fully explore this, various tests are performed using both fully-actuated and under-actuated simulated robotic hands to manipulate different objects for a given task. The performance of the model will also be tested for generalization tests, by changing the properties of the objects in which both the policy and dynamic model were trained, such as density and size, and additionally by guiding a trained policy in a certain object to perform the same task in a different one. The results of this work show that, given a policy with high average reward and an accurate dynamic model, the hybrid framework improves the performance of in-hand manipulation tasks for most test cases, even when the object properties are changed. However, this improvement comes at the expense of increasing the computational cost, due to the complexity of trajectory evaluation.
中文摘要 与许多其他灵巧的任务一样，手动作仍然是机器人技术中的一项艰巨挑战，它将复杂的动态系统与使用其执行器控制和纵各种物体的能力相结合。这项工作介绍了先前开发的混合强化学习（RL）框架在手动作任务中的应用，验证了它能够提高任务的性能。该模型结合了无模型和基于模型的强化学习的概念，通过动态模型和价值函数的帮助，通过轨迹评估来指导经过训练的策略，就像在模型预测控制中所做的那样。这项工作通过将模型与将要指导的策略进行比较来评估模型的性能。为了充分探索这一点，使用全驱动和欠驱动的模拟机械手进行各种测试，以纵给定任务的不同对象。模型的性能还将用于泛化测试，通过更改训练策略和动态模型的对象的属性（例如密度和大小），以及通过引导某个对象中训练的策略在不同的对象中执行相同的任务。这项工作的结果表明，给定具有高平均奖励和准确动态模型的策略，混合框架提高了大多数测试用例的手动作任务的性能，即使对象属性发生了变化。然而，由于轨迹评估的复杂性，这种改进是以增加计算成本为代价的。

Exploring Multi-Temperature Strategies for Token- and Rollout-Level Control in RLVR

探索 RLVR 中令牌和推出级控制的多温度策略

Authors: Haomin Zhuang, Yujun Zhou, Taicheng Guo, Yue Huang, Fangxu Liu, Kai Song, Xiangliang Zhang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.08892
Pdf link: https://arxiv.org/pdf/2510.08892
Abstract Reinforcement Learning has demonstrated substantial improvements in the reasoning abilities of Large Language Models (LLMs), exhibiting significant applicability across various domains. Recent research has identified that tokens within LLMs play distinct roles during reasoning tasks, categorizing them into high-entropy reasoning tokens and low-entropy knowledge tokens. Prior approaches have typically focused on restricting updates to indirectly encourage exploration, yet they do not explicitly facilitate exploratory behavior during the token generation stage itself. In this work, we introduce a complementary approach that explicitly promotes exploration during sampling by applying distinct temperature settings for different token types. Specifically, our method employs higher temperatures for reasoning tokens to actively encourage exploration, while retaining lower temperatures for knowledge tokens to maintain factual correctness. Furthermore, we systematically investigate various multi-temperature scheduling strategies and their impacts within reinforcement learning contexts. Empirical evaluations on several reasoning benchmarks demonstrate that our approach significantly enhances the reasoning performance of LLMs. The code is available at this https URL.
中文摘要 强化学习在大型语言模型（LLM）的推理能力方面取得了显着改进，在各个领域表现出显着的适用性。最近的研究发现，法学硕士中的标记在推理任务中发挥着不同的作用，将它们分为高熵推理标记和低熵知识标记。以前的方法通常侧重于限制更新以间接鼓励探索，但它们并没有明确促进代币生成阶段本身的探索行为。在这项工作中，我们引入了一种补充方法，通过为不同的标记类型应用不同的温度设置来明确促进采样过程中的探索。具体来说，我们的方法采用较高的温度进行推理标记以积极鼓励探索，同时为知识标记保留较低的温度以保持事实正确性。此外，我们系统地研究了各种多温度调度策略及其在强化学习环境中的影响。对几个推理基准的实证评估表明，我们的方法显着提高了法学硕士的推理性能。该代码可在此 https URL 中找到。

HES-SQL: Hybrid Reasoning for Efficient Text-to-SQL with Structural Skeleton Guidance

HES-SQL：具有结构骨架指导的高效文本转SQL的混合推理

Authors: Suming Qiu, Jing Li, Zhicheng Zhou, Junjie Huang, Linyuan Qiu, Zhijie Sun
Subjects: Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.08896
Pdf link: https://arxiv.org/pdf/2510.08896
Abstract We present HES-SQL, a novel hybrid training framework that advances Text-to-SQL generation through the integration of thinking-mode-fused supervised fine-tuning (SFT) with Group Relative Policy Optimization (GRPO). Our approach introduces three key innovations: (1) a skeleton-completeness scoring mechanism that enhances preference alignment between generated queries and optimal SQL structures; (2) a query-latency-aware reward system that incentivizes the generation of computationally efficient SQL queries; (3) a self-distillation process for thinking-mode completion that prevents degradation of the model's reasoning capabilities. This framework enables hybrid thinking models to switch between reasoning and non-reasoning modes while improving SQL query accuracy and execution efficiency. Experimental evaluation, conducted on MySQL 8.0 and SQLite 3.42 under controlled single-user conditions, demonstrates that HES-SQL achieves competitive performance with execution accuracies of 79.14\% and 54.9\% on the BIRD and KaggleDBQA benchmarks, respectively. Query latency is measured as the end-to-end execution time of generated queries on the DBMS, averaged over multiple runs to mitigate variance. Efficiency gains range from 11\% to 20\% relative to supervised baselines. Our results establish a new paradigm for Text-to-SQL systems that effectively balances semantic accuracy with computational efficiency through execution-informed reinforcement learning (RL). The proposed methodology has significant implications for developing robust natural language interfaces to databases and can be extended to broader structured generation tasks requiring both correctness and efficiency optimization.
中文摘要 我们提出了 HES-SQL，这是一种新颖的混合训练框架，它通过思维模式融合监督微调（SFT）与组相对策略优化（GRPO）的集成来推进文本到 SQL 的生成。我们的方法引入了三个关键创新：（1）骨架完整性评分机制，增强生成的查询和最佳SQL结构之间的偏好一致性;（2）一个查询延迟感知奖励系统，激励生成计算高效的SQL查询;（3）一种用于思维模式完成的自我蒸馏过程，以防止模型推理能力下降。该框架使混合思维模型能够在推理和非推理模式之间切换，同时提高SQL查询的准确性和执行效率。在受控的单用户条件下对 MySQL 8.0 和 SQLite 3.42 进行的实验评估表明，HES-SQL 在 BIRD 和 KaggleDBQA 基准测试中分别以 79.14% 和 54.9% 的执行准确率实现了具有竞争力的性能。查询延迟是作为 DBMS 上生成的查询的端到端执行时间来衡量的，在多次运行中取平均值以减轻差异。相对于监督基线，效率提升范围为 11\% 至 20\%。我们的结果为文本转SQL系统建立了一种新的范式，通过执行知情强化学习（RL）有效地平衡了语义准确性和计算效率。所提出的方法对于开发强大的数据库自然语言接口具有重要意义，并且可以扩展到需要正确性和效率优化的更广泛的结构化生成任务。

Pinpointing crucial steps: Attribution-based Credit Assignment for Verifiable Reinforcement Learning

确定关键步骤：基于归因的学分分配，用于可验证的强化学习

Authors: Junxi Yin, Haisen Luo, Zhenyu Li, Yihua Liu, Dan Liu, Zequn Li, Xiaohang Xu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.08899
Pdf link: https://arxiv.org/pdf/2510.08899
Abstract While Reinforcement Learning with Verifiable Rewards (RLVR) enhances complex reasoning in LLMs, current methods struggle to balance exploration and exploitation. This leads to critical issues like inaccurate credit assignment for intermediate steps and premature entropy collapse, limiting model performance. To address this, we introduce Attribution-based Contribution to Policy Optimization (ACPO), a phased framework that incorporates a difficulty-aware curriculum. ACPO improves exploration by using trajectory semantic segmentation and an attribution-based representation to dynamically regulate policy entropy, thus mitigating its collapse. Concurrently, it enhances exploitation with a factorized reward system that precisely quantifies the hierarchical contribution of each reasoning step, ensuring accurate credit assignment. Extensive experiments on challenging benchmarks, including AIME, MATH, and AMC, demonstrate that ACPO significantly outperforms existing state-of-the-art approaches.
中文摘要 虽然具有可验证奖励的强化学习（RLVR）增强了法学硕士的复杂推理，但当前的方法难以平衡探索和利用。这会导致关键问题，例如中间步骤的信用分配不准确和熵过早崩溃，从而限制了模型性能。为了解决这个问题，我们引入了基于归因的政策优化贡献（ACPO），这是一个分阶段的框架，其中包含困难感知课程。ACPO 通过使用轨迹语义分割和基于归因的表示来动态调节政策熵，从而减轻其崩溃，从而改进了探索。同时，它还通过因比化奖励系统增强了开发能力，该系统精确量化每个推理步骤的分层贡献，确保准确的信用分配。在具有挑战性的基准（包括 AIME、MATH 和 AMC）上的广泛实验表明，ACPO 的性能明显优于现有的最先进的方法。

Unleashing Perception-Time Scaling to Multimodal Reasoning Models

将感知时间缩放到多模态推理模型中

Authors: Yifan Li, Zhenghao Chen, Ziheng Wu, Kun Zhou, Ruipu Luo, Can Zhang, Zhentao He, Yufei Zhan, Wayne Xin Zhao, Minghui Qiu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.08964
Pdf link: https://arxiv.org/pdf/2510.08964
Abstract Recent advances in inference-time scaling, particularly those leveraging reinforcement learning with verifiable rewards, have substantially enhanced the reasoning capabilities of Large Vision-Language Models (LVLMs). Inspired by this success, similar strategies have been applied to multimodal reasoning, yet their impact on visual perception remains unclear. To investigate this gap, we introduce DisTANCE, a perception-centric benchmark for visual estimation tasks. Evaluation results show that LVLMs exhibit limited estimation precision, and inference-time scaling offers only marginal gains. We attribute this to the fast perception paradigm of current LVLMs, where visual understanding is treated as a one-shot output without modeling the underlying perceptual process. To address this, we propose Perception-Time Scaling (PTS), a novel paradigm that encourages token-rich perception and decomposes complex perception problems into intermediate tractable sub-problems, thereby enabling perception to align with and benefit from inference-time scaling. Combined with reinforcement learning techniques, PTS significantly improves perception accuracy, raising high-precision performance on DisTANCE from 8.0% to 64.7%, and generalizes well to out-of-domain tasks. Surprisingly, even though PTS data are purely synthetic, combining them with math reasoning data yields consistent gains in both reasoning and real-world perception benchmarks. Further analysis reveals that PTS introduces more perception-related tokens and increases the model's attention to image tokens. Our code and data will be publicly released.
中文摘要 推理时间缩放的最新进展，特别是那些利用具有可验证奖励的强化学习的进展，大大增强了大型视觉语言模型（LVLM）的推理能力。受这一成功的启发，类似的策略已被应用于多模态推理，但它们对视觉感知的影响仍不清楚。为了调查这一差距，我们引入了 DisTANCE，这是一种以感知为中心的视觉估计任务基准。评估结果表明，LVLM 表现出有限的估计精度，推理时间缩放只能提供边际增益。我们将此归因于当前 LVLM 的快速感知范式，其中视觉理解被视为一次性输出，而无需对潜在的感知过程进行建模。为了解决这个问题，我们提出了感知时间缩放（PTS），这是一种新颖的范式，它鼓励标记丰富的感知，并将复杂的感知问题分解为中间可处理的子问题，从而使感知能够与推理时间缩放保持一致并从中受益。结合强化学习技术，PTS显著提高了感知准确率，将DisTANCE的高精度性能从8.0%提高到64.7%，并能很好地泛化到域外任务。令人惊讶的是，尽管 PTS 数据是纯合成的，但将它们与数学推理数据相结合，在推理和现实世界的感知基准中都会产生一致的收益。进一步分析发现，PTS 引入了更多与感知相关的 token，增加了模型对图像 token 的关注度。我们的代码和数据将公开发布。

Diagnosing and Mitigating System Bias in Self-Rewarding RL

诊断和减轻自我奖励 RL 中的系统偏差

Authors: Chuyi Tan, Peiwen Yuan, Xinglin Wang, Yiwei Li, Shaoxiong Feng, Yueqi Zhang, Jiayi Shi, Ji Zhang, Boyuan Pan, Yao Hu, Kan Li
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.08977
Pdf link: https://arxiv.org/pdf/2510.08977
Abstract Reinforcement learning with verifiable rewards (RLVR) scales the reasoning ability of large language models (LLMs) but remains bottlenecked by limited labeled samples for continued data scaling. Reinforcement learning with intrinsic rewards (RLIR), where the policy model assigns rewards to its own rollouts, enables sustainable scaling in unlabeled settings, yet its performance and stability lag behind RLVR. We trace this gap to a system bias: the model tends to overestimate its high-confidence rollouts, leading to biased and unstable reward estimation. This bias accumulates as training progresses, with deviations from the oracle drifting toward over-reward, causing unstable training. We characterize this bias using three metrics: $\rho_{\text{noise}}$, $\rho_{\text{selfbias}}$, and $\rho_{\text{symbias}}$. We find that $\rho_{\text{noise}}$ and $\rho_{\text{symbias}}$ impact convergence, while $\rho_{\text{selfbias}}$ amplifies both correct and incorrect updates, leading to instability. To mitigate this, we propose reinforcement learning with ensembled rewards (RLER), which aggregates diverse models and adapts reward interpolation and rollout selection. Extensive experiments show that RLER improves by +13.6% over RLIR and is only 3.6% below RLVR, achieving stable scaling on unlabeled samples, making it highly applicable.
中文摘要 具有可验证奖励的强化学习（RLVR）扩展了大型语言模型（LLM）的推理能力，但对于持续数据扩展来说，标记样本有限仍然存在瓶颈。具有内在奖励的强化学习（RLIR），其中策略模型为自己的推出分配奖励，可以在未标记的环境中实现可持续扩展，但其性能和稳定性落后于 RLVR。我们将这种差距追溯到系统偏差：该模型往往会高估其高置信度的推出，从而导致有偏差和不稳定的奖励估计。这种偏差随着训练的进行而累积，偏离预言机的偏离会偏向过度奖励，从而导致训练不稳定。我们使用三个指标来表征这种偏差：$\rho_{\text{noise}}$、$\rho_{\text{selfbias}}$ 和 $\rho_{\text{symbias}}$。我们发现 $\rho_{\text{noise}}$ 和 $\rho_{\text{symbias}}$ 影响收敛，而 $\rho_{\text{selfbias}}$ 会放大正确和不正确的更新，从而导致不稳定。为了缓解这种情况，我们提出了集成奖励强化学习（RLER），它聚合了不同的模型并调整了奖励插值和推出选择。大量实验表明，RLER比RLIR提高了+13.6%，仅比RLVR低3.6%，在未标记的样品上实现了稳定的缩放，使其具有很高的适用性。

Rethinking Reasoning in Document Ranking: Why Chain-of-Thought Falls Short

重新思考文档排名中的推理：为什么思维链不足

Authors: Xuan Lu, Haohang Huang, Rui Meng, Yaohui Jin, Wenjun Zeng, Xiaoyu Shen
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2510.08985
Pdf link: https://arxiv.org/pdf/2510.08985
Abstract Document reranking is a key component in information retrieval (IR), aimed at refining initial retrieval results to improve ranking quality for downstream tasks. Recent studies--motivated by large reasoning models (LRMs)--have begun incorporating explicit chain-of-thought (CoT) reasoning into LLM-based rerankers. However, the effectiveness of such reasoning for ranking tasks remains underexplored. In this work, we present the first systematic study of reasoning in reranking across both pointwise and listwise settings, under both supervised fine-tuning and reinforcement learning. Using diverse benchmarks, including reasoning-intensive datasets (BRIGHT) and standard IR benchmarks (BEIR), we find that reasoning-augmented rerankers consistently underperform their direct counterparts that predict rankings without CoT, despite substantially higher inference costs. Our analysis reveals three core limitations: (i) in pointwise rerankers, reasoning breaks calibration and biases models toward the positive class, raising TPR but lowering TNR, which inflates false positives and degrades ranking in negative-dominant pools; (ii) in listwise rerankers, reasoning improves in-domain fit but increases variance and fails to generalize out-of-domain, even when reinforcement learning shortens rationales; and (iii) overall, directly fine-tuned rerankers remain more stable, effective, and robust. These findings challenge the assumption that explicit reasoning is universally beneficial for reranking. We conclude by highlighting future directions, including calibration-aware scoring for pointwise rerankers and the design of concise, targeted reasoning strategies to mitigate overfitting and overthinking in listwise rerankers.
中文摘要 文档重新排序是信息检索（IR）的关键组成部分，旨在完善初始检索结果，以提高下游任务的排名质量。在大型推理模型（LRM）的推动下，最近的研究已开始将显式思维链（CoT）推理纳入基于 LLM 的重新排名器中。然而，这种推理对任务排名的有效性仍未得到充分探索。在这项工作中，我们提出了第一个在监督微调和强化学习下跨逐点和逐列表设置重新排序的推理系统研究。使用各种基准，包括推理密集型数据集（BRIGHT）和标准 IR 基准（BEIR），我们发现推理增强的重新排名者的性能始终低于在没有 CoT 的情况下预测排名的直接对应者，尽管推理成本要高得多。我们的分析揭示了三个核心局限性：（i）在逐点重新排名器中，推理破坏了校准并使模型偏向于正类，提高了TPR但降低了TNR，这会夸大假阳性并降低负面主导池中的排名;（ii）在列表重新排序器中，推理提高了域内拟合度，但增加了方差，并且无法泛化域外，即使强化学习缩短了基本原理;（iii）总体而言，直接微调的重新排序器仍然更加稳定、有效和稳健。这些发现挑战了显式推理普遍有利于重新排名的假设。最后，我们强调了未来的方向，包括逐点重新排名器的校准感知评分，以及设计简洁、有针对性的推理策略，以减轻列表重新排名者的过度拟合和过度思考。

Tiny-R1V: Lightweight Multimodal Unified Reasoning Model via Model Merging

Tiny-R1V：通过模型合并实现轻量级多模态统一推理模型

Authors: Qixiang Yin, Huanjin Yao, Jianghao Chen, Jiaxing Huang, Zhicheng Zhao, Fei Su
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.08987
Pdf link: https://arxiv.org/pdf/2510.08987
Abstract Although Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across diverse tasks, they encounter numerous challenges in terms of reasoning efficiency, such as large model size, overthinking, and compromised accuracy in lightweight scenarios. However, research on the reasoning capabilities of lightweight MLLMs is quite lacking. To this end, we propose Tiny-R1V, a novel lightweight 3B model that achieves faster inference and higher accuracy via a two-stage optimization, while unifying multimodal reasoning across multiple tasks and using fewer tokens. In the first stage, Tiny-R1V introduces Length-Informed Relative Policy Optimization (LIPO), a novel reinforcement learning method, to train each reasoning model. The LIPO is designed to dynamically adjusts advantages of responses within groups, that is, by prioritizing concise yet high-quality responses to encourage the generation of shorter and more accurate response. In the second stage, we propose Adaptive Model Merging (AMM), a training-free model merging method that merges multiple specialist models into a unified architecture. Specifically, AMM adaptively adjusts the weights of task vectors and robustly optimizes the merged vectors via a novel gradient projection regularization loss function, thus mitigating redundant conflicts between them. Extensive evaluations on ten widely-used reasoning benchmarks covering mathematics, structured data (charts, tables, documents), OCR, and general capabilities showcase the superior performance of Tiny-R1V, enabling lightweight models to excel in diverse multimodal reasoning tasks.
中文摘要 尽管多模态大型语言模型（MLLM）在各种任务中都表现出了卓越的能力，但在推理效率方面也面临着诸多挑战，例如模型大小大、思维过度以及轻量级场景下的准确性受损等。然而，对轻量级MLLM推理能力的研究相当缺乏。为此，我们提出了Tiny-R1V，这是一种新型的轻量级3B模型，它通过两阶段优化实现更快的推理和更高的准确性，同时统一多个任务的多模态推理，并使用更少的token。在第一阶段，Tiny-R1V 引入了一种新型强化学习方法长度知情相对策略优化（LIPO）来训练每个推理模型。LIPO 旨在动态调整组内响应的优势，即通过优先考虑简洁而高质量的响应来鼓励生成更短、更准确的响应。在第二阶段，我们提出了自适应模型合并（AMM），这是一种免训练的模型合并方法，将多个专业模型合并到一个统一的架构中。具体来说，AMM自适应地调整任务向量的权重，并通过一种新的梯度投影正则化损失函数对合并的向量进行鲁棒性优化，从而减轻了它们之间的冗余冲突。对十个广泛使用的推理基准测试（涵盖数学、结构化数据（图表、表格、文档）、OCR 和通用功能的广泛评估展示了 Tiny-R1V 的卓越性能，使轻量级模型能够在各种多模态推理任务中表现出色。

DARO: Difficulty-Aware Reweighting Policy Optimization

DARO：难度感知重加权策略优化

Authors: Jingyu Zhou, Lu Ma, Hao Liang, Chengyu Shen, Bin Cui, Wentao Zhang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.09001
Pdf link: https://arxiv.org/pdf/2510.09001
Abstract Recent advances in large language models (LLMs) have shown that reasoning ability can be significantly enhanced through Reinforcement Learning with Verifiable Rewards (RLVR). Group Relative Policy Optimization (GRPO) has emerged as the de facto approach for RLVR, inspiring numerous variants. However, our mathematical analysis reveals that these methods are fundamentally weighted variations of GRPO. We provide a unified view, demonstrating that their reliance on static or overly simplistic weighting schemes tied to sample difficulty prevents adaptation to a model's evolving capabilities. This creates a significant loss scale issue, where training disproportionately focuses on certain difficulty levels at the expense of others, hindering overall performance. To address these limitations, we introduce \textbf{Difficulty-Aware Reweighting Policy Optimization (DARO)}, a method that dynamically adjusts the loss contribution of each difficulty group based on the model's learning state. Extensive experiments on Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, and Llama3.1-8B show that DARO outperforms four leading baselines across six math benchmarks, achieving significantly faster convergence and superior final performance.
中文摘要 大型语言模型（LLM）的最新进展表明，通过具有可验证奖励的强化学习（RLVR）可以显着增强推理能力。群体相对策略优化（GRPO）已成为 RLVR 事实上的方法，激发了许多变体。然而，我们的数学分析表明，这些方法从根本上来说是GRPO的加权变体。我们提供了一个统一的观点，证明它们对与样本难度相关的静态或过于简单的加权方案的依赖阻碍了对模型不断发展的能力的适应。这造成了一个严重的损失规模问题，即训练不成比例地关注某些难度级别，而牺牲其他难度级别，从而阻碍整体表现。为了解决这些限制，我们引入了 \textbf{Difficulty-Aware Reweighting Policy Optimization （DARO）}，这是一种根据模型的学习状态动态调整每个难度组的损失贡献的方法。对 Qwen2.5-Math-1.5B、Qwen2.5-Math-7B 和 Llama3.1-8B 的广泛实验表明，DARO 在六个数学基准测试中优于四个领先的基线，实现了显着更快的收敛和卓越的最终性能。

HERO: Hardware-Efficient RL-based Optimization Framework for NeRF Quantization

HERO：基于硬件高效的 RL 优化框架，用于 NeRF 量化

Authors: Yipu Zhang, Chaofang Ma, Jinming Ge, Lin Jiang, Jiang Xu, Wei Zhang
Subjects: Subjects: Hardware Architecture (cs.AR)
Arxiv link: https://arxiv.org/abs/2510.09010
Pdf link: https://arxiv.org/pdf/2510.09010
Abstract Neural Radiance Field (NeRF) has emerged as a promising 3D reconstruction method, delivering high-quality results for AR/VR applications. While quantization methods and hardware accelerators have been proposed to enhance NeRF's computational efficiency, existing approaches face crucial limitations. Current quantization methods operate without considering hardware architecture, resulting in sub-optimal solutions within the vast design space encompassing accuracy, latency, and model size. Additionally, existing NeRF accelerators heavily rely on human experts to explore this design space, making the optimization process time-consuming, inefficient, and unlikely to discover optimal solutions. To address these challenges, we introduce HERO, a reinforcement learning framework performing hardware-aware quantization for NeRF. Our framework integrates a NeRF accelerator simulator to generate real-time hardware feedback, enabling fully automated adaptation to hardware constraints. Experimental results demonstrate that HERO achieves 1.31-1.33 $\times$ better latency, 1.29-1.33 $\times$ improved cost efficiency, and a more compact model size compared to CAQ, a previous state-of-the-art NeRF quantization framework. These results validate our framework's capability to effectively navigate the complex design space between hardware and algorithm requirements, discovering superior quantization policies for NeRF implementation. Code is available at this https URL.
中文摘要 神经辐射场（NeRF）已成为一种很有前途的 3D 重建方法，可为 AR/VR 应用提供高质量的结果。虽然已经提出了量化方法和硬件加速器来提高 NeRF 的计算效率，但现有方法面临着严重的局限性。当前的量化方法在不考虑硬件架构的情况下运行，导致在包括精度、延迟和模型大小在内的广阔设计空间内产生次优解决方案。此外，现有的 NeRF 加速器严重依赖人类专家来探索这一设计空间，这使得优化过程耗时、效率低下，并且不太可能发现最佳解决方案。为了应对这些挑战，我们引入了 HERO，这是一个为 NeRF 执行硬件感知量化的强化学习框架。我们的框架集成了 NeRF 加速器模拟器来生成实时硬件反馈，从而能够完全自动适应硬件约束。实验结果表明，与之前最先进的NeRF量化框架CAQ相比，HERO实现了1.31-1.33 $\times$更好的延迟，提高了1.29-1.33 $\times$的成本效率，以及更紧凑的模型尺寸。这些结果验证了我们的框架在硬件和算法需求之间有效驾驭复杂设计空间的能力，发现了用于 NeRF 实现的卓越量化策略。代码可在此 https URL 中找到。

TripScore: Benchmarking and rewarding real-world travel planning with fine-grained evaluation

TripScore：通过细粒度评估对现实世界的旅行计划进行基准测试和奖励

Authors: Yincen Qu, Huan Xiao, Feng Li, Hui Zhou, Xiangying Dai
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.09011
Pdf link: https://arxiv.org/pdf/2510.09011
Abstract Travel planning is a valuable yet complex task that poses significant challenges even for advanced large language models (LLMs). While recent benchmarks have advanced in evaluating LLMs' planning capabilities, they often fall short in evaluating feasibility, reliability, and engagement of travel plans. We introduce a comprehensive benchmark for travel planning that unifies fine-grained criteria into a single reward, enabling direct comparison of plan quality and seamless integration with reinforcement learning (RL). Our evaluator achieves moderate agreement with travel-expert annotations (60.75\%) and outperforms multiple LLM-as-judge baselines. We further release a large-scale dataset of 4,870 queries including 219 real-world, free-form requests for generalization to authentic user intent. Using this benchmark, we conduct extensive experiments across diverse methods and LLMs, including test-time computation, neuro-symbolic approaches, supervised fine-tuning, and RL via GRPO. Across base models, RL generally improves itinerary feasibility over prompt-only and supervised baselines, yielding higher unified reward scores.
中文摘要 旅行计划是一项有价值但复杂的任务，即使对于高级大型语言模型（LLM）来说也带来了重大挑战。虽然最近的基准在评估法学硕士的规划能力方面取得了进步，但它们在评估旅行计划的可行性、可靠性和参与度方面往往存在不足。我们引入了旅行计划的综合基准，将细粒度标准统一为单一奖励，从而能够直接比较计划质量并与强化学习（RL）无缝集成。我们的评估员与旅行专家注释取得了中等一致性（60.75\%），并且优于多个 LLM 作为评委的基线。我们进一步发布了一个包含 4,870 个查询的大规模数据集，其中包括 219 个真实世界的自由格式请求，用于推广到真实的用户意图。使用这个基准，我们跨不同的方法和 LLM 进行了广泛的实验，包括测试时间计算、神经符号方法、监督微调和通过 GRPO 的 RL。在所有基本模型中，RL 通常比仅提示和监督基线提高了行程的可行性，从而产生更高的统一奖励分数。

Slim Scheduler: A Runtime-Aware RL and Scheduler System for Efficient CNN Inference

Slim Scheduler：用于高效 CNN 推理的运行时感知 RL 和调度器系统

Authors: Ian Harshbarger, Calvin Chidambaram
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.09018
Pdf link: https://arxiv.org/pdf/2510.09018
Abstract Most neural network scheduling research focuses on optimizing static, end-to-end models of fixed width, overlooking dynamic approaches that adapt to heterogeneous hardware and fluctuating runtime conditions. We present Slim Scheduler, a hybrid scheduling framework that integrates a Proximal Policy Optimization (PPO) reinforcement learning policy with algorithmic, greedy schedulers to coordinate distributed inference for slimmable models. Each server runs a local greedy scheduler that batches compatible requests and manages instance scaling based on VRAM and utilization constraints, while the PPO router learns global routing policies for device selection, width ratio, and batch configuration. This hierarchical design reduces search space complexity, mitigates overfitting to specific hardware, and balances efficiency and throughput. Compared to a purely randomized task distribution baseline, Slim Scheduler can achieve various accuracy and latency trade-offs such as: A 96.45% reduction in mean latency and a 97.31% reduction in energy usage dropping accuracy to the slimmest model available (70.3%). It can then accomplish an overall reduction in average latency plus energy consumption with an increase in accuracy at the cost of higher standard deviations of said latency and energy, effecting overall task throughput.
中文摘要 大多数神经网络调度研究都集中在优化固定宽度的静态端到端模型上，而忽略了适应异构硬件和波动运行时条件的动态方法。我们提出了 Slim Scheduler，这是一个混合调度框架，它将近端策略优化（PPO）强化学习策略与算法贪婪调度器集成在一起，以协调可瘦化模型的分布式推理。每台服务器都运行一个本地贪婪调度程序，该调度程序根据 VRAM 和利用率约束对兼容请求进行批处理并管理实例扩展，而 PPO 路由器则学习设备选择、宽度比和批量配置的全局路由策略。这种分层设计降低了搜索空间的复杂性，减少了对特定硬件的过度拟合，并平衡了效率和吞吐量。与纯粹随机的任务分布基线相比，Slim Scheduler 可以实现各种准确性和延迟权衡，例如：平均延迟减少 96.45%，能耗降低 97.31%，将准确性降至可用的最薄模型（70.3%）。然后，它可以实现平均延迟和能耗的总体降低，同时提高准确性，但代价是所述延迟和能量的标准偏差更高，从而影响整体任务吞吐量。

The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections

攻击者第二步：更强的自适应攻击绕过对 Llm 越狱和提示注入的防御

Authors: Milad Nasr, Nicholas Carlini, Chawin Sitawarin, Sander V. Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, Shuang Song, Harsh Chaudhari, Ilia Shumailov, Abhradeep Thakurta, Kai Yuanqing Xiao, Andreas Terzis, Florian Tramèr
Subjects: Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2510.09023
Pdf link: https://arxiv.org/pdf/2510.09023
Abstract How should we evaluate the robustness of language model defenses? Current defenses against jailbreaks and prompt injections (which aim to prevent an attacker from eliciting harmful knowledge or remotely triggering malicious actions, respectively) are typically evaluated either against a static set of harmful attack strings, or against computationally weak optimization methods that were not designed with the defense in mind. We argue that this evaluation process is flawed. Instead, we should evaluate defenses against adaptive attackers who explicitly modify their attack strategy to counter a defense's design while spending considerable resources to optimize their objective. By systematically tuning and scaling general optimization techniques-gradient descent, reinforcement learning, random search, and human-guided exploration-we bypass 12 recent defenses (based on a diverse set of techniques) with attack success rate above 90% for most; importantly, the majority of defenses originally reported near-zero attack success rates. We believe that future defense work must consider stronger attacks, such as the ones we describe, in order to make reliable and convincing claims of robustness.
中文摘要 我们应该如何评估语言模型防御的鲁棒性？当前针对越狱和提示注入的防御（分别旨在防止攻击者获取有害知识或远程触发恶意作）通常针对一组静态有害攻击字符串或计算较弱的优化方法进行评估，这些优化方法在设计时未考虑到防御。我们认为这种评估过程是有缺陷的。相反，我们应该评估针对自适应攻击者的防御，这些攻击者明确修改其攻击策略以对抗防御设计，同时花费大量资源来优化其目标。通过系统地调整和扩展通用优化技术——梯度下降、强化学习、随机搜索和人工引导探索——我们绕过了最近的 12 种防御（基于一组不同的技术），大多数攻击成功率超过 90%;重要的是，大多数防御最初报告的攻击成功率接近于零。我们认为，未来的防御工作必须考虑更强大的攻击，例如我们所描述的攻击，以便做出可靠和令人信服的稳健性声明。

iMoWM: Taming Interactive Multi-Modal World Model for Robotic Manipulation

iMoWM：驯服机器人纵的交互式多模态世界模型

Authors: Chuanrui Zhang, Zhengxian Wu, Guanxing Lu, Yansong Tang, Ziwei Wang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.09036
Pdf link: https://arxiv.org/pdf/2510.09036
Abstract Learned world models hold significant potential for robotic manipulation, as they can serve as simulator for real-world interactions. While extensive progress has been made in 2D video-based world models, these approaches often lack geometric and spatial reasoning, which is essential for capturing the physical structure of the 3D world. To address this limitation, we introduce iMoWM, a novel interactive world model designed to generate color images, depth maps, and robot arm masks in an autoregressive manner conditioned on actions. To overcome the high computational cost associated with three-dimensional information, we propose MMTokenizer, which unifies multi-modal inputs into a compact token representation. This design enables iMoWM to leverage large-scale pretrained VideoGPT models while maintaining high efficiency and incorporating richer physical information. With its multi-modal representation, iMoWM not only improves the visual quality of future predictions but also serves as an effective simulator for model-based reinforcement learning (MBRL) and facilitates real-world imitation learning. Extensive experiments demonstrate the superiority of iMoWM across these tasks, showcasing the advantages of multi-modal world modeling for robotic manipulation. Homepage: this https URL
中文摘要 学习世界模型在机器人纵方面具有巨大的潜力，因为它们可以作为现实世界交互的模拟器。虽然基于 2D 视频的世界模型取得了长足的进步，但这些方法往往缺乏几何和空间推理，而这对于捕捉 3D 世界的物理结构至关重要。为了解决这一限制，我们引入了 iMoWM，这是一种新颖的交互式世界模型，旨在以以动作为条件的自回归方式生成彩色图像、深度图和机械臂掩码。为了克服与三维信息相关的高计算成本，我们提出了 MMTokenizer，它将多模态输入统一到一个紧凑的标记表示中。这种设计使 iMoWM 能够利用大规模预训练的 VideoGPT 模型，同时保持高效率并整合更丰富的物理信息。凭借其多模态表示，iMoWM 不仅提高了未来预测的视觉质量，而且还可以作为基于模型的强化学习（MBRL）的有效模拟器，并促进现实世界的模仿学习。广泛的实验证明了 iMoWM 在这些任务中的优越性，展示了多模态世界建模在机器人纵方面的优势。首页：此 https URL

Robust Driving Control for Autonomous Vehicles: An Intelligent General-sum Constrained Adversarial Reinforcement Learning Approach

自动驾驶汽车鲁棒驾驶控制：一种智能广和约束对抗强化学习方法

Authors: Junchao Fan, Xiaolin Chang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.09041
Pdf link: https://arxiv.org/pdf/2510.09041
Abstract Deep reinforcement learning (DRL) has demonstrated remarkable success in developing autonomous driving policies. However, its vulnerability to adversarial attacks remains a critical barrier to real-world deployment. Although existing robust methods have achieved success, they still suffer from three key issues: (i) these methods are trained against myopic adversarial attacks, limiting their abilities to respond to more strategic threats, (ii) they have trouble causing truly safety-critical events (e.g., collisions), but instead often result in minor consequences, and (iii) these methods can introduce learning instability and policy drift during training due to the lack of robust constraints. To address these issues, we propose Intelligent General-sum Constrained Adversarial Reinforcement Learning (IGCARL), a novel robust autonomous driving approach that consists of a strategic targeted adversary and a robust driving agent. The strategic targeted adversary is designed to leverage the temporal decision-making capabilities of DRL to execute strategically coordinated multi-step attacks. In addition, it explicitly focuses on inducing safety-critical events by adopting a general-sum objective. The robust driving agent learns by interacting with the adversary to develop a robust autonomous driving policy against adversarial attacks. To ensure stable learning in adversarial environments and to mitigate policy drift caused by attacks, the agent is optimized under a constrained formulation. Extensive experiments show that IGCARL improves the success rate by at least 27.9\% over state-of-the-art methods, demonstrating superior robustness to adversarial attacks and enhancing the safety and reliability of DRL-based autonomous driving.
中文摘要 深度强化学习（DRL）在自动驾驶政策的制定方面取得了显著的成功。然而，它对对抗性攻击的脆弱性仍然是实际部署的关键障碍。尽管现有的稳健方法已经取得了成功，但它们仍然存在三个关键问题：（i）这些方法针对短视的对抗性攻击进行训练，限制了它们应对更具战略性威胁的能力，（ii）它们难以引起真正的安全关键事件（例如碰撞），但往往会导致轻微的后果，以及（iii）由于缺乏稳健约束，这些方法可能会在训练过程中引入学习不稳定性和策略漂移。为了解决这些问题，我们提出了智能广和约束对抗强化学习（IGCARL），这是一种由战略目标对手和鲁棒驾驶代理组成的新型鲁棒自动驾驶方法。战略目标对手旨在利用 DRL 的时间决策能力来执行战略协调的多步骤攻击。此外，它明确关注通过采用总和目标来诱发安全关键事件。强大的驾驶代理通过与对手交互来学习，以制定针对对抗性攻击的强大自动驾驶策略。为了保证在对抗环境中的稳定学习并减轻攻击引起的策略漂移，在约束公式下对代理进行了优化。大量实验表明，与最先进的方法相比，IGCARL的成功率至少提高了27.9%，表现出对对抗攻击的优越鲁棒性，并增强了基于DRL的自动驾驶的安全性和可靠性。

Sensing, Detection and Localization for Low Altitude UAV: A RF-Based Framework via Multiple BSs Collaboration

低空无人机的传感、探测和定位：基于射频的多基站协同框架

Authors: Tianhao Liang, Mu Jia, Tingting Zhang, Junting Chen, Longyu Zhou, Tony Q. S. Quek, Pooi-Yuen Kam
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2510.09055
Pdf link: https://arxiv.org/pdf/2510.09055
Abstract The rapid growth of the low-altitude economy has resulted in a significant increase in the number of Low, slow, and small (LLS) unmanned aerial vehicles (UAVs), raising critical challenges for secure airspace management and reliable trajectory planning. To address this, this paper proposes a cooperative radio-frequency (RF) detection and localization framework that leverages existing cellular base stations. The proposed approach features a robust scheme for LSS target identification, integrating a cell averaging-constant false alarm rate (CA-CFAR) detector with a micro-Doppler signature (MDS) based recognition method. Multi-station measurements are fused through a grid-based probabilistic algorithm combined with clustering techniques, effectively mitigating ghost targets and improving localization accuracy in multi-UAV scenarios. Furthermore, the Cramer-Rao lower bound (CRLB) is derived as a performance benchmark and reinforcement learning (RL)-based optimization is employed to balance localization accuracy against station resource usage. Simulations demonstrate that increasing from one to multiple BSs reduces the positioning error to near the CRLB, while practical experiments further verify the framework's effectiveness. Furthermore, our RL-based optimization can find solutions that maintain high accuracy while minimizing resource usage, highlighting its potential as a scalable solution for ensuring airspace safety in the emerging low-altitude economy.
中文摘要 低空经济的快速增长导致低速、慢速和小型（LLS）无人机（UAV）数量显着增加，为安全空域管理和可靠的轨迹规划提出了严峻挑战。为了解决这个问题，本文提出了一种利用现有蜂窝基站的协同射频（RF）检测和定位框架。所提出的方法具有稳健的LSS目标识别方案，将细胞平均恒定误报率（CA-CFAR）检测器与基于微多普勒特征（MDS）的识别方法相结合。通过基于网格的概率算法结合聚类技术，融合多站测量，有效缓解重影目标，提高多无人机场景下的定位精度。此外，推导了Cramer-Rao下界（CRLB）作为性能基准，并采用基于强化学习（RL）的优化来平衡定位精度与站点资源利用。仿真表明，从一个基准增加到多个基准将定位误差减小到CRLB附近，而实际实验进一步验证了该框架的有效性。此外，我们基于 RL 的优化可以找到在保持高精度同时最大限度地减少资源使用的解决方案，凸显了其作为确保新兴低空经济中空域安全的可扩展解决方案的潜力。

Leading the Follower: Learning Persuasive Agents in Social Deduction Games

领导追随者：在社交推理游戏中学习说服力代理

Authors: Zhang Zheng, Deheng Ye, Peilin Zhao, Hao Wang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.09087
Pdf link: https://arxiv.org/pdf/2510.09087
Abstract Large language model (LLM) agents have shown remarkable progress in social deduction games (SDGs). However, existing approaches primarily focus on information processing and strategy selection, overlooking the significance of persuasive communication in influencing other players' beliefs and responses. In SDGs, success depends not only on making correct deductions but on convincing others to response in alignment with one's intent. To address this limitation, we formalize turn-based dialogue in SDGs as a Stackelberg competition, where the current player acts as the leader who strategically influences the follower's response. Building on this theoretical foundation, we propose a reinforcement learning framework that trains agents to optimize utterances for persuasive impact. Through comprehensive experiments across three diverse SDGs, we demonstrate that our agents significantly outperform baselines. This work represents a significant step toward developing AI agents capable of strategic social influence, with implications extending to scenarios requiring persuasive communication.
中文摘要 大型语言模型（LLM）代理在社交推理游戏（SDGs）方面取得了显著进展。然而，现有方法主要侧重于信息处理和策略选择，而忽视了说服性沟通在影响其他参与者的信念和反应方面的重要性。在可持续发展目标中，成功不仅取决于做出正确的推论，还取决于说服他人按照自己的意图做出反应。为了解决这一限制，我们将可持续发展目标中的回合制对话正式化为斯塔克尔伯格竞赛，其中当前玩家充当领导者，战略性地影响追随者的反应。在此理论基础上，我们提出了一种强化学习框架，该框架可以训练智能体优化话语以产生说服力。通过跨三个不同可持续发展目标的综合实验，我们证明我们的代理明显优于基线。这项工作代表着朝着开发具有战略社会影响力的人工智能代理迈出了重要一步，其影响延伸到需要有说服力的沟通的场景。

Agentic-KGR: Co-evolutionary Knowledge Graph Construction through Multi-Agent Reinforcement Learning

Agentic-KGR：通过多智能体强化学习构建共同进化知识图谱

Authors: Jing Li, Zhijie Sun, Zhicheng Zhou, Suming Qiu, Junjie Huang, Haijia Sun, Linyuan Qiu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.09156
Pdf link: https://arxiv.org/pdf/2510.09156
Abstract Current knowledge-enhanced large language models (LLMs) rely on static, pre-constructed knowledge bases that suffer from coverage gaps and temporal obsolescence, limiting their effectiveness in dynamic information environments. We present Agentic-KGR, a novel framework enabling co-evolution between LLMs and knowledge graphs (KGs) through multi-round reinforcement learning (RL). Our approach introduces three key innovations: (1) a dynamic schema expansion mechanism that systematically extends graph ontologies beyond pre-defined boundaries during training; (2) a retrieval-augmented memory system enabling synergistic co-evolution between model parameters and knowledge structures through continuous optimization; (3) a learnable multi-scale prompt compression approach that preserves critical information while reducing computational complexity through adaptive sequence optimization. Experimental results demonstrate substantial improvements over supervised baselines and single-round RL approaches in knowledge extraction tasks. When integrated with GraphRAG, our method achieves superior performance in downstream QA tasks, with significant gains in both accuracy and knowledge coverage compared to existing methods.
中文摘要 当前的知识增强型大型语言模型（LLM）依赖于静态的、预先构建的知识库，这些知识库存在覆盖差距和时间过时，限制了它们在动态信息环境中的有效性。我们提出了 Agentic-KGR，这是一个新颖的框架，通过多轮强化学习（RL）实现 LLM 和知识图谱（KG）之间的共同进化。我们的方法引入了三个关键创新：（1）动态模式扩展机制，在训练过程中系统地将图本体扩展到预定义边界之外;（2）检索增强记忆系统，通过持续优化实现模型参数和知识结构之间的协同协同进化;（3）一种可学习的多尺度提示压缩方法，通过自适应序列优化保留关键信息，同时降低计算复杂度。实验结果表明，在知识提取任务中，与监督基线和单轮 RL 方法相比，存在显着改进。当与 GraphRAG 集成时，我们的方法在下游 QA 任务中取得了卓越的性能，与现有方法相比，在准确性和知识覆盖率方面都有显着提升。

Hierarchical Semantic RL: Tackling the Problem of Dynamic Action Space for RL-based Recommendations

分层语义 RL：解决基于 RL 的建议的动态行动空间问题

Authors: Minmao Wang, Xingchen Liu, Shijie Yi, Likang Wu, Hongke Zhao, Fei Pan, Qingpeng Cai, Peng Jiang
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2510.09167
Pdf link: https://arxiv.org/pdf/2510.09167
Abstract Recommender Systems (RS) are fundamental to modern online services. While most existing approaches optimize for short-term engagement, recent work has begun to explore reinforcement learning (RL) to model long-term user value. However, these efforts face significant challenges due to the vast, dynamic action spaces inherent in recommendation, which hinder stable policy learning. To resolve this bottleneck, we introduce Hierarchical Semantic RL (HSRL), which reframes RL-based recommendation over a fixed Semantic Action Space (SAS). HSRL encodes items as Semantic IDs (SIDs) for policy learning, and maps SIDs back to their original items via a fixed, invertible lookup during execution. To align decision-making with SID generation, the Hierarchical Policy Network (HPN) operates in a coarse-to-fine manner, employing hierarchical residual state modeling to refine each level's context from the previous level's residual, thereby stabilizing training and reducing representation-decision mismatch. In parallel, a Multi-level Critic (MLC) provides token-level value estimates, enabling fine-grained credit assignment. Across public benchmarks and a large-scale production dataset from a leading Chinese short-video advertising platform, HSRL consistently surpasses state-of-the-art baselines. In online deployment over a seven-day A/B testing, it delivers an 18.421% CVR lift with only a 1.251% increase in cost, supporting HSRL as a scalable paradigm for RL-based recommendation. Our code is released at this https URL.
中文摘要 推荐系统（RS）是现代在线服务的基础。虽然大多数现有方法都针对短期参与进行了优化，但最近的工作已经开始探索强化学习（RL）来模拟长期用户价值。然而，由于建议固有的巨大、动态的行动空间，阻碍了稳定的政策学习，这些努力面临着重大挑战。为了解决这一瓶颈，我们引入了分层语义 RL （HSRL），它将基于 RL 的推荐重新构建到固定的语义作空间（SAS）上。HSRL 将项目编码为语义 ID （SID）以进行策略学习，并在执行期间通过固定的可逆查找将 SID 映射回其原始项目。为了使决策与SID生成保持一致，分层策略网络（HPN）以从粗到细的方式运行，采用分层残差状态模型从前一级的残差中细化每个级别的上下文，从而稳定训练并减少表示-决策不匹配。同时，多级评论家（MLC）提供代币级价值估计，从而实现细粒度的信用分配。通过公开基准和来自中国领先短视频广告平台的大规模制作数据集，HSRL 始终超越最先进的基线。在为期 7 天的 A/B 测试的在线部署中，它提供了 18.421% 的 CVR 提升，而成本仅增加了 1.251%，支持 HSRL 作为基于 RL 的推荐的可扩展范例。我们的代码在此 https URL 上发布。

Obstacle Avoidance using Dynamic Movement Primitives and Reinforcement Learning

使用动态运动基元和强化学习的避障

Authors: Dominik Urbaniak, Alejandro Agostini, Pol Ramon, Jan Rosell, Raúl Suárez, Michael Suppa
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.09254
Pdf link: https://arxiv.org/pdf/2510.09254
Abstract Learning-based motion planning can quickly generate near-optimal trajectories. However, it often requires either large training datasets or costly collection of human demonstrations. This work proposes an alternative approach that quickly generates smooth, near-optimal collision-free 3D Cartesian trajectories from a single artificial demonstration. The demonstration is encoded as a Dynamic Movement Primitive (DMP) and iteratively reshaped using policy-based reinforcement learning to create a diverse trajectory dataset for varying obstacle configurations. This dataset is used to train a neural network that takes as inputs the task parameters describing the obstacle dimensions and location, derived automatically from a point cloud, and outputs the DMP parameters that generate the trajectory. The approach is validated in simulation and real-robot experiments, outperforming a RRT-Connect baseline in terms of computation and execution time, as well as trajectory length, while supporting multi-modal trajectory generation for different obstacle geometries and end-effector dimensions. Videos and the implementation code are available at this https URL.
中文摘要 基于学习的运动规划可以快速生成近乎最佳的轨迹。然而，它通常需要大型训练数据集或昂贵的人类演示收集。这项工作提出了一种替代方法，可以从单个人工演示中快速生成平滑、接近最佳的无碰撞 3D 直角轨迹。该演示被编码为动态运动原语（DMP），并使用基于策略的强化学习进行迭代重塑，为不同的障碍物配置创建多样化的轨迹数据集。该数据集用于训练神经网络，该神经网络将描述障碍物尺寸和位置的任务参数作为输入，这些参数从点云中自动导出，并输出生成轨迹的DMP参数。该方法在仿真和真实机器人实验中得到了验证，在计算和执行时间以及轨迹长度方面优于 RRT-Connect 基线，同时支持针对不同障碍物几何形状和末端执行器尺寸的多模态轨迹生成。视频和实现代码可在此 https URL 中找到。

Detecting Data Contamination from Reinforcement Learning Post-training for Large Language Models

检测大型语言模型后训练强化学习中的数据污染

Authors: Yongding Tao, Tian Wang, Yihong Dong, Huanyu Liu, Kechi Zhang, Xiaolong Hu, Ge Li
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.09259
Pdf link: https://arxiv.org/pdf/2510.09259
Abstract Data contamination poses a significant threat to the reliable evaluation of Large Language Models (LLMs). This issue arises when benchmark samples may inadvertently appear in training sets, compromising the validity of reported performance. While detection methods have been developed for the pre-training and Supervised Fine-Tuning stages, a critical research gap exists for the increasingly significant phase of Reinforcement Learning (RL) post-training. As RL post-training becomes pivotal for advancing LLM reasoning, the absence of specialized contamination detection methods in this paradigm presents a critical vulnerability. To address this, we conduct the first systematic study of data detection within RL post-training scenario and propose Self-Critique. Our method is motivated by a key observation: after RL phase, the output entropy distribution of LLMs tends to collapse into highly specific and sparse modes. Self-Critique probes for the underlying policy collapse, i.e., the model's convergence to a narrow reasoning path, which causes this entropy reduction. To facilitate this research, we also introduce RL-MIA, a benchmark constructed to simulate this specific contamination scenario. Extensive experiments show that Self-Critique significantly outperforms baseline methods across multiple models and contamination tasks, achieving an AUC improvement of up to 30%. Whereas existing methods are close to a random guess for RL-phase contamination, our method makes detection possible.
中文摘要 数据污染对大型语言模型（LLM）的可靠评估构成重大威胁。当基准样本可能无意中出现在训练集中，从而损害报告性能的有效性时，就会出现此问题。虽然已经为预训练和监督微调阶段开发了检测方法，但对于训练后日益重要的强化学习（RL）阶段，存在一个关键的研究差距。随着 RL 后训练成为推进 LLM 推理的关键，该范式中缺乏专门的污染检测方法带来了一个严重的漏洞。为了解决这个问题，我们对RL后训练场景中的数据检测进行了首次系统研究，并提出了自我批评。我们的方法受到一个关键观察的推动：在RL阶段之后，LLM的输出熵分布往往会崩溃为高度特异性和稀疏模式。自我批判探究潜在的政策崩溃，即模型收敛到狭窄的推理路径，这导致了这种熵减。为了促进这项研究，我们还引入了 RL-MIA，这是一个为模拟这种特定污染场景而构建的基准。大量实验表明，Self-Critique 在多个模型和污染任务中明显优于基线方法，实现了高达 30% 的 AUC 改进。虽然现有方法接近于对 RL 相污染的随机猜测，但我们的方法使检测成为可能。

CLARity: Reasoning Consistency Alone Can Teach Reinforced Experts

CLARity：仅推理一致性就可以教出强化专家

Authors: Jiuheng Lin, Cong Jiang, Zirui Wu, Jiarui Sun, Yansong Feng
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.09278
Pdf link: https://arxiv.org/pdf/2510.09278
Abstract Training expert LLMs in domains with scarce data is difficult, often relying on multiple-choice questions (MCQs). However, standard outcome-based reinforcement learning (RL) on MCQs is risky. While it may improve accuracy, we observe it often degrades reasoning quality such as logical consistency. Existing solutions to supervise reasoning, such as large-scale Process Reward Models (PRMs), are prohibitively expensive. To address this, we propose CLARity, a cost-effective RL framework that enhances reasoning quality using only a small, general-purpose LLM. CLARity integrates a consistency-aware reward mechanism with a 2-stage refine-then-monitor training pipeline to enhance reasoning consistency, and a dynamic data reformulation strategy to to better exploit limited data. Experiments demonstrate that CLARity improves response consistency by 16.5% and accuracy by 7.5% over baselines. Human evaluations further confirm holistic improvements in coherence and professionalism. Thus, CLARity offers a generalizable solution that enables smaller models to effectively guide expert models by reasoning this http URL code is open sourced at: this https URL
中文摘要 在数据稀缺的领域培训专家法学硕士很困难，通常依赖于多项选择题（MCQ）。然而，MCQ 上的标准基于结果的强化学习（RL）是有风险的。虽然它可以提高准确性，但我们观察到它经常会降低推理质量，例如逻辑一致性。现有的监督推理解决方案，例如大规模过程奖励模型（PRM），成本高得令人望而却步。为了解决这个问题，我们提出了 CLARity，这是一个经济高效的 RL 框架，它仅使用小型通用 LLM 即可提高推理质量。CLARity 集成了一致性感知奖励机制和 2 阶段精炼然后监控训练管道，以增强推理一致性，并集成了动态数据重新制定策略，以更好地利用有限的数据。实验表明，与基线相比，CLARity 将响应一致性提高了 16.5%，准确性提高了 7.5%。人工评估进一步证实了连贯性和专业性的整体改进。因此，CLARity 提供了一个通用的解决方案，使较小的模型能够通过推断此 http URL 代码开源于以下位置来有效地指导专家模型：此 https URL

Spotlight on Token Perception for Multimodal Reinforcement Learning

聚焦多模态强化学习的标记感知

Authors: Siyuan Huang, Xiaoye Qu, Yafu Li, Yun Luo, Zefeng He, Daizong Liu, Yu Cheng
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2510.09285
Pdf link: https://arxiv.org/pdf/2510.09285
Abstract While Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Vision-Language Models (LVLMs), most existing methods in multimodal reasoning neglect the critical role of visual perception within the RLVR optimization process. In this paper, we undertake a pioneering exploration of multimodal RLVR through the novel perspective of token perception, which measures the visual dependency of each generated token. With a granular analysis of Chain-of-Thought (CoT) processes, we uncover two key insights: first, token perception in a rollout trajectory is sparsely distributed, where only a small fraction of tokens have high visual dependency for visually-grounded reasoning; second, different trajectories exhibit significant divergence in their overall visual dependency. Based on these observations, we propose Visually-Perceptive Policy Optimization (VPPO), a novel policy gradient algorithm that explicitly leverages token perception to refine the learning signal. Specifically, VPPO achieves this through a dual mechanism: it reweights a trajectory's advantage by its overall visual dependency, and focuses policy updates exclusively on perceptually pivotal tokens. On a comprehensive suite of eight perception and reasoning benchmarks, VPPO demonstrates substantial gains over leading open-source RL-tuned models, with its effectiveness consistently validated across 7B and 32B model scales. Our findings not only establish a new token-level perceptual perspective for analyzing multimodal RLVR but also present a novel and effective optimization strategy to significantly enhance the multimodal reasoning capabilities of LVLMs.
中文摘要 虽然具有可验证奖励的强化学习（Reinforcement Learning with Verifiable Rewards，RLVR）已经提高了大型视觉语言模型（Large Vision-Language Models，LVLM）的推理能力，但大多数现有的多模态推理方法都忽视了视觉感知在RLVR优化过程中的关键作用。在本文中，我们通过标记感知的新颖视角对多模态 RLVR 进行了开创性的探索，该视角衡量了每个生成的标记的视觉依赖性。通过对思维链（CoT）过程的精细分析，我们发现了两个关键见解：首先，推出轨迹中的标记感知分布稀疏，只有一小部分标记对视觉基础推理具有高度的视觉依赖性;其次，不同的轨迹在整体视觉依赖性上表现出显着的差异。基于这些观察结果，我们提出了视觉感知策略优化（VPPO），这是一种新颖的策略梯度算法，它明确地利用标记感知来细化学习信号。具体来说，VPPO 通过双重机制实现这一目标：它通过其整体视觉依赖性重新加权轨迹的优势，并将政策更新专门集中在感知上关键的代币上。在一套包含八个感知和推理基准的综合套件中，VPPO 与领先的开源 RL 调整模型相比表现出显着优势，其有效性在 7B 和 32B 模型规模上得到一致验证。研究结果不仅为分析多模态RLVR建立了新的token级感知视角，而且提出了一种新颖有效的优化策略，以显著增强LVLM的多模态推理能力。

Rate optimal learning of equilibria from data

从数据中对均衡进行最优学习

Authors: Till Freihaut, Luca Viano, Emanuele Nevali, Volkan Cevher, Matthieu Geist, Giorgia Ramponi
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.09325
Pdf link: https://arxiv.org/pdf/2510.09325
Abstract We close open theoretical gaps in Multi-Agent Imitation Learning (MAIL) by characterizing the limits of non-interactive MAIL and presenting the first interactive algorithm with near-optimal sample complexity. In the non-interactive setting, we prove a statistical lower bound that identifies the all-policy deviation concentrability coefficient as the fundamental complexity measure, and we show that Behavior Cloning (BC) is rate-optimal. For the interactive setting, we introduce a framework that combines reward-free reinforcement learning with interactive MAIL and instantiate it with an algorithm, MAIL-WARM. It improves the best previously known sample complexity from $\mathcal{O}(\varepsilon^{-8})$ to $\mathcal{O}(\varepsilon^{-2}),$ matching the dependence on $\varepsilon$ implied by our lower bound. Finally, we provide numerical results that support our theory and illustrate, in environments such as grid worlds, where Behavior Cloning fails to learn.
中文摘要 我们通过表征非交互式 MAIL 的局限性并提出第一个具有接近最佳样本复杂度的交互式算法，弥补了多智能体模仿学习（MAIL）中开放的理论空白。在非交互式环境中，我们证明了一个统计下限，该下限将所有策略偏差集中性系数确定为基本复杂性度量，并且我们表明行为克隆（BC）是速率最优的。对于交互式设置，我们引入了一个框架，该框架将无奖励强化学习与交互式 MAIL 相结合，并使用算法 MAIL-WARM 对其进行实例化。它将以前已知的最佳样本复杂度从 $\mathcal{O}（\varepsilon^{-8}）$ 提高到 $\mathcal{O}（\varepsilon^{-2}），$，与我们的下限所暗示的对 $\varepsilon$ 的依赖性相匹配。最后，我们提供了支持我们的理论的数值结果，并说明在网格世界等环境中，行为克隆无法学习。

Safety Game: Balancing Safe and Informative Conversations with Blackbox Agentic AI using LP Solvers

安全游戏：使用 LP 求解器与黑盒代理 AI 平衡安全和信息丰富的对话

Authors: Tuan Nguyen, Long Tran-Thanh
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.09330
Pdf link: https://arxiv.org/pdf/2510.09330
Abstract Ensuring that large language models (LLMs) comply with safety requirements is a central challenge in AI deployment. Existing alignment approaches primarily operate during training, such as through fine-tuning or reinforcement learning from human feedback, but these methods are costly and inflexible, requiring retraining whenever new requirements arise. Recent efforts toward inference-time alignment mitigate some of these limitations but still assume access to model internals, which is impractical, and not suitable for third party stakeholders who do not have access to the models. In this work, we propose a model-independent, black-box framework for safety alignment that does not require retraining or access to the underlying LLM architecture. As a proof of concept, we address the problem of trading off between generating safe but uninformative answers versus helpful yet potentially risky ones. We formulate this dilemma as a two-player zero-sum game whose minimax equilibrium captures the optimal balance between safety and helpfulness. LLM agents operationalize this framework by leveraging a linear programming solver at inference time to compute equilibrium strategies. Our results demonstrate the feasibility of black-box safety alignment, offering a scalable and accessible pathway for stakeholders, including smaller organizations and entities in resource-constrained settings, to enforce safety across rapidly evolving LLM ecosystems.
中文摘要 确保大型语言模型（LLM）符合安全要求是 AI 部署的核心挑战。现有的对齐方法主要在训练期间运行，例如通过微调或从人类反馈中强化学习，但这些方法成本高昂且缺乏灵活性，每当出现新需求时都需要重新训练。最近对推理时间对齐的努力减轻了其中一些限制，但仍然假设可以访问模型内部，这是不切实际的，并且不适合无法访问模型的第三方利益相关者。在这项工作中，我们提出了一个独立于模型的黑盒框架，用于安全对齐，不需要重新训练或访问底层 LLM 架构。作为概念验证，我们解决了在生成安全但信息不丰富的答案与有用但有潜在风险的答案之间进行权衡的问题。我们将这一困境表述为两人零和博弈，其极小极大均衡捕捉到了安全性和有用性之间的最佳平衡。LLM 代理通过在推理时利用线性规划求解器来计算均衡策略来作该框架。我们的结果证明了黑盒安全调整的可行性，为利益相关者（包括资源受限环境中的小型组织和实体）提供了一条可扩展且可访问的途径，以在快速发展的 LLM 生态系统中加强安全。

Logit Arithmetic Elicits Long Reasoning Capabilities Without Training

Logit 算术无需训练即可引发长时间的推理能力

Authors: Yunxiang Zhang, Muhammad Khalifa, Lechen Zhang, Xin Liu, Ayoung Lee, Xinliang Frederick Zhang, Farima Fatahi Bayat, Lu Wang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.09354
Pdf link: https://arxiv.org/pdf/2510.09354
Abstract Large reasoning models exhibit long chain-of-thought reasoning with strategies such as backtracking and self-correction, though recent studies suggest that these abilities typically require additional training. We first investigate whether such behaviors can be elicited without any training. To this end, we propose a decoding-time approach, ThinkLogit, which utilizes logit arithmetic to tune a target large non-reasoning model for long reasoning using a substantially smaller reasoning model as the guider. We then show that we can further boost its performance by training the guider model with preference optimization over correct/incorrect reasoning pairs sampled from both the target and guider model, a setup we refer to as ThinkLogit-DPO. Our experiments demonstrate that ThinkLogit and ThinkLogit-DPO achieve a relative improvement in average accuracy by 24.5% and 29.1%, respectively, over five reasoning benchmarks using the Qwen2.5-32B guided by R1-Distill-Qwen-1.5B, a model 21x smaller. Moreover, we find that ThinkLogit remains effective when the guider and target come from different model families. It is also orthogonal to post-training methods for small models, as guiders improved through supervised distillation or reinforcement learning can be directly plugged in to yield stronger large models, offering a practical path to unlock long reasoning in large-scale models without costly post-training.
中文摘要 大型推理模型表现出长思维链推理，具有回溯和自我纠正等策略，尽管最近的研究表明这些能力通常需要额外的训练。我们首先调查是否可以在没有任何训练的情况下引发此类行为。为此，我们提出了一种解码时间方法，即 ThinkLogit，它利用 logit 算术来调整目标大型非推理模型，以使用一个小得多的推理模型作为指导，以进行长推理。然后，我们表明，我们可以通过训练引导模型来进一步提高其性能，并优先于从目标模型和引导模型中采样的正确/错误推理对，我们将这种设置称为 ThinkLogit-DPO。我们的实验表明，在使用R1-Distill-Qwen-1.5B（模型小21倍）的Qwen2.5-32B指导下，ThinkLogit和ThinkLogit-DPO的平均准确率分别提高了24.5%和29.1%。此外，我们发现当引导器和目标来自不同的模型族时，ThinkLogit仍然有效。它也与小型模型的后训练方法正交，因为可以通过监督蒸馏或强化学习改进的引导器直接插入以产生更强大的大型模型，从而提供了一条实用的途径，可以在大型模型中解锁长推理，而无需昂贵的后训练。

HINT: Helping Ineffective Rollouts Navigate Towards Effectiveness

提示：帮助无效的推出实现有效性

Authors: Xinyi Wang, Jinyi Han, Zishang Jiang, Tingyun Li, Jiaqing Liang, Sihang Jiang, Zhaoqian Dai, Shuguang Ma, Fei Yu, Yanghua Xiao
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.09388
Pdf link: https://arxiv.org/pdf/2510.09388
Abstract Reinforcement Learning (RL) has become a key driver for enhancing the long chain-of-thought (CoT) reasoning capabilities of Large Language Models (LLMs). However, prevalent methods like GRPO often fail when task difficulty exceeds the model's capacity, leading to reward sparsity and inefficient training. While prior work attempts to mitigate this using off-policy data, such as mixing RL with Supervised Fine-Tuning (SFT) or using hints, they often misguide policy updates In this work, we identify a core issue underlying these failures, which we term low training affinity. This condition arises from a large distributional mismatch between external guidance and the model's policy. To diagnose this, we introduce Affinity, the first quantitative metric for monitoring exploration efficiency and training stability. To improve Affinity, we propose HINT: Helping Ineffective rollouts Navigate Towards effectiveness, an adaptive hinting framework. Instead of providing direct answers, HINT supplies heuristic hints that guide the model to discover solutions on its own, preserving its autonomous reasoning capabilities. Extensive experiments on mathematical reasoning tasks show that HINT consistently outperforms existing methods, achieving state-of-the-art results with models of various scales, while also demonstrating significantly more stable learning and greater data this http URL is available on Github.
中文摘要 强化学习（RL）已成为增强大型语言模型（LLM）长思维链（CoT）推理能力的关键驱动力。然而，当任务难度超过模型的能力时，像 GRPO 这样的流行方法往往会失败，导致奖励稀疏和训练效率低下。虽然之前的工作试图使用策略外的数据来缓解这种情况，例如将 RL 与监督微调（SFT）混合使用或使用提示，但它们经常误导策略更新在这项工作中，我们确定了这些失败背后的核心问题，我们称之为低训练亲和力。这种情况是由于外部指导与模型策略之间的巨大分布不匹配而产生的。为了诊断这一点，我们引入了 Affinity，这是第一个用于监控勘探效率和训练稳定性的定量指标。为了提高亲和力，我们提出了 HINT：帮助无效的推出 Navigate Towards effectiveness，一个自适应提示框架。HINT 不是提供直接答案，而是提供启发式提示，指导模型自行发现解决方案，从而保留其自主推理能力。对数学推理任务的广泛实验表明，HINT 始终优于现有方法，在各种规模的模型上取得了最先进的结果，同时还展示了明显更稳定的学习和更大的数据，该 http URL 可在 Github 上找到。

Scalable Multi-Agent Path Finding using Collision-Aware Dynamic Alert Mask and a Hybrid Execution Strategy

使用碰撞感知动态警报掩码和混合执行策略的可扩展多代理路径查找

Authors: Bharath Muppasani, Ritirupa Dey, Biplav Srivastava, Vignesh Narayanan
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.09469
Pdf link: https://arxiv.org/pdf/2510.09469
Abstract Multi-agent pathfinding (MAPF) remains a critical problem in robotics and autonomous systems, where agents must navigate shared spaces efficiently while avoiding conflicts. Traditional centralized algorithms that have global information, such as Conflict-Based Search (CBS), provide high-quality solutions but become computationally expensive in large-scale scenarios due to the combinatorial explosion of conflicts that need resolution. Conversely, distributed approaches that have local information, particularly learning-based methods, offer better scalability by operating with relaxed information availability, yet often at the cost of solution quality. To address these limitations, we propose a hybrid framework that combines decentralized path planning with a lightweight centralized coordinator. Our framework leverages reinforcement learning (RL) for decentralized planning, enabling agents to adapt their planning based on minimal, targeted alerts--such as static conflict-cell flags or brief conflict tracks--that are dynamically shared information from the central coordinator for effective conflict resolution. We empirically study the effect of the information available to an agent on its planning performance. Our approach reduces the inter-agent information sharing compared to fully centralized and distributed methods, while still consistently finding feasible, collision-free solutions--even in large-scale scenarios having higher agent counts.
中文摘要 多智能体寻路（MAPF）仍然是机器人和自主系统中的一个关键问题，其中智能体必须有效地导航共享空间，同时避免冲突。具有全局信息的传统集中式算法，例如基于冲突的搜索（CBS），提供了高质量的解决方案，但在大规模场景中，由于需要解决的冲突的组合爆炸，计算成本会变得高昂。相反，具有本地信息的分布式方法，特别是基于学习的方法，通过以宽松的信息可用性运行来提供更好的可扩展性，但通常以牺牲解决方案质量为代价。为了解决这些限制，我们提出了一个混合框架，将去中心化路径规划与轻量级集中协调器相结合。我们的框架利用强化学习（RL）进行分散规划，使代理能够根据最少的、有针对性的警报（例如静态冲突单元标志或简短的冲突轨迹）调整其规划，这些警报是来自中央协调员的动态共享信息，以有效解决冲突。我们实证研究了代理可用的信息对其计划绩效的影响。与完全集中和分布式的方法相比，我们的方法减少了代理间信息共享，同时仍然始终如一地找到可行的、无冲突的解决方案——即使在代理数量较多的大规模场景中也是如此。

Multimodal Policy Internalization for Conversational Agents

对话代理的多模态策略内部化

Authors: Zhenhailong Wang, Jiateng Liu, Amin Fazel, Ritesh Sarkhel, Xing Fan, Xiang Li, Chenlei Guo, Heng Ji, Ruhi Sarikaya
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.09474
Pdf link: https://arxiv.org/pdf/2510.09474
Abstract Modern conversational agents like ChatGPT and Alexa+ rely on predefined policies specifying metadata, response styles, and tool-usage rules. As these LLM-based systems expand to support diverse business and user queries, such policies, often implemented as in-context prompts, are becoming increasingly complex and lengthy, making faithful adherence difficult and imposing large fixed computational costs. With the rise of multimodal agents, policies that govern visual and multimodal behaviors are critical but remain understudied. Prior prompt-compression work mainly shortens task templates and demonstrations, while existing policy-alignment studies focus only on text-based safety rules. We introduce Multimodal Policy Internalization (MPI), a new task that internalizes reasoning-intensive multimodal policies into model parameters, enabling stronger policy-following without including the policy during inference. MPI poses unique data and algorithmic challenges. We build two datasets spanning synthetic and real-world decision-making and tool-using tasks and propose TriMPI, a three-stage training framework. TriMPI first injects policy knowledge via continual pretraining, then performs supervised finetuning, and finally applies PolicyRollout, a GRPO-style reinforcement learning extension that augments rollouts with policy-aware responses for grounded exploration. TriMPI achieves notable gains in end-to-end accuracy, generalization, and robustness to forgetting. As the first work on multimodal policy internalization, we provide datasets, training recipes, and comprehensive evaluations to foster future research. Project page: this https URL.
中文摘要 ChatGPT 和 Alexa+ 等现代对话代理依赖于指定元数据、响应样式和工具使用规则的预定义策略。随着这些基于 LLM 的系统扩展以支持不同的业务和用户查询，此类策略（通常作为上下文提示实施）变得越来越复杂和冗长，这使得忠实遵守变得困难并带来巨大的固定计算成本。随着多模态智能体的兴起，管理视觉和多模态行为的政策至关重要，但研究仍然不足。先前的提示压缩工作主要缩短任务模板和演示，而现有的策略调整研究仅关注基于文本的安全规则。我们引入了多模态策略内化（MPI），这是一项新任务，它将推理密集型多模态策略内化到模型参数中，从而在推理过程中不包括策略的情况下实现更强的策略遵循。MPI 带来了独特的数据和算法挑战。我们构建了两个跨越合成和现实世界决策以及工具使用任务的数据集，并提出了三阶段训练框架 TriMPI。TriMPI 首先通过持续的预训练注入策略知识，然后执行监督微调，最后应用 PolicyRollout，这是一种 GRPO 风格的强化学习扩展，通过策略感知响应来增强推出，以实现扎根探索。TriMPI 在端到端准确性、泛化性和遗忘鲁棒性方面取得了显着提升。作为多模态政策内化的第一项工作，我们提供了数据集、训练配方和综合评估，以促进未来的研究。项目页面：此 https URL。

Mitigating Overthinking through Reasoning Shaping

通过推理塑造减少过度思考

Authors: Feifan Song, Shaohang Wei, Bofei Gao, Yejie Wang, Wen Luo, Wei Li, Linli Yao, Weimin Xiong, Liang Chen, Tianyu Liu, Houfeng Wang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.09535
Pdf link: https://arxiv.org/pdf/2510.09535
Abstract Large reasoning models (LRMs) boosted by Reinforcement Learning from Verifier Reward (RLVR) have shown great power in problem solving, yet they often cause overthinking: excessive, meandering reasoning that inflates computational cost. Prior designs of penalization in RLVR manage to reduce token consumption while often harming model performance, which arises from the oversimplicity of token-level supervision. In this paper, we argue that the granularity of supervision plays a crucial role in balancing efficiency and accuracy, and propose Group Relative Segment Penalization (GRSP), a step-level method to regularize reasoning. Since preliminary analyses show that reasoning segments are strongly correlated with token consumption and model performance, we design a length-aware weighting mechanism across segment clusters. Extensive experiments demonstrate that GRSP achieves superior token efficiency without heavily compromising accuracy, especially the advantages with harder problems. Moreover, GRSP stabilizes RL training and scales effectively across model sizes.
中文摘要 由验证者奖励强化学习（RLVR）推动的大型推理模型（LRM）在解决问题方面显示出强大的力量，但它们经常导致过度思考：过度、曲折的推理会增加计算成本。RLVR 中先前的惩罚设计设法减少了代币消耗，同时经常损害模型性能，这是由于代币级监管的过于简单而产生的。在本文中，我们认为监督的粒度在平衡效率和准确性方面起着至关重要的作用，并提出了群体相对段惩罚（GRSP），这是一种正则化推理的阶梯级方法。由于初步分析表明推理段与令牌消耗和模型性能密切相关，因此我们设计了一种跨段集群的长度感知加权机制。广泛的实验表明，GRSP 在不严重影响准确性的情况下实现了卓越的代币效率，尤其是在更难的问题上具有优势。此外，GRSP 可以稳定 RL 训练并跨模型大小有效扩展。

SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

SPG：屏蔽扩散语言模型的夹层策略梯度

Authors: Chengyu Wang, Paria Rashidinejad, DiJia Su, Song Jiang, Sid Wang, Siyan Zhao, Cai Zhou, Shannon Zejiang Shen, Feiyu Chen, Tommi Jaakkola, Yuandong Tian, Bo Liu
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.09541
Pdf link: https://arxiv.org/pdf/2510.09541
Abstract Diffusion large language models (dLLMs) are emerging as an efficient alternative to autoregressive models due to their ability to decode multiple tokens in parallel. However, aligning dLLMs with human preferences or task-specific rewards via reinforcement learning (RL) is challenging because their intractable log-likelihood precludes the direct application of standard policy gradient methods. While prior work uses surrogates like the evidence lower bound (ELBO), these one-sided approximations can introduce significant policy gradient bias. To address this, we propose the Sandwiched Policy Gradient (SPG) that leverages both an upper and a lower bound of the true log-likelihood. Experiments show that SPG significantly outperforms baselines based on ELBO or one-step estimation. Specifically, SPG improves the accuracy over state-of-the-art RL methods for dLLMs by 3.6% in GSM8K, 2.6% in MATH500, 18.4% in Countdown and 27.0% in Sudoku.
中文摘要 扩散大型语言模型（dLLM）正在成为自回归模型的有效替代方案，因为它们能够并行解码多个标记。然而，通过强化学习（RL）使 dLLM 与人类偏好或特定于任务的奖励保持一致具有挑战性，因为它们棘手的对数似然排除了标准策略梯度方法的直接应用。虽然先前的工作使用证据下界（ELBO）等替代指标，但这些单侧近似值可能会引入显着的政策梯度偏差。为了解决这个问题，我们提出了夹心策略梯度（SPG），它利用真实对数似然的上限和下限。实验表明，SPG 的性能明显优于基于 ELBO 或一步估计的基线。具体来说，SPG 在 GSM8K 中将 dLLM 的准确性提高了 3.6%，在 MATH500 中提高了 2.6%，在倒计时中提高了 18.4%，在数独中提高了 27.0%。

Guiding Energy-Efficient Locomotion through Impact Mitigation Rewards

通过影响缓解奖励引导节能运动

Authors: Chenghao Wang, Arjun Viswanathan, Eric Sihite, Alireza Ramezani
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.09543
Pdf link: https://arxiv.org/pdf/2510.09543
Abstract Animals achieve energy-efficient locomotion by their implicit passive dynamics, a marvel that has captivated roboticists for this http URL, methods incorporated Adversarial Motion Prior (AMP) and Reinforcement learning (RL) shows promising progress to replicate Animals' naturalistic motion. However, such imitation learning approaches predominantly capture explicit kinematic patterns, so-called gaits, while overlooking the implicit passive dynamics. This work bridges this gap by incorporating a reward term guided by Impact Mitigation Factor (IMF), a physics-informed metric that quantifies a robot's ability to passively mitigate impacts. By integrating IMF with AMP, our approach enables RL policies to learn both explicit motion trajectories from animal reference motion and the implicit passive dynamic. We demonstrate energy efficiency improvements of up to 32%, as measured by the Cost of Transport (CoT), across both AMP and handcrafted reward structure.
中文摘要 动物通过其隐含的被动动力学实现节能运动，这一奇迹吸引了机器人学家的 http URL，结合了对抗性运动先验（AMP）和强化学习（RL）的方法显示出复制动物自然运动的有希望的进展。然而，这种模仿学习方法主要捕捉显式运动学模式，即所谓的步态，而忽略了隐式的被动动力学。这项工作通过纳入由影响缓解因子（IMF）指导的奖励项来弥合这一差距，IMF 是一种基于物理的指标，用于量化机器人被动减轻影响的能力。通过将 IMF 与 AMP 集成，我们的方法使 RL 策略能够从动物参考运动和隐式被动动态中学习显式运动轨迹。我们证明，根据运输成本（CoT）衡量，在 AMP 和手工制作的奖励结构中，能源效率提高了 32%。

Dyna-Mind: Learning to Simulate from Experience for Better AI Agents

Dyna-Mind：从经验中学习模拟以获得更好的人工智能代理

Authors: Xiao Yu, Baolin Peng, Michel Galley, Hao Cheng, Qianhui Wu, Janardhan Kulkarni, Suman Nath, Zhou Yu, Jianfeng Gao
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2510.09577
Pdf link: https://arxiv.org/pdf/2510.09577
Abstract Reasoning models have recently shown remarkable progress in domains such as math and coding. However, their expert-level abilities in math and coding contrast sharply with their performance in long-horizon, interactive tasks such as web navigation and computer/phone-use. Inspired by literature on human cognition, we argue that current AI agents need ''vicarious trial and error'' - the capacity to mentally simulate alternative futures before acting - in order to enhance their understanding and performance in complex interactive environments. We introduce Dyna-Mind, a two-stage training framework that explicitly teaches (V)LM agents to integrate such simulation into their reasoning. In stage 1, we introduce Reasoning with Simulations (ReSim), which trains the agent to generate structured reasoning traces from expanded search trees built from real experience gathered through environment interactions. ReSim thus grounds the agent's reasoning in faithful world dynamics and equips it with the ability to anticipate future states in its reasoning. In stage 2, we propose Dyna-GRPO, an online reinforcement learning method to further strengthen the agent's simulation and decision-making ability by using both outcome rewards and intermediate states as feedback from real rollouts. Experiments on two synthetic benchmarks (Sokoban and ALFWorld) and one realistic benchmark (AndroidWorld) demonstrate that (1) ReSim effectively infuses simulation ability into AI agents, and (2) Dyna-GRPO leverages outcome and interaction-level signals to learn better policies for long-horizon, planning-intensive tasks. Together, these results highlight the central role of simulation in enabling AI agents to reason, plan, and act more effectively in the ever more challenging environments.
中文摘要 推理模型最近在数学和编码等领域取得了显着进展。然而，他们在数学和编码方面的专家级能力与他们在网络导航和计算机/电话使用等长期交互式任务中的表现形成鲜明对比。受人类认知文献的启发，我们认为当前的人工智能代理需要“替代试错”——在行动之前在心理上模拟替代未来的能力——以增强他们在复杂的交互环境中的理解和表现。我们引入了 Dyna-Mind，这是一个两阶段的训练框架，它明确地教（V）LM 代理将这种模拟集成到他们的推理中。在第 1 阶段，我们引入了模拟推理（ReSim），它训练代理从通过环境交互收集的真实经验构建的扩展搜索树中生成结构化推理跟踪。因此，ReSim 将代理的推理建立在忠实的世界动态基础上，并使其具备在推理中预测未来状态的能力。在第 2 阶段，我们提出了 Dyna-GRPO，这是一种在线强化学习方法，通过使用结果奖励和中间状态作为实际推出的反馈，进一步增强智能体的模拟和决策能力。在两个综合基准测试（Sokoban 和 ALFWorld）和一个现实基准测试（AndroidWorld）上的实验表明，（1） ReSim 有效地将模拟能力注入 AI 代理，以及（2） Dyna-GRPO 利用结果和交互级别的信号来学习更好的策略，以应对长期、规划密集型的任务。总之，这些结果凸显了模拟在使人工智能代理能够在更具挑战性的环境中更有效地推理、计划和行动方面的核心作用。

Keyword: diffusion policy

There is no result