Arxiv Papers of Today

生成时间: 2025-10-27 16:30:30 (UTC+8); Arxiv 发布时间: 2025-10-27 20:00 EDT (2025-10-28 08:00 UTC+8)

今天共有 25 篇相关文章

Keyword: reinforcement learning

Incentivizing Consistent, Effective and Scalable Reasoning Capability in Audio LLMs via Reasoning Process Rewards

通过推理过程奖励激励音频 LLM 中一致、有效和可扩展的推理能力

Authors: Jiajun Fan, Roger Ren, Jingyuan Li, Rahul Pandey, Prashanth Gurunath Shivakumar, Ivan Bulyko, Ankur Gandhe, Ge Liu, Yile Gu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.20867
Pdf link: https://arxiv.org/pdf/2510.20867
Abstract The role of reasoning in Audio Large Language Models remains widely underexplored, as introducing a reasoning process often degrades rather than improves performance during inference, a phenomenon we term test-time inverse scaling, where longer reasoning chains yield progressively worse results. We demonstrate that this stems not from fundamental limitations of reasoning itself, but from inadequate training: models without proper guidance for the reasoning process produce hallucinatory, inconsistent reasoning that accumulates errors over longer chains. To address these challenges, we introduce CESAR (Consistent, Effective, and Scalable Audio Reasoners), shifting from outcome verification to rewarding the reasoning process. Our online reinforcement learning framework employs Group Relative Policy Optimization with a multi-faceted reward suite that incentivizes not only correctness and format but also consistency, structured analytical patterns, causal reasoning, domain-knowledge integration, and calibrated reasoning depth. CESAR resolves test-time inverse scaling, transforming reasoning from detriments into gains while revealing model-specific ``reasoning sweet spots", where performance peaks during test-time scaling. We achieve state-of-the-art results on MMAU Test-mini, substantially outperforming Gemini 2.5 Pro and GPT-4o Audio, and near-human-level performance on MMSU reasoning tasks. Through AI-as-judge evaluations and qualitative comparisons, we provide both quantitative and qualitative validation of our improved reasoning quality. Importantly, enhanced reasoning creates synergistic effects, simultaneously improving multimodal reasoning and perception capabilities. Overall, CESAR establishes a principled method for developing robust and scalable reasoning in Audio LLMs.
中文摘要 推理在音频大型语言模型中的作用仍然被广泛探索，因为引入推理过程通常会降低而不是提高推理过程中的性能，我们将这种现象称为测试时间逆缩放，其中较长的推理链会产生越来越差的结果。我们证明，这不是源于推理本身的根本局限性，而是源于训练不足：没有对推理过程的适当指导的模型会产生幻觉、不一致的推理，从而在更长的链条上积累错误。为了应对这些挑战，我们引入了 CESAR（一致、有效和可扩展的音频推理器），从结果验证转向奖励推理过程。我们的在线强化学习框架采用群体相对策略优化和多方面的奖励套件，不仅激励正确性和格式，还激励一致性、结构化分析模式、因果推理、领域知识整合和校准推理深度。CESAR 解决了测试时逆扩展问题，将推理从不利转化为收益，同时揭示特定于模型的“推理最佳点”，即性能在测试时扩展期间达到峰值。我们在 MMAU Test-mini 上取得了最先进的结果，大大优于 Gemini 2.5 Pro 和 GPT-4o Audio，在 MMSU 推理任务上也达到了接近人类水平的性能。通过人工智能作为评判的评估和定性比较，我们为我们改进的推理质量提供定量和定性验证。重要的是，增强推理会产生协同效应，同时提高多模态推理和感知能力。总体而言，CESAR 建立了一种在音频 LLM 中开发稳健且可扩展推理的原则性方法。

Code-enabled language models can outperform reasoning models on diverse tasks

支持代码的语言模型可以在各种任务上优于推理模型

Authors: Cedegao E. Zhang, Cédric Colas, Gabriel Poesia, Joshua B. Tenenbaum, Jacob Andreas
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.20909
Pdf link: https://arxiv.org/pdf/2510.20909
Abstract Reasoning models (RMs), language models (LMs) trained with reinforcement learning to produce long-form natural language reasoning, have been remarkably successful, but they still require large amounts of computation and data to train, and can be slow and expensive to run. In this paper, we show that standard instruct LMs can already be elicited to be strong reasoners at a level comparable to or even surpassing their corresponding RMs (e.g., DeepSeek V3 vs R1) without finetuning, across diverse domains from instruction following and creative generation to mathematical reasoning. This is achieved by CodeAdapt, our simple recipe that combines the CodeAct framework, where LMs interleave natural language reasoning with code execution in a multi-step fashion, with few-shot bootstrap in-context learning from as few as five training problems. Analyzing four matched pairs of LMs and RMs, we find that CodeAdapt enables three LMs to outperform the corresponding RMs on average over eight tasks (up to 22.9%) while being 10-81% more token efficient, and delivers superior performance on six tasks when averaged over the four models (up to 35.7%). Furthermore, the code-augmented reasoning traces display rich and varied problem-solving strategies. Our findings support that (1) CodeAdapt-style learning and reasoning may be robust and domain general and (2) code-enabled LMs are cognitively grounded and powerful systems, potentially providing a strong foundation for in-weight reinforcement learning.
中文摘要 推理模型（RM），即通过强化学习训练以产生长形式自然语言推理的语言模型（LM），已经取得了巨大的成功，但它们仍然需要大量的计算和数据来训练，并且运行起来可能速度缓慢且成本高昂。在本文中，我们表明，标准指令 LM 已经可以被引出为强推理器，其水平与其相应的 RM 相当甚至超过其相应的 RM（例如，DeepSeek V3 与 R1），无需微调，跨越从指令遵循和创造性生成到数学推理的不同领域。这是通过 CodeAdapt 实现的，这是我们的简单配方，它结合了 CodeAct 框架，其中 LM 以多步骤方式将自然语言推理与代码执行交错，并从最少五个训练问题中进行少量引导上下文学习。分析四对匹配的 LM 和 RM，我们发现 CodeAdapt 使三个 LM 在 8 个任务中平均优于相应的 RM（高达 22.9%），同时令牌效率提高 10-81%，并且在四个模型上平均时在 6 个任务上提供卓越的性能（高达 35.7%）。此外，代码增强推理轨迹显示出丰富多样的问题解决策略。我们的研究结果支持：（1）CodeAdapt风格的学习和推理可能是稳健的和通用的，（2）支持代码的LM是认知基础和强大的系统，可能为权重强化学习提供坚实的基础。

Safety Assessment in Reinforcement Learning via Model Predictive Control

通过模型预测控制进行强化学习的安全性评估

Authors: Jeff Pflueger, Michael Everett
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.20955
Pdf link: https://arxiv.org/pdf/2510.20955
Abstract Model-free reinforcement learning approaches are promising for control but typically lack formal safety guarantees. Existing methods to shield or otherwise provide these guarantees often rely on detailed knowledge of the safety specifications. Instead, this work's insight is that many difficult-to-specify safety issues are best characterized by invariance. Accordingly, we propose to leverage reversibility as a method for preventing these safety issues throughout the training process. Our method uses model-predictive path integral control to check the safety of an action proposed by a learned policy throughout training. A key advantage of this approach is that it only requires the ability to query the black-box dynamics, not explicit knowledge of the dynamics or safety constraints. Experimental results demonstrate that the proposed algorithm successfully aborts before all unsafe actions, while still achieving comparable training progress to a baseline PPO approach that is allowed to violate safety.
中文摘要 无模型强化学习方法在控制方面很有希望，但通常缺乏正式的安全保证。屏蔽或以其他方式提供这些保证的现有方法通常依赖于对安全规范的详细了解。相反，这项工作的见解是，许多难以指定的安全问题的最佳特征是不变性。因此，我们建议利用可逆性作为在整个培训过程中防止这些安全问题的方法。我们的方法使用模型预测路径积分控制来检查整个训练过程中学习策略提出的动作的安全性。这种方法的一个关键优点是它只需要能够查询黑盒动力学，而不需要明确了解动力学或安全约束。实验结果表明，所提出的算法在所有不安全作之前成功中止，同时仍然取得了与允许违反安全性的基线 PPO 方法相当的训练进度。

Robust Point Cloud Reinforcement Learning via PCA-Based Canonicalization

基于PCA的规范化鲁棒点云强化学习

Authors: Michael Bezick, Vittorio Giammarino, Ahmed H. Qureshi
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.20974
Pdf link: https://arxiv.org/pdf/2510.20974
Abstract Reinforcement Learning (RL) from raw visual input has achieved impressive successes in recent years, yet it remains fragile to out-of-distribution variations such as changes in lighting, color, and viewpoint. Point Cloud Reinforcement Learning (PC-RL) offers a promising alternative by mitigating appearance-based brittleness, but its sensitivity to camera pose mismatches continues to undermine reliability in realistic settings. To address this challenge, we propose PCA Point Cloud (PPC), a canonicalization framework specifically tailored for downstream robotic control. PPC maps point clouds under arbitrary rigid-body transformations to a unique canonical pose, aligning observations to a consistent frame, thereby substantially decreasing viewpoint-induced inconsistencies. In our experiments, we show that PPC improves robustness to unseen camera poses across challenging robotic tasks, providing a principled alternative to domain randomization.
中文摘要 近年来，来自原始视觉输入的强化学习（RL）取得了令人瞩目的成功，但它仍然脆弱地受到分布外变化的影响，例如光线、颜色和视点的变化。点云强化学习（PC-RL）通过减轻基于外观的脆性提供了一种有前途的替代方案，但它对相机姿势不匹配的敏感性继续破坏现实环境中的可靠性。为了应对这一挑战，我们提出了 PCA 点云（PPC），这是一种专门为下游机器人控制量身定制的规范化框架。PPC 将任意刚体变换下的点云映射到独特的规范姿势，将观测值与一致的框架对齐，从而大大减少视点引起的不一致。在我们的实验中，我们表明 PPC 提高了具有挑战性的机器人任务中看不见的相机姿势的鲁棒性，为域随机化提供了一种原则性的替代方案。

A Reinforcement Learning Framework for Robust and Secure LLM Watermarking

用于稳健且安全的 LLM 水印的强化学习框架

Authors: Li An, Yujian Liu, Yepeng Liu, Yuheng Bu, Yang Zhang, Shiyu Chang
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2510.21053
Pdf link: https://arxiv.org/pdf/2510.21053
Abstract Watermarking has emerged as a promising solution for tracing and authenticating text generated by large language models (LLMs). A common approach to LLM watermarking is to construct a green/red token list and assign higher or lower generation probabilities to the corresponding tokens, respectively. However, most existing watermarking algorithms rely on heuristic green/red token list designs, as directly optimizing the list design with techniques such as reinforcement learning (RL) comes with several challenges. First, desirable watermarking involves multiple criteria, i.e., detectability, text quality, robustness against removal attacks, and security against spoofing attacks. Directly optimizing for these criteria introduces many partially conflicting reward terms, leading to an unstable convergence process. Second, the vast action space of green/red token list choices is susceptible to reward hacking. In this paper, we propose an end-to-end RL framework for robust and secure LLM watermarking. Our approach adopts an anchoring mechanism for reward terms to ensure stable training and introduces additional regularization terms to prevent reward hacking. Experiments on standard benchmarks with two backbone LLMs show that our method achieves a state-of-the-art trade-off across all criteria, with notable improvements in resistance to spoofing attacks without degrading other criteria. Our code is available at this https URL.
中文摘要 水印已成为跟踪和验证大型语言模型（LLM）生成的文本的一种有前途的解决方案。LLM 水印的一种常见方法是构建一个绿色/红色代币列表，并分别为相应的代币分配更高或更低的生成概率。然而，大多数现有的水印算法都依赖于启发式的绿/红标记列表设计，因为直接使用强化学习（RL）等技术优化列表设计会带来一些挑战。首先，理想的水印涉及多个标准，即可检测性、文本质量、抵御删除攻击的鲁棒性以及抵御欺骗攻击的安全性。直接针对这些标准进行优化会引入许多部分冲突的奖励项，从而导致收敛过程不稳定。其次，绿色/红色代币列表选择的广阔行动空间容易受到奖励黑客攻击。在本文中，我们提出了一个端到端的 RL 框架，用于稳健且安全的 LLM 水印。我们的方法对奖励条款采用锚定机制，以确保稳定的训练，并引入额外的正则化条款以防止奖励黑客攻击。使用两个主干 LLM 的标准基准测试实验表明，我们的方法在所有标准上实现了最先进的权衡，在不降低其他标准的情况下显着提高了对欺骗攻击的抵抗力。我们的代码可在此 https URL 中找到。

On the Sample Complexity of Differentially Private Policy Optimization

关于差分私有策略优化的样本复杂度

Authors: Yi He, Xingyu Zhou
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.21060
Pdf link: https://arxiv.org/pdf/2510.21060
Abstract Policy optimization (PO) is a cornerstone of modern reinforcement learning (RL), with diverse applications spanning robotics, healthcare, and large language model training. The increasing deployment of PO in sensitive domains, however, raises significant privacy concerns. In this paper, we initiate a theoretical study of differentially private policy optimization, focusing explicitly on its sample complexity. We first formalize an appropriate definition of differential privacy (DP) tailored to PO, addressing the inherent challenges arising from on-policy learning dynamics and the subtlety involved in defining the unit of privacy. We then systematically analyze the sample complexity of widely-used PO algorithms, including policy gradient (PG), natural policy gradient (NPG) and more, under DP constraints and various settings, via a unified framework. Our theoretical results demonstrate that privacy costs can often manifest as lower-order terms in the sample complexity, while also highlighting subtle yet important observations in private PO settings. These offer valuable practical insights for privacy-preserving PO algorithms.
中文摘要 策略优化（PO）是现代强化学习（RL）的基石，其应用范围广泛，涵盖机器人、医疗保健和大型语言模型训练。然而，PO 在敏感领域的部署越来越多，引发了严重的隐私问题。在本文中，我们启动了差分私有策略优化的理论研究，明确关注其样本复杂度。我们首先正式定义了针对 PO 量身定制的差分隐私（DP）的适当定义，解决了政策学习动态产生的固有挑战以及定义隐私单位所涉及的微妙之处。然后，我们通过一个统一的框架，系统地分析了在DP约束和各种设置下，广泛使用的PO算法的样本复杂度，包括策略梯度（PG）、自然策略梯度（NPG）等。我们的理论结果表明，隐私成本通常可以表现为样本复杂度中的低阶项，同时也突出了私人采购订单环境中微妙但重要的观察结果。这些为隐私保护 PO 算法提供了宝贵的实用见解。

Sensing and Storing Less: A MARL-based Solution for Energy Saving in Edge Internet of Things

传感和存储更少：基于 MARL 的边缘物联网节能解决方案

Authors: Zongyang Yuan, Lailong Luo, Qianzhen Zhang, Bangbang Ren, Deke Guo, Richard T.B. Ma
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2510.21103
Pdf link: https://arxiv.org/pdf/2510.21103
Abstract As the number of Internet of Things (IoT) devices continuously grows and application scenarios constantly enrich, the volume of sensor data experiences an explosive increase. However, substantial data demands considerable energy during computation and transmission. Redundant deployment or mobile assistance is essential to cover the target area reliably with fault-prone sensors. Consequently, the ``butterfly effect" may appear during the IoT operation, since unreasonable data overlap could result in many duplicate data. To this end, we propose Senses, a novel online energy saving solution for edge IoT networks, with the insight of sensing and storing less at the network edge by adopting Muti-Agent Reinforcement Learning (MARL). Senses achieves data de-duplication by dynamically adjusting sensor coverage at the sensor level. For exceptional cases where sensor coverage cannot be altered, Senses conducts data partitioning and eliminates redundant data at the controller level. Furthermore, at the global level, considering the heterogeneity of IoT devices, Senses balances the operational duration among the devices to prolong the overall operational duration of edge IoT networks. We evaluate the performance of Senses through testbed experiments and simulations. The results show that Senses saves 11.37% of energy consumption on control devices and prolongs 20% overall operational duration of the IoT device network.
中文摘要 随着物联网设备数量的不断增长和应用场景的不断丰富，传感器数据量呈爆发式增长。然而，大量数据在计算和传输过程中需要大量能量。冗余部署或移动辅助对于使用易发生故障的传感器可靠地覆盖目标区域至关重要。因此，在物联网作过程中可能会出现“蝴蝶效应”，因为不合理的数据重叠可能会导致许多重复数据。为此，我们提出了Senses，这是一种针对边缘物联网网络的新型在线节能解决方案，通过采用多代理强化学习（MARL），在网络边缘感知和存储更少的资源。Senses 通过在传感器级别动态调整传感器覆盖范围来实现重复数据消除。对于无法更改传感器覆盖范围的特殊情况，Senses 会进行数据分区并消除控制器级别的冗余数据。此外，在全球层面，考虑到物联网设备的异构性，Senses平衡了设备之间的运行持续时间，以延长边缘物联网网络的整体运行持续时间。我们通过测试平台实验和模拟来评估 Senses 的性能。结果表明，Senses在控制设备上节省了11.37%的能耗，延长了物联网设备网络的整体运行时间20%。

Confounding Robust Deep Reinforcement Learning: A Causal Approach

混杂鲁棒深度强化学习：因果方法

Authors: Mingxuan Li, Junzhe Zhang, Elias Bareinboim
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.21110
Pdf link: https://arxiv.org/pdf/2510.21110
Abstract A key task in Artificial Intelligence is learning effective policies for controlling agents in unknown environments to optimize performance measures. Off-policy learning methods, like Q-learning, allow learners to make optimal decisions based on past experiences. This paper studies off-policy learning from biased data in complex and high-dimensional domains where \emph{unobserved confounding} cannot be ruled out a priori. Building on the well-celebrated Deep Q-Network (DQN), we propose a novel deep reinforcement learning algorithm robust to confounding biases in observed data. Specifically, our algorithm attempts to find a safe policy for the worst-case environment compatible with the observations. We apply our method to twelve confounded Atari games, and find that it consistently dominates the standard DQN in all games where the observed input to the behavioral and target policies mismatch and unobserved confounders exist.
中文摘要 人工智能的一项关键任务是学习有效的策略，用于在未知环境中控制代理以优化性能衡量标准。非政策学习方法，如 Q-learning，允许学习者根据过去的经验做出最佳决策。本文研究了在复杂和高维领域中从有偏见数据中进行的策略外学习，在这些领域中，\emph{未观察到的混杂}不能先验地排除。在著名的深度Q网络（DQN）的基础上，我们提出了一种新型的深度强化学习算法，该算法对观测数据中的混杂偏差具有鲁棒性。具体来说，我们的算法试图为最坏情况的环境找到一个与观测结果兼容的安全策略。我们将我们的方法应用于 12 款混淆的 Atari 游戏，发现在所有观察到的行为和目标策略输入不匹配且存在未观察到的混杂因素的游戏中，它始终主导标准 DQN。

NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation

NoisyGRPO：通过噪声注入和贝叶斯估计激励多模态 CoT 推理

Authors: Longtian Qiu, Shan Ning, Jiaxuan Sun, Xuming He
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2510.21122
Pdf link: https://arxiv.org/pdf/2510.21122
Abstract Reinforcement learning (RL) has shown promise in enhancing the general Chain-of-Thought (CoT) reasoning capabilities of multimodal large language models (MLLMs). However, when applied to improve general CoT reasoning, existing RL frameworks often struggle to generalize beyond the training distribution. To address this, we propose NoisyGRPO, a systematic multimodal RL framework that introduces controllable noise into visual inputs for enhanced exploration and explicitly models the advantage estimation process via a Bayesian framework. Specifically, NoisyGRPO improves RL training by: (1) \textbf{Noise-Injected Exploration Policy}: Perturbing visual inputs with Gaussian noise to encourage exploration across a wider range of visual scenarios; and (2) \textbf{Bayesian Advantage Estimation}: Formulating advantage estimation as a principled Bayesian inference problem, where the injected noise level serves as a prior and the observed trajectory reward as the likelihood. This Bayesian modeling fuses both sources of information to compute a robust posterior estimate of trajectory advantage, effectively guiding MLLMs to prefer visually grounded trajectories over noisy ones. Experiments on standard CoT quality, general capability, and hallucination benchmarks demonstrate that NoisyGRPO substantially improves generalization and robustness, especially in RL settings with small-scale MLLMs such as Qwen2.5-VL 3B. The project page is available at \href{this https URL}{\texttt{this https URL_pages/NoisyGRPO}}.
中文摘要 强化学习（RL）在增强多模态大型语言模型（MLLM）的通用思维链（CoT）推理能力方面显示出前景。然而，当应用于改进一般 CoT 推理时，现有的 RL 框架通常难以在训练分布之外进行推广。为了解决这个问题，我们提出了 NoisyGRPO，这是一个系统的多模态 RL 框架，它将可控噪声引入视觉输入以增强探索，并通过贝叶斯框架对优势估计过程进行显式建模。具体来说，NoisyGRPO 通过以下方式改进了 RL 训练：（1） \textbf{噪声注入探索策略}：用高斯噪声扰动视觉输入，以鼓励在更广泛的视觉场景中进行探索;（2） \textbf{贝叶斯优势估计}：将优势估计表述为有原则的贝叶斯推理问题，其中注入的噪声水平作为先验，观察到的轨迹奖励作为似然。这种贝叶斯模型融合了两种信息源，以计算轨迹优势的稳健后验估计，有效地指导 MLLM 更喜欢视觉接地轨迹而不是嘈杂的轨迹。对标准 CoT 质量、通用能力和幻觉基准的实验表明，NoisyGRPO 显着提高了泛化性和鲁棒性，特别是在具有小规模 MLLM（如 Qwen2.5-VL 3B）的 RL 环境中。项目页面位于 \href{this https URL}{\texttt{this https URL_pages/NoisyGRPO}}。

Enhanced Evolutionary Multi-Objective Deep Reinforcement Learning for Reliable and Efficient Wireless Rechargeable Sensor Networks

增强演化多目标深度强化学习，实现可靠高效的无线可充电传感器网络

Authors: Bowei Tong, Hui Kang, Jiahui Li, Geng Sun, Jiacheng Wang, Yaoqi Yang, Bo Xu, Dusit Niyato
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.21127
Pdf link: https://arxiv.org/pdf/2510.21127
Abstract Despite rapid advancements in sensor networks, conventional battery-powered sensor networks suffer from limited operational lifespans and frequent maintenance requirements that severely constrain their deployment in remote and inaccessible environments. As such, wireless rechargeable sensor networks (WRSNs) with mobile charging capabilities offer a promising solution to extend network lifetime. However, WRSNs face critical challenges from the inherent trade-off between maximizing the node survival rates and maximizing charging energy efficiency under dynamic operational conditions. In this paper, we investigate a typical scenario where mobile chargers move and charge the sensor, thereby maintaining the network connectivity while minimizing the energy waste. Specifically, we formulate a multi-objective optimization problem that simultaneously maximizes the network node survival rate and mobile charger energy usage efficiency across multiple time slots, which presents NP-hard computational complexity with long-term temporal dependencies that make traditional optimization approaches ineffective. To address these challenges, we propose an enhanced evolutionary multi-objective deep reinforcement learning algorithm, which integrates a long short-term memory (LSTM)-based policy network for temporal pattern recognition, a multilayer perceptron-based prospective increment model for future state prediction, and a time-varying Pareto policy evaluation method for dynamic preference adaptation. Extensive simulation results demonstrate that the proposed algorithm significantly outperforms existing approaches in balancing node survival rate and energy efficiency while generating diverse Pareto-optimal solutions. Moreover, the LSTM-enhanced policy network converges 25% faster than conventional networks, with the time-varying evaluation method effectively adapting to dynamic conditions.
中文摘要 尽管传感器网络取得了快速发展，但传统的电池供电传感器网络的使用寿命有限且维护要求频繁，严重限制了其在偏远和难以访问的环境中的部署。因此，具有移动充电功能的无线可充电传感器网络（WRSN）为延长网络寿命提供了一种有前途的解决方案。然而，WRSN面临着在动态运行条件下最大化节点存活率和最大化充电能效之间的固有权衡的关键挑战。在本文中，我们研究了移动充电器移动传感器并为传感器充电的典型场景，从而在保持网络连接的同时最大限度地减少能源浪费。具体而言，我们提出了一个多目标优化问题，该问题在多个时隙内同时最大化网络节点存活率和移动充电器能效，该问题呈现出NP硬计算复杂性和长期时间依赖性，使传统优化方法无效。为了应对这些挑战，我们提出了一种增强的进化多目标深度强化学习算法，该算法集成了用于时间模式识别的长短期记忆（LSTM）策略网络、用于未来状态预测的基于多层感知器的前瞻性增量模型以及用于动态偏好适应的时变帕累托策略评估方法。广泛的仿真结果表明，所提算法在平衡节点存活率和能效方面明显优于现有方法，同时生成了多样化的帕累托最优解。此外，LSTM增强策略网络的收敛速度比传统网络快25%，时变评估方法有效地适应了动态条件。

Uncertainty-Aware Multi-Objective Reinforcement Learning-Guided Diffusion Models for 3D De Novo Molecular Design

用于 3D De Novo 分子设计的不确定性感知多目标强化学习引导扩散模型

Authors: Lianghong Chen, Dongkyu Eugene Kim, Mike Domaratzki, Pingzhao Hu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.21153
Pdf link: https://arxiv.org/pdf/2510.21153
Abstract Designing de novo 3D molecules with desirable properties remains a fundamental challenge in drug discovery and molecular engineering. While diffusion models have demonstrated remarkable capabilities in generating high-quality 3D molecular structures, they often struggle to effectively control complex multi-objective constraints critical for real-world applications. In this study, we propose an uncertainty-aware Reinforcement Learning (RL) framework to guide the optimization of 3D molecular diffusion models toward multiple property objectives while enhancing the overall quality of the generated molecules. Our method leverages surrogate models with predictive uncertainty estimation to dynamically shape reward functions, facilitating balance across multiple optimization objectives. We comprehensively evaluate our framework across three benchmark datasets and multiple diffusion model architectures, consistently outperforming baselines for molecular quality and property optimization. Additionally, Molecular Dynamics (MD) simulations and ADMET profiling of top generated candidates indicate promising drug-like behavior and binding stability, comparable to known Epidermal Growth Factor Receptor (EGFR) inhibitors. Our results demonstrate the strong potential of RL-guided generative diffusion models for advancing automated molecular design.
中文摘要 从头设计具有理想特性的 3D 分子仍然是药物发现和分子工程中的一个基本挑战。虽然扩散模型在生成高质量 3D 分子结构方面表现出卓越的能力，但它们通常难以有效控制对实际应用至关重要的复杂多目标约束。在这项研究中，我们提出了一种不确定性感知的强化学习（RL）框架，以指导3D分子扩散模型朝着多个属性目标进行优化，同时提高生成分子的整体质量。我们的方法利用具有预测不确定性估计的代理模型来动态塑造奖励函数，促进多个优化目标之间的平衡。我们跨三个基准数据集和多个扩散模型架构全面评估了我们的框架，在分子质量和性能优化方面始终优于基线。此外，分子动力学（MD）模拟和 ADMET 分析显示，其药物样行为和结合稳定性很有希望，可与已知的表皮生长因子受体（EGFR）抑制剂相媲美。我们的结果证明了 RL 引导的生成扩散模型在推进自动化分子设计方面的巨大潜力。

Reducing the Probability of Undesirable Outputs in Language Models Using Probabilistic Inference

使用概率推理降低语言模型中不良输出的概率

Authors: Stephen Zhao, Aidan Li, Rob Brekelmans, Roger Grosse
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2510.21184
Pdf link: https://arxiv.org/pdf/2510.21184
Abstract Reinforcement learning (RL) has become a predominant technique to align language models (LMs) with human preferences or promote outputs which are deemed to be desirable by a given reward function. Standard RL approaches optimize average reward, while methods explicitly focused on reducing the probability of undesired outputs typically come at a cost to average-case performance. To improve this tradeoff, we introduce RePULSe, a new training method that augments the standard RL loss with an additional loss that uses learned proposals to guide sampling low-reward outputs, and then reduces those outputs' probability. We run experiments demonstrating that RePULSe produces a better tradeoff of expected reward versus the probability of undesired outputs and is more adversarially robust, compared to standard RL alignment approaches and alternatives.
中文摘要 强化学习（RL）已成为一种主要技术，用于使语言模型（LM）与人类偏好保持一致或促进给定奖励函数认为理想的输出。标准 RL 方法优化了平均奖励，而明确专注于降低不良输出概率的方法通常会以牺牲平均情况的性能为代价。为了改善这种权衡，我们引入了 RePULSe，这是一种新的训练方法，它通过额外的损失来增加标准 RL 损失，该方法使用学习到的建议来指导对低奖励输出进行采样，然后降低这些输出的概率。我们进行的实验表明，与标准 RL 对齐方法和替代方案相比，RePULSe 在预期奖励与意外输出概率之间产生了更好的权衡，并且更具对抗性。

How Hard is it to Confuse a World Model?

混淆世界模型有多难？

Authors: Waris Radji (Scool, CRIStAL), Odalric-Ambrym Maillard (Scool, CRIStAL)
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.21232
Pdf link: https://arxiv.org/pdf/2510.21232
Abstract In reinforcement learning (RL) theory, the concept of most confusing instances is central to establishing regret lower bounds, that is, the minimal exploration needed to solve a problem. Given a reference model and its optimal policy, a most confusing instance is the statistically closest alternative model that makes a suboptimal policy optimal. While this concept is well-studied in multi-armed bandits and ergodic tabular Markov decision processes, constructing such instances remains an open question in the general case. In this paper, we formalize this problem for neural network world models as a constrained optimization: finding a modified model that is statistically close to the reference one, while producing divergent performance between optimal and suboptimal policies. We propose an adversarial training procedure to solve this problem and conduct an empirical study across world models of varying quality. Our results suggest that the degree of achievable confusion correlates with uncertainty in the approximate model, which may inform theoretically-grounded exploration strategies for deep model-based RL.
中文摘要 在强化学习（RL）理论中，最令人困惑的实例的概念是建立后悔下限的核心，即解决问题所需的最小探索。给定参考模型及其最优策略，最令人困惑的实例是统计上最接近的替代模型，该模型使次优策略成为最优。虽然这个概念在多臂强盗和遍历表马尔可夫决策过程中得到了很好的研究，但在一般情况下，构建这样的实例仍然是一个悬而未决的问题。在本文中，我们将神经网络世界模型的这个问题形式化为约束优化：找到一个在统计上接近参考模型的修改模型，同时在最优和次优策略之间产生不同的性能。我们提出了一种对抗性训练程序来解决这个问题，并对不同质量的世界模型进行了实证研究。我们的结果表明，可实现的混淆程度与近似模型的不确定性相关，这可能为基于深度模型的深度RL的理论探索策略提供信息。

PARL: Prompt-based Agents for Reinforcement Learning

PARL：基于提示的强化学习代理

Authors: Yarik Menchaca Resendiz, Roman Klinger
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.21306
Pdf link: https://arxiv.org/pdf/2510.21306
Abstract Large language models (LLMs) have demonstrated high performance on tasks expressed in natural language, particularly in zero- or few-shot settings. These are typically framed as supervised (e.g., classification) or unsupervised (e.g., clustering) problems. However, limited work evaluates LLMs as agents in reinforcement learning (RL) tasks (e.g., playing games), where learning occurs through interaction with an environment and a reward system. While prior work focused on representing tasks that rely on a language representation, we study structured, non-linguistic reasoning - such as interpreting positions in a grid world. We therefore introduce PARL (Prompt-based Agent for Reinforcement Learning), a method that uses LLMs as RL agents through prompting, without any fine-tuning. PARL encodes actions, states, and rewards in the prompt, enabling the model to learn through trial-and-error interaction. We evaluate PARL on three standard RL tasks that do not entirely rely on natural language. We show that it can match or outperform traditional RL agents in simple environments by leveraging pretrained knowledge. However, we identify performance limitations in tasks that require complex mathematical operations or decoding states and actions.
中文摘要 大型语言模型（LLM）在以自然语言表达的任务上表现出高性能，特别是在零样本或少量设置中。这些问题通常被框定为有监督（例如分类）或无监督（例如聚类）问题。然而，有限的工作将法学硕士评估为强化学习（RL）任务（例如玩游戏）中的代理，其中学习是通过与环境和奖励系统的交互进行的。虽然之前的工作侧重于表示依赖于语言表示的任务，但我们研究结构化的非语言推理——例如解释网格世界中的位置。因此，我们引入了 PALL（Prompt-based Agent for Reinforcement Learning），这是一种通过提示使用 LLM 作为 RL 代理的方法，无需任何微调。PARL 对提示中的动作、状态和奖励进行编码，使模型能够通过试错交互进行学习。我们在三个不完全依赖自然语言的标准 RL 任务上评估 PARL。我们表明，通过利用预训练的知识，它可以在简单环境中与传统 RL 代理相媲美或优于传统 RL 代理。然而，我们发现需要复杂数学运算或解码状态和动作的任务中存在性能限制。

FineRS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning

FineRS：使用强化学习对小物体进行细粒度推理和分割

Authors: Lu Zhang, Jiazuo Yu, Haomiao Xiong, Ping Hu, Yunzhi Zhuge, Huchuan Lu, You He
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2510.21311
Pdf link: https://arxiv.org/pdf/2510.21311
Abstract Multi-modal Large Language Models (MLLMs) have shown remarkable capabilities across a wide range of vision-language tasks. However, due to the restricted input resolutions, MLLMs face significant challenges in precisely understanding and localizing visual details in high-resolution images -- particularly when dealing with extra-small objects embedded in cluttered contexts. To address this issue, we propose \textsc{FineRS}, a two-stage MLLM-based reinforcement learning framework for jointly reasoning and segmenting extremely small objects within high-resolution scenes. \textsc{FineRS} adopts a coarse-to-fine pipeline comprising Global Semantic Exploration (GSE) and Localized Perceptual Refinement (LPR). Specifically, GSE performs instruction-guided reasoning to generate a textural response and a coarse target region, while LPR refines this region to produce an accurate bounding box and segmentation mask. To couple the two stages, we introduce a locate-informed retrospective reward, where LPR's outputs are used to optimize GSE for more robust coarse region exploration. % Additionally, we present \textsc{FineRS}-4k, a new dataset for evaluating MLLMs on attribute-level reasoning and pixel-level segmentation on subtle, small-scale targets in complex high-resolution scenes. Experimental results on \textsc{FineRS}-4k and public datasets demonstrate that our method consistently outperforms state-of-the-art MLLM-based approaches on both instruction-guided segmentation and visual reasoning tasks.
中文摘要 多模态大型语言模型（MLLM）在广泛的视觉语言任务中表现出卓越的能力。然而，由于输入分辨率有限，MLLM 在精确理解和定位高分辨率图像中的视觉细节方面面临着重大挑战，尤其是在处理嵌入杂乱环境中的超小对象时。为了解决这个问题，我们提出了 \textsc{FineRS}，这是一个基于 MLLM 的两阶段强化学习框架，用于在高分辨率场景中联合推理和分割极小的物体。\textsc{FineRS} 采用了由全局语义探索（GSE）和局部感知细化（LPR）组成的从粗到细的管道。具体来说，GSE 执行指令引导推理以生成纹理响应和粗目标区域，而 LPR 则细化该区域以生成精确的边界框和分割掩码。为了将这两个阶段结合起来，我们引入了定位知情回顾性奖励，其中 LPR 的输出用于优化 GSE，以实现更稳健的粗区域探索。此外，我们还提出了 \textsc{FineRS}-4k，这是一个新的数据集，用于评估复杂高分辨率场景中对细微、小规模目标的属性级推理和像素级分割的 MLLM。在 \textsc{FineRS}-4k 和公共数据集上的实验结果表明，我们的方法在指令引导分割和视觉推理任务上始终优于最先进的基于 MLLM 的方法。

Multi-turn Training with Basic Human Feedback Helps Little on LLM Reasoning

具有基本人类反馈的多轮训练对法学硕士推理几乎没有帮助

Authors: Qiang Liu, Wuganjing Song, Zhenzhou Lin, Feifan Chen, Qiaolong Cai, Chen Li, Yongduo Sui
Subjects: Subjects: Computation and Language (cs.CL); Information Theory (cs.IT); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.21339
Pdf link: https://arxiv.org/pdf/2510.21339
Abstract The reasoning capabilities of Large Language Models (LLMs) are typically developed through the single-turn reinforcement learning, whereas real-world applications often involve multi-turn interactions with human feedback, leading to a potential mismatch between training and deployment conditions. In this work, we study whether multi-turn training with human feedback is necessary for reasoning tasks. We compare conventional single-turn training with three multi-turn strategies and reach contrary conclusions to previous research. We find that models trained in a single-turn setting generalize effectively to both single- and multi-turn evaluations, while models trained with multi-turn strategies exhibit a significant degradation in single-turn reasoning performance. These results suggest that for tasks with complete information, robust single-turn training remains more effective and reliable, as multi-turn training with basic feedback provides limited benefits and can even degrade reasoning capabilities.
中文摘要 大型语言模型（LLM）的推理能力通常是通过单轮强化学习开发的，而实际应用通常涉及与人类反馈的多轮交互，导致训练和部署条件之间可能存在不匹配。在这项工作中，我们研究了推理任务是否需要使用人类反馈进行多轮训练。我们将传统的单回合训练与三种多回合策略进行了比较，并得出了与之前研究相反的结论。我们发现，在单回合设置中训练的模型有效地推广到单回合和多回合评估，而使用多回合策略训练的模型在单回合推理性能方面表现出显着下降。这些结果表明，对于具有完整信息的任务，鲁棒的单轮训练仍然更有效和可靠，因为具有基本反馈的多轮训练提供的好处有限，甚至会降低推理能力。

Boosting Accuracy and Efficiency of Budget Forcing in LLMs via Reinforcement Learning for Mathematical Reasoning

通过数学推理强化学习提高法学硕士预算强制的准确性和效率

Authors: Ravindra Aribowo Tarunokusumo, Rafael Fernandes Cunha
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.21398
Pdf link: https://arxiv.org/pdf/2510.21398
Abstract Test-time scaling methods have seen a rapid increase in popularity for its computational efficiency and parameter-independent training to improve reasoning performance on Large Language Models. One such method is called budget forcing, a decoding intervention strategy which allocates extra compute budget for thinking and elicits the inherent self-correcting behavior of the model. However, this relies on supervised fine-tuning (SFT) on long-context reasoning traces which causes performance degradation on smaller models due to verbose responses. For this reason, we offer a framework integrating reinforcement learning (RL) to improve token efficiency and boost the performance of a 1.5B model for mathematical reasoning. We demonstrate this using only 1.5K training samples and found that our SFT+RL model performed better on the GSM8K dataset with varying compute budgets. Our main findings showed an overall higher accuracy while significantly reducing its token usage by over 40% compared to the SFT model, revealing how RL can recover the losses due to long-context training and altogether improving performance in mathematical reasoning.
中文摘要 测试时间缩放方法因其计算效率和与参数无关的训练而迅速普及，以提高大型语言模型的推理性能。其中一种方法称为预算强制，这是一种解码干预策略，它为思维分配额外的计算预算并引发模型固有的自我纠正行为。然而，这依赖于对长上下文推理跟踪的监督微调（SFT），这会导致由于冗长响应而导致较小模型的性能下降。出于这个原因，我们提供了一个集成强化学习（RL）的框架，以提高令牌效率并提高 1.5B 数学推理模型的性能。我们仅使用 1.5K 训练样本来证明这一点，并发现我们的 SFT+RL 模型在计算预算不同的 GSM8K 数据集上表现更好。我们的主要发现显示，与 SFT 模型相比，RL 总体准确性更高，同时将其令牌使用量显着减少了 40% 以上，揭示了 RL 如何恢复因长上下文训练而造成的损失，并全面提高数学推理的性能。

Unified token representations for sequential decision models

顺序决策模型的统一令牌表示

Authors: Zhuojing Tian, Yushu Chen
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.21448
Pdf link: https://arxiv.org/pdf/2510.21448
Abstract Transformers have demonstrated strong potential in offline reinforcement learning (RL) by modeling trajectories as sequences of return-to-go, states, and actions. However, existing approaches such as the Decision Transformer(DT) and its variants suffer from redundant tokenization and quadratic attention complexity, limiting their scalability in real-time or resource-constrained settings. To address this, we propose a Unified Token Representation (UTR) that merges return-to-go, state, and action into a single token, substantially reducing sequence length and model complexity. Theoretical analysis shows that UTR leads to a tighter Rademacher complexity bound, suggesting improved generalization. We further develop two variants: UDT and UDC, built upon transformer and gated CNN backbones, respectively. Both achieve comparable or superior performance to state-of-the-art methods with markedly lower computation. These findings demonstrate that UTR generalizes well across architectures and may provide an efficient foundation for scalable control in future large decision models.
中文摘要 Transformer 通过将轨迹建模为返回、状态和动作序列，在离线强化学习（RL）中展示了强大的潜力。然而，决策转换器（DT）及其变体等现有方法存在冗余标记化和二次注意力复杂性，限制了它们在实时或资源受限设置中的可扩展性。为了解决这个问题，我们提出了一种统一标记表示（UTR），它将返回、状态和作合并到一个标记中，从而大大减少序列长度和模型复杂性。理论分析表明，UTR 导致更严格的拉德马赫复杂度界限，这表明推广性得到改善。我们进一步开发了两种变体：UDT 和 UDC，分别建立在变压器和门控 CNN 主干之上。两者都以明显较低的计算量实现了与最先进方法相当或更好的性能。这些发现表明，UTR在各个架构中具有很好的通用性，并可能为未来大型决策模型中的可扩展控制提供有效的基础。

MRO: Enhancing Reasoning in Diffusion Language Models via Multi-Reward Optimization

MRO：通过多奖励优化增强扩散语言模型的推理

Authors: Chenglong Wang, Yang Gan, Hang Zhou, Chi Hu, Yongyu Mu, Kai Song, Murun Yang, Bei Li, Chunliang Zhang, Tongran Liu, Jingbo Zhu, Zhengtao Yu, Tong Xiao
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.21473
Pdf link: https://arxiv.org/pdf/2510.21473
Abstract Recent advances in diffusion language models (DLMs) have presented a promising alternative to traditional autoregressive large language models (LLMs). However, DLMs still lag behind LLMs in reasoning performance, especially as the number of denoising steps decreases. Our analysis reveals that this shortcoming arises primarily from the independent generation of masked tokens across denoising steps, which fails to capture the token correlation. In this paper, we define two types of token correlation: intra-sequence correlation and inter-sequence correlation, and demonstrate that enhancing these correlations improves reasoning performance. To this end, we propose a Multi-Reward Optimization (MRO) approach, which encourages DLMs to consider the token correlation during the denoising process. More specifically, our MRO approach leverages test-time scaling, reject sampling, and reinforcement learning to directly optimize the token correlation with multiple elaborate rewards. Additionally, we introduce group step and importance sampling strategies to mitigate reward variance and enhance sampling efficiency. Through extensive experiments, we demonstrate that MRO not only improves reasoning performance but also achieves significant sampling speedups while maintaining high performance on reasoning benchmarks.
中文摘要 扩散语言模型（DLM）的最新进展为传统自回归大型语言模型（LLM）提供了一种有前途的替代方案。然而，DLM 在推理性能方面仍然落后于 LLM，尤其是在去噪步骤数量减少的情况下。我们的分析表明，这一缺点主要是由于跨去噪步骤独立生成屏蔽标记，无法捕获标记相关性。在本文中，我们定义了两种类型的标记相关性：序列内相关性和序列间相关性，并证明增强这些相关性可以提高推理性能。为此，我们提出了一种多奖励优化（MRO）方法，鼓励DLM在去噪过程中考虑代币相关性。更具体地说，我们的 MRO 方法利用测试时缩放、拒绝采样和强化学习来直接优化具有多个精心设计奖励的代币相关性。此外，我们还引入了组步和重要性抽样策略，以减轻奖励方差并提高抽样效率。通过广泛的实验，我们证明MRO不仅提高了推理性能，还实现了显著的采样速度，同时在推理基准上保持了高性能。

A Unified Model for Multi-Task Drone Routing in Post-Disaster Road Assessment

灾后道路评估中多任务无人机路由统一模型

Authors: Huatian Gong, Jiuh-Biing Sheu, Zheng Wang, Xiaoguang Yang, Ran Yan
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.21525
Pdf link: https://arxiv.org/pdf/2510.21525
Abstract Post-disaster road assessment (PDRA) is essential for emergency response, enabling rapid evaluation of infrastructure conditions and efficient allocation of resources. Although drones provide a flexible and effective tool for PDRA, routing them in large-scale networks remains challenging. Traditional optimization methods scale poorly and demand domain expertise, while existing deep reinforcement learning (DRL) approaches adopt a single-task paradigm, requiring separate models for each problem variant and lacking adaptability to evolving operational needs. This study proposes a unified model (UM) for drone routing that simultaneously addresses eight PDRA variants. By training a single neural network across multiple problem configurations, UM captures shared structural knowledge while adapting to variant-specific constraints through a modern transformer encoder-decoder architecture. A lightweight adapter mechanism further enables efficient finetuning to unseen attributes without retraining, enhancing deployment flexibility in dynamic disaster scenarios. Extensive experiments demonstrate that the UM reduces training time and parameters by a factor of eight compared with training separate models, while consistently outperforming single-task DRL methods by 6--14\% and traditional optimization approaches by 24--82\% in terms of solution quality (total collected information value). The model achieves real-time solutions (1--10 seconds) across networks of up to 1,000 nodes, with robustness confirmed through sensitivity analyses. Moreover, finetuning experiments show that unseen attributes can be effectively incorporated with minimal cost while retaining high solution quality. The proposed UM advances neural combinatorial optimization for time-critical applications, offering a computationally efficient, high-quality, and adaptable solution for drone-based PDRA.
中文摘要 灾后道路评估（PDRA）对于应急响应至关重要，可以快速评估基础设施状况并有效分配资源。尽管无人机为 PDRA 提供了灵活有效的工具，但在大规模网络中路由它们仍然具有挑战性。传统的优化方法扩展性很差，需要领域专业知识，而现有的深度强化学习（DRL）方法采用单任务范式，需要为每个问题变体提供单独的模型，并且缺乏对不断变化的运营需求的适应性。本研究提出了一种同时解决八种PDRA变体的无人机路由统一模型（UM）。通过跨多个问题配置训练单个神经网络，UM 捕获共享结构知识，同时通过现代 Transformer 编码器-解码器架构适应特定于变体的约束。轻量级适配器机制进一步实现了对看不见的属性的高效微调，无需重新训练，增强了动态灾难场景下的部署灵活性。广泛的实验表明，与训练单独的模型相比，UM 将训练时间和参数减少了 8 倍，同时在解决方案质量（收集的总信息值）方面始终比单任务 DRL 方法高出 6--14\%，比传统优化方法高出 24--82%。该模型在多达 1,000 个节点的网络上实现实时解决方案（1--10 秒），并通过灵敏度分析确认了鲁棒性。此外，微调实验表明，在保持高解决方案质量的同时，可以以最低的成本有效地整合看不见的属性。所提出的 UM 推进了时间关键型应用的神经组合优化，为基于无人机的 PDRA 提供了计算高效、高质量和适应性强的解决方案。

RETuning: Upgrading Inference-Time Scaling for Stock Movement Prediction with Large Language Models

RETuning：利用大型语言模型升级股票走势预测的推理时间缩放

Authors: Xueyuan Lin, Cehao Yang, Ye Ma, Ming Li, Rongjunchen Zhang, Yang Ni, Xiaojun Wu, Chengjin Xu, Jian Guo, Hui Xiong
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.21604
Pdf link: https://arxiv.org/pdf/2510.21604
Abstract Recently, large language models (LLMs) have demonstrated outstanding reasoning capabilities on mathematical and coding tasks. However, their application to financial tasks-especially the most fundamental task of stock movement prediction-remains underexplored. We study a three-class classification problem (up, hold, down) and, by analyzing existing reasoning responses, observe that: (1) LLMs follow analysts' opinions rather than exhibit a systematic, independent analytical logic (CoTs). (2) LLMs list summaries from different sources without weighing adversarial evidence, yet such counterevidence is crucial for reliable prediction. It shows that the model does not make good use of its reasoning ability to complete the task. To address this, we propose Reflective Evidence Tuning (RETuning), a cold-start method prior to reinforcement learning, to enhance prediction ability. While generating CoT, RETuning encourages dynamically constructing an analytical framework from diverse information sources, organizing and scoring evidence for price up or down based on that framework-rather than on contextual viewpoints-and finally reflecting to derive the prediction. This approach maximally aligns the model with its learned analytical framework, ensuring independent logical reasoning and reducing undue influence from context. We also build a large-scale dataset spanning all of 2024 for 5,123 A-share stocks, with long contexts (32K tokens) and over 200K samples. In addition to price and news, it incorporates analysts' opinions, quantitative reports, fundamental data, macroeconomic indicators, and similar stocks. Experiments show that RETuning successfully unlocks the model's reasoning ability in the financial domain. Inference-time scaling still works even after 6 months or on out-of-distribution stocks, since the models gain valuable insights about stock movement prediction.
中文摘要 近年来，大型语言模型（LLMs）在数学和编码任务上表现出了出色的推理能力。然而，它们在金融任务中的应用——尤其是股票走势预测的最基本任务——仍然没有得到充分探索。我们研究了一个三类分类问题（向上、保持、向下），并通过分析现有的推理响应，观察到：（1） LLM 遵循分析师的意见，而不是表现出系统的、独立的分析逻辑（CoT）。（2）法学硕士列出来自不同来源的摘要，而不权衡对抗性证据，但这种反证对于可靠预测至关重要。这表明模型没有很好地利用其推理能力来完成任务。为了解决这个问题，我们提出了反射证据调整（RETuning），这是一种强化学习之前的冷启动方法，以增强预测能力。在生成 CoT 的同时，RETuning 鼓励从不同的信息源动态构建一个分析框架，根据该框架而不是上下文观点来组织和评分价格上涨或下跌的证据，并最终反思以得出预测。这种方法最大限度地使模型与其学习的分析框架保持一致，确保独立的逻辑推理并减少上下文的不当影响。我们还为 5,123 只 A 股股票构建了一个跨越 2024 年全年的大规模数据集，具有长上下文（32K 代币）和超过 200K 样本。除了价格和新闻外，它还包含分析师的观点、量化报告、基本面数据、宏观经济指标和类似股票。实验表明，RETuning成功解锁了模型在金融领域的推理能力。即使在 6 个月后或对分销外的股票，推理时间缩放仍然有效，因为模型获得了有关股票变动预测的宝贵见解。

Enhancing Tactile-based Reinforcement Learning for Robotic Control

增强基于触觉的强化学习以进行机器人控制

Authors: Elle Miller, Trevor McInroe, David Abel, Oisin Mac Aodha, Sethu Vijayakumar
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.21609
Pdf link: https://arxiv.org/pdf/2510.21609
Abstract Achieving safe, reliable real-world robotic manipulation requires agents to evolve beyond vision and incorporate tactile sensing to overcome sensory deficits and reliance on idealised state information. Despite its potential, the efficacy of tactile sensing in reinforcement learning (RL) remains inconsistent. We address this by developing self-supervised learning (SSL) methodologies to more effectively harness tactile observations, focusing on a scalable setup of proprioception and sparse binary contacts. We empirically demonstrate that sparse binary tactile signals are critical for dexterity, particularly for interactions that proprioceptive control errors do not register, such as decoupled robot-object motions. Our agents achieve superhuman dexterity in complex contact tasks (ball bouncing and Baoding ball rotation). Furthermore, we find that decoupling the SSL memory from the on-policy memory can improve performance. We release the Robot Tactile Olympiad (RoTO) benchmark to standardise and promote future research in tactile-based manipulation. Project page: this https URL
中文摘要 实现安全、可靠的现实世界机器人纵需要智能体超越视觉并结合触觉传感来克服感官缺陷和对理想状态信息的依赖。尽管具有潜力，但触觉传感在强化学习（RL）中的功效仍然不一致。我们通过开发自监督学习（SSL）方法来更有效地利用触觉观察来解决这个问题，重点关注本体感觉和稀疏二元接触的可扩展设置。我们凭经验证明，稀疏的二进制触觉信号对于灵活性至关重要，特别是对于本体感觉控制错误未记录的交互，例如解耦的机器人-物体运动。我们的代理在复杂的接触任务（球弹跳和保定球旋转）中实现了超人的灵巧性。此外，我们发现将 SSL 内存与策略内存解耦可以提高性能。我们发布机器人触觉奥林匹克竞赛（RoTO）基准测试，以标准化和促进基于触觉的作的未来研究。项目页面：此 https URL

DeepAgent: A General Reasoning Agent with Scalable Toolsets

DeepAgent：具有可扩展工具集的通用推理代理

Authors: Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Guanting Dong, Jiajie Jin, Yinuo Wang, Hao Wang, Yutao Zhu, Ji-Rong Wen, Yuan Lu, Zhicheng Dou
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.21618
Pdf link: https://arxiv.org/pdf/2510.21618
Abstract Large reasoning models have demonstrated strong problem-solving abilities, yet real-world tasks often require external tools and long-horizon interactions. Existing agent frameworks typically follow predefined workflows, which limit autonomous and global task completion. In this paper, we introduce DeepAgent, an end-to-end deep reasoning agent that performs autonomous thinking, tool discovery, and action execution within a single, coherent reasoning process. To address the challenges of long-horizon interactions, particularly the context length explosion from multiple tool calls and the accumulation of interaction history, we introduce an autonomous memory folding mechanism that compresses past interactions into structured episodic, working, and tool memories, reducing error accumulation while preserving critical information. To teach general-purpose tool use efficiently and stably, we develop an end-to-end reinforcement learning strategy, namely ToolPO, that leverages LLM-simulated APIs and applies tool-call advantage attribution to assign fine-grained credit to the tool invocation tokens. Extensive experiments on eight benchmarks, including general tool-use tasks (ToolBench, API-Bank, TMDB, Spotify, ToolHop) and downstream applications (ALFWorld, WebShop, GAIA, HLE), demonstrate that DeepAgent consistently outperforms baselines across both labeled-tool and open-set tool retrieval scenarios. This work takes a step toward more general and capable agents for real-world applications. The code and demo are available at this https URL.
中文摘要 大型推理模型已经展示了强大的解决问题的能力，但现实世界的任务往往需要外部工具和长期的交互。现有的代理框架通常遵循预定义的工作流，这限制了自主和全局任务的完成。在本文中，我们介绍了 DeepAgent，这是一种端到端的深度推理代理，可在单一的、连贯的推理过程中执行自主思维、工具发现和动作执行。为了应对长视野交互的挑战，特别是多个工具调用带来的上下文长度爆炸和交互历史的积累，我们引入了一种自主记忆折叠机制，该机制将过去的交互压缩为结构化的情景、工作和工具记忆，在保留关键信息的同时减少错误积累。为了高效、稳定地教授通用工具的使用，我们开发了一种端到端的强化学习策略，即 ToolPO，它利用 LLM 模拟的 API，并应用工具调用优势归因来为工具调用代币分配细粒度的功劳。对八个基准测试的广泛实验，包括一般工具使用任务（ToolBench、API-Bank、TMDB、Spotify、ToolHop）和下游应用程序（ALFWorld、WebShop、GAIA、HLE），表明 DeepAgent 在标记工具和开放集工具检索场景中始终优于基线。这项工作朝着为实际应用提供更通用、更强大的代理迈出了一步。代码和演示可在此 https URL 中找到。

DEEDEE: Fast and Scalable Out-of-Distribution Dynamics Detection

DEEDEE：快速、可扩展的分布外动态检测

Authors: Tala Aljaafari, Varun Kanade, Philip Torr, Christian Schroeder de Witt
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.21638
Pdf link: https://arxiv.org/pdf/2510.21638
Abstract Deploying reinforcement learning (RL) in safety-critical settings is constrained by brittleness under distribution shift. We study out-of-distribution (OOD) detection for RL time series and introduce DEEDEE, a two-statistic detector that revisits representation-heavy pipelines with a minimal alternative. DEEDEE uses only an episodewise mean and an RBF kernel similarity to a training summary, capturing complementary global and local deviations. Despite its simplicity, DEEDEE matches or surpasses contemporary detectors across standard RL OOD suites, delivering a 600-fold reduction in compute (FLOPs / wall-time) and an average 5% absolute accuracy gain over strong baselines. Conceptually, our results indicate that diverse anomaly types often imprint on RL trajectories through a small set of low-order statistics, suggesting a compact foundation for OOD detection in complex environments.
中文摘要 在安全关键型环境中部署强化学习（RL）受到分布偏移下的脆性的限制。我们研究了 RL 时间序列的分布外（OOD）检测，并引入了 DEEDEE，这是一种双统计量检测器，它以最少的替代方案重新审视了表示密集型管道。DEEDEE 仅使用逐集均值和 RBF 核相似性来进行训练摘要，捕获互补的全局和局部偏差。尽管DEEDEE很简单，但它在标准RL OOD套件中与现代探测器相当或超过，在强基线上，计算（FLOPs/墙时间）减少了600倍，平均绝对精度提高了5%。从概念上讲，我们的结果表明，不同的异常类型通常通过一小部分低阶统计数据印在RL轨迹上，这表明在复杂环境中为OOD检测奠定了紧凑的基础。

Mechanistic Interpretability for Neural TSP Solvers

神经 TSP 求解器的机理可解释性

Authors: Reuben Narad, Leonard Boussioux, Michael Wagner
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.21693
Pdf link: https://arxiv.org/pdf/2510.21693
Abstract Neural networks have advanced combinatorial optimization, with Transformer-based solvers achieving near-optimal solutions on the Traveling Salesman Problem (TSP) in milliseconds. However, these models operate as black boxes, providing no insight into the geometric patterns they learn or the heuristics they employ during tour construction. We address this opacity by applying sparse autoencoders (SAEs), a mechanistic interpretability technique, to a Transformer-based TSP solver, representing the first application of activation-based interpretability methods to operations research models. We train a pointer network with reinforcement learning on 100-node instances, then fit an SAE to the encoder's residual stream to discover an overcomplete dictionary of interpretable features. Our analysis reveals that the solver naturally develops features mirroring fundamental TSP concepts: boundary detectors that activate on convex-hull nodes, cluster-sensitive features responding to locally dense regions, and separator features encoding geometric partitions. These findings provide the first model-internal account of what neural TSP solvers compute before node selection, demonstrate that geometric structure emerges without explicit supervision, and suggest pathways toward transparent hybrid systems that combine neural efficiency with algorithmic interpretability. Interactive feature explorer: this https URL
中文摘要 神经网络具有先进的组合优化，基于 Transformer 的求解器可以在几毫秒内实现旅行推销员问题（TSP）的近乎最优的解。然而，这些模型就像黑匣子一样运行，无法深入了解它们学习的几何图案或它们在游览构建过程中采用的启发式方法。我们通过将稀疏自动编码器（SAE）（一种机械可解释性技术）应用于基于 Transformer 的 TSP 求解器来解决这种不透明度，这是基于激活的可解释性方法首次应用于运筹学模型。我们在 100 个节点实例上使用强化学习训练指针网络，然后将 SAE 拟合到编码器的残差流中，以发现可解释特征的过完整字典。我们的分析表明，求解器自然会开发反映基本 TSP 概念的特征：在凸包节点上激活的边界检测器、响应局部密集区域的集群敏感特征以及编码几何分区的分隔器特征。这些发现为神经 TSP 求解器在节点选择之前计算的内容提供了第一个模型内部解释，证明几何结构在没有明确监督的情况下出现，并提出了将神经效率与算法可解释性相结合的透明混合系统的途径。交互式功能资源管理器：此 https URL

Keyword: diffusion policy

There is no result