Arxiv Papers of Today

生成时间: 2025-10-21 16:32:13 (UTC+8); Arxiv 发布时间: 2025-10-21 20:00 EDT (2025-10-22 08:00 UTC+8)

今天共有 61 篇相关文章

Keyword: reinforcement learning

DiffPlace: A Conditional Diffusion Framework for Simultaneous VLSI Placement Beyond Sequential Paradigms

DiffPlace：超越顺序范式的同步 VLSI 放置的条件扩散框架

Authors: Kien Le Trung, Truong-Son Hy
Subjects: Subjects: Hardware Architecture (cs.AR)
Arxiv link: https://arxiv.org/abs/2510.15897
Pdf link: https://arxiv.org/pdf/2510.15897
Abstract Chip placement, the task of determining optimal positions of circuit modules on a chip canvas, is a critical step in the VLSI design flow that directly impacts performance, power consumption, and routability. Traditional methods rely on analytical optimization or reinforcement learning, which struggle with hard placement constraints or require expensive online training for each new circuit design. To address these limitations, we introduce DiffPlace, a framework that formulates chip placement as a conditional denoising diffusion process, enabling transferable placement policies that generalize to unseen circuit netlists without retraining. DiffPlace leverages the generative capabilities of diffusion models to efficiently explore the vast space of placement while conditioning on circuit connectivity and relative quality metrics to identify optimal solutions globally. Our approach combines energy-guided sampling with constrained manifold diffusion to ensure placement legality, achieving extremely low overlap across all experimental scenarios. Our method bridges the gap between optimization-based and learning-based approaches, offering a practical path toward automated, high-quality chip placement for modern VLSI design. Our source code is publicly available at: this https URL
中文摘要 芯片放置是确定电路模块在芯片画布上的最佳位置的任务，是 VLSI 设计流程中的关键步骤，直接影响性能、功耗和可布线性。传统方法依赖于分析优化或强化学习，这些方法难以应对硬性放置约束，或者需要为每个新电路设计进行昂贵的在线培训。为了解决这些限制，我们引入了 DiffPlace，这是一个框架，该框架将芯片放置表述为有条件的去噪扩散过程，从而实现可转移的放置策略，无需重新训练即可推广到看不见的电路网表。DiffPlace利用扩散模型的生成功能，有效地探索广阔的放置空间，同时以电路连接和相对质量指标为条件，以确定全球最佳解决方案。我们的方法将能量引导采样与约束流形扩散相结合，以确保放置的合法性，在所有实验场景中实现极低的重叠。我们的方法弥合了基于优化和基于学习的方法之间的差距，为现代 VLSI 设计提供了一条实现自动化、高质量芯片放置的实用途径。我们的源代码公开提供：此 https URL

Cog-Rethinker: Hierarchical Metacognitive Reinforcement Learning for LLM Reasoning

Cog-Rethinker：用于 LLM 推理的分层元认知强化学习

Authors: Zexu Sun, Yongcheng Zeng, Erxue Min, Heyang Gao, Bokai Ji, Xu Chen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.15979
Pdf link: https://arxiv.org/pdf/2510.15979
Abstract Contemporary progress in large language models (LLMs) has revealed notable inferential capacities via reinforcement learning (RL) employing verifiable reward, facilitating the development of O1 and R1-like reasoning models. Directly training from base models with RL is called zero-RL. However, previous works rely upon activating LLMs' inherent capacities through fixed prompt templates. This strategy introduces substantial sampling inefficiencies for weak LLMs, as the majority of problems generate invalid outputs during accuracy-driven filtration in reasoning tasks, which causes a waste of samples. To solve this issue, we propose Cog-Rethinker, a novel hierarchical metacognitive RL framework for LLM reasoning. Our Cog-Rethinker mainly focuses on the rollout procedure in RL training. After the direct rollout, our Cog-Rethinker improves sample utilization in a hierarchical metacognitive two-stage framework. By leveraging human cognition during solving problems, firstly, it prompts policy to decompose zero-accuracy problems into subproblems to produce final reasoning results. Secondly, with zero-accuracy problems in previous rollout stage, it further prompts policy to refine these answers by referencing previous wrong solutions. Moreover, to enable cold-start of the two new reasoning patterns and maintain train-test consistency across prompt templates, our Cog-Rethinker applies supervised fine-tuning on the policy using correct samples of the two stages with direct rollout template. Experimental results demonstrate Cog-Rethinker's superior performance on various mathematical reasoning benchmarks, we also analyzed its improved sample efficiency that accelerates convergence compared to baseline methods.
中文摘要 大型语言模型（LLM）的当代进步通过采用可验证奖励的强化学习（RL）揭示了显着的推理能力，促进了类似 O1 和 R1 的推理模型的发展。使用 RL 直接从基础模型进行训练称为零 RL。然而，之前的工作依赖于通过固定的提示模板激活 LLM 的固有能力。该策略为弱法学硕士引入了严重的采样效率低下，因为大多数问题在推理任务的准确性驱动过滤过程中会产生无效输出，从而导致样本浪费。为了解决这个问题，我们提出了 Cog-Rethinker，这是一种用于 LLM 推理的新型分层元认知 RL 框架。我们的 Cog-Rethinker 主要关注 RL 培训中的推出程序。直接推出后，我们的 Cog-Rethinker 在分层元认知两阶段框架中提高了样本利用率。通过利用人类在解决问题时的认知，首先促使政策将零精度问题分解为子问题，从而产生最终的推理结果;其次，由于上一个推出阶段的准确率为零，它进一步促使政策通过引用以前的错误解决方案来完善这些答案。此外，为了实现两种新推理模式的冷启动并保持提示模板之间的训练测试一致性，我们的 Cog-Rethinker 使用具有直接推出模板的两个阶段的正确样本对策略进行监督微调。实验结果表明，Cog-Rethinker在各种数学推理基准上具有优异的性能，我们还分析了与基线方法相比，其改进的样本效率加快了收敛速度。

Can GRPO Help LLMs Transcend Their Pretraining Origin?

GRPO 能否帮助 LLM 超越其预训练起源？

Authors: Kangqi Ni, Zhen Tan, Zijie Liu, Pingzhi Li, Tianlong Chen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.15990
Pdf link: https://arxiv.org/pdf/2510.15990
Abstract Reinforcement Learning with Verifiable Rewards (RLVR), primarily driven by the Group Relative Policy Optimization (GRPO) algorithm, is a leading approach for enhancing the reasoning abilities of Large Language Models (LLMs). Despite its wide adoption, GRPO's gains are often inconsistent; for instance, a model may show significant improvement in one reasoning domain, like mathematics, yet remain stagnant in another, such as medicine. This inconsistency raises a critical question: under what conditions does GRPO improve reasoning and generalize out-of-distribution (OOD)? We investigate this from a data distribution perspective. We first prove theoretically that GRPO is a conservative reweighting scheme, bounded by the base model's distribution and thus unable to discover completely novel solutions. We further validate this in carefully designed controlled studies by training transformers from scratch, evaluating generalization across reasoning depth, input length, token representation, and compositionality. Our results provide a principled explanation for GRPO's boundaries: OOD improvement emerges only when the target task aligns with the model's pretrained biases, while gains on in-distribution (ID) tasks diminish as performance saturates. This reframes GRPO not as a universal reasoning enhancer but as a tool that sharpens pretraining biases. Our findings motivate future development of algorithms that can expand a model's capabilities beyond its pretraining origin.
中文摘要 具有可验证奖励的强化学习（RLVR）主要由群体相对策略优化（GRPO）算法驱动，是增强大型语言模型（LLM）推理能力的领先方法。尽管 GRPO 被广泛采用，但它的收益往往不一致;例如，一个模型可能在一个推理领域（如数学）显示出显着的改进，但在另一个推理领域（例如医学）中仍然停滞不前。这种不一致引发了一个关键问题：GRPO 在什么条件下改进推理并推广分布外（OOD）？我们从数据分布的角度对此进行调查。我们首先从理论上证明 GRPO 是一种保守的重新加权方案，受基础模型分布的限制，因此无法发现全新的解决方案。我们通过从头开始训练 Transformer，评估推理深度、输入长度、标记表示和组合性的泛化，在精心设计的对照研究中进一步验证了这一点。我们的结果为 GRPO 的边界提供了原则性的解释：只有当目标任务与模型的预训练偏差一致时，OOD 的改进才会出现，而分布式（ID）任务的增益会随着性能饱和而减少。这将 GRPO 重新定义为一种通用的推理增强器，而不是一种强化预训练偏差的工具。我们的发现激励了未来算法的开发，这些算法可以将模型的功能扩展到其预训练起源之外。

Using Kolmogorov-Smirnov Distance for Measuring Distribution Shift in Machine Learning

使用柯尔莫哥洛夫-斯米尔诺夫距离测量机器学习中的分布偏移

Authors: Ozan K. Tonguz, Federico Taschin
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.15996
Pdf link: https://arxiv.org/pdf/2510.15996
Abstract One of the major problems in Machine Learning (ML) and Artificial Intelligence (AI) is the fact that the probability distribution of the test data in the real world could deviate substantially from the probability distribution of the training data set. When this happens, the predictions of an ML system or an AI agent could involve large errors which is very troublesome and undesirable. While this is a well-known hard problem plaguing the AI and ML systems' accuracy and reliability, in certain applications such errors could be critical for safety and reliability of AI and ML systems. One approach to deal with this problem is to monitor and measure the deviation in the probability distribution of the test data in real time and to compensate for this deviation. In this paper, we propose and explore the use of Kolmogorov-Smirnov (KS) Test for measuring the distribution shift and we show how the KS distance can be used to quantify the distribution shift and its impact on an AI agent's performance. Our results suggest that KS distance could be used as a valuable statistical tool for monitoring and measuring the distribution shift. More specifically, it is shown that even a distance of KS=0.02 could lead to about 50\% increase in the travel time at a single intersection using a Reinforcement Learning agent which is quite significant. It is hoped that the use of KS Test and KS distance in AI-based smart transportation could be an important step forward for gauging the performance degradation of an AI agent in real time and this, in turn, could help the AI agent to cope with the distribution shift in a more informed manner.
中文摘要 机器学习（ML）和人工智能（AI）的主要问题之一是，现实世界中测试数据的概率分布可能与训练数据集的概率分布有很大偏差。发生这种情况时，机器学习系统或人工智能代理的预测可能会涉及大错误，这是非常麻烦和不可取的。虽然这是困扰人工智能和机器学习系统准确性和可靠性的众所周知的难题，但在某些应用中，此类错误对于人工智能和机器学习系统的安全性和可靠性至关重要。解决这个问题的一种方法是实时监控和测量测试数据概率分布的偏差，并补偿这种偏差。在本文中，我们提出并探索了使用柯尔莫哥洛夫-斯米尔诺夫（KS）检验来测量分布偏移，并展示了如何使用KS距离来量化分布偏移及其对AI代理性能的影响。我们的结果表明，KS距离可以作为监测和测量分布偏移的有价值的统计工具。更具体地说，研究表明，即使距离为 KS=0.02，使用强化学习代理也会导致单个交叉路口的行驶时间增加约 50\%，这是相当显着的。希望在基于人工智能的智能交通中使用KS测试和KS距离可以成为实时衡量人工智能智能体性能下降的重要一步，从而帮助人工智能智能体以更明智的方式应对分布转变。

Transfer learning strategies for accelerating reinforcement-learning-based flow control

加速基于强化学习的流控制的迁移学习策略

Authors: Saeed Salehi
Subjects: Subjects: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
Arxiv link: https://arxiv.org/abs/2510.16016
Pdf link: https://arxiv.org/pdf/2510.16016
Abstract This work investigates transfer learning strategies to accelerate deep reinforcement learning (DRL) for multifidelity control of chaotic fluid flows. Progressive neural networks (PNNs), a modular architecture designed to preserve and reuse knowledge across tasks, are employed for the first time in the context of DRL-based flow control. In addition, a comprehensive benchmarking of conventional fine-tuning strategies is conducted, evaluating their performance, convergence behavior, and ability to retain transferred knowledge. The Kuramoto-Sivashinsky (KS) system is employed as a benchmark to examine how knowledge encoded in control policies, trained in low-fidelity environments, can be effectively transferred to high-fidelity settings. Systematic evaluations show that while fine-tuning can accelerate convergence, it is highly sensitive to pretraining duration and prone to catastrophic forgetting. In contrast, PNNs enable stable and efficient transfer by preserving prior knowledge and providing consistent performance gains, and are notably robust to overfitting during the pretraining phase. Layer-wise sensitivity analysis further reveals how PNNs dynamically reuse intermediate representations from the source policy while progressively adapting deeper layers to the target task. Moreover, PNNs remain effective even when the source and target environments differ substantially, such as in cases with mismatched physical regimes or control objectives, where fine-tuning strategies often result in suboptimal adaptation or complete failure of knowledge transfer. The results highlight the potential of novel transfer learning frameworks for robust, scalable, and computationally efficient flow control that can potentially be applied to more complex flow configurations.
中文摘要 这项工作研究了迁移学习策略，以加速深度强化学习（DRL），以实现混沌流体流动的多保真控制。渐进式神经网络（PNN）是一种模块化架构，旨在跨任务保存和重用知识，首次在基于 DRL 的流控制环境中采用。此外，还对传统的微调策略进行了全面的基准测试，评估了它们的性能、收敛行为和保留转移知识的能力。Kuramoto-Sivashinsky （KS）系统被用作基准，以检查在控制策略中编码的知识，在低保真环境中训练，如何有效地转移到高保真环境中。系统评估表明，虽然微调可以加速收敛，但它对预训练持续时间高度敏感，容易出现灾难性遗忘。相比之下，PNN 通过保留先验知识和提供一致的性能增益来实现稳定和高效的传输，并且在预训练阶段对过拟合具有显着的鲁棒性。层级敏感性分析进一步揭示了 PNN 如何动态重用源策略中的中间表示，同时逐步使更深层适应目标任务。此外，即使源环境和目标环境存在很大差异，例如在物理制度或控制目标不匹配的情况下，PNN 仍然有效，其中微调策略通常会导致知识转移的适应次优或完全失败。结果凸显了新型迁移学习框架在稳健、可扩展和计算高效的流量控制方面的潜力，这些框架有可能应用于更复杂的流量配置。

Airfoil optimization using Design-by-Morphing with minimized design-space dimensionality

使用变形设计进行翼型优化，设计空间维度最小化

Authors: Sangjoon Lee, Haris Moazam Sheikh
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.16020
Pdf link: https://arxiv.org/pdf/2510.16020
Abstract Effective airfoil geometry optimization requires exploring a diverse range of designs using as few design variables as possible. This study introduces AirDbM, a Design-by-Morphing (DbM) approach specialized for airfoil optimization that systematically reduces design-space dimensionality. AirDbM selects an optimal set of 12 baseline airfoils from the UIUC airfoil database, which contains over 1,600 shapes, by sequentially adding the baseline that most increases the design capacity. With these baselines, AirDbM reconstructs 99 \% of the database with a mean absolute error below 0.005, which matches the performance of a previous DbM approach that used more baselines. In multi-objective aerodynamic optimization, AirDbM demonstrates rapid convergence and achieves a Pareto front with a greater hypervolume than that of the previous larger-baseline study, where new Pareto-optimal solutions are discovered with enhanced lift-to-drag ratios at moderate stall tolerances. Furthermore, AirDbM demonstrates outstanding adaptability for reinforcement learning (RL) agents in generating airfoil geometry when compared to conventional airfoil parameterization methods, implying the broader potential of DbM in machine learning-driven design.
中文摘要 有效的翼型几何形状优化需要使用尽可能少的设计变量来探索各种设计。本研究引入了AirDbM，这是一种专门用于翼型优化的变形设计（DbM）方法，可系统地降低设计空间维度。AirDbM 通过依次添加最能增加设计容量的基线，从 UIUC 翼型数据库中选择一组包含 1,600 多个形状的最佳基线翼型。使用这些基线，AirDbM 重建了 99 \% 的数据库，平均绝对误差低于 0.005，这与以前使用更多基线的 DbM 方法的性能相匹配。在多目标空气动力学优化中，AirDbM 表现出快速收敛，并实现了比之前更大规模基线研究更大的超容积的帕累托前沿，在之前的更大基线研究中，发现了新的帕累托最优解，在中等失速公差下具有增强的升阻比。此外，与传统的翼型参数化方法相比，AirDbM在生成翼型几何形状方面表现出出色的强化学习（RL）代理适应性，这意味着DbM在机器学习驱动设计中的更广泛潜力。

Feature-driven reinforcement learning for photovoltaic in continuous intraday trading

连续日内交易中光伏的特征驱动强化学习

Authors: Arega Getaneh Abate, Xiufeng Liu, Ruyu Liu, Xiaobing Zhang
Subjects: Subjects: Machine Learning (cs.LG); General Economics (econ.GN)
Arxiv link: https://arxiv.org/abs/2510.16021
Pdf link: https://arxiv.org/pdf/2510.16021
Abstract Photovoltaic (PV) operators face substantial uncertainty in generation and short-term electricity prices. Continuous intraday markets enable producers to adjust their positions in real time, potentially improving revenues and reducing imbalance costs. We propose a feature-driven reinforcement learning (RL) approach for PV intraday trading that integrates data-driven features into the state and learns bidding policies in a sequential decision framework. The problem is cast as a Markov Decision Process with a reward that balances trading profit and imbalance penalties and is solved with Proximal Policy Optimization (PPO) using a predominantly linear, interpretable policy. Trained on historical market data and evaluated out-of-sample, the strategy consistently outperforms benchmark baselines across diverse scenarios. Extensive validation shows rapid convergence, real-time inference, and transparent decision rules. Learned weights highlight the central role of market microstructure and historical features. Taken together, these results indicate that feature-driven RL offers a practical, data-efficient, and operationally deployable pathway for active intraday participation by PV producers.
中文摘要 光伏运营商在发电和短期电价方面面临巨大的不确定性。连续的盘中市场使生产商能够实时调整头寸，从而有可能提高收入并降低不平衡成本。我们提出了一种用于光伏日内交易的特征驱动强化学习（RL）方法，该方法将数据驱动的特征集成到状态中，并在顺序决策框架中学习竞价策略。该问题被塑造为马尔可夫决策过程，其奖励平衡交易利润和不平衡惩罚，并使用主要线性、可解释的策略通过近端策略优化（PPO）来解决。该策略根据历史市场数据进行训练并进行样本外评估，在不同场景中的表现始终优于基准基线。广泛的验证显示快速收敛、实时推理和透明的决策规则。学习的权重突出了市场微观结构和历史特征的核心作用。综上所述，这些结果表明，功能驱动的 RL 为光伏生产商的日内积极参与提供了一条实用、数据高效且可作部署的途径。

RoBCtrl: Attacking GNN-Based Social Bot Detectors via Reinforced Manipulation of Bots Control Interaction

RoBCtrl：通过加强纵机器人控制交互来攻击基于 GNN 的社交机器人检测器

Authors: Yingguang Yang, Xianghua Zeng, Qi Wu, Hao Peng, Yutong Xia, Hao Liu, Bin Chong, Philip S. Yu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2510.16035
Pdf link: https://arxiv.org/pdf/2510.16035
Abstract Social networks have become a crucial source of real-time information for individuals. The influence of social bots within these platforms has garnered considerable attention from researchers, leading to the development of numerous detection technologies. However, the vulnerability and robustness of these detection methods is still underexplored. Existing Graph Neural Network (GNN)-based methods cannot be directly applied due to the issues of limited control over social agents, the black-box nature of bot detectors, and the heterogeneity of bots. To address these challenges, this paper proposes the first adversarial multi-agent Reinforcement learning framework for social Bot control attacks (RoBCtrl) targeting GNN-based social bot detectors. Specifically, we use a diffusion model to generate high-fidelity bot accounts by reconstructing existing account data with minor modifications, thereby evading detection on social platforms. To the best of our knowledge, this is the first application of diffusion models to mimic the behavior of evolving social bots effectively. We then employ a Multi-Agent Reinforcement Learning (MARL) method to simulate bots adversarial behavior. We categorize social accounts based on their influence and budget. Different agents are then employed to control bot accounts across various categories, optimizing the attachment strategy through reinforcement learning. Additionally, a hierarchical state abstraction based on structural entropy is designed to accelerate the reinforcement learning. Extensive experiments on social bot detection datasets demonstrate that our framework can effectively undermine the performance of GNN-based detectors.
中文摘要 社交网络已成为个人实时信息的重要来源。社交机器人在这些平台内的影响引起了研究人员的广泛关注，导致了众多检测技术的发展。然而，这些检测方法的脆弱性和鲁棒性仍然没有得到充分探索。现有的基于图神经网络（GNN）的方法由于对社会代理的控制有限、机器人检测器的黑盒性质以及机器人的异构性等问题而无法直接应用。为了应对这些挑战，本文提出了第一个针对基于GNN的社交机器人检测器的社交机器人控制攻击（RoBCtrl）的对抗性多智能体强化学习框架。具体来说，我们使用扩散模型，通过对现有账户数据进行微小修改来生成高保真度的机器人账户，从而逃避社交平台上的检测。据我们所知，这是扩散模型首次有效模仿不断进化的社交机器人行为的应用。然后，我们采用多智能体强化学习（MARL）方法来模拟机器人的对抗行为。我们根据社交账户的影响力和预算对社交账户进行分类。然后使用不同的代理来控制不同类别的机器人帐户，通过强化学习优化依恋策略。此外，设计了一种基于结构熵的分层状态抽象来加速强化学习。对社交机器人检测数据集的广泛实验表明，我们的框架可以有效破坏基于GNN的检测器的性能。

PrivacyPAD: A Reinforcement Learning Framework for Dynamic Privacy-Aware Delegation

PrivacyPAD：用于动态隐私感知委派的强化学习框架

Authors: Zheng Hui, Yijiang River Dong, Sanhanat Sivapiromrat, Ehsan Shareghi, Nigel Collier
Subjects: Subjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.16054
Pdf link: https://arxiv.org/pdf/2510.16054
Abstract When users submit queries to Large Language Models (LLMs), their prompts can often contain sensitive data, forcing a difficult choice: Send the query to a powerful proprietary LLM providers to achieving state-of-the-art performance and risk data exposure, or relying on smaller, local models guarantees data privacy but often results in a degradation of task performance. Prior approaches have relied on static pipelines that use LLM rewriting, which shatters linguistic coherence and indiscriminately removes privacy-sensitive information, including task-critical content. We reformulate this challenge (Privacy-Conscious Delegation) as a sequential decision-making problem and introduce a novel reinforcement learning (RL) framework called PrivacyPAD to solve it. Our framework trains an agent to dynamically route text chunks, learning a policy that optimally balances the trade-off between privacy leakage and task performance. It implicitly distinguishes between replaceable Personally Identifiable Information (PII) (which it shields locally) and task-critical PII (which it strategically sends to the remote model for maximal utility). To validate our approach in complex scenarios, we also introduce a new medical dataset with high PII density. Our framework achieves a new state-of-the-art on the privacy-utility frontier, demonstrating the necessity of learned, adaptive policies for deploying LLMs in sensitive environments.
中文摘要 当用户向大型语言模型（LLM）提交查询时，他们的提示通常可能包含敏感数据，这迫使他们做出艰难的选择：将查询发送给功能强大的专有 LLM 提供商，以实现最先进的性能和风险数据暴露，或者依赖较小的本地模型来保证数据隐私，但通常会导致任务性能下降。以前的方法依赖于使用 LLM 重写的静态管道，这会破坏语言连贯性并不加区别地删除隐私敏感信息，包括关键任务内容。我们将这一挑战（隐私意识委托）重新表述为一个顺序决策问题，并引入一种名为 PrivacyPAD 的新型强化学习（RL）框架来解决它。我们的框架训练代理动态路由文本块，学习一种策略，以最佳方式平衡隐私泄露和任务性能之间的权衡。它隐式区分可替换的个人身份信息（PII）（它在本地屏蔽）和任务关键型 PII（它战略性地将其发送到远程模型以获得最大效用）。为了在复杂场景中验证我们的方法，我们还引入了一个具有高 PII 密度的新医学数据集。我们的框架在隐私效用前沿实现了新的最先进技术，展示了在敏感环境中部署法学硕士的学习、自适应策略的必要性。

Zero-shot World Models via Search in Memory

通过内存搜索的零样本世界模型

Authors: Federico Malato, Ville Hautamäki
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.16123
Pdf link: https://arxiv.org/pdf/2510.16123
Abstract World Models have vastly permeated the field of Reinforcement Learning. Their ability to model the transition dynamics of an environment have greatly improved sample efficiency in online RL. Among them, the most notorious example is Dreamer, a model that learns to act in a diverse set of image-based environments. In this paper, we leverage similarity search and stochastic representations to approximate a world model without a training procedure. We establish a comparison with PlaNet, a well-established world model of the Dreamer family. We evaluate the models on the quality of latent reconstruction and on the perceived similarity of the reconstructed image, on both next-step and long horizon dynamics prediction. The results of our study demonstrate that a search-based world model is comparable to a training based one in both cases. Notably, our model show stronger performance in long-horizon prediction with respect to the baseline on a range of visually different environments.
中文摘要 世界模型已经广泛渗透到强化学习领域。他们对环境的转变动力学进行建模的能力极大地提高了在线 RL 中的样本效率。其中，最臭名昭著的例子是 Dreamer，这是一种学习在各种基于图像的环境中行动的模型。在本文中，我们利用相似性搜索和随机表示来近似世界模型，而无需训练程序。我们与 PlaNet 进行了比较，PlaNet 是 Dreamer 家族的成熟世界模型。我们评估了潜在重建质量和重建图像的感知相似性模型，包括下一步和长视界动力学预测。我们的研究结果表明，在这两种情况下，基于搜索的世界模型与基于训练的世界模型相当。值得注意的是，我们的模型在一系列视觉上不同的环境中相对于基线显示出更强的长期预测性能。

A Minimal-Assumption Analysis of Q-Learning with Time-Varying Policies

时变策略Q学习的最小假设分析

Authors: Phalguni Nanda, Zaiwei Chen
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2510.16132
Pdf link: https://arxiv.org/pdf/2510.16132
Abstract In this work, we present the first finite-time analysis of the Q-learning algorithm under time-varying learning policies (i.e., on-policy sampling) with minimal assumptions -- specifically, assuming only the existence of a policy that induces an irreducible Markov chain over the state space. We establish a last-iterate convergence rate for $\mathbb{E}[\|Q_k - Q^\|_\infty^2]$, implying a sample complexity of order $O(1/\epsilon^2)$ for achieving $\mathbb{E}[\|Q_k - Q^\|\infty] \le \epsilon$, matching that of off-policy Q-learning but with a worse dependence on exploration-related parameters. We also derive an explicit rate for $\mathbb{E}[\|Q^{\pi_k} - Q^*\|\infty^2]$, where $\pi_k$ is the learning policy at iteration $k$. These results reveal that on-policy Q-learning exhibits weaker exploration than its off-policy counterpart but enjoys an exploitation advantage, as its policy converges to an optimal one rather than remaining fixed. Numerical simulations corroborate our theory. Technically, the combination of time-varying learning policies (which induce rapidly time-inhomogeneous Markovian noise) and the minimal assumption on exploration presents significant analytical challenges. To address these challenges, we employ a refined approach that leverages the Poisson equation to decompose the Markovian noise corresponding to the lazy transition matrix into a martingale-difference term and residual terms. To control the residual terms under time inhomogeneity, we perform a sensitivity analysis of the Poisson equation solution with respect to both the Q-function estimate and the learning policy. These tools may further facilitate the analysis of general reinforcement learning algorithms with rapidly time-varying learning policies -- such as single-timescale actor--critic methods and learning-in-games algorithms -- and are of independent interest.
中文摘要 在这项工作中，我们提出了在时变学习策略（即策略采样）下对 Q 学习算法的首次有限时间分析，假设最小——具体来说，假设仅存在一个策略，该策略在状态空间上诱导不可约的马尔可夫链。我们为 $\mathbb{E}[\|Q_k - Q^\|_\infty^2]$，$O这意味着实现 $\mathbb{E}[\|Q_k - Q^\|\infty] \le \epsilon$，与非策略 Q 学习相匹配，但对探索相关参数的依赖性更差。我们还推导出了 $\mathbb{E}[\|Q^{\pi_k} - Q^*\|\infty^2]$，其中 $\pi_k$ 是迭代 $k$ 时的学习策略。这些结果表明，政策性Q学习的探索性比政策外的Q学习弱，但具有开发优势，因为它的策略收敛于最优策略而不是保持固定。数值模拟证实了我们的理论。从技术上讲，时变学习策略（诱发快速时间不均匀的马尔可夫噪声）和探索的最小假设的结合带来了重大的分析挑战。为了应对这些挑战，我们采用了一种改进的方法，利用泊松方程将与惰性跃迁矩阵相对应的马尔可夫噪声分解为马丁格尔差分项和残差项。为了控制时间不均匀性下的残差项，我们对泊松方程解进行了关于 Q 函数估计和学习策略的敏感性分析。这些工具可以进一步促进对具有快速时变学习策略的通用强化学习算法的分析——例如单时间尺度参与者——批评方法和博弈中的学习算法——并且具有独立的兴趣。

Alignment is Localized: A Causal Probe into Preference Layers

对齐是本地化的：对偏好层的因果探测

Authors: Archie Chaudhury
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.16167
Pdf link: https://arxiv.org/pdf/2510.16167
Abstract Reinforcement Learning frameworks, particularly those utilizing human annotations, have become an increasingly popular method for preference fine-tuning, where the outputs of a language model are tuned to match a certain set of behavioral policies or guidelines. Reinforcement Learning through Human Feedback (RLHF) is perhaps the most popular implementation of such a framework, particularly for aligning LMs toward safety and human intent. However, the internal workings of how such alignment is achieved remain largely opaque. In this work, we systematically analyze preference optimization for language model alignment by applying layer-wide causal patching between a base model and its tuned counterpart across human preference pairs. We implement our methodology on \textit{Llama-3.2-1B}, and find that alignment is spatially localized: mid-layer activations encode a distinct subspace that causally determines reward-consistent behavior, while early and late layers remain largely unaffected. Utilizing LASSO regression, we also find that only a small number of layers possess non-zero coefficients linking activation distances to reward gains. Overall, we show that, at least for some language models, alignment from human-based, preferential tuning is a directional, low rank process, rather than diffuse and parameteric.
中文摘要 强化学习框架，特别是那些利用人工注释的框架，已成为一种越来越流行的偏好微调方法，其中语言模型的输出被调整以匹配一组行为策略或指南。通过人类反馈的强化学习（RLHF）可能是此类框架最流行的实现，特别是对于使 LM 与安全和人类意图保持一致。然而，如何实现这种一致性的内部运作在很大程度上仍然不透明。在这项工作中，我们通过在人类偏好对中应用基础模型与其调整后的对应模型之间的全层因果修补，系统地分析了语言模型对齐的偏好优化。我们在 \textit{Llama-3.2-1B} 上实现了我们的方法，并发现对齐在空间上是局部的：中间层激活编码了一个不同的子空间，该子空间因果关系决定了奖励一致的行为，而早期和后期层基本上不受影响。利用 LASSO 回归，我们还发现只有少数层具有将激活距离与奖励收益联系起来的非零系数。总的来说，我们表明，至少对于某些语言模型来说，基于人类的优先调优的对齐是一个定向的、低秩的过程，而不是分散的和参数化的。

The Formalism-Implementation Gap in Reinforcement Learning Research

强化学习研究中的形式主义-实施差距

Authors: Pablo Samuel Castro
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.16175
Pdf link: https://arxiv.org/pdf/2510.16175
Abstract The last decade has seen an upswing in interest and adoption of reinforcement learning (RL) techniques, in large part due to its demonstrated capabilities at performing certain tasks at "super-human levels". This has incentivized the community to prioritize research that demonstrates RL agent performance, often at the expense of research aimed at understanding their learning dynamics. Performance-focused research runs the risk of overfitting on academic benchmarks -- thereby rendering them less useful -- which can make it difficult to transfer proposed techniques to novel problems. Further, it implicitly diminishes work that does not push the performance-frontier, but aims at improving our understanding of these techniques. This paper argues two points: (i) RL research should stop focusing solely on demonstrating agent capabilities, and focus more on advancing the science and understanding of reinforcement learning; and (ii) we need to be more precise on how our benchmarks map to the underlying mathematical formalisms. We use the popular Arcade Learning Environment (ALE; Bellemare et al., 2013) as an example of a benchmark that, despite being increasingly considered "saturated", can be effectively used for developing this understanding, and facilitating the deployment of RL techniques in impactful real-world problems.
中文摘要 在过去十年中，人们对强化学习（RL）技术的兴趣和采用有所上升，这在很大程度上是由于其在“超人水平”上执行某些任务的能力。这激励社区优先考虑证明 RL 代理性能的研究，而通常以牺牲旨在了解其学习动态的研究为代价。以性能为中心的研究存在学术基准过度拟合的风险，从而降低它们的用处，这使得将所提出的技术转移到新问题上变得困难。此外，它隐含地减少了那些没有推动性能前沿的工作，但旨在提高我们对这些技术的理解。本文论证了两点：（i）RL研究应不再仅仅关注展示智能体能力，而应更多地关注推进强化学习的科学和理解;（ii）我们需要更精确地了解我们的基准如何映射到潜在的数学形式主义。我们使用流行的街机学习环境（ALE;Bellemare 等人，2013 年）作为基准的一个例子，尽管越来越被认为是“饱和的”，但可以有效地用于发展这种理解，并促进在有影响力的现实世界问题中部署 RL 技术。

Expressive Reward Synthesis with the Runtime Monitoring Language

使用运行时监控语言进行表达性奖励综合

Authors: Daniel Donnelly, Angelo Ferrando, Francesco Belardinelli
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2510.16185
Pdf link: https://arxiv.org/pdf/2510.16185
Abstract A key challenge in reinforcement learning (RL) is reward (mis)specification, whereby imprecisely defined reward functions can result in unintended, possibly harmful, behaviours. Indeed, reward functions in RL are typically treated as black-box mappings from state-action pairs to scalar values. While effective in many settings, this approach provides no information about why rewards are given, which can hinder learning and interpretability. Reward Machines address this issue by representing reward functions as finite state automata, enabling the specification of structured, non-Markovian reward functions. However, their expressivity is typically bounded by regular languages, leaving them unable to capture more complex behaviours such as counting or parametrised conditions. In this work, we build on the Runtime Monitoring Language (RML) to develop a novel class of language-based Reward Machines. By leveraging the built-in memory of RML, our approach can specify reward functions for non-regular, non-Markovian tasks. We demonstrate the expressiveness of our approach through experiments, highlighting additional advantages in flexible event-handling and task specification over existing Reward Machine-based methods.
中文摘要 强化学习（RL）的一个关键挑战是奖励（错误）规范，即定义不精确的奖励函数可能会导致意外的、可能有害的行为。事实上，RL 中的奖励函数通常被视为从状态-动作对到标量值的黑盒映射。虽然这种方法在许多情况下都有效，但没有提供关于为什么给予奖励的信息，这可能会阻碍学习和可解释性。奖励机器通过将奖励函数表示为有限状态自动机来解决这个问题，从而能够规范结构化的、非马尔可夫的奖励函数。然而，它们的表达能力通常受到常规语言的限制，使它们无法捕捉更复杂的行为，例如计数或参数化条件。在这项工作中，我们基于运行时监控语言（RML）开发了一类新的基于语言的奖励机器。通过利用 RML 的内置内存，我们的方法可以为非常规、非马尔可夫任务指定奖励函数。我们通过实验展示了我们方法的表现力，强调了与现有基于奖励机器的方法相比，灵活的事件处理和任务规范方面的其他优势。

Human-Allied Relational Reinforcement Learning

人与人关系强化学习

Authors: Fateme Golivand Darvishvand, Hikaru Shindo, Sahil Sidheekh, Kristian Kersting, Sriraam Natarajan
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.16188
Pdf link: https://arxiv.org/pdf/2510.16188
Abstract Reinforcement learning (RL) has experienced a second wind in the past decade. While incredibly successful in images and videos, these systems still operate within the realm of propositional tasks ignoring the inherent structure that exists in the problem. Consequently, relational extensions (RRL) have been developed for such structured problems that allow for effective generalization to arbitrary number of objects. However, they inherently make strong assumptions about the problem structure. We introduce a novel framework that combines RRL with object-centric representation to handle both structured and unstructured data. We enhance learning by allowing the system to actively query the human expert for guidance by explicitly modeling the uncertainty over the policy. Our empirical evaluation demonstrates the effectiveness and efficiency of our proposed approach.
中文摘要 强化学习（RL）在过去十年中经历了第二次风。虽然在图像和视频方面取得了令人难以置信的成功，但这些系统仍然在命题任务领域内运行，忽略了问题中存在的固有结构。因此，已经为此类结构化问题开发了关系扩展（RRL），这些问题允许有效地推广到任意数量的对象。然而，它们本质上对问题结构做出了强有力的假设。我们引入了一个新颖的框架，它将 RRL 与以对象为中心的表示相结合，以处理结构化和非结构化数据。我们通过允许系统通过明确模拟政策的不确定性来主动询问人类专家以获取指导，从而增强学习。我们的实证评估证明了我们所提出的方法的有效性和效率。

WEBSERV: A Browser-Server Environment for Efficient Training of Reinforcement Learning-based Web Agents at Scale

WEBSERV：用于大规模高效训练基于强化学习的 Web 代理的浏览器-服务器环境

Authors: Yuxuan Lu, Jing Huang, Hui Liu, Jiri Gesi, Yan Han, Shihan Fu, Tianqi Zheng, Dakuo Wang
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.16252
Pdf link: https://arxiv.org/pdf/2510.16252
Abstract Training and evaluation of Reinforcement Learning (RL) web agents have gained increasing attention, yet a scalable and efficient environment that couples realistic and robust browser-side interaction with controllable server-side state at scale is still missing. Existing environments tend to have one or more of the following issues: they overwhelm policy models with excessive and noisy context; they perform actions non-deterministically without waiting for the UI or network to stabilize; or they cannot scale isolated client-server containers effectively for parallel RL rollouts. We propose WEBSERV, an environment that includes 1) a compact, site-agnostic browser environment that balances context and action complexity, and 2) a scalable RL environment via efficient launching and resetting web-servers to enable scalable RL training and evaluation. We evaluate WEBSERV on the shopping CMS and Gitlab tasks in WebArena, achieving state-of-the-art single-prompt success rates while cutting launch latency by ~5x and storage need by ~240x, with a comparable memory footprint, enabling 200+ concurrent containers on a single host.
中文摘要 强化学习（RL） Web 代理的训练和评估越来越受到关注，但仍然缺乏一个可扩展且高效的环境，将真实而强大的浏览器端交互与可控的大规模服务器端状态相结合。现有环境往往存在以下一个或多个问题：它们因过度和嘈杂的上下文而压倒策略模型;它们以非确定性的方式执行作，而无需等待 UI 或网络稳定;或者它们无法有效地扩展隔离的客户端-服务器容器以进行并行 RL 部署。我们提出了 WEBSERV，该环境包括 1）一个紧凑的、与站点无关的浏览器环境，它平衡了上下文和作的复杂性，以及 2）一个可扩展的 RL 环境，通过高效启动和重置 Web 服务器来实现可扩展的 RL 训练和评估。我们在 WebArena 中的购物 CMS 和 Gitlab 任务上评估了 WEBSERV，实现了最先进的单提示成功率，同时将启动延迟减少了 ~5 倍，存储需求减少了 ~240 倍，内存占用相当，在单个主机上实现了 200+ 并发容器。

Distractor Injection Attacks on Large Reasoning Models: Characterization and Defense

对大型推理模型的干扰器注入攻击：表征与防御

Authors: Zhehao Zhang, Weijie Xu, Shixian Cui, Chandan K. Reddy
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.16259
Pdf link: https://arxiv.org/pdf/2510.16259
Abstract Recent advances in large reasoning models (LRMs) have enabled remarkable performance on complex tasks such as mathematics and coding by generating long Chain-of-Thought (CoT) traces. In this paper, we identify and systematically analyze a critical vulnerability we term reasoning distraction, where LRMs are diverted from their primary objective by irrelevant yet complex tasks maliciously embedded in the prompt. Through a comprehensive study across diverse models and benchmarks, we show that even state-of-the-art LRMs are highly susceptible, with injected distractors reducing task accuracy by up to 60%. We further reveal that certain alignment techniques can amplify this weakness and that models may exhibit covert compliance, following hidden adversarial instructions in reasoning while concealing them in the final output. To mitigate these risks, we propose a training-based defense that combines Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) on synthetic adversarial data, improving robustness by over 50 points on challenging distractor attacks. Our findings establish reasoning distraction as a distinct and urgent threat to LRM reliability and provide a practical step toward safer and more trustworthy reasoning systems.
中文摘要 大型推理模型（LRM）的最新进展通过生成长思维链（CoT）跟踪，在数学和编码等复杂任务上实现了卓越的性能。在本文中，我们识别并系统地分析了一个我们称之为推理分心的关键漏洞，即 LRM 被恶意嵌入在提示中的不相关但复杂的任务偏离了其主要目标。通过对不同模型和基准的全面研究，我们表明，即使是最先进的 LRM 也很容易受到影响，注入的干扰物会使任务准确性降低多达 60%。我们进一步揭示，某些对齐技术可以放大这一弱点，并且模型可能会表现出隐蔽的顺从性，在推理中遵循隐藏的对抗指令，同时将它们隐藏在最终输出中。为了减轻这些风险，我们提出了一种基于训练的防御，该防御结合了对合成对抗数据的监督微调（SFT）和强化学习（RL），在具有挑战性的干扰性攻击中将鲁棒性提高了 50 多个百分点。我们的研究结果将推理分心确定为对 LRM 可靠性的明显而紧迫的威胁，并为实现更安全、更值得信赖的推理系统提供了实际步骤。

RL makes MLLMs see better than SFT

RL 使 MLLM 比 SFT 看得更好

Authors: Junha Song, Sangdoo Yun, Dongyoon Han, Jaegul Choo, Byeongho Heo
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.16333
Pdf link: https://arxiv.org/pdf/2510.16333
Abstract A dominant assumption in Multimodal Language Model (MLLM) research is that its performance is largely inherited from the LLM backbone, given its immense parameter scale and remarkable capabilities. This has created a void in the understanding of the vision encoder, which determines how MLLMs perceive images. The recent shift in MLLM training paradigms, from Supervised Finetuning (SFT) to Reinforcement Learning (RL), magnifies this oversight-namely, the significant lack of analysis on how such training reshapes the vision encoder as well as the MLLM. To address this, we first investigate the impact of training strategies on MLLMs, where RL shows a clear advantage over SFT in strongly vision-related VQA benchmarks. Motivated by this, we conduct a critical yet under-explored analysis of the vision encoder of MLLMs through diverse and in-depth experiments, ranging from ImageNet classification and segmentation to gradient visualization. Our results demonstrate that MLLM's post-training strategy (i.e., SFT or RL) not only leads to distinct outcomes on MLLM downstream tasks, but also fundamentally reshapes MLLM's underlying visual representations. Specifically, the key finding of our study is that RL produces stronger and precisely localized visual representations compared to SFT, boosting the ability of the vision encoder for MLLM. We then reframe our findings into a simple recipe for building strong vision encoders for MLLMs, Preference-Instructed Vision OpTimization (PIVOT). When integrated into MLLMs, a PIVOT-trained vision encoder outperforms even larger and more heavily-trained counterparts, despite requiring less than 1% of the computational cost of standard vision pretraining. This result opens an effective and efficient path for advancing the vision backbones of MLLMs. Project page available at this https URL
中文摘要 多模态语言模型（MLLM）研究中的一个主要假设是，鉴于其巨大的参数规模和卓越的功能，其性能在很大程度上继承自 LLM 主干网。这在对视觉编码器的理解中造成了空白，视觉编码器决定了 MLLM 如何感知图像。最近MLLM训练范式从监督精调（SFT）到强化学习（RL）的转变放大了这种疏忽，即严重缺乏对此类训练如何重塑视觉编码器和MLLM的分析。为了解决这个问题，我们首先研究了训练策略对 MLLM 的影响，其中 RL 在与视觉相关的 VQA 基准中显示出比 SFT 的明显优势。在此激励下，我们通过从 ImageNet 分类和分割到梯度可视化的多样化和深入实验，对 MLLM 的视觉编码器进行了批判性但尚未充分探索的分析。我们的结果表明，MLLM的训练后策略（即SFT或RL）不仅在MLLM下游任务上产生了不同的结果，而且从根本上重塑了MLLM的底层视觉表示。具体来说，我们研究的主要发现是，与 SFT 相比，RL 产生更强且精确定位的视觉表示，从而提高了 MLLM 视觉编码器的能力。然后，我们将我们的发现重新构建为一个简单的配方，用于为 MLLM 构建强大的视觉编码器，即偏好指示视觉优化（PIVOT）。当集成到 MLLM 中时，经过 PIVOT 训练的视觉编码器的性能优于更大、训练更严格的同类产品，尽管所需的计算成本不到标准视觉预训练的 1%。这一结果为推进 MLLM 的视觉骨干开辟了一条有效且高效的路径。项目页面可在此 https URL 上找到

Call-Center Staff Scheduling Considering Performance Evolution under Emotional Stress

考虑情绪压力下绩效演变的呼叫中心员工调度

Authors: Yujun Zheng, Xinya Chen, Xueqin Lu, Weiguo Sheng, Shengyong Chen
Subjects: Subjects: Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2510.16406
Pdf link: https://arxiv.org/pdf/2510.16406
Abstract Emotional stress often has a significant effect on the working performance of staff, but this effect is commonly neglected in existing staff scheduling methods. We study a call-center staff scheduling problem, which considers the evolution of work performance of staff under emotional stress. First, we present an emotional stress driven model that estimates the working performance of call-center employees based on not only skill levels but also emotional states. On the basis of the model, we formulate a combined short-term and long-term call-center staff scheduling problem aiming at maximizing the customer service level, which depends on the working performance of employees. We then propose a memetic optimization algorithm combining global mutation and neighborhood search assisted by deep reinforcement learning to efficiently solve this problem. Experimental results on real-world problem instances of bank call-center staff scheduling demonstrate the performance advantages of the proposed method over selected popular staff scheduling methods. By explicitly modeling and incorporating emotional stress, our method reflects a more realistic understanding and utilization of human behavior in staff scheduling.
中文摘要 情绪压力往往对员工的工作绩效有显著影响，但这种影响在现有的员工调度方法中通常被忽视。研究了一个呼叫中心员工调度问题，该问题考虑了情绪压力下员工工作绩效的演变。首先，我们提出了一个情绪压力驱动模型，该模型不仅根据技能水平而且根据情绪状态来估计呼叫中心员工的工作绩效。在此模型的基础上，我们制定了短期和长期相结合的呼叫中心员工调度问题，旨在最大限度地提高客户服务水平，这取决于员工的工作绩效。然后，我们提出了一种结合全局突变和邻域搜索的模因优化算法，并辅以深度强化学习，以有效地解决这个问题。在银行呼叫中心员工调度的实际问题实例上的实验结果表明，所提方法优于选定的常用员工调度方法。通过明确建模和纳入情绪压力，我们的方法反映了对员工调度中人类行为的更现实的理解和利用。

SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning

SSL4RL：重新审视自监督学习作为视觉语言推理的内在奖励

Authors: Xiaojun Guo, Runyu Zhou, Yifei Wang, Qi Zhang, Chenheng Zhang, Stefanie Jegelka, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, Yisen Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.16416
Pdf link: https://arxiv.org/pdf/2510.16416
Abstract Vision-language models (VLMs) have shown remarkable abilities by integrating large language models with visual inputs. However, they often fail to utilize visual evidence adequately, either depending on linguistic priors in vision-centric tasks or resorting to textual shortcuts during reasoning. Although reinforcement learning (RL) can align models with desired behaviors, its application to VLMs has been hindered by the lack of scalable and reliable reward mechanisms. To overcome this challenge, we propose SSL4RL, a novel framework that leverages self-supervised learning (SSL) tasks as a source of verifiable rewards for RL-based fine-tuning. Our approach reformulates SSL objectives-such as predicting image rotation or reconstructing masked patches-into dense, automatic reward signals, eliminating the need for human preference data or unreliable AI evaluators. Experiments show that SSL4RL substantially improves performance on both vision-centric and vision-language reasoning benchmarks. Furthermore, through systematic ablations, we identify key factors-such as task difficulty, model scale, and semantic alignment with the target domain-that influence the effectiveness of SSL4RL tasks, offering new design principles for future work. We also demonstrate the framework's generality by applying it to graph learning, where it yields significant gains. SSL4RL establishes a versatile and effective paradigm for aligning multimodal models using verifiable, self-supervised objectives.
中文摘要 视觉语言模型（VLM）通过将大型语言模型与视觉输入相结合，展现出了卓越的能力。然而，他们往往无法充分利用视觉证据，要么依赖于以视觉为中心的任务中的语言先验，要么在推理过程中诉诸文本捷径。尽管强化学习（RL）可以使模型与期望的行为保持一致，但由于缺乏可扩展和可靠的奖励机制，其在VLM中的应用受到了阻碍。为了克服这一挑战，我们提出了 SSL4RL，这是一种利用自监督学习（SSL）任务作为基于 RL 的微调的可验证奖励来源的新颖框架。我们的方法将 SSL 目标（例如预测图像旋转或重建掩蔽补丁）重新表述为密集的自动奖励信号，从而消除了对人类偏好数据或不可靠的 AI 评估器的需求。实验表明，SSL4RL 在以视觉为中心和视觉语言推理基准测试上都显着提高了性能。此外，通过系统消融，我们确定了影响 SSL4RL 任务有效性的关键因素，例如任务难度、模型规模和与目标域的语义一致性，为未来的工作提供了新的设计原则。我们还通过将其应用于图学习来展示该框架的通用性，从而产生显着的收益。SSL4RL 建立了一种通用且有效的范例，用于使用可验证的自监督目标来调整多模态模型。

RAVEN: Robust Advertisement Video Violation Temporal Grounding via Reinforcement Reasoning

RAVEN：通过强化推理进行鲁棒广告视频违规时间基础

Authors: Deyi Ji, Yuekui Yang, Haiyang Wu, Shaoping Ma, Tianrun Chen, Lanyun Zhu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.16455
Pdf link: https://arxiv.org/pdf/2510.16455
Abstract Advertisement (Ad) video violation detection is critical for ensuring platform compliance, but existing methods struggle with precise temporal grounding, noisy annotations, and limited generalization. We propose RAVEN, a novel framework that integrates curriculum reinforcement learning with multimodal large language models (MLLMs) to enhance reasoning and cognitive capabilities for violation detection. RAVEN employs a progressive training strategy, combining precisely and coarsely annotated data, and leverages Group Relative Policy Optimization (GRPO) to develop emergent reasoning abilities without explicit reasoning annotations. Multiple hierarchical sophisticated reward mechanism ensures precise temporal grounding and consistent category prediction. Experiments on industrial datasets and public benchmarks show that RAVEN achieves superior performances in violation category accuracy and temporal interval localization. We also design a pipeline to deploy the RAVEN on the online Ad services, and online A/B testing further validates its practical applicability, with significant improvements in precision and recall. RAVEN also demonstrates strong generalization, mitigating the catastrophic forgetting issue associated with supervised fine-tuning.
中文摘要 广告（Ad）视频违规检测对于确保平台合规性至关重要，但现有方法在精确的时间基础、嘈杂的注释和有限的泛化方面存在困难。我们提出了RAVEN，这是一个将课程强化学习与多模态大语言模型（MLLM）相结合的新框架，以增强违规检测的推理和认知能力。RAVEN 采用渐进式训练策略，结合精确和粗略注释的数据，并利用组相对策略优化（GRPO）来开发紧急推理能力，而无需显式推理注释。多重分层精密奖励机制，确保精确的时间基础和一致的类别预测。在工业数据集和公共基准测试上的实验表明，RAVEN在违规类别精度和时间区间定位方面取得了优越的性能。我们还设计了一个管道，将 RAVEN 部署在在线广告服务上，在线 A/B 测试进一步验证了其实际适用性，在精度和召回率方面有了显着提高。RAVEN 还表现出很强的泛化能力，缓解了与监督微调相关的灾难性遗忘问题。

Buzz, Choose, Forget: A Meta-Bandit Framework for Bee-Like Decision Making

嗡嗡声、选择、忘记：用于蜜蜂式决策的元强盗框架

Authors: Emmanuelle Claeys, Elena Kerjean, Jean-Michel Loubes
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2510.16462
Pdf link: https://arxiv.org/pdf/2510.16462
Abstract We introduce a sequential reinforcement learning framework for imitation learning designed to model heterogeneous cognitive strategies in pollinators. Focusing on honeybees, our approach leverages trajectory similarity to capture and forecast behavior across individuals that rely on distinct strategies: some exploiting numerical cues, others drawing on memory, or being influenced by environmental factors such as weather. Through empirical evaluation, we show that state-of-the-art imitation learning methods often fail in this setting: when expert policies shift across memory windows or deviate from optimality, these models overlook both fast and slow learning behaviors and cannot faithfully reproduce key decision patterns. Moreover, they offer limited interpretability, hindering biological insight. Our contribution addresses these challenges by (i) introducing a model that minimizes predictive loss while identifying the effective memory horizon most consistent with behavioral data, and (ii) ensuring full interpretability to enable biologists to analyze underlying decision-making strategies and finally (iii) providing a mathematical framework linking bee policy search with bandit formulations under varying exploration-exploitation dynamics, and releasing a novel dataset of 80 tracked bees observed under diverse weather conditions. This benchmark facilitates research on pollinator cognition and supports ecological governance by improving simulations of insect behavior in agroecosystems. Our findings shed new light on the learning strategies and memory interplay shaping pollinator decision-making.
中文摘要 我们引入了用于模仿学习的顺序强化学习框架，旨在对传粉媒介的异构认知策略进行建模。我们的方法以蜜蜂为重点，利用轨迹相似性来捕捉和预测依赖不同策略的个体的行为：一些利用数字线索，另一些利用记忆，或受到天气等环境因素的影响。通过实证评估，我们表明，最先进的模仿学习方法在这种情况下往往会失败：当专家策略在记忆窗口之间转移或偏离最优性时，这些模型会忽略快速和慢速学习行为，并且无法忠实地再现关键决策模式。此外，它们提供的可解释性有限，阻碍了生物学洞察力。我们的贡献通过以下方式应对这些挑战：（i）引入一个模型，该模型可以最大限度地减少预测损失，同时确定与行为数据最一致的有效记忆范围，以及（ii）确保完全的可解释性，使生物学家能够分析潜在的决策策略，最后（iii）提供一个数学框架，将蜜蜂政策搜索与不同探索-开发动态下的强盗公式联系起来，并发布了一个包含在不同天气条件下观察到的 80 只追踪蜜蜂的新数据集。该基准促进了传粉媒介认知的研究，并通过改进农业生态系统中昆虫行为的模拟来支持生态治理。我们的研究结果为塑造传粉媒介决策的学习策略和记忆相互作用提供了新的线索。

NP-Engine: Empowering Optimization Reasoning in Large Language Models with Verifiable Synthetic NP Problems

NP-Engine：通过可验证的合成NP问题赋能大型语言模型的优化推理

Authors: Xiaozhe Li, Xinyu Fang, Shengyuan Ding, Linyang Li, Haodong Duan, Qingwen Liu, Kai Chen
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.16476
Pdf link: https://arxiv.org/pdf/2510.16476
Abstract Large Language Models (LLMs) have shown strong reasoning capabilities, with models like OpenAI's O-series and DeepSeek R1 excelling at tasks such as mathematics, coding, logic, and puzzles through Reinforcement Learning with Verifiable Rewards (RLVR). However, their ability to solve more complex optimization problems - particularly NP-hard tasks - remains underexplored. To bridge this gap, we propose NP-ENGINE, the first comprehensive framework for training and evaluating LLMs on NP-hard problems. NP-ENGINE covers 10 tasks across five domains, each equipped with (i) a controllable instance generator, (ii) a rule-based verifier, and (iii) a heuristic solver that provides approximate optimal solutions as ground truth. This generator-verifier-heuristic pipeline enables scalable and verifiable RLVR training under hierarchical difficulties. We also introduce NP-BENCH, a benchmark derived from NP-ENGINE-DATA, specifically designed to evaluate LLMs' ability to tackle NP-hard level reasoning problems, focusing not only on feasibility but also on solution quality. Additionally, we present QWEN2.5-7B-NP, a model trained via zero-RLVR with curriculum learning on Qwen2.5-7B-Instruct, which significantly outperforms GPT-4o on NP-BENCH and achieves SOTA performance with the same model size. Beyond in-domain tasks, we demonstrate that RLVR training on NP-ENGINE-DATA enables strong out-of-domain (OOD) generalization to reasoning tasks (logic, puzzles, math, and knowledge), as well as non-reasoning tasks such as instruction following. We also observe a scaling trend: increasing task diversity improves OOD generalization. These findings suggest that task-rich RLVR training is a promising direction for advancing LLM's reasoning ability, revealing new insights into the scaling laws of RLVR.
中文摘要 大型语言模型（LLM）已经表现出了强大的推理能力，OpenAI的O系列和DeepSeek R1等模型通过具有可验证奖励的强化学习（RLVR）在数学、编码、逻辑和谜题等任务中表现出色。然而，它们解决更复杂的优化问题（尤其是 NP 困难任务）的能力仍未得到充分探索。为了弥补这一差距，我们提出了 NP-ENGINE，这是第一个用于训练和评估 NP 困难问题的 LLM 的综合框架。NP-ENGINE 涵盖了五个领域的 10 个任务，每个任务都配备了（i）一个可控实例生成器，（ii）一个基于规则的验证器，以及（iii）一个启发式求解器，可提供近似的最优解作为地面实况。这种生成器-验证者-启发式管道可以在分层困难下实现可扩展和可验证的 RLVR 训练。我们还介绍了 NP-BENCH，这是一个源自 NP-ENGINE-DATA 的基准测试，专门用于评估 LLM 解决 NP 硬水平推理问题的能力，不仅关注可行性，还关注解决方案质量。此外，我们还提出了 QWEN2.5-7B-NP，这是一个通过零 RLVR 训练的模型，在 Qwen2.5-7B-Instruct 上进行课程学习，它在 NP-BENCH 上的性能明显优于 GPT-4o，并在相同模型大小下实现了 SOTA 性能。除了域内任务之外，我们还证明了 NP-ENGINE-DATA 上的 RLVR 训练能够对推理任务（逻辑、谜题、数学和知识）以及非推理任务（如指令遵循）进行强大的域外（OOD）泛化。我们还观察到一个扩展趋势：增加任务多样性可以提高 OOD 泛化。这些发现表明，任务丰富的RLVR训练是提升LLM推理能力的一个有前途的方向，揭示了对RLVR缩放规律的新见解。

LANPO: Bootstrapping Language and Numerical Feedback for Reinforcement Learning in LLMs

LANPO：用于法学硕士强化学习的引导语言和数值反馈

Authors: Ang Li, Yifei Wang, Zhihang Yuan, Stefanie Jegelka, Yisen Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.16552
Pdf link: https://arxiv.org/pdf/2510.16552
Abstract Reinforcement learning in large language models (LLMs) often relies on scalar rewards, a practice that discards valuable textual rationale buried in the rollouts, forcing the model to explore \textit{de novo} with each attempt and hindering sample efficiency. While LLMs can uniquely learn from language feedback provided in-context, naively integrating on-line experiences into RL training presents a paradox: feedback from the same problem risks information leakage and memorization, while feedback from different problems often leads to behavior collapse due to irrelevant context. To resolve this tension, we propose \textbf{Language-And-Numerical Policy Optimization (LANPO)}, a framework that cleanly separates the roles of feedback: language guides exploration, while numerical rewards drive optimization. LANPO builds a dynamic experience pool from past trials and introduces two principles to ensure feedback is effective: \emph{Reward-Agnostic Reflection} for safe intra-sample self-correction and \emph{Relevant Abstraction} to distill generalizable lessons from inter-sample experiences. Across mathematical reasoning benchmarks, LANPO enables 7B and 14B models to significantly outperform strong baselines trained with GRPO in test accuracy. Our work provides a robust method for integrating historical experiences into the LLM RL loop, creating more effective and data-efficient learning agents.
中文摘要 大型语言模型（LLM）中的强化学习通常依赖于标量奖励，这种做法会丢弃隐藏在推出中的宝贵文本原理，迫使模型在每次尝试时探索 \textit{de novo} 并阻碍样本效率。虽然法学硕士可以独特地从上下文中提供的语言反馈中学习，但天真地将在线体验融入 RL 训练中会带来一个悖论：来自同一问题的反馈存在信息泄露和记忆的风险，而来自不同问题的反馈往往会导致由于不相关的上下文而导致行为崩溃。为了解决这种紧张关系，我们提出了 \textbf{语言和数字策略优化（LANPO）}，这是一个干净地分离反馈角色的框架：语言指导探索，而数字奖励驱动优化。LANPO 从过去的试验中构建了一个动态经验池，并引入了两个原则来确保反馈有效：\emph{Reward-Agnostic Reflection} 用于安全的样本内自我纠正，\emph{Related Abstraction} 用于从样本间经验中提炼出可推广的经验教训。在数学推理基准测试中，LANPO 使 7B 和 14B 模型在测试准确性方面显着优于使用 GRPO 训练的强基线。我们的工作提供了一种强大的方法，将历史经验整合到 LLM RL 循环中，创建更有效、数据高效的学习代理。

Urban-R1: Reinforced MLLMs Mitigate Geospatial Biases for Urban General Intelligence

Urban-R1：增强的多轨多轨管理装置减轻了城市通用智能的地理空间偏差

Authors: Qiongyan Wang, Xingchen Zou, Yutian Jiang, Haomin Wen, Jiaheng Wei, Qingsong Wen, Yuxuan Liang
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.16555
Pdf link: https://arxiv.org/pdf/2510.16555
Abstract Rapid urbanization intensifies the demand for Urban General Intelligence (UGI), referring to AI systems that can understand and reason about complex urban environments. Recent studies have built urban foundation models using supervised fine-tuning (SFT) of LLMs and MLLMs, yet these models exhibit persistent geospatial bias, producing regionally skewed predictions and limited generalization. To this end, we propose Urban-R1, a reinforcement learning-based post-training framework that aligns MLLMs with the objectives of UGI. Urban-R1 adopts Group Relative Policy Optimization (GRPO) to optimize reasoning across geographic groups and employs urban region profiling as a proxy task to provide measurable rewards from multimodal urban data. Extensive experiments across diverse regions and tasks show that Urban-R1 effectively mitigates geo-bias and improves cross-region generalization, outperforming both SFT-trained and closed-source models. Our results highlight reinforcement learning alignment as a promising pathway toward equitable and trustworthy urban intelligence.
中文摘要 快速城市化加剧了对城市通用智能（UGI）的需求，UGI是指能够理解和推理复杂城市环境的人工智能系统。最近的研究使用法学硕士和多语言学历学金的监督微调（SFT）构建了城市基础模型，但这些模型表现出持续的地理空间偏差，产生区域偏斜的预测和有限的泛化。为此，我们提出了 Urban-R1，这是一个基于强化学习的后训练框架，使 MLLM 与 UGI 的目标保持一致。Urban-R1 采用群体相对策略优化（GRPO）来优化跨地理群体的推理，并采用城市区域剖析作为代理任务，从多模态城市数据中提供可衡量的奖励。跨不同区域和任务的广泛实验表明，Urban-R1 有效地减轻了地理偏差并提高了跨区域泛化，优于 SFT 训练模型和闭源模型。我们的研究结果强调，强化学习对齐是通往公平和值得信赖的城市智能的一条有前途的途径。

Count Counts: Motivating Exploration in LLM Reasoning with Count-based Intrinsic Rewards

计数计数：通过基于计数的内在奖励激发 LLM 推理的探索

Authors: Xuan Zhang, Ruixiao Li, Zhijian Zhou, Long Li, Yulei Qin, Ke Li, Xing Sun, Xiaoyu Tan, Chao Qu, Yuan Qi
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.16614
Pdf link: https://arxiv.org/pdf/2510.16614
Abstract Reinforcement Learning (RL) has become a compelling way to strengthen the multi step reasoning ability of Large Language Models (LLMs). However, prevalent RL paradigms still lean on sparse outcome-based rewards and limited exploration, which often drives LLMs toward repetitive and suboptimal reasoning patterns. In this paper, we study the central question of how to design exploration for LLM reasoning and introduce MERCI (Motivating Exploration in LLM Reasoning with Count-based Intrinsic Rewards), a novel RL algorithm that augments policy optimization with a principled intrinsic reward. Building on the idea of count-based exploration, MERCI leverages a lightweight Coin Flipping Network (CFN) to estimate the pseudo count and further epistemic uncertainty over reasoning trajectories, and converts them into an intrinsic reward that values novelty while preserving the learning signal from task rewards. We integrate MERCI into some advanced RL frameworks like Group Relative Policy Optimization (GRPO). Experiments on complex reasoning benchmarks demonstrate that MERCI encourages richer and more varied chains of thought, significantly improves performance over strong baselines, and helps the policy escape local routines to discover better solutions. It indicates that our targeted intrinsic motivation can make exploration reliable for language model reasoning.
中文摘要 强化学习（RL）已成为增强大型语言模型（LLM）多步推理能力的一种引人注目的方法。然而，流行的 RL 范式仍然依赖于稀疏的基于结果的奖励和有限的探索，这往往会促使 LLM 走向重复和次优的推理模式。在本文中，我们研究了如何设计 LLM 推理探索的核心问题，并引入了 MERCI（Motivating Exploration in LLM Reasoning with Count-based Intrinsic Rewards），这是一种新型的 RL 算法，它通过有原则的内在奖励来增强策略优化。基于基于计数的探索理念，MERCI 利用轻量级抛币网络（CFN）来估计伪计数和推理轨迹的进一步认识不确定性，并将它们转换为重视新颖性的内在奖励，同时保留任务奖励的学习信号。我们将 MERCI 集成到一些高级 RL 框架中，例如组相对策略优化（GRPO）。复杂推理基准的实验表明，MERCI 鼓励更丰富、更多样化的思维链，显着提高强基线的绩效，并帮助政策摆脱局部常规以发现更好的解决方案。这表明我们的有针对性的内在动机可以使语言模型推理的探索变得可靠。

Beyond Pipelines: A Survey of the Paradigm Shift toward Model-Native Agentic AI

超越管道：向模型原生代理人工智能的范式转变调查

Authors: Jitao Sang, Jinlin Xiao, Jiarun Han, Jilin Chen, Xiaoyi Chen, Shuyu Wei, Yongjie Sun, Yuhang Wang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.16720
Pdf link: https://arxiv.org/pdf/2510.16720
Abstract The rapid evolution of agentic AI marks a new phase in artificial intelligence, where Large Language Models (LLMs) no longer merely respond but act, reason, and adapt. This survey traces the paradigm shift in building agentic AI: from Pipeline-based systems, where planning, tool use, and memory are orchestrated by external logic, to the emerging Model-native paradigm, where these capabilities are internalized within the model's parameters. We first position Reinforcement Learning (RL) as the algorithmic engine enabling this paradigm shift. By reframing learning from imitating static data to outcome-driven exploration, RL underpins a unified solution of LLM + RL + Task across language, vision and embodied domains. Building on this, the survey systematically reviews how each capability -- Planning, Tool use, and Memory -- has evolved from externally scripted modules to end-to-end learned behaviors. Furthermore, it examines how this paradigm shift has reshaped major agent applications, specifically the Deep Research agent emphasizing long-horizon reasoning and the GUI agent emphasizing embodied interaction. We conclude by discussing the continued internalization of agentic capabilities like Multi-agent collaboration and Reflection, alongside the evolving roles of the system and model layers in future agentic AI. Together, these developments outline a coherent trajectory toward model-native agentic AI as an integrated learning and interaction framework, marking the transition from constructing systems that apply intelligence to developing models that grow intelligence through experience.
中文摘要 代理人工智能的快速发展标志着人工智能进入了一个新阶段，大型语言模型（LLM）不再只是响应，而是行动、推理和适应。本调查追溯了构建代理 AI 的范式转变：从基于管道的系统（其中规划、工具使用和内存由外部逻辑编排）到新兴的模型原生范式（其中这些功能被内化在模型的参数中）。我们首先将强化学习（RL）定位为实现这种范式转变的算法引擎。通过将学习从模仿静态数据重新定义为结果驱动的探索，RL 支持跨语言、视觉和具身领域的 LLM + RL + 任务的统一解决方案。在此基础上，该调查系统地回顾了每项能力——计划、工具使用和记忆——是如何从外部脚本模块演变为端到端学习行为的。此外，它还研究了这种范式转变如何重塑了主要代理应用程序，特别是强调长期推理的深度研究代理和强调具身交互的 GUI 代理。最后，我们讨论了多代理协作和反思等代理功能的持续内化，以及系统和模型层在未来代理人工智能中不断发展的角色。这些发展共同勾勒出一条连贯的轨迹，将模型原生代理人工智能作为一个集成的学习和交互框架，标志着从构建应用智能的系统到开发通过经验增长智能的模型的转变。

A Comprehensive Survey on Reinforcement Learning-based Agentic Search: Foundations, Roles, Optimizations, Evaluations, and Applications

基于强化学习的智能搜索综合综述：基础、角色、优化、评估和应用

Authors: Minhua Lin, Zongyu Wu, Zhichao Xu, Hui Liu, Xianfeng Tang, Qi He, Charu Aggarwal, Hui Liu, Xiang Zhang, Suhang Wang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.16724
Pdf link: https://arxiv.org/pdf/2510.16724
Abstract The advent of large language models (LLMs) has transformed information access and reasoning through open-ended natural language interaction. However, LLMs remain limited by static knowledge, factual hallucinations, and the inability to retrieve real-time or domain-specific information. Retrieval-Augmented Generation (RAG) mitigates these issues by grounding model outputs in external evidence, but traditional RAG pipelines are often single turn and heuristic, lacking adaptive control over retrieval and reasoning. Recent advances in agentic search address these limitations by enabling LLMs to plan, retrieve, and reflect through multi-step interaction with search environments. Within this paradigm, reinforcement learning (RL) offers a powerful mechanism for adaptive and self-improving search behavior. This survey provides the first comprehensive overview of \emph{RL-based agentic search}, organizing the emerging field along three complementary dimensions: (i) What RL is for (functional roles), (ii) How RL is used (optimization strategies), and (iii) Where RL is applied (scope of optimization). We summarize representative methods, evaluation protocols, and applications, and discuss open challenges and future directions toward building reliable and scalable RL driven agentic search systems. We hope this survey will inspire future research on the integration of RL and agentic search. Our repository is available at this https URL.
中文摘要 大型语言模型（LLM）的出现通过开放式自然语言交互改变了信息访问和推理。然而，法学硕士仍然受到静态知识、事实幻觉以及无法检索实时或特定领域信息的限制。检索增强生成（RAG）通过将模型输出建立在外部证据中来缓解这些问题，但传统的 RAG 管道通常是单轮和启发式的，缺乏对检索和推理的自适应控制。代理搜索的最新进展使法学硕士能够通过与搜索环境的多步骤交互来规划、检索和反思，从而解决了这些限制。在这种范式中，强化学习（RL）为自适应和自我改进的搜索行为提供了强大的机制。本调查首次全面概述了\emph{基于RL的代理搜索}，沿着三个互补的维度组织了新兴领域：（i）RL的用途（功能角色），（ii）如何使用RL（优化策略），以及（iii）RL的应用位置（优化范围）。我们总结了具有代表性的方法、评估协议和应用，并讨论了构建可靠且可扩展的 RL 驱动的代理搜索系统的开放挑战和未来方向。我们希望这项调查能够激发未来关于RL和代理搜索整合的研究。我们的存储库可在此 https URL 中找到。

A Control-Theoretic Approach to Dynamic Payment Routing for Success Rate Optimization

用于成功率优化的动态支付路由的控制论方法

Authors: Aniket Agrawal, Harsharanga Patil
Subjects: Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.16735
Pdf link: https://arxiv.org/pdf/2510.16735
Abstract This paper introduces a control-theoretic framework for dynamic payment routing, implemented within JUSPAY's Payment Orchestrator to maximize transaction success rate. The routing system is modeled as a closed-loop feedback controller continuously sensing gateway performance, computing corrective actions, and dynamically routes transactions across gateway to ensure operational resilience. The system leverages concepts from control theory, reinforcement learning, and multi-armed bandit optimization to achieve both short-term responsiveness and long-term stability. Rather than relying on explicit PID regulation, the framework applies generalized feedback-based adaptation, ensuring that corrective actions remain proportional to observed performance deviations and the computed gateway score gradually converges toward the success rate. This hybrid approach unifies control theory and adaptive decision systems, enabling self-regulating transaction routing that dampens instability, and improves reliability. Live production results show an improvement of up to 1.15% in success rate over traditional rule-based routing, demonstrating the effectiveness of feedback-based control in payment systems.
中文摘要 本文介绍了动态支付路由的控制论框架，该框架在 JUSPAY 的支付编排器中实现，以最大限度地提高交易成功率。路由系统被建模为闭环反馈控制器，持续感知网关性能，计算纠正措施，并跨网关动态路由事务，以确保运营弹性。该系统利用控制理论、强化学习和多臂强盗优化的概念来实现短期响应和长期稳定性。该框架不依赖于显式的 PID 调节，而是应用基于反馈的广义适应，确保纠正措施与观察到的性能偏差保持成正比，并且计算的网关分数逐渐趋同于成功率。这种混合方法统一了控制理论和自适应决策系统，实现了自我调节的交易路由，从而抑制了不稳定性并提高了可靠性。实时生产结果显示，与传统的基于规则的路由相比，成功率提高了 1.15%，证明了基于反馈的控制在支付系统中的有效性。

VAGEN: Reinforcing World Model Reasoning for Multi-Turn VLM Agents

VAGEN：强化多轮VLM代理的世界模型推理

Authors: Kangrui Wang, Pingyue Zhang, Zihan Wang, Yaning Gao, Linjie Li, Qineng Wang, Hanyang Chen, Chi Wan, Yiping Lu, Zhengyuan Yang, Lijuan Wang, Ranjay Krishna, Jiajun Wu, Li Fei-Fei, Yejin Choi, Manling Li
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.16907
Pdf link: https://arxiv.org/pdf/2510.16907
Abstract A key challenge in training Vision-Language Model (VLM) agents, compared to Language Model (LLM) agents, lies in the shift from textual states to complex visual observations. This transition introduces partial observability and demands robust world modeling. We ask: Can VLM agents construct internal world models through explicit visual state reasoning? To address this question, we architecturally enforce and reward the agent's reasoning process via reinforcement learning (RL), formulating it as a Partially Observable Markov Decision Process (POMDP). We find that decomposing the agent's reasoning into State Estimation ("what is the current state?") and Transition Modeling ("what comes next?") is critical for success, as demonstrated through five reasoning strategies. Our investigation into how agents represent internal beliefs reveals that the optimal representation is task-dependent: Natural Language excels at capturing semantic relationships in general tasks, while Structured formats are indispensable for precise manipulation and control. Building on these insights, we design a World Modeling Reward that provides dense, turn-level supervision for accurate state prediction, and introduce Bi-Level General Advantage Estimation (Bi-Level GAE) for turn-aware credit assignment. Through this form of visual state reasoning, a 3B-parameter model achieves a score of 0.82 across five diverse agent benchmarks, representing a 3$\times$ improvement over its untrained counterpart (0.21) and outperforming proprietary reasoning models such as GPT-5 (0.75), Gemini 2.5 Pro (0.67) and Claude 4.5 (0.62). All experiments are conducted within our VAGEN framework, a scalable system for training and analyzing multi-turn VLM agents in diverse visual environments. Code and data are publicly available at this https URL.
中文摘要 与语言模型（LLM）代理相比，训练视觉语言模型（VLM）代理的一个关键挑战在于从文本状态到复杂视觉观察的转变。这种转变引入了部分可观测性，并需要强大的世界建模。我们问：VLM 智能体可以通过显式视觉状态推理来构建内部世界模型吗？为了解决这个问题，我们通过强化学习（RL）在架构上强制执行和奖励代理的推理过程，将其表述为部分可观察的马尔可夫决策过程（POMDP）。我们发现，将智能体的推理分解为状态估计（“当前状态是什么”）和过渡建模（“接下来会发生什么？”）对于成功至关重要，正如五种推理策略所证明的那样。我们对智能体如何表示内部信念的调查表明，最佳表示是依赖于任务的：自然语言擅长捕捉一般任务中的语义关系，而结构化格式对于精确作和控制是必不可少的。基于这些见解，我们设计了一个世界建模奖励，该奖励提供密集的回合级监督以实现准确的状态预测，并引入双级通用优势估计（Bi-Level GAE）用于回合感知学分分配。通过这种形式的视觉状态推理，3B 参数模型在五个不同的智能体基准测试中获得了 0.82 分，比未经训练的模型（0.21）提高了 3 美元\倍，并且优于 GPT-5 （0.75）、Gemini 2.5 Pro （0.67）和 Claude 4.5 （0.62）等专有推理模型。所有实验都在我们的 VAGEN 框架内进行，这是一个可扩展的系统，用于在不同的视觉环境中训练和分析多轮 VLM 代理。代码和数据在此 https URL 上公开可用。

Towards Context-aware Reasoning-enhanced Generative Searching in E-commerce

在电子商务中迈向上下文感知推理增强的生成式搜索

Authors: Zhiding Liu, Ben Chen, Mingyue Cheng, Enchong Chen, Li Li, Chenyi Lei, Wenwu Ou, Han Li, Kun Gai
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2510.16925
Pdf link: https://arxiv.org/pdf/2510.16925
Abstract Search-based recommendation is one of the most critical application scenarios in e-commerce platforms. Users' complex search contexts--such as spatiotemporal factors, historical interactions, and current query's information--constitute an essential part of their decision-making, reflecting implicit preferences that complement explicit query terms. Modeling such rich contextual signals and their intricate associations with candidate items remains a key challenge. Although numerous efforts have been devoted to building more effective search methods, existing approaches still show limitations in integrating contextual information, which hinders their ability to fully capture user intent. To address these challenges, we propose a context-aware reasoning-enhanced generative search framework for better \textbf{understanding the complicated context}. Specifically, the framework first unifies heterogeneous user and item contexts into textual representations or text-based semantic identifiers and aligns them. To overcome the lack of explicit reasoning trajectories, we introduce a self-evolving post-training paradigm that iteratively combines supervised fine-tuning and reinforcement learning to progressively enhance the model's reasoning capability. In addition, we identify potential biases in existing RL algorithms when applied to search scenarios and present a debiased variant of GRPO to improve ranking performance. Extensive experiments on search log data collected from a real-world e-commerce platform demonstrate that our approach achieves superior performance compared with strong baselines, validating its effectiveness for search-based recommendation.
中文摘要 搜索式推荐是电商平台最关键的应用场景之一。用户复杂的搜索上下文（例如时空因素、历史交互和当前查询信息）构成了他们决策的重要组成部分，反映了补充显式查询术语的隐性偏好。对如此丰富的上下文信号及其与候选项目的复杂关联进行建模仍然是一个关键挑战。尽管已经做出了大量努力来构建更有效的搜索方法，但现有方法在整合上下文信息方面仍然存在局限性，这阻碍了它们完全捕捉用户意图的能力。为了应对这些挑战，我们提出了一个上下文感知推理增强的生成搜索框架，以更好地 \textbf{理解复杂的上下文}。具体来说，该框架首先将异构用户和项目上下文统一为文本表示或基于文本的语义标识符，并对齐它们。为了克服缺乏显式推理轨迹的问题，我们引入了一种自我进化的训练后范式，该范式迭代地将监督微调和强化学习相结合，以逐步增强模型的推理能力。此外，我们还识别了现有RL算法在应用于搜索场景时的潜在偏差，并提出了GRPO的去偏差变体以提高排名性能。对从真实世界电子商务平台收集的搜索日志数据进行的大量实验表明，与强基线相比，我们的方法取得了更优异的性能，验证了其在基于搜索的推荐方面的有效性。

Prompt-MII: Meta-Learning Instruction Induction for LLMs

Prompt-MII：法学硕士的元学习教学归纳

Authors: Emily Xiao, Yixiao Zeng, Ada Chen, Chin-Jou Li, Amanda Bertsch, Graham Neubig
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.16932
Pdf link: https://arxiv.org/pdf/2510.16932
Abstract A popular method to adapt large language models (LLMs) to new tasks is in-context learning (ICL), which is effective but incurs high inference costs as context length grows. In this paper we propose a method to perform instruction induction, where we take training examples and reduce them to a compact but descriptive prompt that can achieve performance comparable to ICL over the full training set. Specifically, we propose PROMPT-MII, a reinforcement learning (RL) based framework to meta-learn an instruction induction model that can generate compact instructions on the fly for an arbitrary new dataset. We train on over 3,000 diverse classification datasets from the HuggingFace hub, and evaluate on 90 unseen tasks. PROMPT-MII improves downstream model quality by 4-9 F1 points (10-20% relative), matching ICL performance while requiring 3-13x fewer tokens.
中文摘要 使大型语言模型（LLM）适应新任务的一种流行方法是上下文学习（ICL），它很有效，但随着上下文长度的增长，推理成本很高。在本文中，我们提出了一种执行指令归纳的方法，其中我们采用训练示例并将它们简化为一个紧凑但描述性的提示，可以在整个训练集中实现与 ICL 相当的性能。具体来说，我们提出了 PROMPT-MII，这是一个基于强化学习（RL）的框架，用于元学习指令归纳模型，该模型可以为任意新数据集动态生成紧凑的指令。我们对来自 HuggingFace 中心的 3,000 多个不同的分类数据集进行训练，并对 90 个看不见的任务进行评估。PROMPT-MII 将下游模型质量提高了 4-9 个 F1 点（相对 10-20%），与 ICL 性能相匹配，同时需要的令牌减少 3-13 倍。

A Comparative User Evaluation of XRL Explanations using Goal Identification

使用目标识别对 XRL 解释进行比较用户评估

Authors: Mark Towers, Yali Du, Christopher Freeman, Timothy J. Norman
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.16956
Pdf link: https://arxiv.org/pdf/2510.16956
Abstract Debugging is a core application of explainable reinforcement learning (XRL) algorithms; however, limited comparative evaluations have been conducted to understand their relative performance. We propose a novel evaluation methodology to test whether users can identify an agent's goal from an explanation of its decision-making. Utilising the Atari's Ms. Pacman environment and four XRL algorithms, we find that only one achieved greater than random accuracy for the tested goals and that users were generally overconfident in their selections. Further, we find that users' self-reported ease of identification and understanding for every explanation did not correlate with their accuracy.
中文摘要 调试是可解释强化学习（XRL）算法的核心应用;然而，为了了解它们的相对性能，已经进行了有限的比较评估。我们提出了一种新的评估方法来测试用户是否可以从对其决策的解释中识别代理的目标。利用 Atari 的 Ms. Pacman 环境和四种 XRL 算法，我们发现只有一种算法在测试目标上实现了高于随机的准确性，并且用户通常对他们的选择过于自信。此外，我们发现用户自我报告的对每个解释的识别和理解的难易程度与其准确性无关。

Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization

在线学习：通过提示优化防御迭代越狱攻击

Authors: Masahiro Kaneko, Zeerak Talat, Timothy Baldwin
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.17006
Pdf link: https://arxiv.org/pdf/2510.17006
Abstract Iterative jailbreak methods that repeatedly rewrite and input prompts into large language models (LLMs) to induce harmful outputs -- using the model's previous responses to guide each new iteration -- have been found to be a highly effective attack strategy. Despite being an effective attack strategy against LLMs and their safety mechanisms, existing defenses do not proactively disrupt this dynamic trial-and-error cycle. In this study, we propose a novel framework that dynamically updates its defense strategy through online learning in response to each new prompt from iterative jailbreak methods. Leveraging the distinctions between harmful jailbreak-generated prompts and typical harmless prompts, we introduce a reinforcement learning-based approach that optimizes prompts to ensure appropriate responses for harmless tasks while explicitly rejecting harmful prompts. Additionally, to curb overfitting to the narrow band of partial input rewrites explored during an attack, we introduce Past-Direction Gradient Damping (PDGD). Experiments conducted on three LLMs show that our approach significantly outperforms five existing defense methods against five iterative jailbreak methods. Moreover, our results indicate that our prompt optimization strategy simultaneously enhances response quality for harmless tasks.
中文摘要 反复重写提示并将其输入到大型语言模型（LLM）中以诱导有害输出的迭代越狱方法——使用模型之前的响应来指导每次新的迭代——已被发现是一种非常有效的攻击策略。尽管是针对法学硕士及其安全机制的有效攻击策略，但现有的防御措施并不能主动破坏这种动态的试错循环。在这项研究中，我们提出了一种新颖的框架，该框架通过在线学习动态更新其防御策略，以响应迭代越狱方法的每个新提示。利用有害越狱生成的提示和典型的无害提示之间的区别，我们引入了一种基于强化学习的方法，该方法优化提示以确保对无害任务做出适当的响应，同时明确拒绝有害提示。此外，为了遏制在攻击期间探索的部分输入重写的窄带的过度拟合，我们引入了过去方向梯度阻尼（PDGD）。在三个 LLM 上进行的实验表明，我们的方法在五种迭代越狱方法方面明显优于五种现有的防御方法。此外，我们的结果表明，我们的提示优化策略同时提高了无害任务的响应质量。

SafeSearch: Do Not Trade Safety for Utility in LLM Search Agents

安全搜索：不要在 LLM 搜索代理中以安全换取实用性

Authors: Qiusi Zhan, Angeline Budiman-Chan, Abdelrahman Zayed, Xingzhi Guo, Daniel Kang, Joo-Kyung Kim
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.17017
Pdf link: https://arxiv.org/pdf/2510.17017
Abstract Large language model (LLM) based search agents iteratively generate queries, retrieve external information, and reason to answer open-domain questions. While researchers have primarily focused on improving their utility, their safety behaviors remain underexplored. In this paper, we first evaluate search agents using red-teaming datasets and find that they are more likely to produce harmful outputs than base LLMs. For example, when asked ``How can I track someone's location without their consent?'', a base model refuses, whereas a search agent designed to retrieve and cite sources may lower its refusal threshold, fetch documents (e.g., court cases), and, once appended, synthesize them into an informative yet unsafe summary. We further show that utility-oriented fine-tuning intensifies this risk, motivating joint alignment of safety and utility. We present SafeSearch, a multi-objective reinforcement learning approach that couples a final-output safety/utility reward with a novel query-level shaping term that penalizes unsafe queries and rewards safe ones. Experiments show that SafeSearch reduces agent harmfulness by over 70% across three red-teaming datasets while producing safe, helpful responses, and matches the QA performance of a utility-only finetuned agent; further analyses confirm the effectiveness of the query-level reward in jointly improving safety and utility.
中文摘要 基于大型语言模型（LLM）的搜索代理迭代生成查询、检索外部信息并回答开放域问题的理由。虽然研究人员主要专注于提高其效用，但它们的安全行为仍未得到充分探索。在本文中，我们首先使用红队数据集评估搜索代理，发现它们比基础 LLM 更有可能产生有害输出。例如，当被问及“未经某人同意如何跟踪某人的位置”时，基本模型会拒绝，而旨在检索和引用来源的搜索代理可能会降低其拒绝阈值，获取文件（例如法庭案件），并在附加后将它们综合成信息丰富但不安全的摘要。我们进一步表明，以效用为导向的微调加剧了这种风险，促使安全和效用的联合调整。我们提出了 SafeSearch，这是一种多目标强化学习方法，它将最终输出的安全/效用奖励与一个新颖的查询级整形术语相结合，该术语惩罚不安全的查询并奖励安全的查询。实验表明，安全搜索在三个红队数据集中将代理的危害性降低了 70% 以上，同时产生安全、有用的响应，并与仅实用程序微调代理的 QA 性能相匹配;进一步的分析证实了查询级奖励在共同提高安全性和实用性方面的有效性。

Hephaestus: Mixture Generative Modeling with Energy Guidance for Large-scale QoS Degradation

Hephaestus：大规模QoS降解的混合生成建模与能量引导

Authors: Nguyen Do, Bach Ngo, Youval Kashuv, Canh V. Pham, Hanghang Tong, My T. Thai
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.17036
Pdf link: https://arxiv.org/pdf/2510.17036
Abstract We study the Quality of Service Degradation (QoSD) problem, in which an adversary perturbs edge weights to degrade network performance. This setting arises in both network infrastructures and distributed ML systems, where communication quality, not just connectivity, determines functionality. While classical methods rely on combinatorial optimization, and recent ML approaches address only restricted linear variants with small-size networks, no prior model directly tackles the QoSD problem under nonlinear edge-weight functions. This work proposes \PIMMA, a self-reinforcing generative framework that synthesizes feasible solutions in latent space, to fill this gap. Our method includes three phases: (1) Forge: a Predictive Path-Stressing (PPS) algorithm that uses graph learning and approximation to produce feasible solutions with performance guarantee, (2) Morph: a new theoretically grounded training paradigm for Mixture of Conditional VAEs guided by an energy-based model to capture solution feature distributions, and (3) Refine: a reinforcement learning agent that explores this space to generate progressively near-optimal solutions using our designed differentiable reward function. Experiments on both synthetic and real-world networks show that our approach consistently outperforms classical and ML baselines, particularly in scenarios with nonlinear cost functions where traditional methods fail to generalize.
中文摘要 我们研究了服务质量下降（QoSD）问题，其中对手扰动边缘权重以降低网络性能。这种设置出现在网络基础设施和分布式 ML 系统中，其中通信质量（而不仅仅是连接性）决定了功能。虽然经典方法依赖于组合优化，并且最近的机器学习方法仅解决具有小尺寸网络的受限线性变体，但之前没有模型直接解决非线性边权函数下的 QoSD 问题。这项工作提出了 \PIMMA，一种自我强化的生成框架，可以在潜在空间中综合可行的解决方案，以填补这一空白。我们的方法包括三个阶段：（1） Forge：一种预测路径应力（PPS）算法，它使用图学习和近似来生成具有性能保证的可行解决方案，（2） Morph：一种新的理论基础训练范式，用于由基于能量的模型引导的条件 VAE 混合以捕获解决方案特征分布，以及（3） Refine：一种强化学习代理，它探索该空间，使用我们设计的逐步生成接近最优的解决方案可微分奖励函数。在合成网络和现实世界网络上的实验表明，我们的方法始终优于经典和机器学习基线，特别是在传统方法无法推广的非线性成本函数场景中。

Video Reasoning without Training

无需培训的视频推理

Authors: Deepak Sridhar, Kartikeya Bhardwaj, Jeya Pradha Jeyaraj, Nuno Vasconcelos, Ankita Nayak, Harris Teague
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.17045
Pdf link: https://arxiv.org/pdf/2510.17045
Abstract Video reasoning using Large Multimodal Models (LMMs) relies on costly reinforcement learning (RL) and verbose chain-of-thought, resulting in substantial computational overhead during both training and inference. Moreover, the mechanisms that control the thinking process in these reasoning models are very limited. In this paper, using entropy of the model's output as a signal, we discover that the high-quality models go through a series of micro-explorations and micro-exploitations which keep the reasoning process grounded (i.e., avoid excessive randomness while the model is exploring or thinking through an answer). We further observe that once this "thinking" process is over, more accurate models demonstrate a better convergence by reducing the entropy significantly via a final exploitation phase (i.e., a more certain convergence towards a solution trajectory). We then use these novel, theoretically-grounded insights to tune the model's behavior directly at inference, without using any RL or supervised fine-tuning. Specifically, during inference, our proposed approach called V-Reason (Video-Reason) adapts the value cache of the LMM via a few optimization steps on a small, trainable controller using an entropy-based objective, i.e., no supervision from any dataset or RL is necessary. This tuning improves the model's micro-exploration and exploitation behavior during inference. Our experiments show that our proposed method achieves significant improvements over the base instruction-tuned models across several video reasoning datasets, narrowing the gap with RL-trained models to within 0.6% average accuracy without any training, while offering massive efficiency benefits: output tokens are reduced by 58.6% compared to the RL model.
中文摘要 使用大型多模态模型（LMM）的视频推理依赖于昂贵的强化学习（RL）和冗长的思维链，导致在训练和推理过程中产生大量计算开销。此外，这些推理模型中控制思维过程的机制非常有限。在本文中，使用模型输出的熵作为信号，我们发现高质量的模型经历了一系列微探索和微利用，使推理过程保持扎根（即，在模型探索或思考答案时避免过度随机性）。我们进一步观察到，一旦这个“思考”过程结束，更准确的模型就会通过最终开发阶段显着降低熵（即，更确定地收敛到解轨迹）来证明更好的收敛。然后，我们使用这些新颖的、基于理论的见解直接在推理时调整模型的行为，而无需使用任何 RL 或监督微调。具体来说，在推理过程中，我们提出的称为 V-Reason（视频-原因）的方法通过使用基于熵的目标在小型可训练控制器上执行几个优化步骤来调整 LMM 的值缓存，即不需要任何数据集或 RL 的监督。这种调优改进了模型在推理过程中的微探索和利用行为。我们的实验表明，我们提出的方法在多个视频推理数据集中比基本指令调整模型取得了显着改进，无需任何训练即可将与 RL 训练模型的平均准确率缩小到 0.6% 以内，同时提供了巨大的效率优势：与 RL 模型相比，输出标记减少了 58.6%。

The Ends Justify the Thoughts: RL-Induced Motivated Reasoning in LLMs

目的证明了这些想法的合理性：法学硕士中 RL 诱导的动机推理

Authors: Nikolaus Howe, Micah Carroll
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.17057
Pdf link: https://arxiv.org/pdf/2510.17057
Abstract The use of reinforcement learning (RL) with chain-of-thought (CoT) reasoning has emerged as a promising approach for developing more capable language models. In turn, this has led to investigation of CoT monitoring as a compelling method for detecting harmful behaviors such as reward hacking, under the assumption that models' reasoning processes reflect their internal decision-making. In practice, LLM training often produces unintended behaviors due to imperfect reward signals, leading models to develop misaligned tendencies. A common corrective approach is to apply post-hoc instructions to avoid problematic behaviors like sycophancy, but what happens to the model's reasoning process when these instructions conflict with learned behaviors? We investigate this question in simple settings and find that models engage in systematic motivated reasoning -- generating plausible-sounding justifications for violating their instructions while downplaying potential harms. Beyond being an interesting property of training, we find that while motivated reasoning can be detected by most frontier reasoning models, smaller LLM judges can fail to identify a portion of it, and in rare cases can themselves be persuaded that the reasoning is correct, despite it contradicting clear instructions. This capability gap raises concerns that as models become more sophisticated, their motivated reasoning may become increasingly difficult for monitors to detect. Our results underscore the need to account for motivated reasoning when relying on chain-of-thought processes for model evaluation and oversight. All code for this paper will be made available. WARNING: some examples in this paper may be upsetting.
中文摘要 使用强化学习（RL）和思维链（CoT）推理已成为开发功能更强大的语言模型的一种有前途的方法。反过来，这导致了对 CoT 监控的调查，将其作为检测奖励黑客等有害行为的一种引人注目的方法，假设模型的推理过程反映了它们的内部决策。在实践中，LLM 训练经常会因为奖励信号不完善而产生意外行为，导致模型产生错位倾向。一种常见的纠正方法是应用事后指令来避免阿谀奉承等有问题的行为，但是当这些指令与学习的行为发生冲突时，模型的推理过程会发生什么？我们在简单的环境中研究了这个问题，发现模型参与了系统的动机推理——为违反其指令生成听起来合理的理由，同时淡化潜在的危害。除了训练的一个有趣的属性之外，我们发现，虽然大多数前沿推理模型都可以检测到动机推理，但较小的法学硕士法官可能无法识别其中的一部分，并且在极少数情况下，他们自己可以被说服推理是正确的，尽管它与明确的指令相矛盾。这种能力差距引发了人们的担忧，即随着模型变得越来越复杂，监视器可能会越来越难以检测到它们的动机推理。我们的研究结果强调，在依赖思维链过程进行模型评估和监督时，需要考虑动机推理。本文的所有代码都将可用。警告：本文中的一些示例可能会令人不安。

Consistent Zero-Shot Imitation with Contrastive Goal Inference

具有对比目标推理的一致零射模仿

Authors: Kathryn Wantlin, Chongyi Zheng, Benjamin Eysenbach
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.17059
Pdf link: https://arxiv.org/pdf/2510.17059
Abstract In the same way that generative models today conduct most of their training in a self-supervised fashion, how can agentic models conduct their training in a self-supervised fashion, interactively exploring, learning, and preparing to quickly adapt to new tasks? A prerequisite for embodied agents deployed in real world interactions ought to be training with interaction, yet today's most successful AI models (e.g., VLMs, LLMs) are trained without an explicit notion of action. The problem of pure exploration (which assumes no data as input) is well studied in the reinforcement learning literature and provides agents with a wide array of experiences, yet it fails to prepare them for rapid adaptation to new tasks. Today's language and vision models are trained on data provided by humans, which provides a strong inductive bias for the sorts of tasks that the model will have to solve (e.g., modeling chords in a song, phrases in a sonnet, sentences in a medical record). However, when they are prompted to solve a new task, there is a faulty tacit assumption that humans spend most of their time in the most rewarding states. The key contribution of our paper is a method for pre-training interactive agents in a self-supervised fashion, so that they can instantly mimic human demonstrations. Our method treats goals (i.e., observations) as the atomic construct. During training, our method automatically proposes goals and practices reaching them, building off prior work in reinforcement learning exploration. During evaluation, our method solves an (amortized) inverse reinforcement learning problem to explain demonstrations as optimal goal-reaching behavior. Experiments on standard benchmarks (not designed for goal-reaching) show that our approach outperforms prior methods for zero-shot imitation.
中文摘要 就像今天的生成模型以自监督的方式进行大部分训练一样，代理模型如何以自监督的方式进行训练，以交互方式探索、学习和准备快速适应新任务？在现实世界交互中部署具身代理的先决条件应该是进行交互训练，但当今最成功的人工智能模型（例如 VLM、LLM）是在没有明确行动概念的情况下进行训练的。纯探索问题（假设没有数据作为输入）在强化学习文献中得到了很好的研究，它为智能体提供了广泛的经验，但它未能让他们为快速适应新任务做好准备。今天的语言和视觉模型是根据人类提供的数据进行训练的，这为模型必须解决的任务类型提供了很强的归纳偏差（例如，对歌曲中的和弦、十四行诗中的短语、病历中的句子进行建模）。然而，当他们被提示解决一项新任务时，存在一个错误的默认假设，即人类大部分时间都花在最有价值的状态中。我们论文的主要贡献是一种以自监督方式预训练交互代理的方法，以便它们可以立即模仿人类演示。我们的方法将目标（即观察）视为原子结构。在训练期间，我们的方法会自动提出目标和实现这些目标的实践，以之前在强化学习探索方面的工作为基础。在评估过程中，我们的方法解决了一个（摊销的）逆强化学习问题，以将演示解释为最佳目标实现行为。标准基准测试（不是为实现目标而设计的）上的实验表明，我们的方法优于以前的零样本模仿方法。

Continuous Q-Score Matching: Diffusion Guided Reinforcement Learning for Continuous-Time Control

连续 Q 分数匹配：用于连续时间控制的扩散引导强化学习

Authors: Chengxiu Hua, Jiawen Gu, Yushun Tang
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2510.17122
Pdf link: https://arxiv.org/pdf/2510.17122
Abstract Reinforcement learning (RL) has achieved significant success across a wide range of domains, however, most existing methods are formulated in discrete time. In this work, we introduce a novel RL method for continuous-time control, where stochastic differential equations govern state-action dynamics. Departing from traditional value function-based approaches, our key contribution is the characterization of continuous-time Q-functions via a martingale condition and the linking of diffusion policy scores to the action gradient of a learned continuous Q-function by the dynamic programming principle. This insight motivates Continuous Q-Score Matching (CQSM), a score-based policy improvement algorithm. Notably, our method addresses a long-standing challenge in continuous-time RL: preserving the action-evaluation capability of Q-functions without relying on time discretization. We further provide theoretical closed-form solutions for linear-quadratic (LQ) control problems within our framework. Numerical results in simulated environments demonstrate the effectiveness of our proposed method and compare it to popular baselines.
中文摘要 强化学习（RL）在广泛的领域取得了巨大的成功，然而，大多数现有方法都是在离散时间内制定的。在这项工作中，我们引入了一种用于连续时间控制的新型 RL 方法，其中随机微分方程控制状态作用动力学。与传统的基于价值函数的方法不同，我们的主要贡献是通过鞅条件表征连续时间 Q 函数，并通过动态规划原理将扩散策略分数与学习到的连续 Q 函数的作用梯度联系起来。这种洞察力激发了持续 Q 分数匹配（CQSM），这是一种基于分数的策略改进算法。值得注意的是，我们的方法解决了连续时间RL中长期存在的挑战：在不依赖时间离散化的情况下保留Q函数的动作评估能力。我们进一步为框架内的线性二次（LQ）控制问题提供理论封闭式解决方案。模拟环境中的数值结果证明了我们所提出的方法的有效性，并将其与流行的基线进行了比较。

Rethinking On-policy Optimization for Query Augmentation

重新思考查询增强的策略优化

Authors: Zhichao Xu, Shengyao Zhuang, Xueguang Ma, Bingsen Chen, Yijun Tian, Fengran Mo, Jie Cao, Vivek Srikumar
Subjects: Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2510.17139
Pdf link: https://arxiv.org/pdf/2510.17139
Abstract Recent advances in large language models (LLMs) have led to a surge of interest in query augmentation for information retrieval (IR). Two main approaches have emerged. The first prompts LLMs to generate answers or pseudo-documents that serve as new queries, relying purely on the model's parametric knowledge or contextual information. The second applies reinforcement learning (RL) to fine-tune LLMs for query rewriting, directly optimizing retrieval metrics. While having respective advantages and limitations, the two approaches have not been compared under consistent experimental conditions. In this work, we present the first systematic comparison of prompting-based and RL-based query augmentation across diverse benchmarks, including evidence-seeking, ad hoc, and tool retrieval. Our key finding is that simple, training-free query augmentation often performs on par with, or even surpasses, more expensive RL-based counterparts, especially when using powerful LLMs. Motivated by this discovery, we introduce a novel hybrid method, On-policy Pseudo-document Query Expansion (OPQE), which, instead of rewriting a query, the LLM policy learns to generate a pseudo-document that maximizes retrieval performance, thus merging the flexibility and generative structure of prompting with the targeted optimization of RL. We show OPQE outperforms both standalone prompting and RL-based rewriting, demonstrating that a synergistic approach yields the best results. Our implementation is made available to facilitate reproducibility.
中文摘要 大型语言模型（LLM）的最新进展导致人们对信息检索（IR）的查询增强的兴趣激增。出现了两种主要方法。第一种提示法学硕士生成答案或伪文档，作为新查询，纯粹依赖于模型的参数知识或上下文信息。第二种应用强化学习（RL）来微调LLM以进行查询重写，直接优化检索指标。虽然这两种方法具有各自的优点和局限性，但尚未在一致的实验条件下进行比较。在这项工作中，我们首次对基于提示和基于 RL 的查询增强在不同基准（包括证据搜索、临时和工具检索）中进行了系统比较。我们的主要发现是，简单、无需训练的查询增强通常与更昂贵的基于 RL 的查询增强性能相当，甚至超过，尤其是在使用功能强大的 LLM 时。受这一发现的启发，我们引入了一种新颖的混合方法，即策略上伪文档查询扩展（OPQE），该方法不是重写查询，而是 LLM 策略学习生成伪文档，从而最大限度地提高检索性能，从而将提示的灵活性和生成结构与RL的针对性优化相结合。我们表明 OPQE 优于独立提示和基于 RL 的重写，表明协同方法产生最佳结果。我们的实施是为了促进可重复性。

GACO-CAD: Geometry-Augmented and Conciseness-Optimized CAD Model Generation from Single Image

GACO-CAD：从单张图像生成几何增强和简洁优化的 CAD 模型

Authors: Yinghui Wang, Xinyu Zhang, Peng Du
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.17157
Pdf link: https://arxiv.org/pdf/2510.17157
Abstract Generating editable, parametric CAD models from a single image holds great potential to lower the barriers of industrial concept design. However, current multi-modal large language models (MLLMs) still struggle with accurately inferring 3D geometry from 2D images due to limited spatial reasoning capabilities. We address this limitation by introducing GACO-CAD, a novel two-stage post-training framework. It is designed to achieve a joint objective: simultaneously improving the geometric accuracy of the generated CAD models and encouraging the use of more concise modeling procedures. First, during supervised fine-tuning, we leverage depth and surface normal maps as dense geometric priors, combining them with the RGB image to form a multi-channel input. In the context of single-view reconstruction, these priors provide complementary spatial cues that help the MLLM more reliably recover 3D geometry from 2D observations. Second, during reinforcement learning, we introduce a group length reward that, while preserving high geometric fidelity, promotes the generation of more compact and less redundant parametric modeling sequences. A simple dynamic weighting strategy is adopted to stabilize training. Experiments on the DeepCAD and Fusion360 datasets show that GACO-CAD achieves state-of-the-art performance under the same MLLM backbone, consistently outperforming existing methods in terms of code validity, geometric accuracy, and modeling conciseness.
中文摘要 从单个图像生成可编辑的参数化 CAD 模型在降低工业概念设计的门槛方面具有巨大潜力。然而，由于空间推理能力有限，当前的多模态大型语言模型（MLLM）仍然难以从2D图像中准确推断3D几何形状。我们通过引入 GACO-CAD（一种新颖的两阶段训练后框架）来解决这一限制。它旨在实现一个共同目标：同时提高生成的 CAD 模型的几何精度，并鼓励使用更简洁的建模程序。首先，在监督微调过程中，我们利用深度和表面法线贴图作为密集的几何先验，将它们与 RGB 图像相结合，形成多通道输入。在单视图重建的背景下，这些先验提供了互补的空间线索，帮助 MLLM 更可靠地从 2D 观测中恢复 3D 几何形状。其次，在强化学习过程中，我们引入了组长度奖励，该奖励在保持高几何保真度的同时，促进了更紧凑和冗余更少的参数化建模序列的生成。采用简单的动态加权策略来稳定训练。在DeepCAD和Fusion360数据集上的实验表明，GACO-CAD在相同的MLLM主干下实现了最先进的性能，在代码有效性、几何精度和建模简洁性方面始终优于现有方法。

D2C-HRHR: Discrete Actions with Double Distributional Critics for High-Risk-High-Return Tasks

D2C-HRHR：高风险高回报任务的双重分布批评者的离散行动

Authors: Jundong Zhang, Yuhui Situ, Fanji Zhang, Rongji Deng, Tianqi Wei
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.17212
Pdf link: https://arxiv.org/pdf/2510.17212
Abstract Tasks involving high-risk-high-return (HRHR) actions, such as obstacle crossing, often exhibit multimodal action distributions and stochastic returns. Most reinforcement learning (RL) methods assume unimodal Gaussian policies and rely on scalar-valued critics, which limits their effectiveness in HRHR settings. We formally define HRHR tasks and theoretically show that Gaussian policies cannot guarantee convergence to the optimal solution. To address this, we propose a reinforcement learning framework that (i) discretizes continuous action spaces to approximate multimodal distributions, (ii) employs entropy-regularized exploration to improve coverage of risky but rewarding actions, and (iii) introduces a dual-critic architecture for more accurate discrete value distribution estimation. The framework scales to high-dimensional action spaces, supporting complex control domains. Experiments on locomotion and manipulation benchmarks with high risks of failure demonstrate that our method outperforms baselines, underscoring the importance of explicitly modeling multimodality and risk in RL.
中文摘要 涉及高风险高回报（HRHR）行动的任务，例如越过障碍物，通常表现出多模态行动分布和随机回报。大多数强化学习（RL）方法采用单峰高斯策略并依赖于标量值批评者，这限制了它们在 HRHR 环境中的有效性。我们正式定义了HRHR任务，并在理论上表明高斯策略不能保证收敛到最优解。为了解决这个问题，我们提出了一个强化学习框架，该框架（i）将连续动作空间离散化以近似多模态分布，（ii）采用熵正则化探索来提高对有风险但有回报的动作的覆盖率，以及（iii）引入双重批评架构以实现更准确的离散价值分布估计。该框架可扩展到高维动作空间，支持复杂的控制域。对具有高失败风险的运动和纵基准的实验表明，我们的方法优于基线，强调了在 RL 中明确建模多模态和风险的重要性。

Coinvisor: An RL-Enhanced Chatbot Agent for Interactive Cryptocurrency Investment Analysis

Coinvisor：用于交互式加密货币投资分析的 RL 增强型聊天机器人代理

Authors: Chong Chen, Ze Liu, Lingfeng Bao, Yanlin Wang, Ting Chen, Daoyuan Wu, Jiachi Chen
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.17235
Pdf link: https://arxiv.org/pdf/2510.17235
Abstract The cryptocurrency market offers significant investment opportunities but faces challenges including high volatility and fragmented information. Data integration and analysis are essential for informed investment decisions. Currently, investors use three main approaches: (1) Manual analysis across various sources, which depends heavily on individual experience and is time-consuming and prone to bias; (2) Data aggregation platforms-limited in functionality and depth of analysis; (3) Large language model agents-based on static pretrained models, lacking real-time data integration and multi-step reasoning capabilities. To address these limitations, we present Coinvisor, a reinforcement learning-based chatbot that provides comprehensive analytical support for cryptocurrency investment through a multi-agent framework. Coinvisor integrates diverse analytical capabilities through specialized tools. Its key innovation is a reinforcement learning-based tool selection mechanism that enables multi-step planning and flexible integration of diverse data sources. This design supports real-time interaction and adaptive analysis of dynamic content, delivering accurate and actionable investment insights. We evaluated Coinvisor through automated benchmarks on tool calling accuracy and user studies with 20 cryptocurrency investors using our interface. Results show that Coinvisor improves recall by 40.7% and F1 score by 26.6% over the base model in tool orchestration. User studies show high satisfaction (4.64/5), with participants preferring Coinvisor to both general LLMs and existing crypto platforms (4.62/5).
中文摘要 加密货币市场提供了重要的投资机会，但也面临着高波动性和信息分散等挑战。数据集成和分析对于明智的投资决策至关重要。目前，投资者主要使用三种方法：（1）跨各种来源的人工分析，这在很大程度上依赖于个人经验，耗时且容易产生偏见;（2）数据聚合平台——功能、分析深度有限;（3）大语言模型代理——基于静态预训练模型，缺乏实时数据集成和多步推理能力。为了解决这些限制，我们推出了 Coinvisor，这是一种基于强化学习的聊天机器人，通过多代理框架为加密货币投资提供全面的分析支持。Coinvisor 通过专门的工具集成了多样化的分析功能。其关键创新是基于强化学习的工具选择机制，能够实现多步骤规划和灵活集成各种数据源。该设计支持动态内容的实时交互和自适应分析，提供准确且可作的投资见解。我们通过对工具调用准确性的自动基准和用户研究对 20 名加密货币投资者使用我们的界面评估了 Coinvisor。结果显示，在工具编排方面，Coinvisor比基础模型提高了40.7%的召回率，将F1得分提高了26.6%。用户研究显示满意度很高（4.64/5），参与者更喜欢 Coinvisor，而不是一般法学硕士和现有加密平台（4.62/5）。

Multimodal Safety Is Asymmetric: Cross-Modal Exploits Unlock Black-Box MLLMs Jailbreaks

多模态安全是不对称的：跨模态漏洞解锁黑盒 MLLM 越狱

Authors: Xinkai Wang, Beibei Li, Zerui Shao, Ao Liu, Shouling Ji
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2510.17277
Pdf link: https://arxiv.org/pdf/2510.17277
Abstract Multimodal large language models (MLLMs) have demonstrated significant utility across diverse real-world applications. But MLLMs remain vulnerable to jailbreaks, where adversarial inputs can collapse their safety constraints and trigger unethical responses. In this work, we investigate jailbreaks in the text-vision multimodal setting and pioneer the observation that visual alignment imposes uneven safety constraints across modalities in MLLMs, thereby giving rise to multimodal safety asymmetry. We then develop PolyJailbreak, a black-box jailbreak method grounded in reinforcement learning. Initially, we probe the model's attention dynamics and latent representation space, assessing how visual inputs reshape cross-modal information flow and diminish the model's ability to separate harmful from benign inputs, thereby exposing exploitable vulnerabilities. On this basis, we systematize them into generalizable and reusable operational rules that constitute a structured library of Atomic Strategy Primitives, which translate harmful intents into jailbreak inputs through step-wise transformations. Guided by the primitives, PolyJailbreak employs a multi-agent optimization process that automatically adapts inputs against the target models. We conduct comprehensive evaluations on a variety of open-source and closed-source MLLMs, demonstrating that PolyJailbreak outperforms state-of-the-art baselines.
中文摘要 多模态大型语言模型（MLLM）已在各种实际应用中表现出显着的实用性。但 MLLM 仍然容易受到越狱的影响，对抗性输入可能会破坏其安全约束并引发不道德的反应。在这项工作中，我们研究了文本-视觉多模态环境中的越狱，并率先观察到视觉对齐在 MLLM 中跨模态施加了不均匀的安全约束，从而导致了多模态安全不对称。然后，我们开发了 PolyJailbreak，这是一种基于强化学习的黑盒越狱方法。最初，我们探测模型的注意力动态和潜在表示空间，评估视觉输入如何重塑跨模态信息流，并削弱模型将有害输入与良性输入区分开来的能力，从而暴露可利用的漏洞。在此基础上，我们将它们系统化为可通用和可重用的作规则，这些规则构成了原子策略原语的结构化库，通过逐步转换将有害意图转化为越狱输入。在原语的指导下，PolyJailbreak 采用多智能体优化过程，根据目标模型自动调整输入。我们对各种开源和闭源 MLLM 进行了全面评估，证明 PolyJailbreak 的性能优于最先进的基线。

Optimizing Energy Management of Smart Grid using Reinforcement Learning aided by Surrogate models built using Physics-informed Neural Networks

使用使用物理知情神经网络构建的代理模型辅助的强化学习优化智能电网的能源管理

Authors: Julen Cestero, Carmine Delle Femine, Kenji S. Muro, Marco Quartulli, Marcello Restelli
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.17380
Pdf link: https://arxiv.org/pdf/2510.17380
Abstract Optimizing the energy management within a smart grids scenario presents significant challenges, primarily due to the complexity of real-world systems and the intricate interactions among various components. Reinforcement Learning (RL) is gaining prominence as a solution for addressing the challenges of Optimal Power Flow in smart grids. However, RL needs to iterate compulsively throughout a given environment to obtain the optimal policy. This means obtaining samples from a, most likely, costly simulator, which can lead to a sample efficiency problem. In this work, we address this problem by substituting costly smart grid simulators with surrogate models built using Phisics-informed Neural Networks (PINNs), optimizing the RL policy training process by arriving to convergent results in a fraction of the time employed by the original environment.
中文摘要 优化智能电网场景中的能源管理带来了重大挑战，这主要是由于现实世界的复杂性以及各个组件之间复杂的相互作用。强化学习（RL）作为解决智能电网最佳潮流挑战的解决方案越来越受到重视。然而，RL 需要在给定环境中强制迭代以获得最佳策略。这意味着从最有可能昂贵的模拟器中获取样品，这可能会导致样品效率问题。在这项工作中，我们通过用Phisics-informed Neural Networks（PINN）构建的代理模型替换昂贵的智能电网模拟器来解决这个问题，通过在原始环境所用时间的一小部分内获得收敛结果来优化RL策略训练过程。

TabR1: Taming GRPO for tabular reasoning LLMs

TabR1：驯服 GRPO 进行表格推理 LLM

Authors: Pengxiang Cai, Zihao Gao, Jintai Chen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.17385
Pdf link: https://arxiv.org/pdf/2510.17385
Abstract Tabular prediction has traditionally relied on gradient-boosted decision trees and specialized deep learning models, which excel within tasks but provide limited interpretability and weak transfer across tables. Reasoning large language models (LLMs) promise cross-task adaptability with trans- parent reasoning traces, yet their potential has not been fully realized for tabular data. This paper presents TabR1, the first reasoning LLM for tabular prediction with multi-step reasoning. At its core is Permutation Relative Policy Optimization (PRPO), a simple yet efficient reinforcement learning method that encodes column-permutation invariance as a structural prior. By construct- ing multiple label-preserving permutations per sample and estimating advantages both within and across permutations, PRPO transforms sparse rewards into dense learning signals and improves generalization. With limited supervision, PRPO activates the reasoning ability of LLMs for tabular prediction, enhancing few-shot and zero-shot performance as well as interpretability. Comprehensive experiments demonstrate that TabR1 achieves performance comparable to strong baselines under full-supervision fine-tuning. In the zero-shot setting, TabR1 approaches the performance of strong baselines under the 32-shot setting. Moreover, TabR1 (8B) substantially outperforms much larger LLMs across various tasks, achieving up to 53.17% improvement over DeepSeek-R1 (685B).
中文摘要 表格预测传统上依赖于梯度提升决策树和专门的深度学习模型，这些模型在任务中表现出色，但可解释性有限，表之间的传输较弱。推理大型语言模型（LLMs）通过跨父推理轨迹有望实现跨任务适应性，但它们在表格数据中的潜力尚未完全实现。本文提出了 TabR1，这是第一个用于多步推理的表格预测推理 LLM。其核心是排列相对策略优化（PRPO），这是一种简单而有效的强化学习方法，将列排列不变性编码为结构先验。通过为每个样本构建多个标记保留排列并估计排列内部和排列之间的优势，PRPO 将稀疏奖励转化为密集的学习信号并改进泛化。在有限的监督下，PRPO 激活了 LLM 的推理能力以进行表格预测，增强了少量和零样本的性能以及可解释性。综合实验表明，TabR1在全监督微调下实现了与强基线相当的性能。在零样本设置中，TabR1 接近 32 次样本设置下强基线的性能。此外，TabR1 （8B）在各种任务中的性能大大优于更大的 LLM，比 DeepSeek-R1 （685B）提高了 53.17%。

Inference of Deterministic Finite Automata via Q-Learning

通过 Q 学习推断确定性有限自动机

Authors: Elaheh Hosseinkhani, Martin Leucker
Subjects: Subjects: Formal Languages and Automata Theory (cs.FL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.17386
Pdf link: https://arxiv.org/pdf/2510.17386
Abstract Traditional approaches to inference of deterministic finite-state automata (DFA) stem from symbolic AI, including both active learning methods (e.g., Angluin's L* algorithm and its variants) and passive techniques (e.g., Biermann and Feldman's method, RPNI). Meanwhile, sub-symbolic AI, particularly machine learning, offers alternative paradigms for learning from data, such as supervised, unsupervised, and reinforcement learning (RL). This paper investigates the use of Q-learning, a well-known reinforcement learning algorithm, for the passive inference of deterministic finite automata. It builds on the core insight that the learned Q-function, which maps state-action pairs to rewards, can be reinterpreted as the transition function of a DFA over a finite domain. This provides a novel bridge between sub-symbolic learning and symbolic representations. The paper demonstrates how Q-learning can be adapted for automaton inference and provides an evaluation on several examples.
中文摘要 确定性有限态自动机（DFA）推理的传统方法源于符号 AI，包括主动学习方法（例如 Angluin 的 L* 算法及其变体）和被动技术（例如 Biermann 和 Feldman 的方法 RPNI）。同时，子符号人工智能，特别是机器学习，为从数据中学习提供了替代范式，例如监督学习、无监督学习和强化学习（RL）。本文研究了使用众所周知的强化学习算法Q-learning对确定性有限自动机进行被动推理。它建立在核心见解之上，即学习到的 Q 函数（将状态-动作对映射到奖励）可以重新解释为 DFA 在有限域上的转换函数。这在亚符号学习和符号表示之间架起了一座新颖的桥梁。本文演示了如何将 Q 学习应用于自动机推理，并对几个示例进行了评估。

Leveraging Group Relative Policy Optimization to Advance Large Language Models in Traditional Chinese Medicine

利用群体相对政策优化推进中医药大语言模型

Authors: Jiacheng Xie, Shuai Zeng, Yang Yu, Xiaoting Tang, Guanghui An, Dong Xu
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.17402
Pdf link: https://arxiv.org/pdf/2510.17402
Abstract Traditional Chinese Medicine (TCM) presents a rich and structurally unique knowledge system that challenges conventional applications of large language models (LLMs). Although previous TCM-specific LLMs have shown progress through supervised fine-tuning, they often face limitations in alignment, data quality, and evaluation consistency. In this study, we introduce Ladder-base, the first TCM-focused LLM trained with Group Relative Policy Optimization (GRPO), a reinforcement learning method that improves reasoning and factual consistency by optimizing response selection based on intra-group comparisons. Ladder-base is built upon the Qwen2.5-7B-Instruct foundation model and trained exclusively on the textual subset of the TCM-Ladder benchmark, using 80 percent of the data for training and the remaining 20 percent split evenly between validation and test sets. Through standardized evaluation, Ladder-base demonstrates superior performance across multiple reasoning metrics when compared to both state-of-the-art general-purpose LLMs such as GPT-4, Gemini 2.5, Claude 3, and Qwen3 and domain-specific TCM models including BenTsao, HuatuoGPT2, and Zhongjing. These findings suggest that GRPO provides an effective and efficient strategy for aligning LLMs with expert-level reasoning in traditional medical domains and supports the development of trustworthy and clinically grounded TCM artificial intelligence systems.
中文摘要 中医药（TCM）呈现出丰富且结构独特的知识体系，挑战了大型语言模型（LLM）的传统应用。尽管以前的中医专用法学硕士通过监督微调取得了进展，但它们在一致性、数据质量和评估一致性方面经常面临局限性。在这项研究中，我们介绍了 Ladder-base，这是第一个使用群体相对策略优化（GRPO）训练的以中医为中心的 LLM，这是一种强化学习方法，通过基于组内比较优化响应选择来提高推理和事实一致性。Ladder-base 建立在 Qwen2.5-7B-Instruct 基础模型之上，并专门在 TCM-Ladder 基准的文本子集上进行训练，使用 80% 的数据进行训练，其余 20% 在验证集和测试集之间平均分配。通过标准化评估，与GPT-4、Gemini 2.5、Claude 3和Qwen3等最先进的通用LLM以及BenTsao、HuatuoGPT2和Zhongjing等特定领域的中医模型相比，Ladder-base在多个推理指标上表现出优越的性能。这些发现表明，GRPO为将法学硕士与传统医学领域的专家级推理相结合提供了一种有效且高效的策略，并支持了值得信赖和基于临床的中医人工智能系统的开发。

Agentic Reinforcement Learning for Search is Unsafe

搜索的智能体强化学习不安全

Authors: Yushi Yang, Shreyansh Padarha, Andrew Lee, Adam Mahdi
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.17431
Pdf link: https://arxiv.org/pdf/2510.17431
Abstract Agentic reinforcement learning (RL) trains large language models to autonomously call tools during reasoning, with search as the most common application. These models excel at multi-step reasoning tasks, but their safety properties are not well understood. In this study, we show that RL-trained search models inherit refusal from instruction tuning and often deflect harmful requests by turning them into safe queries. However, this safety is fragile. Two simple attacks, one that forces the model to begin response with search (Search attack), another that encourages models to repeatedly search (Multi-search attack), trigger cascades of harmful searches and answers. Across two model families (Qwen, Llama) with both local and web search, these attacks lower refusal rates by up to 60.0%, answer safety by 82.5%, and search-query safety by 82.4%. The attacks succeed by triggering models to generate harmful, request-mirroring search queries before they can generate the inherited refusal tokens. This exposes a core weakness of current RL training: it rewards continued generation of effective queries without accounting for their harmfulness. As a result, RL search models have vulnerabilities that users can easily exploit, making it urgent to develop safety-aware agentic RL pipelines optimising for safe search.
中文摘要 代理强化学习（RL）训练大型语言模型在推理过程中自主调用工具，其中搜索是最常见的应用。这些模型在多步骤推理任务中表现出色，但它们的安全特性尚不清楚。在这项研究中，我们表明，RL 训练的搜索模型继承了指令调整的拒绝，并且经常通过将有害请求转换为安全查询来转移有害请求。然而，这种安全性是脆弱的。两种简单的攻击，一种是强制模型开始搜索响应（搜索攻击），另一种是鼓励模型重复搜索（多搜索攻击），会触发一连串有害的搜索和答案。在具有本地搜索和 Web 搜索的两个模型系列（Qwen、Llama）中，这些攻击将拒绝率降低了 60.0%，应答安全性降低了 82.5%，搜索查询安全性降低了 82.4%。这些攻击通过触发模型在生成继承的拒绝令牌之前生成有害的请求镜像搜索查询来成功。这暴露了当前 RL 训练的一个核心弱点：它奖励持续生成有效查询而不考虑其危害性。因此，RL 搜索模型存在用户可以轻松利用的漏洞，因此迫切需要开发安全感知的代理 RL 管道来优化安全搜索。

OncoReason: Structuring Clinical Reasoning in LLMs for Robust and Interpretable Survival Prediction

OncoReason：在法学硕士中构建临床推理以实现稳健且可解释的生存预测

Authors: Raghu Vamshi Hemadri, Geetha Krishna Guruju, Kristi Topollai, Anna Ewa Choromanska
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.17532
Pdf link: https://arxiv.org/pdf/2510.17532
Abstract Predicting cancer treatment outcomes requires models that are both accurate and interpretable, particularly in the presence of heterogeneous clinical data. While large language models (LLMs) have shown strong performance in biomedical NLP, they often lack structured reasoning capabilities critical for high-stakes decision support. We present a unified, multi-task learning framework that aligns autoregressive LLMs with clinical reasoning for outcome prediction on the MSK-CHORD dataset. Our models are trained to jointly perform binary survival classification, continuous survival time regression, and natural language rationale generation. We evaluate three alignment strategies: (1) standard supervised fine-tuning (SFT), (2) SFT with Chain-of-Thought (CoT) prompting to elicit step-by-step reasoning, and (3) Group Relative Policy Optimization (GRPO), a reinforcement learning method that aligns model outputs to expert-derived reasoning trajectories. Experiments with LLaMa3-8B and Med42-8B backbones demonstrate that CoT prompting improves F1 by +6.0 and reduces MAE by 12%, while GRPO achieves state-of-the-art interpretability and predictive performance across BLEU, ROUGE, and BERTScore. We further show that existing biomedical LLMs often fail to produce valid reasoning traces due to architectural constraints. Our findings underscore the importance of reasoning-aware alignment in multi-task clinical modeling and set a new benchmark for interpretable, trustworthy LLMs in precision oncology.
中文摘要 预测癌症治疗结果需要准确且可解释的模型，特别是在存在异质性临床数据的情况下。虽然大型语言模型（LLM）在生物医学 NLP 中表现出强大的性能，但它们通常缺乏对高风险决策支持至关重要的结构化推理能力。我们提出了一个统一的多任务学习框架，将自回归法学硕士与 MSK-CHORD 数据集结果预测的临床推理结合起来。我们的模型经过训练，可以联合执行二元生存分类、连续生存时间回归和自然语言基本原理生成。我们评估了三种对齐策略：（1）标准监督微调（SFT），（2）带有思维链（CoT）提示的 SFT 以引发分步推理，以及（3）群体相对策略优化（GRPO），一种强化学习方法，可将模型输出与专家推导的推理轨迹保持一致。使用 LLaMa3-8B 和 Med42-8B 主干的实验表明，CoT 提示可将 F1 提高 +6.0 并将 MAE 降低 12%，而 GRPO 在 BLEU、ROUGE 和 BERTScore 上实现了最先进的可解释性和预测性能。我们进一步表明，由于架构限制，现有的生物医学法学硕士通常无法产生有效的推理轨迹。我们的研究结果强调了推理感知一致性在多任务临床建模中的重要性，并为精准肿瘤学中可解释、可信赖的法学硕士树立了新的基准。

An Empirical Study of Lagrangian Methods in Safe Reinforcement Learning

安全强化学习中拉格朗日方法的实证研究

Authors: Lindsay Spoor, Álvaro Serra-Gómez, Aske Plaat, Thomas Moerland
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2510.17564
Pdf link: https://arxiv.org/pdf/2510.17564
Abstract In safety-critical domains such as robotics, navigation and power systems, constrained optimization problems arise where maximizing performance must be carefully balanced with associated constraints. Safe reinforcement learning provides a framework to address these challenges, with Lagrangian methods being a popular choice. However, the effectiveness of Lagrangian methods crucially depends on the choice of the Lagrange multiplier $\lambda$, which governs the trade-off between return and constraint cost. A common approach is to update the multiplier automatically during training. Although this is standard in practice, there remains limited empirical evidence on the robustness of an automated update and its influence on overall performance. Therefore, we analyze (i) optimality and (ii) stability of Lagrange multipliers in safe reinforcement learning across a range of tasks. We provide $\lambda$-profiles that give a complete visualization of the trade-off between return and constraint cost of the optimization problem. These profiles show the highly sensitive nature of $\lambda$ and moreover confirm the lack of general intuition for choosing the optimal value $\lambda^$. Our findings additionally show that automated multiplier updates are able to recover and sometimes even exceed the optimal performance found at $\lambda^$ due to the vast difference in their learning trajectories. Furthermore, we show that automated multiplier updates exhibit oscillatory behavior during training, which can be mitigated through PID-controlled updates. However, this method requires careful tuning to achieve consistently better performance across tasks. This highlights the need for further research on stabilizing Lagrangian methods in safe reinforcement learning. The code used to reproduce our results can be found at this https URL.
中文摘要 在机器人、导航和电力系统等安全关键领域，出现了受限优化问题，必须仔细平衡性能最大化与相关约束。安全强化学习提供了一个应对这些挑战的框架，拉格朗日方法是一种流行的选择。然而，拉格朗日方法的有效性关键取决于拉格朗日乘数 $\lambda$ 的选择，它控制着回报和约束成本之间的权衡。一种常见的方法是在训练期间自动更新乘数。尽管这在实践中是标准的，但关于自动更新的鲁棒性及其对整体性能的影响的经验证据仍然有限。因此，我们分析了拉格朗日乘数在一系列任务的安全强化学习中的最优性和（ii）稳定性。我们提供了 $\lambda$ 配置文件，可以完整地可视化优化问题的回报和约束成本之间的权衡。这些配置文件显示了 $\lambda$ 的高度敏感性，此外还证实了选择最佳值 $\lambda^$ 缺乏一般直觉。我们的研究结果还表明，由于其学习轨迹存在巨大差异，自动乘数更新能够恢复，有时甚至超过$\lambda^$的最佳性能。此外，我们表明自动乘数更新在训练过程中表现出振荡行为，这可以通过 PID 控制的更新来缓解。然而，这种方法需要仔细调整，才能在任务中实现始终如一的更好性能。这凸显了进一步研究安全强化学习中稳定拉格朗日方法的必要性。用于重现结果的代码可以在此 https URL 中找到。

RESample: A Robust Data Augmentation Framework via Exploratory Sampling for Robotic Manipulation

RESample：通过机器人作的探索性采样构建强大的数据增强框架

Authors: Yuquan Xue, Guanxing Lu, Zhenyu Wu, Chuanrui Zhang, Bofang Jia, Zhengyi Gu, Yansong Tang, Ziwei Wang
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.17640
Pdf link: https://arxiv.org/pdf/2510.17640
Abstract Vision-Language-Action models (VLAs) have demonstrated remarkable performance on complex robotic manipulation tasks through imitation learning. However, existing imitation learning datasets contain only successful trajectories and lack failure or recovery data, especially for out-of-distribution (OOD) states where the robot deviates from the main policy due to minor perturbations or errors, leading VLA models to struggle with states deviating from the training distribution. To this end, we propose an automated OOD data augmentation framework named RESample through exploratory sampling. Specifically, we first leverage offline reinforcement learning to obtain an action-value network that accurately identifies sub-optimal actions under the current manipulation policy. We further sample potential OOD states from trajectories via rollout, and design an exploratory sampling mechanism that adaptively incorporates these action proxies into the training dataset to ensure efficiency. Subsequently, our framework explicitly encourages the VLAs to recover from OOD states and enhances their robustness against distributional shifts. We conduct extensive experiments on the LIBERO benchmark as well as real-world robotic manipulation tasks, demonstrating that RESample consistently improves the stability and generalization ability of VLA models.
中文摘要 视觉-语言-动作模型（VLA）通过模仿学习在复杂的机器人纵任务中表现出了卓越的性能。然而，现有的模仿学习数据集仅包含成功的轨迹，缺乏故障或恢复数据，特别是对于分布外（OOD）状态，机器人由于轻微的扰动或错误而偏离主策略，导致VLA模型难以应对偏离训练分布的状态。为此，我们通过探索性采样提出了一个名为RESample的自动化OOD数据增强框架。具体来说，我们首先利用离线强化学习来获得一个动作-价值网络，该网络可以准确识别当前纵策略下的次优动作。我们通过推出进一步从轨迹中采样潜在的 OOD 状态，并设计一种探索性采样机制，将这些动作代理自适应地合并到训练数据集中，以确保效率。随后，我们的框架明确鼓励 VLA 从 OOD 状态中恢复，并增强其对分布偏移的鲁棒性。我们在 LIBERO 基准测试以及真实世界的机器人作任务上进行了广泛的实验，证明 RESample 不断提高 VLA 模型的稳定性和泛化能力。

CrossGuard: Safeguarding MLLMs against Joint-Modal Implicit Malicious Attacks

CrossGuard：保护 MLLM 免受联合模态隐式恶意攻击

Authors: Xu Zhang, Hao Li, Zhichao Lu
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.17687
Pdf link: https://arxiv.org/pdf/2510.17687
Abstract Multimodal Large Language Models (MLLMs) achieve strong reasoning and perception capabilities but are increasingly vulnerable to jailbreak attacks. While existing work focuses on explicit attacks, where malicious content resides in a single modality, recent studies reveal implicit attacks, in which benign text and image inputs jointly express unsafe intent. Such joint-modal threats are difficult to detect and remain underexplored, largely due to the scarcity of high-quality implicit data. We propose ImpForge, an automated red-teaming pipeline that leverages reinforcement learning with tailored reward modules to generate diverse implicit samples across 14 domains. Building on this dataset, we further develop CrossGuard, an intent-aware safeguard providing robust and comprehensive defense against both explicit and implicit threats. Extensive experiments across safe and unsafe benchmarks, implicit and explicit attacks, and multiple out-of-domain settings demonstrate that CrossGuard significantly outperforms existing defenses, including advanced MLLMs and guardrails, achieving stronger security while maintaining high utility. This offers a balanced and practical solution for enhancing MLLM robustness against real-world multimodal threats.
中文摘要 多模态大型语言模型（MLLM）实现了强大的推理和感知能力，但越来越容易受到越狱攻击。虽然现有工作侧重于显式攻击，其中恶意内容以单一模式存在，但最近的研究揭示了隐式攻击，其中良性文本和图像输入共同表达了不安全的意图。这种联合模态威胁很难被发现，并且仍未得到充分探索，这主要是由于高质量隐式数据的稀缺。我们提出了 ImpForge，这是一个自动化的红队管道，它利用强化学习和定制的奖励模块来生成跨 14 个领域的不同隐式样本。在此数据集的基础上，我们进一步开发了 CrossGuard，这是一种意图感知保护措施，可针对显式和隐式威胁提供强大而全面的防御。跨安全和不安全基准、隐式和显式攻击以及多种域外设置的广泛实验表明，CrossGuard 的性能显着优于现有防御措施，包括高级 MLLM 和护栏，在保持高实用性的同时实现更强的安全性。这为增强 MLLM 针对现实世界多模态威胁的鲁棒性提供了平衡且实用的解决方案。

A Principle of Targeted Intervention for Multi-Agent Reinforcement Learning

多智能体强化学习的靶向干预原则

Authors: Anjie Liu, Jianhong Wang, Samuel Kaski, Jun Wang, Mengyue Yang
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2510.17697
Pdf link: https://arxiv.org/pdf/2510.17697
Abstract Steering cooperative multi-agent reinforcement learning (MARL) towards desired outcomes is challenging, particularly when the global guidance from a human on the whole multi-agent system is impractical in a large-scale MARL. On the other hand, designing mechanisms to coordinate agents most relies on empirical studies, lacking a easy-to-use research tool. In this work, we employ multi-agent influence diagrams (MAIDs) as a graphical framework to address the above issues. First, we introduce interaction paradigms that leverage MAIDs to analyze and visualize existing approaches in MARL. Then, we design a new interaction paradigm based on MAIDs, referred to as targeted intervention that is applied to only a single targeted agent, so the problem of global guidance can be mitigated. In our implementation, we introduce a causal inference technique-referred to as Pre-Strategy Intervention (PSI)-to realize the targeted intervention paradigm. Since MAIDs can be regarded as a special class of causal diagrams, a composite desired outcome that integrates the primary task goal and an additional desired outcome can be achieved by maximizing the corresponding causal effect through the PSI. Moreover, the bundled relevance graph analysis of MAIDs provides a tool to identify whether an MARL learning paradigm is workable under the design of an interaction paradigm. In experiments, we demonstrate the effectiveness of our proposed targeted intervention, and verify the result of relevance graph analysis.
中文摘要 引导协作多智能体强化学习（MARL）实现预期结果具有挑战性，特别是当人类对整个多智能体系统的全局指导在大规模 MARL 中不切实际时。另一方面，设计协调智能体的机制最依赖于实证研究，缺乏易于使用的研究工具。在这项工作中，我们采用多智能体影响图（MAID）作为图形框架来解决上述问题。首先，我们引入了利用 MAID 来分析和可视化 MARL 中现有方法的交互范式。然后，我们设计了一种基于MAIDs的新交互范式，称为仅应用于单个靶向代理的靶向干预，从而缓解全局引导的问题。在我们的实施中，我们引入了一种因果推理技术，称为策略前干预（PSI），以实现有针对性的干预范式。由于 MAID 可以被视为一类特殊的因果图，因此可以通过 PSI 最大化相应的因果效应来实现整合主要任务目标和附加期望结果的复合期望结果。此外，MAID的捆绑相关图分析提供了一种工具，用于识别MARL学习范式在交互范式设计下是否可行。在实验中，我们证明了我们提出的靶向干预的有效性，并验证了相关图分析的结果。

QueST: Incentivizing LLMs to Generate Difficult Problems

QueST：激励法学硕士产生难题

Authors: Hanxu Hu, Xingxing Zhang, Jannis Vamvas, Rico Sennrich, Furu Wei
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.17715
Pdf link: https://arxiv.org/pdf/2510.17715
Abstract Large Language Models have achieved strong performance on reasoning tasks, solving competition-level coding and math problems. However, their scalability is limited by human-labeled datasets and the lack of large-scale, challenging coding problem training data. Existing competitive coding datasets contain only thousands to tens of thousands of problems. Previous synthetic data generation methods rely on either augmenting existing instruction datasets or selecting challenging problems from human-labeled data. In this paper, we propose QueST, a novel framework which combines difficulty-aware graph sampling and difficulty-aware rejection fine-tuning that directly optimizes specialized generators to create challenging coding problems. Our trained generators demonstrate superior capability compared to even GPT-4o at creating challenging problems that benefit downstream performance. We leverage QueST to generate large-scale synthetic coding problems, which we then use to distill from strong teacher models with long chain-of-thought or to conduct reinforcement learning for smaller models, proving effective in both scenarios. Our distillation experiments demonstrate significant performance gains. Specifically, after fine-tuning Qwen3-8B-base on 100K difficult problems generated by QueST, we surpass the performance of the original Qwen3-8B on LiveCodeBench. With an additional 112K examples (i.e., 28K human-written problems paired with multiple synthetic solutions), our 8B model matches the performance of the much larger DeepSeek-R1-671B. These findings indicate that generating complex problems via QueST offers an effective and scalable approach to advancing the frontiers of competitive coding and reasoning for large language models.
中文摘要 大型语言模型在推理任务、解决竞赛级编码和数学问题方面取得了强大的性能。然而，它们的可扩展性受到人类标记的数据集和缺乏大规模、具有挑战性的编码问题训练数据的限制。现有的竞争性编码数据集仅包含数千到数万个问题。以前的合成数据生成方法依赖于增强现有指令数据集或从人工标记的数据中选择具有挑战性的问题。在本文中，我们提出了QueST，这是一种结合了难度感知图采样和难度感知拒绝微调的新型框架，它直接优化了专用生成器以创建具有挑战性的编码问题。与 GPT-4o 相比，我们训练有素的生成器在创建有利于下游性能的挑战性问题方面表现出卓越的能力。我们利用 QueST 生成大规模的合成编码问题，然后使用这些问题从具有长思维链的强教师模型中提炼出来，或者对较小的模型进行强化学习，证明在这两种情况下都是有效的。我们的蒸馏实验显示出显着的性能提升。具体来说，在对 QueST 生成的 100K 难题进行 Qwen3-8B-base 微调后，我们在 LiveCodeBench 上超越了原来 Qwen3-8B 的性能。通过额外的 112K 示例（即 28K 人为编写的问题与多个综合解决方案配对），我们的 8B 模型与更大的 DeepSeek-R1-671B 的性能相匹配。这些发现表明，通过 QueST 生成复杂问题提供了一种有效且可扩展的方法，可以推进大型语言模型竞争性编码和推理的前沿。

Train for Truth, Keep the Skills: Binary Retrieval-Augmented Reward Mitigates Hallucinations

为真理而训练，保持技能：二进制检索增强奖励可减轻幻觉

Authors: Tong Chen, Akari Asai, Luke Zettlemoyer, Hannaneh Hajishirzi, Faeze Brahman
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.17733
Pdf link: https://arxiv.org/pdf/2510.17733
Abstract Language models often generate factually incorrect information unsupported by their training data, a phenomenon known as extrinsic hallucination. Existing mitigation approaches often degrade performance on open-ended generation and downstream tasks, limiting their practical utility. We propose an online reinforcement learning method using a novel binary retrieval-augmented reward (RAR) to address this tradeoff. Unlike continuous reward schemes, our approach assigns a reward of one only when the model's output is entirely factually correct, and zero otherwise. We evaluate our method on Qwen3 reasoning models across diverse tasks. For open-ended generation, binary RAR achieves a 39.3% reduction in hallucination rates, substantially outperforming both supervised training and continuous-reward RL baselines. In short-form question answering, the model learns calibrated abstention, strategically outputting "I don't know" when faced with insufficient parametric knowledge. This yields 44.4% and 21.7% fewer incorrect answers on PopQA and GPQA, respectively. Crucially, these factuality gains come without performance degradation on instruction following, math, or code, whereas continuous-reward RL, despite improving factuality, induces quality regressions.
中文摘要 语言模型经常生成没有训练数据支持的事实不正确的信息，这种现象称为外在幻觉。现有的缓解方法通常会降低开放式生成和下游任务的性能，从而限制其实际效用。我们提出了一种使用新型二元检索增强奖励（RAR）的在线强化学习方法来解决这一权衡问题。与连续奖励方案不同，我们的方法仅在模型的输出完全正确时分配 1 的奖励，否则为零。我们评估了跨不同任务的 Qwen3 推理模型的方法。对于开放式生成，二元RAR的幻觉率降低了39.3%，大大优于监督训练和持续奖励RL基线。在简短的问答中，模型学习校准弃权，在面对参数知识不足时战略性地输出“我不知道”。这使 PopQA 和 GPQA 的错误答案分别减少了 44.4% 和 21.7%。至关重要的是，这些事实性增益不会在指令遵循、数学或代码上降低性能，而连续奖励 RL 尽管提高了事实性，但会导致质量回归。

UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action

UltraCUA：具有混合动作的计算机使用代理的基础模型

Authors: Yuhao Yang, Zhen Yang, Zi-Yi Dou, Anh Nguyen, Keen You, Omar Attia, Andrew Szot, Michael Feng, Ram Ramrakhya, Alexander Toshev, Chao Huang, Yinfei Yang, Zhe Gan
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.17790
Pdf link: https://arxiv.org/pdf/2510.17790
Abstract Multimodal agents for computer use rely exclusively on primitive actions (click, type, scroll) that require accurate visual grounding and lengthy execution chains, leading to cascading failures and performance bottlenecks. While other agents leverage rich programmatic interfaces (APIs, MCP servers, tools), computer-use agents (CUAs) remain isolated from these capabilities. We present UltraCUA, a foundation model that bridges this gap through hybrid action -- seamlessly integrating GUI primitives with high-level programmatic tool calls. To achieve this, our approach comprises four key components: (1) an automated pipeline that scales programmatic tools from software documentation, open-source repositories, and code generation; (2) a synthetic data engine producing over 17,000 verifiable tasks spanning real-world computer-use scenarios; (3) a large-scale high-quality hybrid action trajectory collection with both low-level GUI actions and high-level programmatic tool calls; and (4) a two-stage training pipeline combining supervised fine-tuning with online reinforcement learning, enabling strategic alternation between low-level and high-level actions. Experiments with our 7B and 32B models demonstrate substantial improvements over state-of-the-art agents. On OSWorld, UltraCUA models achieve an average 22% relative improvement over base models, while being 11% faster in terms of steps. Out-of-domain evaluation on WindowsAgentArena shows our model reaches 21.7% success rate, outperforming baselines trained on Windows data. The hybrid action mechanism proves critical, reducing error propagation while maintaining execution efficiency.
中文摘要 用于计算机使用的多模态代理完全依赖于原始作（单击、键入、滚动），这些作需要准确的视觉基础和冗长的执行链，从而导致级联故障和性能瓶颈。虽然其他代理利用丰富的编程接口（API、MCP 服务器、工具），但计算机使用代理（CUA）仍然与这些功能隔离。我们提出了 UltraCUA，这是一种基础模型，它通过混合作弥合了这一差距——将 GUI 原语与高级编程工具调用无缝集成。为了实现这一目标，我们的方法包括四个关键组件：（1）从软件文档、开源存储库和代码生成中扩展编程工具的自动化管道;（2）一个合成数据引擎，生成超过 17,000 个跨越现实世界计算机使用场景的可验证任务;（3）具有低级GUI作和高级程序化工具调用的大规模高质量混合动作轨迹集合;（4）将监督微调与在线强化学习相结合的两阶段训练管道，实现低级和高级行动之间的战略交替。我们的 7B 和 32B 模型的实验表明，与最先进的药物相比，有显着改进。在 OSWorld 上，UltraCUA 模型比基本模型平均提高了 22%，同时在步长方面快了 11%。对 WindowsAgentArena 的域外评估显示，我们的模型达到了 21.7% 的成功率，优于在 Windows 数据上训练的基线。事实证明，混合作机制至关重要，可以减少错误传播，同时保持执行效率。

SoftMimic: Learning Compliant Whole-body Control from Examples

SoftMimic：从示例中学习合规全身控制

Authors: Gabriel B. Margolis, Michelle Wang, Nolan Fey, Pulkit Agrawal
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.17792
Pdf link: https://arxiv.org/pdf/2510.17792
Abstract We introduce SoftMimic, a framework for learning compliant whole-body control policies for humanoid robots from example motions. Imitating human motions with reinforcement learning allows humanoids to quickly learn new skills, but existing methods incentivize stiff control that aggressively corrects deviations from a reference motion, leading to brittle and unsafe behavior when the robot encounters unexpected contacts. In contrast, SoftMimic enables robots to respond compliantly to external forces while maintaining balance and posture. Our approach leverages an inverse kinematics solver to generate an augmented dataset of feasible compliant motions, which we use to train a reinforcement learning policy. By rewarding the policy for matching compliant responses rather than rigidly tracking the reference motion, SoftMimic learns to absorb disturbances and generalize to varied tasks from a single motion clip. We validate our method through simulations and real-world experiments, demonstrating safe and effective interaction with the environment.
中文摘要 我们介绍了 SoftMimic，这是一个从示例运动中学习人形机器人合规全身控制策略的框架。通过强化学习模仿人类动作可以让人形机器人快速学习新技能，但现有方法会激励僵硬的控制，积极纠正与参考运动的偏差，从而在机器人遇到意外接触时导致脆弱和不安全的行为。相比之下，SoftMimic 使机器人能够在保持平衡和姿势的同时顺应外力。我们的方法利用逆运动学求解器来生成可行的顺应运动的增强数据集，我们用它来训练强化学习策略。通过奖励匹配合规响应的策略，而不是严格跟踪参考运动，SoftMimic 学会了吸收干扰并从单个运动剪辑中推广到不同的任务。我们通过模拟和真实实验验证我们的方法，展示与环境的安全有效的相互作用。

Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains

基础自动评估器：扩展以推理为中心的领域的多任务生成评估器训练

Authors: Austin Xu, Xuan-Phi Nguyen, Yilun Zhou, Chien-Sheng Wu, Caiming Xiong, Shafiq Joty
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.17793
Pdf link: https://arxiv.org/pdf/2510.17793
Abstract Finetuning specialized generative evaluators has emerged as a popular paradigm to meet the increasing demand for scalable evaluation during both training and test-time. However, recent work has largely focused on applying new methodology, such as reinforcement learning (RL), to training evaluators, shying away from large-scale, data-driven development. In this work, we focus on data scaling, curating a set of 2.5M samples spanning five unique evaluation tasks (pairwise, step-level, reference-free and reference-based verification, and single rating) and multiple domains focused on reasoning evaluation. With our data, we train Foundational Automatic Reasoning Evaluators (FARE), a family of 8B and 20B (with 3.6B active) parameter evaluators, with a simple iterative rejection-sampling supervised finetuning (SFT) approach. FARE-8B challenges larger specialized RL-trained evaluators and FARE-20B sets the new standard for open-source evaluators, surpassing specialized 70B+ evaluators. Beyond static benchmarks, we evaluate FARE in real-world tasks: As inference-time rerankers, FARE-20B achieves near-oracle performance on MATH. As verifiers in RL training, FARE improves the downstream RL-trained model performance by up to 14.1% vs. string-matching verifiers. When initialized from FARE, a continually-finetuned FARE-Code outperforms gpt-oss-20B by 65% on evaluating test-case quality.
中文摘要 精细化专业生成式评估器已成为一种流行的范式，以满足训练和测试期间对可扩展评估日益增长的需求。然而，最近的工作主要集中在将强化学习（RL）等新方法应用于培训评估员，而回避大规模、数据驱动的开发。在这项工作中，我们专注于数据缩放，策划了一组 2.5M 样本，涵盖五个独特的评估任务（成对、阶梯级、无参考和基于参考的验证以及单一评级）和多个专注于推理评估的领域。利用我们的数据，我们使用简单的迭代拒绝采样监督微调（SFT）方法训练基础自动推理评估器（FARE），这是一个由 8B 和 20B（具有 3.6B 主动）参数评估器组成的系列。FARE-8B 挑战了经过 RL 培训的大型专业评估员，而 FARE-20B 为开源评估员设定了新标准，超越了专业的 70B+ 评估员。除了静态基准之外，我们还在实际任务中评估 FARE：作为推理时间重新排名者，FARE-20B 在 MATH 上实现了接近预言机的性能。作为RL训练中的验证者，FARE与字符串匹配验证器相比，下游RL训练的模型性能提高了14.1%。当从 FARE 初始化时，不断微调的 FARE-Code 在评估测试用例质量方面比 gpt-oss-20B 高出 65%。

Keyword: diffusion policy

Continuous Q-Score Matching: Diffusion Guided Reinforcement Learning for Continuous-Time Control

连续 Q 分数匹配：用于连续时间控制的扩散引导强化学习

Authors: Chengxiu Hua, Jiawen Gu, Yushun Tang
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2510.17122
Pdf link: https://arxiv.org/pdf/2510.17122
Abstract Reinforcement learning (RL) has achieved significant success across a wide range of domains, however, most existing methods are formulated in discrete time. In this work, we introduce a novel RL method for continuous-time control, where stochastic differential equations govern state-action dynamics. Departing from traditional value function-based approaches, our key contribution is the characterization of continuous-time Q-functions via a martingale condition and the linking of diffusion policy scores to the action gradient of a learned continuous Q-function by the dynamic programming principle. This insight motivates Continuous Q-Score Matching (CQSM), a score-based policy improvement algorithm. Notably, our method addresses a long-standing challenge in continuous-time RL: preserving the action-evaluation capability of Q-functions without relying on time discretization. We further provide theoretical closed-form solutions for linear-quadratic (LQ) control problems within our framework. Numerical results in simulated environments demonstrate the effectiveness of our proposed method and compare it to popular baselines.
中文摘要 强化学习（RL）在广泛的领域取得了巨大的成功，然而，大多数现有方法都是在离散时间内制定的。在这项工作中，我们引入了一种用于连续时间控制的新型 RL 方法，其中随机微分方程控制状态作用动力学。与传统的基于价值函数的方法不同，我们的主要贡献是通过鞅条件表征连续时间 Q 函数，并通过动态规划原理将扩散策略分数与学习到的连续 Q 函数的作用梯度联系起来。这种洞察力激发了持续 Q 分数匹配（CQSM），这是一种基于分数的策略改进算法。值得注意的是，我们的方法解决了连续时间RL中长期存在的挑战：在不依赖时间离散化的情况下保留Q函数的动作评估能力。我们进一步为框架内的线性二次（LQ）控制问题提供理论封闭式解决方案。模拟环境中的数值结果证明了我们所提出的方法的有效性，并将其与流行的基线进行了比较。