Arxiv Papers of Today

生成时间: 2025-10-22 16:33:10 (UTC+8); Arxiv 发布时间: 2025-10-22 20:00 EDT (2025-10-23 08:00 UTC+8)

今天共有 46 篇相关文章

Keyword: reinforcement learning

Quantum-Driven State-Reduction for Reliable UAV Trajectory Optimization in Low-Altitude Networks

量子驱动状态还原，实现低空网络中可靠的无人机轨迹优化

Authors: Zeeshan Kaleem, Muhammad Afaq, Chau Yuen, Octavia A. Dobre, John M. Cioffi
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2510.17861
Pdf link: https://arxiv.org/pdf/2510.17861
Abstract This letter introduces a Graph-Condensed Quantum-Inspired Placement (GC-QAP) framework for reliability-driven trajectory optimization in Uncrewed Aerial Vehicle (UAV) assisted low-altitude wireless networks. The dense waypoint graph is condensed using probabilistic quantum-annealing to preserve interference-aware centroids while reducing the control state space and maintaining link-quality. The resulting problem is formulated as a priority-aware Markov decision process and solved using epsilon-greedy off-policy Q-learning, considering UAV kinematic and flight corridor constraints. Unlike complex continuous-action reinforcement learning approaches, GC-QAP achieves stable convergence and low outage with substantially and lower computational cost compared to baseline schemes.
中文摘要 这封信介绍了一个图浓缩量子启发放置（GC-QAP）框架，用于无人机（UAV）辅助低空无线网络中的可靠性驱动轨迹优化。使用概率量子退火对密集航路点图进行压缩，以保留干扰感知质心，同时减少控制状态空间并保持链路质量。将所得问题表述为优先级感知马尔可夫决策过程，并使用ε贪婪的非策略Q学习求解，同时考虑无人机运动学和飞行走廊约束。与复杂的连续动作强化学习方法不同，与基线方案相比，GC-QAP以大幅降低的计算成本实现了稳定的收敛和低中断。

DRL-Based Resource Allocation for Energy-Efficient IRS-Assisted UAV Spectrum Sharing Systems

基于DRL的节能IRS辅助无人机频谱共享系统的资源分配

Authors: Yiheng Wang
Subjects: Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2510.17877
Pdf link: https://arxiv.org/pdf/2510.17877
Abstract Intelligent reflecting surface (IRS) assisted unmanned aerial vehicle (UAV) systems provide a new paradigm for reconfigurable and flexible wireless communications. To enable more energy efficient and spectrum efficient IRS assisted UAV wireless communications, this paper introduces a novel IRS-assisted UAV enabled spectrum sharing system with orthogonal frequency division multiplexing (OFDM). The goal is to maximize the energy efficiency (EE) of the secondary network by jointly optimizing the beamforming, subcarrier allocation, IRS phase shifts, and the UAV trajectory subject to practical transmit power and passive reflection constraints as well as UAV physical limitations. A physically grounded propulsion-energy model is adopted, with its tight upper bound used to form a tractable EE lower bound for the spectrum sharing system. To handle highly non convex, time coupled optimization problems with a mixed continuous and discrete policy space, we develop a deep reinforcement learning (DRL) approach based on the actor critic framework. Extended experiments show the significant EE improvement of the proposed DRL-based approach compared to several benchmark schemes, thus demonstrating the effectiveness and robustness of the proposed approach with mobility.
中文摘要 智能反射面（IRS）辅助无人机（UAV）系统为可重构和灵活的无线通信提供了新范例。为了实现更节能、更高效的IRS辅助无人机无线通信，本文介绍了一种新型的具有正交频分复用（OFDM）的IRS辅助无人机频谱共享系统。目标是通过共同优化波束成形、副载波分配、IRS 相移和无人机轨迹，从而最大限度地提高辅助网络的能效（EE），并受到实际发射功率和无源反射约束以及无人机物理限制。采用物理接地推进能量模型，其紧密的上限用于形成频谱共享系统的可处理的EE下限。为了处理具有连续和离散混合策略空间的高度非凸、时间耦合优化问题，我们开发了一种基于参与者批评框架的深度强化学习（DRL）方法。扩展实验表明，与几种基准方案相比，所提出的基于DRL的方法的EE得到了显着的改进，从而证明了所提出的方法在移动性的有效性和鲁棒性。

POPI: Personalizing LLMs via Optimized Natural Language Preference Inference

POPI：通过优化的自然语言偏好推理个性化 LLM

Authors: Yizhuo Chen, Xin Liu, Ruijie Wang, Zheng Li, Pei Chen, Changlong Yu, Priyanka Nigam, Meng Jiang, Bing Yin
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.17881
Pdf link: https://arxiv.org/pdf/2510.17881
Abstract Large language models (LLMs) achieve strong benchmark performance, yet user experiences remain inconsistent due to diverse preferences in style, tone, and reasoning mode. Nevertheless, existing alignment techniques such as reinforcement learning from human feedback (RLHF) or Direct Preference Optimization (DPO) largely optimize toward population-level averages and overlook individual variation. Naive personalization strategies like per-user fine-tuning are computationally prohibitive, and in-context approaches that prepend raw user signals often suffer from inefficiency and noise. To address these challenges, we propose POPI, a general framework that introduces a preference inference model to distill heterogeneous user signals into concise natural language summaries. These summaries act as transparent, compact, and transferable personalization representations that condition a shared generation model to produce personalized responses. POPI jointly optimizes both preference inference and personalized generation under a unified objective using reinforcement learning, ensuring summaries maximally encode useful preference information. Extensive experiments across four personalization benchmarks demonstrate that POPI consistently improves personalization accuracy while reducing context overhead by a large margin. Moreover, optimized summaries seamlessly transfer to frozen off-the-shelf LLMs, enabling plug-and-play personalization without weight updates.
中文摘要 大型语言模型（LLM）实现了强大的基准性能，但由于风格、语气和推理模式的不同偏好，用户体验仍然不一致。然而，现有的对齐技术，如人类反馈强化学习（RLHF）或直接偏好优化（DPO），在很大程度上针对人群水平的平均值进行了优化，而忽略了个体差异。像每个用户微调这样的朴素个性化策略在计算上是令人望而却步的，并且在原始用户信号之前添加上下文方法通常会受到效率低下和噪音的影响。为了应对这些挑战，我们提出了 POPI，这是一个通用框架，它引入了偏好推理模型，将异构用户信号提炼成简洁的自然语言摘要。这些摘要充当透明、紧凑且可转移的个性化表示形式，使共享生成模型能够产生个性化响应。POPI 使用强化学习在统一目标下共同优化偏好推理和个性化生成，确保摘要最大限度地编码有用的偏好信息。跨四个个性化基准的广泛实验表明，POPI 持续提高个性化准确性，同时大幅减少上下文开销。此外，优化的摘要可以无缝传输到冻结的现成法学硕士，无需更新权重即可实现即插即用的个性化。

TritonRL: Training LLMs to Think and Code Triton Without Cheating

TritonRL：训练法学硕士在不作弊的情况下思考和编码 Triton

Authors: Jiin Woo, Shaowei Zhu, Allen Nie, Zhen Jia, Yida Wang, Youngsuk Park
Subjects: Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.17891
Pdf link: https://arxiv.org/pdf/2510.17891
Abstract With the rapid evolution of large language models (LLMs), the demand for automated, high-performance system kernels has emerged as a key enabler for accelerating development and deployment. We introduce TritonRL, a domain-specialized LLM for Triton kernel generation, trained with a novel training framework that enables robust and automated kernel synthesis. Unlike general-purpose programming languages, Triton kernel generation faces unique challenges due to data scarcity and incomplete evaluation criteria, vulnerable to reward hacking. Our approach addresses these challenges end-to-end by distilling Triton-specific knowledge through supervised fine-tuning on curated datasets, and further improving code quality via reinforcement learning (RL) with robust, verifiable rewards and hierarchical reward assignment. Our RL framework robustly detects reward hacking and guides both reasoning traces and code tokens through fine-grained verification and hierarchical reward decomposition, enabling the model to generate high-quality Triton kernels that can truly replace existing modules. With robust and fine-grained evaluation, our experiments on KernelBench demonstrate that TritonRL achieves state-of-the-art correctness and speedup, surpassing all other Triton-specific models and underscoring the effectiveness of our RL-based training paradigm.
中文摘要 随着大型语言模型（LLM）的快速发展，对自动化、高性能系统内核的需求已成为加速开发和部署的关键推动因素。我们介绍了 TritonRL，这是一种用于 Triton 内核生成的领域专用 LLM，使用新颖的训练框架进行训练，可实现稳健和自动化的内核合成。与通用编程语言不同，Triton 内核生成由于数据稀缺和评估标准不完整而面临独特的挑战，容易受到奖励黑客攻击。我们的方法通过对精选数据集进行监督微调来提炼 Triton 特定知识，并通过具有稳健、可验证的奖励和分层奖励分配的强化学习（RL）进一步提高代码质量，从而端到端地应对这些挑战。我们的 RL 框架可以稳健地检测奖励黑客攻击，并通过细粒度验证和分层奖励分解来指导推理跟踪和代码令牌，使模型能够生成能够真正替代现有模块的高质量 Triton 内核。通过稳健和细粒度的评估，我们在 KernelBench 上的实验表明，TritonRL 实现了最先进的正确性和加速，超越了所有其他 Triton 特定模型，并强调了我们基于 RL 的训练范式的有效性。

Self-Evidencing Through Hierarchical Gradient Decomposition: A Dissipative System That Maintains Non-Equilibrium Steady-State by Minimizing Variational Free Energy

通过分层梯度分解进行自我证明：通过最小化变分自由能来维持非平衡稳态的耗散系统

Authors: Michael James McCulloch
Subjects: Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
Arxiv link: https://arxiv.org/abs/2510.17916
Pdf link: https://arxiv.org/pdf/2510.17916
Abstract The Free Energy Principle (FEP) states that self-organizing systems must minimize variational free energy to persist, but the path from principle to implementable algorithm has remained unclear. We present a constructive proof that the FEP can be realized through exact local credit assignment. The system decomposes gradient computation hierarchically: spatial credit via feedback alignment, temporal credit via eligibility traces, and structural credit via a Trophic Field Map (TFM) that estimates expected gradient magnitude for each connection block. We prove these mechanisms are exact at their respective levels and validate the central claim empirically: the TFM achieves 0.9693 Pearson correlation with oracle gradients. This exactness produces emergent capabilities including 98.6% retention after task interference, autonomous recovery from 75% structural damage, self-organized criticality (spectral radius p ~= 1.0$), and sample-efficient reinforcement learning on continuous control tasks without replay buffers. The architecture unifies Prigogine's dissipative structures, Friston's free energy minimization, and Hopfield's attractor dynamics, demonstrating that exact hierarchical inference over network topology can be implemented with local, biologically plausible rules.
中文摘要 自由能原理（FEP）指出，自组织系统必须最小化变分自由能才能持续存在，但从原理到可实现算法的路径仍不清楚。我们提出了一个建设性的证据，证明 FEP 可以通过精确的本地信用分配来实现。该系统按层次分解梯度计算：通过反馈对齐获得空间信用，通过资格轨迹获得时间信用，通过估计每个连接块的预期梯度大小的营养场图（TFM）获得结构信用。我们证明了这些机制在各自的水平上是精确的，并凭证验证了中心主张：TFM 与预言机梯度达到了 0.9693 皮尔逊相关性。这种精确性产生了紧急能力，包括任务干扰后 98.6% 的保留率、从 75% 的结构损坏中自主恢复、自组织临界性（光谱半径 p ~= 1.0$）以及在没有重放缓冲区的情况下对连续控制任务进行样本效率强化学习。该架构统一了 Prigogine 的耗散结构、Friston 的自由能最小化和 Hopfield 的吸引子动力学，证明可以通过局部的、生物学上合理的规则来实现对网络拓扑的精确分层推理。

CLAWS:Creativity detection for LLM-generated solutions using Attention Window of Sections

CLAWS：使用部分注意力窗口对 LLM 生成的解决方案进行创造力检测

Authors: Keuntae Kim, Eunhye Jeong, Sehyeon Lee, Seohee Yoon, Yong Suk Choi
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.17921
Pdf link: https://arxiv.org/pdf/2510.17921
Abstract Recent advances in enhancing the reasoning ability of large language models (LLMs) have been remarkably successful. LLMs trained with reinforcement learning (RL) for reasoning demonstrate strong performance in challenging tasks such as mathematics and coding, even with relatively small model sizes. However, despite these improvements in task accuracy, the assessment of creativity in LLM generations has been largely overlooked in reasoning tasks, in contrast to writing tasks. The lack of research on creativity assessment in reasoning primarily stems from two challenges: (1) the difficulty of defining the range of creativity, and (2) the necessity of human evaluation in the assessment process. To address these challenges, we propose CLAWS, a method that defines and classifies mathematical solutions into typical, creative, and hallucinated categories without human evaluation, by leveraging attention weights across prompt sections and output. CLAWS outperforms five existing white-box detection methods (Perplexity, Logit Entropy, Window Entropy, Hidden Score, and Attention Score) on five 7-8B math RL models (DeepSeek, Qwen, Mathstral, OpenMath2, and Oreal). We validate CLAWS on 4545 math problems collected from 181 math contests (AJHSME, AMC, AIME).
中文摘要 最近在增强大型语言模型（LLM）推理能力方面取得了巨大成功。使用强化学习（RL）进行推理训练的 LLM 在数学和编码等具有挑战性的任务中表现出强大的性能，即使模型大小相对较小。然而，尽管任务准确性有了这些提高，但与写作任务相比，LLM 世代对创造力的评估在推理任务中在很大程度上被忽视了。推理中创造力评估研究的缺失主要源于两个挑战：（1）创造力范围的定义困难，以及（2）评估过程中人工评估的必要性。为了应对这些挑战，我们提出了 CLAWS，这是一种通过利用提示部分和输出的注意力权重，将数学解决方案定义和分类为典型、创造性和幻觉类别的方法，无需人工评估。在五个 7-8B 数学 RL 模型（DeepSeek、Qwen、Mathstral、OpenMath2 和 Oreal）上，CLAWS 优于现有的五种白盒检测方法（困惑度、Logit 熵、窗口熵、隐藏分数和注意力分数）。我们验证了从 181 场数学竞赛（AJHSME、AMC、AIME）中收集的 4545 个数学问题的 CLAWS。

Rewarding the Journey, Not Just the Destination: A Composite Path and Answer Self-Scoring Reward Mechanism for Test-Time Reinforcement Learning

奖励旅程，而不仅仅是目的地：测试时间强化学习的复合路径和答案自评分奖励机制

Authors: Chenwei Tang, Jingyu Xing, Xinyu Liu, Wei Ju, Jiancheng Lv, Deng Xiong, Ziyue Qiao
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.17923
Pdf link: https://arxiv.org/pdf/2510.17923
Abstract Reinforcement Learning (RL) has emerged as a powerful paradigm for advancing Large Language Models (LLMs), achieving remarkable performance in complex reasoning domains such as mathematics and code generation. However, current RL methods face a fundamental scalability bottleneck due to their heavy reliance on human-curated preference data or labeled datasets for reward modeling. To overcome this limitation, we explore RL on unlabeled data where models learn autonomously from continuous experience streams. The core challenge in this setting lies in reliable reward estimation without ground-truth supervision. Existing approaches like Test-Time RL address this through self-consistent consensus, but risk reinforcing incorrect pseudo-labels derived from majority voting. We introduce COMPASS (Composite Path and Answer Self-Scoring), a novel test-time reward mechanism that operates without external supervision. COMPASS integrates two complementary components: the Dual-Calibration Answer Reward (DCAR), which stabilizes training by establishing trustworthy pseudo-labels through confidence and credibility calibration, and the Decisive Path Reward (DPR), which directly optimizes the reasoning process quality beyond mere outcome supervision. By jointly reinforcing trustworthy consensus answers and highly decisive reasoning chains, the COMPASS systematically enhances the model's analytical capabilities. Extensive experiments show that COMPASS achieves significant and consistent performance gains across diverse reasoning tasks and model architectures, advancing a more scalable direction for LLMs to learn from continuous experience.
中文摘要 强化学习（RL）已成为推进大型语言模型（LLM）的强大范式，在数学和代码生成等复杂推理领域取得了卓越的性能。然而，当前的 RL 方法由于严重依赖人类策划的偏好数据或标记数据集进行奖励建模，因此面临着根本的可扩展性瓶颈。为了克服这一限制，我们探索了未标记数据的 RL，其中模型从连续的体验流中自主学习。这种设置的核心挑战在于在没有地面实况监督的情况下进行可靠的奖励估计。像 Test-Time RL 这样的现有方法通过自洽的共识来解决这个问题，但可能会强化从多数投票中得出的不正确的伪标签。我们介绍了 COMPASS（复合路径和答案自评分），这是一种无需外部监督即可运行的新型考试时间奖励机制。COMPASS 集成了两个互补的组件：双校准答案奖励（DCAR），通过置信度和可信度校准建立可信的伪标签来稳定训练，以及决定性路径奖励（DPR），它直接优化了推理过程质量，而不仅仅是结果监督。通过共同强化可信的共识答案和高度决定性的推理链，COMPASS系统地增强了模型的分析能力。广泛的实验表明，COMPASS 在不同的推理任务和模型架构中实现了显着且一致的性能提升，为法学硕士从持续经验中学习提供了更具可扩展性的方向。

EvoSyn: Generalizable Evolutionary Data Synthesis for Verifiable Learning

EvoSyn：用于可验证学习的可推广进化数据合成

Authors: He Du, Bowen Li, Aijun Yang, Siyang He, Qipeng Guo, Dacheng Tao
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2510.17928
Pdf link: https://arxiv.org/pdf/2510.17928
Abstract Reliable verifiable data has become a key driver of capability gains in modern language models, enabling stable reinforcement learning with verifiable rewards and effective distillation that transfers competence across math, coding, and agentic tasks. Yet constructing generalizable synthetic verifiable data remains difficult due to hallucination-prone generation, and weak or trivial verification artifacts that fail to separate strong from weak solutions. Existing approaches often rely on task-specific heuristics or post-hoc filters that do not transfer across domains and lack a principled, universal evaluator of verifiability. In this work, we introduce an evolutionary, task-agnostic, strategy-guided, executably-checkable data synthesis framework that, from minimal seed supervision, jointly synthesizes problems, diverse candidate solutions, and verification artifacts, and iteratively discovers strategies via a consistency-based evaluator that enforces agreement between human-annotated and strategy-induced checks. This pipeline upgrades filtering into principled synthesis: it reliably assembles coherent, verifiable training instances and generalizes without domain-specific rules. Our experiments demonstrate the effectiveness of the proposed approach under both RLVR and model distillation training paradigms. The results show that training with our synthesized data yields significant improvements on both the LiveCodeBench and AgentBench-OS tasks, highlighting the robust generalization of our framework.
中文摘要 可靠的可验证数据已成为现代语言模型能力提升的关键驱动力，通过可验证的奖励和有效的提炼实现稳定的强化学习，从而将能力转移到数学、编码和代理任务中。然而，由于容易产生幻觉，以及无法区分强解和弱解的弱或微不足道的验证伪影，构建可推广的合成可验证数据仍然很困难。现有方法通常依赖于特定于任务的启发式方法或事后过滤器，这些方法不会跨领域转移，并且缺乏有原则的、通用的可验证性评估器。在这项工作中，我们引入了一个进化的、与任务无关的、策略指导的、可执行的可执行检查的数据合成框架，该框架从最少的种子监督开始，共同综合问题、多样化的候选解决方案和验证工件，并通过基于一致性的评估器迭代地发现策略，该评估器强制执行人工注释和策略诱导的检查之间的一致性。该管道将过滤升级为有原则的综合：它可靠地组装了连贯的、可验证的训练实例，并在没有特定于领域的规则的情况下进行泛化。我们的实验证明了所提出的方法在RLVR和模型蒸馏训练范式下的有效性。结果表明，使用我们的合成数据进行训练可以在 LiveCodeBench 和 AgentBench-OS 任务上产生显着改进，凸显了我们框架的稳健泛化。

UniRL-Zero: Reinforcement Learning on Unified Models with Joint Language Model and Diffusion Model Experts

UniRL-Zero：联合语言模型和扩散模型专家对统一模型进行强化学习

Authors: Fu-Yun Wang, Han Zhang, Michael Gharbi, Hongsheng Li, Taesung Park
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.17937
Pdf link: https://arxiv.org/pdf/2510.17937
Abstract We present UniRL-Zero, a unified reinforcement learning (RL) framework that boosts, multimodal language model understanding and reasoning, diffusion model multimedia generation, and their beneficial interaction capabilities within a unified model. Our work defines six scenarios for unified model reinforcement learning, providing systematic baselines for reinforcement learning of unified understanding and generation model. Our code is available at this https URL.
中文摘要 我们提出了UniRL-Zero，这是一个统一的强化学习（RL）框架，它增强了多模态语言模型的理解和推理、扩散模型多媒体生成及其在统一模型中的有益交互能力。我们的工作定义了统一模型强化学习的六种场景，为统一理解和生成模型的强化学习提供了系统的基线。我们的代码可在此 https URL 中找到。

Humanoid Goalkeeper: Learning from Position Conditioned Task-Motion Constraints

人形守门员：从位置条件任务运动约束中学习

Authors: Junli Ren, Junfeng Long, Tao Huang, Huayi Wang, Zirui Wang, Feiyu Jia, Wentao Zhang, Jingbo Wang, Ping Luo, Jiangmiao Pang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.18002
Pdf link: https://arxiv.org/pdf/2510.18002
Abstract We present a reinforcement learning framework for autonomous goalkeeping with humanoid robots in real-world scenarios. While prior work has demonstrated similar capabilities on quadrupedal platforms, humanoid goalkeeping introduces two critical challenges: (1) generating natural, human-like whole-body motions, and (2) covering a wider guarding range with an equivalent response time. Unlike existing approaches that rely on separate teleoperation or fixed motion tracking for whole-body control, our method learns a single end-to-end RL policy, enabling fully autonomous, highly dynamic, and human-like robot-object interactions. To achieve this, we integrate multiple human motion priors conditioned on perceptual inputs into the RL training via an adversarial scheme. We demonstrate the effectiveness of our method through real-world experiments, where the humanoid robot successfully performs agile, autonomous, and naturalistic interceptions of fast-moving balls. In addition to goalkeeping, we demonstrate the generalization of our approach through tasks such as ball escaping and grabbing. Our work presents a practical and scalable solution for enabling highly dynamic interactions between robots and moving objects, advancing the field toward more adaptive and lifelike robotic behaviors.
中文摘要 我们提出了一个强化学习框架，用于在现实场景中使用人形机器人进行自主守门。虽然之前的工作已经在四足平台上展示了类似的能力，但人形守门员带来了两个关键挑战：（1）产生自然的、类似人类的全身运动，以及（2）以相同的响应时间覆盖更广泛的防守范围。与依赖单独的远程作或固定运动跟踪进行全身控制的现有方法不同，我们的方法学习单一的端到端 RL 策略，从而实现完全自主、高度动态和类人的机器人-物体交互。为了实现这一目标，我们通过对抗方案将以感知输入为条件的多个人体运动先验整合到 RL 训练中。我们通过真实世界的实验证明了我们方法的有效性，人形机器人成功地对快速移动的球进行了敏捷、自主和自然的拦截。除了守门员之外，我们还通过球逃脱和抢球等任务来展示我们方法的推广。我们的工作提出了一种实用且可扩展的解决方案，用于实现机器人和移动物体之间的高度动态交互，推动该领域朝着更具适应性和逼真的机器人行为发展。

OPTAGENT: Optimizing Multi-Agent LLM Interactions Through Verbal Reinforcement Learning for Enhanced Reasoning

OPTAGENT：通过语言强化学习优化多智能体 LLM 交互以增强推理

Authors: Zhenyu Bi, Meng Lu, Yang Li, Swastik Roy, Weijie Guan, Morteza Ziyadi, Xuan Wang
Subjects: Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2510.18032
Pdf link: https://arxiv.org/pdf/2510.18032
Abstract Large Language Models (LLMs) have shown remarkable reasoning capabilities in mathematical and scientific tasks. To enhance complex reasoning, multi-agent systems have been proposed to harness the collective intelligence of LLM agents. However, existing collaboration structures are either predefined or rely on majority voting or round-table debates, which can suppress correct but less dominant agent contributions. Recent approaches model multi-agent systems as graph networks but optimize purely for agent performance, neglecting the quality of interactions. We hypothesize that effective agent communication is crucial for multi-agent reasoning and that debating quality plays a significant role. To address this, we propose $\ours$, a multi-agent verbal reinforcement learning algorithm that dynamically constructs and refines multi-agent collaboration structures. Our method defines action spaces and a feedback mechanism that evaluates communication robustness and coherence throughout the debate. The final decision is achieved through a majority vote over all the agents. We assess $\ours$ on various reasoning tasks, including mathematical reasoning, creative writing, scientific reasoning, and numerical sorting. Results demonstrate that our approach significantly outperforms single-agent prompting methods and state-of-the-art multi-agent frameworks on diverse tasks.
中文摘要 大型语言模型（LLM）在数学和科学任务中表现出了卓越的推理能力。为了增强复杂推理，已经提出了多智能体系统来利用 LLM 智能体的集体智慧。然而，现有的协作结构要么是预定义的，要么依赖于多数投票或圆桌辩论，这可能会抑制正确但不太占主导地位的代理人贡献。最近的方法将多智能体系统建模为图网络，但纯粹针对智能体性能进行优化，而忽略了交互的质量。我们假设有效的代理沟通对于多代理推理至关重要，辩论质量起着重要作用。为了解决这个问题，我们提出了 $\ours$，这是一种多智能体语言强化学习算法，可以动态构建和完善多智能体协作结构。我们的方法定义了行动空间和反馈机制，用于评估整个辩论过程中的沟通稳健性和连贯性。最终决定是通过对所有代理人的多数票来实现的。我们评估各种推理任务的 $\ours$，包括数学推理、创意写作、科学推理和数字排序。结果表明，我们的方法在不同任务上明显优于单智能体提示方法和最先进的多智能体框架。

Adaptive Divergence Regularized Policy Optimization for Fine-tuning Generative Models

用于微调生成模型的自适应散度正则化策略优化

Authors: Jiajun Fan, Tong Wei, Chaoran Cheng, Yuxin Chen, Ge Liu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.18053
Pdf link: https://arxiv.org/pdf/2510.18053
Abstract Balancing exploration and exploitation during reinforcement learning fine-tuning of generative models presents a critical challenge, as existing approaches rely on fixed divergence regularization that creates an inherent dilemma: strong regularization preserves model capabilities but limits reward optimization, while weak regularization enables greater alignment but risks instability or reward hacking. We introduce Adaptive Divergence Regularized Policy Optimization (ADRPO), which automatically adjusts regularization strength based on advantage estimates-reducing regularization for high-value samples while applying stronger regularization to poor samples, enabling policies to navigate between exploration and aggressive exploitation according to data quality. Our implementation with Wasserstein-2 regularization for flow matching generative models achieves remarkable results on text-to-image generation, achieving better semantic alignment and diversity than offline methods like DPO and online methods with fixed regularization like ORW-CFM-W2. ADRPO enables a 2B parameter SD3 model to surpass much larger models with 4.8B and 12B parameters in attribute binding, semantic consistency, artistic style transfer, and compositional control while maintaining generation diversity. ADRPO generalizes to KL-regularized fine-tuning of both text-only LLMs and multi-modal reasoning models, enhancing existing online RL methods like GRPO. In LLM fine-tuning, ADRPO demonstrates an emergent ability to escape local optima through active exploration, while in multi-modal audio reasoning, it outperforms GRPO through superior step-by-step reasoning, enabling a 7B model to outperform substantially larger commercial models including Gemini 2.5 Pro and GPT-4o Audio, offering an effective plug-and-play solution to the exploration-exploitation challenge across diverse generative architectures and modalities.
中文摘要 在强化学习过程中平衡探索和利用生成模型的微调提出了一个关键挑战，因为现有方法依赖于固定发散正则化，这造成了固有的困境：强正则化保留了模型功能，但限制了奖励优化，而弱正则化可以实现更大的对齐，但存在不稳定或奖励黑客攻击的风险。我们引入了自适应发散正则化策略优化（ADRPO），它根据优势估计自动调整正则化强度——减少高价值样本的正则化，同时对较差的样本应用更强的正则化，使策略能够根据数据质量在探索和积极利用之间导航。我们使用 Wasserstein-2 正则化实现流匹配生成模型，在文本到图像生成方面取得了显着的成果，与 DPO 等离线方法和 ORW-CFM-W2 等固定正则化在线方法相比，实现了更好的语义对齐和多样性。ADRPO使2B参数SD3模型在属性绑定、语义一致性、艺术风格迁移和构图控制方面超越了具有4.8B和12B参数的更大模型，同时保持了生成多样性。ADRPO 推广到纯文本 LLM 和多模态推理模型的 KL 正则化微调，增强了 GRPO 等现有在线 RL 方法。在 LLM 微调中，ADRPO 展示了通过主动探索逃避局部最优度的新兴能力，而在多模态音频推理中，它通过卓越的分步推理优于 GRPO，使 7B 模型的性能优于包括 Gemini 2.5 Pro 和 GPT-4o Audio 在内的更大的商业模型，为跨不同生成架构和模态的探索开发挑战提供了有效的即插即用解决方案。

SPACeR: Self-Play Anchoring with Centralized Reference Models

SPACeR：使用集中式参考模型进行自我定位

Authors: Wei-Jer Chang, Akshay Rangesh, Kevin Joseph, Matthew Strong, Masayoshi Tomizuka, Yihan Hu, Wei Zhan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.18060
Pdf link: https://arxiv.org/pdf/2510.18060
Abstract Developing autonomous vehicles (AVs) requires not only safety and efficiency, but also realistic, human-like behaviors that are socially aware and predictable. Achieving this requires sim agent policies that are human-like, fast, and scalable in multi-agent settings. Recent progress in imitation learning with large diffusion-based or tokenized models has shown that behaviors can be captured directly from human driving data, producing realistic policies. However, these models are computationally expensive, slow during inference, and struggle to adapt in reactive, closed-loop scenarios. In contrast, self-play reinforcement learning (RL) scales efficiently and naturally captures multi-agent interactions, but it often relies on heuristics and reward shaping, and the resulting policies can diverge from human norms. We propose SPACeR, a framework that leverages a pretrained tokenized autoregressive motion model as a centralized reference policy to guide decentralized self-play. The reference model provides likelihood rewards and KL divergence, anchoring policies to the human driving distribution while preserving RL scalability. Evaluated on the Waymo Sim Agents Challenge, our method achieves competitive performance with imitation-learned policies while being up to 10x faster at inference and 50x smaller in parameter size than large generative models. In addition, we demonstrate in closed-loop ego planning evaluation tasks that our sim agents can effectively measure planner quality with fast and scalable traffic simulation, establishing a new paradigm for testing autonomous driving policies.
中文摘要 开发自动驾驶汽车（AV）不仅需要安全性和效率，还需要具有社会意识和可预测的现实、类人行为。实现这一目标需要在多代理设置中类似于人类、快速且可扩展的 sim 代理策略。使用基于大型扩散或标记化模型的模仿学习的最新进展表明，可以直接从人类驾驶数据中捕获行为，从而产生现实的策略。然而，这些模型计算成本高昂，推理速度慢，并且难以适应反应性闭环场景。相比之下，自我游戏强化学习（RL）可以高效扩展并自然地捕捉多智能体交互，但它通常依赖于启发式和奖励塑造，由此产生的策略可能偏离人类规范。我们提出了 SPACeR，这是一个利用预训练的代币化自回归运动模型作为中心化参考策略来指导去中心化自我游戏的框架。参考模型提供似然奖励和KL散度，将策略锚定到人类驾驶分布，同时保持RL可扩展性。在 Waymo Sim 代理挑战赛中进行了评估，我们的方法通过模仿学习的策略实现了具有竞争力的性能，同时与大型生成模型相比，推理速度提高了 10 倍，参数大小小了 50 倍。此外，我们在闭环自我规划评估任务中证明，我们的模拟代理可以通过快速、可扩展的交通模拟来有效衡量规划器质量，为测试自动驾驶策略建立了新的范式。

R2L: Reliable Reinforcement Learning: Guaranteed Return & Reliable Policies in Reinforcement Learning

R2L：可靠的强化学习：强化学习中的保证回报和可靠策略

Authors: Nadir Farhi
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2510.18074
Pdf link: https://arxiv.org/pdf/2510.18074
Abstract In this work, we address the problem of determining reliable policies in reinforcement learning (RL), with a focus on optimization under uncertainty and the need for performance guarantees. While classical RL algorithms aim at maximizing the expected return, many real-world applications - such as routing, resource allocation, or sequential decision-making under risk - require strategies that ensure not only high average performance but also a guaranteed probability of success. To this end, we propose a novel formulation in which the objective is to maximize the probability that the cumulative return exceeds a prescribed threshold. We demonstrate that this reliable RL problem can be reformulated, via a state-augmented representation, into a standard RL problem, thereby allowing the use of existing RL and deep RL algorithms without the need for entirely new algorithmic frameworks. Theoretical results establish the equivalence of the two formulations and show that reliable strategies can be derived by appropriately adapting well-known methods such as Q-learning or Dueling Double DQN. To illustrate the practical relevance of the approach, we consider the problem of reliable routing, where the goal is not to minimize the expected travel time but rather to maximize the probability of reaching the destination within a given time budget. Numerical experiments confirm that the proposed formulation leads to policies that effectively balance efficiency and reliability, highlighting the potential of reliable RL for applications in stochastic and safety-critical environments.
中文摘要 在这项工作中，我们解决了在强化学习（RL）中确定可靠策略的问题，重点是不确定性下的优化和性能保证的需求。虽然经典的 RL 算法旨在最大化预期回报，但许多实际应用（例如路由、资源分配或风险下的顺序决策）需要的策略不仅要确保高平均性能，还要保证成功概率。为此，我们提出了一种新颖的公式，其目标是最大化累积回报超过规定阈值的概率。我们证明，这个可靠的RL问题可以通过状态增强表示重新表述为标准RL问题，从而允许使用现有的RL和深度RL算法，而无需全新的算法框架。理论结果确定了两种公式的等效性，并表明通过适当调整 Q 学习或决斗双 DQN 等众所周知的方法可以得出可靠的策略。为了说明该方法的实际相关性，我们考虑了可靠路线的问题，其目标不是最小化预期的旅行时间，而是最大限度地提高在给定时间预算内到达目的地的概率。数值实验证实，所提出的公式可以产生有效平衡效率和可靠性的政策，凸显了可靠 RL 在随机和安全关键环境中应用的潜力。

Provably Optimal Reinforcement Learning under Safety Filtering

安全滤波下可证明的最优强化学习

Authors: Donggeon David Oh, Duy P. Nguyen, Haimin Hu, Jaime F. Fisac
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2510.18082
Pdf link: https://arxiv.org/pdf/2510.18082
Abstract Recent advances in reinforcement learning (RL) enable its use on increasingly complex tasks, but the lack of formal safety guarantees still limits its application in safety-critical settings. A common practical approach is to augment the RL policy with a safety filter that overrides unsafe actions to prevent failures during both training and deployment. However, safety filtering is often perceived as sacrificing performance and hindering the learning process. We show that this perceived safety-performance tradeoff is not inherent and prove, for the first time, that enforcing safety with a sufficiently permissive safety filter does not degrade asymptotic performance. We formalize RL safety with a safety-critical Markov decision process (SC-MDP), which requires categorical, rather than high-probability, avoidance of catastrophic failure states. Additionally, we define an associated filtered MDP in which all actions result in safe effects, thanks to a safety filter that is considered to be a part of the environment. Our main theorem establishes that (i) learning in the filtered MDP is safe categorically, (ii) standard RL convergence carries over to the filtered MDP, and (iii) any policy that is optimal in the filtered MDP-when executed through the same filter-achieves the same asymptotic return as the best safe policy in the SC-MDP, yielding a complete separation between safety enforcement and performance optimization. We validate the theory on Safety Gymnasium with representative tasks and constraints, observing zero violations during training and final performance matching or exceeding unfiltered baselines. Together, these results shed light on a long-standing question in safety-filtered learning and provide a simple, principled recipe for safe RL: train and deploy RL policies with the most permissive safety filter that is available.
中文摘要 强化学习（RL）的最新进展使其能够用于日益复杂的任务，但缺乏正式的安全保证仍然限制了其在安全关键环境中的应用。一种常见的实用方法是使用安全筛选器来增强 RL 策略，该筛选器会覆盖不安全的作，以防止在训练和部署期间发生故障。然而，安全过滤通常被认为是牺牲性能并阻碍学习过程。我们表明，这种感知到的安全-性能权衡并不是固有的，并首次证明，使用足够宽松的安全过滤器强制执行安全不会降低渐近性能。我们通过安全关键型马尔可夫决策过程（SC-MDP）将 RL 安全性形式化，该过程需要绝对而不是高概率地避免灾难性故障状态。此外，我们还定义了一个关联的过滤 MDP，其中所有作都会导致安全效果，这要归功于被视为环境一部分的安全过滤器。我们的主要定理确定（i）过滤后的 MDP 中的学习在分类上是安全的，（ii）标准 RL 收敛延续到过滤后的 MDP，以及（iii）过滤后的 MDP 中的任何最佳策略（当通过相同的过滤器执行时）实现与 SC-MDP 中最佳安全策略相同的渐近回报，从而在安全实施和性能优化之间产生完全分离。我们用代表性任务和约束来验证安全体育馆的理论，在训练期间观察到零违规行为，最终表现匹配或超过未经过滤的基线。这些结果共同阐明了安全过滤学习中一个长期存在的问题，并为安全 RL 提供了一个简单、有原则的秘诀：使用可用的最宽松的安全过滤器训练和部署 RL 策略。

RL-Driven Security-Aware Resource Allocation Framework for UAV-Assisted O-RAN

RL驱动的无人机辅助O-RAN安全感知资源分配框架

Authors: Zaineh Abughazzah, Emna Baccour, Loay Ismail, Amr Mohamed, Mounir Hamdi
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.18084
Pdf link: https://arxiv.org/pdf/2510.18084
Abstract The integration of Unmanned Aerial Vehicles (UAVs) into Open Radio Access Networks (O-RAN) enhances communication in disaster management and Search and Rescue (SAR) operations by ensuring connectivity when infrastructure fails. However, SAR scenarios demand stringent security and low-latency communication, as delays or breaches can compromise mission success. While UAVs serve as mobile relays, they introduce challenges in energy consumption and resource management, necessitating intelligent allocation strategies. Existing UAV-assisted O-RAN approaches often overlook the joint optimization of security, latency, and energy efficiency in dynamic environments. This paper proposes a novel Reinforcement Learning (RL)-based framework for dynamic resource allocation in UAV relays, explicitly addressing these trade-offs. Our approach formulates an optimization problem that integrates security-aware resource allocation, latency minimization, and energy efficiency, which is solved using RL. Unlike heuristic or static methods, our framework adapts in real-time to network dynamics, ensuring robust communication. Simulations demonstrate superior performance compared to heuristic baselines, achieving enhanced security and energy efficiency while maintaining ultra-low latency in SAR scenarios.
中文摘要 将无人机（UAV）集成到开放无线接入网络（O-RAN）中，通过确保基础设施出现故障时的连接，增强灾害管理和搜救（SAR）行动中的通信。然而，SAR 场景需要严格的安全性和低延迟通信，因为延迟或违规可能会影响任务的成功。虽然无人机充当移动中继，但它们在能源消耗和资源管理方面带来了挑战，需要智能分配策略。现有的无人机辅助O-RAN方法往往忽视了动态环境中安全性、延迟和能效的联合优化。本文提出了一种基于强化学习（RL）的新型无人机中继动态资源分配框架，明确解决了这些权衡问题。我们的方法提出了一个集成了安全感知资源分配、延迟最小化和能效的优化问题，该问题使用 RL 进行解决。与启发式或静态方法不同，我们的框架实时适应网络动态，确保稳健的通信。与启发式基线相比，模拟表现出卓越的性能，实现了增强的安全性和能效，同时在 SAR 场景中保持超低延迟。

LLMs Encode How Difficult Problems Are

LLM 编码问题的难度

Authors: William Lugoloobi, Chris Russell
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.18147
Pdf link: https://arxiv.org/pdf/2510.18147
Abstract Large language models exhibit a puzzling inconsistency: they solve complex problems yet frequently fail on seemingly simpler ones. We investigate whether LLMs internally encode problem difficulty in a way that aligns with human judgment, and whether this representation tracks generalization during reinforcement learning post-training. We train linear probes across layers and token positions on 60 models, evaluating on mathematical and coding subsets of Easy2HardBench. We find that human-labeled difficulty is strongly linearly decodable (AMC: $\rho \approx 0.88$) and exhibits clear model-size scaling, whereas LLM-derived difficulty is substantially weaker and scales poorly. Steering along the difficulty direction reveals that pushing models toward "easier" representations reduces hallucination and improves accuracy. During GRPO training on Qwen2.5-Math-1.5B, the human-difficulty probe strengthens and positively correlates with test accuracy across training steps, while the LLM-difficulty probe degrades and negatively correlates with performance. These results suggest that human annotations provide a stable difficulty signal that RL amplifies, while automated difficulty estimates derived from model performance become misaligned precisely as models improve. We release probe code and evaluation scripts to facilitate replication.
中文摘要 大型语言模型表现出令人费解的不一致：它们解决了复杂的问题，但在看似简单的问题上却经常失败。我们研究了法学硕士是否以符合人类判断的方式在内部对问题难度进行编码，以及这种表示是否在训练后的强化学习期间跟踪泛化。我们在 60 个模型上训练跨层和标记位置的线性探针，并评估 Easy2HardBench 的数学和编码子集。我们发现，人类标记的难度具有很强的线性可解码性（AMC：$\rho \approx 0.88$）并表现出明显的模型大小缩放，而 LLM 衍生的难度要弱得多，并且缩放性很差。沿着难度方向转向表明，将模型推向“更容易”的表示可以减少幻觉并提高准确性。在Qwen2.5-Math-1.5B上的GRPO训练中，人为难度探针在训练步骤中增强并与测试准确率呈正相关，而LLM难度探针则下降并与性能呈负相关。这些结果表明，人工注释提供了 RL 放大的稳定难度信号，而根据模型性能得出的自动难度估计会随着模型的改进而变得精确地错位。我们发布探测代码和评估脚本以促进复制。

Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains

局部一致性还是全球有效性？研究数学领域的 RLVR 跟踪

Authors: Soumya Rani Samineni, Durgesh Kalwar, Vardaan Gangal, Siddhant Bhambri, Subbarao Kambhampati
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.18176
Pdf link: https://arxiv.org/pdf/2510.18176
Abstract Reinforcement Learning with Verifiable Rewards (RLVR)-based post-training of Large Language Models (LLMs) has been shown to improve accuracy on reasoning tasks and continues to attract significant attention. Existing RLVR methods, however, typically treat all tokens uniformly without accounting for token-level advantages. These methods primarily evaluate performance based on final answer correctness or Pass@K accuracy, and yet make claims about RL post-training leading to improved reasoning traces. This motivates our investigation into the effect of RL post-training on intermediate tokens which are not directly incentivized. To study this, we design an experimental setup using the GRPO algorithm with Qwen-2.5-0.5B model on the GSM8K dataset. We introduce trace coherence, a First-Order Logic (FOL)-based measure to capture the consistency of reasoning steps by identifying errors in the traces. We distinguish between trace validity and trace coherence, noting that the former implies logical soundness while the latter measures local coherence via lack of errors. Our results show that RL post-training overall improves trace coherence with the most significant gains on problems where the base model fails but the RL model succeeds. Surprisingly, RL enhances local coherence without necessarily producing valid or correct solutions. This highlights a crucial distinction: improved local coherence in reasoning steps does not guarantee final answer correctness. We argue that claims of improved reasoning via RL must be examined with care, as these may be based on improved trace coherence, which may not translate into fully valid mathematical proofs.
中文摘要 基于可验证奖励（RLVR）的大型语言模型（LLM）后训练强化学习已被证明可以提高推理任务的准确性，并继续引起广泛关注。然而，现有的 RLVR 方法通常统一对待所有代币，而不考虑代币级别的优势。这些方法主要根据最终答案的正确性或Pass@K准确性来评估性能，但对训练后的 RL 做出声明，从而改进了推理轨迹。这促使我们研究 RL 后训练对未直接激励的中间代币的影响。为了研究这一点，我们在GSM8K数据集上使用GRPO算法和Qwen-2.5-0.5B模型设计了一个实验装置。我们引入了跟踪一致性，这是一种基于一阶逻辑（FOL）的度量，通过识别跟踪中的错误来捕获推理步骤的一致性。我们区分了迹线效度和迹线相连性，注意到前者意味着逻辑合理性，而后者通过没有错误来衡量局部相干性。我们的结果表明，RL 训练后总体上提高了跟踪一致性，在基础模型失败但 RL 模型成功的问题上获得了最显着的收益。令人惊讶的是，RL 增强了局部连贯性，但不一定产生有效或正确的解决方案。这突出了一个关键的区别：推理步骤中局部连贯性的提高并不能保证最终答案的正确性。我们认为，必须仔细检查通过 RL 改进推理的说法，因为这些可能基于改进的跟踪连贯性，这可能无法转化为完全有效的数学证明。

Nash Policy Gradient: A Policy Gradient Method with Iteratively Refined Regularization for Finding Nash Equilibria

纳什策略梯度：一种基于迭代细化正则化的策略梯度方法，用于寻找纳什均衡

Authors: Eason Yu, Tzu Hao Liu, Yunke Wang, Clément L. Canonne, Nguyen H. Tran, Chang Xu
Subjects: Subjects: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
Arxiv link: https://arxiv.org/abs/2510.18183
Pdf link: https://arxiv.org/pdf/2510.18183
Abstract Finding Nash equilibria in imperfect-information games remains a central challenge in multi-agent reinforcement learning. While regularization-based methods have recently achieved last-iteration convergence to a regularized equilibrium, they require the regularization strength to shrink toward zero to approximate a Nash equilibrium, often leading to unstable learning in practice. Instead, we fix the regularization strength at a large value for robustness and achieve convergence by iteratively refining the reference policy. Our main theoretical result shows that this procedure guarantees strictly monotonic improvement and convergence to an exact Nash equilibrium in two-player zero-sum games, without requiring a uniqueness assumption. Building on this framework, we develop a practical algorithm, Nash Policy Gradient (NashPG), which preserves the generalizability of policy gradient methods while relying solely on the current and reference policies. Empirically, NashPG achieves comparable or lower exploitability than prior model-free methods on classic benchmark games and scales to large domains such as Battleship and No-Limit Texas Hold'em, where NashPG consistently attains higher Elo ratings.
中文摘要 在不完美信息博弈中找到纳什均衡仍然是多智能体强化学习的核心挑战。虽然基于正则化的方法最近实现了最后迭代收敛到正则化均衡，但它们需要正则化强度收缩到零以近似纳什均衡，这通常会导致实践中的学习不稳定。相反，我们将正则化强度固定在鲁棒性的较大值，并通过迭代细化参考策略来实现收敛。我们的主要理论结果表明，该过程保证了在双人零和博弈中严格单调改进和收敛到精确的纳什均衡，而不需要唯一性假设。在此框架的基础上，我们开发了一种实用的算法，即纳什策略梯度（NashPG），该算法在仅依赖当前和参考策略的同时保留了策略梯度方法的通用性。根据经验，NashPG 在经典基准测试游戏上实现了与之前的无模型方法相当或更低的可利用性，并扩展到战舰和无限注德州扑克等大型领域，NashPG 在这些领域始终获得更高的 Elo 评级。

NTKMTL: Mitigating Task Imbalance in Multi-Task Learning from Neural Tangent Kernel Perspective

NTKMTL：从神经切线核视角缓解多任务学习中的任务不平衡

Authors: Xiaohan Qin, Xiaoxing Wang, Ning Liao, Junchi Yan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.18258
Pdf link: https://arxiv.org/pdf/2510.18258
Abstract Multi-Task Learning (MTL) enables a single model to learn multiple tasks simultaneously, leveraging knowledge transfer among tasks for enhanced generalization, and has been widely applied across various domains. However, task imbalance remains a major challenge in MTL. Although balancing the convergence speeds of different tasks is an effective approach to address this issue, it is highly challenging to accurately characterize the training dynamics and convergence speeds of multiple tasks within the complex MTL system. To this end, we attempt to analyze the training dynamics in MTL by leveraging Neural Tangent Kernel (NTK) theory and propose a new MTL method, NTKMTL. Specifically, we introduce an extended NTK matrix for MTL and adopt spectral analysis to balance the convergence speeds of multiple tasks, thereby mitigating task imbalance. Based on the approximation via shared representation, we further propose NTKMTL-SR, achieving training efficiency while maintaining competitive performance. Extensive experiments demonstrate that our methods achieve state-of-the-art performance across a wide range of benchmarks, including both multi-task supervised learning and multi-task reinforcement learning. Source code is available at this https URL.
中文摘要 多任务学习（MTL）使单个模型能够同时学习多个任务，利用任务之间的知识转移来增强泛化性，并已广泛应用于各个领域。然而，任务不平衡仍然是 MTL 面临的主要挑战。尽管平衡不同任务的收敛速度是解决这一问题的有效方法，但要准确表征复杂MTL系统中多个任务的训练动态和收敛速度极具挑战性。为此，我们尝试利用神经切切核（NTK）理论来分析MTL中的训练动态，并提出一种新的MTL方法NTKMTL。具体来说，我们引入了用于MTL的扩展NTK矩阵，并采用频谱分析来平衡多个任务的收敛速度，从而缓解任务的不平衡。基于共享表示的近似值，我们进一步提出了NTKMTL-SR，在保持竞争性能的同时实现训练效率。广泛的实验表明，我们的方法在广泛的基准测试中实现了最先进的性能，包括多任务监督学习和多任务强化学习。源代码可在此 https URL 中找到。

From Competition to Synergy: Unlocking Reinforcement Learning for Subject-Driven Image Generation

从竞争到协同：解锁主题驱动图像生成的强化学习

Authors: Ziwei Huang, Ying Shu, Hao Fang, Quanyu Long, Wenya Wang, Qiushi Guo, Tiezheng Ge, Leilei Gan
Subjects: Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Arxiv link: https://arxiv.org/abs/2510.18263
Pdf link: https://arxiv.org/pdf/2510.18263
Abstract Subject-driven image generation models face a fundamental trade-off between identity preservation (fidelity) and prompt adherence (editability). While online reinforcement learning (RL), specifically GPRO, offers a promising solution, we find that a naive application of GRPO leads to competitive degradation, as the simple linear aggregation of rewards with static weights causes conflicting gradient signals and a misalignment with the temporal dynamics of the diffusion process. To overcome these limitations, we propose Customized-GRPO, a novel framework featuring two key innovations: (i) Synergy-Aware Reward Shaping (SARS), a non-linear mechanism that explicitly penalizes conflicted reward signals and amplifies synergistic ones, providing a sharper and more decisive gradient. (ii) Time-Aware Dynamic Weighting (TDW), which aligns the optimization pressure with the model's temporal dynamics by prioritizing prompt-following in the early, identity preservation in the later. Extensive experiments demonstrate that our method significantly outperforms naive GRPO baselines, successfully mitigating competitive degradation. Our model achieves a superior balance, generating images that both preserve key identity features and accurately adhere to complex textual prompts.
中文摘要 主题驱动的图像生成模型在身份保留（保真度）和提示依从性（可编辑性）之间面临着根本性的权衡。虽然在线强化学习（RL），特别是GPRO，提供了一个有前途的解决方案，但我们发现GRPO的朴素应用会导致竞争性退化，因为具有静态权重的奖励的简单线性聚合会导致梯度信号冲突，并与扩散过程的时间动态不一致。为了克服这些限制，我们提出了 Customized-GRPO，这是一个具有两项关键创新的新颖框架：（i）协同感知奖励塑造（SARS），这是一种非线性机制，它明确惩罚冲突的奖励信号并放大协同信号，提供更清晰、更决定性的梯度。（ii）时间感知动态加权（TDW），通过在早期优先考虑提示跟随，在后期优先考虑身份保留，将优化压力与模型的时间动态保持一致。大量实验表明，我们的方法明显优于朴素的GRPO基线，成功地减轻了竞争性降解。我们的模型实现了卓越的平衡，生成的图像既保留了关键的身份特征，又准确地遵循复杂的文本提示。

Food4All: A Multi-Agent Framework for Real-time Free Food Discovery with Integrated Nutritional Metadata

Food4All：一个多代理框架，用于实时免费发现食物，具有集成的营养元数据

Authors: Zhengqing Yuan, Yiyang Li, Weixiang Sun, Zheyuan Zhang, Kaiwen Shi, Keerthiram Murugesan, Yanfang Ye
Subjects: Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2510.18289
Pdf link: https://arxiv.org/pdf/2510.18289
Abstract Food insecurity remains a persistent public health emergency in the United States, tightly interwoven with chronic disease, mental illness, and opioid misuse. Yet despite the existence of thousands of food banks and pantries, access remains fragmented: 1) current retrieval systems depend on static directories or generic search engines, which provide incomplete and geographically irrelevant results; 2) LLM-based chatbots offer only vague nutritional suggestions and fail to adapt to real-world constraints such as time, mobility, and transportation; and 3) existing food recommendation systems optimize for culinary diversity but overlook survival-critical needs of food-insecure populations, including immediate proximity, verified availability, and contextual barriers. These limitations risk leaving the most vulnerable individuals, those experiencing homelessness, addiction, or digital illiteracy, unable to access urgently needed resources. To address this, we introduce Food4All, the first multi-agent framework explicitly designed for real-time, context-aware free food retrieval. Food4All unifies three innovations: 1) heterogeneous data aggregation across official databases, community platforms, and social media to provide a continuously updated pool of food resources; 2) a lightweight reinforcement learning algorithm trained on curated cases to optimize for both geographic accessibility and nutritional correctness; and 3) an online feedback loop that dynamically adapts retrieval policies to evolving user needs. By bridging information acquisition, semantic analysis, and decision support, Food4All delivers nutritionally annotated and guidance at the point of need. This framework establishes an urgent step toward scalable, equitable, and intelligent systems that directly support populations facing food insecurity and its compounding health risks.
中文摘要 粮食不安全仍然是美国持续存在的突发公共卫生事件，与慢性病、精神疾病和阿片类药物滥用紧密交织在一起。然而，尽管存在数以千计的食品银行和食品储藏室，但访问仍然分散：1）当前的检索系统依赖于静态目录或通用搜索引擎，它们提供不完整且与地理无关的结果;2）基于法学硕士的聊天机器人只提供模糊的营养建议，无法适应现实世界的限制，如时间、移动性和交通;3）现有的食物推荐系统针对烹饪多样性进行了优化，但忽略了粮食不安全人群的生存关键需求，包括直接接近、经过验证的可用性和背景障碍。这些限制可能会使最弱势的个人、无家可归、成瘾或数字文盲的人无法获得急需的资源。为了解决这个问题，我们推出了 Food4All，这是第一个明确设计用于实时、上下文感知的免费食物检索的多代理框架。Food4All 统一了三项创新：1）跨官方数据库、社区平台和社交媒体的异构数据聚合，提供不断更新的食品资源池;2）在精选案例上训练的轻量级强化学习算法，以优化地理可及性和营养正确性;3）在线反馈循环，可根据不断变化的用户需求动态调整检索策略。通过桥接信息采集、语义分析和决策支持，Food4All 在需要时提供营养注释和指导。该框架为实现可扩展、公平和智能的系统迈出了紧迫的一步，这些系统直接支持面临粮食不安全及其复杂健康风险的人群。

Proactive Reasoning-with-Retrieval Framework for Medical Multimodal Large Language Models

面向医疗多模态大语言模型的主动推理检索框架

Authors: Lehan Wang, Yi Qin, Honglong Yang, Xiaomeng Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2510.18303
Pdf link: https://arxiv.org/pdf/2510.18303
Abstract Incentivizing the reasoning ability of Multimodal Large Language Models (MLLMs) is essential for medical applications to transparently analyze medical scans and provide reliable diagnosis. However, existing medical MLLMs rely solely on internal knowledge during reasoning, leading to hallucinated reasoning and factual inaccuracies when encountering cases beyond their training scope. Although recent Agentic Retrieval-Augmented Generation (RAG) methods elicit the medical model's proactive retrieval ability during reasoning, they are confined to unimodal LLMs, neglecting the crucial visual information during reasoning and retrieval. Consequently, we propose the first Multimodal Medical Reasoning-with-Retrieval framework, Med-RwR, which actively retrieves external knowledge by querying observed symptoms or domain-specific medical concepts during reasoning. Specifically, we design a two-stage reinforcement learning strategy with tailored rewards that stimulate the model to leverage both visual diagnostic findings and textual clinical information for effective retrieval. Building on this foundation, we further propose a Confidence-Driven Image Re-retrieval (CDIR) method for test-time scaling when low prediction confidence is detected. Evaluation on various public medical benchmarks demonstrates Med-RwR's significant improvements over baseline models, proving the effectiveness of enhancing reasoning capabilities with external knowledge integration. Furthermore, Med-RwR demonstrates remarkable generalizability to unfamiliar domains, evidenced by 8.8% performance gain on our proposed EchoCardiography Benchmark (ECBench), despite the scarcity of echocardiography data in the training corpus. Our data, model, and codes will be made publicly available at this https URL.
中文摘要 激励多模态大型语言模型（MLLM）的推理能力对于医疗应用透明地分析医学扫描并提供可靠的诊断至关重要。然而，现有的医学MLLM在推理过程中仅依赖内部知识，导致在遇到超出其训练范围的病例时出现幻觉推理和事实不准确。尽管最近的智能体检索增强生成（RAG）方法在推理过程中引发了医学模型的主动检索能力，但它们仅限于单模态法学硕士，忽略了推理和检索过程中的关键视觉信息。因此，我们提出了第一个多模态医学推理与检索框架 Med-RwR，它通过在推理过程中查询观察到的症状或特定领域的医学概念来主动检索外部知识。具体来说，我们设计了一个两阶段的强化学习策略，并提供量身定制的奖励，刺激模型利用视觉诊断结果和文本临床信息进行有效检索。在此基础上，我们进一步提出了一种置信度驱动的图像重新检索（CDIR）方法，用于在检测到低预测置信度时进行测试时间缩放。对各种公共医疗基准的评估表明，Med-RwR比基线模型有了显著的改进，证明了通过外部知识整合增强推理能力的有效性。此外，尽管训练语料库中的超声心动图数据稀缺，但 Med-RwR 对不熟悉的领域表现出显着的泛化性，这在我们提出的超声心动图基准（ECBench）上的性能提升为 8.8% 就证明了这一点。我们的数据、模型和代码将在此 https URL 上公开。

Higher Embedding Dimension Creates a Stronger World Model for a Simple Sorting Task

更高的嵌入维度为简单的排序任务创建更强大的世界模型

Authors: Brady Bhalla, Honglu Fan, Nancy Chen, Tony Yue YU
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.18315
Pdf link: https://arxiv.org/pdf/2510.18315
Abstract We investigate how embedding dimension affects the emergence of an internal "world model" in a transformer trained with reinforcement learning to perform bubble-sort-style adjacent swaps. Models achieve high accuracy even with very small embedding dimensions, but larger dimensions yield more faithful, consistent, and robust internal representations. In particular, higher embedding dimensions strengthen the formation of structured internal representation and lead to better interpretability. After hundreds of experiments, we observe two consistent mechanisms: (1) the last row of the attention weight matrix monotonically encodes the global ordering of tokens; and (2) the selected transposition aligns with the largest adjacent difference of these encoded values. Our results provide quantitative evidence that transformers build structured internal world models and that model size improves representation quality in addition to end performance. We release our metrics and analyses, which can be used to probe similar algorithmic tasks.
中文摘要 我们研究了嵌入维度如何影响通过强化学习训练的 Transformer 中内部“世界模型”的出现，以执行冒泡排序式的相邻交换。即使嵌入维度非常小，模型也能实现高精度，但较大的维度会产生更忠实、一致和稳健的内部表示。特别是，更高的嵌入维度加强了结构化内部表示的形成，并导致更好的可解释性。经过数百次实验，我们观察到了两种一致的机制：（1）注意力权重矩阵的最后一行单调地编码了标记的全局排序;（2）所选转置与这些编码值的最大相邻差异对齐。我们的结果提供了定量证据，证明 Transformer 构建了结构化的内部世界模型，并且模型大小除了最终性能之外还提高了表示质量。我们发布了我们的指标和分析，可用于探测类似的算法任务。

Why Policy Gradient Algorithms Work for Undiscounted Total-Reward MDPs

为什么策略梯度算法适用于未贴现的总奖励 MDP

Authors: Jongmin Lee, Ernest K. Ryu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.18340
Pdf link: https://arxiv.org/pdf/2510.18340
Abstract The classical policy gradient method is the theoretical and conceptual foundation of modern policy-based reinforcement learning (RL) algorithms. Most rigorous analyses of such methods, particularly those establishing convergence guarantees, assume a discount factor $\gamma < 1$. In contrast, however, a recent line of work on policy-based RL for large language models uses the undiscounted total-reward setting with $\gamma = 1$, rendering much of the existing theory inapplicable. In this paper, we provide analyses of the policy gradient method for undiscounted expected total-reward infinite-horizon MDPs based on two key insights: (i) the classification of the MDP states into recurrent and transient states is invariant over the set of policies that assign strictly positive probability to every action (as is typical in deep RL models employing a softmax output layer) and (ii) the classical state visitation measure (which may be ill-defined when $\gamma = 1$) can be replaced with a new object that we call the transient visitation measure.
中文摘要 经典策略梯度方法是现代基于策略的强化学习（RL）算法的理论和概念基础。对此类方法的最严格分析，特别是那些建立收敛保证的方法，假设折扣系数为 $\gamma < 1$。然而，相比之下，最近一项关于大型语言模型基于策略的 RL 的工作使用了 $\gamma = 1$ 的未贴现总奖励设置，这使得许多现有理论不适用。在本文中，我们基于两个关键见解对未贴现的预期总奖励无限视界 MDP 的策略梯度方法进行了分析：（i） MDP 状态对循环状态和瞬态的分类在为每个动作分配严格正概率的策略集中是不变的（这在采用 softmax 输出层的深度 RL 模型中是典型的）和（ii）经典状态访问度量（当 $\gamma = 1$ 时可能定义不明确）可以替换为我们称之为瞬态访问度量的新对象。

PGTT: Phase-Guided Terrain Traversal for Perceptive Legged Locomotion

PGTT：用于感知腿运动的相位引导地形穿越

Authors: Alexandros Ntagkas, Chairi Kiourt, Konstantinos Chatzilygeroudis
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.18348
Pdf link: https://arxiv.org/pdf/2510.18348
Abstract State-of-the-art perceptive Reinforcement Learning controllers for legged robots either (i) impose oscillator or IK-based gait priors that constrain the action space, add bias to the policy optimization and reduce adaptability across robot morphologies, or (ii) operate "blind", which struggle to anticipate hind-leg terrain, and are brittle to noise. In this paper, we propose Phase-Guided Terrain Traversal (PGTT), a perception-aware deep-RL approach that overcomes these limitations by enforcing gait structure purely through reward shaping, thereby reducing inductive bias in policy learning compared to oscillator/IK-conditioned action priors. PGTT encodes per-leg phase as a cubic Hermite spline that adapts swing height to local heightmap statistics and adds a swing- phase contact penalty, while the policy acts directly in joint space supporting morphology-agnostic deployment. Trained in MuJoCo (MJX) on procedurally generated stair-like terrains with curriculum and domain randomization, PGTT achieves the highest success under push disturbances (median +7.5% vs. the next best method) and on discrete obstacles (+9%), with comparable velocity tracking, and converging to an effective policy roughly 2x faster than strong end-to-end baselines. We validate PGTT on a Unitree Go2 using a real-time LiDAR elevation-to-heightmap pipeline, and we report preliminary results on ANYmal-C obtained with the same hyperparameters. These findings indicate that terrain-adaptive, phase-guided reward shaping is a simple and general mechanism for robust perceptive locomotion across platforms.
中文摘要 用于腿部机器人的最先进的感知强化学习控制器要么（i）施加振荡器或基于 IK 的步态先验，这些先验会限制动作空间，增加策略优化的偏差并降低机器人形态的适应性，或者（ii） “盲目”作，难以预测后腿地形，并且对噪音很脆弱。在本文中，我们提出了相位引导地形穿越（PGTT），这是一种感知感知的深度RL方法，它通过纯粹通过奖励塑造来强制执行步态结构来克服这些限制，从而减少策略学习中的归纳偏差，与振荡器/IK条件的动作先验相比。PGTT 将每条腿的阶段编码为三次 Hermite 样条，使摆动高度适应局部高度图统计数据并添加摆动阶段接触惩罚，而该策略直接作用于支持与形态无关的部署的关节空间。PGTT 在 MuJoCo （MJX）中通过程序生成的楼梯状地形进行课程和领域随机化的训练，在推动干扰（中位数 +7.5% 与次优方法相比）和离散障碍物（+9%）下取得了最高的成功，具有可比的速度跟踪，并且收敛到有效策略的速度比强端到端基线快大约 2 倍。我们使用实时 LiDAR 高程到高度图管道在 Unitree Go2 上验证了 PGTT，并报告了使用相同超参数获得的 ANYmal-C 的初步结果。这些发现表明，地形适应性、相位引导的奖励塑造是一种简单而通用的跨平台感知运动机制。

Ranking-based Preference Optimization for Diffusion Models from Implicit User Feedback

基于隐式用户反馈的扩散模型基于排名的偏好优化

Authors: Yi-Lun Wu, Bo-Kai Ruan, Chiang Tseng, Hong-Han Shuai
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2510.18353
Pdf link: https://arxiv.org/pdf/2510.18353
Abstract Direct preference optimization (DPO) methods have shown strong potential in aligning text-to-image diffusion models with human preferences by training on paired comparisons. These methods improve training stability by avoiding the REINFORCE algorithm but still struggle with challenges such as accurately estimating image probabilities due to the non-linear nature of the sigmoid function and the limited diversity of offline datasets. In this paper, we introduce Diffusion Denoising Ranking Optimization (Diffusion-DRO), a new preference learning framework grounded in inverse reinforcement learning. Diffusion-DRO removes the dependency on a reward model by casting preference learning as a ranking problem, thereby simplifying the training objective into a denoising formulation and overcoming the non-linear estimation issues found in prior methods. Moreover, Diffusion-DRO uniquely integrates offline expert demonstrations with online policy-generated negative samples, enabling it to effectively capture human preferences while addressing the limitations of offline data. Comprehensive experiments show that Diffusion-DRO delivers improved generation quality across a range of challenging and unseen prompts, outperforming state-of-the-art baselines in both both quantitative metrics and user studies. Our source code and pre-trained models are available at this https URL.
中文摘要 直接偏好优化（DPO）方法通过配对比较训练，在使文本到图像扩散模型与人类偏好保持一致方面显示出强大的潜力。这些方法通过避免REINFORCE算法提高了训练稳定性，但由于sigmoid函数的非线性特性和离线数据集的多样性有限，仍然难以应对准确估计图像概率等挑战。在本文中，我们介绍了扩散去噪排名优化（Diffusion-DRO），这是一种基于逆强化学习的新型偏好学习框架。Diffusion-DRO通过将偏好学习作为排名问题来消除对奖励模型的依赖，从而将训练目标简化为去噪公式，并克服了先前方法中发现的非线性估计问题。此外，Diffusion-DRO 独特地将线下专家演示与在线政策生成的负面样本相结合，使其能够有效捕捉人类偏好，同时解决线下数据的局限性。综合实验表明，Diffusion-DRO 在一系列具有挑战性和看不见的提示中提供了更高的生成质量，在定量指标和用户研究方面都优于最先进的基线。我们的源代码和预训练模型可在此 https URL 中找到。

MENTOR: A Reinforcement Learning Framework for Model Enhancement via Teacher-Optimized Rewards in Small Models

MENTOR：在小模型中通过教师优化奖励增强模型的强化学习框架

Authors: ChangSu Choi, Hoyun Song, Dongyeon Kim, WooHyeon Jung, Minkyung Cho, Sunjin Park, NohHyeob Bae, Seona Yu, KyungTae Lim
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.18383
Pdf link: https://arxiv.org/pdf/2510.18383
Abstract Distilling the tool-using capabilities of large language models (LLMs) into smaller, more efficient small language models (SLMs) is a key challenge for their practical application. The predominant approach, supervised fine-tuning (SFT), suffers from poor generalization as it trains models to imitate a static set of teacher trajectories rather than learn a robust methodology. While reinforcement learning (RL) offers an alternative, the standard RL using sparse rewards fails to effectively guide SLMs, causing them to struggle with inefficient exploration and adopt suboptimal strategies. To address these distinct challenges, we propose MENTOR, a framework that synergistically combines RL with teacher-guided distillation. Instead of simple imitation, MENTOR employs an RL-based process to learn a more generalizable policy through exploration. In addition, to solve the problem of reward sparsity, it uses a teacher's reference trajectory to construct a dense, composite teacher-guided reward that provides fine-grained guidance. Extensive experiments demonstrate that MENTOR significantly improves the cross-domain generalization and strategic competence of SLMs compared to both SFT and standard sparse-reward RL baselines.
中文摘要 将大型语言模型（LLM）的工具使用能力提炼成更小、更高效的小型语言模型（SLM）是其实际应用的关键挑战。主要方法监督微调（SFT）的泛化能力很差，因为它训练模型模仿一组静态的教师轨迹，而不是学习稳健的方法。虽然强化学习（RL）提供了另一种选择，但使用稀疏奖励的标准强化学习无法有效指导SLM，导致它们难以应对低效的探索并采用次优策略。为了应对这些独特的挑战，我们提出了 MENTOR，这是一个将 RL 与教师指导的蒸馏协同结合的框架。MENTOR 不是简单的模仿，而是采用基于 RL 的流程，通过探索来学习更通用的策略。此外，为了解决奖励稀疏的问题，它利用教师的参考轨迹构建了一个密集的、复合的教师引导奖励，提供细粒度的指导。广泛的实验表明，与 SFT 和标准稀疏奖励 RL 基线相比，MENTOR 显着提高了 SLM 的跨域泛化和战略能力。

On AI Verification in Open RAN

关于Open RAN中的AI验证

Authors: Rahul Soundrarajan, Claudio Fiandrino, Michele Polese, Salvatore D'Oro, Leonardo Bonati, Tommaso Melodia
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.18417
Pdf link: https://arxiv.org/pdf/2510.18417
Abstract Open RAN introduces a flexible, cloud-based architecture for the Radio Access Network (RAN), enabling Artificial Intelligence (AI)/Machine Learning (ML)-driven automation across heterogeneous, multi-vendor deployments. While EXplainable Artificial Intelligence (XAI) helps mitigate the opacity of AI models, explainability alone does not guarantee reliable network operations. In this article, we propose a lightweight verification approach based on interpretable models to validate the behavior of Deep Reinforcement Learning (DRL) agents for RAN slicing and scheduling in Open RAN. Specifically, we use Decision Tree (DT)-based verifiers to perform near-real-time consistency checks at runtime, which would be otherwise unfeasible with computationally expensive state-of-the-art verifiers. We analyze the landscape of XAI and AI verification, propose a scalable architectural integration, and demonstrate feasibility with a DT-based slice-verifier. We also outline future challenges to ensure trustworthy AI adoption in Open RAN.
中文摘要 Open RAN为无线接入网络（RAN）引入了灵活的基于云的架构，在异构、多供应商部署中实现人工智能（AI）/机器学习（ML）驱动的自动化。虽然可解释的人工智能（XAI）有助于减轻人工智能模型的不透明度，但仅凭可解释性并不能保证可靠的网络运行。在本文中，我们提出了一种基于可解释模型的轻量级验证方法，以验证深度强化学习（DRL）代理在开放式RAN中进行RAN切片和调度的行为。具体来说，我们使用基于决策树（DT）的验证器在运行时执行近乎实时的一致性检查，否则对于计算成本高昂的最先进的验证器来说，这是不可行的。我们分析了 XAI 和 AI 验证的前景，提出了可扩展的架构集成，并展示了基于 DT 的切片验证器的可行性。我们还概述了未来的挑战，以确保在 Open RAN 中采用值得信赖的 AI。

DeLoad: Demand-Driven Short-Video Preloading with Scalable Watch-Time Estimation

DeLoad：需求驱动的短视频预加载，具有可扩展的观看时间估计功能

Authors: Tong Liu, Zhiwei Fan, Guanyan Peng, Haodan Zhang, Yucheng Zhang, Zhen Wang, Pengjin Xie, Liang Liu
Subjects: Subjects: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2510.18459
Pdf link: https://arxiv.org/pdf/2510.18459
Abstract Short video streaming has become a dominant paradigm in digital media, characterized by rapid swiping interactions and diverse media content. A key technical challenge is designing an effective preloading strategy that dynamically selects and prioritizes download tasks from an evolving playlist, balancing Quality of Experience (QoE) and bandwidth efficiency under practical commercial constraints. However, real world analysis reveals critical limitations of existing approaches: (1) insufficient adaptation of download task sizes to dynamic conditions, and (2) watch time prediction models that are difficult to deploy reliably at scale. In this paper, we propose DeLoad, a novel preloading framework that addresses these issues by introducing dynamic task sizing and a practical, multi dimensional watch time estimation method. Additionally, a Deep Reinforcement Learning (DRL) enhanced agent is trained to optimize the download range decisions adaptively. Extensive evaluations conducted on an offline testing platform, leveraging massive real world network data, demonstrate that DeLoad achieves significant improvements in QoE metrics (34.4% to 87.4% gain). Furthermore, after deployment on a large scale commercial short video platform, DeLoad has increased overall user watch time by 0.09% while simultaneously reducing rebuffering events and 3.76% bandwidth consumption.
中文摘要 短视频流已成为数字媒体的主导范式，其特点是快速刷动互动和多样化的媒体内容。一个关键的技术挑战是设计一种有效的预加载策略，从不断发展的播放列表中动态选择下载任务并确定其优先级，从而在实际商业限制下平衡体验质量（QoE）和带宽效率。然而，实际分析揭示了现有方法的关键局限性：（1）下载任务大小对动态条件的适应不足，以及（2）观看时间预测模型难以大规模可靠部署。在本文中，我们提出了DeLoad，这是一种新颖的预加载框架，它通过引入动态任务大小调整和实用的多维观看时间估计方法来解决这些问题。此外，还训练了深度强化学习（DRL）增强型代理，以自适应地优化下载范围决策。在离线测试平台上进行的广泛评估，利用海量真实世界的网络数据，表明 DeLoad 在 QoE 指标方面取得了显着改进（34.4% 至 87.4% 的增益）。此外，在大规模商业短视频平台部署后，DeLoad 将用户整体观看时长提高了 0.09%，同时减少了重新缓冲事件和 3.76% 的带宽消耗。

CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment

CodeRL+：通过执行语义对齐的强化改进代码生成

Authors: Xue Jiang, Yihong Dong, Mengyang Liu, Hongyi Deng, Tian Wang, Yongding Tao, Rongyu Cao, Binhua Li, Zhi Jin, Wenpin Jiao, Fei Huang, Yongbin Li, Ge Li
Subjects: Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.18471
Pdf link: https://arxiv.org/pdf/2510.18471
Abstract While Large Language Models (LLMs) excel at code generation by learning from vast code corpora, a fundamental semantic gap remains between their training on textual patterns and the goal of functional correctness, which is governed by formal execution semantics. Reinforcement Learning with Verifiable Rewards (RLVR) approaches attempt to bridge this gap using outcome rewards from executing test cases. However, solely relying on binary pass/fail signals is inefficient for establishing a well-aligned connection between the textual representation of code and its execution semantics, especially for subtle logical errors within the code. In this paper, we propose CodeRL+, a novel approach that integrates execution semantics alignment into the RLVR training pipeline for code generation. CodeRL+ enables the model to infer variable-level execution trajectory, providing a direct learning signal of execution semantics. CodeRL+ can construct execution semantics alignment directly using existing on-policy rollouts and integrates seamlessly with various RL algorithms. Extensive experiments demonstrate that CodeRL+ outperforms post-training baselines (including RLVR and Distillation), achieving a 4.6% average relative improvement in pass@1. CodeRL+ generalizes effectively to other coding tasks, yielding 15.5% and 4.4% higher accuracy on code-reasoning and test-output-generation benchmarks, respectively. CodeRL+ shows strong applicability across diverse RL algorithms and LLMs. Furthermore, probe analyses provide compelling evidence that CodeRL+ strengthens the alignment between code's textual representations and its underlying execution semantics.
中文摘要 虽然大型语言模型（LLM）通过从庞大的代码语料库中学习来擅长代码生成，但它们对文本模式的训练与功能正确性目标（由形式执行语义控制）之间仍然存在根本的语义差距。具有可验证奖励的强化学习（RLVR）方法试图使用执行测试用例的结果奖励来弥合这一差距。然而，仅依靠二进制通过/失败信号对于在代码的文本表示及其执行语义之间建立良好对齐的连接效率低下，特别是对于代码中的细微逻辑错误。在本文中，我们提出了CodeRL+，这是一种将执行语义对齐集成到RLVR训练管道中以进行代码生成的新方法。CodeRL+ 使模型能够推断出可变级别的执行轨迹，提供执行语义的直接学习信号。CodeRL+ 可以直接使用现有的策略推出来构建执行语义对齐，并与各种 RL 算法无缝集成。大量实验表明，CodeRL+ 的性能优于训练后基线（包括 RLVR 和蒸馏），pass@1平均相对提高 4.6%。CodeRL+ 有效地推广到其他编码任务，在代码推理和测试输出生成基准测试中分别提高了 15.5% 和 4.4% 的准确率。CodeRL+ 在各种 RL 算法和 LLM 中表现出很强的适用性。此外，探针分析提供了令人信服的证据，表明 CodeRL+ 加强了代码的文本表示与其底层执行语义之间的一致性。

Safe But Not Sorry: Reducing Over-Conservatism in Safety Critics via Uncertainty-Aware Modulation

安全但不后悔：通过不确定性感知调制减少安全批评者的过度保守主义

Authors: Daniel Bethell, Simos Gerasimou, Radu Calinescu, Calum Imrie
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.18478
Pdf link: https://arxiv.org/pdf/2510.18478
Abstract Ensuring the safe exploration of reinforcement learning (RL) agents is critical for deployment in real-world systems. Yet existing approaches struggle to strike the right balance: methods that tightly enforce safety often cripple task performance, while those that prioritize reward leave safety constraints frequently violated, producing diffuse cost landscapes that flatten gradients and stall policy improvement. We introduce the Uncertain Safety Critic (USC), a novel approach that integrates uncertainty-aware modulation and refinement into critic training. By concentrating conservatism in uncertain and costly regions while preserving sharp gradients in safe areas, USC enables policies to achieve effective reward-safety trade-offs. Extensive experiments show that USC reduces safety violations by approximately 40% while maintaining competitive or higher rewards, and reduces the error between predicted and true cost gradients by approximately 83%, breaking the prevailing trade-off between safety and performance and paving the way for scalable safe RL.
中文摘要 确保强化学习（RL）代理的安全探索对于在实际系统中部署至关重要。然而，现有的方法难以取得适当的平衡：严格执行安全的方法往往会削弱任务绩效，而那些优先考虑奖励的方法则经常违反安全约束，从而产生分散的成本格局，使梯度变平并阻碍政策改进。我们介绍了不确定安全评论家（USC），这是一种将不确定性感知调制和细化整合到评论家培训中的新方法。通过将保守主义集中在不确定和成本高昂的地区，同时在安全地区保持急剧的梯度，南加州大学使政策能够实现有效的奖励与安全权衡。广泛的实验表明，USC 在保持竞争性或更高回报的同时，将安全违规行为减少了约 40%，并将预测成本梯度和真实成本梯度之间的误差降低了约 83%，打破了安全性和性能之间的普遍权衡，并为可扩展的安全 RL 铺平了道路。

Socialized Learning and Emergent Behaviors in Multi-Agent Systems based on Multimodal Large Language Models

基于多模态大语言模型的多智能体系统中的社会化学习与涌现行为

Authors: Sureyya Akin, Shruti T. Tiwari, Ram Bhattacharya, Sagar A. Raman, Kiran Mohanty, Sita Krishnan
Subjects: Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2510.18515
Pdf link: https://arxiv.org/pdf/2510.18515
Abstract This search introduces the Multimodal Socialized Learning Framework (M-S2L), designed to foster emergent social intelligence in AI agents by integrating Multimodal Large Language Models (M-LLMs) with social learning mechanisms. The framework equips agents with multimodal perception (vision and text) and structured action capabilities, enabling physical manipulation and grounded multimodal communication (e.g., text with visual pointers). M-S2L combines direct reinforcement learning with two novel social learning pathways: multimodal observational learning and communication-driven learning from feedback, augmented by an episodic memory system for long-term social context. We evaluate M-S2L in a Collaborative Assembly Environment (CAE), where agent teams must construct complex devices from ambiguous blueprints under informational asymmetry. Across tasks of increasing complexity, M-S2L agents consistently outperform Text-Only and No-Social-Learning baselines in Task Completion Rate and Time to Completion, particularly in dynamic problem-solving scenarios. Ablation studies confirm the necessity of both multimodality and socialized learning. Our analysis reveals the emergence of efficient communication protocols integrating visual pointers with concise text, alongside rapid role specialization leading to stable labor division. Qualitative case studies demonstrate agents' abilities for shared awareness, dynamic re-planning, and adaptive problem-solving, suggesting a nascent form of machine social cognition. These findings indicate that integrating multimodal perception with explicit social learning is critical for developing human-like collaborative intelligence in multi-agent systems.
中文摘要 该搜索引入了多模态社交学习框架（M-S2L），旨在通过将多模态大型语言模型（M-LLM）与社交学习机制相结合来培养人工智能代理的新兴社交智能。该框架为智能体提供了多模态感知（视觉和文本）和结构化动作能力，从而实现物理作和接地的多模态交流（例如，带有视觉指针的文本）。M-S2L 将直接强化学习与两种新颖的社会学习途径相结合：多模态观察学习和反馈驱动的交流驱动学习，并通过用于长期社会背景的情景记忆系统进行增强。我们在协作装配环境（CAE）中评估 M-S2L，其中代理团队必须在信息不对称下从模棱两可的蓝图构建复杂的设备。在复杂性不断增加的任务中，M-S2L 代理在任务完成率和完成时间方面始终优于纯文本和无社交学习基线，特别是在动态问题解决场景中。消融研究证实了多模态和社会化学习的必要性。我们的分析揭示了将视觉指针与简洁文本相结合的高效通信协议的出现，以及快速的角色专业化，从而实现稳定的劳动分工。定性案例研究展示了智能体在共享意识、动态重新规划和适应性问题解决方面的能力，这表明了机器社会认知的新兴形式。这些发现表明，将多模态感知与显性社交学习相结合对于在多智能体系统中发展类人协作智能至关重要。

Efficient Model-Based Reinforcement Learning for Robot Control via Online Learning

基于模型的高效在线学习机器人控制强化学习

Authors: Fang Nan, Hao Ma, Qinghua Guan, Josie Hughes, Michael Muehlebach, Marco Hutter
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.18518
Pdf link: https://arxiv.org/pdf/2510.18518
Abstract We present an online model-based reinforcement learning algorithm suitable for controlling complex robotic systems directly in the real world. Unlike prevailing sim-to-real pipelines that rely on extensive offline simulation and model-free policy optimization, our method builds a dynamics model from real-time interaction data and performs policy updates guided by the learned dynamics model. This efficient model-based reinforcement learning scheme significantly reduces the number of samples to train control policies, enabling direct training on real-world rollout data. This significantly reduces the influence of bias in the simulated data, and facilitates the search for high-performance control policies. We adopt online learning analysis to derive sublinear regret bounds under standard stochastic online optimization assumptions, providing formal guarantees on performance improvement as more interaction data are collected. Experimental evaluations were performed on a hydraulic excavator arm and a soft robot arm, where the algorithm demonstrates strong sample efficiency compared to model-free reinforcement learning methods, reaching comparable performance within hours. Robust adaptation to shifting dynamics was also observed when the payload condition was randomized. Our approach paves the way toward efficient and reliable on-robot learning for a broad class of challenging control tasks.
中文摘要 我们提出了一种基于模型的在线强化学习算法，适用于直接在现实世界中控制复杂的机器人系统。与依赖大量离线仿真和无模型策略优化的主流模拟到真实管道不同，我们的方法从实时交互数据构建动力学模型，并在学习到的动力学模型的指导下执行策略更新。这种基于模型的高效强化学习方案显着减少了训练控制策略的样本数量，从而能够对真实世界的推出数据进行直接训练。这大大降低了模拟数据中偏差的影响，并有助于寻找高性能控制策略。我们采用在线学习分析，在标准随机在线优化假设下推导亚线性遗憾边界，为随着交互数据的收集而提高性能提供形式保证。在液压挖掘机臂和软体机器人臂上进行了实验评估，与无模型强化学习方法相比，该算法表现出较强的样本效率，在数小时内达到相当的性能。当有效载荷条件被随机化时，还观察到对移动动力学的稳健适应。我们的方法为高效、可靠的机器人学习铺平了道路，适用于各种具有挑战性的控制任务。

Deep Q-Learning Assisted Bandwidth Reservation for Multi-Operator Time-Sensitive Vehicular Networking

深度Q学习辅助带宽预留，用于多运营商时间敏感车联网

Authors: Abdullah Al-Khatib, Albert Gergus, Muneeb Ul Hassan, Abdelmajid Khelil, Klaus Mossner, Holger Timinger
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2510.18553
Pdf link: https://arxiv.org/pdf/2510.18553
Abstract Very few available individual bandwidth reservation schemes provide efficient and cost-effective bandwidth reservation that is required for safety-critical and time-sensitive vehicular networked applications. These schemes allow vehicles to make reservation requests for the required resources. Accordingly, a Mobile Network Operator (MNO) can allocate and guarantee bandwidth resources based on these requests. However, due to uncertainty in future reservation time and bandwidth costs, the design of an optimized reservation strategy is challenging. In this article, we propose a novel multi-objective bandwidth reservation update approach with an optimal strategy based on Double Deep Q-Network (DDQN). The key design objectives are to minimize the reservation cost with multiple MNOs and to ensure reliable resource provisioning in uncertain situations by solving scenarios such as underbooked and overbooked reservations along the driving path. The enhancements and advantages of our proposed strategy have been demonstrated through extensive experimental results when compared to other methods like greedy update or other deep reinforcement learning approaches. Our strategy demonstrates a 40% reduction in bandwidth costs across all investigated scenarios and simultaneously resolves uncertain situations in a cost-effective manner.
中文摘要 很少有可用的单独带宽预留方案能够提供安全关键型和时间敏感型车联网应用所需的高效且经济高效的带宽预留。这些方案允许车辆对所需资源提出预订请求。因此，移动网络运营商（MNO）可以根据这些请求分配和保证带宽资源。然而，由于未来预留时间和带宽成本的不确定性，优化预留策略的设计具有挑战性。本文提出了一种基于双深Q网络（DDQN）的最优策略的新型多目标带宽预留更新方法。关键设计目标是最大限度地降低多个移动网络运营商的预订成本，并通过解决行驶路径上的预订不足和超额预订等场景，确保在不确定的情况下提供可靠的资源配置。与贪婪更新或其他深度强化学习方法等其他方法相比，我们提出的策略的增强和优势已经通过广泛的实验结果得到证明。我们的策略表明，在所有调查的场景中，带宽成本降低了 40%，同时以经济高效的方式解决了不确定的情况。

Sherlock Your Queries: Learning to Ask the Right Questions for Dialogue-Based Retrieval

夏洛克你的查询：学习提出正确的问题以进行基于对话的检索

Authors: Dong Yun, Marco Schouten, Dim Papadopoulos
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.18659
Pdf link: https://arxiv.org/pdf/2510.18659
Abstract User queries in information retrieval are often ambiguous, making it challenging for systems to identify a user's target from a single query. While recent dialogue-based interactive retrieval systems can clarify user intent, they are inefficient as they often lack an explicit strategy to ask the most informative questions. To address this limitation, we propose SherlockLLM, a dialogue-driven retrieval framework that learns an optimal questioning strategy via Reinforcement Learning (RL) and avoids the need for large-scale annotated dialogue data. In our framework, an agent is trained to generate a sequence of binary questions to efficiently narrow down the search space. To validate our approach, we introduce a benchmark with both structured and unstructured tasks. Experimental results show that SherlockLLM is a robust and efficient solution. On the structured tasks, its performance matches strong baselines and approaches the theoretical optimal defined by binary search. On the challenging unstructured task, our agent significantly outperforms these baselines, showcasing its ability to learn a highly effective information-seeking dialogue policy.
中文摘要 信息检索中的用户查询通常是模棱两可的，这使得系统很难从单个查询中识别用户的目标。虽然最近基于对话的交互式检索系统可以阐明用户意图，但它们效率低下，因为它们通常缺乏明确的策略来提出信息最丰富的问题。为了解决这一限制，我们提出了 SherlockLLM，这是一种对话驱动的检索框架，它通过强化学习（RL）学习最佳提问策略，并避免了对大规模注释对话数据的需求。在我们的框架中，代理被训练生成一系列二元问题，以有效地缩小搜索空间。为了验证我们的方法，我们引入了结构化和非结构化任务的基准测试。实验结果表明，SherlockLLM是一种稳健高效的解决方案。在结构化任务上，其性能与强基线相匹配，并接近二分搜索定义的理论最优。在具有挑战性的非结构化任务中，我们的代理显着优于这些基线，展示了其学习高效信息寻求对话策略的能力。

Reinforcement Learning with Imperfect Transition Predictions: A Bellman-Jensen Approach

具有不完美过渡预测的强化学习：Bellman-Jensen 方法

Authors: Chenbei Lu, Zaiwei Chen, Tongxin Li, Chenye Wu, Adam Wierman
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.18687
Pdf link: https://arxiv.org/pdf/2510.18687
Abstract Traditional reinforcement learning (RL) assumes the agents make decisions based on Markov decision processes (MDPs) with one-step transition models. In many real-world applications, such as energy management and stock investment, agents can access multi-step predictions of future states, which provide additional advantages for decision making. However, multi-step predictions are inherently high-dimensional: naively embedding these predictions into an MDP leads to an exponential blow-up in state space and the curse of dimensionality. Moreover, existing RL theory provides few tools to analyze prediction-augmented MDPs, as it typically works on one-step transition kernels and cannot accommodate multi-step predictions with errors or partial action-coverage. We address these challenges with three key innovations: First, we propose the \emph{Bayesian value function} to characterize the optimal prediction-aware policy tractably. Second, we develop a novel \emph{Bellman-Jensen Gap} analysis on the Bayesian value function, which enables characterizing the value of imperfect predictions. Third, we introduce BOLA (Bayesian Offline Learning with Online Adaptation), a two-stage model-based RL algorithm that separates offline Bayesian value learning from lightweight online adaptation to real-time predictions. We prove that BOLA remains sample-efficient even under imperfect predictions. We validate our theory and algorithm on synthetic MDPs and a real-world wind energy storage control problem.
中文摘要 传统的强化学习（RL）假设智能体基于具有一步转换模型的马尔可夫决策过程（MDP）做出决策。在许多实际应用中，例如能源管理和股票投资，代理可以访问对未来状态的多步预测，这为决策提供了额外的优势。然而，多步预测本质上是高维的：天真地将这些预测嵌入到 MDP 中会导致状态空间的指数级爆炸和维度的诅咒。此外，现有的RL理论几乎没有提供分析预测增强MDP的工具，因为它通常适用于一步过渡核，无法适应具有错误或部分动作覆盖的多步预测。我们通过三项关键创新来应对这些挑战：首先，我们提出了 \emph{贝叶斯值函数} 来概括地表征最佳预测感知策略。其次，我们开发了一种关于贝叶斯值函数的新 \emph{Bellman-Jensen Gap} 分析，该分析能够表征不完美预测的值。第三，我们引入了BOLA（Bayesian Offline Learning with Online Adaptation），这是一种基于两阶段模型的RL算法，它将离线贝叶斯值学习与轻量级在线自适应分开到实时预测。我们证明，即使在不完美的预测下，BOLA 仍然保持样本效率。我们验证了我们在合成 MDP 和现实世界的风能存储控制问题上的理论和算法。

Preference-based Reinforcement Learning beyond Pairwise Comparisons: Benefits of Multiple Options

超越成对比较的基于偏好的强化学习：多种选择的好处

Authors: Joongkyu Lee, Seouh-won Yi, Min-hwan Oh
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2510.18713
Pdf link: https://arxiv.org/pdf/2510.18713
Abstract We study online preference-based reinforcement learning (PbRL) with the goal of improving sample efficiency. While a growing body of theoretical work has emerged-motivated by PbRL's recent empirical success, particularly in aligning large language models (LLMs)-most existing studies focus only on pairwise comparisons. A few recent works (Zhu et al., 2023, Mukherjee et al., 2024, Thekumparampil et al., 2024) have explored using multiple comparisons and ranking feedback, but their performance guarantees fail to improve-and can even deteriorate-as the feedback length increases, despite the richer information available. To address this gap, we adopt the Plackett-Luce (PL) model for ranking feedback over action subsets and propose M-AUPO, an algorithm that selects multiple actions by maximizing the average uncertainty within the offered subset. We prove that M-AUPO achieves a suboptimality gap of $\tilde{\mathcal{O}}\left( \frac{d}{T} \sqrt{ \sum_{t=1}^T \frac{1}{|S_t|}} \right)$, where $T$ is the total number of rounds, $d$ is the feature dimension, and $|S_t|$ is the size of the subset at round $t$. This result shows that larger subsets directly lead to improved performance and, notably, the bound avoids the exponential dependence on the unknown parameter's norm, which was a fundamental limitation in most previous works. Moreover, we establish a near-matching lower bound of $\Omega \left( \frac{d}{K \sqrt{T}} \right)$, where $K$ is the maximum subset size. To the best of our knowledge, this is the first theoretical result in PbRL with ranking feedback that explicitly shows improved sample efficiency as a function of the subset size.
中文摘要 我们研究基于偏好的在线强化学习（PbRL），以提高样本效率。虽然越来越多的理论工作已经出现——受到 PbRL 最近的实证成功，特别是在对齐大型语言模型（LLM）方面——但大多数现有研究只关注成对比较。最近的一些工作（Zhu 等人，2023 年，Mukherjee 等人，2024 年，Thekumparampil 等人，2024 年）探索了使用多重比较和排名反馈，但它们的性能保证未能提高，甚至会随着反馈长度的增加而恶化，尽管有更丰富的信息可用。为了解决这一差距，我们采用 Plackett-Luce （PL）模型对动作子集的反馈进行排名，并提出了 M-AUPO，这是一种通过最大化所提供子集中的平均不确定性来选择多个动作的算法。我们证明 M-AUPO 实现了 $\tilde{\mathcal{O}}\left（ \frac{d}{T} \sqrt{ \sum_{t=1}^T \frac{1}{|S_t|}}\right）$，其中 $T$ 是总轮数，$d$ 是特征维度，$|S_t|$ 是子集在整数 $t$ 处的大小。这一结果表明，较大的子集直接导致性能的提高，值得注意的是，边界避免了对未知参数范数的指数依赖，这是大多数先前工作中的基本限制。此外，我们建立了 $\Omega \left（ \frac{d}{K \sqrt{T}} \right）$ 的近似匹配下界，其中 $K$ 是最大子集大小。据我们所知，这是 PbRL 中第一个具有排名反馈的理论结果，该结果明确显示样本效率随着子集大小的函数而提高。

Verifiable Accuracy and Abstention Rewards in Curriculum RL to Alleviate Lost-in-Conversation

课程 RL 中可验证的准确性和弃权奖励，以减少对话中的丢失

Authors: Ming Li
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.18731
Pdf link: https://arxiv.org/pdf/2510.18731
Abstract Large Language Models demonstrate strong capabilities in single-turn instruction following but suffer from Lost-in-Conversation (LiC), a degradation in performance as information is revealed progressively in multi-turn settings. Motivated by the current progress on Reinforcement Learning with Verifiable Rewards (RLVR), we propose Curriculum Reinforcement Learning with Verifiable Accuracy and Abstention Rewards (RLAAR), a framework that encourages models not only to generate correct answers, but also to judge the solvability of questions in the multi-turn conversation setting. Our approach employs a competence-gated curriculum that incrementally increases dialogue difficulty (in terms of instruction shards), stabilizing training while promoting reliability. Using multi-turn, on-policy rollouts and a mixed-reward system, RLAAR teaches models to balance problem-solving with informed abstention, reducing premature answering behaviors that cause LiC. Evaluated on LiC benchmarks, RLAAR significantly mitigates LiC performance decay (62.6% to 75.1%) and improves calibrated abstention rates (33.5% to 73.4%). Together, these results provide a practical recipe for building multi-turn reliable and trustworthy LLMs.
中文摘要 大型语言模型在单轮指令跟踪方面表现出强大的能力，但存在对话丢失（LiC）的问题，即随着信息在多轮设置中逐渐显示，性能会下降。在当前具有可验证奖励的强化学习（RLVR）进展的推动下，我们提出了具有可验证准确性和弃权奖励的课程强化学习（RLAAR），该框架鼓励模型不仅生成正确答案，而且在多轮对话环境中判断问题的可解决性。我们的方法采用能力门控课程，逐步增加对话难度（就教学分片而言），稳定培训，同时提高可靠性。RLAAR 使用多轮、政策推出和混合奖励系统，教模型平衡解决问题与知情弃权，减少导致 LiC 的过早回答行为。在 LiC 基准上进行评估，RLAAR 显着减轻了 LiC 性能下降（62.6% 至 75.1%），并提高了校准弃权率（33.5% 至 73.4%）。这些结果共同为构建多回合可靠且值得信赖的法学硕士提供了实用的秘诀。

WebSeer: Training Deeper Search Agents through Reinforcement Learning with Self-Reflection

WebSeer：通过自我反思的强化学习训练更深层次的搜索代理

Authors: Guanzhong He, Zhen Yang, Jinxin Liu, Bin Xu, Lei Hou, Juanzi Li
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.18798
Pdf link: https://arxiv.org/pdf/2510.18798
Abstract Search agents have achieved significant advancements in enabling intelligent information retrieval and decision-making within interactive environments. Although reinforcement learning has been employed to train agentic models capable of more dynamic interactive retrieval, existing methods are limited by shallow tool-use depth and the accumulation of errors over multiple iterative interactions. In this paper, we present WebSeer, a more intelligent search agent trained via reinforcement learning enhanced with a self-reflection mechanism. Specifically, we construct a large dataset annotated with reflection patterns and design a two-stage training framework that unifies cold start and reinforcement learning within the self-reflection paradigm for real-world web-based environments, which enables the model to generate longer and more reflective tool-use trajectories. Our approach substantially extends tool-use chains and improves answer accuracy. Using a single 14B model, we achieve state-of-the-art results on HotpotQA and SimpleQA, with accuracies of 72.3% and 90.0%, respectively, and demonstrate strong generalization to out-of-distribution datasets. The code is available at this https URL
中文摘要 搜索代理在交互式环境中实现智能信息检索和决策方面取得了重大进步。尽管强化学习已被用于训练能够进行更动态交互式检索的代理模型，但现有方法受到工具使用深度浅和多次迭代交互中错误积累的限制。在本文中，我们提出了 WebSeer，这是一种通过强化学习训练的更智能的搜索代理，并通过自我反思机制增强。具体来说，我们构建了一个用反射模式注释的大型数据集，并设计了一个两阶段的训练框架，将冷启动和强化学习统一在现实世界基于 Web 的环境中的自我反思范式中，使模型能够生成更长、更具反思性的工具使用轨迹。我们的方法大大扩展了工具使用链并提高了答案的准确性。使用单个 14B 模型，我们在 HotpotQA 和 SimpleQA 上取得了最先进的结果，准确率分别为 72.3% 和 90.0%，并展示了对分布外数据集的很强的泛化性。该代码可在此 https URL 中找到

Online SFT for LLM Reasoning: Surprising Effectiveness of Self-Tuning without Rewards

用于 LLM 推理的在线 SFT：无奖励的自我调整的惊人效果

Authors: Mengqi Li, Lei Zhao, Anthony Man-Cho So, Ruoyu Sun, Xiao Li
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.18814
Pdf link: https://arxiv.org/pdf/2510.18814
Abstract We present a simple, self-help online supervised finetuning (OSFT) paradigm for LLM reasoning. In this paradigm, the model generates its own responses and is immediately finetuned on this self-generated data. OSFT is a highly efficient training strategy for LLM reasoning, as it is reward-free and uses just one rollout by default. Experiment results show that OSFT achieves downstream performance on challenging mathematical reasoning tasks comparable to strong reinforcement learning with verifiable rewards (RLVR) methods such as GRPO. Our ablation study further demonstrates the efficiency and robustness of OSFT. The major mechanism of OSFT lies in facilitating the model's own existing preference (latent knowledge) learned from pretraining, which leads to reasoning ability improvement. We believe that OSFT offers an efficient and promising alternative to more complex, reward-based training paradigms. Our code is available at this https URL.
中文摘要 我们提出了一种简单的自助在线监督微调（OSFT）范式，用于 LLM 推理。在这种范式中，模型生成自己的响应，并立即根据这些自生成的数据进行微调。OSFT 是一种高效的 LLM 推理训练策略，因为它是无奖励的，并且默认情况下仅使用一次推出。实验结果表明，OSFT在具有挑战性的数学推理任务上取得了与GRPO等可验证奖励（RLVR）方法的强强化学习相媲美的下游性能。我们的消融研究进一步证明了 OSFT 的效率和稳健性。OSFT的主要机制在于促进模型自身从预训练中学习到的现有偏好（潜在知识），从而提高推理能力。我们相信，OSFT 为更复杂的、基于奖励的训练范式提供了一种高效且有前途的替代方案。我们的代码可在此 https URL 中找到。

Search Self-play: Pushing the Frontier of Agent Capability without Supervision

搜索自玩：在没有监督的情况下推动代理能力的前沿

Authors: Hongliang Lu, Yuhang Wen, Pengyu Cheng, Ruijin Ding, Haotian Xu, Jiaqi Guo, Chutian Wang, Haonan Chen, Xiaoxi Jiang, Guanjun Jiang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.18821
Pdf link: https://arxiv.org/pdf/2510.18821
Abstract Reinforcement learning with verifiable rewards (RLVR) has become the mainstream technique for training LLM agents. However, RLVR highly depends on well-crafted task queries and corresponding ground-truth answers to provide accurate rewards, which requires massive human efforts and hinders the RL scaling processes, especially under agentic scenarios. Although a few recent works explore task synthesis methods, the difficulty of generated agentic tasks can hardly be controlled to provide effective RL training advantages. To achieve agentic RLVR with higher scalability, we explore self-play training for deep search agents, in which the learning LLM utilizes multi-turn search engine calling and acts simultaneously as both a task proposer and a problem solver. The task proposer aims to generate deep search queries with well-defined ground-truth answers and increasing task difficulty. The problem solver tries to handle the generated search queries and output the correct answer predictions. To ensure that each generated search query has accurate ground truth, we collect all the searching results from the proposer's trajectory as external knowledge, then conduct retrieval-augmentation generation (RAG) to test whether the proposed query can be correctly answered with all necessary search documents provided. In this search self-play (SSP) game, the proposer and the solver co-evolve their agent capabilities through both competition and cooperation. With substantial experimental results, we find that SSP can significantly improve search agents' performance uniformly on various benchmarks without any supervision under both from-scratch and continuous RL training setups. The code is at this https URL.
中文摘要 具有可验证奖励的强化学习（Reingention learning with Verifiable Rewards，简称RLVR）已成为训练LLM智能体的主流技术。然而，RLVR 高度依赖精心设计的任务查询和相应的地面实况答案来提供准确的奖励，这需要大量的人力并阻碍 RL 扩展过程，尤其是在代理场景下。尽管最近的一些工作探索了任务合成方法，但生成的代理任务的难度难以控制，无法提供有效的RL训练优势。为了实现具有更高可扩展性的代理RLVR，我们探索了深度搜索代理的自玩训练，其中学习LLM利用多轮搜索引擎调用，同时充当任务提出者和问题解决者。任务提议者旨在生成具有明确定义的地面实况答案并增加任务难度的深度搜索查询。问题求解器尝试处理生成的搜索查询并输出正确答案预测。为了确保每个生成的搜索查询都具有准确的地面真实性，我们将提议者轨迹中的所有搜索结果作为外部知识收集，然后进行检索-增强生成（RAG），以测试提供所有必要的搜索文档是否能够正确回答所提出的查询。在这个搜索自玩（SSP）游戏中，提议者和求解者通过竞争和合作共同发展他们的智能体能力。通过大量的实验结果，我们发现SSP在从头开始和持续的RL训练设置下，无需任何监督，都可以在各种基准上均匀地提高搜索代理的性能。代码位于此 https URL 上。

Actor-Free Continuous Control via Structurally Maximizable Q-Functions

通过结构上可最大化的 Q 函数实现无 Actor 连续控制

Authors: Yigit Korkmaz, Urvi Bhuwania, Ayush Jain, Erdem Bıyık
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.18828
Pdf link: https://arxiv.org/pdf/2510.18828
Abstract Value-based algorithms are a cornerstone of off-policy reinforcement learning due to their simplicity and training stability. However, their use has traditionally been restricted to discrete action spaces, as they rely on estimating Q-values for individual state-action pairs. In continuous action spaces, evaluating the Q-value over the entire action space becomes computationally infeasible. To address this, actor-critic methods are typically employed, where a critic is trained on off-policy data to estimate Q-values, and an actor is trained to maximize the critic's output. Despite their popularity, these methods often suffer from instability during training. In this work, we propose a purely value-based framework for continuous control that revisits structural maximization of Q-functions, introducing a set of key architectural and algorithmic choices to enable efficient and stable learning. We evaluate the proposed actor-free Q-learning approach on a range of standard simulation tasks, demonstrating performance and sample efficiency on par with state-of-the-art baselines, without the cost of learning a separate actor. Particularly, in environments with constrained action spaces, where the value functions are typically non-smooth, our method with structural maximization outperforms traditional actor-critic methods with gradient-based maximization. We have released our code at this https URL.
中文摘要 基于价值的算法因其简单性和训练稳定性而成为策略外强化学习的基石。然而，它们的使用传统上仅限于离散动作空间，因为它们依赖于估计单个状态-动作对的 Q 值。在连续动作空间中，计算整个动作空间的 Q 值在计算上变得不可行。为了解决这个问题，通常采用演员批评方法，其中批评者接受非政策数据的训练以估计 Q 值，并训练演员以最大限度地提高批评者的产出。尽管这些方法很受欢迎，但在训练过程中经常会出现不稳定的问题。在这项工作中，我们提出了一个纯粹基于价值的连续控制框架，重新审视了 Q 函数的结构最大化，引入了一组关键的架构和算法选择，以实现高效和稳定的学习。我们在一系列标准模拟任务上评估了所提出的无参与者 Q 学习方法，展示了与最先进的基线相当的性能和样本效率，而无需学习单独的参与者。特别是，在具有约束作用空间的环境中，值函数通常是非平滑的，我们的结构最大化方法优于传统的基于梯度的最大化的参与者批评方法。我们已在此 https URL 上发布了代码。

Towards Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning

通过批判性编辑后强化学习实现忠实且可控的个性化

Authors: Chenghao Zhu, Meiling Tao, Tiannan Wang, Dongyi Ding, Yuchen Eleanor Jiang, Wangchunshu Zhou
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.18849
Pdf link: https://arxiv.org/pdf/2510.18849
Abstract Faithfully personalizing large language models (LLMs) to align with individual user preferences is a critical but challenging task. While supervised fine-tuning (SFT) quickly reaches a performance plateau, standard reinforcement learning from human feedback (RLHF) also struggles with the nuances of personalization. Scalar-based reward models are prone to reward hacking which leads to verbose and superficially personalized responses. To address these limitations, we propose Critique-Post-Edit, a robust reinforcement learning framework that enables more faithful and controllable personalization. Our framework integrates two key components: (1) a Personalized Generative Reward Model (GRM) that provides multi-dimensional scores and textual critiques to resist reward hacking, and (2) a Critique-Post-Edit mechanism where the policy model revises its own outputs based on these critiques for more targeted and efficient learning. Under a rigorous length-controlled evaluation, our method substantially outperforms standard PPO on personalization benchmarks. Personalized Qwen2.5-7B achieves an average 11\% win-rate improvement, and personalized Qwen2.5-14B model surpasses the performance of GPT-4.1. These results demonstrate a practical path to faithful, efficient, and controllable personalization.
中文摘要 忠实地个性化大型语言模型（LLM）以符合个人用户偏好是一项关键但具有挑战性的任务。虽然监督微调（SFT）迅速达到性能平台，但来自人类反馈的标准强化学习（RLHF）也在与个性化的细微差别作斗争。基于标量的奖励模型容易受到奖励黑客攻击，从而导致冗长和表面个性化的响应。为了解决这些限制，我们提出了 Critique-Post-Edit，这是一个强大的强化学习框架，可以实现更忠实和可控的个性化。我们的框架集成了两个关键组件：（1）个性化生成奖励模型（GRM），提供多维分数和文本批评以抵制奖励黑客攻击，以及（2）批评-后编辑机制，政策模型根据这些批评修改自己的输出，以实现更有针对性和更有效的学习。在严格的长度控制评估下，我们的方法在个性化基准上大大优于标准 PPO。个性化Qwen2.5-7B实现了平均11%的胜率提升，个性化Qwen2.5-14B模型的性能超过了GPT-4.1。这些结果展示了一条通往忠实、高效和可控个性化的实用途径。

EffiReasonTrans: RL-Optimized Reasoning for Code Translation

EffiReasonTrans：RL 优化的代码转换推理

Authors: Yanlin Wang, Rongyi Ou, Yanli Wang, Mingwei Liu, Jiachi Chen, Ensheng Shi, Xilin Liu, Yuchi Ma, Zibin Zheng
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2510.18863
Pdf link: https://arxiv.org/pdf/2510.18863
Abstract Code translation is a crucial task in software development and maintenance. While recent advancements in large language models (LLMs) have improved automated code translation accuracy, these gains often come at the cost of increased inference latency, hindering real-world development workflows that involve human-in-the-loop inspection. To address this trade-off, we propose EffiReasonTrans, a training framework designed to improve translation accuracy while balancing inference latency. We first construct a high-quality reasoning-augmented dataset by prompting a stronger language model, DeepSeek-R1, to generate intermediate reasoning and target translations. Each (source code, reasoning, target code) triplet undergoes automated syntax and functionality checks to ensure reliability. Based on this dataset, we employ a two-stage training strategy: supervised fine-tuning on reasoning-augmented samples, followed by reinforcement learning to further enhance accuracy and balance inference latency. We evaluate EffiReasonTrans on six translation pairs. Experimental results show that it consistently improves translation accuracy (up to +49.2% CA and +27.8% CodeBLEU compared to the base model) while reducing the number of generated tokens (up to -19.3%) and lowering inference latency in most cases (up to -29.0%). Ablation studies further confirm the complementary benefits of the two-stage training framework. Additionally, EffiReasonTrans demonstrates improved translation accuracy when integrated into agent-based frameworks. Our code and data are available at this https URL.
中文摘要 代码翻译是软件开发和维护中的一项关键任务。虽然大型语言模型（LLM）的最新进展提高了自动代码翻译的准确性，但这些收益往往以推理延迟增加为代价，阻碍了涉及人机交互检查的实际开发工作流程。为了解决这一权衡问题，我们提出了 EffiReasonTrans，这是一个训练框架，旨在提高翻译准确性，同时平衡推理延迟。我们首先通过提示更强的语言模型 DeepSeek-R1 来构建一个高质量的推理增强数据集，以生成中间推理和目标翻译。每个（源代码、推理、目标代码）三元组都经过自动语法和功能检查，以确保可靠性。基于该数据集，我们采用了两阶段训练策略：对推理增强样本进行监督微调，然后进行强化学习，以进一步提高准确性并平衡推理延迟。我们在六个翻译对上评估了 EffiReasonTrans。实验结果表明，它持续提高了翻译准确性（与基础模型相比，高达 +49.2% CA 和 +27.8% CodeBLEU），同时减少了生成的标记数量（高达 -19.3%），并在大多数情况下降低了推理延迟（高达 -29.0%）。消融研究进一步证实了两阶段培训框架的互补益处。此外，EffiReasonTrans 在集成到基于代理的框架中时展示了更高的翻译准确性。我们的代码和数据可在此 https URL 中找到。

Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting

边做边留：政策数据在减少遗忘方面的作用

Authors: Howard Chen, Noam Razin, Karthik Narasimhan, Danqi Chen
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.18874
Pdf link: https://arxiv.org/pdf/2510.18874
Abstract Adapting language models (LMs) to new tasks via post-training carries the risk of degrading existing capabilities -- a phenomenon classically known as catastrophic forgetting. In this paper, toward identifying guidelines for mitigating this phenomenon, we systematically compare the forgetting patterns of two widely adopted post-training methods: supervised fine-tuning (SFT) and reinforcement learning (RL). Our experiments reveal a consistent trend across LM families (Llama, Qwen) and tasks (instruction following, general knowledge, and arithmetic reasoning): RL leads to less forgetting than SFT while achieving comparable or higher target task performance. To investigate the cause for this difference, we consider a simplified setting in which the LM is modeled as a mixture of two distributions, one corresponding to prior knowledge and the other to the target task. We identify that the mode-seeking nature of RL, which stems from its use of on-policy data, enables keeping prior knowledge intact when learning the target task. We then verify this insight by demonstrating that the use on-policy data underlies the robustness of RL to forgetting in practical settings, as opposed to other algorithmic choices such as the KL regularization or advantage estimation. Lastly, as a practical implication, our results highlight the potential of mitigating forgetting using approximately on-policy data, which can be substantially more efficient to obtain than fully on-policy data.
中文摘要 通过后训练使语言模型（LM）适应新任务会带来降低现有能力的风险——这种现象通常被称为灾难性遗忘。在本文中，为了确定缓解这种现象的指南，我们系统地比较了两种广泛采用的训练后方法的遗忘模式：监督微调（SFT）和强化学习（RL）。我们的实验揭示了 LM 家族（Llama、Qwen）和任务（指令遵循、常识和算术推理）的一致趋势：RL 比 SFT 更少遗忘，同时实现可比或更高的目标任务绩效。为了研究这种差异的原因，我们考虑了一种简化的设置，其中 LM 被建模为两个分布的混合，一个对应于先验知识，另一个对应于目标任务。我们发现，RL 的模式寻求性质源于其对策略数据的使用，使得在学习目标任务时能够保持先验知识完好无损。然后，我们通过证明使用策略数据是实际环境中 RL 对遗忘的鲁棒性的基础，而不是其他算法选择（如 KL 正则化或优势估计）来验证这一见解。最后，作为实际意义，我们的结果强调了使用近似政策数据减轻遗忘的潜力，这比完全政策数据的获取效率要高得多。

Keyword: diffusion policy

There is no result