Arxiv Papers of Today

生成时间: 2025-10-20 16:30:25 (UTC+8); Arxiv 发布时间: 2025-10-20 20:00 EDT (2025-10-21 08:00 UTC+8)

今天共有 45 篇相关文章

Keyword: reinforcement learning

ES-C51: Expected Sarsa Based C51 Distributional Reinforcement Learning Algorithm

ES-C51：预期的基于 Sarsa 的 C51 分布强化学习算法

Authors: Rijul Tandon, Peter Vamplew, Cameron Foale
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.15006
Pdf link: https://arxiv.org/pdf/2510.15006
Abstract In most value-based reinforcement learning (RL) algorithms, the agent estimates only the expected reward for each action and selects the action with the highest reward. In contrast, Distributional Reinforcement Learning (DRL) estimates the entire probability distribution of possible rewards, providing richer information about uncertainty and variability. C51 is a popular DRL algorithm for discrete action spaces. It uses a Q-learning approach, where the distribution is learned using a greedy Bellman update. However, this can cause problems if multiple actions at a state have similar expected reward but with different distributions, as the algorithm may not learn a stable distribution. This study presents a modified version of C51 (ES-C51) that replaces the greedy Q-learning update with an Expected Sarsa update, which uses a softmax calculation to combine information from all possible actions at a state rather than relying on a single best action. This reduces instability when actions have similar expected rewards and allows the agent to learn higher-performing policies. This approach is evaluated on classic control environments from Gym, and Atari-10 games. For a fair comparison, we modify the standard C51's exploration strategy from e-greedy to softmax, which we refer to as QL-C51 (Q- Learning based C51). The results demonstrate that ES-C51 outperforms QL-C51 across many environments.
中文摘要 在大多数基于价值的强化学习（RL）算法中，代理仅估计每个动作的预期奖励，并选择奖励最高的动作。相比之下，分布强化学习（DRL）估计可能奖励的整个概率分布，提供有关不确定性和可变性的更丰富的信息。C51 是一种流行的离散动作空间 DRL 算法。它使用 Q 学习方法，其中使用贪婪的 Bellman 更新来学习分布。但是，如果一个状态下的多个作具有相似的预期奖励但分布不同，这可能会导致问题，因为算法可能无法学习稳定的分布。本研究提出了 C51 （ES-C51）的修改版本，它用预期的 Sarsa 更新取代了贪婪的 Q 学习更新，该更新使用软最大计算来组合来自某个状态下所有可能动作的信息，而不是依赖于单个最佳动作。当作具有相似的预期奖励时，这减少了不稳定性，并允许代理学习性能更高的策略。这种方法是在 Gym 和 Atari-10 游戏的经典控制环境中进行评估的。为了公平比较，我们将标准 C51 的探索策略从 e-greedy 修改为 softmax，我们称之为 QL-C51（基于 Q 学习的 C51）。结果表明，ES-C51 在许多环境中优于 QL-C51。

Composition-Grounded Instruction Synthesis for Visual Reasoning

用于视觉推理的基于组合的指令合成

Authors: Xinyi Gu, Jiayuan Mao, Zhang-Wei Hong, Zhuoran Yu, Pengyuan Li, Dhiraj Joshi, Rogerio Feris, Zexue He
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.15040
Pdf link: https://arxiv.org/pdf/2510.15040
Abstract Pretrained multi-modal large language models (MLLMs) demonstrate strong performance on diverse multimodal tasks, but remain limited in reasoning capabilities for domains where annotations are difficult to collect. In this work, we focus on artificial image domains such as charts, rendered documents, and webpages, which are abundant in practice yet lack large-scale human annotated reasoning datasets. We introduce COGS (COmposition-Grounded instruction Synthesis), a data-efficient framework for equipping MLLMs with advanced reasoning abilities from a small set of seed questions. The key idea is to decompose each seed question into primitive perception and reasoning factors, which can then be systematically recomposed with new images to generate large collections of synthetic question-answer pairs. Each generated question is paired with subquestions and intermediate answers, enabling reinforcement learning with factor-level process rewards. Experiments on chart reasoning show that COGS substantially improves performance on unseen questions, with the largest gains on reasoning-heavy and compositional questions. Moreover, training with a factor-level mixture of different seed data yields better transfer across multiple datasets, suggesting that COGS induces generalizable capabilities rather than dataset-specific overfitting. We further demonstrate that the framework extends beyond charts to other domains such as webpages.
中文摘要 预训练的多模态大语言模型（MLLM）在各种多模态任务上表现出强大的性能，但对于难以收集注释的领域，推理能力仍然有限。在这项工作中，我们专注于图表、渲染文档和网页等人工图像领域，这些领域在实践中很丰富，但缺乏大规模的人类注释推理数据集。我们引入了 COGS（COmposition-Grounded instruction Synthesis），这是一个数据高效的框架，用于从 MLLM 中从一小部分种子问题中获得高级推理能力。关键思想是将每个种子问题分解为原始的感知和推理因素，然后可以用新图像系统地重组这些因素，以生成大量合成问答对。每个生成的问题都与子问题和中间答案配对，从而实现具有因子级过程奖励的强化学习。图表推理实验表明，销货成本显着提高了看不见的问题的表现，在推理繁重和作文问题上取得了最大的进步。此外，使用不同种子数据的因子级混合进行训练可以产生更好的跨多个数据集的传输，这表明 COGS 诱导了可推广的能力，而不是特定于数据集的过度拟合。我们进一步证明，该框架从图表扩展到其他领域，例如网页。

Internalizing World Models via Self-Play Finetuning for Agentic RL

通过代理 RL 的 Self-Play Finetuning 内化世界模型

Authors: Shiqi Chen, Tongyao Zhu, Zian Wang, Jinghan Zhang, Kangrui Wang, Siyang Gao, Teng Xiao, Yee Whye Teh, Junxian He, Manling Li
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.15047
Pdf link: https://arxiv.org/pdf/2510.15047
Abstract Large Language Models (LLMs) as agents often struggle in out-of-distribution (OOD) scenarios. Real-world environments are complex and dynamic, governed by task-specific rules and stochasticity, which makes it difficult for LLMs to ground their internal knowledge in those dynamics. Under such OOD conditions, vanilla RL training often fails to scale; we observe Pass@k--the probability that at least one of (k) sampled trajectories succeeds--drops markedly across training steps, indicating brittle exploration and limited generalization. Inspired by model-based reinforcement learning, we hypothesize that equipping LLM agents with an internal world model can better align reasoning with environmental dynamics and improve decision-making. We show how to encode this world model by decomposing it into two components: state representation and transition modeling. Building on this, we introduce SPA, a simple reinforcement learning framework that cold-starts the policy via a Self-Play supervised finetuning (SFT) stage to learn the world model by interacting with the environment, then uses it to simulate future states prior to policy optimization. This simple initialization outperforms the online world-modeling baseline and greatly boosts the RL-based agent training performance. Experiments across diverse environments like Sokoban, FrozenLake, and Sudoku show that our approach significantly improves performance. For example, SPA boosts the Sokoban success rate from 25.6% to 59.8% and raises the FrozenLake score from 22.1% to 70.9% for the Qwen2.5-1.5B-Instruct model.
中文摘要 大型语言模型（LLM）作为代理经常在分布外（OOD）场景中挣扎。现实世界的环境是复杂且动态的，受特定于任务的规则和随机性控制，这使得法学硕士很难将他们的内部知识建立在这些动态中。在这种 OOD 条件下，普通 RL 训练通常无法扩展;我们观察到Pass@k——（k）采样轨迹中至少一个成功的概率——在训练步骤中显着下降，表明探索脆弱且泛化有限。受基于模型的强化学习的启发，我们假设为 LLM 代理配备内部世界模型可以更好地使推理与环境动态保持一致并改进决策。我们展示了如何通过将这个世界模型分解为两个组件来编码它：状态表示和过渡建模。在此基础上，我们引入了 SPA，这是一个简单的强化学习框架，它通过 Self-Play 监督微调（SFT）阶段冷启动策略，通过与环境交互来学习世界模型，然后使用它来模拟策略优化之前的未来状态。这种简单的初始化优于在线世界建模基线，并大大提高了基于 RL 的代理训练性能。在推箱子、FrozenLake 和数独等不同环境中的实验表明，我们的方法显着提高了性能。例如，SPA 将 Qwen2.5-1.5B-Instruct 模型的推箱成功率从 25.6% 提高到 59.8%，并将 FrozenLake 分数从 22.1% 提高到 70.9%。

Directional Reasoning Injection for Fine-Tuning MLLMs

用于微调 MLLM 的定向推理注入

Authors: Chao Huang, Zeliang Zhang, Jiang Liu, Ximeng Sun, Jialian Wu, Xiaodong Yu, Ze Wang, Chenliang Xu, Emad Barsoum, Zicheng Liu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2510.15050
Pdf link: https://arxiv.org/pdf/2510.15050
Abstract Multimodal large language models (MLLMs) are rapidly advancing, yet their reasoning ability often lags behind that of strong text-only counterparts. Existing methods to bridge this gap rely on supervised fine-tuning over large-scale multimodal reasoning data or reinforcement learning, both of which are resource-intensive. A promising alternative is model merging, which interpolates parameters between reasoning-enhanced LLMs and multimodal variants. However, our analysis shows that naive merging is not always a "free lunch": its effectiveness varies drastically across model families, with some (e.g., LLaVA, Idefics) benefiting while others (e.g., Qwen) suffer performance degradation. To address this, we propose Directional Reasoning Injection for Fine-Tuning (DRIFT) MLLMs, a lightweight method that transfers reasoning knowledge in the gradient space, without destabilizing multimodal alignment. DRIFT precomputes a reasoning prior as the parameter-space difference between reasoning and multimodal variants, then uses it to bias gradients during multimodal fine-tuning. This approach preserves the simplicity of standard supervised fine-tuning pipelines while enabling efficient reasoning transfer. Extensive experiments on multimodal reasoning benchmarks, including MathVista and MathVerse, demonstrate that DRIFT consistently improves reasoning performance over naive merging and supervised fine-tuning, while matching or surpassing training-heavy methods at a fraction of the cost.
中文摘要 多模态大型语言模型（MLLM）正在迅速发展，但其推理能力往往落后于强纯文本模型。弥合这一差距的现有方法依赖于对大规模多模态推理数据或强化学习的监督微调，这两种方法都是资源密集型的。一个有前途的替代方案是模型合并，它在推理增强的法学硕士和多模态变体之间插值参数。然而，我们的分析表明，朴素合并并不总是“免费午餐”：其有效性因模型家族而异，一些（例如 LLaVA、Idefics）受益，而另一些（例如 Qwen）则遭受性能下降。为了解决这个问题，我们提出了用于微调的定向推理注入（DRIFT）MLLMs，这是一种轻量级的方法，可以在梯度空间中传输推理知识，而不会破坏多模态对齐的稳定性。DRIFT 预先计算推理先验作为推理和多模态变体之间的参数空间差异，然后在多模态微调期间使用它来偏差梯度。这种方法保留了标准监督微调管道的简单性，同时实现了高效的推理转移。对多模态推理基准（包括 MathVista 和 MathVerse）的广泛实验表明，与朴素合并和监督微调相比，DRIFT 能够持续提高推理性能，同时以极低的成本匹配或超越训练密集型方法。

Learn to Change the World: Multi-level Reinforcement Learning with Model-Changing Actions

学习改变世界：具有模型改变行动的多层次强化学习

Authors: Ziqing Lu, Babak Hassibi, Lifeng Lai, Weiyu Xu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.15056
Pdf link: https://arxiv.org/pdf/2510.15056
Abstract Reinforcement learning usually assumes a given or sometimes even fixed environment in which an agent seeks an optimal policy to maximize its long-term discounted reward. In contrast, we consider agents that are not limited to passive adaptations: they instead have model-changing actions that actively modify the RL model of world dynamics itself. Reconfiguring the underlying transition processes can potentially increase the agents' rewards. Motivated by this setting, we introduce the multi-layer configurable time-varying Markov decision process (MCTVMDP). In an MCTVMDP, the lower-level MDP has a non-stationary transition function that is configurable through upper-level model-changing actions. The agent's objective consists of two parts: Optimize the configuration policies in the upper-level MDP and optimize the primitive action policies in the lower-level MDP to jointly improve its expected long-term reward.
中文摘要 强化学习通常假设一个给定的或有时甚至是固定的环境，在该环境中，智能体寻求最佳策略以最大化其长期贴现奖励。相比之下，我们考虑的代理并不局限于被动适应：相反，它们具有主动修改世界动态本身的 RL 模型的模型更改动作。重新配置底层过渡流程可能会增加代理的奖励。在此设置的激励下，我们引入了多层可配置时变马尔可夫决策过程（MCTVMDP）。在 MCTVMDP 中，较低级别的 MDP 具有非平稳转换功能，可通过上级别模型更改作进行配置。智能体的目标由两部分组成：优化上层 MDP 中的配置策略和优化下层 MDP 中的原始动作策略，共同提高其预期的长期奖励。

DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning

DLER：正确执行长度 pEnalty - 通过强化学习激励每个 token 的更多智能

Authors: Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, Pavlo Molchanov
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.15110
Pdf link: https://arxiv.org/pdf/2510.15110
Abstract Reasoning language models such as OpenAI-o1, DeepSeek-R1, and Qwen achieve strong performance via extended chains of thought but often generate unnecessarily long outputs. Maximizing intelligence per token--accuracy relative to response length--remains an open problem. We revisit reinforcement learning (RL) with the simplest length penalty--truncation--and show that accuracy degradation arises not from the lack of sophisticated penalties but from inadequate RL optimization. We identify three key challenges: (i) large bias in advantage estimation, (ii) entropy collapse, and (iii) sparse reward signal. We address them with Doing Length pEnalty Right (DLER), a training recipe combining batch-wise reward normalization, higher clipping, dynamic sampling, and a simple truncation length penalty. DLER achieves state-of-the-art accuracy--efficiency trade-offs, cutting output length by over 70 percent while surpassing all previous baseline accuracy. It also improves test-time scaling: compared to DeepSeek-R1-7B, DLER-7B generates multiple concise responses in parallel with 28 percent higher accuracy and lower latency. We further introduce Difficulty-Aware DLER, which adaptively tightens truncation on easier questions for additional efficiency gains. We also propose an update-selective merging method that preserves baseline accuracy while retaining the concise reasoning ability of the DLER model, which is useful for scenarios where RL training data is scarce.
中文摘要 OpenAI-o1、DeepSeek-R1 和 Qwen 等推理语言模型通过扩展的思维链实现了强大的性能，但通常会产生不必要的长输出。最大化每个标记的智能——相对于响应长度的准确性——仍然是一个悬而未决的问题。我们用最简单的长度惩罚（截断）重新审视了强化学习（RL），并表明准确性下降不是由于缺乏复杂的惩罚，而是由于RL优化不充分。我们确定了三个关键挑战：（i）优势估计中的大偏差，（ii）熵崩溃，以及（iii）稀疏的奖励信号。我们通过 Doing Length pEnalty Right （DLER）来解决这些问题，这是一种结合了批量奖励归一化、更高裁剪、动态采样和简单截断长度惩罚的训练配方。DLER 实现了最先进的精度和效率权衡，将输出长度缩短了 70% 以上，同时超过了之前的所有基线精度。它还改进了测试时间扩展：与 DeepSeek-R1-7B 相比，DLER-7B 并行生成多个简洁响应，精度提高了 28%，延迟更低。我们进一步引入了难度感知 DLER，它自适应地收紧了对较简单问题的截断，以进一步提高效率。我们还提出了一种更新选择性合并方法，该方法在保留 DLER 模型简洁推理能力的同时保留基线精度，这对于 RL 训练数据稀缺的场景非常有用。

Procedural Game Level Design with Deep Reinforcement Learning

深度强化学习的程序化游戏关卡设计

Authors: Miraç Buğra Özkan
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.15120
Pdf link: https://arxiv.org/pdf/2510.15120
Abstract Procedural content generation (PCG) has become an increasingly popular technique in game development, allowing developers to generate dynamic, replayable, and scalable environments with reduced manual effort. In this study, a novel method for procedural level design using Deep Reinforcement Learning (DRL) within a Unity-based 3D environment is proposed. The system comprises two agents: a hummingbird agent, acting as a solver, and a floating island agent, responsible for generating and placing collectible objects (flowers) on the terrain in a realistic and context-aware manner. The hummingbird is trained using the Proximal Policy Optimization (PPO) algorithm from the Unity ML-Agents toolkit. It learns to navigate through the terrain efficiently, locate flowers, and collect them while adapting to the ever-changing procedural layout of the island. The island agent is also trained using the Proximal Policy Optimization (PPO) algorithm. It learns to generate flower layouts based on observed obstacle positions, the hummingbird's initial state, and performance feedback from previous episodes. The interaction between these agents leads to emergent behavior and robust generalization across various environmental configurations. The results demonstrate that the approach not only produces effective and efficient agent behavior but also opens up new opportunities for autonomous game level design driven by machine learning. This work highlights the potential of DRL in enabling intelligent agents to both generate and solve content in virtual environments, pushing the boundaries of what AI can contribute to creative game development processes.
中文摘要 程序化内容生成（PCG）已成为游戏开发中越来越流行的技术，它允许开发人员以减少手动工作量生成动态、可重玩和可扩展的环境。本研究提出了一种在基于Unity的3D环境中使用深度强化学习（DRL）进行程序化关卡设计的新方法。该系统由两个代理组成：蜂鸟代理（充当求解器）和浮岛代理（负责以逼真和上下文感知的方式在地形上生成和放置收藏对象（花朵）。蜂鸟使用 Unity ML-Agents 工具包中的近端策略优化（PPO）算法进行训练。它学会有效地在地形中导航、定位花朵并收集它们，同时适应岛屿不断变化的程序布局。孤岛代理还使用近端策略优化（PPO）算法进行训练。它学习根据观察到的障碍物位置、蜂鸟的初始状态以及前几集的性能反馈来生成花朵布局。这些智能体之间的相互作用导致各种环境配置的紧急行为和稳健的泛化。结果表明，该方法不仅产生了有效且高效的智能体行为，而且为机器学习驱动的自主游戏关卡设计开辟了新的机会。这项工作凸显了 DRL 在使智能代理能够在虚拟环境中生成和解决内容方面的潜力，突破了人工智能对创意游戏开发过程的贡献界限。

Navigating the consequences of mechanical ventilation in clinical intensive care settings through an evolutionary game-theoretic framework

通过进化博弈论框架探讨临床重症监护环境中机械通气的后果

Authors: David J. Albers, Tell D. Bennett, Jana de Wiljes, Bradford J. Smith, Peter D. Sottile, J.N. Stroh
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Quantitative Methods (q-bio.QM)
Arxiv link: https://arxiv.org/abs/2510.15127
Pdf link: https://arxiv.org/pdf/2510.15127
Abstract Identifying the effects of mechanical ventilation strategies and protocols in critical care requires analyzing data from heterogeneous patient-ventilator systems within the context of the clinical decision-making environment. This research develops a framework to help understand the consequences of mechanical ventilation (MV) and adjunct care decisions on patient outcome from observations of critical care patients receiving MV. Developing an understanding of and improving critical care respiratory management requires the analysis of existing secondary-use clinical data to generate hypotheses about advantageous variations and adaptations of current care. This work introduces a perspective of the joint patient-ventilator-care systems (so-called J6) to develop a scalable method for analyzing data and trajectories of these complex systems. To that end, breath behaviors are analyzed using evolutionary game theory (EGT), which generates the necessary quantitative precursors for deeper analysis through probabilistic and stochastic machinery such as reinforcement learning. This result is one step along the pathway toward MV optimization and personalization. The EGT-based process is analytically validated on synthetic data to reveal potential caveats before proceeding to real-world ICU data applications that expose complexities of the data-generating process J6. The discussion includes potential developments toward a state transition model for the simulating effects of MV decision using empirical and game-theoretic elements.
中文摘要 确定机械通气策略和方案在重症监护中的影响需要在临床决策环境中分析来自异质患者呼吸机系统的数据。这项研究开发了一个框架，通过对接受 MV 的重症监护患者的观察，帮助了解机械通气（MV）和辅助护理决策对患者结果的影响。加深对重症监护呼吸管理的理解和改进需要分析现有的二级临床数据，以生成有关当前护理的有利变化和适应的假设。这项工作介绍了联合患者-呼吸机-护理系统（所谓的 J6）的视角，以开发一种可扩展的方法来分析这些复杂系统的数据和轨迹。为此，使用进化博弈论（EGT）分析呼吸行为，该理论通过强化学习等概率和随机机制生成必要的定量前兆，以便进行更深入的分析。这一结果是沿着 MV 优化和个性化迈出的一步。基于 EGT 的过程在合成数据上进行了分析验证，以揭示潜在的警告，然后再进行现实世界的 ICU 数据应用，这些应用程序暴露了数据生成过程 J6 的复杂性。讨论包括使用经验和博弈论元素模拟MV决策效果的状态转换模型的潜在发展。

Policy Transfer Ensures Fast Learning for Continuous-Time LQR with Entropy Regularization

策略转移确保具有熵正则化的连续时间 LQR 的快速学习

Authors: Xin Guo, Zijiu Lyu
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2510.15165
Pdf link: https://arxiv.org/pdf/2510.15165
Abstract Reinforcement Learning (RL) enables agents to learn optimal decision-making strategies through interaction with an environment, yet training from scratch on complex tasks can be highly inefficient. Transfer learning (TL), widely successful in large language models (LLMs), offers a promising direction for enhancing RL efficiency by leveraging pre-trained models. This paper investigates policy transfer, a TL approach that initializes learning in a target RL task using a policy from a related source task, in the context of continuous-time linear quadratic regulators (LQRs) with entropy regularization. We provide the first theoretical proof of policy transfer for continuous-time RL, proving that a policy optimal for one LQR serves as a near-optimal initialization for closely related LQRs, while preserving the original algorithm's convergence rate. Furthermore, we introduce a novel policy learning algorithm for continuous-time LQRs that achieves global linear and local super-linear convergence. Our results demonstrate both theoretical guarantees and algorithmic benefits of transfer learning in continuous-time RL, addressing a gap in existing literature and extending prior work from discrete to continuous time settings. As a byproduct of our analysis, we derive the stability of a class of continuous-time score-based diffusion models via their connection with LQRs.
中文摘要 强化学习（RL）使智能体能够通过与环境的交互来学习最佳决策策略，但从头开始训练复杂任务可能效率非常低。迁移学习（TL）在大型语言模型（LLM）中广泛成功，为利用预训练模型提高 RL 效率提供了一个有希望的方向。本文研究了策略转移，这是一种 TL 方法，在具有熵正则化的连续时间线性二次调节器（LQR）的背景下，使用来自相关源任务的策略初始化目标 RL 任务中的学习。我们提供了连续时间RL的策略转移的第一个理论证明，证明一个LQR的最优策略可以作为密切相关的LQR的近优初始化，同时保留原始算法的收敛率。此外，我们还引入了一种新的连续时间LQR策略学习算法，该算法实现了全局线性和局部超线性收敛。我们的结果证明了连续时间RL中迁移学习的理论保证和算法优势，解决了现有文献中的空白，并将先前的工作从离散时间设置扩展到连续时间设置。作为我们分析的副产品，我们通过与 LQR 的连接推导出了一类基于连续时间分数的扩散模型的稳定性。

RM-RL: Role-Model Reinforcement Learning for Precise Robot Manipulation

RM-RL：用于精确机器人纵的角色模型强化学习

Authors: Xiangyu Chen, Chuhao Zhou, Yuxi Liu, Jianfei Yang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.15189
Pdf link: https://arxiv.org/pdf/2510.15189
Abstract Precise robot manipulation is critical for fine-grained applications such as chemical and biological experiments, where even small errors (e.g., reagent spillage) can invalidate an entire task. Existing approaches often rely on pre-collected expert demonstrations and train policies via imitation learning (IL) or offline reinforcement learning (RL). However, obtaining high-quality demonstrations for precision tasks is difficult and time-consuming, while offline RL commonly suffers from distribution shifts and low data efficiency. We introduce a Role-Model Reinforcement Learning (RM-RL) framework that unifies online and offline training in real-world environments. The key idea is a role-model strategy that automatically generates labels for online training data using approximately optimal actions, eliminating the need for human demonstrations. RM-RL reformulates policy learning as supervised training, reducing instability from distribution mismatch and improving efficiency. A hybrid training scheme further leverages online role-model data for offline reuse, enhancing data efficiency through repeated sampling. Extensive experiments show that RM-RL converges faster and more stably than existing RL methods, yielding significant gains in real-world manipulation: 53% improvement in translation accuracy and 20% in rotation accuracy. Finally, we demonstrate the successful execution of a challenging task, precisely placing a cell plate onto a shelf, highlighting the framework's effectiveness where prior methods fail.
中文摘要 精确的机器人作对于化学和生物实验等细粒度应用至关重要，在这些应用中，即使是很小的错误（例如，试剂溢出）也可能导致整个任务无效。现有方法通常依赖于预先收集的专家演示，并通过模仿学习（IL）或离线强化学习（RL）来培训政策。然而，获得精确任务的高质量演示既困难又耗时，而离线 RL 通常存在分布偏移和数据效率低的问题。我们引入了一个角色模型强化学习（RM-RL）框架，该框架统一了现实环境中的在线和离线训练。关键思想是一种角色模型策略，它使用近似最佳作自动为在线训练数据生成标签，无需人工演示。RM-RL将策略学习重新表述为监督训练，减少分布错配带来的不稳定性，提高效率。混合训练方案进一步利用在线角色模型数据进行离线复用，通过重复采样提高数据效率。大量实验表明，RM-RL 比现有的 RL 方法收敛得更快、更稳定，在实际作中产生了显着的收益：平移精度提高了 53%，旋转精度提高了 20%。最后，我们演示了一项具有挑战性的任务的成功执行，将细胞板精确地放置在架子上，突出了该框架在先前方法失败的情况下的有效性。

Structure-R1: Dynamically Leveraging Structural Knowledge in LLM Reasoning through Reinforcement Learning

Structure-R1：通过强化学习动态利用结构知识在LLM推理中

Authors: Junlin Wu, Xianrui Zhong, Jiashuo Sun, Bolian Li, Bowen Jin, Jiawei Han, Qingkai Zeng
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2510.15191
Pdf link: https://arxiv.org/pdf/2510.15191
Abstract Large language models (LLMs) have demonstrated remarkable advances in reasoning capabilities. However, their performance remains constrained by limited access to explicit and structured domain knowledge. Retrieval-Augmented Generation (RAG) addresses this by incorporating external information as context to augment reasoning. Nevertheless, traditional RAG systems typically operate over unstructured and fragmented text, resulting in low information density and suboptimal reasoning. To overcome these limitations, we propose \textsc{Structure-R1}, a novel framework that transforms retrieved content into structured representations optimized for reasoning. Leveraging reinforcement learning, \textsc{Structure-R1} learns a content representation policy that dynamically generates and adapts structural formats based on the demands of multi-step reasoning. Unlike prior methods that rely on fixed schemas, our approach adopts a generative paradigm capable of producing task-specific structures tailored to individual queries. To ensure the quality and reliability of these representations, we introduce a self-reward structural verification mechanism that checks whether the generated structures are both correct and self-contained. Extensive experiments on seven knowledge-intensive benchmarks show that \textsc{Structure-R1} consistently achieves competitive performance with a 7B-scale backbone model and matches the performance of much larger models. Additionally, our theoretical analysis demonstrates how structured representations enhance reasoning by improving information density and contextual clarity. Our code and data are available at: this https URL.
中文摘要 大型语言模型（LLM）在推理能力方面取得了显着进步。然而，它们的性能仍然受到对显式和结构化领域知识的有限访问的限制。检索增强生成（RAG）通过将外部信息作为上下文来增强推理来解决这个问题。然而，传统的 RAG 系统通常在非结构化和碎片化文本上运行，导致信息密度低和推理不理想。为了克服这些限制，我们提出了 \textsc{Structure-R1}，这是一种新颖的框架，可将检索到的内容转换为针对推理进行优化的结构化表示。利用强化学习，\textsc{Structure-R1} 学习了一种内容表示策略，该策略根据多步推理的需求动态生成和调整结构格式。与以前依赖固定模式的方法不同，我们的方法采用了一种生成范式，能够生成针对单个查询量身定制的特定于任务的结构。为了确保这些表示的质量和可靠性，我们引入了一种自我奖励结构验证机制，该机制检查生成的结构是否正确且自成一体。对七个知识密集型基准的广泛实验表明，\textsc{Structure-R1} 始终如一地实现了与 7B 规模主干模型的竞争性能，并与更大的模型的性能相匹配。此外，我们的理论分析展示了结构化表示如何通过提高信息密度和上下文清晰度来增强推理能力。我们的代码和数据可在以下网址获得：此 https URL。

Soundness-Aware Level: A Microscopic Signature that Predicts LLM Reasoning Potential

健全性感知水平：预测法学硕士推理潜力的微观特征

Authors: Xuansheng Wu, Xiaoman Pan, Wenlin Yao, Jianshu Chen
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.15216
Pdf link: https://arxiv.org/pdf/2510.15216
Abstract Reinforcement learning with verifiable rewards (RLVR) can elicit strong reasoning in large language models (LLMs), while their performance after RLVR varies dramatically across different base models. This raises a fundamental question: what microscopic property of pre-trained models leads to this variation? To investigate, we formalize reasoning as chains of Horn clauses ("if-then" rules) built from features extracted from the LLM's latent space via cross-layer sparse autoencoders (SAEs). We estimate the transition probabilities between its features, and further categorize each rule by its semantic soundness level (e.g., strict, plausible, noisy) with an LLM. Our key discovery is that high-potential models are inherently soundness-aware: their internal probability distributions systematically shift across rules' soundness levels, becoming highly distinct for "strict" versus "noisy" rules. In contrast, weaker models are soundness-agnostic, collapsing to one distribution regardless of soundness levels. To quantify this, we introduce the Soundness-Aware Level (SAL), a microscopic metric using the Jensen-Shannon Divergence to measure the separation between these distributions. We show that SAL's predictions of post-RLVR reasoning performance follow a precise empirical law (R^2=0.87) across diverse model families (Qwen, Mistral, Llama, DeepSeek) and scales (0.5B-14B). This reveals that a model's reasoning potential is tied to its intrinsic, pre-trained ability to distinguish sound knowledge from unsound ones. These findings underscore the critical role of model pre-training in shaping reasoning and offer a practical metric grounded in the model's internal mechanisms for selecting/designing stronger base models.
中文摘要 具有可验证奖励的强化学习（RLVR）可以在大型语言模型（LLM）中引发强大的推理，而 RLVR 后的性能在不同的基础模型中差异很大。这就提出了一个基本问题：预训练模型的哪些微观特性导致了这种变化？为了进行研究，我们将推理形式化为由通过跨层稀疏自动编码器（SAE）从 LLM 的潜在空间中提取的特征构建的 Horn 子句链（“if-then”规则）。我们估计其特征之间的转换概率，并使用 LLM 根据其语义健全性水平（例如，严格、合理、嘈杂）进一步对每个规则进行分类。我们的主要发现是，高潜力模型本质上是健全性感知的：它们的内部概率分布在规则的健全性水平之间系统地变化，对于“严格”和“嘈杂”的规则变得高度不同。相比之下，较弱的模型与健全性无关，无论健全性水平如何，都会折叠为一个分布。为了量化这一点，我们引入了健全性感知水平（SAL），这是一种使用 Jensen-Shannon 散度来测量这些分布之间分离的微观指标。我们表明，SAL对RLVR后推理性能的预测遵循了跨不同模型家族（Qwen、Mistral、Llama、DeepSeek）和尺度（0.5B-14B）的精确经验定律（R^2=0.87）。这表明模型的推理潜力与其内在的、预训练的区分健全知识和不健全知识的能力有关。这些发现强调了模型预训练在塑造推理方面的关键作用，并提供了一个基于模型内部机制的实用指标，用于选择/设计更强大的基础模型。

Dual-Weighted Reinforcement Learning for Generative Preference Modeling

用于生成偏好建模的双重加权强化学习

Authors: Shengyu Feng, Yun He, Shuang Ma, Beibin Li, Yuanhao Xiong, Vincent Li, Karishma Mandyam, Julian Katz-Samuels, Shengjie Bi, Licheng Yu, Hejia Zhang, Karthik Abinav Sankararaman, Han Fang, Riham Mansour, Yiming Yang, Manaal Faruqui
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.15242
Pdf link: https://arxiv.org/pdf/2510.15242
Abstract Reinforcement learning (RL) has recently proven effective at scaling chain-of-thought (CoT) reasoning in large language models on tasks with verifiable answers. However, extending RL to more general non-verifiable tasks, typically in the format of human preference pairs, remains both challenging and underexplored. In this work, we propose Dual-Weighted Reinforcement Learning (DWRL), a new framework for preference modeling that integrates CoT reasoning with the Bradley-Terry (BT) model via a dual-weighted RL objective that preserves preference-modeling inductive bias. DWRL approximates the maximum-likelihood objective of the BT model with two complementary weights: an instance-wise misalignment weight, which emphasizes under-trained pairs misaligned with human preference, and a group-wise (self-normalized) conditional preference score, which promotes promising thoughts. In this paper, we apply DWRL to preference modeling by training generative preference models (GPMs) to first generate a thought and then predict the human preference score. Across multiple benchmarks and model scales (Llama3 and Qwen2.5), DWRL consistently outperforms both GPM baselines and scalar models, while producing coherent, interpretable thoughts. In summary, our results position DWRL as a general framework for reasoning-enhanced preference learning beyond verifiable tasks.
中文摘要 强化学习（RL）最近被证明可以有效地扩展大型语言模型中具有可验证答案的任务中的思维链（CoT）推理。然而，将 RL 扩展到更一般的不可验证任务，通常以人类偏好对的形式，仍然具有挑战性和未得到充分探索。在这项工作中，我们提出了双加权强化学习（DWRL），这是一种新的偏好建模框架，它通过保留偏好建模归纳偏差的双加权RL目标将CoT推理与Bradley-Terry（BT）模型集成在一起。DWRL 使用两个互补权重近似 BT 模型的最大似然目标：实例错位权重，强调与人类偏好未对齐的训练不足的配对，以及组级（自我归一化）条件偏好评分，促进有希望的想法。在本文中，我们将DWRL应用于偏好建模，通过训练生成偏好模型（GPM）首先生成一个想法，然后预测人类偏好分数。在多个基准测试和模型规模（Llama3 和 Qwen2.5）中，DWRL 始终优于 GPM 基线和标量模型，同时产生连贯、可解释的想法。总之，我们的结果将DWRL定位为超越可验证任务的推理增强偏好学习的通用框架。

AutoGraph-R1: End-to-End Reinforcement Learning for Knowledge Graph Construction

AutoGraph-R1：用于知识图谱构建的端到端强化学习

Authors: Hong Ting Tsang, Jiaxin Bai, Haoyu Huang, Qiao Xiao, Tianshi Zheng, Baixuan Xu, Shujie Liu, Yangqiu Song
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.15339
Pdf link: https://arxiv.org/pdf/2510.15339
Abstract Building effective knowledge graphs (KGs) for Retrieval-Augmented Generation (RAG) is pivotal for advancing question answering (QA) systems. However, its effectiveness is hindered by a fundamental disconnect: the knowledge graph (KG) construction process is decoupled from its downstream application, yielding suboptimal graph structures. To bridge this gap, we introduce AutoGraph-R1, the first framework to directly optimize KG construction for task performance using Reinforcement Learning (RL). AutoGraph-R1 trains an LLM constructor by framing graph generation as a policy learning problem, where the reward is derived from the graph's functional utility in a RAG pipeline. We design two novel, task-aware reward functions, one for graphs as knowledge carriers and another as knowledge indices. Across multiple QA benchmarks, AutoGraph-R1 consistently enables graph RAG methods to achieve significant performance gains over using task-agnostic baseline graphs. Our work shows it is possible to close the loop between construction and application, shifting the paradigm from building intrinsically good'' graphs to building demonstrablyuseful'' ones.
中文摘要 为检索增强生成（RAG）构建有效的知识图谱（KG）对于推进问答（QA）系统至关重要。然而，其有效性受到一个根本脱节的阻碍：知识图谱（KG）构建过程与其下游应用解耦，产生次优图结构。为了弥补这一差距，我们引入了 AutoGraph-R1，这是第一个使用强化学习（RL）直接优化任务性能 KG 构造的框架。AutoGraph-R1 通过将图生成视为策略学习问题来训练 LLM 构造函数，其中奖励来自图在 RAG 管道中的功能效用。我们设计了两个新颖的、任务感知的奖励函数，一个用于作为知识载体的图形，另一个作为知识索引。在多个 QA 基准测试中，AutoGraph-R1 始终使图形 RAG 方法能够比使用与任务无关的基线图实现显着的性能提升。我们的工作表明，在构建和应用之间实现闭环是可能的，将范式从构建本质上“好”的图转变为构建明显“有用”的图。

Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing

Infinity Parser：用于扫描文档解析的布局感知强化学习

Authors: Baode Wang, Biao Wu, Weizhen Li, Meng Fang, Zuming Huang, Jun Huang, Haozhe Wang, Yanjie Liang, Ling Chen, Wei Chu, Yuan Qi
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.15349
Pdf link: https://arxiv.org/pdf/2510.15349
Abstract Document parsing from scanned images into structured formats remains a significant challenge due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Existing supervised fine-tuning methods often struggle to generalize across diverse document types, leading to poor performance, particularly on out-of-distribution data. This issue is further exacerbated by the limited availability of high-quality training data for layout-aware parsing tasks. To address these challenges, we introduce LayoutRL, a reinforcement learning framework that optimizes layout understanding through composite rewards integrating normalized edit distance, paragraph count accuracy, and reading order preservation. To support this training, we construct the Infinity-Doc-400K dataset, which we use to train Infinity-Parser, a vision-language model demonstrating robust generalization across various domains. Extensive evaluations on benchmarks including OmniDocBench, olmOCR-Bench, PubTabNet, and FinTabNet show that Infinity-Parser consistently achieves state-of-the-art performance across a broad range of document types, languages, and structural complexities, substantially outperforming both specialized document parsing systems and general-purpose vision-language models. We will release our code, dataset, and model to facilitate reproducible research in document parsing.
中文摘要 由于文本段落、图形、公式和表格等元素复杂交织，将文档从扫描图像解析为结构化格式仍然是一个重大挑战。现有的监督微调方法通常难以在不同的文档类型中进行推广，从而导致性能不佳，尤其是在分布外的数据上。布局感知解析任务的高质量训练数据可用性有限，进一步加剧了这个问题。为了应对这些挑战，我们引入了 LayoutRL，这是一种强化学习框架，它通过集成归一化编辑距离、段落计数准确性和阅读顺序保留的复合奖励来优化布局理解。为了支持这种训练，我们构建了 Infinity-Doc-400K 数据集，我们用它来训练 Infinity-Parser，这是一种视觉语言模型，展示了跨各个领域的稳健泛化。对包括 OmniDocBench、olmOCR-Bench、PubTabNet 和 FinTabNet 在内的基准测试的广泛评估表明，Infinity-Parser 在广泛的文档类型、语言和结构复杂性中始终如一地实现最先进的性能，大大优于专门的文档解析系统和通用视觉语言模型。我们将发布我们的代码、数据集和模型，以促进文档解析的可重复研究。

Towards Flash Thinking via Decoupled Advantage Policy Optimization

通过解耦优势策略优化实现闪光思维

Authors: Zezhong Tan, Hang Gao, Xinhong Ma, Feng Zhang, Ziqiang Dong
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.15374
Pdf link: https://arxiv.org/pdf/2510.15374
Abstract Recent Large Reasoning Models (LRMs) have achieved remarkable performance in solving complex problems via supervised fine-tuning (SFT) and reinforcement learning (RL). Although existing RL algorithms significantly enhance model accuracy, they still suffer from excessively lengthy responses and overthinking issues, resulting in increased inference latency and computational consumption, especially for simple tasks that require minimal reasoning. To address this, we propose a novel RL framework, DEPO, to reduce inefficient reasoning for models. Our method mainly consists of three core components: (1) an innovative advantage decoupled algorithm to guide model reduction of inefficient tokens; (2) a difficulty-aware length penalty to lower the overall length of model responses; (3) an advantage clipping method to prevent bias in policy optimization. In our experiments, applied to DeepSeek-Distill-Qwen-7B and DeepSeek-Distill-Qwen-1.5B as base models, DEPO achieves a significant reduction in sequence length by 39% and reduces excessive reasoning paths in inefficient tokens, while outperforming the base model in overall accuracy.
中文摘要 最近的大型推理模型（LRM）在通过监督微调（SFT）和强化学习（RL）解决复杂问题方面取得了卓越的性能。尽管现有的 RL 算法显着提高了模型的准确性，但它们仍然存在过长的响应和过度思考的问题，导致推理延迟和计算消耗增加，特别是对于需要最少推理的简单任务。为了解决这个问题，我们提出了一种新的 RL 框架 DEPO，以减少模型的低效推理。该方法主要由三个核心组成部分组成：（1）创新优势解耦算法，指导低效代币模型减少;（2）难度感知长度惩罚，以降低模型响应的总长度;（3）一种优势剪裁方法，以防止策略优化中的偏差。在我们的实验中，应用于 DeepSeek-Distill-Qwen-7B 和 DeepSeek-Distill-Qwen-1.5B 作为基础模型，DEPO 实现了序列长度显着减少 39%，减少了低效标记中的过多推理路径，同时在整体准确性上优于基础模型。

Towards Automated Chicken Deboning via Learning-based Dynamically-Adaptive 6-DoF Multi-Material Cutting

通过基于学习的动态自适应 6-DoF 多材料切割实现鸡的自动化剔骨

Authors: Zhaodong Yang, Ai-Ping Hu, Harish Ravichandar
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.15376
Pdf link: https://arxiv.org/pdf/2510.15376
Abstract Automating chicken shoulder deboning requires precise 6-DoF cutting through a partially occluded, deformable, multi-material joint, since contact with the bones presents serious health and safety risks. Our work makes both systems-level and algorithmic contributions to train and deploy a reactive force-feedback cutting policy that dynamically adapts a nominal trajectory and enables full 6-DoF knife control to traverse the narrow joint gap while avoiding contact with the bones. First, we introduce an open-source custom-built simulator for multi-material cutting that models coupling, fracture, and cutting forces, and supports reinforcement learning, enabling efficient training and rapid prototyping. Second, we design a reusable physical testbed to emulate the chicken shoulder: two rigid "bone" spheres with controllable pose embedded in a softer block, enabling rigorous and repeatable evaluation while preserving essential multi-material characteristics of the target problem. Third, we train and deploy a residual RL policy, with discretized force observations and domain randomization, enabling robust zero-shot sim-to-real transfer and the first demonstration of a learned policy that debones a real chicken shoulder. Our experiments in our simulator, on our physical testbed, and on real chicken shoulders show that our learned policy reliably navigates the joint gap and reduces undesired bone/cartilage contact, resulting in up to a 4x improvement over existing open-loop cutting baselines in terms of success rate and bone avoidance. Our results also illustrate the necessity of force feedback for safe and effective multi-material cutting. The project website is at this https URL.
中文摘要 自动化鸡肩骨切除需要精确的 6-DoF 切割部分闭塞、可变形的多材料关节，因为与骨头的接触会带来严重的健康和安全风险。我们的工作在系统级和算法方面做出了贡献，以训练和部署反应力反馈切割策略，该策略动态调整标称轨迹，并实现完整的 6-DoF 刀控制，以穿越狭窄的关节间隙，同时避免与骨骼接触。首先，我们引入了一个用于多材料切削的开源定制模拟器，该模拟器对耦合力、断裂力和切削力进行建模，并支持强化学习，从而实现高效的训练和快速原型设计。其次，我们设计了一个可重复使用的物理测试台来模拟鸡肩：两个具有可控姿势的刚性“骨”球体嵌入一个较软的块中，从而实现严格和可重复的评估，同时保留目标问题的基本多材料特征。第三，我们训练和部署残差 RL 策略，具有离散化力观察和域随机化，从而实现稳健的零样本模拟到真实转移，并首次演示了对真实鸡肩肉进行骨头的学习策略。我们在模拟器、物理测试台和真实鸡肩上的实验表明，我们学到的策略可以可靠地克服关节间隙并减少不必要的骨/软骨接触，从而在成功率和骨规避方面比现有的开环切割基线提高 4 倍。我们的研究结果还说明了力反馈对于安全有效的多材料切割的必要性。项目网站位于此 https URL。

Towards Robust Zero-Shot Reinforcement Learning

迈向稳健的零样本强化学习

Authors: Kexin Zheng, Lauriane Teyssier, Yinan Zheng, Yu Luo, Xiayuan Zhan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.15382
Pdf link: https://arxiv.org/pdf/2510.15382
Abstract The recent development of zero-shot reinforcement learning (RL) has opened a new avenue for learning pre-trained generalist policies that can adapt to arbitrary new tasks in a zero-shot manner. While the popular Forward-Backward representations (FB) and related methods have shown promise in zero-shot RL, we empirically found that their modeling lacks expressivity and that extrapolation errors caused by out-of-distribution (OOD) actions during offline learning sometimes lead to biased representations, ultimately resulting in suboptimal performance. To address these issues, we propose Behavior-REgularizEd Zero-shot RL with Expressivity enhancement (BREEZE), an upgraded FB-based framework that simultaneously enhances learning stability, policy extraction capability, and representation learning quality. BREEZE introduces behavioral regularization in zero-shot RL policy learning, transforming policy optimization into a stable in-sample learning paradigm. Additionally, BREEZE extracts the policy using a task-conditioned diffusion model, enabling the generation of high-quality and multimodal action distributions in zero-shot RL settings. Moreover, BREEZE employs expressive attention-based architectures for representation modeling to capture the complex relationships between environmental dynamics. Extensive experiments on ExORL and D4RL Kitchen demonstrate that BREEZE achieves the best or near-the-best performance while exhibiting superior robustness compared to prior offline zero-shot RL methods. The official implementation is available at: this https URL.
中文摘要 零样本强化学习（RL）的最新发展为学习预训练的通才策略开辟了一条新途径，这些策略可以以零样本的方式适应任意新任务。虽然流行的前向-后向表示（FB）和相关方法在零样本RL中显示出前景，但我们根据经验发现，它们的建模缺乏表达性，并且离线学习期间由分布外（OOD）动作引起的外推误差有时会导致有偏差的表示，最终导致性能不佳。针对这些问题，我们提出了Behavior-REgularizEd Zero-shot RL with Expressivity enhancement（BREEZE），这是一个基于FB的升级框架，同时增强了学习稳定性、策略提取能力和表示学习质量。BREEZE 在零样本 RL 策略学习中引入了行为正则化，将策略优化转化为稳定的样本内学习范式。此外，BREEZE 使用任务条件扩散模型提取策略，从而能够在零样本 RL 设置中生成高质量和多模态动作分布。此外，BREEZE 采用基于表达注意力的架构进行表示建模，以捕捉环境动态之间的复杂关系。对 ExORL 和 D4RL Kitchen 的广泛实验表明，与之前的离线零样本 RL 方法相比，BREEZE 实现了最佳或接近最佳的性能，同时表现出卓越的鲁棒性。官方实现可从以下网址获得：此 https URL。

Advancing Routing-Awareness in Analog ICs Floorplanning

提高模拟IC布局规划中的布线意识

Authors: Davide Basso, Luca Bortolussi, Mirjana Videnovic-Misic, Husni Habal
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.15387
Pdf link: https://arxiv.org/pdf/2510.15387
Abstract The adoption of machine learning-based techniques for analog integrated circuit layout, unlike its digital counterpart, has been limited by the stringent requirements imposed by electric and problem-specific constraints, along with the interdependence of floorplanning and routing steps. In this work, we address a prevalent concern among layout engineers regarding the need for readily available routing-aware floorplanning solutions. To this extent, we develop an automatic floorplanning engine based on reinforcement learning and relational graph convolutional neural network specifically tailored to condition the floorplan generation towards more routable outcomes. A combination of increased grid resolution and precise pin information integration, along with a dynamic routing resource estimation technique, allows balancing routing and area efficiency, eventually meeting industrial standards. When analyzing the place and route effectiveness in a simulated environment, the proposed approach achieves a 13.8% reduction in dead space, a 40.6% reduction in wirelength and a 73.4% increase in routing success when compared to past learning-based state-of-the-art techniques.
中文摘要 与数字集成电路布局不同，基于机器学习的模拟集成电路布局技术的采用受到电气和特定问题约束的严格要求以及布局规划和布线步骤的相互依赖性的限制。在这项工作中，我们解决了布局工程师普遍关注的问题，即需要现成的布线感知平面规划解决方案。为此，我们开发了一种基于强化学习和关系图卷积神经网络的自动平面规划引擎，专门用于调节平面图生成以获得更可路由的结果。提高的网格分辨率和精确的引脚信息集成，以及动态布线资源估计技术，可以平衡布线和面积效率，最终满足行业标准。在模拟环境中分析位置和布线有效性时，与过去基于学习的最先进技术相比，所提出的方法实现了 13.8% 的死区减少、40.6% 的线长减少和 73.4% 的布线成功率。

MARS: Reinforcing Multi-Agent Reasoning of LLMs through Self-Play in Strategic Games

MARS：通过战略游戏中的自我游戏强化法学硕士的多智能体推理

Authors: Huining Yuan, Zelai Xu, Zheyue Tan, Xiangmin Yi, Mo Guang, Kaiwen Long, Haojia Hui, Boxun Li, Xinlei Chen, Bo Zhao, Xiao-Ping Zhang, Chao Yu, Yu Wang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.15414
Pdf link: https://arxiv.org/pdf/2510.15414
Abstract Developing Large Language Models (LLMs) to cooperate and compete effectively within multi-agent systems is a critical step towards more advanced intelligence. While reinforcement learning (RL) has proven effective for enhancing reasoning in single-agent tasks, its extension to multi-turn, multi-agent scenarios remains underexplored due to the challenges of long-horizon credit assignment and agent-specific advantage estimation. To address these challenges, we introduce MARS, an end-to-end RL framework that incentivizes Multi-Agent Reasoning of LLMs through Self-play in both cooperative and competitive games. MARS features a turn-level advantage estimator that aligns learning signals with each interaction for credit assignment, and an agent-specific advantage normalization to stabilize multi-agent training. By learning with self-play across cooperative and competitive games, the MARS agent trained from Qwen3-4B develops strong strategic abilities that generalize to held-out games with up to 28.7% performance improvements. More importantly, the capability acquired through self-play generalizes beyond games, yielding consistent performance gains of multi-agent systems in reasoning benchmarks. When integrated into leading multi-agent systems, our MARS agent achieves significant performance gains of 10.0% on AIME and 12.5% on GPQA-Diamond. These results establish end-to-end RL training with self-play in strategic games as a powerful approach for developing generalizable multi-agent reasoning capabilities in LLMs. Our code and models are publicly available at this https URL.
中文摘要 开发大型语言模型（LLM）以在多智能体系统中有效合作和竞争是迈向更高级智能的关键一步。虽然强化学习（RL）已被证明可以有效增强单智能体任务中的推理能力，但由于长期信用分配和智能体特定优势估计的挑战，其对多回合、多智能体场景的扩展仍然没有得到充分探索。为了应对这些挑战，我们引入了 MARS，这是一个端到端的 RL 框架，它通过在合作和竞技游戏中的自我游戏来激励 LLM 的多智能体推理。MARS 具有一个回合级优势估计器，可将学习信号与每次交互保持一致以进行学分分配，以及特定于代理的优势归一化以稳定多代理训练。通过在合作和竞技游戏中进行自我游戏学习，从 Qwen3-4B 训练的 MARS 代理发展了强大的战略能力，这些能力可以推广到持有的游戏中，性能提高了 28.7%。更重要的是，通过自我游戏获得的能力超越了游戏，在推理基准测试中产生了多智能体系统的一致性能提升。当集成到领先的多智能体系统中时，我们的 MARS 代理在 AIME 上实现了 10.0% 的性能提升，在 GPQA-Diamond 上实现了 12.5% 的性能提升。这些结果确立了在战略游戏中进行自我游戏的端到端 RL 训练，作为在 LLM 中开发可推广的多智能体推理能力的强大方法。我们的代码和模型在此 https URL 上公开可用。

Safe, Efficient, and Robust Reinforcement Learning for Ranking and Diffusion Models

用于排名和扩散模型的安全、高效和鲁棒的强化学习

Authors: Shashank Gupta
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.15429
Pdf link: https://arxiv.org/pdf/2510.15429
Abstract This dissertation investigates how reinforcement learning (RL) methods can be designed to be safe, sample-efficient, and robust. Framed through the unifying perspective of contextual-bandit RL, the work addresses two major application domains - ranking and recommendation, and text-to-image diffusion models. The first part of the thesis develops theory and algorithms for safe deployment in ranking systems. An exposure-based generalisation bound is derived, leading to a counterfactual risk-minimisation objective whose solution is guaranteed not to underperform the logging policy, even with sparse feedback. This guarantee is extended to doubly robust estimators, enabling safety even under adversarial or misspecified user models and offering practitioners explicit control over permissible utility loss. The second part turns to single-action bandits, where various off-policy estimators are unified within a baseline-correction framework. A closed-form optimal baseline is proposed and shown to minimise both evaluation and policy-gradient variance, thereby improving off-policy learning reliability. The final part examines the trade-offs between efficiency and effectiveness in generative RL. A systematic study of PPO and REINFORCE motivates the Leave-One-Out PPO (LOOP) algorithm, which combines multiple diffusion trajectories with a REINFORCE-style baseline inside PPO's clipped objective. LOOP achieves PPO-level sample efficiency while producing generations that align more faithfully with textual attributes.
中文摘要 本论文研究了如何将强化学习（RL）方法设计为安全、样本高效和稳健。该工作通过上下文强盗 RL 的统一视角构建，涉及两个主要应用领域——排名和推荐，以及文本到图像的扩散模型。论文的第一部分开发了在排名系统中安全部署的理论和算法。推导出基于暴露的泛化界限，从而产生一个反事实风险最小化目标，其解决方案保证即使反馈稀疏，其解决方案也不会低于日志记录策略。这种保证扩展到双重稳健的估计器，即使在对抗性或错误指定的用户模型下也能实现安全，并为从业者提供对允许的效用损失的明确控制。第二部分转向单一行动强盗，其中各种政策外的估计器统一在基线校正框架内。提出了一个封闭形式的最佳基线，并证明可以最大限度地减少评估和政策梯度方差，从而提高政策外学习的可靠性。最后一部分研究了生成式 RL 中效率和有效性之间的权衡。对 PPO 和 REINFORCE 的系统研究激发了 Leave-One-Out PPO （LOOP）算法的动机，该算法将多个扩散轨迹与 PPO 剪切物镜内的 REINFORCE 式基线相结合。LOOP 实现了 PPO 级的样本效率，同时生成更忠实地与文本属性对齐的生成。

Select Less, Reason More: Prioritizing Evidence Purity for Video Reasoning

少选多选：优先考虑视频推理的证据纯度

Authors: Xuchen Li, Xuzhao Li, Shiyu Hu, Kaiqi Huang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.15440
Pdf link: https://arxiv.org/pdf/2510.15440
Abstract Long-form video reasoning remains a major challenge for Video Large Language Models (Video LLMs), as static uniform frame sampling leads to information dilution and obscures critical evidence. Furthermore, existing pixel-space video reasoning agents, which are designed to actively interact with the video to acquire new visual information, remain suboptimal due to their lack of rigorous reward mechanisms to enforce evidence purity and their inability to perform temporal information supplementation beyond pre-sampled frames. To address this critical gap, we propose a novel evidence-prioritized adaptive framework built upon our core philosophy: "Select Less, Reason More." Our core contribution is the evidence-aware reinforcement learning (EARL) framework, which transforms the model into an active interrogator of evidence. EARL is precisely engineered to dynamically select the most relevant frames and, crucially, to perform localized re-sampling around the selected key frames to access fine-grained temporal detail. Extensive experiments on five demanding video reasoning benchmarks demonstrate that our EARL-trained model achieves new state-of-the-art among open-source Video LLMs, simultaneously learning an effective and high-purity visual evidence selection policy. Impressively, our 7B model achieves 59.8% on LongVideoBench, 69.0% on MVBench and 64.9% on VideoMME. These results highlight the importance of prioritizing evidence purity and the effectiveness of our framework.
中文摘要 长格式视频推理仍然是视频大型语言模型（Video LLM）面临的主要挑战，因为静态均匀帧采样会导致信息稀释并掩盖关键证据。此外，现有的像素空间视频推理代理旨在主动与视频交互以获取新的视觉信息，由于缺乏严格的奖励机制来强制执行证据纯度，并且无法在预采样帧之外进行时间信息补充，因此仍然不理想。为了解决这一关键差距，我们提出了一种新颖的证据优先适应框架，该框架建立在我们的核心理念之上：“少选择，多推理”。我们的核心贡献是证据感知强化学习（EARL）框架，该框架将模型转变为主动的证据询问器。EARL 经过精确设计，可以动态选择最相关的帧，最重要的是，围绕选定的关键帧执行局部重新采样，以访问细粒度的时间细节。在五个要求苛刻的视频推理基准上进行的广泛实验表明，我们的 EARL 训练模型在开源视频 LLM 中实现了新的先进水平，同时学习了有效且高纯度的视觉证据选择策略。令人印象深刻的是，我们的 7B 模型在 LongVideoBench 上达到了 59.8%，在 MVBench 上达到了 69.0%，在 VideoMME 上达到了 64.9%。这些结果凸显了优先考虑证据纯度和我们框架有效性的重要性。

VDRive: Leveraging Reinforced VLA and Diffusion Policy for End-to-end Autonomous Driving

VDRive：利用增强型VLA和扩散策略实现端到端自动驾驶

Authors: Ziang Guo, Zufeng Zhang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.15446
Pdf link: https://arxiv.org/pdf/2510.15446
Abstract In autonomous driving, dynamic environment and corner cases pose significant challenges to the robustness of ego vehicle's state understanding and decision making. We introduce VDRive, a novel pipeline for end-to-end autonomous driving that explicitly models state-action mapping to address these challenges, enabling interpretable and robust decision making. By leveraging the advancement of the state understanding of the Vision Language Action Model (VLA) with generative diffusion policy-based action head, our VDRive guides the driving contextually and geometrically. Contextually, VLA predicts future observations through token generation pre-training, where the observations are represented as discrete codes by a Conditional Vector Quantized Variational Autoencoder (CVQ-VAE). Geometrically, we perform reinforcement learning fine-tuning of the VLA to predict future trajectories and actions based on current driving conditions. VLA supplies the current state tokens and predicted state tokens for the action policy head to generate hierarchical actions and trajectories. During policy training, a learned critic evaluates the actions generated by the policy and provides gradient-based feedback, forming an actor-critic framework that enables a reinforcement-based policy learning pipeline. Experiments show that our VDRive achieves state-of-the-art performance in the Bench2Drive closed-loop benchmark and nuScenes open-loop planning.
中文摘要 在自动驾驶中，动态环境和极端情况对自车状态理解和决策的鲁棒性提出了重大挑战。我们推出了 VDRive，这是一种用于端到端自动驾驶的新型管道，它明确地对状态-动作映射进行建模以应对这些挑战，从而实现可解释且稳健的决策。通过利用视觉语言行动模型（VLA）的状态理解与基于生成扩散策略的动作头，我们的 VDRive 在上下文和几何上指导驾驶。在上下文中，VLA 通过标记生成预训练来预测未来的观测值，其中观测值由条件向量量化变分自动编码器（CVQ-VAE）表示为离散代码。在几何学上，我们对VLA进行强化学习微调，以根据当前的驾驶条件预测未来的轨迹和动作。VLA 为作策略头提供当前状态令牌和预测状态令牌，以生成分层作和轨迹。在政策培训期间，有学识的批评者评估政策产生的行动并提供基于梯度的反馈，形成一个行为者-批评者框架，从而实现基于强化的政策学习管道。实验表明，我们的 VDRive 在 Bench2Drive 闭环基准测试和 nuScenes 开环规划中实现了最先进的性能。

Expediting Reinforcement Learning by Incorporating Knowledge About Temporal Causality in the Environment

通过整合有关环境中时间因果关系的知识来加速强化学习

Authors: Jan Corazza, Hadi Partovi Aria, Daniel Neider, Zhe Xu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.15456
Pdf link: https://arxiv.org/pdf/2510.15456
Abstract Reinforcement learning (RL) algorithms struggle with learning optimal policies for tasks where reward feedback is sparse and depends on a complex sequence of events in the environment. Probabilistic reward machines (PRMs) are finite-state formalisms that can capture temporal dependencies in the reward signal, along with nondeterministic task outcomes. While special RL algorithms can exploit this finite-state structure to expedite learning, PRMs remain difficult to modify and design by hand. This hinders the already difficult tasks of utilizing high-level causal knowledge about the environment, and transferring the reward formalism into a new domain with a different causal structure. This paper proposes a novel method to incorporate causal information in the form of Temporal Logic-based Causal Diagrams into the reward formalism, thereby expediting policy learning and aiding the transfer of task specifications to new environments. Furthermore, we provide a theoretical result about convergence to optimal policy for our method, and demonstrate its strengths empirically.
中文摘要 强化学习（RL）算法难以为奖励反馈稀疏且依赖于环境中复杂事件序列的任务学习最佳策略。概率奖励机（PRM）是有限状态形式，可以捕获奖励信号中的时间依赖性以及非确定性任务结果。虽然特殊的 RL 算法可以利用这种有限状态结构来加快学习速度，但 PRM 仍然难以手动修改和设计。这阻碍了利用关于环境的高级因果知识以及将奖励形式主义转移到具有不同因果结构的新领域这一本已困难的任务。该文提出了一种新的方法，将基于时间逻辑的因果图形式的因果信息纳入奖励形式，从而加快策略学习，并帮助任务规范向新环境的转移。此外，我们还提供了关于我们方法向最优策略收敛的理论结果，并实证了其优势。

OffSim: Offline Simulator for Model-based Offline Inverse Reinforcement Learning

OffSim：基于模型的离线逆强化学习的离线模拟器

Authors: Woo-Jin Ahn, Sang-Ryul Baek, Yong-Jun Lee, Hyun-Duck Choi, Myo-Taeg Lim
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.15495
Pdf link: https://arxiv.org/pdf/2510.15495
Abstract Reinforcement learning algorithms typically utilize an interactive simulator (i.e., environment) with a predefined reward function for policy training. Developing such simulators and manually defining reward functions, however, is often time-consuming and labor-intensive. To address this, we propose an Offline Simulator (OffSim), a novel model-based offline inverse reinforcement learning (IRL) framework, to emulate environmental dynamics and reward structure directly from expert-generated state-action trajectories. OffSim jointly optimizes a high-entropy transition model and an IRL-based reward function to enhance exploration and improve the generalizability of the learned reward. Leveraging these learned components, OffSim can subsequently train a policy offline without further interaction with the real environment. Additionally, we introduce OffSim$^+$, an extension that incorporates a marginal reward for multi-dataset settings to enhance exploration. Extensive MuJoCo experiments demonstrate that OffSim achieves substantial performance gains over existing offline IRL methods, confirming its efficacy and robustness.
中文摘要 强化学习算法通常利用具有预定义奖励函数的交互式模拟器（即环境）进行策略训练。然而，开发此类模拟器和手动定义奖励函数通常既耗时又费力。为了解决这个问题，我们提出了一种离线模拟器（OffSim），这是一种基于模型的新型离线逆强化学习（IRL）框架，可以直接从专家生成的状态-动作轨迹中模拟环境动态和奖励结构。OffSim联合优化了高熵过渡模型和基于现实环境的奖励函数，以增强探索性，提高学习奖励的泛化性。利用这些学习到的组件，OffSim 随后可以离线训练策略，而无需与真实环境进一步交互。此外，我们还引入了 OffSim$^+$，这是一个扩展，它包含了多数据集设置的边际奖励，以增强探索。广泛的 MuJoCo 实验表明，与现有的离线 IRL 方法相比，OffSim 实现了显着的性能提升，证实了其有效性和稳健性。

HarmRLVR: Weaponizing Verifiable Rewards for Harmful LLM Alignment

HarmRLVR：将有害的 LLM 调整的可验证奖励武器化

Authors: Yuexiao Liu, Lijun Li, Xingjun Wang, Jing Shao
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2510.15499
Pdf link: https://arxiv.org/pdf/2510.15499
Abstract Recent advancements in Reinforcement Learning with Verifiable Rewards (RLVR) have gained significant attention due to their objective and verifiable reward signals, demonstrating strong performance in reasoning and code generation tasks. However, the potential safety risks associated with RLVR remain underexplored. This paper presents HarmRLVR, the first systematic investigation into the alignment reversibility risk of RLVR. We show that safety alignment can be rapidly reversed using GRPO with merely 64 harmful prompts without responses, causing models to readily comply with harmful instructions. Across five models from Llama, Qwen, and DeepSeek, we empirically demonstrate that RLVR-based attacks elevate the average harmfulness score to 4.94 with an attack success rate of 96.01\%, significantly outperforming harmful fine-tuning while preserving general capabilities. Our findings reveal that RLVR can be efficiently exploited for harmful alignment, posing serious threats to open-source model safety. Please see our code at this https URL.
中文摘要 具有可验证奖励的强化学习（RLVR）的最新进展因其客观且可验证的奖励信号而受到广泛关注，在推理和代码生成任务中表现出强大的性能。然而，与 RLVR 相关的潜在安全风险仍未得到充分探索。本文介绍了 HarmRLVR，这是对 RLVR 对准可逆性风险的首次系统研究。我们表明，使用 GRPO 可以快速逆转安全对齐，只需 64 个有害提示而没有响应，从而使模型很容易遵守有害指令。在 Llama、Qwen 和 DeepSeek 的五个模型中，我们实证证明，基于 RLVR 的攻击将平均危害性分数提高到 4.94，攻击成功率为 96.01%，在保留一般功能的同时显着优于有害微调。我们的研究结果表明，RLVR 可以被有效地用于有害的对齐，对开源模型安全构成严重威胁。请参阅此 https URL 中的代码。

The Road Less Traveled: Enhancing Exploration in LLMs via Sequential Sampling

人迹罕至的道路：通过顺序采样增强法学硕士的探索

Authors: Shijia Kang, Muhan Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.15502
Pdf link: https://arxiv.org/pdf/2510.15502
Abstract Reinforcement learning (RL) has been pivotal in enhancing the reasoning capabilities of large language models (LLMs), but it often suffers from limited exploration and entropy collapse, where models exploit a narrow set of solutions, leading to a loss of sampling diversity and subsequently preventing RL from further improving performance. This issue is exacerbated in parallel sampling methods, where multiple outputs are drawn from the same distribution, potentially causing the model to converge to similar solutions. We propose SESA, a novel SEquential SAmpling framework that mitigates this challenge by generating diverse solution sketches sequentially before expanding them into full reasoning paths. This approach ensures broader exploration by conditioning each new output on previous ones, promoting diversity throughout the process and preventing policy collapse. Our experiments on a synthetic task show that sequential sampling consistently outperforms traditional RL methods in terms of path diversity and recovery from collapse. Further evaluations on real-world tasks demonstrate that SESA improves both the exploration of valid strategies and the overall performance of LLMs. On three agent benchmarks, SESA lifts success rates by $+0.25$, $+0.42$, and $+0.07$ absolute over the base model (up to an additional $211\%$ relative improvement over baseline RL), underscoring its exploration advantage. This work introduces a structured approach to exploration, paving the way for more effective and diverse reasoning in RL-trained LLMs. Our code is released at this https URL.
中文摘要 强化学习（RL）在增强大型语言模型（LLM）的推理能力方面发挥了关键作用，但它经常受到有限探索和熵坍缩的影响，即模型利用一组狭窄的解决方案，导致采样多样性的丧失，进而阻止RL进一步提高性能。这个问题在并行采样方法中会加剧，其中从同一分布中提取多个输出，可能导致模型收敛到类似的解决方案。我们提出了 SESA，这是一种新颖的顺序 SAmpling 框架，它通过按顺序生成不同的解决方案草图，然后将它们扩展为完整的推理路径来缓解这一挑战。这种方法通过以以前的产出为条件，促进整个过程的多样性并防止政策崩溃，确保更广泛的探索。我们在合成任务上的实验表明，顺序采样在路径多样性和从坍缩中恢复方面始终优于传统的RL方法。对实际任务的进一步评估表明，SESA 改进了对有效策略的探索和 LLM 的整体性能。在三个代理基准测试中，SESA 的成功率比基本模型提高了 +0.25 美元、+0.42 美元和 +0.07 美元的绝对成功率（比基线 RL 额外提高了 211 美元，相对提高了 211 美元），凸显了其探索优势。这项工作引入了一种结构化的探索方法，为RL训练的LLM中更有效和多样化的推理铺平了道路。我们的代码在此 https URL 上发布。

Taming the Judge: Deconflicting AI Feedback for Stable Reinforcement Learning

驯服法官：消除冲突的人工智能反馈以实现稳定的强化学习

Authors: Boyin Liu, Zhuo Zhang, Sen Huang, Lipeng Xie, Qingxu Fu, Haoran Chen, LI YU, Tianyi Hu, Zhaoyang Liu, Bolin Ding, Dongbin Zhao
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.15514
Pdf link: https://arxiv.org/pdf/2510.15514
Abstract However, this method often faces judgment inconsistencies that can destabilize reinforcement learning. While prior research has focused on the accuracy of judgments, the critical issue of logical coherence especially issues such as preference cycles hasn't been fully addressed. To fill this gap, we introduce a comprehensive framework designed to systematically detect and resolve these inconsistencies during the reinforcement learning training process. Our framework includes two main contributions: first, the Conflict Detection Rate (CDR), a new metric that quantifies judgment conflicts, and second, Deconflicted Graph Rewards (DGR), a framework that purifies signals by removing cycles before policy optimization. DGR constructs preference graphs from the initial judgments, transforms them into conflict-free Directed Acyclic Graphs (DAGs), and generates a logically coherent reward signal that is compatible with any policy optimizer. Experimental results show that our framework significantly enhances training stability and model performance compared to strong baselines, establishing logical consistency as a crucial and now manageable dimension of AI feedback.
中文摘要 然而，这种方法经常面临判断不一致的问题，从而破坏强化学习的稳定性。虽然先前的研究集中在判断的准确性上，但逻辑连贯性的关键问题，尤其是偏好周期等问题尚未得到充分解决。为了填补这一空白，我们引入了一个全面的框架，旨在系统地检测和解决强化学习训练过程中的这些不一致之处。我们的框架包括两个主要贡献：首先是冲突检测率（CDR），一个量化判断冲突的新指标，其次是去冲突图奖励（DGR），一个通过在策略优化之前删除周期来净化信号的框架。DGR 从最初的判断中构建偏好图，将其转换为无冲突的有向无环图（DAG），并生成与任何策略优化器兼容的逻辑连贯的奖励信号。实验结果表明，与强基线相比，我们的框架显着增强了训练稳定性和模型性能，将逻辑一致性确立为人工智能反馈的一个关键且现在可管理的维度。

JudgeSQL: Reasoning over SQL Candidates with Weighted Consensus Tournament

JudgeSQL：通过加权共识锦标赛对 SQL 候选者进行推理

Authors: Jiayuan Bai, Xuan-guang Pan, Chongyang Tao, Shuai Ma
Subjects: Subjects: Artificial Intelligence (cs.AI); Databases (cs.DB)
Arxiv link: https://arxiv.org/abs/2510.15560
Pdf link: https://arxiv.org/pdf/2510.15560
Abstract Text-to-SQL is a pivotal task that bridges natural language understanding and structured data access, yet it remains fundamentally challenging due to semantic ambiguity and complex compositional reasoning. While large language models (LLMs) have greatly advanced SQL generation though prompting, supervised finetuning and reinforced tuning, the shift toward test-time scaling exposes a new bottleneck: selecting the correct query from a diverse candidate pool. Existing selection approaches, such as self-consistency or best-of-$N$ decoding, provide only shallow signals, making them prone to inconsistent scoring, fragile reasoning chains, and a failure to capture fine-grained semantic distinctions between closely related SQL candidates. To this end, we introduce JudgeSQL, a principled framework that redefines SQL candidate selection through structured reasoning and weighted consensus tournament mechanism. JudgeSQL develops a reasoning-based SQL judge model that distills reasoning traces with reinforcement learning guided by verifiable rewards, enabling accurate and interpretable judgments. Building on this, a weighted consensus tournament integrates explicit reasoning preferences with implicit generator confidence, yielding selections that are both more reliable and more efficient. Extensive experiments on the BIRD benchmark demonstrate that JudgeSQL exhibits superior SQL judgment capabilities and good cross-scale generalization and robustness to generator capacity.
中文摘要 文本转 SQL 是连接自然语言理解和结构化数据访问的一项关键任务，但由于语义模糊性和复杂的组合推理，它仍然具有根本性的挑战性。虽然大型语言模型（LLM）通过提示、监督微调和强化调优极大地推进了 SQL 生成，但向测试时扩展的转变暴露了一个新的瓶颈：从多样化的候选池中选择正确的查询。现有的选择方法，例如自洽或$N最佳解码，仅提供浅层信号，使其容易出现评分不一致、推理链脆弱以及无法捕获密切相关的 SQL 候选者之间的细粒度语义差异。为此，我们引入了 JudgeSQL，这是一个原则性框架，它通过结构化推理和加权共识锦标赛机制重新定义了 SQL 候选者选择。JudgeSQL 开发了一种基于推理的 SQL 判断模型，该模型通过可验证奖励指导的强化学习提炼推理轨迹，从而实现准确且可解释的判断。在此基础上，加权共识锦标赛将显式推理偏好与隐式生成器置信度相结合，产生更可靠、更高效的选择。在 BIRD 基准测试上的大量实验表明，JudgeSQL 表现出卓越的 SQL 判断能力以及良好的跨尺度泛化和生成器容量的鲁棒性。

HEADER: Hierarchical Robot Exploration via Attention-Based Deep Reinforcement Learning with Expert-Guided Reward

标题：通过基于注意力的深度强化学习和专家指导奖励进行分层机器人探索

Authors: Yuhong Cao, Yizhuo Wang, Jingsong Liang, Shuhao Liao, Yifeng Zhang, Peizhuo Li, Guillaume Sartoretti
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.15679
Pdf link: https://arxiv.org/pdf/2510.15679
Abstract This work pushes the boundaries of learning-based methods in autonomous robot exploration in terms of environmental scale and exploration efficiency. We present HEADER, an attention-based reinforcement learning approach with hierarchical graphs for efficient exploration in large-scale environments. HEADER follows existing conventional methods to construct hierarchical representations for the robot belief/map, but further designs a novel community-based algorithm to construct and update a global graph, which remains fully incremental, shape-adaptive, and operates with linear complexity. Building upon attention-based networks, our planner finely reasons about the nearby belief within the local range while coarsely leveraging distant information at the global scale, enabling next-best-viewpoint decisions that consider multi-scale spatial dependencies. Beyond novel map representation, we introduce a parameter-free privileged reward that significantly improves model performance and produces near-optimal exploration behaviors, by avoiding training objective bias caused by handcrafted reward shaping. In simulated challenging, large-scale exploration scenarios, HEADER demonstrates better scalability than most existing learning and non-learning methods, while achieving a significant improvement in exploration efficiency (up to 20%) over state-of-the-art baselines. We also deploy HEADER on hardware and validate it in complex, large-scale real-life scenarios, including a 300m*230m campus environment.
中文摘要 这项工作在环境规模和探索效率方面突破了自主机器人探索中基于学习的方法的界限。我们提出了 HEADER，这是一种基于注意力的强化学习方法，具有分层图，用于在大规模环境中进行高效探索。HEADER遵循现有的常规方法为机器人信念/地图构建分层表示，但进一步设计了一种基于社区的新算法来构建和更新全局图，该图保持完全增量、形状自适应并以线性复杂度运行。基于注意力的网络，我们的规划器对局部范围内的附近信念进行精细推理，同时粗略地利用全球尺度的遥远信息，从而能够做出考虑多尺度空间依赖性的次佳视点决策。除了新颖的地图表示之外，我们还引入了一种无参数的特权奖励，通过避免手工制作的奖励塑造引起的训练目标偏差，显着提高了模型性能并产生近乎最佳的探索行为。在模拟具有挑战性的大规模探索场景中，HEADER表现出比大多数现有学习和非学习方法更好的可扩展性，同时与最先进的基线相比，探索效率的显著提高（高达20%）。我们还在硬件上部署 HEADER，并在复杂、大规模的现实场景中进行验证，包括 300m*230m 的园区环境。

ProofOptimizer: Training Language Models to Simplify Proofs without Human Demonstrations

ProofOptimizer：训练语言模型以简化证明，无需人工演示

Authors: Alex Gu, Bartosz Piotrowski, Fabian Gloeckle, Kaiyu Yang, Aram H. Markosyan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
Arxiv link: https://arxiv.org/abs/2510.15700
Pdf link: https://arxiv.org/pdf/2510.15700
Abstract Neural theorem proving has advanced rapidly in the past year, reaching IMO gold-medalist capabilities and producing formal proofs that span thousands of lines. Although such proofs are mechanically verified by formal systems like Lean, their excessive length renders them difficult for humans to comprehend and limits their usefulness for mathematical insight. Proof simplification is therefore a critical bottleneck. Yet, training data for this task is scarce, and existing methods -- mainly agentic scaffolding with off-the-shelf LLMs -- struggle with the extremely long proofs generated by RL-trained provers. We introduce ProofOptimizer, the first language model trained to simplify Lean proofs without requiring additional human supervision. ProofOptimizer is trained via expert iteration and reinforcement learning, using Lean to verify simplifications and provide training signal. At inference time, it operates within an iterative proof-shortening workflow, progressively reducing proof length. Experiments show that ProofOptimizer substantially compresses proofs generated by state-of-the-art RL-trained provers on standard benchmarks, reducing proof length by 87% on miniF2F, 57% on PutnamBench, and 49% on Seed-Prover's IMO 2025 proofs. Beyond conciseness, the simplified proofs check faster in Lean and further improve downstream prover performance when reused as training data for supervised finetuning.
中文摘要 神经定理证明在过去一年中取得了迅速的发展，达到了 IMO 金牌得主的能力，并产生了跨越数千行的形式证明。尽管此类证明可以通过精益等形式系统进行机械验证，但它们的长度过长使人类难以理解它们，并限制了它们对数学洞察力的有用性。因此，证明简化是一个关键瓶颈。然而，这项任务的训练数据很少，现有方法——主要是使用现成的 LLM 的代理脚手架——难以应对 RL 训练的证明者生成的极长证明。我们介绍了 ProofOptimizer，这是第一个经过训练的语言模型，可以简化精益证明，而无需额外的人工监督。ProofOptimizer 通过专家迭代和强化学习进行训练，使用精益来验证简化并提供训练信号。在推理时，它在迭代证明缩短工作流程中运行，逐步减少证明长度。实验表明，ProofOptimizer 大幅压缩了最先进的 RL 训练证明者在标准基准测试上生成的证明，在 miniF2F 上减少了 87%，在 PutnamBench 上减少了 57%，在 Seed-Prover 的 IMO 2025 证明上减少了 49%。除了简洁之外，简化的证明在精益中检查速度更快，并在重复用作监督微调的训练数据时进一步提高下游证明器性能。

Direct Preference Optimization with Unobserved Preference Heterogeneity: The Necessity of Ternary Preferences

具有未观测偏好异质性的直接偏好优化：三元偏好的必要性

Authors: Keertana Chidambaram, Karthik Vinary Seetharaman, Vasilis Syrgkanis
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.15716
Pdf link: https://arxiv.org/pdf/2510.15716
Abstract Reinforcement Learning from Human Feedback (RLHF) has become central to aligning large language models with human values, typically by first learning a reward model from preference data which is then used to update the model with reinforcement learning. Recent alternatives such as Direct Preference Optimization (DPO) simplify this pipeline by directly optimizing on preferences. However, both approaches often assume uniform annotator preferences and rely on binary comparisons, overlooking two key limitations: the diversity of human evaluators and the limitations of pairwise feedback. In this work, we address both these issues. First, we connect preference learning in RLHF with the econometrics literature and show that binary comparisons are insufficient for identifying latent user preferences from finite user data and infinite users, while (even incomplete) rankings over three or more responses ensure identifiability. Second, we introduce methods to incorporate heterogeneous preferences into alignment algorithms. We develop an Expectation-Maximization adaptation of DPO that discovers latent annotator types and trains a mixture of LLMs accordingly. Then we propose an aggregation algorithm using a min-max regret fairness criterion to produce a single generative policy with equitable performance guarantees. Together, these contributions establish a theoretical and algorithmic framework for fairness and personalization for diverse users in generative model alignment.
中文摘要 来自人类反馈的强化学习（RLHF）已成为使大型语言模型与人类价值观保持一致的核心，通常首先通过从偏好数据中学习奖励模型，然后使用该模型通过强化学习更新模型。最近的替代方案，如直接偏好优化（DPO），通过直接优化偏好来简化此管道。然而，这两种方法通常都假设注释者偏好统一并依赖二元比较，忽略了两个关键局限性：人类评估者的多样性和成对反馈的局限性。在这项工作中，我们解决了这两个问题。首先，我们将RLHF中的偏好学习与计量经济学文献联系起来，表明二元比较不足以从有限的用户数据和无限用户中识别潜在的用户偏好，而（甚至不完整的）三个或更多响应的排名可以确保可识别性。其次，我们介绍了将异构偏好纳入对齐算法的方法。我们开发了 DPO 的期望最大化适应，可以发现潜在的注释器类型并相应地训练 LLM 的混合。然后，我们提出了一种使用最小-最大后悔公平性标准的聚合算法，以生成具有公平性能保证的单一生成策略。这些贡献共同为生成模型对齐中的不同用户建立了公平和个性化的理论和算法框架。

Cost-Aware Retrieval-Augmentation Reasoning Models with Adaptive Retrieval Depth

具有自适应检索深度的成本感知检索-增强推理模型

Authors: Helia Hashemi, Victor Rühle, Saravan Rajmohan
Subjects: Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2510.15719
Pdf link: https://arxiv.org/pdf/2510.15719
Abstract Reasoning models have gained significant attention due to their strong performance, particularly when enhanced with retrieval augmentation. However, these models often incur high computational costs, as both retrieval and reasoning tokens contribute substantially to the overall resource usage. In this work, we make the following contributions: (1) we propose a retrieval-augmented reasoning model that dynamically adjusts the length of the retrieved document list based on the query and retrieval results; (2) we develop a cost-aware advantage function for training of efficient retrieval-augmented reasoning models through reinforcement learning; and (3) we explore both memory- and latency-bound implementations of the proposed cost-aware framework for both proximal and group relative policy optimization algorithms. We evaluate our approach on seven public question answering datasets and demonstrate significant efficiency gains, without compromising effectiveness. In fact, we observed that the model latency decreases by ~16-20% across datasets, while its effectiveness increases by ~5% on average, in terms of exact match.
中文摘要 推理模型因其强大的性能而受到广泛关注，特别是在通过检索增强进行增强时。然而，这些模型通常会产生高昂的计算成本，因为检索和推理令牌都对整体资源使用做出了重大贡献。在这项工作中，我们做出了以下贡献：（1）我们提出了一种检索增强推理模型，该模型根据查询和检索结果动态调整检索到的文档列表的长度;（2）我们开发了一种成本感知优势函数，用于通过强化学习训练高效的检索增强推理模型;（3）我们探索了所提出的成本感知框架的内存和延迟绑定实现，用于近端和组相对策略优化算法。我们评估了我们对七个公共问答数据集的方法，并在不影响有效性的情况下展示了显着的效率提升。事实上，我们观察到，就完全匹配而言，模型延迟在数据集中减少了 ~16-20%，而其有效性平均增加了 ~5%。

ProSh: Probabilistic Shielding for Model-free Reinforcement Learning

ProSh：用于无模型强化学习的概率屏蔽

Authors: Edwin Hamel-De le Court, Gaspard Ohlmann, Francesco Belardinelli
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.15720
Pdf link: https://arxiv.org/pdf/2510.15720
Abstract Safety is a major concern in reinforcement learning (RL): we aim at developing RL systems that not only perform optimally, but are also safe to deploy by providing formal guarantees about their safety. To this end, we introduce Probabilistic Shielding via Risk Augmentation (ProSh), a model-free algorithm for safe reinforcement learning under cost constraints. ProSh augments the Constrained MDP state space with a risk budget and enforces safety by applying a shield to the agent's policy distribution using a learned cost critic. The shield ensures that all sampled actions remain safe in expectation. We also show that optimality is preserved when the environment is deterministic. Since ProSh is model-free, safety during training depends on the knowledge we have acquired about the environment. We provide a tight upper-bound on the cost in expectation, depending only on the backup-critic accuracy, that is always satisfied during training. Under mild, practically achievable assumptions, ProSh guarantees safety even at training time, as shown in the experiments.
中文摘要 安全性是强化学习（RL）的一个主要问题：我们的目标是开发不仅性能最佳，而且通过提供安全正式保证来安全部署的 RL 系统。为此，我们引入了通过风险增强的概率屏蔽（ProSh），这是一种在成本约束下进行安全强化学习的无模型算法。ProSh 通过风险预算增强了受约束的 MDP 状态空间，并通过使用有学识的成本批评者对代理的保单分配应用屏蔽来加强安全性。该防护罩可确保所有采样作在预期中保持安全。我们还表明，当环境是确定性的时，最优性会得到保留。由于 ProSh 是无模型的，因此培训期间的安全取决于我们所获得的环境知识。我们为预期成本提供了严格的上限，仅取决于备份批评者的准确性，这在训练期间始终得到满足。在温和的、实际可实现的假设下，ProSh 即使在训练时也能保证安全，如实验所示。

RLAF: Reinforcement Learning from Automaton Feedback

Authors: Mahyar Alinejad, Alvaro Velasquez, Yue Wang, George Atia
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.15728
Pdf link: https://arxiv.org/pdf/2510.15728
Abstract Reinforcement Learning (RL) in environments with complex, history-dependent reward structures poses significant challenges for traditional methods. In this work, we introduce a novel approach that leverages automaton-based feedback to guide the learning process, replacing explicit reward functions with preferences derived from a deterministic finite automaton (DFA). Unlike conventional approaches that use automata for direct reward specification, our method employs the structure of the DFA to generate preferences over trajectories that are used to learn a reward function, eliminating the need for manual reward engineering. Our framework introduces a static approach that uses the learned reward function directly for policy optimization and a dynamic approach that involves continuous refining of the reward function and policy through iterative updates until convergence. Our experiments in both discrete and continuous environments demonstrate that our approach enables the RL agent to learn effective policies for tasks with temporal dependencies, outperforming traditional reward engineering and automaton-based baselines such as reward machines and LTL-guided methods. Our results highlight the advantages of automaton-based preferences in handling non-Markovian rewards, offering a scalable, efficient, and human-independent alternative to traditional reward modeling. We also provide a convergence guarantee showing that under standard assumptions our automaton-guided preference-based framework learns a policy that is near-optimal with respect to the true non-Markovian objective.
中文摘要 在具有复杂、依赖于历史的奖励结构的环境中进行强化学习（RL）对传统方法提出了重大挑战。在这项工作中，我们引入了一种新颖的方法，该方法利用基于自动机的反馈来指导学习过程，用源自确定性有限自动机（DFA）的偏好取代显式奖励函数。与使用自动机进行直接奖励规范的传统方法不同，我们的方法采用 DFA 的结构来生成对用于学习奖励函数的轨迹的偏好，从而消除了手动奖励工程的需要。我们的框架引入了一种静态方法，直接使用学习到的奖励函数进行策略优化，以及一种动态方法，通过迭代更新不断细化奖励函数和策略，直到收敛。我们在离散和连续环境中的实验表明，我们的方法使RL代理能够为具有时间依赖关系的任务学习有效的策略，优于传统的奖励工程和基于自动机的基线，如奖励机和LTL引导方法。我们的研究结果强调了基于自动机的偏好在处理非马尔可夫奖励方面的优势，为传统奖励建模提供了一种可扩展、高效且独立于人类的替代方案。我们还提供了一个收敛保证，表明在标准假设下，我们的自动机引导的基于偏好的框架学习到的策略相对于真正的非马尔可夫目标来说是近乎最优的。

Self-evolving expertise in complex non-verifiable subject domains: dialogue as implicit meta-RL

复杂不可验证学科领域的自我发展专业知识：作为隐式元修复的对话

Authors: Richard M. Bailey
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.15772
Pdf link: https://arxiv.org/pdf/2510.15772
Abstract So-called wicked problems', those involving complex multi-dimensional settings, non-verifiable outcomes, heterogeneous impacts and a lack of single objectively correct answers, have plagued humans throughout history. Modern examples include decisions over justice frameworks, solving environmental pollution, planning for pandemic resilience and food security. The use of state-of-the-art artificial intelligence systems (notably Large Language Model-based agents) collaborating with humans on solving such problems is being actively explored. While the abilities of LLMs can be improved by, for example, fine-tuning, hand-crafted system prompts and scaffolding with external tools, LLMs lack endogenous mechanisms to develop expertise through experience in such settings. This work address this gap with Dialectica, a framework where agents engage in structured dialogue on defined topics, augmented by memory, self-reflection, and policy-constrained context editing. Formally, discussion is viewed as an implicit meta-reinforcement learning process. Thedialogue-trained' agents are evaluated post-hoc using judged pairwise comparisons of elicited responses. Across two model architectures (locally run Qwen3:30b and OpenAI's o4-mini) results show that enabling reflection-based context editing during discussion produces agents which dominate their baseline counterparts on Elo scores, normalized Bradley-Terry-Davidson ability, and AlphaRank mass. The predicted signatures of learning are observed qualitatively in statement and reflection logs, where reflections identify weaknesses and reliably shape subsequent statements. Agreement between quantitative and qualitative evidence supports dialogue-driven context evolution as a practical path to targeted expertise amplification in open non-verifiable domains.
中文摘要 所谓的“邪恶问题”，即涉及复杂的多维环境、不可验证的结果、异质性影响和缺乏单一客观正确答案的问题，在历史上一直困扰着人类。现代例子包括司法框架的决策、解决环境污染、应对大流行病的复原力和粮食安全规划。正在积极探索使用最先进的人工智能系统（特别是基于大型语言模型的代理）与人类合作解决此类问题。虽然法学硕士的能力可以通过微调、手工制作的系统提示和使用外部工具搭建脚手架等方式来提高，但法学硕士缺乏通过此类环境中的经验来发展专业知识的内生机制。这项工作通过辩证法解决了这一差距，辩证法是一个框架，代理在该框架中就定义的主题进行结构化对话，并通过记忆、自我反思和策略约束的上下文编辑来增强。从形式上讲，讨论被视为一种隐式元强化学习过程。使用引发的反应的判断成对比较，对“对话训练”的代理进行事后评估。在两个模型架构（本地运行的 Qwen3：30b 和 OpenAI 的 o4-mini）中，结果表明，在讨论期间启用基于反射的上下文编辑会产生在 Elo 分数、归一化 Bradley-Terry-Davidson 能力和 AlphaRank 质量方面主导基线对应物的代理。在陈述和反思日志中定性地观察到预测的学习特征，其中反思识别弱点并可靠地塑造后续陈述。定量证据和定性证据之间的一致性支持对话驱动的情境演变，将其作为在开放的不可验证领域中有针对性地扩大专业知识的实用途径。

DexCanvas: Bridging Human Demonstrations and Robot Learning for Dexterous Manipulation

DexCanvas：连接人类演示和机器人学习以实现灵巧作

Authors: Xinyue Xu, Jieqiang Sun, Jing (Daisy)Dai, Siyuan Chen, Lanjie Ma, Ke Sun, Bin Zhao, Jianbo Yuan, Yiwen Lu
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.15786
Pdf link: https://arxiv.org/pdf/2510.15786
Abstract We present DexCanvas, a large-scale hybrid real-synthetic human manipulation dataset containing 7,000 hours of dexterous hand-object interactions seeded from 70 hours of real human demonstrations, organized across 21 fundamental manipulation types based on the Cutkosky taxonomy. Each entry combines synchronized multi-view RGB-D, high-precision mocap with MANO hand parameters, and per-frame contact points with physically consistent force profiles. Our real-to-sim pipeline uses reinforcement learning to train policies that control an actuated MANO hand in physics simulation, reproducing human demonstrations while discovering the underlying contact forces that generate the observed object motion. DexCanvas is the first manipulation dataset to combine large-scale real demonstrations, systematic skill coverage based on established taxonomies, and physics-validated contact annotations. The dataset can facilitate research in robotic manipulation learning, contact-rich control, and skill transfer across different hand morphologies.
中文摘要 我们展示了 DexCanvas，这是一个大规模的混合真实合成人体作数据集，包含 7,000 小时灵巧的手与物体交互，这些交互来自 70 小时的真实人类演示，根据 Cutkosky 分类法组织了 21 种基本作类型。每个条目都结合了同步的多视图 RGB-D、带有 MANO 手部参数的高精度动作捕捉以及具有物理一致力曲线的每帧接触点。我们的真实到模拟管道使用强化学习来训练在物理模拟中控制驱动的 MANO 手的策略，再现人类演示，同时发现产生观察到的物体运动的潜在接触力。DexCanvas 是第一个结合了大规模真实演示、基于既定分类法的系统技能覆盖以及经过物理验证的接触注释的作数据集。该数据集可以促进机器人作学习、接触丰富控制以及跨不同手部形态的技能转移的研究。

Cavity Duplexer Tuning with 1d Resnet-like Neural Networks

使用类一维 Resnet 神经网络进行腔体双工器调谐

Authors: Anton Raskovalov
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2510.15796
Pdf link: https://arxiv.org/pdf/2510.15796
Abstract This paper presents machine learning method for tuning of cavity duplexer with a large amount of adjustment screws. After testing we declined conventional reinforcement learning approach and reformulated our task in the supervised learning setup. The suggested neural network architecture includes 1d ResNet-like backbone and processing of some additional information about S-parameters, like the shape of curve and peaks positions and amplitudes. This neural network with external control algorithm is capable to reach almost the tuned state of the duplexer within 4-5 rotations per screw.
中文摘要 本文提出了一种用于大量调节螺钉腔体双工器调谐的机器学习方法。经过测试，我们拒绝了传统的强化学习方法，并在监督学习设置中重新制定了我们的任务。建议的神经网络架构包括类似 1d ResNet 的主干网和处理有关 S 参数的一些附加信息，例如曲线的形状和峰值位置和幅度。这种带有外部控制算法的神经网络能够在每个螺钉 4-5 转内达到几乎双工器的调谐状态。

FIDDLE: Reinforcement Learning for Quantum Fidelity Enhancement

FIDDLE：用于增强量子保真度的强化学习

Authors: Hoang M. Ngo, Tamer Kahveci, My T. Thai
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.15833
Pdf link: https://arxiv.org/pdf/2510.15833
Abstract Quantum computing has the potential to revolutionize fields like quantum optimization and quantum machine learning. However, current quantum devices are hindered by noise, reducing their reliability. A key challenge in gate-based quantum computing is improving the reliability of quantum circuits, measured by process fidelity, during the transpilation process, particularly in the routing stage. In this paper, we address the Fidelity Maximization in Routing Stage (FMRS) problem by introducing FIDDLE, a novel learning framework comprising two modules: a Gaussian Process-based surrogate model to estimate process fidelity with limited training samples and a reinforcement learning module to optimize routing. Our approach is the first to directly maximize process fidelity, outperforming traditional methods that rely on indirect metrics such as circuit depth or gate count. We rigorously evaluate FIDDLE by comparing it with state-of-the-art fidelity estimation techniques and routing optimization methods. The results demonstrate that our proposed surrogate model is able to provide a better estimation on the process fidelity compared to existing learning techniques, and our end-to-end framework significantly improves the process fidelity of quantum circuits across various noise models.
中文摘要 量子计算有可能彻底改变量子优化和量子机器学习等领域。然而，当前的量子器件受到噪声的阻碍，降低了其可靠性。基于门的量子计算的一个关键挑战是在转编过程中提高量子电路的可靠性，通过过程保真度来衡量，特别是在路由阶段。在本文中，我们通过引入FIDDLE来解决路由阶段保真度最大化（FMRS）问题，FIDDLE 是一种由两个模块组成的新型学习框架：一个基于高斯过程的代理模型，用于在有限的训练样本下估计过程保真度，以及一个用于优化路由的强化学习模块。我们的方法是第一个直接最大限度地提高过程保真度的方法，优于依赖间接指标（如电路深度或栅极数）的传统方法。我们通过将 FIDDLE 与最先进的保真度估计技术和路由优化方法进行比较来严格评估 FIDDLE。结果表明，与现有的学习技术相比，我们提出的代理模型能够更好地估计过程保真度，并且我们的端到端框架显着提高了量子电路在各种噪声模型上的过程保真度。

Learning Correlated Reward Models: Statistical Barriers and Opportunities

学习相关奖励模型：统计障碍和机会

Authors: Yeshwanth Cherapanamjeri, Constantinos Daskalakis, Gabriele Farina, Sobhan Mohammadpour
Subjects: Subjects: Machine Learning (cs.LG); Econometrics (econ.EM); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2510.15839
Pdf link: https://arxiv.org/pdf/2510.15839
Abstract Random Utility Models (RUMs) are a classical framework for modeling user preferences and play a key role in reward modeling for Reinforcement Learning from Human Feedback (RLHF). However, a crucial shortcoming of many of these techniques is the Independence of Irrelevant Alternatives (IIA) assumption, which collapses \emph{all} human preferences to a universal underlying utility function, yielding a coarse approximation of the range of human preferences. On the other hand, statistical and computational guarantees for models avoiding this assumption are scarce. In this paper, we investigate the statistical and computational challenges of learning a \emph{correlated} probit model, a fundamental RUM that avoids the IIA assumption. First, we establish that the classical data collection paradigm of pairwise preference data is \emph{fundamentally insufficient} to learn correlational information, explaining the lack of statistical and computational guarantees in this setting. Next, we demonstrate that \emph{best-of-three} preference data provably overcomes these shortcomings, and devise a statistically and computationally efficient estimator with near-optimal performance. These results highlight the benefits of higher-order preference data in learning correlated utilities, allowing for more fine-grained modeling of human preferences. Finally, we validate these theoretical guarantees on several real-world datasets, demonstrating improved personalization of human preferences.
中文摘要 随机实用新型（RUM）是用于对用户偏好进行建模的经典框架，在人类反馈强化学习（RLHF）的奖励建模中发挥着关键作用。然而，其中许多技术的一个关键缺点是不相关替代方案的独立性（IIA）假设，它将 \emph{all} 人类偏好崩溃为普遍的潜在效用函数，从而产生人类偏好范围的粗略近似值。另一方面，避免这一假设的模型的统计和计算保证很少。在本文中，我们研究了学习\emph{相关}概率模型的统计和计算挑战，\emph{相关}概率模型是一种避免IIA假设的基本RUM。首先，我们确定成对偏好数据的经典数据收集范式对于学习相关信息来说是\emph{根本上不足}的，这解释了在这种情况下缺乏统计和计算保证的原因。接下来，我们证明\emph{best-of-three}偏好数据可以证明克服了这些缺点，并设计了一个具有接近最佳性能的统计和计算效率估计器。这些结果凸显了高阶偏好数据在学习相关效用方面的好处，允许对人类偏好进行更细粒度的建模。最后，我们在几个真实世界的数据集上验证了这些理论保证，证明了人类偏好个性化的改进。

BLIP3o-NEXT: Next Frontier of Native Image Generation

BLIP3o-NEXT：原生图像生成的下一个前沿

Authors: Jiuhai Chen, Le Xue, Zhiyang Xu, Xichen Pan, Shusheng Yang, Can Qin, An Yan, Honglu Zhou, Zeyuan Chen, Lifu Huang, Tianyi Zhou, Junnan Li, Silvio Savarese, Caiming Xiong, Ran Xu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2510.15857
Pdf link: https://arxiv.org/pdf/2510.15857
Abstract We present BLIP3o-NEXT, a fully open-source foundation model in the BLIP3 series that advances the next frontier of native image generation. BLIP3o-NEXT unifies text-to-image generation and image editing within a single architecture, demonstrating strong image generation and image editing capabilities. In developing the state-of-the-art native image generation model, we identify four key insights: (1) Most architectural choices yield comparable performance; an architecture can be deemed effective provided it scales efficiently and supports fast inference; (2) The successful application of reinforcement learning can further push the frontier of native image generation; (3) Image editing still remains a challenging task, yet instruction following and the consistency between generated and reference images can be significantly enhanced through post-training and data engine; (4) Data quality and scale continue to be decisive factors that determine the upper bound of model performance. Building upon these insights, BLIP3o-NEXT leverages an Autoregressive + Diffusion architecture in which an autoregressive model first generates discrete image tokens conditioned on multimodal inputs, whose hidden states are then used as conditioning signals for a diffusion model to generate high-fidelity images. This architecture integrates the reasoning strength and instruction following of autoregressive models with the fine-detail rendering ability of diffusion models, achieving a new level of coherence and realism. Extensive evaluations of various text-to-image and image-editing benchmarks show that BLIP3o-NEXT achieves superior performance over existing models.
中文摘要 我们推出 BLIP3o-NEXT，这是 BLIP3 系列中的一个完全开源的基础模型，它推动了原生图像生成的下一个前沿。BLIP3o-NEXT将文本到图像生成和图像编辑统一在一个架构中，展示了强大的图像生成和图像编辑能力。在开发最先进的原生图像生成模型时，我们确定了四个关键见解：（1）大多数架构选择都能产生可比的性能;如果架构能够有效扩展并支持快速推理，则可以认为它是有效的;（2）强化学习的成功应用可以进一步推动原生图像生成的前沿;（3）图像编辑仍然是一项具有挑战性的任务，但通过后训练和数据引擎可以显著增强指令遵循以及生成图像和参考图像之间的一致性;（4）数据质量和规模仍然是决定模型性能上限的决定性因素。基于这些见解，BLIP3o-NEXT 利用了自回归 + 扩散架构，其中自回归模型首先生成以多模态输入为条件的离散图像标记，然后将其隐藏状态用作扩散模型的条件信号，以生成高保真图像。该架构将自回归模型的推理强度和指令遵循与扩散模型的精细渲染能力相结合，实现了新的连贯性和真实性。对各种文本到图像和图像编辑基准的广泛评估表明，BLIP3o-NEXT 实现了优于现有模型的性能。

InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training

InfiMed-ORBIT：通过基于评分标准的增量训练在开放式复杂任务上调整法学硕士

Authors: Pengkai Wang, Qi Zuo, Pengwei Liu, Zhijie Sang, Congkai Xie, Hongxia Yang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.15859
Pdf link: https://arxiv.org/pdf/2510.15859
Abstract Large Language Models (LLMs) have shown substantial advances through reinforcement learning (RL), particularly in domains where rewards can be programmatically verified, such as mathematics and code. In these areas, models benefit from a well-defined operational base guided by explicit rule-based objectives. However, this progress reveals a significant limitation: in open-ended domains where rewards are ambiguous, subjective, or context-dependent, such as creative writing, scientific reasoning, and notably medical consultation, robust reward functions are lacking, making these areas challenging for current RL strategies. To bridge this gap, we introduce ORBIT, an open-ended rubric-based incremental training framework specifically designed for high-stakes medical dialogue. ORBIT integrates syn- thetic dialogue generation with the dynamic creation of rubrics, employing these rubrics to direct an incremental RL process. In particular, this approach does not depend on external medical knowledge or manual rules, instead utilizing rubric-guided feedback to shape learning. When implemented on the Qwen3-4B-Instruct model, our method can greatly enhance its performance on the HealthBench-Hard benchmark from 7.0 to 27.2 using only 2k samples, thus achieving state-of-the-art results for models of this scale. Our analysis confirms that rubric-driven RL fos-ters consistent performance gains across diverse consultation scenarios, going beyond simple numerical improvements. These findings underscore rubric-based feedback as a scalable strategy for advancing LLMs in intricate, open-ended tasks.
中文摘要 大型语言模型（LLM）通过强化学习（RL）取得了重大进步，特别是在可以通过编程方式验证奖励的领域，例如数学和代码。在这些领域，模型受益于由明确的基于规则的目标指导的明确定义的运营基础。然而，这一进展揭示了一个重大局限性：在奖励模糊、主观或依赖于上下文的开放式领域，例如创意写作、科学推理，尤其是医疗咨询，缺乏强大的奖励功能，这使得这些领域对当前的 RL 策略具有挑战性。为了弥合这一差距，我们推出了 ORBIT，这是一个基于评分标准的开放式增量培训框架，专为高风险的医疗对话而设计。ORBIT 将合成对话生成与评分标准的动态创建相结合，利用这些评分标准来指导增量 RL 过程。特别是，这种方法不依赖于外部医学知识或手动规则，而是利用评分标准指导的反馈来塑造学习。当在Qwen3-4B-Instruct模型上实现时，我们的方法只需使用2k样本即可将其在HealthBench-Hard基准测试上的性能从7.0提高到27.2，从而为这种规模的模型获得最先进的结果。我们的分析证实，评分标准驱动的 RL fos-ter 在不同的咨询场景中具有一致的性能提升，而不仅仅是简单的数字改进。这些发现强调了基于评分标准的反馈是一种可扩展的策略，可以在复杂的开放式任务中推进法学硕士。

PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold

PokeeResearch：通过人工智能反馈和稳健推理支架的强化学习进行有效的深度研究

Authors: Yi Wan, Jiuqi Wang, Liam Li, Jinsong Liu, Ruihao Zhu, Zheqing Zhu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.15862
Pdf link: https://arxiv.org/pdf/2510.15862
Abstract Tool-augmented large language models (LLMs) are emerging as deep research agents, systems that decompose complex queries, retrieve external evidence, and synthesize grounded responses. Yet current agents remain limited by shallow retrieval, weak alignment metrics, and brittle tool-use behavior. We introduce PokeeResearch-7B, a 7B-parameter deep research agent built under a unified reinforcement learning framework for robustness, alignment, and scalability. PokeeResearch-7B is trained by an annotation-free Reinforcement Learning from AI Feedback (RLAIF) framework to optimize policies using LLM-based reward signals that capture factual accuracy, citation faithfulness, and instruction adherence. A chain-of-thought-driven multi-call reasoning scaffold further enhances robustness through self-verification and adaptive recovery from tool failures. Among 10 popular deep research benchmarks, PokeeResearch-7B achieves state-of-the-art performance among 7B-scale deep research agents. This highlights that careful reinforcement learning and reasoning design can produce efficient, resilient, and research-grade AI agents. The model and inference code is open-sourced under MIT license at this https URL.
中文摘要 工具增强型大型语言模型（LLM）正在成为深度研究代理，即分解复杂查询、检索外部证据和合成扎根响应的系统。然而，当前的代理仍然受到浅层检索、弱对齐指标和脆弱的工具使用行为的限制。我们介绍了 PokeeResearch-7B，这是一个 7B 参数的深度研究代理，在统一的强化学习框架下构建，具有鲁棒性、一致性和可扩展性。PokeeResearch-7B 由无注释的 AI 反馈强化学习（RLAIF）框架进行训练，使用基于 LLM 的奖励信号来优化策略，这些信号捕获事实准确性、引文忠实度和指令依从性。思维链驱动的多调用推理支架通过自我验证和工具故障的自适应恢复进一步增强了鲁棒性。在 10 个流行的深度研究基准中，PokeeResearch-7B 在 7B 规模的深度研究代理中取得了最先进的性能。这凸显了仔细的强化学习和推理设计可以产生高效、有弹性和研究级的人工智能代理。模型和推理代码在此 https URL 下根据 MIT 许可开源。

Keyword: diffusion policy

VDRive: Leveraging Reinforced VLA and Diffusion Policy for End-to-end Autonomous Driving

VDRive：利用增强型VLA和扩散策略实现端到端自动驾驶

Authors: Ziang Guo, Zufeng Zhang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.15446
Pdf link: https://arxiv.org/pdf/2510.15446
Abstract In autonomous driving, dynamic environment and corner cases pose significant challenges to the robustness of ego vehicle's state understanding and decision making. We introduce VDRive, a novel pipeline for end-to-end autonomous driving that explicitly models state-action mapping to address these challenges, enabling interpretable and robust decision making. By leveraging the advancement of the state understanding of the Vision Language Action Model (VLA) with generative diffusion policy-based action head, our VDRive guides the driving contextually and geometrically. Contextually, VLA predicts future observations through token generation pre-training, where the observations are represented as discrete codes by a Conditional Vector Quantized Variational Autoencoder (CVQ-VAE). Geometrically, we perform reinforcement learning fine-tuning of the VLA to predict future trajectories and actions based on current driving conditions. VLA supplies the current state tokens and predicted state tokens for the action policy head to generate hierarchical actions and trajectories. During policy training, a learned critic evaluates the actions generated by the policy and provides gradient-based feedback, forming an actor-critic framework that enables a reinforcement-based policy learning pipeline. Experiments show that our VDRive achieves state-of-the-art performance in the Bench2Drive closed-loop benchmark and nuScenes open-loop planning.
中文摘要 在自动驾驶中，动态环境和极端情况对自车状态理解和决策的鲁棒性提出了重大挑战。我们推出了 VDRive，这是一种用于端到端自动驾驶的新型管道，它明确地对状态-动作映射进行建模以应对这些挑战，从而实现可解释且稳健的决策。通过利用视觉语言行动模型（VLA）的状态理解与基于生成扩散策略的动作头，我们的 VDRive 在上下文和几何上指导驾驶。在上下文中，VLA 通过标记生成预训练来预测未来的观测值，其中观测值由条件向量量化变分自动编码器（CVQ-VAE）表示为离散代码。在几何学上，我们对VLA进行强化学习微调，以根据当前的驾驶条件预测未来的轨迹和动作。VLA 为作策略头提供当前状态令牌和预测状态令牌，以生成分层作和轨迹。在政策培训期间，有学识的批评者评估政策产生的行动并提供基于梯度的反馈，形成一个行为者-批评者框架，从而实现基于强化的政策学习管道。实验表明，我们的 VDRive 在 Bench2Drive 闭环基准测试和 nuScenes 开环规划中实现了最先进的性能。

VO-DP: Semantic-Geometric Adaptive Diffusion Policy for Vision-Only Robotic Manipulation

VO-DP：用于纯视觉机器人作的语义几何自适应扩散策略

Authors: Zehao Ni, Yonghao He, Lingfeng Qian, Jilei Mao, Fa Fu, Wei Sui, Hu Su, Junran Peng, Zhipeng Wang, Bin He
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.15530
Pdf link: https://arxiv.org/pdf/2510.15530
Abstract In the context of imitation learning, visuomotor-based diffusion policy learning is one of the main directions in robotic manipulation. Most of these approaches rely on point clouds as observation inputs and construct scene representations through point clouds feature learning, which enables them to achieve remarkable accuracy. However, the existing literature lacks an in-depth exploration of vision-only solutions that have significant potential. In this paper, we propose a Vision-Only and single-view Diffusion Policy learning method (VO-DP) that leverages pretrained visual foundation models to achieve effective fusion of semantic and geometric features. We utilize intermediate features from VGGT incorporating semantic features from DINOv2 and geometric features from Alternating Attention blocks. Features are fused via cross-attention and spatially compressed with a CNN to form the input to the policy head. Extensive experiments demonstrate that VO-DP not only outperforms the vision-only baseline DP significantly but also exhibits distinct performance trends against the point cloud-based method DP3: in simulation tasks, VO-DP achieves an average success rate of 64.6% on par with DP3 64.0% and far higher than DP 34.8%, while in real-world tasks, it reaches 87.9%, outperforming both DP3 67.5% and DP 11.2% by a notable margin. Further robustness evaluations confirm that VO-DP remains highly stable under varying conditions including color, size, background, and lighting. Lastly, we open-source a training library for robotic manipulation. Built on Accelerate, this library supports multi-machine and multi-GPU parallel training, as well as mixed precision training. It is compatible with visuomotor policies such as DP, DP3 and VO-DP, and also supports the RoboTwin simulator.
中文摘要 在模仿学习的背景下，基于视觉运动的扩散策略学习是机器人纵的主要方向之一。这些方法大多依赖点云作为观测输入，并通过点云特征学习构建场景表示，从而实现了极高的精度。然而，现有文献缺乏对具有巨大潜力的纯视觉解决方案的深入探索。在本文中，我们提出了一种仅视觉和单视图扩散策略学习方法（VO-DP），该方法利用预训练的视觉基础模型来实现语义和几何特征的有效融合。我们利用 VGGT 的中间特征，结合了 DINOv2 的语义特征和 Alternating Attention 块的几何特征。特征通过交叉注意力融合，并与 CNN 进行空间压缩，形成策略头的输入。大量实验表明，VO-DP不仅显著优于仅视觉的基线DP，而且与基于点云的方法DP3相比，VO-DP的平均成功率为64.6%，与DP3的64.0%相当，远高于DP的34.8%，而在实际任务中，VO-DP达到87.9%，明显优于DP3的67.5%和DP的11.2%。进一步的稳健性评估证实，VO-DP在颜色、大小、背景和照明等不同条件下保持高度稳定。最后，我们开源了一个用于机器人作的训练库。该库基于 Accelerate 构建，支持多机、多 GPU 并行训练，以及混合精度训练。兼容DP、DP3、VO-DP等视觉运动策略，还支持RoboTwin模拟器。