生成时间: 2025-10-14 16:32:01 (UTC+8); Arxiv 发布时间: 2025-10-14 20:00 EDT (2025-10-15 08:00 UTC+8)
今天共有 94 篇相关文章
Keyword: reinforcement learning
Table Question Answering in the Era of Large Language Models: A Comprehensive Survey of Tasks, Methods, and Evaluation
大语言模型时代的表问答:任务、方法与评估的综合调查
- Authors: Wei Zhou, Bolei Ma, Annemarie Friedrich, Mohsen Mesgar
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2510.09671
- Pdf link: https://arxiv.org/pdf/2510.09671
- Abstract
Table Question Answering (TQA) aims to answer natural language questions about tabular data, often accompanied by additional contexts such as text passages. The task spans diverse settings, varying in table representation, question/answer complexity, modality involved, and domain. While recent advances in large language models (LLMs) have led to substantial progress in TQA, the field still lacks a systematic organization and understanding of task formulations, core challenges, and methodological trends, particularly in light of emerging research directions such as reinforcement learning. This survey addresses this gap by providing a comprehensive and structured overview of TQA research with a focus on LLM-based methods. We provide a comprehensive categorization of existing benchmarks and task setups. We group current modeling strategies according to the challenges they target, and analyze their strengths and limitations. Furthermore, we highlight underexplored but timely topics that have not been systematically covered in prior research. By unifying disparate research threads and identifying open problems, our survey offers a consolidated foundation for the TQA community, enabling a deeper understanding of the state of the art and guiding future developments in this rapidly evolving area.
- 中文摘要
表格问答 (TQA) 旨在回答有关表格数据的自然语言问题,通常伴随着文本段落等附加上下文。该任务跨越不同的设置,在表格表示、问题/答案复杂性、涉及的模式和领域方面各不相同。虽然大型语言模型(LLM)的最新进展使TQA取得了实质性进展,但该领域仍然缺乏对任务表述、核心挑战和方法趋势的系统组织和理解,特别是考虑到强化学习等新兴研究方向。本调查通过提供全面且结构化的 TQA 研究概述来弥补这一差距,重点关注基于 LLM 的方法。我们提供现有基准和任务设置的全面分类。我们根据当前的建模策略所针对的挑战对其进行分组,并分析其优势和局限性。此外,我们强调了先前研究中未系统涵盖的未充分探索但及时的主题。通过统一不同的研究线索并识别悬而未决的问题,我们的调查为 TQA 社区提供了巩固的基础,使人们能够更深入地了解最新技术并指导这个快速发展领域的未来发展。
A Multi-Component Reward Function with Policy Gradient for Automated Feature Selection with Dynamic Regularization and Bias Mitigation
一种具有策略梯度的多分量奖励函数,用于具有动态正则化和偏差缓解的自动特征选择
- Authors: Sudip Khadka, L.S. Paudel
- Subjects: Subjects:
Machine Learning (cs.LG); Computers and Society (cs.CY); Machine Learning (stat.ML)
- Arxiv link: https://arxiv.org/abs/2510.09705
- Pdf link: https://arxiv.org/pdf/2510.09705
- Abstract
Static feature exclusion strategies often fail to prevent bias when hidden dependencies influence the model predictions. To address this issue, we explore a reinforcement learning (RL) framework that integrates bias mitigation and automated feature selection within a single learning process. Unlike traditional heuristic-driven filter or wrapper approaches, our RL agent adaptively selects features using a reward signal that explicitly integrates predictive performance with fairness considerations. This dynamic formulation allows the model to balance generalization, accuracy, and equity throughout the training process, rather than rely exclusively on pre-processing adjustments or post hoc correction mechanisms. In this paper, we describe the construction of a multi-component reward function, the specification of the agents action space over feature subsets, and the integration of this system with ensemble learning. We aim to provide a flexible and generalizable way to select features in environments where predictors are correlated and biases can inadvertently re-emerge.
- 中文摘要
当隐藏的依赖关系影响模型预测时,静态特征排除策略通常无法防止偏差。为了解决这个问题,我们探索了一种强化学习 (RL) 框架,该框架将偏差缓解和自动特征选择集成到单个学习过程中。与传统的启发式驱动过滤器或包装器方法不同,我们的 RL 代理使用奖励信号自适应地选择特征,该信号明确地将预测性能与公平性考虑相结合。这种动态公式允许模型在整个训练过程中平衡泛化、准确性和公平性,而不是完全依赖预处理调整或事后校正机制。在本文中,我们描述了多分量奖励函数的构建、特征子集上的智能体动作空间的规范,以及该系统与集成学习的集成。我们的目标是提供一种灵活且可推广的方法,在预测变量相关且偏差可能无意中重新出现的环境中选择特征。
ARROW: An Adaptive Rollout and Routing Method for Global Weather Forecasting
ARROW:一种用于全球天气预报的自适应推出和路由方法
- Authors: Jindong Tian, Yifei Ding, Ronghui Xu, Hao Miao, Chenjuan Guo, Bin Yang
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.09734
- Pdf link: https://arxiv.org/pdf/2510.09734
- Abstract
Weather forecasting is a fundamental task in spatiotemporal data analysis, with broad applications across a wide range of domains. Existing data-driven forecasting methods typically model atmospheric dynamics over a fixed short time interval (e.g., 6 hours) and rely on naive autoregression-based rollout for long-term forecasting (e.g., 138 hours). However, this paradigm suffers from two key limitations: (1) it often inadequately models the spatial and multi-scale temporal dependencies inherent in global weather systems, and (2) the rollout strategy struggles to balance error accumulation with the capture of fine-grained atmospheric variations. In this study, we propose ARROW, an Adaptive-Rollout Multi-scale temporal Routing method for Global Weather Forecasting. To contend with the first limitation, we construct a multi-interval forecasting model that forecasts weather across different time intervals. Within the model, the Shared-Private Mixture-of-Experts captures both shared patterns and specific characteristics of atmospheric dynamics across different time scales, while Ring Positional Encoding accurately encodes the circular latitude structure of the Earth when representing spatial information. For the second limitation, we develop an adaptive rollout scheduler based on reinforcement learning, which selects the most suitable time interval to forecast according to the current weather state. Experimental results demonstrate that ARROW achieves state-of-the-art performance in global weather forecasting, establishing a promising paradigm in this field.
- 中文摘要
天气预报是时空数据分析的一项基础任务,在广泛的领域都有广泛的应用。现有的数据驱动预测方法通常在固定的短时间间隔(例如 6 小时)内对大气动力学进行建模,并依靠基于朴素的基于自回归的推出进行长期预测(例如 138 小时)。然而,这种范式有两个关键局限性:(1)它经常不能充分模拟全球天气系统固有的空间和多尺度时间依赖关系,以及(2)推出策略难以平衡误差累积与捕获细粒度大气变化。在这项研究中,我们提出了一种用于全球天气预报的自适应推出多尺度时间路由方法ARROW。为了应对第一个限制,我们构建了一个多区间预报模型,用于预测不同时间间隔的天气。在模型中,共享-私人混合专家捕获不同时间尺度上大气动力学的共享模式和特定特征,而环形位置编码在表示空间信息时准确编码地球的圆形纬度结构。针对第二个限制,我们开发了一种基于强化学习的自适应推出调度器,它根据当前天气状态选择最合适的时间间隔进行预测。实验结果表明,ARROW在全球天气预报中取得了最先进的性能,在该领域建立了有前途的范式。
WARC-Bench: Web Archive Based Benchmark for GUI Subtask Executions
WARC-Bench:基于 Web 存档的 GUI 子任务执行基准测试
- Authors: Sanjari Srivastava, Gang Li, Cheng Chang, Rishu Garg, Manpreet Kaur, Charlene Y. Lee, Yuezhang Li, Yining Mao, Ignacio Cases, Yanan Xie, Peng Qi
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.09872
- Pdf link: https://arxiv.org/pdf/2510.09872
- Abstract
Training web agents to navigate complex, real-world websites requires them to master $\textit{subtasks}$ - short-horizon interactions on multiple UI components (e.g., choosing the correct date in a date picker, or scrolling in a container to extract information). We introduce WARC-Bench (Web Archive Benchmark), a novel web navigation benchmark featuring 438 tasks designed to evaluate multimodal AI agents on subtasks. WARC-Bench enables sandboxed interactions with dynamic and realistic webpages using Web ARChive files. We show that WARC-Bench is challenging for leading computer-use models, with the highest observed success rate being 64.8%. To improve open source models on subtask, we explore two common training techniques: supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR). Experiments show that SFT models obtain a 48.8% success rate on the benchmark. Training with RLVR over SFT checkpoints, even in data-scarce settings, improves the score to 52.8% on WARC-Bench, outperforming many frontier models. Our analysis concludes that mastering these subtasks is essential for robust web planning and navigation, and is a capability not extensively evaluated by existing benchmarks.
- 中文摘要
训练 Web 代理浏览复杂的真实网站需要他们掌握 $\textit{subtasks}$ - 多个 UI 组件上的短期交互(例如,在日期选择器中选择正确的日期,或在容器中滚动以提取信息)。我们介绍了 WARC-Bench(Web Archive Benchmark),这是一种新颖的 Web 导航基准测试,具有 438 个任务,旨在评估子任务上的多模态 AI 代理。WARC-Bench 使用 Web ARChive 文件支持与动态和逼真的网页进行沙盒交互。我们表明,WARC-Bench 对于领先的计算机使用模型具有挑战性,观察到的最高成功率为 64.8%。为了改进子任务上的开源模型,我们探索了两种常见的训练技术:监督微调(SFT)和具有可验证奖励的强化学习(RLVR)。实验表明,SFT 模型在基准测试上获得了 48.8% 的成功率。即使在数据稀缺的环境中,使用 RLVR 进行 SFT 检查点训练,也能将 WARC-Bench 上的分数提高到 52.8%,优于许多前沿模型。我们的分析得出的结论是,掌握这些子任务对于稳健的网络规划和导航至关重要,并且是现有基准测试尚未广泛评估的能力。
Abductive Preference Learning
归纳偏好学习
- Authors: Yijin Ni, Peng Qi
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2510.09887
- Pdf link: https://arxiv.org/pdf/2510.09887
- Abstract
Frontier large language models such as GPT-5 and Claude Sonnet remain prone to overconfidence even after alignment through Reinforcement Learning with Human Feedback (RLHF) and Direct Preference Optimization (DPO). For instance, they tend to offer the same conservative answer "No" to both questions "Can I eat the [food / potato chips] that has been left out overnight?" despite the latter requiring no refridgeration for safe consumption. We find that this failure is potentially attributed to a limitation of existing preference learning: it emphasizes selecting the correct response for a given prompt, while neglecting counterfactual prompts that should alter the response. To address this limitation, we propose abductive preference learning, a fine-tuning paradigm that reverses the conventional conditioning by learning preferences over prompts given a response. To validate this idea, we construct an abductive dataset derived from the HaluEval QA benchmark with 1,001 entries, implementing abductive DPO and its variant DPOP. Experiments reveal complementary strengths: standard methods improve response selection, abductive methods improve prompt discrimination, while a multitask objective unifies both. On the abductive dataset, multitask DPOP boosts accuracy from $90.0\%$ to $99.5\%$ in response selection and $54.7\%$ to $85.0\%$ in prompt discrimination, with qualitative evidence highlighting improved sensitivity to prompt differences. Finally, evaluation on AlpacaEval shows multitask DPOP improves win rate (from $5.26\%$ to $6.17\%$), confirming that abductive preference learning preserves the benefits of conventional preference optimization while addressing the overlooked challenge of counterfactual prompts.
- 中文摘要
GPT-5 和 Claude Sonnet 等前沿大型语言模型即使在通过人类反馈强化学习 (RLHF) 和直接偏好优化 (DPO) 进行对齐后,仍然容易过度自信。例如,他们倾向于对“我可以吃过夜的 [食物/薯片] 可以吃吗”这两个问题提供相同的保守答案“否”,尽管后者不需要冷藏即可安全食用。我们发现,这种失败可能归因于现有偏好学习的局限性:它强调为给定的提示选择正确的响应,而忽略了应该改变响应的反事实提示。为了解决这一限制,我们提出了归纳偏好学习,这是一种微调范式,它通过学习偏好而不是给定的响应提示来扭转传统条件反射。为了验证这个想法,我们构建了一个源自 HaluEval QA 基准的归纳数据集,其中包含 1,001 个条目,实现了归纳 DPO 及其变体 DPOP。实验揭示了互补的优势:标准方法改进了响应选择,归纳方法改进了提示辨别,而多任务目标则统一了两者。在归纳数据集上,多任务 DPOP 将响应选择的准确性从 90.0 美元提高到 99.5 美元,将提示辨别的准确性从 54.7 美元提高到 85.0 美元,定性证据强调了对提示差异的敏感性提高。最后,对AlpacaEval的评估表明,多任务DPOP提高了胜率(从5.26美元到6.17美元),证实了归纳偏好学习保留了传统偏好优化的好处,同时解决了被忽视的反事实提示挑战。
Structured Cooperative Multi-Agent Reinforcement Learning: a Bayesian Network Perspective
结构化协同多智能体强化学习:贝叶斯网络视角
- Authors: Shahbaz P Qadri Syed, He Bai
- Subjects: Subjects:
Multiagent Systems (cs.MA); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC); Machine Learning (stat.ML)
- Arxiv link: https://arxiv.org/abs/2510.09937
- Pdf link: https://arxiv.org/pdf/2510.09937
- Abstract
The empirical success of multi-agent reinforcement learning (MARL) has motivated the search for more efficient and scalable algorithms for large scale multi-agent systems. However, existing state-of-the-art algorithms do not fully exploit inter-agent coupling information to develop MARL algorithms. In this paper, we propose a systematic approach to leverage structures in the inter-agent couplings for efficient model-free reinforcement learning. We model the cooperative MARL problem via a Bayesian network and characterize the subset of agents, termed as the value dependency set, whose information is required by each agent to estimate its local action value function exactly. Moreover, we propose a partially decentralized training decentralized execution (P-DTDE) paradigm based on the value dependency set. We theoretically establish that the total variance of our P-DTDE policy gradient estimator is less than the centralized training decentralized execution (CTDE) policy gradient estimator. We derive a multi-agent policy gradient theorem based on the P-DTDE scheme and develop a scalable actor-critic algorithm. We demonstrate the efficiency and scalability of the proposed algorithm on multi-warehouse resource allocation and multi-zone temperature control examples. For dense value dependency sets, we propose an approximation scheme based on truncation of the Bayesian network and empirically show that it achieves a faster convergence than the exact value dependence set for applications with a large number of agents.
- 中文摘要
多智能体强化学习 (MARL) 的实证成功促使人们为大规模多智能体系统寻找更高效、更可扩展的算法。然而,现有的最先进的算法并未充分利用智能体间耦合信息来开发MARL算法。在本文中,我们提出了一种系统方法来利用智能体间耦合中的结构进行高效的无模型强化学习。我们通过贝叶斯网络对协作 MARL 问题进行建模,并表征智能体的子集,称为值依赖集,每个智能体都需要其信息来精确估计其局部动作值函数。此外,我们提出了一种基于价值依赖集的部分去中心化训练去中心化执行(P-DTDE)范式。我们从理论上确定,我们的 P-DTDE 策略梯度估计器的总方差小于集中训练分散执行 (CTDE) 策略梯度估计器。我们推导了基于 P-DTDE 方案的多智能体策略梯度定理,并开发了一种可扩展的 Actor-Critic 算法。本文通过多仓库资源分配和多区域温控示例展示了所提算法的效率和可扩展性。对于密集值依赖集,我们提出了一种基于贝叶斯网络截断的近似方案,并凭经验表明,对于具有大量代理的应用程序,它比精确值依赖集实现了更快的收敛。
Reinforcement Fine-Tuning of Flow-Matching Policies for Vision-Language-Action Models
视觉-语言-行动模型流匹配策略的强化微调
- Authors: Mingyang Lyu, Yinqian Sun, Erliang Lin, Huangrui Li, Ruolin Chen, Feifei Zhao, Yi Zeng
- Subjects: Subjects:
Machine Learning (cs.LG); Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2510.09976
- Pdf link: https://arxiv.org/pdf/2510.09976
- Abstract
Vision-Language-Action (VLA) models such as OpenVLA, Octo, and $\pi_0$ have shown strong generalization by leveraging large-scale demonstrations, yet their performance is still fundamentally constrained by the quality and coverage of supervised data. Reinforcement learning (RL) provides a promising path for improving and fine-tuning VLAs through online interaction. However, conventional policy gradient methods are computationally infeasible in the context of flow-matching based models due to the intractability of the importance sampling process, which requires explicit computation of policy ratios. To overcome this limitation, we propose Flow Policy Optimization (FPO) algorithm, which reformulates importance sampling by leveraging per-sample changes in the conditional flow-matching objective. Furthermore, FPO achieves stable and scalable online reinforcement fine-tuning of the $\pi_0$ model by integrating structure-aware credit assignment to enhance gradient efficiency, clipped surrogate objectives to stabilize optimization, multi-step latent exploration to encourage diverse policy updates, and a Q-ensemble mechanism to provide robust value estimation. We evaluate FPO on the LIBERO benchmark and the ALOHA simulation task against supervised, preference-aligned, diffusion-based, autoregressive online RL, and $\pi_0$-FAST baselines, observing consistent improvements over the imitation prior and strong alternatives with stable learning under sparse rewards. In addition, ablation studies and analyses of the latent space dynamics further highlight the contributions of individual components within FPO, validating the effectiveness of the proposed computational modules and the stable convergence of the conditional flow-matching objective during online RL.
- 中文摘要
OpenVLA、Octo 和 $\pi_0$ 等视觉-语言-行动 (VLA) 模型通过利用大规模演示表现出了很强的泛化性,但其性能仍然从根本上受到监督数据质量和覆盖范围的制约。强化学习(RL)为通过在线交互改进和微调VLA提供了一条有前途的途径。然而,由于重要性抽样过程的难处理性,传统的策略梯度方法在基于流量匹配的模型的背景下在计算上是不可行的,这需要显式计算策略比率。为了克服这一限制,我们提出了流策略优化(FPO)算法,该算法通过利用条件流匹配目标中的每个样本的变化来重新制定重要性抽样。此外,FPO通过整合结构感知信用分配以增强梯度效率、裁剪代理目标以稳定优化、多步潜在探索以鼓励多样化策略更新以及Q-ensemble机制来提供鲁棒的价值估计,实现了对$\pi_0$模型的稳定和可扩展的在线强化微调。我们根据监督、偏好对齐、基于扩散、自回归在线 RL 和 $\pi_0$-FAST 基线评估 LIBERO 基准和 ALOHA 模拟任务上的 FPO,观察到与模仿先前相比的持续改进和在稀疏奖励下具有稳定学习的强替代方案。此外,对潜在空间动力学的消融研究和分析进一步强调了FPO中各个组件的贡献,验证了所提出的计算模块的有效性和在线RL期间条件流匹配目标的稳定收敛。
ATRos: Learning Energy-Efficient Agile Locomotion for Wheeled-legged Robots
ATRos:为轮腿机器人学习节能敏捷运动
- Authors: Jingyuan Sun, Hongyu Ji, Zihan Qu, Chaoran Wang, Mingyu Zhang
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2510.09980
- Pdf link: https://arxiv.org/pdf/2510.09980
- Abstract
Hybrid locomotion of wheeled-legged robots has recently attracted increasing attention due to their advantages of combining the agility of legged locomotion and the efficiency of wheeled motion. But along with expanded performance, the whole-body control of wheeled-legged robots remains challenging for hybrid locomotion. In this paper, we present ATRos, a reinforcement learning (RL)-based hybrid locomotion framework to achieve hybrid walking-driving motions on the wheeled-legged robot. Without giving predefined gait patterns, our planner aims to intelligently coordinate simultaneous wheel and leg movements, thereby achieving improved terrain adaptability and improved energy efficiency. Based on RL techniques, our approach constructs a prediction policy network that could estimate external environmental states from proprioceptive sensory information, and the outputs are then fed into an actor critic network to produce optimal joint commands. The feasibility of the proposed framework is validated through both simulations and real-world experiments across diverse terrains, including flat ground, stairs, and grassy surfaces. The hybrid locomotion framework shows robust performance over various unseen terrains, highlighting its generalization capability.
- 中文摘要
轮腿机器人的混合运动因其结合了腿运动的敏捷性和轮式运动的效率的优势,近年来越来越受到关注。但随着性能的扩展,轮腿机器人的全身控制对于混合运动来说仍然具有挑战性。在本文中,我们提出了ATRos,这是一种基于强化学习(RL)的混合运动框架,用于在轮腿机器人上实现混合行走-驾驶运动。我们的规划器旨在智能地协调车轮和腿部同时运动,从而提高地形适应性和能源效率。基于RL技术,我们的方法构建了一个预测策略网络,该网络可以从本体感觉信息中估计外部环境状态,然后将输出输入到参与者批评网络中,以产生最优的联合命令。所提出框架的可行性通过不同地形(包括平坦地面、楼梯和草地表面)的模拟和真实实验得到验证。混合运动框架在各种看不见的地形上表现出强大的性能,凸显了其泛化能力。
RIPRAG: Hack a Black-box Retrieval-Augmented Generation Question-Answering System with Reinforcement Learning
RIPRAG:破解强化学习黑盒检索增强生成问答系统
- Authors: Meng Xi, Sihan Lv, Yechen Jin, Guanjie Cheng, Naibo Wang, Ying Li, Jianwei Yin
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.10008
- Pdf link: https://arxiv.org/pdf/2510.10008
- Abstract
Retrieval-Augmented Generation (RAG) systems based on Large Language Models (LLMs) have become a core technology for tasks such as question-answering (QA) and content generation. However, by injecting poisoned documents into the database of RAG systems, attackers can manipulate LLMs to generate text that aligns with their intended preferences. Existing research has primarily focused on white-box attacks against simplified RAG architectures. In this paper, we investigate a more complex and realistic scenario: the attacker lacks knowledge of the RAG system's internal composition and implementation details, and the RAG system comprises components beyond a mere retriever. Specifically, we propose the RIPRAG attack framework, an end-to-end attack pipeline that treats the target RAG system as a black box, where the only information accessible to the attacker is whether the poisoning succeeds. Our method leverages Reinforcement Learning (RL) to optimize the generation model for poisoned documents, ensuring that the generated poisoned document aligns with the target RAG system's preferences. Experimental results demonstrate that this method can effectively execute poisoning attacks against most complex RAG systems, achieving an attack success rate (ASR) improvement of up to 0.72 compared to baseline methods. This highlights prevalent deficiencies in current defensive methods and provides critical insights for LLM security research.
- 中文摘要
基于大型语言模型(LLMs)的检索增强生成(RAG)系统已成为问答(QA)和内容生成等任务的核心技术。然而,通过将有毒文档注入 RAG 系统的数据库,攻击者可以纵法学硕士生成符合其预期偏好的文本。现有研究主要集中在针对简化 RAG 架构的白盒攻击上。在本文中,我们研究了一个更复杂、更现实的场景:攻击者缺乏对 RAG 系统的内部组成和实现细节的了解,而 RAG 系统包含的组件不仅仅是一个检索器。具体来说,我们提出了 RIPRAG 攻击框架,这是一个端到端的攻击管道,它将目标 RAG 系统视为一个黑匣子,攻击者唯一可以访问的信息是中毒是否成功。我们的方法利用强化学习(RL)来优化中毒文档的生成模型,确保生成的中毒文档符合目标RAG系统的偏好。实验结果表明,该方法能够有效地对大多数复杂的RAG系统执行中毒攻击,与基线方法相比,攻击成功率(ASR)提高了0.72。这凸显了当前防御方法的普遍缺陷,并为法学硕士安全研究提供了重要的见解。
Beyond the limitation of a single query: Train your LLM for query expansion with Reinforcement Learning
超越单个查询的限制:使用强化学习训练 LLM 以进行查询扩展
- Authors: Shu Zhao, Tan Yu, Anbang Xu
- Subjects: Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
- Arxiv link: https://arxiv.org/abs/2510.10009
- Pdf link: https://arxiv.org/pdf/2510.10009
- Abstract
Reasoning-augmented search agents, such as Search-R1, are trained to reason, search, and generate the final answer iteratively. Nevertheless, due to their limited capabilities in reasoning and search, their performance on multi-hop QA benchmarks remains far from satisfactory. To handle complex or compound queries, we train an LLM-based search agent with the native capability of query expansion through reinforcement learning. In each turn, our search agent proposes several query variants, which are searched simultaneously to cover more relevant information. Meanwhile, given limited post-training data and computing resources, it is very challenging for a search agent to master multiple tasks, including query generation, retrieved information understanding, and answer generation. Therefore, we propose incorporating a pre-trained squeezer model that helps the search agent understand the retrieved documents, allowing the search agent to focus on query generation for high retrieval recall. With the assistance of the squeezer model, we discover that even a small-scale 3B LLM can demonstrate a strong capability of query expansion and achieve state-of-the-art accuracy on the multi-hop QA benchmarks. To be specific, our experiments across seven question-answering benchmarks demonstrate that our method, named ExpandSearch, achieves an average improvement of 4.4% compared to state-of-the-art baselines, with strong gains on multi-hop reasoning tasks requiring diverse evidence aggregation.
- 中文摘要
推理增强搜索代理(例如 Search-R1)经过训练,可以迭代地推理、搜索和生成最终答案。然而,由于它们在推理和搜索方面的能力有限,它们在多跳 QA 基准测试上的性能仍然远未令人满意。为了处理复杂或复合的查询,我们训练了一个基于 LLM 的搜索代理,使其具有通过强化学习进行查询扩展的原生能力。在每个回合中,我们的搜索代理都会提出多个查询变体,同时搜索这些变体以涵盖更多相关信息。同时,由于训练后数据和计算资源有限,搜索代理要掌握查询生成、检索到的信息理解和答案生成等多项任务非常具有挑战性。因此,我们建议加入一个预训练的挤压器模型,帮助搜索代理理解检索到的文档,使搜索代理能够专注于查询生成,以实现高检索召回率。在挤压器模型的帮助下,我们发现即使是小规模的 3B LLM 也可以展示强大的查询扩展能力,并在多跳 QA 基准测试中实现最先进的准确性。具体来说,我们在七个问答基准测试中的实验表明,与最先进的基线相比,我们名为 ExpandSearch 的方法平均提高了 4.4%,在需要多样化证据聚合的多跳推理任务上取得了强劲的进步。
Efficient Onboard Vision-Language Inference in UAV-Enabled Low-Altitude Economy Networks via LLM-Enhanced Optimization
通过LLM增强优化在无人机支持的低空经济网络中实现高效的机载视觉语言推理
- Authors: Yang Li, Ruichen Zhang, Yinqiu Liu, Guangyuan Liu, Dusit Niyato, Abbas Jamalipour, Xianbin Wang, Dong In Kim
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
- Arxiv link: https://arxiv.org/abs/2510.10028
- Pdf link: https://arxiv.org/pdf/2510.10028
- Abstract
The rapid advancement of Low-Altitude Economy Networks (LAENets) has enabled a variety of applications, including aerial surveillance, environmental sensing, and semantic data collection. To support these scenarios, unmanned aerial vehicles (UAVs) equipped with onboard vision-language models (VLMs) offer a promising solution for real-time multimodal inference. However, ensuring both inference accuracy and communication efficiency remains a significant challenge due to limited onboard resources and dynamic network conditions. In this paper, we first propose a UAV-enabled LAENet system model that jointly captures UAV mobility, user-UAV communication, and the onboard visual question answering (VQA) pipeline. Based on this model, we formulate a mixed-integer non-convex optimization problem to minimize task latency and power consumption under user-specific accuracy constraints. To solve the problem, we design a hierarchical optimization framework composed of two parts: (i) an Alternating Resolution and Power Optimization (ARPO) algorithm for resource allocation under accuracy constraints, and (ii) a Large Language Model-augmented Reinforcement Learning Approach (LLaRA) for adaptive UAV trajectory optimization. The large language model (LLM) serves as an expert in refining reward design of reinforcement learning in an offline fashion, introducing no additional latency in real-time decision-making. Numerical results demonstrate the efficacy of our proposed framework in improving inference performance and communication efficiency under dynamic LAENet conditions.
- 中文摘要
低空经济网络 (LAENets) 的快速发展使得各种应用成为可能,包括空中监视、环境感知和语义数据收集。为了支持这些场景,配备机载视觉语言模型 (VLM) 的无人机 (UAV) 为实时多模态推理提供了有前途的解决方案。然而,由于板载资源有限和网络条件动态,确保推理精度和通信效率仍然是一项重大挑战。在本文中,我们首先提出了一种支持无人机的 LAENet 系统模型,该模型联合捕获了无人机的移动性、用户与无人机的通信和机载视觉问答 (VQA) 管道。基于该模型,我们提出了一个混合整数非凸优化问题,以在用户特定的精度约束下最大限度地减少任务延迟和功耗。为了解决这个问题,我们设计了一个分层优化框架,由两部分组成:(i)用于精度约束下资源分配的交替分辨率和功率优化(ARPO)算法,以及(ii)用于自适应无人机轨迹优化的大型语言模型增强强化学习方法(LLaRA)。大型语言模型 (LLM) 是以离线方式完善强化学习奖励设计的专家,在实时决策中不会引入额外的延迟。数值结果证明了我们提出的框架在动态 LAENet 条件下提高推理性能和通信效率方面的功效。
Experience-Efficient Model-Free Deep Reinforcement Learning Using Pre-Training
使用预训练进行体验高效的无模型深度强化学习
- Authors: Ruoxing Yang
- Subjects: Subjects:
Machine Learning (cs.LG); Machine Learning (stat.ML)
- Arxiv link: https://arxiv.org/abs/2510.10029
- Pdf link: https://arxiv.org/pdf/2510.10029
- Abstract
We introduce PPOPT - Proximal Policy Optimization using Pretraining, a novel, model-free deep-reinforcement-learning algorithm that leverages pretraining to achieve high training efficiency and stability on very small training samples in physics-based environments. Reinforcement learning agents typically rely on large samples of environment interactions to learn a policy. However, frequent interactions with a (computer-simulated) environment may incur high computational costs, especially when the environment is complex. Our main innovation is a new policy neural network architecture that consists of a pretrained neural network middle section sandwiched between two fully-connected networks. Pretraining part of the network on a different environment with similar physics will help the agent learn the target environment with high efficiency because it will leverage a general understanding of the transferrable physics characteristics from the pretraining environment. We demonstrate that PPOPT outperforms baseline classic PPO on small training samples both in terms of rewards gained and general training stability. While PPOPT underperforms against classic model-based methods such as DYNA DDPG, the model-free nature of PPOPT allows it to train in significantly less time than its model-based counterparts. Finally, we present our implementation of PPOPT as open-source software, available at this http URL.
- 中文摘要
我们介绍了 PPOPT - 使用预训练的近端策略优化,这是一种新颖的、无模型的深度强化学习算法,它利用预训练在基于物理的环境中对非常小的训练样本实现高训练效率和稳定性。强化学习代理通常依赖于环境交互的大量样本来学习策略。然而,与(计算机模拟)环境的频繁交互可能会产生高计算成本,尤其是在环境复杂的情况下。我们的主要创新是一种新的策略神经网络架构,该架构由夹在两个全连接网络之间的预训练神经网络中间部分组成。在具有相似物理场的不同环境中预训练网络的一部分将有助于代理高效地学习目标环境,因为它将利用对预训练环境中可转移物理特性的一般理解。我们证明,在获得的奖励和一般训练稳定性方面,PPOPT 在小训练样本上的表现优于基线经典 PPO。虽然 PPOPT 的表现不如基于模型的经典方法(如 DYNA DDPG),但 PPOPT 的无模型特性使其训练时间比基于模型的同类方法短得多。最后,我们将 PPOPT 作为开源软件的实现呈现,可在此 http URL 上获得。
Think Twice to See More: Iterative Visual Reasoning in Medical VLMs
三思而后行,了解更多:医疗 VLM 中的迭代视觉推理
- Authors: Kaitao Chen, Shaohao Rui, Yankai Jiang, Jiamin Wu, Qihao Zheng, Chunfeng Song, Xiaosong Wang, Mu Zhou, Mianxin Liu
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.10052
- Pdf link: https://arxiv.org/pdf/2510.10052
- Abstract
Medical vision-language models (VLMs) excel at image-text understanding but typically rely on a single-pass reasoning that neglects localized visual cues. In clinical practice, however, human experts iteratively scan, focus, and refine the regions of interest before reaching a final diagnosis. To narrow this machine-human perception gap, we introduce ViTAR, a novel VLM framework that emulates the iterative reasoning process of human experts through a cognitive chain of "think-act-rethink-answer". ViTAR treats medical images as interactive objects, enabling models to engage multi-step visual reasoning. To support this approach, we curate a high-quality instruction dataset comprising 1K interactive examples that encode expert-like diagnostic behaviors. In addition, a 16K visual question answering training data has been curated towards fine-grained visual diagnosis. We introduce a two-stage training strategy that begins with supervised fine-tuning to guide cognitive trajectories, followed by the reinforcement learning to optimize decision-making. Extensive evaluations demonstrate that ViTAR outperforms strong state-of-the-art models. Visual attention analysis reveals that from the "think" to "rethink" rounds, ViTAR increasingly anchors visual grounding to clinically critical regions and maintains high attention allocation to visual tokens during reasoning, providing mechanistic insight into its improved performance. These findings demonstrate that embedding expert-style iterative thinking chains into VLMs enhances both performance and trustworthiness of medical AI.
- 中文摘要
医学视觉语言模型 (VLM) 擅长图像文本理解,但通常依赖于单次推理,而忽略了局部视觉线索。然而,在临床实践中,人类专家在做出最终诊断之前会反复扫描、聚焦和细化感兴趣的区域。为了缩小这种机器与人类的感知差距,我们引入了 ViTAR,这是一种新颖的 VLM 框架,它通过“思考-行动-重新思考-回答”的认知链来模拟人类专家的迭代推理过程。ViTAR 将医学图像视为交互对象,使模型能够进行多步骤视觉推理。为了支持这种方法,我们策划了一个高质量的指令数据集,其中包含对类似专家的诊断行为进行编码的 1K 交互式示例。此外,还针对细粒度视觉诊断策划了 16K 视觉问答训练数据。我们引入了一种两阶段的训练策略,首先是监督微调来指导认知轨迹,然后是强化学习以优化决策。广泛的评估表明,ViTAR 的性能优于强大的最先进模型。视觉注意力分析表明,从“思考”到“重新思考”轮次,ViTAR 越来越多地将视觉基础锚定在临床关键区域,并在推理过程中保持对视觉标记的高注意力分配,从而为其改进的性能提供了机械洞察。这些发现表明,将专家式迭代思维链嵌入到 VLM 中可以增强医疗人工智能的性能和可信度。
One4Many-StablePacker: An Efficient Deep Reinforcement Learning Framework for the 3D Bin Packing Problem
One4Many-StablePacker:一种针对3D箱包装问题的高效深度强化学习框架
- Authors: Lei Gao, Shihong Huang, Shengjie Wang, Hong Ma, Feng Zhang, Hengda Bao, Qichang Chen, Weihua Zhou
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.10057
- Pdf link: https://arxiv.org/pdf/2510.10057
- Abstract
The three-dimensional bin packing problem (3D-BPP) is widely applied in logistics and warehousing. Existing learning-based approaches often neglect practical stability-related constraints and exhibit limitations in generalizing across diverse bin dimensions. To address these limitations, we propose a novel deep reinforcement learning framework, One4Many-StablePacker (O4M-SP). The primary advantage of O4M-SP is its ability to handle various bin dimensions in a single training process while incorporating support and weight constraints common in practice. Our training method introduces two innovative mechanisms. First, it employs a weighted reward function that integrates loading rate and a new height difference metric for packing layouts, promoting improved bin utilization through flatter packing configurations. Second, it combines clipped policy gradient optimization with a tailored policy drifting method to mitigate policy entropy collapse, encouraging exploration at critical decision nodes during packing to avoid suboptimal solutions. Extensive experiments demonstrate that O4M-SP generalizes successfully across diverse bin dimensions and significantly outperforms baseline methods. Furthermore, O4M-SP exhibits strong practical applicability by effectively addressing packing scenarios with stability constraints.
- 中文摘要
三维仓箱包装问题(3D-BPP)在物流仓储中应用广泛。现有的基于学习的方法往往忽略了与实际稳定性相关的约束,并且在跨不同分箱维度进行泛化方面表现出局限性。为了解决这些局限性,我们提出了一种新颖的深度强化学习框架,即One4Many-StablePacker(O4M-SP)。O4M-SP 的主要优点是它能够在单个训练过程中处理各种箱尺寸,同时结合实践中常见的支撑和重量约束。我们的培训方法引入了两种创新机制。首先,它采用加权奖励函数,将装载率和新的包装布局高度差指标集成在一起,通过更扁平的包装配置促进提高箱子利用率。其次,它将裁剪策略梯度优化与定制的策略漂移方法相结合,以减轻策略熵崩溃,鼓励在打包过程中对关键决策节点进行探索,以避免次优解。大量实验表明,O4M-SP 在不同的箱维度上成功推广,并且明显优于基线方法。此外,O4M-SP通过有效解决具有稳定性约束的包装场景,表现出很强的实际适用性。
Unilaw-R1: A Large Language Model for Legal Reasoning with Reinforcement Learning and Iterative Inference
Unilaw-R1:一种基于强化学习和迭代推理的法律推理大语言模型
- Authors: Hua Cai, Shuang Zhao, Liang Zhang, Xuli Shen, Qing Xu, Weilin Shen, Zihao Wen, Tianke Ban
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2510.10072
- Pdf link: https://arxiv.org/pdf/2510.10072
- Abstract
Reasoning-focused large language models (LLMs) are rapidly evolving across various domains, yet their capabilities in handling complex legal problems remains underexplored. In this paper, we introduce Unilaw-R1, a large language model tailored for legal reasoning. With a lightweight 7-billion parameter scale, Unilaw-R1 significantly reduces deployment cost while effectively tackling three core challenges in the legal domain: insufficient legal knowledge, unreliable reasoning logic, and weak business generalization. To address these issues, we first construct Unilaw-R1-Data, a high-quality dataset containing 17K distilled and screened chain-of-thought (CoT) samples. Based on this, we adopt a two-stage training strategy combining Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), which significantly boosts the performance on complex legal reasoning tasks and supports interpretable decision-making in legal AI applications. To assess legal reasoning ability, we also introduce Unilaw-R1-Eval, a dedicated benchmark designed to evaluate models across single- and multi-choice legal tasks. Unilaw-R1 demonstrates strong results on authoritative benchmarks, outperforming all models of similar scale and achieving performance on par with the much larger DeepSeek-R1-Distill-Qwen-32B (54.9%). Following domain-specific training, it also showed significant gains on LawBench and LexEval, exceeding Qwen-2.5-7B-Instruct (46.6%) by an average margin of 6.6%.
- 中文摘要
以推理为重点的大型语言模型 (LLM) 正在各个领域快速发展,但其处理复杂法律问题的能力仍未得到充分探索。在本文中,我们介绍了 Unilaw-R1,这是一种专为法律推理量身定制的大型语言模型。Unilaw-R1 具有 70 亿参数的轻量级规模,在有效解决法律知识不足、推理逻辑不可靠、业务泛化性弱的三大核心挑战的同时,显著降低了部署成本。为了解决这些问题,我们首先构建了 Unilaw-R1-Data,这是一个包含 17K 蒸馏和筛选的思维链 (CoT) 样本的高质量数据集。基于此,我们采用了监督微调(SFT)和强化学习(RL)相结合的两阶段训练策略,显著提升了复杂法律推理任务的性能,并支持法律人工智能应用中的可解释决策。为了评估法律推理能力,我们还引入了 Unilaw-R1-Eval,这是一个专门的基准,旨在评估单选和多选法律任务的模型。Unilaw-R1 在权威基准测试中表现出强大的结果,优于所有类似规模的模型,并实现了与更大的 DeepSeek-R1-Distill-Qwen-32B (54.9%) 相当的性能。经过特定领域的训练,它在 LawBench 和 LexEval 上也显示出显着的进步,平均超过 Qwen-2.5-7B-Instruct (46.6%) 6.6%。
Answer-Consistent Chain-of-thought Reinforcement Learning For Multi-modal Large Langauge Models
多模态大语言模型的答案一致思维链强化学习
- Authors: Minbin Huang, Runhui Huang, Chuanyang Zheng, Jingyao Li, Guoxuan Chen, Han Shi, Hong Cheng
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2510.10104
- Pdf link: https://arxiv.org/pdf/2510.10104
- Abstract
Recent advances in large language models (LLMs) have demonstrated that reinforcement learning with verifiable rewards (RLVR) can significantly enhance reasoning abilities by directly optimizing correctness, rather than relying solely on supervised imitation. This paradigm has been extended to multimodal LLMs for complex video and image understanding tasks. However, while outcome-driven RL improves answer accuracy, it can inadvertently decouple the reasoning chain from the final answer, leading to situations where models produce inconsistency between the reasoning trace and final answer. In our experiments on multiple-choice visual question-answering tasks, the standard GRPO method yields only 79.7\% consistency on MMVU between the reasoning steps and the chosen answers, indicating frequent mismatches between answers and reasoning. To this end, we propose Answer-Consistent Reinforcement Learning (ACRE) that modifies the GRPO algorithm with an auxiliary consistency check. After the model generates a chain of thought and an initial answer for a given question, we shuffle the answer options and prompt the model again with the same reasoning trace to predict a second answer. We design a consistency-verification reward that grants a high reward only if both the original and the post-shuffle answers agree and are correct; otherwise, a lower reward is assigned accordingly. This mechanism penalizes reasoning-answer misalignment and discourages the model from relying on spurious patterns, such as option ordering biases. We evaluate ACRE on challenging Video Reasoning benchmarks and multimodal math reasoning benchmarks, achieving an average 2.2\% and 1.5\% improvement for Video Reasoning and Math Reasoning tasks over the GRPO baseline.
- 中文摘要
大型语言模型(LLM)的最新进展表明,具有可验证奖励的强化学习(RLVR)可以通过直接优化正确性来显着增强推理能力,而不是仅仅依赖监督模仿。这种范式已扩展到多模态法学硕士,用于执行复杂的视频和图像理解任务。然而,虽然结果驱动的 RL 提高了答案的准确性,但它可能会无意中将推理链与最终答案解耦,从而导致模型在推理跟踪和最终答案之间产生不一致的情况。在我们对多项选择视觉问答任务的实验中,标准 GRPO 方法在推理步骤和所选答案之间的 MMVU 一致性仅为 79.7\%,这表明答案和推理之间经常不匹配。为此,我们提出了答案一致性强化学习(ACRE),它通过辅助一致性检查来修改GRPO算法。在模型为给定问题生成思维链和初始答案后,我们打乱答案选项,并使用相同的推理轨迹再次提示模型以预测第二个答案。我们设计了一个一致性验证奖励,只有当原始答案和随机播放后答案都一致并且正确时,才会授予高奖励;否则,将相应地分配较低的奖励。这种机制惩罚了推理-答案错位,并阻止模型依赖虚假模式,例如选项排序偏差。我们在具有挑战性的视频推理基准和多模态数学推理基准上评估了 ACRE,视频推理和数学推理任务比 GRPO 基线平均提高了 2.2% 和 1.5%。
Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective
重新思考 RLVR 中的熵干预:熵变化视角
- Authors: Zhezheng Hao, Hong Wang, Haoyang Liu, Jian Luo, Jiarui Yu, Hande Dong, Qiang Lin, Can Wang, Jiawei Chen
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.10150
- Pdf link: https://arxiv.org/pdf/2510.10150
- Abstract
While Reinforcement Learning with Verifiable Rewards (RLVR) can enhance LLM reasoning, its training process poses a critical risk: entropy collapse. This phenomenon is a rapid loss of policy diversity, stemming from the exploration-exploitation imbalance and leading to a lack of generalization. Recent entropy-intervention methods aim to prevent \coloredtext{entropy collapse}, yet their underlying mechanisms remain unclear. In this paper, we conduct a quantitative analysis to reveal token-level entropy changes and how existing entropy intervention methods help avoid entropy collapse. Our findings point out a fundamental limitation of existing methods: they attempt to control entropy dynamics indirectly. By only affecting related factors, such as the advantage signal and generation probability, their effectiveness is inherently limited and could potentially fail. To address this limitation, we introduce an entropy-change-aware reweighting scheme, namely Stabilizing Token-level Entropy-changE via Reweighting (STEER), that adaptively stabilizes entropy dynamics through fine-grained token-level adjustments. Our approach mitigates over-exploitation while fostering robust exploration. Extensive experiments demonstrate that STEER significantly mitigates entropy collapse, stabilizes entropy dynamics, and achieves stronger downstream performance across various mathematical reasoning benchmarks \footnote{Our code is available at this https URL.
- 中文摘要
虽然具有可验证奖励的强化学习 (RLVR) 可以增强法学硕士推理,但其训练过程带来了一个严重的风险:熵崩溃。这种现象是政策多样性的迅速丧失,源于勘探-开发失衡,导致缺乏概括性。最近的熵干预方法旨在防止\coloredtext{熵坍缩},但其潜在机制仍不清楚。在本文中,我们进行了定量分析,以揭示代币层面的熵变化,以及现有的熵干预方法如何帮助避免熵崩溃。我们的研究结果指出了现有方法的一个根本局限性:它们试图间接控制熵动力学。由于仅影响相关因素,例如优势信号和生成概率,它们的有效性本质上是有限的,并且可能会失败。为了解决这一限制,我们引入了一种熵变化感知的重新加权方案,即通过重加权稳定代币级熵变(STEER),它通过细粒度的代币级调整自适应地稳定熵动力学。我们的方法减少了过度开发,同时促进了稳健的勘探。广泛的实验表明,STEER 显着减轻了熵坍缩,稳定了熵动力学,并在各种数学推理基准测试中实现了更强的下游性能\footnote{我们的代码可在此 https URL 中找到。
Dejavu: Post-Deployment Learning for Embodied Agents via Experience Feedback
Dejavu:通过经验反馈对具身代理进行部署后学习
- Authors: Shaokai Wu, Yanbiao Ji, Qiuchang Li, Zhiyi Zhang, Qichen He, Wenyuan Xie, Guodong Zhang, Bayram Bayramli, Yue Ding, Hongtao Lu
- Subjects: Subjects:
Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2510.10181
- Pdf link: https://arxiv.org/pdf/2510.10181
- Abstract
Embodied agents face a fundamental limitation: once deployed in real-world environments to perform specific tasks, they are unable to acquire new useful knowledge to enhance task performance. In this paper, we propose a general post-deployment learning framework called Dejavu, which employs an Experience Feedback Network (EFN) and augments the frozen Vision-Language-Action (VLA) policy with retrieved execution memories. EFN automatically identifies contextually successful prior action experiences and conditions action prediction on this retrieved guidance. We adopt reinforcement learning with semantic similarity rewards on EFN to ensure that the predicted actions align with past successful behaviors under current observations. During deployment, EFN continually enriches its memory with new trajectories, enabling the agent to exhibit "learning from experience" despite fixed weights. Experiments across diverse embodied tasks show that EFN significantly improves adaptability, robustness, and success rates over frozen baselines. These results highlight a promising path toward embodied agents that continually refine their behavior after deployment.
- 中文摘要
具身智能体面临一个根本的局限性:一旦部署在现实环境中执行特定任务,他们就无法获得新的有用知识来增强任务绩效。在本文中,我们提出了一个名为 Dejavu 的通用部署后学习框架,它采用体验反馈网络 (EFN),并使用检索到的执行记忆来增强冻结的视觉-语言-行动 (VLA) 策略。EFN 会自动识别上下文中成功的先前作体验,并在此检索到的指南上对作进行条件预测。我们在 EFN 上采用带有语义相似性奖励的强化学习,以确保预测的行为与当前观察下过去的成功行为保持一致。在部署过程中,EFN 不断通过新的轨迹丰富其内存,使代理能够在权重固定的情况下表现出“从经验中学习”。不同具身任务的实验表明,与冻结基线相比,EFN 显着提高了适应性、稳健性和成功率。这些结果凸显了一条通往具身代理的有希望的道路,这些代理在部署后不断完善其行为。
Don't Just Fine-tune the Agent, Tune the Environment
不要只是微调代理,而是调整环境
- Authors: Siyuan Lu, Zechuan Wang, Hongxuan Zhang, Qintong Wu, Leilei Gan, Chenyi Zhuang, Jinjie Gu, Tao Lin
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.10197
- Pdf link: https://arxiv.org/pdf/2510.10197
- Abstract
Large Language Model (LLM) agents show great promise for complex, multi-turn tool-use tasks, but their development is often hampered by the extreme scarcity of high-quality training data. Supervised fine-tuning (SFT) on synthetic data leads to overfitting, whereas standard reinforcement learning (RL) struggles with a critical cold-start problem and training instability. To address these challenges, we introduce $\textbf{Environment Tuning}$, a novel training paradigm that enables agents to learn complex behaviors directly from problem instances without relying on pre-collected expert trajectories. $\textbf{Environment Tuning}$ orchestrates this learning process through a structured curriculum, actionable environment augmentation that provides corrective feedback, and fine-grained progress rewards to ensure stable and efficient exploration. Using only 400 problem instances from Berkeley Function-Calling Leaderboard (BFCL) benchmark, our method not only achieves competitive in-distribution performance against strong baselines but also demonstrates superior out-of-distribution generalization, overcoming the performance collapse common to SFT-based approaches. Our work presents a paradigm shift from supervised fine-tuning on static trajectories to dynamic, environment-based exploration, paving the way for training more robust and data-efficient agents.
- 中文摘要
大型语言模型 (LLM) 代理在复杂的多轮工具使用任务中显示出巨大的前景,但它们的发展往往受到高质量训练数据极度稀缺的阻碍。合成数据的监督微调 (SFT) 会导致过度拟合,而标准强化学习 (RL) 则难以解决关键的冷启动问题和训练不稳定性。为了应对这些挑战,我们引入了 $\textbf{Environment Tuning}$,这是一种新颖的训练范式,使代理能够直接从问题实例中学习复杂的行为,而无需依赖预先收集的专家轨迹。$\textbf{环境调整}$ 通过结构化课程、提供纠正反馈的可作环境增强以及细粒度进度奖励来编排这一学习过程,以确保稳定高效的探索。我们的方法仅使用伯克利函数调用排行榜(BFCL)基准测试中的400个问题实例,不仅在强基线下实现了具有竞争力的分布内性能,而且还展示了卓越的分布外泛化,克服了基于SFT的方法常见的性能崩溃。我们的工作呈现了从静态轨迹的监督微调到动态的、基于环境的探索的范式转变,为训练更强大和数据高效的代理铺平了道路。
RLFR: Extending Reinforcement Learning for LLMs with Flow Environment
RLFR:使用流环境扩展法学硕士的强化学习
- Authors: Jinghao Zhang, Naishan Zheng, Ruilin Li, Dongzhou Cheng, Zheming Liang, Feng Zhao, Jiaqi Wang
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2510.10201
- Pdf link: https://arxiv.org/pdf/2510.10201
- Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a promising framework for improving reasoning abilities in Large Language Models (LLMs). However, policy optimized with binary verification prone to overlook potential valuable exploration in reasoning trajectory. In view of heavy annotation cost of golden Process Reward Models (PRMs), recent works attempt using auxiliary signals for reward shaping of process tokens, involving entropy and likelihood collected from logit space. In this work, we offer a novel perspective on shaping RLVR with flow rewards derived from latent space, and propose RLFR, where the flow fields of model latents are constructed from either off-policy high-quality data and on-policy rejection sampling data, and the velocity deviations of policy latents within it are quantified to serve as a reward signal. RLFR first demonstrates that a well-established flow field can be a sound environment for reward signal collection, highlighting the expressive latent space is much underexplored. Moreover, RLFR is able to compress any off-policy expert data as reference for constituting reward signals, and we show that the efficient context dependence compressed within the hidden states are utilized, rather than individual token-level denotation for context comprehending. Experiments on both language and multimodal reasoning benchmarks demonstrate the reliability of flow rewards, and suggesting a promising paradigm for reward shaping with auxiliary signals.
- 中文摘要
具有可验证奖励的强化学习 (RLVR) 最近已成为提高大型语言模型 (LLM) 推理能力的有前途的框架。然而,通过二元验证优化的策略容易忽视推理轨迹中潜在的有价值的探索。针对黄金过程奖励模型(PRMs)标注成本高的问题,近年来的研究尝试使用辅助信号对过程标记进行奖励塑造,涉及从logit空间收集的熵和似然。在这项工作中,我们提供了一种新颖的视角,即利用源自潜在空间的流量奖励来塑造RLVR,并提出了RLFR,其中模型潜在的流场是由非策略高质量数据和策略拒绝采样数据构建的,并量化其中策略潜在的速度偏差作为奖励信号。RLFR 首先证明,一个完善的流场可以成为奖励信号收集的声音环境,突出了表达性潜在空间的探索不足。此外,RLFR能够压缩任何策略外的专家数据作为构成奖励信号的参考,并且我们表明,在隐藏状态中压缩的有效上下文依赖性被利用,而不是用于上下文理解的单个token级外延。语言和多模态推理基准的实验证明了流奖励的可靠性,并提出了一种有希望的辅助信号奖励塑造范式。
Performance Index Shaping for Closed-loop Optimal Control
用于闭环优化控制的性能指标塑造
- Authors: Ayush Rai, Shaoshuai Mou, Brian D. O. Anderson
- Subjects: Subjects:
Systems and Control (eess.SY); Optimization and Control (math.OC)
- Arxiv link: https://arxiv.org/abs/2510.10202
- Pdf link: https://arxiv.org/pdf/2510.10202
- Abstract
The design of the performance index, also referred to as cost or reward shaping, is central to both optimal control and reinforcement learning, as it directly determines the behaviors, trade-offs, and objectives that the resulting control laws seek to achieve. A commonly used approach for this inference task in recent years is differentiable trajectory optimization, which allows gradients to be computed with respect to cost parameters by differentiating through an optimal control solver. However, this method often requires repeated solving of the underlying optimal control problem at every iteration, making the method computationally expensive. In this work, assuming known dynamics, we propose a novel framework that analytically links the performance index to the resulting closed-loop optimal control law, thereby transforming a typically bi-level inverse problem into a tractable single-level formulation. Our approach is motivated by the question: given a closed-loop control law that solves an infinite-horizon optimal control problem, how does this law change when the performance index is modified with additional terms? This formulation yields closed-form characterizations for broad classes of systems and performance indices, which not only facilitate interpretation and stability analysis, but also provide insight into the robust stability and input-to-state stable behavior of the resulting nonlinear closed-loop system. Moreover, this analytical perspective enables the generalization of our approach to diverse design objectives, yielding a unifying framework for performance index shaping. Given specific design objectives, we propose a systematic methodology to guide the shaping of the performance index and thereby design the resulting optimal control law.
- 中文摘要
性能指数的设计,也称为成本或奖励塑造,是最优控制和强化学习的核心,因为它直接决定了由此产生的控制定律寻求实现的行为、权衡和目标。近年来,这种推理任务的一种常用方法是可微轨迹优化,它允许通过最优控制求解器进行微分来计算相对于成本参数的梯度。然而,这种方法通常需要在每次迭代时重复求解潜在的最优控制问题,这使得该方法的计算成本很高。在这项工作中,假设已知的动力学,我们提出了一种新颖的框架,该框架将性能指标与由此产生的闭环最优控制定律分析联系起来,从而将典型的双能级逆问题转换为可处理的单能级公式。我们的方法受到以下问题的激励:给定一个解决无限视界最优控制问题的闭环控制定律,当性能指标被修改为附加项时,该定律如何变化?该公式为各种系统和性能指标提供了封闭形式的表征,这不仅有助于解释和稳定性分析,而且还提供了对所得非线性闭环系统的稳健稳定性和输入到状态稳定行为的深入了解。此外,这种分析视角使我们的方法能够推广到不同的设计目标,从而为性能指标的形成一个统一的框架。给定具体的设计目标,提出一种系统的方法来指导性能指标的形成,从而设计出最优控制规律。
Adaptive Dual Reasoner: Large Reasoning Models Can Think Efficiently by Hybrid Reasoning
自适应双推理器:大型推理模型可以通过混合推理进行高效思考
- Authors: Yujian Zhang, Keyu Chen, Zhifeng Shen, Ruizhi Qiao, Xing Sun
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.10207
- Pdf link: https://arxiv.org/pdf/2510.10207
- Abstract
Although Long Reasoning Models (LRMs) have achieved superior performance on various reasoning scenarios, they often suffer from increased computational costs and inference latency caused by overthinking. To address these limitations, we propose Adaptive Dual Reasoner, which supports two reasoning modes: fast thinking and slow thinking. ADR dynamically alternates between these modes based on the contextual complexity during reasoning. ADR is trained in two stages: (1) A cold-start stage using supervised fine-tuning (SFT) to equip the model with the ability to integrate both fast and slow reasoning modes, in which we construct a hybrid reasoning dataset through a dedicated pipeline to provide large-scale supervision. (2) A reinforcement learning stage for optimizing reasoning effort, where we introduce Entropy-guided Hybrid Policy Optimization EHPO, an RL training framework employing an entropy-guided dynamic rollout strategy for branching at high-entropy units and a difficulty-aware penalty to balance fast and slow reasoning. Across challenging mathematical reasoning benchmarks, ADR achieves an effective balance between reasoning performance and efficiency among state-of-the-art approaches. Specifically, ADR yields a performance gain of up to 6.1%, while reducing the reasoning output length by 49.5% to 59.3%.
- 中文摘要
尽管长推理模型(LRM)在各种推理场景中取得了卓越的性能,但它们经常受到过度思考导致计算成本增加和推理延迟的影响。为了解决这些限制,我们提出了自适应双推理器,它支持两种推理模式:快速思维和慢思维。ADR 在推理过程中根据上下文复杂性在这些模式之间动态交替。ADR分两个阶段进行训练:(1)使用监督微调(SFT)的冷启动阶段,使模型具备集成快慢推理模式的能力,其中我们通过专用管道构建混合推理数据集,以提供大规模监督。(2)优化推理工作的强化学习阶段,我们引入了熵引导的混合策略优化EHPO,这是一个RL训练框架,采用熵引导的动态推出策略,用于在高熵单位进行分支,以及难度感知惩罚来平衡快速和慢速推理。在具有挑战性的数学推理基准中,ADR 在最先进的方法中实现了推理性能和效率之间的有效平衡。具体而言,ADR可产生高达6.1%的性能提升,同时将推理输出长度减少49.5%至59.3%。
SGM: A Statistical Godel Machine for Risk-Controlled Recursive Self-Modification
SGM:用于风险控制递归自修改的统计 Godel 机器
- Authors: Xuening Wu, Shenqin Yin, Yanlan Kang, Xinhang Zhang, Qianya Xu, Zeping Chen, Wenqiang Zhang
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.10232
- Pdf link: https://arxiv.org/pdf/2510.10232
- Abstract
Recursive self-modification is increasingly central in AutoML, neural architecture search, and adaptive optimization, yet no existing framework ensures that such changes are made safely. Godel machines offer a principled safeguard by requiring formal proofs of improvement before rewriting code; however, such proofs are unattainable in stochastic, high-dimensional settings. We introduce the Statistical Godel Machine (SGM), the first statistical safety layer for recursive edits. SGM replaces proof-based requirements with statistical confidence tests (e-values, Hoeffding bounds), admitting a modification only when superiority is certified at a chosen confidence level, while allocating a global error budget to bound cumulative risk across this http URL also propose Confirm-Triggered Harmonic Spending (CTHS), which indexes spending by confirmation events rather than rounds, concentrating the error budget on promising edits while preserving familywise this http URL across supervised learning, reinforcement learning, and black-box optimization validate this role: SGM certifies genuine gains on CIFAR-100, rejects spurious improvement on ImageNet-100, and demonstrates robustness on RL and optimization this http URL, these results position SGM as foundational infrastructure for continual, risk-aware self-modification in learning this http URL is available at: this https URL.
- 中文摘要
递归自我修改在 AutoML、神经架构搜索和自适应优化中越来越重要,但没有现有框架可以确保安全地进行此类更改。Godel 机器提供了一种原则性的保障措施,要求在重写代码之前提供正式的改进证明;然而,在随机的高维设置中,这样的证明是无法实现的。我们介绍了统计哥德尔机 (SGM),这是第一个用于递归编辑的统计安全层。SGM 用统计置信度检验(e-值、Hoeffding 边界)取代了基于证明的要求,仅当在选定的置信水平上证明优越性时才允许修改,同时分配全局错误预算来限制此 http URL 的累积风险还提出了确认触发谐波支出 (CTHS),它通过确认事件而不是轮次对支出进行索引,将错误预算集中在有希望的编辑上,同时保留此 http URL监督学习、强化学习和黑盒优化验证了这一作用:SGM 在 CIFAR-100 上证明了真正的收益,拒绝了 ImageNet-100 上的虚假改进,并展示了 RL 的鲁棒性和优化这个 http URL,这些结果将 SGM 定位为在学习这个 http URL 时持续的、有风险意识的自我修改的基础基础设施,可在以下网址获得:这个 https URL。
Reasoning-Enhanced Large Language Models for Molecular Property Prediction
推理增强型大语言模型用于分子性质预测
- Authors: Jiaxi Zhuang, Yaorui Shi, Jue Hou, Yunong He, Mingwei Ye, Mingjun Xu, Yuming Su, Linfeng Zhang, Linfeng Zhang, Guolin Ke, Hengxing Cai
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.10248
- Pdf link: https://arxiv.org/pdf/2510.10248
- Abstract
Molecular property prediction is crucial for drug discovery and materials science, yet existing approaches suffer from limited interpretability, poor cross-task generalization, and lack of chemical reasoning capabilities. Traditional machine learning models struggle with task transferability, while specialized molecular language models provide little insight into their decision-making processes. To address these limitations, we propose \textbf{MPPReasoner}, a multimodal large language model that incorporates chemical reasoning for molecular property prediction. Our approach, built upon Qwen2.5-VL-7B-Instruct, integrates molecular images with SMILES strings to enable comprehensive molecular understanding. We develop a two-stage training strategy: supervised fine-tuning (SFT) using 16,000 high-quality reasoning trajectories generated through expert knowledge and multiple teacher models, followed by Reinforcement Learning from Principle-Guided Rewards (RLPGR). RLPGR employs verifiable, rule-based rewards that systematically evaluate chemical principle application, molecular structure analysis, and logical consistency through computational verification. Extensive experiments across 8 datasets demonstrate significant performance improvements, with MPPReasoner outperforming the best baselines by 7.91\% and 4.53\% on in-distribution and out-of-distribution tasks respectively. MPPReasoner exhibits exceptional cross-task generalization and generates chemically sound reasoning paths that provide valuable insights into molecular property analysis, substantially enhancing both interpretability and practical utility for chemists. Code is available at this https URL.
- 中文摘要
分子性质预测对于药物发现和材料科学至关重要,但现有方法存在可解释性有限、跨任务泛化能力差以及缺乏化学推理能力的问题。传统的机器学习模型在任务可转移性方面遇到困难,而专门的分子语言模型几乎无法深入了解其决策过程。为了解决这些限制,我们提出了 \textbf{MPPReasoner},这是一种多模态大型语言模型,它结合了用于分子性质预测的化学推理。我们的方法基于 Qwen2.5-VL-7B-Instruct 构建,将分子图像与 SMILES 字符串集成在一起,以实现全面的分子理解。我们制定了两阶段的训练策略:使用通过专业知识和多个教师模型生成的 16,000 条高质量推理轨迹进行监督微调 (SFT),然后是基于原则引导奖励的强化学习 (RLPGR)。RLPGR 采用可验证的、基于规则的奖励,通过计算验证系统地评估化学原理应用、分子结构分析和逻辑一致性。对 8 个数据集的广泛实验表明性能显着提高,MPPReasoner 在分布内和分布外任务上分别比最佳基线高出 7.91% 和 4.53%。MPPReasoner 表现出出色的跨任务泛化能力,并生成化学上合理的推理路径,为分子性质分析提供有价值的见解,大大提高了化学家的可解释性和实用性。代码可在此 https URL 中找到。
Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting
通过事后诸葛亮轨迹重写在 LM 代理中进行样本高效的在线学习
- Authors: Michael Y. Hu, Benjamin Van Durme, Jacob Andreas, Harsh Jhamtani
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2510.10304
- Pdf link: https://arxiv.org/pdf/2510.10304
- Abstract
Language model (LM) agents deployed in novel environments often exhibit poor sample efficiency when learning from sequential interactions. This significantly hinders the usefulness of such agents in environments where interaction is costly (for example, when they interact with humans or reset physical systems). While a number of existing LM agent architectures incorporate various mechanisms for experience storage and reflection, they make limited use of LMs' abilities to directly generate or reason about full counterfactual trajectories. We introduce ECHO (Experience Consolidation via Hindsight Optimization), a prompting framework that adapts hindsight experience replay from reinforcement learning for language model agents. ECHO generates optimized trajectories for alternative goals that could have been achieved during failed attempts, effectively creating synthetic positive examples from unsuccessful interactions. Our approach consists of two components: a hindsight rule that uses the language model itself to identify relevant subgoals and generate optimized trajectories, and an update rule that maintains compressed trajectory representations in memory. We evaluate ECHO on stateful versions of XMiniGrid, a text-based navigation and planning benchmark, and PeopleJoinQA, a collaborative information-gathering enterprise simulation. Across both domains, ECHO outperforms vanilla language agent baselines by up to 80%; in XMiniGrid, it also outperforms a number of sophisticated agent architectures including Reflexion and AWM, demonstrating faster adaptation to novel environments through more effective utilization of past experiences.
- 中文摘要
部署在新环境中的语言模型 (LM) 代理在从顺序交互中学习时通常表现出较差的样本效率。这极大地阻碍了此类代理在交互成本高昂的环境中的实用性(例如,当它们与人类交互或重置物理系统时)。虽然许多现有的 LM 代理架构包含各种用于经验存储和反思的机制,但它们有限地利用 LM 直接生成或推理完整的反事实轨迹的能力。我们引入了 ECHO(通过事后诸葛亮优化进行体验整合),这是一个提示框架,它适用于语言模型代理的强化学习的事后经验回放。ECHO 为在失败尝试期间可能实现的替代目标生成优化轨迹,有效地从不成功的交互中创建合成的积极示例。我们的方法由两个部分组成:一个使用语言模型本身来识别相关子目标并生成优化轨迹的事后诸葛亮规则,以及一个在内存中维护压缩轨迹表示的更新规则。我们在 XMiniGrid(一种基于文本的导航和规划基准)和 PeopleJoinQA(一种协作信息收集企业模拟)的有状态版本上评估了 ECHO。在这两个领域,ECHO 的性能比普通语言代理基线高出 80%;在 XMiniGrid 中,它还优于包括 Reflexion 和 AWM 在内的许多复杂的代理架构,通过更有效地利用过去的经验,展示了对新环境的更快适应。
Towards Safe Maneuvering of Double-Ackermann-Steering Robots with a Soft Actor-Critic Framework
基于软行为者-批评框架的双阿克曼转向机器人的安全纵
- Authors: Kohio Deflesselle, Mélodie Daniel, Aly Magassouba, Miguel Aranda, Olivier Ly
- Subjects: Subjects:
Robotics (cs.RO); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.10332
- Pdf link: https://arxiv.org/pdf/2510.10332
- Abstract
We present a deep reinforcement learning framework based on Soft Actor-Critic (SAC) for safe and precise maneuvering of double-Ackermann-steering mobile robots (DASMRs). Unlike holonomic or simpler non-holonomic robots such as differential-drive robots, DASMRs face strong kinematic constraints that make classical planners brittle in cluttered environments. Our framework leverages the Hindsight Experience Replay (HER) and the CrossQ overlay to encourage maneuvering efficiency while avoiding obstacles. Simulation results with a heavy four-wheel-steering rover show that the learned policy can robustly reach up to 97% of target positions while avoiding obstacles. Our framework does not rely on handcrafted trajectories or expert demonstrations.
- 中文摘要
我们提出了一个基于软演员评论家(SAC)的深度强化学习框架,用于安全、精确地纵双阿克曼转向移动机器人(DASMR)。与全息或更简单的非全息机器人(如差速驱动机器人)不同,DASMR 面临着强大的运动学约束,这使得经典规划器在杂乱的环境中变得脆弱。我们的框架利用事后诸葛亮体验重播 (HER) 和 CrossQ 叠加来鼓励机动效率,同时避开障碍物。重型四轮转向漫游车的仿真结果表明,学习到的策略可以在避开障碍物的同时稳健地到达高达97%的目标位置。我们的框架不依赖于手工制作的轨迹或专家演示。
RECON: Reasoning with Condensation for Efficient Retrieval-Augmented Generation
RECON:用于高效检索增强生成的冷凝推理
- Authors: Zhichao Xu, Minheng Wang, Yawei Wang, Wenqian Ye, Yuntao Du, Yunpu Ma, Yijun Tian
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2510.10448
- Pdf link: https://arxiv.org/pdf/2510.10448
- Abstract
Retrieval-augmented generation (RAG) systems trained using reinforcement learning (RL) with reasoning are hampered by inefficient context management, where long, noisy retrieved documents increase costs and degrade performance. We introduce RECON (REasoning with CONdensation), a framework that integrates an explicit summarization module to compress evidence within the reasoning loop. Our summarizer is trained via a two-stage process: relevance pretraining on QA datasets, followed by multi-aspect distillation from proprietary LLMs to ensure factuality and clarity. Integrated into the Search-R1 pipeline, RECON reduces total context length by 35\%, leading to improved training speed and inference latency, while simultaneously improving RAG performance on downstream QA benchmarks. Notably, it boosts the average EM score of the 3B model by 14.5\% and the 7B model by 3.0\%, showing particular strength in multi-hop QA. RECON demonstrates that learned context compression is essential for building practical, scalable, and performant RAG systems. Our code implementation is made available at this https URL.
- 中文摘要
使用具有推理的强化学习 (RL) 训练的检索增强生成 (RAG) 系统受到低效上下文管理的阻碍,其中冗长、嘈杂的检索文档会增加成本并降低性能。我们引入了 RECON(REasoning with CONdensation),这是一个集成了显式摘要模块以压缩推理循环内证据的框架。我们的摘要器通过两个阶段的过程进行训练:对 QA 数据集进行相关性预训练,然后从专有的 LLM 进行多方面提炼,以确保事实性和清晰度。RECON 集成到 Search-R1 管道中,将总上下文长度减少了 35\%,从而提高了训练速度和推理延迟,同时提高了下游 QA 基准测试中的 RAG 性能。值得注意的是,它将 3B 模型的平均 EM 分数提高了 14.5\%,将 7B 模型的平均 EM 分数提高了 3.0\%,在多跳 QA 方面表现出特别的优势。RECON 表明,学习的上下文压缩对于构建实用、可扩展和高性能的 RAG 系统至关重要。我们的代码实现可在此 https URL 中找到。
Data-driven simulator of multi-animal behavior with unknown dynamics via offline and online reinforcement learning
通过离线和在线强化学习对动态未知的多动物行为进行数据驱动模拟器
- Authors: Keisuke Fujii, Kazushi Tsutsui, Yu Teshima, Makoto Itoh, Naoya Takeishi, Nozomi Nishiumi, Ryoya Tanaka, Shunsuke Shigaki, Yoshinobu Kawahara
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.10451
- Pdf link: https://arxiv.org/pdf/2510.10451
- Abstract
Simulators of animal movements play a valuable role in studying behavior. Advances in imitation learning for robotics have expanded possibilities for reproducing human and animal movements. A key challenge for realistic multi-animal simulation in biology is bridging the gap between unknown real-world transition models and their simulated counterparts. Because locomotion dynamics are seldom known, relying solely on mathematical models is insufficient; constructing a simulator that both reproduces real trajectories and supports reward-driven optimization remains an open problem. We introduce a data-driven simulator for multi-animal behavior based on deep reinforcement learning and counterfactual simulation. We address the ill-posed nature of the problem caused by high degrees of freedom in locomotion by estimating movement variables of an incomplete transition model as actions within an RL framework. We also employ a distance-based pseudo-reward to align and compare states between cyber and physical spaces. Validated on artificial agents, flies, newts, and silkmoth, our approach achieves higher reproducibility of species-specific behaviors and improved reward acquisition compared with standard imitation and RL methods. Moreover, it enables counterfactual behavior prediction in novel experimental settings and supports multi-individual modeling for flexible what-if trajectory generation, suggesting its potential to simulate and elucidate complex multi-animal behaviors.
- 中文摘要
动物运动模拟器在研究行为方面发挥着重要作用。机器人模仿学习的进步扩大了再现人类和动物运动的可能性。生物学中逼真的多动物模拟的一个关键挑战是弥合未知的现实世界过渡模型与其模拟模型之间的差距。由于运动动力学鲜为人知,仅仅依靠数学模型是不够的;构建一个既能再现真实轨迹又支持奖励驱动优化的模拟器仍然是一个悬而未决的问题。我们引入了一种基于深度强化学习和反事实模拟的多动物行为数据驱动模拟器。我们通过将不完整过渡模型的运动变量估计为RL框架内的动作,解决了运动中高自由度引起的问题的不良性质。我们还采用基于距离的伪奖励来对齐和比较网络空间和物理空间之间的状态。在人工药剂、苍蝇、蝾螈和蚕蛾身上进行了验证,与标准模仿和强联算法相比,我们的方法实现了更高的物种特异性行为的可重复性,并改善了奖励获取。此外,它还能够在新的实验环境中进行反事实行为预测,并支持多个体建模以生成灵活的假设轨迹,这表明其具有模拟和阐明复杂多动物行为的潜力。
Towards Dynamic Quadrupedal Gaits: A Symmetry-Guided RL Hierarchy Enables Free Gait Transitions at Varying Speeds
迈向动态四足步态:对称引导的 RL 层次结构可在不同速度下实现自由步态转换
- Authors: Jiayu Ding, Xulin Chen, Garrett E. Katz, Zhenyu Gan
- Subjects: Subjects:
Robotics (cs.RO); Systems and Control (eess.SY)
- Arxiv link: https://arxiv.org/abs/2510.10455
- Pdf link: https://arxiv.org/pdf/2510.10455
- Abstract
Quadrupedal robots exhibit a wide range of viable gaits, but generating specific footfall sequences often requires laborious expert tuning of numerous variables, such as touch-down and lift-off events and holonomic constraints for each leg. This paper presents a unified reinforcement learning framework for generating versatile quadrupedal gaits by leveraging the intrinsic symmetries and velocity-period relationship of dynamic legged systems. We propose a symmetry-guided reward function design that incorporates temporal, morphological, and time-reversal symmetries. By focusing on preserved symmetries and natural dynamics, our approach eliminates the need for predefined trajectories, enabling smooth transitions between diverse locomotion patterns such as trotting, bounding, half-bounding, and galloping. Implemented on the Unitree Go2 robot, our method demonstrates robust performance across a range of speeds in both simulations and hardware tests, significantly improving gait adaptability without extensive reward tuning or explicit foot placement control. This work provides insights into dynamic locomotion strategies and underscores the crucial role of symmetries in robotic gait design.
- 中文摘要
四足机器人表现出广泛的可行步态,但生成特定的脚步序列通常需要专家对众多变量进行费力的调整,例如着陆和起飞事件以及每条腿的全息约束。本文提出了一个统一的强化学习框架,利用动态腿系统的内在对称性和速度周期关系来生成多功能的四足步态。我们提出了一种对称引导的奖励函数设计,该设计结合了时间、形态和时间反转对称性。通过专注于保持的对称性和自然动力学,我们的方法消除了对预定义轨迹的需求,从而实现了小跑、边界、半边界和疾驰等不同运动模式之间的平滑过渡。我们的方法在 Unitree Go2 机器人上实施,在模拟和硬件测试中展示了在一系列速度范围内的稳健性能,无需大量的奖励调整或明确的脚部位置控制即可显着提高步态适应性。这项工作提供了对动态运动策略的见解,并强调了对称性在机器人步态设计中的关键作用。
MARS-Sep: Multimodal-Aligned Reinforced Sound Separation
MARS-Sep:多模态对齐的增强声音分离
- Authors: Zihan Zhang, Xize Cheng, Zhennan Jiang, Dongjie Fu, Jingyuan Chen, Zhou Zhao, Tao Jin
- Subjects: Subjects:
Sound (cs.SD); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.10509
- Pdf link: https://arxiv.org/pdf/2510.10509
- Abstract
Universal sound separation faces a fundamental misalignment: models optimized for low-level signal metrics often produce semantically contaminated outputs, failing to suppress perceptually salient interference from acoustically similar sources. To bridge this gap, we introduce MARS-Sep, a reinforcement learning framework that reformulates separation as decision making. Instead of simply regressing ground-truth masks, MARS-Sep learns a factorized Beta mask policy that is optimized by a clipped trust-region surrogate with entropy regularization and group-relative advantage normalization. Concretely, we sample masks from a frozen old policy, reconstruct waveforms, and update the current policy using clipped importance ratios-yielding substantially more stable and sample-efficient learning. Multimodal rewards, derived from an audio-text-vision encoder, directly incentivize semantic consistency with query prompts. We further propose a progressive alignment scheme to fine-tune this encoder, boosting its cross-modal discriminability and improving reward faithfulness. Extensive experiments on multiple benchmarks demonstrate consistent gains in Text-, Audio-, and Image-Queried separation, with notable improvements in signal metrics and semantic quality. Our code is available at this https URL. Sound separation samples are available at this https URL.
- 中文摘要
通用声音分离面临着一个根本的错位:针对低电平信号指标优化的模型通常会产生语义污染的输出,无法抑制来自声学相似源的感知显着干扰。为了弥合这一差距,我们引入了 MARS-Sep,这是一个强化学习框架,它将分离重新表述为决策。MARS-Sep 不是简单地回归地面实况掩码,而是学习因式分解的 Beta 掩码策略,该策略通过具有熵正则化和组相对优势归一化的裁剪信任区域代理进行优化。具体来说,我们从冻结的旧策略中对掩码进行采样,重建波形,并使用裁剪重要性比更新当前策略,从而产生更稳定和更高效的学习。多模态奖励源自音频-文本-视觉编码器,直接激励查询提示的语义一致性。我们进一步提出了一种渐进式对齐方案来微调该编码器,提高其跨模态可辨别性并提高奖励忠实度。在多个基准测试上的广泛实验表明,文本、音频和图像查询分离具有一致的收益,信号指标和语义质量显着提高。我们的代码可在此 https URL 中找到。声音分离样本可在此 https URL 上获得。
A Hybrid Machine Learning Approach for Synthetic Data Generation with Post Hoc Calibration for Clinical Tabular Datasets
一种混合机器学习方法,用于合成数据生成,对临床表格数据集进行事后校准
- Authors: Md Ibrahim Shikder Mahin, Md Shamsul Arefin, Md Tanvir Hasan
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.10513
- Pdf link: https://arxiv.org/pdf/2510.10513
- Abstract
Healthcare research and development face significant obstacles due to data scarcity and stringent privacy regulations, such as HIPAA and the GDPR, restricting access to essential real-world medical data. These limitations impede innovation, delay robust AI model creation, and hinder advancements in patient-centered care. Synthetic data generation offers a transformative solution by producing artificial datasets that emulate real data statistics while safeguarding patient privacy. We introduce a novel hybrid framework for high-fidelity healthcare data synthesis integrating five augmentation methods: noise injection, interpolation, Gaussian Mixture Model (GMM) sampling, Conditional Variational Autoencoder (CVAE) sampling, and SMOTE, combined via a reinforcement learning-based dynamic weight selection mechanism. Its key innovations include advanced calibration techniques -- moment matching, full histogram matching, soft and adaptive soft histogram matching, and iterative refinement -- that align marginal distributions and preserve joint feature dependencies. Evaluated on the Breast Cancer Wisconsin (UCI Repository) and Khulna Medical College cardiology datasets, our calibrated hybrid achieves Wasserstein distances as low as 0.001 and Kolmogorov-Smirnov statistics around 0.01, demonstrating near-zero marginal discrepancy. Pairwise trend scores surpass 90%, and Nearest Neighbor Adversarial Accuracy approaches 50%, confirming robust privacy protection. Downstream classifiers trained on synthetic data achieve up to 94% accuracy and F1 scores above 93%, comparable to models trained on real data. This scalable, privacy-preserving approach matches state-of-the-art methods, sets new benchmarks for joint-distribution fidelity in healthcare, and supports sensitive AI applications.
- 中文摘要
由于数据稀缺和严格的隐私法规(例如 HIPAA 和 GDPR)限制了对现实世界基本医疗数据的访问,医疗保健研发面临重大障碍。这些限制阻碍了创新,延迟了强大的人工智能模型创建,并阻碍了以患者为中心的护理的进步。合成数据生成通过生成模拟真实数据统计数据同时保护患者隐私的人工数据集提供了变革性的解决方案。我们引入了一种用于高保真医疗保健数据合成的新型混合框架,该框架集成了五种增强方法:噪声注入、插值、高斯混合模型 (GMM) 采样、条件变分自动编码器 (CVAE) 采样和 SMOTE,并通过基于强化学习的动态权重选择机制相结合。其关键创新包括先进的校准技术——矩匹配、全直方图匹配、软和自适应软直方图匹配以及迭代细化——以对齐边际分布并保留联合特征依赖性。在威斯康星州乳腺癌(UCI 存储库)和库尔纳医学院心脏病学数据集上进行评估,我们校准的混合体实现了低至 0.001 的 Wasserstein 距离和 0.01 左右的 Kolmogorov-Smirnov 统计数据,显示出接近零的边际差异。成对趋势得分超过 90%,最近邻对抗准确率接近 50%,证实了强大的隐私保护。在合成数据上训练的下游分类器可实现高达 94% 的准确率和 93% 以上的 F1 分数,与在真实数据上训练的模型相当。这种可扩展的隐私保护方法与最先进的方法相匹配,为医疗保健领域的联合分配保真度设定了新的基准,并支持敏感的人工智能应用。
Population-Coded Spiking Neural Networks for High-Dimensional Robotic Control
用于高维机器人控制的群体编码尖峰神经网络
- Authors: Kanishkha Jaisankar, Xiaoyang Jiang, Feifan Liao, Jeethu Sreenivas Amuthan
- Subjects: Subjects:
Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.10516
- Pdf link: https://arxiv.org/pdf/2510.10516
- Abstract
Energy-efficient and high-performance motor control remains a critical challenge in robotics, particularly for high-dimensional continuous control tasks with limited onboard resources. While Deep Reinforcement Learning (DRL) has achieved remarkable results, its computational demands and energy consumption limit deployment in resource-constrained environments. This paper introduces a novel framework combining population-coded Spiking Neural Networks (SNNs) with DRL to address these challenges. Our approach leverages the event-driven, asynchronous computation of SNNs alongside the robust policy optimization capabilities of DRL, achieving a balance between energy efficiency and control performance. Central to this framework is the Population-coded Spiking Actor Network (PopSAN), which encodes high-dimensional observations into neuronal population activities and enables optimal policy learning through gradient-based updates. We evaluate our method on the Isaac Gym platform using the PixMC benchmark with complex robotic manipulation tasks. Experimental results on the Franka robotic arm demonstrate that our approach achieves energy savings of up to 96.10% compared to traditional Artificial Neural Networks (ANNs) while maintaining comparable control performance. The trained SNN policies exhibit robust finger position tracking with minimal deviation from commanded trajectories and stable target height maintenance during pick-and-place operations. These results position population-coded SNNs as a promising solution for energy-efficient, high-performance robotic control in resource-constrained applications, paving the way for scalable deployment in real-world robotics systems.
- 中文摘要
节能和高性能电机控制仍然是机器人技术面临的关键挑战,特别是对于机载资源有限的高维连续控制任务。虽然深度强化学习(DRL)取得了显著的成果,但其计算需求和能耗限制了在资源受限环境中的部署。本文介绍了一种将群体编码的尖峰神经网络(SNN)与DRL相结合的新框架来应对这些挑战。我们的方法利用事件驱动的异步计算 SNN 以及 DRL 强大的策略优化功能,实现了能源效率和控制性能之间的平衡。该框架的核心是种群编码的尖峰参与者网络 (PopSAN),它将高维观测值编码到神经元种群活动中,并通过基于梯度的更新实现最佳策略学习。我们在 Isaac Gym 平台上使用 PixMC 基准测试和复杂的机器人作任务来评估我们的方法。在Franka机械臂上的实验结果表明,与传统的人工神经网络(ANN)相比,我们的方法可节省高达96.10%的能源,同时保持相当的控制性能。经过训练的 SNN 策略表现出强大的手指位置跟踪,在拾取和放置作期间与命令轨迹的偏差最小,并稳定地保持目标高度。这些结果将群体编码的 SNN 定位为资源受限应用中节能、高性能机器人控制的有前途的解决方案,为现实世界机器人系统中的可扩展部署铺平了道路。
Reinforced Domain Selection for Continuous Domain Adaptation
用于连续域适应的强化域选择
- Authors: Hanbing Liu, Huaze Tang, Yanru Wu, Yang Li, Xiao-Ping Zhang
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.10530
- Pdf link: https://arxiv.org/pdf/2510.10530
- Abstract
Continuous Domain Adaptation (CDA) effectively bridges significant domain shifts by progressively adapting from the source domain through intermediate domains to the target domain. However, selecting intermediate domains without explicit metadata remains a substantial challenge that has not been extensively explored in existing studies. To tackle this issue, we propose a novel framework that combines reinforcement learning with feature disentanglement to conduct domain path selection in an unsupervised CDA setting. Our approach introduces an innovative unsupervised reward mechanism that leverages the distances between latent domain embeddings to facilitate the identification of optimal transfer paths. Furthermore, by disentangling features, our method facilitates the calculation of unsupervised rewards using domain-specific features and promotes domain adaptation by aligning domain-invariant features. This integrated strategy is designed to simultaneously optimize transfer paths and target task performance, enhancing the effectiveness of domain adaptation processes. Extensive empirical evaluations on datasets such as Rotated MNIST and ADNI demonstrate substantial improvements in prediction accuracy and domain selection efficiency, establishing our method's superiority over traditional CDA approaches.
- 中文摘要
连续域适应 (CDA) 通过从源域通过中间域逐步适应目标域,有效地桥接了显着的域偏移。然而,选择没有显式元数据的中间域仍然是一个重大挑战,现有研究尚未广泛探索。为了解决这个问题,我们提出了一种新的框架,将强化学习与特征解缠相结合,在无监督的 CDA 设置中进行域路径选择。我们的方法引入了一种创新的无监督奖励机制,该机制利用潜在域嵌入之间的距离来促进最佳转移路径的识别。此外,通过解开特征,我们的方法有助于使用特定领域的特征计算无监督奖励,并通过对齐领域不变特征来促进领域适应。这种综合策略旨在同时优化转移路径和目标任务绩效,提高领域适应过程的有效性。对Rotated MNIST和ADNI等数据集的广泛实证评估表明,预测准确性和域选择效率有了显著提高,确立了我们的方法相对于传统CDA方法的优越性。
Rethinking RL Evaluation: Can Benchmarks Truly Reveal Failures of RL Methods?
重新思考RL评估:基准测试能否真正揭示RL方法的失败?
- Authors: Zihan Chen, Yiming Zhang, Hengguang Zhou, Zenghui Ding, Yining Sun, Cho-Jui Hsieh
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.10541
- Pdf link: https://arxiv.org/pdf/2510.10541
- Abstract
Current benchmarks are inadequate for evaluating progress in reinforcement learning (RL) for large language models (LLMs).Despite recent benchmark gains reported for RL, we find that training on these benchmarks' training sets achieves nearly the same performance as training directly on the test sets, suggesting that the benchmarks cannot reliably separate further this http URL study this phenomenon, we introduce a diagnostic suite and the Oracle Performance Gap (OPG) metric that quantifies the performance difference between training on the train split versus the test split of a benchmark. We further analyze this phenomenon with stress tests and find that, despite strong benchmark scores, existing RL methods struggle to generalize across distribution shifts, varying levels of difficulty, and counterfactual scenarios: shortcomings that current benchmarks fail to this http URL conclude that current benchmarks are insufficient for evaluating generalization and propose three core principles for designing more faithful benchmarks: sufficient difficulty, balanced evaluation, and distributional robustness.
- 中文摘要
目前的基准不足以评估大型语言模型 (LLM) 的强化学习 (RL) 进展。尽管最近报告了 RL 的基准测试收益,但我们发现在这些基准测试的训练集上进行训练与直接在测试集上进行训练几乎相同的性能,这表明基准测试无法可靠地进一步分离此 http URL 研究这种现象,我们引入了一个诊断套件和 Oracle 性能差距 (OPG) 指标,用于量化在训练拆分与基准测试拆分上的训练之间的性能差异。我们进一步通过压力测试分析了这一现象,发现尽管基准得分很高,但现有的 RL 方法很难跨分布偏移、不同难度级别和反事实场景进行推广:当前基准测试无法达到这个 http URL 的缺点得出结论,当前基准测试不足以评估泛化,并提出了设计更忠实基准的三个核心原则: 足够的难度、平衡的评估和分布的稳健性。
PAC-Bayesian Reinforcement Learning Trains Generalizable Policies
PAC-贝叶斯强化学习训练可推广的策略
- Authors: Abdelkrim Zitouni, Mehdi Hennequin, Juba Agoun, Ryan Horache, Nadia Kabachi, Omar Rivasplata
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
- Arxiv link: https://arxiv.org/abs/2510.10544
- Pdf link: https://arxiv.org/pdf/2510.10544
- Abstract
We derive a novel PAC-Bayesian generalization bound for reinforcement learning that explicitly accounts for Markov dependencies in the data, through the chain's mixing time. This contributes to overcoming challenges in obtaining generalization guarantees for reinforcement learning, where the sequential nature of data breaks the independence assumptions underlying classical bounds. Our bound provides non-vacuous certificates for modern off-policy algorithms like Soft Actor-Critic. We demonstrate the bound's practical utility through PB-SAC, a novel algorithm that optimizes the bound during training to guide exploration. Experiments across continuous control tasks show that our approach provides meaningful confidence certificates while maintaining competitive performance.
- 中文摘要
我们推导出了一种新的 PAC-Bayesian 泛化,用于强化学习,它通过链的混合时间明确解释了数据中的马尔可夫依赖关系。这有助于克服获得强化学习泛化保证的挑战,其中数据的顺序性质打破了经典边界背后的独立性假设。我们的边界为现代非策略算法(如 Soft Actor-Critic)提供非空证书。我们通过 PB-SAC 展示了边界的实际实用性,PB-SAC 是一种新颖的算法,可在训练期间优化边界以指导探索。跨连续控制任务的实验表明,我们的方法在保持竞争性能的同时提供了有意义的置信度证书。
Reinforcement Learning-based Dynamic Adaptation for Sampling-Based Motion Planning in Agile Autonomous Driving
基于强化学习的敏捷自动驾驶中基于采样的运动规划动态自适应
- Authors: Alexander Langmann, Yevhenii Tokarev, Mattia Piccinini, Korbinian Moller, Johannes Betz
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2510.10567
- Pdf link: https://arxiv.org/pdf/2510.10567
- Abstract
Sampling-based trajectory planners are widely used for agile autonomous driving due to their ability to generate fast, smooth, and kinodynamically feasible trajectories. However, their behavior is often governed by a cost function with manually tuned, static weights, which forces a tactical compromise that is suboptimal across the wide range of scenarios encountered in a race. To address this shortcoming, we propose using a Reinforcement Learning (RL) agent as a high-level behavioral selector that dynamically switches the cost function parameters of an analytical, low-level trajectory planner during runtime. We show the effectiveness of our approach in simulation in an autonomous racing environment where our RL-based planner achieved 0% collision rate while reducing overtaking time by up to 60% compared to state-of-the-art static planners. Our new agent now dynamically switches between aggressive and conservative behaviors, enabling interactive maneuvers unattainable with static configurations. These results demonstrate that integrating reinforcement learning as a high-level selector resolves the inherent trade-off between safety and competitiveness in autonomous racing planners. The proposed methodology offers a pathway toward adaptive yet interpretable motion planning for broader autonomous driving applications.
- 中文摘要
基于采样的轨迹规划器由于能够生成快速、平滑和运动动力学上可行的轨迹,因此被广泛用于敏捷自动驾驶。然而,它们的行为通常受具有手动调整的静态权重的成本函数的控制,这迫使在比赛中遇到的各种场景中采取次优的战术妥协。为了解决这一缺点,我们建议使用强化学习(RL)代理作为高级行为选择器,在运行时动态切换分析性低级轨迹规划器的成本函数参数。我们展示了我们在自动驾驶赛车环境中的模拟方法的有效性,与最先进的静态规划器相比,我们基于 RL 的规划器实现了 0% 的碰撞率,同时将超车时间减少了多达 60%。我们的新代理现在可以在攻击行为和保守行为之间动态切换,从而实现静态配置无法实现的交互式作。这些结果表明,将强化学习集成为高级选择器可以解决自动驾驶赛车规划者在安全性和竞争力之间的固有权衡。所提出的方法为更广泛的自动驾驶应用提供了一条自适应且可解释的运动规划的途径。
AQORA: A Learned Adaptive Query Optimizer for Spark SQL
AQORA:用于 Spark SQL 的学习自适应查询优化器
- Authors: Jiahao He, Yutao Cui, Cuiping Li, Jikang Jiang, Yuheng Hou, Hong Chen
- Subjects: Subjects:
Databases (cs.DB)
- Arxiv link: https://arxiv.org/abs/2510.10580
- Pdf link: https://arxiv.org/pdf/2510.10580
- Abstract
Recent studies have identified two main approaches to improve query optimization: learned query optimization (LQO), which generates or selects better query plans before execution based on models trained in advance, and adaptive query processing (AQP), which adapts the query plan during execution based on statistical feedback collected at runtime. Although both approaches have shown promise, they also face critical limitations. LQO must commit to a fixed plan without access to actual cardinalities and typically rely on a single end-to-end feedback signal, making learning inefficient. On the other hand, AQP depends heavily on rule-based heuristics and lacks the ability to learn from experience. In this paper, we present AQORA, an adaptive query optimizer with a reinforcement learning architecture that combines the strengths of both LQO and AQP. AQORA addresses the above challenges through four core strategies: (1) realistic feature encoding, (2) query stage-level feedback and intervention, (3) automatic strategy adaptation, and (4) low-cost integration. Experiments show that AQORA reduces end-to-end execution time by up to 90% compared to other learned methods and by up to 70% compared to Spark SQL's default configuration with adaptive query execution.
- 中文摘要
最近的研究确定了两种改进查询优化的主要方法:学习查询优化(LQO),它根据预先训练的模型在执行前生成或选择更好的查询计划,以及自适应查询处理(AQP),它根据运行时收集的统计反馈在执行过程中调整查询计划。尽管这两种方法都显示出前景,但它们也面临着严重的局限性。LQO 必须承诺一个固定的计划,而无法访问实际的基数,并且通常依赖于单个端到端反馈信号,这使得学习效率低下。另一方面,AQP 严重依赖基于规则的启发式方法,缺乏从经验中学习的能力。在本文中,我们介绍了 AQORA,这是一种自适应查询优化器,具有结合了 LQO 和 AQP 优势的强化学习架构。AQORA通过四个核心策略解决上述挑战:(1)现实特征编码,(2)查询阶段级反馈和干预,(3)自动策略适配,(4)低成本集成。实验表明,与其他学习方法相比,AQORA 将端到端执行时间缩短了 90%,与 Spark SQL 的默认配置相比,使用自适应查询执行减少了 70%。
ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models
ViSurf:大型视觉和语言模型的视觉监督和强化微调
- Authors: Yuqi Liu, Liangyu Chen, Jiazhen Liu, Mingkang Zhu, Zhisheng Zhong, Bei Yu, Jiaya Jia
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2510.10606
- Pdf link: https://arxiv.org/pdf/2510.10606
- Abstract
Typical post-training paradigms for Large Vision-and-Language Models (LVLMs) include Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR). SFT leverages external guidance to inject new knowledge, whereas RLVR utilizes internal reinforcement to enhance reasoning capabilities and overall performance. However, our analysis reveals that SFT often leads to sub-optimal performance, while RLVR struggles with tasks that exceed the model's internal knowledge base. To address these limitations, we propose ViSurf (\textbf{Vi}sual \textbf{Su}pervised-and-\textbf{R}einforcement \textbf{F}ine-Tuning), a unified post-training paradigm that integrates the strengths of both SFT and RLVR within a single stage. We analyze the derivation of the SFT and RLVR objectives to establish the ViSurf objective, providing a unified perspective on these two paradigms. The core of ViSurf involves injecting ground-truth labels into the RLVR rollouts, thereby providing simultaneous external supervision and internal reinforcement. Furthermore, we introduce three novel reward control strategies to stabilize and optimize the training process. Extensive experiments across several diverse benchmarks demonstrate the effectiveness of ViSurf, outperforming both individual SFT, RLVR, and two-stage SFT \textrightarrow RLVR. In-depth analysis corroborates these findings, validating the derivation and design principles of ViSurf.
- 中文摘要
大型视觉和语言模型 (LVLM) 的典型训练后范式包括监督微调 (SFT) 和具有可验证奖励的强化学习 (RLVR)。SFT 利用外部指导来注入新知识,而 RLVR 利用内部强化来增强推理能力和整体性能。然而,我们的分析表明,SFT 通常会导致性能不佳,而 RLVR 则难以完成超出模型内部知识库的任务。为了解决这些限制,我们提出了 ViSurf (\textbf{Vi}sual \textbf{Su}pervised-and-\textbf{R}einforcement \textbf{F}ine-Tuning),这是一种统一的训练后范式,将 SFT 和 RLVR 的优势整合到一个阶段中。我们分析了 SFT 和 RLVR 目标的推导,以建立 ViSurf 目标,为这两种范式提供了统一的视角。ViSurf 的核心是将地面实况标签注入 RLVR 部署中,从而同时提供外部监督和内部强化。此外,我们还引入了三种新颖的奖励控制策略来稳定和优化训练过程。跨多个不同基准的广泛实验证明了 ViSurf 的有效性,优于单个 SFT、RLVR 和两级 SFT \textrightarrow RLVR。深入的分析证实了这些发现,验证了 ViSurf 的推导和设计原则。
OmniQuality-R: Advancing Reward Models Through All-Encompassing Quality Assessment
OmniQuality-R:通过全方位的质量评估推进奖励模型
- Authors: Yiting Lu, Fengbin Guan, Yixin Gao, Yan Zhong, Xinge Peng, Jiakang Yuan, Yihao Liu, Bo Zhang, Xin Li, Zhibo Chen, Weisi Lin
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2510.10609
- Pdf link: https://arxiv.org/pdf/2510.10609
- Abstract
Current visual evaluation approaches are typically constrained to a single task. To address this, we propose OmniQuality-R, a unified reward modeling framework that transforms multi-task quality reasoning into continuous and interpretable reward signals for policy optimization. Inspired by subjective experiments, where participants are given task-specific instructions outlining distinct assessment principles prior to evaluation, we propose OmniQuality-R, a structured reward modeling framework that transforms multi-dimensional reasoning into continuous and interpretable reward signals. To enable this, we construct a reasoning-enhanced reward modeling dataset by sampling informative plan-reason trajectories via rejection sampling, forming a reliable chain-of-thought (CoT) dataset for supervised fine-tuning (SFT). Building on this, we apply Group Relative Policy Optimization (GRPO) for post-training, using a Gaussian-based reward to support continuous score prediction. To further stabilize the training and improve downstream generalization, we incorporate standard deviation (STD) filtering and entropy gating mechanisms during reinforcement learning. These techniques suppress unstable updates and reduce variance in policy optimization. We evaluate OmniQuality-R on three key IQA tasks: aesthetic quality assessment, technical quality evaluation, and text-image alignment.
- 中文摘要
当前的视觉评估方法通常仅限于单个任务。为了解决这个问题,我们提出了 OmniQuality-R,这是一个统一的奖励建模框架,它将多任务质量推理转化为连续且可解释的奖励信号,用于策略优化。受到主观实验的启发,在评估之前,参与者会获得特定于任务的指令,概述不同的评估原则,我们提出了 OmniQuality-R,这是一种结构化的奖励建模框架,可将多维推理转化为连续且可解释的奖励信号。为了实现这一点,我们通过拒绝抽样对信息丰富的计划-原因轨迹进行采样,构建了一个推理增强的奖励建模数据集,形成了一个可靠的思维链(CoT)数据集,用于监督微调(SFT)。在此基础上,我们将组相对策略优化 (GRPO) 应用于后训练,使用基于高斯的奖励来支持连续分数预测。为了进一步稳定训练并提高下游泛化,我们在强化学习中融入了标准差(STD)滤波和熵门控机制。这些技术抑制了不稳定的更新并减少了策略优化的差异。我们根据三个关键的 IQA 任务评估 OmniQuality-R:美学质量评估、技术质量评估和文本图像对齐。
Assessing Policy Updates: Toward Trust-Preserving Intelligent User Interfaces
评估策略更新:实现可信任的智能用户界面
- Authors: Matan Solomon, Ofra Amir, Omer Ben-Porat
- Subjects: Subjects:
Human-Computer Interaction (cs.HC)
- Arxiv link: https://arxiv.org/abs/2510.10616
- Pdf link: https://arxiv.org/pdf/2510.10616
- Abstract
Reinforcement learning agents are often updated with human feedback, yet such updates can be unreliable: reward misspecification, preference conflicts, or limited data may leave policies unchanged or even worse. Because policies are difficult to interpret directly, users face the challenge of deciding whether an update has truly helped. We propose that assessing model updates -- not just a single model -- is a critical design challenge for intelligent user interfaces. In a controlled study, participants provided feedback to an agent in a gridworld and then compared its original and updated policies. We evaluated four strategies for communicating updates: no demonstration, same-context, random-context, and salient-contrast demonstrations designed to highlight informative differences. Salient-contrast demonstrations significantly improved participants' ability to detect when updates helped or harmed performance, mitigating participants' bias towards assuming that feedback is always beneficial, and supported better trust calibration across contexts.
- 中文摘要
强化学习代理通常会根据人工反馈进行更新,但此类更新可能不可靠:奖励错误规范、偏好冲突或有限的数据可能会使策略保持不变,甚至更糟。由于策略难以直接解释,因此用户面临着确定更新是否真正有帮助的挑战。我们提出,评估模型更新——而不仅仅是单个模型——是智能用户界面的关键设计挑战。在一项对照研究中,参与者向网格世界中的代理提供反馈,然后比较其原始和更新的策略。我们评估了四种传达更新的策略:无演示、相同上下文、随机上下文和显着对比演示,旨在突出信息差异。显着对比演示显着提高了参与者检测更新何时有助于或损害性能的能力,减轻了参与者对假设反馈始终有益的偏见,并支持跨上下文进行更好的信任校准。
Collaborative Text-to-Image Generation via Multi-Agent Reinforcement Learning and Semantic Fusion
通过多智能体强化学习和语义融合进行协作文本到图像生成
- Authors: Jiabao Shi, Minfeng Qi, Lefeng Zhang, Di Wang, Yingjie Zhao, Ziying Li, Yalong Xing, Ningran Li
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.10633
- Pdf link: https://arxiv.org/pdf/2510.10633
- Abstract
Multimodal text-to-image generation remains constrained by the difficulty of maintaining semantic alignment and professional-level detail across diverse visual domains. We propose a multi-agent reinforcement learning framework that coordinates domain-specialized agents (e.g., focused on architecture, portraiture, and landscape imagery) within two coupled subsystems: a text enhancement module and an image generation module, each augmented with multimodal integration components. Agents are trained using Proximal Policy Optimization (PPO) under a composite reward function that balances semantic similarity, linguistic visual quality, and content diversity. Cross-modal alignment is enforced through contrastive learning, bidirectional attention, and iterative feedback between text and image. Across six experimental settings, our system significantly enriches generated content (word count increased by 1614%) while reducing ROUGE-1 scores by 69.7%. Among fusion methods, Transformer-based strategies achieve the highest composite score (0.521), despite occasional stability issues. Multimodal ensembles yield moderate consistency (ranging from 0.444 to 0.481), reflecting the persistent challenges of cross-modal semantic grounding. These findings underscore the promise of collaborative, specialization-driven architectures for advancing reliable multimodal generative systems.
- 中文摘要
多模态文本到图像生成仍然受到跨不同视觉领域保持语义对齐和专业级细节的困难的限制。我们提出了一个多智能体强化学习框架,该框架在两个耦合子系统中协调领域专业智能体(例如,专注于建筑、肖像和风景图像):文本增强模块和图像生成模块,每个子系统都通过多模态集成组件进行增强。代理在平衡语义相似性、语言视觉质量和内容多样性的复合奖励函数下使用近端策略优化 (PPO) 进行训练。跨模态对齐是通过对比学习、双向注意力以及文本和图像之间的迭代反馈来强制执行的。在六个实验设置中,我们的系统显着丰富了生成的内容(字数增加了 1614%),同时将 ROUGE-1 分数降低了 69.7%。在融合方法中,基于 Transformer 的策略取得了最高的综合得分 (0.521),尽管偶尔会出现稳定性问题。多模态集合产生中等一致性(范围从 0.444 到 0.481),反映了跨模态语义基础的持续挑战。这些发现强调了协作、专业化驱动的架构在推进可靠的多模态生成系统方面的前景。
Hierarchical Optimization via LLM-Guided Objective Evolution for Mobility-on-Demand Systems
通过 LLM 引导的按需移动系统的目标演化进行分层优化
- Authors: Yi Zhang, Yushen Long, Yun Ni, Liping Huang, Xiaohong Wang, Jun Liu
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.10644
- Pdf link: https://arxiv.org/pdf/2510.10644
- Abstract
Online ride-hailing platforms aim to deliver efficient mobility-on-demand services, often facing challenges in balancing dynamic and spatially heterogeneous supply and demand. Existing methods typically fall into two categories: reinforcement learning (RL) approaches, which suffer from data inefficiency, oversimplified modeling of real-world dynamics, and difficulty enforcing operational constraints; or decomposed online optimization methods, which rely on manually designed high-level objectives that lack awareness of low-level routing dynamics. To address this issue, we propose a novel hybrid framework that integrates large language model (LLM) with mathematical optimization in a dynamic hierarchical system: (1) it is training-free, removing the need for large-scale interaction data as in RL, and (2) it leverages LLM to bridge cognitive limitations caused by problem decomposition by adaptively generating high-level objectives. Within this framework, LLM serves as a meta-optimizer, producing semantic heuristics that guide a low-level optimizer responsible for constraint enforcement and real-time decision execution. These heuristics are refined through a closed-loop evolutionary process, driven by harmony search, which iteratively adapts the LLM prompts based on feasibility and performance feedback from the optimization layer. Extensive experiments based on scenarios derived from both the New York and Chicago taxi datasets demonstrate the effectiveness of our approach, achieving an average improvement of 16% compared to state-of-the-art baselines.
- 中文摘要
网约车平台旨在提供高效的按需出行服务,但往往面临平衡动态和空间异质供需的挑战。现有方法通常分为两类:强化学习 (RL) 方法,存在数据效率低下、现实世界动态建模过于简化以及难以实施作约束的问题;或分解的在线优化方法,这些方法依赖于手动设计的高级目标,缺乏对低级路由动态的认识。为了解决这个问题,我们提出了一种新型的混合框架,将大型语言模型(LLM)与数学优化集成在动态分层系统中:(1)它是免训练的,消除了像RL那样对大规模交互数据的需求,以及(2)它利用LLM通过自适应生成高级目标来弥合问题分解造成的认知限制。在此框架内,LLM 充当元优化器,产生语义启发式方法,指导负责约束执行和实时决策执行的低级优化器。这些启发式方法通过闭环进化过程进行细化,由和谐搜索驱动,根据优化层的可行性和性能反馈迭代调整 LLM 提示。基于纽约和芝加哥出租车数据集的场景的广泛实验证明了我们方法的有效性,与最先进的基线相比,平均提高了 16%。
Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning
解锁 RLVR 中的探索:不确定性感知优势塑造以进行更深入的推理
- Authors: Can Xie, Ruotong Pan, Xiangyu Wu, Yunfei Zhang, Jiayi Fu, Tingting Gao, Guorui Zhou
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.10649
- Pdf link: https://arxiv.org/pdf/2510.10649
- Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has shown significant promise for enhancing the reasoning capabilities of large language models (LLMs). However, prevailing algorithms like GRPO broadcast a uniform advantage signal across all tokens in a sequence. This coarse-grained approach overlooks the pivotal role of uncertain, high-stakes decisions during reasoning, leading to inefficient exploration and the well-documented problem of entropy collapse. To address this, we introduce UnCertainty-aware Advantage Shaping (UCAS), a model-free method that refines credit assignment by leveraging the model's internal uncertainty signals. UCAS operates in two stages: it first modulates the response-level advantage using the model's overall self-confidence, and then applies a token-level penalty based on raw logit certainty. This dual mechanism encourages exploration of high-uncertainty paths that yield correct answers while penalizing overconfident yet erroneous reasoning, effectively balancing the exploration-exploitation trade-off. Extensive experiments on five mathematical reasoning benchmarks show that UCAS significantly outperforms strong RLVR baselines across multiple model scales, including 1.5B and 7B. Our analysis confirms that UCAS not only achieves higher rewards but also promotes greater reasoning diversity and successfully mitigates entropy collapse.
- 中文摘要
具有可验证奖励的强化学习 (RLVR) 在增强大型语言模型 (LLM) 的推理能力方面显示出巨大的前景。然而,像 GRPO 这样的流行算法在序列中的所有代币上广播统一的优势信号。这种粗粒度的方法忽视了推理过程中不确定、高风险决策的关键作用,导致低效的探索和有据可查的熵崩溃问题。为了解决这个问题,我们引入了不确定性感知优势塑造 (UCAS),这是一种无模型的方法,通过利用模型的内部不确定性信号来细化信用分配。UCAS 分两个阶段运行:它首先使用模型的整体自信心来调节响应水平优势,然后根据原始 logit 确定性应用标记级惩罚。这种双重机制鼓励探索高不确定性路径,从而产生正确答案,同时惩罚过度自信但错误的推理,有效地平衡了探索与利用的权衡。在五个数学推理基准上的广泛实验表明,UCAS 在包括 1.5B 和 7B 在内的多个模型尺度上明显优于强 RLVR 基线。我们的分析证实,UCAS 不仅获得了更高的回报,而且促进了更大的推理多样性,并成功地减轻了熵崩溃。
RePro: Training Language Models to Faithfully Recycle the Web for Pretraining
RePro:训练语言模型以忠实地回收 Web 进行预训练
- Authors: Zichun Yu, Chenyan Xiong
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2510.10681
- Pdf link: https://arxiv.org/pdf/2510.10681
- Abstract
High-quality pretraining data is the fossil fuel of large language models (LLMs), yet its reserves are running low for frontier models. In this paper, we introduce RePro, a novel web recycling method that trains a relatively small LM with reinforcement learning to generate effective and faithful rephrasings of pretraining data. Specifically, we design one quality reward and three faithfulness rewards, optimizing the LM rephraser to convert organic data into high-quality rephrasings while maintaining its core semantics and structure. In our experiment, we train a 4B rephraser to recycle 72B tokens sampled from DCLM-RefinedWeb. Pretraining results on 400M and 1.4B models demonstrate that RePro delivers 4.7%-14.0% relative accuracy gains over organic-only baseline on 22 downstream tasks. RePro also outperforms ReWire, the state-of-the-art web recycling method that prompts a 70B rephraser, as well as the organic baseline with a 4x larger data pool. Experiments with different amounts of recycled data highlight that RePro improves organic data efficiency by 2-3x. Individual and distributional analyses validate that RePro preserves more critical information and faithfully reflects the characteristics of organic data compared to prompting-based methods. Together, these results show that RePro provides an efficient and controllable path to effectively harness the fossil fuel of LLM pretraining. We open-source our code, rephraser, and recycled data at this https URL.
- 中文摘要
高质量的预训练数据是大型语言模型(LLM)的化石燃料,但其对前沿模型的储备却不足。在本文中,我们介绍了 RePro,这是一种新型的网络回收方法,它通过强化学习训练相对较小的 LM,以生成有效且忠实的预训练数据改写。具体来说,我们设计了一个质量奖励和三个忠实度奖励,优化了 LM 改写器,将有机数据转换为高质量的改写,同时保持其核心语义和结构。在我们的实验中,我们训练一个 4B 改写器来回收从 DCLM-RefinedWeb 采样的 72B 标记。400M 和 1.4B 模型的预训练结果表明,RePro 在 22 个下游任务上比纯有机基线提供了 4.7%-14.0% 的相对准确度提升。RePro 的性能也优于 ReWire,这是一种最先进的网络回收方法,可提示 70B 改写器,以及具有 4 倍数据池的有机基线。不同数量的回收数据的实验强调,RePro 将有机数据效率提高了 2-3 倍。与基于提示的方法相比,单个分析和分布分析验证了 RePro 保留了更多关键信息,并忠实地反映了有机数据的特征。这些结果共同表明,RePro提供了一条高效且可控的路径来有效利用LLM预训练的化石燃料。我们在此 https URL 上开源我们的代码、改写器和回收数据。
Digital Twin-enabled Multi-generation Control Co-Design with Deep Reinforcement Learning
支持数字孪生的多代控制协同设计与深度强化学习
- Authors: Ying-Kuan Tsai, Vispi Karkaria, Yi-Ping Chen, Wei Chen
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.10694
- Pdf link: https://arxiv.org/pdf/2510.10694
- Abstract
Control Co-Design (CCD) integrates physical and control system design to improve the performance of dynamic and autonomous systems. Despite advances in uncertainty-aware CCD methods, real-world uncertainties remain highly unpredictable. Multi-generation design addresses this challenge by considering the full lifecycle of a product: data collected from each generation informs the design of subsequent generations, enabling progressive improvements in robustness and efficiency. Digital Twin (DT) technology further strengthens this paradigm by creating virtual representations that evolve over the lifecycle through real-time sensing, model updating, and adaptive re-optimization. This paper presents a DT-enabled CCD framework that integrates Deep Reinforcement Learning (DRL) to jointly optimize physical design and controller. DRL accelerates real-time decision-making by allowing controllers to continuously learn from data and adapt to uncertain environments. Extending this approach, the framework employs a multi-generation paradigm, where each cycle of deployment, operation, and redesign uses collected data to refine DT models, improve uncertainty quantification through quantile regression, and inform next-generation designs of both physical components and controllers. The framework is demonstrated on an active suspension system, where DT-enabled learning from road conditions and driving behaviors yields smoother and more stable control trajectories. Results show that the method significantly enhances dynamic performance, robustness, and efficiency. Contributions of this work include: (1) extending CCD into a lifecycle-oriented multi-generation framework, (2) leveraging DTs for continuous model updating and informed design, and (3) employing DRL to accelerate adaptive real-time decision-making.
- 中文摘要
控制协同设计(CCD)集成了物理和控制系统设计,以提高动态和自主系统的性能。尽管不确定性感知CCD方法取得了进展,但现实世界的不确定性仍然高度不可预测。多代设计通过考虑产品的整个生命周期来应对这一挑战:从每一代产品收集的数据为后续几代的设计提供信息,从而逐步提高稳健性和效率。数字孪生 (DT) 技术通过实时传感、模型更新和自适应重新优化创建在整个生命周期中演变的虚拟表示,进一步加强了这一范式。本文提出了一个支持DT的CCD框架,该框架集成了深度强化学习(DRL),以联合优化物理设计和控制器。DRL 允许控制器不断从数据中学习并适应不确定的环境,从而加速实时决策。该框架扩展了这种方法,采用了多代范式,其中部署、运营和重新设计的每个周期都使用收集到的数据来完善 DT 模型,通过分位数回归改进不确定性量化,并为物理组件和控制器的下一代设计提供信息。该框架在主动悬架系统上进行了演示,其中支持 DT 的从路况和驾驶行为中学习可产生更平稳、更稳定的控制轨迹。结果表明,该方法显著提高了动态性能、鲁棒性和效率。这项工作的贡献包括:(1)将CCD扩展到面向生命周期的多代框架中,(2)利用DT进行持续的模型更新和知情设计,以及(3)采用DRL加速自适应实时决策。
Understanding Sampler Stochasticity in Training Diffusion Models for RLHF
了解 RLHF 训练扩散模型中的采样器随机性
- Authors: Jiayuan Sheng, Hanyang Zhao, Haoxian Chen, David D. Yao, Wenpin Tang
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
- Arxiv link: https://arxiv.org/abs/2510.10767
- Pdf link: https://arxiv.org/pdf/2510.10767
- Abstract
Reinforcement Learning from Human Feedback (RLHF) is increasingly used to fine-tune diffusion models, but a key challenge arises from the mismatch between stochastic samplers used during training and deterministic samplers used during inference. In practice, models are fine-tuned using stochastic SDE samplers to encourage exploration, while inference typically relies on deterministic ODE samplers for efficiency and stability. This discrepancy induces a reward gap, raising concerns about whether high-quality outputs can be expected during inference. In this paper, we theoretically characterize this reward gap and provide non-vacuous bounds for general diffusion models, along with sharper convergence rates for Variance Exploding (VE) and Variance Preserving (VP) Gaussian models. Methodologically, we adopt the generalized denoising diffusion implicit models (gDDIM) framework to support arbitrarily high levels of stochasticity, preserving data marginals throughout. Empirically, our findings through large-scale experiments on text-to-image models using denoising diffusion policy optimization (DDPO) and mixed group relative policy optimization (MixGRPO) validate that reward gaps consistently narrow over training, and ODE sampling quality improves when models are updated using higher-stochasticity SDE training.
- 中文摘要
人类反馈强化学习 (RLHF) 越来越多地用于微调扩散模型,但一个关键挑战是训练期间使用的随机采样器和推理期间使用的确定性采样器之间的不匹配。在实践中,使用随机 SDE 采样器对模型进行微调以鼓励探索,而推理通常依赖于确定性 ODE 采样器来提高效率和稳定性。这种差异导致了奖励差距,引发了人们对推理过程中是否可以预期高质量输出的担忧。在本文中,我们从理论上表征了这种奖励差距,并为一般扩散模型提供了非空边界,并为方差爆炸 (VE) 和方差保持 (VP) 高斯模型提供了更尖锐的收敛率。在方法论上,我们采用广义去噪扩散隐式模型(gDDIM)框架来支持任意高水平的随机性,并始终保留数据边缘。根据经验,我们通过使用去噪扩散策略优化(DDPO)和混合组相对策略优化(MixGRPO)对文本到图像模型进行大规模实验的结果验证了奖励差距在训练过程中持续缩小,并且当使用更高随机性的SDE训练更新模型时,ODE采样质量会提高。
LLM-Empowered Agentic MAC Protocols: A Dynamic Stackelberg Game Approach
LLM 赋能的代理 MAC 协议:动态 Stackelberg 博弈方法
- Authors: Renxuan Tan, Rongpeng Li, Fei Wang, Chenghui Peng, Shaoyun Wu, Zhifeng Zhao, Honggang Zhang
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.10895
- Pdf link: https://arxiv.org/pdf/2510.10895
- Abstract
Medium Access Control (MAC) protocols, essential for wireless networks, are typically manually configured. While deep reinforcement learning (DRL)-based protocols enhance task-specified network performance, they suffer from poor generalizability and resilience, demanding costly retraining to adapt to dynamic environments. To overcome this limitation, we introduce a game-theoretic LLM-empowered multi-agent DRL (MARL) framework, in which the uplink transmission between a base station and a varying number of user equipments is modeled as a dynamic multi-follower Stackelberg game (MFSG), capturing the network's natural hierarchical structure. Within this game, LLM-driven agents, coordinated through proximal policy optimization (PPO), synthesize adaptive, semantic MAC protocols in response to network dynamics. Protocol action grammar (PAG) is employed to ensure the reliability and efficiency of this process. Under this system, we further analyze the existence and convergence behavior in terms of a Stackelberg equilibrium by studying the learning dynamics of LLM-empowered unified policies in response to changing followers. Simulations corroborate that our framework achieves a 77.6% greater throughput and a 65.2% fairness improvement over conventional baselines. Besides, our framework generalizes excellently to a fluctuating number of users without requiring retraining or architectural changes.
- 中文摘要
中型访问控制 (MAC) 协议对于无线网络至关重要,通常是手动配置的。虽然基于深度强化学习 (DRL) 的协议增强了任务指定的网络性能,但它们的泛化性和弹性较差,需要昂贵的重新训练来适应动态环境。为了克服这一限制,我们引入了博弈论 LLM 赋能的多代理 DRL (MARL) 框架,其中基站和不同数量的用户设备之间的上行链路传输被建模为动态多跟随者 Stackelberg 博弈 (MFSG),捕获网络的自然层次结构。在这个游戏中,LLM 驱动的代理通过近端策略优化 (PPO) 进行协调,根据网络动态合成自适应语义 MAC 协议。采用协议动作语法 (PAG) 来确保该过程的可靠性和效率。在这个系统下,我们通过研究LLM赋能的统一策略响应不断变化的追随者的学习动态,进一步分析了Stackelberg均衡的存在和收敛行为。模拟证实,与传统基线相比,我们的框架实现了 77.6% 的吞吐量提高和 65.2% 的公平性改进。此外,我们的框架可以很好地推广到波动的用户数量,而无需重新训练或更改架构。
PoU: Proof-of-Use to Counter Tool-Call Hacking in DeepResearch Agents
PoU:用于对抗 DeepResearch 代理中工具调用黑客攻击的使用证明
- Authors: SHengjie Ma, Chenlong Deng, Jiaxin Mao, Jiadeng Huang, Teng Wang, Junjie Wu, Changwang Zhang, Jun wang
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.10931
- Pdf link: https://arxiv.org/pdf/2510.10931
- Abstract
Retrieval-augmented generation (RAG) agents, such as recent DeepResearch-style systems, extend large language models (LLMs) with autonomous information-seeking capabilities through external tools. While reinforcement learning (RL) has enabled impressive multi-step reasoning, we identify a previously overlooked failure mode, Tool-Call Hacking, where agents inflate reward signals by issuing superficially correct tool calls without genuinely leveraging the retrieved evidence. This results in (i) mode collapse into repetitive reliance on a single source and (ii) spurious grounding, where answers are only weakly supported by cited content. To address this, we propose Proof-of-Use (PoU), an evidence-grounded RL framework that enforces verifiable causal links between retrieved evidence, reasoning traces, and final answers. PoU operationalizes this through a unified step-wise contract combining syntactic citation validation, perturbation-based sensitivity rewards, and answer-evidence alignment objectives, ensuring that tool usage remains both interpretable and functionally grounded. Across seven QA benchmarks spanning in-domain, out-of-domain, and out-of-tool-distribution settings, PoU consistently outperforms strong DeepResearch baselines in factual accuracy, evidence faithfulness, and tool-routing balance. These findings highlight the necessity of grounding RL-trained agents not merely in task outcomes but in the causal use of retrieved information, offering a principled path toward trustworthy retrieval-augmented reasoning.
- 中文摘要
检索增强生成 (RAG) 代理,例如最近的 DeepResearch 风格系统,通过外部工具扩展具有自主信息搜索能力的大型语言模型 (LLM)。虽然强化学习 (RL) 实现了令人印象深刻的多步骤推理,但我们发现了一种以前被忽视的失败模式,即工具调用黑客攻击,其中代理通过发出表面正确的工具调用来夸大奖励信号,而没有真正利用检索到的证据。这导致 (i) 模式崩溃为对单一来源的重复依赖和 (ii) 虚假基础,其中引用内容的答案支持很弱。为了解决这个问题,我们提出了使用证明 (PoU),这是一种基于证据的 RL 框架,可在检索到的证据、推理跟踪和最终答案之间强制执行可验证的因果关系。PoU 通过结合句法引用验证、基于扰动的敏感性奖励和答案证据对齐目标的统一逐步契约来实施这一点,确保工具的使用保持可解释性和功能基础。在涵盖域内、域外和工具外分布设置的七个 QA 基准中,PoU 在事实准确性、证据忠实度和工具路由平衡方面始终优于 DeepResearch 的强大基线。这些发现强调了不仅要将 RL 训练的代理扎根于任务结果,还要建立在检索信息的因果使用中,从而为可信的检索增强推理提供一条原则性途径。
Neutral Agent-based Adversarial Policy Learning against Deep Reinforcement Learning in Multi-party Open Systems
多方开放系统中基于中立代理的对抗性策略学习与深度强化学习
- Authors: Qizhou Peng, Yang Zheng, Yu Wen, Yanna Wu, Yingying Du
- Subjects: Subjects:
Machine Learning (cs.LG); Cryptography and Security (cs.CR)
- Arxiv link: https://arxiv.org/abs/2510.10937
- Pdf link: https://arxiv.org/pdf/2510.10937
- Abstract
Reinforcement learning (RL) has been an important machine learning paradigm for solving long-horizon sequential decision-making problems under uncertainty. By integrating deep neural networks (DNNs) into the RL framework, deep reinforcement learning (DRL) has emerged, which achieved significant success in various domains. However, the integration of DNNs also makes it vulnerable to adversarial attacks. Existing adversarial attack techniques mainly focus on either directly manipulating the environment with which a victim agent interacts or deploying an adversarial agent that interacts with the victim agent to induce abnormal behaviors. While these techniques achieve promising results, their adoption in multi-party open systems remains limited due to two major reasons: impractical assumption of full control over the environment and dependent on interactions with victim agents. To enable adversarial attacks in multi-party open systems, in this paper, we redesigned an adversarial policy learning approach that can mislead well-trained victim agents without requiring direct interactions with these agents or full control over their environments. Particularly, we propose a neutral agent-based approach across various task scenarios in multi-party open systems. While the neutral agents seemingly are detached from the victim agents, indirectly influence them through the shared environment. We evaluate our proposed method on the SMAC platform based on Starcraft II and the autonomous driving simulation platform Highway-env. The experimental results demonstrate that our method can launch general and effective adversarial attacks in multi-party open systems.
- 中文摘要
强化学习(Reinforcement Learning,RL)一直是解决不确定性下长期顺序决策问题的重要机器学习范式。通过将深度神经网络(DNN)集成到RL框架中,深度强化学习(DRL)应运而生,并在各个领域取得了重大成功。然而,DNN 的集成也使其容易受到对抗性攻击。现有的对抗性攻击技术主要侧重于直接纵受害代理与之交互的环境,或部署与受害代理交互的对抗代理以诱发异常行为。虽然这些技术取得了可喜的结果,但由于两个主要原因,它们在多方开放系统中的采用仍然有限:完全控制环境的假设不切实际,并且依赖于与受害者代理的交互。为了在多方开放系统中实现对抗性攻击,在本文中,我们重新设计了一种对抗性策略学习方法,该方法可以误导训练有素的受害者代理,而无需与这些代理直接交互或完全控制其环境。特别是,我们提出了一种基于代理的中立方法,跨多方开放系统中的各种任务场景。虽然中立代理人似乎与受害者代理人分离,但通过共享环境间接影响他们。在基于星际争霸II的SMAC平台和自动驾驶仿真平台Highway-env上评估了所提方法。实验结果表明,该方法能够在多方开放系统中发起通用有效的对抗性攻击。
Rediscovering Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning
重新发现熵正则化:自适应系数释放其法学硕士强化学习的潜力
- Authors: Xiaoyun Zhang, Xiaojian Yuan, Di Huang, Wang You, Chen Hu, Jingqing Ruan, Kejiang Chen, Xing Hu
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
- Arxiv link: https://arxiv.org/abs/2510.10959
- Pdf link: https://arxiv.org/pdf/2510.10959
- Abstract
Reasoning ability has become a defining capability of Large Language Models (LLMs), with Reinforcement Learning with Verifiable Rewards (RLVR) emerging as a key paradigm to enhance it. However, RLVR training often suffers from policy entropy collapse, where the policy becomes overly deterministic, hindering exploration and limiting reasoning performance. While entropy regularization is a common remedy, its effectiveness is highly sensitive to the fixed coefficient, making it unstable across tasks and models. In this work, we revisit entropy regularization in RLVR and argue that its potential has been largely underestimated. Our analysis shows that (i) tasks of varying difficulty demand distinct exploration intensities, and (ii) balanced exploration may require the policy entropy to be maintained within a moderate range below its initial level. Therefore, we propose Adaptive Entropy Regularization (AER)--a framework that dynamically balances exploration and exploitation via three components: difficulty-aware coefficient allocation, initial-anchored target entropy, and dynamic global coefficient adjustment. Experiments on multiple mathematical reasoning benchmarks show that AER consistently outperforms baselines, improving both reasoning accuracy and exploration capability.
- 中文摘要
推理能力已成为大型语言模型 (LLM) 的定义能力,具有可验证奖励的强化学习 (RLVR) 成为增强推理能力的关键范式。然而,RLVR 训练经常受到策略熵崩溃的影响,即策略变得过于确定性,阻碍探索并限制推理性能。虽然熵正则化是一种常见的补救措施,但其有效性对固定系数高度敏感,使其在任务和模型之间不稳定。在这项工作中,我们重新审视了 RLVR 中的熵正则化,并认为其潜力在很大程度上被低估了。我们的分析表明,(i)不同难度的任务需要不同的勘探强度,(ii)平衡勘探可能需要将策略熵维持在低于其初始水平的适度范围内。因此,我们提出了自适应熵正则化(AER)——一个通过难度感知系数分配、初始锚定目标熵和动态全局系数调整三个组件动态平衡探索和开发的框架。在多个数学推理基准上的实验表明,AER 始终优于基线,提高了推理准确性和探索能力。
Game-Theoretic Risk-Shaped Reinforcement Learning for Safe Autonomous Driving
博弈论风险形强化学习实现安全自动驾驶
- Authors: Dong Hu, Fenqing Hu, Lidong Yang, Chao Huang
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2510.10960
- Pdf link: https://arxiv.org/pdf/2510.10960
- Abstract
Ensuring safety in autonomous driving (AD) remains a significant challenge, especially in highly dynamic and complex traffic environments where diverse agents interact and unexpected hazards frequently emerge. Traditional reinforcement learning (RL) methods often struggle to balance safety, efficiency, and adaptability, as they primarily focus on reward maximization without explicitly modeling risk or safety constraints. To address these limitations, this study proposes a novel game-theoretic risk-shaped RL (GTR2L) framework for safe AD. GTR2L incorporates a multi-level game-theoretic world model that jointly predicts the interactive behaviors of surrounding vehicles and their associated risks, along with an adaptive rollout horizon that adjusts dynamically based on predictive uncertainty. Furthermore, an uncertainty-aware barrier mechanism enables flexible modulation of safety boundaries. A dedicated risk modeling approach is also proposed, explicitly capturing both epistemic and aleatoric uncertainty to guide constrained policy optimization and enhance decision-making in complex environments. Extensive evaluations across diverse and safety-critical traffic scenarios show that GTR2L significantly outperforms state-of-the-art baselines, including human drivers, in terms of success rate, collision and violation reduction, and driving efficiency. The code is available at this https URL.
- 中文摘要
确保自动驾驶 (AD) 的安全仍然是一项重大挑战,尤其是在高度动态和复杂的交通环境中,各种主体相互作用且意外危险经常出现。传统的强化学习 (RL) 方法通常难以平衡安全性、效率和适应性,因为它们主要关注奖励最大化,而没有明确对风险或安全约束进行建模。为了解决这些局限性,本研究提出了一种用于安全 AD 的新型博弈论风险形状 RL (GTR2L) 框架。GTR2L 结合了一个多层次的博弈论世界模型,该模型共同预测周围车辆的交互行为及其相关风险,以及根据预测不确定性动态调整的自适应推出视野。此外,不确定性感知屏障机制可以灵活调节安全边界。还提出了一种专门的风险建模方法,明确地捕获认识和偶然的不确定性,以指导受约束的政策优化并增强复杂环境中的决策。对各种安全关键交通场景的广泛评估表明,GTR2L 在成功率、减少碰撞和违规以及驾驶效率方面明显优于包括人类驾驶员在内的最先进的基线。该代码可在此 https URL 中找到。
APLOT: Robust Reward Modeling via Adaptive Preference Learning with Optimal Transport
APLOT:通过具有最佳传输的自适应偏好学习进行鲁棒奖励建模
- Authors: Zhuo Li, Yuege Feng, Dandan Guo, Jinpeng Hu, Anningzhe Gao, Xiang Wan
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.10963
- Pdf link: https://arxiv.org/pdf/2510.10963
- Abstract
The reward model (RM) plays a crucial role in aligning Large Language Models (LLMs) with human preferences through Reinforcement Learning, where the Bradley-Terry (BT) objective has been recognized as simple yet powerful, specifically for pairwise preference learning. However, BT-based RMs often struggle to effectively distinguish between similar preference responses, leading to insufficient separation between preferred and non-preferred outputs. Consequently, they may easily overfit easy samples and cannot generalize well to Out-Of-Distribution (OOD) samples, resulting in suboptimal performance. To address these challenges, this paper introduces an effective enhancement to BT-based RMs through an adaptive margin mechanism. Specifically, we design to dynamically adjust the RM focus on more challenging samples through margins, based on both semantic similarity and model-predicted reward differences, which is approached from a distributional perspective solvable with Optimal Transport (OT). By incorporating these factors into a principled OT cost matrix design, our adaptive margin enables the RM to better capture distributional differences between chosen and rejected responses, yielding significant improvements in performance, convergence speed, and generalization capabilities. Experimental results across multiple benchmarks demonstrate that our method outperforms several existing RM techniques, showcasing enhanced performance in both In-Distribution (ID) and OOD settings. Moreover, RLHF experiments support our practical effectiveness in better aligning LLMs with human preferences. Our code is available at this https URL
- 中文摘要
奖励模型 (RM) 在通过强化学习使大型语言模型 (LLM) 与人类偏好保持一致方面发挥着至关重要的作用,其中 Bradley-Terry (BT) 目标被认为是简单而强大的,特别是对于成对偏好学习。然而,基于 BT 的 RM 通常难以有效区分相似的偏好响应,导致首选和非首选输出之间的分离不足。因此,它们可能很容易对简单的样本进行过拟合,并且无法很好地推广到分布外 (OOD) 样本,从而导致性能不佳。为了应对这些挑战,本文通过自适应裕度机制对基于BT的RM进行了有效的增强。具体来说,我们设计根据语义相似性和模型预测的奖励差异,通过边距动态调整 RM 对更具挑战性的样本的关注,这是从可通过最佳传输 (OT) 解决的分布角度来处理的。通过将这些因素纳入有原则的 OT 成本矩阵设计中,我们的自适应裕度使 RM 能够更好地捕获选择响应和拒绝响应之间的分布差异,从而显着提高性能、收敛速度和泛化能力。跨多个基准测试的实验结果表明,我们的方法优于几种现有的 RM 技术,在分布内 (ID) 和 OOD 设置中都表现出增强的性能。此外,RLHF 实验支持我们在更好地使 LLM 与人类偏好保持一致方面的实际有效性。我们的代码可在此 https URL 中找到
RV-HATE: Reinforced Multi-Module Voting for Implicit Hate Speech Detection
RV-HATE:用于隐性仇恨言论检测的强化多模块投票
- Authors: Yejin Lee, Hyeseon Ahn, Yo-Sub Han
- Subjects: Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.10971
- Pdf link: https://arxiv.org/pdf/2510.10971
- Abstract
Hate speech remains prevalent in human society and continues to evolve in its forms and expressions. Modern advancements in internet and online anonymity accelerate its rapid spread and complicate its detection. However, hate speech datasets exhibit diverse characteristics primarily because they are constructed from different sources and platforms, each reflecting different linguistic styles and social contexts. Despite this diversity, prior studies on hate speech detection often rely on fixed methodologies without adapting to data-specific features. We introduce RV-HATE, a detection framework designed to account for the dataset-specific characteristics of each hate speech dataset. RV-HATE consists of multiple specialized modules, where each module focuses on distinct linguistic or contextual features of hate speech. The framework employs reinforcement learning to optimize weights that determine the contribution of each module for a given dataset. A voting mechanism then aggregates the module outputs to produce the final decision. RV-HATE offers two primary advantages: (1)~it improves detection accuracy by tailoring the detection process to dataset-specific attributes, and (2)~it also provides interpretable insights into the distinctive features of each dataset. Consequently, our approach effectively addresses implicit hate speech and achieves superior performance compared to conventional static methods. Our code is available at this https URL.
- 中文摘要
仇恨言论在人类社会中仍然普遍存在,并且其形式和表达方式仍在不断演变。互联网和在线匿名的现代进步加速了其快速传播,并使其检测变得复杂。然而,仇恨言论数据集表现出不同的特征,主要是因为它们是由不同的来源和平台构建的,每个来源和平台都反映了不同的语言风格和社会背景。尽管存在这种多样性,但先前关于仇恨言论检测的研究通常依赖于固定的方法,而不适应特定于数据的特征。我们介绍了 RV-HATE,这是一个检测框架,旨在考虑每个仇恨言论数据集的数据集特定特征。RV-HATE 由多个专业模块组成,其中每个模块都侧重于仇恨言论的不同语言或上下文特征。该框架采用强化学习来优化权重,从而确定每个模块对给定数据集的贡献。然后,投票机制聚合模块输出以产生最终决策。RV-HATE 具有两个主要优势:(1)~它通过根据数据集特定属性定制检测过程来提高检测准确性,以及 (2)~它还提供了对每个数据集独特特征的可解释见解。因此,与传统的静态方法相比,我们的方法有效地解决了隐性仇恨言论并取得了卓越的性能。我们的代码可在此 https URL 中找到。
Enhancing Large Language Model Reasoning via Selective Critical Token Fine-Tuning
通过选择性关键标记微调增强大型语言模型推理
- Authors: Zhiwen Ruan, Yixia Li, He Zhu, Yun Chen, Peng Li, Yang Liu, Guanhua Chen
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2510.10974
- Pdf link: https://arxiv.org/pdf/2510.10974
- Abstract
Large language models (LLMs) primarily rely on supervised fine-tuning (SFT) as a key method to adapt pre-trained models to domain-specific tasks such as mathematical reasoning. However, standard SFT uniformly penalizes all tokens, neglecting that only a small subset of critical tokens determines reasoning correctness. This uniform supervision often causes reduced output diversity and limited generalization. We propose Critical Token Fine-tuning (CFT), a simple yet effective approach that updates only tokens identified as functionally indispensable via counterfactual perturbations. By focusing gradient signals on these decisive reasoning steps while preserving the diversity of non-critical tokens, CFT can enhance both generation and diversity. Extensive experiments on five models across three families (Qwen, OLMo, LLaMA) and eleven mathematical reasoning benchmarks show that CFT, despite fine-tuning on less than 12% of tokens, consistently outperforms standard SFT. Moreover, CFT enables test-time scaling through improved sampling diversity and provides a stronger initialization for reinforcement learning, sustaining performance gains in later training stages while maintaining higher entropy for better exploration. These results highlight CFT as a practical and general framework for efficient and robust LLM fine-tuning.
- 中文摘要
大型语言模型(LLM)主要依靠监督微调(SFT)作为使预训练模型适应特定领域任务(例如数学推理)的关键方法。然而,标准 SFT 统一惩罚所有标记,忽略了只有一小部分关键标记决定推理正确性。这种统一的监督通常会导致输出多样性减少和泛化有限。我们提出了关键代币微调 (CFT),这是一种简单而有效的方法,它仅更新通过反事实扰动确定为功能上不可或缺的代币。通过将梯度信号集中在这些决定性的推理步骤上,同时保留非关键代币的多样性,CFT 可以增强生成和多样性。对三个系列(Qwen、OLMo、LLaMA)的五个模型和 11 个数学推理基准的广泛实验表明,尽管对不到 12% 的代币进行了微调,但 CFT 的性能始终优于标准 SFT。此外,CFT 通过改进采样多样性实现测试时间缩放,并为强化学习提供更强的初始化,在后期训练阶段保持性能提升,同时保持更高的熵以实现更好的探索。这些结果凸显了 CFT 作为高效、稳健的 LLM 微调的实用和通用框架。
Video-STR: Reinforcing MLLMs in Video Spatio-Temporal Reasoning with Relation Graph
Video-STR:通过关系图强化视频时空推理中的MLLM
- Authors: Wentao Wang, Heqing Zou, Tianze Luo, Rui Huang, Yutian Zhao, Zhuochen Wang, Hansheng Zhang, Chengwei Qin, Yan Wang, Lin Zhao, Huaijian Zhang
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.10976
- Pdf link: https://arxiv.org/pdf/2510.10976
- Abstract
Recent progress in Multimodal Large Language Models (MLLMs) has demonstrated strong semantic understanding capabilities, but struggles to perform precise spatio-temporal understanding. Existing spatio-temporal methods primarily focus on the video itself, while overlooking the physical information within the video, such as multi-object layouts and motion. Such limitations restrict the use of MLLMs in downstream applications that demand high precision, including embodied intelligence and VR. To address this issue, we present Video-STR, a novel graph-based reinforcement method for precise Video Spatio-Temporal Reasoning. Building upon the capacity of Reinforcement Learning with Verifiable Reward (RLVR) to improve model abilities, we introduce a reasoning mechanism using graph-based Group Relative Policy Optimization (GRPO) method to guide the model in inferring the underlying spatio-temporal topology of scenarios during the thinking process. To resolve the lack of spatio-temporal training data, we construct the STV-205k dataset with 205k question-answering pairs, covering dynamic multi-object scenes in both indoor and outdoor environments, to support the model training. Experiments show that Video-STR achieves state-of-the-art results on various benchmarks, outperforming the base model by 13% on STI-Bench, and demonstrating the effectiveness of our approach and dataset. Code, model, and data will be released.
- 中文摘要
多模态大型语言模型(MLLM)的最新进展已经证明了强大的语义理解能力,但难以进行精确的时空理解。现有的时空方法主要关注视频本身,而忽略了视频中的物理信息,如多对象布局和运动。这些限制限制了MLLM在需要高精度的下游应用中的使用,包括具身智能和VR。为了解决这个问题,我们提出了 Video-STR,这是一种基于图的新型强化方法,用于精确的视频时空推理。基于可验证奖励强化学习(RLVR)提高模型能力的能力,引入基于图的群体相对策略优化(GRPO)方法的推理机制,指导模型在思考过程中推断场景的底层时空拓扑。为了解决时空训练数据不足的问题,我们构建了具有205k问答对的STV-205k数据集,覆盖室内和室外环境中的动态多目标场景,以支持模型训练。实验表明,Video-STR 在各种基准测试中取得了最先进的结果,在 STI-Bench 上比基础模型高出 13%,并证明了我们的方法和数据集的有效性。将发布代码、模型和数据。
GeoVLMath: Enhancing Geometry Reasoning in Vision-Language Models via Cross-Modal Reward for Auxiliary Line Creation
GeoVLMath:通过辅助线创建的跨模态奖励增强视觉语言模型中的几何推理
- Authors: Shasha Guo, Liang Pang, Xi Wang, Yanling Wang, Huawei Shen, Jing Zhang
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.11020
- Pdf link: https://arxiv.org/pdf/2510.11020
- Abstract
Auxiliary lines are essential for solving complex geometric problems but remain challenging for large vision-language models (LVLMs). Rather than editing diagrams to draw auxiliary lines, which current image editing models struggle to render with geometric precision, we generate textual descriptions of auxiliary-line constructions to better align with the representational strengths of LVLMs. To bridge the gap between textual descriptions and spatial structure, we propose a reinforcement learning framework that enhances diagram-text alignment. At the core of our approach is a cross-modal reward that evaluates how well the generated auxiliary-line description for an original diagram matches a ground-truth auxiliary-line diagram. Built on this reward, we present GeoVLMath, an open-source LVLM tailored to auxiliary-line reasoning in solid geometry. This fine-grained signal drives a GRPO-based RL stage, yielding precise diagram-text alignment. To support training, we develop a scalable data creation pipeline and construct AuxSolidMath, a dataset of 3,018 real-exam geometry problems with paired diagrams and aligned textual fields. At the 3B and 7B scales, GeoVLMath achieves competitive and often superior performance compared with strong open-source and proprietary LVLMs on auxiliary-line reasoning benchmarks.
- 中文摘要
辅助线对于解决复杂的几何问题至关重要,但对于大型视觉语言模型 (LVLM) 来说仍然具有挑战性。我们不是编辑图表来绘制辅助线,而当前的图像编辑模型很难以几何精度渲染辅助线,而是生成辅助线结构的文本描述,以更好地与 LVLM 的表示优势保持一致。为了弥合文本描述和空间结构之间的差距,我们提出了一种增强图表-文本对齐的强化学习框架。我们方法的核心是跨模态奖励,用于评估为原始图表生成的辅助线描述与地面实况辅助线图的匹配程度。基于这一奖励,我们推出了 GeoVLMath,这是一个开源 LVLM,专为立体几何中的辅助线推理而定制。这种细粒度信号驱动基于 GRPO 的 RL 级,产生精确的图表-文本对齐。为了支持训练,我们开发了一个可扩展的数据创建管道,并构建了 AuxSolidMath,这是一个包含 3,018 个真实考试几何问题的数据集,其中包含配对图和对齐的文本字段。在 3B 和 7B 规模上,GeoVLMath 在辅助线推理基准测试上与强大的开源和专有 LVLM 相比,实现了具有竞争力且通常更优越的性能。
Unveiling Uncertainty-Aware Autonomous Cooperative Learning Based Planning Strategy
揭示基于不确定性的自主合作学习规划策略
- Authors: Shiyao Zhang, Liwei Deng, Shuyu Zhang, Weijie Yuan, Hong Zhang
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2510.11041
- Pdf link: https://arxiv.org/pdf/2510.11041
- Abstract
In future intelligent transportation systems, autonomous cooperative planning (ACP), becomes a promising technique to increase the effectiveness and security of multi-vehicle interactions. However, multiple uncertainties cannot be fully addressed for existing ACP strategies, e.g. perception, planning, and communication uncertainties. To address these, a novel deep reinforcement learning-based autonomous cooperative planning (DRLACP) framework is proposed to tackle various uncertainties on cooperative motion planning schemes. Specifically, the soft actor-critic (SAC) with the implementation of gate recurrent units (GRUs) is adopted to learn the deterministic optimal time-varying actions with imperfect state information occurred by planning, communication, and perception uncertainties. In addition, the real-time actions of autonomous vehicles (AVs) are demonstrated via the Car Learning to Act (CARLA) simulation platform. Evaluation results show that the proposed DRLACP learns and performs cooperative planning effectively, which outperforms other baseline methods under different scenarios with imperfect AV state information.
- 中文摘要
在未来的智能交通系统中,自主协同规划(ACP)成为一种有前途的技术,以提高多车交互的有效性和安全性。然而,现有 ACP 战略无法完全解决多种不确定性,例如感知、规划和沟通的不确定性。针对这些问题,提出了一种基于深度强化学习的自主合作规划(DRLACP)框架,以解决协同运动规划方案的各种不确定性。具体而言,采用实现门循环单元(GRU)的软行为者批评者(SAC)来学习计划、通信和感知不确定性产生的具有不完美状态信息的确定性最优时变动作。此外,自动驾驶汽车 (AV) 的实时动作也通过汽车学习行动 (CARLA) 模拟平台进行演示。评价结果表明,所提出的DRLACP能够有效地学习和执行协同规划,在AV状态信息不完善的不同场景下优于其他基线方法。
Stronger Together: On-Policy Reinforcement Learning for Collaborative LLMs
共同强大:协作法学硕士的政策强化学习
- Authors: Yujie Zhao, Lanxiang Hu, Yang Wang, Minmin Hou, Hao Zhang, Ke Ding, Jishen Zhao
- Subjects: Subjects:
Machine Learning (cs.LG); Multiagent Systems (cs.MA)
- Arxiv link: https://arxiv.org/abs/2510.11062
- Pdf link: https://arxiv.org/pdf/2510.11062
- Abstract
Multi-agent systems (MAS) and reinforcement learning (RL) are widely used to enhance the agentic capabilities of large language models (LLMs). MAS improves task performance through role-based orchestration, while RL uses environmental rewards to learn stronger policies, such as GRPO-style optimization. However, applying on-policy RL to MAS remains underexplored and presents unique challenges. Algorithmically, standard GRPO grouping assumptions break down because prompts vary by role and by turn. System-wise, the training stack must support MAS-workflow rollouts and on-policy updates for both single-policy and multi-policy models. We propose AT-GRPO, which includes (i) an agent- and turn-wise grouped RL algorithm tailored to MAS and (ii) a training system that supports both single- and multi-policy regimes. Across game, planning, coding, and math tasks, AT-GRPO delivers substantial gains. On long-horizon planning, it increases accuracy from a 14.0 to 47.0 percent single-agent RL baseline to 96.0 to 99.5 percent. It also improves reasoning performance, with average gains of 3.87 to 7.62 percent on coding tasks and 9.0 to 17.93 percent on math. Code and environments are available at: this https URL.
- 中文摘要
多智能体系统(MAS)和强化学习(RL)被广泛用于增强大型语言模型(LLM)的智能体能力。MAS 通过基于角色的编排来提高任务性能,而 RL 则使用环境奖励来学习更强的策略,例如 GRPO 式优化。然而,将政策性 RL 应用于 MAS 仍然没有得到充分探索,并带来了独特的挑战。从算法上讲,标准 GRPO 分组假设会崩溃,因为提示因角色和轮流而异。在系统方面,训练堆栈必须支持单策略和多策略模型的 MAS 工作流推出和策略更新。我们提出了 AT-GRPO,其中包括 (i) 针对 MAS 量身定制的代理和轮向分组 RL 算法,以及 (ii) 支持单策略和多策略制度的训练系统。在游戏、规划、编码和数学任务中,AT-GRPO 带来了巨大的收益。在长期规划中,它将准确率从 14.0% 提高到 47.0% 的单代理 RL 基线到 96.0% 到 99.5%。它还提高了推理性能,编码任务的平均提升为 3.87% 至 7.62%,数学平均提升为 9.0% 至 17.93%。代码和环境可在以下位置获得:此 https URL。
A Primer on SO(3) Action Representations in Deep Reinforcement Learning
深度强化学习中的SO(3)动作表示入门
- Authors: Martin Schuck, Sherif Samy, Angela P. Schoellig
- Subjects: Subjects:
Robotics (cs.RO); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.11103
- Pdf link: https://arxiv.org/pdf/2510.11103
- Abstract
Many robotic control tasks require policies to act on orientations, yet the geometry of SO(3) makes this nontrivial. Because SO(3) admits no global, smooth, minimal parameterization, common representations such as Euler angles, quaternions, rotation matrices, and Lie algebra coordinates introduce distinct constraints and failure modes. While these trade-offs are well studied for supervised learning, their implications for actions in reinforcement learning remain unclear. We systematically evaluate SO(3) action representations across three standard continuous control algorithms, PPO, SAC, and TD3, under dense and sparse rewards. We compare how representations shape exploration, interact with entropy regularization, and affect training stability through empirical studies and analyze the implications of different projections for obtaining valid rotations from Euclidean network outputs. Across a suite of robotics benchmarks, we quantify the practical impact of these choices and distill simple, implementation-ready guidelines for selecting and using rotation actions. Our results highlight that representation-induced geometry strongly influences exploration and optimization and show that representing actions as tangent vectors in the local frame yields the most reliable results across algorithms.
- 中文摘要
许多机器人控制任务需要策略作用于方向,但 SO(3) 的几何形状使这变得不平凡。由于 SO(3) 不允许全局、平滑、最小参数化,因此欧拉角、四元数、旋转矩阵和李代数坐标等常见表示引入了不同的约束和失效模式。虽然这些权衡在监督学习中得到了很好的研究,但它们对强化学习行动的影响仍不清楚。我们在密集和稀疏奖励下系统地评估了三种标准连续控制算法(PPO、SAC 和 TD3)的 SO(3) 动作表示。我们通过实证研究比较了表示如何塑造探索、与熵正则化相互作用以及影响训练稳定性,并分析了不同投影对从欧几里得网络输出中获得有效旋转的影响。在一套机器人基准测试中,我们量化了这些选择的实际影响,并提炼出用于选择和使用旋转动作的简单、可实施的指南。我们的结果强调,表示诱导的几何形状强烈影响探索和优化,并表明将动作表示为局部系中的切向量可以产生跨算法的最可靠的结果。
Graph Neural Network-Based Multicast Routing for On-Demand Streaming Services in 6G Networks
基于图神经网络的组播路由,用于6G网络中点播流服务
- Authors: Xiucheng Wang, Zien Wang, Nan Cheng, Wenchao Xu, Wei Quan, Xuemin Shen
- Subjects: Subjects:
Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.11109
- Pdf link: https://arxiv.org/pdf/2510.11109
- Abstract
The increase of bandwidth-intensive applications in sixth-generation (6G) wireless networks, such as real-time volumetric streaming and multi-sensory extended reality, demands intelligent multicast routing solutions capable of delivering differentiated quality-of-service (QoS) at scale. Traditional shortest-path and multicast routing algorithms are either computationally prohibitive or structurally rigid, and they often fail to support heterogeneous user demands, leading to suboptimal resource utilization. Neural network-based approaches, while offering improved inference speed, typically lack topological generalization and scalability. To address these limitations, this paper presents a graph neural network (GNN)-based multicast routing framework that jointly minimizes total transmission cost and supports user-specific video quality requirements. The routing problem is formulated as a constrained minimum-flow optimization task, and a reinforcement learning algorithm is developed to sequentially construct efficient multicast trees by reusing paths and adapting to network dynamics. A graph attention network (GAT) is employed as the encoder to extract context-aware node embeddings, while a long short-term memory (LSTM) module models the sequential dependencies in routing decisions. Extensive simulations demonstrate that the proposed method closely approximates optimal dynamic programming-based solutions while significantly reducing computational complexity. The results also confirm strong generalization to large-scale and dynamic network topologies, highlighting the method's potential for real-time deployment in 6G multimedia delivery scenarios. Code is available at this https URL.
- 中文摘要
第六代 (6G) 无线网络中带宽密集型应用(例如实时体积流和多感官扩展现实)的增加,需要能够大规模提供差异化服务质量 (QoS) 的智能组播路由解决方案。传统的最短路径和组播路由算法要么在计算上令人望而却步,要么在结构上僵化,它们通常无法支持异构用户需求,导致资源利用率不理想。基于神经网络的方法虽然提供了更高的推理速度,但通常缺乏拓扑泛化和可扩展性。为了解决这些限制,本文提出了一种基于图神经网络(GNN)的组播路由框架,该框架共同降低了总传输成本并支持用户特定的视频质量要求。将路由问题表述为约束最小流量优化任务,并开发强化学习算法,通过复用路径和适应网络动力学,依次构建高效的组播树。图注意力网络(GAT)被用作编码器来提取上下文感知节点嵌入,而长短期记忆(LSTM)模块则对路由决策中的顺序依赖关系进行建模。广泛的仿真表明,所提出的方法非常接近基于动态规划的最优解,同时显着降低了计算复杂度。结果还证实了对大规模和动态网络拓扑的强烈推广性,凸显了该方法在6G多媒体交付场景中实时部署的潜力。代码可在此 https URL 中找到。
Refining Hybrid Genetic Search for CVRP via Reinforcement Learning-Finetuned LLM
通过强化学习微调LLM完善CVRP的混合遗传搜索
- Authors: Rongjie Zhu, Cong Zhang, Zhiguang Cao
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.11121
- Pdf link: https://arxiv.org/pdf/2510.11121
- Abstract
While large language models (LLMs) are increasingly used as automated heuristic designers for vehicle routing problems (VRPs), current state-of-the-art methods predominantly rely on prompting massive, general-purpose models like GPT-4. This work challenges that paradigm by demonstrating that a smaller, specialized LLM, when meticulously fine-tuned, can generate components that surpass expert-crafted heuristics within advanced solvers. We propose RFTHGS, a novel Reinforcement learning (RL) framework for Fine-Tuning a small LLM to generate high-performance crossover operators for the Hybrid Genetic Search (HGS) solver, applied to the Capacitated VRP (CVRP). Our method employs a multi-tiered, curriculum-based reward function that progressively guides the LLM to master generating first compilable, then executable, and finally, superior-performing operators that exceed human expert designs. This is coupled with an operator caching mechanism that discourages plagiarism and promotes diversity during training. Comprehensive experiments show that our fine-tuned LLM produces crossover operators which significantly outperform the expert-designed ones in HGS. The performance advantage remains consistent, generalizing from small-scale instances to large-scale problems with up to 1000 nodes. Furthermore, RFTHGS exceeds the performance of leading neuro-combinatorial baselines, prompt-based methods, and commercial LLMs such as GPT-4o and GPT-4o-mini.
- 中文摘要
虽然大型语言模型 (LLM) 越来越多地用作车辆路线问题 (VRP) 的自动启发式设计器,但当前最先进的方法主要依赖于 GPT-4 等大规模通用模型的提示。这项工作挑战了这一范式,证明一个较小的、专门的 LLM 在经过精心微调后,可以生成超越高级求解器中专家制作的启发式方法的组件。我们提出了 RFTHGS,这是一种新型的强化学习 (RL) 框架,用于微调小型 LLM 以生成用于混合遗传搜索 (HGS) 求解器的高性能交叉算子,应用于 Capacitated VRP (CVRP)。我们的方法采用多层次的、基于课程的奖励函数,逐步引导法学硕士掌握生成首先可编译的运算符,然后是可执行的运算符,最后生成超越人类专家设计的卓越运算符。这与运营商缓存机制相结合,可以阻止抄袭并促进培训期间的多样性。综合实验表明,我们微调的 LLM 产生的交叉算子在 HGS 中明显优于专家设计的算子。性能优势保持一致,从小规模实例推广到多达 1000 个节点的大规模问题。此外,RFTHGS 的性能超过了领先的神经组合基线、基于提示的方法以及 GPT-4o 和 GPT-4o-mini 等商业 LLM。
Emergence of hybrid computational dynamics through reinforcement learning
通过强化学习出现混合计算动力学
- Authors: Roman A. Kononov, Nikita A. Pospelov, Konstantin V. Anokhin, Vladimir V. Nekorkin, Oleg V. Maslennikov
- Subjects: Subjects:
Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Adaptation and Self-Organizing Systems (nlin.AO); Neurons and Cognition (q-bio.NC)
- Arxiv link: https://arxiv.org/abs/2510.11162
- Pdf link: https://arxiv.org/pdf/2510.11162
- Abstract
Understanding how learning algorithms shape the computational strategies that emerge in neural networks remains a fundamental challenge in machine intelligence. While network architectures receive extensive attention, the role of the learning paradigm itself in determining emergent dynamics remains largely unexplored. Here we demonstrate that reinforcement learning (RL) and supervised learning (SL) drive recurrent neural networks (RNNs) toward fundamentally different computational solutions when trained on identical decision-making tasks. Through systematic dynamical systems analysis, we reveal that RL spontaneously discovers hybrid attractor architectures, combining stable fixed-point attractors for decision maintenance with quasi-periodic attractors for flexible evidence integration. This contrasts sharply with SL, which converges almost exclusively to simpler fixed-point-only solutions. We further show that RL sculpts functionally balanced neural populations through a powerful form of implicit regularization -- a structural signature that enhances robustness and is conspicuously absent in the more heterogeneous solutions found by SL-trained networks. The prevalence of these complex dynamics in RL is controllably modulated by weight initialization and correlates strongly with performance gains, particularly as task complexity increases. Our results establish the learning algorithm as a primary determinant of emergent computation, revealing how reward-based optimization autonomously discovers sophisticated dynamical mechanisms that are less accessible to direct gradient-based optimization. These findings provide both mechanistic insights into neural computation and actionable principles for designing adaptive AI systems.
- 中文摘要
了解学习算法如何塑造神经网络中出现的计算策略仍然是机器智能的一个基本挑战。虽然网络架构受到广泛关注,但学习范式本身在确定涌现动态方面的作用在很大程度上仍未得到探索。在这里,我们证明,强化学习 (RL) 和监督学习 (SL) 在接受相同决策任务的训练时,会驱动递归神经网络 (RNN) 走向根本不同的计算解决方案。通过系统的动力系统分析,我们揭示了RL自发地发现了混合吸引子架构,将稳定的定点吸引子与用于决策维护的准周期性吸引子相结合,以进行灵活的证据整合。这与 SL 形成鲜明对比,SL 几乎完全收敛到更简单的仅定点解决方案。我们进一步表明,RL 通过一种强大的隐式正则化形式塑造了功能平衡的神经群体——这是一种增强鲁棒性的结构特征,并且在 SL 训练网络发现的更异构的解决方案中明显不存在。RL 中这些复杂动态的流行率通过权重初始化进行可控调节,并与性能提升密切相关,特别是随着任务复杂性的增加。我们的结果将学习算法确立为涌现计算的主要决定因素,揭示了基于奖励的优化如何自主发现复杂的动态机制,而这些机制难以被基于梯度的直接优化所利用。这些发现既提供了对神经计算的机制见解,也为设计自适应人工智能系统提供了可作的原则。
Can Tool-Integrated Reinforcement Learning Generalize Across Diverse Domains?
工具集成强化学习能否在不同领域推广?
- Authors: Zhengyu Chen, Jinluan Yang, Teng Xiao, Ruochen Zhou, Luan Zhang, Xiangyu Xi, Xiaowei Shi, Wei Wang, Jinggang Wang
- Subjects: Subjects:
Machine Learning (cs.LG); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2510.11184
- Pdf link: https://arxiv.org/pdf/2510.11184
- Abstract
Recent advances in large language models (LLMs) have demonstrated remarkable capabilities in reasoning and tool utilization. However, the generalization of tool-augmented reinforcement learning (RL) across diverse domains remains underexplored. In this work, we investigate the cross-domain generalization of an LLM agent equipped with a code interpreter tool, which is exclusively trained on mathematical problem-solving tasks. Despite the restricted training domain, we evaluate the agent's performance across several distinct reasoning domains. The results reveal that RL-based tool usage learned from mathematical tasks can be effectively transferred to complex tasks in other domains, enabling great task performance and high token efficiency. To facilitate this cross-domain transfer, we propose a Tool Generalization Reinforcement Learning (TGRL) framework designed to promote domain-agnostic learning and skill migration, encompassing: (i) a standardized tool interface that abstracts domain-specific nuances through consistent formatting and explicit termination, fostering transferable invocation patterns; (ii) a dual-component reward system that decomposes rewards to incentivize generalizable behaviors like tool efficiency and reasoning abstraction, ensuring alignment and robustness across domain shifts; and (iii) an XML-based prompt template that separates thinking, tool calls, and responses to encourage modular, domain-invariant planning and coherent multi-turn interactions. Extensive experiments across diverse benchmarks validate our approach, achieving state-of-the-art performance and highlighting the cross-domain potential of Tool RL for LLM reasoning.
- 中文摘要
大型语言模型(LLM)的最新进展在推理和工具利用方面表现出卓越的能力。然而,工具增强强化学习(RL)在不同领域的推广仍然没有得到充分探索。在这项工作中,我们研究了配备代码解释器工具的LLM代理的跨域泛化,该工具专门针对数学问题解决任务进行训练。尽管训练领域受到限制,但我们评估了代理在几个不同推理领域的表现。结果表明,从数学任务中学习到的基于RL的工具使用可以有效地转移到其他领域的复杂任务中,从而实现出色的任务性能和较高的token效率。为了促进这种跨领域转移,我们提出了一个工具泛化强化学习(TGRL)框架,旨在促进与领域无关的学习和技能迁移,包括:(i)一个标准化的工具接口,通过一致的格式和显式终止来抽象特定领域的细微差别,促进可转移的调用模式;(ii) 一个双组件奖励系统,分解奖励以激励工具效率和推理抽象等可推广行为,确保跨领域转移的一致性和稳健性;(iii) 基于 XML 的提示模板,将思维、工具调用和响应分开,以鼓励模块化、领域不变的规划和连贯的多轮交互。跨不同基准的广泛实验验证了我们的方法,实现了最先进的性能,并突出了工具 RL 在 LLM 推理方面的跨领域潜力。
Aligning Deep Implicit Preferences by Learning to Reason Defensively
通过学习防御性推理来调整深层内隐偏好
- Authors: Peiming Li, Zhiyuan Hu, Yang Tang, Shiyu Li, Xi Chen
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.11194
- Pdf link: https://arxiv.org/pdf/2510.11194
- Abstract
Personalized alignment is crucial for enabling Large Language Models (LLMs) to engage effectively in user-centric interactions. However, current methods face a dual challenge: they fail to infer users' deep implicit preferences (including unstated goals, semantic context and risk tolerances), and they lack the defensive reasoning required to navigate real-world ambiguity. This cognitive gap leads to responses that are superficial, brittle and short-sighted. To address this, we propose Critique-Driven Reasoning Alignment (CDRA), which reframes alignment from a scalar reward-matching task into a structured reasoning process. First, to bridge the preference inference gap, we introduce the DeepPref benchmark. This dataset, comprising 3000 preference-query pairs across 20 topics, is curated by simulating a multi-faceted cognitive council that produces critique-annotated reasoning chains to deconstruct query semantics and reveal latent risks. Second, to instill defensive reasoning, we introduce the Personalized Generative Process Reward Model (Pers-GenPRM), which frames reward modeling as a personalized reasoning task. It generates a critique chain to evaluate a response's alignment with user preferences before outputting a final score based on this rationale. Ultimately, this interpretable, structured reward signal guides policy model through Critique-Driven Policy Alignment, a process-level online reinforcement learning algorithm integrating both numerical and natural language feedback. Experiments demonstrate that CDRA excels at discovering and aligning with users' true preferences while executing robust reasoning. Our code and dataset are available at this https URL.
- 中文摘要
个性化对齐对于使大型语言模型 (LLM) 能够有效地参与以用户为中心的交互至关重要。然而,当前的方法面临着双重挑战:它们无法推断出用户深层次的隐性偏好(包括未说明的目标、语义背景和风险承受能力),并且缺乏驾驭现实世界模糊性所需的防御推理。这种认知差距导致反应肤浅、脆弱和短视。为了解决这个问题,我们提出了批判驱动推理对齐(CDRA),它将对齐从标量奖励匹配任务重构为结构化推理过程。首先,为了弥合偏好推理差距,我们引入了 DeepPref 基准测试。该数据集由 20 个主题的 3000 个偏好查询对组成,是通过模拟多方面的认知委员会来策划的,该委员会生成批评注释的推理链,以解构查询语义并揭示潜在风险。其次,为了灌输防御性推理,我们引入了个性化生成过程奖励模型(Pers-GenPRM),该模型将奖励建模视为个性化推理任务。它生成一个批评链来评估响应与用户偏好的一致性,然后根据此基本原理输出最终分数。最终,这种可解释的结构化奖励信号通过批判驱动的政策调整来指导政策模型,这是一种集成了数字和自然语言反馈的过程级在线强化学习算法。实验表明,CDRA 擅长发现并符合用户的真实偏好,同时执行稳健的推理。我们的代码和数据集可在此 https URL 中找到。
Vision-LLMs for Spatiotemporal Traffic Forecasting
用于时空交通预测的视觉法学硕士
- Authors: Ning Yang, Hengyu Zhong, Haijun Zhang, Randall Berry
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.11282
- Pdf link: https://arxiv.org/pdf/2510.11282
- Abstract
Accurate spatiotemporal traffic forecasting is a critical prerequisite for proactive resource management in dense urban mobile networks. While Large Language Models (LLMs) have shown promise in time series analysis, they inherently struggle to model the complex spatial dependencies of grid-based traffic data. Effectively extending LLMs to this domain is challenging, as representing the vast amount of information from dense geographical grids can be inefficient and overwhelm the model's context. To address these challenges, we propose ST-Vision-LLM, a novel framework that reframes spatiotemporal forecasting as a vision-language fusion problem. Our approach leverages a Vision-LLM visual encoder to process historical global traffic matrices as image sequences, providing the model with a comprehensive global view to inform cell-level predictions. To overcome the inefficiency of LLMs in handling numerical data, we introduce an efficient encoding scheme that represents floating-point values as single tokens via a specialized vocabulary, coupled with a two-stage numerical alignment fine-tuning process. The model is first trained with Supervised Fine-Tuning (SFT) and then further optimized for predictive accuracy using Group Relative Policy Optimization (GRPO), a memory-efficient reinforcement learning method. Evaluations on real-world mobile traffic datasets demonstrate that ST-Vision-LLM outperforms existing methods by 15.6% in long-term prediction accuracy and exceeds the second-best baseline by over 30.04% in cross-domain few-shot scenarios. Our extensive experiments validate the model's strong generalization capabilities across various data-scarce environments.
- 中文摘要
准确的时空交通预测是密集城市移动网络中主动资源管理的关键先决条件。虽然大型语言模型 (LLM) 在时间序列分析中显示出前景,但它们本质上难以对基于网格的交通数据的复杂空间依赖关系进行建模。有效地将法学硕士扩展到该领域具有挑战性,因为表示来自密集地理网格的大量信息可能效率低下并且会压垮模型的上下文。为了应对这些挑战,我们提出了 ST-Vision-LLM,这是一个将时空预测重新定义为视觉-语言融合问题的新颖框架。我们的方法利用 Vision-LLM 视觉编码器将历史全球流量矩阵处理为图像序列,为模型提供全面的全局视图,为细胞级预测提供信息。为了克服LLM在处理数值数据方面的低效率问题,我们引入了一种高效的编码方案,通过专门的词汇表将浮点值表示为单个标记,并结合两阶段的数值对齐微调过程。该模型首先使用监督微调 (SFT) 进行训练,然后使用组相对策略优化 (GRPO)(一种内存高效的强化学习方法)进一步优化预测准确性。对真实世界移动交通数据集的评估表明,ST-Vision-LLM在长期预测精度方面比现有方法高出15.6%,在跨域少样本场景中比第二好的基线高出30.04%以上。我们广泛的实验验证了该模型在各种数据稀缺环境中的强大泛化能力。
Gym-TORAX: Open-source software for integrating RL with plasma control simulators
Gym-TORAX:用于将 RL 与等离子体控制模拟器集成的开源软件
- Authors: Antoine Mouchamps, Arthur Malherbe, Adrien Bolland, Damien Ernst
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.11283
- Pdf link: https://arxiv.org/pdf/2510.11283
- Abstract
This paper presents Gym-TORAX, a Python package enabling the implementation of Reinforcement Learning (RL) environments for simulating plasma dynamics and control in tokamaks. Users define succinctly a set of control actions and observations, and a control objective from which Gym-TORAX creates a Gymnasium environment that wraps TORAX for simulating the plasma dynamics. The objective is formulated through rewards depending on the simulated state of the plasma and control action to optimize specific characteristics of the plasma, such as performance and stability. The resulting environment instance is then compatible with a wide range of RL algorithms and libraries and will facilitate RL research in plasma control. In its current version, one environment is readily available, based on a ramp-up scenario of the International Thermonuclear Experimental Reactor (ITER).
- 中文摘要
本文介绍了 Gym-TORAX,这是一个 Python 包,能够实现强化学习 (RL) 环境,以模拟托卡马克中的等离子体动力学和控制。用户简洁地定义了一组控制动作和观察,以及一个控制目标,Gym-TORAX 从中创建一个 Gymnasium 环境,将 TORAX 包裹起来以模拟等离子体动力学。根据等离子体的模拟状态和控制动作,通过奖励来制定目标,以优化等离子体的特定特性,例如性能和稳定性。然后,生成的环境实例与各种 RL 算法和库兼容,并将促进等离子体控制中的 RL 研究。在当前版本中,基于国际热核聚变实验堆 (ITER) 的爬坡场景,一种环境随时可用。
FOSSIL: Harnessing Feedback on Suboptimal Samples for Data-Efficient Generalisation with Imitation Learning for Embodied Vision-and-Language Tasks
FOSSIL:利用对次优样本的反馈,通过模仿学习实现具身视觉和语言任务的数据高效泛化
- Authors: Sabrina McCallum, Amit Parekh, Alessandro Suglia
- Subjects: Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.11307
- Pdf link: https://arxiv.org/pdf/2510.11307
- Abstract
Current approaches to embodied AI tend to learn policies from expert demonstrations. However, without a mechanism to evaluate the quality of demonstrated actions, they are limited to learning from optimal behaviour, or they risk replicating errors and inefficiencies. While reinforcement learning offers one alternative, the associated exploration typically results in sacrificing data efficiency. This work explores how agents trained with imitation learning can learn robust representations from both optimal and suboptimal demonstrations when given access to constructive language feedback as a means to contextualise different modes of behaviour. We directly provide language feedback embeddings as part of the input sequence into a Transformer-based policy, and optionally complement the traditional next action prediction objective with auxiliary self-supervised learning objectives for feedback prediction. We test our approach on a range of embodied Vision-and-Language tasks in our custom BabyAI-XGen environment and show significant improvements in agents' compositional generalisation abilities and robustness, suggesting that our data-efficient method allows models to successfully convert suboptimal behaviour into learning opportunities. Overall, our results suggest that language feedback is a competitive and intuitive alternative to intermediate scalar rewards for language-specified embodied tasks.
- 中文摘要
目前具身人工智能的方法倾向于从专家演示中学习政策。然而,如果没有评估所展示行动质量的机制,他们只能从最佳行为中学习,否则他们就有可能复制错误和低效率。虽然强化学习提供了一种替代方案,但相关的探索通常会导致牺牲数据效率。这项工作探讨了接受模仿学习训练的智能体在获得建设性语言反馈作为将不同行为模式置于情境中的一种手段时,如何从最佳和次优演示中学习稳健的表征。我们直接将语言反馈嵌入作为输入序列的一部分提供到基于 Transformer 的策略中,并可选择用辅助自监督学习目标来补充传统的下一步行动预测目标,以进行反馈预测。我们在自定义的 BabyAI-XGen 环境中对一系列具身视觉和语言任务测试了我们的方法,并显示出智能体的组合泛化能力和鲁棒性显着提高,这表明我们的数据高效方法允许模型成功地将次优行为转化为学习机会。总体而言,我们的结果表明,语言反馈是语言指定具身任务的中间标量奖励的一种竞争性和直观的替代方案。
Part II: ROLL Flash -- Accelerating RLVR and Agentic Training with Asynchrony
第二部分:ROLL Flash -- 通过异步加速 RLVR 和代理训练
- Authors: Han Lu, Zichen Liu, Shaopan Xiong, Yancheng He, Wei Gao, Yanan Wu, Weixun Wang, Jiashun Liu, Yang Li, Haizhou Zhao, Ju Huang, Siran Yang, Xiaoyang Li, Yijia Luo, Zihe Liu, Ling Pan, Junchi Yan, Wei Wang, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.11345
- Pdf link: https://arxiv.org/pdf/2510.11345
- Abstract
Synchronous Reinforcement Learning (RL) post-training has emerged as a crucial step for enhancing Large Language Models (LLMs) with diverse capabilities. However, many systems designed to accelerate RL post-training still suffer from low resource utilization and limited scalability. We present ROLL Flash, a system that extends ROLL with native support for asynchronous RL post-training. ROLL Flash is built upon two core design principles: fine-grained parallelism and rollout-train decoupling. Guided by these principles, ROLL Flash provides flexible programming interfaces that enable a fully asynchronous training architecture and support efficient rollout mechanisms, including queue scheduling and environment-level asynchronous execution. Through comprehensive theoretical analysis and extensive experiments, we demonstrate that ROLL Flash significantly improves resource utilization and scalability over synchronous RL post-training. ROLL Flash achieves up to 2.24x speedup on RLVR tasks and 2.72x on agentic tasks, using the same GPU budget as synchronous baselines. Furthermore, we implement several popular off-policy algorithms and verify that asynchronous training can achieve performance on par with synchronous training.
- 中文摘要
同步强化学习 (RL) 后训练已成为增强具有多样化功能的大型语言模型 (LLM) 的关键步骤。然而,许多旨在加速RL后训练的系统仍然存在资源利用率低和可扩展性有限的问题。我们介绍了 ROLL Flash,这是一个扩展 ROLL 的系统,原生支持异步 RL 后训练。ROLL Flash 基于两个核心设计原则:细粒度并行性和推出-序列解耦。在这些原则的指导下,ROLL Flash 提供了灵活的编程接口,可实现完全异步的训练架构并支持高效的推出机制,包括队列调度和环境级异步执行。通过全面的理论分析和广泛的实验,我们证明了与同步RL后训练相比,ROLL Flash显著提高了资源利用率和可扩展性。ROLL Flash 在 RLVR 任务上实现了高达 2.24 倍的加速,在代理任务上实现了 2.72 倍的加速,使用与同步基线相同的 GPU 预算。此外,我们实现了几种流行的策略外算法,并验证异步训练可以实现与同步训练相当的性能。
Reasoning as Representation: Rethinking Visual Reinforcement Learning in Image Quality Assessment
推理作为表征:重新思考图像质量评估中的视觉强化学习
- Authors: Shijie Zhao, Xuanyu Zhang, Weiqi Li, Junlin Li, Li Zhang, Tianfan Xue, Jian Zhang
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2510.11369
- Pdf link: https://arxiv.org/pdf/2510.11369
- Abstract
Reasoning-based image quality assessment (IQA) models trained through reinforcement learning (RL) exhibit exceptional generalization, yet the underlying mechanisms and critical factors driving this capability remain underexplored in current research. Moreover, despite their superior performance, these models incur inference energy usage and latency orders of magnitude higher than their earlier counterparts, restricting their deployment in specific scenarios. Through extensive experiments, this paper verifies and elaborates that through RL training, MLLMs leverage their reasoning capability to convert redundant visual representations into compact, cross-domain aligned text representations. This conversion is precisely the source of the generalization exhibited by these reasoning-based IQA models. Building on this fundamental insight, we propose a novel algorithm, RALI, which employs contrastive learning to directly align images with these generalizable text representations learned by RL. This approach eliminates the reliance on reasoning processes and even obviates the need to load an LLM. For the quality scoring task, this framework achieves generalization performance comparable to reasoning-based models while requiring less than 5% of their model parameters and inference time.
- 中文摘要
通过强化学习(RL)训练的基于推理的图像质量评估(IQA)模型表现出出色的泛化性,但驱动这种能力的潜在机制和关键因素在当前的研究中仍未得到充分探索。此外,尽管这些模型具有卓越的性能,但这些模型产生的推理能耗和延迟比早期模型高出几个数量级,限制了它们在特定场景中的部署。通过大量的实验,本文验证并阐述了通过RL训练,MLLM利用其推理能力将冗余的视觉表示转换为紧凑的、跨域对齐的文本表示。这种转换正是这些基于推理的 IQA 模型所表现出的泛化的来源。基于这一基本见解,我们提出了一种新颖的算法 RALI,它采用对比学习将图像与 RL 学习的这些可推广文本表示直接对齐。这种方法消除了对推理过程的依赖,甚至无需加载 LLM。对于质量评分任务,该框架实现了与基于推理的模型相当的泛化性能,同时需要不到 5% 的模型参数和推理时间。
Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers
通过对齐训练和推理路由器来稳定 MoE 强化学习
- Authors: Wenhan Ma, Hailin Zhang, Liang Zhao, Yifan Song, Yudong Wang, Zhifang Sui, Fuli Luo
- Subjects: Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.11370
- Pdf link: https://arxiv.org/pdf/2510.11370
- Abstract
Reinforcement learning (RL) has emerged as a crucial approach for enhancing the capabilities of large language models. However, in Mixture-of-Experts (MoE) models, the routing mechanism often introduces instability, even leading to catastrophic RL training collapse. We analyze the training-inference consistency of MoE models and identify a notable discrepancy in routing behaviors between the two phases. Moreover, even under identical conditions, the routing framework can yield divergent expert selections across repeated forward passes. To address this foundational inconsistency, we propose Rollout Routing Replay (R3), a method that records routing distributions from the inference engine and replays them during training. R3 significantly reduces training-inference policy KL divergence and mitigates extreme discrepancies without compromising training speed. Extensive experiments on various settings confirm that R3 succeeds in stabilizing RL training, preventing collapse and outperforming methods such as GSPO and TIS. We believe this work can offer a new solution for stabilizing RL in MoE models.
- 中文摘要
强化学习 (RL) 已成为增强大型语言模型能力的重要方法。然而,在专家混合(MoE)模型中,路由机制往往会引入不稳定性,甚至导致灾难性的RL训练崩溃。我们分析了 MoE 模型的训练-推理一致性,并确定了两个阶段之间路由行为的显着差异。此外,即使在相同的条件下,路由框架也可以在重复的前向传递中产生不同的专家选择。为了解决这种基本的不一致问题,我们提出了推出路由重放 (R3),这是一种记录来自推理引擎的路由分布并在训练期间重放它们的方法。R3 显着减少了训练-推理策略 KL 分歧,并在不影响训练速度的情况下减轻了极端差异。对各种设置的广泛实验证实,R3 成功地稳定了 RL 训练,防止崩溃并优于 GSPO 和 TIS 等方法。我们相信这项工作可以为稳定 MoE 模型中的 RL 提供新的解决方案。
KnowRL: Teaching Language Models to Know What They Know
KnowRL:教语言模型知道他们知道什么
- Authors: Sahil Kale, Devendra Singh Dhami
- Subjects: Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.11407
- Pdf link: https://arxiv.org/pdf/2510.11407
- Abstract
Truly reliable AI requires more than simply scaling up knowledge; it demands the ability to know what it knows and when it does not. Yet recent research shows that even the best LLMs misjudge their own competence in more than one in five cases, making any response born of such internal uncertainty impossible to fully trust. Inspired by self-improvement reinforcement learning techniques that require minimal data, we present a simple but powerful framework KnowRL that strengthens a model's internal understanding of its own feasibility boundaries, enabling safer and more responsible behaviour. Our framework combines two components: (i) introspection, where the model generates and classifies tasks it judges feasible or infeasible, and (ii) consensus-based rewarding, where stability of self-knowledge assessment is reinforced through internal agreement. By using internally generated data, this design strengthens consistency in self-knowledge and entirely avoids costly external supervision. In experiments on LLaMA-3.1-8B and Qwen-2.5-7B, KnowRL steadily improved self-knowledge, validated by both intrinsic self-consistency and extrinsic benchmarking. With nothing more than a small seed set and no external supervision, our method drove gains as high as 28% in accuracy and 12% in F1, outperforming baselines in just a few iterations. Our framework essentially unlocks the untapped capacity of LLMs to self-improve their knowledge awareness, opening the door to reliable, more accountable AI and safer deployment in critical applications. Owing to its simplicity and independence from external effort, we encourage applying this reliability-enhancing process to all future models.
- 中文摘要
真正可靠的人工智能需要的不仅仅是扩大知识规模;它需要知道它知道什么以及什么时候不知道的能力。然而,最近的研究表明,即使是最优秀的法学硕士也会在超过五分之一的情况下误判自己的能力,这使得这种内部不确定性产生的任何反应都无法完全信任。受到需要最少数据的自我改进强化学习技术的启发,我们提出了一个简单但功能强大的框架 KnowRL,它加强了模型对其自身可行性边界的内部理解,从而实现更安全、更负责任的行为。我们的框架结合了两个组成部分:(i) 内省,其中模型生成并分类它判断为可行或不可行的任务,以及 (ii) 基于共识的奖励,其中通过内部协议加强自我认识评估的稳定性。通过使用内部生成的数据,这种设计加强了自我认识的一致性,并完全避免了代价高昂的外部监督。在LLaMA-3.1-8B和Qwen-2.5-7B的实验中,KnowRL稳步提高了自我认识,并通过内在自洽和外在基准测试进行了验证。只需一个小种子组,没有外部监督,我们的方法就提高了 28% 的准确率和 12% 的 F1,仅几次迭代就超过了基线。我们的框架从本质上释放了法学硕士自我提高知识意识的未开发能力,为可靠、更负责任的人工智能和在关键应用程序中更安全的部署打开了大门。由于其简单性和独立于外部努力,我们鼓励将这种可靠性增强过程应用于所有未来的模型。
Autonomous vehicles need social awareness to find optima in multi-agent reinforcement learning routing games
自动驾驶汽车需要社会意识才能在多智能体强化学习路由游戏中找到最优
- Authors: Anastasia Psarou, Łukasz Gorczyca, Dominik Gaweł, Rafał Kucharski
- Subjects: Subjects:
Multiagent Systems (cs.MA)
- Arxiv link: https://arxiv.org/abs/2510.11410
- Pdf link: https://arxiv.org/pdf/2510.11410
- Abstract
Previous work has shown that when multiple selfish Autonomous Vehicles (AVs) are introduced to future cities and start learning optimal routing strategies using Multi-Agent Reinforcement Learning (MARL), they may destabilize traffic systems, as they would require a significant amount of time to converge to the optimal solution, equivalent to years of real-world commuting. We demonstrate that moving beyond the selfish component in the reward significantly relieves this issue. If each AV, apart from minimizing its own travel time, aims to reduce its impact on the system, this will be beneficial not only for the system-wide performance but also for each individual player in this routing game. By introducing an intrinsic reward signal based on the marginal cost matrix, we significantly reduce training time and achieve convergence more reliably. Marginal cost quantifies the impact of each individual action (route-choice) on the system (total travel time). Including it as one of the components of the reward can reduce the degree of non-stationarity by aligning agents' objectives. Notably, the proposed counterfactual formulation preserves the system's equilibria and avoids oscillations. Our experiments show that training MARL algorithms with our novel reward formulation enables the agents to converge to the optimal solution, whereas the baseline algorithms fail to do so. We show these effects in both a toy network and the real-world network of Saint-Arnoult. Our results optimistically indicate that social awareness (i.e., including marginal costs in routing decisions) improves both the system-wide and individual performance of future urban systems with AVs.
- 中文摘要
先前的研究表明,当将多个自私的自动驾驶汽车(AV)引入未来城市并开始使用多智能体强化学习(MARL)学习最佳路线策略时,它们可能会破坏交通系统的稳定性,因为它们需要大量时间才能收敛到最优解决方案,相当于现实世界的通勤时间。我们证明,超越奖励中的自私成分可以显着缓解这个问题。如果每个 AV 除了最大限度地减少自己的行驶时间外,还旨在减少其对系统的影响,这不仅有利于系统范围的性能,而且有利于此路由游戏中的每个玩家。通过引入基于边际成本矩阵的内在奖励信号,我们显着减少了训练时间并更可靠地实现收敛。边际成本量化了每个单独的作(路线选择)对系统的影响(总旅行时间)。将其作为奖励的组成部分之一可以通过调整代理的目标来降低非平稳程度。值得注意的是,拟议的反事实表述保持了系统的平衡并避免了振荡。我们的实验表明,使用我们新颖的奖励公式训练 MARL 算法使智能体能够收敛到最优解,而基线算法则无法做到这一点。我们在玩具网络和圣阿诺的现实世界网络中展示了这些影响。我们的研究结果乐观地表明,社会意识(即在路线决策中包括边际成本)提高了未来自动驾驶汽车城市系统的全系统和个人性能。
From to : Multidimensional Supervision of Reasoning Process for LLM Optimization
From to : LLM优化推理过程的多维度监督
- Authors: Beining Wang, Weihang Su, Hongtao Tian, Tao Yang, Yujia Zhou, Ting Yao, Qingyao Ai, Yiqun Liu
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.11457
- Pdf link: https://arxiv.org/pdf/2510.11457
- Abstract
Improving the multi-step reasoning ability of Large Language Models (LLMs) is a critical yet challenging task. The dominant paradigm, outcome-supervised reinforcement learning (RLVR), rewards only correct final answers, often propagating flawed reasoning and suffering from sparse reward signals. While process-level reward models (PRMs) provide denser, step-by-step feedback, they lack generalizability and interpretability, requiring task-specific segmentation of the reasoning process. To this end, we propose the Dimension-level Reward Model (DRM), a new supervision framework that bridges the gap between these two approaches. DRM evaluates the quality of a reasoning process along three fundamental, complementary, and interpretable dimensions: Confidence for uncertainty calibration, Relevance for semantic alignment, and Coherence for logical consistency. Together, these dimensions capture aspects beyond final answer correctness and enable interpretable assessment without requiring ground truth answers. Experimental results show that DRM provides effective supervision signals, guides the optimization of LLMs and enhances their reasoning ability. In particular, DRM-supervised training achieves consistent gains on both in-distribution and out-of-distribution open-domain tasks, including mathematics, question answering, code execution, and puzzles. Our findings demonstrate that multidimensional supervision of the reasoning process can improve the generalized reasoning ability of LLMs beyond the training distribution.
- 中文摘要
提高大型语言模型(LLM)的多步推理能力是一项关键但具有挑战性的任务。占主导地位的范式,即结果监督强化学习(RLVR),只奖励正确的最终答案,通常传播有缺陷的推理并遭受稀疏奖励信号的影响。虽然流程级奖励模型 (PRM) 提供更密集的分步反馈,但它们缺乏普遍性和可解释性,需要对推理过程进行特定任务的细分。为此,我们提出了维度级奖励模型(DRM),这是一种新的监管框架,弥合了这两种方法之间的差距。DRM 从三个基本、互补和可解释的维度评估推理过程的质量:不确定性校准的置信度、语义对齐的相关性和逻辑一致性的连贯性。这些维度共同捕获了最终答案正确性之外的方面,并实现可解释的评估,而无需基本事实答案。实验结果表明,DRM提供了有效的监督信号,指导了LLM的优化,增强了LLM的推理能力。特别是,DRM 监督训练在分布内和分布外开放领域任务(包括数学、问答、代码执行和谜题)上都取得了一致的收益。我们的研究结果表明,对推理过程的多维监督可以提高LLM在训练分布之外的广义推理能力。
Unifying Deductive and Abductive Reasoning in Knowledge Graphs with Masked Diffusion Model
使用掩蔽扩散模型统一知识图谱中的演绎和归纳推理
- Authors: Yisen Gao, Jiaxin Bai, Yi Huang, Xingcheng Fu, Qingyun Sun, Yangqiu Song
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.11462
- Pdf link: https://arxiv.org/pdf/2510.11462
- Abstract
Deductive and abductive reasoning are two critical paradigms for analyzing knowledge graphs, enabling applications from financial query answering to scientific discovery. Deductive reasoning on knowledge graphs usually involves retrieving entities that satisfy a complex logical query, while abductive reasoning generates plausible logical hypotheses from observations. Despite their clear synergistic potential, where deduction can validate hypotheses and abduction can uncover deeper logical patterns, existing methods address them in isolation. To bridge this gap, we propose DARK, a unified framework for Deductive and Abductive Reasoning in Knowledge graphs. As a masked diffusion model capable of capturing the bidirectional relationship between queries and conclusions, DARK has two key innovations. First, to better leverage deduction for hypothesis refinement during abductive reasoning, we introduce a self-reflective denoising process that iteratively generates and validates candidate hypotheses against the observed conclusion. Second, to discover richer logical associations, we propose a logic-exploration reinforcement learning approach that simultaneously masks queries and conclusions, enabling the model to explore novel reasoning compositions. Extensive experiments on multiple benchmark knowledge graphs show that DARK achieves state-of-the-art performance on both deductive and abductive reasoning tasks, demonstrating the significant benefits of our unified approach.
- 中文摘要
演绎推理和归纳推理是分析知识图谱的两种关键范式,可实现从金融查询回答到科学发现的应用。知识图谱上的演绎推理通常涉及检索满足复杂逻辑查询的实体,而归纳推理则从观察中生成合理的逻辑假设。尽管它们具有明显的协同潜力,演绎可以验证假设,而推导可以揭示更深层次的逻辑模式,但现有方法单独解决它们。为了弥合这一差距,我们提出了 DARK,这是一个用于知识图谱中演绎和归纳推理的统一框架。作为一种能够捕获查询和结论之间双向关系的掩码扩散模型,DARK 有两项关键创新。首先,为了在归纳推理过程中更好地利用演绎来细化假设,我们引入了一种自我反思去噪过程,该过程根据观察到的结论迭代生成和验证候选假设。其次,为了发现更丰富的逻辑关联,我们提出了一种逻辑探索强化学习方法,该方法同时掩盖查询和结论,使模型能够探索新的推理组合。在多个基准知识图谱上的大量实验表明,DARK在演绎和归纳推理任务上都取得了最先进的性能,证明了我们统一方法的显着优势。
Coordinated Strategies in Realistic Air Combat by Hierarchical Multi-Agent Reinforcement Learning
基于分层多智能体强化学习的现实空战中的协调策略
- Authors: Ardian Selmonaj, Giacomo Del Rio, Adrian Schneider, Alessandro Antonucci
- Subjects: Subjects:
Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
- Arxiv link: https://arxiv.org/abs/2510.11474
- Pdf link: https://arxiv.org/pdf/2510.11474
- Abstract
Achieving mission objectives in a realistic simulation of aerial combat is highly challenging due to imperfect situational awareness and nonlinear flight dynamics. In this work, we introduce a novel 3D multi-agent air combat environment and a Hierarchical Multi-Agent Reinforcement Learning framework to tackle these challenges. Our approach combines heterogeneous agent dynamics, curriculum learning, league-play, and a newly adapted training algorithm. To this end, the decision-making process is organized into two abstraction levels: low-level policies learn precise control maneuvers, while high-level policies issue tactical commands based on mission objectives. Empirical results show that our hierarchical approach improves both learning efficiency and combat performance in complex dogfight scenarios.
- 中文摘要
由于态势感知和非线性飞行动力学不完善,在空战的真实模拟中实现任务目标极具挑战性。在这项工作中,我们引入了一种新颖的 3D 多智能体空战环境和分层多智能体强化学习框架来应对这些挑战。我们的方法结合了异构智能体动态、课程学习、联赛和新调整的训练算法。为此,决策过程分为两个抽象层次:低级政策学习精确的控制机动,而高级政策则根据任务目标发布战术命令。实证结果表明,在复杂的混战场景中,分层方法提高了学习效率和战斗表现。
Constraint-Aware Reinforcement Learning via Adaptive Action Scaling
通过自适应动作缩放的约束感知强化学习
- Authors: Murad Dawood, Usama Ahmed Siddiquie, Shahram Khorshidi, Maren Bennewitz
- Subjects: Subjects:
Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
- Arxiv link: https://arxiv.org/abs/2510.11491
- Pdf link: https://arxiv.org/pdf/2510.11491
- Abstract
Safe reinforcement learning (RL) seeks to mitigate unsafe behaviors that arise from exploration during training by reducing constraint violations while maintaining task performance. Existing approaches typically rely on a single policy to jointly optimize reward and safety, which can cause instability due to conflicting objectives, or they use external safety filters that override actions and require prior system knowledge. In this paper, we propose a modular cost-aware regulator that scales the agent's actions based on predicted constraint violations, preserving exploration through smooth action modulation rather than overriding the policy. The regulator is trained to minimize constraint violations while avoiding degenerate suppression of actions. Our approach integrates seamlessly with off-policy RL methods such as SAC and TD3, and achieves state-of-the-art return-to-cost ratios on Safety Gym locomotion tasks with sparse costs, reducing constraint violations by up to 126 times while increasing returns by over an order of magnitude compared to prior methods.
- 中文摘要
安全强化学习 (RL) 旨在通过减少约束违规行为,同时保持任务性能,从而减轻训练期间探索过程中产生的不安全行为。现有方法通常依赖于单一策略来共同优化奖励和安全,这可能会因目标冲突而导致不稳定,或者它们使用覆盖作并需要事先系统知识的外部安全过滤器。在本文中,我们提出了一种模块化的成本感知调节器,它根据预测的约束违规来扩展代理的动作,通过平滑动作调制而不是覆盖策略来保留探索。调节器经过训练,可以最大限度地减少约束违规,同时避免对动作的简并抑制。我们的方法与 SAC 和 TD3 等非策略 RL 方法无缝集成,并在成本稀疏的 Safety Gym 运动任务上实现了最先进的成本回报率,将约束违规次数减少了多达 126 倍,同时将回报提高了一个数量级与以前的方法相比。
How Reinforcement Learning After Next-Token Prediction Facilitates Learning
下一个标记预测后的强化学习如何促进学习
- Authors: Nikolaos Tsilivis, Eran Malach, Karen Ullrich, Julia Kempe
- Subjects: Subjects:
Machine Learning (cs.LG); Machine Learning (stat.ML)
- Arxiv link: https://arxiv.org/abs/2510.11495
- Pdf link: https://arxiv.org/pdf/2510.11495
- Abstract
Recent advances in reasoning domains with neural networks have primarily been enabled by a training recipe that optimizes Large Language Models, previously trained to predict the next-token in a sequence, with reinforcement learning algorithms. We introduce a framework to study the success of this paradigm, and we theoretically expose the optimization mechanisms by which reinforcement learning improves over next-token prediction in this setting. We study learning from mixture distributions of short and long ``chain-of-thought'' sequences encoding a single task. In particular, when the task consists of predicting the parity of $d$ bits and long sequences are rare, we show how reinforcement learning after next-token prediction enables autoregressive transformers to generalize, whereas mere next-token prediction requires extreme statistical or computational resources to do so. We further explain how reinforcement learning leverages increased test-time computation, manifested in longer responses, to facilitate this learning process. In a simplified setting, we theoretically prove that autoregressive linear models following this training recipe can efficiently learn to predict the parity of $d$ bits as long as the proportion of long demonstrations in the data mix is not exponentially small in the input dimension $d$. Finally, we demonstrate these same phenomena in other settings, including the post-training of Llama-series models on mixture variations of common mathematical reasoning benchmarks.
- 中文摘要
神经网络推理领域的最新进展主要是通过优化大型语言模型的训练配方实现的,该模型之前经过训练以使用强化学习算法预测序列中的下一个标记。我们引入了一个框架来研究这种范式的成功,并在理论上揭示了强化学习在这种情况下比下一个标记预测改进的优化机制。我们研究从编码单个任务的短和长“思维链”序列的混合分布中学习。特别是,当任务包括预测 $d$ 位的奇偶校验并且长序列很少见时,我们展示了下一个标记预测后的强化学习如何使自回归转换器能够泛化,而单纯的下一个标记预测需要极端的统计或计算资源来做到这一点。我们进一步解释了强化学习如何利用增加的测试时间计算(表现为更长的响应)来促进这一学习过程。在简化的环境中,我们从理论上证明,只要数据组合中长演示的比例在输入维度 $d$ 中不是指数小,遵循此训练配方的自回归线性模型就可以有效地学习预测 $d$ 位的奇偶校验。最后,我们在其他环境中展示了这些相同的现象,包括对常见数学推理基准的混合变化进行 Llama 系列模型的后训练。
ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding
ReLook:基于视觉的 RL 与代理 Web 编码的多模态 LLM 批评者
- Authors: Yuhang Li, Chenchen Zhang, Ruilin Lv, Ao Liu, Ken Deng, Yuanxing Zhang, Jiaheng Liu, Wiggin Zhou, Bo Zhou
- Subjects: Subjects:
Machine Learning (cs.LG); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2510.11498
- Pdf link: https://arxiv.org/pdf/2510.11498
- Abstract
While Large Language Models (LLMs) excel at algorithmic code generation, they struggle with front-end development, where correctness is judged on rendered pixels and interaction. We present ReLook, an agentic, vision-grounded reinforcement learning framework that empowers an agent to close a robust generate--diagnose--refine loop by invoking a multimodal LLM (MLLM) as a tool. During training, the agent uses the MLLM-in-the-loop both as a visual critic--scoring code with screenshots--and as a source of actionable, vision-grounded feedback; a strict zero-reward rule for invalid renders anchors renderability and prevents reward hacking. To prevent behavioral collapse, we introduce Forced Optimization, a strict acceptance rule that admits only improving revisions, yielding monotonically better trajectories. At inference, we decouple the critic and run a lightweight, critic-free self-edit cycle, keeping latency comparable to base decoding while retaining most of the gains. Across three widely used benchmarks, ReLook consistently outperforms strong baselines in vision-grounded front-end code generation, highlighting the benefits of agentic perception, visual rewards, and training-inference decoupling.
- 中文摘要
虽然大型语言模型 (LLM) 擅长算法代码生成,但它们在前端开发方面遇到了困难,前端开发的正确性是根据渲染的像素和交互来判断的。我们提出了 ReLook,这是一个代理的、基于视觉的强化学习框架,它使代理能够通过调用多模态 LLM (MLLM) 作为工具来关闭一个强大的生成-诊断-细化循环。在训练期间,代理将 MLLM 在环中用作视觉批评者(使用屏幕截图对代码进行评分),并用作可作的、基于视觉的反馈的来源;对无效渲染的严格零奖励规则锚定了可渲染性并防止了奖励黑客攻击。为了防止行为崩溃,我们引入了强制优化,这是一种严格的接受规则,只允许改进修订,从而产生单调更好的轨迹。在推理时,我们将批评者解耦并运行一个轻量级的、无批评者的自编辑周期,保持延迟与基数解码相当,同时保留大部分增益。在三个广泛使用的基准测试中,ReLook 在基于视觉的前端代码生成方面始终优于强大的基线,凸显了代理感知、视觉奖励和训练推理解耦的优势。
Offline Reinforcement Learning with Generative Trajectory Policies
使用生成轨迹策略的离线强化学习
- Authors: Xinsong Feng, Leshu Tang, Chenan Wang, Haipeng Chen
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.11499
- Pdf link: https://arxiv.org/pdf/2510.11499
- Abstract
Generative models have emerged as a powerful class of policies for offline reinforcement learning (RL) due to their ability to capture complex, multi-modal behaviors. However, existing methods face a stark trade-off: slow, iterative models like diffusion policies are computationally expensive, while fast, single-step models like consistency policies often suffer from degraded performance. In this paper, we demonstrate that it is possible to bridge this gap. The key to moving beyond the limitations of individual methods, we argue, lies in a unifying perspective that views modern generative models, including diffusion, flow matching, and consistency models, as specific instances of learning a continuous-time generative trajectory governed by an Ordinary Differential Equation (ODE). This principled foundation provides a clearer design space for generative policies in RL and allows us to propose Generative Trajectory Policies (GTPs), a new and more general policy paradigm that learns the entire solution map of the underlying ODE. To make this paradigm practical for offline RL, we further introduce two key theoretically principled adaptations. Empirical results demonstrate that GTP achieves state-of-the-art performance on D4RL benchmarks - it significantly outperforms prior generative policies, achieving perfect scores on several notoriously hard AntMaze tasks.
- 中文摘要
生成模型已成为离线强化学习 (RL) 的一类强大的策略,因为它们能够捕获复杂的多模态行为。然而,现有方法面临着一个明显的权衡:缓慢的迭代模型(如扩散策略)的计算成本很高,而快速的单步模型(如一致性策略)通常会受到性能下降的影响。在本文中,我们证明了弥合这一差距是可能的。我们认为,超越个别方法限制的关键在于一个统一的视角,将现代生成模型(包括扩散、流匹配和一致性模型)视为学习由常微分方程 (ODE) 控制的连续时间生成轨迹的特定实例。这种原则性的基础为 RL 中的生成策略提供了更清晰的设计空间,并允许我们提出生成轨迹策略 (GTP),这是一种新的、更通用的策略范式,可以学习底层 ODE 的整个解决方案映射。为了使这种范式在离线 RL 中实用,我们进一步引入了两个关键的理论原则适应。实证结果表明,GTP 在 D4RL 基准测试中取得了最先进的性能——它的性能明显优于之前的生成策略,在几个臭名昭著的困难 AntMaze 任务中取得了满分。
Context-Aware Model-Based Reinforcement Learning for Autonomous Racing
基于情境感知模型的自动驾驶赛车强化学习
- Authors: Emran Yasser Moustafa, Ivana Dusparic
- Subjects: Subjects:
Machine Learning (cs.LG); Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2510.11501
- Pdf link: https://arxiv.org/pdf/2510.11501
- Abstract
Autonomous vehicles have shown promising potential to be a groundbreaking technology for improving the safety of road users. For these vehicles, as well as many other safety-critical robotic technologies, to be deployed in real-world applications, we require algorithms that can generalize well to unseen scenarios and data. Model-based reinforcement learning algorithms (MBRL) have demonstrated state-of-the-art performance and data efficiency across a diverse set of domains. However, these algorithms have also shown susceptibility to changes in the environment and its transition dynamics. In this work, we explore the performance and generalization capabilities of MBRL algorithms for autonomous driving, specifically in the simulated autonomous racing environment, Roboracer (formerly F1Tenth). We frame the head-to-head racing task as a learning problem using contextual Markov decision processes and parameterize the driving behavior of the adversaries using the context of the episode, thereby also parameterizing the transition and reward dynamics. We benchmark the behavior of MBRL algorithms in this environment and propose a novel context-aware extension of the existing literature, cMask. We demonstrate that context-aware MBRL algorithms generalize better to out-of-distribution adversary behaviors relative to context-free approaches. We also demonstrate that cMask displays strong generalization capabilities, as well as further performance improvement relative to other context-aware MBRL approaches when racing against adversaries with in-distribution behaviors.
- 中文摘要
自动驾驶汽车已显示出成为提高道路使用者安全的突破性技术的巨大潜力。对于要部署在实际应用中的这些车辆以及许多其他安全关键型机器人技术,我们需要能够很好地推广到看不见的场景和数据的算法。基于模型的强化学习算法 (MBRL) 在不同领域展示了最先进的性能和数据效率。然而,这些算法也显示出对环境变化及其过渡动态的敏感性。在这项工作中,我们探索了自动驾驶 MBRL 算法的性能和泛化能力,特别是在模拟自动驾驶赛车环境 Roboracer(以前称为 F1Tenth)中。我们使用上下文马尔可夫决策过程将头对头竞速任务构建为一个学习问题,并使用情节的上下文参数化对手的驾驶行为,从而也参数化过渡和奖励动态。我们对MBRL算法在这种环境下的行为进行了基准测试,并提出了现有文献的一种新颖的上下文感知扩展,即cMask。我们证明,相对于无上下文方法,上下文感知 MBRL 算法可以更好地推广到分布外的对手行为。我们还证明,在与具有分布式行为的对手竞争时,cMask 表现出强大的泛化能力,以及相对于其他上下文感知 MBRL 方法的进一步性能改进。
A Physics-Informed Reinforcement Learning Approach for Degradation-Aware Long-Term Charging Optimization in Batteries
一种基于物理的强化学习方法,用于电池退化感知的长期充电优化
- Authors: Shanthan Kumar Padisala, Bharatkumar Hegde, Ibrahim Haskara, Satadru Dey
- Subjects: Subjects:
Systems and Control (eess.SY)
- Arxiv link: https://arxiv.org/abs/2510.11515
- Pdf link: https://arxiv.org/pdf/2510.11515
- Abstract
Batteries degrade with usage and continuous cycling. This aging is typically reflected through the resistance growth and the capacity fade of battery cells. Over the years, various charging methods have been presented in the literature that proposed current profiles in order to enable optimal, fast, and/or health-conscious charging. However, very few works have attempted to make the ubiquitous Constant Current Constant Voltage (CCCV) charging protocol adaptive to the changing battery health as it cycles. This work aims to address this gap and proposes a framework that optimizes the constant current part of the CCCV protocol adapting to long-term battery degradation. Specifically, a physics-informed Reinforcement Learning (RL) approach has been used that not only estimates a key battery degradation mechanism, namely, Loss of Active Material (LAM), but also adjusts the current magnitude of CCCV as a result of this particular degradation. The proposed framework has been implemented by combining PyBamm, an open-source battery modeling tool, and Stable-baselines where the RL agent was trained using a Proximal Policy Optimization (PPO) network. Simulation results show the potential of the proposed framework for enhancing the widely used CCCV protocol by embedding physics information in RL algorithm. A comparative study of this proposed agent has also been discussed with 2 other charging protocols generated by a non-physics-based RL agent and a constant CCCV for all the cycles.
- 中文摘要
电池会随着使用和连续循环而退化。这种老化通常通过电阻增长和电池容量衰减来反映。多年来,文献中提出了各种充电方法,这些方法提出了当前的配置文件,以实现最佳、快速和/或注重健康的充电。然而,很少有工作试图使无处不在的恒流恒压 (CCCV) 充电协议适应电池循环过程中不断变化的健康状况。这项工作旨在解决这一差距,并提出了一个框架来优化CCCV协议的恒流部分,以适应电池的长期退化。具体来说,已经使用了一种基于物理的强化学习 (RL) 方法,该方法不仅估计了关键的电池退化机制,即活性材料损失 (LAM),而且还调整了由于这种特定退化而导致的 CCCV 的电流大小。所提出的框架是通过结合开源电池建模工具 PyBamm 和稳定基线来实现的,其中 RL 代理是使用近端策略优化 (PPO) 网络进行训练的。仿真结果表明,所提框架通过在RL算法中嵌入物理信息来增强广泛使用的CCCV协议的潜力。还讨论了该拟议代理与其他 2 种由非物理 RL 代理生成的充电协议和所有循环的恒定 CCCV 的比较研究。
A Flexible Multi-Agent Deep Reinforcement Learning Framework for Dynamic Routing and Scheduling of Latency-Critical Services
一种灵活的多智能体深度强化学习框架,用于延迟关键型服务的动态路由和调度
- Authors: Vincenzo Norman Vitale, Antonia Maria Tulino, Andreas F. Molisch, Jaime Llorca
- Subjects: Subjects:
Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.11535
- Pdf link: https://arxiv.org/pdf/2510.11535
- Abstract
Timely delivery of delay-sensitive information over dynamic, heterogeneous networks is increasingly essential for a range of interactive applications, such as industrial automation, self-driving vehicles, and augmented reality. However, most existing network control solutions target only average delay performance, falling short of providing strict End-to-End (E2E) peak latency guarantees. This paper addresses the challenge of reliably delivering packets within application-imposed deadlines by leveraging recent advancements in Multi-Agent Deep Reinforcement Learning (MA-DRL). After introducing the Delay-Constrained Maximum-Throughput (DCMT) dynamic network control problem, and highlighting the limitations of current solutions, we present a novel MA-DRL network control framework that leverages a centralized routing and distributed scheduling architecture. The proposed framework leverages critical networking domain knowledge for the design of effective MA-DRL strategies based on the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) technique, where centralized routing and distributed scheduling agents dynamically assign paths and schedule packet transmissions according to packet lifetimes, thereby maximizing on-time packet delivery. The generality of the proposed framework allows integrating both data-driven \blue{Deep Reinforcement Learning (DRL)} agents and traditional rule-based policies in order to strike the right balance between performance and learning complexity. Our results confirm the superiority of the proposed framework with respect to traditional stochastic optimization-based approaches and provide key insights into the role and interplay between data-driven DRL agents and new rule-based policies for both efficient and high-performance control of latency-critical services.
- 中文摘要
通过动态异构网络及时交付延迟敏感信息对于一系列交互式应用(例如工业自动化、自动驾驶汽车和增强现实)变得越来越重要。然而,大多数现有的网络控制解决方案仅针对平均延迟性能,无法提供严格的端到端 (E2E) 峰值延迟保证。本文利用多代理深度强化学习 (MA-DRL) 的最新进展,解决了在应用程序规定的期限内可靠地交付数据包的挑战。在介绍了时延约束最大吞吐量(DCMT)动态网络控制问题并强调了当前解决方案的局限性之后,我们提出了一种利用集中式路由和分布式调度架构的新型MA-DRL网络控制框架。该框架利用关键的网络领域知识,基于多智能体深度确定性策略梯度(MADDPG)技术设计有效的MA-DRL策略,其中集中式路由和分布式调度智能体根据数据包生命周期动态分配路径并调度数据包传输,从而最大限度地提高数据包的准时交付。所提出的框架的通用性允许集成数据驱动的 \blue{深度强化学习 (DRL)} 代理和传统的基于规则的策略,以便在性能和学习复杂性之间取得适当的平衡。我们的结果证实了所提出的框架相对于传统的基于随机优化的方法的优越性,并为数据驱动的 DRL 代理与基于规则的新策略之间的作用和相互作用提供了关键见解,以实现对延迟关键服务的高效和高性能控制。
NaviGait: Navigating Dynamically Feasible Gait Libraries using Deep Reinforcement Learning
NaviGait:使用深度强化学习导航动态可行的步态库
- Authors: Neil C. Janwani, Varun Madabushi, Maegan Tucker
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2510.11542
- Pdf link: https://arxiv.org/pdf/2510.11542
- Abstract
Reinforcement learning (RL) has emerged as a powerful method to learn robust control policies for bipedal locomotion. Yet, it can be difficult to tune desired robot behaviors due to unintuitive and complex reward design. In comparison, offline trajectory optimization methods, like Hybrid Zero Dynamics, offer more tuneable, interpretable, and mathematically grounded motion plans for high-dimensional legged systems. However, these methods often remain brittle to real-world disturbances like external perturbations. In this work, we present NaviGait, a hierarchical framework that combines the structure of trajectory optimization with the adaptability of RL for robust and intuitive locomotion control. NaviGait leverages a library of offline-optimized gaits and smoothly interpolates between them to produce continuous reference motions in response to high-level commands. The policy provides both joint-level and velocity command residual corrections to modulate and stabilize the reference trajectories in the gait library. One notable advantage of NaviGait is that it dramatically simplifies reward design by encoding rich motion priors from trajectory optimization, reducing the need for finely tuned shaping terms and enabling more stable and interpretable learning. Our experimental results demonstrate that NaviGait enables faster training compared to conventional and imitation-based RL, and produces motions that remain closest to the original reference. Overall, by decoupling high-level motion generation from low-level correction, NaviGait offers a more scalable and generalizable approach for achieving dynamic and robust locomotion.
- 中文摘要
强化学习 (RL) 已成为学习双足运动鲁棒控制策略的强大方法。然而,由于奖励设计不直观且复杂,可能很难调整所需的机器人行为。相比之下,离线轨迹优化方法(如混合零动力学)为高维腿系统提供了更可调、更可解释和数学基础的运动计划。然而,这些方法通常对外部扰动等现实世界的干扰仍然脆弱。在这项工作中,我们提出了 NaviGait,这是一个分层框架,它将轨迹优化的结构与 RL 的适应性相结合,以实现稳健和直观的运动控制。NaviGait 利用离线优化步态库,并在它们之间平滑插值,以产生连续的参考运动以响应高级命令。该策略提供关节级和速度命令残差校正,以调节和稳定步态库中的参考轨迹。NaviGait 的一个显着优势是,它通过对轨迹优化的丰富运动先验进行编码,极大地简化了奖励设计,减少了对微调整形项的需求,并实现了更稳定和可解释的学习。我们的实验结果表明,与传统和基于模仿的 RL 相比,NaviGait 可以实现更快的训练,并产生最接近原始参考的运动。总体而言,通过将高级运动生成与低级校正解耦,NaviGait 提供了一种更具可扩展性和通用性的方法来实现动态和稳健的运动。
MATH-Beyond: A Benchmark for RL to Expand Beyond the Base Model
MATH-Beyond:RL 超越基本模型的基准
- Authors: Prasanna Mayilvahanan, Ricardo Dominguez-Olmedo, Thaddäus Wiedemer, Wieland Brendel
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.11653
- Pdf link: https://arxiv.org/pdf/2510.11653
- Abstract
With the advent of DeepSeek-R1, a new wave of reinforcement learning (RL) methods has emerged that seem to unlock stronger mathematical reasoning. However, a closer look at the open-source ecosystem reveals a critical limitation: with sufficiently many draws (e.g., $\texttt{pass@1024}$), many existing base models already solve nearly all questions on widely used math benchmarks such as MATH-500 and AIME 2024. This suggests that the RL fine-tuning methods prevalent in the LLM reasoning literature largely sharpen existing solution modes rather than discovering entirely new ones. Such sharpening stands in contrast to the broader promise of RL: to foster exploration and to acquire new skills. To move beyond this plateau, we introduce MATH-Beyond (MATH-B), a benchmark deliberately constructed to defeat common open-source models of up to 8B parameters even under large sampling budgets. Improving performance on our benchmark via RL requires methods that learn to reason in ways that go beyond base model capabilities in repeated sampling. Since the problems are drawn from subsets of DAPO-Math-17K and DeepScaleR datasets, they remain topically equivalent to standard high-school math. Validating our premise, RL fine-tuned models such as Nemotron-Research-Reasoning-Qwen-1.5B and DeepScaleR-1.5B-Preview perform poorly on MATH-B at $\texttt{pass@1024}$, showing how existing approaches fall short on tackling harder instances. We hope MATH-B will catalyze exploration-driven RL approaches that elicit deeper reasoning capabilities. We release MATH-B at this https URL.
- 中文摘要
随着 DeepSeek-R1 的出现,新一波强化学习 (RL) 方法已经出现,似乎解锁了更强的数学推理能力。然而,仔细观察开源生态系统就会发现一个关键的局限性:由于抽签次数足够多(例如,$\texttt{pass@1024}$),许多现有的基础模型已经解决了 MATH-500 和 AIME 2024 等广泛使用的数学基准测试的几乎所有问题。这表明,LLM 推理文献中流行的 RL 微调方法在很大程度上提高了现有的解决方案模式,而不是发现全新的解决方案模式。这种锐化与 RL 的更广泛承诺形成鲜明对比:促进探索和获得新技能。为了超越这一平台期,我们引入了 MATH-Beyond (MATH-B),这是一个故意构建的基准测试,即使在大量采样预算下也能击败高达 8B 参数的常见开源模型。通过 RL 提高基准测试的性能需要学习推理的方法,这些方法在重复采样中超越了基本模型的能力。由于这些问题是从 DAPO-Math-17K 和 DeepScaleR 数据集的子集中提取的,因此它们在主题上仍然等同于标准高中数学。验证了我们的前提,Nemotron-Research-Reasoning-Qwen-1.5B 和 DeepScaleR-1.5B-Preview 等 RL 微调模型在 MATH-B 上表现不佳,价格为 $\texttt{pass@1024}$,这表明现有方法在处理更困难的实例方面存在不足。我们希望 MATH-B 能够促进探索驱动的 RL 方法,从而引发更深层次的推理能力。我们在此 https URL 上发布 MATH-B。
SR-Scientist: Scientific Equation Discovery With Agentic AI
SR-Scientist:使用代理人工智能发现科学方程
- Authors: Shijie Xia, Yuhan Sun, Pengfei Liu
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.11661
- Pdf link: https://arxiv.org/pdf/2510.11661
- Abstract
Recently, Large Language Models (LLMs) have been applied to scientific equation discovery, leveraging their embedded scientific knowledge for hypothesis generation. However, current methods typically confine LLMs to the role of an equation proposer within search algorithms like genetic programming. In this paper, we present SR-Scientist, a framework that elevates the LLM from a simple equation proposer to an autonomous AI scientist that writes code to analyze data, implements the equation as code, submits it for evaluation, and optimizes the equation based on experimental feedback. Specifically, we wrap the code interpreter into a set of tools for data analysis and equation evaluation. The agent is instructed to optimize the equation by utilizing these tools over a long horizon with minimal human-defined pipelines. Empirical results show that SR-Scientist outperforms baseline methods by an absolute margin of 6% to 35% on datasets covering four science disciplines. Additionally, we demonstrate our method's robustness to noise, the generalization of the discovered equations to out-of-domain data, and their symbolic accuracy. Furthermore, we develop an end-to-end reinforcement learning framework to enhance the agent's capabilities.
- 中文摘要
最近,大型语言模型 (LLM) 已被应用于科学方程发现,利用其嵌入的科学知识来生成假设。然而,当前的方法通常将法学硕士限制在遗传编程等搜索算法中方程提议者的角色。在本文中,我们提出了 SR-Scientist,这是一个框架,它将 LLM 从一个简单的方程提议者提升为一个自主的 AI 科学家,它编写代码来分析数据,将方程实现为代码,提交以供评估,并根据实验反馈优化方程。具体来说,我们将代码解释器包装成一组用于数据分析和方程评估的工具。代理被指示通过在很长一段时间内使用这些工具来优化方程,并最少地使用人工定义的管道。实证结果表明,SR-Scientist 在涵盖四个科学学科的数据集上比基线方法高出 6% 至 35%。此外,我们还展示了我们的方法对噪声的鲁棒性、发现的方程对域外数据的推广及其符号精度。此外,我们还开发了一个端到端的强化学习框架,以增强智能体的能力。
Ego-Vision World Model for Humanoid Contact Planning
人形接触规划的自我视觉世界模型
- Authors: Hang Liu, Yuman Gao, Sangli Teng, Yufeng Chi, Yakun Sophia Shao, Zhongyu Li, Maani Ghaffari, Koushil Sreenath
- Subjects: Subjects:
Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
- Arxiv link: https://arxiv.org/abs/2510.11682
- Pdf link: https://arxiv.org/pdf/2510.11682
- Abstract
Enabling humanoid robots to exploit physical contact, rather than simply avoid collisions, is crucial for autonomy in unstructured environments. Traditional optimization-based planners struggle with contact complexity, while on-policy reinforcement learning (RL) is sample-inefficient and has limited multi-task ability. We propose a framework combining a learned world model with sampling-based Model Predictive Control (MPC), trained on a demonstration-free offline dataset to predict future outcomes in a compressed latent space. To address sparse contact rewards and sensor noise, the MPC uses a learned surrogate value function for dense, robust planning. Our single, scalable model supports contact-aware tasks, including wall support after perturbation, blocking incoming objects, and traversing height-limited arches, with improved data efficiency and multi-task capability over on-policy RL. Deployed on a physical humanoid, our system achieves robust, real-time contact planning from proprioception and ego-centric depth images. Website: this https URL
- 中文摘要
使人形机器人能够利用身体接触,而不是简单地避免碰撞,对于非结构化环境中的自主性至关重要。传统的基于优化的规划器在接触复杂性方面苦苦挣扎,而策略强化学习 (RL) 的样本效率低下,多任务能力有限。我们提出了一个框架,将学习世界模型与基于采样的模型预测控制(MPC)相结合,在无演示的离线数据集上进行训练,以预测压缩潜在空间中的未来结果。为了解决稀疏接触奖励和传感器噪声,MPC 使用学习的代理值函数进行密集、稳健的规划。我们的单一、可扩展的模型支持接触感知任务,包括扰动后的墙壁支撑、阻挡传入物体和穿越高度限制的拱门,与策略上的 RL 相比,具有更高的数据效率和多任务能力。我们的系统部署在物理人形机器人上,可根据本体感觉和以自我为中心的深度图像实现稳健、实时的接触规划。网站:此 https URL
Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models
扩散大型语言模型内存高效RL的边界引导策略优化
- Authors: Nianyi Lin, Jiajie Zhang, Lei Hou, Juanzi Li
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2510.11683
- Pdf link: https://arxiv.org/pdf/2510.11683
- Abstract
A key challenge in applying reinforcement learning (RL) to diffusion large language models (dLLMs) lies in the intractability of their likelihood functions, which are essential for the RL objective, necessitating corresponding approximation in each training step. While existing methods approximate the log-likelihoods by their evidence lower bounds (ELBOs) via customized Monte Carlo (MC) sampling, the forward computational graphs of all MC samples need to be retained for the gradient computation of non-linear terms in the RL objective, resulting in significant memory overhead. This constraint restricts feasible sample sizes, leading to imprecise likelihood approximations and ultimately distorting the RL objective. To overcome this limitation, we propose \emph{Boundary-Guided Policy Optimization} (BGPO), a memory-efficient RL algorithm that maximizes a specially constructed lower bound of the ELBO-based objective. This lower bound is carefully designed to satisfy two key properties: (1) Linearity: it is formulated in a linear sum where each term depends only on a single MC sample, thereby enabling gradient accumulation across samples and ensuring constant memory usage; (2) Equivalence: Both the value and gradient of this lower bound are equal to those of the ELBO-based objective in on-policy training, making it also an effective approximation for the original RL objective. These properties allow BGPO to adopt a large MC sample size, resulting in more accurate likelihood approximations and improved RL objective estimation, which in turn leads to enhanced performance. Experiments show that BGPO significantly outperforms previous RL algorithms for dLLMs in math problem solving, code generation, and planning tasks.
- 中文摘要
将强化学习(RL)应用于扩散大型语言模型(dLLM)的一个关键挑战在于其似然函数的难处理性,而似然函数对于RL目标至关重要,因此需要在每个训练步骤中进行相应的近似。虽然现有方法通过定制的蒙特卡洛(MC)采样通过其证据下限(ELBO)来近似对数似然,但需要保留所有MC样本的正向计算图,以便在RL目标中对非线性项进行梯度计算,从而导致大量的内存开销。这种约束限制了可行的样本量,导致似然近似不精确,并最终扭曲RL目标。为了克服这一限制,我们提出了 \emph{边界引导策略优化} (BGPO),这是一种内存高效的 RL 算法,可最大化基于 ELBO 的目标的特殊构建下限。该下限经过精心设计,以满足两个关键属性:(1)线性:它以线性和表述,其中每个项仅依赖于单个MC样本,从而实现样本之间的梯度累积并确保恒定的内存使用;(2)等效性:该下界的值和梯度都等于政策培训中基于ELBO的目标,使其成为原始RL目标的有效近似值。这些特性使BGPO能够采用较大的MC样本量,从而实现更准确的似然近似和改进的RL客观估计,从而提高性能。实验表明,BGPO在数学问题解决、代码生成和规划任务方面明显优于以前的dLLM的RL算法。
Representation-Based Exploration for Language Models: From Test-Time to Post-Training
基于表示的语言模型探索:从测试时间到训练后
- Authors: Jens Tuyls, Dylan J. Foster, Akshay Krishnamurthy, Jordan T. Ash
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.11686
- Pdf link: https://arxiv.org/pdf/2510.11686
- Abstract
Reinforcement learning (RL) promises to expand the capabilities of language models, but it is unclear if current RL techniques promote the discovery of novel behaviors, or simply sharpen those already present in the base model. In this paper, we investigate the value of deliberate exploration -- explicitly incentivizing the model to discover novel and diverse behaviors -- and aim to understand how the knowledge in pre-trained models can guide this search. Our main finding is that exploration with a simple, principled, representation-based bonus derived from the pre-trained language model's hidden states significantly improves diversity and pass@k rates -- both for post-training, and in a novel inference-time scaling setting we introduce. For inference-time, exploration with representation-based diversity improves efficiency, consistently improving pass@k rates across a variety of models and reasoning tasks. For example, for Qwen-2.5-14b-Instruct we obtain over 50% improvement in verifier efficiency on almost all tasks. For post-training, we show that integrating this exploration strategy into an RL pipeline improves reasoning performance over that of the initial model and over standard RL post-training. For example, on AIME 2024, our post-trained Qwen-2.5-7b-Instruct's pass@80 matches the pass@256 of GRPO on the same model, demonstrating a 3x improvement in test-time sample efficiency. Overall, our findings suggest that deliberate exploration -- with the right notion of diversity -- is a practical path toward discovery of new behaviors beyond sharpening.
- 中文摘要
强化学习 (RL) 有望扩展语言模型的功能,但尚不清楚当前的 RL 技术是否促进了新行为的发现,或者只是增强了基础模型中已经存在的行为。在本文中,我们研究了刻意探索的价值——明确激励模型发现新颖和多样化的行为——旨在了解预训练模型中的知识如何指导这种搜索。我们的主要发现是,使用从预训练语言模型的隐藏状态中得出的简单、有原则的、基于表示的奖励进行探索,可以显着提高多样性和pass@k率——无论是在训练后还是在我们引入的新型推理时间缩放设置中。对于推理时间,基于表示的多样性探索可以提高效率,持续提高各种模型和推理任务的pass@k率。例如,对于 Qwen-2.5-14b-Instruct,我们在几乎所有任务上的验证器效率都提高了 50% 以上。对于后训练,我们表明将这种探索策略集成到RL管道中可以提高初始模型和标准RL后训练的推理性能。例如,在 AIME 2024 上,我们后训练的 Qwen-2.5-7b-Instruct 的pass@80与同一模型上 GRPO 的pass@256相匹配,表明测试时间样本效率提高了 3 倍。总的来说,我们的研究结果表明,有意识的探索——具有正确的多样性概念——是发现新行为的实用途径,而不仅仅是锐化。
Phys2Real: Fusing VLM Priors with Interactive Online Adaptation for Uncertainty-Aware Sim-to-Real Manipulation
Phys2Real:将 VLM 先验与交互式在线适应融合在一起,以实现不确定性感知的模拟到实数作
- Authors: Maggie Wang, Stephen Tian, Aiden Swann, Ola Shorinwa, Jiajun Wu, Mac Schwager
- Subjects: Subjects:
Robotics (cs.RO); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.11689
- Pdf link: https://arxiv.org/pdf/2510.11689
- Abstract
Learning robotic manipulation policies directly in the real world can be expensive and time-consuming. While reinforcement learning (RL) policies trained in simulation present a scalable alternative, effective sim-to-real transfer remains challenging, particularly for tasks that require precise dynamics. To address this, we propose Phys2Real, a real-to-sim-to-real RL pipeline that combines vision-language model (VLM)-inferred physical parameter estimates with interactive adaptation through uncertainty-aware fusion. Our approach consists of three core components: (1) high-fidelity geometric reconstruction with 3D Gaussian splatting, (2) VLM-inferred prior distributions over physical parameters, and (3) online physical parameter estimation from interaction data. Phys2Real conditions policies on interpretable physical parameters, refining VLM predictions with online estimates via ensemble-based uncertainty quantification. On planar pushing tasks of a T-block with varying center of mass (CoM) and a hammer with an off-center mass distribution, Phys2Real achieves substantial improvements over a domain randomization baseline: 100% vs 79% success rate for the bottom-weighted T-block, 57% vs 23% in the challenging top-weighted T-block, and 15% faster average task completion for hammer pushing. Ablation studies indicate that the combination of VLM and interaction information is essential for success. Project website: this https URL .
- 中文摘要
直接在现实世界中学习机器人纵策略可能既昂贵又耗时。虽然在模拟中训练的强化学习 (RL) 策略提供了一种可扩展的替代方案,但有效的模拟到真实的转移仍然具有挑战性,特别是对于需要精确动态的任务。为了解决这个问题,我们提出了 Phys2Real,这是一种实到模拟到实的 RL 管道,它将视觉语言模型 (VLM) 推断的物理参数估计与通过不确定性感知融合进行交互式适应相结合。我们的方法由三个核心组成部分组成:(1)使用3D高斯散布进行高保真几何重建,(2)VLM推断的物理参数先验分布,以及(3)从交互数据中进行在线物理参数估计。Phys2Real 对可解释的物理参数进行条件策略,通过基于集成的不确定性量化通过在线估计来完善 VLM 预测。在具有不同质心 (CoM) 的 T 形块和具有偏离中心质量分布的锤子的平面推动任务中,Phys2Real 比域随机化基线取得了显着改进:底部加权 T 形块的成功率为 100% vs 79%,具有挑战性的顶部加权 T 型块的成功率为 57% vs 23%,锤子推动的平均任务完成速度提高了 15%。消融研究表明,VLM 和交互信息的结合对于成功至关重要。项目网站:这个 https URL 。
QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs
QeRL:超越效率——面向法学硕士的量化增强强化学习
- Authors: Wei Huang, Yi Ge, Shuai Yang, Yicheng Xiao, Huizi Mao, Yujun Lin, Hanrong Ye, Sifei Liu, Ka Chun Cheung, Hongxu Yin, Yao Lu, Xiaojuan Qi, Song Han, Yukang Chen
- Subjects: Subjects:
Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2510.11696
- Pdf link: https://arxiv.org/pdf/2510.11696
- Abstract
We propose QeRL, a Quantization-enhanced Reinforcement Learning framework for large language models (LLMs). While RL is essential for LLMs' reasoning capabilities, it is resource-intensive, requiring substantial GPU memory and long rollout durations. QeRL addresses these issues by combining NVFP4 quantization with Low-Rank Adaptation (LoRA), accelerating rollout phase of RL while reducing memory overhead. Beyond efficiency, our findings show that quantization noise increases policy entropy, enhancing exploration, and enabling the discovery of better strategies during RL. To further optimize exploration, QeRL introduces an Adaptive Quantization Noise (AQN) mechanism, which dynamically adjusts noise during training. Experiments demonstrate that QeRL delivers over 1.5 times speedup in the rollout phase. Moreover, this is the first framework to enable RL training of a 32B LLM on a single H100 80GB GPU, while delivering overall speedups for RL training. It also achieves faster reward growth and higher final accuracy than 16-bit LoRA and QLoRA, while matching the performance of full-parameter fine-tuning on mathematical benchmarks such as GSM8K (90.8%) and MATH 500 (77.4%) in the 7B model. These results establish QeRL as an efficient and effective framework for RL training in LLMs.
- 中文摘要
我们提出了 QeRL,这是一种用于大型语言模型 (LLM) 的量化增强强化学习框架。虽然 RL 对于法学硕士的推理能力至关重要,但它是资源密集型的,需要大量的 GPU 内存和较长的推出时间。QeRL 通过将 NVFP4 量化与低秩自适应 (LoRA) 相结合来解决这些问题,加速 RL 的推出阶段,同时减少内存开销。除了效率之外,我们的研究结果表明,量化噪声增加了策略熵,增强了探索,并能够在RL期间发现更好的策略。为了进一步优化探索,QeRL 引入了自适应量化噪声 (AQN) 机制,该机制在训练过程中动态调整噪声。实验表明,QeRL 在推出阶段可实现超过 1.5 倍的加速。此外,这是第一个在单个 H100 80GB GPU 上实现 32B LLM 的 RL 训练,同时提供 RL 训练整体加速的框架。与 16 位 LoRA 和 QLoRA 相比,它还实现了更快的奖励增长和更高的最终精度,同时在 7B 模型中与 GSM8K (90.8%) 和 MATH 500 (77.4%) 等数学基准上的全参数微调性能相匹配。这些结果将QeRL确立为LLM中RL训练的高效且有效的框架。
Demystifying Reinforcement Learning in Agentic Reasoning
揭秘智能体推理中的强化学习
- Authors: Zhaochen Yu, Ling Yang, Jiaru Zou, Shuicheng Yan, Mengdi Wang
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2510.11701
- Pdf link: https://arxiv.org/pdf/2510.11701
- Abstract
Recently, the emergence of agentic RL has showcased that RL could also effectively improve the agentic reasoning ability of LLMs, yet the key design principles and optimal practices remain unclear. In this work, we conduct a comprehensive and systematic investigation to demystify reinforcement learning in agentic reasoning from three key perspectives: data, algorithm, and reasoning mode. We highlight our key insights: (i) Replacing stitched synthetic trajectories with real end-to-end tool-use trajectories yields a far stronger SFT initialization; high-diversity, model-aware datasets sustain exploration and markedly improve RL performance. (ii) Exploration-friendly techniques are crucial for agentic RL, such as clip higher, overlong reward shaping, and maintaining adequate policy entropy could improve the training efficiency. (iii) A deliberative strategy with fewer tool calls outperforms frequent tool calls or verbose self-reasoning, improving tool efficiency and final accuracy. Together, these simple practices consistently enhance agentic reasoning and training efficiency, achieving strong results on challenging benchmarks with smaller models, and establishing a practical baseline for future agentic RL research. Beyond these empirical insights, we further contribute a high-quality, real end-to-end agentic SFT dataset along with a high-quality RL dataset, and demonstrate the effectiveness of our insights in boosting the agentic reasoning ability of LLMs across four challenging benchmarks, including AIME2024/AIME2025, GPQA-Diamond, and LiveCodeBench-v6. With our recipes, 4B-sized models could also achieve superior agentic reasoning performance compared to 32B-sized models. Code and models: this https URL
- 中文摘要
近年来,智能体强郁学习的出现表明,智能学习还可以有效提高法学硕士的智能体推理能力,但关键设计原则和优化实践仍不清楚。在这项工作中,我们进行了全面而系统的研究,从数据、算法和推理模式三个关键角度揭开智能体推理中的强化学习的神秘面纱。我们重点介绍了我们的关键见解:(i) 用真实的端到端工具使用轨迹替换拼接的合成轨迹会产生更强的 SFT 初始化;高多样性、模型感知的数据集可以维持探索并显着提高 RL 性能。(ii)探索友好的技术对于代理RL至关重要,如剪辑更高、奖励塑造过长,保持足够的策略熵可以提高训练效率。(iii) 工具调用较少的审议策略优于频繁的工具调用或冗长的自我推理,从而提高工具效率和最终准确性。这些简单的实践共同增强了代理推理和训练效率,在具有较小模型的挑战性基准上取得了强劲的成果,并为未来的代理 RL 研究建立了实用的基线。除了这些实证见解之外,我们还进一步贡献了高质量、真实的端到端代理 SFT 数据集以及高质量的 RL 数据集,并展示了我们的见解在提高 LLM 在四个具有挑战性的基准测试中的有效性,包括 AIME2024/AIME2025、GPQA-Diamond 和 LiveCodeBench-v6。通过我们的配方,与 32B 大小的模型相比,4B 大小的模型也可以实现更优越的智能体推理性能。代码和模型:此 https URL
Reinforced sequential Monte Carlo for amortised sampling
用于摊销抽样的强化顺序蒙特卡洛
- Authors: Sanghyeok Choi, Sarthak Mittal, Víctor Elvira, Jinkyoo Park, Nikolay Malkin
- Subjects: Subjects:
Machine Learning (cs.LG); Machine Learning (stat.ML)
- Arxiv link: https://arxiv.org/abs/2510.11711
- Pdf link: https://arxiv.org/pdf/2510.11711
- Abstract
This paper proposes a synergy of amortised and particle-based methods for sampling from distributions defined by unnormalised density functions. We state a connection between sequential Monte Carlo (SMC) and neural sequential samplers trained by maximum-entropy reinforcement learning (MaxEnt RL), wherein learnt sampling policies and value functions define proposal kernels and twist functions. Exploiting this connection, we introduce an off-policy RL training procedure for the sampler that uses samples from SMC -- using the learnt sampler as a proposal -- as a behaviour policy that better explores the target distribution. We describe techniques for stable joint training of proposals and twist functions and an adaptive weight tempering scheme to reduce training signal variance. Furthermore, building upon past attempts to use experience replay to guide the training of neural samplers, we derive a way to combine historical samples with annealed importance sampling weights within a replay buffer. On synthetic multi-modal targets (in both continuous and discrete spaces) and the Boltzmann distribution of alanine dipeptide conformations, we demonstrate improvements in approximating the true distribution as well as training stability compared to both amortised and Monte Carlo methods.
- 中文摘要
本文提出了一种摊销和基于粒子的方法的协同作用,用于从非归一化密度函数定义的分布中抽样。我们陈述了顺序蒙特卡洛 (SMC) 和通过最大熵强化学习 (MaxEnt RL) 训练的神经顺序采样器之间的联系,其中学习的采样策略和值函数定义了建议核和扭曲函数。利用这种联系,我们为采样器引入了一个非策略的 RL 训练程序,该程序使用来自 SMC 的样本——使用学习到的采样器作为建议——作为行为策略,可以更好地探索目标分布。我们描述了建议和扭转函数的稳定联合训练技术,以及减少训练信号方差的自适应权重回火方案。此外,基于过去使用经验回放来指导神经采样器训练的尝试,我们推导出了一种方法,将历史样本与回放缓冲区内的退火重要性采样权重相结合。在合成多模态靶标(连续和离散空间)和丙氨酸二肽构象的玻尔兹曼分布上,我们证明了与摊销和蒙特卡洛方法相比,在近似真实分布和训练稳定性方面有所改进。
Keyword: diffusion policy
Enhancing Diffusion Policy with Classifier-Free Guidance for Temporal Robotic Tasks
通过无分类器指导增强扩散策略,用于时态机器人任务
- Authors: Yuang Lu, Song Wang, Xiao Han, Xuri Zhang, Yucong Wu, Zhicheng He
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2510.09786
- Pdf link: https://arxiv.org/pdf/2510.09786
- Abstract
Temporal sequential tasks challenge humanoid robots, as existing Diffusion Policy (DP) and Action Chunking with Transformers (ACT) methods often lack temporal context, resulting in local optima traps and excessive repetitive actions. To address these issues, this paper introduces a Classifier-Free Guidance-Based Diffusion Policy (CFG-DP), a novel framework to enhance DP by integrating Classifier-Free Guidance (CFG) with conditional and unconditional models. Specifically, CFG leverages timestep inputs to track task progression and ensure precise cycle termination. It dynamically adjusts action predictions based on task phase, using a guidance factor tuned to balance temporal coherence and action accuracy. Real-world experiments on a humanoid robot demonstrate high success rates and minimal repetitive actions. Furthermore, we assessed the model's ability to terminate actions and examined how different components and parameter adjustments affect its performance. This framework significantly enhances deterministic control and execution reliability for sequential robotic tasks.
- 中文摘要
时间顺序任务对人形机器人提出了挑战,因为现有的扩散策略(DP)和Action Chunking with Transformers(ACT)方法通常缺乏时间上下文,导致局部最优陷阱和过度重复动作。为了解决这些问题,本文引入了基于无分类器指导的扩散策略(CFG-DP),这是一种通过将无分类器指导(CFG)与条件和无条件模型相结合来增强DP的新框架。具体来说,CFG 利用时间步长输入来跟踪任务进度并确保精确的周期终止。它根据任务阶段动态调整动作预测,使用调整的指导因子来平衡时间连贯性和动作准确性。人形机器人的真实实验表明,成功率高,重复动作最少。此外,我们评估了模型终止动作的能力,并检查了不同的组件和参数调整如何影响其性能。该框架显着增强了顺序机器人任务的确定性控制和执行可靠性。
Understanding Sampler Stochasticity in Training Diffusion Models for RLHF
了解 RLHF 训练扩散模型中的采样器随机性
- Authors: Jiayuan Sheng, Hanyang Zhao, Haoxian Chen, David D. Yao, Wenpin Tang
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
- Arxiv link: https://arxiv.org/abs/2510.10767
- Pdf link: https://arxiv.org/pdf/2510.10767
- Abstract
Reinforcement Learning from Human Feedback (RLHF) is increasingly used to fine-tune diffusion models, but a key challenge arises from the mismatch between stochastic samplers used during training and deterministic samplers used during inference. In practice, models are fine-tuned using stochastic SDE samplers to encourage exploration, while inference typically relies on deterministic ODE samplers for efficiency and stability. This discrepancy induces a reward gap, raising concerns about whether high-quality outputs can be expected during inference. In this paper, we theoretically characterize this reward gap and provide non-vacuous bounds for general diffusion models, along with sharper convergence rates for Variance Exploding (VE) and Variance Preserving (VP) Gaussian models. Methodologically, we adopt the generalized denoising diffusion implicit models (gDDIM) framework to support arbitrarily high levels of stochasticity, preserving data marginals throughout. Empirically, our findings through large-scale experiments on text-to-image models using denoising diffusion policy optimization (DDPO) and mixed group relative policy optimization (MixGRPO) validate that reward gaps consistently narrow over training, and ODE sampling quality improves when models are updated using higher-stochasticity SDE training.
- 中文摘要
人类反馈强化学习 (RLHF) 越来越多地用于微调扩散模型,但一个关键挑战是训练期间使用的随机采样器和推理期间使用的确定性采样器之间的不匹配。在实践中,使用随机 SDE 采样器对模型进行微调以鼓励探索,而推理通常依赖于确定性 ODE 采样器来提高效率和稳定性。这种差异导致了奖励差距,引发了人们对推理过程中是否可以预期高质量输出的担忧。在本文中,我们从理论上表征了这种奖励差距,并为一般扩散模型提供了非空边界,并为方差爆炸 (VE) 和方差保持 (VP) 高斯模型提供了更尖锐的收敛率。在方法论上,我们采用广义去噪扩散隐式模型(gDDIM)框架来支持任意高水平的随机性,并始终保留数据边缘。根据经验,我们通过使用去噪扩散策略优化(DDPO)和混合组相对策略优化(MixGRPO)对文本到图像模型进行大规模实验的结果验证了奖励差距在训练过程中持续缩小,并且当使用更高随机性的SDE训练更新模型时,ODE采样质量会提高。