生成时间: 2025-10-16 16:30:26 (UTC+8); Arxiv 发布时间: 2025-10-16 20:00 EDT (2025-10-17 08:00 UTC+8)
今天共有 35 篇相关文章
Keyword: reinforcement learning
Energy-Guided Diffusion Sampling for Long-Term User Behavior Prediction in Reinforcement Learning-based Recommendation
基于强化学习的推荐中用于长期用户行为预测的能量引导扩散采样
- Authors: Xiaocong Chen, Siyu Wang, Lina Yao
- Subjects: Subjects:
Information Retrieval (cs.IR)
- Arxiv link: https://arxiv.org/abs/2510.12815
- Pdf link: https://arxiv.org/pdf/2510.12815
- Abstract
Reinforcement learning-based recommender systems (RL4RS) have gained attention for their ability to adapt to dynamic user preferences. However, these systems face challenges, particularly in offline settings, where data inefficiency and reliance on pre-collected trajectories limit their broader applicability. While offline reinforcement learning methods leverage extensive datasets to address these issues, they often struggle with noisy data and fail to capture long-term user preferences, resulting in suboptimal recommendation policies. To overcome these limitations, we propose Diffusion-enhanced Actor-Critic for Offline RL4RS (DAC4Rec), a novel framework that integrates diffusion processes with reinforcement learning to model complex user preferences more effectively. DAC4Rec leverages the denoising capabilities of diffusion models to enhance the robustness of offline RL algorithms and incorporates a Q-value-guided policy optimization strategy to better handle suboptimal trajectories. Additionally, we introduce an energy-based sampling strategy to reduce randomness during recommendation generation, ensuring more targeted and reliable outcomes. We validate the effectiveness of DAC4Rec through extensive experiments on six real-world offline datasets and in an online simulation environment, demonstrating its ability to optimize long-term user preferences. Furthermore, we show that the proposed diffusion policy can be seamlessly integrated into other commonly used RL algorithms in RL4RS, highlighting its versatility and wide applicability.
- 中文摘要
基于强化学习的推荐系统(RL4RS)因其适应动态用户偏好的能力而受到关注。然而,这些系统面临着挑战,特别是在离线环境中,数据效率低下和对预先收集轨迹的依赖限制了其更广泛的适用性。虽然离线强化学习方法利用广泛的数据集来解决这些问题,但它们经常难以处理嘈杂的数据,并且无法捕捉长期的用户偏好,从而导致推荐策略不理想。为了克服这些限制,我们提出了用于离线RL4RS的扩散增强Actor-Critic(DAC4Rec),这是一个将扩散过程与强化学习相结合的新颖框架,以更有效地对复杂的用户偏好进行建模。DAC4Rec 利用扩散模型的去噪能力来增强离线 RL 算法的鲁棒性,并结合 Q 值引导的策略优化策略来更好地处理次优轨迹。此外,我们还引入了基于能量的采样策略,以减少推荐生成过程中的随机性,确保更有针对性和更可靠的结果。我们通过在六个真实世界的离线数据集和在线模拟环境中进行大量实验来验证 DAC4Rec 的有效性,证明了其优化长期用户偏好的能力。此外,我们表明所提出的扩散策略可以无缝集成到RL4RS中其他常用的RL算法中,突出了其通用性和广泛适用性。
Maximum In-Support Return Modeling for Dynamic Recommendation with Language Model Prior
使用语言模型先验进行动态推荐的最大支持内回报建模
- Authors: Xiaocong Chen, Siyu Wang, Lina Yao
- Subjects: Subjects:
Information Retrieval (cs.IR)
- Arxiv link: https://arxiv.org/abs/2510.12816
- Pdf link: https://arxiv.org/pdf/2510.12816
- Abstract
Reinforcement Learning-based recommender systems (RLRS) offer an effective way to handle sequential recommendation tasks but often face difficulties in real-world settings, where user feedback data can be sub-optimal or sparse. In this paper, we introduce MDT4Rec, an offline RLRS framework that builds on the Decision Transformer (DT) to address two major challenges: learning from sub-optimal histories and representing complex user-item interactions. First, MDT4Rec shifts the trajectory stitching procedure from the training phase to action inference, allowing the system to shorten its historical context when necessary and thereby ignore negative or unsuccessful past experiences. Second, MDT4Rec initializes DT with a pre-trained large language model (LLM) for knowledge transfer, replaces linear embedding layers with Multi-Layer Perceptrons (MLPs) for more flexible representations, and employs Low-Rank Adaptation (LoRA) to efficiently fine-tune only a small subset of parameters. We evaluate MDT4Rec on five public datasets and in an online simulation environment, demonstrating that it outperforms existing methods.
- 中文摘要
基于强化学习的推荐系统 (RLRS) 提供了一种处理顺序推荐任务的有效方法,但在现实环境中经常面临困难,在现实环境中,用户反馈数据可能不是最优或稀疏的。在本文中,我们介绍了MDT4Rec,这是一个基于决策转换器(DT)的离线RLRS框架,以解决两个主要挑战:从次优历史中学习和表示复杂的用户-项目交互。首先,MDT4Rec 将轨迹拼接过程从训练阶段转移到动作推理,允许系统在必要时缩短其历史背景,从而忽略负面或不成功的过去经验。其次,MDT4Rec 使用预训练的大型语言模型 (LLM) 初始化 DT 以进行知识迁移,用多层感知器 (MLP) 替换线性嵌入层以实现更灵活的表示,并采用低秩自适应 (LoRA) 来有效地微调一小部分参数。我们在五个公共数据集和在线模拟环境中评估了 MDT4Rec,证明它优于现有方法。
Pruning Cannot Hurt Robustness: Certified Trade-offs in Reinforcement Learning
修剪不会损害鲁棒性:强化学习中的认证权衡
- Authors: James Pedley, Benjamin Etheridge, Stephen J. Roberts, Francesco Quinzan
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.12939
- Pdf link: https://arxiv.org/pdf/2510.12939
- Abstract
Reinforcement learning (RL) policies deployed in real-world environments must remain reliable under adversarial perturbations. At the same time, modern deep RL agents are heavily over-parameterized, raising costs and fragility concerns. While pruning has been shown to improve robustness in supervised learning, its role in adversarial RL remains poorly understood. We develop the first theoretical framework for certified robustness under pruning in state-adversarial Markov decision processes (SA-MDPs). For Gaussian and categorical policies with Lipschitz networks, we prove that element-wise pruning can only tighten certified robustness bounds; pruning never makes the policy less robust. Building on this, we derive a novel three-term regret decomposition that disentangles clean-task performance, pruning-induced performance loss, and robustness gains, exposing a fundamental performance--robustness frontier. Empirically, we evaluate magnitude and micro-pruning schedules on continuous-control benchmarks with strong policy-aware adversaries. Across tasks, pruning consistently uncovers reproducible ``sweet spots'' at moderate sparsity levels, where robustness improves substantially without harming - and sometimes even enhancing - clean performance. These results position pruning not merely as a compression tool but as a structural intervention for robust RL.
- 中文摘要
在现实环境中部署的强化学习 (RL) 策略必须在对抗性扰动下保持可靠。同时,现代深度 RL 代理的参数化严重过度,增加了成本和脆弱性问题。虽然修剪已被证明可以提高监督学习的鲁棒性,但人们对其在对抗性 RL 中的作用仍然知之甚少。我们开发了第一个在状态对抗性马尔可夫决策过程 (SA-MDP) 中修剪下的认证鲁棒性的理论框架。对于具有 Lipschitz 网络的高斯和分类策略,我们证明了按元素的修剪只能收紧认证的鲁棒性边界;修剪永远不会降低政策的稳健性。在此基础上,我们推导出了一种新颖的三项遗憾分解,该分解解开了干净任务性能、修剪引起的性能损失和鲁棒性增益,揭示了一个基本性能——鲁棒性前沿。根据经验,我们在具有强大策略意识的对手的连续控制基准上评估了规模和微修剪时间表。在整个任务中,修剪始终如一地发现中等稀疏度水平的可重复“最佳点”,其中稳健性显着提高,而不会损害(有时甚至提高)清洁性能。这些结果不仅将修剪定位为一种压缩工具,而且将其定位为稳健 RL 的结构干预措施。
Epistemic-aware Vision-Language Foundation Model for Fetal Ultrasound Interpretation
用于胎儿超声解释的认识感知视觉语言基础模型
- Authors: Xiao He, Huangxuan Zhao, Guojia Wan, Wei Zhou, Yanxing Liu, Juhua Liu, Yongchao Xu, Yong Luo, Dacheng Tao, Bo Du
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multimedia (cs.MM)
- Arxiv link: https://arxiv.org/abs/2510.12953
- Pdf link: https://arxiv.org/pdf/2510.12953
- Abstract
Recent medical vision-language models have shown promise on tasks such as VQA, report generation, and anomaly detection. However, most are adapted to structured adult imaging and underperform in fetal ultrasound, which poses challenges of multi-view image reasoning, numerous diseases, and image diversity. To bridge this gap, we introduce FetalMind, a medical AI system tailored to fetal ultrasound for both report generation and diagnosis. Guided by clinical workflow, we propose Salient Epistemic Disentanglement (SED), which injects an expert-curated bipartite graph into the model to decouple view-disease associations and to steer preference selection along clinically faithful steps via reinforcement learning. This design mitigates variability across diseases and heterogeneity across views, reducing learning bottlenecks while aligning the model's inference with obstetric practice. To train FetalMind at scale, we curate FetalSigma-1M dataset, the first large-scale fetal ultrasound report corpus, comprising 20K reports from twelve medical centers, addressing the scarcity of domain data. Extensive experiments show that FetalMind outperforms open- and closed-source baselines across all gestational stages, achieving +14% average gains and +61.2% higher accuracy on critical conditions while remaining efficient, stable, and scalable. Project Page: this https URL.
- 中文摘要
最近的医学视觉语言模型在 VQA、报告生成和异常检测等任务上显示出前景。然而,大多数适应结构化成人成像,在胎儿超声中表现不佳,这带来了多视图图像推理、多种疾病和图像多样性的挑战。为了弥补这一差距,我们推出了 FetalMind,这是一种专为胎儿超声量身定制的医疗人工智能系统,用于报告生成和诊断。在临床工作流程的指导下,我们提出了显着认识解缠(SED),它将专家策划的二分图注入模型中,以解耦观点与疾病的关联,并通过强化学习引导偏好选择沿着临床上忠实的步骤进行。这种设计减轻了疾病之间的变异性和视图之间的异质性,减少了学习瓶颈,同时使模型的推理与产科实践保持一致。为了大规模训练 FetalMind,我们策划了 FetalSigma-1M 数据集,这是第一个大规模胎儿超声报告语料库,包含来自 12 个医疗中心的 20K 报告,解决了领域数据的稀缺问题。大量实验表明,FetalMind 在所有妊娠阶段都优于开源和闭源基线,在关键条件下实现 +14% 的平均增益和 +61.2% 的准确率提高,同时保持高效、稳定和可扩展。项目页面:此 https URL。
DeepPlanner: Scaling Planning Capability for Deep Research Agents via Advantage Shaping
DeepPlanner:通过优势塑造扩展深度研究代理的计划能力
- Authors: Wei Fan, Wenlin Yao, Zheng Li, Feng Yao, Xin Liu, Liang Qiu, Qingyu Yin, Yangqiu Song, Bing Yin
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2510.12979
- Pdf link: https://arxiv.org/pdf/2510.12979
- Abstract
Large language models (LLMs) augmented with multi-step reasoning and action generation abilities have shown promise in leveraging external tools to tackle complex tasks that require long-horizon planning. However, existing approaches either rely on implicit planning in the reasoning stage or introduce explicit planners without systematically addressing how to optimize the planning stage. As evidence, we observe that under vanilla reinforcement learning (RL), planning tokens exhibit significantly higher entropy than other action tokens, revealing uncertain decision points that remain under-optimized. To address this, we propose DeepPlanner, an end-to-end RL framework that effectively enhances the planning capabilities of deep research agents. Our approach shapes token-level advantage with an entropy-based term to allocate larger updates to high entropy tokens, and selectively upweights sample-level advantages for planning-intensive rollouts. Extensive experiments across seven deep research benchmarks demonstrate that DeepPlanner improves planning quality and achieves state-of-the-art results under a substantially lower training budget.
- 中文摘要
大型语言模型 (LLM) 增强了多步骤推理和动作生成能力,在利用外部工具处理需要长期规划的复杂任务方面显示出前景。然而,现有的方法要么依赖于推理阶段的隐式规划,要么引入显式规划器,而没有系统地解决如何优化规划阶段。作为证据,我们观察到,在普通强化学习(RL)下,规划标记表现出明显高于其他行动标记的熵,揭示了仍然未优化的不确定决策点。为了解决这个问题,我们提出了 DeepPlanner,这是一个端到端的 RL 框架,可以有效增强深度研究代理的规划能力。我们的方法通过基于熵的术语塑造代币级优势,为高熵代币分配更大的更新,并有选择地为规划密集型推出加权样本级优势。跨七个深度研究基准的广泛实验表明,DeepPlanner 提高了规划质量,并在大幅降低的培训预算下取得了最先进的结果。
Escaping Local Optima in the Waddington Landscape: A Multi-Stage TRPO-PPO Approach for Single-Cell Perturbation Analysis
在沃丁顿景观中逃避局部最优:用于单细胞扰动分析的多阶段 TRPO-PPO 方法
- Authors: Francis Boabang, Samuel Asante Gyamerah
- Subjects: Subjects:
Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
- Arxiv link: https://arxiv.org/abs/2510.13018
- Pdf link: https://arxiv.org/pdf/2510.13018
- Abstract
Modeling cellular responses to genetic and chemical perturbations remains a central challenge in single-cell biology. Existing data-driven framework have advanced perturbation prediction through variational autoencoders, chemically conditioned autoencoders, and large-scale transformer pretraining. However, these models are prone to local optima in the nonconvex Waddington landscape of cell fate decisions, where poor initialization can trap trajectories in spurious lineages or implausible differentiation outcomes. While executable gene regulatory networks complement these approaches, automated design frameworks incorporate biological priors through multi-agent optimization. Yet, an approach that is completely data-driven with well-designed initialization to escape local optima and converge to a proper lineage remains elusive. In this work, we introduce a multistage reinforcement learning algorithm tailored for single-cell perturbation modeling. We first compute an explicit natural gradient update using Fisher-vector products and a conjugate gradient solver, scaled by a KL trust-region constraint to provide a safe, curvature-aware the first step for the policy. Starting with these preconditioned parameters, we then apply a second phase of proximal policy optimization (PPO) with clipped surrogates, exploiting minibatch efficiency to refine the policy. We demonstrate that this initialization substantially improves generalization on Single-cell RNA sequencing (scRNA-seq) and Single-cell ATAC sequencing (scATAC-seq) pertubation analysis.
- 中文摘要
模拟细胞对遗传和化学扰动的反应仍然是单细胞生物学的核心挑战。现有的数据驱动框架通过变分自动编码器、化学调节自动编码器和大规模变压器预训练来推进扰动预测。然而,这些模型在细胞命运决策的非凸沃丁顿景观中容易出现局部最优,其中不良初始化可能会将轨迹困在虚假谱系或难以置信的分化结果中。虽然可执行的基因调控网络补充了这些方法,但自动化设计框架通过多药物优化结合了生物学先验。然而,一种完全由数据驱动的方法,通过精心设计的初始化来逃避局部最优并收敛到适当的谱系,仍然难以捉摸。在这项工作中,我们引入了一种为单细胞扰动建模量身定制的多阶段强化学习算法。我们首先使用费舍尔向量乘积和共轭梯度求解器计算显式自然梯度更新,并通过 KL 信任区域约束进行缩放,以提供安全的、曲率感知的策略第一步。从这些预处理参数开始,我们应用第二阶段的近端策略优化 (PPO) 和裁剪的替代项,利用小批量效率来完善策略。我们证明,这种初始化大大改善了单细胞 RNA 测序 (scRNA-seq) 和单细胞 ATAC 测序 (scATAC-seq) 穿管分析的泛化性。
Repairing Reward Functions with Human Feedback to Mitigate Reward Hacking
利用人工反馈修复奖励函数以减轻奖励黑客攻击
- Authors: Stephane Hatgis-Kessell, Logan Mondal Bhamidipaty, Emma Brunskill
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.13036
- Pdf link: https://arxiv.org/pdf/2510.13036
- Abstract
Human-designed reward functions for reinforcement learning (RL) agents are frequently misaligned with the humans' true, unobservable objectives, and thus act only as proxies. Optimizing for a misspecified proxy reward function often induces reward hacking, resulting in a policy misaligned with the human's true objectives. An alternative is to perform RL from human feedback, which involves learning a reward function from scratch by collecting human preferences over pairs of trajectories. However, building such datasets is costly. To address the limitations of both approaches, we propose Preference-Based Reward Repair (PBRR): an automated iterative framework that repairs a human-specified proxy reward function by learning an additive, transition-dependent correction term from preferences. A manually specified reward function can yield policies that are highly suboptimal under the ground-truth objective, yet corrections on only a few transitions may suffice to recover optimal performance. To identify and correct for those transitions, PBRR uses a targeted exploration strategy and a new preference-learning objective. We prove in tabular domains PBRR has a cumulative regret that matches, up to constants, that of prior preference-based RL methods. In addition, on a suite of reward-hacking benchmarks, PBRR consistently outperforms baselines that learn a reward function from scratch from preferences or modify the proxy reward function using other approaches, requiring substantially fewer preferences to learn high performing policies.
- 中文摘要
人类为强化学习 (RL) 代理设计的奖励函数经常与人类真实的、不可观察的目标不一致,因此只能充当代理。针对错误指定的代理奖励函数进行优化通常会导致奖励黑客攻击,从而导致策略与人类的真正目标不一致。另一种方法是根据人类反馈执行 RL,这涉及通过收集人类对成对轨迹的偏好从头开始学习奖励函数。然而,构建此类数据集的成本很高。为了解决这两种方法的局限性,我们提出了基于偏好的奖励修复(PBRR):一种自动迭代框架,通过从偏好中学习一个加法的、依赖于过渡的校正项来修复人类指定的代理奖励函数。手动指定的奖励函数可能会产生在真实目标下高度次优的策略,但仅对少数转换进行修正可能足以恢复最佳性能。为了识别和纠正这些转变,PBRR 使用有针对性的探索策略和新的偏好学习目标。我们在表格域中证明了 PBRR 具有累积遗憾,该遗憾与先前基于偏好的 RL 方法相匹配,直至常数。此外,在一套奖励黑客基准测试中,PBRR 的性能始终优于从偏好从头开始学习奖励函数或使用其他方法修改代理奖励函数的基线,需要更少的偏好来学习高性能策略。
Achieving Logarithmic Regret in KL-Regularized Zero-Sum Markov Games
在KL正则化零和马尔可夫博弈中实现对数遗憾
- Authors: Anupam Nayak, Tong Yang, Osman Yagan, Gauri Joshi, Yuejie Chi
- Subjects: Subjects:
Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Optimization and Control (math.OC); Machine Learning (stat.ML)
- Arxiv link: https://arxiv.org/abs/2510.13060
- Pdf link: https://arxiv.org/pdf/2510.13060
- Abstract
Reverse Kullback-Leibler (KL) divergence-based regularization with respect to a fixed reference policy is widely used in modern reinforcement learning to preserve the desired traits of the reference policy and sometimes to promote exploration (using uniform reference policy, known as entropy regularization). Beyond serving as a mere anchor, the reference policy can also be interpreted as encoding prior knowledge about good actions in the environment. In the context of alignment, recent game-theoretic approaches have leveraged KL regularization with pretrained language models as reference policies, achieving notable empirical success in self-play methods. Despite these advances, the theoretical benefits of KL regularization in game-theoretic settings remain poorly understood. In this work, we develop and analyze algorithms that provably achieve improved sample efficiency under KL regularization. We study both two-player zero-sum Matrix games and Markov games: for Matrix games, we propose OMG, an algorithm based on best response sampling with optimistic bonuses, and extend this idea to Markov games through the algorithm SOMG, which also uses best response sampling and a novel concept of superoptimistic bonuses. Both algorithms achieve a logarithmic regret in $T$ that scales inversely with the KL regularization strength $\beta$ in addition to the standard $\widetilde{\mathcal{O}}(\sqrt{T})$ regret independent of $\beta$ which is attained in both regularized and unregularized settings
- 中文摘要
基于固定参考策略的反向库尔巴克-莱布勒 (KL) 散度正则化广泛用于现代强化学习中,以保留参考策略的所需特征,有时还用于促进探索(使用统一参考策略,称为熵正则化)。除了仅作为锚点之外,参考策略还可以解释为对环境中良好行为的先验知识进行编码。在对齐的背景下,最近的博弈论方法利用了 KL 正则化和预训练语言模型作为参考策略,在自玩方法中取得了显着的实证成功。尽管取得了这些进步,但博弈论环境中 KL 正则化的理论益处仍然知之甚少。在这项工作中,我们开发并分析了在KL正则化下可证明提高样本效率的算法。我们研究了两人零和矩阵博弈和马尔可夫博弈:对于矩阵博弈,我们提出了OMG,一种基于乐观奖励的最佳响应抽样的算法,并通过算法SOMG将这一思想扩展到马尔可夫博弈中,该算法也使用了最佳响应抽样和超乐观奖励的新概念。两种算法都实现了 $T$ 的对数遗憾,除了标准 $\widetilde{\mathcal{O}}(\sqrt{T})$ 遗憾之外,还与 KL 正则化强度 $\beta$ 成反比,而 $\beta$ 是在正则化和非正则化设置中实现的
DriveCritic: Towards Context-Aware, Human-Aligned Evaluation for Autonomous Driving with Vision-Language Models
DriveCritic:使用视觉语言模型对自动驾驶进行上下文感知、人性化的评估
- Authors: Jingyu Song, Zhenxin Li, Shiyi Lan, Xinglong Sun, Nadine Chang, Maying Shen, Joshua Chen, Katherine A. Skinner, Jose M. Alvarez
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2510.13108
- Pdf link: https://arxiv.org/pdf/2510.13108
- Abstract
Benchmarking autonomous driving planners to align with human judgment remains a critical challenge, as state-of-the-art metrics like the Extended Predictive Driver Model Score (EPDMS) lack context awareness in nuanced scenarios. To address this, we introduce DriveCritic, a novel framework featuring two key contributions: the DriveCritic dataset, a curated collection of challenging scenarios where context is critical for correct judgment and annotated with pairwise human preferences, and the DriveCritic model, a Vision-Language Model (VLM) based evaluator. Fine-tuned using a two-stage supervised and reinforcement learning pipeline, the DriveCritic model learns to adjudicate between trajectory pairs by integrating visual and symbolic context. Experiments show DriveCritic significantly outperforms existing metrics and baselines in matching human preferences and demonstrates strong context awareness. Overall, our work provides a more reliable, human-aligned foundation to evaluating autonomous driving systems.
- 中文摘要
对自动驾驶规划器进行基准测试以符合人类判断仍然是一项严峻挑战,因为扩展预测驾驶员模型分数 (EPDMS) 等最先进的指标在细微的场景中缺乏上下文感知。为了解决这个问题,我们引入了 DriveCritic,这是一个具有两个关键贡献的新颖框架:DriveCritic 数据集,一个精选的具有挑战性的场景集合,其中上下文对于正确判断至关重要,并用成对的人类偏好进行注释,以及 DriveCritic 模型,一个基于视觉语言模型 (VLM) 的评估器。DriveCritic 模型使用两阶段监督和强化学习管道进行微调,通过整合视觉和符号上下文来学习在轨迹对之间进行裁决。实验表明,DriveCritic 在匹配人类偏好方面明显优于现有指标和基线,并表现出很强的上下文感知能力。总体而言,我们的工作为评估自动驾驶系统提供了更可靠、更人性化的基础。
EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems
EvoTest:自我改进智能体系统的进化测试时间学习
- Authors: Yufei He, Juncheng Liu, Yue Liu, Yibo Li, Tri Cao, Zhiyuan Hu, Xinxing Xu, Bryan Hooi
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2510.13220
- Pdf link: https://arxiv.org/pdf/2510.13220
- Abstract
A fundamental limitation of current AI agents is their inability to learn complex skills on the fly at test time, often behaving like "clever but clueless interns" in novel environments. This severely limits their practical utility. To systematically measure and drive progress on this challenge, we first introduce the Jericho Test-Time Learning (J-TTL) benchmark. J-TTL is a new evaluation setup where an agent must play the same game for several consecutive episodes, attempting to improve its performance from one episode to the next. On J-TTL, we find that existing adaptation methods like reflection, memory, or reinforcement learning struggle. To address the challenges posed by our benchmark, we present EvoTest, an evolutionary test-time learning framework that improves an agent without any fine-tuning or gradients-by evolving the entire agentic system after every episode. EvoTest has two roles: the Actor Agent, which plays the game, and the Evolver Agent, which analyzes the episode transcript to propose a revised configuration for the next run. This configuration rewrites the prompt, updates memory by logging effective state-action choices, tunes hyperparameters, and learns the tool-use routines. On our J-TTL benchmark, EvoTest consistently increases performance, outperforming not only reflection and memory-only baselines but also more complex online fine-tuning methods. Notably, our method is the only one capable of winning two games (Detective and Library), while all baselines fail to win any.
- 中文摘要
当前人工智能代理的一个根本限制是他们无法在测试时即时学习复杂的技能,在新环境中通常表现得像“聪明但无知的实习生”。这严重限制了它们的实际实用性。为了系统地衡量和推动这一挑战的进展,我们首先引入了杰里科测试时间学习 (J-TTL) 基准。J-TTL 是一种新的评估设置,代理必须连续几集玩同一个游戏,试图从一集到下一集提高其性能。在 J-TTL 上,我们发现现有的适应方法(如反思、记忆或强化学习)存在困难。为了解决基准测试带来的挑战,我们提出了 EvoTest,这是一个进化测试时间学习框架,通过在每集后进化整个智能体系统,无需任何微调或梯度即可改进智能体。EvoTest 有两个角色:玩游戏的 Actor Agent 和 Evolver Agent,它分析剧集脚本,为下一次运行提出修改后的配置。此配置重写提示,通过记录有效的状态-作选择来更新内存,调整超参数,并学习工具使用例程。在我们的 J-TTL 基准测试中,EvoTest 不断提高性能,不仅优于反射和纯内存基线,而且优于更复杂的在线微调方法。值得注意的是,我们的方法是唯一能够赢得两场比赛(侦探和图书馆)的方法,而所有基线都未能赢得任何比赛。
Altruistic Ride Sharing: A Community-Driven Approach to Short-Distance Mobility
利他拼车:社区驱动的短途出行方法
- Authors: Divyanshu Singh, Ashman Mehra, Snehanshu Saha, Santonu Sarkar
- Subjects: Subjects:
Multiagent Systems (cs.MA); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.13227
- Pdf link: https://arxiv.org/pdf/2510.13227
- Abstract
Urban mobility faces persistent challenges of congestion and fuel consumption, specifically when people choose a private, point-to-point commute option. Profit-driven ride-sharing platforms prioritize revenue over fairness and sustainability. This paper introduces Altruistic Ride-Sharing (ARS), a decentralized, peer-to-peer mobility framework where participants alternate between driver and rider roles based on altruism points rather than monetary incentives. The system integrates multi-agent reinforcement learning (MADDPG) for dynamic ride-matching, game-theoretic equilibrium guarantees for fairness, and a population model to sustain long-term balance. Using real-world New York City taxi data, we demonstrate that ARS reduces travel distance and emissions, increases vehicle utilization, and promotes equitable participation compared to both no-sharing and optimization-based baselines. These results establish ARS as a scalable, community-driven alternative to conventional ride-sharing, aligning individual behavior with collective urban sustainability goals.
- 中文摘要
城市交通面临着拥堵和燃料消耗的持续挑战,特别是当人们选择私人、点对点通勤选项时。以利润为导向的拼车平台将收入置于公平性和可持续性之上。本文介绍了利他拼车(ARS),这是一种去中心化的点对点移动框架,参与者根据利他积分而不是金钱激励在司机和乘客角色之间交替。该系统集成了用于动态骑行匹配的多智能体强化学习(MADDPG)、用于公平性的博弈论均衡保证以及用于维持长期平衡的总体模型。使用真实世界的纽约市出租车数据,我们证明,与不共享和基于优化的基线相比,ARS 减少了旅行距离和排放,提高了车辆利用率,并促进了公平参与。这些结果使 ARS 成为传统拼车的可扩展、社区驱动的替代方案,使个人行为与集体城市可持续发展目标保持一致。
Beyond Static LLM Policies: Imitation-Enhanced Reinforcement Learning for Recommendation
超越静态 LLM 策略:用于推荐的模仿增强强化学习
- Authors: Yi Zhang, Lili Xie, Ruihong Qiu, Jiajun Liu, Sen Wang
- Subjects: Subjects:
Information Retrieval (cs.IR)
- Arxiv link: https://arxiv.org/abs/2510.13229
- Pdf link: https://arxiv.org/pdf/2510.13229
- Abstract
Recommender systems (RecSys) have become critical tools for enhancing user engagement by delivering personalized content across diverse digital platforms. Recent advancements in large language models (LLMs) demonstrate significant potential for improving RecSys, primarily due to their exceptional generalization capabilities and sophisticated contextual understanding, which facilitate the generation of flexible and interpretable recommendations. However, the direct deployment of LLMs as primary recommendation policies presents notable challenges, including persistent latency issues stemming from frequent API calls and inherent model limitations such as hallucinations and biases. To address these issues, this paper proposes a novel offline reinforcement learning (RL) framework that leverages imitation learning from LLM-generated trajectories. Specifically, inverse reinforcement learning is employed to extract robust reward models from LLM demonstrations. This approach negates the need for LLM fine-tuning, thereby substantially reducing computational overhead. Simultaneously, the RL policy is guided by the cumulative rewards derived from these demonstrations, effectively transferring the semantic insights captured by the LLM. Comprehensive experiments conducted on two benchmark datasets validate the effectiveness of the proposed method, demonstrating superior performance when compared against state-of-the-art RL-based and in-context learning baselines. The code can be found at this https URL.
- 中文摘要
推荐系统 (RecSys) 已成为通过跨不同数字平台提供个性化内容来增强用户参与度的关键工具。大型语言模型 (LLM) 的最新进展显示出改进 RecSys 的巨大潜力,这主要是由于其卓越的泛化能力和复杂的上下文理解,这有助于生成灵活且可解释的建议。然而,直接部署法学硕士作为主要推荐策略带来了显着的挑战,包括频繁的 API 调用引起的持续延迟问题以及幻觉和偏见等固有模型限制。为了解决这些问题,本文提出了一种新颖的离线强化学习(RL)框架,该框架利用LLM生成轨迹的模仿学习。具体来说,采用逆强化学习从法学硕士演示中提取稳健的奖励模型。这种方法消除了对 LLM 微调的需求,从而大大减少了计算开销。同时,RL 政策以这些演示中获得的累积奖励为指导,有效地转移了 LLM 捕获的语义见解。在两个基准数据集上进行的综合实验验证了所提出方法的有效性,与最先进的基于 RL 和上下文学习基线相比,表现出卓越的性能。代码可以在此 https URL 中找到。
SAJA: A State-Action Joint Attack Framework on Multi-Agent Deep Reinforcement Learning
SAJA:基于多智能体深度强化学习的状态-行动联合攻击框架
- Authors: Weiqi Guo, Guanjun Liu, Ziyuan Zhou
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.13262
- Pdf link: https://arxiv.org/pdf/2510.13262
- Abstract
Multi-Agent Deep Reinforcement Learning (MADRL) has shown potential for cooperative and competitive tasks such as autonomous driving and strategic gaming. However, models trained by MADRL are vulnerable to adversarial perturbations on states and actions. Therefore, it is essential to investigate the robustness of MADRL models from an attack perspective. Existing studies focus on either state-only attacks or action-only attacks, but do not consider how to effectively joint them. Simply combining state and action perturbations such as randomly perturbing states and actions does not exploit their potential synergistic effects. In this paper, we propose the State-Action Joint Attack (SAJA) framework that has a good synergistic effects. SAJA consists of two important phases: (1) In the state attack phase, a multi-step gradient ascent method utilizes both the actor network and the critic network to compute an adversarial state, and (2) in the action attack phase, based on the perturbed state, a second gradient ascent uses the critic network to craft the final adversarial action. Additionally, a heuristic regularizer measuring the distance between the perturbed actions and the original clean ones is added into the loss function to enhance the effectiveness of the critic's guidance. We evaluate SAJA in the Multi-Agent Particle Environment (MPE), demonstrating that (1) it outperforms and is more stealthy than state-only or action-only attacks, and (2) existing state or action defense methods cannot defend its attacks.
- 中文摘要
多智能体深度强化学习 (MADRL) 已显示出自动驾驶和战略游戏等协作和竞争任务的潜力。然而,由 MADRL 训练的模型容易受到状态和动作的对抗性扰动。因此,有必要从攻击的角度研究 MADRL 模型的鲁棒性。现有的研究侧重于纯状态攻击或纯行动攻击,但没有考虑如何有效地将它们联合起来。简单地将状态和动作扰动(例如随机扰动状态和动作)结合起来并不能利用它们潜在的协同效应。本文提出了具有良好协同效应的国家行动联合攻击(SAJA)框架。SAJA由两个重要阶段组成:(1)在状态攻击阶段,多步梯度上升方法利用行动者网络和批评网络来计算对抗状态,以及(2)在行动攻击阶段,基于扰动状态,第二次梯度上升使用批评网络来制作最终的对抗行动。此外,在损失函数中添加了一个启发式正则化器,用于测量受扰动动作和原始干净动作之间的距离,以增强批评者指导的有效性。我们在多智能体粒子环境 (MPE) 中评估了 SAJA,证明 (1) 它优于纯状态或纯动作攻击,并且更隐蔽,以及 (2) 现有的状态或动作防御方法无法防御其攻击。
Beyond Correctness: Rewarding Faithful Reasoning in Retrieval-Augmented Generation
超越正确性:奖励检索增强生成中的忠实推理
- Authors: Zhichao Xu, Zongyu Wu, Yun Zhou, Aosong Feng, Kang Zhou, Sangmin Woo, Kiran Ramnath, Yijun Tian, Xuan Qi, Weikang Qiu, Lin Lee Cheong, Haibo Ding
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2510.13272
- Pdf link: https://arxiv.org/pdf/2510.13272
- Abstract
Inspired by the success of reinforcement learning (RL) in Large Language Model (LLM) training for domains like math and code, recent works have begun exploring how to train LLMs to use search engines more effectively as tools for retrieval-augmented generation. Although these methods achieve performance improvement across QA benchmarks, many prioritize final answer correctness while overlooking the quality of intermediate reasoning steps, which may lead to chain-of-thought unfaithfulness. In this paper, we first introduce a comprehensive evaluation framework for evaluating RL-based search agents, covering three distinct faithfulness metrics: information-think faithfulness, think-answer faithfulness, and think-search faithfulness. Our evaluations reveal that a prototypical RL-based search agent, Search-R1, has significant room for improvement in this regard. To foster faithful reasoning, we introduce VERITAS (Verifying Entailed Reasoning through Intermediate Traceability in Agentic Search), a novel framework that integrates fine-grained faithfulness rewards into the reinforcement learning process. Our experiments show that models trained with VERITAS not only significantly improve reasoning faithfulness, but also achieve comparable task performance across seven QA benchmarks.
- 中文摘要
受到强化学习 (RL) 在数学和代码等领域的大型语言模型 (LLM) 训练中的成功启发,最近的工作开始探索如何训练 LLM 更有效地使用搜索引擎作为检索增强生成的工具。尽管这些方法在 QA 基准测试中实现了性能改进,但许多方法优先考虑最终答案的正确性,而忽视了中间推理步骤的质量,这可能导致思维链不忠实。在本文中,我们首先介绍了一个评估基于RL的搜索代理的综合评估框架,涵盖了三个不同的忠实度指标:信息-思考忠实度、思考-回答忠实度和思考-搜索忠实度。我们的评估表明,基于RL的原型搜索代理Search-R1在这方面有很大的改进空间。为了培养忠实的推理,我们引入了 VERITAS(通过代理搜索中的中间可追溯性验证需要的推理),这是一个将细粒度忠实性奖励集成到强化学习过程中的新颖框架。我们的实验表明,使用 VERITAS 训练的模型不仅显着提高了推理忠实度,而且在七个 QA 基准测试中实现了相当的任务性能。
ChatR1: Reinforcement Learning for Conversational Reasoning and Retrieval Augmented Question Answering
ChatR1:用于对话推理和检索增强问答的强化学习
- Authors: Simon Lupart, Mohammad Aliannejadi, Evangelos Kanoulas
- Subjects: Subjects:
Computation and Language (cs.CL); Information Retrieval (cs.IR)
- Arxiv link: https://arxiv.org/abs/2510.13312
- Pdf link: https://arxiv.org/pdf/2510.13312
- Abstract
We present ChatR1, a reasoning framework based on reinforcement learning (RL) for conversational question answering (CQA). Reasoning plays an important role in CQA, where user intent evolves across dialogue turns, and utterances are often underspecified, requiring contextual interpretation, query reformulation, and dynamic coordination between retrieval and generation. Unlike static `rewrite, retrieve, and generate' pipelines, ChatR1 interleaves search and reasoning across turns, enabling exploratory and adaptive behaviors learned through RL. To address the challenge of sparse and delayed rewards in RL, we propose an intent-aware reward that provides turn-level feedback by aligning retrieval and reasoning with evolving user goals. Our proposed ChatR1 demonstrates strong performance on both 3B and 7B model backbones, outperforming competitive models on five CQA datasets, measured by different metrics (F1, BERTScore, and LLM-as-judge). We include a diverse set of CQA datasets to cover topic shifts, evolving intents, mixed-initiative dialogues, and multi-document grounding, testing ChatR1's performance from various aspects. Ablation studies confirm the effectiveness of the intent-aware reward. Our analyses further reveal diverse reasoning trajectories and effective use of the search tool. ChatR1 also generalizes robustly across domains, demonstrating that RL-based reasoning enables more flexible and context-sensitive behavior than static CQA pipelines.
- 中文摘要
我们提出了 ChatR1,这是一个基于强化学习 (RL) 的对话式问答 (CQA) 推理框架。推理在 CQA 中起着重要作用,其中用户意图在对话回合中演变,并且话语通常被低估,需要上下文解释、查询重新表述以及检索和生成之间的动态协调。与静态的“重写、检索和生成”管道不同,ChatR1 跨轮次交错搜索和推理,从而实现通过 RL 学习的探索性和自适应行为。为了解决 RL 中稀疏和延迟奖励的挑战,我们提出了一种意图感知奖励,通过使检索和推理与不断变化的用户目标保持一致来提供回合级反馈。我们提出的 ChatR1 在 3B 和 7B 模型主干上都表现出强大的性能,在五个 CQA 数据集上优于竞争模型,通过不同的指标(F1、BERTScore 和 LLM-as-judge)进行测量。我们包含一组多样化的 CQA 数据集,涵盖主题转变、不断演变的意图、混合倡议对话和多文档基础,从各个方面测试 ChatR1 的性能。消融研究证实了意图感知奖励的有效性。我们的分析进一步揭示了搜索工具的多样化推理轨迹和有效使用。ChatR1 还跨域进行了稳健的泛化,证明基于 RL 的推理比静态 CQA 管道能够实现更灵活和上下文相关的行为。
AOAD-MAT: Transformer-based multi-agent deep reinforcement learning model considering agents' order of action decisions
AOAD-MAT:考虑智能体行动决策顺序的基于Transformer的多智能体深度强化学习模型
- Authors: Shota Takayama, Katsuhide Fujita
- Subjects: Subjects:
Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.13343
- Pdf link: https://arxiv.org/pdf/2510.13343
- Abstract
Multi-agent reinforcement learning focuses on training the behaviors of multiple learning agents that coexist in a shared environment. Recently, MARL models, such as the Multi-Agent Transformer (MAT) and ACtion dEpendent deep Q-learning (ACE), have significantly improved performance by leveraging sequential decision-making processes. Although these models can enhance performance, they do not explicitly consider the importance of the order in which agents make decisions. In this paper, we propose an Agent Order of Action Decisions-MAT (AOAD-MAT), a novel MAT model that considers the order in which agents make decisions. The proposed model explicitly incorporates the sequence of action decisions into the learning process, allowing the model to learn and predict the optimal order of agent actions. The AOAD-MAT model leverages a Transformer-based actor-critic architecture that dynamically adjusts the sequence of agent actions. To achieve this, we introduce a novel MARL architecture that cooperates with a subtask focused on predicting the next agent to act, integrated into a Proximal Policy Optimization based loss function to synergistically maximize the advantage of the sequential decision-making. The proposed method was validated through extensive experiments on the StarCraft Multi-Agent Challenge and Multi-Agent MuJoCo benchmarks. The experimental results show that the proposed AOAD-MAT model outperforms existing MAT and other baseline models, demonstrating the effectiveness of adjusting the AOAD order in MARL.
- 中文摘要
多智能体强化学习侧重于训练共享环境中共存的多个学习智能体的行为。最近,MARL 模型,如多智能体转换器 (MAT) 和 ACtion dEpendent 深度 Q 学习 (ACE),通过利用顺序决策过程显着提高了性能。尽管这些模型可以提高性能,但它们没有明确考虑代理决策顺序的重要性。在本文中,我们提出了一种智能体行动决策顺序-MAT(AOAD-MAT),这是一种考虑智能体决策顺序的新型MAT模型。所提出的模型将行动决策的顺序明确地纳入了学习过程,使模型能够学习和预测智能体动作的最佳顺序。AOAD-MAT 模型利用基于 Transformer 的 actor-critic 架构,动态调整代理作的顺序。为了实现这一目标,我们引入了一种新颖的MARL架构,该架构与专注于预测下一个行动的智能体的子任务合作,集成到基于近端策略优化的损失函数中,以协同最大化顺序决策的优势。所提出的方法通过星际争霸多智能体挑战和多智能体 MuJoCo 基准测试的广泛实验得到了验证。实验结果表明,所提出的AOAD-MAT模型优于现有的MAT和其他基线模型,证明了调整MARL中AOAD阶数的有效性。
Adversarial Fine-tuning in Offline-to-Online Reinforcement Learning for Robust Robot Control
用于鲁棒机器人控制的离线到在线强化学习中的对抗性微调
- Authors: Shingo Ayabe, Hiroshi Kera, Kazuhiko Kawamoto
- Subjects: Subjects:
Robotics (cs.RO); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.13358
- Pdf link: https://arxiv.org/pdf/2510.13358
- Abstract
Offline reinforcement learning enables sample-efficient policy acquisition without risky online interaction, yet policies trained on static datasets remain brittle under action-space perturbations such as actuator faults. This study introduces an offline-to-online framework that trains policies on clean data and then performs adversarial fine-tuning, where perturbations are injected into executed actions to induce compensatory behavior and improve resilience. A performance-aware curriculum further adjusts the perturbation probability during training via an exponential-moving-average signal, balancing robustness and stability throughout the learning process. Experiments on continuous-control locomotion tasks demonstrate that the proposed method consistently improves robustness over offline-only baselines and converges faster than training from scratch. Matching the fine-tuning and evaluation conditions yields the strongest robustness to action-space perturbations, while the adaptive curriculum strategy mitigates the degradation of nominal performance observed with the linear curriculum strategy. Overall, the results show that adversarial fine-tuning enables adaptive and robust control under uncertain environments, bridging the gap between offline efficiency and online adaptability.
- 中文摘要
离线强化学习可以在没有风险的在线交互的情况下实现样本高效的策略获取,但在静态数据集上训练的策略在执行器故障等动作空间扰动下仍然脆弱。本研究引入了一个离线到在线的框架,该框架在干净的数据上训练策略,然后进行对抗性微调,其中将扰动注入到执行的动作中,以诱导补偿行为并提高弹性。性能感知课程通过指数移动平均信号进一步调整训练期间的扰动概率,在整个学习过程中平衡鲁棒性和稳定性。连续控制运动任务的实验表明,所提出的方法比仅离线基线的鲁棒性持续提高,并且比从头开始训练收敛得更快。匹配微调和评估条件对动作空间扰动产生最强的鲁棒性,而自适应课程策略减轻了线性课程策略观察到的标称性能的下降。总体而言,结果表明,对抗性微调能够在不确定的环境中实现自适应和鲁棒的控制,弥合了离线效率和在线适应性之间的差距。
A New Perspective on Transformers in Online Reinforcement Learning for Continuous Control
在线强化学习中用于持续控制的 Transformers 新视角
- Authors: Nikita Kachaev, Daniil Zelezetsky, Egor Cherepanov, Alexey K. Kovelev, Aleksandr I. Panov
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2510.13367
- Pdf link: https://arxiv.org/pdf/2510.13367
- Abstract
Despite their effectiveness and popularity in offline or model-based reinforcement learning (RL), transformers remain underexplored in online model-free RL due to their sensitivity to training setups and model design decisions such as how to structure the policy and value networks, share components, or handle temporal information. In this paper, we show that transformers can be strong baselines for continuous control in online model-free RL. We investigate key design questions: how to condition inputs, share components between actor and critic, and slice sequential data for training. Our experiments reveal stable architectural and training strategies enabling competitive performance across fully and partially observable tasks, and in both vector- and image-based settings. These findings offer practical guidance for applying transformers in online RL.
- 中文摘要
尽管 Transformer 在离线或基于模型的强化学习 (RL) 中有效且受欢迎,但由于它们对训练设置和模型设计决策(例如如何构建策略和价值网络、共享组件或处理时间信息)敏感,因此在在线无模型 RL 中仍未得到充分探索。在本文中,我们表明 Transformer 可以成为在线无模型 RL 中连续控制的强大基线。我们研究了关键的设计问题:如何调节输入、在演员和评论家之间共享组件以及对顺序数据进行切片以进行训练。我们的实验揭示了稳定的架构和训练策略,能够在完全和部分可观察的任务以及基于矢量和图像的环境中实现竞争性能。研究结果为在线RL中应用transformer提供了实用指导。
Reinforcement Learning Meets Masked Generative Models: Mask-GRPO for Text-to-Image Generation
强化学习与屏蔽生成模型的结合:用于文本到图像生成的 Mask-GRPO
- Authors: Yifu Luo, Xinhao Hu, Keyu Fan, Haoyuan Sun, Zeyu Chen, Bo Xia, Tiantian Zhang, Yongzhe Chang, Xueqian Wang
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2510.13418
- Pdf link: https://arxiv.org/pdf/2510.13418
- Abstract
Reinforcement learning (RL) has garnered increasing attention in text-to-image (T2I) generation. However, most existing RL approaches are tailored to either diffusion models or autoregressive models, overlooking an important alternative: masked generative models. In this work, we propose Mask-GRPO, the first method to incorporate Group Relative Policy Optimization (GRPO)-based RL into this overlooked paradigm. Our core insight is to redefine the transition probability, which is different from current approaches, and formulate the unmasking process as a multi-step decision-making problem. To further enhance our method, we explore several useful strategies, including removing the KL constraint, applying the reduction strategy, and filtering out low-quality samples. Using Mask-GRPO, we improve a base model, Show-o, with substantial improvements on standard T2I benchmarks and preference alignment, outperforming existing state-of-the-art approaches. The code is available on this https URL
- 中文摘要
强化学习 (RL) 在文本到图像 (T2I) 生成中越来越受到关注。然而,大多数现有的 RL 方法都是针对扩散模型或自回归模型量身定制的,而忽略了一个重要的替代方案:掩码生成模型。在这项工作中,我们提出了 Mask-GRPO,这是第一个将基于组相对策略优化 (GRPO) 的 RL 纳入这种被忽视的范式的方法。我们的核心见解是重新定义不同于当前方法的过渡概率,并将揭秘过程表述为一个多步骤的决策问题。为了进一步增强我们的方法,我们探索了几种有用的策略,包括消除 KL 约束、应用约简策略以及过滤掉低质量样本。使用 Mask-GRPO,我们改进了基础模型 Show-o,对标准 T2I 基准和偏好调整进行了重大改进,优于现有的最先进方法。该代码在此 https URL 上可用
Bridge the Gap: Enhancing Quadruped Locomotion with Vertical Ground Perturbations
弥合差距:通过垂直地面扰动增强四足动物运动
- Authors: Maximilian Stasica, Arne Bick, Nico Bohlinger, Omid Mohseni, Max Johannes Alois Fritzsche, Clemens Hübler, Jan Peters, André Seyfarth
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2510.13488
- Pdf link: https://arxiv.org/pdf/2510.13488
- Abstract
Legged robots, particularly quadrupeds, excel at navigating rough terrains, yet their performance under vertical ground perturbations, such as those from oscillating surfaces, remains underexplored. This study introduces a novel approach to enhance quadruped locomotion robustness by training the Unitree Go2 robot on an oscillating bridge - a 13.24-meter steel-and-concrete structure with a 2.0 Hz eigenfrequency designed to perturb locomotion. Using Reinforcement Learning (RL) with the Proximal Policy Optimization (PPO) algorithm in a MuJoCo simulation, we trained 15 distinct locomotion policies, combining five gaits (trot, pace, bound, free, default) with three training conditions: rigid bridge and two oscillating bridge setups with differing height regulation strategies (relative to bridge surface or ground). Domain randomization ensured zero-shot transfer to the real-world bridge. Our results demonstrate that policies trained on the oscillating bridge exhibit superior stability and adaptability compared to those trained on rigid surfaces. Our framework enables robust gait patterns even without prior bridge exposure. These findings highlight the potential of simulation-based RL to improve quadruped locomotion during dynamic ground perturbations, offering insights for designing robots capable of traversing vibrating environments.
- 中文摘要
有腿机器人,尤其是四足动物,擅长在崎岖的地形中导航,但它们在垂直地面扰动(例如摆动表面扰动)下的性能仍未得到充分探索。本研究引入了一种新方法,通过在摆动桥上训练 Unitree Go2 机器人来增强四足运动的鲁棒性,摆动桥是一种 13.24 米长的钢筋混凝土结构,具有 2.0 Hz 的特征频率,旨在扰动运动。在 MuJoCo 模拟中使用强化学习 (RL) 和近端策略优化 (PPO) 算法,我们训练了 15 种不同的运动策略,将五种步态(小跑、配速、束缚、自由、默认)与三种训练条件相结合:刚性桥梁和两个摆动桥梁设置,具有不同的高度调节策略(相对于桥面或地面)。域随机化确保了零样本转移到现实世界的桥接器。我们的结果表明,与在刚性表面上训练的策略相比,在振桥上训练的策略表现出优异的稳定性和适应性。我们的框架即使事先没有桥暴露,也能实现稳健的步态模式。这些发现凸显了基于仿真的 RL 在动态地面扰动期间改善四足运动的潜力,为设计能够穿越振动环境的机器人提供了见解。
Offline and Online KL-Regularized RLHF under Differential Privacy
差分隐私下的离线和在线KL正则化RLHF
- Authors: Yulian Wu, Rushil Thareja, Praneeth Vepakomma, Francesco Orabona
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.13512
- Pdf link: https://arxiv.org/pdf/2510.13512
- Abstract
In this paper, we study the offline and online settings of reinforcement learning from human feedback (RLHF) with KL-regularization -- a widely used objective function in large language model alignment -- under the $\epsilon$ local differential privacy ($\epsilon$-LDP) model on the label of the human preference. In the offline setting, we design an algorithm based on the principle of pessimism and derive a new suboptimality gap of $\tilde{O}(1/[(e^\epsilon-1)^2 n])$ on the KL-regularized objective under single-policy concentrability. We also prove its optimality by providing a matching lower bound where $n$ is the sample size. In the online setting, we are the first one to theoretically investigate the problem of KL-regularized RLHF with LDP. We design an optimism-based algorithm and derive a logarithmic regret bound of $O(d_{\mathcal{F}}\log (N_{\mathcal{F}}\cdot T) /(e^\epsilon-1)^2 )$, where $T$ is the total time step, $N_{\mathcal{F}}$ is cardinality of the reward function space $\mathcal{F}$ and $d_{\mathcal{F}}$ is a variant of eluder dimension for RLHF. As a by-product of our analysis, our results also imply the first analysis for online KL-regularized RLHF without privacy. We implement our algorithm in the offline setting to verify our theoretical results and release our open source code at: this https URL.
- 中文摘要
在本文中,我们研究了在人类偏好标签上的$\epsilon$局部差分隐私($\epsilon$-LDP)模型下,使用KL正则化(大型语言模型对齐中广泛使用的目标函数)的人类反馈强化学习(RLHF)的离线和在线设置。在离线设置下,我们设计了一种基于悲观原理的算法,并在单策略集中性下推导了KL正则化目标上新的次最优性间隙$\tilde{O}(1/[(e^\epsilon-1)^2 n])$。我们还通过提供匹配的下限来证明其最优性,其中 $n$ 是样本量。在在线环境中,我们是第一个从理论上研究 KL 正则化 RLHF 与 LDP 问题的人。我们设计了一个基于乐观的算法,并推导出了一个对数后悔界限 $O(d_{\mathcal{F}}\log (N_{\mathcal{F}}\cdot T) /(e^\epsilon-1)^2 )$,其中 $T$ 是总时间步长,$N_{\mathcal{F}}$ 是奖励函数空间 $\mathcal{F}$ 的基数,$d_{\mathcal{F}}$ 是 RLHF 的易逃维变体。作为我们分析的副产品,我们的结果也意味着首次对没有隐私的在线KL正则化RLHF进行分析。我们在离线设置中实现我们的算法来验证我们的理论结果,并将我们的开源代码发布到:这个 https URL。
Tandem Training for Language Models
语言模型的串联训练
- Authors: Robert West, Ashton Anderson, Ece Kamar, Eric Horvitz
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.13551
- Pdf link: https://arxiv.org/pdf/2510.13551
- Abstract
As language models continue to rapidly improve, we can expect their actions and reasoning to become difficult or impossible for weaker agents and humans to follow, undermining interpretability and oversight. With an eye on long-term futures, we pursue methods that encourage models to produce solutions that remain intelligible to weaker collaborators. We formalize intelligibility as handoff robustness: a strong model's solution is intelligible to a weaker model if randomly handing off control to the weaker model along the solution path does not cause failure. Building on this criterion, we introduce tandem training for language models, a reinforcement learning (RL) paradigm in which rollout tokens are intermittently and randomly sampled from a frozen weak model rather than the strong model being trained. Because rollouts succeed only when the strong model's actions and reasoning process can be continued by the weak model -- when the two can co-construct a successful solution -- optimizing standard RL objectives with tandem training implicitly incentivizes both correctness and intelligibility. In the GSM8K math reasoning task, tandem training reliably teaches models to abandon jargon and adapt their language to weaker partners while keeping task accuracy high. Our results demonstrate a promising route to building AI systems that remain auditable by weaker agents, with implications for human--AI collaboration and multi-agent communication.
- 中文摘要
随着语言模型的不断快速改进,我们可以预期它们的行为和推理将变得困难或不可能被较弱的代理和人类遵循,从而破坏可解释性和监督。着眼于长期未来,我们寻求鼓励模型产生对较弱的合作者仍然可以理解的解决方案的方法。我们将可理解性形式化为切换鲁棒性:如果沿解路径随机将控制权移交给较弱的模型不会导致失败,则强模型的解对较弱的模型来说是可以理解的。基于这一标准,我们引入了语言模型的串联训练,这是一种强化学习 (RL) 范式,其中推出标记是间歇性地、随机地从冻结的弱模型中随机采样的,而不是正在训练的强模型。因为只有当弱模型可以继续强模型的动作和推理过程时,当两者可以共同构建一个成功的解决方案时,推出才会成功,所以通过串联训练优化标准 RL 目标会隐含地激励正确性和可理解性。在 GSM8K 数学推理任务中,串联训练可靠地教模型放弃行话并使其语言适应较弱的伙伴,同时保持较高的任务准确性。我们的研究结果展示了一条有前途的途径来构建人工智能系统,该系统仍然可以由较弱的代理进行审计,这对人类——人工智能协作和多智能体通信产生了影响。
Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization
注意力照亮LLM推理:预规划锚定节奏赋能细粒度策略优化
- Authors: Yang Li, Zhichen Dong, Yuhan Sun, Weixun Wang, Shaopan Xiong, Yijia Luo, Jiashun Liu, Han Lu, Jiamang Wang, Wenbo Su, Bo Zheng, Junchi Yan
- Subjects: Subjects:
Computation and Language (cs.CL); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.13554
- Pdf link: https://arxiv.org/pdf/2510.13554
- Abstract
The reasoning pattern of Large language models (LLMs) remains opaque, and Reinforcement learning (RL) typically applies uniform credit across an entire generation, blurring the distinction between pivotal and routine steps. This work positions attention as a privileged substrate that renders the internal logic of LLMs legible, not merely as a byproduct of computation, but as a mechanistic blueprint of reasoning itself. We first distinguish attention heads between locally and globally focused information processing and reveal that locally focused heads produce a sawtooth pattern near the diagonal indicating phrasal chunks, while globally focused heads expose tokens that exert broad downstream influence over future tokens. We formalize these with two metrics: 1) Windowed Average Attention Distance, which measures the extent of backward attention within a clipped window; 2) Future Attention Influence, which quantifies a token's global importance as the average attention it receives from subsequent tokens. Taken together, these signals reveal a recurring preplan-and-anchor mechanism, where the model first performs a long-range contextual reference to generate an introductory token, which is immediately followed by or coincides with a semantic anchor token that organizes subsequent reasoning. Leveraging these insights, we introduce three novel RL strategies that dynamically perform targeted credit assignment to critical nodes (preplan tokens, anchor tokens, and their temporal coupling) and show consistent performance gains across various reasoning tasks. By aligning optimization with the model's intrinsic reasoning rhythm, we aim to transform opaque optimization into an actionable structure-aware process, hoping to offer a potential step toward more transparent and effective optimization of LLM reasoning.
- 中文摘要
大型语言模型 (LLM) 的推理模式仍然不透明,强化学习 (RL) 通常在整代人中应用统一的学分,从而模糊了关键步骤和常规步骤之间的区别。这项工作将注意力定位为一种特权基础,使法学硕士的内部逻辑变得清晰易读,不仅是计算的副产品,而且是推理本身的机械蓝图。我们首先区分了注意力头在局部和全球关注的信息处理之间,并揭示了局部关注的头在对角线附近产生锯齿状图案,指示短语块,而全球聚焦的头暴露的标记对未来的标记施加广泛的下游影响。我们用两个指标形式化这些指标:1) 窗口平均注意力距离,衡量裁剪窗口内向后注意力的程度;2)未来注意力影响,将代币的全球重要性量化为其从后续代币中获得的平均关注度。综上所述,这些信号揭示了一种重复的预计划和锚定机制,其中模型首先执行远程上下文引用以生成一个介绍性标记,该标记紧随其后或与组织后续推理的语义锚标记重合。利用这些见解,我们引入了三种新颖的 RL 策略,这些策略动态地对关键节点(预计划令牌、锚定令牌及其时间耦合)执行有针对性的信用分配,并在各种推理任务中显示出一致的性能提升。通过使优化与模型的内在推理节奏保持一致,我们的目标是将不透明的优化转变为可作的结构感知过程,希望为更透明、更有效的 LLM 推理优化迈出潜在的一步。
What is the objective of reasoning with reinforcement learning?
强化学习推理的目的是什么?
- Authors: Damek Davis, Benjamin Recht
- Subjects: Subjects:
Machine Learning (cs.LG); Optimization and Control (math.OC)
- Arxiv link: https://arxiv.org/abs/2510.13651
- Pdf link: https://arxiv.org/pdf/2510.13651
- Abstract
We show that several popular algorithms for reinforcement learning in large language models with binary rewards can be viewed as stochastic gradient ascent on a monotone transform of the probability of a correct answer given a prompt. In particular, the transformation associated with rejection sampling algorithms is the logarithm and that associated with the GRPO algorithm is the arcsine of the square root.
- 中文摘要
我们表明,在具有二元奖励的大型语言模型中,几种流行的强化学习算法可以被视为给定提示时正确答案概率的单调变换上的随机梯度上升。特别是,与拒绝采样算法相关的变换是对数,与 GRPO 算法相关的变换是平方根的反正弦。
Information-Theoretic Reward Modeling for Stable RLHF: Detecting and Mitigating Reward Hacking
稳定 RLHF 的信息论奖励建模:检测和缓解奖励黑客攻击
- Authors: Yuchun Miao, Liang Ding, Sen Zhang, Rong Bao, Lefei Zhang, Dacheng Tao
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.13694
- Pdf link: https://arxiv.org/pdf/2510.13694
- Abstract
Despite the success of Reinforcement Learning from Human Feedback (RLHF) in aligning language models with human values, reward hacking-or reward over-optimization-remains a major challenge. We identify two key obstacles to its mitigation: (1) reward misgeneralization in reward modeling, where reward models overfit to spurious, preference-irrelevant features; and (2) the lack of suitable regularization during RL optimization, as existing token-level constraints often over-restrict the policy space. To address these issues, we propose InfoRM, an information-theoretic reward modeling framework based on the Information Bottleneck (IB) principle, which filters out preference-irrelevant information to alleviate reward misgeneralization. We further observe that reward-hacked responses manifest as pronounced outliers in InfoRM's IB latent space, measured by Mahalanobis distance from the SFT-induced distribution. Motivated by this, we introduce IBL, a distribution-level regularization that penalizes such deviations, effectively expanding the optimization landscape while maintaining alignment. We prove that IBL is theoretically equivalent to the pessimistic RL objective within the IB latent space. Finally, we present Mahalanobis Outlier Probability (MOP), a statistical metric for quantifying reward hacking severity, enabling principled hyperparameter tuning and online mitigation such as early stopping. Extensive experiments across diverse LLMs and datasets confirm the generality of our findings, the effectiveness of InfoRM and IBL, and the reliability of MOP as a diagnostic tool-collectively advancing the state of RLHF.
- 中文摘要
尽管人类反馈强化学习 (RLHF) 在使语言模型与人类价值观保持一致方面取得了成功,但奖励黑客攻击或奖励过度优化仍然是一个主要挑战。我们确定了缓解其的两个关键障碍:(1)奖励建模中的奖励误概化,其中奖励模型过度拟合了虚假的、与偏好无关的特征;(2)RL优化过程中缺乏合适的正则化,因为现有的token级约束往往过度限制了策略空间。为了解决这些问题,我们提出了InfoRM,这是一个基于信息瓶颈(IB)原理的信息论奖励建模框架,它过滤掉与偏好无关的信息,以减轻奖励的误泛化。我们进一步观察到,奖励黑客攻击的响应在 InfoRM 的 IB 潜在空间中表现为明显的异常值,通过与 SFT 诱导分布的 Mahalanobis 距离来衡量。出于此,我们引入了 IBL,这是一种分布级正则化,可以惩罚此类偏差,在保持一致性的同时有效地扩展优化领域。我们证明 IBL 在理论上等同于 IB 潜在空间内的悲观 RL 目标。最后,我们提出了马哈拉诺比异常值概率 (MOP),这是一种用于量化奖励黑客严重性的统计指标,可实现有原则的超参数调整和在线缓解,例如提前停止。跨不同 LLM 和数据集的广泛实验证实了我们研究结果的通用性、InfoRM 和 IBL 的有效性以及 MOP 作为诊断工具的可靠性,共同推进了 RLHF 的状态。
Simplicial Embeddings Improve Sample Efficiency in Actor-Critic Agents
简单嵌入提高了 Actor-Critic 代理中的样本效率
- Authors: Johan Obando-Ceron, Walter Mayor, Samuel Lavoie, Scott Fujimoto, Aaron Courville, Pablo Samuel Castro
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2510.13704
- Pdf link: https://arxiv.org/pdf/2510.13704
- Abstract
Recent works have proposed accelerating the wall-clock training time of actor-critic methods via the use of large-scale environment parallelization; unfortunately, these can sometimes still require large number of environment interactions to achieve a desired level of performance. Noting that well-structured representations can improve the generalization and sample efficiency of deep reinforcement learning (RL) agents, we propose the use of simplicial embeddings: lightweight representation layers that constrain embeddings to simplicial structures. This geometric inductive bias results in sparse and discrete features that stabilize critic bootstrapping and strengthen policy gradients. When applied to FastTD3, FastSAC, and PPO, simplicial embeddings consistently improve sample efficiency and final performance across a variety of continuous- and discrete-control environments, without any loss in runtime speed.
- 中文摘要
最近的工作提出了通过使用大规模环境并行化来加快 actor-critic 方法的挂钟训练时间;不幸的是,这些有时仍然需要大量的环境交互才能达到所需的性能水平。注意到结构良好的表示可以提高深度强化学习(RL)代理的泛化和样本效率,我们建议使用简单嵌入:将嵌入限制为简单结构的轻量级表示层。这种几何归纳偏差导致稀疏和离散的特征,稳定了批评者的引导并加强了政策梯度。当应用于 FastTD3、FastSAC 和 PPO 时,简单嵌入可在各种连续和离散控制环境中持续提高样品效率和最终性能,而不会损失任何运行速度。
From Refusal to Recovery: A Control-Theoretic Approach to Generative AI Guardrails
从拒绝到恢复:生成式人工智能护栏的控制论方法
- Authors: Ravi Pandya, Madison Bland, Duy P. Nguyen, Changliu Liu, Jaime Fernández Fisac, Andrea Bajcsy
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.13727
- Pdf link: https://arxiv.org/pdf/2510.13727
- Abstract
Generative AI systems are increasingly assisting and acting on behalf of end users in practical settings, from digital shopping assistants to next-generation autonomous cars. In this context, safety is no longer about blocking harmful content, but about preempting downstream hazards like financial or physical harm. Yet, most AI guardrails continue to rely on output classification based on labeled datasets and human-specified criteria,making them brittle to new hazardous situations. Even when unsafe conditions are flagged, this detection offers no path to recovery: typically, the AI system simply refuses to act--which is not always a safe choice. In this work, we argue that agentic AI safety is fundamentally a sequential decision problem: harmful outcomes arise from the AI system's continually evolving interactions and their downstream consequences on the world. We formalize this through the lens of safety-critical control theory, but within the AI model's latent representation of the world. This enables us to build predictive guardrails that (i) monitor an AI system's outputs (actions) in real time and (ii) proactively correct risky outputs to safe ones, all in a model-agnostic manner so the same guardrail can be wrapped around any AI model. We also offer a practical training recipe for computing such guardrails at scale via safety-critical reinforcement learning. Our experiments in simulated driving and e-commerce settings demonstrate that control-theoretic guardrails can reliably steer LLM agents clear of catastrophic outcomes (from collisions to bankruptcy) while preserving task performance, offering a principled dynamic alternative to today's flag-and-block guardrails.
- 中文摘要
生成式人工智能系统越来越多地在实际环境中为最终用户提供帮助和代表,从数字购物助手到下一代自动驾驶汽车。在这种情况下,安全不再是阻止有害内容,而是预防经济或人身伤害等下游危害。然而,大多数人工智能护栏仍然依赖于基于标记数据集和人类指定标准的输出分类,这使得它们容易适应新的危险情况。即使标记了不安全的情况,这种检测也无法提供恢复途径:通常,人工智能系统只是拒绝采取行动——这并不总是一个安全的选择。在这项工作中,我们认为代理人工智能安全从根本上来说是一个顺序决策问题:有害结果源于人工智能系统不断发展的交互及其对世界的下游影响。我们通过安全关键控制理论的视角将其形式化,但在人工智能模型对世界的潜在表示范围内。这使我们能够构建预测护栏,(i) 实时监控 AI 系统的输出(作),以及 (ii) 主动将风险输出纠正为安全输出,所有这些都以与模型无关的方式进行,因此可以将相同的护栏包裹在任何 AI 模型上。我们还提供了通过安全关键强化学习大规模计算此类护栏的实用培训方法。我们在模拟驾驶和电子商务环境中的实验表明,控制论护栏可以可靠地引导 LLM 代理避免灾难性后果(从碰撞到破产),同时保持任务性能,为当今的旗帜和块护栏提供了一种有原则的动态替代方案。
Asymptotically optimal reinforcement learning in Block Markov Decision Processes
块马尔可夫决策过程中的渐近最优强化学习
- Authors: Thomas van Vuren, Fiona Sloothaak, Maarten G. Wolf, Jaron Sanders
- Subjects: Subjects:
Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
- Arxiv link: https://arxiv.org/abs/2510.13748
- Pdf link: https://arxiv.org/pdf/2510.13748
- Abstract
The curse of dimensionality renders Reinforcement Learning (RL) impractical in many real-world settings with exponentially large state and action spaces. Yet, many environments exhibit exploitable structure that can accelerate learning. To formalize this idea, we study RL in Block Markov Decision Processes (BMDPs). BMDPs model problems with large observation spaces, but where transition dynamics are fully determined by latent states. Recent advances in clustering methods have enabled the efficient recovery of this latent structure. However, a regret analysis that exploits these techniques to determine their impact on learning performance remained open. We are now addressing this gap by providing a regret analysis that explicitly leverages clustering, demonstrating that accurate latent state estimation can indeed effectively speed up learning. Concretely, this paper analyzes a two-phase RL algorithm for BMDPs that first learns the latent structure through random exploration and then switches to an optimism-guided strategy adapted to the uncovered structure. This algorithm achieves a regret that is $O(\sqrt{T}+n)$ on a large class of BMDPs susceptible to clustering. Here, $T$ denotes the number of time steps, $n$ is the cardinality of the observation space, and the Landau notation $O(\cdot)$ holds up to constants and polylogarithmic factors. This improves the best prior bound, $O(\sqrt{T}+n^2)$, especially when $n$ is large. Moreover, we prove that no algorithm can achieve lower regret uniformly on this same class of BMDPs. This establishes that, on this class, the algorithm achieves asymptotic optimality.
- 中文摘要
维度的诅咒使得强化学习 (RL) 在许多具有指数级大状态和动作空间的现实环境中变得不切实际。然而,许多环境表现出可以加速学习的可利用结构。为了正式化这一想法,我们在块马尔可夫决策过程 (BMDP) 中研究了 RL。BMDP 对具有大观测空间的问题进行建模,但其中过渡动力学完全由潜在状态决定。聚类方法的最新进展使得这种潜在结构的有效恢复成为可能。然而,利用这些技术来确定它们对学习表现影响的遗憾分析仍然开放。我们现在正在通过提供明确利用聚类的遗憾分析来解决这一差距,证明准确的潜在状态估计确实可以有效地加速学习。具体来说,本文分析了一种BMDP的两阶段RL算法,该算法首先通过随机探索学习潜在结构,然后切换到适应未发现结构的乐观引导策略。该算法在一大类易受聚类影响的 BMDP 上实现了 $O(\sqrt{T}+n)$ 的遗憾。这里,$T$表示时间步长数,$n$表示观测空间的基数,朗道表示法$O(\cdot)$支持常数和多对数因子。这提高了最佳先验边界 $O(\sqrt{T}+n^2)$,尤其是当 $n$ 很大时。此外,我们证明了没有任何算法可以在同一类BMDP上均匀地实现较低的后悔。这确定,在此类上,算法实现了渐近最优性。
The Art of Scaling Reinforcement Learning Compute for LLMs
LLM 扩展强化学习计算的艺术
- Authors: Devvrit Khatri, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S. Dhillon, David Brandfonbrener, Rishabh Agarwal
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.13786
- Pdf link: https://arxiv.org/pdf/2510.13786
- Abstract
Reinforcement learning (RL) has become central to training large language models (LLMs), yet the field lacks predictive scaling methodologies comparable to those established for pre-training. Despite rapidly rising compute budgets, there is no principled understanding of how to evaluate algorithmic improvements for scaling RL compute. We present the first large-scale systematic study, amounting to more than 400,000 GPU-hours, that defines a principled framework for analyzing and predicting RL scaling in LLMs. We fit sigmoidal compute-performance curves for RL training and ablate a wide range of common design choices to analyze their effects on asymptotic performance and compute efficiency. We observe: (1) Not all recipes yield similar asymptotic performance, (2) Details such as loss aggregation, normalization, curriculum, and off-policy algorithm primarily modulate compute efficiency without materially shifting the asymptote, and (3) Stable, scalable recipes follow predictable scaling trajectories, enabling extrapolation from smaller-scale runs. Combining these insights, we propose a best-practice recipe, ScaleRL, and demonstrate its effectiveness by successfully scaling and predicting validation performance on a single RL run scaled up to 100,000 GPU-hours. Our work provides both a scientific framework for analyzing scaling in RL and a practical recipe that brings RL training closer to the predictability long achieved in pre-training.
- 中文摘要
强化学习 (RL) 已成为训练大型语言模型 (LLM) 的核心,但该领域缺乏与预训练建立的预测扩展方法相媲美的预测扩展方法。尽管计算预算迅速增加,但对于如何评估扩展 RL 计算的算法改进,还没有原则性的理解。我们提出了第一项大规模系统研究,总计超过 400,000 个 GPU 小时,它定义了一个用于分析和预测 LLM 中 RL 扩展的原则框架。我们拟合了用于 RL 训练的 S 形计算性能曲线,并消融了各种常见的设计选择,以分析它们对渐近性能和计算效率的影响。我们观察到:(1) 并非所有配方都能产生相似的渐近性能,(2) 损失聚合、归一化、课程和策略外算法等细节主要调节计算效率,而不会对渐近线产生重大影响,以及 (3) 稳定、可扩展的配方遵循可预测的扩展轨迹,从而能够从小规模运行中进行外推。结合这些见解,我们提出了一个最佳实践配方 ScaleRL,并通过成功扩展和预测单个 RL 运行的验证性能(扩展到 100,000 GPU 小时)来证明其有效性。我们的工作既提供了一个用于分析 RL 扩展的科学框架,也提供了一个实用的配方,使 RL 训练更接近预训练中长期实现的可预测性。
Provably Invincible Adversarial Attacks on Reinforcement Learning Systems: A Rate-Distortion Information-Theoretic Approach
对强化学习系统的可证明无敌对抗性攻击:一种速率失真信息论方法
- Authors: Ziqing Lu, Lifeng Lai, Weiyu Xu
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.13792
- Pdf link: https://arxiv.org/pdf/2510.13792
- Abstract
Reinforcement learning (RL) for the Markov Decision Process (MDP) has emerged in many security-related applications, such as autonomous driving, financial decisions, and drone/robot algorithms. In order to improve the robustness/defense of RL systems against adversaries, studying various adversarial attacks on RL systems is very important. Most previous work considered deterministic adversarial attack strategies in MDP, which the recipient (victim) agent can defeat by reversing the deterministic attacks. In this paper, we propose a provably
invincible'' oruncounterable'' type of adversarial attack on RL. The attackers apply a rate-distortion information-theoretic approach to randomly change agents' observations of the transition kernel (or other properties) so that the agent gains zero or very limited information about the ground-truth kernel (or other properties) during the training. We derive an information-theoretic lower bound on the recipient agent's reward regret and show the impact of rate-distortion attacks on state-of-the-art model-based and model-free algorithms. We also extend this notion of an information-theoretic approach to other types of adversarial attack, such as state observation attacks.
- 中文摘要
马尔可夫决策过程 (MDP) 的强化学习 (RL) 已出现在许多与安全相关的应用中,例如自动驾驶、财务决策和无人机/机器人算法。为了提高RL系统对对手的鲁棒性/防御能力,研究对RL系统的各种对抗性攻击非常重要。大多数先前的工作都考虑了 MDP 中的确定性对抗攻击策略,接收者(受害者)代理可以通过逆转确定性攻击来击败这些策略。在本文中,我们提出了一种可证明的“无敌”或“不可对抗”的对抗性攻击 RL 类型。攻击者应用速率失真信息论方法随机改变代理对过渡核(或其他属性)的观察,以便代理在训练期间获得有关地面实况核(或其他属性)的零或非常有限的信息。我们推导出了接收者代理奖励后悔的信息论下限,并展示了速率失真攻击对最先进的基于模型和无模型算法的影响。我们还将信息论方法的概念扩展到其他类型的对抗性攻击,例如状态观察攻击。
MimicKit: A Reinforcement Learning Framework for Motion Imitation and Control
MimicKit:用于运动模仿和控制的强化学习框架
- Authors: Xue Bin Peng
- Subjects: Subjects:
Graphics (cs.GR); Machine Learning (cs.LG); Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2510.13794
- Pdf link: https://arxiv.org/pdf/2510.13794
- Abstract
MimicKit is an open-source framework for training motion controllers using motion imitation and reinforcement learning. The codebase provides implementations of commonly-used motion-imitation techniques and RL algorithms. This framework is intended to support research and applications in computer graphics and robotics by providing a unified training framework, along with standardized environment, agent, and data structures. The codebase is designed to be modular and easily configurable, enabling convenient modification and extension to new characters and tasks. The open-source codebase is available at: this https URL.
- 中文摘要
MimicKit 是一个开源框架,用于使用运动模仿和强化学习来训练运动控制器。该代码库提供了常用的运动模仿技术和 RL 算法的实现。该框架旨在通过提供统一的训练框架以及标准化的环境、代理和数据结构来支持计算机图形学和机器人技术的研究和应用。代码库设计为模块化且易于配置,可以方便地修改和扩展新字符和任务。开源代码库位于:此 https URL。
Breadcrumbs Reasoning: Memory-Efficient Reasoning with Compression Beacons
面包屑推理:使用压缩信标进行内存效率推理
- Authors: Giovanni Monea, Yair Feldman, Shankar Padmanabhan, Kianté Brantley, Yoav Artzi
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2510.13797
- Pdf link: https://arxiv.org/pdf/2510.13797
- Abstract
The scalability of large language models for long-context reasoning is severely constrained by the linear growth of their Transformer key-value cache, which incurs significant memory and computational costs. We posit that as a model generates reasoning tokens, the informational value of past generated tokens diminishes, creating an opportunity for compression. In this work, we propose to periodically compress the generation KV cache with a learned, special-purpose token and evict compressed entries. We train the model to perform this compression via a modified joint distillation and reinforcement learning (RL) framework. Our training method minimizes overhead over the conventional RL process, as it leverages RL outputs for distillation. Empirically, our method achieves a superior memory-accuracy Pareto frontier compared to both the model without cache compression and training-free compression techniques.
- 中文摘要
用于长上下文推理的大型语言模型的可扩展性受到其 Transformer 键值缓存线性增长的严重限制,这会产生大量的内存和计算成本。我们假设,当模型生成推理标记时,过去生成的标记的信息价值会减弱,从而创造压缩的机会。在这项工作中,我们建议使用学习的、专用的令牌定期压缩生成 KV 缓存,并驱逐压缩条目。我们训练模型通过改进的联合蒸馏和强化学习 (RL) 框架执行这种压缩。我们的训练方法最大限度地减少了传统 RL 工艺的开销,因为它利用 RL 输出进行蒸馏。根据经验,与没有缓存压缩的模型和无训练压缩技术相比,我们的方法实现了卓越的内存精度帕累托前沿。
PhysMaster: Mastering Physical Representation for Video Generation via Reinforcement Learning
PhysMaster:通过强化学习掌握视频生成的物理表示
- Authors: Sihui Ji, Xi Chen, Xin Tao, Pengfei Wan, Hengshuang Zhao
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2510.13809
- Pdf link: https://arxiv.org/pdf/2510.13809
- Abstract
Video generation models nowadays are capable of generating visually realistic videos, but often fail to adhere to physical laws, limiting their ability to generate physically plausible videos and serve as ''world models''. To address this issue, we propose PhysMaster, which captures physical knowledge as a representation for guiding video generation models to enhance their physics-awareness. Specifically, PhysMaster is based on the image-to-video task where the model is expected to predict physically plausible dynamics from the input image. Since the input image provides physical priors like relative positions and potential interactions of objects in the scenario, we devise PhysEncoder to encode physical information from it as an extra condition to inject physical knowledge into the video generation process. The lack of proper supervision on the model's physical performance beyond mere appearance motivates PhysEncoder to apply reinforcement learning with human feedback to physical representation learning, which leverages feedback from generation models to optimize physical representations with Direct Preference Optimization (DPO) in an end-to-end manner. PhysMaster provides a feasible solution for improving physics-awareness of PhysEncoder and thus of video generation, proving its ability on a simple proxy task and generalizability to wide-ranging physical scenarios. This implies that our PhysMaster, which unifies solutions for various physical processes via representation learning in the reinforcement learning paradigm, can act as a generic and plug-in solution for physics-aware video generation and broader applications.
- 中文摘要
如今的视频生成模型能够生成视觉上逼真的视频,但往往无法遵守物理定律,从而限制了它们生成物理上合理的视频并充当“世界模型”的能力。为了解决这个问题,我们提出了 PhysMaster,它捕获物理知识作为指导视频生成模型的表示,以增强其物理意识。具体来说,PhysMaster 基于图像到视频任务,其中模型应从输入图像中预测物理上合理的动态。由于输入图像提供了物理先验,例如场景中对象的相对位置和潜在交互,因此我们设计了 PhysEncoder 来编码来自其中的物理信息,作为将物理知识注入视频生成过程的额外条件。除了外观之外,缺乏对模型物理性能的适当监督,促使 PhysEncoder 将具有人类反馈的强化学习应用于物理表征学习,后者利用生成模型的反馈,以端到端的方式通过直接偏好优化 (DPO) 优化物理表征。PhysMaster 为提高 PhysEncoder 的物理感知以及视频生成提供了一个可行的解决方案,证明了它在简单代理任务上的能力和对广泛物理场景的通用性。这意味着我们的 PhysMaster 通过强化学习范式中的表示学习统一了各种物理过程的解决方案,可以作为物理感知视频生成和更广泛应用的通用插件解决方案。
Keyword: diffusion policy
Energy-Guided Diffusion Sampling for Long-Term User Behavior Prediction in Reinforcement Learning-based Recommendation
基于强化学习的推荐中用于长期用户行为预测的能量引导扩散采样
- Authors: Xiaocong Chen, Siyu Wang, Lina Yao
- Subjects: Subjects:
Information Retrieval (cs.IR)
- Arxiv link: https://arxiv.org/abs/2510.12815
- Pdf link: https://arxiv.org/pdf/2510.12815
- Abstract
Reinforcement learning-based recommender systems (RL4RS) have gained attention for their ability to adapt to dynamic user preferences. However, these systems face challenges, particularly in offline settings, where data inefficiency and reliance on pre-collected trajectories limit their broader applicability. While offline reinforcement learning methods leverage extensive datasets to address these issues, they often struggle with noisy data and fail to capture long-term user preferences, resulting in suboptimal recommendation policies. To overcome these limitations, we propose Diffusion-enhanced Actor-Critic for Offline RL4RS (DAC4Rec), a novel framework that integrates diffusion processes with reinforcement learning to model complex user preferences more effectively. DAC4Rec leverages the denoising capabilities of diffusion models to enhance the robustness of offline RL algorithms and incorporates a Q-value-guided policy optimization strategy to better handle suboptimal trajectories. Additionally, we introduce an energy-based sampling strategy to reduce randomness during recommendation generation, ensuring more targeted and reliable outcomes. We validate the effectiveness of DAC4Rec through extensive experiments on six real-world offline datasets and in an online simulation environment, demonstrating its ability to optimize long-term user preferences. Furthermore, we show that the proposed diffusion policy can be seamlessly integrated into other commonly used RL algorithms in RL4RS, highlighting its versatility and wide applicability.
- 中文摘要
基于强化学习的推荐系统(RL4RS)因其适应动态用户偏好的能力而受到关注。然而,这些系统面临着挑战,特别是在离线环境中,数据效率低下和对预先收集轨迹的依赖限制了其更广泛的适用性。虽然离线强化学习方法利用广泛的数据集来解决这些问题,但它们经常难以处理嘈杂的数据,并且无法捕捉长期的用户偏好,从而导致推荐策略不理想。为了克服这些限制,我们提出了用于离线RL4RS的扩散增强Actor-Critic(DAC4Rec),这是一个将扩散过程与强化学习相结合的新颖框架,以更有效地对复杂的用户偏好进行建模。DAC4Rec 利用扩散模型的去噪能力来增强离线 RL 算法的鲁棒性,并结合 Q 值引导的策略优化策略来更好地处理次优轨迹。此外,我们还引入了基于能量的采样策略,以减少推荐生成过程中的随机性,确保更有针对性和更可靠的结果。我们通过在六个真实世界的离线数据集和在线模拟环境中进行大量实验来验证 DAC4Rec 的有效性,证明了其优化长期用户偏好的能力。此外,我们表明所提出的扩散策略可以无缝集成到RL4RS中其他常用的RL算法中,突出了其通用性和广泛适用性。
Tactile-Conditioned Diffusion Policy for Force-Aware Robotic Manipulation
用于力感知机器人作的触觉条件扩散策略
- Authors: Erik Helmut, Niklas Funk, Tim Schneider, Cristiana de Farias, Jan Peters
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2510.13324
- Pdf link: https://arxiv.org/pdf/2510.13324
- Abstract
Contact-rich manipulation depends on applying the correct grasp forces throughout the manipulation task, especially when handling fragile or deformable objects. Most existing imitation learning approaches often treat visuotactile feedback only as an additional observation, leaving applied forces as an uncontrolled consequence of gripper commands. In this work, we present Force-Aware Robotic Manipulation (FARM), an imitation learning framework that integrates high-dimensional tactile data to infer tactile-conditioned force signals, which in turn define a matching force-based action space. We collect human demonstrations using a modified version of the handheld Universal Manipulation Interface (UMI) gripper that integrates a GelSight Mini visual tactile sensor. For deploying the learned policies, we developed an actuated variant of the UMI gripper with geometry matching our handheld version. During policy rollouts, the proposed FARM diffusion policy jointly predicts robot pose, grip width, and grip force. FARM outperforms several baselines across three tasks with distinct force requirements -- high-force, low-force, and dynamic force adaptation -- demonstrating the advantages of its two key components: leveraging force-grounded, high-dimensional tactile observations and a force-based control space. The codebase and design files are open-sourced and available at this https URL .
- 中文摘要
接触丰富的作取决于在整个作任务中应用正确的抓取力,尤其是在处理易碎或可变形的物体时。大多数现有的模仿学习方法通常仅将视觉触动反馈视为额外的观察,而将施加的力作为夹持器命令的不受控制的结果。在这项工作中,我们提出了力感知机器人纵(FARM),这是一个模仿学习框架,它集成了高维触觉数据来推断触觉条件力信号,进而定义匹配的基于力的动作空间。我们使用集成了 GelSight Mini 视觉触觉传感器的手持式通用作接口 (UMI) 夹持器的改进版本收集人体演示。为了部署学习到的策略,我们开发了 UMI 机械手的驱动变体,其几何形状与我们的手持版本相匹配。在政策推出期间,拟议的 FARM 扩散策略共同预测机器人姿态、握持宽度和握力。FARM 在具有不同力要求的三项任务(高力、低力和动态力适应)中优于多个基线,展示了其两个关键组件的优势:利用力接地、高维触觉观察和基于力的控制空间。代码库和设计文件是开源的,可在此 https URL 上获得。