Arxiv Papers of Today

生成时间: 2025-10-23 16:30:31 (UTC+8); Arxiv 发布时间: 2025-10-23 20:00 EDT (2025-10-24 08:00 UTC+8)

今天共有 30 篇相关文章

Keyword: reinforcement learning

CosmoCore Affective Dream-Replay Reinforcement Learning for Code Generation

CosmoCore 情感梦境重播强化学习用于代码生成

Authors: Santhosh Kumar Ravindran
Subjects: Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2510.18895
Pdf link: https://arxiv.org/pdf/2510.18895
Abstract We introduce CosmoCore, a neuroscience-inspired reinforcement learning (RL) architecture that integrates affective signals to enhance code generation in large language models (LLMs). Motivated by human and animal learning where embarrassment from mistakes drives rapid correction, as observed in training a puppy to avoid repeating errors after a single scolding CosmoCore tags code generation trajectories with valence and surprise using a lightweight multi-layer perceptron (MLP). High-negative valence (cringe) episodes, such as buggy code outputs, are prioritized in a Dream Queue for five-fold replay during off-policy updates, while low-surprise successes are pruned to prevent overconfidence and buffer bloat. Evaluated on code generation benchmarks like HumanEval and BigCodeBench, alongside simulations with a custom data pipeline environment, CosmoCore reduces hallucinated code (e.g., syntax errors or logical bugs) by 48\% and accelerates self-correction by 45\%. Local experiments using Hugging Face models in a PySpark environment validate these gains, with code snippets provided for replication. Ablations confirm valence tagging boosts curiosity in exploration, and pruning mitigates inefficiency. This framework extends RL from human feedback (RLHF) for more emotionally aware code assistants, with applications in IDEs and data pipelines. Code and the custom mini-world simulation are released.
中文摘要 我们介绍了 CosmoCore，这是一种受神经科学启发的强化学习（RL）架构，它集成了情感信号以增强大型语言模型（LLM）中的代码生成。在人类和动物学习的激励下，错误带来的尴尬会推动快速纠正，正如在训练小狗避免在一次责骂后重复错误时所观察到的那样，CosmoCore 使用轻量级多层感知器（MLP）用效价和惊喜标记代码生成轨迹。在策略外更新期间，高负效价（畏缩）情节（例如有缺陷的代码输出）在梦队列中被优先考虑，以便在政策外更新期间进行五倍重放，而低意外成功则被修剪以防止过度自信和缓冲区膨胀。在 HumanEval 和 BigCodeBench 等代码生成基准测试以及使用自定义数据管道环境进行模拟的情况下进行评估，CosmoCore 将幻觉代码（例如语法错误或逻辑错误）减少了 48%，并将自我纠正速度提高了 45%。在 PySpark 环境中使用 Hugging Face 模型进行的本地实验验证了这些收益，并提供了用于复制的代码片段。消融证实价标记可以增强探索的好奇心，而修剪可以减轻低效率。该框架将 RL 从人类反馈（RLHF）扩展到更具情感意识的代码助手，并在 IDE 和数据管道中应用。代码和自定义迷你世界模拟发布。

ADPO: Anchored Direct Preference Optimization

ADPO：锚定直接偏好优化

Authors: Wang Zixian
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2510.18913
Pdf link: https://arxiv.org/pdf/2510.18913
Abstract Anchored Direct Preference Optimization (ADPO) is a unified framework that generalizes Direct Preference Optimization (DPO) with soft preferences, reference-policy anchoring, and groupwise extensions. While standard DPO assumes hard binary labels and pairwise comparisons, ADPO introduces: (i) soft preference probabilities that encode uncertainty and mitigate gradient drift; (ii) arbitrary reference-policy anchors that stabilize training via groupwise shift invariance and implicit KL regularization; and (iii) listwise preference modeling through Plackett-Luce distributions. We prove that DPO, Bradley-Terry objectives, and Top-1-vs-Rest formulations emerge as special cases. ADPO yields three practical variants: pairwise anchored Soft-DPO, listwise anchored Soft-DPO with raw rewards, and KDE-based listwise smoothing for heavy-tailed noise. In contextual bandits, anchoring improves WinMass by 38-63% over standard DPO, while KDE smoothing achieves 0.68 vs 0.32 under heavy-tailed contamination (112% relative gain). In sequential reinforcement learning (CartPole, LunarLander), anchoring improves noisy-preference performance by 15-29%, confirming transfer from single-step to multi-step settings. Experiments with 10-256 parameter models provide clear guidance: use pairwise anchored Soft-DPO for clean or moderate noise, and KDE-based listwise ADPO for extreme contamination.
中文摘要 锚定直接偏好优化（ADPO）是一个统一的框架，它通过软偏好、引用策略锚定和组扩展来推广直接偏好优化（DPO）。虽然标准 DPO 假设硬二进制标签和成对比较，但 ADPO 引入了：（i）编码不确定性并减轻梯度漂移的软偏好概率;（ii）通过组移位不变性和隐式 KL 正则化来稳定训练的任意参考策略锚点;（iii）通过 Plackett-Luce 分布进行列表偏好建模。我们证明 DPO、Bradley-Terry 目标和 Top-1-vs-Rest 公式是特例。ADPO 产生三种实用变体：成对锚定的 Soft-DPO、具有原始奖励的列表锚定 Soft-DPO 以及基于 KDE 的重尾噪声的列表平滑。在上下文强盗中，锚定比标准 DPO 将 WinMass 提高了 38-63%，而 KDE 平滑在重尾污染下实现了 0.68 和 0.32（112% 的相对增益）。在顺序强化学习（CartPole、LunarLander）中，锚定将噪声偏好性能提高了 15-29%，确认了从单步设置到多步设置的转换。使用 10-256 参数模型的实验提供了明确的指导：使用成对锚定的 Soft-DPO 来处理干净或中等的噪声，使用基于 KDE 的列表式 ADPO 来处理极端污染。

Noise-corrected GRPO: From Noisy Rewards to Unbiased Gradients

噪声校正 GRPO：从嘈杂的奖励到无偏的梯度

Authors: Omar El mansouri, Mohamed El Amine Seddik, Salem Lahlou
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.18924
Pdf link: https://arxiv.org/pdf/2510.18924
Abstract Reinforcement learning from human feedback (RLHF) or verifiable rewards (RLVR), the standard paradigm for aligning LLMs or building recent SOTA reasoning models, is highly sensitive to noise from inconsistent or erroneous rewards. Yet, the interaction between such noise and widely used group-based policy optimization methods remains underexplored. We introduce a noise-robust Group Relative Policy Optimization (GRPO) and Done Right GRPO (this http URL) framework that explicitly models reward corruption as Bernoulli noise. Our method applies noise correction after estimating reward flip probabilities to debias the learning signal, yielding provably unbiased gradient estimates. Theoretical analysis shows that group-based methods inherently mitigate individual-level noise, and our correction strategy amplifies this robustness. Empirically, we observe consistent improvements across math and code tasks when applying our noise correction to standard reward model usage, with particular gains of up to 6.7 percentage points in accuracy on math tasks and 1.5 on code tasks under realistic reward model conditions. This work bridges label-noise correction from supervised learning with modern RLHF, offering both theoretical insights and a practical algorithm for noisy real-world deployment.
中文摘要 来自人类反馈（RLHF）或可验证奖励（RLVR）的强化学习是调整 LLM 或构建最新 SOTA 推理模型的标准范式，对来自不一致或错误奖励的噪声高度敏感。然而，这种噪声与广泛使用的基于群体的策略优化方法之间的相互作用仍然没有得到充分探索。我们引入了一个噪声鲁棒的组相对策略优化（GRPO）和 Done Right GRPO（此 http URL）框架，该框架将奖励腐败显式建模为伯努利噪声。我们的方法在估计奖励翻转概率后应用噪声校正来对学习信号进行去偏差，从而产生可证明的无偏梯度估计。理论分析表明，基于组的方法本质上可以减轻个体水平的噪声，而我们的校正策略放大了这种鲁棒性。根据经验，当将我们的噪声校正应用于标准奖励模型使用时，我们观察到数学和代码任务的持续改进，在现实奖励模型条件下，数学任务的准确率特别提高了 6.7 个百分点，代码任务的准确率提高了 1.5 个百分点。这项工作将监督学习的标签噪声校正与现代 RLHF 联系起来，为嘈杂的现实世界部署提供了理论见解和实用算法。

BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping

BAPO：通过自适应裁剪的平衡策略优化稳定法学硕士的策略外强化学习

Authors: Zhiheng Xi, Xin Guo, Yang Nan, Enyu Zhou, Junrui Shen, Wenxiang Chen, Jiaqi Liu, Jixuan Huang, Zhihao Zhang, Honglin Guo, Xun Deng, Zhikai Lei, Miao Zheng, Guoteng Wang, Shuo Zhang, Peng Sun, Rui Zheng, Hang Yan, Tao Gui, Qi Zhang, Xuanjing Huang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.18927
Pdf link: https://arxiv.org/pdf/2510.18927
Abstract Reinforcement learning (RL) has recently become the core paradigm for aligning and strengthening large language models (LLMs). Yet, applying RL in off-policy settings--where stale data from past policies are used for training--improves sample efficiency, but remains challenging: policy entropy declines sharply, optimization often becomes unstable and may even collapse. Through theoretical and empirical analysis, we identify two key insights: (i) an imbalance in optimization, where negative-advantage samples dominate the policy gradient, suppressing useful behaviors and risking gradient explosions; and (ii) the derived Entropy-Clip Rule, which reveals that the fixed clipping mechanism in PPO-like objectives systematically blocks entropy-increasing updates, thereby driving the policy toward over-exploitation at the expense of exploration. Building on these insights, we propose BAlanced Policy Optimization with Adaptive Clipping (BAPO), a simple yet effective method that dynamically adjusts clipping bounds to adaptively re-balance positive and negative contributions, preserve entropy, and stabilize RL optimization. Across diverse off-policy scenarios--including sample replay and partial rollout--BAPO achieves fast, stable, and data-efficient training. On AIME 2024 and AIME 2025 benchmarks, our 7B BAPO model surpasses open-source counterparts such as SkyWork-OR1-7B, while our 32B BAPO model not only achieves state-of-the-art results among models of the same scale but also outperforms leading proprietary systems like o3-mini and Gemini-2.5-Flash-Thinking.
中文摘要 强化学习（RL）最近已成为调整和加强大型语言模型（LLM）的核心范式。然而，在非策略设置中应用 RL（过去策略中的陈旧数据用于训练）可以提高样本效率，但仍然具有挑战性：策略熵急剧下降，优化通常变得不稳定，甚至可能崩溃。通过理论和实证分析，我们确定了两个关键见解：（i）优化中的不平衡，其中负优势样本主导了政策梯度，抑制了有用的行为并冒着梯度爆炸的风险;（ii）推导的熵-裁剪规则，揭示了类PPO目标中的固定裁剪机制系统地阻止了熵增加的更新，从而推动了政策以牺牲勘探为代价的过度开发。基于这些见解，我们提出了具有自适应裁剪的平衡策略优化（BAPO），这是一种简单而有效的方法，可以动态调整裁剪边界，以自适应地重新平衡正负贡献，保持熵并稳定 RL 优化。在各种非策略场景（包括样本重放和部分推出）中，BAPO 可实现快速、稳定和数据高效的训练。在 AIME 2024 和 AIME 2025 基准测试中，我们的 7B BAPO 模型超越了 SkyWork-OR1-7B 等开源模型，而我们的 32B BAPO 模型不仅在同规模模型中取得了最先进的结果，而且优于 o3-mini 和 Gemini-2.5-Flash-Thinking 等领先的专有系统。

REPAIR Approach for Social-based City Reconstruction Planning in case of natural disasters

自然灾害时以社会为基础的城市重建规划的REPAIR方法

Authors: Ghulam Mudassir, Antinisca Di Marco, Giordano d'Aloisio
Subjects: Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.19048
Pdf link: https://arxiv.org/pdf/2510.19048
Abstract Natural disasters always have several effects on human lives. It is challenging for governments to tackle these incidents and to rebuild the economic, social and physical infrastructures and facilities with the available resources (mainly budget and time). Governments always define plans and policies according to the law and political strategies that should maximise social benefits. The severity of damage and the vast resources needed to bring life back to normality make such reconstruction a challenge. This article is the extension of our previously published work by conducting comprehensive comparative analysis by integrating additional deep learning models plus random agent which is used as a baseline. Our prior research introduced a decision support system by using the Deep Reinforcement Learning technique for the planning of post-disaster city reconstruction, maximizing the social benefit of the reconstruction process, considering available resources, meeting the needs of the broad community stakeholders (like citizens' social benefits and politicians' priorities) and keeping in consideration city's structural constraints (like dependencies among roads and buildings). The proposed approach, named post disaster REbuilding plAn ProvIdeR (REPAIR) is generic. It can determine a set of alternative plans for local administrators who select the ideal one to implement, and it can be applied to areas of any extension. We show the application of REPAIR in a real use case, i.e., to the L'Aquila reconstruction process, damaged in 2009 by a major earthquake.
中文摘要 自然灾害总是对人类生活产生多种影响。政府很难处理这些事件，并利用现有资源（主要是预算和时间）重建经济、社会和有形基础设施和设施。政府总是根据法律和政治战略制定计划和政策，以实现社会效益最大化。破坏的严重性和使生活恢复正常所需的大量资源使这种重建成为一项挑战。本文是我们之前发表的工作的延伸，通过集成额外的深度学习模型和用作基线的随机代理进行全面的比较分析。我们之前的研究引入了一种决策支持系统，利用深度强化学习技术来规划灾后城市重建，最大限度地提高重建过程的社会效益，考虑可用资源，满足广泛社区利益相关者的需求（如公民的社会利益和政治家的优先事项），并考虑城市的结构限制（如道路和建筑物之间的依赖性）。所提出的方法被命名为灾后重建方案（REPAIR）是通用的。它可以为选择要实施的理想计划的本地管理员确定一组替代计划，并且可以应用于任何扩展的区域。我们展示了 REPAIR 在真实用例中的应用，即 2009 年因大地震而受损的 L'Aquila 重建过程。

Rectifying Shortcut Behaviors in Preference-based Reward Learning

纠正基于偏好的奖励学习中的捷径行为

Authors: Wenqian Ye, Guangtao Zheng, Aidong Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.19050
Pdf link: https://arxiv.org/pdf/2510.19050
Abstract In reinforcement learning from human feedback, preference-based reward models play a central role in aligning large language models to human-aligned behavior. However, recent studies show that these models are prone to reward hacking and often fail to generalize well due to over-optimization. They achieve high reward scores by exploiting shortcuts, that is, exploiting spurious features (e.g., response verbosity, agreeable tone, or sycophancy) that correlate with human preference labels in the training data rather than genuinely reflecting the intended objectives. In this paper, instead of probing these issues one at a time, we take a broader view of the reward hacking problem as shortcut behaviors and introduce a principled yet flexible approach to mitigate shortcut behaviors in preference-based reward learning. Inspired by the invariant theory in the kernel perspective, we propose Preference-based Reward Invariance for Shortcut Mitigation (PRISM), which learns group-invariant kernels with feature maps in a closed-form learning objective. Experimental results in several benchmarks show that our method consistently improves the accuracy of the reward model on diverse out-of-distribution tasks and reduces the dependency on shortcuts in downstream policy models, establishing a robust framework for preference-based alignment.
中文摘要 在从人类反馈中进行强化学习中，基于偏好的奖励模型在使大型语言模型与人类对齐的行为保持一致方面发挥着核心作用。然而，最近的研究表明，这些模型容易受到奖励黑客攻击，并且由于过度优化，往往无法很好地泛化。他们通过利用捷径来获得高奖励分数，即利用与训练数据中的人类偏好标签相关的虚假特征（例如，响应冗长、令人愉悦的语气或阿谀奉承），而不是真正反映预期目标。在本文中，我们没有一次探究这些问题，而是将奖励黑客问题视为捷径行为，并引入一种有原则但灵活的方法来减轻基于偏好的奖励学习中的捷径行为。受内核视角中不变理论的启发，我们提出了基于偏好的捷径缓解奖励不变性（PRISM），它在封闭形式的学习目标中使用特征图学习群不变核。多个基准测试的实验结果表明，我们的方法持续提高了奖励模型在各种分布外任务上的准确性，并减少了下游策略模型对捷径的依赖，为基于偏好的对齐建立了稳健的框架。

POLAR: Policy-based Layerwise Reinforcement Learning Method for Stealthy Backdoor Attacks in Federated Learning

POLAR：联邦学习中隐身后门攻击的基于策略的逐层强化学习方法

Authors: Kuai Yu, Xiaoyu Wu, Peishen Yan, Qingqian Yang, Linshan Jiang, Hao Wang, Yang Hua, Tao Song, Haibing Guan
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.19056
Pdf link: https://arxiv.org/pdf/2510.19056
Abstract Federated Learning (FL) enables decentralized model training across multiple clients without exposing local data, but its distributed feature makes it vulnerable to backdoor attacks. Despite early FL backdoor attacks modifying entire models, recent studies have explored the concept of backdoor-critical (BC) layers, which poison the chosen influential layers to maintain stealthiness while achieving high effectiveness. However, existing BC layers approaches rely on rule-based selection without consideration of the interrelations between layers, making them ineffective and prone to detection by advanced defenses. In this paper, we propose POLAR (POlicy-based LAyerwise Reinforcement learning), the first pipeline to creatively adopt RL to solve the BC layer selection problem in layer-wise backdoor attack. Different from other commonly used RL paradigm, POLAR is lightweight with Bernoulli sampling. POLAR dynamically learns an attack strategy, optimizing layer selection using policy gradient updates based on backdoor success rate (BSR) improvements. To ensure stealthiness, we introduce a regularization constraint that limits the number of modified layers by penalizing large attack footprints. Extensive experiments demonstrate that POLAR outperforms the latest attack methods by up to 40% against six state-of-the-art (SOTA) defenses.
中文摘要 联邦学习（FL）支持跨多个客户端进行去中心化模型训练，而无需暴露本地数据，但其分布式特性使其容易受到后门攻击。尽管早期的 FL 后门攻击修改了整个模型，但最近的研究探索了后门关键层（BC）层的概念，该层毒害所选的影响层以保持隐蔽性，同时实现高效性。然而，现有的BC层方法依赖于基于规则的选择，而不考虑层之间的相互关系，这使得它们无效并且容易被高级防御检测到。在本文中，我们提出了POLAR（POlicy-based LAyerwise Reinforcement learning），这是第一个创造性地采用RL解决层后门攻击中BC层选择问题的管道。与其他常用的 RL 范式不同，POLAR 具有伯努利采样的轻量级。POLAR 动态学习攻击策略，使用基于后门成功率（BSR）改进的策略梯度更新来优化层选择。为了确保隐蔽性，我们引入了一个正则化约束，通过惩罚较大的攻击足迹来限制修改层的数量。广泛的实验表明，POLAR 在六种最先进的（SOTA）防御措施方面比最新的攻击方法高出 40%。

A Communication-Efficient Decentralized Actor-Critic Algorithm

一种通信高效的去中心化行为者-批评者算法

Authors: Xiaoxing Ren, Nicola Bastianello, Thomas Parisini, Andreas A. Malikopoulos
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2510.19199
Pdf link: https://arxiv.org/pdf/2510.19199
Abstract In this paper, we study the problem of reinforcement learning in multi-agent systems where communication among agents is limited. We develop a decentralized actor-critic learning framework in which each agent performs several local updates of its policy and value function, where the latter is approximated by a multi-layer neural network, before exchanging information with its neighbors. This local training strategy substantially reduces the communication burden while maintaining coordination across the network. We establish finite-time convergence analysis for the algorithm under Markov-sampling. Specifically, to attain the $\varepsilon$-accurate stationary point, the sample complexity is of order $\mathcal{O}(\varepsilon^{-3})$ and the communication complexity is of order $\mathcal{O}(\varepsilon^{-1}\tau^{-1})$, where tau denotes the number of local training steps. We also show how the final error bound depends on the neural network's approximation quality. Numerical experiments in a cooperative control setting illustrate and validate the theoretical findings.
中文摘要 在本文中，我们研究了智能体之间通信有限的多智能体系统中的强化学习问题。我们开发了一个去中心化的行为者-批评者学习框架，其中每个代理在与邻居交换信息之前，对其策略和价值函数执行多次局部更新，其中后者由多层神经网络近似。这种本地培训策略大大减轻了沟通负担，同时保持了整个网络的协调。在马尔可夫采样下建立了算法的有限时间收敛分析。具体来说，为了获得 $\varepsilon$ 精确的静止点，样本复杂度为 $\mathcal{O}（\varepsilon^{-3}）$ 阶，通信复杂度为 $\mathcal{O}（\varepsilon^{-1}\tau^{-1}）$ 阶，其中 tau 表示局部训练步骤的数量。我们还展示了最终误差边界如何取决于神经网络的近似质量。合作控制环境中的数值实验说明并验证了理论发现。

RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs

RLBoost：在法学硕士上收集具有成本效益的强化学习的抢占式资源

Authors: Yongji Wu, Xueshen Liu, Haizhong Zheng, Juncheng Gu, Beidi Chen, Z. Morley Mao, Arvind Krishnamurthy, Ion Stoica
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.19225
Pdf link: https://arxiv.org/pdf/2510.19225
Abstract Reinforcement learning (RL) has become essential for unlocking advanced reasoning capabilities in large language models (LLMs). RL workflows involve interleaving rollout and training stages with fundamentally different resource requirements. Rollout typically dominates overall execution time, yet scales efficiently through multiple independent instances. In contrast, training requires tightly-coupled GPUs with full-mesh communication. Existing RL frameworks fall into two categories: co-located and disaggregated architectures. Co-located ones fail to address this resource tension by forcing both stages to share the same GPUs. Disaggregated architectures, without modifications of well-established RL algorithms, suffer from resource under-utilization. Meanwhile, preemptible GPU resources, i.e., spot instances on public clouds and spare capacity in production clusters, present significant cost-saving opportunities for accelerating RL workflows, if efficiently harvested for rollout. In this paper, we present RLBoost, a systematic solution for cost-efficient RL training that harvests preemptible GPU resources. Our key insight is that rollout's stateless and embarrassingly parallel nature aligns perfectly with preemptible and often fragmented resources. To efficiently utilize these resources despite frequent and unpredictable availability changes, RLBoost adopts a hybrid architecture with three key techniques: (1) adaptive rollout offload to dynamically adjust workloads on the reserved (on-demand) cluster, (2) pull-based weight transfer that quickly provisions newly available instances, and (3) token-level response collection and migration for efficient preemption handling and continuous load balancing. Extensive experiments show RLBoost increases training throughput by 1.51x-1.97x while improving cost efficiency by 28%-49% compared to using only on-demand GPU resources.
中文摘要 强化学习（RL）对于解锁大型语言模型（LLM）中的高级推理能力至关重要。RL 工作流涉及交错推出和训练阶段，其资源需求根本不同。推出通常主导总体执行时间，但可以通过多个独立实例有效地扩展。相比之下，训练需要具有全网状通信的紧密耦合 GPU。现有的 RL 框架分为两类：共置架构和分解架构。同置的阶段无法通过强制两个阶段共享相同的 GPU 来解决这种资源紧张问题。如果不修改成熟的 RL 算法，分解架构就会遭受资源利用不足的问题。同时，抢占式 GPU 资源，即公有云上的 Spot 实例和生产集群中的备用容量，如果有效地收集以进行部署，则为加速 RL 工作流程提供了巨大的成本节约机会。在本文中，我们提出了 RLBoost，这是一种用于经济高效的 RL 训练的系统解决方案，可收集抢占式 GPU 资源。我们的主要见解是，推出的无状态和令人尴尬的并行性质与可抢占且通常分散的资源完美契合。为了在频繁且不可预测的可用性变化的情况下有效利用这些资源，利比博斯特采用了具有三种关键技术的混合架构：（1）自适应推出卸载以动态调整预留（按需）集群上的工作负载，（2）基于拉取的权重转移，可快速配置新可用的实例，以及（3）令牌级响应收集和迁移，以实现高效的抢占处理和持续负载平衡。大量实验表明，与仅使用按需 GPU 资源相比，RLBoost 将训练吞吐量提高了 1.51 倍-1.97 倍，同时将成本效率提高了 28%-49%。

SPOT: Scalable Policy Optimization with Trees for Markov Decision Processes

SPOT：马尔可夫决策过程的树的可扩展策略优化

Authors: Xuyuan Xiong, Pedro Chumpitaz-Flores, Kaixun Hua, Cheng Hua
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.19241
Pdf link: https://arxiv.org/pdf/2510.19241
Abstract Interpretable reinforcement learning policies are essential for high-stakes decision-making, yet optimizing decision tree policies in Markov Decision Processes (MDPs) remains challenging. We propose SPOT, a novel method for computing decision tree policies, which formulates the optimization problem as a mixed-integer linear program (MILP). To enhance efficiency, we employ a reduced-space branch-and-bound approach that decouples the MDP dynamics from tree-structure constraints, enabling efficient parallel search. This significantly improves runtime and scalability compared to previous methods. Our approach ensures that each iteration yields the optimal decision tree. Experimental results on standard benchmarks demonstrate that SPOT achieves substantial speedup and scales to larger MDPs with a significantly higher number of states. The resulting decision tree policies are interpretable and compact, maintaining transparency without compromising performance. These results demonstrate that our approach simultaneously achieves interpretability and scalability, delivering high-quality policies an order of magnitude faster than existing approaches.
中文摘要 可解释的强化学习策略对于高风险决策至关重要，但优化马尔可夫决策过程（MDP）中的决策树策略仍然具有挑战性。我们提出了SPOT，一种计算决策树策略的新方法，它将优化问题表述为混合整数线性规划（MILP）。为了提高效率，我们采用了一种缩减空间的分支和绑定方法，将 MDP 动力学与树结构约束解耦，从而实现高效的并行搜索。与以前的方法相比，这显着提高了运行时和可扩展性。我们的方法确保每次迭代都会产生最佳决策树。标准基准测试的实验结果表明，SPOT实现了显著的加速，并扩展到具有明显更多状态的更大MDP。由此产生的决策树策略是可解释且紧凑的，在不影响性能的情况下保持透明度。这些结果表明，我们的方法同时实现了可解释性和可扩展性，提供高质量的策略比现有方法快一个数量级。

Interpret Policies in Deep Reinforcement Learning using SILVER with RL-Guided Labeling: A Model-level Approach to High-dimensional and Multi-action Environments

使用 SILVER 和 RL 引导标记解释深度强化学习中的策略：高维和多动作环境的模型级方法

Authors: Yiyu Qian, Su Nguyen, Chao Chen, Qinyue Zhou, Liyuan Zhao
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.19244
Pdf link: https://arxiv.org/pdf/2510.19244
Abstract Deep reinforcement learning (RL) achieves remarkable performance but lacks interpretability, limiting trust in policy behavior. The existing SILVER framework (Li, Siddique, and Cao 2025) explains RL policy via Shapley-based regression but remains restricted to low-dimensional, binary-action domains. We propose SILVER with RL-guided labeling, an enhanced variant that extends SILVER to multi-action and high-dimensional environments by incorporating the RL policy's own action outputs into the boundary points identification. Our method first extracts compact feature representations from image observations, performs SHAP-based feature attribution, and then employs RL-guided labeling to generate behaviorally consistent boundary datasets. Surrogate models, such as decision trees and regression-based functions, are subsequently trained to interpret RL policy's decision structure. We evaluate the proposed framework on two Atari environments using three deep RL algorithms and conduct human-subject study to assess the clarity and trustworthiness of the derived interpretable policy. Results show that our approach maintains competitive task performance while substantially improving transparency and human understanding of agent behavior. This work advances explainable RL by transforming SILVER into a scalable and behavior-aware framework for interpreting deep RL agents in high-dimensional, multi-action settings.
中文摘要 深度强化学习（RL）取得了卓越的性能，但缺乏可解释性，限制了对策略行为的信任。现有的 SILVER 框架（Li、Siddique 和 Cao 2025）通过基于 Shapley 的回归解释 RL 策略，但仍仅限于低维二元作用域。我们提出了带有 RL 引导标记的 SILVER，这是一种增强型变体，通过将 RL 策略自己的动作输出合并到边界点识别中，将 SILVER 扩展到多动作和高维环境。我们的方法首先从图像观测中提取紧凑的特征表示，执行基于SHAP的特征归因，然后采用RL引导的标注来生成行为一致的边界数据集。随后训练代理模型，例如决策树和基于回归的函数，以解释 RL 策略的决策结构。我们使用三种深度 RL 算法在两个 Atari 环境中评估所提出的框架，并进行人类受试者研究以评估派生的可解释策略的清晰度和可信度。结果表明，我们的方法保持了竞争性任务绩效，同时大大提高了透明度和人类对智能体行为的理解。这项工作通过将 SILVER 转换为可扩展且行为感知的框架来推进可解释的 RL，用于在高维、多作设置中解释深度 RL 代理。

Hierarchical DLO Routing with Reinforcement Learning and In-Context Vision-language Models

使用强化学习和上下文视觉语言模型的分层 DLO 路由

Authors: Mingen Li, Houjian Yu, Yixuan Huang, Youngjin Hong, Changhyun Choi
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.19268
Pdf link: https://arxiv.org/pdf/2510.19268
Abstract Long-horizon routing tasks of deformable linear objects (DLOs), such as cables and ropes, are common in industrial assembly lines and everyday life. These tasks are particularly challenging because they require robots to manipulate DLO with long-horizon planning and reliable skill execution. Successfully completing such tasks demands adapting to their nonlinear dynamics, decomposing abstract routing goals, and generating multi-step plans composed of multiple skills, all of which require accurate high-level reasoning during execution. In this paper, we propose a fully autonomous hierarchical framework for solving challenging DLO routing tasks. Given an implicit or explicit routing goal expressed in language, our framework leverages vision-language models~(VLMs) for in-context high-level reasoning to synthesize feasible plans, which are then executed by low-level skills trained via reinforcement learning. To improve robustness in long horizons, we further introduce a failure recovery mechanism that reorients the DLO into insertion-feasible states. Our approach generalizes to diverse scenes involving object attributes, spatial descriptions, as well as implicit language commands. It outperforms the next best baseline method by nearly 50% and achieves an overall success rate of 92.5% across long-horizon routing scenarios.
中文摘要 可变形线性物体（DLO）（例如电缆和绳索）的长视距布线任务在工业装配线和日常生活中很常见。这些任务特别具有挑战性，因为它们需要机器人通过长期规划和可靠的技能执行来纵 DLO。要成功完成此类任务，需要适应其非线性动态，分解抽象的路由目标，生成由多种技能组成的多步骤计划，所有这些都需要在执行过程中进行准确的高级推理。在本文中，我们提出了一个完全自主的分层框架来解决具有挑战性的 DLO 路由任务。给定一个用语言表达的隐式或显式路由目标，我们的框架利用视觉语言模型~（VLM）进行上下文中的高级推理，以综合可行的计划，然后由通过强化学习训练的低级技能执行。为了提高长期鲁棒性，我们进一步引入了一种故障恢复机制，将DLO重新定向到插入可行状态。我们的方法推广到涉及对象属性、空间描述以及隐式语言命令的各种场景。它的性能比次优基线方法高出近 50%，并且在长范围布线场景中实现了 92.5% 的总体成功率。

Managing Charging Induced Grid Stress and Battery Degradation in Electric Taxi Fleets

管理电动出租车车队中充电引起的电网压力和电池退化

Authors: Michael Yuhas, Rajesh K. Ahir, Laksamana Vixell Tanjaya Hartono, Muhammad Dzaki Dwi Putranto, Arvind Easwaran, Suhono Harso Supangkat
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2510.19293
Pdf link: https://arxiv.org/pdf/2510.19293
Abstract Operating fleets of electric vehicles (EVs) introduces several challenges, some of which are borne by the fleet operator, and some of which are borne by the power grid. To maximize short-term profit a fleet operator could always charge EVs at the maximum rate to ensure vehicles are ready to service ride demand. However, due to the stochastic nature of electricity demand, charging EVs at their maximum rate may potentially increase the grid stress and lead to overall instability. Furthermore, high-rate charging of EVs can accelerate battery degradation, thereby reducing the service lifespan of the fleet. This study aims to reconcile the conflicting incentives of fleet longevity, short-term profitability, and grid stability by simulating a taxi fleet throughout its lifespan in relation to its charging policies and service conditions. We develop an EV fleet simulator to evaluate the battery degradation due to unpredictable charging and ride demand. Consequently, the impact on the power grid through the charging infrastructure is assessed due to these activities. This simulation utilizes publicly accessible real-world travel data from the NYC taxi dataset. We compare a baseline 80-20 fleet charging policy with a reinforcement learning-based policy designed to prolong the fleet's service life and alleviate grid stress. We monitor grid stress, battery degradation, and profitability over five years and find that our learned policy outperforms the baseline. This simulator enables fleet operators to assess the impact of different charging policies on these indicators to make informed decisions in the future.
中文摘要 运营电动汽车（EV）车队带来了一些挑战，其中一些挑战由车队运营商承担，另一些则由电网承担。为了实现短期利润最大化，车队运营商可以始终以最高费率为电动汽车充电，以确保车辆准备好满足乘车需求。然而，由于电力需求的随机性，以最大速率为电动汽车充电可能会增加电网压力并导致整体不稳定。此外，电动汽车的高倍率充电会加速电池退化，从而缩短车队的使用寿命。本研究旨在通过模拟出租车车队在其整个生命周期内与充电政策和服务条件相关的相互作用，调和车队寿命、短期盈利能力和电网稳定性的相互冲突的激励因素。我们开发了一种电动汽车车队模拟器，以评估由于不可预测的充电和乘坐需求而导致的电池退化。因此，评估了这些活动通过充电基础设施对电网的影响。该模拟利用了来自纽约市出租车数据集的可公开访问的真实世界旅行数据。我们将基线 80-20 车队充电政策与基于强化学习的政策进行了比较，该政策旨在延长车队的使用寿命并减轻电网压力。我们在五年内监测电网压力、电池退化和盈利能力，发现我们学到的政策优于基线。该模拟器使车队运营商能够评估不同充电政策对这些指标的影响，以便在未来做出明智的决策。

QiMeng-SALV: Signal-Aware Learning for Verilog Code Generation

QiMeng-SALV：用于Verilog代码生成的信号感知学习

Authors: Yang Zhang, Rui Zhang, Jiaming Guo, Lei Huang, Di Huang, Yunpu Zhao, Shuyao Cheng, Pengwei Jin, Chongxiao Li, Zidong Du, Xing Hu, Qi Guo, Yunji Chen
Subjects: Subjects: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Programming Languages (cs.PL)
Arxiv link: https://arxiv.org/abs/2510.19296
Pdf link: https://arxiv.org/pdf/2510.19296
Abstract The remarkable progress of Large Language Models (LLMs) presents promising opportunities for Verilog code generation which is significantly important for automated circuit design. The lacking of meaningful functional rewards hinders the preference optimization based on Reinforcement Learning (RL) for producing functionally correct Verilog code. In this paper, we propose Signal-Aware Learning for Verilog code generation (QiMeng-SALV) by leveraging code segments of functionally correct output signal to optimize RL training. Considering Verilog code specifies the structural interconnection of hardware gates and wires so that different output signals are independent, the key insight of QiMeng-SALV is to extract verified signal-aware implementations in partially incorrect modules, so as to enhance the extraction of meaningful functional rewards. Roughly, we verify the functional correctness of signals in generated module by comparing with that of reference module in the training data. Then abstract syntax tree (AST) is employed to identify signal-aware code segments which can provide meaningful functional rewards from erroneous modules. Finally, we introduce signal-aware DPO which is optimized on the correct signal-level code segments, thereby preventing noise and interference from incorrect signals. The proposed QiMeng-SALV underscores the paradigm shift from conventional module-level to fine-grained signal-level optimization in Verilog code generation, addressing the issue of insufficient functional rewards. Experiments demonstrate that our method achieves state-of-the-art performance on VerilogEval and RTLLM, with a 7B parameter model matching the performance of the DeepSeek v3 671B model and significantly outperforming the leading open-source model CodeV trained on the same dataset. Our code is available at this https URL.
中文摘要 大型语言模型（LLM）的显着进步为 Verilog 代码生成提供了充满希望的机会，这对于自动化电路设计非常重要。缺乏有意义的功能奖励阻碍了基于强化学习（RL）的偏好优化，以生成功能正确的Verilog代码。在本文中，我们提出了用于Verilog代码生成的信号感知学习（QiMeng-SALV），通过利用功能正确输出信号的代码段来优化RL训练。考虑到Verilog代码规定了硬件门和线的结构互连，使不同的输出信号是独立的，启蒙-SALV的关键洞察是在部分错误的模块中提取经过验证的信号感知实现，从而增强对有意义的功能奖励的提取。粗略地，通过与训练数据中参考模块的信号进行比较，验证了生成模块中信号的功能正确性。然后，采用抽象语法树（AST）识别信号感知代码段，这些代码段可以从错误模块中提供有意义的功能奖励。最后，我们引入了信号感知DPO，该DPO在正确的信号级码段上进行了优化，从而防止了来自错误信号的噪声和干扰。所提出的QiMeng-SALV强调了Verilog代码生成从常规模块级到细粒度信号级优化的范式转变，解决了功能奖励不足的问题。实验表明，我们的方法在 VerilogEval 和 RTLLM 上实现了最先进的性能，其 7B 参数模型与 DeepSeek v3 671B 模型的性能相匹配，并且明显优于在同一数据集上训练的领先开源模型 CodeV。我们的代码可在此 https URL 中找到。

Unified Reinforcement and Imitation Learning for Vision-Language Models

视觉语言模型的统一强化和模仿学习

Authors: Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Yu-Chiang Frank Wang, Yueh-Hua Wu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2510.19307
Pdf link: https://arxiv.org/pdf/2510.19307
Abstract Vision-Language Models (VLMs) have achieved remarkable progress, yet their large scale often renders them impractical for resource-constrained environments. This paper introduces Unified Reinforcement and Imitation Learning (RIL), a novel and efficient training algorithm designed to create powerful, lightweight VLMs. RIL distinctively combines the strengths of reinforcement learning with adversarial imitation learning. This enables smaller student VLMs not only to mimic the sophisticated text generation of large teacher models but also to systematically improve their generative capabilities through reinforcement signals. Key to our imitation framework is an LLM-based discriminator that adeptly distinguishes between student and teacher outputs, complemented by guidance from multiple large teacher VLMs to ensure diverse learning. This unified learning strategy, leveraging both reinforcement and imitation, empowers student models to achieve significant performance gains, making them competitive with leading closed-source VLMs. Extensive experiments on diverse vision-language benchmarks demonstrate that RIL significantly narrows the performance gap with state-of-the-art open- and closed-source VLMs and, in several instances, surpasses them.
中文摘要 视觉语言模型（VLM）已经取得了显着的进步，但其规模之大往往使其在资源受限的环境中变得不切实际。本文介绍了统一强化和模仿学习（RIL），这是一种新颖高效的训练算法，旨在创建强大、轻量级的VLM。RIL独特地结合了强化学习和对抗性模仿学习的优势。这使得较小的学生VLM不仅能够模仿大型教师模型的复杂文本生成，还可以通过强化信号系统地提高其生成能力。我们模仿框架的关键是一个基于 LLM 的鉴别器，它熟练地区分学生和教师的输出，并辅以多个大型教师 VLM 的指导，以确保多样化的学习。这种统一的学习策略利用强化和模仿，使学生模型能够实现显着的性能提升，使其与领先的闭源 VLM 竞争。对各种视觉语言基准测试的广泛实验表明，RIL 显着缩小了与最先进的开源和闭源 VLM 的性能差距，并且在一些情况下超过了它们。

Continual Knowledge Adaptation for Reinforcement Learning

强化学习的持续知识适应

Authors: Jinwu Hu, Zihao Lian, Zhiquan Wen, Chenghao Li, Guohao Chen, Xutao Wen, Bin Xiao, Mingkui Tan
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.19314
Pdf link: https://arxiv.org/pdf/2510.19314
Abstract Reinforcement Learning enables agents to learn optimal behaviors through interactions with environments. However, real-world environments are typically non-stationary, requiring agents to continuously adapt to new tasks and changing conditions. Although Continual Reinforcement Learning facilitates learning across multiple tasks, existing methods often suffer from catastrophic forgetting and inefficient knowledge utilization. To address these challenges, we propose Continual Knowledge Adaptation for Reinforcement Learning (CKA-RL), which enables the accumulation and effective utilization of historical knowledge. Specifically, we introduce a Continual Knowledge Adaptation strategy, which involves maintaining a task-specific knowledge vector pool and dynamically using historical knowledge to adapt the agent to new tasks. This process mitigates catastrophic forgetting and enables efficient knowledge transfer across tasks by preserving and adapting critical model parameters. Additionally, we propose an Adaptive Knowledge Merging mechanism that combines similar knowledge vectors to address scalability challenges, reducing memory requirements while ensuring the retention of essential knowledge. Experiments on three benchmarks demonstrate that the proposed CKA-RL outperforms state-of-the-art methods, achieving an improvement of 4.20% in overall performance and 8.02% in forward transfer. The source code is available at this https URL.
中文摘要 强化学习使智能体能够通过与环境的交互来学习最佳行为。然而，现实世界的环境通常是非静止的，需要代理不断适应新任务和不断变化的条件。尽管持续强化学习促进了跨多个任务的学习，但现有方法经常遭受灾难性的遗忘和知识利用效率低下的问题。为了应对这些挑战，我们提出了强化学习的持续知识适应（CKA-RL），它能够积累和有效利用历史知识。具体来说，我们引入了持续知识适应策略，该策略涉及维护特定于任务的知识向量池，并动态使用历史知识使智能体适应新任务。这个过程减少了灾难性的遗忘，并通过保留和调整关键模型参数来实现跨任务的高效知识转移。此外，我们提出了一种自适应知识合并机制，该机制结合了相似的知识向量来应对可扩展性挑战，减少内存需求，同时确保保留基本知识。在三个基准测试上的实验表明，所提出的 CKA-RL 优于最先进的方法，整体性能提高了 4.20%，前向传输提高了 8.02%。源代码可在此 https URL 中找到。

Balancing Rewards in Text Summarization: Multi-Objective Reinforcement Learning via HyperVolume Optimization

平衡文本摘要中的奖励：通过超容量优化进行多目标强化学习

Authors: Junjie Song, Yiwen Liu, Dapeng Li, Yin Sun, Shukun Fu, Siqi Chen, Yuji Cao
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.19325
Pdf link: https://arxiv.org/pdf/2510.19325
Abstract Text summarization is a crucial task that requires the simultaneous optimization of multiple objectives, including consistency, coherence, relevance, and fluency, which presents considerable challenges. Although large language models (LLMs) have demonstrated remarkable performance, enhanced by reinforcement learning (RL), few studies have focused on optimizing the multi-objective problem of summarization through RL based on LLMs. In this paper, we introduce hypervolume optimization (HVO), a novel optimization strategy that dynamically adjusts the scores between groups during the reward process in RL by using the hypervolume method. This method guides the model's optimization to progressively approximate the pareto front, thereby generating balanced summaries across multiple objectives. Experimental results on several representative summarization datasets demonstrate that our method outperforms group relative policy optimization (GRPO) in overall scores and shows more balanced performance across different dimensions. Moreover, a 7B foundation model enhanced by HVO performs comparably to GPT-4 in the summarization task, while maintaining a shorter generation length. Our code is publicly available at this https URL
中文摘要 文本摘要是一项至关重要的任务，需要同时优化多个目标，包括一致性、连贯性、相关性和流畅性，这带来了相当大的挑战。尽管大型语言模型（LLM）在强化学习（RL）的增强下表现出了卓越的性能，但很少有研究专注于通过基于LLM的强化学习优化摘要的多目标问题。在本文中，我们引入了超容量优化（HVO），这是一种新型的优化策略，它利用超容量方法在RL的奖励过程中动态调整组间的分数。该方法指导模型的优化以逐步近似帕累托前沿，从而生成跨多个目标的平衡摘要。在几个具有代表性的汇总数据集上的实验结果表明，我们的方法在总分上优于群体相对策略优化（GRPO），并且在不同维度上表现出更平衡的性能。此外，由 HVO 增强的 7B 基础模型在摘要任务中的性能与 GPT-4 相当，同时保持较短的生成长度。我们的代码在此 https URL 上公开可用

Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning

每一个注意力都很重要：用于长上下文推理的高效混合架构

Authors: Ling Team, Bin Han, Caizhi Tang, Chen Liang, Donghao Zhang, Fan Yuan, Feng Zhu, Jie Gao, Jingyu Hu, Longfei Li, Meng Li, Mingyang Zhang, Peijie Jiang, Peng Jiao, Qian Zhao, Qingyuan Yang, Wenbo Shen, Xinxing Yang, Yalin Zhang, Yankun Ren, Yao Zhao, Yibo Cao, Yixuan Sun, Yue Zhang, Yuchen Fang, Zibin Lin, Zixuan Cheng, Jun Zhou
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.19338
Pdf link: https://arxiv.org/pdf/2510.19338
Abstract In this technical report, we present the Ring-linear model series, specifically including Ring-mini-linear-2.0 and Ring-flash-linear-2.0. Ring-mini-linear-2.0 comprises 16B parameters and 957M activations, while Ring-flash-linear-2.0 contains 104B parameters and 6.1B activations. Both models adopt a hybrid architecture that effectively integrates linear attention and softmax attention, significantly reducing I/O and computational overhead in long-context inference scenarios. Compared to a 32 billion parameter dense model, this series reduces inference cost to 1/10, and compared to the original Ring series, the cost is also reduced by over 50%. Furthermore, through systematic exploration of the ratio between different attention mechanisms in the hybrid architecture, we have identified the currently optimal model structure. Additionally, by leveraging our self-developed high-performance FP8 operator library-linghe, overall training efficiency has been improved by 50%. Benefiting from the high alignment between the training and inference engine operators, the models can undergo long-term, stable, and highly efficient optimization during the reinforcement learning phase, consistently maintaining SOTA performance across multiple challenging complex reasoning benchmarks.
中文摘要 在本技术报告中，我们介绍了环形线性模型系列，具体包括环形-迷你线性-2.0 和环形闪存线性-2.0。Ring-mini-linear-2.0 包含 16B 参数和 957M 激活，而 Ring-flash-linear-2.0 包含 104B 参数和 6.1B 激活。两种模型都采用混合架构，有效集成了线性注意力和softmax注意力，显著降低了长上下文推理场景下的I/O和计算开销。与 320 亿参数密集模型相比，该系列将推理成本降低到 1/10，与原来的 Ring 系列相比，成本也降低了 50% 以上。此外，通过对混合架构中不同注意力机制之间的比例进行系统探索，我们确定了当前最优的模型结构。此外，通过利用我们自主研发的高性能FP8算子库——linghe，整体训练效率提高了50%。得益于训练引擎算子和推理引擎算子之间的高度一致性，模型可以在强化学习阶段进行长期、稳定和高效的优化，在多个具有挑战性的复杂推理基准中始终保持SOTA性能。

A Markov Decision Process for Variable Selection in Branch & Bound

分支和边界中变量选择的马尔可夫决策过程

Authors: Paul Strang, Zacharie Alès, Côme Bissuel, Olivier Juan, Safia Kedad-Sidhoum, Emmanuel Rachelson
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.19348
Pdf link: https://arxiv.org/pdf/2510.19348
Abstract Mixed-Integer Linear Programming (MILP) is a powerful framework used to address a wide range of NP-hard combinatorial optimization problems, often solved by Branch and Bound (B&B). A key factor influencing the performance of B&B solvers is the variable selection heuristic governing branching decisions. Recent contributions have sought to adapt reinforcement learning (RL) algorithms to the B&B setting to learn optimal branching policies, through Markov Decision Processes (MDP) inspired formulations, and ad hoc convergence theorems and algorithms. In this work, we introduce BBMDP, a principled vanilla MDP formulation for variable selection in B&B, allowing to leverage a broad range of RL algorithms for the purpose of learning optimal B\&B heuristics. Computational experiments validate our model empirically, as our branching agent outperforms prior state-of-the-art RL agents on four standard MILP benchmarks.
中文摘要 混合整数线性规划（MILP）是一个强大的框架，用于解决各种 NP 硬组合优化问题，通常通过分支和边界（B&B）解决。影响 B&B 求解器性能的一个关键因素是控制分支决策的变量选择启发式方法。最近的贡献试图通过马尔可夫决策过程（MDP）启发的公式以及临时收敛定理和算法，使强化学习（RL）算法适应B&B环境，以学习最佳分支策略。在这项工作中，我们引入了 BBMDP，这是一种用于 B&B 变量选择的原则性普通 MDP 公式，允许利用广泛的 RL 算法来学习最佳 B\&B 启发式方法。计算实验在经验上验证了我们的模型，因为我们的分支代理在四个标准 MILP 基准测试中优于之前最先进的 RL 代理。

Autobidding Arena: unified evaluation of the classical and RL-based autobidding algorithms

Autobidding Arena：对经典和基于RL的自动竞价算法进行统一评估

Authors: Andrey Pudovikov, Alexandra Khirianova, Ekaterina Solodneva, Aleksandr Katrutsa, Egor Samosvat, Yuriy Dorn
Subjects: Subjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.19357
Pdf link: https://arxiv.org/pdf/2510.19357
Abstract Advertisement auctions play a crucial role in revenue generation for e-commerce companies. To make the bidding procedure scalable to thousands of auctions, the automatic bidding (autobidding) algorithms are actively developed in the industry. Therefore, the fair and reproducible evaluation of autobidding algorithms is an important problem. We present a standardized and transparent evaluation protocol for comparing classical and reinforcement learning (RL) autobidding algorithms. We consider the most efficient autobidding algorithms from different classes, e.g., ones based on the controllers, RL, optimal formulas, etc., and benchmark them in the bidding environment. We utilize the most recent open-source environment developed in the industry, which accurately emulates the bidding process. Our work demonstrates the most promising use cases for the considered autobidding algorithms, highlights their surprising drawbacks, and evaluates them according to multiple metrics. We select the evaluation metrics that illustrate the performance of the autobidding algorithms, the corresponding costs, and track the budget pacing. Such a choice of metrics makes our results applicable to the broad range of platforms where autobidding is effective. The presented comparison results help practitioners to evaluate the candidate autobidding algorithms from different perspectives and select ones that are efficient according to their companies' targets.
中文摘要 广告拍卖在电子商务公司的创收方面发挥着至关重要的作用。为了使竞价程序可扩展到数千次拍卖，业界积极开发自动竞价（autobidding）算法。因此，自动竞价算法的公平性和可重复性评估是一个重要问题。我们提出了一种标准化且透明的评估协议，用于比较经典和强化学习（RL）自动竞价算法。我们考虑了来自不同类别的最有效的自动竞价算法，例如基于控制器、RL、最优公式等的算法，并在竞价环境中对它们进行基准测试。我们利用业内开发的最新开源环境，准确模拟投标过程。我们的工作展示了所考虑的自动竞价算法最有前途的用例，强调了它们令人惊讶的缺点，并根据多个指标对其进行了评估。我们选择说明自动竞价算法性能、相应成本的评估指标，并跟踪预算节奏。这种指标的选择使我们的结果适用于自动竞价有效的各种平台。所提出的比较结果有助于从业者从不同角度评估候选自动竞价算法，并根据其公司的目标选择有效的算法。

LoongRL:Reinforcement Learning for Advanced Reasoning over Long Contexts

LoongRL：长上下文高级推理强化学习

Authors: Siyuan Wang, Gaokai Zhang, Li Lyna Zhang, Ning Shang, Fan Yang, Dongyao Chen, Mao Yang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.19363
Pdf link: https://arxiv.org/pdf/2510.19363
Abstract Reasoning over long contexts is essential for large language models. While reinforcement learning (RL) enhances short-context reasoning by inducing "Aha" moments in chain-of-thought, the advanced thinking patterns required for long-context reasoning remain largely unexplored, and high-difficulty RL data are scarce. In this paper, we introduce LoongRL, a data-driven RL method for advanced long-context reasoning. Central to LoongRL is KeyChain, a synthesis approach that transforms short multi-hop QA into high-difficulty long-context tasks by inserting UUID chains that hide the true question among large collections of distracting documents. Solving these tasks requires the model to trace the correct chain step-by-step, identify the true question, retrieve relevant facts and reason over them to answer correctly. RL training on KeyChain data induces an emergent plan-retrieve-reason-recheck reasoning pattern that generalizes far beyond training length. Models trained at 16K effectively solve 128K tasks without prohibitive full-length RL rollout costs. On Qwen2.5-7B and 14B, LoongRL substantially improves long-context multi-hop QA accuracy by +23.5% and +21.1% absolute gains. The resulting LoongRL-14B reaches a score of 74.2, rivaling much larger frontier models such as o3-mini (74.5) and DeepSeek-R1 (74.9). It also improves long-context retrieval, passes all 128K needle-in-a-haystack stress tests, and preserves short-context reasoning capabilities.
中文摘要 对长上下文进行推理对于大型语言模型至关重要。虽然强化学习（RL）通过在思维链中诱导“顿悟”时刻来增强短上下文推理，但长上下文推理所需的高级思维模式在很大程度上仍未被探索，高难度的强化学习数据也很少。在本文中，我们介绍了LoongRL，这是一种用于高级长上下文推理的数据驱动RL方法。LoongRL 的核心是 KeyChain，这是一种综合方法，通过插入 UUID 链将真正的问题隐藏在大量分散注意力的文档中，将短多跳 QA 转换为高难度的长上下文任务。解决这些任务需要模型逐步追踪正确的链条，识别真正的问题，检索相关事实并推理以正确回答。对 KeyChain 数据进行 RL 训练会诱导一种紧急的计划-检索-原因-重新检查推理模式，这种模式的泛化远远超出了训练长度。在 16K 下训练的模型可以有效地解决 128K 任务，而无需高昂的全长 RL 推出成本。在Qwen2.5-7B和14B上，LoongRL大幅提高了+23.5%和+21.1%的长上下文多跳QA精度。最终的 LoongRL-14B 得分达到 74.2，可与 o3-mini （74.5）和 DeepSeek-R1 （74.9）等更大的前沿型号相媲美。它还改进了长上下文检索，通过了所有 128K 大海捞针压力测试，并保留了短上下文推理能力。

ColorAgent: Building A Robust, Personalized, and Interactive OS Agent

ColorAgent：构建强大、个性化和交互式的作系统代理

Authors: Ning Li, Qiqiang Lin, Zheng Wu, Xiaoyun Mo, Weiming Zhang, Yin Zhao, Xiangmou Qu, Jiamu Zhou, Jun Wang, Congmin Zheng, Yuanyi Song, Hongjiang Chen, Heyuan Huang, Jihong Wang, Jiaxin Yin, Jingwei Yu, Junwei Liao, Qiuying Peng, Xingyu Lou, Jun Wang, Weiwen Liu, Zhuosheng Zhang, Weinan Zhang
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.19386
Pdf link: https://arxiv.org/pdf/2510.19386
Abstract With the advancements in hardware, software, and large language model technologies, the interaction between humans and operating systems has evolved from the command-line interface to the rapidly emerging AI agent interactions. Building an operating system (OS) agent capable of executing user instructions and faithfully following user desires is becoming a reality. In this technical report, we present ColorAgent, an OS agent designed to engage in long-horizon, robust interactions with the environment while also enabling personalized and proactive user interaction. To enable long-horizon interactions with the environment, we enhance the model's capabilities through step-wise reinforcement learning and self-evolving training, while also developing a tailored multi-agent framework that ensures generality, consistency, and robustness. In terms of user interaction, we explore personalized user intent recognition and proactive engagement, positioning the OS agent not merely as an automation tool but as a warm, collaborative partner. We evaluate ColorAgent on the AndroidWorld and AndroidLab benchmarks, achieving success rates of 77.2% and 50.7%, respectively, establishing a new state of the art. Nonetheless, we note that current benchmarks are insufficient for a comprehensive evaluation of OS agents and propose further exploring directions in future work, particularly in the areas of evaluation paradigms, agent collaboration, and security. Our code is available at this https URL.
中文摘要 随着硬件、软件和大语言模型技术的进步，人类与作系统之间的交互已经从命令行界面发展到快速兴起的人工智能代理交互。构建能够执行用户指令并忠实地遵循用户需求的作系统（OS）代理正在成为现实。在本技术报告中，我们介绍了 ColorAgent，这是一种作系统代理，旨在与环境进行长期、强大的交互，同时实现个性化和主动的用户交互。为了实现与环境的长期交互，我们通过逐步强化学习和自我进化训练来增强模型的能力，同时还开发了一个量身定制的多智能体框架，以确保通用性、一致性和稳健性。在用户交互方面，我们探索个性化的用户意图识别和主动参与，将作系统代理定位为不仅仅是一个自动化工具，而是一个温暖的协作伙伴。我们在 AndroidWorld 和 AndroidLab 基准测试上评估了 ColorAgent，分别取得了 77.2% 和 50.7% 的成功率，建立了新的技术水平。尽管如此，我们注意到目前的基准不足以对作系统代理进行全面评估，并提出了未来工作中进一步探索的方向，特别是在评估范式、代理协作和安全领域。我们的代码可在此 https URL 中找到。

Reasoning Like Experts: Leveraging Multimodal Large Language Models for Drawing-based Psychoanalysis

像专家一样推理：利用多模态大语言模型进行基于绘图的精神分析

Authors: Xueqi Ma, Yanbei Jiang, Sarah Erfani, James Bailey, Weifeng Liu, Krista A. Ehinger, Jey Han Lau
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2510.19451
Pdf link: https://arxiv.org/pdf/2510.19451
Abstract Multimodal Large Language Models (MLLMs) have demonstrated exceptional performance across various objective multimodal perception tasks, yet their application to subjective, emotionally nuanced domains, such as psychological analysis, remains largely unexplored. In this paper, we introduce PICK, a multi-step framework designed for Psychoanalytical Image Comprehension through hierarchical analysis and Knowledge injection with MLLMs, specifically focusing on the House-Tree-Person (HTP) Test, a widely used psychological assessment in clinical practice. First, we decompose drawings containing multiple instances into semantically meaningful sub-drawings, constructing a hierarchical representation that captures spatial structure and content across three levels: single-object level, multi-object level, and whole level. Next, we analyze these sub-drawings at each level with a targeted focus, extracting psychological or emotional insights from their visual cues. We also introduce an HTP knowledge base and design a feature extraction module, trained with reinforcement learning, to generate a psychological profile for single-object level analysis. This profile captures both holistic stylistic features and dynamic object-specific features (such as those of the house, tree, or person), correlating them with psychological states. Finally, we integrate these multi-faceted information to produce a well-informed assessment that aligns with expert-level reasoning. Our approach bridges the gap between MLLMs and specialized expert domains, offering a structured and interpretable framework for understanding human mental states through visual expression. Experimental results demonstrate that the proposed PICK significantly enhances the capability of MLLMs in psychological analysis. It is further validated as a general framework through extensions to emotion understanding tasks.
中文摘要 多模态大型语言模型（MLLM）在各种客观多模态感知任务中表现出卓越的性能，但它们在主观、情感细微差别领域（例如心理分析）的应用在很大程度上仍未得到探索。在本文中，我们介绍了PICK，这是一个多步骤框架，旨在通过分层分析和MLLM的知识注入来进行精神分析图像理解，特别关注House-Tree-Person（HTP）测试，这是一种在临床实践中广泛使用的心理评估。首先，我们将包含多个实例的绘图分解为语义上有意义的子绘图，构建一个层次表示，捕获三个级别的空间结构和内容：单对象级别、多对象级别和整个级别。接下来，我们有针对性地分析每个级别的这些子图，从视觉线索中提取心理或情感见解。我们还引入了HTP知识库，并设计了一个经过强化学习训练的特征提取模块，以生成用于单对象级分析的心理概况。该配置文件捕获整体风格特征和动态对象特定特征（例如房屋、树木或人的特征），并将它们与心理状态相关联。最后，我们整合这些多方面的信息，以产生与专家级推理相一致的明智评估。我们的方法弥合了 MLLM 和专业专家领域之间的差距，为通过视觉表达理解人类心理状态提供了一个结构化且可解释的框架。实验结果表明，所提出的PICK显著增强了MLLMs在心理分析中的能力。通过对情感理解任务的扩展，它作为一个通用框架得到了进一步验证。

Using Non-Expert Data to Robustify Imitation Learning via Offline Reinforcement Learning

使用非专家数据通过离线强化学习实现模仿学习的鲁棒化

Authors: Kevin Huang, Rosario Scalise, Cleah Winston, Ayush Agrawal, Yunchu Zhang, Rohan Baijal, Markus Grotz, Byron Boots, Benjamin Burchfiel, Hongkai Dai, Masha Itkina, Paarth Shah, Abhishek Gupta
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.19495
Pdf link: https://arxiv.org/pdf/2510.19495
Abstract Imitation learning has proven effective for training robots to perform complex tasks from expert human demonstrations. However, it remains limited by its reliance on high-quality, task-specific data, restricting adaptability to the diverse range of real-world object configurations and scenarios. In contrast, non-expert data -- such as play data, suboptimal demonstrations, partial task completions, or rollouts from suboptimal policies -- can offer broader coverage and lower collection costs. However, conventional imitation learning approaches fail to utilize this data effectively. To address these challenges, we posit that with right design decisions, offline reinforcement learning can be used as a tool to harness non-expert data to enhance the performance of imitation learning policies. We show that while standard offline RL approaches can be ineffective at actually leveraging non-expert data under the sparse data coverage settings typically encountered in the real world, simple algorithmic modifications can allow for the utilization of this data, without significant additional assumptions. Our approach shows that broadening the support of the policy distribution can allow imitation algorithms augmented by offline RL to solve tasks robustly, showing considerably enhanced recovery and generalization behavior. In manipulation tasks, these innovations significantly increase the range of initial conditions where learned policies are successful when non-expert data is incorporated. Moreover, we show that these methods are able to leverage all collected data, including partial or suboptimal demonstrations, to bolster task-directed policy performance. This underscores the importance of algorithmic techniques for using non-expert data for robust policy learning in robotics.
中文摘要 模仿学习已被证明是有效的，可以有效地训练机器人通过专家的人类演示来执行复杂的任务。然而，它仍然受到对高质量、特定于任务的数据的依赖的限制，限制了对各种现实世界对象配置和场景的适应性。相比之下，非专家数据（例如播放数据、次优演示、部分任务完成或次优策略的推出）可以提供更广泛的覆盖范围和更低的收集成本。然而，传统的模仿学习方法无法有效利用这些数据。为了应对这些挑战，我们假设通过正确的设计决策，离线强化学习可以用作利用非专家数据来提高模仿学习策略绩效的工具。我们表明，虽然标准离线RL方法在现实世界中通常遇到的稀疏数据覆盖设置下实际利用非专家数据方面可能无效，但简单的算法修改可以允许利用这些数据，而无需大量的额外假设。我们的方法表明，拓宽策略分布的支持可以允许离线RL增强的模仿算法鲁棒地解决任务，表现出显着增强的恢复和泛化行为。在作任务中，这些创新显着增加了初始条件的范围，即当合并非专家数据时，学习到的策略是成功的。此外，我们表明这些方法能够利用所有收集的数据，包括部分或次优的演示，来增强任务导向的政策绩效。这强调了算法技术对于使用非专家数据进行机器人技术稳健政策学习的重要性。

The Confusing Instance Principle for Online Linear Quadratic Control

在线线性二次控制的混淆实例原理

Authors: Waris Radji (Scool, CRIStAL), Odalric-Ambrym Maillard (Scool, CRIStAL)
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.19531
Pdf link: https://arxiv.org/pdf/2510.19531
Abstract We revisit the problem of controlling linear systems with quadratic cost under unknown dynamics with model-based reinforcement learning. Traditional methods like Optimism in the Face of Uncertainty and Thompson Sampling, rooted in multi-armed bandits (MABs), face practical limitations. In contrast, we propose an alternative based on the Confusing Instance (CI) principle, which underpins regret lower bounds in MABs and discrete Markov Decision Processes (MDPs) and is central to the Minimum Empirical Divergence (MED) family of algorithms, known for their asymptotic optimality in various settings. By leveraging the structure of LQR policies along with sensitivity and stability analysis, we develop MED-LQ. This novel control strategy extends the principles of CI and MED beyond small-scale settings. Our benchmarks on a comprehensive control suite demonstrate that MED-LQ achieves competitive performance in various scenarios while highlighting its potential for broader applications in large-scale MDPs.
中文摘要 我们通过基于模型的强化学习重新审视了在未知动力学下控制具有二次成本的线性系统的问题。传统方法，如面对不确定性的乐观主义和汤普森抽样，植根于多臂强盗（MAB），面临着实际局限性。相比之下，我们提出了一种基于混淆实例（CI）原理的替代方案，该原理支持MAB和离散马尔可夫决策过程（MDP）中的后悔下限，并且是最小经验分歧（MED）算法系列的核心，该算法以其在各种环境中的渐近最优性而闻名。通过利用 LQR 策略的结构以及灵敏度和稳定性分析，我们开发了 MED-LQ。这种新颖的控制策略将 CI 和 MED 的原理扩展到小规模环境之外。我们对综合控制套件的基准测试表明，MED-LQ 在各种场景中都取得了具有竞争力的性能，同时凸显了其在大规模 MDP 中更广泛应用的潜力。

MedReason-R1: Learning to Reason for CT Diagnosis with Reinforcement Learning and Local Zoom

MedReason-R1：通过强化学习和局部变焦学习推理 CT 诊断

Authors: Yifan Li, Fenghe Tang, Yingtai Li, Shaohua Kevin Zhou
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2510.19626
Pdf link: https://arxiv.org/pdf/2510.19626
Abstract General-purpose large Vision-Language Models (VLMs) demonstrate strong capabilities in generating detailed descriptions for natural images. However, their performance in the medical domain remains suboptimal, even for relatively straightforward tasks, primarily due to the lack of large-scale, high-quality, specialized medical imaging datasets and the neglect of the diagnostic process that progresses from coarse to fine-grained. To address the first issue, we construct the CT-RATE-VQA dataset, which has 84K QA pairs. For the second issue, we propose MedReason-R1, a medical VLM with explicit reasoning process for disease diagnosis. MedReason-R1 incorporates a novel strategy that embeds zoom-in disease region-of-interest areas into the image, highlighting the crucial role of both global localization and disease-specific details in enhancing the model's diagnostic performance. Furthermore, we introduce the GRPO reinforcement learning framework to MedReason-R1, which enables effective reasoning without relying on costly manual annotations. Compared to recent general-purpose and medical VLMs, MedReason-R1 achieves state-of-the-art performance in CT disease diagnosis while retaining generalization. The code, checkpoints, and dataset are available at: this https URL
中文摘要 通用大型视觉语言模型（VLM）在生成自然图像的详细描述方面表现出强大的能力。然而，即使对于相对简单的任务，它们在医疗领域的表现仍然不理想，这主要是由于缺乏大规模、高质量、专业的医学成像数据集，以及忽视了从粗粒度到细粒度的诊断过程。为了解决第一个问题，我们构建了 CT-RATE-VQA 数据集，该数据集具有 84K 个 QA 对。对于第二个问题，我们提出了 MedReason-R1，这是一种具有显式推理过程的医学 VLM，用于疾病诊断。MedReason-R1 采用了一种新颖的策略，将放大的疾病感兴趣区域嵌入到图像中，突出了全局定位和疾病特异性细节在增强模型诊断性能方面的关键作用。此外，我们在 MedReason-R1 中引入了 GRPO 强化学习框架，该框架无需依赖昂贵的手动注释即可实现有效的推理。与最近的通用和医疗VLM相比，MedReason-R1在CT疾病诊断方面实现了最先进的性能，同时保持了泛化性。代码、检查点和数据集可在以下位置获得：此 https URL

Memo: Training Memory-Efficient Embodied Agents with Reinforcement Learning

备忘录：使用强化学习训练记忆高效的具身智能体

Authors: Gunshi Gupta, Karmesh Yadav, Zsolt Kira, Yarin Gal, Rahaf Aljundi
Subjects: Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.19732
Pdf link: https://arxiv.org/pdf/2510.19732
Abstract To enable embodied agents to operate effectively over extended timeframes, it is crucial to develop models that form and access memories to stay contextualized in their environment. In the current paradigm of training transformer-based policies for embodied sequential decision-making tasks, visual inputs often overwhelm the context limits of transformers, while humans can maintain and utilize a lifetime of experience compressed as memories. Significant compression is possible in principle, as much of the input is irrelevant and can be abstracted. However, existing approaches predominantly focus on either recurrent models with fixed-size memory or transformers with full-context reliance. In this work, we propose Memo, a transformer-based architecture and training recipe for reinforcement learning (RL) on memory-intensive, long-horizon tasks. Memo incorporates the creation and retrieval of memory by interleaving periodic summarization tokens with the inputs of a model during training. We demonstrate Memo's effectiveness on a gridworld meta-RL benchmark and a multi-object navigation task in photo-realistic indoor settings. Memo outperforms naive long-context transformer baselines while being more compute and storage efficient. Additionally, Memo generalizes better to longer contexts at inference time and remains robust in streaming settings, where historical context must be truncated to fit inference constraints.
中文摘要 为了使具身智能体能够在较长时间内有效运作，开发形成和访问记忆以在其环境中保持情境化的模型至关重要。在当前训练基于 transformer 的策略以进行具身顺序决策任务的范式中，视觉输入往往压倒了 transformer 的上下文限制，而人类可以维护和利用压缩为记忆的终生经验。原则上，显着压缩是可能的，因为大部分输入是无关紧要的并且可以抽象。然而，现有方法主要关注具有固定大小内存的循环模型或具有全上下文依赖的转换器。在这项工作中，我们提出了 Memo，这是一种基于 Transformer 的架构和训练配方，用于内存密集型、长期任务的强化学习（RL）。Memo 通过在训练期间将周期性摘要标记与模型的输入交错，结合了内存的创建和检索。我们在逼真的室内环境中展示了 Memo 在网格世界元 RL 基准测试和多对象导航任务上的有效性。Memo 的性能优于朴素的长上下文转换器基线，同时具有更高的计算和存储效率。此外，Memo 在推理时可以更好地泛化到较长的上下文，并且在流式处理设置中保持稳健，在流式处理设置中，必须截断历史上下文以适应推理约束。

SEA: Semantic Map Prediction for Active Exploration of Uncertain Areas

SEA：主动探索不确定区域的语义图预测

Authors: Hongyu Ding, Xinyue Liang, Yudong Fang, You Wu, Jieqi Shi, Jing Huo, Wenbin Li, Jing Wu, Yu-Kun Lai, Yang Gao
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.19766
Pdf link: https://arxiv.org/pdf/2510.19766
Abstract In this paper, we propose SEA, a novel approach for active robot exploration through semantic map prediction and a reinforcement learning-based hierarchical exploration policy. Unlike existing learning-based methods that rely on one-step waypoint prediction, our approach enhances the agent's long-term environmental understanding to facilitate more efficient exploration. We propose an iterative prediction-exploration framework that explicitly predicts the missing areas of the map based on current observations. The difference between the actual accumulated map and the predicted global map is then used to guide exploration. Additionally, we design a novel reward mechanism that leverages reinforcement learning to update the long-term exploration strategies, enabling us to construct an accurate semantic map within limited steps. Experimental results demonstrate that our method significantly outperforms state-of-the-art exploration strategies, achieving superior coverage ares of the global map within the same time constraints.
中文摘要 在本文中，我们提出了一种通过语义图预测和基于强化学习的分层探索策略进行主动机器人探索的新方法。与现有依赖一步路径点预测的基于学习的方法不同，我们的方法增强了智能体的长期环境理解，以促进更高效的探索。我们提出了一个迭代预测-探索框架，该框架根据当前观测结果显式预测地图的缺失区域。然后使用实际累积地图与预测全球地图之间的差异来指导探索。此外，我们还设计了一种新颖的奖励机制，利用强化学习来更新长期探索策略，使我们能够在有限的步骤内构建准确的语义图。实验结果表明，该方法明显优于最先进的勘探策略，在相同的时间限制内实现了全球地图的卓越覆盖率。

Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning

Scaf-GRPO：用于增强 LLM 推理的脚手架组相对策略优化

Authors: Xichen Zhang, Sitong Wu, Yinghao Zhu, Haoru Tan, Shaozuo Yu, Ziyi He, Jiaya Jia
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.19807
Pdf link: https://arxiv.org/pdf/2510.19807
Abstract Reinforcement learning from verifiable rewards has emerged as a powerful technique for enhancing the complex reasoning abilities of Large Language Models (LLMs). However, these methods are fundamentally constrained by the ''learning cliff'' phenomenon: when faced with problems far beyond their current capabilities, models consistently fail, yielding a persistent zero-reward signal. In policy optimization algorithms like GRPO, this collapses the advantage calculation to zero, rendering these difficult problems invisible to the learning gradient and stalling progress. To overcome this, we introduce Scaf-GRPO (Scaffolded Group Relative Policy Optimization), a progressive training framework that strategically provides minimal guidance only when a model's independent learning has plateaued. The framework first diagnoses learning stagnation and then intervenes by injecting tiered in-prompt hints, ranging from abstract concepts to concrete steps, enabling the model to construct a valid solution by itself. Extensive experiments on challenging mathematics benchmarks demonstrate Scaf-GRPO's effectiveness, boosting the pass@1 score of the Qwen2.5-Math-7B model on the AIME24 benchmark by a relative 44.3% over a vanilla GRPO baseline. This result demonstrates our framework provides a robust and effective methodology for unlocking a model's ability to solve problems previously beyond its reach, a critical step towards extending the frontier of autonomous reasoning in LLM.
中文摘要 来自可验证奖励的强化学习已成为增强大型语言模型（LLM）复杂推理能力的强大技术。然而，这些方法从根本上受到“学习悬崖”现象的制约：当面临远远超出其当前能力的问题时，模型总是失败，产生持续的零奖励信号。在像 GRPO 这样的策略优化算法中，这会将优势计算缩小到零，使这些难题对学习梯度不可见，并停滞进度。为了克服这个问题，我们引入了 Scaf-GRPO（脚手架组相对策略优化），这是一种渐进式训练框架，仅在模型的独立学习趋于稳定时才会战略性地提供最少的指导。该框架首先诊断学习停滞，然后通过注入分层的提示提示进行干预，从抽象概念到具体步骤，使模型能够自行构建有效的解决方案。对具有挑战性的数学基准的广泛实验证明了 Scaf-GRPO 的有效性，将 Qwen2.5-Math-7B 模型在 AIME24 基准上的pass@1分数比普通 GRPO 基线提高了相对 44.3%。这一结果表明，我们的框架提供了一种强大而有效的方法，用于释放模型解决以前无法解决的问题的能力，这是扩展法学硕士自主推理前沿的关键一步。

olmOCR 2: Unit Test Rewards for Document OCR

olmOCR 2：文档 OCR 的单元测试奖励

Authors: Jake Poznanski, Luca Soldaini, Kyle Lo
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.19817
Pdf link: https://arxiv.org/pdf/2510.19817
Abstract We present olmOCR 2, the latest in our family of powerful OCR systems for converting digitized print documents, like PDFs, into clean, naturally ordered plain text. olmOCR 2 is powered by olmOCR-2-7B-1025, a specialized, 7B vision language model (VLM) trained using reinforcement learning with verifiable rewards (RLVR), where our rewards are a diverse set of binary unit tests. To scale unit test creation, we develop a pipeline for generating synthetic documents with diverse and challenging layouts, known ground-truth HTML source code, and extracted test cases. We show that RL training on these test cases results in state-of-the-art performance on olmOCR-Bench, our English-language OCR benchmark, with the largest improvements in math formula conversion, table parsing, and multi-column layouts compared to previous versions. We release our model, data and code under permissive open licenses.
中文摘要 我们推出 olmOCR 2，这是我们功能强大的 OCR 系统系列中的最新产品，用于将数字化打印文档（如 PDF）转换为干净、自然有序的纯文本。olmOCR 2 由 olmOCR-2-7B-1025 提供支持，这是一种专门的 7B 视觉语言模型（VLM），使用具有可验证奖励的强化学习（RLVR）进行训练，其中我们的奖励是一组不同的二进制单元测试。为了扩展单元测试的创建，我们开发了一个管道，用于生成具有多样化和具有挑战性的布局、已知的地面实况 HTML 源代码和提取的测试用例的合成文档。我们表明，在这些测试用例上进行 RL 训练可以在我们的英语 OCR 基准测试 olmOCR-Bench 上实现最先进的性能，与以前的版本相比，在数学公式转换、表格解析和多列布局方面有了最大的改进。我们在宽松的开放许可下发布我们的模型、数据和代码。

Keyword: diffusion policy

There is no result