Arxiv Papers of Today

生成时间: 2026-05-22 18:54:54 (UTC+8); Arxiv 发布时间: 2026-05-22 20:00 EDT (2026-05-23 08:00 UTC+8)

今天共有 48 篇相关文章

Keyword: reinforcement learning

Teaching Language Models to Forecast Research Success Through Comparative Idea Evaluation

通过比较思想评估，教学语言模型预测研究成功

Authors: Srujan P Mule, Aniketh Garikaparthi, Manasi Patwardhan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.21491
Pdf link: https://arxiv.org/pdf/2605.21491
Abstract As language models accelerate scientific research by automating hypothesis generation and implementation, a new bottleneck emerges: evaluating and filtering hundreds of AI-generated ideas without exhaustive experimentation. We ask whether LMs can learn to forecast the empirical success of research ideas before any experiments are run. We study comparative empirical forecasting: given a benchmark-specific research goal and two candidate ideas, predict which will achieve better benchmark performance. We construct a dataset of 11,488 idea pairs grounded in objective outcomes from PapersWithCode. While off-the-shelf 8B-parameter models struggle (30% acc.), SFT dramatically boosts performance to 77.1%, outperforming GPT-5 (61.1%). By framing evaluation as a reasoning task via Reinforcement Learning with Verifiable Rewards (RLVR), we train models to discover latent reasoning paths, achieving 71.35% acc. with interpretable justifications. Through additional ablations and out-of-distribution tests, we show robustness to surface-level heuristics and transfer to both a cross-domain time-split test set and an independently constructed test set. Our results demonstrate that compute-efficient small language models can serve as effective, objective verifiers, offering a scalable path for autonomous scientific discovery.
中文摘要 随着语言模型通过自动化假设生成和实施加速科学研究，出现了一个新的瓶颈：在没有详尽实验的情况下评估和过滤数百个AI生成的想法。我们询问LM是否能在任何实验进行前学会预测研究理念的实证成功率。我们研究比较实证预测：给定一个基准特定的研究目标和两个候选想法，预测哪个能实现更好的基准表现。我们构建了一个包含11,488对想法的数据集，基于PapersWithCode的客观结果。现成的8B参数模型表现挣扎（30%增长），SFT性能大幅提升至77.1%，超过GPT-5（61.1%）。通过通过可验证奖励强化学习（RLVR）将评估框架为推理任务，我们训练模型发现潜在推理路径，实现了71.35%的准确率，且有可解释的理由。通过额外的消融和分布外测试，我们证明了对表层启发式的鲁棒性，并能转移到跨域时间分割测试集和独立构建测试集。我们的结果表明，高效计算的小型语言模型可以作为有效、客观的验证者，为自主科学发现提供可扩展的路径。

Value-Gradient Hypothesis of RL for LLMs

强化学习的价值梯度假说适用于大型语言模型

Authors: Arip Asadulaev, Daniil Ognev, Karim Salta, Martin Takac
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.21654
Pdf link: https://arxiv.org/pdf/2605.21654
Abstract Reinforcement learning substantially improves pretrained language models, but it remains understudied why critic-free methods such as PPO and GRPO work as well as they do, and when they should provide the largest gains. We develop a value-gradient perspective of critic-free RL for LLM post-training. First, under a differentiable rollout and additive-noise parameterization, we show that the actor update is value-gradient-like in expectation: the backward pass propagates costates whose conditional expectation equals the value gradient. Second, for discrete transformer policies, we show that autodifferentiation through attention produces empirical costates that approximate this value signal, with an error controlled by the sampling gap and policy entropy. These results motivate a decomposition of RL impact into value gradient signal and reachable reward headroom, yielding a criterion for when RL should be most effective along a pretraining trajectory.
中文摘要 强化学习显著改善了预训练语言模型，但为何无批评方法如PPO和GRPO效果如此出色，以及它们何时应带来最大收益，仍缺乏充分研究。我们为LLM后培训开发了无批评强化学习的价值梯度视角。首先，在可微的展开和加法噪声参数化下，我们证明演员更新在期望上类似值梯度：后向传递的费用是条件期望等于值梯度的。其次，对于离散变换器策略，我们证明通过注意力进行自微分会产生经验成本，近似该值信号，误差由采样缺口和策略熵控制。这些结果促使将强化学习的影响分解为价值梯度信号和可达奖励余量，从而给出在预训练轨迹中何时强化学习最有效的标准。

Closed-Loop Sim-to-Real Reinforcement Learning for Deformable Microfiber Shape Control

可变形微纤维形状控制的闭环模拟到实增强学习

Authors: Alessandro Amici, Houari Bettahar, Veeti Jaakkola, Quan Zhou
Subjects: Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2605.21688
Pdf link: https://arxiv.org/pdf/2605.21688
Abstract Autonomous contact-based micromanipulation is challenging because surface and interfacial interactions at the microscale are difficult to model accurately, limiting the use of conventional model-based control and sim-to-real learning. We present a closed-loop sim-to-real reinforcement learning (RL) approach for microfiber shape control on a surface. The central idea is to train geometric shape regulation in a simplified frictionless simulator and rely on real-time visual feedback during deployment to iteratively correct the observed effects of unmodeled surface interactions. An RL policy trained entirely in simulation is transferred directly to a physical dual-gripper micromanipulation system operating at 40 Hz, without retraining or domain adaptation. Using silk microfibers as a testbed, the policy achieves a mean point-wise shape error of 270 $\pm$ 80 $\mu$m across twenty-four diverse initial configurations. Across nine specimens covering all combinations of three fiber diameters (50, 80, and 120 $\mu$m) and three manipulated lengths (10 mm, 15mm, and 20 mm), the same policy achieves sub-millimeter final shape error without any retraining or retuning. These results show that a policy learned in a simplified simulator can achieve repeatable real-world microfiber shape regulation under surface contact, provided that the task-relevant effects of the sim-to-real mismatch remain observable and correctable within the closed feedback loop.
中文摘要 基于接触的自主微操作具有挑战性，因为微观尺度上的表面和界面相互作用难以准确建模，限制了传统基于模型的控制和模拟到实学习的使用。我们提出了一种闭环模拟到现实强化学习（RL）方法，用于表面上的微纤维形状控制。核心思想是在简化的无摩擦模拟器中训练几何形状调控，并在部署过程中依赖实时视觉反馈，迭代修正未建模表面相互作用的观察效应。完全在仿真中训练的强化学习策略会直接转移到以40 Hz运行的物理双握力微操作系统，无需重新训练或域适配。以丝绸微纤维为试验平台，该政策在24种不同初始配置中实现了270 $\pm$ 80 $\mu$的平均点形误差。在九个样本中，涵盖三种纤维直径（50、80和120 $/mu$m）及三种处理长度（10毫米、15毫米和20毫米）的所有组合，相同策略可实现亚毫米级的最终形状误差，无需重新训练或重新调谐。这些结果表明，在简化模拟器中学习的策略可以在表面接触下实现可重复的现实世界微纤维形状调控，前提是模拟与现实不匹配的任务相关效应在闭合反馈环内仍可观察并可纠正。

On the Sample Complexity of Discounted Reinforcement Learning with Optimized Certainty Equivalents

关于带有优化确定性等价的折现强化学习样本复杂度

Authors: Oliver Mortensen, Mohammad Sadegh Talebi
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2605.21763
Pdf link: https://arxiv.org/pdf/2605.21763
Abstract We study risk-sensitive reinforcement learning in finite discounted MDPs, where a generative model of the MDP is assumed to be available. We consider a family or risk measures called the optimized certainty equivalent (OCE), which includes important risk measures such as entropic risk, CVaR, and mean-variance. Our focus is on the sample complexities of learning the optimal state-action value function (value learning) and an optimal policy (policy learning) under recursive OCE. We provide an exact characterization of utility functions $u$ for which the corresponding OCE defines an objective that is PAC-learnable. We analyze a simple model-based approach and derive PAC sample complexity bounds. We establish that whenever $u$ does not have full domain $\text{dom}(u)\neq \mathbb{R}$, the corresponding problem is not PAC-learnable. Finally, we establish corresponding lower bounds for both value and policy learning, demonstrating tightness in the size $SA$ of state-action space, and for a more restricted class of utilities, we derive lower bounds that makes the dependence on the effective horizon $\frac{1}{1-\gamma}$ explicit. Specifically, for $\text{CVaR}_\tau$ we show that the correct dependence on $\tau$ is $\frac{1}{\tau^2}$, thus improving by a factor of $\frac{1}{\tau}$ over state-of-the-art although our bound has a suboptimal dependence on $\frac{1}{1-\gamma}$.
中文摘要 我们研究有限折现MDP中的风险敏感强化学习，假设MDP的生成模型存在。我们考虑一种称为优化确定性等效（OCE）的家族或风险指标，其中包括熵风险、CVaR和平均方差等重要风险指标。我们的重点是学习递归OCE下最优状态-动作价值函数（价值学习）和最优策略（策略学习）的样本复杂性。我们提供了效用函数$u$的精确刻画，其中对应的OCE定义了一个可通过PAC学习的目标。我们分析一种简单的基于模型的方法，并推导出PAC样本的复杂度界限。我们确定，只要$u$没有完整定义域$\text{dom}（u）\neq \mathbb{R}$，对应的问题就无法通过PAC学习。最后，我们为价值学习和策略学习建立了相应的下界，展示了状态-行动空间大小$SA$的紧致性，对于更受限的效用类别，我们推导出了对有效视野$\frac{1}{1-\gamma}$的依赖性。具体来说，对于 $\text{CVaR}_\tau$，我们证明对 $\tau$ 的正确依赖是 $\frac{1}{\tau^2}$，因此比最先进 $\frac{1}{\tau}$ 提升了 $\frac{\tau}$ 的倍数，尽管我们的上界对 $\frac{1}{1-\gamma}$ 的依赖性不够优。

Memory-R2: Fair Credit Assignment for Long-Horizon Memory-Augmented LLM Agents

Memory-R2：长视野内存增强LLM代理的公平信用分配

Authors: Sikuan Yan, Ahmed Bahloul, Ercong Nie, Susanna Schwarzmann, Riccardo Trivisonno, Volker Tresp, Yunpu Ma
Subjects: Subjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2605.21768
Pdf link: https://arxiv.org/pdf/2605.21768
Abstract Memory-augmented LLM agents enable interactions that extend beyond finite context windows by storing, updating, and reusing information across sessions. However, training such agents with reinforcement learning in multi-session environments is challenging because memory turns the agent's past actions into part of its future environment. Once different rollouts write, update, or delete different memories, they no longer share the same intermediate memory state, making trajectory-level comparisons fundamentally unfair. This violates a key assumption behind group-relative methods such as GRPO, where rollouts are compared as if they were sampled from the same effective environment. Consequently, trajectory-level rewards provide noisy or biased credit signals for long-horizon memory operations. To address this challenge, we introduce Memory-R2, a training framework for long-horizon memory-augmented LLM agents. Its core algorithm, LoGo-GRPO, combines local and global group-relative optimization. The global objective preserves end-to-end learning from long-horizon trajectory-level rewards, while local rerollouts compare different memory-operation outcomes from the same intermediate memory state, yielding fairer group comparisons and more precise supervision for memory construction. Beyond credit assignment, Memory-R2 jointly optimizes memory formation and memory evolution with a shared-parameter co-learning design, where a fact extractor and a memory manager are instantiated from the same LLM backbone through role-specific prompts. To stabilize multi-step RL over long memory horizons, we adopt a progressive curriculum that increases the training horizon from 8 to 16 to 32 sessions. Together, these components provide an effective training paradigm for memory-augmented LLM agents in long-horizon multi-session settings.
中文摘要 内存增强的LLM代理通过存储、更新和重复使用跨会话的信息，实现了超越有限上下文窗口的交互。然而，在多会话环境中用强化学习训练此类智能体具有挑战性，因为记忆会将智能体的过去行为转化为未来环境的一部分。一旦不同的推送写入、更新或删除不同的内存，它们就不再共享相同的中间内存状态，这使得轨迹级的比较变得根本不公平。这违反了群体相对方法（如GRPO）的关键假设，后者将推广数据当作从同一有效环境中抽样进行比较。因此，轨迹级奖励为长视野记忆操作提供了噪声或偏颇的信用信号。为应对这一挑战，我们引入了Memory-R2，这是一个针对长视野内存增强LLM代理的训练框架。其核心算法LoGo-GRPO结合了局部和全局群相对优化。全局目标保留了来自长视野轨迹级奖励的端到端学习，而局部重推则比较同一中间记忆状态下不同的记忆操作结果，从而实现更公平的组别比较和更精确的记忆构建监督。除了学分分配外，Memory-R2 还通过共享参数的共学习设计共同优化内存形成和内存演化，其中事实提取器和内存管理器通过角色特定的提示从同一大语言模型骨干中实例化。为了稳定多步强化学习在长记忆时间内的表现，我们采用了渐进式课程，将训练时间从8次增加到16次再到32次。这些组件共同为长期多会话环境中的内存增强LLM代理提供了有效的训练范式。

Implicit Safety Alignment from Crowd Preferences

群体偏好的隐性安全对齐

Authors: Qian Lin, Daniel S. Brown
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.21822
Pdf link: https://arxiv.org/pdf/2605.21822
Abstract Reinforcement Learning from Human Feedback (RLHF) can reveal implicit objectives such as safety considerations that go beyond task completion. In this work, we focus on the common safety criteria embedded in crowd preference datasets, where different users may express distinct preferences or objectives, yet follow similar safety principles. Our aim is to discover shared safety criteria from crowd preferences and then transfer them to downstream RL tasks to regularize agent behavior and enforce safety. We first show that direct reward combination-optimizing a preference-learned reward model together with downstream task rewards-has inherent limitations. Motivated by this, we propose Safe Crowd Preference-based RL, a hierarchical framework that extracts safety-aligned skills from crowd preferences and composes them via a high-level policy to safely solve downstream tasks. Experiments across safe RL environments and a preliminary LLM-style task with diverse user goals and shared safety constraints demonstrate that our approach substantially lowers safety costs without access to explicit safety rewards, while achieving task performance comparable to oracle methods trained with ground-truth safety signals.
中文摘要 人类反馈强化学习（RLHF）可以揭示隐含目标，如安全考量，超越任务完成。本研究重点关注群体偏好数据集中嵌入的共同安全标准，不同用户可能表达不同的偏好或目标，但遵循相似的安全原则。我们的目标是从人群偏好中发现共享的安全标准，并将其转移到下游的强化学习任务中，以规范代理行为并维护安全。我们首先表明，直接奖励组合——将偏好学习奖励模型与下游任务奖励结合——存在固有局限性。基于此，我们提出了基于安全人群偏好的强化学习（Safe Crowd Preference RL），这是一个分层框架，通过高层策略从人群偏好中提取安全相关技能，以安全解决后续任务。在安全强化学习环境和一个具备多样用户目标和共享安全约束的初步大型语言模型任务中，实验表明，我们的方法在没有明确安全奖励的情况下大幅降低了安全成本，同时实现了与用地面真实安全信号训练的神谕方法相当的任务性能。

OPPO: Bayesian Value Recursion for Token-Level Credit Assignment in LLM Reasoning

OPPO：LLM推理中代币级信用分配的贝叶斯价值递归

Authors: Yu Li, Rui Miao, Tian Lan, Zhengling Qi
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.21851
Pdf link: https://arxiv.org/pdf/2605.21851
Abstract Reinforcement learning with verifiable rewards has become the standard recipe for improving LLM reasoning, but the dominant algorithm GRPO assigns a single trajectory-level advantage to every token, diluting the signal at pivotal reasoning steps and injecting noise at uninformative ones. Critic-free alternatives derived from on-policy distillation supply per-token signals through oracle-conditioned likelihood ratios, yet apply each signal in isolation from the trajectory-level evidence accumulated up to that position. We propose Oracle-Prompted Policy Optimization (OPPO), which rests on a single observation: the oracle signal used by prior distillation-style methods for local discrimination is also the natural Bayesian update of the model's belief about eventual success. Accumulating the signal along a trajectory yields, in closed form and at the cost of one extra forward pass, a running estimate of the success probability at every position, together with a token-level advantage that requires no learned value network and no additional rollouts. A first-order analysis factorizes the advantage into the per-token discrimination signal used by distillation methods modulated by a state weight that concentrates credit on genuinely pivotal tokens, with a directional variance-reduction guarantee. The framework admits two estimators differing only in which model scores the evidence: a \textit{self-oracle} that reuses the student and recovers the on-policy distillation reward as a strict special case, and a \textit{teacher-oracle} that delegates scoring to a stronger frozen model. On two base LLMs across seven mathematics, science, and code reasoning benchmarks, OPPO improves over GRPO, DAPO, and SDPO by up to $+6.0$ points on AMC'23 and $+5.2$ points on AIME'24, with gains that widen monotonically with response length.
中文摘要 带有可验证奖励的强化学习已成为提升大型语言模型推理的标准配方，但主流算法GRPO为每个代币赋予单一轨迹级优势，在关键推理步骤稀释信号，在无信息性步骤注入噪声。通过oracle条件似然比，基于策略提炼的每token供给，但每个信号都独立应用于该位置前积累的轨迹级证据。我们提出了预言提示策略优化（OPPO），其基于一个观察：先行蒸馏式方法用于局部辨别的预言信号，也是模型关于最终成功的信念的自然贝叶斯更新。沿轨迹累积信号，以封闭形式且代价多一次前传，得到每个局面成功概率的持续估计，同时获得无需学习价值网络和额外扩展的代币级优势。一阶分析将优势分解到提纯方法所使用的每个代币辨别信号中，该方法通过状态权重调制，将信用集中在真正关键的代币上，并保证方向性方差减少。该框架允许两种估计量，仅在不同模型对证据进行评分不同：一个是\textit{自-oracle}，它重用了学生，并作为严格特例恢复了策略上的提炼奖励;另一个是\textit{teacher-oracle}，将评分委托给更强的冻结模型。在两个基础大型语言模型中，涵盖七个数学、科学和代码推理基准测试，OPPO在AMC'23上相较GRPO、DAPO和SDPO提升了最多+6.0美元，在AIME'24上提升了+5.2美元，且响应长度的提升会单调地扩大。

CCLab: Adversarial Testing of Learning- and Non-Learning-Based Congestion Controllers

CCLab：基于学习和非学习的拥塞控制器的对抗性测试

Authors: Zhi Chen, Shehab Sarar Ahmed, Chenkai Wang, Brighten Godfrey, Gang Wang
Subjects: Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.21915
Pdf link: https://arxiv.org/pdf/2605.21915
Abstract Congestion controllers (CCs) are critical to network performance, and yet their robustness under adverse conditions remains insufficiently understood. While recent learning-based CCs have demonstrated strong performance in controlled environments, it is unclear how they compare to traditional CCs when controllers' input signals are corrupted or when environmental conditions become systematically challenging. In this paper, we introduce CCLab, an adversarial testing framework for systematically evaluating the robustness of both learning-based and non-learning-based CCs. CCLab includes a reinforcement learning (RL)-based adversarial agent that operates in a closed loop with the congestion control policy, generating bounded perturbations either on input signals (feature-level) or on external network conditions (environment-level), while preserving realism through explicit constraints. Using this framework, we compare learning-based CCs with non-learning-based CCs under both feature-level and environment-level adversarial conditions. While both types of CCs suffer from performance degradation under adversarial testing, we find that learning-based CCs, in general, are more robust than traditional human-designed algorithms. Finally, we show that our adversarial traces can be used to train more robust CCs that outperform existing learning-based CCs under both challenging and normal conditions.
中文摘要 拥塞控制器（CC）对网络性能至关重要，但其在恶劣条件下的鲁棒性仍未被充分理解。尽管近期基于学习的CC在受控环境中表现出强劲的性能，但当控制器输入信号受损或环境条件系统性地变得复杂时，它们与传统CC相比尚不清楚。本文介绍了CCLab，一种用于系统评估基于学习和非学习的CC鲁棒性的对抗性测试框架。CCLab包含基于强化学习（RL）的对抗代理，该代理在拥塞控制策略的闭环中运行，在输入信号（特征级）或外部网络条件（环境级）产生有界扰动，同时通过显式约束保持真实性。利用该框架，我们将基于学习的CC与非基于学习的CC在特征级和环境级对抗条件下进行比较。虽然这两种CC在对抗性测试下都存在性能下降，但我们发现基于学习的CC总体上比传统人工设计算法更为稳健。最后，我们证明了我们的对抗追踪可以用来训练更稳健的CC，使其在挑战性和正常条件下都优于现有基于学习的CC。

EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models

EvoVid：视频大型语言模型的以时间为中心的自我演化

Authors: Shiqi Huang, Ziyue Wang, Zhongrong Zuo, Han Qiu, Qi She, Bihan Wen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.21931
Pdf link: https://arxiv.org/pdf/2605.21931
Abstract Recent Video Large Language Models (Video-LLMs) have demonstrated strong capabilities in video reasoning through reinforcement learning (RL). However, existing RL pipelines rely heavily on human-annotated tasks and solutions, making them costly to scale and fundamentally constrained by human expertise. Self-evolving frameworks have recently emerged as a promising alternative through autonomous Questioner-Solver self-play. Unfortunately, these approaches are primarily designed for static modalities such as text and images, fundamentally failing to capture the temporal dynamics that are central to video reasoning. In this work, we propose $\textbf{EvoVid}$, a temporal-centric self-evolving framework that enables Video-LLMs to improve directly from raw, unannotated videos. Specifically, we introduce two complementary temporal-centric rewards: a temporal-aware Questioner reward that encourages temporally dependent question generation through temporal perturbation sensitivity, and a temporal-grounded Solver reward that provides automatic temporal supervision via inherent video segment localization. Extensive experiments across four base models and six benchmarks demonstrate consistent improvements over both base models and existing self-evolving baselines, achieving competitive performance with supervised methods. These results highlight temporal-centric self-evolution as an effective and scalable paradigm for video understanding and reasoning.
中文摘要 近期的视频大型语言模型（Video-LLMs）通过强化学习（RL）展现了视频推理的强大能力。然而，现有的强化学习流程严重依赖人工注释的任务和解决方案，这使得其规模化成本高昂，且根本受限于人类专业知识。自我演进的框架最近通过自主提问者-解答者自我游戏，成为一种有前景的替代方案。遗憾的是，这些方法主要针对文本和图像等静态模态设计，根本未能捕捉视频推理中核心的时间动态。在本研究中，我们提出了$\textbf{EvoVid}$，一个以时间为中心的自我演进框架，使视频大型语言模型能够直接从原始、无注释的视频中改进。具体来说，我们引入了两种互补的时间中心奖励：一种是时间感知的提问者奖励，通过时间扰动敏感性鼓励时间依赖性的问题生成;另一种是基于时间的求解器奖励，通过固有的视频片段定位提供自动时间监督。在四个基础模型和六个基准测试中，广泛实验显示其在基础模型和现有自我演化基线上持续提升，在监督方法下实现了竞争性能。这些结果凸显了以时间为中心的自我进化作为视频理解和推理的有效且可扩展的范式。

Auction-Consensus Algorithm with Learned Bidding Scheme for Multi-Robot Systems

多机器人系统的拍卖共识算法与学习竞价方案

Authors: Jose Rodriguez, Constantine Tarawneh, Sven Koenig, Wenjie Dong, Qi Lu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.21932
Pdf link: https://arxiv.org/pdf/2605.21932
Abstract Multi-Robot Task Allocation (MRTA) is a central challenge in decentralized multi-agent systems, where teams of robots must cooperatively assign and execute tasks under limited communication while optimizing global performance objectives. Auction-consensus algorithms, such as the Consensus-Based Bundle Algorithm (CBBA), provide scalable decentralized coordination with provable convergence, but rely on hand-crafted greedy scoring functions that often lead to suboptimal task allocations. This paper proposes a learning-enhanced auction-consensus framework in which CBBA's deterministic bidding mechanism is replaced by a neural bidding policy trained using reinforcement learning. Under a centralized training and decentralized execution paradigm, agents learn to compute task bids from partial local observations while retaining the standard auction and consensus phases for decentralized coordination. The learned bidding policy is trained using Proximal Policy Optimization with rewards shaped by proximity to globally optimal solutions obtained via mixed-integer linear programming. Multiple neural architectures are evaluated, including a Neural Additive Model, the Long Short-Term Memory (LSTM) model, and the Set Transformer Model. Experimental results across varying swarm sizes demonstrate that learned bidding policies can improve solution quality over classical CBBA while preserving decentralized execution. The proposed approach highlights the effectiveness of integrating reinforcement learning with classical distributed coordination algorithms, offering a scalable pathway toward higher-quality decentralized multi-robot task allocation.
中文摘要 多机器人任务分配（MRTA）是去中心化多智能体系统中的一个核心挑战，机器人团队必须在有限的通信条件下协作分配和执行任务，同时优化全球性能目标。拍卖共识算法，如基于共识的捆绑算法（CBBA），提供了可扩展的去中心化协调，且可证明收敛，但依赖手工制作的贪婪评分函数，常导致任务分配不优。本文提出了一种学习增强拍卖共识框架，其中CBBA的确定性竞价机制被通过强化学习训练的神经竞价策略所取代。在集中式训练和去中心化执行范式下，代理学习从部分局部观察计算任务投标，同时保留标准的拍卖和共识阶段以实现分散协调。学习到的竞价策略通过近端策略优化训练，奖励由通过混合整数线性规划获得的全球最优解的接近程度来决定。评估多种神经架构，包括神经加法模型、长短期记忆（LSTM）模型和集合变换器模型。不同群体规模的实验结果表明，学习出的竞价策略可以在保持去中心化执行的同时，提升比经典CBBA的解决方案质量。该方法强调了将强化学习与经典分布式协调算法整合的有效性，为实现高质量去中心化多机器人任务分配提供了可扩展的路径。

AI-Enabled Serious Games: Integrating Intelligence and Adaptivity in Training Systems

人工智能驱动的严肃游戏：将智能与适应性融入训练系统

Authors: Priyamvada Tripathi, Bill Kapralos
Subjects: Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2605.21962
Pdf link: https://arxiv.org/pdf/2605.21962
Abstract Serious games are widely used for learning and training across domains such as healthcare, defense, and education. Persistent challenges remain, however, including static scenario design, authoring bottlenecks, limited learner modeling, and difficulty implementing meaningful real-time instructional adaptation. Recent advances in artificial intelligence (AI) introduce novel capabilities such as dynamic scenario variation, contextual feedback, adaptive pacing, and learner-state modeling that may help address some of these limitations. At the same time, integrating AI into serious games raises important questions related to validity, transparency, system control, and learner trust. This chapter examines how contemporary AI approaches may support real-time instructional adaptation in serious games. It distinguishes between instructional intelligence, defined as a system's capacity to infer learner knowledge and reason about pedagogically appropriate responses, and adaptivity, defined as the ability to modify instructional actions during interaction. A historical synthesis of adaptive learning systems is presented, tracing developments from early computer-assisted instruction through intelligent tutoring systems (ITS), dynamic difficulty adjustment (DDA), authoring platforms, learning analytics, and recent AI-enabled architectures. Building on this perspective, the chapter discusses how large language models (LLMs), reinforcement learning (RL), and agent-based architectures may contribute to more integrated forms of intelligence and adaptivity in serious games. It also highlights practical and research challenges associated with AI-enabled systems, including explainability, validation, computational cost, and the limited empirical evidence regarding long-term learning outcomes in AI-enabled serious games.
中文摘要 严肃游戏被广泛用于医疗、防御和教育等领域的学习和培训。然而，持续存在的挑战包括静态场景设计、创作瓶颈、有限的学习者建模以及实施有意义的实时教学适应困难。人工智能（AI）的最新进展引入了动态场景变化、情境反馈、自适应节奏和学习者状态建模等新能力，可能有助于解决部分限制。与此同时，将人工智能融入严肃游戏也提出了关于有效性、透明度、系统控制和学习者信任等重要问题。本章探讨了当代人工智能方法如何支持严肃游戏中的实时教学适应。它区分了教学智能（定义为系统推断学习者知识和对教学上适当反应推理的能力）和适应性（即在互动中修改教学行为的能力）。本书呈现了自适应学习系统的历史综合，追溯了从早期计算机辅助教学到智能辅导系统（ITS）、动态难度调整（DDA）、创作平台、学习分析以及最新人工智能架构的发展。基于这一视角，本章讨论了大型语言模型（LLM）、强化学习（RL）和基于代理架构如何促进严肃游戏中更整合的智能和适应性。它还强调了与人工智能系统相关的实际和研究挑战，包括可解释性、验证性、计算成本以及关于人工智能驱动严肃游戏长期学习成果的有限实证证据。

Reinforced Preference Optimization for Reasoning-Augmented Recommendations

推理增强推荐的强化偏好优化

Authors: Jingtong Gao, Zeyu Song, Chi Lu, Xiaopeng Li, Derong Xu, Maolin Wang, Peng Jiang, Kun Gai, Qingpeng Cai, Xiangyu Zhao
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2605.21967
Pdf link: https://arxiv.org/pdf/2605.21967
Abstract Recommender systems are critical for delivering personalized content across digital platforms, and recent advances in Large Language Models (LLMs) offer new opportunities to enhance them with richer world knowledge and explicit reasoning capabilities. With the help of reasoning knowledge, recommendations can better infer users' underlying intents, adapt to evolving preferences, and leverage semantic relationships for improved accuracy and interpretability. However, existing reasoning-based recommendation methods often fail to fully align the LLM's reasoning process with recommendation-specific objectives due to structural disruption during integration and difficulties in translating free-form generation into accurate item predictions. In this paper, we introduce RPORec, a reinforced preference optimization framework that unifies an LLM backbone's reasoning ability with a dedicated recommendation head (Rechead) for precise item retrieval. RPORec comprises two stages: (1) Reasoning-Augmented Recommendation Modeling, where high-quality Chain-of-Thought (CoT) reasoning is generated and used as auxiliary knowledge to guide the Rechead in learning recommendation-specific representations; and (2) Advanced Reasoning Refinement and Alignment, in which the trained Rechead produces verifiable rewards to fine-tune the LLM backbone via reinforcement learning, enhancing reasoning quality, structural consistency, and task relevance. Extensive experiments on public benchmarks and large-scale online deployments show that RPORec consistently outperforms state-of-the-art LLM-based recommendation methods, demonstrating the effectiveness of reasoning-augmented recommendation modeling in real-world systems.
中文摘要 推荐系统对于跨数字平台传递个性化内容至关重要，大型语言模型（LLM）的最新进展为增强其丰富的世界知识和显式推理能力提供了新机遇。借助推理知识，推荐能够更好地推断用户的潜在意图，适应不断变化的偏好，并利用语义关系提升准确性和可理解性。然而，现有基于推理的推荐方法常因整合过程中结构性中断以及将自由形式生成转化为准确题目预测的困难，无法完全使LLM的推理过程与推荐特定目标完全对齐。本文介绍了RPORec，一种强化的偏好优化框架，将LLM骨干的推理能力与专用推荐首脑（Rechead）统一，实现精确检索。RPORec 包含两个阶段：（1）推理增强推荐建模，生成高质量的思维链（Chain-of-Thought，CoT）推理作为辅助知识，指导 Rechead 学习针对推荐的表征;以及（2）高级推理精炼与对齐，其中训练有素的Rechead生成可验证的奖励，通过强化学习微调LLM骨干，提升推理质量、结构一致性和任务相关性。在公开基准测试和大规模在线部署上的广泛实验表明，RPORec 持续优于最先进的基于大型语言模型的推荐方法，证明了推理增强推荐建模在现实系统中的有效性。

Reasoning through Verifiable Forecast Actions: Consistency-Grounded RL for Financial LLMs

通过可验证的预测行动进行推理：基于一致性的强化学习为金融大型语言模型

Authors: Jialin Chen, Aosong Feng, Harshit Verma, Siyi Gu, Haiwen Wang, Ali Maatouk, Yixuan He, Yifeng Gao, Leandros Tassiulas, Rex Ying
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.21975
Pdf link: https://arxiv.org/pdf/2605.21975
Abstract Financial markets are characterized by extreme non-stationarity, low signal-to-noise ratios, and strong dependence on external information such as news, company fundamentals, and macroeconomic signals. Yet, existing approaches either abstract time-series into text or decouple forecasting from language-based reasoning, leading to a fundamental mismatch between qualitative reasoning and quantitative outcomes. To address this, we introduce StockR1, a time-series-enhanced LLM that unifies stock forecasting and financial reasoning through a verifiable forecast action. Based on a tool-call design, the model first emits a forecast action, which is a structured and interpretable representation of its qualitative market outlook. It then invokes a time-series decoder conditioned on this action to generate distributional future trajectories, leading to more informed question answering and financial reasoning. We optimize the full pipeline with reinforcement learning, where rewards jointly reflect answer validity, forecast accuracy, and consistency between generated actions and observed time-series dynamics. In addition, rewards are reweighted by a sample-level uncertainty scalar, encouraging the model to accommodate varying uncertainty in market dynamics. We evaluate StockR1 on financial question answering and stock forecasting over a large-scale 10-year benchmark. Our method consistently outperforms time-series baselines and general-purpose LLMs, improving reasoning accuracy by 17.7% (4B) and 25.9% (8B). These findings demonstrate that structuring the forecast actions establishes a powerful synergy between language reasoning and temporal prediction, enabling LLMs to reason through verifiable, interpretable, and numerically grounded decisions.
中文摘要 金融市场的特点是极端的非平稳性、低信噪比，以及对外部信息（如新闻、公司基本面和宏观经济信号）的高度依赖。然而，现有方法要么将时间序列抽象化成文本，要么将预测与基于语言的推理脱钩，导致定性推理与定量结果之间存在根本性的不匹配。为此，我们引入了StockR1，一种时间序列增强的大型语言模型，通过可验证的预测动作统一了股票预测与财务推理。基于工具调用设计，模型首先发出预测动作，这是一种结构化且可解释的定性市场展望表现。随后调用基于该动作的时间序列解码器，生成分布式的未来轨迹，从而实现更有根据的问答和财务推理。我们通过强化学习优化整个流程，奖励共同反映答案的有效性、预测准确性以及生成动作与观察到的时间序列动态之间的一致性。此外，奖励通过样本级不确定性标量重新加权，鼓励模型适应市场动态中不同的不确定性。我们评估StockR1在金融问答和股票预测方面，采用大型十年基准。我们的方法持续优于时间序列基线和通用大型语言模型，推理准确率提升了17.7%（4B）和25.9%（8B）。这些发现表明，结构化预测动作在语言推理与时间预测之间建立了强大的协同效应，使LLM能够通过可验证、可解释且基于数值的决策进行推理。

Faithful-MR1: Faithful Multimodal Reasoning via Anchoring and Reinforcing Visual Attention

忠实-MR1：通过锚定和强化视觉注意力实现忠实多模态推理

Authors: Changyuan Tian, Zhicong Lu, Huaxing Liu, Xiang Wang, Shuai Li, Yu Chen, Wenqian Lv, Zichuan Lin, Juncheng Diao, Deheng Ye
Subjects: Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.22072
Pdf link: https://arxiv.org/pdf/2605.22072
Abstract Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for advancing complex reasoning in large language models, and recent work extends RLVR to multimodal large language models (MLLMs). This transfer, however, surfaces a faithfulness challenge: faithful perception of task-relevant visual evidence and faithful use of that evidence during reasoning, leading to unsatisfactory gains on multimodal benchmarks. Specifically, existing perception supervision often operates on textual descriptions rather than natively on image regions, and faithful use is largely overlooked, exposing the perception-reasoning disconnect where correctly perceived evidence is dropped or contradicted during reasoning. To close these gaps, we propose Faithful-MR1, a training framework that anchors and reinforces visual attention to address both halves of faithful multimodal reasoning. The Anchoring stage turns perception into an explicit pre-reasoning subtask, supervising a dedicated token's attention directly against image regions rather than through textual descriptions. The Reinforcing stage exposes faithful use through counterfactual image intervention, rewarding answer-correct trajectories that concentrate visual attention where vision causally matters. Extensive experiments demonstrate that Faithful-MR1 outperforms recent multimodal reasoning baselines on both Qwen2.5-VL-Instruct 3B and 7B backbones while using substantially less training data.
中文摘要 带有可验证奖励的强化学习（RLVR）已成为推动大型语言模型复杂推理的有前景范式，近期工作将RLVR扩展到多模态大型语言模型（MLLM）。然而，这种转移也带来了忠实性挑战：对任务相关视觉证据的忠实感知以及推理时对该证据的忠实使用，导致多模态基准测试的进展不理想。具体来说，现有的感知监督常常基于文本描述而非原生于图像区域，忠实使用大多被忽视，暴露出感知与推理的断层，即正确感知的证据在推理过程中被遗漏或矛盾。为弥合这些空白，我们提出了Faithful-MR1，一种训练框架，锚定并强化视觉注意力，涵盖忠实多模态推理的两半。锚定阶段将感知转化为显式的预推理子任务，直接监督专用标记对图像区域的注意力，而非通过文本描述。强化阶段通过反事实图像介入展现忠实使用，奖励正确答案的路径，集中视觉注意力，而视觉本身至关重要。大量实验表明，Faithful-MR1在Qwen2.5-VL-Ininstruction 3B和7B骨干上的多模态推理基线表现优于近期的多模态推理基线，且使用了显著更少的训练数据。

From Reasoning Chains to Verifiable Subproblems: Curriculum Reinforcement Learning Enables Credit Assignment for LLM Reasoning

从推理链到可验证子问题：课程强化学习使LLM推理的学分作业得以完成

Authors: Xitai Jiang, Zihan Tang, Wenze Lin, Yang Yue, Shenzhi Wang, Gao Huang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.22074
Pdf link: https://arxiv.org/pdf/2605.22074
Abstract Reinforcement learning from verifiable rewards (RLVR) has shown strong promise for LLM reasoning, but outcome-based RLVR remains inefficient on hard problems because correct final-answer rollouts are rare and sample-level credit assignment cannot use partial progress in failed attempts. We introduce SCRL (Subproblem Curriculum Reinforcement Learning), a curriculum RL framework that derives verifiable subproblems from reference reasoning chains and fixes the final subproblem as the original problem. This turns partial progress on hard problems into verifiable learning signals. Algorithmically, SCRL uses subproblem-level normalization, which normalizes rewards independently at each subproblem position and assigns the resulting advantages to the corresponding answer spans, enabling finer-grained credit assignment without external rubrics or reward models. Our analysis shows that subproblem curricula lift hard problems out of gradient dead zones, with larger relative gains as the original problem becomes harder. Across seven mathematical reasoning benchmarks, SCRL outperforms strong curriculum-learning baselines, improving average accuracy over GRPO by +4.1 points on Qwen3-4B-Base and +1.9 points on Qwen3-14B-Base. On AIME24, AIME25, and IMO-Bench, SCRL further improves pass@1 by +3.7 points and pass@64 by +4.6 points on Qwen3-4B-Base, indicating better exploration on hard reasoning problems.
中文摘要 可验证奖励强化学习（RLVR）在LLM推理方面展现出强劲前景，但基于结果的RLVR在困难问题上仍然低效，因为正确的最终答案展开很少，且样本级学分作业不能利用失败尝试中的部分进展。我们引入了SCRL（子问题课程强化学习），这是一个课程RL框架，通过引用推理链推导出可验证的子问题，并将最终子问题固定为原始问题。这使难题上的部分进展转化为可验证的学习信号。在算法上，SCRL采用子问题层级归一化，在每个子问题位置独立归一化奖励，并将所得优势分配给相应的答案区间，从而实现更细粒度的学分分配，无需外部评分标准或奖励模型。我们的分析显示，子问题课程将难题从梯度死区中提升，随着原问题难度增加，相对收益也更大。在七个数学推理基准中，SCRL优于强课程学习基线，在Qwen3-4B基础上平均准确率提升了+4.1分，Qwen3-14B基础上提升了+1.9分。在AIME24、AIME25和IMO-Bench上，SCRL在Qwen3-4B-Base上进一步提升pass@1+3.7分，pass@64提升+4.6分，显示出对硬推理问题的更好探索。

OPERA: An Agent for Image Restoration with End-to-End Joint Planning-Execution Optimization

OPERA：具备端到端联合规划-执行优化的图像修复代理

Authors: Feng Zhu, Shuyang Xie, Yihan Zeng, Ming Liu, Wangmeng Zuo
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.22104
Pdf link: https://arxiv.org/pdf/2605.22104
Abstract Real-world image restoration is challenging due to complex and interacting mixed degradations. Recent agent-based approaches address this problem by composing multiple task-specific restoration tools. However, empirical analysis reveals that their performance is fundamentally limited by implicitly constrained planning spaces and the lack of coordination among independently pretrained tools. To address these issues, we propose OPERA (Optimized Planning-Execution Restoration Agent), a framework that jointly optimizes restoration planning and tool execution in an end-to-end manner. On the planning side, OPERA uses reinforcement learning to directly optimize tool composition over a combinatorial plan space, with the final restoration quality as the reward. On the execution side, OPERA introduces agent-guided co-training of restoration tools, enabling them to learn cooperative behaviors under sequential composition. Extensive experiments on multi-degradation benchmarks and real-world datasets demonstrate that OPERA consistently outperforms both all-in-one restoration models and existing agent-based methods across diverse and complex degradation scenarios.
中文摘要 由于复杂且相互作用的混合劣化，现实世界的图像修复具有挑战性。近期基于代理的方法通过组合多种任务特定的恢复工具来解决这一问题。然而，实证分析显示，它们的性能根本受限于隐含的规划空间和独立预训练工具之间缺乏协调。为解决这些问题，我们提出了OPERA（优化规划-执行恢复代理）框架，该框架能够以端到端方式联合优化恢复规划与工具执行。在规划方面，OPERA利用强化学习直接优化组合计划空间内的工具组合，最终的恢复质量作为奖励。在执行端，OPERA引入了代理引导的恢复工具共训练，使它们能够学习顺序组合下的协作行为。在多重降解基准测试和真实世界数据集上的大量实验表明，OPERA在多样复杂的降解场景下，始终优于一体化修复模型和现有基于代理的方法。

Beyond Pixels: Learning Invariant Rewards for Real-World Robotics From a Few Demonstrations

超越像素：通过几场演示学习现实世界机器人的不变奖励

Authors: Tengye Xu, Yangting Sun, Ziju Shen, Guanqi Chen, Zhen Fu, Chen yizhou, Hua Chen, Jia Pan
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.22123
Pdf link: https://arxiv.org/pdf/2605.22123
Abstract Designing reward functions that generalize beyond controlled laboratory settings remains a fundamental challenge in reinforcement learning for robotics. In open-world manipulation problems, a single task can appear in numerous variants through different object instances, positions, and camera viewpoints. Recent vision-based reward models tend to memorize specific pixel distributions and fail to generalize beyond their training conditions. To address this, we propose a framework that learns invariant symbolic reward functions from as few as five demonstrations. The insight is to shift from visual feature-fitting to the discovery of behavioral invariants: task-level properties that remain constant across diverse visual instantiations. The framework has two coupled components: a structural reward formulation that encodes task-level strategies and physical constraints while preserving optimal policy invariance, and a hybrid symbolic-numerical procedure that distills these invariants from demonstrations without online interaction. Experiments on eight Meta-World tasks and three Franka manipulation tasks demonstrate that our method achieves stronger process alignment and policy rollout ranking abilities compared to baselines, accelerating downstream policy learning. Three real-world out-of-distribution experiments further show that the same learned reward generalizes zero-shot to position, viewpoint, and object variations, enabling a single reward representation to be reused across diverse task variants in practice.
中文摘要 设计能够在受控实验室环境中推广的奖励函数仍然是机器人强化学习中的根本挑战。在开放世界操作问题中，单个任务可以通过不同的对象实例、位置和摄像机视角以多种变体出现。近期基于视觉的奖励模型往往记忆特定的像素分布，无法超越训练条件进行推广。为此，我们提出了一个框架，通过五次演示即可学习不变符号奖励函数。这一洞见是从视觉特征拟合转向发现行为不变量：即在不同视觉实例中保持恒定的任务级属性。该框架包含两个耦合部分：结构性奖励表述，编码任务级策略和物理约束，同时保持最佳策略不变性;以及一种符号-数值混合过程，从无在线交互的演示中提炼出这些不变量。对八个元世界任务和三个Franka操作任务的实验表明，我们的方法相比基线在流程对齐和政策推广排名能力上实现了更强的实现，加快了下游政策学习。三项现实世界的非分布实验进一步表明，同一学习到的奖励可以推广零射击到位置、视角和物体的变体，使单一奖励表示能够在不同任务变体中重复使用。

Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

通过自我调节模拟规划实现高效的代理推理

Authors: Mingkai Deng, Jinyu Hou, Lara Sá Neves, Varad Pimpalkhute, Taylor W. Killian, Zhengzhong Liu, Eric P. Xing
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.22138
Pdf link: https://arxiv.org/pdf/2605.22138
Abstract How should an agent decide when and how to plan? A dominant approach builds agents as reactive policies with adaptive computation (e.g., chain-of-thought), trained end-to-end expecting planning to emerge implicitly. Without control over the presence, structure, or horizon of planning, these systems dramatically increase reasoning length, yielding inefficient token use without reliable accuracy gains. We argue efficient agentic reasoning benefits from decomposing decision-making into three systems: simulative reasoning (System II) grounding deliberation in future-state prediction via a world model; self-regulation (System III) deciding when and how deeply to plan via a learned configurator; and reactive execution (System I) handling fine-grained action. Simulative reasoning provides unified planning across diverse tasks without per-domain engineering, while self-regulation ensures the planner is invoked only when needed. To test this, we develop SR$^2$AM (Self-Regulated Simulative Reasoning Agentic LLM), realizing both as distinct stages within an LLM's chain-of-thought, with the LLM as world model. We explore two instantiations: recording decisions from a prompted multi-module system (v0.1) and reconstructing structured plans from traces of pretrained reasoning LLMs (v1.0), trained via supervised then reinforcement learning (RL). Across math, science, tabular analysis, and web information seeking, v0.1-8B and v1.0-30B achieve Pass@1 competitive with 120-355B and 685B-1T parameter systems respectively, while v1.0-30B uses 25.8-95.3% fewer reasoning tokens than comparable agentic LLMs. RL increases average planning horizon by 22.8% while planning frequency grows only 2.0%, showing it learns to plan further ahead rather than more often. More broadly, learned self-regulation instantiates a principle we expect to extend beyond planning to how agents govern their own learning and adaptation.
中文摘要 经纪人应如何决定何时以及如何规划？主导方法构建代理为反应性策略，采用自适应计算（例如思维链），经过端到端训练，期望计划隐含地出现。由于无法控制规划的存在、结构或视野，这些系统大幅延长了推理长度，导致代币使用效率低下且无法获得可靠的准确性提升。我们认为，高效的智能推理受益于将决策分解为三个系统：模拟推理（System II），通过世界模型将思考建立在未来状态预测之上;自我调节（System III），通过熟悉的配置器决定何时及深度规划;以及反应式执行（System I），处理细粒度动作。模拟推理在不同任务中实现统一规划，无需逐域工程，而自我调节则确保规划器仅在必要时被调用。为此，我们开发了SR$^2$AM（自控模拟推理代理LLM），将两者视为LLM思维链中的不同阶段，以LLM为世界模型。我们探讨两种实例化：从提示多模块系统（v0.1）记录决策，以及通过监督后强化学习（RL）训练的预训练推理LLM（v1.0）痕迹重建结构化计划。在数学、科学、表格分析和网络信息检索领域，v0.1-8B和v1.0-30B分别在120-355B和685B-1T参数系统中Pass@1竞争，而v1.0-30B使用的推理标记比同类代理型LLM少25.8%至95.3%。强化学习使平均规划时间长达22.8%，而规划频率仅增长2.0%，显示出它学会了提前规划，而非更频繁地规划。更广泛地说，习得的自我调节体现了我们期望超越规划的原则，延伸到代理者如何管理自身的学习和适应。

Short-Term-to-Long-Term Memory Transfer for Knowledge Graphs under Partial Observability

部分可观测性知识图谱的短期到长期记忆转移

Authors: Taewoon Kim, Vincent François-Lavet, Michael Cochez
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.22142
Pdf link: https://arxiv.org/pdf/2605.22142
Abstract Reinforcement learning under partial observability requires deciding what information to retain, yet most memory-based approaches do not explicitly model short-term-to-long-term transfer of symbolic observations. We study this transfer process in a temporal knowledge-graph memory setting and cast it as a neuro-symbolic value-based decision problem: for each observed triple, the agent chooses whether to keep or drop it before long-term insertion. To handle variable-sized short-term buffers, we use a per-item Q-learning design with shared parameters and a practical temporal-difference update over matched items across consecutive steps. On the RoomKG benchmark at long-term memory capacity 128, learned transfer decisions outperform symbolic and neural baselines, including symbolic baselines with temporal annotations and history-based LSTM/Transformer baselines. Across transfer-policy ablations, a lightweight local short-term-only variant performs best, and step-level behavior shows that the policy keeps navigation- and query-relevant facts while discarding lower-value candidate facts, supporting explicit and interpretable memory decisions under memory constraints.
中文摘要 在部分可观测性下进行强化学习需要决定要保留哪些信息，但大多数基于记忆的方法并未明确建模符号观察的短期到长期转移。我们在时间知识图记忆环境中研究这一转移过程，并将其定位为一个神经符号价值决策问题：对于观察到的每一个三重，代理在长期插入前选择是保留还是放弃。为了处理可变大小的短期缓冲区，我们采用了每条目Q学习设计，共享参数，并在连续步骤中对匹配题目进行实际的时间差分更新。在RoomKG基准测试中，长期记忆容量128，学习到的转移决策优于符号和神经基线，包括带时间注释的符号基线和基于历史的LSTM/Transformer基线。在传输策略消融中，轻量级的局部短期变体表现最佳，步级行为表明策略保留了导航和查询相关的事实，同时丢弃低值的候选事实，支持在内存约束下显式且可解释的内存决策。

One-Way Policy Optimization for Self-Evolving LLMs

自我演进大型语言模型的单向策略优化

Authors: Shuo Yang, Jinda Lu, Kexin Huang, Chiyu Ma, Shaohang Wei, Yuyang Liu, Guoyin Wang, Jingren Zhou, Li Yuan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.22156
Pdf link: https://arxiv.org/pdf/2605.22156
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has become a promising paradigm for scaling reasoning capabilities of Large Language Models (LLMs). However, the sparsity of binary verifier rewards often leads to low efficiency and optimization instability. To stabilize training, existing methods typically impose token-level constraints relative to a reference policy. We identify that such constraints penalize deviations indiscriminately; this can flip verifier-determined direction when the policy attempts to outperform the reference, thereby suppressing gains. To resolve this, we propose One-Way Policy Optimization (OWPO), a method based on the principle of decoupling optimization direction from update magnitude. In OWPO, the verifier dictates the update direction, while the reference policy serves only to adjust the magnitude. Specifically, OWPO applies asymmetric reweighting: it performs Accelerated Alignment for inferior deviations (where the policy lags behind the reference) and Gain Locking for superior deviations (where the policy surpasses the reference). Furthermore, by incorporating iterative reference updates, OWPO creates a ``Ratchet Effect'' that continuously consolidates gains. Experimental results demonstrate that OWPO outperforms strong baselines, including DAPO, OPD, and MOPD, breaking the bottleneck of fixed priors to enable continuous self-evolution without reliance on external reference models.
中文摘要 带可验证奖励的强化学习（RLVR）已成为大型语言模型（LLM）推理能力扩展的有前景范式。然而，二元验证者奖励的稀疏性常常导致效率低下和优化不稳定性。为了稳定训练，现有方法通常相对于参考策略施加令牌级约束。我们发现此类约束会无差别地惩罚偏差;当策略试图超越参考时，这可能会反转验证者确定的方向，从而抑制收益。为解决此问题，我们提出了单向策略优化（OWPO）方法，该方法基于将优化方向与更新幅度解耦的原理。在OWPO中，验证者决定更新方向，而参考策略仅用于调整幅度。具体来说，OWPO采用非对称加权：对劣等偏差（策略落后于参考）执行加速对齐，对优越偏差（策略超过参考）进行增益锁定。此外，通过迭代的参考更新，OWPO创造了“棘轮效应”，持续巩固收益。实验结果表明，OWPO优于强基线指标，包括DAPO、OPD和MOPD，打破了固定先验的瓶颈，实现了无需依赖外部参考模型的持续自我演化。

SWE-Mutation: Can LLMs Generate Reliable Test Suites in Software Engineering?

软件工程突变：大型语言模型能否生成可靠的软件测试套件？

Authors: Yuxuan Sun, Yuze Zhao, Yufeng Wang, Yao Du, Zhiyuan Ma, Jinbo Wang, Mengdi Zhang, Kai Zhang, Zhenya Huang
Subjects: Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.22175
Pdf link: https://arxiv.org/pdf/2605.22175
Abstract Evaluating software engineering capabilities has become a core component of modern large language models (LLMs); however, the key bottleneck hindering further scaling lies not in the scarcity of high-quality solutions, but in the lack of high-quality test suites. Test suites are indispensable both for synthesizing program repair trajectories and for providing precise feedback signals in reinforcement learning. Unfortunately, due to the high cost and difficulty of annotation, high-quality test suites have long been hard to obtain, while those automatically generated by LLMs tend to be superficial and lack sufficient discriminative power. As a first step toward constructing high-quality test suites, we introduce SWE-Mutation, a benchmark for evaluating LLM-generated test suites. The benchmark characterizes test suites by introducing systematically mutated solutions that attempt to ``fool'' the test suites and pass validation. We further propose an agentic, language-agnostic framework for automatically generating complex mutants. Our benchmark consists of 2,636 mutated variants derived from 800 original instances and includes a multilingual subset spanning nine programming languages. Experiments on seven LLMs reveal that even DeepSeek-V3.1 achieves only 10.20% verification and 36.15% detection rates, highlighting the inadequacy of current LLMs. Additionally, our agentic mutation strategy enhances realism, reducing average detection rates from 71.04% to 39.81% compared to conventional methods. These findings expose persistent deficiencies in the ability of current LLMs to generate reliable and discriminative test suites.
中文摘要 评估软件工程能力已成为现代大型语言模型（LLM）的核心组成部分;然而，阻碍进一步扩展的关键瓶颈不在于高质量解决方案的稀缺，而在于缺乏高质量的测试套件。测试套件既能综合程序修复轨迹，也能在强化学习中提供精确反馈信号。遗憾的是，由于注释成本高昂且难度高，高质量测试套件长期难以获得，而由大型语言模型自动生成的测试套件往往表面化，缺乏足够的判别能力。作为构建高质量测试套件的第一步，我们引入了SWE-Mutation，这是评估LLM生成测试套件的基准测试。该基准通过引入系统性变异的解决方案来描述测试套件，试图“欺骗”测试套件并通过验证。我们还提出了一种能动、语言无关的框架，用于自动生成复杂突变体。我们的基准测试包含2636个变异变体，源自800个原始实例，并包含涵盖九种编程语言的多语言子集。对七个大型语言模型的实验显示，即使是DeepSeek-V3.1，验证率仅为10.20%，检测率为36.15%，凸显了当前大型语言模型的不足。此外，我们的代理突变策略提升了真实性，将平均检测率从71.04%降至39.81%，相较于传统方法。这些发现暴露了当前大型语言模型在生成可靠且判别性测试套件方面存在的持续缺陷。

Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles

大师：强化学习以协调层级模型-技能集合

Authors: Jinyang Wu, Guocheng Zhai, Ruihan Jin, Yuhao Shen, Zhengxi Lu, Fan Zhang, Haoran Luo, Zheng Lian, Zhengqi Wen, Jianhua Tao
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.22177
Pdf link: https://arxiv.org/pdf/2605.22177
Abstract The proliferation of large language models (LLMs) and modular skills has endowed autonomous agents with increasingly powerful capabilities. Existing frameworks typically rely on monolithic LLMs and fixed logic to interface with these skills. This gives rise to a critical bottleneck: different LLMs offer distinct advantages across diverse domains, yet current frameworks fail to exploit the complementary strengths of models and skills, thereby limiting their performance on downstream tasks. In this paper, we present Maestro (Multimodal Agent for Expert-Skill Targeted Reinforced Orchestration), a Reinforcement Learning (RL)-driven orchestration framework that reframes heterogeneous multimodal tasks as a sequential decision-making process over a hierarchical model-skill registry. Rather than consolidating all knowledge into a single model, Maestro trains a lightweight policy to dynamically compose ensembles of frozen expert models and a two-tier skill library, deciding at each step whether to invoke an external expert, which model-skill pair to select, and when to terminate. The policy is optimized via outcome-based RL, requiring no step-level supervision. We evaluate Maestro across ten representative multimodal benchmarks spanning mathematical reasoning, chart understanding, high-resolution perception, and domain-specific analysis. With only a 4B orchestrator, Maestro achieves an average accuracy of 70.1%, surpassing both GPT-5 (69.3%) and Gemini-2.5-Pro (68.7%). Crucially, the learned coordination policy generalizes to unseen models and skills without retraining: augmenting the registry with out-of-domain experts yields a 59.5% average on four challenging benchmarks, outperforming all closed-source baselines. Maestro further maintains high computational efficiency with low latency. The source code is available at this https URL.
中文摘要 大型语言模型（LLM）和模块化技能的普及赋予了自主智能体越来越强大的能力。现有框架通常依赖单体大型语言模型和固定逻辑来与这些技能接口。这就造成了一个关键瓶颈：不同的大型语言模型在不同领域各有优势，但现有框架未能充分利用模型和技能的互补优势，从而限制了它们在下游任务中的表现。本文介绍了Maestro（专家技能目标强化编排多模态代理），这是一个基于强化学习（RL）驱动的编排框架，将异构多模态任务重新框架为基于层级模型-技能注册表的顺序决策过程。Maestro没有将所有知识整合到单一模型中，而是训练一种轻量级策略，动态组合冻结的专家模型集合和两层技能库，在每个步骤决定是否调用外部专家、选择哪对模型-技能组合以及何时终止。该策略通过基于结果的强化学习进行优化，无需步骤级监督。我们通过十个代表性的多模态基准测试来评估Maestro，涵盖数学推理、图表理解、高分辨率感知和领域特定分析。仅用4B编排器，Maestro平均准确率达70.1%，超过GPT-5（69.3%）和Gemini-2.5-Pro（68.7%）。关键是，所学的协调策略能够推广到看不见的模型和技能，无需重新培训：通过引入非领域专家来补充注册库，在四个具有挑战性的基准测试中平均获得59.5%，超过所有闭源基线。Maestro 还保持了高计算效率和低延迟。源代码可在该 https URL 访问。

Reinforced Graph of Thoughts: RL-Driven Adaptive Prompting for LLMs

强化思维图谱：强化学习驱动的大型语言模型自适应提示

Authors: Manuel Noah Riesen, Peter Alfred von Niederhäusern
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.22195
Pdf link: https://arxiv.org/pdf/2605.22195
Abstract Graph of Thoughts (GoT), a generalized form of recent prompting paradigms for large language models (LLMs), has been shown to be useful for elaborate problem solving. By executing a graph of operations, thoughts of the LLM are structured as an arbitrary graph, forming the actual graph of thoughts. Originally, the graph of operations is defined manually, which requires in-depth knowledge about the solution of the problem to solve. Such a static graph of operations is rigid and therefore lacks adaptability. We propose Reinforced Graph of Thoughts (RGoT), an automated approach to the GoT prompting paradigm that leverages reinforcement learning (RL) to adaptively generate a graph of operations from a human-defined set. Results indicate that, under certain constraints, it is possible to construct graphs of operations adaptively to the task's complexity in an automated way.
中文摘要 思维图（GoT）是大型语言模型（LLM）中一种近期的引导范式的通用形式，已被证明对复杂问题解决非常有用。通过执行操作图，LLM的思想被结构化为任意图，形成实际的思想图。最初，运算图是手工定义的，这需要对问题的解法有深入的了解。这种静态的操作图是僵化的，因此缺乏适应性。我们提出了强化思维图（Reinforced Graph of Thoughts，简称RGoT），这是一种自动化的《权力的游戏》提示范式方法，利用强化学习（RL）自适应地从人类定义的集合生成操作图。结果表明，在某些约束条件下，可以自动化地根据任务复杂度自适应地构建操作图。

Kernel-Based Safe Exploration in Deep Reinforcement Learning

基于内核的安全深度强化学习探索

Authors: Rupak Majumdar, Nikhil Singh, Sadegh Soudjani
Subjects: Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.22207
Pdf link: https://arxiv.org/pdf/2605.22207
Abstract Safety has been a major concern when deploying deep reinforcement learning algorithms in the real world. A promising direction that ensures that the learned policy does not visit unsafe regions is to learn a \emph{barrier function} along with the policy. A barrier is a function from states to reals that assigns low values to the initial states, high values to the unsafe states, and decreases in expectation on each transition; such a function can be used to bound the probability of reaching unsafe states. Previous attempts learned a barrier function directly from exploration data, but this required either large amounts of data or restrictions on the system dynamics. In this paper, we show how kernel embeddings can be used to learn barrier functions during deep reinforcement learning for stochastic systems with unknown dynamics. Our algorithm, \emph{kernel-based safe exploration (KBSE)}, learns an optimal policy and a barrier simultaneously during exploration. The barriers are computed iteratively, represented as conditional mean embeddings, and provide better probabilistic safety guarantees with more exploration. The exploration algorithm uses the learned barrier functions to identify safety violations. In the case of violation, it intervenes to modify the unsafe action to a safe action, thereby ensuring that the exploration is restricted to actions that bound the probability of reaching unsafe states. We evaluate KBSE on several complex continuous control benchmarks. Experimental results establish our new algorithm to be suitable for synthesizing control policies that are probabilistically safe without degradation in reward accumulation.
中文摘要 在现实世界中部署深度强化学习算法时，安全性一直是主要关注点。一个有前景的方向是，确保所学策略不影响危险区域，那就是在策略中学习一个\emph{屏障函数}。障碍是从状态到实数的函数，它为初始状态赋予低值，为不安全状态赋予高值，并且每次转移时期望值会下降;这样的函数可以用来限制达到不安全状态的概率。以往的尝试是直接从勘探数据中学习势垒函数，但这需要大量数据或系统动力学的限制。本文展示了核嵌入如何在未知动力学的随机系统深度强化学习中学习势垒函数。我们的算法\emph{基于核的安全探索（KBSE）}在探索过程中同时学习最优策略和障碍。这些障碍通过迭代计算，以条件平均嵌入表示，并通过更多探索提供更好的概率安全性保证。探索算法利用学到的障碍函数来识别安全违规。在违规时，它介入将不安全动作修改为安全动作，从而确保探索范围限制在限制到达不安全状态概率的行动中。我们在多个复杂的连续控制基准测试上评估KBSE。实验结果证明我们的新算法适合综合具有概率安全性且不影响奖励积累的控制策略。

CLORE: Content-Level Optimization for Reasoning Efficiency

CLORE：内容层级优化以提升推理效率

Authors: Yuyang Wu, Qiyao Xue, Guanxing Lu, Weichen Liu, Zihan Wang, Manling Li, Olexandr Isayev
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.22211
Pdf link: https://arxiv.org/pdf/2605.22211
Abstract Reinforcement learning post-training has improved the reasoning ability of large language models, but often produces unnecessarily long, repetitive, or semantically opaque reasoning traces. Existing efficient reasoning methods mainly regulate response length through explicit budgets or length-aware rewards, leaving intermediate reasoning content weakly supervised. We propose CLORE, a content-level optimization framework that improves reasoning efficiency by editing correct on-policy rollouts. CLORE uses an external augmentation model to delete repetitive segments, illegible or task-irrelevant content, and superfluous reasoning after the solution is established, while preserving the final answer. The resulting augmented--original pairs are optimized with an auxiliary reference-free DPO objective alongside standard policy-gradient training. By restricting augmentation to correct trajectories and performing local deletion, CLORE keeps edited rollouts close to the policy distribution and mitigates off-policy mismatch. Experiments on DeepSeek-R1-Distill-Qwen-7B and Qwen2.5-Math-7B across five mathematical reasoning benchmarks show that CLORE improves the accuracy--efficiency trade-off and remains compatible with GRPO, DAPO, Training Efficient, and ThinkPrune. Content-level analyses further show that CLORE reduces repetitive reasoning, illegible content, and post-answer exploration, supporting content-level supervision as a complementary direction to length-level control.
中文摘要 训练后强化学习提升了大型语言模型的推理能力，但常常产生过长、重复或语义模糊的推理痕迹。现有的高效推理方法主要通过显式预算或长度感知奖励来调节响应长度，使中级推理内容受到较弱的监督。我们提出了CLORE，一种内容级优化框架，通过编辑正确的政策内推展来提升推理效率。CLORE使用外部增强模型，在确定解答后删除重复片段、难以辨认或与任务无关的内容以及多余的推理，同时保留最终答案。最终的增强-原始对通过辅助无参考DPO目标和标准策略梯度训练进行优化。通过限制增强仅符合正确轨迹并执行局部删除，CLORE 使编辑后的推送接近策略分布，并减少政策外不匹配。在五个数学推理基准测试中对DeepSeek-R1-Distill-Qwen-7B和Qwen2.5-Math-7B的实验显示，CLORE提高了准确性与效率的权衡，并且与GRPO、DAPO、Training Efficient和ThinkPrune保持兼容。内容层分析进一步表明，CLORE减少了重复推理、难以辨认的内容和答案后探索，支持内容层级监督作为长度层级控制的补充方向。

Survive or Collapse: The Asymmetric Roles of Data Gating and Reward Grounding in Self-Play RL

生存或崩溃：数据门控与奖励在自我游戏强化生活中的非对称作用

Authors: Sophia Xiao Pu, Zhaotian Weng, Chengzhi Liu, Jayanth Srinivasa, Gaowen Liu, William Yang Wang, Xin Eric Wang
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.22217
Pdf link: https://arxiv.org/pdf/2605.22217
Abstract Self-play reinforcement learning trains language models on their own generated tasks, co-evolving a proposer and solver without human labels. Recent systems report strong reasoning gains, but collapse and instability are widely observed and poorly understood. The dominant response treats this as a reward-design problem. We argue instead that self-play stability is governed by two distinct levers: a data-level gate that decides which proposer-generated tasks enter the training pool, and the reward signal that updates the policy on tasks already admitted. Through controlled experiments on a Python output-prediction task and a deterministic-DSL twin task that strips pretraining priors, output ambiguity, and executor noise, we find the two levers are asymmetric. A strict gate is sufficient for stability under every reward variant we test, including a self-consistency reward with no access to ground truth; while no reward variant is sufficient once the gate is removed. This asymmetry exposes a counter-intuitive coupling we call the Grounded Proposer Paradox: a proposer with ground-truth access accelerates collapse faster than an ungrounded one when paired with a self-consistency solver, by concentrating training on clean tasks that form the fastest path to a spurious self-consistent attractor. Replacing the binary gate with a continuous strictness parameter $\varepsilon$ further reveals a two-stage phase transition: training-side metrics decouple at low $\varepsilon$, while validation accuracy holds until $\varepsilon$ is much higher. Data-level gating, not reward calibration, is the binding constraint on self-play stability.
中文摘要 自我游戏强化学习通过自身生成的任务训练语言模型，共同进化一个提案者和求解者，无需人类标签。近期系统报告了强劲的推理进步，但崩溃和不稳定现象被广泛观察到，且理解不足。主流反应将此视为奖励设计问题。我们认为，自我游戏稳定性由两个不同的控制杆所支配：一个是决定哪些提议者生成任务进入训练池的数据级门，另一个是奖励信号，更新已录取任务的策略。通过对Python输出预测任务和确定性DSL双任务的受控实验，后者去除预训练先验、输出歧义和执行器噪声，我们发现这两个杠杆是不对称的。严格门槛足以保证我们测试的每一种奖励变体下的稳定性，包括一个无底真值的自洽奖励;而一旦门被移除，任何奖励变体都不够。这种不对称性暴露了一个我们称之为“基准提案者悖论”的反直觉耦合：当一个具有基层真理访问的提案者与自洽求解器结合时，加速坍缩的速度比非基者更快，因为他们将训练集中在形成通往虚假自洽吸引子最快路径的干净任务上。用连续严格参数$\varepsilon$替代二元门，进一步揭示了两阶段相变：训练端指标在低$\varepsilon$时解耦，而验证准确度直到$\varepsilon$大幅升高才保持。数据层级门控，而非奖励校准，是自我游戏稳定性的约束。

Emergence of agriculture in an artificial society of reinforcement learning agents

农业在强化学习代理的人工社会中的出现

Authors: Gautier Hamon, Martí Sánchez-Fibla, Clément Moulin-Frier, Ricard Solé
Subjects: Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2605.22256
Pdf link: https://arxiv.org/pdf/2605.22256
Abstract The origin of agriculture represents a major evolutionary transition and a paradigmatic example of how complex collective behaviors emerge from simple interactions. Here we introduce an artificial society of reinforcement learning agents embedded in a dynamic ecological environment to identify general principles underlying this transition. Within this system, agricultural practices emerge spontaneously - without explicit instruction - through the coupled dynamics of learning and environmental modification. We show that this transition is governed by four key ingredients: individual planning through the valuation of delayed rewards, social vulnerability to cheaters, stabilization via social learning, and an emergent lock-in effect that renders agriculture effectively irreversible once established. In particular, we demonstrate that social learning acts as a "firewall" that suppresses cheater invasion and enables the propagation of successful strategies, leading to sustained population growth and nonlinear amplification of domesticated resources. Together, these results reveal universal mechanisms linking individual decision-making, social interactions, and ecological feedbacks. More broadly, they highlight the potential of artificial societies as experimental platforms to study the emergence of cultural innovations and major evolutionary transitions.
中文摘要 农业的起源代表了一个重要的进化转变，也是复杂集体行为如何从简单互动中产生的一个典范例证。在这里，我们引入一个嵌入动态生态环境中的强化学习代理人工社会，以识别这一转变背后的一般原则。在这一体系中，农业实践通过学习与环境改变的耦合动态自发产生——无需明确指导。我们表明，这一转变受四个关键要素支配：通过评估延迟奖励实现个人规划、对作弊者的社会脆弱性、通过社会学习实现稳定，以及一种涌现的锁定效应，使农业一旦建立就几乎不可逆转。特别是，我们证明了社会学习作为“防火墙”，抑制作弊者入侵，促进成功策略的传播，从而实现持续的人口增长和驯化资源的非线性放大。这些结果共同揭示了连接个体决策、社会互动和生态反馈的普遍机制。更广泛地说，它们强调了人工社会作为研究文化创新和重大进化转变的实验平台的潜力。

Long-term Fairness with Selective Labels

选择性标签的长期公平性

Authors: Giovani Valdrighi, Isabel Valera, Marcos Medeiros Raimundo
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.22291
Pdf link: https://arxiv.org/pdf/2605.22291
Abstract Long-term fairness algorithms aim to satisfy fairness beyond static and short-term notions by accounting for the dynamics between decision-making policies and population behavior. Most previous approaches evaluate performance and fairness measures from observable features and a label, which is assumed to be fully observed. However, in scenarios such as hiring or lending, the labels (e.g., ability to repay the loan) are selective labels as they are only revealed based on positive decisions (e.g., when a loan is granted). In this paper, we study long-term fairness in the selective labels setting and analytically show that naive solutions do not guarantee fairness. To address this gap, we then introduce a novel framework that leverages both the observed data and a label predictor model to estimate the true fairness measure value by decomposing it into the observed fairness and bias from label predictions. This allows us to derive sufficient conditions to satisfy true fairness from observable quantities by using the confidence in the predictor model. Finally, we rely on our theoretical results to propose a novel reinforcement learning algorithm for effective long-term fair decision-making with selective labels. In semisynthetic environments, the proposed algorithm reached comparable fairness and performance to an agent with oracle access to the true labels.
中文摘要 长期公平性算法旨在通过考虑决策政策与人口行为之间的动态，满足超越静态和短期概念的公平性。以往大多数方法通过可观察特征和一个假定为完全观察到的标签来评估性能和公平性指标。然而，在招聘或贷款等情境中，标签（如偿还贷款能力）是选择性的，因为它们只有在积极决策（如贷款获批时）才会被揭示。本文研究了选择性标签设置下的长期公平性，并分析性地证明了朴素解并不保证公平。为弥补这一差距，我们引入了一个新框架，利用观察到的数据和标签预测模型，通过将真实公平度测量值分解为标签预测中的公平性和偏差，来估算。这使得我们能够利用预测模型中的置信度，从可观测量中推导出足够的条件，以满足真正的公平性。最后，我们基于理论结果提出了一种新型强化学习算法，用于有效且长期且公平的决策，并带有选择性标签。在半合成环境中，所提算法的公平性和性能与拥有预言机访问真实标签的代理相当。

ACCoRD: Actor-Critic Conflict Resolution with Deep learning for O-RAN xApps

ACCoRD：基于深度学习的演员-批评者冲突解决，适用于O-RAN xApps

Authors: Cezary Adamczyk, Adrian Kliks
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.22306
Pdf link: https://arxiv.org/pdf/2605.22306
Abstract Conflict Mitigation (ConMit) is a crucial part of intelligent network control in Open Radio Access Networks (O-RAN). In this paper, we propose a method named ACCoRD to resolve detected control conflicts in Near-Real Time RAN Intelligent Controller using a Conflict Resolution (CR) Agent with an Artificial Neural Network (ANN) trained with a reinforcement learning algorithm PPO-Clip. The implemented ANN analyzes data about the network and conflicting control decisions to infer optimal CR actions. The CR Agent gathers feedback from the network after each resolved conflict to assess its efficiency and adjust the ANN's weights during batch training. The evaluation of the proposed approach is based on simulation data. A new methodology for evaluating CR solutions is proposed. Results show that the proposed ANN-based method improves on the efficiency of rule-based approaches by significantly reducing negative network events caused by conflicting control decisions in medium and high traffic scenarios.
中文摘要 冲突缓解（ConMit）是开放无线接入网络（O-RAN）智能网络控制的关键组成部分。本文提出一种名为ACCoRD的方法，用于利用冲突解决（CR）代理和人工神经网络（ANN）训练的强化学习算法PPO-Clip，解决近实时RAN智能控制器中检测到的控制冲突。实现的人工神经网络分析网络数据及冲突的控制决策，以推断最优的计算过程。CR代理在每次冲突解决后收集网络反馈，评估效率并在批次训练中调整ANN权重。对拟议方法的评估基于仿真数据。提出了一种评估CR解决方案的新方法论。结果显示，基于人工神经网络的方法通过显著减少中高流量场景中因控制决策冲突引起的负面网络事件，提升了基于规则的方法的效率。

Integrating Chain-of-Thought into Generative Retrieval: A Preliminary Study

将思维链整合进生成检索：初步研究

Authors: Wenhao Zhang, Ruihao Yu, Yi Bai, Zhumin Chen, Pengjie Ren
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2605.22358
Pdf link: https://arxiv.org/pdf/2605.22358
Abstract While generative retrieval (GR) demonstrates competitive performance on standard retrieval benchmarks, existing approaches directly map queries to document identifiers (docids) without intermediate deliberation, limiting their effectiveness for complex queries that require multi-step reasoning. As a preliminary study on integrating chain-of-thought (CoT) into generative retrieval, we introduce ThinkGR, a unified framework that interleaves CoT with docid generation, enabling iterative thinking and retrieval within a single generative process. To bridge the gap between free-form thought generation and structured retrieval targets, we design (1) a hybrid decoding strategy that dynamically switches between unconstrained thought generation and constrained docid decoding, and (2) a two-phase training approach that first aligns thought-retrieval patterns through supervised fine-tuning, then optimizes thought quality via retrieval-grounded reinforcement learning. Experiments on four multi-hop retrieval benchmarks demonstrate that ThinkGR achieves state-of-the-art performance with an average improvement of +6.86\%. Our work opens new avenues for enhancing generative retrieval with explicit deliberation capabilities, with promising implications for retrieval tasks requiring complex reasoning.
中文摘要 虽然生成检索（GR）在标准检索基准测试中表现出竞争力，但现有方法直接将查询映射到文档标识符（docid），无需中间考虑，限制了其在需要多步推理的复杂查询中的有效性。作为将思维链（CoT）整合进生成检索的初步研究，我们介绍了ThinkGR，这是一个统一框架，将CoT与docid生成交织，使迭代思维和检索能够在单一生成过程中实现。为了弥合自由形式思维生成与结构化检索目标之间的差距，我们设计了（1）一种混合解码策略，动态切换于无约束思维生成和受限docid译码之间;（2）一种两阶段训练方法，先通过监督微调对齐思维检索模式，然后通过基于提取的强化学习优化思维质量。四个多跳反演基准测试的实验表明，ThinkGR实现了最先进的性能，平均提升率为+6.86%。我们的工作为增强生成式检索开辟了新途径，并赋予了明确的审议能力，并为需要复杂推理的检索任务带来了积极的启示。

Target-Aligned Bellman Backup for Cross-domain Offline Reinforcement Learning

目标比对Bellman备份，用于跨域离线强化学习

Authors: Wei Liu, Ting Long
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.22376
Pdf link: https://arxiv.org/pdf/2605.22376
Abstract Cross-domain offline reinforcement learning (CDRL) aims to improve policy learning in a target domain by leveraging data collected from a source domain. Existing works typically assess the transferability of source-domain data by measuring its similarity to target-domain transitions, and implicitly perform transition-level selection. Transitions that are considered similar are assigned higher weights or rewards, while dissimilar ones are down-weighted. However, transition-level similarity does not necessarily imply consistency in long-term returns. Even visually or dynamically similar transitions may lead to significantly different outcomes in the target domain, which can mislead policy learning and degrade performance. To address this issue, we revisit the fundamental objective of policy learning. Since policy optimization ultimately relies on Bellman targets to evaluate the quality of decisions, we propose to assess the transferability of source-domain transitions based on their alignment with target-domain Bellman targets, rather than superficial transition similarity. Based on this insight, we propose a method termed Target-Aligned Bellman Backup (TABB), which selectively leverages source-domain data by measuring their contribution to accurate Bellman target estimation in the target domain. We evaluate TABB across a broad range of cross-domain offline RL settings with highly limited target-domain data. Experimental results show that TABB consistently achieves strong performance.
中文摘要 跨域离线强化学习（CDRL）旨在通过利用源域收集的数据，提升目标域的策略学习。现有研究通常通过测量源域数据与目标域转换的相似度来评估其可转移性，并隐式执行转移级选择。被认为相似的过渡会被赋予更高的权重或奖励，而不同的过渡则会被下重。然而，过渡层面的相似性并不一定意味着长期回报的一致性。即使在视觉或动态上相似的转换，也可能在目标领域产生显著不同的结果，这可能误导策略学习并降低绩效。为解决这一问题，我们重新审视政策学习的根本目标。由于策略优化最终依赖贝尔曼目标来评估决策质量，我们建议基于源域-域转换与目标域贝尔曼目标的对齐度来评估其可转移性，而非表面的过渡相似性。基于这一见解，我们提出了一种称为目标对齐贝尔曼备份（TABB）的方法，该方法通过测量源域数据对目标域中贝尔曼目标估计的准确贡献，选择性地利用其。我们在极限的目标域数据下，评估了TABB在广泛跨域离线强化学习环境中的应用。实验结果显示，TABB始终保持着强劲的性能。

Unified Data Selection for LLM Reasoning

LLM推理的统一数据选择

Authors: Xiaoyuan Li, Yubo Ma, Chengpeng Li, Fengbin Zhu, Yiyao Yu, Keqin Bao, Wenjie Wang, Fuli Feng, Dayiheng Liu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.22389
Pdf link: https://arxiv.org/pdf/2605.22389
Abstract Effectively training Large Language Models (LLMs) for complex, long-CoT reasoning is often bottlenecked by the need for massive high-quality reasoning data. Existing methods are either computationally expensive or fail to reliably distinguish high- from low-quality reasoning samples. To address this, we propose High-Entropy Sum (HES), a training-free metric that quantifies reasoning quality by summing only the entropy of the top (e.g., 0.5\%) highest-entropy tokens in each reasoning sample. We validate HES across three mainstream training paradigms: Supervised Fine-tuning (SFT), Rejection Fine-tuning (RFT), and Reinforcement Learning (RL), with extensive results demonstrating its consistent effectiveness and significantly reduced computational overhead. In SFT, training on the top 20\% HES-ranked data matches full-dataset performance, while using the lowest-HES data degrades it. In RFT, our HES-based training approach significantly outperforms baseline methods. In RL, HES-selected successful trajectories enable the model to learn strong reasoning patterns, significantly surpassing other compared methods. Our findings establish HES as a robust, training-free metric that enables a unified, effective, and efficient method for developing advanced reasoning in LLMs.
中文摘要 有效训练大型语言模型（LLMs）进行复杂且长期的推理，常常被对大量高质量推理数据的需求所阻碍。现有方法要么计算量大，要么无法可靠区分高质量和低质量推理样本。为此，我们提出了高熵和（HES）方法，这是一种无训练的指标，通过仅对每个推理样本中最高熵标记（如0.5\%）的熵相加来量化推理质量。我们在三种主流训练范式中验证了HES：监督式微调（SFT）、拒绝微调（RFT）和强化学习（RL），并广泛证明其有效性并显著降低了计算开销。在SFT中，使用HES排名前20%的数据训练能匹配全数据集表现，而使用HES排名最低的数据则会降低整体数据质量。在RFT中，我们基于HES的训练方法显著优于基线方法。在强化学习中，HES选定的成功轨迹使模型能够学习强有力的推理模式，显著优于其他比较方法。我们的发现确立了HES作为一个稳健、无需训练的指标，能够为LLMs中发展高级推理提供统一、有效且高效的方法。

DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA

DeferMem：通过强化学习进行查询时证据提炼，用于长期记忆质量保证

Authors: Jianing Yin, Tan Tang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.22411
Pdf link: https://arxiv.org/pdf/2605.22411
Abstract Large language model (LLM) agents still struggle with long-term memory question answering, where answer-supporting evidence is often scattered across long conversational histories and buried in substantial irrelevant content. Existing memory systems typically process memory before future queries are known, then retrieve the resulting units based on similarity rather than their utility for answering the query. This workflow leaves downstream answerers to denoise retrieved candidates and reconstruct query-specific evidence. We present DeferMem, a long-term memory framework that decouples this problem into high-recall candidate retrieval and query-conditioned evidence distillation. DeferMem uses a lightweight segment-link structure to organize raw history and retrieve broad candidates at query time. It then applies a memory distiller trained with DistillPO, our reinforcement learning algorithm for distilling the high-recall but highly noisy candidates into a set of faithful, self-contained, and query-conditioned evidence. DistillPO formulates post-retrieval evidence distillation as a structured action comprising message selection and evidence rewriting. It optimizes this action with a decomposed-and-gated reward pipeline and structure-aligned advantage assignment, gating reward components from validity to quality checks while exposing task-level correctness feedback early and assigning each reward to its responsible output span. On LoCoMo and LongMemEval-S, DeferMem surpasses strong baselines in QA accuracy and memory-system efficiency, achieving the highest QA accuracy with the fastest runtime and zero commercial-API token cost for memory operations.
中文摘要 大型语言模型（LLM）代理仍在长期记忆问题回答方面遇到困难，因为支持答案的证据常常分散在漫长的对话历史中，并埋藏在大量无关内容中。现有内存系统通常在未来查询已知之前处理内存，然后根据相似度而非其回答查询的效用来检索所得单元。该工作流程让下游回答者负责去噪检索的候选人并重建查询特定证据。我们提出了DeferMem，一种长期记忆框架，将该问题解耦为高回忆候选检索和查询条件证据提炼。DeferMem 采用轻量级段-链接结构来组织原始历史并在查询时检索广泛的候选数据。然后应用由DistillPO训练的记忆蒸馏器，DistillPO是我们的强化学习算法，将高召回但噪声较大的候选样本提炼成一组忠实、自包含且经查询条件的证据。DistillPO将检索后证据提炼制定为一种结构化的行动，包括信息选择和证据重写。它通过分解和门槛奖励流水线和结构对齐的优势分配来优化该动作，将奖励组成部分从有效性到质量检查之间进行门槛，同时提前暴露任务层级的正确性反馈，并将每个奖励分配到其负责任的输出时段。在LoCoMo和LongMemEval-S上，DeferMem在QA准确性和内存系统效率方面超越了强劲基线，实现了最高的QA准确性，运行时间最快且无商业API令牌成本。

From Recognition to Reasoning: Benchmarking and Enhancing MLLMs on Real-World Receipt Document Understanding

从识别到推理：基于现实世界收据文件理解的基准测试与提升MLLMs（多层次营销产品）

Authors: Yandi Wang, Libin Zhan, Ziwei Huang, Tiancheng Luo, Yuxuan Jiang, Wang Dong, Leilei Gan, Jun Chen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.22413
Pdf link: https://arxiv.org/pdf/2605.22413
Abstract Extracting structured information from visual documents (Visual Information Extraction, VIE) is a cornerstone of business automation. While recent Multimodal Large Language Models (MLLMs) have shown promising capabilities, existing benchmarks suffer from critical limitations in scale and realism, lack semantic granularity, and fail to cover diverse document types. To bridge this gap, we introduce ReceiptBench, a large-scale, human-annotated benchmark consisting of 10k diverse receipts, organizing information extraction into four hierarchical sub-tasks: (1) Basic Perception for raw text spotting, (2) Format Normalization for strictly following standardization instructions, (3) Semantic Reasoning for inferring implicit attributes from context, and (4) Structure Parsing for handling nested line items. Furthermore, we propose a two-stage training framework incorporating Metric-Aware Group Relative Policy Optimization (GRPO), which translates rigorous evaluation constraints into reinforcement learning signals to enhance structural consistency. Extensive experiments demonstrate that our method yields state-of-the-art performance, surpassing leading proprietary models on complex reasoning tasks. We release our datasets and code at this https URL.
中文摘要 从可视化文档中提取结构化信息（Visual Information Extraction，VIE）是业务自动化的基石。尽管近期的多模态大型语言模型（MLLM）展现出有前景的能力，但现有基准在规模和真实性方面存在严重限制，缺乏语义细度，且未能涵盖多样化的文档类型。为弥合这一空白，我们引入了ReceiptBench，一个大规模、人工注释的基准测试，包含1万条多样化收据，将信息提取组织为四个层级子任务：（1）用于原始文本发现的基本感知，（2）格式规范化（严格遵循标准化指令），（3）语义推理（从上下文推断隐式属性），以及（4）结构解析（用于处理嵌套行项）。此外，我们提出了一个包含度量感知群相对策略优化（GRPO）的两阶段训练框架，将严格的评估约束转化为强化学习信号，以增强结构一致性。大量实验表明，我们的方法在复杂推理任务上表现最先进，超越了领先的专有模型。我们将数据集和代码发布到这个 https URL。

Don't Forget the Critic: Value-Based Data Rehearsal for Multi-Cyclic Continual Reinforcement Learning

别忘了批评者：基于价值的数据演练用于多周期持续强化学习

Authors: Benjamin Poole, Andrew Quinn, Li Yang, Minwoo Lee
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.22454
Pdf link: https://arxiv.org/pdf/2605.22454
Abstract Data rehearsal has emerged as a leading approach for mitigating catastrophic forgetting in Continual Reinforcement Learning (CRL). However, existing work remains confined to policy gradient frameworks, regularizing only actors due to the performance degradation incurred by critic regularization. This actor-centric approach overlooks the potential of data rehearsal for value function approximation. Moreover, existing evaluations in CRL rarely consider multi-cyclic environments where task sequences repeat, a critical real-world scenario that exacerbates forgetting and plasticity. We investigate data rehearsal for Deep Q-Networks using Q-value regularization in multi-cyclic settings and propose Qreg+NWLU which introduces two simple modifications: (1) continuous data rehearsal that dynamically collects and updates stored Q-values throughout training, and (2) "No-Wait" regularization that applies immediately rather than after the first task. Together, these modifications yield improvements in learning efficiency, forgetting mitigation, and knowledge transfer over Qreg and conventional CRL methods within value function approximation settings.
中文摘要 数据演练已成为持续强化学习（CRL）中减轻灾难性遗忘的领先方法。然而，现有工作仍局限于政策梯度框架内，仅对行为者进行正则化，因为批评正则化会带来绩效下降。这种以演员为中心的方法忽视了数据演练在价值函数近似方面的潜力。此外，现有CRL评估很少考虑任务序列重复的多周期环境，这一关键现实场景加剧了遗忘和可塑性。我们研究了在多周期环境中使用Q值正则化的深度Q网络数据演练，并提出了Qreg+NWLU，引入了两个简单修改：（1）连续数据演练，在训练过程中动态收集并更新存储的Q值;（2）立即而非第一个任务后即刻应用的“无等待”正则化。这些改进共同带来了学习效率、遗忘缓解和知识转移的提升，尤其是在价值函数近似设置下，相较于Qreg和传统CRL方法。

F-TIS: Harnessing Diverse Models in Collaborative GRPO

F-TIS：协作GRPO中多元模型的利用

Authors: Nikolay Blagoev, Oğuzhan Ersoy, Wendelin Boehmer, Lydia Yiyu Chen
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.22537
Pdf link: https://arxiv.org/pdf/2605.22537
Abstract Reinforcement learning methods such as GRPO have seen great popularity in LLM post-training. In GRPO, models produce completions to a set of prompts, which are rewarded, and the policy is updated towards the relatively high reward completions. Due to the auto-regressive nature of models, the generation phase of such style of training can be extremely time consuming. As a solution, prior work has sought to distribute the inference step across many nodes, working parallel. These works assume primarily homogeneous models in the training in order to keep samples as close to on-policy as possible. This assumption may be impractical in decentralized systems, where parties with various computes and preferences may wish to collaborate on the same task. Thus, decentralized training requires an approach that can handle heterogeneous models - different models collaborating on the same tasks. However, this leads to highly off-policy samples presented during training, which prior work has identified that off-policy samples can hurt GRPO convergence. To enable heterogeneity, we propose Filtered Truncated Importance Sampling (F-TIS) - a GRPO-style training paradigm that can use off-policy samples to improve local model's learning. Our framework allows various models to collaborate in the same RL training run while being communication efficient. We extensively evaluate F-TIS in various heterogeneous setups and we show that it exhibits identical final model convergence to purely on-sample training. Furthermore, we observe in some setups better generalization on out-of-distribution tasks than on-policy training, increasing model's performance by up to 12\%.
中文摘要 像GRPO这样的强化学习方法在LLM后期学习中非常受欢迎。在GRPO中，模型生成一组提示的完成任务，这些提示会获得奖励，策略也会更新，以适应相对较高的奖励完成率。由于模型具有自回归性质，这种训练方式的生成阶段可能非常耗时。作为解决方案，先前的工作试图将推理步骤分散到多个节点，并行工作。这些工作主要假设训练中的模型是同质的，以保持样本尽可能接近策略上的模式。在去中心化系统中，这种假设可能不切实际，因为不同计算和偏好的各方可能希望在同一任务上协作。因此，去中心化训练需要一种能够处理异构模型的方法——即不同模型协作完成同一任务。然而，这会导致培训中呈现高度偏离策略的样本，先前研究表明，偏离策略的样本会损害GRPO收敛。为实现异质性，我们提出了过滤截断重要性抽样（F-TIS）——一种类似GRPO的训练范式，可以利用非策略样本提升局部模型学习效果。我们的框架允许多个模型在同一强化学习训练中协作，同时实现通信效率。我们广泛评估F-TIS在各种异质设置下，并证明其最终模型收敛性与纯样本训练完全一致。此外，我们观察到某些设置中，模型在分配外任务上的泛化比策略训练更佳，使模型性能提升多达12%。

LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance

LANG：多语言推理强化学习与语言自适应提示指导

Authors: Yuchun Fan, Bei Li, Peiguang Li, Yilin Wang, Yongyu Mu, Jian Yang, Xin Chen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Jingbo Zhu, Tong Xiao
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.22567
Pdf link: https://arxiv.org/pdf/2605.22567
Abstract Reinforcement learning has proven effective for enhancing multi-step reasoning in large language models (LLMs), yet its benefits have not fully translated to multilingual contexts. Existing methods struggle with a fundamental trade-off: prioritizing input-language consistency severely hampers reasoning quality, while prioritizing reasoning often leads to unintended language drift toward English. We address this challenge with LANG, a novel framework that leverages language-conditioned hints to guide exploration in non-English reasoning tasks. Our method incorporates two key mechanisms to prevent dependency on these hints: a progressive decay schedule that gradually withdraws scaffolding, and a language-adaptive switch that tailors learning horizons to specific language difficulties. Empirical results on challenging multilingual mathematical benchmarks reveal that LANG substantially enhances reasoning performance without compromising language consistency. Moreover, we show that our framework generalizes beyond mathematics, fostering more consistent language alignment across model layers
中文摘要 强化学习已被证明能有效增强大型语言模型（LLMs）中的多步推理能力，但其益处尚未完全转化到多语言语境中。现有方法面临一个根本性的权衡：优先考虑输入语言一致性严重影响推理质量，而优先考虑推理往往导致语言无意中向英语倾斜。我们通过LANG来应对这一挑战，这是一种利用语言条件提示引导非英语推理任务探索的新框架。我们的方法包含两个关键机制以防止对这些提示的依赖：逐步收回支架的渐进衰减计划，以及针对特定语言困难调整学习视野的语言适应切换。对多语言数学基准的实证结果表明，LANG 在不影响语言一致性的前提下，显著提升了推理能力。此外，我们展示了我们的框架超越数学范畴，促进了模型层间更一致的语言对齐

Two is better than one: A Collapse-free Multi-Reward RLIF Training Framework

两个比一个更好：一个无崩溃的多奖励RLIF培训框架

Authors: Shourov Joarder, Diganta Sikdar, Ahsan Habib Akash, Binod Bhattarai, Prashnna Gyawali
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.22620
Pdf link: https://arxiv.org/pdf/2605.22620
Abstract Reinforcement learning with verifiable rewards (RLVR) has substantially improved the reasoning ability of LLMs, but often depends on external supervision from human annotations or gold-standard solutions. Reinforcement learning from internal feedback (RLIF) has recently emerged as a scalable unsupervised alternative, using signals extracted from the model itself. However, existing RLIF methods typically rely on a single internal reward, which can lead to reward hacking, entropy collapse, and degraded reasoning structure. We propose a multi-reward RLIF framework that decomposes the training signal into two complementary components: an answer-level reward based on cluster voting and a completion-level reward based on token-wise self-certainty. To combine these signals robustly, we apply GDPO-based normalization to reduce reward-scale imbalance. We further introduce KL-Cov regularization, which targets low-entropy token distributions responsible for disproportionate entropy reduction, preserving exploration and preventing late-stage collapse. Across mathematical reasoning and code-generation benchmarks, our method improves stability and robustness over prior unsupervised RL approaches, while achieving performance close to supervised RLVR methods. These results show that complementary internal rewards, combined with targeted regularization, can support stable long-horizon reasoning without relying on external ground-truth supervision. Code will be released soon.
中文摘要 带有可验证奖励的强化学习（RLVR）显著提升了大型语言模型的推理能力，但通常依赖于人类注释或金标准解决方案的外部监督。来自内部反馈的强化学习（RLIF）最近作为一种可扩展的无监督替代方案出现，利用从模型本身提取的信号。然而，现有的RLIF方法通常依赖单一的内部奖励，这可能导致奖励黑客、熵崩溃和推理结构退化。我们提出了一个多奖励RLIF框架，将训练信号分解为两个互补组成部分：基于集群投票的答案级奖励和基于按代币的自确定性进行的完成级奖励。为了强有力地结合这些信号，我们应用基于GDPO的规范化以减少奖励尺度的失衡。我们进一步介绍了KL-Cov正则化，针对低熵代币分布，这些分布负责不成比例的熵降低，保留探索并防止晚期崩溃。在数学推理和代码生成基准测试中，我们的方法相比以往无监督强化学习方法提升了稳定性和鲁棒性，同时性能接近监督式RLVR方法。这些结果表明，互补的内部奖励结合有针对性的正则化，可以支持稳定的长期推理，而无需依赖外部的实地监督。代码将很快发布。

A note on convergence of Wasserstein policy optimization

关于瓦瑟斯坦策略优化收敛性的说明

Authors: David Šiška, Yufei Zhang
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2605.22622
Pdf link: https://arxiv.org/pdf/2605.22622
Abstract Wasserstein Policy Optimization (WPO) is a recently proposed reinforcement learning algorithm that leverages Wasserstein gradient flows to optimize stochastic policies in continuous action spaces. Despite its empirical success, the theoretical convergence properties of WPO in environments with continuous state and action spaces have yet to be fully established. In this note, we argue that WPO within the framework of entropy-regularised Markov Decision Processes converges linearly. This is done by leveraging recent advances in mean-field analysis for convergence of gradient flows using log-Sobole inequalities. Assuming existence of sufficiently regular solution to the gradient flow equation we demonstrate monotonic energy dissipation along the flow and establish a local log-Sobolev inequality. Ultimately, these properties allow us to argue that the value function should converge linearly to the global optimum.
中文摘要 Wasserstein策略优化（WPO）是一种最近提出的强化学习算法，利用Wasserstein梯度流优化连续动作空间中的随机策略。尽管WPO在具有连续状态和作用空间的环境中具有理论收敛性，但其理论收敛性尚未完全确立。本文论证，在熵正则化马尔可夫决策过程框架下，WPO是线性收敛的。这是通过利用平均场分析的最新进展，利用对数-索博尔不等式进行梯度流收敛实现的。假设梯度流方程存在足够正则的解，我们证明了沿流动的单调能量耗散，并建立了局部对数-索博列夫不等式。最终，这些性质使我们能够论证价值函数应线性收敛到全局最优。

Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning

Spreadsheet-RL：通过强化学习推动大型语言模型代理完成现实的电子表格任务

Authors: Banghao Chi, Yining Xie, Mingyuan Wu, Jingcheng Yang, Jize Jiang, Zhaoheng Li, Shengyi Qian, Minjia Zhang, Klara Nahrstedt, Rui Hou, Xiangjun Fan, Hanchao Yu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.22642
Pdf link: https://arxiv.org/pdf/2605.22642
Abstract Spreadsheet systems (e.g., Microsoft Excel, Google Sheets) play a central role in modern data-centric workflows. As AI agents grow increasingly capable of automating complex tasks, such as controlling computers and generating presentations, building an AI-driven spreadsheet agent has emerged as a promising research direction. Most existing spreadsheet agents rely on specialized prompting over general-purpose LLMs; while this design has potentials on simple spreadsheet operations, it struggles to manage the complex, multi-step workflows typical of real-world applications. We introduce Spreadsheet-RL, a reinforcement learning (RL) fine-tuning framework designed to train specialized spreadsheet agents within a realistic Microsoft Excel environment. Spreadsheet-RL features an automated pipeline for scalable collection of paired start-goal spreadsheets from online forums, as well as domain-specific evaluation tasks in areas such as finance and supply chain management, which we compile into the new Domain-Spreadsheet benchmark dataset. It also includes a Spreadsheet Gym environment designed for multi-turn RL: Spreadsheet Gym exposes extensive Excel functionality through a Python sandbox, along with a refined harness that incorporates a comprehensive tool set and carefully designed tool-routing rules for spreadsheet tasks. Through comprehensive experiments, we show that Spreadsheet-RL substantially enhances AI agent's performance on both general and domain-specific spreadsheet tasks: it improves Qwen3-4B-Thinking-2507's Pass@1 on SpreadsheetBench from 12.0% to 23.4%, and raises Pass@1 from 8.4% to 17.2% on our curated Domain-Spreadsheet dataset. These results highlight Spreadsheet-RL's strong potential for generalization and real-world adoption in spreadsheet automation, and broadly, its promise for advancing LLM-based interactions with data interfaces in everyday work.
中文摘要 电子表格系统（如 Microsoft Excel、Google Sheets）在现代以数据为中心的工作流程中起着核心作用。随着人工智能代理越来越具备自动化复杂任务的能力，如控制计算机和生成演示文稿，构建基于人工智能的电子表格代理已成为一个有前景的研究方向。大多数现有的电子表格代理依赖专业提示，而非通用大型语言模型;虽然这种设计在简单的电子表格操作上有潜力，但它在管理现实应用中复杂、多步骤的工作流程时遇到了困难。我们介绍Spreadsheet-RL，一个强化学习（RL）微调框架，旨在在真实的Microsoft Excel环境中训练专用的电子表格代理。Spreadsheet-RL 具备自动化流程，支持从在线论坛中可扩展收集成对起始-目标电子表格，以及金融和供应链管理等领域的特定领域评估任务，我们将这些任务汇总到新的 Domain-Spreadsheet 基准数据集中。它还包含一个为多回合强化学习设计的Spreadsheet Gym环境：Spreadsheet Gym通过Python沙盒展示了丰富的Excel功能，并配备了完善的工具集和精心设计的工具路由规则，用于电子表格任务。通过全面的实验，我们证明Spreadsheet-RL显著提升了AI代理在通用和特定领域电子表格任务中的表现：它将Qwen3-4B-Thinking-2507在SpreadsheetBench上的Pass@1提升从12.0%提升到23.4%，并在我们精心策划的领域电子表格数据集上提升Pass@1从8.4%提升到17.2%。这些结果凸显了Spreadsheet-RL在电子表格自动化中泛化和实际应用的强大潜力，以及其在日常工作中推动基于LLM与数据接互的潜力。

SegCompass: Exploring Interpretable Alignment with Sparse Autoencoders for Enhanced Reasoning Segmentation

SegCompass：探索稀疏自编码器的可解释比对以增强推理分割

Authors: Zhenyu Lu, Liupeng Li, Jinpeng Wang, Haoqian Kang, Yan Feng, Ke Chen, Yaowei Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2605.22658
Pdf link: https://arxiv.org/pdf/2605.22658
Abstract While large language models provide strong compositional reasoning, existing reasoning segmentation pipelines fail to transparently connect this reasoning to visual perception. Current methods, such as latent query alignment, are end-to-end yet opaque "black boxes". Conversely, textual localization readout is merely readable, not truly interpretable, often functioning as an unconstrained post-hoc step. To bridge this interpretability gap, we propose SegCompass, an end-to-end model that leverages a Sparse Autoencoder (SAE) to forge an explicit, interpretable, and differentiable alignment pathway. Given an image-instruction pair, SegCompass first generates a chain-of-thought (CoT) trace. The core of our method is an SAE that maps both the CoT and visual tokens into a shared, high-dimensional sparse concept space. A query codebook selects salient concepts from this space, which are then spatially grounded by a slot mapper into a multi-slot heatmap that guides the final mask decoder. The entire model is trained jointly, unifying reinforcement learning for the reasoning path with standard segmentation supervision. This SAE-driven interface provides a "white-box" connection that is significantly more traceable than latent queries and more coherent than textual readouts. Extensive experiments on five challenging benchmarks demonstrate that SegCompass matches or surpasses state-of-the-art performance. Crucially, our visual and quantitative analyses show a strong correlation between the quality of the learned sparse concepts and final mask accuracy, confirming that SegCompass achieves superior results through its enhanced and inspectable alignment. Code is available at this https URL.
中文摘要 虽然大型语言模型提供了强大的组合推理，但现有的推理分割流程未能透明地将这种推理与视觉感知连接起来。当前的方法，如潜在查询对齐，是端到端但不透明的“黑箱”。相反，文本本地化读出仅可阅读，并非真正可解释，通常作为无限制的事后步骤。为弥合这一可解释性差距，我们提出了SegCompass，这是一种端到端模型，利用稀疏自编码器（SAE）构建显式、可解释且可微分的对齐路径。给定一对图像-指令，SegCompass 首先生成一条思维链（CoT）跟踪。我们方法的核心是一个SAE，将CoT和视觉标记映射到一个共享的高维稀疏概念空间中。查询码本从该空间中选择重要概念，然后通过槽映射器空间接地到多槽热图中，指导最终的掩码解码器。整个模型由共同训练，统一了推理路径的强化学习与标准的分段监督。这种基于SAE的界面提供了一种“白箱”连接，比潜在查询更易追踪，也比文本读数更连贯。在五个具有挑战性的基准测试上的广泛实验表明，SegCompass能够与甚至超越最先进的性能。关键是，我们的视觉和定量分析显示，学习到的稀疏概念质量与最终掩膜准确性之间存在强烈相关性，证实SegCompass通过增强和可检测的对齐实现了更优的结果。代码可在此 https URL 访问。

Clipping Bottleneck: Stabilizing RLVR via Stochastic Recovery of Near-Boundary Signals

削波瓶颈：通过随机恢复近边界信号稳定RLVR

Authors: Shuo Yang, Jinda Lu, Chiyu Ma, Kexin Huang, Haoming Meng, Qihui Zhang, Yuyang Liu, Bolin Ding, Guoyin Wang, Li Yuan, Jingren Zhou
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.22703
Pdf link: https://arxiv.org/pdf/2605.22703
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a central paradigm for scaling LLM reasoning, yet its optimization often suffers from training instability and suboptimal convergence. Through a systematic dissection of clipping-based GRPO-style objectives, we identify the rigid clipping decision induced by hard clipping as a key practical bottleneck in the studied RLVR setups. Specifically, our analysis suggests that informative signals can lie in the near-boundary region just beyond the clipping threshold, and are therefore discarded by the standard hard-clipping rule. Notably, once this bottleneck is precisely identified, even simple stochastic perturbations at the boundary can recover meaningful performance gains. Building on this finding, we propose Near-boundary Stochastic Rescue (NSR), a minimal, plug-and-play modification that stochastically retains these slightly out-of-bound tokens to recover lost signals. While NSR, via stochastic sampling, can be interpreted as inducing an implicit gradient decay in expectation, our ablations reveal that its stochastic, boundary-local rescue mechanism is consistently more effective than deterministic gradient decay. Validated by extensive experiments across model sizes from 7B to 30B and both dense and MoE architectures, as a plug-and-play solution, NSR substantially improves training stability and delivers consistent gains over strong baselines such as DAPO and GSPO.
中文摘要 带可验证奖励的强化学习（RLVR）已成为扩展大型语言模型推理的核心范式，但其优化常常存在训练不稳定性和次优收敛的问题。通过系统剖析基于削波的GRPO式物镜，我们识别出硬削波引起的刚性裁剪决策是研究RLVR设置中的关键实际瓶颈。具体来说，我们的分析表明，信息信号可能位于接近边界区域，刚好超过削波阈值，因此标准硬削波规则会被舍弃。值得注意的是，一旦精确识别了这一瓶颈，即使是边界处的简单随机扰动也能恢复有意义的性能提升。基于这一发现，我们提出了近边界随机救援（NSR）的方案，这是一种最小的即插即用修改，随机地保留这些略微超出边界的标记，以恢复丢失的信号。虽然通过随机采样，北方放射能可解释为诱导期望的隐式梯度衰减，但我们的消融显示，其随机边界-局部救援机制始终优于确定性梯度衰减。通过跨7B至30B模型规模以及密集和MoE架构的广泛实验验证，NSR作为即插即用解决方案，显著提升了训练稳定性，并相较于DAPO和GSPO等强基线实现了持续的提升。

Abstraction for Offline Goal-Conditioned Reinforcement Learning

离线目标条件强化学习的抽象

Authors: Clarisse Wibault, Alexander Goldie, Antonio Villares, Maike Osborne, Jakob Foerster
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.22711
Pdf link: https://arxiv.org/pdf/2605.22711
Abstract Markov Decision Processes (MDPs) often exhibit significant redundancy due to symmetries and shared structure across state-goal pairs in real-world Goal-Conditioned Reinforcement Learning (GCRL). While hierarchical policies have been motivated for horizon reduction via temporal abstraction in offline GCRL, we demonstrate that hierarchy also enables absolute abstraction. By introducing relativised options as well as distinct representations for different levels of the hierarchy, we demonstrate how an agent can reuse experience across similar contexts of the state-space. Based on this framework, we introduce two simple algorithms for learning relativised options and abstracting from the absolute frame of reference. Our experiments show that such inductive biases significantly improve performance in offline GCRL.
中文摘要 马尔可夫决策过程（MDPs）由于对称性和在现实世界目标条件强化学习（GCRL）中的状态-目标对共享结构，常表现出显著的冗余性。虽然层级策略通过离线GCRL的时间抽象来减少视界，但我们证明了层级结构也支持绝对抽象。通过引入相对化选项以及对不同层级的不同表示，我们展示了智能体如何在类似的状态空间上下文中重复利用经验。基于该框架，我们引入了两种简单的算法，用于学习相对化选项并从绝对参考系抽象。我们的实验表明，这种归纳偏差显著提升了离线GCRL的性能。

N3P: Accelerated Automated Parking via a Learning-Based Naturalistic Three-Stage Scheme

N3P：通过基于学习的自然主义三阶段方案实现加速自动停车

Authors: Yifan Xue, Toktam Mohammadnejad, Faizan M Tariq, Sangjae Bae, David Isele, Yosuke Sakamoto, Nadia Figueroa, Jovin D'sa
Subjects: Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2605.22722
Pdf link: https://arxiv.org/pdf/2605.22722
Abstract Autonomous parking requires efficient path planning that ensures kinematic feasibility and collision avoidance in constrained environments. Hybrid A is widely used but computationally expensive, while reinforcement learning (RL) methods lack reliability and often struggle with long-horizon geometric constraints, leading to suboptimal trajectories. We present N3P, a fast learning-based three-stage framework for automated parking. By introducing an intermediate preparatory pose and using a learning module to predict it, N3P decomposes the maneuver into simpler subproblems, thereby reducing computational complexity and accelerating path generation. We validate the framework by integrating it with Hybrid A algorithms. Experiments in perpendicular and parallel parking scenarios show that N3P-enhanced Hybrid A* speeds up planning by more than 80%. It also outperforms RL baselines in success rate and trajectory quality, producing shorter trajectories with fewer gear changes, while achieving comparable or lower planning time in most cases.
中文摘要 自动停车需要高效的路径规划，确保在受限环境中的运动学可行性和碰撞避免。混合A被广泛使用但计算成本高，而强化学习（RL）方法缺乏可靠性，且常常难以应对长视野几何约束，导致轨迹不理想。我们介绍N3P，一个基于快速学习的三阶段自动停车框架。通过引入中间准备姿势并使用学习模块进行预测，N3P 将机动分解为更简单的子问题，从而降低计算复杂度并加速路径生成。我们通过与混合A算法整合该框架进行验证。垂直和平行停车场景的实验显示，N3P增强型混合A*能将规划速度提升80%以上。它在成功率和弹道质量上也优于强化基础，能实现更短的轨迹和更少的换挡次数，同时在大多数情况下实现相当或更低的规划时间。

Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation

后期培训关注状态，而非代币：SFT、强化学习及策略上蒸馏的状态分布视角

Authors: Dong Nie
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.22731
Pdf link: https://arxiv.org/pdf/2605.22731
Abstract Large language model post-training methods such as supervised fine-tuning (SFT), reinforcement learning (RL), and distillation are often analyzed through their loss functions: maximum likelihood, policy gradients, forward KL, reverse KL, or related objective-level variants. We study a complementary factor: the state distribution on which supervision is applied. For an autoregressive policy, a state is a prompt plus generated prefix. SFT trains on fixed dataset states, while RL and on-policy distillation (OPD) train on states induced by the current learner. We formalize post-training as state-distribution shaping and run a controlled smallscale study using Qwen3-0.6B-Base on GSM8K, with TruthfulQA and MMLU as retention evaluations. Our results show three phenomena. First, a mild SFT run improves GSM8K with little forgetting, while a stress SFT run causes substantial retention loss. Second, OPD from a degraded SFT teacher surpasses that teacher on GSM8K, TruthfulQA, and MMLU, despite using the teacher as its only supervision source. Third, a lightweight on-policy RL run improves GSM8K while preserving retention. These results support a state-centric view of post-training: the source and locality of training states can be as important as the form of the supervision signal.
中文摘要 大型语言模型的后训练方法，如监督微调（SFT）、强化学习（RL）和提纯，通常通过其损失函数进行分析：最大似然、策略梯度、前向KL、反KL或相关的目标级变体。我们研究一个互补因素：监督所依赖的状态分布。对于自回归政策，状态是一个提示词加上生成的前缀。SFT训练于固定数据集状态，而RL和策略中蒸馏（OPD）训练于当前学习者诱导的状态。我们将训练后过程形式化为状态分布塑形，并基于GSM8K运行Qwen3-0.6B-Base的受控小规模研究，TruthfulQA和MMLU作为保留评估。我们的结果显示了三种现象。首先，轻度的SFT运行能大幅提升GSM8K，几乎不会忘记，而压力SFT运行则会造成显著的保留损失。其次，一位被降级的SFT老师的门诊评分在GSM8K、TruthfulQA和MMLU上超过了那位老师，尽管她是唯一的监督来源。第三，轻量级的策略驱动运行在提升GSM8K的同时保持保留率。这些结果支持了以状态为中心的训练后处理观点：训练状态的来源和所在地可能与监督信号的形式同样重要。

Superhuman Safe and Agile Racing through Multi-Agent Reinforcement Learning

通过多智能体强化学习实现超人安全敏捷赛车

Authors: Ismail Geles, Leonard Bauersfeld, Markus Wulfmeier, Davide Scaramuzza
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2605.22748
Pdf link: https://arxiv.org/pdf/2605.22748
Abstract Autonomous systems have achieved superhuman performance in isolation or simulation, yet they remain brittle in shared, dynamic real-world spaces. This failure stems from the dominant single-agent paradigm for physical applications, where other actors are ignored or treated as environmental noise, preventing effective coordination. Here we show that multi-agent reinforcement learning provides the essential safety scaffolding required for real-world interaction. Using high-speed quadrotor racing as a high-stakes testbed, we train agents to navigate complex aerodynamic interactions and strategic maneuvering with a variable number of racers. Through league-based self-play, agents evolve sophisticated anticipatory behaviors, including proactive collision avoidance, overtaking, and handling multi-agent physical interactions, including aerodynamic downwash. Our agents outperform a champion-level human pilot in multi-player races at speeds exceeding 22 m/s, while simultaneously reducing collision rates by 50 % compared to state-of-the-art single-agent baselines. Crucially, training with diverse artificial agents enables zero-shot generalization to safer human interaction. These results suggest that the path to robust robotic co-existence lies not in isolated safety constraints, but in the rigorous demands of multi-agent interaction. Multimedia materials are available at: this https URL
中文摘要 自主系统在孤立或模拟中已达到超人性能，但在共享的动态现实空间中依然脆弱。这种失败源于物理应用中主流的单代理范式，其他参与者被忽视或视为环境噪音，阻碍有效协调。我们展示了多智能体强化学习为现实世界互动提供了必要的安全支架。我们以高速四旋翼赛车为高风险试验平台，训练特工应对复杂的空气动力学相互作用和战略性机动，面对不同数量的赛车手。通过基于联盟的自我对弈，代理进化出复杂的预判行为，包括主动避免碰撞、超车以及处理多代理之间的物理互动，包括空气动力学下洗。我们的特工在多人竞速中超过22米/秒的速度优于冠军级人类驾驶员，同时相较于最先进的单人特工基线，碰撞率降低了50%。关键是，使用多样化的人工特工训练使得零机会推广到更安全的人类互动。这些结果表明，实现稳健机器人共存的道路不在于孤立的安全约束，而在于多智能体交互的严格要求。多媒体材料可通过：此 https URL 获取

DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback

DeltaBox：通过毫秒级沙盒检查点/回滚扩展有状态的AI代理

Authors: Yunpeng Dong, Jingkai He, Yuze Hou, Dong Du, Zhonghu Xu, Si Yu, Yubin Xia, Haibo Chen
Subjects: Subjects: Operating Systems (cs.OS); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.22781
Pdf link: https://arxiv.org/pdf/2605.22781
Abstract LLM-powered AI agents require high-frequency state exploration (e.g., test-time tree search and reinforcement learning), relying on rapid checkpoint and rollback (C/R) of the complete sandbox state, including files and process state (e.g., memory, contexts, etc.). Existing mechanisms duplicate the entire state, causing hundreds of milliseconds to seconds of latency per C/R, which severely bottlenecks deep search and large-scale fan-outs. This paper observes that subsequent checkpoints in AI agents are highly similar. Therefore, instead of full duplication, a sandbox should only duplicate the changes between consecutive checkpoints (Key Insight). However, it is non-trivial to realize the idea, mainly due to the missing OS supports. This paper proposes a new OS-level abstraction, DeltaState, to enable the change-based transactional C/R for AI agents with two co-designed OS mechanisms. First, DeltaFS enables change-based filesystem C/R by organizing the file states into layers and dynamically freezing the writable layer and inserting a new one during checkpoint, reducing file updates to copy-on-write, and making rollback a simple layer switch. Second, DeltaCR enables change-based process state C/R using incremental dumps, and accelerates rollback by bypassing traditional pipelines to directly fork() from a frozen template process. We then present DeltaBox, a novel agent sandbox achieving millisecond level C/R through the two new mechanisms. Evaluations on SWE-bench and RL micro-benchmarks show DeltaBox completes checkpoint and rollback in millisecond-level latency (14ms and 5ms, respectively), empowering agents to explore substantially more nodes under fixed time budgets.
中文摘要 基于LLM的AI代理需要高频状态探索（例如测试时的树搜索和强化学习），依赖于对完整沙箱状态（包括文件和进程状态（如内存、上下文等）进行快速检查点和回滚（C/R）。现有机制会复制整个状态，导致每个C/R延迟数百毫秒到几秒，严重限制深度搜索和大规模散播。本文指出，AI代理的后续检查点高度相似。因此，沙盒不应完全重复，而应仅重复连续检查点之间的变更（关键洞察）。然而，实现这一想法并不简单，主要是因为缺少操作系统支持。本文提出了一种新的操作系统级抽象——DeltaState，以实现基于变更的AI代理的交易C/R，并结合两个共同设计的操作系统机制。首先，DeltaFS 通过将文件状态组织成层，动态冻结可写层并在检查点插入新层，实现基于变更的文件系统 C/R，将文件更新简化为写时复制，使回滚成为简单的层切换。其次，DeltaCR通过增量转储实现基于变更的进程状态C/R，并通过绕过传统流水线直接从冻结模板进程fork（）加速回滚。随后我们介绍DeltaBox，一种通过两种新机制实现毫秒级C/R的新型代理沙盒。SWE-bench和RL微基准测试的评估显示，DeltaBox在毫秒级延迟内完成检查点和回滚（分别为14毫秒和5毫秒），使代理能够在固定时间预算内探索更多节点。

Remember to be Curious: Episodic Context and Persistent Worlds for 3D Exploration

记得保持好奇：章节上下文和持久世界用于3D探索

Authors: Lily Goli, Justin Kerr, Daniele Reda, Alec Jacobson, Andrea Tagliasacchi, Angjoo Kanazawa
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.22814
Pdf link: https://arxiv.org/pdf/2605.22814
Abstract Exploration is a prerequisite for learning useful behaviors in sparse-reward, long-horizon tasks, particularly within 3D environments. Curiosity-driven reinforcement learning addresses this via intrinsic rewards derived from the mismatch between the agent's predictive model of the world and reality. However, translating this intrinsic motivation to complex, photorealistic environments remains difficult, as agents can become trapped in local loops and receive fresh rewards for revisiting forgotten states. In this work, we demonstrate that this failure stems from a lack of spatial persistence and episodic context. We show that effective curiosity requires a model of the world that is persistent and continuously updated, paired with an agent that maintains an episodic trajectory history to navigate toward novel regions. We achieve this using an online 3D reconstruction as a persistent model of the world, while the agent policy is parameterized as a sequence model over RGB observations to maintain episodic context. This design enables effective exploration during training while allowing the agent to navigate using solely RGB frames at deployment. Trained purely via curiosity on HM3D, our agent outperforms RL-based active mapping baselines and generalizes zero-shot to Gibson and AI-generated worlds. Our end-to-end policy enables efficient adaptation to downstream tasks, such as apple picking and image-goal navigation, outperforming from-scratch baselines. Please see video results at this https URL.
中文摘要 探索是学习稀疏奖励、长视野任务中有用行为的前提，尤其是在三维环境中。好奇心驱动的强化学习通过内在奖励来解决这个问题，这些奖励源自主体对世界预测模型与现实的不匹配。然而，将这种内在动机转化为复杂、逼真的环境仍然困难，因为代理可能陷入局部循环，并因重访被遗忘状态而获得新的奖励。在本研究中，我们证明了这种失败源于缺乏空间持久性和情境性。我们表明，有效的好奇心需要一个持续且持续更新的世界模型，配合一个保持片段轨迹历史的代理，以引导他们进入新的领域。我们通过在线3D重建作为世界的持久模型实现这一点，而代理策略则参数化为基于RGB观测的序列模型，以保持情景上下文。这种设计在训练过程中能够有效探索，同时允许代理在部署时仅使用RGB帧进行导航。我们的智能体仅凭对HM3D的好奇心训练，表现优于基于强化学习的主动映射基线，并将零射点推广到Gibson和AI生成的世界。我们的端到端策略使得对下游任务（如采摘苹果和图像目标导航）的高效适应，表现优于从零开始的基线。请在此 https URL 查看视频结果。

Keyword: diffusion policy

There is no result