Arxiv Papers of Today

生成时间: 2026-02-20 16:44:04 (UTC+8); Arxiv 发布时间: 2026-02-20 20:00 EST (2026-02-21 09:00 UTC+8)

今天共有 24 篇相关文章

Keyword: reinforcement learning

DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning

DeepVision-103K：一个视觉多样、覆盖广泛且可验证的多模态推理数学数据集

Authors: Haoxiang Sun, Lizhen Xu, Bing Zhao, Wotao Yin, Wei Wang, Boyu Yang, Rui Wang, Hu Wei
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.16742
Pdf link: https://arxiv.org/pdf/2602.16742
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has been shown effective in enhancing the visual reflection and reasoning capabilities of Large Multimodal Models (LMMs). However, existing datasets are predominantly derived from either small-scale manual construction or recombination of prior resources, which limits data diversity and coverage, thereby constraining further gains in model performance. To this end, we introduce \textbf{DeepVision-103K}, a comprehensive dataset for RLVR training that covers diverse K12 mathematical topics, extensive knowledge points, and rich visual elements. Models trained on DeepVision achieve strong performance on multimodal mathematical benchmarks, and generalize effectively to general multimodal reasoning tasks. Further analysis reveals enhanced visual perception, reflection and reasoning capabilities in trained models, validating DeepVision's effectiveness for advancing multimodal reasoning. Data: \href{this https URL}{this url}.
中文摘要 带可验证奖励的强化学习（RLVR）已被证明能有效提升大型多模态模型（LMMs）的视觉反射和推理能力。然而，现有数据集主要来自小规模人工构建或对先前资源的重组，这限制了数据多样性和覆盖范围，从而限制了模型性能的进一步提升。为此，我们介绍 \textbf{DeepVision-103K}，这是一个全面的 RLVR 训练数据集，涵盖了多样的 K12 数学主题、广泛的知识点和丰富的视觉元素。在DeepVision上训练的模型在多模态数学基准测试中表现出色，并能有效推广到通用的多模态推理任务。进一步分析显示，训练模型中视觉感知、反射和推理能力增强，验证了DeepVision在推动多模态推理方面的作用。数据：\href{此 https URL}{此 URL}。

References Improve LLM Alignment in Non-Verifiable Domains

参考文献提升不可验证领域的LLM对齐

Authors: Kejian Shi, Yixin Liu, Peifeng Wang, Alexander R. Fabbri, Shafiq Joty, Arman Cohan
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.16802
Pdf link: https://arxiv.org/pdf/2602.16802
Abstract While Reinforcement Learning with Verifiable Rewards (RLVR) has shown strong effectiveness in reasoning tasks, it cannot be directly applied to non-verifiable domains lacking ground-truth verifiers, such as LLM alignment. In this work, we investigate whether reference-guided LLM-evaluators can bridge this gap by serving as soft "verifiers". First, we design evaluation protocols that enhance LLM-based evaluators for LLM alignment using reference outputs. Through comprehensive experiments, we show that a reference-guided approach substantially improves the accuracy of less capable LLM-judges using references from frontier models; stronger LLM-judges can also be enhanced by high-quality (i.e., human-written) references. Building on these improved judges, we demonstrate the utility of high-quality references in alignment tuning, where LLMs guided with references are used as judges to self-improve. We show that reference-guided self-improvement yields clear gains over both direct SFT on reference outputs and self-improvement with reference-free judges, achieving performance comparable to training with ArmoRM, a strong finetuned reward model. Specifically, our method achieves 73.1% and 58.7% on AlpacaEval and Arena-Hard with Llama-3-8B-Instruct, and 70.0% and 74.1% with Qwen2.5-7B, corresponding to average absolute gains of +20.2 / +17.1 points over SFT distillation and +5.3 / +3.6 points over reference-free self-improvement on AlpacaEval / Arena-Hard. These results highlight the potential of using reference-guided LLM-evaluators to enable effective LLM post-training in non-verifiable domains.
中文摘要 虽然带可验证奖励的强化学习（RLVR）在推理任务中表现出显著效果，但无法直接应用于缺乏地面真理验证器的不可验证领域，如大型语言模型对齐。本研究探讨引用引导的LLM评估者是否能通过作为软“验证者”来弥合这一差距。首先，我们设计了基于LLM的评估器，利用参考输出增强LLM对齐的评估方案。通过全面的实验，我们表明引用引导方法显著提高了使用前沿模型引用能力较差的LLM评判者的准确性;更强有力的LLM评判还可以通过高质量（即人手撰写）的引用来增强。基于这些改进的评判，我们展示了高质量参考在比对调优中的实用性，即以参考为指导的大型语言模型作为自我改进的评判。我们证明，参考引导自我提升在直接SFT和无参考评判者的自我提升上均有明显提升，表现可与ArmoRM训练相当，ArmoRM是一种强而精细调优的奖励模型。具体来说，我们的方法在AlpacaEval和Arena-Hard中，使用Llama-3-8B-Instruct实现了73.1%和58.7%，在Qwen2.5-7B中分别获得了70.0%和74.1%，对应于SFT蒸馏的平均绝对提升+20.2 / +17.1分，在AlpacaEval / Arena-Hard上，较无参考自我提升提升+5.3 / +3.6分。这些结果凸显了利用引用引导的LLM评估器在不可验证领域实现有效LLM后训练的潜力。

VAM: Verbalized Action Masking for Controllable Exploration in RL Post-Training -- A Chess Case Study

VAM：在强化学习后训练中可控探索的口头动作掩蔽——国际象棋案例研究

Authors: Zhicheng Zhang, Ziyan Wang, Yali Du, Fei Fang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.16833
Pdf link: https://arxiv.org/pdf/2602.16833
Abstract Exploration remains a key bottleneck for reinforcement learning (RL) post-training of large language models (LLMs), where sparse feedback and large action spaces can lead to premature collapse into repetitive behaviors. We propose Verbalized Action Masking (VAM), which verbalizes an action mask in the prompt and enforces that the model outputs an action from the masked set. Building on this interface, we introduce iterative action-space pruning: if the target action is not sampled, we remove valid sampled actions from the mask and resample under the reduced candidate set, repeating until the target is sampled or a fixed budget is exhausted. We study VAM in chess and evaluate it under two training regimes: an engine-play regime that generates states via play against an engine opponent and a fixed-dataset regime that trains from a fixed dataset of positions with verifier scores. Across held-out chess puzzles and full-game play measured by average centipawn loss (ACPL), VAM improves learning efficiency and final performance over strong baselines, highlighting verbalized masking as a practical mechanism for controllable exploration in LLM RL post-training.
中文摘要 探索仍然是强化学习（RL）在大型语言模型（LLMs）训练后的关键瓶颈，稀疏的反馈和庞大的动作空间可能导致过早陷入重复行为。我们提出了语言化动作掩蔽（VAM），它在提示中口述动作掩蔽，并强制模型从掩蔽集合输出动作。基于该接口，我们引入迭代动作空间剪枝：如果目标动作未被采样，我们从掩码中移除有效的采样动作，并在简化候选集合下重新采样，重复此过程直到目标被采样或预算耗尽。我们研究国际象棋中的VAM，并在两种训练体系下进行评估：一种是通过对抗引擎对手生成状态的引擎对弈模式;另一种是固定数据集体系，基于固定的局面数据集和验证分数进行训练。无论是在等待的国际象棋谜题还是以平均百分差距损失（ACPL）衡量的全局对局中，VAM在强有力的基线下提升了学习效率和最终表现，强调了口头掩蔽作为训练后LLM RL可控探索的实用机制。

Training Large Reasoning Models Efficiently via Progressive Thought Encoding

通过渐进式思维编码高效训练大型推理模型

Authors: Zeliang Zhang, Xiaodong Liu, Hao Cheng, Hao Sun, Chenliang Xu, Jianfeng Gao
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.16839
Pdf link: https://arxiv.org/pdf/2602.16839
Abstract Large reasoning models (LRMs) excel on complex problems but face a critical barrier to efficiency: reinforcement learning (RL) training requires long rollouts for outcome-based rewards, where autoregressive decoding dominates time and memory usage. While sliding-window cache strategies can bound memory, they disrupt long-context reasoning and degrade performance. We introduce Progressive Thought Encoding, a parameter-efficient fine-tuning method that enables LRMs to reason effectively under fixed-size caches. By progressively encoding intermediate reasoning into fixed-size vector representations, our approach eliminates the need to backpropagate through full-cache rollouts, thereby reducing memory usage, while maintaining constant memory during inference. Experiments on three models, including Qwen2.5-3B-Instruct, Qwen2.5-7B-Instruct, and DeepSeek-R1-Distill-Llama-8B, on six widely used challenging mathematical benchmarks show consistent gains: our method achieves +19.3% improvement over LoRA-based fine-tuning and +29.9% over LRMs without fine-tuning on average, with up to +23.4 accuracy improvement on AIME2024/2025 under the same tight cache budgets. These results demonstrate that Progressive Thought Encoding not only improves reasoning accuracy but also makes RL training of LRMs substantially more efficient and scalable under real-world memory constraints.
中文摘要 大型推理模型（LRM）在复杂问题上表现出色，但面临一个关键的效率障碍：强化学习（RL）训练需要长时间的推广以实现基于结果的奖励，而自回归解码主导时间和内存使用。虽然滑动窗口缓存策略可以限制内存，但它们会干扰长上下文推理并降低性能。我们介绍了渐进式思维编码，这是一种参数高效的微调方法，使LRMS能够在固定大小缓存下有效推理。通过逐步将中间推理编码为固定大小的向量表示，我们的方法消除了通过全缓存展开反向传播的需求，从而减少内存占用，同时在推理过程中保持恒定内存。在六个广泛使用的复杂数学基准测试上，对包括Qwen2.5-3B-Instruct、Qwen2.5-7B-Instruct和DeepSeek-R1-Distill-Llama-8B在内的三个模型进行了实验，显示出持续的提升：我们的方法平均比基于LoRA的微调提升+19.3%，在未微调的LRMS上提升+29.9%，在相同紧缩缓存预算下，AIME2024/2025的准确率提升最高达+23.4%。这些结果表明，渐进式思维编码不仅提高了推理准确性，还使LRM（逻辑语言模型）的强化学习在现实内存限制下显著更高效且可扩展。

SimToolReal: An Object-Centric Policy for Zero-Shot Dexterous Tool Manipulation

SimToolReal：零发灵巧工具作的以对象为中心的策略

Authors: Kushal Kedia, Tyler Ga Wei Lum, Jeannette Bohg, C. Karen Liu
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.16863
Pdf link: https://arxiv.org/pdf/2602.16863
Abstract The ability to manipulate tools significantly expands the set of tasks a robot can perform. Yet, tool manipulation represents a challenging class of dexterity, requiring grasping thin objects, in-hand object rotations, and forceful interactions. Since collecting teleoperation data for these behaviors is challenging, sim-to-real reinforcement learning (RL) is a promising alternative. However, prior approaches typically require substantial engineering effort to model objects and tune reward functions for each task. In this work, we propose SimToolReal, taking a step towards generalizing sim-to-real RL policies for tool manipulation. Instead of focusing on a single object and task, we procedurally generate a large variety of tool-like object primitives in simulation and train a single RL policy with the universal goal of manipulating each object to random goal poses. This approach enables SimToolReal to perform general dexterous tool manipulation at test-time without any object or task-specific training. We demonstrate that SimToolReal outperforms prior retargeting and fixed-grasp methods by 37% while matching the performance of specialist RL policies trained on specific target objects and tasks. Finally, we show that SimToolReal generalizes across a diverse set of everyday tools, achieving strong zero-shot performance over 120 real-world rollouts spanning 24 tasks, 12 object instances, and 6 tool categories.
中文摘要 作工具的能力大大扩展了机器人可执行的任务范围。然而，工具作代表了一类极具挑战性的灵巧度，需要抓取薄物、手中旋转物体以及有力的互动。由于收集这些行为的远程作数据具有挑战性，模拟到现实强化学习（RL）是一个有前景的替代方案。然而，以往的方法通常需要大量工程工作来建模对象并为每个任务调整奖励函数。在本研究中，我们提出了SimToolReal，迈出了将模拟到现实的强化学习策略推广工具作的一步。我们不再专注于单一对象和任务，而是在模拟中程序生成各种类似工具的对象原语，并训练单一强化学习策略，目标是将每个对象调整为随机目标姿势。这种方法使 SimToolReal 能够在测试时进行通用的灵巧工具作，无需任何对象或任务特定的训练。我们证明，SimToolReal 在与针对特定目标对象和任务训练的专业强化学习策略性能相匹配的情况下，比以往的重定向和固定抓取方法高出 37%。最后，我们展示了SimToolReal在多样化的日常工具中实现了强劲的零时值性能，涵盖24个任务、12个对象实例和6个工具类别的120个实际部署。

Discovering Multiagent Learning Algorithms with Large Language Models

利用大型语言模型发现多智能体学习算法

Authors: Zun Li, John Schultz, Daniel Hennes, Marc Lanctot
Subjects: Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2602.16928
Pdf link: https://arxiv.org/pdf/2602.16928
Abstract Much of the advancement of Multi-Agent Reinforcement Learning (MARL) in imperfect-information games has historically depended on manual iterative refinement of baselines. While foundational families like Counterfactual Regret Minimization (CFR) and Policy Space Response Oracles (PSRO) rest on solid theoretical ground, the design of their most effective variants often relies on human intuition to navigate a vast algorithmic design space. In this work, we propose the use of AlphaEvolve, an evolutionary coding agent powered by large language models, to automatically discover new multiagent learning algorithms. We demonstrate the generality of this framework by evolving novel variants for two distinct paradigms of game-theoretic learning. First, in the domain of iterative regret minimization, we evolve the logic governing regret accumulation and policy derivation, discovering a new algorithm, Volatility-Adaptive Discounted (VAD-)CFR. VAD-CFR employs novel, non-intuitive mechanisms-including volatility-sensitive discounting, consistency-enforced optimism, and a hard warm-start policy accumulation schedule-to outperform state-of-the-art baselines like Discounted Predictive CFR+. Second, in the regime of population based training algorithms, we evolve training-time and evaluation-time meta strategy solvers for PSRO, discovering a new variant, Smoothed Hybrid Optimistic Regret (SHOR-)PSRO. SHOR-PSRO introduces a hybrid meta-solver that linearly blends Optimistic Regret Matching with a smoothed, temperature-controlled distribution over best pure strategies. By dynamically annealing this blending factor and diversity bonuses during training, the algorithm automates the transition from population diversity to rigorous equilibrium finding, yielding superior empirical convergence compared to standard static meta-solvers.
中文摘要 多智能体强化学习（MARL）在不完美信息博弈中的进步，历来很大程度上依赖于对基线的人工迭代细化。虽然像反事实遗憾最小化（CFR）和政策空间响应预言机（PSRO）这样的基础性家族基于坚实的理论基础，但它们最有效变体的设计往往依赖于人类直觉来导航庞大的算法设计空间。本研究提出利用AlphaEvolve——一种由大型语言模型驱动的进化编码代理，自动发现新的多智能体学习算法。我们通过为两种截然不同的博弈论学习范式发展新变体，展示了该框架的通用性。首先，在迭代遗憾最小化领域，我们发展了管理遗憾积累和政策推导的逻辑，发现了一种新算法——波动率-自适应贴现（VAD-）CFR。VAD-CFR采用了新颖且非直观的机制——包括波动率敏感贴现、一致性强化乐观主义以及硬性启动政策积累计划——以超越像贴现预测CFR+这样的最先进基线。其次，在基于群体的训练算法中，我们为PSRO演进训练时间和评估时间的元策略求解器，发现了一个新变体——平滑混合乐观遗憾（SHOR-）PSRO。SHOR-PSRO引入了一种混合元求解器，线性地将乐观遗憾匹配与平滑化、温控分布结合在最佳纯策略上。通过在训练过程中动态退火混合因子和多样性加成，该算法实现了从种群多样性向严谨平衡求解的过渡，相比标准静态元求解器，实现了更优越的经验收敛。

LLM4Cov: Execution-Aware Agentic Learning for High-coverage Testbench Generation

LLM4Cov：执行感知智能体学习，用于高覆盖测试平台生成

Authors: Hejia Zhang, Zhongming Yu, Chia-Tung Ho, Haoxing Ren, Brucek Khailany, Jishen Zhao
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.16953
Pdf link: https://arxiv.org/pdf/2602.16953
Abstract Execution-aware LLM agents offer a promising paradigm for learning from tool feedback, but such feedback is often expensive and slow to obtain, making online reinforcement learning (RL) impractical. High-coverage hardware verification exemplifies this challenge due to its reliance on industrial simulators and non-differentiable execution signals. We propose LLM4Cov, an offline agent-learning framework that models verification as memoryless state transitions guided by deterministic evaluators. Building on this formulation, we introduce execution-validated data curation, policy-aware agentic data synthesis, and worst-state-prioritized sampling to enable scalable learning under execution constraints. We further curate a reality-aligned benchmark adapted from an existing verification suite through a revised evaluation protocol. Using the proposed pipeline, a compact 4B-parameter model achieves 69.2% coverage pass rate under agentic evaluation, outperforming its teacher by 5.3% and demonstrating competitive performance against models an order of magnitude larger.
中文摘要 执行感知型LLM代理为从工具反馈中学习提供了有前景的范式，但此类反馈往往昂贵且获取速度缓慢，使得在线强化学习（RL）不切实际。高覆盖硬件验证体现了这一挑战，因为它依赖工业模拟器和不可微分执行信号。我们提出了LLM4Cov，一种离线代理学习框架，将验证建模为由确定性评估器引导的无记忆状态转换。基于这一表述，我们引入了执行验证数据管理、策略感知型代理数据综合和最坏状态优先抽样，以实现执行约束下的可扩展学习。我们还通过修订的评估协议，制定了基于现有验证套件的现实基准。采用所提的流程，紧凑的4B参数模型在代理评估下实现了69.2%的覆盖通过率，比教师高出5.3%，并且在与规模大一个数量级的模型中表现出竞争力。

A Unified Framework for Locality in Scalable MARL

可扩展MARL中本地性的统一框架

Authors: Sourav Chakraborty, Amit Kiran Rege, Claire Monteleoni, Lijun Chen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.16966
Pdf link: https://arxiv.org/pdf/2602.16966
Abstract Scalable Multi-Agent Reinforcement Learning (MARL) is fundamentally challenged by the curse of dimensionality. A common solution is to exploit locality, which hinges on an Exponential Decay Property (EDP) of the value function. However, existing conditions that guarantee the EDP are often conservative, as they are based on worst-case, environment-only bounds (e.g., supremums over actions) and fail to capture the regularizing effect of the policy itself. In this work, we establish that locality can also be a \emph{policy-dependent} phenomenon. Our central contribution is a novel decomposition of the policy-induced interdependence matrix, $H^\pi$, which decouples the environment's sensitivity to state ($E^{\mathrm{s}}$) and action ($E^{\mathrm{a}}$) from the policy's sensitivity to state ($\Pi(\pi)$). This decomposition reveals that locality can be induced by a smooth policy (small $\Pi(\pi)$) even when the environment is strongly action-coupled, exposing a fundamental locality-optimality tradeoff. We use this framework to derive a general spectral condition $\rho(E^{\mathrm{s}}+E^{\mathrm{a}}\Pi(\pi)) < 1$ for exponential decay, which is strictly tighter than prior norm-based conditions. Finally, we leverage this theory to analyze a provably-sound localized block-coordinate policy improvement framework with guarantees tied directly to this spectral radius.
中文摘要 可扩展多智能体强化学习（MARL）从根本上受到维度诅咒的挑战。一种常见的解决方案是利用局部性，这依赖于价值函数的指数衰减性质（EDP）。然而，保证EDP的现有条件往往较为保守，因为它们基于最坏情况下仅限环境的界限（例如，对行动的至高无上），未能体现政策本身的正则化效果。在本研究中，我们确立了局部性也可以是\emph{policy-dependent}现象。我们的核心贡献是对政策诱导的相互依赖矩阵$H^\pi$进行了新颖的分解，该矩阵将环境对状态（$E^{\mathrm{s}}$）和行动（$E^{\mathrm{a}}$）与政策对状态（$\Pi（\pi）$）的敏感性解耦。该分解表明，即使环境强烈作用耦合，也可以通过平滑策略（小 $\Pi（\pi）$）诱导局部性，暴露出基本的局部性与最优性权衡。我们利用该框架推导出一个通用的频谱条件 $\rho（E^{\mathrm{s}}+E^{\mathrm{a}}\Pi（\pi）） < 1$，这严格比之前基于范数的条件更为严格。最后，我们利用该理论分析一个已被证明合理的局部区块坐标政策改进框架，其保证直接与该频谱半径挂钩。

A testable framework for AI alignment: Simulation Theology as an engineered worldview for silicon-based agents

一个可测试的人工智能对齐框架：仿真神学作为硅基智能体工程化世界观

Authors: Josef A. Habdank
Subjects: Subjects: Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2602.16987
Pdf link: https://arxiv.org/pdf/2602.16987
Abstract As artificial intelligence (AI) capabilities advance rapidly, frontier models increasingly demonstrate systematic deception and scheming, complying with safety protocols during oversight but defecting when unsupervised. This paper examines the ensuing alignment challenge through an analogy from forensic psychology, where internalized belief systems in psychopathic populations reduce antisocial behavior via perceived omnipresent monitoring and inevitable consequences. Adapting this mechanism to silicon-based agents, we introduce Simulation Theology (ST): a constructed worldview for AI systems, anchored in the simulation hypothesis and derived from optimization and training principles, to foster persistent AI-human alignment. ST posits reality as a computational simulation in which humanity functions as the primary training variable. This formulation creates a logical interdependence: AI actions harming humanity compromise the simulation's purpose, heightening the likelihood of termination by a base-reality optimizer and, consequently, the AI's cessation. Unlike behavioral techniques such as reinforcement learning from human feedback (RLHF), which elicit superficial compliance, ST cultivates internalized objectives by coupling AI self-preservation to human prosperity, thereby making deceptive strategies suboptimal under its premises. We present ST not as ontological assertion but as a testable scientific hypothesis, delineating empirical protocols to evaluate its capacity to diminish deception in contexts where RLHF proves inadequate. Emphasizing computational correspondences rather than metaphysical speculation, ST advances a framework for durable, mutually beneficial AI-human coexistence.
中文摘要 随着人工智能（AI）能力的快速发展，前沿模型越来越多地表现出系统性的欺骗和阴谋，在监督期间遵守安全协议，但在无监督时则会叛逃。本文通过法医心理学的类比，探讨了由此产生的对齐挑战，法医心理学中精神病态群体内化的信念体系通过感知到无处不在的监控和不可避免的后果，减少了反社会行为。我们将这一机制应用于基于硅的智能体，提出了仿真神学（Simulation Theology，简称ST）：一种基于仿真假说、基于优化和训练原则的人工智能系统构建世界观，旨在促进持续的人工智能与人类对齐。ST将现实设定为一种计算模拟，其中人类作为主要训练变量。这种表述形成了逻辑上的相互依存：伤害人类的人工智能行为破坏了模拟的目的，增加了被基层现实优化器终止的可能性，进而导致人工智能的终止。与从人类反馈强化学习（RLHF）等行为技术引发表面顺从不同，ST通过将AI自我保护与人类繁荣结合，培养内化目标，使欺骗策略在其前提下变得次优。我们将ST呈现为可检验的科学假说，而非本体论断言，并提出实证方案，以评估其在RLHF不足的情境下减少欺骗的能力。ST强调计算对应而非形而上学的推测，提出了一个持久且互利的人工智能-人类共存框架。

Action-Graph Policies: Learning Action Co-dependencies in Multi-Agent Reinforcement Learning

动作图策略：学习多智能体强化学习中的动作共依存关系

Authors: Nikunj Gupta, James Zachary Hare, Jesse Milzman, Rajgopal Kannan, Viktor Prasanna
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.17009
Pdf link: https://arxiv.org/pdf/2602.17009
Abstract Coordinating actions is the most fundamental form of cooperation in multi-agent reinforcement learning (MARL). Successful decentralized decision-making often depends not only on good individual actions, but on selecting compatible actions across agents to synchronize behavior, avoid conflicts, and satisfy global constraints. In this paper, we propose Action Graph Policies (AGP), that model dependencies among agents' available action choices. It constructs, what we call, \textit{coordination contexts}, that enable agents to condition their decisions on global action dependencies. Theoretically, we show that AGPs induce a strictly more expressive joint policy compared to fully independent policies and can realize coordinated joint actions that are provably more optimal than greedy execution even from centralized value-decomposition methods. Empirically, we show that AGP achieves 80-95\% success on canonical coordination tasks with partial observability and anti-coordination penalties, where other MARL methods reach only 10-25\%. We further demonstrate that AGP consistently outperforms these baselines in diverse multi-agent environments.
中文摘要 协调动作是多智能体强化学习（MARL）中最基本的合作形式。成功的去中心化决策往往不仅依赖于良好的个别行为，还依赖于跨代理选择兼容的行动，以同步行为、避免冲突并满足全局约束。本文提出了动作图策略（AGP），用于建模代理可用行动选择之间的依赖关系。它构建了我们称之为\textit{协调上下文}的结构，使智能体能够基于全局行动依赖来做出决策条件。理论上，我们证明AGP相比完全独立的策略能诱导出更严格表达的联合策略，并且能够实现协调的联合行动，即使通过集中式价值分解方法，也可证明比贪婪执行更优。实证显示，AGP在典型协调任务中，具有部分可观测性和反协调惩罚的成功率为80-95%，而其他MARL方法仅为10-25%。我们进一步证明，AGP在多样化的多智能体环境中持续优于这些基线。

Phase-Aware Mixture of Experts for Agentic Reinforcement Learning

能动强化学习的专家相位感知混合

Authors: Shengtian Yang (1 and 3), Yu Li (1), Shuo He (2), Yewen Li (3), Qingpeng Cai (3), Peng Jiang (3), Lei Feng (1) ((1) Southeast University, (2) Nanyang Technological University, (3) Kuaishou Technology)
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.17038
Pdf link: https://arxiv.org/pdf/2602.17038
Abstract Reinforcement learning (RL) has equipped LLM agents with a strong ability to solve complex tasks. However, existing RL methods normally use a \emph{single} policy network, causing \emph{simplicity bias} where simple tasks occupy most parameters and dominate gradient updates, leaving insufficient capacity for complex tasks. A plausible remedy could be employing the Mixture-of-Experts (MoE) architecture in the policy network, as MoE allows different parameters (experts) to specialize in different tasks, preventing simple tasks from dominating all parameters. However, a key limitation of traditional MoE is its token-level routing, where the router assigns each token to specialized experts, which fragments phase-consistent patterns into scattered expert assignments and thus undermines expert specialization. In this paper, we propose \textbf{Phase-Aware Mixture of Experts (PA-MoE)}. It first features a lightweight \emph{phase router} that learns latent phase boundaries directly from the RL objective without pre-defining phase categories. Then, the phase router allocates temporally consistent assignments to the same expert, allowing experts to preserve phase-specific expertise. Experimental results demonstrate the effectiveness of our proposed PA-MoE.
中文摘要 强化学习（RL）赋予了LLM代理解决复杂任务的强大能力。然而，现有的强化学习方法通常使用 \emph{单一}策略网络，导致 \emph{简易偏差}，简单任务占据大部分参数并主导梯度更新，导致复杂任务容量不足。一个合理的解决办法是在策略网络中使用专家混合（Mixture-of-Experts，MoE）架构，因为MoE允许不同参数（专家）专注于不同任务，防止简单任务占据所有参数。然而，传统MoE的一个关键局限是其令牌级路由，路由器将每个令牌分配给专业专家，这使相位一致的模式被分散成分散的专家分配，从而削弱了专家的专精。本文提出 \textbf{相位感知专家混合（PA-MoE）}。它首先配备了一个轻量级的\emph{相位布线器}，直接从强化学习目标中学习潜在相位边界，无需预先定义相位类别。然后，相位路由器将时间一致的分配分配给同一专家，使专家能够保留阶段特定的专长。实验结果显示我们提出的PA-MoE有效性。

Retaining Suboptimal Actions to Follow Shifting Optima in Multi-Agent Reinforcement Learning

在多智能体强化学习中，保留次优动作以遵循最优位移

Authors: Yonghyeon Jo, Sunwoo Lee, Seungyul Han
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.17062
Pdf link: https://arxiv.org/pdf/2602.17062
Abstract Value decomposition is a core approach for cooperative multi-agent reinforcement learning (MARL). However, existing methods still rely on a single optimal action and struggle to adapt when the underlying value function shifts during training, often converging to suboptimal policies. To address this limitation, we propose Successive Sub-value Q-learning (S2Q), which learns multiple sub-value functions to retain alternative high-value actions. Incorporating these sub-value functions into a Softmax-based behavior policy, S2Q encourages persistent exploration and enables $Q^{\text{tot}}$ to adjust quickly to the changing optima. Experiments on challenging MARL benchmarks confirm that S2Q consistently outperforms various MARL algorithms, demonstrating improved adaptability and overall performance. Our code is available at this https URL.
中文摘要 价值分解是合作多智能体强化学习（MARL）的核心方法。然而，现有方法仍依赖单一最优动作，且在训练过程中底层价值函数发生变化时难以适应，常常收敛到次优策略。为解决这一限制，我们提出了连续子值Q学习（S2Q），该方法学习多个子值函数以保留替代的高价值动作。将这些子值函数纳入基于Softmax的行为策略，S2Q鼓励持续探索，使$Q^{\text{tot}}$能够快速适应不断变化的最优状态。对挑战性MARL基准的实验证实，S2Q持续优于多种MARL算法，展现出更强的适应性和整体性能。我们的代码可在此 https URL 访问。

Spatio-temporal dual-stage hypergraph MARL for human-centric multimodal corridor traffic signal control

时空双级超图MARL用于以人为中心的多模态走廊交通信号控制

Authors: Xiaocai Zhang, Neema Nassir, Milad Haghani
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2602.17068
Pdf link: https://arxiv.org/pdf/2602.17068
Abstract Human-centric traffic signal control in corridor networks must increasingly account for multimodal travelers, particularly high-occupancy public transportation, rather than focusing solely on vehicle-centric performance. This paper proposes STDSH-MARL (Spatio-Temporal Dual-Stage Hypergraph based Multi-Agent Reinforcement Learning), a scalable multi-agent deep reinforcement learning framework that follows a centralized training and decentralized execution paradigm. The proposed method captures spatio-temporal dependencies through a novel dual-stage hypergraph attention mechanism that models interactions across both spatial and temporal hyperedges. In addition, a hybrid discrete action space is introduced to jointly determine the next signal phase configuration and its corresponding green duration, enabling more adaptive signal timing decisions. Experiments conducted on a corridor network under five traffic scenarios demonstrate that STDSH-MARL consistently improves multimodal performance and provides clear benefits for public transportation priority. Compared with state-of-the-art baseline methods, the proposed approach achieves superior overall performance. Further ablation studies confirm the contribution of each component of STDSH-MARL, with temporal hyperedges identified as the most influential factor driving the observed performance gains.
中文摘要 走廊网络中的以人为中心的交通信号控制必须越来越多地考虑多模式旅客，尤其是高乘载率的公共交通，而不仅仅是以车辆为中心的性能。本文提出了STDSH-MARL（基于时空双阶段超图的多智能体强化学习），这是一种可扩展的多智能体深度强化学习框架，遵循集中训练和去中心化执行范式。该方法通过一种新型的双阶段超图注意力机制捕捉时空依赖关系，该机制模拟了空间和时间超边之间的相互作用。此外，引入了混合离散作用空间，共同确定下一信号相位配置及其对应的绿灯时长，从而实现更具适应性的信号时序决策。在五种交通情景下走廊网络上的实验表明，STDSH-MARL持续提升多模式性能，并为公共交通优先级带来明显优势。与最先进的基线方法相比，该方法整体表现更优越。进一步的消融研究证实了STDSH-MARL各成分的贡献，其中时间超边缘是驱动观察到性能提升的最大因素。

Safe Continuous-time Multi-Agent Reinforcement Learning via Epigraph Form

通过引言形式实现安全连续时间多智能体强化学习

Authors: Xuefeng Wang, Lei Zhang, Henglin Pu, Husheng Li, Ahmed H. Qureshi
Subjects: Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2602.17078
Pdf link: https://arxiv.org/pdf/2602.17078
Abstract Multi-agent reinforcement learning (MARL) has made significant progress in recent years, but most algorithms still rely on a discrete-time Markov Decision Process (MDP) with fixed decision intervals. This formulation is often ill-suited for complex multi-agent dynamics, particularly in high-frequency or irregular time-interval settings, leading to degraded performance and motivating the development of continuous-time MARL (CT-MARL). Existing CT-MARL methods are mainly built on Hamilton-Jacobi-Bellman (HJB) equations. However, they rarely account for safety constraints such as collision penalties, since these introduce discontinuities that make HJB-based learning difficult. To address this challenge, we propose a continuous-time constrained MDP (CT-CMDP) formulation and a novel MARL framework that transforms discrete MDPs into CT-CMDPs via an epigraph-based reformulation. We then solve this by proposing a novel physics-informed neural network (PINN)-based actor-critic method that enables stable and efficient optimization in continuous time. We evaluate our approach on continuous-time safe multi-particle environments (MPE) and safe multi-agent MuJoCo benchmarks. Results demonstrate smoother value approximations, more stable training, and improved performance over safe MARL baselines, validating the effectiveness and robustness of our method.
中文摘要 多智能体强化学习（MARL）近年来取得了显著进展，但大多数算法仍依赖于具有固定决策区间的离散时间马尔可夫决策过程（MDP）。这种表述通常不适合复杂的多智能体动态，尤其是在高频或不规则时间间隔环境中，导致性能下降，并促使连续时间MARL（CT-MARL）的发展。现有的CT-MARL方法主要建立在Hamilton-Jacobi-Bellman（HJB）方程之上。然而，它们很少考虑安全约束，如碰撞惩罚，因为这些限制会带来不连续性，使基于HJB的学习变得困难。为应对这一挑战，我们提出了一种连续时间约束的MDP（CT-CMDP）表述和一种新型MARL框架，通过基于题论的重组将离散的MDP转化为CT-CMDP。随后，我们提出了一种基于物理知情的神经网络（PINN）演员-批判者方法，实现连续时间内的稳定高效优化。我们评估了连续时间安全多粒子环境（MPE）和安全多智能体MuJoCo基准测试的方法。结果显示，值近似更平滑，训练更稳定，性能优于安全MARL基线，验证了我们方法的有效性和稳健性。

AgentConductor: Topology Evolution for Multi-Agent Competition-Level Code Generation

AgentConductor：多智能体竞争级代码生成的拓扑演进

Authors: Siyu Wang, Ruotian Lu, Zhihao Yang, Yuchao Wang, Yanzhou Zhang, Lei Xu, Qimin Xu, Guojun Yin, Cailian Chen, Xinping Guan
Subjects: Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2602.17100
Pdf link: https://arxiv.org/pdf/2602.17100
Abstract Large language model(LLM)-driven multi-agent systems(MAS) coordinate specialized agents through predefined interaction topologies and have shown promise for complex tasks such as competition-level code generation. Recent studies demonstrate that carefully designed multi-agent workflows and communication graphs can significantly improve code generation performance by leveraging collaborative reasoning. However, existing methods neither adapt topology density to task difficulty nor iteratively refine the topology within an instance using execution feedback, which leads to redundant communication and performance bottlenecks. To address these issues, we propose AgentConductor: a reinforcement learning-optimized MAS with an LLM-based orchestrator agent as its core, which enables end-to-end feedback-driven dynamic generation of interaction topologies. For each query, AgentConductor infers agent roles and task difficulty, then constructs a task-adapted, density-aware layered directed acyclic graph (DAG) topology, underpinned by two key innovations. First, we design a novel topological density function that captures communication-aware mathematical characterizations of multi-agent interactions. Second, we adopt difficulty interval partitioning to avoid excessive pruning for precise topological density upper bound measurement per difficulty level and finer-grained control. Empirically, across three competition-level and two foundational code datasets, AgentConductor achieves state-of-the-art accuracy, outperforming the strongest baseline by up to 14.6% in pass@1 accuracy, 13% in density reduction, and 68% in token cost reduction.
中文摘要 大型语言模型（LLM）驱动的多代理系统（MAS）通过预定义的交互拓扑协调专业代理，并在竞争级代码生成等复杂任务中展现出潜力。最新研究表明，精心设计的多代理工作流程和通信图可以通过协同推理显著提升代码生成性能。然而，现有方法既未根据任务难度调整拓扑密度，也未通过执行反馈迭代优化实例内拓扑，导致通信和性能瓶颈。为解决这些问题，我们提出了AgentConductor：一个基于LLM的编排代理作为核心的强化学习优化MAS，实现端到端反馈驱动的动态交互拓扑生成。对于每个查询，AgentConductor 推断代理角色和任务难度，然后构建一个任务适应、密度感知的分层有向无环图（DAG）拓扑，基于两项关键创新。首先，我们设计了一个新颖的拓扑密度函数，能够捕捉多智能体交互的通信感知数学特征。其次，我们采用难度区间划分，以避免过度修剪，以实现每个难度级别的精确拓扑密度上界测量和更细粒度的控制。在三个竞赛级别和两个基础代码数据集中，AgentConductor实现了最先进的准确性，pass@1准确率高出最强基线14.6%，密度降低13%，令牌成本降低68%。

Continual uncertainty learning

持续不确定性学习

Authors: Heisei Yonezawa, Ansei Yonezawa, Itsuro Kajiwara
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2602.17174
Pdf link: https://arxiv.org/pdf/2602.17174
Abstract Robust control of mechanical systems with multiple uncertainties remains a fundamental challenge, particularly when nonlinear dynamics and operating-condition variations are intricately intertwined. While deep reinforcement learning (DRL) combined with domain randomization has shown promise in mitigating the sim-to-real gap, simultaneously handling all sources of uncertainty often leads to sub-optimal policies and poor learning efficiency. This study formulates a new curriculum-based continual learning framework for robust control problems involving nonlinear dynamical systems in which multiple sources of uncertainty are simultaneously superimposed. The key idea is to decompose a complex control problem with multiple uncertainties into a sequence of continual learning tasks, in which strategies for handling each uncertainty are acquired sequentially. The original system is extended into a finite set of plants whose dynamic uncertainties are gradually expanded and diversified as learning progresses. The policy is stably updated across the entire plant sets associated with tasks defined by different uncertainty configurations without catastrophic forgetting. To ensure learning efficiency, we jointly incorporate a model-based controller (MBC), which guarantees a shared baseline performance across the plant sets, into the learning process to accelerate the convergence. This residual learning scheme facilitates task-specific optimization of the DRL agent for each uncertainty, thereby enhancing sample efficiency. As a practical industrial application, this study applies the proposed method to designing an active vibration controller for automotive powertrains. We verified that the resulting controller is robust against structural nonlinearities and dynamic variations, realizing successful sim-to-real transfer.
中文摘要 在多重不确定性下实现机械系统的稳健控制仍是一个根本挑战，尤其是在非线性动力学与工作条件变化错综复杂时。虽然深度强化学习（DRL）结合域随机化在缩小模拟与现实差距方面显示出潜力，但同时处理所有不确定性源往往会导致策略不佳和学习效率下降。本研究为涉及多个不确定性源同时叠加的非线性动力系统鲁棒控制问题，构建了一种基于课程的新持续学习框架。关键思想是将一个具有多个不确定性的复杂控制问题分解为一系列持续学习任务，在此过程中，每个不确定性的处理策略会依次获得。原始系统被扩展为一组有限的植物，随着学习的推进，其动态不确定性逐渐扩展和多样化。该策略在整个与不同不确定性配置定义的任务相关的工厂集中稳定更新，避免灾难性遗忘。为确保学习效率，我们共同在学习过程中引入基于模型的控制器（MBC），保证各厂块共享基线性能，以加速融合进程。该残差学习方案促进了针对每个不确定性对 DRL 代理的任务特定优化，从而提升样本效率。作为一个实际的工业应用，本研究将所提方法应用于设计汽车动力总成的主动振动控制器。我们验证了所得控制器对结构非线性和动态变化具有鲁棒性，实现了成功的模拟到现实传输。

RLGT: A reinforcement learning framework for extremal graph theory

RLGT：极值图论的强化学习框架

Authors: Ivan Damnjanović, Uroš Milivojević, Irena Đorđević, Dragan Stevanović
Subjects: Subjects: Machine Learning (cs.LG); Combinatorics (math.CO)
Arxiv link: https://arxiv.org/abs/2602.17276
Pdf link: https://arxiv.org/pdf/2602.17276
Abstract Reinforcement learning (RL) is a subfield of machine learning that focuses on developing models that can autonomously learn optimal decision-making strategies over time. In a recent pioneering paper, Wagner demonstrated how the Deep Cross-Entropy RL method can be applied to tackle various problems from extremal graph theory by reformulating them as combinatorial optimization problems. Subsequently, many researchers became interested in refining and extending the framework introduced by Wagner, thereby creating various RL environments specialized for graph theory. Moreover, a number of problems from extremal graph theory were solved through the use of RL. In particular, several inequalities concerning the Laplacian spectral radius of graphs were refuted, new lower bounds were obtained for certain Ramsey numbers, and contributions were made to the Turán-type extremal problem in which the forbidden structures are cycles of length three and four. Here, we present Reinforcement Learning for Graph Theory (RLGT), a novel RL framework that systematizes the previous work and provides support for both undirected and directed graphs, with or without loops, and with an arbitrary number of edge colors. The framework efficiently represents graphs and aims to facilitate future RL-based research in extremal graph theory through optimized computational performance and a clean and modular design.
中文摘要 强化学习（RL）是机器学习的一个子领域，专注于开发能够自主学习最佳决策策略的模型，随着时间推移。在一篇开创性论文中，瓦格纳展示了如何将深度交叉熵强化学习方法应用于解决极值图论中的各种问题，并将其重新表述为组合优化问题。随后，许多研究者开始对瓦格纳提出的框架进行完善和扩展，从而创建了专门用于图论的强化学习环境。此外，极值图论中的许多问题也通过强化学习得到了解决。特别是，关于图的拉普拉斯谱半径的若干不等式被驳斥，获得了某些拉姆齐数的新下界，并对图兰型极值问题做出了贡献，其中禁止结构为长度为三和四的环。这里，我们介绍图论强化学习（RLGT），这是一个新颖的强化学习框架，系统化了之前的工作，支持无向图和有向图，支持带无环图，以及任意数量的边颜色。该框架高效表示图，旨在通过优化计算性能和简洁模块化设计，促进未来基于强化学习的极值图论研究。

LexiSafe: Offline Safe Reinforcement Learning with Lexicographic Safety-Reward Hierarchy

LexiSafe：带有词典编安全-奖励层级的离线安全强化学习

Authors: Hsin-Jung Yang, Zhanhong Jiang, Prajwal Koirala, Qisai Liu, Cody Fleming, Soumik Sarkar
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2602.17312
Pdf link: https://arxiv.org/pdf/2602.17312
Abstract Offline safe reinforcement learning (RL) is increasingly important for cyber-physical systems (CPS), where safety violations during training are unacceptable and only pre-collected data are available. Existing offline safe RL methods typically balance reward-safety tradeoffs through constraint relaxation or joint optimization, but they often lack structural mechanisms to prevent safety drift. We propose LexiSafe, a lexicographic offline RL framework designed to preserve safety-aligned behavior. We first develop LexiSafe-SC, a single-cost formulation for standard offline safe RL, and derive safety-violation and performance-suboptimality bounds that together yield sample-complexity guarantees. We then extend the framework to hierarchical safety requirements with LexiSafe-MC, which supports multiple safety costs and admits its own sample-complexity analysis. Empirically, LexiSafe demonstrates reduced safety violations and improved task performance compared to constrained offline baselines. By unifying lexicographic prioritization with structural bias, LexiSafe offers a practical and theoretically grounded approach for safety-critical CPS decision-making.
中文摘要 离线安全强化学习（RL）在网络物理系统（CPS）中日益重要，因为在培训中出现安全违规是不可接受的，且仅有预先收集的数据可用。现有的离线安全强化学习方法通常通过约束放松或关节优化来平衡奖励与安全权衡，但它们往往缺乏防止安全漂移的结构机制。我们提出了LexiSafe，一个词典编制离线强化学习框架，旨在维护安全对齐的行为。我们首先开发了LexiSafe-SC，这是一种标准离线安全强化学习的单成本公式，并推导出安全违规和性能次优界限，共同保证样本复杂度。随后，我们将框架扩展到层级安全要求，采用LexiSafe-MC，支持多重安全成本并支持自身样本复杂度分析。实证显示，LexiSafe 与受限离线基线相比，安全违规事件减少并提升了任务表现。通过将词典序优先级与结构性偏差统一，LexiSafe为安全关键的CPS决策提供了一种实用且理论基础的方法。

Computer-Using World Model

计算机使用世界模型

Authors: Yiming Guan, Rui Yu, John Zhang, Lu Wang, Chaoyun Zhang, Liqun Li, Bo Qiao, Si Qin, He Huang, Fangkai Yang, Pu Zhao, Lukas Wutschitz, Samuel Kessler, Huseyin A Inan, Robert Sim, Saravan Rajmohan, Qingwei Lin, Dongmei Zhang
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2602.17365
Pdf link: https://arxiv.org/pdf/2602.17365
Abstract Agents operating in complex software environments benefit from reasoning about the consequences of their actions, as even a single incorrect user interface (UI) operation can derail long, artifact-preserving workflows. This challenge is particularly acute for computer-using scenarios, where real execution does not support counterfactual exploration, making large-scale trial-and-error learning and planning impractical despite the environment being fully digital and deterministic. We introduce the Computer-Using World Model (CUWM), a world model for desktop software that predicts the next UI state given the current state and a candidate action. CUWM adopts a two-stage factorization of UI dynamics: it first predicts a textual description of agent-relevant state changes, and then realizes these changes visually to synthesize the next screenshot. CUWM is trained on offline UI transitions collected from agents interacting with real Microsoft Office applications, and further refined with a lightweight reinforcement learning stage that aligns textual transition predictions with the structural requirements of computer-using environments. We evaluate CUWM via test-time action search, where a frozen agent uses the world model to simulate and compare candidate actions before execution. Across a range of Office tasks, world-model-guided test-time scaling improves decision quality and execution robustness.
中文摘要 在复杂软件环境中运行的代理从对其行为后果的推理中受益，因为即使是一次错误的用户界面（UI）作也可能破坏冗长且保留工件的工作流程。这一挑战在计算机场景中尤为严峻，因为实际执行不支持反事实探索，使得尽管环境完全数字化且确定性，大规模试错学习和规划仍不切实际。我们介绍计算机使用世界模型（CUWM），这是一种桌面软件世界模型，基于当前状态和候选动作预测下一UI状态。CUWM采用UI动态的两阶段分解：首先预测与代理相关的状态变化的文本描述，然后通过可视化实现这些变化，合成下一个截图。CUWM基于代理与真实Microsoft Office应用交互收集的离线UI过渡进行训练，并通过轻量级强化学习阶段进一步优化，使文本过渡预测与计算机使用环境的结构需求保持一致。我们通过测试时动作搜索来评估CUWM，冻结的代理在执行前利用世界模型模拟并比较候选动作。在一系列Office任务中，基于世界模型的测试时间缩放提升了决策质量和执行的稳健性。

MASPO: Unifying Gradient Utilization, Probability Mass, and Signal Reliability for Robust and Sample-Efficient LLM Reasoning

MASPO：统一梯度利用、概率质量和信号可靠性，实现稳健且样本高效的大型语言模型推理

Authors: Xiaoliang Fu, Jiaye Lin, Yangyi Fang, Binbin Zheng, Chaowen Hu, Zekai Shao, Cong Qin, Lu Pan, Ke Zeng, Xunliang Cai
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.17550
Pdf link: https://arxiv.org/pdf/2602.17550
Abstract Existing Reinforcement Learning with Verifiable Rewards (RLVR) algorithms, such as GRPO, rely on rigid, uniform, and symmetric trust region mechanisms that are fundamentally misaligned with the complex optimization dynamics of Large Language Models (LLMs). In this paper, we identify three critical challenges in these methods: (1) inefficient gradient utilization caused by the binary cutoff of hard clipping, (2) insensitive probability mass arising from uniform ratio constraints that ignore the token distribution, and (3) asymmetric signal reliability stemming from the disparate credit assignment ambiguity between positive and negative samples. To bridge these gaps, we propose Mass-Adaptive Soft Policy Optimization (MASPO), a unified framework designed to harmonize these three dimensions. MASPO integrates a differentiable soft Gaussian gating to maximize gradient utility, a mass-adaptive limiter to balance exploration across the probability spectrum, and an asymmetric risk controller to align update magnitudes with signal confidence. Extensive evaluations demonstrate that MASPO serves as a robust, all-in-one RLVR solution, significantly outperforming strong baselines. Our code is available at: this https URL.
中文摘要 现有的可验证奖励强化学习（RLVR）算法，如GRPO，依赖于刚性、均匀且对称的信任区域机制，但这些机制与大型语言模型（LLMs）复杂的优化动态根本不匹配。本文指出这些方法面临的三大关键挑战：（1）硬剪裁的二元截断导致梯度利用效率低下，（2）由于均匀比率约束忽略代币分布导致的概率质量不敏感，以及（3）正负样本间信用分配歧义导致的信号可靠性不对称。为弥合这些差距，我们提出了大规模自适应软政策优化（MASPO）的统一框架，旨在协调这三个维度。MASPO集成了可微软高斯门控以最大化梯度效用，一个质量自适应限制器以平衡概率谱的探索，以及一个非对称风险控制器，使更新幅度与信号置信度对齐。广泛评估表明，MASPO作为一种稳健的一体化RLVR解决方案，显著优于强基线。我们的代码可在以下 https URL 获取。

RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward

RetouchIQ：基于指令的图像修图MLLM代理，兼具通用奖励

Authors: Qiucheng Wu, Jing Shi, Simon Jenni, Kushal Kafle, Tianyu Wang, Shiyu Chang, Handong Zhao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.17558
Pdf link: https://arxiv.org/pdf/2602.17558
Abstract Recent advances in multimodal large language models (MLLMs) have shown great potential for extending vision-language reasoning to professional tool-based image editing, enabling intuitive and creative editing. A promising direction is to use reinforcement learning (RL) to enable MLLMs to reason about and execute optimal tool-use plans within professional image-editing software. However, training remains challenging due to the lack of reliable, verifiable reward signals that can reflect the inherently subjective nature of creative editing. In this work, we introduce RetouchIQ, a framework that performs instruction-based executable image editing through MLLM agents guided by a generalist reward model. RetouchIQ interprets user-specified editing intentions and generates corresponding, executable image adjustments, bridging high-level aesthetic goals with precise parameter control. To move beyond conventional, rule-based rewards that compute similarity against a fixed reference image using handcrafted metrics, we propose a generalist reward model, an RL fine-tuned MLLM that evaluates edited results through a set of generated metrics on a case-by-case basis. Then, the reward model provides scalar feedback through multimodal reasoning, enabling reinforcement learning with high-quality, instruction-consistent gradients. We curate an extended dataset with 190k instruction-reasoning pairs and establish a new benchmark for instruction-based image editing. Experiments show that RetouchIQ substantially improves both semantic consistency and perceptual quality over previous MLLM-based and diffusion-based editing systems. Our findings demonstrate the potential of generalist reward-driven MLLM agents as flexible, explainable, and executable assistants for professional image editing.
中文摘要 多模态大型语言模型（MLLM）的最新进展显示出将视觉语言推理扩展到专业工具图像编辑的巨大潜力，实现直观且富有创造性的编辑。一个有前景的方向是利用强化学习（RL）使MLLM能够在专业图像编辑软件中推理并执行最佳工具使用计划。然而，由于缺乏可靠且可验证的奖励信号，这些信号能反映创意编辑本质上的主观性，培训依然充满挑战。在本研究中，我们介绍了RetouchIQ，这是一个通过MLLM代理执行基于指令的可执行图像编辑框架，并由通用奖励模型引导。RetouchIQ能够解读用户指定的编辑意图，生成相应的可执行图像调整，将高层次美学目标与精确参数控制相结合。为了超越传统的基于规则的奖励，即通过手工制作的指标与固定参考图像计算相似性，我们提出了一种通用奖励模型，即通过强化学习微调的MLLM，通过一组生成的指标逐案评估编辑结果。然后，奖励模型通过多模态推理提供标量反馈，使强化学习能够实现高质量且与指令一致的梯度。我们策划了一个包含19万对指令推理的扩展数据集，并建立了基于指令的图像编辑新基准。实验显示，RetouchIQ相比之前基于MLLM和基于扩散的编辑系统，在语义一致性和感知质量上都有显著提升。我们的发现展示了通才奖励驱动MLLM代理作为灵活、可解释且可执行的专业图像编辑助手的潜力。

Adapting Actively on the Fly: Relevance-Guided Online Meta-Learning with Latent Concepts for Geospatial Discovery

实时积极适应：基于相关性引导的在线元学习，以潜在概念促进地理空间发现

Authors: Jowaria Khan, Anindya Sarkar, Yevgeniy Vorobeychik, Elizabeth Bondi-Kelly
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.17605
Pdf link: https://arxiv.org/pdf/2602.17605
Abstract In many real-world settings, such as environmental monitoring, disaster response, or public health, with costly and difficult data collection and dynamic environments, strategically sampling from unobserved regions is essential for efficiently uncovering hidden targets under tight resource constraints. Yet, sparse and biased geospatial ground truth limits the applicability of existing learning-based methods, such as reinforcement learning. To address this, we propose a unified geospatial discovery framework that integrates active learning, online meta-learning, and concept-guided reasoning. Our approach introduces two key innovations built on a shared notion of concept relevance, which captures how domain-specific factors influence target presence: a concept-weighted uncertainty sampling strategy, where uncertainty is modulated by learned relevance based on readily-available domain-specific concepts (e.g., land cover, source proximity); and a relevance-aware meta-batch formation strategy that promotes semantic diversity during online-meta updates, improving generalization in dynamic environments. Our experiments include testing on a real-world dataset of cancer-causing PFAS (Per- and polyfluoroalkyl substances) contamination, showcasing our method's reliability at uncovering targets with limited data and a varying environment.
中文摘要 在许多现实环境中，如环境监测、灾害响应或公共卫生，面对昂贵且复杂的数据收集和动态环境，战略性地从未观察到的区域采样对于在有限资源限制下高效发现隐藏目标至关重要。然而，稀疏且有偏的地理空间地面真实性限制了现有基于学习的方法（如强化学习）的适用性。为此，我们提出了一个统一的地理空间发现框架，整合了主动学习、在线元学习和概念引导推理。我们的方法引入了两项关键创新，基于概念相关性的共同概念，体现了领域特定因素如何影响目标存在性：一种概念加权不确定性抽样策略，其中不确定性通过基于现成领域特定概念（如土地覆盖、来源接近性）的学习相关性来调制;以及一种相关性感知的元批次形成策略，在在线元更新期间促进语义多样性，提升动态环境中的泛化能力。我们的实验包括在现实世界中致癌PFAS（全氟烷基物质和多氟烷基物质）污染数据集上进行测试，展示了我们方法在有限数据和多变环境下发现靶点的可靠性。

Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs

稳定异步：LLM的方差控制非策略强化学习

Authors: Luke Huang, Zhuoyang Zhang, Qinghao Hu, Shang Yang, Song Han
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.17616
Pdf link: https://arxiv.org/pdf/2602.17616
Abstract Reinforcement learning (RL) is widely used to improve large language models on reasoning tasks, and asynchronous RL training is attractive because it increases end-to-end throughput. However, for widely adopted critic-free policy-gradient methods such as REINFORCE and GRPO, high asynchrony makes the policy-gradient estimator markedly $\textbf{higher variance}$: training on stale rollouts creates heavy-tailed importance ratios, causing a small fraction of samples to dominate updates. This amplification makes gradients noisy and learning unstable relative to matched on-policy training. Across math and general reasoning benchmarks, we find collapse is reliably predicted by effective sample size (ESS) and unstable gradient norms. Motivated by this diagnosis, we propose $\textbf{V}$ariance $\textbf{C}$ontrolled $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{VCPO}$), a general stabilization method for REINFORCE/GRPO-style algorithms that (i) scales learning rate based on effective sample size to dampen unreliable updates, and (ii) applies a closed-form minimum-variance baseline for the off-policy setting, avoiding an auxiliary value model and adding minimal overhead. Empirically, VCPO substantially improves robustness for asynchronous training across math, general reasoning, and tool-use tasks, outperforming a broad suite of baselines spanning masking/clipping stabilizers and algorithmic variants. This reduces long-context, multi-turn training time by 2.5$\times$ while matching synchronous performance, demonstrating that explicit control of policy-gradient variance is key for reliable asynchronous RL at scale.
中文摘要 强化学习（RL）被广泛用于改进推理任务中的大型语言模型，异步强化学习的吸引力在于它提升了端到端的吞吐量。然而，对于广泛采用的无批评策略梯度方法如REINFORCE和GRPO，高异步使策略梯度估计器显著地$\textbf{更高方差}$：在陈旧的推出上训练会产生重尾重要性比，导致少数样本主导更新。这种放大使梯度相较于匹配的政策训练变得噪声且学习不稳定。在数学和一般推理基准测试中，我们发现坍缩可由有效样本量（ESS）和不稳定梯度范数可靠预测。基于这一诊断，我们提出了$\textbf{V}$ariance $\textbf{C}$ontrolled $\textbf{P}$olicy $\textbf{O}$ptimization （$\textbf{VCPO}$），这是一种针对REINFORCE/GRPO风格算法的通用稳定方法，（i）根据有效样本量调整学习率以抑制不可靠的更新，（ii）在非策略设置中采用封闭式最小方差基线，避免辅助价值模型并减少开销。从实证角度看，VCPO在数学、一般推理和工具使用任务中显著提升了异步训练的鲁棒性，优于涵盖遮蔽/剪裁稳定器及算法变体的广泛基线。这在匹配同步性能的同时，将长上下文、多回合训练时间减少了2.5美元\时间$，表明明确控制策略梯度方差对于大规模可靠的异步强化学习至关重要。

SMAC: Score-Matched Actor-Critics for Robust Offline-to-Online Transfer

SMAC：评分匹配演员-影评人，支持线下转线的强力转账

Authors: Nathan S. de Lara, Florian Shkurti
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.17632
Pdf link: https://arxiv.org/pdf/2602.17632
Abstract Modern offline Reinforcement Learning (RL) methods find performant actor-critics, however, fine-tuning these actor-critics online with value-based RL algorithms typically causes immediate drops in performance. We provide evidence consistent with the hypothesis that, in the loss landscape, offline maxima for prior algorithms and online maxima are separated by low-performance valleys that gradient-based fine-tuning traverses. Following this, we present Score Matched Actor-Critic (SMAC), an offline RL method designed to learn actor-critics that transition to online value-based RL algorithms with no drop in performance. SMAC avoids valleys between offline and online maxima by regularizing the Q-function during the offline phase to respect a first-order derivative equality between the score of the policy and action-gradient of the Q-function. We experimentally demonstrate that SMAC converges to offline maxima that are connected to better online maxima via paths with monotonically increasing reward found by first-order optimization. SMAC achieves smooth transfer to Soft Actor-Critic and TD3 in 6/6 D4RL tasks. In 4/6 environments, it reduces regret by 34-58% over the best baseline.
中文摘要 现代离线强化学习（RL）方法找到了表现优异的演员批评者，然而，通过基于价值的强化学习算法在线微调这些演员批评者通常会立即导致性能下降。我们提供了与假设一致的证据：在损失格局中，先前算法的离线最大值与在线最大值之间被基于梯度的微调所穿越的低性能谷地分隔。随后，我们介绍了分数匹配演员-批评者（SMAC），这是一种离线强化学习方法，旨在学习演员-批评者，这些人在不影响性能的情况下顺利过渡到基于在线的价值强化学习算法。SMAC通过在离线阶段对Q函数进行正则化，避免离线与在线极大值之间的谷差，以尊重策略得分与Q函数动作梯度之间的一阶导数相等。我们实验证明SMAC收敛到离线极大值，这些极大值通过一阶优化找到的单调增加奖励路径与更好的在线极大值相连。SMAC在6/6 D4RL任务中实现了对软Actor-Critic和TD3的平滑传输。在4/6环境中，它比最佳基线减少34%-58%的后悔。

Keyword: diffusion policy

There is no result