Arxiv Papers of Today

生成时间: 2026-03-20 16:44:25 (UTC+8); Arxiv 发布时间: 2026-03-20 20:00 EDT (2026-03-21 08:00 UTC+8)

今天共有 46 篇相关文章

Keyword: reinforcement learning

Lightweight Adaptation for LLM-based Technical Service Agent: Latent Logic Augmentation and Robust Noise Reduction

基于LLM的技术服务代理轻量适配：潜在逻辑增强与稳健降噪

Authors: Yi Yu, Junzhuo Ma, Chenghuang Shen, Xingyan Liu, Jing Gu, Hangyi Sun, Guangquan Hu, Jianfeng Liu, Weiting Liu, Mingyue Pu, Yu Wang, Zhengdong Xiao, Rui Xie, Longjiu Luo, Qianrong Wang, Gurong Cui, Honglin Qiao, Wenlian Lu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Applications (stat.AP)
Arxiv link: https://arxiv.org/abs/2603.18074
Pdf link: https://arxiv.org/pdf/2603.18074
Abstract Adapting Large Language Models in complex technical service domains is constrained by the absence of explicit cognitive chains in human demonstrations and the inherent ambiguity arising from the diversity of valid responses. These limitations severely hinder agents from internalizing latent decision dynamics and generalizing effectively. Moreover, practical adaptation is often impeded by the prohibitive resource and time costs associated with standard training paradigms. To overcome these challenges and guarantee computational efficiency, we propose a lightweight adaptation framework comprising three key contributions. (1) Latent Logic Augmentation: We introduce Planning-Aware Trajectory Modeling and Decision Reasoning Augmentation to bridge the gap between surface-level supervision and latent decision logic. These approaches strengthen the stability of Supervised Fine-Tuning alignment. (2) Robust Noise Reduction: We construct a Multiple Ground Truths dataset through a dual-filtering method to reduce the noise by validating diverse responses, thereby capturing the semantic diversity. (3) Lightweight Adaptation: We design a Hybrid Reward mechanism that fuses an LLM-based judge with a lightweight relevance-based Reranker to distill high-fidelity reward signals while reducing the computational cost compared to standard LLM-as-a-Judge reinforcement learning. Empirical evaluations on real-world Cloud service tasks, conducted across semantically diverse settings, demonstrate that our framework achieves stability and performance gains through Latent Logic Augmentation and Robust Noise Reduction. Concurrently, our Hybrid Reward mechanism achieves alignment comparable to standard LLM-as-a-judge methods with reduced training time, underscoring the practical value for deploying technical service agents.
中文摘要 在复杂技术服务领域适应大型语言模型受限于人类演示中缺乏显式认知链，以及有效响应多样性带来的固有模糊性。这些限制严重阻碍了代理内化潜在决策动态并有效泛化。此外，实际适应常常受到标准培训模式所带来的高昂资源和时间成本所阻碍。为克服这些挑战并保证计算效率，我们提出了一个由三大贡献组成的轻量级适配框架。（1）潜在逻辑增强：我们引入规划感知轨迹建模和决策推理增强，弥合表层监督与潜在决策逻辑之间的差距。这些方法增强了监督微调对齐的稳定性。（2）稳健噪声减少：我们通过双重滤波方法构建多重地面真实数据集，通过验证多样化响应来减少噪声，从而捕捉语义多样性。（3）轻量级适应：我们设计了一种混合奖励机制，将基于LLM的评判与基于轻量级相关性的Reranker融合，提取高保真度的奖励信号，同时降低了与标准LLM即评判强化学习的计算成本。在语义多样的环境中对现实云服务任务进行的实证评估表明，我们的框架通过潜在逻辑增强和稳健降噪实现了稳定性和性能提升。同时，我们的混合奖励机制实现了与标准LLM即评判方法相当的对齐，同时缩短了培训时间，凸显了部署技术服务代理的实用价值。

SLEA-RL: Step-Level Experience Augmented Reinforcement Learning for Multi-Turn Agentic Training

SLEA-RL：多回合代理训练的阶级经验增强强化学习

Authors: Prince Zizhuang Wang, Shuli Jiang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.18079
Pdf link: https://arxiv.org/pdf/2603.18079
Abstract Large Language Model (LLM) agents have shown strong results on multi-turn tool-use tasks, yet they operate in isolation during training, failing to leverage experiences accumulated across episodes. Existing experience-augmented methods address this by organizing trajectories into retrievable libraries, but they retrieve experiences only once based on the initial task description and hold them constant throughout the episode. In multi-turn settings where observations change at every step, this static retrieval becomes increasingly mismatched as episodes progress. We propose SLEA-RL (Step-Level Experience-Augmented Reinforcement Learning), a framework that retrieves relevant experiences at each decision step conditioned on the current observation. SLEA-RL operates through three components: (i) step-level observation clustering that groups structurally equivalent environmental states for efficient cluster-indexed retrieval; (ii) a self-evolving experience library that distills successful strategies and failure patterns through score-based admission and rate-limited extraction; and (iii) policy optimization with step-level credit assignment for fine-grained advantage estimation across multi-turn episodes. The experience library evolves alongside the policy through semantic analysis rather than gradient updates. Experiments on long-horizon multi-turn agent benchmarks demonstrate that SLEA-RL achieves superior performance compared to various reinforcement learning baselines.
中文摘要 大型语言模型（LLM）代理在多回合工具使用任务中表现出显著成效，但在训练过程中却处于孤立状态，未能充分利用跨阶段积累的经验。现有的经验增强方法通过将轨迹组织到可检索的库中来解决这个问题，但它们仅根据初始任务描述检索一次经验，并且在整个过程中保持不变。在多回合环境中，观测值每一步都变，这种静态检索会随着剧情推进变得越来越不匹配。我们提出了SLEA-RL（阶级经验增强强化学习），这是一个基于当前观察在每个决策步骤中检索相关经验的框架。SLEA-RL通过三个组成部分工作：（i）阶级观测聚类，将结构等效的环境状态分组，以高效地进行聚类索引检索;（ii）一个自我演化的经验库，通过基于分数的录取和速率限制提取，提炼成功策略和失败模式;以及（iii）通过步骤级信用分配进行策略优化，以实现多回合事件间的细粒度优势估计。体验库通过语义分析而非梯度更新，随着策略的推进而演进。长期多回合智能体基准测试的实验表明，SLEA-RL相较于各种强化学习基线实现了更优的性能。

Uncovering Latent Phase Structures and Branching Logic in Locomotion Policies: A Case Study on HalfCheetah

揭示移动政策中的潜在相位结构与分支逻辑：半猎豹案例研究

Authors: Daisuke Yasui, Toshitaka Matsuki, Hiroshi Sato
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.18084
Pdf link: https://arxiv.org/pdf/2603.18084
Abstract In locomotion control tasks, Deep Reinforcement Learning (DRL) has demonstrated high performance; however, the decision-making process of the learned policy remains a black box, making it difficult for humans to understand. On the other hand, in periodic motions such as walking, it is well known that implicit motion phases exist, such as the stance phase and the swing phase. Focusing on this point, this study hypothesizes that a policy trained for locomotion control may also represent a phase structure that is interpretable by humans. To examine this hypothesis in a controlled setting, we consider a locomotion task that is amenable to observing whether a policy autonomously acquires temporally structured phases through interaction with the environment. To verify this hypothesis, in the MuJoCo locomotion benchmark HalfCheetah-v5, the state transition sequences acquired by a policy trained for walking control through interaction with the environment were aggregated into semantic phases based on state similarity and consistency of subsequent transitions. As a result, we demonstrated that the state sequences generated by the trained policy exhibit periodic phase transition structures as well as phase branching. Furthermore, by approximating the states and actions corresponding to each semantic phase using Explainable Boosting Machines (EBMs), we analyzed phase-dependent decision making-namely, which state features the policy function attends to and how it controls action outputs in each phase. These results suggest that neural network-based policies, which are often regarded as black boxes, can autonomously acquire interpretable phase structures and logical branching mechanisms.
中文摘要 在运动控制任务中，深度强化学习（DRL）展现了高性能;然而，学出的政策决策过程依然是黑箱，使人类难以理解。另一方面，在像行走这样的周期性动作中，众所周知存在隐含的运动阶段，如站立阶段和摆动阶段。聚焦于这一点，本研究假设，训练用于运动控制的政策也可能代表人类可解读的相位结构。为了在受控环境中检验这一假设，我们考虑一个运动任务，该任务能够观察策略是否通过与环境的交互自主获得时间结构化的阶段。为验证这一假设，在MuJoCo的移动基准测试HalfCheetah-v5中，训练用于行走控制的策略获得的状态转移序列被聚合为基于状态相似性和后续过渡一致性的语义阶段。因此，我们证明了训练策略生成的状态序列表现出周期性相变结构以及相位分支。此外，通过使用可解释的增强机器（EBM）近似对应每个语义阶段的状态和动作，我们分析了阶段依赖的决策——即策略函数关注的状态特征及其如何控制各阶段的动作输出。这些结果表明，基于神经网络的策略，通常被视为黑箱，能够自主获得可解释的相位结构和逻辑分支机制。

Enhancing Reinforcement Learning Fine-Tuning with an Online Refiner

通过在线精炼器增强强化学习的微调

Authors: Hao Ma, Zhiqiang Pu, Yang Liu, Xiaolin Ai
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.18088
Pdf link: https://arxiv.org/pdf/2603.18088
Abstract Constraints are essential for stabilizing reinforcement learning fine-tuning (RFT) and preventing degenerate outputs, yet they inherently conflict with the optimization objective because stronger constraints limit the ability of a fine-tuned model to discover better solutions. We propose \textit{dynamic constraints} that resolve this tension by adapting to the evolving capabilities of the fine-tuned model based on the insight that constraints should only intervene when degenerate outputs occur. We implement this by using a reference model as an \textit{online refiner} that takes the response from the fine-tuned model and generates a minimally corrected version which preserves correct content verbatim while fixing errors. A supervised fine-tuning loss then trains the fine-tuned model to produce the refined output. This mechanism yields a constraint that automatically strengthens or relaxes based on output quality. Experiments on dialogue and code generation show that dynamic constraints outperform both KL regularization and unconstrained baselines, achieving substantially higher task rewards while maintaining training stability.
中文摘要 约束对于稳定强化学习微调（RFT）和防止退化输出至关重要，但它们本质上与优化目标相冲突，因为更强的约束限制了精细调优模型发现更好解的能力。我们提出了 \textit{dynamic constraints}，通过适应微调模型不断演变的能力来解决这种张力，基于约束应仅在退化输出发生时介入的洞察。我们通过使用参考模型作为 \textit{在线优化器}实现这一点，该模型从微调模型中获取响应，生成一个最小修正版本，保持内容原文，同时修正错误。随后通过监督微调损耗训练微调模型以产生精细输出。该机制产生了根据输出质量自动增强或放松的约束。对话和代码生成的实验表明，动态约束优于基层正则化和无约束基线，在保持训练稳定性的同时实现显著更高的任务奖励。

BoundAD: Boundary-Aware Negative Generation for Time Series Anomaly Detection

BoundAD：用于时间序列异常检测的边界感知负片生成

Authors: Xiancheng Wang, Lin Wang, Zhibo Zhang, Rui Wang, Minghang Zhao
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2603.18111
Pdf link: https://arxiv.org/pdf/2603.18111
Abstract Contrastive learning methods for time series anomaly detection (TSAD) heavily depend on the quality of negative sample construction. However, existing strategies based on random perturbations or pseudo-anomaly injection often struggle to simultaneously preserve temporal semantic consistency and provide effective decision-boundary supervision. Most existing methods rely on prior anomaly injection, while overlooking the potential of generating hard negatives near the data manifold boundary directly from normal samples themselves. To address this issue, we propose a reconstruction-driven boundary negative generation framework that automatically constructs hard negatives through the reconstruction process of normal samples. Specifically, the method first employs a reconstruction network to capture normal temporal patterns, and then introduces a reinforcement learning strategy to adaptively adjust the optimization update magnitude according to the current reconstruction state. In this way, boundary-shifted samples close to the normal data manifold can be induced along the reconstruction trajectory and further used for subsequent contrastive representation learning. Unlike existing methods that depend on explicit anomaly injection, the proposed framework does not require predefined anomaly patterns, but instead mines more challenging boundary negatives from the model's own learning dynamics. Experimental results show that the proposed method effectively improves anomaly representation learning and achieves competitive detection performance on the current dataset.
中文摘要 时间序列异常检测（TSAD）的对比学习方法在很大程度上依赖于负样本构建的质量。然而，基于随机扰动或伪异常注入的现有策略常常难以同时保持时间语义一致性并提供有效的决策边界监督。大多数现有方法依赖于先验异常注入，同时忽视了直接从正常样本在数据流形边界附近生成硬负片的可能性。为解决这一问题，我们提出了一种基于重建的边界负生成框架，通过对正常样本的重建过程自动构造硬负片。具体来说，该方法首先采用重建网络捕捉正常的时间模式，然后引入强化学习策略，根据当前重建状态自适应调整优化更新幅度。通过这种方式，可以沿重建轨迹诱导靠近正规数据流形的边界偏移样本，并进一步用于后续的对比表示学习。与依赖显式异常注入的现有方法不同，所提框架不要求预定义异常模式，而是从模型自身的学习动态中挖掘更具挑战性的边界负面。实验结果表明，所提方法有效提升了异常表示学习，并在当前数据集上实现了竞争性的检测性能。

Insight-V++: Towards Advanced Long-Chain Visual Reasoning with Multimodal Large Language Models

Insight-V++：迈向多模态大型语言模型的高级长链视觉推理

Authors: Yuhao Dong, Zuyan Liu, Shulin Tian, Yongming Rao, Ziwei Liu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.18118
Pdf link: https://arxiv.org/pdf/2603.18118
Abstract Large Language Models (LLMs) have achieved remarkable reliability and advanced capabilities through extended test-time reasoning. However, extending these capabilities to Multi-modal Large Language Models (MLLMs) remains a significant challenge due to a critical scarcity of high-quality, long-chain reasoning data and optimized training pipelines. To bridge this gap, we present a unified multi-agent visual reasoning framework that systematically evolves from our foundational image-centric model, Insight-V, into a generalized spatial-temporal architecture, Insight-V++. We first propose a scalable data generation pipeline equipped with multi-granularity assessment that autonomously synthesizes structured, complex reasoning trajectories across image and video domains without human intervention. Recognizing that directly supervising MLLMs with such intricate data yields sub-optimal results, we design a dual-agent architecture comprising a reasoning agent to execute extensive analytical chains, and a summary agent to critically evaluate and distill final outcomes. While our initial framework utilized Direct Preference Optimization (DPO), its off-policy nature fundamentally constrained reinforcement learning potential. To overcome these limitations, particularly for long-horizon video understanding, Insight-V++ introduces two novel algorithms, ST-GRPO and J-GRPO, which enhance spatial-temporal reasoning and improve evaluative robustness. Crucially, by leveraging reliable feedback from the summary agent, we guide an iterative reasoning path generation process, retraining the entire multi-agent system in a continuous, self-improving loop. Extensive experiments on base models like LLaVA-NeXT and Qwen2.5-VL demonstrate significant performance gains across challenging image and video reasoning benchmarks while preserving strong capabilities on traditional perception-focused tasks.
中文摘要 大型语言模型（LLM）通过延长的测试时间推理实现了卓越的可靠性和高级能力。然而，由于高质量、长链推理数据和优化训练流程极为稀缺，将这些能力扩展到多模态大型语言模型（MLLM）仍是一个重大挑战。为弥合这一差距，我们提出了一个统一的多智能体视觉推理框架，该框架从我们基础的图像中心模型Insight-V系统性地演进为广义的时空架构Insight-V++。我们首先提出一个可扩展的数据生成流程，配备多粒度评估，能够自主综合图像和视频领域的结构化、复杂推理轨迹，无需人工干预。认识到直接监督如此复杂数据的MLLM结果不理想，我们设计了双代理架构，由推理代理执行广泛分析链和总结代理进行批判性评估和提炼最终结果组成。虽然我们的初始框架采用了直接偏好优化（DPO），但其非政策性质从根本上限制了强化学习的潜力。为克服这些局限，特别是在长视野视频理解方面，Insight-V++引入了两种新颖算法ST-GRPO和J-GRPO，增强了时空推理能力并提高了评估鲁棒性。关键是，通过利用摘要代理的可靠反馈，我们引导迭代推理路径生成过程，将整个多代理系统重新训练成一个连续且自我改进的循环。在LLaVA-NeXT和Qwen2.5-VL等基础模型上的广泛实验显示，在具有挑战性的图像和视频推理基准测试中显著提升性能，同时保留了传统感知导向任务的强大能力。

R2-Dreamer: Redundancy-Reduced World Models without Decoders or Augmentation

R2-Dreamer：无解码器或增强的冗余减少世界模型

Authors: Naoki Morihira (1 and 2), Amal Nahar (1), Kartik Bharadwaj (1), Yasuhiro Kato (2), Akinobu Hayashi (1 and 2), Tatsuya Harada (2 and 3) ((1) Honda R and D Co. Ltd., (2) The University of Tokyo, (3) RIKEN AIP)
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.18202
Pdf link: https://arxiv.org/pdf/2603.18202
Abstract A central challenge in image-based Model-Based Reinforcement Learning (MBRL) is to learn representations that distill essential information from irrelevant visual details. While promising, reconstruction-based methods often waste capacity on large task-irrelevant regions. Decoder-free methods instead learn robust representations by leveraging Data Augmentation (DA), but reliance on such external regularizers limits versatility. We propose R2-Dreamer, a decoder-free MBRL framework with a self-supervised objective that serves as an internal regularizer, preventing representation collapse without resorting to DA. The core of our method is a redundancy-reduction objective inspired by Barlow Twins, which can be easily integrated into existing frameworks. On DeepMind Control Suite and Meta-World, R2-Dreamer is competitive with strong baselines such as DreamerV3 and TD-MPC2 while training 1.59x faster than DreamerV3, and yields substantial gains on DMC-Subtle with tiny task-relevant objects. These results suggest that an effective internal regularizer can enable versatile, high-performance decoder-free MBRL. Code is available at this https URL.
中文摘要 基于图像的模型强化学习（MBRL）中一个核心挑战是学习能够从无关的视觉细节中提炼出关键信息的表征。虽然前景看好，基于重建的方法常常在与任务无关的大面积区域浪费能力。无解码器的方法则通过利用数据增强（DA）学习稳健表示，但依赖此类外部正则化器限制了多样性。我们提出了R2-Dreamer，一个无解码器的MBRL框架，具有自监督目标，作为内部正则化器，防止表示崩溃而不依赖DA。我们方法的核心是受巴洛双胞胎启发的冗余减少目标，该目标可以轻松集成到现有框架中。在DeepMind Control Suite和Meta-World上，R2-Dreamer凭借DreamerV3和TD-MPC2等强基线竞争，训练速度是DreamerV3的1.59倍，且在DMC-Subtle中对微小任务相关对象的表现显著提升。这些结果表明，有效的内部正则化器可以实现多功能、高性能的无解码器MBRL。代码可在此 https URL 访问。

How Psychological Learning Paradigms Shaped and Constrained Artificial Intelligence

心理学习范式如何塑造并限制了人工智能

Authors: Alex Anvi Eponon, Ildar Batyrshin, Christian E. Maldonado-Sifuentes, Grigori Sidorov
Subjects: Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2603.18203
Pdf link: https://arxiv.org/pdf/2603.18203
Abstract The dominant paradigms of artificial intelligence were shaped by learning theories from psychology: behaviorism inspired reinforcement learning, cognitivism gave rise to deep learning and memory-augmented architectures, and constructivism influenced curriculum learning and compositional approaches. This paper argues that each AI paradigm inherited not only the strengths but the structural limitations of the psychological theory that inspired it. Reinforcement learning cannot account for the internal structure of knowledge, deep learning compresses representations into opaque parameter spaces resistant to principled update, and current integrative approaches lack a formal account of how new understanding is constructed from existing components. The paper further examines a cross-cultural divergence in the interpretation of rote learning, arguing that the Eastern conception of memorization as a structured, multi-phase precursor to understanding offers an underexploited bridge between psychological theory and AI methodology. Drawing on the systematicity debate and critique of Aizawa of both classicism and connectionism, this paper introduces ReSynth, a trimodular framework that separates reasoning (Intellect), purpose (Identity), and knowledge (Memory) as architecturally independent components. The paper traces the genealogy from psychological paradigm to AI method, diagnoses the inherited limitations at each stage, and argues that adaptability, the central challenge of artificial general intelligence requires a representational architecture in which systematic behavior is a necessary consequence rather than an accidental property.
中文摘要 人工智能的主流范式受心理学学习理论影响：行为主义启发了强化学习，认知主义催生了深度学习和记忆增强架构，建构主义则影响了课程学习和写作方法。本文论证了每种人工智能范式不仅继承了启发其的心理学理论的优势，还承袭了结构性局限性。强化学习无法解释知识的内部结构，深度学习将表征压缩到不透明的参数空间，难以按原则更新，当前的整合方法缺乏关于如何从现有组件构建新理解的正式说明。论文进一步探讨了死记硬背解释中的跨文化分歧，认为东方将记忆视为理解的结构化、多阶段前奏的观念，为心理学理论与人工智能方法论之间提供了一座未被充分利用的桥梁。本文借鉴了Aizawa对系统性辩论及对古典主义与连接主义的批判，介绍了ReSynth，一种将推理（智力）、目的（身份）和知识（记忆）分离为结构独立组成部分的三模框架。论文追溯了从心理学范式到人工智能方法的谱系，诊断了各阶段的固有局限性，并主张适应性——人工智能通用智能的核心挑战——需要一种表征架构，其中系统行为是必然的结果，而非偶然属性。

MolRGen: A Training and Evaluation Setting for De Novo Molecular Generation with Reasonning Models

MolRGen：基于推理模型的新分子生成训练与评估环境

Authors: Philippe Formont, Maxime Darrin, Ismail Ben Ayed, Pablo Piantanida
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.18256
Pdf link: https://arxiv.org/pdf/2603.18256
Abstract Recent advances in reasoning-based large language models (LLMs) have demonstrated substantial improvements in complex problem-solving tasks. Motivated by these advances, several works have explored the application of reasoning LLMs to drug discovery and molecular design. However, most existing approaches either focus on evaluation or rely on training setups that require ground-truth labels, such as molecule pairs with known property modifications. Such supervision is unavailable in \textit{de novo} molecular generation, where the objective is to generate novel molecules that optimize a desirability score without prior knowledge of high-scoring candidates. To bridge this gap, we introduce MolRGen, a large-scale benchmark and dataset for training and evaluating reasoning-based LLMs on \textit{de novo} molecular generation. Our contributions are threefold. First, we propose a setting to evaluate and train models for \textit{de novo} molecular generation and property prediction. Second, we introduce a novel diversity-aware top-$k$ score that captures both the quality and diversity of generated molecules. Third, we show our setting can be used to train LLMs for molecular generation, training a 24B LLM with reinforcement learning, and we provide a detailed analysis of its performance and limitations.
中文摘要 基于推理的大型语言模型（LLMs）的最新进展展示了复杂问题解决任务的显著改进。受这些进展的激励，多项研究探讨了推理大型语言模型在药物发现和分子设计中的应用。然而，大多数现有方法要么专注于评估，要么依赖需要真实标签的训练设置，比如具有已知性质变化的分子对。在 \textit{de novo} 分子生成中，这种监督是无法实现的，因为该领域旨在生成能够优化期望值的新分子，而无需事先了解高分候选分子。为弥合这一差距，我们引入了MolRGen，一个大规模基准测试和数据集，用于在\textit{de novo}分子生成上训练和评估基于推理的LLMs。我们的贡献有三方面。首先，我们提出一个用于评估和训练 \textit{de novo} 分子生成和性质预测模型的环境。其次，我们引入了一个新颖的多样性感知顶级$k美元评分，既捕捉了生成分子的质量也体现了多样性。第三，我们展示了我们的设置可用于训练分子生成的大型语言模型，训练一个24B的强化学习LLM，并详细分析其性能和局限性。

Approximate Subgraph Matching with Neural Graph Representations and Reinforcement Learning

神经图表示与强化学习的近似子图匹配

Authors: Kaiyang Li, Shihao Ji, Zhipeng Cai, Wei Li
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.18314
Pdf link: https://arxiv.org/pdf/2603.18314
Abstract Approximate subgraph matching (ASM) is a task that determines the approximate presence of a given query graph in a large target graph. Being an NP-hard problem, ASM is critical in graph analysis with a myriad of applications ranging from database systems and network science to biochemistry and privacy. Existing techniques often employ heuristic search strategies, which cannot fully utilize the graph information, leading to sub-optimal solutions. This paper proposes a Reinforcement Learning based Approximate Subgraph Matching (RL-ASM) algorithm that exploits graph transformers to effectively extract graph representations and RL-based policies for ASM. Our model is built upon the branch-and-bound algorithm that selects one pair of nodes from the two input graphs at a time for potential matches. Instead of using heuristics, we exploit a Graph Transformer architecture to extract feature representations that encode the full graph information. To enhance the training of the RL policy, we use supervised signals to guide our agent in an imitation learning stage. Subsequently, the policy is fine-tuned with the Proximal Policy Optimization (PPO) that optimizes the accumulative long-term rewards over episodes. Extensive experiments on both synthetic and real-world datasets demonstrate that our RL-ASM outperforms existing methods in terms of effectiveness and efficiency. Our source code is available at this https URL.
中文摘要 近似子图匹配（ASM）是一项用于确定某一查询图在大型目标图中近似存在的任务。作为一个NP难问题，ASM在图分析中至关重要，应用广泛，从数据库系统、网络科学到生物化学和隐私等。现有技术通常采用启发式搜索策略，无法充分利用图信息，导致解不理想。本文提出了一种基于强化学习的近似子图匹配（RL-ASM）算法，利用图变换器有效提取基于强化语言的ASM图表示和基于强化语言的策略。我们的模型基于分支定界算法，每次从两个输入图中选择一对节点作为潜在匹配。我们不使用启发式方法，而是利用图变换器架构提取特征表示，编码完整的图信息。为了增强强化学习策略的训练，我们使用监督信号引导代理进入模仿学习阶段。随后，策略通过近端策略优化（PPO）进行微调，优化了各事件累计的长期奖励。在合成和现实世界数据集上的大量实验表明，我们的RL-ASM在有效性和效率方面优于现有方法。我们的源代码可在此 https URL 访问。

DriveVLM-RL: Neuroscience-Inspired Reinforcement Learning with Vision-Language Models for Safe and Deployable Autonomous Driving

DriveVLM-RL：神经科学启发的强化学习，结合视觉语言模型，实现安全且可部署的自动驾驶

Authors: Zilin Huang, Zihao Sheng, Zhengyang Wan, Yansong Qu, Junwei You, Sicong Jiang, Sikai Chen
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.18315
Pdf link: https://arxiv.org/pdf/2603.18315
Abstract Ensuring safe decision-making in autonomous vehicles remains a fundamental challenge despite rapid advances in end-to-end learning approaches. Traditional reinforcement learning (RL) methods rely on manually engineered rewards or sparse collision signals, which fail to capture the rich contextual understanding required for safe driving and make unsafe exploration unavoidable in real-world settings. Recent vision-language models (VLMs) offer promising semantic understanding capabilities; however, their high inference latency and susceptibility to hallucination hinder direct application to real-time vehicle control. To address these limitations, this paper proposes DriveVLM-RL, a neuroscience-inspired framework that integrates VLMs into RL through a dual-pathway architecture for safe and deployable autonomous driving. The framework decomposes semantic reward learning into a Static Pathway for continuous spatial safety assessment using CLIP-based contrasting language goals, and a Dynamic Pathway for attention-gated multi-frame semantic risk reasoning using a lightweight detector and a large VLM. A hierarchical reward synthesis mechanism fuses semantic signals with vehicle states, while an asynchronous training pipeline decouples expensive VLM inference from environment interaction. All VLM components are used only during offline training and are removed at deployment, ensuring real-time feasibility. Experiments in the CARLA simulator show significant improvements in collision avoidance, task success, and generalization across diverse traffic scenarios, including strong robustness under settings without explicit collision penalties. These results demonstrate that DriveVLM-RL provides a practical paradigm for integrating foundation models into autonomous driving without compromising real-time feasibility. Demo video and code are available at: this https URL
中文摘要 尽管端到端学习方法迅速进步，确保自动驾驶车辆安全决策仍是根本挑战。传统的强化学习（RL）方法依赖于人工设计的奖励或稀疏的碰撞信号，这些信号未能捕捉安全驾驶所需的丰富情境理解，使得在现实环境中不可避免地出现不安全的探索。最新的视觉语言模型（VLMs）提供了有前景的语义理解能力;然而，其高推理延迟和易产生幻觉，阻碍了直接应用于实时车辆控制。为解决这些局限，本文提出了DriveVLM-RL，这是一个受神经科学启发的框架，通过双通路架构将VLM集成到强化学习中，实现安全且可部署的自动驾驶。该框架将语义奖励学习分解为静态路径，用于基于CLIP的对比语言目标进行持续空间安全评估，以及动态路径，用于注意力门控多帧语义风险推理，利用轻量级检测器和大型VLM。分层奖励合成机制将语义信号与车辆状态融合，异步训练流水线则将昂贵的VLM推断与环境交互解耦。所有VLM组件仅在离线训练期间使用，部署时移除，确保实时可行性。CARLA模拟器的实验显示，在碰撞避免、任务成功率和泛化能力方面，在多种交通场景下显著提升，包括在无明确碰撞惩罚的设置下表现出强烈的鲁棒性。这些结果表明，DriveVLM-RL为将基础模型集成到自动驾驶中提供了实用范式，同时不牺牲实时可行性。演示视频和代码可在以下链接获取：此 https URL

Learning to Reason with Curriculum I: Provable Benefits of Autocurriculum

通过课程学习推理 I：自学课程的可证实益处

Authors: Nived Rajaraman, Audrey Huang, Miro Dudik, Robert Schapire, Dylan J. Foster, Akshay Krishnamurthy
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2603.18325
Pdf link: https://arxiv.org/pdf/2603.18325
Abstract Chain-of-thought reasoning, where language models expend additional computation by producing thinking tokens prior to final responses, has driven significant advances in model capabilities. However, training these reasoning models is extremely costly in terms of both data and compute, as it involves collecting long traces of reasoning behavior from humans or synthetic generators and further post-training the model via reinforcement learning. Are these costs fundamental, or can they be reduced through better algorithmic design? We show that autocurriculum, where the model uses its own performance to decide which problems to focus training on, provably improves upon standard training recipes for both supervised fine-tuning (SFT) and reinforcement learning (RL). For SFT, we show that autocurriculum requires exponentially fewer reasoning demonstrations than non-adaptive fine-tuning, by focusing teacher supervision on prompts where the current model struggles. For RL fine-tuning, autocurriculum decouples the computational cost from the quality of the reference model, reducing the latter to a burn-in cost that is nearly independent of the target accuracy. These improvements arise purely from adaptive data selection, drawing on classical techniques from boosting and learning from counterexamples, and requiring no assumption on the distribution or difficulty of prompts.
中文摘要 思维链推理，即语言模型在最终回答前通过产生思考代币来消耗额外的计算，推动了模型能力的重大进步。然而，训练这些推理模型在数据和计算方面都极其昂贵，因为这涉及从人类或合成生成器收集长长的推理行为痕迹，并通过强化学习进一步后期训练模型。这些成本是根本性的，还是可以通过更好的算法设计来降低？我们证明了，模型利用自身表现决定重点训练哪些问题的自律课程，能够证明优于监督微调（SFT）和强化学习（RL）的标准训练方案。对于SFT，我们通过将教师的监督重点放在当前模型难以解决的提示上，说明自教学需要的推理演示远少于非自适应微调。对于强化学习的微调，自律将计算成本与参考模型质量解耦，将后者降低为几乎独立于目标精度的烧入成本。这些改进纯粹源自适应数据选择，借鉴了通过提升和反例学习的经典技术，且不要求对提示的分布或难度做出假设。

Escaping Offline Pessimism: Vector-Field Reward Shaping for Safe Frontier Exploration

逃离离线悲观：向量场奖励塑造以实现安全边疆探索

Authors: Amirhossein Roknilamouki, Arnob Ghosh, Eylem Ekici, Ness B. Shroff
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.18326
Pdf link: https://arxiv.org/pdf/2603.18326
Abstract While offline reinforcement learning provides reliable policies for real-world deployment, its inherent pessimism severely restricts an agent's ability to explore and collect novel data online. Drawing inspiration from safe reinforcement learning, exploring near the boundary of regions well covered by the offline dataset and reliably modeled by the simulator allows an agent to take manageable risks--venturing into informative but moderate-uncertainty states while remaining close enough to familiar regions for safe recovery. However, naively rewarding this boundary-seeking behavior can lead to a degenerate parking behavior, where the agent simply stops once it reaches the frontier. To solve this, we propose a novel vector-field reward shaping paradigm designed to induce continuous, safe boundary exploration for non-adaptive deployed policies. Operating on an uncertainty oracle trained from offline data, our reward combines two complementary components: a gradient-alignment term that attracts the agent toward a target uncertainty level, and a rotational-flow term that promotes motion along the local tangent plane of the uncertainty manifold. Through theoretical analysis, we show that this reward structure naturally induces sustained exploratory behavior along the boundary while preventing degenerate solutions. Empirically, by integrating our proposed reward shaping with Soft Actor-Critic on a 2D continuous navigation task, we validate that agents successfully traverse uncertainty boundaries while balancing safe, informative data collection with primary task completion.
中文摘要 虽然离线强化学习为现实世界部署提供了可靠的策略，但其固有的悲观性严重限制了智能体在线探索和收集新数据的能力。借鉴安全强化学习，探索离线数据集覆盖良好且模拟器可靠建模的区域边界附近，使智能体能够承担可控风险——进入信息量大但不确定性的状态，同时保持足够接近熟悉区域以保证安全恢复。然而，天真地奖励这种寻求边界的行为可能导致退化的停车行为，代理在到达边界后就停车。为此，我们提出了一种新型矢量场奖励塑造范式，旨在诱导非自适应部署策略的连续安全边界探索。基于从离线数据训练的不确定性预言机，我们的奖励结合了两个互补组成部分：一个梯度对齐项，吸引智能体向目标不确定性水平，另一个旋转流项，促进沿不确定性流形局部切平面运动。通过理论分析，我们表明这种奖励结构自然地在边界上诱导持续的探索行为，同时防止退化解法。通过在二维连续导航任务中将我们提出的奖励塑造与软演员-批判整合，我们验证了代理在安全、信息性数据收集与主要任务完成之间，成功跨越不确定性边界。

PowerFlow: Unlocking the Dual Nature of LLMs via Principled Distribution Matching

PowerFlow：通过原则分布匹配解锁大型语言模型的双重性质

Authors: Ruishuo Chen, Yu Chen, Zhuoran Li, Longbo Huang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.18363
Pdf link: https://arxiv.org/pdf/2603.18363
Abstract Unsupervised Reinforcement Learning from Internal Feedback (RLIF) has emerged as a promising paradigm for eliciting the latent capabilities of Large Language Models (LLMs) without external supervision. However, current methods rely on heuristic intrinsic rewards, which often lack a well-defined theoretical optimization target and are prone to degenerative biases. In this work, we introduce PowerFlow, a principled framework that reformulates unsupervised fine-tuning as a distribution matching problem. By casting GFlowNet as an amortized variational sampler for unnormalized densities, we propose a length-aware Trajectory-Balance objective that explicitly neutralizes the structural length biases inherent in autoregressive generation. By targeting $\alpha$-power distributions, PowerFlow enables the directional elicitation of the dual nature of LLMs: sharpening the distribution ($\alpha > 1$) to intensify logical reasoning, or flattening it ($\alpha < 1$) to unlock expressive creativity. Extensive experiments demonstrate that PowerFlow consistently outperforms existing RLIF methods, matching or even exceeding supervised GRPO. Furthermore, by mitigating over-sharpening in aligned models, our approach achieves simultaneous gains in diversity and quality, shifting the Pareto frontier in creative tasks.
中文摘要 来自内部反馈的无监督强化学习（RLIF）已成为一种有前景的范式，用于在无需外部监督的情况下，激发大型语言模型（LLMs）潜在能力。然而，当前方法依赖启发式内在奖励，这些奖励通常缺乏明确的理论优化目标，且容易出现退化性偏差。在本研究中，我们介绍了PowerFlow，这是一个原则性框架，将无监督微调重新表述为分布匹配问题。通过将GFlowNet定位为未归一化密度的摊销变分采样器，我们提出了一个长度感知轨迹平衡的目标，明确中和自回归生成中固有的结构长度偏差。通过针对 $\alpha$ 的功率分布，PowerFlow 实现了大型语言模型双重特性的方向性诱导：通过锐化分布（$\alpha > 1$）以强化逻辑推理，或将其平坦化（$\alpha < 1$）以激发表达创造力。大量实验表明，PowerFlow始终优于现有RLIF方法，甚至超过监督GRPO。此外，通过减少对齐模型中的过度锐化，我们的方法同时实现多样性和质量的提升，推动了创造性任务中的帕累托边界。

Mathematical Foundations of Deep Learning

深度学习的数学基础

Authors: Xiaojing Ye
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2603.18387
Pdf link: https://arxiv.org/pdf/2603.18387
Abstract This draft book offers a comprehensive and rigorous treatment of the mathematical principles underlying modern deep learning. The book spans core theoretical topics, from the approximation capabilities of deep neural networks, the theory and algorithms of optimal control and reinforcement learning integrated with deep learning techniques, to contemporary generative models that drive today's advances in artificial intelligence.
中文摘要 本草稿书对现代深度学习背后的数学原理进行了全面且严谨的论述。本书涵盖核心理论主题，从深度神经网络的近似能力、与深度学习技术集成的最优控制与强化学习理论与算法，到推动当今人工智能进步的现代生成模型。

RE-SAC: Disentangling aleatoric and epistemic risks in bus fleet control: A stable and robust ensemble DRL approach

RE-SAC：理清公交车队控制中的偶然性与认识风险：一种稳定且稳健的集合日程学习方法

Authors: Yifan Zhang, Liang Zheng
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.18396
Pdf link: https://arxiv.org/pdf/2603.18396
Abstract Bus holding control is challenging due to stochastic traffic and passenger demand. While deep reinforcement learning (DRL) shows promise, standard actor-critic algorithms suffer from Q-value instability in volatile environments. A key source of this instability is the conflation of two distinct uncertainties: aleatoric uncertainty (irreducible noise) and epistemic uncertainty (data insufficiency). Treating these as a single risk leads to value underestimation in noisy states, causing catastrophic policy collapse. We propose a robust ensemble soft actor-critic (RE-SAC) framework to explicitly disentangle these uncertainties. RE-SAC applies Integral Probability Metric (IPM)-based weight regularization to the critic network to hedge against aleatoric risk, providing a smooth analytical lower bound for the robust Bellman operator without expensive inner-loop perturbations. To address epistemic risk, a diversified Q-ensemble penalizes overconfident value estimates in sparsely covered regions. This dual mechanism prevents the ensemble variance from misidentifying noise as a data gap, a failure mode identified in our ablation study. Experiments in a realistic bidirectional bus corridor simulation demonstrate that RE-SAC achieves the highest cumulative reward (approx. -0.4e6) compared to vanilla SAC (-0.55e6). Mahalanobis rareness analysis confirms that RE-SAC reduces Oracle Q-value estimation error by up to 62% in rare out-of-distribution states (MAE of 1647 vs. 4343), demonstrating superior robustness under high traffic variability.
中文摘要 由于随机交通和乘客需求，巴士等待控制具有挑战性。虽然深度强化学习（DRL）展现出潜力，但标准的actor-critic算法在易失环境中存在Q值不稳定性。这种不稳定性的一个关键来源是两种截然不同的不确定性混淆：偶然性不确定性（不可约约的噪声）和认识论不确定性（数据不足）。将这些视为单一风险会导致噪音州的价值低估，导致政策灾难性崩溃。我们提出了一个稳健的集合软行为者-批评者（RE-SAC）框架，以明确解开这些不确定性。RE-SAC将基于整概率度量（IPM）的权重正则化应用于批评网络，以对冲偶然性风险，为稳健的贝尔曼算符提供平滑的解析下界，且避免昂贵的内环扰动。为了应对认知风险，多样化的Q-集合会惩罚覆盖稀疏区域中过度自信的值估计。这种双重机制防止了集合方差误将噪声误判为数据缺口，而我们的消融研究中已识别出一种失效模式。在真实的双向公交走廊模拟中，RE-SAC获得了最高的累计奖励（约-0.4e6），而普通SAC则为-0.55e6。Mahalanobis稀有性分析证实，RE-SAC在罕见的非分布状态下，能将Oracle Q值估计误差降低高达62%（MAE为1647对4343），在高流量变异性下表现出优越的鲁棒性。

Efficient and Versatile Quadrupedal Skating: Optimal Co-design via Reinforcement Learning and Bayesian Optimization

高效且多功能的四足滑行：通过强化学习和贝叶斯优化实现的最优协同设计

Authors: Hanwen Wang, Zhenlong Fang, Josiah Hanna, Xiaobin Xiong
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.18408
Pdf link: https://arxiv.org/pdf/2603.18408
Abstract In this paper, we present a hardware-control co-design approach that enables efficient and versatile roller skating on quadrupedal robots equipped with passive wheels. Passive-wheel skating reduces leg inertia and improves energy efficiency, particularly at high speeds. However, the absence of direct wheel actuation tightly couples mechanical design and control. To unlock the full potential of this modality, we formulate a bilevel optimization framework: an upper-level Bayesian Optimization searches the mechanical design space, while a lower-level Reinforcement Learning trains a motor control policy for each candidate design. The resulting design-policy pairs not only outperform human-engineered baselines, but also exhibit versatile behaviors such as hockey stop (rapid braking by turning sideways to maximize friction) and self-aligning motion (automatic reorientation to improve energy efficiency in the direction of travel), offering the first system-level study of dynamic skating motion on quadrupedal robots.
中文摘要 本文提出了一种硬件控制协同设计方法，使配备被动轮子的四足机器人能够高效且多功能地滑旱冰。被动轮滑可以减少腿部惯性，提高能源效率，尤其是在高速时。然而，缺乏直接车轮驱动，紧密结合了机械设计和控制。为了充分发挥这一模态的潜力，我们制定了一个双层次优化框架：上层贝叶斯优化搜索机械设计空间，而低层强化学习则为每个候选设计训练运动控制策略。由此产生的设计-政策对不仅优于人工设计的基线，还展现出多功能行为，如冰球停车（通过侧转快速制动以最大化摩擦）和自对准运动（自动重新定位以提升行进方向的能效），首次在系统层面研究四足机器人的动态滑行运动。

Adaptive Decoding via Test-Time Policy Learning for Self-Improving Generation

通过测试时间策略学习实现自适应解码，实现自我改进生成

Authors: Asmita Bhardwaj, Yuya Jeremy Ong, Eelaaf Zahid, Basel Shbita
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.18428
Pdf link: https://arxiv.org/pdf/2603.18428
Abstract Decoding strategies largely determine the quality of Large Language Model (LLM) outputs, yet widely used heuristics such as greedy or fixed temperature/top-p decoding are static and often task-agnostic, leading to suboptimal or inconsistent generation quality across domains that demand stylistic or structural flexibility. We introduce a reinforcement learning-based decoder sampler that treats decoding as sequential decision-making and learns a lightweight policy to adjust sampling parameters at test-time while keeping LLM weights frozen. We evaluated summarization datasets including BookSum, arXiv, and WikiHow using Granite-3.3-2B and Qwen-2.5-0.5B. Our policy sampler consistently outperforms greedy and static baselines, achieving relative gains of up to +88% (BookSum, Granite) and +79% (WikiHow, Qwen). Reward ablations show that overlap-only objectives underperform compared to composite rewards, while structured shaping terms (length, coverage, repetition, completeness) enable stable and sustained improvements. These findings highlight reinforcement learning as a practical mechanism for test-time adaptation in decoding, enabling domain-aware and user-controllable generation without retraining large models.
中文摘要 解码策略在很大程度上决定了大型语言模型（LLM）输出的质量，但广泛使用的启发式方法如贪婪或固定温度/顶点解码是静态的，且常常与任务无关，导致在需要风格或结构灵活性的领域中生成质量不理想或不一致。我们引入了基于强化学习的译码器采样器，将解码视为顺序决策，并学习轻量级策略，在测试时调整采样参数，同时保持LLM权重冻结。我们使用Granite-3.3-2B和Qwen-2.5-0.5B评估了包括BookSum、arXiv和WikiHow在内的摘要数据集。我们的政策采样器持续优于贪婪和静态基线，相对提升高达+88%（BookSum、Granite）和+79%（WikiHow、Qwen）。奖励消融显示，仅重叠目标相较于复合奖励表现不佳，而结构化的塑形项（长度、覆盖范围、重复、完整性）则实现稳定且持续的改进。这些发现凸显了强化学习作为测试时间适应解码的实用机制，实现了无需重新训练大型模型的领域感知和用户可控生成。

Discounted Beta--Bernoulli Reward Estimation for Sample-Efficient Reinforcement Learning with Verifiable Rewards

样本高效强化学习的折现贝塔-伯努利奖励估计，且有可验证奖励

Authors: Haechan Kim, Soohyun Ryu, Gyouk Chu, Doohyuk Jang, Eunho Yang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.18444
Pdf link: https://arxiv.org/pdf/2603.18444
Abstract Reinforcement learning with verifiable rewards (RLVR) has emerged as an effective post-training paradigm for improving the reasoning capabilities of large language models. However, existing group-based RLVR methods often suffer from severe sample inefficiency. This inefficiency stems from reliance on point estimation of rewards from a small number of rollouts, leading to high estimation variance, variance collapse, and ineffective utilization of generated responses. In this work, we reformulate RLVR from a statistical estimation perspective by modeling rewards as samples drawn from a policy-induced distribution and casting advantage computation as the problem of estimating the reward distribution from finite data. Building on this view, we propose Discounted Beta--Bernoulli (DBB) reward estimation, which leverages historical reward statistics for the non-stationary distribution. Although biased, the resulting estimator exhibits reduced and stable variance, theoretically avoids estimated variance collapse, and achieves lower mean squared error than standard point estimation. Extensive experiments across six in-distribution and three out-of-distribution reasoning benchmarks demonstrate that GRPO with DBB consistently outperforms naive GRPO, achieving average Acc@8 improvements of 3.22/2.42 points in-distribution and 12.49/6.92 points out-of-distribution on the 1.7B and 8B models, respectively, without additional computational cost or memory usage.
中文摘要 带可验证奖励的强化学习（RLVR）已成为提升大型语言模型推理能力的有效训练后范式。然而，现有基于组的RLVR方法常常存在严重的样本效率低下。这种低效率源于对奖励的点数估计依赖于少量推广，导致估计方差高、方差崩溃以及生成响应的有效利用。在本研究中，我们从统计估计的角度重新表述了RLVR，将奖励建模为从政策诱导分布抽取的样本，并将优势计算视为从有限数据估计奖励分布的问题。基于这一观点，我们提出了贴现贝塔-伯努利（DBB）奖励估计，利用非平稳分布的历史奖励统计数据。尽管有偏差，所得估计量表现出降低且稳定的方差，理论上避免了估计的方差坍缩，并且实现了比标准点估计更低的均方误差。在六个分布内和三个分布外推理基准测试中进行的大量实验表明，带有DBB的GRPO始终优于朴素GRPO，在1.7B和8B模型中Acc@8，分布内平均提升3.22/2.42点，分布外平均提升12.49/6.92点，且无额外计算成本或内存使用。

AcceRL: A Distributed Asynchronous Reinforcement Learning and World Model Framework for Vision-Language-Action Models

AcceRL：一个分布式异步强化学习与世界模型框架，用于视觉-语言-行动模型

Authors: Chengxuan Lu, Shukuan Wang, Yanjie Li, Wei Liu, Shiji Jin, Fuyuan Qian, Peiming Li, Baigui Sun, Yang Liu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.18464
Pdf link: https://arxiv.org/pdf/2603.18464
Abstract Reinforcement learning (RL) for large-scale Vision-Language-Action (VLA) models faces significant challenges in computational efficiency and data acquisition. We propose AcceRL, a fully asynchronous and decoupled RL framework designed to eliminate synchronization barriers by physically isolating training, inference, and rollouts. Crucially, AcceRL is the first to integrate a plug-and-play, trainable world model into a distributed asynchronous RL pipeline to generate virtual experiences. Experiments on the LIBERO benchmark demonstrate that AcceRL achieves state-of-the-art (SOTA) performance. Systematically, it exhibits super-linear scaling in throughput and highly efficient hardware utilization. Algorithmically, the world-model-augmented variant delivers unprecedented sample efficiency and robust training stability in complex control tasks.
中文摘要 大规模视觉-语言-行动（VLA）模型的强化学习（RL）在计算效率和数据采集方面面临重大挑战。我们提出了AcceRL，这是一个完全异步且解耦的强化学习框架，旨在通过物理隔离训练、推理和推广来消除同步障碍。关键是，AcceRL首次将即插即用、可训练的世界模型集成到分布式异步强化学习流水线中，以生成虚拟体验。基于LIBERO基准的实验表明，AcceRL实现了最先进的（SOTA）性能。系统性地，它表现出吞吐量的超线性扩展和高效的硬件利用。在算法上，世界模型增强变体在复杂控制任务中提供了前所未有的样本效率和稳健的训练稳定性。

Scaling Sim-to-Real Reinforcement Learning for Robot VLAs with Generative 3D Worlds

利用生成三维世界的模拟到实物强化学习扩展机器人VLA

Authors: Andrew Choi, Xinjie Wang, Zhizhong Su, Wei Xu
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.18532
Pdf link: https://arxiv.org/pdf/2603.18532
Abstract The strong performance of large vision-language models (VLMs) trained with reinforcement learning (RL) has motivated similar approaches for fine-tuning vision-language-action (VLA) models in robotics. Many recent works fine-tune VLAs directly in the real world to avoid addressing the sim-to-real gap. While real-world RL circumvents sim-to-real issues, it inherently limits the generality of the resulting VLA, as scaling scene and object diversity in the physical world is prohibitively difficult. This leads to the paradoxical outcome of transforming a broadly pretrained model into an overfitted, scene-specific policy. Training in simulation can instead provide access to diverse scenes, but designing those scenes is also costly. In this work, we show that VLAs can be RL fine-tuned without sacrificing generality and with reduced labor by leveraging 3D world generative models. Using these models together with a language-driven scene designer, we generate hundreds of diverse interactive scenes containing unique objects and backgrounds, enabling scalable and highly parallel policy learning. Starting from a pretrained imitation baseline, our approach increases simulation success from 9.7% to 79.8% while achieving a 1.25$\times$ speedup in task completion time. We further demonstrate successful sim-to-real transfer enabled by the quality of the generated digital twins together with domain randomization, improving real-world success from 21.7% to 75% and achieving a 1.13$\times$ speedup. Finally, we further highlight the benefits of leveraging the effectively unlimited data from 3D world generative models through an ablation study showing that increasing scene diversity directly improves zero-shot generalization.
中文摘要 通过强化学习（RL）训练的大型视觉语言模型（VLM）表现出色，这也促使机器人领域对视觉语言行动模型（VLA）进行类似的微调方法。许多近期研究直接在现实世界中微调VLA，以避免解决模拟与真实之间的差距。虽然现实世界的强化学习绕过了模拟到现实的问题，但它本质上限制了最终VLA的普遍性，因为在物理世界中缩放场景和对象多样性极其困难。这导致了一个悖论的结果：将一个广泛预训练的模型转变为一个过拟合的、针对场景的策略。模拟培训可以提供多样场景的访问，但设计这些场景的成本也很高。本研究展示了通过利用三维世界生成模型，VLA可以在不牺牲通用性和减少劳动力的情况下进行强化学习微调。结合这些模型和语言驱动的场景设计器，我们生成数百个包含独特对象和背景的多样化交互场景，实现可扩展且高度并行的策略学习。从预训练的模拟基线出发，我们的方法将模拟成功率从9.7%提升到79.8%，同时实现任务完成时间的1.25美元/时间加速。我们进一步展示了成功的模拟到现实传输得益于生成的数字孪生质量和域随机化，将现实世界成功率从21.7%提升至75%，实现了1.13美元\时间的提升。最后，我们进一步强调利用三维世界生成模型几乎无限数据的益处，通过消融研究显示，增加场景多样性直接改善零镜头泛化。

Balancing the Reasoning Load: Difficulty-Differentiated Policy Optimization with Length Redistribution for Efficient and Robust Reinforcement Learning

推理负载的平衡：难度差异化策略优化与长度重分布，实现高效且稳健的强化学习

Authors: Yinan Xia, Haotian Zhang, Huiming Wang
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.18533
Pdf link: https://arxiv.org/pdf/2603.18533
Abstract Large Reasoning Models (LRMs) have shown exceptional reasoning capabilities, but they also suffer from the issue of overthinking, often generating excessively long and redundant answers. For problems that exceed the model's capabilities, LRMs tend to exhibit the overconfidence phenomenon, generating overly short but incorrect answers, which may contribute to suboptimal performance. To address these issues, we propose Difficulty-Differentiated Policy Optimization (DDPO), an efficient reinforcement learning algorithm that optimizes simple and complex tasks separately based on the overconfidence phenomenon. Specifically, it reduces the output length for simple tasks without compromising accuracy, while for complex tasks, it expands the exploration space to improve performance. We further derive the theoretical conditions for maximizing expected accuracy, which require the length distribution to closely approximate the optimal length and be as concentrated as possible. Based on these conditions, we propose using the difficulty-level average as a well-founded reference for length optimization. Extensive experiments on both in-domain and out-of-domain benchmarks validate the superiority and effectiveness of DDPO. Compared to GRPO, DDPO reduces the average answer length by 12% while improving accuracy by 1.85% across multiple benchmarks, achieving a better trade-off between accuracy and length. The code is available at this https URL.
中文摘要 大型推理模型（LRM）展现出卓越的推理能力，但它们也存在过度思考的问题，常常生成过长且冗余的答案。对于超出模型能力的问题，LRMs往往表现出过度自信现象，生成过短但错误的答案，可能导致性能不佳。为解决这些问题，我们提出了难度差异化策略优化（DDPO），这是一种高效的强化学习算法，基于过度自信现象分别优化简单任务和复杂任务。具体来说，它在不牺牲准确性的情况下缩短简单任务的输出长度，而对于复杂任务，则扩展了探索空间以提升性能。我们进一步推导出最大化期望精度的理论条件，要求长度分布尽可能接近最优长度并尽可能集中。基于这些条件，我们建议以难度平均值作为长度优化的有充分基础参考。在域内和域外基准测试中进行了大量实验，验证了DDPO的优越性和有效性。与GRPO相比，DDPO在多个基准测试中平均答案长度减少了12%，准确率提升了1.85%，实现了准确性和长度之间的更好权衡。代码可在该 https URL 访问。

iSatCR: Graph-Empowered Joint Onboard Computing and Routing for LEO Data Delivery

iSatCR：基于图形的联合机载计算与路由用于LEO数据传输

Authors: Jiangtao Luo, Bingbing Xu, Shaohua Xia, Yongyi Ran
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.18539
Pdf link: https://arxiv.org/pdf/2603.18539
Abstract Sending massive Earth observation data produced by low Earth orbit (LEO) satellites back to the ground for processing consumes a large amount of on-orbit bandwidth and exacerbates the space-to-ground link bottleneck. Most prior work has concentrated on optimizing the routing of raw data within the constellation, yet cannot cope with the surge in data volume. Recently, advances in onboard computing have made it possible to process data in situ, thus significantly reducing the data volume to be transmitted. In this paper, we present iSatCR, a distributed graph-based approach that jointly optimizes onboard computing and routing to boost transmission efficiency. Within iSatCR, we design a novel graph embedding utilizing shifted feature aggregation and distributed message passing to capture satellite states, and then propose a distributed graph-based deep reinforcement learning algorithm that derives joint computing-routing strategies under constrained on-board storage to handle the complexity and dynamics of LEO networks. Extensive experiments show iSatCR outperforms baselines, particularly under high load.
中文摘要 将低地轨道卫星产生的大量地球观测数据传回地面处理，消耗大量轨道带宽，加剧了空间与地面链路的瓶颈。此前大多数工作集中在优化星座内原始数据的路由，但无法应对数据量的激增。近年来，车载计算技术的进步使得原位处理数据成为可能，从而显著减少了传输的数据量。本文介绍了iSatCR，一种基于分布式图的方法，联合优化车载计算和路由，以提升传输效率。在iSatCR中，我们设计了一种利用移位特征聚合和分布式消息传递捕捉卫星状态的新型图嵌入，随后提出了一种分布式基于图的深度强化学习算法，在受限的机载存储下推导联合计算-路由策略，以应对LEO网络的复杂性和动态。大量实验表明，iSatCR在高负载下表现优于基线。

Learning to Self-Evolve

学习自我进化

Authors: Xiaoyin Chen, Canwen Xu, Yite Wang, Boyi Liu, Zhewei Yao, Yuxiong He
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.18620
Pdf link: https://arxiv.org/pdf/2603.18620
Abstract We introduce Learning to Self-Evolve (LSE), a reinforcement learning framework that trains large language models (LLMs) to improve their own contexts at test time. We situate LSE in the setting of test-time self-evolution, where a model iteratively refines its context from feedback on seen problems to perform better on new ones. Existing approaches rely entirely on the inherent reasoning ability of the model and never explicitly train it for this task. LSE reduces the multi-step evolution problem to a single-step RL objective, where each context edit is rewarded by the improvement in downstream performance. We pair this objective with a tree-guided evolution loop. On Text-to-SQL generation (BIRD) and general question answering (MMLU-Redux), a 4B-parameter model trained with LSE outperforms self-evolving policies powered by GPT-5 and Claude Sonnet 4.5, as well as prompt optimization methods including GEPA and TextGrad, and transfers to guide other models without additional training. Our results highlight the effectiveness of treating self-evolution as a learnable skill.
中文摘要 我们介绍了“自我进化学习”（LSE），这是一个强化学习框架，训练大型语言模型（LLMs）在测试时改善自身的上下文。我们将伦敦证券交易所置于测试时间自我演化的环境中，模型通过迭代从对已见问题的反馈中细化其上下文，以在新问题上表现更好。现有方法完全依赖模型的固有推理能力，从未明确训练模型完成此任务。LSE将多步演化问题简化为单步强化学习目标，每次上下文编辑都会以下游性能的提升作为奖励。我们将此目标与树状引导进化循环结合。在文本转SQL生成（BIRD）和一般问答（MMLU-Redux）方面，使用LSE训练的4B参数模型优于由GPT-5和Claude Sonnet 4.5驱动的自演化策略，以及包括GEPA和TextGrad在内的提示优化方法，并能在无需额外训练的情况下引导其他模型。我们的结果强调了将自我进化视为可学习技能的有效性。

Balanced Thinking: Improving Chain of Thought Training in Vision Language Models

平衡思维：提升视觉语言模型中的思维链训练

Authors: Shaked Perek, Ben Wiesel, Avihu Dekel, Nimrod Shabtay, Eli Schwartz
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.18656
Pdf link: https://arxiv.org/pdf/2603.18656
Abstract Multimodal reasoning in vision-language models (VLMs) typically relies on a two-stage process: supervised fine-tuning (SFT) and reinforcement learning (RL). In standard SFT, all tokens contribute equally to the loss, even though reasoning data are inherently token-imbalanced. Long traces overshadow short but task-critical segments, leading to verbose reasoning and inaccurate answers. We propose SCALe (Scheduled Curriculum Adaptive Loss), which explicitly separates supervision over reasoning and answer segments using dynamic, length-independent weighting. Unlike vanilla SFT, which overweights the segment, SCALe-SFT gradually shifts the focus from to throughout training via a cosine scheduling policy, encouraging concise and well-grounded reasoning. We evaluate SCALe across diverse benchmarks and architectures. Results show that SCALe consistently improves accuracy over vanilla SFT and matches the performance of the full two-phase SFT + GRPO pipeline while requiring only about one-seventh of the training time, making it a lightweight yet effective alternative. When combined with GRPO, SCALe achieves the best overall performance, highlighting its value both as a standalone method and as a strong foundation for reinforcement refinement.
中文摘要 视觉语言模型（VLMs）中的多模推理通常依赖于两个阶段的过程：监督微调（SFT）和强化学习（RL）。在标准SFT中，所有代币对损失的贡献均等，尽管推理数据本质上存在代币不平衡。冗长的描写掩盖了短暂但关键任务的片段，导致冗长的推理和不准确的回答。我们提出了SCALe（计划课程适应性损失），明确通过动态且长度无关的权重将推理和答案部分的监督分离。与过重该部分的标准SFT不同，SCALe-SFT通过余弦排班策略逐步将重点从整个训练中转移到，鼓励简洁且扎实的推理。我们评估SCALe在多种基准和架构上的应用。结果显示，SCALe在精度上持续提升，性能可媲美完整的两相SFT+GRPO流水线，同时仅需约七分之一的训练时间，使其成为轻量化但高效的替代方案。与GRPO结合时，SCALe实现了最佳的整体性能，凸显了其作为独立方法和强化精细坚实基础的价值。

Thinking with Constructions: A Benchmark and Policy Optimization for Visual-Text Interleaved Geometric Reasoning

构造思维：视觉文本交错几何推理的基准与策略优化

Authors: Haokun Zhao, Wanshi Xu, Haidong Yuan, Songjun Cao, Long Ma, Yanghua Xiao
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.18662
Pdf link: https://arxiv.org/pdf/2603.18662
Abstract Geometric reasoning inherently requires "thinking with constructions" -- the dynamic manipulation of visual aids to bridge the gap between problem conditions and solutions. However, existing Multimodal Large Language Models (MLLMs) are largely confined to passive inference with static diagrams, lacking the strategic knowledge of when and how to construct effective visual aids. To address this, we present a framework for Visual-Text Interleaved Chain-of-Thought. We first introduce GeoAux-Bench, the first benchmark comprising 4,334 geometry problems that aligns textual construction steps with ground-truth visual updates. Our pilot study reveals two critical insights: (1) interleaved visual-textual aids outperform single-modality counterparts, which cannot losslessly capture geometric synergy; and (2) valid constructions act as entropy reducers, strongly correlating with reduced reasoning perplexity. Building on these findings, we propose Action Applicability Policy Optimization (A2PO), a reinforcement learning paradigm for mastering strategic construction. A2PO employs Adaptive Reward Shaping to regulate the timing and quality of visual aids via counterfactual sampling to distinguish necessary from redundant constructions. Experiments demonstrate our approach enables MLLMs to leverage selective auxiliary constructions, yielding a 3.51% gain over strong baselines. Code and data are available on GitHub.
中文摘要 几何推理本质上需要“用构造思维”——动态操控视觉辅助工具，以弥合问题条件与解决方案之间的鸿沟。然而，现有的多模态大型语言模型（MLLM）大多局限于静态图示的被动推断，缺乏何时以及如何构建有效视觉辅助工具的战略性知识。为此，我们提出了一个视觉-文本交错思维链的框架。我们首先介绍 GeoAux-Bench，这是首个包含 4,334 个几何问题的基准测试，将文本构建步骤与真实的视觉更新对齐。我们的试点研究揭示了两个关键见解：（1）交错视觉-文本辅助工具优于单模态辅助工具，后者无法无损捕捉几何协同效应;以及（2）有效的构造作为熵约简器，与推理困惑度减少密切相关。基于这些发现，我们提出了行动适用性策略优化（A2PO），这是一种强化学习范式，用于掌握战略构建。A2PO采用自适应奖励塑造技术，通过反事实抽样调节视觉辅助工具的时机和质量，以区分必要的与冗余的构造。实验表明，我们的方法使多层次级语言模型能够利用选择性辅助结构，较强基线提升3.51%。代码和数据可在GitHub上获取。

HISR: Hindsight Information Modulated Segmental Process Rewards For Multi-turn Agentic Reinforcement Learning

HISR：多回合代理强化学习的事后信息调制分段过程奖励

Authors: Zhicong Lu, Zichuan Lin, Wei Jia, Changyuan Tian, Deheng Ye, Peiguang Li, Li Jin, Nayu Liu, Guangluan Xu, Wei Feng
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.18683
Pdf link: https://arxiv.org/pdf/2603.18683
Abstract While large language models excel in diverse domains, their performance on complex longhorizon agentic decision-making tasks remains limited. Most existing methods concentrate on designing effective reward models (RMs) to advance performance via multi-turn reinforcement learning. However, they suffer from delayed propagation in sparse outcome rewards and unreliable credit assignment with potentially overly fine-grained and unfocused turnlevel process rewards. In this paper, we propose (HISR) exploiting Hindsight Information to modulate Segmental process Rewards, which closely aligns rewards with sub-goals and underscores significant segments to enhance the reliability of credit assignment. Specifically, a segment-level process RM is presented to assign rewards for each sub-goal in the task, avoiding excessively granular allocation to turns. To emphasize significant segments in the trajectory, a hindsight model is devised to reflect the preference of performing a certain action after knowing the trajectory outcome. With this characteristic, we design the ratios of sequence likelihoods between hindsight and policy model to measure action importance. The ratios are subsequently employed to aggregate segment importance scores, which in turn modulate segmental process rewards, enhancing credit assignment reliability. Extensive experimental results on three publicly benchmarks demonstrate the validity of our method.
中文摘要 虽然大型语言模型在多个领域表现出色，但在复杂的长期代理决策任务中表现仍然有限。大多数现有方法集中于设计有效的奖励模型（RMs），通过多回合强化学习提升表现。然而，它们存在稀疏结果奖励的延迟传播和不可靠的信用分配，以及可能过于细粒度和缺乏聚焦的交际级奖励。本文提出（HISR）利用事后诸葛亮信息调节细分过程奖励，该机制紧密对应奖励与子目标，并强调重要细分，以提升信用分配的可靠性。具体来说，呈现了一个分段级流程RM，用于为任务中的每个子目标分配奖励，避免对回合过于细致的分配。为了强调轨迹中的重要部分，设计了一个事后诸葛模型，反映在已知轨迹结果后执行某一行动的偏好。基于这一特性，我们设计了事后诸葛与政策模型之间的序列可能性比值，以衡量行动的重要性。这些比率随后被用来汇总分段重要性评分，进而调节分段过程奖励，提升信用分配的可靠性。关于三个公开基准的大量实验结果证明了我们方法的有效性。

CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks

因果RM：基于观察用户反馈的因果理论RLHF奖励建模

Authors: Hao Wang, Licheng Pan, Zhichao Chen, Chunyuan Zheng, Zhixuan Chu, Xiaoxi Li, Yuan Lu, Xinggao Liu, Haoxuan Li, Zhouchen Lin
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2603.18736
Pdf link: https://arxiv.org/pdf/2603.18736
Abstract Despite the success of reinforcement learning from human feedback (RLHF) in aligning language models, current reward modeling heavily relies on experimental feedback data collected from human annotators under controlled and costly conditions. In this work, we introduce observational reward modeling -- learning reward models with observational user feedback (e.g., clicks, copies, and upvotes) -- as a scalable and cost-effective alternative. We identify two fundamental challenges in this setting: (1) observational feedback is noisy due to annotation errors, which deviates it from true user preference; (2) observational feedback is biased by user preference, where users preferentially provide feedback on responses they feel strongly about, which creats a distribution shift between training and inference data. To address these challenges, we propose CausalRM, a causal-theoretic reward modeling framework that aims to learn unbiased reward models from observational feedback. To tackle challenge (1), CausalRM introduces a noise-aware surrogate loss term that is provably equivalent to the primal loss under noise-free conditions by explicitly modeling the annotation error generation process. To tackle challenge (2), CausalRM uses propensity scores -- the probability of a user providing feedback for a given response -- to reweight training samples, yielding a loss function that eliminates user preference bias. Extensive experiments across diverse LLM backbones and benchmark datasets validate that CausalRM effectively learns accurate reward signals from noisy and biased observational feedback and delivers substantial performance improvements on downstream RLHF tasks -- including a 49.2% gain on WildGuardMix and a 32.7% improvement on HarmBench. Code is available on our project website.
中文摘要 尽管人类反馈强化学习（RLHF）在比对语言模型方面取得了成功，当前的奖励建模仍高度依赖于在受控且成本高昂条件下从人类注释者收集的实验反馈数据。在本研究中，我们引入了观察性奖励建模——通过观察性用户反馈（如点击、复制和点赞）学习奖励模型，作为一种可扩展且具成本效益的替代方案。我们识别出两个基本挑战：（1）由于注释错误导致观察反馈噪声较大，偏离了用户的真实偏好;（2）观察反馈受用户偏好偏向，用户优先对自己强烈认同的回答提供反馈，这导致训练数据与推断数据之间的分布发生偏移。为应对这些挑战，我们提出了因果RM，一种因果理论奖励建模框架，旨在从观察反馈中学习无偏奖励模型。为应对挑战（1），CausalRM引入了一个基于噪声感知的替代损耗项，该项通过显式建模注释错误生成过程，在无噪声条件下可证明与原始损失等价。为了应对挑战（2），CausalRM使用倾向评分——即用户对某一反应提供反馈的概率——来重新加权训练样本，从而得到一个消除用户偏好偏差的损失函数。在多种LLM骨干和基准数据集上的大量实验验证了CausalRM能够有效从噪声和偏见的观察反馈中学习准确的奖励信号，并在下游RLHF任务中实现显著性能提升——包括WildGuardMix提升49.2%，HarmBench提升32.7%。代码可在我们的项目网站上获取。

Memento-Skills: Let Agents Design Agents

纪念技能：让代理设计代理

Authors: Huichi Zhou, Siyuan Guo, Anjie Liu, Zhongwei Yu, Ziqin Gong, Bowen Zhao, Zhixun Chen, Menglong Zhang, Yihang Chen, Jinsong Li, Runyu Yang, Qiangbin Liu, Xinlei Yu, Jianmin Zhou, Na Wang, Chunyang Sun, Jun Wang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.18743
Pdf link: https://arxiv.org/pdf/2603.18743
Abstract We introduce \emph{Memento-Skills}, a generalist, continually-learnable LLM agent system that functions as an \emph{agent-designing agent}: it autonomously constructs, adapts, and improves task-specific agents through experience. The system is built on a memory-based reinforcement learning framework with \emph{stateful prompts}, where reusable skills (stored as structured markdown files) serve as persistent, evolving memory. These skills encode both behaviour and context, enabling the agent to carry forward knowledge across interactions. Starting from simple elementary skills (like Web search and terminal operations), the agent continually improves via the \emph{Read--Write Reflective Learning} mechanism introduced in \emph{Memento~2}~\cite{wang2025memento2}. In the \emph{read} phase, a behaviour-trainable skill router selects the most relevant skill conditioned on the current stateful prompt; in the \emph{write} phase, the agent updates and expands its skill library based on new experience. This closed-loop design enables \emph{continual learning without updating LLM parameters}, as all adaptation is realised through the evolution of externalised skills and prompts. Unlike prior approaches that rely on human-designed agents, Memento-Skills enables a generalist agent to \emph{design agents end-to-end} for new tasks. Through iterative skill generation and refinement, the system progressively improves its own capabilities. Experiments on the \emph{General AI Assistants} benchmark and \emph{Humanity's Last Exam} demonstrate sustained gains, achieving 26.2\% and 116.2\% relative improvements in overall accuracy, respectively. Code is available at this https URL.
中文摘要 我们介绍了\emph{Memento-Skills}，一个通用的、可持续学习的LLM代理系统，作为\emph{代理设计代理}：它通过经验自主构建、适应并改进任务特定的代理。该系统基于基于内存的强化学习框架，支持 \emph{stateful prompts}，其中可重用技能（以结构化标记文件存储）作为持久且不断演进的记忆。这些技能既编码行为，也包含上下文，使智能体能够跨越交互传递知识。从简单的基础技能（如网页搜索和终端操作）开始，代理通过\emph{Read--Write Reflective Learning}机制不断提升，这一机制在\emph{Memento~2}~\cite{wang2025memento2}中引入。在 \emph{read} 阶段，行为训练技能路由器根据当前有状态提示选择最相关的技能;在\emph{write}阶段，特工会根据新经验更新并扩展技能库。这种闭环设计实现了无需更新LLM参数即可持续学习，因为所有适应都是通过外在技能和提示的进化实现的。与以往依赖人工设计代理的方法不同，Memento-Skills使通用代理能够为新任务从端到端设计代理。通过不断迭代的技能生成和完善，系统逐步提升自身能力。在\emph{General AI Assistants}基准和\emph{Humanity's Last Exam}的实验显示，持续提升，分别在整体准确率上取得了26.2%和116.2%的相对提升。代码可在此 https URL 访问。

Automatic Configuration of LLM Post-Training Pipelines

LLM后培训流程的自动配置

Authors: Channe Chwa, Xinle Wu, Yao Lu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.18773
Pdf link: https://arxiv.org/pdf/2603.18773
Abstract LLM post-training pipelines that combine supervised fine-tuning and reinforcement learning are difficult to configure under realistic compute budgets: the configuration space is high-dimensional and heterogeneous, stages are strongly coupled, and each end-to-end evaluation is expensive. We propose AutoPipe, a budget-aware two-stage framework for configuration selection in LLM post-training. Offline, AutoPipe learns a dataset-conditioned learning-to-rank surrogate from historical runs, capturing within-dataset preferences and providing transferable guidance toward promising regions of the configuration space. Online, for a new dataset, AutoPipe uses the offline guidance to steer Bayesian optimization and models dataset-specific deviations with a Gaussian-process residual surrogate. To reduce evaluation cost, each trial is early-stopped and scored by a learned predictor that maps early training signals to a low-cost proxy for final post-training performance. Experiments on biomedical reasoning tasks show that AutoPipe consistently outperforms offline-only baselines and achieves comparable performance with the strongest online HPO baselines while using less than 10\% of their computational cost.
中文摘要 结合监督微调和强化学习的LLM后培训流程在现实的计算预算下难以配置：配置空间高维且异构，各阶段强耦合，且每次端到端评估成本高昂。我们提出了AutoPipe，一个预算意识的两阶段框架，用于LLM后培训中的配置选择。离线时，AutoPipe 通过历史运行学习数据集条件学习排名替代，捕捉数据集内偏好，并为配置空间中有前景的区域提供可转移的指导。在线上，对于新数据集，AutoPipe利用离线指导引导贝叶斯优化，并用高斯过程残差代理对数据集特定的偏差进行建模。为降低评估成本，每个试验都会被提前停止，并由一个学习得来的预测变量评分，该预测变量将早期训练信号映射到低成本的最终训练后表现代理指标。生物医学推理任务的实验表明，AutoPipe 持续优于仅离线基线，并以不到 10% 计算成本的最高计算成本，与最强的在线 HPO 基线性能相当。

Mi:dm K 2.5 Pro

Mi：dm K 2.5 Pro

Authors: KT Tech innovation Group
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.18788
Pdf link: https://arxiv.org/pdf/2603.18788
Abstract The evolving LLM landscape requires capabilities beyond simple text generation, prioritizing multi-step reasoning, long-context understanding, and agentic workflows. This shift challenges existing models in enterprise environments, especially in Korean-language and domain-specific scenarios where scaling is insufficient. We introduce Mi:dm K 2.5 Pro, a 32B parameter flagship LLM designed to address enterprise-grade complexity through reasoning-focused optimization. Our methodology builds a robust data foundation via a quality-centric curation pipeline utilizing abstract syntax tree (AST) analysis for code, gap-filling synthesis for mathematics, and an LLM-based quality evaluator. Pre-training scales the model via layer-predictor-based Depth Upscaling (DuS) and a progressive strategy supporting a 128K token context window. Post-training introduces a specialized multi-stage pipeline, including Reasoning SFT, model merging, and asynchronous reinforcement learning (RL), to develop complex problem-solving skills. "Fusion Training" then rebalances these capabilities with conversational fluency, consistent response styling, and reliable tool-use. The evaluations show that Mi:dm K 2.5 Pro achieves competitive performance against leading global and domestic models. In addition, it sets state-of-the-art results on Korean-specific benchmarks, showcasing deep linguistic and cultural understanding. Finally, Responsible AI evaluations validate safety against attacks, ensuring a secure profile for deployment with a balance of harmlessness and responsiveness.
中文摘要 不断发展的LLM领域需要超越简单文本生成的能力，更强调多步推理、长上下文理解和代理工作流程。这一转变挑战了企业环境中现有的模型，尤其是在韩语和领域特定场景中，扩展性不足。我们介绍Mi：dm K 2.5 Pro，一款32B参数的旗舰大型语言模型，旨在通过推理优化解决企业级复杂性。我们的方法论通过以质量为中心的策展流程构建坚实的数据基础，代码采用抽象语法树（AST）分析，数学采用空白合成，并基于LLM的质量评估器。预训练通过基于层预测器的深度放大（DuS）和支持128K令牌上下文窗口的渐进策略来扩展模型。后期培训引入了一套专门的多阶段流程，包括推理SFT、模型合并和异步强化学习（RL），以培养复杂的问题解决能力。“融合训练”随后通过会话流利度、一致的响应风格和可靠的工具使用来重新平衡这些能力。评估显示，Mi：dm K 2.5 Pro 在与全球及国内领先车型竞争中表现出色。此外，它在韩国特定基准上树立了最先进的成果，展现了深厚的语言和文化理解。最后，负责任的AI评估验证了对攻击的安全性，确保部署时的安全性，兼顾无害与响应性。

ProRL Agent: Rollout-as-a-Service for RL Training of Multi-Turn LLM Agents

ProRL 代理：多回合大型语言模型代理强化学习训练的即服务推广

Authors: Hao Zhang, Mingjie Liu, Shaokun Zhang, Songyang Han, Jian Hu, Zhenghui Jin, Yuchi Zhang, Shizhe Diao, Ximing Lu, Binfeng Xu, Zhiding Yu, Jan Kautz, Yi Dong
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.18815
Pdf link: https://arxiv.org/pdf/2603.18815
Abstract Multi-turn LLM agents are increasingly important for solving complex, interactive tasks, and reinforcement learning (RL) is a key ingredient for improving their long-horizon behavior. However, RL training requires generating large numbers of sandboxed rollout trajectories, and existing infrastructures often couple rollout orchestration with the training loop, making systems hard to migrate and maintain. Under the rollout-as-a-service philosophy, we present ProRL Agent , a scalable infrastructure that serves the full agentic rollout lifecycle through an API service. ProRL Agent also provides standardized and extensible sandbox environments that support diverse agentic tasks in rootless HPC settings. We validate ProRL Agent through RL training on software engineering, math, STEM, and coding tasks. ProRL Agent is open-sourced and integrated as part of NVIDIA NeMo Gym.
中文摘要 多回合大型语言模型代理在解决复杂交互任务中日益重要，强化学习（RL）是改善其长期行为的关键因素。然而，强化学习训练需要生成大量沙箱式的部署轨迹，现有基础设施常常将部署编排与训练循环结合，使系统迁移和维护变得困难。在部署即服务理念下，我们介绍了ProRL Agent，这是一个可扩展的基础设施，通过API服务服务完整的代理部署生命周期。ProRL Agent 还提供标准化且可扩展的沙盒环境，支持无根高性能计算（HPC）环境中多样化的代理任务。我们通过强化学习培训验证ProRL代理，涵盖软件工程、数学、STEM和编程任务。ProRL Agent 是开源的，集成于 NVIDIA NeMo Gym 中。

Learn for Variation: Variationally Guided AAV Trajectory Learning in Differentiable Environments

为变异学习：在可微环境中的变分引导AAV轨迹学习

Authors: Xiucheng Wang, Zhenye Chen, Nan Cheng
Subjects: Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.18853
Pdf link: https://arxiv.org/pdf/2603.18853
Abstract Autonomous aerial vehicles (AAVs) empower sixth-generation (6G) Internet-of-Things (IoT) networks through mobility-driven data collection. However, conventional reward-driven reinforcement learning for AAV trajectory planning suffers from severe credit assignment issues and training instability, because sparse scalar rewards fail to capture the long-term and nonlinear effects of sequential movements. To address these challenges, this paper proposes Learn for Variation (L4V), a gradient-informed trajectory learning framework that replaces high-variance scalar reward signals with dense and analytically grounded policy gradients. Particularly, the coupled evolution of AAV kinematics, distance-dependent channel gains, and per-user data-collection progress is first unrolled into an end-to-end differentiable computational graph. Backpropagation through time then serves as a discrete adjoint solver, which propagates exact sensitivities from the cumulative mission objective to every control action and policy parameter. These structured gradients are used to train a deterministic neural policy with temporal smoothness regularization and gradient clipping. Extensive simulations demonstrate that L4V consistently outperforms representative baselines, including a genetic algorithm, DQN, A2C, and DDPG, in mission completion time, average transmission rate, and training cost
中文摘要 自动驾驶飞行器（AAV）通过以出行为驱动的数据收集，赋能第六代（6G）物联网（IoT）网络。然而，传统的奖励驱动强化学习用于AAV轨迹规划存在严重的学分分配问题和训练不稳定性，因为稀疏标量奖励无法捕捉连续运动的长期非线性影响。为应对这些挑战，本文提出了“变异学习”（Learn for Variation，L4V），这是一种梯度知情轨迹学习框架，用密集且分析基础的策略梯度替代高方差标量奖励信号。特别是，AAV运动学、距离依赖信道增益和每用户数据收集进展的耦合演化首先展开为端到端的可微计算图。反向传播随时间可作为离散伴随求解器，将累积任务目标的精确敏感度传播到每个控制动作和策略参数。这些结构化梯度用于训练具有时间平滑性正则化和梯度裁剪的确定性神经策略。大量模拟表明，L4V在任务完成时间、平均传输率和训练成本方面，始终优于代表性基线，包括遗传算法、DQN、A2C和DDPG

RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models

RewardFlow：在大型语言模型下的agentic RL状态图上的拓扑感知奖励传播

Authors: Xiao Feng, Bo Han, Zhanke Zhou, Jiaqi Fan, Jiangchao Yao, Ka Ho Li, Dahai Yu, Michael Kwok-Po Ng
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.18859
Pdf link: https://arxiv.org/pdf/2603.18859
Abstract Reinforcement learning (RL) holds significant promise for enhancing the agentic reasoning capabilities of large language models (LLMs) with external environments. However, the inherent sparsity of terminal rewards hinders fine-grained, state-level optimization. Although process reward modeling offers a promising alternative, training dedicated reward models often entails substantial computational costs and scaling difficulties. To address these challenges, we introduce RewardFlow, a lightweight method for estimating state-level rewards tailored to agentic reasoning tasks. RewardFlow leverages the intrinsic topological structure of states within reasoning trajectories by constructing state graphs. This enables an analysis of state-wise contributions to success, followed by topology-aware graph propagation to quantify contributions and yield objective, state-level rewards. When integrated as dense rewards for RL optimization, RewardFlow substantially outperforms prior RL baselines across four agentic reasoning benchmarks, demonstrating superior performance, robustness, and training efficiency. The implementation of RewardFlow is publicly available at this https URL.
中文摘要 强化学习（RL）在增强大型语言模型（LLMs）在外部环境中的代理推理能力方面具有重大潜力。然而，终端奖励的固有稀疏性阻碍了细粒度的状态级优化。尽管过程奖励建模提供了有前景的替代方案，但训练专用奖励模型往往伴随着较大的计算成本和扩展难度。为应对这些挑战，我们引入了RewardFlow，一种轻量级方法，用于估算针对代理推理任务的状态级奖励。RewardFlow 利用推理轨迹中状态的内在拓扑结构，构建状态图。这使得能够分析各状态对成功的贡献，随后通过拓扑感知图传播来量化贡献并获得客观的状态级奖励。当作为密集奖励用于强化学习优化时，RewardFlow在四个代理推理基准测试中显著优于以往的强化学习基线，展现出卓越的性能、鲁棒性和训练效率。RewardFlow的实现可在此 https URL 公开获取。

Bridging Network Fragmentation: A Semantic-Augmented DRL Framework for UAV-aided VANETs

网络碎片化桥接：无人机辅助VANET的语义增强DRL框架

Authors: Gaoxiang Cao, Wenke Yuan, Huasen He, Yunpeng Hou, Xiaofeng Jiang, Shuangwu Chen, Jian Yang
Subjects: Subjects: Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2603.18871
Pdf link: https://arxiv.org/pdf/2603.18871
Abstract Vehicular Ad-hoc Networks (VANETs) are the digital cornerstone of autonomous driving, yet they suffer from severe network fragmentation in urban environments due to physical obstructions. Unmanned Aerial Vehicles (UAVs), with their high mobility, have emerged as a vital solution to bridge these connectivity gaps. However, traditional Deep Reinforcement Learning (DRL)-based UAV deployment strategies lack semantic understanding of road topology, often resulting in blind exploration and sample inefficiency. By contrast, Large Language Models (LLMs) possess powerful reasoning capabilities capable of identifying topological importance, though applying them to control tasks remains challenging. To address this, we propose the Semantic-Augmented DRL (SA-DRL) framework. Firstly, we propose a fragmentation quantification method based on Road Topology Graphs (RTG) and Dual Connected Graphs (DCG). Subsequently, we design a four-stage pipeline to transform a general-purpose LLM into a domain-specific topology expert. Finally, we propose the Semantic-Augmented PPO (SA-PPO) algorithm, which employs a Logit Fusion mechanism to inject the LLM's semantic reasoning directly into the policy as a prior, effectively guiding the agent toward critical intersections. Extensive high-fidelity simulations demonstrate that SA-PPO achieves state-of-the-art performance with remarkable efficiency, reaching baseline performance levels using only 26.6% of the training episodes. Ultimately, SA-PPO improves two key connectivity metrics by 13.2% and 23.5% over competing methods, while reducing energy consumption to just 28.2% of the baseline.
中文摘要 车辆自组网络（VANETs）是自动驾驶的数字基石，但在城市环境中由于物理障碍物，网络碎片化严重。无人机（UAV）凭借其高机动性，已成为弥合这些连接空白的重要解决方案。然而，传统的基于深度强化学习（DRL）的无人机部署策略缺乏对道路拓扑的语义理解，常导致盲目探索和样本效率低下。相比之下，大型语言模型（LLMs）具备强大的推理能力，能够识别拓扑重要性，尽管将其应用于控制任务仍具挑战性。为此，我们提出了语义增强DRL（SA-DRL）框架。首先，我们提出了一种基于道路拓扑图（RTG）和双连通图（DCG）的碎片量化方法。随后，我们设计了一个四阶段的流水线，将通用大型语言模型转变为领域特定的拓扑专家。最后，我们提出了语义增强PPO（SA-PPO）算法，该算法采用Logit Fusion机制，将LLM的语义推理直接注入策略，作为先验，有效引导智能体走向关键交叉点。大量高精度仿真表明，SA-PPO以极高效率实现了最先进的性能，仅用26.6%的训练集就能达到基线性能水平。最终，SA-PPO在两项关键连接指标上相较竞争方法分别提升了13.2%和23.5%，同时将能耗降至基线的28.2%。

MultihopSpatial: Multi-hop Compositional Spatial Reasoning Benchmark for Vision-Language Model

多跳空间：视觉语言模型的多跳合成空间推理基准

Authors: Youngwan Lee, Soojin Jang, Yoorhim Cho, Seunghwan Lee, Yong-Ju Lee, Sung Ju Hwang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.18892
Pdf link: https://arxiv.org/pdf/2603.18892
Abstract Spatial reasoning is foundational for Vision-Language Models (VLMs), particularly when deployed as Vision-Language-Action (VLA) agents in physical environments. However, existing benchmarks predominantly focus on elementary, single-hop relations, neglecting the multi-hop compositional reasoning and precise visual grounding essential for real-world scenarios. To address this, we introduce MultihopSpatial, offering three key contributions: (1) A comprehensive benchmark designed for multi-hop and compositional spatial reasoning, featuring 1- to 3-hop complex queries across diverse spatial perspectives. (2) Acc@50IoU, a complementary metric that simultaneously evaluates reasoning and visual grounding by requiring both answer selection and precise bounding box prediction - capabilities vital for robust VLA deployment. (3) MultihopSpatial-Train, a dedicated large-scale training corpus to foster spatial intelligence. Extensive evaluation of 37 state-of-the-art VLMs yields eight key insights, revealing that compositional spatial reasoning remains a formidable challenge. Finally, we demonstrate that reinforcement learning post-training on our corpus enhances both intrinsic VLM spatial reasoning and downstream embodied manipulation performance.
中文摘要 空间推理是视觉语言模型（VLM）的基础，尤其是在作为视觉语言行动（VLA）代理部署在物理环境中时。然而，现有基准主要关注基本的单跳关系，忽视了多跳合成推理和精确的视觉基础，这些对现实场景至关重要。为此，我们引入了MultihopSpatial，提出了三项关键贡献：（1）一个为多跳和组合空间推理设计的综合基准，涵盖1至3跳复杂查询，涵盖多样的空间视角。（2）Acc@50IoU，一种互补指标，同时评估推理和视觉基础，要求答案选择和精确边界框预测——这些能力对于稳健的VLA部署至关重要。（3）MultihopSpatial-Train，一个专门用于促进空间智能的大型训练语料库。对37个最先进的VLM进行广泛评估，得出八项关键见解，显示构图空间推理仍是艰巨挑战。最后，我们证明了语料库训练后强化学习不仅提升了VLM的内在空间推理能力，也提升了后续的具身操作表现。

Context Bootstrapped Reinforcement Learning

上下文引导强化学习

Authors: Saaket Agashe, Jayanth Srinivasa, Gaowen Liu, Ramana Kompella, Xin Eric Wang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.18953
Pdf link: https://arxiv.org/pdf/2603.18953
Abstract Reinforcement Learning from Verifiable Rewards (RLVR) suffers from exploration inefficiency, where models struggle to generate successful rollouts, resulting in minimal learning signal. This challenge is particularly severe for tasks that require the acquisition of novel reasoning patterns or domain-specific knowledge. To address this, we propose Context Bootstrapped Reinforcement Learning (CBRL), which augments RLVR training by stochastically prepending few-shot demonstrations to training prompts. The injection probability follows a curriculum that starts high to bootstrap early exploration, then anneals to zero so the model must ultimately succeed without assistance. This forces the policy to internalize reasoning patterns from the demonstrations rather than relying on them at test time. We validate CBRL across two model families and five Reasoning Gym tasks. Our results demonstrate that CBRL consistently improves success rate, provides better exploration efficiency, and is algorithm-agnostic. We further demonstrate CBRL's practical applicability on Q, a domain-specific programming language that diverges significantly from mainstream language conventions.
中文摘要 可验证奖励强化学习（RLVR）存在探索效率低落的问题，模型难以产生成功的推广，导致学习信号极少。对于需要获得新颖推理模式或领域特定知识的任务，这一挑战尤为严峻。为此，我们提出了情境自助强化学习（CBRL），它通过随机地在训练提示前插入少量样本演示来增强RLVR训练。注入概率遵循一套课程体系，从高启动早期探索开始，然后退火至零，因此模型最终必须在无辅助的情况下成功。这迫使政策内化演示中的推理模式，而不是在测试时依赖这些模式。我们在两个模型家族和五个推理馆任务中验证了CBRL。我们的结果表明，CBRL持续提升成功率，提供更好的探索效率，并且具有算法无关性。我们进一步展示了CBRL在Q上的实际应用性，Q是一种领域特定编程语言，与主流语言约定有显著差异。

Maximum-Entropy Exploration with Future State-Action Visitation Measures

最大熵探索及未来状态动作访问测量

Authors: Adrien Bolland, Gaspard Lambrechts, Damien Ernst
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2603.18965
Pdf link: https://arxiv.org/pdf/2603.18965
Abstract Maximum entropy reinforcement learning motivates agents to explore states and actions to maximize the entropy of some distribution, typically by providing additional intrinsic rewards proportional to that entropy function. In this paper, we study intrinsic rewards proportional to the entropy of the discounted distribution of state-action features visited during future time steps. This approach is motivated by two results. First, we show that the expected sum of these intrinsic rewards is a lower bound on the entropy of the discounted distribution of state-action features visited in trajectories starting from the initial states, which we relate to an alternative maximum entropy objective. Second, we show that the distribution used in the intrinsic reward definition is the fixed point of a contraction operator and can therefore be estimated off-policy. Experiments highlight that the new objective leads to improved visitation of features within individual trajectories, in exchange for slightly reduced visitation of features in expectation over different trajectories, as suggested by the lower bound. It also leads to improved convergence speed for learning exploration-only agents. Control performance remains similar across most methods on the considered benchmarks.
中文摘要 最大熵强化学习激励智能体探索状态和动作以最大化某一分布的熵，通常通过提供与该熵函数成比例的额外内在奖励。本文研究了与未来时间步中访问状态-行动特征折现分布熵成正比的内在奖励。这一方法基于两个结果。首先，我们证明这些内在奖励的期望和是从初始状态出发轨迹中访问的状态-行动特征折现分布熵的下界，我们将该轨迹与另一种最大熵目标相关联。其次，我们证明内在奖励定义中使用的分布是收缩算子的不动点，因此可以非策略估计。实验显示，这一新目标导致单个轨迹中特征的访问率提高，而不同轨迹的预期访视略有减少，正如下限所示。这也提高了仅探索智能体学习的收敛速度。在所考虑的基准测试中，大多数方法的控制性能保持相似。

CRAFT: Aligning Diffusion Models with Fine-Tuning Is Easier Than You Think

CRAFT：通过微调对齐扩散模型比你想象的要容易

Authors: Zening Sun, Zhengpeng Xie, Lichen Bai, Shitong Shao, Shuo Yang, Zeke Xie
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.18991
Pdf link: https://arxiv.org/pdf/2603.18991
Abstract Aligning Diffusion models has achieved remarkable breakthroughs in generating high-quality, human preference-aligned images. Existing techniques, such as supervised fine-tuning (SFT) and DPO-style preference optimization, have become principled tools for fine-tuning diffusion models. However, SFT relies on high-quality images that are costly to obtain, while DPO-style methods depend on large-scale preference datasets, which are often inconsistent in quality. Beyond data dependency, these methods are further constrained by computational inefficiency. To address these two challenges, we propose Composite Reward Assisted Fine-Tuning (CRAFT), a lightweight yet powerful fine-tuning paradigm that requires significantly reduced training data while maintaining computational efficiency. It first leverages a Composite Reward Filtering (CRF) technique to construct a high-quality and consistent training dataset and then perform an enhanced variant of SFT. We also theoretically prove that CRAFT actually optimizes the lower bound of group-based reinforcement learning, establishing a principled connection between SFT with selected data and reinforcement learning. Our extensive empirical results demonstrate that CRAFT with only 100 samples can easily outperform recent SOTA preference optimization methods with thousands of preference-paired samples. Moreover, CRAFT can even achieve 11-220$\times$ faster convergences than the baseline preference optimization methods, highlighting its extremely high efficiency.
中文摘要 对齐扩散模型在生成高质量、符合人类偏好的图像方面取得了显著突破。现有技术，如监督微调（SFT）和DPO风格偏好优化，已成为微调扩散模型的原则性工具。然而，SFT依赖于高质量图像，获取成本高昂，而DPO式方法依赖于大规模偏好数据集，而这些偏好数据集质量常常不稳定。除了数据依赖性外，这些方法还受到计算效率的限制。为应对这两个挑战，我们提出了复合奖励辅助微调（CRAFT），这是一种轻量但强大的微调范式，在保持计算效率的同时，需要显著减少训练数据。它首先利用复合奖励过滤（CRF）技术构建高质量且一致的训练数据集，然后执行SFT的增强变体。我们还理论上证明了CRAFT实际上优化了基于群体的强化学习的下界，建立了SFT与选定数据与强化学习之间的原则性联系。我们丰富的实证结果表明，仅用100个样本的CRAFT就能轻松优于近期SOTA偏好优化方法，尤其是数千个偏好配对样本。此外，CRAFT的收敛速度甚至比基线偏好优化方法快11-220美元/倍，彰显其极高的效率。

MoRI: Learning Motivation-Grounded Reasoning for Scientific Ideation in Large Language Models

MoRI：在大型语言模型中学习基于动机的科学构想推理

Authors: Chenyang Gu, Jiahao Cheng, Meicong Zhang, Pujun Zheng, Jinquan Zheng, Guoxiu He
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.19044
Pdf link: https://arxiv.org/pdf/2603.19044
Abstract Scientific ideation aims to propose novel solutions within a given scientific context. Existing LLM-based agentic approaches emulate human research workflows, yet inadequately model scientific reasoning, resulting in surface-level conceptual recombinations that lack technical depth and scientific grounding. To address this issue, we propose \textbf{MoRI} (\textbf{Mo}tivation-grounded \textbf{R}easoning for Scientific \textbf{I}deation), a framework that enables LLMs to explicitly learn the reasoning process from research motivations to methodologies. The base LLM is initialized via supervised fine-tuning to generate a research motivation from a given context, and is subsequently trained under a composite reinforcement learning reward that approximates scientific rigor: (1) entropy-aware information gain encourages the model to uncover and elaborate high-complexity technical details grounded in ground-truth methodologies, and (2) contrastive semantic gain constrains the reasoning trajectory to maintain conceptually aligned with scientifically valid solutions. Empirical results show that MoRI significantly outperforms strong commercial LLMs and complex agentic baselines across multiple dimensions, including novelty, technical rigor, and feasibility. The code will be made available on \href{this https URL}{GitHub}.
中文摘要 科学构思旨在于特定科学背景下提出新颖的解决方案。现有基于LLM的代理方法模拟人类研究工作流程，但未能充分建模科学推理，导致表面概念重组缺乏技术深度和科学基础。为解决这个问题，我们提出了 \textbf{MoRI}（\textbf{Mo}tivation-based \textbf{R}easoning for Scientific \textbf{I}deation），这是一个框架，使大型语言模型能够明确学习从研究动机到方法论的推理过程。基础LLM通过监督微调初始化，从给定语境生成研究动机，随后在接近科学严谨性的复合强化学习奖励下训练：（1）熵感知信息获得鼓励模型发现并完善基于真实方法的高复杂技术细节，（2）对比语义增益限制推理轨迹，使其在科学上保持概念一致有效的解决方案。实证结果表明，MoRI在多个维度上，包括新颖性、技术严谨性和可行性，显著优于强大的商业大型语言模型和复杂的代理基线。代码将在 \href{this https URL}{GitHub} 上公开。

Articulated-Body Dynamics Network: Dynamics-Grounded Prior for Robot Learning

关节体动力学网络：机器人学习的动力学基础先验

Authors: Sangwoo Shin, Kunzhao Ren, Xiaobin Xiong, Josiah Hanna
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.19078
Pdf link: https://arxiv.org/pdf/2603.19078
Abstract Recent work in reinforcement learning has shown that incorporating structural priors for articulated robots, such as link connectivity, into policy networks improves learning efficiency. However, dynamics properties, despite their fundamental role in determining how forces and motion propagate through the body, remain largely underexplored as an inductive bias for policy learning. To address this gap, we present the Articulated-Body Dynamics Network (ABD-Net), a novel graph neural network architecture grounded in the computational structure of forward dynamics. Specifically, we adapt the inertia propagation mechanism from the Articulated Body Algorithm, systematically aggregating inertial quantities from child to parent links in a tree-structured manner, while replacing physical quantities with learnable parameters. Embedding ABD-NET into the policy actor enables dynamics-informed representations that capture how actions propagate through the body, leading to efficient and robust policy learning. Through experiments with simulated humanoid, quadruped, and hopper robots, our approach demonstrates increased sample efficiency and generalization to dynamics shifts compared to transformer-based and GNN baselines. We further validate the learned policy on real Unitree G1 and Go2 robots, state-of-the-art humanoid and quadruped platforms, generating dynamic, versatile and robust locomotion behaviors through sim-to-real transfer with real-time inference.
中文摘要 强化学习的最新研究表明，将关节机器人的结构先验（如链路连接）纳入策略网络，可以提升学习效率。然而，尽管动力学属性在决定力和运动如何在体内传播中扮演着根本性作用，但作为政策学习的归纳偏向，仍然大多缺乏充分探讨。为弥补这一空白，我们提出了关节体动力学网络（ABD-Net），这是一种基于前向动力学计算结构的新颖图神经网络架构。具体来说，我们采用了关节体算法的惯性传播机制，系统地以树状结构方式聚合子链到父链路的惯性量，同时用可学习参数替代物理量。将ABD-NET嵌入策略行为者中，可以实现动态知情的表示，捕捉动作如何在身体中传播，从而实现高效且稳健的策略学习。通过模拟人形机器人、四足机器人和漏斗机器人的实验，我们的方法展示了相较于基于变压器和GNN基线的样本效率和动力学变化的推广性。我们还进一步验证了在真实的Unitree G1和Go2机器人、先进的类人生物和四足平台上所学到的政策，这些机器人通过模拟到现实的转移和实时推断，生成动态、多功能且稳健的运动行为。

Adaptive Regime-Aware Stock Price Prediction Using Autoencoder-Gated Dual Node Transformers with Reinforcement Learning Control

利用自编码门控对节点变换器和强化学习控制的自适应状态感知股价预测

Authors: Mohammad Al Ridhawi, Mahtab Haj Ali, Hussein Al Osman
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistical Finance (q-fin.ST)
Arxiv link: https://arxiv.org/abs/2603.19136
Pdf link: https://arxiv.org/pdf/2603.19136
Abstract Stock markets exhibit regime-dependent behavior where prediction models optimized for stable conditions often fail during volatile periods. Existing approaches typically treat all market states uniformly or require manual regime labeling, which is expensive and quickly becomes stale as market dynamics evolve. This paper introduces an adaptive prediction framework that adaptively identifies deviations from normal market conditions and routes data through specialized prediction pathways. The architecture consists of three components: (1) an autoencoder trained on normal market conditions that identifies anomalous regimes through reconstruction error, (2) dual node transformer networks specialized for stable and event-driven market conditions respectively, and (3) a Soft Actor-Critic reinforcement learning controller that adaptively tunes the regime detection threshold and pathway blending weights based on prediction performance feedback. The reinforcement learning component enables the system to learn adaptive regime boundaries, defining anomalies as market states where standard prediction approaches fail. Experiments on 20 S&P 500 stocks spanning 1982 to 2025 demonstrate that the proposed framework achieves 0.68% MAPE for one-day predictions without the reinforcement controller and 0.59% MAPE with the full adaptive system, compared to 0.80% for the baseline integrated node transformer. Directional accuracy reaches 72% with the complete framework. The system maintains robust performance during high-volatility periods, with MAPE below 0.85% when baseline models exceed 1.5%. Ablation studies confirm that each component contributes meaningfully: autoencoder routing accounts for 36% relative MAPE degradation upon removal, followed by the SAC controller at 15% and the dual-path architecture at 7%.
中文摘要 股市表现出体制依赖性行为，优化为稳定条件的预测模型在波动期常常失效。现有方法通常对所有市场状态一视同仁，或要求手动制度标签，但这成本高昂且随着市场动态演变迅速变得陈旧。本文引入了一种自适应预测框架，能够自适应识别偏离正常市场状况的因素，并将数据通过专门的预测路径路由。该架构由三个组成部分组成：（1）一个基于正常市场条件训练的自编码器，通过重建误差识别异常状态;（2）双节点变换器网络，分别针对稳定和事件驱动的市场条件;（3）软演员-批判强化学习控制器，根据预测性能反馈自适应调整状态检测阈值和路径混合权重。强化学习组件使系统能够学习自适应的体制边界，将异常定义为标准预测方法失效的市场状态。对1982年至2025年间20只标普500股票的实验表明，所提出的框架在无强化控制器的情况下一天预测的MAPE为0.68%，在全自适应系统下为0.59%，而基线集成节点变压器为0.80%。使用完整框架后，方向准确率可达72%。该系统在高波动期保持强劲表现，当基线模型超过1.5%时，MAPE低于0.85%。消融研究证实每个组件都有显著贡献：自编码器路由在移除后解释了36%的相对MAPE退化，其次是SAC控制器15%，双路径架构为7%。

VEPO: Variable Entropy Policy Optimization for Low-Resource Language Foundation Models

VEPO：低资源语言基础模型的变量熵策略优化

Authors: Chonghan Liu, Yimin Du, Qi An, Xin He, Cunqi Zhai, Fei Tan, Weijia Lin, Xiaochun Gong, Yongchao Deng, Shousheng Jia, Xiangzheng Zhang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.19152
Pdf link: https://arxiv.org/pdf/2603.19152
Abstract Large language models frequently exhibit suboptimal performance on low resource languages, primarily due to inefficient subword segmentation and systemic training data imbalances. In this paper, we propose Variable Entropy Policy Optimization (VEPO), which leverages Reinforcement Learning with Verifiable Rewards to incorporate deterministic structural constraints into the policy alignment process. This framework ensures prescribed sequence length, robust format consistency, and rigorous linguistic well formedness, all enforced during training. Central to our approach is a variable entropy mechanism that enables the model to dynamically calibrate the equilibrium between literal fidelity and semantic naturalness by modulating the exploration exploitation manifold. By integrating entropy tempered advantage estimation with asymmetric clipping, VEPO sustains robust exploration while mitigating policy collapse. Empirical evaluations across 90 FLORES-200, COMET-22, chrF directions demonstrate that VEPO yields substantial improvements in both tokenization efficiency and translation quality, bridging the performance gap for underrepresented languages.
中文摘要 大型语言模型在资源有限的语言上常表现不佳，主要原因是子词分割效率低下和系统性训练数据失衡。本文提出可变熵策略优化（VEPO），利用可验证奖励的强化学习，将确定性结构约束纳入策略对齐过程。该框架确保规定的序列长度、严格的格式一致性以及严格的语言规范性，所有这些都在培训过程中得到严格执行。我们方法的核心是一种可变熵机制，使模型能够动态校准字面忠实度与语义自然性的平衡，通过调制探索利用流形。通过将熵调和优势估计与非对称裁剪相结合，VEPO在缓解政策崩溃的同时，保持了强健的探索。跨90个FLORES-200、COMET-22、chrF方向的实证评估表明，VEPO在标记化效率和翻译质量方面取得了显著提升，弥合了代表性不足语言的性能差距。

Box Maze: A Process-Control Architecture for Reliable LLM Reasoning

Box Maze：一种用于可靠大型语言模型推理的过程控制架构

Authors: Zou Qiang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.19182
Pdf link: https://arxiv.org/pdf/2603.19182
Abstract Large language models (LLMs) demonstrate strong generative capabilities but remain vulnerable to hallucination and unreliable reasoning under adversarial prompting. Existing safety approaches -- such as reinforcement learning from human feedback (RLHF) and output filtering -- primarily operate at the behavioral level and may lack explicit architectural mechanisms for enforcing reasoning process integrity. This paper proposes the Box Maze framework, a conceptual process-control architecture that decomposes LLM reasoning into three explicit layers: memory grounding, structured inference, and boundary enforcement. We introduce preliminary simulation-based evaluation involving progressive boundary erosion scenarios across multiple heterogeneous LLM systems (DeepSeek-V3, Doubao, Qwen). Results from n=50 adversarial scenarios suggest that explicit cognitive control layers may improve consistency in boundary maintenance, with architectural constraints reducing boundary failure rates from approximately 40% (baseline RLHF) to below 1% under adversarial conditions. While current validation is simulation-based, these preliminary results indicate that process-level control may offer a promising direction for improving reliability in large language model reasoning.
中文摘要 大型语言模型（LLMs）展现出强大的生成能力，但在对抗性提示下仍易产生幻觉和不可靠推理。现有的安全方法——如人类反馈强化学习（RLHF）和输出过滤——主要在行为层面工作，可能缺乏显式的架构机制来强制推理过程完整性。本文提出了Box Maze框架，这是一种概念性的过程控制架构，将LLM推理分解为三个明确层次：内存基础、结构化推理和边界执行。我们引入了基于模拟的初步评估，涉及多个异构大型语言模型系统（DeepSeek-V3、斗宝、Qwen）的渐进边界侵蚀场景。n=50个对抗情景的结果表明，显式认知控制层可能提升边界维护的一致性，架构约束将边界失效率从约40%（基线RLHF）降至对抗条件下低于1%。虽然目前的验证基于仿真，但这些初步结果表明，过程级控制可能为提高大型语言模型推理的可靠性提供有前景的方向。

Markov Potential Game and Multi-Agent Reinforcement Learning for Autonomous Driving

自动驾驶的马尔可夫潜在博弈与多智能体强化学习

Authors: Huiwen Yan, Mushuang Liu
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2603.19188
Pdf link: https://arxiv.org/pdf/2603.19188
Abstract Autonomous driving (AD) requires safe and reliable decision-making among interacting agents, e.g., vehicles, bicycles, and pedestrians. Multi-agent reinforcement learning (MARL) modeled by Markov games (MGs) provides a suitable framework to characterize such agents' interactions during decision-making. Nash equilibria (NEs) are often the desired solution in an MG. However, it is typically challenging to compute an NE in general-sum games, unless the game is a Markov potential game (MPG), which ensures the NE attainability under a few learning algorithms such as gradient play. However, it has been an open question how to construct an MPG and whether these construction rules are suitable for AD applications. In this paper, we provide sufficient conditions under which an MG is an MPG and show that these conditions can accommodate general driving objectives for autonomous vehicles (AVs) using highway forced merge scenarios as illustrative examples. A parameter-sharing neural network (NN) structure is designed to enable decentralized policy execution. The trained driving policy from MPGs is evaluated in both simulated and naturalistic traffic datasets. Comparative studies with single-agent RL and with human drivers whose behaviors are recorded in the traffic datasets are reported, respectively.
中文摘要 自动驾驶（AD）需要相互作用的主体之间安全可靠的决策，例如车辆、自行车和行人。多智能体强化学习（MARL）由马尔可夫博弈（MGs）建模，为描述此类智能体在决策过程中的交互提供了合适的框架。纳什均衡（NE）通常是大力导管中理想的解。然而，除非博弈是马尔可夫势博弈（MPG），否则在一般和博弈中计算NE通常具有挑战性，MPG确保在某些学习算法（如梯度游玩）下NE的可达性。然而，如何构建MPG以及这些建造规则是否适合AD应用，仍是一个开放问题。本文提供了MG达到MPG的充分条件，并展示了这些条件能够满足自动驾驶车辆（AV）的通用驾驶目标，并以高速公路强制合并场景为例。参数共享神经网络（NN）结构旨在实现去中心化策略的执行。从MPGs中训练出的驾驶策略在模拟和自然交通数据集中进行评估。报告了单智能体强化学习和人类驾驶员行为记录在交通数据集中的比较研究。

OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards

OS-Themis：一个面向通用GUI奖励的可扩展批评框架

Authors: Zehao Li, Zhenyu Wu, Yibo Zhao, Bowen Yang, Jingjing Xie, Zhaoyang Liu, Zhoumianze Liu, Kaiming Jin, Jianze Liang, Zonglin Li, Feng Wu, Bowen Zhou, Zun Wang, Zichen Ding
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.19191
Pdf link: https://arxiv.org/pdf/2603.19191
Abstract Reinforcement Learning (RL) has the potential to improve the robustness of GUI agents in stochastic environments, yet training is highly sensitive to the quality of the reward function. Existing reward approaches struggle to achieve both scalability and performance. To address this, we propose OS-Themis, a scalable and accurate multi-agent critic framework. Unlike a single judge, OS-Themis decomposes trajectories into verifiable milestones to isolate critical evidence for decision making and employs a review mechanism to strictly audit the evidence chain before making the final verdict. To facilitate evaluation, we further introduce OmniGUIRewardBench (OGRBench), a holistic cross-platform benchmark for GUI outcome rewards, where all evaluated models achieve their best performance under OS-Themis. Extensive experiments on AndroidWorld show that OS-Themis yields a 10.3% improvement when used to support online RL training, and a 6.9% gain when used for trajectory validation and filtering in the self-training loop, highlighting its potential to drive agent evolution.
中文摘要 强化学习（RL）有潜力提升随机环境中图形界面代理的鲁棒性，但训练对奖励函数的质量高度敏感。现有的奖励方法在实现可扩展性和性能方面都面临困难。为此，我们提出了OS-Themis，一个可扩展且准确的多代理批评框架。与单一法官不同，OS-Themis 将审理轨迹分解为可验证的里程碑，以筛选关键证据以供决策，并采用审查机制严格审计证据链，最终做出裁决。为便于评估，我们进一步引入了OmniGUIRewardBench（OGRBench），这是一个全面的跨平台GUI结果奖励基准，所有被评估模型在OS-Themis下均达到最佳性能。在 AndroidWorld 上的大量实验显示，OS-Themis 在支持在线强化学习训练时提升了 10.3%，在自训练循环中用于轨迹验证和过滤时提升了 6.9%，凸显了其推动代理演进的潜力。

Keyword: diffusion policy

There is no result