Arxiv Papers of Today

生成时间: 2026-03-12 16:48:34 (UTC+8); Arxiv 发布时间: 2026-03-12 20:00 EDT (2026-03-13 08:00 UTC+8)

今天共有 39 篇相关文章

Keyword: reinforcement learning

Evolving Demonstration Optimization for Chain-of-Thought Feature Transformation

思维链特征变换的演进演示优化

Authors: Xinyuan Wang, Kunpeng Liu, Arun Vignesh Malarkkan, Yanjie Fu
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.09987
Pdf link: https://arxiv.org/pdf/2603.09987
Abstract Feature Transformation (FT) is a core data-centric AI task that improves feature space quality to advance downstream predictive performance. However, discovering effective transformations remains challenging due to the large space of feature-operator combinations. Existing solutions rely on discrete search or latent generation, but they are frequently limited by sample inefficiency, invalid candidates, and redundant generations with limited coverage. Large Language Models (LLMs) offer strong priors for producing valid transformations, but current LLM-based FT methods typically rely on static demonstrations, resulting in limited diversity, redundant outputs, and weak alignment with downstream objectives. We propose a framework that optimizes context data for LLM-driven FT by evolving trajectory-level experiences in a closed loop. Starting from high-performing feature transportation sequences explored by reinforcement learning, we construct and continuously update an experience library of downstream task-verified transformation trajectories, and use a diversity-aware selector to form contexts along with a chain-of-thought and guide transformed feature generation toward higher performance. Experiments on diverse tabular benchmarks show that our method outperforms classical and LLM-based baselines and is more stable than one-shot generation. The framework generalizes across API-based and open-source LLMs and remains robust across downstream evaluators.
中文摘要 特征转换（FT）是一项核心的数据中心人工智能任务，旨在提升特征空间质量，以提升下游预测性能。然而，由于特征-算符组合空间庞大，发现有效的变换仍然具有挑战性。现有解依赖离散搜索或潜在生成，但它们常受样本效率低、候选无效和覆盖有限的冗余代限制。大型语言模型（LLM）为产生有效变换提供了强先验，但当前基于LLM的傅立假变换方法通常依赖静态演示，导致多样性有限、输出冗余且与下游目标对齐度较弱。我们提出了一个框架，通过在闭环中演化轨迹级体验，优化基于LLM驱动的FT的上下文数据。从强化学习探索的高效能特征传输序列出发，我们构建并持续更新下游任务验证的转换轨迹经验库，使用多样性感知选择器构建上下文和思维链，引导转换后的特征生成提升性能。在多种表格基准测试上的实验表明，我们的方法优于经典和基于大型语言模型的基线，且比单次生成更稳定。该框架在基于API和开源的大型语言模型中推广，并在下游评估器中保持稳健。

Personalized Group Relative Policy Optimization for Heterogenous Preference Alignment

异质偏好对齐的个性化群体相对策略优化

Authors: Jialu Wang, Heinrich Peters, Asad A. Butt, Navid Hashemi, Alireza Hashemi, Pouya M. Ghari, Joseph Hoover, James Rae, Morteza Dehghani
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.10009
Pdf link: https://arxiv.org/pdf/2603.10009
Abstract Despite their sophisticated general-purpose capabilities, Large Language Models (LLMs) often fail to align with diverse individual preferences because standard post-training methods, like Reinforcement Learning with Human Feedback (RLHF), optimize for a single, global objective. While Group Relative Policy Optimization (GRPO) is a widely adopted on-policy reinforcement learning framework, its group-based normalization implicitly assumes that all samples are exchangeable, inheriting this limitation in personalized settings. This assumption conflates distinct user reward distributions and systematically biases learning toward dominant preferences while suppressing minority signals. To address this, we introduce Personalized GRPO (P-GRPO), a novel alignment framework that decouples advantage estimation from immediate batch statistics. By normalizing advantages against preference-group-specific reward histories rather than the concurrent generation group, P-GRPO preserves the contrastive signal necessary for learning distinct preferences. We evaluate P-GRPO across diverse tasks and find that it consistently achieves faster convergence and higher rewards than standard GRPO, thereby enhancing its ability to recover and align with heterogeneous preference signals. Our results demonstrate that accounting for reward heterogeneity at the optimization level is essential for building models that faithfully align with diverse human preferences without sacrificing general capabilities.
中文摘要 尽管大型语言模型（LLMs）具备复杂的通用功能，但它们常常无法满足多样化的个人偏好，因为标准的后期训练方法，如人类反馈强化学习（RLHF），主要针对单一的全局目标进行优化。虽然群体相对策略优化（GRPO）是一种广泛采用的策略强化学习框架，但其基于群体的规范化隐含假设所有样本均可交换，因此在个性化环境中具有这一限制。这一假设混淆了不同的用户奖励分布，系统性地使学习偏向主导偏好，同时抑制少数信号。为此，我们引入了个性化GRPO（P-GRPO），一种新型对齐框架，将优势估计与即时批次统计解耦。通过将优势归一化于偏好组特异的奖励历史，而非并发生成组，P-GRPO保留了学习不同偏好所需的对比信号。我们在多种任务中评估P-GRPO，发现其持续比标准GRPO更快收敛并获得更高奖励，从而增强了其恢复和对齐异质偏好信号的能力。我们的结果表明，在优化层面考虑奖励异质性对于构建忠实符合多样人类偏好且不牺牲整体能力的模型至关重要。

Cluster-Aware Attention-Based Deep Reinforcement Learning for Pickup and Delivery Problems

针对取件和送达问题的群体感知注意力深度强化学习

Authors: Wentao Wang, Lifeng Han, Guangyu Zou
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.10053
Pdf link: https://arxiv.org/pdf/2603.10053
Abstract The Pickup and Delivery Problem (PDP) is a fundamental and challenging variant of the Vehicle Routing Problem, characterized by tightly coupled pickup--delivery pairs, precedence constraints, and spatial layouts that often exhibit clustering. Existing deep reinforcement learning (DRL) approaches either model all nodes on a flat graph, relying on implicit learning to enforce constraints, or achieve strong performance through inference-time collaborative search at the cost of substantial latency. In this paper, we propose \emph{CAADRL} (Cluster-Aware Attention-based Deep Reinforcement Learning), a DRL framework that explicitly exploits the multi-scale structure of PDP instances via cluster-aware encoding and hierarchical decoding. The encoder builds on a Transformer and combines global self-attention with intra-cluster attention over depot, pickup, and delivery nodes, producing embeddings that are both globally informative and locally role-aware. Based on these embeddings, we introduce a Dynamic Dual-Decoder with a learnable gate that balances intra-cluster routing and inter-cluster transitions at each step. The policy is trained end-to-end with a POMO-style policy gradient scheme using multiple symmetric rollouts per instance. Experiments on synthetic clustered and uniform PDP benchmarks show that CAADRL matches or improves upon strong state-of-the-art baselines on clustered instances and remains highly competitive on uniform instances, particularly as problem size increases. Crucially, our method achieves these results with substantially lower inference time than neural collaborative-search baselines, suggesting that explicitly modeling cluster structure provides an effective and efficient inductive bias for neural PDP solvers.
中文摘要 取货与交付问题（PDP）是车辆路由问题的一个基础且具有挑战性的变体，其特征是取货对、优先级约束以及常表现出聚类的空间布局紧密耦合。现有的深度强化学习（DRL）方法要么将所有节点建模为平面图，依赖隐式学习来强制约束，要么通过推理时间协作搜索实现强性能，代价是大幅增加的延迟。本文提出了\emph{CAADRL}（基于集群感知注意力的深度强化学习），这是一种通过集群感知编码和分层解码，明确利用PDP实例的多尺度结构。该编码器基于变换器，结合了全局自关注与集群内对仓库、取货和配送节点的关注，产生既具全局信息量又具本地角色感知的嵌入。基于这些嵌入，我们引入了动态双重解码器，具有可学习的门，在每一步平衡簇内路由和簇间转换。该策略采用POMO风格的策略梯度方案进行端到端训练，每个实例使用多次对称展开。合成聚类和均匀PDP基准测试的实验显示，CAADRL在聚类实例上能与强有力的先进基线相匹配或改进，并且在均匀实例上保持高度竞争力，尤其是在问题规模增加时。关键是，我们的方法以明显更短的推理时间实现这些结果，这表明明确建模簇结构为神经PDP求解器提供了有效且高效的归纳偏置。

Improving Search Agent with One Line of Code

用一行代码提升搜索代理

Authors: Jian Li, Dongsheng Chen, Zhenhua Xu, Yizhang Jin, Jiafu Wu, Chengjie Wang, Xiaotong Yuan, Yabiao Wang
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.10069
Pdf link: https://arxiv.org/pdf/2603.10069
Abstract Tool-based Agentic Reinforcement Learning (TARL) has emerged as a promising paradigm for training search agents to interact with external tools for a multi-turn information-seeking process autonomously. However, we identify a critical training instability that leads to catastrophic model collapse: Importance Sampling Distribution Drift(ISDD). In Group Relative Policy Optimization(GRPO), a widely adopted TARL algorithm, ISDD manifests as a precipitous decline in the importance sampling ratios, which nullifies gradient updates and triggers irreversible training failure. To address this, we propose \textbf{S}earch \textbf{A}gent \textbf{P}olicy \textbf{O}ptimization (\textbf{SAPO}), which stabilizes training via a conditional token-level KL constraint. Unlike hard clipping, which ignores distributional divergence, SAPO selectively penalizes the KL divergence between the current and old policies. Crucially, this penalty is applied only to positive tokens with low probabilities where the policy has shifted excessively, thereby preventing distribution drift while preserving gradient flow. Remarkably, SAPO requires only one-line code modification to standard GRPO, ensuring immediate deployability. Extensive experiments across seven QA benchmarks demonstrate that SAPO achieves \textbf{+10.6\% absolute improvement} (+31.5\% relative) over Search-R1, yielding consistent gains across varying model scales (1.5B, 14B) and families (Qwen, LLaMA).
中文摘要 基于工具的代理强化学习（TARL）已成为一种有前景的范式，用于训练搜索代理自主地与外部工具交互，实现多回合的信息寻求过程。然而，我们发现了一个导致模型灾难性崩溃的关键训练不稳定性：重要性抽样分布漂移（ISDD）。在广泛采用的TARL算法群相对策略优化（GRPO）中，ISDD表现为重要性抽样比率的急剧下降，导致梯度更新失效并触发不可逆的训练失败。为此，我们提出了 \textbf{S}earch \textbf{A}gent \textbf{P}olici \textbf{O}ptimization （\textbf{SAPO}），通过条件令牌级 KL 约束稳定训练。与忽略分布分歧的硬裁剪不同，SAPO会选择性惩罚当前与旧策略之间的KL偏差。关键是，该惩罚仅适用于概率较低且策略偏移过度的正标记，从而防止分布漂移同时保持梯度流动。值得注意的是，SAPO只需对标准GRPO进行一行代码修改，确保可立即部署。跨越七个质量保证基准的广泛实验表明，SAPO相比Search-R1实现了\textbf{+10.6\%绝对提升}（+31.5\%相对），在不同模型尺度（1.5B、14B）和家族（Qwen、LLaMA）中均有稳定提升。

Amnesia: Adversarial Semantic Layer Specific Activation Steering in Large Language Models

失忆症：大型语言模型中的对抗语义层特定激活引导

Authors: Ali Raza, Gurang Gupta, Nikolay Matyunin, Jibesh Patra
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.10080
Pdf link: https://arxiv.org/pdf/2603.10080
Abstract Warning: This article includes red-teaming experiments, which contain examples of compromised LLM responses that may be offensive or upsetting. Large Language Models (LLMs) have the potential to create harmful content, such as generating sophisticated phishing emails and assisting in writing code of harmful computer viruses. Thus, it is crucial to ensure their safe and responsible response generation. To reduce the risk of generating harmful or irresponsible content, researchers have developed techniques such as reinforcement learning with human feedback to align LLM's outputs with human values and preferences. However, it is still undetermined whether such measures are sufficient to prevent LLMs from generating interesting responses. In this study, we propose Amnesia, a lightweight activation-space adversarial attack that manipulates internal transformer states to bypass existing safety mechanisms in open-weight LLMs. Through experimental analysis on state-of-the-art, open-weight LLMs, we demonstrate that our attack effectively circumvents existing safeguards, enabling the generation of harmful content without the need for any fine-tuning or additional training. Our experiments on benchmark datasets show that the proposed attack can induce various antisocial behaviors in LLMs. These findings highlight the urgent need for more robust security measures in open-weight LLMs and underscore the importance of continued research to prevent their potential misuse.
中文摘要 警告：本文包含红队化实验，包含可能令人反感或令人不安的被破坏的大型语言模型回应示例。大型语言模型（LLM）有可能生成有害内容，比如生成复杂的钓鱼邮件和协助编写有害计算机病毒的代码。因此，确保他们安全且负责任地响应至关重要。为了降低生成有害或不负责任内容的风险，研究人员开发了如强化学习与人类反馈相结合的技术，使LLM的输出与人类价值观和偏好保持一致。然而，目前尚不确定这些措施是否足以阻止大型语言模型产生有趣的反应。本研究提出失忆，这是一种轻量级激活空间对抗攻击，通过控内部变换器状态绕过开权大型语言模型中现有的安全机制。通过对最先进的开放权重大型语言模型的实验分析，我们证明了我们的攻击有效规避了现有的防护措施，使得无需微调或额外培训即可生成有害内容。我们在基准数据集上的实验显示，所提攻击可以在LLM中诱发各种反社会行为。这些发现凸显了开放权重大型语言模型中更强健安全措施的紧迫需求，并强调了持续研究以防止其潜在滥用的重要性。

Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models

代码空间响应预言机：利用大型语言模型生成可解释的多代理策略

Authors: Daniel Hennes, Zun Li, John Schultz, Marc Lanctot
Subjects: Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.10098
Pdf link: https://arxiv.org/pdf/2603.10098
Abstract Recent advances in multi-agent reinforcement learning, particularly Policy-Space Response Oracles (PSRO), have enabled the computation of approximate game-theoretic equilibria in increasingly complex domains. However, these methods rely on deep reinforcement learning oracles that produce `black-box' neural network policies, making them difficult to interpret, trust or debug. We introduce Code-Space Response Oracles (CSRO), a novel framework that addresses this challenge by replacing RL oracles with Large Language Models (LLMs). CSRO reframes the best response computation as a code generation task, prompting an LLM to generate policies directly as human-readable code. This approach not only yields inherently interpretable policies but also leverages the LLM's pretrained knowledge to discover complex, human-like strategies. We explore multiple ways to construct and enhance an LLM-based oracle: zero-shot prompting, iterative refinement and \emph{AlphaEvolve}, a distributed LLM-based evolutionary system. We demonstrate that CSRO achieves performance competitive with baselines while producing a diverse set of explainable policies. Our work presents a new perspective on multi-agent learning, shifting the focus from optimizing opaque policy parameters to synthesizing interpretable algorithmic behavior.
中文摘要 多智能体强化学习的最新进展，特别是策略空间响应预言机（PSRO），使得在日益复杂领域中计算近似博弈论均衡成为可能。然而，这些方法依赖于深度强化学习预言机，这些oracle生成“黑箱”神经网络策略，使其难以解释、信任或调试。我们引入了代码空间响应预言机（CSRO），这是一个新颖框架，通过用大型语言模型（LLM）替代强化语言预言机来解决这一挑战。CSRO将最佳响应计算重新定义为代码生成任务，促使大型语言模型直接生成策略，作为人类可读代码。这种方法不仅产生了内在可解释的策略，还利用LLM的预训练知识发现复杂且类人化的策略。我们探讨了构建和增强基于LLM的多种方法：零样本提示、迭代优化以及\emph{AlphaEvolution}，一种分布式LLM进化系统。我们证明CSRO在制定多样化且可解释政策的同时，能够实现与基线竞争的性能。我们的工作为多智能体学习提供了新的视角，将重点从优化不透明的策略参数转向综合可解释的算法行为。

CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR

CLIPO：政策优化中的对比学习推广RLVR

Authors: Sijia Cui, Pengyu Cheng, Jiajun Song, Yongbo Gai, Guojun Zhang, Zhechao Yu, Jianhe Lin, Xiaoxi Jiang, Guanjun Jiang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.10101
Pdf link: https://arxiv.org/pdf/2603.10101
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capacity of Large Language Models (LLMs). However, RLVR solely relies on final answers as outcome rewards, neglecting the correctness of intermediate reasoning steps. Training on these process-wrong but outcome-correct rollouts can lead to hallucination and answer-copying, severely undermining the model's generalization and robustness. To address this, we incorporate a Contrastive Learning mechanism into the Policy Optimization (CLIPO) to generalize the RLVR process. By optimizing a contrastive loss over successful rollouts, CLIPO steers the LLM to capture the invariant structure shared across correct reasoning paths. This provides a more robust cross-trajectory regularization than the original single-path supervision in RLVR, effectively mitigating step-level reasoning inconsistencies and suppressing hallucinatory artifacts. In experiments, CLIPO consistently improves multiple RLVR baselines across diverse reasoning benchmarks, demonstrating uniform improvements in generalization and robustness for policy optimization of LLMs. Our code and training recipes are available at this https URL.
中文摘要 带可验证奖励的强化学习（RLVR）显著提升了大型语言模型（LLMs）的推理能力。然而，RLVR仅依赖最终答案作为结果奖励，忽视了中间推理步骤的正确性。在这些过程错误但结果正确的推广上进行训练，可能导致幻觉和答案复制，严重削弱模型的泛化性和稳健性。为此，我们在策略优化（CLIPO）中引入了对比学习机制，以推广RLVR过程。通过优化成功推广的对比损失，CLIPO引导LLM捕捉正确推理路径共享的不变结构。这比RLVR中原始的单径监督提供了更稳健的交叉轨迹正则化，有效减少了步级推理的不一致并抑制了幻觉伪影。在实验中，CLIPO持续改进多个RLVR基线，涵盖多种推理基准，展示了在推广和鲁棒性方面对LLM策略优化的统一提升。我们的代码和培训配方可在此 https 网址获取。

ReMix: Reinforcement routing for mixtures of LoRAs in LLM finetuning

ReMix：LLM微调中LoRA混合的强化路由

Authors: Ruizhong Qiu, Hanqing Zeng, Yinglong Xia, Yiwen Meng, Ren Chen, Jiarui Feng, Dongqi Fu, Qifan Wang, Jiayi Liu, Jun Xiao, Xiangjun Fan, Benyu Zhang, Hong Li, Zhining Liu, Hyunsik Yoo, Zhichen Zeng, Tianxin Wei, Hanghang Tong
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.10160
Pdf link: https://arxiv.org/pdf/2603.10160
Abstract Low-rank adapters (LoRAs) are a parameter-efficient finetuning technique that injects trainable low-rank matrices into pretrained models to adapt them to new tasks. Mixture-of-LoRAs models expand neural networks efficiently by routing each layer input to a small subset of specialized LoRAs of the layer. Existing Mixture-of-LoRAs routers assign a learned routing weight to each LoRA to enable end-to-end training of the router. Despite their empirical promise, we observe that the routing weights are typically extremely imbalanced across LoRAs in practice, where only one or two LoRAs often dominate the routing weights. This essentially limits the number of effective LoRAs and thus severely hinders the expressive power of existing Mixture-of-LoRAs models. In this work, we attribute this weakness to the nature of learnable routing weights and rethink the fundamental design of the router. To address this critical issue, we propose a new router designed that we call Reinforcement Routing for Mixture-of-LoRAs (ReMix). Our key idea is using non-learnable routing weights to ensure all active LoRAs to be equally effective, with no LoRA dominating the routing weights. However, our routers cannot be trained directly via gradient descent due to our non-learnable routing weights. Hence, we further propose an unbiased gradient estimator for the router by employing the reinforce leave-one-out (RLOO) technique, where we regard the supervision loss as the reward and the router as the policy in reinforcement learning. Our gradient estimator also enables to scale up training compute to boost the predictive performance of our ReMix. Extensive experiments demonstrate that our proposed ReMix significantly outperform state-of-the-art parameter-efficient finetuning methods under a comparable number of activated parameters.
中文摘要 低秩适配器（LoRA）是一种参数高效的微调技术，将可训练的低秩矩阵注入预训练模型，使其适应新任务。混合LoRA模型通过将每个层输入路由到该层中专门的LoRA子集，高效地扩展神经网络。现有的LoRA混合路由器会为每个LoRA分配学习的路由权重，以实现路由器的端到端训练。尽管在实证上有前景，我们观察到在实际作中，LoRA之间的路由权重通常极度不平衡，只有一两个LoRA通常主导这些权重。这实际上限制了有效LoRA的数量，从而严重削弱了现有LoRA混合模型的表现力。在本研究中，我们将这一弱点归因于可学习路由权重的特性，并重新思考了路由器的基本设计。为了解决这一关键问题，我们提出了一款新路由器，我们称之为“LoRA混合强化路由”（ReMix）。我们的核心理念是使用不可学习的路由权重，确保所有活跃的LoRA都能同样有效，没有LoRA主导这些路由权重。然而，由于路由权重不可学习，我们的路由器无法通过梯度下降直接训练。因此，我们进一步提出了一种无偏梯度估计器，采用强化一出（reforcein-one-out，RLOO）技术，将监督损失视为奖励，路由器作为策略，用于强化学习。我们的梯度估计器还支持扩展训练计算，提升ReMix的预测性能。大量实验表明，我们提出的ReMix在同等激活参数数量下，显著优于最先进的参数高效微调方法。

From Prior to Pro: Efficient Skill Mastery via Distribution Contractive RL Finetuning

从之前到专业：通过分布式紧凑型强化精炼实现高效的技能掌握

Authors: Zhanyi Sun, Shuran Song
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.10263
Pdf link: https://arxiv.org/pdf/2603.10263
Abstract We introduce Distribution Contractive Reinforcement Learning (DICE-RL), a framework that uses reinforcement learning (RL) as a "distribution contraction" operator to refine pretrained generative robot policies. DICE-RL turns a pretrained behavior prior into a high-performing "pro" policy by amplifying high-success behaviors from online feedback. We pretrain a diffusion- or flow-based policy for broad behavioral coverage, then finetune it with a stable, sample-efficient residual off-policy RL framework that combines selective behavior regularization with value-guided action selection. Extensive experiments and analyses show that DICE-RL reliably improves performance with strong stability and sample efficiency. It enables mastery of complex long-horizon manipulation skills directly from high-dimensional pixel inputs, both in simulation and on a real robot. Project website: this https URL.
中文摘要 我们介绍分布式收缩强化学习（DICE-RL），这是一个利用强化学习（RL）作为“分布收缩”算符来优化预训练生成机器人策略的框架。DICE-RL通过放大在线反馈中的高成功行为，将预训练的行为转变为高效能的“专业”策略。我们预训练基于扩散或流的策略以实现广泛的行为覆盖，然后用一个稳定、样本高效的残余非策略强化学习框架进行微调，该框架结合了选择性行为正则化和价值引导的动作选择。大量实验和分析表明，DICE-RL能够可靠地提升性能，同时保持强大的稳定性和样本效率。它使人们能够直接从高维像素输入掌握复杂的长视距作技能，无论是在仿真中还是在真实机器人上。项目网站：这个 https URL。

From Imitation to Intuition: Intrinsic Reasoning for Open-Instance Video Classification

从模仿到直觉：开放实例视频分类的内在推理

Authors: Ke Zhang, Xiangchen Zhao, Yunjie Tian, Jiayu Zheng, Vishal M. Patel, Di Fu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.10300
Pdf link: https://arxiv.org/pdf/2603.10300
Abstract Conventional video classification models, acting as effective imitators, excel in scenarios with homogeneous data distributions. However, real-world applications often present an open-instance challenge, where intra-class variations are vast and complex, beyond existing benchmarks. While traditional video encoder models struggle to fit these diverse distributions, vision-language models (VLMs) offer superior generalization but have not fully leveraged their reasoning capabilities (intuition) for such tasks. In this paper, we bridge this gap with an intrinsic reasoning framework that evolves open-instance video classification from imitation to intuition. Our approach, namely DeepIntuit, begins with a cold-start supervised alignment to initialize reasoning capability, followed by refinement using Group Relative Policy Optimization (GRPO) to enhance reasoning coherence through reinforcement learning. Crucially, to translate this reasoning into accurate classification, DeepIntuit then introduces an intuitive calibration stage. In this stage, a classifier is trained on this intrinsic reasoning traces generated by the refined VLM, ensuring stable knowledge transfer without distribution mismatch. Extensive experiments demonstrate that for open-instance video classification, DeepIntuit benefits significantly from transcending simple feature imitation and evolving toward intrinsic reasoning. Our project is available at this https URL.
中文摘要 传统的视频分类模型作为有效的模仿者，在数据分布均匀的场景中表现出色。然而，现实应用常常面临开放实例挑战，类内差异巨大且复杂，超出现有基准标准。虽然传统的视频编码器模型难以适应这些多样化分布，但视觉语言模型（VLM）提供了更优越的泛化能力，但尚未充分发挥其推理能力（直觉）来完成此类任务。本文通过内在推理框架弥合这一空白，将开放实例视频分类从模仿演变为直觉。我们的方法，即DeepIntuit，首先进行冷启动监督对齐以初始化推理能力，随后利用群相对策略优化（GRPO）通过强化学习提升推理一致性进行细化。关键是，为了将这种推理转化为准确的分类，DeepIntuit 随后引入了一个直观的校准阶段。在此阶段，分类器会基于精炼后的VLM生成的内在推理痕迹进行训练，确保知识传递稳定且不存在分布不匹配。大量实验表明，对于开放实例视频分类，DeepIntuit 通过超越简单特征模仿而显著受益，而不断向内在推理发展。我们的项目可在此 https 网址访问。

SteadyTray: Learning Object Balancing Tasks in Humanoid Tray Transport via Residual Reinforcement Learning

稳步托盘：通过残留强化学习学习人形托盘运输中的对象平衡任务

Authors: Anlun Huang, Zhenyu Wu, Soofiyan Atar, Yuheng Zhi, Michael Yip
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.10306
Pdf link: https://arxiv.org/pdf/2603.10306
Abstract Stabilizing unsecured payloads against the inherent oscillations of dynamic bipedal locomotion remains a critical engineering bottleneck for humanoids in unstructured environments. To solve this, we introduce ReST-RL, a hierarchical reinforcement learning architecture that explicitly decouples locomotion from payload stabilization, evaluated via the SteadyTray benchmark. Rather than relying on monolithic end-to-end learning, our framework integrates a robust base locomotion policy with a dynamic residual module engineered to actively cancel gait-induced perturbations at the end-effector. This architectural separation ensures steady tray transport without degrading the underlying bipedal stability. In simulation, the residual design significantly outperforms end-to-end baselines in gait smoothness and orientation accuracy, achieving a 96.9% success rate in variable velocity tracking and 74.5% robustness against external force disturbances. Successfully deployed on the Unitree G1 humanoid hardware, this modular approach demonstrates highly reliable zero-shot sim-to-real generalization across various objects and external force disturbances.
中文摘要 稳定未固定有效载荷以抵御动态双足行走的固有振荡，仍然是类人生物在非结构化环境中的关键工程瓶颈。为此，我们引入了ReST-RL，一种分层强化学习架构，明确将移动与有效载荷稳定解耦，并通过SteadyTray基准进行评估。我们的框架不依赖单一的端到端学习，而是将稳健的基础运动策略与动态残差模块集成，该模块旨在主动抵消末端效应器上步态引起的扰动。这种结构分离确保了托盘运输的稳定，同时不影响底层的双足稳定性。在模拟中，残差设计在步态平滑性和方向准确性方面显著优于端到端基线，变速追踪成功率为96.9%，对外力干扰的鲁棒性为74.5%。该模块化方法成功部署于Unitree G1类人形硬件上，展示了在各种物体和外部力干扰下高度可靠的零发射模拟到真实推广。

ScanDP: Generalizable 3D Scanning with Diffusion Policy

ScanDP：具有扩散策略的通用3D扫描

Authors: Itsuki Hirako, Ryo Hakoda, Yubin Liu, Matthew Hwang, Yoshihiro Sato, Takeshi Oishi
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.10390
Pdf link: https://arxiv.org/pdf/2603.10390
Abstract Learning-based 3D Scanning plays a crucial role in enabling efficient and accurate scanning of target objects. However, recent reinforcement learning-based methods often require large-scale training data and still struggle to generalize to unseen object this http URL this work, we propose a data-efficient 3D scanning framework that uses Diffusion Policy to imitate human-like scanning strategies. To enhance robustness and generalization, we adopt the Occupancy Grid Mapping instead of direct point cloud processing, offering improved noise resilience and handling of diverse object geometries. We also introduce a hybrid approach combining a sphere-based space representation with a path optimization procedure that ensures path safety and scanning efficiency. This approach addresses limitations in conventional imitation learning, such as redundant or unpredictable behavior. We evaluate our method on diverse unseen objects in both shape and scale. Ours achieves higher coverage and shorter paths than baselines, while remaining robust to sensor noise. We further confirm practical feasibility and stable operation in real-world execution.
中文摘要 基于学习的3D扫描在实现目标物体的高效和准确扫描中发挥着关键作用。然而，近期基于强化学习的方法往往需要大规模训练数据，且仍难以推广到看不见的对象。本文提出一个数据高效的3D扫描框架，利用扩散策略模拟类人扫描策略。为增强鲁棒性和泛化性，我们采用占用网格映射代替直接点云处理，提升噪声韧性和对多样物体几何形状的处理能力。我们还引入了结合球面空间表示与路径优化程序的混合方法，确保路径安全和扫描效率。该方法解决了传统模仿学习中的局限性，如冗余或不可预测的行为。我们在形状和尺度上对各种看不见物体进行了评估。我们的设备覆盖范围更高，路径更短，同时对传感器噪声保持韧性。我们进一步确认了其在实际执行中的可行性和稳定运行。

Graph-GRPO: Training Graph Flow Models with Reinforcement Learning

Graph-GRPO：带强化学习的训练图流模型

Authors: Baoheng Zhu, Deyu Bo, Delvin Ce Zhang, Xiao Wang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.10395
Pdf link: https://arxiv.org/pdf/2603.10395
Abstract Graph generation is a fundamental task with broad applications, such as drug discovery. Recently, discrete flow matching-based graph generation, \aka, graph flow model (GFM), has emerged due to its superior performance and flexible sampling. However, effectively aligning GFMs with complex human preferences or task-specific objectives remains a significant challenge. In this paper, we propose Graph-GRPO, an online reinforcement learning (RL) framework for training GFMs under verifiable rewards. Our method makes two key contributions: (1) We derive an analytical expression for the transition probability of GFMs, replacing the Monte Carlo sampling and enabling fully differentiable rollouts for RL training; (2) We propose a refinement strategy that randomly perturbs specific nodes and edges in a graph, and regenerates them, allowing for localized exploration and self-improvement of generation quality. Extensive experiments on both synthetic and real datasets demonstrate the effectiveness of Graph-GRPO. With only 50 denoising steps, our method achieves 95.0\% and 97.5\% Valid-Unique-Novelty scores on the planar and tree datasets, respectively. Moreover, Graph-GRPO achieves state-of-the-art performance on the molecular optimization tasks, outperforming graph-based and fragment-based RL methods as well as classic genetic algorithms.
中文摘要 图生成是一项基础性任务，应用广泛，如药物发现。近年来，基于离散流匹配的图生成（\aka，图流模型，GFM）因其卓越的性能和灵活采样而兴起。然而，有效将GFM与复杂的人类偏好或任务特定目标对齐仍是重大挑战。本文提出了Graph-GRPO，一种在线强化学习（RL）框架，用于在可验证奖励下训练GFM。我们的方法有两个关键贡献：（1）我们推导出GFMs转移概率的解析表达式，取代了蒙特卡洛采样，实现了强化学习训练中完全可微分的展开;（2）我们提出一种精炼策略，随机扰动图中的特定节点和边并重新生成，从而实现局部探索和生成质量的自我提升。对合成和真实数据集的广泛实验证明了Graph-GRPO的有效性。仅用50个去噪步骤，我们的方法在平面数据集和树数据集上分别获得了95.0%和97.5%的有效唯一新颖性分数。此外，Graph-GRPO在分子优化任务中实现了最先进的性能，优于基于图和片段的强化学习方法以及经典遗传算法。

COHORT: Hybrid RL for Collaborative Large DNN Inference on Multi-Robot Systems Under Real-Time Constraints

队列：在实时约束下多机器人系统上进行协作大型DNN推断的混合强化学习

Authors: Mohammad Saeid Anwar, Anuradha Ravi, Indrajeet Ghosh, Gaurav Shinde, Carl Busart, Nirmalya Roy
Subjects: Subjects: Robotics (cs.RO); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2603.10436
Pdf link: https://arxiv.org/pdf/2603.10436
Abstract Large deep neural networks (DNNs), especially transformer-based and multimodal architectures, are computationally demanding and challenging to deploy on resource-constrained edge platforms like field robots. These challenges intensify in mission-critical scenarios (e.g., disaster response), where robots must collaborate under tight constraints on bandwidth, latency, and battery life, often without infrastructure or server support. To address these limitations, we present COHORT, a collaborative DNN inference and task-execution framework for multi-robot systems built on the Robotic Operating System (ROS). COHORT employs a hybrid offline-online reinforcement learning (RL) strategy to dynamically schedule and distribute DNN module execution across robots. Our key contributions are threefold: (a) Offline RL policy learning combined with Advantage-Weighted Regression (AWR), trained on auction-based task allocation data from heterogeneous DNN workloads across distributed robots, (b) Online policy adaptation via Multi-Agent PPO (MAPPO), initialized from the offline policy and fine-tuned in real time, and (c) comprehensive evaluation of COHORT on vision-language model (VLM) inference tasks such as CLIP and SAM, analyzing scalability with increasing robot/workload and robustness under . We benchmark COHORT against genetic algorithms and multiple RL baselines. Experimental results demonstrate that COHORT reduces battery consumption by 15.4% and increases GPU utilization by 51.67%, while satisfying frame-rate and deadline constraints 2.55 times of the time.
中文摘要 大型深度神经网络（DNN），尤其是基于变换器和多模态架构，在资源有限的边缘平台上如现场机器人部署时计算量大且具有挑战性。这些挑战在关键任务场景（如灾难响应）中尤为突出，机器人必须在带宽、延迟和电池寿命的严格限制下协作，且常常缺乏基础设施或服务器支持。为解决这些局限性，我们提出了COHORT，这是一个基于机器人作系统（ROS）构建的协作DNN推理和任务执行框架，适用于多机器人系统。COHORT采用混合的离线-在线强化学习（RL）策略，动态调度并分配DNN模块执行在机器人间。我们的主要贡献有三方面：（a）结合优势加权回归（AWR）的离线强化学习策略学习，基于分布式机器人异构DNN工作负载的基于拍卖的任务分配数据进行训练，（b）通过多智能体PPO（MAPPO）进行在线策略适应，从离线策略初始化并实时微调，以及（c）对视觉语言模型（VLM）推理任务如CLIP和SAM的COHORT进行全面评估，分析机器人/工作负载增加的可扩展性和鲁棒性。我们将COHORT与遗传算法和多个强化学习基线进行基准测试。实验结果显示，COHORT可减少15.4%的电池消耗，提升51.67%的GPU利用率，同时满足帧率和截止时间限制的时间有2.55倍。

Muscle Synergy Priors Enhance Biomechanical Fidelity in Predictive Musculoskeletal Locomotion Simulation

肌肉协同先验提升了预测性肌肉骨骼运动模拟中的生物力学精度

Authors: Ilseung Park (1), Eunsik Choi (2), Jangwhan Ahn (3), Jooeun Ahn (2) ((1) Carnegie Mellon University, (2) Seoul National University, (3) UNC-Chapel Hill and NC State University)
Subjects: Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.10474
Pdf link: https://arxiv.org/pdf/2603.10474
Abstract Human locomotion emerges from high-dimensional neuromuscular control, making predictive musculoskeletal simulation challenging. We present a physiology-informed reinforcement-learning framework that constrains control using muscle synergies. We extracted a low-dimensional synergy basis from inverse musculoskeletal analyses of a small set of overground walking trials and used it as the action space for a muscle-driven three-dimensional model trained across variable speeds, slopes and uneven terrain. The resulting controller generated stable gait from 0.7-1.8 m/s and on $\pm$ 6$^{\circ}$ grades and reproduced condition-dependent modulation of joint angles, joint moments and ground reaction forces. Compared with an unconstrained controller, synergy-constrained control reduced non-physiological knee kinematics and kept knee moment profiles within the experimental envelope. Across conditions, simulated vertical ground reaction forces correlated strongly with human measurements, and muscle-activation timing largely fell within inter-subject variability. These results show that embedding neurophysiological structure into reinforcement learning can improve biomechanical fidelity and generalization in predictive human locomotion simulation with limited experimental data.
中文摘要 人体运动源自高维神经肌肉控制，使得预测性肌肉骨骼模拟变得具有挑战性。我们提出了一个基于生理学的强化学习框架，通过肌肉协同来限制控制。我们从一组地面步行试验的逆向肌肉骨骼分析中提取了低维协同基底，并将其作为一个肌肉驱动三维模型的行动空间，训练对象涵盖可变速度、坡度和不平整地形。最终的控制器在0.7-1.8米/秒间产生稳定步态，坡度为$\pm$ 6$^{\circ}}，并重现了关节角度、关节力矩和地面反作用力的条件依赖调制。与无约束控制组相比，协同约束对照减少了非生理性的膝关节运动学，并将膝盖力矩曲线保持在实验范围内。在不同条件下，模拟的垂直地面反作用力与人体测量高度相关，肌肉激活时间在受试者间差异范围内。这些结果表明，将神经生理结构嵌入强化学习中，可以在有限实验数据下提高生物力学的准确性和预测性人类运动模拟的泛化性。

IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs

IH-Challenge：一个用于提升前沿大型语言模型教学层级结构的训练数据集

Authors: Chuan Guo, Juan Felipe Ceron Uribe, Sicheng Zhu, Christopher A. Choquette-Choo, Steph Lin, Nikhil Kandpal, Milad Nasr, Rai (Michael Pokorny), Sam Toyer, Miles Wang, Yaodong Yu, Alex Beutel, Kai Xiao
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.10521
Pdf link: https://arxiv.org/pdf/2603.10521
Abstract Instruction hierarchy (IH) defines how LLMs prioritize system, developer, user, and tool instructions under conflict, providing a concrete, trust-ordered policy for resolving instruction conflicts. IH is key to defending against jailbreaks, system prompt extractions, and agentic prompt injections. However, robust IH behavior is difficult to train: IH failures can be confounded with instruction-following failures, conflicts can be nuanced, and models can learn shortcuts such as overrefusing. We introduce IH-Challenge, a reinforcement learning training dataset, to address these difficulties. Fine-tuning GPT-5-Mini on IH-Challenge with online adversarial example generation improves IH robustness by +10.0% on average across 16 in-distribution, out-of-distribution, and human red-teaming benchmarks (84.1% to 94.1%), reduces unsafe behavior from 6.6% to 0.7% while improving helpfulness on general safety evaluations, and saturates an internal static agentic prompt injection evaluation, with minimal capability regression. We release the IH-Challenge dataset (this https URL) to support future research on robust instruction hierarchy.
中文摘要 指令层级（IH）定义了LLM如何优先处理系统、开发者、用户和工具在冲突下的指令，提供一个具体、信任有序的策略来解决指令冲突。IH是防御越狱、系统提示提取和药物提示注射的关键。然而，强健的IH行为难以训练：IH失败可能与指令跟随失败混淆，冲突可能细致化，模型也可能学习过捷径，如过度拒绝。我们引入了IH-Challenge，一种强化学习训练数据集，以解决这些困难。通过在线对抗性示例生成对 IH-Challenge 上的 GPT-5-Mini 进行微调，使 IH 在 16 个分布内、非分发和人工红团队基准中平均提升 +10.0%（84.1% 至 94.1%），将不安全行为从 6.6% 降至 0.7%，同时提升一般安全评估的有用性，并实现内部静态代理提示注入评估的饱和，能力回归最小化。我们发布IH-Challenge数据集（此https URL），以支持未来关于稳健教学层级的研究。

UAV-MARL: Multi-Agent Reinforcement Learning for Time-Critical and Dynamic Medical Supply Delivery

UAV-MARL：多智能体强化学习，用于时间关键且动态的医疗物资配送

Authors: Islam Guven, Mehmet Parlak
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.10528
Pdf link: https://arxiv.org/pdf/2603.10528
Abstract Unmanned aerial vehicles (UAVs) are increasingly used to support time-critical medical supply delivery, providing rapid and flexible logistics during emergencies and resource shortages. However, effective deployment of UAV fleets requires coordination mechanisms capable of prioritizing medical requests, allocating limited aerial resources, and adapting delivery schedules under uncertain operational conditions. This paper presents a multi-agent reinforcement learning (MARL) framework for coordinating UAV fleets in stochastic medical delivery scenarios where requests vary in urgency, location, and delivery deadlines. The problem is formulated as a partially observable Markov decision process (POMDP) in which UAV agents maintain awareness of medical delivery demands while having limited visibility of other agents due to communication and localization constraints. The proposed framework employs Proximal Policy Optimization (PPO) as the primary learning algorithm and evaluates several variants, including asynchronous extensions, classical actor--critic methods, and architectural modifications to analyze scalability and performance trade-offs. The model is evaluated using real-world geographic data from selected clinics and hospitals extracted from the OpenStreetMap dataset. The framework provides a decision-support layer that prioritizes medical tasks, reallocates UAV resources in real time, and assists healthcare personnel in managing urgent logistics. Experimental results show that classical PPO achieves superior coordination performance compared to asynchronous and sequential learning strategies, highlighting the potential of reinforcement learning for adaptive and scalable UAV-assisted healthcare logistics.
中文摘要 无人机（UAV）越来越多地用于支持时间关键的医疗物资配送，在紧急情况下和资源短缺时提供快速且灵活的后勤支持。然而，有效部署无人机队需要能够优先处理医疗请求、分配有限空中资源并在不确定作战条件下调整投放时间表的协调机制。本文提出了一种多智能体强化学习（MARL）框架，用于在紧急度、地点和交付截止日期等随机医疗交付场景下协调无人机舰队。该问题被表述为一个部分可观察的马尔可夫决策过程（POMDP），其中无人机代理在由于通信和定位限制而对其他代理的可见性有限的情况下，保持对医疗递送需求的感知。该框架采用近端策略优化（PPO）作为主要学习算法，评估了多种变体，包括异步扩展、经典的actor-critic方法以及架构修改，以分析可扩展性和性能权衡。该模型通过从OpenStreetMap数据集提取的部分诊所和医院的真实地理数据进行评估。该框架提供了一个决策支持层，优先级化医疗任务，实时重新分配无人机资源，并协助医护人员管理紧急物流。实验结果表明，经典PPO相较于异步和顺序学习策略，在协调表现上更为出色，凸显了强化学习在自适应且可扩展的无人机辅助医疗物流中的潜力。

Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning

无权衡地应对长度膨胀：强化学习中的群体相对奖励重塑

Authors: Zichao Li, Jie Lou, Fangchen Dong, Zhiyuan Fan, Mengjie Ren, Hongyu Lin, Xianpei Han, Debing Zhang, Le Sun, Yaojie Lu, Xing Yu
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.10535
Pdf link: https://arxiv.org/pdf/2603.10535
Abstract Reinforcement learning significantly enhances LLM capabilities but suffers from a critical issue: length inflation, where models adopt verbosity or inefficient reasoning to maximize rewards. Prior approaches struggle to address this challenge in a general and lossless manner, primarily because additive penalties introduce a compensatory effect that creates optimization shortcuts, while heuristic gating strategies lack generality beyond binary feedback. To bridge this gap, we present Group Relative Reward Rescaling (GR$^3$), which reframes length control as a multiplicative rescaling paradigm, effectively establishing a generalized, continuous, and reward-dependent gating mechanism. To further ensure lossless optimization, we incorporate group-relative regularization and advantage-aware calibration, which dynamically adapt length budgets to instance difficulty and preserve the advantage signal of high-quality trajectories. Empirically, across both RLHF and RLVR settings, GR$^3$~maintains training dynamics and downstream performance comparable to standard GRPO while significantly mitigating length inflation, outperforming state-of-the-art length-regularized baselines.
中文摘要 强化学习显著增强了LLM的能力，但存在一个关键问题：长度膨胀，即模型为了最大化奖励而采用冗长或低效推理。以往方法难以以通用且无损的方式解决这一挑战，主要因为加法惩罚引入了补偿效应，导致优化捷径，而启发式门槛策略除了二元反馈外缺乏普遍性。为弥合这一空白，我们提出了群相对奖励重标度（GR$^3$），它将长度控制重新框架为乘法重标度范式，有效建立了一种广义、连续且依赖奖励的门槛机制。为进一步确保无损优化，我们采用群相对正则化和优势感知校准，动态调整长度预算以适应实例难度，并保持高质量轨迹的优势信号。从实证角度看，在RLHF和RLVR设置下，GR$^3$~保持了与标准GRPO相当的训练动态和下游性能，同时显著降低了长度膨胀，优于最先进的长度正则化基线。

Learning to Score: Tuning Cluster Schedulers through Reinforcement Learning

学习评分：通过强化学习调优集群调度器

Authors: Martin Asenov, Qiwen Deng, Gingfung Yeung, Adam Barker
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.10545
Pdf link: https://arxiv.org/pdf/2603.10545
Abstract Efficiently allocating incoming jobs to nodes in large-scale clusters can lead to substantial improvements in both cluster utilization and job performance. In order to allocate incoming jobs, cluster schedulers usually rely on a set of scoring functions to rank feasible nodes. Results from individual scoring functions are usually weighted equally, which could lead to sub-optimal deployments as the one-size-fits-all solution does not take into account the characteristics of each workload. Tuning the weights of scoring functions, however, requires expert knowledge and is computationally expensive. This paper proposes a reinforcement learning approach for learning the weights in scheduler scoring algorithms with the overall objective of improving the end-to-end performance of jobs for a given cluster. Our approach is based on percentage improvement reward, frame-stacking, and limiting domain information. We propose a percentage improvement reward to address the objective of multi-step parameter tuning. The inclusion of frame-stacking allows for carrying information across an optimization experiment. Limiting domain information prevents overfitting and improves performance in unseen clusters and workloads. The policy is trained on different combinations of workloads and cluster setups. We demonstrate the proposed approach improves performance on average by 33\% compared to fixed weights and 12\% compared to the best-performing baseline in a lab-based serverless scenario.
中文摘要 高效地将入站作业分配给大型集群中的节点，可以显著提升集群利用率和作业性能。为了分配新入任务，集群调度器通常依赖一组评分函数来对可行节点进行排名。单个评分函数的结果通常权重相等，这可能导致部署不理想，因为一刀切的解决方案未能考虑每个工作负载的特性。然而，调优评分函数权重需要专业知识，且计算量大。本文提出了一种强化学习方法，用于学习调度器评分算法中的权重，总体目标是提升给定集群作业的端到端性能。我们的方法基于百分比改进奖励、帧堆叠和限制域名信息。我们提出一个百分比改进奖励，以实现多步参数调优的目标。帧堆栈的加入使得信息能够跨越优化实验进行传输。限制域信息防止过拟合，并提升未见集群和工作负载的性能。该策略针对不同的工作负载和集群配置组合进行训练。我们证明，在实验室无服务器场景中，所提方法平均比固定权重提升33%和12%的最高基准提升。

Adaptive RAN Slicing Control via Reward-Free Self-Finetuning Agents

通过无奖励自调节智能员实现自适应RAN切片控制

Authors: Yuanhao Li, Haozhe Wang, Geyong Min, Nektarios Georgalas, Wang Miao
Subjects: Subjects: Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2603.10564
Pdf link: https://arxiv.org/pdf/2603.10564
Abstract The integration of Generative AI models into AI-native network systems offers a transformative path toward achieving autonomous and adaptive control. However, the application of such models to continuous control tasks is impeded by intrinsic architectural limitations, including finite context windows, the lack of explicit reward signals, and the degradation of the long context. This paper posits that the key to unlocking robust continuous control is enabling agents to internalize experience by distilling it into their parameters, rather than relying on prompt-based memory. To this end, we propose a novel self-finetuning framework that enables agentic systems to learn continuously through direct interaction with the environment, bypassing the need for handcrafted rewards. Our framework implements a bi-perspective reflection mechanism that generates autonomous linguistic feedback to construct preference datasets from interaction history. A subsequent preference-based fine-tuning process distills long-horizon experiences into the model's parameters. We evaluate our approach on a dynamic Radio Access Network (RAN) slicing task, a challenging multi-objective control problem that requires the resolution of acute trade-offs between spectrum efficiency, service quality, and reconfiguration stability under volatile network conditions. Experimental results show that our framework outperforms standard Reinforcement Learning (RL) baselines and existing Large Language Model (LLM)-based agents in sample efficiency, stability, and multi-metric optimization. These findings demonstrate the potential of self-improving generative agents for continuous control tasks, paving the way for future AI-native network infrastructure.
中文摘要 将生成式AI模型整合进AI原生网络系统，为实现自主和自适应控制提供了变革性道路。然而，将此类模型应用于连续控制任务受到内在架构限制的阻碍，包括有限上下文窗口、缺乏显式奖励信号以及长上下文的退化。本文提出，解锁稳健连续控制的关键在于让智能体能够通过将经验提炼到参数中来内化，而非依赖基于提示的记忆。为此，我们提出了一种新的自我微调框架，使智能系统能够通过与环境的直接交互持续学习，绕过手工奖励的需求。我们的框架实现了一种双视角反射机制，能够生成自主语言反馈，从交互历史中构建偏好数据集。随后基于偏好的微调过程将长期视野经验提炼进模型参数中。我们评估了在动态无线接入网（RAN）切片任务上的方法，这是一个具有挑战性的多目标控制问题，需要在波动性网络条件下解决频谱效率、服务质量和重构稳定性之间的严重权衡。实验结果显示，我们的框架在样本效率、稳定性和多度量优化方面优于标准强化学习（RL）基线和现有基于大型语言模型（LLM）的代理。这些发现展示了自我改进生成代理在持续控制任务中的潜力，为未来的AI原生网络基础设施铺平道路。

Safety-critical Control Under Partial Observability: Reach-Avoid POMDP meets Belief Space Control

部分可观测性下的安全关键控制：距离-避开POMDP满足信念空间控制

Authors: Matti Vahs, Joris Verhagen, Jana Tumova
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.10572
Pdf link: https://arxiv.org/pdf/2603.10572
Abstract Partially Observable Markov Decision Processes (POMDPs) provide a principled framework for robot decision-making under uncertainty. Solving reach-avoid POMDPs, however, requires coordinating three distinct behaviors: goal reaching, safety, and active information gathering to reduce uncertainty. Existing online POMDP solvers attempt to address all three within a single belief tree search, but this unified approach struggles with the conflicting time scales inherent to these objectives. We propose a layered, certificate-based control architecture that operates directly in belief space, decoupling goal reaching, information gathering, and safety into modular components. We introduce Belief Control Lyapunov Functions (BCLFs) that formalize information gathering as a Lyapunov convergence problem in belief space, and show how they can be learned via reinforcement learning. For safety, we develop Belief Control Barrier Functions (BCBFs) that leverage conformal prediction to provide probabilistic safety guarantees over finite horizons. The resulting control synthesis reduces to lightweight quadratic programs solvable in real time, even for non-Gaussian belief representations with dimension $>10^4$. Experiments in simulation and on a space-robotics platform demonstrate real-time performance and improved safety and task success compared to state-of-the-art constrained POMDP solvers.
中文摘要 部分可观测马尔可夫决策过程（POMDPs）为机器人在不确定性下的决策提供了原则性框架。然而，解决触达-避免 POMDP 需要协调三种不同的行为：目标达成、安全和主动信息收集以减少不确定性。现有的在线POMDP求解器试图在单一信念树搜索中同时解决这三者，但这种统一方法在这些目标固有的时间尺度冲突中遇到困难。我们提出了一种分层、基于证书的控制架构，直接在信念空间中运行，将目标达成、信息收集和安全解耦为模块化组件。我们引入信念控制李雅普诺夫函数（BCLFs），将信息收集形式化为信念空间中的李雅普诺夫收敛问题，并展示了如何通过强化学习学习。为了安全性，我们开发了信念控制障碍函数（BCBF），利用共形预测在有限视野内提供概率安全保障。由此产生的控制综合简化为可实时求解的轻量级二次规划，即使是维数为$>10^4$的非高斯信念表示。仿真和空间机器人平台的实验展示了与最先进的受限POMDP求解器相比，具有实时性能以及更高的安全性和任务成功率。

Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning

LLM对齐真的需要多样性吗？关于将RLVR方法应用于道德推理的实证研究

Authors: Zhaowei Zhang, Xiaohan Liu, Xuekai Zhu, Junchao Huang, Ceyao Zhang, Zhiyuan Feng, Yaodong Yang, Xiaoyuan Yi, Xing Xie
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.10588
Pdf link: https://arxiv.org/pdf/2603.10588
Abstract Reinforcement learning with verifiable rewards (RLVR) has achieved remarkable success in logical reasoning tasks, yet whether large language model (LLM) alignment requires fundamentally different approaches remains unclear. Given the apparent tolerance for multiple valid responses in moral reasoning, a natural hypothesis is that alignment tasks inherently require diversity-seeking distribution-matching algorithms rather than reward-maximizing policy-based methods. We conduct the first comprehensive empirical study comparing both paradigms on MoReBench. To enable stable RLVR training, we build a rubric-grounded reward pipeline by training a Qwen3-1.7B judge model. Contrary to our hypothesis, we find that distribution-matching approaches do not demonstrate significant advantages over reward-maximizing methods as expected on alignment tasks. Through semantic visualization mapping high-reward responses to semantic space, we demonstrate that moral reasoning exhibits more concentrated high-reward distributions than mathematical reasoning, where diverse solution strategies yield similarly high rewards. This counter-intuitive finding explains why mode-seeking optimization proves equally or more effective for alignment tasks. Our results suggest that alignment tasks do not inherently require diversity-preserving algorithms, and standard reward-maximizing RLVR methods can effectively transfer to moral reasoning without explicit diversity mechanisms.
中文摘要 带有可验证奖励的强化学习（RLVR）在逻辑推理任务中取得了显著成功，但大型语言模型（LLM）对齐是否需要根本不同的方法仍不明确。鉴于道德推理中对多种有效反应的容忍度，一个自然假设是，对齐任务本质上需要多样性寻求分布匹配算法，而非最大化奖励的基于策略的方法。我们进行了首次在MoReBench上比较两种范式的综合实证研究。为了实现RLVR的稳定训练，我们通过训练Qwen3-1.7B评审模型，构建了基于评分标准的奖励流程。与我们的假设相反，我们发现分布匹配方法在对齐任务中并未表现出预期中相较于奖励最大化方法的显著优势。通过将高奖励反应映射到语义空间的语义可视化，我们证明了道德推理比数学推理更集中的高回报分布，后者在数学推理中，多样的解法策略都能产生同样高的奖励。这一反直觉的发现解释了为何寻模优化在比对任务中同样有效甚至更有效。我们的结果表明，对齐任务本质上并不一定需要保持多样性的算法，而标准的奖励最大化RLVR方法可以在没有显式多样性机制的情况下有效地转化为道德推理。

AdaClearGrasp: Learning Adaptive Clearing for Zero-Shot Robust Dexterous Grasping in Densely Cluttered Environments

AdaClearGrasp：学习零射精打稳健灵活抓取的自适应清理，适应密集环境中的抓取

Authors: Zixuan Chen, Wenquan Zhang, Jing Fang, Ruiming Zeng, Zhixuan Xu, Yiwen Hou, Xinke Wang, Jieqi Shi, Jing Huo, Yang Gao
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.10616
Pdf link: https://arxiv.org/pdf/2603.10616
Abstract In densely cluttered environments, physical interference, visual occlusions, and unstable contacts often cause direct dexterous grasping to fail, while aggressive singulation strategies may compromise safety. Enabling robots to adaptively decide whether to clear surrounding objects or directly grasp the target is therefore crucial for robust manipulation. We propose AdaClearGrasp, a closed-loop decision-execution framework for adaptive clearing and zero-shot dexterous grasping in densely cluttered environments. The framework formulates manipulation as a controllable high-level decision process that determines whether to directly grasp the target or first clear surrounding objects. A pretrained vision-language model (VLM) interprets visual observations and language task descriptions to reason about grasp interference and generate a high-level planning skeleton, which invokes structured atomic skills through a unified action interface. For dexterous grasping, we train a reinforcement learning policy with a relative hand-object distance representation, enabling zero-shot generalization across diverse object geometries and physical properties. During execution, visual feedback monitors outcomes and triggers replanning upon failures, forming a closed-loop correction mechanism. To evaluate language-conditioned dexterous grasping in clutter, we introduce Clutter-Bench, the first simulation benchmark with graded clutter complexity. It includes seven target objects across three clutter levels, yielding 210 task scenarios. We further perform sim-to-real experiments on three objects under three clutter levels (18 scenarios). Results demonstrate that AdaClearGrasp significantly improves grasp success rates in densely cluttered environments. For more videos and code, please visit our project website: this https URL.
中文摘要 在密集的环境中，物理干扰、视觉遮挡和不稳定接触常导致直接灵巧抓取失败，而激进的单点策略可能危及安全性。因此，使机器人能够自适应地决定是清除周围物体还是直接抓取目标，对于鲁棒的作至关重要。我们提出了AdaClearGrasp，一个闭环决策-执行框架，用于在密集杂乱环境中实现自适应清算和零射击灵巧抓取。该框架将作表述为一种可控的高层次决策过程，决定是直接抓住目标还是先清除周围物体。预训练视觉语言模型（VLM）通过解读视觉观察和语言任务描述，推理抓握干扰，生成高层次的规划骨架，通过统一的动作界面调用结构化的原子技能。对于灵巧抓取，我们训练一个带有相对手-物体距离表示的强化学习策略，实现在不同物体几何和物理属性上的零次泛化。执行过程中，视觉反馈监控结果，并在失败时触发重新规划，形成闭环修正机制。为了评估语言条件灵活抓取在杂乱中的表现，我们引入了Clutter-Bench，这是首个具有分级杂乱复杂度的仿真基准测试。它包含七个目标对象，分布在三个杂乱层级，产生210个任务场景。我们还在三个混沌层级下的三个物体（18个场景）上进行了模拟到真实的实验。结果表明，AdaClearGrasp在密集杂乱环境中显著提升了抓取成功率。更多视频和代码请访问我们的项目网站：https URL。

Reinforcement Learning with Conditional Expectation Reward

带条件期望奖励的强化学习

Authors: Changyi Xiao, Caijun Xu, Yixin Cao
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.10624
Pdf link: https://arxiv.org/pdf/2603.10624
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective in enhancing the reasoning capabilities of large language models, particularly in domains such as mathematics where reliable rule-based verifiers can be constructed. However, the reliance on handcrafted, domain-specific verification rules substantially limits the applicability of RLVR to general reasoning domains with free-form answers, where valid answers often exhibit significant variability, making it difficult to establish complete and accurate rules. To address this limitation, we propose Conditional Expectation Reward (CER), which leverages the large language model itself as an implicit verifier, and is therefore applicable to general domains and eliminates the need for external verifiers or auxiliary models. CER is defined as the expected likelihood of generating the reference answer conditioned on the generated answer. In contrast to rule-based verifiers that yield binary feedback, CER provides a soft, graded reward signal that reflects varying degrees of correctness, making it better suited to tasks where answers vary in correctness. Experimental results demonstrate that CER is effective across a wide range of reasoning tasks, spanning both mathematical and general domains, indicating that CER serves as a flexible and general verification mechanism. The code is available at this https URL.
中文摘要 带可验证奖励的强化学习（RLVR）已被证明在增强大型语言模型的推理能力方面非常有效，尤其是在数学等可以构建可靠规则验证器的领域。然而，依赖手工设计的领域特定验证规则大大限制了RLVR在具有自由形式答案的一般推理领域中的适用性，这些领域有效答案常常存在显著变异性，难以建立完整且准确的规则。为解决这一限制，我们提出了条件期望奖励（CER），它利用大型语言模型作为隐式验证器，因此适用于通用领域，并消除了对外部验证器或辅助模型的需求。CER定义为基于生成答案的参考答案生成的期望概率。与基于规则的二元反馈验证器不同，CER提供柔和的分级奖励信号，反映不同正确程度，更适合答案正确度变化的任务。实验结果表明，CER在涵盖数学和一般领域的广泛推理任务中都有效，表明CER作为一种灵活且通用的验证机制。代码可在该 https URL 访问。

MAVEN: A Meta-Reinforcement Learning Framework for Varying-Dynamics Expertise in Agile Quadrotor Maneuvers

MAVEN：一个面向敏捷四旋翼机动中变动力学专业知识的元强化学习框架

Authors: Jin Zhou, Dongcheng Cao, Xian Wang, Shuo Li
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.10714
Pdf link: https://arxiv.org/pdf/2603.10714
Abstract Reinforcement learning (RL) has emerged as a powerful paradigm for achieving online agile navigation with quadrotors. Despite this success, policies trained via standard RL typically fail to generalize across significant dynamic variations, exhibiting a critical lack of adaptability. This work introduces MAVEN, a meta-RL framework that enables a single policy to achieve robust end-to-end navigation across a wide range of quadrotor dynamics. Our approach features a novel predictive context encoder, which learns to infer a latent representation of the system dynamics from interaction history. We demonstrate our method in agile waypoint traversal tasks under two challenging scenarios: large variations in quadrotor mass and severe single-rotor thrust loss. We leverage a GPU-vectorized simulator to distribute tasks across thousands of parallel environments, overcoming the long training times of meta-RL to converge in less than an hour. Through extensive experiments in both simulation and the real world, we validate that MAVEN achieves superior adaptation and agility. The policy successfully executes zero-shot sim-to-real transfer, demonstrating robust online adaptation by performing high-speed maneuvers despite mass variations of up to 66.7% and single-rotor thrust losses as severe as 70%.
中文摘要 强化学习（RL）已成为实现四旋翼机在线敏捷导航的强大范式。尽管取得了这些成功，通过标准强化学习训练的策略通常无法在显著的动态变异中泛化，表现出严重的适应性不足。这项工作介绍了MAVEN，一种元强化学习框架，使单一策略能够在广泛的四旋翼动力学中实现稳健的端到端导航。我们的方法采用了一种新颖的预测上下文编码器，能够从交互历史中推断系统动力学的潜在表征。我们在敏捷航点穿越任务中演示了我们的方法，适用于两个具有挑战性的情景：四旋翼质量变化大和单旋翼推力损失严重。我们利用GPU矢量化模拟器将任务分配到数千个并行环境中，克服元强化学习的漫长训练时间，使任务在不到一小时内完成。通过在模拟和现实世界中的广泛实验，我们验证了MAVEN实现了卓越的适应性和敏捷性。该政策成功执行零发射模拟到实物传输，展示了强大的在线适应能力，尽管质量变化高达66.7%，单旋翼推力损失高达70%，仍能执行高速机动。

ASTER: Attitude-aware Suspended-payload Quadrotor Traversal via Efficient Reinforcement Learning

ASTER：通过高效强化学习实现姿态感知悬挂有效载荷四旋翼横转

Authors: Dongcheng Cao, Jin Zhou, Shuo Li
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.10715
Pdf link: https://arxiv.org/pdf/2603.10715
Abstract Agile maneuvering of the quadrotor cable-suspended system is significantly hindered by its non-smooth hybrid dynamics. While model-free Reinforcement Learning (RL) circumvents explicit differentiation of complex models, achieving attitude-constrained or inverted flight remains an open challenge due to the extreme reward sparsity under strict orientation requirements. This paper presents ASTER, a robust RL framework that achieves, to our knowledge, the first successful autonomous inverted flight for the cable-suspended system. We propose hybrid-dynamics-informed state seeding (HDSS), an initialization strategy that back-propagates target configurations through physics-consistent kinematic inversions across both taut and slack cable phases. HDSS enables the policy to discover aggressive maneuvers that are unreachable via standard exploration. Extensive simulations and real-world experiments demonstrate remarkable agility, precise attitude alignment, and robust zero-shot sim-to-real transfer across complex trajectories.
中文摘要 四旋翼电缆悬挂系统的灵活机动受到其非平滑混合动力学的显著限制。虽然无模型强化学习（RL）绕过了复杂模型的显式微分，但由于在严格的定向要求下极度奖励稀疏，实现姿态约束或倒飞仍是一大挑战。本文介绍了ASTER，一个稳健的强化学习框架，据我们所知，它实现了首个成功的缆绳悬挂系统自主倒飞。我们提出了混合动力学知情状态播种（HDSS），这是一种初始化策略，通过物理一致的运动学反演，在紧绷和松弛电缆相间反向传播目标配置。HDSS使政策能够发现标准探索无法达到的激进机动。广泛的模拟和真实实验展示了卓越的敏捷性、精确的姿态对齐以及跨复杂轨迹的零发射模拟到现实的稳健传输能力。

mAceReason-Math: A Dataset of High-Quality Multilingual Math Problems Ready For RLVR

mAceReason-Math：一个高质量多语言数学题数据集，准备用于RLVR

Authors: Konstantin Dobler, Simon Lehnerer, Federico Scozzafava, Jonathan Janke, Mohamed Ali
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.10767
Pdf link: https://arxiv.org/pdf/2603.10767
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has been successfully applied to significantly boost the capabilities of pretrained large language models, especially in the math and logic problem domains. However, current research and available training datasets remain English-centric. While mul- tilingual training data and benchmarks have been created in the past, they were not created with RLVR and current model capability in mind, and their level of difficulty is often too low to provide appropriate training signals for current models. To address this gap, we provide mAceReason-Math, a dataset of high-quality translations of challenging math problems sourced from a corpus specifically curated for RLVR (AceReason-Math). We further take specific care to clean and improve our translations, resulting in a coverage of 14 languages with more than 10,000 samples per language. We release the dataset to facilitate multilingual RLVR research and benchmarking in the research community.
中文摘要 带可验证奖励的强化学习（RLVR）已被成功应用于显著提升预训练大型语言模型的能力，尤其是在数学和逻辑问题领域。然而，当前的研究和现有的培训数据集仍以英语为中心。虽然过去曾创建过多语言训练数据和基准测试，但它们并非以RLVR和当前模型能力为考虑，且难度通常过低，无法为当前模型提供合适的训练信号。为弥补这一空白，我们提供了mAceReason-Math，这是一个高质量的难度数学题翻译数据集，来源于专门为RVR（AceReason-Math）策划的语料库。我们还特别注重清理和改进翻译，覆盖14种语言，每种语言样本超过10,000个。我们发布该数据集，以促进多语言RLVR研究和研究界的基准测试。

Multilingual Reasoning Gym: Multilingual Scaling of Procedural Reasoning Environments

多语言推理馆：程序化推理环境的多语言扩展

Authors: Konstantin Dobler, Simon Lehnerer, Federico Scozzafava, Jonathan Janke, Mohamed Ali
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.10793
Pdf link: https://arxiv.org/pdf/2603.10793
Abstract We present the Multilingual Reasoning Gym, an extension of Reasoning Gym (Stojanovski et al., 2025), that procedurally generates verifiable reasoning problems across 14 languages. We translate templates for 94 tasks with native-speaker validation in 10 languages and targeted code or template adaptations to ensure linguistic naturalness. The Multilingual Reasoning Gym preserves the core benefits of the procedural generation approach used in the original Reasoning Gym, such as virtually unlimited problem instance generation and adjustable difficulty, and remains directly usable for Reinforcement Learning from Verifiable Rewards and evaluation settings. Problems in the Multilingual Reasoning Gym are parallel across languages, enabling crosslingually parallel data generation at massive scale due to the procedural nature of the environments. We release our implementation to support research into multilingual reasoning models.
中文摘要 我们介绍了多语言推理道馆，这是推理道馆（Stojanovski 等，2025）的扩展，能够在14种语言中程序化生成可验证的推理问题。我们为94项任务翻译模板，并带有母语者验证，涵盖10种语言，并针对性地调整代码或模板，以确保语言的自然性。多语言推理馆保留了原始推理馆中使用的程序生成方法的核心优势，如几乎无限的问题实例生成和可调节难度，并且仍可直接用于可验证奖励和评估设置的强化学习。多语言推理馆中的问题跨语言并行处理，使得环境的程序性质使跨语言并行数据生成成为可能。我们发布实现是为了支持对多语言推理模型的研究。

Towards Cold-Start Drafting and Continual Refining: A Value-Driven Memory Approach with Application to NPU Kernel Synthesis

迈向冷启动绘图与持续精炼：一种价值驱动的内存方法，并应用于NPU内核综合

Authors: Yujie Zheng, Zhuo Li, Shengtao Zhang, Hanjing Wang, Junjie Sheng, Jiaqian Wang, Junchi Yan, Weinan Zhang, Ying Wen, Bo Tang, Muning Wen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.10846
Pdf link: https://arxiv.org/pdf/2603.10846
Abstract Deploying Large Language Models to data-scarce programming domains poses significant challenges, particularly for kernel synthesis on emerging Domain-Specific Architectures where a "Data Wall" limits available training data. While models excel on data-rich platforms like CUDA, they suffer catastrophic performance drops on data-scarce ecosystems such as NPU programming. To overcome this cold-start barrier without expensive fine-tuning, we introduce EvoKernel, a self-evolving agentic framework that automates the lifecycle of kernel synthesis from initial drafting to continual refining. EvoKernel addresses this by formulating the synthesis process as a memory-based reinforcement learning task. Through a novel value-driven retrieval mechanism, it learns stage-specific Q-values that prioritize experiences based on their contribution to the current objective, whether bootstrapping a feasible draft or iteratively refining latency. Furthermore, by enabling cross-task memory sharing, the agent generalizes insights from simple to complex operators. By building an NPU variant of KernelBench and evaluating on it, EvoKernel improves frontier models' correctness from 11.0% to 83.0% and achieves a median speedup of 3.60x over initial drafts through iterative refinement. This demonstrates that value-guided experience accumulation allows general-purpose models to master the kernel synthesis task on niche hardware ecosystems. Our official page is available at this https URL.
中文摘要 将大型语言模型部署到数据稀缺的编程领域面临重大挑战，尤其是在新兴领域特定架构中内核综合时，“数据墙”限制了可用训练数据。虽然模型在数据丰富的平台上如CUDA表现出色，但在数据稀缺的生态系统如NPU编程中却遭遇了灾难性的性能下降。为了克服这种冷启动障碍，无需昂贵的微调，我们引入了EvoKernel，一个自我演进的代理框架，自动化了从初始绘图到持续精炼的核心综合生命周期。EvoKernel通过将合成过程表述为基于内存的强化学习任务来解决这个问题。通过一种新的价值驱动检索机制，它学习阶段特定的Q值，这些Q值根据体验对当前目标的贡献优先排序，无论是自筹可行草稿还是迭代优化延迟。此外，通过支持跨任务内存共享，智能体将从简单到复杂算符的洞察推广到更复杂。通过构建KernelBench的NPU变体并基于其进行评估，EvoKernel将前沿模型的正确性从11.0%提升至83.0%，并通过迭代优化实现了比初稿中位数加速3.60倍的中位数。这表明，价值导向的经验积累使通用模型能够掌握细分硬件生态系统中的内核综合任务。我们的官方页面可通过此 https 网址访问。

$V_{0.5}$: Generalist Value Model as a Prior for Sparse RL Rollouts

$V_{0.5}$：作为稀疏RL推广的通用价值模型

Authors: Yi-Kai Zhang, Yueqing Sun, Hongyan Hao, Qi Gu, Xunliang Cai, De-Chuan Zhan, Han-Jia Ye
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.10848
Pdf link: https://arxiv.org/pdf/2603.10848
Abstract In Reinforcement Learning with Verifiable Rewards (RLVR), constructing a robust advantage baseline is critical for policy gradients, effectively guiding the policy model to reinforce desired behaviors. Recent research has introduced Generalist Value Models (such as $V_0$), which achieve pre-trained value estimation by explicitly encoding model capabilities in-context, eliminating the need to synchronously update the value model alongside the policy model. In this paper, we propose $V_{0.5}$, which adaptively fuses the baseline predicted by such value model (acting as a prior) with the empirical mean derived from sparse rollouts. This constructs a robust baseline that balances computational efficiency with extremely low variance. Specifically, we introduce a real-time statistical testing and dynamic budget allocation. This balances the high variance caused by sparse sampling against the systematic bias (or hallucinations) inherent in the value model's prior. By constructing a hypothesis test to evaluate the prior's reliability in real-time, the system dynamically allocates additional rollout budget on demand. This mechanism minimizes the baseline estimator's Mean Squared Error (MSE), guaranteeing stable policy gradients, even under extreme sparsity with a group size of 4. Extensive evaluations across six mathematical reasoning benchmarks demonstrate that $V_{0.5}$ significantly outperforms GRPO and DAPO, achieving faster convergence and over some 10% performance improvement.
中文摘要 在可验证奖励强化学习（RLVR）中，构建稳健的优势基线对于政策梯度至关重要，有效引导策略模型强化期望行为。最新研究引入了通用价值模型（如$V_0$），通过在上下文中显式编码模型能力，实现预训练值估计，消除了与策略模型同步更新价值模型的需求。本文提出$V_{0.5}$，将该价值模型预测的基线（作为先验）与稀疏展开所得的经验均值融合自适应。这构建了一个稳健的基线，在计算效率与极低方差之间取得平衡。具体来说，我们引入了实时统计测试和动态预算分配。这在稀疏抽样导致的高方差与价值模型先验中固有的系统偏见（或幻觉）之间取得了平衡。通过构建假设检验以实时评估先验的可靠性，系统动态地按需分配额外的推广预算。该机制最小化基线估计量的均方误差（MSE），即使在极稀疏且群规模为4的情况下，也能保证策略梯度的稳定。对六个数学推理基准的广泛评估表明，$V_{0.5}$显著优于GRPO和DAPO，实现了更快的收敛速度和约10%的性能提升。

RL-Augmented MPC for Non-Gaited Legged and Hybrid Locomotion

RL增强MPC用于非步态腿部和混合运动

Authors: Andrea Patrizi, Carlo Rizzardo, Arturo Laurenzi, Francesco Ruscelli, Luca Rossini, Nikos G. Tsagarakis
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.10878
Pdf link: https://arxiv.org/pdf/2603.10878
Abstract We propose a contact-explicit hierarchical architecture coupling Reinforcement Learning (RL) and Model Predictive Control (MPC), where a high-level RL agent provides gait and navigation commands to a low-level locomotion MPC. This offloads the combinatorial burden of contact timing from the MPC by learning acyclic gaits through trial and error in simulation. We show that only a minimal set of rewards and limited tuning are required to obtain effective policies. We validate the architecture in simulation across robotic platforms spanning 50 kg to 120 kg and different MPC implementations, observing the emergence of acyclic gaits and timing adaptations in flat-terrain legged and hybrid locomotion, and further demonstrating extensibility to non-flat terrains. Across all platforms, we achieve zero-shot sim-to-sim transfer without domain randomization, and we further demonstrate zero-shot sim-to-real transfer without domain randomization on Centauro, our 120 kg wheeled-legged humanoid robot. We make our software framework and evaluation results publicly available at this https URL.
中文摘要 我们提出了一种结合强化学习（RL）和模型预测控制（MPC）的接触显式层级架构，其中高级强化学习代理向低级移动MPC提供步态和导航指令。通过模拟中的反复试验学习无环步态，这减轻了MPC接触时机的组合负担。我们证明，只需极少的奖励和有限的调整即可获得有效的保单。我们在跨50公斤至120公斤的机器人平台及不同MPC实现中验证了该架构，观察了平地腿式和混合运动中无周期步态和时序适应的出现，并进一步证明了对非平坦地形的可扩展性。在所有平台上，我们实现了无域随机化的零样本模拟对模拟传输，并在我们120公斤重的轮式人形机器人Centauro上演示了无域随机的零样本模拟到真实传输。我们将软件框架和评估结果公开于此 https URL。

Dynamics-Predictive Sampling for Active RL Finetuning of Large Reasoning Models

大型推理模型的动态预测采样用于主动强化学习微调

Authors: Yixiu Mao, Yun Qu, Qi Wang, Heming Zou, Xiangyang Ji
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.10887
Pdf link: https://arxiv.org/pdf/2603.10887
Abstract Reinforcement learning (RL) finetuning has become a key technique for enhancing the reasoning abilities of large language models (LLMs). However, its effectiveness critically depends on the selection of training data. Recent advances underscore the importance of online prompt selection methods, which typically concentrate training on partially solved or moderately challenging examples under the current policy, thereby yielding more effective model updates. While significantly accelerating RL finetuning in terms of training steps, they also incur substantial computational overhead by requiring extensive LLM rollouts over large candidate batches to identify informative samples, an expense that can outweigh the finetuning process itself. To address this challenge, this work proposes Dynamics-Predictive Sampling (DPS), which online predicts and selects informative prompts by inferring their learning dynamics prior to costly rollouts. Specifically, we introduce a new perspective by modeling each prompt's solving progress during RL finetuning as a dynamical system, where the extent of solving is represented as the state and the transition is characterized by a hidden Markov model. Using historical rollout reward signals, we perform online Bayesian inference to estimate evolving state distributions, and the inference outcome provides a predictive prior for efficient prompt selection without rollout-intensive filtering. Empirical results across diverse reasoning tasks, including mathematics, planning, and visual geometry, demonstrate that DPS substantially reduces redundant rollouts, accelerates the training process, and achieves superior reasoning performance.
中文摘要 强化学习（RL）微调已成为提升大型语言模型（LLMs）推理能力的关键技术。然而，其有效性关键在于训练数据的选择。近期进展凸显了在线提示选择方法的重要性，该方法通常集中训练部分解决或中等难度的案例，从而实现更有效的模型更新。虽然在训练步骤上显著加快了强化学习的微调，但也带来了大量计算开销，因为需要对大量候选批次进行大规模大型语言模型的推广以识别有信息的样本，这一成本可能超过微调过程本身的价值。为应对这一挑战，本研究提出了动力学预测抽样（DPS），通过推断其学习动态，在线预测并选择信息提示，以实现高成本推广。具体来说，我们通过将每个提示在强化学习微调过程中的求解进展建模为一个动态系统，引入了新的视角，其中求解的程度以状态表示，过渡则用隐藏的马尔可夫模型来表征。利用历史的扩展奖励信号，我们进行在线贝叶斯推断以估计演化中的状态分布，推断结果为高效选择提示提供了预测先验，无需大量展开过滤。涵盖数学、规划和视觉几何等多种推理任务的实证结果表明，DPS显著减少了冗余的展开，加快了训练进程，并实现了更优越的推理表现。

Ergodicity in reinforcement learning

强化学习中的遍历性

Authors: Dominik Baumann, Erfaun Noorani, Arsenii Mustafin, Xinyi Sheng, Bert Verbruggen, Arne Vanhoyweghen, Vincent Ginis, Thomas B. Schön
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.10895
Pdf link: https://arxiv.org/pdf/2603.10895
Abstract In reinforcement learning, we typically aim to optimize the expected value of the sum of rewards an agent collects over a trajectory. However, if the process generating these rewards is non-ergodic, the expected value, i.e., the average over infinitely many trajectories with a given policy, is uninformative for the average over a single, but infinitely long trajectory. Thus, if we care about how the individual agent performs during deployment, the expected value is not a good optimization objective. In this paper, we discuss the impact of non-ergodic reward processes on reinforcement learning agents through an instructive example, relate the notion of ergodic reward processes to more widely used notions of ergodic Markov chains, and present existing solutions that optimize long-term performance of individual trajectories under non-ergodic reward dynamics.
中文摘要 在强化学习中，我们通常旨在优化智能体在轨迹上收集奖励总和的期望值。然而，如果生成这些奖励的过程是非遍历的，那么期望值，即在给定策略下无限多个轨迹的平均值，对于单一但无限长轨迹的平均值来说，就没有参考价值。因此，如果我们关心的是单个代理在部署中的表现，期望值并不是一个好的优化目标。本文通过一个有启发性的例子讨论了非遍历奖励过程对强化学习代理的影响，将遍历奖励过程的概念与更广泛使用的遍历马尔可夫链概念联系起来，并提出了在非遍历奖励动态下优化单个轨迹长期表现的现有解决方案。

Safe RLHF Beyond Expectation: Stochastic Dominance for Universal Spectral Risk Control

超越预期的安全RLHF：随机优势用于通用频谱风险控制

Authors: Yaswanth Chittepu, Ativ Joshi, Rajarshi Bhattacharjee, Scott Niekum
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.10938
Pdf link: https://arxiv.org/pdf/2603.10938
Abstract Safe Reinforcement Learning from Human Feedback (RLHF) typically enforces safety through expected cost constraints, but the expectation captures only a single statistic of the cost distribution and fails to account for distributional uncertainty, particularly under heavy tails or rare catastrophic events. This limitation is problematic when robustness and risk sensitivity are critical. Stochastic dominance offers a principled alternative by comparing entire cost distributions rather than just their averages, enabling direct control over tail risks and potential out-of-distribution failures that expectation-based constraints may overlook. In this work, we propose Risk-sensitive Alignment via Dominance (RAD), a novel alignment framework that replaces scalar expected cost constraints with First-Order Stochastic Dominance (FSD) constraints. We operationalize this constraint by comparing the target policy's cost distribution to that of a reference policy within an Optimal Transport (OT) framework, using entropic regularization and Sinkhorn iterations to obtain a differentiable and computationally efficient objective for stable end-to-end optimization. Furthermore, we introduce quantile-weighted FSD constraints and show that weighted FSD universally controls a broad class of Spectral Risk Measures (SRMs), so that improvements under weighted dominance imply guaranteed improvements in the corresponding spectral risk. This provides a principled mechanism for tuning a model's risk profile via the quantile weighting function. Empirical results demonstrate that RAD improves harmlessness over baselines while remaining competitive in helpfulness, and exhibits greater robustness on out-of-distribution harmlessness evaluations.
中文摘要 基于人类反馈的安全强化学习（RLHF）通常通过预期成本约束来强制执行安全，但该期望仅捕捉成本分布的一个统计量，未能考虑分布不确定性，尤其是在重尾部或罕见灾难事件下。当鲁棒性和风险敏感性至关重要时，这一限制是个问题。随机优势提供了一种原则性替代方案，通过比较整个成本分布而非仅仅其平均值，从而能够直接控制基于期望的约束可能忽视的尾部风险和潜在的分配外失败。在本研究中，我们提出了通过优势实现风险敏感对齐（RAD），这是一种新颖的对齐框架，用一阶随机优势优势（FSD）约束替代标量期望成本约束。我们通过比较目标策略的成本分布与最优传输（OT）框架下的参考策略，利用熵正则化和Sinkhorn迭代，实现该约束，从而获得一个可微且计算高效的目标，实现端到端的稳定优化。此外，我们引入了分位加权FSD约束，并证明加权FSD普遍控制了广泛的频谱风险度量（SRMs），因此在加权优势下改进意味着相应频谱风险的改善。这为通过分位数加权函数调整模型的风险特征提供了原则性机制。实证结果表明，RAD在帮助性上保持竞争力的同时，在非分布的无害评估中表现出更强的稳健性。

Contact Coverage-Guided Exploration for General-Purpose Dexterous Manipulation

接触覆盖引导探索，用于通用灵巧作

Authors: Zixuan Liu, Ruoyi Qiao, Chenrui Tie, Xuanwei Liu, Yunfan Lou, Chongkai Gao, Zhixuan Xu, Lin Shao
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.10971
Pdf link: https://arxiv.org/pdf/2603.10971
Abstract Deep Reinforcement learning (DRL) has achieved remarkable success in domains with well-defined reward structures, such as Atari games and locomotion. In contrast, dexterous manipulation lacks general-purpose reward formulations and typically depends on task-specific, handcrafted priors to guide hand-object interactions. We propose Contact Coverage-Guided Exploration (CCGE), a general exploration method designed for general-purpose dexterous manipulation tasks. CCGE represents contact state as the intersection between object surface points and predefined hand keypoints, encouraging dexterous hands to discover diverse and novel contact patterns, namely which fingers contact which object regions. It maintains a contact counter conditioned on discretized object states obtained via learned hash codes, capturing how frequently each finger interacts with different object regions. This counter is leveraged in two complementary ways: (1) to assign a count-based contact coverage reward that promotes exploration of novel contact patterns, and (2) an energy-based reaching reward that guides the agent toward under-explored contact regions. We evaluate CCGE on a diverse set of dexterous manipulation tasks, including cluttered object singulation, constrained object retrieval, in-hand reorientation, and bimanual manipulation. Experimental results show that CCGE substantially improves training efficiency and success rates over existing exploration methods, and that the contact patterns learned with CCGE transfer robustly to real-world robotic systems. Project page is this https URL.
中文摘要 深度强化学习（DRL）在奖励结构明确的领域取得了显著成功，如雅达利游戏和运动。相比之下，灵巧作缺乏通用的奖励表述，通常依赖任务特定的手工先验来指导手与物的互动。我们提出了接触覆盖引导探索（CCGE），这是一种通用的探索方法，专为通用灵巧作任务设计。CCGE将接触状态表示为物体表面点与预定义手部关键点的交汇点，鼓励灵巧的手发现多样且新颖的接触模式，即哪些手指接触哪些物体区域。它维护一个基于通过哈希码获得的离散化对象状态的接触计数器，捕捉每个手指与不同对象区域交互的频率。该计数器有两种互补方式：（1）分配基于计数的接触覆盖奖励，促进探索新接触模式;（2）基于能量的覆盖奖励，引导代理前往未被探索的接触区域。我们在多样化的灵巧作任务中评估CCGE，包括杂波物体单一化、受限物体检索、手持重新定向和双手作。实验结果表明，CCGE相比现有探索方法显著提升了训练效率和成功率，且通过CCGE学习到的接触模式能够稳健地迁移到现实世界的机器人系统中。项目页面是这个 https URL。

Learning Adaptive Force Control for Contact-Rich Sample Scraping with Heterogeneous Materials

学习针对非均质材料的接触丰富样品刮除的自适应力控制

Authors: Cenk Cetin, Shreyas Pouli, Gabriella Pizzuto
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.10979
Pdf link: https://arxiv.org/pdf/2603.10979
Abstract The increasing demand for accelerated scientific discovery, driven by global challenges, highlights the need for advanced AI-driven robotics. Deploying robotic chemists in human-centric labs is key for the next horizon of autonomous discovery, as complex tasks still demand the dexterity of human scientists. Robotic manipulation in this context is uniquely challenged by handling diverse chemicals (granular, powdery, or viscous liquids), under varying lab conditions. For example, humans use spatulas for scraping materials from vial walls. Automating this process is challenging because it goes beyond simple robotic insertion tasks and traditional lab automation, requiring the execution of fine-granular movements within a constrained environment (the sample vial). Our work proposes an adaptive control framework to address this, relying on a low-level Cartesian impedance controller for stable and compliant physical interaction and a high-level reinforcement learning agent that learns to dynamically adjust interaction forces at the end-effector. The agent is guided by perception feedback, which provides the material's location. We first created a task-representative simulation environment with a Franka Research 3 robot, a scraping tool, and a sample vial containing heterogeneous materials. To facilitate the learning of an adaptive policy and model diverse characteristics, the sample is modelled as a collection of spheres, where each sphere is assigned a unique dislodgement force threshold, which is procedurally generated using Perlin noise. We train an agent to autonomously learn and adapt the optimal contact wrench for a sample scraping task in simulation and then successfully transfer this policy to a real robotic setup. Our method was evaluated across five different material setups, outperforming a fixed-wrench baseline by an average of 10.9%.
中文摘要 全球性挑战推动加速科学发现的需求不断增加，凸显了先进人工智能驱动机器人技术的必要性。在以人为中心的实验室部署机器人化学家是自主发现新阶段的关键，因为复杂任务仍需人类科学家的灵巧度。在此背景下，机器人作面临独特挑战，需在不同实验室条件下处理各种化学品（颗粒状、粉末状或粘稠液体）。例如，人类使用刮刀刮除小瓶壁上的材料。自动化这一过程具有挑战性，因为它超越了简单的机器人插入任务和传统实验室自动化，需要在受限的环境（样本瓶）内执行细致的移动。我们的工作提出了一个自适应控制框架来解决这个问题，依靠低级笛卡尔阻抗控制器实现稳定且顺应的物理交互，以及一个高级强化学习代理，能够学习动态调整终端执行器的相互作用力。智能体由感知反馈引导，反馈提供材料的位置。我们首先创建了一个任务代表模拟环境，使用Franka Research 3机器人、刮取工具和一个装有异质材料的样品瓶。为了便于学习适应策略并建模多样化特性，样本被建模为一组球体，每个球体被赋予独特的位移力阈值，该阈值通过Perlin噪声程序生成。我们训练代理自主学习并调整最佳接触扳手以适应样品刮取任务，并成功将该策略转移到真实机器人系统中。我们的方法在五种不同材料设置中进行了评估，平均比固定扳手基线高出10.9%。

Keyword: diffusion policy

Update-Free On-Policy Steering via Verifiers

通过验证器实现无需更新的策略引导

Authors: Maria Attarian, Ian Vyse, Claas Voelcker, Jasper Gerigk, Evgenii Opryshko, Anas Almasri, Sumeet Singh, Yilun Du, Igor Gilitschenski
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.10282
Pdf link: https://arxiv.org/pdf/2603.10282
Abstract In recent years, Behavior Cloning (BC) has become one of the most prevalent methods for enabling robots to mimic human demonstrations. However, despite their successes, BC policies are often brittle and struggle with precise manipulation. To overcome these issues, we propose UF-OPS, an Update-Free On-Policy Steering method that enables the robot to predict the success likelihood of its actions and adapt its strategy at execution time. We accomplish this by training verifier functions using policy rollout data obtained during an initial evaluation of the policy. These verifiers are subsequently used to steer the base policy toward actions with a higher likelihood of success. Our method improves the performance of black-box diffusion policy, without changing the base parameters, making it light-weight and flexible. We present results from both simulation and real-world data and achieve an average 49% improvement in success rate over the base policy across 5 real tasks.
中文摘要 近年来，行为克隆（BC）已成为使机器人能够模拟人类演示的最广泛方法之一。然而，尽管取得了成功，卑诗省的政策往往脆弱，难以精准作。为克服这些问题，我们提出了UF-OPS，一种无需更新的政策引导方法，使机器人能够预测其行动的成功概率，并在执行时调整策略。我们通过使用策略初步评估时获得的政策推广数据来训练验证器功能来实现这一点。这些验证器随后被用来引导基础政策采取更有可能成功的行动。我们的方法提升了黑箱扩散策略的性能，同时不改变基础参数，使其轻量化且灵活。我们结合模拟和真实数据呈现结果，在5个真实任务中，成功率平均比基础策略提升49%。

ScanDP: Generalizable 3D Scanning with Diffusion Policy

ScanDP：具有扩散策略的通用3D扫描

Authors: Itsuki Hirako, Ryo Hakoda, Yubin Liu, Matthew Hwang, Yoshihiro Sato, Takeshi Oishi
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.10390
Pdf link: https://arxiv.org/pdf/2603.10390
Abstract Learning-based 3D Scanning plays a crucial role in enabling efficient and accurate scanning of target objects. However, recent reinforcement learning-based methods often require large-scale training data and still struggle to generalize to unseen object this http URL this work, we propose a data-efficient 3D scanning framework that uses Diffusion Policy to imitate human-like scanning strategies. To enhance robustness and generalization, we adopt the Occupancy Grid Mapping instead of direct point cloud processing, offering improved noise resilience and handling of diverse object geometries. We also introduce a hybrid approach combining a sphere-based space representation with a path optimization procedure that ensures path safety and scanning efficiency. This approach addresses limitations in conventional imitation learning, such as redundant or unpredictable behavior. We evaluate our method on diverse unseen objects in both shape and scale. Ours achieves higher coverage and shorter paths than baselines, while remaining robust to sensor noise. We further confirm practical feasibility and stable operation in real-world execution.
中文摘要 基于学习的3D扫描在实现目标物体的高效和准确扫描中发挥着关键作用。然而，近期基于强化学习的方法往往需要大规模训练数据，且仍难以推广到看不见的对象。本文提出一个数据高效的3D扫描框架，利用扩散策略模拟类人扫描策略。为增强鲁棒性和泛化性，我们采用占用网格映射代替直接点云处理，提升噪声韧性和对多样物体几何形状的处理能力。我们还引入了结合球面空间表示与路径优化程序的混合方法，确保路径安全和扫描效率。该方法解决了传统模仿学习中的局限性，如冗余或不可预测的行为。我们在形状和尺度上对各种看不见物体进行了评估。我们的设备覆盖范围更高，路径更短，同时对传感器噪声保持韧性。我们进一步确认了其在实际执行中的可行性和稳定运行。

PPGuide: Steering Diffusion Policies with Performance Predictive Guidance

PPGuide：利用绩效预测指导引导扩散政策

Authors: Zixing Wang, Devesh K. Jha, Ahmed H. Qureshi, Diego Romeres
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.10980
Pdf link: https://arxiv.org/pdf/2603.10980
Abstract Diffusion policies have shown to be very efficient at learning complex, multi-modal behaviors for robotic manipulation. However, errors in generated action sequences can compound over time which can potentially lead to failure. Some approaches mitigate this by augmenting datasets with expert demonstrations or learning predictive world models which might be computationally expensive. We introduce Performance Predictive Guidance (PPGuide), a lightweight, classifier-based framework that steers a pre-trained diffusion policy away from failure modes at inference time. PPGuide makes use of a novel self-supervised process: it uses attention-based multiple instance learning to automatically estimate which observation-action chunks from the policy's rollouts are relevant to success or failure. We then train a performance predictor on this self-labeled data. During inference, this predictor provides a real-time gradient to guide the policy toward more robust actions. We validated our proposed PPGuide across a diverse set of tasks from the Robomimic and MimicGen benchmarks, demonstrating consistent improvements in performance.
中文摘要 扩散策略已被证明在学习复杂、多模态的机器人作行为方面非常高效。然而，生成动作序列中的错误会随着时间累积，可能导致失败。有些方法通过用专家演示补充数据集或学习可能计算量大的预测世界模型来缓解这一问题。我们引入了性能预测指导（PPGuide），这是一个轻量级、基于分类器的框架，能够引导预训练的扩散策略避免在推断时出现失效模式。PPGuide 采用了一种新颖的自监督流程：它利用基于注意力的多实例学习，自动估算策略推广中哪些观察-行动块对成功或失败有关联。然后，我们基于这些自我标记的数据训练绩效预测器。在推断过程中，该预测器提供实时梯度，引导政策朝向更稳健的行动。我们在Robomimic和MimicGen基准测试中验证了拟议PPGuide的多样任务，展示了性能的持续提升。