Arxiv Papers of Today

生成时间: 2026-07-01 19:26:04 (UTC+8); Arxiv 发布时间: 2026-07-01 20:00 EDT (2026-07-02 08:00 UTC+8)

今天共有 42 篇相关文章

Keyword: reinforcement learning

Locker-based Truck-Drone Routing with Integrated Considerations of Pickups, Deliveries, and No-Fly Zones

基于储物柜的卡车-无人机路由，综合考虑接送、配送和禁飞区

Authors: Xuanyu Liu, Hui Hu, Jiao Zhao, Ziliang Wang, Zhengbing He
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Physics and Society (physics.soc-ph)
Arxiv link: https://arxiv.org/abs/2606.30680
Pdf link: https://arxiv.org/pdf/2606.30680
Abstract Truck-drone delivery is an emerging last-mile logistics mode combining the long-haul capacity of trucks with the flexible service capability of drones. In locker-based operations, smart lockers serve not only as temporary parcel storage facilities but also as automated drone docking and service nodes. These automated nodes support drone takeoff, landing, parcel handover, and battery replacement, thereby significantly extending the service range and operational flexibility of drone-assisted delivery networks. However, practical locker-based delivery systems face complex real-world challenges, requiring the integrated coordination of not only parcel delivery, return pickup, battery-constrained and load-dependent drone flights, but also necessary detours around restricted airspace. To address this practical and multifaceted challenge, this paper introduces a locker-based truck-drone routing problem with integrated considerations of pickups, deliveries, and no-fly zones (LTDRP-PDNF), with the objective of minimizing the total operational cost of a fleet of drone-equipped trucks. We formulate the route construction process as a Markov Decision Process and develop a two-stage deep reinforcement learning-based neural heuristic. The first stage utilizes an attention-based encoder and a Bidirectional Gated Recurrent Unit decoder to solve the truck-only routing problem, formulated as a capacitated vehicle routing problem. The second stage combines a policy-transfer strategy with a hybrid dispatch assignment heuristic to construct fully coordinated truck and drone routes for LTDRP-PDNF. Experiments on instances of different scales demonstrate that the proposed method outperforms metaheuristic and neural heuristic baselines in most cases while maintaining exceptionally short computation times, offering an effective, scalable solution framework under practical operational constraints.
中文摘要 卡车无人机配送是一种新兴的最后一公里物流模式，结合了卡车的长途运输能力与无人机灵活的服务能力。在基于储物柜的操作中，智能储物柜不仅作为临时包裹存储设施，还作为自动无人机停靠和服务节点。这些自动化节点支持无人机起飞、着陆、包裹交接和电池更换，显著延长了无人机辅助配送网络的服务范围和操作灵活性。然而，实用的储物柜式投递系统面临复杂的现实挑战，不仅需要包裹投递、返程取件、电池受限且负载依赖的无人机飞行，还需绕过受限空域进行必要的绕行。为应对这一实际且多方面的挑战，本文引入了基于储物柜的卡车-无人机路由问题，综合考虑了取货、配送和禁飞区（LTDRP-PDNF），旨在最大限度地降低装备无人机卡车车队的总运营成本。我们将路由构建过程制定为马尔可夫决策过程，并开发了基于深度强化学习的两阶段神经启发式算法。第一阶段利用基于注意力的编码器和双向门控循环单元解码器，解决仅卡车的路由问题，表述为电容车辆路由问题。第二阶段结合政策转移策略与混合调度分配启发式，构建LTDRP-PDNF的完全协调卡车和无人机路线。不同尺度实例的实验表明，所提方法在大多数情况下优于元启发式和神经启发式基线，同时保持极短的计算时间，在实际操作约束下提供了有效且可扩展的解决方案框架。

An AI-Based Solution for Secure Service Provisioning in IoT

基于人工智能的物联网安全服务配置解决方案

Authors: Marco Arazzi, Mert Cihangiroglu, Serena Nicolazzo, Antonino Nocera, Vinod P
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.30701
Pdf link: https://arxiv.org/pdf/2606.30701
Abstract As the Internet of Things (IoT) continues its rapid expansion, the attack surface grows accordingly, with emerging threats targeting smart objects and their interactions. In this evolving landscape, securing service provisioning is crucial to ensure the proper functioning, security, and reliability of the IoT ecosystem. Service provisioning encompasses key tasks such as device registration, configuration, authentication, authorization, and software deployment, all of which are essential for seamless and secure IoT operations. In this paper, we present a comprehensive framework designed to select the most suitable smart objects to deliver a target service within a given IoT environment while also monitoring the behavior of the entities involved during the service provisioning phase. To achieve this, we employ a Deep Reinforcement Learning (DRL) approach in which an intelligent agent learns, through interaction with a complex, dynamic environment, how to adapt to changes while adhering to predefined security constraints. For behavioral monitoring, we leverage Federated Learning (FL) to develop a global Behavioral Fingerprinting (BF) model that is fully distributed and can analyze how IoT devices interact within the network. In addition, the BF is used to compute a reliability score for each service provider, reflecting its degree of compliance with the defined security constraints. This score is then incorporated into the service provisioning process, allowing smart objects to select providers not only according to functional suitability but also to their reliability level. Finally, we conduct an extensive experimental evaluation to assess the robustness and scalability of our approach. The results demonstrate that our solution can be effectively deployed even on resource-constrained IoT devices, making it a viable and scalable security-enhancing mechanism for modern IoT ecosystems.
中文摘要 随着物联网（IoT）的快速扩展，攻击面也相应扩大，针对智能对象及其交互的新兴威胁也日益出现。在这一不断变化的环境中，确保服务配置的安全对于确保物联网生态系统的正常运行、安全性和可靠性至关重要。服务配置涵盖设备注册、配置、认证、授权和软件部署等关键任务，这些都是无缝且安全的物联网运营的关键。本文提出了一个综合框架，旨在选择最适合在特定物联网环境中提供目标服务的智能对象，同时监控服务配置阶段相关实体的行为。为此，我们采用深度强化学习（DRL）方法，智能代理通过与复杂动态环境交互，学习如何在遵守预设安全约束的同时适应变化。在行为监测方面，我们利用联邦学习（FL）开发了一个全分布式的全球行为指纹识别（BF）模型，能够分析物联网设备在网络中的交互方式。此外，BF还用于计算每个服务提供商的可靠性评分，反映其对既定安全约束的合规程度。该评分随后被纳入服务配置流程，使智能对象不仅能根据功能兼容性和可靠性水平选择服务提供者。最后，我们进行了广泛的实验评估，以评估方法的鲁棒性和可扩展性。结果表明，我们的解决方案即使在资源有限的物联网设备上也能有效部署，使其成为现代物联网生态系统中可行且可扩展的安全增强机制。

From Search to Synthesis: Training LLMs as Zero-Shot Workflow Generators

从搜索到综合：将LLM训练为零样本工作流程生成器

Authors: Gan Luo, Zihan Qin, Bin Dong, Wotao Yin
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.30704
Pdf link: https://arxiv.org/pdf/2606.30704
Abstract Large language models (LLMs) excel across a wide range of tasks, yet their instance-specific solutions often lack the structural consistency needed for reliable deployment. Workflows that encode recurring algorithmic patterns at the task level provide a principled framework, offering robustness across instance variations, interpretable traces for debugging, and reusability across problem instances. However, manually designing such workflows requires significant expertise and effort, limiting their broader application. While automatic workflow generation could address this bottleneck, existing methods either produce instance-specific solutions without learning task-level patterns, or cannot generalize beyond their training configurations. We present MetaFlow, which casts workflow generation as a meta-learning problem: given a task and an operator set, the model learns to compose solution strategies. MetaFlow trains in two stages: supervised fine-tuning on synthetic workflow data, followed by reinforcement learning with verifiable rewards (RLVR) that uses execution feedback across problem instances in the task to improve end-to-end success. The resulting model produces effective workflows for trained tasks and exhibits strong generalization to untrained tasks and novel operator sets. Across benchmarks in question answering, code generation, and mathematical reasoning, MetaFlow achieves performance comparable to state-of-the-art baselines on in-domain tasks with single inference, while demonstrating remarkable zero-shot generalization capabilities on out-of-domain tasks and operator sets.
中文摘要 大型语言模型（LLM）在各种任务中表现出色，但其实例特定解决方案往往缺乏结构一致性，难以实现可靠部署。在任务层面编码反复出现的算法模式的工作流程提供了一个有原则的框架，在实例变体间提供鲁棒性，调试时可解释的痕迹，以及跨问题实例的可重用性。然而，手动设计此类工作流程需要大量专业知识和努力，限制了其更广泛的应用范围。虽然自动工作流生成可以解决这一瓶颈，但现有方法要么在未学习任务层级模式的情况下生成实例特定解决方案，要么无法超越训练配置进行推广。我们介绍MetaFlow，将工作流生成定位为一个元学习问题：给定任务和操作符集，模型学习组合解决方案策略。MetaFlow 训练分为两个阶段：对合成工作流程数据进行监督微调，随后是带可验证奖励的强化学习（RLVR），利用任务中问题实例的执行反馈提升端到端成功率。该模型为训练任务产生有效的工作流，并对未训练任务和新颖的算符集表现出强烈的推广。在问答、代码生成和数学推理等基准测试中，MetaFlow在单推断领域任务中表现可媲美最先进基线，同时在域外任务和算子集上展现出卓越的零样本泛化能力。

Sampling-Based Coordination-Informed Multi-Objective Multi-Robot Reinforcement Learning

基于抽样的协调知情多目标多机器人强化学习

Authors: Antonio Marino, Esteban Restrepo, Soon-jo Chung, Paolo Robuffo Giordano, Claudio Pacchierotti
Subjects: Subjects: Robotics (cs.RO); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2606.30893
Pdf link: https://arxiv.org/pdf/2606.30893
Abstract Multi-robot systems must simultaneously optimize competing objectives while maintaining coordinated behavior. Existing multi-agent reinforcement learning approaches often rely on fixed or centralized coordination, which limits adaptability and violates distributed constraints. This work introduces the Coordination-Informed Multi-Objective Reinforcement Learning (CIMORL) framework, integrating a distributed weight prediction mechanism, a privileged expert training strategy, and theoretical guarantees for Pareto-optimal solutions. We present the base CIMORL method alongside two sampling-based variants, CIMORL-TS (Tree Search) and CIMORL-MPPI (MPPI), which leverage privileged global information during training to enable fully decentralized deployment. Experimental validation in cooperative and adversarial scenarios demonstrates a $21.2\%$ hypervolume improvement and superior policy stability compared to state-of-the-art baselines. Real-world experiments with Crazyflie drones further validate the framework's robustness in resource allocation and multi-attacker multi-defend scenarios under partial observability.
中文摘要 多机器人系统必须同时优化竞争目标，同时保持协调的行为。现有的多智能体强化学习方法通常依赖固定或集中协调，这限制了适应性并违反分布式约束。本研究引入了协调知情多目标强化学习（CIMORL）框架，整合了分布式权重预测机制、特权专家培训策略以及帕累托最优解的理论保证。我们介绍了基于CIMORL的方法以及两种基于抽样的变体——CIMORL-TS（树状搜索）和CIMORL-MPPI（MPPI），它们在培训过程中利用特权全球信息实现完全去中心化部署。在合作和对抗情境中的实验验证显示，超量提升了21.2美元，且政策稳定性优于最先进基线。Crazyflie无人机的实际实验进一步验证了该框架在资源分配和多攻击者多防御场景下的鲁棒性，且在部分可观测性条件下。

HyPOLE: Hyperproperty-Guided Multi-Agent Reinforcement Learning under Partial Observation

HyPOLE：部分观察下的超属性引导多智能体强化学习

Authors: Arshia Rafieioskouei, Tzu-Han Hsu, Matthew Lucas, Borzoo Bonakdarpour
Subjects: Subjects: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2606.30966
Pdf link: https://arxiv.org/pdf/2606.30966
Abstract Formal specification is a powerful tool to guide the learning process and provides significant advantages over reward shaping: (1) mathematical rigor; (2) expressiveness to specify objectives and constraints, and (3) the ability to define tactics to achieve objectives. However, these benefits remain largely unexplored in the context of Multi-Agent Reinforcement Learning (MARL). This paper introduces HyPOLE, a novel framework for MARL under partial observability, where learning is guided by the expressive power of the so-called hyperproperties and, in particular, the temporal logic HyperLTL. We integrate Centralized Training for Decentralized Execution (CTDE) techniques with HyPOLE to synthesize decentralized policies, and our evaluation on SMAC, MessySMAC, and WildFire benchmark demonstrates clear advantages over baselines.
中文摘要 形式规范是指导学习过程的强大工具，相较于奖励塑造具有显著优势：（1）数学严谨性;（2）表达力以明确目标和约束，（3）定义实现目标的战术能力。然而，在多智能体强化学习（MARL）的背景下，这些优势仍然鲜有充分探讨。本文介绍了HyPOLE，这是一个在部分可观测性下用于MARL的新框架，其中学习由所谓超属性的表达力，特别是时间逻辑HyperLTL所引导。我们将去中心化执行中心训练（CTDE）技术与HyPOLE整合，综合去中心化策略，并在SMAC、MessySMAC和WildFire基准测试上的评估显示出明显优于基线的优势。

A Three-Phase Foundation Model for Tax-Aware Personalized Portfolio Management

税务意识个性化投资组合管理的三阶段基础模型

Authors: Ramin Pishehvar
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.30997
Pdf link: https://arxiv.org/pdf/2606.30997
Abstract We present a three-phase deep reinforcement learning system for personalized portfolio management that addresses three limitations shared by all prior financial RL work: 1) ticker lock-in, 2) monolithic objectives , and 3) static user models. Phase 1 pretrains a ticker-identity-free cross asset encoder via self-supervised learning on a multi-asset corpus, augmented by a frozen parallel branch using Chronos, a T5-based time series foundation model, fused via a learned gating mechanism. To our knowledge, this is the first application of a time series foundation model to portfolio management RL. The encoder generalizes to any publicly traded asset via a 50-dimensional observable metadata vector that requires no retraining for new tickers. Phase 2 fine-tunes a MoE (Mixture of Experts) portfolio actor critic with PPO under an objective-conditioned reward that simultaneously serves six distinct investment goals sampled per episode: short-term alpha, short-term gain, long-term gain, capital preservation, tax-loss harvesting, and long-term-gains-only. A MoE architecture assigns each objective to a specialized expert head (momentum, growth, defensive, tax-aware), and a learned intent router blends experts based on the active objective and current market regime, which eliminates cross-objective gradient conflict. Phase 3 adds a lightweight personalization layer further adapted at inference time to each individual via a 76-parameter LoRA module fine-tuned on real brokerage transaction history, inferring investment objectives from revealed trading behavior rather than questionnaires. A natural language intent parser converts free-form goals directly into structured investment objective parameters.
中文摘要 我们提出了一个三阶段深度强化学习系统，用于个性化投资组合管理，解决了所有之前金融强化学习共有的三个局限性：1）股票代码锁定，2）单一目标，3）静态用户模型。第一阶段通过多资产语料库的自监督学习预训练无跑马纪元身份的跨资产编码器，并通过基于T5的时间序列基础模型Chronos进行冻结并行分支，并通过学习门控机制融合。据我们所知，这是首次将时间序列基础模型应用于投资组合管理强化学习。编码器通过一个50维可观测元数据矢量推广到任何公开交易资产，无需重新训练新的股票代码。第二阶段通过客观条件的奖励，微调一位专家混合（MoE）投资组合演员批评者，PPO奖励，该奖励同时服务于每集抽样的六个不同投资目标：短期阿尔法、短期收益、长期收益、资本保值、税损收割和仅长期收益。MoE架构将每个目标分配给一个专业的专家负责人（动量、增长、防御、税务意识），而学习意图路由器则结合主动目标和当前市场机制的专家，消除了跨目标的梯度冲突。第三阶段增加了轻量级个性化层，通过一个76参数的LoRA模块，基于真实经纪交易历史微调，进一步根据每个人进行调整，通过揭示的交易行为而非问卷推断投资目标。自然语言意图解析器将自由形式的目标直接转换为结构化的投资目标参数。

Offline Reinforcement Learning for Fluid Controls: Data-based Multi-observational Policy Extraction

流体控制的离线强化学习：基于数据的多观察策略提取

Authors: Deepak Akhare, Luning Sun, Xin-Yang Liu, Xiantao Fan, Timo Bremer, Ben Zhu, Jian-Xun Wang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.31025
Pdf link: https://arxiv.org/pdf/2606.31025
Abstract Active flow control is a fundamental application in engineering. Recent advances in deep reinforcement learning have made progress in this field. However, the classical online RL approaches require extensive real-time interactions with the high fidelity environment, while each sensor configuration change necessitates whole policy retraining. All these factors result in prohibitive computational costs for real-world applications. In this work, we propose a novel offline RL framework that addresses both challenges through data-driven policy extraction. We develop a sensor position-conditioned architecture that enables a single policy network to adapt seamlessly to multiple sensor arrangements. The position-conditioned approach incorporated spatial relationship modeling through Point Attention layers to ensure the generalizability to varying sensor placements. We demonstrate the framework on two representative problems, mitigating chaoticity in the Kuramoto-Sivashinsky equation and flow control over airfoils governed by the Navier-Stokes equation. The result demonstrates that the policy extraction from the dataset provides unprecedented flexibility for sensor placement optimization. This approach represents a significant step towards adaptive, intelligent flow control systems.
中文摘要 主动流量控制是工程中的基础应用。近年来，深度强化学习的进展在该领域取得了进展。然而，经典的在线强化学习方法需要与高保真环境进行大量实时交互，而每次传感器配置的变更都需要重新训练整个策略。所有这些因素都导致了现实应用中难以承受的计算成本。在本研究中，我们提出了一种新型离线强化学习框架，通过数据驱动的策略提取来解决这两个挑战。我们开发了一种传感器位置条件架构，使单一策略网络能够无缝适应多种传感器配置。位置条件方法通过点注意层结合空间关系建模，确保对不同传感器位置的推广性。我们在两个代表性问题上展示了该框架：减轻仓本-西瓦辛斯基方程中的混沌性，以及由纳维-斯托克斯方程控制的翼型流动控制。结果表明，从数据集中提取策略为传感器布置优化提供了前所未有的灵活性。这一方法代表了朝向自适应、智能流量控制系统迈出的重要一步。

GenPage: Towards End-to-End Generative Homepage Construction at Netflix

GenPage：迈向Netflix端到端生成式首页建设

Authors: Lequn Wang, Jiangwei Pan, Fengdi Che, Linas Baltrunas
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2606.31031
Pdf link: https://arxiv.org/pdf/2606.31031
Abstract We present GenPage, an end-to-end generative approach to Netflix homepage construction that replaces the traditional multi-stage recommender stack with a single transformer. GenPage treats the user and request context as a prompt, and autoregressively generates the entire structured, multi-row homepage as the response. We adapt the LLM training recipe: pretraining on production pages, followed by post-training via weighted binary classification (WBC) or reinforcement learning (RL). For industry-scale deployment, we introduce techniques addressing cold start, model freshness, business-rule enforcement, and serving efficiency. In online A/B tests against a mature, highly optimized production homepage recommender, the WBC variant of GenPage delivered a +0.24% lift on the core user engagement metric we use for launch decisions (p < 0.001), while reducing end-to-end serving latency by 20%. Offline, two findings stand out: enriching the prompt yields a larger improvement than scaling model capacity in our current regime, and RL post-training increases homepage diversity even though diversity is not part of the objective.
中文摘要 我们介绍GenPage，这是一种端到端生成式Netflix首页构建方法，用单一变换器取代了传统的多阶段推荐堆栈。GenPage将用户和请求上下文视为提示词，并自回归生成整个结构化、多行主页作为响应。我们调整了LLM训练方案：先在生产页面上进行预训练，然后通过加权二元分类（WBC）或强化学习（RL）进行后期训练。对于行业规模的部署，我们引入了冷启动、模型新鲜度、业务规则执行和服务效率等技术。在与成熟且高度优化的生产主页推荐工具进行的在线A/B测试中，GenPage的WBC变体在我们用于发布决策的核心用户参与度指标（p < 0.001）上提升了+0.24%，同时端到端服务延迟降低了20%。离线时，有两个显著发现：丰富提示比当前模式下的模型容量扩展带来更大的提升;强化学习后训练提升了首页多样性，尽管多样性并非目标的一部分。

Warp RL: Reshaping Base Policy Distributions for Dynamics Adaptation

Warp RL：重塑基础策略分布以适应动态

Authors: Ethan Hirschowitz, Fabio Ramos
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.31043
Pdf link: https://arxiv.org/pdf/2606.31043
Abstract Residual reinforcement learning adapts a pretrained robot policy by learning an additive correction to its actions. While effective when adaptation amounts to shifting the base policy's action distribution, additive corrections cannot change the distribution's shape, scale, or state-dependent geometry -- limitations we formalize as wrong variance, miscalibrated confidence, and non-uniform correction. We show that these matter under dynamics shift: when the base distribution is geometrically mismatched to the shifted system, residual correction can underperform even the unadapted policy. We propose \textbf{Warp RL}, a policy adaptation method that replaces additive residuals with an invertible, state-conditioned transformation of the base policy's action distribution. Instantiated with monotonic rational-quadratic spline flows [arXiv:0706.1234v1], Warp RL preserves identity initialization, strictly generalizes additive residual correction, and exposes a structured adaptation space suitable for both policy-gradient and gradient-free optimization. Across a variety of ManiSkill3 manipulation tasks with controlled dynamics shifts, Warp RL matches residual correction when translation is sufficient and substantially outperforms it when adaptation requires distributional reshaping. We further demonstrate that warping can replace additive correction in an off-policy sim-to-real pipeline, achieving comparable success rate with 30% faster task completion on a real-robot peg-insertion task.
中文摘要 残余强化学习通过学习对机器人行为的加法修正来调整预训练的策略。虽然当适应仅仅是调整基础策略的行动分布时有效，但加法修正无法改变分布的形状、尺度或状态依赖几何——这些限制我们形式化为错误的方差、错误校准的置信度和非均匀的纠正。我们证明这些因素在动力学转移下具有重要性：当基础分布与移动后系统几何不匹配时，残差修正甚至可能低于未调整策略。我们提出了 \textbf{Warp RL}，一种策略适配方法，用可逆的状态条件变换替代加法残差。Warp RL以单调有理二次样条流实例化 [arXiv：0706.1234v1]，保持恒定初始化，严格推广加法残差修正，并揭示了适用于策略梯度和无梯度优化的结构化适应空间。在多种ManiSkill3操作任务中，当平移足够时，Warp RL能匹配残差校正，且在需要分布重塑的适应时表现显著优于残差修正。我们还进一步证明，扭曲可以替代非策略模拟到现实流水线中的加法修正，在真实机器人挂钩插入任务中实现了相当的成功率，且任务完成速度提升了30%。

What Probing Reveals about Autonomous Driving: Linking Internal Prediction Errors to Ego Planning

探究揭示自动驾驶：将内部预测错误与自我规划联系起来

Authors: Hyeonchang Jeon, Kyungbeom Kim, Eugene Vinitsky, Kyung-Joong Kim
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.31106
Pdf link: https://arxiv.org/pdf/2606.31106
Abstract Large-scale datasets and fast simulators have enabled improvements in driving policies that appear safe and robust, yet strong performance in nominal scenarios can still mask flawed reasoning and unsafe heuristics. Summary scores from closed-loop simulators do not give significant insight into the policy, making it difficult to determine whether they truly predict the motion of surrounding vehicles, how the ego vehicle generates future plans, or whether they merely rely on brittle heuristics that happen to succeed in nominal scenarios. To better understand the limits and weaknesses of driving policies, we focus on probing for forms of prediction, i.e., where surrounding vehicles will move next, and planning, i.e., understanding how to generate safe trajectories. We focus on these two capabilities because they reflect behaviors expected of effective driving policies, and use their presence or absence to assess policy quality across data-driven behavior cloning and simulation-driven reinforcement learning policies. To evaluate the presence of these capabilities, we investigate them as a function of scale, asking whether the closed-loop gains from larger datasets and longer simulation training reflect stronger prediction and planning or merely better behavioral heuristics. We use linear probing and targeted perturbations in both imitation learning and reinforcement learning models to track when these internal signals emerge, plateau, or fail. Despite good closed-loop performance, policies often fail to form timely surrounding-vehicle predictions during near-collision events, revealing a limitation in the predictive signals available for ego planning. Finally, causal intervention shows that correcting mistaken predictions improves ego planning toward safer trajectories.
中文摘要 大规模数据集和快速模拟器使得驾驶政策的改进成为可能，这些政策看似安全且稳健，但在名义场景下的强劲表现仍可能掩盖推理错误和不安全的启发式。闭环模拟器的总结分数无法提供对政策的重大洞察，这使得判断它们是否真正预测了周围车辆的运动、自我车辆如何产生未来计划，还是仅仅依赖于在名义场景中成功的脆弱启发式。为了更好地理解驾驶政策的局限性和弱点，我们重点探讨预测形式，即周围车辆下一步将走向，并进行规划，即理解如何生成安全的轨迹。我们关注这两项能力，因为它们反映了有效驱动策略的预期行为，并利用其存在与否评估数据驱动行为克隆和仿真驱动强化学习策略中的策略质量。为了评估这些能力的存在，我们从规模函数角度研究它们，探讨更大数据集和更长时间模拟训练带来的闭环收益，是更强的预测和规划，还是仅仅是更好的行为启发式。我们在模拟学习和强化学习模型中都使用线性探测和定向扰动，追踪这些内部信号何时出现、趋于平稳或失效。尽管闭环表现良好，保单在近碰撞事件中常常无法及时形成周围车辆的预测，暴露了自我规划预测信号的局限性。最后，因果干预表明纠正错误预测能改善自我规划，朝着更安全的轨迹发展。

ELASTIC: Efficiently Learning to Adaptively Scale Test-Time Compute for Generative Control Policies

ELASTIC：高效学习如何自适应扩展生成控制策略的测试时计算

Authors: Andrew Zou Li, Gokul Swamy, Yonatan Bisk, Andrea Bajcsy
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.31132
Pdf link: https://arxiv.org/pdf/2606.31132
Abstract Generative control policies (GCPs), such as diffusion policies and flow-based vision-language-action models, enable test-time scaling in robot control. Test-time compute can be allocated along two axes: sequential scaling, which increases denoising steps to refine actions, and parallel scaling, which samples multiple candidate actions to search across modes of the policy distribution. However, the optimal allocation of sequential and parallel compute is hard to know a priori as it is state-, task-, and policy-dependent. For example, early stages of a grasp may benefit from broader parallel exploration, while near-contact phases may require more sequential refinement for precision. We present ELASTIC, an algorithm that learns state-dependent test-time compute schedules for GCPs. We formulate compute allocation as a meta-Markov Decision Process in which a meta-policy interacts with a frozen pretrained robot policy and selects sequential steps and parallel samples at each denoising iteration to maximize task success while minimizing compute. Using reinforcement learning, this meta-policy also learns adaptive compute schedules without access to the GCP's training data. Across simulated manipulation benchmarks with diffusion policies, ELASTIC Pareto-dominates fixed and single-axis scaling baselines at matched compute budgets. On real-world robot manipulation with the $\pi_{0.5}$ vision-language-action model, ELASTIC matches best-of-$10$ success while reducing wall-clock latency by 34%.
中文摘要 生成控制策略（GCP），如扩散策略和基于流量的视觉-语言-动作模型，使机器人控制能够实现测试时间的扩展。测试时计算可以沿两个轴分配：顺序缩放，增加去噪步以细化动作;并行扩展，采样多个候选动作以跨策略分布模式搜索。然而，顺序计算和并行计算的最佳分配难以事先确定，因为它依赖于状态、任务和策略。例如，抓握的早期阶段可能更适合更广泛的平行探索，而近接触阶段则可能需要更连续的细化以实现精度。我们介绍ELASTIC算法，该算法可学习GCP的状态相关测试时间计算计划。我们将计算分配表述为一种元马尔可夫决策过程，其中元策略与冻结的预训练机器人策略交互，在每次去噪迭代中选择顺序步骤和并行样本，以最大化任务成功率同时最小化计算量。通过强化学习，该元策略还在不访问GCP训练数据的情况下学习自适应计算计划。在具有扩散策略的模拟操作基准测试中，ELASTIC在匹配计算预算下在固定和单轴尺度基线上均处于帕累托优势。在使用$\pi_{0.5}$视觉-语言-动作模型进行真实机器人操作时，ELASTIC实现了10美元最佳的成功率，同时将墙时钟延迟降低了34%。

AETDICE: Unified Framework and Offline Optimization for Nonlinear Multi-Objective RL

AETDICE：非线性多目标强化学习的统一框架与离线优化

Authors: Woosung Kim, Youngjun Suh, Jinho Lee, Jongmin Lee, Byung-Jun Lee
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.31178
Pdf link: https://arxiv.org/pdf/2606.31178
Abstract Optimizing nonlinear preferences in multi-objective reinforcement learning (MORL) is essential for capturing complex trade-offs like risk aversion or fairness. However, such non-linearity has historically bifurcated nonlinear MORL objectives into two distinct paradigms: Scalarized Expected Return (SER) and Expected Scalarized Return (ESR). While SER requires global-level optimization and ESR requires non-Markovian policies, leading to fragmented optimization strategies, we bridge this divide through the Aggregation-Expectation-Transformation (AET) framework. By unifying both criteria through a tripartite decomposition of scalarization, AET provides a principled foundation for general nonlinear MORL. Building on this framework, we propose AETDICE, a tractable offline RL algorithm for AET objectives. By utilizing DICE-style density-ratio estimation in an augmented state space, AETDICE enables sample-based optimization from static datasets. Our framework resolves long-standing barriers and captures respective trade-offs induced by AET framework, which existing methods fail to address.
中文摘要 在多目标强化学习（MORL）中优化非线性偏好对于捕捉风险厌恶或公平等复杂权衡至关重要。然而，这种非线性在历史上将非线性MORL目标分为两个不同的范式：标量化预期收益（SER）和预期标量化收益（ESR）。虽然SER需要全局优化，ESR则需要非马尔可夫策略，导致优化策略分散，但我们通过聚合-期望-转换（AET）框架弥合了这一鸿沟。通过三分分解将两个标准统一，AET为一般非线性MORL提供了原则性基础。基于该框架，我们提出了AETDICE算法，一种可操作的离线强化学习算法，用于实现AET目标。通过在增强状态空间中利用DICE式的密度比估计，AETDICE实现了基于样本的静态数据集优化。我们的框架解决了长期存在的障碍，并捕捉了AET框架带来的相应权衡，而现有方法未能解决这些问题。

Deep Reinforcement Learning for Spacecraft Attitude Control During Atmospheric Re-Entry

航天器大气再入期间姿态控制的深度强化学习

Authors: Alexander Fabisch, Melvin Laux, Mariela De Lucas Álvarez, Edoardo Caroselli, Julian Theis
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.31291
Pdf link: https://arxiv.org/pdf/2606.31291
Abstract Deep reinforcement learning has the potential to solve attitude control problems more adaptively, precisely, and robustly by handling nonlinear dynamics, uncertainties, and failure cases more effectively than traditional attitude control approaches. We explore reinforcement learning (RL) for attitude control in spacecraft re-entry. An industry-standard proportional-integral-derivative controller with gain scheduling serves as a strong baseline for model-free RL and hybrid controllers that combine these two approaches. We formalize the application in the RL framework to apply continuous, off-policy RL. State-of-the-art RL achieves comparable performance to traditional control approaches in this domain. However, its out-of-distribution generalization is not sufficient. Hence, we use dynamics randomization to introduce challenging task variations during training and enforce generalization in a predefined operational envelope. Finally, we assess the best obtained RL-based controllers with application-specific metrics to show superior performance in comparison to traditional controllers in the operational envelope, that is, hybrid controllers are able to track the angle of attack better and are more robust under variations of mass, inertia tensor, and flap actuator bandwidth.
中文摘要 深度强化学习有潜力通过比传统姿态控制方法更有效地处理非线性动力学、不确定性和失效情况，从而更适应、更精准和稳健地解决姿态控制问题。我们探讨了航天器再入姿态控制的强化学习（RL）。带有增益调度的行业标准比例积分微分控制器，为无模型的强力学习和结合这两种方法的混合控制器提供了有力的基础。我们在强化学习框架中形式化应用，以应用连续的、非策略的强化学习。最先进的强化学习在该领域实现了与传统控制方法相当的性能。然而，其分布外的推广还不够。因此，我们利用动力学随机化在训练过程中引入具有挑战性的任务变体，并在预设的操作范围内强制推广。最后，我们评估了基于强化学习的最佳控制器，并结合应用特定指标，以展示在操作范围内优于传统控制器的性能，即混合控制器能够更好地追踪攻角，并且在质量、惯性张量和襟翼执行器带宽的变化下更稳健。

Safe Online Learning via Smooth Safety-Structured Policy Composition

通过流畅的安全结构化政策构建实现安全在线学习

Authors: Hongpeng Cao, Liqun Zhao, Yuliang Gu, Naira Hovakimyan, Lui Sha, Marco Caccamo
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.31320
Pdf link: https://arxiv.org/pdf/2606.31320
Abstract Safe online reinforcement learning requires policies to respect safety constraints while maintaining smooth optimization dynamics. Existing approaches typically rely on either strict safety enforcement via action interventions, which introduce discontinuities in system interaction and learning, or soft safety constraint formulations, which preserve smooth learning but provide limited safety assurance. We propose AutoSafe, a safety-aware policy architecture that integrates structured safety monitoring and intervention directly into the action generation process. This design enables smooth, risk-dependent transitions between performance-driven and safety-preserving behaviors, resulting in continuous online interaction and learning dynamics. Empirical results across a suite of continuous-control benchmarks demonstrate strong safety enforcement without sacrificing learning smoothness. We further validate AutoSafe on a physical cart-pole system, highlighting its practical effectiveness for safe online learning in the real world.
中文摘要 安全的在线强化学习要求策略在保持顺畅优化动态的同时，尊重安全约束。现有方法通常依赖于通过行动干预进行严格的安全执行，这会引入系统交互和学习的断裂，或者软性安全约束表述，保持学习顺畅但安全性保证有限。我们提出了AutoSafe，一种安全意识的政策架构，将结构化的安全监控和干预直接整合进行动生成过程。这种设计实现了性能驱动与安全维护行为之间平滑且风险相关的过渡，从而实现持续的在线互动和学习动态。一系列连续控制基准的实证结果显示了强有力的安全执法，同时不影响学习的顺畅性。我们进一步验证了AutoSafe在实体车杆系统上的应用，强调其在现实世界中安全在线学习的实用性。

Smart charging of large fleets of Electric Vehicles: Independent Multi-Agent Reinforcement Learning approaches

大型电动汽车车队的智能充电：独立多智能体强化学习方法

Authors: Xavier Rate, Eloann Le Guern, Raphaël Féraud, Fatma Salem, Melissa Chiknoun, Eymeric Giabicani, Mehdi Feki, Patrick Maillé, Guy Camilleri, Anne Blavette, Hamid Benhamed
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.31347
Pdf link: https://arxiv.org/pdf/2606.31347
Abstract The electrification of transportation through electric vehicles introduces new challenges for power grid management, such as increased peak demand, voltage fluctuations, line overloads, and the integration of variable renewable energy sources. To enable efficient integration of EVs while minimizing costs for users and avoiding network overloads, implicit coordination between EVs is required. This work compares two independent multi-agent reinforcement learning approaches for optimizing such decentralized EV charging: contextual combinatorial bandits and policy gradient algorithms. Using a realistic simulation environment with autonomous agents making decisions based on local environmental information (including price signals, state-of-charge, and temporal constraints), we evaluate their performance across varying congestion levels, and mixed-strategy configurations with heterogeneous agent groups under dynamic electricity pricing derived from real photovoltaic production data.
中文摘要 通过电动汽车实现交通电气化，为电网管理带来了新的挑战，如峰值需求增加、电压波动、线路过载以及可变可再生能源的整合。为了实现电动汽车的高效集成，同时降低用户成本并避免网络过载，电动汽车之间需要隐性协调。本研究比较了两种独立的多智能体强化学习方法，用于优化这种去中心化电动汽车充电：上下文组合强盗算法和策略梯度算法。利用一个真实的模拟环境，自主智能体根据本地环境信息（包括价格信号、充电状态和时间约束）做出决策，我们评估其在不同拥堵水平下的表现，以及在基于真实光伏发电数据的动态电力定价下，采用异构智能体组的混合策略配置。

Failure-Based Testing for Deep Reinforcement Learning Agents

基于失败的深度强化学习代理测试

Authors: Weibin Lin, Jiangtao Meng, Zheng Zheng
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2606.31372
Pdf link: https://arxiv.org/pdf/2606.31372
Abstract Deep Reinforcement Learning (DRL) agents have been widely adopted across diverse domains to address challenging decision-making problems, such as autonomous driving and robotic control. Given that many of these applications are safety- and security-critical, rigorous testing of DRL agents is indispensable. Existing testing methods are typically guided by reward signals to detect failures. However, for well-trained agents, whose performance approaches optimal levels in standard operating conditions, reward signals remain generally high, making current methods ineffective at uncovering critical failures. To address these challenges, we propose a novel failure-based method that leverages task-induced failure insights to enhance failure detection capability while reducing the number of tests required. Since DRL agents are inherently designed with human-defined tasks, they provide valuable cues about task difficulty. Intuitively, a DRL agent is more likely to fail when confronted with a more difficult task; therefore, PRT prioritizes these tasks. Building on this foundation, we propose Prior Random Testing, a black-box failure-based testing method that enables targeted prioritization while preserving the diversity of generated test cases. Guided by task-induced failure insights, PRT prioritizes failure-prone regions of the input domain, thereby facilitating efficient failure detection. PRT is evaluated on four widely used benchmarks and compared with different state-of-the-art methods including fuzzing, search-based and generative-based methods. PRT ranks among the top performers in terms of both the cost of finding the first failure and the diversity of test cases. Notably, compared to random testing, PRT achieves better diversity and reduces the testing cost by over 50%.
中文摘要 深度强化学习（DRL）代理已被广泛应用于多个领域，以解决诸如自动驾驶和机器人控制等具有挑战性的决策问题。鉴于许多应用对安全和保障至关重要，对DRL代理进行严格测试是不可或缺的。现有的测试方法通常以奖励信号为指导以检测失效。然而，对于训练有素的代理来说，其性能在标准操作条件下接近最佳水平，奖励信号通常仍然较高，使得现有方法在发现关键故障方面效果有限。为应对这些挑战，我们提出了一种基于故障的新方法，利用任务引发的失效洞察来增强故障检测能力，同时减少所需测试次数。由于DRL代理本质上设计为人类定义的任务，它们提供了关于任务难度的宝贵线索。直观上，DRL操作员在面对更难的任务时更容易失败;因此，PRT优先处理这些任务。在此基础上，我们提出了先验随机测试（Prior Random Tests），这是一种基于黑箱失败的测试方法，能够实现有针对性优先级排序，同时保持生成测试用例的多样性。在任务引发的失败洞察指导下，PRT优先考虑输入域中易失效区域，从而促进高效的故障检测。PRT基于四个广泛使用的基准测试进行评估，并与包括模糊检测、基于搜索和生成式的方法在内的多种先进方法进行比较。PRT在寻找第一个失败的成本和测试用例多样性方面均位列前茅。值得注意的是，与随机检测相比，PRT实现了更好的多样性，并将检测成本降低了50%以上。

Stage-Transition Dense Reward Modeling for Reinforcement Learning

阶段-过渡密集奖励建模用于强化学习

Authors: Yang Yang, Bingjie Chen, Zihan Wang, Yizhe Li, Guoping Pan, Yi Cheng, Houde Liu
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.31377
Pdf link: https://arxiv.org/pdf/2606.31377
Abstract Reinforcement learning for long-horizon robotic manipulation is often limited by sparse and delayed rewards, while manually designing dense shaping signals is costly and brittle to changes in environments and object configurations. This work proposes Stage-Transition Dense Reward (STDR), a visual reward-learning framework that converts unstructured expert videos into logically grounded dense rewards for training RL agents from scratch. STDR leverages semantic understanding to infer a task's stage structure from demonstrations, and delivers two complementary learning signals during online training: (i) stage-transition feedback that provides goal-directed reward, and (ii) within-stage progress feedback that supplies fine-grained guidance toward completing each stage. Furthermore, an out-of-distribution (OOD) detection mechanism and a grasping regulation module are integrated to enhance robustness and prevent reward hacking. Experiments on 14 manipulation tasks across MetaWorld, ManiSkill, and Franka Kitchen show that STDR consistently improves sample efficiency and success rates over multiple baselines, and matches or surpasses handcrafted dense rewards on several challenging tasks. Real-robot evaluations further indicate that STDR assigns stable, progress-aligned rewards on successful executions while producing appropriately low rewards for failures, suggesting robustness to visual noise and better-calibrated reward assignment across settings.
中文摘要 长视界机器人操作的强化学习常受限于奖励稀疏和延迟，而手动设计密集的塑形信号则成本高昂且易受环境和物体配置变化影响。本研究提出了阶段-过渡密集奖励（STDR），一种视觉奖励学习框架，将非结构化的专家视频转化为逻辑基础的密集奖励，用于从零开始训练强化学习代理。STDR利用语义理解从演示推断任务的阶段结构，并在在线培训中传递两种互补的学习信号：（i）阶段过渡反馈，提供目标导向奖励;（ii）阶段内进度反馈，提供细致指导，帮助完成各阶段。此外，集成了分布外（OOD）检测机制和抓取调节模块，以增强鲁棒性并防止奖励黑客行为。在MetaWorld、ManiSkill和Franka Kitchen的14个操控任务实验显示，STDR在多个基线上持续提升样本效率和成功率，并在多个具有挑战性的任务中达到甚至超越手工设计的密集奖励。真实机器人评估进一步表明，STDR在成功执行时给予稳定且与进展相符的奖励，而对失败则给予适当较低的奖励，表明其对视觉噪声具有鲁棒性，且在不同环境下奖励分配更为精准。

Xiaomi-GUI-0 Technical Report

小米GUI-0技术报告

Authors: Wanxia Cao, Chengzhen Duan, Pei Fu, Pengzhi Gao, Niu Lian, Fazhan Liu, Hui Liu, Heng Qu, Qinzhuo Wu, Zhehao Yu, Tongbo Chen, Shiqi Cui, Anan Du, Shukai Jia, Yuanfa Li, Yike Liu, Wenchao Lu, Haoyuan Sun, Jiatong Sun, Cheng Tan, Yajie Wang, Changqiao Wu, Tao Xiong, Jiahui Yang, Yuxuan Yuan, Ruoceng Zhang, Shaojie Zhang, Jian Zhu, Jian Luan, Cong Zou
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.31410
Pdf link: https://arxiv.org/pdf/2606.31410
Abstract Graphical user interface (GUI) agents build on vision-language models to complete user tasks end-to-end in real applications through interface actions such as tapping, swiping, text entry, and navigation. However, existing GUI agents are trained and evaluated largely on offline trajectories, simulated environments, and standardized benchmarks. These differ substantially from real applications in interface layout, interaction logic, and abnormal-state distribution, and cannot faithfully characterize execution stability in real-world use, where account states, permission dialogs, payment authentication, and risk control continually reshape the state distribution and open a persistent gap between benchmark scores and real usability. To close this gap, we propose Xiaomi-GUI-0, a native multimodal GUI agent for real mobile environments, trained and evaluated within a real-device closed loop. At its core is a real-device-dominant hybrid infrastructure, where physical devices are the primary execution environment and sandboxes provide auxiliary support, so that data collection, training, rollout, and evaluation share an execution distribution close to real deployment. We construct multi-source training data spanning high-frequency head tasks, high-generalization data for long-tail intents, and capability-enhancement data for reflection and memory, and introduce an error-driven data flywheel that turns failure trajectories into corrected actions, reflective explanations, and recovery demonstrations. The model is trained through a progressive three-stage pipeline of supervised fine-tuning, step-level reinforcement learning, and agentic reinforcement learning. Evaluated on public benchmarks and our in-house RealMobile, Xiaomi-GUI-0 achieves 72.0% success on RealMobile and 78.9% on AndroidWorld, while substantially improving execution stability and abnormal-state recognition in real-world tasks.
中文摘要 图形用户界面（GUI）代理基于视觉语言模型，通过点击、滑动、文本输入和导航等界面操作，完成用户任务的端到端。然而，现有的图形界面代理主要基于离线轨迹、模拟环境和标准化基准进行训练和评估。这些在界面布局、交互逻辑和异常状态分布方面与现实应用有显著差异，无法忠实描述现实应用中的执行稳定性，因为账户状态、权限对话框、支付认证和风险控制不断重塑状态分布，并在基准评分与实际可用性之间留下持续的差距。为弥合这一差距，我们提出了Xiaomi-GUI-0，一款适用于真实移动环境的原生多模态GUI代理，并在真实设备闭环中训练和评估。其核心是一个以真实设备为主导的混合基础设施，物理设备是主要执行环境，沙箱则提供辅助支持，使数据收集、训练、推广和评估在接近实际部署时共享执行分布。我们构建了涵盖高频头部任务的多源训练数据、长尾意图的高泛化数据、反射和记忆能力增强数据，并引入了误差驱动的数据飞轮，将故障轨迹转化为纠正动作、反思解释和恢复演示。该模型通过监督微调、步级强化学习和代理强化学习的渐进式三阶段流程进行训练。经过公开基准测试和我们自家RealMobile的评估，小米GUI-0在RealMobile上实现了72.0%的成功率，在AndroidWorld上达到了78.9%，同时在实际任务中显著提升了执行稳定性和异常状态识别能力。

Learning to Select, Not Relearn: Hard-Routed Mixtures of Reasoning LoRAs

学习选择，而非重新学习：硬性推理LoRA的混合

Authors: Seyed Alireza Molavi, Zhan Su, Yan Hu, Peyman Sheikholharam Mashhadi, Stefan Byttner, Prayag Tiwari
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.31413
Pdf link: https://arxiv.org/pdf/2606.31413
Abstract Composing independently trained LoRA adapters into a single large language model is useful for multi-domain adaptation, especially when the original training data cannot be shared. A common approach is to use MoE-style routing over LoRA experts, but for frozen pretrained adapters, soft weighted combinations can change the unit-scale additive update under which each LoRA module was originally trained. We propose \textbf{Hard-Routed MoR-LoRA}, a two-stage framework for composing frozen reasoning LoRA experts through unit-scale hard selection. First, domain-specific LoRA adapters are trained independently using reinforcement learning from verifiable feedback to obtain reasoning experts. Then, all experts are frozen, reasoning traces are distilled from them, and only a lightweight shared router together with a small attention LoRA is trained for integration. The router selects exactly one expert per token using hard top-1 routing, while a straight-through estimator enables gradient-based training. Experiments across five benchmarks, multiple model scales, and additional model families show that Hard-Routed MoR-LoRA preserves expert behavior while requiring substantially fewer trainable parameters than soft-routing mixture baselines. Our analysis further shows that normalized soft mixtures often concentrate most routing mass on a single expert, suggesting that hard unit-scale routing provides a simple and efficient abstraction for frozen LoRA expert composition.
中文摘要 将独立训练的LoRA适配器组合成单一大型语言模型，对于多域适配非常有用，尤其是在原始训练数据无法共享的情况下。一种常见方法是使用MoE风格路由，覆盖LoRA专家，但对于冻结的预训练适配器，软加权组合可以改变每个LoRA模块最初训练的单位尺度加法更新。我们提出了 \textbf{硬路由 MoR-LoRA}，这是一个两阶段框架，通过单位尺度硬选择构建冻结推理 LoRA 专家。首先，领域特定的LoRA适配器通过可验证反馈的强化学习独立训练，获得推理专家。然后，所有专家被冻结，推理痕迹被提炼出来，只有轻量级共享路由器和少量 Attention LoRA 被训练用于集成。路由器通过硬顶1路由为每个令牌只选择一名专家，而直通估计器则支持基于梯度的训练。跨五个基准测试、多个模型尺度及其他模型家族的实验表明，硬路由 MoR-LoRA 保持专家行为，同时比软路由混合基线所需的可训练参数少得多。我们的分析进一步表明，归一化软混合物通常将大部分路由质量集中在单个专家身上，表明硬单位尺度路由为冻结的LoRA专家组合提供了简单高效的抽象。

Stabilization Learning: A Paradigm Transition Bridging Control Theory and Machine Learning

稳定学习：范式转变——连接控制理论与机器学习

Authors: Quan Quan
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.31562
Pdf link: https://arxiv.org/pdf/2606.31562
Abstract Stabilization learning is an interdisciplinary paradigm that bridges control theory and machine learning. Its core idea is to enable systems to adjust their policies under perturbations or environmental changes through real-time feedback and adaptive mechanisms. It takes stability as its primary goal, distinguishing itself from certificate learning, which focuses on formal proofs, and reinforcement learning, which pursues optimality. It encompasses a range of methods, including Lyapunov-based analysis and design, deep feature extraction, and data-driven feedback synthesis, and is applicable to complex high-dimensional, nonlinear systems. This paper elaborates on the two major categories of stability in stabilization learning, as well as three typical application scenarios: control, observation, and recognition. It constructs a unified mathematical framework based on a six-tuple, and expands into two types of seven-tuple models: constrained learning with barrier spaces and tracking problems with targets. It also analyzes the roles, meanings, and implementation choices of key elements such as state space, controlled system, metrics, and policy. Through the formal reformulation of 11 types of problems, including multi-agent cooperative tracking, visual servo robot position stabilization, chess games, and Push-T tasks, this paper illustrates the potential applicability of the framework across multiple domains. Finally, it points out that future stabilization learning will focus on two major directions: constructing a unified problem framework and achieving efficient and robust learning, providing solutions for complex system control that combine theoretical rigor with engineering practicality.
中文摘要 稳定学习是一种跨学科范式，连接了控制理论和机器学习。其核心理念是通过实时反馈和自适应机制，使系统能够在扰动或环境变化下调整其策略。它以稳定性为主要目标，区别于注重形式证明的证书学习和追求最优性的强化学习。它涵盖了多种方法，包括基于李雅普诺夫的分析与设计、深度特征提取以及数据驱动反馈综合，并适用于复杂的高维非线性系统。本文详细阐述了稳定学习中稳定性的两大类，以及三种典型应用场景：控制、观察和识别。它构建了一个基于六元组的统一数学框架，并扩展为两种类型的七元组模型：带障碍空间的约束学习和带有目标的跟踪问题。它还分析了状态空间、受控系统、度量和策略等关键要素的角色、含义和实施选择。通过对11类问题的形式化重新表述，包括多智能体协作跟踪、视觉伺服机器人位置稳定、国际象棋棋局和Push-T任务，本文展示了该框架在多个领域的潜在适用性。最后，它指出未来的稳定学习将聚焦于两个主要方向：构建统一的问题框架和实现高效且稳健的学习，提供结合理论严谨与工程实用性的复杂系统控制解决方案。

Which Tokens Matter? Adaptive Token Selection for RLVR with the Relative Surprisal Index

哪些代币重要？RLVR的自适应代币选择，带有相对惊讶指数

Authors: Outongyi Lv, Yanzhao Zheng, Yuanwei Zhang, Zhenghao Huang, Xingjun Wang, Baohua Dong, Hangcheng Zhu, Yingda Chen
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.31575
Pdf link: https://arxiv.org/pdf/2606.31575
Abstract Reinforcement learning (RL) has become a powerful tool for propelling Large Language Models (LLMs) beyond imitation-based training towards more robust reasoning capabilities. Among existing approaches, RL with Verifiable Rewards (RLVR) has emerged as a pivotal paradigm for advancing LLM reasoning. Despite its empirical success, recent studies have offered different insights. One line of inquiry advocates prioritizing high-entropy token positions during training, while another perspective cautions against allowing low-probability tokens to dominate gradient updates. Notably, although high-entropy tokens are usually correlated with low probability, both paradigms empirically yield substantial performance gains. In this work, we argue that evaluating sampled-token probability or entropy in isolation is insufficient to capture the policy optimization dynamics. To resolve this tension, we introduce the Relative Surprisal Index (RSI), a principled, information-theoretic metric that naturally couples the token's entropy with the probability of the selected token. We show that, under mild conditions, RSI is related to the local ratio between the first-order variations of the logit-gradient norm and predictive entropy under a selected-logit perturbation. Building on RSI, we propose RSI Selection (RSI-S), an entropy-adaptive token filtering method that retains tokens within a stable RSI interval. RSI-S successfully reconciles previous contradictory paradigms and filters out both redundant low-surprisal tokens and unstable high-surprisal tail tokens. Empirical evaluations show that RSI-S achieves higher avg@32 accuracy across different model scales (Qwen2.5-1.5B, 3B, and 7B) on AIME and AMC benchmarks: RSI-S improves avg@32 accuracy by 2--3 percentage points over GRPO. Overall, RSI offers a promising perspective for RLVR improvement.
中文摘要 强化学习（RL）已成为推动大型语言模型（LLM）从模仿训练向更稳健推理能力发展的强大工具。在现有方法中，带可验证奖励的强化学习（RLVR）已成为推动大型语言模型推理发展的关键范式。尽管实证上取得了成功，近期研究却提供了不同的见解。一种研究方向主张在训练中优先考虑高熵令牌位置，而另一种观点则警示不要让低概率令牌主导梯度更新。值得注意的是，尽管高熵代币通常与低概率相关，但两种范式在实证上都能带来显著的性能提升。在本研究中，我们认为单独评估抽样代币概率或熵不足以捕捉策略优化动态。为解决这种矛盾，我们引入了相对惊讶指数（RSI），这是一种原则性的信息论指标，它自然将代币的熵与所选代币的概率耦合。我们证明，在温和条件下，RSI与选择性洛吉特扰动下一阶logit梯度范数变异与预测熵的局部比值相关。基于RSI，我们提出了RSI选择（RSI-S），这是一种熵适应的代币过滤方法，能够在稳定的RSI区间内保留代币。RSI-S成功调和了之前矛盾的范式，过滤掉了冗余的低惊讶标记和不稳定的高惊讶尾标记。实证评估显示，RSI-S在AIME和AMC基准测试中，不同模型尺度（Qwen2.5-1.5B、3B和7B）的avg@32准确率更高：RSI-S比GRPO提高了2-3个百分点的avg@32准确率。总体而言，RSI为RLVR的改善提供了有前景的前景。

Token-Sparse Medical Multimodal Reasoning via Dual-Stream Reinforcement Learning

通过双流强化学习实现令牌稀疏医学多模态推理

Authors: Kaitao Chen, Weiqian Zhao, Jiamin Wu, Qihao Zheng, Shangquan Sun, Chunfeng Song, Xiaosong Wang, Mu Zhou, Mianxin Liu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.31599
Pdf link: https://arxiv.org/pdf/2606.31599
Abstract Vision-language models (VLMs) combining reinforcement learning (RL) ignite remarkable progress in multimodal reasoning, yet still struggle with medical images, which typically exhibit extremely sparse visual evidence to inform clinical decision-making. We recognize that pruning visual tokens outside the grounding region greatly enhances medical reasoning. However, a united RL framework for active visual token pruning (VTP) and medical multimodal reasoning remains unestablished. Here, we propose a dual-stream RL framework, ViToS, to fulfill token pruning and question answering. ViToS trains one policy model with two task branches, where one focuses on grounding while the other conducts token-sparse reasoning after VTP. Furthermore, we solve the coupled policy learning problem by introducing the cross-feedback sequential optimization, avoiding gradient conflict and facilitating convergence of the shared policy model. Evaluated on seven medical benchmarks, our method reduces visual tokens to 77% of the original sequence length while achieving a 108.27% relative performance on Lingshu-7B and 104.16% relative performance on HuatuoGPT-Vision-7B. Overall, ViToS delivers superior performance and inference speedup, establishing an efficient paradigm for medical multimodal reasoning.
中文摘要 结合强化学习（RL）的视觉语言模型（VLMs）在多模态推理方面取得了显著进展，但在医学图像方面仍然存在困难，因为通常缺乏极少的视觉证据来指导临床决策。我们认识到，将视觉符号修剪到接地区域之外极大地增强了医学推理能力。然而，针对主动视觉令牌剪枝（VTP）和医学多模态推理的统一强化学习框架尚未建立。在这里，我们提出了一个双流强化学习框架ViToS，用于完成令牌修剪和问答。ViToS训练一个策略模型，包含两个任务分支，一个专注于基础化，另一个在VTP后进行令牌稀疏推理。此外，我们通过引入交叉反馈顺序优化解决了耦合策略学习问题，避免了梯度冲突，促进了共享策略模型的收敛。经过七项医学基准测试，我们的方法将视觉标记缩短至原始序列长度的77%，同时在灵术-7B上实现了108.27%的相对性能，在华托GPT-Vision-7B上实现了104.16%的相对性能。总体而言，ViToS实现了卓越的性能和推理加速，建立了高效的医学多模态推理范式。

What Memory Do GUI Agents Really Need? From Passive Records to Active Task-Driving States

GUI 代理真正需要什么内存？从被动记录到主动任务驱动状态

Authors: Chen Liu, Ling Chen, Hanzhang Zhou, Xu Zhang, Quyu Kong, Panrong Tong, Wenhao Wang, Xin Yu, Steven Hoi, Yue Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.31612
Pdf link: https://arxiv.org/pdf/2606.31612
Abstract Mobile GUI agents increasingly face long-horizon tasks that require reading, updating, and reusing task-relevant data across pages and applications. Existing memory methods treat memory largely as passive storage, where past observations are accumulated and retrieved when needed. Yet retrieving a value does not reveal its current role in the workflow. The agent must still infer from accumulated records whether the value should be used now, has already been used, or must wait for a later dependency. This implicit reconstruction becomes unreliable in long trajectories with similar fields, repeated values, distractors, and outdated states, causing repeated or missed operations. We propose Active Task Driving Memory (ATMem), which shifts GUI-agent memory from passive storage to an actively maintained execution state. ATMem maintains task-relevant information as a continually updated execution state that links each value to its role and current status, enabling action selection based on the current workflow state. We therefore introduce \textbf{STR-GRPO}, an online reinforcement learning method that learns to use ATMem selectively according to its contribution to task completion. STR-GRPO contrasts memory-on and memory-off rollouts to estimate when memory use improves execution, while memory-cost-aware reward discourages costly memory usage that does not improve execution. To evaluate whether agents can complete all in-scope work while avoiding out-of-scope actions over long-horizon execution, we build a challenging mobile benchmark. From a list of near identical entries, agents must act on every entry that satisfies the instruction and reject entries that violate its constraints.
中文摘要 移动图形界面代理日益面临需要跨页面和应用读取、更新和重复使用任务相关数据的长期任务。现有的记忆方法主要将内存视为被动存储，通过积累过去的观测数据，并在需要时检索。然而，检索一个值并不能揭示其在工作流程中当前的角色。代理仍需从累积记录中推断该值是否应立即使用、已被使用，或必须等待后续依赖。这种隐式重建在场相近、值重复、分散因素和过时状态的长轨迹中变得不可靠，导致重复或遗漏操作。我们提出了主动任务驱动内存（ATMem），它将GUI-agent内存从被动存储转移到主动维护的执行状态。ATMem 以持续更新的执行状态形式维护任务相关信息，将每个值与其角色和当前状态关联起来，从而基于当前工作流状态选择动作。因此，我们引入了 \textbf{STR-GRPO}，一种在线强化学习方法，根据 ATMem 对任务完成的贡献选择性使用。STR-GRPO对比内存开启和内存关闭的扩展，以估算内存使用何时能改善执行，而内存成本感知奖励则抑制代价高昂且无法提升执行的内存使用。为了评估代理是否能够在长期执行中避免范围外操作，同时完成所有范围内工作，我们构建了一个具有挑战性的移动基准。从一组几乎相同的条目中，代理必须对满足指令的每个条目采取行动，并拒绝违反其约束的条目。

Robust Autonomous UAV Landing on Maritime Platforms via Multimodal Agentic AI and Active Wave Compensation

通过多模态智能人工智能和主动波次补偿实现稳健自主无人机在海上平台着陆

Authors: Francisco S. Neves, Pedro N. Pereira, Raul D.S.G. Campilho, Andry M. Pinto
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.31613
Pdf link: https://arxiv.org/pdf/2606.31613
Abstract Autonomous aerial inspection of marine infrastructure is frequently compromised by stochastic sea states, introducing risks of high-kinetic impacts, post-landing toppling, and sensory occlusion. This paper proposes a decoupled, multi-vehicle landing framework synchronizing an Unmanned Surface Vehicle (USV) equipped with a 3-RPU stabilized platform with a robust Unmanned Aerial Vehicle (UAV). The architecture utilizes two independent Deep Reinforcement Learning (DRL) agents: a Soft Actor-Critic (SAC) agent providing high-frequency wave-motion compensation for the landing deck, and a multimodal RL agent for the UAVs final approach. Evaluated in high-fidelity maritime simulations, the system achieved a 100% landing success rate across 15 trials in wave states varying from calm to rough. Results show a mean stabilization efficacy of 87.8%, maintaining the landing surface within 1 degree of the horizontal plane for 96% of the mission duration in rough conditions, effectively contributing to safer landings.
中文摘要 对海洋基础设施的自主空中检查常常受到随机海况影响，带来高动能冲击、着陆后翻覆和感官遮挡的风险。本文提出了一种脱钩的多载体着陆框架，将配备3RPU稳定平台的无人地面飞行器（USV）与坚固无人机（UAV）同步。该架构采用两个独立的深度强化学习（DRL）智能体：一个软演员-批判（SAC）代理，为着陆甲板提供高频波形运动补偿，另一个为无人机最终进近提供多模态强化学习代理。通过高精度海上模拟评估，系统在15次波浪状态下实现了100%的着陆成功率，波浪状态涵盖从平静到波浪不等。结果显示，平均稳定效率为87.8%，在恶劣条件下，96%的任务期间着陆面保持在与水平面1度以内，有效促进了更安全的着陆。

Think in English, Answer in Korean: Efficient Adaptation of Multilingual Tool-Using Agents

用英语思考，用韩语回答：多语言工具使用代理的高效适应

Authors: Utsav Garg, Sungjin Hong, Jason Jung, Justin Lee, Shaan Desai, Joon Hee Kim, Anirudh Shrinivason, Edmond Wen, Susie Park
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.31648
Pdf link: https://arxiv.org/pdf/2606.31648
Abstract We present LuckyStar 111B, a 111B-parameter hybrid reasoning model developed through a collaboration between Cohere and LG CNS for Korean-English enterprise agents under practical memory and serving constraints. The model trains from Cohere's fully post-trained Command A model rather than a new pretraining run, and uses preamble conditioning to switch between concise non-reasoning behavior and longer tool-oriented reasoning. We study four choices for scaling tool-using agents efficiently: multilingual supervised fine-tuning, reinforcement learning with verifiable rewards for multi-step tool-use tasks, language-consistency rewards for Korean user-facing responses, and 4-bit quantization for single-GPU serving. The adapted model improves mathematical reasoning, function calling, and agentic natural-language-to-SQL (NL2SQL) performance while preserving general Korean and English instruction-following quality. These results provide a practical recipe and failure-mode analysis for adapting post-trained multilingual models to verifiable agentic workflows under memory-constrained deployment.
中文摘要 我们介绍LuckyStar 111B，这是Cohere与LG CNS合作开发的111B参数混合推理模型，适用于韩英企业代理，且在实际内存和服务约束下使用。该模型基于Cohere完全后期训练的Command A模型训练，而非新的预训练运行，并利用前导条件在简洁非推理行为和较长的工具导向推理之间切换。我们研究了四种高效扩展工具使用代理的方案：多语言监督微调、多步工具使用任务的可验证奖励强化学习、韩语用户响应的语言一致性奖励，以及单GPU服务的4位量化。该模型改进了数学推理、函数调用和代理自然语言转SQL（NL2SQL）性能，同时保持了通用的韩语和英语指令跟随质量。这些结果为适应内存受限部署下的可验证代理工作流提供了实用的方案和失败模式分析。

FastDSAC: Enhancing Policy Plasticity via Constrained Exploration for Scalable Humanoid Locomotion

FastDSAC：通过受限探索提升可扩展类人机动的政策可塑性

Authors: Guanchen Lu, Yajuan Dun, Yi Zhou, Letian Tao, Jingliang Duan, Jie Li, Guofa Li
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.31691
Pdf link: https://arxiv.org/pdf/2606.31691
Abstract Scalable reinforcement learning has popularized high-throughput sampling architectures, which significantly compresses the training time for off-policy methods in robotic locomotion. However, the rapid increase of data volume and update frequency undermines the stability of value-based methods and diminishes the plasticity of policy networks. To address these challenges, this work presents FastDSAC, a fast and high-performance variant of the Distributional Actor-Critic algorithm designed for parallel sampling scenarios. Specifically, we introduce a truncated Gaussian distribution to approximate the learned policy, which effectively excludes out-of-distribution actions that strain target value estimation while keeping necessary stochasticity for exploration. The proposed action constraint functions as an implicit regularization, which counteracts the plasticity loss typically caused by aggressive gradient updates. This preservation of network adaptability enhances sample efficiency, particularly in scenarios with a high update-to-data ratio, and accelerates the early training process. In contrast to prior fast reinforcement learning approaches that rely on discrete value distributions, our method utilizes a continuous Gaussian representation equipped with adaptive variance regulation, which improves value estimation accuracy by sampling confident and informative transitions. Extensive experiments on MuJoCo Playground and HumanoidBench demonstrate that FastDSAC not only stabilizes the overall training process but also achieves superior asymptotic performance and faster convergence compared to state-of-the-art baselines.
中文摘要 可扩展强化学习普及了高通量采样架构，显著压缩了机器人移动中非策略方法的训练时间。然而，数据量的快速增长和更新频率削弱了基于价值的方法的稳定性，并削弱了政策网络的可塑性。为应对这些挑战，本研究提出了FastDSAC，这是一种为并行采样场景设计的分布式行为者-批判者算法的快速高性能变体。具体来说，我们引入截断高斯分布以近似所学策略，有效排除了会带来目标值估计压力的非分布行为，同时保持必要的随机性以便探索。所提动作约束作为隐式正则化，抵消了激进梯度更新通常导致的可塑性损失。这种网络适应性的保持提升了样本效率，尤其是在更新与数据比例较高的场景中，并加快了早期训练过程。与以往依赖离散值分布的快速强化学习方法不同，我们的方法采用了配备自适应方差调控的连续高斯表示，通过采样自信且信息丰富的转移提高了价值估计的准确性。在MuJoCo Playground和HumanoidBench上的广泛实验表明，FastDSAC不仅稳定了整体训练过程，还实现了优于最先进基线的渐近性能和更快的收敛速度。

Diffusing Blame: Task-Dependent Credit Assignment in Biologically Plausible Dual-Stream Networks

分散责任：生物学上合理的双流网络中的任务依赖性学分分配

Authors: Yutaro Yamada, Luca Grillotti, Rujikorn Charakorn, Sebastian Risi, David Ha, Robert Tjarko Lange
Subjects: Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2606.31700
Pdf link: https://arxiv.org/pdf/2606.31700
Abstract Biological neural circuits obey Dale's principle: each neuron's synapses are uniformly excitatory or inhibitory. Artificial networks that respect this constraint must coordinate separate excitatory and inhibitory populations, fundamentally changing how credit is assigned during learning. Several biologically plausible learning rules avoid backpropagation's weight transport requirement, but it has been difficult to achieve strong performance under Dale's principle beyond MNIST. Error Diffusion (ED) was originally proposed in a dual-stream excitatory/inhibitory architecture, where learning is driven by routing global error signals to all layers without transporting transposed forward weights or relying on random feedback matrices. Whether such a rule can scale under Dale's principle across both supervised classification and reinforcement learning remains unknown. Here, we introduce modulo error routing to extend Error Diffusion beyond binary classification, and show that a dual-stream excitatory/inhibitory architecture trained with this method achieves 96.7% on MNIST and establishes a 61.7% baseline on CIFAR-10, demonstrating that representation learning is possible even when strictly enforcing Dale's principle. For the classification setting, we introduce three domain-specific innovations: layer-specific sigmoid widths, batch-centered class error signals, and asymmetric initialization, and ablation analysis reveals that their relative importance reverses between MNIST and CIFAR-10, exposing task-dependent credit-assignment bottlenecks invisible to single-benchmark evaluation. In reinforcement learning, we integrate ED with Proximal Policy Optimization (PPO) and evaluate it on continuous-control tasks in Google Brax and on Craftax, an open-ended exploration task. We show that ED-PPO achieves competitive performance relative to Direct Feedback Alignment, a backpropagation-free baseline.
中文摘要 生物神经回路遵循戴尔原则：每个神经元的突触均为兴奋性或抑制性。遵守这一约束的人工网络必须协调独立的兴奋性和抑制性群体，从根本上改变学习过程中学分的分配方式。若干生物学上合理的学习规则避免了反向传播的权重传输要求，但在戴尔原理下，除了MNIST之外，要实现强性能仍然困难。误差扩散（ED）最初提出在一种双流激发/抑制结构中，学习通过将全局误差信号路由到所有层次，而无需传输转置的前向权重或依赖随机反馈矩阵。该规则是否能在Dale原则下适用于监督分类和强化学习，目前尚不得而知。本文介绍模误差路由，将错误扩散扩展至二元分类之外，并展示了用该方法训练的双流兴奋/抑制性结构在MNIST上达到96.7%，在CIFAR-10上建立了61.7%的基线，证明即使严格执行Dale原则，表征学习也是可能的。在分类设置中，我们引入了三项领域特定的创新：层特定S形形宽度、批次中心类别错误信号和非对称初始化，消融分析显示它们在MNIST和CIFAR-10之间的相对重要性反转，揭示了单一基准评估中看不见的任务依赖性学分分配瓶颈。在强化学习中，我们将ED与近端策略优化（PPO）集成，并在Google Brax和Craftax（开放式探索任务）中的连续控制任务中进行评估。我们证明，ED-PPO相对于无反向传播基线直接反馈对齐实现了竞争性能。

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

UniCoder：通过符号奖励和引用引导的代码优化实现统一的可视化到代码生成

Authors: Yaozhi Zheng, Yilei Jiang, Manyuan Zhang, Yuxuan Wan, Kaituo Feng, Tianshuo Peng, Bo Zhang, Xiangyu Yue
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.31732
Pdf link: https://arxiv.org/pdf/2606.31732
Abstract Visual-to-Code generation, which transforms scientific plots, vector graphics, and webpages into executable scripts, demands a level of pixel-precise alignment that standard Multimodal Large Language Models (MLLMs) fail to achieve through Supervised Fine-Tuning (SFT) alone. While Reinforcement Learning (RL) offers a theoretical pathway to bridge this gap, its application is hindered by two fundamental obstacles: (1) \textit{Reward Coarseness}, where semantic metrics like CLIP scores fail to penalize fine-grained element deviations, and (2) \textit{Exploration Stagnation}, where the sparse, heterogeneous code search space prevents the policy from bootstrapping valid trajectories. To overcome these limitations, we introduce UniCoder, a unified RL framework that integrates two novel mechanisms. First, we propose \textbf{Symbolic Attribute Alignment}, which employs a lightweight auxiliary LLM to parse generated code into discrete visual attributes (e.g., hex colors, coordinate limits), enabling dense, element-wise reward computation. Second, to escape local optima, we devise \textbf{Reference-Guided Code Optimization}, a strategy that dynamically injects ground-truth trajectories into low-performing rollout groups, transforming blind exploration into guided policy improvement. Extensive experiments on ChartMimic, UniSVG, Design2Code and ScreenBench benchmarks demonstrate that our 8B-parameter model not only surpasses all open-source baselines but also achieves state-of-the-art performance comparable to proprietary models, establishing a new paradigm for generalized visual-to-code synthesis.
中文摘要 可视化到代码生成，即将科学图表、矢量图形和网页转换为可执行脚本，要求达到标准多模态大型语言模型（MLLM）仅靠监督微调（SFT）无法达到的像素精确对齐水平。虽然强化学习（RL）提供了弥合这一差距的理论路径，但其应用受到两个根本障碍的阻碍：（1）\textit{奖励粗糙性}，即语义指标如CLIP评分未能惩罚细粒度元素偏差;（2） \textit{探索停滞}，稀疏且异构的代码搜索空间阻碍策略启动有效轨迹。为克服这些限制，我们引入了UniCoder，一个整合了两种新机制的统一强化学习框架。首先，我们提出了 \textbf{符号属性对齐}，它利用轻量级辅助大型语言模型将生成的代码解析为离散的视觉属性（如十六进制颜色、坐标限制），实现密集的元素级奖励计算。其次，为了摆脱局部最优，我们设计了 \textbf{参考引导代码优化}，这是一种动态地向低绩效推广组注入实地真理轨迹的策略，将盲目探索转变为引导策略改进。在ChartMimic、UniSVG、Design2Code和ScreenBench基准测试上的大量实验表明，我们的8B参数模型不仅超越了所有开源基线，还实现了与专有模型相当的先进性能，建立了广义可视化到代码综合的新范式。

Addressing Over-Refusal in LLMs with Competing Rewards

解决带有竞争奖励的大型语言模型中的过度拒绝问题

Authors: Taeyoun Kim, Aviral Kumar
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.31748
Pdf link: https://arxiv.org/pdf/2606.31748
Abstract Safety training on language models often induces over-refusal: improved safety on harmful prompts at the cost of increased refusal on harmless ones. Though this trade-off can be mitigated by training models with reinforcement learning (RL) to reason before answering, it does not remove the underlying problem that reasoning can often be a "rubber stamp" for a predetermined response. In this paper, we address the safety-refusal trade-off by rethinking how models are trained to reason about safety. Our key insight is that unsafe reasoning can itself serve as a useful exploratory signal. Rather than preemptively blocking harmful thoughts, we encourage the model to sufficiently explore unsafe reasoning but produce a safe response. The harmful exploration improves the model's ability to distinguish harmful from harmless prompts by resolving ambiguity, allowing it to remain safe while complying only when appropriate. We cast this as an adversarial optimization problem in which a reasoning player explores strategies for producing an unsafe response and an answer player ensures that the final output is safe. We train a single model with dense rewards to play both roles within one chain-of-thought, across different segments. To achieve this, we find that process rewards are crucial for stable optimization of competing objectives. Our resulting model SEAR deliberately engages in harmful reasoning as exploration while reliably flipping back to a safe answer. We demonstrate that this behavior helps mitigate over-refusal and defend against attacks that directly manipulate the reasoning to be harmful.
中文摘要 对语言模型进行安全培训常常导致过度拒绝：在有害提示上安全性提升，但对无害提示的拒绝率却有所增加。虽然这种权衡可以通过训练强化学习（RL）来缓解，让模型在回答前先推理，但这并不能消除推理常常成为预定回答的“橡皮图章”这一根本问题。本文通过重新思考模型如何训练以推理安全问题，来解决安全与拒绝的权衡。我们的关键见解是，不安全推理本身可以作为有用的探索信号。我们鼓励模型充分探索不安全推理，同时产生安全的回应，而不是先预防有害思想。有害探索通过消除歧义提升模型区分有害与无害提示的能力，使其在适当时才遵守的同时保持安全。我们将此视为一个对抗性优化问题，其中理性玩家探索产生不安全反应的策略，而回答者则确保最终输出安全。我们训练一个带有密集奖励的模型，在一个思维链中扮演两个角色，跨越不同的细分段。为此，我们发现过程奖励对于稳定优化竞争目标至关重要。我们最终形成的SEAR模型故意通过有害的推理进行探索，同时可靠地回归到一个安全的答案。我们证明这种行为有助于减轻过度拒绝，并防御直接操控推理使其有害的攻击。

Reinforcement Learning-Based Control for an Inline Skating Humanoid Robot

基于强化学习的直排轮人机器人控制

Authors: Ethan Marot, Thomas Bi, Clemens Schwarke, Victor Klemm, Marco Hutter, Raffaello D'Andrea
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.31807
Pdf link: https://arxiv.org/pdf/2606.31807
Abstract As humanoid robots become increasingly dynamic, coupling them with reinforcement learning offers a promising approach to solving the complex, underactuated mechanics of passive inline skating. Equipping a humanoid robot with passive inline skating wheels presents an opportunity to combine the versatile agility of humanoids with the high-speed, energy-efficient locomotion strategies utilized by human skaters. In this paper, we train and deploy a reinforcement learning control policy that enables novel locomotion strategies for a humanoid robot modified to equip consumer inline skates instead of conventional feet. Unlike previous work limited to quadrupedal robots or actively driven wheels, our system allows for precise 6-DoF control of the skates to execute dynamic, edge-driven propulsion strategies. Our skating strategies emerge entirely from our reward structure, without reliance on human motion data, imitation learning, or kinematic priors. We overcome the inherent instability of passive wheels and simulation contact artifacts by utilizing different geometric wheel models (spherical and ellipsoidal) during training and validation, along with a custom success-based command curriculum and a specialized rolling reward. Consequently, our policy demonstrates up to a 50% reduction in Cost of Transport (CoT) compared to standard walking gaits. The resulting policy successfully transfers zero-shot to the physical Booster T1 hardware. Real-world deployments demonstrate dynamic balance, the ability to reject active physical perturbations, and agile locomotion strategies capable of turning at speed. A video of our results can be found at this https URL.
中文摘要 随着类人机器人日益动态化，将其与强化学习结合，为解决被动直排轮滑复杂且欠驱动力学提供了有前景的方法。为类人机器人配备被动直排轮轮，为人形机器人的灵活性与人类滑冰者采用的高速节能运动策略相结合提供了机会。本文中，我们训练并部署了一套强化学习控制策略，使拟装人形机器人能够采用新型移动策略，以装备消费级直排轮鞋而非传统脚部。与以往仅限于四足机器人或主动驱动轮子的工作不同，我们的系统支持对滑冰鞋的精确6-DoF控制，执行动态的边缘驱动推进策略。我们的滑冰策略完全源自奖励结构，不依赖人类运动数据、模仿学习或运动学先验。我们通过在培训和验证中使用不同的几何轮模型（球形和椭球形），以及定制的基于成功率的指令课程和专门的滚动奖励，克服了被动轮子固有的不稳定性和模拟接触伪影。因此，我们的政策显示，与标准步行步态相比，交通成本（CoT）可降低多达50%。该策略成功将零射击转移到物理助推器T1硬件上。现实部署展示了动态平衡、拒绝主动物理扰动的能力，以及能够高速转向的灵活移动策略。我们的结果视频可在此 https 网址观看。

Geometry-Preserving Orthonormal Initialization for Low-Rank Adaptation in RLVR

RLVR中保持几何的正交法一初始化，用于低秩适应

Authors: Ruijia Zhang, Jiacheng Zhu, Hanqing Zhu, Laixi Shi
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.31813
Pdf link: https://arxiv.org/pdf/2606.31813
Abstract Low-rank adaptation (LoRA) and its variants enable parameter-efficient fine-tuning of large language models under the supervised fine-tuning (SFT) paradigm. However, their efficacy and behavior under Reinforcement learning with verifiable rewards (RLVR) are less well understood. In particular, two structurally initialized LoRA variants, PiSSA and MiLoRA, which outperform standard LoRA under SFT, can underperform standard LoRA under RLVR and may even exhibit training instability. These observations suggest that how to initialize the low-rank matrices in RLVR remains unclear. In this work, we develop a theoretical analysis of LoRA in RLVR, showing that orthonormal initialization achieves the minimal gap between LoRA outcome and that of full fine-tuning. Guided by this insight, we propose geometry-preserving orthonormal initialization for low-rank adaptation in RLVR, leading to two new variants, RLPO and RLMO. Experiments on mathematical reasoning benchmarks show that the proposed orthonormal initialization stabilizes RLVR training and outperforms standard LoRA, contrasting with PiSSA and MiLoRA. Finally, our unified analysis for LoRA initialization also explains why PiSSA and MiLoRA can underperform in RLVR, which may be of independent interest. Code and checkpoints are publicly available at this https URL.
中文摘要 低秩适应（LoRA）及其变体使得在监督微调（SFT）范式下实现参数高效的大型语言模型微调。然而，它们在可验证奖励强化学习（RLVR）下的有效性和行为尚不十分明了。特别是，两种结构初始化的LoRA变体PiSSA和MiLoRA在SFT下优于标准LoRA，但在RLVR下可能表现不及标准LoRA，甚至可能表现出训练不稳定性。这些观察表明，如何初始化RLVR中的低秩矩阵仍然不明确。本研究中，我们对 RLVR 中的 LoRA 进行了理论分析，表明正交归一初始化实现了 LoRA 结果与完全微调之间最小的差距。基于这一见解，我们提出了保持几何的正交归一初始化，用于RLVR中的低秩适应，从而产生了两种新变体：RLPO和RLMO。数学推理基准测试的实验表明，所提出的正交归一初始化稳定了RLVR训练，并优于标准LoRA，这与PiSSA和MiLoRA形成对比。最后，我们对LoRA初始化的统一分析也解释了为何PiSSA和MiLoRA在RLVR中表现可能不佳，这可能具有独立的兴趣。代码和检查点在此 https URL 公开。

Z-1: Efficient Reinforcement Learning for Vision-Language-Action Models

Z-1：视觉-语言-行动模型的高效强化学习

Authors: Lang Cao, Renhong Chen, Luyi Li, Peng Wang, Mofan Peng, Yitong Li
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.31846
Pdf link: https://arxiv.org/pdf/2606.31846
Abstract Vision-Language-Action (VLA) models offer a promising framework for robotic manipulation by connecting language instructions, visual observations, and continuous control. However, most existing policies remain limited by behavior cloning or supervised fine-tuning (SFT) from fixed demonstrations, which provides limited opportunity to improve from the policy's own failures. In this paper, we present Z-1, a reinforcement learning (RL) post-training framework for flow-based VLA models. Built on top of $\pi_{0.5}$, Z-1 uses only publicly released RoboCasa demonstrations for SFT and then applies a task-wise Group Relative Policy Optimization (GRPO) strategy across $24$ standard RoboCasa tasks. To improve the efficiency and stability of online optimization, Z-1 combines shared-prefix rollout construction, tree-structured trajectory branching, completion-aware reward calibration, and selective joint training of VLM and Action Expert. Across all $24$ RoboCasa tasks, Z-1 achieves an average success rate of $80.6\%$, improving over its SFT initialization by $13.2\%$ points and outperforms the published sota models. These results show that systematic GRPO post-training can substantially improve flow-based VLA policies without additional private demonstrations.
中文摘要 视觉-语言-行动（VLA）模型通过连接语言指令、视觉观察和持续控制，为机器人操作提供了有前景的框架。然而，大多数现有政策仍受限于行为克隆或固定演示的监督微调（SFT），这使得改进政策自身失败的机会有限。本文介绍了Z-1，一种针对基于流的VLA模型的强化学习（RL）训练后框架。Z-1 建立在 $\pi_{0.5}$ 之上，仅使用公开发布的 RoboCasa SFT 演示，然后在价值 $24 的标准 RoboCasa 任务中应用任务层级的组相对策略优化（Group Relative Policy Optimization， GRPO）策略。为提升在线优化的效率和稳定性，Z-1结合了共享前缀展开构建、树状轨迹分支、完成感知奖励校准以及VLM与Action Expert的选择性联合训练。在所有24美元RoboCasa任务中，Z-1的平均成功率为80.6美元，比SFT初始化提升13.2%$点，且性能优于已发布的Sota模型。这些结果表明，系统化的GRPO训练后训练可以在无需额外私人演示的情况下，显著改善基于流量的VLA策略。

CoDex: Learning Compositional Dexterous Functional Manipulation without Demonstrations

CoDex：学习无演示的组合灵巧功能操作

Authors: Bowen Jiang, William Painter Reger, Roberto Martin-Martin
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.31909
Pdf link: https://arxiv.org/pdf/2606.31909
Abstract In this work, we study Compositional Dexterous Functional Object Manipulation (CD-FOM): tasks such as aiming and actuating a spray bottle on a plant or a glue gun on wood, which require both actuating an object's internal mechanism and controlling its pose to apply the object's function to the environment. These tasks pose significant challenges for robots due to the demanding integration of semantic understanding of the object's function, actuation mode, and application area with intricate physical dexterity to manage grasp stability, movement trajectory, and actuation. We introduce CoDex, a zero-demonstration framework that autonomously discovers CD-FOM manipulation strategies. CoDex uses vision-language models (VLMs) to infer semantic constraints from the task and scene. These constraints guide analytic constrained optimization to generate a short list of functional grasp candidates that can be efficiently refined with reinforcement learning to generate full grasp-move-actuate policies transferable from simulation to the real world. We evaluate CoDex on a 7-DoF robot arm with a 16-DoF multi-fingered hand across six CD-FOM tasks involving previously unseen objects with internal mechanisms, including spray bottles, hot glue guns, air dusters, flashlights, and pepper grinders, and their application to unseen target objects, showcasing its ability to autonomously discover and execute complex, physically viable dexterous behaviors without human demonstrations. More information at this https URL.
中文摘要 在本研究中，我们研究合成灵巧功能对象操作（CD-FOM）：诸如瞄准并操作喷雾瓶在植物上，或用胶枪在木头上操作，这些任务既需要驱动物体内部机制，也需要控制其姿态，以将物体的功能应用于环境。这些任务对机器人来说是重大挑战，因为需要将对物体功能、驱动模式和应用领域的语义理解与复杂的身体灵巧度相结合，以管理抓握稳定性、运动轨迹和触发。我们介绍CoDex，一个零演示框架，能够自主发现CD-FOM操作策略。CoDex 使用视觉语言模型（VLM）从任务和场景推断语义约束。这些约束引导分析约束优化生成一份简短的函数抓取候选列表，这些候选对象可以通过强化学习高效优化，生成可从模拟转移到现实世界的完整抓取-移动-执行策略。我们评估了CoDex在7-DoF机器人臂上，配合16-DoF多指手，完成了六个CD-FOM任务，涉及此前未见过的内部机制物体，包括喷雾瓶、热熔胶枪、空气掸子、手电筒和胡椒研磨器，并应用于看不见的目标物体，展示了其自主发现并执行复杂且具物理可行灵巧行为的能力，无需人工演示。更多信息请访问此 https 网址。

Learning Locomotion on Discrete Terrain via Minimal Proximity Sensing

通过最小接近感测学习离散地形上的运动

Authors: Jiale Fan, Connor Flynn, Tianao Xu, Junzhe He, Andrei Cramariuc, Marco Hutter, Robert Baines
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.31912
Pdf link: https://arxiv.org/pdf/2606.31912
Abstract Learning-based control has revolutionized dynamic locomotion, yet navigating unstructured terrain remains limited by a robot's incomplete awareness of imminent ground contact. While global perception systems such as LiDARs and depth cameras provide environmental context, they are frequently plagued by latencies, occlusions, and the high computational cost of dense geometric reconstruction. On the other hand, proprioceptive feedback is purely reactive, initiating corrections only after impact has occurred. This work explores embedding a minimal suite of low-cost, high-frequency infrared proximity sensors directly into the feet of a quadrupedal robot. These sensors provide "pre-contact" feedback that is robust to self-occlusions and significantly less computationally demanding than conventional vision-based pipelines. By integrating these localized signals into a reinforcement learning framework, we enable the robot to anticipate terrain discontinuities such as gaps and stepping stones that are problematic for traditional perception stacks due to occlusions or state estimation drift. We demonstrate that such sparse, near-field sensing can be reliably modeled in simulation and transferred to the real world with high fidelity. Experimental results show that local proximity sensing substantially improves traversal robustness over discrete terrain and offers a low-power, low-latency alternative or complement to complex global perception suites in unpredictable environments. For more information about results and methods, please see the project website: this https URL.
中文摘要 基于学习的控制彻底改变了动态移动，但在无结构地形中导航仍受限于机器人对即将到来的地面接触感知不完全。虽然像激光雷达和深度相机这样的全球感知系统提供了环境背景，但它们经常受到延迟、遮挡以及密集几何重建的高计算成本困扰。另一方面，本体感觉反馈纯粹是反应性的，只有在撞击发生后才会开始纠正。这项工作探讨了将一套极简的低成本高频红外接近传感器直接嵌入四足机器人的脚部。这些传感器提供“接触前”反馈，对自闭更具抵抗力，计算量远低于传统基于视觉的管道。通过将这些局部信号整合进强化学习框架，我们使机器人能够预判地形不连续点，如间隙和踏脚石，这些因遮挡或状态估计漂移而对传统感知堆栈来说是个难题。我们证明了这种稀疏的近场感测可以在仿真中可靠建模，并以高保真度转移到现实世界。实验结果表明，局部接近感能显著提升离散地形上的横移鲁棒性，并为复杂全局感知套件在不可预测环境中提供低功耗、低延迟的替代方案或补充。有关结果和方法的更多信息，请访问项目网站：此 https URL。

LeCropFollow: Latent Space Planning for Navigation in Unstructured Crop Fields

LeCropFollow：无结构作物田中导航的潜在空间规划

Authors: Felipe Tommaselli, Francisco Affonso, Arthur Pompeu, Gianluca Capezzuto, Arun Narenthiran Sivakumar, Girish Chowdhary, Marcelo Becker
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.31941
Pdf link: https://arxiv.org/pdf/2606.31941
Abstract Unstructured navigational features, such as irregular planting or discontinuities, remain the primary failure mode for under-canopy agricultural robots. Existing geometric approaches often fail in these scenarios because they compress high-dimensional visual data into deterministic spatial references, effectively discarding the uncertainty and semantic context required to navigate ambiguous terrain. To address this, we present LeCropFollow, a visual navigation framework that bypasses explicit geometric modeling in favor of a learned latent representation. By integrating a self-supervised semantic heatmap extractor with TD-MPC2, a Model-Based Reinforcement Learning (MBRL) planner, our system optimizes trajectories directly within a latent manifold. The framework operates over the uncompressed heatmap signal, preserving the semantic context that geometric reductions discard. We demonstrate that this representational shift enables zero-shot transfer from simplified simulation to the physical world without fine-tuning. Extensive field experiments in late-stage corn fields show that LeCropFollow matches state-of-the-art baselines in unstructured rows but significantly outperforms them in plantation gaps, achieving a 2.4x reduction in semantic failures compared to keypoint-based methods. These results suggest that latent planning offers a robust alternative to geometric estimation for operations in heterogeneous agricultural environments. Code, models, and data available: this https URL .
中文摘要 非结构化导航特征，如不规则种植或不连续地带，仍是树冠下农业机器人的主要故障模式。现有几何方法常常在这些场景中失败，因为它们将高维视觉数据压缩为确定性的空间参考，实际上丢弃了导航模糊地形所需的不确定性和语义语境。为此，我们提出了LeCropFollow，一种绕过显式几何建模、转而采用学习潜能表征的可视化导航框架。通过将自监督语义热图提取器与TD-MPC2（基于模型强化学习（MBRL）规划器集成，我们的系统直接优化潜流形内的轨迹。该框架作用于未压缩热图信号，保持几何约化所丢弃的语义上下文。我们证明了这种表征转变使得从简化模拟到物理世界的零射击转移成为可能，而无需微调。在晚期玉米田中的大量田间实验显示，LeCropFollow在无结构行中可匹配最先进的基线，但在种植间隙中显著优于基线，语义失败率比基于关键点的方法减少了2.4倍。这些结果表明，潜伏规划为异质农业环境中的操作提供了一种有力的几何估计替代方案。代码、模型和可用数据：此 https URL 。

Adapting Generalist Robot Policies with Semantic Reinforcement Learning

采用语义强化学习的通用机器人策略

Authors: Jagdeep Singh Bhatia, Andrew Wagenmaker, William Chen, Sergey Levine
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.31958
Pdf link: https://arxiv.org/pdf/2606.31958
Abstract Generalist robot policies learn a diverse repertoire of behaviors from large-scale pretraining. In principle, this makes them excellent priors for downstream adaptation via reinforcement learning (RL). In practice, however, standard RL methods leveraging this prior optimize directly over robot actions, requiring the base policy's action distribution to be close to that of a performant policy from the start. This assumption breaks down for complex or long-horizon tasks that fall outside the pretraining distribution. Our key insight is that, for sufficiently expressive generalist policies, language prompts are an effective alternative space for learning to solve such tasks: modulating language inputs elicits skills already within the policy's repertoire, which can be composed to solve tasks beyond its zero-shot capabilities. We propose Semantic Action Reinforcement Learning (SARL), which learns to optimize this prompt space through online interaction, treating the generalist policy as a controllable skill prior. Importantly, leveraging pretrained skills rather than learning new ones from scratch yields structured, semantically meaningful exploration and highly efficient online improvement, and learning to modulate prompts through experience grounds them in induced real-world behaviors for robust task-solving. Across real-world settings and simulated benchmarks, we show SARL unlocks fundamentally new capabilities -- adapting VLA behavior to solve complex, long-horizon tasks -- and significantly outperforms existing approaches for improving robot behavior in deployment.
中文摘要 通才机器人政策通过大规模预训练学习多样化的行为。原则上，这使它们成为通过强化学习（RL）进行下游适应的极佳先验。然而，在实际操作中，利用这一先例的标准强化学习方法会直接优化机器人动作，要求基础策略的动作分布从一开始就接近性能优良策略。对于超出预训练分布的复杂或长视野任务，这一假设会失效。我们的关键见解是，对于足够表达性的通才策略，语言提示是一个有效的替代空间来学习解决此类任务：调节语言输入能激发策略已有的技能，这些技能可以组合起来解决其零样本能力之外的任务。我们提出了语义行动强化学习（SARL），通过在线互动学习优化提示空间，将通才策略视为可控的先验技能。重要的是，利用预先训练的技能而非从零学习新技能，可以带来结构化、语义意义深远的探索和高效的在线提升，而通过经验学习调节提示，则使他们扎根于现实世界的诱发行为，从而实现扎实的任务解决能力。在现实世界环境和模拟基准测试中，我们展示了SARL解锁了根本性的新功能——调整VLA行为以解决复杂的长期任务——并且在提升机器人部署行为方面显著优于现有方法。

GR2 Technical Report

GR2技术报告

Authors: Yufei Li, Zaiwei Zhang, Mingfu Liang, Kavosh Asadi, Jay Xu, Jimmy Kim, Chongyang Bai, Jieyi Zhang, Hongye Xie, Prachi Agrawal, Dian Yu, Tianyi Chen, Jean-Pascal Billaud, Garret Buell, YK (Yongkang)Zhu, Sachin Patil, Brooke Bian, Zhou Fang, Kevin Huang, Shiva Sudanagunta, Yuzhen Huang, Emma Lu, Chris O'Brien, Yang Song, Lihong Li, Jacob Tao, Zhicheng Zhu, Chao Li, Gaoxiang Liu, Neil Wu, Zhongyin Hu, Li Han, Loki Chen, Ming Lei, Greg Rehm, Siyuan Song, Tianwei Zhang, Li Li, Ketan Singh, Yavuz Yetim, Ilyas Atishev, Satendra Gera, Ashkan Sadeghi, Rachel Yan, Nikko Mizutani, Shuaiwen Wang, Song Yang, Zhijing Li, Jiang Liu, Mengying Sun, Fei Tian, Xiaohan Wei, Chonglin Sun, Parish Aggarwal, Kaushik Rangadurai, Zhi Hua, Frank Shyu, Ruchit Sharma, Liyuan Li, Shike Mei, Wenlin Chen, Santanu Kolay, Ben Schulte, Deepak Chandra, Adam (Yang)Song, Sandeep Pandey, Xi Liu, Hamed Firooz, Luke Simon
Subjects: Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.31984
Pdf link: https://arxiv.org/pdf/2606.31984
Abstract Industrial recommendation systems serve billions of users through a multi-stage funnel -- retrieval, early-stage ranking, and re-ranking -- where the final re-ranking step disproportionately shapes user engagement and downstream performance, particularly for carousel and grid display formats. Despite growing enthusiasm for Large Language Models (LLMs) in recommendation, three gaps hinder industrial adoption: (1) most efforts target retrieval and ranking, leaving re-ranking -- the stage closest to the final user experience -- largely underexplored; (2) LLMs are typically deployed zero-shot or via supervised fine-tuning, underutilizing the reasoning capabilities unlocked by reinforcement learning (RL) on verifiable rewards; (3) deployed catalogs index billions of items with non-semantic identifiers that lie outside any base-LLM vocabulary. We present GR2 (Generative Reasoning Re-Ranker), an end-to-end framework that combines (i) mid-training on semantic IDs produced by a tokenizer with >=99% uniqueness, (ii) reasoning-trace distilled from a stronger teacher via targeted prompting and rejection sampling, and (iii) RL with verifiable rewards purpose-built for re-ranking. To make GR2 resource-viable, we further (iv) introduce a context compressor that amortizes training cost, On-Policy Distillation (OPD) as a scalable alternative to SFT -- which we find collapses at industrial scale -- and reasoning distillation for low-latency serving. GR2 delivers +18.7% R@1, +7.1% R@3, and +9.6% N@3 over legacy baselines on industrial-scale traffic. We further find that reward design is critical in re-ranking: LLMs often hack rewards by preserving the incoming order or exploiting position bias, motivating conditional verifiable rewards as essential industrial components.
中文摘要 工业推荐系统通过多阶段的漏斗——检索、早期排名和重新排序——服务数十亿用户，最后的重新排名步骤对用户参与度和后续表现产生不成比例的影响，尤其是在轮播和网格显示格式中。尽管大型语言模型（LLMs）在推荐中越来越受欢迎，但工业界的采用仍存在三个阻碍：（1）大多数努力都聚焦于检索和排名，导致最接近最终用户体验的重新排序阶段——大多未被充分探索;（2）LLM通常采用零点或监督微调方式部署，未能充分利用强化学习（RL）对可验证奖励解锁的推理能力;（3）部署目录索引数十亿个非语义标识符的项目，这些标识符不属于任何基础大型语言模型词汇表。我们提出了GR2（生成推理再排序器），这是一个端到端框架，结合了（i）由具有>=99%唯一性的唯一性分词器生成的语义ID中进行中期训练，（ii）通过针对性提示和拒绝抽样从更强教师那里提炼出的推理痕迹，以及（iii）专为重新排序设计的可验证奖励的强化学习。为了使GR2资源可行，我们进一步（iv）引入了一种可摊销训练成本的上下文压缩器、作为可扩展的策略蒸馏（OPD）替代方案——我们在工业规模中发现SFT容易崩溃——以及用于低延迟服务的推理蒸馏。GR2在工业级交通中相比传统基线，R@1+18.7%，R@3+7.1%，+9.6%N@3。我们还发现奖励设计在重新排序中至关重要：大型语言模型常通过保留进单或利用位置偏差来破解奖励，将条件可验证奖励作为工业组成部分的必要组成部分。

OopsieVerse: A Safety Benchmark with Damage-Aware Simulation for Robot Manipulation

OopsieVerse：机器人操控的安全基准，具备损伤感知模拟

Authors: Arnav Balaji, Arpit Bahety, Sriniket Ambatipudi, Daniel Lam, Junhong Xu, Roberto Martín-Martín
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.31993
Pdf link: https://arxiv.org/pdf/2606.31993
Abstract While robotic manipulation capabilities have advanced rapidly, physical safety remains a major barrier to deploying household robots: task success is insufficient if the robot damages itself or its surroundings. Simulation offers a harm-free alternative to costly and dangerous real-world training and evaluation, yet existing simulators lack general mechanisms to detect, quantify, and represent damage. To address this gap, we introduce OOPSIEVERSE, a unified simulation framework and benchmark for damage-aware household manipulation. OOPSIEVERSE provides damage as an explicit, physically-grounded, and taskagnostic signal by converting sources such as contact forces, temperature changes, and liquid interactions into corresponding mechanical, thermal or fluid damage. OOPSIEVERSE comprises two core elements: (1) DAMAGESIM, a simulator-agnostic framework for detecting and quantifying damage during navigation and manipulation, and (2) a suite of household tasks designed to evaluate common damage modes and distinguish between task completion and safe execution. We demonstrate the generality of our framework by instantiating DAMAGESIM in two simulators with different physics backends, OmniGibson (Nvidia Omniverse) and RoboCasa (MuJoCo). We further showcase the utility of OOPSIEVERSE across multiple use cases, including (1) guiding safer demonstration collection via real-time damage feedback, (2) learning safer manipulation policies through damage-conditioned imitation learning and reinforcement learning, (3) benchmarking the safety of state-of-the-art Vision Language Action policies, and (4) improving real-world safety of sim-to-real transferred policies. Together, our results highlight the potential of OOPSIEVERSE as an open-source foundation for systematic, scalable research on safe robot manipulation. For code and more information, please refer to this https URL
中文摘要 尽管机器人操控能力迅速进步，但物理安全仍是部署家用机器人的主要障碍：如果机器人自身或环境受损，任务成功就不够。仿真为昂贵且危险的现实训练和评估提供了一种无害的替代方案，但现有模拟器缺乏检测、量化和表现损伤的通用机制。为弥补这一空白，我们引入了OOPSIEVERSE，一个统一的仿真框架和损害感知家庭操作的基准。OOPSIEVERSE通过将接触力、温度变化和液体相互作用等源转换为相应的机械、热或流体损伤，提供显性、物理接地且任务无关的信号。OOPSIEVERSE包含两个核心元素：（1）DAMAGESIM，一个与模拟器无关的框架，用于检测和量化导航和操作过程中的损伤;（2）一套用于评估常见损伤模式并区分任务完成与安全执行的家务任务。我们通过在两个具有不同物理后端的模拟器中实例化DAMAGESIM，展示了我们框架的通用性，分别是OmniGibson（Nvidia Omniverse）和RoboCasa（MuJoCo）。我们还进一步展示了OOPSIEVERSE在多种应用场景中的实用性，包括（1）通过实时损伤反馈引导更安全的演示收集，（2）通过损伤条件模仿学习和强化学习学习更安全的操作策略，（3）对最先进的视觉语言行动策略安全性进行基准测试，以及（4）提升模拟到现实转移策略的实际安全性。我们的研究结果共同凸显了OOPSIEVERSE作为系统且可扩展安全机器人操作研究开源基础的潜力。有关代码和更多信息，请参阅此 https URL

On the Comparison of Reinforcement Learning and Adaptive Control for Linear Systems under Packet Loss and Uncertainty

关于强化学习与线性系统在数据包丢失和不确定性条件下的自适应控制比较

Authors: Moh Kamalul Wafi
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2606.32003
Pdf link: https://arxiv.org/pdf/2606.32003
Abstract This paper presents a comparative study between Adaptive Quantized Control (AQC) and Deep Deterministic Policy Gradient (DDPG) reinforcement learning for uncertain linear systems with input quantization over communication channels subject to packet loss. The considered setting also includes dynamic switching from a nominal unstable system to a more unstable one during operation. The AQC is designed for unknown system dynamics using acknowledgment messages to compensate for packet losses, whereas the DDPG controller is trained using the nominal system model without acknowledgment messages. Numerical results show that the DDPG controller achieves faster transient responses and improved damping within its training environment. However, under model uncertainty, packet loss, and dynamic switching, the AQC consistently demonstrates superior robustness owing to its rigorous Lyapunov stability guarantees. These results highlight the trade-off between data-driven performance and model-based robustness, and provide insight into the applicability of reinforcement learning and adaptive control for networked uncertain systems.
中文摘要 本文提出了一项比较研究，比较自适应量化控制（AQC）与深度确定性策略梯度（DDPG）强化学习，针对在通信信道上存在分组丢包的不确定线性系统。考虑的设置还包括在运行过程中从名义上的不稳定系统动态切换到更不稳定的系统。AQC设计用于未知系统动态，使用确认消息补偿数据包丢失，而DDPG控制器则使用名义系统模型训练，不使用确认消息。数值结果表明，DDPG控制器在训练环境中实现了更快的瞬态响应和更好的阻尼效果。然而，在模型不确定性、数据包丢失和动态交换条件下，AQC因其严格的李雅普诺夫稳定性保证，始终展现出卓越的鲁棒性。这些结果凸显了数据驱动性能与基于模型的鲁棒性之间的权衡，并为强化学习和自适应控制在网络化不确定系统的适用性提供了见解。

TRIAGE: Role-Typed Credit Assignment for Agentic Reinforcement Learning

分诊：代理强化学习中的角色类型学分作业

Authors: Yuanda Xu, Zhengze Zhou, Hejian Sang, Xiaomin Li, Jiaxin Zhang, Xinchen Du, Zhipeng Wang, Alborz Geramifard
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.32017
Pdf link: https://arxiv.org/pdf/2606.32017
Abstract Agentic reinforcement learning requires assigning credit to environment-facing actions such as searches, clicks, edits, navigation commands, and object interactions. Standard GRPO uses the final verifier outcome as a uniform advantage over all action tokens. This outcome signal is useful but structurally incomplete: it punishes useful exploration in failed rollouts and reinforces redundant or regressive actions in successful rollouts. We propose TRIAGE, a role-typed credit assignment framework that adds a semantic role axis to outcome credit. A structured judge classifies each segment as decisive progress, useful exploration, no-progress infrastructure, or regression, and a fixed role-conditioned rule maps these labels to bounded segment-level process rewards. This keeps verifier outcomes as the source of optimization direction while correcting the two main blind spots of outcome-only credit. We further show that role-conditioned credit is the optimal segment-level correction expressible from role labels alone -- a projection of the per-segment advantage residual onto the role variable -- so that the fixed role constants reduce advantage estimation error whenever the judge is reliable, and we connect this to lower-variance policy gradients. Across ALFWorld, Search-QA, and WebShop, TRIAGE improves success rates over GRPO for two policy models and outperforms both a scalar judge-derived process reward and an outcome-supervised shared-backbone value baseline. Ablations show that the gain comes from role typing rather than merely adding dense rewards: reliable detection of regression inside successful trajectories is the dominant contributor, while exploration credit provides a consistent secondary gain; on completed ALFWorld and WebShop rollouts, TRIAGE also reduces environment-facing turns by an additional $10.4\%$ and $14.8\%$ relative to GRPO.
中文摘要 智能强化学习要求将功劳归功于面向环境的行为，如搜索、点击、编辑、导航命令和对象交互。标准GRPO将最终验证者结果作为对所有动作标记的统一优势。该结果信号有用但结构不完整：它惩罚失败推广中的有用探索，强化成功推广中的冗余或倒退行为。我们提出了TRIAGE，一种基于角色类型的学分分配框架，为结果学分增加了语义轴。结构化评判将每个分段分类为决定性进展、有用探索、无进展基础设施或回归，固定的角色条件规则将这些标签映射到有界的分段级过程奖励。这使验证者结果作为优化方向的来源，同时纠正了仅凭结果获得信用的两个主要盲点。我们还进一步证明，角色条件化信用是仅通过角色标签表达的最佳分段层级修正——即每个分段优势残差对角色变量的投影——因此，固定的角色常数在评判可靠时减少优势估计误差，并将此与较低方差策略梯度联系起来。在ALFWorld、Search-QA和WebShop中，TRIAGE在两种政策模型中提升了GRPO的成功率，并且优于标量裁判推导的过程奖励和结果监督的共享骨干价值基线。消融显示，收益来自角色类型，而非仅仅增加密集奖励：成功轨迹内回归的可靠检测是主要贡献者，而探索信用则提供持续的次要收益;在完成的ALFWorld和WebShop推广中，TRIAGE还相较GRPO额外减少了10.4美元和14.8%美元的面向环境的回合。

Reinforcement Learning with Metacognitive Feedback Elicits Faithful Uncertainty Expression in LLMs

带有元认知反馈的强化学习在大型语言模型中激发忠实的不确定性表达

Authors: Gabrielle Kaili-May Liu, Avi Caciularu, Gal Yona, Idan Szpektor, Arman Cohan
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.32032
Pdf link: https://arxiv.org/pdf/2606.32032
Abstract Metacognition is a critical component of intelligence that describes the ability to monitor and regulate one's own cognitive processes. Yet LLMs exhibit systemic deficiencies in key metacognitive faculties: they hallucinate with high confidence, fail to recognize knowledge boundaries, and misrepresent their internal uncertainty--undermining trustworthiness and reliability. Since monitoring task performance and adapting behavior accordingly are central to metacognition, we posit that models capable of accurately judging their own performance are better positioned to improve it. We operationalize this idea via two novel mechanisms: reinforcement learning with metacognitive feedback (RLMF), a paradigm to refine completion rankings during preference optimization based on the quality of a model's self-judgments of performance, and metacognitive data selection, which uses similar self-judgments to identify high-value training examples, outperforming naive active learning. We apply these innovations to the problem of faithful calibration (FC), a task that is itself fundamentally metacognitive: the goal is to align expressed with intrinsic uncertainty, difficult even for frontier LLMs. We adopt a two-stage, decoupled approach, first using these methods to calibrate the faithfulness of models' self-reported confidence scores, then mapping to natural, context-adaptable linguistic uncertainty via targeted output editing. Extensive experiments show RLMF achieves generalizable, state-of-the-art FC on diverse tasks while preserving accuracy. Further, RLMF surpasses standard RL by up to 63% while enhancing models' ability to assess and express their own capability limits. This positions RLMF as a promising paradigm to enhance LLM metacognition toward improved abilities and alignment, and suggests metacognitive performance as an effective RL signal to overcome limits of prior intrinsic feedback methods.
中文摘要 元认知是智能的关键组成部分，描述了监控和调节自身认知过程的能力。然而，LLMs在关键元认知能力方面存在系统性缺陷：它们会高度自信地产生幻觉，无法识别知识边界，并误导自身的内部不确定性——削弱了可信度和可靠性。由于监控任务表现并相应调整行为是元认知的核心，我们认为能够准确判断自身表现的模型更有能力提升自身表现。我们通过两种新机制实现这一理念：带元认知反馈的强化学习（RLMF），这是一种基于模型自我判断表现质量，在偏好优化过程中优化完成排名的范式;以及元认知数据选择，利用类似的自我判断识别高价值训练示例，表现优于朴实主动学习。我们将这些创新应用于忠实校准（FC）问题，这一任务本身就具有元认知性：目标是与内在不确定性保持一致，这对前沿大型语言模型来说也很难做到。我们采用两阶段解耦方法，首先用这些方法校准模型自报信心分数的忠实度，然后通过有针对性的输出编辑，映射到自然且可根据上下文适应的语言不确定性。大量实验表明，RLMF能够在不同任务中实现通用且最先进的FC，同时保持准确性。此外，RLMF比标准强化学习高出多达63%，同时增强模型评估和表达自身能力极限的能力。这使RLMF成为一种有前景的范式，旨在提升LLM元认知以提升能力和对齐，并建议元认知表现作为克服既有内在反馈方法局限的有效强化学习信号。

Keyword: diffusion policy

From Grasps to Dexterity: Large-Scale Grasp Pretraining for Dexterous Manipulation

从抓握到灵巧：大规模抓握预训练以提升灵巧操作

Authors: Ying Yuan, Xinyu Liu, Sriram Krishna, David Held
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.30749
Pdf link: https://arxiv.org/pdf/2606.30749
Abstract Large-scale dexterous grasp datasets encode rich priors over hand-object interaction, but their use has largely been confined to grasp generation and pick-and-place manipulation. We study whether such data can instead support functional dexterity in articulated tool use, where a robot must acquire a tool, maintain contact, and operate its functional moving parts. We adapt a hierarchical imitation learning framework that combines high-level hand sub-goal prediction with a low-level goal-conditioned controller. We construct a 355k-trajectory grasp-pretraining dataset from large-scale dexterous grasp annotations and use it to pretrain the low-level controller. The controller is then fine-tuned on downstream task demonstrations. To evaluate this setting, we introduce DexCraft, a simulation benchmark with six articulated tool-use tasks requiring coordinated finger motion. Across simulation and real-world experiments, our approach outperforms end-to-end diffusion policy baselines and hierarchical policies trained from scratch. In the real world, it improves full-task success by 33.3 percentage points over DP3. These results show that grasp datasets can serve not only as resources for grasp synthesis, but also as scalable pretraining data for contact-rich dexterous manipulation. Videos are shown on this https URL .
中文摘要 大规模灵巧抓握数据集编码了丰富的先验，超越了手与对象的交互，但其应用主要局限于抓握生成和选放操作。我们研究这些数据是否能支持关节式工具使用中的功能灵活性，机器人必须获得工具、保持接触并操作其功能性活动部件。我们采用了分层模仿学习框架，结合了高层次手部子目标预测与低层次目标条件控制。我们从大规模灵巧抓取注释构建了一个35.5k轨迹的抓握预训练数据集，并用它预训练低级别控制器。随后，控制器在下游任务演示中进行微调。为了评估该环境，我们引入了DexCraft，一个模拟基准测试，包含六个需要协调手指动作的关节工具使用任务。在模拟和现实实验中，我们的方法优于端到端扩散政策基线和从零训练的层级政策。在现实中，它比DP3提高了33.3个百分点的全任务成功率。这些结果表明，抓握数据集不仅可以作为抓取综合的资源，还能作为可扩展的预训练数据，用于接触丰富且灵巧的操作。视频可在该 https URL 上观看。