Arxiv Papers of Today

生成时间: 2026-06-09 19:21:11 (UTC+8); Arxiv 发布时间: 2026-06-09 20:00 EDT (2026-06-10 08:00 UTC+8)

今天共有 81 篇相关文章

Keyword: reinforcement learning

TinyJudge: Unverifiable Constraint Alignment via Lightweight Specialist Ensembles

TinyJudge：通过轻量级专业合奏实现不可验证的约束对齐

Authors: Yirong Zeng, Yufei Liu, Xiao Ding, Yutai Hou, Yuxian Wang, Wu Ning, Haonan Song, Dandan Tu, Qixun Zhang, Yuxiang He, Bibo Cai, Ting Liu
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.07520
Pdf link: https://arxiv.org/pdf/2606.07520
Abstract Instruction Following (IF) is a core capability of LLMs, requiring strict adherence to diverse constraints, ranging from verifiable ones (e.g., output length) to unverifiable ones (e.g., tone). Reinforcement learning with verifiable rewards has emerged as a paradigm for IF tasks, leveraging LLM-as-a-judge to assess unverifiable constraints. However, we empirically find that this approach remains a significant bottleneck, suffering from severe reward hacking and higher computational overhead. In this work, we first analyze the generalization capabilities of unverifiable constraints and discover that specific constraints exhibit distinct, high-generalization patterns. Motivated by this, we propose TinyJudge, a framework that employs an ensemble of specialized tiny language models ($\sim0.6B$) to provide rewards for soft constraints. By distilling expertise from frontier models into these tiny models, it achieves high-precision, lightweight evaluation. Extensive evaluations across five benchmarks demonstrate that TinyJudge outperforms the baselines by $\sim10\%$ in average performance and $12\%$ in reward precision. Crucially, it also achieves a $3\times$ speedup in total training time. Our work provides a scalable and robust path for aligning LLMs with unverifiable human instructions.
中文摘要 指令跟随（IF）是大型语言模型（LLM）的核心能力，要求严格遵守各种约束，从可验证的（如输出长度）到不可验证的（如音调）不等。带有可验证奖励的强化学习已成为IF任务的范式，利用LLM作为评判来评估不可验证的约束。然而，我们实证发现这种方法仍是一个重大瓶颈，存在严重的奖励黑客效应和更高的计算开销。本研究首先分析了不可验证约束的泛化能力，发现特定约束表现出不同的高泛化模式。基于此，我们提出了TinyJudge框架，该框架采用一组专门的微型语言模型（约$\sim0.6亿美元）为软约束提供奖励。通过将前沿模型的专业知识提炼到这些微小模型中，实现高精度且轻量级的评估。五个基准测试的广泛评估表明，TinyJudge 在平均表现上优于 Baseline，高于 $10\%$，奖励精准度高出 $12\%。关键是，它还能提升3美元乘以的总训练时间。我们的工作提供了一条可扩展且稳健的路径，用于将大型语言模型与不可验证的人类指令对齐。

Offline Reinforcement Learning for Plasma Control in Nuclear Fusion: Codebase and Benchmark

核聚变中等离子体控制的离线强化学习：代码库与基准

Authors: Yang Fu, Haomin Bao, Rohit Sonker, Xiaoyan Hu, Aravind Venugopal, Jeff Schneider, Jiayu Chen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.07550
Pdf link: https://arxiv.org/pdf/2606.07550
Abstract Offline reinforcement learning (RL) offers a promising route for developing plasma controllers from historical tokamak data, since online trial-and-error on real devices is costly and risky. However, progress in this direction remains difficult to measure due to the lack of a standardized offline RL benchmark for realistic multi-actuator, long-horizon plasma control problems in nuclear fusion. We introduce RL4F, an Offline Reinforcement Learning Benchmark for Plasma Control in Nuclear Fusion, providing closed-loop evaluation environments and baseline comparisons across four full-profile tracking tasks: rotation, density, temperature, and pressure. The dynamics function underlying the evaluation environment is built from historical discharge data from DIII-D, a real-world Tokamak. We evaluate a broad set of imitation learning and offline RL baselines under a unified protocol. We find that offline model-based RL methods obtain the best average performance on most objectives, although no single method dominates all tasks, highlighting the importance of dynamics modeling in complex, long-horizon plasma control tasks. To foster further research, we open-source the codebase, datasets, and evaluation framework, providing a benchmark not only for the fusion community but also for algorithm development in offline RL.
中文摘要 离线强化学习（RL）为从历史托卡马克数据开发等离子体控制器提供了一条有前景的途径，因为在真实设备上进行在线试错成本高且风险高昂。然而，由于缺乏针对核聚变中多执行器、长视距等离子体控制问题的标准化离线强化轨道基准，这一方向的进展仍难以衡量。我们引入了RL4F，这是核聚变中等离子体控制的离线强化学习基准，提供闭环评估环境和在旋转、密度、温度和压力四个全方向跟踪任务上的基线比较。评估环境的动力学函数基于真实世界的托卡马克DIII-D的历史放电数据构建。我们评估了一套统一协议下的模拟学习和离线强化学习基线。我们发现离线基于模型的强化学习方法在大多数目标上获得最佳平均性能，尽管没有单一方法主导所有任务，凸显了动力学建模在复杂长视距等离子体控制任务中的重要性。为了促进进一步研究，我们将代码库、数据集和评估框架开源，不仅为融合社区提供了标杆，也为离线强化学习的算法开发提供了标杆。

Outage Detection in Self-Healing Smart Grids Using Reinforcement Learning with Spectral Graph Neural Networks

利用谱图神经网络强化学习实现自愈智能电网的停电检测

Authors: Lihui Liu, Mucun Sun, Caisheng Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.07583
Pdf link: https://arxiv.org/pdf/2606.07583
Abstract Self-healing smart grids can quickly adjust their network configuration during outages to minimize power disruptions. During an outage, several actions can be taken, such as network reconfiguration through switching operations and emergency load shedding. However, traditional machine learning methods for outage mitigation are not well suited for smart grids due to their slow response time and high computational cost. To address these challenges, recent studies have explored reinforcement learning to automatically perform network reconfiguration. In these approaches, the control policy is typically modeled using a graph neural network (GNN). However, conventional GNNs operate in the spatial domain and may fail to capture important relationships in the frequency domain. Frequency-domain information is particularly useful for modeling global structural patterns and system-wide interactions in power networks. In this paper, we propose a spectral graph reinforcement learning framework for outage management in distribution networks to enhance system resilience. Our model learns the optimal power restoration policy using a spectral graph neural network. We evaluate the proposed method on three modified IEEE test systems: the 13-bus, 34-bus, and 123-bus networks. Experimental results show that our approach achieves near-optimal performance in real time and generalizes well across a wide range of outage scenarios.
中文摘要 自修复智能电网可以在停电期间快速调整网络配置，以最大限度减少电力中断。停电期间，可以采取多种措施，如通过切换操作重新配置网络和紧急限电。然而，传统的机器学习方法因响应缓慢和计算成本高，不太适合智能电网。为应对这些挑战，近期研究探索了强化学习以实现网络重构。在这些方法中，控制策略通常通过图神经网络（GNN）来建模。然而，传统的GNNs在空间域工作，可能无法捕捉频域中的重要关系。频域信息对于建模电力网络中的全局结构模式和系统范围的交互尤其有用。本文提出了一种谱图强化学习框架，用于配电网络的停电管理，以增强系统韧性。我们的模型通过谱图神经网络学习最优功率恢复策略。我们在三种改良型IEEE测试系统上评估了该方法：13总线、34总线和123总线网络。实验结果表明，我们的方法在实时中实现了近乎最优的性能，并且在广泛的停电场景中能够很好地推广。

UNIQ: Conformal Calibration for Adaptive Conservatism in Offline Reinforcement Learning

UNIQ：离线强化学习中适应性保守主义的共形校准

Authors: Aditya Upadhyay
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.07592
Pdf link: https://arxiv.org/pdf/2606.07592
Abstract Offline reinforcement learning requires careful conservatism to mitigate distribution shift, yet most existing methods apply a fixed penalty uniformly across all states regardless of local data coverage. We present UNIQ (Uncertainty-Informed Quantile), an offline RL method that introduces state-adaptive conservatism through conformally calibrated uncertainty estimation. Built on the Implicit Q-Learning (IQL) backbone, UNIQ trains a multi-expectile value ensemble, computes distribution-free uncertainty estimates using split conformal prediction, and maps the resulting signal to a state-dependent expectile that relaxes conservatism in well-covered regions while strengthening it in uncertain regions near the data frontier. On D4RL MuJoCo benchmarks, UNIQ consistently improves over IQL, with the largest gains observed on Walker2d and replay-heavy tasks. At the same time, UNIQ operates at near-IQL memory cost (approximately 250 MB peak VRAM), providing roughly a 10x reduction compared to EDAC. Rather than pursuing overall state-of-the-art performance, we position UNIQ as a practical mechanism contribution that improves the performance-efficiency trade-off in offline reinforcement learning.
中文摘要 离线强化学习需要谨慎保守以减轻分布转移，但大多数现有方法在所有州均统一施加固定惩罚，无论本地数据覆盖情况如何。我们介绍了UNIQ（不确定性知情分位数），这是一种离线强化学习方法，通过共形校准的不确定性估计引入状态适应保守主义。UNIQ基于隐性Q学习（IQL）骨干，训练多期望值集合，利用分裂共形预测计算无分布不确定性估计，并将所得信号映射到状态依赖期望值，该期望在覆盖较广的区域放松保守性，而在数据前沿的不确定区域加强其强度。在D4RL MuJoCo基准测试中，UNIQ在IQL基础上持续提升，Walker2d和重放任务中表现最大。同时，UNIQ的内存成本接近IQL（峰值显存约250 MB），相比EDAC约减少了10倍的内存。我们不追求整体最先进的性能，而是将UNIQ定位为一种实用机制，能够改善离线强化学习中性能与效率的权衡。

Sample-Efficient Post-Training for LEGO Spatial-Physics Reasoning

乐高空间物理推理的样本高效后期训练

Authors: Yuhuan Yuan, Zhouliang Yu, Minghao Liu, Weiyang Liu, Ge Lin Kan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.07602
Pdf link: https://arxiv.org/pdf/2606.07602
Abstract LLM-based LEGO assembly generation requires both semantic grounding and physical feasibility. We identify a data-induced failure mode, PhysHack, in which the assemblies satisfy physical-validity constraints while producing structures that are geometrically misaligned, semantically inconsistent, or poorly calibrated. To address this challenge, we propose a model-based data selection approach that uses only a small fraction of the training data while improving physically grounded LEGO assembly generation. Building on the selected trajectories, we introduce PVPO, a sample-efficient reinforcement learning method that couples physical feasibility with voxel-space geometric rewards. Our results show that physical validity alone is an insufficient proxy for reliable physical reasoning: models can learn to generate valid structures without preserving semantic or geometric fidelity. Experiments across model backbones and test-time scaling settings demonstrate that PVPO improves structural and semantic alignment, physical validity, structural stability, and calibration, while reducing reliance on extensive post-hoc rejection sampling. In particular, results on calibration show that PVPO mitigates PhysHack by making test-time selection more predictive of semantic and structural quality.
中文摘要 基于LLM的乐高组装需要语义基础和物理可行性。我们识别出一种数据诱导的失效模式——PhysHack，其中组件满足物理有效性约束，但产生的结构几何错位、语义不一致或校准不良。为应对这一挑战，我们提出了一种基于模型的数据选择方法，仅使用训练数据的一小部分，同时改进物理基础的乐高组装生成。基于所选轨迹，我们介绍PVPO，一种样本高效强化学习方法，将物理可行性与体素空间几何奖励结合。我们的结果表明，仅靠物理效度不足以作为可靠物理推理的代理：模型可以学会生成有效结构，而不保留语义或几何的真实性。跨模型骨干和测试时间缩放设置的实验表明，PVPO能够提升结构和语义对齐、物理效度、结构稳定性和校准，同时减少对大量事后拒绝采样的依赖。特别是校准结果表明，PVPO通过使测试时间选择更能预测语义和结构质量，从而缓解了PhysHack的使用。

SAW: Stage-Aware Dynamic Weighting for Multi-Objective Reinforcement Learning in Large Language Models

SAW：大型语言模型中多目标强化学习的阶段感知动态加权

Authors: Yuchen He, Baolong Bi, Shenghua Liu, Huaming Liao, Yuyao Ge, Bolin Wan, Siqian Tong, Juan Chen, Jiafeng Guo, Xueqi Cheng
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.07705
Pdf link: https://arxiv.org/pdf/2606.07705
Abstract Although multi-objective reinforcement learning (MORL) is central to aligning large language models with complex human preferences, the prevailing practice of static weighted summation overlooks a more fundamental phenomenon: reward learning is markedly asynchronous across objectives. Well-learned dimensions quickly produce homogeneous, low-variance signals whose residual noise contaminates the aggregated reward (in GRPO) or occupies a fixed share of the advantage budget (in GDPO), interfering with the scarce yet high-value signals carried by under-learned dimensions. To address this asynchrony, we propose Stage-Aware Dynamic Weighting (SAW), a lightweight, algorithm-agnostic dynamic weighting mechanism. SAW utilizes the coefficient of variation (CV) as a scale-invariant proxy for real-time informativeness, reweighting each dimension's reward or advantage contribution by its relative informativeness within the batch. Unlike gradient-based methods that require multiple forward and backward passes, SAW relies solely on batch-level statistics, introducing nearly negligible computational overhead. Experiments on tool-calling and text summarization tasks demonstrate that SAW consistently improves both training efficiency and final performance under both GRPO and GDPO frameworks, confirming it as a general-purpose plug-in for multi-reward LLM alignment. Our code is available at this https URL
中文摘要 尽管多目标强化学习（MORL）对于使大型语言模型与复杂的人类偏好对齐至关重要，但目前主流的静态加权求和实践忽视了一个更根本的现象：奖励学习在各目标间明显异步。良好学习的维度很快产生均质、低方差信号，其残余噪声污染了汇总奖励（以GRPO计）或占据优势预算的固定份额（以GDPO计），干扰了由未充分学习维度携带的稀缺但高价值信号。为解决这种异步问题，我们提出了阶段感知动态加权（SAW），这是一种轻量级、算法无关的动态加权机制。SAW利用变异系数（CV）作为实时信息量的尺度不变代理，根据各维度在批次中的相对信息量重新加权其奖励或优势贡献。与需要多次前向和后向的梯度方法不同，SAW仅依赖批处理级统计，几乎可以忽略不计的计算开销。工具调用和文本摘要任务的实验表明，SAW在GRPO和GDPO框架下持续提升训练效率和最终性能，确认其作为多奖励LLM对齐的通用插件。我们的代码可在此 https 网址获取

Belief-Space Quantum-Inspired Reinforcement Learning for Partially Observable Autonomous Cyber Defense in the Internet of Vehicles

信念空间量子启发强化学习，用于车辆互联网中部分可观察的自主网络防御

Authors: Anwar Shah, Rohan Farooq, Sajid Anwer, Tallha Akram, Usman Ghous, Sajid Ullah Khan
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2606.07796
Pdf link: https://arxiv.org/pdf/2606.07796
Abstract The Internet of Vehicles (IoV) faces a dynamic, adversarial security environment where attackers adapt to defenses. Existing intrusion detection systems rely on static classifiers that fail to capture sequential decision-making, attacker adaptation, and uncertainty. We formulate IoV security as a sequential attacker-defender interaction and model defense as a reinforcement learning problem under partial observability. We propose Quantum Belief-Integrated Reinforcement Defense (Q-BIRD), using quantum-inspired belief representation to encode defender uncertainty about hidden attacker intent via amplitude-based states, enabling non-Bayesian belief evolution. Integrated into a Proximal Policy Optimization (PPO) defender, Q-BIRD selects cost-aware mitigation actions. In simulated environments with adaptive, probing attackers, Q-BIRD reduced cumulative mean damage, damage variance, and attack success rate (ASR) by 60.4%, 90.2%, and 50.0%, respectively, while increasing survival probability by 46.4%. Compared to classical Bayesian PPO, damage variance reduction and ASR improved by 10.2 times and 50%. Ablation and explainability analyses confirm that amplitude-based belief is the primary decision signal during strategy transitions when classical belief collapses, providing superior IoV security without additional hardware.
中文摘要 车辆互联网（IoV）面临着动态且对抗性的安全环境，攻击者需要适应防御措施。现有的入侵检测系统依赖静态分类器，无法捕捉顺序决策、攻击者适应和不确定性。我们将IoV安全表述为顺序攻击者-防御者交互，并将防御建模为部分可观测性下的强化学习问题。我们提出了量子信念整合强化防御（Q-BIRD），利用量子启发的信念表示，通过基于振幅的状态编码防御者对隐藏攻击者意图的不确定性，从而实现非贝叶斯信念的演化。Q-BIRD集成到近端策略优化（PPO）防御器中，能够选择成本感知的缓解措施。在具备自适应探测攻击者的模拟环境中，Q-BIRD分别将累计平均伤害、损害方差和攻击成功率（ASR）降低了60.4%、90.2%和50.0%，同时提高了46.4%的生存概率。与经典贝叶斯PPO相比，伤害方差减少和ASR提升了10.2倍，提升了50%。消融和可解释性分析证实，基于振幅的信念是策略转换时经典信念崩溃时的主要决策信号，提供了更优越的IoV安全性，无需额外硬件。

Quantum-Inspired Reinforcement Learning for Low-Latency Intrusion Detection in V2X and Internet-of-Vehicles Networks

量子启发强化学习用于V2X和车联网中低延迟入侵检测

Authors: Sajid Anwer, Rohan Farooq, Anwar Shah, Tallha Akram
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2606.07804
Pdf link: https://arxiv.org/pdf/2606.07804
Abstract Smart cities increasingly depend on dense edge, IoT, and vehicular networks to deliver critical urban services, including traffic control, connected mobility, infrastructure monitoring, and energy management. In this ecosystem, the Internet of Vehicles (IoV) is central to intelligent transportation, enabling continuous communication among vehicles, roadside infrastructure, and cloud-edge platforms. This connectivity, however, also enlarges the attack surface and exposes smart city and vehicular systems to evolving cyber threats that can compromise safety, privacy, data integrity, and service continuity. Conventional static defenses are often inadequate because they cannot autonomously adapt to changing attack behaviors or multi-stage intrusion patterns. This paper proposes QIRL, a Quantum-Inspired Reinforcement Learning framework built on a lightweight Deep Q-Network architecture for next-generation autonomous cyber defense. QIRL combines amplitude-phase quantum state encoding, rotation-gate-based exploration, and quantum interference reward augmentation within a cost-sensitive Markov Decision Process formulation. It further addresses class imbalance through training-only SMOTE balancing and asymmetric cost-sensitive reward shaping, while sequential MDP modeling captures temporal dependencies in multi-stage attack campaigns. The framework is evaluated on CICIDS2017 and UNSW-NB15. QIRL achieves accuracies of 97.89\% and 91.04\%, F1-scores of 95.22\% and 91.66\%, AUC-ROC values of 0.9945 and 0.9713, and True Skill Statistics of 0.9443 and 0.8244, respectively. It also attains ultra-low inference latencies of 32.5 and 45.7 microseconds per sample, corresponding to 67.77 times and 51.77 times speedups over ensemble baselines. These results show that QIRL offers a lightweight, latency-aware, and adaptive defense for smart city and IoV infrastructures.
中文摘要 智慧城市越来越依赖密集的边缘、物联网和车辆网络来提供关键的城市服务，包括交通控制、互联出行、基础设施监控和能源管理。在这个生态系统中，车联网（IoV）是智能交通的核心，使车辆、路边基础设施和云端平台之间能够实现持续通信。然而，这种连接性也扩大了攻击面，使智能城市和车辆系统面临不断演变的网络威胁，可能危及安全、隐私、数据完整性和服务连续性。传统的静态防御往往不足，因为它们无法自主适应不断变化的攻击行为或多阶段入侵模式。本文提出了QIRL，一种基于轻量级深度Q网络架构的量子启发强化学习框架，用于下一代自主网络防御。QIRL结合了振幅相量子态编码、基于旋转门的探索和量子干涉奖励增强，采用成本敏感的马尔可夫决策过程表述。它通过仅训练的SMOTE平衡和非对称成本敏感的奖励塑形，进一步解决了类别失衡问题，而序列MDP建模则捕捉了多阶段攻击活动中的时间依赖关系。该框架在CICIDS2017和UNSW-NB15进行评估。QIRL的准确率分别为97.89%和91.04%，F1分数分别为95.22%和91.66%，AUC-ROC值为0.9945和0.9713，真实技能统计为0.9443和0.8244。推理延迟还达到了每样本32.5和45.7微秒的超低延迟，分别对应集成基线的67.77倍和51.77倍的加速。这些结果表明，QIRL为智慧城市和物联网基础设施提供了轻量化、延迟感知和自适应防御。

X-OP: Cross-Morphology Whole-Body Teleoperation via MPC Retargeting

X-OP：通过MPC重定向实现的跨形态全身远程操作

Authors: Jen-Wei Wang, Sarthak Kaingade, Andrea Tagliabue, Nicholas Morozovsky
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.07934
Pdf link: https://arxiv.org/pdf/2606.07934
Abstract Whole-body teleoperation is essential for scalable robot data collection in loco-manipulation tasks, yet existing approaches relying on exoskeleton suits or multi-camera setups impose prohibitive cost, complexity, and environmental constraints. Recent methods using a single extended reality (XR) device with end-to-end reinforcement learning policies partially address these limitations but require robot-specific retraining, suffer from out-of-distribution failures, and rely on motion retargeting that neglects dynamic feasibility. We propose a hierarchical whole-body teleoperation framework driven by a single XR device that generalizes across diverse robot morphologies without retraining robot-specific policies. A Model Predictive Control (MPC)-based motion retargeter jointly optimizes alignment with the operator's intent and the robot's dynamic feasibility, generating optimal commands for existing low-level controllers. To ensure robust online execution, we introduce a state synchronization method that resets the simulator state at each MPC step to handle noisy real-world measurements and contact sensitivity, and integrate SLAM-based global pose feedback to mitigate long-term drift. Simulation results show higher success rates on whole-body control tasks for both a humanoid (over 30% lower completion time and 20% lower power consumption) and a mobile manipulator (zero collisions) compared to baselines. Real-world experiments further validate the effectiveness and flexibility of our method, demonstrating the successful deployment of the proposed retargeter on both platforms for whole-body control tasks and the ease of allowing users to adjust teleoperation behavior based on their preferences. This plug-and-play framework offers a scalable, morphology-agnostic solution for whole-body robot teleoperation, enabling real-time behavioral customization and broad applicability across platforms.
中文摘要 全身远程操作对于可扩展的机器人数据收集在机动操作任务中至关重要，但现有依赖外骨骼服或多摄像头配置的方法带来了高昂的成本、复杂性和环境限制。近期使用单一扩展现实（XR）设备并采用端到端强化学习策略的方法部分解决了这些限制，但需要机器人专用的再训练，存在分布外失败的问题，且依赖运动重定向，忽视了动态可行性。我们提出了一种由单一XR设备驱动的分层式全身远程操作框架，能够在不同机器人形态上泛化，而无需重新训练机器人特有策略。基于模型预测控制（MPC）的运动重定向器结合操作员意图和机器人动态可行性，生成对现有低级别控制器的最优指令。为确保稳健的在线执行，我们引入了状态同步方法，在每个MPC步骤重置模拟器状态，以处理噪声较大的真实世界测量和接触敏感度，并集成基于SLAM的全局姿态反馈以减轻长期漂移。模拟结果显示，类人生物（完成时间缩短超过30%，功耗降低20%）和移动操作手（零碰撞）在全身控制任务中的成功率均高于基线。真实实验进一步验证了我们方法的有效性和灵活性，展示了该重定向器在两个平台的成功部署，用于全身控制任务，并且允许用户根据偏好调整远程操作行为的便利性。该即插即用框架为全身机器人远程操作提供了可扩展、形态无关的解决方案，实现实时行为定制和跨平台广泛应用。

Customer-Agent: Overcoming Context Limitations in Ultra-Long Shopping Trajectories via Tool-Augmented Agents and RLVR

客户-代理：通过工具增强代理和RLVR克服超长购物轨迹中的上下文限制

Authors: Hongye Liu, Rongmei Lin, Anurag Kashyap, Hejie Cui, Ricardo Henao, Besnik Fetahu, Bing Yin
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.07995
Pdf link: https://arxiv.org/pdf/2606.07995
Abstract Understanding customer shopping trajectories is essential for enabling personalized shopping experiences. However, shopping records (i.e., customer's search, clicks, purchases, etc.) often span long time horizons over multiple years, resulting in extremely long trajectories that pose significant challenges for existing large language models (LLMs). Despite the importance of this problem, existing benchmarks are limited to short customer trajectories, while real-world trajectories from large e-commerce platforms are rarely accessible due to data privacy constraints. To address this gap, we introduce ShopTrajQA, a long-context evaluation benchmark constructed from real-world product information and simulated shopping trajectories. The dataset includes variants of up to 32k and 64k tokens, enabling systematic evaluation of model robustness under varying context lengths. Through comprehensive benchmarking of frontier LLMs, we identify critical performance gaps in reasoning over long shopping trajectory data. To address these challenges, we propose a Customer Agent Framework for ultra-long context management. Leveraging a Reinforcement Learning with Verifiable Rewards (RLVR) agentic training paradigm, our approach stores trajectories as external local files and trains the agent to autonomously retrieve and parse them through code-interpreter interactions (e.g., SQL queries), effectively bypassing the fixed in-context window constraints of LLMs. Experimental results demonstrate that our framework achieves strong performance for ShopTrajQA and shows generalization to other complex reasoning tasks.
中文摘要 了解客户的购物轨迹对于实现个性化购物体验至关重要。然而，购物记录（如客户的搜索、点击、购买等）往往跨越多年长的时间跨度，导致其轨迹极为漫长，这对现有大型语言模型（LLM）构成了重大挑战。尽管这一问题重要，现有基准仍限于短期客户轨迹，而大型电商平台的真实客户流量因数据隐私限制而难以访问。为弥补这一空白，我们引入了ShopTrajQA，这是一个由真实产品信息和购物轨迹模拟构建的长上下文评估基准。数据集包含多达32k和64k代币的变体，支持系统评估不同上下文长度下的模型鲁棒性。通过对前沿大型语言模型的全面基准测试，我们识别了长期购物轨迹数据推理中的关键性能差距。为应对这些挑战，我们提出了一个用于超长上下文管理的客户代理框架。利用可验证奖励强化学习（RLVR）代理训练范式，我们的方法将轨迹存储为外部本地文件，并训练代理通过代码解释器交互（如SQL查询）自主检索和解析轨迹，有效绕过LLM固定的上下文窗口限制。实验结果表明，我们的框架在ShopTrajQA中表现出优异性能，并能推广到其他复杂推理任务。

Rewrite to Translate, Translate to Reward: Reinforcement Learning for Source Rewriting in Machine Translation

重写以翻译，翻译以奖励：机器翻译中源头重写的强化学习

Authors: Boxuan Lyu, Haiyue Song, Zhi Qu, Hidetaka Kamigaito, Kotaro Funakoshi, Manabu Okumura
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.08011
Pdf link: https://arxiv.org/pdf/2606.08011
Abstract Although directly prompting off-the-shelf Large Language Models (LLMs) to generate meaning-preserving source rewrites can effectively enhance Machine Translation (MT) quality, doing so requires manually tuning prompts for different MT models. In this work, we propose RLSR (Reinforcement Learning for Source Rewriting), a novel RL-based framework for training a source rewriting model without tuning prompts for each MT model. RLSR optimizes the rewriting model by directly using the improvement in downstream translation quality yielded by each rewritten source as the reward. Extensive experiments across six MT models and 16 language pairs demonstrate that our 4B rewriting models trained via RLSR significantly outperform the no-rewriting baseline and existing same-scale prompt-based rewriting baselines, while achieving competitive performance against prompt-based baselines based on the 235B LLM.
中文摘要 虽然直接提示现成的大型语言模型（LLM）生成保留意义的源代码重写可以有效提升机器翻译（MT）质量，但这需要手动调整不同机器翻译模型的提示词。在本研究中，我们提出了RLSR（源重写强化学习），这是一种基于强化学习的新框架，用于训练源重写模型，而无需为每个机智模型设置调优提示。RLSR通过直接利用每个重写源带来的下游翻译质量提升作为奖励来优化重写模型。在六个机器学习模型和16对语言对上的广泛实验表明，我们通过RLSR训练的4B重写模型显著优于无重写基线和现有同规模基于提示的重写基线，同时在与基于235B的提示基线竞争中表现出色。

Q-VGM: Q-Guided Value-Gradient Matching for Flow-Matching VLA Policies

Q-VGM：Q-引导值梯度匹配用于流量匹配VLA策略

Authors: Ziqian Wang, Jiayu Sun, Xingjian Mao, Minqian Wang, Yao Mu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.08015
Pdf link: https://arxiv.org/pdf/2606.08015
Abstract We propose Q-Guided Value-Gradient Matching (Q-VGM), an off-policy reinforcement learning (RL) method that tackles a long-standing challenge in fine-tuning flow-matching vision-language-action (VLA) policies: efficiently improving an expressive flow-matching action expert with respect to a learned Q-function. Effective improvement must exploit the first-order (gradient) information of the critic, but this is difficult for flow policies, because directly back-propagating the value through their multi-step denoising process is numerically unstable at VLA scale, while the tractable action likelihoods required by policy-gradient methods are unavailable under iterative denoising. Existing value-based methods either backpropagate through the full denoising chain, use the critic only at test time without updating the policy, or distill critic-improved actions as terminal labels without supervising the velocity field. Q-VGM sidesteps these issues by leveraging VGG-Flow, a value-gradient view of flow alignment in generative modeling that transforms value gradient into a denoising-time value-gradient field rather than an unstable end-to-end objective. This requires no action likelihoods and no backpropagation through the denoising chain, and operates on a fixed replay buffer. The critic is an action-sensitive Cal-QL ensemble over compact RLT features with per-layer action injection. Q-VGM enables a practical few-shot initialization then learn-from-experience paradigm: starting from a few-shot-SFT pi0.5 VLA, the method leverages self-generated rollout data to substantially improve task performance without additional expert supervision. On LIBERO, Q-VGM raises the average success rate from 75.0% to 92.5%; on RoboTwin 2.0, from 76.4% to 87.2%; and on two real-robot tabletop tasks, from 40.0% to 67.5%, outperforming all same-backbone, same-critic baselines across all three settings.
中文摘要 我们提出了Q引导值-梯度匹配（Q-VGM），这是一种非策略强化学习（RL）方法，解决了在微调流匹配视觉语言-行动（VLA）策略中长期存在的一个挑战：高效提升表达流匹配动作专家，以相对于已学习的Q函数进行优化。有效的改进必须利用批评者的一阶（梯度）信息，但这对流策略来说较为困难，因为通过多步去噪过程直接反向传播该值在VLA尺度上在数值上不稳定，而策略梯度方法所需的可处理作用似然在迭代去噪下不可得。现有基于值的方法要么通过完整的去噪链反向传播，要么仅在测试时使用批判者而不更新策略，或者将批判者改进的动作蒸馏为终端标签而不监督力度场。Q-VGM通过利用VGG-Flow避开这些问题，VGG-Flow是一种生成建模中的流量对齐值梯度视角，将值梯度转换为去噪时间值梯度场，而非不稳定的端到端目标。这无需动作似然，也无需通过去噪链进行反向传播，且运行在固定的重放缓冲区上。Critic是一个动作敏感的Cal-QL集合，基于紧凑的RLT特征，并带有每层动作注入。Q-VGM实现了实用的几点初始化和经验学习范式：从几点SFT pi0.5 VLA出发，该方法利用自生成的部署数据，在无需额外专家监督的情况下大幅提升任务表现。在LIBERO上，Q-VGM将平均成功率从75.0%提升到92.5%;在RoboTwin 2.0上，从76.4%降至87.2%;在两个真实机器人桌面任务中，从40.0%提升到67.5%，在这三种设置中都超过了所有相同骨干、相同批评者的基线。

DyCo-RL: Dynamic Cross-Modal Coordination for Visual Reasoning

DyCo-RL：视觉推理的动态跨模态协调

Authors: Hangui Lin, Yan Shu, Zhengyang Liang, Chi Liu, Xiangrui Liu, Minghao Qin, Teng Long, Zheng Liu, Nicu Sebe
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.08035
Pdf link: https://arxiv.org/pdf/2606.08035
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a leading paradigm for enhancing visual reasoning in Multimodal Large Language Models (MLLMs). However, existing RLVR methods optimize primarily for the reasoning outcome, fundamentally overlooking the fine-grained cross-modal coordination required during the generation process. Through token-level analyses and controlled interventions, we reveal that during Chain-of-Thought (CoT) reasoning, MLLMs frequently fail to dynamically alternate between extracting visual evidence and synthesizing textual context-a coordination breakdown that is causally linked to reasoning failures. Motivated by these findings, we propose DyCo-RL, which integrates dynamic cross-modal coordination into RLVR optimization. Specifically, DyCo-RL uses the Fisher-Rao geodesic distance to measure within-modality attention shifts, assigning tokens to either visually-oriented or text-oriented functional roles. It then evaluates the alignment between a token's actual attention allocation and its assigned role, leveraging this score for alignment-guided advantage reweighting during policy optimization. Extensive experiments demonstrate that the algorithm-agnostic DyCo-RL, when applied to Qwen2.5-VL-3B/7B, consistently improves four representative RLVR algorithms across seven benchmarks spanning visual-centric and mathematical reasoning.
中文摘要 带可验证奖励的强化学习（RLVR）已成为增强多模态大型语言模型（MLLM）视觉推理的领先范式。然而，现有的RLVR方法主要优化推理结果，根本忽视了生成过程中所需的细粒度跨模态协调。通过代币级分析和受控干预，我们发现在思维链（Chain-of-Thought，CoT）推理过程中，MLLMs经常未能动态交替提取视觉证据和综合文本上下文——这种协调崩溃与推理失败有因果关系。基于这些发现，我们提出了DyCo-RL，将动态跨模态协调整合进RLVR优化。具体来说，DyCo-RL 利用 Fisher-Rao 测地线距离测量模态内的注意力转移，将标记分配给视觉导向或文本导向的功能角色。然后评估代币实际注意力分配与其分配角色之间的对齐，利用该分数在策略优化过程中进行对齐引导的优势重权重调整。大量实验表明，算法无关的DyCo-RL应用于Qwen2.5-VL-3B/7B时，能够在七个基准测试中持续改进四种代表性的RLVR算法，涵盖视觉和数学推理。

MuJoCo-Drones-Gym: A GPU-Accelerated Multi-Drone Simulator for Control and Reinforcement Learning

MuJoCo-Drones-Gym：一款GPU加速的多无人机模拟器，用于控制与强化学习

Authors: Manan Tayal
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.08039
Pdf link: https://arxiv.org/pdf/2606.08039
Abstract Robotic simulators are a cornerstone of modern research in aerial robotics, serving both as a vehicle for the development of new control algorithms and as the data source for training reinforcement learning (RL) policies. Yet, existing quadcopter learning environments often face a trade-off between physical fidelity, multi-agent support, and the throughput required by modern deep RL pipelines. In this paper, we present MuJoCo-Drones-Gym, an open-source Gymnasium-compatible multi-drone environment built on top of the MuJoCo physics engine. MuJoCo-Drones-Gym supports an arbitrary number of Bitcraze Crazyflie 2.x nano-quadcopters and exposes a modular API for selecting (i)~the physics model (rigid-body MuJoCo, explicit Python dynamics, or any subset of ground effect, blade drag, and inter-drone downwash), (ii)~the action interface (per-motor RPMs, collective normalized thrust, velocity setpoints, or PID waypoint commands), and (iii)~the observation space (kinematic state vectors, RGB / depth / segmentation cameras, or neighbourhood adjacency information). A PettingZoo ParallelEnv wrapper enables drop-in multi-agent reinforcement learning, while a suite of seven task environments, hover, velocity tracking, multi-drone hover, waypoint navigation, formation flight, gate racing, and a generic multi-agent template, demonstrates the breadth of the interface. We describe the environment design, the underlying physics and quadcopter dynamics, and illustrate its use through control and learning examples that mirror those of the closely related gym-pybullet-drones project, while taking advantage of MuJoCo's improved contact handling, rendering, and parallelizability.
中文摘要 机器人模拟器是现代空中机器人研究的基石，既是开发新控制算法的载体，也是训练强化学习（RL）策略的数据源。然而，现有的四旋翼学习环境常常面临物理保真度、多智能体支持和现代深度强化学习流水线所需吞吐量之间的权衡。本文介绍了MuJoCo-Drones-Gym，这是一个基于MuJoCo物理引擎构建的开源Gymnasium兼容多无人机环境。MuJoCo无人机健身房支持任意数量的Bitcraze Crazyflie 2.x纳米四轴飞行器，并提供了一个模块化API，用于选择（i）~物理模型（刚体MuJoCo、显式Python动力学，或地面效应、叶片阻力和无人机间下洗的任何子集）、（ii）~动作接口（每电机转速、集体归一化推力、速度设定点或PID航点命令），以及（iii）~观测空间（运动学状态矢量， RGB / 深度 / 分段摄像头，或邻域邻接信息）。PettingZoo ParallelEnv 包装支持多智能体强化学习，同时七个任务环境套件：悬停、速度跟踪、多无人机悬停、航点导航、编队飞行、星门竞速以及通用多智能体模板，展示了界面的广度。我们描述了环境设计、底层物理和四旋翼飞行器动力学，并通过控制和学习示例展示了其应用，这些示例与密切相关的Gym-Pybullet-Drones项目相似，同时利用MuJoCo改进的接触处理、渲染和并行化功能。

Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

Robust-U1：MLLM能否自我恢复损坏的视觉内容以实现稳健理解？

Authors: Jiaqi Tang, Jianmin Chen, Youyang Zhai, Wei Wei, Runtao Liu, Mengjie Zhao, Xiangyu Wu, Qingfa Xiao, Qifeng Chen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.08063
Pdf link: https://arxiv.org/pdf/2606.08063
Abstract Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in visual understanding, yet their performance degrades significantly under real-world visual corruptions. While existing robustness enhancement approaches exist, they are limited: black-box feature alignment lacks interpretability, and white-box text-based reasoning cannot restore lost pixel-level details. This work investigates a fundamental research question: Can MLLMs recover corrupted visual content by themselves? To address this, we propose Robust-U1, a novel framework that equips MLLMs with explicit visual self-recovery capability for robust understanding. The approach comprises three core stages: supervised fine-tuning for initial reconstruction, reinforcement learning with dual rewards (pixel-level SSIM and semantic-level CLIP similarity) for aligning high visual quality, and multimodal reasoning that jointly considers both the corrupted input and the recovered image. Extensive experiments demonstrate that Robust-U1 achieves state-of-the-art robustness on the real-world corruption benchmark and maintains superior performance under adversarial corruptions on general VQA benchmarks. Analysis confirms that high-quality visual recovery directly enhances reasoning performance, establishing self-recovery as a critical mechanism for robust visual understanding. The source code is available at this https URL.
中文摘要 多模态大型语言模型（MLLM）在视觉理解方面表现出显著成功，但其在现实视觉损坏下表现显著下降。虽然现有的鲁棒性增强方法存在，但其有限：黑盒特征对齐缺乏可解释性，白盒文本推理无法恢复丢失的像素级细节。本研究探讨了一个基本性研究问题：MLLM能否自行恢复损坏的视觉内容？为此，我们提出了Robust-U1，一种新颖框架，赋予MLLM显性视觉自我恢复能力，实现扎实的理解。该方法包括三个核心阶段：初始重建的监督微调、带有双重奖励（像素级SSIM和语义级CLIP相似性）以对齐高视觉质量的强化学习，以及多模态推理，结合损坏输入和恢复图像。大量实验表明，Robust-U1在现实世界腐败基准上达到了最先进的鲁棒性，并在一般VQA基准测试中保持了对抗性腐败的优异表现。分析证实，高质量的视觉恢复直接提升了推理能力，确立了自我恢复作为稳健视觉理解的关键机制。源代码可在该 https URL 访问。

Cooperative Long Rope Skipping via Multi-Agent Reinforcement Learning

通过多智能体强化学习实现合作式长绳跳跃

Authors: Zihao Wang, Shijie Peng, Kerui Wu, Yu Huang, Ruiqi Xue, Dong Liu, Tian Xu, Lei Yuan, Yang Yu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.08064
Pdf link: https://arxiv.org/pdf/2606.08064
Abstract Humans exhibit remarkable motor agility, enabling a wide range of dynamic skills such as running and jumping, which highlights the great potential of humanoid robots for athletic locomotion. Among athletic sports, long rope skipping requires two rope turners to cooperatively swing the rope while adapting to a player under different jumping rhythms, making it a meaningful yet challenging task for humanoid robots. Although existing methods for humanoid sports have achieved success in single-agent and interaction-free settings, such as running, dancing, and parkour, task scenarios that require precise coordination among multiple participants remain largely unexplored. To this end, we propose Marope, a multi-agent reinforcement learning (MARL) framework for cooperative long rope skipping with multiple humanoid robots. Specifically, Marope adopts a hierarchical reinforcement learning framework for policy training. At the lower level, it learns decentralized rope manipulation policies through MARL, while at the upper level, a centralized scheduling policy is trained to coordinate the execution of the lower-level policies. To improve generalization across different player behavioral styles, Marope further incorporates diverse jumping policies into cooperative game training. We evaluate our approach on Unitree G1 humanoid robots in both simulation and real-world settings. Experimental results demonstrate that Marope outperforms various baselines, achieving more efficient and stable rope manipulation as well as more robust and adaptable cooperation with varied players.
中文摘要 人类展现出卓越的运动灵活性，能够实现奔跑和跳跃等多种动态技能，这凸显了类人机器人在运动运动方面的巨大潜力。在田径运动中，长绳跳绳需要两名绳索转动者协同摆动绳索，同时适应不同跳跃节奏的玩家，这使得这对类人机器人来说既有意义又具有挑战性。尽管现有的人形运动方法在单代理且无交互的环境中取得成功，如跑步、舞蹈和跑酷，但需要多参与者精确协调的任务场景仍大多未被探索。为此，我们提出了Marope，一个多智能体强化学习（MARL）框架，用于多类人机器人协作跳绳。具体来说，Marope采用了分层强化学习框架进行政策培训。在低层，它通过MARL学习去中心化的绳索操作策略，而在高层，则训练集中调度策略以协调低层策略的执行。为了提升不同玩家行为风格间的泛化，Marope 进一步将多样化的跳跃策略纳入合作游戏训练。我们评估了在模拟和现实环境中对Unitree G1人形机器人的方法。实验结果表明，Marope 优于各种基线，实现了更高效、更稳定的绳索操作，以及与多样玩家的更稳健和适应性强的合作。

ConSteer-RL: Steering Reasoning Capabilities in Large Language Models via Confidence-Aware Reinforcement Learning

ConSteer-RL：通过信心感知强化学习在大型语言模型中引导推理能力

Authors: Qing Miao, Yiming Zhao, Jing Yang, Chenxi Liu, Yuehai Chen, Yuewen Liu, Shaoyi Du, Badong Chen
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.08088
Pdf link: https://arxiv.org/pdf/2606.08088
Abstract Reinforcement Learning from Verifiable Rewards (RLVR) has recently become a key paradigm for improving the reasoning abilities of Large Language Models (LLMs), yet it remains limited by sparse binary rewards and its ignorance of model-internal uncertainty. In this paper, we propose ConSteer-RL, a simple yet effective framework that integrates token-level confidence signals derived from model log-probabilities into RLVR training. Specifically, building upon the Group Relative Policy Optimization (GRPO) framework, we construct a confidence-aware reward by aggregating per-token probabilities into a scalar confidence score and incorporating it into an awareness-based reward shaping mechanism that penalizes overconfident errors while reinforcing correct and confident reasoning. Experimental results demonstrate that ConSteer-RL consistently outperforms strong GRPO baselines, achieving average improvements of 2.3%-4.0% across different model scales.
中文摘要 可验证奖励强化学习（RLVR）近年来已成为提升大型语言模型（LLMs）推理能力的关键范式，但它仍受限于稀疏的二元奖励以及对模型内部不确定性的无知。本文提出了ConSteer-RL，这是一个简单但有效的框架，将模型对数概率得出的代币级置信信号整合进RLVR训练中。具体来说，基于群体相对策略优化（GRPO）框架，我们通过将每个代币的概率聚合为标量信心评分，并将其纳入基于意识的奖励塑造机制，惩罚过度自信的错误，同时强化正确且自信的推理，构建了信心感知奖励。实验结果显示，ConSteer-RL在不同模型尺度下持续优于强GRPO基线，平均提升幅度为2.3%-4.0%。

Continual Quadruped Robots Coordination via Semantic Skill Discovery

通过语义技能发现实现四足机器人持续协调

Authors: Daoqing Wang, Yuchen Xiao, Weixuan Huang, Zhilong Zhang, Shenghua Wan, Meng Li, Lei Yuan, Yang Yu
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2606.08102
Pdf link: https://arxiv.org/pdf/2606.08102
Abstract Multi-quadruped coordination has attracted increasing attention due to its enhanced payload capacity, broader contact coverage, and improved adaptability to challenging tasks. Existing methods for multi-quadruped manipulation typically focus on predefined or closed task families, often relying on multi-agent reinforcement learning (MARL) to train task-specific coordination policies. However, such methods struggle in open-ended continual learning settings, where tasks arrive sequentially and robots are expected to acquire new coordination skills while reusing previously learned ones without catastrophic forgetting. To address this challenge, we propose Conquer, a semantic skill-library framework that formulates continual multi-quadruped coordination as a retrieve-adapt-update process. First, to accommodate varying team sizes across tasks, we design a team-structured Self-Allies-Goal (SAG) backbone that supports variable-cardinality robot teams by explicitly modeling each robot's own state, teammate context, and task goal. For each incoming task, Conquer constructs a task-level semantic descriptor from pre-execution information and retrieves a relevant skill from the library for adaptation. After successful execution, Conquer updates the skill library by extracting trajectory-level semantic descriptors and organizing them according to semantic distance, thereby enabling continual skill accumulation and cross-task knowledge transfer. Simulation experiments show that Conquer achieves a final average success rate of 95.6%, demonstrating strong forward transfer and negligible catastrophic forgetting. Real-world rollouts on Unitree Go2 teams further validate the deployment feasibility of Conquer for practical multi-quadruped coordination. Simulation and real-robot demonstration videos are available at: this https URL.
中文摘要 多四足协同因其更强的有效载荷能力、更广泛的接触覆盖以及对复杂任务的适应性增强而受到越来越多的关注。现有的多足四足操作方法通常聚焦于预定义或封闭的任务族，常依赖多智能体强化学习（MARL）来训练任务特定的协调策略。然而，在开放式持续学习环境中，这些方法存在困难，因为任务是顺序进行的，机器人需要在重复使用之前学过的协调技能的同时，获得新的协调技能，避免灾难性的遗忘。为应对这一挑战，我们提出了Conquer，这是一个语义技能库框架，将持续的多四足协调构建为检索-适应-更新过程。首先，为了适应不同任务团队规模，我们设计了一个团队结构化的自我盟友目标（SAG）骨干网，通过显式建模每个机器人的状态、队友上下文和任务目标，支持可变基数机器人团队。对于每个新任务，Conquer 从预执行信息构建任务级语义描述符，并从库中检索相关技能进行适配。成功执行后，Conquer 通过提取轨迹级语义描述符并按语义距离组织，更新技能库，从而实现技能的持续积累和跨任务知识转移。模拟实验显示，Conquer的最终平均成功率为95.6%，展现出强的前向转移和几乎可忽略的灾难性遗忘。Unitree Go2团队的实际部署进一步验证了征服者在多四足协作的实际部署可行性。模拟和真实机器人演示视频可在以下 https URL 观看。

Reinforcement learning in linear embedding space unlocks generalizable control across soft robot configurations

线性嵌入空间中的强化学习解锁了软机器人配置中的通用控制

Authors: Xinglong Zhang, Cong Li, Hangjie Mo, Yue Jiang, Xin Xu, Wei Jiang, Zhenshan Bing, Yihe Yang, Xiaojian Li, Yueneng Yang, Huimin Lu, Ling-li Zeng, Alois Knoll, Dewen Hu, Li Wen, Wei Pan
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.08104
Pdf link: https://arxiv.org/pdf/2606.08104
Abstract Soft-bodied organisms such as octopuses and elephant trunks exhibit remarkable morphological adaptability, dynamically reconfiguring body shape and stiffness, and flexibly adjusting their control strategies to enable versatile behaviors. Inspired by these biological systems, various soft robots have emerged in recent decades, featuring diverse materials, stiffnesses, and morphologies tailored to specific tasks. Despite substantial advances in the materials and structural designs of soft robots, developing a generalizable control framework capable of rapid adaptation across diverse configurations remains a long-standing challenge. Existing controllers are limited to fixed configurations, demanding laborious configuration-specific remodelling and policy redesign for new configurations. Here, we introduce a generalizable control system that enables rapid adaptation across diverse soft robot configurations via reinforcement learning in a shared linear Koopman embedding space. By encoding robot dynamics into this embedding space, our method decouples control policies from specific morphologies, allowing real-time, model-free policy adaptation across diverse configurations without retraining from scratch. We validate our system across 33 distinct robot configurations. Our system achieves a 75 times reduction in transfer samples across configurations, while sustaining robust performance under high-speed motion, heavy payloads, and multiactuator faults, and achieving real-world skills previously unattainable in soft robotics. This work establishes a unified and adaptable control paradigm for diverse soft robot configurations, bridging mechanical reconfigurability with control flexibility, and may offer broader insights for generalizable control in complex physical systems.
中文摘要 软体生物如章鱼和象鼻表现出惊人的形态适应性，能够动态重塑身体形状和刚性，灵活调整控制策略，实现多样化行为。受这些生物系统的启发，近几十年出现了各种软机器人，采用多样的材料、刚性和形态，专为特定任务量身定制。尽管软体机器人的材料和结构设计取得了重大进步，但开发能够快速适应多种配置的通用控制框架仍是一个长期挑战。现有控制器只能采用固定配置，需要针对特定配置进行繁琐的重构和策略重设计以适应新配置。本文介绍了一种可推广的控制系统，通过在共享的线性Koopman嵌入空间中进行强化学习，实现在不同软机器人配置间的快速适应。通过将机器人动力学编码到该嵌入空间，我们的方法将控制策略与特定形态解耦，实现在不同配置中实时、无模型的策略适配，无需从零重新训练。我们在33种不同机器人配置中验证系统。我们的系统在不同配置下实现了75倍的转移样品减少，同时在高速运动、重载荷和多执行器故障下保持强劲性能，并实现了软机器人此前难以实现的实际技能。该工作建立了一个统一且可适应的控制范式，适用于多样化的软机器人配置，连接机械可重构性与控制灵活性，并可能为复杂物理系统中的通用控制提供更广泛的见解。

Learning Predictive Control with Deep Koopman Operators for Autonomous Vehicle Motion Planning

学习Deep Koopman操作员的自动驾驶车辆运动规划预测控制

Authors: Xinglong Zhang, Yongqian Xiao, Haotian Cao, Xing Zhou, Xin Yin, Xin Xu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.08136
Pdf link: https://arxiv.org/pdf/2606.08136
Abstract Model Predictive Control (MPC) is widely used for autonomous-vehicle (AV) motion planning, but its real-time applicability is often limited by the need for accurate models and online solution of nonlinear, nonconvex optimization problems in dynamic road environments. Actor-critic reinforcement learning offers a promising alternative for online policy generation, yet its policy-learning process often lacks explicit control-theoretic structure. This article proposes a learning predictive control (LPC) framework with deep Koopman operators for efficient real-time motion planning under nonconvex constraints. To address nonlinear and uncertain vehicle dynamics, a deep-Koopman-based predictor is used to lift the system into an interpretable linear observable space in a data-driven manner. Unlike traditional MPC, which computes open-loop control sequences, the proposed LPC framework yields a closed-loop state-feedback policy within each prediction interval through receding-horizon actor-critic learning. To ensure safety under nonconvex environmental constraints, LPC constructs convex local surrogate representations of obstacles and defines corresponding potential-field functions. These functions and their gradients are directly embedded into the actor-critic structure, enabling efficient, safety-aware policy learning. Extensive simulations and real-world experiments on the HongQi-EHS3 platform demonstrate favorable performance in diverse obstacle-avoidance scenarios in terms of safety, computational efficiency, and driving comfort, compared with benchmark methods such as CBF-MPC and LMPCC.
中文摘要 模型预测控制（MPC）广泛应用于自动驾驶汽车（AV）运动规划，但其实时应用性常受限于对精确模型的需求以及动态道路环境中非线性、非凸优化问题的在线求解需求。演员-批评者强化学习为在线策略生成提供了一种有前景的替代方案，但其策略学习过程往往缺乏显式的控制理论结构。本文提出了一个学习型预测控制（LPC）框架，采用深度库普曼算子，在非凸约束下实现高效的实时运动规划。为了解决非线性和不确定的车辆动力学问题，采用基于深度库普曼的预测器，以数据驱动的方式将系统提升到可解释的线性可观测空间。与传统的MPC计算开环控制序列不同，所提LPC框架通过后退视界的actor-critic学习，在每个预测区间内产生闭环状态反馈策略。为确保在非凸环境约束下的安全性，LPC构建障碍物的凸局部代理表示并定义相应的势场函数。这些函数及其梯度直接嵌入到行为者-批评者结构中，实现高效且安全意识的政策学习。在红旗EHS3平台上进行的大量模拟和实际实验显示，在多种障碍避让场景下，安全性、计算效率和驾驶舒适度方面，优于CBF-MPC和LMPCC等基准方法。

Mind Your Steps: A General Learning Framework for Accurate Humanoid Foothold Tracking

注意你的步伐：一个用于准确人形足迹追踪的通用学习框架

Authors: Alessandro Montenegro, Shihao Li, Puze Liu, Alberto Maria Metelli, Jan Peters
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.08253
Pdf link: https://arxiv.org/pdf/2606.08253
Abstract Enabling humanoid robots to operate in complex, dynamic environments remains a critical challenge, fundamentally limited by the ability to navigate robustly, safely, and accurately. While reinforcement learning with velocity-commanded policies has achieved remarkable robustness in humanoid locomotion, this approach lacks explicit control of the foothold placement, leading to unsafe behavior, such as stepping onto human feet, or imprecise navigation, hindering the following manipulation task. Conversely, explicit foothold-tracking policies offer a promising alternative by directly being commanded with target foot poses. However, existing approaches are often limited by unrealistic state assumptions, compromising real-world deployment, or they are part of staged pipelines, making them tied to specific downstream tasks. In this work, we introduce a novel, lightweight framework for training general-purpose 3D foothold-tracking policies. By dynamically providing footstep support through a goal sampler, this method enables the learned policy to be agnostic to specific terrains. Our new target representation effectively mitigates challenges arising in the real world, such as noisy and inaccurate pose estimation and foot contact estimation. Designed for direct real-world transfer, our policy acts as a standalone low-level controller that can be seamlessly paired with various high-level foothold generators. We demonstrate the effectiveness of our framework through extensive experiments in simulation and in the real world. By coupling our policy with different upstream planners, we achieve natural and accurate locomotion in challenging settings, paving the way for loco-manipulation tasks in complex environments.
中文摘要 使类人机器人能够在复杂、动态环境中运行仍是关键挑战，根本上受限于其稳健、安全和准确导航的能力。虽然基于速度指令的强化学习在人形移动中取得了显著的稳健性，但这种方法缺乏对脚点位置的明确控制，导致不安全行为，如踩到人脚或导航不精准，阻碍后续操作任务。相反，明确的脚步追踪政策通过直接以目标脚步姿势进行指令，提供了一种有前景的替代方案。然而，现有方法常常受限于不切实际的状态假设，影响实际部署，或者属于分阶段的流程，因此与特定的下游任务相关联。在本研究中，我们引入了一个新颖且轻量级的框架，用于训练通用的3D足迹追踪策略。通过通过目标采样器动态提供脚步支持，该方法使得所学策略对特定地形具有中立性。我们的新目标表示有效缓解了现实世界中出现的挑战，如嘈杂且不准确的姿势估计以及脚部接触估计。我们的政策设计用于直接的现实世界传输，作为独立的低级控制器，可无缝与各种高级立足生成器配合使用。我们通过大量模拟和现实实验展示了我们框架的有效性。通过将我们的政策与不同的上游规划师结合，我们在具有挑战性的环境中实现自然且准确的移动，为复杂环境中的机车操作任务铺平了道路。

Neuro-Symbolic Injection of LTLf Constraints in Autoregressive Reinforcement Learning Policies

自回归强化学习策略中LTLf约束的神经符号注入

Authors: Ashkan Ansarifard (1), Matteo Mancanelli (1), Elena Umili (1), Fabio Patrizi (1) ((1) Sapienza University of Rome)
Subjects: Subjects: Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
Arxiv link: https://arxiv.org/abs/2606.08312
Pdf link: https://arxiv.org/pdf/2606.08312
Abstract In this work we study offline reinforcement learning (RL) under temporally extended task constraints expressed in Linear Temporal Logic over finite traces (LTLf). Recently, transformer-based approaches such as Trajectory Transformers and Decision Transformers have been adopted to address RL as a sequence modeling problem. However, these methods optimize purely for reward and do not account for high-level temporal requirements. Here, we introduce a neurosymbolic framework that injects LTLf background knowledge into such transformer-based RL policies. Our approach compiles LTLf formulas into deterministic finite automata (DFAs) and integrates them into the learning process through a differentiable representation and a logic-based loss function. In particular, we derive differentiable satisfaction signals from DFA progression and use them as a regularization term during training. The resulting method is architecture-agnostic across different models. We evaluate the proposed framework on navigation environments with specification suites covering combinations of safety and reachability temporal properties. Experimental results show that incorporating background knowledge not only improves constraint satisfaction, but also maintains competitive return compared to vanilla baselines.
中文摘要 本研究研究在有限迹线性时间逻辑（LTLf）中以线性时间逻辑（LTLf）表达的任务约束下，离线强化学习（RL）。近年来，基于变换器的方法如轨迹变换器和决策变换器被采用，以解决强化学习作为序列建模问题。然而，这些方法纯粹是为了奖励而优化，未考虑高层次的时间需求。在这里，我们引入了一个神经符号框架，将LTLf背景知识注入此类基于变换器的强化学习策略中。我们的方法将LTLf公式编译为确定性有限自动机（DFA），并通过可微表示和基于逻辑的损失函数将其集成到学习过程中。特别是，我们从DFA进展中推导出可微的满意度信号，并在培训中将其作为正则化项使用。由此产生的方法在不同模型间是架构无关的。我们评估了导航环境的框架，采用涵盖安全与可达性时间属性组合的规范套件。实验结果表明，融入背景知识不仅能提升约束满足度，还能保持与普通基线的竞争回报。

CATPO: Critique-Augmented Tree Policy Optimization

CATPO：批判性增强树策略优化

Authors: Ayush Singh, Umang Goyal, Ankur Dahiya
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.08346
Pdf link: https://arxiv.org/pdf/2606.08346
Abstract Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving the reasoning capabilities of large language models (LLMs). Recent tree-based methods such as TreeRPO extend flat trajectory sampling with tree-structured rollouts to obtain dense, step-level reward signals without a separate process reward model. However, not all trees are equally informative: trees where all leaves succeed, all leaves fail, or the policy already predicts the reward distribution contribute little to gradient updates, wasting compute. We introduce CATPO (Critique-Augmented Tree Policy Optimization), which diagnoses and addresses this waste at the tree level. CATPO first scores each tree via a tree informativeness score, F(T), combining leaf-outcome diversity with policy-reward decorrelation at zero extra compute. For dead-wrong trees where all branches fail, CATPO applies critique-guided healing: it locates the shallowest failure point, generates a natural-language critique, and grafts refined continuations to recover training signal. Finally, an informativeness-weighted loss scales each tree's gradient contribution by its normalized score, concentrating parameter updates on the most informative trees while preserving overall gradient magnitude. Experiments on Qwen2.5-Math-1.5B trained with the MATH dataset show that CATPO achieves 37.5% macro accuracy across four benchmarks (AIME24, MATH-500, OlympiadBench, and MinervaMath), improving over TreeRPO by 1.9% and GRPO by 4.8%.
中文摘要 带有可验证奖励的强化学习（RLVR）已成为提升大型语言模型（LLM）推理能力的主流范式。近期基于树的方法如TreeRPO通过树状结构的滚动扩展了平坦轨迹采样，从而获得密集的阶级奖励信号，而无需单独的过程奖励模型。然而，并非所有树都同样具有信息量：所有叶子都成功，所有叶子失败，或者策略已经预测奖励分布对梯度更新贡献较小，浪费计算量。我们引入了CATPO（批判增强树策略优化），它在树级上诊断并解决了这些浪费。CATPO首先通过树信息性评分F（T）对每棵树进行评分，结合叶片-结果多样性与策略-奖励去相关性，计算量为零。对于所有分支都失效的错误树，CATPO采用批判引导修复：定位最浅的失败点，生成自然语言批评，并嫁接精细的延续以恢复训练信号。最后，信息性加权损失通过归一化分数对每棵树的梯度贡献进行标度，将参数更新集中在最具信息量的树上，同时保持整体梯度大小。在用MATH数据集训练的Qwen2.5-Math-1.5B实验显示，CATPO在四个基准测试（AIME24、MATH-500、OlympiadBench和MinervaMath）中实现了37.5%的宏观准确率，比TreeRPO提升1.9%，比GRPO提升4.8%。

Self-Evolving Scientific Agent Discovers Generalizable Physically-Reasoned Fluid Control

自我进化的科学代理人发现了可推广的物理推理流体控制

Authors: Boai Sun, Wenjin Guo, Zongmin Yu, Liu Yang
Subjects: Subjects: Artificial Intelligence (cs.AI); Fluid Dynamics (physics.flu-dyn)
Arxiv link: https://arxiv.org/abs/2606.08405
Pdf link: https://arxiv.org/pdf/2606.08405
Abstract While data-intensive deep reinforcement learning can optimize complex control policies, scientific discovery in physical systems fundamentally requires an interpretable chain of reasoning that connects physical evidence to structured control architectures. Here, we present a self-evolving scientific-agent workflow, driven by large language models and iterative code generation, that automates controller construction while preserving strict interpretability and rigorous physical reasoning. Instead of adjusting weights, the agent deploys candidate strategies into physical simulations, actively diagnoses dynamic behaviors from multimodal evidence, and translates these observations into progressive source-code refinements. We demonstrate this framework on a highly non-linear fluid-structure interaction problem: an underactuated, two-joint dogfish swimmer tasked with spatial target reaching using only joint angular accelerations. Starting from a propulsive seed policy that exhibits a one-sided steering bias, the agent autonomously discovers and refines a unified controller that robustly captures all canonical targets. Remarkably, without any retraining or target-specific branching, the synthesized control policy generalizes to unseen static targets and dynamically curved pursuit trajectories. The auditable evolve log reveals an emergent control architecture built upon traveling-wave propulsion, body-frame target guidance, yaw-rate feedback, signed mean-tail curvature, and adaptive cadence relief. Our results show that an autonomous scientific agent can successfully transform accumulated physical evidence into robust, mathematically readable control policy, while maintaining a fully traceable process of scientific discovery.
中文摘要 虽然数据密集型深度强化学习可以优化复杂的控制策略，但物理系统中的科学发现本质上需要一条可解释的推理链，将物理证据与结构化控制架构连接起来。本文介绍了一种自我演进的科学代理工作流程，由大型语言模型和迭代代码生成驱动，能够自动化控制器构建，同时保持严格的可解释性和严谨的物理推理。代理不再调整权重，而是将候选策略部署到物理模拟中，主动从多模态证据中诊断动态行为，并将这些观察转化为渐进式的源代码改进。我们在一个高度非线性的流体-结构相互作用问题上演示了该框架：一个欠驱动的双关节狗鱼游泳者，仅通过关节角加速度来实现空间目标的达标。从表现出单侧转向偏置的推进种子策略开始，智能体自主发现并完善一个统一控制器，能够稳健地捕获所有典型目标。值得注意的是，在没有任何重新训练或目标特定分支的情况下，综合控制策略能够推广到看不见的静态目标和动态弯曲的追击轨迹。可审计的演化日志揭示了基于行波推进、机体-机架目标引导、偏航率反馈、带符号尾曲率和自适应踏频缓解的涌现控制架构。我们的结果表明，自主科学代理能够成功将积累的物理证据转化为稳健且数学上可读的控制策略，同时保持科学发现的完全可追溯过程。

Sparrow: Sparse Rollout for Stable and Efficient Long-context RL of Large Language Models

Sparrow：为大型语言模型稳定高效长上下文强化学习提供稀疏推广

Authors: Yang Zhou, Ranajoy Sadhukhan, Zhaofeng Sun, Zhuoming Chen, Souvik Kundu, Saket Dingliwal, Sai Muralidhar Jayanthi, Aram Galstyan, Haizhong Zheng, Beidi Chen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.08446
Pdf link: https://arxiv.org/pdf/2606.08446
Abstract Despite being powerful, reinforcement learning with verifiable rewards (RLVR) induces extremely long COT, making it computationally expensive. Since RLVR per-step cost is dominated by long-context rollout generation, sparse attention offers a promising way to accelerate dense rollout. However, sparse rollouts require a delicate stability-efficiency tradeoff: overly aggressive sparsity causes collapse, while overly lenient sparsity gives insufficient speedup. In this work, we study this tradeoff through sparse-to-dense actor-policy mismatch. We first observe that sparse rollout collapse is not driven by uniform degradation across tokens: most sparse tokens align perfectly with dense even under aggressive sparsity. Motivated by this, we hypothesize that sparse rollout training remains stable if the lower tail of per-token actor-policy mismatch stays above a critical threshold throughout the trajectory. We introduce a dynamic sparsity schedule that keeps this tail statistic constant during generation and validate our hypothesis. Across Qwen3 thinking-family models, keeping the tail mismatch statistic near a consistent threshold generally enables stable training. We then use a cost model to find the sparsity schedule for maximum speedup under this mismatch threshold, achieving 2.2x, 2.4x, and 2.0x rollout speedups when training Qwen3-1.7B, Qwen3-4B, and Qwen3-8B. Empirically, we show the thresholds generalize to a larger model (Qwen3-14B) and another RL domain (coding). Finally, our analysis naturally motivates DistillSparse: lightweight LoRA-based distillation on sparse rollout lets more aggressive sparsity reach the same sparse-to-dense mismatch threshold, yielding higher speedup.
中文摘要 尽管强大，带可验证奖励的强化学习（RLVR）会诱导极长的COT时间，导致计算成本高昂。由于每步RLVR成本主要由长上下文的展开生成所主导，稀疏关注为加快密集展开提供了有前景的方法。然而，稀疏的滑行需要在稳定性与效率之间做出微妙权衡：过于激进的稀疏度会导致崩溃，而过于宽松的稀疏度则无法提供足够的加速。本研究通过稀疏到密集的行为者-策略不匹配来研究这种权衡。我们首先观察到稀疏的rollout崩溃并非由各token均匀退化驱动：大多数稀疏token即使在激进稀疏情况下也与dense完美对齐。基于此，我们假设如果每个代币参与者-策略不匹配的下尾部在整个轨迹中保持在临界阈值以上，则稀疏的推广训练保持稳定。我们引入了一个动态稀疏性计划，使尾部统计量在生成过程中保持恒定，并验证我们的假设。在Qwen3思维家族模型中，将尾部不匹配统计量保持在一致阈值附近通常有助于稳定训练。然后我们使用成本模型，在该错配阈值下找到最大加速的稀疏度调度，在训练Qwen3-1.7B、Qwen3-4B和Qwen3-8B时，实现了2.2倍、2.4倍和2.0倍的推广加速。通过经验，我们展示了阈值可以推广到更大的模型（Qwen3-14B）和另一个强化学习领域（编码）。最后，我们的分析自然而然地推动了 DistillSparse：基于轻量级 LoRA 的蒸馏在稀疏推广上，使更激进的稀疏度达到相同的稀疏到稠密不匹配阈值，从而带来更高的加速。

GIFT: LLM-Guided State-Reward Interface for Financial Reinforcement Learning

GIFT：以LLM为导向的状态-奖励接口，用于财务强化学习

Authors: Yanyan Wu, Boyi Zhang, Yanlin Liu, Xinyu Fang, Jining Luan, Meiqi Zhang, Jiacheng Liu, Hao Zeng, Dexu Yu, Chang Liu, Hanwen Du, Yongxin Ni, Youhua Li
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.08450
Pdf link: https://arxiv.org/pdf/2606.08450
Abstract Financial portfolio trading is naturally formulated as a reinforcement learning problem, where an agent sequentially rebalances assets under changing market conditions to balance return, risk, and transaction costs. Yet in non-stationary markets, raw OHLCV states and short-horizon return rewards often provide an under-specified learning interface, motivating large language models as a way to inject financial knowledge into state and reward design while constraining open-ended generation. To this end, we propose GIFT, an LLM-guided framework for state-reward interface design in PPO-based financial reinforcement learning. Rather than using the LLM to make trading decisions, GIFT uses Factor-guided State Enhancement to generate state features from financial-factor primitives, Risk-rule-guided Reward Shaping to generate auxiliary rewards from portfolio-risk rules, and Diagnostic-guided Refinement to revise candidate interfaces using PPO rollout diagnostics. After refinement, GIFT fixes the selected state-reward interface before evaluation, with no further LLM queries or interface updates at test time. Comprehensive rolling-window experiments across diverse market regimes and portfolio scenarios demonstrate that GIFT improves learning-signal quality and out-of-sample risk-adjusted portfolio performance over baselines. Code and data are available at: this https URL .
中文摘要 金融投资组合交易自然被表述为一种强化学习问题，代理在变化的市场条件下依次再平衡资产，以平衡回报、风险和交易成本。然而，在非平稳市场中，原始OHLCV状态和短期回报奖励常常提供一个未明确的学习接口，促使大型语言模型将金融知识注入状态和奖励设计，同时限制开放式生成。为此，我们提出了GIFT，这是一个基于PPO的金融强化学习中，基于LLM的状态-奖励接口设计框架。GIFT不使用LLM进行交易决策，而是使用因子引导状态增强（FACTOR-GUIDED State Enhancement）从金融因素原语生成状态特征，利用风险规则引导的回报塑形（Risk Guided Reward Shaping）从投资组合风险规则生成辅助奖励，并使用诊断引导细化（Diagnostic-guided Refinement）通过PPO推广诊断修正候选接口。优化后，GIFT在评估前修复所选状态-奖励接口，测试时不再进行LLM查询或界面更新。跨不同市场体系和投资组合场景的全面滚动窗口实验表明，GIFT在学习信号质量和样本外风险调整后投资组合表现相较基线水平有所提升。代码和数据可在以下 https URL 获取。

Adaptive Loss Balancing for Noise-Robust GRPO in Generative Recommendation

生成推荐中噪声鲁棒GRPO的自适应损耗均衡

Authors: Kewei Xu, Junbo Qi, Yanyan Zou, Pengfei Zhang, Xingzhi Yao, Shengjie Li
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2606.08480
Pdf link: https://arxiv.org/pdf/2606.08480
Abstract Reinforcement learning (RL) presents a promising avenue for enhancing generative recommendation beyond supervised imitation, leveraging reward signals to guide policy improvement. However, its efficacy is critically contingent on the trustworthiness of the reward model for the samples it evaluates. In practice, production rankers, the widely adopted reward models, are trained on exposure-biased logs, leading to sample-dependent inaccuracies that violate this assumption. Our stratified analysis uncovers a consistent pattern: reward guidance is most beneficial when the policy exhibits uncertainty and the ranker can effectively discriminate the ground-truth item from rollout negatives. On other samples, the reward signal is either negligible or detrimental, highlighting the risk of uniform RL application. To address such an issue, we introduce AdaGRPO, a novel framework that treats reward-guided optimization as selective admission rather than uniform pressure. Training is anchored in supervised negative log-likelihood, while the GRPO objective is gated by a binary, per-sample clip determined by two rollout diagnostics: policy-side difficulty and reward discriminability. Instances failing either diagnostic default to pure supervision, ensuring stability and mitigating the amplification of noisy gradients. We validate AdaGRPO on a large-scale e-commerce dataset. At the best intermediate checkpoint, it elevates HR@10 from 11.01% to 12.18% while constraining hallucination below 0.22%, and maintains robustness at the final checkpoint (HR@10 11.63%, hallucination 0.27%), outperforming fixed NLL--GRPO mixtures across the retrieval--validity frontier. In production A/B tests, AdaGRPO achieves statistically significant gains in click-through rate and dwell time, confirming its practical utility.
中文摘要 强化学习（RL）为提升生成式推荐提供了一种有前景的途径，超越了监督模仿，利用奖励信号指导政策改进。然而，其有效性关键在于奖励模型对所评估样本的可信度。实际上，广泛采用的生产排名器（Production Rankers）是基于暴露偏向的日志进行训练的，导致依赖样本的不准确度，违反了这一假设。我们的分层分析揭示了一个一致的模式：当政策存在不确定性，且排名者能够有效区分实地信息与推广负面因素时，奖励指导最为有益。在其他样本中，奖励信号要么可以忽略不计，要么有害，凸显了强化学习应用均匀化的风险。为解决这一问题，我们引入了AdaGRPO这一新框架，将奖励引导优化视为选择性录取而非均匀压力。训练基于监督的负对数似然，而GRPO目标则由两个推广诊断确定的二进制、每个样本剪辑来限制：策略端难度和奖励可辨别性。未通过诊断的实例默认为纯监督，确保稳定性并减轻噪声梯度的放大。我们在大规模电子商务数据集上验证了AdaGRPO。在最佳中间检查点，它将HR@10从11.01%提升到12.18%，同时将幻觉控制在0.22%以下，并在最终检查点保持稳健度（HR@10 11.63%，幻觉0.27%），在检索有效性边界上优于固定的NLL-GRPO组合。在生产A/B测试中，AdaGRPO在点击率和停留时间上取得了统计学上的显著提升，证实了其实用性。

Back on Track: Aligning Rewards and States for Reasoning in Diffusion Large Language Models

回到正题：在扩散大型语言模型中，如何对齐奖励与状态以推理

Authors: Yawen Shao, Jie Xiao, Kai Zhu, Yu Liu, Hongchen Luo, Xueyang Fu, Yang Cao, Wei Zhai, Zheng-Jun Zha
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.08501
Pdf link: https://arxiv.org/pdf/2606.08501
Abstract Reinforcement learning (RL) holds immense promise for enhancing the reasoning capabilities of diffusion large language models (dLLMs). However, progress is fundamentally constrained by a dual misalignment between authentic generation trajectory and the gradient update process: (i) Process-reward misalignment. Sparse, terminal rewards are indiscriminately assigned to all intermediate steps of the generation process, failing to provide discriminative credit assignment. (ii) State-trajectory misalignment. Policy updates are often diverted toward artificial, out-of-trajectory states, squandering gradients on less informative samples. To address these limitations, we introduce Process Aligned Policy Optimization (PAPO), a novel framework that holistically aligns the RL update with the dLLM's generative trajectory via Step-Aware Process Rewards (SPR) that transform sparse terminal rewards into dense, step-wise credit, and Entropy-Guided Historical Re-enactment (EHR) that replays authentic trajectories at high-uncertainty steps. Extensive experiments on four benchmarks demonstrate that PAPO significantly outperforms baselines, achieving gains of up to 4.5% on GSM8K, 4.8% on MATH500, 42.2% on Countdown and 16.1% on Sudoku.
中文摘要 强化学习（RL）在提升扩散大型语言模型（dLLMs）的推理能力方面具有巨大潜力。然而，进展从根本上受到真实生成轨迹与梯度更新过程之间的双重错配所限制：（i）过程-奖励错位。稀疏的终端奖励被无差别地分配给生成过程的所有中间步骤，未能提供有针对性的信用分配。（ii）状态轨迹错位。政策更新常常被转向人为且偏离轨迹的状态，浪费了梯度在信息量较少的样本上。为解决这些局限性，我们引入了流程对齐策略优化（PAPO），这是一种新颖框架，通过步知过程奖励（SPR）将稀疏的终端奖励转化为密集的分步积分，以及熵引导历史重演（EHR），在高不确定性步骤重放真实轨迹，整体对齐强化学习更新与dLLM生成轨迹。在四个基准测试上的广泛实验表明，PAPO显著优于基准，GSM8K提升了4.5%，MATH500提升4.8%，Countdown提升42.2%，数活提升16.1%。

Towards End to End Motion Planning and Execution for Autonomous Underwater Vehicles Using Reinforcement Learning

关于基于强化学习的自主水下载具端到端运动规划与执行

Authors: Elisei Shafer, Oren Gal
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2606.08513
Pdf link: https://arxiv.org/pdf/2606.08513
Abstract Autonomous Underwater Vehicles (AUVs) traditionally rely on complex, heavily engineered pipelines for perception, path planning, and motion control. This paper explores the feasibility of an end-to-end Deep Reinforcement Learning (DRL) approach that maps raw sensor data directly to thruster commands, reducing manual engineering. We propose a hierarchical reinforcement learning (HRL) architecture splitting the problem into two Markov Decision Processes. A High-Level (HL) policy operating at 2Hz processes raw $84 \times 84$ pixel monocular camera frames, stacked $100 \times 100$ pixel forward-looking imaging sonar, and proprioceptive data to generate spatial subgoals. Simultaneously, a Low-Level (LL) policy operating at 10Hz converts these subgoals into thruster commands. The HL policy is trained using Reinforcement Learning from Prior Demonstrations (RLPD) within a modified Sample-Efficient Robotic Reinforcement Learning (SERL) framework, while the LL policy utilizes Soft Actor-Critic (SAC) combined with Hindsight Experience Replay (HER). Evaluated in the high-fidelity HoloOcean simulator, our method demonstrates successful obstacle avoidance, achieving trajectory lengths closely approximating (within 4% to 6% of) an $\text{RRT}^*$ planning baseline. Furthermore, the learned policy exhibits strong robustness to simulated sensor noise and decreased visibility. While the system navigates familiar geometries effectively, experiments reveal generalization limitations when encountering unvisited areas with novel obstacle shapes. Ultimately, this work demonstrates the promise of sample-efficient, end-to-end DRL for underwater navigation using minimal computational hardware.
中文摘要 自主水下载具（AUV）传统上依赖复杂且高度工程化的管道进行感知、路径规划和运动控制。本文探讨了端到端深度强化学习（DRL）方法的可行性，该方法将原始传感器数据直接映射到推进器指令，减少手工工程。我们提出了一种分层强化学习（HRL）架构，将问题拆分为两个马尔可夫决策过程。高层（HL）策略在2Hz下运行，处理原始价值84美元×84美元的像素单眼摄像头帧、叠加价值100美元×100美元的像素前视成像声纳和本体感觉数据，以生成空间子目标。同时，以10Hz运行的低层（LL）策略将这些子目标转换为推进器命令。HL策略采用基于修改后的样本高效机器人强化学习（SERL）框架，使用从既往演示强化学习（RLPD）训练，而LL策略则结合软行为者-批判者（SAC）和事后诸葛亮经验重放（HER）。在高精度HoloOcean模拟器中评估，我们的方法成功地实现了障碍物规避，轨迹长度接近（在4%到6%范围内）的$\text{RRT}^*$规划基线。此外，该策略对模拟传感器噪声和能见度降低表现出强烈的韧性。虽然系统能有效导航熟悉的几何形状，但实验揭示了在遇到未访问且障碍形状新颖的区域时泛化的局限性。最终，这项工作展示了利用极少计算硬件实现采样高效、端到端的水下导航日行学习的前景。

DriveReward: A Comprehensive Dataset and Generative Vision-Language Reward Model for Autonomous Driving

DriveReward：一个全面的数据集和生成式视觉语言奖励模型，适用于自动驾驶

Authors: Qimao Chen, Fang Li, Yuechen Luo, Zehan Zhang, Haiyang Sun, Fangzhen Li, Bing Wang, Guang Chen, Yang Ji, Jiong Deng, Hongwei Xie, Hangjun Ye, Long Chen, Yi Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.08525
Pdf link: https://arxiv.org/pdf/2606.08525
Abstract Reward models play a pivotal role in reinforcement learning (RL) and multi-modal trajectory selection for autonomous driving. However, acquiring such rewards typically relies on hand-crafted rule-based objectives or perception ground truth, which hinders generalization for data-scaling. While Vision-Language Models (VLMs) have demonstrated feasibility as reward models in other domains, their effectiveness in driving tasks remains underexplored. In this work, we bridge this gap by (1) introducing DriveReward, a reasoning trajectory evaluation dataset rigorously labeled via temporally-grounded visual guidance, and augmented with counterfactual driving behaviors., (2) alongside a specialized Vision-Language Reward Model. To address the scarcity of failure cases in conventional datasets, we propose a counterfactual data annotation scheme to construct cases encompassing diverse driving styles and erroneous behaviors. Evaluations on our proposed benchmark reveal that even leading open-source and proprietary VLMs fail to excel across all tasks, highlighting significant room for improvement in existing models. Building on these findings, we subsequently tailor a specialized 1B reward model that outperforms larger VLMs on task-specific reward alignment. Finally, we validate our reward model's effectiveness by integrating it into RL finetuning and multi-modal trajectory scoring across multiple baselines, achieving performance comparable to rule-based reward calculations in both open-loop and closed-loop evaluation.
中文摘要 奖励模型在强化学习（RL）和多模态轨迹选择中发挥着关键作用，用于自动驾驶。然而，获得此类奖励通常依赖于手工制定的基于规则的目标或感知的真实信息，这阻碍了数据扩展的泛化。虽然视觉语言模型（VLMs）已证明可行作为奖励模型，但其在推动任务中的有效性仍未被充分探讨。在本研究中，我们通过（1）引入DriveReward——一个通过时间基础视觉指导严格标记并辅以反事实驾驶行为的推理轨迹评估数据集，弥合了这一空白;（2）结合了专门的视觉语言奖励模型。为解决传统数据集中故障案例的稀缺性，我们提出了一种反事实数据注释方案，用于构建涵盖多样驾驶风格和错误行为的案例。对我们提出的基准测试的评估显示，即使是领先的开源和专有VLM也未能在所有任务中表现出色，这凸显了现有模型仍有显著改进空间。基于这些发现，我们随后定制了一个专门的1B奖励模型，使其在任务特定奖励对齐方面优于大型VLM。最后，我们通过将奖励模型整合进强化学习微调和多模态轨迹评分，验证了其有效性，在开环和闭环评估中均可与基于规则的奖励计算相媲美。

Autonomous Aerial Manipulation via Contextual Contrastive Meta Reinforcement Learning

通过上下文对比元强化学习实现自主空中操作

Authors: Lixuan Jin, Bingxuan Lan, Xinyi Bao, Xiangyuan Xie, Chunjie Zhang, Zheng Chen, Tianshuo Liu, Ruijie Tian, Jinyu Ru, Gang Wang, Lei Yuan, Yang Yu
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.08533
Pdf link: https://arxiv.org/pdf/2606.08533
Abstract Unmanned aerial vehicles (UAVs) are increasingly being deployed in logistics, service robotics, and other real-world applications, creating a growing demand for autonomous payload acquisition and delivery. Existing approaches typically assume pre-attached payloads or rely on specialized grippers, leaving versatile end-to-end aerial delivery largely unresolved, where different payloads induce highly variable flight dynamics, requiring a single policy to adapt online without manual calibration or explicit system identification. To this end, we study \textbf{A}utonomous \textbf{A}erial Manipulation via \textbf{Co}ntextual \textbf{Co}ntrastive Meta Reinforcement Learning (\textbf{\textit{Aco2}}), a fully autonomous aerial delivery setting in which a quadrotor equipped with a lightweight hook continuously picks up, transports, and delivers diverse handle-equipped objects between randomized locations, all without human intervention. First, we design a contextual observation encoder that infers a compact latent context from recent interaction history, enabling the policy to adapt online to payload-dependent dynamics. To further improve the quality of this context, we introduce a contrastive objective that structures the context embedding around task-relevant variations, improving generalization across diverse payloads without requiring explicit system identification. Trained entirely in simulation with extensive domain randomization, \textit{Aco2} can be directly deployed on a physical quadrotor without real-world fine-tuning.
中文摘要 无人机（UAV）正日益被应用于物流、服务机器人及其他实际应用领域，推动了对自主有效载荷获取和投放的需求增长。现有方法通常假设预装有效载荷或依赖专用夹持器，导致多样化的端到端空中投放问题大多未解决，不同有效载荷导致高度可变的飞行动力学，需单一策略在线调整，无需手动校准或系统明确识别。为此，我们研究了通过 \textbf{Co}ntextual \textbf{Co}ntrastive Meta Reinforcement Learning（\textbf{\textit{Aco2}} 进行 \textbf{A}utonomous \textbf{A}erial Manipulation，这是一种完全自主的空中投递环境，配备轻型钩子的四旋翼持续拾取、运输并在随机地点之间传递各种带有手柄的物体，且全部无需人工干预。首先，我们设计了一个上下文观测编码器，能够从近期交互历史推断出紧凑的潜在上下文，使策略能够在线适应依赖有效载荷的动态。为进一步提升上下文质量，我们引入了一个对比目标，围绕任务相关变体构建上下文嵌入，提升跨多载荷的泛化性，无需显式系统识别。完全训练于带有广泛领域随机化的仿真，\textit{Aco2} 可以直接部署在物理四旋翼上，无需实际微调。

PAEC: Position-Aware Entropy Calibration for LLM Reasoning in RLVR

PAEC：RLVR中用于LLM推理的位置感知熵校准

Authors: Shumeng Yang, Yisu Liu, Jiayi Zheng, Zhaohui Yang, Linjing Li
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.08543
Pdf link: https://arxiv.org/pdf/2606.08543
Abstract Reinforcement learning with verifiable rewards (RLVR) improves large language model reasoning but often suffers from rapid policy-entropy collapse, where the policy prematurely concentrates on narrow high-probability reasoning paths. While global entropy regularization can encourage exploration, uniformly increasing entropy across all token positions is inefficient for long reasoning trajectories, where many tokens are not decision-relevant. We propose Position-Aware Entropy Calibration (PAEC), a token-level entropy-management framework that constructs a soft mask from local top-p entropy and top-two candidate competition, and applies an anchor-based lower-bound penalty to prevent selected-position entropy collapse. Experiments on five mathematical reasoning benchmarks show that PAEC improves macro-average majority-vote performance over strong RLVR baselines, with clear gains on AIME-style tasks. Our results suggest that entropy management in reasoning RL should be formulated as selective exploration allocation over decision-sensitive positions rather than uniform randomness injection.
中文摘要 带有可验证奖励的强化学习（RLVR）提升了大型语言模型推理能力，但常常面临快速的策略熵崩溃，即策略过早集中于狭窄的高概率推理路径。虽然全局熵正则化可以促进探索，但在长推理轨迹中，统一增加所有代币位置的熵效率较低，因为许多代币与决策无关。我们提出了位置感知熵校准（PAEC），这是一种代币级熵管理框架，通过局部顶p熵和前两个候选竞争构建软掩码，并施加基于锚点的下界惩罚以防止选定位置熵坍缩。五个数学推理基准测试的实验显示，PAEC在强RLVR基线下提升了宏观平均多数票表现，在AIME式任务中也有明显提升。我们的结果表明，强化学习推理中的熵管理应作为对决策敏感位置的选择性探索分配，而非均匀随机注入。

Ishigaki-IDS: An Open-Weight Verifier-Aware Model for Information Delivery Specification Drafting in Building Information Modeling

石垣-IDS：一种开放权重验证器感知模型，用于建筑信息建模中的信息传递规范起草

Authors: Ryo Kanazawa, Koyo Hidaka, Teppei Miyamoto, Takayuki Kato, Tomoki Ando, Chenguang Wang, Dayuan Jiang, Naofumi Fujita, Shuhei Saitoh, Atomu Kondo, Koki Arakawa, Daiho Nishioka
Subjects: Subjects: Computation and Language (cs.CL); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2606.08545
Pdf link: https://arxiv.org/pdf/2606.08545
Abstract Building Information Modeling (BIM) projects require information requirements to be described as machine-checkable Information Delivery Specification (IDS) files in order to verify whether building models contain the required attributes. However, IDS authoring remains a practical bottleneck: practitioners must handle domain vocabulary, strict XML schema constraints, and external validator conformance while also checking whether the requirement itself is correctly expressed. We present Ishigaki-IDS, an open-weight LLM specialized for verifier-aware IDS draft generation. The model combines continued pretraining on BIM/IDS corpora, supervised fine-tuning on information-requirement-to-IDS pairs, and reinforcement learning with verifiable rewards from an external validator. The goal is not to replace expert review, but to move IDS authoring from low-level XML and schema repair toward validator-loadable drafts that practitioners can inspect and correct. On the 166-case expert-created Ishigaki-IDS-Bench, Ishigaki-IDS-8B achieves an IDSAuditPass score of 0.651, a validator-pass metric for generated IDS files, substantially outperforming Claude Opus 4.5, the strongest single-shot LLM baseline we evaluated, at 0.331. It also obtains an Audit-Gated FacetF1 of 0.282, which measures requirement-facet alignment among validator-passing drafts. The same recipe scales: 14B and 32B variants reach IDSAuditPass 0.753 / 0.693 and Audit-Gated FacetF1 0.392 / 0.369. In a workflow check with six BIM practitioners, Ishigaki-assisted authoring reduced aggregate work time by 54.7% under the same validation and alignment endpoint. These results suggest that verifier-aware IDS generation can reduce the practical burden of converting BIM information requirements into reviewable IDS drafts.
中文摘要 建筑信息建模（BIM）项目要求信息需求描述为机器可检查的信息传递规范（IDS）文件，以验证建筑模型是否包含所需属性。然而，IDS的创作仍是一个实际瓶颈：从业者必须处理领域词汇、严格的XML模式约束和外部验证器一致性，同时还要检查需求本身是否被正确表达。我们介绍Ishigaki-IDS，一款专门用于验证器感知IDS草稿生成的开权重大型语言模型。该模型结合了对BIM/IDS语料库的持续预训练、对信息需求到IDS对的监督微调，以及来自外部验证者的可验证奖励强化学习。目标不是取代专家评审，而是将IDS的创作从低级的XML和模式修复，转变为可由验证员加载的草稿，供从业者检查和修正。在166个案例的专家创建的石垣IDS-Bench中，石垣IDS-8B的IDSAuditPass得分为0.651，这是生成IDS文件的验证者-通过指标，显著优于我们评估过的最强单次LLM基线Claude Opus 4.5，后者为0.331。它还获得了0.282的审计门控FacetF1，用于衡量验证者通过草稿之间的需求面对齐。相同的配方尺度：14B和32B变体达到IDSAuditPass 0.753 / 0.693，审计门控FacetF1达到0.392 / 0.369。在对六位BIM从业者的工作流程检查中，石垣辅助创作在同一验证和对齐端点下，总计工作时间减少了54.7%。这些结果表明，验证者感知的IDS生成可以减轻将BIM信息需求转换为可审阅IDS草案的实际负担。

Real-IKEA: Physical Fidelity is the Prerequisite for Robust Manipulation

Real-IKEA：物理保真是强健操作的前提

Authors: Kunqi Xu, Zhenhao Huang, Siyuan Luo, Ziqiu Zeng, Fan Shi
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.08564
Pdf link: https://arxiv.org/pdf/2606.08564
Abstract Robotic manipulation robustness often founders on the physics gap between simplified simulations and the resistance-laden real world. In this work, we emphasize that physical realism in articulated interaction is an important ingredient for robust policy learning. We present Real-IKEA, a dataset and simulation framework designed with physical accuracy as a first-class goal. Real-IKEA provides 1,079 articulated asset configurations, derived from 83 authentic IKEA handles and knobs processed through a meticulous six-step physical workflow. For contact-geometry accuracy, we introduce a bidirectional surface-deviation metric to quantify collision meshes. For dynamics realism, we establish resistance-calibrated configurations that vary damping and friction. Crucially, we demonstrate through a Reinforcement Learning (RL) policy that high-fidelity assets enable the discovery of robust "hooking" and "levering" strategies that prioritize mechanical advantage over fragile friction-pulling. Together, these results position Real-IKEA as a critical benchmark for developing manipulation policies capable of human-level robustness in articulated object tasks.
中文摘要 机器人操作的稳健性常常因简化模拟与充满阻力的现实世界之间的物理差距而失败。在本研究中，我们强调了在有表达的互动中物理现实主义是稳健政策学习的重要组成部分。我们介绍Real-IKEA，一个以物理精度为首要目标的数据集和模拟框架。Real-IKEA提供1079种可连接的资产配置，源自83个真实宜家手柄和旋钮，经过精心的六步物理流程处理。为了接触几何的精度，我们引入了双向表面偏差度量来量化碰撞网格。为了动力学的真实性，我们建立了电阻校准的配置，以改变阻尼和摩擦。关键是，我们通过强化学习（RL）政策证明，高保真资产能够发现强健的“钩”和“杠杆”策略，优先考虑机械优势而非脆弱的摩擦拉扯。综合来看，这些结果使Real-IKEA成为制定具备具备人力级结构化对象任务的操作策略的关键标杆。

Distilling LLM Reasoning into an Interpretable Policy Tree for Human-AI Collaboration

将LLM推理提炼成可解释的策略树，用于人机协作

Authors: Beiwen Zhang, Yongheng Liang, Guowei Zou, Haitao Wang, Hejun Wu
Subjects: Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2606.08596
Pdf link: https://arxiv.org/pdf/2606.08596
Abstract Constructing efficient and reliable policies to assist humans is indispensable for human-AI collaboration. Existing methods mainly follow two lines of work. Most prior work relies on multi-agent reinforcement learning (MARL) to learn black-box policies, which limits interpretability and raises safety concerns. Recent methods query large language models (LLMs) at each decision step, causing slow responses and high inference costs. We propose Collaboration Policy Tree (Co-pi-tree), a closed-loop method that learns an executable policy tree consisting of a partner-behavior prediction tree and an agent-action selection tree. Co-pi-tree constructs a policy by distilling LLM reasoning into policy tree code. It then evaluates the policy through partner interaction, obtains feedback, and uses natural language to summarize the interaction feedback to improve problematic branches. Experiments in Overcooked-AI show that Co-pi-tree improves average reward by 35.4% over the baseline average, while reducing the number of LLM queries by 77.7% and test-time latency by 97.1%. Project page: this https URL
中文摘要 构建高效且可靠的政策以协助人类，是人机协作的必不可少。现有方法主要遵循两条工作路线。大多数先前工作依赖多智能体强化学习（MARL）来学习黑箱策略，这限制了可解释性并引发了安全隐患。最新方法在每个决策步骤查询大型语言模型（LLM），导致响应缓慢且推理成本高。我们提出了协作策略树（Co-pi-tree），这是一种闭环方法，学习由合作伙伴行为预测树和代理-动作选择树组成的可执行策略树。co-pi-tree 通过将大型语言模型推理提炼为策略树代码来构建策略。然后通过合作伙伴互动评估策略，收集反馈，并利用自然语言总结互动反馈，以改进有问题的分支。Overcooked-AI 实验显示，Co-pi 树平均奖励比基线平均值提升了 35.4%，同时将 LLM 查询数量减少了 77.7%，测试时间延迟降低了 97.1%。项目页面：此 https URL

Reinforcement Learning for Flow-Matching Policies with Density Transport

用于流量匹配策略的强化学习与密度传输

Authors: Boshu Lei, Kostas Daniilidis, Antonio Loquercio
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.08602
Pdf link: https://arxiv.org/pdf/2606.08602
Abstract We present an online reinforcement learning (RL) algorithm for fine-tuning flow-matching policies in continuous-control problems. Our key insight is to view RL-based policy improvement as a transport of action densities towards regions of high reward, which naturally aligns with the transport formulation of flow matching models. Prior methods either approximate the current or optimal policy distribution or resort to distillation, which introduces biased gradients or sacrifices multimodal modeling capacity. In contrast, our approach for RL with Density Transport, which we name \emph{RLDT}, constructs a transport field from a maximum-entropy RL objective using Stein Variational Gradient Descent (SVGD). Then, it finetunes a pretrained flow matching policy to align with this field. Training with this alignment objective is nontrivial because flow-matching policies generate actions via a multi-step process, making direct gradient-based optimization challenging. To overcome this challenge and stabilize training, we approximate policy actions from intermediate denoising steps via expected-target estimation. This allows the transport-field update to propagate into the network parameters without unstable backpropagation through time. Experimental results demonstrate that RLDT outperforms competitive baselines in reward quality and convergence speed. This performance holds across diverse continuous-control tasks, encompassing both dense and sparse rewards, as well as state- and vision-based long-horizon robot manipulation. The project webpage is \href{this https URL}{this https URL}.
中文摘要 我们提出了一种在线强化学习（RL）算法，用于在连续控制问题中微调流量匹配策略。我们的关键见解是将基于强化学习的策略改进视为将行动密度向高奖励区域的运输，这自然与流匹配模型的传输表述相符。以往的方法要么近似当前或最优政策分布，要么诉诸于蒸馏法，后者引入了偏置梯度或牺牲多模态建模能力。相比之下，我们对带有密度传输的强化学习方法，称为\emph{RLDT}，利用Stein变分梯度下降（SVGD）从最大熵的强化学习目标构建一个输运场。然后，它对预训练的流程匹配策略进行微调，使其与该字段保持一致。基于这一对齐目标进行训练并不简单，因为流量匹配策略通过多步过程生成动作，这使得基于梯度的直接优化变得具有挑战性。为克服这一挑战并稳定训练，我们通过预期目标估计，从中间去噪步骤中近似策略行动。这使得传输场更新能够在不不稳定的反向传播的情况下传播到网络参数中。实验结果表明，RLDT在奖励质量和收敛速度方面优于竞争基线。这种性能适用于多种连续控制任务，包括密集和稀疏奖励，以及基于状态和视觉的长视野机器人操作。项目网页是\href{this https URL}{this https URL}。

HARBOR: A Harness Framework for Agentic Robot Reinforcement Learning

HARBOR：智能机器人强化学习的束带框架

Authors: Zechu Li, Yufeng Jin, Xiaoyang Liu, Puze Liu, Vignesh Prasad, Carlo D'Eramo, Georgia Chalvatzaki
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.08610
Pdf link: https://arxiv.org/pdf/2606.08610
Abstract Reinforcement learning (RL) has become a powerful paradigm for robot learning, particularly in sim-to-real settings, but its broader adoption remains limited by the engineering pipeline surrounding the algorithms. Building tasks, shaping rewards, and tuning hyperparameters require substantial expert effort, making RL workflows costly and difficult to scale. We introduce HARBOR, an agentic framework that frames robot RL automation as a harness-engineering problem: given a simulator codebase and a task specification, it automates the workflow from environment setup to policy training in simulation. HARBOR decomposes such high-level objectives into bounded stages executed by specialized agents through standardized commands, persistent artifacts, executable gates, and reusable knowledge, and scales iteration via decentralized parallel trials and experience learning across runs. We evaluate HARBOR across 6 benchmarks and 16 tasks in total, spanning manipulation, locomotion, and bimanual dexterous control. We demonstrate that HARBOR automates the simulation RL workflow end-to-end, designs rewards, tunes algorithms to match or improve over default configurations, and reduces engineering effort at practical token and wall-clock cost; the resulting policies can also be transferred to real robots.
中文摘要 强化学习（RL）已成为机器人学习的强大范式，尤其是在模拟到现实的环境中，但其更广泛的应用仍受限于围绕算法的工程流程。构建任务、塑造奖励和调整超参数都需要大量专家努力，这使得强化学习工作流程成本高昂且难以扩展。我们介绍了HARBOR，一个代理框架，将机器人强化学习自动化框架为一个线束工程问题：给定一个模拟器代码库和任务规范，它自动化了从环境搭建到仿真策略训练的工作流程。HARBOR将此类高级目标分解为有界阶段，由专业代理通过标准化命令、持久工件、可执行门和可复用知识执行，并通过去中心化并行试验和跨运行的经验学习扩展迭代。我们通过6个基准和16个任务评估HARBOR，涵盖操作、移动和双手灵巧控制。我们证明了HARBOR能够端到端自动化模拟强化学习工作流程，设计奖励，调整算法以匹配或改进默认配置，并以实用的代币和墙时钟成本降低工程工作量;由此产生的策略也可以转移到真实的机器人上。

SPA: A SQL-Plan-Aware Reinforcement Learning Framework for Query Rewriting with LLMs

SPA：一个用于用LLM进行查询重写的SQL计划感知强化学习框架

Authors: Xinyi Huang, Zhengjie Miao
Subjects: Subjects: Databases (cs.DB)
Arxiv link: https://arxiv.org/abs/2606.08620
Pdf link: https://arxiv.org/pdf/2606.08620
Abstract SQL query rewriting is a well-established technique for improving database performance without schema or index changes, yet finding effective rewrites for modern analytical workloads remains difficult: rule-based methods are limited to predefined transformations, while LLM-based approaches often produce rewrites that are semantically valid but compile to equivalent physical plans or degrade runtime performance. We present SPA, a SQL-Plan-Aware reinforcement learning framework that trains LLMs to rewrite queries using physical execution feedback. SPA formulates rewriting as a policy optimization problem and extends GRPO with rewards spanning semantic equivalence, textual rewrite distance, physical-plan divergence, and runtime speedup. To handle reward sparsity across query difficulty, SPA introduces Probability-Gated Adaptive Reward Shaping, a query-level curriculum that unlocks higher-level rewards only once a rollout group achieves sufficient mastery of lower-level objectives, and further improves sample efficiency through on-policy self-improvement by recycling slowdown rewrites from the current policy as targeted training signals. On both IID and OOD workloads, SPA outperforms rule-based and strong LLM baselines in end-to-end runtime, substantially reduces harmful slowdown rewrites, and yields strong tail-latency gains.
中文摘要 SQL查询重写是一种成熟的技术，可以在不改变模式或索引的情况下提升数据库性能，但为现代分析工作负载找到有效的重写仍然困难：基于规则的方法仅限于预定义的转换，而基于LLM的方法通常会产生语义有效的重写，但会编译到等效的物理计划或降低运行时性能。我们介绍SPA，一个基于SQL计划感知的强化学习框架，训练LLM利用物理执行反馈重写查询。SPA将重写表述为策略优化问题，并通过涵盖语义等价、文本重写距离、物理-计划发散和运行加速等奖励扩展GRPO。为应对查询难度下的奖励稀疏，SPA引入了概率门槛自适应奖励塑造（Probability-Garated Adaptive Reward Shaping），这是一种查询级课程，只有在推广组充分掌握低层目标后才能解锁更高层级奖励，并通过策略上的自我改进，将当前策略中的减速重写回收作为有针对性训练信号，进一步提升样本效率。在IID和OOD工作负载中，SPA在端到端运行时中优于基于规则和强的LLM基线，显著减少了有害的减速重写，并实现了显著的尾延迟提升。

From Holistic Evaluation to Structured Criteria: Rubrics Across the Evolving LLM Landscape

从整体评估到结构化标准：在不断演变的LLM领域中的评分标准

Authors: Hao Chen, Ziyu Han, Yukun Yan, Qingfu Zhu, Maosong Sun, Wanxiang Che
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.08625
Pdf link: https://arxiv.org/pdf/2606.08625
Abstract As Large Language Models (LLMs) advance toward open-ended autonomous agents, the mechanisms used to evaluate and guide their behavior must evolve accordingly. This work introduces the rubric as a unifying framework capturing this evolution, characterizing rubrics as a dynamic response to successive LLM paradigm shifts that recurs across otherwise independent efforts in evaluation, reinforcement learning, and safety alignment. We define rubrics as explicit criteria sets that transform complex quality judgments into structured and actionable standards, and demonstrate that their recurrence across these research threads is not coincidental. We systematically organize existing rubric designs, examine their construction and optimization, and analyze their role across evaluation and training. Rubrics manifest at three progressively deeper levels: at the evaluative level, they decompose holistic judgments into verifiable dimensions; at the training level, they serve as dense feedback signals providing process-level guidance where scalar rewards fall short; at the intrinsic level, they emerge dynamically from model behaviors, driving self-improvement. We further assess rubric reliability across generation quality, execution fidelity, theoretical constraints, and security threats, before surveying rubric-based benchmarks across diverse domains. By rendering assessment transparent and decomposable, rubrics translate human value expectations into machine-learnable signals, serving as the enduring bridge between human intentions and machine behavior.
中文摘要 随着大型语言模型（LLMs）向开放式自主智能体发展，评估和指导其行为的机制也必须相应演进。本研究引入了评分标准作为一个统一框架，捕捉这一演变，将评分标准描述为对连续大型语言模型范式转变的动态响应，这种反应在评估、强化学习和安全对齐等本应独立的努力中反复出现。我们将评分标准定义为明确的标准集，将复杂的质量判断转化为结构化且可操作的标准，并证明这些研究线索中的反复出现并非巧合。我们系统地组织现有的评分标准设计，审视其构建与优化，并分析其在评估和培训中的作用。评分标准在三个逐步更深层次的层面体现：在评估层面，它们将整体判断分解为可验证的维度;在训练层面，它们作为密集的反馈信号，在标量奖励不足时提供过程层面的指导;在内在层面，它们动态地从模型行为中浮现，推动自我提升。我们进一步评估了评分标准在生成质量、执行忠实度、理论约束和安全威胁等方面的可靠性，然后对基于评分标准的基准进行了跨多个领域的调查。通过使评估透明且可分解，评分标准将人类价值期望转化为机器学习信号，成为人类意图与机器行为之间的持久桥梁。

Towards Long-Horizon Vessel Trajectory and Destination Forecasting with Reasoning Large Language Models

迈向远视线船舶轨迹与目标预测，基于推理的大型语言模型

Authors: Hongwei Wang, Miao Zhou, Fengde Wang, Yuting Wang, Jiewen Yu, Jun-Yan He, Bohao Qu, Wanbing Zhang, Xiuju Fu, Qing Guo, Zipei Fan, Yingying Xing, Yi Yuan
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.08633
Pdf link: https://arxiv.org/pdf/2606.08633
Abstract Long-horizon maritime trajectory prediction is important for shipping management, logistics planning, and maritime risk analysis, yet month-level forecasting remains insufficiently studied. Existing deep learning methods mainly focus on short- and mid-term coordinate extrapolation and often struggle to preserve route feasibility and destination correctness over extended horizons. This paper investigates joint long-horizon vessel trajectory and destination forecasting with reasoning-capable large language models, and develops a Maritime LLM post-training framework based on Reinforcement Learning with Verifiable Reward (RLVR). An AIS-based benchmark is constructed with 60-day historical trajectories and 30-day forecasting horizons, where trajectories are converted into semantic textual representations for RL prompt construction. RLVR aligns LLMs with maritime forecasting objectives by enforcing physical validity, providing early-weighted trajectory supervision, and evaluating destination correctness through hierarchical matching and curriculum learning. Experimental results show that RLVR-trained LLMs substantially improve over zero-shot LLMs and representative deep learning baselines, especially on destination-related metrics. Among the evaluated RLVR-trained variants, 4B LLMs achieve the best overall performance, suggesting that reward-compatible optimization and task-specific capacity matching are more important than simply using larger 8B or 14B LLMs. The results also show that LSTM remains a strong deep learning baseline under limited fine-tuning data, while Transformer-style spatio-temporal models typically require larger datasets and richer structured inputs. Overall, this work advances semantic, verifier-aligned maritime forecasting for operational decision support.
中文摘要 长期海事轨迹预测对于航运管理、物流规划和海事风险分析至关重要，但月级预测的研究仍然不足。现有的深度学习方法主要侧重于短期和中期的坐标外推，且常常难以在较长的视野中保持路线可行性和目的地正确性。本文研究了联合长视野船舶轨迹与目的地预测，并基于可验证奖励强化学习（RLVR）开发了海事大型语言模型的后期训练框架。基于AIS构建了一个基准，包含60天历史轨迹和30天预测视野，轨迹被转换为用于强化学习提示构建的语义文本表示。RLVR通过强制物理有效性、提供早期加权轨迹监督，以及通过层级匹配和课程学习评估目的地正确性，使LLM与海洋预报目标保持一致。实验结果显示，RLVR训练的LLM相比零样本LLM和代表性深度学习基线，尤其是在目标相关指标上，表现显著优越。在评估的RLVR训练变体中，4B大型语言模型整体表现最佳，表明奖励兼容的优化和任务特定容量匹配比单纯使用更大的8B或14B大型语言模型更为重要。结果还表明，在有限的微调数据下，LSTM依然是一个强有力的深度学习基线，而Transformer风格的时空模型通常需要更大的数据集和更丰富的结构化输入。总体而言，这项工作推动了语义化、验证者对齐的海上预报，以支持运营决策。

PRPO: Perception-Reinforced Policy Optimization via Token-Level Dynamic Advantage Reshaping

PRPO：通过代币级动态优势重塑实现感知强化策略优化

Authors: Qiming Li, Tianlun Li, Xiaolong Cheng, Hangyu Li, Ruiyan Gong, Kangning Niu, Kaitao Jiang, Mu Xu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.08708
Pdf link: https://arxiv.org/pdf/2606.08708
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has become an effective paradigm for improving the reasoning capability of Large Vision-Language Models (LVLMs). However, existing RLVR methods primarily rely on trajectory-level outcome rewards, which assign identical learning signals across all generated tokens. This coarse-grained credit assignment is fundamentally mismatched to multimodal reasoning, where only a sparse subset of tokens is causally grounded in visual evidence. Consequently, these pivotal perceptual tokens receive weak supervision and are often overwhelmed by language priors or reasoning-template tokens. To address this limitation, we propose Perception-Reinforced Policy Optimization (PRPO), a token-level reinforcement learning framework that explicitly identifies and reinforces pivotal perceptual tokens within long-horizon multimodal reasoning trajectories. PRPO introduces Robust Visual Dependency (RVD), a principled metric that identifies tokens whose predictions are both visually grounded and perturbation-stable, filtering out brittle or noisy visual tokens. Based on RVD, we further propose Perceptual Advantage Reshaping (PAR), a token-level credit assignment technique that amplifies perceptually informative tokens while preserving stable gradients for non-perceptual tokens. Extensive experiments on seven multimodal reasoning benchmarks demonstrate that PRPO consistently outperforms strong LVLM baselines across both 3B and 7B model scales, achieving average gains of 23.3% and 21.1%, respectively. PRPO achieves state-of-the-art performance with improved training efficiency and stronger cross-task generalization. Our findings highlight the importance of fine-grained credit assignment for scalable multimodal reinforcement learning.
中文摘要 带可验证奖励的强化学习（RLVR）已成为提升大型视觉语言模型（LVLM）推理能力的有效范式。然而，现有的RLVR方法主要依赖轨迹级的结果奖励，即在所有生成的代币中分配相同的学习信号。这种粗粒度的署名分配根本与多模态推理不匹配，后者只有极少数代币是基于视觉证据的因果依据。因此，这些关键感知标记受到的监督较弱，且常被语言先验或推理模板标记淹没。为解决这一限制，我们提出了感知强化策略优化（PRPO），这是一种代币级强化学习框架，明确识别并强化长视野多模态推理轨迹中的关键感知代币。PRPO引入了稳健视觉依赖（RVD），这是一种原则性指标，用于识别预测既具视觉基础又具扰动稳定性的代币，过滤掉脆弱或噪声较大的视觉代币。基于RVD，我们进一步提出了感知优势重塑（Perceptual Advantage Reshaping，简称PAR）技术，这是一种基于代币层面的信用分配技术，能够放大感知信息型代币，同时保持非感知代币的稳定梯度。对七个多模态推理基准的广泛实验表明，PRPO在3B和7B模型尺度上始终优于强力的LVLM基线，分别实现了23.3%和21.1%的平均提升。PRPO实现了最先进的性能，提高了训练效率和更强的跨任务泛化能力。我们的发现强调了细粒度学分分配在可扩展多模态强化学习中的重要性。

Structure-Conditioned Actor-Critic Branches for Quality-Diversity Reinforcement Learning

结构条件演员-批评分支用于质量多样性强化学习

Authors: Lianrong Zuo, Peilan Xu, Yong Liu, Wenjian Luo
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.08735
Pdf link: https://arxiv.org/pdf/2606.08735
Abstract Quality-diversity reinforcement learning (QD-RL) aims to construct policy repertoires that contain both high-performing and behaviorally diverse policies. Existing QD-RL methods mainly diversify policy instances after rollout evaluation or use learned value information to improve policy quality and behavior targeting, while the learning branches that generate candidate policies remain less explored. This paper proposes SV-QD-RL, a structure-value coupled framework that represents each candidate as a structure-conditioned actor-critic branch. Each branch contains an actor, a structural mask, a branch-specific critic, a replay state, and evaluation attributes including behavior, return, sparsity, and value profile. The structural mask defines the actor subspace in which the branch learns, while the branch-specific critic and replay state shape its value-learning trajectory. A branch-aware QD archive then evaluates and retains branches according to behavioral quality, structural footprint, and value-profile information. Experiments on MuJoCo continuous-control tasks show that SV-QD-RL constructs policy repertoires with strong archive quality and behaviorally useful diversity. Ablation and diagnostic analyses further indicate that structural conditioning, critic differentiation, and memory-consistent refinement make complementary contributions to behavioral specialization. Schedule-aware repertoire evaluation shows that the learned archive provides selectable policy alternatives under changing behavior-level requirements. These results suggest that coupling actor structure with branch-specific value learning is an effective mechanism for generating diverse QD-RL policy repertoires.
中文摘要 质量多样性强化学习（QD-RL）旨在构建既包含高效策略又行为多样性的策略库。现有的QD-RL方法主要在推广评估后使策略实例多样化，或利用学到的价值信息提升策略质量和行为定向，而生成候选策略的学习分支则较少被探索。本文提出了SV-QD-RL，一种结构-值耦合框架，将每个候选者表示为结构条件的演员-批评分支。每个分支包含一个actor、一个结构掩码、一个分支特定的批评者、一个重放状态，以及包括行为、返回、稀疏性和值轮廓在内的评估属性。结构掩码定义了分支学习的演员子空间，而分支特定的批评和重放状态则塑造了其值学习轨迹。分支感知的量子数据归档则根据行为质量、结构足迹和价值轮廓信息评估并保留分支。MuJoCo连续控制任务的实验表明，SV-QD-RL构建具有强归档质量和行为上有用多样性的策略库。消融和诊断分析进一步表明，结构条件反射、批评者区分和记忆一致性细化对行为专精化有互补贡献。日程感知库评估显示，学习到的档案在不断变化的行为层级要求下，提供了可选择的策略替代方案。这些结果表明，将演员结构与分支特定值学习结合，是生成多样化QD-RL策略库的有效机制。

Guided Discovery of New Behaviors using Diffusion Policies

利用扩散策略引导发现新行为

Authors: Dian Yu, Sebastian Sanokowski, Majid Khadiv
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.08743
Pdf link: https://arxiv.org/pdf/2606.08743
Abstract Diffusion models have become a powerful tool for generative modeling in robotics, with diffusion policies excelling at modeling multimodal action-trajectory distributions. However, when demonstrations are limited, standard sampling often reproduces dominant behaviors while neglecting valid but rare modes, limiting the discovery of novel solutions. Existing approaches, such as guidance methods or combining reinforcement learning with diffusion, either push samples into infeasible regions or struggle to escape local minima, failing to systematically uncover diverse behaviors. To address these challenges, we propose a framework that combines Feynman-Kac correctors with a novel guiding potential that systematically guides diffusion policy samples towards promising yet underrepresented samples. These trajectories are refined using sampling-based trajectory optimization and reincorporated into the training set to retrain the diffusion policy. Our method effectively mines and repairs novel trajectories, enabling the systematic discovery of diverse and executable behaviors. We demonstrate the effectiveness of our framework across a range of manipulation environments, consistently discovering new behaviors.
中文摘要 扩散模型已成为机器人生成建模的强大工具，扩散策略在建模多模态动作-轨迹分布方面表现出色。然而，当演示有限时，标准抽样往往重现主导行为，忽视有效但罕见的模式，限制了新解的发现。现有方法，如引导方法或将强化学习与扩散结合，要么将样本推入不可行的区域，要么难以突破局部极小值，未能系统地发现多样化的行为。为应对这些挑战，我们提出了一个结合费曼-Kac校正器与一种新颖的指导潜能的框架，系统地引导扩散政策样本朝向有前景但代表性不足的样本。这些轨迹通过基于采样的轨迹优化进行优化，并重新纳入训练集以重新训练扩散策略。我们的方法有效挖掘和修复新路径，使系统性发现多样化且可执行的行为成为可能。我们展示了该框架在多种操控环境中的有效性，持续发现新的行为。

Co-Evolving Skill Generation and Policy Optimization

技能生成与政策优化的共进化

Authors: Zhiwei Zhang, Yudi Lin, Nikki Lijing Kuang, Linlin Wu, Xiaomin Li, Songtao Liu, Fenglong Ma
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.08755
Pdf link: https://arxiv.org/pdf/2606.08755
Abstract Skill-augmented reinforcement learning improves language agents by storing reusable procedural knowledge acquired from past experience. Existing methods typically use strong language models to analyze trajectories, generate skills, and update a retrievable skill bank during online training. However, they rarely assess whether a newly generated skill is useful before it is stored and reused. We find that this assumption is unreliable: even skills generated by proprietary frontier LLMs exhibit highly mixed utility, with many providing little benefit or even degrading performance. Once such skills enter the bank, their effects are difficult to identify, because subsequent rollout feedback is delayed and usually reflects the combined effect of multiple retrieved skills rather than the marginal contribution of any individual skill. We propose an online reinforcement learning framework for pre-storage skill validation. The framework estimates whether a candidate skill contributes useful information beyond the skills already retrieved for the current task. It uses the standard rollout budget to form two matched groups under the same task and retrieval context: base rollouts conditioned on the currently retrieved skills, and skill-augmented rollouts conditioned on the same skills plus one candidate skill induced from the base trajectories. The reward gap between these two groups estimates the candidate skill's context-dependent marginal utility, enabling the framework to promote useful skills while filtering ineffective or harmful ones without additional rollout overhead. The framework further uses this marginal-utility signal to train the policy itself as a skill generator, reducing reliance on repeated calls to proprietary models. The learned skill-generation likelihood serves as a context-dependent score for retrieval-time reranking and outdated-skill pruning as the policy evolves.
中文摘要 技能增强强化学习通过存储可重复使用的过往经验过程知识来提升语言代理。现有方法通常使用强语言模型分析轨迹、生成技能，并在在线培训中更新可检索的技能库。然而，他们很少在新生成的技能被存储和重复使用前评估其实用性。我们发现这一假设不可靠：即使是由专有前沿大型语言模型生成的技能，其实用性也高度参差，许多技能几乎没有带来益处，甚至降低性能。一旦这些技能进入数据库，其影响难以识别，因为后续的推广反馈会被延迟，通常反映的是多个检索技能的综合效应，而非单个技能的边际贡献。我们提出了一种用于存储前技能验证的在线强化学习框架。该框架估计候选技能是否为当前任务提供了超出已检索技能的有用信息。它使用标准的rollout预算，在同一任务和检索背景下形成两个匹配组：基于当前检索技能的基础部署，以及基于相同技能加上一个由基础轨迹诱导的候选技能的技能增强组。这两组之间的奖励差距估算了候选技能的上下文依赖边际效用，使该框架能够推广有用技能，同时过滤无效或有害技能，而无需增加额外的推广开销。该框架进一步利用这种边际效用信号来训练策略本身作为技能生成器，减少了对专有模型反复调用的依赖。所学技能生成似然作为上下文相关的评分，用于检索时间重新排序和过时技能修剪，随着政策的发展。

Reformulate LLM Reinforcement Learning for Efficient Training under Black-box Discrepancy

重新表述LLM强化学习，以实现黑箱差异下的高效培训

Authors: Jiashun Liu, Runze Liu, Xu Wan, Jing Liang, Hongyao Tang, Ling Pan
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.08779
Pdf link: https://arxiv.org/pdf/2606.08779
Abstract Reinforcement Learning (RL) has emerged as a pivotal post-training paradigm, yet it frequently suffers from unpredictable sub-optimum performance or even training collapses. Recent findings attribute these failures to a hidden train-inference discrepancy (or mismatch), stemming from the disparate underlying engines and architecture. We find that the training policy can actively self-correct such a discrepancy when provided with an appropriate learning signal. Then, we further empirically identify a discrepancy tolerance region: within this region, aggressively narrowing the discrepancy can suppress policy exploration and reduce learning efficiency, whereas outside this region, reducing excessive discrepancy improves optimization consistency and raises the achievable local performance ceiling. According to such findings, we formulate this problem as a Discrepancy-Constrained Markov Decision Process (DCMDP), where reward maximization is coupled with a constraint that aligns training-Inference behavior, achieving stable dual-objective optimization. To adaptively balance performance improvement and discrepancy control, we introduce a Lagrangian relaxation mechanism that dynamically adjusts the relative weight of the two objectives according to the current degree of discrepancy violation. This enables stable dual-objective optimization: the policy is allowed to explore freely within the tolerance region, while being guided back when the discrepancy exceeds the safe boundary. Empirically, DCMDP significantly improves the performance of 8B dense model (Qwen-3-8b) and 30B Mixture-of-Expert model (Qwen-3-30bA3b), and enables a heterogeneous training paradigm, where LLMs can be optimized in high-fidelity training setup while being explicitly aligned for low-cost, resource-constrained inference deployment.
中文摘要 强化学习（RL）已成为一个关键的后培训范式，但它经常遭遇不可预测的次优表现甚至训练崩溃。最新研究将这些失败归因于隐藏的列车推断差异（或不匹配），源于底层发动机和架构的不同。我们发现，当提供适当的学习信号时，培训策略可以主动自我纠正这种差异。随后，我们进一步实证识别了一个差异容忍区：在此区域内，积极缩小差异会抑制策略探索并降低学习效率;而在该区域外，减少过度差异则提升优化一致性并提高可实现的局部性能上限。基于这些发现，我们将该问题表述为差异约束马尔可夫决策过程（DCMDP），其中奖励最大化与约束相结合，使训练-推理行为对齐，实现稳定的双目标优化。为了自适应地平衡性能提升和差异控制，我们引入了一种拉格朗日松弛机制，根据当前差异违规程度动态调整两个目标的相对权重。这实现了稳定的双目标优化：策略可以在容忍区内自由探索，而当差异超过安全边界时，策略会被引导回去。从经验角度看，DCMDP显著提升了8B密集模型（Qwen-3-8b）和30B专家混合模型（Qwen-3-30bA3b）的性能，并实现了异构训练范式，使LLM可以在高保真度训练环境中得到优化，同时明确对齐于低成本、资源有限的推理部署。

Momentum for Reasoning: Dense Intrinsic Signals in Policy Optimization

推理的动力：政策优化中的密集内在信号

Authors: Hao Chen, Zhanming Shen, Liyao Li, Yanyu Chen, Xuhang Zhu, Xiaomeng Hu, Qi Zhang, Ru Peng, Xiaoyu Shen, Haobo Wang, Junbo Zhao
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.08815
Pdf link: https://arxiv.org/pdf/2606.08815
Abstract Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for eliciting long-chain reasoning in large language models. However, existing methods based on Group Relative Policy Optimization (GRPO) rely on a binary outcome reward, which induces two structural failure modes: Zero-Advantage Collapse, in which all rollouts in a group share the same outcome and the gradient vanishes, and Hallucinated Certainty, in which the model becomes increasingly confident on incorrect rollouts late in training. We address both modes by densifying the reward with intrinsic signals computed entirely from the policy's own conditional probabilities, and propose ISPO (Intrinsic Signal Policy Optimization, which combines a sequence-level signal measuring how informative the thinking trajectory is for the final answer, with a token-level directional reward whose hallucinated-certainty hinge penalizes confidently-wrong predictions at critical decision tokens. Across three base models and five mathematical reasoning benchmarks, ISPO consistently outperforms competitive baselines, with the largest gains on the hardest benchmarks where zero-advantage collapse is most frequent, and training-dynamics diagnostics confirm that both failure modes are decreased.
中文摘要 带有可验证奖励的强化学习（RLVR）已成为在大型语言模型中引发长链推理的强大范式。然而，基于群体相对政策优化（GRPO）的现有方法依赖于二元结果奖励，这导致两种结构性失败模式：零优势崩溃，即组内所有推广共享相同结果且梯度消失;以及幻觉确定性，模型在训练后期对错误推广越来越有信心。我们通过用完全由策略自身条件概率计算的内在信号密度来对奖励进行密度化，并提出了IPO（内在信号策略优化），它结合了序列级信号（衡量思考轨迹对最终答案的信息量）和一种代币级的方向性奖励，其幻觉确定性铰链惩罚关键决策代币上自信错误预测的行为。在三个基础模型和五个数学推理基准中，ISPO持续优于竞争对手基线，在零优势崩溃最频繁且训练动力学诊断显示两种失效模式均减少的最难基准中增幅最大。

Knowledge Graphs and Reasoning LLMs for Finding Simple Yet Effective Transcriptomic Perturbation Predictors

知识图谱与推理大型模型，用于寻找简单但有效的转录组扰动预测变量

Authors: Jake Fawkes, Liam Hodgson, Jason Hartford
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.08816
Pdf link: https://arxiv.org/pdf/2606.08816
Abstract Predicting the effect of an unseen gene knockout perturbation on transcriptomic gene expression remains a highly challenging problem for virtual cell models. Recent progress has been made by leveraging biological knowledge graphs to provide a notion of similar perturbation, allowing for improved extrapolation beyond the set of training perturbations. In this work, we demonstrate that the simplest model to leverage these assumptions - a K-nearest neighbour from the knowledge graph - achieves highly competitive performance on this task, and that this can be improved further using LLMs optimised via reinforcement learning (RL) for predictive performance. Specifically, we find that the K-nearest neighbour approach beats almost all methods on out-of-distribution perturbation prediction, and when a reasoning LLM is trained via RL to make changes to the neighbourhood, it obtains equivalent performance to current state of the art methods on the cell lines from Replogle et al. (2022). We also demonstrate that the RL training improves the LLM's performance on the downstream task of differential expression prediction, despite not being trained on this directly. Overall, these findings demonstrate the efficacy of knowledge graphs as model priors, and show early signs that RL can refine LLMs into generalizable tools for predicting complex biological responses.
中文摘要 预测未见基因敲除扰动对转录组基因表达的影响，仍然是虚拟细胞模型面临的极具挑战性的问题。近期通过利用生物知识图谱提出类似扰动概念，实现了超越训练扰动集的改进外推。本研究展示了利用这些假设的最简单模型——知识图中的K最近邻——在该任务中表现出极高竞争力，且通过强化学习（RL）优化的大型语言模型（LLM）可进一步提升预测性能。具体来说，我们发现K最近邻方法几乎击败了所有分布外扰动预测方法，当推理型LLM通过强化学习训练以对邻域进行修改时，其性能与Replogle等人（2022）中当前最先进的细胞系方法相当。我们还证明，尽管强化学习并未直接训练，强化学习训练仍能提升LLM在差分表达预测这一下游任务中的表现。总体而言，这些发现证明了知识图作为模型先验的有效性，并早期表明强化学习能够将大型语言模型精炼为预测复杂生物反应的通用工具。

sGPO: Trading Inference FLOPs for Training Efficiency in RLVR

sGPO：在RLVR中用推理浮牌换取训练效率

Authors: Shivchander Sudalairaj, Kai Xu, Akash Srivastava, Giorgio Giannone
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2606.08854
Pdf link: https://arxiv.org/pdf/2606.08854
Abstract Standard Reinforcement Learning with Verifiable Rewards (RLVR) training allocates a fixed rollout budget to every query, without regard for what each query's difficulty means for the current policy. This leads to two symmetric failure modes: easy queries produce near-zero advantage because the policy already solves them, while unsolvable queries produce no signal because the policy never solves them. Both regimes waste training FLOPs without contributing to a learning gradient. We introduce sorted Group Policy Optimization (sGPO), a compute-efficient strategy that trades a small budget of inference FLOPs for a large reduction in wasted training FLOPs. The key insight is that cheap inference compute can serve as a single offline proxy for query difficulty. By generating a small batch of parallel samples per query under the initial policy, we obtain a model-aware empirical success rate. This motivates setting the training rollout group size to the inverse of this success rate, a practical rule that maximizes sample efficiency by extracting the most advantage per generated rollout. This single profiling pass simultaneously drives data filtering (removing trivial queries and sub-sampling unsolvable ones), adaptive group size allocation, and curriculum construction (scheduling queries from easy to hard). sGPO matches or exceeds baseline performance while reducing total training compute by a factor of three, with the upfront inference profiling cost included.
中文摘要 标准的可验证奖励强化学习（RLVR）培训为每个查询分配固定的部署预算，不考虑每个查询的难度对当前政策的影响。这导致两种对称的失败模式：简单查询几乎没有优势，因为策略已经解决了它们;而不可解查询则没有信号，因为策略从未解决它们。这两种体制都浪费了FLOP的训练，却没有促进学习梯度。我们引入了排序组策略优化（sGPO），这是一种计算效率高的策略，用少量推理 FLOP 的预算换取大量训练 FLOP 的浪费。关键见解是，廉价的推理计算可以作为查询难度的单一离线代理。通过在初始策略下每个查询生成一小批并行样本，我们获得了模型感知的经验成功率。这促使训练推广组规模设定为成功率的倒数，这一实用规则通过每次生成的推广最大化最大化样本效率。这一单次剖析同时推动数据过滤（去除琐碎查询和对无法解决问题的子抽样）、自适应组大小分配和课程构建（从简单到困难调度查询）。sGPO在将总训练计算量减少三倍的同时，能够匹配甚至超过基线性能，且包含前期推理剖析成本。

Multilingual Sentiment Aware Text Summarization A Reinforcement Learning Approach for Consistency Maintenance

多语言情感感知文本摘要：一种用于一致性维护的强化学习方法

Authors: Mikhail Krasitskii, Alexander Gelbukh, Olga Kolesnikova, Grigori Sidorov
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.08940
Pdf link: https://arxiv.org/pdf/2606.08940
Abstract Reinforcement Learning from Human Feedback (RLHF) has significantly improved the quality and fluency of large language models in text summarization. However, its impact on affective properties remains insufficiently understood. In this work, we study sentiment drift, a systematic shift toward neutral sentiment in RLHF-based summarization outputs compared to source texts. We conduct extensive experiments across multiple datasets, model architectures, and eight languages to analyze how alignment objectives influence sentiment preservation. Our results show that sentiment drift is a consistent phenomenon that becomes stronger with increased KL regularization strength, indicating a trade-off between alignment stability and affective fidelity. To explain this behavior, we introduce a Policy Attribution framework that decomposes the RLHF objective and quantifies the contribution of its components. Our analysis reveals that KL regularization is the primary driver of sentiment suppression across all settings. Based on these findings, we propose a sentiment-aware modification of the KL regularization term, which selectively reduces constraints on sentiment-bearing tokens. Empirical results demonstrate that this approach mitigates sentiment drift while maintaining summarization quality. Overall, our findings highlight a fundamental limitation of current alignment methods: while they improve factual consistency and safety, they may unintentionally suppress emotional expressiveness. This motivates the development of alignment strategies that explicitly account for affective preservation.
中文摘要 人类反馈强化学习（RLHF）显著提升了大型语言模型在文本摘要中的质量和流畅度。然而，其对情感属性的影响仍不充分。本研究研究情感漂移，即基于 RLHF 的摘要输出相较于源文本中向中性情绪的系统性转变。我们通过多个数据集、模型架构和八种语言进行大量实验，分析对齐目标如何影响情感保存。我们的结果表明，情感漂移是一个持续现象，随着KL正则化强度的增强而增强，表明对齐稳定性与情感忠实度之间的权衡。为解释这种行为，我们引入了一个策略归因框架，该框架分解了RLHF目标并量化其组成部分的贡献。我们的分析显示，在所有环境中，KL正则化是情感抑制的主要驱动力。基于这些发现，我们提出了对KL正则化项进行情感感知的修改，选择性地减少对情感符号的约束。实证结果表明，这种方法在保持摘要质量的同时，有效减轻情感漂移。总体而言，我们的发现凸显了当前对齐方法的一个根本局限：虽然它们提高了事实的一致性和安全性，但可能无意中抑制了情感表达。这促使开发明确考虑情感保存的对齐策略。

AlloSpatial: Agentic Harness Framework for Spatial Reasoning in Foundation Models

全空间：基础模型中空间推理的代理框架

Authors: Shouwei Ruan, Bin Wang, Zhenyu Wu, Qihui Zhu, Yuxiang Zhang, Jingzhi Li, Yubin Wang, Xingxing Wei
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.08952
Pdf link: https://arxiv.org/pdf/2606.08952
Abstract Multimodal Foundation Models (MFMs) have made substantial progress, yet remain fragile in spatial reasoning over the physical world. A key bottleneck lies in their inability to transform local egocentric observations into a global allocentric spatial representation. To address this, we propose AlloSpatial, an agentic framework for allocentric spatial cognition in foundation models. AlloSpatial introduces World2Mind, a plug-and-play cognitive mapping sandbox that converts egocentric observations into structured allocentric priors, including Allocentric-Spatial Trees and route maps that support querying object topology, geometric relations, passability, and trajectories. To utilize these priors reliably under noisy reconstruction and ambiguous visual evidence, AlloSpatial introduces a Spatial Reasoning Harness for tool-use judgment, modality-decoupled cue collection, and geometry-semantic arbitration. We further internalize this process in Qwen3-VL through cold-start reinforcement learning with a harness-gated trajectory-level reward. Experiments on VSI-Bench and MindCube show that AlloSpatial improves proprietary models by 5%-18% in a training-free setting, while ASTs alone support strong spatial reasoning even when visual inputs are removed. The trained AlloSpatial agents further outperform larger general-purpose models and competitive spatial baselines, suggesting that structured allocentric representations, active tool use, and verifiable reasoning offer a promising route toward spatially capable foundation models.
中文摘要 多模态基础模型（MFM）取得了显著进展，但在物理世界的空间推理上仍然脆弱。一个关键瓶颈在于他们无法将局部自我中心的观察转化为全球的异体空间表征。为此，我们提出了AlloSpatial，一种用于基础模型中异中心空间认知的代理框架。AlloSpaceal引入World2Mind，一个即插即用的认知映射沙盒，将以自我为中心的观察转换为结构化的以恒为中心先验，包括以恒中心空间树和支持查询对象拓扑、几何关系、可通过性和轨迹的路线图。为了在噪声重建和模糊的视觉证据下可靠利用这些先验，AlloSpatial引入了空间推理工具工具使用工具的辅助工具、模态解耦线索收集以及几何-语义仲裁。我们通过冷启动强化学习和基于束缚门控的轨迹级奖励，进一步内化了Qwen3-VL中的这一过程。VSI-Bench和MindCube上的实验显示，AlloSpatial在无训练环境下能提升专有模型5%-18%，而仅AST即使去除视觉输入，也能支持强烈的空间推理。训练有素的AlloSpace代理进一步优于大型通用模型和竞争性空间基线，表明结构化的同胞中心表征、主动工具使用和可验证推理为实现空间能力基础模型提供了有前景的途径。

Diverse Thinking Schemata Elicit Better Reasoning in Large Language Models

多样化思维模式在大型语言模型中激发更优的推理能力

Authors: Xinyue Liang, Yizhe Yang, Yu Bai, Bin Xu, Jiawei Li, Yang Gao
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.08974
Pdf link: https://arxiv.org/pdf/2606.08974
Abstract Large reasoning models (LRMs) have attracted increasing attention for their ability to solve complex mathematical problems by generating extended reasoning chains. In this work, we focus on two critical yet underexplored aspects of the reasoning process: reasoning transitions capturing the distinct transitions between reasoning steps and answer candidates reflecting the variety of solution paths produced by the model. We collectively define these two aspects as thinking schemata. We observe a correlation between the diversity of thinking schemata and model performance, which motivates us to enhance diversity as a means to further improve reasoning potential. To this end, we propose Diverse Schemata Policy Optimization (DiScO), a framework that first endows the model with schemata awareness, then encourages diversity through reinforcement learning, and further promotes diverse reasoning at inference time. Experiments on multiple mathematical reasoning benchmarks demonstrate that DiScO consistently outperforms standard group relative policy optimization. Beyond accuracy, human-annotated analyses show that DiScO substantially improves the model's ability to recover from erroneous initial attempts. Overall, our work suggests the important role that diversity of the thinking schemata plays and points to scaling along the diversity dimension as a promising research direction.
中文摘要 大型推理模型（LRM）因其通过生成扩展推理链来解决复杂数学问题的能力而受到越来越多的关注。本研究重点关注推理过程中两个关键但尚未被充分探讨的方面：推理过渡捕捉推理步骤之间不同过渡，以及反映模型生成的多样解路径的答案候选。我们将这两方面共同定义为思维图式。我们观察到思维图式多样性与模型表现之间的相关性，这促使我们通过增强多样性来进一步提升推理潜能。为此，我们提出了多元图式策略优化（DiScO），该框架首先赋予模型对图样的认知，然后通过强化学习鼓励多样性，并在推理时进一步促进多元推理。多数学推理基准测试的实验表明，DiScO始终优于标准的群体相对策略优化。除了准确性，人工注释分析显示DiScO显著提升了模型从错误初始尝试中恢复的能力。总体而言，我们的研究表明思维模式多样性的重要作用，并指出沿着多样性维度进行扩展是一个有前景的研究方向。

Personalization Meets Safety:Mechanisms,Risks,and Mitigations in Personalized LLMs

个性化与安全性的结合：个性化大型语言模型中的机制、风险与缓解措施

Authors: Yanyan Luo, Xue Han, Ruiqiao Bai, Xin Huang, Yitong Wang, Qian Hu, Qing Wang, Chunxu Zhao, Jie Liu, Cong Geng, Lehao Xing, Pengwei Hu, Junlan Feng
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.09038
Pdf link: https://arxiv.org/pdf/2606.09038
Abstract Large Language Models (LLMs) have enabled increasingly personalized interactions by adapting to users' preferences, contexts, and long-term histories. However, the mechanisms that enable personalization also expand the safety landscape in ways not systematically addressed by existing literature. Existing reviews typically focus either on personalization or safety, leaving their intersection largely unexplored. We present the first comprehensive, safety-aware review of personalized LLMs. We organize personalization along three dimensions-user representation, personalization paradigm, and evaluation-and introduce a unified taxonomy of safety risks. At the representation level, we analyze risks arising from diverse user representations. Across mainstream personalization paradigms, we delineate vulnerabilities inherent to prompting, retrieval augmentation, parameter fine-tuning, reinforcement learning, Mixture-of-Experts (MoE), pruning, agent frameworks, and multimodal personalization, and synthesize mitigation strategies across the model lifecycle. Beyond these fine-grained risks, we characterize paradigm-agnostic safety risks arising from personalized adaptation. We further summarize personalized datasets and evaluation methodologies. Through a case study of OpenClaw, we analyze deployment trends in personalized agent ecosystems. Our analysis reveals three structural inadequacies in existing research: safety is evaluated as user-invariant rather than relational, personalization techniques are analyzed in isolation rather than in composition, and evaluation frameworks cannot capture emergent long-term risks. By jointly examining personalized representations, personalization paradigms, safety risks, defenses, and evaluation methods, we provide a unified framework for developing safe personalized LLMs and highlight key directions for future research.
中文摘要 大型语言模型（LLMs）通过适应用户偏好、语境和长期历史，使互动变得越来越个性化。然而，促成个性化的机制也以现有文献未系统性探讨的方式扩展了安全领域。现有的评价通常侧重于个性化或安全，两者的交叉点大多未被深入探讨。我们呈现首个全面且注重安全的个性化大型语言模型评测。我们按三个维度组织个性化——用户代表、个性化范式和评估——并引入统一的安全风险分类法。在代表层面，我们分析由多元用户代表性带来的风险。在主流个性化范式中，我们划分了提示、检索增强、参数微调、强化学习、专家混合（MoE）、剪枝、代理框架和多模态个性化固有的脆弱性，并综合了模型生命周期中的缓解策略。除了这些细致风险外，我们还描述了个性化适应带来的范式无关性安全风险。我们还进一步总结了个性化数据集和评估方法。通过OpenClaw案例研究，我们分析了个性化代理生态系统中的部署趋势。我们的分析揭示了现有研究中的三个结构性不足：安全性被评估为用户不变而非关系性，个性化技术被孤立分析而非组合，评估框架无法捕捉新出现的长期风险。通过共同审视个性化表征、个性化范式、安全风险、防御机制和评估方法，我们提供了一个统一的框架，用于开发安全个性化LLMs，并突出未来研究的关键方向。

Stage-1 Controls the Entropy Regime, Not the Outcome

第一阶段控制熵状态，而非结果

Authors: Jianxiong Shen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.09059
Pdf link: https://arxiv.org/pdf/2606.09059
Abstract Two-stage post-training -- a Stage-1 warm-start (supervised fine-tuning, SFT, or on-policy distillation, OPD) followed by Stage-2 reinforcement learning (RL) -- is increasingly used for vision-language models (VLMs). We ask what Stage-1 actually controls in a small-data study using Qwen2.5-VL-7B with a same-modality 72B VLM teacher for OPD. First, the three warm-starts reach a narrow $53$--$54\%$ band on Geometry3K internal validation, consistent with the narrow range reported by recent specialized methods; this setup provides little evidence that Stage-1 changes the in-domain endpoint. Second, a matched-recipe, early-stopped SFT improves out-of-domain MathVista by $+2.1$ points, reversing the $-9.5$-point drop of an over-trained variant. The clearest difference is the \emph{entropy regime}: OPD enters RL with substantially higher policy entropy than either SFT initialization, and the separation remains visible through the available trajectories. At the in-domain initialization, OPD also has higher answer diversity and pass@16 ($+2.0$ to $+5.2$ points over SFT), although problem-level bootstrap intervals show that the smaller contrast is uncertain. The advantage is absent after RL (endpoint pass@16 values within $1.1$ points) and on MathVista (six models within $1.2$ points). Our contribution is therefore a bounded empirical characterization: Stage-1 is strongly associated with the entropy regime in this setup, but the downstream payoff is small, localized, and not evidence that OPD is a better RL warm-start.
中文摘要 两阶段的训练后训练——第一阶段的热启动（监督微调、SFT，或策略上提炼，OPD），随后是第二阶段强化学习（RL）——越来越多地被视觉语言模型（VLM）采用。我们在一项小数据研究中探讨了第一阶段实际控制的是什么，使用了Qwen2.5-VL-7B，配合同一模态的72B型VLM教师治疗门诊。首先，三次热启动在Geometry3K内部验证中达到一个狭窄的53美元至54美元/%%区间，这与近期专业方法报告的狭窄区间一致;这种设置几乎没有证据表明第一阶段会改变域内端点。其次，匹配的配方、提前停止的SFT可以提升域外MathVista $+2.1$点，逆转了过度训练变体导致的$-9.5 $下降。最明显的区别是\emph{熵区}：OPD进入强化学习时，政策熵远高于任一SFT初始化，且分离在可用轨迹中依然可见。在域内初始化时，OPD的答案多样性和pass@16也更高（比SFT高出+2.0$到+5.2$的点），尽管问题层级的引导间隔显示较小的对比度是不确定的。在RL（端点pass@16点数在1.1美元以内）和MathVista（六个模型在1.2美元点以内）后，这种优势就不存在了。因此，我们的贡献是一个有界的经验特征：在该设置中，第一阶段与熵状态密切相关，但下游收益较小且局限，并不能证明OPD是更好的强化学习热启。

A Unifying Lens on Reward Uncertainty in RLHF

RLHF中奖励不确定性的统一视角

Authors: Ely Hahami, Yoel Zimmermann, Ray Zhou, Jack Benarroch Jedlicki
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.09073
Pdf link: https://arxiv.org/pdf/2606.09073
Abstract Reinforcement learning from human feedback (RLHF) is bottlenecked by \emph{reward hacking}, where the policy exploits errors in a proxy reward model (RM) and produces high RM scores without genuine quality gains. A natural mitigation is \emph{pessimism}: penalizing rewards in regions where the RM is uncertain. However, standard scalar RMs provide no principled notion of uncertainty. We argue that the right object is a \emph{distributional} reward model $p(r\mid x,y)$. Under either a Bayesian inference or a KL-distributionally robust optimization (KL-DRO) lens, the KL-regularized RLHF objective admits a closed-form effective reward $\tilde r(x,y) = \pm\beta\log\mathbb{E}_p[e^{\pm r/\beta}]$. The pessimistic branch unifies the prior heuristics for RM ensemble aggregation: mean aggregation, worst-case optimization (WCO), and uncertainty-weighted optimization (UWO) all emerge as limits or truncations of this single expression. This also clarifies the implicit assumptions of each existing rule.
中文摘要 来自人类反馈的强化学习（RLHF）被 \emph{reward hacking} 所限制，该策略利用代理奖励模型（RM）中的错误，产生高 RM 分数却没有真正的质量提升。一种自然的缓解方法是 \emph{悲观}：在 RM 不确定的区域惩罚奖励。然而，标准标量RM并不提供原则性的不确定性概念。我们认为合适的对象是一个\emph{distribution}奖励模型$p（r\mid x，y）$。在贝叶斯推断或KL分布稳健优化（KL-DRO）视角下，KL正则化RLHF目标的有效奖励为闭式 $\tilde r（x，y） = \pm\beta\log\mathbb{E}_p[e^{\pm r/\beta}]$。悲观分支统一了先前的RM集合聚合启发式：均值聚合、最坏情况优化（WCO）和不确定性加权优化（UWO）都作为该单一表达式的极限或截断出现。这也澄清了每条现有规则的隐含假设。

Stabilizing On-Policy Distillation for MLLM Reasoning with Global Normalization

以全局规范化稳定政策提炼MLLM推理

Authors: Dongze Hao, Zhiwei Jin, Chen Chen, Haonan Lu
Subjects: Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.09091
Pdf link: https://arxiv.org/pdf/2606.09091
Abstract On-policy distillation (OPD) has recently emerged as an important post-training paradigm. By using a stronger teacher model to provide dense, fine-grained supervision for sampled trajectories, OPD offers a clear advantage over reinforcement learning with verifiable rewards (RLVR), which typically depends on sparse binary or outcome-based environmental feedback. However, naive token-level distillation can suffer from gradient instability, due to magnitude misalignment in outlier states. To address this issue, we propose Globally Normalized Distillation Policy Optimization (GNDPO), a practical method that stabilizes optimization by transforming raw KL scores into batch-level relative advantages. This normalization effectively mitigates gradient explosions while retaining the benefits of token-level guidance. Experimental results show that GNDPO substantially improves training robustness and downstream performance across multimodal reasoning tasks. The code is released at this https URL.
中文摘要 政策提炼（OPD）最近成为一种重要的培训后范式。通过采用更强的教师模型，为抽样轨迹提供密集、细致的监督，OPD在通常依赖稀疏二元或基于结果的环境反馈的可验证奖励强化学习（RLVR）上具有明显优势。然而，朴素的代币级蒸馏可能因离群值态的大小错位而出现梯度不稳定。为解决这一问题，我们提出了全局归一化蒸馏策略优化（GNDPO），这是一种通过将原始 KL 分数转换为批量级相对优势来稳定优化的实用方法。这种归一化有效减轻了梯度爆炸，同时保留了代币级指导的优势。实验结果显示，GNDPO在多模态推理任务中显著提升了训练的鲁棒性和下游性能。代码发布时是这个 https URL。

Addressing Market Regime Changes and Heavy-Tailed Returns in Portfolio Optimization via Bayesian VAR and Elliptical Black-Litterman

通过贝叶斯VAR和椭圆黑利特曼应对市场形态变化和投资组合优化中的重尾回报

Authors: Daniil Mikriukov (1 and 2), Ruoyu Sun (2), Angelos Stefanidis (2), Jionglong Su (2), Zhengyong Jiang (2) ((1) University of Liverpool, (2) Xi'an Jiaotong-Liverpool University)
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Portfolio Management (q-fin.PM)
Arxiv link: https://arxiv.org/abs/2606.09104
Pdf link: https://arxiv.org/pdf/2606.09104
Abstract Deep reinforcement learning (DRL) frameworks for portfolio optimization have shown promise for their ability to learn allocation rules dynamically from market data. However, these models fail to account for fat-tailed returns, which characterize actual market behavior with more frequent extreme events. Furthermore, historical data is treated homogeneously, without accounting for temporal importance, leading models to fail during regime changes. We propose a new BAVAR-BLED algorithm that combines methods derived from Bayesian-Averaging Vector Autoregressive (BAVAR) and the Black-Litterman model using Elliptical Distributions (BLED) within a TD3 architecture. BAVAR captures a set of vector autoregressive representations that consider multi-scale temporal features, enabling adaptive allocation decisions based on regime-aware estimates of return expectations and dispersion matrices. These estimates serve as prior inputs to BLED, a model that uses Student's t-distributions, allowing for more realistic fat tail return estimates. The BAVAR-BLED algorithm uses transformer networks for view construction and CNNs for risk-aversion estimates, which modify dynamic allocation decisions based on market conditions. An evaluation of 29 Dow Jones Industrial Average constituents over a decade-long market period shows that BAVAR-BLED significantly outperforms state-of-the-art methods, achieving Sharpe and Sortino ratios of 1.72 and 2.70, respectively, and total returns of 57.26%.
中文摘要 深度强化学习（DRL）框架用于投资组合优化，已展现出从市场数据动态学习配置规则的能力。然而，这些模型未能考虑肥尾回报，而肥尾回报描述的实际市场行为中极端事件更频繁。此外，历史数据被同质处理，未考虑时间重要性，导致模型在政权更迭时失效。我们提出了一种新的BAVAR-BLED算法，结合了基于贝叶斯平均向量自回归（BAVAR）和利用椭圆分布（BLED）在TD3架构中的Black-Literman模型衍生的方法。BAVAR捕捉了一组考虑多尺度时间特征的矢量自回归表示，使基于对收益预期和离散矩阵的状态感知估计能够做出自适应分配决策。这些估计值作为BLED的先验输入，BLED模型采用Student的t分布，使脂肪尾部收益估计更为真实。BAVAR-BLED 算法使用变压器网络进行视图构建，利用 CNN 进行风险厌恶估计，这些网络根据市场状况调整动态配置决策。对29个道琼斯工业平均指数成分股在十年市场期间的评估显示，BAVAR-BLED显著优于最先进方法，分别实现了1.72和2.70的Sharpe和Sortino比率，总回报率为57.26%。

Counterfactual Transport Flows for Offline Conservative Trajectory Refinement

离线保守轨迹精细化的反事实传输流

Authors: Lena Krieger, Xuan Zhao, Zhuo Cao, Qin Wang, Hanno Scharr, Ira Assent
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.09115
Pdf link: https://arxiv.org/pdf/2606.09115
Abstract Offline reinforcement learning (RL) offers a path to policy improvement from logged data alone, using historical returns or other measurable outcomes as world feedback. A key difficulty is improving observed behavior without extrapolating beyond what the offline data supports. We propose \emph{counterfactual transport flows}, a source-conditioned trajectory refinement framework for offline decision-making guided by world feedback. Given a low-feedback candidate trajectory, we construct local preference pairs from offline data by retrieving nearby trajectories in latent trajectory space with higher task-specific feedback, and use them as weak supervision for conservative refinement. The framework learns instance-specific refinement directions: at inference time, a refinement strength parameter controls how far the candidate trajectory is transported, enabling a trade-off between preserving the original behavior and applying stronger improvement. Experiments on D4RL benchmarks, including AntMaze and MuJoCo tasks, show that our method improves behavior from historical returns as world feedback, while providing interpretable trajectory-level refinement paths.
中文摘要 离线强化学习（RL）仅凭记录数据提供政策改进的途径，利用历史回报或其他可衡量的结果作为全球反馈。一个关键难点是如何在不超出离线数据支持范围的情况下改善观察到的行为。我们提出了\emph{反事实传输流}，一种源条件轨迹优化框架，用于基于全球反馈的离线决策。给定低反馈候选轨迹，我们通过在潜在轨迹空间中获取具有更高任务特异反馈的附近轨迹，构建局部偏好对，并将其作为保守细化的弱监督。该框架学习实例特定的细化方向：在推理时，细化强度参数控制候选轨迹的传输距离，实现在保留原始行为与施加更强改进之间权衡。对D4RL基准测试的实验，包括AntMaze和MuJoCo任务，表明我们的方法能够从历史回报作为世界反馈中改善行为，同时提供可解释的轨迹级细化路径。

AutoPilot: Learning to Steer High Speed Robust BFT

AutoPilot：学习高速且坚固的BFT方向

Authors: Liangrong Chen, Yue Zhang, Eric Zhou, Mohammad Javad Amiri, Ryan Marcus, Chenyuan Wu
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2606.09120
Pdf link: https://arxiv.org/pdf/2606.09120
Abstract Recent Byzantine Fault Tolerant (BFT) protocols achieve strong performance by combining the low-latency advantages of leader-based BFT protocols with the high-throughput benefits of DAG-based data dissemination. Despite exposing a wide spectrum of internal tunable parameters, these protocols typically rely on static and heuristic configurations, which leads to performance degradation under dynamic workloads, heterogeneous network conditions, and evolving adversarial behaviors. In this paper, we present AutoPilot, a reinforcement learning-based framework that continuously monitors runtime conditions and dynamically adjusts protocol parameters online to optimize consensus performance. To ensure robustness, AutoPilot coordinates learning in a decentralized manner, providing resilience against adversarial data pollution. We implement AutoPilot on top of Autobahn, a state-of-the-art, highspeed, robust BFT protocol, and evaluate it across diverse dynamic environments. Experimental results demonstrate that AutoPilot quickly converges to the optimal configuration under changing environments, reduces end-to-end latency by 49.8% compared to the default protocol configuration, and outperforms random configuration exploration by 73.3%.
中文摘要 最新的拜占庭容错（BFT）协议通过结合基于领导者的BFT协议的低延迟优势与基于DAG的数据传播的高吞吐量优势，实现了强大的性能。尽管这些协议暴露了广泛的内部可调参数，但通常依赖静态和启发式配置，导致在动态工作负载、异构网络条件和不断演变的对抗行为下性能下降。本文介绍了AutoPilot，一种基于强化学习的框架，能够持续监控运行时状态，并在线动态调整协议参数以优化共识性能。为确保鲁棒性，AutoPilot以去中心化方式协调学习，增强对对抗性数据污染的韧性。我们在Autobahn之上实施AutoPilot，这是一种最先进的高速、稳健的BFT协议，并在多样化的动态环境中进行评估。实验结果表明，AutoPilot在变化的环境下能迅速收敛到最优配置，端到端延迟比默认协议配置降低49.8%，并且比随机配置探索高出73.3%。

A Regret Minimization Framework on Preference Learning in Large Language Models

大型语言模型中偏好学习的遗憾最小化框架

Authors: Suhwan Kim, Taehyun Cho, Geon-Hyeong Kim, Yu Jin Kim, Youngsoo Jang, Moontae Lee, Jungwoo Lee
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.09124
Pdf link: https://arxiv.org/pdf/2606.09124
Abstract Reinforcement learning with verifiable rewards (RLVR) has enabled progress on reasoning-intensive tasks by relying on task-specific verifiers that provide automated correctness signals. However, many realistic language tasks are difficult to equip with reliable verifiers, motivating a growing reliance on reinforcement learning from human feedback (RLHF). In this setting, we argue that a closer examination of how human feedback should be interpreted is essential. We introduce Regret-based Preference Optimization $(\textbf{RePO})$, which reframes RLHF through $\textit{regret minimization}$ rather than reward maximization. Human preferences are often shaped by $\textit{prospective}$ anticipation of outcomes and $\textit{counterfactual}$ comparisons to alternative behaviors, rather than by immediate, outcome-independent utility. $\textbf{RePO}$ captures this structure by modeling preferences as behavior-conditioned assessments of relative suboptimality. Experiments on mathematical reasoning benchmarks and human preference datasets demonstrate consistent performance gains, indicating that $\textbf{RePO}$ is an effective and human-aligned approach for training large language models.
中文摘要 带可验证奖励的强化学习（RLVR）通过依赖任务特定验证器自动正确性信号，使推理密集型任务取得了进展。然而，许多现实中的语言任务难以配备可靠的验证器，这促使人们越来越依赖人类反馈强化学习（RLHF）。在此背景下，我们认为更深入地审视人类反馈应如何被解读至关重要。我们引入了基于遗憾的偏好优化 $（\textbf{RePO}）$，它通过 $\textit{遗憾最小化}$来重新定义 RLHF，而非奖励最大化。人类偏好往往受 $\textit{prospective}$ 对结果的预期和 $\textit{反事实}$ 对替代行为的比较塑造，而非即时且与结果无关的效用。$\textbf{RePO}$ 通过将偏好建模为行为条件下的相对次优评估来捕捉这一结构。数学推理基准测试和人类偏好数据集的实验显示了持续的性能提升，表明 $\textbf{RePO}$ 是一种有效且符合人类需求的大型语言模型训练方法。

Claw-R1: A Step-Level Data Middleware System for Agentic Reinforcement Learning

Claw-R1：一种用于代理强化学习的阶梯级数据中间件系统

Authors: Daoyu Wang, Mingyue Cheng, Qingchuan Li, Shuo Yu, Jie Ouyang, Qi Liu
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.09138
Pdf link: https://arxiv.org/pdf/2606.09138
Abstract Agentic reinforcement learning (RL) has become an important post-training paradigm for turning LLMs from static chatbots into interactive agents, giving rise to representative applications such as OpenClaw. Existing work mainly focuses on policy optimization algorithms and training frameworks, but pays less attention to the full data lifecycle of agent-environment interactions, from data production to training consumption. To bridge this gap, we present Claw-R1, an interactive step-level data middleware system for agentic RL. Claw-R1 connects heterogeneous agent runtimes with RL training backends through two core components: a Gateway Server and a Data Pool. The Gateway Server captures multi-turn interaction steps through a unified LLM API entry point, while the Data Pool organizes them into step-level records consisting of prompt IDs, response IDs, rewards and other metadata. In our demo, users can interactively inspect live trajectories, examine the state, action, and reward of each step, curate data by quality and readiness, and configure training-ready batches for different downstream RL algorithms. Overall, Claw-R1 treats agent interaction traces as managed data assets rather than temporary runtime logs. Through this demonstration, we hope to encourage the community to recognize the importance of data management in agentic RL. Our code is available at this https URL and the demonstration video can be found at link this https URL.
中文摘要 代理强化学习（RL）已成为将静态聊天机器人转变为交互智能体的重要训练后范式，催生了如OpenClaw等代表性应用。现有工作主要聚焦于策略优化算法和训练框架，但对代理-环境交互的完整数据生命周期关注较少，从数据生产到训练消耗。为弥合这一空白，我们介绍了Claw-R1，一款针对代理强化学习的交互式步进级数据中间件系统。Claw-R1通过两个核心组件——网关服务器和数据池——将异构代理运行时与强化学习训练后端连接起来。网关服务器通过统一的LLM API入口点捕获多回合交互步骤，而数据池则将其组织为步骤级记录，包括提示ID、响应ID、奖励及其他元数据。在我们的演示中，用户可以交互式地检查实时轨迹，检查每一步的状态、动作和奖励，按质量和准备度整理数据，并为不同的下游强化学习算法配置训练准备批次。总体而言，Claw-R1将代理交互跟踪视为管理数据资产，而非临时运行日志。通过此次演示，我们希望鼓励社区认识到智能强化学习中数据管理的重要性。我们的代码可在此 https URL 访问，演示视频可在链接此 https URL 观看。

Self-Paced Curriculum Reinforcement Learning for Autonomous Superbike Racing in Simulation

自动驾驶超级摩托车赛模拟自定进度课程强化学习

Authors: Luca Ghisi, Jacopo Essenziale, Carlo D'Eramo, Matteo Luperto
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.09236
Pdf link: https://arxiv.org/pdf/2606.09236
Abstract Autonomous Racing has seen remarkable progress through deep Reinforcement Learning (RL), primarily for four-wheeled vehicles. However, motorbikes introduce substantially greater complexity due to the need to manage balance and lean angle, in addition to more reactive steering and throttle control, and a smaller weight. In this work, we present a framework for training an autonomous agent to race a superbike in VRider SBK, a physics-accurate Unity-based motorbike simulator. Our approach integrates Soft Actor-Critic (SAC) with Self-Paced curriculum Deep reinforcement Learning (SPDL), which dynamically generates progressively more challenging tasks based on the agent's performance, without requiring manual curriculum design. The agent's state space comprises proprioceptive features extended with lean-angle history, along with global track features via course points. The reward signal is shaped to encourage progress along the track while penalizing instability-inducing behaviors specific to two-wheeled dynamics. Preliminary experimental results demonstrate that SPDL outperforms SAC alone in training efficiency, lap time, and driving stability across multiple tracks and motorbike models, establishing a first baseline for RL-based autonomous motorbike racing.
中文摘要 自动驾驶赛车通过深度强化学习（RL）取得了显著进展，主要针对四轮车辆。然而，由于需要管理平衡和倾斜角度，加上更灵活的转向和油门控制，以及更轻的重量，摩托车带来了显著更高的复杂性。在本研究中，我们提出了一个框架，用于在VRider SBK中训练一名自主智能体驾驶超级摩托车，这是一款基于物理的Unity摩托车模拟器。我们的方法将软性演员批评（SAC）与自学课程深度强化学习（SPDL）整合，后者根据代理的表现动态生成更具挑战性的任务，无需手动设计课程。代理的状态空间包括带有倾斜角度历史扩展的本体感觉特征，以及通过航线点的全局轨迹特征。奖励信号设计旨在鼓励赛道前进，同时惩罚两轮动力学中特有的不稳定行为。初步实验结果表明，SPDL在多条赛道和摩托车车型的训练效率、圈速和驾驶稳定性方面优于单一SAC，为基于强化学习的自动摩托车赛奠定了首个基础。

Temporal-Aware Reasoning Optimization for Video Temporal Grounding

视频时间接地的时间感知推理优化

Authors: Minghang Zheng, Zihao Yin, Yi Yang, Yuxin Peng, Yang Liu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.09248
Pdf link: https://arxiv.org/pdf/2606.09248
Abstract Multi-modal Large Language Models (MLLMs) have achieved remarkable progress in video temporal grounding with reinforcement learning for generating reasoning paths. However, existing models often produce superficial reasoning, which offers limited guidance for precise temporal localization. This limitation stems from (1) inefficient random exploration and (2) reward functions that focus solely on the answer correctness while ignoring reasoning quality. To address these issues, we propose TaRO (Temporal-Aware Reasoning Optimization), a framework that explicitly enhances the model's ability of thinking with time. First, we introduce a Constructive Reasoning Exploration that leverages pre-generated dense captions to construct reasoning paths grounded in explicit visual cues and timestamps, enabling efficient exploration of high-quality time-aware reasoning. Second, to evaluate reasoning quality, we design a Temporal-Sensitivity Reward. High-quality reasoning should be anchored to specific events and timestamps. If the event boundary under thinking is disrupted, such reasoning should become invalid, leading to a drop in the logit of the reasoning path. We utilize this drop as a critique of reasoning quality. Finally, TaRO follows a progressive curriculum, which starts by utilizing this reward to select better constructed reasoning paths, and evolves to a free exploration phase where the model autonomously generates effective reasoning. Experiments demonstrate that TaRO achieves state-of-the-art performance on VTG benchmarks. Code is available at this https URL.
中文摘要 多模态大型语言模型（MLLM）在视频时间基础化与强化学习中取得了显著进展，用于生成推理路径。然而，现有模型往往产生表面推理，这对精确的时间定位提供了有限的指导。这一局限性源于（1）低效的随机探索，以及（2）奖励函数仅关注答案正确性而忽视推理质量。为解决这些问题，我们提出了TaRO（时间感知推理优化）框架，明确提升模型随时间思考的能力。首先，我们介绍了建设性推理探索，利用预生成的密集字幕构建基于明确视觉线索和时间戳的推理路径，从而高效探索高质量的时间感知推理。其次，为了评估推理质量，我们设计了时间敏感性奖励。高质量的推理应锚定于具体事件和时间戳。如果思考下的事件边界被破坏，这种推理应失效，导致推理路径的逻辑下降。我们用这个下降来评析推理质量。最后，TaRO遵循渐进式课程，首先利用奖励选择更完善的推理路径，随后发展为自由探索阶段，模型自主生成有效的推理。实验显示，TaRO在VTG基准测试中达到了最先进的性能。代码可在此 https URL 访问。

Physics-Guided Sequence-Based Generative Framework for Acoustic Metamaterial Inverse Design

基于物理的序列生成框架用于声学超材料逆向设计

Authors: Yijie Li, Jiahao Xu, Ching-Chih Tsao, Lili Qiu, Jingxian Wang
Subjects: Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.09266
Pdf link: https://arxiv.org/pdf/2606.09266
Abstract Acoustic metamaterial (AMM) inverse design is particularly challenging for broadband target responses due to acoustic dispersion: a structure that matches the desired response at one frequency may deviate at others, and modifying geometry to improve one sub-band often perturbs neighboring sub-bands. Yet existing broadband inverse-design approaches are either constrained by predefined templates, or rely on image representations that fail to preserve the geometric precision and structural connectivity required by acoustic structures. We present MetaSeq, a physics-guided, sequence-based generative framework for acoustic metamaterial inverse design. At its core, MetaSeq introduces a language that represents each AMM as a structured sequence, rather than as a pixel grid or fixed template. This representation preserves precise geometry, explicitly encodes connectivity, and casts inverse design as a sequence-to-sequence task from target response to structure sequence. MetaSeq further constructs a balanced, high-fidelity dataset with efficient calibration and complexity-based sampling. To address the one-to-many nature of inverse design, MetaSeq combines supervised pretraining with reinforcement learning fine-tuning guided by a physics-based solver and validity checker. Extensive evaluations against COMSOL and five baselines show that MetaSeq reduces response error by 45% over the best baseline.
中文摘要 声学超材料（AMM）反向设计对宽带目标响应尤其具有挑战性，原因是声学色散：一个结构在某一频率上与期望响应相匹配，但在其他频率上可能偏离，且为了改善某一子频带而改变几何形状常常扰乱邻近的子频带。然而，现有的宽带逆向设计方法要么受限于预设模板，要么依赖无法保持声学结构所需的几何精度和结构连通性的图像表示。我们介绍MetaSeq，一种基于物理的序列生成框架，用于声学超材料逆向设计。MetaSeq 的核心引入了一种语言，将每个 AMM 表示为结构化序列，而非像素网格或固定模板。这种表示保持了精确几何结构，显式编码连接性，并将逆设计视为从目标响应到结构序列的序列间任务。MetaSeq进一步构建了一个平衡、高保真度的数据集，采用高效校准和基于复杂度的抽样。为解决逆设计一对多的特性，MetaSeq结合了监督预训练与由基于物理的求解器和效度检查器指导的强化学习微调。对COMSOL和五条基线的广泛评估显示，MetaSeq比最佳基线减少了45%的响应误差。

One Model, Multiple Goals: Adaptive Multi-Objective Learning for E-commerce Dialogue Systems

一个模型，多目标：电子商务对话系统的自适应多目标学习

Authors: Mingzhe Li, Jing Xiang, Enguo Zhou, Lang Gao, Tai Li, Qishen Zhang, Xiangliang Zhang, Xiuying Chen
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.09293
Pdf link: https://arxiv.org/pdf/2606.09293
Abstract Dialogue systems in e-commerce scenarios often need to satisfy multiple objectives: accurately reasoning over user profiles (e.g., eligibility, credit limit) to ensure correct decision-making and user state interpretation, while also generating natural and faithful responses. These goals are complementary but not identical. In this work, we propose MORE, an adaptive Multi-Objective REinforcement learning framework that jointly optimizes reasoning accuracy and linguistic naturalness. Our preliminary experiments show that directly mixing rewards with diverging optimization dynamics can cause oscillations and unstable learning. Thus, instead of optimizing a single mixed reward, we treat reasoning functions as constraints that guide policy optimization. At inference time, the system directly generates responses without explicit reasoning steps, while still benefiting from reasoning-enhanced scaffold and avoiding additional inference overhead. To better balance linguistic objectives during response generation, we introduce an adaptive multi-reward mechanism that aggregates signals such as fluency and naturalness and dynamically reweighs them via gradient feedback. We evaluate MORE on two real-world dialogue systems at ByteDance and the MultiWOZ 2.2 benchmark, where it consistently outperforms strong baselines. In 14-day online experiments on ByteDance production traffic, MORE improves overall and reached conversion by 16.53% and 30.09%, while increasing user satisfaction and reducing handoff rates. Notably, in a human-machine comparison, MORE recovers about 60% of the incremental conversion lift achieved by human agents.
中文摘要 电子商务场景中的对话系统通常需要满足多重目标：准确推理用户档案（如资格、信用额度），以确保决策正确和用户状态解读，同时生成自然且忠实的响应。这些目标是互补的，但并不完全相同。在本研究中，我们提出了MORE，一种自适应的多目标强化学习框架，能够共同优化推理准确性和语言自然性。我们的初步实验表明，直接将奖励与发散优化动态混合会导致振荡和学习不稳定。因此，我们不只是优化单一混合奖励，而是将推理函数视为指导策略优化的约束。在推理阶段，系统直接生成响应，无需显式推理步骤，同时仍受益于推理增强的支架，避免额外的推理开销。为了更好地平衡反应生成中的语言目标，我们引入了一种自适应多奖励机制，汇总流畅性和自然性等信号，并通过梯度反馈动态重新权重它们。我们在字节跳动和MultiWOZ 2.2基准测试中评估了两个真实世界的对话系统，它们始终表现优于强劲的基线。在为期14天的字节跳动生产流量在线实验中，MORE整体提升，转化率分别提升了16.53%和30.09%，同时提高了用户满意度并降低了切换率。值得注意的是，在人机比较中，MORE回收了约60%的人机代理实现的增量转换提升。

SG-OPD: Sign-Gated On-Policy Distillation via Sign-Consistency Gating and Phased Teacher Sampling

SG-OPD：通过符号一致性门槛和分阶段教师抽样进行的政策签名门控提炼

Authors: Haoran Xu, Hongyu Wang, Yifei Gao, Jiaze Li, Xiaofeng Zhang, Xiaosong Yuan
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.09304
Pdf link: https://arxiv.org/pdf/2606.09304
Abstract On-policy distillation (OPD) trains a student on its own trajectories with dense per-token supervision from a stronger teacher, and often outperforms off-policy distillation and standard reinforcement learning. However, we find that its effectiveness implicitly relies on two assumptions that frequently break in practice: trajectory-level alignment between the student and the teacher, and uniform token-level reliability of the teacher's preferences. We therefore propose Sign-Gated On-Policy Distillation (SG-OPD), which uses a binary verifier as a trust signal for the teacher at two complementary granularities: phased teacher sampling mixes in verifier-endorsed teacher rollouts at cold-start, and a sign-consistency gate extrapolates the distillation update on tokens where the teacher agrees with the verifier-correct direction and interpolates it where it disagrees. Experiments on competition-level mathematical reasoning benchmarks show that SG-OPD consistently outperforms standard OPD, with average gains of 1.98 and 7.50 at the per-sample and per-question levels, respectively.
中文摘要 策略上提炼（OPD）在更强教师的密集每个代币监督下，按照自己的方向训练学生，且通常优于非策略提炼和标准强化学习。然而，我们发现其有效性隐含依赖于两个在实践中常被打破的假设：学生与教师之间的轨迹层级对齐，以及教师偏好的一致代币级可靠性。因此，我们提出了签名门控政策蒸馏（SG-OPD），它使用二进制验证器作为教师信任信号，在两个互补的粒度上进行：在冷启动阶段通过验证者认可的教师推广，以及符号一致性门，在教师同意验证者正确方向时，对代币的蒸馏更新外推，并在不同意时插值。竞赛层数学推理基准测试显示，SG-OPD持续优于标准OPD，平均提升分别为1.98和7.50。

TORL-VLA: Tactile Guided Online Reinforcement Learning for Contact-Rich Manipulation

TORL-VLA：触觉引导在线强化学习，用于接触丰富操作

Authors: Huaihang Zheng, Yi Yang, Kai Ma, Shenglin Xu, Tian Xie, Guozheng Li, Xiangyu Wang, Yiren Ma, Si Liu, Yinian Mao, Baoxu Liu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.09337
Pdf link: https://arxiv.org/pdf/2606.09337
Abstract Vision-Language-Action (VLA) models have become a powerful framework for robotic manipulation, and recent studies have introduced tactile or force feedback into VLAs to address contact-rich tasks. However, these models are typically deployed as offline policies. When contact conditions shift from the training distribution, the policy cannot perform online adaptation, leading to problems such as inappropriate contact forces and inefficient retries. Therefore, we propose TORL-VLA, a tactile-guided online reinforcement learning framework that couples tactile feedback with policy refinement for contact-rich manipulation. Our method introduces a tactile-derived wrench-aware VLA to predict reference actions and future wrench sequences, while a lightweight online RL module is used to refine the reference actions. To stabilize learning from mixed exploratory policy-generated and human-intervention data, we introduce an intervention-censored critic that prevents post-intervention success from being wrongly credited to policy-generated actions preceding intervention. Real-robot experiments on long-horizon contact-rich tasks, including latch manipulation, coffee-cup placement, and egg handling, show that TORL-VLA improves success rates at both subtask and full-task levels, as well as time-bounded execution efficiency over strong baselines.
中文摘要 视觉-语言-行动（VLA）模型已成为机器人操作的强大框架，近期研究将触觉或力反馈引入VLA，以应对接触密集的任务。然而，这些模型通常以离线策略的形式部署。当接触条件偏离训练分布时，策略无法进行在线适应，导致接触力不当和重试效率低下等问题。因此，我们提出了TORL-VLA，一种触觉引导在线强化学习框架，将触觉反馈与策略细化相结合，实现丰富的接触操作。我们的方法引入了触觉衍生的扳手感知VLA，用于预测参考动作和未来的扳手序列，同时使用一个轻量级在线强化学习模块来细化参考动作。为了稳定从探索性政策生成和人类干预混合数据中学习，我们引入了干预审查的批评器，防止干预后成功被错误归功于干预前的政策行动。在长距离接触丰富任务上的实机器人实验，包括锁扣操作、咖啡杯放置和鸡蛋处理，显示TORL-VLA在子任务和全任务层面均提升成功率，并在强基线下提高时间限制执行效率。

PBSD: Privileged Bayesian Self-Distillation for Long-Horizon Credit Assignment

PBSD：长视界学分分配的特权贝叶斯自蒸馏

Authors: Yang Tian, Rui Wang, Xumeng Wen, Junjie Li, Shizhao Sun, Lei Song, Jiang Bian, Bo Zhao
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.09348
Pdf link: https://arxiv.org/pdf/2606.09348
Abstract Long-horizon agentic tasks pose a fundamental credit assignment challenge for outcome-base reinforcement learning: trajectory-level rewards verify final correctness but provide limited guidance on which intermediate reasoning steps or tool interactions contribute to the outcome. The difficulty is especially pronounced in multi-turn search agents, where successful trajectories may contain misleading actions and failed trajectories may contain valuable evidence-gathering steps. We propose PBSD (Privileged Bayesian Self-Distillation), a Bayes-calibrated self-distillation method for fine-grained credit assignment under sparse final rewards. PBSD measures trajectory quality through the posterior-to-prior probability ratio of the verified answer and applies Bayes' rule to convert this hard-to-estimate answer-side ratio into a tractable likelihood ratio between a standard student model and a privileged answer-conditioned teacher model. Autoregressive decomposition of this Bayesian evidence score yields turn-level signals that identify whether each intermediate turn supports or undermines the verified outcome. Consequently, PBSD provides a principled and elegant reweighting scheme that transforms sparse outcome supervision into Bayes-calibrated turn-level credit signals, while remaining fully compatible with standard policy optimization. Experiments demonstrate that PBSD consistently enhances performance across both in-domain and out-of-domain settings, and effectively transfers knowledge from short-context training to long-context inference, suggesting that its fine-grained credit assignment mechanism facilitates more effective policy learning and yields improved generalization.
中文摘要 长期代理任务对基于结果的强化学习构成了根本的学分分配挑战：轨迹级奖励验证最终正确性，但对中介推理步骤或工具交互对结果贡献的指导有限。这种难度在多回合搜索代理中尤为明显，成功的轨迹可能包含误导性动作，失败的轨迹可能包含宝贵的证据收集步骤。我们提出了PBSD（特权贝叶斯自蒸馏法），这是一种贝叶斯校准的自蒸馏方法，用于在稀疏最终奖励下进行细粒度的信用分配。PBSD通过验证答案的后验概率比与先验概率比来衡量轨迹质量，并应用贝叶斯定律将这一难以估计的答案侧比率转换为标准学生模型与特权答案条件教师模型之间的可解似然比。对该贝叶斯证据评分的自回归分解会产生转向层级信号，识别每个中间转向是否支持或削弱已验证的结果。因此，PBSD提供了一种原则性且优雅的加权方案，将稀疏的结果监督转化为贝叶斯校准的回合级信用信号，同时完全兼容标准策略优化。实验表明，PBSD在域内和域外环境中持续提升性能，并有效地将知识从短上下文训练转移到长上下文推断，表明其细粒度的学分分配机制有助于更有效的策略学习，并提升泛化能力。

Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short

理性竞技场：当可验证的奖励不足时追踪锦标赛

Authors: Han Zhou, Adam X. Yang, Laurence Aitchison, Anna Korhonen, Albert Q. Jiang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.09380
Pdf link: https://arxiv.org/pdf/2606.09380
Abstract Reinforcement learning with verifiable rewards (RLVR) has become a leading paradigm for improving the reasoning ability of large language models through outcome-based supervision. However, verifiable rewards frequently become uninformative at the group level: when all sampled traces of a given prompt receive identical rewards, group-relative advantage estimation provides no gradient signal, even though the traces may differ substantially in reasoning quality. We propose Reasoning Arena, an adaptive training framework that routes such non-diverse reward groups to a judge system instead of discarding them. Beyond examining the final answer, Reasoning Arena constructs trace tournaments, where reasoning traces are compared head-to-head to expose finer-grained preferences within the group, converting reasoning quality into rich relative reward signals. To make reward estimation efficient, rather than exhaustively comparing every pair, each new trace is evaluated against a small, dynamically updated pool of previously generated traces as anchors to efficiently establish a relative ranking. We then fit a Bradley-Terry model on the incomplete comparison graph, enabling scalable RL integration without quadratic pairwise comparisons. Empirical results demonstrate that Reasoning Arena consistently outperforms the RLVR baseline by 7.6% on average in competition mathematics and coding benchmarks. By converting otherwise wasted zero-advantage samples into useful gradient updates, our method accelerates training by 27% to 41%, saving nearly 50% of generation compute, and substantially improves overall reasoning performance.
中文摘要 带有可验证奖励的强化学习（RLVR）已成为通过基于结果的监督提升大型语言模型推理能力的领先范式。然而，可验证的奖励在群体层面往往缺乏信息：当给定提示的所有抽样痕迹获得相同奖励时，群体相对优势估计不会提供梯度信号，尽管这些痕迹在推理质量上可能有显著差异。我们提出了Reasoning Arena，一种适应性训练框架，将此类非多样性奖励群体引导至评判系统，而非直接丢弃。除了分析最终答案外，Reasoning Arena还构建了追踪锦标赛，通过一对一比较推理追踪，揭示群体内更细致的偏好，将推理质量转化为丰富的相对奖励信号。为了提高奖励估计效率，而不是对每对都进行详尽比较，而是将每个新迹与一个小型、动态更新的先前生成轨迹池作为锚点进行评估，以高效建立相对排名。随后，我们将Bradley-Terry模型拟合到不完全比较图上，实现了可扩展的强化学习积分，无需二次两两比较。实证结果显示，Reasoning Arena在竞赛数学和编码基准测试中平均表现优于RLVR基线7.6%。通过将本应浪费的零优势样本转换为有用的梯度更新，我们的方法将训练速度提升了27%至41%，节省了近50%的生成计算，并显著提升了整体推理性能。

CapRL++: Unified Reinforcement Learning with Verifiable Rewards for Dense Image and Video Captioning

CapRL++：统一强化学习，提供可验证的密集图片和视频字幕奖励

Authors: Penghui Yang, Long Xing, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Yibin Wang, Yujie Zhou, Jiazi Bu, Jianze Liang, Qidong Huang, Jiaqi Wang, Feng Wu, Dahua Lin
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.09393
Pdf link: https://arxiv.org/pdf/2606.09393
Abstract Image and video captioning are fundamental tasks that bridge the visual and linguistic domains, playing a critical role in pre-training Large Vision-Language Models (LVLMs). Current state-of-the-art captioning models are typically trained with Supervised Fine-Tuning (SFT), a paradigm that relies on expensive, non-scalable annotations and often causes models to memorize specific ground-truth answers, limiting their generality and ability to generate diverse, creative descriptions. To overcome these limitations, we propose applying Reinforcement Learning with Verifiable Rewards (RLVR) to the open-ended task of multimodal captioning. We introduce Captioning Reinforcement Learning++ (CapRL++), a novel reference-free training framework that redefines caption quality through its utility: a high-quality caption should enable a non-visual language model to accurately answer questions about the corresponding visual content. CapRL++ employs a decoupled two-stage pipeline where an LVLM generates a caption, and the objective reward is derived from the accuracy of a separate, vision-free LLM answering Multiple-Choice Questions based solely on that caption. Evaluations on more than 20 image and video benchmarks show that CapRL++ improves dense caption quality and strengthens caption-based pretraining across tasks such as spatial and temporal understanding. Pretraining on scalable image and video caption datasets annotated by CapRL++ yields substantial downstream gains. Furthermore, within the Prism Framework for caption quality evaluation, compact models trained with CapRL++ achieve dense captioning performance comparable to substantially larger models such as Qwen2.5-VL-72B and Qwen3-VL-235B-A22B. These results validate that CapRL++ effectively trains models to produce generalizable, high-fidelity descriptions, establishing a robust foundation beyond the limitations of traditional SFT.
中文摘要 图像和视频字幕是连接视觉和语言领域的基础任务，在大型视觉语言模型（LVLM）的预训练中发挥着关键作用。当前最先进的字幕模型通常采用监督微调（SFT）训练，这种范式依赖昂贵且不可扩展的注释，且常导致模型记忆特定的真实答案，限制了其普遍性和生成多样、富有创意描述的能力。为克服这些限制，我们建议将可验证奖励强化学习（RLVR）应用于多模态字幕的开放式任务。我们介绍了字幕强化学习++（CapRL++），这是一个新颖的无引用训练框架，通过其实用性重新定义了字幕质量：高质量的字幕应使非视觉语言模型能够准确回答对应视觉内容的问题。CapRL++采用解耦的两阶段流水线，LVLM生成一个字幕，客观奖励来自于一个独立且无视觉的大型语言模型仅基于该字幕回答选择题的准确性。对20多个图像和视频基准测试的评估显示，CapRL++提升了密集的字幕质量，并加强了基于字幕的预训练，涵盖空间和时间理解等任务。在由CapRL++注释的可扩展图像和视频说明数据集上预训练，带来了显著的后续收益。此外，在Prism框架用于字幕质量评估中，使用CapRL++训练的紧凑模型能实现与Qwen2.5-VL-72B和Qwen3-VL-235B-A22B等更大模型相当的密集字幕性能。这些结果验证了CapRL++有效训练模型以产生可推广的高保真度描述，建立了超越传统SFT局限的坚实基础。

PriFT: Prior-Support Guided Supervised Fine-Tuning

PriFT：先行支持引导监督微调

Authors: Ke Wang, Shuangqi Li, Mathieu Salzmann, Pascal Frossard
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.09396
Pdf link: https://arxiv.org/pdf/2606.09396
Abstract Supervised fine-tuning (SFT) is an efficient approach for downstream task adaptation and often serves as the initialization stage for reinforcement learning (RL), but it can show weaker generalization than RL. A key limitation is its off-policy objective: SFT fits fixed demonstrations token by token, including targets poorly aligned with the model's pretrained distribution, which can lead to overfitting. A recent line of work addresses this issue by assigning larger training weights to tokens better aligned with the current model's predictive distribution, with the intuition that fitting these tokens are less distortive to the model's pretrained knowledge and representations. However, computing the token weights from the model that is currently fine-tuned entangles token weights with the optimization trajectory, inducing a self-reinforcing dynamics as the distribution rapidly departs from the pretrained model. To address this, we propose PriFT (Prior-support guided Fine-Tuning), which derives token weights from a frozen pretrained reference to obtain a stable reweighting signal unaffected by fine-tuning. This signal estimates prior support: the extent to which each target token is supported by the pretrained distribution. Across multiple existing token-reweighting rules, replacing the reweighting signal from the online model to pretrained model consistently improves performance. We introduce two instantiations: PriFT-prob uses pretrained token probability, while PriFT-mass selects tokens by cumulative probability mass under the pretrained distribution. Extensive experiments on mathematical reasoning, code generation, and medical question answering show that PriFT achieves state-of-the-art results among SFT baselines and provides a better initialization for subsequent RL training.
中文摘要 监督微调（SFT）是一种高效的任务下游适应方法，常作为强化学习（RL）的初始化阶段，但其泛化力可能不如强化学习。一个关键局限是其非策略目标：SFT按令牌拟合固定演示令牌，包括与模型预训练分布不匹配的目标，可能导致过拟合。最近一项研究通过为更符合当前模型预测分布的标记分配更大的训练权重来解决这个问题，直觉上，拟合这些标记对模型预训练知识和表示的影响更小。然而，从当前微调模型计算词汇权重时，会使词权与优化轨迹纠缠，从而引发自强化的动态，因为分布会迅速偏离预训练模型。为此，我们提出了PriFT（先验支持引导微调），它从冻结的预训练引用中推导出代币权重，获得不受微调影响的稳定重权重信号。该信号估计了先量支持：每个目标代币在预训练分布中得到的支持程度。在多个现有的代币重权重规则中，将在线模型的重权重信号替换为预训练模型，能够持续提升性能。我们引入两种实例化：PriFT-prob使用预训练的token概率，而PriFT-mass则根据预训练分布下的累积概率质量选择token。数学推理、代码生成和医学问答等广泛实验表明，PriFT在SFT基线中取得了最先进的成果，并为后续强化学习提供了更好的初始化。

AliyunConsoleAgent: Training Web Agents in Real-World Cloud Environments via Distillation and Reinforcement Learning

AliyunConsoleAgent：通过蒸馏和强化学习在真实云环境中训练网络代理

Authors: Bojie Rong, Zheyu Shen, Qiaoping Wang, Pengfei Kang, Yang Xu, Yawen Wei, Hanyu Wu, Zhi Zhao, Leihao Pei, Linquan Jiang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.09447
Pdf link: https://arxiv.org/pdf/2606.09447
Abstract We present AliyunConsoleAgent, a web agent framework for automated documentation verification in real-world cloud consoles. Major cloud platforms encompass hundreds of products with rapid feature iteration, causing console UIs to frequently diverge from their corresponding documentation. Verifying that documented procedures accurately reflect the current console and can be executed end-to-end demands an estimated 4 million recurring inspections annually, yet manual coverage remains below 1%. While agent systems built on frontier proprietary models achieve high success rates, their prohibitive cost and data privacy constraints preclude large-scale deployment. We propose a two-stage training paradigm: supervised fine-tuning (SFT) on distilled frontier-model trajectories, followed by reinforcement learning using Group Relative Policy Optimization (GRPO) and a dual-channel outcome reward model in real cloud environments. To support large-scale RL training, we construct a high-determinism rollout system featuring Terraform-based resource pre-provisioning and LLM-driven on-demand provisioning, which effectively isolates environment noise from the training signal. We further introduce a rule-based reward evaluation protocol grounded in backend audit logs, providing objective, reward-hacking-resistant outcome judgment. Our model evolves from mechanical instruction following to autonomous decision-making with cloud console and product-specific understanding. Experiments on a challenging 278-task benchmark where the best frontier model achieves only 65.34% demonstrate that AliyunConsoleAgent-32B achieves a 63.52% mean success rate -- a 20.24 percentage-point improvement over the base model, narrowing the gap to the best frontier proprietary model to 1.82 pp (bootstrap 95% CI [-1.27, 7.39]) -- at 92% lower inference cost.
中文摘要 我们介绍AliyunConsoleAgent，一个用于真实云控制台自动化文档验证的网络代理框架。主要云平台涵盖数百种产品，功能迭代迅速，导致控制台界面经常与相应文档不符。验证文档中程序是否准确反映当前控制台并可端到端执行，每年估计需进行400万次定期检查，但人工覆盖率仍低于1%。虽然基于前沿专有模型的代理系统成功率很高，但其高昂的成本和数据隐私限制阻碍了大规模部署。我们提出了一个两阶段训练范式：对精炼前沿模型轨迹进行监督微调（SFT），随后是基于群相对策略优化（GRPO）和在真实云环境中的双通道结果奖励模型进行强化学习。为支持大规模强化学习训练，我们构建了一个高确定性推广系统，采用基于Terraform的资源预配置和基于大型语言模型的按需配置，有效隔离环境噪声与训练信号。我们还进一步引入了基于后端审计日志的基于规则的奖励评估协议，提供客观且抗奖励黑客攻击的结果判断。我们的模式从机械指令跟随发展到基于云控制台和产品特定理解的自主决策。在一个具有挑战性的278任务基准测试中，最佳前沿模型仅达到65.34%的成功率，显示AliyunConsoleAgent-32B的平均成功率为63.52%，比基础模型提升20.24个百分点，将与最佳前沿专有模型的差距缩小到1.82 pp（自助95%置信区间[-1.27,7.39]），推理成本降低了92%。

Emergence of Context Characteristics Sensitivity in Large Language Models

大型语言模型中上下文特征敏感性的出现

Authors: Nadya Yuki Wangsajaya, Haeun Yu, Isabelle Augenstein
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.09525
Pdf link: https://arxiv.org/pdf/2606.09525
Abstract During instruction fine-tuning (IFT), large language models (LLMs) learn to follow instructions by using the provided context to answer a query. While prior work has studied how context characteristics correlate with context usage by the LLM, this analysis has been limited to inference time, leaving open how these relationships are acquired in the first place. Here, we measure how models' sensitivity to such characteristics shifts across successive IFT stages: supervised fine-tuning (SFT), direct preference optimization (DPO), and reinforcement learning with verifiable rewards (RLVR). Experiments across four models and three datasets show that SFT makes models more likely to use contexts that are easy to understand, such as containing high length, context-query similarity, and fluency. Post-SFT dynamics may either reinforce or resolve these preferences depending on the training dataset. Our findings reveal that context usage is actively reshaped at each IFT stage, and designing a balanced IFT dataset is important in ensuring robust context utilization of instruction-tuned models.
中文摘要 在指令微调（IFT）过程中，大型语言模型（LLMs）通过利用提供的上下文来回答问题，学习遵循指令。虽然此前已有研究研究上下文特征如何与LLM的上下文使用相关，但该分析仅限于推断时间，导致这些关系最初是如何获得的。在这里，我们测量模型对这些特征的敏感度在后续IFT阶段的变化：监督式微调（SFT）、直接偏好优化（DPO）和带可验证奖励的强化学习（RLVR）。在四个模型和三个数据集上的实验表明，SFT使模型更可能使用易于理解的上下文，如包含高长度、上下文-查询相似性和流畅性。SFT后的动态可能强化或解决这些偏好，具体取决于训练数据集。我们的发现表明，上下文使用在每个IFT阶段都会被积极重塑，设计一个平衡的IFT数据集对于确保指令调优模型的稳健上下文利用至关重要。

Safe-RULE: Safe Reinforcement UnLEarning

安全规则：安全强化释放

Authors: Shixiong Jiang, Taozheng Zhu, Fanxin Kong
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.09559
Pdf link: https://arxiv.org/pdf/2606.09559
Abstract Offline safe reinforcement learning (Safe RL) enables policy learning without online interactions, making it suitable for safety-critical systems such as robotics systems. However, its reliance on static datasets exposes offline Safe RL to data poisoning attacks, where adversaries inject malicious samples that compromise safety and induce unsafe policy behavior. In this work, we propose a new learning paradigm, named safe reinforcement unlearning (Safe-RULE), used as a defense framework to remove the influence of poisoned data without retraining from scratch or requiring access to the original training environment. We further extend reinforcement unlearning to offline Safe RL by explicitly accounting for both task performance and safety constraints during the unlearning process. Experiments across benchmark Safe RL tasks demonstrate that our approach effectively enhances safety performance against data poisoning attacks.
中文摘要 离线安全强化学习（Safe RL）使策略学习无需在线互动，非常适合机器人系统等安全关键系统。然而，其对静态数据集的依赖使离线安全强化学习暴露于数据中毒攻击，攻击者注入恶意样本，破坏安全并引发不安全的策略行为。在本研究中，我们提出了一种新的学习范式，称为安全强化去学习（Safe-RULE），作为防御框架，在无需从零再培训或访问原始训练环境的情况下，消除中毒数据的影响。我们进一步将强化式复学扩展到离线安全强化学习，明确考虑了任务执行和安全约束。在Safe RL基准任务中的实验表明，我们的方法有效提升了对数据中毒攻击的安全性能。

Shape Formation for the Cooperative Transportation of Arbitrary Objects Using Multi-Agent Reinforcement Learning

利用多智能体强化学习协同运输任意对象的形状形成

Authors: Mohamed Sayed, Wolfram Burgard, Tanja Katharina Kaiser
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.09610
Pdf link: https://arxiv.org/pdf/2606.09610
Abstract Cooperative object transportation is essential in numerous domains, including industrial to domestic services. A popular transportation strategy is to carry objects on top of multi-robot systems. The corresponding task is typically solved by decomposing it into three interconnected subproblems: formation control, cooperative navigation, and collision avoidance. A particular challenge posed by real-world objects is their potentially arbitrary shape and non-uniform mass distribution, necessitating robot formations that securely support the object. In this work, we address the challenge of pattern formation control for transporting such real-world objects by proposing a novel multi-agent reinforcement learning approach. Our approach enables a multi-robot system to autonomously position itself underneath an object to support its weight while avoiding obstacles during the formation process. Our evaluations with diverse environments and varying numbers of robots show that our approach leads to policies that reliably produce balanced formations and generalize to cluttered scenes and objects with complex geometry and non-uniform mass distribution.
中文摘要 合作式物品运输在多个领域都至关重要，包括工业到家庭服务。一种流行的运输策略是将物品搬运到多机器人系统之上。相应任务通常通过分解为三个相互关联的子问题来解决：编队控制、协同导航和碰撞避免。现实世界中的物体面临的一个特别挑战是它们可能任意的形状和不均匀的质量分布，因此需要机器人编队来安全地支撑该物体。本研究通过提出一种新颖的多智能体强化学习方法，解决了模式形成控制在运输此类现实世界对象中的挑战。我们的方法使多机器人系统能够自主定位在物体下方，以支撑其重量，同时在形成过程中避开障碍物。我们对多样环境和不同数量机器人的评估表明，我们的方法能够可靠地产生平衡的地层，并推广到复杂几何形状和质量分布不均匀的杂乱场景和物体。

Learning to Attack and Defend: Adaptive Red Teaming of Language Models via GRPO

学习攻防：通过GRPO实现的自适应红队语言模型

Authors: Blake Bullwinkel, Eugenia Kim, Amanda Minnich, Mark Russinovich
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.09701
Pdf link: https://arxiv.org/pdf/2606.09701
Abstract AI red teaming must continually adapt to evolving attackers and defenders. Reinforcement learning offers a promising approach to discovering novel attacks, and co-training methods can produce more robust defenders in tandem. Recent works have demonstrated the efficacy of attacker-defender co-training by applying PPO and DPO, but report that GRPO is unstable in this setting. We introduce AdvGRPO, a co-training framework that makes GRPO viable for joint attacker-defender optimization using dense multi-channel rewards and decoupled advantage normalization. Training progresses through a curriculum from single-turn to closed-loop multi-turn attacks before bootstrapping co-training, where attacker and defender models are updated in alternation. We show that our method can produce highly effective and transferable attacks and that co-trained defenders outperform baselines on safety benchmarks.
中文摘要 AI红队必须不断适应不断变化的攻防方。强化学习为发现新攻击提供了有前景的方法，协同训练方法可以协同培养更强的防御者。近期研究通过应用PPO和DPO证明了攻防协同训练的有效性，但报告称GRPO在此环境中不稳定。我们介绍了AdvGRPO，一种协同训练框架，使GRPO在密集多通道奖励和解耦优势规范化下，适用于联合攻防优化。训练课程从单回合到闭环多回合攻击，然后进行自助式共训，进攻方和防守方模型交替更新。我们证明了我们的方法能够产生高效且可转移的攻击，且共训练的防御者在安全基准上优于基线。

The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language Model

中性掩膜：RLHF如何在大型语言模型中保持党派结构的同时，提供浅层对齐

Authors: Wendy K. Tam
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.09735
Pdf link: https://arxiv.org/pdf/2606.09735
Abstract The ambition behind alignment training is to make large language models safe and useful. The primary mechanism, reinforcement learning from human feedback (RLHF), shapes the behavior of deployed language models by aligning them with ``human values.'' Yet the process is opaque. What values are being encoded; whose values are they; and how does RLHF encode them? A growing body of evidence suggests that RLHF produces only functional compliance rather than deep alignment. We offer a mechanistic case study of this phenomenon for partisan political orientation with a comparison of the internal representations of Llama 3.1 8B before and after RLHF. We show that RLHF does not remove the structured partisan direction in the base model. Instead, it compresses the variance of the partisan signal to generate consistently balanced and non-partisan output. Sparse autoencoder decomposition reveals that policy-encoding features, which activate sporadically in the base model, are completely inactive in the Instruct model. Feature-level steering experiments confirm the causal disconnect. RLHF thus encodes a norm of political neutrality, not by erasing the model's knowledge of partisanship, but by severing the causal pathway from partisan geometry to output generation. Importantly, this neutrality is functional, not structural so that the underlying geometry that enables partisan steering remains intact. The mechanisms that bypass RLHF's guardrails, such as inferring and amplifying a user's partisan identity, reactivate partisan generation. If RLHF operates by disconnecting rather than removing value-laden structure, then the same pattern may hold for other value domains, and the aligned model's behavior may be more fragile than its outputs suggest.
中文摘要 对齐训练的目标是让大型语言模型安全且实用。主要机制是来自人类反馈的强化学习（RLHF），通过使部署的语言模型与“人类价值观”对齐来塑造其行为。然而，这一过程是不透明的。编码的值是哪些;他们的价值观是谁的;RLHF是如何编码它们的？越来越多的证据表明，RLHF仅产生功能性依从性，而非深度对齐。我们提供了一个关于该现象的机制性案例研究，比较了 Llama 3.1 8B 在 RLHF 前后内部表征的表现。我们证明RLHF并未去除基础模型中的结构化党派方向。相反，它压缩党派信号的方差，以生成一致平衡且非党派的输出。稀疏自编码器分解表明，在基础模型中偶尔激活的策略编码特征在Instruct模型中完全不活跃。功能级转向实验证实了因果断层。因此，RLHF编码了一个政治中立的规范，这并非通过抹去模型中党派对立的知识，而是切断了党派几何与产出生成之间的因果路径。重要的是，这种中立性是功能性的，而非结构性的，因此支持党派引导的基础几何结构得以保持完整。绕过RLHF防护措施的机制，比如推断和放大用户党派身份，重新激活党派生成。如果RLHF通过断开而非移除带值的结构来运作，那么其他值域可能也存在同样的模式，且对齐模型的行为可能比其输出显示的更脆弱。

Rethinking the Divergence Regularization in LLM RL

重新思考LLM强化语言中的发散正则化

Authors: Jiarui Yao, Xiangxin Zhou, Penghui Qi, Wee Sun Lee, Liefeng Bo, Tianyu Pang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.09821
Pdf link: https://arxiv.org/pdf/2606.09821
Abstract Reinforcement learning (RL) has become a key component of post-training large language models (LLMs). In practice, LLM RL is often off-policy because of training-inference mismatch and policy staleness, making trust-region control essential for stable optimization. Mainstream methods such as PPO and GRPO approximate this control with a ratio-clipping mechanism, but the importance ratio can be a poor proxy for distributional shift in long-tailed vocabularies. Recent work such as DPPO addresses this mismatch by replacing ratio-based clipping with a divergence-based mask, yielding a trust region defined by the sampled token's absolute probability shift. However, DPPO still relies on a hard mask: once a token crosses the trust-region boundary in a harmful direction, its gradient is discarded rather than corrected. To address this, we propose Divergence Regularized Policy Optimization (DRPO), which replaces the hard mask with a smooth advantage-weighted quadratic regularizer on policy shift. DRPO preserves the same trust-region geometry as DPPO while inducing bounded, continuous gradient weights that attenuate diverging updates and provide corrective signals beyond the boundary. Experiments across model scales, architectures, and precision settings show that DRPO improves the stability and efficiency of LLM RL training.
中文摘要 强化学习（RL）已成为大型语言模型（LLM）训练后的重要组成部分。实际上，LLM RL常因训练-推理不匹配和策略陈旧而偏离策略，因此信任区域控制对稳定优化至关重要。主流方法如PPO和GRPO通过比率裁剪机制近似该控制，但重要性比往往无法很好地代理长尾词汇的分布变化。近期工作如DPPO通过用基于发散度的掩码替代基于比率的裁剪，从而产生由抽样令牌的绝对概率偏移定义的信任区域来解决这一不匹配问题。然而，DPPO仍然依赖硬掩膜：一旦令牌以有害方向越过信任区域边界，其梯度会被丢弃而非纠正。为此，我们提出了发散正则化策略优化（DRPO），它用平滑优势加权的二次正则子取代了硬掩膜。DRPO保持与DPPO相同的信任区域几何，同时诱导有界连续梯度权重，衰减发散更新并提供边界外的纠正信号。跨模型尺度、架构和精度设置的实验表明，DRPO提升了LLM RL训练的稳定性和效率。

An Agency-Transferring Model-Free Policy Enhancement Technique

一种无需机构转移的无模式保单增强技术

Authors: Anton Bolychev, Georgiy Malaniya, Sinan Ibrahim, Pavel Osinenko
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2606.09825
Pdf link: https://arxiv.org/pdf/2606.09825
Abstract Training reinforcement learning (RL) policies from scratch is costly: it requires careful reward and environment design, extensive tuning, and substantial computation. Yet many control problems already have a functional but suboptimal policy available as a baseline. This paper proposes a method for embedding such a baseline into the RL training process, simultaneously improving training efficiency relative to from-scratch methods and producing a learning policy that outperforms the baseline. At each step, the method arbitrates between the baseline policy and a trainable learning policy, initially relying strongly on the baseline policy and then progressively transferring agency to the learning policy. By the end of training, the learning policy is a standalone neural network that operates without baseline policy support. The paper formalizes what it means for the baseline policy to be functional: under this policy, the agent reaches a goal set and remains there with high probability. The proposed arbitration mechanism is designed to exploit this property during training, yielding high goal-reaching rates right from the beginning of training. A theoretical analysis provides a formal interpretation of this behavior under stated assumptions and extends it to the final baseline-free regime, where explicit lower bounds are derived for the goal-reaching probability of the standalone learning policy. Empirical results on continuous-control benchmarks show that the proposed method achieves returns that match or exceed those of competitive approaches, while maintaining the highest goal-reaching rates throughout training among the compared methods -- including in the final stage, where the learning policy operates without any baseline support.
中文摘要 从零开始建立训练强化学习（RL）策略成本高昂：需要精心设计奖励和环境，进行大量调优，以及大量计算。然而，许多控制问题已经有一个功能正常但不那么优的基准策略。本文提出了一种将此类基线嵌入强化学习训练过程的方法，同时提升相对于从零开始方法的训练效率，并制定出优于基线的学习策略。在每一步，方法在基线政策和可训练学习策略之间进行仲裁，最初强烈依赖基线策略，然后逐步将能动性转移至学习策略。培训结束时，学习策略已成为一个独立的神经网络，无需基线策略支持即可运行。论文形式化了基线策略的功能性含义：在该策略下，代理达到目标设定并保持该目标的概率很高。拟议的仲裁机制旨在利用这一特性，从训练一开始就实现高目标达成率。理论分析在陈述假设下的形式化解释了这种行为，并将其推广到最终无基线的状态，在那里明确推导出独立学习策略达标概率的下限。连续控制基准的实证结果表明，所提方法的回报与竞争方法相当甚至超过，同时在整个训练过程中保持最高的目标达成率——包括在学习策略缺乏基线支持的最终阶段。

Keyword: diffusion policy

Latent Diffusion Policy: Shaping Latent Spaces for Diffusion-Based Robotic Manipulation

潜在扩散政策：塑造基于扩散的机器人操作的潜在空间

Authors: Zhexuan Zhou, Yichen Lai, Jinhao Zhang, Huizhe Li, Youmin Gong, Jie Mei
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.08657
Pdf link: https://arxiv.org/pdf/2606.08657
Abstract Diffusion-based visuomotor policies operating directly in raw action spaces conflate scene comprehension with trajectory generation within a single denoising process. The resulting velocity field must simultaneously encode scene information and generate precise trajectories, increasing learning complexity and limiting performance on tasks demanding precise temporal coordination across multiple arms. To simplify this joint learning problem, we introduce Latent Diffusion Policy (LDP), a two-stage framework performing flow matching in a deliberately shaped latent space. By absorbing scene understanding into an observation-conditioned CVAE encoder, LDP concentrates the conditional distribution of each observation. Consequently, the flow model avoids implicitly resolving scene-dependent structures; instead, it generates within a pre-concentrated distribution featuring a smoother velocity field, simplifying learning from limited demonstrations. Furthermore, to capture temporal dependencies among latent tokens, LDP trains with per-token diffusion forcing and employs staircase inference sampling to resolve the resulting distributional mismatch. We also propose reconstruction FID (rFID) as a lightweight proxy predicting downstream task success solely from latent space statistics. On coordination-intensive tasks from RoboTwin 2.0, LDP outperforms DP3 by a substantial margin and transfers effectively to real-world bimanual deployments.
中文摘要 基于扩散的视觉运动策略直接在原始动作空间中运行，将场景理解与轨迹生成混为一谈，在单一去噪过程中实现。由此产生的速度场必须同时编码场景信息并生成精确轨迹，增加学习复杂度，限制了在需要跨多臂精确时间协调任务中的性能。为了简化这一联合学习问题，我们引入了潜在扩散策略（LDP），这是一个在有意塑造的潜在空间中进行流匹配的两阶段框架。通过将场景理解吸收到观测条件的CVAE编码器中，LDP集中了每个观测值的条件分布。因此，流模型避免隐式解析场景相关的结构;相反，它在预集中分布中生成，具有更平滑的速度场，简化了有限演示的学习。此外，为了捕捉潜在符号之间的时间依赖性，LDP采用每个符号扩散强制训练，并采用阶梯推断采样来解决由此产生的分布不匹配。我们还提出重建FID（rFID）作为轻量级代理，仅凭潜空间统计预测下游任务成功。在RoboTwin 2.0的协调密集任务中，LDP的表现远超DP3，并能有效实现实际双手部署。

Guided Discovery of New Behaviors using Diffusion Policies

利用扩散策略引导发现新行为

Authors: Dian Yu, Sebastian Sanokowski, Majid Khadiv
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.08743
Pdf link: https://arxiv.org/pdf/2606.08743
Abstract Diffusion models have become a powerful tool for generative modeling in robotics, with diffusion policies excelling at modeling multimodal action-trajectory distributions. However, when demonstrations are limited, standard sampling often reproduces dominant behaviors while neglecting valid but rare modes, limiting the discovery of novel solutions. Existing approaches, such as guidance methods or combining reinforcement learning with diffusion, either push samples into infeasible regions or struggle to escape local minima, failing to systematically uncover diverse behaviors. To address these challenges, we propose a framework that combines Feynman-Kac correctors with a novel guiding potential that systematically guides diffusion policy samples towards promising yet underrepresented samples. These trajectories are refined using sampling-based trajectory optimization and reincorporated into the training set to retrain the diffusion policy. Our method effectively mines and repairs novel trajectories, enabling the systematic discovery of diverse and executable behaviors. We demonstrate the effectiveness of our framework across a range of manipulation environments, consistently discovering new behaviors.
中文摘要 扩散模型已成为机器人生成建模的强大工具，扩散策略在建模多模态动作-轨迹分布方面表现出色。然而，当演示有限时，标准抽样往往重现主导行为，忽视有效但罕见的模式，限制了新解的发现。现有方法，如引导方法或将强化学习与扩散结合，要么将样本推入不可行的区域，要么难以突破局部极小值，未能系统地发现多样化的行为。为应对这些挑战，我们提出了一个结合费曼-Kac校正器与一种新颖的指导潜能的框架，系统地引导扩散政策样本朝向有前景但代表性不足的样本。这些轨迹通过基于采样的轨迹优化进行优化，并重新纳入训练集以重新训练扩散策略。我们的方法有效挖掘和修复新路径，使系统性发现多样化且可执行的行为成为可能。我们展示了该框架在多种操控环境中的有效性，持续发现新的行为。

Unifying Object-Centric World Models and Diffusion Policy: A Hierarchical Framework for Multi-Stage Robotic Tasks

统一以对象为中心的世界模型与扩散政策：多阶段机器人任务的层级框架

Authors: Raktim Gautam Goswami, Prashanth Krishnamurthy, Yann LeCun, Farshad Khorrami
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.08775
Pdf link: https://arxiv.org/pdf/2606.08775
Abstract Visual world models have shown great potential in learning complex system dynamics. Recent advancements leverage these models as transition functions within Model Predictive Control (MPC) frameworks to solve various control tasks. When applied to robotics, however, they are limited to single-stage tasks such as reaching or grasping, and struggle with multi-stage ones that demand complex sequential planning. In this work, we introduce WorldDP, a world model framework designed for multi-stage robotic manipulation. Our hierarchical approach utilizes a high-level world model as a transition function to optimize for feasible subgoals during runtime, which are subsequently reached by a low-level Diffusion Policy. To further aid in learning dynamics and planning, we incorporate object-centric representations that decouple environmental entities and enable us to plan sequentially with respect to each. Evaluated across several robotics benchmarks, WorldDP consistently outperforms existing baselines, validating that coupling the world model's physically grounded planning with diffusion policy's efficient execution yields superior multi-stage performance.
中文摘要 可视化世界模型在学习复杂系统动力学方面展现出巨大潜力。近期的进展利用这些模型作为模型预测控制（MPC）框架中的过渡函数，解决各种控制任务。然而，应用于机器人时，它们仅限于单阶段任务，如伸手或抓握，而在需要复杂顺序规划的多阶段任务中则显得困难。在本研究中，我们介绍了WorldDP，一个为多阶段机器人操作设计的世界模型框架。我们的分层方法利用高层世界模型作为过渡函数，在运行时优化可行子目标，这些子目标随后通过低层扩散策略实现。为了进一步帮助学习动态和规划，我们采用了以对象为中心的表征，将环境实体解耦，使我们能够针对每个实体进行顺序规划。在多个机器人基准测试中，WorldDP持续优于现有基线，验证了将世界模型物理基础规划与扩散策略高效执行相结合，能带来更优越的多阶段性能。