Arxiv Papers of Today

生成时间: 2026-06-26 18:54:37 (UTC+8); Arxiv 发布时间: 2026-06-26 20:00 EDT (2026-06-27 08:00 UTC+8)

今天共有 44 篇相关文章

Keyword: reinforcement learning

Helpfulness Hurts: Domain-Dependent Degradation of Mid-Trained Compassion Values Under Post-Training

乐于助人会带来伤害：中期训练后同情心价值的领域依赖性下降

Authors: Jasmine Brazilek, Juliana Seawell
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2606.26102
Pdf link: https://arxiv.org/pdf/2606.26102
Abstract Standard post-training pipelines apply supervised fine-tuning (SFT) and reinforcement learning (RL) to make language models helpful, but these processes may inadvertently degrade values instilled during pre-training. We investigate whether the domain of post-training data differentially affects the retention of animal compassion values in a Llama 3.1 8B model mid-trained on compassion-oriented synthetic data, using both SFT (helpfulness via Dolly-15k vs. coding via Magicoder-110K) and GRPO (helpfulness via RLHFlow vs. coding via Magicoder), evaluated on the Animal Harm Benchmark (AHB 2.2) and MORU benchmark (Moral Reasoning Under Uncertainty). Helpfulness training significantly degrades animal compassion relative to coding training on AHB (SFT: 35.7% vs. 65.2%; GRPO: 18.7% vs. 32.0%), replicating across two independent helpfulness datasets and two training paradigms. On English MORU items, helpfulness training degrades general moral reasoning by 25.5 percentage points (46.4% vs. 71.9%), a striking gap that rivals the compassion effect in magnitude. However, this effect does not transfer cross-lingually: on the multilingual MORU benchmark, the domain effect disappears (SFT: 52.3% vs. 51.2%). In contrast, the animal compassion effect transfers consistently across languages, with Magicoder's AHB percentage-point gain over the base model 4.5 times larger on non-English items than English items. This divergence suggests that values instilled through mid-training are encoded more deeply and cross-lingually than reasoning improvements from domain-specific post-training. These results suggest that, for labs building on value-laden mid-training, coding-domain post-training may better preserve mid-trained values than helpfulness post-training without harming general reasoning capabilities.
中文摘要 标准的训练后流程会应用监督微调（SFT）和强化学习（RL）使语言模型变得有用，但这些过程可能无意中削弱预训练中植入的价值。我们调查了在Llama 3.1 8B模型中，训练后数据的领域是否对动物同情心值的保留有差异性，该模型在以同情为导向的合成数据进行中期训练，同时使用SFT（通过Dolly-15k的帮助度与Magicoder编码）和GRPO（通过RLHFlow的帮助度与通过Magicoder编码）进行评估，这些数据是在动物伤害基准（AHB 2.2）和MORU（不确定性下的道德推理）基准上评估的。帮助性训练相较于编码训练显著降低了动物的同情心（SFT：35.7% 对比 65.2%;GRPO：18.7% 对 32.0%，在两个独立的帮助度数据集和两种训练范式中均有重复。在英语MORU题目中，帮助性培训使一般道德推理下降了25.5个百分点（46.4%对71.9%），这一显著差距可与同情效应相媲美。然而，这种效应不会跨语言传递：在多语言MORU基准测试中，领域效应消失（SFT：52.3% 对 51.2%）。相比之下，动物同情心效果在不同语言间可一致转移，Magicoder的AHB百分比点数提升比基础模型高出4.5倍，非英语题目。这种分歧表明，通过培训中期灌输的价值观比领域特定训练后的推理改进更为深入且跨语言地编码。这些结果表明，对于基于价值为核心的中期培训实验室，编码领域后培训可能比培训后更有帮助性更能保留中期训练的价值，同时不损害一般推理能力。

DocArena: Turning Raw Documents into Controllable Training Environments for Document Search Agents

DocArena：将原始文档转变为可控的文档搜索代理培训环境

Authors: Jiamian Wang, Ruiyi Zhang, Tong Yu, Jing Shi, Samyadeep Basu, Rajiv Jain, Zhiqiang Tao, Tong Sun
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.26122
Pdf link: https://arxiv.org/pdf/2606.26122
Abstract Recent methods train search agents via reinforcement learning from (question, answer, evidence) tuples without requiring expert trajectories. The tuples serve as the training environment, and whose properties directly shape what search strategies and generalization abilities the agent can develop. While prior works have made encouraging progress in improving training data quality, existing environments remain predominantly text-based and existing approaches can struggle to construct training environments that are controllable, scalable, and account for multimodal data. Given this, we propose DocArena, a fully automated data curation pipeline building on the practical need for multimodal document search and question-answering. It transforms raw document collections into training environments for search agents without any human annotation. The pipeline first structures and indexes documents through MLLM-based visual perception, then profiles and leverage the cross-page information distribution to construct reasoning-intensive QA pairs, as well as performs cascaded quality assurance operations via MLLM. We introduce DocArena-79K with QA pairs from 8,336 documents spanning 16 domains and 49 languages. We further design a Doc-Search agent infrastructure that decouples visual perception from the policy model, allowing text-based LLMs to serve as the reasoning backbone for multimodal document retrieval and QA. Under a unified evaluation framework where only the policy model differs, experiments on six multimodal document scenarios and seven text-based QA benchmarks show that agents trained on DocArena data achieve the best performance on both retrieval accuracy and QA quality. Further analysis on agent search behaviors confirms the effectiveness and controllability of the constructed training environment.
中文摘要 最新方法通过强化学习（问题、答案、证据）元组训练搜索代理，无需专家轨迹。元组作为训练环境，其属性直接决定代理可以发展出哪些搜索策略和泛化能力。尽管以往工作在提升训练数据质量方面取得了鼓舞人心的进展，但现有环境仍以文本为主，现有方法难以构建可控、可扩展且能考虑多模态数据的训练环境。基于此，我们提出了DocArena，一个全自动化的数据整理流程，基于多模态文档搜索和问答的实际需求。它将原始文档集合转化为搜索代理的训练环境，无需人工注释。该流水线首先通过基于MLLM的视觉感知对文档进行结构化和索引，然后利用跨页信息分布进行剖析和利用，构建需要推理的质量保证对，并通过MLLM执行层级质量保证操作。我们推出了DocArena-79K，内容涵盖涵盖16个领域和49种语言的8,336份文档中的QA对。我们还设计了一个文档搜索代理基础设施，将视觉感知与策略模型解耦，使基于文本的大型语言模型成为多模态文档检索和质量保证的推理骨干。在一个仅政策模型不同的统一评估框架下，六个多模态文档场景和七个基于文本的质量保证基准测试的实验显示，基于DocArena数据训练的代理在检索准确性和质量保证质量方面均表现最佳。对代理搜索行为的进一步分析证实了构建训练环境的有效性和可控性。

Privacy-Aware Agent Collaboration for Dynamic VR Slice Management in 6G SD-RAN

6G SD-RAN 动态虚拟现实切片管理的隐私感知代理协作

Authors: Khaled M. Naguib, Soumaya Cherkaoui, Mahmoud M. Elmesalawy, Ahmed M. Abd El-Haleem, Ibrahim I. Ibrahim
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.26123
Pdf link: https://arxiv.org/pdf/2606.26123
Abstract Ultra-low latency and high throughput are required for Virtual Reality (VR) services in 6G networks, which presents critical challenges for Software-Defined Radio Access Networks (SD-RANs) dynamic resource management. This work propose a mobility-driven, privacy-aware Multi-Agent Reinforcement Learning (MARL) framework for VR slice management, in which cooperative agents maximize resource distribution over end-to-end VR links while protecting the privacy of user data. Our approach incorporates mobility prediction and an information bottleneck encoder to facilitate effective and secure agent collaboration. In simulations, comparisons with traditional methods are studied which show up to 34\% throughput improvement, 28\% fewer resources, and 85\% less privacy leakage, guaranteeing dependable immersive VR experiences in future 6G environments.
中文摘要 6G网络中的虚拟现实（VR）服务需要超低延迟和高吞吐量，这对软件定义无线接入网（SD-RAN）动态资源管理构成了关键挑战。本研究提出了一种以移动性为驱动、注重隐私意识的多智能体强化学习（MARL）框架用于虚拟现实切片管理，协作智能体通过端到端虚拟现实链路最大化资源分配，同时保护用户数据隐私。我们的方法结合了移动预测和信息瓶颈编码器，以促进高效且安全的代理协作。在模拟中，研究了与传统方法的比较，显示吞吐量提升高达34%，资源减少28%，隐私泄露减少85%，保证未来6G环境中可靠的沉浸式VR体验。

Reinforcement Learning Enables Autonomous Microrobot Navigation and Intervention in Simulated Blood Capillaries

强化学习使微型机器人能够自主导航并干预模拟血细血管

Authors: Jannik Drotleff, Samuel Tovey, Paul Hohenberger, Christoph Lohrmann, Julian Hoßbach, Konstantin Nikolaou, Christian Holm
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG); Biological Physics (physics.bio-ph)
Arxiv link: https://arxiv.org/abs/2606.26154
Pdf link: https://arxiv.org/pdf/2606.26154
Abstract Autonomous microrobots navigating biological vasculature could enable targeted drug delivery and thrombolysis, yet training control policies for realistic environments remains an open challenge. Prior reinforcement learning (RL) studies of microrobotic navigation have been limited to idealized geometries that omit complex hydrodynamic flow fields, confined branching structures, and dense cellular obstacles found in vivo. Here, we develop a physically grounded simulation of a blood capillary network, incorporating realistic hydrodynamic flow fields, explicit red blood cell dynamics, and anatomically derived branching geometry, and train deep RL agents to navigate it via chemotaxis. We systematically map the physical limits of navigation across robot size and swimming speed, revealing a forbidden regime where Brownian motion and flow overcome propulsion. Successful agents independently discover multiple universal strategy types, including run-and-rotate and energy-efficient search-and-sit policies, regardless of robot parameters. Without retraining, these agents perform targeted blocking and unblocking of capillary flow, restoring throughput to healthy baseline levels. These results establish RL as a viable framework for developing autonomous microrobotic intervention strategies in complex biological environments.
中文摘要 自主微型机器人在生物血管中导航，可能实现靶向药物递送和血栓溶解，但现实环境中的训练控制政策仍是一个开放的挑战。此前对微型机器人导航的强化学习（RL）研究仅限于理想化几何形状，这些几何体省略了体内发现的复杂流体动力流场、受限的分支结构和密集的细胞障碍。在这里，我们开发了一个物理基础的血液毛细血管网络模拟，结合了真实的流体动力学流场、明确的红细胞动态和解剖学衍生的分支几何，并训练深强化学习者通过趋化性导航。我们系统地绘制了机器人尺寸和游泳速度下的导航物理极限，揭示了布朗运动和流动克服推进力的禁忌区域。成功的代理能够独立发现多种通用策略类型，包括运行旋转和节能搜索与静坐策略，无论机器人参数如何。无需再训练，这些因子能够进行针对性的毛细血管流动阻断和疏通，将通量恢复到健康的基线水平。这些结果确立了强化学习作为在复杂生物环境中开发自主微型机器人干预策略的可行框架。

Implementation of reinforcement learning in chemical reaction networks: application to phototaxis as curiosity-driven exploration

强化学习在化学反应网络中的应用：作为好奇心驱动探索的移光性应用

Authors: Ruyi Tang, Grégoire Sergeant-Perthuis (LCQB-AG), David Colliaux
Subjects: Subjects: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
Arxiv link: https://arxiv.org/abs/2606.26168
Pdf link: https://arxiv.org/pdf/2606.26168
Abstract Living systems navigate environments using noisy and incomplete sensory signals. In unicellular algae, phototaxis is often modeled as a mechanistic run--tumble process driven by stimulus--response rules. However, such descriptions overlook how organisms actively sample their environment to reduce sensory ambiguity. From a minimal cognition perspective, we reframe this navigation as a subjective, information-driven sensorimotor process. To this end, we propose a framework linking a Partially Observable Markov Decision Process (POMDP) with biochemical reaction dynamics. Environmental variables are hidden, while the cell updates a minimal internal state from each observation through a memoryless Bayesian step. These internal dynamics balance orienting toward light with exploratory reorientation and can be implemented through Chemical-Reaction-Network Ordinary Differential Equations (CRN--ODEs). Our model includes a biophysical observation process for photoreception and a chemically computable polynomial bound on information gain. Using Inverse Reinforcement Learning (IRL) on 30 experimentally recorded Chlamydomonas trajectories, we infer the behavioral objective consistent with observed phototactic motion and benchmark the resulting dynamics with standard Stochastic Simulation Algorithm (SSA) baselines. Our model reproduces the empirical alignment-to-light distribution, comparable to objective SSA baselines on this dataset. Within this framework, run--tumble alternation emerges as an information-acquisition strategy: tumbling reorients the cell to sample new sensory configurations and resolve sensor ambiguity, demonstrating how intracellular biochemical networks can support adaptive information-seeking behavior in cellular navigation.
中文摘要 生物系统通过嘈杂且不完整的感官信号导航环境。在单细胞藻类中，光趋性通常被建模为由刺激反应规则驱动的机械性运行-翻滚过程。然而，这种描述忽视了生物体如何主动采样环境以减少感官歧义。从最小认知的角度来看，我们将这种导航重新框定为一种主观、信息驱动的感觉运动过程。为此，我们提出了一个将部分可观测马尔可夫决策过程（POMDP）与生化反应动力学相结合的框架。环境变量被隐藏，而单元通过无记忆的贝叶斯步骤更新每次观测的最小内部状态。这些内部动力学平衡了朝光方向与探索性重新定向，可以通过化学反应网络常微分方程（CRN--ODE）实现。我们的模型包括光接收的生物物理观察过程和信息增益的化学计算多项式上界。利用逆强化学习（IRL）对30条实验记录的披衣菌轨迹进行分析，我们推断出与观测到的定向光运动一致的行为目标，并用标准随机模拟算法（SSA）基线对此动态进行基准测试。我们的模型重现了该数据集上客观SSA基线的经验比对与光线比对分布。在此框架下，跑-翻交替作为一种信息获取策略出现：翻滚重新定向细胞以采样新的感官配置，解决传感器的模糊性，展示了细胞内生化网络如何支持细胞导航中的适应性信息寻求行为。

RMTL: Reinforced Micro-task Learning for Long-Horizon Manipulation with VLM Rewards

RMTL：基于VLM奖励的强化微任务学习，用于长视野操作

Authors: Anıl Can Ateş, Orhan Kahraman, Cihan Topal
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.26175
Pdf link: https://arxiv.org/pdf/2606.26175
Abstract Reinforcement learning (RL) for robotic manipulation often requires manually designing a dense reward function, which is difficult to tune and often fragile, or learning a reward from human demonstrations or preferences, which can be expensive. A recent line of work uses pretrained vision-language models (VLMs) as zero-shot reward models, replacing these costs with a single text prompt. However, we argue that a single global prompt is too coarse for long-horizon manipulation tasks with randomized initial conditions. The single-prompt VLM reward is near-flat for much of the trajectory, making early progress hard for the agent to detect. We propose Reinforced Micro-Task Learning (RMTL), an approach that decomposes a manipulation task into a small set of language-described micro-tasks and trains the agent to switch between them. At each step, the agent receives a multi-view VLM reward computed using the prompt of the currently active micro-task and averaged across multiple camera views to reduce the effect of view-specific occlusions. A reverse curriculum gradually exposes the agent to harder initial conditions, while a PPO worker is first trained with a fixed distance-based rule that selects the active micro-task. We then replace this rule with a learned hierarchical manager, turning rule-based phase selection into a fully learned hierarchical policy. We instantiate RMTL on the Fetch manipulation environment using three short stage-specific prompts and without additional prompt tuning. Experiments show that RMTL provides more informative reward signals than single-prompt VLM rewards, enabling faster learning. These results suggest that decomposing VLM rewards into micro-task-specific language prompts can substantially improve the scalability of language-guided reinforcement learning for robotic manipulation.
中文摘要 机器人操作的强化学习（RL）通常需要手动设计一个密集的奖励函数，这很难调优且通常脆弱，或者从人类演示或偏好中学习奖励，这可能成本高昂。近期一项工作利用预训练视觉语言模型（VLM）作为零机会奖励模型，用单一文本提示替代这些成本。然而，我们认为单一全局提示对于具有随机初始条件的长视野操作任务来说过于粗糙。单提示VLM的奖励在大部分轨迹中几乎平坦，使得代理难以察觉早期进展。我们提出了强化微任务学习（RMTL），这是一种将操作任务分解为一小部分语言描述的微任务，并训练代理在它们之间切换的方法。每一步，代理会获得多视角VLM奖励，该奖励基于当前活动微任务的提示计算，并在多个摄像头视角间平均，以减少视角特定遮挡的影响。逆向课程逐步让代理面对更严苛的初始条件，而PPO工作者首先接受基于固定距离的规则训练，该规则选择主动的微任务。然后我们用一个有经验的层级管理器取代了这一规则，将基于规则的阶段选择转变为完全学习的层级策略。我们在Fetch操作环境中实例化RMTL，使用三个短阶段特定的提示，且不进行额外的提示调优。实验表明，RMTL提供的奖励信号比单提示VLM奖励更有信息，从而实现更快的学习。这些结果表明，将VLM奖励分解为微任务专属的语言提示，可以显著提升语言引导强化学习在机器人操作中的可扩展性。

HALO: Hierarchical Auction-assisted Learning for Offloading in SAGIN

HALO：SAGIN 中分层拍卖辅助的卸载学习

Authors: Xuli Cai, Poonam Lohan, Sachin Ravikant Trankatwar, Burak Kantarci
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2606.26293
Pdf link: https://arxiv.org/pdf/2606.26293
Abstract In this paper, we investigate delay-aware task offloading and resource scheduling in a three-tier space-air-ground integrated network (SAGIN) consisting of IoT devices, UAV edge nodes, and a high-altitude platform station (HAPS). We formulate joint task association and continuous resource control (including bandwidth, transmit power, and CPU frequency allocation) as a non-convex mixed-integer nonlinear programming (MINLP) problem, which is inherently NP-hard. To capture fine-grained system dynamics, we introduce a macro-micro slot model that tracks cumulative transmission and computation progress over time. Based on this model, we propose HALO, a hierarchical auction-assisted learning framework that combines auction-based task association with hierarchical Proximal Policy Optimization (HPPO) for resource allocation. Simulation results under different traffic loads show that HALO consistently outperforms representative deep reinforcement learning (DRL) baselines. In particular, HALO achieves an average improvement of 8.7 percentage points in task success rate over PPO (corresponding to an 11.4% relative gain) and shows consistently greater robustness than DDPG and SAC, with relative improvements of 32.4% and 89.9%, respectively. These results highlight HALO's ability to maintain stable and efficient performance under varying traffic conditions, making it well-suited for delay-sensitive SAGIN environments.
中文摘要 本文探讨了在由物联网设备、无人机边缘节点和高空平台站（HAPS）组成的三层空间-空-地综合网络（SAGIN）中，延迟感知任务卸载和资源调度。我们将联合任务关联和连续资源控制（包括带宽、发射功率和CPU频率分配）提出为非凸混合整数非线性规划（MINLP）问题，本质上是NP难问题。为了捕捉细致的系统动态，我们引入了一个宏观微观槽模型，跟踪积累的传输和计算进展随时间变化。基于该模型，我们提出了HALO，这是一种分层拍卖辅助学习框架，结合了基于拍卖的任务关联与分层近端策略优化（HPPO）用于资源分配。在不同流量负载下的模拟结果表明，HALO始终优于代表性的深度强化学习（DRL）基线。特别是，HALO在任务成功率上平均提升了8.7个百分点（相对提升11.4%），且鲁棒性持续优于DDPG和SAC，分别提升32.4%和89.9%。这些结果凸显了HALO在多变交通条件下保持稳定高效性能的能力，非常适合延迟敏感的SAGIN环境。

COrigami: An AI Pipeline for Co-Designing Flat-Foldable Visually Recognisable Origami

COrigami：一个用于共同设计平面折叠、视觉识别可识别折纸的AI流程

Authors: Tom Zahavy, Shaobo Hou, Thomas Tumiel, James Doran, Francesco Faccio, Xidong Feng, Alex Havrilla, Igor Khytryi, Chenglei Li, Lisa Schut, Vivek Veeriah, Arijan Abrashi, Michał Kosmulski, Robert J. Lang, Nick Robinson, Brandon Wong, Marcus Chiam, Gloria Fang, Satinder Singh
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.26299
Pdf link: https://arxiv.org/pdf/2606.26299
Abstract While generative AI has achieved remarkable success in solving problems with verifiable solutions, generating physical art that satisfies both strict geometric constraints and subjective visual aesthetics remains a challenge. This paper presents an approach to tackle these difficulties in the domain of computational origami, a mathematically rigid environment that grounds artistic design within the equations of flat foldability. We present COrigami, an end-to-end AI-driven pipeline that assists the design cycle by generating crease patterns from natural language. Our pipeline involves generating a semantic stick figure, computing a base packing, solving for a flat-foldable crease pattern, shaping the flat-folded crease pattern, and refining the generated model using reinforcement learning driven by an autonomous aesthetic evaluation loop. Our system acts as a highly effective collaborative assistant, generating structural starting points that human artists can further expand and shape. By integrating algorithmic optimisation with autonomous aesthetic critique, this work demonstrates how AI systems can satisfy multi-objective physical constraints to enable reliable, mathematically grounded co-creativity.
中文摘要 尽管生成式人工智能在解决问题时取得了显著成功，但同时满足严格几何约束和主观视觉美学的物理艺术仍然存在挑战。本文提出了一种方法，在计算折纸领域解决这些难题，折纸是一个数学上严格的环境，将艺术设计置于平面折叠性方程中。我们介绍了 COrigami，一个端到端的人工智能驱动流水线，通过自然语言生成折痕图案来辅助设计周期。我们的流程包括生成语义简形图、计算基底填充、求解平面折叠折痕图案、塑造平面折叠折痕图案，并通过由自主美学评估循环驱动的强化学习对生成模型进行细化。我们的系统作为高效的协作助手，生成结构起点，让人类艺术家能够进一步扩展和塑造。通过将算法优化与自主美学批评相结合，这项工作展示了人工智能系统如何满足多目标物理约束，从而实现可靠且基于数学的共创。

Racing a Wheeled Quadruped: Active Load Transfer Mitigation via Model Predictive Control

轮式四轮摩托车竞速：通过模型预测控制实现主动负载转移缓解

Authors: Marla Eisman, Brian Lam, Samuel Sonnino, Francesco Borrelli
Subjects: Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2606.26313
Pdf link: https://arxiv.org/pdf/2606.26313
Abstract This paper presents a hierarchical control framework using model predictive control (MPC) and reinforcement learning (RL) for active roll control to manage lateral load transfer during autonomous racing of a wheeled quadruped. The framework integrates offline time-optimal raceline generation, an online MPC planner that actively minimizes the lateral Load Transfer Ratio (LTR), and a low-level, whole-body RL policy deployed directly onto the robot's 16 actuators. The MPC is based on a vehicle dynamics bicycle model of the Unitree Go2-W platform. The robot's leg actuators act as active suspension where knee joints generate anti-roll torque to bank into turns. Physical track experiments demonstrate that active roll control reduces mean LTR by up to 44%, improves the fastest lap time by 8.7%, and boosts peak lateral acceleration capability by 21.3% to 1.98 $m/s^2$, maintaining robust high-speed stability beyond the range of a non-tilting baseline controller. Supplementary code and video can be found at this https URL
中文摘要 本文提出了一个利用模型预测控制（MPC）和强化学习（RL）进行主动滚转控制的层级控制框架，以管理轮式四足车自主竞速中的横向负载转移。该框架集成了离线时间最优的raceline生成、在线MPC规划器（主动最小化横向负载转移比（LTR）以及直接部署到机器人16个执行器的低级别全体强化策略。MPC基于Unitree Go2-W平台的车辆动力学自行车模型。机器人的腿部执行器作为主动悬挂，膝关节产生防倾力矩以倾斜转弯。赛道物理实验表明，主动侧倾控制可将平均LTR降低多达44%，最快圈速提升8.7%，峰值横向加速能力提升21.3%，达到1.98 $m/s^2$，保持高速稳定，超出非倾斜基线控制器的范围。补充代码和视频可在此 https URL 找到

EVOM: Agentic Meta-Evolution of Actor-Critic Architectures for Reinforcement Learning

EVOM：强化学习中演员-批评架构的代理元进化

Authors: Boyun Zhang, Chao Wang, Kai Wu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.26327
Pdf link: https://arxiv.org/pdf/2606.26327
Abstract In actor-critic reinforcement learning, network architectures are typically manually designed. Automating this design is challenging because each candidate must be trained before evaluation, and the design space is open-ended. To address these challenges, we introduce EVOM, an agentic meta-evolution framework for discovering high-performance actor-critic architectures. We frame architecture search as a bi-level optimization: an inner loop trains weights via the low-fidelity proximal policy optimization (PPO), while an outer loop drives meta-evolution by iteratively refining architecture programs. Crucially, this outer loop is powered by an LLM-based design agent that operates purely as an architecture designer, completely decoupled from policy execution and environment control. Experiments reveal that EVOM outperforms the manually designed baseline, an LLM-guided random search, and the state-of-the-art LLM-guided programmatic policy search method MLES, delivering superior performance on Ant-v4 and HalfCheetah-v4. Ablation studies validate that both the meta-evolution loop and the LLM Design Agent are indispensable for final performance.
中文摘要 在演员-批评者强化学习中，网络架构通常是手动设计的。自动化设计具有挑战性，因为每个候选人在评估前都必须接受培训，且设计空间是开放式的。为应对这些挑战，我们引入了EVOM，一种用于发现高性能actor-critic架构的代理元进化框架。我们将架构搜索框架为双层优化：内环通过低保真度的近端策略优化（PPO）训练权重，而外环通过迭代优化架构程序推动元进化。关键是，这个外环由基于LLM的设计代理驱动，该代理纯粹作为架构设计师运行，完全脱离策略执行和环境控制。实验显示，EVOM优于手动设计的基线、LLM引导随机搜索以及最先进的LLM引导程序策略搜索方法MLES，在Ant-v4和HalfCheetah-v4上表现更优。消融研究验证了元进化循环和大型语言模型设计代理对最终性能的必不可少。

Mesh-RL: Coupled subgrid reinforcement learning

Mesh-RL：耦合子网格强化学习

Authors: Behnam Gheshlaghi, Bahador Rashidi, Shahin Atakishiyev
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.26333
Pdf link: https://arxiv.org/pdf/2606.26333
Abstract Reinforcement learning in large or sparse-reward environments suffers from slow temporal-difference reward propagation, as value information spreads only locally across the state space. We propose Mesh-RL, a spatial domain-decomposition framework inspired by the finite element method and domain decomposition theory, which partitions the environment into overlapping subgrids and enforces boundary-consistent temporal-difference updates. Such an approach enables localized learning while ensuring globally coherent value propagation. Unlike hierarchical or model-based approaches, Mesh-RL accelerates long-range credit assignment without modifying the reward function, Bellman operator, or introducing explicit planning mechanisms. We evaluate Mesh-RL on hazard-dense grid-world environments with varying geometries and mesh resolutions. Across Q-learning, SARSA, and Dyna-Q, Mesh-RL consistently improves convergence speed, cumulative reward, and learning stability. Higher mesh resolutions sustain exploration, prevent premature convergence, and substantially accelerate value propagation to distant states. While Dyna-Q already benefits from internal planning, it still achieves additional gains under structured decomposition. Overall, Mesh-RL introduces a principled spatial domain-decomposition mechanism for accelerating temporal-difference learning. Our framework bridges finite element method-inspired boundary-consistency techniques from scientific computing with reinforcement learning to improve sample efficiency in sparse-reward environments. We will release source code of the study.
中文摘要 在大型或奖励稀疏环境中，强化学习存在时间差分奖励传播缓慢的问题，因为价值信息仅在状态空间中局部传播。我们提出了Mesh-RL，这是一种受有限元方法和域分解理论启发的空间域分解框架，将环境划分为重叠的子网格，并强制执行边界一致的时间差分更新。这种方法既能实现本地化学习，又确保全球价值的连贯传播。与层级或基于模型的方法不同，Mesh-RL在不修改奖励函数、贝尔曼算子或引入显式规划机制的情况下，加速了长期信用分配。我们在危险密集的网格世界环境中评估几何形态和网格分辨率各异的网格环境。在Q-learning、SARSA和Dyna-Q中，Mesh-RL持续提升了收敛速度、累计奖励和学习稳定性。更高的网格分辨率支持探索，防止过早收敛，并显著加速价值向远态的传播。虽然Dyna-Q已经受益于内部规划，但在结构化分解下仍能获得额外收益。总体而言，Mesh-RL引入了一种有原则的空间域分解机制，用于加速时间差分学习。我们的框架将科学计算中受有限元方法启发的边界一致性技术与强化学习相结合，以提升稀疏奖励环境中的样本效率。我们将发布该研究的源代码。

Scaling Nonlinear Optimization: Many Problems One GPU

非线性优化缩放：一GPU上多问题

Authors: John Viljoen, Johanna Haffner, Masayoshi Tomizuka, Negar Mehr
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.26341
Pdf link: https://arxiv.org/pdf/2606.26341
Abstract Many robotics problems, including trajectory optimization, inverse kinematics, and contact-rich motion planning, reduce to nonlinear programs (NLPs). Mature NLP solvers such as IPOPT can solve these problems, offering hard constraint satisfaction, optimality guarantees, and favorable scaling with problem dimension. These solvers underpin gradient-based methods in robotics, yet remain CPU-bound and solve only one problem at a time, preventing their integration into GPU-batched learning pipelines. On the other hand, sampling-based approaches such as reinforcement learning, model predictive path integral, and imitation learning have become the core of modern robotics research due to their ability to leverage GPU-batched simulators. These simulators can generate orders of magnitude more dynamics rollouts per second than was previously possible. If a GPU-batched NLP solver existed, it would unlock similar speedups in the number of constrained, locally optimal solutions generated per second. This regime of solving many problems concurrently versus solving a single problem at a time is a key requirement for integrating NLP solvers in modern GPU-batched robotics frameworks. To this end, we introduce \texttt{jaxipm}, the first GPU-batched NLP solver, based on IPOPT, and implemented in JAX. We accomplish this by redesigning IPOPT's algorithm to eliminate control flow with \textit{heterogeneous iteration fusion}, and by minimizing GPU idle time with \textit{iteration level batching}. We evaluate \texttt{jaxipm} on a variety of quadrotor nonlinear model predictive control benchmarks, including reference tracking in the presence of obstacles, multi-quadrotor navigation without collision, and navigation in a cluttered environment. We demonstrate up to a $32.85\times$ increase in throughput over IPOPT. Our complete open-source codebase is available at this https URL.
中文摘要 许多机器人问题，包括轨迹优化、逆运动学和接触富运动规划，都归结为非线性规划（NLP）。成熟的NLP求解器如IPOPT能够解决这些问题，提供硬约束满足、最优性保证以及随着问题维度的有利扩展。这些求解器支撑了机器人中基于梯度的方法，但仍受CPU限制，一次只能解决一个问题，阻碍了它们集成到GPU批处理学习流程中。另一方面，基于抽样的方法如强化学习、模型预测路径积分和模仿学习，因其能够利用GPU批处理模拟器，已成为现代机器人研究的核心。这些模拟器每秒能产生数量级远超以往的动态展开次数。如果存在GPU批处理的自然语言处理求解器，它将能在每秒产生受限且局部最优解的数量上实现类似的加速。这种同时解决多问题而非一次解决单一问题的模式，是现代GPU批处理机器人框架中NLP求解器集成的关键需求。为此，我们介绍了 \texttt{jaxipm}，这是首个基于 IPOPT 的 GPU 批处理自然语言处理求解器，并在 JAX 中实现。我们通过重新设计 IPOPT 算法，消除 \textit{异构迭代融合}的控制流，并通过 \textit{迭代层批处理}来最小化 GPU 闲置时间。我们在多种四旋翼非线性模型预测控制基准测试上评估了 \texttt{jaxipm}，包括在有障碍物下的参考跟踪、无碰撞的多四旋翼导航以及杂波环境中的导航。我们演示了相比IPOPT的吞吐量提升了最多32.85美元/倍数美元。我们完整的开源代码库可在此 https URL 访问。

MPC-Injection: Biasing Off-Policy Locomotion RL Toward Controller-Induced Behavior Basins

MPC注入：偏向非策略移动强化学习，使其偏向控制者诱导的行为盆地

Authors: Roy Xing (Dartmouth College), Seyoung Ree (Harvard University), Brian Plancher (Dartmouth College)
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.26392
Pdf link: https://arxiv.org/pdf/2606.26392
Abstract Reinforcement learning (RL) for locomotion frequently converges to locally optimal but undeployable behaviors, such as vibrating limbs or scooting on the torso, that maximize return without producing a usable gait. We present MPC-Injection, a low-overhead method that steers RL toward a designer-preferred gait by inserting transitions into the replay buffer from a model predictive controller solving the same Markov decision process. Unlike reward shaping, MPC-Injection does not require redesigning the task reward, and unlike adversarial imitation learning, it adds no discriminator, no kinematic retargeting, and no auxiliary objective. Instead, the controller's preferred behavior is transferred to the policy purely through the replay state distribution. On a 2D walker in simulation and with sim-to-real evaluation on a Go2 quadruped, we show that MPC-Injection drives the policy into the controller's behavior basin using a one to two-term task reward, producing gaits qualitatively comparable to those of reward shaping with twenty-one tuned terms and of adversarial motion priors without their discriminator and retargeting overhead. We further analyze how the injected transitions bias actor-critic updates toward controller-visited states, allowing the policy to learn behaviors that pure RL may fail to reach under simple reward functions.
中文摘要 用于运动的强化学习（RL）经常趋向局部最优但不可部署的行为，如振动肢体或在躯干上滑行，以最大化回报而不产生可用的步态。我们介绍MPC-注入，这是一种低开销的方法，通过将模型预测控制器的过渡插入回放缓冲区，引导强化学习朝设计师偏好的步态发展，解决相同的马尔可夫决策过程。与奖励塑造不同，MPC-Injection不需要重新设计任务奖励，且与对抗性模仿学习不同，它不添加判别器、运动学重定向，也没有辅助目标。相反，控制器的偏好行为仅通过回放状态分布传递给策略。在模拟中的二维步行机和Go2四足机上的模拟到实评估中，我们展示了MPC-Injection通过一到两项任务奖励将策略推入控制器的行为范畴，产生的步态在质上与21项调优项的奖励塑形以及无判别器和重定向开销的对抗运动先验相当。我们还进一步分析了注入的转移如何使actor-critic更新偏向于控制者访问状态，使策略能够学习纯强化学习在简单奖励函数下可能无法达到的行为。

Deterministic Pareto-Optimal Policy Synthesis for Multi-Objective Reinforcement Learning

确定性帕累托最优策略综合用于多目标强化学习

Authors: Aniruddha Joshi, Niklas Lauffer, Sanjit Seshia
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.26397
Pdf link: https://arxiv.org/pdf/2606.26397
Abstract Real-world decision-making often requires balancing multiple conflicting objectives, a challenge that standard Reinforcement Learning (RL) frequently addresses by aggregating rewards into a single scalar signal. While effective for simple tasks, this approach often fails to capture the full spectrum of optimal trade-offs, known as the Pareto frontier. In this paper, we introduce a novel preference-conditioned Bellman operator, motivated from the Chebyshev scalarization, designed to compute deterministic Pareto-optimal policies for Multi-Objective Markov Decision Processes (MOMDPs). We prove that this operator satisfies an enveloping property, where the estimated value functions upper-bound the true Pareto frontier, and demonstrate that it monotonically converges to a coverage set of this frontier. Furthermore, we also show how to extract deterministic policies from these converged Q-estimates. This ensures the agent can recover a policy for any given preference, capturing the entire Pareto-optimal frontier while guaranteeing each synthesized policy remains approximately Pareto-optimal. Experimental results validate that our algorithm successfully recovers complex trade-offs, providing a solution for deterministic Pareto-optimal policy synthesis.
中文摘要 现实世界的决策往往需要平衡多个相互冲突的目标，而标准强化学习（RL）常通过将奖励聚合成单一标量信号来解决这一挑战。虽然这种方法对简单任务有效，但往往无法涵盖所有最优权衡，即帕累托边界。本文引入了一种新颖的偏好条件贝尔曼算子，灵感来自切比雪夫标量化，旨在计算多目标马尔可夫决策过程（MOMDPs）的确定性帕累托最优策略。我们证明该算符满足包绕性质，其中估计值函数上界于真实帕累托边界，并证明它单调收敛到该边界的覆盖集。此外，我们还展示了如何从这些收敛的Q估计中提取确定性策略。这确保代理能够恢复任意给定偏好的策略，捕捉整个帕累托最优边界，同时保证每个综合策略保持近似帕累托最优。实验结果验证了我们的算法成功恢复复杂权衡，为确定性帕累托最优策略综合提供了解决方案。

Geometry-Aware MCTS for Extremal Problems in Combinatorial Geometry

针对组合几何极端问题的几何感知 MCTS

Authors: Luoning Zhang, Xu Zhuang, Tianhao Wang, Nathan Kaplan
Subjects: Subjects: Artificial Intelligence (cs.AI); Computational Geometry (cs.CG); Machine Learning (cs.LG); Combinatorics (math.CO)
Arxiv link: https://arxiv.org/abs/2606.26399
Pdf link: https://arxiv.org/pdf/2606.26399
Abstract We study certain extremal problems in combinatorial geometry that ask about configurations of points in an $n \times n$ grid that satisfy strict, global geometric constraints. Classical exact solvers suffer from combinatorial explosion for these types of problems, and standard reinforcement learning and transformer-based models struggle with the sparse reward "validity cliff" and quadratic token-consumption limits. To overcome these bottlenecks, we propose a Geometry-Aware Monte Carlo Tree Search (MCTS) framework. Our approach strictly enforces geometric constraints through incremental updates to the feasible action space. For constraints about collections of collinear points, like those that occur in the classic No-Three-in-Line problem (Max-N3IL), this mechanism reduces the constraint checking complexity from $O(n^3)$ to $O(n^2)$. To improve search efficiency, we exploit geometric symmetries in two ways: canonical pruning during node expansion to reduce the branching factor, and symmetric batch transitions to accelerate the discovery of promising configurations. We perform extensive experiments and establish new best-known computational results on five out of six of the problems that we considered. Notably, for Max-N3IL we find configurations of size roughly $1.8 n$ for grids of size $82 \le n \le 119$. For the Smallest Complete Set problem, we find configurations of size roughly $0.95 n$, providing new upper bounds within the tested grids. This work establishes Geometry-Aware MCTS as a highly adaptable framework for discovering novel configurations in combinatorial geometry.
中文摘要 我们研究组合几何中的某些极端问题，这些问题涉及满足严格全局几何约束的点的配置，这些问题涉及$n×n$网格中的点配置。经典精确求解器在这类问题中存在组合爆炸，标准强化学习和基于变换器的模型则难以应对稀疏奖励的“效度悬崖”和二次代币消耗限制。为克服这些瓶颈，我们提出了一个几何感知蒙特卡洛树搜索（MCTS）框架。我们的方法严格通过对可行作用空间的增量更新来严格执行几何约束。对于关于共线点集合的约束，比如经典的非三列问题（Max-N3IL）中出现的，该机制将约束检查复杂度从$O（n^3）$降至$O（n^2）$。为提高搜索效率，我们利用几何对称性进行两种方式：节点扩展时的典范剪枝以降低分支因子，以及对称批次转移以加速发现有前景配置。我们进行了大量实验，并在我们考虑的六个问题中有五个建立了新的、最知名的计算结果。值得注意的是，对于Max-N3IL，我们发现大小约为$1.8 n$的配置，而尺寸为$82的网格则为$119。对于最小完全集问题，我们找到大小约为0.95 n$的配置，在测试网格内提供了新的上界。这项工作确立了几何感知MCTS作为一种高度适应性的框架，用于发现组合几何中的新颖配置。

Play2Perfect: What Matters in Dexterous Play Pretraining for Precise Assembly?

Play2Perfect：在 Dexterous Play 预训练中，精准组装的关键是什么？

Authors: Tyler Ga Wei Lum, Kushal Kedia, C. Karen Liu, Jeannette Bohg
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.26428
Pdf link: https://arxiv.org/pdf/2606.26428
Abstract Multi-fingered robots promise the speed and dexterity of human hands, yet challenging problems such as precise assembly have remained out of reach. These tasks are contact-rich, making data collection for imitation learning difficult, and sparse-reward, making direct exploration with reinforcement learning (RL) intractable. Consequently, prior work has made progress by structuring the problem with specialized grippers, tool attachments, and environment fixtures. In this work, we argue that before a robot can perfect precise assembly, it must first learn to play. We further ask the question: what factors in the process of learning to play matter for precise assembly? We propose Play2Perfect, an RL framework for task-agnostic pretraining through play on diverse objects and goals, which is then perfected on precise assembly. The goal of play is to acquire reusable manipulation priors, such as grasping, in-hand reorientation and pose reaching. Finetuning then adapts this general prior to assembly, focusing exploration on the final contact-rich, high-precision interactions needed for success. We systematically study key design choices in play pretraining, including object diversity, training objective, trajectory diversity, and goal precision. We show that our prior is 33x more sample-efficient than RL training from scratch, even when provided with dense, multi-stage rewards. We demonstrate zero-shot sim-to-real transfer, achieving 60% success on tight insertions with only 0.5 mm contact clearance, and over 50% success on long-horizon multi-part assembly and screwing.
中文摘要 多指机器人承诺拥有人类手的速度和灵巧度，但诸如精准组装等挑战性问题一直遥不可及。这些任务接触密集，使得模仿学习的数据收集变得困难，而奖励稀疏，使得用强化学习（RL）进行直接探索变得困难。因此，之前的工作通过使用专用夹具、工具附件和环境夹具来结构化问题，取得了进展。在本研究中，我们认为机器人在完善精确组装之前，必须先学会演奏。我们进一步提出问题：在学习玩弄物质以实现精确组装的过程中，有哪些因素？我们提出了Play2Perfect，这是一个通过对多样对象和目标进行任务无关预训练的强化学习框架，并通过精确组装进行完善。游戏的目标是获得可重复使用的操作先望，如抓握、手中重新定向和姿势伸展。精细调化随后在组装前调整这一通用技术，重点探索成功所需的最终接触丰富、高精度的交互。我们系统地研究游戏预训练中的关键设计选择，包括对象多样性、训练目标、轨迹多样性和目标精度。我们证明，即使提供密集多阶段的奖励，我们的先验样本效率仍是从零开始的强化学习的33倍。我们展示了零点模拟到实物传输，在仅0.5毫米接触间隙的紧密插入成功率达到60%，在长层多部件组装和螺丝中成功率超过50%。

AXLE: A Cloud Infrastructure for Lean 4 Theorem Proving Utilities

AXLE：精益四定理云基础设施，证明效用

Authors: Jimmy Xin, Alex Schneidman, Chris Cummins, Karun Ram, Srihari Ganesh, Jannis Limperg
Subjects: Subjects: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.26442
Pdf link: https://arxiv.org/pdf/2606.26442
Abstract We present AXLE (Axiom Lean Engine), a cloud service for Lean 4 proof manipulation, extraction, and verification. Recent progress in AI for mathematics -- reinforcement learning pipelines, agentic proving workflows, dataset curation -- demands Lean 4 tooling that scales to millions of requests while remaining correct and robust; existing infrastructure offers parallel compilation but not scalable proof verification, higher-level proof manipulation, multi-version support, or per-request isolation at the throughput modern AI workflows require. AXLE provides 14 Lean 4 metaprogramming tools spanning strict proof verification, declaration metadata extraction, semantic source manipulation, deterministic proof repair and simplification, and lemma extraction. The service runs as a multi-tenant cloud deployment with per-request isolation and concurrent support for multiple Lean 4 and Mathlib versions, accessible via a Python SDK, command-line interface, web UI, MCP server, and raw HTTP API. AXLE is publicly available and free to use at this https URL and via the axiom-axle PyPI package, with no local Lean 4 installation required. It has served over 500 million requests to date and is the underlying infrastructure for Axiom Math's proving efforts, including its 12/12 score on the 2025 Putnam competition.
中文摘要 我们介绍AXLE（Axiom精益引擎），这是一个用于精益4证明操作、提取和验证的云服务。近年来，人工智能在数学领域的进展——强化学习流程、代理性证明工作流、数据集策划——需要能够覆盖数百万请求且保持准确性和稳健性的精益四级工具;现有基础设施提供并行编译，但无法实现可扩展的证明验证、更高层次的证明操作、多版本支持或按请求隔离，且不具备现代AI工作流程所需的吞吐量。AXLE提供14种精益4元编程工具，涵盖严格证明验证、声明元数据提取、语义源操作、确定性证明修复与简化，以及引理提取。该服务以多租户云部署形式运行，实现按请求隔离，并同时支持多个精益4和Mathlib版本，可通过Python SDK、命令行界面、网页界面、MCP服务器和原始HTTP API访问。AXLE 公开且免费，用户可通过 https URL 及 axiom-axle PyPI 包免费使用，无需本地精益 4 安装。迄今为止，它已处理超过5亿次请求，是Axiom Math证明工作的基础基础设施，包括其在2025年Putnam竞赛中获得的12/12分。

Finding the Time to Think: Learning Planning Budgets in Real-Time RL

找到时间思考：实时强化学习预算规划

Authors: Aneesh Muppidi, Firas Darwish, Dylan Cope, João F. Henriques, Jakob Nicolaus Foerster
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.26463
Pdf link: https://arxiv.org/pdf/2606.26463
Abstract Deliberating takes time. In real-time settings, that time is not free. Standard reinforcement learning (RL) sidesteps this as the environment waits indefinitely for the agent's decision. Instead, we study real-time RL environments where the environment progresses while waiting for the agent's action. Building on prior real-time formalizations, we introduce variable-delay real-time RL, where the agent chooses how long to deliberate at each decision point since the environment progresses. For the planning agents we use, the right delay is state-dependent, and naively planning how long to plan can paralyze the agent. We instead approach this setting by training a lightweight gating policy on top of a planner to select state-dependent planning budgets. Across real-time Pac-Man, Tetris, Snake, Speed Hex, and Speed Go, our gating policy outperforms fixed-budget and heuristic baselines, and transfers to a real-time setup where the environment and agent run on two different GPUs.
中文摘要 深思熟虑需要时间。在实时环境中，这段时间并非免费。标准强化学习（RL）则绕过了这一点，因为环境会无限期等待智能体的决定。相反，我们研究的是实时强化学习环境，环境在等待代理行动的同时进行。基于之前的实时形式化，我们引入了可变延迟实时强化学习，智能体在每个决策点选择思考的时间长短，因为环境会进展。对于我们使用的规划代理来说，正确的延迟取决于状态，天真地规划规划时间可能会让代理陷入瘫痪。我们通过在规划者基础上训练一套轻量级门禁政策来应对这一环境，以选择州级规划预算。在实时吃豆人、俄罗斯方块、蛇形、速度六角和速度Go等方面，我们的门禁策略优于固定预算和启发式基线，并可切换到实时环境和代理运行于两块不同GPU上的环境。

Sample-efficient Transfer Reinforcement Learning via Adaptive Reward Shaping and Policy-Ratio Reweighting Strategy

通过自适应奖励塑造和策略比重权策略实现样本高效的转移强化学习

Authors: Wenjie Huang, Yang Li, Jingjia Teng, Mingwei Jin, Kai Song, Yougang Bian, Yongfu Li, Qisong Yang, Helai Huang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.26527
Pdf link: https://arxiv.org/pdf/2606.26527
Abstract Transfer learning improves policy learning efficiency by reusing knowledge from source tasks, providing a feasible paradigm for safe and efficient autonomous highway lane changing decision-making. Existing methods frequently encounter transfer mismatch induced by distribution shifts between source and target domains, leading to training oscillation and performance decline. Besides, target domain adaptation depends on exploratory interactions, which struggles to guarantee training safety in safety-critical lane changing cases. To tackle these limitations, this paper proposes a safe transfer reinforcement learning framework for autonomous highway lane changing. First, we design an adaptive teacher intervention mechanism based on instantaneous safety cost to restrain risky exploration and fade intervention strength progressively, with theoretical analysis on return bounds for mixed behavior policy. This intervention also produces dual-source samples for joint training. Second, a teacher-guided safe transfer module embeds action evaluation information of teacher policy into student learning via reward shaping to boost training safety and efficiency, with teacher guidance decaying as policy safety rises. Third, a teacher-guided weighted optimization mechanism adjusts sample weights in policy optimization using a likelihood ratio factor to stabilize transfer performance. Experiments under varied traffic densities and validations on real-world NGSIM dataset reveal that our method surpasses baseline approaches by over 52.2% in safety and 5.0% in learning efficiency. Results verify the efficacy and robustness of our safety-aware transfer strategy for autonomous highway lane changing under various traffic conditions.
中文摘要 迁移学习通过重用源任务中的知识提升政策学习效率，为安全高效的自动高速公路车道变更决策提供了可行范式。现有方法经常因源域与目标域分布偏移引发转移不匹配，导致训练振荡和性能下降。此外，目标域适应依赖于探索性相互作用，而在安全关键变道情况下，这难以保证训练安全。为解决这些限制，本文提出了一个安全的换车强化学习框架，用于自动驾驶高速公路变道。首先，我们设计了一种基于瞬时安全成本的适应性教师干预机制，以逐步限制风险探索并逐渐削弱干预强度，并对混合行为政策的反馈界限进行了理论分析。该干预还产生了用于联合训练的双源样本。其次，教师指导的安全转移模块通过奖励塑造将教师政策的行动评估信息嵌入学生学习中，以提升培训的安全性和效率，而随着政策安全性的提升，教师指导的质量会下降。第三，教师指导的加权优化机制通过似然比因子调整政策优化中的样本权重，以稳定转移表现。在不同流量密度下的实验以及对真实世界NGSIM数据集的验证显示，我们的方法在安全性方面比基线方法高出52.2%以上，学习效率提升5.0%。结果验证了我们安全意识切换策略在各种交通条件下自动驾驶变道的有效性和稳健性。

VoiceTTA: Enhancing Zero-Shot Text-to-Speech via Reinforcement Learning-Based Test-Time Adaptation

VoiceTTA：通过基于强化学习的测试时间适配增强零样本文本转语音

Authors: Tianxin Xie, Chenxing Li, Dong Yu, Li Liu
Subjects: Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.26534
Pdf link: https://arxiv.org/pdf/2606.26534
Abstract Recently, zero-shot text-to-speech (TTS) has enabled high-fidelity and expressive speech synthesis, but it often fails to imitate unseen speaking styles from uncommon scenarios (e.g., crosstalk, dialects). Moreover, fine-tuning pretrained models requires large, high-quality datasets, limiting rapid personalization. We propose VoiceTTA, a reinforcement learning-based test-time adaptation (TTA) method that improves voice imitation of pretrained zero-shot TTS models. VoiceTTA introduces two style rewards based on coefficient-of-variation differences of F0 and energy, combined with speaker similarity and intelligibility (WER from a pretrained Whisper model), and optimizes learnable prefixes via group relative preference optimization (GRPO) in a flow matching-based model at inference time. Extensive experiments demonstrate substantial improvements on uncommon speech prompts, outperforming state-of-the-art baselines. Audio samples are available at this https URL
中文摘要 近年来，零帧文本转语音（TTS）使得高保真和富有表现力的语音合成成为可能，但它常常无法模拟来自罕见场景（如串音、方言）中看不见的说话风格。此外，对预训练模型进行微调需要大量高质量数据集，限制了快速个性化。我们提出了VoiceTTA，一种基于强化学习的测试时间适应（TTA）方法，旨在提升预训练零样子TTS模型的语音模仿能力。VoiceTTA引入了两种基于F0和能量变异系数差异的样式奖励，结合说话者的相似性和可理解性（基于预训练的Whisper模型的WER），并在基于流匹配的模型中通过群体相对偏好优化（GRPO）在推断时优化可学习前缀。大量实验显示，罕见语音提示在表现上取得了显著改进，优于最先进的基线。音频样本可在此 https URL 获取

Revisiting Action Factorization for Complex Action Spaces

复审复作用空间的作用分解

Authors: Timothy Flavin, Sandip Sen
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.26574
Pdf link: https://arxiv.org/pdf/2606.26574
Abstract Many real-world control problems involve hybrid discrete-continuous action spaces. For example, steering and signaling in autonomous driving, and aiming and firing in robotics or video-games. Despite real-world hybrid factorization and reinforcement learning framework support for complex action spaces (e.g., Gymnasium, PettingZoo, TorchRL, SeedRL, Mujoco, etc), the default environments within those frameworks often implement uniform action space configurations (LunarLander, Walker2D, Cheetah, SMAC, SUMO, Ant, Atari). Landmark hybrid-action benchmarks (RoboCup 2D HFO, SC2LE, Platform, CARLA, etc) are mostly heavyweight or archival implementations originating from papers which test one or a small number of competing factorization methods on one kind of control. This article provides a cross-sectional study of factorization methods [independent networks, shared encoder, VDN, QPLEX, Joint, Auto-Regressive] on each of three families of algorithms [PPO, SAC, DQN] across three action spaces [discretized, hybrid, continuous] over four lightweight environments [Platform, hybrid-LunarLander, Hybrid-Shoot, CoopPush]. Accounting for some invalid pairings such as joint-continuous, we are left with 220 configurations to analyze each method. We provide two new C++ parallel gymnasium and petting-zoo compliant environments [CoopPush, Hybrid-Shoot] to isolate particular challenges such as state-dependent inter-action dependence. Finally, we introduce VDN-PPO and PPO-MIX which use a branching critic to assign credit to multi-headed PPO. These variants out-perform all other tested PPO factorizations. Our results suggest that branching dueling architectures balance compute and performance most effectively, with Auto-Regressive actions reaching the highest performance overall and native continuous SAC outperforming discrete and hybrid algorithms, albiet both at increased computational cost.
中文摘要 许多现实控制问题涉及离散-连续作用空间的混合。例如，自动驾驶中的转向和信号传递，以及机器人或电子游戏中的瞄准和射击。尽管现实中对复杂动作空间（如 Gymnasium、PettingZoo、TorchRL、SeedRL、Mujoco 等）支持混合分解和强化学习框架，这些框架内的默认环境通常实现统一的动作空间配置（LunarLander、Walker2D、Cheetah、SMAC、SUMO、Ant、Atari）。标志性的混合动作基准测试（RoboCup 2D HFO、SC2LE、Platform、CARLA 等）大多是重量级或存档型实现，源自测试一种或少数竞争分解方法的论文。本文对三类算法[PPO、SAC、DQN]在三个动作空间[离散化、混合、连续]、四个轻量级环境[平台、混合月球着陆器、混合射击、合作推送]中，分别进行了分解方法[独立网络、共享编码器、VDN、QPLEX、联合回归]的横断面研究。考虑到一些无效配对，如联合连续，我们每个方法剩下220种配置可供分析。我们提供了两个新的C++平行体育馆和宠物园兼容环境[CoopPush、Hybrid-Shoot]，以隔离诸如状态依赖性交互作用等特殊挑战。最后，我们介绍了VDN-PPO和PPO-MIX，它们使用分支批评者为多头PPO分配功劳。这些变体优于所有其他测试的PPO分解方法。我们的结果表明，分支对决架构在计算和性能之间取得最佳平衡，自动回归动作整体性能最高，原生连续SAC表现优于离散和混合算法，尽管计算成本增加。

EvoOptiGraph: Weakness-Driven Coevolution via Graph-Based Structural Generation for Optimization Modeling

EvoOptiGraph：通过基于图的结构生成实现弱点驱动的共进化以实现优化建模

Authors: Qingcan Kang, Mingyang Liu, Xiaojin Fu, Shixiong Kai, Tao Zhong, Mingxuan Yuan
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.26578
Pdf link: https://arxiv.org/pdf/2606.26578
Abstract Automating optimization modeling from natural language with large language models (LLMs) faces two key challenges. First, training corpora lack structural diversity. Second, data generation pipelines remain static and decoupled from model learning. To address these challenges, we propose EvoOptiGraph, a novel framework where data and model co-evolve, driven by model weaknesses. EvoOptiGraph represents each mixed-integer linear program (MILP) as an attributed bipartite graph and applies validity-preserving evolutionary operators to generate structurally diverse instances. The evolved graphs are converted into solver code and natural language via deterministic compilation and verified back-translation. Training proceeds in two stages: supervised fine-tuning (SFT) on an initial dataset, followed by reinforcement learning with verifiable rewards (RLVR), where graph-derived weakness signals guide the generation of new instances targeting the model's failures. This forms a closed loop that continuously updates the training distribution. Empirical results on six public datasets show that EvoOptiGraph significantly outperforms larger generalist models, agentic methods, and specialized baselines in accuracy, executability, and generalization. These results demonstrate that targeted data-model coevolution is an effective strategy for improving LLMs on optimization modeling tasks.
中文摘要 利用大型语言模型（LLM）自动化自然语言优化建模面临两个关键挑战。首先，训练语料库缺乏结构多样性。其次，数据生成流水线保持静态，且与模型学习脱钩。为应对这些挑战，我们提出了EvoOptiGraph，一种数据与模型由模型弱点共同演化的新型框架。EvoOptiGraph将每个混合整数线性规划（MILP）表示为一个带属性的二分图，并应用保持效度的进化算符生成结构多样化的实例。进化后的图通过确定性编译和验证的反向翻译转换为求解器代码和自然语言。训练分为两个阶段：对初始数据集进行监督微调（SFT），随后是可验证奖励的强化学习（RLVR），通过图导出的弱点信号引导生成针对模型失败的新实例。这形成了一个闭合循环，持续更新训练分布。六个公开数据集的实证结果显示，EvoOptiGraph在准确性、可执行性和泛化性方面显著优于更广泛的通用模型、代理方法和专业基线。这些结果表明，针对性数据-模型共进是提升LLM优化建模任务的有效策略。

NebulaExp-8B: An Empirical Post-Training Pipeline via Full-Scale Ablation Research

NebulaExp-8B：通过全面消融研究实现的实证培训后流程

Authors: Qiaobo Hao, Yangqian Wu, Shunyi Wang, Zhongjian Zhang, Ziqun Li, Yayin He, Muqing Li, Chen Zhong
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.26671
Pdf link: https://arxiv.org/pdf/2606.26671
Abstract Post-training alignment determines the reasoning and human preference following capabilities of large language models, yet most existing works withhold detailed data construction, filtering rules and training recipes, which hinders community reproducibility and lightweight model optimization. This work presents NebulaExp, a fully transparent, ablation-driven post-training pipeline built on Qwen3-8B-base, covering two orthogonal model branches: general instruct model and complex reasoning-specialized model. We curate a raw corpus of 3.84M multi-source SFT samples and a 200K verifiable RL candidate pool, and design an end-to-end data processing stack including response distillation, multi-dimensional cross-verification filtering, fine-grained difficulty grading, task classification and diversity-aware sampling. For the Instruct branch, our three-stage optimized supervised fine-tuning approach NebulaExp-Ins-SFT improves the average benchmark score from the 55.01 baseline of Qwen3-8B-nothink to 60.99. GRPO reinforcement learning then further elevates the average score to 61.85. For the Reasoning branch, medium-difficulty GRPO RL improves average reasoning score from 73.88 to 75.17. To address RL's dependency on task verifiers, we systematically investigate single-teacher and multi-teacher OPD (MOPD): utilizing merely 4K instruction-following samples and outperforms RL baseline by 3.26 points on IFEval with +4.43 average overall gain; MOPD fuses four domain-specialist teachers with merely 10K samples, lifting average performance by 4.18 over the base model. This report provides a fully reproducible empirical post-training recipe for 8B-scale LLMs, and comprehensively dissects the capability trade-offs among instruction adherence, mathematical reasoning, code generation and general knowledge.
中文摘要 训练后对齐决定了大型语言模型的推理能力和人类偏好遵循能力，但大多数现有工作缺乏详细的数据构建、过滤规则和训练方案，这阻碍了社区的可重复性和轻量级模型优化。本研究提出了NebulaExp，这是一个基于Qwen3-8B基础的完全透明、消融驱动的训练后流程，涵盖两个正交模型分支：通用指令模型和复杂推理专用模型。我们策划了384万个多源SFT样本的原始语料库和20万可验证的强化学习候选人池，设计了端到端的数据处理堆栈，包括响应提炼、多维交叉验证过滤、细粒度难度分级、任务分类和多样性感知采样。对于Ininstruction分支，我们采用三阶段优化监督微调方法NebulaExp-Ins-SFT，将平均基准分数从Qwen3-8B-nothink的55.01基准提升至60.99。GRPO强化学习进一步将平均得分提升至61.85。在推理分支中，中等难度的GRPO RL将平均推理分数从73.88提升到75.17。为解决强化学习对任务验证器的依赖，我们系统性地研究了单教师和多教师的OPD（MOPD）：仅使用4K指令跟随样本，IFEval表现比RL基线高3.26分，平均整体增益+4.43;MOPD将四位领域专业教师与仅1万个样本融合，平均表现比基础模型提升4.18%。本报告提供了8B级大型语言模型的完全可重复的实证训练后方案，全面剖析了指令遵循、数学推理、代码生成和常识之间的能力权衡。

PressMimic: Pressure-Guided Motion Capture and Control for Humanoid Robot Imitation

PressMimic：用于仿人机器人模拟的压力引导动作捕捉与控制

Authors: Yi Lu, Shenghao Ren, Tianyu Xiong, Zhaoxiang Li, Jiaqi Li, He Zhang, Tao Yu, Qiu Shen, Xun Cao
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.26741
Pdf link: https://arxiv.org/pdf/2606.26741
Abstract Humanoid motion imitation requires not only accurate perception of human kinematics but also faithful reproduction of physical interactions with the environment. However, existing pipelines rely primarily on vision-based motion capture and kinematic imitation, largely ignoring contact dynamics, leading to artifacts such as foot sliding, floor penetration, and unstable behaviors. In this work, we revisit humanoid motion imitation from the perspective of physical grounding and leverage pressure as a unified modality across perception and control. We present PressMimic, a framework that integrates pressure into the full pipeline from motion capture to humanoid control. In the perception stage, we introduce FRAPPE++, a multimodal model that fuses RGB and pressure to jointly estimate 3D pose and global motion, where pressure provides explicit contact and support constraints to resolve ambiguity in vision-based estimation. In the control stage, we propose a pressure-supervised policy (PSP) that incorporates pressure-derived signals into reinforcement learning, enabling physically consistent contact patterns during execution. We further construct MotionPRO, a large-scale dataset with synchronized RGB, pressure, and motion capture data. Experiments show that pressure improves motion estimation accuracy, trajectory consistency, and execution stability. These results demonstrate that pressure serves as an effective physical grounding signal, bridging perception and control for physically consistent humanoid motion imitation.
中文摘要 类人生物运动模仿不仅需要对人体运动学的准确感知，还需要忠实再现与环境的物理互动。然而，现有的管道主要依赖基于视觉的动作捕捉和运动学模拟，基本忽视了接触动态，导致脚滑、地板穿透和不稳定行为等伪影。在本研究中，我们从物理基础的角度重新审视人形运动模仿，并将压力作为跨感知与控制的统一模式。我们介绍PressMimic，一个将压力整合进从动作捕捉到人形控制的全流程中的框架。在感知阶段，我们引入FRAPPE++，这是一种多模态模型，融合RGB和压力，共同估计三维姿态和全局运动，压力为视觉估算提供明确的接触和支撑约束，以解决模糊性。在控制阶段，我们提出了一种压力监督策略（PSP），将压力衍生信号纳入强化学习，实现执行过程中物理一致的接触模式。我们进一步构建了MotionPRO，这是一个包含同步RGB、压力和动作捕捉数据的大规模数据集。实验表明，压力能提升运动估计的准确性、轨迹一致性和执行稳定性。这些结果表明压力作为有效的物理接地信号，连接感知与控制，实现物理上一致的人形动作模仿。

AIGP: An LLM-Based Framework for Long-Term Value Alignment in E-Commerce Pricing

AIGP：基于大型语言模型的长期电子商务定价价值对齐框架

Authors: Chennan Ma, Yanning Zhang, Siqi Hong, Xiuchong Wang, Fei Xiao, Keping Yang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.26787
Pdf link: https://arxiv.org/pdf/2606.26787
Abstract Traditional dynamic pricing models in large-scale e-commerce suffer from limited interpretability, poor utilization of unstructured information, and misalignment with long-term business objectives such as cumulative Gross Merchandise Value (GMV), Return on Investment (ROI) and milestone achievement. We propose AIGP, a novel framework that leverages a Large Language Model (LLM) prompted with domain knowledge, structured data and textual context to make interpretable, knowledge-aware pricing decisions. For efficient deployment while maintaining high-quality outputs, we employ supervised fine-tuning for knowledge distillation. Central to AIGP is the Long-Term Value Estimator (LTVE), trained via offline reinforcement learning on historical data, which serves as a reward model to score candidate pricing actions and select preference pairs for Direct Preference Optimization (DPO), thereby aligning the pricing policy with long-term business objectives. Extensive offline evaluations and large-scale online A/B tests on Tao Factory demonstrate that AIGP achieves significant improvements: +13.21% in GMV, +7.59% in ROI, and +8.20% in milestone achievement rate over 14 days compared to the production baseline, while simultaneously providing interpretable and transparent pricing rationales.
中文摘要 大型电子商务中的传统动态定价模型存在可理解性有限、非结构化信息利用率差，以及与累计总商品价值（GMV）、投资回报率（ROI）和里程碑成就等长期业务目标不匹配的问题。我们提出了AIGP，这是一个新颖框架，利用大型语言模型（LLM）辅以领域知识、结构化数据和文本上下文，做出可解释、知识感知的定价决策。为了高效部署同时保持高质量输出，我们采用监督式微调技术进行知识提炼。AIGP的核心是长期价值估计器（LTVE），通过离线强化学习基于历史数据训练，作为奖励模型，用于对候选定价动作进行评分并选择直接偏好优化（DPO）的偏好对，从而使定价政策与长期业务目标保持一致。在Tao Factory上进行的大量线下评估和大规模在线A/B测试显示，AIGP在14天内实现了显著提升：GMV提升+13.21%，投资回报率+7.59%，里程碑成就率+8.20%，同时提供可理解且透明的定价理由。

OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning

OPID：针对能动强化学习的策略技能提炼

Authors: Shuo Yang, Jinyang Wu, Zhengxi Lu, Yuhao Shen, Fan Zhang, Lang Feng, Shuai Zhang, Haoran Luo, Zheng Lian, Zhengqi Wen, Jianhua Tao
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.26790
Pdf link: https://arxiv.org/pdf/2606.26790
Abstract Outcome-based reinforcement learning provides a stable optimization backbone for language agents, but its sparse trajectory-level rewards provide little guidance on which intermediate decisions should be reinforced or suppressed. On-policy self-distillation offers dense token-level supervision, yet existing skill-conditioned variants often rely on external skill memories or retrieved privileged context, which are costly to maintain and can be mismatched with the state distribution induced by the current policy in multi-turn interaction. We propose \textbf{OPID} (\textbf{O}n-\textbf{P}olicy Sk\textbf{i}ll \textbf{D}istillation), a framework that extracts skill supervision directly from completed on-policy trajectories. OPID represents trajectory hindsight as hierarchical skills: episode-level skills capture global workflows or failure-avoidance rules, while step-level skills capture local decision knowledge at critical timesteps. A critical-first routing mechanism uses step-level skills when critical decisions are identified and falls back to episode-level skills as default guidance otherwise. The selected skill is injected into the interaction history, allowing the old policy to re-score the same sampled response under both original and skill-augmented contexts. The resulting log-probability shift yields a token-level self-distillation advantage, which is combined with the outcome advantage for policy optimization. OPID thus preserves RL as the primary training objective while introducing dense, distribution-matched hindsight supervision. Experiments on ALFWorld, WebShop and Search-based QA demonstrate that OPID generally improves agent performance, sample efficiency, and robustness over outcome-only RL and existing skill-distillation baselines. Our code is available at this https URL.
中文摘要 基于结果的强化学习为语言代理提供了稳定的优化骨干，但其稀疏的轨迹级奖励几乎无法指导应强化或抑制哪些中间决策。策略上自我蒸馏提供了密集的令牌级监督，但现有的技能条件变体通常依赖外部技能记忆或检索的特权上下文，这些维护成本高昂，且可能与当前策略在多回合交互中诱导的状态分布不匹配。我们提出了 \textbf{OPID}（\textbf{O}n-\textbf{P}olicy Sk\textbf{i}ll \textbf{D}istillation），这是一个直接从已完成的政策轨迹中提取技能监督的框架。OPID将轨迹事后诸葛亮表现为层级技能：剧集级技能捕捉全局工作流程或失败避免规则，而步级技能则捕捉关键时间步的本地决策知识。关键优先路由机制在识别关键决策时使用步骤级技能，否则则退回到章节级技能作为默认指导。所选技能被注入交互历史，使旧策略能够在原始和技能增强上下文下重新评分相同的抽样响应。由此产生的对数概率偏移带来了代币级的自我蒸馏优势，并结合了策略优化的结果优势。因此，OPID保持强化学习作为主要训练目标，同时引入密集、分布匹配的事后诸葛亮监督。在ALFWorld、WebShop和基于搜索的质量保证上的实验表明，OPID通常比仅结果的强化学习和现有技能蒸馏基线更能提升代理表现、样本效率和稳健性。我们的代码可在此 https URL 访问。

Humanoid-DART: Humanoid Loco-Manipulation using Diffusion-guided Augmentation through Relabeling and Tracking

类人DART：通过重新标记和追踪实现扩散引导增强的人形机动操作

Authors: Pranav Debbad, Kanish Thiagarajan, Victor Dhédin, Shafeef Omar, Majid Khadiv
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.26855
Pdf link: https://arxiv.org/pdf/2606.26855
Abstract Imitating human demonstrations has emerged as a dominant paradigm for learning humanoid loco-manipulation policies. However, scaling these approaches remains challenging due to the high cost of collecting diverse demonstrations and the need for continual human intervention to correct policy failures. In this paper, we present a self-supervised framework that bootstraps from sparse demonstrations and progressively expands its behavioral repertoire, enabling the learning of a goal-conditioned policy that automatically explores the goal space with minimal expert supervision. Our approach combines diffusion-based trajectory generation with reinforcement learning, where the latter is used to track goal-conditioned trajectories produced by the diffusion model for a range of loco-manipulation skills. Through extensive ablation studies and comparisons with state-of-the-art methods, we demonstrate the effectiveness of our framework on multiple humanoid loco-manipulation skills.
中文摘要 模仿人类演示已成为学习类人机车操控政策的主流范式。然而，由于收集多样化演示的成本高昂且需要持续的人为干预纠正政策失误，这些方法的规模化仍然具有挑战性。本文提出了一个自导框架，从稀疏的演示出发，逐步扩展其行为能力库，使得学习一个目标条件化的策略能够在最少的专家监督下自动探索目标空间。我们的方法结合了基于扩散的轨迹生成与强化学习，后者用于追踪扩散模型生成的目标条件轨迹，用于多种机动操作技能。通过广泛的消融研究和与最先进方法的比较，我们展示了该框架对多种类人机车操控技能的有效性。

PlanRL: A Trajectory Planning Architecture for Reinforcement Learning-based Driving Experts

PlanRL：基于强化学习的驾驶专家的轨迹规划架构

Authors: Joonhee Lim, Yongjae Lee, Jangho Shin, Dongsuk Kum
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.26858
Pdf link: https://arxiv.org/pdf/2606.26858
Abstract Reinforcement learning (RL) has become a prominent framework for developing driving experts in autonomous vehicles. However, most existing RL-based experts are designed to output direct control commands (e.g., throttle, steering), which suffer from a lack of interpretability, high spatial complexity in learning road geometries, and poor compatibility with modern end-to-end planning architectures. To address these limitations, we propose a novel trajectory planning architecture for RL driving experts that integrates an RL policy with a polynomial-based trajectory planner. By employing a Frenet-frame coordinate system, our method simplifies complex road geometries into a curvilinear framework, offering a structured coordinate prior that facilitates policy learning. Furthermore, we incorporate a kinematic feasibility check into the planning stage to ensure that generated trajectories remain within the vehicle's physical limits, effectively mitigating cumulative tracking errors typically found in planning-based systems. We evaluate our approach on key CARLA benchmarks, where it significantly outperforms existing state-of-the-art control-based RL experts. On the CARLA Offline Leaderboard v1 and NoCrash benchmarks, our method improves the driving score by 5% and 11%, respectively, and increases the success rate by 8% and 19%.
中文摘要 强化学习（RL）已成为培养自动驾驶专家的重要框架。然而，大多数现有基于强化学习的专家设计为输出直接控制命令（如油门、转向），但这些命令缺乏可解释性，学习道路几何时空间复杂度高，且与现代端到端规划架构兼容性差。为解决这些限制，我们提出了一种新型的轨迹规划架构，用于驱动强化学习的专家，将强化学习策略与基于多项式的轨迹规划器集成。通过采用Frenet坐标系，我们的方法将复杂的道路几何简化为曲线框架，提供结构化的坐标先验，便于政策学习。此外，我们在规划阶段加入了运动学可行性检查，确保生成轨迹保持在飞行器物理极限内，有效减少规划系统中常见的累积跟踪误差。我们基于关键的CARLA基准评估我们的方法，其表现显著优于现有最先进的基于对照的强化学习专家。在CARLA离线排行榜v1和NoCrash基准测试中，我们的方法分别提升驾驶得分5%和11%，成功率提升8%和19%。

SpatialFlow-GRPO: Where Spatial Credit Drives Image Editing

SpatialFlow-GRPO：空间署名如何驱动图像编辑

Authors: Yankai Yang, Yancheng Long, Wei Chen, Xingyu Lu, Hongyang Wei, Bin Wen, Fan Yang, Tingting Gao, Han Li, Shuo Yang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.26872
Pdf link: https://arxiv.org/pdf/2606.26872
Abstract Recent online reinforcement learning has substantially improved image editing quality. However, existing Flow-GRPO-style methods usually rely on a single whole-image reward, which makes fine-grained editing optimization difficult. We observe that a key obstacle in image editing is this spatial uniformity assumption: a whole-image reward cannot distinguish how different spatial regions contribute to image quality. To address this issue, we propose SpatialFlow-GRPO, a training framework that introduces spatially fine-grained reward feedback. The framework converts region-aware rewards into semantic-region-level optimization signals and aligns region advantages with the corresponding latent positions during policy updates. We also train a region-aware reward model, SFReward, construct SFReward-14K with region-annotated editing samples, and introduce MultiEditBench to evaluate multi-region editing ability. On OmniGen2 and FLUX.2-klein-4B, SpatialFlow-GRPO outperforms Flow-GRPO on GEdit-Bench, ImgEdit-Bench, and MultiEditBench. The results show that SpatialFlow-GRPO converts local feedback into spatially aligned update signals and improves editing quality.
中文摘要 最近的在线强化学习显著提升了图像编辑质量。然而，现有的Flow-GRPO风格方法通常依赖单一整幅图像的奖励，这使得细粒度编辑优化变得困难。我们观察到图像编辑中的一个关键障碍是空间一致性假设：整体图像奖励无法区分不同空间区域对图像质量的贡献。为解决这一问题，我们提出了SpatialFlow-GRPO培训框架，该框架引入空间细粒度的奖励反馈。该框架将区域感知奖励转换为语义区域级优化信号，并将区域优势与政策更新时相应的潜在位置对齐。我们还训练了一个区域感知奖励模型SFReward，构建了带有区域注释的编辑样本SFReward-14K，并引入MultiEditBench以评估多区域编辑能力。在 OmniGen2 和 FLUX.2-klein-4B 上，SpatialFlow-GRPO 在 GEdit-Bench、ImgEdit-Bench 和 MultiEditBench 上表现优于 Flow-GRPO。结果显示，SpatialFlow-GRPO将局部反馈转换为空间对齐的更新信号，并提升编辑质量。

GEOALIGN: Geometric Rollout Curation for Robust LLM Reinforcement Learning

GEOALIGN：为稳健的大型语言模型强化学习提供几何推广策划

Authors: Ting Zhou, Zhenqing Ling, Yiyang Zhao, Ying Shen, Daoyuan Chen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.26917
Pdf link: https://arxiv.org/pdf/2606.26917
Abstract Online reinforcement learning is widely used to align large language models (LLMs) with reward signals, yet training can be unstable under noisy or misspecified rewards. We identify a failure mode we call directional inconsistency: within a batch, a small set of high-reward rollouts induces representation-space preference directions that sharply disagree with the batch majority, resulting in high-variance and destabilizing updates. We propose geoalign, a lightweight plug-in for rollout curation in iterative policy optimization. Geoalign (i) forms within-prompt preference pairs, (ii) learns an online projector on per-rollout hidden states to concentrate reward-ordered displacement directions, and (iii) detects directionally inconsistent rollouts via their angular deviation from a batch consensus prototype and rectifies them with within-prompt stable alternatives. Geoalign is forward-pass only and adds negligible overhead. Across dialogue alignment with a learned reward model and mathematical reasoning with binary verified rewards, Geoalign improves final performance and reduces training oscillation, outperforming PF-PPO, PAR, PODS, and Seed-GRPO. These results suggest latent directional consensus as an effective reliability signal for online LLM RL.
中文摘要 在线强化学习被广泛用于将大型语言模型（LLM）与奖励信号对齐，但在噪声或错误指定奖励下，训练可能不稳定。我们识别出一种称为方向不一致的失败模式：在批次中，一小部分高回报的推广会引发与批次多数截然不同的表示空间偏好方向，导致高方差和不稳定的更新。我们提出了 geoalign，一款用于迭代策略优化中推广策划的轻量插件。Geoalign （i）在提示内的偏好对中形成，（ii）在线投影器学习每次推出的隐藏状态以集中奖励顺序的位移方向，（iii）通过与批处理共识原型的角度偏差检测方向不一致的展开，并用即时内稳定的替代方案进行纠正。Geoalign仅支持前向传递，几乎不增加开销。在学习奖励模型的对话对齐和二元验证奖励的数学推理中，Geoalign提升了最终表现并减少了训练振荡，优于PF-PPO、PAR、PODS和SEED-GRPO。这些结果表明潜在方向共识是在线大型语言模型RL的有效可靠性信号。

PortraitGen: Exemplar-Driven GRPO with Dual-Reward Guidance for Photorealistic Portrait Generation

PortraitGen：以示范为驱动的GRPO，配备双重奖励指导，实现照片级真实肖像生成

Authors: Xiaomin Li, Qian Liang, Yinan Li, Ying Zhang, Chen Li, Jing Lyu, Huchuan Lu, Xu Jia
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.26930
Pdf link: https://arxiv.org/pdf/2606.26930
Abstract Reinforcement Learning like Group Relative Policy Optimization (GRPO) has significantly advanced text-to-image post-training. However, current methods often favor superficial aesthetics, such as over-saturated colors, leaving critical flaws like AI artifacts and biological implausibilities unresolved. We attribute these limitations to two primary factors: (1) The absence of real images during post-training confines GRPO sampling to the original distribution, failing to break inherent generative boundaries; (2) the optimization process lacks specific rewards targeting fine-grained artifacts like overly oily skin and other AI artifacts. To address this, we propose PortraitGen, a novel framework tailored for photorealistic portrait generation. First, we break inherent generative boundaries by directly introducing real images into the GRPO sampling groups, where image inversion is employed to obtain their transition probabilities and latents. Second, to explicitly steer the model toward photorealism, we introduce a complementary dual-reward mechanism: OmniReward for general quality and AI-Portrait for human-centric fidelity. Furthermore, we curate PortraitBench, a comprehensive portrait-centric benchmark. Extensive experiments demonstrate that PortraitGen significantly outperforms existing baselines, effectively suppressing AI artifacts and achieving unprecedented photorealism.
中文摘要 强化学习如群体相对策略优化（GRPO）显著推动了文本转图像的后训练技术。然而，当前的方法往往偏向表面美学，如过饱和的色彩，导致人工智能伪影和生物学上的不合理性等关键缺陷未能解决。我们将这些局限归因于两个主要因素：（1）训练后缺乏真实图像，使GRPO采样仅限于原始分布，未能打破固有的生成边界;（2）优化过程缺乏针对细粒度瑕疵（如过油皮肤及其他AI伪影）的具体奖励。为此，我们提出了PortraitGen，一个专为写实肖像生成设计的新框架。首先，我们通过直接将真实图像引入GRPO采样组，打破内在生成边界，利用图像反演来获取其转移概率和潜在值。其次，为了明确引导模型趋向照片写实主义，我们引入了互补的双重奖励机制：OmniReward用于整体质量，AI肖像用于以人为中心的真实度。此外，我们还策划了PortraitBench，这是一个全面的肖像主题标杆。大量实验表明，PortraitGen 远超现有基线，有效抑制了 AI 伪影，实现了前所未有的照片级真实感。

RobOralScan: Learning Active Intraoral Scanning for Robotic Dental Reconstruction

RobOralScan：学习主动口内扫描以实现机器人牙齿重建

Authors: Jinhyung Lee, Haeun Yun, Siwon Kim, Gihyun Baek, Sungho Moon, Sehyun Hwang, Sunghoon Im
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.26955
Pdf link: https://arxiv.org/pdf/2606.26955
Abstract Intraoral scanning is widely used for digital optical impressions in prosthodontic, implant, and orthodontic treatment, but full-arch and long-span scanning remain labor-intensive tasks with limited automation. In the confined oral cavity, operators must continuously adjust scanner motion while accumulating narrow field-of-view observations, making reconstruction quality sensitive to missing tooth surfaces and operator workload. We propose RobOralScan, which, to the best of our knowledge, is the first reinforcement learning (RL)-based pipeline for robotic automatic intraoral scanning. RobOralScan introduces a geometric memory-based observation space that accumulates partial scan observations into a tri-state geometric representation, allowing the policy to reason over scan history and insufficiently observed regions. It further introduces tooth-wise coverage learning, combining coverage-aware reward signals and a progressive training scheme to improve global reconstruction coverage while reducing uneven coverage across individual teeth. The learned policy selects relative scanner motions from accumulated geometric memory and robot proprioception for closed-loop scan control within the oral workspace. RobOralScan achieves a Chamfer Distance of 0.00838, an average coverage of 92.58%, a lower-tail per-tooth coverage of 88.45%, and a normalized AUC of 0.6674, completing the scan criterion in 8 of 10 evaluation episodes. Furthermore, zero-shot sim-to-real experiments demonstrate its practical feasibility on a physical robot-scanner setup.
中文摘要 口内扫描广泛用于修复、种植和正畸治疗中的数字光学印模，但全弓扫描和长距离扫描仍是劳动密集型任务，自动化有限。在狭窄的口腔内，操作员必须持续调整扫描仪运动，同时积累狭窄视场的观察，使重建质量对缺失的牙齿表面和操作员的工作量非常敏感。我们提出了RobOralScan，据我们所知，这是首个基于强化学习（RL）的机器人自动口腔扫描流程。RobOralScan 引入了一个基于几何记忆的观测空间，将部分扫描观测数据累积为三态几何表示，使该策略能够对扫描历史和观测不足区域进行推理。它进一步引入了按牙齿覆盖度学习，结合覆盖感知的奖励信号和渐进式训练方案，以提升全球重建覆盖率，同时减少单个牙齿之间的覆盖不均。该策略从积累的几何记忆和机器人本体感觉中选择相对扫描动作，实现口腔工作区内的闭环扫描控制。RobOralScan的倒角距离为0.00838，平均覆盖率为92.58%，每齿下尾部覆盖率为88.45%，归一化AUC为0.6674，完成了10次评估中的8次扫描标准。此外，零发射模拟到真实的实验展示了其在物理机器人扫描设备上的可行性。

RolloutPipe: Overlapping Pipelined Rollout and Training in Disaggregated On-Policy LLM Reinforcement Learning

RolloutPipe：在分解策略上LLM强化学习中的重叠流水线推广与培训

Authors: Rongjian Chen, Jianmin Hu, Kejiang Ye, Minxian Xu
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.26997
Pdf link: https://arxiv.org/pdf/2606.26997
Abstract Large language model (LLM) post-training for reasoning increasingly relies on reinforcement learning with verifiable rewards (RLVR), where models learn from ground-truth feedback on mathematical, logical, and scientific tasks. To enable flexible resource allocation and support heterogeneous training setups, modern RLVR systems adopt disaggregated architectures that decouple rollout generation and policy training across independent GPU pools. However, existing synchronous on-policy GRPO (Group Relative Policy Optimization) RLVR systems finish an entire rollout before starting training, leaving the trainer GPU pool idle while rollout is still ongoing. Asynchronous RL pipelines overlap the two stages, but at the cost of training on stale data. To address these challenges, we propose RolloutPipe, a post-training framework for disaggregated RLVR systems, which turns the fixed-weight rollout into a complete-group pipeline where trainable groups move to the trainer while later groups are still being generated. RolloutPipe achieves this through two techniques including complete-group pipelining (CGP) and frontier-group dispatch (FGD). CGP dispatches each trainable complete group to the trainer FIFO as soon as group materialization finishes, and FGD is an admission policy on the Rollout node that first admits requests for the frontier groups needed to form the next training batch, so that trainer-ready groups arrive earlier and more steadily. The design starts training before the rollout completes while maintaining on-policy correctness. Evaluated on Qwen3-1.7B across four reasoning and science benchmarks and twelve rollout settings, RolloutPipe shortens the rollout-to-train-end time by 30.7%-42.3%, and lowers the trainer waiting ratio by 37%-76% compared to Slime, a state-of-the-art rollout and training system.
中文摘要 大型语言模型（LLM）推理后训练越来越依赖可验证奖励的强化学习（RLVR），模型通过对数学、逻辑和科学任务的真实反馈进行学习。为了实现灵活的资源分配并支持异构训练设置，现代RLVR系统采用了拆分架构，将部署生成和策略训练在独立GPU池间解耦。然而，现有的同步策略内GRPO（Group Relative Policy Optimization）RLVR系统会在开始训练前完成整个部署，导致培训器GPU池处于空闲状态，而部署仍在进行中。异步强化学习流水线在两个阶段中重叠，但代价是训练过时数据。为应对这些挑战，我们提出了RolloutPipe，这是一个针对拆分RLVR系统的后培训框架，将固定权重的部署转化为完整的组流程，可训练组会迁移到训练器，而后续组仍在生成中。RolloutPipe 通过两种技术实现这一点，包括完整组流水线（CGP）和前沿组调度（FGD）。CGP会在组化完成后立即将每个可训练的完整小组发送到培训师FIFO，而FGD则是部署节点上的招生策略，优先接收组建下一批培训所需的前沿小组申请，从而使培训师准备好的小组更早、更稳定地到达。设计在推广完成前就开始训练，同时保持政策正确性。在Qwen3-1.7B测试下，涵盖四个推理和科学基准测试以及十二个推广设置，RolloutPipe将推广到培训结束的时间缩短了30.7%-42.3%，培训师等待率也降低了37%-76%，相比Slime（一个最先进的推广和培训系统）。

Improving General Role-Playing Agents via Psychology-Grounded Reasoning and Role-Aware Policy Optimization

通过基于心理学的推理和角色意识的政策优化，提升通用角色扮演代理

Authors: Zhenhua Xu, Dongsheng Chen, Jian Li, Yitong Lin, Zhebo Wang, Jiafu Wu, Yizhang Jin, Chengjie Wang, Meng Han, Yabiao Wang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.27025
Pdf link: https://arxiv.org/pdf/2606.27025
Abstract Building general-purpose role-playing agents that faithfully portray any character from a natural-language profile remains challenging. The dominant paradigm -- supervised fine-tuning -- encourages behavioral mimicry without deep, human-like internal thought processes, resulting in poor out-of-distribution generalization. Therefore, we propose \textbf{Psy-CoT}, a psychology-grounded chain-of-thought framework that decomposes pre-response reasoning into three role-specific steps -- \emph{Interaction Perception}, \emph{Psychological Empathy}, and \emph{Logical Construction} -- so that the model \emph{thinks dynamically} from the profile rather than merely mimicking surface patterns. While structured reasoning provides a foundation, it alone is insufficient; reinforcement learning is essential to further align the model with character fidelity. However, we observe that under LLM-based reward models, both generic phrases that hack the reward model and genuinely role-specific phrases receive identical gradient signals -- this hacking accumulates over training, misleading the model into treating both as equally optimal choices. To address this, we propose \textbf{Role-Aware Policy Optimization (RAPO)}, which uses profile--token mutual information to weight gradients asymmetrically -- amplifying role-specific tokens under positive advantage while attenuating them under negative advantage. Experiments on CoSER, CharacterBench, and CharacterEval demonstrate that Psy-CoT outperforms existing role-playing CoT methods, and RAPO consistently surpasses GRPO across multiple model scales.
中文摘要 构建能够忠实展现自然语言角色的通用角色扮演代理依然具有挑战性。主流范式——监督式微调——鼓励行为模仿，缺乏深层的人类内在思维过程，导致分布外泛化效果不佳。因此，我们提出了 \textbf{Psy-CoT}，一个基于心理学的思维链框架，将反应前推理分解为三个特定角色步骤——\emph{互动感知}、\emph{心理共情}和\emph{逻辑构建}——使模型从画像动态思考，而不仅仅是模仿表面模式。虽然结构化推理提供了基础，但仅靠它本身是不够的;强化学习对于进一步使模型与角色忠实度高度对齐至关重要。然而，我们观察到，在基于LLM的奖励模型下，修改奖励模型的通用短语和真正针对角色的特定短语接收到相同的梯度信号——这种黑客行为在训练过程中积累，误导模型将两者视为同等最优选择。为此，我们提出了 \textbf{角色感知策略优化（RAPO）}，该方法利用配置文件——令牌互信息对梯度进行非对称权重——在正优势下放大角色特定令牌，在负优势下减弱。在CoSER、CharacterBench和CharacterEval上的实验表明，Psy-CoT优于现有的角色扮演CoT方法，RAPO在多个模型尺度上持续超过GRPO。

State Representation Matters in Deep Reinforcement Learning: Application to Energy Trading

深度强化学习中的状态代表性重要性：在能源交易中的应用

Authors: Jesper Klicks, Sander Vržina, Vincent François-Lavet
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.27032
Pdf link: https://arxiv.org/pdf/2606.27032
Abstract Energy trading decisions depend not only on current market prices, but also on expected future market conditions, and operational constraints. This makes the state representation given to a reinforcement learning agent an important design choice. We study this in HydroDam, a pumped-storage arbitrage environment, using a fixed Double DQN agent. The environment, action space, reward function, network, and training protocol are kept fixed; only the market features are changed. We compare absolute price/calendar features, relative features that compare current prices with recent market history, forecast features, and all combinations of these three feature families. Policies are trained and selected using 2007--2011 Belgian day-ahead prices and evaluated on two test settings: a later same-market test set from 2012--2025 and 39 other ENTSO-E market zones. Absolute features only reaches 28.8% on the test set and a median 5.7% across zones. Relative-only and forecast-only states also stay below a rolling price-score heuristic in the cross-zone median. Combining feature families is much stronger: absolute + relative reaches 49.9% on the test set and a 39.8% cross-zone median, while absolute + relative + forecast reaches 55.6% and 47.5%. These results suggest that state representation is not a minor preprocessing choice in storage-trading RL, but a central part of the policy design: robust transfer requires combining price scale, recent relative price context, and short-horizon forecast information, rather than relying on any single feature family.
中文摘要 能源交易决策不仅取决于当前市场价格，还取决于预期的未来市场状况和运营限制。这使得强化学习代理获得状态表示成为一个重要的设计选择。我们在HydroDam（一种抽水蓄能套利环境）中使用固定的双DQN代理进行研究。环境、行动空间、奖励功能、网络和训练协议保持固定;仅改变了市场特征。我们比较绝对价格/日历特征、比较当前价格与近期市场历史的相对特征、预测特征以及这三大功能族的所有组合。保单采用2007-2011年比利时提前期价格进行培训和选择，并在两个测试环境中进行评估：2012-2025年的同一市场测试集以及其他39个ENTSO-E市场区。绝对特征在测试集中仅达到28.8%，在各个区域中位数为5.7%。仅相对和仅预测的州也在跨区中位数下保持在滚动价格分数启发式以下。组合特征族的效果更为强：测试集中绝对+相对比例达到49.9%，跨区中位数为39.8%，而绝对+相对+预测分别为55.6%和47.5%。这些结果表明，状态代表性并非存储-交易强化学习中的次要预处理选择，而是政策设计的核心部分：稳健转移需要结合价格规模、近期相对价格背景和短期预测信息，而非依赖任何单一特征族。

Heavy-Ball Q-Learning with Residual Weighting Correction

带有残差权重修正的重球Q学习

Authors: Donghwan Lee
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.27112
Pdf link: https://arxiv.org/pdf/2606.27112
Abstract This paper proposes a corrected heavy-ball Q-learning method for reinforcement learning (RL) and establishes its convergence. It also identifies conditions under which the method is theoretically guaranteed to converge faster than standard Q-learning. The same construction is then extended to Q-learning with linear function approximation, where analogous convergence and acceleration statements are derived. The analysis is based on a switched linear system (SLS) representation of Q-learning algorithms and on the joint spectral radius (JSR) of the associated switching families. This SLS viewpoint is not commonly used in standard analyses of Q-learning, and it provides a complementary framework and new insight into how heavy-ball momentum can accelerate Q-learning.
中文摘要 本文提出了一种修正重球Q-学习方法用于强化学习（RL），并建立了其收敛性。它还识别了理论上保证方法比标准Q学习收敛更快的条件。同样的构造随后被扩展到线性函数近似的Q学习中，推导出类似的收敛和加速陈述。该分析基于Q-学习算法的切换线性系统（SLS）表示以及相关切换族的联合谱半径（JSR）。这种SLS观点在Q学习的标准分析中并不常见，但它提供了一个互补的框架和新的见解，帮助我们了解重球动量如何加速Q学习。

Automating Potential-based Reward Shaping with Vision Language Model Guidance

利用视觉语言模型指导自动化基于潜力的奖励塑造

Authors: Henrik Müller, Daniel Kudenko
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.27180
Pdf link: https://arxiv.org/pdf/2606.27180
Abstract Sparse rewards are inherently challenging for reinforcement learning agents as they lack intermediate feedback to guide exploration and to correctly attribute the sparse success rewards to relevant parts of the trajectory. Naive reward shaping can induce reward hacking, yielding policies that exploit auxiliary signals instead of solving the intended task. Potential-based reward shaping (PBRS) guarantees preservation of the optimal policy set, but requires the definition of a heuristic potential function over the state space. In this work, we introduce the VLM-guided PBRS framework VLM-PBRS that learns the potential function directly from vision language model (VLM) feedback. We query a lightweight VLM to obtain preferences over image pairs and train a model of the potential function using these preferences. As this approach is based on potential-based reward shaping, it preserves the original optimal policies, and removes the need for expert-designed reward shaping terms. Because large VLMs are prohibitively expensive to invoke repeatedly during policy learning, we employ smaller, more computationally efficient VLMs. Although the resulting preference labels are less accurate, empirical evidence shows that the preference labels can still be used to accelerate learning. We validate our method empirically in the Meta-World and Franka Kitchen environments and highlight the connection between VLM preference label accuracy and sample efficiency improvements. Our contributions are threefold: (1) the first application of VLM preference-based learning to synthesize a potential function for PBRS, (2) a principled, low-cost solution that leverages small VLMs, and (3) extensive empirical demonstration of improved sample efficiency and robustness to reward hacking.
中文摘要 稀疏奖励对强化学习代理来说本质上具有挑战性，因为它们缺乏中间反馈来引导探索，也无法正确将稀疏成功奖励归因于相关路径部分。天真的奖励塑造可能诱导奖励黑客行为，导致利用辅助信号而非解决预期任务的策略。基于势的奖励整形（PBRS）保证最优策略集的保持，但需要定义状态空间上的启发式势能函数。在本研究中，我们介绍了VLM引导的PBRS框架VLM-PBRS，该框架直接从视觉语言模型（VLM）反馈中学习潜在功能。我们查询一个轻量级VLM以获取对图像对的偏好，并利用这些偏好训练潜在函数模型。由于该方法基于基于潜力的奖励塑造，它保留了原始的最优策略，并免除了专家设计的奖励塑造术语的需求。由于大型VLM在策略学习中反复调用成本过高，我们采用更小、计算效率更高的VLM。尽管最终的偏好标签准确度较低，但实证证据表明，偏好标签仍可用于加速学习。我们在Meta-World和Franka Kitchen环境中实证验证了该方法，并强调了VLM偏好标签准确性与样本效率提升之间的联系。我们的贡献有三方面：（1）首次将基于VLM偏好的学习应用于PBRS的潜在函数，（2）一种利用小型VLM的原则性低成本解决方案，（3）大量实证证明了改进样本效率和鲁棒性以奖励黑客行为。

Paved with True Intents: Intent-Aware Training Improves LLM Safety Classification Across Training Regimes

以真实意图铺装：意图感知培训提升各培训体系中的LLM安全分类

Authors: Jeremias Ferrao, Niclas Müller-Hof, Iustin Sîrbu, Traian Rebedea, Yftah Ziser
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.27210
Pdf link: https://arxiv.org/pdf/2606.27210
Abstract We argue that safety classifiers should model user intent as an explicit signal between the prompt and the final label. To study this, we introduce AIMS, a human-annotated dataset of 1,724 difficult safety prompts, each paired with an intent description and harm label. We use AIMS to evaluate intent-aware training across supervised fine-tuning, preference learning, reasoning distillation, and reinforcement learning. Despite its size, AIMS enables competitive safety classifiers across training regimes: DPO from model-generated intent errors improves over SFT, and intent-conditioned distillation outperforms reasoning-only distillation in most teacher-student pairs. Most notably, directly rewarding intent faithfulness with GRPO yields the strongest average performance across five external safety benchmarks, while our intent-aware models form the inference latency-F1 Pareto frontier. These results show that faithful intent modeling is a compact, high-quality supervision signal for more robust safety classifiers.
中文摘要 我们认为安全分类器应将用户意图建模为提示与最终标签之间的明确信号。为此，我们引入了AIMS，这是一个由人类注释的数据集，包含1724个难度安全提示，每个提示都配有意图描述和危害标签。我们利用AIMS评估意图感知训练，涵盖监督式微调、偏好学习、推理提炼和强化学习。尽管规模庞大，AIMS仍能在不同培训体系中实现竞争性安全分类：模型生成意图误差的DPO优于SFT，意图条件蒸馏在大多数师生对中优于仅推理的提炼。最显著的是，直接奖励 GRPO 对意图忠实度的表现，在五个外部安全基准中获得最强的平均表现，而我们的意图感知模型构成了推断延迟-F1 帕累托前沿。这些结果表明，忠实意图建模是一种紧凑且高质量的监管信号，适用于更稳健的安全分类器。

Designing Reward Signals for Portable Query Generation: A Case Study in Industrial Semantic Job Search

为便携式查询生成设计奖励信号：工业语义求职案例研究

Authors: Ping Liu, Qianqi Shen, Jianqiang Shen, Wenqiong Liu, Rajat Arora, Yunxiang Ren, Chunnan Yao, Dan Xu, Baofen Zheng, Wanjun Jiang, Andrii Soviak, Kevin Kao, Jingwei Wu, Wenjing Zhang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.27291
Pdf link: https://arxiv.org/pdf/2606.27291
Abstract Job-search platforms rely on low-bandwidth query interfaces that often fail to capture the high-dimensional complexity of candidate profiles. We present an end-to-end RLAIF (Reinforcement Learning from AI Feedback) framework to generate \emph{portable} job search queries, terms that abstract away seeker-specific identifiers while preserving generalizable qualifications. This task introduces a highly adversarial reward surface where policy optimization frequently exploits flaws in LLM-as-judge rubrics, resulting in degenerate verbatim-copying behaviors. We conducted comprehensive empirical experiments to isolate the impact of optimization mechanics against structured reward engineering. Our results demonstrate that for critic-free optimizers, performance is overwhelmingly dictated by robust reward shaping, rendering the specific choice of algorithm largely immaterial. While critic-free per-rollout baseline methods (RLOO and REINFORCE++) natively resist reward-hacking, the group-relative advantage normalization in GRPO appears uniquely sensitive to spurious reward signals, making it disproportionately susceptible to exploitation. We show that introducing a deterministic, rule-based reward floor to correct for rewards assigned to verbatim copying mitigates this failure mode, resulting in a substantial $+0.147$ quality improvement on a cross-family evaluation judge. Ultimately, we show that the training-time reward model inflates performance gains by $2.4\times$, confirming that the training success is fundamentally dependent on enforcing reward-shaping disciplines rather than selecting alternative optimizers.
中文摘要 求职平台依赖低带宽的查询接口，往往无法捕捉候选人档案的高维复杂性。我们提出了一个端到端的RLAIF（AI反馈强化学习）框架，用于生成\emph{portable}求职查询，这些词在保留可推广资格的同时，抽象化了求职者的特定标识符。该任务引入了一个高度对抗性的奖励面，策略优化经常利用LLM作为评判的评分标准中的缺陷，导致退化的逐字复制行为。我们进行了全面的实证实验，以分离优化力学对结构化奖励工程的影响。我们的结果表明，对于无批评的优化者来说，性能主要由稳健的奖励塑造决定，使得算法的具体选择在很大程度上无关紧要。虽然无批判的每次部署基线方法（RLOO和REINFORCE++）原生抵御奖励黑客攻击，但GRPO中的群体相对优势归一化对虚假奖励信号异常敏感，使其更易被利用。我们证明，引入确定性、基于规则的奖励下限来纠正逐字复制分配的奖励，可以缓解这种失败模式，从而在跨家庭评估评审中实现了显著的+0.147美元质量提升。最终，我们证明训练时间奖励模型将绩效提升提升2.4美元倍数，证实训练成功根本依赖于强制执行奖励塑造纪律，而非选择替代优化器。

Sculpting NeRF Geometry: Human-Preference Fine-Tuning of a 3D-Aware Face GAN

雕刻NeRF几何体：3D感知面部GAN的人类偏好微调

Authors: Archer Moore, Mingming Gong, Liam Hodgkinson
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.27305
Pdf link: https://arxiv.org/pdf/2606.27305
Abstract Reinforcement learning from human feedback (RLHF) for 3D generation is now established across a number of works, but most existing pipelines optimise explicit surface representations, often by converting radiance fields into meshes and training heavily on surface-supervised data. We instead fine-tune a pretrained 3D-aware generative model directly from a learned reward over radiance-field density ($\sigma$) values, with no externally supplied mesh or shape prior. The reward model requires no pretraining, trains easily on a small set of preference samples, and yields robust improvement in 3D geometry. Working on an unconditional 3D-aware face GAN (EG3D), our reward reads the continuous 3D density field of the neural radiance field (NeRF) directly and supplies a geometry-only learning signal, requiring neither text conditioning, mesh extraction, nor multi-view rendering. A density-consistency constraint keeps the 2D appearance qualitatively similar while the geometry is reshaped, at a measurable but bounded distributional cost (FID-50k rises from 4.09 to 6.66): the fine-tuned generator, trained from the preferences of a single annotator as a proof of concept, produces face geometries preferred by users in 74.4% of pairwise comparisons.
中文摘要 基于人类反馈的强化学习（RLHF）现已在多个研究中建立，但大多数现有流程优化显式表面表示，通常通过将辐射场转换为网格并大量训练表面监督数据。我们直接从学习到的辐射场密度（$\sigma$）奖励值微调预训练的3D感知生成模型，且不依赖外部网格或形状。奖励模型无需预训练，只需少量偏好样本即可轻松训练，并且在三维几何方面有显著提升。在无条件三维感知面部GAN（EG3D）上，我们的奖励直接读取神经辐射场（NeRF）的连续三维密度场，并提供仅几何的学习信号，无需文本条件处理、网格提取或多视图渲染。密度一致性约束在几何体重塑时保持二维外观的定性相似，但分布成本可测量但有限制（FID-50k从4.09升至6.66）：由单个注释者偏好训练的微调生成器作为概念验证，生成74.4%的两对比较中用户偏好的面几何。

VibeAct: Vibration to Actions for Contact-Rich Reactive Robot Dexterity

VibeAct：振动对动作的响应式机器人灵巧度提升

Authors: Yuemin Mao, Uksang Yoo, Jean Oh, Jonathan Francis, Jeffrey Ichnowski
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.27344
Pdf link: https://arxiv.org/pdf/2606.27344
Abstract Dexterous manipulation depends on contact events that are fast, local, and often visually occluded. Piezoelectric microphones offer a compact and high-bandwidth way to sense these interactions, but the resulting vibro-acoustic signals are difficult to simulate faithfully enough for end-to-end sim-to-real policy learning on dexterous robot hands. We propose VibeAct, a framework that bridges real vibrotactile sensing and simulation-based reinforcement learning through a shared physical representation of contact and slip. In the real world, we embed piezoelectric microphones into a dexterous robot hand and collect vibro-acoustic data through teleoperation, then replay the recordings in a calibrated digital clone to automatically label per-finger contact and slip. A tactile estimator learns to predict contact and slip from real microphone waveforms, while manipulation policies are trained in simulation on the same representation computed directly from simulated contacts. This decoupling lets policies exploit rapid tactile feedback without simulating raw audio. Across five contact-rich tasks spanning regrasping, in-hand reorientation, and insertion, VibeAct consistently outperforms a proprioception-and-point-cloud baseline in simulation, with the largest gains on tasks requiring sustained reactive control, where the continuous slip-magnitude channel proves the most informative observation. The learned policies transfer to a physical dexterous hand-arm platform, improving success rates on deployed tasks. Project videos and additional details are at this https URL.
中文摘要 灵巧的操作依赖于快速、局部且常被视觉遮挡的接触事件。压电麦克风提供了一种紧凑且高带宽的方式来感知这些交互，但产生的振动声学信号很难足够忠实地模拟，方便灵巧的机器人进行端到端模拟到真实的政策学习。我们提出了VibeAct框架，通过共享接触和滑移的物理表征，桥接真实的振动触觉感知与基于仿真的强化学习。在现实中，我们将压电麦克风嵌入灵巧的机器人手中，通过远程操作收集振动声学数据，然后通过校准的数字克隆机回放录音，自动标记每指的接触和滑动。触觉估计器学习从真实麦克风波形中预测接触和滑动，而操作策略则在仿真中基于直接从模拟接触计算的同一表示进行训练。这种解耦使政策能够利用快速的触觉反馈，而无需模拟原始音频。在五个接触丰富的任务中，涵盖回归、手中重新定向和插入，VibeAct在模拟中始终优于本体感觉和点云基线，在需要持续反应控制的任务中表现最大，而连续滑差通道是最具信息量的观测。所学策略转移到实体灵活的手持武器平台，提高部署任务的成功率。项目视频及更多详情请见此 https 网址。

Bridging Performance and Generalization in Reinforcement Learning for Agile Flight

在强化学习中桥接性能与泛化，助力敏捷飞行

Authors: Jonathan Green, Jiaxu Xing, Nico Messikommer, Angel Romero, Davide Scaramuzza
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.27348
Pdf link: https://arxiv.org/pdf/2606.27348
Abstract Autonomous drone racing is a fundamentally challenging regime for autonomous aerial robots, requiring time-optimal control while operating under persistent actuation saturation. While reinforcement learning (RL) has achieved human-level performance in this domain, current methods fail to generalize; policies trained on specific environments often crash immediately in unseen configurations. This failure reflects the intrinsic difficulty of zero-shot generalization in agile flight, arising from high-dimensional task variation and the tight coupling between safety and performance at high speeds. Existing approaches that improve generalization impose a substantial cost on flight speed: control policies must significantly degrade performance to achieve even modest levels of generalization. In this work, we propose a framework for zero-shot generalization in agile flight for RL-based drone racing. By combining task-aware switching based on learning progress with a physically informed procedural track generator, the framework produces a fast and robust generalist policy without test-time adaptation. Our method achieves strong zero-shot performance across a wide range of unseen racetracks in the real world, demonstrating a 7.4x improvement in generalization over the state-of-the-art approaches, while maintaining competitive racing speeds. We validate our method's results in both simulation and real-world settings, including a challenging vision-based, end-to-end control setting that operates without explicit state estimation, where all prior approaches fail to generalize.
中文摘要 自主无人机竞速是自主空中机器人面临的一项根本挑战，需要在持续的致动过饱和状态下实现时间最优控制。虽然强化学习（RL）在该领域实现了人类水平的表现，但当前方法未能实现普遍应用;针对特定环境训练的策略常常在未见配置中立即崩溃。这一失败反映了敏捷飞行中零发射推广的内在困难，源于高维任务变化以及高速下安全性与性能的紧密耦合。现有提升泛化的方法会对飞行速度造成重大代价：控制策略必须显著降低性能，才能实现即使是适度的泛化水平。在本研究中，我们提出了基于强化学习的无人机竞速中敏捷飞行零发泛化的框架。通过结合基于学习进展的任务感知切换与物理信息化的过程轨迹生成器，该框架生成快速且稳健的通用策略，无需测试时间的适应。我们的方法在现实世界中多种未见赛道上实现了强劲的零射击性能，展示了比最先进方法提升7.4倍的泛化性能，同时保持了竞速的竞争力。我们在模拟和现实环境中验证了方法的结果，包括一种具有挑战性的基于视觉的端到端控制环境，该环境无需显式状态估计，所有先前方法都无法推广。

Reinforcement Learning without Ground-Truth Solutions can Improve LLMs

没有地面真实解决方案的强化学习可以提升LLMs

Authors: Yingyu Lin, Qiyue Gao, Nikki Lijing Kuang, Xunpeng Huang, Kun Zhou, Tongtong Liang, Zhewei Yao, Yi-An Ma, Yuxiong He
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.27369
Pdf link: https://arxiv.org/pdf/2606.27369
Abstract Reinforcement learning with verifiable rewards (RLVR) for training LLMs typically rely on ground-truth answers to assign rewards, limiting their applicability to tasks where the ground-truth solution is unknown. We introduce a \textbf{R}anking-\textbf{i}nduced \textbf{VER}ifiable framework (RiVER) that trains LLMs on score-based optimization tasks without ground-truth solutions, using deterministic execution feedback as continuous-valued supervision. When applying group-relative RL to such continuous rewards, we identify two key challenges: \emph{scale dominance}, where uncalibrated score magnitudes across test instances distort policy updates, and \emph{frequency dominance}, where repeatedly sampled suboptimal solutions can outweigh rare but stronger candidates. RiVER addresses these challenges with calibrated reward shaping that uses instance-wise comparisons and emphasizes top-ranked solvers while retaining bounded feedback for other valid solutions. We train on 12 AtCoder Heuristic Contest tasks and evaluate on Algorithm Engineering Benchmark (ALE-Bench), LiveCodeBench, and USACO. RiVER advances Qwen3-8B and GLM-Z1-9B-0414 by 8.9\% and 9.4\% in ALE rating rank. More importantly, despite training exclusively on score-based tasks without any ground-truth solutions, RiVER also improves the backbones across exact-solution benchmarks such as LiveCodeBench and USACO by an absolute average improvement of 2.4\% and 3.5\%. By contrast, baselines trained with raw execution scores improve ALE rating but fail to transfer to exact-solution benchmarks. These results suggest that score-based optimization tasks, combined with proper reward calibration, can serve as effective training environments for general coding ability without ground-truth solutions.
中文摘要 带可验证奖励的强化学习（RLVR）通常依赖基于实地真理答案来分配奖励，限制其适用范围在未知地面真理解的任务中。我们引入了一个 \textbf{R}anking-\textbf{i}诱导的 \textbf{VER}ifiable 框架（RiVER），该框架通过确定性执行反馈作为连续值监督，训练基于分数的优化任务。在应用群体相对强化学习来应对此类连续奖励时，我们识别出两个关键挑战：\emph{尺度优势}，即测试实例中未校准的分数幅度扭曲策略更新;以及\emph{频率优势}，反复抽样的次优解可能超过罕见但更强的候选方案。RiVER 通过校准奖励形态解决这些挑战，采用实例级比较，强调排名前列的求解器，同时保留对其他有效解的有界反馈。我们训练12个AtCoder启发式竞赛任务，并评估算法工程基准（ALE-Bench）、LiveCodeBench和USACO。RiVER在ALE评级排名中分别提升Qwen3-8B和GLM-Z1-9B-0414分别提升8.9%和9.4%。更重要的是，尽管RiVER仅针对基于分数的任务进行训练，没有任何实地真相解决方案，RiVER在LiveCodeBench和USACO等精确解基准中也以绝对平均提升2.4%和3.5%提升了骨干。相比之下，使用原始执行分数训练的基线能提升ALE评分，但无法转移到精确解基准测试。这些结果表明，基于评分的优化任务结合适当的奖励校准，可以作为在无需地面真相解的情况下，有效训练通用编码能力的环境。

Keyword: diffusion policy

Bridging Handheld and Teleoperated Supervision for Contact-Rich Manipulation via State-Gated Experts

连接手持与远程监控，通过国家门控专家实现联系人丰富的操作

Authors: Vidullan Surendran, Neehar Peri, David Watkins
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.26603
Pdf link: https://arxiv.org/pdf/2606.26603
Abstract Handheld data collection systems, such as the Universal Manipulation Interface (UMI), enable scalable data collection across diverse environments but only capture observed actions rather than the desired actions executed by a robot controller. In contrast, teleoperation captures desired actions directly, but is prohibitively time-consuming to collect. We revisit this trade-off through the lens of action validity across task phases. We observe that handheld trajectories provide valid supervision in tolerant, free-space phases, but lack dynamic feasibility in contact-sensitive phases, where tracking observed trajectories at high stiffness produces large, unsafe contact forces. We study the interaction between these two supervision types for contact-rich manipulation and find that training policies that combine handheld data with a small number of targeted teleoperated demonstrations provide an efficient hybrid strategy. Specifically, rather than teleoperating the entire task, we only collect partial teleoperated demonstrations for task segments where base handheld policies fail. However, naively mixing handheld and teleoperated phase-specific data yields worse performance than training on handheld data alone. To address this mismatch between observed and desired supervision, we propose Bi-modal Routing for Imitation Data via Gated Experts (BRIDGE), a mixture of diffusion policy experts that routes between specialist task phase heads conditioned on the current robot state. Notably, our approach enables task-phase specific use of desired actions during contact sensitive segments and improves success rates over handheld-only baselines by up to 36.7% across three contact-rich manipulation tasks.
中文摘要 手持数据收集系统，如通用操作接口（UMI），能够在不同环境中实现可扩展的数据收集，但仅捕捉观察到的动作，而非机器人控制器执行的期望动作。相比之下，远程操作直接捕捉期望的动作，但收集起来极其耗时。我们通过跨任务阶段的行动有效性视角重新审视这一权衡。我们观察到，手持轨迹在容忍自由空间阶段提供有效监督，但在接触敏感阶段缺乏动态可行性，因为在高刚度下跟踪观测轨迹会产生巨大且不安全的接触力。我们研究了这两种监督类型在接触丰富操作中的相互作用，发现将手持数据与少量有针对性远程操作演示相结合的培训策略，提供了高效的混合策略。具体来说，我们不对整个任务进行远程操作，而仅收集部分远程操作演示，针对基础手持策略失效的任务片段。然而，天真地混合手持和远程操作的相位特定数据，其性能不如仅用手持数据训练。为解决观察与期望监督之间的不匹配，我们提出了通过门控专家进行模仿数据双模路由（BRIDGE），这是一种由扩散政策专家组成的组合，基于当前机器人状态在专业任务阶段头之间进行路由。值得注意的是，我们的方法支持在接触敏感段段中任务阶段的特定操作，并在三个接触丰富操作任务中，比仅手持基线的成功率提升了高达36.7%。