Arxiv Papers of Today

生成时间: 2026-06-23 19:22:31 (UTC+8); Arxiv 发布时间: 2026-06-23 20:00 EDT (2026-06-24 08:00 UTC+8)

今天共有 93 篇相关文章

Keyword: reinforcement learning

RL-based Joint Coverage and Beam Optimization of High Altitude Platform Systems

基于强化学习的高空平台系统联合覆盖与波束优化

Authors: Guilhem Loussouarn, Nancy Nayak, Kin K. Leung, Patrick J. Baker
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2606.20578
Pdf link: https://arxiv.org/pdf/2606.20578
Abstract High Altitude Platform Systems (HAPS) are a promising component of 6G network architectures, offering a unique "freedom of movement" that distinguishes them from static terrestrial networks (TN) and orbit-constrained satellite communications. This inherent mobility for HAPS provides a powerful mechanism to address non-stationarity, spatio-temporal user distributions, and traffic dynamics, such as periodic population migrations. This work addresses three key optimization problems in HAPS networks: (a) HAPS positioning for optimal coverage, (b) beam allocation, and (c) joint optimization of coverage and beam allocation. To tackle these complex challenges, a Reinforcement Learning (RL) framework is proposed, capable of operating in scenarios with multiple HAPS. The results demonstrate that the RL-based approach effectively learns to control HAPS positioning and resource allocation, dynamically adapting to variations in user distributions and traffic patterns. In particular, by employing a multi-policy Proximal Policy Optimization (PPO) approach, the proposed framework jointly learns HAPS positioning and allocating beams under spatio-temporal traffic demand variations and outperforms heuristic baselines. Simulation results demonstrate that our joint optimization approach significantly improves sum-rate and user satisfaction, showing that the dynamic mobility of HAPS can be successfully exploited to create highly responsive and efficient next-generation networks.
中文摘要 高空平台系统（HAPS）是6G网络架构中一个有前景的组成部分，提供了独特的“移动自由度”，使其区别于静态地面网络（TN）和受轨道限制的卫星通信。这种固有的HAPS移动性为处理非平稳性、时空用户分布和流量动态（如周期性人口迁移）提供了强大的机制。本研究解决了HOPS网络中的三个关键优化问题：（a）HOPS定位以实现最佳覆盖，（b）波束分配，以及（c）覆盖和波束分配的联合优化。为应对这些复杂挑战，提出了一个强化学习（RL）框架，能够在多个HAPS场景中运行。结果表明，基于强化学习的方法能够有效学习控制HAPS定位和资源分配，动态适应用户分布和流量模式的变化。特别是，通过采用多策略近端策略优化（PPO）方法，所提框架能够共同学习HASS定位和在时空流量需求变化下的波束分配，并优于启发式基线。模拟结果表明，我们的联合优化方法显著提升了总和率和用户满意度，表明HAPS的动态移动性可以成功利用，打造高度响应且高效的下一代网络。

Darwin Mobile Agent: A Roadmap for Self-Evolution

达尔文移动代理：自我进化路线图

Authors: Daniel Beechey, Derek Yuen, Jianheng Liu, Dezhao Luo, Tiantian He, Weilin Luo, Jun Wang, Kun Shao
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.20622
Pdf link: https://arxiv.org/pdf/2606.20622
Abstract The goal of artificial intelligence is to create agents capable of general, adaptive behaviour in open-ended environments. Guided by the "Bitter Lesson", we argue that the most effective path toward this goal is to systematically remove human priors and allow intelligence to naturally emerge through interaction with a "Big World" that is orders of magnitude more complex than the agent itself. We propose the mobile Graphical User Interface (GUI) as a practical proxy for such a world and introduce Darwin Mobile Agent, an open-source infrastructure designed as a foundation for autonomous reinforcement learning in this domain. This framework addresses the data-collection bottleneck in real-world mobile interactions by using an asynchronous agent-environment loop across parallel cloud-phone instances. We further propose a conceptual roadmap to systematically remove human priors from three fundamental pillars of a self-evolving agent: task curricula, outcome verification, and memory management. We validate that the Darwin infrastructure provides the stability and scalability required for the first stage of this roadmap: policy optimisation in the GUI domain. This work establishes the practical and theoretical foundation necessary to move toward truly autonomous, self-evolving GUI agents.
中文摘要 人工智能的目标是在开放式环境中创造能够进行一般且适应性行为的智能体。在“苦涩教训”的指导下，我们认为实现这一目标最有效的路径是系统性地消除人类先验，让智能通过与一个比智能体本身复杂数个数量级的“大世界”互动自然产生。我们提出了移动图形用户界面（GUI）作为这一世界的实用代理，并介绍了Darwin Mobile Agent，这是一个开源基础设施，旨在为该领域的自主强化学习奠定基础。该框架通过在云-电话实例间使用异步代理-环境循环，解决了现实移动交互中的数据收集瓶颈。我们还提出了一个概念路线图，系统地将人类先验从自我进化代理的三个基本支柱中剔除：任务课程、结果验证和记忆管理。我们验证达尔文基础设施为该路线图第一阶段——GUI领域的策略优化——提供了所需的稳定性和可扩展性。这项工作奠定了迈向真正自主、自我演化的图形界面代理所需的实践和理论基础。

An LLM-Explainable DRL Framework for Passenger-Directed Autonomous Driving

一个可解释的 LLM DRL 乘客主导自动驾驶框架

Authors: Ouided Braoui, Meriem Bouali, Nadir Farhi
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2606.20640
Pdf link: https://arxiv.org/pdf/2606.20640
Abstract Autonomous vehicles offer the potential for safer and more efficient mobility, yet public trust remains limited due to the lack of transparency in their decision-making. This work addresses this issue by combining deep reinforcement learning (DRL) for adaptive driving control with large language model (LLM)-based explainability modules designed to communicate agent behavior to passengers. DRL agents were trained in simulation using a Dueling Double Deep Q-Network to follow distinct driving requests: \textit{fast}, \textit{comfort}, and \textit{stop}. They demonstrated stable learning, safe compliance with traffic rules, and reliable switching between modes within a single trip. In parallel, LLM modules were introduced to interpret passenger requests, determine when explanations were needed, and generate concise, safety-oriented justifications. Results show that this framework, serving as a proof of concept for integrating RL decision-making and LLMs, balances safety, adaptability, and explainability, and is most effective when requests are delayed or overridden due to safety constraints.
中文摘要 自动驾驶汽车带来了更安全、更高效的出行潜力，但由于决策缺乏透明度，公众信任度仍然有限。本研究通过结合深度强化学习（DRL）与基于大型语言模型（LLM）的可解释模块，解决了这一问题，旨在向乘客传达代理行为。日间行车（DRL）人员接受了使用双重深度Q网络模拟训练，以跟踪不同的驾驶请求：\textit{fast}、\textit{comfort}和\textit{stop}。它们表现出稳定的学习能力、安全遵守交通规则以及单次行程内可靠的交通模式切换能力。与此同时，引入了LLM模块，用于解释乘客请求，判断何时需要解释，并生成简明、安全导向的理由。结果表明，该框架作为强化学习决策与大型语言模型整合的概念验证，平衡了安全性、适应性和可解释性，且在因安全限制导致请求延迟或覆盖时效果最佳。

MAGNIFIED: RL Fine-tuning of Multimodal Large Language Models for Motion Planning

放大：多模态大型语言模型的强化学习微调以实现运动规划

Authors: Letian Chen, Yiren Lu, Justin Fu, Yichen Xie, Runsheng Xu, Jyh-Jing Hwang, Ben Sapp, Drago Anguelov
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.20641
Pdf link: https://arxiv.org/pdf/2606.20641
Abstract Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in semantic understanding and common sense reasoning, making them promising candidates for solving planning problems in autonomous driving. However, the next-token text prediction objectives traditionally used in pre-training and supervised fine-tuning (SFT) of MLLMs may fall short of fulfilling the planning objectives for autonomous vehicles. The next-token prediction objective merely encourages per-token imitation in text, often irrespective of multi-step consequences and the alignment with crucial planning considerations such as giving space to other road actors. To overcome these limitations, we propose a reinforcement learning fine-tuning (RLFT) approach, MAGNIFIED, that aligns the MLLM-based driving agent with planning objectives by learning from token-level rewards. By mapping a sequence of predicted tokens to corresponding vehicle trajectories and learning from planning rewards, MAGNIFIED optimizes for the true planning objectives rather than focusing solely on token prediction accuracy, enabling the model to refine its understanding of the planning task beyond simple imitation. We validate our approach on the Waymo Open Motion Dataset with a novel setup incorporating rasterized birds-eye views and tokenized trajectories as inputs and planning-oriented outputs. An initial SFT phase establishes a strong baseline in outputting plan trajectories as sequences of X-Y coordinates in text, while subsequent RL fine-tuning substantially enhances planning performance relative to the SFT baseline (demonstrating over a 10.5% reduction in overlap rate and a 38.9% reduction in off-road rate), underscoring the potential of RLFT on MLLMs to achieve vehicle planning that is better aligned with compliant, comfortable, and efficient driving.
中文摘要 多模态大型语言模型（MLLM）在语义理解和常识推理方面展现出卓越的能力，使其成为解决自动驾驶规划问题的有力候选。然而，传统上用于MLLM预培训和监督微调（SFT）的下一标记文本预测目标，可能无法实现自动驾驶车辆的规划目标。下一代币预测的目标仅鼓励文本中逐代币的模仿，往往不考虑多步骤后果及与关键规划考虑（如为其他道路行为者留出空间）的一致性。为克服这些局限，我们提出了一种强化学习微调（RLFT）方法，即MAGNIFIED，通过从代币级奖励中学习，使基于MLLM的驱动智能体与规划目标保持一致。通过将预测代币序列映射到对应的车辆轨迹，并从规划奖励中学习，MAGNIFIED优化了真正的规划目标，而不仅仅是代币预测的准确性，使模型能够超越简单模仿，进一步完善对规划任务的理解。我们在Waymo开放运动数据集上验证了我们的方法，采用了包含栅格化鸟瞰视图和标记化轨迹作为输入和规划导向输出的新方法。初始的SFT阶段建立了以X-Y坐标序列形式输出计划轨迹的坚实基线，随后的RL微调显著提升了相对于SFT基线的规划性能（显示重叠率降低超过10.5%，越野率降低38.9%），凸显了RLFT在MLLM上实现更符合合规性的车辆规划潜力。舒适且高效的驾驶体验。

Platooning Connected, Autonomous, and Human-Driven Vehicles: A Deep Reinforcement Learning-based Approach

分队编排联网、自主和人驾驶车辆：基于深度强化学习的方法

Authors: Zhen Qina, Dong-Fan Xie, Heng Ma, Xiaomei Zhao, Zhengbing He
Subjects: Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Robotics (cs.RO); Physics and Society (physics.soc-ph)
Arxiv link: https://arxiv.org/abs/2606.20648
Pdf link: https://arxiv.org/pdf/2606.20648
Abstract Conventionally, existing vehicle platooning approaches are designed for connected vehicles, typically including connected autonomous vehicles and connected human-driven vehicles. Non-connected vehicles, such as non-connected autonomous or human-driven vehicles, are not incorporated. As a result, these platooning approaches may not properly reflect real-world mixed traffic conditions at the current stage. To address this limitation, this study proposes a hybrid platooning pattern that conditionally permits non-connected vehicles to join platoons, thereby enhancing platooning diversity and flexibility. However, it was found that the unregulated integration of non-connected vehicles can trigger rapid platoon expansion, significantly amplifying the risk of disturbance propagation in traffic flow. This, in turn, exacerbates the inherent conflict between traffic throughput and stability. To mitigate these challenges, this paper further develops a hybrid platooning control strategy based on deep reinforcement learning (DRL). This strategy integrates vehicle dynamics, platoon topology, and traffic flow states through a multi-level state representation network, enabling a dynamic trade-off between traffic capacity and stability. Numerical simulations demonstrate that the proposed strategy effectively suppresses velocity disturbance propagation by dynamically optimizing platoon structures, thereby significantly enhancing the stability and safety of mixed traffic while reducing fuel consumption and emissions.
中文摘要 传统上，现有的车辆分队方法主要针对互联车辆设计，通常包括联网自动驾驶车辆和联网人驾驶车辆。非互联车辆，如非互联的自动驾驶或人驾驶车辆，不被纳入。因此，这些分队方法可能无法准确反映当前阶段真实的混合交通状况。为解决这一限制，本研究提出了一种混合编排模式，允许非联网车辆加入排，从而增强编队的多样性和灵活性。然而，研究发现，非联网车辆的无序整合可能引发排队快速扩张，显著放大交通流中干扰传播的风险。这反过来加剧了流量吞吐量与稳定性之间的固有冲突。为缓解这些挑战，本文进一步发展了基于深度强化学习（DRL）的混合分队控制策略。该策略通过多级状态表示网络整合车辆动力学、排状拓扑和交通流状态，实现交通容量与稳定性之间的动态权衡。数值模拟表明，该策略通过动态优化排结构，有效抑制速度扰动传播，显著提升混合交通的稳定性和安全性，同时降低燃油消耗和排放。

SafeDojo: Safe Reinforcement Learning for VLA via Interactive World Model

SafeDojo：通过交互式世界模型实现VLA的安全强化学习

Authors: Kai Tang, Peidong Jia, Zhong Chu, Jixian Wu, Rui Ma, Jiajun Cao, Fangyuan Zhao, Sixiang Chen, Yichen Guo, Xiaowei Chi, Chun-Kai Fan, Kevin Zhang, Jinchang Xu, Fubing Yang, Weishi Mi, Xiaozhu Ju, Jian Tang, Shanghang Zhang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.20698
Pdf link: https://arxiv.org/pdf/2606.20698
Abstract Safe control is a prerequisite for real-world embodied intelligence, for which safe reinforcement learning has emerged as a promising paradigm. However, existing safe reinforcement learning methods either require costly real-world exploration or depend on hand-crafted safety functions. Neither scales to vision-language-action models deployed in open-world physical environments. We propose SafeDojo, the first model-based safe reinforcement learning framework for vision-language-action policies designed to learn safe actions through world model-based imagination. Specifically, SafeDojo performs online reinforcement learning on top of an interactive video world model. The world model generates action-conditioned future predictions, from which a tailored ResNet success classifier estimates per-step task progress from imagined frames and a lightweight safety head predicts per-step safety costs from latent context together with the proposed action chunk, enabling simultaneous assessment of task execution and trajectory safety. The decoupled task-reward and safety-cost signals are balanced through a Lagrangian-based constrained GRPO objective, enabling coordinated improvement of task success and safety under explicit constraints. On SafeLIBERO, SafeDojo achieves the best aggregate task success, safe success, and execution efficiency among inference-time safety, model-free RL, and model-based RL baselines, with the best average safe-success rate on both levels and an 8.25 percentage-point improvement over the strongest baseline on Level I. Real-world Franka deployment further shows the best average task and safe-success rates across five tasks. Our results position world model-based safe reinforcement learning as a scalable and generalizable path toward safe embodied intelligence.
中文摘要 安全控制是现实世界具身智能的前提，安全强化学习已成为一种有前景的范式。然而，现有的安全强化学习方法要么需要昂贵的现实世界探索，要么依赖手工设计的安全功能。两者都无法扩展到在开放世界物理环境中部署的视觉-语言-动作模型。我们提出了SafeDojo，这是首个基于模型的安全强化学习框架，旨在通过世界模型想象力学习视觉-语言-行动政策。具体来说，SafeDojo 基于互动视频世界模型进行在线强化学习。该世界模型生成动作条件的未来预测，基于此，定制化的ResNet成功分类器根据想象帧估算每步任务进展，轻量级安全头则根据潜在上下文及拟议动作块预测每步安全成本，从而实现任务执行和轨迹安全性的同步评估。解耦的任务奖励和安全-成本信号通过基于拉格朗日的约束GRPO目标进行平衡，从而在明确约束下协调提升任务成功率和安全。在SafeLIBERO上，SafeDojo在推理时间安全、无模型强化学习和基于模型的强化学习基线中，均实现了最佳的总体任务成功率、安全成功率和执行效率，在两个层级均有最佳的平均安全成功率，且比第一级最强基线提升了8.25个百分点。真实世界的Franka部署还显示了五项任务中最佳的平均任务和安全成功率。我们的研究结果将基于世界模型的安全强化学习定位为一条可扩展且可推广的通往安全具身智能的路径。

BARD-MARL: Byzantine-Agent Detection for Learned Communication in Multi-Agent Reinforcement Learning

BARD-MARL：多智能体强化学习中学习交流的拜占庭代理检测

Authors: Almond Kiruthu Murimi
Subjects: Subjects: Multiagent Systems (cs.MA); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.20701
Pdf link: https://arxiv.org/pdf/2606.20701
Abstract Learned communication improves coordination in cooperative multi-agent reinforcement learning, but it also creates a trust problem: a trained policy may route information through agents that have become faulty or adversarial. This paper studies Byzantine-agent detection for learned-communication MARL in adaptive traffic signal control. We propose BARD-MARL, a post-hoc diagnostic layer on top of BayesG, which is used as an attributed communication substrate rather than as a contribution of this paper. BARD-MARL combines two agent-level evidence streams: policy-graph features extracted from state-action trajectories and Bayesian trust statistics computed from BayesG latent mask probabilities. Across fixed-action, observation-flip, random-noise, and coordinated attacks in SUMO traffic grids, the results show that these signals are complementary rather than universally dominant. On a 25-agent grid, BARD-MARL reaches 0.843 AUC-ROC under a 10% observation-flip attack, while policy-graph-only detection reaches 0.917 AUC-ROC under a 10% coordinated attack. On a 100-agent grid, the unified BARD-MARL variant reaches 0.982 AUC-ROC for both 10% fixed-action and 10% coordinated attacks. The study shows that learned communication policies expose useful diagnostic evidence, but credible resilience claims require attack-specific ablations and explicit separation between coordination, detection, and mitigation.
中文摘要 学习性交流改善了合作式多智能体强化学习中的协调性，但也带来了信任问题：经过训练的策略可能会将信息通过已出现故障或对抗的代理传递。本文研究了自适应交通信号控制中学习通信MARL的拜占庭代理检测。我们提出BARD-MARL，这是一种基于BayesG的后期诊断层，作为归因的通讯底物，而非本文的贡献。BARD-MARL结合了两条代理级证据流：从状态动作轨迹提取的策略图特征和由BayesG潜在掩蔽概率计算的贝叶斯信任统计。在固定动作、观察翻转、随机噪声和协调攻击等SUMO交通网格中，结果显示这些信号是互补的，而非普遍的主导。在25个代理网格中，BARD-MARL在10%观察翻转攻击下达到0.843 AUC-ROC，而策略图仅检测在10%协调攻击下可达0.917 AUC-ROC。在100个特工的网格中，统一的BARD-MARL变体在10%固定行动和10%协同攻击中均达到0.982 AUC-ROC。研究显示，学习到的通信策略揭示了有用的诊断证据，但可信的韧性主张需要针对攻击进行特定消融，并明确区分协调、检测和缓解。

MotionPyramid: Hierarchical Motion Representation and Residual Interfaces

MotionPyramid：分层运动表示与残余接口

Authors: Gao Zhu, Zaishuo Xia, Yubei Chen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.20705
Pdf link: https://arxiv.org/pdf/2606.20705
Abstract We ask whether the representational hierarchy seen in perception, from local primitives such as edges to higher level structures such as parts and objects, can be established for motion. In humanoid control, low level actions specify immediate motor commands, while meaningful behavior is organized over longer temporal scales, including contacts, gait fragments, balance recovery, reaching, and whole body skills. We introduce MotionPyramid, a hierarchical action representation that learns such structure from motion data. Starting from a motion tracking teacher, it trains a recursive stack of latent decoders: low level latents decode to immediate full body motor commands, while higher level latents unfold through lower levels into temporally extended motion programs. After pretraining, the hierarchy is frozen and reused by downstream reinforcement learning policies as a family of action interfaces at different control resolutions. Experiments show the learned levels form a motion hierarchy: coarser interfaces improve early learning and motion regularity by constraining exploration to structured segments, while finer interfaces preserve feedback control and final task precision. Representation probes show the hierarchy supports traversal, interpolation, transition, and qualitative composition, exposing editable control handles across temporal scales. Finally, we introduce Residual Interfaces, letting a downstream policy maintain coarse, segment level, and frame level residual commands through the frozen hierarchy. Analogous to residual or skip connections in deep networks, this allows coarse motion programs and fine residual corrections to coexist within one controller. MotionPyramid shows that motion, like perception, can be organized into a reusable multi level representation, providing structured abstraction without sacrificing controllability.
中文摘要 我们探讨感知中从局部原始元素如边缘到更高层次结构如部分和对象的表征层级，是否可以为运动建立。在类人控制中，低层次动作指定即时的运动指令，而有意义的行为则组织在更长的时间尺度上，包括接触、步态片段、平衡恢复、伸手和全身技能。我们介绍了MotionPyramid，一种从运动数据中学习结构的分层动作表示。从一名动作追踪教师出发，它训练一个递归的潜在解码器堆栈：低层级的潜在解码器解码为全身运动的即时指令，而高层级的潜能通过低层次展开为时间扩展的运动程序。预训练后，层级结构被冻结，并被下游强化学习策略重用，作为一系列动作接口在不同控制分辨率下使用。实验显示，所学层级形成运动层级：较粗糙的界面通过限制探索到结构化的片段来改善早期学习和运动规律性，而更细的界面则保留反馈控制和最终任务的精度。表示探针显示层级支持遍历、插值、过渡和定性组合，暴露了跨时间尺度的可编辑控制手柄。最后，我们引入了残差接口，允许下游策略通过冻结的层级维持粗命令、段级命令和帧级残余命令。类似于深度网络中的残差或跳跃连接，这使得粗动作程序和细微残余修正能够在同一控制器内共存。MotionPyramid 展示了运动和感知一样，可以组织成可重复使用的多层次表示，提供结构化抽象，同时不牺牲可控性。

Empowering Economic Simulation Through Situation-Aware Llm-Driven Generative System

通过情境感知的 LMM 驱动生成系统赋能经济模拟

Authors: Zhimei Chen, Mu Chen
Subjects: Subjects: Multiagent Systems (cs.MA); Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2606.20720
Pdf link: https://arxiv.org/pdf/2606.20720
Abstract Traditional economic modeling typically follows a TOP-DOWN paradigm, neglecting individual diversity and the complexity of social interactions. To better capture the complexity of societal structure, Agent-Based Modeling (ABM) employs a BOTTOM-UP solution by incorporating micro-level dynamics to generate macroeconomic phenomena. Reinforcement Learning further improves its decision-making ability through tailored reward signals. However, existing ABM systems struggle to generalize beyond predefined scenarios. Recognizing the potential of LLM-driven role-playing in perception and human-like decision-making, we propose SAMAS, which models individual agents with rich macroeconomic understanding embedded in LLMs and economic trajectories experienced in the passing simulation steps. By jointly modeling both macro-level structural patterns and micro-level dynamic behaviors, SAMAS achieves superior performance in volatility realism and turning point prediction.
中文摘要 传统的经济建模通常遵循自上而下的范式，忽视个体多样性和社会互动的复杂性。为了更好地捕捉社会结构的复杂性，基于主体建模（ABM）采用自下而上的解决方案，通过融入微观动态生成宏观经济现象。强化学习通过定制化的奖励信号进一步提升决策能力。然而，现有的反导系统难以超越预设场景进行推广。鉴于LLM驱动的角色扮演在感知和类人决策中的潜力，我们提出了SAMAS，该模型对嵌入LLM中丰富的宏观经济理解和在模拟步骤中经历的经济轨迹建模个体。通过联合建模宏观层面结构模式和微观层面动态行为，SAMAS在波动率真实性和转折点预测方面取得了卓越的表现。

Provably Sub-Linear Two-Timescale NeuroEvolution with Online Plasticity

可证明的亚线性两时间尺度神经进化与在线可塑性

Authors: Shishen Lin, Yixin Chen
Subjects: Subjects: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.20817
Pdf link: https://arxiv.org/pdf/2606.20817
Abstract NeuroEvolution of Augmenting Topologies (NEAT) is a widely used neuroevolution algorithm for learning neural network architectures and weights for control tasks. However, standard offline optimisation searches for connection strengths directly, which can scale poorly in high-dimensional weight spaces and more difficult continuous control problems. Hybrid methods that combine neuroevolution with online learning can address this challenge, but their theoretical properties remain underexplored. This paper gives the first regret analysis for a general NeuroEvolutionary Online Learning (NEOL) framework, which decouples learning into two timescales: an outer loop for architecture search and an inner loop for online weight adaptation via rewardmodulated plasticity. Under mild conditions, we prove that NEOL achieves sublinear regret. Empirically, under fixed interaction budgets on four standard control benchmarks, a NEAT-based NEOL implementation achieves higher final fitness and lower variance than pure NEAT, and is competitive with strong reinforcement learning (RL) baselines on several tasks. The results are supported byWilcoxon rank-sum tests and ablation studies. Overall, the findings show that online plasticity can improve the sample efficiency and robustness of two-timescale neuroevolution. Code is available at this https URL Online Learning NEOL.
中文摘要 增强拓扑的神经进化（NEAT）是一种广泛使用的神经进化算法，用于学习控制任务中的神经网络架构和权重。然而，标准离线优化直接搜索连接强度，这在高维权重空间和更复杂的连续控制问题中扩展性较差。结合神经进化与在线学习的混合方法可以解决这一挑战，但其理论特性尚未被充分探讨。本文首次对通用的神经进化在线学习（NEOL）框架进行了遗憾分析，该框架将学习拆分为两个时间尺度：外环用于架构搜索，内环用于通过奖励调节可塑性进行在线权重适应。在温和条件下，我们证明NEOL实现亚线性后悔。在四个标准控制基准测试的固定交互预算下，基于NEAT的NEOL实现比纯NEAT实现更高的最终适应度和更低的方差，并且在多个任务上与强强化学习（RL）基线竞争。这些结果得到了Wilcoxon秩和检验和消融研究的支持。总体来看，研究结果表明在线可塑性可以提升样本效率和两时间尺度神经进化的稳健性。代码可在此 https URL 在线学习 NEOL 获取。

Evolutionary Discovery of Developmental Reward Schedules in Deep Reinforcement Learning

深度强化学习中发展性奖励时间表的进化发现

Authors: Alan Nadelsticher Ruvalcaba
Subjects: Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2606.20858
Pdf link: https://arxiv.org/pdf/2606.20858
Abstract The temporal structure of reward composition in reinforcement learning (RL) is typically hand-designed and held fixed throughout training, leaving the progression of motivational priorities largely unexplored. In this work, we propose an evolutionary framework for discovering developmental reward schedules, in which three distinct biologically inspired motivational components -- agency, novelty, and reactivity -- are combined through time-varying weights that dynamically shift over the course of training. Evaluated on two sparse-reward MiniGrid tasks: DoorKey-6x6 and KeyCorridorS3R1, our framework compares the generalizability of four evolutionary algorithms: CMA-ES, xNES, DE, and L-SHADE against an extrinsically motivated baseline (our main comparison point), and three additional hand-designed methods. On DoorKey-6x6, all evolved methods outperform the non-evolved baselines, with L-SHADE achieving the best performance -- an approximate relative mean improvement of 11.4% over the extrinsic only baseline. On KeyCorridorS3R1, CMA-ES achieves the best overall performance, with the remaining evolved methods showing weaker and less reliable generalization capability compared to the extrinsic only baseline. Interestingly, the discovered schedules diverge from our defined developmental ordering, with novelty consistently emerging as the dominant early signal during training, across both tasks. Collectively, our results position evolutionary optimization as a promising approach for developmental reward schedule discovery in deep reinforcement learning, and suggest that what evolution finds to be optimal in computational settings may differ from what it finds to be optimal in biology. The code for this project can be found at: this https URL.
中文摘要 强化学习（RL）中奖励构成的时间结构通常是手工设计并在整个训练过程中固定的，因此动机优先事项的进展大多未被深入探讨。在本研究中，我们提出了一个进化框架，用于发现发展奖励时间表，其中三种不同的生物启发动机组成部分——主动性、新颖性和反应性——通过时间变化的权重结合起来，这些权重在训练过程中动态变化。该框架基于两个稀疏奖励MiniGrid任务：DoorKey-6x6和KeyCorridorS3R1进行评估，比较了四个进化算法：CMA-ES、xNES、DE和L-SHADE在外在动机基线（我们主要比较点）及另外三种手工设计方法下的推广性。在DoorKey-6x6上，所有进化方法均优于未进化基线，其中L-SHADE表现最佳——相较仅外在基线的相对平均提升约11.4%。在KeyCorridorS3R1上，CMA-ES实现了最佳的整体性能，其余演进方法相比仅外在基线的泛化能力较弱且可靠性较低。有趣的是，发现的计划与我们定义的发展顺序不同，新颖性始终是训练期间的主导早期信号，贯穿两项任务。总体来看，我们的结果将进化优化定位为深度强化学习中发展奖励时间表发现的一种有前景的方法，并表明进化在计算环境中认为最优的做法可能与生物学中认为最优的有所不同。该项目的代码可在以下 https URL 找到。

When Do Intrinsic Rewards Work for Code Reasoning? A Comprehensive Study

内在奖励在代码推理中何时有效？一项综合研究

Authors: Xiaolong Jin, Xuandong Zhao, Wenbo Guo, Xiangyu Zhang, Dawn Song
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.20881
Pdf link: https://arxiv.org/pdf/2606.20881
Abstract Reinforcement learning with verifiable rewards (RLVR) has driven substantial progress in large language model reasoning, but relies on ground-truth supervision that is costly or infeasible, especially in coding tasks. Recent work addresses this by deriving rewards from a model's own signals, such as majority voting or confidence-based scores, achieving notable success on mathematical reasoning benchmarks. However, code generation poses distinct challenges: programs are structurally complex, semantically equivalent solutions may differ syntactically, and verification typically requires execution. Whether these intrinsic reward methods transfer effectively to code remains unexplored. In this work, we present a systematic empirical study of intrinsic reward methods for code generation. We conduct extensive experiments on LiveCodeBench, systematically evaluating representative certainty-based Reinforcement Learning from Internal Feedback (RLIF) approaches under different training scenarios and hyperparameter settings. Our experiments reveal that certainty-based methods yield early gains but inevitably collapse: models progressively shorten outputs and lose reasoning capability, with collapse speed sensitive to sample size and temperature. When used to initialize RLVR training, RLIF pre-training offers no significant improvement over training from scratch. We also provide actionable recommendations for using intrinsic rewards for training code reasoning models. Our study shows both the promise and limitations of intrinsic reward methods for code, informing future work on code models and agents.
中文摘要 带有可验证奖励的强化学习（RLVR）推动了大型语言模型推理的显著进展，但依赖于基层真实的监督，这在编码任务中成本高昂或不可行。近期研究通过从模型自身信号中获得奖励，如多数投票或基于置信度的分数，在数学推理基准测试中取得了显著成功。然而，代码生成面临明显挑战：程序结构复杂，语义等效的解在语法上可能不同，且验证通常需要执行。这些内在奖励方法是否能有效转化到代码中，目前尚无深入探讨。本研究提出了对代码生成内在奖励方法的系统实证研究。我们在LiveCodeBench上进行了大量实验，系统性地评估了代表性的基于确定性的内部反馈强化学习（RLIF）方法，适用于不同的训练场景和超参数设置。我们的实验显示，基于确定性的方法虽然能带来早期收益，但不可避免地会崩溃：模型逐渐缩短输出并失去推理能力，崩溃速度对样本量和温度敏感。用于初始化RLVR训练时，RLIF预训练与从零开始训练无显著提升。我们还提供了可操作的建议，用于使用内在奖励来训练代码推理模型。我们的研究展示了代码内在奖励方法的前景与局限性，为未来代码模型和代理的研究提供了参考。

Learning-Based List Sequential Belief Propagation Decoding of Quantum LDPC Codes

基于学习的列表顺序信念传播解码量子LDPC码

Authors: Mohsen Moradi, Taejoon Kim, Remi A. Chou
Subjects: Subjects: Information Theory (cs.IT); Quantum Physics (quant-ph)
Arxiv link: https://arxiv.org/abs/2606.20926
Pdf link: https://arxiv.org/pdf/2606.20926
Abstract Quantum low-density parity-check (QLDPC) codes are strong candidates for fault-tolerant quantum computation, but efficient decoding remains a major challenge due to short cycles, degeneracy, and the poor convergence of standard belief-propagation (BP) decoders. We propose a reinforcement learning-based list sequential (RL-LS) BP decoder for QLDPC codes by extending the reinforcement-learning-based sequential variable-node scheduling (RL-S) framework with list-based search. At each step, the learned policy selects the next variable node to update; the decoder then retains the ordinary RL-S trajectory while also exploring a competing branch obtained by softly biasing the post-update LLR pair toward the second-most likely Pauli symbol, recomputing the incident local BP messages, and setting the visited variable node to that second-best symbol. Candidate trajectories are ranked and pruned using our proposed cumulative path metric. The resulting decoder extends the learned decoder by combining the improved convergence of learned sequential scheduling with list exploration. Numerical results on representative QLDPC benchmark codes over the depolarizing channel show that our proposed method improves the decoding performance of the underlying decoder and compares favorably with existing BP-based decoding methods.
中文摘要 量子低密度奇偶校验（QLDPC）码是容错量子计算的有力候选，但由于短周期、简并性以及标准信念传播（BP）解码器收敛性差，高效的解码仍是一大挑战。我们提出了一种基于强化学习的列表顺序（RL-LS）BP解码器，应用于基于列表的搜索，基于强化学习的顺序变量-节点调度（RL-S）框架。在每一步，学习策略选择下一个需要更新的变量节点;解码器随后保留普通的RL-S轨迹，同时探索通过对更新后LLR对软偏向第二可能的泡利符号、重新计算本地BP消息并将访问变量节点设置为该次优符号所得的竞争分支。候选轨迹通过我们提出的累积路径指标进行排序和修剪。最终的译码器通过结合学习顺序调度与列表探索的改进收敛性，扩展了已学习的译码器。对代表性QLDPC基准码在去极化通道上的数值结果显示，我们提出的方法提升了底层解码器的解码性能，并与现有基于BP的解码方法相比表现优异。

Heterogeneous Policy Networks for Composite Robot Team Communication and Coordination

复合机器人团队通信与协调的异构政策网络

Authors: Esmaeil Seraj, Rohan Paleja, Luis Pimentel, Kin Man Lee, Zheyuan Wang, Daniel Martin, Matthew Sklar, John Zhang, Zahi Kakish, Matthew Gombolay
Subjects: Subjects: Robotics (cs.RO); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2606.20962
Pdf link: https://arxiv.org/pdf/2606.20962
Abstract High-performing human-human teams learn intelligent and efficient communication and coordination strategies to maximize their joint utility. These teams implicitly understand the different roles of heterogeneous team members and adapt their communication protocols accordingly. Multi-Agent Reinforcement Learning (MARL) has attempted to develop computational methods for synthesizing such joint coordination-communication strategies, but emulating heterogeneous communication patterns across agents with different state, action, and observation spaces has remained a challenge. Without properly modeling agent heterogeneity, as in prior MARL work that leverages homogeneous graph networks, communication becomes less helpful and can even deteriorate the team's performance. In the past, we proposed Heterogeneous Policy Networks (HetNet) to learn efficient and diverse communication models for coordinating cooperative heterogeneous teams. In this extended work, we extend Heterogeneous Policy Networks (HetNet) to support scaling heterogeneous robot teams. Building on heterogeneous graph-attention networks, we show that HetNet not only facilitates learning heterogeneous collaborative policies but also enables end-to-end training for learning highly efficient binarized messaging. Our empirical evaluation shows that HetNet sets a new state of the art in learning coordination and communication strategies for heterogeneous multi-agent teams by achieving an 5.84% to 707.65% performance improvement over the next-best baseline across multiple domains while simultaneously achieving a 200x reduction in the required communication bandwidth.
中文摘要 高绩效的人人团队学习智能高效的沟通与协调策略，以最大化其联合效用。这些团队隐含地理解异构团队成员的不同角色，并据此调整通信协议。多智能体强化学习（MARL）尝试开发用于综合此类联合协调-通信策略的计算方法，但模拟不同状态、动作和观察空间的智能体间异构通信模式仍是一大挑战。如果不像之前利用同质图网络的MARL工作那样，正确建模代理异构性，沟通就会变得不那么有用，甚至可能降低团队绩效。过去，我们提出了异构策略网络（HetNet）以学习高效且多样化的沟通模型，以协调合作的异构团队。在本扩展工作中，我们扩展了异构策略网络（HetNet）以支持异构机器人团队的扩展。基于异构图关注网络，我们展示了HetNet不仅促进了异构协作策略的学习，还支持了端到端的高效二元化消息学习训练。我们的实证评估显示，HetNet在异质多智能体团队的学习协调与通信策略方面树立了新潮流，在多个领域相比次优基线提升了5.84%至707.65%，同时实现了所需通信带宽减少200倍的水平。

Formalizing Task-Space Complexity for Zero-Shot Generalization

零射推广任务空间复杂性的形式化

Authors: Jung-Hoon Cho, Heling Zhang, Siqi Du, Roy Dong, Cathy Wu
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2606.20967
Pdf link: https://arxiv.org/pdf/2606.20967
Abstract Policies must operate across diverse conditions, yet a single policy is often conservative while fully adaptive schemes can be complex. We study zero-shot generalization in contextual dynamical systems and introduce a performance-centric, directional task dissimilarity--the signed divergence--that upper bounds the generalization gap from a source context to a target context. The signed divergence induces $\varepsilon$-tolerance sets that certify when a source policy class generalizes, and it yields a concrete notion of task-space complexity: the minimum number of source contexts needed so that every target context incurs at most $\varepsilon$ generalization gap. Under a mild local smoothness assumption on performance, the induced tolerance sets admit certified inner/outer balls and instance-dependent volume bounds on task-space complexity. In the finite-oracle setting, source selection reduces to set cover; a greedy strategy inherits the standard $H(n)$ approximation guarantee. Using a Mass-Spring-Damper system with linear-quadratic regulator (LQR) controllers and a nonlinear CartPole system with deep reinforcement learning controllers, we show that greedy selection achieves the same $\varepsilon$-coverage with fewer policies than uniform or random baselines. Our approach delivers a performance-based task similarity measure and practical certificates for building generalizable control with simple policies.
中文摘要 政策必须在多样条件下运作，但单一政策往往较为保守，而完全适应性的方案则可能复杂。我们研究上下文动力系统中的零样本推广，并引入了以性能为中心的方向性任务异干性——带符号散度——该差异限制了从源上下文到目标上下文的推广差距上界。带符号的发散诱导出了$\varepsilon$容忍集，这些集合在源策略类推广时会证明，并提供了具体的任务空间复杂性概念：每个目标上下文最多都存在$\varepsilon$泛化差距所需的最小数量。在性能的轻度局部平滑假设下，诱导容差集允许经认证的内球/外球和任务空间复杂度的实例依赖体积界限。在有限预言机设定中，源选择归结为集合覆盖;贪婪策略继承了标准的$H（n）$近似保证。利用带有线性二次调节器（LQR）控制器的质量弹簧阻尼器系统和深度强化学习控制器的非线性CartPole系统，我们证明贪婪选择能以更少策略实现相同的$\varepsilon覆盖率，而比均匀或随机基线更少。我们的方法提供了基于性能的任务相似度衡量和实用证书，用于用简单策略构建可通用控制。

CogniRoute: Learning to Route Social Evidence in Omni-Modal Models

CogniRoute：学习在全模态模型中路由社会证据

Authors: Yifan Shen, Pei Tian, Xinzhuo Li, Bowen Fang, Shujun Xia, Bingxuan Li, Ana Jojic, Wenming Ye, Xu Cao, James Matthew Rehg, Ismini Lourentzou
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.20970
Pdf link: https://arxiv.org/pdf/2606.20970
Abstract Omni-modal models can ingest video, audio, and text, but unified access to multiple modalities does not guarantee that a model uses the right evidence. This gap is especially pronounced in social video question answering, where the answer may hinge on a gesture, vocal tone, temporal cue, or mismatch between what is said and what is visually expressed. We introduce CogniRoute, a schema-guided Mixture-of-Experts framework for social omni reasoning. CogniRoute uses a training-only cognitive schema that factorizes each example by cross-modal relation, reasoning demand, and temporal scope, and aligns global routing signatures with this structure during supervised fine-tuning. We further introduce route-aware reinforcement learning, which jointly optimizes token generation and expert allocation using rewards for answer correctness, modality-consistent reasoning, and cognitive temporal grounding. To support training and evaluation, we construct OmniSocialBench, a diagnostic social video QA resource with 118K structured training examples, grounded reasoning traces, schema labels, temporal evidence spans, and a manually verified evaluation split. CogniRoute achieves 59.38\% average accuracy on OmniSocialBench, improving over the strongest proprietary baseline by 15.33 percentage points and the strongest open-source omni baseline by 26.77 points, with the largest gains on questions requiring audio-visual coordination, conflict resolution, and temporally grounded social inference.
中文摘要 全模态模型可以接收视频、音频和文本，但统一访问多种模态并不能保证模型使用正确的证据。这种差距在社交视频问答中尤为明显，答案可能取决于手势、语调、时间线索，或说话内容与视觉表达的不匹配。我们介绍了CogniRoute，一个基于模式引导的专家混合框架，用于社会全向推理。CogniRoute 采用仅训练的认知模式，通过跨模态关系、推理需求和时间范围对每个示例进行分解，并在监督微调过程中将全局路由签名与该结构对齐。我们进一步引入了路径感知强化学习，结合通过奖励来优化令牌生成和专家分配，奖励答案正确性、模态一致性推理和认知时间基础。为支持培训和评估，我们构建了OmniSocialBench，一个诊断性社交视频质量保证资源，包含11.8万条结构化培训示例、基于推理的痕迹、模式标签、时间证据范围以及人工验证的评估分割。CogniRoute在OmniSocialBench上的平均准确率为59.38%，较最强专有基线提升15.33个百分点，较最强的开源全向基线提升26.77个百分点，在需要视听协调、冲突解决和时间基础社会推断的问题上提升最大。

Sim2O: Efficient Offline-to-Online MARL via Joint Action Composition

Sim2O：通过联合行动组合实现高效的离线到在线MARL

Authors: Bingchang Song, Yiqin Yang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.21085
Pdf link: https://arxiv.org/pdf/2606.21085
Abstract Offline-to-online adaptation serves as a pivotal paradigm for mitigating the prohibitive cost of online exploration by bootstrapping reinforcement learning from offline datasets. While this paradigm has been extensively studied in single-agent settings, its extension to Multi-Agent Reinforcement Learning (MARL) remains largely unexplored, despite its critical relevance to complex coordinated decision-making. To bridge this gap, we introduce Sim2O, an elegant and minimalist framework for offline-to-online MARL. Rather than treating adaptation as a monolithic joint decision, Sim2O conceptualizes it as a compositional process. Specifically, candidate joint actions are synthesized by dynamically blending offline and online action proposals across agents. By leveraging a centralized value function to evaluate these hybrid combinations, Sim2O identifies high-value coordination strategies without requiring auxiliary training objectives or structural overhead. Empirical evaluations across diverse benchmarks demonstrate that Sim2O significantly outperforms existing baselines, underscoring that a minimalist design is not only viable but highly effective for multi-agent offline-to-online adaptation.
中文摘要 离线到在线的适应是通过从离线数据集自助强化学习，降低在线探索高昂成本的关键范式。尽管该范式在单智能体环境中已被广泛研究，但其在多智能体强化学习（MARL）中的扩展仍然鲜有深入探讨，尽管其对复杂协调决策具有关键意义。为了弥合这一差距，我们引入了Sim2O，一个优雅且极简的离线到在线MARL框架。Sim2O不再将适应视为一个整体的联合决策，而是将其概念化为一种组合过程。具体来说，候选联合行动通过动态混合线下和在线行动提案在各代理间进行综合。通过利用集中价值函数评估这些混合组合，Sim2O能够识别高价值的协调策略，而无需辅助训练目标或结构性开销。跨多个基准测试的实证评估表明，Sim2O显著优于现有基线，强调极简设计不仅可行，而且极为有效，适合多智能体离线到在线的适应。

Horizon Adaptive Offline Policy Learning via Value Stitching

地平线自适应离线策略学习，通过价值拼接

Authors: Kexin Zheng, Xianyuan Zhan, Xintao Yan
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.21136
Pdf link: https://arxiv.org/pdf/2606.21136
Abstract Learning accurate value functions plays a decisive role for reinforcement learning (RL) agents to solve long-horizon, complex tasks. Conventional temporal-difference (TD) learning objectives suffer from value-estimation bias that accumulates over the horizon, while extended-horizon modeling methods, such as n-step TD backups and Q-chunking, adopt a rigid, fixed-horizon value-modeling recipe that is often not flexible enough to capture complex value structures in long-horizon, multi-stage tasks. In this paper, we show that enabling value updates with dynamic horizon composition can yield a strong offline policy learning scheme. Our method, Horizon Adaptive Offline Policy Learning via VAlue STitching (VAST), replaces fixed-horizon backups with recursive, horizon-adaptive value composition. Its key ingredient is to couple value optimization with a future state- and horizon-length-conditioned auxiliary value function that is learned through direct data supervision, and a stitching policy that optimally selects the reward-maximizing horizon length and future sub-goal to achieve horizon-adaptive value stitching. This design enables direct estimation and compositional "stitching" of variable-length returns grounded in actionable sub-goal states, providing an accurate and greedily exploitable value-supervision signal for offline policy optimization. Across 50 tasks on OGBench, VAST outperforms fixed-step, extended-horizon methods, and generative-value offline RL baselines, achieving strong performance particularly in high-complexity, long-horizon decision-making tasks.
中文摘要 学习准确的价值函数对于强化学习（RL）代理解决长期复杂任务起着决定性作用。传统的时间差分（TD）学习目标存在随着视距积累的价值估计偏差，而延展视野建模方法，如n步TD备份和Q分块，采用僵化的固定视野价值建模方案，通常不够灵活，难以捕捉长视野多阶段任务中的复杂价值结构。本文展示了通过动态视野组合实现价值更新，可以带来强有力的离线政策学习方案。我们的方法——通过VAlue拼接实现的地平线自适应离线政策学习（VAST），用递归的地平线自适应值组合取代固定地平线备份。其关键要素是将价值优化与通过直接数据监督学习的未来状态和视界长度条件辅助值函数结合起来，以及一种最优选择奖励最大化视野长度和未来子目标以实现视界自适应价值拼接的拼接策略。该设计能够直接估计并组合“拼接”基于可操作子目标状态的可变长度收益，提供准确且可被贪婪利用的价值监管信号，用于离线策略优化。在OGBench的50个任务中，VAST优于固定步长、延伸视野方法和生成价值离线强化学习基线，尤其在高复杂度、长视野的决策任务中表现突出。

Pose-Agnostic Robotic Functional Grasping via Observation-Action Canonicalization

通过观察-动作规范化实现无相态机器人功能抓取

Authors: Le Qiu, Cole Harrison, Jiankai Sun, Yao Liu, Suning Huang, Qianzhong Chen, Yang You, Marco Pavone
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.21148
Pdf link: https://arxiv.org/pdf/2606.21148
Abstract Functional robotic grasping requires a policy that generalizes across diverse object geometries and poses while maintaining task-specific contact precision. We study this challenge through mug-handle grasping, where thin handles, instance variation, and upright or inverted placements make both perception and control sensitive to object configuration. Grasp pose detection methods operate open-loop and are sensitive to estimation errors on thin handle structures. Learned visuomotor policies must implicitly learn to handle the coupled variation in visual appearance and action direction induced by different object placements, limiting generalization. We propose AnyMug, a canonicalized visuomotor reinforcement learning framework for functional grasping that trains a single closed-loop policy entirely in simulation and deploys it zero-shot on a real robot. AnyMug introduces observation-action canonicalization, which transforms both the depth observation and the predicted end-effector action into a shared object-centric frame. The policy therefore sees a consistent mug-centered view and emits actions in a canonical direction regardless of mug placement, allowing the same grasping behavior to be reused across configurations. A handle-aware reward further encourages precise approach, gripper alignment, and opposing-finger placement, while a pose curriculum and domain randomization improve training stability and sim-to-real transfer. In simulation, AnyMug achieves over 93% success rate on both unseen upright and inverted mugs and transfers zero-shot to a real Franka Panda, reaching 80% success rate on 5 held-out physical mugs across both pose categories.
中文摘要 功能性机器人抓取需要一种策略，能够在不同物体几何和姿态中泛化，同时保持任务特定的接触精度。我们通过抓握杯柄来研究这一挑战，细把手、实例变化以及直立或倒置的位置使得感知和控制对物体配置敏感。抓握位势检测方法采用开环运行，对薄手柄结构上的估计误差非常敏感。习得的视觉运动策略必须隐含地学会处理不同物体放置引起的视觉外观和动作方向耦合变化，限制泛化。我们提出了AnyMug，一种标准化的视觉运动强化学习框架，用于功能抓取，完全在模拟中训练单一闭环策略，并在真实机器人上零次部署。AnyMug引入了观察-动作规范化，将深度观察和预测的终点执行器作用转换为共享的以对象为中心的框架。因此，该策略保持一致的以杯子为中心的视角，并无论杯子放置位置如何，都会朝规范方向发送动作，允许在不同配置中重复使用相同的抓取行为。手柄感知奖励进一步鼓励精准的操作、握把对齐和对指位置，而姿式课程和领域随机化则提升训练稳定性和模拟到现实的转移。在模拟中，AnyMug在看不见的直立和倒立杯上均有超过93%的成功率，并将零射击转赠给真实的Franka Panda，在两个姿势类别中，5个手持的实体杯子成功率均达到80%。

Inverting the Bellman Equation: From $Q$-Values to World Models

反转贝尔曼方程：从$Q$值到世界模型

Authors: Alistair Letcher, Mattie Fellows, Alexander D. Goldie, Jonathan Richens, Jakob N. Foerster, Oliver Richardson
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.21173
Pdf link: https://arxiv.org/pdf/2606.21173
Abstract Model-based and model-free reinforcement learning are traditionally viewed as separate paradigms: instead of learning a model of the transition kernel $P$, model-free agents typically estimate value functions tied to a specific policy and reward. In this paper, we challenge this dichotomy by proving that value-based agents trained on a sufficiently rich set of reward functions, e.g. using goal-conditioned RL, implicitly encode a unique and accurate world model. To extract this model in practice, we introduce \textit{$P$-learning}, an inverse analogue to $Q$-learning that samples from an agent's $Q$-values, policies and rewards to decode its internal model of the environment. We then provide sufficient conditions on the type and number of goals for which agents encode the true kernel $P$, covering both stochastic and deterministic MDPs over finite or continuous state spaces. Even when our assumptions are violated, we empirically demonstrate that agents trained on a handful of reward functions encode accurate dynamics in $\texttt{Reacher}$, $\texttt{MountainCar}$ and stochastic variants of $\texttt{FourRooms}$. Surprisingly, we find that policies trained exclusively on a \texttt{Reacher} agent's implicit world model are quasi-optimal on out-of-distribution, velocity-based goals despite position-only training -- suggesting that agents contain hidden generalisation capabilities and providing a new lens into the connection between model-based, model-free, and goal-conditioned RL.
中文摘要 基于模型和无模型的强化学习传统上被视为不同的范式：无模型智能体通常估算与特定策略和奖励相关的价值函数，而不是学习过渡核$P$的模型。本文通过证明基于价值的智能体在足够丰富的奖励函数集上训练（例如使用目标条件强化学习）隐式编码了唯一且准确的世界模型，从而挑战了这一二分法。为了在实际中提取该模型，我们引入了 \textit{$P$-learning}，这是一种与$Q$-learning相反的对比，它从代理的$Q$值、策略和奖励中抽样，解码其内部环境模型。随后，我们为智能体编码真实核$P$的目标类型和数量提供了充分条件，涵盖有限或连续状态空间上的随机和确定性MDP。即使我们的假设被打破，我们通过实证证明，训练于少数奖励函数的智能体在$\texttt{Reacher}$、$\texttt{MountainCar}$及随机变体$\texttt{FourRooms}$中编码了准确的动态。令人惊讶的是，我们发现，仅基于\texttt{Reacher}智能体隐式世界模型训练的策略，尽管仅训练位置，但在非分布、基于速度的目标上仍是准最优的——这表明智能体具备隐藏的泛化能力，并为理解基于模型、无模型和目标条件的强化学习之间的联系提供了新的视角。

Sakana Fugu Technical Report

佐贺奈富古技术报告

Authors: Yujin Tang, Edoardo Cetin, Jinglue Xu, Qi Sun, Stefan Nielsen, Vincent Richard, Haruto Goda, Iaroslav Tymchenko, Nhan Nguyen, Hyunin Lee, Mari Ashiga, Shashank Kotyan, So Kuroki, Tarin Clanuwat
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.21228
Pdf link: https://arxiv.org/pdf/2606.21228
Abstract The capabilities of frontier Large Language Models (LLMs) continue to advance, with different providers increasingly specializing in distinct domains. This raises a natural next objective: how to combine the individual specializations of various LLMs into a collectively intelligent system. To this end, we report the development of Sakana Fugu, a family of orchestrator models that harness and amplify the capabilities of an LLM agent team. Fugu models are themselves language models trained to understand user queries and dynamically devise agentic scaffolds to solve them. Through these adaptive scaffolds, Fugu accesses performance beyond any individual LLM agent, achieving state-of-the-art results compared to other publicly accessible models across a range of challenging tasks, including SWE-Bench Pro, Terminal Bench, LiveCodeBench, GPQA-Diamond, Humanity's Last Exam, and CharXiv Reasoning. We release two models: Fugu, which balances performance with latency for everyday use, and Fugu-Ultra, which prioritizes answer quality on the hardest problems. We describe our training paradigm, which encompasses large-scale fine-tuning, evolutionary algorithms, and reinforcement learning approaches, along with the infrastructure and core design principles that turn these methods into a production system. We hope this report encourages further research into multi-agent systems and dynamic, query-adaptive agentic scaffolds as a path toward the next frontier of AI capabilities, accessed through collective intelligence.
中文摘要 前沿大型语言模型（LLMs）的能力不断进步，不同服务提供商越来越专注于不同领域。这自然引出了一个下一个目标：如何将各种大型语言模型的各个专业化结合成一个集体智能系统。为此，我们报告了Sakana Fugu的开发，这是一系列编排模型，能够利用并放大LLM代理团队的能力。Fugu模型本身是训练用来理解用户查询并动态设计智能支架以解决这些问题的语言模型。通过这些自适应支架，Fugu 能够超越任何单个 LLM 代理的性能，在一系列具有挑战性的任务中，如 SWE-Bench Pro、Terminal Bench、LiveCodeBench、GPQA-Diamond、Humanity's Last Exam 和 CharXiv 推理，实现了与其他公开模型相比最先进的结果。我们发布了两款型号：Fugu，在日常使用中平衡性能与延迟;以及Fugu-Ultra，优先考虑最难题目的答案质量。我们描述了我们的训练范式，涵盖大规模微调、进化算法和强化学习方法，以及将这些方法转化为生产系统的基础设施和核心设计原则。我们希望本报告能鼓励对多智能体系统和动态、查询自适应智能支架的进一步研究，作为迈向通过集体智能进入人工智能能力下一个前沿的道路。

ARCO: Adaptive Rubric with Co-Evolution for Multi-Step LLM-Based Agents

ARCO：多步LLM代理的自适应评分标准与共进化

Authors: Zihang Tian, Jingsen Zhang, Rui Li, Xiaohe Bo, Yuanzi Li, Xu Chen
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.21262
Pdf link: https://arxiv.org/pdf/2606.21262
Abstract Reinforcement learning for multi-step LLM agents often relies on scalar rewards that indicate success but cannot explain why a trajectory is good or bad. Rubric-based rewards improve interpretability through natural-language criteria, but existing methods score at the trajectory level and freeze the scorer behind a closed-source judge, leaving step-level credit assignment unresolved and the judge itself static. We propose ARCO (Adaptive Rubric CO-evolution), a rubric framework in which a same-scale model $\mu$ shares a backbone with two heads: a generation head that produces per-step criteria, and a score head that predicts rubric-conditioned step-level rewards. A trajectory decomposition constraint ties the sum of step rewards to the terminal outcome, enabling credit assignment without step-level labels, while $\mu$ and the policy $\pi$ are jointly updated on on-policy data so that the rubric content and the scoring function co-evolve at the parameter level. Across HotpotQA, 2WikiMultiHopQA, and MuSiQue with two open-source backbones, ARCO improves the best EM in every setting over strong outcome-, rubric-, and process-reward baselines, and analyses show that its rubrics are step-specific, robust to design choices, and useful for diagnosing agent behavior. Codes and data are available at this https URL.
中文摘要 多步LLM智能体的强化学习通常依赖标量奖励，这些奖励表明成功，但无法解释轨迹为何好坏。基于评分标准的奖励通过自然语言标准提升了可理解性，但现有方法仅在轨迹层面评分，评分者被封闭源判定，导致步骤级学分分配未解决，评委本身保持静止。我们提出了ARCO（自适应评分标准共演化），这是一个评分标准框架，其中同尺度模型$\mu$共享一个主干，有两个头：一个生成每步标准的生成头，另一个预测评分标准条件的步骤级奖励。轨迹分解约束将步数奖励的总和与最终结果绑定，使得无需步骤级标签即可分配学分，而$\mu$和策略$\pi$则在政策上同步更新，使评分标准内容和评分函数在参数层面协同演化。在HotpotQA、2WikiMultiHopQA和MuSiQue，并结合两个开源骨干，ARCO在每种环境中都优于强的结果、评分标准和过程奖励基线，分析显示其评分标准具有步骤特异性，对设计选择具有鲁棒性，且有助于诊断代理行为。代码和数据可在该 https URL 获取。

Reward-free Pretraining for Reinforcement Learning via Occupancy Coverage Maximization

通过占用覆盖最大化实现的无奖励强化学习预训练

Authors: Marco Pratticò, Pietro Novelli, Massimiliano Pontil, Carlo Ciliberto
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.21271
Pdf link: https://arxiv.org/pdf/2606.21271
Abstract Sparse rewards pose a central challenge in reinforcement learning, since agents receive no informative signal until they reach their goal. Intrinsic-reward methods address this issue by optimizing non-stationary objectives such as novelty, prediction error, or skill diversity, thereby injecting a supervision signal into the problem. While effective, these methods often require that the extrinsic (sparse) reward can be evaluated -- either online or during offline relabeling of the stored transitions. This limitation is particularly vexing for multi-task, meta-, and continual reinforcement learning, where agents' interactions with the environment are usually reward-free. In this work, we present a method to pre-train transferable exploration policies that rapidly adapt to sparse rewards at downstream task time. Our objective maximizes state-space covering for the occupancy measure, and can be framed in terms of entropy maximization. Its algorithmic implementation, ROVER, leverages recent advances on the operatorial formulation of RL to estimate occupancy with a learned resolvent world model, bypassing common hurdles associated with density and entropy estimation. ROVER further introduces a virtual "sink" state for unexplored regions, balancing coverage of known states with expansion into unseen ones and preventing cyclic expansion-collapse behavior during learning. In tabular and pixel-based sparse navigation tasks, ROVER produces more uniform aggregate coverage and stronger initializations for downstream tasks than standard reward-free baselines.
中文摘要 稀疏奖励是强化学习中的核心挑战，因为代理在达到目标之前不会收到任何信息信号。内在奖励方法通过优化非平稳目标（如新颖性、预测误差或技能多样性）来解决这个问题，从而向问题注入监督信号。虽然有效，但这些方法通常需要评估外在（稀疏）奖励——无论是在线还是离线重新标记存储的转移时。这一限制对多任务、元和持续强化学习尤为棘手，因为在这些环境中，智能体与环境的互动通常没有奖励。本研究提出一种预训练可转移探索策略的方法，这些策略能快速适应下游任务时间的稀疏奖励。我们的目标是最大化占用度量的状态空间覆盖，并可用熵最大化来表述。其算法实现ROVER利用了RL操作式表述的最新进展，利用学习中的解决式世界模型估算占用率，绕过了密度和熵估计的常见障碍。ROVER进一步引入了未探索区域的虚拟“汇”状态，平衡已知状态的覆盖与向未知状态的扩展，并防止学习过程中的循环膨胀-坍缩行为。在表格和基于像素的稀疏导航任务中，ROVER比标准无奖励基线更能实现更均匀的总覆盖率和更强的下游任务初始化。

NASDAQ: Normalized Observation Space Dynamics-Augmented Q-Learning

纳斯达克：规范化观测空间动力学增强Q-学习

Authors: Xinwei Liu (1), Junyuan Liang (1), Zicong Hong (2), Jianting Zhang (3), Wuhui Chen (1) ((1) Sun Yat-sen University, China, (2) EPFL, Switzerland, (3) Purdue University, USA)
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.21297
Pdf link: https://arxiv.org/pdf/2606.21297
Abstract Augmenting model-free reinforcement learning (RL) with representations learned through observation dynamics prediction (observation-predictive RL) can improve sample efficiency and performance, with minor modifications and limited additional computation. However, this approach still struggles in challenging tasks with low-dimensional observations. In this paper, we identify a key factor behind this problem: unbalanced reconstruction losses across observation dimensions, where dimensions with larger value ranges dominate the loss. This encourages the agent to neglect dimensions with relatively small ranges, leading to degraded performance. To address this issue, we propose a novel normalization method tailored to online RL, which normalizes low-dimensional observations and balances the resulting losses and gradients. Beyond balancing reconstruction losses, observation normalization enables dynamics prediction to be performed in a normalized observation space, thereby providing a unified treatment of low- and high-dimensional inputs (e.g., physical states and images). Building on this idea, we further introduce Normalized Observation Space Dynamics-Augmented Q-learning (NASDAQ), a framework for observation-predictive RL applicable across diverse domains. NASDAQ learns state-action representations by coupling value learning with two auxiliary tasks: short-term value prediction and next normalized observation prediction. Extensive experiments demonstrate that NASDAQ achieves competitive or superior performance compared with state-of-the-art model-based and self-predictive RL methods, while requiring significantly less training wall-time.
中文摘要 通过观察动力学预测（observation-predictive RL）来增强无模型强化学习（RL），可以通过少量修改和有限的额外计算，提升样本效率和性能。然而，这种方法在低维观测的挑战性任务中仍然存在困难。本文指出，我们识别出一个关键因素：观测维度上的重建损失不平衡，其中值区间较大的维度主导损失。这促使代理忽视范围较小的维度，导致性能下降。为解决这一问题，我们提出了一种针对在线强化学习的新归一化方法，能够归一化低维观测并平衡由此产生的损失和梯度。除了平衡重建损失外，观测归一化使动态预测能够在归一化的观测空间中进行，从而统一处理低维和高维输入（如物理状态和图像）。基于这一理念，我们进一步引入了规范化观测空间动力学增强Q-learning（NASDAQ），这是一个适用于多个领域的观测预测强化学习框架。纳斯达克通过将价值学习与两个辅助任务——短期值预测和下一次归一化观测预测——结合来学习状态-动作表示。大量实验表明，纳斯达克在与最先进的基于模型和自我预测的强化学习方法相比，在实现竞争或更优的性能上，同时所需的壁垒训练时间显著更少。

A Test-time Actor-Critic Approach to News Images Generation

一种测试时的演员-评论家新闻图像生成方法

Authors: Damianos Galanopoulos, Vasileios Mezaris
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.21304
Pdf link: https://arxiv.org/pdf/2606.21304
Abstract This paper introduces the CERTH-ITI solution for the MediaEval NewsImages 2026 challenge, which focuses on generating images related to news headlines. Inspired by the Actor-Critic paradigm in reinforcement learning, we present a test-time, model-agnostic Actor-Critic Image Generation approach (ACIG). ACIG generates prompts for image creation, produces the images, evaluates the generated results, and if needed refines the image generation prompts accordingly in a feedback loop. ACIG achieved the best results in the NewsImages 2026 challenge, according to the challenge's leaderboard.
中文摘要 本文介绍了CERTH-ITI解决方案，用于MediaEval NewsImages 2026挑战，重点是生成与新闻头条相关的图像。受强化学习中演员-批评者范式的启发，我们提出了一种测试时、模型无关的演员-批评者图像生成方法（ACIG）。ACIG生成图像生成提示，生成图像，评估生成结果，并在需要时通过反馈循环相应地优化图像生成提示。根据2026年NewsImages挑战赛的排行榜，ACIG取得了最佳成绩。

Objective-Behavior Alignment: Diagnostics for MORL Policy Selection

目标-行为对齐：MORL策略选择的诊断

Authors: Antonio Mone, Zuzanna Osika, Florian Felten, Pradeep K. Murukannaiah, Mark Fuge, Frans A. Oliehoek, Luciano Cavalcante Siebert
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.21321
Pdf link: https://arxiv.org/pdf/2606.21321
Abstract Real-world decision-making often requires optimizing multiple competing objectives simultaneously. In reinforcement learning (RL), this is typically addressed by combining reward signals into a single scalar objective via a scalarization function, which can be fragile: small changes in the weights can induce drastically different policies. Multi-objective reinforcement learning (MORL) instead produces sets of policies that explicitly represent trade-offs between objectives. However, these policies are typically presented to the decision maker only through their value vectors, which can obscure substantial behavioral variation: policies that induce distinct trajectories may appear indistinguishable when evaluated solely by expected returns. We propose an exploratory diagnostic workflow that automatically highlights behavioral variation along the Pareto front that objective values alone do not reveal, providing both quantitative and visual tools to support policy inspection. We validate our approach on simple grid examples and scale it to continuous control benchmarks, demonstrating that it remains effective as problem complexity increases.
中文摘要 现实世界的决策通常需要同时优化多个竞争目标。在强化学习（RL）中，通常通过将奖励信号合并为单一标量目标，通过标量化函数来解决，该函数可能较为脆弱：权重的微小变化可能引发截然不同的策略。多目标强化学习（MORL）则生成一套明确表示目标权衡的策略集。然而，这些政策通常仅通过价值向量呈现给决策者，这可能掩盖了显著的行为差异：仅以预期回报评估时，诱导不同轨迹的政策可能看起来难以区分。我们提出了一种探索性诊断工作流，能够自动突出帕累托前沿的行为变异，这些差异仅靠客观值无法揭示，同时提供定量和可视化工具支持政策检查。我们在简单的网格示例上验证方法，并将其扩展到连续控制基准，证明即使问题复杂度增加，该方法依然有效。

A Reward-Petri-Net Interpretation of Temporal Behavior Trees

奖励-彼得网对时间行为树的解释

Authors: Till Schmeil, Günther Waxenegger-Wilfing, Sebastian Schirmer
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.21350
Pdf link: https://arxiv.org/pdf/2606.21350
Abstract This paper introduces an interpretation of Temporal Behavior Trees (TBTs) as Reward-Petri-Nets (RPNs) for reinforcement learning (RL). Designing reward functions for complex, long-horizon robotic tasks is notoriously difficult, especially when tasks have hierarchical structure and temporal constraints. TBTs extend conventional behavior trees (BTs) used in robotic applications by incorporating temporal properties into their leaf nodes. This allows TBTs to represents not only the behavioral task structure defined by BT operators such as Sequence, Fallback, and Parallel, but also the task's temporal constraints. In this work, the constraints are specified in the leaf nodes using Linear Temporal Logic. In order to inform RL rewards using TBTs, we provide a translation from TBT into a Petri Net (PN) and show how rewards can be automatically assigned based on the TBT's structure, resulting in a RPN. In a series of increasingly challenging environments, we demonstrate how TBT-based rewards enable learning where vanilla RL fails, improve sample efficiency, and offer flexible, intuitive control over the learning progress. We showcase the learning impact by using different reward distribution schemes and TBT structures.
中文摘要 本文介绍了将时间行为树（TBTs）解释为强化学习（RL）中的奖励-Petri-Net（RPN）。为复杂且长视野的机器人任务设计奖励函数以极为困难著称，尤其是当任务具有层级结构和时间限制时。TBT通过将时间属性融入叶节点，扩展了机器人应用中使用的传统行为树（BT）。这使得TBT不仅可以表示由BT算子如序列、退回和并行定义的行为任务结构，还能表示任务的时间约束。在本工作中，约束通过线性时间逻辑在叶节点中指定。为了利用TBT来指导强化学习奖励，我们提供了从TBT转换为Petri Net（PN）的方法，并展示了如何根据TBT结构自动分配奖励，从而生成RPN。在一系列日益具有挑战性的环境中，我们展示了基于TBT的奖励如何帮助实现原版强化学习失败的学习，提高样本效率，并提供灵活直观的学习进度控制。我们通过不同的奖励分配方案和TBT结构展示学习效果。

Long-Distance Real-World Navigation of the Legged-Wheeled Robot Go2-W Using Deep Reinforcement Learning

利用深度强化学习实现有腿轮机器人Go2-W的远程真实世界导航

Authors: Takaaki Matsuzawa, Kiyoshi Irie, Tomoaki Yoshida, Taro Suzuki, Yoshitaka Hara, Masahiro Tomono
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.21387
Pdf link: https://arxiv.org/pdf/2606.21387
Abstract Legged-wheeled robots have long been studied for their potential to combine the efficient flat-ground mobility of wheels with the rough-terrain capability of legs. However, examples of their application to long-range autonomous navigation in real environments remain limited. This paper reports our effort to build a deep reinforcement learning (DRL) based locomotion controller and an autonomous navigation system for the commercially available legged-wheeled robot Go2-W, and to apply them to long-range autonomous navigation in a real environment. For locomotion control, we extended a proprioception-only policy, which we had previously developed for quadruped robots, to the 16-DoF legged-wheeled robot. We also found that wheeled locomotion concentrates the load on the hip joints and causes heat concentration that hinders sustained travel, and obtained a policy that suppresses it by distributing the load. We evaluated the system at the Tsukuba Challenge 2025, demonstrating that it can autonomously traverse an approximately 2.8 km route including sidewalks, a park, and stairs without stopping due to overheating.
中文摘要 带腿轮机器人长期以来一直被研究，旨在将轮子高效的平地机动性与腿部的崎岖地形能力结合起来。然而，其在现实环境中远程自主导航中的应用实例仍然有限。本文报道了我们为商业化腿轮机器人Go2-W构建基于深度强化学习（DRL）的行走控制器和自主导航系统的工作，并将其应用于真实环境中的远程自主导航。在运动控制方面，我们将此前为四足机器人开发的仅本体感觉政策扩展到了16景深的腿轮机器人。我们还发现轮式运动会将负荷集中在髋关节上，导致热量集中，阻碍持续行进，并制定了一项通过分散负荷来抑制这一负荷的政策。我们在筑波挑战赛2025上评估了该系统，展示了它能够自主穿越约2.8公里的路线，包括人行道、公园和楼梯，而不会因过热而停车。

Federated Temporal Attention Intelligence for Cyber-Resilient IoMT: Lightweight Digital Twins and PPO-Driven Honeypot Deception

联邦时间注意力智能用于网络韧性物联网：轻量级数字孪生与PPO驱动的蜜罐欺骗

Authors: Syed Zeeshan Haider, Anwar Shah, Muneeb Arif, Hamza Iftikhar, Waqas Ali
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.21422
Pdf link: https://arxiv.org/pdf/2606.21422
Abstract The rapid proliferation of Internet of Medical Things (IoMT) devices introduces critical cybersecurity vulnerabilities in healthcare environments where resource-constrained medical devices operate under strict latency requirements and stringent data-privacy regulations. To address these challenges, this paper presents the Lightweight Digital Twin and Federated Reinforcement Learning (LDT-FRL) framework, a privacy-preserving defense architecture integrating four complementary mechanisms: a Temporal Attention Encoder (TAE) built on a GRU backbone with learned temporal self-attention for flow-level threat classification; lightweight LSTM-based Digital Twins trained on normal-class traffic to generate per-device anomaly scores that gate the TAE classifier through a learned sigmoid coupling; a Federated Proximal Policy Optimization (PPO) agent selecting among ALLOW, ISOLATE, and HONEYPOT_REDIRECT actions based on a seven-dimensional state; and an intelligent honeypot layer that converts redirected suspicious traffic into actionable threat intelligence. A federated aggregation strategy employing EMA-smoothed per-client validation losses as inverse-weighted FedAvg coefficients stabilizes global model updates under non-IID client distributions. Evaluated on CICDDoS 2019 and TON-IoT benchmarks, LDT-FRL achieves 99.66% and 99.95% test accuracy respectively, with macro-F1 scores of 0.9913 and 0.9995, converging 81% faster than the DTFL-CD baseline while attaining perfect F1=1.000 on the severely imbalanced MITM class. Explainability analysis via SHAP, LIME, Grad-CAM, and counterfactual methods confirms that the TAE focuses on semantically meaningful flow features, providing interpretable evidence for each defense decision.
中文摘要 医疗物联网（IoMT）设备的快速普及，在资源受限的医疗设备在严格的延迟要求和严格的数据隐私法规下运行的医疗环境中，带来了关键的网络安全漏洞。为应对这些挑战，本文提出了轻量级数字孪生与联邦强化学习（LDT-FRL）框架，这是一种保护隐私的防御架构，集成了四种互补机制：基于GRU骨干的时序注意力编码器（TAE），并具备学习的时序自注意以实现流级威胁分类;基于LSTM的轻量级数字孪生训练于普通类流量，生成每设备异常分数，通过学习的S形结晶耦合将TAE分类器门控;一个基于七维状态的联合近端策略优化（PPO）代理，在ALLOW、ISOLATE和HONEYPOT_REDIRECT动作中进行选择;以及一个智能蜜罐层，将重定向的可疑流量转化为可操作的威胁情报。采用EMA平滑的每个客户端验证损失作为反加权FedAvg系数的联邦聚合策略，在非IID客户端分布下稳定了全局模型更新。根据CICDDoS 2019和TON-IoT基准测试，LDT-FRL分别实现了99.66%和99.95%的测试准确率，宏F1得分分别为0.9913和0.9995，收敛速度比DTFL-CD基线快81%，在严重失衡的中间人类别中获得了满分F1=1.000。通过SHAP、LIME、Grad-CAM和反事实方法进行可解释性分析，证实TAE关注语义有意义的流程特征，为每项辩护决策提供可解释的证据。

Precision Recall Controllable Radiology Report Generation via Hybrid Natural Language and Clinical Reward Learning

通过混合自然语言与临床奖励学习生成精确回忆可控放射科报告

Authors: Ling Chen, Ruinan Jin, Jun Luo, Hanliang Chen, Quirin Strotzer, Rongkai Yan, Yuan Xue, Luciano Prevedello, Dufan Wu
Subjects: Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.21447
Pdf link: https://arxiv.org/pdf/2606.21447
Abstract Automated radiology report generation (RRG) has gained increasing attention because it can reduce the heavy workload of clinical report writing. However, most existing methods mainly optimize for natural language generation (NLG) metrics that focus on language fluency, while providing little control over clinically important factors such as precision and recall. As consequence, generated reports may be fluent but not well aligned with different clinical needs. To address this challenge, we propose a reinforcement learning framework for precision recall controllable RRG, where a control parameter explicitly adjusts the trade-off between clinical precision and recall during inference. This design allows the model to flexibly generate reports according to different clinical requirements. To ensure clinical correctness, we introduce a \blue{clinical reward} into the training objective, which helps improve clinical efficacy (CE) beyond standard language-based optimization. In addition, we apply a group-relative training strategy that normalizes rewards within each training group, reducing reward variance and improving training stability. Extensive experiments on the MIMIC-CXR dataset show that our method consistently outperforms state-of-the-art approaches in both NLG{ and CE} evaluation metrics, while providing reliable control over the CE precision recall trade-off.
中文摘要 自动化放射报告生成（RRG）因其能减轻临床报告写作的繁重工作量而受到越来越多的关注。然而，大多数现有方法主要优化自然语言生成（NLG）指标，重点关注语言流利度，而对临床重要因素如准确性和回忆性几乎没有控制。因此，生成的报告可能流畅，但与不同临床需求不完全契合。为应对这一挑战，我们提出了一种强化学习框架，用于精确回忆可控RRG，其中控制参数明确调整临床精度与推理中回忆之间的权衡。这种设计使模型能够根据不同的临床需求灵活生成报告。为确保临床正确性，我们在培训目标中引入了 \blue{clinical reward}，有助于提升临床效能（CE），超越标准的基于语言的优化。此外，我们应用了群体相对训练策略，使每个训练组内的奖励归一化，降低奖励方差并提升训练稳定性。对MIMIC-CXR数据集的广泛实验表明，我们的方法在NLG{和CE}评估指标上始终优于最先进方法，同时对CE精度回忆权衡提供了可靠的控制。

Balancing Performance and Diversity in GRPO Autoregressive Text-to-Image Post-Training

在GRPO自回归文本转图像后培训中平衡性能与多样性

Authors: Yuanhao Chiang, Hongbo Duan, Chunru Yang, Jiahua Pei, Yi Liu, Xueqian Wang
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.21498
Pdf link: https://arxiv.org/pdf/2606.21498
Abstract Autoregressive text-to-image (T2I) generation has recently advanced rapidly, yet aligning generated images with human preferences remains challenging. GRPO-style online reinforcement learning provides an effective framework; however, existing methods typically treat reference-policy divergence as fixed, despite its direct impact on policy optimization. We study this overlooked factor within a unified f-divergence framework, encompassing forward KL, reverse KL, and JS divergence, for GRPO-style autoregressive T2I alignment. Our systematic theoretical analysis reveals that different divergences reshape token-level updates in distinct ways. In particular, under the sampled-token shaping form used, JS regularization achieves a favorable trade-off by mitigating uniform bias relative to the reference policy while still discouraging large deviations. Extensive experiments on LlamaGen and Janus-7B show that JS divergence achieves the strongest or highly competitive optimization performance on most evaluation metrics while maintaining favorable generation diversity. The code is available at this https URL.
中文摘要 自回归文本到图像（T2I）生成技术近年来发展迅速，但将生成图像与人类偏好对齐仍具挑战性。GRPO风格的在线强化学习提供了一个有效的框架;然而，现有方法通常将参考-政策偏差视为固定，尽管它直接影响策略优化。我们在一个统一的f-发散框架下研究这一被忽视的因素，该框架涵盖正向KL、反KL和JS发散，用于GRPO式自回归T2I比对。我们的系统理论分析显示，不同的分歧以不同方式重塑代币级更新。特别是在采样令牌整形形式下，JS正则化通过减少相对于参考策略的均匀偏置，同时仍能防止大幅偏差，从而实现了有利的权衡。在LlamaGen和Janus-7B上的大量实验表明，JS发散在大多数评估指标上实现了最强或高度竞争的优化性能，同时保持了有利的生成多样性。代码可在该 https URL 访问。

Backpropagating Through Simulation: Analytic Policy Gradients for Sample and Learning Efficient Differentiable Continuous Control

通过模拟进行反向传播：样本与学习高效可微连续控制的分析策略梯度

Authors: Yueci Deng
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.21525
Pdf link: https://arxiv.org/pdf/2606.21525
Abstract Model-free reinforcement learning algorithms such as Proximal Policy Optimization (PPO) treat the environment as a black box, estimating policy gradients from sampled rewards; this process demands millions of interactions and relies on high-variance advantage estimates. When environment dynamics are differentiable, the return is an end-to-end differentiable function of the policy parameters, enabling exact gradient computation via backpropagation through simulation. We term this approach Analytic Policy Gradients (APG) and evaluate it against PPO on four continuous control tasks of increasing dynamical complexity: a one-dimensional point-mass target-reaching task, a 2D point-mass navigation task with obstacle avoidance, a 2D rigid-body T-block pushing task, and a 7-DOF Franka FR3 end-effector reaching task. Both algorithms share identical model architectures, observation normalization, and optimizer settings. To decouple sample efficiency from compute efficiency, we design a multi-axis evaluation protocol that records performance against environment steps and gradient steps. We report a segmented backpropagation scheme with MC and critic-based bootstrap modes that mitigates gradient degradation on long-horizon tasks, and present ablations over segment length and bootstrap strategy.
中文摘要 无模型强化学习算法如近端策略优化（PPO）将环境视为黑箱，通过采样奖励估计策略梯度;该过程需要数百万次交互，并依赖高方差优势估计。当环境动力学可微时，返回是策略参数的端到端可微函数，使得通过仿真进行反向传播实现精确梯度计算。我们将此方法称为分析策略梯度（APG），并在四个动态复杂度递增的连续控制任务中对其进行与PPO的比较评估：一维点质量目标达标任务、二维点质量导航任务（障碍物避让）、二维刚体T块推进任务，以及7自由度Franka FR3末端执行器任务。两者在模型架构、观测归一化和优化器设置上共享相同。为了将样本效率与计算效率解耦，我们设计了多轴评估协议，记录环境步和梯度步的表现。我们报告了一种基于MC和critic的引导模式的分段反向传播方案，该方案在长视野任务中减轻了梯度退化，并展示了段长和引导策略的消融。

FAST: A Framework for Aligned Sampling and Training in Parallel Reinforcement Learning for Autonomous Driving

FAST：自动驾驶并行强化学习中的对齐抽样与培训框架

Authors: Bonan Wang, Letian Tao, Bin Shuai, Jiaxin Gao, Wenxin Zhao, Wei Xiong, Kehua Sheng, Bo Zhang, Yang Guan, Shengbo Eben Li
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.21587
Pdf link: https://arxiv.org/pdf/2606.21587
Abstract Deep reinforcement learning is pivotal for closed-loop autonomous driving yet remains constrained by severe bottlenecks in sampling efficiency. Standard parallel sampling mitigates this but suffers from the straggler effect, where the premature termination of a single environment necessitates a synchronized batch re-initialization, leading to suboptimal sample utilization and prohibitive re-initialization latency. To address this, we propose FAST, a synchronous parallel framework tailored for closed-loop simulation. Specifically, FAST employs Dynamic Parallel Sampling Alignment (DPSA) to maintain vectorization synchronization by extending terminated episodes via virtual continuation, thereby decoupling the sampling loop from individual terminations. By dynamically triggering global truncation based on the termination rate of parallel clips, FAST effectively eliminates the bottleneck of premature resets without sacrificing data diversity. Furthermore, to strictly preserve theoretical consistency, we incorporate a Scaled Mask-Padding Optimization (SMPO) that leverages validity masking and adaptive loss normalization to nullify the bias from auxiliary padding data. Empirical evaluations demonstrate that FAST achieves at least a 1.78 times wall-clock speedup over the single-clip baseline while preserving statistical unbiasedness.
中文摘要 深度强化学习对于闭环自动驾驶至关重要，但仍受限于采样效率的严重瓶颈。标准并行采样可以缓解这一问题，但存在落后效应，即单一环境过早终止需要同步批次初始化，导致样本利用率不佳且重新初始化延迟过高。为此，我们提出了FAST，一个专为闭环仿真设计的同步并行框架。具体来说，FAST采用动态并行采样对齐（DPSA）来通过虚拟延续延长终止片段来维持矢量化同步，从而将采样环路与单个终端解耦。通过基于并行剪辑终止率动态触发全局截断，FAST 有效消除了提前重置的瓶颈，同时不牺牲数据多样性。此外，为了严格保持理论一致性，我们采用了缩放掩码填充优化（SMPO），利用效度掩蔽和自适应损耗归一化，消除辅助填充数据的偏见。实证评估表明，FAST在保持统计无偏的同时，至少比单片段基线提升1.78倍。

The Two-Hump Problem: Bridging the Difficulty Gap in Mathematical Reinforcement Learning

双峰问题：弥合数学强化学习中的难度差距

Authors: Lucas Fagan, Michele Tarquini, Ali Shehper, Maksymilian Manko, Angus Gruen, Coco Huang, Giorgi Butbaia, Davide Passaro, Sergei Gukov
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Group Theory (math.GR); Geometric Topology (math.GT)
Arxiv link: https://arxiv.org/abs/2606.21611
Pdf link: https://arxiv.org/pdf/2606.21611
Abstract Mathematical search problems present a unique challenge for Reinforcement Learning (RL) due to vast search spaces and sparse rewards. In previous works, the Andrews-Curtis (AC) conjecture was established as an illustrative example of such problems. In this work, we identify a critical structural barrier in the AC landscape: a "Two-Hump" distribution, where problem instances are either trivially solvable or effectively impossible, with a scarcity of intermediate "hard-but-solvable" instances required for effective learning. We tackle this challenge through two primary avenues: novel data generation techniques to populate the difficulty gap, and significant algorithmic enhancements including the introduction of supermoves and Transformer-based architectures. We demonstrate substantial performance improvements over previous baselines, and release new comprehensive benchmark datasets including AC-19 (125,192 AC-trivial presentations of varying difficulty with length at most 19) and AC-1M (1,136,154 hard AC-trivial presentations of length at most 30), the first large-scale, publicly available datasets of this kind.
中文摘要 数学搜索问题对强化学习（RL）来说是一个独特的挑战，因为它的搜索空间巨大且奖励稀疏。在之前的研究中，Andrews-Curtis（AC）猜想被确立为此类问题的说明性例子。在本研究中，我们识别出AC领域中一个关键的结构性障碍：“双峰分布”，其中问题实例要么简单易解，要么实际上不可能，且缺乏中间“难解但可解”实例以实现有效学习。我们通过两大途径应对这一挑战：新颖的数据生成技术以填补难度差距，以及引入超级移动和基于变形金刚架构的重大算法改进。我们展示了较以往基线显著的性能提升，并发布了包括AC-19（125,192个难度不一、长度最多19个的AC平凡演示）和AC-1M（1,136,154个长度最多30个的难度AC平凡演示）和AC-1M（1,136,154个长度最多30的硬AC简单演示），这是首批大规模公开此类数据集。

Motion-Aware Reinforcement Learning For Object Localization

物体定位的动作感知强化学习

Authors: Prithvi Raj Singh, Satyendra Singh
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.21764
Pdf link: https://arxiv.org/pdf/2606.21764
Abstract We present MARLNet (Motion-Aware Reinforcement Learning Network), a PPO-based bounding-box refinement agent that incorporates a constant-velocity motion prior into the observation state and an action smoothness penalty into the reward function. The agent operates on 268-dimensional observations encoding the current proposal, a kinematic prediction, the previous action, and a 256-dimensional EfficientNet-B0 crop feature, and learns a five-dimensional policy controlling coordinate adjustments and a binary termination trigger. Evaluated on Pascal VOC 2012 and VisDrone 2019, MARLNet trains stably across all regularization strengths tested and achieves consistent gains in detection success rate at $\text{IoU} \geq 0.5$: up to $+0.011$ on VOC ($\lambda_\text{phys}{=}0.10$), where the motion prior prevents the overshooting that causes plain PPO to regress on this metric, and $+0.007$ on VisDrone ($\lambda_\text{phys}{=}0.70$), where unconstrained PPO achieves a larger gain ($+0.025$) owing to the weaker base detector. Through reward design ablations and training dynamics analysis, we identify a reward interference in which combining a constant-velocity deviation penalty with an absolute IoU term causes trigger collapse, and show that replacing it with the action smoothness penalty resolves this failure. We further characterize a representational ceiling facing crop-feature refinement agents that share a backbone with their base detector, confirmed through a global-plus-local observation ablation. Project page: this https URL
中文摘要 我们介绍了MARLNet（运动感知强化学习网络），这是一个基于PPO的包围盒细化代理，在观察状态中加入了恒定速度运动，在奖励函数中引入了动作平滑惩罚。该代理基于编码当前提案的268维观测值、运动学预测、前一动作以及256维EfficientNet-B0裁剪特征，并学习控制坐标调整和二元终止触发器的五维策略。在Pascal VOC 2012和VisDrone 2019的评估中，MARLNet在所有正则化强度下都能稳定训练，检测成功率在$\text{IoU} \geq 0.5$：VOC最高可达$+0.011$（$\lambda_\text{phys}{=}0.10$），其中运动先验防止了导致普通PPO回归的超跃，VisDrone为$+0.007$（$\lambda_\text{phys}{=}0.70$），其中无约束的PPO由于基准检测器较弱，获得更大的增益（$+0.025$）。通过奖励设计消融和训练动态分析，我们识别出一种奖励干涉，其中将恒定速度偏差惩罚与绝对IoU项结合会导致触发坍缩，并证明用动作平滑惩罚替代该惩罚可解决此故障。我们进一步描述了与其基探测器共享主干的裁切特征精炼剂所面临的表征天花板，这一点通过全局加局部观察消融得到确认。项目页面：此 https URL

CalVerT: Augmenting Agents with Calibrated Verifier Telemetry Improves Action and Learning in Knowledge-Intensive Tasks

CalVerT：通过校准验证者遥测增强智能体在知识密集型任务中的行动和学习能力

Authors: Ashwin Vinod, Ying Ding, Elias Stengel-Eskin
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.21777
Pdf link: https://arxiv.org/pdf/2606.21777
Abstract LLM agents in knowledge intensive question answering take retrieval and reasoning actions with incomplete knowledge about whether their current answer is uncertain, unsupported, or already complete. This produces two failure modes: committing to confident but unsupported answers, which hurts accuracy, and over-retrieving when the evidence in hand already suffices, resulting in wasted compute. To give agents a more complete picture of the state space they are operating in, we introduce calibrated verifier telemetry (CalVerT), which augments the agent's state with additional telemetry: a calibrated self-confidence score and a grounding verifier score. We show that CalVerT can improve agents in both training-free and training-based settings. On four QA benchmarks, we find that CalVerT raises F1 by triggering retrieval in cases where agents over-rely on parametric knowledge, while cutting redundant retrieval in cases where agents have sufficient context to answer. We show that CalVerT can augment existing QA frameworks without training. Moreover, CalVerT also improves trained systems: by simply augmenting an agent's state with telemetry, we observe improvements after reinforcement learning, as compared to an agent with identical training but no CalVerT telemetry.
中文摘要 知识密集型问答中的LLM代理在进行检索和推理时，对当前答案是否不确定、缺乏支持或已经完整了解并不完整。这导致两种失败模式：一种是承诺自信但缺乏支持的答案，这会降低准确性;另一种是过度检索，而手头的证据已经足够，导致计算浪费。为了让代理更完整地了解他们所处的状态空间，我们引入了校准验证者遥测（CalVerT），它通过额外的遥测来增强代理状态：校准自信心评分和接地验证评分。我们证明CalVerT可以在无培训和基于培训的环境中提升代理。在四个QA基准测试中，我们发现CalVerT通过在代理过度依赖参数化知识时触发检索来提升F1，而在代理拥有足够上下文回答的情况下减少冗余检索。我们证明CalVerT可以在无需培训的情况下增强现有的质量保证框架。此外，CalVerT还改进了训练系统：仅仅通过用遥测增强代理状态，我们就能观察到强化学习后的改进，相比于训练相同但没有CalVerT遥测的代理。

KineticSim: A Lightweight, High-Performance Execution Engine for Real-Time Market Simulators

KineticSim：一款用于实时市场模拟器的轻量高性能执行引擎

Authors: Shakya Jayakody, Prarthinie Jayakody
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF); Trading and Market Microstructure (q-fin.TR)
Arxiv link: https://arxiv.org/abs/2606.21784
Pdf link: https://arxiv.org/pdf/2606.21784
Abstract Simulating financial markets at scale with multi-agent (Agent-Based) models is critical for market design, regulatory stress-testing, and reinforcement learning, but traditional CPU simulators are bottlenecked by sequential processing while vectorized GPU frameworks suffer from kernel-launch overhead and redundant global-memory round-trips. We formalize, analyze, and evaluate a reusable parallel design pattern: persistent, state-carrying clearing for iterative multi-agent reductions. By caching mutable simulation state in thread-block shared memory across step boundaries, aggregating agent actions via shared-memory atomics, and resolving the clearing function cooperatively, the pattern reduces the per-step critical-path depth from Theta(L+A) for sequential clearing (L price-grid ticks, A agents) to Theta(log L + ceil(A/L)) and makes global-memory traffic independent of the step count. We implement this in KineticSim, a lightweight GPU execution engine that simulates massive ensembles of limit-order books in parallel, reaching a peak throughput of over 54.7 billion agent-events per second. On a fixed workload it delivers speedups of 3406x over CPU (NumPy), 27.8x over PyTorch GPU, 42.8x over JAX GPU, and 8.4x over a naive custom CUDA baseline, while using roughly an order of magnitude less GPU memory than PyTorch. Across 53 configurations the two custom CUDA engines produce bitwise-identical order books, and aggregate statistics match the CPU reference to within 0.1%. The pattern generalizes to other iterative multi-agent workloads requiring state-persistent, block-localized reductions.
中文摘要 用多智能体（基于代理）模型大规模模拟金融市场对于市场设计、监管压力测试和强化学习至关重要，但传统CPU模拟器因顺序处理而受限，而矢量化GPU框架则存在内核启动开销和冗余的全局内存往返。我们形式化、分析并评估了一种可复用的并行设计模式：用于迭代多智能体减少的持久、携带状态的清除。通过在跨步边界的线程块共享内存中缓存可变仿真状态，通过共享内存原子集聚集成代理动作，并协同解析清除函数，该模式将每步关键路径深度从顺序清算（L 价格网格刻度，A 代理）的 Theta（L+A）减少到 Theta（log L + ceil（A/L）），并使全局内存流量不受步数影响。我们在KineticSim中实现这一功能，这是一款轻量级GPU执行引擎，能够并行模拟大量极限顺序书的集合，峰值吞吐量超过每秒547亿次。在固定工作负载下，它比CPU（NumPy）提升3406倍，PyTorch GPU加速27.8倍，JAX GPU加速42.8倍，基于简单自定义CUDA基线8.4倍，同时GPU内存消耗约比PyTorch少一个数量级。在53种配置中，两个定制CUDA引擎生成的订单簿位数完全相同，汇总统计数据与CPU参考匹配率仅在0.1%。该模式推广到其他需要状态持久、块局部化的迭代多智能体工作负载。

Discretizing Reward Models

离散化奖励模型

Authors: Vijay Viswanathan, Shiqi Wang, Devamanyu Hazarika, Chirag Nagpal, Tongshuang Wu, Graham Neubig, Yuning Mao
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.21795
Pdf link: https://arxiv.org/pdf/2606.21795
Abstract Despite their widespread use, the role of reward models in shaping reinforcement learning is poorly understood. Reward models offer a tempting promise: they automatically estimate response quality in the absence of verifiers or human judges. Unlike "verifiable rewards" which typically produce binary scores, reward models typically produce continuous scores, allowing them to be sensitive to fine-grained differences in responses. However, we show this apparent strength is a serious weakness: many popular reward models are oversensitive, assigning different scores to equally good responses. Theoretically, we show that seemingly perfect reward models can be highly oversensitive; empirically, this oversensitivity can lead to bad policies. In place of existing notions of "reward model accuracy," we propose evaluating reward models using distinct measures of "discriminative ability" and "specificity" (the complement of oversensitivity). As a solution, we describe a training-free algorithm that uses Monte Carlo dropout on any neural reward model to produce discrete reward clusters. Theoretically, we prove there exist discretizations that reduce oversensitivity at minimal expense of discriminative ability; empirically we show, in both controlled and natural RL settings, that discretizing rewards leads to less reward hacking and better policies than training on the original rewards.
中文摘要 尽管奖励模型被广泛使用，但它们在塑造强化学习中的作用仍被理解不足。奖励模型提供了一个诱人的承诺：它们在没有验证者或人工评审的情况下自动估算响应质量。与通常产生二元分数的“可验证奖励”不同，奖励模型通常产生连续评分，使其对反应中的细微差异非常敏感。然而，我们发现这种表面优势其实是严重的弱点：许多流行的奖励模型过于敏感，会给同样好的回答分配不同的分数。理论上，我们表明看似完美的奖励模型可能高度敏感;从经验来看，这种过度敏感可能导致糟糕的政策。我们提出用“辨别能力”和“特异性”（即过度敏感的补补）这两个不同的衡量标准来评估奖励模型，取代现有的“奖励模型准确性”概念。作为解决方案，我们描述了一种无训练算法，利用蒙特卡洛脱落法对任意神经奖励模型生成离散奖励簇。理论上，我们证明存在离散化方法，可以在最小的辨别能力代价下减少过度敏感性;通过实证，我们在受控和自然强化环境中表明，离散化奖励比训练原始奖励更少，导致更少的奖励黑客行为和更好的策略。

Mat-Pref: Verifiable-Reward Training Improves Compositional Reasoning in Inorganic Materials

Mat-Pref：可验证奖励训练提升无机材料的组成推理能力

Authors: Sarrah R. Mikhail Leung, Taehan Kim, Jeongbin Park
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.21830
Pdf link: https://arxiv.org/pdf/2606.21830
Abstract Reinforcement learning from verifiable rewards (RLVR) has driven rapid progress in mathematical and code reasoning, but when extended to science, existing benchmarks do not decompose what generalizes: do gains reflect structural transfer, property transfer, or memorization? We introduce Mat-Pref, a benchmark of 10,837 ionic-substitution questions across 11 inorganic structure families, grounded in density functional theory calculations from the Materials Project, with three evaluation splits that isolate in-distribution performance, generalization to entirely held-out structure families, and cross-property transfer: applying band-gap reasoning to hosts seen during training only through formation-energy supervision. Four zero-shot frontier models (70-671B parameters) remain in the 33-54% range on every split, confirming that scale alone does not resolve the compositional chemical reasoning this task demands. A two-stage pipeline of supervised fine-tuning followed by Group Relative Policy Optimization (GRPO) lifts Qwen3-8B to 65.2% in-distribution and 71.6% on held-out families, exceeding zero-shot Qwen3-235B by over 20 percentage points on both structural-generalization splits. Self-consistency sampling shows that the SFT policy can already produce correct answers but cannot reliably surface them as the modal response; GRPO reshapes the distribution so that correct answers become modal rather than merely reachable, and this sharper commitment is visible mechanistically: logit lens analysis reveals a ${\sim}$20pp advantage in answer crystallization at the critical decision layer. We formalize this observation as a distractor-permutation consistency metric under which GRPO narrows the gap between lenient scoring (at least one permutation correct) and strict scoring (all permutations correct) from 24.0 to 14.3 percentage points.
中文摘要 可验证奖励强化学习（RLVR）推动了数学和代码推理的快速进展，但当扩展到科学领域时，现有基准无法分解什么是泛化：收益是结构性转移、属性转移还是记忆？我们引入了Mat-Pref，这是一个涵盖11个无机结构家族的10,837个离子替代问题的基准测试，基于材料项目的密度泛函理论计算，包含三种评估分段，分别分离分布内表现、推广至完全保留的结构家族，以及交叉性质转移：对仅通过形成能量监督训练期间观察到的宿主应用带隙推理。四个零发子前沿模型（70-671B参数）在每次分裂中保持在33-54%范围内，证实仅靠规模无法解决该任务所需的化学成分推理。经过两阶段监督微调和集团相对政策优化（GRPO）流程，Qwen3-8B的分布率提升至65.2%，在未被保留的家庭中达到71.6%，在结构推广拆分上均比零样本Qwen3-235B高出超过20个百分点。自洽抽样表明，SFT策略已经能够产生正确答案，但无法可靠地将其呈现为模态响应;GRPO重塑了分布，使正确答案变得模态化，而不仅仅是可达的，这种更锐利的承诺在机制上可见：logit透镜分析显示，关键决策层的答案结晶率提升了${\sim}$20pp的优势。我们将这一观察形式化为干扰器-置换一致性指标，在此下GRPO将宽松评分（至少一个排列正确）与严格评分（所有排列正确）之间的差距从24.0%缩小到14.3个百分点。

Modularized Reinforcement Learning on LLMs: From MDP Creation to Exploration and Learning

大型语言模型上的模块化强化学习：从MDP创建到探索与学习

Authors: Zhao Yang, Yuxuan Jiang, Ting-Chih Chen, Lincen Yang, Annie Wong, Chao Gao, Jacob E. Kooi, Zhong Li, Jiayang Shi, Kevin Qiu, Qi Huang, Xinrui Zu, Shiping Yang, Hengyuan Zhang, Ngai Wong, Filip Ilievski, Shujian Yu, Aske Plaat, Zhaochun Ren, Mark Hoogendoorn, Vincent François-Lavet
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.21943
Pdf link: https://arxiv.org/pdf/2606.21943
Abstract Reinforcement learning (RL) has become central to LLM post-training, yet the methods that dominate current pipelines, PPO and GRPO, represent only a narrow slice of what RL offers. Understanding why these methods prevail, and what alternatives exist, requires a principled examination of the design decisions that underlie any RL algorithm. This survey organizes that examination around three stages of algorithm construction. We begin with MDP creation: how the reward function, state space, action space, termination condition, and discount factor are, or could be, defined for LLM training. We then turn to exploration, covering temperature sampling, entropy regularization, intrinsic motivation, tree search, and curriculum learning. Finally, we address learning along four classical RL dimensions: model-free versus model-based, value-based versus policy-based versus actor-critic, on-policy versus off-policy, and credit assignment, including both Monte Carlo methods, which rely on full return estimates, and bootstrapping methods, which update estimates using other learned predictions. Mapping the LLM literature onto this taxonomy reveals a strikingly non-uniform distribution of research effort. Critic-free policy gradients and Monte Carlo credit assignment are densely populated, while value-based methods, off-policy actor-critic training, and bootstrapping-based credit assignment remain largely unexplored despite well-established counterparts in classical RL. These gaps represent concrete opportunities for transferring proven RL techniques to LLM training. By making these gaps explicit alongside the methods that have proven effective, this survey offers researchers in both RL and LLMs a shared framework for understanding current practice and identifying promising directions for future work.
中文摘要 强化学习（RL）已成为训练后LLM的核心，但目前主流的PPO和GRPO方法，仅代表了RL所提供的一小部分。理解这些方法为何盛行，以及存在哪些替代方案，需要对任何强化学习算法背后的设计决策进行有原则的审视。本调查围绕算法构建的三个阶段进行了组织考察。我们从MDP的创建开始：奖励函数、状态空间、动作空间、终止条件和折扣因子如何定义，或者如何定义用于LLM训练。接着我们转向探索，涵盖温度采样、熵正则化、内在动机、树状搜索和课程学习。最后，我们从四个经典强化学习维度进行学习：无模型与基于模型、基于价值与基于策略与行为者批评者、政策启动与非策略，以及信用分配，包括依赖全回报估计的蒙特卡洛方法和利用其他学习预测更新估计的自助方法。将LLM文献映射到该分类法中，发现研究努力分布异常不均。无批评的政策梯度和蒙特卡洛信用赋值高度存在，而基于价值的方法、非政策行为者-批评者培训和基于自助的信用赋值尽管在经典强化学习中有成熟对应方法，但仍然鲜有深入探讨。这些空白为将经过验证的强化学习技术转化为LLM培训提供了切实机会。通过将这些空白与已被证明有效的方法并列，本调查为强化学习和大型语言模型的研究人员提供了一个共享的框架，帮助理解当前实践并识别未来工作的有前景方向。

IRumAI: Reinforcement Learning for Indian Rummy

IRumAI：印度拉米牌的强化学习

Authors: Vignesh Mohan
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.21975
Pdf link: https://arxiv.org/pdf/2606.21975
Abstract Despite its massive player base and complex hidden-information dynamics, Indian Rummy has received no reinforcement learning attention. Existing agents rely on combinatorial search, which is tactically strong but slow at inference. We present IRumAI, the first RL agent for the domain. IRumAI integrates Proximal Policy Optimization (PPO), meld-aware observation encoding, deadwood-driven reward shaping, and a dual-branch convolutional architecture. IRumAI is RL-trained solely against weak heuristics, after a one-time behaviour-cloning warm-start on stronger demonstration data. It generalises to defeat the entire baseline hierarchy, including a 53.9% win rate against the strongest search-based opponent unseen during RL training. Bypassing explicit search, IRumAI requires just 0.33 ms per action, which is over 7,000x faster than the state-of-the-art heuristic. Ablations validate our architectural choices, and linear probing reveals that the network implicitly models the opponent's hidden hand from public interactions.
中文摘要 尽管拥有庞大的玩家基础和复杂的隐藏信息机制，印度拉米牌却没有获得任何强化学习的关注。现有代理依赖组合搜索，这种方法在战术上强大但推断速度较慢。我们介绍IRumAI，该领域的首个强化学习代理。IRumAI 集成了近端策略优化（PPO）、融合感知的观察编码、无效驱动的奖励塑造以及双分支卷积架构。IRumAI仅针对弱启发式进行强化学习训练，前提是对更强的演示数据进行了一次性行为克隆的热启动。它推广到击败整个基线层级，包括对抗强搜索对手的53.9%胜率，这是在现实学习训练中未曾见过的。IRumAI绕过显式搜索，每次操作只需0.33毫秒，比最先进的启发式算法快了7000倍以上。消融验证了我们的架构选择，线性探测揭示了网络隐含地模拟了对手在公开互动中的隐藏手牌。

RARM: Confidence-Gated Progress Reward Modeling for RL in Manipulation

RARM：基于操作中的强化学习信心门槛进步奖励建模

Authors: Pengzhi Yang, Xinyu Wang, Pengyu Jing, Kehan Wen, Yiduo Qu, Zhenhao Huang, Minghao Fu, Xin Liu, Yaheng Shen, Fan Shi
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.22027
Pdf link: https://arxiv.org/pdf/2606.22027
Abstract Reinforcement learning for robot manipulation is often bottlenecked by reward design, especially in long-horizon tasks: sparse success rewards provide weak supervision, while hand-crafted dense rewards are tedious to design and generalize poorly across tasks. Progress-based reward models offer a promising alternative by estimating how far an observation has advanced toward task completion, but existing approaches often require task-specific demonstrations or progress labels, and can assign high rewards to visually plausible but physically incorrect states. We introduce the Reference-Anchored Reward Model (RARM), a lightweight visual comparator that converts a single successful demonstration into a dense, progress-aware reward. RARM is trained once on general-purpose videos with a contrastive temporal objective, requiring no robot-specific data, task-specific reward labels, or per-task reward engineering. At deployment, RARM matches rollout clips to reference clips and rewards only confident forward progress, suppressing uncertain matches that may otherwise produce false-positive rewards. Across 9 simulated manipulation tasks from LIBERO and MetaWorld and 4 real-world tasks, RARM achieves the best overall success rates in subsequent RL training, with particularly large gains on long-horizon tasks such as cloth folding, where unreliable progress estimates are especially harmful.
中文摘要 机器人操作的强化学习常常被奖励设计所限制，尤其是在长视野任务中：稀疏的成功奖励提供了薄弱的监督，而手工制作的密集奖励则繁琐，设计和推广不当。基于进展的奖励模型通过估计观察在任务完成方面进展了多远，提供了一个有前景的替代方案，但现有方法通常需要任务特定的演示或进展标签，并且可能为视觉上合理但身体上不正确的状态赋予高奖励。我们介绍参考锚定奖励模型（RARM），这是一种轻量级的视觉比较器，将单次成功的演示转化为密集且有进展意识的奖励。RARM仅在具有对比性时间目标的通用视频上进行一次训练，无需机器人专用数据、任务特定奖励标签或逐任务奖励工程。部署时，RARM会将推出片段与参考片段匹配，并只奖励有信心的前进进度，抑制不确定的匹配，避免误报奖励。在LIBERO和MetaWorld的9个模拟操作任务以及4个真实世界任务中，RARM在后续强化学习训练中取得了最佳整体成功率，尤其是在长期任务如折叠布料上取得显著进展，因为不可靠的进展估计尤其有害。

Deep RL- Tuned Mo del-Free Adaptive Control for Lower-Limb Exoskeletons During Sit-to-Stand Transitions

深RL——调谐无须-自由自适应控制，适用于下肢外骨骼在坐立过渡阶段

Authors: Ranjeet Kumbhar, Appaso M. Gadade, Rajmeet Singh, Ashish Singla, Ravinder Kumar
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.22040
Pdf link: https://arxiv.org/pdf/2606.22040
Abstract Sit-to-stand (STS) transitions impose significant joint-loading demands on elderly individuals, making them a primary target for lower-limb exoskeleton assistance. However, accurate trajectory tracking during STS is challenging due to complex, time-varying human exoskeleton interaction dynamics and inter-subject variability that render model-based control approaches difficult to apply in practice. This paper presents an intelligent model free adaptive backstepping control strategy for a bilateral lower-limb exoskeleton during STS motion. The proposed controller design uses an ultra-local second-order model to avoid explicit system identification, while a Gaussian radial basis function (RBF) neural network estimates the unknown lumped dynamics online. To further improve phase-aware tracking performance, a Twin Delayed Deep Deterministic Policy Gradient (TD3) reinforcement learning agent is integrated as a supervisory gain scheduler that adaptively adjusts controller gains across the distinct phases of STS motion. The proposed controller is evaluated through co-simulation in MATLAB/Simulink and Simscape Multibody using OpenSim-derived reference trajectories and benchmarked against state-of-the-art controllers. Results demonstrate that the proposed controller achieves the lowest average RMSE of 0.078 degree across all joints, representing improvements of 60.2%, 54.4%, 48.7%, and 42.6% over proportional integral derivative (PID), model-free adaptive control (MFAC), linear quadratic regulator (LQR), and sliding-mode control (SMC), respectively. TD3 integration further reduces tracking error by 35%, 33%, and 79% at the hip, knee, and ankle joints compared to the standalone RBF-MFAC baseline. These results demonstrate the effectiveness and robustness of the proposed controller design for assistive exoskeleton control during STS transitions.
中文摘要 坐立转换（STS）对老年人带来了显著的关节负荷负担，使其成为下肢外骨骼辅助的主要目标。然而，由于复杂且随时间变化的人体外骨骼相互作用动态以及受试者间的变异性，使得基于模型的控制方法难以在实际应用中应用，STS中的准确轨迹跟踪具有挑战性。本文提出了一种智能无模型的自适应后退控制策略，适用于STS运动期间双侧下肢外骨骼。拟议的控制器设计采用超局部二阶模型以避免显式系统识别，而高斯径向基函数（RBF）神经网络则在线估计未知的集中动力学。为进一步提升相位感知跟踪性能，集成了双延迟深度确定性策略梯度（TD3）强化学习代理作为监督增益调度器，自适应调整STS运动不同阶段的控制器增益。该控制器通过MATLAB/Simulink和Simscape Multibody的协同仿真，使用OpenSim衍生的参考轨迹进行评估，并以最先进的控制器进行基准测试。结果显示，所提控制器在所有关节中均有效误差（RMSE）达到最低，达到0.078度，分别在比例积分导数（PID）、无模型自适应控制（MFAC）、线性二次调节器（LQR）和滑动模式控制（SMC）方面提升了60.2%、54.4%、48.7%和42.6%。与单独的RBF-MFAC基线相比，TD3整合进一步降低了髋关节、膝盖和踝关节的追踪误差分别降低了35%、33%和79%。这些结果展示了拟议控制器设计在STS过渡期间辅助外骨骼控制的有效性和稳健性。

When Does a Video-Language Model Stop Watching? Reward Strength Controls the Formation and Reversal of Visual Shortcuts in Multimodal RLVR

视频语言模型什么时候停止观看？奖励强度控制多模态RLVR中视觉捷径的形成和逆转

Authors: Zekun Xu
Subjects: Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.22043
Pdf link: https://arxiv.org/pdf/2606.22043
Abstract Reinforcement learning with verifiable rewards (RLVR) is increasingly applied to large vision-language models (LVLMs), yet outcome-only optimization can drive a model to stop attending to the video and instead exploit linguistic priors -- a failure we call a visual shortcut. While the existence of such perception bypass is by now documented, how it forms, whether it can be undone, and when intervention still helps remain open. We treat the strength of a grounding penalty, lambda, as a control knob and characterize the formation-reversal dynamics of visual shortcuts along the training time axis. On a held-out, out-of-distribution diagnostic set, we find: (i) a sharp onset -- shortcut reliance emerges abruptly over a narrow window of optimization steps and is robust across random seeds; (ii) a monotone dose-response -- increasing lambda progressively suppresses the shortcut, and at an intermediate dose the trajectory first forms and then reverses the shortcut, exposing a hysteresis-like asymmetry between acquiring and removing it; and (iii) a critical intervention window -- applying the penalty before onset arrests shortcut formation, whereas the same penalty applied after consolidation is markedly less effective. Together these results recast visual-shortcut collapse not as a binary defect but as a controllable, time-dependent, and asymmetric process, with direct implications for when and how strongly to regularize multimodal RLVR.
中文摘要 带可验证奖励的强化学习（RLVR）越来越多地应用于大型视觉语言模型（LVLM），但仅结果优化可能促使模型停止关注视频，转而利用语言先验——我们称之为视觉捷径。虽然这种感知绕过的存在已被记录，但它如何形成、是否能被逆转，以及何时干预仍然有效，仍然保持开放。我们将接地惩罚lambda的强度视为控制旋钮，并描述沿训练时间轴的视觉捷径的形成-反转动态。在一个被保留、分布外的诊断集上，我们发现：（i）急剧起始——捷径依赖在一个狭窄的优化步骤窗口内突然出现，并且在随机种子间具有鲁棒性;（ii）单调剂量反应——增加λ会逐渐抑制捷径，在中间剂量时，轨迹先形成，然后反转捷径，暴露出获取与移除之间类似滞后不对称的现象;以及（iii）关键干预窗口——在逮捕发生前施加惩罚可实现捷径形成，而在巩固后施加的同样惩罚效果明显较低。这些结果共同将视觉捷径坍缩重新定义为一个可控、时间依赖且不对称的过程，直接影响何时以及多强正则化多模RLVR。

Reinforcement Learning-Based Traffic Signal Control for IoT-Enabled Intersections

基于强化学习的物联网交叉通信号控制

Authors: Yousef AlSaqabi
Subjects: Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.22108
Pdf link: https://arxiv.org/pdf/2606.22108
Abstract Urban traffic congestion remains a persistent challenge in car-dependent cities, imposing significant economic and societal costs. Traffic signal systems are increasingly deployed as networked cyber-physical components within smart-city infrastructures, where distributed sensing and edge intelligence enable adaptive traffic management. This paper investigates reinforcement learning (RL) as an edge-intelligent approach for adaptive traffic signal operation at a signalized urban intersection in Kuwait. A Proximal Policy Optimization (PPO)-based controller is developed to dynamically allocate green-phase durations using locally observed traffic states, without relying on future demand information or centralized coordination. The controller is evaluated in a realistic simulation environment informed by real-world hourly traffic volume data from Kuwait, and is compared against both conventional fixed-time control and a vehicle-actuated controller representing the current state of practice, using average vehicle delay, queue length, and emissions as performance metrics. Under nominal conditions, the proposed controller reduces average vehicle delay by 46% relative to fixed-time control and 34% relative to actuated control, while also lowering per-vehicle CO2 emissions by approximately 23%. These performance gains persist under demand perturbations of +/-15%, generalize from weekday to weekend traffic patterns, and are corroborated by a reward function ablation; low variance across five random seeds confirms their statistical reliability. These findings demonstrate the practicality of learning-based edge traffic signal control as a building block for IoT-enabled smart-city transportation systems, and as a deployable precursor toward fully connected, Internet of Vehicles (IoV)-based urban mobility.
中文摘要 城市交通拥堵在依赖汽车的城市中依然是一个持续的挑战，带来了巨大的经济和社会成本。交通信号系统越来越多地作为智能城市基础设施中的网络化网络物理组件部署，分布式感测和边缘智能实现了自适应的交通管理。本文探讨了强化学习（RL）作为科威特一处信号化城市路口自适应交通信号操作的边缘智能方法。开发了基于近端策略优化（PPO）的控制器，利用本地观察到的交通状态动态分配绿灯时长，无需依赖未来需求信息或集中协调。该控制器在基于科威特真实小时交通量数据的真实仿真环境中进行评估，并与传统固定时间控制和车辆驱动控制器进行比较，后者以平均车辆延迟、排队长度和排放量作为性能指标，代表当前实践状态。在名义条件下，拟议控制器相较固定时间控制降低46%的平均车辆延误，相较于感应控制减少34%，同时将每辆车的二氧化碳排放量降低约23%。这些性能提升在需求扰动+/-15%下依然存在，且可从工作日到周末的交通模式推广，并由奖励函数消融得到佐证;五种随机种子间的低方差证实了其统计可靠性。这些发现展示了基于学习的边缘交通信号控制作为物联网智能城市交通系统的基石的实用性，以及作为实现完全互联、基于车辆互联（IoV）的城市出行的可部署前身。

Zero-shot Transfer of Reinforcement Learning Control Policies for the Swing-Up and Stabilization of a Cart-Pole System

零发射强化学习控制策略，用于车杆系统的摆动和稳定

Authors: Nikki Xu, Hien Tran
Subjects: Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2606.22145
Pdf link: https://arxiv.org/pdf/2606.22145
Abstract Reinforcement learning (RL) is a powerful and convenient tool to modernize controller design. In this work, we study the zero-shot transfer of RL-based control policies from simulation to hardware for cart-pole swing-up and stabilization. The two policies are trained independently, and the handoff is implemented in Simulink via switching logic. We apply a first-order action smoothing filter to prevent hardware damage from high-frequency oscillatory actuation. Pairing this bandwidth-aware filtering with sensitivity-guided domain randomization (DR) and a simple linear curriculum learning (CL) schedule, we obtain a swing-up policy that in all of our experiments injects sufficient energy for handoff into the stabilizer's region of attraction. The stabilization policy rejects disturbances within the tested range, and the swing-up policy can re-engage after larger perturbations and restores the pendulum to the inverted position.
中文摘要 强化学习（RL）是一种强大且便捷的工具，用于现代化控制器设计。本研究研究基于强化学习的控制策略从模拟到硬件的零射点转移，实现车杆式的摆动和稳定。这两种策略分别训练，切换通过交换逻辑在Simulink中实现。我们应用一阶作用平滑滤波器以防止高频振荡驱动对硬件造成损害。将这种带宽感知滤波与灵敏度引导域随机化（DR）和简单的线性课程学习（CL）计划结合，我们获得了一种摆动策略，在所有实验中都为稳定子吸引区注入足够的能量。稳定策略排除测试范围内的扰动，摆动策略可在较大扰动后重新接合，将摆器恢复至倒置位置。

Meta-Reinforcement Learning via Evolution for Multi-Objective Combinatorial Supply Chain Optimisation

通过进化实现多目标组合供应链优化的元强化学习

Authors: Rifny Rachman, Bahrul Ilmi Nasution, Josh Tingey, Richard Allmendinger, Pradyumn Shukla, Wei Pan
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.22146
Pdf link: https://arxiv.org/pdf/2606.22146
Abstract Meta-reinforcement learning is a promising approach to multi-objective optimisation because it enables rapid policy adaptation across changing environments and preference settings. However, conventional few-shot methods usually fine-tune from a single shared meta-policy, which can reduce solution diversity and limit exploration of the Pareto front, especially in high-dimensional combinatorial problems such as supply chain optimisation. We propose a population-based Meta-reinforcement learning framework that combines decomposition with evolutionary search in scalarisation weight space. The framework maintains a population of weight vectors, each associated with a distinct meta-policy trained through gradient-based meta-learning, and iteratively refines this population through elitist selection, crossover, and mutation guided by hypervolume and entropy contributions. We evaluate the method in a multi-objective supply chain setting with conflicting economic, environmental, and social goals, and further test its generality on standard reinforcement learning problems. The results show that the proposed approach yields more diverse, better distributed Pareto front approximations, improves cross-task adaptation, increases hypervolume by up to 32\% over Meta-multi-objective reinforcement learning in the complex case, and attains the lowest average Hausdorff distance among all compared methods.
中文摘要 元强化学习是一种有前景的多目标优化方法，因为它能够在不断变化的环境和偏好设置下快速调整策略。然而，传统的少数样本方法通常从单一共享元政策微调，这可能降低解的多样性，限制对帕累托前线的探索，尤其是在供应链优化等高维组合问题中。我们提出了一种基于群体的元强化学习框架，结合了分解与在标量权重空间中的进化搜索。该框架维护一个权重向量的族群，每个权重向量对应通过梯度元学习训练的不同元政策，并通过精英选择、交叉和由超体积和熵贡献引导的突变，迭代完善该族群。我们在多目标供应链环境中评估该方法，且其经济、环境和社会目标相互冲突，并进一步检验其在标准强化学习问题上的普遍性。结果表明，所提方法产生了更多样化、更均匀分布的帕累托前缘近似，提高了跨任务适应性，在复杂情况下，超体积比元多目标强化学习提升了多达32/%，并且在所有比较方法中达到了最低的平均豪斯多夫距离。

L20-Edu-135M: An Auditable Single-GPU Study of Data-Efficient Small Language Modeling

L20-Edu-135M：一项可审计的单GPU数据高效小语言建模研究

Authors: Yin Li
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.22189
Pdf link: https://arxiv.org/pdf/2606.22189
Abstract Small language models are cheap to serve and feasible on local hardware, but strong public 135M-class systems are commonly trained with hundreds of billions to trillions of tokens on large clusters. We study a sharply resource-constrained regime: a complete 134.5M-parameter language-model pipeline executed on one NVIDIA L20 GPU. The released checkpoint, L20-Edu-135M, receives approximately 13B pretraining tokens: 10B FineWeb-Edu tokens followed by a 3B-token educational, mathematics, code, and reasoning mixture. We document the architecture, data gates, cross-source MinHash/LSH near-deduplication, segment deduplication, benchmark-overlap removal, throughput optimization, supervised fine-tuning (SFT) with weight interpolation, and reinforcement learning from verifiable rewards (RLVR) on GSM8K. In a self-run zero-shot six-task harness, L20-Edu-135M obtains a mean score of 0.4150. It trails SmolLM-135M (0.4767) and SmolLM2-135M (0.4917), but its mean is 87.1% of SmolLM-135M's while its nominal token count is 2.17% as large. This ratio is descriptive, not evidence of statistical equivalence or a controlled scaling law. The model exceeds several older 100M-160M public baselines under the same harness. Direct GRPO-style RLVR decreases GSM8K exact-match accuracy from 1.82% to 1.59% (192-token completions) and 1.21% (320-token completions). These single-run results identify a concrete failure mode rather than establishing a general lower bound on RLVR. The contribution is an auditable resource-constrained case study, not a state-of-the-art claim.
中文摘要 小型语言模型服务成本低且可在本地硬件上实现，但强大的公共1.35亿级系统通常在大型集群上训练数千亿到数万亿个令牌。我们研究了一个资源极度受限的环境：在一块NVIDIA L20 GPU上执行的完整1.345亿参数语言模型流水线。发布的检查点L20-Edu-135M接收约13B个预训练令牌：10B个FineWeb-Edu令牌，随后是3B个令牌的教育、数学、代码和推理混合。我们记录了架构、数据门、跨源MinHash/LSH近重处理、段重处理、基准重叠消除、吞吐量优化、带权重插值的监督微调（SFT）以及GSM8K上的可验证奖励强化学习（RLVR）。在一台自运行零发射六任务背带中，L20-Edu-135M 的平均得分为0.4150。它落后于SmolLM-135M（0.4767）和SmolLM2-135M（0.4917），但其平均值为SmolLM-135M的87.1%，而名义代币数量则为2.17%。该比率仅为描述性，并非统计等价性或受控尺度律的证据。该型号在同一安全带下超过了多台较旧的100M至160M公共基线。直接GRPO风格的RLVR将GSM8K的精确匹配准确率从1.82%降至1.59%（192个令牌完成）和1.21%（320个令牌完成）。这些单次运行结果识别了具体的失效模式，而非建立RLVR的通用下限。该贡献是一个可审计的资源有限案例研究，而非最先进的主张。

FlowDPG: Deterministic Policy Gradient on Flow Matching Policies for Real-World Manipulation

FlowDPG：现实世界操作中流匹配策略的确定性策略梯度

Authors: Kexin Shi, Junyao Shi, Poorvi Hebbar, Zhuolun Zhao, Tarun Amarnath, Yifan Su, Shikhar Bahl, Deepak Pathak
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.22303
Pdf link: https://arxiv.org/pdf/2606.22303
Abstract Real-world reinforcement learning for robotic manipulation remains challenging, and this difficulty is amplified for flow matching policies: applying policy gradient methods to these policies is fundamentally limited by the need to backpropagate through time(BPTT) along the multi-step ODE that maps noise to actions, which is computationally prohibitive and numerically fragile. We propose FlowDPG, a DDPG-style method specifically designed for flow matching policies that distills the critic gradient into the velocity field at training time, bypassing BPTT entirely. Intuitively, FlowDPG combines two complementary vectors: the demonstration-driven velocity that keeps the action feasible, and the critic-driven correction that steers it toward higher value. Our contributions are threefold: (1) a BPTT-free distillation framework that enables stable DDPG-style policy improvement on flow matching policies, (2) a formal connection between the FlowDPG update direction and vanilla Deterministic Policy Gradient via three explicit approximations, and (3) real-world validation on a long-horizon, multi-stage, dual-arm AirPods assembly task, where FlowDPG attains a 92% end-to-end success rate, substantially outperforming recent RL methods spanning value-conditioning, auxiliary-module adaptation, and adjoint-based critic-gradient approaches. Videos and more results are provided on the project page this https URL.
中文摘要 机器人操作的现实强化学习依然具有挑战性，这一难点在流匹配策略中更加明显：将策略梯度方法应用于这些策略，根本上受限于需要沿多步常微分方程（BPTT）反向传播（BPTT），该常微分方程将噪声映射为动作，这在计算上既困难又极为脆弱。我们提出了FlowDPG，这是一种专为流量匹配策略设计的DDPG风格方法，在训练时将批评梯度提炼到速度场，完全绕过BPTT。直观上，FlowDPG结合了两个互补的向量：以示范为驱动的速度保持动作可行，以及由批评者驱动的修正，引导其向更高价值。我们的贡献有三方面：（1）无BPTT的蒸馏框架，实现了DDPG风格的流程匹配策略稳定改进;（2）通过三种显式近似实现FlowDPG更新方向与原版确定性策略梯度的正式连接;（3）在长期、多阶段、双臂AirPods组装任务上的实际验证，FlowDPG实现了92%的端到端成功率，显著优于近期跨度值条件、辅助模块适应和伴随批判梯度方法的强化学习方法。项目页面提供视频和更多结果，链接为 https URL。

Learning at the Right Pace: Adaptive Data Scheduling Improves LLM Reinforcement Learning

以正确的节奏学习：自适应数据调度提升LLM强化学习

Authors: Zicheng Xu, Ruixuan Zhang, Yu-Neng Chuang, Xiuyi Lou, Hoang Anh Duy Le, Oren Gal, Alexander S. Szalay, Zhaozhuo Xu, Guanchu Wang, Vladimir Braverman
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.22305
Pdf link: https://arxiv.org/pdf/2606.22305
Abstract Large Language Models (LLMs) achieve remarkable reasoning capabilities through reinforcement learning (RL) post-training. However, existing RL post-training commonly relies on uniform data sampling, which ignores the semantic structure of the training data and the changing capability of the training policy. To address these limitations, we propose Adaptive Data Scheduling (ADS), a dual-level data scheduling framework for pacing RL post-training that replaces uniform sampling with an adaptive distribution over semantic clusters and policy-boundary sample selection. At the cluster level, ADS organizes samples according to semantic patterns and maintains an adaptive inter-cluster distribution to solidify current training progress. At the sample level, ADS performs intra-cluster scheduling to continuously sample policy-boundary samples, which provides informative relative advantages. Experimental results across three LLMs and seven reasoning benchmarks demonstrate that ADS improves average accuracy by 5.2% over Group Relative Policy Optimization (GRPO). Notably, ADS consistently improves RL methods with different objective designs, highlighting its potential as a general data scheduling strategy for LLM RL post-training. The source code is available at: this https URL.
中文摘要 大型语言模型（LLMs）通过训练后强化学习（RL）实现了卓越的推理能力。然而，现有的强化学习后训练通常依赖于统一的数据采样，忽视了训练数据的语义结构和训练策略能力的变化。为解决这些局限性，我们提出了自适应数据调度（ADS），这是一种用于训练后强化学习节奏的双层数据调度框架，用语义聚类和策略边界样本选择的自适应分布取代了统一抽样。在集群层面，ADS根据语义模式组织样本，并保持自适应的簇间分布，以巩固当前的训练进展。在样本层面，ADS执行簇内调度，连续采样策略边界样本，这提供了信息性的相对优势。针对三个大型语言模型和七个推理基准的实验结果表明，ADS相比群体相对策略优化（GRPO）平均准确率提升了5.2%。值得注意的是，ADS持续改进具有不同目标设计的强化学习方法，凸显其作为LLM RL训练后通用数据调度策略的潜力。源代码可在以下 https URL 获取。

Curriculum Reinforcement Learning Can Incentivize Reasoning Capacity in LLMs Beyond the Base Model

课程强化学习可以激励LLM中超越基础模型的推理能力

Authors: Pengxiang Cai, Tianchen Fang, Xiaohan Li, Qingyuan Zeng, Guocong Li, Jintai Chen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.22317
Pdf link: https://arxiv.org/pdf/2606.22317
Abstract Reinforcement learning with verifiable rewards (RLVR) is widely viewed as a promising path toward continuously improving large language models. Recent works, however, suggest that mainstream RLVR often reallocates sampling probabilities among trajectories already present in the base model: it can improve sampling efficiency, reflected by higher pass@1 scores, but yields limited gains, and can even decrease pass@k scores when k is large, and therefore may fail to expand the base model's reasoning capacity boundary. In this paper, we present a boundary-aware Curriculum RL approach to move beyond the base model's reasoning capacity boundary. Our approach first uses pass@k sampling to locate the current reasoning capacity boundary, then applies targeted teacher guidance to examples near or beyond that boundary, and finally uses RL to consolidate the newly introduced reasoning patterns. Across Qwen, Llama, and DeepSeek base models, boundary-aware Curriculum RL improves both pass@1 scores and pass@256 scores, with pass@1 reflecting one-attempt performance and pass@256 serving as an empirical proxy for the reasoning capacity boundary. In our experiments, average pass@256 improves by 9.8 percentage points over the base models and by 10.3 percentage points over Vanilla RLVR. These results suggest that boundary-aware Curriculum RL can provide a scalable route for LLMs to continuously improve beyond the base model's empirical reasoning capacity boundary.
中文摘要 带有可验证奖励的强化学习（RLVR）被广泛认为是持续改进大型语言模型的有前景道路。然而，最新研究表明，主流RLVR常常在基础模型中已有的轨迹之间重新分配抽样概率：它可以提高抽样效率，体现在更高的pass@1分上，但收益有限，甚至在k较大时pass@k分数可能下降，因此可能无法扩展基础模型的推理能力边界。本文提出了一种边界感知型课程强化学习方法，以突破基础模型的推理能力边界。我们的方法首先使用pass@k抽样来定位当前的推理能力边界，然后对该边界附近或更外的例子应用针对性的教师指导，最后用强化学习来整合新引入的推理模式。在Qwen、Llama和DeepSeek基础模型中，边界感知课程RL提升了pass@1分数和pass@256分数，pass@1反映一次尝试表现，pass@256作为推理能力边界的实证代理。在我们的实验中，平均 pass@256 比基础模型提升了 9.8 个百分点，比 Vanilla RLVR 提高了 10.3 个百分点。这些结果表明，边界感知课程RL可以为LLM提供一条可扩展的路径，使其能够不断超越基础模型的经验推理能力边界。

Select-to-Act: Hierarchical Reinforcement Learning via Adaptive Language Guidance

选择行动：通过自适应语言指导实现的层级强化学习

Authors: Hanping Zhang, Adam Koziak, Yuhong Guo
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.22350
Pdf link: https://arxiv.org/pdf/2606.22350
Abstract Reinforcement Learning (RL) has been widely applied to sequential decision-making, yet it often suffers from poor sample efficiency due to costly interactions with the environment. A limited line of recent work has started exploring improving RL efficiency by leveraging external knowledge expressed in natural-language instructions. However, the few existing approaches typically treat the entire instruction as a single conditioning input, failing to account for the stage-dependent nature of language guidance, especially in complex environments. In this paper, we propose \emph{Hierarchical Reinforcement Learning with Language Instructions (HRLLI)}, a hierarchical RL framework that explicitly models natural-language instructions as dynamically selectable semantic guidance during decision-making. HRLLI decomposes instructions into a set of piecewise guidance elements, where each instruction piece may become relevant at different stages of interaction with the environment. A novel hierarchical RL policy structure is then formulated in a \emph{Select-to-Act} paradigm: a high-level semantic policy acts as a guidance selector that selects the most relevant instruction piece to the current state to guide the low-level agent's decision, while a low-level policy executes environment actions conditioned on the selected guidance. The two-level policies are learned simultaneously to maximize augmented expected returns from interactions with the environment. This design enables the agent to adaptively ground language instructions into stage-specific decisions during interaction. Experiments on the instruction-intensive RTFM benchmark show that HRLLI consistently outperforms strong instruction-conditioned RL baselines, demonstrating that explicitly modeling adaptive instruction selection significantly improves the effectiveness of RL.
中文摘要 强化学习（RL）已被广泛应用于顺序决策，但由于与环境的交互代价高昂，其样本效率常常较低。最近有一系列有限的研究开始探索通过利用自然语言指令中表达的外部知识来提升强化学习的效率。然而，现有的少数方法通常将整个指令视为单一条件输入，未能考虑语言指导在复杂环境中阶段依赖性。本文提出了 \emph{层级强化学习与语言指令（HRLLI）}，一种分层强化学习框架，明确将自然语言指令建模为决策过程中动态可选择的语义指导。HRLLI将指令分解为一组分段的指导元素，每个指令片段在与环境交互的不同阶段都可能变得相关。随后，在\emph{选择行动}范式中构建了一个新的层级强化学习策略结构：高级语义策略作为指导选择器，选择与当前状态最相关的指令部分以指导低级代理的决策，而低级策略则根据所选指导执行环境动作。两层策略是同时学习的，以最大化与环境互动的预期收益。这种设计使智能体能够在交互过程中自适应地将语言指令转化为阶段特定的决策。对指令密集型RTFM基准的实验显示，HRLLI始终优于强指令条件强强学习基线，表明明确建模自适应指令选择显著提升了强化学习的有效性。

Curvature-Adaptive Consistency Flow Matching: Autonomous Trajectory Optimization via Reinforcement Learning

曲率自适应一致性流匹配：通过强化学习实现自主轨迹优化

Authors: Songtao Tian, Guhan Chen, Bohan Li, Jingyi Ma, Zixiong Yu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.22394
Pdf link: https://arxiv.org/pdf/2606.22394
Abstract Consistency distillation has significantly accelerated the inference of diffusion models. In this work, we reveal an intriguing asymmetry: while Logit-Normal sampling priors are highly efficacious for standard iterative generation, consistency distillation exhibits a distinctly different difficulty profile (e.g., U-shaped). We identify that the primary optimization bottlenecks reside at the boundary stages (initialization or final refinement) rather than the intermediate steps. To address the limitations of static sampling in accommodating evolving learning requirements, we propose Curvature-Adaptive Consistency Flow Matching (CACFM). By formulating distillation as a dynamic decision process, CACFM employs a lightweight Reinforcement Learning agent to actively probe Probability Flow ODE trajectories, automatically constructing an efficiency-oriented curriculum that prioritizes critical regions without manual scheduling. Integrated with a novel Flow Distribution Matching Distillation (DMD) objective, our approach achieves new state-of-the-art results on large-scale models such as FLUX and SDXL. It effectively mitigates structural deformities and preserves high-frequency details in extreme few-step regimes, achieving unprecedented visual fidelity.
中文摘要 一致性蒸馏显著加速了扩散模型的推断。在本研究中，我们揭示了一个有趣的不对称性：虽然Logit-Normal采样先验对标准迭代生成非常有效，但一致性蒸馏表现出明显不同的难度轮廓（例如U形）。我们发现主要的优化瓶颈出现在边界阶段（初始化或最终精炼），而非中间步骤。为了解决静态采样在适应不断变化的学习需求中的局限性，我们提出了曲率自适应一致性流匹配（CACFM）。通过将蒸馏定义为动态决策过程，CACFM采用轻量级强化学习代理主动探测概率流常微分方程轨迹，自动构建以效率为导向的课程，优先处理关键区域，无需人工调度。结合一种新型流量分布匹配蒸馏（DMD）目标，我们的方法在FLUX和SDXL等大型模型上实现了新的先进成果。它有效减轻结构变形，并在极少数步骤的环境中保留高频细节，实现前所未有的视觉真实度。

SVGym (SciVerseGym): An Environment for Reinforcement Learning and Bayesian Optimization in Crystal Discovery

SVGym（SciVerseGym）：晶体发现中的强化学习与贝叶斯优化环境

Authors: Bin Cao
Subjects: Subjects: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci)
Arxiv link: https://arxiv.org/abs/2606.22425
Pdf link: https://arxiv.org/pdf/2606.22425
Abstract Machine-learned interatomic potentials now enable efficient atomistic evaluation for interactive materials discovery, yet closed-loop crystal search methods remain fragmented across bespoke pipelines for editing, relaxation, scoring, constraints, and bookkeeping. We introduce SciVerseGym, a Gymnasium-compatible environment for sequential crystal discovery that frames crystal design as a Markov decision process. Agents observe an atomistic structure, apply chemically meaningful edits, and receive feedback from a configurable evaluator. SciVerseGym supports local and global actions, including elemental substitution, lattice perturbation, atomic displacement, vacancy creation, and atom insertion, along with configurable chemical spaces, structure pools, atomistic and graph-based observations, custom rewards, optional relaxation, and stability or phonon-related diagnostics. Each step applies an edit, evaluates the candidate using a machine-learned interatomic potential or any ASE-compatible calculator, and returns the standard (obs, reward, terminated, truncated, info) tuple. By decoupling agent logic from materials infrastructure, SciVerseGym provides an open, reproducible, and extensible testbed for reinforcement learning, Bayesian optimization, evolutionary search, and language-agent workflows in closed-loop crystal discovery. Code is available at: this https URL.
中文摘要 机器学习的原子间势能现在使得交互材料发现的原子学评估成为可能，但闭环晶体搜索方法仍分散在定制的流程中，用于编辑、放松、评分、约束和账务管理。我们介绍了SciVerseGym，一个兼容Gymnasium的顺序晶体发现环境，将晶体设计框定为马尔可夫决策过程。代理观察原子结构，进行化学意义上的编辑，并从可配置的评估器那里获得反馈。SciVerseGym支持局部和全局操作，包括元素替换、晶格扰动、原子位移、空位生成和原子插入，同时支持可配置的化学空间、结构池、原子和基于图的观察、自定义奖励、可选的放松以及稳定性或声子相关诊断。每一步应用编辑，使用机器学习的原子间势能或任何兼容ASE的计算器评估候选，并返回标准元组（obs、reward、terminated、truncated、info）元组。通过将智能体逻辑与材料基础设施解耦，SciVerseGym 提供了一个开放、可重复且可扩展的测试平台，用于强化学习、贝叶斯优化、进化搜索以及闭环晶体发现中的语言代理工作流。代码可在以下 https URL 获取。

Escaping the Variance Trap: Jacobian-Free Dynamics for Root-Finding Bilevel Optimization

逃离方差陷阱：在找根双层优化中的雅可比无动力学

Authors: Zhiyu Li, Xi Xuan, Davide Carbone
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.22433
Pdf link: https://arxiv.org/pdf/2606.22433
Abstract Many central machine learning tasks, from entropy tuning in reinforcement learning to equilibrating generative adversarial networks, are fundamentally stochastic root-finding problems rather than loss minimization. Yet, they are frequently forced into a minimization framework via squared residuals, introducing a critical flaw we identify as the Variance Trap. Standard bilevel minimization algorithms require estimating hypergradients involving implicit Jacobians; in stochastic settings, these terms act as noise amplifiers, destabilizing convergence. We formalize Root-Finding Bilevel Optimization (RF-BO) as a distinct problem class that bypasses this pathology. We propose a Jacobian-free solution using Two-Time-Scale Stochastic Approximation (TTSA) that updates directly along the root error, structurally avoiding variance amplification. We provide the first non-asymptotic convergence guarantees for TTSA in this setting under Markovian noise. Extensive experiments demonstrate the decisive advantage of this paradigm: compared to squared-residual and implicit-gradient baselines, our framework achieves a 2.6\% top-1 accuracy gain in SimCLR, 17$\times$ faster convergence in non-linear ODE control where baselines fail, significantly improved entropy stability in reinforcement learning, and an 11.1\% quality improvement in generative modeling.
中文摘要 许多中心机器学习任务，从强化学习中的熵调优到生成对抗网络的均衡，本质上是随机找根问题，而非损失最小化。然而，它们经常被平方残差强行纳入最小化框架，这带来了我们称之为方差陷阱的关键缺陷。标准的双层最小化算法需要估计涉及隐式雅可比矩阵的超梯度;在随机环境中，这些项充当噪声放大器，使收敛不稳定。我们将寻根双层优化（RF-BO）形式化为一个绕过这种病理问题的独立问题类别。我们提出了一种使用两时间尺度随机近似（TTSA）的雅可比无解，该解直接沿根误差更新，结构上避免方差放大。我们首次在马尔可夫噪声下为TTSA提供了非渐近收敛保证。大量实验证明了该范式的决定性优势：与平方残差和隐式梯度基线相比，我们的框架在SimCLR中实现了2.6%的前一准确率提升，在基线失效的非线性常微分方程控制收敛速度提升了17%\times$，强化学习中的熵稳定性显著提升，生成建模质量提升了11.1%。

Distribution-Aware Robust Bilevel Optimization: Quantile-Guided Huber Updates in Two-Timescale Stochastic Approximation

分布感知的鲁棒双层优化：二时间尺度随机近似中的分位数引导胡伯更新

Authors: Zhiyu Li, Xi Xuan, Davide Carbone
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.22436
Pdf link: https://arxiv.org/pdf/2606.22436
Abstract Bilevel optimization (BLO) is fundamental to hierarchical decision-making but suffers from critical instability under heavy-tailed stochastic noise. Existing variance-reduction techniques typically rely on myopic magnitude checks, which fail to distinguish informative geometric signals from impulsive outliers. To resolve this, we propose \textbf{RQ-TTSA} (Robust Quantile-guided TTSA), a distribution-aware framework that leverages historical gradient buffers to estimate rolling quantiles for adaptive Huber-style clipping, effectively preserving local optimization geometry while strictly bounding effective variance. Theoretically, we provide a convergence analysis for quantile-guided TTSA under nonconvex-strongly convex assumptions with infinite-variance noise ($p \in (1,2]$), deriving a rate of $\mathcal{O}(T^{-\frac{p-1}{3p-2}})$ that recovers optimal dependence on the heavy-tailed parameter. Empirically, across six diverse tasks, spanning heterogeneous vision benchmarks, dynamic games under momentum poisoning, and offline reinforcement learning, RQ-TTSA consistently outperforms state-of-the-art baselines by eliminating divergence spikes and ensuring stable convergence. Our method demonstrates significant robustness to hyperparameter variations and incurs negligible computational overhead ($\approx 2.7\%$ increase), validating distribution-aware gradient control as a practical and necessary component for reliable bilevel learning.
中文摘要 双层优化（BLO）是层级决策的基础，但在重尾随机噪声下存在关键不稳定性。现有的方差缩减技术通常依赖近视幅度检验，但这些方法无法区分信息性几何信号和冲动异常值。为解决这个问题，我们提出了 \textbf{RQ-TTSA}（稳健分位数引导 TTSA），这是一个分布感知框架，利用历史梯度缓冲区估计自适应 Huber 式裁剪的滚动分位数，有效保持局部优化几何，同时严格限制有效方差。理论上，我们在非凸强凸假设下，提供了分位数引导TTSA的收敛分析，噪声为无限方差（$p \in （1,2]$），推导出$\mathcal{O}（T^{-\frac{p-1}{3p-2}}）$的速率，恢复了对重尾参数的最优依赖。在六个多样化任务中，涵盖异构视觉基准、动量中毒动态博弈和离线强化学习，RQ-TTSA通过消除发散尖峰并确保稳定收敛，持续优于最先进的基线。我们的方法对超参数变化表现出显著的鲁棒性，计算开销可忽略不计（约2.7%%增加），验证了分布感知梯度控制作为可靠双级学习的实用且必要组成部分。

A Differentiable Atari VCS:A Complex, Fully Known Ground Truth for Explainable AI

可微分的雅达利VCS：一个复杂且完全已知的可解释人工智能的基层真相

Authors: Andreas Maier, Siming Bayer, Patrick Krauss
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.22447
Pdf link: https://arxiv.org/pdf/2606.22447
Abstract Explanation requires ground truth: to verify an account of a system we must know its inner functioning-just what is missing where explainable AI (XAI) is most needed. Systems we can study fall into two camps. Simple, procedural one-decision trees, rule lists, sparse linear models-have a known but trivial mechanism, so explaining them tests nothing; genuinely complex ones-deep networks, real-world tasks-need XAI but have no ground-truth inner functioning, so an explanation can be plausible, confident, and wrong with no way to tell. We remove this dichotomy with a study object both genuinely complex and fully specified-inspectable by construction-and, so gradient methods apply, fully differentiable. We reimplement the Atari 2600 Video Computer System (VCS)-a real computer architecture, and the cradle of deep reinforcement learning-as two independent end-to-end differentiable emulators in Julia (jutari) and JAX (jaxtari), each validated bit-for-bit against xitari. Both reproduce xitari on all 64 supported Arcade Learning Environment (ALE) games: 64/64 byte-identical RAM and 64/64 pixel-identical screens. Treating the cartridge ROM as a weight tensor, RAM as a soft tape, and control flow as gates, we prove the differentiable (soft) execution equals the original (hard) one bit-for-bit in the forward pass at any finite temperature, while exposing surrogate gradients where the bit logic has none. The JAX port also opens a GPU path: batched differentiable rollouts reach millions of environment-steps/s on one commodity GPU. The system was built in roughly 137 active hours over 29 calendar days, much of it written autonomously by coding agents. This paper builds and validates the foundation, showing-theoretically and in a qualitative gradient study-that gradient-based XAI on it is feasible. Both ports' full code is available under the MIT license at this https URL.
中文摘要 解释需要真实的事实：要验证系统的描述，我们必须了解其内部运作——正是在最需要可解释人工智能（XAI）的地方缺失了什么。我们能研究的系统大致分为两类。简单的程序式一决策树、规则列表、稀疏线性模型都有已知但简单的机制，因此解释它们对任何东西都没有任何考验;真正复杂的——深度网络、现实任务——需要XAI，但没有真实的内在功能，因此解释可以合理、自信且错误，且无法判断。我们用一个既真实复杂又完全指定、可通过构造检查——并且适用梯度法且完全可微的研究对象，去除了这种二分法。我们将Atari 2600视频计算机系统（VCS）——一个真实的计算机架构和深度强化学习的摇篮——重新实现为两个独立的端到端可微化模拟器，分别在Julia（jutari）和JAX（jaxtari）中实现，分别逐比验证了xitari。两者都可在所有64款支持的街机学习环境（ALE）游戏中重现xitari：64/64字节的内存和64/64像素相同的屏幕。将卡带ROM视为权重张量，RAM视为软磁带，控制流视为门，我们证明了可微（软）执行在任意有限温度下前向传递中对位的原始（硬）位，同时在位逻辑中没有替代梯度的地方暴露了替代梯度。JAX移植还开辟了一条GPU路径：批量可微分部署在一个普通GPU上可覆盖数百万环境步/秒。该系统在大约137小时的活跃时间内构建，分布在29个日历日内，其中大部分由编码代理自主编写。本文构建并验证了基础，理论和定性梯度研究证明基于梯度的XAI是可行的。这两个端口的完整代码均可在MIT许可下访问，地址为 https URL。

Scalable Multi-Task Data Generation via Reinforcement Learning for Language-Conditioned Bimanual Dexterous Manipulation

通过强化学习实现语言条件化双手灵巧操作的可扩展多任务数据生成

Authors: Zechu Li, Yufeng Jin, Puze Liu, Jan Peters, Georgia Chalvatzaki
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.22471
Pdf link: https://arxiv.org/pdf/2606.22471
Abstract A key bottleneck in training generalist policies for bimanual dexterous manipulation is the lack of large-scale, high-quality datasets. Synthetic data generation in simulation provides a scalable alternative to human video demonstrations by overcoming challenges such as morphology mismatch, missing physical interactions, and the generation of robot actions. However, existing approaches based on human teleoperation offer limited task diversity, as object-centric trajectory matching often neglects the feasibility of robot execution. Reinforcement learning (RL) enables broader scalability but is often constrained by handcrafted, task-specific rewards. In this work, we propose a systematic RL-based data generation pipeline that integrates generalizable reward design, effective domain randomization, and language-conditioned task annotations. This pipeline synthesizes diverse, high-quality datasets for dexterous bimanual manipulation and enables training of language-conditioned multi-task policies. Our experiments show that the generated data significantly improves generalization across three representative manipulation tasks.
中文摘要 训练双手灵巧操作的通才策略时，一个关键瓶颈是缺乏大规模、高质量的数据集。仿真中的合成数据生成通过克服形态不匹配、缺失的物理互动以及机器人动作生成等挑战，提供了一种可扩展的替代方法，以替代人类视频演示。然而，基于人类远程操作的现有方法任务多样性有限，因为以对象为中心的轨迹匹配往往忽视了机器人执行的可行性。强化学习（RL）实现了更广泛的可扩展性，但通常受限于手工制作的任务特定奖励。本研究提出一套系统化的基于强化学习的数据生成流水线，整合了可推广奖励设计、有效领域随机化和语言条件任务注释。该流程综合了多样化且高质量的数据集，实现灵活的双手操作，并支持语言条件多任务策略的训练。我们的实验表明，生成的数据显著提升了三种代表性操作任务的泛化能力。

WebCQ: Cooperative Multi-Agent Deep Reinforcement Learning for Scalable Web GUI Testing

WebCQ：可扩展Web图形界面测试的协作式多智能体深度强化学习

Authors: Yujia Fan, Sinan Wang, Zebang Fei, Yao Qin, Huaxuan Li, Yepang Liu
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2606.22502
Pdf link: https://arxiv.org/pdf/2606.22502
Abstract Multi-agent reinforcement learning (MARL)-based techniques have shown promise for GUI testing. However, as the complexity of modern GUI software increases, existing MARL-based approaches (e.g., MARG and Fastbot) struggle to scale due to the inherent limitations of their underlying tabular reinforcement learning algorithms. This limits their applicability to large-scale commercial GUI software, especially web applications with vast state spaces and many interactive elements. To fill this gap, we propose WebCQ, a novel MARL-based approach for scalable web GUI testing. WebCQ incorporates QTRAN for multi-agent coordination and a lightweight synchronization mechanism, allowing it to work under asynchronous web testing scenarios. It extracts semantic and exploration features for each UI event to form an action vector. This vector is concatenated with the current state vector and fed into the policy network, enabling DQN-based decision making within a dynamic action space. We evaluated WebCQ on eight large-scale commercial websites. Under the same time budget and agent count, WebCQ explored 33.3% more states and executed 42.2% more unique actions than MARG, while triggering more failures on six of the eight websites under test. It also demonstrated strong scalability, maintaining higher action throughput during 20-hour experiments, and achieving greater performance improvements as the number of agents increased. These results show that WebCQovercomes key limitations of existing MARL-based approaches, providing a scalable and effective solution for enhancing modern web GUI testing.
中文摘要 基于多智能体强化学习（MARL）的技术在GUI测试方面展现出潜力。然而，随着现代图形界面软件复杂度的增加，现有基于MARL的方法（如MARG和Fastbot）由于其底层表强化算法的固有局限性，难以扩展。这限制了它们在大规模商业图形界面软件中的适用范围，尤其是拥有庞大状态空间和大量交互元素的网络应用。为弥补这一空白，我们提出了WebCQ，一种基于MARL的新颖可扩展网络图形界面测试方法。WebCQ 集成了 QTRAN 用于多智能体协调和轻量级同步机制，使其能够在异步网页测试场景下工作。它为每个 UI 事件提取语义和探索特征，形成一个动作向量。该向量与当前状态向量连接，输入策略网络，使基于DQN的决策能够在动态动作空间内实现。我们在八个大型商业网站上评估了WebCQ。在相同时间预算和代理数量下，WebCQ探索的状态比MARG多33.3%，执行独立操作多42.2%，同时在八个测试网站中有六个触发更多失败。它还展现出强大的可扩展性，在20小时实验中保持更高的动作吞吐量，并且随着代理数量的增加，性能提升也更大。这些结果表明，WebCQ克服了现有基于MARL方法的关键局限，为提升现代Web GUI测试提供了可扩展且高效的解决方案。

Imagine to Ensure Safety in Hierarchical Reinforcement Learning

想象一下，在层级强化学习中确保安全

Authors: Gregory Gorbov, Artem Latyshev, Aleksandr I. Panov
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.22509
Pdf link: https://arxiv.org/pdf/2606.22509
Abstract This work investigates the safe exploration problem in reinforcement learning, where an agent must maximize cumulative performance while simultaneously satisfying safety constraints. This challenge becomes even more pronounced in long-horizon tasks, where existing safe methods face fundamental limitations due to compounding estimation errors and restricted exploration capabilities. To address this problem, we propose a method that combines a learnable world model with two complementary policies a high-level policy and a low-level policy to promote safety at both hierarchical levels. The high-level policy generates intermediate subgoals that bias exploration toward safe regions, while the low-level policy uses imagined rollouts in the learned world model to reduce unsafe behaviors when reaching these subgoals. The proposed method was evaluated on challenging long-horizon navigation and manipulation tasks with high-dimensional action spaces, where it significantly outperforms existing Safe RL baselines in both success rate and strong empirical constraint satisfaction, consistently meeting the prescribed safety budget across seeds, while prior approaches fail to effectively solve these complex long-horizon scenarios.
中文摘要 本研究探讨强化学习中的安全探索问题，即智能体必须在满足安全约束的同时最大化累计性能。这一挑战在长视野任务中尤为突出，现有安全方法因累计估计误差和有限的勘探能力而面临根本性限制。为解决这一问题，我们提出了一种方法，将可学习的世界模型与两种互补政策——高层政策和低层政策——结合起来，以促进两个层级的安全。高层策略生成中间子目标，使探索偏向安全区域，而低层策略则通过学习世界模型中的想象性展开，减少实现这些子目标时的不安全行为。该方法在具有高维动作空间的长视野导航和操作任务中进行了评估，在成功率和强烈的实证约束满足度上显著优于现有安全强化学习基线，始终符合各种子规定的安全预算，而以往方法未能有效解决这些复杂的长视野场景。

PolicyTrim: Boosting Intrinsic Policy Efficiency of Vision-Language-Action Models

PolicyTrim：提升愿景-语言-行动模型的内在政策效率

Authors: Xianghui Wang, Feng Chen, Wenbo Zhang, Hua Yan, Zixuan Wang, Changsheng Li, Yinjie Lei
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.22540
Pdf link: https://arxiv.org/pdf/2606.22540
Abstract Vision-Language-Action (VLA) models provide a unified paradigm for robotic manipulation, yet their real-world deployment is often bottlenecked by execution efficiency. While existing efforts predominantly focus on compute-centric efficiency to reduce per-step inference latency, the intrinsic \textbf{policy efficiency} of these models remains largely unexplored. Policy efficiency is fundamentally affected by two factors, namely the effective executable length of predicted action chunks and the total physical steps required to complete a task. These two factors jointly determine the total number of forward inference calls during execution. We observe that current VLA policies struggle with planning unreliability and action redundancy, suffering from severe prediction degradation at the tail of action chunks and tending to generate unnecessarily redundant physical steps. To address this, we propose \textbf{PolicyTrim}, a reinforcement learning-based post-training framework that extends the reliable action chunk length and reduces redundant physical steps. For reliable chunk extension, we employ a dynamic exploration strategy that explicitly rewards the successful completion of longer executable lengths, progressively pushing the trustworthy prediction horizon to its empirical limit. For step efficiency, we design a redundancy-aware reward that directly favors successful task completions with fewer steps while penalizing unreproducible shortcuts, effectively eliminating redundant physical actions. Extensive experiments across three benchmarks and three VLA models demonstrate that PolicyTrim improves action chunk utilization by 3$\times$ and reduces physical execution steps by 51.4\%. Ultimately, our framework delivers up to a 5.83$\times$ end-to-end deployment speedup without compromising task success rates.
中文摘要 视觉-语言-行动（VLA）模型为机器人操作提供了统一范式，但其实际部署常常被执行效率所限制。虽然现有努力主要聚焦于以计算为中心的效率以降低每步推理延迟，但这些模型内在的 \textbf{策略效率}仍大多未被充分探索。策略效率从根本上受两个因素影响，即预测动作块的有效可执行长度和完成任务所需的总物理步骤。这两个因素共同决定了执行过程中的前向推理调用总数。我们观察到，当前VLA策略存在计划不可靠性和动作冗余的问题，在动作块尾部存在严重的预测退化，且容易产生不必要的冗余物理步骤。为此，我们提出了 \textbf{PolicyTrim}，一个基于强化学习的后训练框架，它扩展了可靠的动作块长度并减少冗余的物理步骤。为了实现可靠的块扩展，我们采用动态探索策略，明确奖励成功完成更长可执行长度，逐步将可信预测视野推至经验极限。为了提高步骤效率，我们设计了一种冗余感知奖励，直接支持以更少步骤成功完成任务，同时惩罚不可重复的捷径，有效消除重复的物理操作。在三个基准测试和三个VLA模型上的广泛实验表明，PolicyTrim能将动作块利用率提升3$/倍数，物理执行步骤减少51.4%。最终，我们的框架在端到端部署速度提升最高达5.83美元\时间美元，同时不影响任务成功率。

What are Key Factors for Updates in RL for LLM Reasoning?

强化学习中LLM推理更新的关键因素有哪些？

Authors: Peidong Wang, Demi Wang, Xufang Luo, Jiahang Xu, Xiaocui Yang, Shi Feng, Yuqing Yang, Dongsheng Li
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.22570
Pdf link: https://arxiv.org/pdf/2606.22570
Abstract Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a promising framework for enhancing the reasoning ability of large language models. However, much of the existing work is guided by heuristic intuition, leading to divergent algorithmic choices, even contradictory ones that nevertheless report empirical gains. To better understand this phenomenon, we conduct a theoretical analysis of RLVR updates. Our study reveals that differences in off-policy degree, determined by the number of gradient steps per rollout, substantially affect the distribution of importance sampling ratios and their clipping behavior, thereby altering which tokens dominate the update. Building on this insight, we characterize gradient expectation as the central quantity governing update dynamics and analyze the roles of token probability, advantage, and importance sampling ratio. Motivated by these findings, we propose Adaptive Clip Policy Optimization (ACPO), which adjusts clipping boundaries across token groups according to the empirical variance of their importance sampling ratios. Experiments on 3B and 7B models across diverse reasoning benchmarks, spanning mathematical problem solving, tabular QA, and logic puzzles, demonstrate that ACPO outperforms strong baselines such as DAPO and CISPO. These results demonstrate that principled, analysis-driven approaches yield more robust and effective RLVR methods. Code is available in: this https URL
中文摘要 可验证奖励强化学习（RLVR）已成为提升大型语言模型推理能力的有前景框架。然而，现有许多工作依赖启发式直觉，导致算法选择分歧，甚至有些自相矛盾但报告了实证收益。为了更好地理解这一现象，我们对RLVR更新进行了理论分析。我们的研究显示，非策略程度的差异，由每次推送的梯度数决定，会显著影响重要性采样比的分布及其裁剪行为，从而改变哪些代币主导更新。基于这一见解，我们将梯度期望描述为更新动态的核心量，并分析了代币概率、优势和重要性抽样比的作用。基于这些发现，我们提出了自适应剪辑策略优化（ACPO），该方法根据代币组的重要性抽样比的经验方差调整剪裁边界。在涵盖数学问题解决、表格质检和逻辑谜题等多种推理基准测试的3B和7B模型上的实验表明，ACPO优于DAPO和CISPO等强基线。这些结果表明，有原则、以分析为驱动的方法能够带来更稳健、更有效的RLVR方法。代码可用格式：这个 https URL

Stationary Robust Mean-Field Games under Model Mismatches

模型错配下的稳健平均场平稳博弈

Authors: Yue Wang
Subjects: Subjects: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
Arxiv link: https://arxiv.org/abs/2606.22579
Pdf link: https://arxiv.org/pdf/2606.22579
Abstract Deploying multi-agent reinforcement learning (MARL) in the real world is often limited by model mismatches between the training simulators and the true environment, which could be further amplified through strategic interactions and result in severe performance degradation upon deployment. Distributional robustness offers a principled response by optimizing policies against worst-case transition models drawn from an uncertainty set, but standard robust MARL frameworks become increasingly intractable as the number of agents grows. This paper develops an infinite-horizon, stationary mean-field game framework that incorporates distributional model uncertainty directly into the population-coupled dynamics. We establish a robust dynamic programming principle with a contractive Bellman operator and prove the existence of a stationary robust mean-field equilibrium via a fixed-point argument. We further develop the first concrete algorithm with convergence guarantees. We then connect the mean-field solution to a finite-population robust game whose ambiguity sets depend on the empirical distribution, showing that the mean-field equilibrium policy induces approximate equilibrium behavior as the population size increases. Under a contractive robust-dynamics regime, we further obtain explicit non-asymptotic error bounds. Numerical experiments further illustrate the qualitative and quantitative impact of robustness under multiple uncertainty models, validating our theoretical findings.
中文摘要 在现实世界中部署多智能体强化学习（MARL）常常受限于训练模拟器与真实环境之间的模型不匹配，这种不匹配可能通过战略交互进一步放大，导致部署时性能严重下降。分布鲁棒性通过优化针对来自不确定性集的最坏情况过渡模型的策略，提供了一种原则性的回应，但随着主体数量的增加，标准的稳健MARL框架变得越来越难以处理。本文开发了一个无限视界、平稳均值场博弈框架，将分布模型不确定性直接纳入群体耦合动态中。我们建立了一个具有收缩贝尔曼算子的稳健动态规划原理，并通过不动点论证证明了稳健平均场平衡的存在。我们进一步开发了第一个具有收敛保证的具体算法。然后我们将平均场解与一个有限总体鲁棒博弈联系起来，其歧义集依赖于经验分布，表明平均场均衡策略在种群规模增加时会诱导近似均衡行为。在收缩稳健动力学范畴下，我们进一步获得显式的非渐近误差界限。数值实验进一步展示了鲁棒性在多不确定性模型下的定性和定量影响，验证了我们的理论发现。

On the Position Bias of On-Policy Distillation

关于政策提炼的立场偏见

Authors: Yan Xie, Sijie Zhu, Tiansheng Wen, Bo Chen, Yifei Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.22600
Pdf link: https://arxiv.org/pdf/2606.22600
Abstract On-Policy Distillation (OPD) improves the learning efficiency of standard reinforcement learning through dense, token-level supervision from teachers. In the standard KL objective of OPD, token-level losses are uniformly averaged, implying equal weights for all tokens. However, we discover that not all tokens are created equal: as student rollouts grow longer, they deviate further from the teacher's distribution, leading to degraded supervision quality at later positions. As a result, OPD using only the first 30% of tokens can perform comparably to using all tokens, whereas OPD using only the last 30% of tokens barely learns anything. In this work, we provide a principled understanding of this issue through the lens of constrained optimization. Based on these insights, we derive Importance-Weighted On-Policy Distillation (IW-OPD), in which the weight assigned to each token depends on the accumulated discrepancy between the student's and teacher's distributions, naturally upweighting earlier tokens and downweighting later ones with larger deviations. We show that IW-OPD converges significantly faster than OPD, with better learning efficiency, and achieves better final performance than standard OPD in both same-size and cross-scale settings, improving performance up to 6.9 points on AIME-2025.
中文摘要 策略提炼（OPD）通过教师密集的代币级监督，提高标准强化学习的学习效率。在标准的 KL 目标 OPD 中，代币层级损失被统一平均，意味着所有代币权重相等。然而，我们发现并非所有代币都一样：随着学生推广时间延长，代币与教师的分配偏差会进一步加深，导致后续岗位的督导质量下降。因此，OPD只用前30%的代币，表现和用所有代币相当，而OPD只用最后30%代币几乎学不到什么。在本研究中，我们通过受限优化的视角，对这一问题进行了原则性理解。基于这些见解，我们推导出了重要性加权策略提炼（IW-OPD），其中每个代币的权重取决于学生和教师分布之间的累积差异，自然地会对早期代币加权，后期代币则因偏差更大而降低权重。我们表明，IW-OPD收敛速度显著快于OPD，学习效率更高，且在相同规模和跨尺度环境中均优于标准OPD的最终性能，在AIME-2025中提升了高达6.9分。

Scalable Maximum Entropy Reinforcement Learning for Diffusion Policies via Adjoint Matching

通过伴随匹配实现扩散策略的可扩展最大熵强化学习

Authors: Serge Thilges, Onur Celik, Denis Blessing, Emiliyan Gospodinov, Gerhard Neumann
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.22630
Pdf link: https://arxiv.org/pdf/2606.22630
Abstract Diffusion policies have recently emerged as a powerful paradigm for representing complex action distributions in reinforcement learning (RL). However, their application to online RL remains limited by the challenge of scalable training in the absence of ground-truth data, where standard optimization techniques such as score matching are not directly applicable. In this work, we introduce a highly efficient algorithm for optimizing diffusion policies by leveraging recent advances in stochastic optimal control. Our approach is based on adjoint matching, which enables simulation-free training and circumvents the need for explicit likelihood estimation or costly backpropagation through the diffusion process. Furthermore, we propose several extensions that improve the robustness and stability of the method in practical settings. Empirical results demonstrate that our approach achieves competitive performance while significantly reducing computational overhead, making diffusion policies more viable for online RL scenarios.
中文摘要 扩散策略最近成为强化学习（RL）中复杂动作分布的有力范式。然而，在缺乏真实数据的情况下，其在在线强化学习中的应用仍受限于可扩展训练的挑战，标准优化技术如分数匹配无法直接应用。本研究通过利用随机最优控制的最新进展，提出了一种高效的扩散策略优化算法。我们的方法基于伴随匹配，实现无仿真训练，并绕过了显式似然估计或通过扩散过程进行昂贵的反向传播的需求。此外，我们提出了若干扩展方案，以提升该方法在实际环境中的稳健性和稳定性。实证结果表明，我们的方法在显著降低计算开销的同时实现了竞争性能，使扩散策略在在线强化学习场景中更具可行性。

A Markov Chain Approach to Preference Alignment

偏好比对的马尔可夫链方法

Authors: Takuya Koriyama, Tengyuan Liang
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2606.22652
Pdf link: https://arxiv.org/pdf/2606.22652
Abstract We propose Markov Chain from Human Feedback (MCHF), an elementary approach for aligning generative models from pairwise human preferences. Unlike Reinforcement Learning from Human Feedback (RLHF), which reduces comparisons to a scalar reward, and Nash Learning from Human Feedback (NLHF), which preserves pairwise utilities through a KL-regularized minimax optimization, MCHF uses pairwise preferences directly to define a transition mechanism over model outputs. Given a pairwise utility $U(x,y)$, which quantifies human preference for $y$ over $x$, and a reference probability distribution $\mu_{\mathsf{ref}}$, we define a Markov kernel $\mathsf{P}(x, dy)\propto \exp(U(x,y))\mu_{\mathsf{ref}}(dy)$, and take the Markov chain starting from $\mu_{\mathsf{ref}}$ as an iterative alignment procedure. We show that MCHF converges geometrically fast to the stationary distribution, with a convergence rate governed by the seminorm $\|U\|\oplus=\inf{g,f\in L^\infty(\mu_{\mathsf{ref}})}\|U-g\oplus f\|\infty$, which quantifies the non-transitive structure of the pairwise utility. We further show that a mirror-descent algorithm for NLHF satisfies an analogous structure-adaptive convergence guarantee. Finally, through a perturbation analysis, we prove that when $\|U\|\oplus$ is small, MCHF and NLHF agree up to first order around an RLHF solution, which yields a unified view of reward-based, game-theoretic, and Markovian approaches to alignment. In particular, for two natural algorithms that converge to the MCHF/NLHF equilibria, we show that the first step of MCHF and NLHF recovers the RLHF solution based on the column-sum reward $\hat{f}(y)=\int \mu_{\mathsf{ref}}(dx) U(x, y)$, and starting from the second iteration, both algorithms incorporate the same linear functional of the residual $U-(-\hat f)\oplus \hat f$, which captures the non-transitive structure of the pairwise utility $U$.
中文摘要 我们提出了人类反馈中的马尔可夫链（MCHF），这是一种基于成对人类偏好对齐生成模型的基本方法。与通过KL正则化极大优化保持两对效用的人类反馈强化学习（RLHF）和通过KL正则化极大优化保持两对效用的纳什学习（NRHF）不同，MCHF直接利用两两偏好定义模型输出的转移机制。给定一个两对效用 $U（x，y）$，量化人类对$y$相较于$x$的偏好，以及一个参考概率分布 $\mu_{\mathsf{ref}}$，我们定义一个马尔可夫核 $\mathsf{P}（x， dy）\propto \exp（U（x，y））\mu_{\mathsf{ref}}（dy）$，并将从 $\mu_{\mathsf{ref}}$ 开始的马尔可夫链作为迭代比对过程。我们证明 MCHF 在几何上收敛到平稳分布，收敛速率由半范数 $\| 控制。U\|\oplus=\inf{g，f\in L^\infty（\mu_{\mathsf{ref}}）}\|U-g\oplus f\|\infty$，量化了两对实用的非传递结构。我们还进一步证明，NLHF的镜像下降算法满足类似的结构自适应收敛保证。最后，通过微扰分析，我们证明当 $\|U\|\oplus$ 很小，MCHF 和 NLHF 在 RLHF 解下基本一致，从而形成了基于奖励、博弈论和马尔可夫比对方法的统一视图。特别地，对于两个自然算法趋于MCHF/NLHF均衡，我们证明MCHF和NLHF的第一步基于列和奖励$\hat{f}（y）=\int \mu_{\mathsf{ref}}（dx） U（x， y）$，并且从第二次迭代开始，两个算法都包含剩余$U-（-\hat f）\oplus \hat f$的相同线性泛函，该结构捕捉了两对效用 $U$ 的非传递结构。

GeoRouteNet: Geometry-Enhanced Non-Autoregressive Neural Solver for the Traveling Salesman Problem

GeoRouteNet：几何增强的非自回归神经求解器，解决旅行推销员问题

Authors: Xiang Li
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.22776
Pdf link: https://arxiv.org/pdf/2606.22776
Abstract The traveling salesman problem (TSP) is a canonical NP-hard combinatorial optimization benchmark that tests the representational capacity and generalization of neural solvers. While non-autoregressive (NAR) approaches offer parallel inference, they often lack sufficient geometric inductive bias and stable training signals, leading to degraded performance under cross-scale and cross-distribution shifts. We propose GeoRouteNet, a geometry-enhanced NAR neural solver for Euclidean TSP. On the model side, GeoRouteNet incorporates centered node features, learnable radial distance basis functions, distance-aware graph attention with explicit edge messaging, LayerNorm-SwiGLU feed-forward blocks, and cross-layer attentive residual mixing. On the training side, we design multi-candidate self-comparison reinforcement learning (MCS-RL), which samples multiple candidate tours per instance, constructs adaptive baselines from greedy and peer candidates, and adds winner-candidate guidance with annealed entropy regularization. On 10,000 random TSP50 instances, GeoRouteNet achieves a 0.32% optimality gap under Beam-1000 decoding. On TSP100, the gap is 1.26%. On 27 stratified TSPLIB EUC_2D instances, the overall gap drops from 17.12% (NAR4TSP reproduction) to 3.60%, while batch inference throughput substantially exceeds that of Concorde and LKH3. Ablation studies confirm that geometric structure enhancement and multi-candidate training are complementary: structure improvements dominate cross-distribution gains, while MCS-RL further stabilizes solution quality when paired with a strong geometric encoder.
中文摘要 旅行推销员问题（TSP）是一个典型的NP-困难组合优化基准测试，用于测试神经求解器的表征能力和泛化性。虽然非自回归（NAR）方法提供并行推断，但通常缺乏足够的几何归纳偏置和稳定的训练信号，导致在跨尺度和跨分布转移下性能下降。我们提出了GeoRouteNet，一款几何增强的NAR神经求解器，用于欧几里得TSP。在模型端，GeoRouteNet 集成了居中节点特性、可学习的径向距离基函数、带显式边缘消息的距离感知图关注、LayerNorm-SwiGLU 前馈块以及跨层关注残差混合。在训练方面，我们设计了多候选自比较强化学习（MCS-RL），该方法每个实例采样多个候选巡回，从贪婪和同伴候选中构建自适应基线，并通过退火熵正则化添加胜者-候选指导。在10,000个随机TSP50实例中，GeoRouteNet在Beam-1000解码下实现了0.32%的最优性差距。TSP100的差距为1.26%。在27个分层的 TSPLIB EUC_2D实例中，整体差距从17.12%（NAR4TSP重现）降至3.60%，而批次推理吞吐量则远超协和式和LKH3。消融研究证实几何结构增强与多候选训练是互补的：结构改进主导跨分布增益，而MCS-RL结合强几何编码器进一步稳定解答质量。

Policy-as-Data: Learning Generalizable HOI Diffusion Models from Simulated Physics

政策即数据：从模拟物理学中学习可推广的HOI扩散模型

Authors: Shujia Li, Jianshu Hu, Haiyu Zhang, Yunpeng Jiang, Haoyuan Jin, Xinyuan Chen, Yaohui Wang, Yutong Ban
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.22806
Pdf link: https://arxiv.org/pdf/2606.22806
Abstract Synthesizing realistic Human-Object Interactions (HOI) is critical for creating embodied avatars and functional virtual environments. However, current data-driven approaches primarily rely on motion capture datasets, which are expensive to scale and limited in functional diversity. Models trained with these datasets fail to generalize to unseen objects and maintain physical consistency over long horizons. In this paper, we propose a novel framework that leverages a physics simulator to overcome the data-scarcity bottleneck in HOI generation. Specifically, we propose a scalable pipeline, called \ours, which leverages policies trained with reinforcement learning in a physics simulator for task-oriented data generation and trains a generative model on the augmented dataset for generalizable HOI generation. To seamlessly utilize the synthetic data, we introduce a coarse-to-fine retargeting process that bridges the representation gap between the simplified model used in physics simulator and the standard parametric body models required for generative training. Validated through comprehensive experiments, our method demonstrates enhanced generalization to unseen objects and the capability of long-horizon generation, while exhibiting greater dynamic diversity and physical plausibility.
中文摘要 综合逼真的人与物交互（HOI）对于创建具身化虚拟形象和功能性虚拟环境至关重要。然而，当前的数据驱动方法主要依赖动作捕捉数据集，而该数据集规模庞大且功能多样性有限。使用这些数据集训练的模型无法推广到看不见的物体，也无法在长时间内保持物理一致性。本文提出了一种新颖框架，利用物理模拟器克服HOI生成中的数据稀缺瓶颈。具体来说，我们提出了一个可扩展的流水线，称为\ours，利用在物理模拟器中强化学习训练的策略进行任务导向数据生成，并在增强数据集上训练生成模型以实现可推广的HOI。为了无缝利用合成数据，我们引入了从粗到细的重定向过程，弥合了物理模拟器中简化模型与生成训练所需的标准参数体模型之间的表示鸿沟。通过全面实验验证，我们的方法展示了对看不见天体的更强推广能力和长视界生成能力，同时展现出更大的动态多样性和物理可信度。

Active Inference as the Test-Time Scaling Law for Physical AI Agents

主动推断作为物理人工智能代理的测试时间尺度定律

Authors: Omar Hashash, Christo Kurisummoottil Thomas, Walid Saad, Merouane Debbah, Karl Friston, Adeel Razi
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.22813
Pdf link: https://arxiv.org/pdf/2606.22813
Abstract In this paper, a novel test-time scaling law for physical artificial intelligence (AI) agents is introduced. This scaling law enables physical AI agents to reason with their world models to generalize in unforeseen scenarios at test time. The derived scaling law is grounded in the first principle of active inference, which equips agents with the general objective to survive in the real world, under which their specific task objectives are subsumed. Active inference achieves this by providing the reasoning to resolve prediction errors that arise when the agent encounters unforeseen situations outside its training distribution, enabling generalization in non-stationary environments. The proposed scaling law captures this by dynamically updating the agent's policy with this reasoning at test time. This policy update is modeled as a soft Bayesian inference process in which beliefs about the policy are updated using the reasoning that reduces expected prediction errors under allowable policies as a likelihood. The resulting posterior policy admits a biological interpretation, recovering the scaling mechanism that engages the brain's basal ganglia and prefrontal cortex at test time. To solve this analytically intractable problem, a variational inference solution minimizing free energy bounds is developed. This solution extends to enable learning beyond training by reinforcing new instances, resolved at test time, in both the policy and world model. Unlike existing scaling laws constrained by model size and training data, the derived solution scales with the continuous real-world experience of a physical AI agent. Simulation results on an autonomous driving task demonstrate that the proposed solution outperforms model-free Q-learning and model-based Bayesian reinforcement learning, achieving robust generalization to unforeseen scenarios while improving inference efficiency by over 36%.
中文摘要 本文介绍了一种针对物理人工智能（AI）代理的新型测试时间尺度定律。这一缩放定律使物理人工智能智能体能够在测试时与其世界模型进行推理，在不可预见的场景中进行推广。推导的尺度律建立在主动推理的第一原则之上，该原则赋予代理在现实世界中生存的一般目标，其具体任务目标被纳入其中。主动推理通过提供推理来解决当智能体遇到训练分布外的不可预见情况时产生的预测误差，从而实现在非平稳环境中的泛化。拟议的缩放律通过在测试时动态更新代理策略，并基于此推理来捕捉这一点。该策略更新被建模为软贝叶斯推断过程，利用将允许策略下预期预测误差作为似然的推理来更新对策略的信念。由此产生的后置策略接受了生物学解释，恢复了测试时激活大脑基底节和前额叶皮层的尺度机制。为解决这一解析上难以解决的问题，开发了一种最小化自由能界限的变分推断解。该解决方案通过强化在策略模型和世界模型中测试时解决的新实例，实现超越训练的学习。与受模型规模和训练数据限制的现有缩放定律不同，得出的解会随着物理AI代理的持续真实体验而扩展。自动驾驶任务的模拟结果表明，所提出的方案优于无模型Q-学习和基于模型的贝叶斯强化学习，实现了对不可预见场景的稳健泛化，同时推理效率提升了36%以上。

HiL-ResRL: A Model-Agnostic Finetuning Adapter via Human-in-the-loop Residual Reinforcement Learning

HiL-ResRL：通过人机循环残余强化学习的模型无关性微调适配器

Authors: Jingyi Liu, Zhaohong Mai, ShunSen He, Hang Ren, Chao Wang, Shunbo Zhou, XiaoDong Wu, Heng Zhang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.22860
Pdf link: https://arxiv.org/pdf/2606.22860
Abstract Recent advancements in generative imitation learning have significantly propelled the field of robotic manipulation. However, the majority of existing models rely heavily on Behavior Cloning (BC), a paradigm that suffers from compounding errors and distributional shift. Consequently, the efficacy of these models in practical industrial deployments remains limited. To address these challenges, we introduce a novel, plug-and-play fine-tuning pipeline designed to facilitate the robust deployment of Vision-Language-Action (VLA) models in real-world environments. In contrast to contemporary reinforcement learning (RL) fine-tuning strategies, which are often constrained by specific model architectures, our proposed framework is model-agnostic and adaptable to a diverse range of VLA models. We conceptualize VLA-generated actions as a unified interface, upon which we train a residual policy. This policy is designed to rectify suboptimal actions and address the distributional shift inherent in imitation learning. Additionally, we incorporate human-in-the-loop guidance to ensure safe exploration and maximize training efficiency. We conduct experiments directly in real-world robotic settings. The results demonstrate that within only 1.5 hour of real-world online RL training, the average success rate exceeds 95% on real robots. Our work presents a practical solution for deploying behavior cloning models in industrial scenarios.
中文摘要 生成式模仿学习的最新进展极大推动了机器人操作领域的发展。然而，大多数现有模型高度依赖行为克隆（BC），这一范式存在叠加错误和分布转移的问题。因此，这些模型在实际工业部署中的效果仍然有限。为应对这些挑战，我们引入了一种新型即插即用的微调流水线，旨在促进视觉-语言-行动（VLA）模型在现实环境中的稳健部署。与常受特定模型架构限制的现代强化学习（RL）微调策略不同，我们提出的框架是模型无关的，并可适应多种VLA模型。我们将VLA生成的动作概念化为统一接口，并在此基础上训练残余策略。该策略旨在纠正次优行为，并解决模仿学习中固有的分布性转变。此外，我们还采用人机指导，确保安全探索并最大化训练效率。我们直接在现实世界的机器人环境中进行实验。结果显示，在现实世界在线强化学习训练仅1.5小时内，真实机器人的平均成功率就超过了95%。我们的工作为在工业场景中部署行为克隆模型提供了实用解决方案。

SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning

SingGuard：具有动态推理的策略自适应多模态LLM护栏

Authors: SingGuard Team
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.22873
Pdf link: https://arxiv.org/pdf/2606.22873
Abstract Vision-language models (VLMs) are increasingly deployed in consumer, medical, financial, and enterprise applications. This broad deployment expands the safety surface: risks can arise from multimodal question answering, assistant responses, and cross-modal composition, while moderation policies may vary across products, regions, and deployment stages. Most existing guardrails either rely on fixed taxonomies or target only a narrow set of interaction settings, which limits their adaptability when safety rules change at deployment time. We present \textbf{SingGuard}, a policy-adaptive multimodal guardrail model family for safety assessment in multimodal conversations. SingGuard treats the active policy as a runtime input: given natural-language rules, it checks the target content against the active policy rule by rule and predicts both the safety label and the triggered rule. To balance efficiency and interpretability, SingGuard supports fast, hybrid, and slow inference regimes along a fast-to-slow reasoning spectrum, ranging from direct safety judgments to policy-grounded deliberation. We further optimize this behavior with fast--slow decoupled reinforcement learning. We also introduce \textbf{SingGuard-Bench}, a multimodal guardrail benchmark with 56{,}340 examples spanning 80+ fine-grained risk types across multimodal QA, adversarial attack, and dynamic-rule evaluation settings, including cross-modal joint-risk cases where each modality is harmless in isolation but their composition implies unsafe intent. Across six benchmark families (35 datasets), SingGuard achieves state-of-the-art average F1 in every family. Dynamic-rule evaluation further shows improved policy-following accuracy from 0.6465 to 0.7415 under runtime policy shifts. Our code is available at this https URL.
中文摘要 视觉语言模型（VLM）正日益被应用于消费者、医疗、金融和企业应用领域。这种广泛的部署扩大了安全层面：多模态问答、助理响应和跨模态组合可能带来风险，而审核政策可能因产品、地区和部署阶段而异。大多数现有护栏要么依赖固定分类法，要么只针对一集狭窄的交互设置，这限制了它们在部署时安全规则变化时的适应性。我们介绍了 \textbf{SingGuard}，一套用于多模态对话安全评估的策略自适应多模态护栏模型家族。SingGuard将主动策略视为运行时输入：给定自然语言规则，逐条将目标内容与活跃策略规则对照，并预测安全标签和触发规则。为了平衡效率和可解释性，SingGuard支持快速、混合和慢速推理体系，涵盖从直接安全判断到基于政策的审议。我们进一步通过快慢解耦强化学习优化这一行为。我们还介绍了 \textbf{SingGuard-Bench}，这是一个多模态护栏基准测试，包含56{，}340个示例，涵盖80+细粒度风险类型，涵盖多模态质量保证、对抗攻击和动态规则评估环境，包括跨模态联合风险案例，其中每种模态单独无害但其组合暗示不安全意图。在六个基准家族（35个数据集）中，SingGuard在每个家族中都实现了最先进的平均F1。动态规则评估进一步显示，在运行时策略调整下，策略遵循准确率从 0.6465 提升至 0.7415。我们的代码可在此 https URL 访问。

Hierarchical Reinforcement Learning for Sparse-Reward Search in Commutative Algebra

交换代数中稀疏奖励搜索的层级强化学习

Authors: Giorgi Butbaia, Paul Orland, Coco Huang, Davide Passaro, Lucas Fagan, Michele Tarquini, Hailong Dao, David Eisenbud, Ali Shehper, Sergei Gukov
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Commutative Algebra (math.AC); Combinatorics (math.CO)
Arxiv link: https://arxiv.org/abs/2606.22922
Pdf link: https://arxiv.org/pdf/2606.22922
Abstract Applying machine learning techniques to solving long-standing mathematical conjectures can be particularly challenging due to their extreme reward sparsity. As an illustrative example, we consider Kalai's algebraic Hirsch conjecture and recast the construction of its counterexamples as a sparse-reward reinforcement learning problem on graphs. We propose a constrained options-based HRL framework with an equivariant graph neural network policy, which allows us to learn useful temporal abstractions for this task. We evaluate our approach over a wide range of degrees and demonstrate that it consistently outperforms classical RL algorithms as well as greedy search. By exploiting the hierarchical structure of the problem, we effectively provide a first-of-its-kind application of HRL to a problem in commutative algebra.
中文摘要 将机器学习技术应用于解决长期存在的数学猜想尤其具有挑战性，因为其奖励极度稀疏。作为一个说明性例子，我们考虑了Kalai的代数Hirsch猜想，并将其反例的构造重新构建为图上的稀疏奖励强化学习问题。我们提出了一个基于条件的受限HRL框架，采用等变图神经网络策略，使我们能够学习用于该任务的有用时间抽象。我们从多个层面评估了我们的方法，并证明其持续优于经典强化学习算法和贪婪搜索。通过利用问题的层级结构，我们有效地提供了HRL在交换代数问题上的首创应用。

EchoFlow: A Workload-Aware Parameter Tuning Method for Blockchain Systems

EchoFlow：一种区块链系统工作负载感知参数调优方法

Authors: Ben Lian, Linpeng Jia, Xing Chen, Xiaofeng Chen, Yi Sun
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2606.22934
Pdf link: https://arxiv.org/pdf/2606.22934
Abstract Blockchain systems expose a large number of tunable parameters that significantly influence system performance. However, in practice, a single parameter configuration is often applied across different workloads, leaving substantial unexploited performance potential. To address this, we propose EchoFlow, a blockchain parameter tuning framework that adaptively adjusts parameter configurations based on workload characteristics, enabling continuous performance optimization. EchoFlow employs a distributed reinforcement learning approach in which multiple actors perform parallel sampling to mitigate the substantial time required for sample generation in blockchain environments. To further accelerate convergence, we introduce a genetic algorithm during the initial phase of training to generate high-quality samples. Extensive experimental evaluations demonstrate that EchoFlow consistently outperforms existing methods across diverse workload scenarios while also reducing training time, highlighting its effectiveness and practical value.
中文摘要 区块链系统会暴露大量可调参数，这些参数对系统性能有显著影响。然而，在实际操作中，单一参数配置通常应用于不同的工作负载，导致性能潜力大而未被充分利用。为此，我们提出了EchoFlow，一种区块链参数调优框架，能够根据工作负载特性自适应调整参数配置，实现持续的性能优化。EchoFlow采用分布式强化学习方法，多参与者进行并行采样，以减少区块链环境中样本生成所需的大量时间。为了进一步加速收敛，我们在训练初期引入遗传算法以生成高质量样本。大量实验评估表明，EchoFlow在多种工作负载场景下持续优于现有方法，同时缩短训练时间，彰显其有效性和实用价值。

Provable Benefits of RLVR over SFT for Reasoning Models: Learning to Backtrack Efficiently

RLVR相较于SFT在推理模型上的可证明优势：学习高效回溯

Authors: Stanley Wei, Juno Kim
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.22938
Pdf link: https://arxiv.org/pdf/2606.22938
Abstract Recent advances in large language models (LLMs) have demonstrated that reinforcement fine-tuning of pretrained base models can lead to significant gains in reasoning performance at inference time. In this work, we theoretically analyze why reinforcement fine-tuning induces better reasoning ability than purely supervised fine-tuning (SFT) methods. We model chain-of-thought (CoT) reasoning as a pathfinding problem on graphs and compare the popular method of reinforcement learning with verifiable rewards (RLVR) against traditional SFT. We prove that SFT, when trained on golden shortest paths without negative examples, fails to learn how to efficiently backtrack. In contrast, an RLVR-trained model can learn how to efficiently backtrack from dead ends using only outcome reward. This leads to an exponential separation in inference-time compute between the two methods, and demonstrates that RLVR leads the model to learn the location of difficult decisions in a reasoning chain, ultimately allowing for better allocation of inference-time compute. Finally, we show that the reasoning traces of an RLVR model can be distilled to train a base model to backtrack efficiently as well.
中文摘要 大型语言模型（LLMs）的最新进展表明，对预训练基础模型进行强化微调，可以在推理时显著提升推理性能。在本研究中，我们理论上分析了为何强化微调比纯监督微调（SFT）方法能诱导更好的推理能力。我们将思维链（CoT）推理建模为图上的路径寻找问题，并将流行的可验证奖励强化学习（RLVR）方法与传统SFT进行比较。我们证明，当SFT在没有负面例子的黄金最短路径上训练时，无法学会高效回溯。相比之下，RLVR训练的模型可以仅用结果奖励学习如何高效地从死胡同回溯。这导致两种方法在推理时间计算中出现指数级分离，并证明RLVR引导模型学习推理链中难判的位置，从而更好地分配推理时间计算。最后，我们证明了RLVR模型的推理迹可以提炼出来，用以训练基础模型高效回溯。

Group-Graph Policy Optimization for Long-Horizon Agentic Reinforcement Learning

长视野代理强化学习的群图策略优化

Authors: Yunan Wang, Minghui Song, Zihan Zhang, Shaohan Huang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.22995
Pdf link: https://arxiv.org/pdf/2606.22995
Abstract Group-based Reinforcement Learning (RL) has significantly enhanced Large Language Models (LLMs) in agentic scenarios. To achieve finer-grained policy updates, recent agentic RL frameworks have shifted from trajectory-level to step-level training. However, long-horizon agentic RL suffers from severe reward sparsity and delay, as feedback is often deferred for dozens of interaction steps. While existing step-level frameworks refine training granularity, their credit assignment remains coarse-grained and still treats agent exploration as isolated, linear trajectories. This oversimplified perspective ignores the inherent graph structure of state transitions, leading to high-variance state-value estimation and myopic, localized credit assignment. To overcome these critical bottlenecks, we propose Group-Graph Policy Optimization (G2PO), a novel group-based RL algorithm tailored for multi-turn agentic tasks. G2PO explicitly transforms linear interaction trajectories into a global state-transition graph. By aggregating identical observations across different trajectories, we introduce group-aggregation state-value estimation that reduces sampling variance and trajectory-dependent bias. Furthermore, we redefine agent actions as transitions between state nodes and propose an edge-centric advantage estimation strategy. By globally standardizing Temporal Difference (TD) errors across the entire graph, G2PO explicitly identifies and prioritizes critical transitions that drive absolute task progress. Extensive experiments on representative long-horizon benchmarks-WebShop, ALFWorld, and AppWorld-demonstrate that G2PO substantially outperforms state-of-the-art prompt-based and RL baselines, achieving remarkable success rate improvements of up to 22.2% over GRPO.
中文摘要 基于群体的强化学习（RL）显著增强了大型语言模型（LLMs）在代理场景中的表现。为了实现更细致的政策更新，近期的能动强化学习框架已从轨迹级转向步级训练。然而，长视野能动强化学习存在严重的奖励稀疏性和延迟，反馈常被延迟数十个交互步骤。虽然现有的步级框架细化了培训的细度，但其信用分配仍是粗粒度的，且仍将代理探索视为孤立的线性轨迹。这种过于简化的观点忽视了状态转变固有的图结构，导致了高方差的状态值估计和狭隘、局部化的功劳分配。为克服这些关键瓶颈，我们提出了群图策略优化（G2PO），这是一种针对多回合代理任务的新型基于群体的强化学习算法。G2PO显式地将线性交互轨迹转换为全局状态-转移图。通过汇总不同轨迹上的相同观测值，我们引入了群聚态值估计，减少了抽样方差和轨迹依赖偏差。此外，我们重新定义代理行为为状态节点之间的转移，并提出了以边缘为中心的优势估计策略。通过全局标准化整个图中的时间差（TD）误差，G2PO明确识别并优先考虑推动绝对任务进展的关键转变。对代表性的长期基准测试——WebShop、ALFWorld和AppWorld的广泛实验表明，G2PO远超最先进的基于提示和强化学习基线，成功率比GRPO提升了高达22.2%。

EvoRubrics: Dynamic Rubrics as Rewards via Adversarial Co-Evolution for LLM Reinforcement Learning

进化评分标准：通过对抗共进化作为奖励的动态评分标准，用于大型语言模型强化学习

Authors: Hongxin Ding, Baixiang Huang, Yue Fang, Weibin Liao, Zheng Li, Jinyang Zhang, Zhijing Wu, Junfeng Zhao, Yasha Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.23038
Pdf link: https://arxiv.org/pdf/2606.23038
Abstract Rubric-based rewards offer interpretable and fine-grained optimization signals for reinforcement learning in open-ended tasks where verifiable answers are unavailable. However, pre-constructed rubrics remain static throughout training, creating a fundamental mismatch with the evolving policy: fixed criteria gradually lose discriminative power as the model improves, leading to reward saturation and potential hacking. Recent dynamic rubric methods partially address this but rely on external frontier models or ground-truth answers, and update rubrics only at coarse granularity. We propose EvoRubrics, a co-evolutionary RL framework where a Policy LLM and a Rubric Generator jointly improve through adversarial interaction within each training step. As the policy improves under the rubric generator's guidance, the rubric generator adapts its criteria to remain discriminative and informative, enabling evaluation to track the policy in real time and naturally inducing an automatic curriculum. Experiments show that EvoRubrics consistently outperforms static and dynamic rubric baselines across benchmarks. The learned Rubric Generator further generalizes as a transferable reward model. Notably, even a fully self-supervised variant without any external supervision achieves meaningful gains, suggesting that co-evolution between generation and evaluation alone can provide sufficiently rich learning signals. Our code is publicly available at this https URL.
中文摘要 基于评分标准的奖励为开放式任务中无法验证答案的强化学习提供了可解释且细粒度的优化信号。然而，预设的评分标准在整个训练过程中保持静态，这与不断演变的政策形成根本性不匹配：随着模型改进，固定标准逐渐失去判别力，导致奖励饱和和潜在的黑客攻击。近期的动态评分标准方法部分解决了这个问题，但依赖外部前沿模型或真实答案，并且仅在粗粒度上更新评分标准。我们提出了EvoRubrics，一种共进化的强化学习框架，其中策略LLM和评分标准生成器通过在每个训练步骤中的对抗互动共同改进。随着政策在评分标准生成器的指导下不断改进，评分标准会调整以保持判断性和信息性，使评估能够实时跟踪政策，自然地引入自动课程。实验显示，EvoRubrics在各基准测试中始终优于静态和动态评分基线。所学评分标准生成器进一步推广为可转移的奖励模型。值得注意的是，即使是完全自我监督且没有外部监督的变异，也能获得有意义的进步，表明仅通过生成与评估之间的共同进化就能提供足够丰富的学习信号。我们的代码在此 https URL 公开。

ReNIO: Reweighting Negative Trajectory Importance for LLM On-Policy Distillation

ReNIO：对LLM政策提炼负轨迹重要性的重新加权

Authors: Chen Lin, Kedi Chen, Wei Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.23104
Pdf link: https://arxiv.org/pdf/2606.23104
Abstract On-policy distillation (OPD) improves LLM reasoning by training a student model on its own generated outputs, but standard OPD treats all student-generated outputs (SGOs) equally regardless of their informativeness. We observe a consistent asymmetry in controlled filtering experiments: in both OPD and on-policy self distillation (OPSD), training only on incorrect SGOs outperforms training only on correct ones. Our further analysis suggests that models trained on correct-only SGOs tend to generate shorter reasoning traces and show weaker reflection behavior, while incorrect SGOs better preserve exploratory reasoning near the model's capability boundary. To exploit this signal without requiring full answer-containing rollouts, we introduce ReNIO, which Reweights Negative trajectory Importance for LLM On-policy distillation. By using the student-to-teacher probability ratio, ReNIO identifies pivotal tokens leading to wrong reasoning traces and aggregates their information into a normalized sample weight, inherently assigning larger weights to likely negative trajectories without observing the correctness of final-answer. Since Re-NIO only uses prefix-conditioned token probabilities, it preserves OPD's prefix training advantage over full-rollout reinforcement learning. Across both mathematical reasoning and code generation tasks, ReNIO improves both OPD and OPSD, with representative relative gains of up to 8.90% for Qwen3-1.7B and 10.00% for R1-Distill-Qwen-7B on mathematical reasoning benchmarks. Code repo: this https URL.
中文摘要 策略提炼（OPD）通过训练学生模型自身生成的输出来提升LLM推理，但标准OPD对所有学生生成输出（SGO）无论其信息量如何，都一视同仁。我们在受控过滤实验中观察到一致的不对称性：无论是在OPD还是策略自提纯（OPSD）中，仅对错误SGO进行训练时，训练效果优于仅对正确SGO的训练。我们的进一步分析表明，仅在正确SGO上训练的模型往往产生较短的推理迹迹，表现出较弱的反射行为，而错误的SGO则在模型能力边界附近更好地保留探索性推理。为了利用这一信号而不要求包含完整答案的部署，我们引入了ReNIO，它对LLM On-policy提炼进行负轨迹重要性重权重。通过使用师生概率比，ReNIO识别导致错误推理迹的关键标记，并将其信息汇总为归一化的样本权重，本质上会为可能的负轨迹赋予更大权重，而不观察最终答案的正确性。由于Re-NIO仅使用前缀条件令牌概率，它保留了OPD在前缀训练上的优势，相较于全推广强化学习。在数学推理和代码生成任务中，ReNIO提升了OPD和OPSD，数学推理基准测试中Qwen3-1.7B的相对提升最高达8.90%，R1-Distill-Qwen-7B的10.00%。代码仓库：这个 https URL。

Asymmetric physics enables efficient learning in quadrupedal robot swarms

非对称物理使四足机器人群体中的学习更加高效

Authors: Yuang Zhang, Yunlong Song, Zhihao He, Zelin Ni, Kangyu Wang, Tianchi Liu, Yu Hu, Feng Yu, Danping Zou, Weiyao Lin
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.23153
Pdf link: https://arxiv.org/pdf/2606.23153
Abstract Animal collectives navigate cluttered environments through local coordination, yet robot swarms still struggle to reproduce this capability in the physical world. End-to-end learning offers a route to such coordination, but scaling it to embodied swarms remains difficult: standard sampling-based reinforcement learning becomes inefficient when visual perception, dense robot-robot interaction, and contact-rich locomotion must be learned together. Here we show that asymmetric physics enables efficient end-to-end learning of vision-based, decentralized control in large swarms of quadrupedal robots. During training, quadrupeds interact in shared environments, where a high-fidelity, non-differentiable simulator generates realistic motion and contact dynamics, and differentiable surrogate models provide gradients for navigation and locomotion policies. This separation enables up to 512 quadrupeds to learn coordinated navigation policies in obstacle-rich environments. At deployment, each robot acts from a single forward-facing depth camera, without explicit communication, centralized planning, or global maps. The policies generalize across forests, bridges, enclosures, narrow passages, and mazes, and zero-shot transfer to six physical quadrupeds across five real-world scenarios. The resulting swarms exhibit predictive avoidance, right-side yielding, pausing before bottlenecks, and wall following, showing that asymmetric physics enables efficient training of scalable decentralized control policies for quadrupedal robot swarms.
中文摘要 动物集体通过局部协调在杂乱环境中导航，但机器人群体在物理世界中仍难以复制这种能力。端到端学习提供了实现这种协调的途径，但将其扩展到具身群体仍然困难：当视觉感知、密集的机器人与机器人互动以及丰富的接触运动需要共同学习时，标准基于采样的强化学习效率低下。本文我们展示了非对称物理技术能够在大量四足机器人群体中高效端到端学习基于视觉的去中心化控制。在训练过程中，四足动物在共享环境中互动，高保真、不可微分的模拟器生成真实的运动和接触动态，可微分的替代模型为导航和移动策略提供梯度。这种分离使多达512只四足动物能够在障碍物繁多的环境中学习协调的导航策略。部署时，每个机器人从单台前置深度相机操作，无需明确通信、集中规划或全球地图。这些政策适用于森林、桥梁、围栏、狭窄通道和迷宫，零射击转移适用于六只实体四足动物，涵盖五个现实场景。由此产生的群体表现出预测性规避、右侧让路、瓶颈前暂停和墙体跟踪，表明非对称物理技术使四足机器人群体能够高效训练可扩展的分散控制策略。

CFPO: Counterfactual Policy Optimization for Multimodal Reasoning

CFPO：多模态推理的反事实政策优化

Authors: Zhangyuan Yu, Wanran Sun, Guangjing Yang, Xiaohu Wu, Qicheng Lao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.23206
Pdf link: https://arxiv.org/pdf/2606.23206
Abstract Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in multimodal reasoning. However, prevailing reinforcement learning (RL) paradigms lack explicit counterfactual enhancement and causal learning mechanisms. This fundamental deficiency results in severe grounding failures, manifesting as a tendency to ignore visual evidence in favor of language priors or exhibiting hallucination drift during long chain-of-thought reasoning. To address this root cause, we propose CounterFactual Policy Optimization (CFPO), a novel framework that enforces causal consistency between visual perception and textual reasoning. CFPO introduces a cross-modal counterfactual enhancement mechanism, which regularizes the policy by maximizing the discrepancy between the model's predictions and those from a counterfactual state where critical visual cues are suppressed. This approach seamlessly integrates with standard algorithms like GRPO and DAPO without requiring external reward models or additional supervision. Extensive experiments demonstrate that CFPO significantly improves reasoning fidelity, achieving consistent gains of 3.17%-6.25% over standard RL baselines and 1.32%-2.13% over the state-of-the-art perception-aware method (PAPO). Code is available at this https URL.
中文摘要 大型视觉语言模型（LVLM）在多模态推理方面展现了卓越的能力。然而，主流的强化学习（RL）范式缺乏明确的反事实增强和因果学习机制。这种根本缺陷导致严重的接地失败，表现为倾向于忽视视觉证据而偏好语言先验，或在长时间的思维链推理中出现幻觉漂移。为解决这一根本原因，我们提出了反事实政策优化（CFPO），这是一种新颖框架，旨在强制视觉感知与文本推理之间的因果一致性。CFPO引入了一种跨模态反事实增强机制，通过最大化模型预测与反事实状态（关键视觉线索被抑制）预测之间的差异来规范该策略。该方法能够无缝集成于GRPO和DAPO等标准算法，无需外部奖励模型或额外监督。大量实验表明，CFPO显著提升了推理准确度，较标准强化学习基线（PAPO）稳定提升3.17%-6.25%，较最先进的感知感知方法（PAPO）提升1.32%-2.13%。代码可在此 https URL 访问。

Dynamic multi-agent deep reinforcement learning-based pricing and incentivization approach in multimodal transportation networks

多模式交通网络中的动态多智能体深度强化基于学习的定价与激励方法

Authors: Khadidja Kadem, Mostafa Ameli, Carlos Lima Azevedo, Mahdi Zargayouna, Latifa Oukhellou
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2606.23257
Pdf link: https://arxiv.org/pdf/2606.23257
Abstract In multimodal transportation systems, shared mobility services (SMSs) are promoted for their potential to enhance flexibility and reduce congestion. However, SMS demand is often concentrated in high-density areas, which can limit the effectiveness and accessibility for various commuter groups. This uneven integration challenges transportation system efficiency, especially in terms of emissions and spatial equity. Addressing these issues requires coordination among multiple stakeholders whose objectives frequently conflict. Whereas authorities aim to ensure sustainable and equitable mobility, SMS providers focus on revenue maximization, and travelers seek to minimize personal travel costs. This paper proposes a multi-agent deep reinforcement learning framework that captures these interactions through dynamic pricing and incentivization strategies for SMSs and public transport. The framework integrates two reinforcement learning (RL) agents: (i) a public authority that allocates spatio-temporal public transport incentives to improve equity, emissions, and efficiency, and (ii) an SMS provider that dynamically adjusts fares to optimize revenue. The agents interact with the transportation system and adapt strategies in response to evolving demand, congestion, and network conditions. Numerical experiments conducted over a three-hour morning peak period show that dynamic incentivization effectively reduces congestion peaks, lowers commuters' costs by around 20% and emissions by approximately 10%, while nearly doubling public transport profit and supporting a more equitable distribution of benefits. When combined with dynamic SMS pricing, the two RL agents demonstrate the ability to balance conflicting objectives between private providers and public authorities. The proposed approach provides a decision-support tool for sustainable and equitable multimodal mobility planning.
中文摘要 在多式联运交通系统中，共享出行服务（SMS）因其提升灵活性和减少拥堵的潜力而被推广。然而，短信需求通常集中在高密度区域，这可能限制了各种通勤群体的有效性和可及性。这种不均衡的整合挑战了交通系统的效率，尤其是在排放和空间公平方面。解决这些问题需要多个利益相关者之间的协调，而这些利益相关者往往目标相互冲突。当局致力于确保可持续和公平的出行，而短信服务提供商则注重收入最大化，旅客则力求降低个人出行成本。本文提出了一个多智能体深度强化学习框架，通过动态定价和激励策略捕捉短信和公共交通的这些互动。该框架集成了两个强化学习（RL）代理：（i）公共机构，负责分配时空公共交通激励以提升公平性、排放和效率;（ii）动态调整票价以优化收入的短信提供商。代理与交通系统互动，并根据不断变化的需求、拥堵和网络状况调整策略。在三小时的早高峰时段进行的数值实验显示，动态激励有效减少拥堵高峰，降低通勤成本约20%，排放约10%，同时几乎使公共交通利润翻倍，支持更公平的利益分配。结合动态短信定价，这两款RL代理展现了在私营服务商与公共机构之间平衡冲突目标的能力。所提方法为可持续且公平的多模式出行规划提供了决策支持工具。

BoxCtrl: 3D-Aware Visual Prompting for Geometric Image Editing

BoxCtrl：三维感知的几何图像编辑视觉提示

Authors: Feifei Wang, Shiyuan Yang, Xiaoyu Li, Jing Liao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.23270
Pdf link: https://arxiv.org/pdf/2606.23270
Abstract As instruction-based editing models and multimodal large language models advance, diverse image editing tasks have become feasible. However, achieving precise and consistent geometric image editing, such as translating, scaling, and rotating in 3D space, remains a major challenge. In this work, we introduce BoxCtrl, a 3D-aware visual prompting framework. Unlike text-only or coarse 2D-guided approaches, our method introduces informative RGB 3D bounding boxes projected onto 2D images as visual prompts. The three orthogonal faces of each box are painted with distinct RGB colors, simultaneously encoding position, size, and orientation to provide a compact, intuitive in-context visual example. The key to BoxCtrl's success lies in these well-designed bounding boxes, which decouple geometric control from appearance control. This enables the model to learn consistent correspondences between faces of the same color in the latent space, leading to a precise understanding of geometric intentions and accurate editing results. We introduce a two-stage training paradigm: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL). To address paired data scarcity, we construct a large-scale synthetic dataset for SFT, equipping the model with fundamental editing capabilities. To bridge the synthetic-to-real domain gap, we incorporate an online RL stage leveraging unpaired real-world data. Guided by a reward function evaluating geometric accuracy and visual fidelity, our SFT-RL strategy significantly enhances geometric precision while maintaining photorealistic quality. Extensive experiments demonstrate that BoxCtrl achieves state-of-the-art performance across translation, rotation, scaling, and composite editing tasks.
中文摘要 随着基于指令的编辑模型和多模态大型语言模型的发展，多样化的图像编辑任务变得可行。然而，实现精确且一致的几何图像编辑，如在三维空间中进行平移、缩放和旋转，仍是一大挑战。在本研究中，我们介绍了BoxCtrl，一个3D感知的视觉提示框架。与纯文本或粗略的二维引导方法不同，我们的方法引入了将信息丰富的RGB 3D边界框投影到二维图像上，作为视觉提示。每个盒子的三个正交面都涂有不同的RGB颜色，同时编码位置、大小和方向，提供紧凑直观的上下文视觉示例。BoxCtrl成功的关键在于这些设计良好的边界框，它们将几何控制与外观控制分离。这使得模型能够学习潜空间中同色面之间的一致对应关系，从而精确理解几何意图并获得准确的编辑结果。我们引入了两阶段的训练范式：监督式微调（SFT），随后是强化学习（RL）。为解决配对数据稀缺性，我们构建了大规模的SFT合成数据集，赋予模型基础编辑能力。为了弥合合成到现实领域的差距，我们引入了一个在线强化学习阶段，利用未配对的真实世界数据。在奖励函数评估几何准确性和视觉真实度的指导下，我们的SFT-RL策略显著提升了几何精度，同时保持了照片级真实性。大量实验表明，BoxCtrl 在平移、旋转、缩放和复合编辑任务中实现了最先进的性能。

Causal Reward World Models: Zero-shot Reward Design for Automated Skill Generation

因果奖励世界模型：零机会奖励设计用于自动技能生成

Authors: Yang Yang, Yuchuang Tong, Zhengtao Zhang, Xu Ding, Ning Yang, Yifan Zhang, Haipeng Li, Kehu Yang, Miao Xin
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.23280
Pdf link: https://arxiv.org/pdf/2606.23280
Abstract Automated Reward Design (ARD) aims to replace manual reward engineering in reinforcement learning with language-driven reward function synthesis. However, existing approaches based on large language models (LLMs) remain inherently correlation-driven, relying on iterative environmental feedback to refine reward hypotheses for each specific task. This paradigm not only results in inefficient reasoning but also makes LLMs susceptible to semantically plausible yet causally spurious reward components, leading to ineffective optimization. To address these limitations, we propose the Causal Reward World Model (CRWM), which explicitly models the causal topological relationships between candidate reward components and task-targeted physical variables through offline pre-training on multi-task interaction data. Based on a coarse-to-fine pre-training strategy, we introduce a joint optimization module that integrates Explicit Mechanism Decoupling with Confidence-Aware Soft Fusion to refine coarse structural priors using micro-level trajectories, thereby constructing a robust and interpretable causal skeleton. During inference, LLMs leverage CRWM as a task-irrelevant causal prior to constrain the reward generation, enabling zero-shot reward function design. Our work opens up a new white-box paradigm for the ARD problem. Extensive experiments on complex continuous control benchmarks demonstrate that CRWM generates executable reward functions without feedback-driven reward refinement, significantly reducing the design latency for acquiring new robotic skills while matching or surpassing state-of-the-art performance, and further exhibits strong generalization capabilities across unseen tasks and diverse robotic embodiments.
中文摘要 自动化奖励设计（ARD）旨在用语言驱动的奖励函数综合，取代强化学习中的人工奖励工程。然而，基于大型语言模型（LLMs）的现有方法本质上仍是相关驱动的，依赖迭代环境反馈来完善每个具体任务的奖励假设。这种范式不仅导致推理效率低下，还使大型语言模型容易受到语义上合理但因果上虚假的奖励成分影响，导致优化效果低下。为解决这些局限性，我们提出了因果奖励世界模型（CRWM），该模型通过离线预训练对多任务交互数据，明确建模候选奖励组件与任务目标物理变量之间的因果拓扑关系。基于粗到细的预训练策略，我们引入了一个联合优化模块，将显式机制解耦与信心感知软融合整合，利用微观轨迹精炼粗结构先验，构建稳健且可解释的因果骨架。在推理过程中，LLM利用CRWM作为任务无关因果关系，在限制奖励生成前实现零样本奖励函数设计。我们的工作为ARD问题开辟了一个新的白箱范式。复杂连续控制基准测试的广泛实验表明，CRWM能够生成可执行的奖励函数，无需反馈驱动的奖励细化，显著降低了获得新机器人技能的设计延迟，同时与最先进的性能匹敌甚至超越，并展现出在未见任务和多样机器人体型上的强大泛化能力。

SQLConductor: Search-to-Policy Learning for Step-wise Text-to-SQL Orchestration

SQLConductor：基于分步文本转SQL编排的搜索到策略学习

Authors: Yizhang Zhu, Zhangyang Peng, Boyan Li, Yuyu Luo
Subjects: Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.23537
Pdf link: https://arxiv.org/pdf/2606.23537
Abstract Text-to-SQL enables users to access relational databases via natural language, but real-world settings remain challenging due to coordinated reasoning over complex database environments. Existing systems often use multi-stage pipelines or reasoning models specialized for individual stages. However, fixed pipelines rely on predefined stage orders, limiting their adaptivity to query demands and intermediate evidence. Recent orchestration-based methods provide flexibility by composing specialized modules for each query, but typical plan-then-execute approaches still commit to a complete workflow before execution and cannot adapt to intermediate artifacts and feedback. In this paper, we propose SQLConductor, a step-wise orchestration learning framework for Text-to-SQL. SQLConductor formulates Text-to-SQL subtasks as specialized actions for workflow composition and trains a policy model to select the next action based on intermediate artifacts and feedback. To learn this policy, SQLConductor introduces Search-to-Policy Learning, which uses Monte Carlo Tree Search to explore candidate workflows and stability estimation to identify robust supervision. The policy model is trained with Stability-weighted Supervised Fine-tuning to prioritize high-quality orchestration patterns and further enhanced through Curriculum Reinforcement Learning. This transforms offline workflow search into a deployable policy for step-wise orchestration at inference time. Experiments on BIRD-Dev and out-of-distribution datasets show that SQLConductor achieves superior execution accuracy and strong generalization, reaching 73.2% EX on BIRD-Dev with a compact orchestration policy coordinating frozen larger action models, outperforming prior methods that directly train comparable or larger Text-to-SQL backbones. Further analyses show that the learned policy adapts orchestration to diverse query demands.
中文摘要 文本转SQL使用户能够通过自然语言访问关系型数据库，但由于在复杂数据库环境中进行协调推理，实际环境依然具有挑战性。现有系统通常使用多阶段流水线或专门针对各阶段的推理模型。然而，固定流水线依赖预定义的阶段顺序，限制了其对查询需求和中间证据的适应性。近期基于编排的方法通过为每个查询编写专门模块提供了灵活性，但典型的计划后执行方法仍承诺完成完整的工作流程，无法适应中间的产物和反馈。本文提出了SQLConductor，一种用于文本转SQL的分步编排学习框架。SQLConductor 将文本转 SQL 子任务表述为工作流组合的专用动作，并训练策略模型，基于中间工件和反馈选择下一个动作。为了学习该策略，SQLConductor 引入了搜索到策略学习，利用蒙特卡洛树搜索探索候选工作流和稳定性估计，以识别稳健的监督。该策略模型通过稳定性加权监督微调训练，以优先考虑高质量的编排模式，并通过课程强化学习进一步增强。这将离线工作流搜索转变为可部署的策略，支持推理时的分步编排。在BIRD-Dev和非分布数据集上的实验表明，SQLConductor在BIRD-Dev上实现了更优的执行准确性和强的泛化能力，采用紧凑的编排策略协调冻结的大型动作模型，表现优于以往直接训练同等或更大文本转SQL骨干网的方法。进一步分析表明，学习策略会根据多样化的查询需求调整编排。

VeriEvol: Scaling Multimodal Mathematical Reasoning via Verifiable Evol-Instruct

VeriEvol：通过可验证的Evol-Instruct扩展多模数学推理

Authors: Haoling Li, Kai Zheng, Jie Wu, Can Xu, Qingfeng Sun, Han Hu, Yujiu Yang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.23543
Pdf link: https://arxiv.org/pdf/2606.23543
Abstract Scaling reinforcement learning for visual mathematical reasoning requires more than generating harder questions: as data volume grows, the reward labels themselves must remain reliable. Yet existing data pipelines scale supervision while trusting the labeller, and policy-side methods assume the underlying answers are already correct. We instead treat scaling as a verifiable data-construction problem and decouple two axes before any policy update: prompt difficulty, expanded by route-specific evolution operators, and answer reliability, enforced by offline hypothesis-test falsification. We instantiate this as VeriEvol, an iterative framework with two extensible components: a type-aware evolution module that rewrites low-difficulty image-question seeds into harder, image-grounded prompts; and HTV-Agent, a verifier that accepts an answer only after multi-source counter-evidence has failed to refute it. The resulting verified data scales in volume, extends by adding evolution routes or verifier channels, and plugs directly into existing GRPO-style RL recipes. On a five-benchmark visual-math suite, scaling evolved SFT data from 10K to 250K samples raises the mean accuracy from 35.42 to 54.73; then, with backbone, SFT initialization, and GRPO recipe held fixed, VeriEvol adds a cumulative +3.88 over an un-evolved RL baseline, of which +1.82 comes from evolved prompts and +2.06 from the HTV-Agent verifier. We release the prompts, data, models, code, and the full verifier trace of every sample, so that downstream work can scale and audit the pipeline rather than only inspect its outputs.
中文摘要 为视觉数学推理扩展强化学习需要的不仅仅是产生更难的问题：随着数据量的增长，奖励标签本身必须保持可靠。然而，现有的数据流水线在信任标签者的情况下扩展监管，策略端方法假设底层答案已经正确。我们反而将扩展视为可验证的数据构建问题，在任何策略更新前先解耦两个轴：提示难度（由路径特定演化算符扩展）和答案可靠性（由离线假设检验伪造强制执行）。我们将此实例化为VeriEvol，一个迭代框架，包含两个可扩展组件：一个类型感知进化模块，将低难度的图像问题种子重写为更难、基于图像的提示;以及HTV-Agent，一种只有在多方反证未能反驳后才接受答案的验证器。最终验证的数据体积可扩展，通过添加演化路径或验证信道扩展，并直接接入现有的GRPO式强化学习配方。在五个基准的视觉数学套件中，从10K样本缩放演变SFT数据到250K样本，平均准确率从35.42提升到54.73;然后，在骨干网、SFT初始化和GRPO公式保持固定的情况下，VeriEvol在未演化的强化逻辑基线上累计加+3.88，其中+1.82来自演化提示，+2.06来自HTV-Agent验证器。我们发布每个样本的提示词、数据、模型、代码和完整的验证者追踪，使下游工作能够扩展和审计流程，而不仅仅是检查输出。

Decentralized Autonomous Traffic Management through Corridor Networks

通过走廊网络实现的去中心化自治交通管理

Authors: Jasmine Jerry Aloor, Aadarsh Govada, Hamsa Balakrishnan
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2606.23585
Pdf link: https://arxiv.org/pdf/2606.23585
Abstract As autonomous aircraft are introduced at scale and traffic density increases, centralized management becomes insufficient to coordinate the large numbers of crewed and uncrewed aircraft. Dedicated Advanced Air Mobility (AAM) corridors have therefore been proposed for organizing high-density autonomous traffic flows. The desire to scalably provide autonomous aircraft flexibility in trajectory planning motivates the development of decentralized approaches to traffic management in AAM corridors. In this work, we extend a multi-agent reinforcement learning (MARL) approach to address the challenge of decentralized traffic flow management in air corridor networks. We test policies trained in a single-corridor setting on increasingly complex multi-corridor networks with combinations of merges and splits in a zero-shot manner. Experimental results demonstrate that learned behaviors transfer well to scenarios with varying traffic density, network geometry, and heterogeneous vehicle performance, without needing centralized coordination or model retraining. We evaluate system-level performance in terms of conformance to corridor boundaries, completion rates, average speeds, distance traveled, and maintenance of inter-aircraft separation. We find that although our policies require only locally coordinated entry, traversal, and exit behaviors, they collectively produce desirable traffic flows through the corridor network.
中文摘要 随着自主飞机大规模引入和交通密度增加，集中管理已不足以协调大量载人和无人飞机。因此，专门的先进空中机动（AAM）走廊被提出，用于组织高密度自主交通流。为自主飞机在轨迹规划中提供可扩展灵活性的愿望，推动了空对空导弹走廊中去中心化交通管理方法的发展。在本研究中，我们扩展了多智能体强化学习（MARL）方法，以应对空中走廊网络中分散式流量管理的挑战。我们在日益复杂的多走廊网络中测试在单走廊环境中训练的策略，并以零样本方式进行合并和分割的组合。实验结果表明，学习到的行为能够很好地转移到交通密度、网络几何形状和异构车辆性能的场景中，无需集中协调或模型重新训练。我们评估系统层面的性能，包括走廊边界的合规性、完成率、平均速度、飞行距离以及机间隔离的维护。我们发现，虽然我们的政策只要求本地协调的进出行为，但它们共同产生了走廊网络中理想的流量。

SPIRAL: Learning to Search and Aggregate

螺旋：学习搜索与聚合

Authors: Jubayer Ibn Hamid, Ifdita Hasan Orney, Michael Y. Li, Omar Shaikh, Yoonho Lee, Dorsa Sadigh, Chelsea Finn, Noah Goodman
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.23595
Pdf link: https://arxiv.org/pdf/2606.23595
Abstract Language model reasoning can be substantially improved at test time via scaffolds that scale inference compute across different primitives -- sequential reasoning within a trace, independently sampled parallel traces, and aggregation of multiple reasoning traces into a final response. During post-training, however, language models are optimized only for sequential reasoning within a single trace. We introduce Sequential-Parallel-Aggregative Reinforcement Learning (SPIRAL), a framework in which a language model is trained to use all three primitives, as part of a unified inference compute pipeline. Concretely, the language model first samples a set of independent traces in parallel, each produced through sequential chain-of-thought reasoning, and then generates a final aggregation trace conditioned on those traces; all components are optimized end-to-end against the reward of the final aggregated response. To train this system, SPIRAL uses set reinforcement learning to teach models to produce a set of traces that are collectively useful for an aggregator and standard reinforcement learning to teach models to aggregate the set into improved final responses. Our experiments on reasoning tasks show that SPIRAL effectively scales with inference compute, outperforming GRPO by up to 11$\times$ scaling efficiency and 15% higher performance when all three compute primitives are scaled.
中文摘要 语言模型推理在测试时可以通过跨不同原语进行规模推理计算的支架得到显著提升——在轨迹内进行顺序推理、独立采样的并行追踪，以及将多个推理痕迹聚合成最终响应。然而，在训练后阶段，语言模型仅针对单一轨迹内的顺序推理进行优化。我们介绍了顺序-并行-聚合强化学习（SPIRAL），这是一个框架，该框架将语言模型训练为使用这三种原语，作为统一推理计算流水线的一部分。具体来说，语言模型首先并行采样一组独立的痕迹，每个痕迹通过顺序思维链推理产生，然后基于这些痕迹生成最终的聚合痕迹;所有组件都端到端地优化，以匹配最终汇总响应的奖励。为了训练该系统，SPIRAL 使用集合强化学习来教模型生成一组对聚合器有用的痕迹，并用标准强化学习教模型将集合聚合为更好的最终响应。我们在推理任务上的实验表明，SPIRAL在推理计算中有效扩展，在三个计算原语均进行扩展时，其缩放效率高出GRPO多达11$\倍数，性能高出15%。

dVLA-RL: Reinforcement Learning over Denoising Trajectories for Discrete Diffusion Vision-Language-Action Models

dVLA-RL：离散扩散视觉-语言-动作模型的去噪轨迹强化学习

Authors: Yuhao Wu, Yitian Liu, Weijie Shen, Mishuo Han, Wenjie Xu, Haotian Liang, Zhongshan Liu, Yinan Mao, Lei Xu, Xinping Guan, Ru Ying, Ran Zheng, Wei Sui, Xiaokang Yang, Wenbo Ding, Yao Mu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.23623
Pdf link: https://arxiv.org/pdf/2606.23623
Abstract Vision-Language-Action (VLA) models have established a powerful paradigm for generalist robotic manipulation by grounding control into the semantic reasoning of VLMs. Prevailing architectures typically model actions continuously via diffusion or flow processes, or discretely through either autoregressive generation or parallel decoding. Recently, Discrete Diffusion VLAs (dVLAs) have emerged as a distinct alternative, unifying vision, language, and action into a single discrete token space via masked generative modeling. While combining iterative refinement with unified representations, its training has thus far been restricted to Supervised Fine-Tuning (SFT), leaving the potential of Reinforcement Learning (RL) for further policy refinement largely unexplored. A fundamental challenge in RL for dVLAs is that the marginal probability of the final action generated by dVLAs remains intractable. To solve this problem, we propose \textbf{dVLA-RL}, shifting the learning objective from the marginal action probability to the joint probability of the sampled generation path. Specifically, by modeling the denoising process as a Markov Decision Process (MDP), we mathematically formulate this path probability as a product of step-wise transitions. This trajectory-level objective provides a unified formulation that natively accommodates variable denoising steps. Leveraging this intrinsic fexibility, we introduce a unified step scheduling approach for complex multi-task learning, tailoring denoising steps to specific task complexities to maximize both success rates and computational effciency. Extensive evaluations demonstrate that our approach achieves a success rate of \textbf{99.7\%} on LIBERO. Furthermore, it establishes strong VLA-based results on RoboTwin 2.0 by delivering a \textbf{30.6\%} improvement over the SFT baseline, remaining competitive with strong World-Action Model baselines.
中文摘要 视觉-语言-行动（VLA）模型通过将控制根植于VLM的语义推理，建立了通用机器人操作的强大范式。主流架构通常通过扩散或流动过程连续建模动作，或通过自回归生成或并行解码离散建模。近年来，离散扩散VLA（dVLAs）作为一种独特的替代方案出现，通过掩膜生成建模将视觉、语言和动作统一到单一离散的令牌空间中。虽然将迭代细化与统一表示结合起来，但其培训迄今仅限于监督微调（SFT），因此强化学习（RL）在进一步策略细化方面的潜力尚未被充分探索。dVLA在强化学习中面临的一个根本挑战是，dVLA最终作用的边际概率仍然难以解决。为解决此问题，我们提出 \textbf{dVLA-RL}，将学习目标从边际作用概率转移到采样生成路径的联合概率。具体来说，通过将去噪过程建模为马尔可夫决策过程（MDP），我们将该路径概率数学化为逐步跃迁的乘积。这一轨迹级目标提供了一个统一的表述，原生支持可变的去噪步进。利用这一内在的可信性，我们引入了一种统一的步序调度方法，用于复杂多任务学习，针对特定任务复杂度定制去噪步骤，以最大化成功率和计算效率。广泛评估表明，我们的方法在LIBERO上实现了\textbf{99.7\%}的成功率。此外，它在RoboTwin 2.0基础上取得了强劲的VLA结果，在SFT基线上实现了“文本bf{30.6\%}提升，保持与强劲的世界行动模型基线的竞争力。

Learning Process Rewards via Success Visitation Matching for Efficient RL

通过成功访问匹配来学习过程奖励，实现高效的强化学习

Authors: Raymond Tsao, Andrew Wagenmaker, Sergey Levine
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2606.23640
Pdf link: https://arxiv.org/pdf/2606.23640
Abstract In many modern applications of reinforcement learning (RL), the natural reward for a task of interest is inherently sparse: a reward of 0 is given everywhere except when the task is completed, when a reward of +1 is given. Training a policy to maximize such a sparse reward requires solving a challenging credit assignment problem, leading to slow or ineffective RL improvement. We propose a simple approach to transform a sparse outcome reward into a dense process reward. Our approach relies on training a discriminator to distinguish between previous successful and unsuccessful episodes, and using this discriminator to incentivize the RL-learned policy to match the state-action visitations of successful episodes, while avoiding those of unsuccessful episodes. By incentivizing the policy to match the visitations over all states, not just those that correspond to task success, this reward provides dense feedback on whether progress is being made towards task completion, and, we show, provably achieves this without changing the optimal policy. Focusing on finetuning of robotic control policies, we demonstrate that our approach leads to significantly faster RL finetuning performance on both simulated and real-world manipulation tasks, as compared to simply maximizing the sparse outcome reward.
中文摘要 在许多现代强化学习（RL）应用中，感兴趣任务的自然奖励本质上是稀疏的：除了任务完成时，奖励为+1，其他地方奖励均为0。训练策略以最大化如此稀疏的奖励需要解决一个具有挑战性的信用分配问题，导致强化学习的缓慢或无效。我们提出一种简单的方法，将稀疏的结果奖励转化为密集过程奖励。我们的方法依赖于训练一个判别器来区分之前成功和失败的发作，并利用该判别器激励强化学习的策略匹配成功发作的状态行动访问，同时避免失败发作的访问。通过激励政策匹配所有状态的访问次数，而不仅仅是与任务成功对应的状态，这种奖励提供了密集的反馈，判断任务是否在完成过程中取得进展，并且我们证明了在不改变最优策略的情况下可证明实现这一目标。我们专注于机器人控制策略的微调，证明我们的方法相比单纯最大化稀疏结果奖励，在模拟和现实操作任务中都能显著加快强化学习微调性能。

AIR: Adaptive Interleaved Reasoning with Code in MLLMs

AIR：多层次语言学习模型中的自适应交错推理与代码

Authors: Cong Han, Xiaohan Lan, Haibo Qiu, Yujie Zhong
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.23678
Pdf link: https://arxiv.org/pdf/2606.23678
Abstract Following the paradigm shift initiated by OpenAI o3, interleaved reasoning with code to enhance multimodal large language models (MLLMs) has become a pivotal research frontier. The existing literature focuses primarily on tool-use within vision-perception tasks. However, such approaches typically rely on predefined heuristics for visual manipulation and are inherently incapable of addressing numerical computation problems due to their exclusive focus on visual operations. This paper empowers MLLMs with adaptive interleaved reasoning capabilities through extended reinforcement learning training on code-augmented complex numerical computation tasks. To this end, we propose a comprehensive three-component solution consisting of: a two-stage cold-start data construction pipeline, data filtering strategies for RL dataset curation, and an adaptive tool-invocation strategy leveraging a group-constrained reward function for interleaved reasoning trajectories. Extensive experiments demonstrate that after Reinforcement Learning training with the group-constrained reward function, performance improves by an average of 6.1 percentage points (pp) on evaluation benchmarks. Specifically, the accuracy for interleaved reasoning samples increases by 9.9 pp, and the overall success rate of tool-use exceeds 95%. Our data and code are available at: this https URL.
中文摘要 继OpenAI o3引发的范式转变之后，交织推理与代码以增强多模态大型语言模型（MLLMs）已成为关键的研究前沿。现有文献主要关注视觉-感知任务中工具的使用。然而，这类方法通常依赖预定义的启发式来进行视觉操作，且由于专注于视觉操作，本质上无法解决数值计算问题。本文通过扩展强化学习训练，赋予MLLM在代码增强复杂数值计算任务上的自适应交错推理能力。为此，我们提出了一个全面的三部分解决方案，包括：两阶段冷启动数据构建流程、用于强化学习数据集策划的数据过滤策略，以及利用组约束奖励函数处理交错推理轨迹的自适应工具调用策略。大量实验表明，在使用群体约束奖励函数进行强化学习训练后，评估基准表现平均提升6.1个百分点（pp）。具体来说，交错推理样本的准确率提高了9.9 pp，工具使用的总体成功率超过95%。我们的数据和代码可在以下 https URL 获取。

CoorDex: Coordinating Body and Hand Priors for Continuous Dexterous Humanoid Loco-Manipulation

CoorDex：协调身体与手部先验，实现持续灵巧的人形机动操控

Authors: Sikai Li, Shuning Li, Zhenyu Wei, Yunchao Yao, Chenran Li, Mingyu Ding
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.23680
Pdf link: https://arxiv.org/pdf/2606.23680
Abstract Humanoid loco-manipulation is often simplified into a stop-and-go process: walking to an object, stopping to manipulate it, and then resuming locomotion. It also commonly relies on low degree-of-freedom (DoF) end effectors that behave like an open-close grasp primitive. We introduce CoorDex, a learning pipeline that converts high-dimensional body and dexterous hand control into coordinated latent residual control, enabling high-DoF dexterous loco-manipulation on the move. Starting from simulated whole-body and hand demonstrations, CoorDex trains privileged motion tracking teachers for the humanoid body and dexterous hand, distills them into proprioception-conditioned latent priors, and uses the frozen priors as the action space for downstream residual reinforcement learning. A coordinated latent residual policy composes these priors through shared task context and separate body-hand residual heads, preserving natural whole-body motion while improving finger-level contact reliability. CoorDex enables a Unitree G1 humanoid with a 20-DoF WUJI hand to execute dexterous manipulation while in motion, including non-stop bottle grasping and carrying, fridge door opening on the move, and cube pick-and-turn. Ablations on the walk-grasp-carry task show that joint-space PPO, joint-space hand control, and monolithic latent prediction all fail under the same reward budget, while the latent-prior interface and coordinated residual structure make high-dimensional contact-rich loco-manipulation trainable. Project Page: this https URL
中文摘要 类人机车操控常被简化为停走过程：走到物体前，停下来操作它，然后继续移动。它通常依赖低自由度（DoF）端执行器，表现为开闭抓取原元。我们引入CoorDex，一个学习流程，将高维身体和灵巧的手部控制转化为协调的潜在残余控制，实现高景深的灵巧机车操作。从模拟全身和手部演示开始，CoorDex 培训特权运动追踪教师，针对类人身体和灵巧手部，将其提炼为本体感觉条件的潜在先验，并利用冻结先验作为下游残留强化学习的行动空间。协调的潜在残余策略通过共享任务上下文和分离的身手残余首脑，将这些先验组成，保持自然的全身运动，同时提高指尖接触的可靠性。CoorDex 使拥有 20 DoF WUJI 手的 Unitree G1 类人生物能够在移动中灵活操作，包括不停地抓取和携带瓶子、移动中打开冰箱门以及拾取并转动方块。对行走-抓取-携带任务的消融显示，关节空间PPO、关节空间手控和单体潜在预测在相同奖励预算下均失效，而潜先验界面和协调残差结构使得高维接触丰富机车操作可训练。项目页面：此 https URL

Keyword: diffusion policy

BayesFP: Posterior Estimation for Flow-Based Policies via Feynman-Kac Sampling

BayesFP：通过费曼-KAC抽样对基于流量的政策进行后验估计

Authors: Sreevardhan Sirigiri, Weiming Zhi, Fabio Ramos
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.21014
Pdf link: https://arxiv.org/pdf/2606.21014
Abstract Robots must generate trajectories that remain faithful to learned expert behavior while satisfying safety constraints and task-specific objectives specified only at inference time. We formulate constrained trajectory generation for pretrained diffusion and flow-matching policies as Bayesian posterior sampling, with the learned demonstration distribution as a prior and an inference-time, cost-derived likelihood tilting it toward feasible, optimal trajectories. To sample from this posterior without any retraining of the base policy, we leverage the Feynman--Kac corrector framework, originally formulated for diffusion models, and extend it to deterministic flow-matching policies. The result is a unified, inference-time, retraining-free sampler for diffusion and flow policies. We validate the approach on pretrained Diffusion Policy, GR00T-N1.6, and $\pi_{0.5}$ checkpoints across simulated and real-world manipulation tasks, including planning around non-convex obstacles introduced at inference time, and show improvements over the base $\pi_{0.5}$ on zero-shot tasks.
中文摘要 机器人必须生成轨迹，既忠实于已学专家行为，又满足安全约束和仅在推理时指定的任务特定目标。我们将预训练扩散和流匹配策略的受约束轨迹生成构建为贝叶斯后验抽样，先验为学到的示范分布，推断时间和成本推导的似然使其倾向于可行且最优的轨迹。为了从该后验中抽样，无需重新训练基础策略，我们利用最初为扩散模型制定的费曼-Kac校正框架，并将其扩展到确定性流匹配策略。结果是一个统一的、推断时间、无重训练的扩散和流量策略采样器。我们在预训练扩散策略、GR00T-N1.6和$\pi_{0.5}$检查点上验证了该方法，涵盖模拟和现实操作任务，包括围绕推理时引入的非凸障碍进行规划，并在零样本任务中展示了对基础$\pi_{0.5}$的改进。

Factor-Aware Mixture-of-Experts with Pretrained Encoder for Combinatorial Generalization

基于预训练编码器的因子感知专家混合组合组合推广

Authors: Feihong Zhang, Guojian Zhan, Zeyu He, Yinuo Wang, Likun Wang, Tianze Zhu, Yao Lyu, Tao Zhang, Tinghao Yi, Wei You, Shengbo Eben Li
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.21100
Pdf link: https://arxiv.org/pdf/2606.21100
Abstract The integration of pretrained encoders with diffusion policies has become a dominant paradigm for visual robotic manipulation. However, it still struggles to generalize across complex environments with varying factors such as lighting and surface textures. To address this, we propose FAME, a framework that integrates a factor-aware mixture-of-experts (MoE) with a pretrained encoder to enhance generalization to environmental variations. FAME follows a three-stage training process: (1) policy warmup, where a diffusion policy is trained on standard-environment data with a frozen encoder; (2) factor-specific adapter training, where lightweight adapters inserted between the frozen encoder and the temporarily frozen policy are trained on customized datasets, each targeting a distinct environmental variation; and (3) joint fine-tuning, where a central router and the warmed policy are trained on mixed data to handle multiple factors jointly. FAME is ``factor-aware'' because the central router softly weights frozen factor-specific adapters as a dense MoE, enabling combinatorial generalization across multiple factors. Evaluations on the Meta-World benchmark show that FAME outperforms diffusion policy baselines by 34%. We further validate FAME in a real-world pick-and-place task using a compact model trained on newly collected data, where FAME achieves a 35% improvement in generalization under real-world variations.
中文摘要 预训练编码器与扩散策略的集成已成为视觉机器人操作的主导范式。然而，它仍然难以在复杂环境中泛化，环境因素多样，比如光照和表面纹理。为此，我们提出了FAME框架，该框架将因子感知的专家混合（MoE）与预训练编码器集成，以增强对环境变异的泛化能力。FAME遵循三阶段训练过程：（1）策略热身，利用标准环境数据训练扩散策略，使用冻结编码器;（2）因子特定适配器训练，将插入在冻结编码器和临时冻结策略之间的轻量级适配器，在定制数据集上训练，每个数据集针对不同的环境变化;以及（3）联合微调，即中央路由器和预热策略在混合数据上训练，以联合处理多个因素。FAME之所以“因子感知”，是因为中央路由器会以密集的MoE方式对冻结的因子专用适配器进行权重，从而实现跨多个因素的组合推广。Meta-World基准的评估显示，FAME的表现优于扩散政策基线34%。我们进一步验证了FAME在真实世界选择与放置任务中，使用基于新收集数据训练的紧凑模型，FAME在真实世界变异下泛化提升了35%。

Temporal Logic Guidance for Action-Only Diffusion Policies with World Models

仅动作扩散政策的时序逻辑指导与世界模型

Authors: Moritz Zoellner, Anastasios Manganaris, Rohan Paleja
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.22729
Pdf link: https://arxiv.org/pdf/2606.22729
Abstract Diffusion policies enable multimodal robot behavior but offer limited ability to choose among behavior modes at inference time, even though such control is desirable in human-robot settings. Prior solutions to this lack of control have utilized Signal Temporal Logic (STL) to express human intentions and provide corresponding guidance for diffusion policy inference. However, these approaches can only guide diffusion policies that jointly generate future actions and states, increasing both complexity and runtime. We propose a novel guidance method for action-only diffusion policies that uses a separate learned world model to enable differentiable evaluation of STL robustness, with its gradient then injected into the diffusion process. This steers behavior toward constraint satisfaction without retraining, improving constraint adherence while preserving task performance. On the Can Transport task from Robomimic, our method maintains 100% task success while reducing constraint violations from over 80% for baseline methods to 4%. We also discuss extensions toward improved robustness and more complex constraints.
中文摘要 扩散策略允许多模态机器人行为，但在推理时选择行为模式的能力有限，尽管在人机环境中这种控制是可取的。此前针对这种控制不足的解决方案利用信号时间逻辑（STL）表达人类意图，并为扩散策略推断提供相应指导。然而，这些方法只能指导能够共同生成未来动作和状态的扩散策略，从而增加复杂性和运行时间。我们提出了一种新的动作纯扩散策略指导方法，利用一个独立的学习世界模型，实现对STL鲁棒性的可微评估，并将梯度注入扩散过程。这使行为趋向约束满足，无需再训练，提高约束依从性，同时保持任务表现。在Robomimic的“可传输”任务中，我们的方法保持了100%的任务成功率，同时将约束违规率从基线方法的80%以上降至4%。我们还讨论了提升鲁棒性和更复杂约束的扩展方法。