Arxiv Papers of Today

生成时间: 2026-03-25 16:55:47 (UTC+8); Arxiv 发布时间: 2026-03-25 20:00 EDT (2026-03-26 08:00 UTC+8)

今天共有 33 篇相关文章

Keyword: reinforcement learning

TIPS: Turn-Level Information-Potential Reward Shaping for Search-Augmented LLMs

提示：搜索增强大型语言模型的回合级信息潜力奖励塑造

Authors: Yutao Xie, Nathaniel Thomas, Nicklas Hansen, Yang Fu, Li Erran Li, Xiaolong Wang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.22293
Pdf link: https://arxiv.org/pdf/2603.22293
Abstract Search-augmented large language models (LLMs) trained with reinforcement learning (RL) have achieved strong results on open-domain question answering (QA), but training still remains a significant challenge. The optimization is often unstable due to sparse rewards and difficult credit assignments across reasoning and tool calls. To address this, we introduce Turn-Level Information Potential Reward Shaping (TIPS), a simple framework that assigns dense, turn-level rewards to each reasoning + tool-call segment based on the increased likelihood of the correct answer under a teacher model. By leveraging the potential-based reward shaping, TIPS offers fine-grained and policy-invariant guidance that overcomes the limitations of outcome-only optimization. Evaluated on seven QA benchmarks, TIPS consistently outperforms GRPO/PPO baselines and substantially improves training stability. For instance, with a Qwen-2.5 7B Instruct model, TIPS improves the average Exact Match score by 11.8% and F1 by 13.6% relative to PPO. Our results demonstrate that turn-level information-potential reward shaping provides an effective and general solution to sparse-reward credit assignment for multi-turn LLM reasoning.
中文摘要 通过强化学习（RL）训练的搜索增强大型语言模型（LLMs）在开放域问答（QA）方面取得了显著成果，但训练仍是一项重大挑战。由于奖励稀少且推理和工具调用间的信用分配困难，优化过程常常不稳定。为此，我们引入了回合级信息潜力奖励塑形（TIPS），这是一个简单的框架，根据教师模型中正确答案的可能性增加，为每个推理+工具调用片段分配密集的回合级奖励。通过利用基于潜力的奖励塑造，TIPS提供了细粒度且策略不变的指导，克服了仅结果优化的局限性。经过七项质量保证基准评估，TIPS持续优于GRPO/PPO基线，显著提升训练稳定性。例如，使用Qwen-2.5 7B Instruct模型时，TIPS相较PPO提高了精确匹配的平均分数11.8%，F1提升了13.6%。我们的结果表明，回合级信息-潜在奖励塑形为多回合大型语言模型推理中稀疏奖励赋值分配提供了有效且通用的解决方案。

The Efficiency Attenuation Phenomenon: A Computational Challenge to the Language of Thought Hypothesis

效率衰减现象：对思维语言假说的计算挑战

Authors: Di Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.22312
Pdf link: https://arxiv.org/pdf/2603.22312
Abstract This paper computationally investigates whether thought requires a language-like format, as posited by the Language of Thought (LoT) hypothesis. We introduce the ``AI Private Language'' thought experiment: if two artificial agents develop an efficient, inscrutable communication protocol via multi-agent reinforcement learning (MARL), and their performance declines when forced to use a human-comprehensible language, this Efficiency Attenuation Phenomenon (EAP) challenges the LoT. We formalize this in a cooperative navigation task under partial observability. Results show that agents with an emergent protocol achieve 50.5\% higher efficiency than those using a pre-defined, human-like symbolic protocol, confirming the EAP. This suggests optimal collaborative cognition in these systems is not mediated by symbolic structures but is naturally coupled with sub-symbolic computations. The work bridges philosophy, cognitive science, and AI, arguing for pluralism in cognitive architectures and highlighting implications for AI ethics.
中文摘要 本文通过计算方式探讨思维是否需要类似语言的形式，正如思维语言（Language of Thought，LoT）假说所提出的。我们引入“AI私有语言”思想实验：如果两个人工智能体通过多智能体强化学习（MARL）开发出高效且难以理解的通信协议，且在被迫使用人类可理解语言时性能下降，这种效率衰减现象（EAP）将挑战LoT。我们将此形式化为部分可观测性的协作导航任务。结果显示，采用新兴协议的代理比使用预定义的类人符号协议的代理效率高出50.5%，证实了EAP的存在。这表明，这些系统中最优的协作认知并非由符号结构中介，而是自然地与子符号计算耦合。该书连接哲学、认知科学和人工智能，主张认知架构中的多元性，并强调对人工智能伦理的影响。

WIST: Web-Grounded Iterative Self-Play Tree for Domain-Targeted Reasoning Improvement

WIST：基于网络的迭代自玩树，用于领域定向推理改进

Authors: Fangyuan Li, Pengfei Li, Shijie Wang, Junqi Gao, Jianxing Liu, Biqing Qi, Yuqiang Li
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.22352
Pdf link: https://arxiv.org/pdf/2603.22352
Abstract Recent progress in reinforcement learning with verifiable rewards (RLVR) offers a practical path to self-improvement of language models, but existing methods face a key trade-off: endogenous self-play can drift over iterations, while corpus-grounded approaches rely on curated data environments. We present \textbf{WIST}, a \textbf{W}eb-grounded \textbf{I}terative \textbf{S}elf-play \textbf{T}ree framework for domain-targeted reasoning improvement that learns directly from the open web without requiring any pre-arranged domain corpus. WIST incrementally expands a domain tree for exploration, and retrieves and cleans path-consistent web corpus to construct a controllable training environment. It then performs Challenger--Solver self-play with verifiable rewards, and feeds learnability signals back to update node posteriors and guide subsequent exploration through an adaptive curriculum. Across four backbones, WIST consistently improves over the base models and typically outperforms both purely endogenous self-evolution and corpus-grounded self-play baselines, with the Overall gains reaching \textbf{+9.8} (\textit{Qwen3-4B-Base}) and \textbf{+9.7} (\textit{OctoThinker-8B}). WIST is also domain-steerable, improving \textit{Qwen3-8B-Base} by \textbf{+14.79} in medicine and \textit{Qwen3-4B-Base} by \textbf{+5.28} on PhyBench. Ablations further confirm the importance of WIST's key components for stable open-web learning. Our Code is available at this https URL.
中文摘要 带可验证奖励的强化学习（RLVR）的最新进展为语言模型的自我改进提供了切实可行的路径，但现有方法面临一个关键权衡：内生自我游戏可能在迭代中漂移，而基于语料库的方法则依赖于策划数据环境。我们呈现 \textbf{WIST}，一个基于 eb 的 \textbf{I}terative \textbf{S}elf-play \textbf{T}ree 框架，用于领域定向推理改进，直接从开放网络学习，无需预先安排的域名语料库。WIST 逐步扩展领域树以供探索，检索和清理路径一致的网络语料库，构建可控的训练环境。随后，它进行挑战者-解题者自我游戏，并获得可验证的奖励，并将可学习性信号反馈回来，更新节点后备，并通过自适应课程引导后续探索。在四个骨干链中，WIST 持续优于基础模型，通常优于纯内生自我进化和基于语料库的自我游戏基线，整体提升达到 \textbf{+9.8} （\textit{Qwen3-4B-Base}）和 \textbf{+9.7} （\textit{OctoThinker-8B}）。WIST 也具备领域导向功能，在医学领域通过 \textbf{+14.79} 改进了 \textit{Qwen3-4B-Base}，在 PhyBench 上通过 \textbf{+5.28} 改进了 \textit{Qwen3-4B-Base}。消融进一步证实了WIST关键组件对稳定开放网络学习的重要性。我们的代码可在此 https 网址获取。

Model Predictive Control with Differentiable World Models for Offline Reinforcement Learning

利用可微世界模型进行离线强化学习的模型预测控制

Authors: Rohan Deb, Stephen J. Wright, Arindam Banerjee
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.22430
Pdf link: https://arxiv.org/pdf/2603.22430
Abstract Offline Reinforcement Learning (RL) aims to learn optimal policies from fixed offline datasets, without further interactions with the environment. Such methods train an offline policy (or value function), and apply it at inference time without further refinement. We introduce an inference time adaptation framework inspired by model predictive control (MPC) that utilizes a pretrained policy along with a learned world model of state transitions and rewards. While existing world model and diffusion-planning methods use learned dynamics to generate imagined trajectories during training, or to sample candidate plans at inference time, they do not use inference-time information to optimize the policy parameters on the fly. In contrast, our design is a Differentiable World Model (DWM) pipeline that enables endto-end gradient computation through imagined rollouts for policy optimization at inference time based on MPC. We evaluate our algorithm on D4RL continuous-control benchmarks (MuJoCo locomotion tasks and AntMaze), and show that exploiting inference-time information to optimize the policy parameters yields consistent gains over strong offline RL baselines.
中文摘要 离线强化学习（RL）旨在从固定的离线数据集中学习最优策略，而无需与环境进行进一步交互。这些方法训练离线策略（或价值函数），并在推理时应用，无需进一步细化。我们引入了一个受模型预测控制（MPC）启发的推理时间适应框架，利用预训练策略和学习过的状态转换和奖励世界模型。现有的世界模型和扩散规划方法在训练时利用学习动力学生成想象轨迹，或在推断时采样候选方案，但它们并未利用推断时间信息实时优化策略参数。相比之下，我们的设计是一个可微世界模型（DWM）流水线，通过想象的展开实现端到端梯度计算，基于MPC在推断时进行策略优化。我们在D4RL连续控制基准测试（MuJoCo移动任务和AntMaze）上评估算法，证明利用推断时间信息优化策略参数，能在强离线强强化基础上获得稳定的提升。

CaP-X: A Framework for Benchmarking and Improving Coding Agents for Robot Manipulation

CaP-X：机器人操作编码代理基准与改进框架

Authors: Max Fu, Justin Yu, Karim El-Refai, Ethan Kou, Haoru Xue, Huang Huang, Wenli Xiao, Guanzhi Wang, Fei-Fei Li, Guanya Shi, Jiajun Wu, Shankar Sastry, Yuke Zhu, Ken Goldberg, Linxi "Jim" Fan
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.22435
Pdf link: https://arxiv.org/pdf/2603.22435
Abstract "Code-as-Policy" considers how executable code can complement data-intensive Vision-Language-Action (VLA) methods, yet their effectiveness as autonomous controllers for embodied manipulation remains underexplored. We present CaP-X, an open-access framework for systematically studying Code-as-Policy agents in robot manipulation. At its core is CaP-Gym, an interactive environment in which agents control robots by synthesizing and executing programs that compose perception and control primitives. Building on this foundation, CaP-Bench evaluates frontier language and vision-language models across varying levels of abstraction, interaction, and perceptual grounding. Across 12 models, CaP-Bench reveals a consistent trend: performance improves with human-crafted abstractions but degrades as these priors are removed, exposing a dependence on designer scaffolding. At the same time, we observe that this gap can be mitigated through scaling agentic test-time computation--through multi-turn interaction, structured execution feedback, visual differencing, automatic skill synthesis, and ensembled reasoning--substantially improves robustness even when agents operate over low-level primitives. These findings allow us to derive CaP-Agent0, a training-free framework that recovers human-level reliability on several manipulation tasks in simulation and on real embodiments. We further introduce CaP-RL, showing reinforcement learning with verifiable rewards improves success rates and transfers from sim2real with minimal gap. Together, CaP-X provides a principled, open-access platform for advancing embodied coding agents.
中文摘要 “代码即策略”探讨了可执行代码如何补充数据密集型视觉-语言-行动（VLA）方法，但其作为自主控制器实现具体操作的有效性仍未被充分探索。我们介绍CaP-X，一个开放获取框架，用于系统研究机器人操作中的代码即策略代理。其核心是 CaP-Gym，一个交互式环境，代理通过合成和执行程序来控制机器人，这些程序构成感知并控制原语。在此基础上，CaP-Bench评估了不同抽象层次、互动和感知基础的前沿语言和视觉语言模型。在12个模型中，CaP-Bench显示出一个一致的趋势：随着人工抽象的提升，性能下降，随着这些先验的移除，暴露出对设计支架的依赖。同时，我们观察到，这一差距可以通过缩放智能体测试时间计算来缓解——通过多回合交互、结构化执行反馈、视觉差分、自动技能综合和集合推理——显著提升了即使代理在低级原语上运行时的鲁棒性。这些发现使我们能够推导出CaP-Agent0，一个无需训练的框架，能够在模拟和实际身体中恢复多种操作任务中的人类水平可靠性。我们进一步介绍了CaP-RL，展示了带有可验证奖励的强化学习能以最小间隔提升成功率和从sim2real的迁移率。CaP-X 共同提供了一个有原则、开放的平台，用于推进具身编码代理的发展。

Sparse but Critical: A Token-Level Analysis of Distributional Shifts in RLVR Fine-Tuning of LLMs

稀疏但关键：RLVR微调LLM分布变化的代币级分析

Authors: Haoming Meng, Kexin Huang, Shaohang Wei, Chiyu Ma, Shuo Yang, Xue Wang, Guoyin Wang, Bolin Ding, Jingren Zhou
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.22446
Pdf link: https://arxiv.org/pdf/2603.22446
Abstract Reinforcement learning with verifiable rewards (RLVR) has significantly improved reasoning in large language models (LLMs), yet the token-level mechanisms underlying these improvements remain unclear. We present a systematic empirical study of RLVR's distributional effects organized around three main analyses: (1) token-level characterization of distributional shifts between base and RL models, (2) the impact of token-level distributional shifts on sequence-level reasoning performance through cross-sampling interventions, and (3) fine-grained mechanics of these shifts at the token level. We find that RL fine-tuning induces highly sparse and targeted changes, with only a small fraction of token distributions exhibiting meaningful divergence between the base and RL policies. We further characterize the structure and evolution of these shifts through analyses of token entropy, positional concentration, and reallocation of probability mass. To assess the functional importance of these sparse changes, we conduct cross-sampling experiments that selectively swap token choices between the base and RL models with varying intervention budgets. We show that inserting only a small fraction of RL-sampled tokens into base generations progressively recovers RL performance gains, while injecting a similarly small number of base token choices into otherwise RL-generated sequences collapses performance to base levels, isolating a small set of token-level decisions directly responsible for RLVR's performance gains. Finally, we explore divergence-weighted variants of the advantage signal as a diagnostic intervention, finding that they can yield improvements over baselines. Together, our results shed light on the distributional changes induced by RLVR and provide a fine-grained, token-level lens for understanding RLVR fine-tuning as a targeted refinement process.
中文摘要 带有可验证奖励的强化学习（RLVR）显著提升了大型语言模型（LLMs）中的推理能力，但这些改进背后的代币级机制仍不明确。我们提出了一项系统性的实证研究，围绕三项主要分析组织：（1）基于基础模型与强化学习模型之间分布变化的代币级特征描述，（2）通过交叉抽样干预，代币级分布变化对序列层级推理性能的影响，以及（3）代币层面这些变化的细粒度机制。我们发现，强化学习微调会引发高度稀疏且有针对性的变更，只有极少数代币分布在基础策略与强化学习策略之间表现出有意义的差异。我们通过对代币熵、位置集中度和概率质量重新分配的分析，进一步描述了这些变化的结构和演变。为评估这些稀疏变化的功能重要性，我们进行了交叉抽样实验，选择性地在基础模型和强化学习模型之间交换代币选择，且干预预算有所不同。我们证明，将极少数强化学习采样的代币插入基序生成，逐步恢复强化学习性能提升;而将同样少量的基础令牌选择注入原为强化学习生成序列时，性能会崩溃到基础层级，从而隔离出一小部分直接导致RLVR性能提升的令牌级决策。最后，我们探讨了优势信号的发散加权变体作为诊断干预，发现它们可以比基线带来改善。我们的结果共同揭示了RLVR引发的分布变化，并为理解RLVR微调作为一种有针对性精细化过程提供了细致的代币级视角。

Q-Tacit: Image Quality Assessment via Latent Visual Reasoning

Q-Tacit：通过潜在视觉推理进行图像质量评估

Authors: Yuxuan Jiang, Yixuan Li, Hanwei Zhu, Siyue Teng, Fan Zhang, David Bull
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.22641
Pdf link: https://arxiv.org/pdf/2603.22641
Abstract Vision-Language Model (VLM)-based image quality assessment (IQA) has been significantly advanced by incorporating Chain-of-Thought (CoT) reasoning. Recent work has refined image quality reasoning by applying reinforcement learning (RL) and leveraging active visual tools. However, such strategies are typically language-centric, with visual information being treated as static preconditions. Quality-related visual cues often cannot be abstracted into text in extenso due to the gap between discrete textual tokens and quality perception space, which in turn restricts the reasoning effectiveness for visually intensive IQA tasks. In this paper, we revisit this by asking the question, "Is natural language the ideal space for quality reasoning?" and, as a consequence, we propose Q-Tacit, a new paradigm that elicits VLMs to reason beyond natural language in the latent quality space. Our approach follows a synergistic two-stage process: (i) injecting structural visual quality priors into the latent space, and (ii) calibrating latent reasoning trajectories to improve quality assessment ability. Extensive experiments demonstrate that Q-Tacit can effectively perform quality reasoning with significantly fewer tokens than previous reasoning-based methods, while achieving strong overall performance. This paper validates the proposition that language is not the only compact representation suitable for visual quality, opening possibilities for further exploration of effective latent reasoning paradigms for IQA. Source code will be released to support future research.
中文摘要 基于视觉语言模型（VLM）的图像质量评估（IQA）通过引入思维链（Chain-of-Think，CoT）推理得到了显著进步。近期工作通过应用强化学习（RL）和利用主动视觉工具，完善了图像质量推理。然而，这类策略通常以语言为中心，视觉信息被视为静态前提条件。与质量相关的视觉线索往往无法被抽象成文本，因为离散文本符号与质量感知空间之间的差距，这反过来限制了视觉密集型IQA任务的推理效果。本文通过提出“自然语言是质量推理的理想空间吗？”来重新探讨这一问题，因此我们提出了Q-Tacit，一种新范式，能够引发VLM在潜在质量空间中超越自然语言进行推理。我们的方法遵循协同的两阶段过程：（i）将结构性视觉质量先验注入潜在空间，（ii）校准潜在推理轨迹以提升质量评估能力。大量实验表明，Q-Tacit 能够以显著更少的符号有效执行高质量推理，同时实现强大的整体性能。本文验证了语言并非唯一适合视觉质量的紧凑表征的命题，为进一步探索IQA有效潜在推理范式提供了可能。源代码将发布以支持未来的研究。

Improving Safety Alignment via Balanced Direct Preference Optimization

通过平衡直接偏好优化提升安全对齐

Authors: Shiji Zhao, Mengyang Wang, Shukun Xiong, Fangzhou Chen, Qihui Zhu, Shouwei Ruan, Yisong Xiao, Ranjie Duan, Xun Chen, XingXing Wei
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.22829
Pdf link: https://arxiv.org/pdf/2603.22829
Abstract With the rapid development and widespread application of Large Language Models (LLMs), their potential safety risks have attracted widespread attention. Reinforcement Learning from Human Feedback (RLHF) has been adopted to enhance the safety performance of LLMs. As a simple and effective alternative to RLHF, Direct Preference Optimization (DPO) is widely used for safety alignment. However, safety alignment still suffers from severe overfitting, which limits its actual performance. This paper revisits the overfitting phenomenon from the perspective of the model's comprehension of the training data. We find that the Imbalanced Preference Comprehension phenomenon exists between responses in preference pairs, which compromises the model's safety performance. To address this, we propose Balanced Direct Preference Optimization (B-DPO), which adaptively modulates optimization strength between preferred and dispreferred responses based on mutual information. A series of experimental results show that B-DPO can enhance the safety capability while maintaining the competitive general capabilities of LLMs on various mainstream benchmarks compared to state-of-the-art methods. \color{red}{Warning: This paper contains examples of harmful texts, and reader discretion is recommended.
中文摘要 随着大型语言模型（LLMs）的快速发展和广泛应用，其潜在的安全风险引起了广泛关注。人类反馈强化学习（RLHF）已被采用以提升大型语言模型的安全性。作为RLHF的简单有效替代方案，直接偏好优化（DPO）被广泛用于安全对齐。然而，安全对准仍存在严重的过拟合，限制了其实际性能。本文从模型对训练数据理解的角度重新审视了过拟合现象。我们发现偏好理解不平衡现象存在于偏好对的响应之间，这影响了模型的安全性。为此，我们提出了平衡直接偏好优化（B-DPO），该方法基于互信息自适应地调制偏好与不偏好响应之间的优化强度。一系列实验结果表明，与最先进方法相比，B-DPO能够提升安全性，同时保持LLM在各类主流基准测试中的竞争性通用能力。\color{red}{警告：本文包含有害文本示例，建议读者谨慎阅读。

CoMaTrack: Competitive Multi-Agent Game-Theoretic Tracking with Vision-Language-Action Models

CoMaTrack：具备视觉-语言-行动模型的竞争性多智能体博弈论追踪

Authors: Youzhi Liu, Li Gao, Liu Liu, Mingyang Lv, Yang Cai
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.22846
Pdf link: https://arxiv.org/pdf/2603.22846
Abstract Embodied Visual Tracking (EVT), a core dynamic task in embodied intelligence, requires an agent to precisely follow a language-specified target. Yet most existing methods rely on single-agent imitation learning, suffering from costly expert data and limited generalization due to static training environments. Inspired by competition-driven capability evolution, we propose CoMaTrack, a competitive game-theoretic multi-agent reinforcement learning framework that trains agents in a dynamic adversarial setting with competitive subtasks, yielding stronger adaptive planning and interference-resilient strategies. We further introduce CoMaTrack-Bench, the first benchmark for competitive EVT, featuring game scenarios between a tracker and adaptive opponents across diverse environments and instructions, enabling standardized robustness evaluation under active adversarial interactions. Experiments show that CoMaTrack achieves state-of-the-art results on both standard benchmarks and CoMaTrack-Bench. Notably, a 3B VLM trained with our framework surpasses previous single-agent imitation learning methods based on 7B models on the challenging EVT-Bench, achieving 92.1% in STT, 74.2% in DT, and 57.5% in AT. The benchmark code will be available at this https URL
中文摘要 具身视觉追踪（EVT）是具身智能中的核心动态任务，要求智能体精确跟踪语言指定的目标。然而，大多数现有方法依赖单代理模仿学习，存在昂贵的专家数据和静态训练环境导致泛化有限的问题。受竞争驱动能力演化的启发，我们提出了CoMaTrack，一种竞争性博弈论多智能体强化学习框架，通过竞争子任务在动态对抗环境中训练智能体，从而实现更强的自适应规划和抗干扰韧性策略。我们还进一步介绍了CoMaTrack-Bench，这是首个竞技EVT基准测试，涵盖追踪器与自适应对手在不同环境和指令下的游戏场景，实现在主动对抗互动下的标准化鲁棒性评估。实验显示，CoMaTrack在标准基准测试和CoMaTrack-Bench上都达到了最先进的成绩。值得注意的是，使用我们框架训练的3B虚拟智能体模型在具有挑战性EVT-Bench上的单代理模仿学习方法超过了基于7B模型的以往单代理模仿学习方法，在STT中达到了92.1%，在DT中达到74.2%，在AT中达到了57.5%。基准代码将在此 https URL 上提供

Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought

重新思考多模态思维链的代币级策略优化

Authors: Yunheng Li, Hangyi Kuang, Hengrui Zhang, Jiangxia Cao, Zhaojie Liu, Qibin Hou, Ming-Ming Cheng
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.22847
Pdf link: https://arxiv.org/pdf/2603.22847
Abstract Multimodal Chain-of-Thought (CoT) reasoning requires large vision-language models to construct reasoning trajectories that interleave perceptual grounding with multi-step inference. However, existing Reinforcement Learning with Verifiable Rewards (RLVR) methods typically optimize reasoning at a coarse granularity, treating CoT uniformly without distinguishing their varying degrees of visual grounding. In this work, we conduct a token-level analysis of multimodal reasoning trajectories and show that successful reasoning is characterized by structured token dynamics reflecting both perceptual grounding and exploratory inference. Building upon this analysis, we propose Perception-Exploration Policy Optimization (PEPO), which derives a perception prior from hidden state similarity and integrates it with token entropy through a smooth gating mechanism to produce token-level advantages. PEPO integrates seamlessly with existing RLVR frameworks such as GRPO and DAPO, requiring neither additional supervision nor auxiliary branches. Extensive experiments across diverse multimodal benchmarks demonstrate consistent and robust improvements over strong RL baselines, spanning geometry reasoning, visual grounding, visual puzzle solving, and few-shot classification, while maintaining stable training dynamics. Code: this https URL
中文摘要 多模思维链（CoT）推理需要大型视觉语言模型构建推理轨迹，将感知基础与多步推理交错交织。然而，现有的可验证奖励强化学习（RLVR）方法通常在粗粒度下优化推理，统一处理CoT，不区分其不同程度的视觉基础。本研究中，我们对多模态推理轨迹进行了代币级分析，表明成功的推理特征是结构化的代币动态，既反映了感知基础，也体现了探索性推理。基于此分析，我们提出了感知-探索策略优化（PEPO），该方法通过隐藏状态相似性推导出先验感知，并通过平滑门控机制将其与代币熵整合，从而产生代币层面的优势。PEPO与现有的RLVR框架如GRPO和DAPO无缝集成，无需额外的监督或辅助分支。在多种多模态基准测试中的广泛实验显示，在几何推理、视觉基础、视觉谜题解决和少数样本分类等方面，在强化学习基线基础上取得了持续且稳健的提升，同时保持了稳定的训练动态。代码：这个 https URL

Grounding Sim-to-Real Generalization in Dexterous Manipulation: An Empirical Study with Vision-Language-Action Models

在灵巧操作中建立模拟到现实的推广基础：一项基于视觉-语言-行动模型的实证研究

Authors: Ruixing Jin, Zicheng Zhu, Ruixiang Ouyang, Sheng Xu, Bo Yue, Zhizheng Wu, Guiliang Liu
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.22876
Pdf link: https://arxiv.org/pdf/2603.22876
Abstract Learning a generalist control policy for dexterous manipulation typically relies on large-scale datasets. Given the high cost of real-world data collection, a practical alternative is to generate synthetic data through simulation. However, the resulting synthetic data often exhibits a significant gap from real-world distributions. While many prior studies have proposed algorithms to bridge the Sim-to-Real discrepancy, there remains a lack of principled research that grounds these methods in real-world manipulation tasks, particularly their performance on generalist policies such as Vision-Language-Action (VLA) models. In this study, we empirically examine the primary determinants of Sim-to-Real generalization across four dimensions: multi-level domain randomization, photorealistic rendering, physics-realistic modeling, and reinforcement learning updates. To support this study, we design a comprehensive evaluation protocol to quantify the real-world performance of manipulation tasks. The protocol accounts for key variations in background, lighting, distractors, object types, and spatial features. Through experiments involving over 10k real-world trials, we derive critical insights into Sim-to-Real transfer. To inform and advance future studies, we release both the robotic platforms and the evaluation protocol for public access to facilitate independent verification, thereby establishing a realistic and standardized benchmark for dexterous manipulation policies.
中文摘要 学习灵巧操作的通用控制策略通常依赖于大规模数据集。鉴于现实世界数据收集的高昂成本，一个实用的替代方案是通过仿真生成合成数据。然而，所得的合成数据往往与现实分布存在显著差距。尽管许多先前研究提出了弥合模拟与现实差异的算法，但缺乏有原则的研究将这些方法置于现实操作任务中，特别是它们在视觉-语言-行动（VLA）模型等通用策略上的表现。本研究实证考察模拟到现实推广的主要决定因素，涵盖多层域随机化、写实渲染、物理真实建模和强化学习更新。为支持本研究，我们设计了一套全面的评估方案，以量化操作任务的实际表现。该协议考虑了背景、光照、干扰物、物体类型和空间特征等关键变化。通过涉及超过1万个真实世界试验的实验，我们获得了关于模拟到现实转移的关键见解。为指导和推进未来研究，我们发布了机器人平台和公众访问评估协议，便于独立验证，从而为灵活操作政策建立现实且标准化的基准。

VLGOR: Visual-Language Knowledge Guided Offline Reinforcement Learning for Generalizable Agents

VLGOR：面向可推广智能体的视觉语言知识引导离线强化学习

Authors: Pengsen Liu, Maosen Zeng, Nan Tang, Kaiyuan Li, Jing-Cheng Pang, Yunan Liu, Yang Yu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.22892
Pdf link: https://arxiv.org/pdf/2603.22892
Abstract Combining Large Language Models (LLMs) with Reinforcement Learning (RL) enables agents to interpret language instructions more effectively for task execution. However, LLMs typically lack direct perception of the physical environment, which limits their understanding of environmental dynamics and their ability to generalize to unseen tasks. To address this limitation, we propose Visual-Language Knowledge-Guided Offline Reinforcement Learning (VLGOR), a framework that integrates visual and language knowledge to generate imaginary rollouts, thereby enriching the interaction data. The core premise of VLGOR is to fine-tune a vision-language model to predict future states and actions conditioned on an initial visual observation and high-level instructions, ensuring that the generated rollouts remain temporally coherent and spatially plausible. Furthermore, we employ counterfactual prompts to produce more diverse rollouts for offline RL training, enabling the agent to acquire knowledge that facilitates following language instructions while grounding in environments based on visual cues. Experiments on robotic manipulation benchmarks demonstrate that VLGOR significantly improves performance on unseen tasks requiring novel optimal policies, achieving a success rate over 24% higher than the baseline methods.
中文摘要 将大型语言模型（LLM）与强化学习（RL）结合，使智能体能够更有效地解释语言指令以执行任务。然而，LLMs通常缺乏对物理环境的直接感知，这限制了它们对环境动态的理解以及对未见任务的推广能力。为解决这一局限，我们提出了视觉语言知识引导离线强化学习（VLGOR）框架，该框架整合视觉与语言知识生成虚构的展开，从而丰富交互数据。VLGOR的核心前提是微调视觉语言模型，以基于初始视觉观察和高级别指令预测未来状态和行为，确保生成的推展保持时间连贯性和空间合理性。此外，我们采用反事实提示，为离线强化学习训练提供更多样化的推广，使智能体能够获得有助于遵循语言指令的知识，同时基于视觉线索扎根于环境中。机器人操作基准测试的实验表明，VLGOR在需要新最优策略的未见任务上显著提升性能，成功率比基线方法高出24%以上。

EVA: Efficient Reinforcement Learning for End-to-End Video Agent

EVA：端到端视频代理的高效强化学习

Authors: Yaolun Zhang, Ruohui Wang, Jiahao Wang, Yepeng Tang, Xuanyu Zheng, Haonan Duan, Hao Lu, Hanming Deng, Lewei Lu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.22918
Pdf link: https://arxiv.org/pdf/2603.22918
Abstract Video understanding with multimodal large language models (MLLMs) remains challenging due to the long token sequences of videos, which contain extensive temporal dependencies and redundant frames. Existing approaches typically treat MLLMs as passive recognizers, processing entire videos or uniformly sampled frames without adaptive reasoning. Recent agent-based methods introduce external tools, yet still depend on manually designed workflows and perception-first strategies, resulting in inefficiency on long videos. We present EVA, an Efficient Reinforcement Learning framework for End-to-End Video Agent, which enables planning-before-perception through iterative summary-plan-action-reflection reasoning. EVA autonomously decides what to watch, when to watch, and how to watch, achieving query-driven and efficient video understanding. To train such agents, we design a simple yet effective three-stage learning pipeline - comprising supervised fine-tuning (SFT), Kahneman-Tversky Optimization (KTO), and Generalized Reward Policy Optimization (GRPO) - that bridges supervised imitation and reinforcement learning. We further construct high-quality datasets for each stage, supporting stable and reproducible training. We evaluate EVA on six video understanding benchmarks, demonstrating its comprehensive capabilities. Compared with existing baselines, EVA achieves a substantial improvement of 6-12% over general MLLM baselines and a further 1-3% gain over prior adaptive agent methods. Our code and model are available at this https URL.
中文摘要 由于视频的标记序列较长，包含广泛的时间依赖性和冗余帧，使用多模态大型语言模型（MLLM）进行视频理解仍然具有挑战性。现有方法通常将多层次学习模型视为被动识别器，处理整个视频或均匀采样帧，无需自适应推理。近期基于主体的方法引入了外部工具，但仍依赖手动设计的工作流程和以感知为先的策略，导致长视频效率低下。我们介绍了EVA，一种面向端到端视频代理的高效强化学习框架，通过迭代总结-计划-行动-反思推理实现规划-后感知。EVA自主决定观看内容、时间和观看方式，实现基于查询的高效视频理解。为了训练这些代理，我们设计了一个简单但有效的三阶段学习流程——包括监督微调（SFT）、Kahneman-Tversky优化（KTO）和广义奖励政策优化（GRPO），桥接了监督模仿和强化学习。我们还为每个阶段构建了高质量数据集，支持稳定且可重复的训练。我们基于六项视频理解基准测试评估舱外活动，展示了其全面能力。与现有基线相比，EVA比一般MLLM基线实现了6-12%的显著提升，且比以往适应性药物方法提升1-3%。我们的代码和模型可在该 https URL 访问。

Quality Over Clicks: Intrinsic Quality-Driven Iterative Reinforcement Learning for Cold-Start E-Commerce Query Suggestion

点击优先质量：内在质量驱动的迭代强化学习，用于冷启动电子商务查询建议

Authors: Qi Sun, Kejun Xiao, Huaipeng Zhao, Tao Luo, Xiaoyi Zeng
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.22922
Pdf link: https://arxiv.org/pdf/2603.22922
Abstract Existing dialogue systems rely on Query Suggestion (QS) to enhance user engagement. Recent efforts typically employ large language models with Click-Through Rate (CTR) model, yet fail in cold-start scenarios due to their heavy reliance on abundant online click data for effective CTR model training. To bridge this gap, we propose Cold-EQS, an iterative reinforcement learning framework for Cold-Start E-commerce Query Suggestion (EQS). Specifically, we leverage answerability, factuality, and information gain as reward to continuously optimize the quality of suggested queries. To continuously optimize our QS model, we estimate uncertainty for grouped candidate suggested queries to select hard and ambiguous samples from online user queries lacking click signals. In addition, we provide an EQS-Benchmark comprising 16,949 online user queries for offline training and evaluation. Extensive offline and online experiments consistently demonstrate a strong positive correlation between online and offline effectiveness. Both offline and online experimental results demonstrate the superiority of our Cold-EQS, achieving a significant +6.81% improvement in online chatUV.
中文摘要 现有的对话系统依赖查询建议（QS）来增强用户参与度。近期的努力通常采用带有点击率（CTR）模型的大型语言模型，但由于高度依赖丰富的在线点击数据进行有效CTR模型训练，冷启动场景中常常失败。为弥合这一差距，我们提出了Cold-EQS，一种针对冷启动电子商务查询建议（EQS）的迭代强化学习框架。具体来说，我们利用可回答性、事实性和信息获取作为奖励，持续优化建议查询的质量。为了持续优化我们的QS模型，我们估算了分组候选建议查询的不确定性，以从缺乏点击信号的在线用户查询中筛选出硬性和模糊样本。此外，我们还提供了一个包含16,949条在线用户查询的EQS基准，供离线培训和评估使用。大量线上线下和线上实验一贯显示线上和线下效果之间存在强烈的正相关性。线下和线上实验结果都显示了我们的冷型均衡器（Cold-EQS）的优越性，在线聊天紫外线显著提升了+6.81%。

From Morality Installation in LLMs to LLMs in Morality-as-a-System

从大型语言模型中的道德安装到道德作为系统中的大型语言模型

Authors: Gunter Bombaerts
Subjects: Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2603.22944
Pdf link: https://arxiv.org/pdf/2603.22944
Abstract Work on morality in large language models (LLMs) has progressed via constitutional AI, reinforcement learning from human feedback (RLHF) and systematic benchmarking, yet it still lacks tools to connect internal moral representations to regulatory obligations, to design cultural plurality across the full development stack, and to monitor how moral properties drift over the lifecycle of a deployed system. These difficulties reflect a shared root. Morality is installed in a model at training time. I propose instead a morality-as-a-system framework, grounded in Niklas Luhmann's social systems theory, that treats LLM morality as a dynamic, emergent property of a sociotechnical system. Moral behaviour in a deployed LLM is not fixed at training. It is continuously reproduced through interactions among seven structurally coupled components spanning the neural substrate, training data, alignment procedures, system prompts, moderation, runtime dynamics, and user interface. This is a conceptual framework paper, not an empirical study. It philosophically reframes three known challenges, the interpretability-governance gap, the cross-component plurality problem, and the absence of lifecycle monitoring, as structural coupling failures that the installation paradigm cannot diagnose. For technical researchers, it explores three illustrative hypotheses about cross-component representational inconsistency, representation-level drift as an early safety signal, and the governance advantage of lifecycle monitoring. For philosophers and governance specialists, it offers a vocabulary for specifying substrate-level monitoring obligations within existing governance frameworks. The morality-as-a-system framework does not displace elements such as constitutional AI or RLHF it embeds them within a larger temporal and structural account and specifies the additional infrastructure those methods require.
中文摘要 大型语言模型（LLMs）中的道德研究通过宪法人工智能、人类反馈强化学习（RLHF）和系统基准测试取得了进展，但仍缺乏将内部道德表征与监管义务连接起来的工具，无法在整个开发栈中设计文化多样性，并监控道德属性在部署系统生命周期中的漂移。这些困难反映了共同的根源。道德是在训练时被植入模型中的。我提出一种基于尼克拉斯·卢曼社会系统理论的道德作为系统框架，将大型语言模型道德视为社会技术系统中动态、涌现的属性。部署的LLM中的道德行为并非在培训时固定。它通过七个结构耦合组件之间的交互不断重现，这些组件涵盖神经基底、训练数据、比对过程、系统提示、调节、运行时动态和用户界面。这是一篇概念性框架论文，不是实证研究。它从哲学上重新定义了三个已知挑战：可解释性-治理差距、跨组件多元性问题以及生命周期监控的缺失，认为安装范式无法诊断的结构耦合失败。对于技术研究者，它探讨了三个关于跨组件表示不一致、表示层漂移作为早期安全信号以及生命周期监控治理优势的三个说明性假设。对于哲学家和治理专家来说，它提供了在现有治理框架内明确基层监督义务的词汇表。道德作为系统框架并不取代宪法人工智能或RLHF等元素，而是将它们嵌入到更大的时间和结构性账户中，并明确这些方法所需的额外基础设施。

MedCausalX: Adaptive Causal Reasoning with Self-Reflection for Trustworthy Medical Vision-Language Models

MedCausalX：自适应因果推理与自我反思，打造可信的医学视觉语言模型

Authors: Jianxin Lin, Chunzheng Zhu, Peter J. Kneuertz, Yunfei Bai, Yuan Xue
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.23085
Pdf link: https://arxiv.org/pdf/2603.23085
Abstract Vision-Language Models (VLMs) have enabled interpretable medical diagnosis by integrating visual perception with linguistic reasoning. Yet, existing medical chain-of-thought (CoT) models lack explicit mechanisms to represent and enforce causal reasoning, leaving them vulnerable to spurious correlations and limiting their clinical reliability. We pinpoint three core challenges in medical CoT reasoning: how to adaptively trigger causal correction, construct high-quality causal-spurious contrastive samples, and maintain causal consistency across reasoning trajectories. To address these challenges, we propose MedCausalX, an end-to-end framework explicitly models causal reasoning chains in medical VLMs. We first introduce the CRMed dataset providing fine-grained anatomical annotations, structured causal reasoning chains, and counterfactual variants that guide the learning of causal relationships beyond superficial correlations. Building upon CRMed, MedCausalX employs a two-stage adaptive reflection architecture equipped with $\langle$causal$\rangle$ and $\langle$verify$\rangle$ tokens, enabling the model to autonomously determine when and how to perform causal analysis and verification. Finally, a trajectory-level causal correction objective optimized through error-attributed reinforcement learning refines the reasoning chain, allowing the model to distinguish genuine causal dependencies from shortcut associations. Extensive experiments on multiple benchmarks show that MedCausalX consistently outperforms state-of-the-art methods, improving diagnostic consistency by +5.4 points, reducing hallucination by over 10 points, and attaining top spatial grounding IoU, thereby setting a new standard for causally grounded medical reasoning.
中文摘要 视觉语言模型（VLMs）通过将视觉感知与语言推理相结合，实现了可解读的医学诊断。然而，现有的医学思维链（CoT）模型缺乏明确的机制来表示和执行因果推理，使其容易受到虚假相关性的影响，并限制了其临床可靠性。我们指出医学CoT推理中的三大核心挑战：如何自适应触发因果纠正，构建高质量的因果-虚假对比样本，以及在推理轨迹间保持因果一致性。为应对这些挑战，我们提出了MedCausalX，一个端到端框架，明确建模医学VLM中的因果推理链。我们首先介绍了CRM数据集，提供细致的解剖注释、结构化的因果推理链和反事实变体，指导人们超越表面相关性地学习因果关系。基于CRMed，MedCausalX采用两阶段自适应反思架构，配备$\langle$causal$\rangle$和$\langle$验证$\rangle$标记，使模型能够自主决定何时以及如何进行因果分析和验证。最后，通过错误归因强化学习优化的轨迹级因果纠正目标，优化了推理链，使模型能够区分真实的因果依赖与捷径关联。多项基准测试的广泛实验表明，MedCausalX持续优于最先进方法，诊断一致性提升+5.4分，幻觉减少超过10分，并达到顶尖的空间基础IoU，从而为因果医学推理树立新标准。

Policy-based Tuning of Autoregressive Image Models with Instance- and Distribution-Level Rewards

基于策略的自回归图像模型调优，并获得实例级和分布级奖励

Authors: Orhun Buğra Baran, Melih Kandemir, Ramazan Gokberk Cinbis
Subjects: Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.23086
Pdf link: https://arxiv.org/pdf/2603.23086
Abstract Autoregressive (AR) models are highly effective for image generation, yet their standard maximum-likelihood estimation training lacks direct optimization for sample quality and diversity. While reinforcement learning (RL) has been used to align diffusion models, these methods typically suffer from output diversity collapse. Similarly, concurrent RL methods for AR models rely strictly on instance-level rewards, often trading off distributional coverage for quality. To address these limitations, we propose a lightweight RL framework that casts token-based AR synthesis as a Markov Decision Process, optimized via Group Relative Policy Optimization (GRPO). Our core contribution is the introduction of a novel distribution-level Leave-One-Out FID (LOO-FID) reward; by leveraging an exponential moving average of feature moments, it explicitly encourages sample diversity and prevents mode collapse during policy updates. We integrate this with composite instance-level rewards (CLIP and HPSv2) for strict semantic and perceptual fidelity, and stabilize the multi-objective learning with an adaptive entropy regularization term. Extensive experiments on LlamaGen and VQGAN architectures demonstrate clear improvements across standard quality and diversity metrics within only a few hundred tuning iterations. The results also show that the model can be updated to produce competitive samples even without Classifier-Free Guidance, and bypass its 2x inference cost.
中文摘要 自回归（AR）模型在图像生成方面非常有效，但其标准的最大似然估计训练缺乏对样本质量和多样性的直接优化。虽然强化学习（RL）已被用于对应扩散模型，但这些方法通常会受到输出多样性崩溃的影响。同样，AR模型的并发强化学习方法严格依赖实例级奖励，常常以分配覆盖换取质量。为解决这些限制，我们提出了一个轻量级强化学习框架，将基于代币的AR综合视为马尔可夫决策过程，并通过群相对策略优化（GRPO）进行优化。我们的核心贡献是引入一种新型分发级别的“留一退出FID（LOO-FID）奖励”;通过利用特征时刻的指数移动平均，它明确鼓励样本多样性，并防止策略更新期间模式崩溃。我们将此与复合实例级奖励（CLIP和HPSv2）整合，以实现严格的语义和感知忠实度，并通过自适应熵正则化项稳定多目标学习。对LlamaGen和VQGAN架构的广泛实验显示，在仅几百次调优迭代内，标准质量和多样性指标均有明显提升。结果还表明，即使没有无分类器指导，模型也能更新以产生竞争性样本，并绕过其2倍的推断成本。

SpecXMaster Technical Report

SpecXMaster 技术报告

Authors: Yutang Ge, Yaning Cui, Hanzheng Li, Jun-Jie Wang, Fanjie Xu, Jinhan Dong, Yongqi Jin, Dongxu Cui, Peng Jin, Guojiang Zhao, Hengxing Cai, Rong Zhu, Linfeng Zhang, Xiaohong Ji, Zhifeng Gao
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.23101
Pdf link: https://arxiv.org/pdf/2603.23101
Abstract Intelligent spectroscopy serves as a pivotal element in AI-driven closed-loop scientific discovery, functioning as the critical bridge between matter structure and artificial intelligence. However, conventional expert-dependent spectral interpretation encounters substantial hurdles, including susceptibility to human bias and error, dependence on limited specialized expertise, and variability across interpreters. To address these challenges, we propose SpecXMaster, an intelligent framework leveraging Agentic Reinforcement Learning (RL) for NMR molecular spectral interpretation. SpecXMaster enables automated extraction of multiplicity information from both 1H and 13C spectra directly from raw FID (free induction decay) data. This end-to-end pipeline enables fully automated interpretation of NMR spectra into chemical structures. It demonstrates superior performance across multiple public NMR interpretation benchmarks and has been refined through iterative evaluations by professional chemical spectroscopists. We believe that SpecXMaster, as a novel methodological paradigm for spectral interpretation, will have a profound impact on the organic chemistry community.
中文摘要 智能光谱学是人工智能驱动的闭环科学发现中的关键元素，是物质结构与人工智能之间的关键桥梁。然而，传统的专家依赖谱解读面临诸多障碍，包括易受人为偏见和误差影响、依赖有限的专业技能以及不同解读者的差异。为应对这些挑战，我们提出了SpecXMaster，一个利用智能强化学习（RL）进行NMR分子谱解读的智能框架。SpecXMaster 能够直接从原始 FID（自由感应衰减）数据中自动提取 1H 和 13C 频谱的多重性信息。该端到端流程实现了NMR光谱的全自动解读，转化为化学结构。该技术在多个公开核磁解析基准中表现出优异性能，并通过专业化学光谱学家的迭代评估不断完善。我们相信，SpecXMaster 作为一种新的光谱解释方法范式，将对有机化学界产生深远影响。

Fault-Tolerant Design and Multi-Objective Model Checking for Real-Time Deep Reinforcement Learning Systems

容错设计与实时深度强化学习系统的多目标模型检查

Authors: Guoxin Su, Thomas Robinson, Hoa Khanh Dam, Li Liu, David S. Rosenblum
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2603.23113
Pdf link: https://arxiv.org/pdf/2603.23113
Abstract Deep reinforcement learning (DRL) has emerged as a powerful paradigm for solving complex decision-making problems. However, DRL-based systems still face significant dependability challenges particularly in real-time environments due to the simulation-to-reality gap, out-of-distribution observations, and the critical impact of latency. Latency-induced faults, in particular, can lead to unsafe or unstable behaviour, yet existing fault-tolerance approaches to DRL systems lack formal methods to rigorously analyse and optimise performance and safety simultaneously in real-time settings. To address this, we propose a formal framework for designing and analysing real-time switching mechanisms between DRL agents and alternative controllers. Our approach leverages Timed Automata (TAs) for explicit switch logic design, which is then syntactically converted to a Markov Decision Process (MDP) for formal analysis. We develop a novel convex query technique for multi-objective model checking, enabling the optimisation of soft performance objectives while ensuring hard safety constraints for MDPs. Furthermore, we present MOPMC, a GPU-accelerated software tool implementing this technique, demonstrating superior scalability in both model size and objective numbers.
中文摘要 深度强化学习（DRL）已成为解决复杂决策问题的强大范式。然而，基于DRL的系统在实时环境中仍面临显著的可靠性挑战，原因包括仿真与现实的差距、分布外的观测以及延迟带来的关键影响。尤其是延迟引起的故障可能导致不安全或不稳定的行为，但现有的DRL容错方法缺乏正式方法，无法在实时环境中同时严格分析和优化性能与安全。为此，我们提出了一个正式框架，用于设计和分析DRL代理与替代控制器之间实时切换机制。我们的方法利用定时自动机（TA）进行显式开关逻辑设计，然后将句法转换为马尔可夫决策过程（MDP）进行形式分析。我们开发了一种新型凸查询技术用于多目标模型检查，既能优化软性能目标，又确保MDP的硬安全约束。此外，我们还展示了MOPMC，一款GPU加速软件工具，实现了该技术，展示了模型规模和目标数值上的优越可扩展性。

Path Planning and Reinforcement Learning-Driven Control of On-Orbit Free-Flying Multi-Arm Robots

轨道自由飞行多臂机器人的路径规划与基于学习的强化控制

Authors: Álvaro Belmonte-Baeza, José Luis Ramón, Leonard Felicetti, Miguel Cazorla, Jorge Pomares
Subjects: Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2603.23182
Pdf link: https://arxiv.org/pdf/2603.23182
Abstract This paper presents a hybrid approach that integrates trajectory optimization (TO) and reinforcement learning (RL) for motion planning and control of free-flying multi-arm robots in on-orbit servicing scenarios. The proposed system integrates TO for generating feasible, efficient paths while accounting for dynamic and kinematic constraints, and RL for adaptive trajectory tracking under uncertainties. The multi-arm robot design, equipped with thrusters for precise body control, enables redundancy and stability in complex space operations. TO optimizes arm motions and thruster forces, reducing reliance on the arms for stabilization and enhancing maneuverability. RL further refines this by leveraging model-free control to adapt to dynamic interactions and disturbances. The experimental results validated through comprehensive simulations demonstrate the effectiveness and robustness of the proposed hybrid approach. Two case studies are explored: surface motion with initial contact and a free-floating scenario requiring surface approximation. In both cases, the hybrid method outperforms traditional strategies. In particular, the thrusters notably enhance motion smoothness, safety, and operational efficiency. The RL policy effectively tracks TO-generated trajectories, handling high-dimensional action spaces and dynamic mismatches. This integration of TO and RL combines the strengths of precise, task-specific planning with robust adaptability, ensuring high performance in the uncertain and dynamic conditions characteristic of space environments. By addressing challenges such as motion coupling, environmental disturbances, and dynamic control requirements, this framework establishes a strong foundation for advancing the autonomy and effectiveness of space robotic systems.
中文摘要 本文提出了一种混合方法，将轨迹优化（TO）和强化学习（RL）整合为自由飞行多臂机器人在轨道维护场景下的运动规划与控制。该系统整合了TO生成可行且高效的路径，同时考虑动态和运动学约束，并结合RL实现不确定性下的自适应轨迹跟踪。多臂机器人设计配备精确体控推进器，实现复杂航天操作中的冗余与稳定性。TO 优化了机械臂运动和推进力，减少对机械臂稳定的依赖，增强机动性。强化学习通过利用无模型控制来适应动态交互和干扰，进一步完善了这一点。通过全面模拟验证的实验结果证明了该混合方法的有效性和鲁棒性。探讨了两个案例：初始接触的表面运动和需要表面近似的自由漂浮场景。在这两种情况下，混合方法都优于传统策略。特别是，推进器显著提升了运动的平稳性、安全性和操作效率。强化学习策略有效跟踪TO生成的轨迹，处理高维动作空间和动态不匹配。TO与RL的结合结合了精准、针对特定任务的规划与强韧的适应性，确保在太空环境中不确定且动态的条件下保持高性能。通过解决运动耦合、环境干扰和动态控制需求等挑战，该框架为提升空间机器人系统的自主性和效能奠定了坚实基础。

ImplicitRM: Unbiased Reward Modeling from Implicit Preference Data for LLM alignment

ImplicitRM：基于隐性偏好数据的无偏奖励建模，用于LLM对齐

Authors: Hao Wang, Haocheng Yang, Licheng Pan, Lei Shen, Xiaoxi Li, Yinuo Wang, Zhichao Chen, Yuan Lu, Haoxuan Li, Zhouchen Lin
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Applications (stat.AP)
Arxiv link: https://arxiv.org/abs/2603.23184
Pdf link: https://arxiv.org/pdf/2603.23184
Abstract Reward modeling represents a long-standing challenge in reinforcement learning from human feedback (RLHF) for aligning language models. Current reward modeling is heavily contingent upon experimental feedback data with high collection costs. In this work, we study \textit{implicit reward modeling} -- learning reward models from implicit human feedback (e.g., clicks and copies) -- as a cost-effective alternative. We identify two fundamental challenges in implicit reward modeling: (1) Implicit preference data lacks definitive negative samples, which makes standard positive-negative classification methods inapplicable; (2) Implicit preference data suffers from user preference bias, where different responses have different propensities to elicit user feedback actions, which exacerbates the difficulty of distinguishing definitive negative samples. To address these challenges, we propose ImplicitRM, which aims to learn unbiased reward models from implicit preference data. ImplicitRM stratifies training samples into four latent groups via a stratification model. Building on this, it derives a learning objective through likelihood maximization, which we prove is theoretically unbiased, effectively resolving both challenges. Experiments demonstrate that ImplicitRM learns accurate reward models across implicit preference datasets. Code is available on our project website.
中文摘要 奖励建模代表了人类反馈强化学习（RLHF）中一个长期存在的挑战，用于对齐语言模型。当前的奖励建模高度依赖实验反馈数据，且收集成本较高。本研究中，我们研究了\textit{隐性奖励建模}——通过隐性人类反馈（如点击和复制）学习奖励模型——作为一种成本效益的替代方案。我们识别了隐性奖励建模中的两个根本挑战：（1）隐性偏好数据缺乏明确的负样本，这使得标准的正负分类方法不适用;（2）隐性偏好数据存在用户偏好偏差，不同响应引发用户反馈行为的倾向不同，这加剧了区分明确负面样本的困难。为应对这些挑战，我们提出了ImplicitRM，旨在从隐性偏好数据中学习无偏的奖励模型。隐式 RM 通过分层模型将训练样本分层为四个潜在群体。基于此，它通过似然最大化得出学习目标，我们证明了该目标理论上无偏，有效解决了这两个挑战。实验表明，ImplicitRM能够在隐性偏好数据集中学习准确的奖励模型。代码可在我们的项目网站上获取。

GEM: Guided Expectation-Maximization for Behavior-Normalized Candidate Action Selection in Offline RL

GEM：离线强化学习中行为归一化候选行动选择的引导期望最大化

Authors: Haoyu Wang, Jingcheng Wang, Shunyu Wu, Xinwei Xiao
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.23232
Pdf link: https://arxiv.org/pdf/2603.23232
Abstract Offline reinforcement learning (RL) can fit strong value functions from fixed datasets, yet reliable deployment still hinges on the action selection interface used to query them. When the dataset induces a branched or multimodal action landscape, unimodal policy extraction can blur competing hypotheses and yield "in-between" actions that are weakly supported by data, making decisions brittle even with a strong critic. We introduce GEM (Guided Expectation-Maximization), an analytical framework that makes action selection both multimodal and explicitly controllable. GEM trains a Gaussian Mixture Model (GMM) actor via critic-guided, advantage-weighted EM-style updates that preserve distinct components while shifting probability mass toward high-value regions, and learns a tractable GMM behavior model to quantify support. During inference, GEM performs candidate-based selection: it generates a parallel candidate set and reranks actions using a conservative ensemble lower-confidence bound together with behavior-normalized support, where the behavior log-likelihood is standardized within each state's candidate set to yield stable, comparable control across states and candidate budgets. Empirically, GEM is competitive across D4RL benchmarks, and offers a simple inference-time budget knob (candidate count) that trades compute for decision quality without retraining.
中文摘要 离线强化学习（RL）可以从固定数据集中拟合强值函数，但可靠的部署仍依赖于用于查询这些数据集的动作选择界面。当数据集诱导出分支或多模态的行动景观时，单模态策略提取可能模糊竞争假设，产生数据支持薄弱的“中间”行动，使决策即使面对强烈批评也变得脆弱。我们介绍了GEM（引导期望最大化），这是一种分析框架，使动作选择既多模态又可显式控制。GEM通过批评者引导的优势加权EM式更新训练高斯混合模型（GMM）演员，保持不同组分，同时将概率质量向高值区域移动，并学习一个可操作的GMM行为模型以量化支持。在推断过程中，GEM执行基于候选人的选择：生成一个并行候选人集合，并使用保守系综低置信度约束和行为归一化支持对行动进行排序，其中行为日志似然在每个州的候选人集中标准化，以实现各州和候选人预算间稳定且可比的控制。从经验角度看，GEM在D4RL基准测试中具有竞争力，并提供了一个简单的推理时间预算旋钮（候选人数量），可以用计算换取决策质量而无需重新训练。

Neural ODE and SDE Models for Adaptation and Planning in Model-Based Reinforcement Learning

基于模型的强化学习中的神经常微分方程和SDE模型适应与规划

Authors: Chao Han, Stefanos Ioannou, Luca Manneschi, T.J. Hayward, Michael Mangan, Aditya Gilra, Eleni Vasilaki
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.23245
Pdf link: https://arxiv.org/pdf/2603.23245
Abstract We investigate neural ordinary and stochastic differential equations (neural ODEs and SDEs) to model stochastic dynamics in fully and partially observed environments within a model-based reinforcement learning (RL) framework. Through a sequence of simulations, we show that neural SDEs more effectively capture the inherent stochasticity of transition dynamics, enabling high-performing policies with improved sample efficiency in challenging scenarios. We leverage neural ODEs and SDEs for efficient policy adaptation to changes in environment dynamics via inverse models, requiring only limited interactions with the new environment. To address partial observability, we introduce a latent SDE model that combines an ODE with a GAN-trained stochastic component in latent space. Policies derived from this model provide a strong baseline, outperforming or matching general model-based and model-free approaches across stochastic continuous-control benchmarks. This work demonstrates the applicability of action-conditional latent SDEs for RL planning in environments with stochastic transitions. Our code is available at: this https URL
中文摘要 我们研究神经常微分方程和随机微分方程（神经常微分方程和SDE），以建模基于模型强化学习（RL）框架下的完全和部分观测环境中的随机动力学。通过一系列模拟，我们表明神经SDE更有效地捕捉了转移动态的固有随机性，从而在具有挑战性场景中实现高效策略并提升样本效率。我们利用神经常微分方程和SDE通过逆模型高效调整环境动态变化，只需有限的交互即可与新环境互动。为解决部分可观测性问题，我们引入了一个潜在SDE模型，该模型将常微分方程与潜空间中GAN训练的随机分量结合起来。基于该模型的策略提供了强有力的基线，在随机连续控制基准测试中表现优于或匹敌一般基于模型和无模型的方法。本研究展示了动作条件潜在SDE在随机转移环境中强化学习规划中的适用性。我们的代码可在以下地址获取：此 https URL

A Learning Method with Gap-Aware Generation for Heterogeneous DAG Scheduling

一种具有差距感知生成的异构DAG调度学习方法

Authors: Ruisong Zhou, Haijun Zou, Li Zhou, Chumin Sun, Zaiwen Wen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2603.23249
Pdf link: https://arxiv.org/pdf/2603.23249
Abstract Efficient scheduling of directed acyclic graphs (DAGs) in heterogeneous environments is challenging due to resource capacities and dependencies. In practice, the need for adaptability across environments with varying resource pools and task types, alongside rapid schedule generation, complicates these challenges. We propose WeCAN, an end-to-end reinforcement learning framework for heterogeneous DAG scheduling that addresses task--pool compatibility coefficients and generation-induced optimality gaps. It adopts a two-stage single-pass design: a single forward pass produces task--pool scores and global parameters, followed by a generation map that constructs schedules without repeated network calls. Its weighted cross-attention encoder models task--pool interactions gated by compatibility coefficients, and is size-agnostic to environment fluctuations. Moreover, widely used list-scheduling maps can incur generation-induced optimality gaps from restricted reachability. We introduce an order-space analysis that characterizes the reachable set of generation maps via feasible schedule orders, explains the mechanism behind generation-induced gaps, and yields sufficient conditions for gap elimination. Guided by these conditions, we design a skip-extended realization with an analytically parameterized decreasing skip rule, which enlarges the reachable order set while preserving single-pass efficiency. Experiments on computation graphs and real-world TPC-H DAGs demonstrate improved makespan over strong baselines, with inference time comparable to classical heuristics and faster than multi-round neural schedulers.
中文摘要 由于资源容量和依赖性，在异构环境中高效调度有向无环图（DAGs）具有挑战性。实际上，在不同资源池和任务类型的环境中适应性，以及快速的进度生成，使这些挑战更加复杂。我们提出了WeCAN，这是一种端到端强化学习框架，用于异构DAG调度，解决任务池兼容性系数和生成引起的最优性差距。它采用两阶段单遍设计：单阶段前向传递产生任务池得分和全局参数，随后生成映射，构建不重复网络调用的计划。其加权交叉注意力编码器模型采用任务池交互，受兼容性系数限制，且对环境波动大小无关。此外，广泛使用的列表调度映射可能因可达性受限而产生生成引起的最优性缺口。我们引入了一种阶空间分析，通过可行的调度序表征可达的世代映射集合，解释了代际诱导间隙的机制，并给出了消除间隙的充分条件。基于这些条件，我们设计了一个带有解析参数化递减跳跃规则的跳跃扩展实现，该规则扩大可达顺序集，同时保持单次传递效率。计算图和现实TPC-H DAG的实验显示，在强基线下累积时间更优，推断时间与经典启发式相当，且比多轮神经调度器更快。

Learning Multi-Agent Local Collision-Avoidance for Collaborative Carrying tasks with Coupled Quadrupedal Robots

学习与耦合四足机器人协作携带任务的多智能体局部碰撞避免

Authors: Francesca Bray, Simone Tolomei, Andrei Cramariuc, Cesar Cadena, Marco Hutter
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.23278
Pdf link: https://arxiv.org/pdf/2603.23278
Abstract Robotic collaborative carrying could greatly benefit human activities like warehouse and construction site management. However, coordinating the simultaneous motion of multiple robots represents a significant challenge. Existing works primarily focus on obstacle-free environments, making them unsuitable for most real-world applications. Works that account for obstacles, either overfit to a specific terrain configuration or rely on pre-recorded maps combined with path planners to compute collision-free trajectories. This work focuses on two quadrupedal robots mechanically connected to a carried object. We propose a Reinforcement Learning (RL)-based policy that enables tracking a commanded velocity direction while avoiding collisions with nearby obstacles using only onboard sensing, eliminating the need for precomputed trajectories and complete map knowledge. Our work presents a hierarchical architecture, where a perceptive high-level object-centric policy commands two pretrained locomotion policies. Additionally, we employ a game-inspired curriculum to increase the complexity of obstacles in the terrain progressively. We validate our approach on two quadrupedal robots connected to a bar via spherical joints, benchmarking it against optimization-based and decentralized RL baselines. Our hardware experiments demonstrate the ability of our system to locomote in unknown environments without the need for a map or a path planner. The video of our work is available in the multimedia material.
中文摘要 机器人协作搬运将极大地促进仓库和建筑工地管理等人类活动。然而，协调多个机器人的同时运动仍是一项重大挑战。现有研究主要聚焦于无障碍环境，因此不适合大多数现实应用。考虑障碍物的作品，要么是对特定地形配置进行过拟合，要么依赖预录地图结合路径规划器来计算无碰撞轨迹。这项工作聚焦于两台四足机器人通过机械连接到携带物体。我们提出一种基于强化学习（RL）的策略，能够仅依靠车载感测追踪指令速度方向，同时避免与附近障碍物碰撞，消除对预先计算轨迹和完整地图知识的需求。我们的工作呈现了一个层级架构，其中一个洞察力强的高级对象中心策略指令两个预训练的移动策略。此外，我们采用游戏启发的课程，逐步提升地形障碍的复杂度。我们在两台通过球形关节连接到杆上的四足机器人上验证了我们的方法，并将其与基于优化和去中心化的强化学习基线进行基准对比。我们的硬件实验展示了系统能够在未知环境中运行，无需地图或路径规划器。我们的作品视频可在多媒体资料中观看。

Off-Policy Value-Based Reinforcement Learning for Large Language Models

大型语言模型的非策略价值强化学习

Authors: Peng-Yuan Wang, Ziniu Li, Tian Xu, Bohan Yang, Tian-Shuo Liu, ChenYang Wang, Xiong-Hui Chen, Yi-Chen Li, Tianyun Yang, Congliang Chen, Yang Yu
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.23355
Pdf link: https://arxiv.org/pdf/2603.23355
Abstract Improving data utilization efficiency is critical for scaling reinforcement learning (RL) for long-horizon tasks where generating trajectories is expensive. However, the dominant RL methods for LLMs are largely on-policy: they update each batch of data only once, discard it, and then collect fresh samples, resulting in poor sample efficiency. In this work, we explore an alternative value-based RL framework for LLMs that naturally enables off-policy learning. We propose ReVal, a Bellman-update-based method that combines stepwise signals capturing internal consistency with trajectory-level signals derived from outcome verification. ReVal naturally supports replay-buffer-based training, allowing efficient reuse of past trajectories. Experiments on standard mathematical reasoning benchmarks show that ReVal not only converges faster but also outperforms GRPO in final performance. On DeepSeek-R1-Distill-1.5B, ReVal improves training efficiency and achieves improvement of 2.7% in AIME24 and 4.5% in out-of-domain benchmark GPQA over GRPO. These results suggest that value-based RL is a practical alternative to policy-based methods for LLM training.
中文摘要 提高数据利用效率对于在生成轨迹成本高昂的长期任务中扩展强化学习（RL）至关重要。然而，主流的强化学习方法大多是策略性：每批数据只更新一次，丢弃后再收集新样本，导致样本效率较低。本研究探讨了一种基于价值的强化学习框架，用于大型语言模型，自然实现非策略学习。我们提出了ReVal方法，这是一种基于Bellman更新的方法，结合了逐步信号捕捉内部一致性与基于结果验证的轨迹级信号。ReVal自然支持基于重放缓冲的训练，允许高效重用过去轨迹。标准数学推理基准测试的实验表明，ReVal不仅收敛更快，而且在最终性能上优于GRPO。在DeepSeek-R1-Distill-1.5B测试下，ReVal提升了训练效率，在AIME24中提升了2.7%，在域外基准GPQA中相较GRPO提升了4.5%。这些结果表明，基于价值的强化学习是基于策略方法的实用替代方案。

A Joint Reinforcement Learning Scheduling and Compression Framework for Teleoperated Driving

远程驾驶的联合强化学习调度与压缩框架

Authors: Giacomo Avanzi, Marco Giordani, Michele Zorzi
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2603.23387
Pdf link: https://arxiv.org/pdf/2603.23387
Abstract Teleoperated driving (TD) is envisioned as a key application of future sixth generation (6G) networks. In this paradigm, connected vehicles transmit sensor-perception data to a remote (software) driver, which returns driving control commands to enhance traffic efficiency and road safety. This scenario imposes to maintain reliable and low-latency communication between the vehicle and the remote driver. To this aim, a promising solution is Predictive Quality of Service (PQoS), which provides mechanisms to estimate possible Quality of Service (QoS) degradation, and trigger timely network corrective actions accordingly. In particular, Reinforcement Learning (RL) agents can be trained to identify the optimal PQoS configuration. In this paper, we develop and implement two integrated RL agents that jointly determine (i) the optimal compression configuration for TD sensor data to balance the trade-off between transmission efficiency and data quality, and (ii) the optimal scheduling configuration to minimize the end-to-end latency by allocating radio resources according to different priority levels. We prove via full-stack ns-3 simulations that our integrated agents can deliver superior performance than any standalone model that only optimizes either compression or scheduling, especially in constrained or congested networks. While these agents can be deployed using either centralized or decentralized learning, we further propose a new meta-learning agent that dynamically selects the most appropriate strategy between the two based on current network conditions and application requirements.
中文摘要 远程驾驶（TD）被设想为未来第六代（6G）网络的关键应用。在这种模式下，联网车辆将传感器感知数据传输给远程（软件）驾驶员，后者返回驾驶控制指令，以提升交通效率和道路安全。这种情景要求保持车辆与远程驾驶员之间可靠且低延迟的通信。为此，一个有前景的解决方案是预测服务质量（PQoS），它提供了估计可能的服务质量（QoS）下降的机制，并相应触发及时的网络纠正措施。特别是，强化学习（RL）代理可以被训练以识别最优的PQoS配置。本文开发并实现了两个集成的强化学习代理，共同确定（i）TD传感器数据的最佳压缩配置，以平衡传输效率与数据质量之间的权衡，以及（ii）通过根据不同优先级分配无线资源，最小化端到端延迟的最佳调度配置。通过全栈ns-3仿真，我们证明我们的集成代理能够提供优于任何仅优化压缩或调度的独立模型，尤其是在受限或拥堵的网络中。虽然这些代理可以通过集中式或去中心化学习部署，但我们进一步提出了一种新的元学习代理，能够根据当前网络状况和应用需求动态选择两者之间最合适的策略。

SortedRL: Accelerating RL Training for LLMs through Online Length-Aware Scheduling

SortedRL：通过在线时长感知调度加速LLM的强化学习训练

Authors: Yiqi Zhang, Huiqiang Jiang, Xufang Luo, Zhihe Yang, Chengruidong Zhang, Yifei Shen, Dongsheng Li, Yuqing Yang, Lili Qiu, Yang You
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.23414
Pdf link: https://arxiv.org/pdf/2603.23414
Abstract Scaling reinforcement learning (RL) has shown strong promise for enhancing the reasoning abilities of large language models (LLMs), particularly in tasks requiring long chain-of-thought generation. However, RL training efficiency is often bottlenecked by the rollout phase, which can account for up to 70% of total training time when generating long trajectories (e.g., 16k tokens), due to slow autoregressive generation and synchronization overhead between rollout and policy updates. We propose SortedRL, an online length-aware scheduling strategy designed to address this bottleneck by improving rollout efficiency and maintaining training stability. SortedRL reorders rollout samples based on output lengths, prioritizing short samples forming groups for early updates. This enables large rollout batches, flexible update batches, and near on-policy micro-curriculum construction simultaneously. To further accelerate the pipeline, SortedRL incorporates a mechanism to control the degree of off-policy training through a cache-based mechanism, and is supported by a dedicated RL infrastructure that manages rollout and update via a stateful controller and rollout buffer. Experiments using LLaMA-3.1-8B and Qwen-2.5-32B on diverse tasks, including logical puzzles, and math challenges like AIME 24, Math 500, and Minerval, show that SortedRL reduces RL training bubble ratios by over 50%, while attaining 3.9% to 18.4% superior performance over baseline given same amount of data.
中文摘要 扩展强化学习（RL）在提升大型语言模型（LLMs）推理能力方面展现出强烈潜力，尤其是在需要长思考链生成的任务中。然而，由于自回归生成缓慢且在推出与策略更新之间同步开销较大，强化学习训练效率常常被推出阶段（如1.6万代币）限制，推送阶段可能占总训练时间的70%。我们提出了SortedRL，一种在线时长感知调度策略，旨在通过提升推广效率和保持训练稳定性来解决这一瓶颈。SortedRL 根据输出长度重新排序滚动样本，优先选择形成组的短样本以进行早期更新。这使得大规模的推广批次、灵活的更新批次以及几乎符合政策的微课程建设成为可能。为了进一步加速流水线，SortedRL 采用了通过缓存机制控制非策略训练程度的机制，并由专用的 RL 基础设施支持，通过有状态控制器和推进缓冲区管理部署和更新。使用LLaMA-3.1-8B和Qwen-2.5-32B在多种任务中进行实验，包括逻辑谜题和AIME 24、Math 500和Minerval等数学挑战，显示SortedRL在相同数据量下，能将强化学习训练气泡比例降低超过50%，同时在相同数据量下，比基线性能提升3.9%至18.4%。

End-to-End Efficient RL for Linear Bellman Complete MDPs with Deterministic Transitions

线性贝尔曼完全MDP的端到端高效强化学习，具有确定性转移

Authors: Zakaria Mhammedi, Alexander Rakhlin, Nneka Okolo
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.23461
Pdf link: https://arxiv.org/pdf/2603.23461
Abstract We study reinforcement learning (RL) with linear function approximation in Markov Decision Processes (MDPs) satisfying \emph{linear Bellman completeness} -- a fundamental setting where the Bellman backup of any linear value function remains linear. While statistically tractable, prior computationally efficient algorithms are either limited to small action spaces or require strong oracle assumptions over the feature space. We provide a computationally efficient algorithm for linear Bellman complete MDPs with \emph{deterministic transitions}, stochastic initial states, and stochastic rewards. For finite action spaces, our algorithm is end-to-end efficient; for large or infinite action spaces, we require only a standard argmax oracle over actions. Our algorithm learns an $\varepsilon$-optimal policy with sample and computational complexity polynomial in the horizon, feature dimension, and $1/\varepsilon$.
中文摘要 我们研究在马尔可夫决策过程（MDP）中满足\emph{线性贝尔曼完备}——这是一个基本设定，使得任何线性值函数的贝尔曼备份保持线性。虽然统计上可处理，但以往计算效率高的算法要么局限于小动作空间，要么需要对特征空间进行强预言机假设。我们为具有\emph{确定性转移}、随机初始状态和随机奖励的线性Bellman完备MDP提供了一种计算效率高的算法。对于有限作用空间，我们的算法是端到端高效;对于大型或无限的动作空间，我们只需对动作使用标准的 argmax oracle。我们的算法学习一个 $\varepsilon$ 最优策略，视界、特征维度和 $1/\varepsilon$ 均为样本和计算复杂度多项式。

WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG

WildWorld：一个面向生成式ARPG的动态世界建模数据集，支持动作和显式状态

Authors: Zhen Li, Zian Meng, Shuwei Shi, Wenshuo Peng, Yuwei Wu, Bo Zheng, Chuanhao Li, Kaipeng Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.23497
Pdf link: https://arxiv.org/pdf/2603.23497
Abstract Dynamical systems theory and reinforcement learning view world evolution as latent-state dynamics driven by actions, with visual observations providing partial information about the state. Recent video world models attempt to learn this action-conditioned dynamics from data. However, existing datasets rarely match the requirement: they typically lack diverse and semantically meaningful action spaces, and actions are directly tied to visual observations rather than mediated by underlying states. As a result, actions are often entangled with pixel-level changes, making it difficult for models to learn structured world dynamics and maintain consistent evolution over long horizons. In this paper, we propose WildWorld, a large-scale action-conditioned world modeling dataset with explicit state annotations, automatically collected from a photorealistic AAA action role-playing game (Monster Hunter: Wilds). WildWorld contains over 108 million frames and features more than 450 actions, including movement, attacks, and skill casting, together with synchronized per-frame annotations of character skeletons, world states, camera poses, and depth maps. We further derive WildBench to evaluate models through Action Following and State Alignment. Extensive experiments reveal persistent challenges in modeling semantically rich actions and maintaining long-horizon state consistency, highlighting the need for state-aware video generation. The project page is this https URL.
中文摘要 动力系统理论和强化学习将世界演化视为由动作驱动的潜态动态，视觉观察提供了关于状态的部分信息。近期的视频世界模型试图从数据中学习这种动作条件动力学。然而，现有数据集很少满足这一要求：它们通常缺乏多样且语义意义深厚的动作空间，且动作直接依赖于视觉观察，而非由底层状态中介。因此，动作常常与像素级变化纠缠在一起，使模型难以学习结构化的世界动态并保持长期持续演变。本文提出WildWorld，这是一个大规模动作条件世界建模数据集，带有显式状态注释，自动收集自一款逼真的AAA动作角色扮演游戏（怪物猎人：荒野）。《WildWorld》拥有超过1.08亿帧，包含450多个动作，包括移动、攻击和技能施放，以及角色骨骼、世界状态、摄像机姿势和深度图的同步每帧注释。我们还进一步推导出WildBench，通过动作跟随和状态对齐来评估模型。大量实验显示，在建模语义丰富的动作和保持长视野状态一致性方面存在持续挑战，凸显了状态感知视频生成的必要性。项目页面是这个 https URL。

UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation

UniGRPO：推理驱动视觉生成的统一策略优化

Authors: Jie Liu, Zilyu Ye, Linxiao Yuan, Shenhan Zhu, Yu Gao, Jie Wu, Kunchang Li, Xionghui Wang, Xiaonan Nie, Weilin Huang, Wanli Ouyang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.23500
Pdf link: https://arxiv.org/pdf/2603.23500
Abstract Unified models capable of interleaved generation have emerged as a promising paradigm, with the community increasingly converging on autoregressive modeling for text and flow matching for image generation. To advance this direction, we propose a unified reinforcement learning framework tailored for interleaved generation. We validate our approach on its fundamental unit: a single round of reasoning-driven image generation, where the model first expands the user prompt through reasoning, followed by image synthesis. Formulating this multimodal generation process as a Markov Decision Process with sparse terminal rewards, we introduce UniGRPO to jointly optimize text and image generation policies using GRPO. Adopting a minimalist methodology to avoid over-design, we leverage established training recipes for both modalities by seamlessly integrating standard GRPO for reasoning and FlowGRPO for visual synthesis. To ensure scalability to multi-round interleaved generation, we introduce two critical modifications to the original FlowGRPO: (1) eliminating classifier-free guidance to maintain linear, unbranched rollouts, which is essential for scaling to complex scenarios involving multi-turn interactions and multi-condition generation (e.g., editing); and (2) replacing the standard latent KL penalty with an MSE penalty directly on the velocity fields, providing a more robust and direct regularization signal to mitigate reward hacking effectively. Our experiments demonstrate that this unified training recipe significantly enhances image generation quality through reasoning, providing a robust and scalable baseline for the future post-training of fully interleaved models.
中文摘要 能够交错生成的统一模型已成为一种有前景的范式，社区越来越趋向于文本的自回归建模和图像生成的流匹配。为推进这一方向，我们提出了一个针对交错生成量身定制的统一强化学习框架。我们验证了方法的基本单元：一轮推理驱动的图像生成，模型先通过推理扩展用户提示，然后进行图像合成。我们将该多模态生成过程表述为具有稀疏终端奖励的马尔可夫决策过程，并引入UniGRPO以联合优化文本和图像生成策略。我们采用极简方法以避免过度设计，利用两种模式的既有训练方案，无缝整合标准GRPO用于推理，FlowGRPO用于视觉综合。为确保多轮交错生成的可扩展性，我们对原FlowGRPO引入了两个关键修改：（1）取消无分类器的指导，以维持线性、无分支的展开，这对于扩展到涉及多回合交互和多条件生成的复杂场景（如编辑）至关重要;以及（2）用MSE惩罚直接在速度场上替代标准的潜在KL惩罚，提供更稳健直接的正则化信号，有效减轻奖励黑客行为。我们的实验表明，这种统一的训练方案通过推理显著提升了图像生成质量，为未来完全交错模型的后训练提供了稳健且可扩展的基线。

Keyword: diffusion policy

DiSCo: Diffusion Sequence Copilots for Shared Autonomy

DiSCo：共享自治的扩散序列副驾驶

Authors: Andy Wang, Xu Yan, Brandon McMahan, Michael Zhou, Yuyang Yuan, Johannes Y. Lee, Ali Shreif, Matthew Li, Zhenghao Peng, Bolei Zhou, Yuchen Cui, Jonathan C. Kao
Subjects: Subjects: Human-Computer Interaction (cs.HC); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.22787
Pdf link: https://arxiv.org/pdf/2603.22787
Abstract Shared autonomy combines human user and AI copilot actions to control complex systems such as robotic arms. When a task is challenging, requires high dimensional control, or is subject to corruption, shared autonomy can significantly increase task performance by using a trained copilot to effectively correct user actions in a manner consistent with the user's goals. To significantly improve the performance of shared autonomy, we introduce Diffusion Sequence Copilots (DiSCo): a method of shared autonomy with diffusion policy that plans action sequences consistent with past user actions. DiSCo seeds and inpaints the diffusion process with user-provided actions with hyperparameters to balance conformity to expert actions, alignment with user intent, and perceived responsiveness. We demonstrate that DiSCo substantially improves task performance in simulated driving and robotic arm tasks. Project website: this https URL
中文摘要 共享自主性结合了人类用户和人工智能的副驾驶操作，以控制如机械臂等复杂系统。当任务具有挑战性、需要高维度控制或易受损坏时，共享自主可以通过训练有素的副驾驶有效纠正用户行为，从而显著提升任务性能，以符合用户目标的方式。为显著提升共享自治的性能，我们引入了扩散序列副驾驶（DiSCo）：一种具有扩散策略的共享自治方法，能够规划与过去用户行为一致的动作序列。DiSCo 通过用户提供的动作播种并内缀扩散过程，并以超参数平衡专家动作的一致性、与用户意图的对齐以及感知响应性。我们证明了DiSCo在模拟驾驶和机械臂任务中显著提升了任务表现。项目网站：此 https URL

Efficient Hybrid SE(3)-Equivariant Visuomotor Flow Policy via Spherical Harmonics for Robot Manipulation

通过球谐函数实现的高效混合SE（3）-等变体液驱动器流动策略，用于机器人操作

Authors: Qinglun Zhang, Shen Cheng, Tian Dan, Haoqiang Fan, Guanghui Liu, Shuaicheng Liu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.23227
Pdf link: https://arxiv.org/pdf/2603.23227
Abstract While existing equivariant methods enhance data efficiency, they suffer from high computational intensity, reliance on single-modality inputs, and instability when combined with fast-sampling methods. In this work, we propose E3Flow, a novel framework that addresses the critical limitations of equivariant diffusion policies. E3Flow overcomes these challenges, successfully unifying efficient rectified flow with stable, multi-modal equivariant learning for the first time. Our framework is built upon spherical harmonic representations to ensure rigorous SO(3) equivariance. We introduce a novel invariant Feature Enhancement Module (FEM) that dynamically fuses hybrid visual modalities (point clouds and images), injecting rich visual cues into the spherical harmonic features. We evaluate E3Flow on 8 manipulation tasks from the MimicGen and further conduct 4 real-world experiments to validate its effectiveness in physical environments. Simulation results show that E3Flow achieves a 3.12% improvement in average success rate over the state-of-the-art Spherical Diffusion Policy (SDP) while simultaneously delivering a 7x inference speedup. E3Flow thus demonstrates a new and highly effective trade-off between performance, efficiency, and data efficiency for robotic policy learning. Code: this https URL.
中文摘要 虽然现有的等变方法提高了数据效率，但它们存在较高的计算强度、依赖单模态输入以及与快速采样方法结合时的稳定性问题。在本研究中，我们提出了E3Flow，一种新颖框架，解决了等变扩散策略的关键局限性。E3Flow克服了这些挑战，首次成功将高效的整流与稳定的多模态等变学习统一。我们的框架建立在球面谐波表示之上，以确保严格的SO（3）等差性。我们引入了一种新型不变特征增强模块（FEM），动态融合混合视觉模态（点云和图像），为球面谐波特征注入丰富的视觉线索。我们对 MimicGen 中的 8 项操作任务进行了评估，并进一步进行了 4 项真实世界实验，以验证其在物理环境中的有效性。模拟结果显示，E3Flow的平均成功率较最先进的球形扩散策略（SDP）提升3.12%，同时推理速度提升了7倍。因此，E3Flow展示了机器人政策学习中性能、效率和数据效率之间一种新的且极为有效的权衡。代码：这个 https URL。