Arxiv Papers of Today

生成时间: 2026-03-09 16:50:48 (UTC+8); Arxiv 发布时间: 2026-03-09 20:00 EDT (2026-03-10 08:00 UTC+8)

今天共有 34 篇相关文章

Keyword: reinforcement learning

Autocorrelation effects in a stochastic-process model for decision making via time series

随机过程模型中的自相关效应，用于通过时间序列做决策

Authors: Tomoki Yamagami, Mikio Hasegawa, Takatomo Mihana, Ryoichi Horisaki, Atsushi Uchida
Subjects: Subjects: Machine Learning (cs.LG); Emerging Technologies (cs.ET); Probability (math.PR); Optics (physics.optics)
Arxiv link: https://arxiv.org/abs/2603.05559
Pdf link: https://arxiv.org/pdf/2603.05559
Abstract Decision makers exploiting photonic chaotic dynamics obtained by semiconductor lasers provide an ultrafast approach to solving multi-armed bandit problems by using a temporal optical signal as the driving source for sequential decisions. In such systems, the sampling interval of the chaotic waveform shapes the temporal correlation of the resulting time series, and experiments have reported that decision accuracy depends strongly on this autocorrelation property. However, it remains unclear whether the benefit of autocorrelation can be explained by a minimal mathematical model. Here, we analyze a stochastic-process model of the time-series-based decision making using the tug-of-war principle for solving the two-armed bandit problem, where the threshold and a two-valued Markov signal evolve jointly. Numerical results reveal an environment-dependent structure: negative (positive) autocorrelation is optimal in reward-rich (reward-poor) environments. These findings show that negative autocorrelation of the time series is advantageous when the sum of the winning probabilities is more than $1$, whereas positive autocorrelation is useful when the sum of the winning probabilities is less than $1$. Moreover, the performance is independent of autocorrelation if the sum of the winning probabilities equals $1$, which is mathematically clarified. This study paves the way for improving the decision-making scheme for reinforcement learning applications in wireless communications and robotics.
中文摘要 决策者利用半导体激光器获得的光子混沌动力学，通过以时间光信号作为序列决策的驱动源，提供了超高速解决多臂bandit问题的方法。在此类系统中，混沌波形的采样间隔决定了时间序列的时间相关性，实验报告显示决策准确性高度依赖于这种自相关性质。然而，是否能用最小的数学模型解释自相关的益处仍不明确。在这里，我们利用拔河原理分析基于时间序列的随机过程模型，用于解决双臂强盗问题，其中阈值和二值马尔可夫信号共同演化。数值结果显示出环境依赖结构：在奖励丰富（奖励贫乏）环境中，负（正）自相关是最优的。这些发现表明，当获胜概率之和大于1美元时，时间序列的负自相关是有利的;而当中奖概率之和小于1美元时，正自相关则有用。此外，如果获胜概率之和等于1美元，则该表现与自相关无关，这一点在数学上得到了澄清。本研究为改进无线通信和机器人强化学习应用的决策机制铺平了道路。

PRISM: Personalized Refinement of Imitation Skills for Manipulation via Human Instructions

棱镜：通过人类指令个性化精炼模仿技能以实现控

Authors: Arnau Boix-Granell, Alberto San-Miguel-Tello, Magí Dalmau-Moreno, Néstor García
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.05574
Pdf link: https://arxiv.org/pdf/2603.05574
Abstract This paper presents PRISM: an instruction-conditioned refinement method for imitation policies in robotic manipulation. This approach bridges Imitation Learning (IL) and Reinforcement Learning (RL) frameworks into a seamless pipeline, such that an imitation policy on a broad generic task, generated from a set of user-guided demonstrations, can be refined through reinforcement to generate new unseen fine-grain behaviours. The refinement process follows the Eureka paradigm, where reward functions for RL are iteratively generated from an initial natural-language task description. Presented approach, builds on top of this mechanism to adapt a refined IL policy of a generic task to new goal configurations and the introduction of constraints by adding also human feedback correction on intermediate rollouts, enabling policy reusability and therefore data efficiency. Results for a pick-and-place task in a simulated scenario show that proposed method outperforms policies without human feedback, improving robustness on deployment and reducing computational burden.
中文摘要 本文介绍了PRISM：一种用于机器人作中模仿策略的指令条件精炼方法。该方法将模仿学习（IL）和强化学习（RL）框架无缝衔接成一条流水线，使得通过一组用户引导演示生成的广泛通用任务上的模仿策略，可以通过强化细致生成新的隐形细粒度行为。精炼过程遵循尤里卡范式，即从初始自然语言任务描述迭代生成强化学习的奖励函数。本方法基于该机制，将优化的通用任务IL策略适应新目标配置，并通过在中间部署中加入人工反馈修正，实现策略可重用性，从而提高数据效率。在模拟场景中，选置任务的结果显示，所提方法在无人工反馈的情况下优于策略，提高了部署的鲁棒性并减轻了计算负担。

A Novel Hybrid Heuristic-Reinforcement Learning Optimization Approach for a Class of Railcar Shunting Problems

一种针对一类铁路车辆调车问题的新型混合启发式-强化学习优化方法

Authors: Ruonan Zhao, Joseph Geunes
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2603.05579
Pdf link: https://arxiv.org/pdf/2603.05579
Abstract Railcar shunting is a core planning task in freight railyards, where yard planners need to disassemble and reassemble groups of railcars to form outbound trains. Classification tracks with access from one side only can be considered as stack structures, where railcars are added and removed from only one end, leading to a last-in-first-out (LIFO) retrieval order. In contrast, two-sided tracks function like queue structures, allowing railcars to be added from one end and removed from the opposite end, following a first-in-first-out (FIFO) order. We consider a problem requiring assembly of multiple outbound trains using two locomotives in a railyard with two-sided classification track access. To address this combinatorially challenging problem class, we decompose the problem into two subproblems, each with one-sided classification track access and a locomotive on each side. We present a novel Hybrid Heuristic-Reinforcement Learning (HHRL) framework that integrates railway-specific heuristic solution approaches with a reinforcement learning method, specifically Q-learning. The proposed framework leverages methods to decrease the state-action space and guide exploration during reinforcement learning. The results of a series of numerical experiments demonstrate the efficiency and quality of the HHRL algorithm in both one-sided access, single-locomotive problems and two-sided access, two-locomotive problems.
中文摘要 动车调车是货运铁路场的核心规划任务，场内规划人员需要拆解和重组动车组以组成出站列车。仅能从一侧进出的分类轨道可视为堆栈结构，即仅从一端增减动动车组，形成后进先出（LIFO）回收顺序。相比之下，双侧轨道类似于排队结构，允许从一端加装动车，从另一端移除，遵循先进先出（FIFO）顺序。我们考虑了一个问题，即在拥有双侧分类轨道通道的铁路编组场内，使用两台机车组装多列出站列车。为了解决这一组合上具有挑战性的问题类别，我们将问题分解为两个子问题，每个子问题分别有单侧分类轨道访问和一台机车。我们提出了一种新型混合启发式-强化学习（HHRL）框架，将铁路特有的启发式解决方案方法与强化学习方法，特别是Q-learning相结合。该框架利用方法减少状态-行动空间，并在强化学习过程中引导探索。一系列数值实验的结果证明了HHRL算法在单侧进出、单机车问题和双边进出、双机车问题中的效率和质量。

Thinking with Spatial Code for Physical-World Video Reasoning

用空间代码思考物理世界视频推理

Authors: Jieneng Chen, Wenxin Ma, Ruisheng Yuan, Yunzhi Zhang, Jiajun Wu, Alan Yuille
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.05591
Pdf link: https://arxiv.org/pdf/2603.05591
Abstract We introduce Thinking with Spatial Code, a framework that transforms RGB video into explicit, temporally coherent 3D representations for physical-world visual question answering. We highlight the empirical finding that our proposed spatial encoder can parse videos into structured spatial code with explicit 3D oriented bounding boxes and semantic labels, enabling large language models (LLMs) to reason directly over explicit spatial variables. Specifically, we propose the spatial encoder that encodes image and geometric features by unifying 6D object parsing and tracking backbones with geometric prediction, and we further finetuning LLMs with reinforcement learning using a spatial rubric reward that encourages perspective-aware, geometrically grounded inference. As a result, our model outperforms proprietary vision-language models on VSI-Bench, setting a new state-of-the-art. Code is available at this https URL.
中文摘要 我们介绍了“空间代码思维”框架，将RGB视频转化为显式、时间连贯的3D表示，用于物理世界的视觉问答。我们强调了一个实证发现：我们提出的空间编码器能够将视频解析成带有明确三维边界框和语义标签的结构化空间代码，使大型语言模型（LLMs）能够直接对显式空间变量进行推理。具体来说，我们提出了一种空间编码器，通过统一6D对象解析和跟踪骨干与几何预测来编码图像和几何特征，并进一步利用空间评分标准奖励对大型语言模型进行强化学习的微调，鼓励基于透视、几何基础的推断。因此，我们的模型在VSI-Bench上优于专有视觉语言模型，开创了新的最前沿。代码可在此 https URL 访问。

When Rubrics Fail: Error Enumeration as Reward in Reference-Free RL Post-Training for Virtual Try-On

当评分标准失效：在无参考的强化学习后训练中，错误枚举作为奖励用于虚拟试用

Authors: Wisdom Ikezogwo, Mehmet Saygin Seyfioglu, Ranjay Krishna, Karim Bouyarmane
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.05659
Pdf link: https://arxiv.org/pdf/2603.05659
Abstract Reinforcement learning with verifiable rewards (RLVR) and Rubrics as Rewards (RaR) have driven strong gains in domains with clear correctness signals and even in subjective domains by synthesizing evaluation criteria from ideal reference answers. But many real-world tasks admit multiple valid outputs and lack the single ideal answer that rubric generation depends on. We identify this reference-free setting as a gap in current post-training methods and propose Implicit Error Counting (IEC) to fill it. Instead of checking what a response gets right against a rubric, IEC enumerates what it gets wrong, applying severity-weighted scores across task-relevant axes and converting them into calibrated per-aspect rewards. We show that naïve explicit enumeration is too noisy for stable optimization, and that two design choices: implicit score emission and group calibration are necessary to make error counting a reliable reward. As a case study, we validate IEC on virtual try-on (VTO), a domain that is simultaneously too constrained for holistic scoring and too permissive for rubric-based evaluation: subtle garment errors are unacceptable, yet many output variations are correct. We introduce Cascaded Error Counting (CEC) as an evaluation metric, which tracks human preferences well (60% top-1 vs. 30% others), and curate Mismatch-DressCode (MDressBench), a benchmark with maximal attribute mismatch to stress-test reward designs. On MDressBench, IEC outperforms RaR across all metrics (CEC: 5.31 vs. 5.60 on flat references; 5.20 vs. 5.53 on non-flat). On VITON-HD and DressCode, IEC matches or surpasses six baselines on 6 of 8 perceptual metrics. These results suggest that when ideal answers are unavailable, counting errors provide a stronger signal than constructing rubrics.
中文摘要 带有可验证奖励的强化学习（RLVR）和以评分标准为奖励（RaR）在具有明确正确信号的领域，甚至主观领域，通过综合理想参考答案的评估标准，取得了强劲的进步。但许多现实任务存在多个有效输出，缺乏评分标准生成依赖的唯一理想答案。我们将这种无引用的设置视为当前训练后方法中的空白，并提出隐性错误计数（IEC）来填补这一空白。IEC不是通过评分标准来核对回答的正确位置，而是列举错误，在任务相关轴上应用严重加权分数，并将其转换为各方面校准的奖励。我们表明，朴素的显式枚举对于稳定优化来说噪声过大，并且为了使误差计数成为可靠奖励，必须采用隐式评分发射和群校准两个设计选择。作为案例研究，我们验证了IEC的虚拟试穿（VTO），该领域既过于受限于整体评分，又过于宽松，难以基于评分标准的评估：细微的服装错误不可接受，但许多输出变体是正确的。我们引入级联错误计数（CEC）作为评估指标，良好追踪人类偏好（60%为前一，30%为他人），并策划了Mismatch-DressCode（MDressBench），这是一项与压力测试奖励设计最大属性不匹配的基准测试。在MDressBench上，IEC在所有指标上都优于RaR（CEC：平坦引用为5.31对5.60;非平坦引用为5.20对5.53）。在VITON-HD和DressCode上，IEC在8个感知指标中有6个匹配或超过6个基线。这些结果表明，当无法获得理想答案时，计数误差比构建评分标准更能传递信号。

MIRACL: A Diverse Meta-Reinforcement Learning for Multi-Objective Multi-Echelon Combinatorial Supply Chain Optimisation

MIRACL：一种多目标多层级组合供应链优化的多样化元强化学习

Authors: Rifny Rachman, Josh Tingey, Richard Allmendinger, Wei Pan, Pradyumn Shukla, Bahrul Ilmi Nasution
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.05760
Pdf link: https://arxiv.org/pdf/2603.05760
Abstract Multi-objective reinforcement learning (MORL) is effective for multi-echelon combinatorial supply chain optimisation, where tasks involve high dimensionality, uncertainty, and competing objectives. However, its deployment in dynamic environments is hindered by the need for task-specific retraining and substantial computational cost. We introduce MIRACL (Meta multI-objective Reinforcement leArning with Composite Learning), a hierarchical Meta-MORL framework that allows for a few-shot generalisation across diverse tasks. MIRACL decomposes each task into structured subproblems for efficient policy adaptation and meta-learns a global policy across tasks using a Pareto-based adaptation strategy to encourage diversity in meta-training and fine-tuning. To our knowledge, this is the first integration of Meta-MORL with such mechanisms in combinatorial optimisation. Although validated in the supply chain domain, MIRACL is theoretically domain-agnostic and applicable to broader dynamic multi-objective decision-making problems. Empirical evaluations show that MIRACL outperforms conventional MORL baselines in simple to moderate tasks, achieving up to 10% higher hypervolume and 5% better expected utility. These results underscore the potential of MIRACL for robust, efficient adaptation in multi-objective problems.
中文摘要 多目标强化学习（MORL）适用于多层次组合供应链优化，涉及高维度、不确定性和相互竞争目标的任务。然而，其在动态环境中的部署受限于任务特定的再训练需求和较高的计算成本。我们介绍了MIRACL（复合学习中的Meta multI-目标强化学习），这是一个层级Meta-MORL框架，允许在不同任务中实现少数样本推广。MIRACL将每个任务分解为结构化子问题，以实现高效的策略适应，并利用基于帕累托的适应策略，在任务间元学习全局策略，以鼓励元训练和微调的多样性。据我们所知，这是Meta-MORL首次将Meta-MORL与此类机制集成于组合优化中。尽管在供应链领域得到了验证，MIRACL理论上是领域无关性的，适用于更广泛的动态多目标决策问题。实证评估表明，MIRACL在简单到中等任务中优于传统MORL基线，实现了高达10%的超体积和5%的预期效用。这些结果强调了MIRACL在多目标问题中稳健高效适应的潜力。

Task-Level Decisions to Gait Level Control: A Hierarchical Policy Approach for Quadruped Navigation

步态控制的任务级决策：四足导航的层级策略方法

Authors: Sijia Li, Haoyu Wang, Shenghai Yuan, Yizhuo Yang, Thien-Minh Nguyen
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.05783
Pdf link: https://arxiv.org/pdf/2603.05783
Abstract Real-world quadruped navigation is constrained by a scale mismatch between high-level navigation decisions and low-level gait execution, as well as by instabilities under out-of-distribution environmental changes. Such variations challenge sim-to-real transfer and can trigger falls when policies lack explicit interfaces for adaptation. In this paper, we present a hierarchical policy architecture for quadrupedal navigation, termed Task-level Decision to Gait Control (TDGC). A low-level policy, trained with reinforcement learning in simulation, delivers gait-conditioned locomotion and maps task requirements to a compact set of controllable behavior parameters, enabling robust mode generation and smooth switching. A high-level policy makes task-centric decisions from sparse semantic or geometric terrain cues and translates them into low-level targets, forming a traceable decision pipeline without dense maps or high-resolution terrain reconstruction. Different from end-to-end approaches, our architecture provides explicit interfaces for deployment-time tuning, fault diagnosis, and policy refinement. We introduce a structured curriculum with performance-driven progression that expands environmental difficulty and disturbance ranges. Experiments show higher task success rates on mixed terrains and out-of-distribution tests.
中文摘要 现实世界的四足导航受限于高层导航决策与低级步态执行之间的尺度不匹配，以及在分布外环境变化下的不稳定性。这种变体挑战了模拟到现实的传输，并在策略缺乏显式接口以适配时可能触发掉屏。本文提出了一种用于四足导航的分层策略架构，称为任务级步态控制决策（TDGC）。一个通过模拟强化学习训练的低层策略，能够实现步态条件的移动，并将任务需求映射到一组紧凑的可控行为参数上，实现稳健的模式生成和平滑切换。高层次政策通过稀疏的语义或几何地形线索做出以任务为中心的决策，并将其转化为低层目标，形成一个可追踪的决策流程，无需密集地图或高分辨率地形重建。与端到端方法不同，我们的架构提供了部署时间调优、故障诊断和策略优化的显式接口。我们引入结构化课程，以表现为导向的进阶，扩大环境难度和干扰范围。实验显示，混合地形和分布外测试的任务成功率更高。

OpenHEART: Opening Heterogeneous Articulated Objects with a Legged Manipulator

OpenHEART：用带腿的作器打开异构关节物体

Authors: Seonghyeon Lim, Hyeonwoo Lee, Seunghyun Lee, I Made Aswin Nahrendra, Hyun Myung
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.05830
Pdf link: https://arxiv.org/pdf/2603.05830
Abstract Legged manipulators offer high mobility and versatile manipulation. However, robust interaction with heterogeneous articulated objects, such as doors, drawers, and cabinets, remains challenging because of the diverse articulation types of the objects and the complex dynamics of the legged robot. Existing reinforcement learning (RL)-based approaches often rely on high-dimensional sensory inputs, leading to sample inefficiency. In this paper, we propose a robust and sample-efficient framework for opening heterogeneous articulated objects with a legged manipulator. In particular, we propose Sampling-based Abstracted Feature Extraction (SAFE), which encodes handle and panel geometry into a compact low-dimensional representation, improving cross-domain generalization. Additionally, Articulation Information Estimator (ArtIEst) is introduced to adaptively mix proprioception with exteroception to estimate opening direction and range of motion for each object. The proposed framework was deployed to manipulate various heterogeneous articulated objects in simulation and real-world robot systems. Videos can be found on the project website: this https URL
中文摘要 腿式作器具有高机动性和多功能作能力。然而，由于物体的关节类型多样且腿部机器人的复杂动力学，与异构关节物体（如门、抽屉和橱柜）进行稳健交互仍然具有挑战性。现有基于强化学习（RL）的方法常依赖高维感官输入，导致样本效率低下。本文提出了一个稳健且样本高效的框架，用于用腿式作器打开异构关节物体。特别地，我们提出了基于采样的抽象特征提取（SAFE），将手柄几何和面板几何编码为紧凑的低维表示，提升了跨域推广能力。此外，引入了关节信息估计器（ArtIEst），以适应性地混合本体感觉和外感知，以估算每个物体的开启方向和活动范围。该框架被部署用于在仿真和现实机器人系统中作各种异构关节物体。视频可在项目网站上找到：这个 https URL

Expert Knowledge-driven Reinforcement Learning for Autonomous Racing via Trajectory Guidance and Dynamics Constraints

基于轨迹引导和动力学约束的专家知识驱动的自主赛车强化学习

Authors: Bo Leng, Weiqi Zhang, Zhuoren Li, Lu Xiong, Guizhe Jin, Ran Yu, Chen Lv
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.05842
Pdf link: https://arxiv.org/pdf/2603.05842
Abstract Reinforcement learning has demonstrated significant potential in the field of autonomous driving. However, it suffers from defects such as training instability and unsafe action outputs when faced with autonomous racing environments characterized by high dynamics and strong nonlinearities. To this end, this paper proposes a trajectory guidance and dynamics constraints Reinforcement Learning (TraD-RL) method for autonomous racing. The key features of this method are as follows: 1) leveraging the prior expert racing line to construct an augmented state representation and facilitate reward shaping, thereby integrating domain knowledge to stabilize early-stage policy learning; 2) embedding explicit vehicle dynamic priors into a safe operating envelope formulated via control barrier functions to enable safety-constrained learning; and 3) adopting a multi-stage curriculum learning strategy that shifts from expert-guided learning to autonomous exploration, allowing the learned policy to surpass expert-level performance. The proposed method is evaluated in a high-fidelity simulation environment modeled after the Tempelhof Airport Street Circuit. Experimental results demonstrate that TraD-RL effectively improves both lap speed and driving stability of the autonomous racing vehicle, achieving a synergistic optimization of racing performance and safety.
中文摘要 强化学习在自动驾驶领域展现出显著潜力。然而，它存在训练不稳定性和在面对高度动态和强烈非线性特征的自主竞速环境时不安全的动作输出等缺陷。为此，本文提出了一种轨迹引导与动力学约束的强化学习（TraD-RL）方法用于自主竞速。该方法的关键特点如下：1）利用先前的专家竞速线构建增强状态表征并促进奖励塑造，从而整合领域知识以稳定早期策略学习;2）将明确的车辆动态先验嵌入通过控制障碍函数构建的安全作包络中，以实现安全限制学习;3）采用多阶段课程学习策略，从专家指导学习转向自主探索，使所学政策超越专家水平表现。该方法在以坦佩尔霍夫机场街道赛道为模型的高保真模拟环境中进行评估。实验结果表明，TraD-RL有效提升了自动驾驶赛车的圈速和驾驶稳定性，实现了竞速性能与安全性的协同优化。

ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning

ReflexiCoder：教大型语言模型自我反思生成代码并通过强化学习自我纠正

Authors: Juyong Jiang, Jiasi Shen, Sunghun Kim, Kang Min Yoo, Jeonghoon Kim, Sungju Kim
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2603.05863
Pdf link: https://arxiv.org/pdf/2603.05863
Abstract While Large Language Models (LLMs) have revolutionized code generation, standard "System 1" approaches, generating solutions in a single forward pass, often hit a performance ceiling when faced with complex algorithmic tasks. Existing iterative refinement strategies attempt to bridge this gap at inference time, yet they predominantly rely on external oracles, execution feedback, or computationally expensive prompt-response cycles. In this work, we propose ReflexiCoder, a novel reinforcement learning (RL) framework that internalizes the structured reasoning trajectory, encompassing initial generation, bug and optimization aware reflection, and self-correction, directly into the model's weights. Unlike prior methods, ReflexiCoder shifts the paradigm from external-dependent refinement to an intrinsic, fully autonomous self-reflection and self-correction capabilities at inference time. We utilize an RL-zero training paradigm with granular reward functions to optimize the entire reflection-correction trajectory, teaching the model how to debug without reliance on ground-truth feedback or execution engines at inference time. Extensive experiments across seven benchmarks demonstrate that our ReflexiCoder-8B establishes a new state-of-the-art (SOTA) among leading open-source models in the 1.5B-14B range, achieving 94.51% (87.20%) on HumanEval (Plus), 81.80% (78.57%) on MBPP (Plus), 35.00% on BigCodeBench, 52.21% on LiveCodeBench, and 37.34% on CodeForces in a single-attempt setting, rivaling or surpassing proprietary models like GPT-5.1. Notably, our framework is significantly more token-efficient than base models, reducing inference-time compute overhead by approximately 40% through disciplined, high-speed reasoning and reflection patterns. Source code is available at this https URL.
中文摘要 虽然大型语言模型（LLM）革新了代码生成，但标准的“系统1”方法——通过一次前向传递生成解——在面对复杂算法任务时常常会达到性能上限。现有的迭代精炼策略试图在推理时弥合这一差距，但它们主要依赖外部预言机、执行反馈或计算量高的即时响应周期。在本研究中，我们提出了ReflexiCoder，一种新型强化学习（RL）框架，将结构化推理轨迹（涵盖初始生成、缺陷和优化意识反思及自我纠正）直接内化到模型权重中。与以往方法不同，ReflexiCoder 将范式从依赖外部的精炼转变为在推理时实现内在、完全自主的自我反思和自我修正能力。我们采用带有细粒奖励函数的RL-0训练范式，优化整个反射-修正轨迹，教模型如何在推理时不依赖地面真实反馈或执行引擎进行调试。在七个基准测试中进行了大量实验，我们的ReflexiCoder-8B在1.5B至14B级的领先开源模型中树立了新的最先进（SOTA），在HumanEval（Plus）上达到了94.51%（87.20%），在MBPP（Plus）上达到81.80%（78.57%），在BigCodeBench上达到35.00%，在LiveCodeBench上为52.21%，在CodeForce上为37.34%，在单次尝试模式下可与或超越了如GPT-5.1等专有模型。值得注意的是，我们的框架比基础模型更高效，通过严谨、高速的推理和反思模式，推理时间计算开销降低了约40%。源代码可在此 https URL 获取。

PatchCue: Enhancing Vision-Language Model Reasoning with Patch-Based Visual Cues

PatchCue：利用基于Patch的视觉线索增强视觉语言模型推理

Authors: Yukun Qi, Pei Fu, Hang Li, Yuhan Liu, Chao Jiang, Bin Qin, Zhenbo Luo, Jian Luan
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.05869
Pdf link: https://arxiv.org/pdf/2603.05869
Abstract Vision-Language Models (VLMs) have achieved remarkable progress on a wide range of challenging multimodal understanding and reasoning tasks. However, existing reasoning paradigms, such as the classical Chain-of-Thought (CoT), rely solely on textual information and often underutilize important visual cues. While prior work has incorporated pixel-level visual cues, these representations require precise spatial localization, introducing additional learning complexity. To address this, we propose PatchCue, a novel patch-based visual cue paradigm designed to significantly enhance the visual reasoning capabilities of VLMs. By partitioning images into patches and representing cues at the patch level, PatchCue aligns better with human perceptual habits and leverages the patch-tokenized input of modern VLMs. We train VLMs using a two-stage approach: cold-start supervised fine-tuning to output patch-level cues, followed by reinforcement learning with a process-supervised cue reward that guides intermediate visual reasoning steps. Extensive experiments on multiple VLMs and diverse benchmarks, including general visual question answering, complex reasoning, and document understanding, demonstrate that PatchCue consistently improves overall model performance. Our results show that patch-level cues outperform both pixel-level bounding boxes and point-based cues, providing a more effective and cognitively aligned visual reasoning paradigm.
中文摘要 视觉语言模型（VLMs）在众多具有挑战性的多模态理解和推理任务上取得了显著进展。然而，现有的推理范式，如经典思维链（CoT），仅依赖文本信息，且常常未充分利用重要的视觉线索。虽然以往工作已包含像素级视觉线索，但这些表示需要精确的空间定位，增加了学习复杂度。为此，我们提出了PatchCue，一种基于贴片的视觉提示范式，旨在显著提升VLM的视觉推理能力。通过将图像划分为音色并在音色层面表示提示，PatchCue 更贴合人类感知习惯，并利用现代 VLM 的音色标记化输入。我们采用两阶段方法训练VLM：冷启动监督微调以输出片段级线索，随后进行强化学习，辅导过程监督的提示奖励引导中间的视觉推理步骤。在多个VLM和多样化基准测试上的大量实验，包括一般视觉问答、复杂推理和文档理解，表明PatchCue持续提升整体模型性能。我们的结果表明，贴片级线索优于像素级边界框和基于点的线索，提供了更有效且更符合认知对齐的视觉推理范式。

Confidence Before Answering: A Paradigm Shift for Efficient LLM Uncertainty Estimation

回答前的信心：高效LLM不确定性估计的范式转变

Authors: Changcheng Li, Jiancan Wu, Hengheng Zhang, Zhengsu Chen, Guo An, Junxiang Qiu, Xiang Wang, Qi Tian
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.05881
Pdf link: https://arxiv.org/pdf/2603.05881
Abstract Reliable deployment of large language models (LLMs) requires accurate uncertainty estimation. Existing methods are predominantly answer-first, producing confidence only after generating an answer, which measure the correctness of a specific response and limits practical usability. We study a confidence-first paradigm, where the model outputs its confidence before answering, interpreting this score as the model's probability of answering the question correctly under its current policy. We propose CoCA(Co-optimized Confidence and Answers), a GRPO reinforcement learning framework that jointly optimizes confidence calibration and answer accuracy via segmented credit assignment. By assigning separate rewards and group-relative advantages to confidence and answer segments, CoCA enables stable joint optimization and avoids reward hacking. Experiments across math, code, and factual QA benchmarks show improved calibration and uncertainty discrimination while preserving answer quality, thereby enabling a broader range of downstream applications.
中文摘要 大型语言模型（LLM）的可靠部署需要准确的不确定性估计。现有方法主要是以答案为先，只有在生成答案后才产生信心，这衡量了特定回答的正确性，限制了实际可用性。我们研究一种置信优先范式，模型在回答前先输出置信度，将该分数解释为模型在当前策略下正确回答问题的概率。我们提出了CoCA（协优化置信与答案），这是一个GRPO强化学习框架，通过分段学分共同优化置信度校准和答案准确性。通过为信心和答案段分配不同的奖励和群体优势，CoCA实现了稳定的联合优化，避免了奖励黑客行为。数学、代码和事实质量保证基准的实验显示，校准和不确定性辨别能力提升，同时保持了答案质量，从而实现了更广泛的后续应用。

Reference-guided Policy Optimization for Molecular Optimization via LLM Reasoning

通过LLM推理的参考引导分子优化策略

Authors: Xuan Li, Zhanke Zhou, Zongze Li, Jiangchao Yao, Yu Rong, Lu Zhang, Bo Han
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.05900
Pdf link: https://arxiv.org/pdf/2603.05900
Abstract Large language models (LLMs) benefit substantially from supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR) in reasoning tasks. However, these recipes perform poorly in instruction-based molecular optimization, where each data point typically provides only a single optimized reference molecule and no step-by-step optimization trajectory. We reveal that answer-only SFT on the reference molecules collapses reasoning, and RLVR provides sparse feedback under similarity constraints due to the model's lack of effective exploration, which slows learning and limits optimization. To encourage the exploration of new molecules while balancing the exploitation of the reference molecules, we introduce Reference-guided Policy Optimization (RePO), an optimization approach that learns from reference molecules without requiring trajectory data. At each update, RePO samples candidate molecules with their intermediate reasoning trajectories from the model and trains the model using verifiable rewards that measure property satisfaction under similarity constraints in an RL manner. Meanwhile, it applies reference guidance by keeping the policy's intermediate reasoning trajectory as context and training only the answer in a supervised manner. Together, the RL term promotes exploration, while the guidance term mitigates reward sparsity and stabilizes training by grounding outputs to references when many valid molecular edits exist. Across molecular optimization benchmarks, RePO consistently outperforms SFT and RLVR baselines (e.g., GRPO), achieving improvements on the optimization metric (Success Rate $\times$ Similarity), improving balance across competing objectives, and generalizing better to unseen instruction styles. Our code is publicly available at this https URL.
中文摘要 大型语言模型（LLMs）在推理任务中显著受益于监督式微调（SFT）和带可验证奖励的强化学习（RLVR）。然而，这些配方在基于指令的分子优化中表现不佳，每个数据点通常只提供一个优化的参考分子，且没有逐步优化路径。我们发现，仅答案SFT在参考分子上会削弱推理，而RLVR由于模型缺乏有效探索，在相似性约束下反馈稀疏，这会减缓学习速度并限制优化。为了在平衡参考分子利用的同时，鼓励探索新分子，我们引入了参考引导策略优化（RePO），这是一种从参考分子中学习而无需轨迹数据的优化方法。每次更新时，RePO会从模型中抽样候选分子及其中间推理轨迹，并使用可验证的奖励来训练模型，这些奖励在相似性约束下的属性满足度以强化学习的方式进行测量。同时，它通过将政策的中间推理轨迹作为上下文，并仅以监督方式训练答案，来应用参考指导。强化学习一词共同促进探索，而指导词则通过将输出基于存在许多有效分子编辑时的参考，缓解奖励稀疏并稳定训练。在分子优化基准测试中，RePO 持续优于 SFT 和 RLVR 基线（如 GRPO），在优化指标（成功率 $\times$ 相似度）上取得改进，改善了竞争目标间的平衡，并更好地推广到未被看见的指令样式。我们的代码在此 https URL 公开。

Swooper: Learning High-Speed Aerial Grasping With a Simple Gripper

Swooper：用简单抓钳学习高速空中抓取

Authors: Ziken Huang, Xinze Niu, Bowen Chai, Renbiao Jin, Danping Zou
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.05935
Pdf link: https://arxiv.org/pdf/2603.05935
Abstract High-speed aerial grasping presents significant challenges due to the high demands on precise, responsive flight control and coordinated gripper manipulation. In this work, we propose Swooper, a deep reinforcement learning (DRL) based approach that achieves both precise flight control and active gripper control using a single lightweight neural network policy. Training such a policy directly via DRL is nontrivial due to the complexity of coordinating flight and grasping. To address this, we adopt a two-stage learning strategy: we first pre-train a flight control policy, and then fine-tune it to acquire grasping skills. With the carefully designed reward functions and training framework, the entire training process completes in under 60 minutes on a standard desktop with an Nvidia RTX 3060 GPU. To validate the trained policy in the real world, we develop a lightweight quadrotor grasping platform equipped with a simple off-the-shelf gripper, and deploy the policy in a zero-shot manner on the onboard Raspberry Pi 4B computer, where each inference takes only about 1.0 ms. In 25 real-world trials, our policy achieves an 84% grasp success rate and grasping speeds of up to 1.5 m/s without any fine-tuning. This matches the robustness and agility of state-of-the-art classical systems with sophisticated grippers, highlighting the capability of DRL for learning a robust control policy that seamlessly integrates high-speed flight and grasping. The supplementary video is available for more results. Video: this https URL.
中文摘要 高速空中抓取面临巨大挑战，因为对精准、响应的飞行控制和协调的抓取作要求极高。在本研究中，我们提出了Swooper，这是一种基于深度强化学习（DRL）的方法，能够通过单一轻量级神经网络策略实现精确的飞行控制和主动抓手控制。由于飞行和抓取协调的复杂性，直接通过日行学习（DRL）训练此类策略并不简单。为此，我们采用了两阶段学习策略：首先预训练飞行控制策略，然后微调以获得掌握技能。凭借精心设计的奖励功能和培训框架，整个培训过程在配备Nvidia RTX 3060显卡的标准台式机上，60分钟内即可完成。为了验证训练策略在现实中，我们开发了一个轻量化的四旋翼抓取平台，配备简单的现成抓钳，并在板载的树莓派4B计算机上以零次方式部署该策略，每次推理仅需约1.0毫秒。在25个真实世界试验中，我们的政策实现了84%的抓取成功率，抓取速度可达1.5米/秒，无需微调。这与先进经典系统配备复杂抓握器的坚固性和灵活性相匹配，凸显了日行学习稳健控制策略的能力，能够无缝整合高速飞行与抓取。补充视频可观看更多结果。视频：这个 https URL。

How to Model Your Crazyflie Brushless

如何建模你的无刷疯蝇

Authors: Alexander Gräfe, Christoph Scherer, Wolfgang Hönig, Sebastian Trimpe
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.05944
Pdf link: https://arxiv.org/pdf/2603.05944
Abstract The Crazyflie quadcopter is widely recognized as a leading platform for nano-quadcopter research. In early 2025, the Crazyflie Brushless was introduced, featuring brushless motors that provide around 50% more thrust compared to the brushed motors of its predecessor, the Crazyflie 2.1. This advancement has opened new opportunities for research in agile nano-quadcopter control. To support researchers utilizing this new platform, this work presents a dynamics model of the Crazyflie Brushless and identifies its key parameters. Through simulations and hardware analyses, we assess the accuracy of our model. We furthermore demonstrate its suitability for reinforcement learning applications by training an end-to-end neural network position controller and learning a backflip controller capable of executing two complete rotations with a vertical movement of just 1.8 meters. This showcases the model's ability to facilitate the learning of controllers and acrobatic maneuvers that successfully transfer from simulation to hardware. Utilizing this application, we investigate the impact of domain randomization on control performance, offering valuable insights into bridging the sim-to-real gap with the presented model. We have open-sourced the entire project, enabling users of the Crazyflie Brushless to swiftly implement and test their own controllers on an accurate simulation platform.
中文摘要 Crazyflie四旋翼被广泛认为是纳米四旋翼研究的领先平台。2025年初，Crazyflie 无刷电机问世，配备无刷电机，推力比前代 Crazyflie 2.1 的有刷电机高出约50%。这一进展为敏捷纳米四旋翼控制的研究开辟了新机遇。为支持使用这一新平台的研究人员，本研究提出了疯狂飞刷的动力学模型，并识别了其关键参数。通过模拟和硬件分析，我们评估模型的准确性。我们还通过训练端到端神经网络位置控制器，学习能够以仅1.8米垂直移动完成两次完整旋转的后翻控制器，展示了其在强化学习应用中的适用性。这展示了模型促进控制器学习和特技动作的能力，这些动作能成功从仿真转移到硬件。利用该应用，我们探讨了领域随机化对控制性能的影响，为弥合模拟与现实差距提供了宝贵见解。我们已将整个项目开源，使Crazyflie Brushless用户能够快速在精确的模拟平台上实现和测试自己的控制器。

LucidNFT: LR-Anchored Multi-Reward Preference Optimization for Generative Real-World Super-Resolution

LucidNFT：生成现实世界超分辨率的LR锚定多奖励偏好优化

Authors: Song Fei, Tian Ye, Sixiang Chen, Zhaohu Xing, Jianyu Lai, Lei Zhu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.05947
Pdf link: https://arxiv.org/pdf/2603.05947
Abstract Generative real-world image super-resolution (Real-ISR) can synthesize visually convincing details from severely degraded low-resolution (LR) inputs, yet its stochastic sampling makes a critical failure mode hard to avoid: outputs may look sharp but be unfaithful to the LR evidence (semantic and structural hallucination), while such LR-anchored faithfulness is difficult to assess without HR ground truth. Preference-based reinforcement learning (RL) is a natural fit because each LR input yields a rollout group of candidates to compare. However, effective alignment in Real-ISR is hindered by (i) the lack of a degradation-robust LR-referenced faithfulness signal, and (ii) a rollout-group optimization bottleneck where naive multi-reward scalarization followed by normalization compresses objective-wise contrasts, causing advantage collapse and weakening the reward-weighted updates in DiffusionNFT-style forward fine-tuning. Moreover, (iii) limited coverage of real degradations restricts rollout diversity and preference signal quality. We propose LucidNFT, a multi-reward RL framework for flow-matching Real-ISR. LucidNFT introduces LucidConsistency, a degradation-robust semantic evaluator that makes LR-anchored faithfulness measurable and optimizable; a decoupled advantage normalization strategy that preserves objective-wise contrasts within each LR-conditioned rollout group before fusion, preventing advantage collapse; and LucidLR, a large-scale collection of real-world degraded images to support robust RL fine-tuning. Experiments show that LucidNFT consistently improves strong flow-based Real-ISR baselines, achieving better perceptual-faithfulness trade-offs with stable optimization dynamics across diverse real-world scenarios.
中文摘要 生成现实世界图像超分辨率（Real-ISR）能够从严重退化的低分辨率（LR）输入中合成视觉上令人信服的细节，但其随机采样使得关键失效模式难以避免：输出可能看起来锋利但与LR证据不符（语义和结构幻觉），而这种基于LR的忠实度在没有HR的真实性的情况下难以评估。基于偏好的强化学习（RL）是自然的匹配方式，因为每个LR输入都会生成一组可供比较的推广候选对象。然而，Real-ISR中的有效对齐受到（i）缺乏退化稳健的LR参考忠实度信号，以及（ii）部署组优化瓶颈，即朴素的多奖励标量化后再归一化会压缩目标对比，导致优势崩溃，削弱DiffusionNFT式前向微调中奖励加权更新的效果。此外，（iii）对实际降级的有限覆盖限制了扩展多样性和偏好信号质量。我们提出了LucidNFT，一个用于流量匹配的多奖励强化学习框架（Real-ISR）。LucidNFT 推出了 LucidConsistency，一款退化强化语义评估器，使基于逻辑推理的忠实度可测量和优化;一种解耦优势归一化策略，在融合前保持每个LR条件下的滚动组内的客观对比，防止优势崩溃;以及LucidLR，一个大规模的现实世界退化图像集合，用于支持稳健的强化学习微调。实验显示，LucidNFT持续提升基于流量的强实ISR基线，在多种现实场景下实现更佳的感知忠实权衡，同时保持稳定的优化动态。

TADPO: Reinforcement Learning Goes Off-road

TADPO：强化学习越野

Authors: Zhouchonghao Wu, Raymond Song, Vedant Mundheda, Luis E. Navarro-Serment, Christof Schoenborn, Jeff Schneider
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.05995
Pdf link: https://arxiv.org/pdf/2603.05995
Abstract Off-road autonomous driving poses significant challenges such as navigating unmapped, variable terrain with uncertain and diverse dynamics. Addressing these challenges requires effective long-horizon planning and adaptable control. Reinforcement Learning (RL) offers a promising solution by learning control policies directly from interaction. However, because off-road driving is a long-horizon task with low-signal rewards, standard RL methods are challenging to apply in this setting. We introduce TADPO, a novel policy gradient formulation that extends Proximal Policy Optimization (PPO), leveraging off-policy trajectories for teacher guidance and on-policy trajectories for student exploration. Building on this, we develop a vision-based, end-to-end RL system for high-speed off-road driving, capable of navigating extreme slopes and obstacle-rich terrain. We demonstrate our performance in simulation and, importantly, zero-shot sim-to-real transfer on a full-scale off-road vehicle. To our knowledge, this work represents the first deployment of RL-based policies on a full-scale off-road platform.
中文摘要 越野自动驾驶面临诸如在未绘制地图、变化多端且动态不确定的地形中导航等重大挑战。应对这些挑战需要有效的长期规划和可适应的控制。强化学习（RL）通过直接从交互中学习控制策略，提供了一个有前景的解决方案。然而，由于越野驾驶是一项长视野且信号奖励低的任务，标准强化学习方法在此环境中应用起来颇具挑战。我们介绍了TADPO，一种新颖的政策梯度表述，扩展了近端政策优化（PPO），利用非政策轨迹为教师提供指导，利用政策内轨迹为学生探索。基于此，我们开发了一套基于视觉的端到端强化学习系统，用于高速越野驾驶，能够穿越极端坡道和障碍物丰富的地形。我们在模拟中展示了性能，更重要的是，在全尺寸越野车上的零机会模拟到实物传输中。据我们所知，这项工作是首次在全规模越野平台上部署基于强化学习的策略。

ViewFusion: Structured Spatial Thinking Chains for Multi-View Reasoning

ViewFusion：多视角推理的结构化空间思维链

Authors: Xingjian Tao, Yiwei Wang, Yujun Cai, Yifan Song, Jing Tang
Subjects: Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.06024
Pdf link: https://arxiv.org/pdf/2603.06024
Abstract Multi-view spatial reasoning remains difficult for current vision-language models. Even when multiple viewpoints are available, models often underutilize cross-view relations and instead rely on single-image shortcuts, leading to fragile performance on viewpoint transformation and occlusion-sensitive cases. We present ViewFusion, a two-stage framework that explicitly separates cross-view spatial pre-alignment from question answering. In the first stage, the model performs deliberate spatial pre-thinking to infer viewpoint relations and spatial transformations across views, forming an intermediate workspace that goes beyond a simple re-description. In the second stage, the model conducts question-driven reasoning conditioned on this workspace to produce the final prediction. We train ViewFusion with synthetic reasoning supervision followed by reinforcement learning using GRPO, which improves answer correctness while stabilizing the intended two-stage generation behavior. On MMSI-Bench, ViewFusion improves accuracy by 5.3\% over Qwen3-VL-4B-Instruct, with the largest gains on examples that require genuine cross-view alignment.
中文摘要 多视角空间推理对于当前的视觉-语言模型来说仍然困难。即使有多个视角，模型通常也未能充分利用交叉视角关系，而是依赖单图像捷径，导致视点转换和遮挡敏感情况的性能脆弱。我们介绍ViewFusion，这是一个两阶段框架，明确区分了交叉视角空间预对齐与问答。在第一阶段，模型进行有意识的空间预思考，以推断视角关系和跨视图的空间转换，形成一个超越简单重述的中间工作空间。第二阶段，模型基于该工作区进行问题驱动推理，生成最终预测。我们先用合成推理监督训练ViewFusion，随后使用GRPO进行强化学习，这不仅提升了答案的正确性，还稳定了预期的两阶段生成行为。在MMSI-Bench上，ViewFusion比Qwen3-VL-4B-Instruct提升了5.3\%，在需要真正交叉视图对齐的示例上提升最大。

Learning to Generate via Understanding: Understanding-Driven Intrinsic Rewarding for Unified Multimodal Models

通过理解学习生成：统一多模态模型的理解驱动内在奖励

Authors: Jiadong Pan, Liang Li, Yuxin Peng, Yu-Ming Tang, Shuohuan Wang, Yu Sun, Hua Wu, Qingming Huang, Haifeng Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.06043
Pdf link: https://arxiv.org/pdf/2603.06043
Abstract Recently, unified multimodal models (UMMs) have made remarkable progress in integrating visual understanding and generation, demonstrating strong potential for complex text-to-image (T2I) tasks. Despite their theoretical promise, a persistent capability gap exists: UMMs typically exhibit superior visual understanding but comparatively weaker generative capabilities. This discrepancy arises largely from the intrinsic decoupling between the understanding and generation processes. While a UMM can accurately interpret fine-grained visual details, it often struggles to produce semantically coherent images from complex textual prompts. To address this challenge, we explore UMMs' internal understanding capability to enhance generation quality. We propose a token-level intrinsic text-image alignment reward mechanism, GvU, enabling the UMM to act simultaneously as teacher and student: it evaluates its own outputs using the understanding branch to guide the generations accordingly. Building upon this, we design a self-supervised reinforcement learning framework, allowing UMMs to iteratively improve their generation quality through understanding-based intrinsic reward signals--without reliance on external supervision. Experimental results show that our method substantially boosts UMMs' generation, which in turn strengthens their fine-grained visual understanding, narrowing the capability gap between UMMs' visual understanding and generation.
中文摘要 近年来，统一多模态模型（UMM）在视觉理解与生成整合方面取得了显著进展，展现出复杂文本到图像（T2I）任务的强大潜力。尽管理论上前景看好，但存在持续的能力差距：UMM通常表现出更优越的视觉理解，但生成能力相对较弱。这种差异主要源于理解过程与生成过程之间的内在脱钩。虽然 UMM 能够准确解读细粒度的视觉细节，但它常常难以从复杂的文本提示中生成语义连贯的图像。为应对这一挑战，我们探讨UMM内部理解能力以提升发电质量。我们提出了一种代币级的内在文本-图像对齐奖励机制GvU，使UMM能够同时扮演教师和学生的角色：它通过理解分支评估自身输出，从而相应地引导各代人。基于此，我们设计了一个自我监督强化学习框架，允许UMM通过基于理解的内在奖励信号迭代提升生成质量——而无需依赖外部监督。实验结果显示，我们的方法显著提升了UMMs的生成能力，进而强化了其细致的视觉理解，缩小了UMM视觉理解与生成之间的能力差距。

Devil is in Narrow Policy: Unleashing Exploration in Driving VLA Models

狭隘政策中的魔鬼：释放探索驱动VLA模型

Authors: Canyu Chen, Yuguang Yang, Zhewen Tan, Yizhi Wang, Ruiyi Zhan, Haiyan Liu, Xuanyao Mao, Jason Bao, Xinyue Tang, Linlin Yang, Bingchuan Sun, Yan Wang, Baochang Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.06049
Pdf link: https://arxiv.org/pdf/2603.06049
Abstract We identify a fundamental Narrow Policy limitation undermining the performance of autonomous VLA models, where driving Imitation Learning (IL) tends to collapse exploration and limit the potential of subsequent Reinforcement Learning (RL) stages, which often saturate prematurely due to insufficient feedback diversity. Thereby, we propose Curious-VLA, a framework that alleviates the exploit-explore dilemma through a two-stage design. During IL, we introduce a Feasible Trajectory Expansion (FTE) strategy to generate multiple physically valid trajectories and a step-wise normalized trajectory representation to adapt this diverse data. In the RL stage, we present Adaptive Diversity-Aware Sampling (ADAS) that prioritizes high-diversity samples and introduce Spanning Driving Reward (SDR) with a focal style weighting to amplify reward's value span for improving sensitivity to driving quality. On the Navsim benchmark, Curious-VLA achieves SoTA results (PDMS 90.3, EPDMS 85.4) and a Best-of-N PDMS of 94.8, demonstrating its effectiveness in unlocking the exploratory potential of VLA models. Code: this https URL.
中文摘要 我们发现了一个根本性的窄策略限制，削弱了自主VLA模型的性能，即驱动模仿学习（IL）往往会阻碍探索，并限制后续强化学习（RL）阶段的潜力，而后期阶段往往因反馈多样性不足而过早饱和。因此，我们提出了Curious-VLA，一种通过两阶段设计缓解漏洞与探索困境的框架。在IL期间，我们引入了可行轨迹扩展（FTE）策略，用于生成多条物理有效的轨迹，并采用逐步归一化轨迹表示以适应这些多样化的数据。在强化学习阶段，我们介绍了自适应多样性感知抽样（ADAS），优先考虑高多样性样本，并引入带有焦点风格加权的跨度驱动奖励（SDR），以放大奖励的价值范围，从而提升对驱动质量的敏感度。在Navsim基准测试中，Curious-VLA实现了SoTA成绩（PDMS 90.3，EPDMS 85.4）和N次元最佳PDMS94.8，展示了其释放VLA模型探索潜力的有效性。代码：这个 https URL。

ChatShopBuddy: Towards Reliable Conversational Shopping Agents via Reinforcement Learning

ChatShopBuddy：通过强化学习迈向可靠的对话购物代理

Authors: Yiruo Cheng, Kelong Mao, Tianhao Li, Jiejun Tan, Ji-Rong Wen, Zhicheng Dou
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2603.06065
Pdf link: https://arxiv.org/pdf/2603.06065
Abstract Conversational shopping agents represent a critical consumer-facing application of Large Language Model (LLM)-powered agents, yet how to effectively apply post-training Reinforcement Learning (RL) to optimize such agents remains underexplored. This work investigates RL-based optimization for shopping agents in real-world scenarios, where agents must simultaneously satisfy multiple interdependent objectives spanning objective metrics (product correctness), subjective qualities (persuasiveness), outcome rewards (final response quality), and process rewards (tool efficiency). We present a complete methodology to address this challenge. Specifically, we first construct SmartShopBench, a benchmark that captures diverse shopping intents with a hierarchical evaluation that decomposes complex quality requirements into measurable levels. Building on this evaluation framework, we design Hierarchical Reward Modeling (HRM) to structure mixed reward types through conditional gating that reflects their logical dependencies. To enable efficient training, we further propose Dynamic Contrastive Policy Optimization (DCPO), which balances response quality with operational efficiency through dynamic trajectory selection based on reward and reasoning length. Extensive experiments demonstrate that our RL-trained agent, namely ChatShopBuddy, consistently outperforms larger models relying on generic reasoning, achieving superior stability rather than merely higher peaks. Our work provides valuable guidance for applying RL to real-world conversational agents.
中文摘要 对话购物代理代表了大型语言模型（LLM）驱动代理在面向消费者的关键应用，但如何有效应用训练后强化学习（RL）来优化此类代理仍缺乏探索。本研究探讨了基于强化学习的优化，适用于现实场景中的购物代理，在这些场景中，代理必须同时满足多个相互依赖的目标，这些目标涵盖客观指标（产品正确性）、主观品质（说服力）、结果奖励（最终响应质量）和过程奖励（工具效率）。我们提出了完整的方法论来应对这一挑战。具体来说，我们首先构建了SmartShopBench，这是一个基准测试，通过层级评估捕捉多样的购物意图，将复杂的质量要求分解为可衡量的层次。基于该评估框架，我们设计了层级奖励建模（HRM），通过条件门槛来结构化混合奖励类型，反映其逻辑依赖关系。为实现高效训练，我们进一步提出了动态对比策略优化（DCPO），通过基于奖励和推理长度的动态轨迹选择，平衡响应质量与作效率。大量实验表明，我们训练强化学习的智能体ChatShopBuddy，在依赖通用推理的大型模型中表现稳定，实现了优异的稳定性，而不仅仅是峰值较高。我们的工作为将强化学习应用于现实世界的会话代理提供了宝贵指导。

Partial Policy Gradients for RL in LLMs

大型语言模型中强化学习的部分策略梯度

Authors: Puneet Mathur, Branislav Kveton, Subhojyoti Mukherjee, Viet Dac Lai
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.06138
Pdf link: https://arxiv.org/pdf/2603.06138
Abstract Reinforcement learning is a framework for learning to act sequentially in an unknown environment. We propose a natural approach for modeling policy structure in policy gradients. The key idea is to optimize for a subset of future rewards: smaller subsets represent simpler policies, which can be learned more reliably because their empirical gradient estimates are more accurate. Our approach allows for modeling and comparison of different policy classes, including full planning, greedy, K-step lookahead, and segment policies. We evaluate the policies empirically on multiple persona-alignment conversational problems. Different policies excel in different problems, reflecting their different characteristics and highlighting the importance of our studied policy class.
中文摘要 强化学习是一种学习在未知环境中顺序行动的框架。我们提出了一种自然的方法来模拟政策梯度中的政策结构。关键思想是优化未来奖励的子集：较小的子集代表更简单的策略，因为其经验梯度估计更准确，因此可以更可靠地学习。我们的方法允许对不同保单类别进行建模和比较，包括全面规划、贪婪、K步提前和分段保单。我们通过实证评估多重人格对齐对话问题的政策。不同政策在不同问题上表现出色，反映出其不同特性，凸显了我们所研究政策类别的重要性。

Dual-Agent Multiple-Model Reinforcement Learning for Event-Triggered Human-Robot Co-Adaptation in Decoupled Task Spaces

双代理多模型强化学习，用于解耦任务空间中的事件触发人机共适应

Authors: Yaqi Li, Zhengqi Han, Huifang Liu, Steven W.Su
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.06163
Pdf link: https://arxiv.org/pdf/2603.06163
Abstract This paper presents a shared-control rehabilitation policy for a custom 6-degree-of-freedom (6-DoF) upper-limb robot that decomposes complex reaching tasks into decoupled spatial axes. The patient governs the primary reaching direction using binary commands, while the robot autonomously manages orthogonal corrective motions. Because traditional fixed-frequency control often induces trajectory oscillations due to variable inverse-kinematics execution times, an event-driven progression strategy is proposed. This architecture triggers subsequent control actions only when the end-effector enters an admission sphere centred on the immediate target waypoint, and was validated in a semi-virtual setup linking a physical pressure sensor to a MuJoCo simulation. To optimise human--robot co-adaptation safely and efficiently, this study introduces Dual Agent Multiple Model Reinforcement Learning (DAMMRL). This framework discretises decision characteristics: the human agent selects the admission sphere radius to reflect their inherent speed--accuracy trade-off, while the robot agent dynamically adjusts its 3D Cartesian step magnitudes to complement the user's cognitive state. Trained in simulation and deployed across mixed environments, this event-triggered DAMMRL approach effectively suppresses waypoint chatter, balances spatial precision with temporal efficiency, and significantly improves success rates in object acquisition tasks.
中文摘要 本文提出了一种针对定制6自由度（6-DoF）上肢机器人的共享控制康复策略，该机器人将复杂的伸展任务分解为解耦的空间轴。患者通过二进制指令控制主要伸手方向，而机器人则自主管理正交纠正动作。由于传统固定频率控制常因可变的逆运动学执行时间而引发轨迹振荡，因此提出了一种事件驱动的进展策略。该架构仅在末端执行器进入以目标航点为中心的进气球时触发后续控制动作，并在半虚拟装置中验证了物理压力传感器与MuJoCo模拟。为了安全高效地优化人机协同适应，本研究引入了双代理多模型强化学习（DAMMRL）。该框架对决策特性进行了离散化：人类智能体选择进入球半径以反映其固有的速度——准确性权衡，而机器人智能体则动态调整其三维笛卡尔阶级以配合用户的认知状态。经过模拟训练并部署于混合环境中，这种事件触发的DAMMRL方法有效抑制了航点杂音，平衡空间精度与时间效率，显著提高物体获取任务的成功率。

Optimizing 3D Diffusion Models for Medical Imaging via Multi-Scale Reward Learning

通过多尺度奖励学习优化医学影像的3D扩散模型

Authors: Yueying Tian, Xudong Han, Meng Zhou, Rodrigo Aviles-Espinosa, Rupert Young, Philip Birch
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.06173
Pdf link: https://arxiv.org/pdf/2603.06173
Abstract Diffusion models have emerged as powerful tools for 3D medical image generation, yet bridging the gap between standard training objectives and clinical relevance remains a challenge. This paper presents a method to enhance 3D diffusion models using Reinforcement Learning (RL) with multi-scale feedback. We first pretrain a 3D diffusion model on MRI volumes to establish a robust generative prior. Subsequently, we fine-tune the model using Proximal Policy Optimization (PPO), guided by a novel reward system that integrates both 2D slice-wise assessments and 3D volumetric analysis. This combination allows the model to simultaneously optimize for local texture details and global structural coherence. We validate our framework on the BraTS 2019 and OASIS-1 datasets. Our results indicate that incorporating RL feedback effectively steers the generation process toward higher quality distributions. Quantitative analysis reveals significant improvements in Fréchet Inception Distance (FID) and, crucially, the synthetic data demonstrates enhanced utility in downstream tumor and disease classification tasks compared to non-optimized baselines.
中文摘要 扩散模型已成为三维医学图像生成的强大工具，但弥合标准训练目标与临床相关性之间的鸿沟仍是一大挑战。本文提出了一种利用强化学习（RL）增强三维扩散模型和多尺度反馈的方法。我们首先在MRI体积上预训练三维扩散模型，以建立稳健的生成先验。随后，我们利用近端策略优化（PPO）对模型进行微调，这一系统结合了二维切片评估和三维体积分析。这种组合使模型能够同时优化局部纹理细节和全局结构一致性。我们在BraTS 2019和OASIS-1数据集上验证了我们的框架。我们的结果表明，有效纳入强化学习反馈可以引导生成过程朝着更高质量的分布发展。定量分析显示，弗雷歇起始距离（FID）显著提升，且关键的是，合成数据在肿瘤和疾病分类任务中相较于未优化基线数据的应用性提升。

MAPO: Mixed Advantage Policy Optimization for Long-Horizon Multi-Turn Dialogue

MAPO：面向长期多回合对话的混合优势政策优化

Authors: Naifan Zhang, Ruihan Sun, Jinwei Su, Hengjie Yang, Zhengyuan Pan, Zhaohan Chen, Xiaofan Zhang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.06194
Pdf link: https://arxiv.org/pdf/2603.06194
Abstract Subjective multi-turn dialogue tasks, such as emotional support, require conversational policies that adapt to evolving user states and optimize long-horizon interaction quality. However, reinforcement learning (RL) for such settings remains challenging due to the absence of reliable process supervision. Outcome-only training collapses credit assignment across turns into a single trajectory-level reward, while naïve turn-level group sampling incurs prohibitive rollout costs in interactive environments. We propose a critic-free and efficient RL algorithm named MAPO that leverages dense process feedback from a judge model and propagates long-horizon effects through Monte Carlo returns. To stabilize optimization, we introduce a mixed advantage estimator that combines turn-level normalization with batch-level normalization, enabling fine-grained yet scalable credit assignment. Across multiple subjective dialogue benchmarks, including EMPA, EmoBench, and EQ-Bench, and model scales ranging from 7B to 32B, our method consistently improves both training stability and final performance over outcome-only GRPO and single-level normalization baselines. On EMPA, we improve rates by up to 9 points and increase dialogue scores by as much as +43.2 over the 7B base model. Despite training only on EMPA-style environments, our approach generalizes well, yielding consistent improvements on unseen emotional-intelligence benchmarks, including up to +4 points on EmoBench and +3.5 on EQ-Bench. Together, these results demonstrate that dense process supervision combined with mixed-level normalization enables effective and scalable RL for subjective, open-ended multi-turn dialogue.
中文摘要 主观多回合对话任务，如情感支持，需要能够适应不断变化的用户状态并优化长期交互质量的对话策略。然而，由于缺乏可靠的过程监督，强化学习（RL）在此类环境中仍然具有挑战性。仅结果培训将信用分配压缩为单一轨迹级奖励，而简单的回合级群体抽样则在互动环境中产生高昂的推广成本。我们提出了一种无批评且高效的强化学习算法，名为MAPO，利用来自评判模型的密集过程反馈，并通过蒙特卡洛回报传播长视野效应。为稳定优化，我们引入了混合优势估计器，结合了回合级规范化和批次级规范化，实现细粒度且可扩展的信用分配。在包括EMPA、EmoBench和EQ-Bench在内的多种主观对话基准测试，以及从7B到32B的模型尺度上，我们的方法始终优于仅结果的GRPO和单级归一化基线，提升训练稳定性和最终表现。在EMPA上，我们提升了最多9个点，并使对话评分比7B基础模型提升了+43.2。尽管仅在EMPA风格环境中进行训练，我们的方法具有良好的推广效果，在未见的情商基准上持续取得进步，包括EmoBench上最高+4分，EQ-Bench上+3.5分。这些结果共同表明，密集过程监督与混合层级归一化相结合，能够有效且可扩展地实现主观、开放式多回合对话的强化学习。

Synthetic Monitoring Environments for Reinforcement Learning

强化学习的合成监控环境

Authors: Leonard Pleiss, Carolin Schmidt, Maximilian Schiffer
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2603.06252
Pdf link: https://arxiv.org/pdf/2603.06252
Abstract Reinforcement Learning (RL) lacks benchmarks that enable precise, white-box diagnostics of agent behavior. Current environments often entangle complexity factors and lack ground-truth optimality metrics, making it difficult to isolate why algorithms fail. We introduce Synthetic Monitoring Environments (SMEs), an infinite suite of continuous control tasks. SMEs provide fully configurable task characteristics and known optimal policies. As such, SMEs allow for the exact calculation of instantaneous regret. Their rigorous geometric state space bounds allow for systematic within-distribution (WD) and out-of-distribution (OOD) evaluation. We demonstrate the framework's benefit through multidimensional ablations of PPO, TD3, and SAC, revealing how specific environmental properties - such as action or state space size, reward sparsity and complexity of the optimal policy - impact WD and OOD performance. We thereby show that SMEs offer a standardized, transparent testbed for transitioning RL evaluation from empirical benchmarking toward rigorous scientific analysis.
中文摘要 强化学习（RL）缺乏能够精确、白箱式地诊断代理行为的基准测试。当前环境常常纠缠复杂度因素，缺乏真实的最优度指标，难以确定算法失败的原因。我们介绍了合成监控环境（SMEs），这是一套无限的连续控制任务。中小企业提供完全可配置的任务特性和已知的最优策略。因此，中小企业允许精确计算瞬时后悔。其严格的几何状态空间界限允许系统地进行分布内（WD）和分布外（OOD）的评估。我们通过对PPO、TD3和SAC的多维消融分析，展示了该框架的优势，揭示了具体环境属性——如动作或状态空间大小、奖励稀疏性以及最优策略的复杂性——如何影响WD和OOD的绩效。由此，我们表明中小企业为强化学习评估从实证基准向严谨科学分析的过渡提供了一个标准化、透明的测试平台。

Artificial Intelligence for Climate Adaptation: Reinforcement Learning for Climate Change-Resilient Transport

气候适应的人工智能：为气候变化韧性交通提供强化学习

Authors: Miguel Costa, Arthur Vandervoort, Carolin Schmidt, João Miranda, Morten W. Petersen, Martin Drews, Karyn Morrisey, Francisco C. Pereira
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.06278
Pdf link: https://arxiv.org/pdf/2603.06278
Abstract Climate change is expected to intensify rainfall and, consequently, pluvial flooding, leading to increased disruptions in urban transportation systems over the coming decades. Designing effective adaptation strategies is challenging due to the long-term, sequential nature of infrastructure investments, deep climate uncertainty, and the complex interactions between flooding, infrastructure, and mobility impacts. In this work, we propose a novel decision-support framework using reinforcement learning (RL) for long-term flood adaptation planning. Formulated as an integrated assessment model (IAM), the framework combines rainfall projection and flood modeling, transport simulation, and quantification of direct and indirect impacts on infrastructure and mobility. Our RL-based approach learns adaptive strategies that balance investment and maintenance costs against avoided impacts. We evaluate the framework through a case study of Copenhagen's inner city over the 2024-2100 period, testing multiple adaptation options, and different belief and realized climate scenarios. Results show that the framework outperforms traditional optimization approaches by discovering coordinated spatial and temporal adaptation pathways and learning trade-offs between impact reduction and adaptation investment, yielding more resilient strategies. Overall, our results showcase the potential of reinforcement learning as a flexible decision-support tool for adaptive infrastructure planning under climate uncertainty.
中文摘要 气候变化预计将加剧降雨，进而引发积蓄洪水，未来几十年城市交通系统将更加频繁地中断。由于基础设施投资具有长期且连续性的特性、深刻的气候不确定性以及洪水、基础设施与出行影响之间复杂的相互作用，设计有效的适应策略具有挑战性。本研究提出一种利用强化学习（RL）用于长期洪水适应规划的新型决策支持框架。该框架作为综合评估模型（IAM）制定，结合了降雨预测与洪水建模、交通模拟以及对基础设施和出行直接和间接影响的量化。我们基于强化学习的方法学习适应性策略，平衡投资和维护成本与避免的影响。我们通过对2024-2100年哥本哈根市中心的案例研究，测试多种适应方案，以及不同的信念和实际气候情景，评估了该框架。结果显示，该框架通过发现协调的空间和时间适应路径，并学习影响减少与适应投资之间的权衡，优于传统优化方法，从而产生更具韧性的策略。总体而言，我们的结果展示了强化学习作为适应性基础设施规划中灵活决策支持工具的潜力。

From Entropy to Calibrated Uncertainty: Training Language Models to Reason About Uncertainty

从熵到校准不确定性：训练语言模型以推理不确定性

Authors: Azza Jenane, Nassim Walha, Lukas Kuhn, Florian Buettner
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.06317
Pdf link: https://arxiv.org/pdf/2603.06317
Abstract Large Language Models (LLMs) that can express interpretable and calibrated uncertainty are crucial in high-stakes domains. While methods to compute uncertainty post-hoc exist, they are often sampling-based and therefore computationally expensive or lack calibration. We propose a three-stage pipeline to post-train LLMs to efficiently infer calibrated uncertainty estimates for their responses. First, we compute fine-grained entropy-based uncertainty scores on the training data, capturing the distributional variability of model outputs in embedding space. Second, these scores are calibrated via Platt scaling, producing reliable and human-interpretable uncertainty signals. Finally, the target LLM is post-trained via reinforcement learning to align its policy with these calibrated signals through a verifiable reward function. Unlike post-hoc uncertainty estimation methods, our approach provides interpretable and computationally efficient uncertainty estimates at test time. Experiments show that models trained with our pipeline achieve better calibration than baselines and generalize to unseen tasks without further processing, suggesting that they learn a robust uncertainty reasoning behavior.
中文摘要 能够表达可解释性和校准不确定性的大型语言模型（LLM）在高风险领域至关重要。虽然存在事后计算不确定性的方法，但通常基于抽样，因此计算成本高或缺乏校准。我们提出了一个三阶段流程，用于对LLM进行训练后，高效推断其响应的校准不确定性估计。首先，我们计算训练数据中的细粒度熵不确定性评分，捕捉模型输出在嵌入空间中的分布变异性。其次，这些分数通过普拉特尺度进行校准，产生可靠且人类可解读的不确定性信号。最后，目标LLM通过强化学习进行后期训练，使其策略与这些校准信号通过可验证的奖励函数保持一致。与事后不确定性估计方法不同，我们的方法在测试时提供可解释且计算效率高的不确定性估计。实验显示，使用我们流水线训练的模型比基线更能校准，并在无需进一步处理的情况下推广到未见任务，表明它们学会了强健的不确定性推理行为。

OralGPT-Plus: Learning to Use Visual Tools via Reinforcement Learning for Panoramic X-ray Analysis

OralGPT-Plus：通过强化学习学习使用视觉工具进行全景X射线分析

Authors: Yuxuan Fan, Jing Hao, Hong Chen, Jiahao Bao, Yihua Shao, Yuci Liang, Kuo Feng Hung, Hao Tang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.06366
Pdf link: https://arxiv.org/pdf/2603.06366
Abstract Panoramic dental radiographs require fine-grained spatial reasoning, bilateral symmetry understanding, and multi-step diagnostic verification, yet existing vision-language models operate under a static single-pass paradigm that limits their clinical reliability. In this paper, we introduce OralGPT-Plus, an agentic vision-language model designed to perform iterative and symmetry-aware diagnostic reasoning for panoramic dental radiograph analysis. To support this paradigm, we construct DentalProbe, a five-thousand-image dataset with expert-curated diagnostic trajectories that provide structured supervision for localized inspection and contralateral comparison. We further develop a Reinspection-driven reinforcement learning framework that encourages clinically meaningful re-examination and stabilizes long-horizon reasoning with rubric-based reward and conditioned diagnostic-driven reward. In parallel, we present MMOral-X, the first benchmark for holistic panoramic diagnosis, containing 300 open-ended questions and region-level annotations across multiple difficulty levels. OralGPT-Plus demonstrates consistent and reliable improvements over strong baselines on MMOral-X and established panoramic benchmarks, indicating the effectiveness of interactive and symmetry-informed reasoning. Our work highlights the value of agentic modeling for dental imaging and provides a foundation for future research in clinically aligned panoramic radiograph analysis.
中文摘要 全景牙科X光片需要细致的空间推理、双侧对称性理解和多步诊断验证，但现有视觉语言模型采用静态单次扫描范式，限制了其临床可靠性。本文介绍了OralGPT-Plus，一种用于全景牙科X光分析的迭代且对称感知诊断推理的智能视觉语言模型。为支持这一范式，我们构建了DentalProbe，这是一个包含五千张图像的数据集，配备专家策划的诊断轨迹，为局部检查和对侧比较提供结构化监督。我们进一步开发了以再检查为驱动的强化学习框架，鼓励临床意义深远的再审视，并通过基于评分标准的奖励和条件化的诊断驱动奖励稳定长期推理。与此同时，我们推出了MMOral-X，这是首个整体全景诊断的基准，包含300个开放式问题和跨多个难度级别的区域级注释。OralGPT-Plus相较于MMOral-X强基准和既定的全景基准，持续且可靠地改进，显示了交互式和对称性驱动推理的有效性。我们的工作凸显了代理建模在牙科影像中的价值，并为未来临床对齐全景X光分析的研究奠定了基础。

Efficient, Property-Aligned Fan-Out Retrieval via RL-Compiled Diffusion

通过强化学习编译扩散实现高效、属性对齐的扇出检索

Authors: Pengcheng Jiang, Judith Yue Li, Moonkyung Ryu, R. Lily Hu, Kun Su, Zhong Yi Wan, Liam Hebert, Hao Peng, Jiawei Han, Dima Kuzmin, Craig Boutilier
Subjects: Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.06397
Pdf link: https://arxiv.org/pdf/2603.06397
Abstract Many modern retrieval problems are set-valued: given a broad intent, the system must return a collection of results that optimizes higher-order properties (e.g., diversity, coverage, complementarity, coherence) while remaining grounded with respect to a fixed database. Set-valued objectives are typically non-decomposable and are not captured by existing supervised (query, content) datasets which only prioritize top-1 retrieval. Consequently, fan-out retrieval is often employed to generate diverse subqueries to retrieve item sets. While reinforcement learning (RL) can optimize set-level objectives via interaction, deploying an RL-tuned LLM for fan-out retrieval is prohibitively expensive at inference time. Conversely, diffusion-based generative retrieval enables efficient single-pass fan-out in embedding space, but requires objective-aligned training targets. To address these issues, we propose R4T (Retrieve-for-Train), which uses RL once as an objective transducer in a three-step process: (i) train a fan-out LLM with composite set-level rewards, (ii) synthesize objective-consistent training pairs, and (iii) train a lightweight diffusion retriever to model the conditional distribution of set-valued outputs. Across large-scale fashion and music benchmarks consisting of curated item sets, we show that R4T improves retrieval quality relative to strong baselines while reducing query-time fan-out latency by an order of magnitude.
中文摘要 许多现代检索问题采用集合值：在宽义意图下，系统必须返回一组结果，既能优化高阶属性（如多样性、覆盖率、互补性、相干性），又要依照固定数据库保持基础。集合值目标通常不可分解，且现有的监督（查询、内容）数据集只优先获取顶级目标，无法捕捉。因此，扇出检索常被用来生成多样化的子查询以检索项集。虽然强化学习（RL）可以通过交互优化集合级目标，但部署一个基于强化学习调优的大型语言模型进行扇出式检索，在推理时成本过高。相反，基于扩散的生成检索实现了嵌入空间中的高效单遍扇化，但需要目标对齐的训练目标。为解决这些问题，我们提出了R4T（Retrieve-for-Train），其在三步过程中一次性使用RL作为目标换能器：（i）训练一个带有复合集合级奖励的扇出LLM，（ii）合成目标一致的训练对，（iii）训练一个轻量级扩散检索器以建模集合值输出的条件分布。在由策划商品集组成的大型时尚和音乐基准中，我们表明R4T相对于强有力基线提升了检索质量，同时将查询时间的扇出延迟降低了一个数量级。

A Reference Architecture of Reinforcement Learning Frameworks

强化学习框架的参考架构

Authors: Xiaoran Liu, Istvan David
Subjects: Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.06413
Pdf link: https://arxiv.org/pdf/2603.06413
Abstract The surge in reinforcement learning (RL) applications gave rise to diverse supporting technology, such as RL frameworks. However, the architectural patterns of these frameworks are inconsistent across implementations and there exists no reference architecture (RA) to form a common basis of comparison, evaluation, and integration. To address this gap, we propose an RA of RL frameworks. Through a grounded theory approach, we analyze 18 state-of-the-practice RL frameworks and, by that, we identify recurring architectural components and their relationships, and codify them in an RA. To demonstrate our RA, we reconstruct characteristic RL patterns. Finally, we identify architectural trends, e.g., commonly used components, and outline paths to improving RL frameworks.
中文摘要 强化学习（RL）应用的激增催生了多样化的支持技术，如强化学习框架。然而，这些框架的架构模式在不同实现间存在不一致，且不存在参考架构（RA）来形成共同的比较、评估和集成基础。为弥补这一空白，我们提出了强化学习框架的评估。通过扎根的理论方法，我们分析了18个实践状态的强化学习框架，并据此识别了反复出现的架构组件及其关系，并将其编码成RA。为了展示我们的RA，我们重建了典型的强化学习模式。最后，我们识别架构趋势，例如常用组件，并概述改进强化学习框架的路径。

EgoReasoner: Learning Egocentric 4D Reasoning via Task-Adaptive Structured Thinking

自我推理者：通过任务适应结构化思维学习以自我为中心的四维推理

Authors: Fangrui Zhu, Yunfeng Xi, Jianmo Ni, Mu Cai, Boqing Gong, Long Zhao, Chen Qu, Ian Miao, Yi Li, Cheng Zhong, Huaizu Jiang, Shwetak Patel
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.06561
Pdf link: https://arxiv.org/pdf/2603.06561
Abstract Egocentric video understanding is inherently complex due to the dynamic 4D nature of the environment, where camera motion and object displacements necessitate a continuous re-evaluation of spatial relations. In this work, we target a suite of under-explored egocentric 4D reasoning tasks, including fixture interaction counting, viewpoint-relative fixture location, object movement itinerary tracking, and stationary object localization, that require fundamentally different cognitive operations: spatial anchoring, temporal tracking, and duration reasoning. We observe that these structural differences make task-agnostic approaches insufficient: generic Chain-of-Thought methods lack task-appropriate reasoning primitives, and uniform reinforcement learning actively destabilizes performance on spatial tasks. To address this, we propose EgoReasoner, a two-stage framework that aligns both the reasoning scaffold and the reward signal to each task's cognitive structure. In the first stage, Task-Adaptive Thinking Templates guide the synthesis of structured CoT traces that teach the model to reason adaptively across task types via supervised fine-tuning. In the second stage, task-aware reward functions verify entity grounding, temporal alignment, and task-adaptive logical consistency, selectively strengthening each reasoning pathway via reinforcement fine-tuning with GRPO. Our 3B-parameter model, trained on only 16K samples, achieves 37.5% average accuracy on the challenging HD-EPIC benchmark, surpassing Qwen2.5-VL-7B (25.7%) by over 10 points.
中文摘要 自我中心的视频理解本质复杂，因为环境是动态的四维特性，摄像机运动和物体位移需要不断重新评估空间关系。本研究针对一组未被充分探讨的自我中心四维推理任务，包括夹具交互计数、视点相对夹具位置、物体移动行程追踪和静止物体定位，这些任务需要根本不同的认知作：空间锚定、时间追踪和持续时间推理。我们观察到，这些结构差异使得任务无关方法不够：通用的思维链方法缺乏适合任务的推理原语，而统一强化学习会主动破坏空间任务的表现。为此，我们提出了EgoReasoner，这是一个两阶段框架，将推理支架和奖励信号与每个任务的认知结构对齐。第一阶段，任务自适应思维模板指导结构化CoT痕迹的综合，这些追踪通过监督微调教导模型跨任务类型进行自适应推理。第二阶段，任务感知奖励函数验证实体基础、时间对齐和任务自适应逻辑一致性，通过GRPO强化微调，选择性强化每条推理路径。我们的3B参数模型仅在1.6万样本上训练，在具有挑战性的HD-EPIC基准测试中平均准确率达到37.5%，比Qwen2.5-VL-7B（25.7%）高出10%以上。

Boosting deep Reinforcement Learning using pretraining with Logical Options

利用逻辑选项的预训练提升深度强化学习

Authors: Zihan Ye, Phil Chau, Raban Emunds, Jannis Blüml, Cedric Derstroff, Quentin Delfosse, Oleg Arenz, Kristian Kersting
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.06565
Pdf link: https://arxiv.org/pdf/2603.06565
Abstract Deep reinforcement learning agents are often misaligned, as they over-exploit early reward signals. Recently, several symbolic approaches have addressed these challenges by encoding sparse objectives along with aligned plans. However, purely symbolic architectures are complex to scale and difficult to apply to continuous settings. Hence, we propose a hybrid approach, inspired by humans' ability to acquire new skills. We use a two-stage framework that injects symbolic structure into neural-based reinforcement learning agents without sacrificing the expressivity of deep policies. Our method, called Hybrid Hierarchical RL (H^2RL), introduces a logical option-based pretraining strategy to steer the learning policy away from short-term reward loops and toward goal-directed behavior while allowing the final policy to be refined via standard environment interaction. Empirically, we show that this approach consistently improves long-horizon decision-making and yields agents that outperform strong neural, symbolic, and neuro-symbolic baselines.
中文摘要 深度强化学习代理常常错位，因为它们过度利用早期奖励信号。近年来，几种象征性方法通过编码稀疏目标和一致计划来应对这些挑战。然而，纯符号架构在规模化上较为复杂，且难以应用于连续的环境。因此，我们提出一种混合方法，灵感来源于人类习得新技能的能力。我们采用两阶段框架，在不牺牲深度策略表达力的前提下，向基于神经的强化学习代理注入符号结构。我们的方法称为混合层级强化学习（H^2RL），引入了一种逻辑性的基于选项的预训练策略，将学习策略从短期奖励循环转向目标导向行为，同时允许通过标准环境交互来完善最终策略。通过实证，我们表明这种方法持续提升了长期决策能力，并产生了优于强神经、符号和神经符号基线的代理。

Keyword: diffusion policy

CDF-Glove: A Cable-Driven Force Feedback Glove for Dexterous Teleoperation

CDF-Glove：一种用于灵巧远程作的钢索驱动力反馈手套

Authors: Huayue Liang, Ruochong Li, Yaodong Yang, Long Zeng, Yuanpei Chen, Xueqian Wang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.05804
Pdf link: https://arxiv.org/pdf/2603.05804
Abstract High-quality teleoperated demonstrations are a primary bottleneck for imitation learning (IL) in dexterous manipulation. However, haptic feedback provides operators with real-time contact information, enabling real-time finger posture adjustments, and thereby improving demonstration quality. Existing dexterous teleoperation platforms typically omit haptic feedback and remain bulky and expensive. We introduce CDF-Glove, a lightweight and low cost cable-driven force-feedback glove. The real-time state is available for 20 finger degrees of freedom (DoF), of which 16 are directly sensed and 4 are passively coupled (inferred from kinematic constraints). We develop a kinematic model and control stack for the glove, and validate them across multiple robotic hands with diverse kinematics and DoF. The CDF-Glove achieves distal joint repeatability of 0.4 degrees, and delivers about 200 ms force feedback latency, yielding a 4x improvement in task success rate relative to no-feedback teleoperation. We collect two bimanual teleoperation datasets, on which we train and evaluate Diffusion Policy baselines. Compared to kinesthetic teaching, the policies trained in our teleoperated demonstrations increase the average success rate by 55% and reduce the mean completion time by approximately 15.2 seconds (a 47.2% relative reduction). In particular, the CDF-Glove costs approximately US$230. The code and designs are released as open source at this https URL.
中文摘要 高质量的远程作演示是灵巧作中模仿学习（IL）的主要瓶颈。然而，触觉反馈为提供了实时联系信息，使得手指姿势能够实时调整，从而提升演示质量。现有的灵巧远程作平台通常省略触觉反馈，且体积庞大且价格昂贵。我们介绍CDF-Glove，一款轻便且低成本的电缆驱动力反馈手套。实时状态适用于20个指自由度（DoF），其中16个是直接感测的，4个是被动耦合的（通过运动学约束推断）。我们为手套开发了运动学模型和控制栈，并在多只具有多样运动学和景深的机器人手上进行验证。CDF-Glove实现了0.4度的远端关节重复性，力反馈延迟约为200毫秒，任务成功率相比无反馈远程作提高了4倍。我们收集了两个双手远程作数据集，用于训练和评估扩散政策基线。与动觉教学相比，我们远程作演示中所培训的策略使平均成功率提高了55%，平均完成时间减少约15.2秒（相对减少47.2%）。特别是，CDF-Glove的价格约为230美元。代码和设计已在此 https URL 上以开源形式发布。