生成时间: 2026-02-25 16:53:58 (UTC+8); Arxiv 发布时间: 2026-02-25 20:00 EST (2026-02-26 09:00 UTC+8)
今天共有 31 篇相关文章
Keyword: reinforcement learning
Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning
混合策略RLVR中的可控探索用于多模态推理
- Authors: Zhuoxu Huang, Mengxi Jia, Hao Sun, Xuelong Li, Jungong Han
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2602.20197
- Pdf link: https://arxiv.org/pdf/2602.20197
- Abstract
Reinforcement Learning with verifiable rewards (RLVR) has emerged as a primary learning paradigm for enhancing the reasoning capabilities of multi-modal large language models (MLLMs). However, during RL training, the enormous state space of MLLM and sparse rewards often leads to entropy collapse, policy degradation, or over-exploitation of suboptimal behaviors. This necessitates an exploration strategy that maintains productive stochasticity while avoiding the drawbacks of uncontrolled random sampling, yielding inefficient exploration. In this paper, we propose CalibRL, a hybrid-policy RLVR framework that supports controllable exploration with expert guidance, enabled by two key mechanisms. First, a distribution-aware advantage weighting scales updates by group rareness to calibrate the distribution, therefore preserving exploration. Meanwhile, the asymmetric activation function (LeakyReLU) leverages the expert knowledge as a calibration baseline to moderate overconfident updates while preserving their corrective direction. CalibRL increases policy entropy in a guided manner and clarifies the target distribution by estimating the on-policy distribution through online sampling. Updates are driven by these informative behaviors, avoiding convergence to erroneous patterns. Importantly, these designs help alleviate the distributional mismatch between the model's policy and expert trajectories, thereby achieving a more stable balance between exploration and exploitation. Extensive experiments across eight benchmarks, including both in-domain and out-of-domain settings, demonstrate consistent improvements, validating the effectiveness of our controllable hybrid-policy RLVR training. Code is available at this https URL.
- 中文摘要
带可验证奖励的强化学习(RLVR)已成为提升多模态大型语言模型(MLLM)推理能力的主要学习范式。然而,在强化学习训练中,MLLM的巨大状态空间和稀疏的奖励常常导致熵坍缩、策略退化或过度利用次优行为。这需要一种探索策略,既保持高效的随机性,又避免了无控制随机抽样带来的低效探索。本文提出CalibRL,一种混合策略RLVR框架,支持可控探索,由专家指导,由两个关键机制实现。首先,分布感知优势加权按组稀有度进行更新,以校准分布,从而保持探索性。与此同时,非对称激活函数(LeakyReLU)利用专家知识作为校准基线,在保持修正方向的同时调节过度自信的更新。CalibRL以引导方式增加政策熵,并通过在线抽样估计政策内分布来明确目标分布。更新由这些信息行为驱动,避免趋同于错误模式。重要的是,这些设计有助于缓解模型策略与专家轨迹之间的分布不匹配,从而实现探索与利用之间更稳定的平衡。涵盖八个基准测试的广泛实验,包括域内和域外设置,显示出持续的改进,验证了我们可控混合策略RLVR训练的有效性。代码可在此 https URL 访问。
Sample-Efficient Learning with Online Expert Correction for Autonomous Catheter Steering in Endovascular Bifurcation Navigation
带在线专家校正的高效学习,用于血管内分岔导航中的自主导管引导
- Authors: Hao Wang, Tianliang Yao, Bo Lu, Zhiqiang Pei, Liu Dong, Lei Ma, Peng Qi
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2602.20216
- Pdf link: https://arxiv.org/pdf/2602.20216
- Abstract
Robot-assisted endovascular intervention offers a safe and effective solution for remote catheter manipulation, reducing radiation exposure while enabling precise navigation. Reinforcement learning (RL) has recently emerged as a promising approach for autonomous catheter steering; however, conventional methods suffer from sparse reward design and reliance on static vascular models, limiting their sample efficiency and generalization to intraoperative variations. To overcome these challenges, this paper introduces a sample-efficient RL framework with online expert correction for autonomous catheter steering in endovascular bifurcation navigation. The proposed framework integrates three key components: (1) A segmentation-based pose estimation module for accurate real-time state feedback, (2) A fuzzy controller for bifurcation-aware orientation adjustment, and (3) A structured reward generator incorporating expert priors to guide policy learning. By leveraging online expert correction, the framework reduces exploration inefficiency and enhances policy robustness in complex vascular structures. Experimental validation on a robotic platform using a transparent vascular phantom demonstrates that the proposed approach achieves convergence in 123 training episodes -- a 25.9% reduction compared to the baseline Soft Actor-Critic (SAC) algorithm -- while reducing average positional error to 83.8% of the baseline. These results indicate that combining sample-efficient RL with online expert correction enables reliable and accurate catheter steering, particularly in anatomically challenging bifurcation scenarios critical for endovascular navigation.
- 中文摘要
机器人辅助血管内干预为远程导管作提供了安全有效的解决方案,既减少辐射暴露,又实现精准导航。强化学习(RL)最近被认为是一种有前景的自主导管引导方法;然而,传统方法存在奖励设计稀疏且依赖静态血管模型,限制了其采样效率及对术中变异的推广。为克服这些挑战,本文引入了一个高效的强行学习框架,并提供在线专家校正,用于血管内分岔导航中的自主导管引导。该框架集成了三个关键组件:(1)基于分割的姿态估计模块,用于准确的实时状态反馈;(2)用于分岔感知方向调整的模糊控制器;(3)包含专家先验的结构化奖励生成器,指导政策学习。通过利用在线专家纠正,该框架降低了探测效率,增强复杂血管结构中的政策稳健性。在机器人平台上使用透明血管幻影进行的实验验证表明,所提方法在123次训练中实现收敛——相比基线软性演员-批判者(SAC)算法减少了25.9%,同时将平均位置误差降低至基线的83.8%。这些结果表明,结合样本高效的强化学习与在线专家矫正,能够实现导管引导的可靠和准确,尤其是在对血管内导航至关重要的解剖学上具有挑战性的分岔场景中。
What Matters for Simulation to Online Reinforcement Learning on Real Robots
模拟对在线强化学习的重要性
- Authors: Yarden As, Dhruva Tirumala, René Zurbrügg, Chenhao Li, Stelian Coros, Andreas Krause, Markus Wulfmeier
- Subjects: Subjects:
Robotics (cs.RO); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2602.20220
- Pdf link: https://arxiv.org/pdf/2602.20220
- Abstract
We investigate what specific design choices enable successful online reinforcement learning (RL) on physical robots. Across 100 real-world training runs on three distinct robotic platforms, we systematically ablate algorithmic, systems, and experimental decisions that are typically left implicit in prior work. We find that some widely used defaults can be harmful, while a set of robust, readily adopted design choices within standard RL practice yield stable learning across tasks and hardware. These results provide the first large-sample empirical study of such design choices, enabling practitioners to deploy online RL with lower engineering effort.
- 中文摘要
我们研究了哪些具体的设计选择使实体机器人能够成功实现在线强化学习(RL)。通过在三种不同机器人平台上的100次真实训练运行,我们系统性地剔除了以往工作中通常隐含的算法、系统和实验决策。我们发现,一些广泛使用的默认设置可能有害,而在标准强化学习实践中,一套稳健且被广泛采用的设计选择则能在任务和硬件间实现稳定学习。这些结果提供了首个对此类设计选择的大样本实证研究,使从业者能够以更低的工程投入部署在线强化学习。
Aesthetic Camera Viewpoint Suggestion with 3D Aesthetic Field
带有3D美学场的美学摄像机视角建议
- Authors: Sheyang Tang, Armin Shafiee Sarvestani, Jialu Xu, Xiaoyu Xu, Zhou Wang
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2602.20363
- Pdf link: https://arxiv.org/pdf/2602.20363
- Abstract
The aesthetic quality of a scene depends strongly on camera viewpoint. Existing approaches for aesthetic viewpoint suggestion are either single-view adjustments, predicting limited camera adjustments from a single image without understanding scene geometry, or 3D exploration approaches, which rely on dense captures or prebuilt 3D environments coupled with costly reinforcement learning (RL) searches. In this work, we introduce the notion of 3D aesthetic field that enables geometry-grounded aesthetic reasoning in 3D with sparse captures, allowing efficient viewpoint suggestions in contrast to costly RL searches. We opt to learn this 3D aesthetic field using a feedforward 3D Gaussian Splatting network that distills high-level aesthetic knowledge from a pretrained 2D aesthetic model into 3D space, enabling aesthetic prediction for novel viewpoints from only sparse input views. Building on this field, we propose a two-stage search pipeline that combines coarse viewpoint sampling with gradient-based refinement, efficiently identifying aesthetically appealing viewpoints without dense captures or RL exploration. Extensive experiments show that our method consistently suggests viewpoints with superior framing and composition compared to existing approaches, establishing a new direction toward 3D-aware aesthetic modeling.
- 中文摘要
场景的美学质量很大程度上取决于摄像机视角。现有的美学视角建议方法要么是单视角调整,即在不理解场景几何的情况下预测有限的摄像机调整;要么是依赖密集捕捉或预建3D环境,结合昂贵的强化学习(RL)搜索的3D探索方法。在本研究中,我们引入了三维美学场的概念,能够在稀疏捕捉的三维中实现基于几何的美学推理,从而实现高效的视角建议,而非昂贵的强化学习搜索。我们选择通过前馈3D高斯Splatting网络学习这一3D美学领域,该网络将预训练的2D美学模型中高阶美学知识提炼到3D空间中,从而实现仅用稀疏输入视角预测新颖视角的美学预测。基于该领域,我们提出了一个两阶段搜索流水线,结合粗视点采样与基于梯度的细化,高效识别美观视角,无需密集捕捉或强化学习探索。大量实验表明,我们的方法始终推荐出相较于现有方法更优的框架和构图的观点,开辟了迈向3D感知美学建模的新方向。
Generalizing from References using a Multi-Task Reference and Goal-Driven RL Framework
利用多任务参考和目标驱动强化学习框架从参考进行推广
- Authors: Jiashun Wang, M. Eva Mungai, He Li, Jean Pierre Sleiman, Jessica Hodgins, Farbod Farshidian
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2602.20375
- Pdf link: https://arxiv.org/pdf/2602.20375
- Abstract
Learning agile humanoid behaviors from human motion offers a powerful route to natural, coordinated control, but existing approaches face a persistent trade-off: reference-tracking policies are often brittle outside the demonstration dataset, while purely task-driven Reinforcement Learning (RL) can achieve adaptability at the cost of motion quality. We introduce a unified multi-task RL framework that bridges this gap by treating reference motion as a prior for behavioral shaping rather than a deployment-time constraint. A single goal-conditioned policy is trained jointly on two tasks that share the same observation and action spaces, but differ in their initialization schemes, command spaces, and reward structures: (i) a reference-guided imitation task in which reference trajectories define dense imitation rewards but are not provided as policy inputs, and (ii) a goal-conditioned generalization task in which goals are sampled independently of any reference and where rewards reflect only task success. By co-optimizing these objectives within a shared formulation, the policy acquires structured, human-like motor skills from dense reference supervision while learning to adapt these skills to novel goals and initial conditions. This is achieved without adversarial objectives, explicit trajectory tracking, phase variables, or reference-dependent inference. We evaluate the method on a challenging box-based parkour playground that demands diverse athletic behaviors (e.g., jumping and climbing), and show that the learned controller transfers beyond the reference distribution while preserving motion naturalness. Finally, we demonstrate long-horizon behavior generation by composing multiple learned skills, illustrating the flexibility of the learned polices in complex scenarios.
- 中文摘要
从人类运动中学习敏捷类人行为提供了一条通往自然协调控制的强大途径,但现有方法面临一个持续的权衡:参考跟踪策略在演示数据集外往往脆弱,而纯任务驱动强化学习(RL)则能以牺牲动作质量为代价实现适应性。我们引入了一个统一的多任务强化学习框架,通过将参考运动视为行为塑造的先验,而非部署时间的限制,弥合了这一差距。一个单一的目标条件策略在两个共享相同观察和行动空间但初始化方案、指令空间和奖励结构不同的任务上联合训练:(i)参考引导的模仿任务,其中参考轨迹定义了密集的模拟奖励,但不作为策略输入;(ii)目标条件推广任务,其目标独立于任何参考抽样,奖励仅反映任务成功。通过在共享的表述中共同优化这些目标,政策通过密集的参考监督获得结构化、类人运动技能,同时学习将这些技能适应新目标和初始条件。这无需对抗目标、显式轨迹跟踪、相位变量或依赖参考推断即可实现。我们在一个具有挑战性的基于箱子的跑酷游乐场上评估了该方法,该场地要求多样化的运动行为(如跳跃和攀爬),并证明所学的控制器能够超越参考分布,同时保持运动的自然性。最后,我们通过组合多种已学技能,展示了长期行为生成,展示了所学策略在复杂情境下的灵活性。
Diffusion Modulation via Environment Mechanism Modeling for Planning
通过环境机制建模实现的扩散调制用于规划
- Authors: Hanping Zhang, Yuhong Guo
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2602.20422
- Pdf link: https://arxiv.org/pdf/2602.20422
- Abstract
Diffusion models have shown promising capabilities in trajectory generation for planning in offline reinforcement learning (RL). However, conventional diffusion-based planning methods often fail to account for the fact that generating trajectories in RL requires unique consistency between transitions to ensure coherence in real environments. This oversight can result in considerable discrepancies between the generated trajectories and the underlying mechanisms of a real environment. To address this problem, we propose a novel diffusion-based planning method, termed as Diffusion Modulation via Environment Mechanism Modeling (DMEMM). DMEMM modulates diffusion model training by incorporating key RL environment mechanisms, particularly transition dynamics and reward functions. Experimental results demonstrate that DMEMM achieves state-of-the-art performance for planning with offline reinforcement learning.
- 中文摘要
扩散模型在轨迹生成方面展现出有前景的能力,用于离线强化学习(RL)的规划。然而,传统的基于扩散的规划方法往往未能考虑到在强化学习中生成轨迹需要在过渡之间保持独特一致性以确保在真实环境中的相干性这一事实。这种疏忽可能导致生成的轨迹与真实环境的底层机制之间存在显著差异。为解决这一问题,我们提出了一种基于扩散的新型规划方法,称为环境机制建模扩散调制(DMEMM)。DMEMM 通过整合关键的强化学习环境机制,特别是过渡动力学和奖励函数,调制扩散模型训练。实验结果表明,DMEMM在离线强化学习规划方面达到了最先进的性能。
KairosVL: Orchestrating Time Series and Semantics for Unified Reasoning
KairosVL:统一推理的时间序列与语义编排
- Authors: Haotian Si, Changhua Pei, Xiao He, Zeyan Li, Zhe Xie, Zexin Wang, Jiyao Hu, Zhaoyang Yu, Tieying Zhang, Dan Pei, Jianhui Li, Gaogang Xie
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2602.20494
- Pdf link: https://arxiv.org/pdf/2602.20494
- Abstract
Driven by the increasingly complex and decision-oriented demands of time series analysis, we introduce the Semantic-Conditional Time Series Reasoning task, which extends conventional time series analysis beyond purely numerical modeling to incorporate contextual and semantic understanding. To further enhance the mode's reasoning capabilities on complex time series problems, we propose a two-round reinforcement learning framework: the first round strengthens the mode's perception of fundamental temporal primitives, while the second focuses on semantic-conditioned reasoning. The resulting model, KairosVL, achieves competitive performance across both synthetic and real-world tasks. Extensive experiments and ablation studies demonstrate that our framework not only boosts performance but also preserves intrinsic reasoning ability and significantly improves generalization to unseen scenarios. To summarize, our work highlights the potential of combining semantic reasoning with temporal modeling and provides a practical framework for real-world time series intelligence, which is in urgent demand.
- 中文摘要
受时间序列分析日益复杂和决策驱动的需求驱动,我们引入了语义条件时间序列推理任务,该任务将传统时间序列分析扩展到超越纯数值建模,纳入上下文和语义理解。为了进一步提升该模式在复杂时间序列问题上的推理能力,我们提出了一个两轮强化学习框架:第一轮加强对基本时间原语的感知,第二轮则侧重于语义条件推理。最终的模型KairosVL在合成和现实任务中均具竞争力。大量实验和消融研究表明,我们的框架不仅提升了性能,还保留了内在推理能力,并显著提升了对未见场景的泛化能力。总之,我们的工作强调了将语义推理与时间建模结合的潜力,并为现实世界时间序列智能提供了切实可行的框架,这一需求极为迫切。
A Generalized Apprenticeship Learning Framework for Capturing Evolving Student Pedagogical Strategies
一个捕捉不断演变的学生教学策略的通用学徒学习框架
- Authors: Md Mirajul Islam, Xi Yang, Adittya Soukarjya Saha, Rajesh Debnath, Min Chi
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2602.20527
- Pdf link: https://arxiv.org/pdf/2602.20527
- Abstract
Reinforcement Learning (RL) and Deep Reinforcement Learning (DRL) have advanced rapidly in recent years and have been successfully applied to e-learning environments like intelligent tutoring systems (ITSs). Despite great success, the broader application of DRL to educational technologies has been limited due to major challenges such as sample inefficiency and difficulty designing the reward function. In contrast, Apprenticeship Learning (AL) uses a few expert demonstrations to infer the expert's underlying reward functions and derive decision-making policies that generalize and replicate optimal behavior. In this work, we leverage a generalized AL framework, THEMES, to induce effective pedagogical policies by capturing the complexities of the expert student learning process, where multiple reward functions may dynamically evolve over time. We evaluate the effectiveness of THEMES against six state-of-the-art baselines, demonstrating its superior performance and highlighting its potential as a powerful alternative for inducing effective pedagogical policies and show that it can achieve high performance, with an AUC of 0.899 and a Jaccard of 0.653, using only 18 trajectories of a previous semester to predict student pedagogical decisions in a later semester.
- 中文摘要
强化学习(RL)和深度强化学习(DRL)近年来发展迅速,并已成功应用于智能辅导系统(ITSs)等电子学习环境中。尽管取得了巨大成功,但由于样本效率低和奖励函数设计困难等重大挑战,DRL在教育技术中的更广泛应用仍然有限。相比之下,学徒学习(AL)通过少数专家演示推断专家的潜在奖励函数,并推导出能够推广和复制最优行为的决策策略。在本研究中,我们利用一个广义的AL框架THEMES,通过捕捉专家学生学习过程的复杂性,诱导有效的教学策略,其中多个奖励函数可能随时间动态演变。我们通过六个最先进的基线评估THEMES的有效性,展示了其优异表现,并突出其作为诱导有效教学政策的强大替代方案的潜力,并展示了其在仅使用前一学期的18条轨迹预测后期教学决策时,AUC为0.899,Jaccard为0.653,能够实现高绩效。
Actor-Curator: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for RL Post-Training
演员-策展人:通过政策改进强盗为强化学习者实现的共适应课程学习
- Authors: Zhengyao Gu, Jonathan Light, Raul Astudillo, Ziyu Ye, Langzhou He, Henry Peng Zou, Wei Cheng, Santiago Paternain, Philip S. Yu, Yisong Yue
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2602.20532
- Pdf link: https://arxiv.org/pdf/2602.20532
- Abstract
Post-training large foundation models with reinforcement learning typically relies on massive and heterogeneous datasets, making effective curriculum learning both critical and challenging. In this work, we propose ACTOR-CURATOR, a scalable and fully automated curriculum learning framework for reinforcement learning post-training of large language models (LLMs). ACTOR-CURATOR learns a neural curator that dynamically selects training problems from large problem banks by directly optimizing for expected policy performance improvement. We formulate problem selection as a non-stationary stochastic bandit problem, derive a principled loss function based on online stochastic mirror descent, and establish regret guarantees under partial feedback. Empirically, ACTOR-CURATOR consistently outperforms uniform sampling and strong curriculum baselines across a wide range of challenging reasoning benchmarks, demonstrating improved training stability and efficiency. Notably, it achieves relative gains of 28.6% on AIME2024 and 30.5% on ARC-1D over the strongest baseline and up to 80% speedup. These results suggest that ACTOR-CURATOR is a powerful and practical approach for scalable LLM post-training.
- 中文摘要
强化学习的后期大型基础模型通常依赖于庞大且异构的数据集,使得有效的课程学习既关键又充满挑战。本研究提出ACTOR-CURATOR,一种可扩展且全自动化的课程学习框架,用于大型语言模型(LLMs)训练后强化学习。ACTOR-CURATOR 学习一种神经策展器,通过直接优化预期的策略性能提升,动态从大型问题库中选择训练问题。我们将问题选择表述为非平稳随机盗垒问题,基于在线随机镜像下降推导有原则的损失函数,并在部分反馈下建立遗憾保证。实证上,ACTOR-CURATOR 在众多具有挑战性的推理基准测试中,始终优于统一抽样和强有力的课程基线,显示出训练稳定性和效率的提升。值得注意的是,在AIME2024上相较于最强基线,ARC-1D上实现了28.6%的相对提升,且最高可达80%的加速。这些结果表明,ACTOR-CURATOR 是一种强大且实用的可扩展大型语言模型后培训方法。
From Logs to Language: Learning Optimal Verbalization for LLM-Based Recommendation in Production
从日志到语言:学习基于LLM的生产推荐的最佳口头化
- Authors: Yucheng Shi, Ying Li, Yu Wang, Yesu Feng, Arjun Rao, Rein Houthooft, Shradha Sehgal, Jin Wang, Hao Zhen, Ninghao Liu, Linas Baltrunas
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
- Arxiv link: https://arxiv.org/abs/2602.20558
- Pdf link: https://arxiv.org/pdf/2602.20558
- Abstract
Large language models (LLMs) are promising backbones for generative recommender systems, yet a key challenge remains underexplored: verbalization, i.e., converting structured user interaction logs into effective natural language inputs. Existing methods rely on rigid templates that simply concatenate fields, yielding suboptimal representations for recommendation. We propose a data-centric framework that learns verbalization for LLM-based recommendation. Using reinforcement learning, a verbalization agent transforms raw interaction histories into optimized textual contexts, with recommendation accuracy as the training signal. This agent learns to filter noise, incorporate relevant metadata, and reorganize information to improve downstream predictions. Experiments on a large-scale industrial streaming dataset show that learned verbalization delivers up to 93% relative improvement in discovery item recommendation accuracy over template-based baselines. Further analysis reveals emergent strategies such as user interest summarization, noise removal, and syntax normalization, offering insights into effective context construction for LLM-based recommender systems.
- 中文摘要
大型语言模型(LLMs)是生成式推荐系统中有望的骨干,但一个关键挑战仍未被充分探索:口头化,即将结构化的用户交互日志转换为有效的自然语言输入。现有方法依赖于仅仅将字段串接的僵化模板,导致推荐的表示不够理想。我们提出了一个以数据为中心的框架,用于学习基于LLM的口头推荐。通过强化学习,口述化代理将原始交互历史转化为优化的文本上下文,推荐准确性作为训练信号。该智能体学习过滤噪声、整合相关元数据并重组信息以改进后续预测。在大型工业流媒体数据集上的实验显示,学习性口语相比基于模板的基线,发现项目推荐准确性相对提升高达93%。进一步分析揭示了用户兴趣汇总、噪声消除和语法规范化等新兴策略,为基于LLM的推荐系统有效上下文构建提供了见解。
OptiLeak: Efficient Prompt Reconstruction via Reinforcement Learning in Multi-tenant LLM Services
OptiLeak:多租户大型语言模型服务中的高效即时重建,通过强化学习实现
- Authors: Longxiang Wang, Xiang Zheng, Xuhao Zhang, Yao Zhang, Ye Wu, Cong Wang
- Subjects: Subjects:
Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2602.20595
- Pdf link: https://arxiv.org/pdf/2602.20595
- Abstract
Multi-tenant LLM serving frameworks widely adopt shared Key-Value caches to enhance efficiency. However, this creates side-channel vulnerabilities enabling prompt leakage attacks. Prior studies identified these attack surfaces yet focused on expanding attack vectors rather than optimizing attack performance, reporting impractically high attack costs that underestimate the true privacy risk. We propose OptiLeak, a reinforcement learning-enhanced framework that maximizes prompt reconstruction efficiency through two-stage fine-tuning. Our key insight is that domain-specific ``hard tokens'' -- terms difficult to predict yet carrying sensitive information -- can be automatically identified via likelihood ranking and used to construct preference pairs for Direct Preference Optimization, eliminating manual annotation. This enables effective preference alignment while avoiding the overfitting issues of extended supervised fine-tuning. Evaluated on three benchmarks spanning medical and financial domains, OptiLeak achieves up to $12.48\times$ reduction in average requests per token compared to baseline approaches, with consistent improvements across model scales from 3B to 14B parameters. Our findings demonstrate that cache-based prompt leakage poses a more severe threat than previously reported, underscoring the need for robust cache isolation in production deployments.
- 中文摘要
多租户LLM服务框架广泛采用共享键值缓存以提高效率。然而,这也会产生侧信道漏洞,使得迅速泄露攻击成为可能。此前的研究识别了这些攻击面,但他们关注的是扩展攻击向量而非优化攻击性能,报告了不切实际的高攻击成本,低估了真实的隐私风险。我们提出了OptiLeak,一种强化学习增强框架,通过两阶段微调最大化快速重建效率。我们的关键见解是,领域特定的“硬代币”——这些难以预测但携带敏感信息的词——可以通过似然排名自动识别,并用于构建直接偏好优化的偏好对,消除人工注释。这不仅能有效对齐偏好,还能避免长时间监督微调带来的过拟合问题。通过涵盖医疗和金融领域的三个基准进行评估,OptiLeak 在每个代币的平均请求量相比基线方法上减少了高达 12.48 美元/倍数美元,且在从 3B 到 14B 参数的模型尺度上持续有持续改进。我们的发现表明,基于缓存的即时泄露威胁比此前报道的更为严重,凸显了生产部署中强有力的缓存隔离的必要性。
From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection
从对到序列:关键点检测的轨道感知策略梯度
- Authors: Yepeng Liu, Hao Li, Liwen Yang, Fangzhen Li, Xudi Ge, Yuliang Gu, kuang Gao, Bing Wang, Guang Chen, Hangjun Ye, Yongchao Xu
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2602.20630
- Pdf link: https://arxiv.org/pdf/2602.20630
- Abstract
Keypoint-based matching is a fundamental component of modern 3D vision systems, such as Structure-from-Motion (SfM) and SLAM. Most existing learning-based methods are trained on image pairs, a paradigm that fails to explicitly optimize for the long-term trackability of keypoints across sequences under challenging viewpoint and illumination changes. In this paper, we reframe keypoint detection as a sequential decision-making problem. We introduce TraqPoint, a novel, end-to-end Reinforcement Learning (RL) framework designed to optimize the \textbf{Tra}ck-\textbf{q}uality (Traq) of keypoints directly on image sequences. Our core innovation is a track-aware reward mechanism that jointly encourages the consistency and distinctiveness of keypoints across multiple views, guided by a policy gradient method. Extensive evaluations on sparse matching benchmarks, including relative pose estimation and 3D reconstruction, demonstrate that TraqPoint significantly outperforms some state-of-the-art (SOTA) keypoint detection and description methods.
- 中文摘要
基于关键点的匹配是现代三维视觉系统的基本组成部分,如运动结构(SfM)和SLAM。大多数现有基于学习的方法都基于图像对训练,这种范式未能明确优化关键点在视角和照明变化下序列的长期可追踪性。本文将关键点检测重新定义为一个顺序决策问题。我们介绍TraqPoint,这是一个全新的端到端强化学习(RL)框架,旨在直接优化图像序列关键点的\textbf{Tra}ck-\textbf{q}uality(Traq)。我们的核心创新是一种轨迹感知奖励机制,通过策略梯度法共同促进关键点在多个视角中的一致性和独特性。对稀疏匹配基准的广泛评估,包括相对姿态估计和三维重建,表明TraqPoint显著优于一些最先进的(SOTA)关键点检测和描述方法。
TrajGPT-R: Generating Urban Mobility Trajectory with Reinforcement Learning-Enhanced Generative Pre-trained Transformer
TrajGPT-R:利用强化学习增强生成预训练变换器生成城市出行轨迹
- Authors: Jiawei Wang, Chuang Yang, Jiawei Yong, Xiaohang Xu, Hongjun Wang, Noboru Koshizuka, Shintaro Fukushima, Ryosuke Shibasaki, Renhe Jiang
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2602.20643
- Pdf link: https://arxiv.org/pdf/2602.20643
- Abstract
Mobility trajectories are essential for understanding urban dynamics and enhancing urban planning, yet access to such data is frequently hindered by privacy concerns. This research introduces a transformative framework for generating large-scale urban mobility trajectories, employing a novel application of a transformer-based model pre-trained and fine-tuned through a two-phase process. Initially, trajectory generation is conceptualized as an offline reinforcement learning (RL) problem, with a significant reduction in vocabulary space achieved during tokenization. The integration of Inverse Reinforcement Learning (IRL) allows for the capture of trajectory-wise reward signals, leveraging historical data to infer individual mobility preferences. Subsequently, the pre-trained model is fine-tuned using the constructed reward model, effectively addressing the challenges inherent in traditional RL-based autoregressive methods, such as long-term credit assignment and handling of sparse reward environments. Comprehensive evaluations on multiple datasets illustrate that our framework markedly surpasses existing models in terms of reliability and diversity. Our findings not only advance the field of urban mobility modeling but also provide a robust methodology for simulating urban data, with significant implications for traffic management and urban development planning. The implementation is publicly available at this https URL.
- 中文摘要
出行轨迹对于理解城市动态和提升城市规划至关重要,但访问此类数据常因隐私问题而受阻。本研究提出了一种变革性框架,用于生成大规模城市出行轨迹,采用基于变压器的模型,通过两阶段流程进行预训练和微调。最初,轨迹生成被概念化为离线强化学习(RL)问题,在分词过程中显著减少了词汇空间。逆向强化学习(IRL)的集成使得能够捕捉轨迹级的奖励信号,利用历史数据推断个体的移动偏好。随后,预训练模型通过构造奖励模型进行微调,有效解决传统基于强化学习的自回归方法所固有的挑战,如长期信用分配和稀疏奖励环境的处理。对多个数据集的全面评估表明,我们的框架在可靠性和多样性方面明显优于现有模型。我们的发现不仅推动了城市流动性建模领域的发展,也提供了一种稳健的城市数据模拟方法,对交通管理和城市发展规划具有重要意义。实现可在此 https URL 公开获取。
CAMEL: Confidence-Gated Reflection for Reward Modeling
CAMEL:信心门控反思用于奖励建模
- Authors: Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar, Kun Xu, Yang You
- Subjects: Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2602.20670
- Pdf link: https://arxiv.org/pdf/2602.20670
- Abstract
Reward models play a fundamental role in aligning large language models with human preferences. Existing methods predominantly follow two paradigms: scalar discriminative preference models, which are efficient but lack interpretability, and generative judging models, which offer richer reasoning at the cost of higher computational overhead. We observe that the log-probability margin between verdict tokens strongly correlates with prediction correctness, providing a reliable proxy for instance difficulty without additional inference cost. Building on this insight, we propose CAMEL, a confidence-gated reflection framework that performs a lightweight single-token preference decision first and selectively invokes reflection only for low-confidence instances. To induce effective self-correction, we train the model via reinforcement learning with counterfactual prefix augmentation, which exposes the model to diverse initial verdicts and encourages genuine revision. Empirically, CAMEL achieves state-of-the-art performance on three widely used reward-model benchmarks with 82.9% average accuracy, surpassing the best prior model by 3.2% and outperforming 70B-parameter models using only 14B parameters, while establishing a strictly better accuracy-efficiency Pareto frontier.
- 中文摘要
奖励模型在使大型语言模型与人类偏好对齐方面起着根本作用。现有方法主要遵循两种范式:标量判别偏好模型,效率高但缺乏可解释性;生成判断模型,提供更丰富的推理,但代价是计算开销较高。我们观察到,判决标记之间的对数概率裕度与预测正确性高度相关,提供了可靠的实例难度代理,且无需额外推理成本。基于这一见解,我们提出了CAMEL,这是一种信心门控反思框架,先执行轻量级单代币偏好决策,仅在低置信度实例中选择性地调用反思。为了诱导有效的自我纠正,我们通过带有反事实前缀增强的强化学习训练模型,使模型接触到不同的初始结论,并鼓励真正的修正。从实证角度看,CAMEL在三个广泛使用的奖励模型基准测试中实现了最先进的性能,平均准确率达到82.9%,比之前的最佳模型高出3.2%,并且仅用14B参数就优于70B参数模型,同时建立了严格更优的准确率效率帕累托前沿。
IG-RFT: An Interaction-Guided RL Framework for VLA Models in Long-Horizon Robotic Manipulation
IG-RFT:一种用于长视野机器人作中VLA模型的交互引导强化学习框架
- Authors: Zhian Su, Weijie Kong, Haonan Dong, Huixu Dong
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2602.20715
- Pdf link: https://arxiv.org/pdf/2602.20715
- Abstract
Vision-Language-Action (VLA) models have demonstrated significant potential for generalist robotic policies; however, they struggle to generalize to long-horizon complex tasks in novel real-world domains due to distribution shifts and the scarcity of high-quality demonstrations. Although reinforcement learning (RL) offers a promising avenue for policy improvement, applying it to real-world VLA fine-tuning faces challenges regarding exploration efficiency, training stability, and sample cost. To address these issues, we propose IG-RFT, a novel Interaction-Guided Reinforced Fine-Tuning system designed for flow-based VLA models. Firstly, to facilitate effective policy optimization, we introduce Interaction-Guided Advantage Weighted Regression (IG-AWR), an RL algorithm that dynamically modulates exploration intensity based on the robot's interaction status. Furthermore, to address the limitations of sparse or task-specific rewards, we design a novel hybrid dense reward function that integrates the trajectory-level reward and the subtask-level reward. Finally, we construct a three-stage RL system comprising SFT, Offline RL, and Human-in-the-Loop RL for fine-tuning VLA models. Extensive real-world experiments on four challenging long-horizon tasks demonstrate that IG-RFT achieves an average success rate of 85.0%, significantly outperforming SFT (18.8%) and standard Offline RL baselines (40.0%). Ablation studies confirm the critical contributions of IG-AWR and hybrid reward shaping. In summary, our work establishes and validates a novel reinforced fine-tuning system for VLA models in real-world robotic manipulation.
- 中文摘要
视觉-语言-行动(VLA)模型已展现出通用机器人政策的巨大潜力;然而,由于分布变化和高质量演示稀缺,它们难以推广到新颖现实世界中的长视野复杂任务。尽管强化学习(RL)为策略改进提供了有前景的途径,但将其应用于现实世界的VLA微调面临探索效率、训练稳定性和样本成本等挑战。为解决这些问题,我们提出了IG-RFT,一种为基于流量的VLA模型设计的新型交互引导强化微调系统。首先,为了促进有效的策略优化,我们引入了交互引导优势加权回归(IG-AWR)强化学习算法,该算法根据机器人交互状态动态调节探索强度。此外,为了解决稀疏或任务特定奖励的局限性,我们设计了一种新型混合密集奖励函数,整合了轨迹级奖励和子任务级奖励。最后,我们构建了一个三级强化学习系统,包括SFT、离线强化学习和人机在环强化学习,用于微调VLA模型。在四个具有挑战性的长期任务上的大量真实世界实验表明,IG-RFT的平均成功率为85.0%,显著优于SFT(18.8%)和标准离线强化学习基线(40.0%)。消融研究证实了IG-AWR和混合奖励塑造的关键贡献。总之,我们的工作建立了并验证了一种用于现实世界机器人作中VLA模型的新型强化微调系统。
Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning
缓冲区的重要性:释放大型语言模型推理中非策略强化学习的力量
- Authors: Xu Wan, Yansheng Wang, Wenqi Huang, Mingyang Sun
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2602.20722
- Pdf link: https://arxiv.org/pdf/2602.20722
- Abstract
Traditional on-policy Reinforcement Learning with Verifiable Rewards (RLVR) frameworks suffer from experience waste and reward homogeneity, which directly hinders learning efficiency on difficult samples during large language models post-training. In this paper, we introduce Batch Adaptation Policy Optimization (BAPO), an off-policy RLVR framework to improve the data efficiency in large language models post-training. It dynamically selects training batches by re-evaluating historically difficult samples and reusing high-quality ones, while holding a lower bound guarantee for policy improvement. Extensive experiments further demonstrate that BAPO achieves an average 12.5% improvement over GRPO across mathematics, planning, and visual reasoning tasks. Crucially, BAPO successfully resolves 40.7% of problems that base models consistently fail to solve.
- 中文摘要
传统的基于策略的可验证奖励强化学习(RLVR)框架存在经验浪费和奖励同质性的问题,这直接阻碍了大型语言模型训练后对困难样本的学习效率。本文介绍了批量适配策略优化(BAPO),这是一种非策略RLVR框架,用于提升大型语言模型训练后的数据效率。它通过重新评估历史上较难的样本并重用高质量样本,动态选择训练批次,同时对策略改进设有下界保证。大量实验进一步表明,BAPO在数学、规划和视觉推理任务中平均比GRPO提升12.5%。关键是,BAPO成功解决了40.7%的基础模型始终无法解决的问题。
Deep Reinforcement Learning Based Block Coordinate Descent for Downlink Weighted Sum-rate Maximization on AI-Native Wireless Networks
基于深度强化学习的块状坐标下降,用于AI原生无线网络下行加权和速率最大化
- Authors: Siya Chen, Chee Wei Tan, H. Vincent Poor
- Subjects: Subjects:
Networking and Internet Architecture (cs.NI); Information Theory (cs.IT)
- Arxiv link: https://arxiv.org/abs/2602.20724
- Pdf link: https://arxiv.org/pdf/2602.20724
- Abstract
This paper introduces a deep reinforcement learning-based block coordinate descent (DRL-based BCD) algorithm to address the nonconvex weighted sum-rate maximization (WSRM) problem with a total power constraint. Firstly, we present an efficient block coordinate descent (BCD) method to solve the problem. We then integrate deep reinforcement learning (DRL) techniques into the BCD method and propose the DRL-based BCD algorithm. This approach combines the data-driven learning capability of machine learning techniques with the navigational and decision-making characteristics of the optimization-theoretic-based BCD method. This combination significantly improves the algorithm's performance by reducing its sensitivity to initial points and mitigating the risk of entrapment in local optima. The primary advantages of the proposed DRL-based BCD algorithm lie in its ability to adhere to the constraints of the WSRM problem and significantly enhance accuracy, potentially achieving the exact optimal solution. Moreover, unlike many pure machine-learning approaches, the DRL-based BCD algorithm capitalizes on the underlying theoretical analysis of the WSRM problem's structure. This enables it to be easily trained and computationally efficient while maintaining a level of interpretability. Through numerical experiments, the DRL-based BCD algorithm demonstrates substantial advantages in effectiveness, efficiency, robustness, and interpretability for maximizing sum rates, which also provides valuable potential for designing resource-constrained AI-native wireless optimization strategies in next-generation wireless networks.
- 中文摘要
本文引入了基于深度强化学习的块坐标下降(基于DRL的BCD)算法,以解决具有全功率约束的非凸加权和率最大化(WSRM)问题。首先,我们提出了一种高效的块坐标下降(BCD)方法来解决该问题。随后,我们将深度强化学习(DRL)技术整合进BCD方法,并提出了基于DRL的BCD算法。该方法结合了机器学习技术的数据驱动学习能力与基于优化理论的BCD方法的导航和决策特性。这种组合显著提升了算法的性能,降低了对初始点的敏感性,并降低了被困在局部最优解中的风险。基于DRL的BCD算法的主要优势在于能够遵守WSRM问题的约束,并显著提升精度,从而有可能实现精确的最优解。此外,与许多纯机器学习方法不同,基于DRL的BCD算法充分利用了对WSRM问题结构的理论分析。这使得它能够轻松训练和计算高效,同时保持一定的可解释性。通过数值实验,基于DRL的BCD算法在效率、鲁棒性和可解释性方面展现出显著优势,以最大化求和率,同时也为下一代无线网络中设计资源受限的AI原生无线优化策略提供了宝贵潜力。
Balancing Multiple Objectives in Urban Traffic Control with Reinforcement Learning from AI Feedback
在城市交通控制中平衡多重目标与基于AI反馈的强化学习
- Authors: Chenyang Zhao, Vinny Cahill, Ivana Dusparic
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2602.20728
- Pdf link: https://arxiv.org/pdf/2602.20728
- Abstract
Reward design has been one of the central challenges for real world reinforcement learning (RL) deployment, especially in settings with multiple objectives. Preference-based RL offers an appealing alternative by learning from human preferences over pairs of behavioural outcomes. More recently, RL from AI feedback (RLAIF) has demonstrated that large language models (LLMs) can generate preference labels at scale, mitigating the reliance on human annotators. However, existing RLAIF work typically focuses only on single-objective tasks, leaving the open question of how RLAIF handles systems that involve multiple objectives. In such systems trade-offs among conflicting objectives are difficult to specify, and policies risk collapsing into optimizing for a dominant goal. In this paper, we explore the extension of the RLAIF paradigm to multi-objective self-adaptive systems. We show that multi-objective RLAIF can produce policies that yield balanced trade-offs reflecting different user priorities without laborious reward engineering. We argue that integrating RLAIF into multi-objective RL offers a scalable path toward user-aligned policy learning in domains with inherently conflicting objectives.
- 中文摘要
奖励设计一直是现实世界强化学习(RL)部署的核心挑战之一,尤其是在多目标环境中。基于偏好的强化学习通过学习人类偏好而非成对的行为结果,提供了一种有吸引力的替代方案。最近,基于人工智能反馈的强化学习(RLAIF)证明了大型语言模型(LLMs)能够大规模生成偏好标签,减少对人工标注者的依赖。然而,现有的RLAIF工作通常只关注单一目标任务,导致RLAIF如何处理涉及多个目标的系统成为悬而未决的问题。在此类系统中,冲突目标之间的权衡难以明确界定,政策有崩溃为追求主导目标而崩溃的风险。本文探讨了将RLAIF范式扩展到多目标自适应系统。我们展示了多目标RLAIF可以生成反映不同用户优先级的平衡权衡策略,而无需繁琐的奖励工程。我们认为,将RLAIF整合进多目标强化学习,为实现本质目标冲突领域用户对齐的策略学习提供了可扩展的路径。
Fuz-RL: A Fuzzy-Guided Robust Framework for Safe Reinforcement Learning under Uncertainty
Fuz-RL:一个模糊引导的稳健框架,用于在不确定性下安全强化学习
- Authors: Xu Wan, Chao Yang, Cheng Yang, Jie Song, Mingyang Sun
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2602.20729
- Pdf link: https://arxiv.org/pdf/2602.20729
- Abstract
Safe Reinforcement Learning (RL) is crucial for achieving high performance while ensuring safety in real-world applications. However, the complex interplay of multiple uncertainty sources in real environments poses significant challenges for interpretable risk assessment and robust decision-making. To address these challenges, we propose Fuz-RL, a fuzzy measure-guided robust framework for safe RL. Specifically, our framework develops a novel fuzzy Bellman operator for estimating robust value functions using Choquet integrals. Theoretically, we prove that solving the Fuz-RL problem (in Constrained Markov Decision Process (CMDP) form) is equivalent to solving distributionally robust safe RL problems (in robust CMDP form), effectively avoiding min-max optimization. Empirical analyses on safe-control-gym and safety-gymnasium scenarios demonstrate that Fuz-RL effectively integrates with existing safe RL baselines in a model-free manner, significantly improving both safety and control performance under various types of uncertainties in observation, action, and dynamics.
- 中文摘要
安全强化学习(RL)对于实现高性能同时确保现实应用中的安全性至关重要。然而,现实环境中多重不确定性源的复杂相互作用,对可解释的风险评估和稳健决策提出了重大挑战。为应对这些挑战,我们提出了Fuz-RL,一种模糊测量引导的稳健框架,用于安全强化学习。具体来说,我们的框架开发了一个新颖的模糊贝尔曼算子,用于利用Choquet积分估计稳健的值函数。理论上,我们证明求解Fuz-RL问题(以受约束马尔可夫决策过程(CMDP形式)等价于解决分布稳健安全的强化学习问题(以稳健CMDP形式),从而有效避免最小极大优化。对安全控制健身房和安全体育馆场景的实证分析表明,Fuz-RL能够有效整合现有安全强化学习基线,以无模型的方式实现,显著提升了在观察、动作和动力学等各种不确定性条件下的安全性和控制性能。
PyVision-RL: Forging Open Agentic Vision Models via RL
PyVision-RL:通过强化学习打造开放代理视觉模型
- Authors: Shitian Zhao, Shaoheng Lin, Ming Li, Haoquan Zhang, Wenshuo Peng, Kaipeng Zhang, Chen Wei
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2602.20739
- Pdf link: https://arxiv.org/pdf/2602.20739
- Abstract
Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior. We introduce PyVision-RL, a reinforcement learning framework for open-weight multimodal models that stabilizes training and sustains interaction. Our approach combines an oversampling-filtering-ranking rollout strategy with an accumulative tool reward to prevent collapse and encourage multi-turn tool use. Using a unified training pipeline, we develop PyVision-Image and PyVision-Video for image and video understanding. For video reasoning, PyVision-Video employs on-demand context construction, selectively sampling task-relevant frames during reasoning to significantly reduce visual token usage. Experiments show strong performance and improved efficiency, demonstrating that sustained interaction and on-demand visual processing are critical for scalable multimodal agents.
- 中文摘要
智能多模态模型的强化学习常常存在交互坍缩问题,即模型学习减少工具使用和多回合推理,限制了代理行为的益处。我们介绍了PyVision-RL,一个用于开放权重多模态模型的强化学习框架,能够稳定训练并维持交互。我们的方法结合了过采样-过滤-排名推广策略与累积工具奖励,以防止崩溃并鼓励多回合工具的使用。通过统一的培训流程,我们开发了PyVision-Image和PyVision-Video,用于图像和视频理解。在视频推理方面,PyVision-Video采用按需上下文构建,在推理过程中选择性采样任务相关的帧,显著减少视觉代币的使用。实验显示,持续的交互和按需视觉处理对于可扩展的多模态代理至关重要。
Overton Pluralistic Reinforcement Learning for Large Language Models
Overton 多元强化学习用于大型语言模型
- Authors: Yu Fu, Seongho Son, Ilija Bogunovic
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2602.20759
- Pdf link: https://arxiv.org/pdf/2602.20759
- Abstract
Existing alignment paradigms remain limited in capturing the pluralistic nature of human values. Overton Pluralism addresses this gap by generating responses with diverse perspectives from a single query. This paper introduces OP-GRPO (Overton Pluralistic Group Relative Policy Optimization), a reinforcement learning framework for implicit Overton Pluralism that enables a single large language model to produce pluralistic responses without explicit prompting or modular orchestration. Our workflow consists of two main steps. First, similarity estimator training fine-tunes a Sentence Transformer for Overton Pluralism tasks to provide more accurate coverage evaluation of generated responses. Second, OP-GRPO training incorporates this similarity estimator into a dual-reward system designed to ensure both broad coverage of genuine human perspectives and the uniqueness of each perspective, thereby promoting diversity. Empirical results demonstrate a "small models, big perspective coverage" effect. The trained Qwen2.5-3B-Instruct model surpasses a 20B GPT-OSS baseline with a 37.4 percent relative accuracy gain on a Natural Language Inference benchmark, and also outperforms a modular architecture baseline with a 19.1 percent relative improvement. Additional evaluations using GPT-4.1 as a large language model judge further confirm the robustness of the approach.
- 中文摘要
现有的对齐范式在捕捉人类价值观的多元性方面仍然有限。Overton 多元主义通过从单一查询生成多元视角的回答来弥补这一空白。本文介绍了OP-GRPO(Overton多元群体相对策略优化),这是一种针对隐性Overton多元性的强化学习框架,使单个大型语言模型能够在无需显式提示或模块化编排的情况下产生多元响应。我们的工作流程主要包括两个步骤。首先,相似估计器训练对适用于Overton多元性任务的句子变换器进行了微调,以提供更准确的覆盖评估。其次,OP-GRPO培训将相似性估计器纳入双重奖励体系,旨在确保广泛覆盖真实的人文视角和每种视角的独特性,从而促进多样性。实证结果显示存在“小模型,大视角覆盖”效应。训练好的Qwen2.5-3B-Ininstruction模型在自然语言推断基准测试中以37.4%的相对准确率提升超过20B GPT-OSS,并且以19.1%的相对提升优于模块化架构基线。使用GPT-4.1作为大型语言模型评判的进一步评估进一步确认了该方法的稳健性。
Probing Dec-POMDP Reasoning in Cooperative MARL
探讨合作式MARL中对DEC-POMDP推理的探讨
- Authors: Kale-ab Tessera, Leonard Hinckeldey, Riccardo Zamboni, David Abel, Amos Storkey
- Subjects: Subjects:
Machine Learning (cs.LG); Multiagent Systems (cs.MA)
- Arxiv link: https://arxiv.org/abs/2602.20804
- Pdf link: https://arxiv.org/pdf/2602.20804
- Abstract
Cooperative multi-agent reinforcement learning (MARL) is typically framed as a decentralised partially observable Markov decision process (Dec-POMDP), a setting whose hardness stems from two key challenges: partial observability and decentralised coordination. Genuinely solving such tasks requires Dec-POMDP reasoning, where agents use history to infer hidden states and coordinate based on local information. Yet it remains unclear whether popular benchmarks actually demand this reasoning or permit success via simpler strategies. We introduce a diagnostic suite combining statistically grounded performance comparisons and information-theoretic probes to audit the behavioural complexity of baseline policies (IPPO and MAPPO) across 37 scenarios spanning MPE, SMAX, Overcooked, Hanabi, and MaBrax. Our diagnostics reveal that success on these benchmarks rarely requires genuine Dec-POMDP reasoning. Reactive policies match the performance of memory-based agents in over half the scenarios, and emergent coordination frequently relies on brittle, synchronous action coupling rather than robust temporal influence. These findings suggest that some widely used benchmarks may not adequately test core Dec-POMDP assumptions under current training paradigms, potentially leading to over-optimistic assessments of progress. We release our diagnostic tooling to support more rigorous environment design and evaluation in cooperative MARL.
- 中文摘要
合作多智能体强化学习(MARL)通常被框架为去中心化的部分可观测马尔可夫决策过程(Dec-POMDP),其难度源于两个关键挑战:部分可观测性和去中心化协调。真正解决此类任务需要Dec-POMDP推理,代理利用历史推断隐藏状态并基于局部信息进行协调。然而,目前尚不清楚流行基准是否真的要求这种推理,还是允许通过更简单的策略取得成功。我们引入了一套结合统计基础性能比较和信息理论探针的诊断套件,以审计涵盖MPE、SMAX、Overcooked、Hanabi和MaBrax等37种场景下的基线策略(IPPO和MAPPO)行为复杂性。我们的诊断显示,在这些基准测试上的成功很少需要真正的Dec-POMDP推理。在超过一半的场景中,反应策略的性能与基于内存的代理相匹配,而涌现协调通常依赖于脆弱的同步动作耦合,而非稳健的时间影响。这些发现表明,一些广泛使用的基准可能无法充分检验当前训练范式下的核心Dec-POMDP假设,可能导致进展过于乐观的评估。我们发布诊断工具,以支持协作式MARL中更严谨的环境设计和评估。
Regret-Guided Search Control for Efficient Learning in AlphaZero
AlphaZero中高效学习的遗憾引导搜索控制
- Authors: Yun-Jui Tsai, Wei-Yu Chen, Yan-Ru Ju, Yu-Hung Chang, Ti-Rong Wu
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2602.20809
- Pdf link: https://arxiv.org/pdf/2602.20809
- Abstract
Reinforcement learning (RL) agents achieve remarkable performance but remain far less learning-efficient than humans. While RL agents require extensive self-play games to extract useful signals, humans often need only a few games, improving rapidly by repeatedly revisiting states where mistakes occurred. This idea, known as search control, aims to restart from valuable states rather than always from the initial state. In AlphaZero, prior work Go-Exploit applies this idea by sampling past states from self-play or search trees, but it treats all states equally, regardless of their learning potential. We propose Regret-Guided Search Control (RGSC), which extends AlphaZero with a regret network that learns to identify high-regret states, where the agent's evaluation diverges most from the actual outcome. These states are collected from both self-play trajectories and MCTS nodes, stored in a prioritized regret buffer, and reused as new starting positions. Across 9x9 Go, 10x10 Othello, and 11x11 Hex, RGSC outperforms AlphaZero and Go-Exploit by an average of 77 and 89 Elo, respectively. When training on a well-trained 9x9 Go model, RGSC further improves the win rate against KataGo from 69.3% to 78.2%, while both baselines show no improvement. These results demonstrate that RGSC provides an effective mechanism for search control, improving both efficiency and robustness of AlphaZero training. Our code is available at this https URL.
- 中文摘要
强化学习(RL)代理取得了显著的性能,但学习效率远低于人类。强化学习代理需要大量自玩游戏来提取有用信号,而人类通常只需几局游戏,通过反复回访错误发生的状态快速提升。这一理念称为搜索控制,旨在从有价值的状态重新开始,而非总是从初始状态开始。在AlphaZero中,之前的Go-Exploit通过从自玩或搜索树中采样过去状态来应用这一理念,但它对所有状态一视同仁,无论其学习潜力如何。我们提出了遗憾引导搜索控制(RGSC),它通过学习识别高悔状态的遗憾网络扩展了AlphaZero,这些状态中代理的评估与实际结果差异最大。这些状态从自游轨迹和MCTS节点收集,存储在优先级遗憾缓冲区,并作为新的起始局面重新使用。在9x9的Go、10x10的Othello和11x11的Hex中,RGSC平均领先AlphaZero和Go-Exploit 77和89 Elo。在训练良好的9x9围棋模型上,RGSC进一步将对KataGo的胜率从69.3%提升至78.2%,而两个基线均无提升。这些结果表明,RGSC为搜索控制提供了有效的机制,提高了AlphaZero训练的效率和稳健性。我们的代码可在此 https URL 访问。
LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding
LongVideo-R1:低成本长视频理解的智能导航
- Authors: Jihao Qiu, Lingxi Xie, Xinyue Huo, Qi Tian, Qixiang Ye
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2602.20913
- Pdf link: https://arxiv.org/pdf/2602.20913
- Abstract
This paper addresses the critical and underexplored challenge of long video understanding with low computational budgets. We propose LongVideo-R1, an active, reasoning-equipped multimodal large language model (MLLM) agent designed for efficient video context navigation, avoiding the redundancy of exhaustive search. At the core of LongVideo-R1 lies a reasoning module that leverages high-level visual cues to infer the most informative video clip for subsequent processing. During inference, the agent initiates traversal from top-level visual summaries and iteratively refines its focus, immediately halting the exploration process upon acquiring sufficient knowledge to answer the query. To facilitate training, we first extract hierarchical video captions from CGBench, a video corpus with grounding annotations, and guide GPT-5 to generate 33K high-quality chain-of-thought-with-tool trajectories. The LongVideo-R1 agent is fine-tuned upon the Qwen-3-8B model through a two-stage paradigm: supervised fine-tuning (SFT) followed by reinforcement learning (RL), where RL employs a specifically designed reward function to maximize selective and efficient clip navigation. Experiments on multiple long video benchmarks validate the effectiveness of name, which enjoys superior tradeoff between QA accuracy and efficiency. All curated data and source code are provided in the supplementary material and will be made publicly available. Code and data are available at: this https URL
- 中文摘要
本文探讨了低计算预算下长视频理解这一关键且未被充分探讨的挑战。我们提出了LongVideo-R1,一种具备推理功能的主动多模态大型语言模型(MLLM)代理,旨在高效地进行视频上下文导航,避免了穷尽搜索的冗余。LongVideo-R1的核心是一个推理模块,利用高层次的视觉线索推断出最有信息量的视频片段以便后续处理。在推理过程中,代理从顶层视觉摘要开始遍历,并反复细化焦点,获得足够知识回答问题后立即停止探索过程。为了促进训练,我们首先从带有基础注释的视频语料库CGBench中提取分层视频字幕,并引导GPT-5生成33K高质量的工具思维链轨迹。LongVideo-R1代理通过两阶段范式在Qwen-3-8B模型上进行微调:监督微调(SFT)和强化学习(RL),RL采用专门设计的奖励函数以最大化选择性和高效的剪辑导航。多次长视频基准测试的实验验证了名称的有效性,因为它在质量保证的准确性和效率之间享有优越权衡。所有策划的数据和源代码均包含在补充材料中,并将公开公开。代码和数据可在以下 https URL 获取
Task-oriented grasping for dexterous robots using postural synergies and reinforcement learning
利用姿势协同和强化学习,针对灵巧机器人的任务导向抓取
- Authors: Dimitrios Dimou, José Santos-Victor, Plinio Moreno
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2602.20915
- Pdf link: https://arxiv.org/pdf/2602.20915
- Abstract
In this paper, we address the problem of task-oriented grasping for humanoid robots, emphasizing the need to align with human social norms and task-specific objectives. Existing methods, employ a variety of open-loop and closed-loop approaches but lack an end-to-end solution that can grasp several objects while taking into account the downstream task's constraints. Our proposed approach employs reinforcement learning to enhance task-oriented grasping, prioritizing the post-grasp intention of the agent. We extract human grasp preferences from the ContactPose dataset, and train a hand synergy model based on the Variational Autoencoder (VAE) to imitate the participant's grasping actions. Based on this data, we train an agent able to grasp multiple objects while taking into account distinct post-grasp intentions that are task-specific. By combining data-driven insights from human grasping behavior with learning by exploration provided by reinforcement learning, we can develop humanoid robots capable of context-aware manipulation actions, facilitating collaboration in human-centered environments.
- 中文摘要
本文探讨了类人机器人任务导向抓取的问题,强调与人类社会规范和任务特定目标保持一致的必要性。现有方法采用多种开环和闭环方法,但缺乏能够在考虑下游任务约束的同时掌握多个对象的端到端解决方案。我们提出的方法采用强化学习来增强任务导向的抓取,优先考虑主体的抓取后意图。我们从ContactPose数据集中提取人类抓握偏好,并基于变分自编码器(VAE)训练手部协同模型,模拟参与者的抓取动作。基于这些数据,我们训练一个能够抓住多个对象的智能体,同时考虑任务特定的不同后抓取意图。通过结合人类抓取行为的数据驱动洞见与强化学习带来的探索学习,我们可以开发具备情境感知作动作的人形机器人,促进以人为本的环境中的协作。
The Art of Efficient Reasoning: Data, Reward, and Optimization
高效推理的艺术:数据、奖励与优化
- Authors: Taiqiang Wu, Zenan Zu, Bo Zhou, Ngai Wong
- Subjects: Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2602.20945
- Pdf link: https://arxiv.org/pdf/2602.20945
- Abstract
Large Language Models (LLMs) consistently benefit from scaled Chain-of-Thought (CoT) reasoning, but also suffer from heavy computational overhead. To address this issue, efficient reasoning aims to incentivize short yet accurate thinking trajectories, typically through reward shaping with Reinforcement Learning (RL). In this paper, we systematically investigate the mechanics of efficient reasoning for LLMs. For comprehensive evaluation, we advocate for more fine-grained metrics, including length distribution conditioned on correctness and performance across a wide spectrum of token budgets ranging from 2k to 32k. First, we reveal that the training process follows a two-stage paradigm: length adaptation and reasoning refinement. After that, we conduct extensive experiments (about 0.2 million GPU hours) in a unified protocol, deconstructing training prompts and rollouts, reward shaping, and optimization strategies. In particular, a key finding is to train on relatively easier prompts, ensuring the density of positive reward signals and thus avoiding the length collapse. Meanwhile, the learned length bias can be generalized across domains. We distill all findings into valuable insights and practical guidelines, and further validate them across the Qwen3 series, ranging from 0.6B to 30B, demonstrating the robustness and generalization.
- 中文摘要
大型语言模型(LLM)始终受益于规模化思维链(CoT)推理,但也存在较大的计算开销。为解决这一问题,高效推理旨在激励短暂但准确的思维轨迹,通常通过强化学习(RL)进行奖励塑造。本文系统地研究了大型语言模型高效推理的机制。为了全面评估,我们主张采用更细致的指标,包括基于正确性和性能的长度分布,涵盖从2千到3.2万代币的广泛代币预算。首先,我们揭示训练过程遵循两个阶段的范式:长度适应和推理精炼。之后,我们会在统一协议中进行大量实验(约20万GPU小时),拆解训练提示和推广、奖励塑造和优化策略。特别是,一个关键发现是用相对简单的提示进行训练,确保正向奖励信号的密度,从而避免长度崩溃。与此同时,学习长度偏差可以推广到不同领域。我们将所有发现提炼为有价值的见解和实用指南,并在Qwen3系列中进一步验证,范围从0.6亿到30亿,展示了其稳健性和泛化性。
Localized Dynamics-Aware Domain Adaption for Off-Dynamics Offline Reinforcement Learning
用于离线动态强化学习的局部动态感知域适配
- Authors: Zhangjie Xia, Yu Yang, Pan Xu
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2602.21072
- Pdf link: https://arxiv.org/pdf/2602.21072
- Abstract
Off-dynamics offline reinforcement learning (RL) aims to learn a policy for a target domain using limited target data and abundant source data collected under different transition dynamics. Existing methods typically address dynamics mismatch either globally over the state space or via pointwise data filtering; these approaches can miss localized cross-domain similarities or incur high computational cost. We propose Localized Dynamics-Aware Domain Adaptation (LoDADA), which exploits localized dynamics mismatch to better reuse source data. LoDADA clusters transitions from source and target datasets and estimates cluster-level dynamics discrepancy via domain discrimination. Source transitions from clusters with small discrepancy are retained, while those from clusters with large discrepancy are filtered out. This yields a fine-grained and scalable data selection strategy that avoids overly coarse global assumptions and expensive per-sample filtering. We provide theoretical insights and extensive experiments across environments with diverse global and local dynamics shifts. Results show that LoDADA consistently outperforms state-of-the-art off-dynamics offline RL methods by better leveraging localized distribution mismatch.
- 中文摘要
离动力学离线强化学习(RL)旨在利用有限的目标数据和在不同转移动态下收集的丰富源数据,学习目标域的策略。现有方法通常通过全局状态空间或点状数据过滤处理动态不匹配;这些方法可能遗漏局部的跨域相似性,或导致高计算成本。我们提出了局部动力学感知域适应(LoDADA),利用局部动力学不匹配更好地重用源数据。LoDADA 对源和目标数据集的迁移进行聚类,并通过域判别估计群组级动态差异。来自差异较小的簇的源转移被保留,而来自差异较大的簇的转移则被过滤掉。这带来了一种细粒度且可扩展的数据选择策略,避免了过于粗糙的全局假设和昂贵的每样本过滤。我们提供理论见解和广泛的实验,涵盖全球和局部动态变化多样的环境。结果显示,LoDADA通过更好地利用局部分布错配,持续优于最先进的非动态离线强化学习方法。
Cooperative-Competitive Team Play of Real-World Craft Robots
现实世界工艺机器人的合作竞技团队游戏
- Authors: Rui Zhao, Xihui Li, Yizheng Zhang, Yuzhen Liu, Zhong Zhang, Yufeng Zhang, Cheng Zhou, Zhengyou Zhang, Lei Han
- Subjects: Subjects:
Robotics (cs.RO); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2602.21119
- Pdf link: https://arxiv.org/pdf/2602.21119
- Abstract
Multi-agent deep Reinforcement Learning (RL) has made significant progress in developing intelligent game-playing agents in recent years. However, the efficient training of collective robots using multi-agent RL and the transfer of learned policies to real-world applications remain open research questions. In this work, we first develop a comprehensive robotic system, including simulation, distributed learning framework, and physical robot components. We then propose and evaluate reinforcement learning techniques designed for efficient training of cooperative and competitive policies on this platform. To address the challenges of multi-agent sim-to-real transfer, we introduce Out of Distribution State Initialization (OODSI) to mitigate the impact of the sim-to-real gap. In the experiments, OODSI improves the Sim2Real performance by 20%. We demonstrate the effectiveness of our approach through experiments with a multi-robot car competitive game and a cooperative task in real-world settings.
- 中文摘要
近年来,多智能体深度强化学习(RL)在智能游戏代理开发方面取得了显著进展。然而,利用多智能体强化学习高效训练集体机器人以及将所学策略转化为现实应用,仍是未解的研究问题。在这项工作中,我们首先开发了一个全面的机器人系统,包括仿真、分布式学习框架和物理机器人组件。随后,我们提出并评估旨在高效培训合作与竞争性政策的强化学习技术。为应对多智能体模拟到现实传输的挑战,我们引入了分布外状态初始化(OODSI),以减轻模拟到真实差距的影响。在实验中,OODSI 提高了 Sim2Real 的性能 20%。我们通过多机器人汽车竞赛游戏和现实世界中的合作任务实验,展示了我们方法的有效性。
SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards
SELAUR:通过不确定性感知奖励实现自我进化的LLM代理
- Authors: Dengjia Zhang, Xiaoou Liu, Lu Cheng, Yaqing Wang, Kenton Murray, Hua Wei
- Subjects: Subjects:
Machine Learning (cs.LG); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2602.21158
- Pdf link: https://arxiv.org/pdf/2602.21158
- Abstract
Large language models (LLMs) are increasingly deployed as multi-step decision-making agents, where effective reward design is essential for guiding learning. Although recent work explores various forms of reward shaping and step-level credit assignment, a key signal remains largely overlooked: the intrinsic uncertainty of LLMs. Uncertainty reflects model confidence, reveals where exploration is needed, and offers valuable learning cues even in failed trajectories. We introduce SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards, a reinforcement learning framework that incorporates uncertainty directly into the reward design. SELAUR integrates entropy-, least-confidence-, and margin-based metrics into a combined token-level uncertainty estimate, providing dense confidence-aligned supervision, and employs a failure-aware reward reshaping mechanism that injects these uncertainty signals into step- and trajectory-level rewards to improve exploration efficiency and learning stability. Experiments on two benchmarks, ALFWorld and WebShop, show that our method consistently improves success rates over strong baselines. Ablation studies further demonstrate how uncertainty signals enhance exploration and robustness.
- 中文摘要
大型语言模型(LLMs)越来越多地被用作多步骤决策代理,有效的奖励设计对于引导学习至关重要。尽管近期研究探讨了各种形式的奖励塑造和阶级学分分配,但一个关键信号仍然被广泛忽视:LLMs的内在不确定性。不确定性反映了模型的信心,揭示需要探索的地方,并在失败的轨迹中提供宝贵的学习线索。我们介绍了SELAUR:通过不确定性感知奖励实现自我演化的LLM代理,这是一个将不确定性直接融入奖励设计的强化学习框架。SELAUR将熵、最小置信度和基于margin的指标整合进一个综合的代币级不确定性估计,提供密集的置信度对齐监督,并采用失败感知奖励重塑机制,将这些不确定性信号注入步进和轨迹级奖励中,以提升探索效率和学习稳定性。在两个基准测试ALFWorld和WebShop上的实验显示,我们的方法在强有力基线下持续提升成功率。消融研究进一步表明,不确定性信号如何增强探索和稳健性。
Squint: Fast Visual Reinforcement Learning for Sim-to-Real Robotics
Squint:模拟到现实机器人的快速视觉强化学习
- Authors: Abdulaziz Almuzairee, Henrik I. Christensen
- Subjects: Subjects:
Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2602.21203
- Pdf link: https://arxiv.org/pdf/2602.21203
- Abstract
Visual reinforcement learning is appealing for robotics but expensive -- off-policy methods are sample-efficient yet slow; on-policy methods parallelize well but waste samples. Recent work has shown that off-policy methods can train faster than on-policy methods in wall-clock time for state-based control. Extending this to vision remains challenging, where high-dimensional input images complicate training dynamics and introduce substantial storage and encoding overhead. To address these challenges, we introduce Squint, a visual Soft Actor Critic method that achieves faster wall-clock training than prior visual off-policy and on-policy methods. Squint achieves this via parallel simulation, a distributional critic, resolution squinting, layer normalization, a tuned update-to-data ratio, and an optimized implementation. We evaluate on the SO-101 Task Set, a new suite of eight manipulation tasks in ManiSkill3 with heavy domain randomization, and demonstrate sim-to-real transfer to a real SO-101 robot. We train policies for 15 minutes on a single RTX 3090 GPU, with most tasks converging in under 6 minutes.
- 中文摘要
视觉强化学习对机器人技术很有吸引力,但成本高昂——非策略方法样本效率高但速度较慢;策略上的方法并行处理得很好,但会浪费样本。最新研究表明,非策略方法在基于状态的控制中,在墙时钟时间内训练速度比策略内方法更快。将其扩展到视觉仍然具有挑战性,因为高维输入图像使训练动态复杂,并带来了大量的存储和编码开销。为应对这些挑战,我们引入了Squint,一种视觉软演员批评方法,其比以往的视觉非策略和开策略方法更快实现墙上时钟训练。Squint通过并行仿真、分布批判器、分辨率斜视、层归一化、调整更新与数据比例以及优化实现。我们在SO-101任务集上进行评估,这是一套包含ManiSkill3中重域随机化的八个作任务的新套件,并演示了模拟到现实的转移到真实SO-101机器人。我们在一块RTX 3090显卡上训练策略15分钟,大多数任务在6分钟内完成。
Keyword: diffusion policy
Recursive Belief Vision Language Model
递归信念视觉语言模型
- Authors: Vaidehi Bagaria, Bijo Sebastian, Nirav Patel
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2602.20659
- Pdf link: https://arxiv.org/pdf/2602.20659
- Abstract
Current vision-language-action (VLA) models struggle with long-horizon manipulation under partial observability. Most existing approaches remain observation-driven, relying on short context windows or repeated queries to vision-language models (VLMs). This leads to loss of task progress, action repetition under perceptual aliasing, and high inference latency. Semantic reasoning alone is not the primary bottleneck in long-horizon manipulation. Instead, VLAs lack persistent, action-conditioned state representations and exhibit limited temporal and physical reasoning, making them ill-suited for multi-stage control. This paper introduces RB-VLA, a belief-centric architecture trained with self-supervised world-model objectives that maintains a compact latent state encoding task-relevant history, dynamics, and object interactions. Queried once for high-level intent, the VLM provides task specification, while the belief tracks task progress and enables phase-aware, causally grounded control under partial observability without storing raw observations or scaling memory with time. The belief and intent jointly condition a diffusion policy for robust closed-loop execution. RB-VLA outperforms prior VLAs on long-horizon benchmarks, achieving 52.5% and 37.5% higher success on multi-stage pick-and-place and stacking tasks, respectively, compared to {\pi}0. It also reduces inference latency by up to 5x relative to baselines and eliminates memory growth across timesteps observed in existing VLAs. Ablations show that the belief module is the primary driver of performance, increasing success rates from 32.5% to 77.5%. These results demonstrate the effectiveness of belief-based state representations for long-horizon VLA policies.
- 中文摘要
当前的视觉-语言-动作(VLA)模型在部分可观测性下难以实现长视距作。大多数现有方法仍以观察为驱动,依赖短上下文窗口或对视觉语言模型(VLM)的重复查询。这导致任务进度丢失、感知混叠下的动作重复以及较高的推理延迟。仅靠语义推理并不是长视野作的主要瓶颈。相反,VLA缺乏持久的、受动作条件的状态表征,且表现出有限的时间和物理推理能力,因此不适合多阶段控制。本文介绍了RB-VLA,一种以信念为中心的架构,通过自监督的世界模型目标训练,保持紧凑的潜在状态,编码任务相关的历史、动态和对象交互。VLM只需查询一次高层意图,即可提供任务规范,而信念则跟踪任务进展,并在部分可观测性下实现相位感知、因果基础控制,无需存储原始观测数据或随时间扩展内存。信念和意图共同决定了稳健闭环执行的扩散策略。RB-VLA在长期视野基准测试中优于以往VLA,在多阶段选取和堆叠任务中分别比{\pi}0高出52.5%和37.5%。它还将推理延迟相较于基线降低了最多5倍,并消除了在现有VLA中观察到的跨时间步内存增长。消融显示,信念模块是性能的主要驱动力,成功率从32.5%提升至77.5%。这些结果证明了基于信念的状态表示对于长期VLA政策的有效性。