Arxiv Papers of Today

生成时间: 2026-03-11 16:45:52 (UTC+8); Arxiv 发布时间: 2026-03-11 20:00 EDT (2026-03-12 08:00 UTC+8)

今天共有 35 篇相关文章

Keyword: reinforcement learning

VisionCreator-R1: A Reflection-Enhanced Native Visual-Generation Agentic Model

VisionCreator-R1：一种反射增强的原生视觉生成代理模型

Authors: Jinxiang Lai, Wenzhe Zhao, Zexin Lu, Hualei Zhang, Qinyu Yang, Rongwei Quan, Zhimin Li, Shuai Shao, Song Guo, Qinglin Lu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.08812
Pdf link: https://arxiv.org/pdf/2603.08812
Abstract Visual content generation has advanced from single-image to multi-image workflows, yet existing agents remain largely plan-driven and lack systematic reflection mechanisms to correct mid-trajectory visual errors. To address this limitation, we propose VisionCreator-R1, a native visual generation agent with explicit reflection, together with a Reflection-Plan Co-Optimization (RPCO) training methodology. Through extensive experiments and trajectory-level analysis, we uncover reflection-plan optimization asymmetry in reinforcement learning (RL): planning can be reliably optimized via plan rewards, while reflection learning is hindered by noisy credit assignment. Guided by this insight, our RPCO first trains on the self-constructed VCR-SFT dataset with reflection-strong single-image trajectories and planning-strong multi-image trajectories, then co-optimization on VCR-RL dataset via RL. This yields our unified VisionCreator-R1 agent, which consistently outperforms Gemini2.5Pro on existing benchmarks and our VCR-bench covering single-image and multi-image tasks.
中文摘要 视觉内容生成已从单图向多图工作流程发展，但现有代理仍大多以计划为驱动，缺乏系统化的反射机制来纠正中途视觉错误。为解决这一限制，我们提出了VisionCreator-R1，一个原生视觉生成代理，具备显式反射，并结合反射-计划协同优化（RPCO）训练方法。通过大量实验和轨迹层面分析，我们发现了强化学习（RL）中的反思-计划优化不对称性：规划可以通过计划奖励可靠地优化，而反思学习则受限于噪声学分分配。基于这一洞察，我们的RPCO首先在自构的VCR-SFT数据集上训练，该数据集具有反射强单图像轨迹和规划强多图像轨迹，然后通过强化学习对VCR-RL数据集进行协同优化。这带来了我们统一的VisionCreator-R1代理，在现有基准测试中持续优于Gemini2.5Pro，并配备了涵盖单图和多图任务的录像机工作台。

APPLV: Adaptive Planner Parameter Learning from Vision-Language-Action Model

APPLV：自适应规划器参数学习，基于视觉-语言-行动模型

Authors: Yuanjie Lu, Beichen Wang, Zhengqi Wu, Yang Li, Xiaomin Lin, Chengzhi Mao, Xuesu Xiao
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.08862
Pdf link: https://arxiv.org/pdf/2603.08862
Abstract Autonomous navigation in highly constrained environments remains challenging for mobile robots. Classical navigation approaches offer safety assurances but require environment-specific parameter tuning; end-to-end learning bypasses parameter tuning but struggles with precise control in constrained spaces. To this end, recent robot learning approaches automate parameter tuning while retaining classical systems' safety, yet still face challenges in generalizing to unseen environments. Recently, Vision-Language-Action (VLA) models have shown promise by leveraging foundation models' scene understanding capabilities, but still struggle with precise control and inference latency in navigation tasks. In this paper, we propose Adaptive Planner Parameter Learning from Vision-Language-Action Model (\textsc{applv}). Unlike traditional VLA models that directly output actions, \textsc{applv} leverages pre-trained vision-language models with a regression head to predict planner parameters that configure classical planners. We develop two training strategies: supervised learning fine-tuning from collected navigation trajectories and reinforcement learning fine-tuning to further optimize navigation performance. We evaluate \textsc{applv} across multiple motion planners on the simulated Benchmark Autonomous Robot Navigation (BARN) dataset and in physical robot experiments. Results demonstrate that \textsc{applv} outperforms existing methods in both navigation performance and generalization to unseen environments.
中文摘要 在高度受限的环境中实现自主导航对移动机器人来说依然充满挑战。经典导航方法提供安全保障，但需要根据环境参数进行调优;端到端学习绕过参数调优，但在受限空间中难以精确控制。为此，近期的机器人学习方法在保持经典系统的安全性的同时实现参数调优，但仍面临泛化到未见环境的挑战。近年来，视觉-语言-行动（VLA）模型通过利用基础模型的场景理解能力展现出潜力，但在导航任务中仍面临精确控制和推理延迟的困难。本文提出了基于视觉-语言-行动模型（\textsc{applv}）的自适应规划器参数学习方法。与直接输出动作的传统VLA模型不同，\textsc{applv}利用带有回归头的预训练视觉语言模型来预测配置经典规划器的参数。我们开发了两种训练策略：从收集到的导航轨迹进行监督式学习微调和，以及强化学习微调以进一步优化导航性能。我们在模拟的基准自主机器人导航（BARN）数据集和物理机器人实验中，评估了多个运动规划器的\textsc{applv}。结果表明，\textsc{applv}在导航性能和对未见环境的泛化性上均优于现有方法。

Optimizing Reinforcement Learning Training over Digital Twin Enabled Multi-fidelity Networks

优化基于数字孪生的强化学习培训，实现多保真网络

Authors: Hanzhi Yu, Hasan Farooq, Julien Forgeat, Shruti Bothe, Kristijonas Cyras, Md Moin Uddin Chowdhury, Mingzhe Chen
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2603.08931
Pdf link: https://arxiv.org/pdf/2603.08931
Abstract In this paper, we investigate a novel digital network twin (DNT) assisted deep learning (DL) model training framework. In particular, we consider a physical network where a base station (BS) uses several antennas to serve multiple mobile users, and a DNT that is a virtual representation of the physical network. The BS must adjust its antenna tilt angles to optimize the data rates of all users. Due to user mobility, the BS may not be able to accurately track network dynamics such as wireless channels and user mobilities. Hence, a reinforcement learning (RL) approach is used to dynamically adjust the antenna tilt angles. To train the RL, we can use data collected from the physical network and the DNT. The data collected from the physical network is more accurate but incurs more communication overhead compared to the data collected from the DNT. Therefore, it is necessary to determine the ratio of data collected from the physical network and the DNT to improve the training of the RL model. We formulate this problem as an optimization problem whose goal is to jointly optimize the tilt angle adjustment policy and the data collection strategy, aiming to maximize the data rates of all users while constraining the time delay introduced by collecting data from the physical network. To solve this problem, we propose a hierarchical RL framework that integrates robust adversarial loss and proximal policy optimization (PPO). Simulation results show that our proposed method reduces the physical network data collection delay by up to 28.01% and 1x compared to a hierarchical RL that uses vanilla PPO as the first level RL, and the baseline that uses robust-RL at the first level and selects the data collection ratio randomly.
中文摘要 本文探讨了一种新型数字网络孪生（DNT）辅助深度学习（DL）模型训练框架。特别地，我们考虑一个物理网络，其中基站（BS）使用多个天线服务多个移动用户，以及一个虚拟表示物理网络的DNT。BS必须调整天线倾斜角度，以优化所有用户的数据速率。由于用户移动性，BS可能无法准确追踪网络动态，如无线信道和用户移动性。因此，采用强化学习（RL）方法动态调整天线倾斜角度。训练强化学习时，我们可以使用从物理网络和DNT收集的数据。从物理网络收集的数据更准确，但通信开销比从DNT收集的数据更大。因此，有必要确定从物理网络收集的数据与DNT收集的数据比例，以提升强化学习模型的训练效果。我们将该问题表述为一个优化问题，其目标是联合优化倾斜角调整策略和数据收集策略，旨在最大化所有用户的数据速率，同时限制从物理网络收集数据所带来的时间延迟。为解决此问题，我们提出了一个层级强化学习框架，整合了稳健的对抗性损失和近端策略优化（PPO）。模拟结果表明，我们提出的方法相比层级强化学习（第一层使用原版PPO，基线使用稳健强化学习并随机选择数据采集比例）降低了最多28.01%且为1倍。

Interpretable Markov-Based Spatiotemporal Risk Surfaces for Missing-Child Search Planning with Reinforcement Learning and LLM-Based Quality Assurance

基于失踪儿童搜索规划的可解释马尔可夫时空风险曲面，结合强化学习和基于大型语言模型的质量保证

Authors: Joshua Castillo, Ravi Mukkamala
Subjects: Subjects: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.08933
Pdf link: https://arxiv.org/pdf/2603.08933
Abstract The first 72 hours of a missing-child investigation are critical for successful recovery. However, law enforcement agencies often face fragmented, unstructured data and a lack of dynamic, geospatial predictive tools. Our system, Guardian, provides an end-to-end decision-support system for missing-child investigation and early search planning. It converts heterogeneous, unstructured case documents into a schema-aligned spatiotemporal representation, enriches cases with geocoding and transportation context, and provides probabilistic search products spanning 0-72 hours. In this paper, we present an overview of Guardian as well as a detailed description of a three-layer predictive component of the system. The first layer is a Markov chain, a sparse, interpretable model with transitions incorporating road accessibility costs, seclusion preferences, and corridor bias with separate day/night parameterizations. The Markov chain's output prediction distributions are then transformed into operationally useful search plans by the second layer's reinforcement learning. Finally, the third layer's LLM performs post hoc validation of layer 2 search plans prior to their release. Using a synthetic but realistic case study, we report quantitative outputs across 24/48/72-hour horizons and analyze sensitivity, failure modes, and tradeoffs. Results show that the proposed predictive system with the three-layer architecture produces interpretable priors for zone optimization and human review.
中文摘要 失踪儿童调查的前72小时对成功恢复至关重要。然而，执法机构常常面临零散、非结构化的数据以及缺乏动态地理空间预测工具的问题。我们的系统Guardian提供端到端的决策支持系统，用于失踪儿童调查和早期搜寻规划。它将异构、非结构化的案例文档转换为符合模式的时空表示，丰富案例的地理编码和运输上下文，并提供0至72小时的概率搜索产品。本文介绍了Guardian的概况，并详细描述了该系统的三层预测组件。第一层是马尔可夫链，这是一个稀疏且可解释的模型，其转换包含道路可达性成本、隔离偏好和走廊偏置，并有独立的昼夜参数化。马尔可夫链的输出预测分布随后通过第二层的强化学习转化为实用的搜索计划。最后，第三层的LLM在搜索计划发布前对第2层搜索计划进行事后验证。通过一个综合但现实的案例研究，我们报告了24/48/72小时视野的定量输出，并分析了敏感性、失效模式和权衡。结果表明，采用三层架构的预测系统能够产生可解释的区域优化和人工审核先验。

FAME: Force-Adaptive RL for Expanding the Manipulation Envelope of a Full-Scale Humanoid

名声：原力自适应强化学习，扩展了全尺寸人形生物的控范围

Authors: Niraj Pudasaini, Yutong Zhang, Jensen Lavering, Alessandro Roncone, Nikolaus Correll
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.08961
Pdf link: https://arxiv.org/pdf/2603.08961
Abstract Maintaining balance under external hand forces is critical for humanoid bimanual manipulation, where interaction forces propagate through the kinematic chain and constrain the feasible manipulation envelope. We propose \textbf{FAME}, a force-adaptive reinforcement learning framework that conditions a standing policy on a learned latent context encoding upper-body joint configuration and bimanual interaction forces. During training, we apply diverse, spherically sampled 3D forces on each hand to inject disturbances in simulation together with an upper-body pose curriculum, exposing the policy to manipulation-induced perturbations across continuously varying arm configurations. At deployment, interaction forces are estimated from the robot dynamics and fed to the same encoder, enabling online adaptation without wrist force/torque sensors. In simulation across five fixed arm configurations with randomized hand forces and commanded base heights, FAME improves mean standing success to 73.84%, compared to 51.40% for the curriculum-only baseline and 29.44% for the base policy. We further deploy the learned policy on a full-scale Unitree H12 humanoid and evaluate robustness in representative load-interaction scenarios, including asymmetric single-arm load and symmetric bimanual load. Code and videos are available on this https URL
中文摘要 在外部手力下保持平衡对于类人双手作至关重要，因为相互作用力会通过运动链传播，限制可行的作范围。我们提出了 \textbf{FAME}，一种力适应强化学习框架，基于编码上肢关节配置和双手相互作用力的学习潜伏语境，制定常设策略。在训练过程中，我们对每只手施加多样的球形采样三维力，在模拟中注入干扰，同时配合上半身姿式课程，使策略暴露于连续变化手臂配置中由作引起的扰动。部署时，从机器人动力学估算相互作用力并输入同一编码器，实现在线自适应，无需腕部力/扭矩传感器。在五种固定臂配置、随机手部力量和指令基底高度的模拟中，FAME将平均站立成功率提升至73.84%，而仅课程基线为51.40%，基础政策为29.44%。我们还进一步将所学策略部署到全尺寸Unitree H12人形机器人上，并评估代表性负载-交互场景中的鲁棒性，包括非对称单臂负载和对称双手负载。代码和视频可在此 https URL 上获取

MAPLE: Elevating Medical Reasoning from Statistical Consensus to Process-Led Alignment

MAPLE：将医学推理从统计共识提升为过程主导的对齐

Authors: Kailong Fan, Anqi Pu, Yichen Wu, Wanhua Li, Yicong Li, Hanspeter Pfister, Huafeng Liu, Xiang Li, Quanzheng Li, Ning Guo
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.08987
Pdf link: https://arxiv.org/pdf/2603.08987
Abstract Recent advances in medical large language models have explored Test-Time Reinforcement Learning (TTRL) to enhance reasoning. However, standard TTRL often relies on majority voting (MV) as a heuristic supervision signal, which can be unreliable in complex medical scenarios where the most frequent reasoning path is not necessarily the clinically correct one. In this work, we propose a novel and unified training paradigm that integrates medical process reward models with TTRL to bridge the gap between test-time scaling (TTS) and parametric model optimization. Specifically, we advance the TTRL framework by replacing the conventional MV with a fine-grained, expert-aligned supervision paradigm using Med-RPM. This integration ensures that reinforcement learning is guided by medical correctness rather than mere consensus, effectively distilling search-based intelligence into the model's parametric memory. Extensive evaluations on four different benchmarks have demonstrated that our developed method consistently and significantly outperforms current TTRL and standalone PRM selection. Our findings establish that transitioning from stochastic heuristics to structured, step-wise rewards is essential for developing reliable and scalable medical AI systems
中文摘要 医学大型语言模型的最新进展探索了测试时间强化学习（TTRL）以增强推理能力。然而，标准TTRL通常依赖多数投票（MV）作为启发式监督信号，在复杂的医疗场景中，这种方法可能不可靠，因为最常见的推理路径不一定是临床正确的。在本研究中，我们提出了一种新颖且统一的训练范式，将医疗过程奖励模型与TTRL整合，以弥合测试时间缩放（TTS）与参数模型优化之间的差距。具体来说，我们通过使用Med-RPM的细粒度、专家对齐监督范式取代传统MV来推进TTRL框架。这种整合确保强化学习以医学正确性为指导，而非单纯的共识，有效地将基于搜索的智能提炼进模型的参数记忆中。对四个不同基准的广泛评估表明，我们开发的方法持续且显著地优于当前TTRL和独立PRM选择。我们的发现表明，从随机启发式向结构化、分阶段的奖励转变，对于开发可靠且可扩展的医疗人工智能系统至关重要

PlayWorld: Learning Robot World Models from Autonomous Play

PlayWorld：从自主游戏中学习机器人世界模型

Authors: Tenny Yin, Zhiting Mei, Zhonghe Zheng, Miyu Yamane, David Wang, Jade Sceats, Samuel M. Bateman, Lihan Zha, Apurva Badithela, Ola Shorinwa, Anirudha Majumdar
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.09030
Pdf link: https://arxiv.org/pdf/2603.09030
Abstract Action-conditioned video models offer a promising path to building general-purpose robot simulators that can improve directly from data. Yet, despite training on large-scale robot datasets, current state-of-the-art video models still struggle to predict physically consistent robot-object interactions that are crucial in robotic manipulation. To close this gap, we present PlayWorld, a simple, scalable, and fully autonomous pipeline for training high-fidelity video world simulators from interaction experience. In contrast to prior approaches that rely on success-biased human demonstrations, PlayWorld is the first system capable of learning entirely from unsupervised robot self-play, enabling naturally scalable data collection while capturing complex, long-tailed physical interactions essential for modeling realistic object dynamics. Experiments across diverse manipulation tasks show that PlayWorld generates high-quality, physically consistent predictions for contact-rich interactions that are not captured by world models trained on human-collected this http URL further demonstrate the versatility of PlayWorld in enabling fine-grained failure prediction and policy evaluation, with up to 40% improvements over human-collected data. Finally, we demonstrate how PlayWorld enables reinforcement learning in the world model, improving policy performance by 65% in success rates when deployed in the real world.
中文摘要 动作条件视频模型为构建可直接从数据改进的通用机器人模拟器提供了有前景的路径。然而，尽管在大规模机器人数据集上进行了训练，当前最先进的视频模型仍然难以预测机器人与物体之间物理一致的交互，而这对机器人作至关重要。为弥合这一差距，我们介绍PlayWorld，这是一条简单、可扩展且完全自主的流程，用于从交互体验中训练高保真视频世界模拟器。与以往依赖成功偏向的人类演示方法不同，PlayWorld是首个能够完全从无监督机器人自玩中学习的系统，实现自然可扩展的数据收集，同时捕捉复杂且长尾的物理互动，这些对建模真实物体动态至关重要。跨越多种作任务的实验表明，PlayWorld能够生成高质量、物理一致的联系丰富交互预测，这些预测未被基于人类收集的世界模型捕获。http URL，进一步展示了PlayWorld在实现细粒度故障预测和策略评估方面的多功能性，比人工收集的数据提升高达40%。最后，我们展示了PlayWorld如何实现世界模型中的强化学习，在实际应用时政策表现成功率提升了65%。

Synergistic Directed Execution and LLM-Driven Analysis for Zero-Day AI-Generated Malware Detection

协同定向执行与大型语言模型驱动分析，用于零日AI生成的恶意软件检测

Authors: George Edwards, Mahdi Eslamimehr
Subjects: Subjects: Cryptography and Security (cs.CR); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2603.09044
Pdf link: https://arxiv.org/pdf/2603.09044
Abstract The weaponization of LLMs for automated malware generation poses an existential threat to conventional detection paradigms. AI-generated malware exhibits polymorphic, metamorphic, and context-aware evasion capabilities that render signature-based and shallow heuristic defenses obsolete. This paper introduces a novel hybrid analysis framework that synergistically combines \emph{concolic execution} with \emph{LLM-augmented path prioritization} and \emph{deep-learning-based vulnerability classification} to detect zero-day AI-generated malware with provable guarantees. We formalize the detection problem within a first-order temporal logic over program execution traces, define a lattice-theoretic abstraction for path constraint spaces, and prove both the \emph{soundness} and \emph{relative completeness} of our detection algorithm, assuming classifier correctness. The framework introduces three novel algorithms: (i) an LLM-guided concolic exploration strategy that reduces the average number of explored paths by 73.2\% compared to depth-first search while maintaining equivalent malicious-path coverage; (ii) a transformer-based path-constraint classifier trained on symbolic execution traces; and (iii) a feedback loop that iteratively refines the LLM's prioritization policy using reinforcement learning from detection outcomes. We provide a comprehensive implementation built upon \texttt{angr} 9.2, \texttt{Z3} 4.12, Hugging Face Transformers 4.38, and PyTorch 2.2, with configuration details enabling reproducibility. Experimental evaluation on the EMBER, Malimg, SOREL-20M, and a novel AI-Gen-Malware benchmark comprising 2{,}500 LLM-synthesized samples demonstrates that achieves 98.7\% accuracy on conventional malware and 97.5\% accuracy on AI-generated threats, outperforming ClamAV, YARA, MalConv, and EMBER-GBDT baselines by margins of 8.4--52.2 percentage points on AI-generated samples.
中文摘要 将大型语言模型武器化用于自动生成恶意软件，对传统检测范式构成生存威胁。AI生成的恶意软件具备多态性、和上下文感知规避能力，使基于特征码和浅显的启发式防御变得过时。本文提出了一种新型混合分析框架，协同结合了\emph{协同执行}与\emph{LLM增强路径优先级}以及\emph{基于深度学习的漏洞分类}，以检测具有可证明保证的零日AI生成恶意软件。我们将检测问题形式化为基于程序执行轨迹的一阶时序逻辑，定义路径约束空间的格点理论抽象，并假设分类器正确性，证明了检测算法的\emph{合理性}和\emph{相对完备性}。该框架引入了三种新颖算法：（i）一种基于LLM引导的共调探索策略，该策略在保持恶意路径覆盖率的同时，平均可探索路径数量比深度优先搜索减少73.2%;（ii）基于变换器、基于符号执行迹训练的路径约束分类器;以及（iii）通过检测结果的强化学习，迭代优化LLM的优先级政策的反馈循环。我们提供了一个基于 \texttt{angr} 9.2、\texttt{Z3} 4.12、Hugging Face Transformers 4.38 和 PyTorch 2.2 的综合实现，并附有可重复性的配置细节。对EMBER、Malimg、SOREL-20M及包含2个LLM合成样本的新型AI-Gen-Malware基准测试的实验评估显示，其在传统恶意软件上实现了98.7%的准确率，对AI生成威胁的准确率为97.5%，在AI生成样本中比ClamAV、YARA、MalConv和EMBER-GBDT基线高出8.4个百分点，提升52.2个百分点。

Learning Adaptive LLM Decoding

学习自适应大型语言模型解码

Authors: Chloe H. Su, Zhe Ye, Samuel Tenka, Aidan Yang, Soonho Kong, Udaya Ghai
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.09065
Pdf link: https://arxiv.org/pdf/2603.09065
Abstract Decoding from large language models (LLMs) typically relies on fixed sampling hyperparameters (e.g., temperature, top-p), despite substantial variation in task difficulty and uncertainty across prompts and individual decoding steps. We propose to learn adaptive decoding policies that dynamically select sampling strategies at inference time, conditioned on available compute resources. Rather than fine-tuning the language model itself, we introduce lightweight decoding adapters trained with reinforcement learning and verifiable terminal rewards (e.g. correctness on math and coding tasks). At the sequence level, we frame decoding as a contextual bandit problem: a policy selects a decoding strategy (e.g. greedy, top-k, min-p) for each prompt, conditioned on the prompt embedding and a parallel sampling budget. At the token level, we model decoding as a partially observable Markov decision process (POMDP), where a policy selects sampling actions at each token step based on internal model features and the remaining token budget. Experiments on the MATH and CodeContests benchmarks show that the learned adapters improve the accuracy-budget tradeoff: on MATH, the token-level adapter improves Pass@1 accuracy by up to 10.2% over the best static baseline under a fixed token budget, while the sequence-level adapter yields 2-3% gains under fixed parallel sampling. Ablation analyses support the contribution of both sequence- and token-level adaptation.
中文摘要 大型语言模型（LLMs）的解码通常依赖固定的采样超参数（如温度、top-p），尽管任务难度和不确定性在提示和单个译码步骤之间存在较大差异。我们计划学习自适应解码策略，在推理时动态选择采样策略，条件是可用的计算资源。我们没有微调语言模型本身，而是引入了通过强化学习和可验证终端奖励训练的轻量级解码适配器（例如数学和编码任务的正确性）。在序列层面，我们将解码框架为一个情境盗贼问题：策略为每个提示选择一种解码策略（如贪婪、top-k、min-p），条件是提示嵌入和并行采样预算。在代币层面，我们将解码建模为部分可观测的马尔可夫决策过程（POMDP），策略根据内部模型特征和剩余代币预算在每个代币步骤中选择抽样动作。MATH和CodeContests基准测试的实验表明，学习到的适配器能改善准确率与预算的权衡：在MATH中，代币级适配器在固定代币预算下，较最佳静态基线提升Pass@1准确率高达10.2%，而序列级适配器在固定并行采样下提升2-3%。消融分析支持序列层面和代币级适应的贡献。

Overcoming Valid Action Suppression in Unmasked Policy Gradient Algorithms

在无掩蔽策略梯度算法中克服有效动作抑制

Authors: Renos Zabounidis, Roy Siegelmann, Mohamad Qadri, Woojun Kim, Simon Stepputtis, Katia P. Sycara
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.09090
Pdf link: https://arxiv.org/pdf/2603.09090
Abstract In reinforcement learning environments with state-dependent action validity, action masking consistently outperforms penalty-based handling of invalid actions, yet existing theory only shows that masking preserves the policy gradient theorem. We identify a distinct failure mode of unmasked training: it systematically suppresses valid actions at states the agent has not yet visited. This occurs because gradients pushing down invalid actions at visited states propagate through shared network parameters to unvisited states where those actions are valid. We prove that for softmax policies with shared features, when an action is invalid at visited states but valid at an unvisited state $s^$, the probability $\pi(a \mid s^)$ is bounded by exponential decay due to parameter sharing and the zero-sum identity of softmax logits. This bound reveals that entropy regularization trades off between protecting valid actions and sample efficiency, a tradeoff that masking eliminates. We validate empirically that deep networks exhibit the feature alignment condition required for suppression, and experiments on Craftax, Craftax-Classic, and MiniHack confirm the predicted exponential suppression and demonstrate that feasibility classification enables deployment without oracle masks.
中文摘要 在具有状态依赖动作效度的强化学习环境中，动作掩蔽始终优于基于惩罚的无效动作处理，但现有理论仅表明掩蔽保持了策略梯度定理。我们识别出一种明显的无掩蔽训练失败模式：它系统性地抑制代理人尚未访问的有效行为。这是因为在访问状态下压下无效动作的梯度会通过共享网络参数传播到那些作有效的未访问状态。我们证明，对于具有共享特征的 softmax 策略，当某作在访问状态无效，但在未访问状态 $s^$ 有效时，概率 $\pi（a \mid s^）$ 被参数共享和 softmax logit 零和恒等导致的指数衰减所限制。这一界限表明熵正则化在保护有效作用和样本效率之间进行权衡，而掩蔽消除了这一权衡。我们实证验证深度网络具备抑制所需的特征对齐条件，Craftax、Craftax-Classic和MiniHack上的实验证实了预测的指数抑制，并证明可行性分类使部署无需预言机掩码成为可能。

Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards

推理与信心解耦：从可验证奖励中复兴强化学习校准

Authors: Zhengzhao Ma, Xueru Wen, Boxi Cao, Yaojie Lu, Hongyu Lin, Jinglin Yang, Min He, Xianpei Han, Le Sun
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.09117
Pdf link: https://arxiv.org/pdf/2603.09117
Abstract Reinforcement Learning from Verifiable Rewards (RLVR) significantly enhances large language models (LLMs) reasoning but severely suffers from calibration degeneration, where models become excessively over-confident in incorrect answers. Previous studies devote to directly incorporating calibration objective into existing optimization target. However, our theoretical analysis demonstrates that there exists a fundamental gradient conflict between the optimization for maximizing policy accuracy and minimizing calibration error. Building on this insight, we propose DCPO, a simple yet effective framework that systematically decouples reasoning and calibration objectives. Extensive experiments demonstrate that our DCPO not only preserves accuracy on par with GRPO but also achieves the best calibration performance and substantially mitigates the over-confidence issue. Our study provides valuable insights and practical solution for more reliable LLM deployment.
中文摘要 可验证奖励强化学习（RLVR）显著增强了大型语言模型（LLMs）的推理能力，但严重存在校准退化问题，即模型对错误答案过于自信。以往的研究致力于将校准目标直接纳入现有的优化目标中。然而，我们的理论分析表明，最大化策略准确性优化与最小化校准误差之间存在根本梯度冲突。基于这一见解，我们提出了DCPO，这是一个简单而有效的框架，系统地将推理目标与校准目标脱钩。大量实验表明，我们的DCPO不仅保持了与GRPO相当的准确性，还实现了最佳校准性能，并大大缓解了过度置信的问题。我们的研究为更可靠的LLM部署提供了宝贵见解和实用解决方案。

RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning

RubiCap：用于密集图片字幕的评分标准引导强化学习

Authors: Tzu-Heng Huang, Sirajul Salekin, Javier Movellan, Frederic Sala, Manjot Bilkhu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.09160
Pdf link: https://arxiv.org/pdf/2603.09160
Abstract Dense image captioning is critical for cross-modal alignment in vision-language pretraining and text-to-image generation, but scaling expert-quality annotations is prohibitively expensive. While synthetic captioning via strong vision-language models (VLMs) is a practical alternative, supervised distillation often yields limited output diversity and weak generalization. Reinforcement learning (RL) could overcome these limitations, but its successes have so far been concentrated in verifiable domains that rely on deterministic checkers -- a luxury not available in open-ended captioning. We address this bottleneck with RubiCap, a novel RL framework that derives fine-grained, sample-specific reward signals from LLM-written rubrics. RubiCap first assembles a diverse committee of candidate captions, then employs an LLM rubric writer to extract consensus strengths and diagnose deficiencies in the current policy. These insights are converted into explicit evaluation criteria, enabling an LLM judge to decompose holistic quality assessment and replace coarse scalar rewards with structured, multi-faceted evaluations. Across extensive benchmarks, RubiCap achieves the highest win rates on CapArena, outperforming supervised distillation, prior RL methods, human-expert annotations, and GPT-4V-augmented outputs. On CaptionQA, it demonstrates superior word efficiency: our 7B model matches Qwen2.5-VL-32B-Instruct, and our 3B model surpasses its 7B counterpart. Remarkably, using the compact RubiCap-3B as a captioner produces stronger pretrained VLMs than those trained on captions from proprietary models.
中文摘要 密集的图像说明对于视觉-语言预训练和文本到图像生成中的跨模态对齐至关重要，但专家级注释的规模化成本过高。虽然通过强视觉语言模型（VLMs）进行合成字幕是一种实用的替代方案，但监督蒸馏往往导致输出多样性有限且泛化性较弱。强化学习（RL）可以克服这些局限，但其成功迄今主要集中在依赖确定性检查器的可验证领域——这是开放式字幕所不具备的奢侈。我们用RubiCap解决了这一瓶颈，这是一种新颖的强化学习框架，它能从LLM编写的评分标准中推导出细粒度、样本特异性的奖励信号。RubiCap首先组建一个多元化的候选人说明委员会，然后聘请LLM评分标准撰写者，提取共识优势并诊断当前政策中的不足。这些洞见被转化为明确的评估标准，使LLM评委能够分解整体质量评估，用结构化、多方面的评估替代粗略的标量奖励。在广泛的基准测试中，RubiCap在CapArena上取得了最高的胜率，优于监督蒸馏、以往的强化学习方法、人类专家注释以及GPT-4V增强输出。在CaptionQA中，它展现了更优的词效率：我们的7B模型匹配Qwen2.5-VL-32B-Instruct，而我们的3B模型则超过了它的7B对应模型。令人惊讶的是，使用紧凑型RubiCap-3B作为字幕，预训练VLM比基于专有模型的字幕训练的VLM更强。

Evaluate-as-Action: Self-Evaluated Process Rewards for Retrieval-Augmented Agents

作为行动评估：检索增强代理的自我评估过程奖励

Authors: Jiangming Shu, Yuxiang Zhang, Ye Ma, Xueyuan Lin, Jitao Sang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.09203
Pdf link: https://arxiv.org/pdf/2603.09203
Abstract Retrieval-augmented agents can query external evidence, yet their reliability in multi-step reasoning remains limited: noisy retrieval may derail multi-hop question answering, while outcome-only reinforcement learning provides credit signals that are too coarse to optimize intermediate steps. We propose \textsc{EvalAct} (Evaluate-as-Action), which converts implicit retrieval quality assessment into an explicit action and enforces a coupled Search-to-Evaluate protocol so that each retrieval is immediately followed by a structured evaluation score, yielding process signals aligned with the interaction trajectory. To leverage these signals, we introduce Process-Calibrated Advantage Rescaling (PCAR), a GRPO-based optimization method that rescales advantages at the segment level according to evaluation scores, emphasizing reliable segments while updating uncertain ones conservatively. Experiments on seven open-domain QA benchmarks show that \textsc{EvalAct} achieves the best average accuracy, with the largest gains on multi-hop tasks, and ablations verify that the explicit evaluation loop drives the primary improvements while PCAR provides consistent additional benefits.
中文摘要 检索增强代理可以查询外部证据，但其在多步推理中的可靠性有限：噪声检索可能会干扰多跳问题的回答，而仅结果强化学习提供的信用信号过于粗糙，难以优化中间步骤。我们提出了 \textsc{EvalAct}（Evaluate-as-Action），它将隐式检索质量评估转化为显式动作，并强制执行耦合的搜索到评估协议，使每次检索后立即获得结构化评估评分，从而生成与交互轨迹对齐的过程信号。为利用这些信号，我们引入了过程校准优势重标量（PCAR），这是一种基于GRPO的优化方法，根据评估分数在细分段层面重新调整优势，强调可靠细分，保守更新不确定细分。在七个开放域质量保证基准测试上的实验显示，\textsc{EvalAct} 在多跳任务中实现了最佳的平均准确率，且消融验证了显式评估循环推动主要改进，而 PCAR 则持续带来额外优势。

Strategically Robust Multi-Agent Reinforcement Learning with Linear Function Approximation

具有线性函数近似的战略稳健多智能体强化学习

Authors: Jake Gonzales, Max Horwitz, Eric Mazumdar, Lillian J. Ratliff
Subjects: Subjects: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2603.09208
Pdf link: https://arxiv.org/pdf/2603.09208
Abstract Provably efficient and robust equilibrium computation in general-sum Markov games remains a core challenge in multi-agent reinforcement learning. Nash equilibrium is computationally intractable in general and brittle due to equilibrium multiplicity and sensitivity to approximation error. We study Risk-Sensitive Quantal Response Equilibrium (RQRE), which yields a unique, smooth solution under bounded rationality and risk sensitivity. We propose \texttt{RQRE-OVI}, an optimistic value iteration algorithm for computing RQRE with linear function approximation in large or continuous state spaces. Through finite-sample regret analysis, we establish convergence and explicitly characterize how sample complexity scales with rationality and risk-sensitivity parameters. The regret bounds reveal a quantitative tradeoff: increasing rationality tightens regret, while risk sensitivity induces regularization that enhances stability and robustness. This exposes a Pareto frontier between expected performance and robustness, with Nash recovered in the limit of perfect rationality and risk neutrality. We further show that the RQRE policy map is Lipschitz continuous in estimated payoffs, unlike Nash, and RQRE admits a distributionally robust optimization interpretation. Empirically, we demonstrate that \texttt{RQRE-OVI} achieves competitive performance under self-play while producing substantially more robust behavior under cross-play compared to Nash-based approaches. These results suggest \texttt{RQRE-OVI} offers a principled, scalable, and tunable path for equilibrium learning with improved robustness and generalization.
中文摘要 在一般和马尔可夫博弈中，能够证明高效且稳健的均衡计算仍然是多智能体强化学习的核心挑战。纳什均衡通常计算上难以处理，且由于平衡多重性和对近似误差的敏感性而较为脆弱。我们研究风险敏感量子反应均衡（RQRE），该平衡在有界理性和风险敏感性下产生独特且平滑的解。我们提出了 \texttt{RQRE-OVI}，一种用于在大状态空间或连续状态空间中利用线性函数近似计算 RQRE 的乐观值迭代算法。通过有限样本遗憾分析，我们建立了收敛性，并明确描述了样本复杂度如何随理性和风险敏感性参数的扩展。遗憾界限揭示了一个定量权衡：理性性增加会收紧遗憾，而风险敏感性则诱导正则化，增强稳定性和稳健性。这揭示了预期性能与稳健性之间的帕累托边界，纳什在完美理性和风险中性极限内恢复。我们进一步证明，RQRE策略图在估计收益方面是Lipschitz连续的，而与纳什不同，RQRE也具有分布稳健的优化解释。通过实证，我们证明了 \texttt{RQRE-OVI} 在自玩下实现了竞争性能，同时在跨平台游戏中产生了显著更强健的行为，相较于 Nash 方法。这些结果表明 \texttt{RQRE-OVI} 提供了一条原则性强、可扩展且可调的平衡学习路径，同时提升了鲁棒性和泛化性。

Embodied Human Simulation for Quantitative Design and Analysis of Interactive Robotics

具身人体模拟用于交互机器人定量设计与分析

Authors: Chenhui Zuo, Jinhao Xu, Michael Qian Vergnolle, Yanan Sui
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.09218
Pdf link: https://arxiv.org/pdf/2603.09218
Abstract Physical interactive robotics, ranging from wearable devices to collaborative humanoid robots, require close coordination between mechanical design and control. However, evaluating interactive dynamics is challenging due to complex human biomechanics and motor responses. Traditional experiments rely on indirect metrics without measuring human internal states, such as muscle forces or joint loads. To address this issue, we develop a scalable simulation-based framework for the quantitative analysis of physical human-robot interaction. At its core is a full-body musculoskeletal model serving as a predictive surrogate for the human dynamical system. Driven by a reinforcement learning controller, it generates adaptive, physiologically grounded motor behaviors. We employ a sequential training pipeline where the pre-trained human motion control policy acts as a consistent evaluator, making large-scale design space exploration computationally tractable. By simulating the coupled human-robot system, the framework provides access to internal biomechanical metrics, offering a systematic way to concurrently co-optimize a robot's structural parameters and control policy. We demonstrate its capability in optimizing human-exoskeleton interactions, showing improved joint alignment and reduced contact forces. This work establishes embodied human simulation as a scalable paradigm for interactive robotics design.
中文摘要 从可穿戴设备到协作类人机器人的物理交互机器人，需要机械设计与控制之间的紧密协调。然而，由于复杂的人体生物力学和运动反应，评估交互动力学具有挑战性。传统实验依赖间接指标，而不测量人体内部状态，如肌肉力量或关节负荷。为解决这一问题，我们开发了一个可扩展的基于仿真的物理人机交互定量分析框架。其核心是一个全身肌肉骨骼模型，作为人体动力系统的预测替代。它由强化学习控制器驱动，生成适应性、生理基准的运动行为。我们采用顺序训练流程，预训练的人体运动控制策略作为一致的评估者，使大规模设计空间探索在计算上变得可行。通过模拟耦合人机系统，该框架提供了内部生物力学指标的访问，从而系统化地同时优化机器人的结构参数和控制策略。我们展示了其优化人-外骨骼相互作用的能力，展示了关节排列的改善和接触力的减少。这项工作确立了具身人类模拟作为交互式机器人设计可扩展范式的范式。

Beyond Test-Time Training: Learning to Reason via Hardware-Efficient Optimal Control

超越测试时训练：通过硬件高效的最优控制学习推理

Authors: Peihao Wang, Shan Yang, Xijun Wang, Tesi Xiao, Xin Liu, Changlong Yu, Yu Lou, Pan Li, Zhangyang Wang, Ming Lin, René Vidal
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.09221
Pdf link: https://arxiv.org/pdf/2603.09221
Abstract Associative memory has long underpinned the design of sequential models. Beyond recall, humans reason by projecting future states and selecting goal-directed actions, a capability that modern language models increasingly require but do not natively encode. While prior work uses reinforcement learning or test-time training, planning remains external to the model architecture. We formulate reasoning as optimal control and introduce the Test-Time Control (TTC) layer, which performs finite-horizon LQR planning over latent states at inference time, represents a value function within neural architectures, and leverages it as the nested objective to enable planning before prediction. To ensure scalability, we derive a hardware-efficient LQR solver based on a symplectic formulation and implement it as a fused CUDA kernel, enabling parallel execution with minimal overhead. Integrated as an adapter into pretrained LLMs, TTC layers improve mathematical reasoning performance by up to +27.8% on MATH-500 and 2-3x Pass@8 improvements on AMC and AIME, demonstrating that embedding optimal control as an architectural component provides an effective and scalable mechanism for reasoning beyond test-time training.
中文摘要 联想记忆长期以来一直是序列模型设计的基础。除了回忆之外，人类通过预测未来状态和选择目标导向的行动来推理，这种能力是现代语言模型越来越需要但尚未原生编码的。虽然之前的工作使用强化学习或测试时训练，但规划仍超出模型架构的范畴。我们将推理表述为最优控制，并引入测试时间控制层（TTC），该层在推断时间对潜在状态执行有限视野LQR规划，代表神经架构中的价值函数，并将其作为嵌套目标，实现预测前的规划。为确保可扩展性，我们基于辛表述推导出一个硬件高效的LQR求解器，并将其实现为融合CUDA内核，实现并行执行且开销最小。TTC层作为预训练LLM的适配器集成，在MATH-500上提升数学推理性能高达+27.8%，在AMC和AIME上提升2-3倍Pass@8，证明将最优控制嵌入架构组件为超越测试时训练的推理提供了有效且可扩展的机制。

MO-Playground: Massively Parallelized Multi-Objective Reinforcement Learning for Robotics

MO-Playground：机器人学的大规模并行多目标强化学习

Authors: Neil Janwani, Ellen Novoseller, Vernon J. Lawhern, Maegan Tucker
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.09237
Pdf link: https://arxiv.org/pdf/2603.09237
Abstract Multi-objective reinforcement learning (MORL) is a powerful tool to learn Pareto-optimal policy families across conflicting objectives. However, unlike traditional RL algorithms, existing MORL algorithms do not effectively leverage large-scale parallelization to concurrently simulate thousands of environments, resulting in vastly increased computation time. Ultimately, this has limited MORL's application towards complex multi-objective robotics problems. To address these challenges, we present 1) MORLAX, a new GPU-native, fast MORL algorithm, and 2) MO-Playground, a pip-installable playground of GPU-accelerated multi-objective environments. Together, MORLAX and MO-Playground approximate Pareto sets within minutes, offering 25-270x speed-ups compared to legacy CPU-based approaches whilst achieving superior Pareto front hypervolumes. We demonstrate the versatility of our approach by implementing a custom BRUCE humanoid robot environment using MO-Playground and learning Pareto-optimal locomotion policies across 6 realistic objectives for BRUCE, such as smoothness, efficiency and arm swinging.
中文摘要 多目标强化学习（MORL）是一种强大的工具，用于学习在冲突目标中实现帕累托最优策略族。然而，与传统强化学习算法不同，现有的MORL算法并不能有效利用大规模并行化来同时模拟数千个环境，导致计算时间大幅增加。最终，这限制了MORL在复杂多目标机器人问题中的应用。为应对这些挑战，我们介绍了1）MORLAX，一种新的GPU原生快速MORL算法，2）MO-Playground，一个可点安装的GPU加速多目标环境游乐场。MORLAX和MO-Playground合力可在几分钟内近似帕累托集，相比传统基于CPU的方法提升25-270倍，同时实现更优越的帕累托前端超音量。我们通过使用 MO-Playground 实现定制的 BRUCE 类人机器人环境，并学习在 BRUCE 的六个现实目标（如平滑度、效率和手臂摆动）中学习帕累托最优的移动策略，展示了我们方法的多样性。

Social-R1: Towards Human-like Social Reasoning in LLMs

Social-R1：迈向类人社会推理 LLMs

Authors: Jincenzi Wu, Yuxuan Lei, Jianxun Lian, Yitian Huang, Lexin Zhou, Haotian Li, Xing Xie, Helen Meng
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.09249
Pdf link: https://arxiv.org/pdf/2603.09249
Abstract While large language models demonstrate remarkable capabilities across numerous domains, social intelligence - the capacity to perceive social cues, infer mental states, and generate appropriate responses - remains a critical challenge, particularly for enabling effective human-AI collaboration and developing AI that truly serves human needs. Current models often rely on superficial patterns rather than genuine social reasoning. We argue that cultivating human-like social intelligence requires training with challenging cases that resist shortcut solutions. To this end, we introduce ToMBench-Hard, an adversarial benchmark designed to provide hard training examples for social reasoning. Building on this, we propose Social-R1, a reinforcement learning framework that aligns model reasoning with human cognition through multi-dimensional rewards. Unlike outcome-based RL, Social-R1 supervises the entire reasoning process, enforcing structural alignment, logical integrity, and information density. Results show that our approach enables a 4B parameter model to surpass much larger counterparts and generalize robustly across eight diverse benchmarks. These findings demonstrate that challenging training cases with trajectory-level alignment offer a path toward efficient and reliable social intelligence.
中文摘要 尽管大型语言模型在多个领域展现出卓越能力，但社会智能——感知社会线索、推断心理状态并生成适当反应的能力——仍是一个关键挑战，尤其是在实现有效人机协作和开发真正满足人类需求的人工智能方面。现有模型往往依赖于表面模式，而非真正的社会推理。我们认为，培养类人社会智慧需要接受具有挑战性的案例训练，这些案例难以通过捷径解决。为此，我们引入了 ToMBench-Hard，一个对抗性基准测试，旨在为社会推理提供硬训练示例。基于此，我们提出了Social-R1，一种强化学习框架，通过多维奖励将模型推理与人类认知对齐。与基于结果的强化学习不同，Social-R1监督整个推理过程，强制执行结构对齐、逻辑完整性和信息密度。结果显示，我们的方法使4B参数模型能够超越更大规模的模型，并在八个多样化基准测试中实现强有力的推广。这些发现表明，具有轨迹层次对齐的挑战性训练案例为通向高效且可靠的社会智能提供了途径。

OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models

OddGridBench：揭示多模态大型语言模型中缺乏细粒度视觉差异敏感性的问题

Authors: Tengjin Weng, Wenhao Jiang, Jingyi Wang, Ming Li, Lin Ma, Zhong Ming
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.09326
Pdf link: https://arxiv.org/pdf/2603.09326
Abstract Multimodal large language models (MLLMs) have achieved remarkable performance across a wide range of vision language tasks. However, their ability in low-level visual perception, particularly in detecting fine-grained visual discrepancies, remains underexplored and lacks systematic analysis. In this work, we introduce OddGridBench, a controllable benchmark for evaluating the visual discrepancy sensitivity of MLLMs. OddGridBench comprises over 1,400 grid-based images, where a single element differs from all others by one or multiple visual attributes such as color, size, rotation, or position. Experiments reveal that all evaluated MLLMs, including open-source families such as Qwen3-VL and InternVL3.5, and proprietary systems like Gemini-2.5-Pro and GPT-5, perform far below human levels in visual discrepancy detection. We further propose OddGrid-GRPO, a reinforcement learning framework that integrates curriculum learning and distance-aware reward. By progressively controlling the difficulty of training samples and incorporating spatial proximity constraints into the reward design, OddGrid-GRPO significantly enhances the model's fine-grained visual discrimination ability. We hope OddGridBench and OddGrid-GRPO will lay the groundwork for advancing perceptual grounding and visual discrepancy sensitivity in multimodal intelligence. Code and dataset are available at this https URL.
中文摘要 多模态大型语言模型（MLLMs）在广泛的视觉语言任务中取得了卓越的表现。然而，它们在低层次视觉感知，尤其是细微视觉差异检测方面的能力，仍然缺乏充分探索，缺乏系统分析。在本研究中，我们介绍了OddGridBench，这是一个用于评估MLLM视觉差异敏感性的可控基准测试。OddGridBench包含1400多张基于网格的图像，其中单个元素在颜色、大小、旋转或位置等视觉属性上与其他所有元素有所不同。实验显示，所有评估的MLLMs，包括开源家族如Qwen3-VL和InternVL3.5，以及专有系统如Gemini-2.5-Pro和GPT-5，在视觉差异检测方面都远低于人类水平。我们还提出了OddGrid-GRPO，这是一种集成课程学习与远程感知奖励的强化学习框架。通过逐步控制训练样本的难度并将空间接近约束纳入奖励设计，OddGrid-GRPO显著提升了模型的细粒度视觉辨别能力。我们希望OddGridBench和OddGrid-GRPO能为多模态智能中推进感知基础和视觉差异敏感性奠定基础。代码和数据集可在此 https URL 获取。

Reward-Zero: Language Embedding Driven Implicit Reward Mechanisms for Reinforcement Learning

零奖励：语言嵌入驱动的隐性奖励机制用于强化学习

Authors: Heng Zhang, Haddy Alchaer, Arash Ajoudani, Yu She
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.09331
Pdf link: https://arxiv.org/pdf/2603.09331
Abstract We introduce Reward-Zero, a general-purpose implicit reward mechanism that transforms natural-language task descriptions into dense, semantically grounded progress signals for reinforcement learning (RL). Reward-Zero serves as a simple yet sophisticated universal reward function that leverages language embeddings for efficient RL training. By comparing the embedding of a task specification with embeddings derived from an agent's interaction experience, Reward-Zero produces a continuous, semantically aligned sense-of-completion signal. This reward supplements sparse or delayed environmental feedback without requiring task-specific engineering. When integrated into standard RL frameworks, it accelerates exploration, stabilizes training, and enhances generalization across diverse tasks. Empirically, agents trained with Reward-Zero converge faster and achieve higher final success rates than conventional methods such as PPO with common reward-shaping baselines, successfully solving tasks that hand-designed rewards could not in some complex tasks. In addition, we develop a mini benchmark for the evaluation of completion sense during task execution via language embeddings. These results highlight the promise of language-driven implicit reward functions as a practical path toward more sample-efficient, generalizable, and scalable RL for embodied agents. Code will be released after peer review.
中文摘要 我们介绍了Reward-Zero，一种通用隐性奖励机制，将自然语言任务描述转化为密集、语义基础的强化学习进展信号。Reward-Zero 是一个简单而复杂的通用奖励函数，利用语言嵌入实现高效的强化学习训练。通过比较任务规格的嵌入与源自代理交互体验的嵌入，Reward-Zero 产生了一个连续且语义对齐的完成感信号。这种奖励补充了稀疏或延迟的环境反馈，无需针对特定任务的工程设计。当它集成到标准强化学习框架中时，它加速了探索，稳定了训练，并增强了跨越多样任务的泛化能力。从经验上看，采用Reward-Zero训练的智能体收敛速度更快，最终成功率也高于拥有共同奖励塑造基线的PPO等传统方法，能够成功解决一些手工设计的奖励无法完成的任务。此外，我们还开发了一个通过语言嵌入评估任务执行中完成感的迷你基准。这些结果凸显了语言驱动的隐性奖励函数作为实现更具样本效率、可推广性和可扩展的具身主体强化学习的实用路径的前景。代码将在同行评审后发布。

Robust Regularized Policy Iteration under Transition Uncertainty

在过渡不确定性下，稳健正则化政策迭代

Authors: Hongqiang Lin, Zhenghui Fu, Weihao Tang, Pengfei Wang, Yiding Sun, Qixian Huang, Dongxu Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2603.09344
Pdf link: https://arxiv.org/pdf/2603.09344
Abstract Offline reinforcement learning (RL) enables data-efficient and safe policy learning without online exploration, but its performance often degrades under distribution shift. The learned policy may visit out-of-distribution state-action pairs where value estimates and learned dynamics are unreliable. To address policy-induced extrapolation and transition uncertainty in a unified framework, we formulate offline RL as robust policy optimization, treating the transition kernel as a decision variable within an uncertainty set and optimizing the policy against the worst-case dynamics. We propose Robust Regularized Policy Iteration (RRPI), which replaces the intractable max-min bilevel objective with a tractable KL-regularized surrogate and derives an efficient policy iteration procedure based on a robust regularized Bellman operator. We provide theoretical guarantees by showing that the proposed operator is a $\gamma$-contraction and that iteratively updating the surrogate yields monotonic improvement of the original robust objective with convergence. Experiments on D4RL benchmarks demonstrate that RRPI achieves strong average performance, outperforming recent baselines including percentile-based methods such as PMDB on the majority of environments while remaining competitive on the rest. Moreover, RRPI exhibits robust behavior. The learned $Q$-values decrease in regions with higher epistemic uncertainty, suggesting that the resulting policy avoids unreliable out-of-distribution actions under transition uncertainty.
中文摘要 离线强化学习（RL）实现数据高效且安全的策略学习，无需在线探索，但其性能在分布转移时常常下降。学习策略可能访问分布外的状态-动作对，这些对的价值估计和学习动态不可靠。为了在统一框架下解决策略引发的外推和过渡不确定性，我们将离线强化学习表述为稳健策略优化，将过渡核视为不确定性集合中的决策变量，并针对最坏情况动态优化策略。我们提出了强健正则化策略迭代（RRPI），它用可解的KL正则化代理替代了难以处理的最大最小双层目标，并基于稳健正则化贝尔曼算符推导出高效的策略迭代过程。我们通过证明所提算符是$\gamma$收缩，并且迭代更新代理算符能单调地改进原始稳健目标并收敛来提供理论保证。D4RL基准测试的实验表明，RRPI在大多数环境中实现了强劲的平均性能，优于包括基于百分位的方法（如PMDB）在内的近期基线，同时在其他环境中保持竞争力。此外，RRPI表现出强健的行为。在认识不确定性较高的区域，习得的$Q$值会下降，表明该政策在过渡不确定性下避免了不可靠的分布外行为。

SPAARS: Safer RL Policy Alignment through Abstract Exploration and Refined Exploitation of Action Space

SPAARS：通过抽象探索和精炼行动空间，实现更安全的强化学习政策对齐

Authors: Swaminathan S K, Aritra Hazra
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.09378
Pdf link: https://arxiv.org/pdf/2603.09378
Abstract Offline-to-online reinforcement learning (RL) offers a promising paradigm for robotics by pre-training policies on safe, offline demonstrations and fine-tuning them via online interaction. However, a fundamental challenge remains: how to safely explore online without deviating from the behavioral support of the offline data? While recent methods leverage conditional variational autoencoders (CVAEs) to bound exploration within a latent space, they inherently suffer from an exploitation gap -- a performance ceiling imposed by the decoder's reconstruction loss. We introduce SPAARS, a curriculum learning framework that initially constrains exploration to the low-dimensional latent manifold for sample-efficient, safe behavioral improvement, then seamlessly transfers control to the raw action space, bypassing the decoder bottleneck. SPAARS has two instantiations: the CVAE-based variant requires only unordered (s,a) pairs and no trajectory segmentation; SPAARS-SUPE pairs SPAARS with OPAL temporal skill pretraining for stronger exploration structure at the cost of requiring trajectory chunks. We prove an upper bound on the exploitation gap using the Performance Difference Lemma, establish that latent-space policy gradients achieve provable variance reduction over raw-space exploration, and show that concurrent behavioral cloning during the latent phase directly controls curriculum transition stability. Empirically, SPAARS-SUPE achieves 0.825 normalized return on kitchen-mixed-v0 versus 0.75 for SUPE, with 5x better sample efficiency; standalone SPAARS achieves 92.7 and 102.9 normalized return on hopper-medium-v2 and walker2d-medium-v2 respectively, surpassing IQL baselines of 66.3 and 78.3 respectively, confirming the utility of the unordered-pair CVAE instantiation.
中文摘要 离线到在线强化学习（RL）为机器人学提供了一个有前景的范式，通过预先训练安全离线演示的政策，并通过在线互动进行微调。然而，一个根本性挑战依然存在：如何在不偏离离线下数据行为支持的情况下安全地在线探索？虽然最新方法利用条件变分自编码器（CVAEs）在潜空间内进行界限探索，但它们本质上存在利用缺口——即解码器重建损失带来的性能上限。我们引入了SPAARS，一种课程学习框架，最初将探索限制在低维潜流形中，以实现样本高效、安全的行为改进，随后无缝地将控制权转移到原始动作空间，绕过解码器瓶颈。SPAARS有两种实例：基于CVAE的变体仅要求无序（s，a）对，无需轨迹分割;SPAARS-SUPE 将 SPAARS 与 OPAL 时间技能预训练配对，以增强探索结构，但代价是需要轨迹块。我们利用性能差引理证明了利用差距的上界，确立了潜空间策略梯度在原始空间探索中实现可证明的方差缩小，并证明潜伏期同时进行行为克隆直接控制课程过渡稳定性。从经验上看，SPAARS-SUPE在厨房混合v0上实现了0.825的归一化回报，而SUPE为0.75，采样效率提高了5倍;独立SPAARS分别在Hopper-Medium-v2和Walker2D-Medium-V2上实现92.7和102.9的归一化回报，超过了IQL的66.3和78.3，证实了无序对CVAE实例的实用性。

Impact of Markov Decision Process Design on Sim-to-Real Reinforcement Learning

马尔可夫决策过程设计对模拟到现实强化学习的影响

Authors: Tatjana Krau, Jorge Mandlmaier, Tobias Damm, Frieder Heieck
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.09427
Pdf link: https://arxiv.org/pdf/2603.09427
Abstract Reinforcement Learning (RL) has demonstrated strong potential for industrial process control, yet policies trained in simulation often suffer from a significant sim-to-real gap when deployed on physical hardware. This work systematically analyzes how core Markov Decision Process (MDP) design choices -- state composition, target inclusion, reward formulation, termination criteria, and environment dynamics models -- affect this transfer. Using a color mixing task, we evaluate different MDP configurations and mixing dynamics across simulation and real-world experiments. We validate our findings on physical hardware, demonstrating that physics-based dynamics models achieve up to 50% real-world success under strict precision constraints where simplified models fail entirely. Our results provide practical MDP design guidelines for deploying RL in industrial process control.
中文摘要 强化学习（RL）已展现出工业过程控制的强大潜力，但模拟训练的策略在物理硬件上部署时常存在显著的模拟与现实差距。本研究系统分析了核心马尔可夫决策过程（MDP）设计选择——状态组成、目标包含、奖励表述、终止标准和环境动态模型——如何影响这种转移。利用颜色混合任务，我们评估不同的MDP配置以及模拟与现实实验中的混合动力学。我们在物理硬件上验证了我们的发现，证明基于物理的动力学模型在严格的精度约束下，实际成功率高达50%，而简化模型则完全失败。我们的结果为工业过程控制中部署强化学习提供了实用的MDP设计指南。

SEA-Nav: Efficient Policy Learning for Safe and Agile Quadruped Navigation in Cluttered Environments

SEA-Nav：在杂乱环境中安全敏捷地四足导航的高效政策学习

Authors: Shiyi Chen, Mingye Yang, Haiyan Mao, Jiaqi Zhang, Haiyi Liu, Shuheng He, Debing Zhang, Zihao Qiu, Chun Zhang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.09460
Pdf link: https://arxiv.org/pdf/2603.09460
Abstract Efficiently training quadruped robot navigation in densely cluttered environments remains a significant challenge. Existing methods are either limited by a lack of safety and agility in simple obstacle distributions or suffer from slow locomotion in complex environments, often requiring excessively long training phases. To this end, we propose SEA-Nav (Safe, Efficient, and Agile Navigation), a reinforcement learning framework for quadruped navigation. Within diverse and dense obstacle environments, a differentiable control barrier function (CBF)-based shield constraints the navigation policy to output safe velocity commands. An adaptive collision replay mechanism and hazardous exploration rewards are introduced to increase the probability of learning from critical experiences, guiding efficient exploration and exploitation. Finally, kinematic action constraints are incorporated to ensure safe velocity commands, facilitating successful physical deployment. To the best of our knowledge, this is the first approach that achieves highly challenging quadruped navigation in the real world with minute-level training time.
中文摘要 在密集拥挤环境中高效训练四足机器人导航仍是一项重大挑战。现有方法要么受限于简单障碍物分布的安全性和灵活性不足，要么在复杂环境中移动缓慢，常常需要过长的训练阶段。为此，我们提出了SEA-Nav（安全、高效、敏捷导航）——一种用于四足导航的强化学习框架。在多样且密集的障碍环境中，基于可微控制障碍函数（CBF）的屏障约束导航策略，以输出安全速度指令。引入了自适应碰撞重放机制和危险探索奖励，以提高从关键体验中学习的概率，指导高效的探索和利用。最后，加入了运动学动作约束，以确保速度指令的安全，促进物理部署的成功。据我们所知，这是首个在现实世界中以极具挑战性的四足导航，训练时间仅需分钟级的方案。

MORE-R1: Guiding LVLM for Multimodal Object-Entity Relation Extraction via Stepwise Reasoning with Reinforcement Learning

更多R1：通过逐步推理与强化学习指导LVLM多模态对象-实体关系提取

Authors: Xiang Yuan, Xu Chu, Xinrong Chen, Haochen Li, Zonghong Dai, Hongcheng Fan, Xiaoyue Yuan, Weiping Li, Tong Mo
Subjects: Subjects: Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2603.09478
Pdf link: https://arxiv.org/pdf/2603.09478
Abstract Multimodal Object-Entity Relation Extraction (MORE) is a challenging task in information extraction research. It aims to identify relations between visual objects and textual entities, requiring complex multimodal understanding and cross-modal reasoning abilities. Existing methods, mainly classification-based or generation-based without reasoning, struggle to handle complex extraction scenarios in the MORE task and suffer from limited scalability and intermediate reasoning transparency. To address these challenges, we propose MORE-R1, a novel model that introduces explicit stepwise reasoning with Reinforcement Learning (RL) to enable Large Vision-Language Model (LVLM) to address the MORE task effectively. MORE-R1 integrates a two-stage training process, including an initial cold-start training stage with Supervised Fine-Tuning (SFT) and a subsequent RL stage for reasoning ability optimization. In the initial stage, we design an efficient way to automatically construct a high-quality SFT dataset containing fine-grained stepwise reasoning tailored to the MORE task, enabling the model to learn an effective reasoning paradigm. In the subsequent stage, we employ the Group Relative Policy Optimization (GRPO) RL algorithm with a Progressive Sample-Mixing Strategy to stabilize training and further enhance model's reasoning ability on hard samples. Comprehensive experiments on the MORE benchmark demonstrate that MORE-R1 achieves state-of-the-art performance with significant improvement over baselines.
中文摘要 多模态对象-实体关系提取（MORE）是信息提取研究中的一项具有挑战性的任务。它旨在识别视觉对象与文本实体之间的关系，这需要复杂的多模态理解和跨模态推理能力。现有方法，主要是基于分类或无推理的基于生成的方法，难以处理MORE任务中的复杂提取场景，且存在有限的可扩展性和中间推理透明度。为应对这些挑战，我们提出了MORE-R1这一新模型，通过强化学习（RL）引入显性逐步推理，使大型视觉语言模型（LVLM）能够有效应对MORE任务。MORE-R1整合了两阶段训练过程，包括带监督微调（SFT）的初始冷启动训练阶段，以及随后用于推理能力优化的强化学习阶段。在初期阶段，我们设计了一种高效的方法，自动构建包含细粒度分级推理的高质量SFT数据集，针对MORE任务量身定制，使模型能够学习有效的推理范式。在接下来的阶段，我们采用了带有渐进样本混合策略的群体相对策略优化（GRPO）强化学习算法，以稳定训练并进一步提升模型在硬样本上的推理能力。MORE基准的综合实验表明，MORE-R1实现了最先进的性能，且比基线有显著提升。

Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization

通过群相对策略优化实现统一多模态交错生成

Authors: Ming Nie, Chunwei Wang, Jianhua Han, Hang Xu, Li Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.09538
Pdf link: https://arxiv.org/pdf/2603.09538
Abstract Unified vision-language models have made significant progress in multimodal understanding and generation, yet they largely fall short in producing multimodal interleaved outputs, which is a crucial capability for tasks like visual storytelling and step-by-step visual reasoning. In this work, we propose a reinforcement learning-based post-training strategy to unlock this capability in existing unified models, without relying on large-scale multimodal interleaved datasets. We begin with a warm-up stage using a hybrid dataset comprising curated interleaved sequences and limited data for multimodal understanding and text-to-image generation, which exposes the model to interleaved generation patterns while preserving its pretrained capabilities. To further refine interleaved generation, we propose a unified policy optimization framework that extends Group Relative Policy Optimization (GRPO) to the multimodal setting. Our approach jointly models text and image generation within a single decoding trajectory and optimizes it with our novel hybrid rewards covering textual relevance, visual-text alignment, and structural fidelity. Additionally, we incorporate process-level rewards to provide step-wise guidance, enhancing training efficiency in complex multimodal tasks. Experiments on MMIE and InterleavedBench demonstrate that our approach significantly enhances the quality and coherence of multimodal interleaved generation.
中文摘要 统一的视觉语言模型在多模态理解和生成方面取得了显著进展，但在产生多模态交错输出方面大多不足，而这对于视觉叙事和逐步视觉推理等任务至关重要。本研究提出一种基于强化学习的训练后策略，以在现有统一模型中释放该能力，而无需依赖大规模多模态交错数据集。我们从一个热身阶段开始，使用由精心策划的交错序列和有限数据组成的混合数据集，用于多模态理解和文本到图像生成，这使模型暴露于交错生成模式的同时保持其预训练能力。为了进一步完善交错生成，我们提出了一个统一的策略优化框架，将群相对策略优化（GRPO）扩展到多模态环境。我们的方法结合在单一解码轨迹中建模文本和图像生成，并结合涵盖文本相关性、视觉-文本对齐和结构忠实度的创新混合奖励进行优化。此外，我们还加入了流程级奖励，提供分阶段指导，提升复杂多模态任务的培训效率。MMIE和InterleavedBench的实验表明，我们的方法显著提升了多模交织生成的质量和连贯性。

NS-VLA: Towards Neuro-Symbolic Vision-Language-Action Models

NS-VLA：迈向神经符号视觉-语言-行动模型

Authors: Ziyue Zhu, Shangyang Wu, Shuai Zhao, Zhiqiu Zhao, Shengjie Li, Yi Wang, Fang Li, Haoran Luo
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.09542
Pdf link: https://arxiv.org/pdf/2603.09542
Abstract Vision-Language-Action (VLA) models are formulated to ground instructions in visual context and generate action sequences for robotic manipulation. Despite recent progress, VLA models still face challenges in learning related and reusable primitives, reducing reliance on large-scale data and complex architectures, and enabling exploration beyond demonstrations. To address these challenges, we propose a novel Neuro-Symbolic Vision-Language-Action (NS-VLA) framework via online reinforcement learning (RL). It introduces a symbolic encoder to embedding vision and language features and extract structured primitives, utilizes a symbolic solver for data-efficient action sequencing, and leverages online RL to optimize generation via expansive exploration. Experiments on robotic manipulation benchmarks demonstrate that NS-VLA outperforms previous methods in both one-shot training and data-perturbed settings, while simultaneously exhibiting superior zero-shot generalizability, high data efficiency and expanded exploration space. Our code is available.
中文摘要 视觉-语言-行动（VLA）模型旨在将指令置于视觉语境中，并生成用于机器人作的动作序列。尽管近期取得了进展，VLA模型在学习相关和可重用原语方面仍面临挑战，减少对大规模数据和复杂架构的依赖，并实现了超越演示的探索。为应对这些挑战，我们提出了一种通过在线强化学习（RL）实现的新型神经符号视觉-语言-行动（NS-VLA）框架。它引入符号编码器，用于嵌入视觉和语言特征并提取结构化原语，利用符号求解器实现数据高效的动作序列，并利用在线强化学习通过广泛的探索优化生成。机器人作基准测试的实验表明，NS-VLA在单次训练和数据扰动环境下均优于以往方法，同时展现出卓越的零次泛化能力、高数据效率和扩展的探索空间。我们的代码已开放。

GeoSolver: Scaling Test-Time Reasoning in Remote Sensing with Fine-Grained Process Supervision

GeoSolver：在细粒度过程监督下进行遥感测试时间推理的尺度化

Authors: Lang Sun, Ronghao Fu, Zhuoran Duan, Haoran Liu, Xueyan Liu, Bo Yang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.09551
Pdf link: https://arxiv.org/pdf/2603.09551
Abstract While Vision-Language Models (VLMs) have significantly advanced remote sensing interpretation, enabling them to perform complex, step-by-step reasoning remains highly challenging. Recent efforts to introduce Chain-of-Thought (CoT) reasoning to this domain have shown promise, yet ensuring the visual faithfulness of these intermediate steps remains a critical bottleneck. To address this, we introduce GeoSolver, a novel framework that transitions remote sensing reasoning toward verifiable, process-supervised reinforcement learning. We first construct Geo-PRM-2M, a large-scale, token-level process supervision dataset synthesized via entropy-guided Monte Carlo Tree Search (MCTS) and targeted visual hallucination injection. Building upon this dataset, we train GeoPRM, a token-level process reward model (PRM) that provides granular faithfulness feedback. To effectively leverage these verification signals, we propose Process-Aware Tree-GRPO, a reinforcement learning algorithm that integrates tree-structured exploration with a faithfulness-weighted reward mechanism to precisely assign credit to intermediate steps. Extensive experiments demonstrate that our resulting model, GeoSolver-9B, achieves state-of-the-art performance across diverse remote sensing benchmarks. Crucially, GeoPRM unlocks robust Test-Time Scaling (TTS). Serving as a universal geospatial verifier, it seamlessly scales the performance of GeoSolver-9B and directly enhances general-purpose VLMs, highlighting its remarkable cross-model generalization.
中文摘要 虽然视觉语言模型（VLMs）在遥感解读方面取得了显著进步，使其能够执行复杂且逐步推理的过程依然极具挑战性。近期将思维链（CoT）推理引入该领域的努力展现出潜力，但确保这些中间步骤的视觉忠实性仍是一个关键瓶颈。为此，我们引入了GeoSolver，一个新颖框架，将遥感推理向可验证的过程监督强化学习过渡。我们首先构建了Geo-PRM-2M，这是一个通过熵引导蒙特卡洛树搜索（MCTS）和靶向视觉幻觉注入合成的大规模代币级过程监督数据集。基于该数据集，我们训练GeoPRM，一种代币级流程奖励模型（PRM），提供细致的忠实度反馈。为有效利用这些验证信号，我们提出了过程感知树（Process-Aware Tree-GRPO）强化学习算法，将树结构探索与忠实度加权奖励机制相结合，精确地为中间步骤分配功劳。大量实验表明，我们最终生成的模型GeoSolver-9B在多种遥感基准测试中实现了最先进的性能。关键是，GeoPRM解锁了强大的测试时间缩放（TTS）。作为通用地理空间验证器，它无缝扩展了GeoSolver-9B的性能，并直接增强了通用VLM，突出其卓越的跨模型泛化能力。

ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning

ActiveUltraFeedback：利用主动学习高效生成偏好数据

Authors: Davit Melikidze, Marian Schneider, Jessica Lam, Martin Wertich, Ido Hakimi, Barna Pásztor, Andreas Krause
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.09692
Pdf link: https://arxiv.org/pdf/2603.09692
Abstract Reinforcement Learning from Human Feedback (RLHF) has become the standard for aligning Large Language Models (LLMs), yet its efficacy is bottlenecked by the high cost of acquiring preference data, especially in low-resource and expert domains. To address this, we introduce ACTIVEULTRAFEEDBACK, a modular active learning pipeline that leverages uncertainty estimates to dynamically identify the most informative responses for annotation. Our pipeline facilitates the systematic evaluation of standard response selection methods alongside DOUBLE REVERSE THOMPSON SAMPLING (DRTS) and DELTAUCB, two novel methods prioritizing response pairs with large predicted quality gaps, leveraging recent results showing that such pairs provide good signals for fine-tuning. Our experiments demonstrate that ACTIVEULTRAFEEDBACK yields high-quality datasets that lead to significant improvements in downstream performance, notably achieving comparable or superior results with as little as one-sixth of the annotated data relative to static baselines. Our pipeline is available at this https URL and our preference datasets at this https URL.
中文摘要 来自人类反馈的强化学习（RLHF）已成为大型语言模型（LLMs）对齐的标准，但其效能因获取偏好数据的高昂成本而受限，尤其是在资源匮乏和专家领域。为此，我们引入了ACTIVEULTRAFEEDBACK，一个模块化主动学习流水线，利用不确定性估计动态识别最具信息量的注释响应。我们的流程促进了标准反应选择方法的系统评估，同时结合双重反向汤普森采样（DRTS）和DELTAUCB这两种新颖方法，优先考虑预测质量差距较大的反应对，利用近期结果表明此类对能提供良好信号进行微调。我们的实验表明，ACTIVEULTRAFEEDBACK 能够产生高质量数据集，从而显著提升下游性能，尤其是在标注数据仅占静态基线六分之一的情况下，就能获得相当甚至更优的结果。我们的流水线可通过这个 https URL 访问，偏好数据集则在此 https URL。

GSStream: 3D Gaussian Splatting based Volumetric Scene Streaming System

GSStream：基于3D高斯喷溅的体积场景流系统

Authors: Zhiye Tang, Qiudan Zhang, Lei Zhang, Junhui Hou, You Yang, Xu Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.09718
Pdf link: https://arxiv.org/pdf/2603.09718
Abstract Recently, the 3D Gaussian splatting (3DGS) technique for real-time radiance field rendering has revolutionized the field of volumetric scene representation, providing users with an immersive experience. But in return, it also poses a large amount of data volume, which is extremely bandwidth-intensive. Cutting-edge researchers have tried to introduce different approaches and construct multiple variants for 3DGS to obtain a more compact scene representation, but it is still challenging for real-time distribution. In this paper, we propose GSStream, a novel volumetric scene streaming system to support 3DGS data format. Specifically, GSStream integrates a collaborative viewport prediction module to better predict users' future behaviors by learning collaborative priors and historical priors from multiple users and users' viewport sequences and a deep reinforcement learning (DRL)-based bitrate adaptation module to tackle the state and action space variability challenge of the bitrate adaptation problem, achieving efficient volumetric scene delivery. Besides, we first build a user viewport trajectory dataset for volumetric scenes to support the training and streaming simulation. Extensive experiments prove that our proposed GSStream system outperforms existing representative volumetric scene streaming systems in visual quality and network usage. Demo video: this https URL.
中文摘要 最近，3D高斯喷涂（3DGS）技术用于实时辐射场渲染，彻底革新了体积场景表现领域，为用户提供了沉浸式体验。但作为回报，它也带来了大量数据量，极其耗费带宽。前沿研究人员尝试引入不同方法并构建多种3DGS变体以获得更紧凑的场景表示，但实时分布仍具挑战性。本文提出了GSStream，一种支持3DGS数据格式的新型体积场景流系统。具体来说，GSStream集成了协同视口预测模块，通过学习多个用户和用户视口序列的协同先验和历史先验，更好地预测用户的未来行为;同时还集成了基于深度强化学习（DRL）的比特率自适应模块，解决比特率适应问题中的状态和动作空间变异性挑战，实现高效的体积场景交付。此外，我们首先构建了一个用于体积场景的用户视口轨迹数据集，以支持训练和流式模拟。大量实验证明，我们提出的GSStream系统在视觉质量和网络使用率上优于现有的代表性体积场景流系统。演示视频：这个 https URL。

Good Reasoning Makes Good Demonstrations: Implicit Reasoning Quality Supervision via In-Context Reinforcement Learning

良好的推理造就好演示：通过上下文强化学习实现隐性推理的优质监督

Authors: Tiehua Mei, Minxuan Lv, Leiyu Pan, Zhenpeng Su, Hongru Hou, Hengrui Chen, Ao Xu, Deqing Yang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.09803
Pdf link: https://arxiv.org/pdf/2603.09803
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) improves reasoning in large language models but treats all correct solutions equally, potentially reinforcing flawed traces that get correct answers by chance. We observe that better reasoning are better teachers: high-quality solutions serve as more effective demonstrations than low-quality ones. We term this teaching ability Demonstration Utility, and show that the policy model's own in-context learning ability provides an efficient way to measure it, yielding a quality signal termed Evidence Gain. To employ this signal during training, we introduce In-Context RLVR. By Bayesian analysis, we show that this objective implicitly reweights rewards by Evidence Gain, assigning higher weights to high-quality traces and lower weights to low-quality ones, without requiring costly computation or external evaluators. Experiments on mathematical benchmarks show improvements in both accuracy and reasoning quality over standard RLVR.
中文摘要 带可验证奖励的强化学习（RLVR）提升了大型语言模型中的推理能力，但对所有正确解法一视同仁，可能强化那些偶然获得正确答案的有缺陷痕迹。我们观察到，更好的推理才是更好的教师：高质量的解决方案比低质量的解决方案更有效地示范。我们将这种教学能力称为示范效用，并展示了该政策模型自身的上下文学习能力为衡量它提供了一种高效的方法，从而产生一种称为证据获得的高质量信号。为了在培训中使用该信号，我们引入了上下文RLVR。通过贝叶斯分析，我们表明该目标隐含地重新加权了证据增益的奖励，给高质量的痕迹赋予更高权重，低质量的痕迹赋予较低权重，而无需昂贵的计算或外部评估者。数学基准测试的实验显示，准确性和推理质量均优于标准RLVR。

RecThinker: An Agentic Framework for Tool-Augmented Reasoning in Recommendation

RecThinker：一种用于工具增强推理的代理框架

Authors: Haobo Zhang, Yutao Zhu, Kelong Mao, Tianhao Li, Zhicheng Dou
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2603.09843
Pdf link: https://arxiv.org/pdf/2603.09843
Abstract Large Language Models (LLMs) have revolutionized recommendation agents by providing superior reasoning and flexible decision-making capabilities. However, existing methods mainly follow a passive information acquisition paradigm, where agents either rely on static pre-defined workflows or perform reasoning with constrained information. It limits the agent's ability to identify information sufficiency, often leading to suboptimal recommendations when faced with fragmented user profiles or sparse item metadata. To address these limitations, we propose RecThinker, an agentic framework for tool-augmented reasoning in recommendation, which shifts recommendation from passive processing to autonomous investigation by dynamically planning reasoning paths and proactively acquiring essential information via autonomous tool-use. Specifically, RecThinker adopts an Analyze-Plan-Act paradigm, which first analyzes the sufficiency of user-item information and autonomously invokes tool-calling sequences to bridge information gaps between available knowledge and reasoning requirements. We develop a suite of specialized tools for RecThinker, enabling the model to acquire user-side, item-side, and collaborative information for better reasoning and user-item matching. Furthermore, we introduce a self-augmented training pipeline, comprising a Supervised Fine-Tuning (SFT) stage to internalize high-quality reasoning trajectories and a Reinforcement Learning (RL) stage to optimize for decision accuracy and tool-use efficiency. Extensive experiments on multiple benchmark datasets demonstrate that RecThinker consistently outperforms strong baselines in the recommendation scenario.
中文摘要 大型语言模型（LLMs）通过提供更优越的推理能力和灵活的决策能力，彻底革新了推荐代理。然而，现有方法主要遵循被动信息获取范式，代理要么依赖静态预定义的工作流程，要么在受限信息下进行推理。它限制了智能体识别信息充分性的能力，常导致在面对碎片化的用户配置文件或稀疏的项目元数据时，给出次优推荐。为解决这些局限性，我们提出了RecThinker，这是一种用于工具增强推理的代理框架，通过动态规划推理路径并主动获取关键信息，将推荐从被动处理转向自主调查。具体来说，RecThinker 采用了分析-计划-行动范式，首先分析用户-项目信息的充分性，并自主调用工具调用序列，以弥合可用知识与推理需求之间的信息差距。我们为RecThinker开发了一套专用工具，使模型能够获取用户端、条目端和协作信息，从而更好地推理和用户-项目匹配。此外，我们引入了自我增强培训流程，包括监督式微调（SFT）阶段，用于内化高质量的推理轨迹，以及强化学习（RL）阶段，以优化决策准确性和工具使用效率。对多个基准数据集的大量实验表明，RecThinker 在推荐场景中始终优于强基线。

Emerging Extrinsic Dexterity in Cluttered Scenes via Dynamics-aware Policy Learning

通过动态感知策略学习，在杂乱场景中新兴的外在灵活性

Authors: Yixin Zheng, Jiangran Lyu, Yifan Zhang, Jiayi Chen, Mi Yan, Yuntian Deng, Xuesong Shi, Xiaoguang Zhao, Yizhou Wang, Zhizheng Zhang, He Wang
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.09882
Pdf link: https://arxiv.org/pdf/2603.09882
Abstract Extrinsic dexterity leverages environmental contact to overcome the limitations of prehensile manipulation. However, achieving such dexterity in cluttered scenes remains challenging and underexplored, as it requires selectively exploiting contact among multiple interacting objects with inherently coupled dynamics. Existing approaches lack explicit modeling of such complex dynamics and therefore fall short in non-prehensile manipulation in cluttered environments, which in turn limits their practical applicability in real-world environments. In this paper, we introduce a Dynamics-Aware Policy Learning (DAPL) framework that can facilitate policy learning with a learned representation of contact-induced object dynamics in cluttered environments. This representation is learned through explicit world modeling and used to condition reinforcement learning, enabling extrinsic dexterity to emerge without hand-crafted contact heuristics or complex reward shaping. We evaluate our approach in both simulation and the real world. Our method outperforms prehensile manipulation, human teleoperation, and prior representation-based policies by over 25% in success rate on unseen simulated cluttered scenes with varying densities. The real-world success rate reaches around 50% across 10 cluttered scenes, while a practical grocery deployment further demonstrates robust sim-to-real transfer and applicability.
中文摘要 外在灵巧利用环境接触克服抓握作的局限。然而，在杂乱场景中实现这种灵巧度仍然具有挑战性且未被充分探索，因为这需要有选择性地利用多个相互作用物体之间的接触，这些物体本身具有耦合的动力学。现有方法缺乏对此类复杂动力学的显式建模，因此在杂乱环境中的非抓握作能力不足，这也限制了其在现实环境中的实际应用。本文介绍了一种动态感知策略学习（DAPL）框架，能够通过学习性表示接触诱导对象动态，促进策略学习，尤其是在杂乱环境中。这种表征通过显式世界建模学习，并用于条件化强化学习，使外在灵巧度能够在无需手工设计接触启发式或复杂奖励塑造的情况下自然流露。我们会在模拟和现实世界中评估我们的方法。在不同密度且未看见的模拟杂乱场景中，我们的方法成功率超过25%，优于抓握作、人工远程作和基于先例表征的策略。在10个杂乱场景中，实际成功率约为50%，而实际的杂货部署进一步展示了强大的模拟到现实传输及其适用性。

Influencing LLM Multi-Agent Dialogue via Policy-Parameterized Prompts

通过策略参数化提示影响LLM多智能体对话

Authors: Hongbo Bo, Jingyu Hu, Weiru Liu
Subjects: Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2603.09890
Pdf link: https://arxiv.org/pdf/2603.09890
Abstract Large Language Models (LLMs) have emerged as a new paradigm for multi-agent systems. However, existing research on the behaviour of LLM-based multi-agents relies on ad hoc prompts and lacks a principled policy perspective. Different from reinforcement learning, we investigate whether prompt-as-action can be parameterized so as to construct a lightweight policy which consists of a sequence of state-action pairs to influence conversational behaviours without training. Our framework regards prompts as actions executed by LLMs, and dynamically constructs prompts through five components based on the current state of the agent. To test the effectiveness of parameterized control, we evaluated the dialogue flow based on five indicators: responsiveness, rebuttal, evidence usage, non-repetition, and stance shift. We conduct experiments using different LLM-driven agents in two discussion scenarios related to the general public and show that prompt parameterization can influence the dialogue dynamics. This result shows that policy-parameterised prompts offer a simple and effective mechanism to influence the dialogue process, which will help the research of multi-agent systems in the direction of social simulation.
中文摘要 大型语言模型（LLM）已成为多智能体系统的新范式。然而，现有关于基于LLM的多智能体行为的研究依赖于临时提示，缺乏有原则的政策视角。与强化学习不同，我们研究提示即动作是否可以参数化，构建一个轻量级策略，该策略由一系列状态-行动对组成，以影响无需训练的会话行为。我们的框架将提示视为大型语言模型执行的动作，并根据代理当前状态动态构建五个组件的提示。为测试参数化控制的有效性，我们基于五个指标评估对话流程：响应性、反驳、证据使用、不重复和立场转变。我们使用不同的LLM驱动代理在两个面向公众的讨论场景中进行了实验，展示了提示参数化可以影响对话动态。这一结果表明，基于政策参数的提示提供了一种简单且有效的机制来影响对话过程，这将有助于多智能体系统朝向社会模拟的研究。

When Learning Rates Go Wrong: Early Structural Signals in PPO Actor-Critic

当学习率出错：PPO演员批评中的早期结构信号

Authors: Alberto Fernández-Hernández, Cristian Pérez-Corral, Jose I. Mestre, Manuel F. Dolz, Jose Duato, Enrique S. Quintana-Ortí
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.09950
Pdf link: https://arxiv.org/pdf/2603.09950
Abstract Deep Reinforcement Learning systems are highly sensitive to the learning rate (LR), and selecting stable and performant training runs often requires extensive hyperparameter search. In Proximal Policy Optimization (PPO) actor--critic methods, small LR values lead to slow convergence, whereas large LR values may induce instability or collapse. We analyse this phenomenon from the behavior of the hidden neurons in the network using the Overfitting-Underfitting Indicator (OUI), a metric that quantifies the balance of binary activation patterns over a fixed probe batch. We introduce an efficient batch-based formulation of OUI and derive a theoretical connection between LR and activation sign changes, clarifying how a correct evolution of the neuron's inner structure depends on the step size. Empirically, across three discrete-control environments and multiple seeds, we show that OUI measured at only 10\% of training already discriminates between LR regimes. We observe a consistent asymmetry: critic networks achieving highest return operate in an intermediate OUI band (avoiding saturation), whereas actor networks achieving highest return exhibit comparatively high OUI values. We then compare OUI-based screening rules against early return, clip-based, divergence-based, and flip-based criteria under matched recall over successful runs. In this setting, OUI provides the strongest early screening signal: OUI alone achieves the best precision at broader recall, while combining early return with OUI yields the highest precision in best-performing screening regimes, enabling aggressive pruning of unpromising runs without requiring full training.
中文摘要 深度强化学习系统对学习率（LR）高度敏感，选择稳定且高效的训练运行通常需要大量超参数搜索。在近端策略优化（PPO）actor-critic方法中，较小的LR值会导致收敛缓慢，而较大的LR值则可能导致不稳定或崩溃。我们通过过拟合-欠拟合指标（OUI）分析该现象，该指标量化固定探针批次中二元激活模式的平衡。我们引入了高效的批量基础OUI表述，并推导出LR与激活符号变化之间的理论联系，阐明了神经元内部结构的正确演化如何依赖于步长。通过实证，在三种离散对照环境和多个种子中，我们表明仅10%的OUI训练量度已能区分LR模式。我们观察到一个一致的不对称性：获得最高回报的批评网络运行在中间OUI带（避免饱和），而获得最高回报的行为者网络则表现出相对较高的OUI值。随后，我们将基于OUI的筛查规则与早期回归、基于剪辑、基于发散和基于翻转的匹配回忆标准对比成功运行。在此环境下，OUI提供了最强的早期筛查信号：单靠OUI在更广泛的回忆中能实现最佳精度，而早期返回与OUI结合则在最佳筛查过程中获得最高精度，能够在无需全面训练的情况下积极修剪不良序列。

Keyword: diffusion policy

There is no result