Arxiv Papers of Today

生成时间: 2026-03-17 17:00:01 (UTC+8); Arxiv 发布时间: 2026-03-17 20:00 EDT (2026-03-18 08:00 UTC+8)

今天共有 88 篇相关文章

Keyword: reinforcement learning

Agentic AI, Retrieval-Augmented Generation, and the Institutional Turn: Legal Architectures and Financial Governance in the Age of Distributional AGI

代理人工智能、检索增强生成与制度转向：分布式AGI时代的法律架构与金融治理

Authors: Marcel Osmond
Subjects: Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
Arxiv link: https://arxiv.org/abs/2603.13244
Pdf link: https://arxiv.org/pdf/2603.13244
Abstract The proliferation of agentic artificial intelligence systems--characterized by autonomous goal-seeking, tool use, and multi-agent coordination--presents unprecedented challenges to existing legal and financial regulatory frameworks. While traditional AI governance has focused on model-level alignment through training-time interventions such as Reinforcement Learning from Human Feedback (RLHF), the deployment of large language models (LLMs) as persistent agents necessitates a paradigm shift toward institutional governance structures. This paper examines the intersection of agentic AI, Retrieval-Augmented Generation (RAG), and their implications for legal accountability and financial market integrity. Through analysis of the Institutional AI framework, we argue that alignment must be reconceptualized as a mechanism design problem involving runtime governance graphs, sanction functions, and observable behavioral constraints rather than internalized constitutional values[...].The analysis concludes that the future of AI governance lies not in perfecting isolated model behavior, but in architecting institutional environments where compliant behavior emerges as the dominant strategy through carefully calibrated payoff landscapes.
中文摘要 代理型人工智能系统的激增——以自主目标追求、工具使用和多智能体协调为特征——对现有的法律和金融监管框架带来了前所未有的挑战。传统人工智能治理通过训练时间干预如人类反馈强化学习（RLHF）实现模型层级对齐，但大型语言模型（LLMs）作为持久代理的部署，迫使范式转向机构治理结构。本文探讨了代理人工智能、检索增强生成（RAG）及其对法律问责和金融市场诚信的影响。通过对制度性人工智能框架的分析，我们认为对齐必须被重新概念化为一种机制设计问题，涉及运行时治理图、制裁函数和可观察的行为约束，而非内化的宪法价值观[...]。分析总结指出，人工智能治理的未来不在于完善孤立的模型行为，而在于构建一种制度环境，使合规行为通过精心校调的收益环境成为主导策略。

Distilling Deep Reinforcement Learning into Interpretable Fuzzy Rules: An Explainable AI Framework

将深度强化学习提炼为可解释的模糊规则：一个可解释的人工智能框架

Authors: Sanup S. Araballi, Simon Khan, Chilukuri K. Mohan
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.13257
Pdf link: https://arxiv.org/pdf/2603.13257
Abstract Deep Reinforcement Learning (DRL) agents achieve remarkable performance in continuous control but remain opaque, hindering deployment in safety-critical domains. Existing explainability methods either provide only local insights (SHAP, LIME) or employ over-simplified surrogates failing to capture continuous dynamics (decision trees). This work proposes a Hierarchical Takagi-Sugeno-Kang (TSK) Fuzzy Classifier System (FCS) distilling neural policies into human-readable IF-THEN rules through K-Means clustering for state partitioning and Ridge Regression for local action inference. Three quantifiable metrics are introduced: Fuzzy Rule Activation Density (FRAD) measuring explanation focus, Fuzzy Set Coverage (FSC) validating vocabulary completeness, and Action Space Granularity (ASG) assessing control mode diversity. Dynamic Time Warping (DTW) validates temporal behavioral fidelity. Empirical evaluation on \textit{Lunar Lander(Continuous)} shows the Triangular membership function variant achieves 81.48\% $\pm$ 0.43\% fidelity, outperforming Decision Trees by 21 percentage points. The framework exhibits statistically superior interpretability (FRAD = 0.814 vs. 0.723 for Gaussian, $p < 0.001$) with low MSE (0.0053) and DTW distance (1.05). Extracted rules such as ``IF lander drifting left at high altitude THEN apply upward thrust with rightward correction'' enable human verification, establishing a pathway toward trustworthy autonomous systems.
中文摘要 深度强化学习（DRL）智能体在持续控制方面表现出色，但在安全关键领域仍保持不透明，阻碍部署。现有的可解释方法要么只提供局部洞见（如SHAP、LIME），要么采用过于简化的替代工具，未能捕捉连续动态（决策树）。本研究提出了一种分层的高木-苏格诺-康（TSK）模糊分类系统（FCS），通过K-Means聚类进行状态划分和脊回归进行局部动作推断，将神经策略提炼为人类可读的IF-THEN规则。引入了三种可量化的指标：模糊规则激活密度（FRAD）测量解释焦点，模糊集覆盖率（FSC）验证词汇完整性，以及动作空间粒度（ASG）评估控制模式多样性。动态时间扭曲（DTW）验证了时间行为的忠实度。在\textit{Lunar Lander（Continuous）}上的实证评估显示，三角成员函数变体实现了81.48\% $\pm$ 0.43\%的忠真度，比决策树高出21个百分点。该框架在统计上表现出优越性的解释性（FRAD = 0.814，高斯分布为0.723，高斯分布$p <0.001$），MSE较低（0.0053）和DTW距离（1.05）。提取的规则如“如果着陆器在高空向左漂移，则施加向上推力并向右修正”，使人类能够验证，建立通往可信自主系统的路径。

Demand Acceptance using Reinforcement Learning for Dynamic Vehicle Routing Problem with Emission Quota

利用强化学习解决带有排放配额的动态车辆路由问题的需求接受

Authors: Farid Najar, Dominique Barth, Yann Strozecki
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.13279
Pdf link: https://arxiv.org/pdf/2603.13279
Abstract This paper introduces and formalizes the Dynamic and Stochastic Vehicle Routing Problem with Emission Quota (DS-QVRP-RR), a novel routing problems that integrates dynamic demand acceptance and routing with a global emission constraint. A key contribution is a two-layer optimization framework designed to facilitate anticipatory rejections of demands and generation of new routes. To solve this, we develop hybrid algorithms that combine reinforcement learning with combinatorial optimization techniques. We present a comprehensive computational study that compares our approach against traditional methods. Our findings demonstrate the relevance of our approach for different types of inputs, even when the horizon of the problem is uncertain.
中文摘要 本文介绍并形式化了带排放配额的动态与随机车辆路由问题（DS-QVRP-RR），这是一项将动态需求接受与路由与全局排放约束整合的新型路由问题。其关键贡献是建立一个双层优化框架，旨在促进需求的预先拒绝和新路由的生成。为此，我们开发了结合强化学习与组合优化技术的混合算法。我们提出了一项全面的计算研究，比较了我们的方法与传统方法。我们的发现表明，即使问题的前景尚不确定，我们的方法对不同类型投入的适用性依然存在。

Pragma-VL: Towards a Pragmatic Arbitration of Safety and Helpfulness in MLLMs

Pragma-VL：迈向多层次多层次营销中安全与帮助性的务实仲裁

Authors: Ming Wen, Kun Yang, Xin Chen, Jingyu Zhang, Dingding Han, Shiwen Cui, Yuedong Xu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.13292
Pdf link: https://arxiv.org/pdf/2603.13292
Abstract Multimodal Large Language Models (MLLMs) pose critical safety challenges, as they are susceptible not only to adversarial attacks such as jailbreaking but also to inadvertently generating harmful content for benign users. While internal safety alignment via Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) is a primary mitigation strategy, current methods often face a safety-utility trade-off: they either refuse benign queries out of excessive caution or overlook latent risks in cross-modal interactions. To resolve this, we introduce Pragma-VL, an end-to-end alignment algorithm that enables MLLMs to pragmatically arbitrate between safety and helpfulness. First, we enhance visual risk perception with a novel cold-start SFT stage. This is achieved by applying risk-aware clustering to the visual encoder and using an interleaved dataset of risk descriptions and high-quality data. Second, we introduce a theoretically-guaranteed reward model that leverages synergistic learning. We train it with a novel data augmentation method that assigns dynamic weights based on the queries, enabling contextual arbitration between safety and helpfulness. Extensive experiments show that Pragma-VL effectively balances safety and helpfulness, outperforming baselines by 5% to 20% on most multimodal safety benchmarks while preserving its general capabilities in areas such as mathematics and knowledge reasoning.
中文摘要 多模态大型语言模型（MLLM）面临着严重的安全挑战，因为它们不仅容易遭受越狱等对抗性攻击，还可能无意中生成对无害用户有害的内容。虽然通过监督微调（SFT）和强化学习（RL）实现内部安全对齐是主要的缓解策略，但现有方法常常面临安全与效用权衡：它们要么出于过度谨慎而拒绝无害查询，要么忽视跨模态交互中的潜在风险。为解决这个问题，我们引入了Pragma-VL，一种端到端对齐算法，使多层次语言学习模型能够务实地在安全性与帮助性之间做出仲裁。首先，我们通过一种新型冷启动SFT阶段增强视觉风险感知。这通过对可视化编码器应用风险感知聚类，并使用交错的数据集，包括风险描述和高质量数据来实现。其次，我们引入一种理论上有保障的奖励模型，利用协同学习。我们用一种新颖的数据增强方法训练它，根据查询分配动态权重，实现安全性与帮助性的上下文仲裁。大量实验表明，Pragma-VL在安全性与帮助性之间取得了有效平衡，在大多数多模态安全基准中比基线高出5%至20%，同时在数学和知识推理等领域保持了其整体能力。

ICPRL: Acquiring Physical Intuition from Interactive Control

ICPRL：从交互式控制获得物理直觉

Authors: Xinrun Xu, Pi Bu, Ye Wang, Börje F. Karlsson, Ziming Wang, Tengtao Song, Qi Zhu, Jun Song, Shuo Zhang, Zhiming Ding, Bo Zheng
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.13295
Pdf link: https://arxiv.org/pdf/2603.13295
Abstract VLMs excel at static perception but falter in interactive reasoning in dynamic physical environments, which demands planning and adaptation to dynamic outcomes. Existing physical reasoning methods often depend on abstract symbolic inputs or lack the ability to learn and adapt from direct, pixel-based visual interaction in novel scenarios. We introduce ICPRL (In-Context Physical Reinforcement Learning), a framework inspired by In-Context Reinforcement Learning (ICRL) that empowers VLMs to acquire physical intuition and adapt their policies in-context. Our approach trains a vision-grounded policy model via multi-turn Group Relative Policy Optimization (GRPO) over diverse multi-episode interaction histories. This enables the agent to adapt strategies by conditioning on past trial-and-error sequences, without requiring any weight updates. This adaptive policy works in concert with a separately trained world model that provides explicit physical reasoning by predicting the results of potential actions. At inference, the policy proposes candidate actions, while the world model predicts outcomes to guide a root-node PUCT search to select the most promising action. Evaluated on the diverse physics-based puzzle-solving tasks in the DeepPHY benchmark, ICPRL demonstrates significant improvements across both its I. policy-only, and II. world-model-augmented stages. Notably, these gains are retained in unseen physical environments, demonstrating that our framework facilitates genuine in-context acquisition of the environment's physical dynamics from interactive experience.
中文摘要 VLM在静态感知方面表现出色，但在动态物理环境中的互动推理能力不足，这需要规划并适应动态结果。现有的物理推理方法通常依赖抽象的符号输入，或者缺乏从直接的像素视觉交互中学习和适应新颖场景的能力。我们引入了ICPRL（情境内物理强化学习），这是一个受情境内强化学习（ICRL）启发的框架，赋能VLM获得物理直觉并在情境中调整策略。我们的方法通过多回合群体相对策略优化（GRPO）训练一个基于愿景的政策模型，涵盖多重事件的交互历史。这使得智能体能够通过条件条件来调整策略，而无需任何权重更新。这种自适应策略与一个单独训练的世界模型协同工作，后者通过预测潜在行动的结果，提供明确的物理推理。在推断时，策略提出候选动作，而世界模型则预测结果，以引导根节点PUCT搜索以选择最有前景的动作。在DeepPHY基准测试中，基于多种基于物理的解谜任务评估ICPRL在I.仅政策和II.政策方面均有显著改进。世界模型增强阶段。值得注意的是，这些收益在看不见的物理环境中得以保留，表明我们的框架能够通过互动体验，真实地在情境中获取环境的物理动态。

Evidence-based Distributional Alignment for Large Language Models

大型语言模型的循证分布对齐

Authors: Viet-Thanh Pham, Lizhen Qu, Zhuang Li, Gholamreza Haffari
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.13305
Pdf link: https://arxiv.org/pdf/2603.13305
Abstract Distributional alignment enables large language models (LLMs) to predict how a target population distributes its responses across answer options, rather than collapsing disagreement into a single consensus answer. However, existing LLM-based distribution prediction is often unstable and degrades under cultural and domain shift. Token score-based estimates can change with minor option wording or formatting, response sampling-based estimates are expensive and sensitive to prompts and decoding settings, and directly generated distributions are frequently miscalibrated. We propose Evi-DA, an evidence-based alignment technique that improves the fidelity and robustness of LLM-based distribution estimation under domain and cultural shift. Given a target country and a multiple-choice question, Evi-DA retrieves related World Values Survey items and their answer distributions, predicts a coarse Welzel value signature for each option, and infers the country-conditioned answer distribution in a structured format. We train the LLMs using a two-stage pipeline, where reinforcement learning optimizes survey-derived rewards that encourage accurate intermediate value predictions, faithful final distributions, well-formed structured outputs, and reduced cultural bias. Across in-domain and out-of-domain benchmarks and multiple open-source backbones, Evi-DA reduces Jensen-Shannon divergence between predicted and gold distributions relative to strong baselines, with average relative improvements of up to 44%.
中文摘要 分布对齐使大型语言模型（LLMs）能够预测目标人群如何将答案分布在不同选项中，而不是将分歧合并为单一共识答案。然而，现有基于LLM的分布预测往往不稳定，且在文化和领域转换下会退化。基于代币分数的估计值可能会因选项措辞或格式的细微变化而变化，基于响应抽样的估计成本高昂且对提示和解码设置敏感，直接生成的分布经常被误校。我们提出了Evi-DA，一种基于证据的比对技术，能够提升基于LLM的分布估计在领域和文化转变下的保真度和稳健性。给定一个目标国家和一个多项选择题，Evi-DA检索相关的世界价值观调查题目及其答案分布，预测每个选项的粗略韦尔泽尔值签名，并以结构化格式推断出国家条件的答案分布。我们采用两阶段流程训练大型语言模型，强化学习优化调查奖励，鼓励准确的中间值预测、忠实的最终分布、结构化的输出以及减少文化偏见。跨越域内和域外基准以及多个开源骨干，Evi-DA相较于强基线减少了预测分布与黄金分布之间的Jensen-Shannon分歧，平均相对改善高达44%。

LightningRL: Breaking the Accuracy-Parallelism Trade-off of Block-wise dLLMs via Reinforcement Learning

LightningRL：通过强化学习打破分块型数字大型语言模型在准确性与并行性的权衡

Authors: Yanzhe Hu, Yijie Jin, Pengfei Liu, Kai Yu, Zhijie Deng
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.13319
Pdf link: https://arxiv.org/pdf/2603.13319
Abstract Diffusion Large Language Models (dLLMs) have emerged as a promising paradigm for parallel token generation, with block-wise variants garnering significant research interest. Despite their potential, existing dLLMs typically suffer from a rigid accuracy-parallelism trade-off: increasing the number of tokens per forward (TPF) via aggressive parallel decoding often leads to performance degradation and increased generation instability. We identify that this limitation stems from the model's inability to navigate high-parallelism regimes where approximation errors and local corruptions accumulate, ultimately undermining the reliability of parallel generation. To address this, we propose LightningRL, a post-training framework designed to directly optimize the speed-quality Pareto frontier of pre-trained dLLMs. Instead of forcing uniform parallelization, our approach leverages reinforcement learning to identify and reinforce high-parallelism trajectories that maintain generation accuracy. Built upon the Group Relative Policy Optimization (GRPO) framework, LightningRL introduces several enhancements tailored for dLLMs: (1) stabilized training via per-reward decoupled normalization; (2) token-level negative log-likelihood (NLL) regularization on correct trajectories to anchor model performance; and (3) a dynamic sampling strategy with TPF-aware filtering to enhance training efficiency. Experimental results across mathematical and coding benchmarks demonstrate that LightningRL consistently advances the Pareto frontier, achieving competitive task accuracy while significantly increasing parallelism, reaching an average TPF of 7.32 (with a peak of 11.10 on the MBPP dataset). Our code is available at this https URL.
中文摘要 扩散大型语言模型（dLLMs）已成为并行令牌生成的有前景范式，分块变体也引起了广泛研究关注。尽管潜力巨大，现有的dLLM通常存在严格的准确率与并行权衡：通过激进的并行解码增加每个前向代币（TPF）数量，常导致性能下降和生成不稳定性增加。我们发现，这一局限源于模型无法适应高并行性环境，在该区域近似误差和局部损坏会累积，最终削弱并行生成的可靠性。为此，我们提出了LightningRL，一个后训练框架，旨在直接优化预训练dLLM的速度质量帕累托前沿。我们的方法不是强制统一并行化，而是利用强化学习识别并强化高并行性轨迹，从而保持生成精度。LightningRL基于群体相对策略优化（GRPO）框架，引入了多项针对dLLM量身定制的改进：（1）通过每个奖励解耦规范化稳定训练;（2）基于正确轨迹的标记级负对数似然（NLL）正则化以锚定模型性能;以及（3）带有TPF感知过滤的动态抽样策略，以提升训练效率。数学和编码基准的实验结果表明，LightningRL 持续推进帕累托前沿，既能实现竞争任务精度，又显著提升并行性，平均 TPF 为 7.32（MBPP 数据集峰值为 11.10）。我们的代码可在此 https URL 访问。

AutoTool: Automatic Scaling of Tool-Use Capabilities in RL via Decoupled Entropy Constraints

AutoTool：通过解耦熵约束自动扩展强化学习工具使用能力

Authors: Yirong Zeng, Xiao Ding, Yufei Liu, Yuxian Wang, Qunyao Du, Yutai Hou, Wu Ning, Haonan Song, Duyu Tang, Dandan Tu, Bing Qin, Ting Liu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.13348
Pdf link: https://arxiv.org/pdf/2603.13348
Abstract Tool use represents a critical capability for AI agents, with recent advances focusing on leveraging reinforcement learning (RL) to scale up the explicit reasoning process to achieve better performance. However, there are some key challenges for tool use in current RL-based scaling approaches: (a) direct RL training often struggles to scale up thinking length sufficiently to solve complex problems, and (b) scaled-up models tend to overthink simpler problems, resulting in substantial token inefficiency. To address these challenges, we propose a novel training paradigm that first employs warm-up supervised fine-tuning to help models distinguish between simple and complex problems, followed by RL that enable models to automatically determine appropriate reasoning trajectories. Furthermore, to tackle the issue of automatic thinking-length scaling, we discover that entropy-based optimization objectives effectively maintain model diversity while successfully unlocking the model's scaling capabilities. Based on this insight, we introduce an entropy-based long-short reasoning fusion RL strategy. Our experiments on three benchmarks demonstrate that model successfully achieves auto-scaling for efficient tool use, achieving significant 9.8\% accuracy improvements while reducing computational overhead by \textasciitilde81\%.
中文摘要 工具的使用是人工智能代理的关键能力，近期的进展集中在利用强化学习（RL）来扩大显性推理过程，以实现更好的性能。然而，当前基于强化学习的扩展方法中工具使用面临一些关键挑战：（a）直接的强化学习往往难以将思维长度扩展到足以解决复杂问题;（b）放大模型往往过度思考简单问题，导致显著的代币效率低下。为应对这些挑战，我们提出了一种新颖的训练范式，首先采用预热监督微调，帮助模型区分简单问题和复杂问题，随后是强化学习，使模型能够自动确定合适的推理轨迹。此外，为了解决自动思维长度缩放的问题，我们发现基于熵的优化目标在成功解锁模型扩展能力的同时，能够有效保持模型多样性。基于这一见解，我们引入了基于熵的长短推理融合强化学习策略。我们在三个基准测试上的实验表明，模型成功实现了工具使用效率的自动缩放，实现了显著的9.8%准确率提升，同时计算开销降低了81%。

Learning When to Trust in Contextual Bandits

学会何时信任情境强盗

Authors: Majid Ghasemi, Mark Crowley
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.13356
Pdf link: https://arxiv.org/pdf/2603.13356
Abstract Standard approaches to Robust Reinforcement Learning assume that feedback sources are either globally trustworthy or globally adversarial. In this paper, we challenge this assumption and we identify a more subtle failure mode. We term this mode as Contextual Sycophancy, where evaluators are truthful in benign contexts but strategically biased in critical ones. We prove that standard robust methods fail in this setting, suffering from Contextual Objective Decoupling. To address this, we propose CESA-LinUCB, which learns a high-dimensional Trust Boundary for each evaluator. We prove that CESA-LinUCB achieves sublinear regret $\tilde{O}(\sqrt{T})$ against contextual adversaries, recovering the ground truth even when no evaluator is globally reliable.
中文摘要 标准的强化强化学习方法假设反馈源要么是全球可信的，要么是全球对抗的。本文挑战这一假设，并识别出一种更微妙的失效模式。我们称这种模式为情境谄媚，评估者在良性情境下诚实，但在批判性情境中策略性偏见。我们证明标准稳健方法在此环境中失效，存在情境目标解耦问题。为此，我们提出了CESA-LinUCB，该方法为每个评估者学习一个高维信任边界。我们证明了CESA-LinUCB能够在对情境对手面前实现亚线性遗憾$\tilde{O}（\sqrt{T}）$，即使没有评估者在全球范围内可靠，也能恢复真实的基础信息。

Language-Guided Token Compression with Reinforcement Learning in Large Vision-Language Models

大型视觉语言模型中的语言引导令牌压缩与强化学习

Authors: Sihan Cao, Jianwei Zhang, Pengcheng Zheng, Jiaxin Yan, Caiyan Qin, Yalan Ye, Wei Dong, Peng Wang, Yang Yang, Chaoning Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.13394
Pdf link: https://arxiv.org/pdf/2603.13394
Abstract Large Vision-Language Models (LVLMs) incur substantial inference costs due to the processing of a vast number of visual tokens. Existing methods typically struggle to model progressive visual token reduction as a multi-step decision process with sequential dependencies and often rely on hand-engineered scoring rules that lack adaptive optimization for complex reasoning trajectories. To overcome these limitations, we propose TPRL, a reinforcement learning framework that learns adaptive pruning trajectories through language-guided sequential optimization tied directly to end-task performance. We formulate visual token pruning as a sequential decision process with explicit state transitions and employ a self-supervised autoencoder to compress visual tokens into a compact state representation for efficient policy learning. The pruning policy is initialized through learning from demonstrations and subsequently fine-tuned using Proximal Policy Optimization (PPO) to jointly optimize task accuracy and computational efficiency. Our experimental results demonstrate that TPRL removes up to 66.7\% of visual tokens and achieves up to a 54.2\% reduction in FLOPs during inference while maintaining a near-lossless average accuracy drop of only 0.7\%. Code is released at \href{this https URL}{\textcolor{mypink}{this https URL}}.
中文摘要 大型视觉语言模型（LVLM）由于处理大量视觉标记，会产生大量的推理成本。现有方法通常难以将渐进式视觉令牌减少建模为多步骤决策过程，且依赖于手工设计的评分规则，缺乏复杂推理轨迹的自适应优化。为克服这些局限，我们提出了TPRL，一种强化学习框架，通过语言引导的顺序优化学习自适应剪枝轨迹，直接关联终端任务表现。我们将可视化令牌剪枝制定为带有显式状态转换的顺序决策过程，并利用自监督自编码器将视觉令牌压缩为紧凑的状态表示，以实现高效的策略学习。剪枝策略通过演示学习初始化，随后通过近端策略优化（PPO）进行微调，以联合优化任务准确性和计算效率。我们的实验结果表明，TPRL可去除高达66.7%的视觉标记，并在推理过程中实现最多54.2%的FLOP减少，同时保持几乎无损的平均准确率下降仅0.7%。代码发布于 \href{this https URL}{\textcolor{mypink}{this https URL}}。

Scalable Machines with Intrinsic Higher Mental-State Dynamics

具有内在高阶心理状态动力学的可扩展机器

Authors: Ahsan Adeel, M. Bilal
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.13453
Pdf link: https://arxiv.org/pdf/2603.13453
Abstract Drawing on recent breakthroughs in cellular neurobiology and detailed biophysical modeling linking neocortical pyramidal neurons to distinct mental-state regimes, this work introduces a mathematically grounded formulation showing how models (e.g., Transformers) can implement computational principles underlying awake imaginative thought to pre-select relevant information before attention is applied via triadic modulation loops among queries ($Q$), keys ($K$), and values ($V$).~Scalability experiments on ImageNet-1K, benchmarked against a standard Vision Transformer (ViT), demonstrate significantly faster learning with reduced computational demand (fewer heads, layers, and tokens), consistent with our prior findings in reinforcement learning and language modeling. The approach operates at approximately $\mathcal{O}(N)$ complexity with respect to the number of input tokens $N$.
中文摘要 本研究借鉴了细胞神经生物学的最新突破和将新皮层锥体神经元与不同心理状态体系联系起来的详细生物物理建模，提出了一个数学基础的表述，展示了模型（如变形金刚）如何实现基于清醒想象思维的计算原理，通过查询（$Q$）、键（$K$）之间的三元调制环路，在注意力应用前预先选择相关信息。以及值（$V$）。~基于标准视觉变换器（ViT）基准测试的ImageNet-1K可扩展性实验，展示了显著更快的学习速度和更低的计算需求（减少的字头、层和标记），这与我们之前在强化学习和语言建模方面的发现一致。该方法在输入标记数 $N$ 的复杂度下，大约在 $\mathcal{O}（N）$ 下运行。

REFINE-DP: Diffusion Policy Fine-tuning for Humanoid Loco-manipulation via Reinforcement Learning

REFINE-DP：通过强化学习对类人机车操控进行扩散政策微调

Authors: Zhaoyuan Gu, Yipu Chen, Zimeng Chai, Alfred Cueva, Thong Nguyen, Yifan Wu, Huishu Xue, Minji Kim, Isaac Legene, Fukang Liu, Matthew Kim, Ayan Barula, Yongxin Chen, Ye Zhao
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.13707
Pdf link: https://arxiv.org/pdf/2603.13707
Abstract Humanoid loco-manipulation requires coordinated high-level motion plans with stable, low-level whole-body execution under complex robot-environment dynamics and long-horizon tasks. While diffusion policies (DPs) show promise for learning from demonstrations, deploying them on humanoids poses critical challenges: the motion planner trained offline is decoupled from the low-level controller, leading to poor command tracking, compounding distribution shift, and task failures. The common approach of scaling demonstration data is prohibitively expensive for high-dimensional humanoid systems. To address this challenge, we present REFINE-DP (REinforcement learning FINE-tuning of Diffusion Policy), a hierarchical framework that jointly optimizes a DP high-level planner and an RL-based low-level loco-manipulation controller. The DP is fine-tuned via a PPO-based diffusion policy gradient to improve task success rate, while the controller is simultaneously updated to accurately track the planner's evolving command distribution, reducing the distributional mismatch that degrades motion quality. We validate REFINE-DP on a humanoid robot performing loco-manipulation tasks, including door traversal and long-horizon object transport. REFINE-DP achieves an over $90\%$ success rate in simulation, even in out-of-distribution cases not seen in the pre-trained data, and enables smooth autonomous task execution in real-world dynamic environments. Our proposed method substantially outperforms pre-trained DP baselines and demonstrates that RL fine-tuning is key to reliable humanoid loco-manipulation. this https URL
中文摘要 类人机车操作需要协调的高级运动计划，并在复杂的机器人环境动力学和长视野任务下实现稳定、低层次的全身执行。虽然扩散策略（DP）在演示中展现出学习潜力，但在类人生物上部署它们存在关键挑战：离线训练的运动规划器与低级控制器解耦，导致指令跟踪不良、分配转移和任务失败。对高维类人生物系统来说，常见的演示数据缩放方式成本高昂。为应对这一挑战，我们提出了REFINE-DP（扩散政策的强化学习微调），这是一个分层框架，联合优化了DP高级规划器和基于RL的低级机车操作控制器。DP通过基于PPO的扩散策略梯度进行微调以提高任务成功率，同时控制器也同步更新，准确跟踪规划师不断变化的指令分布，减少导致运动质量下降的分布不匹配。我们在执行机动操作任务的人形机器人上验证了REFINE-DP的效果，该机器人包括门穿行和长视距物体运输。REFINE-DP在模拟中实现超过90%%的成功率，即使在预训练数据中未出现的非分布情况中，也能实现在真实动态环境中的平滑自主任务执行。我们提出的方法远超预训练的DP基线，并证明强化学习微调是可靠人形机车操作的关键。这个 https 网址

Implicit Maximum Likelihood Estimation for Real-time Generative Model Predictive Control

实时生成模型预测控制的隐式最大似然估计

Authors: Grayson Lee, Minh Bui, Shuzi Zhou, Yankai Li, Mo Chen, Ke Li
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.13733
Pdf link: https://arxiv.org/pdf/2603.13733
Abstract Diffusion-based models have recently shown strong performance in trajectory planning, as they are capable of capturing diverse, multimodal distributions of complex behaviors. A key limitation of these models is their slow inference speed, which results from the iterative denoising process. This makes them less suitable for real-time applications such as closed-loop model predictive control (MPC), where plans must be generated quickly and adapted continuously to a changing environment. In this paper, we investigate Implicit Maximum Likelihood Estimation (IMLE) as an alternative generative modeling approach for planning. IMLE offers strong mode coverage while enabling inference that is two orders of magnitude faster, making it particularly well suited for real-time MPC tasks. Our results demonstrate that IMLE achieves competitive performance on standard offline reinforcement learning benchmarks compared to the standard diffusion-based planner, while substantially improving planning speed in both open-loop and closed-loop settings. We further validate IMLE in a closed-loop human navigation scenario, operating in real-time, demonstrating how it enables rapid and adaptive plan generation in dynamic environments.
中文摘要 基于扩散的模型近年来在轨迹规划方面表现出强劲表现，能够捕捉复杂行为的多样多模分布。这些模型的一个主要局限是推断速度较慢，这源于迭代去噪过程。这使得它们不适合实时应用，如闭环模型预测控制（MPC），因为这些应用需要快速生成计划并持续适应不断变化的环境。本文探讨隐性最大似然估计（IMLE）作为规划的替代生成建模方法。IMLE提供了强的模式覆盖，同时推断速度快了两个数量级，特别适合实时MPC任务。我们的结果表明，IMLE在标准离线强化学习基准测试中优于标准扩散规划器，同时在开环和闭环环境中显著提升了规划速度。我们还进一步验证了IMLE在闭环人类导航场景中的实时运行，展示了其如何在动态环境中实现快速且自适应的计划生成。

Knowledge Distillation for Large Language Models

大型语言模型的知识蒸馏

Authors: Alejandro Paredes La Torre, Barbara Flores, Diego Rodriguez
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.13765
Pdf link: https://arxiv.org/pdf/2603.13765
Abstract We propose a resource-efficient framework for compressing large language models through knowledge distillation, combined with guided chain-of-thought reinforcement learning. Using Qwen 3B as the teacher and Qwen 0.5B as the student, we apply knowledge distillation across English Dolly-15k, Spanish Dolly-15k, and code BugNet and PyTorrent datasets, with hyperparameters tuned in the English setting to optimize student performance. Across tasks, the distilled student retains a substantial portion of the teacher's capability while remaining significantly smaller: 70% to 91% in English, up to 95% in Spanish, and up to 93.5% Rouge-L in code. For coding tasks, integrating chain-of-thought prompting with Group Relative Policy Optimization using CoT-annotated Codeforces data improves reasoning coherence and solution correctness compared to knowledge distillation alone. Post-training 4-bit weight quantization further reduces memory footprint and inference latency. These results show that knowledge distillation combined with chain-of-thought guided reinforcement learning can produce compact, efficient models suitable for deployment in resource-constrained settings.
中文摘要 我们提出了一个资源高效框架，通过知识蒸馏和引导式思维链强化学习来压缩大型语言模型。以Qwen 3B为教师，Qwen 0.5B作为学生，我们在英语Dolly-15k、西班牙语Dolly-15k以及代码BugNet和PyTorrent数据集上应用知识蒸馏，超参数调整为英语设置以优化学生表现。在各项任务中，精炼学生保留了教师相当一部分的能力，同时保持显著缩小：英语中70%至91%，西班牙语中最高95%，代码中最高93.5%为Rouge-L。对于编码任务，将思维链提示与使用CoT注释码力数据的群相对策略优化结合，相比单纯的知识蒸馏，可以提升推理的连贯性和解的正确性。训练后4位权重量化进一步降低了内存占用和推理延迟。这些结果表明，知识蒸馏与思维链引导强化学习结合，可以生成适合资源有限环境中部署的紧凑高效模型。

Retrieve, Schedule, Reflect: LLM Agents for Chip QoR Optimization

检索、调度、反射：用于芯片生活质量优化的大型语言模型代理

Authors: Yikang ouyang, Yang Luo, Dongsheng Zuo, Yuzhe Ma
Subjects: Subjects: Hardware Architecture (cs.AR)
Arxiv link: https://arxiv.org/abs/2603.13767
Pdf link: https://arxiv.org/pdf/2603.13767
Abstract Modern chip design requires multi-objective optimization of timing, power, and area under stringent time-to-market constraints. Although powerful optimization algorithms are integrated into EDA tools, achieving high QoR hinges on effective long-horizon scheduling, which relies heavily on manual expert intervention. To address this issue and automate chip design, we propose an agentic LLM framework that schedules chip optimizations through direct interaction with EDA tools. The agent is grounded in natural language expertise expressed as a search tree through retrieval-augmented generation (RAG). We further improve scheduling quality with Pareto-driven QoR feedback through language reflection. Experimental results show that, compared with black-box search methods such as reinforcement learning, our framework achieves 10% greater timing improvement while consuming less power and area, with more than 4x speedup. The post-optimization QoR is also comparable to that achieved by human experts. Finally, the agent supports customized tasks expressed in natural language, enabling preferential QoR trade-offs. The code and chip design data will be publicly available at this https URL.
中文摘要 现代芯片设计需要在严格的上市时间限制下，多目标地优化时序、功耗和面积。尽管强大的优化算法集成到EDA工具中，但实现高服务质量依赖于有效的长期调度，而这在很大程度上依赖于专家的人工干预。为解决这一问题并实现芯片设计自动化，我们提出了一个代理式大型语言模型框架，通过与EDA工具的直接交互来安排芯片优化。该代理基于自然语言专业知识，通过检索增强生成（RAG）表达搜索树。我们通过语言反思通过帕累托驱动的服务质量反馈进一步提升了排班质量。实验结果显示，与强化学习等黑箱搜索方法相比，我们的框架在耗电和面积更少的情况下，实现了10%的时序提升，速度提升超过4倍。优化后的服务质量也可与人类专家所达到的水平相媲美。最后，代理支持用自然语言表达的定制任务，实现优先服务质量的权衡。代码和芯片设计数据将在此HTTPS网址公开。

Your Vision-Language-Action Model Already Has Attention Heads For Path Deviation Detection

你的视觉-语言-行动模型已经有注意力，用于路径偏差检测

Authors: Jaehwan Jeong, Evelyn Zhu, Jinying Lin, Emmanuel Jaimes, Tuan-Anh Vu, Jungseock Joo, Sangpil Kim, M. Khalid Jawed
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.13782
Pdf link: https://arxiv.org/pdf/2603.13782
Abstract Vision-Language-Action (VLA) models have demonstrated strong potential for predicting semantic actions in navigation tasks, demonstrating the ability to reason over complex linguistic instructions and visual contexts. However, they are fundamentally hindered by visual-reasoning hallucinations that lead to trajectory deviations. Addressing this issue has conventionally required training external critic modules or relying on complex uncertainty heuristics. In this work, we discover that monitoring a few attention heads within a frozen VLA model can accurately detect path deviations without incurring additional computational overhead. We refer to these heads, which inherently capture the spatiotemporal causality between historical visual sequences and linguistic instructions, as Navigation Heads. Using these heads, we propose an intuitive, training-free anomaly-detection framework that monitors their signals to detect hallucinations in real time. Surprisingly, among over a thousand attention heads, a combination of just three is sufficient to achieve a 44.6 % deviation detection rate with a low false-positive rate of 11.7 %. Furthermore, upon detecting a deviation, we bypass the heavy VLA model and trigger a lightweight Reinforcement Learning (RL) policy to safely execute a shortest-path rollback. By integrating this entire detection-to-recovery pipeline onto a physical robot, we demonstrate its practical robustness. All source code will be publicly available.
中文摘要 视觉-语言-行动（VLA）模型在预测导航任务中的语义动作方面展现出强大潜力，展示了在复杂语言指令和视觉语境中进行推理的能力。然而，它们从根本上受到视觉推理幻觉的阻碍，这些幻觉会导致轨迹偏离。解决这一问题通常需要外部批评模块的培训或依赖复杂的不确定性启发式方法。在这项工作中，我们发现在冻结的VLA模型中监测几个注意力头，可以在不增加计算开销的情况下准确检测路径偏离。我们称这些本质上捕捉历史视觉序列与语言指令之间时空因果关系的“导航头”。利用这些头部，我们提出了一个直观、无需训练的异常检测框架，能够实时监测他们的信号，从而检测幻觉。令人惊讶的是，在一千多名注意力引导者中，仅用三个就足以实现44.6%的偏差检测率，且假阳性率低至11.7%。此外，检测到偏差时，我们绕过重VLA模型，触发轻量级强化学习（RL）策略，安全地执行最短路径回滚。通过将整个检测到恢复的流程集成到实体机器人上，我们展示了其实用的稳健性。所有源代码将公开。

Fine-tuning is Not Enough: A Parallel Framework for Collaborative Imitation and Reinforcement Learning in End-to-end Autonomous Driving

微调还不够：端到端自动驾驶中协作模仿与强化学习的并行框架

Authors: Zhexi Lian, Haoran Wang, Xuerun Yan, Weimeng Lin, Xianhong Zhang, Yongyu Chen, Jia Hu
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.13842
Pdf link: https://arxiv.org/pdf/2603.13842
Abstract End-to-end autonomous driving is typically built upon imitation learning (IL), yet its performance is constrained by the quality of human demonstrations. To overcome this limitation, recent methods incorporate reinforcement learning (RL) through sequential fine-tuning. However, such a paradigm remains suboptimal: sequential RL fine-tuning can introduce policy drift and often leads to a performance ceiling due to its dependence on the pretrained IL policy. To address these issues, we propose PaIR-Drive, a general Parallel framework for collaborative Imitation and Reinforcement learning in end-to-end autonomous driving. During training, PaIR-Drive separates IL and RL into two parallel branches with conflict-free training objectives, enabling fully collaborative optimization. This design eliminates the need to retrain RL when applying a new IL policy. During inference, RL leverages the IL policy to further optimize the final plan, allowing performance beyond prior knowledge of IL. Furthermore, we introduce a tree-structured trajectory neural sampler to group relative policy optimization (GRPO) in the RL branch, which enhances exploration capability. Extensive analysis on NAVSIMv1 and v2 benchmark demonstrates that PaIR-Drive achieves Competitive performance of 91.2 PDMS and 87.9 EPDMS, building upon Transfuser and DiffusionDrive IL baselines. PaIR-Drive consistently outperforms existing RL fine-tuning methods, and could even correct human experts' suboptimal behaviors. Qualitative results further confirm that PaIR-Drive can effectively explore and generate high-quality trajectories.
中文摘要 端到端自动驾驶通常基于模仿学习（IL），但其性能受限于人类演示的质量。为克服这一局限，近期方法通过顺序微调结合强化学习（RL）。然而，这种范式仍然不够优：顺序强化学习微调可能会引入策略漂移，且由于依赖预训练的IL策略，常常导致性能上限。为解决这些问题，我们提出了PaIR-Drive，这是一种用于端到端自动驾驶中协作模仿与强化学习的通用并行框架。在培训过程中，PaIR-Drive将IL和RL分为两个平行分支，目标无冲突，实现完全协作优化。这种设计消除了在应用新IL策略时对强化学习进行重新培训的需求。在推断过程中，强化学习利用IL策略进一步优化最终计划，实现超越IL已知的表现。此外，我们在强化学习分支引入了树结构轨迹神经采样器到群组相对策略优化（GRPO），提升了探索能力。对NAVSIMv1和v2基准的广泛分析表明，PaIR-Drive在Transfuser和DiffusionDrive IL基线基础上实现了91.2 PDMS和87.9 EPDMS的竞争性能。PaIR-Drive持续优于现有的强化学习微调方法，甚至可能纠正人类专家的次优行为。定性结果进一步证实，PaIR-Drive能够有效探索并生成高质量的轨迹。

APEX-Searcher: Augmenting LLMs' Search Capabilities through Agentic Planning and Execution

APEX-Searcher：通过代理规划与执行增强大型语言模型的搜索能力

Authors: Kun Chen, Qingchao Kong, Zhao Feifei, Wenji Mao
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.13853
Pdf link: https://arxiv.org/pdf/2603.13853
Abstract Retrieval-augmented generation (RAG), based on large language models (LLMs), serves as a vital approach to retrieving and leveraging external knowledge in various domain applications. When confronted with complex multi-hop questions, single-round retrieval is often insufficient for accurate reasoning and problem solving. To enhance search capabilities for complex tasks, most existing works integrate multi-round iterative retrieval with reasoning processes via end-to-end training. While these approaches significantly improve problem-solving performance, they are still faced with challenges in task reasoning and model training, especially ambiguous retrieval execution paths and sparse rewards in end-to-end reinforcement learning (RL) process, leading to inaccurate retrieval results and performance degradation. To address these issues, in this paper, we proposes APEX-Searcher, a novel Agentic Planning and Execution framework to augment LLM search capabilities. Specifically, we introduce a two-stage agentic framework that decouples the retrieval process into planning and execution: It first employs RL with decomposition-specific rewards to optimize strategic planning; Built on the sub-task decomposition, it then applies supervised fine-tuning on high-quality multi-hop trajectories to equip the model with robust iterative sub-task execution capabilities. Extensive experiments demonstrate that our proposed framework achieves significant improvements in both multi-hop RAG and task planning performances across multiple benchmarks.
中文摘要 基于大型语言模型（LLM）的检索增强生成（RAG）是检索和利用各种领域应用中外部知识的重要方法。面对复杂的多跳题时，单轮检索往往不足以实现准确的推理和问题解决。为了增强复杂任务的搜索能力，大多数现有研究将多轮迭代检索与通过端到端训练的推理过程相结合。虽然这些方法显著提升了问题解决性能，但在任务推理和模型训练方面仍面临挑战，尤其是端到端强化学习（RL）过程中检索路径模糊和奖励稀疏，导致检索结果不准确和性能下降。为解决这些问题，本文提出了APEX-Searcher，一种新型代理规划与执行框架，旨在增强LLM搜索能力。具体来说，我们引入了一个两阶段的代理框架，将检索过程解耦为规划和执行：首先采用强化学习和分解专属奖励以优化战略规划;基于子任务分解，随后对高质量多跳轨迹进行监督微调，赋予模型稳健的迭代子任务执行能力。大量实验表明，我们提出的框架在多跳RAG和任务规划性能上均有显著提升，跨越多个基准测试。

Path-conditioned Reinforcement Learning-based Local Planning for Long-Range Navigation

基于路径条件强化学习的远程导航局部规划

Authors: Mateo Haro, Julia Richter, Fan Yang, Cesar Cadena, Marco Hutter
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.13888
Pdf link: https://arxiv.org/pdf/2603.13888
Abstract Long-range navigation is commonly addressed through hierarchical pipelines in which a global planner generates a path, decomposed into waypoints, and followed sequentially by a local planner. These systems are sensitive to global path quality, as inaccurate remote sensing data can result in locally infeasible waypoints, which degrade local execution. At the same time, the limited global context available to the local planner hinders long-range efficiency. To address this issue, we propose a reinforcement learning-based local navigation policy that leverages path information as contextual guidance. The policy is conditioned on reference path observations and trained with a reward function mainly based on goal-reaching objectives, without any explicit path-following reward. Through this implicit conditioning, the policy learns to opportunistically exploit path information while remaining robust to misleading or degraded guidance. Experimental results show that the proposed approach significantly improves navigation efficiency when high-quality paths are available and maintains baseline-level performance when path observations are severely degraded or even non-existent. These properties make the method particularly well-suited for long-range navigation scenarios in which high-level plans are approximate and local execution must remain adaptive to uncertainty.
中文摘要 长期导航通常通过分层流程处理，其中全局规划者生成路径，分解为航点，然后由本地规划者依次跟踪。这些系统对全球路径质量非常敏感，因为不准确的遥感数据可能导致本地不可行的航点，从而降低本地执行能力。与此同时，地方规划者可用的有限全球环境限制了长期效率。为解决这一问题，我们提出了一种基于强化学习的本地导航策略，利用路径信息作为上下文指导。该策略基于参考路径观测，并以奖励函数主要基于目标达成目标进行训练，没有明确的路径追踪奖励。通过这种隐含条件，该策略学会机会主义地利用路径信息，同时保持对误导性或降级指引的韧性。实验结果显示，当高质量路径可用时，所提方法显著提升导航效率，并在路径观测严重下降甚至不存在时保持基线性能。这些特性使该方法特别适合长期导航场景，在这些场景中，高层次计划是近似的，且局部执行必须保持对不确定性的适应性。

ATCC: Adaptive Concurrency Control for Unforeseen Agentic Transactions

ATCC：针对不可预见代理事务的自适应并发控制

Authors: Weixing Zhou, Zhiyou Wang, Zeshun Peng, Hetian Chen, Yanfeng Zhang, Ge Yu
Subjects: Subjects: Databases (cs.DB)
Arxiv link: https://arxiv.org/abs/2603.13906
Pdf link: https://arxiv.org/pdf/2603.13906
Abstract Data agents, empowered by Large Language Models (LLMs), introduce a new paradigm in transaction processing. Unlike traditional applications with fixed patterns, data agents run online-generated workflows that repeatedly issue SQL statements, reason over intermediate results, and revise subsequent plans. To ensure data consistency, these SQL statements issued by an agent should be integrated into a transaction, referred to as agentic transactions. Agentic transactions exhibit unforeseen characteristics, including long execution times, irregular execution intervals, and non-deterministic access patterns, breaking the assumptions underlying concurrency control (CC) (e.g., short-lived, predefined). Traditional CC schemes, which rely on fixed policies, fail to capture such dynamic behavior, resulting in inadequate performance. This paper introduces ATCC, an adaptive Concurrency Control for Agentic Transactions. ATCC continuously monitors and interprets the runtime behavior of each agentic transaction, evaluates its interactive phases, and dynamically adapts optimistic or pessimistic execution for each transaction. To ensure precise timing for adaptive switches, ATCC employs a reinforcement learning-based policy to balance immediate blocking against future abort costs. Additionally, to mitigate contention-induced tail latency and wasted reasoning cost caused by abort, a cost-aware priority-based lock scheduling is integrated to prioritize expensive or latency-sensitive transactions. Experimental results under agentic-like YCSB and TPC-C workloads demonstrate that ATCC improves the throughput of agentic transactions by up to four orders of magnitude and reduces tail latency by up to 90% compared to state-of-the-art CC schemes.
中文摘要 数据代理借助大型语言模型（LLM）为事务处理引入了新的范式。与固定模式的传统应用不同，数据代理运行在线生成的工作流程，反复发布SQL语句，推理中间结果，并修订后续计划。为了确保数据一致性，代理发布的这些SQL语句应集成到一个事务中，称为代理事务。代理事务表现出不可预见的特性，包括较长的执行时间、不规则的执行间隔和非确定性的访问模式，打破了并发控制（CC）背后的假设（例如短暂、预定义）。依赖固定策略的传统CC方案无法捕捉这种动态行为，导致性能不足。本文介绍了ATCC，一种针对代理事务的自适应并发控制。ATCC 持续监控和解释每个代理事务的运行时行为，评估其交互阶段，并动态调整每笔交易的乐观或悲观执行。为确保自适应切换的精确时机，ATCC采用基于强化学习的策略，平衡即时阻断与未来中止成本。此外，为了减轻因争用引起的尾部延迟和中止导致的推理浪费成本，集成了基于优先级的成本的锁调度，优先处理昂贵或延迟敏感的交易。在类似代理的YCSB和TPC-C工作负载下，实验结果表明，ATCC相比最先进的CC方案，代理事务吞吐量提升了多达四个数量级，并将尾延迟降低了高达90%。

SmoothVLA: Aligning Vision-Language-Action Models with Physical Constraints via Intrinsic Smoothness Optimization

SmoothVLA：通过内在平滑性优化将视觉-语言-行动模型与物理约束对齐

Authors: Jiashun Li, Xiaoyu Shi, Hong Xie, Mingsheng Shang, Yun Lu
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.13925
Pdf link: https://arxiv.org/pdf/2603.13925
Abstract Vision-Language-Action (VLA) models have emerged as a powerful paradigm for robotic manipulation. However, existing post-training methods face a dilemma between stability and exploration: Supervised Fine-Tuning (SFT) is constrained by demonstration quality and lacks generalization, whereas Reinforcement Learning (RL) improves exploration but often induces erratic, jittery trajectories that violate physical constraints. To bridge this gap, we propose SmoothVLA, a novel reinforcement learning fine-tuning framework that synergistically optimizes task performance and motion smoothness. The technical core is a physics-informed hybrid reward function that integrates binary sparse task rewards with a continuous dense term derived from trajectory jerk. Crucially, this reward is intrinsic, that computing directly from policy rollouts, without requiring extrinsic environment feedback or laborious reward engineering. Leveraging the Group Relative Policy Optimization (GRPO), SmoothVLA establishes trajectory smoothness as an explicit optimization prior, guiding the model toward physically feasible and stable control. Extensive experiments on the LIBERO benchmark demonstrate that SmoothVLA outperforms standard RL by 13.8\% in smoothness and significantly surpasses SFT in generalization across diverse tasks. Our work offers a scalable approach to aligning VLA models with physical-world constraints through intrinsic reward optimization.
中文摘要 视觉-语言-行动（VLA）模型已成为机器人操作的强大范式。然而，现有的训练后方法在稳定性与探索之间面临两难：监督式微调（SFT）受限于演示质量，缺乏泛化性，而强化学习（RL）提升了探索性，但常常导致不稳定、抖动的轨迹，违反物理约束。为弥合这一差距，我们提出了SmoothVLA，一种新型强化学习微调框架，协同优化任务性能和运动流畅性。技术核心是一个基于物理的混合奖励函数，将二元稀疏任务奖励与由轨迹抖动衍生的连续密集项整合。关键是，这种奖励是内在的，直接从政策推送中计算，无需外部环境反馈或繁琐的奖励工程。利用群相对策略优化（GRPO），SmoothVLA将轨迹平滑性作为显式优化先验，引导模型朝向物理可行且稳定的控制。在LIBERO基准测试上的大量实验表明，SmoothVLA在平滑度上比标准强化学习高出13.8%，在推广性上显著优于SFT。我们的工作通过内在奖励优化，提供了一种可扩展的方法，使VLA模型与物理世界约束保持一致。

LLM-Guided Reinforcement Learning for Audio-Visual Speech Enhancement

视听语音增强的LLM引导强化学习

Authors: Chih-Ning Chen, Jen-Cheng Hou, Hsin-Min Wang, Shao-Yi Chien, Yu Tsao, Fan-Gang Zeng
Subjects: Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2603.13952
Pdf link: https://arxiv.org/pdf/2603.13952
Abstract In existing Audio-Visual Speech Enhancement (AVSE) methods, objectives such as Scale-Invariant Signal-to-Noise Ratio (SI-SNR) and Mean Squared Error (MSE) are widely used; however, they often correlate poorly with perceptual quality and provide limited interpretability for optimization. This work proposes a reinforcement learning-based AVSE framework with a Large Language Model (LLM)-based interpretable reward model. An audio LLM generates natural language descriptions of enhanced speech, which are converted by a sentiment analysis model into a 1-5 rating score serving as the PPO reward for fine-tuning a pretrained AVSE model. Compared with scalar metrics, LLM-generated feedback is semantically rich and explicitly describes improvements in speech quality. Experiments on the 4th COG-MHEAR AVSE Challenge (AVSEC-4) dataset show that the proposed method outperforms a supervised baseline and a DNSMOS-based RL baseline in PESQ, STOI, neural quality metrics, and subjective listening tests.
中文摘要 在现有的视听语音增强（AVSE）方法中，广泛使用了尺度不变信噪比（SI-SNR）和均方误差（MSE）等目标;然而，它们通常与感知质量相关性较差，且在优化方面提供有限的解释性。本研究提出了基于强化学习的AVSE框架，基于大型语言模型（LLM）的可解释奖励模型。音频LLM生成增强语音的自然语言描述，这些描述被情感分析模型转换为1-5的评分，作为对预训练AVSE模型微调的PPO奖励。与标量指标相比，LLM生成的反馈语义丰富，明确描述了语音质量的提升。在第四次COG-MHEAR AVSE挑战数据集（AVSEC-4）上的实验显示，该方法在PESQ、STOI、神经质量指标和主观听力测试中优于监督基线和基于DNSMOS的强化学习基线。

Chunk-Guided Q-Learning

区块引导Q学习

Authors: Gwanwoo Song, Kwanyoung Park, Youngwoon Lee
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.13971
Pdf link: https://arxiv.org/pdf/2603.13971
Abstract In offline reinforcement learning (RL), single-step temporal-difference (TD) learning can suffer from bootstrapping error accumulation over long horizons. Action-chunked TD methods mitigate this by backing up over multiple steps, but can introduce suboptimality by restricting the policy class to open-loop action sequences. To resolve this trade-off, we present Chunk-Guided Q-Learning (CGQ), a single-step TD algorithm that guides a fine-grained single-step critic by regularizing it toward a chunk-based critic trained using temporally extended backups. This reduces compounding error while preserving fine-grained value propagation. We theoretically show that CGQ attains tighter critic optimality bounds than either single-step or action-chunked TD learning alone. Empirically, CGQ achieves strong performance on challenging long-horizon OGBench tasks, often outperforming both single-step and action-chunked methods.
中文摘要 在离线强化学习（RL）中，单步时间差分（TD）学习可能会在较长的时间线内出现自举错误累积。动作分块TD方法通过多步备份来缓解这一问题，但通过将策略类限制为开环动作序列，可能会引入次优性。为解决这一权衡，我们提出了块引导Q学习（CGQ），一种单步TD算法，通过正则化细粒度单步批评者，引导其向使用时间扩展备份训练的块状批评者。这减少了复利误差，同时保持了细粒度的值传播。理论上我们证明，CGQ比单步或动作分块TD学习达到更严格的批评最优界限。从实证角度看，CGQ在具有挑战性的长视野OGBench任务中表现优异，常常优于单步和动作分块方法。

Supervised Fine-Tuning versus Reinforcement Learning: A Study of Post-Training Methods for Large Language Models

监督式微调与强化学习：大型语言模型训练后方法的研究

Authors: Haitao Jiang, Wenbo Zhang, Jiarui Yao, Hengrui Cai, Sheng Wang, Rui Song
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.13985
Pdf link: https://arxiv.org/pdf/2603.13985
Abstract Pre-trained Large Language Model (LLM) exhibits broad capabilities, yet, for specific tasks or domains their attainment of higher accuracy and more reliable reasoning generally depends on post-training through Supervised Fine-Tuning (SFT) or Reinforcement Learning (RL). Although often treated as distinct methodologies, recent theoretical and empirical developments demonstrate that SFT and RL are closely connected. This study presents a comprehensive and unified perspective on LLM post-training with SFT and RL. We first provide an in-depth overview of both techniques, examining their objectives, algorithmic structures, and data requirements. We then systematically analyze their interplay, highlighting frameworks that integrate SFT and RL, hybrid training pipelines, and methods that leverage their complementary strengths. Drawing on a representative set of recent application studies from 2023 to 2025, we identify emerging trends, characterize the rapid shift toward hybrid post-training paradigms, and distill key takeaways that clarify when and why each method is most effective. By synthesizing theoretical insights, practical methodologies, and empirical evidence, this study establishes a coherent understanding of SFT and RL within a unified framework and outlines promising directions for future research in scalable, efficient, and generalizable LLM post-training.
中文摘要 预训练的大语言模型（LLM）能力广泛，但对于特定任务或领域，其更高的准确性和更可靠的推理通常依赖于通过监督微调（SFT）或强化学习（RL）进行的后期训练。尽管常被视为不同的方法论，但近期的理论和实证发展表明SFT与RL密切相关。本研究提供了关于SFT和RL后培训LLM的全面且统一视角。我们首先深入介绍这两种技术，考察它们的目标、算法结构和数据需求。随后，我们系统地分析它们的相互作用，突出整合SFT和RL的框架、混合培训流程以及利用其互补优势的方法。我们结合2023年至2025年间具代表性的近期应用研究，识别新兴趋势，描述了向混合培训后范式的快速转变，并总结关键要点，澄清了何时及为何每种方法最为有效。通过综合理论见解、实践方法和实证证据，本研究在统一框架内建立了对SFT和RL的连贯理解，并为未来可扩展、高效且可推广的LLM后期研究提出了有前景的方向。

LLM-Guided Safe Reinforcement Learning for Energy System Topology Reconfiguration

用于能源系统拓扑重构的LLM引导安全强化学习

Authors: Zongyan Zhang, Chao Shen, Xu Wan, Jie Song, Mingyang Sun
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2603.14018
Pdf link: https://arxiv.org/pdf/2603.14018
Abstract The increasing penetration of renewable generation and the growing variability of electrified demand introduce substantial operational uncertainty to modern power systems. Topology reconfiguration is widely recognized as an effective and economical means to enhance grid resilience. Due to the coexistence of AC power-flow constraints and discrete switching decisions, topology reconfiguration in large-scale systems leads to a highly nonlinear and nonconvex optimization problem, making traditional methods computationally prohibitive. Consequently, several studies have explored reinforcement learning-based approaches to improve scalability and operational efficiency. However, its practical implementation is challenged by the high-dimensional combinatorial action space and the need to ensure safety during learning-based decision-making. To address these challenges, this paper presents a safe and intelligent topology control framework that integrates Large Language Models (LLMs) with a Safety Soft Actor-Critic (Safety-SAC) architecture. Operational voltage and thermal limits are reformulated into smooth safety-cost signals, enabling risk-aware policy optimization within a constrained Markov decision process. A knowledge-based Safety-LLM module is further introduced to refine unsafe or suboptimal transitions through domain knowledge and state-informed reasoning, thus guiding the learning agent toward safer and more effective switching actions. Experiments on the IEEE 36-bus and 118-bus Grid2Op benchmarks show that the proposed method consistently improves reward, survival time, and safety metrics, achieving higher reward, longer survival, and lower safety cost compared with SAC, ACE, and their safety-enhanced variants. These results demonstrate the potential of combining LLM-based reasoning with safe reinforcement learning to achieve scalable and reliable grid topology control.
中文摘要 可再生能源日益普及以及电气化需求日益变化，给现代电力系统带来了重大的运营不确定性。拓扑重构被广泛认为是提升网格韧性的有效且经济的方法。由于交流电流约束与离散切换决策的共存，大规模系统中的拓扑重构导致高度非线性和非凸的优化问题，使传统方法在计算上难以实现。因此，多项研究探讨了基于强化学习的方法以提高可扩展性和运营效率。然而，其实际实现面临高维组合作用空间的挑战，以及在基于学习的决策过程中确保安全性。为应对这些挑战，本文提出了一个安全且智能的拓扑控制框架，该框架将大型语言模型（LLMs）与安全软演员-批判者（Safety-SAC）架构集成。操作电压和热极限被重新表述为平稳的安全-成本信号，使得在受限的马尔可夫决策过程中实现风险意识的策略优化。进一步引入了基于知识的Safety-LLM模块，通过领域知识和状态知情推理优化不安全或次优的转移，从而引导学习主体走向更安全、更有效的切换动作。IEEE 36总线和118总线Grid2Op基准测试的实验表明，所提方法持续提升奖励、生存时间和安全指标，相较于SAC、ACE及其安全增强变体，实现了更高的奖励、更长的生存时间和更低的安全成本。这些结果展示了将基于LLM的推理与安全强化学习结合，实现可扩展且可靠的网格拓扑控制的潜力。

GRPO and Reflection Reward for Mathematical Reasoning in Large Language Models

GRPO与反射对大型语言模型中数学推理的奖励

Authors: Zhijie Wang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.14041
Pdf link: https://arxiv.org/pdf/2603.14041
Abstract The enhancement of reasoning capabilities in large language models (LLMs) has garnered significant attention, with supervised fine-tuning (SFT) and reinforcement learning emerging as dominant paradigms. While recent studies recognize the importance of reflection in reasoning processes, existing methodologies seldom address proactive reflection encouragement during training. This study focuses on mathematical reasoning by proposing a four-stage framework integrating Group Relative Policy Optimization (GRPO) with reflection reward mechanisms to strengthen LLMs' self-reflective capabilities. Besides, this approach incorporates established accuracy and format reward. Experimental results demonstrate GRPO's state-of-the-art performance through reflection-encouraged training, with ablation studies confirming the reflection reward's pivotal role. Comparative evaluations demonstrate full-parameter SFT's superiority over low-rank adaptation (LoRA) despite heightened computational demands. Building on these cumulative findings, this research substantiates GRPO's methodological significance in post-training optimization and envisions its potential to serve as a pivotal enabler for future LLM-based intelligent agents through the synergistic integration of cognitive rewards with dynamic environmental interactions.
中文摘要 大型语言模型（LLMs）推理能力的提升引起了广泛关注，监督式微调（SFT）和强化学习成为主流范式。尽管近期研究认识到反思在推理过程中的重要性，但现有方法很少涉及培训期间的主动反思鼓励。本研究聚焦于数学推理，提出一个将群体相对政策优化（GRPO）与反思奖励机制整合的四阶段框架，以增强LLM的自我反思能力。此外，这种方法还包括已建立的准确性和格式的奖励。实验结果通过反思鼓励训练展示了GRPO的先进性能，消融研究证实了反思奖励的关键作用。比较评估显示，尽管计算需求增加，全参数SFT仍优于低秩适应（LoRA）。基于这些累积发现，本研究证实了GRPO在训练后优化中方法论的重要性，并展望其通过认知奖励与动态环境交互的协同整合，有望成为未来基于LLM智能代理的关键推动力。

Amortizing Trajectory Diffusion with Keyed Drift Fields

利用密钥漂移场摊销轨迹扩散

Authors: Gokul Puthumanaillam, Melkior Ornik
Subjects: Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2603.14056
Pdf link: https://arxiv.org/pdf/2603.14056
Abstract Diffusion-based trajectory planners can synthesize rich, multimodal action sequences for offline reinforcement learning, but their iterative denoising incurs substantial inference-time cost, making closed-loop planning slow under tight compute budgets. We study the problem of achieving diffusion-like trajectory planning behavior with one-step inference, while retaining the ability to sample diverse candidate plans and condition on the current state in a receding-horizon control loop. Our key observation is that conditional trajectory generation fails under naïve distribution-matching objectives when the similarity measure used to align generated trajectories with the dataset is dominated by unconstrained future dimensions. In practice, this causes attraction toward average trajectories, collapses action diversity, and yields near-static behavior. Our key insight is that conditional generative planning requires a conditioning-aware notion of neighborhood: trajectory updates should be computed using distances in a compact key space that reflects the condition, while still applying updates in the full trajectory space. Building on this, we introduce Keyed Drifting Policies (KDP), a one-step trajectory generator trained with a drift-field objective that attracts generated trajectories toward condition-matched dataset windows and repels them from nearby generated samples, using a stop-gradient drifted target to amortize iterative refinement into training. At inference, the resulting policy produces a full trajectory window in a single forward pass. Across standard RL benchmarks and real-time hardware deployments, KDP achieves strong performance with one-step inference and substantially lower planning latency than diffusion sampling. Project website, code and videos: this https URL
中文摘要 基于扩散的轨迹规划器可以合成丰富的多模态动作序列进行离线强化学习，但其迭代去噪会产生巨大的推理时间成本，使得在有限的计算预算下闭环规划变得缓慢。我们研究如何通过一步推断实现扩散类轨迹规划行为的问题，同时保留在退远视界控制循环中对当前状态采样多样化候选计划和条件的能力。我们的关键观察是，当用于对齐生成轨迹与数据集的相似度度量被无约束的未来维度主导时，条件轨迹生成在朴素分布匹配目标下失败。在实际操作中，这会导致对平均轨迹的吸引，导致作用多样性的崩解，并产生近乎静态的行为。我们的关键见解是，条件生成规划需要一个条件感知的邻域概念：轨迹更新应利用反映条件的紧凑密钥空间中的距离计算，同时在完整轨迹空间中应用更新。基于此，我们引入了关键漂移策略（KDP），这是一种单步轨迹生成器，通过漂移场目标训练，吸引生成轨迹向条件匹配数据集窗口，并排斥其从附近生成的样本中，利用停止梯度漂移目标将迭代细化摊销到训练中。推断时，所得策略在一次前向传递中产生一个完整的轨迹窗口。在标准强化学习基准测试和实时硬件部署中，KDP通过一步推断实现了强劲的性能，规划延迟远低于扩散采样。项目网站、代码和视频：这个 https URL

Improving Visual Reasoning with Iterative Evidence Refinement

通过迭代证据精炼提升视觉推理

Authors: Zeru Shi, Kai Mei, Yihao Quan, Dimitris N.Metaxas, Ruixiang Tang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.14117
Pdf link: https://arxiv.org/pdf/2603.14117
Abstract Vision language models (VLMs) are increasingly capable of reasoning over images, but robust visual reasoning often requires re-grounding intermediate steps in the underlying visual evidence. Recent approaches typically rely on external image operations such as zooming or cropping to re-access fine-grained details during inference, which requires additional image re-encoding and can disrupt the reasoning trajectory. We argue that VLMs already provide strong internal signals for identifying and reusing visual evidence, and that these signals can be directly leveraged to support image-grounded reasoning. Motivated by this insight, we propose an end-to-end self-revisit framework, SIEVE, that trains models to re-engage image evidence through internal representations. SIEVE automatically extracts embeddings of salient image regions and injects them into the reasoning chain when additional grounding is needed, enabling later steps to condition on relevant visual cues without external tool calls or re-encoding. We use reinforcement learning to teach the model when to trigger visual revisiting and which region embeddings to retrieve and insert during the reasoning process. Experiments on multiple visual reasoning benchmarks, together with perception, reasoning, and hallucination evaluations, show that SIEVE yields consistent gains, improving performance by 8 percent on average across several benchmarks.
中文摘要 视觉语言模型（VLM）越来越能够基于图像进行推理，但要实现稳健的视觉推理，通常需要在基础视觉证据中重新建立中间步骤。近期方法通常依赖外部图像操作，如缩放或裁剪，在推断过程中重新访问细粒度细节，这需要额外的图像重新编码，并可能干扰推理轨迹。我们认为，VLM已经为识别和重复利用视觉证据提供了强有力的内部信号，这些信号可以直接用于支持基于图像的推理。基于这一见解，我们提出了一个端到端自我再访框架SIEVE，训练模型通过内部表征重新激活图像证据。SIEVE自动提取显著图像区域的嵌入，并在需要额外基础时将其注入推理链，使后续步骤能够基于相关视觉线索进行条件化，而无需外部工具调用或重新编码。我们利用强化学习教模型何时触发视觉回访，以及在推理过程中应检索和插入哪些区域嵌入。对多个视觉推理基准测试的实验，以及感知、推理和幻觉评估显示，SIEVE在多个基准测试中平均提升8%的性能，持续稳定地提升。

Diffusion Reinforcement Learning via Centered Reward Distillation

通过中心奖励蒸馏的扩散强化学习

Authors: Yuanzhi Zhu, Xi Wang, Stéphane Lathuilière, Vicky Kalogeiton
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.14128
Pdf link: https://arxiv.org/pdf/2603.14128
Abstract Diffusion and flow models achieve State-Of-The-Art (SOTA) generative performance, yet many practically important behaviors such as fine-grained prompt fidelity, compositional correctness, and text rendering are weakly specified by score or flow matching pretraining objectives. Reinforcement Learning (RL) fine-tuning with external, black-box rewards is a natural remedy, but diffusion RL is often brittle. Trajectory-based methods incur high memory cost and high-variance gradient estimates; forward-process approaches converge faster but can suffer from distribution drift, and hence reward hacking. In this work, we present \textbf{Centered Reward Distillation (CRD)}, a diffusion RL framework derived from KL-regularized reward maximization built on forward-process-based fine-tuning. The key insight is that the intractable normalizing constant cancels under \emph{within-prompt centering}, yielding a well-posed reward-matching objective. To enable reliable text-to-image fine-tuning, we introduce techniques that explicitly control distribution drift: (\textit{i}) decoupling the sampler from the moving reference to prevent ratio-signal collapse, (\textit{ii}) KL anchoring to a CFG-guided pretrained model to control long-run drift and align with the inference-time semantics of the pre-trained model, and (\textit{iii}) reward-adaptive KL strength to accelerate early learning under large KL regularization while reducing late-stage exploitation of reward-model loopholes. Experiments on text-to-image post-training with \texttt{GenEval} and \texttt{OCR} rewards show that CRD achieves competitive SOTA reward optimization results with fast convergence and reduced reward hacking, as validated on unseen preference metrics.
中文摘要 扩散和流模型实现了最先进的（SOTA）生成性能，但许多实际重要的行为，如细粒度提示的忠实度、构图正确性和文本渲染，在分数或流匹配预训练目标中却被弱化。通过外部黑箱奖励进行强化学习（RL）微调是一种自然的解决办法，但扩散式强化学习往往较为脆弱。基于轨迹的方法会产生高内存成本和高方差梯度估计;前向过程方法收敛得更快，但可能存在分布漂移，因此会奖励黑客行为。本研究提出了\textbf{Centered Reward Distillation（CRD）}，这是一种基于KL正则化奖励最大化的扩散强化学习框架，基于基于前向过程的微调。关键见解是，难解的归一化常数在\emph{within-prompt centering}下会被抵消，从而得到一个合适的奖励匹配目标。为了实现可靠的文本到图像微调，我们引入了显式控制分布漂移的技术：（\textit{i}）将采样器与移动参考解耦以防止比值-信号坍缩，（\textit{ii}）KL锚定到CFG引导的预训练模型以控制长期漂移并与预训练模型的推理时间语义保持一致，以及（\textit{iii}）奖励自适应KL强度，在大KL正则化下加速早期学习，同时减少后期阶段利用奖励模型漏洞。对 \texttt{GenEval} 和 \texttt{OCR} 奖励的文本转图像后训练实验表明，CRD 在未见偏好指标上验证，能够在竞争性 SOTA 奖励优化中实现快速收敛和减少奖励黑客行为。

Understanding Strategic Platform Entry and Seller Exploration: A Stackelberg Model

理解战略平台进入与卖家探索：斯塔克尔伯格模型

Authors: Garrett Seo, Xintong Wang, David C. Parkes
Subjects: Subjects: Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT)
Arxiv link: https://arxiv.org/abs/2603.14206
Pdf link: https://arxiv.org/pdf/2603.14206
Abstract Online market platforms play an increasingly powerful role in the economy. An empirical phenomenon is that platforms, such as Amazon, Apple, and DoorDash, also enter their own marketplaces, imitating successful products developed by third-party sellers. We formulate a Stackelberg model, where the platform acts as the leader by committing to an entry policy: when will it enter and compete on a product? We study this model through a theoretical and computational framework. We begin with a single seller, and consider different kinds of policies for entry. We characterize the seller's optimal explore-exploit strategy via a Gittins-index policy, and give an algorithm to compute the platform's optimal entry policy. We then consider multiple sellers, to account for competition and information spillover. Here, the Gittins-index characterization fails, and we employ deep reinforcement learning to examine seller equilibrium behavior. Our findings highlight the incentives that drive platform entry and seller innovation, consistent with empirical evidence from markets such as Amazon and Google Play, with implications for regulatory efforts to preserve innovation and market diversity.
中文摘要 在线市场平台在经济中扮演着越来越重要的角色。一个实证现象是，亚马逊、苹果和DoorDash等平台也会进入自己的市场，模仿第三方卖家开发的成功产品。我们制定了斯塔克尔伯格模型，平台作为领导者，承诺进入市场政策：何时进入并以产品竞争？我们通过理论和计算框架研究该模型。我们从单一卖家开始，考虑不同类型的进入保单。我们通过Gittins指数策略描述卖家的最优探索-利用策略，并给出计算平台最优进入策略的算法。然后我们会考虑多个卖家，以应对竞争和信息溢出。在这里，Gittins指数的刻画失败了，我们采用深度强化学习来分析卖方均衡行为。我们的发现凸显了推动平台进入和卖家创新的激励机制，这与亚马逊和谷歌Play等市场的实证证据一致，这对保护创新和市场多样性的监管努力具有重要意义。

GoldenStart: Q-Guided Priors and Entropy Control for Distilling Flow Policies

GoldenStart：蒸馏流策略中的Q引导先验与熵控制

Authors: He Zhang, Ying Sun, Hui Xiong
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.14245
Pdf link: https://arxiv.org/pdf/2603.14245
Abstract Flow-matching policies hold great promise for reinforcement learning (RL) by capturing complex, multi-modal action distributions. However, their practical application is often hindered by prohibitive inference latency and ineffective online exploration. Although recent works have employed one-step distillation for fast inference, the structure of the initial noise distribution remains an overlooked factor that presents significant untapped potential. This overlooked factor, along with the challenge of controlling policy stochasticity, constitutes two critical areas for advancing distilled flow-matching policies. To overcome these limitations, we propose GoldenStart (GSFlow), a policy distillation method with Q-guided priors and explicit entropy control. Instead of initializing generation from uninformed noise, we introduce a Q-guided prior modeled by a conditional VAE. This state-conditioned prior repositions the starting points of the one-step generation process into high-Q regions, effectively providing a "golden start" that shortcuts the policy to promising actions. Furthermore, for effective online exploration, we enable our distilled actor to output a stochastic distribution instead of a deterministic point. This is governed by entropy regularization, allowing the policy to shift from pure exploitation to principled exploration. Our integrated framework demonstrates that by designing the generative startpoint and explicitly controlling policy entropy, it is possible to achieve efficient and exploratory policies, bridging the generative models and the practical actor-critic methods. We conduct extensive experiments on offline and online continuous control benchmarks, where our method significantly outperforms prior state-of-the-art approaches. Code will be available at this https URL.
中文摘要 流量匹配策略通过捕捉复杂的多模态动作分布，在强化学习（RL）方面具有巨大潜力。然而，其实际应用常常受限于过高的推理延迟和无效的在线探索。尽管近期研究采用了一步蒸馏以实现快速推断，但初始噪声分布的结构仍是一个被忽视的因素，具有重要的未开发潜力。这一被忽视的因素，加上控制政策随机性的挑战，构成了推进精炼流动匹配政策的两个关键领域。为克服这些局限，我们提出了GoldenStart（GSFlow），这是一种具有Q引导先验和显式熵控制的策略提纯方法。我们不初始化从无知噪声生成，而是引入由条件VAE建模的Q引导先验。这种状态条件先验将一步生成过程的起点重新定位为高Q区域，有效地提供了一个“黄金起点”，使策略缩短到有前景的行动。此外，为了实现有效的在线探索，我们使提炼演员能够输出随机分布，而非确定性点。这受熵正则化控制，使策略能够从纯粹的利用转向有原则的探索。我们的集成框架表明，通过设计生成起点并明确控制策略熵，可以实现高效且探索性的策略，连接生成模型与实际的行为者-批评方法。我们在离线和在线连续对照基准测试中进行了大量实验，其方法显著优于以往的先进方法。代码将在此 https URL 上提供。

MistExit: Learning to Exit for Early Mistake Detection in Procedural Videos

MistExit：学习程序化视频中早期错误检测的退出方法

Authors: Sagnik Majumder, Anish Nethi, Ziad Al-Halah, Kristen Grauman
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.14252
Pdf link: https://arxiv.org/pdf/2603.14252
Abstract We introduce the task of early mistake detection in video, where the goal is to determine whether a keystep in a procedural activity is performed correctly while observing as little of the streaming video as possible. To tackle this problem, we propose a method comprising a mistake detector and a reinforcement learning policy. At each timestep, the detector processes recently observed frames to estimate the keystep's correctness while anticipating future visual features, enabling reliable early mistake estimates. Meanwhile, the policy aggregates the detector outputs and visual observations over time and adaptively decides when to exit (i.e., stop processing incoming frames) while producing the final prediction. Using diverse real-world procedural video datasets, we demonstrate that our MistExit model achieves superior mistake detection accuracy while reducing the fraction of video observed compared to state-of-the-art models. Project: this https URL.
中文摘要 我们引入了视频中的早期错误检测任务，目标是在尽可能少观察流媒体视频的情况下，确定程序性活动的关键步骤是否正确执行。为解决这个问题，我们提出了一种包含错误检测器和强化学习策略的方法。在每个时间步，探测器处理最近观测到的帧，以估算关键步的正确性，同时预测未来的视觉特征，从而实现可靠的早期错误估计。同时，策略汇总探测器输出和视觉观测值随时间变化，并自适应地决定何时退出（即停止处理输入帧），同时生成最终预测。利用多种真实世界的程序视频数据集，我们证明了我们的MistExit模型在错误检测准确率上更优，同时减少了视频观察比例，相较于最先进模型。项目：这个 https 网址。

Load-Aware Locomotion Control for Humanoid Robots in Industrial Transportation Tasks

工业运输任务中人形机器人的载荷感知运动控制

Authors: Lequn Fu, Yijun Zhong, Xiao Li, Yibin Liu, Zhiyuan Xu, Jian Tang, Shiqi Li
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.14308
Pdf link: https://arxiv.org/pdf/2603.14308
Abstract Humanoid robots deployed in industrial environments are required to perform load-carrying transportation tasks that tightly couple locomotion and manipulation. However, achieving stable and robust locomotion under varying payloads and upper-body motions is challenging due to dynamic coupling and partial observability. This paper presents a load-aware locomotion framework for industrial humanoids based on a decoupled yet coordinated loco-manipulation architecture. Lower-body locomotion is controlled via a reinforcement learning policy producing residual joint actions on kinematically derived nominal configurations. A kinematics-based locomotion reference with a height-conditioned joint-space offset guides learning, while a history-based state estimator infers base linear velocity and height and encodes residual load- and manipulation-induced disturbances in a compact latent representation. The framework is trained entirely in simulation and deployed on a full-size humanoid robot without fine-tuning. Simulation and real-world experiments demonstrate faster training, accurate height tracking, and stable loco-manipulation. Project page: this https URL
中文摘要 部署在工业环境中的人形机器人需要执行将移动和操作紧密结合的载重运输任务。然而，由于动态耦合和部分可观测性，在不同有效载荷和上半身运动下实现稳定且稳健的运动具有挑战性。本文提出了基于解耦但协调的机车操控架构的工业类人生物负载感知运动框架。下半身运动通过强化学习策略控制，产生对运动学推导名义配置的残余关节动作。基于运动学的运动参考带有高度条件的关节空间偏移指导学习，而基于历史的状态估计器推断基线线速度和高度，并编码残余负载和操作引起的扰动，形成紧凑的潜在表示。该框架完全基于仿真训练，并部署在全尺寸类人机器人上，无需微调。模拟和实际实验展示了更快的训练速度、准确的高度追踪和稳定的机车操控能力。项目页面：此 https URL

Data-Driven Physics Embedded Dynamics with Predictive Control and Reinforcement Learning for Quadrupeds

基于数据驱动的物理嵌入式动力学，结合四足动物的预测控制和强化学习

Authors: Prakrut Kotecha, Aditya Shirwatkar, Shishir Kolathaya
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.14333
Pdf link: https://arxiv.org/pdf/2603.14333
Abstract State of the art quadrupedal locomotion approaches integrate Model Predictive Control (MPC) with Reinforcement Learning (RL), enabling complex motion capabilities with planning and terrain adaptive behaviors. However, they often face compounding errors over long horizons and have limited interpretability due to the absence of physical inductive biases. We address these issues by integrating Lagrangian Neural Networks (LNNs) into an RL MPC framework, enabling physically consistent dynamics learning. At deployment, our inverse dynamics infinite horizon MPC scheme avoids costly matrix inversions, improving computational efficiency by up to 4x with minimal loss of task performance. We validate our framework through multiple ablations of the proposed LNN and its variants. We show improved sample efficiency, reduced long-horizon error, and faster real time planning compared to unstructured neural dynamics. Lastly, we also test our framework on the Unitree Go1 robot to show real world viability.
中文摘要 最先进的四足行走方法将模型预测控制（MPC）与强化学习（RL）相结合，实现复杂的运动能力，实现规划和地形适应行为。然而，它们在较长的时间跨度内常常存在复合误差，且由于缺乏物理归纳偏倚，解释性有限。我们通过将拉格朗日神经网络（LNN）集成到强化学习MPC框架中，实现物理一致性的动态学习，解决了这些问题。部署时，我们的逆动力学无限视野MPC方案避免了昂贵的矩阵反演，计算效率提升了多达4倍，同时任务性能损失最小。我们通过对所提出的LNN及其变体进行多次消融来验证我们的框架。我们展示了样本效率提升、长视距误差减少以及与非结构神经动力学相比更快的实时规划。最后，我们还在Unitree Go1机器人上测试了我们的框架，以展示实际世界的可行性。

AgroNVILA: Perception-Reasoning Decoupling for Multi-view Agricultural Multimodal Large Language Models

AgroNVILA：多视角农业多模态大型语言模型的感知-推理解耦

Authors: Jiarui Zhang, Junqi Hu, Zurong Mai, Yuhang Chen, Shuohong Lou, Henglian Huang, Lingyuan Zhao, Jianxi Huang, Yutong Lu, Haohuan Fu, Juepeng Zheng
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.14342
Pdf link: https://arxiv.org/pdf/2603.14342
Abstract Agricultural multimodal reasoning requires robust spatial understanding across varying scales, from ground-level close-ups to top-down UAV and satellite imagery. Existing Multi-modal Large Language Models (MLLMs) suffer from a significant "terrestrial-centric" bias, causing scale confusion and logic drift during complex agricultural planning. To address this, we introduce the first large-scale AgroOmni (288K), a multi-view training corpus designed to capture diverse spatial topologies and scales in modern precision agriculture. Built on this dataset, we propose AgroNVILA, an MLLM that utilizes a novel Perception-Reasoning Decoupling (PRD) architecture. On the perception side, we incorporate a View-Conditioned Meta-Net (VCMN), which injects macroscopic spatial context into visual tokens, resolving scale ambiguities with minimal computational overhead. On the reasoning side, Agriculture-aware Relative Policy Optimization (ARPO) leverages reinforcement learning to align the model's decision-making with expert agricultural logic, preventing statistical shortcuts. Extensive experiments demonstrate that AgroNVILA outperforms state-of-the-art MLLMs, achieving significant improvements (+15.18%) in multi-altitude agricultural reasoning, reflecting its robust capability for holistic agricultural spatial planning.
中文摘要 农业多模态推理需要在不同尺度上有扎实的空间理解，从地面特写到俯视无人机和卫星影像。现有的多模态大型语言模型（MLLM）存在明显的“以地球为中心”偏见，导致复杂农业规划中规模混乱和逻辑漂移。为此，我们推出了首个大规模AgroOmni（288K），这是一种多视角训练语料库，旨在捕捉现代精准农业中多样的空间拓扑和尺度。基于该数据集，我们提出了AgroNVILA，一种采用新颖感知-推理解耦（PRD）架构的MLLM。在感知方面，我们采用了视图条件元网（VCMN），将宏观空间上下文注入视觉标记，以最小的计算开销解决尺度模糊。在推理方面，农业感知相对政策优化（ARPO）利用强化学习，使模型决策与专家农业逻辑对齐，避免统计捷径。大量实验表明，AgroNVILA在多海拔农业推理方面取得了显著提升（+15.18%），体现了其在整体农业空间规划方面的强大能力。

VIP-Loco: A Visually Guided Infinite Horizon Planning Framework for Legged Locomotion

VIP-Loco：一种视觉引导的无限地平线规划框架，用于腿部移动

Authors: Aditya Shirwatkar, Satyam Gupta, Shishir Kolathaya
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.14345
Pdf link: https://arxiv.org/pdf/2603.14345
Abstract Perceptive locomotion for legged robots requires anticipating and adapting to complex, dynamic environments. Model Predictive Control (MPC) serves as a strong baseline, providing interpretable motion planning with constraint enforcement, but struggles with high-dimensional perceptual inputs and rapidly changing terrain. In contrast, model-free Reinforcement Learning (RL) adapts well across visually challenging scenarios but lacks planning. To bridge this gap, we propose VIP-Loco, a framework that integrates vision-based scene understanding with RL and planning. During training, an internal model maps proprioceptive states and depth images into compact kinodynamic features used by the RL policy. At deployment, the learned models are used within an infinite-horizon MPC formulation, combining adaptability with structured planning. We validate VIP-Loco in simulation on challenging locomotion tasks, including slopes, stairs, crawling, tilting, gap jumping, and climbing, across three robot morphologies: a quadruped (Unitree Go1), a biped (Cassie), and a wheeled-biped (TronA1-W). Through ablations and comparisons with state-of-the-art methods, we show that VIP-Loco unifies planning and perception, enabling robust, interpretable locomotion in diverse environments.
中文摘要 对有腿机器人来说，感知运动需要预判并适应复杂、动态的环境。模型预测控制（MPC）作为强有力的基线，提供可解释的运动规划和约束执行，但在高维感知输入和快速变化的地形方面存在困难。相比之下，无模型强化学习（RL）在视觉挑战场景中适应良好，但缺乏规划。为弥合这一差距，我们提出了VIP-Loco框架，该框架将基于视觉的场景理解与强化学习和规划相结合。在训练过程中，内部模型将本体感觉状态和深度图像映射为强化学习策略使用的紧凑运动动力学特征。部署时，所学模型被纳入无限视野MPC的表述中，结合了适应性和结构化规划。我们在模拟中验证了VIP-Loco在具有挑战性的运动任务中，包括坡道、楼梯、爬行、倾斜、跳跃和攀爬，涵盖三种机器人形态：四足（Unitree Go1）、双足（Cassie）和轮式双足（TronA1-W）。通过消融和与最先进方法的比较，我们展示了VIP-Loco统一了规划与感知，实现了在多样环境中稳健且可理解的移动。

Exposing Long-Tail Safety Failures in Large Language Models through Efficient Diverse Response Sampling

通过高效的多样响应抽样揭示大型语言模型中的长尾安全失效

Authors: Suvadeep Hajra, Palash Nandi, Tanmoy Chakraborty
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.14355
Pdf link: https://arxiv.org/pdf/2603.14355
Abstract Safety tuning through supervised fine-tuning and reinforcement learning from human feedback has substantially improved the robustness of large language models (LLMs). However, it often suppresses rather than eliminates unsafe behaviors, leaving rare but critical failures hidden in the long tail of the output distribution. While most red-teaming work emphasizes adversarial prompt search (input-space optimization), we show that safety failures can also be systematically exposed through diverse response generation (output-space exploration) for a fixed safety-critical prompt, where increasing the number and diversity of sampled responses can drive jailbreak success rates close to unity. To efficiently uncover such failures, we propose Progressive Diverse Population Sampling (PDPS), which combines stochastic token-level sampling with diversity-aware selection to explore a large candidate pool of responses and retain a compact, semantically diverse subset. Across multiple jailbreak benchmarks and open-source LLMs, PDPS achieves attack success rates comparable to large-scale IID sampling while using only 8% to 29% of the computational cost. Under limited-response settings, it improves success rates by 26% to 40% over IID sampling and Diverse Beam Search. Furthermore, responses generated by PDPS exhibit both a higher number and greater diversity of unsafe outputs, demonstrating its effectiveness in uncovering a broader range of failures.
中文摘要 通过监督微调和基于人类反馈的强化学习进行安全调优，显著提升了大型语言模型（LLMs）的稳健性。然而，它往往抑制而非消除不安全行为，使得罕见但关键的故障隐藏在输出分布的长尾中。虽然大多数红队工作强调对抗性提示搜索（输入空间优化），但我们也证明，对于固定的安全关键提示，通过多样化响应生成（输出空间探索）系统性地暴露安全失败，增加采样响应的数量和多样性可以使越狱成功率接近一。为了高效揭示此类失败，我们提出了渐进多样群体抽样（PDPS），它结合了随机令牌级抽样和多样性感知选择，探索庞大的候选反应池，同时保持紧凑且语义多样化的子集。在多个越狱基准测试和开源大型语言模型中，PDPS在仅占8%至29%计算成本的情况下，实现了与大规模IID采样相当的攻击成功率。在有限响应设置下，它比IID采样和多样光束搜索的成功率提高了26%至40%。此外，PDPS生成的响应显示出更多且更多样化的不安全输出，展示了其在发现更广泛故障方面的有效性。

SPARQ: Spiking Early-Exit Neural Networks for Energy-Efficient Edge AI

SPARQ：为节能边缘人工智能激增早期退出神经网络

Authors: Parth Patne, Mahdi Taheri, Ali Mahani, Maksim Jenihhin, Reza Mahani, Christian Herglotz
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
Arxiv link: https://arxiv.org/abs/2603.14380
Pdf link: https://arxiv.org/pdf/2603.14380
Abstract Spiking neural networks (SNNs) offer inherent energy efficiency due to their event-driven computation model, making them promising for edge AI deployment. However, their practical adoption is limited by the computational overhead of deep architectures and the absence of input-adaptive control. This work presents SPARQ, a unified framework that integrates spiking computation, quantization-aware training, and reinforcement learning-guided early exits for efficient and adaptive inference. Evaluations across MLP, LeNet, and AlexNet architectures demonstrated that the proposed Quantised Dynamic SNNs (QDSNN) consistently outperform conventional SNNs and QSNNs, achieving up to 5.15% higher accuracy over QSNNs, over 330 times lower system energy compared to baseline SNNs, and over 90 percent fewer synaptic operations across different datasets. These results validate SPARQ as a hardware-friendly, energy-efficient solution for real-time AI at the edge.
中文摘要 尖峰神经网络（SNN）因其事件驱动计算模型而具有固有的能源效率，使其在边缘AI部署中充满潜力。然而，其实际应用受限于深度架构的计算开销和缺乏输入自适应控制。本研究提出了SPARQ，一个统一框架，集成了尖峰计算、量化感知训练和强化学习引导的早期退出，实现高效且自适应的推理。跨MLP、LeNet和AlexNet架构的评估表明，所提出的量化动态SNN（QDSNN）持续优于传统SNN和QSNN，准确率比QSNN高出多达5.15%，系统能量比基线SNN低330倍以上，且不同数据集中突触操作减少超过90%。这些结果验证了SPARQ作为边缘实时AI的硬件友好型、节能解决方案。

From $\boldsymbol{\logπ}$ to $\boldsymbolπ$: Taming Divergence in Soft Clipping via Bilateral Decoupled Decay of Probability Gradient Weight

从 $\boldsymbol{\logπ}$ 到 $\boldsymbolπ$：通过双边解耦衰减概率梯度权重来调控软剪裁中的发散

Authors: Xiaoliang Fu, Jiaye Lin, Yangyi Fang, Chaowen Hu, Cong Qin, Zekai Shao, Binbin Zheng, Lu Pan, Ke Zeng
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.14389
Pdf link: https://arxiv.org/pdf/2603.14389
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed a leap in Large Language Model (LLM) reasoning, yet its optimization dynamics remain fragile. Standard algorithms like GRPO enforce stability via hard clipping'', which inadvertently stifles exploration by discarding gradients of tokens outside the trust region. While recentsoft clipping'' methods attempt to recover these gradients, they suffer from a critical challenge: relying on log-probability gradient ($\nabla_\theta\log \pi_\theta$) yields divergent weights as probabilities vanish, destabilizing LLM training. We rethink this convention by establishing probability gradient ($\nabla_\theta \pi_\theta$) as the superior optimization primitive. Accordingly, we propose Decoupled Gradient Policy Optimization (DGPO), which employs a decoupled decay mechanism based on importance sampling ratios. By applying asymmetric, continuous decay to boundary tokens, DGPO resolves the conflict between stability and sustained exploration. Extensive experiments across DeepSeek-R1-Distill-Qwen series models (1.5B/7B/14B) demonstrate that DGPO consistently outperforms strong baselines on various mathematical benchmarks, offering a robust and scalable solution for RLVR. Our code and implementation are available at: this https URL.
中文摘要 带可验证奖励的强化学习（RLVR）催生了大型语言模型（LLM）推理的飞跃，但其优化动态仍然脆弱。像GRPO这样的标准算法通过“硬剪裁”强制稳定，这种方法无意中通过丢弃信任区外的代币梯度来抑制探索。虽然最新的“软裁剪”方法试图恢复这些梯度，但它们面临一个关键挑战：依赖对数概率梯度（$\nabla_\theta\log \pi_\theta$）会导致权重发散，概率为零，导致LLM训练不稳定。我们重新思考这一约定，将概率梯度（$\nabla_\theta \pi_\theta$）定为更优的优化原语。因此，我们提出了解耦梯度策略优化（DGPO），采用基于重要性抽样比的解耦衰减机制。通过对边界标记应用非对称、连续衰减，DGPO解决了稳定性与持续探索之间的冲突。在DeepSeek-R1-Distill-Qwen系列模型（1.5B/7B/14B）上的大量实验表明，DGPO在多个数学基准测试中始终优于强基线，为RLVR提供了稳健且可扩展的解决方案。我们的代码和实现可在以下链接获取：https URL。

Physics-Informed Policy Optimization via Analytic Dynamics Regularization

通过解析动力学正则化实现物理启发策略优化

Authors: Namai Chandra, Liu Mohan, Zhihao Gu, Lin Wang
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.14469
Pdf link: https://arxiv.org/pdf/2603.14469
Abstract Reinforcement learning (RL) has achieved strong performance in robotic control; however, state-of-the-art policy learning methods, such as actor-critic methods, still suffer from high sample complexity and often produce physically inconsistent actions. This limitation stems from neural policies implicitly rediscovering complex physics from data alone, despite accurate dynamics models being readily available in simulators. In this paper, we introduce a novel physics-informed RL framework, called PIPER, that seamlessly integrates physical constraints directly into neural policy optimization with analytical soft physics constraints. At the core of our method is the integration of a differentiable Lagrangian residual as a regularization term within the actor's objective. This residual, extracted from a robot's simulator description, subtly biases policy updates towards dynamically consistent solutions. Crucially, this physics integration is realized through an additional loss term during policy optimization, requiring no alterations to existing simulators or core RL algorithms. Extensive experiments demonstrate that our method significantly improves learning efficiency, stability, and control accuracy, establishing a new paradigm for efficient and physically consistent robotic control.
中文摘要 强化学习（RL）在机器人控制方面表现出色;然而，最先进的策略学习方法，如actor-critic方法，仍然存在高样本复杂度，且常常产生物理上不一致的动作。这一局限源于神经策略隐含地仅凭数据重新发现复杂物理，尽管模拟器中已有准确的动力学模型。本文介绍了一个新的物理导向强化学习框架，称为PIPER，它将物理约束无缝集成到神经策略优化中，并结合解析软物理约束。我们方法的核心是将可微拉格朗日残差作为正则化项整合进行为者的目标。这种从机器人模拟器描述中提取的残差，微妙地使政策更新趋向动态一致的解决方案。关键是，这种物理集成通过策略优化时的额外损耗项实现，无需对现有模拟器或核心强化学习算法进行修改。大量实验表明，我们的方法显著提升了学习效率、稳定性和控制精度，建立了高效且物理一致的机器人控制新范式。

AI Can Learn Scientific Taste

人工智能可以学习科学品味

Authors: Jingqi Tong, Mingzhe Li, Hangcheng Li, Yongzhuo Yang, Yurong Mou, Weijie Ma, Zhiheng Xi, Hongji Chen, Xiaoran Liu, Qinyuan Cheng, Ming Zhang, Qiguang Chen, Weifeng Ge, Qipeng Guo, Tianlei Ying, Tianxiang Sun, Yining Zheng, Xinchi Chen, Jun Zhao, Ning Ding, Xuanjing Huang, Yugang Jiang, Xipeng Qiu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.14473
Pdf link: https://arxiv.org/pdf/2603.14473
Abstract Great scientists have strong judgement and foresight, closely tied to what we call scientific taste. Here, we use the term to refer to the capacity to judge and propose research ideas with high potential impact. However, most relative research focuses on improving an AI scientist's executive capability, while enhancing an AI's scientific taste remains underexplored. In this work, we propose Reinforcement Learning from Community Feedback (RLCF), a training paradigm that uses large-scale community signals as supervision, and formulate scientific taste learning as a preference modeling and alignment problem. For preference modeling, we train Scientific Judge on 700K field- and time-matched pairs of high- vs. low-citation papers to judge ideas. For preference alignment, using Scientific Judge as a reward model, we train a policy model, Scientific Thinker, to propose research ideas with high potential impact. Experiments show Scientific Judge outperforms SOTA LLMs (e.g., GPT-5.2, Gemini 3 Pro) and generalizes to future-year test, unseen fields, and peer-review preference. Furthermore, Scientific Thinker proposes research ideas with higher potential impact than baselines. Our findings show that AI can learn scientific taste, marking a key step toward reaching human-level AI scientists.
中文摘要 伟大的科学家拥有强烈的判断力和远见，这与我们所称的科学品味紧密相关。这里，我们用该术语指的是判断和提出具有高潜在影响力研究想法的能力。然而，大多数相对研究都集中在提升人工智能科学家的执行能力，而提升人工智能科学品味的方面尚未被充分探讨。在本研究中，我们提出了基于社区反馈的强化学习（RLCF），这是一种以大规模社区信号为指导的训练范式，并将科学味觉学习作为偏好建模和对齐问题提出。在偏好建模方面，我们用70万篇高引用率和低引用率的田野和时间匹配论文对来训练Scientific Judge来评判观点。为了偏好对齐，我们以Scientific Judge作为奖励模型，训练一个政策模型Scientific Thinker，提出具有高潜力影响的研究想法。实验显示，Scientific Judge 优于 SOTA 大型语言模型（如 GPT-5.2、Gemini 3 Pro），并可推广到未来年度测试、未见领域和同行评审偏好。此外，《科学思想家》提出的研究理念具有比基线更具潜在影响力。我们的发现表明，人工智能能够学习科学品味，这标志着迈向人类水平人工智能科学家的关键一步。

VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning

VLA-Thinker：通过图像思维推理提升视觉-语言-行动模型

Authors: Chaoyang Wang, Wenrui Bao, Sicheng Gao, Bingxin Xu, Yu Tian, Yogesh S. Rawat, Yunhao Ge, Yuzhang Shang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.14523
Pdf link: https://arxiv.org/pdf/2603.14523
Abstract Vision-Language-Action (VLA) models have shown promising capabilities for embodied intelligence, but most existing approaches rely on text-based chain-of-thought reasoning where visual inputs are treated as static context. This limits the ability of the model to actively revisit the environment and resolve ambiguities during long-horizon tasks. We propose VLA-Thinker, a thinking-with-image reasoning framework that models perception as a dynamically invocable reasoning action. To train such a system, we introduce a two-stage training pipeline consisting of (1) an SFT cold-start phase with curated visual Chain-of-Thought data to activate structured reasoning and tool-use behaviors, and (2) GRPO-based reinforcement learning to align complete reasoning-action trajectories with task-level success. Extensive experiments on LIBERO and RoboTwin 2.0 benchmarks demonstrate that VLA-Thinker significantly improves manipulation performance, achieving 97.5% success rate on LIBERO and strong gains across long-horizon robotic tasks. Project and Codes: this https URL .
中文摘要 视觉-语言-行动（VLA）模型已展现出具备具象智能的前景，但大多数现有方法依赖基于文本的思维链推理，将视觉输入视为静态上下文。这限制了模型在长期任务中主动重新访问环境和解决歧义的能力。我们提出了VLA-Thinker，一种带图像思考的推理框架，将感知建模为动态可调用的推理动作。为训练此类系统，我们引入了两阶段训练流程，包括：（1）SFT冷启动阶段，利用精心策划的视觉思维链数据激活结构化推理和工具使用行为;（2）基于GRPO的强化学习，将完整的推理-行动轨迹与任务层级成功对齐。在LIBERO和RoboTwin 2.0基准测试上的广泛实验表明，VLA-Thinker显著提升了操作性能，在LIBERO上实现了97.5%的成功率，并在长期机器人任务中取得了显著提升。项目与代码：此 https 网址。

Visualizing Critic Match Loss Landscapes for Interpretation of Online Reinforcement Learning Control Algorithms

可视化批评者匹配损失景观以解释在线强化学习控制算法

Authors: Jingyi Liu, Jian Guo, Eberhard Gill
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.14535
Pdf link: https://arxiv.org/pdf/2603.14535
Abstract Reinforcement learning has proven its power on various occasions. However, its performance is not always guaranteed when system dynamics change. Instead, it largely relies on users' empirical experience. For reinforcement learning algorithms with an actor-critic structure, the critic neural network reflects the approximation and optimization process in the RL algorithm. Analyzing the performance of the critic neural network helps to understand the mechanism of the algorithm. To support systematic interpretation of such algorithms in dynamic control problems, this work proposes a critic match loss landscape visualization method for online reinforcement learning. The method constructs a loss landscape by projecting recorded critic parameter trajectories onto a low-dimensional linear subspace. The critic match loss is evaluated over the projected parameter grid using fixed reference state samples and temporal-difference targets. This yields a three-dimensional loss surface together with a two-dimensional optimization path that characterizes critic learning behavior. To extend analysis beyond visual inspection, quantitative landscape indices and a normalized system performance index are introduced, enabling structured comparison across different training outcomes. The approach is demonstrated using the Action-Dependent Heuristic Dynamic Programming algorithm on cart-pole and spacecraft attitude control tasks. Comparative analyses across projection methods and training stages reveal distinct landscape characteristics associated with stable convergence and unstable learning. The proposed framework enables both qualitative and quantitative interpretation of critic optimization behavior in online reinforcement learning.
中文摘要 强化学习在多次场合证明了其威力。然而，当系统动态变化时，其性能并不总是有保障。相反，它主要依赖用户的实证体验。对于具有actor-critic结构的强化学习算法，critic神经网络反映了强化学习算法中的近似和优化过程。分析批判性神经网络的性能有助于理解该算法的机制。为支持动态控制问题中此类算法的系统解释，本研究提出了一种用于在线强化学习的批评匹配损失景观可视化方法。该方法通过将记录的批判参数轨迹投影到低维线性子空间上来构建损失景观。批判匹配损耗通过固定参考状态样本和时间差分目标在预测参数网格上进行评估。这会得到一个三维损耗曲面和一个二维优化路径，用于描述批判者学习行为。为了将分析扩展到视觉检查之外，引入了定量景观指数和归一化系统性能指数，实现不同训练结果之间的结构化比较。该方法通过作用依赖启发式动态规划算法在车杆和航天器姿态控制任务中得到演示。跨投影方法和训练阶段的比较分析揭示了稳定收敛和不稳定学习相关的明显景观特征。该框架支持对在线强化学习中批评优化行为的定性和定量解释。

MorFiC: Fixing Value Miscalibration for Zero-Shot Quadruped Transfer

MorFiC：修正零射击四足运输的数值误校准

Authors: Prakhar Mishra, Amir Hossain Raj, Xuesu Xiao, Dinesh Manocha
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.14554
Pdf link: https://arxiv.org/pdf/2603.14554
Abstract Generalizing learned locomotion policies across quadrupedal robots with different morphologies remain a challenge. Policies trained on a single robot often break when deployed on embodiments with different mass distributions, kinematics, joint limits, or actuation constraints, forcing per robot retraining. We present MorFiC, a reinforcement learning approach for zero-shot cross-morphology locomotion using a single shared policy. MorFiC resolves a key failure mode in multi-morphology actor-critic training: a shared critic tends to average incompatible value targets across embodiments, yielding miscalibrated advantages. To address this, MorFiC conditions the critic via morphology-aware modulation driven by robot physical and control parameters, generating morphology-specific value estimates within a shared network. Trained with a single source robot with morphology randomization in simulation, MorFiC can transfer to unseen robots and surpasses morphology-conditioned PPO baselines by improving stable average speed and longest stable run on multiple targets, including speed gains of +16.1% on A1, ~2x on Cheetah, and ~5x on B1. We additionally show that MorFiC reduces the value-prediction error variance across morphologies and stabilizes the advantage estimates, demonstrating that the improved value-function calibration corresponds to a stronger transfer performance. Finally, we demonstrate zero-shot deployment on two Unitree Go1 and Go2 robots without fine-tuning, indicating that critic-side conditioning is a practical approach for cross-morphology generalization.
中文摘要 将学习到的运动策略推广到形态不同的四足机器人仍是一个挑战。在单个机器人上训练的策略在部署于质量分布、运动学、关节极限或执行约束不同的实例时常常失效，迫使每台机器人重新训练。我们提出了MorFiC，这是一种基于单一共享策略的零射击交叉形态移动强化学习方法。MorFiC解决了多形态actor-critic训练中的一个关键失败模式：共享critic往往会在不同实例中平均不兼容的值目标，从而产生校准错误的优势。为此，MorFiC通过由机器人物理和控制参数驱动的形态感知调制来调节批评者，生成共享网络中的形态特有值估计。MorFiC在模拟中用单源机器人进行形态随机化训练，能够转移到未被发现的机器人上，并通过提升多目标的稳定平均速度和最长稳定运行，超越形态条件下的PPO基线，包括A1速度提升+16.1%，猎豹速度提升+2倍，B1速度提升~5倍。我们还展示了MorFiC降低了形态学间的值-预测误差方差，并稳定了优势估计，表明价值函数校准的改进对应于更强的转移性能。最后，我们演示了在两台Unitree Go1和Go2机器人上进行零次部署，无需微调，表明批判端条件是跨形态推广的实用方法。

Machine Learning-Driven Intelligent Memory System Design: From On-Chip Caches to Storage

机器学习驱动的智能内存系统设计：从片上缓存到存储

Authors: Rahul Bera, Rakesh Nadig, Onur Mutlu
Subjects: Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.14583
Pdf link: https://arxiv.org/pdf/2603.14583
Abstract Despite the data-rich environment in which memory systems of modern computing platforms operate, many state-of-the-art architectural policies employed in the memory system rely on static, human-designed heuristics that fail to truly adapt to the workload and system behavior via principled learning methodologies. In this article, we propose a fundamentally different design approach: using lightweight and practical machine learning (ML) methods to enable adaptive, data-driven control throughout the memory hierarchy. We present three ML-guided architectural policies: (1) Pythia, a reinforcement learning-based data prefetcher for on-chip caches, (2) Hermes, a perceptron learning-based off-chip predictor for multi-level cache hierarchies, and (3) Sibyl, a reinforcement learning-based data placement policy for hybrid storage systems. Our evaluation shows that Pythia, Hermes, and Sibyl significantly outperform the best-prior human-designed policies, while incurring modest hardware overheads. Collectively, this article demonstrates that integrating adaptive learning into memory subsystems can lead to intelligent, self-optimizing architectures that unlock performance and efficiency gains beyond what is possible with traditional human-designed approaches.
中文摘要 尽管现代计算平台的内存系统运行在数据丰富的环境中，许多内存系统采用的最先进架构策略仍依赖静态、人工设计的启发式方法，这些策略未能通过原则性学习方法真正适应工作负载和系统行为。本文提出了一种根本不同的设计方法：采用轻量化且实用的机器学习（ML）方法，实现贯穿内存层级的自适应、数据驱动控制。我们提出了三种机器学习引导的架构策略：（1）Pythia，基于强化学习的数据预取器，用于芯片缓存;（2）Hermes，基于感知器学习的芯片外预测器，用于多层缓存层级;（3）Sibyl，基于强化学习的数据放置策略，用于混合存储系统。我们的评估显示，Pythia、Hermes 和 Sibyl 的表现远超以往最佳的人类设计政策，同时硬件开销适中。本文总体证明，将自适应学习整合进记忆子系统可以带来智能、自我优化的架构，从而实现超越传统人类设计方法所能实现的性能和效率提升。

Adapting Critic Match Loss Landscape Visualization to Off-policy Reinforcement Learning

将批评者匹配损失景观可视化调整为非策略强化学习

Authors: Jingyi Liu, Jian Guo, Eberhard Gill
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.14589
Pdf link: https://arxiv.org/pdf/2603.14589
Abstract This work extends an established critic match loss landscape visualization method from online to off-policy reinforcement learning (RL), aiming to reveal the optimization geometry behind critic learning. Off-policy RL differs from stepwise online actor-critic learning in its replay-based data flow and target computation. Based on these two structural differences, the critic match loss landscape visualization method is adapted to the Soft Actor-Critic (SAC) algorithm by aligning the loss evaluation with its batch-based data flow and target computation, using a fixed replay batch and precomputed critic targets from the selected policy. Critic parameters recorded during training are projected onto a principal component plane, where the critic match loss is evaluated to form a 3-D landscape with an overlaid 2-D optimization path. Applied to a spacecraft attitude control problem, the resulting landscapes are analyzed both qualitatively and quantitatively using sharpness, basin area, and local anisotropy metrics, together with temporal landscape snapshots. Comparisons between convergent SAC, divergent SAC, and divergent Action-Dependent Heuristic Dynamic Programming (ADHDP) cases reveal distinct geometric patterns and optimization behaviors under different algorithmic structures. The results demonstrate that the adapted critic match loss visualization framework serves as a geometric diagnostic tool for analyzing critic optimization dynamics in replay-based off-policy RL-based control problems.
中文摘要 本研究将成熟的批评匹配损失景观可视化方法从在线推广到非策略强化学习（RL），旨在揭示批评学习背后的优化几何结构。非策略强化学习在其基于重放的数据流和目标计算方面，与逐步在线演员-批评者学习不同。基于这两种结构差异，批判者匹配损失景观可视化方法通过将损失评估与基于批次的数据流和目标计算对齐，采用固定重放批处理和从所选策略预计算的批评目标，适配软演员-批评者（SAC）算法。训练期间记录的批判参数被投影到主分量平面上，评估批判匹配损耗，形成带有叠加二维优化路径的三维景观。应用于航天器姿态控制问题时，所得的景观会通过锐度、盆地面积和局部各向异性指标，以及时间景观快照，进行定性和定量分析。收敛SAC、发散SAC和发散的动作依赖启发式动态规划（ADHDP）案例的比较揭示了不同算法结构下不同的几何模式和优化行为。结果表明，适应后的批评者匹配损失可视化框架作为分析基于重放的非策略强化学习控制问题中批评优化动态的几何诊断工具。

A Loss Landscape Visualization Framework for Interpreting Reinforcement Learning: An ADHDP Case Study

用于解读强化学习的损失景观可视化框架：ADHDP案例研究

Authors: Jingyi Liu, Jian Guo, Eberhard Gill
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.14600
Pdf link: https://arxiv.org/pdf/2603.14600
Abstract Reinforcement learning algorithms have been widely used in dynamic and control systems. However, interpreting their internal learning behavior remains a challenge. In the authors' previous work, a critic match loss landscape visualization method was proposed to study critic training. This study extends that method into a framework which provides a multi-perspective view of the learning dynamics, clarifying how value estimation, policy optimization, and temporal-difference (TD) signals interact during training. The proposed framework includes four complementary components; a three-dimensional reconstruction of the critic match loss surface that shows how TD targets shape the optimization geometry; an actor loss landscape under a frozen critic that reveals how the policy exploits that geometry; a trajectory combining time, Bellman error, and policy weights that indicates how updates move across the surface; and a state-TD map that identifies the state regions that drive those updates. The Action-Dependent Heuristic Dynamic Programming (ADHDP) algorithm for spacecraft attitude control is used as a case study. The framework is applied to compare several ADHDP variants and shows how training stabilizers and target updates change the optimization landscape and affect learning stability. Therefore, the proposed framework provides a systematic and interpretable tool for analyzing reinforcement learning behavior across algorithmic designs.
中文摘要 强化学习算法已被广泛应用于动态系统和控制系统中。然而，解读他们的内在学习行为仍然是个挑战。在作者之前的研究中，提出了一种批评匹配损失景观可视化方法来研究批评者训练。本研究将该方法扩展为一个框架，提供多视角的学习动态视角，阐明价值估计、策略优化和时间差分（TD）信号在训练过程中的相互作用。拟议框架包含四个互补组成部分;批判匹配损耗曲面的三维重建，展示了TD靶如何塑造优化几何;在冻结批评者下，显示该政策如何利用该几何;结合时间、贝尔曼误差和策略权重的轨迹，指示更新如何在表面上移动;以及一个状态-TD映射，用于识别驱动这些更新的状态区域。以作用依赖启发式动态规划（ADHDP）算法作为航天器姿态控制的案例研究。该框架被用于比较多种ADHDP变体，展示了训练稳定器和目标更新如何改变优化格局并影响学习稳定性。因此，所提出的框架提供了一个系统且可解释的工具，用于分析跨算法设计的强化学习行为。

EcoFair-CH-MARL: Scalable Constrained Hierarchical Multi-Agent RL with Real-Time Emission Budgets and Fairness Guarantees

EcoFair-CH-MARL：具备实时排放预算和公平性保证的可扩展受限分层多智能体强化学习

Authors: Saad Alqithami
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2603.14625
Pdf link: https://arxiv.org/pdf/2603.14625
Abstract Global decarbonisation targets and tightening market pressures demand maritime logistics solutions that are simultaneously efficient, sustainable, and equitable. We introduce EcoFair-CH-MARL, a constrained hierarchical multi-agent reinforcement learning framework that unifies three innovations: (i) a primal-dual budget layer that provably bounds cumulative emissions under stochastic weather and demand; (ii) a fairness-aware reward transformer with dynamically scheduled penalties that enforces max-min cost equity across heterogeneous fleets; and (iii) a two-tier policy architecture that decouples strategic routing from real-time vessel control, enabling linear scaling in agent count. New theoretical results establish O(\sqrt{T}) regret for both constraint violations and fairness loss. Experiments on a high-fidelity maritime digital twin (16 ports, 50 vessels) driven by automatic identification system traces, plus an energy-grid case study, show up to 15% lower emissions, 12% higher through-put, and a 45% fair-cost improvement over state-of-the-art hierarchical and constrained MARL baselines. In addition, EcoFair-CH-MARL achieves stronger equity (lower Gini and higher min-max welfare) than fairness-specific MARL baselines (e.g., SOTO, FEN), and its modular design is compatible with both policy- and value-based learners. EcoFair-CH-MARL therefore advances the feasibility of large-scale, regulation-compliant, and socially responsible multi-agent coordination in safety-critical domains.
中文摘要 全球脱碳目标和日益加剧的市场压力要求海事物流解决方案既高效、可持续又公平。我们介绍了EcoFair-CH-MARL，一种受限的分层多智能体强化学习框架，统一了三项创新：（一）原始-双预算层，可证明在随机天气和需求下限制累计排放;（ii）具有动态调度惩罚的公平感奖励变换器，能够在异构车队间强制执行最大最小成本公平性;以及（iii）一套两层策略架构，将战略路由与实时船舶控制解耦，实现代理数量的线性扩展。新的理论结果确立了约束违规和公平性损失均存在O（\sqrt{T}）遗憾。在高保真海事数字孪生（16个港口，50艘船舶）上，通过自动识别系统追踪和能源电网案例研究，显示排放量比最先进的分层和受限的MARL基线降低了15%，通过排放提高了12%，且公平成本提升了45%。此外，EcoFair-CH-MARL比基于公平的MARL基线（如SOTO、FEN）实现更强的公平性（较低的基尼和更高的最小最大福利），其模块化设计兼容政策型和价值型学习者。因此，EcoFair-CH-MARL推动了在安全关键领域实现大规模、合规且具社会责任感的多智能体协调的可行性。

VisionCoach: Reinforcing Grounded Video Reasoning via Visual-Perception Prompting

VisionCoach：通过视觉感知提示强化扎根视频推理

Authors: Daeun Lee, Shoubin Yu, Yue Zhang, Mohit Bansal
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.14659
Pdf link: https://arxiv.org/pdf/2603.14659
Abstract Video reasoning requires models to locate and track question-relevant evidence across frames. While reinforcement learning (RL) with verifiable rewards improves accuracy, it still struggles to achieve reliable spatio-temporal grounding during the reasoning process. Moreover, improving grounding typically relies on scaled training data or inference-time perception tools, which increases annotation cost or computational cost. To address this challenge, we propose VisonCoach, an input-adaptive RL framework that improves spatio-temporal grounding through visual prompting as training-time guidance. During RL training, visual prompts are selectively applied to challenging inputs to amplify question-relevant evidence and suppress distractors. The model then internalizes these improvements through self-distillation, enabling grounded reasoning directly on raw videos without visual prompting at inference. VisonCoach consists of two components: (1) Visual Prompt Selector, which predicts appropriate prompt types conditioned on the video and question, and (2) Spatio-Temporal Reasoner, optimized with RL under visual prompt guidance and object-aware grounding rewards that enforce object identity consistency and multi-region bounding-box overlap. Extensive experiments demonstrate that VisonCoach achieves state-of-the-art performance under comparable settings, across diverse video reasoning, video understanding, and temporal grounding benchmarks (V-STAR, VideoMME, World-Sense, VideoMMMU, PerceptionTest, and Charades-STA), while maintaining a single efficient inference pathway without external tools. Our results show that visual prompting during training improves grounded video reasoning, while self-distillation enables the model to internalize this ability without requiring prompts at inference time.
中文摘要 视频推理要求模型在多个帧中定位并跟踪与问题相关的证据。虽然带有可验证奖励的强化学习（RL）提高了准确性，但在推理过程中仍难以实现可靠的时空基础。此外，改善基地通常依赖于缩放训练数据或推理时间感知工具，这会增加标注成本或计算成本。为应对这一挑战，我们提出了VisonCoach，一种输入自适应强化学习框架，通过视觉提示作为训练时间指导，提升时空基础。在强化学习训练中，有选择地将视觉提示应用于有挑战性输入，以放大与问题相关的证据并抑制干扰因素。模型随后通过自我提炼内化这些改进，使得直接在原始视频上进行扎实的推理，无需视觉提示。VisonCoach 由两个部分组成：（1）Visual Prompt Selector，根据视频和问题预测合适的提示类型;（2）时空推理器，基于强化学习（RL）优化，基于视觉提示指导和对象感知基础奖励，强化对象身份一致性和多区域边界框重叠。大量实验表明，VisonCoach在可比环境中，涵盖多种视频推理、视频理解和时间基准测试（V-STAR、VideoMME、World-Sense、VideoMMMU、PerceptionTest和Charades-STA），同时保持单一高效的推理路径，无需外部工具。我们的结果显示，训练中的视觉提示提升了扎根的视频推理能力，而自我提炼则使模型能够内化这一能力，而无需在推理时就提示。

DeFRiS: Silo-Cooperative IoT Applications Scheduling via Decentralized Federated Reinforcement Learning

DeFRiS：通过去中心化联合强化学习实现的孤岛协作物联网应用调度

Authors: Zhiyu Wang, Mohammad Goudarzi, Mingming Gong, Rajkumar Buyya
Subjects: Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2603.14729
Pdf link: https://arxiv.org/pdf/2603.14729
Abstract Next-generation IoT applications increasingly span across autonomous administrative entities, necessitating silo-cooperative scheduling to leverage diverse computational resources while preserving data privacy. However, realizing efficient cooperation faces significant challenges arising from infrastructure heterogeneity, Non-IID workload shifts, and the inherent risks of adversarial environments. Existing approaches, relying predominantly on centralized coordination or independent learning, fail to address the incompatibility of state-action spaces across heterogeneous silos and lack robustness against malicious attacks. This paper proposes DeFRiS, a Decentralized Federated Reinforcement Learning framework for robust and scalable Silo-cooperative IoT application scheduling. DeFRiS integrates three synergistic innovations: (i) an action-space-agnostic policy utilizing candidate resource scoring to enable seamless knowledge transfer across heterogeneous silos; (ii) a silo-optimized local learning mechanism combining Generalized Advantage Estimation (GAE) with clipped policy updates to resolve sparse delayed reward challenges; and (iii) a Dual-Track Non-IID robust decentralized aggregation protocol leveraging gradient fingerprints for similarity-aware knowledge transfer and anomaly detection, and gradient tracking for optimization momentum. Extensive experiments on a distributed testbed with 20 heterogeneous silos and realistic IoT workloads demonstrate that DeFRiS significantly outperforms state-of-the-art baselines, reducing average response time by 6.4% and energy consumption by 7.2%, while lowering tail latency risk (CVaR$_{0.95}$) by 10.4% and achieving near-zero deadline violations. Furthermore, DeFRiS achieves over 3 times better performance retention as the system scales and over 8 times better stability in adversarial environments compared to the best-performing baseline.
中文摘要 下一代物联网应用越来越多地跨越自治的行政实体，因此需要孤岛协作调度，以利用多样化的计算资源，同时保护数据隐私。然而，实现高效合作面临基础设施异质性、非IID工作量转移以及对抗环境固有风险带来的重大挑战。现有方法主要依赖集中协调或独立学习，未能解决跨异构孤岛状态-行动空间的不兼容问题，且缺乏对恶意攻击的鲁棒性。本文提出了DeFRiS，一种去中心化联合强化学习框架，用于稳健且可扩展的Silo-协作物联网应用调度。DeFRiS整合了三项协同创新：（i）利用候选资源评分实现跨异构孤岛的无缝知识转移，采取行动空间无关政策;（ii）一种结合广义优势估计（GAE）与截图策略更新的孤岛优化本地学习机制，以解决稀疏延迟奖励挑战;以及（iii）利用梯度指纹进行相似性感知知识传输和异常检测的双轨非IID稳健去中心化聚合协议，以及梯度跟踪以优化动量。在拥有20个异构孤岛和真实物联网工作负载的分布式测试平台上进行的广泛实验表明，DeFRiS显著优于最先进的基线，平均响应时间减少6.4%，能耗降低7.2%，同时将尾部延迟风险（CVaR$_{0.95}$）降低10.4%，并实现近乎零的截止日期违规。此外，DeFRiS在系统扩展过程中性能保持率提升了3倍以上，在对抗环境中的稳定性也比最佳基线高出8倍以上。

Ego to World: Collaborative Spatial Reasoning in Embodied Systems via Reinforcement Learning

自我到世界：通过强化学习实现具身系统中的协作空间推理

Authors: Heng Zhou, Li Kang, Yiran Qin, Xiufeng Song, Ao Yu, Zilu Zhang, Haoming Song, Kaixin Xu, Yuchen Fan, Dongzhan Zhou, Xiaohong Liu, Ruimao Zhang, Philip Torr, Lei Bai, Zhenfei Yin
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.14811
Pdf link: https://arxiv.org/pdf/2603.14811
Abstract Understanding the world from distributed, partial viewpoints is a fundamental challenge for embodied multi-agent systems. Each agent perceives the environment through an ego-centric view that is often limited by occlusion and ambiguity. To study this problem, we introduce the Ego-to-World (E2W) benchmark, which evaluates a vision-language model's ability to fuse heterogeneous viewpoints across three tasks: (i) global counting, (ii) relational location reasoning, and (iii) action-oriented grasping that requires predicting view-specific image coordinates. To address this setting, we propose CoRL, a two-stage framework that combines Chain-of-Thought supervised fine-tuning with reinforcement learning using Group-Relative Policy Optimization. Its core component, the Cross-View Spatial Reward (CVSR), provides dense task-aligned feedback by linking reasoning steps to visual evidence, ensuring coherent cross-view entity resolution, and guiding the model toward correct final predictions. Experiments on E2W show that CoRL consistently surpasses strong proprietary and open-source baselines on both reasoning and perception-grounding metrics, while ablations further confirm the necessity of each CVSR component. Beyond that, CoRL generalizes to external spatial reasoning benchmarks and enables effective real-world multi-robot manipulation with calibrated multi-camera rigs, demonstrating cross-view localization and successful grasp-and-place execution. Together, E2W and CoRL provide a principled foundation for learning world-centric scene understanding from distributed, ego-centric observations, advancing collaborative embodied AI.
中文摘要 从分布式、部分视角理解世界是具身多智能体系统面临的根本挑战。每个代理通过以自我为中心的视角感知环境，这种视角常常受限于遮蔽和模糊性。为研究该问题，我们引入了自我到世界（E2W）基准测试，该基准评估视觉语言模型在三项任务中融合异质视角的能力：（i）全局计数，（ii）关系位置推理，以及（iii）需要预测视角特定图像坐标的动作导向抓取。为应对这一设定，我们提出了CoRL，这是一个两阶段框架，结合了思维链监督微调与使用群体相对策略优化的强化学习。其核心组件——交叉视图空间奖励（CVSR），通过将推理步骤与视觉证据连接，提供密集且符合任务的反馈，确保交叉视图实体的连贯解析，并引导模型朝向正确的最终预测。E2W上的实验表明，CoRL在推理和感知基础指标上始终超越强专有和开源基线，而消融进一步证实了每个CVSR组件的必要性。除此之外，CoRL还推广到外部空间推理基准，并通过校准的多机位设备实现现实世界中多机器人的有效操作，展示了交叉视角定位和成功的抓取定位执行。E2W与CoRL共同为学习以世界为中心的场景理解提供了原则基础，从分布式、以自我为中心的观察中学习，推动协作式具身人工智能的发展。

Shopping Companion: A Memory-Augmented LLM Agent for Real-World E-Commerce Tasks

购物伴侣：一个用于现实电商任务的记忆增强大型语言模型代理

Authors: Zijian Yu, Kejun Xiao, Huaipeng Zhao, Tao Luo, Xiaoyi Zeng
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.14864
Pdf link: https://arxiv.org/pdf/2603.14864
Abstract In e-commerce, LLM agents show promise for shopping tasks such as recommendations, budgeting, and bundle deals, where accurately capturing user preferences from long-term conversations is critical. However, two challenges hinder realizing this potential: (1) the absence of benchmarks for evaluating long-term preference-aware shopping tasks, and (2) the lack of end-to-end optimization due to existing designs that treat preference identification and shopping assistance as separate components. In this paper, we introduce a novel benchmark with a long-term memory setup, spanning two shopping tasks over 1.2 million real-world products, and propose Shopping Companion, a unified framework that jointly tackles memory retrieval and shopping assistance while supporting user intervention. To train such capabilities, we develop a dual-reward reinforcement learning strategy with tool-wise rewards to handle the sparse and discontinuous rewards inherent in multi-turn interactions. Experimental results demonstrate that even state-of-the-art models (such as GPT-5) achieve success rates under 70% on our benchmark, highlighting the significant challenges in this domain. Notably, our lightweight LLM, trained with Shopping Companion, consistently outperforms strong baselines, achieving better preference capture and task performance, which validates the effectiveness of our unified design.
中文摘要 在电子商务领域，LLM代理在推荐、预算和套餐等购物任务中展现出潜力，准确捕捉长期对话中的用户偏好至关重要。然而，有两个挑战阻碍了实现这一潜力：（1）缺乏评估长期偏好感知购物任务的基准，（2）由于现有设计将偏好识别和购物辅助视为独立组成部分，缺乏端到端优化。本文介绍了一个新颖的基准测试，采用长期记忆构建，涵盖两个购物任务，涵盖120万个真实世界产品，并提出了Shopping Companion，这是一个统一框架，共同解决记忆检索和购物辅助，同时支持用户干预。为训练此类能力，我们开发了双重奖励强化学习策略，并以工具为单位奖励，以应对多回合互动中固有的稀疏和不连续奖励。实验结果显示，即使是最先进的模型（如GPT-5）在我们的基准测试中成功率也低于70%，凸显了该领域的重大挑战。值得注意的是，我们用Shopping Companion训练的轻量级大型语言模型，持续优于强基线，实现了更好的偏好捕获和任务表现，验证了我们统一设计的有效性。

Sample-Efficient Hypergradient Estimation for Decentralized Bi-Level Reinforcement Learning

去中心化双级强化学习的样本高效高梯度估计

Authors: Mikoto Kudo, Takumi Tanabe, Akifumi Wachi, Youhei Akimoto
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2603.14867
Pdf link: https://arxiv.org/pdf/2603.14867
Abstract Many strategic decision-making problems, such as environment design for warehouse robots, can be naturally formulated as bi-level reinforcement learning (RL), where a leader agent optimizes its objective while a follower solves a Markov decision process (MDP) conditioned on the leader's decisions. In many situations, a fundamental challenge arises when the leader cannot intervene in the follower's optimization process; it can only observe the optimization outcome. We address this decentralized setting by deriving the hypergradient of the leader's objective, i.e., the gradient of the leader's strategy that accounts for changes in the follower's optimal policy. Unlike prior hypergradient-based methods that require extensive data for repeated state visits or rely on gradient estimators whose complexity can increase substantially with the high-dimensional leader's decision space, we leverage the Boltzmann covariance trick to derive an alternative hypergradient formulation. This enables efficient hypergradient estimation solely from interaction samples, even when the leader's decision space is high-dimensional. Additionally, to our knowledge, this is the first method that enables hypergradient-based optimization for 2-player Markov games in decentralized settings. Experiments highlight the impact of hypergradient updates and demonstrate our method's effectiveness in both discrete and continuous state tasks.
中文摘要 许多战略决策问题，如仓库机器人的环境设计，可以自然地被表述为双级强化学习（RL），其中领导者代理优化目标，而跟随者则解决基于领导者决策的马尔可夫决策过程（MDP）。在许多情况下，领导者无法干预跟随者的优化过程，就会出现根本性的挑战;它只能观察优化结果。我们通过推导领导者目标的超梯度来解决这种去中心化环境，即领导者策略中考虑追随者最优政策变化的梯度。与以往需要大量数据进行状态访问或依赖梯度估计器（其复杂度随高维领导者决策空间增加）不同，我们利用玻尔兹曼协方差技巧推导出另一种超梯度表述。这使得仅从交互样本中高效估计超梯度成为可能，即使领导者的决策空间是高维的。此外，据我们所知，这是首个在去中心化环境中实现双人马尔可夫游戏基于超梯度优化的方法。实验强调了超梯度更新的影响，并展示了我们方法在离散和连续状态任务中的有效性。

ViSA: Visited-State Augmentation for Generalized Goal-Space Contrastive Reinforcement Learning

ViSA：广义目标空间对比强化学习的访问状态增强

Authors: Issa Nakamura, Tomoya Yamanokuchi, Yuki Kadokawa, Jia Qu, Shun Otsub, Ken Miyamoto, Shotaro Miwa, Takamitsu Matsubara
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.14887
Pdf link: https://arxiv.org/pdf/2603.14887
Abstract Goal-Conditioned Reinforcement Learning (GCRL) is a framework for learning a policy that can reach arbitrarily given goals. In particular, Contrastive Reinforcement Learning (CRL) provides a framework for policy updates using an approximation of the value function estimated via contrastive learning, achieving higher sample efficiency compared to conventional methods. However, since CRL treats the visited state as a pseudo-goal during learning, it can accurately estimate the value function only for limited goals. To address this issue, we propose a novel data augmentation approach for CRL called ViSA (Visited-State Augmentation). ViSA consists of two components: 1) generating augmented state samples, with the aim of augmenting hard-to-visit state samples during on-policy exploration, and 2) learning consistent embedding space, which uses an augmented state as auxiliary information to regularize the embedding space by reformulating the objective function of the embedding space based on mutual information. We evaluate ViSA in simulation and real-world robotic tasks and show improved goal-space generalization, which permits accurate value estimation for hard-to-visit goals. Further details can be found on the project page: \href{this https URL}{\texttt{this https URL_ViSA/}}
中文摘要 目标条件强化学习（GCRL）是一种用于学习能够达到任意给定目标的策略框架。特别是，对比强化学习（CRL）提供了一个基于对比学习估计值函数近似的策略更新框架，相较于传统方法实现了更高的样本效率。然而，由于CRL在学习过程中将访问状态视为伪目标，因此它只能准确估计有限目标的价值函数。为解决这一问题，我们提出了一种名为ViSA（访问状态增强）的新型CRL数据增强方法。ViSA包含两个组成部分：1）生成增强状态样本，旨在增强在策略探索中难以访问的状态样本;2）学习一致嵌入空间，即利用增强状态作为辅助信息，通过基于互信息重新表述嵌入空间的目标函数来正则化嵌入空间。我们在模拟和现实机器人任务中评估了ViSA，展示了改进的目标空间泛化，从而能够准确估算难以到达的目标值。更多详情可见项目页面：\href{this https URL}{\texttt{this https URL_ViSA/}}

PerlAD: Towards Enhanced Closed-loop End-to-end Autonomous Driving with Pseudo-simulation-based Reinforcement Learning

PerlAD：迈向基于伪仿真的强化学习的增强闭环端到端自动驾驶

Authors: Yinfeng Gao, Qichao Zhang, Deqing Liu, Zhongpu Xia, Guang Li, Kun Ma, Guang Chen, Hangjun Ye, Long Chen, Da-Wei Ding, Dongbin Zhao
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.14908
Pdf link: https://arxiv.org/pdf/2603.14908
Abstract End-to-end autonomous driving policies based on Imitation Learning (IL) often struggle in closed-loop execution due to the misalignment between inadequate open-loop training objectives and real driving requirements. While Reinforcement Learning (RL) offers a solution by directly optimizing driving goals via reward signals, the rendering-based training environments introduce the rendering gap and are inefficient due to high computational costs. To overcome these challenges, we present a novel Pseudo-simulation-based RL method for closed-loop end-to-end autonomous driving, PerlAD. Based on offline datasets, PerlAD constructs a pseudo-simulation that operates in vector space, enabling efficient, rendering-free trial-and-error training. To bridge the gap between static datasets and dynamic closed-loop environments, PerlAD introduces a prediction world model that generates reactive agent trajectories conditioned on the ego vehicle's plan. Furthermore, to facilitate efficient planning, PerlAD utilizes a hierarchical decoupled planner that combines IL for lateral path generation and RL for longitudinal speed optimization. Comprehensive experimental results demonstrate that PerlAD achieves state-of-the-art performance on the Bench2Drive benchmark, surpassing the previous E2E RL method by 10.29% in Driving Score without requiring expensive online interactions. Additional evaluations on the DOS benchmark further confirm its reliability in handling safety-critical occlusion scenarios.
中文摘要 基于模仿学习（IL）的端到端自动驾驶政策常因开环培训目标与实际驾驶需求不匹配而在闭环执行中遇到困难。虽然强化学习（RL）通过奖励信号直接优化驱动目标提供了解决方案，但基于渲染的训练环境引入了渲染差距，且由于计算成本高，效率低下。为克服这些挑战，我们提出了一种基于伪仿真的新式强化学习方法，用于闭环端到端自动驾驶，即PerlAD。基于离线数据集，PerlAD构建了一个在向量空间中工作的伪仿真，实现高效且无渲染的试错训练。为了弥合静态数据集与动态闭环环境之间的差距，PerlAD引入了一种预测世界模型，能够根据自我载体的计划生成反应性代理轨迹。此外，为促进高效规划，PerlAD采用分层解耦规划器，结合IL进行横向路径生成和强化学习进行纵向速度优化。全面的实验结果表明，PerlAD在Bench2Drive基准测试中实现了最先进的性能，在驾驶评分方面比之前的E2E RL方法高出10.29%，且无需昂贵的在线交互。对DOS基准测试的进一步评估进一步确认其在处理安全关键闭塞场景中的可靠性。

EditHF-1M: A Million-Scale Rich Human Preference Feedback for Image Editing

编辑HF-1M：百万尺度丰富的人类偏好反馈用于图像编辑

Authors: Zitong Xu, Huiyu Duan, Zhongpeng Ji, Xinyun Zhang, Yutao Liu, Xiongkuo Min, Ke Gu, Jian Zhang, Shusong Xu, Jinwei Chen, Bo Li, Guangtao Zhai
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2603.14916
Pdf link: https://arxiv.org/pdf/2603.14916
Abstract Recent text-guided image editing (TIE) models have achieved remarkable progress, while many edited images still suffer from issues such as artifacts, unexpected editings, unaesthetic contents. Although some benchmarks and methods have been proposed for evaluating edited images, scalable evaluation models are still lacking, which limits the development of human feedback reward models for image editing. To address the challenges, we first introduce \textbf{EditHF-1M}, a million-scale image editing dataset with over 29M human preference pairs and 148K human mean opinion ratings, both evaluated from three dimensions, \textit{i.e.}, visual quality, instruction alignment, and attribute preservation. Based on EditHF-1M, we propose \textbf{EditHF}, a multimodal large language model (MLLM) based evaluation model, to provide human-aligned feedback from image editing. Finally, we introduce \textbf{EditHF-Reward}, which utilizes EditHF as the reward signal to optimize the text-guided image editing models through reinforcement learning. Extensive experiments show that EditHF achieves superior alignment with human preferences and demonstrates strong generalization on other datasets. Furthermore, we fine-tune the Qwen-Image-Edit using EditHF-Reward, achieving significant performance improvements, which demonstrates the ability of EditHF to serve as a reward model to scale-up the image editing. Both the dataset and code will be released in our GitHub repository: this https URL.
中文摘要 近年来的文本引导图像编辑（TIE）模型取得了显著进步，尽管许多编辑图像仍存在伪影、意外编辑和不美观内容等问题。尽管已有一些基准和方法被提出用于评估编辑图像，但可扩展的评估模型仍然缺乏，这限制了人类反馈奖励模型用于图像编辑的发展。为应对挑战，我们首先介绍了\textbf{EditHF-1M}，这是一个百万尺度的图像编辑数据集，包含超过2900万对人类偏好和14.8万个人平均意见评分，均从三维（\textit{即}）、视觉质量、指令对齐和属性保持等方面评估。基于EditHF-1M，我们提出了\textbf{EditHF}，这是一种基于多模态大型语言模型（MLLM）的评估模型，用于提供图像编辑中与人类对齐的反馈。最后，我们介绍了 \textbf{EditHF-Reward}，它利用 EditHF 作为奖励信号，通过强化学习优化文本引导图像编辑模型。大量实验表明，EditHF在与其他数据集上表现出强烈的泛化性，更能与人类偏好保持一致。此外，我们利用EditHF-Reward微调Qwen-Image-Edit，实现显著性能提升，展示了EditHF作为奖励模型以扩大图像编辑的能力。数据集和代码都会发布到我们的GitHub仓库：这个https URL。

CyCLeGen: Cycle-Consistent Layout Prediction and Image Generation in Vision Foundation Models

CyCLeGen：视觉基础模型中的周期一致布局预测与图像生成

Authors: Xiaojun Shan, Haoyu Shen, Yucheng Mao, Xiang Zhang, Abhay Anand, Bingnan Li, Haiyang Xu, Zhuowen Tu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.14957
Pdf link: https://arxiv.org/pdf/2603.14957
Abstract We present CyCLeGen, a unified vision-language foundation model capable of both image understanding and image generation within a single autoregressive framework. Unlike existing vision models that depend on separate modules for perception and synthesis, CyCLeGen adopts a fully integrated architecture that enforces cycle-consistent learning through image->layout->image and layout->image->layout generation loops. This unified formulation introduces two key advantages: introspection, enabling the model to reason about its own generations, and data efficiency, allowing self-improvement via synthetic supervision under a reinforcement learning objective guided by cycle consistency. Extensive experiments show that CyCLeGen achieves significant gains across diverse image understanding and generation benchmarks, highlighting the potential of unified vision-language foundation models.
中文摘要 我们介绍CyCLeGen，一个统一的视觉-语言基础模型，能够在单一自回归框架内实现图像理解和图像生成。与依赖独立模块进行感知和综合的现有视觉模型不同，CyCLeGen采用了完全集成的架构，通过图像>布局->图像和布局-图像->图像->布局生成循环，强制实现循环一致性学习。这种统一表述带来了两个关键优势：内省，使模型能够推理自身的世代;以及数据效率，允许在基于循环一致性指导的强化学习目标下进行合成监督的自我改进。大量实验表明，CyCLeGen在多样的图像理解和生成基准测试中取得了显著提升，凸显了统一视觉语言基础模型的潜力。

Molecular Identifier Visual Prompt and Verifiable Reinforcement Learning for Chemical Reaction Diagram Parsing

分子标识符视觉提示与可验证强化学习用于化学反应图解析

Authors: Jiahe Song, Chuang Wang, Yinfan Wang, Hao Zheng, Rui Nie, Bowen Jiang, Xingjian Wei, Junyuan Gao, Yubin Wang, Bin Wang, Lijun Wu, Jiang Wu, Qian Yu, Conghui He
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.15011
Pdf link: https://arxiv.org/pdf/2603.15011
Abstract Reaction diagram parsing (RxnDP) is critical for extracting chemical synthesis information from literature. Although recent Vision-Language Models (VLMs) have emerged as a promising paradigm to automate this complex visual reasoning task, their application is fundamentally bottlenecked by the inability to align visual chemical entities with pre-trained knowledge, alongside the inherent discrepancy between token-level training and reaction-level evaluation. To address these dual challenges, this work enhances VLM-based RxnDP from two complementary perspectives: prompting representation and learning paradigms. First, we propose Identifier as Visual Prompting (IdtVP), which leverages naturally occurring molecule identifiers (e.g., bold numerals like 1a) to activate the chemical knowledge acquired during VLM pre-training. IdtVP enables powerful zero-shot and out-of-distribution capabilities, outperforming existing prompting strategies. Second, to further optimize performance within fine-tuning paradigms, we introduce Re3-DAPO, a reinforcement learning algorithm that leverages verifiable rewards to directly optimize reaction-level metrics, thereby achieving consistent gains over standard supervised fine-tuning. Additionally, we release the ScannedRxn benchmark, comprising scanned historical reaction diagrams with real-world artifacts, to rigorously assess model robustness and out-of-distribution ability. Our contributions advance the accuracy and generalization of VLM-based reaction diagram parsing. We will release data, models, and code on GitHub.
中文摘要 反应图解析（RxnDP）对于从文献中提取化学合成信息至关重要。尽管近期视觉语言模型（VLMs）已成为自动化这一复杂视觉推理任务的有前景范式，但其应用本质上因无法将视觉化学实体与预训练知识对齐，以及代币级训练与反应级评估之间的固有差异而受到瓶颈。为应对这些双重挑战，本研究从两个互补视角——提示表示和学习范式——提升基于VLM的RxnDP。首先，我们提出了标识符作为视觉提示（IdtVP），它利用自然存在的分子标识符（例如加粗数字如1a）来激活VLM预训练期间获得的化学知识。IdtVP 实现了强大的零射击和非分发能力，优于现有的提示策略。其次，为了在微调范式中进一步优化性能，我们引入了Re3-DAPO强化学习算法，利用可验证的奖励直接优化反应级指标，从而实现相较标准监督微调的持续提升。此外，我们还发布了ScannedRxn基准，包含带有真实世界工件的历史反应图，以严格评估模型的鲁棒性和非分布能力。我们的贡献推动了基于VLM反应图解析的准确性和推广性。我们将在GitHub上发布数据、模型和代码。

CycleRL: Sim-to-Real Deep Reinforcement Learning for Robust Autonomous Bicycle Control

CycleRL：用于稳健自动自行车控制的模拟到真实深度强化学习

Authors: Gelu Liu, Teng Wang, Zhijie Wu, Junliang Wu, Songyuan Li, Xiangwei Zhu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.15013
Pdf link: https://arxiv.org/pdf/2603.15013
Abstract Autonomous bicycles offer a promising agile solution for urban mobility and last-mile logistics, however, conventional control strategies often struggle with their underactuated nonlinear dynamics, suffering from sensitivity to model mismatches and limited adaptability to real-world uncertainties. To address this, this paper presents CycleRL, the first sim-to-real deep reinforcement learning framework designed for robust autonomous bicycle control. Our approach trains an end-to-end neural control policy within the high-fidelity NVIDIA Isaac Sim environment, leveraging Proximal Policy Optimization (PPO) to circumvent the need for an explicit dynamics model. The framework features a composite reward function tailored for concurrent balance maintenance, velocity tracking, and steering control. Crucially, systematic domain randomization is employed to bridge the simulation-to-reality gap and facilitate direct transfer. In simulation, CycleRL achieves considerable performance, including a 99.90% balance success rate, a low steering tracking error of 1.15°, and a velocity tracking error of 0.18 m/s. These quantitative results, coupled with successful hardware transfer, validate DRL as an effective paradigm for autonomous bicycle control, offering superior adaptability over traditional methods. Video demonstrations are available at this https URL.
中文摘要 自动自行车为城市出行和最后一公里物流提供了有前景的敏捷解决方案，然而，传统控制策略常常因其欠驱动的非线性动力学而遇到困难，对模型不匹配敏感，且对现实世界的不确定性适应性有限。为此，本文介绍了CycleRL，这是首个为稳健自动自行车控制设计的模拟到现实深度强化学习框架。我们的方法在高保真NVIDIA Isaac Sim环境中训练端到端神经控制策略，利用近端策略优化（PPO）绕过显式动力学模型的需求。该框架配备了复合奖励功能，专为平衡维护、速度追踪和转向控制量身定制。关键是，系统化领域随机化被用来弥合模拟与现实的差距，促进直接转移。在模拟中，CycleRL实现了相当出色的性能，包括99.90%的平衡成功率、1.15°的低转向跟踪误差以及0.18 m/s的速度跟踪误差。这些定量结果，加上硬件传输的成功，验证了日间行车作为自动自行车控制的有效范式，具备优于传统方法的适应性。视频演示可在此 https 网址观看。

Interference-Aware K-Step Reachable Communication in Multi-Agent Reinforcement Learning

多智能体强化学习中的干扰感知K步可达通信

Authors: Ziyu Cheng, Jinsheng Ren, Zhouxian Jiang, Chenzhihang Li, Rongye Shi, Bin Liang, Jun Yang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.15054
Pdf link: https://arxiv.org/pdf/2603.15054
Abstract Effective communication is pivotal for addressing complex collaborative tasks in multi-agent reinforcement learning (MARL). Yet, limited communication bandwidth and dynamic, intricate environmental topologies present significant challenges in identifying high-value communication partners. Agents must consequently select collaborators under uncertainty, lacking a priori knowledge of which partners can deliver task-critical information. To this end, we propose Interference-Aware K-Step Reachable Communication (IA-KRC), a novel framework that enhances cooperation via two core components: (1) a K-Step reachability protocol that confines message passing to physically accessible neighbors, and (2) an interference-prediction module that optimizes partner choice by minimizing interference while maximizing utility. Compared to existing methods, IA-KRC enables substantially more persistent and efficient cooperation despite environmental interference. Comprehensive evaluations confirm that IA-KRC achieves superior performance compared to state-of-the-art baselines, while demonstrating enhanced robustness and scalability in complex topological and highly dynamic multi-agent scenarios.
中文摘要 有效的沟通对于解决多智能体强化学习（MARL）中复杂的协作任务至关重要。然而，有限的通信带宽和动态复杂的环境拓扑在识别高价值通信伙伴方面带来了重大挑战。因此，代理必须在不确定性下选择协作者，因为他们缺乏先验了解哪些合作伙伴能够传递关键任务信息。为此，我们提出了干扰感知K步可达通信（IA-KRC），这是一种通过两个核心组成部分增强协作的新框架：（1）限制消息传递给物理可访问邻居的K步可达协议，以及（2）通过最大化效用最大化干扰来优化合作伙伴选择的干扰预测模块。与现有方法相比，IA-KRC 在环境干扰下实现了更持久且高效的合作。综合评估证实，IA-KRC 在复杂拓扑和高度动态多代理场景下表现优于最先进基线，同时展现出更强的鲁棒性和可扩展性。

Writer-R1: Enhancing Generative Writing in LLMs via Memory-augmented Replay Policy Optimization

Writer-R1：通过内存增强重放策略优化提升LLM中的生成式写作

Authors: Jihao Zhao, Shuaishuai Zu, Zhiyuan Ji, Chunlai Zhou, Biao Qin
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.15061
Pdf link: https://arxiv.org/pdf/2603.15061
Abstract As a typical open-ended generation task, creative writing lacks verifiable reference answers, which has long constrained reward modeling and automatic evaluation due to high human annotation costs, evaluative bias, and coarse feedback signals. To address these challenges, this paper first designs a multi-agent collaborative workflow based on Grounded Theory, performing dimensional decomposition and hierarchical induction of the problem to dynamically produce interpretable and reusable fine-grained criteria. Furthermore, we propose the Memory-augmented Replay Policy Optimization (MRPO) algorithm: on the one hand, without additional training, MRPO guides models to engage in self-reflection based on dynamic criteria, enabling controlled iterative improvement; on the other hand, we adopt the training paradigm that combines supervised fine-tuning with reinforcement learning to convert evaluation criteria into reward signals, achieving end-to-end optimization. Experimental results demonstrate that the automatically constructed criteria achieve performance gains comparable to human annotations. Writer-R1-4B models trained with this approach outperform baselines across multiple creative writing tasks and surpass some 100B+ parameter open-source models.
中文摘要 作为典型的开放式生成任务，创意写作缺乏可验证的参考答案，长期以来由于高人工注释成本、评估偏倚和粗糙反馈信号，限制了奖励建模和自动评估。为应对这些挑战，本文首先基于基础理论设计了一个多代理协作工作流程，通过维度分解和层级归纳问题，动态生成可解释且可重用的细粒度准则。此外，我们提出了内存增强重放策略优化（MRPO）算法：一方面，无需额外训练即可引导模型基于动态标准进行自我反思，实现受控迭代改进;另一方面，我们采用结合监督微调与强化学习的训练范式，将评估标准转化为奖励信号，实现端到端优化。实验结果表明，自动构建的标准实现的性能提升可与人工注释相当。采用此方法训练的Writer-R1-4B模型在多项创意写作任务中表现优于基线，甚至超过一些100B+参数的开源模型。

HALO:Closing Sim-to-Real Gap for Heavy-loaded Humanoid Agile Motion Skills via Differentiable Simulation

HALO：通过可微分模拟缩小重载人形敏捷运动技能的模拟与现实差距

Authors: Xingyi Wang, Chenyun Zhang, Weiji Xie, Chao Yu, Wei Song, Chenjia Bai, Shiqiang Zhu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.15084
Pdf link: https://arxiv.org/pdf/2603.15084
Abstract Humanoid robots deployed in real-world scenarios often need to carry unknown payloads, which introduce significant mismatch and degrade the effectiveness of simulation-to-reality reinforcement learning methods. To address this challenge, we propose a two-stage gradient-based system identification framework built on the differentiable simulator MuJoCo XLA. The first stage calibrates the nominal robot model using real-world data to reduce intrinsic sim-to-real discrepancies, while the second stage further identifies the mass distribution of the unknown payload. By explicitly reducing structured model bias prior to policy training, our approach enables zero-shot transfer of reinforcement learning policies to hardware under heavy-load conditions. Extensive simulation and real-world experiments demonstrate more precise parameter identification, improved motion tracking accuracy, and substantially enhanced agility and robustness compared to existing baselines. Project Page: this https URL
中文摘要 在现实场景中部署的人形机器人通常需要携带未知的有效载荷，这会带来显著的不匹配并降低模拟到现实强化学习方法的有效性。为应对这一挑战，我们提出了基于微分模拟器MuJoCo XLA的两阶段梯度系统识别框架。第一阶段利用真实世界数据校准名义机器人模型，以减少模拟与真实之间的固有差异，第二阶段进一步确定未知有效载荷的质量分布。通过在策略训练前明确减少结构化模型偏差，我们的方法使强化学习策略在高负载条件下实现零射值转移到硬件。大量模拟和现实实验展示了参数识别更精确、运动追踪精度提升，以及相较于现有基线显著增强的敏捷性和鲁棒性。项目页面：此 https URL

Sampling-guided exploration of active feature selection policies

采样引导探索主动特征选择策略

Authors: Gabriel Bernardino, Anders Jonsson, Patrick Clarysse, Nicolas Duchateau
Subjects: Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.15110
Pdf link: https://arxiv.org/pdf/2603.15110
Abstract Determining the most appropriate features for machine learning predictive models is challenging regarding performance and feature acquisition costs. In particular, global feature choice is limited given that some features will only benefit a subset of instances. In previous work, we proposed a reinforcement learning approach to sequentially recommend which modality to acquire next to reach the best information/cost ratio, based on the instance-specific information already acquired. We formulated the problem as a Markov Decision Process where the state's dimensionality changes during the episode, avoiding data imputation, contrary to existing works. However, this only allowed processing a small number of features, as all possible combinations of features were considered. Here, we address these limitations with two contributions: 1) we expand our framework to larger datasets with a heuristic-based strategy that focuses on the most promising feature combinations, and 2) we introduce a post-fit regularisation strategy that reduces the number of different feature combinations, leading to compact sequences of decisions. We tested our method on four binary classification datasets (one involving high-dimensional variables), the largest of which had 56 features and 4500 samples. We obtained better performance than state-of-the-art methods, both in terms of accuracy and policy complexity.
中文摘要 确定最适合机器学习预测模型的特征，在性能和特征获取成本方面具有挑战性。特别是，由于某些特征只惠及部分实例，全局特征选择受到限制。在之前的研究中，我们提出了一种强化学习方法，基于已获得的实例特定信息，顺序推荐下一步获取哪种模式以达到最佳信息/成本比。我们将问题表述为马尔可夫决策过程，其中状态维度在发作过程中变化，避免数据补补，这与现有研究相反。然而，这只允许处理少量特征，因为考虑了所有可能的特征组合。在这里，我们通过两项贡献解决了这些局限性：1）我们将框架扩展到更大的数据集，采用启发式策略，聚焦于最有前景的特征组合;2）引入拟合后正则化策略，减少不同特征组合的数量，从而实现决策序列的紧凑。我们在四个二元分类数据集（其中一个涉及高维变量）上测试了我们的方法，其中最大的数据集包含56个特征和4500个样本。我们在准确性和策略复杂度方面都优于最先进方法。

MMKU-Bench: A Multimodal Update Benchmark for Diverse Visual Knowledge

MMKU-Bench：多模态更新基准，提升多元视觉知识

Authors: Baochen Fu, Yuntao Du, Cheng Chang, Baihao Jin, Wenzhi Deng, Muhao Xu, Hongmei Yan, Weiye Song, Yi Wan
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.15117
Pdf link: https://arxiv.org/pdf/2603.15117
Abstract As real-world knowledge continues to evolve, the parametric knowledge acquired by multimodal models during pretraining becomes increasingly difficult to remain consistent with real-world knowledge. Existing research on multimodal knowledge updating focuses only on learning previously unknown knowledge, while overlooking the need to update knowledge that the model has already mastered but that later changes; moreover, evaluation is limited to the same modality, lacking a systematic analysis of cross-modal consistency. To address these issues, this paper proposes MMKU-Bench, a comprehensive evaluation benchmark for multimodal knowledge updating, which contains over 25k knowledge instances and more than 49k images, covering two scenarios, updated knowledge and unknown knowledge, thereby enabling comparative analysis of learning across different knowledge types. On this benchmark, we evaluate a variety of representative approaches, including supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and knowledge editing (KE). Experimental results show that SFT and RLHF are prone to catastrophic forgetting, while KE better preserve general capabilities but exhibit clear limitations in continual updating. Overall, MMKU-Bench provides a reliable and comprehensive evaluation benchmark for multimodal knowledge updating, advancing progress in this field.
中文摘要 随着现实世界知识的不断发展，多模态模型在预训练期间获得的参数化知识越来越难以与现实知识保持一致。现有关于多模态知识更新的研究仅关注学习之前未知的知识，忽视了更新模型已掌握但后来发生变化的知识;此外，评估仅限于同一模态，缺乏对跨模态一致性的系统分析。为解决这些问题，本文提出了MMKU-Bench，这是一个多模态知识更新的综合评估基准，包含超过2.5万个知识实例和超过4.9万张图像，涵盖更新知识和未知知识两种场景，从而实现不同知识类型学习的比较分析。基于该基准，我们评估了多种代表性方法，包括监督微调（SFT）、基于人类反馈的强化学习（RLHF）和知识编辑（KE）。实验结果显示，SFT和RLHF容易出现灾难性遗忘，而动能则更好地保留了通用能力，但在持续更新方面存在明显局限。总体而言，MMKU-Bench为多模态知识更新提供了一个可靠且全面的评估基准，推动了该领域的进步。

Safe Flow Q-Learning: Offline Safe Reinforcement Learning with Reachability-Based Flow Policies

安全流Q-Learning：基于可达性流程策略的离线安全强化学习

Authors: Mumuksh Tayal, Manan Tayal, Ravi Prakash
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.15136
Pdf link: https://arxiv.org/pdf/2603.15136
Abstract Offline safe reinforcement learning (RL) seeks reward-maximizing policies from static datasets under strict safety constraints. Existing methods often rely on soft expected-cost objectives or iterative generative inference, which can be insufficient for safety-critical real-time control. We propose Safe Flow Q-Learning (SafeFQL), which extends FQL to safe offline RL by combining a Hamilton--Jacobi reachability-inspired safety value function with an efficient one-step flow policy. SafeFQL learns the safety value via a self-consistency Bellman recursion, trains a flow policy by behavioral cloning, and distills it into a one-step actor for reward-maximizing safe action selection without rejection sampling at deployment. To account for finite-data approximation error in the learned safety boundary, we add a conformal prediction calibration step that adjusts the safety threshold and provides finite-sample probabilistic safety coverage. Empirically, SafeFQL trades modestly higher offline training cost for substantially lower inference latency than diffusion-style safe generative baselines, which is advantageous for real-time safety-critical deployment. Across boat navigation, and Safety Gymnasium MuJoCo tasks, SafeFQL matches or exceeds prior offline safe RL performance while substantially reducing constraint violations.
中文摘要 离线安全强化学习（RL）在严格安全约束下，从静态数据集中寻求最大化奖励的策略。现有方法通常依赖软预期成本目标或迭代生成推理，这些方法可能不足以实现安全关键的实时控制。我们提出了安全流Q学习（SafeFQL），它通过结合Hamilton-Jacobi可达性启发的安全价值函数与高效的一步流策略，将FQL扩展到安全离线强化学习。SafeFQL 通过自一致性的 Bellman 递归学习安全值，通过行为克隆训练流策略，并将其提炼为一个单步演员，用于在部署时无拒绝采样的情况下实现奖励最大化的安全动作选择。为了考虑学习安全边界中的有限数据近似误差，我们增加了一个共形预测校准步骤，调整安全阈值并提供有限样本概率安全覆盖。从经验角度看，SafeFQL以略高的离线训练成本换取显著更低的推理延迟，这对于实时安全关键部署具有优势。在船舶导航和安全健身房MuJoCo任务中，SafeFQL能够匹配甚至超越以往的离线安全强化学习表现，同时大幅减少约束违规。

Multi-Scale Control of Large Agent Populations: From Density Dynamics to Individual Actuation

多尺度控制大型药物群体：从密度动态到个体驱动

Authors: Mario di Bernardo
Subjects: Subjects: Systems and Control (eess.SY); Dynamical Systems (math.DS)
Arxiv link: https://arxiv.org/abs/2603.15160
Pdf link: https://arxiv.org/pdf/2603.15160
Abstract We review a body of recent work by the author and collaborators on controlling the spatial organisation of large agent populations across multiple scales. A central theme is the systematic bridging of microscopic agent-level dynamics and macroscopic density descriptions, enabling control design at the most natural level of abstraction and subsequent translation across scales. We show how this multi-scale perspective provides a unified approach to both \emph{direct control}, where every agent is actuated, and \emph{indirect control}, where few leaders or herders steer a larger uncontrolled population. The review covers continuification-based control with robustness under limited sensing and decentralised implementation via distributed density estimation; leader--follower density regulation with dual-feedback stability guarantees and bio-inspired plasticity; optimal-transport methods for coverage control and macro-to-micro discretisation; nonreciprocal field theory for collective decision-making; mean-field control barrier functions for population-level safety; and hierarchical reinforcement learning for settings where closed-form solutions are intractable. Together, these results demonstrate the breadth and versatility of a multi-scale control framework that integrates analytical methods, learning, and physics-inspired approaches for large agent populations.
中文摘要 我们回顾了作者及其合作者近期关于控制大型主体群体在多尺度空间组织方面的工作。一个核心主题是系统地连接微观代理级动态与宏观密度描述，使控制设计能够在最自然的抽象层面实现，并实现跨尺度的翻译。我们展示了这种多尺度视角如何为\emph{直接控制}（每个主体都被驱动）和\emph{间接控制}（少数领导者或牧民领导更大且无控制的人口）提供了统一的方法。综述涵盖了在有限传感下具有鲁棒性的连续性控制，以及通过分布式密度估计实现的分散实施;先导——跟随者密度调节，具备双反馈稳定性保证和仿生塑性;覆盖控制和宏观到微观离散化的最优传输方法;集体决策的非互易场论;平均场控制障碍功能用于人口层面的安全;以及针对封闭式解决方案难以解决的环境中的层级强化学习。这些结果共同展示了一个整合分析方法、学习和物理启发方法的多尺度控制框架的广度与多样性，适用于大型主体群体。

KiRAS: Keyframe Guided Self-Imitation for Robust and Adaptive Skill Learning in Quadruped Robots

KiRAS：关键帧引导自我模仿，实现四足机器人中稳健且自适应的技能学习

Authors: Xiaoyi Wei, Peng Zhai, Jiaxin Tu, Yueqi Zhang, Yuqi Li, Zonghao Zhang, Hu Zhou, Lihua Zhang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.15179
Pdf link: https://arxiv.org/pdf/2603.15179
Abstract With advances in reinforcement learning and imitation learning, quadruped robots can acquire diverse skills within a single policy by imitating multiple skill-specific datasets. However, the lack of datasets on complex terrains limits the ability of such multi-skill policies to generalize effectively in unstructured environments. Inspired by animation, we adopt keyframes as minimal and universal skill representations, relaxing dataset constraints and enabling the integration of terrain adaptability with skill diversity. We propose Keyframe Guided Self-Imitation for Robust and Adaptive Skill Learning (KiRAS), an end-to-end framework for acquiring and transitioning between diverse skill primitives on complex terrains. KiRAS first learns diverse skills on flat terrain through keyframe-guided self-imitation, eliminating the need for expert datasets; then continues training the same policy network on rough terrains to enhance robustness. To eliminate catastrophic forgetting, a proficiency-based Skill Initialization Technique is introduced. Experiments on Solo-8 and Unitree Go1 robots show that KiRAS enables robust skill acquisition and smooth transitions across challenging terrains. This framework demonstrates its potential as a lightweight platform for multi-skill generation and dataset collection. It further enables flexible skill transitions that enhance locomotion on challenging terrains.
中文摘要 随着强化学习和模仿学习的进步，四足机器人可以通过模拟多个技能特定数据集，在单一策略内获得多样化技能。然而，复杂地形数据集的缺乏限制了此类多技能策略在非结构化环境中有效推广的能力。受动画启发，我们采用关键帧作为极简且通用的技能表示方式，放松数据集限制，实现地形适应性与技能多样性的整合。我们提出了关键帧引导自我模仿（Keyframe Guided Self-Imitation for Robust and Adaptive Skill Learning，KiRAS），这是一个端到端的框架，用于在复杂地形上获取和转换不同技能原语。KiRAS首先通过关键帧引导的自我模仿，在平坦地形上学习多样化技能，无需专家数据集;然后继续在崎岖地形上训练同一政策网络，以增强稳健性。为消除灾难性遗忘，引入了基于熟练度的技能初始化技术。在Solo-8和Unitree Go1机器人上的实验表明，KiRAS能够实现稳健的技能习得和在挑战地形上的平稳过渡。该框架展示了其作为多技能生成和数据集收集的轻量级平台的潜力。它还促进了灵活的技能转换，增强了在复杂地形上的移动能力。

Iterative Learning Control-Informed Reinforcement Learning for Batch Process Control

迭代学习控制驱动强化学习用于批量过程控制

Authors: Runze Lin, Ziqi Zhuo, Junghui Chen, Lei Xie, Hongye Su
Subjects: Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.15180
Pdf link: https://arxiv.org/pdf/2603.15180
Abstract A significant limitation of Deep Reinforcement Learning (DRL) is the stochastic uncertainty in actions generated during exploration-exploitation, which poses substantial safety risks during both training and deployment. In industrial process control, the lack of formal stability and convergence guarantees further inhibits adoption of DRL methods by practitioners. Conversely, Iterative Learning Control (ILC) represents a well-established autonomous control methodology for repetitive systems, particularly in batch process optimization. ILC achieves desired control performance through iterative refinement of control laws, either between consecutive batches or within individual batches, to compensate for both repetitive and non-repetitive disturbances. This study introduces an Iterative Learning Control-Informed Reinforcement Learning (IL-CIRL) framework for training DRL controllers in dual-layer batch-to-batch and within-batch control architectures for batch processes. The proposed method incorporates Kalman filter-based state estimation within the iterative learning structure to guide DRL agents toward control policies that satisfy operational constraints and ensure stability guarantees. This approach enables the systematic design of DRL controllers for batch processes operating under multiple disturbance conditions.
中文摘要 深度强化学习（DRL）的一个重大局限是探索-利用过程中生成的动作存在随机不确定性，这在训练和部署过程中都带来了重大安全风险。在工业过程控制中，缺乏形式稳定性和趋同性保障进一步阻碍了从业者采用DRL方法。相反，迭代学习控制（ILC）代表了一种成熟的自主控制方法论，适用于重复系统，特别是在批量过程优化领域。ILC通过迭代细化控制定律，无论是在连续批次之间还是单个批次内，以补偿重复性和非重复性扰动，从而实现理想的控制性能。本研究引入了一种迭代学习控制-知情强化学习（IL-CIRL）框架，用于在双层批对批及批内控制架构中训练DRL控制器，用于批处理过程。所提方法在迭代学习结构中结合了基于卡尔曼滤波器的状态估计，以引导DRL代理制定满足操作约束并确保稳定性的控制策略。这种方法使得在多干扰条件下批量操作的DRL控制器能够系统化地设计。

Towards Foundation Models for Consensus Rank Aggregation

迈向共识排名聚合的基础模型

Authors: Yijun Jin, Simon Klüttermann, Chiara Balestra, Emmanuel Müller
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2603.15218
Pdf link: https://arxiv.org/pdf/2603.15218
Abstract Aggregating a consensus ranking from multiple input rankings is a fundamental problem with applications in recommendation systems, search engines, job recruitment, and elections. Despite decades of research in consensus ranking aggregation, minimizing the Kemeny distance remains computationally intractable. Specifically, determining an optimal aggregation of rankings with respect to the Kemeny distance is an NP-hard problem, limiting its practical application to relatively small-scale instances. We propose the Kemeny Transformer, a novel Transformer-based algorithm trained via reinforcement learning to efficiently approximate the Kemeny optimal ranking. Experimental results demonstrate that our model outperforms classical majority-heuristic and Markov-chain approaches, achieving substantially faster inference than integer linear programming solvers. Our approach thus offers a practical, scalable alternative for real-world ranking-aggregation tasks.
中文摘要 从多个输入排名中汇总共识排名是推荐系统、搜索引擎、招聘和选举应用中的根本难题。尽管共识排名聚合已有数十年研究，最小化克门尼距离在计算上仍然难以解决。具体来说，确定相对于克门尼距离的最优排序聚合是一个NP难问题，其实际应用仅限于相对较小的规模实例。我们提出了Kemeny Transformer，这是一种基于Transformer的新算法，通过强化学习训练，以高效近似Kemeny最优排名。实验结果表明，我们的模型优于经典多数启发式和马尔可夫链方法，推断速度远快于整数线性规划求解器。因此，我们的方法为现实世界的排名聚合任务提供了一种实用且可扩展的替代方案。

SAGE: Multi-Agent Self-Evolution for LLM Reasoning

SAGE：多智能体自我演化用于大型语言模型推理

Authors: Yulin Peng, Xinxin Zhu, Chenxing Wei, Nianbo Zeng, Leilei Wang, Ying Tiffany He, F. Richard Yu
Subjects: Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2603.15255
Pdf link: https://arxiv.org/pdf/2603.15255
Abstract Reinforcement learning with verifiable rewards improves reasoning in large language models (LLMs), but many methods still rely on large human-labeled datasets. While self-play reduces this dependency, it often lacks explicit planning and strong quality control, limiting stability in long-horizon multi-step reasoning. We present SAGE (Self-evolving Agents for Generalized reasoning Evolution), a closed-loop framework where four agents: Challenger, Planner, Solver, and Critic, co-evolve from a shared LLM backbone using only a small seed set. The Challenger continuously generates increasingly difficult tasks; the Planner converts each task into a structured multi-step plan; and the Solver follows the plan to produce an answer, whose correctness is determined by external verifiers. The Critic scores and filters both generated questions and plans to prevent curriculum drift and maintain training signal quality, enabling stable self-training. Across mathematics and code-generation benchmarks, SAGE delivers consistent gains across model scales, improving the Qwen-2.5-7B model by 8.9% on LiveCodeBench and 10.7% on OlympiadBench.
中文摘要 带有可验证奖励的强化学习提升了大型语言模型（LLM）中的推理能力，但许多方法仍然依赖于大型人类标记数据集。虽然自玩减少了这种依赖，但通常缺乏明确的规划和强有力的质量控制，限制了长期多步推理的稳定性。我们提出了SAGE（广义推理演化的自我演化代理），这是一个闭环框架，四个代理：挑战者、规划者、求解者和批评者，仅用一个小种子集从共享的大型语言模型骨干共同进化。挑战者号不断生成越来越难的任务;规划器将每个任务转换为结构化的多步骤计划;解算器则按照计划生成答案，其正确性由外部验证器确定。Critic对生成的问题和计划进行评分和筛选，以防止课程漂移并保持培训信号质量，从而实现稳定的自训。在数学和代码生成基准测试中，SAGE在模型尺度上持续提升，LiveCodeBench上Qwen-2.5-7B模型提升8.9%，OlympiadBench提升10.7%。

Probe-then-Plan: Environment-Aware Planning for Industrial E-commerce Search

探查然后规划：工业电子商务搜索的环境感知规划

Authors: Mengxiang Chen, Zhouwei Zhai, Jin Li
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.15262
Pdf link: https://arxiv.org/pdf/2603.15262
Abstract Modern e-commerce search is evolving to resolve complex user intents. While Large Language Models (LLMs) offer strong reasoning, existing LLM-based paradigms face a fundamental blindness-latency dilemma: query rewriting is agnostic to retrieval capabilities and real-time inventory, yielding invalid plans; conversely, deep search agents rely on iterative tool calls and reflection, incurring seconds of latency incompatible with industrial sub-second budgets. To resolve this conflict, we propose Environment-Aware Search Planning (EASP), reformulating search planning as a dynamic reasoning process grounded in environmental reality. EASP introduces a Probe-then-Plan mechanism: a lightweight Retrieval Probe exposes the retrieval snapshot, enabling the Planner to diagnose execution gaps and generate grounded search plans. The methodology comprises three stages: (1) Offline Data Synthesis: A Teacher Agent synthesizes diverse, execution-validated plans by diagnosing the probed environment. (2) Planner Training and Alignment: The Planner is initialized via Supervised Fine-Tuning (SFT) to internalize diagnostic capabilities, then aligned with business outcomes (conversion rate) via Reinforcement Learning (RL). (3) Adaptive Online Serving: A complexity-aware routing mechanism selectively activates planning for complex queries, ensuring optimal resource allocation. Extensive offline evaluations and online A/B testing on this http URL demonstrate that EASP significantly improves relevant recall and achieves substantial lifts in UCVR and GMV. EASP has been successfully deployed in this http URL's AI-Search system.
中文摘要 现代电子商务搜索正在演进以解决复杂的用户意图。虽然大型语言模型（LLM）提供了强有力的推理，但现有基于LLM的范式面临根本性的盲延迟困境：查询重写对检索能力和实时库存无关，导致计划无效;相反，深度搜索代理依赖迭代工具调用和反射，导致的延迟与工业亚秒预算不兼容。为解决这一冲突，我们提出了环境感知搜索规划（EASP），将搜索规划重新表述为基于环境现实的动态推理过程。EASP 引入了“探测再计划”机制：轻量级检索探针暴露检索快照，使规划器能够诊断执行缺口并生成有根基的搜索计划。该方法论包含三个阶段：（1）离线数据综合：教师代理通过诊断探测环境，综合多样化且经过执行验证的计划。（2）规划师培训与对齐：规划师通过监督微调（SFT）初始化以内化诊断能力，然后通过强化学习（RL）与业务成果（转化率）对齐。（3）自适应在线服务：一种复杂度感知的路由机制，选择性地激活复杂查询的规划，确保资源的最佳分配。对该http网址进行的大量离线评估和在线A/B测试表明，EASP显著提升了相关回忆率，并在UCVR和GMV中取得了显著提升。EASP 已成功部署于该 http URL 的 AI-Search 系统中。

Evaluating the Robustness of Reinforcement Learning based Adaptive Traffic Signal Control

基于强化学习的自适应交通信号控制的鲁棒性评估

Authors: Dickens Kwesiga, Angshuman Guin, Khaled Abdelghany, Michael Hunter
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.15283
Pdf link: https://arxiv.org/pdf/2603.15283
Abstract Reinforcement learning (RL) has attracted increasing interest for adaptive traffic signal control due to its model-free ability to learn control policies directly from interaction with the traffic environment. However, several challenges remain before RL-based signal control can be considered ready for field deployment. Many existing studies rely on simplified signal timing structures, robustness of trained models under varying traffic demand conditions remains insufficiently evaluated, and runtime efficiency continues to pose challenges when training RL algorithms in traffic microscopic simulation environments. This study formulates an RL-based signal control algorithm capable of representing a full eight-phase ring-barrier configuration consistent with field signal controllers. The algorithm is trained and evaluated under varying traffic demand conditions and benchmarked against state-of-the-practice actuated signal control (ASC). To assess robustness, experiments are conducted across multiple traffic volumes and origin-destination (O-D) demand patterns with varying levels of structural similarity. To improve training efficiency, a distributed asynchronous training architecture is implemented that enables parallel simulation across multiple computing nodes. Results from a case study intersection show that the proposed RL-based signal control significantly outperforms optimized ASC, reducing average delay by 11-32% across movements. A model trained on a single O-D pattern generalizes well to similar unseen demand patterns but degrades under substantially different demand conditions. In contrast, a model trained on diverse O-D patterns demonstrates strong robustness, consistently outperforming ASC even under highly dissimilar unseen demand scenarios.
中文摘要 强化学习（RL）因其无需模型即可直接从交通环境交互中学习控制策略的能力，吸引了越来越多的自适应交通信号控制兴趣。然而，在基于强化学习的信号控制能够被视为准备投入现场部署之前，仍有若干挑战。许多现有研究依赖简化的信号时序结构，训练模型在不同交通需求条件下的鲁棒性评估不足，运行效率在交通微观模拟环境中训练强化学习算法时仍面临挑战。本研究提出了一种基于强化学习的信号控制算法，能够表示与现场信号控制器一致的完整八相环形态态。该算法在不同的交通需求条件下进行训练和评估，并以实践中有效的感应信号控制（ASC）进行基准测试。为评估鲁棒性，在多个交通量和起讫地（O-D）需求模式中进行不同结构相似度的实验。为提高训练效率，采用分布式异步训练架构，实现跨多个计算节点的并行仿真。案例研究交叉点的结果显示，基于强化学习的信号控制显著优于优化后的ASC，平均延迟减少了11%-32%。在单一O-D模式上训练的模型可以很好地推广到类似的未见需求模式，但在需求条件大不相同的情况下会退化。相比之下，在多样O-D模式上训练的模型表现出强烈的鲁棒性，即使在高度不同的未见需求场景下，也能持续优于ASC。

NavThinker: Action-Conditioned World Models for Coupled Prediction and Planning in Social Navigation

NavThinker：用于社会导航中耦合预测与规划的行动条件世界模型

Authors: Tianshuai Hu, Zeying Gong, Lingdong Kong, XiaoDong Mei, Yiyi Ding, Qi Zeng, Ao Liang, Rong Li, Yangyi Zhong, Junwei Liang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.15359
Pdf link: https://arxiv.org/pdf/2603.15359
Abstract Social navigation requires robots to act safely in dynamic human environments. Effective behavior demands thinking ahead: reasoning about how the scene and pedestrians evolve under different robot actions rather than reacting to current observations alone. This creates a coupled prediction-planning challenge, where robot actions and human motion mutually influence each other. To address this challenge, we propose NavThinker, a future-aware framework that couples an action-conditioned world model with on-policy reinforcement learning. The world model operates in the Depth Anything V2 patch feature space and performs autoregressive prediction of future scene geometry and human motion; multi-head decoders then produce future depth maps and human trajectories, yielding a future-aware state aligned with traversability and interaction risk. Crucially, we train the policy with DD-PPO while injecting world-model think-ahead signals via: (i) action-conditioned future features fused into the current observation embedding and (ii) social reward shaping from predicted human trajectories. Experiments on single- and multi-robot Social-HM3D show state-of-the-art navigation success, with zero-shot transfer to Social-MP3D and real-world deployment on a Unitree Go2, validating generalization and practical applicability. Webpage: this https URL.
中文摘要 社会导航要求机器人在动态的人类环境中安全行动。有效的行为需要前瞻性思考：推理场景和行人在不同机器人动作下如何演变，而不仅仅是对当前观察做出反应。这带来了耦合的预测规划挑战，机器人动作与人类运动相互影响。为应对这一挑战，我们提出了NavThinker，一种将行动条件世界模型与政策强化学习相结合的未来感知框架。世界模型运行在深度任意V2的特征空间中，并对未来场景几何和人体运动进行自回归预测;多磁头解码器随后生成未来深度图和人类轨迹，从而获得符合可穿越性和交互风险的未来感知状态。关键是，我们用DD-PPO训练该政策，同时通过以下方式注入世界模型的前瞻性思考信号：（i）将行动条件的未来特征融合进当前观察嵌入，（ii）基于预测的人类轨迹进行社会奖励塑造。单机器人和多机器人Social-HM3D的实验显示出最先进的导航成功，实现了零发射迁移到Social-MP3D以及在Unitree Go2上的实际部署，验证了其推广性和实际应用性。网页：这个 https URL。

Trajectory-Diversity-Driven Robust Vision-and-Language Navigation

轨迹多样性驱动的稳健视觉与语言导航

Authors: Jiangyang Li, Cong Wan, SongLin Dong, Chenhao Ding, Qiang Wang, Zhiheng Ma, Yihong Gong
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.15370
Pdf link: https://arxiv.org/pdf/2603.15370
Abstract Vision-and-Language Navigation (VLN) requires agents to navigate photo-realistic environments following natural language instructions. Current methods predominantly rely on imitation learning, which suffers from limited generalization and poor robustness to execution perturbations. We present NavGRPO, a reinforcement learning framework that learns goal-directed navigation policies through Group Relative Policy Optimization. By exploring diverse trajectories and optimizing via within-group performance comparisons, our method enables agents to distinguish effective strategies beyond expert paths without requiring additional value networks. Built on ScaleVLN, NavGRPO achieves superior robustness on R2R and REVERIE benchmarks with +3.0% and +1.71% SPL improvements in unseen environments. Under extreme early-stage perturbations, we demonstrate +14.89% SPL gain over the baseline, confirming that goal-directed RL training builds substantially more robust navigation policies. Code and models will be released.
中文摘要 视觉与语言导航（VLN）要求代理按照自然语言指令在逼真的环境中导航。当前方法主要依赖模仿学习，但这种方法在泛化性有限且对执行扰动的鲁棒性较差。我们介绍NavGRPO，一个强化学习框架，通过群体相对策略优化学习目标导向的导航策略。通过探索多样化轨迹并通过群体内绩效比较进行优化，我们的方法使代理能够在无需额外价值网络的情况下，区分出超越专家路径的有效策略。NavGRPO基于ScaleVLN，在R2R和REVERIE基准测试中实现了卓越的鲁棒性，在未可见环境中提升了+3.0%和+1.71%的声压性能。在极端早期扰动下，我们展示了对基线的 SPL 增益 +14.89%，证实目标导向强化学习构建了更稳健的导航策略。代码和模型将会发布。

Fusian: Multi-LoRA Fusion for Fine-Grained Continuous MBTI Personality Control in Large Language Models

Fusian：多重LoRA融合用于大型语言模型中细粒度连续MBTI人格控制

Authors: Zehao Chen, Rong Pan
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.15405
Pdf link: https://arxiv.org/pdf/2603.15405
Abstract Large Language Models (LLMs) have demonstrated impressive capabilities in simulating diverse human behaviors and personalities. However, existing methods for personality control, which include prompt engineering and standard Supervised Fine-Tuning (SFT), typically treat personality traits as discrete categories (e.g., "Extroverted" vs. "Introverted"), lacking the ability to precisely control the intensity of a trait on a continuous spectrum. In this paper, we introduce Fusian, a novel framework for fine-grained, continuous personality control in LLMs. Fusian operates in two stages: (1) Trajectory Collection, where we capture the dynamic evolution of personality adoption during SFT by saving a sequence of LoRA adapters, effectively mapping the continuous manifold of a trait; and (2) RL-based Dynamic Fusion, where we train a policy network using Reinforcement Learning to dynamically compute mixing weights for these frozen adapters. By sampling from a Dirichlet distribution parameterized by the policy network, Fusian fuses multiple adapters to align the model's output with a specific numerical target intensity. Experiments on the Qwen3-14B model demonstrate that Fusian achieves high precision in personality control, significantly outperforming baseline methods in aligning with user-specified trait intensities.
中文摘要 大型语言模型（LLMs）在模拟多样的人类行为和个性方面展现出令人印象深刻的能力。然而，现有的人格控制方法，包括即时工程和标准的监督微调（SFT），通常将人格特质视为离散类别（例如“外向”与“内向”），缺乏精确控制连续光谱上特质强度的能力。本文介绍了Fusian，一种用于LLM中细粒度连续人格控制的新框架。Fusian分为两个阶段：（1）轨迹收集，通过保存一系列LoRA适配器，捕捉SFT中人格采纳的动态演变，有效映射一个特征的连续流形;以及（2）基于强化学习的动态融合，我们利用强化学习训练策略网络，动态计算这些冻结适配器的混合权重。通过从由策略网络参数化的狄利克雷分布中采样，Fusian融合多个适配器，使模型输出与特定的数值目标强度对齐。Qwen3-14B模型的实验表明，Fusian在人格控制方面实现了极高的精度，在匹配用户指定的特质强度方面显著优于基线方法。

Amplification Effects in Test-Time Reinforcement Learning: Safety and Reasoning Vulnerabilities

测试时强化学习中的放大效应：安全性与推理漏洞

Authors: Vanshaj Khattar, Md Rafi ur Rashid, Moumita Choudhury, Jing Liu, Toshiaki Koike-Akino, Ming Jin, Ye Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2603.15417
Pdf link: https://arxiv.org/pdf/2603.15417
Abstract Test-time training (TTT) has recently emerged as a promising method to improve the reasoning abilities of large language models (LLMs), in which the model directly learns from test data without access to labels. However, this reliance on test data also makes TTT methods vulnerable to harmful prompt injections. In this paper, we investigate safety vulnerabilities of TTT methods, where we study a representative self-consistency-based test-time learning method: test-time reinforcement learning (TTRL), a recent TTT method that improves LLM reasoning by rewarding self-consistency using majority vote as a reward signal. We show that harmful prompt injection during TTRL amplifies the model's existing behaviors, i.e., safety amplification when the base model is relatively safe, and harmfulness amplification when it is vulnerable to the injected data. In both cases, there is a decline in reasoning ability, which we refer to as the reasoning tax. We also show that TTT methods such as TTRL can be exploited adversarially using specially designed "HarmInject" prompts to force the model to answer jailbreak and reasoning queries together, resulting in stronger harmfulness amplification. Overall, our results highlight that TTT methods that enhance LLM reasoning by promoting self-consistency can lead to amplification behaviors and reasoning degradation, highlighting the need for safer TTT methods.
中文摘要 测试时间训练（TTT）最近被认为是一种有前景的方法，可以提升大型语言模型（LLMs）的推理能力，在这种模型中，模型直接从测试数据中学习而无需访问标签。然而，这种对测试数据的依赖也使TTT方法容易受到有害的即时注射影响。本文探讨了TTT方法的安全性漏洞，研究了一种代表性的基于自我一致性的测试时间学习方法：测试时间强化学习（TTRL），这是一种近期的TTT方法，通过多数票作为奖励信号来奖励自洽性，从而提升LLM推理能力。我们表明，TTRL期间的有害提示注入会放大模型现有的行为，即当基础模型相对安全时进行安全放大，在对注入数据脆弱时进行危害性放大。在这两种情况下，推理能力都会下降，我们称之为推理税。我们还展示了TTT（TTT）方法如TTRL，可以通过专门设计的“HarmInject”提示进行对抗性利用，强制模型同时回答越狱和推理查询，从而增强危害性放大。总体而言，我们的结果强调，通过促进自我一致性来增强LLM推理的TTT方法可能导致放大行为和推理退化，凸显了更安全的TTT方法的必要性。

MA-VLCM: A Vision Language Critic Model for Value Estimation of Policies in Multi-Agent Team Settings

MA-VLCM：一种用于多智能体团队环境中策略价值估计的视觉语言批评模型

Authors: Shahil Shaik, Aditya Parameshwaran, Anshul Nayak, Jonathon M. Smereka, Yue Wang
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.15418
Pdf link: https://arxiv.org/pdf/2603.15418
Abstract Multi-agent reinforcement learning (MARL) commonly relies on a centralized critic to estimate the value function. However, learning such a critic from scratch is highly sample-inefficient and often lacks generalization across environments. At the same time, large vision-language-action models (VLAs) trained on internet-scale data exhibit strong multimodal reasoning and zero-shot generalization capabilities, yet directly deploying them for robotic execution remains computationally prohibitive, particularly in heterogeneous multi-robot systems with diverse embodiments and resource constraints. To address these challenges, we propose Multi-Agent Vision-Language-Critic Models (MA-VLCM), a framework that replaces the learned centralized critic in MARL with a pretrained vision-language model fine-tuned to evaluate multi-agent behavior. MA-VLCM acts as a centralized critic conditioned on natural language task descriptions, visual trajectory observations, and structured multi-agent state information. By eliminating critic learning during policy optimization, our approach significantly improves sample efficiency while producing compact execution policies suitable for deployment on resource-constrained robots. Results show good zero-shot return estimation on models with differing VLM backbones on in-distribution and out-of-distribution scenarios in multi-agent team settings
中文摘要 多智能体强化学习（MARL）通常依赖集中式批评者来估计价值函数。然而，从零开始学习这样的批评者样本效率极低，且通常缺乏跨环境的泛化能力。与此同时，在互联网规模数据上训练的大型视觉-语言-动作模型（VLA）展现出强大的多模态推理和零样本泛化能力，但直接将其部署用于机器人执行仍然在计算上存在困难，尤其是在具有多样体型和资源限制的异构多机器人系统中。为应对这些挑战，我们提出了多智能体视觉-语言-批评模型（MA-VLCM），该框架用预训练的视觉-语言模型替代MARL中已学习的集中批评者，并经过微调以评估多智能体行为。MA-VLCM 作为一个集中式批评者，基于自然语言任务描述、视觉轨迹观察和结构化多智能体状态信息。通过在策略优化过程中消除批评者学习，我们的方法显著提升了样本效率，同时生成了适合资源受限机器人部署的紧凑执行策略。结果显示，在多代理团队环境中，VLM骨干不同模型在分布内和非分布场景下，零样本回报估计表现良好

Gym-V: A Unified Vision Environment System for Agentic Vision Research

Gym-V：一个用于代理视觉研究的统一视觉环境系统

Authors: Fanqing Meng Lingxiao Du Jiawei Gu Jiaqi Liao Linjie Li Zijian Wu Xiangyan Liu Ziqi Zhao Mengkang Hu Yue Zhang Zichen Liu Jiaheng Zhang Michael Qizhe Shieh
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.15432
Pdf link: https://arxiv.org/pdf/2603.15432
Abstract As agentic systems increasingly rely on reinforcement learning from verifiable rewards, standardized ``gym'' infrastructure has become essential for rapid iteration, reproducibility, and fair comparison. Vision agents lack such infrastructure, limiting systematic study of what drives their learning and where current models fall short. We introduce \textbf{Gym-V}, a unified platform of 179 procedurally generated visual environments across 10 domains with controllable difficulty, enabling controlled experiments that were previously infeasible across fragmented toolkits. Using it, we find that observation scaffolding is more decisive for training success than the choice of RL algorithm, with captions and game rules determining whether learning succeeds at all. Cross-domain transfer experiments further show that training on diverse task categories generalizes broadly while narrow training can cause negative transfer, with multi-turn interaction amplifying all of these effects. Gym-V is released as a convenient foundation for training environments and evaluation toolkits, aiming to accelerate future research on agentic VLMs.
中文摘要 随着智能系统越来越依赖可验证奖励的强化学习，标准化的“健身房”基础设施已成为快速迭代、可重复性和公平比较的关键。视觉代理缺乏这样的基础设施，限制了对其学习驱动因素及现有模型不足之处的系统研究。我们介绍了 \textbf{Gym-V}，这是一个涵盖10个领域、179个程序生成视觉环境的统一平台，难度可控，使得此前在碎片化工具包中难以实现的受控实验成为可能。利用它，我们发现观察支架对训练成功比选择强化学习算法更为决定性，字幕和游戏规则决定学习是否成功。跨领域转移实验进一步表明，对不同任务类别的训练可以广泛泛化，而狭义训练则可能导致负转移，多回合交互则放大了这些效应。Gym-V作为训练环境和评估工具包的便捷基础发布，旨在加速未来对智能体体型VLM的研究。

Listening to the Echo: User-Reaction Aware Policy Optimization via Scalar-Verbal Hybrid Reinforcement Learning

倾听回声：通过标量-语言混合强化学习实现用户反应感知策略优化

Authors: Jing Ye, Xinpei Zhao, Lu Xiang, Yaping Zhang, Chengqing Zong
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.15434
Pdf link: https://arxiv.org/pdf/2603.15434
Abstract While current emotional support dialogue systems typically rely on expert-defined scalar rewards for alignment, these signals suffer from severe information sparsity. They cannot explain why a response failed or how to adapt to dynamic user states, often diverging from the actual goal of facilitating positive emotional shifts. In practice, the most direct and reliable learning signal emerges from the user's continuous reactions during ongoing interaction. We therefore propose Reaction Aware Policy Optimization (RAPO), a framework that optimizes over interaction consequences rather than rubric scores. RAPO treats dialogue as a reaction-driven process and utilizes simulated user responses to generate dense natural-language feedback through three core components: Hindsight Dialogue Selection, which isolates pivotal turns that meaningfully alter user emotional trajectories; Generative Hindsight Feedback, which transforms user reactions into contrastive ranking signals and natural-language critiques; and Scalar-Verbal Hybrid Policy Optimization, which couples scalar reward optimization for global alignment with verbal feedback distillation for fine-grained semantic refinement. Extensive experiments on ESC and Sotopia demonstrate that RAPO significantly outperforms strong reinforcement learning baselines in driving positive interaction outcomes.
中文摘要 虽然当前的情感支持对话系统通常依赖专家定义的标量奖励来匹配，但这些信号存在严重的信息稀缺性。他们无法解释为何某个反应失败，也无法适应动态的用户状态，往往与促进积极情绪转变的实际目标背离。实际上，最直接且可靠的学习信号来自用户在持续互动中的持续反应。因此，我们提出了反应感知策略优化（RAPO），这是一个优化交互后果而非评分标准的框架。RAPO将对话视为反应驱动的过程，利用模拟用户反应通过三个核心组件生成密集的自然语言反馈：事后诸葛亮对话选择，分离出有意义改变用户情感轨迹的关键转折;生成式后见反馈，将用户反应转化为对比性排名信号和自然语言批评;以及标量-语言混合策略优化，将标量奖励优化用于全局对齐与语言反馈提炼相结合，实现细粒度语义细化。对ESC和Sotopia的广泛实验表明，RAPO在推动积极互动结果方面显著优于强化学习基线。

Unbiased and Biased Variance-Reduced Forward-Reflected-Backward Splitting Methods for Stochastic Composite Inclusions

随机复合包含关系的无偏和偏差减弱前反射后向拆分方法

Authors: Quoc Tran-Dinh, Nghia Nguyen-Trung
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2603.15576
Pdf link: https://arxiv.org/pdf/2603.15576
Abstract This paper develops new variance-reduction techniques for the forward-reflected-backward splitting (FRBS) method to solve a class of possibly nonmonotone stochastic composite inclusions. Unlike unbiased estimators such as mini-batching, developing stochastic biased variants faces a fundamental technical challenge and has not been utilized before for inclusions and fixed-point problems. We fill this gap by designing a new framework that can handle both unbiased and biased estimators. Our main idea is to construct stochastic variance-reduced estimators for the forward-reflected direction and use them to perform iterate updates. First, we propose a class of unbiased variance-reduced estimators and show that increasing mini-batch SGD, loopless-SVRG, and SAGA estimators fall within this class. For these unbiased estimators, we establish a $\mathcal{O}(1/k)$ best-iterate convergence rate for the expected squared residual norm, together with almost-sure convergence of the iterate sequence to a solution. Consequently, we prove that the best oracle complexities for the $n$-finite-sum and expectation settings are $\mathcal{O}(n^{2/3}\epsilon^{-2})$ and $\mathcal{O}(\epsilon^{-10/3})$, respectively, when employing loopless-SVRG or SAGA, where $\epsilon$ is a desired accuracy. Second, we introduce a new class of biased variance-reduced estimators for the forward-reflected direction, which includes SARAH, Hybrid SGD, and Hybrid SVRG as special instances. While the convergence rates remain valid for these biased estimators, the resulting oracle complexities are $\mathcal{O}(n^{3/4}\epsilon^{-2})$ and $\mathcal{O}(\epsilon^{-5})$ for the $n$-finite-sum and expectation settings, respectively. Finally, we conduct two numerical experiments on AUC optimization for imbalanced classification and policy evaluation in reinforcement learning.
中文摘要 本文开发了用于前向反射后向分裂（FRBS）方法的新方差缩减技术，以求解一类可能非单调的随机复合包含关系。与无偏估计方法如微批量处理不同，开发随机偏置变体面临根本性的技术挑战，此前尚未用于包含关系和不动点问题。我们通过设计一个能够同时处理无偏估计和有偏估计量的新框架来填补这一空白。我们的主要思路是构建前向反射方向的随机方差约简估计量，并用它们进行迭代更新。首先，我们提出了一类无偏方差约简估计器，并证明递增的微批量SGD、无环SVRG和SAGA估计器属于这一类。对于这些无偏估计量，我们为期望的平方残差范数建立一个$\mathcal{O}（1/k）$最佳迭代收敛率，同时几乎确定迭代序列收敛到解。因此，我们证明，在使用无环SVRG或SAGA时，$n$-有限和期望设置的最佳预言机复杂度分别为$\mathcal{O}（n^{2/3}\epsilon^{-2}）$（\epsilon^{-10/3}）$（\epsilon^{-10/3}））$，其中$\epsilon$为期望精度。其次，我们引入了一类新的前向反射方向偏差减差估计器，其中包括SARAH、混合SGD和混合SVRG作为特殊实例。虽然收敛率对这些有偏估计量依然有效，但最终的预言机复杂度分别为$n$有限和和期望值的$\mathcal{O}（n^{3/4}\epsilon^{-2}）$（\epsilon^{-5}）$。最后，我们进行了两个关于AUC优化的数值实验，用于强化学习中的不平衡分类和策略评估。

From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation

从被动观察者到主动批评者：强化学习引发机器人操作的过程推理

Authors: Yibin Liu, Yaxing Lyu, Daqi Gao, Zhixuan Liang, Weiliang Tang, Shilong Mu, Xiaokang Yang, Yao Mu
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.15600
Pdf link: https://arxiv.org/pdf/2603.15600
Abstract Accurate process supervision remains a critical challenge for long-horizon robotic manipulation. A primary bottleneck is that current video MLLMs, trained primarily under a Supervised Fine-Tuning (SFT) paradigm, function as passive "Observers" that recognize ongoing events rather than evaluating the current state relative to the final task goal. In this paper, we introduce PRIMO R1 (Process Reasoning Induced Monitoring), a 7B framework that transforms video MLLMs into active "Critics". We leverage outcome-based Reinforcement Learning to incentivize explicit Chain-of-Thought generation for progress estimation. Furthermore, our architecture constructs a structured temporal input by explicitly anchoring the video sequence between initial and current state images. Supported by the proposed PRIMO Dataset and Benchmark, extensive experiments across diverse in-domain environments and out-of-domain real-world humanoid scenarios demonstrate that PRIMO R1 achieves state-of-the-art performance. Quantitatively, our 7B model achieves a 50% reduction in the mean absolute error of specialized reasoning baselines, demonstrating significant relative accuracy improvements over 72B-scale general MLLMs. Furthermore, PRIMO R1 exhibits strong zero-shot generalization on difficult failure detection tasks. We establish state-of-the-art performance on RoboFail benchmark with 67.0% accuracy, surpassing closed-source models like OpenAI o1 by 6.0%.
中文摘要 精确的过程监督仍然是长视野机器人操作的关键挑战。一个主要瓶颈是目前主要在监督式微调（SFT）范式下训练的视频多层次语言模型（MLLM）作为被动的“观察者”，识别正在进行的事件，而非评估当前状态相对于最终任务目标。本文介绍了PRIMO R1（过程推理诱导监控），这是一个7B框架，将视频多层次营销转化为主动的“批评者”。我们利用基于结果的强化学习，激励显式思维链生成以进行进展评估。此外，我们的架构通过显式将视频序列锚定在初始和当前状态图像之间，构建了结构化的时间输入。在拟议的PRIMO数据集和基准测试的支持下，跨多样的域内环境和域外的真实人形场景进行了大量实验，证明PRIMO R1实现了最先进的性能。从定量角度看，我们的7B模型在专门推理基线的平均绝对误差上降低了50%，相较72B尺度的通用MLLM在准确性上有显著提升。此外，PRIMO R1在复杂的故障检测任务中表现出强烈的零点推广能力。我们在RoboFail基准测试上建立了最先进的性能，准确率达到67.0%，比OpenAI o1等闭源模型高出6.0%。

Code-A1: Adversarial Evolving of Code LLM and Test LLM via Reinforcement Learning

代码A1：通过强化学习对抗性演进代码LLM和测试LLM的进化

Authors: Aozhe Wang, Yuchen Yan, Nan Zhou, Zhengxi Lu, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.15611
Pdf link: https://arxiv.org/pdf/2603.15611
Abstract Reinforcement learning for code generation relies on verifiable rewards from unit test pass rates. Yet high-quality test suites are scarce, existing datasets offer limited coverage, and static rewards fail to adapt as models improve. Recent self-play methods unify code and test generation in a single model, but face a inherent dilemma: white-box access leads to self-collusion where the model produces trivial tests for easy rewards, yet black-box restriction yields generic tests that miss implementation-specific bugs. We introduce Code-A1, an adversarial co-evolution framework that jointly optimizes a Code LLM and a Test LLM with opposing objectives. The Code LLM is rewarded for passing more tests, while the Test LLM is rewarded for exposing more defects. This architectural separation eliminates self-collusion risks and safely enables white-box test generation, where the Test LLM can inspect candidate code to craft targeted adversarial tests. We further introduce a Mistake Book mechanism for experience replay and a composite reward balancing test validity with adversarial difficulty. Experiments on Qwen2.5-Coder models demonstrate that Code-A1 achieves code generation performance matching or exceeding models trained on human-annotated tests, while significantly improving test generation capability.
中文摘要 代码生成的强化学习依赖于单元测试通过率的可验证奖励。然而，高质量的测试套件稀缺，现有数据集覆盖有限，静态奖励随着模型改进也无法适应。近期的自玩方法将代码和测试生成统一在一个模型中，但面临固有的困境：白盒访问导致自我串通，模型生成简单测试以获得轻松奖励，而黑盒限制则产生通用测试，漏掉实现特定漏洞。我们介绍Code-A1，一种对抗性共进化框架，共同优化一个Code LLM和一个测试LLM，目标相反。代码型LLM通过更多测试获得奖励，而测试LLM则因暴露更多缺陷而获得奖励。这种架构分离消除了自我共谋的风险，并安全地实现了白盒测试生成，测试大型语言模型可以检查候选代码，设计针对性的对抗性测试。我们还进一步引入了错误书机制用于经验重放，以及一种复合奖励平衡测试有效性与对抗难度的机制。在Qwen2.5-Coder模型上的实验表明，Code-A1在代码生成性能上与人工注释测试训练模型相当甚至超越，同时显著提升了测试生成能力。

HSImul3R: Physics-in-the-Loop Reconstruction of Simulation-Ready Human-Scene Interactions

HSImul3R：模拟就绪人机场景交互的物理环路重建

Authors: Yukang Cao, Haozhe Xie, Fangzhou Hong, Long Zhuo, Zhaoxi Chen, Liang Pan, Ziwei Liu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.15612
Pdf link: https://arxiv.org/pdf/2603.15612
Abstract We present HSImul3R, a unified framework for simulation-ready 3D reconstruction of human-scene interactions (HSI) from casual captures, including sparse-view images and monocular videos. Existing methods suffer from a perception-simulation gap: visually plausible reconstructions often violate physical constraints, leading to instability in physics engines and failure in embodied AI applications. To bridge this gap, we introduce a physically-grounded bi-directional optimization pipeline that treats the physics simulator as an active supervisor to jointly refine human dynamics and scene geometry. In the forward direction, we employ Scene-targeted Reinforcement Learning to optimize human motion under dual supervision of motion fidelity and contact stability. In the reverse direction, we propose Direct Simulation Reward Optimization, which leverages simulation feedback on gravitational stability and interaction success to refine scene geometry. We further present HSIBench, a new benchmark with diverse objects and interaction scenarios. Extensive experiments demonstrate that HSImul3R produces the first stable, simulation-ready HSI reconstructions and can be directly deployed to real-world humanoid robots.
中文摘要 我们介绍HSImul3R，一个统一框架，用于从随意捕捉（包括稀疏视图和单眼视频）中模拟就绪的3D人文互动（HSI）重建。现有方法存在感知与模拟之间的差距：视觉上合理的重建常常违反物理约束，导致物理引擎不稳定，以及具身人工智能应用中的失败。为弥合这一差距，我们引入了物理基础的双向优化流水线，将物理模拟器视为主动监督器，共同优化人体动力学和场景几何。在正向方向，我们采用场景定向强化学习，在运动保真度和接触稳定性的双重监督下优化人体运动。反过来，我们提出了直接仿真奖励优化，利用仿真反馈对重力稳定性和交互成功率的反馈，优化场景几何。我们还进一步介绍了HSIBench，这是一个包含多样对象和交互场景的新基准。大量实验表明，HSImul3R能够实现首批稳定、可模拟的HSI重建，并可直接部署到现实中的类人机器人上。

GlyphPrinter: Region-Grouped Direct Preference Optimization for Glyph-Accurate Visual Text Rendering

GlyphPrinter：用于字形准确视觉文本渲染的区域分组直接偏好优化

Authors: Xincheng Shuai, Ziye Li, Henghui Ding, Dacheng Tao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.15616
Pdf link: https://arxiv.org/pdf/2603.15616
Abstract Generating accurate glyphs for visual text rendering is essential yet challenging. Existing methods typically enhance text rendering by training on a large amount of high-quality scene text images, but the limited coverage of glyph variations and excessive stylization often compromise glyph accuracy, especially for complex or out-of-domain characters. Some methods leverage reinforcement learning to alleviate this issue, yet their reward models usually depend on text recognition systems that are insensitive to fine-grained glyph errors, so images with incorrect glyphs may still receive high rewards. Inspired by Direct Preference Optimization (DPO), we propose GlyphPrinter, a preference-based text rendering method that eliminates reliance on explicit reward models. However, the standard DPO objective only models overall preference between two samples, which is insufficient for visual text rendering where glyph errors typically occur in localized regions. To address this issue, we construct the GlyphCorrector dataset with region-level glyph preference annotations and propose Region-Grouped DPO (R-GDPO), a region-based objective that optimizes inter- and intra-sample preferences over annotated regions, substantially enhancing glyph accuracy. Furthermore, we introduce Regional Reward Guidance, an inference strategy that samples from an optimal distribution with controllable glyph accuracy. Extensive experiments demonstrate that the proposed GlyphPrinter outperforms existing methods in glyph accuracy while maintaining a favorable balance between stylization and precision.
中文摘要 生成准确的字形用于视觉文本渲染既重要又具有挑战性。现有方法通常通过训练大量高质量场景文本图像来增强文本渲染，但字形变化覆盖有限且过度风格化常常影响字形准确性，尤其是复杂或域外字符。一些方法利用强化学习来缓解这一问题，但其奖励模型通常依赖于对细粒度字形错误不敏感的文本识别系统，因此字形错误的图像仍可能获得高额奖励。受直接偏好优化（DPO）启发，我们提出了GlyphPrinter，一种基于偏好的文本渲染方法，消除了对显式奖励模型的依赖。然而，标准DPO目标仅模拟两个样本之间的整体偏好，对于文字错误通常发生在局部区域的视觉文本渲染来说，这不足以实现。为解决这一问题，我们构建了带有区域级字形偏好注释的 GlyphCorrector 数据集，并提出了区域分组 DPO（R-GDPO），这是一种基于区域的目标，优化样本间和样本内对注释区域的偏好，显著提升了字形的准确性。此外，我们介绍了区域奖励指导，这是一种从最优分布中抽样且字形准确率可控的推理策略。大量实验表明，所提议的 GlyphPrinter 在字形准确性上优于现有方法，同时保持了风格化与精确度之间的良好平衡。

Keyword: diffusion policy

REFINE-DP: Diffusion Policy Fine-tuning for Humanoid Loco-manipulation via Reinforcement Learning

REFINE-DP：通过强化学习对类人机车操控进行扩散政策微调

Authors: Zhaoyuan Gu, Yipu Chen, Zimeng Chai, Alfred Cueva, Thong Nguyen, Yifan Wu, Huishu Xue, Minji Kim, Isaac Legene, Fukang Liu, Matthew Kim, Ayan Barula, Yongxin Chen, Ye Zhao
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.13707
Pdf link: https://arxiv.org/pdf/2603.13707
Abstract Humanoid loco-manipulation requires coordinated high-level motion plans with stable, low-level whole-body execution under complex robot-environment dynamics and long-horizon tasks. While diffusion policies (DPs) show promise for learning from demonstrations, deploying them on humanoids poses critical challenges: the motion planner trained offline is decoupled from the low-level controller, leading to poor command tracking, compounding distribution shift, and task failures. The common approach of scaling demonstration data is prohibitively expensive for high-dimensional humanoid systems. To address this challenge, we present REFINE-DP (REinforcement learning FINE-tuning of Diffusion Policy), a hierarchical framework that jointly optimizes a DP high-level planner and an RL-based low-level loco-manipulation controller. The DP is fine-tuned via a PPO-based diffusion policy gradient to improve task success rate, while the controller is simultaneously updated to accurately track the planner's evolving command distribution, reducing the distributional mismatch that degrades motion quality. We validate REFINE-DP on a humanoid robot performing loco-manipulation tasks, including door traversal and long-horizon object transport. REFINE-DP achieves an over $90\%$ success rate in simulation, even in out-of-distribution cases not seen in the pre-trained data, and enables smooth autonomous task execution in real-world dynamic environments. Our proposed method substantially outperforms pre-trained DP baselines and demonstrates that RL fine-tuning is key to reliable humanoid loco-manipulation. this https URL
中文摘要 类人机车操作需要协调的高级运动计划，并在复杂的机器人环境动力学和长视野任务下实现稳定、低层次的全身执行。虽然扩散策略（DP）在演示中展现出学习潜力，但在类人生物上部署它们存在关键挑战：离线训练的运动规划器与低级控制器解耦，导致指令跟踪不良、分配转移和任务失败。对高维类人生物系统来说，常见的演示数据缩放方式成本高昂。为应对这一挑战，我们提出了REFINE-DP（扩散政策的强化学习微调），这是一个分层框架，联合优化了DP高级规划器和基于RL的低级机车操作控制器。DP通过基于PPO的扩散策略梯度进行微调以提高任务成功率，同时控制器也同步更新，准确跟踪规划师不断变化的指令分布，减少导致运动质量下降的分布不匹配。我们在执行机动操作任务的人形机器人上验证了REFINE-DP的效果，该机器人包括门穿行和长视距物体运输。REFINE-DP在模拟中实现超过90%%的成功率，即使在预训练数据中未出现的非分布情况中，也能实现在真实动态环境中的平滑自主任务执行。我们提出的方法远超预训练的DP基线，并证明强化学习微调是可靠人形机车操作的关键。这个 https 网址

OCRA: Object-Centric Learning with 3D and Tactile Priors for Human-to-Robot Action Transfer

OCRA：以对象为中心的学习，结合3D和触觉先验，实现人到机器人的动作传递

Authors: Kuanning Wang, Ke Fan, Yuqian Fu, Siyu Lin, Hu Luo, Daniel Seita, Yanwei Fu, Yu-Gang Jiang, Xiangyang Xue
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.14401
Pdf link: https://arxiv.org/pdf/2603.14401
Abstract We present OCRA, an Object-Centric framework for video-based human-to-Robot Action transfer that learns directly from human demonstration videos to enable robust manipulation. Object-centric learning emphasizes task-relevant objects and their interactions while filtering out irrelevant background, providing a natural and scalable way to teach robots. OCRA leverages multi-view RGB videos, the state-of-the-art 3D foundation model VGGT, and advanced detection and segmentation models to reconstruct object-centric 3D point clouds, capturing rich interactions between objects. To handle properties not easily perceived by vision alone, we incorporate tactile priors via a large-scale dataset of over one million tactile images. These 3D and tactile priors are fused through a multimodal module (ResFiLM) and fed into a Diffusion Policy to generate robust manipulation actions. Extensive experiments on both vision-only and visuo-tactile tasks show that OCRA significantly outperforms existing baselines and ablations, demonstrating its effectiveness for learning from human demonstration videos.
中文摘要 我们介绍了OCRA，一种基于视频的人机动作传输的以对象为中心框架，直接从人类演示视频中学习，实现稳健的操作。以对象为中心的学习强调与任务相关的对象及其交互，同时过滤掉无关背景，为机器人提供一种自然且可扩展的方式。OCRA利用多视角RGB视频、最先进的3D基础模型VGGT，以及先进的检测和分割模型，重建以物体为中心的3D点云，捕捉对象间丰富的交互。为了处理单靠视觉难以感知的属性，我们通过超过一百万张触觉图像的大规模数据集，加入了触觉先验。这些三维和触觉先验通过多模态模块（ResFiLM）融合，并输入扩散策略以生成稳健的操作动作。对仅视觉和视觉-触觉任务的广泛实验显示，OCRA显著优于现有基线和消融，证明其在从人体演示视频中学习的有效性。

ReMAP-DP: Reprojected Multi-view Aligned PointMaps for Diffusion Policy

ReMAP-DP：用于扩散政策的重新投影多视角对齐点图

Authors: Xinzhang Yang, Renjun Wu, Jinyan Liu, Xuesong Li
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.14977
Pdf link: https://arxiv.org/pdf/2603.14977
Abstract Generalist robot policies built upon 2D visual representations excel at semantic reasoning but inherently lack the explicit 3D spatial awareness required for high-precision tasks. Existing 3D integration methods struggle to bridge this gap due to the structural irregularity of sparse point clouds and the geometric distortion introduced by multi-view orthographic rendering. To overcome these barriers, we present ReMAP-DP, a novel framework synergizing standardized perspective reprojection with a structure-aware dual-stream diffusion policy. By coupling the re-projected views with pixel-aligned PointMaps, our dual-stream architecture leverages learnable modality embeddings to fuse frozen semantic features and explicit geometric descriptors, ensuring precise implicit patch-level alignment. Extensive experiments across simulation and real-world environments demonstrate ReMAP-DP's superior performance in diverse manipulation tasks. On RoboTwin 2.0, it attains a 59.3% average success rate, outperforming the DP3 baseline by +6.6%. On ManiSkill 3, our method yields a 28% improvement over DP3 on the geometrically challenging Stack Cube task. Furthermore, ReMAP-DP exhibits remarkable real-world robustness, executing high-precision and dynamic manipulations with superior data efficiency from only a handful of demonstrations. Project page is available at: this https URL
中文摘要 基于二维视觉表现的通用机器人政策擅长语义推理，但本质上缺乏高精度任务所需的显式三维空间感知。由于稀疏点云的结构不规则以及多视角正交渲染带来的几何畸变，现有的三维积分方法难以弥合这一差距。为克服这些障碍，我们提出了ReMAP-DP新框架，将标准化透视重投影与结构感知双流扩散策略协同。通过将重新投影的视图与像素对齐的点映射结合，我们的双流架构利用可学习的模态嵌入，融合冻结的语义特征和显式几何描述符，确保精确的隐式补丁级对齐。在模拟和现实环境中的广泛实验展示了ReMAP-DP在多种操作任务中的卓越性能。在RoboTwin 2.0版本中，其平均成功率为59.3%，比DP3基准高出+6.6%。在 ManiSkill 3 上，我们的方法在几何上具有挑战性的堆叠立方任务比 DP3 提升了 28%。此外，ReMAP-DP展现出卓越的现实世界鲁棒性，仅通过少数演示就能执行高精度和动态操作，数据效率更高。项目页面可访问：此 https URL

Master Micro Residual Correction with Adaptive Tactile Fusion and Force-Mixed Control for Contact-Rich Manipulation

主控微残余校正，配备自适应触觉融合和力混合控制，实现接触丰富操作

Authors: Xingting Li, Yifan Xie, Han Liu, Wei Hou, Guangyu Chen, Shoujie Li, Wenbo Ding
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.15152
Pdf link: https://arxiv.org/pdf/2603.15152
Abstract Robotic contact-rich and fine-grained manipulation remains a significant challenge due to complex interaction dynamics and the competing requirements of multi-timescale control. While current visual imitation learning methods excel at long-horizon planning, they often fail to perceive critical interaction cues like friction variations or incipient slip, and struggle to balance global task coherence with local reactive feedback. To address these challenges, we propose M2-ResiPolicy, a novel Master-Micro residual control architecture that synergizes high-level action guidance with low-level correction. The framework consists of a Master-Guidance Policy (MGP) operating at 10 Hz, which generates temporally consistent action chunks via a diffusion-based backbone and employs a tactile-intensity-driven adaptive fusion mechanism to dynamically modulate perceptual weights between vision and touch. Simultaneously, a high-frequency (60 Hz) Micro-Residual Corrector (MRC) utilizes a lightweight GRU to provide real-time action compensation based on TCP wrench feedback. This policy is further integrated with a force-mixed PBIC execution layer, effectively regulating contact forces to ensure interaction safety. Experiments across several demanding tasks including fragile object grasping and precision insertion, demonstrate that M2-ResiPolicy significantly outperforms standard Diffusion Policy (DP) and state-of-the-art Reactive Diffusion Policy (RDP), achieving a 93\% damage-free success rate in chip grasping and superior force regulation stability.
中文摘要 由于复杂的交互动力学和多时间尺度控制的竞争需求，机器人接触丰富且细粒度的操作仍是重大挑战。虽然当前的视觉模仿学习方法在长视野规划方面表现出色，但它们常常无法识别关键交互线索，如摩擦变化或初步滑移，难以平衡全局任务的一致性与局部反应反馈。为应对这些挑战，我们提出了M2-ResiPolicy，一种新颖的主微残差控制架构，将高层次行动指导与低层次纠正协同。该框架由一个以10赫兹频率运行的主控策略（MGP）组成，通过基于扩散的骨干生成时间一致的动作块，并采用触觉强度驱动的自适应融合机制，动态调制视觉与触觉之间的感知权重。同时，高频（60 Hz）微残差校正器（MRC）利用轻量级GRU基于TCP扳手反馈提供实时动作补偿。该政策进一步集成了混合力的PBIC执行层，有效调节接触力以确保交互安全。在多项高要求任务中进行的实验，包括脆弱物体抓取和精密插入，表明M2-ResiPolicy显著优于标准扩散策略（DP）和最先进的反应扩散策略（RDP），实现了93%的无损伤成功率和卓越的力调控稳定性。