Arxiv Papers of Today

生成时间: 2025-12-03 16:32:53 (UTC+8); Arxiv 发布时间: 2025-12-03 20:00 EST (2025-12-04 09:00 UTC+8)

今天共有 35 篇相关文章

Keyword: reinforcement learning

Reinforcement Learning for Robotic Safe Control with Force Sensing

基于力感的机器人安全控制强化学习

Authors: Nan Lin, Linrui Zhang, Yuxuan Chen, Zhenrui Chen, Yujun Zhu, Ruoxi Chen, Peichen Wu, Xiaoping Chen
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.02022
Pdf link: https://arxiv.org/pdf/2512.02022
Abstract For the task with complicated manipulation in unstructured environments, traditional hand-coded methods are ineffective, while reinforcement learning can provide more general and useful policy. Although the reinforcement learning is able to obtain impressive results, its stability and reliability is hard to guarantee, which would cause the potential safety threats. Besides, the transfer from simulation to real world also will lead in unpredictable situations. To enhance the safety and reliability of robots, we introduce the force and haptic perception into reinforcement learning. Force and tactual sensation play key roles in robotic dynamic control and human-robot interaction. We demonstrate that the force-based reinforcement learning method can be more adaptive to environment, especially in sim-to-real transfer. Experimental results show in object pushing task, our strategy is safer and more efficient in both simulation and real world, thus it holds prospects for a wide variety of robotic applications.
中文摘要 对于非结构化环境中复杂作的任务，传统的手工编码方法效果不佳，而强化学习可以提供更通用且有用的策略。尽管强化学习能够取得令人印象深刻的结果，但其稳定性和可靠性难以保证，这会带来潜在的安全威胁。此外，从模拟到现实世界的转变也会导致不可预测的情况。为了提升机器人的安全性和可靠性，我们将力和触觉感知引入强化学习。力和实际感官在机器人动态控制和人机交互中起着关键作用。我们证明基于力的强化学习方法在模拟到现实转移中更具环境适应性。实验结果表明，在物体推进任务中，我们的策略在模拟和现实世界中都更安全、更高效，因此在多种机器人应用中具有前景。

Deep Research: A Systematic Survey

深度研究：系统调查

Authors: Zhengliang Shi, Yiqun Chen, Haitao Li, Weiwei Sun, Shiyu Ni, Yougang Lyu, Run-Ze Fan, Bowen Jin, Yixuan Weng, Minjun Zhu, Qiujie Xie, Xinyu Guo, Qu Yang, Jiayi Wu, Jujia Zhao, Xiaqiang Tang, Xinbei Ma, Cunxiang Wang, Jiaxin Mao, Qingyao Ai, Jen-Tse Huang, Wenxuan Wang, Yue Zhang, Yiming Yang, Zhaopeng Tu, Zhaochun Ren
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2512.02038
Pdf link: https://arxiv.org/pdf/2512.02038
Abstract Large language models (LLMs) have rapidly evolved from text generators into powerful problem solvers. Yet, many open tasks demand critical thinking, multi-source, and verifiable outputs, which are beyond single-shot prompting or standard retrieval-augmented generation. Recently, numerous studies have explored Deep Research (DR), which aims to combine the reasoning capabilities of LLMs with external tools, such as search engines, thereby empowering LLMs to act as research agents capable of completing complex, open-ended tasks. This survey presents a comprehensive and systematic overview of deep research systems, including a clear roadmap, foundational components, practical implementation techniques, important challenges, and future directions. Specifically, our main contributions are as follows: (i) we formalize a three-stage roadmap and distinguish deep research from related paradigms; (ii) we introduce four key components: query planning, information acquisition, memory management, and answer generation, each paired with fine-grained sub-taxonomies; (iii) we summarize optimization techniques, including prompting, supervised fine-tuning, and agentic reinforcement learning; and (iv) we consolidate evaluation criteria and open challenges, aiming to guide and facilitate future development. As the field of deep research continues to evolve rapidly, we are committed to continuously updating this survey to reflect the latest progress in this area.
中文摘要 大型语言模型（LLM）已迅速从文本生成器演变为强大的问题解决器。然而，许多开放任务要求批判性思维、多源和可验证的输出，这超出了单次提示或标准的检索增强生成。近年来，许多研究探讨了深度研究（DR），旨在将大型语言模型的推理能力与外部工具（如搜索引擎）结合起来，从而赋能大型语言模型作为能够完成复杂开放式任务的研究代理。本综述对深度研究系统进行了全面且系统的概述，包括清晰的路线图、基础组成部分、实用实施技术、重要挑战及未来方向。具体来说，我们的主要贡献如下：（i）我们正式制定了三阶段路线图，并区分深度研究与相关范式;（ii）我们引入四个关键组成部分：查询规划、信息获取、内存管理和答案生成，每个部分都配有细粒度的子分类法;（iii）我们总结优化技术，包括提示、监督微调和智能强化学习;以及（iv）整合评估标准和开放挑战，旨在指导和促进未来发展。随着深度研究领域持续快速发展，我们致力于持续更新本次调查，以反映该领域的最新进展。

Modelling the Doughnut of social and planetary boundaries with frugal machine learning

用节俭机器学习建模社会与地球边界的甜甜圈

Authors: Stefano Vrizzi, Daniel W. O'Neill
Subjects: Subjects: Machine Learning (cs.LG); General Economics (econ.GN)
Arxiv link: https://arxiv.org/abs/2512.02200
Pdf link: https://arxiv.org/pdf/2512.02200
Abstract The 'Doughnut' of social and planetary boundaries has emerged as a popular framework for assessing environmental and social sustainability. Here, we provide a proof-of-concept analysis that shows how machine learning (ML) methods can be applied to a simple macroeconomic model of the Doughnut. First, we show how ML methods can be used to find policy parameters that are consistent with 'living within the Doughnut'. Second, we show how a reinforcement learning agent can identify the optimal trajectory towards desired policies in the parameter space. The approaches we test, which include a Random Forest Classifier and $Q$-learning, are frugal ML methods that are able to find policy parameter combinations that achieve both environmental and social sustainability. The next step is the application of these methods to a more complex ecological macroeconomic model.
中文摘要 社会与地球边界的“甜甜圈”已成为评估环境和社会可持续性的流行框架。在这里，我们提供了一个概念验证分析，展示了机器学习（ML）方法如何应用于甜甜圈的简单宏观经济模型。首先，我们展示了机器学习方法如何找到与“生活在甜甜圈内”相符的政策参数。其次，我们展示了强化学习代理如何在参数空间中识别通往期望策略的最优轨迹。我们测试的方法包括随机森林分类器和$Q美元学习，是节俭的机器学习方法，能够找到既实现环境可持续性又能实现社会可持续性的政策参数组合。下一步是将这些方法应用于更复杂的生态宏观经济模型。

Improved Training Mechanism for Reinforcement Learning via Online Model Selection

通过在线模型选择改进强化学习训练机制

Authors: Aida Afshar, Aldo Pacchiano
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.02214
Pdf link: https://arxiv.org/pdf/2512.02214
Abstract We study the problem of online model selection in reinforcement learning, where the selector has access to a class of reinforcement learning agents and learns to adaptively select the agent with the right configuration. Our goal is to establish the improved efficiency and performance gains achieved by integrating online model selection methods into reinforcement learning training procedures. We examine the theoretical characterizations that are effective for identifying the right configuration in practice, and address three practical criteria from a theoretical perspective: 1) Efficient resource allocation, 2) Adaptation under non-stationary dynamics, and 3) Training stability across different seeds. Our theoretical results are accompanied by empirical evidence from various model selection tasks in reinforcement learning, including neural architecture selection, step-size selection, and self model selection.
中文摘要 我们研究强化学习中的在线模型选择问题，选择者可以访问一类强化学习代理，并学习如何自适应地选择合适的配置主体。我们的目标是通过将在线模型选择方法整合进强化学习培训程序，实现了效率和性能的提升。我们考察了在实践中识别正确配置的理论特征，并从理论角度探讨三个实际标准：1）资源的高效分配，2）非平稳动力学下的适应性，3）不同种子间的训练稳定性。我们的理论结果伴随着强化学习中各种模型选择任务的实证证据，包括神经结构选择、步长选择和自模型选择。

Lightweight Latent Reasoning for Narrative Tasks

叙事任务的轻量潜在推理

Authors: Alexander Gurung, Nikolay Malkin, Mirella Lapata
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.02240
Pdf link: https://arxiv.org/pdf/2512.02240
Abstract Large language models (LLMs) tackle complex tasks by generating long chains of thought or "reasoning traces" that act as latent variables in the generation of an output given a query. A model's ability to generate such traces can be optimized with reinforcement learning (RL) to improve their utility in predicting an answer. This optimization comes at a high computational cost, especially for narrative-related tasks that involve retrieving and processing many tokens. To this end, we propose LiteReason, a latent reasoning method that can be interleaved with standard token sampling and easily combined with RL techniques. LiteReason employs a lightweight Reasoning Projector module, trained to produce continuous latent tokens that help the model 'skip' reasoning steps. During RL, the policy model decides when to activate the projector, switching between latent and discrete reasoning as needed. Experimental results on plot hole detection and book chapter generation show that our method outperforms latent reasoning baselines and comes close to matching non-latent RL training, while reducing final reasoning length by 77-92%. Overall, LiteReason guides RL training to a more efficient part of the performance-computation tradeoff curve.
中文摘要 大型语言模型（LLM）通过生成长链思考或“推理痕迹”来处理复杂任务，这些轨迹作为生成查询输出的潜在变量。模型生成此类踪迹的能力可以通过强化学习（RL）进行优化，以提高其在预测答案中的效用。这种优化会带来高昂的计算成本，尤其是在涉及获取和处理大量令牌的叙事相关任务中。为此，我们提出了LiteReason，一种可以与标准令牌抽样交错使用，并能轻松与强化学习技术结合的潜在推理方法。LiteReason采用一个轻量级推理投影模块，经过训练产生连续潜在令牌，帮助模型“跳过”推理步骤。在强化学习过程中，策略模型决定何时激活投影器，并根据需要在潜在推理和离散推理之间切换。关于情节漏洞检测和书籍章节生成的实验结果显示，我们的方法优于潜在推理基线，接近非潜在强化学习训练，同时将最终推理长度缩短了77%-92%。总体而言，LiteReason引导强化学习在性能与计算权衡曲线中更高效地进行训练。

CAIRNS: Balancing Readability and Scientific Accuracy in Climate Adaptation Question Answering

凯恩斯：在气候适应问题回答中平衡可读性与科学准确性

Authors: Liangji Kong, Aditya Joshi, Sarvnaz Karimi
Subjects: Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2512.02251
Pdf link: https://arxiv.org/pdf/2512.02251
Abstract Climate adaptation strategies are proposed in response to climate change. They are practised in agriculture to sustain food production. These strategies can be found in unstructured data (for example, scientific literature from the Elsevier website) or structured (heterogeneous climate data via government APIs). We present Climate Adaptation question-answering with Improved Readability and Noted Sources (CAIRNS), a framework that enables experts -- farmer advisors -- to obtain credible preliminary answers from complex evidence sources from the web. It enhances readability and citation reliability through a structured ScholarGuide prompt and achieves robust evaluation via a consistency-weighted hybrid evaluator that leverages inter-model agreement with experts. Together, these components enable readable, verifiable, and domain-grounded question-answering without fine-tuning or reinforcement learning. Using a previously reported dataset of expert-curated question-answers, we show that CAIRNS outperforms the baselines on most of the metrics. Our thorough ablation study confirms the results on all metrics. To validate our LLM-based evaluation, we also report an analysis of correlations against human judgment.
中文摘要 针对气候变化提出了气候适应策略。它们在农业中被实践以维持粮食生产。这些策略可以存在于非结构化数据（例如爱思唯尔网站的科学文献）或结构化数据（通过政府API获取的异构气候数据）。我们提出了气候适应问题解答，采用改进的可读性和重要来源（CAIRNS）框架，使专家——农民顾问——能够从网络复杂的证据来源中获得可信的初步答案。它通过结构化的 ScholarGuide 提示提升可读性和引用可靠性，并通过一致性加权的混合评估器实现稳健评估，利用与专家的模型间一致性达成一致。这些组件共同实现了可读、可验证且基于领域的问题回答，无需微调或强化学习。利用此前报告的专家策划问答数据集，我们表明CAIRNS在大多数指标上优于基线数据。我们全面的消融研究在所有指标上都证实了这一结果。为了验证基于LLM的评估，我们还报告了与人类判断之间的相关性分析。

FOVA: Offline Federated Reinforcement Learning with Mixed-Quality Data

FOVA：结合混合质量数据的离线联合强化学习

Authors: Nan Qiao, Sheng Yue, Ju Ren, Yaoxue Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.02350
Pdf link: https://arxiv.org/pdf/2512.02350
Abstract Offline Federated Reinforcement Learning (FRL), a marriage of federated learning and offline reinforcement learning, has attracted increasing interest recently. Albeit with some advancement, we find that the performance of most existing offline FRL methods drops dramatically when provided with mixed-quality data, that is, the logging behaviors (offline data) are collected by policies with varying qualities across clients. To overcome this limitation, this paper introduces a new vote-based offline FRL framework, named FOVA. It exploits a \emph{vote mechanism} to identify high-return actions during local policy evaluation, alleviating the negative effect of low-quality behaviors from diverse local learning policies. Besides, building on advantage-weighted regression (AWR), we construct consistent local and global training objectives, significantly enhancing the efficiency and stability of FOVA. Further, we conduct an extensive theoretical analysis and rigorously show that the policy learned by FOVA enjoys strict policy improvement over the behavioral policy. Extensive experiments corroborate the significant performance gains of our proposed algorithm over existing baselines on widely used benchmarks.
中文摘要 离线联合强化学习（FRL）是联邦学习与离线强化学习的结合，近年来引起了越来越多的关注。尽管有所进步，我们发现大多数现有离线FRL方法在提供混合质量数据时性能会大幅下降，即日志行为（离线数据）由不同客户端质量不同的策略收集。为克服这一限制，本文引入了一个新的基于投票的离线FRL框架，名为FOVA。它利用一种\emph{vote}机制，在地方政策评估中识别高回报行为，缓解多样地方学习政策带来的低质量行为负面影响。此外，基于优势加权回归（AWR），我们构建了一致的本地和全球训练目标，显著提升了FOVA的效率和稳定性。此外，我们进行了广泛的理论分析，严谨地证明FOVA学习的策略优于行为策略，具有严格的策略改进。大量实验证实了我们提出的算法在广泛使用基准测试中优于现有基线的性能显著提升。

Beyond Playtesting: A Generative Multi-Agent Simulation System for Massively Multiplayer Online Games

超越游戏测试：一个面向大型多人在线游戏的生成式多智能体模拟系统

Authors: Ran Zhang, Kun Ouyang, Tiancheng Ma, Yida Yang, Dong Fang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.02358
Pdf link: https://arxiv.org/pdf/2512.02358
Abstract Optimizing numerical systems and mechanism design is crucial for enhancing player experience in Massively Multiplayer Online (MMO) games. Traditional optimization approaches rely on large-scale online experiments or parameter tuning over predefined statistical models, which are costly, time-consuming, and may disrupt player experience. Although simplified offline simulation systems are often adopted as alternatives, their limited fidelity prevents agents from accurately mimicking real player reasoning and reactions to interventions. To address these limitations, we propose a generative agent-based MMO simulation system empowered by Large Language Models (LLMs). By applying Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) on large-scale real player behavioral data, we adapt LLMs from general priors to game-specific domains, enabling realistic and interpretable player decision-making. In parallel, a data-driven environment model trained on real gameplay logs reconstructs dynamic in-game systems. Experiments demonstrate strong consistency with real-world player behaviors and plausible causal responses under interventions, providing a reliable, interpretable, and cost-efficient framework for data-driven numerical design optimization.
中文摘要 优化数值系统和机制设计对于提升大型多人在线（MMO）游戏中的玩家体验至关重要。传统优化方法依赖于大规模的在线实验或参数调优，而非预设的统计模型，这些方法成本高昂、耗时，且可能扰乱玩家体验。尽管简化的离线模拟系统常被用作替代方案，但其有限的保真度阻碍了智能体准确模拟真实玩家的推理和对干预的反应。为解决这些局限，我们提出了一个由大型语言模型（LLM）赋能的基于生成代理的MMO模拟系统。通过对大规模真实玩家行为数据应用监督微调（SFT）和强化学习（RL），我们将LLM从一般先验调整到游戏特定领域，实现真实且可解释的玩家决策。与此同时，基于真实游戏日志训练的数据驱动环境模型重建动态游戏内系统。实验显示出与现实玩家行为和干预下合理因果反应的高度一致性，提供了可靠、可解释且成本效益高的数据驱动数值设计优化框架。

VACoT: Rethinking Visual Data Augmentation with VLMs

VACoT：重新思考用VLM进行视觉数据增强

Authors: Zhengzhuo Xu, Chong Sun, SiNan Du, Chen Li, Jing Lyu, Chun Yuan
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.02361
Pdf link: https://arxiv.org/pdf/2512.02361
Abstract While visual data augmentation remains a cornerstone for training robust vision models, it has received limited attention in visual language models (VLMs), which predominantly rely on large-scale real data acquisition or synthetic diversity. Consequently, they may struggle with basic perception tasks that conventional models handle reliably. Given the substantial cost of pre-training and fine-tuning VLMs, continue training on augmented data yields limited and diminishing returns. In this paper, we present Visual Augmentation Chain-of-Thought (VACoT), a framework that dynamically invokes image augmentations during model inference. By incorporating post-hoc transformations such as denoising, VACoT substantially improves robustness on challenging and out-of-distribution inputs, especially in OCR-related adversarial scenarios. Distinct from prior approaches limited to local cropping, VACoT integrates a structured collection of general visual augmentations, broadening the query image views while reducing training complexity and computational overhead with efficient agentic reinforcement learning. We propose a conditional reward scheme that encourages necessary augmentation while penalizing verbose responses, ensuring concise and effective reasoning in perception tasks. We demonstrate the superiority of VACoT with extensive experiments on 13 perception benchmarks and further introduce AdvOCR to highlight the generalization benefits of post-hoc visual augmentations in adversarial scenarios.
中文摘要 虽然视觉数据增强仍是训练稳健视觉模型的基石，但在视觉语言模型（VLMs）中关注有限，后者主要依赖大规模真实数据采集或合成多样性。因此，他们可能会在传统模型可靠处理的基本感知任务上遇到困难。鉴于预训练和微调VLM的成本巨大，继续在增强数据上训练收益有限且递减。本文介绍了视觉增强思维链（VACoT），这一框架在模型推断过程中动态调用图像增强。通过引入如去噪等事后变换，VACoT显著提升了对具有挑战性和非分布输入的鲁棒性，尤其是在OCR相关的对抗场景中。与以往仅限于局部裁剪的方法不同，VACoT 集成了一组结构化的通用视觉增强，拓宽查询图像视图，同时通过高效的代理强化学习降低训练复杂性和计算开销。我们提出了一种条件奖励方案，鼓励必要的补充，同时惩罚冗长的回答，确保感知任务中的推理简洁有效。我们通过对13个感知基准的广泛实验证明了VACoT的优越性，并进一步介绍AdvOCR，以突出事后视觉增强在对抗场景中的泛化优势。

Risk-Sensitive Q-Learning in Continuous Time with Application to Dynamic Portfolio Selection

连续时间的风险敏感Q学习及其在动态投资组合选择中的应用

Authors: Chuhan Xie
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2512.02386
Pdf link: https://arxiv.org/pdf/2512.02386
Abstract This paper studies the problem of risk-sensitive reinforcement learning (RSRL) in continuous time, where the environment is characterized by a controllable stochastic differential equation (SDE) and the objective is a potentially nonlinear functional of cumulative rewards. We prove that when the functional is an optimized certainty equivalent (OCE), the optimal policy is Markovian with respect to an augmented environment. We also propose \textit{CT-RS-q}, a risk-sensitive q-learning algorithm based on a novel martingale characterization approach. Finally, we run a simulation study on a dynamic portfolio selection problem and illustrate the effectiveness of our algorithm.
中文摘要 本文研究了连续时间内的风险敏感强化学习（RSRL）问题，其中环境由可控随机微分方程（SDE）表征，目标是累积奖励的潜在非线性泛函。我们证明，当泛函是优化确定性等价（OCE）时，最优策略相对于增强环境是马尔可夫的。我们还提出了\textit{CT-RS-q}，一种基于新颖马丁格尔刻画方法的风险敏感q-learning算法。最后，我们对动态投资组合选择问题进行了模拟研究，展示了算法的有效性。

Synthetic Error Injection Fails to Elicit Self-Correction In Language Models

合成错误注入在语言模型中未能引发自我纠正

Authors: David X. Wu, Shreyas Kapur, Anant Sahai, Stuart Russell
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.02389
Pdf link: https://arxiv.org/pdf/2512.02389
Abstract Reinforcement learning has become the dominant paradigm for eliciting reasoning and self-correction capabilities in large language models, but its computational expense motivates exploration of alternatives. Inspired by techniques from autonomous driving and robotics, we investigate whether supervised learning with synthetic error injection can induce self-correction abilities in language models. Our approach inserts artificial errors into reasoning chains, masks them, and supervises the model to recognize and correct these mistakes. Despite the intuitive appeal of this method, we find that it fails to significantly improve performance even on simple synthetic tasks across multiple models. Moreover, even when the model catches its own error, it often parrots the original mistake. We find that the distribution shift of synthetic errors to on-policy errors significantly degrades the error-correction capabilities of the fine-tuned model, even with good synthetic coverage of on-policy errors. Our results help explain why on-policy reinforcement learning methods have proven uniquely effective for eliciting self-correction.
中文摘要 强化学习已成为大型语言模型中引导推理和自我纠正能力的主导范式，但其计算成本促使人们探索替代方案。受自动驾驶和机器人技术启发，我们研究了带有合成错误注入的监督学习是否能在语言模型中诱导自我纠正能力。我们的方法在推理链中插入人工错误，掩盖它们，并监督模型识别和纠正这些错误。尽管这种方法直观上很有吸引力，但我们发现即使在多个模型的简单合成任务中，它也无法显著提升性能。此外，即使模型发现了自己的错误，它也常常会重复原始错误。我们发现，即使对策略上错误进行了良好的合成覆盖，将合成错误向策略内错误的分布转移会显著降低微调模型的纠错能力。我们的结果有助于解释为何策略强化学习方法在引发自我纠正方面被证明如此有效。

Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch

Skywork-R1V4：通过交织思维与图像与深度研究，迈向智能多模态智能

Authors: Yifan Zhang, Liang Hu, Haofeng Sun, Peiyu Wang, Yichen Wei, Shukang Yin, Jiangbo Pei, Wei Shen, Peng Xia, Yi Peng, Tianyidan Xie, Eric Li, Yang Liu, Xuchen Song, Yahui Zhou
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.02395
Pdf link: https://arxiv.org/pdf/2512.02395
Abstract Despite recent progress in multimodal agentic systems, existing approaches often treat image manipulation and web search as disjoint capabilities, rely heavily on costly reinforcement learning, and lack planning grounded in real tool-execution traces. To address these limitations, we present Skywork-R1V4, a 30B (A3B) parameter multimodal agentic model that unifies multimodal planning, active image manipulation ("thinking with images"), deep multimodal search, and, most critically, interleaved reasoning that dynamically alternates between visual operations and external knowledge retrieval. Trained solely via supervised fine-tuning on fewer than 30,000 high-quality, planning-execution-consistent trajectories and validated through stepwise consistency filtering, Skywork-R1V4 achieves state-of-the-art results across perception and multimodal search benchmarks: it scores 66.1 on MMSearch and 67.2 on FVQA, surpassing Gemini 2.5 Flash on all 11 metrics. Skywork-R1V4 exhibits emergent long-horizon reasoning at inference time, successfully orchestrating more than 10 tool calls to solve complex, multi-step tasks. Our results demonstrate that sophisticated agentic multimodal intelligence can be achieved through carefully curated supervised learning alone, without any reliance on reinforcement learning.
中文摘要 尽管多模态智能体系统近期取得了进展，现有方法往往将图像处理和网页搜索视为不相交的能力，严重依赖昂贵的强化学习，且缺乏基于真实工具执行痕迹的规划。为解决这些局限性，我们提出了Skywork-R1V4，一个30B（A3B）参数多模态代理模型，统一了多模态规划、主动图像处理（“用图像思考”）、深度多模态搜索，以及最关键的交错推理，该推理在视觉作与外部知识检索之间动态交替切换。Skywork-R1V4仅通过监督微调训练，针对不到3万条高质量、规划-执行一致的轨迹，并通过分阶段一致性过滤验证，在感知和多模态搜索基准测试中均取得最先进的成绩：在MMSearch上得分66.1，在FVQA上得分67.2，在所有11项指标上均超过Gemini 2.5 Flash。Skywork-R1V4在推理时展现了涌现的长视野推理能力，成功协调了10多个工具调用以解决复杂的多步任务。我们的结果表明，复杂的智能多模态智能仅通过精心策划的监督学习就能实现，无需依赖强化学习。

Dynamic Configuration of On-Street Parking Spaces using Multi Agent Reinforcement Learning

利用多智能体强化学习动态配置路边停车位

Authors: Oshada Jayasinghe, Farhana Choudhury, Egemen Tanin, Shanika Karunasekera
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.02406
Pdf link: https://arxiv.org/pdf/2512.02406
Abstract With increased travelling needs more than ever, traffic congestion has become a major concern in most urban areas. Allocating spaces for on-street parking, further hinders traffic flow, by limiting the effective road width available for driving. With the advancement of vehicle-to-infrastructure connectivity technologies, we explore how the impact of on-street parking on traffic congestion could be minimized, by dynamically configuring on-street parking spaces. Towards that end, we formulate dynamic on-street parking space configuration as an optimization problem, and we follow a data driven approach, considering the nature of our problem. Our proposed solution comprises a two-layer multi agent reinforcement learning based framework, which is inherently scalable to large road networks. The lane level agents are responsible for deciding the optimal parking space configuration for each lane, and we introduce a novel Deep Q-learning architecture which effectively utilizes long short term memory networks and graph attention networks to capture the spatio-temporal correlations evident in the given problem. The block level agents control the actions of the lane level agents and maintain a sufficient level of parking around the block. We conduct a set of comprehensive experiments using SUMO, on both synthetic data as well as real-world data from the city of Melbourne. Our experiments show that the proposed framework could reduce the average travel time loss of vehicles significantly, reaching upto 47%, with a negligible increase in the walking distance for parking.
中文摘要 随着出行需求比以往任何时候都更严重，交通拥堵已成为大多数城市地区的重大问题。为路边停车分配空间进一步阻碍交通流动，因为限制了可行驶的有效道路宽度。随着车辆与基础设施互联技术的发展，我们探讨如何通过动态配置路边停车位，最大限度地减少路边停车对交通拥堵的影响。为此，我们将动态路边停车位配置作为优化问题，并采用数据驱动的方法，考虑问题的性质。我们提出的解决方案包括一个基于两层的多智能体强化学习框架，具有对大型道路网络的本质可扩展性。车道级代理负责决定每个车道的最佳停车位配置，我们引入了一种新型的深度Q学习架构，有效利用长期短期记忆网络和图关注网络，捕捉问题中明显的时空相关性。街区级工作人员控制车道级工作人员的作，并保持街区周围足够的停车位。我们利用SUMO进行了一系列全面的实验，涵盖了合成数据以及墨尔本市的真实世界数据。我们的实验显示，所提框架可显著减少车辆的平均行车时间损失，最高可达47%，且停车步行距离增加可忽略不计。

Data Curation Through the Lens of Spectral Dynamics: Static Limits, Dynamic Acceleration, and Practical Oracles

通过谱动力学视角进行数据管理：静态极限、动态加速与实用预言机

Authors: Yizhou Zhang, Lun Du
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.02409
Pdf link: https://arxiv.org/pdf/2512.02409
Abstract Large-scale neural models are increasingly trained with data pruning, synthetic data generation, cross-model distillation, reinforcement learning from human feedback (RLHF), and difficulty-based sampling. While several of these data-centric strategies reliably improve training efficiency and downstream performance, others fail to provide meaningful gains -- most notably self-generated synthetic data, which often increases dataset volume without enhancing model capability. We formalize data curation as reweighting the sampling distribution and map its effect onto the eigenstructure of the data-induced operator. Our first main result shows that \textbf{static pruning induces a bounded operator and therefore cannot change the spectral tail exponent}; it provides at most finite-region improvements and cannot alter asymptotic neural scaling. Our second result analyzes \textbf{time-dependent data curation}, showing that an ideal oracle capable of tracking spectral residuals and continuously re-normalizing the tail can provably accelerate learning -- although practical systems can only approximate this behavior.
中文摘要 大规模神经模型越来越多地通过数据剪枝、合成数据生成、跨模型提炼、基于人类反馈的强化学习（RLHF）以及基于难度的抽样进行训练。虽然其中一些以数据为中心的策略可靠地提升了训练效率和下游性能，但也有一些未能带来实质性的提升——最显著的是自生成数据，往往增加了数据集量却不增强模型能力。我们将数据整理形式化为对抽样分布的加权，并将其影响映射到数据诱导算符的特征结构上。我们的第一个主要结果表明，\textbf{静态剪枝诱导一个有界算子，因此无法改变谱尾指数};它最多只能提供有限区域的改进，且无法改变渐近神经尺度。我们的第二个结果分析了 \textbf{时间依赖数据整理}，表明一个能够追踪谱残差并持续重整尾部的理想预言机可以证明加速学习——尽管实际系统只能近似这种行为。

GUI Exploration Lab: Enhancing Screen Navigation in Agents via Multi-Turn Reinforcement Learning

图形界面探索实验室：通过多回合强化学习提升代理屏幕导航

Authors: Haolong Yan, Yeqing Shen, Xin Huang, Jia Wang, Kaijun Tan, Zhixuan Liang, Hongxin Li, Zheng Ge, Osamu Yoshie, Si Li, Xiangyu Zhang, Daxin Jiang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.02423
Pdf link: https://arxiv.org/pdf/2512.02423
Abstract With the rapid development of Large Vision Language Models, the focus of Graphical User Interface (GUI) agent tasks shifts from single-screen tasks to complex screen navigation challenges. However, real-world GUI environments, such as PC software and mobile Apps, are often complex and proprietary, making it difficult to obtain the comprehensive environment information needed for agent training and evaluation. This limitation hinders systematic investigation and benchmarking of agent navigation capabilities. To address this limitation, we introduce GUI Exploration Lab, a simulation environment engine for GUI agent navigation research that enables flexible definition and composition of screens, icons, and navigation graphs, while providing full access to environment information for comprehensive agent training and evaluation. Through extensive experiments, we find that supervised fine-tuning enables effective memorization of fundamental knowledge, serving as a crucial foundation for subsequent training. Building on this, single-turn reinforcement learning further enhances generalization to unseen scenarios. Finally, multi-turn reinforcement learning encourages the development of exploration strategies through interactive trial and error, leading to further improvements in screen navigation performance. We validate our methods on both static and interactive benchmarks, demonstrating that our findings generalize effectively to real-world scenarios. These findings demonstrate the advantages of reinforcement learning approaches in GUI navigation and offer practical guidance for building more capable and generalizable GUI agents.
中文摘要 随着大型视觉语言模型的快速发展，图形用户界面（GUI）代理任务的重点从单屏任务转向复杂的屏幕导航挑战。然而，现实世界的图形用户界面环境，如PC软件和移动应用，往往复杂且专有，难以获得代理培训和评估所需的全面环境信息。这一限制阻碍了对代理导航能力的系统性研究和基准测试。为解决这一限制，我们推出了GUI Exploration Lab，一款用于GUI代理导航研究的模拟环境引擎，能够灵活定义和组合屏幕、图标和导航图，同时提供全面的环境信息访问，支持全面的代理培训和评估。通过大量实验，我们发现监督式微调能够有效记忆基础知识，成为后续培训的重要基础。基于此，单回合强化学习进一步增强了对未见场景的泛化能力。最后，多回合强化学习鼓励通过互动试错法开发探索策略，进一步提升屏幕导航性能。我们在静态和交互基准上验证了我们的方法，证明我们的发现能够有效地推广到现实世界场景。这些发现展示了强化学习方法在图形界面导航中的优势，并为构建更具能力和可推广性的图形界面代理提供了实用指导。

Cross-Domain Offline Policy Adaptation with Dynamics- and Value-Aligned Data Filtering

跨域离线策略适配，采用动态和价值对齐的数据过滤

Authors: Zhongjian Qiao, Rui Yang, Jiafei Lyu, Chenjia Bai, Xiu Li, Zhuoran Yang, Siyang Gao, Shuang Qiu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.02435
Pdf link: https://arxiv.org/pdf/2512.02435
Abstract Cross-Domain Offline Reinforcement Learning aims to train an agent deployed in the target environment, leveraging both a limited target domain dataset and a source domain dataset with (possibly) sufficient data coverage. Due to the underlying dynamics misalignment between the source and target domain, simply merging the data from two datasets may incur inferior performance. Recent advances address this issue by selectively sharing source domain samples that exhibit dynamics alignment with the target domain. However, these approaches focus solely on dynamics alignment and overlook \textit{value alignment}, i.e., selecting high-quality, high-value samples from the source domain. In this paper, we first demonstrate that both dynamics alignment and value alignment are essential for policy learning, by examining the limitations of the current theoretical framework for cross-domain RL and establishing a concrete sub-optimality gap of a policy trained on the source domain and evaluated on the target domain. Motivated by the theoretical insights, we propose to selectively share those source domain samples with both high dynamics and value alignment and present our \textbf{\underline{D}}ynamics- and \textbf{\underline{V}}alue-aligned \textbf{\underline{D}}ata \textbf{\underline{F}}iltering (DVDF) method. We design a range of dynamics shift settings, including kinematic and morphology shifts, and evaluate DVDF on various tasks and datasets, as well as in challenging extremely low-data settings where the target domain dataset contains only 5,000 transitions. Extensive experiments demonstrate that DVDF consistently outperforms prior strong baselines and delivers exceptional performance across multiple tasks and datasets.
中文摘要 跨域离线强化学习旨在训练部署在目标环境中的代理，利用有限的目标域数据集和（可能）足够数据覆盖的源域数据集。由于源域与目标域之间的动态不匹配，单纯合并两个数据集的数据可能会导致性能下降。近期的进展通过选择性分享与目标域动态对齐的源域样本来解决这一问题。然而，这些方法仅关注动态对齐，忽视了 \textit{value alignment}，即从源域中选择高质量、高价值的样本。本文首先通过考察当前跨域强化学习理论框架的局限性，并确立基于源域训练、在目标域评估的策略的具体次优差距，证明动态对齐和价值对齐对策略学习至关重要。基于理论见解，我们计划选择性地分享这些具有高动态性和高值对齐的源域样本，并呈现我们的\textbf{\underline{D}}ynamics-和\textbf{\underline{V}}alue-aligned \textbf{\underline{D}}ata \textbf{\underline{F}}iltering（DVDF）方法。我们设计了多种动力学转移设置，包括运动学和形态变化，并在各种任务和数据集上评估DVDF，以及在极低数据的环境中，目标领域数据集仅包含5,000个转移。大量实验表明，DVDF在多个任务和数据集中持续优于以往强劲基线，表现出色。

A Visual Analytics System to Understand Behaviors of Multi Agents in Reinforcement Learning

一个可视化分析系统，用于理解强化学习中多智能体的行为

Authors: Changhee Lee, Jeongmin Rhee, DongHwa Shin
Subjects: Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2512.02442
Pdf link: https://arxiv.org/pdf/2512.02442
Abstract Multi-Agent Reinforcement Learning (MARL) is a branch of machine learning in which agents interact and learn optimal policies through trial and error, addressing complex scenarios where multiple agents interact and learn in the same environment at the same time. Analyzing and understanding these complex interactions is challenging, and existing analysis methods are limited in their ability to fully reflect and interpret this complexity. To address these challenges, we provide MARLViz, a visual analytics system for visualizing and analyzing the policies and interactions of agents in MARL environments. The system is designed to visually show the difference in behavior of agents under different environment settings and help users understand complex interaction patterns. In this study, we analyzed agents with similar behaviors and selected scenarios to understand the interactions of the agents, which made it easier to understand the strategies of agents in MARL.
中文摘要 多智能体强化学习（MARL）是机器学习的一个分支，智能体通过反复试验来交互并学习最优策略，解决多个智能体在同一环境中同时交互和学习的复杂场景。分析和理解这些复杂相互作用具有挑战性，现有分析方法在充分反映和解读这种复杂性方面有限。为应对这些挑战，我们提供了MARLViz，一套可视化分析系统，用于可视化和分析MARL环境中代理的策略与交互。该系统旨在直观地展示不同环境下代理行为的差异，帮助用户理解复杂的交互模式。本研究分析了行为相似的代理和选定场景，以理解代理之间的相互作用，从而更容易理解代理在MARL中的策略。

Dual-Robust Cross-Domain Offline Reinforcement Learning Against Dynamics Shifts

双稳健跨域离线强化学习：针对动态变化

Authors: Zhongjian Qiao, Rui Yang, Jiafei Lyu, Xiu Li, Zhongxiang Dai, Zhuoran Yang, Siyang Gao, Shuang Qiu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.02486
Pdf link: https://arxiv.org/pdf/2512.02486
Abstract Single-domain offline reinforcement learning (RL) often suffers from limited data coverage, while cross-domain offline RL handles this issue by leveraging additional data from other domains with dynamics shifts. However, existing studies primarily focus on train-time robustness (handling dynamics shifts from training data), neglecting the test-time robustness against dynamics perturbations when deployed in practical scenarios. In this paper, we investigate dual (both train-time and test-time) robustness against dynamics shifts in cross-domain offline RL. We first empirically show that the policy trained with cross-domain offline RL exhibits fragility under dynamics perturbations during evaluation, particularly when target domain data is limited. To address this, we introduce a novel robust cross-domain Bellman (RCB) operator, which enhances test-time robustness against dynamics perturbations while staying conservative to the out-of-distribution dynamics transitions, thus guaranteeing the train-time robustness. To further counteract potential value overestimation or underestimation caused by the RCB operator, we introduce two techniques, the dynamic value penalty and the Huber loss, into our framework, resulting in the practical \textbf{D}ual-\textbf{RO}bust \textbf{C}ross-domain \textbf{O}ffline RL (DROCO) algorithm. Extensive empirical results across various dynamics shift scenarios show that DROCO outperforms strong baselines and exhibits enhanced robustness to dynamics perturbations.
中文摘要 单域离线强化学习（RL）通常存在有限的数据覆盖问题，而跨域离线强化学习则通过利用来自其他域的动态变化数据来解决这一问题。然而，现有研究主要关注训练时间鲁棒性（即如何处理训练数据中的动态变化），忽视了在实际场景中部署时对动力学扰动的测试时间鲁棒性。本文探讨跨域离线强化学习中对动态变化的双重（训练时间和测试时间）鲁棒性。我们首先实证显示，在评估过程中，使用跨域离线强化学习训练的策略在动态扰动下表现出脆弱性，尤其是在目标域数据有限时。为此，我们引入了一种新颖的稳健跨域贝尔曼（RCB）算符，它在对动态扰动的测试时间鲁棒性增强的同时，对分布外动态转变保持保守，从而保证列车时间的鲁棒性。为了进一步抵消RCB算符可能造成的价值高估或低估，我们在框架中引入了两种技术：动态价值惩罚和Huber损失，最终形成了实用的\textbf{Ual-\textbf{RO}bust \textbf{C}ross-domain \textbf{O}ffline RL（DROCO）算法。在各种动力学转移场景中，广泛的实证结果表明，DROCO优于强基线，并展现出对动力学扰动的更强鲁棒性。

AID: Agent Intent from Diffusion for Multi-Agent Informative Path Planning

辅助：来自扩散的智能体意图，用于多智能体信息路径规划

Authors: Jeric Lew, Yuhong Cao, Derek Ming Siang Tan, Guillaume Sartoretti
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.02535
Pdf link: https://arxiv.org/pdf/2512.02535
Abstract Information gathering in large-scale or time-critical scenarios (e.g., environmental monitoring, search and rescue) requires broad coverage within limited time budgets, motivating the use of multi-agent systems. These scenarios are commonly formulated as multi-agent informative path planning (MAIPP), where multiple agents must coordinate to maximize information gain while operating under budget constraints. A central challenge in MAIPP is ensuring effective coordination while the belief over the environment evolves with incoming measurements. Recent learning-based approaches address this by using distributions over future positions as "intent" to support coordination. However, these autoregressive intent predictors are computationally expensive and prone to compounding errors. Inspired by the effectiveness of diffusion models as expressive, long-horizon policies, we propose AID, a fully decentralized MAIPP framework that leverages diffusion models to generate long-term trajectories in a non-autoregressive manner. AID first performs behavior cloning on trajectories produced by existing MAIPP planners and then fine-tunes the policy using reinforcement learning via Diffusion Policy Policy Optimization (DPPO). This two-stage pipeline enables the policy to inherit expert behavior while learning improved coordination through online reward feedback. Experiments demonstrate that AID consistently improves upon the MAIPP planners it is trained from, achieving up to 4x faster execution and 17% increased information gain, while scaling effectively to larger numbers of agents. Our implementation is publicly available at this https URL.
中文摘要 在大规模或时间关键场景（如环境监测、搜救）中的信息收集需要在有限的时间预算内实现广泛的覆盖，这促使多智能体系统的使用。这些情景通常被表述为多智能体信息路径规划（MAIPP），即多个智能体必须协调，以在预算限制下最大化信息收益。MAIPP的核心挑战之一是确保在环境认知随着测量数据不断变化的同时，实现有效的协调。近期基于学习的方法通过将未来职位分布作为“意图”来支持协调。然而，这些自回归意图预测器计算成本高且容易出现复合误差。受扩散模型作为表达性、长远视野政策的有效性启发，我们提出了AID框架，这是一个完全去中心化的MAIPP框架，利用扩散模型以非自回归的方式生成长期轨迹。AID首先对现有MAIPP规划器生成的轨迹进行行为克隆，然后通过扩散策略优化（DPPO）通过强化学习对策略进行微调。这一两阶段培养流程使政策能够继承专家行为，同时通过在线奖励反馈学习改进协调。实验表明，AID持续优于其训练的MAIPP规划器，实现执行速度高达4倍，信息增益提升17%，同时有效扩展至更多代理。我们的实现可在此 https URL 公开获取。

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

CUDA-L2：通过强化学习超越 cuBLAS 矩阵乘法性能

Authors: Songqiao Su, Xiaofei Sun, Xiaoya Li, Albert Wang, Jiwei Li, Chris Shum
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.02551
Pdf link: https://arxiv.org/pdf/2512.02551
Abstract In this paper, we propose CUDA-L2, a system that combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. Using CUDA execution speed as the RL reward, CUDA-L2 automatically optimizes HGEMM kernels across 1,000 configurations. CUDA-L2 systematically outperforms major matmul baselines to date, from the widely-used {\it this http URL} to state-of-the-art Nvidia's closed-source libraries, i.e., {\it cuBLAS}, {\it cuBLASLt}. In offline mode, where kernels are executed consecutively without time intervals, CUDA-L2 yields +22.0\% over {\it this http URL} on average; +19.2\% over {\it cuBLAS} using the optimal layout configuration (normal-normal NN and transposed-normal TN); +16.8\% over {\it cuBLASLt-heuristic}, which queries {\it cuBLASLt} library and selects the algorithm based on the heuristic's suggestion; and +11.4\% over the most competitive {\it cuBLASLt-AutoTuning} model, which selects the fastest algorithm from up to 100 candidates from {\it cuBLASLt}'s suggestions. In server mode, where kernels are executed at random intervals simulating real-time inference, the speedups further increase to +28.7\%, +26.0\%, +22.4\%, and +15.9\% for {\it this http URL}, {\it cuBLAS}, {\it cuBLASLt-heuristic}, and {\it cuBLASLt-AutoTuning} respectively. CUDA-L2 shows that even the most performance-critical, heavily-optimized kernels like HGEMM can be improved through LLM-guided RL automation by systematically exploring configuration spaces at scales impractical for humans. Project and code can be found at this http URL
中文摘要 本文提出了CUDA-L2系统，该系统结合了大型语言模型（LLM）和强化学习（RL），自动优化半精度通用矩阵乘法（HGEMM）CUDA核。以 CUDA 执行速度作为强化学习奖励，CUDA-L2 自动优化 1000 种配置中的 HGEMM 内核。CUDA-L2 系统性地超越了迄今为止的主要 matmul 基线，从广泛使用的 {\it this http URL} 到 Nvidia 最先进的闭源库，如 {\it cuBLAS}、{\it cuBLASLt}。在离线模式下，内核连续执行且无时间间隔，CUDA-L2 平均在 {\it 这个 http URL} 上产生 +22.0\%;在使用最优布局配置（正态-正态神经核和转置正态 TN）时，+19.2\% 对 {\it cuBLAS};+16.8\% 对 {\it cuBLASLt-heuristic} 进行查询，该库查询 {\it cuBLASLt} 库并根据启发式建议选择算法;以及在最具竞争力的{\it cuBLASLt-AutoTuning}模型中+11.4%的百分比，该模型从{\it cuBLASLt}建议中最多100个候选人中选择最快的算法。在服务器模式下，内核以随机间隔执行以模拟实时推理，加速速度进一步提升至 +28.7%、+26.0\%、+22.4\% 和 +15.9\%，{\it 这个 http URL}、{\it cuBLAS}、{\it cuBLASLt-heuristic} 和 {\it cuBLASLt-AutoTuning}。CUDA-L2表明，即使是像HGEMM这样性能最关键、高度优化的内核，也可以通过LLM引导的强化学习自动化，系统性探索人类难以实现的配置空间。项目和代码可在此 http URL 找到

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

DeepSeek-v3.2：推动开放大型语言模型的前沿

Authors: DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenhao Xu, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Erhang Li, Fangqi Zhou, Fangyun Lin, Fucong Dai, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Hao Li, Haofen Liang, Haoran Wei, Haowei Zhang, Haowen Luo, Haozhe Ji, Honghui Ding, Hongxuan Tang, Huanqi Cao, Huazuo Gao, Hui Qu, Hui Zeng, Jialiang Huang, Jiashi Li, Jiaxin Xu, Jiewen Hu, Jingchang Chen, Jingting Xiang, Jingyang Yuan, Jingyuan Cheng, Jinhua Zhu, Jun Ran, Junguang Jiang, Junjie Qiu, Junlong Li, Junxiao Song, Kai Dong, Kaige Gao, Kang Guan, Kexin Huang, Kexing Zhou, Kezhao Huang, Kuai Yu, Lean Wang, Lecong Zhang, Lei Wang, Liang Zhao, Liangsheng Yin, Lihua Guo, Lingxiao Luo, Linwang Ma, Litong Wang, Liyue Zhang, M.S. Di, M.Y Xu, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingxu Zhou, Panpan Huang, Peixin Cong, Peiyi Wang, Qiancheng Wang, Qihao Zhu, Qingyang Li, Qinyu Chen, Qiushi Du, Ruiling Xu, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, Runqiu Yin, Runxin Xu, Ruomeng Shen, Ruoyu Zhang, S.H. Liu, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shaofei Cai
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.02556
Pdf link: https://arxiv.org/pdf/2512.02556
Abstract We introduce DeepSeek-V3.2, a model that harmonizes high computational efficiency with superior reasoning and agent performance. The key technical breakthroughs of DeepSeek-V3.2 are as follows: (1) DeepSeek Sparse Attention (DSA): We introduce DSA, an efficient attention mechanism that substantially reduces computational complexity while preserving model performance in long-context scenarios. (2) Scalable Reinforcement Learning Framework: By implementing a robust reinforcement learning protocol and scaling post-training compute, DeepSeek-V3.2 performs comparably to GPT-5. Notably, our high-compute variant, DeepSeek-V3.2-Speciale, surpasses GPT-5 and exhibits reasoning proficiency on par with Gemini-3.0-Pro, achieving gold-medal performance in both the 2025 International Mathematical Olympiad (IMO) and the International Olympiad in Informatics (IOI). (3) Large-Scale Agentic Task Synthesis Pipeline: To integrate reasoning into tool-use scenarios, we developed a novel synthesis pipeline that systematically generates training data at scale. This methodology facilitates scalable agentic post-training, yielding substantial improvements in generalization and instruction-following robustness within complex, interactive environments.
中文摘要 我们介绍DeepSeek-V3.2模型，该模型将高计算效率与优越的推理和代理表现相结合。DeepSeek-V3.2 的关键技术突破如下：（1） DeepSeek 稀疏注意力（DSA）：我们引入了 DSA，一种高效的注意力机制，在长时间上下文场景下大幅降低计算复杂度，同时保持模型性能。（2）可扩展强化学习框架：通过实现强化学习协议和训练后计算扩展，DeepSeek-V3.2的性能可与GPT-5相当。值得注意的是，我们的高计算变体DeepSeek-V3.2-Speciale超越了GPT-5，并在推理能力上与Gemini-3.0-Pro相当，在2025年国际数学奥林匹克（IMO）和国际信息学奥林匹克（IOI）中均获得金牌。（3）大规模代理任务综合流程：为了将推理整合到工具使用场景中，我们开发了一种新型综合流程，能够系统地大规模生成训练数据。该方法促进了可扩展的代理后训练，在复杂交互环境中显著提升泛化性和指令遵循鲁棒性。

From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks

从模仿到歧视：迈向增强跨领域推理任务的通用课程优势机制

Authors: Changpeng Yang, Jinyang Wu, Yuchen Liu, Shuai Zhang, Yang Li, Qiliang Liang, Hongzhen Wang, Shuai Nie, Jiaming Xu, Runyu Shi, Ying Huang, Guoquan Zhang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.02580
Pdf link: https://arxiv.org/pdf/2512.02580
Abstract Reinforcement learning has emerged as a paradigm for post-training large language models, boosting their reasoning capabilities. Such approaches compute an advantage value for each sample, reflecting better or worse performance than expected, thereby yielding both positive and negative signals for training. However, the indiscriminate mixing of the two signals in existing methods, especially from the early stages, may lead to ambiguous guidance and limited gains. To address this issue, we propose CAPO (Curriculum Advantage Policy Optimization), an adaptive curriculum mechanism based on advantage signals. The proposed mechanism bootstraps imitation learning with positive-only advantage samples to establish robust foundations, and subsequently introduces negative signals to cultivate discriminative capabilities, thereby improving generalization across complex scenarios. Compatible with diverse optimization methods including GRPO, PPO, RLOO, and Reinforce++, our method consistently achieves stable and significant improvements in mathematical reasoning tasks, and further generalizes effectively to multimodal Graphical User Interface (GUI) reasoning scenarios, establishing itself as a versatile and robust optimization framework.
中文摘要 强化学习已成为大型语言模型后训练的范式，提升了它们的推理能力。此类方法为每个样本计算优势值，反映性能优于预期或劣势，从而为训练提供正负信号。然而，现有方法中，尤其是早期阶段，这两种信号的无差别混合可能导致指引模糊和有限的收益。为解决这一问题，我们提出了卡波（Curriculum Advantage Poptimization），这是一种基于优势信号的自适应课程机制。该机制通过仅有正优势样本的模仿学习建立坚实基础，随后引入负信号以培养判别能力，从而提升复杂场景下的泛化能力。该方法兼容包括GRPO、PPO、RLOO和Reinforce++等多种优化方法，在数学推理任务中持续取得稳定且显著的改进，并进一步有效地推广到多模态图形用户界面（GUI）推理场景，确立了其作为多功能且稳健优化框架的地位。

GoRL: An Algorithm-Agnostic Framework for Online Reinforcement Learning with Generative Policies

GoRL：一个算法无关的在线强化学习框架，采用生成策略

Authors: Chubin Zhang, Zhenglin Wan, Feng Chen, Xingrui Yu, Ivor Tsang, Bo An
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.02581
Pdf link: https://arxiv.org/pdf/2512.02581
Abstract Reinforcement learning (RL) faces a persistent tension: policies that are stable to optimize are often too simple to represent the multimodal action distributions needed for complex control. Gaussian policies provide tractable likelihoods and smooth gradients, but their unimodal form limits expressiveness. Conversely, generative policies based on diffusion or flow matching can model rich multimodal behaviors; however, in online RL, they are frequently unstable due to intractable likelihoods and noisy gradients propagating through deep sampling chains. We address this tension with a key structural principle: decoupling optimization from generation. Building on this insight, we introduce GoRL (Generative Online Reinforcement Learning), a framework that optimizes a tractable latent policy while utilizing a conditional generative decoder to synthesize actions. A two-timescale update schedule enables the latent policy to learn stably while the decoder steadily increases expressiveness, without requiring tractable action likelihoods. Across a range of continuous-control tasks, GoRL consistently outperforms both Gaussian policies and recent generative-policy baselines. Notably, on the HopperStand task, it reaches a normalized return above 870, more than 3 times that of the strongest baseline. These results demonstrate that separating optimization from generation provides a practical path to policies that are both stable and highly expressive.
中文摘要 强化学习（RL）面临持续的张力：那些稳定可优化的策略往往过于简单，无法表示复杂控制所需的多模态动作分布。高斯策略提供了可处理的似然和平滑梯度，但其单峰形式限制了表达性。相反，基于扩散匹配或流匹配的生成策略可以模拟丰富的多模态行为;然而，在线强化学习中，由于难以处理的似然和通过深采样链传播的噪声梯度，它们常常不稳定。我们通过一个关键结构原则来应对这种矛盾：将优化与发电分离。基于这一见解，我们介绍了 GoRL（生成在线强化学习），这是一个在利用条件生成解码器综合动作的同时优化可作潜在策略的框架。两时间尺度的更新计划使潜在策略能够稳定学习，同时解码器稳步提升表现力，而无需可处理的动作似然。在一系列连续控制任务中，GoRL始终优于高斯策略和近期生成策略基线。值得注意的是，在HopperStand任务中，其归一化回报超过870，是最强基线的三倍多。这些结果表明，将优化与生成分离，为实现既稳定又高度表达的策略提供了切实可行的路径。

SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization

SeeNav代理：通过可视化提示和步骤级策略优化提升视觉语言导航

Authors: Zhengcheng Wang, Zichuan Lin, Yijun Yang, Haobo Fu, Deheng Ye
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.02631
Pdf link: https://arxiv.org/pdf/2512.02631
Abstract Existing Vision-Language Navigation (VLN) agents based on Large Vision-Language Models (LVLMs) often suffer from perception errors, reasoning errors, and planning errors, which significantly hinder their navigation performance. To address these limitations, a novel VLN agent framework, named SeeNav-Agent, is proposed in this work. First, to reduce perception hallucinations of the visual module of the VLN agent, a dual-view Visual Prompt (VP) technique is introduced in the input space, which can also improve the agent's understanding of current spatial states. Subsequently, a novel step-level Reinforcement Fine-Tuning (RFT) method, Step Reward Group Policy Optimization (SRGPO), is designed for the post-training of VLN agents. In SRGPO, we first define verifiable process rewards for the navigation task, and then perform efficient step-level advantage estimation by randomly grouping different navigation steps. SRGPO provides dense reward signals for the reinforcement learning process of the VLN agent and enhances its planning capability. Experimental results on the EmbodiedBench Navigation benchmark indicate that by introducing the zero-shot VP module, the GPT-4.1 achieves a navigation success rate of 86.7%, surpassing the current best LVLM by approximately 20 percentage points (pp). Through post-training based on SRGPO, the Qwen2.5-VL-3B model reaches a navigation success rate of 72.3%, outperforming the best existing LVLM model by 5.6 pp. Moreover, compared to RFT algorithms such as GRPO and GiGPO, the proposed SRGPO demonstrates significant improvements in training stability, convergence efficiency, and generalization capability.
中文摘要 基于大型视觉语言模型（LVLM）的现有视觉语言导航（VLN）代理常常存在感知错误、推理错误和规划错误，严重影响其导航性能。为解决这些局限性，本文提出了一种名为SeeNav-Agent的新颖VLN代理框架。首先，为了减少VLN智能体视觉模块的幻觉感知，输入空间引入了双视视觉提示（VP）技术，这也能提升智能体对当前空间状态的理解。随后，一种新型的步级强化微调（RFT）方法——步级奖励组策略优化（SRGPO）被设计用于VLN代理的后期训练。在SRGPO中，我们首先定义了导航任务的可验证过程奖励，然后通过随机分组不同的导航步骤进行高效的步级优势估计。SRGPO为VLN智能体的强化学习过程提供密集的奖励信号，并增强其规划能力。EmbodiedBench导航基准测试的实验结果表明，通过引入零射VP模块，GPT-4.1实现了86.7%的导航成功率，比当前最佳LVLM高出约20个百分点（pp）。通过基于SRGPO的后期训练，Qwen2.5-VL-3B模型的导航成功率达到72.3%，比现有最佳LVLM模型高出5.6个百分点。此外，与GRPO和GiGPO等RFT算法相比，SRGPO在训练稳定性、收敛效率和泛化能力方面表现出显著提升。

Zero-Shot Instruction Following in RL via Structured LTL Representations

通过结构化LTL表示实现的强化学习中零帧指令后续

Authors: Mattia Giuri, Mathias Jackermeier, Alessandro Abate
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.02633
Pdf link: https://arxiv.org/pdf/2512.02633
Abstract Linear temporal logic (LTL) is a compelling framework for specifying complex, structured tasks for reinforcement learning (RL) agents. Recent work has shown that interpreting LTL instructions as finite automata, which can be seen as high-level programs monitoring task progress, enables learning a single generalist policy capable of executing arbitrary instructions at test time. However, existing approaches fall short in environments where multiple high-level events (i.e., atomic propositions) can be true at the same time and potentially interact in complicated ways. In this work, we propose a novel approach to learning a multi-task policy for following arbitrary LTL instructions that addresses this shortcoming. Our method conditions the policy on sequences of simple Boolean formulae, which directly align with transitions in the automaton, and are encoded via a graph neural network (GNN) to yield structured task representations. Experiments in a complex chess-based environment demonstrate the advantages of our approach.
中文摘要 线性时序逻辑（LTL）是一个引人注目的框架，用于为强化学习（RL）智能体指定复杂且结构化的任务。最新研究表明，将LTL指令解释为有限自动机，即监控任务进展的高级程序，使得学习一个通用策略能够在测试时执行任意指令。然而，现有方法在多个高层事件（即原子命题）可以同时成立且可能以复杂方式相互作用的环境中表现不足。在本研究中，我们提出了一种新颖的方法来学习一套多任务策略，以遵循任意的LTL指令，以解决这一缺陷。我们的方法以一组简单的布尔公式序列为策略条件，这些公式直接对应自动机中的转移，并通过图神经网络（GNN）编码以生成结构化任务表示。在复杂的国际象棋环境中的实验展示了我们方法的优势。

RoboWheel: A Data Engine from Real-World Human Demonstrations for Cross-Embodiment Robotic Learning

RoboWheel：来自真实人类演示的数据引擎，用于跨身体机器人学习

Authors: Yuhong Zhang, Zihan Gao, Shengpeng Li, Ling-Hao Chen, Kaisheng Liu, Runqing Cheng, Xiao Lin, Junjia Liu, Zhuoheng Li, Jingyi Feng, Ziyan He, Jintian Lin, Zheyan Huang, Zhifang Liu, Haoqian Wang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.02729
Pdf link: https://arxiv.org/pdf/2512.02729
Abstract We introduce Robowheel, a data engine that converts human hand object interaction (HOI) videos into training-ready supervision for cross morphology robotic learning. From monocular RGB or RGB-D inputs, we perform high precision HOI reconstruction and enforce physical plausibility via a reinforcement learning (RL) optimizer that refines hand object relative poses under contact and penetration constraints. The reconstructed, contact rich trajectories are then retargeted to cross-embodiments, robot arms with simple end effectors, dexterous hands, and humanoids, yielding executable actions and rollouts. To scale coverage, we build a simulation-augmented framework on Isaac Sim with diverse domain randomization (embodiments, trajectories, object retrieval, background textures, hand motion mirroring), which enriches the distributions of trajectories and observations while preserving spatial relationships and physical plausibility. The entire data pipeline forms an end to end pipeline from video,reconstruction,retargeting,augmentation data acquisition. We validate the data on mainstream vision language action (VLA) and imitation learning architectures, demonstrating that trajectories produced by our pipeline are as stable as those from teleoperation and yield comparable continual performance gains. To our knowledge, this provides the first quantitative evidence that HOI modalities can serve as effective supervision for robotic learning. Compared with teleoperation, Robowheel is lightweight, a single monocular RGB(D) camera is sufficient to extract a universal, embodiment agnostic motion representation that could be flexibly retargeted across embodiments. We further assemble a large scale multimodal dataset combining multi-camera captures, monocular videos, and public HOI corpora for training and evaluating embodied models.
中文摘要 我们介绍了Robowheel，一个数据引擎，将人手、物体交互（HOI）视频转化为用于跨形态机器人学习的训练级监督。通过单眼RGB或RGB-D输入，我们进行高精度的HOI重建，并通过强化学习（RL）优化器强化物理可信度，优化手部物体在接触和穿透约束下的相对姿态。重建的、接触丰富的轨迹随后被重新定位到交叉身体、带有简单末端执行器的机械臂、灵巧的手和类人生物，从而产生可执行的动作和滚动。为扩大覆盖范围，我们在Isaac Sim上构建了一个模拟增强框架，采用多样的领域随机化（具体体型、轨迹、对象检索、背景纹理、手部动作镜像），丰富轨迹和观测分布，同时保持空间关系和物理可信度。整个数据流水线构成了从视频、重建、重定向到增强数据采集的端到端流水线。我们验证了主流视觉语言动作（VLA）和模仿学习架构的数据，证明我们的流水线产生的轨迹与远程作一样稳定，并能带来可比的持续性能提升。据我们所知，这首次提供了定量证据，表明HOI模式可以作为机器人学习的有效监督。与远程作相比，Robowheel 轻量化，单一单眼 RGB（D）摄像头即可提取通用、与身体无关的运动表示，并可灵活地跨实体重新定位。我们还进一步组装了一个大规模多模态数据集，结合多机位捕捉、单眼视频和公开的HOI语料库，用于训练和评估具身模型。

IC-World: In-Context Generation for Shared World Modeling

IC-World：共享世界建模的上下文生成

Authors: Fan Wu, Jiacheng Wei, Ruibo Li, Yi Xu, Junyou Li, Deheng Ye, Guosheng Lin
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.02793
Pdf link: https://arxiv.org/pdf/2512.02793
Abstract Video-based world models have recently garnered increasing attention for their ability to synthesize diverse and dynamic visual environments. In this paper, we focus on shared world modeling, where a model generates multiple videos from a set of input images, each representing the same underlying world in different camera poses. We propose IC-World, a novel generation framework, enabling parallel generation for all input images via activating the inherent in-context generation capability of large video models. We further finetune IC-World via reinforcement learning, Group Relative Policy Optimization, together with two proposed novel reward models to enforce scene-level geometry consistency and object-level motion consistency among the set of generated videos. Extensive experiments demonstrate that IC-World substantially outperforms state-of-the-art methods in both geometry and motion consistency. To the best of our knowledge, this is the first work to systematically explore the shared world modeling problem with video-based world models.
中文摘要 基于视频的世界模型近年来因其综合多样且动态的视觉环境的能力而受到越来越多的关注。本文重点介绍共享世界建模，即模型从一组输入图像生成多个视频，每张图像代表同一底层世界，采用不同的摄像机姿态。我们提出了IC-World，一种新型生成框架，通过激活大型视频模型固有的上下文生成能力，实现所有输入图像的并行生成。我们通过强化学习、群相对策略优化以及两个提出的新颖奖励模型，进一步微调了IC-World，以强制生成视频集的场景级几何一致性和对象级运动一致性。大量实验表明，IC-World在几何和运动一致性方面远远优于最先进的方法。据我们所知，这是首个系统性探索基于视频的世界模型的共享世界建模问题的研究。

SR-GRPO: Stable Rank as an Intrinsic Geometric Reward for Large Language Model Alignment

SR-GRPO：稳定秩作为大型语言模型对齐的内在几何奖励

Authors: Yixuan Tang, Yi Yang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.02807
Pdf link: https://arxiv.org/pdf/2512.02807
Abstract Aligning Large Language Models (LLMs) with human preferences typically relies on external supervision, which faces critical limitations: human annotations are scarce and subjective, reward models are vulnerable to reward hacking, and self-evaluation methods suffer from prompt sensitivity and biases. In this work, we propose stable rank, an intrinsic, annotation-free quality signal derived from model representations. Stable rank measures the effective dimensionality of hidden states by computing the ratio of total variance to dominant-direction variance, capturing quality through how information distributes across representation dimensions. Empirically, stable rank achieves 84.04% accuracy on RewardBench and improves task accuracy by an average of 11.3 percentage points over greedy decoding via Best-of-N sampling. Leveraging this insight, we introduce Stable Rank Group Relative Policy Optimization (SR-GRPO), which uses stable rank as a reward signal for reinforcement learning. Without external supervision, SR-GRPO improves Qwen2.5-1.5B-Instruct by 10% on STEM and 19% on mathematical reasoning, outperforming both learned reward models and self-evaluation baselines. Our findings demonstrate that quality signals can be extracted from internal model geometry, offering a path toward scalable alignment without external supervision.
中文摘要 将大型语言模型（LLMs）与人类偏好对齐通常依赖外部监督，但这面临着关键局限：人类注释稀缺且主观，奖励模型易受奖励黑客攻击，自我评估方法存在提示敏感性和偏见。在本研究中，我们提出了稳定秩，这是一种内在且无注释的质量信号，源自模型表示。稳定秩通过计算总方差与主导方向方差的比值来衡量隐藏态的有效维度，通过信息在表示维度间的分布来捕捉质量。从经验来看，稳定排名在RewardBench上达到了84.04%的准确率，并且比通过N中取样的贪婪解码平均提高了11.3个百分点的任务准确率。基于这一见解，我们引入了稳定排名组相对策略优化（SR-GRPO），该方法将稳定排名作为强化学习的奖励信号。在无外部监督的情况下，SR-GRPO在STEM方面提升Qwen2.5-1.5B-Ininstruction 10%，数学推理提升19%，优于已学习的奖励模型和自我评估基线。我们的发现表明，高质量信号可以从内部模型几何中提取，为无需外部监督的可扩展对齐提供了路径。

Phase-Adaptive LLM Framework with Multi-Stage Validation for Construction Robot Task Allocation: A Systematic Benchmark Against Traditional Optimization Algorithms

阶段自适应大型语言模型框架，具多阶段验证用于建筑机器人任务分配：系统性地对抗传统优化算法

Authors: Shyam prasad reddy Kaitha, Hongrui Yu
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.02810
Pdf link: https://arxiv.org/pdf/2512.02810
Abstract Multi-robot task allocation in construction automation has traditionally relied on optimization methods such as Dynamic Programming and Reinforcement Learning. This research introduces the LangGraph-based Task Allocation Agent (LTAA), an LLM-driven framework that integrates phase-adaptive allocation strategies, multi-stage validation with hierarchical retries, and dynamic prompting for efficient robot coordination. Although recent LLM approaches show potential for construction robotics, they largely lack rigorous validation and benchmarking against established algorithms. This paper presents the first systematic comparison of LLM-based task allocation with traditional methods in construction this http URL study validates LLM feasibility through SMART-LLM replication and addresses implementation challenges using a Self-Corrective Agent Architecture. LTAA leverages natural-language reasoning combined with structured validation mechanisms, achieving major computational gains reducing token usage by 94.6% and allocation time by 86% through dynamic prompting. The framework adjusts its strategy across phases: emphasizing execution feasibility early and workload balance in later this http URL authors evaluate LTAA against Dynamic Programming, Q-learning, and Deep Q-Network (DQN) baselines using construction operations from the TEACh human-robot collaboration dataset. In the Heavy Excels setting, where robots have strong task specializations, LTAA achieves 77% task completion with superior workload balance, outperforming all traditional methods. These findings show that LLM-based reasoning with structured validation can match established optimization algorithms while offering additional advantages such as interpretability, adaptability, and the ability to update task logic without retraining.
中文摘要 建筑自动化中的多机器人任务分配传统上依赖于动态规划和强化学习等优化方法。本研究介绍了基于LangGraph的任务分配代理（LTAA），这是一个基于LLM的框架，集成了阶段自适应分配策略、多阶段验证与分层重试以及动态提示，实现机器人高效协调。尽管近期的大型语言模型方法显示出建筑机器人的潜力，但它们在很大程度上缺乏严格的验证和与既有算法的基准测试。本文首次系统地比较基于LLM的任务分配与传统构建方法，http URL研究通过SMART-LLM复制验证LLM的可行性，并利用自我纠正代理架构解决实现挑战。LTAA结合自然语言推理与结构化验证机制，通过动态提示实现了显著的计算收益，使代币使用率降低了94.6%，分配时间减少了86%。该框架在各个阶段调整策略：强调早期执行可行性，后期强调工作负载平衡。http URL 作者利用 TEACh 人机协作数据集中的构造作，将 LTAA 与动态规划、Q 学习和深度 Q-Network（DQN）基线进行评估。在Heavy Excels环境中，机器人具有强烈的任务专业化，LTAA实现了77%的任务完成率，且工作负荷平衡优越，超越所有传统方法。这些发现表明，基于LLM的推理结合结构化验证能够匹配既有优化算法，同时还具备可解释性、适应性以及无需重新训练即可更新任务逻辑等额外优势。

Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach

引导视觉-语言-行动模型作为反探索：一种测试时间缩放方法

Authors: Siyuan Yang, Yang Zhang, Haoran He, Ling Pan, Xiu Li, Chenjia Bai, Xuelong Li
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.02834
Pdf link: https://arxiv.org/pdf/2512.02834
Abstract Vision-Language-Action (VLA) models, trained via flow-matching or diffusion objectives, excel at learning complex behaviors from large-scale, multi-modal datasets (e.g., human teleoperation, scripted policies). However, since VLAs incorporate diverse data modes in the pre-training stage, and the finetuning dataset often contains demonstration data collected in a kinematically suboptimal or undesirable way, it exists redundant action modes that are irrelevant to the success action modes of the downstream task. Specifically, we observe a critical inference-time fragility among various sampled noises after supervised finetuning of pre-trained VLAs. In this paper, we attribute this instability to the distribution shift between the VLA policy and the policy induced by stable success modes of the downstream task dataset. Thus, we propose \textbf{TACO}, a test-time-scaling (TTS) framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks. The VLA models integrated with TACO can execute the actions with maximum pseudo-count from all sampled action chunks, thereby preventing distribution shifts while preserving the generalization ability of VLAs since the constraint is applied only during inference. Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits compared to RL update, especially for flow or diffusion-based VLAs which are difficult to perform RL update due to denoising process. Extensive experiments across four simulation benchmarks (RoboTwin2.0, Robotwin, LIBERO, SimplerEnv) and a dual-arm platform demonstrate that our method significantly improves the inference stability and success rates in downstream-task adaptations.
中文摘要 视觉-语言-行动（VLA）模型通过流量匹配或扩散目标训练，擅长从大规模多模态数据集中学习复杂行为（例如人类远程作、脚本化策略）。然而，由于VLA在预训练阶段包含多种数据模式，且微调数据集通常包含以运动学上次优或不理想方式收集的演示数据，因此存在与下游任务成功动作模式无关的冗余动作模式。具体来说，我们在对预训练VLA进行监督微调后，观察到各种采样噪声之间存在关键的推断时间脆弱性。本文将这种不稳定性归因于VLA策略与下游任务数据集稳定成功模式引发策略之间的分布转移。因此，我们提出了 \textbf{TACO}，一个测试时间缩放（TTS）框架，应用轻量级伪计数估计器作为动作块的高保真验证器。与TACO集成的VLA模型可以从所有采样动作块中以最大伪计数执行动作，从而防止分布偏移，同时保持VLA的泛化能力，因为该约束仅在推理过程中应用。我们的方法类似于离线强化学习（RL）中的经典反探索原理，且由于无梯度，相较于强化学习在计算上有显著优势，尤其是对于基于流或扩散的VLA，这些VLA由于去噪过程难以进行RL更新。通过涵盖四个模拟基准测试（RoboTwin2.0、Robotwin、LIBERO、SimplerEnv）和双臂平台的广泛实验，表明我们的方法显著提升了下游任务适应的推理稳定性和成功率。

ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning

ReVSeg：通过强化学习激励视频分割的推理链

Authors: Yifan Li, Yingda Yin, Lingting Zhu, Weikai Chen, Shengju Qian, Xin Wang, Yanwei Fu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.02835
Pdf link: https://arxiv.org/pdf/2512.02835
Abstract Reasoning-centric video object segmentation is an inherently complex task: the query often refers to dynamics, causality, and temporal interactions, rather than static appearances. Yet existing solutions generally collapse these factors into simplified reasoning with latent embeddings, rendering the reasoning chain opaque and essentially intractable. We therefore adopt an explicit decomposition perspective and introduce ReVSeg, which executes reasoning as sequential decisions in the native interface of pretrained vision language models (VLMs). Rather than folding all reasoning into a single-step prediction, ReVSeg executes three explicit operations -- semantics interpretation, temporal evidence selection, and spatial grounding -- aligning pretrained capabilities. We further employ reinforcement learning to optimize the multi-step reasoning chain, enabling the model to self-refine its decision quality from outcome-driven signals. Experimental results demonstrate that ReVSeg attains state-of-the-art performances on standard video object segmentation benchmarks and yields interpretable reasoning trajectories. Project page is available at this https URL .
中文摘要 以推理为中心的视频对象分割本质上是一项复杂的任务：查询通常涉及动态、因果关系和时间交互，而非静态的外观。然而，现有的解通常将这些因素简化为带有潜在嵌入的推理，使推理链变得晦涩且本质上难以解决。因此，我们采用显式分解视角，引入ReVSeg，在预训练视觉语言模型（VLMs）的原生接口中，以顺序决策方式执行推理。ReVSeg 没有将所有推理压缩到单一步预测中，而是执行三种显式作——语义解释、时间证据选择和空间基础——对应预训练能力。我们还进一步运用强化学习优化多步推理链，使模型能够从结果驱动的信号中自我优化决策质量。实验结果表明，ReVSeg在标准视频对象分割基准测试上达到了最先进的性能，并产生了可解释的推理轨迹。项目页面可在此 https 网址访问。

Taming Camera-Controlled Video Generation with Verifiable Geometry Reward

驯服可验证几何奖励的摄像机控制视频生成

Authors: Zhaoqing Wang, Xiaobo Xia, Zhuolin Bie, Jinlin Liu, Dongdong Yu, Jia-Wang Bian, Changhu Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.02870
Pdf link: https://arxiv.org/pdf/2512.02870
Abstract Recent advances in video diffusion models have remarkably improved camera-controlled video generation, but most methods rely solely on supervised fine-tuning (SFT), leaving online reinforcement learning (RL) post-training largely underexplored. In this work, we introduce an online RL post-training framework that optimizes a pretrained video generator for precise camera control. To make RL effective in this setting, we design a verifiable geometry reward that delivers dense segment-level feedback to guide model optimization. Specifically, we estimate the 3D camera trajectories for both generated and reference videos, divide each trajectory into short segments, and compute segment-wise relative poses. The reward function then compares each generated-reference segment pair and assigns an alignment score as the reward signal, which helps alleviate reward sparsity and improve optimization efficiency. Moreover, we construct a comprehensive dataset featuring diverse large-amplitude camera motions and scenes with varied subject dynamics. Extensive experiments show that our online RL post-training clearly outperforms SFT baselines across multiple aspects, including camera-control accuracy, geometric consistency, and visual quality, demonstrating its superiority in advancing camera-controlled video generation.
中文摘要 视频扩散模型的最新进展显著提升了摄像头控制的视频生成，但大多数方法仅依赖监督微调（SFT），在线强化学习（RL）在训练后仍然鲜为人知。本研究引入了一个在线强化学习后训练框架，优化预训练视频生成器以实现精确的摄像机控制。为了使强化学习在这种环境下有效，我们设计了一种可验证的几何奖励，提供密集的段级反馈以指导模型优化。具体来说，我们估算生成视频和参考视频的3D摄像机轨迹，将每个轨迹划分为短段，并计算段间相对姿态。奖励函数随后比较每个生成的参考片段对，并赋予一个比对分数作为奖励信号，有助于缓解奖励稀疏性并提高优化效率。此外，我们构建了一个涵盖多样大幅相机运动和多样主体动态场景的综合数据集。大量实验表明，我们的在线强化学习后训练在多个方面明显优于SFT基线，包括摄像机控制的准确性、几何一致性和视觉质量，证明其在推动摄像机控制视频生成方面的优势。

MindGPT-4ov: An Enhanced MLLM via a Multi-Stage Post-Training Paradigm

MindGPT-4ov：通过多阶段后训练范式的增强MLLM

Authors: Wei Chen, Chaoqun Du, Feng Gu, Wei He, Qizhen Li, Zide Liu, Xuhao Pan, Chang Ren, Xudong Rao, Chenfeng Wang, Tao Wei, Chengjun Yu, Pengfei Yu, Yufei Zheng, Chunpeng Zhou, Pan Zhou, Xuhan Zhu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.02895
Pdf link: https://arxiv.org/pdf/2512.02895
Abstract We present MindGPT-4ov, a multimodal large language model (MLLM) that introduces a general post-training paradigm spanning data production, model training, and efficient deployment. It achieves state-of-the-art performance across multiple benchmarks at low cost, effectively enhancing the foundational capabilities of MLLMs and the generalization ability. Focusing on data construction, supervised fine-tuning strategies, and multimodal reinforcement learning methods, this work proposes three key innovations: (1) An information density-based data generation scheme, integrated with a dual-dimensional tree-structured label system, enabling automated generation of high-quality cross-domain data. (2) A collaborative curriculum supervised fine-tuning approach that balances the injection of domain-specific knowledge with the preservation of general capabilities. (3) A hybrid reinforcement learning paradigm that enhances reasoning ability while simultaneously addressing multi-objective optimization such as diversity exploration, maintenance of multimodal perception, and response conciseness. Moreover, we implement a series of infrastructure optimizations, such as 5D parallel training, operator optimization, and inference quantization to enhance training and inference efficiency while reducing the cost of domain adaptation. Experimental results demonstrate that the MindGPT-4ov model outperforms state-of-the-art models on benchmarks such as MMBench, MMStar, MathVision, and MathVista. In addition, MindGPT-4ov also demonstrates superior user experience in vertical domain tasks, enabling a seamless transition from academic research to industrial deployment. MindGPT-4ov provides a general post-training paradigm applicable to a wide range of MLLMs. The model weights, datasets, and code for the Qwen3-VL-based variants will be recently open-sourced to support the community's development of MLLMs.
中文摘要 我们介绍MindGPT-4ov，一个多模态大型语言模型（MLLM），引入了涵盖数据生成、模型训练和高效部署的通用后训练范式。它以低成本实现了多基准测试的尖端性能，有效提升了MLLM的基础能力和泛化能力。本研究聚焦于数据构建、监督式微调策略和多模态强化学习方法，提出了三项关键创新：（1）基于信息密度的数据生成方案，集成了双维树状标签系统，实现高质量跨域数据的自动生成。（2）一种协作式课程监督的微调方法，平衡领域特定知识的注入与通用能力的保留。（3）一种混合强化学习范式，在提升推理能力的同时，解决多目标优化，如多样性探索、多模态感知维护和反应简洁性。此外，我们实施了一系列基础设施优化，如五维并行训练、算符优化和推断量化，以提升训练和推理效率，同时降低域适配成本。实验结果表明，MindGPT-4ov模型在MMBench、MMStar、MathVision和MathVista等基准测试上表现优于最先进模型。此外，MindGPT-4ov还在垂直领域任务中展现出卓越的用户体验，实现了从学术研究到工业部署的无缝衔接。MindGPT-4ov 提供了一个适用于多种多层次学习模型的通用后训练范式。基于Qwen3-VL的变体模型权重、数据集和代码将近期开源，以支持社区开发MLLMs。

OneThinker: All-in-one Reasoning Model for Image and Video

OneThinker：图像与视频的一体化推理模型

Authors: Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, Yan Feng, Peng Pei, Xunliang Cai, Xiangyu Yue
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.03043
Pdf link: https://arxiv.org/pdf/2512.03043
Abstract Reinforcement learning (RL) has recently achieved remarkable success in eliciting visual reasoning within Multimodal Large Language Models (MLLMs). However, existing approaches typically train separate models for different tasks and treat image and video reasoning as disjoint domains. This results in limited scalability toward a multimodal reasoning generalist, which restricts practical versatility and hinders potential knowledge sharing across tasks and modalities. To this end, we propose OneThinker, an all-in-one reasoning model that unifies image and video understanding across diverse fundamental visual tasks, including question answering, captioning, spatial and temporal grounding, tracking, and segmentation. To achieve this, we construct the OneThinker-600k training corpus covering all these tasks and employ commercial models for CoT annotation, resulting in OneThinker-SFT-340k for SFT cold start. Furthermore, we propose EMA-GRPO to handle reward heterogeneity in multi-task RL by tracking task-wise moving averages of reward standard deviations for balanced optimization. Extensive experiments on diverse visual benchmarks show that OneThinker delivers strong performance on 31 benchmarks, across 10 fundamental visual understanding tasks. Moreover, it exhibits effective knowledge transfer between certain tasks and preliminary zero-shot generalization ability, marking a step toward a unified multimodal reasoning generalist. All code, model, and data are released.
中文摘要 强化学习（RL）最近在多模态大型语言模型（MLLM）中引发视觉推理方面取得了显著成功。然而，现有方法通常为不同任务训练独立模型，并将图像和视频推理视为不相交的领域。这导致多模态推理通才的扩展性有限，限制了实际作的灵活性，并阻碍了任务和模态间的知识共享。为此，我们提出了OneThinker，一种一体化推理模型，统一了图像和视频在多种基本视觉任务中的理解，包括问答、字幕、空间和时间基础、追踪和分割。为此，我们构建了涵盖所有这些任务的 OneThinker-600k 训练语料库，并采用商业模型进行 CoT 注释，最终形成了 OneThinker-SFT-340k 用于 SFT 冷启动。此外，我们提出EMA-GRPO技术，通过跟踪任务间奖励标准差的移动平均，以实现多任务强化学习中的奖励异质性，实现平衡优化。对多种视觉基准测试的广泛实验表明，OneThinker 在 10 项基础视觉理解任务中，31 个基准测试中表现出色。此外，它在特定任务间展现出有效的知识传递能力和初步零样本泛化能力，标志着迈向统一多模态推理通才的一步。所有代码、模型和数据均已发布。

Keyword: diffusion policy

AID: Agent Intent from Diffusion for Multi-Agent Informative Path Planning

辅助：来自扩散的智能体意图，用于多智能体信息路径规划

Authors: Jeric Lew, Yuhong Cao, Derek Ming Siang Tan, Guillaume Sartoretti
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.02535
Pdf link: https://arxiv.org/pdf/2512.02535
Abstract Information gathering in large-scale or time-critical scenarios (e.g., environmental monitoring, search and rescue) requires broad coverage within limited time budgets, motivating the use of multi-agent systems. These scenarios are commonly formulated as multi-agent informative path planning (MAIPP), where multiple agents must coordinate to maximize information gain while operating under budget constraints. A central challenge in MAIPP is ensuring effective coordination while the belief over the environment evolves with incoming measurements. Recent learning-based approaches address this by using distributions over future positions as "intent" to support coordination. However, these autoregressive intent predictors are computationally expensive and prone to compounding errors. Inspired by the effectiveness of diffusion models as expressive, long-horizon policies, we propose AID, a fully decentralized MAIPP framework that leverages diffusion models to generate long-term trajectories in a non-autoregressive manner. AID first performs behavior cloning on trajectories produced by existing MAIPP planners and then fine-tunes the policy using reinforcement learning via Diffusion Policy Policy Optimization (DPPO). This two-stage pipeline enables the policy to inherit expert behavior while learning improved coordination through online reward feedback. Experiments demonstrate that AID consistently improves upon the MAIPP planners it is trained from, achieving up to 4x faster execution and 17% increased information gain, while scaling effectively to larger numbers of agents. Our implementation is publicly available at this https URL.
中文摘要 在大规模或时间关键场景（如环境监测、搜救）中的信息收集需要在有限的时间预算内实现广泛的覆盖，这促使多智能体系统的使用。这些情景通常被表述为多智能体信息路径规划（MAIPP），即多个智能体必须协调，以在预算限制下最大化信息收益。MAIPP的核心挑战之一是确保在环境认知随着测量数据不断变化的同时，实现有效的协调。近期基于学习的方法通过将未来职位分布作为“意图”来支持协调。然而，这些自回归意图预测器计算成本高且容易出现复合误差。受扩散模型作为表达性、长远视野政策的有效性启发，我们提出了AID框架，这是一个完全去中心化的MAIPP框架，利用扩散模型以非自回归的方式生成长期轨迹。AID首先对现有MAIPP规划器生成的轨迹进行行为克隆，然后通过扩散策略优化（DPPO）通过强化学习对策略进行微调。这一两阶段培养流程使政策能够继承专家行为，同时通过在线奖励反馈学习改进协调。实验表明，AID持续优于其训练的MAIPP规划器，实现执行速度高达4倍，信息增益提升17%，同时有效扩展至更多代理。我们的实现可在此 https URL 公开获取。