Arxiv Papers of Today

生成时间: 2025-12-19 16:33:00 (UTC+8); Arxiv 发布时间: 2025-12-19 20:00 EST (2025-12-20 09:00 UTC+8)

今天共有 29 篇相关文章

Keyword: reinforcement learning

Bilevel Optimization for Covert Memory Tampering in Heterogeneous Multi-Agent Architectures (XAMT)

异构多智能体架构（XAMT）中隐蔽内存篡改的双级优化

Authors: Akhil Sharma, Shaikh Yaser Arafat, Jai Kumar Sharma, Ken Huang
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2512.15790
Pdf link: https://arxiv.org/pdf/2512.15790
Abstract The increasing operational reliance on complex Multi-Agent Systems (MAS) across safety-critical domains necessitates rigorous adversarial robustness assessment. Modern MAS are inherently heterogeneous, integrating conventional Multi-Agent Reinforcement Learning (MARL) with emerging Large Language Model (LLM) agent architectures utilizing Retrieval-Augmented Generation (RAG). A critical shared vulnerability is reliance on centralized memory components: the shared Experience Replay (ER) buffer in MARL and the external Knowledge Base (K) in RAG agents. This paper proposes XAMT (Bilevel Optimization for Covert Memory Tampering in Heterogeneous Multi-Agent Architectures), a novel framework that formalizes attack generation as a bilevel optimization problem. The Upper Level minimizes perturbation magnitude (delta) to enforce covertness while maximizing system behavior divergence toward an adversary-defined target (Lower Level). We provide rigorous mathematical instantiations for CTDE MARL algorithms and RAG-based LLM agents, demonstrating that bilevel optimization uniquely crafts stealthy, minimal-perturbation poisons evading detection heuristics. Comprehensive experimental protocols utilize SMAC and SafeRAG benchmarks to quantify effectiveness at sub-percent poison rates (less than or equal to 1 percent in MARL, less than or equal to 0.1 percent in RAG). XAMT defines a new unified class of training-time threats essential for developing intrinsically secure MAS, with implications for trust, formal verification, and defensive strategies prioritizing intrinsic safety over perimeter-based detection.
中文摘要 在安全关键领域，对复杂多代理系统（MAS）的日益依赖，要求严格的对抗性鲁棒性评估。现代MAS本质上具有异构性，集成了传统的多智能体强化学习（MARL）与采用检索增强生成（RAG）的新兴大型语言模型（LLM）智能体架构。一个关键的共享漏洞是对集中式内存组件的依赖：MARL中的共享体验重放（ER）缓冲区和RAG代理中的外部知识库（K）。本文提出了XAMT（异构多智能体架构中隐蔽内存篡改的双级优化），这是一个新颖框架，将攻击生成形式化为双层优化问题。上层最小化微扰幅度（delta），以强制隐蔽性，同时最大化系统行为向对手定义目标（下层）的偏差。我们为CTDE MARL算法和基于RAG的大型语言模型代理提供了严谨的数学实例化，证明双层优化独特地构建了隐蔽、最小扰动的毒药，从而规避检测启发式。全面的实验方案利用SMAC和SafeRAG基准，量化亚百分度毒性率的有效性（MARL中小于或等于1%，RAG中低于或等于0.1%）。XAMT定义了一类新的统一训练时间威胁，对于开发本质安全的MAS至关重要，其影响包括信任、正式验证以及优先考虑内在安全而非基于边界的检测的防御策略。

DSO: Direct Steering Optimization for Bias Mitigation

DSO：直接转向优化以缓解偏置

Authors: Lucas Monteiro Paes, Nivedha Sivakumar, Yinong Oliver Wang, Masha Fedzechkina Donaldson, Luca Zappella, Nicholas Apostoloff
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2512.15926
Pdf link: https://arxiv.org/pdf/2512.15926
Abstract Generative models are often deployed to make decisions on behalf of users, such as vision-language models (VLMs) identifying which person in a room is a doctor to help visually impaired individuals. Yet, VLM decisions are influenced by the perceived demographic attributes of people in the input, which can lead to biased outcomes like failing to identify women as doctors. Moreover, when reducing bias leads to performance loss, users may have varying needs for balancing bias mitigation with overall model capabilities, highlighting the demand for methods that enable controllable bias reduction during inference. Activation steering is a popular approach for inference-time controllability that has shown potential in inducing safer behavior in large language models (LLMs). However, we observe that current steering methods struggle to correct biases, where equiprobable outcomes across demographic groups are required. To address this, we propose Direct Steering Optimization (DSO) which uses reinforcement learning to find linear transformations for steering activations, tailored to mitigate bias while maintaining control over model performance. We demonstrate that DSO achieves state-of-the-art trade-off between fairness and capabilities on both VLMs and LLMs, while offering practitioners inference-time control over the trade-off. Overall, our work highlights the benefit of designing steering strategies that are directly optimized to control model behavior, providing more effective bias intervention than methods that rely on pre-defined heuristics for controllability.
中文摘要 生成模型常被用来代表用户做决策，例如视觉语言模型（VLMs），用以识别房间内谁是医生，以帮助视障人士。然而，VLM的决策仍受输入中人员的人口统计属性影响，这可能导致偏见结果，比如未能将女性认定为医生。此外，在减少偏见导致性能损失时，用户可能对平衡偏差缓解与整体模型能力的需求不同，凸显了对推理过程中可控偏差减少方法的需求。激活引导是一种流行的推理时间可控方法，已被证明有潜力在大型语言模型（LLM）中诱导更安全的行为。然而，我们观察到现有的指导方法难以纠正偏差，而在这些偏差中，需要在不同人口统计群体之间取得等概率的结果。为此，我们提出了直接引导优化（DSO），利用强化学习寻找引导激活的线性变换，旨在减轻偏差并保持对模型性能的控制。我们证明了DSO在VLM和LLM上实现了公平性与能力的先进权衡，同时为从业者提供了对权衡的推理时间控制权。总体而言，我们的研究强调了设计直接优化以控制模型行为的引导策略的好处，这比依赖预设启发式方法提供更有效的偏差干预。

Small Language Models for Efficient Agentic Tool Calling: Outperforming Large Models with Targeted Fine-tuning

高效智能体工具调用的小型语言模型：通过精准微调优胜于大型模型

Authors: Polaris Jhandi, Owais Kazi, Shreyas Subramanian, Neel Sendas
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.15943
Pdf link: https://arxiv.org/pdf/2512.15943
Abstract As organizations scale adoption of generative AI, model cost optimization and operational efficiency have emerged as critical factors determining sustainability and accessibility. While Large Language Models (LLMs) demonstrate impressive capabilities across diverse tasks, their extensive computational requirements make them cost-prohibitive for routine enterprise use. This limitation motivates the exploration of Small Language Models (SLMs), which can deliver comparable performance in targeted applications while drastically reducing infrastructure overhead (Irugalbandara et al., 2023). In this work, we investigate the feasibility of replacing LLM-driven workflows with optimized SLMs. We trained a domain-adapted SLM to execute representative tasks traditionally handled by LLMs, such as document summarization, query answering, and structured data interpretation. As part of the experiment, we investigated the fine-tuning of facebook/opt-350m model (single epoch only) using the Hugging Face TRL (Transformer Reinforcement Learning), specifically the Supervised Fine-Tuning (SFT) trainer. The OPT-350M model was released by Meta AI in 2022 as part of the OPT (Open Pretrained Transformer) family of models. Similar studies demonstrate that even models at the 350M parameter scale can meaningfully contribute to instruction-tuning pipelines (Mekala et al., 2024). Experimental results demonstrated that our fine-tuned SLM achieves exceptional performance with a 77.55\% pass rate on ToolBench evaluation, significantly outperforming all baseline models including ChatGPT-CoT (26.00\%), ToolLLaMA-DFS (30.18\%), and ToolLLaMA-CoT (16.27\%). These findings emphasize that thoughtful design and targeted training of SLMs can significantly lower barriers to adoption, enabling cost-effective, large-scale integration of generative AI into production systems.
中文摘要 随着组织扩大生成式人工智能的采用，模型成本优化和运营效率成为决定可持续性和可及性的关键因素。虽然大型语言模型（LLM）在各种任务中展现出令人印象深刻的能力，但其庞大的计算需求使其在日常企业使用中成本过高。这一限制促使人们探索小型语言模型（SLM），它们可以在目标应用中提供相当的性能，同时大幅降低基础设施开销（Irugalbandara 等，2023）。在本研究中，我们探讨用优化的SLM取代LLM驱动工作流程的可行性。我们训练了一个领域适配的SLM，执行传统上由LLM处理的代表性任务，如文档摘要、查询回复和结构化数据解释。作为实验的一部分，我们研究了使用Hugging Face TRL（变压器强化学习），特别是监督微调（SFT）训练器，对facebook/opt-350m模型（仅单一历元）进行微调。OPT-350M模型由Meta AI于2022年发布，属于OPT（开放预训练变换器）系列模型。类似研究表明，即使是3.5亿参数尺度的模型，也能对指令调优流水线有意义贡献（Mekala等，2024）。实验结果表明，我们经过精细调优的SLM在ToolBench评估中以77.55%的通过率表现出色，显著优于包括ChatGPT-CoT（26.00%）、ToolLLaMA-DFS（30.18%）和ToolLLaMA-CoT（16.27%）在内的所有基线模型。这些发现强调，深思熟虑的设计和针对性SLM的培训可以显著降低采用门槛，从而实现生成式人工智能在生产系统中的成本效益高、大规模集成。

Dynamic Rank Reinforcement Learning for Adaptive Low-Rank Multi-Head Self Attention in Large Language Models

大型语言模型中自适应低秩多头自我注意力的动态秩强化学习

Authors: Caner Erden
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.15973
Pdf link: https://arxiv.org/pdf/2512.15973
Abstract We propose Dynamic Rank Reinforcement Learning (DR-RL), a novel framework that adaptively optimizes the low-rank factorization of Multi-Head Self-Attention (MHSA) in Large Language Models (LLMs) through the integration of reinforcement learning and online matrix perturbation theory. While traditional low-rank approximations often rely on static rank assumptions--limiting their flexibility across diverse input contexts--our method dynamically selects ranks based on real-time sequence dynamics, layer-specific sensitivities, and hardware constraints. The core innovation lies in an RL agent that formulates rank selection as a sequential policy optimization problem, where the reward function strictly balances attention fidelity against computational latency. Crucially, we employ online matrix perturbation bounds to enable incremental rank updates, thereby avoiding the prohibitive cost of full decomposition during inference. Furthermore, the integration of a lightweight Transformer-based policy network and batched Singular Value Decomposition (SVD) operations ensures scalable deployment on modern GPU architectures. Experiments demonstrate that DR-RL maintains downstream accuracy statistically equivalent to full-rank attention while significantly reducing Floating Point Operations (FLOPs), particularly in long-sequence regimes (L > 4096). This work bridges the gap between adaptive efficiency and theoretical rigor in MHSA, offering a principled, mathematically grounded alternative to heuristic rank reduction techniques in resource-constrained deep learning. Source code and experiment logs are available at: this https URL
中文摘要 我们提出了动态秩强化学习（DR-RL），这是一种新颖框架，通过强化学习与在线矩阵扰动理论的集成，自适应优化大型语言模型（LLMs）中多头自注意（MHSA）的低秩分解。传统的低秩近似通常依赖静态秩假设——限制了其在不同输入环境下的灵活性——而我们的方法则基于实时序列动态、层级敏感度和硬件约束动态选择秩。核心创新在于一个强化学习代理，将排名选择表述为顺序策略优化问题，其中奖励函数严格平衡注意力忠实度与计算延迟。关键是，我们采用在线矩阵扰动界限，实现增量秩更新，从而避免了推断过程中完全分解的高昂成本。此外，基于Transformer的轻量级策略网络和批量奇异值分解（SVD）作的集成确保了在现代GPU架构上的可扩展部署。实验表明，DR-RL在统计上保持下游准确性，与全秩注意力相当，同时显著减少浮点运算（FLOP），尤其是在长序列区间（L > 4096）。这项工作弥合了MHSA中自适应效率与理论严谨性的差距，提供了一种有原则、数学基础的替代资源受限深度学习中启发式秩次降低技术的替代方案。源代码和实验日志可在以下网站获取：此 https URL

Techno-economic optimization of a heat-pipe microreactor, part I: theory and cost optimization

热管微型反应堆的技术经济优化，第一部分：理论与成本优化

Authors: Paul Seurin, Dean Price, Luis Nunez
Subjects: Subjects: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
Arxiv link: https://arxiv.org/abs/2512.16032
Pdf link: https://arxiv.org/pdf/2512.16032
Abstract Microreactors, particularly heat-pipe microreactors (HPMRs), are compact, transportable, self-regulated power systems well-suited for access-challenged remote areas where costly fossil fuels dominate. However, they suffer from diseconomies of scale, and their financial viability remains unconvincing. One step in addressing this shortcoming is to design these reactors with comprehensive economic and physics analyses informing early-stage design iteration. In this work, we present a novel unifying geometric design optimization approach that accounts for techno-economic considerations. We start by generating random samples to train surrogate models, including Gaussian processes (GPs) and multi-layer perceptrons (MLPs). We then deploy these surrogates within a reinforcement learning (RL)-based optimization framework to optimize the levelized cost of electricity (LCOE), all the while imposing constraints on the fuel lifetime, shutdown margin (SDM), peak heat flux, and rod-integrated peaking factor. We study two cases: one in which the axial reflector cost is very high, and one in which it is inexpensive. We found that the operation and maintenance and capital costs are the primary contributors to the overall LCOE particularly the cost of the axial reflectors (for the first case) and the control drum materials. The optimizer cleverly changes the design parameters so as to minimize one of them while still satisfying the constraints, ultimately reducing the LCOE by more than 57% in both instances. A comprehensive integration of fuel and HP performance with multi-objective optimization is currently being pursued to fully understand the interaction between constraints and cost performance.
中文摘要 微型反应器，特别是热管微型反应器（HPMR），是一种紧凑、可运输、自我调节的电力系统，非常适合以昂贵化石燃料为主导的偏远地区，进入困难。然而，它们存在规模不经济的问题，其财务可行性仍然令人信服。解决这一不足的一步是设计这些反应堆，采用全面的经济和物理分析，以指导早期设计迭代。在本研究中，我们提出了一种新的统一几何设计优化方法，能够考虑技术经济因素。我们首先生成随机样本以训练代理模型，包括高斯过程（GP）和多层感知器（MLP）。然后，我们将这些替代工具部署在基于强化学习（RL）的优化框架中，以优化电力平稳成本（LCOE），同时对燃料寿命、停机余裕（SDM）、峰值热通量和棒积分峰值因子施加约束。我们研究两种情况：一种轴向反射器成本非常高，另一种是成本较低的情况。我们发现，运营维护和资本成本是整体LCOE的主要贡献因素，尤其是轴向反射器（针对第一种情况）和控制鼓材料的成本。优化器巧妙地调整设计参数，使得最小化其中一个参数，同时仍满足约束，最终在两种情况下都将LCOE降低了57%以上。目前正在全面整合燃油和马力性能与多目标优化，以全面理解约束与成本性能之间的相互作用。

INTELLECT-3: Technical Report

INTELLECT-3：技术报告

Authors: Prime Intellect Team, Mika Senghaas, Fares Obeid, Sami Jaghouar, William Brown, Jack Min Ong, Daniel Auras, Matej Sirovatka, Jannik Straube, Andrew Baker, Sebastian Müller, Justus Mattern, Manveer Basra, Aiman Ismail, Dominik Scherm, Cooper Miller, Ameen Patel, Simon Kirsten, Mario Sieg, Christian Reetz, Kemal Erdem, Vincent Weisser, Johannes Hagemann
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.16144
Pdf link: https://arxiv.org/pdf/2512.16144
Abstract We present INTELLECT-3, a 106B-parameter Mixture-of-Experts model (12B active) trained with large-scale reinforcement learning on our end-to-end RL infrastructure stack. INTELLECT-3 achieves state of the art performance for its size across math, code, science and reasoning benchmarks, outperforming many larger frontier models. We open-source the model together with the full infrastructure stack used to create it, including RL frameworks, complete recipe, and a wide collection of environments, built with the verifiers library, for training and evaluation from our Environments Hub community platform. Built for this effort, we introduce prime-rl, an open framework for large-scale asynchronous reinforcement learning, which scales seamlessly from a single node to thousands of GPUs, and is tailored for agentic RL with first-class support for multi-turn interactions and tool use. Using this stack, we run both SFT and RL training on top of the GLM-4.5-Air-Base model, scaling RL training up to 512 H200s with high training efficiency.
中文摘要 我们介绍INTELLECT-3，一个106B参数的专家混合模型（12B活跃），通过大规模强化学习在端到端强化学习基础设施上训练。INTELLECT-3在数学、代码、科学和推理基准测试中实现了其规模中的最先进性能，优于许多大型前沿模型。我们将模型及用于创建的完整基础设施栈开源，包括强化学习框架、完整配方以及使用验证器库构建的大量环境，用于我们的环境中心社区平台的培训和评估。为此，我们推出了prime-rl，一个面向大规模异步强化学习的开放框架，能够无缝从单节点扩展到数千个GPU，专为代理强化学习量身定制，并拥有一流的多回合交互和工具使用支持。利用该协议栈，我们在GLM-4.5-空军基地模型基础上运行SFT和RL训练，将RL训练扩展至512架H200，且训练效率极高。

MRG-R1: Reinforcement Learning for Clinically Aligned Medical Report Generation

MRG-R1：临床对齐医学报告生成的强化学习

Authors: Pengyu Wang, Shuchang Ye, Usman Naseem, Jinman Kim
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.16145
Pdf link: https://arxiv.org/pdf/2512.16145
Abstract Medical report generation (MRG) aims to automatically derive radiology-style reports from medical images to aid in clinical decision-making. However, existing methods often generate text that mimics the linguistic style of radiologists but fails to guarantee clinical correctness, because they are trained on token-level objectives which focus on word-choice and sentence structure rather than actual medical accuracy. We propose a semantic-driven reinforcement learning (SRL) method for medical report generation, adopted on a large vision-language model (LVLM). SRL adopts Group Relative Policy Optimization (GRPO) to encourage clinical-correctness-guided learning beyond imitation of language style. Specifically, we optimise a report-level reward: a margin-based cosine similarity (MCCS) computed between key radiological findings extracted from generated and reference reports, thereby directly aligning clinical-label agreement and improving semantic correctness. A lightweight reasoning format constraint further guides the model to generate structured "thinking report" outputs. We evaluate Medical Report Generation with Sematic-driven Reinforment Learning (MRG-R1), on two datasets: IU X-Ray and MIMIC-CXR using clinical efficacy (CE) metrics. MRG-R1 achieves state-of-the-art performance with CE-F1 51.88 on IU X-Ray and 40.39 on MIMIC-CXR. We found that the label-semantic reinforcement is better than conventional token-level supervision. These results indicate that optimizing a clinically grounded, report-level reward rather than token overlap,meaningfully improves clinical correctness. This work is a prior to explore semantic-reinforcement in supervising medical correctness in medical Large vision-language model(Med-LVLM) training.
中文摘要 医学报告生成（MRG）旨在自动从医学图像中提取放射学风格的报告，以辅助临床决策。然而，现有方法常常生成模仿放射科医生语言风格的文本，但无法保证临床正确性，因为它们训练基于代币级目标，重点关注词汇选择和句子结构，而非实际医学准确性。我们提出了一种语义驱动强化学习（SRL）方法用于医疗报告生成，采用在大型视觉语言模型（LVLM）上。SRL采用群体相对政策优化（GRPO），鼓励临床正确性指导的学习，而不仅仅是模仿语言风格。具体来说，我们优化了报告层级奖励：基于边际的余弦相似度（MCCS），计算出从生成报告和参考报告中提取的关键放射学发现，从而直接对齐临床标签一致性并提升语义正确性。一个轻量级推理格式约束进一步引导模型生成结构化的“思维报告”输出。我们利用两个数据集——IU X光和MIMIC-CXR，利用临床疗效（CE）指标评估了使用语气驱动信息学习（MRG-R1）的医疗报告生成。MRG-R1在IU X光上达到CE-F1 51.88和MIMIC-CXR上40.39的先进性能。我们发现标签语义强化优于传统的令牌级监督。这些结果表明，优化临床基础的报告级奖励而非代币重叠，能显著提升临床正确性。本研究是探讨语义强化在医疗大视觉语言模型（Med-LVLM）培训中指导医疗正确性的先行。

Visual Alignment of Medical Vision-Language Models for Grounded Radiology Report Generation

医学视觉语言模型的视觉对齐，用于基础放射科报告生成

Authors: Sarosij Bose, Ravi K. Rajendran, Biplob Debnath, Konstantinos Karydis, Amit K. Roy-Chowdhury, Srimat Chakradhar
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.16201
Pdf link: https://arxiv.org/pdf/2512.16201
Abstract Radiology Report Generation (RRG) is a critical step toward automating healthcare workflows, facilitating accurate patient assessments, and reducing the workload of medical professionals. Despite recent progress in Large Medical Vision-Language Models (Med-VLMs), generating radiology reports that are both visually grounded and clinically accurate remains a significant challenge. Existing approaches often rely on large labeled corpora for pre-training, costly task-specific preference data, or retrieval-based methods. However, these strategies do not adequately mitigate hallucinations arising from poor cross-modal alignment between visual and linguistic representations. To address these limitations, we propose VALOR:Visual Alignment of Medical Vision-Language Models for GrOunded Radiology Report Generation. Our method introduces a reinforcement learning-based post-alignment framework utilizing Group-Relative Proximal Optimization (GRPO). The training proceeds in two stages: (1) improving the Med-VLM with textual rewards to encourage clinically precise terminology, and (2) aligning the vision projection module of the textually grounded model with disease findings, thereby guiding attention toward image re gions most relevant to the diagnostic task. Extensive experiments on multiple benchmarks demonstrate that VALOR substantially improves factual accuracy and visual grounding, achieving significant performance gains over state-of-the-art report generation methods.
中文摘要 放射报告生成（RRG）是自动化医疗工作流程、促进准确患者评估和减轻医疗专业人员工作负担的关键一步。尽管大型医学视觉语言模型（Med-VLM）近年来取得了进展，但生成既具视觉基础又临床准确的放射报告仍是一大挑战。现有方法通常依赖大型标记语料库来预训练、昂贵的任务特定偏好数据或基于检索的方法。然而，这些策略并不能充分缓解因视觉与语言表征之间跨模态对齐不良而产生的幻觉。为解决这些局限性，我们提出了VALOR：用于GrOunded放射报告生成的医学视觉语言模型视觉对齐。我们的方法引入了基于强化学习的后比对框架，利用群体-相对近端优化（GRPO）。培训分为两个阶段：（1）通过文本奖励改进Med-VLM，鼓励临床精确术语使用;（2）将基于文本模型的视觉投影模块与疾病发现对齐，从而引导注意力集中在与诊断任务最相关的图像符号上。多项基准测试的广泛实验表明，VALOR显著提升了事实准确性和视觉基础，性能优于最先进的报告生成方法。

Hypernetworks That Evolve Themselves

自我进化的超网络

Authors: Joachim Winther Pedersen, Erwan Plantec, Eleni Nisioti, Marcello Barylli, Milton Montero, Kathrin Korte, Sebastian Risi
Subjects: Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.16406
Pdf link: https://arxiv.org/pdf/2512.16406
Abstract How can neural networks evolve themselves without relying on external optimizers? We propose Self-Referential Graph HyperNetworks, systems where the very machinery of variation and inheritance is embedded within the network. By uniting hypernetworks, stochastic parameter generation, and graph-based representations, Self-Referential GHNs mutate and evaluate themselves while adapting mutation rates as selectable traits. Through new reinforcement learning benchmarks with environmental shifts (CartPoleSwitch, LunarLander-Switch), Self-Referential GHNs show swift, reliable adaptation and emergent population dynamics. In the locomotion benchmark Ant-v5, they evolve coherent gaits, showing promising fine-tuning capabilities by autonomously decreasing variation in the population to concentrate around promising solutions. Our findings support the idea that evolvability itself can emerge from neural self-reference. Self-Referential GHNs reflect a step toward synthetic systems that more closely mirror biological evolution, offering tools for autonomous, open-ended learning agents.
中文摘要 神经网络如何在不依赖外部优化器的情况下自我进化？我们提出了自指图超网络，即将变异和继承机制嵌入网络中的系统。通过结合超网络、随机参数生成和基于图的表示，自指GHN在变异和自我评估的同时，将突变率作为可选择性状进行调整。通过新的强化学习基准测试（如CartPoleSwitch、LunarLander-Switch），自指GHNs展现了快速、可靠的适应能力和涌现的种群动态。在运动基准测试Ant-v5中，它们进化出连贯步态，通过自主减少群体变异以集中注意力围绕有前景的解决方案，展现出有前景的微调能力。我们的发现支持进化本身可以从神经自我指涉中产生的想法。自指GHNs反映了向更贴近生物进化的合成系统迈进的一步，为自主、开放式学习代理提供工具。

NDRL: Cotton Irrigation and Nitrogen Application with Nested Dual-Agent Reinforcement Learning

NDRL：含嵌套双代理强化学习的棉花灌溉与氮应用

Authors: Ruifeng Xu, Liang He
Subjects: Subjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2512.16408
Pdf link: https://arxiv.org/pdf/2512.16408
Abstract Effective irrigation and nitrogen fertilization have a significant impact on crop yield. However, existing research faces two limitations: (1) the high complexity of optimizing water-nitrogen combinations during crop growth and poor yield optimization results; and (2) the difficulty in quantifying mild stress signals and the delayed feedback, which results in less precise dynamic regulation of water and nitrogen and lower resource utilization efficiency. To address these issues, we propose a Nested Dual-Agent Reinforcement Learning (NDRL) method. The parent agent in NDRL identifies promising macroscopic irrigation and fertilization actions based on projected cumulative yield benefits, reducing ineffective explorationwhile maintaining alignment between objectives and yield. The child agent's reward function incorporates quantified Water Stress Factor (WSF) and Nitrogen Stress Factor (NSF), and uses a mixed probability distribution to dynamically optimize daily strategies, thereby enhancing both yield and resource efficiency. We used field experiment data from 2023 and 2024 to calibrate and validate the Decision Support System for Agrotechnology Transfer (DSSAT) to simulate real-world conditions and interact with NDRL. Experimental results demonstrate that, compared to the best baseline, the simulated yield increased by 4.7% in both 2023 and 2024, the irrigation water productivity increased by 5.6% and 5.1% respectively, and the nitrogen partial factor productivity increased by 6.3% and 1.0% respectively. Our method advances the development of cotton irrigation and nitrogen fertilization, providing new ideas for addressing the complexity and precision issues in agricultural resource management and for sustainable agricultural development.
中文摘要 有效的灌溉和氮肥对作物产量有显著影响。然而，现有研究面临两个局限：（1）作物生长过程中水氮组合优化复杂度高且产量优化效果较差;以及（2）轻微应力信号的量化困难和反馈延迟，导致水和氮的动态调节精度降低，资源利用效率降低。为解决这些问题，我们提出了一种嵌套双代理强化学习（NDRL）方法。NDRL的母因子基于预期的累计产量效益，识别有前景的宏观灌溉和施肥行动，减少无效的勘探，同时保持目标与产量的一致性。儿童代理的奖励函数包含定量的水应力因子（WSF）和氮应力因子（NSF），并采用混合概率分布动态优化每日策略，从而提升产量和资源效率。我们利用2023年和2024年的实地实验数据校准和验证农业技术转让决策支持系统（DSSAT），以模拟真实世界状况并与NDRL互动。实验结果显示，与最佳基线相比，2023年和2024年模拟产量均增长了4.7%，灌溉用水生产率分别提升了5.6%和5.1%，氮分因子生产率分别提升了6.3%和1.0%。我们的方法推动了棉花灌溉和氮肥的发展，为解决农业资源管理中的复杂性和精度问题以及可持续农业发展提供了新思路。

StarCraft+: Benchmarking Multi-agent Algorithms in Adversary Paradigm

星际争霸+：对手范式中多智能体算法的基准测试

Authors: Yadong Li, Tong Zhang, Bo Huang, Zhen Cui
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.16444
Pdf link: https://arxiv.org/pdf/2512.16444
Abstract Deep multi-agent reinforcement learning (MARL) algorithms are booming in the field of collaborative intelligence, and StarCraft multi-agent challenge (SMAC) is widely-used as the benchmark therein. However, imaginary opponents of MARL algorithms are practically configured and controlled in a fixed built-in AI mode, which causes less diversity and versatility in algorithm evaluation. To address this issue, in this work, we establish a multi-agent algorithm-vs-algorithm environment, named StarCraft II battle arena (SC2BA), to refresh the benchmarking of MARL algorithms in an adversary paradigm. Taking StarCraft as infrastructure, the SC2BA environment is specifically created for inter-algorithm adversary with the consideration of fairness, usability and customizability, and meantime an adversarial PyMARL (APyMARL) library is developed with easy-to-use interfaces/modules. Grounding in SC2BA, we benchmark those classic MARL algorithms in two types of adversarial modes: dual-algorithm paired adversary and multi-algorithm mixed adversary, where the former conducts the adversary of pairwise algorithms while the latter focuses on the adversary to multiple behaviors from a group of algorithms. The extensive benchmark experiments exhibit some thought-provoking observations/problems in the effectivity, sensibility and scalability of these completed algorithms. The SC2BA environment as well as reproduced experiments are released in \href{this https URL}{Github}, and we believe that this work could mark a new step for the MARL field in the coming years.
中文摘要 深度多智能体强化学习（MARL）算法在协作智能领域蓬勃发展，而星际争霸多智能体挑战（SMAC）被广泛用作该领域的基准。然而，MARL算法的虚拟对手实际上是在固定的内置AI模式下配置和控制的，这导致算法评估的多样性和多样性降低。为解决这一问题，本研究建立了一个多智能体算法对算法环境，名为星际争霸II战场（SC2BA），以刷新对抗范式中MARL算法的基准测试。以星际争霸为基础设施，SC2BA环境专为算法间对手设计，兼顾公平性、可用性和可定制性，同时开发了一个对抗性PyMARL（APyMARL）库，配备易用的接口/模块。基于SC2BA，我们将经典MARL算法分为两种对抗模式进行基准测试：双算法配对对手和多算法混合对手，前者对成对算法的对手进行对抗，后者则聚焦于对手从一组算法中进行多重行为。这些广泛的基准测试实验在这些已完成算法的有效性、可感知性和可扩展性方面展现了一些发人深省的观察和问题。SC2BA环境及复现的实验已发布在\href{this https URL}{Github}中，我们相信这项工作可能为MARL领域在未来几年迈出新一步。

Guiding Perception-Reasoning Closer to Human in Blind Image Quality Assessment

引导盲图质量评估中的感知推理更接近人类

Authors: Yuan Li, Yahan Yu, Youyuan Lin, Yong-Hao Yang, Chenhui Chu, Shin'ya Nishida
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.16484
Pdf link: https://arxiv.org/pdf/2512.16484
Abstract Humans assess image quality through a perception-reasoning cascade, integrating sensory cues with implicit reasoning to form self-consistent judgments. In this work, we investigate how a model can acquire both human-like and self-consistent reasoning capability for blind image quality assessment (BIQA). We first collect human evaluation data that capture several aspects of human perception-reasoning pipeline. Then, we adopt reinforcement learning, using human annotations as reward signals to guide the model toward human-like perception and reasoning. To enable the model to internalize self-consistent reasoning capability, we design a reward that drives the model to infer the image quality purely from self-generated descriptions. Empirically, our approach achieves score prediction performance comparable to state-of-the-art BIQA systems under general metrics, including Pearson and Spearman correlation coefficients. In addition to the rating score, we assess human-model alignment using ROUGE-1 to measure the similarity between model-generated and human perception-reasoning chains. On over 1,000 human-annotated samples, our model reaches a ROUGE-1 score of 0.512 (cf. 0.443 for baseline), indicating substantial coverage of human explanations and marking a step toward human-like interpretable reasoning in BIQA.
中文摘要 人类通过感知-推理级联评估图像质量，将感官线索与隐性推理整合，形成自洽的判断。本研究探讨模型如何获得类似人类和自洽的推理能力，用于盲图质量评估（BIQA）。我们首先收集人类评估数据，捕捉人类感知-推理流程的多个方面。然后，我们采用强化学习，利用人类注释作为奖励信号，引导模型趋向类人感知和推理。为了使模型能够内化自洽的推理能力，我们设计了一种奖励机制，使模型能够仅凭自生成的描述推断图像质量。从经验上看，我们的方法在一般指标（包括Pearson和Spearman相关系数）下，得分预测表现可与最先进的BIQA系统媲美。除了评分评分外，我们还利用ROUGE-1评估人机-模型对齐，以衡量模型生成链与人类感知推理链的相似性。在1000多个人类注释样本中，我们的模型获得了ROUGE-1得分0.512（基线为0.443），表明对人类解释的广泛覆盖，标志着BIQA向类人可解释推理迈出了一步。

ParamExplorer: A framework for exploring parameters in generative art

ParamExplorer：用于探索生成艺术参数的框架

Authors: Julien Gachadoat, Guillaume Lagarde
Subjects: Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2512.16529
Pdf link: https://arxiv.org/pdf/2512.16529
Abstract Generative art systems often involve high-dimensional and complex parameter spaces in which aesthetically compelling outputs occupy only small, fragmented regions. Because of this combinatorial explosion, artists typically rely on extensive manual trial-and-error, leaving many potentially interesting configurations undiscovered. In this work we make two contributions. First, we introduce ParamExplorer, an interactive and modular framework inspired by reinforcement learning that helps the exploration of parameter spaces in generative art algorithms, guided by human-in-the-loop or even automated feedback. The framework also integrates seamlessly with existing this http URL projects. Second, within this framework we implement and evaluate several exploration strategies, referred to as agents.
中文摘要 生成艺术系统通常涉及高维且复杂的参数空间，其中具有美学吸引力的输出仅占据小而分散的区域。由于这种组合爆炸，艺术家通常依赖大量的手工反复试验，导致许多潜在有趣的配置未被发现。在本研究中，我们贡献了两项贡献。首先，我们介绍ParamExplorer，这是一个受强化学习启发的交互式模块化框架，帮助探索生成艺术算法中的参数空间，并借助人工反馈甚至自动反馈。该框架还能无缝集成现有的 http URL 项目。其次，在此框架下，我们实施并评估了多种探索策略，称为代理。

Stackelberg Learning from Human Feedback: Preference Optimization as a Sequential Game

Stackelberg 从人类反馈中学习：偏好优化作为顺序游戏

Authors: Barna Pásztor, Thomas Kleine Buening, Andreas Krause
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2512.16626
Pdf link: https://arxiv.org/pdf/2512.16626
Abstract We introduce Stackelberg Learning from Human Feedback (SLHF), a new framework for preference optimization. SLHF frames the alignment problem as a sequential-move game between two policies: a Leader, which commits to an action, and a Follower, which responds conditionally on the Leader's action. This approach decomposes preference optimization into a refinement problem for the Follower and an optimization problem against an adversary for the Leader. Unlike Reinforcement Learning from Human Feedback (RLHF), which assigns scalar rewards to actions, or Nash Learning from Human Feedback (NLHF), which seeks a simultaneous-move equilibrium, SLHF leverages the asymmetry of sequential play to capture richer preference structures. The sequential design of SLHF naturally enables inference-time refinement, as the Follower learns to improve the Leader's actions, and these refinements can be leveraged through iterative sampling. We compare the solution concepts of SLHF, RLHF, and NLHF, and lay out key advantages in consistency, data sensitivity, and robustness to intransitive preferences. Experiments on large language models demonstrate that SLHF achieves strong alignment across diverse preference datasets, scales from 0.5B to 8B parameters, and yields inference-time refinements that transfer across model families without further fine-tuning.
中文摘要 我们介绍了Stackelberg“人类反馈学习”（SLHF），这是一个用于偏好优化的新框架。SLHF将对齐问题框架为两个策略之间的顺序移动博弈：领导者承诺执行行动，跟随者则根据领导者的行动有条件响应。该方法将偏好优化分解为追随者的精炼问题和领导者的对手优化问题。与赋予行为标量奖励的人类反馈强化学习（RLHF）或寻求同时行动均衡的纳什人类反馈学习（NLHF）不同，SLHF利用顺序游戏的不对称性捕捉更丰富的偏好结构。SLHF 的顺序设计自然促进推理时间的细化，跟随者学习改进领导者的行为，这些细化可以通过迭代抽样加以利用。我们比较了SLHF、RLHF和NLHF的解概念，并阐述了一致性、数据敏感性和对传递偏好的鲁棒性等关键优势。大型语言模型的实验表明，SLHF能够在多样化偏好数据集中实现强对齐，参数范围可从0.5B到8B，并且能够实现跨模型族的推理时间细化，无需进一步微调。

Implementing a Sharia Chatbot as a Consultation Medium for Questions About Islam

实施伊斯兰教法聊天机器人作为伊斯兰问题咨询平台

Authors: Wisnu Uriawan, Aria Octavian Hamza, Ade Ripaldi Nuralim, Adi Purnama, Ahmad Juaeni Yunus, Anissya Auliani Supriadi Putri
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.16644
Pdf link: https://arxiv.org/pdf/2512.16644
Abstract This research presents the implementation of a Sharia-compliant chatbot as an interactive medium for consulting Islamic questions, leveraging Reinforcement Learning (Q-Learning) integrated with Sentence-Transformers for semantic embedding to ensure contextual and accurate responses. Utilizing the CRISP-DM methodology, the system processes a curated Islam QA dataset of 25,000 question-answer pairs from authentic sources like the Qur'an, Hadith, and scholarly fatwas, formatted in JSON for flexibility and scalability. The chatbot prototype, developed with a Flask API backend and Flutter-based mobile frontend, achieves 87% semantic accuracy in functional testing across diverse topics including fiqh, aqidah, ibadah, and muamalah, demonstrating its potential to enhance religious literacy, digital da'wah, and access to verified Islamic knowledge in the Industry 4.0 era. While effective for closed-domain queries, limitations such as static learning and dataset dependency highlight opportunities for future enhancements like continuous adaptation and multi-turn conversation support, positioning this innovation as a bridge between traditional Islamic scholarship and modern AI-driven consultation.
中文摘要 本研究展示了将符合伊斯兰教法的聊天机器人作为互动媒介实施，利用强化学习（Q-Learning）与句子变换器集成进行语义嵌入，确保回答语境准确。利用CRISP-DM方法，系统处理一个精心策划的伊斯兰QA数据集，包含25,000对来自《古兰经》、《圣训》和学术法特瓦等真实来源的问答对，并以JSON格式格式化，以实现灵活性和可扩展性。该聊天机器人原型采用Flask API后台和基于Flutter的移动前端开发，在涵盖伊斯兰法、宗教信仰、伊巴达和穆阿玛拉等多领域的功能测试中实现了87%的语义准确率，展示了其提升宗教素养、数字传教以及在工业4.0时代获取经验证伊斯兰知识的潜力。虽然对封闭域查询有效，但静态学习和数据集依赖等限制凸显了未来改进的潜力，如持续适应和多回合对话支持，使这一创新成为传统伊斯兰学术与现代人工智能驱动咨询之间的桥梁。

JustRL: Scaling a 1.5B LLM with a Simple RL Recipe

JustRL：用简单的强化学习配方扩展15亿大型语言模型

Authors: Bingxiang He, Zekai Qu, Zeyuan Liu, Yinghao Chen, Yuxin Zuo, Cheng Qian, Kaiyan Zhang, Weize Chen, Chaojun Xiao, Ganqu Cui, Ning Ding, Zhiyuan Liu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.16649
Pdf link: https://arxiv.org/pdf/2512.16649
Abstract Recent advances in reinforcement learning for large language models have converged on increasing complexity: multi-stage training pipelines, dynamic hyperparameter schedules, and curriculum learning strategies. This raises a fundamental question: \textbf{Is this complexity necessary?} We present \textbf{JustRL}, a minimal approach using single-stage training with fixed hyperparameters that achieves state-of-the-art performance on two 1.5B reasoning models (54.9\% and 64.3\% average accuracy across nine mathematical benchmarks) while using 2$\times$ less compute than sophisticated approaches. The same hyperparameters transfer across both models without tuning, and training exhibits smooth, monotonic improvement over 4,000+ steps without the collapses or plateaus that typically motivate interventions. Critically, ablations reveal that adding ``standard tricks'' like explicit length penalties and robust verifiers may degrade performance by collapsing exploration. These results suggest that the field may be adding complexity to solve problems that disappear with a stable, scaled-up baseline. We release our models and code to establish a simple, validated baseline for the community.
中文摘要 大型语言模型强化学习的最新进展趋于复杂度的提升：多阶段训练流程、动态超参数计划以及课程学习策略。这引出了一个根本性问题：\textbf{这种复杂性有必要吗？}我们提出了 \textbf{JustRL}，这是一种采用单阶段训练、固定超参数的极简方法，在两个 1.5 亿推理模型上实现了最先进的性能（在九个数学基准测试中平均准确率为 54.9% 和 64.3% 平均准确率），同时计算量比复杂方法少 2 倍。相同的超参数在两个模型间转移，无需调优，训练在4000+步中表现平滑单调，没有通常激励干预的崩溃或停滞。关键是，消融显示，添加“标准技巧”如明确的长度惩罚和强健的验证器，可能会通过崩溃来降低性能。这些结果表明，该领域可能在增加复杂性，以解决在稳定且放大基线下消失的问题。我们发布模型和代码，为社区建立一个简单且经过验证的基线。

Olaf: Bringing an Animated Character to Life in the Physical World

奥拉夫：让一个动画角色在现实世界中栩栩如生

Authors: David Müller, Espen Knoop, Dario Mylonopoulos, Agon Serifi, Michael A. Hopkins, Ruben Grandia, Moritz Bächer
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.16705
Pdf link: https://arxiv.org/pdf/2512.16705
Abstract Animated characters often move in non-physical ways and have proportions that are far from a typical walking robot. This provides an ideal platform for innovation in both mechanical design and stylized motion control. In this paper, we bring Olaf to life in the physical world, relying on reinforcement learning guided by animation references for control. To create the illusion of Olaf's feet moving along his body, we hide two asymmetric legs under a soft foam skirt. To fit actuators inside the character, we use spherical and planar linkages in the arms, mouth, and eyes. Because the walk cycle results in harsh contact sounds, we introduce additional rewards that noticeably reduce impact noise. The large head, driven by small actuators in the character's slim neck, creates a risk of overheating, amplified by the costume. To keep actuators from overheating, we feed temperature values as additional inputs to policies, introducing new rewards to keep them within bounds. We validate the efficacy of our modeling in simulation and on hardware, demonstrating an unmatched level of believability for a costumed robotic character.
中文摘要 动画角色通常以非物理方式移动，比例远非典型的行走机器人。这为机械设计和风格化运动控制的创新提供了理想的平台。本文将奥拉夫带入物理世界，依靠动画引用引导的强化学习进行控制。为了营造奥拉夫双脚沿身体移动的错觉，我们把两条不对称的腿藏在柔软的泡沫裙下。为了在角色内部安装执行器，我们在手臂、嘴巴和眼睛中使用球形和平面连杆。由于走路周期会产生刺耳的接触声，我们引入了额外的奖励，显著减少撞击噪音。大头部由角色细长脖子上的小型执行器驱动，增加了过热风险，且被服装放大。为了防止执行器过热，我们将温度值作为额外输入输入到政策，引入新的奖励以保持其在控制范围内。我们在模拟和硬件上验证了建模的有效性，展示了穿着服装的机器人角色无与伦比的可信度。

Coordinated Anti-Jamming Resilience in Swarm Networks via Multi-Agent Reinforcement Learning

通过多智能体强化学习实现群聚网络中的协调抗干扰韧性

Authors: Bahman Abolhassani, Tugba Erpek, Kemal Davaslioglu, Yalin E. Sagduyu, Sastry Kompella
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2512.16813
Pdf link: https://arxiv.org/pdf/2512.16813
Abstract Reactive jammers pose a severe security threat to robotic-swarm networks by selectively disrupting inter-agent communications and undermining formation integrity and mission success. Conventional countermeasures such as fixed power control or static channel hopping are largely ineffective against such adaptive adversaries. This paper presents a multi-agent reinforcement learning (MARL) framework based on the QMIX algorithm to improve the resilience of swarm communications under reactive jamming. We consider a network of multiple transmitter-receiver pairs sharing channels while a reactive jammer with Markovian threshold dynamics senses aggregate power and reacts accordingly. Each agent jointly selects transmit frequency (channel) and power, and QMIX learns a centralized but factorizable action-value function that enables coordinated yet decentralized execution. We benchmark QMIX against a genie-aided optimal policy in a no-channel-reuse setting, and against local Upper Confidence Bound (UCB) and a stateless reactive policy in a more general fading regime with channel reuse enabled. Simulation results show that QMIX rapidly converges to cooperative policies that nearly match the genie-aided bound, while achieving higher throughput and lower jamming incidence than the baselines, thereby demonstrating MARL's effectiveness for securing autonomous swarms in contested environments.
中文摘要 反应式干扰器通过选择性干扰代理间通信，破坏编队完整性和任务成功，对机器人群网络构成严重安全威胁。传统的对抗措施如固定功率控制或静态通道跳跃对此类自适应对手效果不佳。本文提出了基于QMIX算法的多智能体强化学习（MARL）框架，旨在提升群体通信在反应式干扰下的弹性。我们考虑一个由多对发射-接收器组成的网络共享信道，而具有马尔可夫阈值动力学的反应式干扰器则感应总功率并做出相应反应。每个代理共同选择发射频率（信道）和功率，QMIX学习一个集中但可分解的动作值函数，实现协调但去中心化的执行。我们将QMIX以无通道重用环境下的精灵辅助最优策略为基准，以及在更一般的衰落状态下启用通道重用的区域上置信度上限（UCB）和无状态反应策略进行基准测试。模拟结果显示，QMIX能够迅速收敛到几乎匹配精灵辅助界限的合作策略，同时实现了比基线更高的通量和更低的干扰发生率，从而证明了MARL在争夺环境中保护自主群体的有效性。

Meta-RL Induces Exploration in Language Agents

Meta-RL 诱导语言代理的探索

Authors: Yulun Jiang, Liangze Jiang, Damien Teney, Michael Moor, Maria Brbic
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.16848
Pdf link: https://arxiv.org/pdf/2512.16848
Abstract Reinforcement learning (RL) has enabled the training of large language model (LLM) agents to interact with the environment and to solve multi-turn long-horizon tasks. However, the RL-trained agents often struggle in tasks that require active exploration and fail to efficiently adapt from trial-and-error experiences. In this paper, we present LaMer, a general Meta-RL framework that enables LLM agents to actively explore and learn from the environment feedback at test time. LaMer consists of two key components: (i) a cross-episode training framework to encourage exploration and long-term rewards optimization; and (ii) in-context policy adaptation via reflection, allowing the agent to adapt their policy from task feedback signal without gradient update. Experiments across diverse environments show that LaMer significantly improves performance over RL baselines, with 11%, 14%, and 19% performance gains on Sokoban, MineSweeper and Webshop, respectively. Moreover, LaMer also demonstrates better generalization to more challenging or previously unseen tasks compared to the RL-trained agents. Overall, our results demonstrate that Meta-RL provides a principled approach to induce exploration in language agents, enabling more robust adaptation to novel environments through learned exploration strategies.
中文摘要 强化学习（RL）使大型语言模型（LLM）代理能够与环境互动并解决多回合长视野任务成为可能。然而，受过强化学习训练的代理在需要主动探索的任务中常常挣扎，且未能有效从试错中适应。本文介绍了LaMer，一种通用的Meta-RL框架，使LLM代理能够在测试时主动探索并学习环境反馈。LaMer包含两个关键组成部分：（i）跨剧集训练框架，鼓励探索和长期奖励优化;以及（ii）通过反射实现上下文中的策略调整，使智能体能够根据任务反馈信号调整策略，而无需梯度更新。在不同环境中的实验显示，LaMer在强化学习基线上显著提升了性能，分别在Sokoban、MineSweeper和Webshop上提升了11%、14%和19%。此外，LaMer还展示了对更具挑战性或此前未见任务的推广能力，优于强化学习训练的智能体。总体而言，我们的结果表明，Meta-RL为引导语言代理的探索提供了一种有原则的方法，使得通过学习的探索策略实现对新环境的更稳健适应。

ReinforceGen: Hybrid Skill Policies with Automated Data Generation and Reinforcement Learning

ReinforceGen：结合自动数据生成和强化学习的混合技能策略

Authors: Zihan Zhou, Animesh Garg, Ajay Mandlekar, Caelan Garrett
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.16861
Pdf link: https://arxiv.org/pdf/2512.16861
Abstract Long-horizon manipulation has been a long-standing challenge in the robotics community. We propose ReinforceGen, a system that combines task decomposition, data generation, imitation learning, and motion planning to form an initial solution, and improves each component through reinforcement-learning-based fine-tuning. ReinforceGen first segments the task into multiple localized skills, which are connected through motion planning. The skills and motion planning targets are trained with imitation learning on a dataset generated from 10 human demonstrations, and then fine-tuned through online adaptation and reinforcement learning. When benchmarked on the Robosuite dataset, ReinforceGen reaches 80% success rate on all tasks with visuomotor controls in the highest reset range setting. Additional ablation studies show that our fine-tuning approaches contributes to an 89% average performance increase. More results and videos available in this https URL
中文摘要 长视野作一直是机器人界长期面临的挑战。我们提出了ReinforceGen系统，该系统结合了任务分解、数据生成、模仿学习和动作规划，形成初始解决方案，并通过基于强化学习的微调改进每个组件。ReinforceGen首先将任务划分为多个局部技能，并通过动作规划相互连接。技能和动作规划目标通过模仿学习训练，基于10次人工演示生成的数据集，随后通过在线适应和强化学习进行微调。在Robosuite数据集上进行基准测试时，ReinforceGen在所有带有视觉运动控制且重置范围最高设置下的任务成功率达到80%。额外的消融研究显示，我们的微调方法能带来89%的平均性能提升。更多结果和视频可在此 https 网址

RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing

RePlan：基于逻辑的区域规划用于复杂指令型图像编辑

Authors: Tianyuan Qu, Lei Ke, Xiaohang Zhan, Longxiang Tang, Yuqi Liu, Bohao Peng, Bei Yu, Dong Yu, Jiaya Jia
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.16864
Pdf link: https://arxiv.org/pdf/2512.16864
Abstract Instruction-based image editing enables natural-language control over visual modifications, yet existing models falter under Instruction-Visual Complexity (IV-Complexity), where intricate instructions meet cluttered or ambiguous scenes. We introduce RePlan (Region-aligned Planning), a plan-then-execute framework that couples a vision-language planner with a diffusion editor. The planner decomposes instructions via step-by-step reasoning and explicitly grounds them to target regions; the editor then applies changes using a training-free attention-region injection mechanism, enabling precise, parallel multi-region edits without iterative inpainting. To strengthen planning, we apply GRPO-based reinforcement learning using 1K instruction-only examples, yielding substantial gains in reasoning fidelity and format reliability. We further present IV-Edit, a benchmark focused on fine-grained grounding and knowledge-intensive edits. Across IV-Complex settings, RePlan consistently outperforms strong baselines trained on far larger datasets, improving regional precision and overall fidelity. Our project page: this https URL
中文摘要 基于指令的图像编辑使得对视觉修改的自然语言控制成为可能，但现有模型在指令-视觉复杂性（IV-复杂性）下表现不佳，即复杂指令与杂乱或模糊场景相遇。我们介绍了RePlan（区域对齐规划），这是一个计划后执行框架，将愿景语言规划器与扩散编辑器结合起来。规划器通过逐步推理分解指令，并明确将其锚定到目标区域;编辑器随后通过无训练的注意力区域注入机制应用变更，实现精确、并行的多区域编辑，无需迭代补绘。为加强规划，我们采用基于GRPO的强化学习，采用仅1K指令示例，显著提升了推理的准确性和格式可靠性。我们还介绍了IV-Edit，这是一个注重细粒度基础和知识密集型编辑的基准。在IV复杂环境中，RePlan始终优于在更大数据集上训练的强基线，提升区域精度和整体真实度。我们的项目页面：这个 https URL

AdaSearch: Balancing Parametric Knowledge and Search in Large Language Models via Reinforcement Learning

AdaSearch：通过强化学习平衡大型语言模型中的参数知识与搜索

Authors: Tzu-Han Lin, Wei-Lin Chen, Chen-An Li, Hung-yi Lee, Yun-Nung Chen, Yu Meng
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.16883
Pdf link: https://arxiv.org/pdf/2512.16883
Abstract Equipping large language models (LLMs) with search engines via reinforcement learning (RL) has emerged as an effective approach for building search agents. However, overreliance on search introduces unnecessary cost and risks exposure to noisy or malicious content, while relying solely on parametric knowledge risks hallucination. The central challenge is to develop agents that adaptively balance parametric knowledge with external search, invoking search only when necessary. Prior work mitigates search overuse by shaping rewards around the number of tool calls. However, these penalties require substantial reward engineering, provide ambiguous credit assignment, and can be exploited by agents that superficially reduce calls. Moreover, evaluating performance solely through call counts conflates necessary and unnecessary search, obscuring the measurement of true adaptive behavior. To address these limitations, we first quantify the self-knowledge awareness of existing search agents via an F1-based decision metric, revealing that methods such as Search-R1 often overlook readily available parametric knowledge. Motivated by these findings, we propose AdaSearch, a simple two-stage, outcome-driven RL framework that disentangles problem solving from the decision of whether to invoke search, and makes this decision process explicit and interpretable. This transparency is crucial for high-stakes domains such as finance and medical question answering, yet is largely neglected by prior approaches. Experiments across multiple model families and sizes demonstrate that AdaSearch substantially improves knowledge-boundary awareness, reduces unnecessary search calls, preserves strong task performance, and offers more transparent, interpretable decision behaviors.
中文摘要 通过强化学习（RL）为大型语言模型（LLMs）配备搜索引擎，已成为构建搜索代理的有效方法。然而，过度依赖搜索会带来不必要的成本，并有可能接触到噪声或恶意内容;而仅依赖参数化知识则可能导致幻觉。核心挑战是开发能够自适应平衡参数化知识与外部搜索的代理，仅在必要时调用搜索。以往的工作通过围绕工具调用次数来调整奖励，从而缓解搜索过度使用。然而，这些惩罚需要大量的奖励工程，提供模糊的信用分配，且可被代理利用，表面上减少呼叫。此外，仅通过呼叫次数评估性能会混淆必要和不必要的搜索，从而掩盖了真正适应行为的测量。为解决这些局限性，我们首先通过基于F1的决策指标量化现有搜索代理的自我认知，发现像Search-R1这样的方法常常忽视了现成的参数化知识。基于这些发现，我们提出了AdaSearch，一个简单的两阶段、以结果为驱动的强化学习框架，将问题解决与是否调用搜索的决策分开，使这一决策过程显明且可解释。这种透明度对于金融和医疗问答等高风险领域至关重要，但以往的方法大多忽视了这一点。跨多个模型家族和规模的实验表明，AdaSearch 显著提升了知识边界意识，减少不必要的搜索调用，保持了强有力的任务性能，并提供了更透明、可解释的决策行为。

MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning

MomaGraph：具备具身任务规划的视觉语言模型的状态感知统一场景图

Authors: Yuanchen Ju, Yongyuan Liang, Yen-Jen Wang, Nandiraju Gireesh, Yuanliang Ju, Seungjae Lee, Qiao Gu, Elvis Hsieh, Furong Huang, Koushil Sreenath
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.16909
Pdf link: https://arxiv.org/pdf/2512.16909
Abstract Mobile manipulators in households must both navigate and manipulate. This requires a compact, semantically rich scene representation that captures where objects are, how they function, and which parts are actionable. Scene graphs are a natural choice, yet prior work often separates spatial and functional relations, treats scenes as static snapshots without object states or temporal updates, and overlooks information most relevant for accomplishing the current task. To address these limitations, we introduce MomaGraph, a unified scene representation for embodied agents that integrates spatial-functional relationships and part-level interactive elements. However, advancing such a representation requires both suitable data and rigorous evaluation, which have been largely missing. We thus contribute MomaGraph-Scenes, the first large-scale dataset of richly annotated, task-driven scene graphs in household environments, along with MomaGraph-Bench, a systematic evaluation suite spanning six reasoning capabilities from high-level planning to fine-grained scene understanding. Built upon this foundation, we further develop MomaGraph-R1, a 7B vision-language model trained with reinforcement learning on MomaGraph-Scenes. MomaGraph-R1 predicts task-oriented scene graphs and serves as a zero-shot task planner under a Graph-then-Plan framework. Extensive experiments demonstrate that our model achieves state-of-the-art results among open-source models, reaching 71.6% accuracy on the benchmark (+11.4% over the best baseline), while generalizing across public benchmarks and transferring effectively to real-robot experiments.
中文摘要 家庭中的移动作器既要导航又要作。这需要一个紧凑、语义丰富的场景表示，捕捉物体的位置、功能以及哪些部分可作。场景图是自然选择，但以往工作常常将空间和功能关系分开，将场景视为无对象状态或时间更新的静态快照，并忽略了与当前任务最相关的信息。为解决这些局限，我们引入了MomaGraph，一种统一的具身代理场景表示，集成了空间功能关系和部分级交互元素。然而，推动这种表征需要合适的数据和严谨的评估，而这些在很大程度上缺失。因此，我们贡献了MomaGraph-Scenes，这是首个大规模的家庭环境中丰富注释、任务驱动场景图数据集，以及MomaGraph-Bench，一个涵盖六种推理能力的系统评估套件，从高层规划到细粒度场景理解。在此基础上，我们进一步开发了MomaGraph-R1，这是一个7B视觉语言模型，通过MomaGraph场景进行强化学习训练。MomaGraph-R1 预测任务导向的场景图，并作为图后规划框架下的零样本任务规划器。大量实验表明，我们的模型在开源模型中达到了最先进的成绩，基准测试准确率达到71.6%（比最佳基线+11.4%），同时在公开基准测试中具有推广性，并有效应用于真实机器人实验。

Posterior Behavioral Cloning: Pretraining BC Policies for Efficient RL Finetuning

后期行为克隆：预训练BC策略以实现高效强化学习精调

Authors: Andrew Wagenmaker, Perry Dong, Raymond Tsao, Chelsea Finn, Sergey Levine
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.16911
Pdf link: https://arxiv.org/pdf/2512.16911
Abstract Standard practice across domains from robotics to language is to first pretrain a policy on a large-scale demonstration dataset, and then finetune this policy, typically with reinforcement learning (RL), in order to improve performance on deployment domains. This finetuning step has proved critical in achieving human or super-human performance, yet while much attention has been given to developing more effective finetuning algorithms, little attention has been given to ensuring the pretrained policy is an effective initialization for RL finetuning. In this work we seek to understand how the pretrained policy affects finetuning performance, and how to pretrain policies in order to ensure they are effective initializations for finetuning. We first show theoretically that standard behavioral cloning (BC) -- which trains a policy to directly match the actions played by the demonstrator -- can fail to ensure coverage over the demonstrator's actions, a minimal condition necessary for effective RL finetuning. We then show that if, instead of exactly fitting the observed demonstrations, we train a policy to model the posterior distribution of the demonstrator's behavior given the demonstration dataset, we do obtain a policy that ensures coverage over the demonstrator's actions, enabling more effective finetuning. Furthermore, this policy -- which we refer to as the posterior behavioral cloning (PostBC) policy -- achieves this while ensuring pretrained performance is no worse than that of the BC policy. We then show that PostBC is practically implementable with modern generative models in robotic control domains -- relying only on standard supervised learning -- and leads to significantly improved RL finetuning performance on both realistic robotic control benchmarks and real-world robotic manipulation tasks, as compared to standard behavioral cloning.
中文摘要 从机器人到语言的各领域，标准做法是先在大规模演示数据集上预训练策略，然后通过强化学习（RL）对该策略进行微调，以提升部署域的性能。这一微调步骤对于实现人类或超人性能至关重要，尽管开发更有效的微调算法时，确保预训练策略是强化学习微调的有效初始化却鲜有关注。本研究旨在理解预训练策略如何影响微调性能，以及如何预训练策略以确保它们是有效的微调初始化。我们首先理论上证明，标准行为克隆（BC）——即训练策略以直接匹配演示者的行为——可能无法确保对演示者的行为覆盖，而这对有效强化学习微调来说是最低条件。我们随后证明，如果训练策略来模拟示范者行为的后验分布，给定示范数据集，我们确实获得了一个策略，确保对示范者的行为覆盖，从而实现更高效的微调。此外，这一政策——我们称之为后验行为克隆（PostBC）政策——实现了这一点，同时确保预训练表现不比BC策略差。随后我们展示了PostBC在机器人控制领域中几乎可以通过现代生成模型实现——仅依赖标准监督学习——并显著提升了在现实机器人控制基准和现实机器人作任务上的强化学习微调性能，相较于标准行为克隆。

Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward

探索与开发：通过裁剪、熵和虚假奖励重新思考RLVR

Authors: Peter Chen, Xiaopeng Li, Ziniu Li, Wotao Yin, Xi Chen, Tianyi Lin
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.16912
Pdf link: https://arxiv.org/pdf/2512.16912
Abstract This paper examines the exploration-exploitation trade-off in reinforcement learning with verifiable rewards (RLVR), a framework for improving the reasoning of Large Language Models (LLMs). Recent studies suggest that RLVR can elicit strong mathematical reasoning in LLMs through two seemingly paradoxical mechanisms: spurious rewards, which suppress exploitation by rewarding outcomes unrelated to the ground truth, and entropy minimization, which suppresses exploration by pushing the model toward more confident and deterministic outputs, highlighting a puzzling dynamic: both discouraging exploitation and discouraging exploration improve reasoning performance, yet the underlying principles that reconcile these effects remain poorly understood. We focus on two fundamental questions: (i) how policy entropy relates to performance, and (ii) whether spurious rewards yield gains, potentially through the interplay of clipping bias and model contamination. Our results show that clipping bias under spurious rewards reduces policy entropy, leading to more confident and deterministic outputs, while entropy minimization alone is insufficient for improvement. We further propose a reward-misalignment model explaining why spurious rewards can enhance performance beyond contaminated settings. Our findings clarify the mechanisms behind spurious-reward benefits and provide principles for more effective RLVR training.
中文摘要 本文探讨了带有可验证奖励的强化学习（RLVR）中探索与利用权衡，RLVR是提升大型语言模型（LLMs）推理能力的框架。最新研究表明，RLVR可以通过两种看似矛盾的机制在LLM中激发强烈的数学推理：虚假奖励，通过奖励与真实结果无关的结果来抑制利用;熵最小化，通过推动模型向更自信和确定的输出发展来抑制探索，这凸显了一个令人困惑的动态：既阻止利用，也阻止探索，都能提升推理性能，然而，调和这些影响的根本原理仍然鲜为人知。我们关注两个根本问题：（i）政策熵如何与绩效相关，（ii）虚假奖励是否通过裁剪偏差与模型污染相互作用带来收益。我们的结果表明，在虚假奖励下，裁剪偏差降低了策略熵，从而产生更自信和确定性的结果，而仅靠熵最小化不足以改善。我们还提出了一个奖励错位模型，解释为何虚假奖励能提升超越污染环境的表现。我们的发现阐明了虚假奖励利益背后的机制，并为更有效的RLVR培训提供了原则。

Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning

生成式对抗性推理器：利用对抗强化学习增强LLM推理

Authors: Qihao Liu, Luoxin Ye, Wufei Ma, Yu-Cheng Chou, Alan Yuille
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.16917
Pdf link: https://arxiv.org/pdf/2512.16917
Abstract Large language models (LLMs) with explicit reasoning capabilities excel at mathematical reasoning yet still commit process errors, such as incorrect calculations, brittle logic, and superficially plausible but invalid steps. In this paper, we introduce Generative Adversarial Reasoner, an on-policy joint training framework designed to enhance reasoning by co-evolving an LLM reasoner and an LLM-based discriminator through adversarial reinforcement learning. A compute-efficient review schedule partitions each reasoning chain into logically complete slices of comparable length, and the discriminator evaluates each slice's soundness with concise, structured justifications. Learning couples complementary signals: the LLM reasoner is rewarded for logically consistent steps that yield correct answers, while the discriminator earns rewards for correctly detecting errors or distinguishing traces in the reasoning process. This produces dense, well-calibrated, on-policy step-level rewards that supplement sparse exact-match signals, improving credit assignment, increasing sample efficiency, and enhancing overall reasoning quality of LLMs. Across various mathematical benchmarks, the method delivers consistent gains over strong baselines with standard RL post-training. Specifically, on AIME24, we improve DeepSeek-R1-Distill-Qwen-7B from 54.0 to 61.3 (+7.3) and DeepSeek-R1-Distill-Llama-8B from 43.7 to 53.7 (+10.0). The modular discriminator also enables flexible reward shaping for objectives such as teacher distillation, preference alignment, and mathematical proof-based reasoning.
中文摘要 具备显式推理能力的大型语言模型（LLMs）在数学推理上表现出色，但仍会犯流程错误，如计算错误、逻辑脆弱以及表面看似合理但无效的步骤。本文介绍了生成对抗推理器，这是一种基于策略的联合训练框架，旨在通过对抗强化学习共同进化LLM推理器和基于LLM的判别器来增强推理能力。计算效率高的审查计划将每个推理链划分为长度相当的逻辑完备切片，判别器通过简明、结构化的理由评估每个切片的合理性。学习时会与互补信号结合：LLM推理者因逻辑一致且得出正确答案而获得奖励，而判别者则因正确发现错误或区分推理过程中的痕迹而获得奖励。这会产生密集、校准良好、符合策略的步骤级奖励，补充稀疏的精确匹配信号，改善信用分配，提高样本效率，并提升大型语言模型的整体推理质量。在多个数学基准测试中，该方法在标准强化学习后训练条件下，持续获得优异的收益。具体来说，在AIME24中，我们将DeepSeek-R1-Distill-Qwen-7B从54.0提升到61.3（+7.3），DeepSeek-R1-Distill-Llama-8B从43.7提升到53.7（+10.0）。模块化判别器还支持教师提炼、偏好对齐和基于数学证明的推理等目标的灵活奖励塑造。

AdaTooler-V: Adaptive Tool-Use for Images and Videos

AdaTooler-V：图像和视频自适应工具使用

Authors: Chaoyang Wang, Kaituo Feng, Dongyang Chen, Zhongyu Wang, Zhixun Li, Sicheng Gao, Meng Meng, Xu Zhou, Manyuan Zhang, Yuzhang Shang, Xiangyu Yue
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.16918
Pdf link: https://arxiv.org/pdf/2512.16918
Abstract Recent advances have shown that multimodal large language models (MLLMs) benefit from multimodal interleaved chain-of-thought (CoT) with vision tool interactions. However, existing open-source models often exhibit blind tool-use reasoning patterns, invoking vision tools even when they are unnecessary, which significantly increases inference overhead and degrades model performance. To this end, we propose AdaTooler-V, an MLLM that performs adaptive tool-use by determining whether a visual problem truly requires tools. First, we introduce AT-GRPO, a reinforcement learning algorithm that adaptively adjusts reward scales based on the Tool Benefit Score of each sample, encouraging the model to invoke tools only when they provide genuine improvements. Moreover, we construct two datasets to support training: AdaTooler-V-CoT-100k for SFT cold start and AdaTooler-V-300k for RL with verifiable rewards across single-image, multi-image, and video data. Experiments across twelve benchmarks demonstrate the strong reasoning capability of AdaTooler-V, outperforming existing methods in diverse visual reasoning tasks. Notably, AdaTooler-V-7B achieves an accuracy of 89.8\% on the high-resolution benchmark V*, surpassing the commercial proprietary model GPT-4o and Gemini 1.5 Pro. All code, models, and data are released.
中文摘要 最新进展表明，多模态大型语言模型（MLLM）受益于多模态交错思维链（CoT）与视觉工具交互。然而，现有的开源模型常常表现出盲目的工具使用推理模式，即使视觉工具并非必要，也会调用这些工具，这大大增加了推理开销并降低了模型性能。为此，我们提出了AdaTooler-V，一种MLLM，通过判断视觉问题是否真的需要工具来实现自适应工具使用。首先，我们介绍了AT-GRPO，一种强化学习算法，能够根据每个样本的工具效益评分自适应调整奖励量表，鼓励模型仅在工具带来真实改进时调用。此外，我们构建了两个数据集支持训练：AdaTooler-V-CoT-100k用于SFT冷启动，AdaTooler-V-300k用于强化学习，并在单图、多图和视频数据中提供可验证的奖励。十二个基准测试的实验展示了AdaTooler-V强大的推理能力，在多样化的视觉推理任务中优于现有方法。值得注意的是，AdaTooler-V-7B 在高分辨率基准测试 V* 上实现了 89.8% 的准确率，超过了商业专有模型 GPT-4o 和 Gemini 1.5 Pro。所有代码、模型和数据均已公开。

Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification

重要差异：能力缺口发现与纠正的审计模型

Authors: Qihao Liu, Chengzhi Mao, Yaojie Liu, Alan Yuille, Wen-Sheng Chu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.16921
Pdf link: https://arxiv.org/pdf/2512.16921
Abstract Conventional evaluation methods for multimodal LLMs (MLLMs) lack interpretability and are often insufficient to fully disclose significant capability gaps across models. To address this, we introduce AuditDM, an automated framework that actively discovers and rectifies MLLM failure modes by auditing their divergence. AuditDM fine-tunes an MLLM as an auditor via reinforcement learning to generate challenging questions and counterfactual images that maximize disagreement among target models. Once trained, the auditor uncovers diverse, interpretable exemplars that reveal model weaknesses and serve as annotation-free data for rectification. When applied to SoTA models like Gemma-3 and PaliGemma-2, AuditDM discovers more than 20 distinct failure types. Fine-tuning on these discoveries consistently improves all models across 16 benchmarks, and enables a 3B model to surpass its 28B counterpart. Our results suggest that as data scaling hits diminishing returns, targeted model auditing offers an effective path to model diagnosis and improvement.
中文摘要 传统的多模态LLMs（MLLM）评估方法缺乏可解释性，且通常不足以完全揭示模型间显著的能力缺口。为此，我们引入了AuditDM，一个自动化框架，通过审计MLM的偏差，主动发现并纠正MLLM的失效模式。AuditDM通过强化学习对MLLM进行微调，生成具有挑战性的问题和反事实图像，最大化目标模型之间的分歧。培训完成后，审计员会发现多样且可解释的范例，揭示模型的弱点，并作为无注释的数据进行纠正。应用于 Gemma-3 和 PaliGemma-2 等 SoTA 模型时，AuditDM 发现了 20 多种不同的失效类型。对这些发现的微调持续提升了16个基准测试中的所有模型，使3B模型能够超越28B对应模型。我们的结果表明，随着数据规模递减，针对性模型审计提供了有效的模型诊断和改进路径。

Keyword: diffusion policy

TS-DP: Reinforcement Speculative Decoding For Temporal Adaptive Diffusion Policy Acceleration

TS-DP：时间自适应扩散政策加速的强化推测解码

Authors: Ye Li, Jiahe Feng, Yuan Meng, Kangye Ji, Chen Tang, Xinwan Wen, Shutao Xia, Zhi Wang, Wenwu Zhu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.15773
Pdf link: https://arxiv.org/pdf/2512.15773
Abstract Diffusion Policy (DP) excels in embodied control but suffers from high inference latency and computational cost due to multiple iterative denoising steps. The temporal complexity of embodied tasks demands a dynamic and adaptable computation mode. Static and lossy acceleration methods, such as quantization, fail to handle such dynamic embodied tasks, while speculative decoding offers a lossless and adaptive yet underexplored alternative for DP. However, it is non-trivial to address the following challenges: how to match the base model's denoising quality at lower cost under time-varying task difficulty in embodied settings, and how to dynamically and interactively adjust computation based on task difficulty in such environments. In this paper, we propose Temporal-aware Reinforcement-based Speculative Diffusion Policy (TS-DP), the first framework that enables speculative decoding for DP with temporal adaptivity. First, to handle dynamic environments where task difficulty varies over time, we distill a Transformer-based drafter to imitate the base model and replace its costly denoising calls. Second, an RL-based scheduler further adapts to time-varying task difficulty by adjusting speculative parameters to maintain accuracy while improving efficiency. Extensive experiments across diverse embodied environments demonstrate that TS-DP achieves up to 4.17 times faster inference with over 94% accepted drafts, reaching an inference frequency of 25 Hz and enabling real-time diffusion-based control without performance degradation.
中文摘要 扩散策略（DP）在内涵控制方面表现出色，但由于多次迭代去噪步骤，导致推理延迟高且计算成本较高。具象任务的时间复杂性要求一种动态且可适应的计算模式。静态和有损加速方法，如量子化，无法处理这种动态的内涵任务，而推测解码则提供了无损且自适应但尚未被充分探索的DP替代方案。然而，解决以下挑战并非简单：如何在具象环境中，在时间变化的任务难度下以更低成本匹配基础模型的去噪质量，以及如何根据任务难度动态和交互式调整计算。本文提出基于时间强化的推测扩散策略（TS-DP），这是首个支持时间适应性DP的推测解码框架。首先，为了应对任务难度随时间变化的动态环境，我们提取了一个基于Transformer的绘图机，以模拟基础模型并替换其昂贵的去噪调用。其次，基于强化学习的调度器通过调整推测参数，进一步适应随时间变化的任务难度，以保持准确性同时提高效率。在多种具象环境中的广泛实验表明，TS-DP在超过94%的被接受稿中实现了高达4.17倍的推理速度，推理频率达到25赫兹，实现基于扩散的实时控制而不影响性能。