Arxiv Papers of Today

生成时间: 2025-12-31 16:33:16 (UTC+8); Arxiv 发布时间: 2025-12-30 20:00 EST (2025-12-31 09:00 UTC+8)

今天共有 51 篇相关文章

Keyword: reinforcement learning

Unbiased Visual Reasoning with Controlled Visual Inputs

带有受控视觉输入的无偏视觉推理

Authors: Zhaonan Li, Shijie Lu, Fei Wang, Jacob Dineen, Xiao Ye, Zhikun Xu, Siyi Liu, Young Min Cho, Bangzheng Li, Daniel Chang, Kenny Nguyen, Qizheng Yang, Muhao Chen, Ben Zhou
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.22183
Pdf link: https://arxiv.org/pdf/2512.22183
Abstract End-to-end Vision-language Models (VLMs) often answer visual questions by exploiting spurious correlations instead of causal visual evidence, and can become more shortcut-prone when fine-tuned. We introduce VISTA (Visual-Information Separation for Text-based Analysis), a modular framework that decouples perception from reasoning via an explicit information bottleneck. A frozen VLM sensor is restricted to short, objective perception queries, while a text-only LLM reasoner decomposes each question, plans queries, and aggregates visual facts in natural language. This controlled interface defines a reward-aligned environment for training unbiased visual reasoning with reinforcement learning. Instantiated with Qwen2.5-VL and Llama3.2-Vision sensors, and trained with GRPO from only 641 curated multi-step questions, VISTA significantly improves robustness to real-world spurious correlations on SpuriVerse (+16.29% with Qwen-2.5-VL-7B and +6.77% with Llama-3.2-Vision-11B), while remaining competitive on MMVP and a balanced SeedBench subset. VISTA transfers robustly across unseen VLM sensors and is able to recognize and recover from VLM perception failures. Human analysis further shows that VISTA's reasoning traces are more neutral, less reliant on spurious attributes, and more explicitly grounded in visual evidence than end-to-end VLM baselines.
中文摘要 端到端视觉语言模型（VLMs）通常通过利用虚假的相关性而非因果视觉证据来回答视觉问题，经过精细调校后可能更容易走捷径。我们介绍了VISTA（基于文本的视觉信息分离），这是一个模块化框架，通过显式信息瓶颈将感知与推理分离。冻结的VLM传感器仅限于短暂、客观的感知查询，而纯文本的LLM推理器则分解每个问题，规划查询，并以自然语言汇总视觉事实。这种受控界面定义了一个与奖励对齐的环境，用于训练带有强化学习的无偏见视觉推理。VISTA采用Qwen2.5-VL和Llama3.2-Vision传感器实例化，并仅用641个精心策划的多步问题进行GRPO训练，在SpuriVerse上显著提升了对现实世界虚假相关性的鲁棒性（Qwen-2.5-VL-7B为+16.29%，Llama-3.2-Vision-11B为+6.77%），同时在MMVP和平衡的SeedBench子集上保持竞争力。VISTA可稳健地跨越未见的VLM传感器传输，能够识别并恢复VLM感知故障。人类分析进一步表明，VISTA的推理痕迹更为中立，较少依赖虚假属性，且比端到端VLM基线更明确地基于视觉证据。

Learning Tennis Strategy Through Curriculum-Based Dueling Double Deep Q-Networks

通过课程对决学习网球策略双深度Q网络

Authors: Vishnu Mohan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.22186
Pdf link: https://arxiv.org/pdf/2512.22186
Abstract Tennis strategy optimization is a challenging sequential decision-making problem involving hierarchical scoring, stochastic outcomes, long-horizon credit assignment, physical fatigue, and adaptation to opponent skill. I present a reinforcement learning framework that integrates a custom tennis simulation environment with a Dueling Double Deep Q-Network(DDQN) trained using curriculum learning. The environment models complete tennis scoring at the level of points, games, and sets, rally-level tactical decisions across ten discrete action categories, symmetric fatigue dynamics, and a continuous opponent skill parameter. The dueling architecture decomposes action-value estimation into state-value and advantage components, while double Q-learning reduces overestimation bias and improves training stability in this long-horizon stochastic domain. Curriculum learning progressively increases opponent difficulty from 0.40 to 0.50, enabling robust skill acquisition without the training collapse observed under fixed opponents. Across extensive evaluations, the trained agent achieves win rates between 98 and 100 percent against balanced opponents and maintains strong performance against more challenging opponents. Serve efficiency ranges from 63.0 to 67.5 percent, and return efficiency ranges from 52.8 to 57.1 percent. Ablation studies demonstrate that both the dueling architecture and curriculum learning are necessary for stable convergence, while a standard DQN baseline fails to learn effective policies. Despite strong performance, tactical analysis reveals a pronounced defensive bias, with the learned policy prioritizing error avoidance and prolonged rallies over aggressive point construction. These results highlight a limitation of win-rate driven optimization in simplified sports simulations and emphasize the importance of reward design for realistic sports reinforcement learning.
中文摘要 网球策略优化是一个具有挑战性的连续决策问题，涉及层级得分、随机结果、长期计分分配、身体疲劳以及对对手技能的适应。我提出了一个强化学习框架，将定制网球模拟环境与通过课程学习训练的双重深度Q网络（DDQN）集成。环境模型涵盖了网球得分、局数和盘数，涵盖十个离散动作类别的拉力赛级战术决策，对称疲劳动态，以及连续的对手技能参数。对抗架构将动作值估计分解为状态值和优势成分，而双Q学习则减少了高估偏差，并改善了在这一长视野随机领域中的训练稳定性。课程学习逐步将对手难度从0.40提升到0.50，从而实现了稳健的技能习得，同时避免了固定对手下观察到的训练崩溃。经过广泛评估，受过训练的经纪人在面对平衡对手时的胜率在98%至100%之间，并在面对更具挑战性的对手时保持强劲表现。发球效率范围为63.0%至67.5%，回击效率范围为52.8%至57.1%。消融研究表明，对抗架构和课程学习对于稳定收敛是必要的，而标准的DQN基线则无法学习有效的策略。尽管表现强劲，战术分析显示明显的防御倾向，学到的策略优先考虑避免错误和持久反击，而非激进的得分构建。这些结果凸显了简化体育模拟中胜率驱动优化的局限性，并强调了奖励设计对真实体育强化学习的重要性。

Physics-Informed Machine Learning for Transformer Condition Monitoring -- Part I: Basic Concepts, Neural Networks, and Variants

基于物理的机器学习用于变压器状态监测——第一部分：基本概念、神经网络及其变体

Authors: Jose I. Aizpurua
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.22190
Pdf link: https://arxiv.org/pdf/2512.22190
Abstract Power transformers are critical assets in power networks, whose reliability directly impacts grid resilience and stability. Traditional condition monitoring approaches, often rule-based or purely physics-based, struggle with uncertainty, limited data availability, and the complexity of modern operating conditions. Recent advances in machine learning (ML) provide powerful tools to complement and extend these methods, enabling more accurate diagnostics, prognostics, and control. In this two-part series, we examine the role of Neural Networks (NNs) and their extensions in transformer condition monitoring and health management tasks. This first paper introduces the basic concepts of NNs, explores Convolutional Neural Networks (CNNs) for condition monitoring using diverse data modalities, and discusses the integration of NN concepts within the Reinforcement Learning (RL) paradigm for decision-making and control. Finally, perspectives on emerging research directions are also provided.
中文摘要 电力变压器是电力网络中的关键资产，其可靠性直接影响电网的韧性和稳定性。传统的状态监测方法，通常基于规则或纯物理，面临不确定性、有限的数据可用性以及现代作条件的复杂性。机器学习（ML）的最新进展为补充和扩展这些方法提供了强大的工具，使诊断、预后和控制更加准确。在本两部分系列中，我们将探讨神经网络（NNs）及其在变压器状态监测和健康管理任务中的作用。首篇论文介绍了神经网络的基本概念，探讨了利用多种数据模式进行状态监测的卷积神经网络（CNN），并讨论了神经网络概念在强化学习（RL）范式中的整合，用于决策和控制。最后，还提供了对新兴研究方向的视角。

Emotion-Inspired Learning Signals (EILS): A Homeostatic Framework for Adaptive Autonomous Agents

情感启发学习信号（EILS）：一种适应性自主智能体的稳态框架

Authors: Dhruv Tiwari
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.22200
Pdf link: https://arxiv.org/pdf/2512.22200
Abstract The ruling method in modern Artificial Intelligence spanning from Deep Reinforcement Learning (DRL) to Large Language Models (LLMs) relies on a surge of static, externally defined reward functions. While this "extrinsic maximization" approach has rendered superhuman performance in closed, stationary fields, it produces agents that are fragile in open-ended, real-world environments. Standard agents lack internal autonomy: they struggle to explore without dense feedback, fail to adapt to distribution shifts (non-stationarity), and require extensive manual tuning of static hyperparameters. This paper proposes that the unaddressed factor in robust autonomy is a functional analog to biological emotion, serving as a high-level homeostatic control mechanism. We introduce Emotion-Inspired Learning Signals (EILS), a unified framework that replaces scattered optimization heuristics with a coherent, bio-inspired internal feedback engine. Unlike traditional methods that treat emotions as semantic labels, EILS models them as continuous, homeostatic appraisal signals such as Curiosity, Stress, and Confidence. We formalize these signals as vector-valued internal states derived from interaction history. These states dynamically modulate the agent's optimization landscape in real time: curiosity regulates entropy to prevent mode collapse, stress modulates plasticity to overcome inactivity, and confidence adapts trust regions to stabilize convergence. We hypothesize that this closed-loop homeostatic regulation can enable EILS agents to outperform standard baselines in terms of sample efficiency and non-stationary adaptation.
中文摘要 现代人工智能的统治方法涵盖了从深度强化学习（DRL）到大型语言模型（LLMs）的广泛应用，依赖于大量静态的外部定义奖励函数。虽然这种“外在最大化”方法在封闭、静止的领域中实现了超人表现，但它却在开放的现实环境中产生了脆弱的代理。标准代理缺乏内部自主性：它们难以在没有密集反馈的情况下探索，无法适应分布变化（非平稳性），并且需要大量手动调优静态超参数。本文提出，鲁棒自主性中未被解决的因素是生物情绪的功能类比，作为一种高级稳态控制机制。我们引入了情感启发学习信号（EILS），这是一个统一框架，用连贯的仿生内部反馈引擎取代了分散的优化启发式。与将情绪视为语义标签的方法不同，EILS将其建模为连续的、稳态的评估信号，如好奇、压力和自信。我们将这些信号形式化为基于相互作用历史的矢量值内部状态。这些状态实时动态调节智能体的优化环境：好奇心调节熵以防止模式崩溃，压力调节可塑性以克服不活跃，信心调节信任区域以稳定收敛。我们假设这种闭环稳态调节能使EILS药物在样本效率和非平稳适应方面优于标准基线。

DiRL: An Efficient Post-Training Framework for Diffusion Language Models

DiRL：扩散语言模型的高效后期训练框架

Authors: Ying Zhu, Jiaxin Wan, Xiaoran Liu, Siyanag He, Qiqi Wang, Xu Guo, Tianyi Liang, Zengfeng Huang, Ziwei He, Xipeng Qiu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.22234
Pdf link: https://arxiv.org/pdf/2512.22234
Abstract Diffusion Language Models (dLLMs) have emerged as promising alternatives to Auto-Regressive (AR) models. While recent efforts have validated their pre-training potential and accelerated inference speeds, the post-training landscape for dLLMs remains underdeveloped. Existing methods suffer from computational inefficiency and objective mismatches between training and inference, severely limiting performance on complex reasoning tasks such as mathematics. To address this, we introduce DiRL, an efficient post-training framework that tightly integrates FlexAttention-accelerated blockwise training with LMDeploy-optimized inference. This architecture enables a streamlined online model update loop, facilitating efficient two-stage post-training (Supervised Fine-Tuning followed by Reinforcement Learning). Building on this framework, we propose DiPO, the first unbiased Group Relative Policy Optimization (GRPO) implementation tailored for dLLMs. We validate our approach by training DiRL-8B-Instruct on high-quality math data. Our model achieves state-of-the-art math performance among dLLMs and surpasses comparable models in the Qwen2.5 series on several benchmarks.
中文摘要 扩散语言模型（dLLMs）已成为自回归（AR）模型的有前景替代方案。尽管近期努力验证了其预训练潜力和推理速度加快，但dLLMs的后期训练环境仍然不够成熟。现有方法存在计算效率低下和训练与推理之间的客观不匹配问题，严重限制了复杂推理任务（如数学）的表现。为此，我们引入了DiRL，一个高效的训练后框架，紧密整合了FlexAttention加速的分块式训练与LMDeploy优化的推理。该架构实现了简化的在线模型更新循环，促进高效的两阶段训练后（监督微调后强化学习）。基于该框架，我们提出了DiPO，这是首个专为dLLM量身定制的无偏群相对策略优化（GRPO）实现。我们通过高质量的数学数据训练DiRL-8B-Instruct来验证我们的方法。我们的模型在dLLM中实现了最先进的数学性能，并在多个基准测试中超越了Qwen2.5系列的同类模型。

Masking Teacher and Reinforcing Student for Distilling Vision-Language Models

掩蔽教师与强化学生提炼视觉语言模型

Authors: Byung-Kwan Lee, Yu-Chiang Frank Wang, Ryo Hachiuma
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.22238
Pdf link: https://arxiv.org/pdf/2512.22238
Abstract Large-scale vision-language models (VLMs) have recently achieved remarkable multimodal understanding, but their massive size makes them impractical for deployment on mobile or edge devices. This raises the need for compact yet capable VLMs that can efficiently learn from powerful large teachers. However, distilling knowledge from a large teacher to a small student remains challenging due to their large size gap: the student often fails to reproduce the teacher's complex, high-dimensional representations, leading to unstable learning and degraded performance. To address this, we propose Masters (Masking Teacher and Reinforcing Student), a mask-progressive reinforcement learning (RL) distillation framework. Masters first masks non-dominant weights of the teacher to reduce unnecessary complexity, then progressively restores the teacher by gradually increasing its capacity during training. This strategy allows the student to learn richer representations from the teacher in a smooth and stable manner. To further refine knowledge transfer, Masters integrates an offline RL stage with two complementary rewards: an accuracy reward that measures the correctness of the generated responses, and a distillation reward that quantifies the ease of transferring responses from teacher to student. Unlike online think-answer RL paradigms that are computationally expensive and generate lengthy responses, our offline RL leverages pre-generated responses from masked teachers. These provide rich yet efficient guidance, enabling students to achieve strong performance without requiring the think-answer process.
中文摘要 大规模视觉语言模型（VLMs）近年来实现了卓越的多模态理解，但其庞大的体积使其在移动或边缘设备上部署时不切实际。这也带来了对紧凑但功能强大的VLM的需求，能够高效地向强大的大型教师学习。然而，由于教师规模差距巨大，将知识从大教师提炼到小学生仍然具有挑战性：学生常常无法复现教师复杂的高维表征，导致学习不稳定和表现下降。为此，我们提出了Masters（掩饰教师与强化学生）——一种掩饰渐进强化学习（RL）提炼框架。大师先掩盖教师的非主导体重以减少不必要的复杂性，然后通过逐步提升教师的训练能力来恢复教师的能力。这种策略使学生能够以更平稳、稳定的方式从教师那里学习更丰富的表象。为了进一步完善知识转移，Masters将离线强化学习阶段与两种互补奖励相结合：准确性奖励（衡量生成回答的正确性）和提炼奖励（量化从教师向学生传递回答的难易程度）。与计算成本高且生成冗长回答的在线思维-答案强化学习范式不同，我们的离线强化学习利用了蒙蔽教师预先生成的回答。这些指导既丰富又高效，使学生无需思考-回答过程即可取得优异表现。

Agentic Software Issue Resolution with Large Language Models: A Survey

大型语言模型下的代理软件问题解决：一项综述

Authors: Zhonghao Jiang, David Lo, Zhongxin Liu
Subjects: Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.22256
Pdf link: https://arxiv.org/pdf/2512.22256
Abstract Software issue resolution aims to address real-world issues in software repositories (e.g., bug fixing and efficiency optimization) based on natural language descriptions provided by users, representing a key aspect of software maintenance. With the rapid development of large language models (LLMs) in reasoning and generative capabilities, LLM-based approaches have made significant progress in automated software issue resolution. However, real-world software issue resolution is inherently complex and requires long-horizon reasoning, iterative exploration, and feedback-driven decision making, which demand agentic capabilities beyond conventional single-step approaches. Recently, LLM-based agentic systems have become mainstream for software issue resolution. Advancements in agentic software issue resolution not only greatly enhance software maintenance efficiency and quality but also provide a realistic environment for validating agentic systems' reasoning, planning, and execution capabilities, bridging artificial intelligence and software engineering. This work presents a systematic survey of 126 recent studies at the forefront of LLM-based agentic software issue resolution research. It outlines the general workflow of the task and establishes a taxonomy across three dimensions: benchmarks, techniques, and empirical studies. Furthermore, it highlights how the emergence of agentic reinforcement learning has brought a paradigm shift in the design and training of agentic systems for software engineering. Finally, it summarizes key challenges and outlines promising directions for future research.
中文摘要 软件问题解决旨在解决软件仓库中的现实问题（例如，基于用户提供的自然语言描述，如漏洞修复和效率优化），这是软件维护的关键方面。随着大型语言模型（LLMs）在推理和生成能力上的快速发展，基于LLM的方法在自动化软件问题解决方面取得了显著进展。然而，现实软件问题的解决本质上极为复杂，需要长期推理、迭代探索和反馈驱动的决策，这需要超越传统单步方法的代理能力。近年来，基于LLM的代理系统已成为软件问题解决的主流。代理软件问题解决的进步不仅极大提升了软件维护的效率和质量，还为验证代理系统的推理、规划和执行能力提供了现实的环境，连接了人工智能与软件工程。本研究系统综述了126项近期研究，这些研究处于基于LLM的代理软件问题解决研究前沿。它概述了任务的一般工作流程，并建立了三个维度的分类体系：基准测试、技术和实证研究。此外，它强调了代理强化学习的出现如何为软件工程代理系统设计和培训带来了范式转变。最后，总结了关键挑战，并提出了未来研究的有前景方向。

VideoZoomer: Reinforcement-Learned Temporal Focusing for Long Video Reasoning

VideoZoomer：强化学习的时间聚焦用于长视频推理

Authors: Yang Ding, Yizhen Zhang, Xin Lai, Ruihang Chu, Yujiu Yang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.22315
Pdf link: https://arxiv.org/pdf/2512.22315
Abstract Multimodal Large Language Models (MLLMs) have achieved remarkable progress in vision-language tasks yet remain limited in long video understanding due to the limited context window. Consequently, prevailing approaches tend to rely on uniform frame sampling or static pre-selection, which might overlook critical evidence and unable to correct its initial selection error during its reasoning process. To overcome these limitations, we propose VideoZoomer, a novel agentic framework that enables MLLMs to dynamically control their visual focus during reasoning. Starting from a coarse low-frame-rate overview, VideoZoomer invokes a temporal zoom tool to obtain high-frame-rate clips at autonomously chosen moments, thereby progressively gathering fine-grained evidence in a multi-turn interactive manner. Accordingly, we adopt a two-stage training strategy: a cold-start supervised fine-tuning phase on a curated dataset of distilled exemplar and reflection trajectories, followed by reinforcement learning to further refine the agentic policy. Extensive experiments demonstrate that our 7B model delivers diverse and complex reasoning patterns, yielding strong performance across a broad set of long video understanding and reasoning benchmarks. These emergent capabilities allow it to consistently surpass existing open-source models and even rival proprietary systems on challenging tasks, while achieving superior efficiency under reduced frame budgets.
中文摘要 多模态大型语言模型（MLLM）在视觉语言任务方面取得了显著进展，但由于上下文窗口有限，长时间视频理解仍受限。因此，主流方法往往依赖于统一帧抽样或静态预选，这可能会忽略关键证据，且在推理过程中无法纠正初始选择错误。为克服这些局限，我们提出了VideoZoomer，一种新型代理框架，使多层次多层次医学人员能够动态控制推理时的视觉焦点。从粗略的低帧率概览开始，VideoZoomer 调用时间缩放工具，在自主选择的时刻获取高帧率片段，从而以多回合交互方式逐步收集细粒度证据。因此，我们采用了两阶段训练策略：冷启动阶段，在精选的精选数据集上进行精细调优，包括精炼的范例和反思轨迹，随后进行强化学习以进一步完善能动策略。大量实验表明，我们的7B模型能够呈现多样且复杂的推理模式，在广泛的长视频理解和推理基准测试中表现出色。这些新兴能力使其能够持续超越现有开源模型，甚至在复杂任务中与专有系统竞争，同时在更低的帧预算下实现卓越的效率。

SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents

SmartSnap：主动寻找自我验证代理人的证据

Authors: Shaofei Cai, Yulei Qin, Haojia Lin, Zihan Xu, Gang Li, Yuchen Shi, Zongyi Li, Yong Mao, Siqi Cai, Xiaoyu Tan, Yitao Liang, Ke Li, Xing Sun
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2512.22322
Pdf link: https://arxiv.org/pdf/2512.22322
Abstract Agentic reinforcement learning (RL) holds great promise for the development of autonomous agents under complex GUI tasks, but its scalability remains severely hampered by the verification of task completion. Existing task verification is treated as a passive, post-hoc process: a verifier (i.e., rule-based scoring script, reward or critic model, and LLM-as-a-Judge) analyzes the agent's entire interaction trajectory to determine if the agent succeeds. Such processing of verbose context that contains irrelevant, noisy history poses challenges to the verification protocols and therefore leads to prohibitive cost and low reliability. To overcome this bottleneck, we propose SmartSnap, a paradigm shift from this passive, post-hoc verification to proactive, in-situ self-verification by the agent itself. We introduce the Self-Verifying Agent, a new type of agent designed with dual missions: to not only complete a task but also to prove its accomplishment with curated snapshot evidences. Guided by our proposed 3C Principles (Completeness, Conciseness, and Creativity), the agent leverages its accessibility to the online environment to perform self-verification on a minimal, decisive set of snapshots. Such evidences are provided as the sole materials for a general LLM-as-a-Judge verifier to determine their validity and relevance. Experiments on mobile tasks across model families and scales demonstrate that our SmartSnap paradigm allows training LLM-driven agents in a scalable manner, bringing performance gains up to 26.08% and 16.66% respectively to 8B and 30B models. The synergizing between solution finding and evidence seeking facilitates the cultivation of efficient, self-verifying agents with competitive performance against DeepSeek V3.1 and Qwen3-235B-A22B.
中文摘要 智能强化学习（RL）在复杂图形界面任务下自主智能体开发方面具有巨大潜力，但其可扩展性仍受到任务完成验证的严重限制。现有任务验证被视为被动的、事后处理的过程：验证者（即基于规则的评分脚本、奖励或批评模型，以及作为评判的LLM）分析代理的整个交互轨迹，以判断代理是否成功。这种对包含无关且杂乱历史的冗长上下文处理，对验证协议构成挑战，因此成本过高且可靠性低。为克服这一瓶颈，我们提出了SmartSnap，这是一种从被动、事后验证转向代理自身主动、原位自我验证的范式转变。我们介绍了自我验证代理，这是一种新型代理，设计有双重使命：不仅完成任务，还要通过精心策划的快照证据证明其完成度。在我们提出的3C原则（完整性、简洁性和创造性）指导下，智能体利用其对在线环境的可访问性，对一组极少且决定性的快照进行自我验证。这些证据作为一般LLM作为法官验证者判断其有效性和相关性的唯一材料。跨模型家族和规模的移动任务实验表明，我们的SmartSnap范式允许以可扩展的方式训练LLM驱动的代理，8B和30B模型的性能提升分别达到26.08%和16.66%。解决方案寻找与证据寻求的协同促进了高效、自我验证的代理的培养，并在DeepSeek V3.1和Qwen3-235B-A22B中具有竞争力。

PHANTOM: Physics-Aware Adversarial Attacks against Federated Learning-Coordinated EV Charging Management System

幻影：物理感知对联邦学习协调电动汽车充电管理系统的对抗攻击

Authors: Mohammad Zakaria Haider, Amit Kumar Podder, Prabin Mali, Aranya Chakrabortty, Sumit Paudyal, Mohammad Ashiqur Rahman
Subjects: Subjects: Emerging Technologies (cs.ET); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.22381
Pdf link: https://arxiv.org/pdf/2512.22381
Abstract The rapid deployment of electric vehicle charging stations (EVCS) within distribution networks necessitates intelligent and adaptive control to maintain the grid's resilience and reliability. In this work, we propose PHANTOM, a physics-aware adversarial network that is trained and optimized through a multi-agent reinforcement learning model. PHANTOM integrates a physics-informed neural network (PINN) enabled by federated learning (FL) that functions as a digital twin of EVCS-integrated systems, ensuring physically consistent modeling of operational dynamics and constraints. Building on this digital twin, we construct a multi-agent RL environment that utilizes deep Q-networks (DQN) and soft actor-critic (SAC) methods to derive adversarial false data injection (FDI) strategies capable of bypassing conventional detection mechanisms. To examine the broader grid-level consequences, a transmission and distribution (T and D) dual simulation platform is developed, allowing us to capture cascading interactions between EVCS disturbances at the distribution level and the operations of the bulk transmission system. Results demonstrate how learned attack policies disrupt load balancing and induce voltage instabilities that propagate across T and D boundaries. These findings highlight the critical need for physics-aware cybersecurity to ensure the resilience of large-scale vehicle-grid integration.
中文摘要 电动汽车充电站（EVC）在配电网络中的快速部署，需要智能且自适应的控制，以维持电网的韧性和可靠性。在本研究中，我们提出了PHANTOM，一种通过多智能体强化学习模型训练和优化的物理感知对抗网络。PHANTOM集成了一个由联邦学习（FL）支持的物理知情神经网络（PINN），作为EVCS集成系统的数字孪生，确保作动态和约束的物理一致建模。基于该数字孪生，我们构建了一个多智能体强化学习环境，利用深度Q网络（DQN）和软行为者-批判者（SAC）方法，推导出能够绕过传统检测机制的对抗性虚假数据注入（FDI）策略。为了考察更广泛的电网层面影响，开发了一个输配电（T和D）双重仿真平台，使我们能够捕捉配电层级EVCS扰动与大宗输电系统运行之间的级联交互。结果表明，学习到的攻击策略如何破坏负载均衡，并诱发跨T和D边界传播的电压不稳定性。这些发现凸显了物理感知网络安全的关键需求，以确保大规模车辆与电网整合的韧性。

AFA-LoRA: Enabling Non-Linear Adaptations in LoRA with Activation Function Annealing

AFA-LoRA：通过激活功能退火实现LoRA中的非线性适应

Authors: Jiacheng Li, Jianchao Tan, Zhidong Yang, Feiye Huo, Yerui Sun, Yuchen Xie, Xunliang Cai
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.22455
Pdf link: https://arxiv.org/pdf/2512.22455
Abstract Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient fine-tuning (PEFT) method. However, its linear adaptation process limits its expressive power. This means there is a gap between the expressive power of linear training and non-linear training. To bridge this gap, we propose AFA-LoRA, a novel training strategy that brings non-linear expressivity to LoRA while maintaining its seamless mergeability. Our key innovation is an annealed activation function that transitions from a non-linear to a linear transformation during training, allowing the adapter to initially adopt stronger representational capabilities before converging to a mergeable linear form. We implement our method on supervised fine-tuning, reinforcement learning, and speculative decoding. The results show that AFA-LoRA reduces the performance gap between LoRA and full-parameter training. This work enables a more powerful and practical paradigm of parameter-efficient adaptation.
中文摘要 低秩适应（LoRA）是一种广泛采用的参数高效微调（PEFT）方法。然而，其线性适应过程限制了其表现力。这意味着线性训练和非线性训练的表达能力之间存在差距。为弥合这一差距，我们提出了AFA-LoRA这一新颖的训练策略，该策略在保持无缝合并性的同时，为LoRA带来非线性表现力。我们的关键创新是退火激活函数，在训练过程中从非线性变换过渡到线性变换，使适配器在收敛为可合并线性形式之前能够先具备更强的表示能力。我们将方法应用于监督微调、强化学习和推测解码。结果显示，AFA-LoRA缩小了LoRA与全参数训练之间的性能差距。这项工作使参数高效适应的范式更加强大和实用。

RollArt: Scaling Agentic RL Training via Disaggregated Infrastructure

RollArt：通过拆分基础设施扩展智能强化学习训练

Authors: Wei Gao, Yuheng Zhao, Tianyuan Wu, Shaopan Xiong, Weixun Wang, Dakai An, Lunxi Cao, Dilxat Muhtar, Zichen Liu, Haizhou Zhao, Ju Huang, Siran Yang, Yongbin Li, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng, Wei Wang
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.22560
Pdf link: https://arxiv.org/pdf/2512.22560
Abstract Agentic Reinforcement Learning (RL) enables Large Language Models (LLMs) to perform autonomous decision-making and long-term planning. Unlike standard LLM post-training, agentic RL workloads are highly heterogeneous, combining compute-intensive prefill phases, bandwidth-bound decoding, and stateful, CPU-heavy environment simulations. We argue that efficient agentic RL training requires disaggregated infrastructure to leverage specialized, best-fit hardware. However, naive disaggregation introduces substantial synchronization overhead and resource underutilization due to the complex dependencies between stages. We present RollArc, a distributed system designed to maximize throughput for multi-task agentic RL on disaggregated infrastructure. RollArc is built on three core principles: (1) hardware-affinity workload mapping, which routes compute-bound and bandwidth-bound tasks to bestfit GPU devices, (2) fine-grained asynchrony, which manages execution at the trajectory level to mitigate resource bubbles, and (3) statefulness-aware computation, which offloads stateless components (e.g., reward models) to serverless infrastructure for elastic scaling. Our results demonstrate that RollArc effectively improves training throughput and achieves 1.35-2.05(\times) end-to-end training time reduction compared to monolithic and synchronous baselines. We also evaluate RollArc by training a hundreds-of-billions-parameter MoE model for Qoder product on an Alibaba cluster with more than 3,000 GPUs, further demonstrating RollArc scalability and robustness. The code is available at this https URL.
中文摘要 代理强化学习（RL）使大型语言模型（LLMs）能够自主决策和长期规划。与标准的LLM后训练不同，代理型强化学习工作负载高度异构，结合了计算密集型预填充阶段、带宽受限的解码以及有状态且CPU负载较大的环境仿真。我们认为，高效的智能化强化学习训练需要分散的基础设施，以利用专门且最合适的硬件。然而，朴素的拆分会带来大量同步开销和资源利用不足，因为各阶段之间存在复杂的依赖关系。我们介绍RollArc，一个分布式系统，旨在最大化在分解基础设施上实现多任务代理强化学习的吞吐量。RollArc 基于三大核心原则构建：（1）硬件亲和性工作负载映射，将计算和带宽受限任务路由到最适合 GPU 设备的位置;（2）细粒度异步，管理轨迹级执行以缓解资源泡沫;（3）状态感知计算，将无状态组件（如奖励模型）卸载到无服务器基础设施以实现弹性扩展。我们的结果表明，RollArc 有效提升了训练吞吐量，并实现了端到端训练时间的缩短，相较于单一和同步基线。我们还通过在拥有3000多GPU的阿里巴巴集群上训练数千亿参数的Qoder产品MoE模型，进一步展示了RollArc的可扩展性和稳健性。代码可在该 https URL 访问。

FinPercep-RM: A Fine-grained Reward Model and Co-evolutionary Curriculum for RL-based Real-world Super-Resolution

FinPercep-RM：基于强化学习的细粒度奖励模型与共进化课程，用于基于强化学习的现实世界超分辨率

Authors: Yidi Liu, Zihao Fan, Jie Huang, Jie Xiao, Dong Li, Wenlong Zhang, Lei Bai, Xueyang Fu, Zheng-Jun Zha
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.22647
Pdf link: https://arxiv.org/pdf/2512.22647
Abstract Reinforcement Learning with Human Feedback (RLHF) has proven effective in image generation field guided by reward models to align human preferences. Motivated by this, adapting RLHF for Image Super-Resolution (ISR) tasks has shown promise in optimizing perceptual quality with Image Quality Assessment (IQA) model as reward models. However, the traditional IQA model usually output a single global score, which are exceptionally insensitive to local and fine-grained distortions. This insensitivity allows ISR models to produce perceptually undesirable artifacts that yield spurious high scores, misaligning optimization objectives with perceptual quality and results in reward hacking. To address this, we propose a Fine-grained Perceptual Reward Model (FinPercep-RM) based on an Encoder-Decoder architecture. While providing a global quality score, it also generates a Perceptual Degradation Map that spatially localizes and quantifies local defects. We specifically introduce the FGR-30k dataset to train this model, consisting of diverse and subtle distortions from real-world super-resolution models. Despite the success of the FinPercep-RM model, its complexity introduces significant challenges in generator policy learning, leading to training instability. To address this, we propose a Co-evolutionary Curriculum Learning (CCL) mechanism, where both the reward model and the ISR model undergo synchronized curricula. The reward model progressively increases in complexity, while the ISR model starts with a simpler global reward for rapid convergence, gradually transitioning to the more complex model outputs. This easy-to-hard strategy enables stable training while suppressing reward hacking. Experiments validates the effectiveness of our method across ISR models in both global quality and local realism on RLHF methods.
中文摘要 基于人类反馈的强化学习（RLHF）在图像生成领域已被证明有效，该方法通过奖励模型引导以协调人类偏好。基于此，将RLHF应用于图像超分辨率（ISR）任务，已显示出利用图像质量评估（IQA）模型作为奖励模型优化感知质量的前景。然而，传统的IQA模型通常输出单一的全局分数，对局部和细粒度失真极为不敏感。这种不敏感性使得ISR模型产生感知上不理想的伪影，导致虚假的高分，使优化目标与感知质量错位，最终导致奖励黑客。为此，我们提出了基于编码器-解码器架构的细粒度感知奖励模型（FinPercep-RM）。在提供全局质量评分的同时，它还生成一个感知退化图，空间定位并量化局部缺陷。我们特别引入FGR-30k数据集来训练该模型，该数据集由来自现实世界超分辨率模型的多样且微妙的畸变组成。尽管FinPercep-RM模型取得了成功，但其复杂性在生成器策略学习中带来了重大挑战，导致训练不稳定。为此，我们提出了一种共进化课程学习（CCL）机制，即奖励模型和ISR模型都经历同步课程。奖励模型的复杂度逐渐增加，而ISR模型则以更简单的快速收敛全局奖励开始，逐步过渡到更复杂的模型输出。这种从简单到困难的策略能够实现稳定的训练，同时抑制奖励黑客行为。实验验证了我们方法在ISR模型中在全球质量和局部真实性上对RLHF方法的有效性。

Optimal Regulation of Nonlinear Input-Affine Systems via an Integral Reinforcement Learning-Based State-Dependent Riccati Equation Approach

通过基于积分强化学习的状态依赖 Riccati 方程方法对非线性输入仿射系统的最优调控

Authors: Arya Rashidinejad Meibodi, Mahbod Gholamali Sinaki, Khalil Alipour
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2512.22668
Pdf link: https://arxiv.org/pdf/2512.22668
Abstract The State-Dependent Riccati Equation (SDRE) technique generalizes the classical algebraic Riccati formulation to nonlinear systems by designing an input to the system that optimally(suboptimally) regulates system states toward the origin while simultaneously optimizing a quadratic performance index. In the SDRE technique, we solve the State-Dependent Riccati Equation to determine the control for regulating a nonlinear input-affine system. Since an analytic solution to SDRE is not straightforward, one method is to linearize the system at every state, solve the corresponding Algebraic Riccati Equation (ARE), and apply optimal control until the next state of the system. Completing this task with high frequency gives a result like the original SDRE technique. Both approaches require a complete model; therefore, here we propose a method that solves ARE in every state of the system using a partially model-free approach that learns optimal control in every state of the system, without explicit knowledge of the drift dynamics, based on Integral Reinforcement Learning (IRL). To show the effectiveness of our proposed approach, we apply it to the second-order nonlinear system in simulation and compare its performance with the classical SDRE method, which relies on the system's model and solves the ARE at each state. Our simulation results demonstrate that, with sufficient iterations, the IRL-based approach achieves approximately the same performance as the conventional SDRE method, demonstrating its capability as a reliable alternative for nonlinear system control that does not require an explicit environmental model. Index Terms-Algebraic Riccati Equation (ARE), Integral Reinforcement Learning (IRL), Nonlinear Input-Affine Systems, Optimal Regulation, State-Dependent Riccati Equation (SDRE)
中文摘要 状态依赖里卡蒂方程（SDRE）技术通过设计一个输入，使系统状态最优（次最优）地调节到原点，同时优化二次性能指数，将经典代数里卡蒂表述推广到非线性系统。在SDRE技术中，我们求解状态依赖性里卡蒂方程，以确定调控非线性输入仿射系统的控制。由于对SDRE的解析解并不简单，一种方法是在每个状态线性化系统，求解相应的代数里卡蒂方程（ARE），并对系统进行最优控制直到系统进入下一个状态。以高频完成此任务，效果类似于原始SDRE技术。这两种方法都需要完整的模型;因此，我们提出一种方法，利用部分无模型的方法在系统每个状态下求解ARE，该方法基于整合强化学习（IRL），在不显式了解漂移动力学的情况下，在系统每个状态下学习最优控制。为了展示我们提出方法的有效性，我们将它应用于仿真中的二阶非线性系统，并将其性能与依赖系统模型并在每个状态求解的经典SDRE方法进行比较。我们的模拟结果表明，经过足够迭代，基于IRL的方法可实现与传统SDRE方法大致相同的性能，证明其作为非线性系统控制的可靠替代方案，无需显式环境模型。指标项-代数里卡提方程（ARE）、积分强化学习（IRL）、非线性输入仿射系统、最优调控、状态依赖里卡提方程（SDRE）

Memento-II: Learning by Stateful Reflective Memory

记忆书二：通过有状态反思记忆学习

Authors: Jun Wang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.22716
Pdf link: https://arxiv.org/pdf/2512.22716
Abstract We propose a theoretical framework for continual and experiential learning in large language model agents that integrates episodic memory with reinforcement learning. The framework identifies reflection as the key mechanism that enables agents to adapt through interaction without back propagation or model fine tuning, thereby relaxing the conventional separation between training and this http URL formalise this process, we introduce the Stateful Reflective Decision Process, which models reflective learning as a two stage read write interaction with episodic memory. Writing stores interaction outcomes and corresponds to policy evaluation, while reading retrieves relevant past cases and corresponds to policy improvement. We show that this process induces an equivalent Markov decision process over augmented state memory representations, allowing the use of classical tools from dynamic programming and reinforcement learning. We further instantiate the framework using entropy regularised policy iteration and establish convergence guarantees. As episodic memory grows and achieves sufficient coverage of the state space, the resulting policy converges to the optimal solution. This work provides a principled foundation for memory augmented and retrieval based language model agents capable of continual adaptation without parameter updates.
中文摘要 我们提出了一个理论框架，用于大型语言模型代理的持续和体验式学习，将情节记忆与强化学习整合起来。该框架将反射视为使智能体能够通过交互适应而无需反向传播或模型微调的关键机制，从而放松了训练与该 http URL 之间的传统区分。我们将该过程形式化，介绍有状态反思决策过程，将反思学习建模为带有情节记忆的两阶段读写交互。书面存储互动结果并对应政策评估，而阅读则检索相关过去案例并对应政策改进。我们证明了这一过程在增强状态记忆表示上诱导了等效的马尔可夫决策过程，从而使得动态规划和强化学习中的经典工具得以使用。我们进一步通过熵正则化策略迭代实现框架，并建立收敛保证。随着情景记忆的增长并实现对状态空间的足够覆盖，最终的策略收敛到最优解。这项工作为内存增强和基于检索的语言模型代理提供了原则性基础，能够在不需参数更新的情况下持续适应。

Cyber Resilience in Next-Generation Networks: Threat Landscape, Theoretical Foundations, and Design Paradigms

下一代网络中的网络韧性：威胁格局、理论基础与设计范式

Authors: Junaid Farooq, Quanyan Zhu
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Computer Science and Game Theory (cs.GT)
Arxiv link: https://arxiv.org/abs/2512.22721
Pdf link: https://arxiv.org/pdf/2512.22721
Abstract The evolution of networked systems, driven by innovations in software-defined networking (SDN), network function virtualization (NFV), open radio access networks (O-RAN), and cloud-native architectures, is redefining both the operational landscape and the threat surface of critical infrastructures. This book offers an in-depth, interdisciplinary examination of how resilience must be re-conceptualized and re-engineered to address the multifaceted challenges posed by these transformations. Structured across six chapters, this book begins by surveying the contemporary risk landscape, identifying emerging cyber, physical, and AI-driven threats, and analyzing their implications for scalable, heterogeneous network environments. It then establishes rigorous definitions and evaluation frameworks for resilience, going beyond robustness and fault-tolerance to address adaptive, anticipatory, and retrospective mechanisms across diverse application domains. The core of the book delves into advanced paradigms and practical strategies for resilience, including zero trust architectures, game-theoretic threat modeling, and self-healing design principles. A significant portion is devoted to the role of artificial intelligence, especially reinforcement learning and large language models (LLMs), in enabling dynamic threat response, autonomous network control, and multi-agent coordination under uncertainty.
中文摘要 网络系统的演进，由软件定义网络（SDN）、网络功能虚拟化（NFV）、开放无线接入网络（O-RAN）和云原生架构的创新推动，正在重新定义关键基础设施的运营格局和威胁面。本书深入且跨学科地探讨了韧性必须如何重新构思和重新设计，以应对这些变革带来的多方面挑战。本书分为六章，首先综述当代风险格局，识别新兴的网络、物理及人工智能驱动威胁，并分析其对可扩展、异构网络环境的影响。随后，它建立了严谨的定义和评估框架，超越了鲁棒性和容错性，涵盖了适应性、预期性和回顾性机制，涵盖了多样应用领域的机制。本书核心探讨了高级范式和韧性的实用策略，包括零信任架构、博弈论威胁建模和自我修复设计原则。相当一部分内容集中在人工智能，尤其是强化学习和大型语言模型（LLM）方面，在实现动态威胁响应、自主网络控制和多智能体协调中的作用。

FoldAct: Efficient and Stable Context Folding for Long-Horizon Search Agents

FoldAct：长视野搜索代理的高效稳定上下文折叠

Authors: Jiaqi Shao, Yufeng Miao, Wei Zhang, Bing Luo
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.22733
Pdf link: https://arxiv.org/pdf/2512.22733
Abstract Long-horizon reinforcement learning (RL) for large language models faces critical scalability challenges from unbounded context growth, leading to context folding methods that compress interaction history during task execution. However, existing approaches treat summary actions as standard actions, overlooking that summaries fundamentally modify the agent's future observation space, creating a policy-dependent, non-stationary observation distribution that violates core RL assumptions. This introduces three fundamental challenges: (1) gradient dilution where summary tokens receive insufficient training signal, (2) self-conditioning where policy updates change summary distributions, creating a vicious cycle of training collapse, and (3) computational cost from processing unique contexts at each turn. We introduce \textbf{FoldAct}\footnote{this https URL}, a framework that explicitly addresses these challenges through three key innovations: separated loss computation for independent gradient signals on summary and action tokens, full context consistency loss to reduce distribution shift, and selective segment training to reduce computational cost. Our method enables stable training of long-horizon search agents with context folding, addressing the non-stationary observation problem while improving training efficiency with 5.19$\times$ speedup.
中文摘要 大型语言模型的长视野强化学习（RL）面临着无界上下文增长带来的关键可扩展性挑战，导致上下文折叠方法在任务执行过程中压缩交互历史。然而，现有方法将摘要动作视为标准动作，忽视了摘要从根本上改变代理未来的观察空间，从而产生一种依赖策略的非平稳观察分布，违反了强化学习的核心假设。这带来了三个根本性挑战：（1）梯度稀释，即摘要令牌接收到的训练信号不足;（2）自条件反射，即策略更新改变摘要分布，形成训练崩溃的恶性循环;（3）处理每个回合独特上下文的计算成本。我们引入了 \textbf{FoldAct}\footnote{this https URL}，这是一个通过三项关键创新明确解决这些挑战的框架：针对摘要和动作符号独立梯度信号的分离损耗计算、减少分布偏移的全上下文一致性丢失，以及降低计算成本的选择性段训练。我们的方法通过上下文折叠实现了长视野搜索代理的稳定训练，解决了非平稳观察问题，同时提升了训练效率，提升了5.19$\时间$。

Parallel Diffusion Solver via Residual Dirichlet Policy Optimization

通过残差狄利克雷策略优化实现并行扩散求解器

Authors: Ruoyu Wang, Ziyu Li, Beier Zhu, Liangyu Yuan, Hanwang Zhang, Xun Yang, Xiaojun Chang, Chi Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.22796
Pdf link: https://arxiv.org/pdf/2512.22796
Abstract Diffusion models (DMs) have achieved state-of-the-art generative performance but suffer from high sampling latency due to their sequential denoising nature. Existing solver-based acceleration methods often face significant image quality degradation under a low-latency budget, primarily due to accumulated truncation errors arising from the inability to capture high-curvature trajectory segments. In this paper, we propose the Ensemble Parallel Direction solver (dubbed as EPD-Solver), a novel ODE solver that mitigates these errors by incorporating multiple parallel gradient evaluations in each step. Motivated by the geometric insight that sampling trajectories are largely confined to a low-dimensional manifold, EPD-Solver leverages the Mean Value Theorem for vector-valued functions to approximate the integral solution more accurately. Importantly, since the additional gradient computations are independent, they can be fully parallelized, preserving low-latency sampling nature. We introduce a two-stage optimization framework. Initially, EPD-Solver optimizes a small set of learnable parameters via a distillation-based approach. We further propose a parameter-efficient Reinforcement Learning (RL) fine-tuning scheme that reformulates the solver as a stochastic Dirichlet policy. Unlike traditional methods that fine-tune the massive backbone, our RL approach operates strictly within the low-dimensional solver space, effectively mitigating reward hacking while enhancing performance in complex text-to-image (T2I) generation tasks. In addition, our method is flexible and can serve as a plugin (EPD-Plugin) to improve existing ODE samplers.
中文摘要 扩散模型（DM）已实现最先进的生成性能，但由于其顺序去噪特性，采样延迟较高。现有基于求解器的加速方法在低延迟预算下常常面临显著的图像质量下降，主要原因是无法捕捉高曲率轨迹段所导致的累积截断误差。本文提出了集合平行方向求解器（称为EPD-Solver），这是一种新颖的常微分方程求解器，通过在每一步中包含多个并行梯度评估来减轻这些误差。基于采样轨迹主要局限于低维流形的几何洞见，EPD-Solver 利用向量值函数的均值定理，更准确地近似积分解。重要的是，由于额外的梯度计算是独立的，它们可以实现完全并行化，保持低延迟采样特性。我们引入一个两阶段优化框架。最初，EPD-Solver 通过基于蒸馏的方法优化一小部分可学习参数。我们还提出了一种参数高效的强化学习（RL）微调方案，将求解器重新表述为随机狄利克雷策略。与传统的微调庞大骨干结构方法不同，我们的强化学习方法严格运行在低维求解器空间内，有效减少了奖励黑客行为，同时提升了复杂文本到图像（T2I）生成任务的性能。此外，我们的方法具有灵活性，可以作为插件（EPD-Plugin）来改进现有的常微分方程采样器。

ReDiF: Reinforced Distillation for Few Step Diffusion

ReDiF：少数步骤扩散的强化蒸馏

Authors: Amirhossein Tighkhorshid, Zahra Dehghanian, Gholamali Aminian, Chengchun Shi, Hamid R. Rabiee
Subjects: Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.22802
Pdf link: https://arxiv.org/pdf/2512.22802
Abstract Distillation addresses the slow sampling problem in diffusion models by creating models with smaller size or fewer steps that approximate the behavior of high-step teachers. In this work, we propose a reinforcement learning based distillation framework for diffusion models. Instead of relying on fixed reconstruction or consistency losses, we treat the distillation process as a policy optimization problem, where the student is trained using a reward signal derived from alignment with the teacher's outputs. This RL driven approach dynamically guides the student to explore multiple denoising paths, allowing it to take longer, optimized steps toward high-probability regions of the data distribution, rather than relying on incremental refinements. Our framework utilizes the inherent ability of diffusion models to handle larger steps and effectively manage the generative process. Experimental results show that our method achieves superior performance with significantly fewer inference steps and computational resources compared to existing distillation techniques. Additionally, the framework is model agnostic, applicable to any type of diffusion models with suitable reward functions, providing a general optimization paradigm for efficient diffusion learning.
中文摘要 蒸馏通过创建规模更小或步数更少的模型来解决扩散模型中的慢抽样问题，这些模型近似高步教师的行为。本研究提出了基于强化学习的扩散模型蒸馏框架。我们不依赖固定的重建或一致性损失，而是将蒸馏过程视为策略优化问题，学生通过与教师输出对齐得出的奖励信号进行训练。这种强化学习驱动的方法动态引导学生探索多条去噪路径，使其能够在数据分布的高概率区域前进行更长且优化的步骤，而非依赖渐进式细化。我们的框架利用扩散模型的固有能力处理较大步骤，并有效管理生成过程。实验结果表明，我们的方法相比现有蒸馏技术，以显著更少的推断步骤和计算资源实现了更优的性能。此外，该框架具有模型无关性，适用于任何带有适当奖励函数的扩散模型，为高效扩散学习提供了通用优化范式。

TEACH: Temporal Variance-Driven Curriculum for Reinforcement Learning

TEACH：基于时间方差的强化学习课程

Authors: Gaurav Chaudhary, Laxmidhar Behera
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.22824
Pdf link: https://arxiv.org/pdf/2512.22824
Abstract Reinforcement Learning (RL) has achieved significant success in solving single-goal tasks. However, uniform goal selection often results in sample inefficiency in multi-goal settings where agents must learn a universal goal-conditioned policy. Inspired by the adaptive and structured learning processes observed in biological systems, we propose a novel Student-Teacher learning paradigm with a Temporal Variance-Driven Curriculum to accelerate Goal-Conditioned RL. In this framework, the teacher module dynamically prioritizes goals with the highest temporal variance in the policy's confidence score, parameterized by the state-action value (Q) function. The teacher provides an adaptive and focused learning signal by targeting these high-uncertainty goals, fostering continual and efficient progress. We establish a theoretical connection between the temporal variance of Q-values and the evolution of the policy, providing insights into the method's underlying principles. Our approach is algorithm-agnostic and integrates seamlessly with existing RL frameworks. We demonstrate this through evaluation across 11 diverse robotic manipulation and maze navigation tasks. The results show consistent and notable improvements over state-of-the-art curriculum learning and goal-selection methods.
中文摘要 强化学习（RL）在解决单一目标任务方面取得了显著成功。然而，统一的目标选择常导致多目标环境中样本效率低下，因为代理人必须学习通用的目标条件策略。受生物系统中观察到的适应性和结构化学习过程启发，我们提出了一种新的师生学习范式，采用时间方差驱动课程，以加速目标条件化强化学习。在该框架中，教师模块动态优先级对策略置信度评分中时间方差最大的目标进行优先级，该指标由状态-行动值（Q）函数参数化。教师通过针对这些高不确定性的目标，提供适应性和聚焦的学习信号，促进持续且高效的进步。我们建立了Q值的时间方差与政策演变之间的理论联系，提供了对方法基本原理的洞见。我们的方法不依赖算法，并能无缝集成现有的强化学习框架。我们通过评估11种不同的机器人作和迷宫导航任务来展示这一点。结果显示，相较于最先进的课程学习和目标选择方法，取得了持续且显著的改进。

MARPO: A Reflective Policy Optimization for Multi Agent Reinforcement Learning

MARPO：多智能体强化学习的反思策略优化

Authors: Cuiling Wu, Yaozhong Gan, Junliang Xing, Ying Fu
Subjects: Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2512.22832
Pdf link: https://arxiv.org/pdf/2512.22832
Abstract We propose Multi Agent Reflective Policy Optimization (MARPO) to alleviate the issue of sample inefficiency in multi agent reinforcement learning. MARPO consists of two key components: a reflection mechanism that leverages subsequent trajectories to enhance sample efficiency, and an asymmetric clipping mechanism that is derived from the KL divergence and dynamically adjusts the clipping range to improve training stability. We evaluate MARPO in classic multi agent environments, where it consistently outperforms other methods.
中文摘要 我们提出多智能体反思策略优化（MARPO）以缓解多智能体强化学习中样本效率低下的问题。MARPO由两个关键组成部分组成：一个利用后续轨迹提升样本效率的反射机制，以及一个基于KL散度并动态调整截波范围以提升训练稳定性的非对称裁剪机制。我们在经典多智能体环境中评估MARPO，其表现持续优于其他方法。

AutoForge: Automated Environment Synthesis for Agentic Reinforcement Learning

AutoForge：用于智能强化学习的自动化环境综合

Authors: Shihao Cai, Runnan Fang, Jialong Wu, Baixuan Li, Xinyu Wang, Yong Jiang, Liangcai Su, Liwen Zhang, Wenbiao Yin, Zhen Zhang, Fuli Feng, Pengjun Xie, Xiaobin Wang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.22857
Pdf link: https://arxiv.org/pdf/2512.22857
Abstract Conducting reinforcement learning (RL) in simulated environments offers a cost-effective and highly scalable way to enhance language-based agents. However, previous work has been limited to semi-automated environment synthesis or tasks lacking sufficient difficulty, offering little breadth or depth. In addition, the instability of simulated users integrated into these environments, along with the heterogeneity across simulated environments, poses further challenges for agentic RL. In this work, we propose: (1) a unified pipeline for automated and scalable synthesis of simulated environments associated with high-difficulty but easily verifiable tasks; and (2) an environment level RL algorithm that not only effectively mitigates user instability but also performs advantage estimation at the environment level, thereby improving training efficiency and stability. Comprehensive evaluations on agentic benchmarks, including tau-bench, tau2-Bench, and VitaBench, validate the effectiveness of our proposed method. Further in-depth analyses underscore its out-of-domain generalization.
中文摘要 在模拟环境中进行强化学习（RL）是一种成本效益高且高度可扩展的方式来增强基于语言的代理。然而，以往的工作多限于半自动化环境综合或缺乏足够难度的任务，缺乏广度和深度。此外，模拟用户集成于这些环境中的不稳定性，以及模拟环境中的异质性，给智能强化学习带来了进一步的挑战。在本研究中，我们提出了：（1）一个统一的流水线，用于自动化且可扩展地综合与高难度但易于验证任务相关的模拟环境;以及（2）一种环境级强化学习算法，不仅有效缓解用户不稳定性，还能在环境层面进行优势估计，从而提升训练效率和稳定性。对能动基准测试（包括tau-bench、tau2-bench和VitaBench）的全面评估验证了我们提出方法的有效性。进一步深入的分析强调了其域外推广性。

Adaptive Trust Consensus for Blockchain IoT: Comparing RL, DRL, and MARL Against Naive, Collusive, Adaptive, Byzantine, and Sleeper Attacks

区块链物联网自适应信任共识：比较强化学习、DRL和MARL对抗天真、串通、自适应、拜占庭和潜伏攻击

Authors: Soham Padia, Dhananjay Vaidya, Ramchandra Mangrulkar
Subjects: Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2512.22860
Pdf link: https://arxiv.org/pdf/2512.22860
Abstract Securing blockchain-enabled IoT networks against sophisticated adversarial attacks remains a critical challenge. This paper presents a trust-based delegated consensus framework integrating Fully Homomorphic Encryption (FHE) with Attribute-Based Access Control (ABAC) for privacy-preserving policy evaluation, combined with learning-based defense mechanisms. We systematically compare three reinforcement learning approaches -- tabular Q-learning (RL), Deep RL with Dueling Double DQN (DRL), and Multi-Agent RL (MARL) -- against five distinct attack families: Naive Malicious Attack (NMA), Collusive Rumor Attack (CRA), Adaptive Adversarial Attack (AAA), Byzantine Fault Injection (BFI), and Time-Delayed Poisoning (TDP). Experimental results on a 16-node simulated IoT network reveal significant performance variations: MARL achieves superior detection under collusive attacks (F1=0.85 vs. DRL's 0.68 and RL's 0.50), while DRL and MARL both attain perfect detection (F1=1.00) against adaptive attacks where RL fails (F1=0.50). All agents successfully defend against Byzantine attacks (F1=1.00). Most critically, the Time-Delayed Poisoning attack proves catastrophic for all agents, with F1 scores dropping to 0.11-0.16 after sleeper activation, demonstrating the severe threat posed by trust-building adversaries. Our findings indicate that coordinated multi-agent learning provides measurable advantages for defending against sophisticated trust manipulation attacks in blockchain IoT environments.
中文摘要 保护区块链驱动的物联网网络免受复杂对抗攻击仍是关键挑战。本文提出了一个基于信任的委派共识框架，整合了全同态加密（FHE）与基于属性的访问控制（ABAC）以实现隐私保护策略评估，并结合基于学习的防御机制。我们系统地比较了三种强化学习方法——表式Q学习（RL）、带双重DQN的深度强化学习（DRL）和多代理RL（MARL）——与五个不同攻击家族：朴素恶意攻击（NMA）、共谋谣言攻击（CRA）、自适应对抗攻击（AAA）、拜占庭错误注入（BFI）和时间延迟中毒（TDP）。在16节点模拟物联网网络上的实验结果显示，性能差异显著：MARL在共谋攻击下能实现更优检测（F1=0.85，对比DRL的0.68和RL的0.50），而DRL和MARL在自适应攻击中均达到完美检测（F1=1.00），而在强化攻击失败时（F1=0.50）。所有代理人都能成功防御拜占庭的攻击（F1=1.00）。最关键的是，时间延迟中毒攻击对所有特工造成灾难性影响，F1分数在潜伏者激活后降至0.11-0.16，显示出建立信任的对手构成的严重威胁。我们的发现表明，协调多智能体学习在区块链物联网环境中防御复杂的信任控攻击方面具有可衡量优势。

Reinforcement Networks: novel framework for collaborative Multi-Agent Reinforcement Learning tasks

强化网络：协作多智能体强化学习任务的新框架

Authors: Maksim Kryzhanovskiy, Svetlana Glazyrina, Roman Ischenko, Konstantin Vorontsov
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.22876
Pdf link: https://arxiv.org/pdf/2512.22876
Abstract Modern AI systems often comprise multiple learnable components that can be naturally organized as graphs. A central challenge is the end-to-end training of such systems without restrictive architectural or training assumptions. Such tasks fit the theory and approaches of the collaborative Multi-Agent Reinforcement Learning (MARL) field. We introduce Reinforcement Networks, a general framework for MARL that organizes agents as vertices in a directed acyclic graph (DAG). This structure extends hierarchical RL to arbitrary DAGs, enabling flexible credit assignment and scalable coordination while avoiding strict topologies, fully centralized training, and other limitations of current approaches. We formalize training and inference methods for the Reinforcement Networks framework and connect it to the LevelEnv concept to support reproducible construction, training, and evaluation. We demonstrate the effectiveness of our approach on several collaborative MARL setups by developing several Reinforcement Networks models that achieve improved performance over standard MARL baselines. Beyond empirical gains, Reinforcement Networks unify hierarchical, modular, and graph-structured views of MARL, opening a principled path toward designing and training complex multi-agent systems. We conclude with theoretical and practical directions - richer graph morphologies, compositional curricula, and graph-aware exploration. That positions Reinforcement Networks as a foundation for a new line of research in scalable, structured MARL.
中文摘要 现代人工智能系统通常包含多个可学习的组件，这些组件可以自然地组织成图表。一个核心挑战是如何在没有限制性架构或培训假设的情况下，对此类系统的端到端训练。这些任务符合协作多智能体强化学习（MARL）领域的理论和方法。我们介绍强化网络，这是一种通用的MARL框架，将代理组织为有向无环图（DAG）中的顶点。该结构将层级强化学习扩展到任意DAGs，实现灵活的学分分配和可扩展的协调，同时避免严格的拓扑结构、完全集中式培训及现有方法的其他限制。我们为强化网络框架正式化培训和推理方法，并将其与LevelEnv概念连接，以支持可重复的构建、培训和评估。我们通过开发多个强化网络模型，展示了我们方法在多个协作MARL设置上的有效性，这些模型在性能优于标准MARL基线上有所提升。除了经验上的收益外，强化网络还统一了MARL的层级、模块化和图结构视图，为设计和训练复杂多智能体系统开辟了一条有原则的路径。最后，我们以理论和实践方向作结——更丰富的图形态、合成课程以及图感知探索。这使强化网络成为可扩展、结构化MARL新研究方向的基础。

SAMP-HDRL: Segmented Allocation with Momentum-Adjusted Utility for Multi-agent Portfolio Management via Hierarchical Deep Reinforcement Learning

SAMP-HDRL：通过层级深度强化学习实现多代理投资组合管理的分段分配与动量调整效用

Authors: Xiaotian Ren, Nuerxiati Abudurexiti, Zhengyong Jiang, Angelos Stefanidis, Hongbin Liu, Jionglong Su
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.22895
Pdf link: https://arxiv.org/pdf/2512.22895
Abstract Portfolio optimization in non-stationary markets is challenging due to regime shifts, dynamic correlations, and the limited interpretability of deep reinforcement learning (DRL) policies. We propose a Segmented Allocation with Momentum-Adjusted Utility for Multi-agent Portfolio Management via Hierarchical Deep Reinforcement Learning (SAMP-HDRL). The framework first applies dynamic asset grouping to partition the market into high-quality and ordinary subsets. An upper-level agent extracts global market signals, while lower-level agents perform intra-group allocation under mask constraints. A utility-based capital allocation mechanism integrates risky and risk-free assets, ensuring coherent coordination between global and local decisions. backtests across three market regimes (2019--2021) demonstrate that SAMP-HDRL consistently outperforms nine traditional baselines and nine DRL benchmarks under volatile and oscillating conditions. Compared with the strongest baseline, our method achieves at least 5\% higher Return, 5\% higher Sharpe ratio, 5\% higher Sortino ratio, and 2\% higher Omega ratio, with substantially larger gains observed in turbulent markets. Ablation studies confirm that upper--lower coordination, dynamic clustering, and capital allocation are indispensable to robustness. SHAP-based interpretability further reveals a complementary ``diversified + concentrated'' mechanism across agents, providing transparent insights into decision-making. Overall, SAMP-HDRL embeds structural market constraints directly into the DRL pipeline, offering improved adaptability, robustness, and interpretability in complex financial environments.
中文摘要 由于体制转移、动态相关性以及深度强化学习（DRL）政策的可解释性有限，非固定市场中的投资组合优化具有挑战性。我们提出了一种分段分配，采用动量调整效用，通过分层深度强化学习（SAMP-HDRL）进行多智能体投资组合管理。该框架首先应用动态资产分组，将市场划分为高质量和普通子集。高级代理提取全球市场信号，而低级代理则在掩码约束下执行组内分配。基于公用事业的资本配置机制整合了风险资产和无风险资产，确保全球与地方决策之间的协调一致。在2019年至2021年三个市场区间的回测显示，SAMP-HDRL在波动和波动条件下持续优于九个传统基线和九个日程月亮基准。与最强基线相比，我们的方法至少实现了5/%更高的回报、5%更高的夏普比率、5%更高的索蒂诺比率和2%的高出的Omega比率，在动荡市场中观察到的收益显著更大。消融研究证实，上下协调、动态聚类和资本配置对于稳健性至关重要。基于SHAP的可解释性进一步揭示了跨代理的互补“多元化+集中”机制，为决策过程提供透明的洞察。总体而言，SAMP-HDRL将结构性市场约束直接嵌入DRL流程中，在复杂金融环境中提升了适应性、稳健性和可解释性。

Sat-EnQ: Satisficing Ensembles of Weak Q-Learners for Reliable and Compute-Efficient Reinforcement Learning

Sat-EnQ：满足弱Q-学习者群，实现可靠且计算高效的强化学习

Authors: Ünver Çiftçi
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.22910
Pdf link: https://arxiv.org/pdf/2512.22910
Abstract Deep Q-learning algorithms remain notoriously unstable, especially during early training when the maximization operator amplifies estimation errors. Inspired by bounded rationality theory and developmental learning, we introduce Sat-EnQ, a two-phase framework that first learns to be ``good enough'' before optimizing aggressively. In Phase 1, we train an ensemble of lightweight Q-networks under a satisficing objective that limits early value growth using a dynamic baseline, producing diverse, low-variance estimates while avoiding catastrophic overestimation. In Phase 2, the ensemble is distilled into a larger network and fine-tuned with standard Double DQN. We prove theoretically that satisficing induces bounded updates and cannot increase target variance, with a corollary quantifying conditions for substantial reduction. Empirically, Sat-EnQ achieves 3.8x variance reduction, eliminates catastrophic failures (0% vs 50% for DQN), maintains 79% performance under environmental noise}, and requires 2.5x less compute than bootstrapped ensembles. Our results highlight a principled path toward robust reinforcement learning by embracing satisficing before optimization.
中文摘要 深度Q学习算法依然以不稳定著称，尤其是在早期训练阶段，最大化算子放大了估计误差。受有限理性理论和发展性学习的启发，我们引入了Sat-EnQ，这是一个两阶段框架，先学习“足够好”，然后积极优化。在第一阶段，我们以一个令人满意的目标训练一组轻量级Q网络，利用动态基线限制早期价值增长，生成多样且低方差的估计，同时避免灾难性高估。在第二阶段，合奏被提炼成更大的网络，并用标准的双DQN进行微调。我们理论上证明满足会诱导有界更新，且不能增加目标方差，并附带量化显著减少的条件。从经验上看，Sat-EnQ实现了3.8倍的方差减少，消除了灾难性故障（DQN为0%对50%），在环境噪声下保持79%的性能}，并且比自助式集合所需的计算量减少了2.5倍。我们的结果凸显了一条有原则的强化学习路径，先满足后再优化。

Heterogeneity in Multi-Agent Reinforcement Learning

多智能体强化学习中的异质性

Authors: Tianyi Hu, Zhiqiang Pu, Yuan Wang, Tenghai Qiu, Min Chen, Xin Yu
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.22941
Pdf link: https://arxiv.org/pdf/2512.22941
Abstract Heterogeneity is a fundamental property in multi-agent reinforcement learning (MARL), which is closely related not only to the functional differences of agents, but also to policy diversity and environmental interactions. However, the MARL field currently lacks a rigorous definition and deeper understanding of heterogeneity. This paper systematically discusses heterogeneity in MARL from the perspectives of definition, quantification, and utilization. First, based on an agent-level modeling of MARL, we categorize heterogeneity into five types and provide mathematical definitions. Second, we define the concept of heterogeneity distance and propose a practical quantification method. Third, we design a heterogeneity-based multi-agent dynamic parameter sharing algorithm as an example of the application of our methodology. Case studies demonstrate that our method can effectively identify and quantify various types of agent heterogeneity. Experimental results show that the proposed algorithm, compared to other parameter sharing baselines, has better interpretability and stronger adaptability. The proposed methodology will help the MARL community gain a more comprehensive and profound understanding of heterogeneity, and further promote the development of practical algorithms.
中文摘要 异质性是多智能体强化学习（MARL）中的一个基本属性，它不仅与智能体的功能差异密切相关，还与策略多样性和环境相互作用密切相关。然而，MARL领域目前缺乏严格的定义和对异质性的更深入理解。本文系统性地从定义、量化和利用角度讨论了MARL中的异质性。首先，基于代理级建模MARL，我们将异质性分为五种类型，并给出数学定义。其次，我们定义了异质距离的概念，并提出了一种实用的量化方法。第三，我们设计了一个基于异构性的多智能体动态参数共享算法，作为我们方法论应用的示例。案例研究表明，我们的方法能够有效识别和量化各种类型的药物异质性。实验结果表明，与其他共享基线参数的算法相比，该算法具有更好的解释性和更强的适应性。所提方法论将帮助MARL社区更全面、更深入地理解异质性，并进一步推动实用算法的发展。

APO: Alpha-Divergence Preference Optimization

APO：阿尔法-散度偏好优化

Authors: Wang Zixian
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.22953
Pdf link: https://arxiv.org/pdf/2512.22953
Abstract Two divergence regimes dominate modern alignment practice. Supervised fine-tuning and many distillation-style objectives implicitly minimize the forward KL divergence KL(q || pi_theta), yielding stable mode-covering updates but often under-exploiting high-reward modes. In contrast, PPO-style online reinforcement learning from human feedback behaves closer to reverse KL divergence KL(pi_theta || q), enabling mode-seeking improvements but risking mode collapse. Recent anchored methods, such as ADPO, show that performing the projection in anchored coordinates can substantially improve stability, yet they typically commit to a single divergence. We introduce Alpha-Divergence Preference Optimization (APO), an anchored framework that uses Csiszar alpha-divergence to continuously interpolate between forward and reverse KL behavior within the same anchored geometry. We derive unified gradient dynamics parameterized by alpha, analyze gradient variance properties, and propose a practical reward-and-confidence-guarded alpha schedule that transitions from coverage to exploitation only when the policy is both improving and confidently calibrated. Experiments on Qwen3-1.7B with math-level3 demonstrate that APO achieves competitive performance with GRPO and GSPO baselines while maintaining training stability.
中文摘要 现代对齐实践主导着两种发散体系。监督微调和许多蒸馏式目标隐式最小化了前向KL散度KL（q || pi_theta），实现稳定的模式覆盖更新，但往往未能充分利用高回报模式。相比之下，PPO风格的人类反馈在线强化学习更接近逆KL发散KL（pi_theta || q），允许模式寻求改进，但存在模式崩溃的风险。最近的锚定方法，如ADPO，表明在锚定坐标中进行投影可以显著提升稳定性，但通常会选择单一发度。我们介绍了Alpha-散度偏好优化（APO），这是一种锚定框架，利用Csiszar alpha发散在同一锚定几何内连续插值正向和反向KL行为。我们推导了以α为参数的统一梯度动态，分析梯度方差性质，并提出了一种实用的奖励与信心保护型α计划，只有当政策既改善且校准可靠时，才从覆盖过渡到利用。在Qwen3-1.7B和数学水平3上的实验表明，APO在保持训练稳定性的同时，能够在GRPO和GSPO基线上实现竞争性能。

Diversity or Precision? A Deep Dive into Next Token Prediction

多样性还是精准？深入探讨下一个代币预测

Authors: Haoyuan Wu, Hai Wang, Jiajia Wu, Jinxiang Ou, Keyao Wang, Weile Chen, Zihao Zheng, Bei Yu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.22955
Pdf link: https://arxiv.org/pdf/2512.22955
Abstract Recent advancements have shown that reinforcement learning (RL) can substantially improve the reasoning abilities of large language models (LLMs). The effectiveness of such RL training, however, depends critically on the exploration space defined by the pre-trained model's token-output distribution. In this paper, we revisit the standard cross-entropy loss, interpreting it as a specific instance of policy gradient optimization applied within a single-step episode. To systematically study how the pre-trained distribution shapes the exploration potential for subsequent RL, we propose a generalized pre-training objective that adapts on-policy RL principles to supervised learning. By framing next-token prediction as a stochastic decision process, we introduce a reward-shaping strategy that explicitly balances diversity and precision. Our method employs a positive reward scaling factor to control probability concentration on ground-truth tokens and a rank-aware mechanism that treats high-ranking and low-ranking negative tokens asymmetrically. This allows us to reshape the pre-trained token-output distribution and investigate how to provide a more favorable exploration space for RL, ultimately enhancing end-to-end reasoning performance. Contrary to the intuition that higher distribution entropy facilitates effective exploration, we find that imposing a precision-oriented prior yields a superior exploration space for RL.
中文摘要 最新进展表明，强化学习（RL）可以显著提升大型语言模型（LLMs）的推理能力。然而，这种强化学习训练的有效性关键在于预训练模型的代币-输出分布所定义的探索空间。本文重新审视标准交叉熵损失，将其解释为单步事件中策略梯度优化的具体实例。为了系统性研究预训练分布如何塑造后续强化学习的探索潜力，我们提出了一个将政策化强化学习原则应用于监督学习的通用预训练目标。通过将下一代币预测框架为随机决策过程，我们引入了一种明确平衡多样性与精准性的奖励塑造策略。我们的方法采用正奖励缩放因子控制地面真实代币的概率集中度，并采用等级感知机制，对高排名和低排名负代币进行非对称处理。这使我们能够重塑预训练的代币输出分布，探索如何为强化学习提供更有利的探索空间，最终提升端到端推理性能。与高分布熵有助于有效探索的直觉相反，我们发现施加一个精确导向的先验能为强化学习带来更优的探索空间。

Taming the Tail: Stable LLM Reinforcement Learning via Dynamic Vocabulary Pruning

驯服尾巴：通过动态词汇修剪实现的稳定大型语言模型强化学习

Authors: Yingru Li, Jiawei Xu, Jiacai Liu, Yuxuan Tong, Ziniu Li, Tianle Cai, Ge Zhang, Qian Liu, Baoxiang Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2512.23087
Pdf link: https://arxiv.org/pdf/2512.23087
Abstract Reinforcement learning for large language models (LLMs) faces a fundamental tension: high-throughput inference engines and numerically-precise training systems produce different probability distributions from the same parameters, creating a training-inference mismatch. We prove this mismatch has an asymmetric effect: the bound on log-probability mismatch scales as $(1-p)$ where $p$ is the token probability. For high-probability tokens, this bound vanishes, contributing negligibly to sequence-level mismatch. For low-probability tokens in the tail, the bound remains large, and moreover, when sampled, these tokens exhibit systematically biased mismatches that accumulate over sequences, destabilizing gradient estimation. Rather than applying post-hoc corrections, we propose constraining the RL objective to a dynamically-pruned ``safe'' vocabulary that excludes the extreme tail. By pruning such tokens, we trade large, systematically biased mismatches for a small, bounded optimization bias. Empirically, our method achieves stable training; theoretically, we bound the optimization bias introduced by vocabulary pruning.
中文摘要 大型语言模型（LLM）的强化学习面临一个根本性矛盾：高通量推理引擎和数值精确训练系统从相同参数产生不同的概率分布，导致训练-推理不匹配。我们证明了这种不匹配具有非对称效应：对数概率不匹配的上界可按$（1-p）$扩展，其中$p$是标记概率。对于高概率标记，这一界限消失，对序列层级不匹配的贡献微乎其微。对于尾部的低概率标记，界限依然较大，且在采样时，这些标记在序列中表现出系统性偏倚的不匹配，这些不匹配会在序列中累积，导致梯度估计不稳定。我们建议不应用事后修正，而是将强化学习目标限制在动态修剪的“安全”词汇表中，排除极端尾部。通过修剪这些代币，我们用一个小且有界的优化偏差，交换了大规模且系统性偏见的错配。从经验上看，我们的方法实现了稳定训练;理论上，我们对词汇剪枝引入的优化偏差进行了界限。

Benchmark Success, Clinical Failure: When Reinforcement Learning Optimizes for Benchmarks, Not Patients

基准成功与临床失败：当强化学习优化的是基准，而非患者

Authors: Armin Berger, Manuela Bergau, Helen Schneider, Saad Ahmad, Tom Anglim Lagones, Gianluca Brugnara, Martha Foltyn-Dumitru, Kai Schlamp, Philipp Vollmuth, Rafet Sifa
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.23090
Pdf link: https://arxiv.org/pdf/2512.23090
Abstract Recent Reinforcement Learning (RL) advances for Large Language Models (LLMs) have improved reasoning tasks, yet their resource-constrained application to medical imaging remains underexplored. We introduce ChexReason, a vision-language model trained via R1-style methodology (SFT followed by GRPO) using only 2,000 SFT samples, 1,000 RL samples, and a single A100 GPU. Evaluations on CheXpert and NIH benchmarks reveal a fundamental tension: GRPO recovers in-distribution performance (23% improvement on CheXpert, macro-F1 = 0.346) but degrades cross-dataset transferability (19% drop on NIH). This mirrors high-resource models like NV-Reason-CXR-3B, suggesting the issue stems from the RL paradigm rather than scale. We identify a generalization paradox where the SFT checkpoint uniquely improves on NIH before optimization, indicating teacher-guided reasoning captures more institution-agnostic features. Furthermore, cross-model comparisons show structured reasoning scaffolds benefit general-purpose VLMs but offer minimal gain for medically pre-trained models. Consequently, curated supervised fine-tuning may outperform aggressive RL for clinical deployment requiring robustness across diverse populations.
中文摘要 大型语言模型（LLM）近年来的强化学习（RL）进展改善了推理任务，但其在医学影像中资源有限的应用仍未被充分探索。我们介绍了ChexReason，这是一种通过R1风格方法（SFT后接GRPO）训练的视觉语言模型，仅使用2000个SFT样本、1000个强化学习样本和一块A100 GPU。对CheXpert和NIH基准的评估显示出一种根本性矛盾：GRPO恢复了分布内表现（CheXpert提升23%，宏观F1=0.346），但降低了跨数据集可迁移性（NIH下降19%）。这与NV-Reason-CXR-3B等高资源模型类似，表明问题源自强化学习范式而非规模化。我们发现了一个泛化悖论，即SFT检查点在优化前独特地优于NIH，表明教师引导推理能捕捉更多机构无关性特征。此外，跨模型比较显示结构化推理支架对通用VLM有利，但对医学预训练模型的收益有限。因此，策划的监督微调在需要跨多元人群稳健性的临床部署中，可能优于积极的强化学习。

A Note on Hybrid Online Reinforcement and Imitation Learning for LLMs: Formulations and Algorithms

关于大型语言模型混合在线强化与模仿学习的说明：表述与算法

Authors: Yingru Li, Ziniu Li, Jiacai Liu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.23097
Pdf link: https://arxiv.org/pdf/2512.23097
Abstract We present a unified framework for Large Language Model (LLM) fine-tuning that integrates Imitation Learning and Reinforcement Learning. By analyzing the gradient of a composite objective combining trajectory-level KL divergence with task rewards, we derive a natural decomposition into two components: (1) an analytically computable Dense Gradient for token-level imitation, and (2) a Monte Carlo estimated Sparse Gradient for long-horizon reward optimization. The Dense Gradient admits a closed-form logit-level formula, enabling efficient GPU implementation.
中文摘要 我们提出了一个统一的大型语言模型（LLM）微调框架，整合了模仿学习和强化学习。通过分析结合轨迹级 KL 发散与任务奖励的复合目标的梯度，我们自然地将分解分为两个部分：（1）用于代币级模仿的解析计算稠密度梯度，以及（2）用于长视野奖励优化的蒙特卡洛估计稀疏梯度。稠密梯度采用闭式Logit级公式，从而实现GPU的高效运行。

Evaluating Parameter Efficient Methods for RLVR

评估RLVR参数效率方法

Authors: Qingyu Yin, Yulun Wu, Zhennan Shen, Sunbowen Li, Zhilin Wang, Yanshu Li, Chak Tou Leong, Jiale Kang, Jinjin Gu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.23165
Pdf link: https://arxiv.org/pdf/2512.23165
Abstract We systematically evaluate Parameter-Efficient Fine-Tuning (PEFT) methods under the paradigm of Reinforcement Learning with Verifiable Rewards (RLVR). RLVR incentivizes language models to enhance their reasoning capabilities through verifiable feedback; however, while methods like LoRA are commonly used, the optimal PEFT architecture for RLVR remains unidentified. In this work, we conduct the first comprehensive evaluation of over 12 PEFT methodologies across the DeepSeek-R1-Distill families on mathematical reasoning benchmarks. Our empirical results challenge the default adoption of standard LoRA with three main findings. First, we demonstrate that structural variants, such as DoRA, AdaLoRA, and MiSS, consistently outperform LoRA. Second, we uncover a spectral collapse phenomenon in SVD-informed initialization strategies (\textit{e.g.,} PiSSA, MiLoRA), attributing their failure to a fundamental misalignment between principal-component updates and RL optimization. Furthermore, our ablations reveal that extreme parameter reduction (\textit{e.g.,} VeRA, Rank-1) severely bottlenecks reasoning capacity. We further conduct ablation studies and scaling experiments to validate our findings. This work provides a definitive guide for advocating for more exploration for parameter-efficient RL methods.
中文摘要 我们系统地评估了参数高效微调（PEFT）方法，采用可验证奖励强化学习（RLVR）范式。RLVR激励语言模型通过可验证的反馈提升推理能力;然而，尽管像LoRA这样的方法被广泛使用，RLVR的最佳PEFT架构仍未确定。本研究首次全面评估了DeepSeek-R1-Distill家族中12多种PEFT方法论的数学推理基准。我们的实证结果挑战了默认采用标准LoRA的做法，主要发现三项。首先，我们证明了结构变体，如DoRA、AdaLoRA和MiSS，始终优于LoRA。其次，我们发现了SVD启发初始化策略（如 PiSSA、MiLoRA）中的频谱坍缩现象，将其失败归因于主成分更新与强化学习优化之间的根本错位。此外，我们的消融显示极端参数减少（如 VeRA，Rank-1）严重限制推理能力。我们还会进行消融研究和量表实验以验证我们的发现。这项工作为倡导更多参数高效强化学习方法的探索提供了权威指南。

A Human-Oriented Cooperative Driving Approach: Integrating Driving Intention, State, and Conflict

以人为本的合作驾驶方法：整合驾驶意图、状态与冲突

Authors: Qin Wang, Shanmin Pang, Jianwu Fang, Shengye Dong, Fuhao Liu, Jianru Xue, Chen Lv
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.23220
Pdf link: https://arxiv.org/pdf/2512.23220
Abstract Human-vehicle cooperative driving serves as a vital bridge to fully autonomous driving by improving driving flexibility and gradually building driver trust and acceptance of autonomous technology. To establish more natural and effective human-vehicle interaction, we propose a Human-Oriented Cooperative Driving (HOCD) approach that primarily minimizes human-machine conflict by prioritizing driver intention and state. In implementation, we take both tactical and operational levels into account to ensure seamless human-vehicle cooperation. At the tactical level, we design an intention-aware trajectory planning method, using intention consistency cost as the core metric to evaluate the trajectory and align it with driver intention. At the operational level, we develop a control authority allocation strategy based on reinforcement learning, optimizing the policy through a designed reward function to achieve consistency between driver state and authority allocation. The results of simulation and human-in-the-loop experiments demonstrate that our proposed approach not only aligns with driver intention in trajectory planning but also ensures a reasonable authority allocation. Compared to other cooperative driving approaches, the proposed HOCD approach significantly enhances driving performance and mitigates human-machine this http URL code is available at this https URL.
中文摘要 人车合作驾驶通过提升驾驶灵活性、逐步建立驾驶员信任和对自动驾驶技术的接受度，成为通往完全自动驾驶的重要桥梁。为了建立更自然、更有效的人机交互，我们提出了一种以人为导向的合作驾驶（HOCD）方法，主要通过优先考虑驾驶者的意图和状态来最小化人机冲突。在实施过程中，我们兼顾战术和运营层面，确保人机协作无缝。在战术层面，我们设计了一种意图感知轨迹规划方法，以意图一致性成本为核心指标，评估轨迹并使其与驾驶员意图保持一致。在运营层面，我们基于强化学习开发控制权威分配策略，通过设计的奖励函数优化策略，实现驾驶员状态与权威分配的一致性。模拟和人机实验的结果表明，我们提出的方法不仅符合驾驶员在轨迹规划中的意图，还确保了合理的权限分配。与其他合作驾驶方法相比，提出的HOCD方法显著提升了驾驶性能并减轻了人机对HTTP的干扰。http URL代码可在该HTTP网址获取。

ViLaCD-R1: A Vision-Language Framework for Semantic Change Detection in Remote Sensing

ViLaCD-R1：一种用于遥感语义变更检测的视觉语言框架

Authors: Xingwei Ma, Shiyang Feng, Bo Zhang, Bin Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.23244
Pdf link: https://arxiv.org/pdf/2512.23244
Abstract Remote sensing change detection (RSCD), a complex multi-image inference task, traditionally uses pixel-based operators or encoder-decoder networks that inadequately capture high-level semantics and are vulnerable to non-semantic perturbations. Although recent multimodal and vision-language model (VLM)-based approaches enhance semantic understanding of change regions by incorporating textual descriptions, they still suffer from challenges such as inaccurate spatial localization, imprecise pixel-level boundary delineation, and limited interpretability. To address these issues, we propose ViLaCD-R1, a two-stage framework comprising a Multi-Image Reasoner (MIR) and a Mask-Guided Decoder (MGD). Specifically, the VLM is trained through supervised fine-tuning (SFT) and reinforcement learning (RL) on block-level dual-temporal inference tasks, taking dual-temporal image patches as input and outputting a coarse change mask. Then, the decoder integrates dual-temporal image features with this coarse mask to predict a precise binary change map. Comprehensive evaluations on multiple RSCD benchmarks demonstrate that ViLaCD-R1 substantially improves true semantic change recognition and localization, robustly suppresses non-semantic variations, and achieves state-of-the-art accuracy in complex real-world scenarios.
中文摘要 遥感变化检测（RSCD）是一项复杂的多图像推断任务，传统上使用基于像素的操作员或编码-解码器网络，这些网络无法充分捕捉高层语义，且容易受到非语义扰动的影响。尽管近期基于多模态和视觉语言模型（VLM）的方法通过融入文本描述增强了对变化区域的语义理解，但它们仍面临空间定位不准确、像素级边界划分不精确以及可解释性有限等挑战。为解决这些问题，我们提出了ViLaCD-R1，这是一个由多图像推理器（MIR）和掩模引导解码器（MGD）组成的两阶段框架。具体来说，VLM通过监督微调（SFT）和强化学习（RL）训练，针对块级双时序推理任务，输入为双时态图像补丁，输出粗变掩膜。然后，解码器将双时态图像特征与该粗糙掩模整合，预测精确的二元变化映射。对多个RSCD基准的综合评估表明，ViLaCD-R1显著提升了真实的语义变化识别和定位能力，能有效抑制非语义变异，并在复杂的现实场景中实现最先进的准确性。

Agentic AI-Enhanced Semantic Communications: Foundations, Architecture, and Applications

代理人工智能增强语义通信：基础、架构与应用

Authors: Haixiao Gao, Mengying Sun, Ruichen Zhang, Yanhan Wang, Xiaodong Xu, Nan Ma, Dusit Niyato, Ping Zhang
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2512.23294
Pdf link: https://arxiv.org/pdf/2512.23294
Abstract Semantic communications (SemCom), as one of the key technologies for 6G, is shifting networks from bit transmission to semantic information exchange. On this basis, introducing agentic artificial intelligence (AI) with perception, memory, reasoning, and action capabilities provides a practicable path to intelligent communications. This paper provides a systematic exposition of how agentic AI empowers SemCom from the perspectives of research foundations, system architecture, and application scenarios. We first provide a comprehensive review of existing studies by agent types, covering embedded agents, large language model (LLM)/large vision model (LVM) agents, and reinforcement learning (RL) agents. Additionally, we propose a unified agentic AI-enhanced SemCom framework covering the application layer, the semantic layer, and the cloud-edge collaboration layer, forming a closed loop from intent to encoding to transmission to decoding to action to evaluation. We also present several typical scenarios, including multi-vehicle collaborative perception, multi-robot cooperative rescue, and agentic operations for intellicise (intelligent and concise) networks. Furthermore, we introduce an agentic knowledge base (KB)-based joint source-channel coding case study, AKB-JSCC, where the source KB and channel KB are built by LLM/LVM agents and RL agents, respectively. Experimental results show that AKB-JSCC achieves higher information reconstruction quality under different channel conditions. Finally, we discuss future evolution and research directions, providing a reference for portable, verifiable, and controllable research and deployment of agentic SemCom.
中文摘要 语义通信（SemCom）作为6G的关键技术之一，正在将网络从比特传输转向语义信息交换。基于此，引入具备感知、记忆、推理和行动能力的智能人工智能（AI）为实现智能通信提供了切实可行的路径。本文系统地阐述了代理人工智能如何从研究基础、系统架构和应用场景的角度赋能SemCom。我们首先对现有研究进行了全面综述，涵盖了嵌入代理、大型语言模型（LLM）/大型视觉模型（LVM）代理以及强化学习（RL）代理。此外，我们还提出了一个统一的代理人工智能增强SemCom框架，涵盖应用层、语义层和云端协作层，形成从意图到编码、传输、解码、动作到评估的闭环。我们还展示了几种典型场景，包括多载具协作感知、多机器人协作救援以及智能化（智能简洁）网络的智能化作。此外，我们引入了一个基于代理知识库（KB）的联合源信道编码案例研究AKB-JSCC，其中源知识库和信道知识库分别由LLM/LVM代理和强化学习代理构建。实验结果表明，AKB-JSCC在不同信道条件下实现了更高的信息重建质量。最后，我们讨论了未来的发展和研究方向，为代理式SemCom的可携带、可验证和可控的研究与部署提供参考。

Splitwise: Collaborative Edge-Cloud Inference for LLMs via Lyapunov-Assisted DRL

Splitwise：通过Lyapunov辅助的DRL为LLM开发协作边缘云推断

Authors: Abolfazl Younesi, Abbas Shabrang Maryan, Elyas Oustad, Zahra Najafabadi Samani, Mohsen Ansari, Thomas Fahringer
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2512.23310
Pdf link: https://arxiv.org/pdf/2512.23310
Abstract Deploying large language models (LLMs) on edge devices is challenging due to their limited memory and power resources. Cloud-only inference reduces device burden but introduces high latency and cost. Static edge-cloud partitions optimize a single metric and struggle when bandwidth fluctuates. We propose Splitwise, a novel Lyapunov-assisted deep reinforcement learning (DRL) framework for fine-grained, adaptive partitioning of LLMs across edge and cloud environments. Splitwise decomposes transformer layers into attention heads and feed-forward sub-blocks, exposing more partition choices than layer-wise schemes. A hierarchical DRL policy, guided by Lyapunov optimization, jointly minimizes latency, energy consumption, and accuracy degradation while guaranteeing queue stability under stochastic workloads and variable network bandwidth. Splitwise also guarantees robustness via partition checkpoints with exponential backoff recovery in case of communication failures. Experiments on Jetson Orin NX, Galaxy S23, and Raspberry Pi 5 with GPT-2 (1.5B), LLaMA-7B, and LLaMA-13B show that Splitwise reduces end-to-end latency by 1.4x-2.8x and cuts energy consumption by up to 41% compared with existing partitioners. It lowers the 95th-percentile latency by 53-61% relative to cloud-only execution, while maintaining accuracy and modest memory requirements.
中文摘要 由于大型语言模型（LLM）内存和资源有限，在边缘设备上部署它们具有挑战性。仅云推断降低了设备负担，但带来了高延迟和高成本。静态边缘云分区优化单一指标，带宽波动时表现不佳。我们提出了Splitwise，一种新型的Lyapunov辅助深度强化学习（DRL）框架，用于在边缘和云环境中实现LLM的细粒度自适应分区。分层分解变换器层为注意力头和前馈子块，暴露出比分层方案更多的分区选择。采用以Lyapunov优化为指导的分层式DRL策略，在随机工作负载和可变网络带宽下保证队列稳定性的同时，最大限度地减少了延迟、能耗和精度下降。Splitwise还通过分区检查点保证了鲁棒性，并在通信失败时实现指数级退还。在 Jetson Orin NX、Galaxy S23 和 Raspberry Pi 5 上使用 GPT-2（1.5B）、LLaMA-7B 和 LLaMA-13B 的实验显示，Splitwise 相比现有分区器，端到端延迟减少了 1.4% 至 2.8%，并能耗降低了高达 41%。它在保持准确性和适度内存需求的同时，将第95百分位的延迟降低了53%-61%，同时保持了仅云端执行的效率。

CME-CAD: Heterogeneous Collaborative Multi-Expert Reinforcement Learning for CAD Code Generation

CME-CAD：异构协作多专家强化学习用于CAD代码生成

Authors: Ke Niu, Haiyang Yu, Zhuofan Chen, Zhengtao Yao, Weitao Jia, Xiaodong Ge, Jingqun Tang, Benlei Cui, Bin Li, Xiangyang Xue
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.23333
Pdf link: https://arxiv.org/pdf/2512.23333
Abstract Computer-Aided Design (CAD) is essential in industrial design, but the complexity of traditional CAD modeling and workflows presents significant challenges for automating the generation of high-precision, editable CAD models. Existing methods that reconstruct 3D models from sketches often produce non-editable and approximate models that fall short of meeting the stringent requirements for precision and editability in industrial design. Moreover, the reliance on text or image-based inputs often requires significant manual annotation, limiting their scalability and applicability in industrial settings. To overcome these challenges, we propose the Heterogeneous Collaborative Multi-Expert Reinforcement Learning (CME-CAD) paradigm, a novel training paradigm for CAD code generation. Our approach integrates the complementary strengths of these models, facilitating collaborative learning and improving the model's ability to generate accurate, constraint-compatible, and fully editable CAD models. We introduce a two-stage training process: Multi-Expert Fine-Tuning (MEFT), and Multi-Expert Reinforcement Learning (MERL). Additionally, we present CADExpert, an open-source benchmark consisting of 17,299 instances, including orthographic projections with precise dimension annotations, expert-generated Chain-of-Thought (CoT) processes, executable CADQuery code, and rendered 3D models.
中文摘要 计算机辅助设计（CAD）在工业设计中至关重要，但传统CAD建模和工作流程的复杂性在自动化生成高精度、可编辑CAD模型方面带来了重大挑战。现有从草图重建3D模型的方法，往往会产生不可编辑且近似的模型，无法满足工业设计中对精度和可编辑性的严格要求。此外，依赖文本或图像输入通常需要大量手工注释，限制了其在工业环境中的可扩展性和适用性。为克服这些挑战，我们提出了异构协作多专家强化学习（CME-CAD）范式，这是一种用于CAD代码生成的新型训练范式。我们的方法整合了这些模型的互补优势，促进协作学习，并提升模型生成准确、约束兼容且完全可编辑CAD模型的能力。我们引入了两阶段的培训流程：多专家微调（MEFT）和多专家强化学习（MERL）。此外，我们还展示了CADExpert，这是一个开源基准测试，包含17,299个实例，包括带有精确维度注释的正交投影、专家生成的Chain-of-Thought（CoT）流程、可执行的CADQuery代码以及渲染的3D模型。

AGRO-SQL: Agentic Group-Relative Optimization with High-Fidelity Data Synthesis

AGRO-SQL：具备高精度数据综合的代理群相对优化

Authors: Cehua Yang, Dongyu Xiao, Junming Lin, Yuyang Song, Hanxu Yan, Shawn Guo, Wei Zhang, Jian Yang, Mingjie Tang, Bryan Dai
Subjects: Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.23366
Pdf link: https://arxiv.org/pdf/2512.23366
Abstract The advancement of Text-to-SQL systems is currently hindered by the scarcity of high-quality training data and the limited reasoning capabilities of models in complex scenarios. In this paper, we propose a holistic framework that addresses these issues through a dual-centric approach. From a Data-Centric perspective, we construct an iterative data factory that synthesizes RL-ready data characterized by high correctness and precise semantic-logic alignment, ensured by strict verification. From a Model-Centric perspective, we introduce a novel Agentic Reinforcement Learning framework. This framework employs a Diversity-Aware Cold Start stage to initialize a robust policy, followed by Group Relative Policy Optimization (GRPO) to refine the agent's reasoning via environmental feedback. Extensive experiments on BIRD and Spider benchmarks demonstrate that our synergistic approach achieves state-of-the-art performance among single-model methods.
中文摘要 文本转SQL系统的进步目前受限于高质量训练数据的稀缺和复杂场景模型推理能力有限的限制。本文提出了一个整体框架，通过双中心主义方法解决这些问题。从数据中心的角度，我们构建了一个迭代数据工厂，综合具有高度正确性和精确语义逻辑对齐的强化学习就绪数据，并通过严格验证确保。从以模型为中心的角度，我们引入了一个新的智能强化学习框架。该框架采用多样性感知冷启动阶段初始化稳健策略，随后进行群相对策略优化（GRPO）通过环境反馈优化智能体推理。BIRD和Spider基准测试的广泛实验表明，我们的协同方法在单模型方法中实现了最先进的性能。

The World Is Bigger! A Computationally-Embedded Perspective on the Big World Hypothesis

世界更大了！对大世界假说的计算嵌入视角

Authors: Alex Lewandowski, Adtiya A. Ramesh, Edan Meyer, Dale Schuurmans, Marlos C. Machado
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.23419
Pdf link: https://arxiv.org/pdf/2512.23419
Abstract Continual learning is often motivated by the idea, known as the big world hypothesis, that "the world is bigger" than the agent. Recent problem formulations capture this idea by explicitly constraining an agent relative to the environment. These constraints lead to solutions in which the agent continually adapts to best use its limited capacity, rather than converging to a fixed solution. However, explicit constraints can be ad hoc, difficult to incorporate, and may limit the effectiveness of scaling up the agent's capacity. In this paper, we characterize a problem setting in which an agent, regardless of its capacity, is constrained by being embedded in the environment. In particular, we introduce a computationally-embedded perspective that represents an embedded agent as an automaton simulated within a universal (formal) computer. Such an automaton is always constrained; we prove that it is equivalent to an agent that interacts with a partially observable Markov decision process over a countably infinite state-space. We propose an objective for this setting, which we call interactivity, that measures an agent's ability to continually adapt its behaviour by learning new predictions. We then develop a model-based reinforcement learning algorithm for interactivity-seeking, and use it to construct a synthetic problem to evaluate continual learning capability. Our results show that deep nonlinear networks struggle to sustain interactivity, whereas deep linear networks sustain higher interactivity as capacity increases.
中文摘要 持续学习通常源于所谓的大世界假说，即“世界比主体更大”。近期的问题表述通过明确限制代理相对于环境来捕捉这一理念。这些约束导致智能体不断适应以最大化利用有限容量，而非趋向固定解。然而，显式约束可能很临时、难以纳入，并且可能限制智能体能力扩展的有效性。本文描述了一个问题环境，即代理人无论其能力如何，都被嵌入环境中所限制。特别地，我们引入了一种计算嵌入式视角，将嵌入式代理视为在通用（形式）计算机中模拟的自动机。这样的自动机总是受到约束;我们证明它等价于一个在可数无限状态空间上与部分可观测的马尔可夫决策过程相互作用的代理。我们提出了一个目标，称之为交互性，衡量智能体通过学习新预测不断调整行为的能力。随后，我们开发了基于模型的增强学习算法，用于寻求交互性，并用它构建一个综合问题，以评估持续学习能力。我们的结果表明，深度非线性网络难以维持交互性，而深度线性网络随着容量增加，能维持更高的交互性。

Replay Failures as Successes: Sample-Efficient Reinforcement Learning for Instruction Following

将失败作为成功：教学遵循的样本高效强化学习

Authors: Kongcheng Zhang, Qi Yao, Shunyu Liu, Wenjian Zhang, Min Cen, Yang Zhou, Wenkai Fang, Yiru Zhao, Baisheng Lai, Mingli Song
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.23457
Pdf link: https://arxiv.org/pdf/2512.23457
Abstract Reinforcement Learning (RL) has shown promise for aligning Large Language Models (LLMs) to follow instructions with various constraints. Despite the encouraging results, RL improvement inevitably relies on sampling successful, high-quality responses; however, the initial model often struggles to generate responses that satisfy all constraints due to its limited capabilities, yielding sparse or indistinguishable rewards that impede learning. In this work, we propose Hindsight instruction Replay (HiR), a novel sample-efficient RL framework for complex instruction following tasks, which employs a select-then-rewrite strategy to replay failed attempts as successes based on the constraints that have been satisfied in hindsight. We perform RL on these replayed samples as well as the original ones, theoretically framing the objective as dual-preference learning at both the instruction- and response-level to enable efficient optimization using only a binary reward signal. Extensive experiments demonstrate that the proposed HiR yields promising results across different instruction following tasks, while requiring less computational budget. Our code and dataset is available at this https URL.
中文摘要 强化学习（RL）已展现出在协调大型语言模型（LLM）以遵循各种约束指令方面展现出的潜力。尽管结果令人鼓舞，强化学习的提升不可避免地依赖于抽样成功且高质量的反应;然而，由于其能力有限，初始模型常常难以生成满足所有约束条件的响应，导致奖励稀疏或难以区分，阻碍学习。本研究提出事后诸葛亮指令重放（HiR），这是一种用于复杂指令跟踪任务的新型样本高效强化学习框架，采用选择后重写策略，基于事后诸葛亮满足的约束，将失败尝试重放为成功。我们对这些重放样本和原始样本进行强化学习，理论上将目标框架为双重偏好学习，涵盖指令和响应层面，以实现仅使用二元奖励信号的高效优化。大量实验表明，所提的HiR在不同指令跟随任务中取得了有前景的结果，同时所需的计算预算更少。我们的代码和数据集可在该 https URL 访问。

Eliminating Inductive Bias in Reward Models with Information-Theoretic Guidance

利用信息理论指导消除奖励模型中的归纳偏倚

Authors: Zhuo Li, Pengyu Cheng, Zhechao Yu, Feifei Tong, Anningzhe Gao, Tsung-Hui Chang, Xiang Wan, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.23461
Pdf link: https://arxiv.org/pdf/2512.23461
Abstract Reward models (RMs) are essential in reinforcement learning from human feedback (RLHF) to align large language models (LLMs) with human values. However, RM training data is commonly recognized as low-quality, containing inductive biases that can easily lead to overfitting and reward hacking. For example, more detailed and comprehensive responses are usually human-preferred but with more words, leading response length to become one of the inevitable inductive biases. A limited number of prior RM debiasing approaches either target a single specific type of bias or model the problem with only simple linear correlations, \textit{e.g.}, Pearson coefficients. To mitigate more complex and diverse inductive biases in reward modeling, we introduce a novel information-theoretic debiasing method called \textbf{D}ebiasing via \textbf{I}nformation optimization for \textbf{R}M (DIR). Inspired by the information bottleneck (IB), we maximize the mutual information (MI) between RM scores and human preference pairs, while minimizing the MI between RM outputs and biased attributes of preference inputs. With theoretical justification from information theory, DIR can handle more sophisticated types of biases with non-linear correlations, broadly extending the real-world application scenarios for RM debiasing methods. In experiments, we verify the effectiveness of DIR with three types of inductive biases: \textit{response length}, \textit{sycophancy}, and \textit{format}. We discover that DIR not only effectively mitigates target inductive biases but also enhances RLHF performance across diverse benchmarks, yielding better generalization abilities. The code and training recipes are available at this https URL.
中文摘要 奖励模型（RM）在人类反馈强化学习（RLHF）中至关重要，旨在将大型语言模型（LLMs）与人类价值观对齐。然而，RM训练数据通常被认为是低质量的，存在归纳偏差，容易导致过拟合和奖励黑客行为。例如，更详细、更全面的回答通常更受人类偏好，但词汇更多，导致回答长度成为不可避免的归纳偏见之一。有限数量的先前RM去偏倚方法要么针对单一特定类型的偏置，要么仅用简单的线性相关性、\textit{e.}、皮尔逊系数来建模问题。为了减轻奖励建模中更复杂和多样化的归纳偏见，我们引入了一种新颖的信息理论去偏见方法，名为 \textbf{D}ebiasing via \textbf{I}信息优化，适用于 \textbf{R}M（DIR）。受信息瓶颈（IB）启发，我们最大化了RM分数与人类偏好对之间的互信息（MI），同时最小化RM输出与偏好输入偏置属性之间的MI。基于信息理论的理论依据，DIR能够处理更复杂的非线性相关偏差类型，广泛扩展了现实中RM去偏倚方法的实际应用场景。在实验中，我们验证了DIR的有效性，采用三种归纳偏差：\textit{响应长度}、\textit{谄媚}和\textit{format}。我们发现，DIR不仅有效减轻了目标归纳偏差，还能提升RLHF在多种基准测试中的表现，从而提升泛化能力。代码和训练配方可在此 https 网址获取。

HY-Motion 1.0: Scaling Flow Matching Models for Text-To-Motion Generation

HY-Motion 1.0：文本到动作生成的比例流匹配模型

Authors: Yuxin Wen, Qing Shuai, Di Kang, Jing Li, Cheng Wen, Yue Qian, Ningxin Jiao, Changhai Chen, Weijie Chen, Yiran Wang, Jinkun Guo, Dongyue An, Han Liu, Yanyu Tong, Chao Zhang, Qing Guo, Juan Chen, Qiao Zhang, Youyi Zhang, Zihao Yao, Cheng Zhang, Hong Duan, Xiaoping Wu, Qi Chen, Fei Cheng, Liang Dong, Peng He, Hao Zhang, Jiaxin Lin, Chao Zhang, Zhongyi Fan, Yifan Li, Zhichao Hu, Yuhong Liu, Linus, Jie Jiang, Xiaolong Li, Linchao Bao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
Arxiv link: https://arxiv.org/abs/2512.23464
Pdf link: https://arxiv.org/pdf/2512.23464
Abstract We present HY-Motion 1.0, a series of state-of-the-art, large-scale, motion generation models capable of generating 3D human motions from textual descriptions. HY-Motion 1.0 represents the first successful attempt to scale up Diffusion Transformer (DiT)-based flow matching models to the billion-parameter scale within the motion generation domain, delivering instruction-following capabilities that significantly outperform current open-source benchmarks. Uniquely, we introduce a comprehensive, full-stage training paradigm -- including large-scale pretraining on over 3,000 hours of motion data, high-quality fine-tuning on 400 hours of curated data, and reinforcement learning from both human feedback and reward models -- to ensure precise alignment with the text instruction and high motion quality. This framework is supported by our meticulous data processing pipeline, which performs rigorous motion cleaning and captioning. Consequently, our model achieves the most extensive coverage, spanning over 200 motion categories across 6 major classes. We release HY-Motion 1.0 to the open-source community to foster future research and accelerate the transition of 3D human motion generation models towards commercial maturity.
中文摘要 我们展示了HY-Motion 1.0，这是一系列最先进的大型运动生成模型，能够根据文本描述生成三维人体运动。HY-Motion 1.0 代表了首次成功将基于扩散变压器（DiT）的流量匹配模型扩展到运动生成领域十亿参数尺度的尝试，提供了远超当前开源基准的指令跟踪能力。我们独特地引入了全面的全阶段训练范式——包括对3000多小时运动数据的大规模预训练、对400小时精选数据的高质量微调，以及从人类反馈和奖励模型中进行强化学习——以确保与文本教学的精确对齐和高质量的动作质量。该框架由我们严谨的数据处理流程支持，执行严格的动态清理和字幕制作。因此，我们的模型覆盖范围最广，涵盖6大类200多个运动类别。我们将HY-Motion 1.0发布给开源社区，以促进未来研究，加速3D人体运动生成模型向商业成熟的转型。

Agentic AI for Autonomous Defense in Software Supply Chain Security: Beyond Provenance to Vulnerability Mitigation

软件供应链安全中自主防御的代理人工智能：超越来源到漏洞缓解

Authors: Toqeer Ali Syed, Mohammad Riyaz Belgaum, Salman Jan, Asadullah Abdullah Khan, Saad Said Alqahtani
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.23480
Pdf link: https://arxiv.org/pdf/2512.23480
Abstract The software supply chain attacks are becoming more and more focused on trusted development and delivery procedures, so the conventional post-build integrity mechanisms cannot be used anymore. The available frameworks like SLSA, SBOM and in toto are majorly used to offer provenance and traceability but do not have the capabilities of actively identifying and removing vulnerabilities in software production. The current paper includes an example of agentic artificial intelligence (AI) based on autonomous software supply chain security that combines large language model (LLM)-based reasoning, reinforcement learning (RL), and multi-agent coordination. The suggested system utilizes specialized security agents coordinated with the help of LangChain and LangGraph, communicates with actual CI/CD environments with the Model Context Protocol (MCP), and documents all the observations and actions in a blockchain security ledger to ensure integrity and auditing. Reinforcement learning can be used to achieve adaptive mitigation strategies that consider the balance between security effectiveness and the operational overhead, and LLMs can be used to achieve semantic vulnerability analysis, as well as explainable decisions. This framework is tested based on simulated pipelines, as well as, actual world CI/CD integrations on GitHub Actions and Jenkins, including injection attacks, insecure deserialization, access control violations, and configuration errors. Experimental outcomes indicate better detection accuracy, shorter mitigation latency and reasonable build-time overhead than rule-based, provenance only and RL only baselines. These results show that agentic AI can facilitate the transition to self defending, proactive software supply chains rather than reactive verification ones.
中文摘要 软件供应链攻击越来越聚焦于可信的开发和交付流程，因此传统的建后完整性机制已无法再使用。现有的框架如SLSA、SBOM等主要用于提供来源和可追溯性，但缺乏主动识别和消除软件生产漏洞的能力。本论文包含了一个基于自主软件供应链安全的代理人工智能（AI）示例，结合了基于大型语言模型（LLM）的推理、强化学习（RL）和多智能体协调。建议系统利用LangChain和LangGraph协调的专业安全代理，使用模型上下文协议（MCP）与实际CI/CD环境通信，并将所有观察和作记录在区块链安全账本中，以确保完整性和审计。强化学习可用于实现自适应缓解策略，考虑安全效能与运营开销之间的平衡，LLM可用于实现语义脆弱性分析以及可解释的决策。该框架基于模拟流水线以及 GitHub Actions 和 Jenkins 上的实际 CI/CD 集成进行测试，包括注入攻击、不安全的反序列化、访问控制违规和配置错误。实验结果显示，检测准确率更高，缓解延迟更短，构建时间开销合理，优于基于规则、仅来源和仅强化学习的基线。这些结果表明，代理人工智能可以促进向自我防御、主动的软件供应链转型，而非被动验证的供应链。

Hierarchical Decision Mamba Meets Agentic AI: A Novel Approach for RAN Slicing in 6G

分层决策Mamba遇见代理人工智能：6G中RAN切片的创新方法

Authors: Md Arafat Habib, Medhat Elsayed, Majid Bavand, Pedro Enrique Iturria Rivera, Yigit Ozcan, Melike Erol-Kantarci
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2512.23502
Pdf link: https://arxiv.org/pdf/2512.23502
Abstract Radio Access Network (RAN) slicing enables multiple logical networks to exist on top of the same physical infrastructure by allocating resources to distinct service groups, where radio resource scheduling plays a key role in ensuring compliance with slice-specific Service-Level Agreements (SLAs). Existing configuration-based or intent-driven Reinforcement Learning (RL) approaches usually rely on static mappings and SLA conversions. The current literature does not integrate natural language understanding with coordinated decision-making. To address these limitations, we propose an Agentic AI framework for 6G RAN slicing, driven by a super agent built using Hierarchical Decision Mamba (HDM) controllers and a Large Language Model (LLM). The super agent interprets operator intents and translates them into actionable goals using the LLM, which are used by HDM to coordinate inter-slice, intra-slice, and self-healing agents. Compared to transformer-based and reward-driven baselines, the proposed Agentic AI framework demonstrates consistent improvements across key performance indicators, including higher throughput, improved cell-edge performance, and reduced latency across different slices.
中文摘要 无线接入网（RAN）切片通过将资源分配给不同的服务组，使多个逻辑网络能够在同一物理基础设施之上存在，而无线资源调度在确保符合片特定服务级别协议（SLA）方面起着关键作用。现有的基于配置或意图驱动的强化学习（RL）方法通常依赖静态映射和SLA转换。现有文献尚未将自然语言理解与协调决策相结合。为解决这些局限，我们提出了一个由超级代理驱动的6G RAN切片智能体框架，该框架使用分层决策Mamba（HDM）控制器和大型语言模型（LLM）构建。超级智能体解释操作员意图，并利用LLM将其转化为可作的目标，HDM用于协调切片间、切片内和自愈代理。与基于变换器和奖励驱动的基线相比，所提议的代理人工智能框架在关键绩效指标上持续展现出持续的提升，包括更高的吞吐量、提升的单元边缘性能以及不同切片间的延迟降低。

PathFound: An Agentic Multimodal Model Activating Evidence-seeking Pathological Diagnosis

PathFound：激活寻求证据的病理诊断的代理多模态模型

Authors: Shengyi Hua, Jianfeng Wu, Tianle Shen, Kangzhe Hu, Zhongzhen Huang, Shujuan Ni, Zhihong Zhang, Yuan Li, Zhe Wang, Xiaofan Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.23545
Pdf link: https://arxiv.org/pdf/2512.23545
Abstract Recent pathological foundation models have substantially advanced visual representation learning and multimodal interaction. However, most models still rely on a static inference paradigm in which whole-slide images are processed once to produce predictions, without reassessment or targeted evidence acquisition under ambiguous diagnoses. This contrasts with clinical diagnostic workflows that refine hypotheses through repeated slide observations and further examination requests. We propose PathFound, an agentic multimodal model designed to support evidence-seeking inference in pathological diagnosis. PathFound integrates the power of pathological visual foundation models, vision-language models, and reasoning models trained with reinforcement learning to perform proactive information acquisition and diagnosis refinement by progressing through the initial diagnosis, evidence-seeking, and final decision stages. Across several large multimodal models, adopting this strategy consistently improves diagnostic accuracy, indicating the effectiveness of evidence-seeking workflows in computational pathology. Among these models, PathFound achieves state-of-the-art diagnostic performance across diverse clinical scenarios and demonstrates strong potential to discover subtle details, such as nuclear features and local invasions.
中文摘要 近期的病理基础模型在视觉表征学习和多模态交互方面取得了显著进步。然而，大多数模型仍依赖静态推断范式，即对整张幻灯片图像处理一次以产生预测，且在模糊诊断时无需重新评估或有针对性证据获取。这与临床诊断工作流程形成对比，后者通过反复观察幻灯片和进一步检查请求来细化假设。我们提出了PathFound，一种能动多模态模型，旨在支持病理诊断中的证据寻求推断。PathFound 整合了病理视觉基础模型、视觉语言模型和通过强化学习训练的推理模型，通过初步诊断、证据寻求和最终决策阶段，实现主动信息获取和诊断精炼。在多个大型多模态模型中，采用这一策略持续提升诊断准确性，表明证据寻求工作流程在计算病理学中的有效性。在这些模型中，PathFound 在多种临床场景下实现了最先进的诊断性能，并展现出发现细微细节（如核特征和局部入侵）的强大潜力。

ThinkGen: Generalized Thinking for Visual Generation

ThinkGen：视觉生成的通用思维

Authors: Siyu Jiao, Yiheng Lin, Yujie Zhong, Qi She, Wei Zhou, Xiaohan Lan, Zilong Huang, Fei Yu, Yingchen Yu, Yunqing Zhao, Yao Zhao, Yunchao Wei
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.23568
Pdf link: https://arxiv.org/pdf/2512.23568
Abstract Recent progress in Multimodal Large Language Models (MLLMs) demonstrates that Chain-of-Thought (CoT) reasoning enables systematic solutions to complex understanding tasks. However, its extension to generation tasks remains nascent and limited by scenario-specific mechanisms that hinder generalization and adaptation. In this work, we present ThinkGen, the first think-driven visual generation framework that explicitly leverages MLLM's CoT reasoning in various generation scenarios. ThinkGen employs a decoupled architecture comprising a pretrained MLLM and a Diffusion Transformer (DiT), wherein the MLLM generates tailored instructions based on user intent, and DiT produces high-quality images guided by these instructions. We further propose a separable GRPO-based training paradigm (SepGRPO), alternating reinforcement learning between the MLLM and DiT modules. This flexible design enables joint training across diverse datasets, facilitating effective CoT reasoning for a wide range of generative scenarios. Extensive experiments demonstrate that ThinkGen achieves robust, state-of-the-art performance across multiple generation benchmarks. Code is available: this https URL
中文摘要 多模态大型语言模型（MLLM）的最新进展表明，思维链（CoT）推理能够系统化地解决复杂的理解任务。然而，其对生成任务的扩展仍处于萌芽阶段，受限于场景特定机制，阻碍了泛化和适应。在本研究中，我们介绍了ThinkGen，这是首个明确利用MLLM的CoT推理的思维驱动视觉生成框架，适用于各种生成场景。ThinkGen采用解耦架构，包括预训练的MLLM和扩散变换器（DiT），MLLM根据用户意图生成定制指令，DiT则根据这些指令生成高质量图像。我们还提出了一种可分离的基于GRPO的训练范式（SepGRPO），在MLLM和DiT模块之间交替进行强化学习。这种灵活的设计支持跨多样数据集的联合训练，促进了针对多种生成场景的有效CoT推理。大量实验表明，ThinkGen在多代基准测试中实现了稳健且最先进的性能。代码可用：这个 https URL

ProGuard: Towards Proactive Multimodal Safeguard

ProGuard：迈向主动多模态保障

Authors: Shaohan Yu, Lijun Li, Chenyang Si, Lu Sheng, Jing Shao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.23573
Pdf link: https://arxiv.org/pdf/2512.23573
Abstract The rapid evolution of generative models has led to a continuous emergence of multimodal safety risks, exposing the limitations of existing defense methods. To address these challenges, we propose ProGuard, a vision-language proactive guard that identifies and describes out-of-distribution (OOD) safety risks without the need for model adjustments required by traditional reactive approaches. We first construct a modality-balanced dataset of 87K samples, each annotated with both binary safety labels and risk categories under a hierarchical multimodal safety taxonomy, effectively mitigating modality bias and ensuring consistent moderation across text, image, and text-image inputs. Based on this dataset, we train our vision-language base model purely through reinforcement learning (RL) to achieve efficient and concise reasoning. To approximate proactive safety scenarios in a controlled setting, we further introduce an OOD safety category inference task and augment the RL objective with a synonym-bank-based similarity reward that encourages the model to generate concise descriptions for unseen unsafe categories. Experimental results show that ProGuard achieves performance comparable to closed-source large models on binary safety classification, substantially outperforms existing open-source guard models on unsafe content categorization. Most notably, ProGuard delivers a strong proactive moderation ability, improving OOD risk detection by 52.6% and OOD risk description by 64.8%.
中文摘要 生成模型的快速发展导致多模态安全风险不断出现，暴露了现有防御方法的局限性。为应对这些挑战，我们提出了ProGuard，一种视觉语言的主动防护工具，能够识别并描述非配送（OOD）安全风险，无需传统被动式方法所需的模型调整。我们首先构建了一个由87K个样本组成的模态平衡数据集，每个样本均标注了二元安全标签和风险类别，采用层级多模态安全分类法，有效减轻模态偏差，确保文本、图像及文本-图像输入间的一致调节。基于该数据集，我们仅通过强化学习（RL）训练视觉语言基础模型，以实现高效且简洁的推理。为了在受控环境中近似主动安全场景，我们进一步引入了OOD安全类别推断任务，并通过基于同义词库的相似性奖励来增强RL目标，鼓励模型生成对未见的不安全类别的简明描述。实验结果显示，ProGuard在二进制安全分类方面的性能可与闭源大型模型相当，在不安全内容分类方面远超现有开源GUARD模型。最显著的是，ProGuard 展现了强大的主动审核能力，将职场风险检测提升了 52.6%，业外风险描述提升了 64.8%。

Le Cam Distortion: A Decision-Theoretic Framework for Robust Transfer Learning

Le Cam 失真：稳健迁移学习的决策理论框架

Authors: Deniz Akdemir
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2512.23617
Pdf link: https://arxiv.org/pdf/2512.23617
Abstract Distribution shift is the defining challenge of real-world machine learning. The dominant paradigm--Unsupervised Domain Adaptation (UDA)--enforces feature invariance, aligning source and target representations via symmetric divergence minimization [Ganin et al., 2016]. We demonstrate that this approach is fundamentally flawed: when domains are unequally informative (e.g., high-quality vs degraded sensors), strict invariance necessitates information destruction, causing "negative transfer" that can be catastrophic in safety-critical applications [Wang et al., 2019]. We propose a decision-theoretic framework grounded in Le Cam's theory of statistical experiments [Le Cam, 1986], using constructive approximations to replace symmetric invariance with directional simulability. We introduce Le Cam Distortion, quantified by the Deficiency Distance $\delta(E_1, E_2)$, as a rigorous upper bound for transfer risk conditional on simulability. Our framework enables transfer without source degradation by learning a kernel that simulates the target from the source. Across five experiments (genomics, vision, reinforcement learning), Le Cam Distortion achieves: (1) near-perfect frequency estimation in HLA genomics (correlation $r=0.999$, matching classical methods), (2) zero source utility loss in CIFAR-10 image classification (81.2% accuracy preserved vs 34.7% drop for CycleGAN), and (3) safe policy transfer in RL control where invariance-based methods suffer catastrophic collapse. Le Cam Distortion provides the first principled framework for risk-controlled transfer learning in domains where negative transfer is unacceptable: medical imaging, autonomous systems, and precision medicine.
中文摘要 分布转移是现实机器学习的决定性挑战。主导范式——无监督域适应（UDA）——通过对称发散最小化来对齐源和目标表示，强化特征不变性[Ganin等，2016]。我们证明了这种方法存在根本缺陷：当信息量不均时（例如高质量传感器与劣化传感器），严格不变性就必须销毁信息，导致“负转移”，在安全关键应用中可能造成灾难性[Wang 等，2019]。我们提出了一个基于Le Cam统计实验理论[Le Cam， 1986]的决策理论框架，利用构造性近似替代对称不变性，以方向可模拟性代替。我们引入Le Cam畸变，以不足距离$\delta（E_1， E_2）$为量化，作为以可模拟性为条件的转移风险的严格上限。我们的框架通过学习一个内核，从源头模拟目标，实现了无源代码劣化的传输。通过五项实验（基因组学、视觉、强化学习），Le Cam Distortion 实现了：（1） HLA 基因组学中近乎完美的频率估计（相关性 $r=0.999$，匹配经典方法），（2） CIFAR-10 图像分类中零源效用损失（准确率保持81.2%，而 CycleGAN 下降34.7%），以及（3）在基于不变性的方法遭遇灾难性崩溃的强化学习控制中安全传递策略。Le Cam Distortion 提供了首个原则性框架，适用于负转移不可接受的领域：医学影像、自主系统和精准医疗。

Robo-Dopamine: General Process Reward Modeling for High-Precision Robotic Manipulation

机器人多巴胺：高精度机器人作的通用过程奖励建模

Authors: Huajie Tan, Sixiang Chen, Yijie Xu, Zixiao Wang, Yuheng Ji, Cheng Chi, Yaoxu Lyu, Zhongxia Zhao, Xiansheng Chen, Peterson Co, Shaoxuan Xie, Guocai Yao, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.23703
Pdf link: https://arxiv.org/pdf/2512.23703
Abstract The primary obstacle for applying reinforcement learning (RL) to real-world robotics is the design of effective reward functions. While recently learning-based Process Reward Models (PRMs) are a promising direction, they are often hindered by two fundamental limitations: their reward models lack step-aware understanding and rely on single-view perception, leading to unreliable assessments of fine-grained manipulation progress; and their reward shaping procedures are theoretically unsound, often inducing a semantic trap that misguides policy optimization. To address these, we introduce Dopamine-Reward, a novel reward modeling method for learning a general-purpose, step-aware process reward model from multi-view inputs. At its core is our General Reward Model (GRM), trained on a vast 3,400+ hour dataset, which leverages Step-wise Reward Discretization for structural understanding and Multi-Perspective Reward Fusion to overcome perceptual limitations. Building upon Dopamine-Reward, we propose Dopamine-RL, a robust policy learning framework that employs a theoretically-sound Policy-Invariant Reward Shaping method, which enables the agent to leverage dense rewards for efficient self-improvement without altering the optimal policy, thereby fundamentally avoiding the semantic trap. Extensive experiments across diverse simulated and real-world tasks validate our approach. GRM achieves state-of-the-art accuracy in reward assessment, and Dopamine-RL built on GRM significantly improves policy learning efficiency. For instance, after GRM is adapted to a new task in a one-shot manner from a single expert trajectory, the resulting reward model enables Dopamine-RL to improve the policy from near-zero to 95% success with only 150 online rollouts (approximately 1 hour of real robot interaction), while retaining strong generalization across tasks. Project website: this https URL
中文摘要 将强化学习（RL）应用于现实机器人的主要障碍是设计有效的奖励函数。尽管基于学习的过程奖励模型（PRM）近年来是一个有前景的方向，但它们常常受到两个根本性的限制：其奖励模型缺乏阶梯感知，依赖单一视角感知，导致对细粒度作进展的评估不可靠;而且它们的奖励塑造程序理论上不合理，常常导致语义陷阱，误导策略优化。为应对这些问题，我们引入了多巴胺奖励（Dopamine-Reward），这是一种新颖的奖励建模方法，用于从多视角输入中学习通用的、步数感知的过程奖励模型。其核心是我们的通用奖励模型（GRM），该模型基于一个3400+小时的庞大数据集训练，利用逐步奖励离散化实现结构理解，并利用多视角奖励融合克服感知限制。基于多巴胺奖励，我们提出了多巴胺强化学习框架，采用理论合理的策略不变奖励塑造方法，使智能体能够利用密集奖励实现高效自我提升，而不改变最优策略，从而从根本上避免语义陷阱。在各种模拟和现实任务中进行的大量实验验证了我们的方法。GRM在奖励评估方面达到了最先进的准确性，基于GRM的多巴胺强化学习显著提升了政策学习效率。例如，在GRM从单一专家路径一次性适应新任务后，所得的奖励模型使多巴胺-RL能够将策略从近乎零提升到95%，仅需150次在线推广（约1小时的真实机器人互动），同时保持任务间的强力泛化。项目网站：此 https URL

Training AI Co-Scientists Using Rubric Rewards

使用评分标准奖励培训AI共同科学家

Authors: Shashwat Goel, Rishi Hazra, Dulhan Jayalath, Timon Willi, Parag Jain, William F. Shen, Ilias Leontiadis, Francesco Barbieri, Yoram Bachrach, Jonas Geiping, Chenxi Whitehouse
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2512.23707
Pdf link: https://arxiv.org/pdf/2512.23707
Abstract AI co-scientists are emerging as a tool to assist human researchers in achieving their research goals. A crucial feature of these AI co-scientists is the ability to generate a research plan given a set of aims and constraints. The plan may be used by researchers for brainstorming, or may even be implemented after further refinement. However, language models currently struggle to generate research plans that follow all constraints and implicit requirements. In this work, we study how to leverage the vast corpus of existing research papers to train language models that generate better research plans. We build a scalable, diverse training corpus by automatically extracting research goals and goal-specific grading rubrics from papers across several domains. We then train models for research plan generation via reinforcement learning with self-grading. A frozen copy of the initial policy acts as the grader during training, with the rubrics creating a generator-verifier gap that enables improvements without external human supervision. To validate this approach, we conduct a study with human experts for machine learning research goals, spanning 225 hours. The experts prefer plans generated by our finetuned Qwen3-30B-A3B model over the initial model for 70% of research goals, and approve 84% of the automatically extracted goal-specific grading rubrics. To assess generality, we also extend our approach to research goals from medical papers, and new arXiv preprints, evaluating with a jury of frontier models. Our finetuning yields 12-22% relative improvements and significant cross-domain generalization, proving effective even in problem settings like medical research where execution feedback is infeasible. Together, these findings demonstrate the potential of a scalable, automated training recipe as a step towards improving general AI co-scientists.
中文摘要 人工智能共同科学家正作为辅助人类研究人员实现研究目标的工具而兴起。这些人工智能合作科学家的一个关键特点是能够在设定目标和限制条件下制定研究计划。该计划可能被研究人员用于头脑风暴，甚至经过进一步完善后得以实施。然而，目前语言模型难以生成符合所有约束和隐性要求的研究计划。在本研究中，我们研究如何利用现有大量研究论文来训练语言模型，从而制定更完善的研究计划。我们通过自动从多个领域的论文中提取研究目标和目标特定的评分标准，构建可扩展、多样化的培训语料库。随后，我们通过强化学习和自评分训练模型，用于研究计划生成。初始政策的冻结副本在培训期间充当评分器，评分标准形成生成器与验证者之间的差距，使改进无需外部人工监督即可实现。为验证这一方法，我们与人类专家进行了为期225小时的机器学习研究目标研究。专家们更倾向于使用我们经过微调的Qwen3-30B-A3B模型生成的计划，而非初始模型，针对70%的研究目标，并批准了84%自动提取的目标特定评分标准。为了评估普遍性，我们还扩展了研究目标的方法，包括医学论文和新的arXiv预印本，并与前沿模型进行评审。我们的微调带来了12-22%的相对提升和显著的跨领域泛化，即使在医学研究等执行反馈难以实现的问题环境中也有效。这些发现共同展示了可扩展的自动化训练配方作为提升通用人工智能共同科学家的潜力。

Keyword: diffusion policy

There is no result

Keyword: reinforcement learning

Unbiased Visual Reasoning with Controlled Visual Inputs

带有受控视觉输入的无偏视觉推理

Learning Tennis Strategy Through Curriculum-Based Dueling Double Deep Q-Networks

通过课程对决学习网球策略 双深度Q网络

Physics-Informed Machine Learning for Transformer Condition Monitoring -- Part I: Basic Concepts, Neural Networks, and Variants

基于物理的机器学习用于变压器状态监测——第一部分：基本概念、神经网络及其变体

Emotion-Inspired Learning Signals (EILS): A Homeostatic Framework for Adaptive Autonomous Agents

情感启发学习信号（EILS）：一种适应性自主智能体的稳态框架

DiRL: An Efficient Post-Training Framework for Diffusion Language Models

DiRL：扩散语言模型的高效后期训练框架

Masking Teacher and Reinforcing Student for Distilling Vision-Language Models

掩蔽教师与强化学生提炼视觉语言模型

Agentic Software Issue Resolution with Large Language Models: A Survey

大型语言模型下的代理软件问题解决：一项综述

VideoZoomer: Reinforcement-Learned Temporal Focusing for Long Video Reasoning

VideoZoomer：强化学习的时间聚焦用于长视频推理

SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents

SmartSnap：主动寻找自我验证代理人的证据

PHANTOM: Physics-Aware Adversarial Attacks against Federated Learning-Coordinated EV Charging Management System

幻影：物理感知对联邦学习协调电动汽车充电管理系统的对抗攻击

AFA-LoRA: Enabling Non-Linear Adaptations in LoRA with Activation Function Annealing

AFA-LoRA：通过激活功能退火实现LoRA中的非线性适应

RollArt: Scaling Agentic RL Training via Disaggregated Infrastructure

RollArt：通过拆分基础设施扩展智能强化学习训练

FinPercep-RM: A Fine-grained Reward Model and Co-evolutionary Curriculum for RL-based Real-world Super-Resolution

FinPercep-RM：基于强化学习的细粒度奖励模型与共进化课程，用于基于强化学习的现实世界超分辨率

Optimal Regulation of Nonlinear Input-Affine Systems via an Integral Reinforcement Learning-Based State-Dependent Riccati Equation Approach

通过基于积分强化学习的状态依赖 Riccati 方程方法对非线性输入仿射系统的最优调控

Memento-II: Learning by Stateful Reflective Memory

记忆书二：通过有状态反思记忆学习

Cyber Resilience in Next-Generation Networks: Threat Landscape, Theoretical Foundations, and Design Paradigms

下一代网络中的网络韧性：威胁格局、理论基础与设计范式

FoldAct: Efficient and Stable Context Folding for Long-Horizon Search Agents

FoldAct：长视野搜索代理的高效稳定上下文折叠

Parallel Diffusion Solver via Residual Dirichlet Policy Optimization

通过残差狄利克雷策略优化实现并行扩散求解器

ReDiF: Reinforced Distillation for Few Step Diffusion

ReDiF：少数步骤扩散的强化蒸馏

TEACH: Temporal Variance-Driven Curriculum for Reinforcement Learning

TEACH：基于时间方差的强化学习课程

MARPO: A Reflective Policy Optimization for Multi Agent Reinforcement Learning

MARPO：多智能体强化学习的反思策略优化

AutoForge: Automated Environment Synthesis for Agentic Reinforcement Learning

AutoForge：用于智能强化学习的自动化环境综合

Adaptive Trust Consensus for Blockchain IoT: Comparing RL, DRL, and MARL Against Naive, Collusive, Adaptive, Byzantine, and Sleeper Attacks

区块链物联网自适应信任共识：比较强化学习、DRL和MARL对抗天真、串通、自适应、拜占庭和潜伏攻击

Reinforcement Networks: novel framework for collaborative Multi-Agent Reinforcement Learning tasks

强化网络：协作多智能体强化学习任务的新框架

SAMP-HDRL: Segmented Allocation with Momentum-Adjusted Utility for Multi-agent Portfolio Management via Hierarchical Deep Reinforcement Learning

SAMP-HDRL：通过层级深度强化学习实现多代理投资组合管理的分段分配与动量调整效用

Sat-EnQ: Satisficing Ensembles of Weak Q-Learners for Reliable and Compute-Efficient Reinforcement Learning

Sat-EnQ：满足弱Q-学习者群，实现可靠且计算高效的强化学习

Heterogeneity in Multi-Agent Reinforcement Learning

多智能体强化学习中的异质性

APO: Alpha-Divergence Preference Optimization

APO：阿尔法-散度偏好优化

Diversity or Precision? A Deep Dive into Next Token Prediction

多样性还是精准？深入探讨下一个代币预测

Taming the Tail: Stable LLM Reinforcement Learning via Dynamic Vocabulary Pruning

驯服尾巴：通过动态词汇修剪实现的稳定大型语言模型强化学习

Benchmark Success, Clinical Failure: When Reinforcement Learning Optimizes for Benchmarks, Not Patients

基准成功与临床失败：当强化学习优化的是基准，而非患者

A Note on Hybrid Online Reinforcement and Imitation Learning for LLMs: Formulations and Algorithms

关于大型语言模型混合在线强化与模仿学习的说明：表述与算法

Evaluating Parameter Efficient Methods for RLVR

评估RLVR参数效率方法

A Human-Oriented Cooperative Driving Approach: Integrating Driving Intention, State, and Conflict

以人为本的合作驾驶方法：整合驾驶意图、状态与冲突

ViLaCD-R1: A Vision-Language Framework for Semantic Change Detection in Remote Sensing

ViLaCD-R1：一种用于遥感语义变更检测的视觉语言框架

Agentic AI-Enhanced Semantic Communications: Foundations, Architecture, and Applications

代理人工智能增强语义通信：基础、架构与应用

Splitwise: Collaborative Edge-Cloud Inference for LLMs via Lyapunov-Assisted DRL

Splitwise：通过Lyapunov辅助的DRL为LLM开发协作边缘云推断

CME-CAD: Heterogeneous Collaborative Multi-Expert Reinforcement Learning for CAD Code Generation

CME-CAD：异构协作多专家强化学习用于CAD代码生成

AGRO-SQL: Agentic Group-Relative Optimization with High-Fidelity Data Synthesis

AGRO-SQL：具备高精度数据综合的代理群相对优化

The World Is Bigger! A Computationally-Embedded Perspective on the Big World Hypothesis

通过课程对决学习网球策略双深度Q网络