Arxiv Papers of Today

生成时间: 2025-12-26 16:30:45 (UTC+8); Arxiv 发布时间: 2025-12-25 20:00 EST (2025-12-26 09:00 UTC+8)

今天共有 20 篇相关文章

Keyword: reinforcement learning

BitRL-Light: 1-bit LLM Agents with Deep Reinforcement Learning for Energy-Efficient Smart Home Lighting Optimization

BitRL-Light：具备深度强化学习的1位LLM代理，实现高效智能家居照明优化

Authors: Ravi Gupta, Shabista Haider
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.20623
Pdf link: https://arxiv.org/pdf/2512.20623
Abstract Smart home lighting systems consume 15-20% of residential energy but lack adaptive intelligence to optimize for user comfort and energy efficiency simultaneously. We present BitRL-Light, a novel framework combining 1-bit quantized Large Language Models (LLMs) with Deep Q-Network (DQN) reinforcement learning for real-time smart home lighting control on edge devices. Our approach deploys a 1-bit quantized Llama-3.2-1B model on Raspberry Pi hardware, achieving 71.4 times energy reduction compared to full-precision models while maintaining intelligent control capabilities. Through multi-objective reinforcement learning, BitRL-Light learns optimal lighting policies from user feedback, balancing energy consumption, comfort, and circadian alignment. Experimental results demonstrate 32% energy savings compared to rule-based systems, with inference latency under 200ms on Raspberry Pi 4 and 95% user satisfaction. The system processes natural language commands via Google Home/IFTTT integration and learns from implicit feedback through manual overrides. Our comparative analysis shows 1-bit models achieve 5.07 times speedup over 2-bit alternatives on ARM processors while maintaining 92% task accuracy. This work establishes a practical framework for deploying adaptive AI on resource-constrained IoT devices, enabling intelligent home automation without cloud dependencies.
中文摘要 智能家居照明系统消耗了15-20%的住宅能源，但缺乏适应智能，无法同时优化用户舒适度和能效。我们介绍了BitRL-Light，这是一个结合了1位量化大型语言模型（LLM）与深度Q网络（DQN）强化学习的新框架，用于边缘设备上的实时智能家居照明控制。我们的方法在树莓派硬件上部署了1位量化的Llama-3.2-1B模型，实现了比全精度模型71.4倍的能量削减，同时保持智能控制能力。通过多目标强化学习，BitRL-Light 从用户反馈中学习最佳照明政策，平衡能耗、舒适度和昼夜节律对齐。实验结果显示，与基于规则的系统相比，节能32%，树莓派4的推理延迟低于200毫秒，用户满意度达95%。系统通过 Google Home/IFTTT 集成处理自然语言命令，并通过手动覆盖从隐式反馈中学习。我们的比较分析显示，1位模型在ARM处理器上比2位模型提升5.07倍，同时保持92%的任务准确率。这项工作建立了在资源受限物联网设备上部署自适应人工智能的实用框架，实现无需依赖云的智能家居自动化。

Quantum-Inspired Multi Agent Reinforcement Learning for Exploration Exploitation Optimization in UAV-Assisted 6G Network Deployment

量子启发的多智能体强化学习，用于无人机辅助6G网络部署中的探索与利用优化

Authors: Mazyar Taghavi, Javad Vahidi
Subjects: Subjects: Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2512.20624
Pdf link: https://arxiv.org/pdf/2512.20624
Abstract This study introduces a quantum inspired framework for optimizing the exploration exploitation tradeoff in multiagent reinforcement learning, applied to UAVassisted 6G network deployment. We consider a cooperative scenario where ten intelligent UAVs autonomously coordinate to maximize signal coverage and support efficient network expansion under partial observability and dynamic conditions. The proposed approach integrates classical MARL algorithms with quantum-inspired optimization techniques, leveraging variational quantum circuits VQCs as the core structure and employing the Quantum Approximate Optimization Algorithm QAOA as a representative VQC based method for combinatorial optimization. Complementary probabilistic modeling is incorporated through Bayesian inference, Gaussian processes, and variational inference to capture latent environmental dynamics. A centralized training with decentralized execution CTDE paradigm is adopted, where shared memory and local view grids enhance local observability among agents. Comprehensive experiments including scalability tests, sensitivity analysis, and comparisons with PPO and DDPG baselines demonstrate that the proposed framework improves sample efficiency, accelerates convergence, and enhances coverage performance while maintaining robustness. Radar chart and convergence analyses further show that QI MARL achieves a superior balance between exploration and exploitation compared to classical methods. All implementation code and supplementary materials are publicly available on GitHub to ensure reproducibility.
中文摘要 本研究提出了一个量子启发的框架，用于优化多智能体强化学习中的探索与利用权衡，并将其应用于无人机辅助6G网络部署。我们考虑了一个合作情景，十架智能无人机自主协调，以最大化信号覆盖并支持在部分可探测性和动态条件下的高效网络扩展。所提方法将经典MARL算法与量子启发优化技术相结合，以变分量子电路VQC为核心结构，并采用量子近似优化算法QAOA作为一种代表性的基于VQC的组合优化方法。通过贝叶斯推断、高斯过程和变分推断，补充的概率建模被纳入以捕捉潜在的环境动态。采用了集中式训练和去中心化执行的CTDE范式，共享内存和局部视图网格增强了代理间的本地可观察性。包括可扩展性测试、敏感性分析以及与PPO和DDPG基线的比较在内的综合实验表明，所提框架在保持鲁棒性的同时，提高了样本效率、加速收敛并增强覆盖性能。雷达图和收敛分析进一步表明，QI MARL在勘探与利用之间实现了优越的平衡，优于传统方法。所有实现代码和补充材料均公开于 GitHub 上，以确保可重复性。

Mechanism-Based Intelligence (MBI): Differentiable Incentives for Rational Coordination and Guaranteed Alignment in Multi-Agent Systems

基于机制的智能（MBI）：多智能体系统中合理协调和保证对齐的可差异激励

Authors: Stefano Grassi
Subjects: Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2512.20688
Pdf link: https://arxiv.org/pdf/2512.20688
Abstract Autonomous multi-agent systems are fundamentally fragile: they struggle to solve the Hayekian Information problem (eliciting dispersed private knowledge) and the Hurwiczian Incentive problem (aligning local actions with global objectives), making coordination computationally intractable. I introduce Mechanism-Based Intelligence (MBI), a paradigm that reconceptualizes intelligence as emergent from the coordination of multiple "brains", rather than a single one. At its core, the Differentiable Price Mechanism (DPM) computes the exact loss gradient $$ \mathbf{G}_i = - \frac{\partial \mathcal{L}}{\partial \mathbf{x}_i} $$ as a dynamic, VCG-equivalent incentive signal, guaranteeing Dominant Strategy Incentive Compatibility (DSIC) and convergence to the global optimum. A Bayesian extension ensures incentive compatibility under asymmetric information (BIC). The framework scales linearly ($\mathcal{O}(N)$) with the number of agents, bypassing the combinatorial complexity of Dec-POMDPs and is empirically 50x faster than Model-Free Reinforcement Learning. By structurally aligning agent self-interest with collective objectives, it provides a provably efficient, auditable and generalizable approach to coordinated, trustworthy and scalable multi-agent intelligence grounded in economic principles.
中文摘要 自主多智能体系统本质上脆弱：它们难以解决哈耶克式信息问题（引发分散的私人知识）和胡尔维奇激励问题（将本地行动与全球目标对齐），使得协调在计算上难以解决。我介绍了基于机制的智能（MBI），这是一种将智能重新概念化为多个“大脑”协调产生的范式，而非单一大脑。可微价格机制（DPM）的核心计算精确损失梯度$$ \mathbf{G}_i = - \frac{\partial \mathcal{L}}{\partial \mathbf{x}_i} $$，作为动态的VCG等效激励信号，保证主导策略激励兼容性（DSIC）并收敛于全局最优。贝叶斯扩展确保了在非对称信息（BIC）下的激励兼容性。该框架随代理数量线性扩展（$\mathcal{O}（N）$），绕过Dec-POMDP的组合复杂性，且经验上比无模型强化学习快50倍。通过结构性地将代理的自身利益与集体目标对齐，它提供了一种可验证高效、可审计且可推广的多智能智能方法，基于经济原则。

AI-Driven Green Cognitive Radio Networks for Sustainable 6G Communication

人工智能驱动的绿色认知无线网络，实现可持续的6G通信

Authors: Anshul Sharma, Shujaatali Badami, Biky Chouhan, Pushpanjali Pandey, Brijeena Rana, Navneet Kaur
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.20739
Pdf link: https://arxiv.org/pdf/2512.20739
Abstract The 6G wireless aims at the Tb/s peak data rates are expected, a sub-millisecond latency, massive Internet of Things/vehicle connectivity, which requires sustainable access to audio over the air and energy-saving functionality. Cognitive Radio Networks CCNs help in alleviating the problem of spectrum scarcity, but classical sensing and allocation are still energy-consumption intensive, and sensitive to rapid spectrum variations. Our framework which centers on AI driven green CRN aims at integrating deep reinforcement learning (DRL) with transfer learning, energy harvesting (EH), reconfigurable intelligent surfaces (RIS) with other light-weight genetic refinement operations that optimally combine sensing timelines, transmit power, bandwidth distribution and RIS phase selection. Compared to two baselines, the utilization of MATLAB + NS-3 under dense loads, a traditional CRN with energy sensing under fixed policies, and a hybrid CRN with cooperative sensing under heuristic distribution of resource, there are (25-30%) fewer energy reserves used, sensing AUC greater than 0.90 and +6-13 p.p. higher PDR. The integrated framework is easily scalable to large IoT and vehicular applications, and it provides a feasible and sustainable roadmap to 6G CRNs. Index Terms--Cognitive Radio Networks (CRNs), 6G, Green Communication, Energy Efficiency, Deep Reinforcement Learning (DRL), Spectrum Sensing, RIS, Energy Harvesting, QoS, IoT.
中文摘要 6G无线目标是达到Tb/s峰值数据速率，实现亚毫秒延迟，拥有庞大的物联网/车辆连接，这需要可持续的无线音频接入以及节能功能。认知无线网络CCN有助于缓解频谱稀缺性问题，但传统的传感和分配仍然耗能较高，且对频谱的快速变化敏感。我们的框架以AI驱动的绿色CRN为核心，旨在将深度强化学习（DRL）与转移学习、能量收集（EH）、可重构智能曲面（RIS）以及其他轻量级基因精细作整合起来，以最佳化方式结合传感时间线、发射功率、带宽分布和RIS相位选择。与两个基线相比：在高负载下使用MATLAB + NS-3、采用固定策略进行能量传感的传统CRN以及在资源启发式分配下采用协同感测的混合CRN，所用的能量储备减少了25-30%，感知的AUC超过0.90，且PDR高出+6-13 pp。该集成框架易于扩展到大型物联网和车载应用，并为6G网络提供了可行且可持续的路线图。指数术语——认知无线网络（CRN）、6G、绿色通信、能效、深度强化学习（DRL）、频谱传感、RIS、能量收割、QoS、物联网。

AgentMath: Empowering Mathematical Reasoning for Large Language Models via Tool-Augmented Agent

AgentMath：通过工具增强代理赋能大型语言模型的数学推理

Authors: Haipeng Luo, Huawen Feng, Qingfeng Sun, Can Xu, Kai Zheng, Yufei Wang, Tao Yang, Han Hu, Yansong Tang, Di Wang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.20745
Pdf link: https://arxiv.org/pdf/2512.20745
Abstract Large Reasoning Models (LRMs) like o3 and DeepSeek-R1 have achieved remarkable progress in natural language reasoning with long chain-of-thought. However, they remain computationally inefficient and struggle with accuracy when solving problems requiring complex mathematical operations. In this work, we present AgentMath, an agent framework that seamlessly integrates language models' reasoning capabilities with code interpreters' computational precision to efficiently tackle complex mathematical problems. Our approach introduces three key innovations: (1) An automated method that converts natural language chain-of-thought into structured tool-augmented trajectories, generating high-quality supervised fine-tuning (SFT) data to alleviate data scarcity; (2) A novel agentic reinforcement learning (RL) paradigm that dynamically interleaves natural language generation with real-time code execution. This enables models to autonomously learn optimal tool-use strategies through multi-round interactive feedback, while fostering emergent capabilities in code refinement and error correction; (3) An efficient training system incorporating innovative techniques, including request-level asynchronous rollout scheduling, agentic partial rollout, and prefix-aware weighted load balancing, achieving 4-5x speedup and making efficient RL training feasible on ultra-long sequences with scenarios with massive tool this http URL evaluations show that AgentMath achieves state-of-the-art performance on challenging mathematical competition benchmarks including AIME24, AIME25, and HMMT25. Specifically, AgentMath-30B-A3B attains 90.6%, 86.4%, and 73.8% accuracy respectively, achieving advanced this http URL results validate the effectiveness of our approach and pave the way for building more sophisticated and scalable mathematical reasoning agents.
中文摘要 大型推理模型（LRM）如o3和DeepSeek-R1在自然语言推理中取得了显著进展，且具有长思考链。然而，它们在计算效率上依然低，在解决需要复杂数学运算的问题时，精度也存在困难。本文介绍了AgentMath，一个智能体框架，能够无缝整合语言模型的推理能力与代码解释器的计算精度，高效解决复杂的数学问题。我们的方法引入了三项关键创新：（1）一种自动化方法，将自然语言的思维链转换为结构化的工具增强轨迹，生成高质量的监督微调（SFT）数据以缓解数据稀缺性;（2）一种新型的智能强化学习（RL）范式，能够动态地将自然语言生成与实时代码执行交织。这使得模型能够通过多轮互动反馈自主学习最优工具使用策略，同时促进代码优化和错误纠正的涌现能力;（3）高效训练系统，融合了创新技术，包括请求级异步部署调度、代理部分展开和前缀感知加权负载均衡，实现4-5倍的加速，使得在超长序列中高效强化学习成为可能，借助大型工具。http URL评估显示，AgentMath在包括AIME24、AIME25在内的高挑战性数学竞赛基准上实现了最先进的性能。以及HMMT25。具体来说，AgentMath-30B-A3B分别达到90.6%、86.4%和73.8%的准确率，实现了先进的http URL结果验证了我们方法的有效性，并为构建更复杂、更具可扩展性的数学推理代理铺平了道路。

Generalization of RLVR Using Causal Reasoning as a Testbed

利用因果推理作为试验平台推广RLVR

Authors: Brian Lu, Hongyu Zhao, Shuo Sun, Hao Peng, Rui Ding, Hongyuan Mei
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.20760
Pdf link: https://arxiv.org/pdf/2512.20760
Abstract Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for post-training large language models (LLMs) on complex reasoning tasks. Yet, the conditions under which RLVR yields robust generalization remain poorly understood. This paper provides an empirical study of RLVR generalization in the setting of probabilistic inference over causal graphical models. This setting offers two natural axes along which to examine generalization: (i) the level of the probabilistic query -- associational, interventional, or counterfactual -- and (ii) the structural complexity of the query, measured by the size of its relevant subgraph. We construct datasets of causal graphs and queries spanning these difficulty axes and fine-tune Qwen-2.5-Instruct models using RLVR or supervised fine-tuning (SFT). We vary both the model scale (3B-32B) and the query level included in training. We find that RLVR yields stronger within-level and across-level generalization than SFT, but only for specific combinations of model size and training query level. Further analysis shows that RLVR's effectiveness depends on the model's initial reasoning competence. With sufficient initial competence, RLVR improves an LLM's marginalization strategy and reduces errors in intermediate probability calculations, producing substantial accuracy gains, particularly on more complex queries. These findings show that RLVR can improve specific causal reasoning subskills, with its benefits emerging only when the model has sufficient initial competence.
中文摘要 带有可验证奖励的强化学习（RLVR）已成为大型语言模型（LLMs）在复杂推理任务中后训练的有前景范式。然而，RLVR在何种条件下产生强健推广的条件仍然不够清楚。本文对RLVR推广在因果图形模型上概率推断中的概率推断进行了实证研究。该背景为考察泛化提供了两个自然轴线：（i）概率查询的层级——关联性、介入性或反事实性——以及（ii）查询的结构复杂度，衡量其相关子图的大小。我们构建跨越这些难度轴的因果图和查询数据集，并利用RLVR或监督微调（SFT）对Qwen-2.5-Ininstruction模型进行微调。我们会调整模型规模（3B-32B）和训练中包含的查询级别。我们发现RLVR在内层和跨层推广方面比SFT更强，但仅限于模型规模和训练查询层级的特定组合。进一步分析显示，RLVR的有效性取决于模型的初始推理能力。在具备足够初始能力的情况下，RLVR可以改进LLM的边缘化策略，减少中间概率计算中的误差，从而在更复杂的查询中带来显著的准确性提升。这些发现表明，RLVR能够提升特定的因果推理子技能，其益处只有在模型具备足够初始能力时才会显现。

Safety Alignment of LMs via Non-cooperative Games

通过非合作游戏实现登陆舱的安全对齐

Authors: Anselm Paulus, Ilia Kulikov, Brandon Amos, Rémi Munos, Ivan Evtimov, Kamalika Chaudhuri, Arman Zharmagambetov
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.20806
Pdf link: https://arxiv.org/pdf/2512.20806
Abstract Ensuring the safety of language models (LMs) while maintaining their usefulness remains a critical challenge in AI alignment. Current approaches rely on sequential adversarial training: generating adversarial prompts and fine-tuning LMs to defend against them. We introduce a different paradigm: framing safety alignment as a non-zero-sum game between an Attacker LM and a Defender LM trained jointly via online reinforcement learning. Each LM continuously adapts to the other's evolving strategies, driving iterative improvement. Our method uses a preference-based reward signal derived from pairwise comparisons instead of point-wise scores, providing more robust supervision and potentially reducing reward hacking. Our RL recipe, AdvGame, shifts the Pareto frontier of safety and utility, yielding a Defender LM that is simultaneously more helpful and more resilient to adversarial attacks. In addition, the resulting Attacker LM converges into a strong, general-purpose red-teaming agent that can be directly deployed to probe arbitrary target models.
中文摘要 确保语言模型（LM）的安全性，同时保持其实用性，仍然是AI对齐中的关键挑战。当前的方法依赖顺序对抗训练：生成对抗提示并微调LM以防御它们。我们引入了另一种范式：将安全对齐视为攻击者LM与防御LM之间通过在线强化学习共同训练的非零和博弈。每个LM不断适应对方不断演变的策略，推动迭代改进。我们的方法采用基于偏好的奖励信号，基于成对比较，而非按点得分，提供更稳健的监督，并可能减少奖励黑客行为。我们的强化学习配方AdvGame改变了安全性与实用性的帕累托边界，打造出一款既更具帮助性又更具抵抗性攻击的防御者游戏模型。此外，最终生成的攻击者 LM 收敛成一个强大的通用红队代理，可直接部署用于探测任意目标模型。

Context-Sensitive Abstractions for Reinforcement Learning with Parameterized Actions

参数化动作强化学习的上下文敏感抽象

Authors: Rashmeet Kaur Nayyar, Naman Shah, Siddharth Srivastava
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.20831
Pdf link: https://arxiv.org/pdf/2512.20831
Abstract Real-world sequential decision-making often involves parameterized action spaces that require both, decisions regarding discrete actions and decisions about continuous action parameters governing how an action is executed. Existing approaches exhibit severe limitations in this setting -- planning methods demand hand-crafted action models, and standard reinforcement learning (RL) algorithms are designed for either discrete or continuous actions but not both, and the few RL methods that handle parameterized actions typically rely on domain-specific engineering and fail to exploit the latent structure of these spaces. This paper extends the scope of RL algorithms to long-horizon, sparse-reward settings with parameterized actions by enabling agents to autonomously learn both state and action abstractions online. We introduce algorithms that progressively refine these abstractions during learning, increasing fine-grained detail in the critical regions of the state-action space where greater resolution improves performance. Across several continuous-state, parameterized-action domains, our abstraction-driven approach enables TD($\lambda$) to achieve markedly higher sample efficiency than state-of-the-art baselines.
中文摘要 现实世界的顺序决策通常涉及参数化的动作空间，这些空间既需要关于离散动作的决策，也需要关于连续动作参数的决策，这些参数控制着动作的执行方式。现有方法在此环境中存在严重局限——规划方法需要手工制作的动作模型，标准强化学习（RL）算法设计时只针对离散动作或连续动作，而非两者兼顾，少数处理参数化动作的强化学习方法通常依赖领域特定工程，未能利用这些空间的潜在结构。本文将强化学习算法的适用范围扩展到具有参数化动作的长期、稀疏奖励设置，使智能体能够自主在线学习状态和动作抽象。我们引入了在学习过程中逐步细化这些抽象的算法，在状态-动作空间的关键区域中提升了细粒度细节，而更高的分辨率提升了性能。在多个连续状态、参数化作用域中，我们的抽象驱动方法使TD（$\lambda$）能够显著高于最先进基线的样本效率。

NVIDIA Nemotron 3: Efficient and Open Intelligence

NVIDIA Nemotron 3：高效且开放的智能

Authors: NVIDIA: Aaron Blakeman, Aaron Grattafiori, Aarti Basant, Abhibha Gupta, Abhinav Khattar, Adi Renduchintala, Aditya Vavre, Akanksha Shukla, Akhiad Bercovich, Aleksander Ficek, Aleksandr Shaposhnikov, Alex Kondratenko, Alexander Bukharin, Alexandre Milesi, Ali Taghibakhshi, Alisa Liu, Amelia Barton, Ameya Sunil Mahabaleshwarkar, Amir Klein, Amit Zuker, Amnon Geifman, Amy Shen, Anahita Bhiwandiwalla, Andrew Tao, Anjulie Agrusa, Ankur Verma, Ann Guan, Anubhav Mandarwal, Arham Mehta, Ashwath Aithal, Ashwin Poojary, Asif Ahamed, Asit Mishra, Asma Kuriparambil Thekkumpate, Ayush Dattagupta, Banghua Zhu, Bardiya Sadeghi, Barnaby Simkin, Ben Lanir, Benedikt Schifferer, Besmira Nushi, Bilal Kartal, Bita Darvish Rouhani, Boris Ginsburg, Brandon Norick, Brandon Soubasis, Branislav Kisacanin, Brian Yu, Bryan Catanzaro, Carlo del Mundo, Chantal Hwang, Charles Wang, Cheng-Ping Hsieh, Chenghao Zhang, Chenhan Yu, Chetan Mungekar, Chintan Patel, Chris Alexiuk, Christopher Parisien, Collin Neale, Cyril Meurillon, Damon Mosk-Aoyama, Dan Su, Dane Corneil, Daniel Afrimi, Daniel Lo, Daniel Rohrer, Daniel Serebrenik, Daria Gitman, Daria Levy, Darko Stosic, David Mosallanezhad, Deepak Narayanan, Dhruv Nathawani, Dima Rekesh, Dina Yared, Divyanshu Kakwani, Dong Ahn, Duncan Riach, Dusan Stosic, Edgar Minasyan, Edward Lin, Eileen Long, Eileen Peters Long, Elad Segal, Elena Lantz, Ellie Evans, Elliott Ning, Eric Chung, Eric Harper, Eric Tramel, Erick Galinkin, Erik Pounds, Evan Briones, Evelina Bakhturina, Evgeny Tsykunov, Faisal Ladhak, Fay Wang, Fei Jia
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.20856
Pdf link: https://arxiv.org/pdf/2512.20856
Abstract We introduce the Nemotron 3 family of models - Nano, Super, and Ultra. These models deliver strong agentic, reasoning, and conversational capabilities. The Nemotron 3 family uses a Mixture-of-Experts hybrid Mamba-Transformer architecture to provide best-in-class throughput and context lengths of up to 1M tokens. Super and Ultra models are trained with NVFP4 and incorporate LatentMoE, a novel approach that improves model quality. The two larger models also include MTP layers for faster text generation. All Nemotron 3 models are post-trained using multi-environment reinforcement learning enabling reasoning, multi-step tool use, and support granular reasoning budget control. Nano, the smallest model, outperforms comparable models in accuracy while remaining extremely cost-efficient for inference. Super is optimized for collaborative agents and high-volume workloads such as IT ticket automation. Ultra, the largest model, provides state-of-the-art accuracy and reasoning performance. Nano is released together with its technical report and this white paper, while Super and Ultra will follow in the coming months. We will openly release the model weights, pre- and post-training software, recipes, and all data for which we hold redistribution rights.
中文摘要 我们介绍Nemotron 3系列机型——Nano、Super和Ultra。这些模型具备强大的代理性、推理和对话能力。Nemotron 3 系列采用专家混合架构，提供最高级别的吞吐量和高达 100 万令牌的上下文长度。Super 和 Ultra 模型使用 NVFP4 训练，并采用 LatentMoE，这是一种提升模型质量的新方法。这两个较大的型号还包含MTP图层以加快文本生成速度。所有Nemotron 3模型均通过多环境强化学习进行后期训练，支持推理、多步工具使用，并支持细粒度推理预算控制。Nano作为最小模型，在准确性上优于同类模型，同时在推理中保持极高的成本效益。Super 针对协作代理和高流量工作负载（如 IT 工单自动化）进行了优化。Ultra 是最大的模型，提供了最先进的精度和推理性能。Nano 将与其技术报告及白皮书一同发布，Super 和 Ultra 将在未来几个月内陆续发布。我们将公开发布模型权重、训练前后软件、配方以及所有我们持有再分发权的数据。

The Silent Scholar Problem: A Probabilistic Framework for Breaking Epistemic Asymmetry in LLM Agents

沉默学者问题：打破LLM智能体认知不对称的概率框架

Authors: Zan-Kai Chong, Hiroyuki Ohsaki, Bryan Ng
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.20884
Pdf link: https://arxiv.org/pdf/2512.20884
Abstract Autonomous agents powered by LLMs and Retrieval-Augmented Generation (RAG) are proficient consumers of digital content but remain unidirectional, a limitation we term epistemic asymmetry. This isolation leads to redundant reasoning and stagnates collective intelligence. Current self-reflection frameworks remain largely heuristic and private, lacking a probabilistic foundation to quantify certainty or justify external this http URL bridge this gap, we propose a formal probabilistic framework that provides agents with a non-altruistic motive for bidirectional knowledge exchange. We model an agent's belief in a proposition using a Beta-Bernoulli distribution with a forgetting factor ($\gamma$). This allows us to isolate epistemic uncertainty as the variance of belief, establishing a dual drive for interaction: A homeostatic motive: The need to maintain certainty against the temporal decay introduced by $\gamma$. An optimal learning strategy: Targeting points of maximum ambiguity ($\mathbb{E}[\theta]=0.5$) to maximize information gain. Under this framework, public contribution is reframed as optimal active learning: sharing solutions to elicit feedback is the most efficient method for an agent to reduce its own uncertainty. To ensure scalability, we introduce epistemic caching, which leverages the forgetting factor to dynamically prioritize resources for the active head of non-stationary knowledge distributions. Finally, we demonstrate how these accumulated belief states serve as verifiable reward signals for Reinforcement Learning from Human Feedback (RLHF) and high-quality data filters for Supervised Fine-Tuning (SFT). Simulation results validate that this uncertainty-driven strategy significantly outperforms random baselines in heterogeneous (Zipfian) environments, maintaining high adaptability to concept drift.
中文摘要 由大型语言模型和检索增强生成（RAG）驱动的自主智能体是数字内容的熟练消费者，但仍保持单向，我们称之为认知不对称。这种孤立导致推理重复，并使集体智慧停滞不前。当前的自我反思框架大多仍是启发式和私密的，缺乏概率基础来量化确定性或证明外部 http URL 弥合这一鸿沟，我们提出了一个正式的概率框架，为智能体提供非利他性的双向知识交流动机。我们用带有遗忘因子（$\gamma$）的贝塔-伯努利分布来建模代理人对命题的信念。这使我们能够将认识论不确定性作为信念的变异性，建立一个双重互动驱动力：稳态动机：需要在$\gamma$引入的时间衰减时保持确定性。最佳学习策略：针对最大歧义点（$\mathbb{E}[\theta]=0.5$）以最大化信息获取。在这一框架下，公共贡献被重新定义为最优的主动学习：分享解决方案以引发反馈，是代理减少自身不确定性的最高效方法。为了确保可扩展性，我们引入了认知缓存，利用遗忘因素动态优先级分配给非平稳知识分布的主动负责人资源。最后，我们展示了这些积累的信念状态如何作为人类反馈强化学习（RLHF）的可验证奖励信号，以及监督微调（SFT）的高质量数据过滤器。模拟结果验证了这种不确定性驱动的策略在异构（Zipfian）环境中显著优于随机基线，保持了对概念漂移的高度适应性。

Embodied AI-Enhanced IoMT Edge Computing: UAV Trajectory Optimization and Task Offloading with Mobility Prediction

具身人工智能增强的物联网边缘计算：无人机轨迹优化与任务卸载，结合移动预测

Authors: Siqi Mu, Shuo Wen, Yang Lu, Ruihong Jiang, Bo Ai
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.20902
Pdf link: https://arxiv.org/pdf/2512.20902
Abstract Due to their inherent flexibility and autonomous operation, unmanned aerial vehicles (UAVs) have been widely used in Internet of Medical Things (IoMT) to provide real-time biomedical edge computing service for wireless body area network (WBAN) users. In this paper, considering the time-varying task criticality characteristics of diverse WBAN users and the dual mobility between WBAN users and UAV, we investigate the dynamic task offloading and UAV flight trajectory optimization problem to minimize the weighted average task completion time of all the WBAN users, under the constraint of UAV energy consumption. To tackle the problem, an embodied AI-enhanced IoMT edge computing framework is established. Specifically, we propose a novel hierarchical multi-scale Transformer-based user trajectory prediction model based on the users' historical trajectory traces captured by the embodied AI agent (i.e., UAV). Afterwards, a prediction-enhanced deep reinforcement learning (DRL) algorithm that integrates predicted users' mobility information is designed for intelligently optimizing UAV flight trajectory and task offloading decisions. Real-word movement traces and simulation results demonstrate the superiority of the proposed methods in comparison with the existing benchmarks.
中文摘要 由于其固有的灵活性和自主作能力，无人机（UAV）被广泛应用于医疗物联网（IoMT），为无线人体区域网（WBAN）用户提供实时生物医学边缘计算服务。本文结合不同WBAN用户随时间变化的任务关键性特性及WBAN用户与无人机之间的双重移动性，探讨动态任务卸载和无人机飞行轨迹优化问题，以在无人机能耗约束下最小化所有WBAN用户的加权平均任务完成时间。为解决这一问题，建立了具身的人工智能增强型IoMT边缘计算框架。具体来说，我们提出了一种基于用户历史轨迹的多层次多尺度Transformer用户轨迹预测模型，基于具身AI代理（即无人机）捕获的用户轨迹。随后，设计了一套预测增强深度强化学习（DRL）算法，整合预测用户的移动性信息，智能优化无人机飞行轨迹和任务卸载决策。真实的移动轨迹和模拟结果显示，所提方法相较于现有基准更优。

One Tool Is Enough: Reinforcement Learning for Repository-Level LLM Agents

一个工具就足够了：为仓库级LLM代理提供的强化学习

Authors: Zhaoxi Zhang, Yitong Duan, Yanzhi Zhang, Yiming Xu, Jiyan He, Yunfang Wu
Subjects: Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.20957
Pdf link: https://arxiv.org/pdf/2512.20957
Abstract Locating the files and functions requiring modification in large open-source software (OSS) repositories is challenging due to their scale and structural complexity. Existing large language model (LLM)-based methods typically treat this as a repository-level retrieval task and rely on multiple auxiliary tools, which overlook code execution logic and complicate model control. We propose RepoNavigator, an LLM agent equipped with a single execution-aware tool-jumping to the definition of an invoked symbol. This unified design reflects the actual flow of code execution while simplifying tool manipulation. RepoNavigator is trained end-to-end via Reinforcement Learning (RL) directly from a pretrained model, without any closed-source distillation. Experiments demonstrate that RL-trained RepoNavigator achieves state-of-the-art performance, with the 7B model outperforming 14B baselines, the 14B model surpassing 32B competitors, and even the 32B model exceeding closed-source models such as Claude-3.7. These results confirm that integrating a single, structurally grounded tool with RL training provides an efficient and scalable solution for repository-level issue localization.
中文摘要 由于规模庞大且结构复杂，在大型开源软件（OSS）仓库中定位需要修改的文件和函数具有挑战性。现有基于大型语言模型（LLM）的方法通常将其视为仓库级检索任务，依赖多个辅助工具，这些工具忽略了代码执行逻辑，增加了模型控制的复杂性。我们提出了RepoNavigator，这是一个配备单一执行感知工具的LLM代理，能够跳转到被调用符号的定义。这种统一的设计反映了代码执行的实际流程，同时简化了工具作。RepoNavigator 通过强化学习（RL）直接从预训练模型进行端到端训练，无需闭源提炼。实验显示，经过强化学习训练的RepoNavigator实现了最先进的性能，7B模型优于14B基线，14B模型超过32B的竞争对手，甚至32B模型甚至超过了闭源模型如Claude-3.7。这些结果证实，将单一结构化工具与强化学习（RL）训练相结合，为仓库级问题本地化提供了高效且可扩展的解决方案。

ReACT-Drug: Reaction-Template Guided Reinforcement Learning for de novo Drug Design

ReACT-Drug：新药物设计中的反应模板引导强化学习

Authors: R Yadunandan, Nimisha Ghosh
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.20958
Pdf link: https://arxiv.org/pdf/2512.20958
Abstract De novo drug design is a crucial component of modern drug development, yet navigating the vast chemical space to find synthetically accessible, high-affinity candidates remains a significant challenge. Reinforcement Learning (RL) enhances this process by enabling multi-objective optimization and exploration of novel chemical space - capabilities that traditional supervised learning methods lack. In this work, we introduce \textbf{ReACT-Drug}, a fully integrated, target-agnostic molecular design framework based on Reinforcement Learning. Unlike models requiring target-specific fine-tuning, ReACT-Drug utilizes a generalist approach by leveraging ESM-2 protein embeddings to identify similar proteins for a given target from a knowledge base such as Protein Data Base (PDB). Thereafter, the known drug ligands corresponding to such proteins are decomposed to initialize a fragment-based search space, biasing the agent towards biologically relevant subspaces. For each such fragment, the pipeline employs a Proximal Policy Optimization (PPO) agent guiding a ChemBERTa-encoded molecule through a dynamic action space of chemically valid, reaction-template-based transformations. This results in the generation of \textit{de novo} drug candidates with competitive binding affinities and high synthetic accessibility, while ensuring 100\% chemical validity and novelty as per MOSES benchmarking. This architecture highlights the potential of integrating structural biology, deep representation learning, and chemical synthesis rules to automate and accelerate rational drug design. The dataset and code are available at this https URL.
中文摘要 新药设计是现代药物开发的关键组成部分，但在庞大的化学领域中寻找合成可及的高亲和力候选药物仍是重大挑战。强化学习（RL）通过实现多目标优化和探索新颖化学空间的能力，增强了这一过程——这是传统监督学习方法所缺乏的能力。在本研究中，我们介绍了 \textbf{ReACT-Drug}，这是一个基于强化学习的完全集成、目标无关的分子设计框架。与需要针对靶点微调的模型不同，ReACT-Drug采用通用方法，利用ESM-2蛋白嵌入，从蛋白质数据库（PDB）等知识库中识别特定靶点的相似蛋白。随后，对应这些蛋白质的已知药物配体被分解，初始化一个基于片段的搜索空间，使该药物偏向生物学相关的子空间。对于每个此类片段，管道采用近端策略优化（PPO）代理，引导ChemBERTa编码的分子通过化学有效、基于反应模板的动态转化空间。这导致了具有竞争结合亲和力和高合成可及性的“新生”药物候选物的生成，同时确保100%的化学效度和新颖性符合MOSES基准测试。该架构凸显了整合结构生物学、深度表征学习和化学合成规则以自动化和加速理性药物设计的潜力。数据集和代码可在该 https URL 访问。

Generalised Linear Models in Deep Bayesian RL with Learnable Basis Functions

具有可学习基函数的深贝叶斯强化学习中的广义线性模型

Authors: Jingyang You, Hanna Kurniawati
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.20974
Pdf link: https://arxiv.org/pdf/2512.20974
Abstract Bayesian Reinforcement Learning (BRL) provides a framework for generalisation of Reinforcement Learning (RL) problems from its use of Bayesian task parameters in the transition and reward models. However, classical BRL methods assume known forms of transition and reward models, reducing their applicability in real-world problems. As a result, recent deep BRL methods have started to incorporate model learning, though the use of neural networks directly on the joint data and task parameters requires optimising the Evidence Lower Bound (ELBO). ELBOs are difficult to optimise and may result in indistinctive task parameters, hence compromised BRL policies. To this end, we introduce a novel deep BRL method, Generalised Linear Models in Deep Bayesian RL with Learnable Basis Functions (GLiBRL), that enables efficient and accurate learning of transition and reward models, with fully tractable marginal likelihood and Bayesian inference on task parameters and model noises. On challenging MetaWorld ML10/45 benchmarks, GLiBRL improves the success rate of one of the state-of-the-art deep BRL methods, VariBAD, by up to 2.7x. Comparing against representative or recent deep BRL / Meta-RL methods, such as MAML, RL2, SDVT, TrMRL and ECET, GLiBRL also demonstrates its low-variance and decent performance consistently.
中文摘要 贝叶斯强化学习（BRL）提供了一个框架，用于推广强化学习（RL）问题，从其在过渡模型和奖励模型中使用贝叶斯任务参数。然而，经典BRL方法假设了已知的过渡和奖励模型形式，因此其在现实问题中的适用性降低了。因此，近期的深度BRL方法开始纳入模型学习，尽管直接使用神经网络处理联合数据和任务参数需要优化证据下界（ELBO）。ELBOs难以优化，可能导致任务参数不明显，从而影响BRL策略。为此，我们引入了一种新型深度BRL方法——带有可学习基函数的深贝叶斯RL广义线性模型（GLiBRL），能够高效且准确地学习过渡模型和奖励模型，具备完全可处理的边际似然和任务参数及模型噪声的贝叶斯推断。在挑战MetaWorld ML10/45基准测试时，GLiBRL将最先进的深度BRL方法之一VariBAD的成功率提高了多达2.7倍。与代表性或近期的深度BRL/Meta-RL方法（如MAML、RL2、SDVT、TrMRL和ECET）相比，GLiBRL还持续展现了其低方差和良好性能。

LLM-Empowered Agentic AI for QoE-Aware Network Slicing Management in Industrial IoT

基于LLM赋能的代理人工智能，用于工业物联网中的QoE感知网络切片管理

Authors: Xudong Wang, Lei Feng, Ruichen Zhang, Fanqin Zhou, Hongyang Du, Wenjing Li, Dusit Niyato, Abbas Jamalipour, Ping Zhang
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2512.20997
Pdf link: https://arxiv.org/pdf/2512.20997
Abstract The Industrial Internet of Things (IIoT) requires networks that deliver ultra-low latency, high reliability, and cost efficiency, which traditional optimization methods and deep reinforcement learning (DRL)-based approaches struggle to provide under dynamic and heterogeneous workloads. To address this gap, large language model (LLM)-empowered agentic AI has emerged as a promising paradigm, integrating reasoning, planning, and adaptation to enable QoE-aware network management. In this paper, we explore the integration of agentic AI into QoE-aware network slicing for IIoT. We first review the network slicing management architecture, QoE metrics for IIoT applications, and the challenges of dynamically managing heterogeneous network slices, while highlighting the motivations and advantages of adopting agentic AI. We then present the workflow of agentic AI-based slicing management, illustrating the full lifecycle of AI agents from processing slice requests to constructing slice instances and performing dynamic adjustments. Furthermore, we propose an LLM-empowered agentic AI approach for slicing management, which integrates a retrieval-augmented generation (RAG) module for semantic intent inference, a DRL-based orchestrator for slicing configuration, and an incremental memory mechanism for continual learning and adaptation. Through a case study on heterogeneous slice management, we demonstrate that the proposed approach significantly outperforms other baselines in balancing latency, reliability, and cost, and achieves up to a 19% improvement in slice availability ratio.
中文摘要 工业物联网（IIoT）需要实现超低延迟、高可靠性和成本效益的网络，而传统优化方法和基于深度强化学习（DRL）的方法在动态和异构工作负载下难以实现这些需求。为弥补这一空白，大型语言模型（LLM）赋能的代理人工智能成为一种有前景的范式，整合推理、规划和适应，实现了具服务质量感知的网络管理。本文探讨了智能人工智能在IIoT的QoE感知网络切片中集成。我们首先回顾了网络切片管理架构、IIoT应用的QoE指标，以及动态管理异构网络切片的挑战，同时强调采用代理AI的动机和优势。随后，我们介绍了基于AI的智能切片管理的工作流程，展示了从处理切片请求到构建切片实例及动态调整的AI代理全生命周期。此外，我们提出了一种基于LLM赋能的代理AI切片管理方法，整合了用于语义意图推断的检索增强生成（RAG）模块、基于DRL的切片配置编排器，以及持续学习和适应的增量记忆机制。通过异构切片管理的案例研究，我们证明该方法在延迟、可靠性和成本的平衡上显著优于其他基线，切片可用性比提升高达19%。

Policy-Conditioned Policies for Multi-Agent Task Solving

多智能体任务解决的策略条件策略

Authors: Yue Lin, Shuhui Zhu, Wenhao Li, Ang Li, Dan Qiao, Pascal Poupart, Hongyuan Zha, Baoxiang Wang
Subjects: Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.21024
Pdf link: https://arxiv.org/pdf/2512.21024
Abstract In multi-agent tasks, the central challenge lies in the dynamic adaptation of strategies. However, directly conditioning on opponents' strategies is intractable in the prevalent deep reinforcement learning paradigm due to a fundamental ``representational bottleneck'': neural policies are opaque, high-dimensional parameter vectors that are incomprehensible to other agents. In this work, we propose a paradigm shift that bridges this gap by representing policies as human-interpretable source code and utilizing Large Language Models (LLMs) as approximate interpreters. This programmatic representation allows us to operationalize the game-theoretic concept of \textit{Program Equilibrium}. We reformulate the learning problem by utilizing LLMs to perform optimization directly in the space of programmatic policies. The LLM functions as a point-wise best-response operator that iteratively synthesizes and refines the ego agent's policy code to respond to the opponent's strategy. We formalize this process as \textit{Programmatic Iterated Best Response (PIBR)}, an algorithm where the policy code is optimized by textual gradients, using structured feedback derived from game utility and runtime unit tests. We demonstrate that this approach effectively solves several standard coordination matrix games and a cooperative Level-Based Foraging environment.
中文摘要 在多智能体任务中，核心挑战在于策略的动态适应。然而，直接条件对对手策略进行条件是难以解决的，因为存在一个根本性的“表征瓶颈”：神经策略是不透明的高维参数向量，其他智能体无法理解。在本研究中，我们提出了一种范式转变，通过将策略表示为人类可解释的源代码，并利用大型语言模型（LLMs）作为近似解释器，弥合这一空白。这种程序化表示使我们能够作化博弈论中的\textit{Program Equilibrium}概念。我们通过利用大型语言模型直接在程序化策略领域进行优化，重新表述了学习问题。LLM作为一个按点的最佳反应算子，迭代地综合和完善自我代理的策略代码，以应对对手的策略。我们将此过程形式化为 \textit{程序化迭代最佳响应（PIBR）}，这是一种通过文本梯度优化策略代码的算法，利用来自游戏实用性和运行时单元测试的结构化反馈。我们证明了这种方法能够有效解决多个标准的协调矩阵博弈和合作的基于层级的采集环境。

LSTM-Based Modeling and Reinforcement Learning Control of a Magnetically Actuated Catheter

基于LSTM的建模与磁性驱动导管的强化学习控制

Authors: Arya Rashidinejad Meibodi, Mahbod Gholamali Sinaki, Khalil Alipour
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2512.21063
Pdf link: https://arxiv.org/pdf/2512.21063
Abstract Autonomous magnetic catheter systems are emerging as a promising approach for the future of minimally invasive interventions. This study presents a novel approach that begins by modeling the nonlinear and hysteretic dynamics of a magnetically actuated catheter system, consists of a magnetic catheter manipulated by servo-controlled magnetic fields generated by two external permanent magnets, and its complex behavior is captured using a Long Short-Term Memory (LSTM) neural network. This model validated against experimental setup's data with a root mean square error (RMSE) of 0.42 mm and 99.8% coverage within 3 mm, establishing it as a reliable surrogate model. This LSTM enables the training of Reinforcement Learning (RL) agents for controlling the system and avoiding damage to the real setup, with the potential for subsequent fine-tuning on the physical system. We implemented Deep Q-Network (DQN) and actor-critic RL controllers, comparing these two agents first for regulation and subsequently for path following along linear and half-sinusoidal paths for the catheter tip. The actor-critic outperforms DQN, offering greater accuracy and faster performance with less error, along with smoother trajectories at a 10 Hz sampling rate, in both regulation and path following compared to the DQN controller. This performance, due to the continuous action space, suits dynamic navigation tasks like navigating curved vascular structures for practical applications.
中文摘要 自主磁导管系统正作为微创干预未来一种有前景的方法而兴起。本研究提出了一种新颖的方法，首先模拟磁驱动导管系统的非线性和滞后动力学，该导管由两个外部永久磁铁产生的伺服控制磁场控，其复杂行为通过长短期记忆（LSTM）神经网络捕捉。该模型以0.42毫米的均方根误差（RMSE）和99.8%的覆盖率在3毫米内验证了实验设置的数据，确立了其作为可靠替代模型的地位。该LSTM使强化学习（RL）代理能够训练，以控制系统并避免对真实系统造成损害，同时有可能对物理系统进行后续微调。我们实现了深度Q网络（DQN）和actor-critic RL控制器，先比较这两种药物用于调控，随后对导管尖端的线性和半正弦路径进行路径跟踪。actor-critic在调控和路径跟踪方面均优于DQN，提供更高的准确性和更快的性能，误差更少，且在10 Hz采样率下轨迹更平滑，均优于DQN控制器。由于连续的动作空间，这种性能适合动态导航任务，如在弯曲的血管结构中导航，实现实际应用。

Dyna-Style Reinforcement Learning Modeling and Control of Non-linear Dynamics

动力式强化学习：非线性动力学建模与控制

Authors: Karim Abdelsalam, Zeyad Gamal, Ayman El-Badawy
Subjects: Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.21081
Pdf link: https://arxiv.org/pdf/2512.21081
Abstract Controlling systems with complex, nonlinear dynamics poses a significant challenge, particularly in achieving efficient and robust control. In this paper, we propose a Dyna-Style Reinforcement Learning control framework that integrates Sparse Identification of Nonlinear Dynamics (SINDy) with Twin Delayed Deep Deterministic Policy Gradient (TD3) reinforcement learning. SINDy is used to identify a data-driven model of the system, capturing its key dynamics without requiring an explicit physical model. This identified model is used to generate synthetic rollouts that are periodically injected into the reinforcement learning replay buffer during training on the real environment, enabling efficient policy learning with limited data available. By leveraging this hybrid approach, we mitigate the sample inefficiency of traditional model-free reinforcement learning methods while ensuring accurate control of nonlinear systems. To demonstrate the effectiveness of this framework, we apply it to a bi-rotor system as a case study, evaluating its performance in stabilization and trajectory tracking. The results show that our SINDy-TD3 approach achieves superior accuracy and robustness compared to direct reinforcement learning techniques, highlighting the potential of combining data-driven modeling with reinforcement learning for complex dynamical systems.
中文摘要 控制复杂非线性动力学系统面临重大挑战，尤其是在实现高效且稳健的控制方面。本文提出了一种动态式强化学习控制框架，将非线性动力学的稀疏识别（SINDy）与双延迟深度确定性策略梯度（TD3）强化学习整合在一起。SINDy用于识别系统的数据驱动模型，捕捉其关键动态，无需显式物理模型。该模型用于生成合成的扩展，并在真实环境中训练时定期注入强化学习重放缓冲区，实现有限数据下的高效策略学习。通过这种混合方法，我们减轻了传统无模型强化学习方法的样本效率低，同时确保对非线性系统的准确控制。为证明该框架的有效性，我们将其应用于双旋翼系统作为案例研究，评估其在稳定和轨迹跟踪方面的表现。结果显示，我们的SINDy-TD3方法相比直接强化学习技术实现了更优越的准确性和鲁棒性，凸显了将数据驱动建模与强化学习结合于复杂动力系统中的潜力。

Global End-Effector Pose Control of an Underactuated Aerial Manipulator via Reinforcement Learning

全局末端执行器姿态通过强化学习控制欠驱动空中机械臂

Authors: Shlok Deshmukh, Javier Alonso-Mora, Sihao Sun
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.21085
Pdf link: https://arxiv.org/pdf/2512.21085
Abstract Aerial manipulators, which combine robotic arms with multi-rotor drones, face strict constraints on arm weight and mechanical complexity. In this work, we study a lightweight 2-degree-of-freedom (DoF) arm mounted on a quadrotor via a differential mechanism, capable of full six-DoF end-effector pose control. While the minimal design enables simplicity and reduced payload, it also introduces challenges such as underactuation and sensitivity to external disturbances, including manipulation of heavy loads and pushing tasks. To address these, we employ reinforcement learning, training a Proximal Policy Optimization (PPO) agent in simulation to generate feedforward commands for quadrotor acceleration and body rates, along with joint angle targets. These commands are tracked by an incremental nonlinear dynamic inversion (INDI) attitude controller and a PID joint controller, respectively. Flight experiments demonstrate centimeter-level position accuracy and degree-level orientation precision, with robust performance under external force disturbances. The results highlight the potential of learning-based control strategies for enabling contact-rich aerial manipulation using simple, lightweight platforms.
中文摘要 空中机械臂结合了机械臂和多旋翼无人机，对机械臂重量和机械复杂度有严格限制。本研究中，我们研究了一种轻量化的2自由度（DoF）臂，通过差分机构安装在四旋翼上，能够实现完整的六景深端执行器姿态控制。虽然极简设计带来了简化和减少有效载荷，但也带来了诸如驱动不足和对外部干扰的敏感性等挑战，包括重载的作和推动任务。为解决这些问题，我们采用强化学习，在仿真中训练一个近端策略优化（PPO）代理，生成四旋翼加速度和体速的前馈指令，以及联合角度目标。这些指令分别由增量非线性动态反转（INDI）姿态控制器和PID关节控制器跟踪。飞行实验展示了厘米级位置精度和度级定向精度，并在外力干扰下表现出稳健的性能。结果凸显了基于学习的控制策略在利用简单轻便平台实现接触丰富空中控方面的潜力。

MiST: Understanding the Role of Mid-Stage Scientific Training in Developing Chemical Reasoning Models

MiST：理解中期科学培训在化学推理模型开发中的作用

Authors: Andres M Bran, Tong Xie, Shai Pranesh, Jeffrey Meng, Xuan Vu Nguyen, Jeremy Goumaz, David Ming Segura, Ruizhi Xu, Dongzhan Zhou, Wenjie Zhang, Bram Hoex, Philippe Schwaller
Subjects: Subjects: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
Arxiv link: https://arxiv.org/abs/2512.21231
Pdf link: https://arxiv.org/pdf/2512.21231
Abstract Large Language Models can develop reasoning capabilities through online fine-tuning with rule-based rewards. However, recent studies reveal a critical constraint: reinforcement learning succeeds only when the base model already assigns non-negligible probability to correct answers -- a property we term 'latent solvability'. This work investigates the emergence of chemical reasoning capabilities and what these prerequisites mean for chemistry. We identify two necessary conditions for RL-based chemical reasoning: 1) Symbolic competence, and 2) Latent chemical knowledge. We propose mid-stage scientific training (MiST): a set of mid-stage training techniques to satisfy these, including data-mixing with SMILES/CIF-aware pre-processing, continued pre-training on 2.9B tokens, and supervised fine-tuning on 1B tokens. These steps raise the latent-solvability score on 3B and 7B models by up to 1.8x, and enable RL to lift top-1 accuracy from 10.9 to 63.9% on organic reaction naming, and from 40.6 to 67.4% on inorganic material generation. Similar results are observed for other challenging chemical tasks, while producing interpretable reasoning traces. Our results define clear prerequisites for chemical reasoning training and highlight the broader role of mid-stage training in unlocking reasoning capabilities.
中文摘要 大型语言模型可以通过在线微调和规则奖励来发展推理能力。然而，最新研究揭示了一个关键限制：强化学习只有在基础模型已经为正确答案赋予不可忽略概率时才成功——我们称之为“潜在可解性”。本研究探讨了化学推理能力的出现以及这些前提条件对化学的意义。我们确定了基于强化学习的化学推理的两个必要条件：1）符号能力，2）潜在化学知识。我们提出了中期科学训练（MiST）：一套满足这些要求的中期训练技术，包括与SMILES/CIF感知预处理的数据混合、对2.9亿令牌的持续预训练以及对1亿令牌的监督微调。这些步骤使3B和7B模型的潜溶性得分提升了最多1.8倍，使RL在有机反应命名的准确率从10.9%提升到63.9%，无机材料生成的准确率从40.6%提升到67.4%。在其他具有挑战性的化学任务中也观察到类似结果，同时产生可解释的推理痕迹。我们的结果明确了化学推理培训的先决条件，并强调中期培训在解锁推理能力中更广泛的作用。

Keyword: diffusion policy

There is no result