Arxiv Papers of Today

生成时间: 2025-12-23 16:34:42 (UTC+8); Arxiv 发布时间: 2025-12-23 20:00 EST (2025-12-24 09:00 UTC+8)

今天共有 58 篇相关文章

Keyword: reinforcement learning

Graph-O1 : Monte Carlo Tree Search with Reinforcement Learning for Text-Attributed Graph Reasoning

Graph-O1：基于文本属性图推理的蒙特卡洛树搜索与强化学习

Authors: Lihui Liu
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.17912
Pdf link: https://arxiv.org/pdf/2512.17912
Abstract ChatGPT said: Text-attributed graphs, where nodes and edges contain rich textual information, are widely used across diverse domains. A central challenge in this setting is question answering, which requires jointly leveraging unstructured text and the structured relational signals within the graph. Although Large Language Models (LLMs) have made significant advances in natural language understanding, their direct use for reasoning over text-attributed graphs remains limited. Retrieval-augmented generation methods that operate purely on text often treat passages as isolated units, ignoring the interconnected structure of the graph. Conversely, graph-based RAG methods that serialize large subgraphs into long textual sequences quickly become infeasible due to LLM context-length constraints, resulting in fragmented reasoning and degraded accuracy. To overcome these limitations, we introduce Graph-O1, an agentic GraphRAG framework that enables LLMs to conduct stepwise, interactive reasoning over graphs. Our approach integrates Monte Carlo Tree Search (MCTS) with end-to-end reinforcement learning, allowing the model to selectively explore and retrieve only the most informative subgraph components. The reasoning procedure is framed as a multi-turn interaction between the agent and the graph environment, and the agent is trained through a unified reward mechanism. Extensive experiments across multiple LLM backbones demonstrate that Graph-O1 consistently surpasses state-of-the-art baselines, producing answers that are more accurate, reliable, and interpretable.
中文摘要 ChatGPT表示：文本属性图，即节点和边包含丰富的文本信息，在不同领域被广泛应用。该环境中一个核心挑战是问答，这需要结合非结构化文本和图中的结构化关系信号。尽管大型语言模型（LLMs）在自然语言理解方面取得了显著进展，但它们在文本属性图谱上直接用于推理的应用仍然有限。纯文本的检索增强生成方法通常将段落视为孤立单元，忽视图的相互关联结构。相反，基于图的RAG方法将大型子图序列化为长文本序列，由于LLM上下文长度限制，迅速变得不可行，导致推理碎片化和准确性下降。为克服这些局限，我们引入了Graph-O1，一种代理式GraphRAG框架，使大型语言模型能够对图进行逐步、交互式推理。我们的方法将蒙特卡洛树搜索（MCTS）与端到端强化学习相结合，使模型能够选择性地探索和检索最具信息量的子图组件。推理过程被框架为智能体与图环境之间的多回合交互，智能体通过统一的奖励机制进行训练。跨多个LLM骨干的广泛实验表明，Graph-O1始终超越最先进的基线，生成更准确、更可靠且易于解释的答案。

QAISim: A Toolkit for Modeling and Simulation of AI in Quantum Cloud Computing Environments

QAISim：量子云计算环境中人工智能建模与仿真工具包

Authors: Irwindeep Singh, Sukhpal Singh Gill, Jinzhao Sun, Jan Mol
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.17918
Pdf link: https://arxiv.org/pdf/2512.17918
Abstract Quantum computing offers new ways to explore the theory of computation via the laws of quantum mechanics. Due to the rising demand for quantum computing resources, there is growing interest in developing cloud-based quantum resource sharing platforms that enable researchers to test and execute their algorithms on real quantum hardware. These cloud-based systems face a fundamental challenge in efficiently allocating quantum hardware resources to fulfill the growing computational demand of modern Internet of Things (IoT) applications. So far, attempts have been made in order to make efficient resource allocation, ranging from heuristic-based solutions to machine learning. In this work, we employ quantum reinforcement learning based on parameterized quantum circuits to address the resource allocation problem to support large IoT networks. We propose a python-based toolkit called QAISim for the simulation and modeling of Quantum Artificial Intelligence (QAI) models for designing resource management policies in quantum cloud environments. We have simulated policy gradient and Deep Q-Learning algorithms for reinforcement learning. QAISim exhibits a substantial reduction in model complexity compared to its classical counterparts with fewer trainable variables.
中文摘要 量子计算为通过量子力学定律探索计算理论提供了新途径。随着对量子计算资源需求的增长，开发基于云的量子资源共享平台的兴趣日益增长，这些平台使研究人员能够在真实的量子硬件上测试和执行他们的算法。这些基于云的系统面临着一个根本性的挑战，即如何高效分配量子硬件资源，以满足现代物联网（IoT）应用日益增长的计算需求。迄今为止，已有尝试实现高效的资源分配，涵盖从启发式解决方案到机器学习。在本研究中，我们采用基于参数化量子电路的量子强化学习，解决支持大型物联网网络的资源分配问题。我们提出了一个基于Python的工具包QAISim，用于量子人工智能（QAI）模型的仿真和建模，用于设计量子云环境中的资源管理策略。我们有模拟策略梯度和深度Q-学习算法用于强化学习。与可训练变量较少的经典对应程序相比，QAISim在模型复杂度上显著降低。

NystagmusNet: Explainable Deep Learning for Photosensitivity Risk Prediction

NystagmusNet：用于光敏感风险预测的可解释深度学习

Authors: Karthik Prabhakar
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.17943
Pdf link: https://arxiv.org/pdf/2512.17943
Abstract Nystagmus patients with photosensitivity face significant daily challenges due to involuntary eye movements exacerbated by environmental brightness conditions. Current assistive solutions are limited to symptomatic treatments without predictive personalization. This paper proposes NystagmusNet, an AI-driven system that predicts high-risk visual environments and recommends real-time visual adaptations. Using a dual-branch convolutional neural network trained on synthetic and augmented datasets, the system estimates a photosensitivity risk score based on environmental brightness and eye movement variance. The model achieves 75% validation accuracy on synthetic data. Explainability techniques including SHAP and GradCAM are integrated to highlight environmental risk zones, improving clinical trust and model interpretability. The system includes a rule-based recommendation engine for adaptive filter suggestions. Future directions include deployment via smart glasses and reinforcement learning for personalized recommendations.
中文摘要 患有光敏感的眼震患者因环境亮度条件加剧的不自主眼动，日常面临重大挑战。目前的辅助解决方案仅限于无预测个性化的有症状治疗。本文提出了NystagmusNet，一种由人工智能驱动的系统，能够预测高风险视觉环境并建议实时的视觉适应。该系统利用双分支卷积神经网络，训练于合成和增强数据集，基于环境亮度和眼动方差估算光敏感风险评分。该模型在合成数据上的验证准确率达到了75%。集成了包括SHAP和GradCAM在内的可解释性技术，突出环境风险区，提升临床信任度和模型可解释性。该系统包含基于规则的自适应过滤器建议推荐引擎。未来方向包括通过智能眼镜部署和强化学习以获得个性化推荐。

SuperFlow: Training Flow Matching Models with RL on the Fly

SuperFlow：实时使用强化学习的训练流匹配模型

Authors: Kaijie Chen, Zhiyang Xu, Ying Shen, Zihao Lin, Yuguang Yao, Lifu Huang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.17951
Pdf link: https://arxiv.org/pdf/2512.17951
Abstract Recent progress in flow-based generative models and reinforcement learning (RL) has improved text-image alignment and visual quality. However, current RL training for flow models still has two main problems: (i) GRPO-style fixed per-prompt group sizes ignore variation in sampling importance across prompts, which leads to inefficient sampling and slower training; and (ii) trajectory-level advantages are reused as per-step estimates, which biases credit assignment along the flow. We propose SuperFlow, an RL training framework for flow-based models that adjusts group sizes with variance-aware sampling and computes step-level advantages in a way that is consistent with continuous-time flow dynamics. Empirically, SuperFlow reaches promising performance while using only 5.4% to 56.3% of the original training steps and reduces training time by 5.2% to 16.7% without any architectural changes. On standard text-to-image (T2I) tasks, including text rendering, compositional image generation, and human preference alignment, SuperFlow improves over SD3.5-M by 4.6% to 47.2%, and over Flow-GRPO by 1.7% to 16.0%.
中文摘要 基于流的生成模型和强化学习（RL）的最新进展提升了文本与图像的对齐和视觉质量。然而，当前流模型的强化学习训练仍存在两个主要问题：（i）GRPO风格的固定每个提示词组规模忽视了不同提示词采样重要性的差异，导致采样效率低下且训练变慢;以及（ii）轨迹层面的优势被重复用作每步估计值，这会对信用分配在流程中产生偏差。我们提出了SuperFlow，这是一种基于流量模型的强化学习训练框架，通过方差感知抽样调整组规模，并以符合连续时间流动态的方式计算步级优势。从经验上看，SuperFlow 在仅使用原始训练步骤的 5.4% 到 56.3% 的情况下达到了有希望的性能，且在没有任何架构变动的情况下，训练时间减少了 5.2% 至 16.7%。在标准文本到图像（T2I）任务中，包括文本渲染、构图图像生成和人类偏好对齐，SuperFlow相较SD3.5-M提升4.6%至47.2%，较Flow-GRPO提升1.7%至16.0%。

Adaptive Agents in Spatial Double-Auction Markets: Modeling the Emergence of Industrial Symbiosis

空间双重拍卖市场中的适应性代理：工业共生的出现建模

Authors: Matthieu Mastio, Paul Saves, Benoit Gaudou, Nicolas Verstaevel
Subjects: Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Applications (stat.AP)
Arxiv link: https://arxiv.org/abs/2512.17979
Pdf link: https://arxiv.org/pdf/2512.17979
Abstract Industrial symbiosis fosters circularity by enabling firms to repurpose residual resources, yet its emergence is constrained by socio-spatial frictions that shape costs, matching opportunities, and market efficiency. Existing models often overlook the interaction between spatial structure, market design, and adaptive firm behavior, limiting our understanding of where and how symbiosis arises. We develop an agent-based model where heterogeneous firms trade byproducts through a spatially embedded double-auction market, with prices and quantities emerging endogenously from local interactions. Leveraging reinforcement learning, firms adapt their bidding strategies to maximize profit while accounting for transport costs, disposal penalties, and resource scarcity. Simulation experiments reveal the economic and spatial conditions under which decentralized exchanges converge toward stable and efficient outcomes. Counterfactual regret analysis shows that sellers' strategies approach a near Nash equilibrium, while sensitivity analysis highlights how spatial structures and market parameters jointly govern circularity. Our model provides a basis for exploring policy interventions that seek to align firm incentives with sustainability goals, and more broadly demonstrates how decentralized coordination can emerge from adaptive agents in spatially constrained markets.
中文摘要 工业共生通过使企业能够再利用剩余资源来促进循环，但其出现受限于塑造成本、匹配机会和市场效率的社会空间摩擦。现有模型常常忽视空间结构、市场设计和适应性企业行为之间的相互作用，限制了我们对共生关系何时何种方式产生的理解。我们开发了一个基于代理的模型，异质企业通过空间嵌入的双重拍卖市场交易副产品，价格和数量则由局部互动内生产生。利用强化学习，企业调整招标策略以最大化利润，同时考虑运输成本、处置罚款和资源稀缺性。模拟实验揭示了去中心化交易所趋向稳定高效结果的经济和空间条件。反事实遗憾分析显示，卖方策略趋近纳什均衡，而敏感性分析则强调空间结构与市场参数共同支配循环性。我们的模型为探索政策干预提供了基础，旨在将企业激励与可持续目标对齐，更广泛地展示了空间受限市场中适应性代理如何产生去中心化协调。

ReGal: A First Look at PPO-based Legal AI for Judgment Prediction and Summarization in India

ReGal：基于PPO的法律人工智能在印度判决预测与摘要中的首次解析

Authors: Shubham Kumar Nigam, Tanuj Tyagi, Siddharth Shukla, Aditya Kumar Guru, Balaramamahanthi Deepak Patnaik, Danush Khanna, Noel Shallum, Kripabandhu Ghosh, Arnab Bhattacharya
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.18014
Pdf link: https://arxiv.org/pdf/2512.18014
Abstract This paper presents an early exploration of reinforcement learning methodologies for legal AI in the Indian context. We introduce Reinforcement Learning-based Legal Reasoning (ReGal), a framework that integrates Multi-Task Instruction Tuning with Reinforcement Learning from AI Feedback (RLAIF) using Proximal Policy Optimization (PPO). Our approach is evaluated across two critical legal tasks: (i) Court Judgment Prediction and Explanation (CJPE), and (ii) Legal Document Summarization. Although the framework underperforms on standard evaluation metrics compared to supervised and proprietary models, it provides valuable insights into the challenges of applying RL to legal texts. These challenges include reward model alignment, legal language complexity, and domain-specific adaptation. Through empirical and qualitative analysis, we demonstrate how RL can be repurposed for high-stakes, long-document tasks in law. Our findings establish a foundation for future work on optimizing legal reasoning pipelines using reinforcement learning, with broader implications for building interpretable and adaptive legal AI systems.
中文摘要 本文初步探讨了印度语境下法律人工智能强化学习方法论。我们介绍基于强化学习的法律推理（ReGal），这是一个结合多任务指令调优与基于AI反馈的强化学习（RLAIF）的框架，采用近端策略优化（PPO）。我们的方法涵盖两个关键法律任务：（i）法院判决预测与解释（CJPE），以及（ii）法律文件摘要。尽管该框架在标准评估指标上表现不如监督和专有模型，但它为将强化学习应用于法律文本的挑战提供了宝贵见解。这些挑战包括奖励模型的对齐、法律语言的复杂性以及领域特定的适应性。通过实证和定性分析，我们展示了如何将强化学习重新用于法律中高风险、长文档任务。我们的发现为未来利用强化学习优化法律推理流程奠定了基础，并对构建可解释性和自适应的法律人工智能系统具有更广泛的启示。

Towards Autonomous Navigation in Endovascular Interventions

迈向血管内干预中的自主导航

Authors: Tudor Jianu, Anh Nguyen, Sebastiano Fichera, Pierre Berthet-Rayne
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.18081
Pdf link: https://arxiv.org/pdf/2512.18081
Abstract Cardiovascular diseases remain the leading cause of global mortality, with minimally invasive treatment options offered through endovascular interventions. However, the precision and adaptability of current robotic systems for endovascular navigation are limited by heuristic control, low autonomy, and the absence of haptic feedback. This thesis presents an integrated AI-driven framework for autonomous guidewire navigation in complex vascular environments, addressing key challenges in data availability, simulation fidelity, and navigational accuracy. A high-fidelity, real-time simulation platform, CathSim, is introduced for reinforcement learning based catheter navigation, featuring anatomically accurate vascular models and contact dynamics. Building on CathSim, the Expert Navigation Network is developed, a policy that fuses visual, kinematic, and force feedback for autonomous tool control. To mitigate data scarcity, the open-source, bi-planar fluoroscopic dataset Guide3D is proposed, comprising more than 8,700 annotated images for 3D guidewire reconstruction. Finally, SplineFormer, a transformer-based model, is introduced to directly predict guidewire geometry as continuous B-spline parameters, enabling interpretable, real-time navigation. The findings show that combining high-fidelity simulation, multimodal sensory fusion, and geometric modelling substantially improves autonomous endovascular navigation and supports safer, more precise minimally invasive procedures.
中文摘要 心血管疾病仍是全球死亡率的主要原因，微创治疗方案通过血管内干预提供。然而，当前用于血管内导航的机器人系统的精度和适应性受限于启发式控制、低自主性以及缺乏触觉反馈。本论文提出了一个集成的AI驱动框架，用于复杂血管环境中自主导线导航，解决数据可用性、仿真精度和导航精度等关键挑战。引入了高保真实时模拟平台CathSim，用于基于强化学习的导管导航，具备解剖学精确的血管模型和接触动力学。基于CathSim，开发了专家导航网络，该政策融合了视觉、运动学和力反馈，实现自主工具控制。为缓解数据稀缺，提出了开源的双平面透视数据集Guide3D，包含8700多张注释图像，用于3D导线重建。最后，引入了基于变压器的模型SplineFormer，直接预测导线几何形状作为连续的B样条参数，实现可解释的实时导航。研究结果显示，结合高精度仿真、多模态感觉融合和几何建模，显著提升了自主血管内导航能力，并支持更安全、更精准的微创手术。

Unifying Causal Reinforcement Learning: Survey, Taxonomy, Algorithms and Applications

统一因果强化学习：调查、分类学、算法与应用

Authors: Cristiano da Costa Cunha, Wei Liu, Tim French, Ajmal Mian
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.18135
Pdf link: https://arxiv.org/pdf/2512.18135
Abstract Integrating causal inference (CI) with reinforcement learning (RL) has emerged as a powerful paradigm to address critical limitations in classical RL, including low explainability, lack of robustness and generalization failures. Traditional RL techniques, which typically rely on correlation-driven decision-making, struggle when faced with distribution shifts, confounding variables, and dynamic environments. Causal reinforcement learning (CRL), leveraging the foundational principles of causal inference, offers promising solutions to these challenges by explicitly modeling cause-and-effect relationships. In this survey, we systematically review recent advancements at the intersection of causal inference and RL. We categorize existing approaches into causal representation learning, counterfactual policy optimization, offline causal RL, causal transfer learning, and causal explainability. Through this structured analysis, we identify prevailing challenges, highlight empirical successes in practical applications, and discuss open problems. Finally, we provide future research directions, underscoring the potential of CRL for developing robust, generalizable, and interpretable artificial intelligence systems.
中文摘要 将因果推理（CI）与强化学习（RL）整合，已成为解决经典强化学习中关键局限性的有力范式，包括低可解释性、缺乏鲁棒性和泛化失败。传统的强化学习技术通常依赖于相关性驱动的决策，但在面对分布变化、混杂变量和动态环境时表现不佳。因果强化学习（CRL）利用因果推理的基础原则，通过明确建模因果关系，为这些挑战提供了有前景的解决方案。在本综述中，我们系统地回顾了因果推断与强化学习交叉领域的最新进展。我们将现有方法分类为因果表征学习、反事实策略优化、离线因果强化学习、因果迁移学习和因果可解释性。通过这种结构化分析，我们识别当前的挑战，突出实际应用中的实证成功案例，并讨论未解问题。最后，我们提出了未来的研究方向，强调CRL在开发稳健、可通用且可解释的人工智能系统方面的潜力。

On Swarm Leader Identification using Probing Policies

关于使用探测策略识别群体领导者

Authors: Stergios E. Bachoumas, Panagiotis Artemiadis
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2512.18146
Pdf link: https://arxiv.org/pdf/2512.18146
Abstract Identifying the leader within a robotic swarm is crucial, especially in adversarial contexts where leader concealment is necessary for mission success. This work introduces the interactive Swarm Leader Identification (iSLI) problem, a novel approach where an adversarial probing agent identifies a swarm's leader by physically interacting with its members. We formulate the iSLI problem as a Partially Observable Markov Decision Process (POMDP) and employ Deep Reinforcement Learning, specifically Proximal Policy Optimization (PPO), to train the prober's policy. The proposed approach utilizes a novel neural network architecture featuring a Timed Graph Relationformer (TGR) layer combined with a Simplified Structured State Space Sequence (S5) model. The TGR layer effectively processes graph-based observations of the swarm, capturing temporal dependencies and fusing relational information using a learned gating mechanism to generate informative representations for policy learning. Extensive simulations demonstrate that our TGR-based model outperforms baseline graph neural network architectures and exhibits significant zero-shot generalization capabilities across varying swarm sizes and speeds different from those used during training. The trained prober achieves high accuracy in identifying the leader, maintaining performance even in out-of-training distribution scenarios, and showing appropriate confidence levels in its predictions. Real-world experiments with physical robots further validate the approach, confirming successful sim-to-real transfer and robustness to dynamic changes, such as unexpected agent disconnections.
中文摘要 在机器人群体中识别领导者至关重要，尤其是在对抗性情境下，领导者隐蔽对任务成功至关重要。这项工作引入了交互式群体领袖识别（iSLI）问题，这是一种新颖的方法，即对抗探测代理通过与群体成员的物理互动来识别群体的领导者。我们将iSLI问题表述为部分可观测马尔可夫决策过程（POMDP），并采用深度强化学习，特别是近端策略优化（PPO）来训练prober的策略。该方法采用了一种新颖的神经网络架构，结合了时序图关系器（TGR）层和简化结构化状态空间序列（S5）模型。TGR层有效处理基于图的群体观测，捕捉时间依赖关系，利用学习门控机制融合关系信息，生成策略学习所需的信息表征。大量模拟表明，基于TGR的模型优于基线图神经网络架构，并在不同群体规模和速度下展现出显著的零射点泛化能力，这些能力与训练时使用的群体不同。训练有素的探测器在识别领先者方面实现了高准确性，即使在训练外的分布场景中也能保持性能，并对预测显示适当的置信度水平。物理机器人的实际实验进一步验证了该方法，确认了模拟到现实的成功转移以及对动态变化（如意外代理断线）的鲁棒性。

NL2CA: Auto-formalizing Cognitive Decision-Making from Natural Language Using an Unsupervised CriticNL2LTL Framework

NL2CA：利用无监督批评者框架从自然语言自动形式化认知决策 NL2LTL

Authors: Zihao Deng, Yijia Li, Renrui Zhang, Peijun Ye
Subjects: Subjects: Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
Arxiv link: https://arxiv.org/abs/2512.18189
Pdf link: https://arxiv.org/pdf/2512.18189
Abstract Cognitive computing models offer a formal and interpretable way to characterize human's deliberation and decision-making, yet their development remains labor-intensive. In this paper, we propose NL2CA, a novel method for auto-formalizing cognitive decision-making rules from natural language descriptions of human experience. Different from most related work that exploits either pure manual or human guided interactive modeling, our method is fully automated without any human intervention. The approach first translates text into Linear Temporal Logic (LTL) using a fine-tuned large language model (LLM), then refines the logic via an unsupervised Critic Tree, and finally transforms the output into executable production rules compatible with symbolic cognitive frameworks. Based on the resulted rules, a cognitive agent is further constructed and optimized through cognitive reinforcement learning according to the real-world behavioral data. Our method is validated in two domains: (1) NL-to-LTL translation, where our CriticNL2LTL module achieves consistent performance across both expert and large-scale benchmarks without human-in-the-loop feed-backs, and (2) cognitive driving simulation, where agents automatically constructed from human interviews have successfully learned the diverse decision patterns of about 70 trials in different critical scenarios. Experimental results demonstrate that NL2CA enables scalable, interpretable, and human-aligned cognitive modeling from unstructured textual data, offering a novel paradigm to automatically design symbolic cognitive agents.
中文摘要 认知计算模型为描述人类的思考和决策提供了一种形式化且可解释的方式，但其开发过程仍然劳动密集。本文提出了NL2CA这一新方法，用于从自然语言描述人类经验中自动形式化认知决策规则。与大多数利用纯手工或人类引导交互建模的相关工作不同，我们的方法是完全自动化，无需人工干预。该方法首先将文本翻译为线性时序逻辑（LTL），使用微调的大语言模型（LLM），然后通过无监督的批评树对逻辑进行细化，最终将输出转换为与符号认知框架兼容的可执行生成规则。基于所得规则，认知代理通过认知强化学习进一步构建和优化，基于现实世界的行为数据。我们的方法在两个领域得到了验证：（1）NL到LTL的转换，其中我们的CriticNL2LTL模块在专家和大规模基准测试中都能实现稳定性能，无需人工反馈;（2）认知驱动模拟，由人工访谈自动构建的代理成功学习了约70个不同关键场景下试验的多样化决策模式。实验结果表明，NL2CA能够从非结构化文本数据实现可扩展、可解释且符合人类的认知建模，提供了一种全新的范式，用于自动设计符号认知代理。

Stable and Efficient Single-Rollout RL for Multimodal Reasoning

多模态推理的稳定高效单次扩展RL

Authors: Rui Liu, Dian Yu, Lei Ke, Haolin Liu, Yujun Zhou, Zhenwen Liang, Haitao Mi, Pratap Tokekar, Dong Yu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.18215
Pdf link: https://arxiv.org/pdf/2512.18215
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has become a key paradigm to improve the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, prevalent group-based algorithms such as GRPO require multi-rollout sampling for each prompt. While more efficient single-rollout variants have recently been explored in text-only settings, we find that they suffer from severe instability in multimodal contexts, often leading to training collapse. To address this training efficiency-stability trade-off, we introduce $\textbf{MSSR}$ (Multimodal Stabilized Single-Rollout), a group-free RLVR framework that achieves both stable optimization and effective multimodal reasoning performance. MSSR achieves this via an entropy-based advantage-shaping mechanism that adaptively regularizes advantage magnitudes, preventing collapse and maintaining training stability. While such mechanisms have been used in group-based RLVR, we show that in the multimodal single-rollout setting they are not merely beneficial but essential for stability. In in-distribution evaluations, MSSR demonstrates superior training compute efficiency, achieving similar validation accuracy to the group-based baseline with half the training steps. When trained for the same number of steps, MSSR's performance surpasses the group-based baseline and shows consistent generalization improvements across five diverse reasoning-intensive benchmarks. Together, these results demonstrate that MSSR enables stable, compute-efficient, and effective RLVR for complex multimodal reasoning tasks.
中文摘要 带可验证奖励的强化学习（RLVR）已成为提升多模态大型语言模型（MLLMs）推理能力的关键范式。然而，常见的基于组的算法如GRPO要求对每个提示进行多重展开抽样。虽然最近在纯文本环境中探索了更高效的单次推广变体，但我们发现它们在多模态环境中存在严重不稳定性，常导致训练崩溃。为解决训练效率与稳定性的权衡，我们引入了$\textbf{MSSR}$（多模态稳定单一展开）框架，该框架既实现了稳定优化，也实现了有效的多模态推理性能。MSSR通过基于熵的优势塑形机制实现这一点，该机制自适应地正则化优势幅度，防止崩溃并保持训练稳定性。虽然这些机制已在基于群组的RLVR中使用，但我们证明在多模态单一推广环境中，它们不仅有益，更是稳定性的关键。在分布内评估中，MSSR展现出更优越的训练计算效率，仅用一半的训练步骤就能实现与基于组基线的验证准确率相当。在相同步数训练时，MSSR的表现超过基于群体的基线，并在五个多样化的推理密集基准中展现出持续的泛化提升。这些结果共同表明，MSSR能够实现复杂的多模推理任务中稳定、高效且高效的RLVR。

Embedded Safety-Aligned Intelligence via Differentiable Internal Alignment Embeddings

通过可微分内部对齐嵌入实现嵌入式安全对齐智能

Authors: Harsh Rathva, Ojas Srivastava, Pruthwik Mishra
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.18309
Pdf link: https://arxiv.org/pdf/2512.18309
Abstract We introduce Embedded Safety-Aligned Intelligence (ESAI), a theoretical framework for multi-agent reinforcement learning that embeds alignment constraints directly into agents internal representations using differentiable internal alignment embeddings. Unlike external reward shaping or post-hoc safety constraints, internal alignment embeddings are learned latent variables that predict externalized harm through counterfactual reasoning and modulate policy updates toward harm reduction through attention and graph-based propagation. The ESAI framework integrates four mechanisms: differentiable counterfactual alignment penalties computed from soft reference distributions, alignment-weighted perceptual attention, Hebbian associative memory supporting temporal credit assignment, and similarity-weighted graph diffusion with bias mitigation controls. We analyze stability conditions for bounded internal embeddings under Lipschitz continuity and spectral constraints, discuss computational complexity, and examine theoretical properties including contraction behavior and fairness-performance tradeoffs. This work positions ESAI as a conceptual contribution to differentiable alignment mechanisms in multi-agent systems. We identify open theoretical questions regarding convergence guarantees, embedding dimensionality, and extension to high-dimensional environments. Empirical evaluation is left to future work.
中文摘要 我们介绍嵌入式安全对齐智能（ESAI），这是一种多智能体强化学习的理论框架，通过可微分的内部对齐嵌入，将比对约束直接嵌入智能体内部表示。与外部奖励塑造或事后安全约束不同，内部对齐嵌入是学习的潜在变量，通过反事实推理预测外部伤害，并通过注意力和基于图的传播调节政策更新，以减少伤害。ESAI框架整合了四种机制：由软参考分布计算的可微反事实对齐惩罚、比对加权的感知注意力、支持时间归功的Hebbian联想记忆，以及带有偏差缓解控制的相似加权图扩散。我们分析了在Lipschitz连续性和谱约束下有界内部嵌入的稳定性条件，讨论计算复杂性，并考察了包括收缩行为和公平性-性能权衡在内的理论性质。本研究将ESAI定位为对多智能体系统中可微比对机制的概念性贡献。我们识别了关于收敛保证、嵌入维度以及向高维环境扩展等开放的理论问题。实证评估留待后续研究。

Monitoring Monitorability

可监控性

Authors: Melody Y. Guan, Miles Wang, Micah Carroll, Zehao Dou, Annie Y. Wei, Marcus Williams, Benjamin Arnav, Joost Huizinga, Ian Kivlichan, Mia Glaese, Jakub Pachocki, Bowen Baker
Subjects: Subjects: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2512.18311
Pdf link: https://arxiv.org/pdf/2512.18311
Abstract Observability into the decision making of modern AI systems may be required to safely deploy increasingly capable agents. Monitoring the chain-of-thought (CoT) of today's reasoning models has proven effective for detecting misbehavior. However, this "monitorability" may be fragile under different training procedures, data sources, or even continued system scaling. To measure and track monitorability, we propose three evaluation archetypes (intervention, process, and outcome-property) and a new monitorability metric, and introduce a broad evaluation suite. We demonstrate that these evaluations can catch simple model organisms trained to have obfuscated CoTs, and that CoT monitoring is more effective than action-only monitoring in practical settings. We compare the monitorability of various frontier models and find that most models are fairly, but not perfectly, monitorable. We also evaluate how monitorability scales with inference-time compute, reinforcement learning optimization, and pre-training model size. We find that longer CoTs are generally more monitorable and that RL optimization does not materially decrease monitorability even at the current frontier scale. Notably, we find that for a model at a low reasoning effort, we could instead deploy a smaller model at a higher reasoning effort (thereby matching capabilities) and obtain a higher monitorability, albeit at a higher overall inference compute cost. We further investigate agent-monitor scaling trends and find that scaling a weak monitor's test-time compute when monitoring a strong agent increases monitorability. Giving the weak monitor access to CoT not only improves monitorability, but it steepens the monitor's test-time compute to monitorability scaling trend. Finally, we show we can improve monitorability by asking models follow-up questions and giving their follow-up CoT to the monitor.
中文摘要 在现代人工智能系统决策中融入可观察性，可能有助于安全部署日益强大的代理。监测当今推理模型的思维链（CoT）已被证明对检测不良行为非常有效。然而，这种“可监控性”在不同的训练程序、数据源甚至系统持续扩展下可能变得脆弱。为了衡量和跟踪可监测性，我们提出了三种评估原型（干预、过程和结果属性）和一种新的可监测性指标，并引入了广泛的评估套件。我们证明，这些评估能够捕捉训练为模糊CoT的简单模式生物，并且CoT监测在实际环境中比仅行动监测更有效。我们比较了各种前沿模型的可监测性，发现大多数模型都相当可监控，但并非完美。我们还评估了可监测性如何随着推理时间计算、强化学习优化和预训练模型规模的扩展。我们发现，较长的CoT通常更容易被监控，且即使在当前前沿尺度上，强化学习优化也不会实质性降低可监测性。值得注意的是，我们发现对于一个推理投入较低的模型，我们可以部署一个较小且推理投入更高的模型（从而匹配能力），从而获得更高的可监控性，尽管整体推理计算成本更高。我们进一步研究了代理监控者的扩展趋势，发现在监控强代理时，对弱监控器的测试时间计算进行扩展，有助于提升可监控性。让弱监控器访问CoT不仅提升了可监控性，还加快了测试时间计算到可监控性扩展的趋势。最后，我们展示了通过向模型提出后续问题并向监测者提供后续CoT来提升可监测性。

Trustworthy and Explainable Deep Reinforcement Learning for Safe and Energy-Efficient Process Control: A Use Case in Industrial Compressed Air Systems

可信且可解释的深度强化学习，实现安全节能过程控制：工业压缩空气系统中的应用场景

Authors: Vincent Bezold, Patrick Wagner, Jakob Hofmann, Marco Huber, Alexander Sauer
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2512.18317
Pdf link: https://arxiv.org/pdf/2512.18317
Abstract This paper presents a trustworthy reinforcement learning approach for the control of industrial compressed air systems. We develop a framework that enables safe and energy-efficient operation under realistic boundary conditions and introduce a multi-level explainability pipeline combining input perturbation tests, gradient-based sensitivity analysis, and SHAP (SHapley Additive exPlanations) feature attribution. An empirical evaluation across multiple compressor configurations shows that the learned policy is physically plausible, anticipates future demand, and consistently respects system boundaries. Compared to the installed industrial controller, the proposed approach reduces unnecessary overpressure and achieves energy savings of approximately 4\,\% without relying on explicit physics models. The results further indicate that system pressure and forecast information dominate policy decisions, while compressor-level inputs play a secondary role. Overall, the combination of efficiency gains, predictive behavior, and transparent validation supports the trustworthy deployment of reinforcement learning in industrial energy systems.
中文摘要 本文提出了一种可信赖的强化学习方法，用于工业压缩空气系统控制。我们开发了一个框架，能够在现实边界条件下实现安全且节能的运行，并引入了结合输入微扰测试、基于梯度的敏感性分析和SHAP（SHapley加法解释）特征归因的多层次可解释流程。跨多种压缩机配置的实证评估表明，所学策略在物理上可行，能够预见未来需求，并且始终尊重系统边界。与已安装的工业控制器相比，该方法减少了不必要的过压，并在不依赖显式物理模型的情况下实现约4\，\%的节能。结果进一步表明，系统压力和预报信息主导政策决策，而压缩机级输入则处于次要地位。总体而言，效率提升、预测行为和透明验证的结合，支持了强化学习在工业能源系统中的可靠部署。

Reinforcement Learning Position Control of a Quadrotor Using Soft Actor-Critic (SAC)

使用软演员批评（SAC）强化学习四旋翼位置控制

Authors: Youssef Mahran, Zeyad Gamal, Ayman El-Badawy
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.18333
Pdf link: https://arxiv.org/pdf/2512.18333
Abstract This paper proposes a new Reinforcement Learning (RL) based control architecture for quadrotors. With the literature focusing on controlling the four rotors' RPMs directly, this paper aims to control the quadrotor's thrust vector. The RL agent computes the percentage of overall thrust along the quadrotor's z-axis along with the desired Roll ($\phi$) and Pitch ($\theta$) angles. The agent then sends the calculated control signals along with the current quadrotor's Yaw angle ($\psi$) to an attitude PID controller. The PID controller then maps the control signals to motor RPMs. The Soft Actor-Critic algorithm, a model-free off-policy stochastic RL algorithm, was used to train the RL agents. Training results show the faster training time of the proposed thrust vector controller in comparison to the conventional RPM controllers. Simulation results show smoother and more accurate path-following for the proposed thrust vector controller.
中文摘要 本文提出了一种基于强化学习（RL）的四旋翼控制架构。文献重点是直接控制四个旋翼的转速，本文旨在控制四旋翼的推力矢量。RL代理计算四旋翼z轴上的总推力百分比，以及所需的滚转角（$\phi$）和俯仰角（$\theta$）。随后，代理将计算出的控制信号连同当前四旋翼的偏航角（$\psi$）发送给姿态PID控制器。PID控制器随后将控制信号映射到电机转速。软演员-批判者算法是一种无模型的非策略随机强化学习算法，用于训练强化学习代理。训练结果显示，所提出的推力矢量控制器相比传统转速控制器的训练时间更快。仿真结果显示，所提出的推力矢量控制器的路径跟踪更平滑、更准确。

Dynamic Entropy Tuning in Reinforcement Learning Low-Level Quadcopter Control: Stochasticity vs Determinism

动态熵调谐在强化学习中低级四旋翼控制：随机性与决定论

Authors: Youssef Mahran, Zeyad Gamal, Ayman El-Badawy
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.18336
Pdf link: https://arxiv.org/pdf/2512.18336
Abstract This paper explores the impact of dynamic entropy tuning in Reinforcement Learning (RL) algorithms that train a stochastic policy. Its performance is compared against algorithms that train a deterministic one. Stochastic policies optimize a probability distribution over actions to maximize rewards, while deterministic policies select a single deterministic action per state. The effect of training a stochastic policy with both static entropy and dynamic entropy and then executing deterministic actions to control the quadcopter is explored. It is then compared against training a deterministic policy and executing deterministic actions. For the purpose of this research, the Soft Actor-Critic (SAC) algorithm was chosen for the stochastic algorithm while the Twin Delayed Deep Deterministic Policy Gradient (TD3) was chosen for the deterministic algorithm. The training and simulation results show the positive effect the dynamic entropy tuning has on controlling the quadcopter by preventing catastrophic forgetting and improving exploration efficiency.
中文摘要 本文探讨了动态熵调优在强化学习（RL）算法中用于随机策略训练的影响。其性能与训练确定性算法进行比较。随机策略优化行动的概率分布以最大化奖励，而确定性策略则为每个状态选择一个确定性动作。探讨了在静态熵和动态熵中训练随机策略，然后执行确定性作以控制四旋翼飞行器的影响。然后将其与训练确定性策略和执行确定性动作进行比较。本研究选择了软演员-批判者（SAC）算法作为随机算法，而双延迟深度确定性策略梯度（TD3）用于确定性算法。训练和模拟结果显示，动态熵调谐对控制四旋翼飞行器有积极影响，防止灾难性遗忘并提升探索效率。

On the Universality of Transformer Architectures; How Much Attention Is Enough?

《论变压器架构的通用性》;多少关注才算足够？

Authors: Amirreza Abbasi, Mohsen Hooshmand
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.18445
Pdf link: https://arxiv.org/pdf/2512.18445
Abstract Transformers are crucial across many AI fields, such as large language models, computer vision, and reinforcement learning. This prominence stems from the architecture's perceived universality and scalability compared to alternatives. This work examines the problem of universality in Transformers, reviews recent progress, including architectural refinements such as structural minimality and approximation rates, and surveys state-of-the-art advances that inform both theoretical and practical understanding. Our aim is to clarify what is currently known about Transformers expressiveness, separate robust guarantees from fragile ones, and identify key directions for future theoretical research.
中文摘要 变换器在许多人工智能领域都至关重要，如大型语言模型、计算机视觉和强化学习。这种突出性源于其相较于其他替代方案的普遍性和可扩展性。本研究探讨了变形金刚中的普遍性问题，回顾了近期进展，包括结构极简性和近似率等架构改进，并回顾了为理论和实践理解提供支持的最新进展。我们的目标是澄清目前关于变形金刚表现力的已知情况，区分强健的保证与脆弱的，并确定未来理论研究的关键方向。

When Robots Say No: The Empathic Ethical Disobedience Benchmark

当机器人说“不”：同理伦理不服从基准

Authors: Dmytro Kuzmenko, Nadiya Shvai
Subjects: Subjects: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2512.18474
Pdf link: https://arxiv.org/pdf/2512.18474
Abstract Robots must balance compliance with safety and social expectations as blind obedience can cause harm, while over-refusal erodes trust. Existing safe reinforcement learning (RL) benchmarks emphasize physical hazards, while human-robot interaction trust studies are small-scale and hard to reproduce. We present the Empathic Ethical Disobedience (EED) Gym, a standardized testbed that jointly evaluates refusal safety and social acceptability. Agents weigh risk, affect, and trust when choosing to comply, refuse (with or without explanation), clarify, or propose safer alternatives. EED Gym provides different scenarios, multiple persona profiles, and metrics for safety, calibration, and refusals, with trust and blame models grounded in a vignette study. Using EED Gym, we find that action masking eliminates unsafe compliance, while explanatory refusals help sustain trust. Constructive styles are rated most trustworthy, empathic styles -- most empathic, and safe RL methods improve robustness but also make agents more prone to overly cautious behavior. We release code, configurations, and reference policies to enable reproducible evaluation and systematic human-robot interaction research on refusal and trust. At submission time, we include an anonymized reproducibility package with code and configs, and we commit to open-sourcing the full repository after the paper is accepted.
中文摘要 机器人必须在遵守安全与社会期望之间取得平衡，因为盲目服从可能带来伤害，而过度拒绝则会削弱信任。现有的安全强化学习（RL）基准强调物理危害，而人机交互信任研究规模小且难以复制。我们介绍了同理伦理不服从（EED）健身房，这是一个标准化测试平台，联合评估拒绝的安全性和社会接受度。代理人在选择遵守、拒绝（无论有无解释）、澄清或提出更安全的替代方案时，权衡风险、影响和信任。EED Gym提供不同的情景、多重人物画像，以及安全、校准和拒绝的指标，信任与指责模型基于一个案例研究。使用EED Gym，我们发现行动掩饰消除了不安全的合规行为，而解释性拒绝有助于维持信任。建设性风格被评为最值得信赖、最具同理心的风格——最有同理心，安全的强化学习方法不仅提升了稳健性，但也使客服更倾向于过度谨慎。我们发布代码、配置和参考政策，以实现可重复的评估和系统的人机交互研究，涉及拒绝与信任。在提交时，我们会附带匿名可重复性包，包含代码和配置，并承诺在论文被接受后开源整个仓库。

Scaling up Stability: Reinforcement Learning for Distributed Control of Networked Systems in the Space of Stabilizing Policies

提升稳定性：在稳定策略领域实现网络系统分布式控制的强化学习

Authors: John Cao, Luca Furieri
Subjects: Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2512.18540
Pdf link: https://arxiv.org/pdf/2512.18540
Abstract We study distributed control of networked systems through reinforcement learning, where neural policies must be simultaneously scalable, expressive and stabilizing. We introduce a policy parameterization that embeds Graph Neural Networks (GNNs) into a Youla-like magnitude-direction parameterization, yielding distributed stochastic controllers that guarantee network-level closed-loop stability by design. The magnitude is implemented as a stable operator consisting of a GNN acting on disturbance feedback, while the direction is a GNN acting on local observations. We prove robustness of the closed loop to perturbations in both the graph topology and model parameters, and show how to integrate our parameterization with Proximal Policy Optimization. Experiments on a multi-agent navigation task show that policies trained on small networks transfer directly to larger ones and unseen network topologies, achieve higher returns and lower variance than a state-of-the-art MARL baseline while preserving stability.
中文摘要 我们研究通过强化学习对网络系统的分布式控制，其中神经策略必须同时具备可扩展性、表达性和稳定性。我们引入了一种策略参数化，将图神经网络（GNN）嵌入类似尤拉的大小方向参数化中，从而实现分布式随机控制器，设计上保证网络级闭环稳定性。该大小以稳定算符实现，由一个受扰动反馈影响的GNN组成，而方向为受局部观测影响的GNN。我们证明了闭环对图拓扑和模型参数的鲁棒性，并展示了如何将参数化与近端策略优化整合。多智能体导航任务的实验表明，在小型网络上训练的策略可直接转移到更大的网络和未见的网络拓扑，在保持稳定性的同时，实现更高的回报和更低的方差。

Toward Training Superintelligent Software Agents through Self-Play SWE-RL

通过自玩SWE-RL培训超级智能软件代理

Authors: Yuxiang Wei, Zhiqing Sun, Emily McMilin, Jonas Gehring, David Zhang, Gabriel Synnaeve, Daniel Fried, Lingming Zhang, Sida Wang
Subjects: Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.18552
Pdf link: https://arxiv.org/pdf/2512.18552
Abstract While current software agents powered by large language models (LLMs) and agentic reinforcement learning (RL) can boost programmer productivity, their training data (e.g., GitHub issues and pull requests) and environments (e.g., pass-to-pass and fail-to-pass tests) heavily depend on human knowledge or curation, posing a fundamental barrier to superintelligence. In this paper, we present Self-play SWE-RL (SSR), a first step toward training paradigms for superintelligent software agents. Our approach takes minimal data assumptions, only requiring access to sandboxed repositories with source code and installed dependencies, with no need for human-labeled issues or tests. Grounded in these real-world codebases, a single LLM agent is trained via reinforcement learning in a self-play setting to iteratively inject and repair software bugs of increasing complexity, with each bug formally specified by a test patch rather than a natural language issue description. On the SWE-bench Verified and SWE-Bench Pro benchmarks, SSR achieves notable self-improvement (+10.4 and +7.8 points, respectively) and consistently outperforms the human-data baseline over the entire training trajectory, despite being evaluated on natural language issues absent from self-play. Our results, albeit early, suggest a path where agents autonomously gather extensive learning experiences from real-world software repositories, ultimately enabling superintelligent systems that exceed human capabilities in understanding how systems are constructed, solving novel challenges, and autonomously creating new software from scratch.
中文摘要 虽然现有由大型语言模型（LLMs）和代理强化学习（RL）驱动的软件代理可以提升程序员生产力，但它们的训练数据（如GitHub问题和拉取请求）和环境（如通过测试和通过测试）高度依赖人类知识或管理，这成为超级智能的根本障碍。本文介绍了自玩型SWE-RL（SSR），这是为超级智能软件代理训练范式迈出的第一步。我们的方法采用极少的数据假设，只需访问带有源代码和安装依赖的沙盒仓库，无需人工标记的问题或测试。基于这些真实代码库，单个LLM代理通过自我游戏环境中的强化学习训练，迭代注入和修复日益复杂的软件漏洞，每个漏洞通过测试补丁正式指定，而非自然语言问题描述。在SWE-bench Verified和SWE-Bench Pro基准测试中，SSR实现了显著的自我提升（分别为+10.4和+7.8分），并且在整个训练轨迹中持续优于人类数据基线，尽管评估涉及自玩中缺乏的自然语言问题。我们的研究结果虽然尚早，但暗示了一条路径：智能体能够自主地从现实软件仓库中收集大量学习经验，最终实现超越人类能力的超智能系统，能够理解系统构建方式，解决新挑战，并自主从零开始创建新软件。

Distributionally Robust Multi-Agent Reinforcement Learning for Intelligent Traffic Control

分布式稳健的多智能体强化学习用于智能交通控制

Authors: Shuwei Pei, Joran Borger, Arda Kosay, Muhammed O. Sayin, Saeed Ahmed
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2512.18558
Pdf link: https://arxiv.org/pdf/2512.18558
Abstract Learning-based traffic signal control is typically optimized for average performance under a few nominal demand patterns, which can result in poor behavior under atypical traffic conditions. To address this, we develop a distributionally robust multi-agent reinforcement learning framework for signal control on a 3x3 urban grid calibrated from a contiguous 3x3 subarea of central Athens covered by the pNEUMA trajectory dataset (Barmpounakis and Geroliminis, 2020). Our approach proceeds in three stages. First, we train a baseline multi-agent RL controller in which each intersection is governed by a proximal policy optimization agent with discrete signal phases, using a centralized training, decentralized execution paradigm. Second, to capture demand uncertainty, we construct eight heterogeneous origin-destination-based traffic scenarios-one directly derived from pNEUMA and seven synthetically generated-to span a wide range of spatial and temporal demand patterns. Over this scenario set, we train a contextual-bandit worst-case estimator that assigns mixture weights to estimate adversarial demand distributions conditioned on context. Finally, without modifying the controller architecture, we fine-tune the baseline multi-agent reinforcement learning agents under these estimated worst-case mixtures to obtain a distributionally robust multi-agent reinforcement learning controller. Across all eight scenarios, as well as on an unseen validation network based on the Sioux Falls configuration, the distributionally robust multi-agent reinforcement learning controller consistently reduces horizon-averaged queues and increases average speeds relative to the baseline, achieving up to 51% shorter queues and 38% higher speeds on the worst-performing scenarios.
中文摘要 基于学习的交通信号控制通常针对少数名义需求模式的平均性能进行优化，这可能导致在非典型交通条件下表现不佳。为此，我们开发了一个分布稳健的多智能体强化学习框架，用于在雅典市中心连续的3x3子区域校准信号控制信号（Barmpounakis和Geroliminis，2020）。我们的方法分为三个阶段进行。首先，我们训练一个基础多智能体强化学习控制器，每个交点由一个具有离散信号相位的近端策略优化代理管理，采用集中式训练和去中心化执行范式。其次，为了捕捉需求不确定性，我们构建了八种基于 pNEUMA 的异构起讫地交通场景——一种直接源自 pNEUMA，七种是合成生成的——涵盖广泛的空间和时间需求模式。在该情景集中，我们训练一个情境-强盗最坏情况估计器，赋予混合权重以估计基于情境的对抗性需求分布。最后，在不修改控制器架构的情况下，我们在这些估计的最坏情况混合下微调基础多智能体强化学习智能体，以获得分布稳健的多智能体强化学习控制器。在所有八种场景中，以及基于苏福尔斯配置的未见验证网络上，分布式稳健的多智能体强化学习控制器持续减少地平平均队列数，并相较于基线提升平均速度，在表现最差的场景中实现最多51%的队列短化和38%的高速。

Vox Deorum: A Hybrid LLM Architecture for 4X / Grand Strategy Game AI -- Lessons from Civilization V

Vox Deorum：一款用于4X/大战略游戏AI的混合大型语言模型架构——从《文明V》中汲取的经验教训

Authors: John Chen, Sihan Cheng, Can Gurkan, Ryan Lay, Moez Salahuddin
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.18564
Pdf link: https://arxiv.org/pdf/2512.18564
Abstract Large Language Models' capacity to reason in natural language makes them uniquely promising for 4X and grand strategy games, enabling more natural human-AI gameplay interactions such as collaboration and negotiation. However, these games present unique challenges due to their complexity and long-horizon nature, while latency and cost factors may hinder LLMs' real-world deployment. Working on a classic 4X strategy game, Sid Meier's Civilization V with the Vox Populi mod, we introduce Vox Deorum, a hybrid LLM+X architecture. Our layered technical design empowers LLMs to handle macro-strategic reasoning, delegating tactical execution to subsystems (e.g., algorithmic AI or reinforcement learning AI in the future). We validate our approach through 2,327 complete games, comparing two open-source LLMs with a simple prompt against Vox Populi's enhanced AI. Results show that LLMs achieve competitive end-to-end gameplay while exhibiting play styles that diverge substantially from algorithmic AI and from each other. Our work establishes a viable architecture for integrating LLMs in commercial 4X games, opening new opportunities for game design and agentic AI research.
中文摘要 大型语言模型在自然语言中推理的能力使其在4X和大战略游戏中极具潜力，能够实现更自然的人机互动，如协作和谈判。然而，这些游戏因其复杂性和长期视野特性，带来了独特的挑战，同时延迟和成本因素也可能阻碍LLM的实际部署。在制作经典4X策略游戏《席德·梅尔的文明V》并搭配Vox Populi模组时，我们引入了Vox Deorum，一种混合LLM+X架构。我们的分层技术设计使大型语言模型能够处理宏观战略推理，将战术执行委托给子系统（例如未来的算法AI或强化学习AI）。我们通过2327款完整游戏验证了我们的方法，将两个开源大型语言模型与一个简单提示对比Vox Populi的增强型人工智能。结果显示，LLM实现了具有竞争力的端到端游戏体验，同时其玩法风格与算法AI及彼此之间有显著差异。我们的工作建立了一种可行的架构，用于将大型语言模型集成到商业4X游戏中，为游戏设计和代理人工智能研究开辟新机遇。

Trajectory Planning for UAV-Based Smart Farming Using Imitation-Based Triple Deep Q-Learning

基于无人机的智能农业轨迹规划，基于模仿的三重深度Q-学习

Authors: Wencan Mao, Quanxi Zhou, Tomas Couso Coddou, Manabu Tsukada, Yunling Liu, Yusheng Ji
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.18604
Pdf link: https://arxiv.org/pdf/2512.18604
Abstract Unmanned aerial vehicles (UAVs) have emerged as a promising auxiliary platform for smart agriculture, capable of simultaneously performing weed detection, recognition, and data collection from wireless sensors. However, trajectory planning for UAV-based smart agriculture is challenging due to the high uncertainty of the environment, partial observations, and limited battery capacity of UAVs. To address these issues, we formulate the trajectory planning problem as a Markov decision process (MDP) and leverage multi-agent reinforcement learning (MARL) to solve it. Furthermore, we propose a novel imitation-based triple deep Q-network (ITDQN) algorithm, which employs an elite imitation mechanism to reduce exploration costs and utilizes a mediator Q-network over a double deep Q-network (DDQN) to accelerate and stabilize training and improve performance. Experimental results in both simulated and real-world environments demonstrate the effectiveness of our solution. Moreover, our proposed ITDQN outperforms DDQN by 4.43\% in weed recognition rate and 6.94\% in data collection rate.
中文摘要 无人机（UAV）已成为智能农业的有前景辅助平台，能够同时进行杂草检测、识别和无线传感器数据收集。然而，由于环境高度不确定、观测有限以及无人机电池容量有限，无人机智能农业的发展轨迹规划充满挑战。为解决这些问题，我们将轨迹规划问题提出为马尔可夫决策过程（MDP），并利用多智能体强化学习（MARL）来解决。此外，我们提出了一种基于模拟的新型三重深度Q网络（ITDQN）算法，采用精英模拟机制降低探索成本，并利用双深度Q网络（DDQN）上的中介Q网络加速稳定训练并提升性能。在模拟和现实环境中的实验结果证明了我们解决方案的有效性。此外，我们提出的ITDQN在杂草识别率上比DDQN高出4.43%，数据收集率高出6.94%。

A Multi-agent Text2SQL Framework using Small Language Models and Execution Feedback

一个多智能体Text2SQL框架，使用小语言模型和执行反馈

Authors: Thanh Dat Hoang, Thanh Trung Huynh, Matthias Weidlich, Thanh Tam Nguyen, Tong Chen, Hongzhi Yin, Quoc Viet Hung Nguyen
Subjects: Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2512.18622
Pdf link: https://arxiv.org/pdf/2512.18622
Abstract Text2SQL, the task of generating SQL queries from natural language text, is a critical challenge in data engineering. Recently, Large Language Models (LLMs) have demonstrated superior performance for this task due to their advanced comprehension and generation capabilities. However, privacy and cost considerations prevent companies from using Text2SQL solutions based on external LLMs offered as a service. Rather, small LLMs (SLMs) that are openly available and can hosted in-house are adopted. These SLMs, in turn, lack the generalization capabilities of larger LLMs, which impairs their effectiveness for complex tasks such as Text2SQL. To address these limitations, we propose MATS, a novel Text2SQL framework designed specifically for SLMs. MATS uses a multi-agent mechanism that assigns specialized roles to auxiliary agents, reducing individual workloads and fostering interaction. A training scheme based on reinforcement learning aligns these agents using feedback obtained during execution, thereby maintaining competitive performance despite a limited LLM size. Evaluation results using on benchmark datasets show that MATS, deployed on a single- GPU server, yields accuracy that are on-par with large-scale LLMs when using significantly fewer parameters. Our source code and data are available at this https URL.
中文摘要 Text2SQL 是从自然语言文本生成 SQL 查询的任务，是数据工程中的一项关键挑战。近年来，大型语言模型（LLM）因其先进的理解和生成能力，在该任务中表现出更优越的性能。然而，隐私和成本考虑使公司无法使用基于外部大型语言模型（LLM）作为服务提供的Text2SQL解决方案。相反，采用的是公开且可内部托管的小型大型语言模型（SLM）。这些SLM缺乏大型大型语言模型的泛化能力，这影响了它们在复杂任务如Text2SQL中的有效性。为解决这些局限性，我们提出了MATS，一种专为SLM设计的新型Text2SQL框架。MATS采用多代理机制，为辅助代理分配专业角色，减少单个工作负载并促进交互。基于强化学习的训练方案利用执行过程中获得的反馈对齐这些代理，从而在LLM规模有限的情况下保持竞争力。基于基准数据集的评估结果表明，部署在单GPU服务器上的MATS在使用显著更少参数时，其准确率可与大规模大型语言模型相当。我们的源代码和数据可在该 https URL 访问。

LLM-CAS: Dynamic Neuron Perturbation for Real-Time Hallucination Correction

LLM-CAS：动态神经元扰动用于实时幻觉矫正

Authors: Jensen Zhang, Ningyuan Liu, Yijia Fan, Zihao Huang, Qinglin Zeng, Kaitong Cai, Jian Wang, Keze Wang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.18623
Pdf link: https://arxiv.org/pdf/2512.18623
Abstract Large language models (LLMs) often generate hallucinated content that lacks factual or contextual grounding, limiting their reliability in critical applications. Existing approaches such as supervised fine-tuning and reinforcement learning from human feedback are data intensive and computationally expensive, while static parameter editing methods struggle with context dependent errors and catastrophic forgetting. We propose LLM-CAS, a framework that formulates real-time hallucination correction as a hierarchical reinforcement learning problem. LLM-CAS trains an agent to learn a policy that dynamically selects temporary neuron perturbations during inference based on the current context. Unlike prior dynamic approaches that rely on heuristic or predefined adjustments, this policy driven mechanism enables adaptive and fine grained correction without permanent parameter modification. Experiments across multiple language models demonstrate that LLM-CAS consistently improves factual accuracy, achieving gains of 10.98 percentage points on StoryCloze, 2.71 points on TriviaQA, and 2.06 points on the MC1 score of TruthfulQA. These results outperform both static editing methods such as ITI and CAA and the dynamic SADI framework. Overall, LLM-CAS provides an efficient and context aware solution for improving the reliability of LLMs, with promising potential for future multimodal extensions.
中文摘要 大型语言模型（LLM）常常生成缺乏事实或上下文基础的幻觉内容，限制了其在关键应用中的可靠性。现有方法如监督微调和基于人类反馈的强化学习数据量大且计算成本高昂，而静态参数编辑方法则面临上下文依赖错误和灾难性遗忘问题。我们提出了LLM-CAS框架，将实时幻觉纠正构建为分层强化学习问题。LLM-CAS训练代理学习一种策略，在推理过程中根据当前上下文动态选择临时神经元扰动。与以往依赖启发式或预定义调整的动态方法不同，这种策略驱动机制使得无需永久参数修改即可实现自适应且细粒度的修正。跨多语言模型的实验表明，LLM-CAS持续提升事实准确性，StoryCloze提升10.98个百分点，TriviaQA提升2.71个百分点，TruthfulQA的MC1得分提升2.06个百分点。这些结果优于静态编辑方法如ITI和CAA以及动态SADI框架。总体而言，LLM-CAS为提升LLMs可靠性提供了高效且具上下文感知的解决方案，并为未来多模态扩展带来了前景。

Offline Reinforcement Learning for End-to-End Autonomous Driving

端到端自动驾驶的离线强化学习

Authors: Chihiro Noguchi, Takaki Yamamoto
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.18662
Pdf link: https://arxiv.org/pdf/2512.18662
Abstract End-to-end (E2E) autonomous driving models that take only camera images as input and directly predict a future trajectory are appealing for their computational efficiency and potential for improved generalization via unified optimization; however, persistent failure modes remain due to reliance on imitation learning (IL). While online reinforcement learning (RL) could mitigate IL-induced issues, the computational burden of neural rendering-based simulation and large E2E networks renders iterative reward and hyperparameter tuning costly. We introduce a camera-only E2E offline RL framework that performs no additional exploration and trains solely on a fixed simulator dataset. Offline RL offers strong data efficiency and rapid experimental iteration, yet is susceptible to instability from overestimation on out-of-distribution (OOD) actions. To address this, we construct pseudo ground-truth trajectories from expert driving logs and use them as a behavior regularization signal, suppressing imitation of unsafe or suboptimal behavior while stabilizing value learning. Training and closed-loop evaluation are conducted in a neural rendering environment learned from the public nuScenes dataset. Empirically, the proposed method achieves substantial improvements in collision rate and route completion compared with IL baselines. Our code will be available at [URL].
中文摘要 端到端（E2E）自动驾驶模型仅接收相机图像作为输入，并直接预测未来轨迹，因其计算效率和通过统一优化提升泛化潜力而备受青睐;然而，由于依赖模仿学习（IL），持续的失败模式依然存在。虽然在线强化学习（RL）可能缓解IL引发的问题，但基于神经渲染的模拟和大型端对端网络的计算负担使得迭代奖励和超参数调优成本高昂。我们引入了一个仅摄像头的端对端（E2E）离线强化学习框架，不进行额外探索，仅在固定的模拟器数据集上训练。离线强化学习提供强大的数据效率和快速的实验迭代，但由于对非分发（OOD）动作的高估，容易产生不稳定性。为此，我们从专家驾驶日志构建伪地面真实轨迹，并将其用作行为正则化信号，抑制不安全或次优行为的模仿，同时稳定价值学习。训练和闭环评估在从公开 nuScenes 数据集学习到的神经渲染环境中进行。从经验上看，所提方法相比IL基线在碰撞率和路线完成率方面取得了显著改善。我们的代码将在[URL]上发布。

Demonstration-Guided Continual Reinforcement Learning in Dynamic Environments

动态环境中的演示引导持续强化学习

Authors: Xue Yang, Michael Schukat, Junlin Lu, Patrick Mannion, Karl Mason, Enda Howley
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.18670
Pdf link: https://arxiv.org/pdf/2512.18670
Abstract Reinforcement learning (RL) excels in various applications but struggles in dynamic environments where the underlying Markov decision process evolves. Continual reinforcement learning (CRL) enables RL agents to continually learn and adapt to new tasks, but balancing stability (preserving prior knowledge) and plasticity (acquiring new knowledge) remains challenging. Existing methods primarily address the stability-plasticity dilemma through mechanisms where past knowledge influences optimization but rarely affects the agent's behavior directly, which may hinder effective knowledge reuse and efficient learning. In contrast, we propose demonstration-guided continual reinforcement learning (DGCRL), which stores prior knowledge in an external, self-evolving demonstration repository that directly guides RL exploration and adaptation. For each task, the agent dynamically selects the most relevant demonstration and follows a curriculum-based strategy to accelerate learning, gradually shifting from demonstration-guided exploration to fully self-exploration. Extensive experiments on 2D navigation and MuJoCo locomotion tasks demonstrate its superior average performance, enhanced knowledge transfer, mitigation of forgetting, and training efficiency. The additional sensitivity analysis and ablation study further validate its effectiveness.
中文摘要 强化学习（RL）在多种应用中表现出色，但在动态环境中表现不佳，因为底层马尔可夫决策过程不断演变。持续强化学习（CRL）使强化学习代理能够持续学习并适应新任务，但在稳定性（保留既有知识）和可塑性（获取新知识）之间取得平衡仍然具有挑战性。现有方法主要通过机制解决稳定性与可塑性的困境，即过去的知识影响优化，但很少直接影响智能体的行为，这可能阻碍知识的有效再利用和高效学习。相比之下，我们提出了演示引导的持续强化学习（DGCRL），它将先前知识存储在外部的自我演化示范仓库中，直接引导强化学习的探索和适应。对于每个任务，代理动态选择最相关的演示，并遵循基于课程的策略加速学习，逐步从演示引导探索转向完全自我探索。对二维导航和MuJoCo移动任务的广泛实验展示了其卓越的平均性能、增强的知识传递、减少遗忘和训练效率。额外的敏感性分析和消融研究进一步验证了其疗效。

A Theoretical Lens for RL-Tuned Language Models via Energy-Based Models

通过基于能量的模型，为强化学习调谐语言模型提供理论透镜

Authors: Zhiquan Tan, Yinrong Hong
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.18730
Pdf link: https://arxiv.org/pdf/2512.18730
Abstract Large language models (LLMs) trained via KL-regularized reinforcement learning demonstrate strong instruction following, self-correction, and reasoning abilities. Yet their theoretical underpinnings remain limited. We exploit the closed-form energy-based model (EBM) structure of the optimal KL-regularized policy to provide a unified variational analysis of LLMs. For instruction-tuned models, under natural assumptions on reward potentials and pretraining symmetry, we prove that the transition kernel satisfies detailed balance with respect to a scalar potential encoding response quality. This yields monotonic KL convergence to a high-quality stationary distribution, bounded hitting times to superior states, and exponential mixing governed by the spectral gap. For reasoning models trained with verifiable rewards (RLVR), we show the objective is equivalent to expected KL minimization toward an optimal reasoning distribution, with the suboptimality gap reducing to the Bernoulli KL between target and current accuracies along the natural gradient flow. This helps explain empirical entropy-accuracy trade-offs.
中文摘要 通过KL正则化强化学习训练的大型语言模型（LLM）展现出强大的指令跟随、自我纠正和推理能力。然而，它们的理论基础仍然有限。我们利用最优KL正则化策略的封闭形式基于能量模型（EBM）结构，提供LLMs的统一变分分析。对于指令调优模型，在奖励势和预训练对称性自然假设下，我们证明转移核满足标量势编码响应质量的详细平衡。这导致单调KL收敛到高质量的平稳分布，达到优态的有界击中时间，以及由谱间隙支配的指数混合。对于训练可验证奖励（RLVR）的推理模型，我们表明目标等价于向最优推理分布的预期KL最小化，次最优差距缩减至目标精度与当前精度沿自然梯度流动之间的伯努利KL。这有助于解释经验熵与准确性的权衡。

InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search

InSight-o3：通过广义视觉搜索赋能多模态基础模型

Authors: Kaican Li, Lewei Yao, Jiannan Wu, Tiezheng Yu, Jierun Chen, Haoli Bai, Lu Hou, Lanqing Hong, Wei Zhang, Nevin L. Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.18745
Pdf link: https://arxiv.org/pdf/2512.18745
Abstract The ability for AI agents to "think with images" requires a sophisticated blend of reasoning and perception. However, current open multimodal agents still largely fall short on the reasoning aspect crucial for real-world tasks like analyzing documents with dense charts/diagrams and navigating maps. To address this gap, we introduce O3-Bench, a new benchmark designed to evaluate multimodal reasoning with interleaved attention to visual details. O3-Bench features challenging problems that require agents to piece together subtle visual information from distinct image areas through multi-step reasoning. The problems are highly challenging even for frontier systems like OpenAI o3, which only obtains 40.8% accuracy on O3-Bench. To make progress, we propose InSight-o3, a multi-agent framework consisting of a visual reasoning agent (vReasoner) and a visual search agent (vSearcher) for which we introduce the task of generalized visual search -- locating relational, fuzzy, or conceptual regions described in free-form language, beyond just simple objects or figures in natural images. We then present a multimodal LLM purpose-trained for this task via reinforcement learning. As a plug-and-play agent, our vSearcher empowers frontier multimodal models (as vReasoners), significantly improving their performance on a wide range of benchmarks. This marks a concrete step towards powerful o3-like open systems. Our code and dataset can be found at this https URL .
中文摘要 人工智能智能体“用图像思考”的能力需要复杂的推理与感知结合。然而，目前开放的多模态代理在分析复杂图表/图表文件和导航地图等现实任务中至关重要的推理方面仍然大有不足。为弥补这一空白，我们推出了O3-Bench，这是一个旨在评估多模态推理并交织关注视觉细节的新基准。O3-Bench 具有挑战性的问题，要求代理通过多步推理拼凑出不同图像区域的微妙视觉信息。即使是像OpenAI o3这样的前沿系统，这些问题也极具挑战性，后者在O3-Bench上仅有40.8%的准确率。为了取得进展，我们提出了InSight-o3，这是一个由视觉推理代理（vReasoner）和视觉搜索代理（vSearcher）组成的多智能体框架，我们引入了广义视觉搜索任务——定位以自由形式语言描述的关系性、模糊或概念区域，而不仅仅是自然图像中的简单物体或图形。随后，我们展示了一个通过强化学习专门训练的多模态大型语言模型。作为即插即用的代理，我们的vSearcher为前沿多模态模型（vReasoner）赋能，显著提升了其在多种基准测试中的表现。这标志着迈向强大氧气开放系统的具体一步。我们的代码和数据集可在此 https URL 找到。

Gaussian-Mixture-Model Q-Functions for Policy Iteration in Reinforcement Learning

高斯混合模型Q函数用于强化学习中的策略迭代

Authors: Minh Vu, Konstantinos Slavakis
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.18763
Pdf link: https://arxiv.org/pdf/2512.18763
Abstract Unlike their conventional use as estimators of probability density functions in reinforcement learning (RL), this paper introduces a novel function-approximation role for Gaussian mixture models (GMMs) as direct surrogates for Q-function losses. These parametric models, termed GMM-QFs, possess substantial representational capacity, as they are shown to be universal approximators over a broad class of functions. They are further embedded within Bellman residuals, where their learnable parameters -- a fixed number of mixing weights, together with Gaussian mean vectors and covariance matrices -- are inferred from data via optimization on a Riemannian manifold. This geometric perspective on the parameter space naturally incorporates Riemannian optimization into the policy-evaluation step of standard policy-iteration frameworks. Rigorous theoretical results are established, and supporting numerical tests show that, even without access to experience data, GMM-QFs deliver competitive performance and, in some cases, outperform state-of-the-art approaches across a range of benchmark RL tasks, all while maintaining a significantly smaller computational footprint than deep-learning methods that rely on experience data.
中文摘要 与它们作为强化学习（RL）概率密度函数估计器的传统用途不同，本文引入了高斯混合模型（GMMs）作为Q函数损失直接替代的新功能功能。这些参数模型被称为GMM-QF，具有相当的表示能力，因为它们被证明是广泛类别函数的通用近似器。它们进一步嵌入在贝尔曼残差中，通过黎曼流形上的优化从数据推断出其可学习参数——固定数量的混合权重，以及高斯均向量和协方差矩阵。这种对参数空间的几何视角自然地将黎曼优化纳入标准策略迭代框架的策略评估步骤。建立了严谨的理论结果，支持性的数值测试表明，即使没有经验数据，GMM-QF在多种基准强化学习任务中仍能提供竞争性能，甚至在某些情况下超越最先进方法，同时保持显著更小于依赖经验数据的深度学习方法的计算占用。

MaskFocus: Focusing Policy Optimization on Critical Steps for Masked Image Generation

MaskFocus：聚焦于掩面图像生成的关键步骤的策略优化

Authors: Guohui Zhang, Hu Yu, Xiaoxiao Ma, Yaning Pan, Hang Xu, Feng Zhao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.18766
Pdf link: https://arxiv.org/pdf/2512.18766
Abstract Reinforcement learning (RL) has demonstrated significant potential for post-training language models and autoregressive visual generative models, but adapting RL to masked generative models remains challenging. The core factor is that policy optimization requires accounting for the probability likelihood of each step due to its multi-step and iterative refinement process. This reliance on entire sampling trajectories introduces high computational cost, whereas natively optimizing random steps often yields suboptimal results. In this paper, we present MaskFocus, a novel RL framework that achieves effective policy optimization for masked generative models by focusing on critical steps. Specifically, we determine the step-level information gain by measuring the similarity between the intermediate images at each sampling step and the final generated image. Crucially, we leverage this to identify the most critical and valuable steps and execute focused policy optimization on them. Furthermore, we design a dynamic routing sampling mechanism based on entropy to encourage the model to explore more valuable masking strategies for samples with low entropy. Extensive experiments on multiple Text-to-Image benchmarks validate the effectiveness of our method.
中文摘要 强化学习（RL）已在训练后语言模型和自回归视觉生成模型中展现出显著潜力，但将强化学习适应到掩蔽生成模型仍具挑战。核心因素是，策略优化需要考虑每一步的概率可能性，因为其多步且反复完善。这种依赖整个采样轨迹的做法带来了高计算成本，而原生优化随机步骤往往会产生次优结果。本文介绍了MaskFocus，一种新型强化学习框架，通过聚焦关键步骤实现掩蔽生成模型的有效策略优化。具体来说，我们通过测量每个采样步骤中中间图像与最终生成图像之间的相似性来确定步级信息增益。关键是，我们利用这一点识别最关键且最有价值的步骤，并针对这些步骤实施针对性的政策优化。此外，我们设计了基于熵的动态路由采样机制，鼓励模型探索低熵样本中更有价值的遮蔽策略。通过多项文本转图像基准测试的广泛实验验证了我们方法的有效性。

From Word to World: Can Large Language Models be Implicit Text-based World Models?

从文字到世界：大型语言模型可以成为隐式基于文本的世界模型吗？

Authors: Yixia Li, Hongru Wang, Jiahao Qiu, Zhenfei Yin, Dongdong Zhang, Cheng Qian, Zeping Li, Pony Ma, Guanhua Chen, Heng Ji, Mengdi Wang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.18832
Pdf link: https://arxiv.org/pdf/2512.18832
Abstract Agentic reinforcement learning increasingly relies on experience-driven scaling, yet real-world environments remain non-adaptive, limited in coverage, and difficult to scale. World models offer a potential way to improve learning efficiency through simulated experience, but it remains unclear whether large language models can reliably serve this role and under what conditions they meaningfully benefit agents. We study these questions in text-based environments, which provide a controlled setting to reinterpret language modeling as next-state prediction under interaction. We introduce a three-level framework for evaluating LLM-based world models: (i) fidelity and consistency, (ii) scalability and robustness, and (iii) agent utility. Across five representative environments, we find that sufficiently trained world models maintain coherent latent state, scale predictably with data and model size, and improve agent performance via action verification, synthetic trajectory generation, and warm-starting reinforcement learning. Meanwhile, these gains depend critically on behavioral coverage and environment complexity, delineating clear boundry on when world modeling effectively supports agent learning.
中文摘要 智能强化学习越来越依赖基于经验的扩展，但现实环境依然缺乏适应性，覆盖范围有限，且难以扩展。世界模型提供了通过模拟体验提升学习效率的潜在途径，但目前尚不清楚大型语言模型是否能可靠地发挥这一作用，以及它们在何种条件下对代理者有意义的益处。我们在基于文本的环境中研究这些问题，这些环境为语言建模提供了受控的环境，可以重新解释为交互下的下一状态预测。我们引入了一个评估基于LLM的世界模型的三层框架：（i）保真度与一致性，（ii）可扩展性和鲁棒性，（iii）代理效用。在五个代表性环境中，我们发现训练足够的世界模型能够保持连贯的潜在状态，随着数据和模型规模可预测地扩展，并通过动作验证、合成轨迹生成和热启动强化学习提升代理性能。与此同时，这些进步关键依赖于行为覆盖率和环境复杂性，明确界定了世界建模何时有效支持代理学习的界限。

InDRiVE: Reward-Free World-Model Pretraining for Autonomous Driving via Latent Disagreement

InDRiVE：通过潜在分歧实现无奖励的世界模型自动驾驶预训练

Authors: Feeza Khan Khanzada, Jaerock Kwon
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.18850
Pdf link: https://arxiv.org/pdf/2512.18850
Abstract Model-based reinforcement learning (MBRL) can reduce interaction cost for autonomous driving by learning a predictive world model, but it typically still depends on task-specific rewards that are difficult to design and often brittle under distribution shift. This paper presents InDRiVE, a DreamerV3-style MBRL agent that performs reward-free pretraining in CARLA using only intrinsic motivation derived from latent ensemble disagreement. Disagreement acts as a proxy for epistemic uncertainty and drives the agent toward under-explored driving situations, while an imagination-based actor-critic learns a planner-free exploration policy directly from the learned world model. After intrinsic pretraining, we evaluate zero-shot transfer by freezing all parameters and deploying the pretrained exploration policy in unseen towns and routes. We then study few-shot adaptation by training a task policy with limited extrinsic feedback for downstream objectives (lane following and collision avoidance). Experiments in CARLA across towns, routes, and traffic densities show that disagreement-based pretraining yields stronger zero-shot robustness and robust few-shot collision avoidance under town shift and matched interaction budgets, supporting the use of intrinsic disagreement as a practical reward-free pretraining signal for reusable driving world models.
中文摘要 基于模型的强化学习（MBRL）可以通过学习预测世界模型来降低自动驾驶的交互成本，但通常仍依赖于任务特定的奖励，这些奖励设计起来困难且在分布转移下往往脆弱。本文介绍了InDRiVE，一种DreamerV3风格的MBRL智能体，仅利用源自潜在集合不一致的内在动机，在CARLA中进行无奖励的预训练。分歧作为认识论不确定性的代理，推动主体走向未被充分探索的驾驶情境，而基于想象力的行为者批评者则直接从学习世界模型中学习到无计划者的探索策略。经过内在预训练后，我们通过冻结所有参数并在未被发现的城镇和路线部署预训练探索策略来评估零射击转移。随后，我们通过训练任务策略，对下游目标（车道跟踪和避免碰撞）进行有限的外部反馈，研究少数样本适应。在城镇、路线和交通密度的CARLA实验显示，基于分歧的预训练在城镇转移和匹配交互预算下，能带来更强的零样本鲁棒性和稳健的少样本碰撞避免，支持将内在分歧作为可重复使用驾驶世界模型中实用的无奖励预训练信号。

CORE: Concept-Oriented Reinforcement for Bridging the Definition-Application Gap in Mathematical Reasoning

核心：以概念为导向的强化，弥合数学推理中定义与应用之间的鸿沟

Authors: Zijun Gao, Zhikun Xu, Xiao Ye, Ben Zhou
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.18857
Pdf link: https://arxiv.org/pdf/2512.18857
Abstract Large language models (LLMs) often solve challenging math exercises yet fail to apply the concept right when the problem requires genuine understanding. Popular Reinforcement Learning with Verifiable Rewards (RLVR) pipelines reinforce final answers but provide little fine-grained conceptual signal, so models improve at pattern reuse rather than conceptual applications. We introduce CORE (Concept-Oriented REinforcement), an RL training framework that turns explicit concepts into a controllable supervision signal. Starting from a high-quality, low-contamination textbook resource that links verifiable exercises to concise concept descriptions, we run a sanity probe showing LLMs can restate definitions but fail concept-linked quizzes, quantifying the conceptual reasoning gap. CORE then (i) synthesizes concept-aligned quizzes, (ii) injects brief concept snippets during rollouts to elicit concept-primed trajectories, and (iii) reinforces conceptual reasoning via trajectory replacement after group failures, a lightweight forward-KL constraint that aligns unguided with concept-primed policies, or standard GRPO directly on concept-aligned quizzes. Across several models, CORE delivers consistent gains over vanilla and SFT baselines on both in-domain concept-exercise suites and diverse out-of-domain math benchmarks. CORE unifies direct training on concept-aligned quizzes and concept-injected rollouts under outcome regularization. It provides fine-grained conceptual supervision that bridges problem-solving competence and genuine conceptual reasoning, while remaining algorithm- and verifier-agnostic.
中文摘要 大型语言模型（LLMs）常常能解决具有挑战性的数学题题，但在问题需要真正理解时，却未能恰当地应用概念。流行的可验证奖励强化学习（RLVR）流水线强化最终答案，但提供细粒度概念信号较少，因此模型更适合模式重用而非概念应用。我们引入了CORE（概念导向强化），这是一种强化学习训练框架，将显性概念转化为可控的监督信号。我们从一个高质量、低污染的教科书资源出发，将可验证的练习与简明的概念描述联系起来，我们进行了合理性测试，证明大型语言模型可以重述定义，但会在概念相关测验中失败，从而量化了概念推理的差距。CORE随后（i）综合概念对齐的测验，（ii）在推广过程中注入简短的概念片段以引出概念预设轨迹，（iii）通过组失败后的轨迹替换来强化概念推理，这是一种轻量级的前向基准约束，无需指导即可与概念预设政策对齐，或直接对概念对齐测验的标准GRPO。在多个模型中，CORE在领域内概念练习套件和多样域外数学基准测试中，均优于原版和SFT基线。CORE统一了基于概念对齐测验和概念注入的直接培训，纳入结果规范化。它提供细致的概念监督，桥接解决问题能力与真实的概念推理，同时保持算法和验证者无关性。

Remedy-R: Generative Reasoning for Machine Translation Evaluation without Error Annotations

Remedy-R：无错误注释的机器翻译评估生成推理

Authors: Shaomu Tan, Ryosuke Mitani, Ritvik Choudhary, Qiyu Wu, Toshiyuki Sekiya, Christof Monz
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.18906
Pdf link: https://arxiv.org/pdf/2512.18906
Abstract Over the years, automatic MT metrics have hillclimbed benchmarks and presented strong and sometimes human-level agreement with human ratings. Yet they remain black-box, offering little insight into their decision-making and often failing under real-world out-of-distribution (OOD) inputs. We introduce Remedy-R, a reasoning-driven generative MT metric trained with reinforcement learning from pairwise translation preferences, without requiring error-span annotations or distillation from closed LLMs. Remedy-R produces step-by-step analyses of accuracy, fluency, and completeness, followed by a final score, enabling more interpretable assessments. With only 60K training pairs across two language pairs, Remedy-R remains competitive with top scalar metrics and GPT-4-based judges on WMT22-24 meta-evaluation, generalizes to other languages, and exhibits strong robustness on OOD stress tests. Moreover, Remedy-R models generate self-reflective feedback that can be reused for translation improvement. Building on this finding, we introduce Remedy-R Agent, a simple evaluate-revise pipeline that leverages Remedy-R's evaluation analysis to refine translations. This agent consistently improves translation quality across diverse models, including Qwen2.5, ALMA-R, GPT-4o-mini, and Gemini-2.0-Flash, suggesting that Remedy-R's reasoning captures translation-relevant information and is practically useful.
中文摘要 多年来，自动机器翻译指标不断突破基准，并与人类评分呈现出强烈且有时符合人类水平的一致性。然而，它们依然是黑箱，几乎没有提供决策的洞察，且常常在现实中的“非分销”（OOD）输入下失败。我们引入了Remedy-R，一种基于推理驱动的生成机器翻译指标，通过两对翻译偏好的强化学习训练，无需误差范围注释或从封闭大型语言模型中提炼。Remedy-R能够逐步分析准确性、流畅性和完整性，随后给出最终评分，使评估更具可理解性。Remedy-R在两对语言中仅有6万对训练，在WMT22-24元评估中与顶级标量指标和基于GPT-4的评判保持竞争力，能够推广到其他语言，并在户外压力测试中表现出强的鲁棒性。此外，Remedy-R模型还能生成自我反思反馈，这些反馈可用于翻译改进。基于这一发现，我们介绍了Remedy-R Agent，一个简单的评估-修订流程，利用Remedy-R的评估分析来优化翻译。该代理在包括Qwen2.5、ALMA-R、GPT-40-mini和Gemini-2.0-Flash在内的多种模型中持续提升翻译质量，表明Remedy-R的推理能够捕捉翻译相关信息并具有实际价值。

A Framework for Deploying Learning-based Quadruped Loco-Manipulation

部署基于学习的四足机车作框架

Authors: Yadong Liu, Jianwei Liu, He Liang, Dimitrios Kanoulas
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.18938
Pdf link: https://arxiv.org/pdf/2512.18938
Abstract Quadruped mobile manipulators offer strong potential for agile loco-manipulation but remain difficult to control and transfer reliably from simulation to reality. Reinforcement learning (RL) shows promise for whole-body control, yet most frameworks are proprietary and hard to reproduce on real hardware. We present an open pipeline for training, benchmarking, and deploying RL-based controllers on the Unitree B1 quadruped with a Z1 arm. The framework unifies sim-to-sim and sim-to-real transfer through ROS, re-implementing a policy trained in Isaac Gym, extending it to MuJoCo via a hardware abstraction layer, and deploying the same controller on physical hardware. Sim-to-sim experiments expose discrepancies between Isaac Gym and MuJoCo contact models that influence policy behavior, while real-world teleoperated object-picking trials show that coordinated whole-body control extends reach and improves manipulation over floating-base baselines. The pipeline provides a transparent, reproducible foundation for developing and analyzing RL-based loco-manipulation controllers and will be released open source to support future research.
中文摘要 四足移动机械臂具备强大的灵活性，适合灵活的机车作，但仍然难以控制，且从模拟到现实的传输仍然不可靠。强化学习（RL）在全体控制方面展现出潜力，但大多数框架是专有的，难以在真实硬件上复现。我们提供了一个开放的培训流程，用于在Unitree B1上四足运行的Z1臂驱动基于RL控制器的训练、基准测试和部署。该框架通过ROS统一了模拟对模拟和模拟到真实的传输，重新实现了在Isaac Gym训练的策略，通过硬件抽象层将其扩展到MuJoCo，并在物理硬件上部署了相同的控制器。模拟对模拟实验揭示了Isaac Gym和MuJoCo接触模型之间的差异，影响政策行为，而现实世界的远程作物体挑选试验显示，协调的全身控制能扩大对浮动基准线的覆盖范围和改进作能力。该流水线为基于强化学习的机车作控制器开发和分析提供了透明、可重复的基础，并将开源以支持未来研究。

Training Multimodal Large Reasoning Models Needs Better Thoughts: A Three-Stage Framework for Long Chain-of-Thought Synthesis and Selection

训练多模态大型推理模型需要更好的思考：一个用于长链思考综合与选择的三阶段框架

Authors: Yizhi Wang, Linan Yue, Min-Ling Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.18956
Pdf link: https://arxiv.org/pdf/2512.18956
Abstract Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex reasoning tasks through long Chain-of-Thought (CoT) reasoning. Extending these successes to multimodal reasoning remains challenging due to the increased complexity of integrating diverse input modalities and the scarcity of high-quality long CoT training data. Existing multimodal datasets and CoT synthesis methods still suffer from limited reasoning depth, modality conversion errors, and rigid generation pipelines, hindering model performance and stability. To this end, in this paper, we propose SynSelect, a novel three-stage Synthesis-Selection framework for generating high-quality long CoT data tailored to multimodal reasoning tasks. Specifically, SynSelect first leverages multiple heterogeneous multimodal LRMs to produce diverse candidate CoTs, and then applies both instance and batch level selection to filter high-quality CoTs that can effectively enhance the model's reasoning capabilities. Extensive experiments on multiple multimodal benchmarks demonstrate that models supervised fine-tuned on SynSelect-generated data significantly outperform baselines and achieve further improvements after reinforcement learning post-training. Our results validate SynSelect as an effective approach for advancing multimodal LRMs reasoning capabilities.
中文摘要 大型推理模型（LRM）通过长思考链（CoT）推理，在复杂推理任务中表现出显著的表现。由于整合多样输入模态的复杂度增加以及高质量的长CoT训练数据稀缺，将这些成功推广到多模态推理仍然具有挑战性。现有的多模态数据集和CoT合成方法仍存在推理深度有限、模态转换错误和生成流程僵化的问题，影响了模型性能和稳定性。为此，本文提出了SynSelect，一种新的三阶段合成-选择框架，用于生成高质量的长CoT数据，专为多模态推理任务量身定制。具体来说，SynSelect首先利用多个异构多模态LRM，生成多样化的候选CoT，然后应用实例级和批次级选择，筛选出高质量的CoT，从而有效提升模型的推理能力。对多个多模态基准测试的广泛实验表明，在SynSelect生成的数据上进行监督微调的模型，在训练后强化学习后显著优于基线，并实现进一步提升。我们的结果验证了SynSelect作为提升多模态LRM推理能力的有效方法。

Scaling Online Distributionally Robust Reinforcement Learning: Sample-Efficient Guarantees with General Function Approximation

在线分布鲁棒强化学习扩展：带有一般函数近似的样本高效保证

Authors: Debamita Ghosh, George K. Atia, Yue Wang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.18957
Pdf link: https://arxiv.org/pdf/2512.18957
Abstract The deployment of reinforcement learning (RL) agents in real-world applications is often hindered by performance degradation caused by mismatches between training and deployment environments. Distributionally robust RL (DR-RL) addresses this issue by optimizing worst-case performance over an uncertainty set of transition dynamics. However, existing work typically relies on substantial prior knowledge-such as access to a generative model or a large offline dataset-and largely focuses on tabular methods that do not scale to complex domains. We overcome these limitations by proposing an online DR-RL algorithm with general function approximation that learns an optimal robust policy purely through interaction with the environment, without requiring prior models or offline data, enabling deployment in high-dimensional tasks. We further provide a theoretical analysis establishing a near-optimal sublinear regret bound under a total variation uncertainty set, demonstrating the sample efficiency and effectiveness of our method.
中文摘要 在现实应用中部署强化学习（RL）代理时，常常因训练环境与部署环境不匹配导致性能下降而受阻。分布稳健RL（DR-RL）通过优化在不确定性转移动态集上的最坏情况性能来解决这个问题。然而，现有工作通常依赖大量先验知识——例如访问生成模型或大型离线数据集——且主要聚焦于无法扩展到复杂领域的表格方法。我们通过提出一种在线DR-RL算法克服了这些局限，该算法具有通用功能近似，能够通过与环境的交互学习最优稳健策略，无需先前模型或离线数据，从而实现高维任务的部署。我们还提供了理论分析，确立了在总变异不确定性集合下近乎最优的亚线性遗憾界限，展示了我们方法的样本效率和有效性。

ORPR: An OR-Guided Pretrain-then-Reinforce Learning Model for Inventory Management

ORPR：一种由运筹学指导的库存管理预培训再强化学习模型

Authors: Lingjie Zhao, Xue Yu, Yongzhi Qi, Hao Hu, Jianshen Zhang, Yingzheng Ma, Shuyu Han, Wei Qi, Zuo-Jun Max Shen
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.19001
Pdf link: https://arxiv.org/pdf/2512.19001
Abstract As the pursuit of synergy between Artificial Intelligence (AI) and Operations Research (OR) gains momentum in handling complex inventory systems, a critical challenge persists: how to effectively reconcile AI's adaptive perception with OR's structural rigor. To bridge this gap, we propose a novel OR-Guided "Pretrain-then-Reinforce" framework. To provide structured guidance, we propose a simulation-augmented OR model that generates high-quality reference decisions, implicitly capturing complex business constraints and managerial preferences. Leveraging these OR-derived decisions as foundational training labels, we design a domain-informed deep learning foundation model to establish foundational decision-making capabilities, followed by a reinforcement learning (RL) fine-tuning stage. Uniquely, we position RL as a deep alignment mechanism that enables the AI agent to internalize the optimality principles of OR, while simultaneously leveraging exploration for general policy refinement and allowing expert guidance for scenario-specific adaptation (e.g., promotional events). Validated through extensive numerical experiments and a field deployment at this http URL augmented by a Difference-in-Differences (DiD) analysis, our model significantly outperforms incumbent industrial practices, delivering real-world gains of a 5.27-day reduction in turnover and a 2.29% increase in in-stock rates, alongside a 29.95% decrease in holding costs. Contrary to the prevailing trend of brute-force model scaling, our study demonstrates that a lightweight, domain-informed model can deliver state-of-the-art performance and robust transferability when guided by structured OR logic. This approach offers a scalable and cost-effective paradigm for intelligent supply chain management, highlighting the value of deeply aligning AI with OR.
中文摘要 随着人工智能（AI）与运筹学（OR）在处理复杂库存系统方面的协同追求日益加速，一个关键挑战依然存在：如何有效调和人工智能的适应性感知与运筹学的结构严谨性。为弥合这一差距，我们提出了一种新的运筹学引导“预训练后强化”框架。为提供结构化指导，我们提出一个模拟增强的运学学模型，生成高质量的参考决策，隐含捕捉复杂的业务约束和管理偏好。利用这些由运筹学衍生的决策作为基础训练标签，我们设计了一个领域导向深度学习基础模型，建立基础决策能力，随后进入强化学习（RL）微调阶段。我们独特地将强化学习定位为一种深度对齐机制，使人工智能代理能够内化运筹学的最优性原则，同时利用探索进行一般政策细化，并为情景特定适应（如推广活动）提供专家指导。通过大量数值实验和在该 http URL 进行的现场部署，并结合差异中差异（DiD）分析，我们的模型显著优于现有产业实践，实际实现了5.27天的周转减少和库存率增长2.29%，同时持有成本降低了29.95%。与目前暴力破解模型规模化的趋势相反，我们的研究表明，在结构化运筹逻辑引导下，轻量级、领域导向模型能够提供最先进的性能和稳健的可迁移性。这种方法为智能供应链管理提供了可扩展且具成本效益的范式，凸显了深度将人工智能与运筹学高度对齐的价值。

CoDrone: Autonomous Drone Navigation Assisted by Edge and Cloud Foundation Models

CoDrone：由边缘和云基金会模型辅助的自主无人机导航

Authors: Pengyu Chen, Tao Ouyang, Ke Luo, Weijie Hong, Xu Chen
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.19083
Pdf link: https://arxiv.org/pdf/2512.19083
Abstract Autonomous navigation for Unmanned Aerial Vehicles faces key challenges from limited onboard computational resources, which restrict deployed deep neural networks to shallow architectures incapable of handling complex environments. Offloading tasks to remote edge servers introduces high latency, creating an inherent trade-off in system design. To address these limitations, we propose CoDrone - the first cloud-edge-end collaborative computing framework integrating foundation models into autonomous UAV cruising scenarios - effectively leveraging foundation models to enhance performance of resource-constrained unmanned aerial vehicle platforms. To reduce onboard computation and data transmission overhead, CoDrone employs grayscale imagery for the navigation model. When enhanced environmental perception is required, CoDrone leverages the edge-assisted foundation model Depth Anything V2 for depth estimation and introduces a novel one-dimensional occupancy grid-based navigation method - enabling fine-grained scene understanding while advancing efficiency and representational simplicity of autonomous navigation. A key component of CoDrone is a Deep Reinforcement Learning-based neural scheduler that seamlessly integrates depth estimation with autonomous navigation decisions, enabling real-time adaptation to dynamic environments. Furthermore, the framework introduces a UAV-specific vision language interaction module incorporating domain-tailored low-level flight primitives to enable effective interaction between the cloud foundation model and the UAV. The introduction of VLM enhances open-set reasoning capabilities in complex unseen scenarios. Experimental results show CoDrone outperforms baseline methods under varying flight speeds and network conditions, achieving a 40% increase in average flight distance and a 5% improvement in average Quality of Navigation.
中文摘要 无人机的自主导航面临着机载计算资源有限的主要挑战，这些资源限制了部署的深度神经网络只能使用无法处理复杂环境的浅层架构。将任务卸载到远程边缘服务器会带来高延迟，从而在系统设计中带来固有的权衡。为解决这些局限，我们提出了CoDrone——首个将基础模型集成到自主无人机巡航场景的云端协作计算框架，有效利用基础模型提升资源受限无人机平台的性能。为减少机载计算和数据传输开销，CoDrone在导航模型中使用灰度图像。当需要增强环境感知时，CoDrone利用边缘辅助基础模型Depth Anything V2进行深度估计，并引入了一种新颖的一维占用网格导航方法——实现细致的场景理解，同时提升自主导航的效率和表征简洁性。CoDrone的一个关键组成部分是基于深度强化学习的神经调度器，能够无缝整合深度估计与自主导航决策，实现对动态环境的实时适应。此外，该框架引入了无人机专用视觉语言交互模块，集成了领域定制的低级别飞行原语，以实现云基础模型与无人机之间的有效交互。VLM的引入增强了在复杂未见场景中的开放集推理能力。实验结果显示，CoDrone在不同飞行速度和网络条件下优于基线方法，平均飞行距离增加了40%，平均导航质量提升了5%。

Tool-Augmented Hybrid Ensemble Reasoning with Distillation for Bilingual Mathematical Problem Solving

工具增强混合集合推理与提炼技术用于双语数学问题解决

Authors: Peiqing Lu, Yuan Zhang, Haoyun Zhang, Jiasen Zheng, Kejian Tong, Wenjun Wu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.19093
Pdf link: https://arxiv.org/pdf/2512.19093
Abstract Bilingual mathematical problem solving needs a clear link between language reasoning and symbolic calculation. Large language models often handle language well but are weak in accurate computation. This paper presents HERALD (Hybrid Ensemble Reasoning with Adaptive Learning and Distillation), a framework that joins reasoning and calculation using NuminaMath-7B-TIR, GPT-4o, and Mistral-7B. HERALD uses adaptive routing, tool-based reinforcement learning, and knowledge distillation to connect different reasoning paths. Confidence calibration keeps weighting stable, and dual-path checking keeps results correct. Reinforcement learning controls tool use to cut redundancy, and distillation lowers delay without hurting accuracy. The system shows that combining symbolic checking, adaptive ensembles, and bilingual fine-tuning helps achieve both fluent reasoning and precise calculation. HERALD offers a practical solution for multilingual mathematical reasoning with better accuracy, stability, and clarity.
中文摘要 双语数学问题解决需要语言推理与符号计算之间有明确的联系。大型语言模型通常能很好地处理语言，但在精确计算方面较弱。本文介绍了HERALD（混合集合推理与自适应学习与提炼）框架，该框架结合了推理与计算，使用NuminaMath-7B-TIR、GPT-4o和Mistral-7B。HERALD通过自适应路由、基于工具的强化学习和知识提炼来连接不同的推理路径。置信校准保持权重稳定，双路径检查保持结果准确。强化学习控制工具用于减少冗余，蒸馏降低延迟而不影响准确性。该系统表明，结合符号检查、自适应集合和双语微调，有助于实现流畅推理和精确计算。HERALD为多语言数学推理提供了实用的解决方案，具有更高的准确性、稳定性和清晰度。

AWPO: Enhancing Tool-Use of Large Language Models through Explicit Integration of Reasoning Rewards

AWPO：通过明确整合推理奖励，提升大型语言模型的工具使用能力

Authors: Zihan Lin, Xiaohan Wang, Hexiong Yang, Jiajun Chai, Jie Cao, Guojun Yin, Wei Lin, Ran He
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.19126
Pdf link: https://arxiv.org/pdf/2512.19126
Abstract While reinforcement learning (RL) shows promise in training tool-use large language models (LLMs) using verifiable outcome rewards, existing methods largely overlook the potential of explicit reasoning rewards to bolster reasoning and tool utilization. Furthermore, natively combining reasoning and outcome rewards may yield suboptimal performance or conflict with the primary optimization objective. To address this, we propose advantage-weighted policy optimization (AWPO) -- a principled RL framework that effectively integrates explicit reasoning rewards to enhance tool-use capability. AWPO incorporates variance-aware gating and difficulty-aware weighting to adaptively modulate advantages from reasoning signals based on group-relative statistics, alongside a tailored clipping mechanism for stable optimization. Extensive experiments demonstrate that AWPO achieves state-of-the-art performance across standard tool-use benchmarks, significantly outperforming strong baselines and leading closed-source models in challenging multi-turn scenarios. Notably, with exceptional parameter efficiency, our 4B model surpasses Grok-4 by 16.0 percent in multi-turn accuracy while preserving generalization capability on the out-of-distribution MMLU-Pro benchmark.
中文摘要 虽然强化学习（RL）在使用可验证的结果奖励训练工具使用大型语言模型（LLM）方面展现出希望，但现有方法大多忽视了显式推理奖励在增强推理和工具利用方面的潜力。此外，原生结合推理和结果奖励可能导致性能不佳或与主要优化目标冲突。为此，我们提出了优势加权策略优化（AWPO）——一种有原则的强化学习框架，有效整合显性推理奖励，以增强工具使用能力。AWPO集成了方差感知门控和难度感知加权，以自适应调制基于群相对统计的推理信号优势，同时采用定制的裁剪机制以实现稳定优化。大量实验表明，AWPO在标准工具使用基准测试中实现了最先进的性能，在多回合复杂场景中显著优于强基线，并领先闭源模型。值得注意的是，凭借卓越的参数效率，我们的4B模型在多回合精度上比Grok-4高出16.0%，同时在已发行的MMLU-Pro基准测试上保持了泛化能力。

WorldRFT: Latent World Model Planning with Reinforcement Fine-Tuning for Autonomous Driving

WorldRFT：自动驾驶的潜在世界模型规划与强化微调

Authors: Pengxuan Yang, Ben Lu, Zhongpu Xia, Chao Han, Yinfeng Gao, Teng Zhang, Kun Zhan, XianPeng Lang, Yupeng Zheng, Qichao Zhang
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.19133
Pdf link: https://arxiv.org/pdf/2512.19133
Abstract Latent World Models enhance scene representation through temporal self-supervised learning, presenting a perception annotation-free paradigm for end-to-end autonomous driving. However, the reconstruction-oriented representation learning tangles perception with planning tasks, leading to suboptimal optimization for planning. To address this challenge, we propose WorldRFT, a planning-oriented latent world model framework that aligns scene representation learning with planning via a hierarchical planning decomposition and local-aware interactive refinement mechanism, augmented by reinforcement learning fine-tuning (RFT) to enhance safety-critical policy performance. Specifically, WorldRFT integrates a vision-geometry foundation model to improve 3D spatial awareness, employs hierarchical planning task decomposition to guide representation optimization, and utilizes local-aware iterative refinement to derive a planning-oriented driving policy. Furthermore, we introduce Group Relative Policy Optimization (GRPO), which applies trajectory Gaussianization and collision-aware rewards to fine-tune the driving policy, yielding systematic improvements in safety. WorldRFT achieves state-of-the-art (SOTA) performance on both open-loop nuScenes and closed-loop NavSim benchmarks. On nuScenes, it reduces collision rates by 83% (0.30% -> 0.05%). On NavSim, using camera-only sensors input, it attains competitive performance with the LiDAR-based SOTA method DiffusionDrive (87.8 vs. 88.1 PDMS).
中文摘要 潜在世界模型通过时间自监督学习增强场景表现，呈现无感知注释的端到端自动驾驶范式。然而，重构导向的表示学习将感知与规划任务混淆，导致规划优化不理想。为应对这一挑战，我们提出了WorldRFT，这是一个以规划为导向的潜在世界模型框架，通过层级规划分解和局部感知的交互式细化机制，将场景表示学习与规划相结合，并辅以强化学习微调（RFT）以提升安全关键策略的性能。具体来说，WorldRFT 集成了视觉几何基础模型以提升三维空间感知，采用层级规划任务分解指导表示优化，并利用局部感知的迭代细化来推导出以规划为导向的驱动策略。此外，我们引入了群相对策略优化（GRPO），该技术应用轨迹高斯化和碰撞感知奖励，微调驱动策略，系统性地提升安全性。WorldRFT 在开环 nuScenes 和闭环导航模拟基准测试中均实现了最先进的（SOTA）性能。在nuScenes上，碰撞率降低了83%（0.30% -> 0.05%）。在导航模拟器上，仅使用摄像头传感器输入，性能可与基于激光雷达的SOTA扩散驱动方法（87.8对88.1 PDMS）竞争。

RMLer: Synthesizing Novel Objects across Diverse Categories via Reinforcement Mixing Learning

RMLer：通过强化混合学习综合不同类别的新颖对象

Authors: Jun Li, Zikun Chen, Haibo Chen, Shuo Chen, Jian Yang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.19300
Pdf link: https://arxiv.org/pdf/2512.19300
Abstract Novel object synthesis by integrating distinct textual concepts from diverse categories remains a significant challenge in Text-to-Image (T2I) generation. Existing methods often suffer from insufficient concept mixing, lack of rigorous evaluation, and suboptimal outputs-manifesting as conceptual imbalance, superficial combinations, or mere juxtapositions. To address these limitations, we propose Reinforcement Mixing Learning (RMLer), a framework that formulates cross-category concept fusion as a reinforcement learning problem: mixed features serve as states, mixing strategies as actions, and visual outcomes as rewards. Specifically, we design an MLP-policy network to predict dynamic coefficients for blending cross-category text embeddings. We further introduce visual rewards based on (1) semantic similarity and (2) compositional balance between the fused object and its constituent concepts, optimizing the policy via proximal policy optimization. At inference, a selection strategy leverages these rewards to curate the highest-quality fused objects. Extensive experiments demonstrate RMLer's superiority in synthesizing coherent, high-fidelity objects from diverse categories, outperforming existing methods. Our work provides a robust framework for generating novel visual concepts, with promising applications in film, gaming, and design.
中文摘要 通过整合来自不同类别的不同文本概念进行新颖对象综合，仍然是文本到图像（T2I）生成中的一大挑战。现有方法常常存在概念混合不足、缺乏严格评估和不理想的输出——表现为概念失衡、表面组合或仅是并置。为解决这些局限性，我们提出了强化混合学习（RMLer）框架，将跨类别概念融合构建为强化学习问题：混合特征作为状态，混合策略作为动作，视觉结果作为奖励。具体来说，我们设计了一个MLP策略网络，用于预测跨类别文本嵌入混合的动态系数。我们进一步引入基于（1）语义相似性和（2）融合对象与其组成概念之间组合平衡的视觉奖励，通过近端策略优化优化策略。推理中，选择策略利用这些奖励来策划最高质量的融合物品。大量实验证明了RMLer在合成来自不同类别的连贯高保真度对象方面，表现优于现有方法。我们的工作为生成新颖视觉概念提供了坚实的框架，在电影、游戏和设计领域具有前景。

Bridging Semantics and Geometry: A Decoupled LVLM-SAM Framework for Reasoning Segmentation in Remote Sensing

桥接语义与几何：一个用于遥感推理分割的解耦LVLM-SAM框架

Authors: Xu Zhang, Junyao Ge, Yang Zheng, Kaitai Guo, Jimin Liang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.19302
Pdf link: https://arxiv.org/pdf/2512.19302
Abstract Large Vision-Language Models (LVLMs) hold great promise for advancing remote sensing (RS) analysis, yet existing reasoning segmentation frameworks couple linguistic reasoning and pixel prediction through end-to-end supervised fine-tuning, leading to weak geometric grounding and limited generalization across tasks. To address this, we developed Think2Seg-RS, a decoupled framework that trains an LVLM prompter to control a frozen Segment Anything Model (SAM) via structured geometric prompts. Through a mask-only reinforcement learning objective, the LVLM learns to translate abstract semantic reasoning into spatially grounded actions, achieving state-of-the-art performance on the EarthReason dataset. Remarkably, the learned prompting policy generalizes zero-shot to multiple referring segmentation benchmarks, exposing a distinct divide between semantic-level and instance-level grounding. We further found that compact segmenters outperform larger ones under semantic-level supervision, and that negative prompts are ineffective in heterogeneous aerial backgrounds. Together, these findings establish semantic-level reasoning segmentation as a new paradigm for geospatial understanding, opening the way toward unified, interpretable LVLM-driven Earth observation. Our code and model are available at this https URL.
中文摘要 大型视觉语言模型（LVLM）在推进遥感（RS）分析方面具有巨大潜力，但现有的推理分割框架通过端到端监督微调将语言推理与像素预测结合，导致几何基础薄弱，任务间泛化有限。为此，我们开发了Think2Seg-RS，一个解耦框架，通过结构化几何提示训练LVLM提示器控制冻结的段任意模型（SAM）。通过仅掩码强化学习目标，LVLM 学习将抽象语义推理转化为空间基础动作，在 EarthReason 数据集上实现了最先进的性能。值得注意的是，所学提示策略将零样本推广到多个引用分割基准，揭示了语义层面和实例层面的明显差距。我们还发现，在语义层面监督下，紧凑分段子表现优于较大的分段子，而负面提示在异构空中背景中效果不佳。这些发现共同确立了语义层面的分段作为地理空间理解的新范式，为统一且可解释的LVLM驱动地球观测开辟了道路。我们的代码和模型可在该 https URL 访问。

Learning-Assisted Multi-Operator Variable Neighborhood Search for Urban Cable Routing

学习辅助多运营商变量邻域搜索城市电缆路由

Authors: Wei Liu, Tao Zhang, Chenhui Lin, Kaiwen Li, Rui Wang
Subjects: Subjects: Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2512.19321
Pdf link: https://arxiv.org/pdf/2512.19321
Abstract Urban underground cable construction is essential for enhancing the reliability of city power grids, yet its high construction costs make planning a worthwhile optimization task. In urban environments, road layouts tightly constrain cable routing. This, on the one hand, renders relation-only models (i.e., those without explicit routes) used in prior work overly simplistic, and on the other hand, dramatically enlarges the combinatorial search space, thereby imposing much higher demands on algorithm design. In this study, we formulate urban cable routing as a connectivity-path co-optimization problem and propose a learning-assisted multi-operator variable neighborhood search (L-MVNS) algorithm. The framework first introduces an auxiliary task to generate high-quality feasible initial solutions. A hybrid genetic search (HGS) and A serve as the connectivity optimizer and the route-planning optimizer, respectively. Building on these, a multi-operator variable neighborhood search (MVNS) iteratively co-optimizes inter-substation connectivity and detailed routes via three complementary destruction operators, a modified A repair operator, and an adaptive neighborhood-sizing mechanism. A multi-agent deep reinforcement learning module is further embedded to prioritize promising neighborhoods. We also construct a standardized and scalable benchmark suite for evaluation. Across these cases, comprehensive experiments demonstrate effectiveness and stability: relative to representative approaches, MVNS and L-MVNS reduce total construction cost by approximately 30-50%, with L-MVNS delivering additional gains on larger instances and consistently higher stability.
中文摘要 城市地下电缆建设对于提升城市电网的可靠性至关重要，但其高昂的建设成本使得规划成为一项值得的优化任务。在城市环境中，道路布局严格限制了电缆的布线。一方面，这使得先前工作中使用的仅关系模型（即无显式路径的模型）过于简化;另一方面，极大地扩大了组合搜索空间，从而对算法设计提出了更高的要求。本研究将城市电缆路由表述为连通路径协优化问题，并提出了一种学习辅助多操作员变量邻域搜索（L-MVNS）算法。该框架首先引入辅助任务，以生成高质量且可行的初始解决方案。杂交遗传搜索（HGS）和A分别作为连通性优化器和路线规划优化器。基于这些，多操作员变量邻域搜索（MVNS）通过三个互补的销毁算子、一个修改后的A修复算符和自适应邻域大小机制，迭代协同优化变电站间连通性和详细路径。还嵌入了一个多智能体深度强化学习模块，以优先考虑有前景的邻域。我们还构建了一个标准化且可扩展的基准测试套件用于评估。在这些案例中，综合实验证明了其有效性和稳定性：相较于代表性方法，MVNS和L-MVNS可降低约30-50%的总建造成本，L-MVNS在更大实例上带来额外收益和持续更高的稳定性。

Enhancing PLS of Indoor IRS-VLC Systems for Colluding and Non-Colluding Eavesdroppers

增强室内IRS-VLC系统对联合和非合作窃听者的PLS

Authors: Rashid Iqbal, Ahmed Zoha, Salama Ikki, Muhammad Ali Imran, Hanaa Abumarshoud
Subjects: Subjects: Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2512.19339
Pdf link: https://arxiv.org/pdf/2512.19339
Abstract Most intelligent reflecting surface (IRS)-aided indoor visible light communication (VLC) studies ignore the time delays introduced by reflected paths, even though these delays are inherent in practical wideband systems. In this work, we adopt a realistic assumption of IRS-induced time delay for physical layer security (PLS) enhancement. We consider an indoor VLC system where an IRS is used to shape the channel so that the reflected signals add constructively at the legitimate user and create intersymbol interference at eavesdroppers located inside the coverage area. The resulting secrecy capacity maximisation over the IRS element allocation is formulated as a complex combinatorial optimisation problem and is solved using deep reinforcement learning with proximal policy optimisation (PPO). The approach is evaluated for both colluding eavesdroppers, which combine their received signals, and non-colluding eavesdroppers, which act independently. Simulation results are shown for various simulation setups, which demonstrate significant secrecy capacity gains. In a worst-case scenario, where the eavesdroppers have stronger channels than the legitimate user, the proposed PPO-based IRS allocation improves secrecy capacity by 107\% and 235\% in the colluding and non-colluding cases, respectively, compared with allocating all IRS elements to the legitimate user. These results demonstrate that time-delay-based IRS control can provide a strong secrecy advantage in practical indoor VLC scenarios.
中文摘要 大多数智能反射面（IRS）辅助室内可见光通信（VLC）研究忽略了反射路径带来的时间延迟，尽管这些延迟在实际宽带系统中是固有的。本研究采用了IRS诱导时间延迟的现实假设，用于物理层安全（PLS）增强。我们考虑一个室内VLC系统，利用IRS来塑造信道，使反射信号在合法用户处形成建设性叠加，并在覆盖区内的窃听者时产生符号间干扰。所得的IRS元素分配的保密容量最大化被表述为一个复杂的组合优化问题，并通过深度强化学习结合近端策略优化（PPO）求解。该方法适用于合谋窃听者（将接收到的信号合并）和非合谋窃听者（独立行动）。展示了各种模拟设置的模拟结果，显示出显著的保密容量提升。在最坏情况下，窃听者拥有比合法用户更强的渠道，基于PPO的IRS分配在勾结和非勾结情况下分别提升了107%和235%的保密能力，而将所有IRS元素分配给合法用户。这些结果表明，基于时间延迟的IRS控制在实际室内VLC场景中可以提供强大的保密优势。

First-Order Representation Languages for Goal-Conditioned RL

目标条件强化学习的一阶表示语言

Authors: Simon Ståhlberg, Hector Geffner
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.19355
Pdf link: https://arxiv.org/pdf/2512.19355
Abstract First-order relational languages have been used in MDP planning and reinforcement learning (RL) for two main purposes: specifying MDPs in compact form, and representing and learning policies that are general and not tied to specific instances or state spaces. In this work, we instead consider the use of first-order languages in goal-conditioned RL and generalized planning. The question is how to learn goal-conditioned and general policies when the training instances are large and the goal cannot be reached by random exploration alone. The technique of Hindsight Experience Replay (HER) provides an answer to this question: it relabels unsuccessful trajectories as successful ones by replacing the original goal with one that was actually achieved. If the target policy must generalize across states and goals, trajectories that do not reach the original goal states can enable more data- and time-efficient learning. In this work, we show that further performance gains can be achieved when states and goals are represented by sets of atoms. We consider three versions: goals as full states, goals as subsets of the original goals, and goals as lifted versions of these subgoals. The result is that the latter two successfully learn general policies on large planning instances with sparse rewards by automatically creating a curriculum of easier goals of increasing complexity. The experiments illustrate the computational gains of these versions, their limitations, and opportunities for addressing them.
中文摘要 一阶关系语言在MDP规划与强化学习（RL）中主要用于两个目的：以紧凑形式指定MDP，以及表示和学习那些通用的策略，而不绑定于特定实例或状态空间。在本研究中，我们转而考虑一阶语言在目标条件强化学习和广义规划中的应用。问题在于，当训练实例很大且目标无法仅靠随机探索达成时，如何学习目标条件和通用策略。事后诸葛亮经验重放（HER）技术为这个问题提供了答案：它通过用实际实现的目标替代原目标，将未成功的轨迹重新标记为成功轨迹。如果目标策略必须跨状态和目标泛化，未能达到原始目标状态的轨迹可以实现更高效的数据和时间效率学习。本研究表明，当状态和目标用原子集合表示时，性能提升可以进一步实现。我们考虑三个版本：目标作为完整状态，目标作为原始目标的子集，以及目标作为这些子目标的提升版本。结果是，后两者通过自动创建更简单的、越来越复杂目标的课程，成功掌握了奖励稀少的大型规划实例的通用政策。这些实验展示了这些版本的计算收益、局限性以及解决这些问题的机会。

Interpretable Hybrid Deep Q-Learning Framework for IoT-Based Food Spoilage Prediction with Synthetic Data Generation and Hardware Validation

可解释的混合深度Q学习框架，用于基于物联网的食品变质预测，结合合成数据生成和硬件验证

Authors: Isshaan Singh, Divyansh Chawla, Anshu Garg, Shivin Mangal, Pallavi Gupta, Khushi Agarwal, Nimrat Singh Khalsa, Nandan Patel
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.19361
Pdf link: https://arxiv.org/pdf/2512.19361
Abstract The need for an intelligent, real-time spoilage prediction system has become critical in modern IoT-driven food supply chains, where perishable goods are highly susceptible to environmental conditions. Existing methods often lack adaptability to dynamic conditions and fail to optimize decision making in real time. To address these challenges, we propose a hybrid reinforcement learning framework integrating Long Short-Term Memory (LSTM) and Recurrent Neural Networks (RNN) for enhanced spoilage prediction. This hybrid architecture captures temporal dependencies within sensor data, enabling robust and adaptive decision making. In alignment with interpretable artificial intelligence principles, a rule-based classifier environment is employed to provide transparent ground truth labeling of spoilage levels based on domain-specific thresholds. This structured design allows the agent to operate within clearly defined semantic boundaries, supporting traceable and interpretable decisions. Model behavior is monitored using interpretability-driven metrics, including spoilage accuracy, reward-to-step ratio, loss reduction rate, and exploration decay. These metrics provide both quantitative performance evaluation and insights into learning dynamics. A class-wise spoilage distribution visualization is used to analyze the agents decision profile and policy behavior. Extensive evaluations on simulated and real-time hardware data demonstrate that the LSTM and RNN based agent outperforms alternative reinforcement learning approaches in prediction accuracy and decision efficiency while maintaining interpretability. The results highlight the potential of hybrid deep reinforcement learning with integrated interpretability for scalable IoT-based food monitoring systems.
中文摘要 在现代物联网驱动的食品供应链中，智能实时变质预测系统的需求变得至关重要，因为易腐商品极易受环境条件影响。现有方法往往缺乏对动态条件的适应能力，且无法实时优化决策。为应对这些挑战，我们提出了一个集成长短期记忆（LSTM）和循环神经网络（RNN）的混合强化学习框架，以增强腐败预测。这种混合架构捕捉了传感器数据中的时间依赖性，实现了稳健且自适应的决策。根据可解释的人工智能原则，采用基于规则的分类器环境，基于特定领域阈值，提供透明的腐败等级真实标注。这种结构化设计使智能体能够在明确定义的语义边界内作，支持可追溯和可解释的决策。模型行为通过可解释性驱动的指标进行监控，包括破坏准确率、奖励与步数比、损失减少率和勘探衰减。这些指标既提供了量化绩效评估，也提供了学习动态的洞见。采用类别分类的腐败分布可视化来分析代理人的决策特征和策略行为。对模拟和实时硬件数据的广泛评估表明，基于LSTM和RNN的智能体在预测准确性和决策效率上优于其他强化学习方法，同时保持可解释性。结果凸显了集成可解释性的混合深度强化学习在可扩展物联网食品监控系统的潜力。

Learning General Policies with Policy Gradient Methods

利用政策梯度方法学习一般政策

Authors: Simon Ståhlberg, Blai Bonet, Hector Geffner
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.19366
Pdf link: https://arxiv.org/pdf/2512.19366
Abstract While reinforcement learning methods have delivered remarkable results in a number of settings, generalization, i.e., the ability to produce policies that generalize in a reliable and systematic way, has remained a challenge. The problem of generalization has been addressed formally in classical planning where provable correct policies that generalize over all instances of a given domain have been learned using combinatorial methods. The aim of this work is to bring these two research threads together to illuminate the conditions under which (deep) reinforcement learning approaches, and in particular, policy optimization methods, can be used to learn policies that generalize like combinatorial methods do. We draw on lessons learned from previous combinatorial and deep learning approaches, and extend them in a convenient way. From the former, we model policies as state transition classifiers, as (ground) actions are not general and change from instance to instance. From the latter, we use graph neural networks (GNNs) adapted to deal with relational structures for representing value functions over planning states, and in our case, policies. With these ingredients in place, we find that actor-critic methods can be used to learn policies that generalize almost as well as those obtained using combinatorial approaches while avoiding the scalability bottleneck and the use of feature pools. Moreover, the limitations of the DRL methods on the benchmarks considered have little to do with deep learning or reinforcement learning algorithms, and result from the well-understood expressive limitations of GNNs, and the tradeoff between optimality and generalization (general policies cannot be optimal in some domains). Both of these limitations are addressed without changing the basic DRL methods by adding derived predicates and an alternative cost structure to optimize.
中文摘要 虽然强化学习方法在多个场景中取得了显著成果，但泛化，即能够以可靠且系统的方式制定策略的能力，仍然是一大挑战。泛化问题在经典规划中已被正式解决，通过组合方法学习了可证明的正确策略，这些策略能够推广到给定域的所有实例。本研究旨在将这两条研究线索结合起来，阐明（深度）强化学习方法，特别是策略优化方法，在何种条件下可以用来学习像组合方法那样泛化的策略。我们借鉴了以往组合和深度学习方法的经验教训，并以便捷的方式进行扩展。基于前者，我们将策略建模为状态转换分类器，因为（地面）动作并非通用的，且会在不同实例之间变化。基于后者，我们使用图神经网络（GNN），适应处理关系结构，用于表示计划状态上的价值函数，在我们的案例中是策略。有了这些要素，我们发现可以用actor-critic方法学习几乎与组合方法相同的策略，同时避免了可扩展性瓶颈和特征池的使用。此外，DRL方法在所考虑基准测试上的局限性与深度学习或强化学习算法关系不大，而是源于GNN表达上的局限性，以及最优与泛化之间的权衡（一般策略在某些领域无法最优）。这两个限制都通过添加导出谓词和替代成本结构来优化，解决了基本的 DRL 方法。

CodeSimpleQA: Scaling Factuality in Code Large Language Models

CodeSimpleQA：在代码大型语言模型中提升事实性

Authors: Jian Yang, Wei Zhang, Yizhi Li, Shawn Guo, Haowen Wang, Aishan Liu, Ge Zhang, Zili Wang, Zhoujun Li, Xianglong Liu, Weifeng Lv
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.19424
Pdf link: https://arxiv.org/pdf/2512.19424
Abstract Large language models (LLMs) have made significant strides in code generation, achieving impressive capabilities in synthesizing code snippets from natural language instructions. However, a critical challenge remains in ensuring LLMs generate factually accurate responses about programming concepts, technical implementations, etc. Most previous code-related benchmarks focus on code execution correctness, overlooking the factual accuracy of programming knowledge. To address this gap, we present CodeSimpleQA, a comprehensive bilingual benchmark designed to evaluate the factual accuracy of code LLMs in answering code-related questions, which contains carefully curated question-answer pairs in both English and Chinese, covering diverse programming languages and major computer science domains. Further, we create CodeSimpleQA-Instruct, a large-scale instruction corpus with 66M samples, and develop a post-training framework combining supervised fine-tuning and reinforcement learning. Our comprehensive evaluation of diverse LLMs reveals that even frontier LLMs struggle with code factuality. Our proposed framework demonstrates substantial improvements over the base model, underscoring the critical importance of factuality-aware alignment in developing reliable code LLMs.
中文摘要 大型语言模型（LLMs）在代码生成方面取得了显著进步，能够从自然语言指令合成代码片段，实现了令人印象深刻的能力。然而，确保LLM能够生成关于编程概念、技术实现等事实准确的回答，仍是一个关键挑战。以往大多数与代码相关的基准测试关注代码执行的正确性，忽视了编程知识的事实准确性。为弥补这一空白，我们推出了CodeSimpleQA，这是一个全面的双语基准测试，旨在评估代码大型语言模型在回答代码相关问题时的事实准确性，包含精心策划的英文和中文问答对，涵盖多种编程语言和主要计算机科学领域。此外，我们创建了 CodeSimpleQA-Instruct，一个拥有 6600 万样本的大规模教学语料库，并开发了一个结合监督微调和强化学习的培训后框架。我们对各种大型语言模型的全面评估显示，即使是前沿的大型语言模型也难以体现代码的真实性。我们提出的框架相较基础模型有显著改进，强调了事实意识对齐在开发可靠代码大型语言模型中的关键性。

LacaDM: A Latent Causal Diffusion Model for Multiobjective Reinforcement Learning

LacaDM：多目标强化学习的潜在因果扩散模型

Authors: Xueming Yan, Bo Yin, Yaochu Jin
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.19516
Pdf link: https://arxiv.org/pdf/2512.19516
Abstract Multiobjective reinforcement learning (MORL) poses significant challenges due to the inherent conflicts between objectives and the difficulty of adapting to dynamic environments. Traditional methods often struggle to generalize effectively, particularly in large and complex state-action spaces. To address these limitations, we introduce the Latent Causal Diffusion Model (LacaDM), a novel approach designed to enhance the adaptability of MORL in discrete and continuous environments. Unlike existing methods that primarily address conflicts between objectives, LacaDM learns latent temporal causal relationships between environmental states and policies, enabling efficient knowledge transfer across diverse MORL scenarios. By embedding these causal structures within a diffusion model-based framework, LacaDM achieves a balance between conflicting objectives while maintaining strong generalization capabilities in previously unseen environments. Empirical evaluations on various tasks from the MOGymnasium framework demonstrate that LacaDM consistently outperforms the state-of-art baselines in terms of hypervolume, sparsity, and expected utility maximization, showcasing its effectiveness in complex multiobjective tasks.
中文摘要 多目标强化学习（MORL）由于目标之间的固有冲突以及适应动态环境的困难，带来了重大挑战。传统方法常常难以有效泛化，尤其是在大型复杂状态作用空间中。为解决这些局限性，我们引入了潜在因果扩散模型（LacaDM），这是一种新颖的方法，旨在增强MORL在离散和连续环境中的适应性。与主要解决目标冲突的现有方法不同，LacaDM学习环境状态与政策之间潜在的时间因果关系，实现跨多样MORL场景的高效知识转移。通过将这些因果结构嵌入基于扩散模型的框架，LacaDM在保持前所未见环境中强有力的泛化能力的同时，实现了相互冲突的目标之间的平衡。针对MOGymnasium框架中多个任务的实证评估表明，LacaDM在超大容量、稀疏性和期望效用最大化方面持续优于最先进基线，展示了其在复杂多目标任务中的有效性。

CARE What Fails: Contrastive Anchored-REflection for Verifiable Multimodal

注意哪些失败：可验证多模态的对比锚定反射

Authors: Yongxin Wang, Zhicheng Yang, Meng Cao, Mingfei Han, Haokun Lin, Yingying Zhu, Xiaojun Chang, Xiaodan Liang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.19554
Pdf link: https://arxiv.org/pdf/2512.19554
Abstract Group-relative reinforcement learning with verifiable rewards (RLVR) often wastes the most informative data it already has the failures. When all rollouts are wrong, gradients stall; when one happens to be correct, the update usually ignores why the others are close-but-wrong, and credit can be misassigned to spurious chains. We present CARE (Contrastive Anchored REflection), a failure-centric post-training framework for multimodal reasoning that turns errors into supervision. CARE combines: (i) an anchored-contrastive objective that forms a compact subgroup around the best rollout and a set of semantically proximate hard negatives, performs within-subgroup z-score normalization with negative-only scaling, and includes an all-negative rescue to prevent zero-signal batches; and (ii) Reflection-Guided Resampling (RGR), a one-shot structured self-repair that rewrites a representative failure and re-scores it with the same verifier, converting near-misses into usable positives without any test-time reflection. CARE improves accuracy and training smoothness while explicitly increasing the share of learning signal that comes from failures. On Qwen2.5-VL-7B, CARE lifts macro-averaged accuracy by 4.6 points over GRPO across six verifiable visual-reasoning benchmarks; with Qwen3-VL-8B it reaches competitive or state-of-the-art results on MathVista and MMMU-Pro under an identical evaluation protocol.
中文摘要 带有可验证奖励的群体相对强化学习（RLVR）常常浪费了它已有的最有信息量的数据，导致失败。当所有滑行都不对时，梯度会停滞;当其中一条是正确的时，更新通常忽略了其他链接接近但错误的原因，导致信用错误分配给虚假链条。我们介绍CARE（对比锚定再指法），这是一个以失败为中心的多模态推理后培训框架，将错误转化为监督。CARE结合了：（i）锚定对比目标，围绕最佳展开形成紧凑子群，并有一组语义上接近的硬阴性，执行子组内z分数归一化并仅有负数尺度，并包含全负救援以防止零信号批次;以及（ii）反射引导重采样（RGR），这是一种一次性结构化自我修复，重写代表性失败并用同一验证器重新评分，将近距离未中转化为可用正值，无需测试时间反射。CARE提升了准确性和训练流畅度，同时明确增加了来自失败的学习信号比例。在Qwen2.5-VL-7B上，CARE在六个可验证的视觉推理基准测试中，宏观平均准确率比GRPO提升了4.6个百分点;通过Qwen3-VL-8B，在MathVista和MMMU-Pro上采用相同的评估协议，取得了具有竞争力或最先进的成绩。

LeLaR: The First In-Orbit Demonstration of an AI-Based Satellite Attitude Controller

LeLaR：基于人工智能的卫星姿态控制器的首次在轨演示

Authors: Kirill Djebko, Tom Baumann, Erik Dilger, Frank Puppe, Sergio Montenegro
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2512.19576
Pdf link: https://arxiv.org/pdf/2512.19576
Abstract Attitude control is essential for many satellite missions. Classical controllers, however, are time-consuming to design and sensitive to model uncertainties and variations in operational boundary conditions. Deep Reinforcement Learning (DRL) offers a promising alternative by learning adaptive control strategies through autonomous interaction with a simulation environment. Overcoming the Sim2Real gap, which involves deploying an agent trained in simulation onto the real physical satellite, remains a significant challenge. In this work, we present the first successful in-orbit demonstration of an AI-based attitude controller for inertial pointing maneuvers. The controller was trained entirely in simulation and deployed to the InnoCube 3U nanosatellite, which was developed by the Julius-Maximilians-Universität Würzburg in cooperation with the Technische Universität Berlin, and launched in January 2025. We present the AI agent design, the methodology of the training procedure, the discrepancies between the simulation and the observed behavior of the real satellite, and a comparison of the AI-based attitude controller with the classical PD controller of InnoCube. Steady-state metrics confirm the robust performance of the AI-based controller during repeated in-orbit maneuvers.
中文摘要 姿态控制对许多卫星任务至关重要。然而，经典控制器设计耗时且对模型不确定性和作边界条件的变化敏感。深度强化学习（DRL）通过与模拟环境自主交互学习自适应控制策略，提供了一种有前景的替代方案。克服Sim2Real差距——即将一个经过模拟训练的智能体部署到真实的物理卫星上——仍是一项重大挑战。本研究首次成功演示基于人工智能的惯性指向机动姿态控制器。该控制器完全接受了仿真训练，并部署到由符茨堡尤利乌斯-马克西米连大学与柏林工业大学合作开发的InnoCube 3U纳米卫星上，该卫星于2025年1月发射。我们介绍了人工智能代理的设计、训练方法论、模拟与真实卫星观测行为之间的差异，以及基于人工智能的姿态控制器与InnoCube经典PD控制器的比较。稳态指标证实了基于人工智能的控制器在反复轨道机动中的强劲表现。

Learning Generalizable Hand-Object Tracking from Synthetic Demonstrations

从合成演示中学习可推广的手对象跟踪

Authors: Yinhuai Wang, Runyi Yu, Hok Wai Tsui, Xiaoyi Lin, Hui Zhang, Qihan Zhao, Ke Fan, Miao Li, Jie Song, Jingbo Wang, Qifeng Chen, Ping Tan
Subjects: Subjects: Robotics (cs.RO); Graphics (cs.GR)
Arxiv link: https://arxiv.org/abs/2512.19583
Pdf link: https://arxiv.org/pdf/2512.19583
Abstract We present a system for learning generalizable hand-object tracking controllers purely from synthetic data, without requiring any human demonstrations. Our approach makes two key contributions: (1) HOP, a Hand-Object Planner, which can synthesize diverse hand-object trajectories; and (2) HOT, a Hand-Object Tracker that bridges synthetic-to-physical transfer through reinforcement learning and interaction imitation learning, delivering a generalizable controller conditioned on target hand-object states. Our method extends to diverse object shapes and hand morphologies. Through extensive evaluations, we show that our approach enables dexterous hands to track challenging, long-horizon sequences including object re-arrangement and agile in-hand reorientation. These results represent a significant step toward scalable foundation controllers for manipulation that can learn entirely from synthetic data, breaking the data bottleneck that has long constrained progress in dexterous manipulation.
中文摘要 我们提出了一套系统，能够纯粹从合成数据中学习通用的手部物体跟踪控制器，无需人工演示。我们的方法有两个关键贡献：（1）HOP，一种手对象规划器，可以综合不同的手对象轨迹;以及（2）HOT，一种手对象追踪器，通过强化学习和交互模仿学习桥接合成到物理的传输，提供基于目标手对象状态的可推广控制器。我们的方法适用于多种物体形状和手形。通过广泛评估，我们表明我们的方法使灵巧的手能够追踪具有挑战性的长视野序列，包括物体重组和敏捷的手部重新定位。这些结果代表了实现可扩展的基础控制器，实现完全从合成数据中学习的基础控制器，打破了长期限制灵巧作进展的数据瓶颈。

Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

自下而上策略优化：你的语言模型策略秘密包含内部政策

Authors: Yuqiao Tan, Minzheng Wang, Shizhu He, Huanxuan Liao, Chengfeng Zhao, Qiunan Lu, Tian Liang, Jun Zhao, Kang Liu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.19673
Pdf link: https://arxiv.org/pdf/2512.19673
Abstract Existing reinforcement learning (RL) approaches treat large language models (LLMs) as a single unified policy, overlooking their internal mechanisms. Understanding how policy evolves across layers and modules is therefore crucial for enabling more targeted optimization and raveling out complex reasoning mechanisms. In this paper, we decompose the language model policy by leveraging the intrinsic split of the Transformer residual stream and the equivalence between the composition of hidden states with the unembedding matrix and the resulting samplable policy. This decomposition reveals Internal Layer Policies, corresponding to contributions from individual layers, and Internal Modular Policies, which align with the self-attention and feed-forward network (FFN) components within each layer. By analyzing the entropy of internal policy, we find that: (a) Early layers keep high entropy for exploration, top layers converge to near-zero entropy for refinement, with convergence patterns varying across model series. (b) LLama's prediction space rapidly converges in the final layer, whereas Qwen-series models, especially Qwen3, exhibit a more human-like, progressively structured reasoning pattern. Motivated by these findings, we propose Bottom-up Policy Optimization (BuPO), a novel RL paradigm that directly optimizes the internal layer policy during early training. By aligning training objective at lower layer, BuPO reconstructs foundational reasoning capabilities and achieves superior performance. Extensive experiments on complex reasoning benchmarks demonstrates the effectiveness of our method. Our code is available at this https URL.
中文摘要 现有的强化学习（RL）方法将大型语言模型（LLM）视为单一统一策略，忽略其内部机制。因此，理解策略如何在层和模块间演进，对于实现更有针对性优化和理清复杂推理机制至关重要。本文通过利用Transformer残差流的内在分裂以及隐藏态与未嵌入矩阵的复合与由此产生的可采样策略的等价性，来分解语言模型策略。这种分解揭示了内部层策略，对应各个层的贡献，以及内部模块化策略，这些策略与各层内的自关注和前馈网络（FFN）组件对齐。通过分析内部政策的熵，我们发现：（a）早期层保持高熵以便探索，顶层在细化时收敛至近零熵，收敛模式在模型系列中有所不同。（b） LLama的预测空间在最后一层迅速收敛，而Qwen级数模型，尤其是Qwen3，则表现出更接近人类、渐进结构化的推理模式。基于这些发现，我们提出了自下而上策略优化（BuPO）的新颖强化学习范式，直接优化早期训练中的内部层策略。通过将训练目标与底层对齐，BuPO重建了基础推理能力，实现了卓越的绩效。复杂推理基准测试的广泛实验证明了我们方法的有效性。我们的代码可在此 https URL 访问。

Scalably Enhancing the Clinical Validity of a Task Benchmark with Physician Oversight

在医生监督下，可扩展地提升任务基准的临床效度

Authors: Junze Ye, Daniel Tawfik, Alex J. Goodell, Nikhil V. Kotha, Mark K. Buyyounouski, Mohsen Bayati
Subjects: Subjects: Artificial Intelligence (cs.AI); Applications (stat.AP)
Arxiv link: https://arxiv.org/abs/2512.19691
Pdf link: https://arxiv.org/pdf/2512.19691
Abstract Automating the calculation of clinical risk scores offers a significant opportunity to reduce physician administrative burden and enhance patient care. The current standard for evaluating this capability is MedCalc-Bench, a large-scale dataset constructed using LLM-based feature extraction and rule-based aggregation. However, treating such model-generated benchmarks as static oracles risks enshrining historical model errors as evaluation gold standards, a problem dangerously amplified when these datasets serve as reward signals for Reinforcement Learning (RL). In this work, we propose viewing benchmarks for complex tasks such as clinical score computation as ''in-progress living documents'' that should be periodically re-evaluated as the processes for creating them improve. We introduce a systematic, physician-in-the-loop pipeline that leverages advanced agentic verifiers to audit and relabel MedCalc-Bench, utilizing automated triage to reserve scarce clinician attention for the most contentious instances. Our audit reveals that a notable fraction of original labels diverge from medical ground truth due to extraction errors, calculator logic mismatches, and clinical ambiguity. To study whether this label noise meaningfully impacts downstream RL training, we fine-tune a Qwen3-8B model via Group Relative Policy Optimization (GRPO) and demonstrate that training on corrected labels yields an 8.7% absolute improvement in accuracy over the original baseline -- validating that label noise materially affects model evaluation. These findings underscore that in safety-critical domains, rigorous benchmark maintenance is a prerequisite for genuine model alignment.
中文摘要 自动化临床风险评分的计算为减轻医生行政负担和提升患者护理提供了重大机会。当前评估该能力的标准是MedCalc-Bench，这是一个基于LLM的特征提取和基于规则聚合构建的大规模数据集。然而，将此类模型生成的基准测试视为静态预言机，有可能将历史模型错误确立为评估的黄金标准，而当这些数据集作为强化学习（RL）的奖励信号时，这一问题将被严重放大。在本研究中，我们提出将临床评分计算等复杂任务的基准视为“正在进行的活文档”，随着创建过程的改进，应定期重新评估。我们引入了系统化的医生环路流程，利用先进的代理验证器审计和重新标记MedCalc-Bench，利用自动分诊技术将稀缺的临床医生注意力留给最具争议的病例。我们的审计显示，由于提取错误、计算器逻辑不匹配和临床模糊，许多原始标签与医学真实情况存在差异。为了研究这种标签噪声是否对下游强化学习有意义的影响，我们通过群相对策略优化（GRPO）微调了Qwen3-8B模型，并证明在校正标签上训练后，准确率比原始基线提升了8.7%，验证了标签噪声对模型评估有实质性影响。这些发现强调，在安全关键领域，严格的基准维护是实现模型真正对齐的前提。

Keyword: diffusion policy

A Flexible Field-Based Policy Learning Framework for Diverse Robotic Systems and Sensors

一个灵活的基于现场的政策学习框架，适用于多样化机器人系统和传感器

Authors: Jose Gustavo Buenaventura Carreon, Floris Erich, Roman Mykhailyshyn, Tomohiro Motoda, Ryo Hanai, Yukiyasu Domae
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.19148
Pdf link: https://arxiv.org/pdf/2512.19148
Abstract We present a cross robot visuomotor learning framework that integrates diffusion policy based control with 3D semantic scene representations from D3Fields to enable category level generalization in manipulation. Its modular design supports diverse robot camera configurations including UR5 arms with Microsoft Azure Kinect arrays and bimanual manipulators with Intel RealSense sensors through a low latency control stack and intuitive teleoperation. A unified configuration layer enables seamless switching between setups for flexible data collection training and evaluation. In a grasp and lift block task the framework achieved an 80 percent success rate after only 100 demonstration episodes demonstrating robust skill transfer between platforms and sensing modalities. This design paves the way for scalable real world studies in cross robotic generalization.
中文摘要 我们提出了一个跨机器人视觉运动学习框架，将基于扩散策略的控制与D3Fields的三维语义场景表示相结合，实现了作中的类别级泛化。其模块化设计支持多种机器人摄像头配置，包括配备Microsoft Azure Kinect阵列的UR5机械臂和配备Intel RealSense传感器的双手作手，采用低延迟控制栈和直观的远程作。统一配置层实现了无缝切换，实现灵活的数据收集训练和评估。在抓取与抬举模块任务中，该框架仅经过100次演示，展示了平台间的技能转移和感应模式的稳健传递，成功率达80%。这一设计为跨机器人推广的可扩展现实世界研究铺平了道路。