Arxiv Papers of Today

生成时间: 2025-12-22 16:34:35 (UTC+8); Arxiv 发布时间: 2025-12-22 20:00 EST (2025-12-23 09:00 UTC+8)

今天共有 31 篇相关文章

Keyword: reinforcement learning

Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

通过科学家对齐的工作流程探究大型语言模型的科学通用智能

Authors: Wanghan Xu, Yuhao Zhou, Yifan Zhou, Qinglong Cao, Shuo Li, Jia Bu, Bo Liu, Yixin Chen, Xuming He, Xiangyu Zhao, Xiang Zhuang, Fengxiang Wang, Zhiwang Zhou, Qiantai Feng, Wenxuan Huang, Jiaqi Wei, Hao Wu, Yuejin Yang, Guangshuai Wang, Sheng Xu, Ziyan Huang, Xinyao Liu, Jiyao Liu, Cheng Tang, Wei Li, Ying Chen, Junzhi Ning, Pengfei Jiang, Chenglong Ma, Ye Du, Changkai Ji, Huihui Xu, Ming Hu, Jiangbin Zheng, Xin Chen, Yucheng Wu, Feifei Jiang, Xi Chen, Xiangru Tang, Yuchen Fu, Yingzhou Lu, Yuanyuan Zhang, Lihao Sun, Chengbo Li, Jinzhe Ma, Wanhao Liu, Yating Liu, Kuo-Cheng Wu, Shengdu Chai, Yizhou Wang, Ouwen Zhangjin, Chen Tang, Shufei Zhang, Wenbo Cao, Junjie Ren, Taoyong Cui, Zhouheng Yao, Juntao Deng, Yijie Sun, Feng Liu, Wangxu Wei, Jingyi Xu, Zhangrui Li, Junchao Gong, Zijie Guo, Zhiyu Yao, Zaoyu Chen, Tianhao Peng, Fangchen Yu, Bo Zhang, Dongzhan Zhou, Shixiang Tang, Jiaheng Liu, Fenghua Ling, Yan Lu, Yuchen Ren, Ben Fei, Zhen Zhao, Xinyu Gu, Rui Su, Xiao-Ming Wu, Weikang Si, Yang Liu, Hao Chen, Xiangchao Yan, Xue Yang, Junchi Yan, Jiamin Wu, Qihao Zheng, Chenhui Li, Zhiqiang Gao, Hao Kong, Junjun He, Mao Su, Tianfan Fu, Peng Ye, Chunfeng Song, Nanqing Dong, Yuqiang Li, Huazhu Fu
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.16969
Pdf link: https://arxiv.org/pdf/2512.16969
Abstract Despite advances in scientific AI, a coherent framework for Scientific General Intelligence (SGI)-the ability to autonomously conceive, investigate, and reason across scientific domains-remains lacking. We present an operational SGI definition grounded in the Practical Inquiry Model (PIM: Deliberation, Conception, Action, Perception) and operationalize it via four scientist-aligned tasks: deep research, idea generation, dry/wet experiments, and experimental reasoning. SGI-Bench comprises over 1,000 expert-curated, cross-disciplinary samples inspired by Science's 125 Big Questions, enabling systematic evaluation of state-of-the-art LLMs. Results reveal gaps: low exact match (10--20%) in deep research despite step-level alignment; ideas lacking feasibility and detail; high code executability but low execution result accuracy in dry experiments; low sequence fidelity in wet protocols; and persistent multimodal comparative-reasoning challenges. We further introduce Test-Time Reinforcement Learning (TTRL), which optimizes retrieval-augmented novelty rewards at inference, enhancing hypothesis novelty without reference answer. Together, our PIM-grounded definition, workflow-centric benchmark, and empirical insights establish a foundation for AI systems that genuinely participate in scientific discovery.
中文摘要 尽管科学人工智能取得了进步，科学通用智能（SGI）——即跨科学领域自主构思、调查和推理的能力——仍然缺乏一个连贯的框架。我们提出了基于实用探究模型（PIM：审议、构思、行动、感知）的作性SGI定义，并通过四个科学家对齐任务：深度研究、创意生成、干/湿实验和实验推理将其付诸实践。SGI-Bench包含1000多个专家策划的跨学科样本，灵感来自《科学》杂志的125个大问题，支持系统性评估最先进的大型语言模型。结果揭示了不足：尽管步骤层面对齐，深度研究中精确匹配率低（10%-20%）;缺乏可行性和细节的观点;高代码可执行性但干实验执行精度低;湿协议中的低序列保真度;以及持续存在的多模态比较推理挑战。我们进一步介绍测试时间强化学习（TTRL），该方法优化推理时的检索增强新颖性奖励，增强假设新颖性，无需引用答案。我们基于PIM的定义、以工作流程为中心的基准和实证洞察共同奠定了真正参与科学发现的AI系统的基础。

Turn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agentic LLMs

Turn-PPO：利用PPO估算回合级优势，提升代理型大型语言模型中的多回合强化学习

Authors: Junbo Li, Peng Zhou, Rui Meng, Meet P. Vadera, Lihong Li, Yang Li
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.17008
Pdf link: https://arxiv.org/pdf/2512.17008
Abstract Reinforcement learning (RL) has re-emerged as a natural approach for training interactive LLM agents in real-world environments. However, directly applying the widely used Group Relative Policy Optimization (GRPO) algorithm to multi-turn tasks exposes notable limitations, particularly in scenarios requiring long-horizon reasoning. To address these challenges, we investigate more stable and effective advantage estimation strategies, especially for multi-turn settings. We first explore Proximal Policy Optimization (PPO) as an alternative and find it to be more robust than GRPO. To further enhance PPO in multi-turn scenarios, we introduce turn-PPO, a variant that operates on a turn-level MDP formulation, as opposed to the commonly used token-level MDP. Our results on the WebShop and Sokoban datasets demonstrate the effectiveness of turn-PPO, both with and without long reasoning components.
中文摘要 强化学习（RL）已成为在现实环境中训练交互式LLM代理的自然方法。然而，直接将广泛使用的群体相对策略优化（GRPO）算法应用于多回合任务，尤其在需要长期视野推理的场景中，会暴露出显著的局限性。为应对这些挑战，我们研究了更稳定且有效的优势估计策略，特别是针对多回合情形。我们首先探讨了近端策略优化（PPO）作为替代方案，发现它比GRPO更稳健。为了进一步增强多回合场景中的PPO，我们引入了turn-PPO，这是一种基于回合级MDP表述的变体，而非常用的代币级MDP。我们在WebShop和Sokoban数据集上的成果展示了turn-PPO的有效性，无论是有长推理成分还是没有。

GB-DQN: Gradient Boosted DQN Models for Non-stationary Reinforcement Learning

GB-DQN：用于非定常强化学习的梯度增强DQN模型

Authors: Chang-Hwan Lee, Chanseung Lee
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.17034
Pdf link: https://arxiv.org/pdf/2512.17034
Abstract Non-stationary environments pose a fundamental challenge for deep reinforcement learning, as changes in dynamics or rewards invalidate learned value functions and cause catastrophic forgetting. We propose \emph{Gradient-Boosted Deep Q-Networks (GB-DQN)}, an adaptive ensemble method that addresses model drift through incremental residual learning. Instead of retraining a single Q-network, GB-DQN constructs an additive ensemble in which each new learner is trained to approximate the Bellman residual of the current ensemble after drift. We provide theoretical results showing that each boosting step reduces the empirical Bellman residual and that the ensemble converges to the post-drift optimal value function under standard assumptions. Experiments across a diverse set of control tasks with controlled dynamics changes demonstrate faster recovery, improved stability, and greater robustness compared to DQN and common non-stationary baselines.
中文摘要 非静止环境对深度强化学习构成根本挑战，因为动态或奖励的变化会使学到的价值函数失效，并导致灾难性的遗忘。我们提出了 \emph{梯度增强深度 Q 网络（GB-DQN）}，这是一种自适应集合方法，通过增量残差学习解决模型漂移问题。GB-DQN不是重新训练单个Q网络，而是构建一个加法系碴，每个新学习者被训练以近似当前群的Bellman残差。我们提供了理论结果，表明每一步提升都会减少经验上的贝尔曼残差，并且在标准假设下，集合收敛到漂移后最优值函数。在多样控制任务中，受控动力学变化的实验显示，相较于DQN和常见非固定基线，恢复速度更快、稳定性更好且鲁棒性更高。

UniRel-R1: RL-tuned LLM Reasoning for Knowledge Graph Relational Question Answering

UniRel-R1：基于强化学习的大型语言模型推理知识图关系问答

Authors: Yinxu Tang, Chengsong Huang, Jiaxin Huang, William Yeoh
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.17043
Pdf link: https://arxiv.org/pdf/2512.17043
Abstract Knowledge Graph Question Answering (KGQA) has traditionally focused on entity-centric queries that return a single answer entity. However, real-world queries are often relational, seeking to understand how entities are associated. In this work, we introduce relation-centric KGQA, a complementary setting where the answer is a subgraph capturing the semantic connections among entities rather than an individual entity. The main challenge lies in the abundance of candidate subgraphs, where trivial or overly common connections often obscure the identification of unique and informative answers. To tackle this, we propose UniRel-R1, a unified framework that integrates subgraph selection, multi-stage graph pruning, and an LLM fine-tuned with reinforcement learning. The reward function is designed to encourage compact and specific subgraphs with more informative relations and lower-degree intermediate entities. Extensive experiments show that UniRel-R1 achieves significant gains in connectivity and reward over Vanilla baselines and generalizes effectively to unseen entities and relations.
中文摘要 知识图谱问答（KGQA）传统上专注于以实体为中心的查询，返回单一答案实体。然而，现实世界的查询通常是关系性的，旨在理解实体之间的关联方式。在本研究中，我们引入了以关系为中心的KGQA，这是一种互补的设定，其答案是一个子图，捕捉实体间的语义联系，而非单个实体。主要挑战在于候选子图的过多，其中简单或过于常见的联系往往掩盖了识别独特且有信息量的答案。为此，我们提出了UniRel-R1，一个统一框架，集成了子图选择、多阶段图剪枝和通过强化学习微调的大型语言模型。奖励函数旨在鼓励紧凑且具体的子图，具有更多信息的关系和较低阶的中间实体。大量实验表明，UniRel-R1在连接性和奖励方面相较于原版基线取得了显著提升，并且能够有效推广到看不见的实体和关系。

Value Under Ignorance in Universal Artificial Intelligence

在通用人工智能中无知下的价值

Authors: Cole Wyeth, Marcus Hutter
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.17086
Pdf link: https://arxiv.org/pdf/2512.17086
Abstract We generalize the AIXI reinforcement learning agent to admit a wider class of utility functions. Assigning a utility to each possible interaction history forces us to confront the ambiguity that some hypotheses in the agent's belief distribution only predict a finite prefix of the history, which is sometimes interpreted as implying a chance of death equal to a quantity called the semimeasure loss. This death interpretation suggests one way to assign utilities to such history prefixes. We argue that it is as natural to view the belief distributions as imprecise probability distributions, with the semimeasure loss as total ignorance. This motivates us to consider the consequences of computing expected utilities with Choquet integrals from imprecise probability theory, including an investigation of their computability level. We recover the standard recursive value function as a special case. However, our most general expected utilities under the death interpretation cannot be characterized as such Choquet integrals.
中文摘要 我们推广AIXI强化学习代理，以容纳更广泛的效用函数类别。为每种可能的交互历史赋予效用，迫使我们面对一个模糊性：代理人信念分布中的某些假设仅预测历史的有限前缀，这有时被解释为死亡概率等于称为半测度损失的量。这种死亡解释暗示了一种为此类历史前缀分配效用的方法。我们认为，将信念分布视为不精确的概率分布，而半测度的损失则是完全无知，是同样自然的。这促使我们考虑用不精确概率论计算Choquet积分的期望效用，包括对其可计算水平的研究。我们恢复标准递归价值函数作为特例。然而，我们在死亡解释下最一般的期望效用不能被描述为此类Choquet积分。

Learning to Plan, Planning to Learn: Adaptive Hierarchical RL-MPC for Sample-Efficient Decision Making

学习规划，规划学习：自适应层级RL-MPC用于样本高效决策

Authors: Toshiaki Hori, Jonathan DeCastro, Deepak Gopinath, Avinash Balachandran, Guy Rosman
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.17091
Pdf link: https://arxiv.org/pdf/2512.17091
Abstract We propose a new approach for solving planning problems with a hierarchical structure, fusing reinforcement learning and MPC planning. Our formulation tightly and elegantly couples the two planning paradigms. It leverages reinforcement learning actions to inform the MPPI sampler, and adaptively aggregates MPPI samples to inform the value estimation. The resulting adaptive process leverages further MPPI exploration where value estimates are uncertain, and improves training robustness and the overall resulting policies. This results in a robust planning approach that can handle complex planning problems and easily adapts to different applications, as demonstrated over several domains, including race driving, modified Acrobot, and Lunar Lander with added obstacles. Our results in these domains show better data efficiency and overall performance in terms of both rewards and task success, with up to a 72% increase in success rate compared to existing approaches, as well as accelerated convergence (x2.1) compared to non-adaptive sampling.
中文摘要 我们提出了一种采用层级结构解决规划问题的新方法，融合强化学习与MPC规划。我们的表述紧密且优雅地结合了这两种规划范式。它利用强化学习动作来为MPPI采样器提供信息，并自适应地聚合MPPI样本以支持值估计。由此产生的自适应过程利用了价值估计不确定的MPPI进一步探索，提升了训练的稳健性和整体策略。这带来了一种稳健的规划方法，能够处理复杂的规划问题，并轻松适应不同应用，这一点在多个领域得到了验证，包括赛车驾驶、改装的Acrobot和带有额外障碍的月球着陆器。我们在这些领域的结果显示，数据效率和整体绩效在奖励和任务成功率方面均提升，成功率比现有方法提升高达72%，且收敛加速（x2.1）均优于非自适应抽样。

Reinforcement Learning for Self-Improving Agent with Skill Library

带有技能库的自我提升智能体强化学习

Authors: Jiongxiao Wang, Qiaojing Yan, Yawei Wang, Yijun Tian, Soumya Smruti Mishra, Zhichao Xu, Megha Gandhi, Panpan Xu, Lin Lee Cheong
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.17102
Pdf link: https://arxiv.org/pdf/2512.17102
Abstract Large Language Model (LLM)-based agents have demonstrated remarkable capabilities in complex reasoning and multi-turn interactions but struggle to continuously improve and adapt when deployed in new environments. One promising approach is implementing skill libraries that allow agents to learn, validate, and apply new skills. However, current skill library approaches rely primarily on LLM prompting, making consistent skill library implementation challenging. To overcome these challenges, we propose a Reinforcement Learning (RL)-based approach to enhance agents' self-improvement capabilities with a skill library. Specifically, we introduce Skill Augmented GRPO for self-Evolution (SAGE), a novel RL framework that systematically incorporates skills into learning. The framework's key component, Sequential Rollout, iteratively deploys agents across a chain of similar tasks for each rollout. As agents navigate through the task chain, skills generated from previous tasks accumulate in the library and become available for subsequent tasks. Additionally, the framework enhances skill generation and utilization through a Skill-integrated Reward that complements the original outcome-based rewards. Experimental results on AppWorld demonstrate that SAGE, when applied to supervised-finetuned model with expert experience, achieves 8.9% higher Scenario Goal Completion while requiring 26% fewer interaction steps and generating 59% fewer tokens, substantially outperforming existing approaches in both accuracy and efficiency.
中文摘要 基于大型语言模型（LLM）的代理在复杂推理和多回合交互方面展现出卓越能力，但在新环境中部署时难以持续改进和适应。一种有前景的方法是实施技能库，使客服能够学习、验证并应用新技能。然而，当前的技能库方法主要依赖LLM提示，使得技能库的一致性实现具有挑战性。为克服这些挑战，我们提出了基于强化学习（RL）的方法，通过技能库提升代理的自我提升能力。具体来说，我们介绍了技能增强GRPO自我进化（SAGE），这是一个新颖的强化学习框架，系统地将技能融入学习中。该框架的关键组件——顺序部署，在每次部署中迭代部署代理，执行一系列相似任务。当代理在任务链中导航时，从之前任务生成的技能会积累在库中，并为后续任务提供。此外，该框架通过技能集成奖励提升技能生成和利用，补充了原有的基于结果的奖励。AppWorld上的实验结果显示，当SAGE应用于具有专家经验的监督微调模型时，情景目标完成率提高了8.9%，交互步骤减少26%，令牌数量减少59%，在准确性和效率上均远超现有方法。

Towards Senior-Robot Interaction: Reactive Robot Dog Gestures

迈向老年人与机器人互动：反应性机器人狗的手势

Authors: Chunyang Meng, Eduardo B. Sandoval, Ricardo Sosa, Francisco Cruz
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.17136
Pdf link: https://arxiv.org/pdf/2512.17136
Abstract As the global population ages, many seniors face the problem of loneliness. Companion robots offer a potential solution. However, current companion robots often lack advanced functionality, while task-oriented robots are not designed for social interaction, limiting their suitability and acceptance by seniors. Our work introduces a senior-oriented system for quadruped robots that allows for more intuitive user input and provides more socially expressive output. For user input, we implemented a MediaPipe-based module for hand gesture and head movement recognition, enabling control without a remote. For output, we designed and trained robotic dog gestures using curriculum-based reinforcement learning in Isaac Gym, progressing from simple standing to three-legged balancing and leg extensions, and more. The final tests achieved over 95\% success on average in simulation, and we validated a key social gesture (the paw-lift) on a Unitree robot. Real-world tests demonstrated the feasibility and social expressiveness of this framework, while also revealing sim-to-real challenges in joint compliance, load distribution, and balance control. These contributions advance the development of practical quadruped robots as social companions for the senior and outline pathways for sim-to-real adaptation and inform future user studies.
中文摘要 随着全球人口老龄化，许多老年人面临孤独问题。伴侣机器人提供了一个潜在的解决方案。然而，现有的伴侣机器人通常缺乏高级功能，而任务导向机器人并非为社交互动设计，限制了其在老年人中的适用性和接受度。我们的研究引入了一套面向老年人的四足机器人系统，允许用户更直观地输入，并提供更具社会表现力的输出。用户输入时，我们集成了基于MediaPipe的手势和头部动作识别模块，实现无需遥控器即可控制。在成果方面，我们设计并训练了机器人狗狗手势，利用基于课程的强化学习在艾萨克体育馆进行，从简单的站立逐步发展到三足平衡和腿部伸展等。最终测试在模拟中平均成功率超过95%，我们在Unitree机器人上验证了关键社交动作（爪子抬起）。实际测试展示了该框架的可行性和社会表现力，同时也揭示了模拟到真实的联合合规性、负载分配和平衡控制等方面的挑战。这些贡献推动了实用四足机器人作为老年人社交伙伴的发展，并勾勒出模拟到现实适应的路径，并为未来用户研究提供参考。

Enhancing AIGC Service Efficiency with Adaptive Multi-Edge Collaboration in A Distributed System

通过分布式系统中的自适应多边协作提升AIGC服务效率

Authors: Changfu Xu, Jianxiong Guo, Jiandian Zeng, Houming Qiu, Tian Wang, Xiaowen Chu, Jiannong Cao
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2512.17158
Pdf link: https://arxiv.org/pdf/2512.17158
Abstract The Artificial Intelligence Generated Content (AIGC) technique has gained significant traction for producing diverse content. However, existing AIGC services typically operate within a centralized framework, resulting in high response times. To address this issue, we integrate collaborative Mobile Edge Computing (MEC) technology to reduce processing delays for AIGC services. Current collaborative MEC methods primarily support single-server offloading or facilitate interactions among fixed Edge Servers (ESs), limiting flexibility and resource utilization across all ESs to meet the varying computing and networking requirements of AIGC services. We propose AMCoEdge, an adaptive multi-server collaborative MEC approach to enhancing AIGC service efficiency. The AMCoEdge fully utilizes the computing and networking resources across all ESs through adaptive multi-ES selection and dynamic workload allocation, thereby minimizing the offloading make-span of AIGC services. Our design features an online distributed algorithm based on deep reinforcement learning, accompanied by theoretical analyses that confirm an approximate linear time complexity. Simulation results show that our method outperforms state-of-the-art baselines, achieving at least an 11.04% reduction in task offloading make-span and a 44.86% decrease in failure rate. Additionally, we develop a distributed prototype system to implement and evaluate our AMCoEdge method for real AIGC service execution, demonstrating service delays that are 9.23% - 31.98% lower than the three representative methods.
中文摘要 人工智能生成内容（AIGC）技术在生成多样化内容方面获得了显著关注。然而，现有的AIGC服务通常处于集中式框架内，导致响应时间较长。为解决这一问题，我们整合了协作移动边缘计算（MEC）技术，以减少AIGC服务的处理延迟。当前的协作MEC方法主要支持单服务器卸载或促进固定边缘服务器（ES）之间的交互，限制了所有ES之间的灵活性和资源利用，以满足AIGC服务不同的计算和网络需求。我们提出AMCoEdge，一种自适应多服务器协作MEC方法，旨在提升AIGC服务效率。AMCoEdge通过自适应多ES选择和动态工作负载分配，充分利用所有ES的计算和网络资源，从而最大限度地减少AIGC服务的卸载。我们的设计采用基于深度强化学习的在线分布式算法，并结合理论分析，确认其线性时间复杂度的近似。模拟结果显示，我们的方法优于最先进的基线，任务卸载完成时长至少减少了11.04%，失败率降低了44.86%。此外，我们还开发了一个分布式原型系统，用于实现和评估AMCoEdge方法在实际AIGC服务执行中的应用，其服务延迟比三种代表性方法低9.23%至31.98%。

Conservative Bias in Multi-Teacher Learning: Why Agents Prefer Low-Reward Advisors

多教师学习中的保守偏见：为什么代理更喜欢低回报的顾问

Authors: Maher Mesto, Francisco Cruz
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.17180
Pdf link: https://arxiv.org/pdf/2512.17180
Abstract Interactive reinforcement learning (IRL) has shown promise in enabling autonomous agents and robots to learn complex behaviours from human teachers, yet the dynamics of teacher selection remain poorly understood. This paper reveals an unexpected phenomenon in IRL: when given a choice between teachers with different reward structures, learning agents overwhelmingly prefer conservative, low-reward teachers (93.16% selection rate) over those offering 20x higher rewards. Through 1,250 experimental runs in navigation tasks with multiple expert teachers, we discovered: (1) Conservative bias dominates teacher selection: agents systematically choose the lowest-reward teacher, prioritising consistency over optimality; (2) Critical performance thresholds exist at teacher availability rho >= 0.6 and accuracy omega >= 0.6, below which the framework fails catastrophically; (3) The framework achieves 159% improvement over baseline Q-learning under concept drift. These findings challenge fundamental assumptions about optimal teaching in RL and suggest potential implications for human-robot collaboration, where human preferences for safety and consistency may align with the observed agent selection behaviour, potentially informing training paradigms for safety-critical robotic applications.
中文摘要 交互式强化学习（IRL）在使自主智能体和机器人能够从人类教师那里学习复杂行为方面展现出潜力，但教师选择的动态仍鲜为人知。本文揭示了现实生活中一个意想不到的现象：当被允许在不同奖励结构的教师中选择时，学习主体压倒性地偏好保守且奖励较低的教师（93.16%的选择率），而非那些提供高20倍奖励的教师。通过1250次由多位专家教师参与的导航任务实验运行，我们发现：（1）保守偏见主导教师选择：代理系统地选择奖励最低的教师，优先考虑一致性而非最优性;（2）临界表现阈值存在于教师可用时间 >= 0.6 和准确率 omega >= 0.6，低于此阈值框架将灾难性失效;（3）该框架在概念漂移下相比基线Q学习实现了159%的提升。这些发现挑战了关于强化学习最优教学的基本假设，并暗示了对人机协作的潜在启示，在人类对安全性和一致性的偏好可能与观察到的代理选择行为相一致，从而可能为安全关键机器人应用的训练范式提供参考。

MAPPO-LCR: Multi-Agent Policy Optimization with Local Cooperation Reward in Spatial Public Goods Games

MAPPO-LCR：空间公共物品博弈中的多智能体策略优化与本地合作奖励

Authors: Zhaoqilin Yang, Axin Xiang, Kedi Yang, Tianjun Liu, Youliang Tian
Subjects: Subjects: Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT)
Arxiv link: https://arxiv.org/abs/2512.17187
Pdf link: https://arxiv.org/pdf/2512.17187
Abstract Spatial public goods games model collective dilemmas where individual payoffs depend on population-level strategy configurations. Most existing studies rely on evolutionary update rules or value-based reinforcement learning methods. These approaches struggle to represent payoff coupling and non-stationarity in large interacting populations. This work introduces Multi-Agent Proximal Policy Optimization (MAPPO) into spatial public goods games for the first time. In these games, individual returns are intrinsically coupled through overlapping group interactions. Proximal Policy Optimization (PPO) treats agents as independent learners and ignores this coupling during value estimation. MAPPO addresses this limitation through a centralized critic that evaluates joint strategy configurations. To study neighborhood-level cooperation signals under this framework, we propose MAPPO with Local Cooperation Reward, termed MAPPO-LCR. The local cooperation reward aligns policy updates with surrounding cooperative density without altering the original game structure. MAPPO-LCR preserves decentralized execution while enabling population-level value estimation during training. Extensive simulations demonstrate stable cooperation emergence and reliable convergence across enhancement factors. Statistical analyses further confirm the learning advantage of MAPPO over PPO in spatial public goods games.
中文摘要 空间公共财博弈模拟了集体困境，其中个体收益依赖于人口层面的战略配置。大多数现有研究依赖进化更新规则或基于价值的强化学习方法。这些方法难以在大型相互作用群体中表现收益耦合和非平稳性。这项工作首次将多智能体近端策略优化（MAPPO）引入空间公共物品游戏。在这些博弈中，个人回报通过重叠的群体互动内在联系在一起。近端策略优化（PPO）将代理视为独立学习者，并在价值估计时忽略这种耦合。MAPPO通过一个中心化的批评者来解决这一局限性，评估联合战略配置。为了研究该框架下的邻里层面合作信号，我们提出了带有地方合作奖励的MAPPO-LCR（MAPPO-LCR）。本地合作奖励会使政策更新与周边合作密度保持一致，而不改变原有的游戏结构。MAPPO-LCR在训练期间保持分散执行的同时，支持种群级价值估计。大量模拟证明了增强因子间的稳定合作生成和可靠收敛。统计分析进一步证实了MAPPO在空间公共物品游戏中相较于PPO的学习优势。

MMRAG-RFT: Two-stage Reinforcement Fine-tuning for Explainable Multi-modal Retrieval-augmented Generation

MMRAG-RFT：可解释多模态检索增强生成的两阶段强化微调

Authors: Shengwei Zhao, Jingwen Yao, Sitong Wei, Linhai Xu, Yuying Liu, Dong Zhang, Zhiqiang Tian, Shaoyi Du
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.17194
Pdf link: https://arxiv.org/pdf/2512.17194
Abstract Multi-modal Retrieval-Augmented Generation (MMRAG) enables highly credible generation by integrating external multi-modal knowledge, thus demonstrating impressive performance in complex multi-modal scenarios. However, existing MMRAG methods fail to clarify the reasoning logic behind retrieval and response generation, which limits the explainability of the results. To address this gap, we propose to introduce reinforcement learning into multi-modal retrieval-augmented generation, enhancing the reasoning capabilities of multi-modal large language models through a two-stage reinforcement fine-tuning framework to achieve explainable multi-modal retrieval-augmented generation. Specifically, in the first stage, rule-based reinforcement fine-tuning is employed to perform coarse-grained point-wise ranking of multi-modal documents, effectively filtering out those that are significantly irrelevant. In the second stage, reasoning-based reinforcement fine-tuning is utilized to jointly optimize fine-grained list-wise ranking and answer generation, guiding multi-modal large language models to output explainable reasoning logic in the MMRAG process. Our method achieves state-of-the-art results on WebQA and MultimodalQA, two benchmark datasets for multi-modal retrieval-augmented generation, and its effectiveness is validated through comprehensive ablation experiments.
中文摘要 多模态检索增强生成（MMRAG）通过整合外部多模态知识实现高度可信的生成，从而在复杂的多模态场景中展现出令人印象深刻的性能。然而，现有的MMRAG方法未能澄清检索和反应生成背后的推理逻辑，限制了结果的可解释性。为弥补这一空白，我们提出将强化学习引入多模态检索增强生成，通过两阶段强化微调框架提升多模态大型语言模型的推理能力，实现可解释的多模态检索增强生成。具体来说，在第一阶段，采用基于规则的强化微调，对多模态文档进行粗粒度的逐点排序，有效过滤掉那些显著无关的文档。第二阶段，基于推理的强化微调用于联合优化细粒度列表排名和答案生成，引导多模态大型语言模型在MMRAG过程中输出可解释的推理逻辑。我们的方法在WebQA和MultimodalQA这两个多模态检索增强生成的基准数据集上取得了最先进的成果，并通过全面的消融实验验证了其有效性。

Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs

推理调色板：通过潜在情境化调节推理，实现（V）LMs的可控探索

Authors: Rujiao Long, Yang Li, Xingyao Zhang, Weixun Wang, Tianqianjin Lin, Xi Zhao, Yuchi Xu, Wenbo Su, Junchi Yan, Bo Zheng
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.17206
Pdf link: https://arxiv.org/pdf/2512.17206
Abstract Exploration capacity shapes both inference-time performance and reinforcement learning (RL) training for large (vision-) language models, as stochastic sampling often yields redundant reasoning paths with little high-level diversity. This paper proposes Reasoning Palette, a novel latent-modulation framework that endows the model with a stochastic latent variable for strategic contextualization, guiding its internal planning prior to token generation. This latent context is inferred from the mean-pooled embedding of a question-answer pair via a variational autoencoder (VAE), where each sampled latent potentially encodes a distinct reasoning context. During inference, a sampled latent is decoded into learnable token prefixes and prepended to the input prompt, modulating the model's internal reasoning trajectory. In this way, the model performs internal sampling over reasoning strategies prior to output generation, which shapes the style and structure of the entire response sequence. A brief supervised fine-tuning (SFT) warm-up phase allows the model to adapt to this latent conditioning. Within RL optimization, Reasoning Palette facilitates structured exploration by enabling on-demand injection for diverse reasoning modes, significantly enhancing exploration efficiency and sustained learning capability. Experiments across multiple reasoning benchmarks demonstrate that our method enables interpretable and controllable control over the (vision-) language model's strategic behavior, thereby achieving consistent performance gains over standard RL methods.
中文摘要 探索能力既影响了大型（视觉）语言模型的推理时间表现，也影响强化学习（RL）训练，因为随机抽样常常产生冗余的推理路径，且高层次多样性有限。本文提出了推理调色板，一种新型潜在调制框架，赋予模型一个随机潜在变量以实现战略情境化，指导其在代币生成前的内部规划。该潜在上下文是通过变分自编码器（VAE）对问答对的均值池嵌入推断的，其中每个采样的潜在变量可能编码一个不同的推理上下文。在推理过程中，采样的潜在变量被解码为可学习的标记前缀，并加在输入提示前，从而调节模型的内部推理轨迹。通过这种方式，模型在输出生成前对推理策略进行内部抽样，从而塑造整个反应序列的风格和结构。短暂的监督微调（SFT）预热阶段使模型能够适应这种潜在条件反射。在强化学习优化中，推理调色板通过支持按需注入多种推理模式，促进结构化探索，显著提升探索效率和持续学习能力。多项推理基准测试的实验表明，我们的方法能够对（视觉-）语言模型的战略行为实现可解释和可控的控制，从而在性能上持续优于标准强化学习方法。

CheXPO-v2: Preference Optimization for Chest X-ray VLMs with Knowledge Graph Consistency

CheXPO-v2：具有知识图谱一致性的胸部X光VLM的偏好优化

Authors: Xiao Liang, Yuxuan An, Di Wang, Jiawei Hu, Zhicheng Jiao, Bin Jing, Quan Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.17213
Pdf link: https://arxiv.org/pdf/2512.17213
Abstract Medical Vision-Language Models (VLMs) are prone to hallucinations, compromising clinical reliability. While reinforcement learning methods like Group Relative Policy Optimization (GRPO) offer a low-cost alignment solution, their reliance on sparse, outcome-based rewards inadvertently encourages models to "overthink" -- generating verbose, convoluted, and unverifiable Chain-of-Thought reasoning to justify answers. This focus on outcomes obscures factual errors and poses significant safety risks. To address this, we propose CheXPO-v2, a novel alignment framework that shifts from outcome to process supervision. Our core innovation is a Knowledge Graph Consistency Reward mechanism driven by Entity-Relation Matching. By explicitly parsing reasoning steps into structured "Disease, Relation, Anatomy" triplets, we provide fine-grained supervision that penalizes incoherent logic and hallucinations at the atomic level. Integrating this with a hard-example mining strategy, our approach significantly outperforms GRPO and state-of-the-art models on benchmarks like MIMIC-CXR-VQA. Crucially, CheXPO-v2 achieves new state-of-the-art accuracy using only 5k samples, demonstrating exceptional data efficiency while producing clinically sound and verifiable reasoning. The project source code is publicly available at: this https URL.
中文摘要 医学视觉语言模型（VLMs）容易出现幻觉，影响临床可靠性。虽然像群体相对策略优化（GRPO）这样的强化学习方法提供了低成本的对齐解决方案，但它们对稀疏、基于结果的奖励的依赖无意中鼓励模型“过度思考”——产生冗长、复杂且无法验证的思维链推理来为答案辩护。这种对结果的关注掩盖了事实错误，并带来了重大的安全风险。为此，我们提出了CheXPO-v2，一种从结果转向过程监督的新型比对框架。我们的核心创新是由实体关系匹配驱动的知识图谱一致性奖励机制。通过明确将推理步骤拆分为结构化的“疾病、关系、解剖”三元组，我们提供了细致的监督，惩罚原子层面的不连贯逻辑和幻觉。结合硬样本挖矿策略，我们的方法在 MIMIC-CXR-VQA 等基准测试上显著优于 GRPO 和最先进模型。关键是，CheXPO-v2仅用5000个样本就实现了新的最先进精度，展现了卓越的数据效率，同时产生了临床上合理且可验证的推理。项目源代码公开于：此 https URL。

Learning When to Look: A Disentangled Curriculum for Strategic Perception in Multimodal Reasoning

学会何时寻找：多模态推理中战略感知的解开课程

Authors: Siqi Yang, Zilve Gao, Haibo Qiu, Fanfan Liu, Peng Shi, Zhixiong Zeng, Qingmin Liao, Lin Ma
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.17227
Pdf link: https://arxiv.org/pdf/2512.17227
Abstract Multimodal Large Language Models (MLLMs) demonstrate significant potential but remain brittle in complex, long-chain visual reasoning tasks. A critical failure mode is "visual forgetting", where models progressively lose visual grounding as reasoning extends, a phenomenon aptly described as "think longer, see less". We posit this failure stems from current training paradigms prematurely entangling two distinct cognitive skills: (1) abstract logical reasoning "how-to-think") and (2) strategic visual perception ("when-to-look"). This creates a foundational cold-start deficiency -- weakening abstract reasoning -- and a strategic perception deficit, as models lack a policy for when to perceive. In this paper, we propose a novel curriculum-based framework to disentangle these skills. First, we introduce a disentangled Supervised Fine-Tuning (SFT) curriculum that builds a robust abstract reasoning backbone on text-only data before anchoring it to vision with a novel Perception-Grounded Chain-of-Thought (PG-CoT) paradigm. Second, we resolve the strategic perception deficit by formulating timing as a reinforcement learning problem. We design a Pivotal Perception Reward that teaches the model when to look by coupling perceptual actions to linguistic markers of cognitive uncertainty (e.g., "wait", "verify"), thereby learning an autonomous grounding policy. Our contributions include the formalization of these two deficiencies and the development of a principled, two-stage framework to address them, transforming the model from a heuristic-driven observer to a strategic, grounded reasoner. \textbf{Code}: \url{this https URL}.
中文摘要 多模态大型语言模型（MLLM）展现出显著潜力，但在复杂的长链视觉推理任务中仍然脆弱。一个关键的失败模式是“视觉遗忘”，即随着推理的推进，模型逐渐失去视觉基础，这种现象恰如其分地被称为“思考更久，看到更少”。我们认为这种失败源于当前的培训范式过早地将两种不同的认知技能混淆在一起：（1）抽象的逻辑推理“如何思考”）和（2）战略视觉感知（“何时观察”）。这造成了基础性的冷启动缺陷——削弱抽象推理能力——以及战略感知的缺陷，因为模型缺乏感知的策略。本文提出了一个基于课程的新框架，以解开这些技能的纠缠。首先，我们引入了一套脱离纠缠的监督微调（SFT）课程，基于纯文本数据构建了坚实的抽象推理骨干，然后通过一种新的感知-基础思维链（PG-CoT）范式将其锚定于愿景。其次，我们通过将时机设为强化学习问题来解决战略感知的缺口。我们设计了关键感知奖励，通过将感知行为与认知不确定性的语言标记（如“等待”、“验证”）结合，教导模型何时观察，从而学习自主的基础策略。我们的贡献包括形式化这两个缺陷，并开发一个有原则的两阶段框架来应对它们，将模型从启发式驱动的观察者转变为战略性、扎实的推理者。\textbf{Code}： \url{这个 https URL}。

Cooperative Energy Scheduling of Multi-Microgrids Based on Risk-Sensitive Reinforcement Learning

基于风险敏感强化学习的多微电网合作能源调度

Authors: Rongxiang Zhang, Bo Li, Jinghua Li, Yuguang Song, Ziqing Zhu, Wentao Yang, Zhengmao Li, Edris Pouresmaeil, Joshua Y. Kim
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2512.17246
Pdf link: https://arxiv.org/pdf/2512.17246
Abstract With the rapid development of distributed renewable energy, multi-microgrids play an increasingly important role in improving the flexibility and reliability of energy supply. Reinforcement learning has shown great potential in coordination strategies due to its model-free nature. Current methods lack explicit quantification of the relationship between individual and joint risk values, resulting in obscured credit assignment. Moreover, they often depend on explicit communication, which becomes inefficient as system complexity grows. To address these challenges, this paper proposes a risk-sensitive reinforcement learning framework with shared memory (RRL-SM) for multi-microgrid scheduling. Specifically, a risk-sensitive value factorization scheme is proposed to quantify the relationship between individual and joint risk values by leveraging distributional modeling and attention-based representations, thereby aligning local decisions with global risk objectives. An implicit shared-memory coordination mechanism is implemented through a global memory space to enhance the overall efficiency of decentralized decision-making. Collectively, the integrated approach delivers more reliable cooperative scheduling under renewable energy uncertainty. Simulation results show that RRL-SM reduces load-shedding risk by 84.5%, demonstrating a favorable balance between reliability and economic performance.
中文摘要 随着分布式可再生能源的快速发展，多微电网在提升能源供应的灵活性和可靠性方面发挥着越来越重要的作用。强化学习因其无模型特性，在协调策略中展现出巨大潜力。现有方法缺乏对个人与联合风险值之间关系的明确量化，导致信用分配被模糊。此外，它们通常依赖显式通信，而随着系统复杂度的增加，这种通信效率会降低。为应对这些挑战，本文提出了一个基于多微电网调度的风险敏感强化学习框架，采用共享内存（RRL-SM）。具体来说，提出了一种风险敏感价值因子化方案，通过利用分布建模和基于注意力的表示来量化个人与联合风险值之间的关系，从而使本地决策与全球风险目标保持一致。通过全局内存空间实现隐式共享内存协调机制，以提升去中心化决策的整体效率。综合而言，这种综合方法在可再生能源不确定性下提供了更可靠的合作调度。模拟结果显示，RRL-SM可将负载切断风险降低84.5%，显示出可靠性与经济性能之间的良好平衡。

Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience

种子校验器1.5：通过经验学习掌握本科水平定理证明

Authors: Jiangjie Chen, Wenxiang Chen, Jiacheng Du, Jinyi Hu, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Wenlei Shi, Zhihong Wang, Mingxuan Wang, Chenrui Wei, Shufa Wei, Huajian Xin, Fan Yang, Weihao Gao, Zheng Yuan, Tianyang Zhan, Zeyu Zheng, Tianxi Zhou, Thomas Hanwen Zhu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.17260
Pdf link: https://arxiv.org/pdf/2512.17260
Abstract Large language models have recently made significant progress to generate rigorous mathematical proofs. In contrast, utilizing LLMs for theorem proving in formal languages (such as Lean) remains challenging and computationally expensive, particularly when addressing problems at the undergraduate level and beyond. In this work, we present \textbf{Seed-Prover 1.5}, a formal theorem-proving model trained via large-scale agentic reinforcement learning, alongside an efficient test-time scaling (TTS) workflow. Through extensive interactions with Lean and other tools, the model continuously accumulates experience during the RL process, substantially enhancing the capability and efficiency of formal theorem proving. Furthermore, leveraging recent advancements in natural language proving, our TTS workflow efficiently bridges the gap between natural and formal languages. Compared to state-of-the-art methods, Seed-Prover 1.5 achieves superior performance with a smaller compute budget. It solves \textbf{88\% of PutnamBench} (undergraduate-level), \textbf{80\% of Fate-H} (graduate-level), and \textbf{33\% of Fate-X} (PhD-level) problems. Notably, using our system, we solved \textbf{11 out of 12 problems} from Putnam 2025 within 9 hours. Our findings suggest that scaling learning from experience, driven by high-quality formal feedback, holds immense potential for the future of formal mathematical reasoning.
中文摘要 大型语言模型近年来在生成严谨数学证明方面取得了显著进展。相比之下，利用LLM在形式语言（如精益）中进行定理证明仍然具有挑战性且计算成本高昂，尤其是在处理本科及以后阶段的问题时。在本研究中，我们提出了 \textbf{Seed-Prover 1.5}，这是一个通过大规模智能体强化学习训练的形式定理证明模型，同时配备了高效的测试时间缩放（TTS）工作流程。通过与精益及其他工具的广泛交互，模型在强化学习过程中不断积累经验，显著提升了形式定理证明的能力和效率。此外，借助自然语言证明的最新进展，我们的TTS工作流程高效地弥合了自然语言与形式语言之间的鸿沟。与最先进的方法相比，种子校验器1.5以更小的计算预算实现了更优的性能。它解决了 \textbf{88%的 PutnamBench}（本科水平）、\textbf{80% 的 Fate-H}（研究生级别）和 \textbf{33\% 的 Fate-X}（博士级别）问题。值得注意的是，使用我们的系统，我们在9小时内解决了Putnam 2025的\textbf{11个问题中的11个}。我们的发现表明，基于经验的扩展学习，由高质量的形式反馈驱动，对形式数学推理的未来具有巨大潜力。

A Theoretical Analysis of State Similarity Between Markov Decision Processes

马尔可夫决策过程状态相似性的理论分析

Authors: Zhenyu Tao, Wei Xu, Xiaohu You
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.17265
Pdf link: https://arxiv.org/pdf/2512.17265
Abstract The bisimulation metric (BSM) is a powerful tool for analyzing state similarities within a Markov decision process (MDP), revealing that states closer in BSM have more similar optimal value functions. While BSM has been successfully utilized in reinforcement learning (RL) for tasks like state representation learning and policy exploration, its application to state similarity between multiple MDPs remains challenging. Prior work has attempted to extend BSM to pairs of MDPs, but a lack of well-established mathematical properties has limited further theoretical analysis between MDPs. In this work, we formally establish a generalized bisimulation metric (GBSM) for measuring state similarity between arbitrary pairs of MDPs, which is rigorously proven with three fundamental metric properties, i.e., GBSM symmetry, inter-MDP triangle inequality, and a distance bound on identical spaces. Leveraging these properties, we theoretically analyze policy transfer, state aggregation, and sampling-based estimation across MDPs, obtaining explicit bounds that are strictly tighter than existing ones derived from the standard BSM. Additionally, GBSM provides a closed-form sample complexity for estimation, improving upon existing asymptotic results based on BSM. Numerical results validate our theoretical findings and demonstrate the effectiveness of GBSM in multi-MDP scenarios.
中文摘要 双模拟度量（BSM）是分析马尔可夫决策过程（MDP）中状态相似性的强大工具，揭示了BSM中更接近的状态具有更多相似的最优值函数。虽然BSM已成功应用于强化学习（RL）等任务，如状态表示学习和策略探索，但其在多个MDP之间状态相似性的应用仍然具有挑战性。此前已有研究尝试将BSM扩展到MDP对，但由于缺乏成熟的数学性质，限制了MDP之间的理论分析。本研究正式建立了一种广义双模拟度量（GBSM），用于测量任意MDP对之间的状态相似性，该度量通过三个基本度量性质被严格证明，即GBSM对称性、MDP间三角不等式以及在相同空间上的距离限制。利用这些特性，我们理论上分析了跨MDP的策略转移、状态聚合和基于抽样的估计，获得了比标准BSM现有界限更严格的明确界限。此外，GBSM还提供了封闭形式的样本复杂度进行估计，改进了基于BSM的现有渐近结果。数值结果验证了我们的理论发现，并展示了GBSM在多多重干扰处理场景中的有效性。

Understanding Generalization in Role-Playing Models via Information Theory

通过信息理论理解角色扮演模型中的泛化

Authors: Yongqi Li, Hao Lang, Fei Huang, Tieyun Qian, Yongbin Li
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.17270
Pdf link: https://arxiv.org/pdf/2512.17270
Abstract Role-playing models (RPMs) are widely used in real-world applications but underperform when deployed in the wild. This degradation can be attributed to distribution shifts, including user, character, and dialogue compositional shifts. Existing methods like LLM-as-a-judge fall short in providing a fine-grained diagnosis of how these shifts affect RPM generalization, and thus there lack formal frameworks to characterize RPM generalization behaviors. To bridge these gaps, we introduce an information-theoretic metric, named reasoning-based effective mutual information difference (R-EMID), to measure RPM performance degradation in an interpretable way. We also derive an upper bound on R-EMID to predict the worst-case generalization performance of RPMs and theoretically reveal how various shifts contribute to the RPM performance degradation. Moreover, we propose a co-evolving reinforcement learning framework to adaptively model the connection among user, character, and dialogue context and thus enhance the estimation of dialogue response generation probability, which is critical for calculating R-EMID. Finally, we evaluate the generalization performance of various RPMs using R-EMID, finding that user shift poses the highest risk among all shifts and reinforcement learning is the most effective approach for enhancing RPM generalization.
中文摘要 角色扮演模型（RPM）在现实应用中被广泛使用，但在实际部署时表现不佳。这种退化可归因于发行方式的变化，包括用户、角色和对话构成的变化。现有方法如LLM作为评判，无法提供细致诊断这些变化如何影响RPM泛化，因此缺乏正式框架来描述RPM泛化行为。为弥合这些差距，我们引入了一种信息论指标——基于推理的有效互信息差（R-EMID），以可解释的方式衡量转速性能的下降。我们还推导了R-EMID的上界，以预测转速的最坏情况下泛化性能，并理论上揭示了各种偏移如何导致转速性能下降。此外，我们提出了一种共进强化学习框架，以自适应建模用户、角色与对话语境之间的联系，从而增强对话反应生成概率的估计，这对计算R-EMID至关重要。最后，我们利用R-EMID评估了各种RPM的泛化性能，发现用户移位在所有转移中风险最高，强化学习是提升RPM泛化的最有效方法。

Large Language Models as Pokémon Battle Agents: Strategic Play and Content Generation

大型语言模型作为宝可梦战斗代理人：战略玩法与内容生成

Authors: Daksh Jain, Aarya Jain, Ashutosh Desai, Avyakt Verma, Ishan Bhanuka, Pratik Narang, Dhruv Kumar
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.17308
Pdf link: https://arxiv.org/pdf/2512.17308
Abstract Strategic decision-making in Pokémon battles presents a unique testbed for evaluating large language models. Pokémon battles demand reasoning about type matchups, statistical trade-offs, and risk assessment, skills that mirror human strategic thinking. This work examines whether Large Language Models (LLMs) can serve as competent battle agents, capable of both making tactically sound decisions and generating novel, balanced game content. We developed a turn-based Pokémon battle system where LLMs select moves based on battle state rather than pre-programmed logic. The framework captures essential Pokémon mechanics: type effectiveness multipliers, stat-based damage calculations, and multi-Pokémon team management. Through systematic evaluation across multiple model architectures we measured win rates, decision latency, type-alignment accuracy, and token efficiency. These results suggest LLMs can function as dynamic game opponents without domain-specific training, offering a practical alternative to reinforcement learning for turn-based strategic games. The dual capability of tactical reasoning and content creation, positions LLMs as both players and designers, with implications for procedural generation and adaptive difficulty systems in interactive entertainment.
中文摘要 宝可梦对战中的战略决策为评估大型语言模型提供了独特的试验场。宝可梦对战需要对属性匹配、统计取舍和风险评估进行推理，这些技能与人类的战略思维相符。本研究探讨大型语言模型（LLMs）是否能作为合格的战斗代理，既能做出战术性决策，也能生成新颖且平衡的游戏内容。我们开发了一个回合制宝可梦战斗系统，LLM根据战斗状态选择招式，而非预设逻辑。该框架涵盖宝可梦的核心机制：属性效果倍增器、基于属性的伤害计算以及多宝可梦队伍管理。通过对多种模型架构的系统评估，我们测量了获胜率、决策延迟、类型对齐准确性和代币效率。这些结果表明，大型语言模型无需领域特定训练即可作为动态的游戏对手，为回合制战略游戏提供了强化学习的实用替代方案。战术推理和内容创作的双重能力使大型语言模型既是玩家也是设计师，这对交互娱乐中的程序生成和自适应难度系统具有重要影响。

Neuro-Symbolic Control with Large Language Models for Language-Guided Spatial Tasks

利用大型语言模型进行语言引导空间任务的神经符号控制

Authors: Momina Liaqat Ali, Muhammad Abid
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.17321
Pdf link: https://arxiv.org/pdf/2512.17321
Abstract Although large language models (LLMs) have recently become effective tools for language-conditioned control in embodied systems, instability, slow convergence, and hallucinated actions continue to limit their direct application to continuous control. A modular neuro-symbolic control framework that clearly distinguishes between low-level motion execution and high-level semantic reasoning is proposed in this work. While a lightweight neural delta controller performs bounded, incremental actions in continuous space, a locally deployed LLM interprets symbolic tasks. We assess the suggested method in a planar manipulation setting with spatial relations between objects specified by language. Numerous tasks and local language models, such as Mistral, Phi, and LLaMA-3.2, are used in extensive experiments to compare LLM-only control, neural-only control, and the suggested LLM+DL framework. In comparison to LLM-only baselines, the results show that the neuro-symbolic integration consistently increases both success rate and efficiency, achieving average step reductions exceeding 70% and speedups of up to 8.83x while remaining robust to language model quality. The suggested framework enhances interpretability, stability, and generalization without any need of reinforcement learning or costly rollouts by controlling the LLM to symbolic outputs and allocating uninterpreted execution to a neural controller trained on artificial geometric data. These outputs show empirically that neuro-symbolic decomposition offers a scalable and principled way to integrate language understanding with ongoing control, this approach promotes the creation of dependable and effective language-guided embodied systems.
中文摘要 尽管大型语言模型（LLMs）近年来已成为具身系统中语言条件控制的有效工具，但不稳定性、收敛缓慢和幻觉行为仍限制了其在连续控制中的直接应用。本研究提出了一个模块化的神经符号控制框架，明确区分低层运动执行与高层语义推理。轻量级神经δ控制器在连续空间内执行有界的增量动作，而本地部署的LLM则解释符号任务。我们在具有语言指定对象间空间关系的平面作环境中评估所建议的方法。大量任务和本地语言模型，如Mistral、Phi和LLaMA-3.2，被广泛用于比较仅LLM对照、仅神经对照及建议的LLM+DL框架。与仅限LLM基线相比，结果显示神经符号整合持续提升成功率和效率，平均步数减少超过70%，加速速度高达8.83倍，同时保持语言模型质量的稳健性。所建议的框架通过将大型语言模型控制为符号输出，并将未解释的执行分配给训练于人工几何数据的神经控制器，从而提升了可解释性、稳定性和泛化性，无需强化学习或昂贵的推广。这些成果通过实证表明，神经符号分解为将语言理解与持续控制结合提供了一种可扩展且有原则的方法，这种方法促进了可靠且有效的语言引导具身系统的创建。

Xiaomi MiMo-VL-Miloco Technical Report

小米MiMo-VL-Miloco技术报告

Authors: Jiaze Li, Jingyang Chen, Yuxun Qu, Jianzhong Ju, Zhenbo Luo, Jian Luan, Shijie Xu, Zhenru Lin, Junyou Zhu, Boshen Xu, Wenhui Tan, Pei Fu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.17436
Pdf link: https://arxiv.org/pdf/2512.17436
Abstract We open-source \textbf{MiMo-VL-Miloco-7B} and its quantized variant \textbf{MiMo-VL-Miloco-7B-GGUF}, a pair of home-centric vision-language models that achieve strong performance on both home-scenario understanding and general multimodal reasoning. Built on the MiMo-VL-7B backbone, MiMo-VL-Miloco-7B is specialized for smart-home environments, attaining leading F1 scores on gesture recognition and common home-scenario understanding, while also delivering consistent gains across video benchmarks such as Video-MME, Video-MMMU, and Charades-STA, as well as language understanding benchmarks including MMMU-Pro and MMLU-Pro. In our experiments, MiMo-VL-Miloco-7B outperforms strong closed-source and open-source baselines on home-scenario understanding and several multimodal reasoning benchmarks. To balance specialization and generality, we design a two-stage training pipeline that combines supervised fine-tuning with reinforcement learning based on Group Relative Policy Optimization, leveraging efficient multi-domain data. We further incorporate chain-of-thought supervision and token-budget-aware reasoning, enabling the model to learn knowledge in a data-efficient manner while also performing reasoning efficiently. Our analysis shows that targeted home-scenario training not only enhances activity and gesture understanding, but also improves text-only reasoning with only modest trade-offs on document-centric tasks. Model checkpoints, quantized GGUF weights, and our home-scenario evaluation toolkit are publicly available at \href{this https URL}{this https URL} to support research and deployment in real-world smart-home applications.
中文摘要 我们开源了 \textbf{MiMo-VL-Miloco-7B} 及其量化变体 \textbf{MiMo-VL-Miloco-7B-GGUF}，这是一对以家庭为中心的视觉语言模型，在家庭场景理解和多模态推理方面均表现出色。基于MiMo-VL-7B骨干，MiMo-VL-Miloco-7B专注于智能家居环境，在手势识别和常见家庭场景理解方面均获得领先的F1分数，同时在视频基准测试如Video-MME、Video-MMMU和Charades-STA，以及语言理解基准测试如MMMU-Pro和MMLU-Pro中持续取得进步。在我们的实验中，MiMo-VL-Miloco-7B 在家庭场景理解和多模态推理基准测试方面优于强劲的闭源和开源基线。为了平衡专业化与通用性，我们设计了一个两阶段训练流水线，结合了监督式微调与基于群相对策略优化的强化学习，利用高效的多域数据。我们还进一步融入了思维链监督和代币预算感知推理，使模型能够以数据高效的方式学习知识，同时高效执行推理。我们的分析显示，针对性的家庭场景训练不仅提升了活动和手势理解，还能提升纯文本推理能力，且在文档中心任务上仅有有限的权衡。模型检查点、量子化GGUF权重以及我们的家庭场景评估工具包均公开于\href{this https URL}{this https URL}，支持在现实智能家居应用中的研究和部署。

Assessing Long-Term Electricity Market Design for Ambitious Decarbonization Targets using Multi-Agent Reinforcement Learning

利用多智能体强化学习评估长期电力市场设计以实现雄心勃勃的脱碳目标

Authors: Javier Gonzalez-Ruiz, Carlos Rodriguez-Pardo, Iacopo Savelli, Alice Di Bella, Massimo Tavoni
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); General Economics (econ.GN)
Arxiv link: https://arxiv.org/abs/2512.17444
Pdf link: https://arxiv.org/pdf/2512.17444
Abstract Electricity systems are key to transforming today's society into a carbon-free economy. Long-term electricity market mechanisms, including auctions, support schemes, and other policy instruments, are critical in shaping the electricity generation mix. In light of the need for more advanced tools to support policymakers and other stakeholders in designing, testing, and evaluating long-term markets, this work presents a multi-agent reinforcement learning model capable of capturing the key features of decarbonizing energy systems. Profit-maximizing generation companies make investment decisions in the wholesale electricity market, responding to system needs, competitive dynamics, and policy signals. The model employs independent proximal policy optimization, which was selected for suitability to the decentralized and competitive environment. Nevertheless, given the inherent challenges of independent learning in multi-agent settings, an extensive hyperparameter search ensures that decentralized training yields market outcomes consistent with competitive behavior. The model is applied to a stylized version of the Italian electricity system and tested under varying levels of competition, market designs, and policy scenarios. Results highlight the critical role of market design for decarbonizing the electricity sector and avoiding price volatility. The proposed framework allows assessing long-term electricity markets in which multiple policy and market mechanisms interact simultaneously, with market participants responding and adapting to decarbonization pathways.
中文摘要 电力系统是将当今社会转变为无碳经济的关键。长期的电力市场机制，包括拍卖、支持计划及其他政策工具，对于塑造电力发电结构至关重要。鉴于支持政策制定者及其他利益相关者设计、测试和评估长期市场所需的更先进工具，本研究提出了一种多智能体强化学习模型，能够捕捉能源系统脱碳的关键特征。追求利润最大化的发电公司在批发电力市场做出投资决策，响应系统需求、竞争动态和政策信号。该模型采用独立的近端策略优化，该优化被选中适合去中心化和竞争环境。然而，鉴于多智能体环境中独立学习的固有挑战，广泛的超参数搜索确保去中心化训练能够产生与竞争行为一致的市场结果。该模型应用于意大利电力系统的风格化版本，并在不同竞争水平、市场设计和政策情景下进行测试。结果凸显了市场设计在电力行业脱碳和避免价格波动中的关键作用。该框架允许评估长期电力市场，其中多种政策和市场机制同时互动，市场参与者响应并适应脱碳路径。

Learning Safe Autonomous Driving Policies Using Predictive Safety Representations

利用预测安全表述学习安全自动驾驶政策

Authors: Mahesh Keswani, Raunak Bhattacharyya
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.17586
Pdf link: https://arxiv.org/pdf/2512.17586
Abstract Safe reinforcement learning (SafeRL) is a prominent paradigm for autonomous driving, where agents are required to optimize performance under strict safety requirements. This dual objective creates a fundamental tension, as overly conservative policies limit driving efficiency while aggressive exploration risks safety violations. The Safety Representations for Safer Policy Learning (SRPL) framework addresses this challenge by equipping agents with a predictive model of future constraint violations and has shown promise in controlled environments. This paper investigates whether SRPL extends to real-world autonomous driving scenarios. Systematic experiments on the Waymo Open Motion Dataset (WOMD) and NuPlan demonstrate that SRPL can improve the reward-safety tradeoff, achieving statistically significant improvements in success rate (effect sizes r = 0.65-0.86) and cost reduction (effect sizes r = 0.70-0.83), with p < 0.05 for observed improvements. However, its effectiveness depends on the underlying policy optimizer and the dataset distribution. The results further show that predictive safety representations play a critical role in improving robustness to observation noise. Additionally, in zero-shot cross-dataset evaluation, SRPL-augmented agents demonstrate improved generalization compared to non-SRPL methods. These findings collectively demonstrate the potential of predictive safety representations to strengthen SafeRL for autonomous driving.
中文摘要 安全强化学习（SafeRL）是自动驾驶的一个重要范式，在严格的安全要求下，智能体需要优化性能。这一双重目标造成了根本张力，过于保守的政策限制了驾驶效率，而激进的勘探则可能引发安全违规。安全表述以实现更安全政策学习（SRPL）框架通过为智能体配备未来约束违规的预测模型，解决了这一挑战，并在受控环境中展现出潜力。本文探讨SRPL是否适用于现实世界的自动驾驶场景。在Waymo开放运动数据集（WOMD）和NuPlan上的系统实验表明，SRPL能够改善奖励与安全性权衡，实现了成功率（效应量r = 0.65-0.86）和成本降低（效应量r = 0.70-0.83）的统计学显著提升，观察到的改善幅度为p <0.05。然而，其有效性取决于底层策略优化器和数据集分布。结果进一步表明，预测性安全性表示在提升观测噪声的鲁棒性方面起着关键作用。此外，在零样本跨数据集评估中，SRPL增强的特剂相较于非SRPL方法展现出更好的泛化能力。这些发现共同展示了预测安全表征在增强SafeRL自动驾驶能力方面的潜力。

SCOPE: Sequential Causal Optimization of Process Interventions

范围：过程干预的顺序因果优化

Authors: Jakob De Moor, Hans Weytjens, Johannes De Smedt, Jochen De Weerdt
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.17629
Pdf link: https://arxiv.org/pdf/2512.17629
Abstract Prescriptive Process Monitoring (PresPM) recommends interventions during business processes to optimize key performance indicators (KPIs). In realistic settings, interventions are rarely isolated: organizations need to align sequences of interventions to jointly steer the outcome of a case. Existing PresPM approaches fall short in this respect. Many focus on a single intervention decision, while others treat multiple interventions independently, ignoring how they interact over time. Methods that do address these dependencies depend either on simulation or data augmentation to approximate the process to train a Reinforcement Learning (RL) agent, which can create a reality gap and introduce bias. We introduce SCOPE, a PresPM approach that learns aligned sequential intervention recommendations. SCOPE employs backward induction to estimate the effect of each candidate intervention action, propagating its impact from the final decision point back to the first. By leveraging causal learners, our method can utilize observational data directly, unlike methods that require constructing process approximations for reinforcement learning. Experiments on both an existing synthetic dataset and a new semi-synthetic dataset show that SCOPE consistently outperforms state-of-the-art PresPM techniques in optimizing the KPI. The novel semi-synthetic setup, based on a real-life event log, is provided as a reusable benchmark for future work on sequential PresPM.
中文摘要 处方性流程监控（PresPM）建议在业务流程中采取干预措施，以优化关键绩效指标（KPI）。在现实环境中，干预很少是孤立的：组织需要协调干预的顺序，共同引导案件结果。现有的PresPM方法在这方面不足。许多机构专注于单一干预决策，而另一些则独立处理多项干预，忽视它们随时间的相互作用。解决这些依赖性的方法依赖于仿真或数据增强来近似训练强化学习（RL）代理的过程，这可能造成现实差距并引入偏见。我们引入SCOPE，这是一种PresPM方法，学习对齐的顺序干预建议。SCOPE采用逆向归纳法来估算每个候选干预措施的影响，并将其影响从最终决策点传播回第一个决策点。通过利用因果学习者，我们的方法可以直接利用观察数据，这与需要构建过程近似以进行强化学习的方法不同。在现有合成数据集和新的半合成数据集上的实验显示，SCOPE在优化KPI方面始终优于最先进的PresPM技术。基于真实事件日志的新型半合成装置，作为未来序列PresPM研究的可复用基准。

Trust-Region Adaptive Policy Optimization

信任区域自适应策略优化

Authors: Mingyu Su, Jian Guan, Yuxian Gu, Minlie Huang, Hongning Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.17636
Pdf link: https://arxiv.org/pdf/2512.17636
Abstract Post-training methods, especially Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), play an important role in improving large language models' (LLMs) complex reasoning abilities. However, the dominant two-stage pipeline (SFT then RL) suffers from a key inconsistency: SFT enforces rigid imitation that suppresses exploration and induces forgetting, limiting RL's potential for improvements. We address this inefficiency with TRAPO (\textbf{T}rust-\textbf{R}egion \textbf{A}daptive \textbf{P}olicy \textbf{O}ptimization), a hybrid framework that interleaves SFT and RL within each training instance by optimizing SFT loss on expert prefixes and RL loss on the model's own completions, unifying external supervision and self-exploration. To stabilize training, we introduce Trust-Region SFT (TrSFT), which minimizes forward KL divergence inside a trust region but attenuates optimization outside, effectively shifting toward reverse KL and yielding stable, mode-seeking updates favorable for RL. An adaptive prefix-selection mechanism further allocates expert guidance based on measured utility. Experiments on five mathematical reasoning benchmarks show that TRAPO consistently surpasses standard SFT, RL, and SFT-then-RL pipelines, as well as recent state-of-the-art approaches, establishing a strong new paradigm for reasoning-enhanced LLMs.
中文摘要 训练后方法，尤其是监督微调（SFT）和强化学习（RL），在提升大型语言模型（LLMs）复杂推理能力方面发挥着重要作用。然而，主流的两阶段流程（先是SFT再RL）存在一个关键矛盾：SFT强制执行僵化的模仿，抑制探索并导致遗忘，限制了RL改进的潜力。我们用 TRAPO（\textbf{T}rust-\textbf{R}egion \textbf{A}daptive \textbf{P}olicy \textbf{O}ptimization）来解决这一低效问题，这是一种混合框架，通过优化专家前缀的 SFT 损失和模型自身完成时的 RL 损失，将 SFT 和 RL 交错嵌入每个训练实例，统一外部监督和自我探索。为稳定训练，我们引入了信任区域SFT（TrSFT），该方法在信任区域内最小化正向KL发散，但削弱外部优化，有效向反KL转变，并产生有利于强化学习的稳定寻模式更新。自适应前缀选择机制进一步根据测量效用分配专家指导。五个数学推理基准测试的实验显示，TRAPO持续超越标准的SFT、RL及SFT后强化流程，同时也超越了最新最先进的方法，为推理增强LLM树立了强有力的新范式。

About Time: Model-free Reinforcement Learning with Timed Reward Machines

关于时间：无模型的定时奖励机强化学习

Authors: Anirban Majumdar, Ritam Raha, Rajarshi Roy, David Parker, Marta Kwiatkowska
Subjects: Subjects: Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Logic in Computer Science (cs.LO)
Arxiv link: https://arxiv.org/abs/2512.17637
Pdf link: https://arxiv.org/pdf/2512.17637
Abstract Reward specification plays a central role in reinforcement learning (RL), guiding the agent's behavior. To express non-Markovian rewards, formalisms such as reward machines have been introduced to capture dependencies on histories. However, traditional reward machines lack the ability to model precise timing constraints, limiting their use in time-sensitive applications. In this paper, we propose timed reward machines (TRMs), which are an extension of reward machines that incorporate timing constraints into the reward structure. TRMs enable more expressive specifications with tunable reward logic, for example, imposing costs for delays and granting rewards for timely actions. We study model-free RL frameworks (i.e., tabular Q-learning) for learning optimal policies with TRMs under digital and real-time semantics. Our algorithms integrate the TRM into learning via abstractions of timed automata, and employ counterfactual-imagining heuristics that exploit the structure of the TRM to improve the search. Experimentally, we demonstrate that our algorithm learns policies that achieve high rewards while satisfying the timing constraints specified by the TRM on popular RL benchmarks. Moreover, we conduct comparative studies of performance under different TRM semantics, along with ablations that highlight the benefits of counterfactual-imagining.
中文摘要 奖励指定在强化学习（RL）中起着核心作用，指导智能体的行为。为了表达非马尔可夫奖励，引入了诸如奖励机等形式主义来捕捉对历史的依赖关系。然而，传统奖励机缺乏精确时序约束建模的能力，限制了其在时间敏感应用中的应用。本文提出了定时奖励机（TRM），它是将时序约束纳入奖励结构的奖励机的扩展。TRM通过可调节的奖励逻辑实现更具表现力的规范，例如对延迟施加成本，及时行动给予奖励。我们研究无模型的强化学习框架（即表格Q学习），用于在数字和实时语义下学习TRM的最优策略。我们的算法通过定时自动机的抽象将TRM整合进学习中，并采用反事实想象启发式方法，利用TRM的结构来改进搜索。实验显示，我们的算法能够学习能够在满足TRM对流行强化学习基准测试时限约束的同时，获得高回报的策略。此外，我们还对不同TRM语义下的表现进行了比较研究，并结合了凸显反事实想象优势的消融分析。

Planning as Descent: Goal-Conditioned Latent Trajectory Synthesis in Learned Energy Landscapes

作为下降的规划：学习能量景观中的目标条件潜在轨迹综合

Authors: Carlos Vélez García, Miguel Cazorla, Jorge Pomares
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.17846
Pdf link: https://arxiv.org/pdf/2512.17846
Abstract We present Planning as Descent (PaD), a framework for offline goal-conditioned reinforcement learning that grounds trajectory synthesis in verification. Instead of learning a policy or explicit planner, PaD learns a goal-conditioned energy function over entire latent trajectories, assigning low energy to feasible, goal-consistent futures. Planning is realized as gradient-based refinement in this energy landscape, using identical computation during training and inference to reduce train-test mismatch common in decoupled modeling pipelines. PaD is trained via self-supervised hindsight goal relabeling, shaping the energy landscape around the planning dynamics. At inference, multiple trajectory candidates are refined under different temporal hypotheses, and low-energy plans balancing feasibility and efficiency are selected. We evaluate PaD on OGBench cube manipulation tasks. When trained on narrow expert demonstrations, PaD achieves state-of-the-art 95\% success, strongly outperforming prior methods that peak at 68\%. Remarkably, training on noisy, suboptimal data further improves success and plan efficiency, highlighting the benefits of verification-driven planning. Our results suggest learning to evaluate and refine trajectories provides a robust alternative to direct policy learning for offline, reward-free planning.
中文摘要 我们提出了“规划即下降”（PaD），一种离线目标条件强化学习框架，将轨迹综合建立在验证之上。PaD不是学习政策或明确的规划器，而是学习一个针对整个潜在轨迹的目标条件化能量函数，将低能量分配给可行且目标一致的未来。规划通过基于梯度的精细化实现，在训练和推断过程中使用相同的计算，以减少解耦建模流水线中常见的列车与测试不匹配。PaD通过自我监督的事后诸葛亮目标重新标签来培训，围绕规划动态塑造能源格局。在推断时，在不同的时间假设下细化多个轨迹候选方案，并选择在可行性和效率之间取得平衡的低能耗方案。我们在OGBench立方体作任务中评估PaD。在狭窄的专家演示下训练时，PaD实现了95%的先进成功率，远远优于以往峰值68%的方法。值得注意的是，在噪声和次优数据上进行训练进一步提升了成功率和计划效率，凸显了验证驱动规划的优势。我们的结果表明，学习评估和完善轨迹，为离线、无奖励规划提供了一种有力的替代方案，而非直接政策学习。

AnyTask: an Automated Task and Data Generation Framework for Advancing Sim-to-Real Policy Learning

AnyTask：一个自动化任务与数据生成框架，用于推动模拟到现实政策学习

Authors: Ran Gong, Xiaohan Zhang, Jinghuan Shang, Maria Vittoria Minniti, Jigarkumar Patel, Valerio Pepe, Riedana Yan, Ahmet Gundogdu, Ivan Kapelyukh, Ali Abbas, Xiaoqiang Yan, Harsh Patel, Laura Herlant, Karl Schmeckpeper
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.17853
Pdf link: https://arxiv.org/pdf/2512.17853
Abstract Generalist robot learning remains constrained by data: large-scale, diverse, and high-quality interaction data are expensive to collect in the real world. While simulation has become a promising way for scaling up data collection, the related tasks, including simulation task design, task-aware scene generation, expert demonstration synthesis, and sim-to-real transfer, still demand substantial human effort. We present AnyTask, an automated framework that pairs massively parallel GPU simulation with foundation models to design diverse manipulation tasks and synthesize robot data. We introduce three AnyTask agents for generating expert demonstrations aiming to solve as many tasks as possible: 1) ViPR, a novel task and motion planning agent with VLM-in-the-loop Parallel Refinement; 2) ViPR-Eureka, a reinforcement learning agent with generated dense rewards and LLM-guided contact sampling; 3) ViPR-RL, a hybrid planning and learning approach that jointly produces high-quality demonstrations with only sparse rewards. We train behavior cloning policies on generated data, validate them in simulation, and deploy them directly on real robot hardware. The policies generalize to novel object poses, achieving 44% average success across a suite of real-world pick-and-place, drawer opening, contact-rich pushing, and long-horizon manipulation tasks. Our project website is at this https URL .
中文摘要 通用机器人学习仍受限于数据：大规模、多样且高质量的交互数据在现实世界中收集成本高昂。尽管仿真已成为扩大数据收集的有前景方式，但相关任务，包括仿真任务设计、任务感知场景生成、专家演示综合以及模拟到实物传输，仍需大量人力投入。我们介绍AnyTask，一个自动化框架，将大规模并行GPU仿真与基础模型结合，设计多样化的作任务并综合机器人数据。我们引入了三种AnyTask代理，用于生成专家演示，旨在尽可能多地解决任务：1）ViPR，一款具备VLM在环并行优化功能的新型任务和运动规划代理;2）ViPR-Eureka，一种强化学习代理，具有生成密集奖励和LLM引导接触采样;3）ViPR-RL，一种混合规划与学习方法，能够共同制作高质量的演示，但奖励很少。我们对生成数据进行行为克隆策略训练，在仿真中验证，并直接部署到真实机器人硬件上。这些策略推广到新颖的物体姿势，在一系列真实世界的挑选放位、打开抽屉、接触丰富推人和长视野作任务中平均成功率达44%。我们的项目网站是这个 https 网址。

Distributionally Robust Imitation Learning: Layered Control Architecture for Certifiable Autonomy

分布式稳健模仿学习：可认证自治的分层控制架构

Authors: Aditya Gahlawat, Ahmed Aboudonia, Sandeep Banik, Naira Hovakimyan, Nikolai Matni, Aaron D. Ames, Gioele Zardini, Alberto Speranzon
Subjects: Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.17899
Pdf link: https://arxiv.org/pdf/2512.17899
Abstract Imitation learning (IL) enables autonomous behavior by learning from expert demonstrations. While more sample-efficient than comparative alternatives like reinforcement learning, IL is sensitive to compounding errors induced by distribution shifts. There are two significant sources of distribution shifts when using IL-based feedback laws on systems: distribution shifts caused by policy error and distribution shifts due to exogenous disturbances and endogenous model errors due to lack of learning. Our previously developed approaches, Taylor Series Imitation Learning (TaSIL) and $\mathcal{L}_1$ -Distributionally Robust Adaptive Control (\ellonedrac), address the challenge of distribution shifts in complementary ways. While TaSIL offers robustness against policy error-induced distribution shifts, \ellonedrac offers robustness against distribution shifts due to aleatoric and epistemic uncertainties. To enable certifiable IL for learned and/or uncertain dynamical systems, we formulate \textit{Distributionally Robust Imitation Policy (DRIP)} architecture, a Layered Control Architecture (LCA) that integrates TaSIL and~\ellonedrac. By judiciously designing individual layer-centric input and output requirements, we show how we can guarantee certificates for the entire control pipeline. Our solution paves the path for designing fully certifiable autonomy pipelines, by integrating learning-based components, such as perception, with certifiable model-based decision-making through the proposed LCA approach.
中文摘要 模仿学习（IL）通过专家演示实现自主行为。虽然IL比强化学习等相对替代方案更高效，但IL对分布偏移引起的复合错误非常敏感。在使用基于IL的反馈定律对系统时，分布偏移有两个重要来源：由策略错误引起的分布偏移，以及由外生干扰引起的分布偏移，以及由于学习不足引起的内生模型错误。我们之前开发的方法，泰勒级数模仿学习（TaSIL）和$\mathcal{L}_1$ -分布强健自适应控制（\ellonedrac），以互补的方式应对分布变化的挑战。TaSIL对政策错误引起的分布变化具有鲁棒性，而\ellonedrac则对因偶然性和认识论不确定性引起的分布转移具有鲁棒性。为了支持可认证的IL应用于学习和/或不确定的动力系统，我们制定了\textit{分布式强健模仿策略（DRIP）}架构，这是一种分层控制架构（LCA），集成了TaSIL和~\ellonedrac。通过审慎设计各个层为中心的输入和输出需求，我们展示了如何保证整个控制流水线的证书。我们的解决方案通过整合基于学习的组件（如感知）与基于模型的可认证决策，为设计完全可认证的自主性管道铺平了道路，采用所提议的LCA方法。

Keyword: diffusion policy

Kinematics-Aware Diffusion Policy with Consistent 3D Observation and Action Space for Whole-Arm Robotic Manipulation

运动学感知扩散政策，具有一致的三维观察和动作空间，用于全臂机器人作

Authors: Kangchen Lv, Mingrui Yu, Yongyi Jia, Chenyu Zhang, Xiang Li
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.17568
Pdf link: https://arxiv.org/pdf/2512.17568
Abstract Whole-body control of robotic manipulators with awareness of full-arm kinematics is crucial for many manipulation scenarios involving body collision avoidance or body-object interactions, which makes it insufficient to consider only the end-effector poses in policy learning. The typical approach for whole-arm manipulation is to learn actions in the robot's joint space. However, the unalignment between the joint space and actual task space (i.e., 3D space) increases the complexity of policy learning, as generalization in task space requires the policy to intrinsically understand the non-linear arm kinematics, which is difficult to learn from limited demonstrations. To address this issue, this letter proposes a kinematics-aware imitation learning framework with consistent task, observation, and action spaces, all represented in the same 3D space. Specifically, we represent both robot states and actions using a set of 3D points on the arm body, naturally aligned with the 3D point cloud observations. This spatially consistent representation improves the policy's sample efficiency and spatial generalizability while enabling full-body control. Built upon the diffusion policy, we further incorporate kinematics priors into the diffusion processes to guarantee the kinematic feasibility of output actions. The joint angle commands are finally calculated through an optimization-based whole-body inverse kinematics solver for execution. Simulation and real-world experimental results demonstrate higher success rates and stronger spatial generalizability of our approach compared to existing methods in body-aware manipulation policy learning.
中文摘要 在了解全臂运动学的意识下，对机器人作器的全身控制至关重要，对于许多涉及身体碰撞避免或身体-物体交互的作场景至关重要，因此仅仅考虑最终执行器姿态在策略学习中是不够的。整臂作的典型方法是学习机器人关节空间中的动作。然而，联合空间与实际任务空间（即三维空间）之间的不对齐增加了策略学习的复杂性，因为在任务空间中的泛化需要策略本身理解非线性臂运动学，而这在有限的演示中很难学会。为解决这一问题，本信提出了一种运动学感知的模仿学习框架，具有一致的任务空间、观察空间和行动空间，所有空间均在同一三维空间中表示。具体来说，我们用一组三维点在手臂上，自然对齐于三维点云观测，来表示机器人的状态和动作。这种空间一致性的表示提升了策略的样本效率和空间泛化性，同时实现了全身控制。基于扩散策略，我们进一步将运动学先验纳入扩散过程，以保证输出动作的运动学可行性。最终，通过基于优化的全体逆运动学求解器计算了联合角度命令以进行执行。模拟和现实实验结果显示，与现有身体感知作策略学习方法相比，我们的方法更高的成功率和更强的空间泛化性。