Arxiv Papers of Today

生成时间: 2025-12-01 16:35:37 (UTC+8); Arxiv 发布时间: 2025-12-01 20:00 EST (2025-12-02 09:00 UTC+8)

今天共有 45 篇相关文章

Keyword: reinforcement learning

GPS: General Per-Sample Prompter

GPS：通用每样本提问器

Authors: Pawel Batorski, Paul Swoboda
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.21714
Pdf link: https://arxiv.org/pdf/2511.21714
Abstract LLMs are sensitive to prompting, with task performance often hinging on subtle, sometimes imperceptible variations in phrasing. As a result, crafting effective prompts manually remains challenging and time-consuming. Recent automatic prompting methods mitigate this difficulty but face three key limitations: (i) for each new task, they require large datasets to train good prompts;(ii) they rely on costly optimization loops that may take hours; (iii)they typically produce a single task-level prompt that does not adapt to the individual input problem to be solved. We propose GPS, the first general-purpose, per-sample prompting method. Without any task-specific tuning, GPS generates a tailored prompt for each unseen input, improving performance across diverse tasks. The prompter is trained with reinforcement learning on a suite of training tasks and includes a novel regularization for effectively adapting to per-sample prompting. Finally, we employ Minimum Bayes Risk decoding to stabilize inference. Empirically, GPS demonstrates competitive performance: we attain second best results among baselines on text simplification, third best results on summarization and on-par results on classification, while not training on any of these tasks, in contrast to the baselines. For in-domain prompting, we obtain sota on GSM8K. Our work shows the potential of a novel and effective paradigm for automatic prompting: generating adaptive, input-specific prompts without extensive optimization and without access to a task-specific training set. Our code is available at this https URL.
中文摘要 LLM对提示非常敏感，任务表现往往依赖于措辞中的细微、有时难以察觉的变化。因此，手动制作有效的提示依然具有挑战性且耗时。近期的自动提示方法缓解了这一难题，但面临三个关键局限：（i）每完成一项新任务，它们需要大量数据集来训练优质提示;（ii）它们依赖于可能耗时数小时的高昂优化循环;（iii）它们通常生成一个任务级提示，不适应具体输入问题。我们提出了GPS，这是首个通用的、按样本进行提示的方法。无需针对特定任务的调整，GPS会为每个未看见的输入生成定制提示，提升不同任务的表现。提词器通过强化学习训练一系列训练任务，并包含一种新颖的正则化，以有效适应每个样本的提示。最后，我们采用最小贝叶斯风险解码来稳定推断。从实证角度看，GPS表现出竞争力：我们在文本简化基线中获得第二佳成绩，摘要排名第三，分类成绩相当，且未训练这些任务，与基线形成对比。对于域内提示，我们在GSM8K上获得了SOTA。我们的研究展示了一种新颖且有效的自动提示范式的潜力：无需大量优化且无需任务特定训练集，即可生成自适应、输入特定提示。我们的代码可在此 https URL 访问。

Goal-Directed Search Outperforms Goal-Agnostic Memory Compression in Long-Context Memory Tasks

目标导向搜索在长上下文记忆任务中优于目标无关的内存压缩

Authors: Yicong Zheng, Kevin L. McKee, Thomas Miconi, Zacharie Bugaud, Mick van Gelderen, Jed McCaleb
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.21726
Pdf link: https://arxiv.org/pdf/2511.21726
Abstract How to enable human-like long-term memory in large language models (LLMs) has been a central question for unlocking more general capabilities such as few-shot generalization. Existing memory frameworks and benchmarks focus on finding the optimal memory compression algorithm for higher performance in tasks that require recollection and sometimes further reasoning. However, such efforts have ended up building more human bias into the compression algorithm, through the search for the best prompts and memory architectures that suit specific benchmarks, rather than finding a general solution that would work on other data distributions. On the other hand, goal-directed search on uncompressed information could potentially exhibit superior performance because compression is lossy, and a predefined compression algorithm will not fit all raw data distributions. Here we present SUMER (Search in Uncompressed Memory via Experience Replay), an end-to-end reinforcement learning agent with verifiable reward (RLVR) that learns to use search tools to gather information and answer a target question. On the LoCoMo dataset for long-context conversation understanding, SUMER with Qwen2.5-7B-Instruct learned to use search tools and outperformed all other biased memory compression approaches and also the full-context baseline, reaching SOTA performance (43% gain over the prior best). We demonstrate that a simple search method applied to raw data outperforms goal-agnostic and biased compression algorithms in current long-context memory tasks, arguing for new paradigms and benchmarks that are more dynamic and autonomously scalable. Code for SUMER and all implemented baselines is publicly available at this https URL.
中文摘要 如何在大型语言模型（LLM）中实现类人类的长期记忆，一直是解锁更通用能力（如少数样本泛化）的核心问题。现有的记忆框架和基准测试侧重于寻找最优的内存压缩算法，以实现需要回忆和有时进一步推理的任务中的更高性能。然而，这些努力最终导致压缩算法中增加了更多人为偏见，通过寻找最适合特定基准测试的最佳提示和内存架构，而非寻找适用于其他数据分布的通用解决方案。另一方面，针对未压缩信息的目标导向搜索可能表现出更优的性能，因为压缩是有损的，预定义的压缩算法无法满足所有原始数据分布。这里我们介绍SUMER（通过经验回放在未压缩记忆中搜索），这是一种端到端的强化学习代理，具有可验证奖励（RLVR），它学习使用搜索工具收集信息并回答目标问题。在LoCoMo数据集中，QWEN2.5-7B-Instruct的SUMER学习使用搜索工具，并优于所有其他偏向记忆压缩方法及全上下文基线，达到SOTA表现（较之前最佳提升43%）。我们证明，将简单的搜索方法应用于原始数据，在当前长上下文记忆任务中优于目标无关和偏见压缩算法，这为更动态且可自主扩展的新范式和基准提出了支持。SUMER及所有实现的基线代码在此HTTPS网址公开。

Factors That Support Grounded Responses in LLM Conversations: A Rapid Review

支持LLM对话中扎实回答的因素：快速回顾

Authors: Gabriele Cesar Iwashima, Claudia Susie Rodrigues, Claudio Dipolitto, Geraldo Xexéo
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.21762
Pdf link: https://arxiv.org/pdf/2511.21762
Abstract Large language models (LLMs) may generate outputs that are misaligned with user intent, lack contextual grounding, or exhibit hallucinations during conversation, which compromises the reliability of LLM-based applications. This review aimed to identify and analyze techniques that align LLM responses with conversational goals, ensure grounding, and reduce hallucination and topic drift. We conducted a Rapid Review guided by the PRISMA framework and the PICO strategy to structure the search, filtering, and selection processes. The alignment strategies identified were categorized according to the LLM lifecycle phase in which they operate: inference-time, post-training, and reinforcement learning-based methods. Among these, inference-time approaches emerged as particularly efficient, aligning outputs without retraining while supporting user intent, contextual grounding, and hallucination mitigation. The reviewed techniques provided structured mechanisms for improving the quality and reliability of LLM responses across key alignment objectives.
中文摘要 大型语言模型（LLM）可能生成与用户意图不匹配、缺乏上下文基础，或在对话中出现幻觉，这会损害基于LLM应用的可靠性。本综述旨在识别和分析使LLM反应与会话目标保持一致、确保接地感、减少幻觉和话题偏移的技巧。我们根据PRISMA框架和PICO策略进行了快速审查，以构建搜索、筛选和选择流程。识别的对齐策略根据其运行的生命周期阶段进行分类：推理时间、训练后和基于强化学习的方法。其中，推理时间方法尤为高效，能够在不重新训练的情况下对齐输出，同时支持用户意图、上下文基础和幻觉缓解。所评述的技术为提升LLM响应在关键对齐目标上的质量和可靠性提供了结构化机制。

Prompted Policy Search: Reinforcement Learning through Linguistic and Numerical Reasoning in LLMs

提示策略搜索：通过语言和数值推理在大型语言模型中的强化学习

Authors: Yifan Zhou, Sachin Grover, Mohamed El Mistiri, Kamalesh Kalirathnam, Pratyush Kerhalkar, Swaroop Mishra, Neelesh Kumar, Sanket Gaurav, Oya Aran, Heni Ben Amor
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.21928
Pdf link: https://arxiv.org/pdf/2511.21928
Abstract Reinforcement Learning (RL) traditionally relies on scalar reward signals, limiting its ability to leverage the rich semantic knowledge often available in real-world tasks. In contrast, humans learn efficiently by combining numerical feedback with language, prior knowledge, and common sense. We introduce Prompted Policy Search (ProPS), a novel RL method that unifies numerical and linguistic reasoning within a single framework. Unlike prior work that augment existing RL components with language, ProPS places a large language model (LLM) at the center of the policy optimization loop-directly proposing policy updates based on both reward feedback and natural language input. We show that LLMs can perform numerical optimization in-context, and that incorporating semantic signals, such as goals, domain knowledge, and strategy hints can lead to more informed exploration and sample-efficient learning. ProPS is evaluated across fifteen Gymnasium tasks, spanning classic control, Atari games, and MuJoCo environments, and compared to seven widely-adopted RL algorithms (e.g., PPO, SAC, TRPO). It outperforms all baselines on eight out of fifteen tasks and demonstrates substantial gains when provided with domain knowledge. These results highlight the potential of unifying semantics and numerics for transparent, generalizable, and human-aligned RL.
中文摘要 强化学习（RL）传统上依赖标量奖励信号，限制了其利用现实任务中常见的丰富语义知识的能力。相比之下，人类通过将数值反馈与语言、先验知识和常识相结合来高效学习。我们介绍了提示策略搜索（ProPS），这是一种新颖的强化学习方法，将数值推理和语言推理统一在一个框架内。与以往通过语言补充现有强化学习组件的工作不同，ProPS将一个大型语言模型（LLM）置于策略优化循环的核心——直接基于奖励反馈和自然语言输入提出策略更新。我们展示了大型语言模型可以在上下文中进行数值优化，并且结合语义信号，如目标、领域知识和策略提示，可以带来更有根据的探索和高效的样本学习。ProPS在15个体育馆任务中进行了评估，涵盖经典控制、Atari游戏和MuJoCo环境，并与七种广泛采用的强化学习算法（如PPO、SAC、TRPO）进行了比较。在15项任务中有8项表现优于所有基线，并在提供领域知识时表现出显著提升。这些结果凸显了统一语义和数值在透明、可推广且符合人类视野的强化学习中的潜力。

Heterogeneous Multi-Agent Reinforcement Learning with Attention for Cooperative and Scalable Feature Transformation

异构多智能体强化学习，关注合作且可扩展的特征转换

Authors: Tao Zhe, Huazhen Fang, Kunpeng Liu, Qian Lou, Tamzidul Hoque, Dongjie Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.21934
Pdf link: https://arxiv.org/pdf/2511.21934
Abstract Feature transformation enhances downstream task performance by generating informative features through mathematical feature crossing. Despite the advancements in deep learning, feature transformation remains essential for structured data, where deep models often struggle to capture complex feature interactions. Prior literature on automated feature transformation has achieved success but often relies on heuristics or exhaustive searches, leading to inefficient and time-consuming processes. Recent works employ reinforcement learning (RL) to enhance traditional approaches through a more effective trial-and-error way. However, two limitations remain: 1) Dynamic feature expansion during the transformation process, which causes instability and increases the learning complexity for RL agents; 2) Insufficient cooperation and communication between agents, which results in suboptimal feature crossing operations and degraded model performance. To address them, we propose a novel heterogeneous multi-agent RL framework to enable cooperative and scalable feature transformation. The framework comprises three heterogeneous agents, grouped into two types, each designed to select essential features and operations for feature crossing. To enhance communication among these agents, we implement a shared critic mechanism that facilitates information exchange during feature transformation. To handle the dynamically expanding feature space, we tailor multi-head attention-based feature agents to select suitable features for feature crossing. Additionally, we introduce a state encoding technique during the optimization process to stabilize and enhance the learning dynamics of the RL agents, resulting in more robust and reliable transformation policies. Finally, we conduct extensive experiments to validate the effectiveness, efficiency, robustness, and interpretability of our model.
中文摘要 特征转换通过数学特征交叉生成信息性特征，提升后续任务的表现。尽管深度学习取得了进步，特征转换对于结构化数据依然至关重要，因为深度模型常常难以捕捉复杂的特征交互。此前关于自动特征转换的文献取得了成功，但通常依赖启发式方法或穷尽搜索，导致流程效率低下且耗时。近期研究利用强化学习（RL）通过更有效的试错方式来增强传统方法。然而，仍有两个局限性：1）转换过程中的动态特征扩展，导致不稳定性并增加强化学习代理的学习复杂度;2）代理间合作和通信不足，导致特征交叉作次优且模型性能下降。为此，我们提出了一种新型异构多智能体强化学习框架，以实现协作且可扩展的特征转换。该框架由三个异构代理组成，分为两类，每种类型旨在选择特征交叉所需的关键特征和作。为了加强这些代理之间的通信，我们实现了共享批评机制，促进特征转换过程中的信息交换。为了应对动态扩展的特征空间，我们定制多头基于注意力的特征代理，以选择适合跨特征的特征。此外，我们在优化过程中引入了状态编码技术，以稳定和增强强化学习代理的学习动态，从而实现更稳健可靠的转换策略。最后，我们进行了大量实验，验证模型的有效性、效率、鲁棒性和可解释性。

Selecting User Histories to Generate LLM Users for Cold-Start Item Recommendation

选择用户历史以生成用于冷启动项目推荐的LLM用户

Authors: Nachiket Subbaraman (1), Jaskinder Sarai (1), Aniruddh Nath (2), Lichan Hong (3), Lukasz Heldt (2), Li Wei (2), Zhe Zhao (1) ((1) UC Davis, (2) Google Inc., (3) Google DeepMind)
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2511.21989
Pdf link: https://arxiv.org/pdf/2511.21989
Abstract Large Language Models (LLMs) have demonstrated remarkable capabilities in reasoning, generalization, and simulating human-like behavior across a wide range of tasks. These strengths present new opportunities to enhance traditional recommendation systems (RS), especially in the cold-start item scenario where newly introduced items lack interactions. Existing works have used LLMs to address cold-start issues in traditional RS through data augmentation, but they have limitations. One recent work directly addresses this issue by prompting LLMs to generate augmented interaction data between randomly sampled users and cold-start items. Then, they train the traditional RS with augmented data, incorporating collaborative signals for cold-start items. Although they use LLMs to provide cold-start items with feedback, they use partial user histories, which does not allow the LLM to fully emulate the user. Furthermore, randomly selecting users is not optimal for augmentation. To address these challenges, we leverage the LLM as a user and develop a reinforcement learning (RL) framework that trains a policy to select users for augmentation, optimizing for cold-start item performance after augmented training. The policy model learns to select users for cold-start item data augmentation based on their behavioral features and histories. To optimize user selection for cold-start item performance, we employ a policy gradient method that updates the policy in the direction of actions that lead to high rewards. Experiments on Amazon Product Review datasets show substantial gains in cold-start item recall, demonstrating the effectiveness of our method as a scalable, serving-efficient augmentation strategy for modern RS.
中文摘要 大型语言模型（LLMs）在推理、泛化和模拟类人类行为方面展现出卓越的能力，涵盖了广泛的任务。这些优势为提升传统推荐系统（RS）提供了新机遇，尤其是在冷启动项目情境下，新引入的项目缺乏交互。现有研究利用LLM通过数据增强解决传统RS中的冷启动问题，但存在局限性。一项近期研究直接解决了这一问题，促使大型语言模型生成随机抽样用户与冷启动项目之间的增强交互数据。然后，他们用增强数据训练传统反应系统，加入冷启动项目的协作信号。虽然它们使用LLM提供冷启动项目并反馈，但使用部分用户历史，无法让LLM完全模拟用户。此外，随机选择用户并非增强的最佳选择。为应对这些挑战，我们利用LLM作为用户，开发了一个强化学习（RL）框架，训练策略以选择用户进行增强，优化增强训练后的冷启动项目性能。策略模型学习根据用户的行为特征和历史选择冷启动项目数据增强。为了优化冷启动项目的性能，我们采用策略梯度方法，将策略更新为能带来高回报的动作。在亚马逊产品评测数据集上的实验显示，冷启动商品召回率显著提升，证明了我们方法作为现代产品回收（RS）可扩展、高效服务的增强策略的有效性。

MedEyes: Learning Dynamic Visual Focus for Medical Progressive Diagnosis

MedEyes：学习动态视觉聚焦以实现医学进步诊断

Authors: Chunzheng Zhu, Yangfang Lin, Shen Chen, Yijun Wang, Jianxin Lin
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.22018
Pdf link: https://arxiv.org/pdf/2511.22018
Abstract Accurate medical diagnosis often involves progressive visual focusing and iterative reasoning, characteristics commonly observed in clinical workflows. While recent vision-language models demonstrate promising chain-of-thought (CoT) reasoning capabilities via reinforcement learning with verifiable rewards (RLVR), their purely on-policy learning paradigm tends to reinforce superficially coherent but clinically inaccurate reasoning paths. We propose MedEyes, a novel reinforcement learning framework that dynamically models clinician-style diagnostic reasoning by progressively attending to and interpreting relevant medical image regions. By incorporating off-policy expert guidance, MedEyes converts expert visual search trajectories into structured external behavioral signals, guiding the model toward clinically aligned visual reasoning. We design the Gaze-guided Reasoning Navigator (GRN) to emulate the diagnostic process through a dual-mode exploration strategy, scanning for systematic abnormality localization and drilling for detailed regional analysis. To balance expert imitation and autonomous discovery, we introduce the Confidence Value Sampler (CVS), which employs nucleus sampling and adaptive termination to create diverse yet credible exploration paths. Finally, the dual-stream GRPO optimization framework decouples on-policy and off-policy learning signals, mitigating reward assimilation and entropy collapse. Experiments demonstrate that MedEyes achieves an average performance improvement of +8.5\% across multiple medical VQA benchmarks, validating MedEyes's potential in building interpretable medical AI systems.
中文摘要 准确的医学诊断通常涉及渐进式视觉聚焦和迭代推理，这些特征在临床工作流程中很常见。尽管近期视觉语言模型通过可验证奖励强化学习（RLVR）展现了有前景的思维链（CoT）推理能力，但其纯粹的政策学习范式往往强化表面连贯但临床上不准确的推理路径。我们提出了MedEyes，一种新型强化学习框架，通过逐步关注和解读相关的医学图像区域，动态建模临床医生式的诊断推理。通过纳入非政策专家指导，MedEyes将专家的视觉搜索轨迹转化为结构化的外部行为信号，引导模型实现临床对齐的视觉推理。我们设计了凝视引导推理导航器（GRN），通过双模式探索策略模拟诊断过程，扫描系统性异常定位并钻探以进行详细区域分析。为了平衡专家模仿与自主发现，我们引入了置信值采样器（CVS），该工具采用核采样和自适应终止技术，创造多样且可信的探索路径。最后，双流GRPO优化框架解耦了策略中和非策略学习信号，减轻了奖励同化和熵崩溃。实验显示，MedEyes在多个医疗VQA基准中平均性能提升达+8.5%，验证了其构建可解读医疗AI系统的潜力。

Hybrid Stackelberg Game and Diffusion-based Auction for Two-tier Agentic AI Task Offloading in Internet of Agents

基于两层智能体人工智能任务卸载的混合Stackelberg博弈与扩散拍卖，用于代理互联网中的任务卸载

Authors: Yue Zhong, Yongju Tong, Jiawen Kang, Minghui Dai, Hong-Ning Dai, Zhou Su, Dusit Niyato
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.22076
Pdf link: https://arxiv.org/pdf/2511.22076
Abstract The Internet of Agents (IoA) is rapidly gaining prominence as a foundational architecture for interconnected intelligent systems, designed to facilitate seamless discovery, communication, and collaborative reasoning among a vast network of Artificial Intelligence (AI) agents. Powered by Large Language and Vision-Language Models, IoA enables the development of interactive, rational agents capable of complex cooperation, moving far beyond traditional isolated models. IoA involves physical entities, i.e., Wireless Agents (WAs) with limited onboard resources, which need to offload their compute-intensive agentic AI services to nearby servers. Such servers can be Mobile Agents (MAs), e.g., vehicle agents, or Fixed Agents (FAs), e.g., end-side units agents. Given their fixed geographical locations and stable connectivity, FAs can serve as reliable communication gateways and task aggregation points. This stability allows them to effectively coordinate with and offload to an Aerial Agent (AA) tier, which has an advantage not affordable for highly mobile MAs with dynamic connectivity limitations. As such, we propose a two-tier optimization approach. The first tier employs a multi-leader multi-follower Stackelberg game. In the game, MAs and FAs act as the leaders who set resource prices. WAs are the followers to determine task offloading ratios. However, when FAs become overloaded, they can further offload tasks to available aerial resources. Therefore, the second tier introduces a Double Dutch Auction model where overloaded FAs act as the buyers to request resources, and AAs serve as the sellers for resource provision. We then develop a diffusion-based Deep Reinforcement Learning algorithm to solve the model. Numerical results demonstrate the superiority of our proposed scheme in facilitating task offloading.
中文摘要 代理互联网（IoA）正迅速成为互联智能系统的基础架构，旨在促进庞大人工智能（AI）智能体网络之间的无缝发现、通信和协作推理。借助大型语言和视觉语言模型，IoA使得开发能够实现复杂协作的交互式、理性代理，远远超越传统的孤立模型。IoA涉及物理实体，即拥有有限机载资源的无线代理（WA），它们需要将其计算密集型代理AI服务卸载到附近的服务器。此类服务器可以是移动代理（MA），例如车辆代理，或固定代理（FA），例如端端单元代理。鉴于其固定地理位置和稳定的连接性，FA可以作为可靠的通信网关和任务聚合点。这种稳定性使它们能够有效协调并卸载给空中代理（AA）层，这对于具有动态连接限制的高移动性MA来说是负担不起的优势。因此，我们提出了一种两层优化方法。第一层采用多领袖多跟随者斯塔克尔伯格游戏。在游戏中，MA和FA作为领导者，负责设定资源价格。WA是确定任务卸载比率的执行者。然而，当FA超负荷时，他们可以进一步将任务分担到可用的空中资源上。因此，第二层引入了双重荷兰拍卖模式，超载的FA作为买家请求资源，AA作为资源供应的卖方。随后，我们开发了基于扩散的深度强化学习算法来求解该模型。数值结果证明了我们提出方案在促进任务卸载方面的优越性。

Adaptive Dueling Double Deep Q-networks in Uniswap V3 Replication and Extension with Mamba

Uniswap V3 复制与扩展中的自适应双深度 Q 网络 Mamba

Authors: Zhaofeng Zhang
Subjects: Subjects: Machine Learning (cs.LG); Trading and Market Microstructure (q-fin.TR)
Arxiv link: https://arxiv.org/abs/2511.22101
Pdf link: https://arxiv.org/pdf/2511.22101
Abstract The report goes through the main steps of replicating and improving the article "Adaptive Liquidity Provision in Uniswap V3 with Deep Reinforcement Learning." The replication part includes how to obtain data from the Uniswap Subgraph, details of the implementation, and comments on the results. After the replication, I propose a new structure based on the original model, which combines Mamba with DDQN and a new reward function. In this new structure, I clean the data again and introduce two new baselines for comparison. As a result, although the model has not yet been applied to all datasets, it shows stronger theoretical support than the original model and performs better in some tests.
中文摘要 报告详细介绍了复制和改进《Uniswap V3 中的自适应流动性提供与深度强化学习》一文的主要步骤。复制部分包括如何从Uniswap子图获取数据、实现细节以及对结果的评论。复制后，我提出了基于原始模型的新结构，结合了Mamba与DDQN以及新的奖励函数。在这个新结构中，我再次清理数据，并引入两个新的基线进行比较。因此，尽管该模型尚未应用于所有数据集，但它在理论支持上比原始模型更强，并在某些测试中表现更好。

Representative Action Selection for Large Action Space: From Bandits to MDPs

大行动空间的代表性行动选择：从强盗到移动行动工具

Authors: Quan Zhou, Shie Mannor
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2511.22104
Pdf link: https://arxiv.org/pdf/2511.22104
Abstract We study the problem of selecting a small, representative action subset from an extremely large action space shared across a family of reinforcement learning (RL) environments -- a fundamental challenge in applications like inventory management and recommendation systems, where direct learning over the entire space is intractable. Our goal is to identify a fixed subset of actions that, for every environment in the family, contains a near-optimal action, thereby enabling efficient learning without exhaustively evaluating all actions. This work extends our prior results for meta-bandits to the more general setting of Markov Decision Processes (MDPs). We prove that our existing algorithm achieves performance comparable to using the full action space. This theoretical guarantee is established under a relaxed, non-centered sub-Gaussian process model, which accommodates greater environmental heterogeneity. Consequently, our approach provides a computationally and sample-efficient solution for large-scale combinatorial decision-making under uncertainty.
中文摘要 我们研究从极大的动作空间中选择一个小型、具有代表性的动作子集的问题，该空间跨越一系列强化学习（RL）环境——这是库存管理和推荐系统等应用中的根本挑战，因为在整个空间中直接学习难以解决。我们的目标是识别一个固定的动作子集，对于家族中的每个环境，都包含一个近似最优的动作，从而实现高效的学习，而无需穷尽评估所有动作。本研究将我们先前关于元强盗的研究扩展到更一般的马尔可夫决策过程（MDP）环境。我们证明现有算法的性能可媲美使用完整动作空间。这一理论保证是在宽松的非中心亚高斯过程模型下建立的，该模型能够容纳更大的环境异质性。因此，我们的方法为在不确定性下的大规模组合决策提供了一种计算效率和样本高效的解决方案。

Energy Efficient Sleep Mode Optimization in 5G mmWave Networks via Multi Agent Deep Reinforcement Learning

通过多智能体深度强化学习实现5G毫米波网络中的节能睡眠模式优化

Authors: Saad Masrur, Ismail Guvenc, David Lopez Perez
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.22105
Pdf link: https://arxiv.org/pdf/2511.22105
Abstract Dynamic sleep mode optimization (SMO) in millimeter-wave (mmWave) networks is essential for maximizing energy efficiency (EE) under stringent quality-of-service (QoS) constraints. However, existing optimization and reinforcement learning (RL) approaches rely on aggregated, static base station (BS) traffic models that fail to capture non-stationary traffic dynamics and suffer from large state-action spaces, limiting real-world deployment. To address these challenges, this paper proposes a multi-agent deep reinforcement learning (MARL) framework using a Double Deep Q-Network (DDQN), referred to as MARL-DDQN, for adaptive SMO in a 3D urban environment with a time-varying and community-based user equipment (UE) mobility model. Unlike conventional single-agent RL, MARL-DDQN enables scalable, distributed decision-making with minimal signaling overhead. A realistic BS power consumption model and beamforming are integrated to accurately quantify EE, while QoS is defined in terms of throughput. The method adapts SMO policies to maximize EE while mitigating inter-cell interference and ensuring throughput fairness. Simulations show that MARL-DDQN outperforms state-of-the-art strategies, including All On, iterative QoS-aware load-based (IT-QoS-LB), MARL-DDPG, and MARL-PPO, achieving up to 0.60 Mbit/Joule EE, 8.5 Mbps 10th-percentile throughput, and meeting QoS constraints 95% of the time under dynamic scenarios.
中文摘要 毫米波（mmWave）网络中的动态睡眠模式优化（SMO）对于在严格的服务质量（QoS）约束下最大化能效（EE）至关重要。然而，现有的优化与强化学习（RL）方法依赖于聚合的静态基站（BS）流量模型，这些模型无法捕捉非平稳的流量动态，且存在较大的状态动作空间，限制了实际部署。为应对这些挑战，本文提出了一个多智能体深度强化学习（MARL）框架，采用双深度Q网络（DDQN），称为MARL-DDQN，用于在三维城市环境中实现自适应SMO，采用时间变化和社区基础用户设备（UE）移动模型。与传统的单代理强化学习不同，MARL-DDQN实现了可扩展的分布式决策，同时最低的信令开销。整合了真实的BS功耗模型和波束成形，以准确量化EE，而QoS则以吞吐量定义。该方法调整SMO策略以最大化EE，同时减少单元间干扰并确保吞吐量公平性。仿真显示，MARL-DDQN优于包括全开、迭代QoS感知负载（IT-QoS-lb）、MARL-DDPG和MARL-PPO在内的先进策略，在动态场景下实现最高0.60 Mbit/焦耳EE、8.5 Mbps的第10百分位吞吐量，并95%的时间满足QoS约束。

An energy-efficient spiking neural network with continuous learning for self-adaptive brain-machine interface

一个节能的尖峰神经网络，具备持续学习能力，实现自适应脑机接口

Authors: Zhou Biyan, Arindam Basu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.22108
Pdf link: https://arxiv.org/pdf/2511.22108
Abstract The number of simultaneously recorded neurons follows an exponentially increasing trend in implantable brain-machine interfaces (iBMIs). Integrating the neural decoder in the implant is an effective data compression method for future wireless iBMIs. However, the non-stationarity of the system makes the performance of the decoder unreliable. To avoid frequent retraining of the decoder and to ensure the safety and comfort of the iBMI user, continuous learning is essential for real-life applications. Since Deep Spiking Neural Networks (DSNNs) are being recognized as a promising approach for developing resource-efficient neural decoder, we propose continuous learning approaches with Reinforcement Learning (RL) algorithms adapted for DSNNs. Banditron and AGREL are chosen as the two candidate RL algorithms since they can be trained with limited computational resources, effectively addressing the non-stationary problem and fitting the energy constraints of implantable devices. To assess the effectiveness of the proposed methods, we conducted both open-loop and closed-loop experiments. The accuracy of open-loop experiments conducted with DSNN Banditron and DSNN AGREL remains stable over extended periods. Meanwhile, the time-to-target in the closed-loop experiment with perturbations, DSNN Banditron performed comparably to that of DSNN AGREL while achieving reductions of 98% in memory access usage and 99% in the requirements for multiply- and-accumulate (MAC) operations during training. Compared to previous continuous learning SNN decoders, DSNN Banditron requires 98% less computes making it a prime candidate for future wireless iBMI systems.
中文摘要 同时记录的神经元数量在植入式脑机接口（iBMIs）中呈指数级增长趋势。将神经解码器集成到植入体中，是未来无线iBMI中有效的数据压缩方法。然而，系统的非平稳性使解码器性能不可靠。为了避免频繁重新训练解码器，并确保iBMI用户的安全和舒适，持续学习对于实际应用至关重要。由于深度尖峰神经网络（DSNNs）被认可为开发资源高效神经解码器的有前景方法，我们提出了采用针对DSNN的强化学习（RL）算法的持续学习方法。Banditron和AGREL被选为两种候选强化学习算法，因为它们可以在有限的计算资源下进行训练，有效解决非固定问题并符合植入设备的能量约束。为评估所提方法的有效性，我们进行了开环和闭环实验。使用DSNN Banditron和DSNN AGREL进行的开环实验，其准确性在较长时间内保持稳定。与此同时，在带微扰的闭环实验中，DSNN Banditron的目标到达时间表现与DSNN AGREL相当，同时在训练期间实现了98%的内存访问使用率和99%的乘法累积（MAC）作要求。与之前的持续学习SNN解码器相比，DSNN Banditron所需的计算量减少了98%，使其成为未来无线iBMI系统的首选。

PROMPTMINER: Black-Box Prompt Stealing against Text-to-Image Generative Models via Reinforcement Learning and Fuzz Optimization

PROMPTMINER：通过强化学习和模糊优化对文本转图像生成模型的黑箱提示窃取

Authors: Mingzhe Li, Renhao Zhang, Zhiyang Wen, Siqi Pan, Bruno Castro da Silva, Juan Zhai, Shiqing Ma
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.22119
Pdf link: https://arxiv.org/pdf/2511.22119
Abstract Text-to-image (T2I) generative models such as Stable Diffusion and FLUX can synthesize realistic, high-quality images directly from textual prompts. The resulting image quality depends critically on well-crafted prompts that specify both subjects and stylistic modifiers, which have become valuable digital assets. However, the rising value and ubiquity of high-quality prompts expose them to security and intellectual-property risks. One key threat is the prompt stealing attack, i.e., the task of recovering the textual prompt that generated a given image. Prompt stealing enables unauthorized extraction and reuse of carefully engineered prompts, yet it can also support beneficial applications such as data attribution, model provenance analysis, and watermarking validation. Existing approaches often assume white-box gradient access, require large-scale labeled datasets for supervised training, or rely solely on captioning without explicit optimization, limiting their practicality and adaptability. To address these challenges, we propose PROMPTMINER, a black-box prompt stealing framework that decouples the task into two phases: (1) a reinforcement learning-based optimization phase to reconstruct the primary subject, and (2) a fuzzing-driven search phase to recover stylistic modifiers. Experiments across multiple datasets and diffusion backbones demonstrate that PROMPTMINER achieves superior results, with CLIP similarity up to 0.958 and textual alignment with SBERT up to 0.751, surpassing all baselines. Even when applied to in-the-wild images with unknown generators, it outperforms the strongest baseline by 7.5 percent in CLIP similarity, demonstrating better generalization. Finally, PROMPTMINER maintains strong performance under defensive perturbations, highlighting remarkable robustness. Code: this https URL
中文摘要 文本到图像（T2I）生成模型，如Stable Diffusion和FLUX，可以直接从文本提示合成出真实且高质量的图像。最终的图像质量关键在于精心设计的提示词，既指定了主题又是风格修饰符，这些都成为了宝贵的数字资产。然而，高质量提示的价值和普及也使其面临安全和知识产权风险。其中一个关键威胁是提示窃取攻击，即恢复生成给定图像的文本提示的任务。提示窃取允许未经授权提取和重用精心设计的提示，同时也能支持数据归属、模型来源分析和水印验证等有益应用。现有方法通常假设白箱梯度访问，需要大规模带标签的数据集进行监督训练，或仅依赖字幕而未进行明确优化，限制了其实用性和适应性。为应对这些挑战，我们提出了PROMPTMINER，一种黑箱提示窃取框架，将任务拆分为两个阶段：（1）基于强化学习的优化阶段以重建主要主体，（2）模糊驱动的搜索阶段以恢复风格修饰符。跨多个数据集和扩散骨架的实验表明，PROMPTMINER 实现了更优的结果，CLIP 相似度高达 0.958，文本与 SBERT 的对齐度高达 0.751，超越所有基线。即使应用于带有未知生成器的野外图像，其CLIP相似度仍比最强基线高出7.5%，展现出更好的泛化性。最后，PROMPTMINER 在防御扰动下保持强劲性能，展现出卓越的稳健性。代码：这个 https URL

TinyLLM: Evaluation and Optimization of Small Language Models for Agentic Tasks on Edge Devices

TinyLLM：边缘设备上代理任务小型语言模型的评估与优化

Authors: Mohd Ariful Haque (1), Fahad Rahman (2), Kishor Datta Gupta (1), Khalil Shujaee (1), Roy George (1) ((1) Clark Atlanta University, (2) United International University)
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.22138
Pdf link: https://arxiv.org/pdf/2511.22138
Abstract This paper investigates the effectiveness of small language models (SLMs) for agentic tasks (function/tool/API calling) with a focus on running agents on edge devices without reliance on cloud infrastructure. We evaluate SLMs using the Berkeley Function Calling Leaderboard (BFCL) framework and describe parameter-driven optimization strategies that include supervised fine-tuning (SFT), parameter-efficient fine-tuning (PEFT), reinforcement learning (RL)-based optimization, preference alignment via Direct Preference Optimization (DPO), and hybrid methods. We report results for models including TinyAgent, TinyLlama, Qwen, and xLAM across BFCL categories (simple, multiple, parallel, parallel-multiple, and relevance detection), both in live and non-live settings, and in multi-turn evaluations. We additionally detail a DPO training pipeline constructed from AgentBank data (e.g., ALFRED), including our conversion of SFT data to chosen-rejected pairs using TinyLlama responses as rejected outputs and manual validation. Our results demonstrate clear accuracy differences across model scales where medium-sized models (1-3B parameters) significantly outperform ultra-compact models (<1B parameters), achieving up to 65.74% overall accuracy, and 55.62% multi-turn accuracy with hybrid optimization. This study highlights the importance of hybrid optimization strategies that enable small language models to deliver accurate, efficient, and stable agentic AI on edge devices, making privacy-preserving, low-latency autonomous agents practical beyond the cloud.
中文摘要 本文探讨了小型语言模型（SLMs）在代理任务（功能/工具/API调用）中的有效性，重点是如何在边缘设备上运行代理，而不依赖云基础设施。我们使用伯克利函数调用排行榜（BFCL）框架评估SLM，并描述了参数驱动优化策略，包括监督微调（SFT）、参数高效微调（PEFT）、基于强化学习（RL）的优化、通过直接偏好优化（DPO）进行偏好对齐以及混合方法。我们报告了包括 TinyAgent、TinyLlama、Qwen 和 xLAM 在 BFCL 类别（简单、多重、并行、并行多重和相关性检测）中的模型结果，涵盖实时和非实时环境以及多回合评估。我们还详细介绍了基于AgentBank数据（如ALFRED）构建的DPO训练流程，包括利用TinyLlama反应作为拒绝输出将SFT数据转换为选择拒绝对，以及手动验证。我们的结果显示，不同模型尺度的准确率差异显著优于超紧凑模型（<1B参数），采用混合优化实现65.74%的整体准确率和55.62%的多回合精度。本研究强调了混合优化策略的重要性，使小型语言模型能够在边缘设备上提供准确、高效且稳定的代理人工智能，使得保护隐私、低延迟的自主智能体在云端之外变得实用。

Guiding the Inner Eye: A Framework for Hierarchical and Flexible Visual Grounded Reasoning

引导内心之眼：层级与灵活视觉基础推理的框架

Authors: Zhaoyang Wei, Wenchao Ding, Yanchao Hao, Xi Chen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.22172
Pdf link: https://arxiv.org/pdf/2511.22172
Abstract Models capable of "thinking with images" by dynamically grounding their reasoning in visual evidence represent a major leap in multimodal AI. However, replicating and advancing this ability is non-trivial, with current methods often trapped between the instability of end-to-end reinforcement learning (RL) and the rigidity of supervised fine-tuning (SFT). This leads to models that either struggle to learn or lack the cognitive flexibility required for complex, real-world scenes. To navigate this dilemma, we introduce GRiP (Guided Reasoning and Perception), a novel two-stage training framework that cultivates robust and flexible visual grounded reasoning by explicitly guiding the model's perceptual focus and logical pathways. GRiP's core lies in its cognitive-enhanced RL stage, which features two key innovations: (1) a Salience-Weighted IoU Reward that incentivizes the model to prioritize the localization of mission-critical objects over trivial distractors, and (2) a Multi-Heuristic Reward that encourages cognitive flexibility by rewarding diverse yet logically valid reasoning pathways. Initialized from the Qwen2.5-VL-7B model, GRiP demonstrates significant performance gains across multiple challenging benchmarks. It achieves state-of-the-art results among open-source models on the highly challenging TreeBench and V* Bench, proving its effectiveness in complex visual reasoning. Our work demonstrates that moving beyond simplistic rewards and instead guiding models with cognitively-inspired signals for what to see and how to think is crucial for unlocking the next level of multimodal intelligence. The code will be made publicly available.
中文摘要 能够通过动态以视觉证据为基础“用图像思考”的模型，代表了多模态人工智能的一大飞跃。然而，复制和提升这一能力并非易事，当前方法常常夹在端到端强化学习（RL）的不稳定性和监督微调（SFT）的僵化之间。这导致模型要么难以学习，要么缺乏应对复杂现实场景所需的认知灵活性。为解决这一困境，我们引入了GRiP（引导推理与感知），这是一种新的两阶段训练框架，通过明确引导模型的感知焦点和逻辑路径，培养稳健灵活的视觉基础推理能力。GRiP的核心在于其认知增强强化学习阶段，该阶段具有两项关键创新：（1）显著性加权的IoU奖励，激励模型优先定位关键任务对象而非琐碎干扰;（2）多启发式奖励，通过奖励多样但逻辑有效的推理路径，鼓励认知灵活性。GRiP基于Qwen2.5-VL-7B模型初始化，在多个具有挑战性的基准测试中展现出显著的性能提升。它在极具挑战性的 TreeBench 和 V* Bench 上，在开源模型中取得了最先进的成绩，证明了其在复杂视觉推理中的有效性。我们的研究表明，超越简单的奖励，转而用认知启发的信号引导模型，指引该看什么、如何思考，是解锁多模态智能下一层次的关键。代码将公开。

Focused Chain-of-Thought: Efficient LLM Reasoning via Structured Input Information

聚焦思维链：通过结构化输入信息实现高效的大型语言模型推理

Authors: Lukas Struppek, Dominik Hintersdorf, Hannah Struppek, Daniel Neider, Kristian Kersting
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.22176
Pdf link: https://arxiv.org/pdf/2511.22176
Abstract Recent large language models achieve strong reasoning performance by generating detailed chain-of-thought traces, but this often leads to excessive token use and high inference latency. Existing efficiency approaches typically focus on model-centric interventions, such as reinforcement learning or supervised fine-tuning, to reduce verbosity. In contrast, we propose a training-free, input-centric approach. Inspired by cognitive psychology, we introduce Focused Chain-of-Thought (F-CoT), which separates information extraction from the reasoning process. F-CoT first organizes the essential information from a query into a concise, structured context and then guides the model to reason exclusively over this context. By preventing attention to irrelevant details, F-CoT naturally produces shorter reasoning paths. On arithmetic word problems, F-CoT reduces generated tokens by 2-3x while maintaining accuracy comparable to standard zero-shot CoT. These results highlight structured input as a simple yet effective lever for more efficient LLM reasoning.
中文摘要 近年来的大型语言模型通过生成详细的思维链追踪实现了强大的推理性能，但这常常导致过度的令牌使用和较高的推理延迟。现有的效率方法通常侧重于以模型为中心的干预，如强化学习或监督式微调，以减少冗长。相比之下，我们提出一种无培训、以输入为中心的方法。受认知心理学启发，我们引入了聚焦思维链（F-CoT），将信息提取与推理过程分离。F-CoT 首先将查询中的核心信息组织成简明、结构化的上下文，然后引导模型仅基于该上下文进行推理。通过避免对无关细节的关注，F-CoT 自然产生了更短的推理路径。在算术应用题中，F-CoT 在保持与标准零点 CoT 相当的准确性的情况下，将生成的符号减少 2-3 倍。这些结果强调了结构化输入作为一种简单但有效的杠杆，能够实现更高效的大型语言模型推理。

BiCQL-ML: A Bi-Level Conservative Q-Learning Framework for Maximum Likelihood Inverse Reinforcement Learning

BiCQL-ML：一种用于最大似然逆强化学习的双级保守Q-学习框架

Authors: Junsung Park
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.22210
Pdf link: https://arxiv.org/pdf/2511.22210
Abstract Offline inverse reinforcement learning (IRL) aims to recover a reward function that explains expert behavior using only fixed demonstration data, without any additional online interaction. We propose BiCQL-ML, a policy-free offline IRL algorithm that jointly optimizes a reward function and a conservative Q-function in a bi-level framework, thereby avoiding explicit policy learning. The method alternates between (i) learning a conservative Q-function via Conservative Q-Learning (CQL) under the current reward, and (ii) updating the reward parameters to maximize the expected Q-values of expert actions while suppressing over-generalization to out-of-distribution actions. This procedure can be viewed as maximum likelihood estimation under a soft value matching principle. We provide theoretical guarantees that BiCQL-ML converges to a reward function under which the expert policy is soft-optimal. Empirically, we show on standard offline RL benchmarks that BiCQL-ML improves both reward recovery and downstream policy performance compared to existing offline IRL baselines.
中文摘要 离线逆强化学习（IRL）旨在仅用固定的演示数据恢复一个奖励函数，解释专家行为，而无需额外的在线交互。我们提出了BiCQL-ML，这是一种无策略的离线IRL算法，在双层框架下共同优化奖励函数和保守Q函数，从而避免显式策略学习。该方法交替进行：（i）在当前奖励下通过保守Q-学习（CQL）学习保守Q函数，以及（ii）更新奖励参数以最大化专家行为的预期Q值，同时抑制对分布外动作的过度泛化。该过程可视为软值匹配原则下的最大似然估计。我们提供理论保证，证明BiCQL-ML收敛到一个奖励函数，使专家策略在该函数下软最优。通过实证，我们在标准离线强化学习基准中显示，BiCQL-ML相比现有的离线现实基准，在奖励回收和下游策略表现上均有提升。

Optimizing NetGPT via Routing-Based Synergy and Reinforcement Learning

通过基于路由的协同与强化学习优化NetGPT

Authors: Yuxuan Chen, Rongpeng Li, Xianfu Chen, Celimuge Wu, Chenghui Peng, Zhifeng Zhao, Honggang Zhang
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2511.22217
Pdf link: https://arxiv.org/pdf/2511.22217
Abstract Large language model (LLM) agents at the network edge offer low-latency execution for routine queries. In contrast, complex requests often require the superior capability of cloud models, incurring higher latency and cost. To navigate this quality-cost trade-off under dynamic network conditions, we propose a cloud-edge synergy for NetGPT that integrates network-aware routing with on-edge self-improvement. Specifically, our framework routes structured tool-calling requests to cloud or edge agents via a novel scoring policy. We prove that, under mild regularity assumptions, the optimal routing rule admits a unique fallback threshold with monotone dependence on bandwidth and round-trip time (RTT). Concurrently, based on the dataset collected from requests routed to the cloud and corresponding responses, we instantiate a schema-preserving reinforcement learning (RL) to improve the capability of the edge agent. We analyze a supervised finetuning (SFT)-anchored composite objective that combines a reverse-KL trust-region step with a forward-KL realignment toward the SFT prior, explaining stability and constraining policy drift. Both the network-aware routing policy and the edge agent are updated coherently. Experiments across controlled network states and pricing schedules demonstrate smooth quality-cost frontiers, consistent gains of dynamic fallback thresholds over fixed policies, and sustained reductions in offloading while maintaining task success and schema-correct outputs.
中文摘要 网络边缘的大型语言模型（LLM）代理为常规查询提供低延迟执行。相比之下，复杂的请求通常需要云模型的更强能力，导致更高的延迟和成本。为了在动态网络条件下应对这种质量与成本的权衡，我们提出了一种将网络感知路由与边缘自我改进相结合的云端协同方案。具体来说，我们的框架通过一种新颖的评分策略，将结构化工具调用请求路由到云端或边缘代理。我们证明，在轻度正则性假设下，最优路由规则允许唯一的退回阈值，且对带宽和往返时间（RTT）单调依赖。同时，基于从云端请求收集的数据集及相应响应，我们实例化了保持模式的强化学习（RL），以提升边缘代理的能力。我们分析了一个监督微调（SFT）锚定复合目标，结合了逆KL信任区域步进与向SFT前向KL重对齐，解释了稳定性并限制策略漂移。网络感知的路由策略和边缘代理都保持一致更新。跨受控网络状态和定价计划的实验显示，质量与成本边界平滑，动态后备阈值相较于固定策略持续提升，同时持续减少卸载，同时保持任务成功和模式正确输出。

Embedded Universal Predictive Intelligence: a coherent framework for multi-agent learning

嵌入式通用预测智能：一个多智能体学习的连贯框架

Authors: Alexander Meulemans, Rajai Nasser, Maciej Wołczyk, Marissa A. Weis, Seijin Kobayashi, Blake Richards, Guillaume Lajoie, Angelika Steger, Marcus Hutter, James Manyika, Rif A. Saurous, João Sacramento, Blaise Agüera y Arcas
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.22226
Pdf link: https://arxiv.org/pdf/2511.22226
Abstract The standard theory of model-free reinforcement learning assumes that the environment dynamics are stationary and that agents are decoupled from their environment, such that policies are treated as being separate from the world they inhabit. This leads to theoretical challenges in the multi-agent setting where the non-stationarity induced by the learning of other agents demands prospective learning based on prediction models. To accurately model other agents, an agent must account for the fact that those other agents are, in turn, forming beliefs about it to predict its future behavior, motivating agents to model themselves as part of the environment. Here, building upon foundational work on universal artificial intelligence (AIXI), we introduce a mathematical framework for prospective learning and embedded agency centered on self-prediction, where Bayesian RL agents predict both future perceptual inputs and their own actions, and must therefore resolve epistemic uncertainty about themselves as part of the universe they inhabit. We show that in multi-agent settings, self-prediction enables agents to reason about others running similar algorithms, leading to new game-theoretic solution concepts and novel forms of cooperation unattainable by classical decoupled agents. Moreover, we extend the theory of AIXI, and study universally intelligent embedded agents which start from a Solomonoff prior. We show that these idealized agents can form consistent mutual predictions and achieve infinite-order theory of mind, potentially setting a gold standard for embedded multi-agent learning.
中文摘要 标准的无模型强化学习理论假设环境动态是平稳的，代理与其环境解耦，因此策略被视为与其所处世界分离。这在多智能体环境中带来了理论挑战，因为学习其他智能体所引起的非平稳性要求基于预测模型进行前瞻性学习。为了准确建模其他代理，代理必须考虑这些代理反过来形成关于它的信念以预测其未来行为，从而激励代理将自己建模为环境的一部分。在此基础上，基于通用人工智能（AIXI），我们引入了一个以自我预测为核心的前瞻性学习和嵌入式能动性的数学框架，贝叶斯强化学习代理既预测未来的感知输入，也能预测自身行为，因此必须解决关于自身作为宇宙一部分的认知不确定性。我们表明，在多智能体环境中，自我预测使智能体能够推理他人运行类似算法，从而产生新的博弈论解概念和经典解耦智能体无法实现的新型合作形式。此外，我们扩展了AIXI理论，研究了从所罗门诺夫先验出发的普遍智能嵌入代理。我们证明这些理想化的智能体能够形成一致的相互预测，实现无限次心智理论，有望为嵌入式多智能体学习树立金标准。

Training High-Level Schedulers with Execution-Feedback Reinforcement Learning for Long-Horizon GUI Automation

通过执行反馈强化学习培训高级调度员，实现长期图形界面自动化

Authors: Zehao Deng, Tianjie Ju, Zheng Wu, Zhuosheng Zhang, Gongshen Liu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.22235
Pdf link: https://arxiv.org/pdf/2511.22235
Abstract The rapid development of large vision-language model (VLM) has greatly promoted the research of GUI agent. However, GUI agents still face significant challenges in handling long-horizon tasks. First, single-agent models struggle to balance high-level capabilities and low-level execution capability, facing prevalent issues of responsibility coupling and capability conflicts. Second, agents lack awareness of the task state, leading to progress loss in long-horizon tasks. To address these challenges, we propose a staged execution-feedback reinforcement learning algorithm. Unlike training a unified policy model, we focus on training high-level scheduling models. Specifically, we propose and train two agents: a Coordinator, responsible for the strategic planning and task decomposition; and a State Tracker, responsible for context compression and information management to maintain the task's state and coherence. Based on this, we built the Coordinator-Executor-State Tracker (CES) multi-agent framework, which can be integrated with any low-level Executor model, assisting the Executor in solving long-horizon tasks through task scheduling and state management. Experiments on long-horizon task benchmarks demonstrate that CES significantly enhances the system's planning and state management capabilities. Furthermore, analysis confirms that our trained high-level scheduling module is a generalizable, plug-and-play module that significantly enhances the long-horizon capabilities of various Executors. Code can be available at this https URL.
中文摘要 大型视觉语言模型（VLM）的快速发展极大推动了GUI代理的研究。然而，图形界面代理在处理长期任务时仍面临重大挑战。首先，单智能体模型难以平衡高级能力与低级执行能力，面临责任耦合和能力冲突等普遍问题。其次，智能体缺乏对任务状态的意识，导致长期任务进度丢失。为应对这些挑战，我们提出了一种分阶段执行反馈强化学习算法。与训练统一政策模型不同，我们专注于训练高层调度模型。具体来说，我们提出并培训两名代理：协调员，负责战略规划和任务分解;以及一个状态跟踪器，负责上下文压缩和信息管理，以维护任务的状态和一致性。基于此，我们构建了协调者-执行者-状态追踪器（CES）多代理框架，可与任何低级执行者模型集成，协助执行者通过任务调度和状态管理解决长期任务。长期任务基准测试的实验表明，CES显著提升了系统的规划和状态管理能力。此外，分析确认我们训练好的高级调度模块是一个通用的即插即用模块，显著增强了各执行者的长期能力。代码可以通过这个 https URL 获取。

Beyond Query-Level Comparison: Fine-Grained Reinforcement Learning for Text-to-SQL with Automated Interpretable Critiques

超越查询级比较：文本转SQL的细粒度强化学习，并具备自动化可解释性批评

Authors: Guifeng Wang, Yuanfeng Song, Meng Yang, Tao Zhu, Xiaoming Yin, Xing Chen
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.22258
Pdf link: https://arxiv.org/pdf/2511.22258
Abstract Text-to-SQL, a pivotal natural language processing (NLP) task that converts textual queries into executable SQL, has seen substantial progress in recent years. However, existing evaluation and reward mechanisms used to train and assess the text-to-SQL models remain a critical bottleneck. Current approaches heavily rely on manually annotated gold SQL queries, which are costly to produce and impractical for large-scale evaluation. More importantly, most reinforcement learning (RL) methods in text-to-SQL leverage only the final binary execution outcome as the reward signal, a coarse-grained supervision that overlooks detailed structural and semantic errors from the perspective of rubrics. To address these challenges, we propose RuCo-C, a novel generative judge model for fine-grained, query-specific automatic evaluation using interpretable critiques without human intervention. Our framework first automatically generates query-specific evaluation rubrics for human-free annotation, linking them to interpretable critiques. Subsequently, it integrates densified reward feedback through a "progressive exploration" strategy during the RL training process, which dynamically adjusts the rewards to enhance the model's performance. Comprehensive experiments demonstrate that RuCo-C outperforms existing methods in text-to-SQL evaluation, yielding significant performance gains.
中文摘要 文本转SQL是一项关键的自然语言处理（NLP）任务，将文本查询转换为可执行SQL的，近年来取得了显著进展。然而，现有用于训练和评估文本转SQL模型的评估和奖励机制仍是一个关键瓶颈。当前方法高度依赖手动注释的金色SQL查询，这种查询成本高昂且不适合大规模评估。更重要的是，文本转SQL中的大多数强化学习（RL）方法仅利用最终的二进制执行结果作为奖励信号，这是一种粗粒度的监督，从评分标准的角度忽略了详细的结构性和语义错误。为应对这些挑战，我们提出了RuCo-C，一种新型生成式裁判模型，用于利用可解释的批评实现细粒度、查询特定自动评估，无需人工干预。我们的框架首先自动生成针对查询的评估评分标准，用于无人注释，并将其与可解释的批评链接起来。随后，它通过“渐进探索”策略在强化学习训练过程中整合密集的奖励反馈，动态调整奖励以提升模型表现。综合实验表明，RuCo-C在文本转SQL评估中优于现有方法，带来了显著的性能提升。

Improving Stochastic Action-Constrained Reinforcement Learning via Truncated Distributions

通过截断分布改进随机动作约束强化学习

Authors: Roland Stolz, Michael Eichelbeck, Matthias Althoff
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2511.22406
Pdf link: https://arxiv.org/pdf/2511.22406
Abstract In reinforcement learning (RL), it is often advantageous to consider additional constraints on the action space to ensure safety or action relevance. Existing work on such action-constrained RL faces challenges regarding effective policy updates, computational efficiency, and predictable runtime. Recent work proposes to use truncated normal distributions for stochastic policy gradient methods. However, the computation of key characteristics, such as the entropy, log-probability, and their gradients, becomes intractable under complex constraints. Hence, prior work approximates these using the non-truncated distributions, which severely degrades performance. We argue that accurate estimation of these characteristics is crucial in the action-constrained RL setting, and propose efficient numerical approximations for them. We also provide an efficient sampling strategy for truncated policy distributions and validate our approach on three benchmark environments, which demonstrate significant performance improvements when using accurate estimations.
中文摘要 在强化学习（RL）中，通常考虑对动作空间施加额外约束以确保安全性或动作相关性是有利的。现有关于此类受动作限制强化学习的研究在有效策略更新、计算效率和可预测运行时间方面面临挑战。近期研究提出在随机策略梯度方法中使用截断正态分布。然而，计算关键特征，如熵、对数概率及其梯度，在复杂约束下变得难以处理。因此，先前的工作是用非截断分布来近似这些数据，这严重降低了性能。我们认为，在动作约束强化学习（RL）环境中，准确估计这些特征至关重要，并提出了对它们的高效数值近似。我们还提供了截断策略分布的高效抽样策略，并在三个基准环境中验证了我们的方法，这些环境在使用准确估计时展示了显著的性能提升。

Exposing Vulnerabilities in RL: A Novel Stealthy Backdoor Attack through Reward Poisoning

揭示强化学习中的漏洞：一种通过奖励毒药进行的新型隐秘后门攻击

Authors: Bokang Zhang, Chaojun Lu, Jianhui Li, Junfeng Wu
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2511.22415
Pdf link: https://arxiv.org/pdf/2511.22415
Abstract Reinforcement learning (RL) has achieved remarkable success across diverse domains, enabling autonomous systems to learn and adapt to dynamic environments by optimizing a reward function. However, this reliance on reward signals creates a significant security vulnerability. In this paper, we study a stealthy backdoor attack that manipulates an agent's policy by poisoning its reward signals. The effectiveness of this attack highlights a critical threat to the integrity of deployed RL systems and calls for urgent defenses against training-time manipulation. We evaluate the attack across classic control and MuJoCo environments. The backdoored agent remains highly stealthy in Hopper and Walker2D, with minimal performance drops of only 2.18 % and 4.59 % under non-triggered scenarios, while achieving strong attack efficacy with up to 82.31% and 71.27% declines under trigger conditions.
中文摘要 强化学习（RL）在多个领域取得了显著成功，使自主系统能够通过优化奖励函数来学习和适应动态环境。然而，这种对奖励信号的依赖带来了重大的安全漏洞。本文研究了一种隐秘后门攻击，通过毒害代理的奖励信号来控其策略。此次攻击的有效性凸显了对已部署强化学习系统完整性的严重威胁，呼吁紧急防御训练时间控。我们评估了该攻击在经典控制和MuJoCo环境中的应用。后门代理在Hopper和Walker2D中保持高度隐匿，在非触发场景下性能下降仅为2.18%和4.59%，而在触发条件下攻击效能下降最高达82.31%和71.27%。

DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning

DeepSeekMath-V2：迈向自我验证的数学推理

Authors: Zhihong Shao, Yuxiang Luo, Chengda Lu, Z.Z. Ren, Jiewen Hu, Tian Ye, Zhibin Gou, Shirong Ma, Xiaokang Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.22570
Pdf link: https://arxiv.org/pdf/2511.22570
Abstract Large language models have made significant progress in mathematical reasoning, which serves as an important testbed for AI and could impact scientific research if further advanced. By scaling reasoning with reinforcement learning that rewards correct final answers, LLMs have improved from poor performance to saturating quantitative reasoning competitions like AIME and HMMT in one year. However, this approach faces fundamental limitations. Pursuing higher final answer accuracy doesn't address a key issue: correct answers don't guarantee correct reasoning. Moreover, many mathematical tasks like theorem proving require rigorous step-by-step derivation rather than numerical answers, making final answer rewards inapplicable. To push the limits of deep reasoning, we believe it is necessary to verify the comprehensiveness and rigor of mathematical reasoning. Self-verification is particularly important for scaling test-time compute, especially for open problems without known solutions. Towards self-verifiable mathematical reasoning, we investigate how to train an accurate and faithful LLM-based verifier for theorem proving. We then train a proof generator using the verifier as the reward model, and incentivize the generator to identify and resolve as many issues as possible in their own proofs before finalizing them. To maintain the generation-verification gap as the generator becomes stronger, we propose to scale verification compute to automatically label new hard-to-verify proofs, creating training data to further improve the verifier. Our resulting model, DeepSeekMath-V2, demonstrates strong theorem-proving capabilities, achieving gold-level scores on IMO 2025 and CMO 2024 and a near-perfect 118/120 on Putnam 2024 with scaled test-time compute.
中文摘要 大型语言模型在数学推理方面取得了显著进展，数学推理成为人工智能的重要试验场，如果技术进一步发展，可能影响科学研究。通过通过奖励正确最终答案的强化学习来扩展推理，LLM在一年内从表现不佳提升到饱和的AIME和HMMT等定量推理竞赛。然而，这种方法面临根本性的局限性。追求更高的最终答案准确性并不能解决一个关键问题：正确答案并不保证推理正确。此外，许多数学任务如定理证明需要严格的逐步推导，而非数值答案，因此最终答案奖励不适用。为了突破深度推理的极限，我们认为有必要验证数学推理的全面性和严谨性。自我验证对于测试时间计算的扩展尤为重要，尤其是对于没有已知解的未解决问题。为了实现自我验证的数学推理，我们研究如何训练一个准确且忠实的基于LLM的定理验证器。然后我们用验证器作为奖励模型训练证明生成器，激励生成器在最终确定前尽可能多地识别和解决自己的证明问题。为了随着生成器增强，保持世代验证差距，我们提议扩大验证计算，自动标记新的难以验证证明，创建训练数据以进一步改进验证器。我们最终生成的模型DeepSeekMath-V2展现了强大的定理证明能力，在IMO 2025和CMO 2024中获得金级分数，在Putnam 2024通过缩放测试时间计算中几乎完美的118/120。

GeoZero: Incentivizing Reasoning from Scratch on Geospatial Scenes

GeoZero：从零开始激励地理空间场景推理

Authors: Di Wang, Shunyu Liu, Wentao Jiang, Fengxiang Wang, Yi Liu, Xiaolei Qin, Zhiming Luo, Chaoyang Zhou, Haonan Guo, Jing Zhang, Bo Du, Dacheng Tao, Liangpei Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.22645
Pdf link: https://arxiv.org/pdf/2511.22645
Abstract Multimodal large language models (MLLMs) have undergone rapid development in advancing geospatial scene understanding. Recent studies have sought to enhance the reasoning capabilities of remote sensing MLLMs, typically through cold-start training with elaborately curated chain-of-thought (CoT) data. However, this approach not only incurs substantial annotation costs but also introduces human biases that may limit the diversity of model reasoning. To address these challenges, we propose GeoZero, a framework that enables MLLMs to perform geospatial reasoning without any predefined CoT supervision. Specifically, we construct two datasets, GeoZero-Instruct and GeoZero-Hard. GeoZero-Instruct allows the model to acquire preliminary geospatial knowledge through supervised fine-tuning, while GeoZero-Hard stimulates deep reasoning during the subsequent reinforcement learning stage. Furthermore, we introduce Answer-Anchored Group Relative Policy Optimization (A$^2$GRPO), where the reasoning process is regularized by the model's own answers, encouraging diverse yet accurate thinking. Extensive experiments on multiple remote sensing vision-language benchmarks demonstrate that GeoZero not only surpasses existing state-of-the-art methods but also fosters universal emergent reasoning capabilities across diverse geospatial tasks. Code,data,and models will be publicly available at this https URL.
中文摘要 多模态大型语言模型（MLLM）在推进地理空间场景理解方面经历了快速发展。近期研究旨在提升遥感MLLM的推理能力，通常通过冷启动训练，配合精心策划的思维链（CoT）数据。然而，这种方法不仅带来了大量的注释成本，还引入了人为偏见，可能限制模型推理的多样性。为应对这些挑战，我们提出了GeoZero框架，使多层次语言模型能够在没有预设CoT监督的情况下进行地理空间推理。具体来说，我们构建了两个数据集，分别是GeoZero-Instruct和GeoZero-Hard。GeoZero-Instruct 允许模型通过监督微调获得初步地理空间知识，而 GeoZero-Hard 则在后续强化学习阶段激发深度推理。此外，我们引入了答案锚定群体相对策略优化（A$^2$GRPO），其中推理过程由模型自身的答案规范化，鼓励多样化但准确的思考。对多个遥感视觉语言基准测试的广泛实验表明，GeoZero不仅超越了现有的最先进方法，还促进了跨越多种地理空间任务的通用涌现推理能力。代码、数据和模型将在此 https URL 公开。

Deadlock-Free Hybrid RL-MAPF Framework for Zero-Shot Multi-Robot Navigation

零点多机器人导航的无死锁混合RL-MAPF框架

Authors: Haoyi Wang (2), Licheng Luo (1), Yiannis Kantaros (2), Bruno Sinopoli (2), Mingyu Cai (1) ((1) Department of Mechanical Engineering, University of California Riverside, CA, USA, (2) Department of Electrical and Systems Engineering, Washington University in St. Louis, MO, USA)
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.22685
Pdf link: https://arxiv.org/pdf/2511.22685
Abstract Multi-robot navigation in cluttered environments presents fundamental challenges in balancing reactive collision avoidance with long-range goal achievement. When navigating through narrow passages or confined spaces, deadlocks frequently emerge that prevent agents from reaching their destinations, particularly when Reinforcement Learning (RL) control policies encounter novel configurations out of learning distribution. Existing RL-based approaches suffer from limited generalization capability in unseen environments. We propose a hybrid framework that seamlessly integrates RL-based reactive navigation with on-demand Multi-Agent Path Finding (MAPF) to explicitly resolve topological deadlocks. Our approach integrates a safety layer that monitors agent progress to detect deadlocks and, when detected, triggers a coordination controller for affected agents. The framework constructs globally feasible trajectories via MAPF and regulates waypoint progression to reduce inter-agent conflicts during navigation. Extensive evaluation on dense multi-agent benchmarks shows that our method boosts task completion from marginal to near-universal success, markedly reducing deadlocks and collisions. When integrated with hierarchical task planning, it enables coordinated navigation for heterogeneous robots, demonstrating that coupling reactive RL navigation with selective MAPF intervention yields a robust, zero-shot performance.
中文摘要 在杂乱环境中的多机器人导航在平衡反应性碰撞避免与实现远程目标之间带来了根本性的挑战。在穿越狭窄通道或受限空间时，常常会出现僵局，阻碍智能体到达目的地，尤其是在强化学习（RL）控制策略遇到学习分布中新颖配置时。现有基于强化学习的方法在未见环境中泛化能力有限。我们提出了一个混合框架，能够无缝整合基于强化学习的反应式导航与按需多代理路径寻觅（MAPF），以显式解决拓扑死锁。我们的方法集成了一个安全层，监控代理进度以检测死锁，并在检测到时触发受影响代理的协调控制器。该框架通过MAPF构建全球可行的轨迹，并调节航点的推进，以减少导航过程中的代理间冲突。对密集多智能体基准的广泛评估表明，我们的方法将任务完成率从边缘成功提升到几乎普遍成功，显著减少了僵局和碰撞。当与分层任务规划结合时，它能够为异构机器人实现协调导航，证明将反应式强化学习导航与选择性MAPF干预相结合，能够实现稳健的零发射性能。

ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering

ReAG：基于知识的视觉问答推理增强生成

Authors: Alberto Compagnoni, Marco Morini, Sara Sarto, Federico Cocchi, Davide Caffagni, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2511.22715
Pdf link: https://arxiv.org/pdf/2511.22715
Abstract Multimodal Large Language Models (MLLMs) have shown impressive capabilities in jointly understanding text, images, and videos, often evaluated via Visual Question Answering (VQA). However, even state-of-the-art MLLMs struggle with domain-specific or knowledge-intensive queries, where relevant information is underrepresented in pre-training data. Knowledge-based VQA (KB-VQA) addresses this by retrieving external documents to condition answer generation, but current retrieval-augmented approaches suffer from low precision, noisy passages, and limited reasoning. To address this, we propose ReAG, a novel Reasoning-Augmented Multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model that filters irrelevant passages, ensuring high-quality additional context. The model follows a multi-stage training strategy leveraging reinforcement learning to enhance reasoning over retrieved content, while supervised fine-tuning serves only as a cold start. Extensive experiments on Encyclopedic-VQA and InfoSeek demonstrate that ReAG significantly outperforms prior methods, improving answer accuracy and providing interpretable reasoning grounded in retrieved evidence. Our source code is publicly available at: this https URL.
中文摘要 多模态大型语言模型（MLLM）在共同理解文本、图像和视频方面展现了令人印象深刻的能力，通常通过视觉问答（VQA）进行评估。然而，即使是最先进的MLLM也难以应对领域特定或知识密集型查询，因为相关信息在预训练数据中代表性不足。基于知识的 VQA（KB-VQA）通过检索外部文档来满足答案生成条件，但当前检索增强方法存在低精度、噪声大且推理有限的问题。为此，我们提出了ReAG，这是一种新颖的推理增强多模态RAG方法，结合了粗粒度和细粒度检索，以及过滤无关段落的批评模型，确保高质量的额外上下文。该模型采用多阶段训练策略，利用强化学习提升推理能力，而监督微调仅作为冷启动。在Encyclopedic-VQA和InfoSeek上的大量实验表明，ReAG显著优于以往方法，提高了答案准确性，并提供了基于已检索证据的可解释推理。我们的源代码公开于：https URL。

ORION: Teaching Language Models to Reason Efficiently in the Language of Thought

ORION：教授语言模型高效推理思维语言

Authors: Kumar Tanmay, Kriti Aggarwal, Paul Pu Liang, Subhabrata Mukherjee
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.22891
Pdf link: https://arxiv.org/pdf/2511.22891
Abstract Large Reasoning Models (LRMs) achieve strong performance in mathematics, code generation, and task planning, but their reliance on long chains of verbose "thinking" tokens leads to high latency, redundancy, and incoherent reasoning paths. Inspired by the Language of Thought Hypothesis, which posits that human reasoning operates over a symbolic, compositional mental language called Mentalese, we introduce a framework that trains models to reason in a similarly compact style. Mentalese encodes abstract reasoning as ultra-compressed, structured tokens, enabling models to solve complex problems with far fewer steps. To improve both efficiency and accuracy, we propose SHORTER LENGTH PREFERENCE OPTIMIZATION (SLPO), a reinforcement learning method that rewards concise solutions that stay correct, while still allowing longer reasoning when needed. Applied to Mentalese-aligned models, SLPO yields significantly higher compression rates by enabling concise reasoning that preserves the benefits of detailed thinking without the computational overhead. Across benchmarks including AIME 2024 and 2025, MinervaMath, OlympiadBench, Math500, and AMC, our ORION models produce reasoning traces with 4-16x fewer tokens, achieve up to 5x lower inference latency, and reduce training costs by 7-9x relative to the DeepSeek R1 Distilled model, while maintaining 90-98% of its accuracy. ORION also surpasses Claude and ChatGPT-4o by up to 5% in accuracy while maintaining 2x compression. These results show that Mentalese-style compressed reasoning offers a step toward human-like cognitive efficiency, enabling real-time, cost-effective reasoning without sacrificing accuracy.
中文摘要 大型推理模型（LRM）在数学、代码生成和任务规划方面表现出色，但它们依赖冗长的“思考”代币链，导致高延迟、冗余和推理路径不连贯。受“思维语言假说”启发，该假说认为人类推理作用于一种象征性的、组合性的心理语言——Mentalese，我们提出了一个以类似紧凑风格训练模型为理性的框架。Mentalese 将抽象推理编码为超压缩的结构化代币，使模型能够以更少的步骤解决复杂问题。为了提高效率和准确性，我们提出了短长度偏好优化（SLPO）的强化学习方法，奖励保持准确的简明解法，同时在需要时允许更长时间的推理。应用于Mentalese对齐模型时，SLPO通过实现简洁推理，保持细致思考的优势且不增加计算开销，从而显著提升压缩率。在包括AIME 2024和2025、MinervaMath、OlympiadBench、Math500和AMC等基准测试中，我们的ORION模型能以4-16倍少的标记生成推理轨迹，推理延迟降低最多5倍，训练成本比DeepSeek R1 Distilled模型降低7-9倍，同时保持90-98%的准确率。ORION在保持2倍压缩率的同时，准确率也比Claude和ChatGPT-4o高出多达5%。这些结果表明，Mentalese式压缩推理为迈向类人认知效率迈出了一步，实现实时且经济高效的推理，同时不牺牲准确性。

Switching-time bioprocess control with pulse-width-modulated optogenetics

采用脉宽调制光遗传学的切换时间生物工艺控制

Authors: Sebastián Espinel-Ríos
Subjects: Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.22893
Pdf link: https://arxiv.org/pdf/2511.22893
Abstract Biotechnology can benefit from dynamic control to improve production efficiency. In this context, optogenetics enables modulation of gene expression using light as an external input, allowing fine-tuning of protein levels to unlock dynamic metabolic control and regulation of cell growth. Optogenetic systems can be actuated by light intensity. However, relying solely on intensity-driven control (i.e., signal amplitude) may fail to properly tune optogenetic bioprocesses when the dose-response relationship (i.e., light intensity versus gene-expression strength) is steep. In these cases, tunability is effectively constrained to either fully active or fully repressed gene expression, with little intermediate regulation. Pulse-width modulation, a concept widely used in electronics, can alleviate this issue by alternating between fully ON and OFF light intensity within forcing periods, thereby smoothing the average response and enhancing process controllability. Naturally, optimizing pulse-width-modulated optogenetics entails a switching-time optimal control problem with a binary input over many forcing periods. While this can be formulated as a mixed-integer program on a refined time grid, the number of decision variables can grow rapidly with increasing time-grid resolution and number of forcing periods, compromising tractability. Here, we propose an alternative solution based on reinforcement learning. We parametrize control actions via the duty cycle, a continuous variable that encodes the ON-to-OFF switching time within each forcing period, thereby respecting the intrinsic binary nature of the light intensity.
中文摘要 生物技术可以通过动态控制提升生产效率。在此背景下，光遗传学使得利用光线作为外部输入调节基因表达成为可能，从而实现蛋白质水平的微调，从而实现细胞生长的动态代谢控制和调控。光遗传系统可以由光强驱动。然而，仅依赖强度驱动的控制（即信号幅度）在剂量-反应关系（即光强与基因表达强度）较大时，可能无法正确调节光遗传生物过程。在这些情况下，调谐性实际上被限制在完全活跃或完全抑制的基因表达之间，中间调节很少。脉宽调制是电子学中广泛使用的概念，可以通过在强制周期内交替完全导通和关闭光强来缓解这一问题，从而平滑平均响应并增强工艺可控性。自然地，优化脉宽调制光遗传学涉及一个切换时间最优控制问题，涉及多个强制周期的二进制输入。虽然这可以作为精炼时间网格上的混合整数规划来表述，但随着时间网格分辨率和强制周期数量的增加，决策变量数量会迅速增加，从而影响可处理性。在这里，我们提出了基于强化学习的替代解决方案。我们通过占空比参数化控制动作，该变量是一个连续变量，编码每个强制周期内的开到关切换时间，从而尊重光强的内在二元性质。

Language-conditioned world model improves policy generalization by reading environmental descriptions

语言条件世界模型通过阅读环境描述提升政策泛化能力

Authors: Anh Nguyen, Stefan Lee
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.22904
Pdf link: https://arxiv.org/pdf/2511.22904
Abstract To interact effectively with humans in the real world, it is important for agents to understand language that describes the dynamics of the environment--that is, how the environment behaves--rather than just task instructions specifying "what to do". Understanding this dynamics-descriptive language is important for human-agent interaction and agent behavior. Recent work address this problem using a model-based approach: language is incorporated into a world model, which is then used to learn a behavior policy. However, these existing methods either do not demonstrate policy generalization to unseen games or rely on limiting assumptions. For instance, assuming that the latency induced by inference-time planning is tolerable for the target task or expert demonstrations are available. Expanding on this line of research, we focus on improving policy generalization from a language-conditioned world model while dropping these assumptions. We propose a model-based reinforcement learning approach, where a language-conditioned world model is trained through interaction with the environment, and a policy is learned from this model--without planning or expert demonstrations. Our method proposes Language-aware Encoder for Dreamer World Model (LED-WM) built on top of DreamerV3. LED-WM features an observation encoder that uses an attention mechanism to explicitly ground language descriptions to entities in the observation. We show that policies trained with LED-WM generalize more effectively to unseen games described by novel dynamics and language compared to other baselines in several settings in two environments: MESSENGER and this http URL highlight how the policy can leverage the trained world model before real-world deployment, we demonstrate the policy can be improved through fine-tuning on synthetic test trajectories generated by the world model.
中文摘要 为了在现实世界中与人类有效互动，代理理解描述环境动态的语言——即环境的行为——而不仅仅是任务指令，明确“该做什么”，这一点非常重要。理解这种动态描述语言对于人与代理的交互和代理行为非常重要。近期工作通过基于模型的方法解决了这个问题：语言被纳入世界模型，然后用来学习行为策略。然而，这些现有方法要么无法展示对未见博弈的策略推广，要么依赖于有限假设。例如，假设推理时间规划引起的延迟对目标任务来说是可忍受的，或者有专家演示可供使用。在这一研究基础上，我们重点关注从语言条件世界模型中提升政策泛化，同时摒弃这些假设。我们提出一种基于模型的强化学习方法，通过与环境的交互训练语言条件化的世界模型，并从中学习策略——无需计划或专家演示。我们的方法提出了基于DreamerV3构建的Dreamer世界模型语言感知编码器（LED-WM）。LED-WM 采用观察编码器，利用注意力机制将语言描述明确地基于观察中的实体。我们展示了用LED-WM训练的策略在两个环境下，比其他基线更有效地推广到由新颖动力学和语言描述的未见游戏：MESSENGER和此http URL，突出该策略如何在实际部署前利用训练过的世界模型，我们展示了策略可以通过对世界模型生成的合成测试轨迹进行微调来改进。

Commanding Humanoid by Free-form Language: A Large Language Action Model with Unified Motion Vocabulary

通过自由形式语言指挥类人生物：一个具有统一运动词汇的大型语言动作模型

Authors: Zhirui Liu, Kaiyang Ji, Ke Yang, Jingyi Yu, Ye Shi, Jingya Wang
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.22963
Pdf link: https://arxiv.org/pdf/2511.22963
Abstract Enabling humanoid robots to follow free-form language commands is critical for seamless human-robot interaction, collaborative task execution, and general-purpose embodied intelligence. While recent advances have improved low-level humanoid locomotion and robot manipulation, language-conditioned whole-body control remains a significant challenge. Existing methods are often limited to simple instructions and sacrifice either motion diversity or physical plausibility. To address this, we introduce Humanoid-LLA, a Large Language Action Model that maps expressive language commands to physically executable whole-body actions for humanoid robots. Our approach integrates three core components: a unified motion vocabulary that aligns human and humanoid motion primitives into a shared discrete space; a vocabulary-directed controller distilled from a privileged policy to ensure physical feasibility; and a physics-informed fine-tuning stage using reinforcement learning with dynamics-aware rewards to enhance robustness and stability. Extensive evaluations in simulation and on a real-world Unitree G1 humanoid show that Humanoid-LLA delivers strong language generalization while maintaining high physical fidelity, outperforming existing language-conditioned controllers in motion naturalness, stability, and execution success rate.
中文摘要 使类人机器人能够执行自由形式的语言指令，对于无缝的人机交互、协作任务执行以及通用具身智能至关重要。尽管近期进展改善了低水平类人生物的运动和机器人控能力，但语言条件化的全身控制仍是一个重大挑战。现有方法通常仅限于简单的指令，牺牲了运动多样性或物理可信度。为此，我们引入了Humanoid-LLA，一种大型语言动作模型，将表达性语言命令映射到人形机器人的物理可执行全身动作。我们的方法整合了三个核心组成部分：统一的运动词汇，将人类和类人生物的运动原语对齐到一个共享的离散空间中;一个词汇导向控制器，源自特权政策，以确保物理可行性;以及基于物理的强化学习微调阶段，并配备动态感知奖励，以增强鲁棒性和稳定性。在模拟和现实世界Unitree G1类人生物上的广泛评估表明，Humanoid-LLA在保持高物理保真度的同时，能够实现强有力的语言泛化，在运动自然性、稳定性和执行成功率方面优于现有的语言条件控制器。

McSc: Motion-Corrective Preference Alignment for Video Generation with Self-Critic Hierarchical Reasoning

MCSC：视频生成中的运动矫正偏好对齐与自我批评层级推理

Authors: Qiushi Yang, Yingjie Chen, Yuan Yao, Yifang Men, Huaizhuo Liu, Miaomiao Cui
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.22974
Pdf link: https://arxiv.org/pdf/2511.22974
Abstract Text-to-video (T2V) generation has achieved remarkable progress in producing high-quality videos aligned with textual prompts. However, aligning synthesized videos with nuanced human preference remains challenging due to the subjective and multifaceted nature of human judgment. Existing video preference alignment methods rely on costly human annotations or utilize proxy metrics to predict preference, which lacks the understanding of human preference logic. Moreover, they usually directly align T2V models with the overall preference distribution, ignoring potential conflict dimensions like motion dynamics and visual quality, which may bias models towards low-motion content. To address these issues, we present Motion-corrective alignment with Self-critic hierarchical Reasoning (McSc), a three-stage reinforcement learning framework for robust preference modeling and alignment. Firstly, Self-critic Dimensional Reasoning (ScDR) trains a generative reward model (RM) to decompose preferences into per-dimension assessments, using self-critic reasoning chains for reliable learning. Secondly, to achieve holistic video comparison, we introduce Hierarchical Comparative Reasoning (HCR) for structural multi-dimensional reasoning with hierarchical reward supervision. Finally, using RM-preferred videos, we propose Motion-corrective Direct Preference Optimization (McDPO) to optimize T2V models, while dynamically re-weighting alignment objective to mitigate bias towards low-motion content. Experiments show that McSc achieves superior performance in human preference alignment and generates videos with high-motion dynamic.
中文摘要 文本转视频（T2V）生成在生成高质量视频并配合文本提示方面取得了显著进展。然而，由于人类判断的主观性和多面性，将合成视频与细微的人类偏好相匹配仍然具有挑战性。现有的视频偏好对齐方法依赖昂贵的人工注释或使用代理指标来预测偏好，这缺乏对人类偏好逻辑的理解。此外，它们通常会直接将T2V模型与整体偏好分布对齐，忽略潜在冲突维度，如运动动力学和视觉质量，这些因素可能使模型偏向低运动内容。为解决这些问题，我们介绍了带有自我批判层级推理（MCSC）的动作纠正对齐，这是一种三阶段强化学习框架，用于稳健的偏好建模和对齐。首先，自我批判维度推理（ScDR）训练生成奖励模型（RM），将偏好分解为每维度评估，利用自我批判推理链实现可靠学习。其次，为了实现整体视频比较，我们引入了层级比较推理（HCR），用于结构性多维推理并结合层级奖励监督。最后，利用RM偏好视频，我们提出了运动纠正直接偏好优化（McDPO）以优化T2V模型，同时动态重权对齐目标，以减少对低动态内容的偏见。实验显示，MCSC 在人类偏好比对方面表现更优，并生成高动态视频。

Evolutionary Discovery of Heuristic Policies for Traffic Signal Control

交通信号控制启发式策略的进化发现

Authors: Ruibing Wang, Shuhan Guo, Zeen Li, Zhen Wang, Quanming Yao
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.23122
Pdf link: https://arxiv.org/pdf/2511.23122
Abstract Traffic Signal Control (TSC) involves a challenging trade-off: classic heuristics are efficient but oversimplified, while Deep Reinforcement Learning (DRL) achieves high performance yet suffers from poor generalization and opaque policies. Online Large Language Models (LLMs) provide general reasoning but incur high latency and lack environment-specific optimization. To address these issues, we propose Temporal Policy Evolution for Traffic (\textbf{\method{}}), which uses LLMs as an evolution engine to derive specialized heuristic policies. The framework introduces two key modules: (1) Structured State Abstraction (SSA), converting high-dimensional traffic data into temporal-logical facts for reasoning; and (2) Credit Assignment Feedback (CAF), tracing flawed micro-decisions to poor macro-outcomes for targeted critique. Operating entirely at the prompt level without training, \method{} yields lightweight, robust policies optimized for specific traffic environments, outperforming both heuristics and online LLM actors.
中文摘要 交通信号控制（TSC）面临一个具有挑战性的权衡：经典启发式高效但过于简化，而深度强化学习（DRL）虽然性能高，但泛化性差且策略不透明。在线大型语言模型（LLM）提供通用推理，但延迟较高且缺乏针对环境的优化。为解决这些问题，我们提出了流量时序策略演化（\textbf{\method{}}），利用LLM作为演化引擎推导专用启发式策略。该框架引入了两个关键模块：（1）结构化状态抽象（SSA），将高维交通数据转换为用于推理的时间逻辑事实;以及（2）学分分配反馈（CAF），将有缺陷的微观决策追溯到宏观结果不佳，以便进行有针对性的批评。完全在提示层级运行且无需训练，\method{} 生成了针对特定流量环境优化的轻量级、稳健策略，优于启发式和在线大型语言模型演员。

Peer-to-Peer Energy Trading in Dairy Farms using Multi-Agent Reinforcement Learning

利用多智能体强化学习实现奶牛场点对点能源交易

Authors: Mian Ibad Ali Shah, Marcos Eduardo Cruz Victorio, Maeve Duffy, Enda Barrett, Karl Mason
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.23148
Pdf link: https://arxiv.org/pdf/2511.23148
Abstract The integration of renewable energy resources in rural areas, such as dairy farming communities, enables decentralized energy management through Peer-to-Peer (P2P) energy trading. This research highlights the role of P2P trading in efficient energy distribution and its synergy with advanced optimization techniques. While traditional rule-based methods perform well under stable conditions, they struggle in dynamic environments. To address this, Multi-Agent Reinforcement Learning (MARL), specifically Proximal Policy Optimization (PPO) and Deep Q-Networks (DQN), is combined with community/distributed P2P trading mechanisms. By incorporating auction-based market clearing, a price advisor agent, and load and battery management, the approach achieves significant improvements. Results show that, compared to baseline models, DQN reduces electricity costs by 14.2% in Ireland and 5.16% in Finland, while increasing electricity revenue by 7.24% and 12.73%, respectively. PPO achieves the lowest peak hour demand, reducing it by 55.5% in Ireland, while DQN reduces peak hour demand by 50.0% in Ireland and 27.02% in Finland. These improvements are attributed to both MARL algorithms and P2P energy trading, which together results in electricity cost and peak hour demand reduction, and increase electricity selling revenue. This study highlights the complementary strengths of DQN, PPO, and P2P trading in achieving efficient, adaptable, and sustainable energy management in rural communities.
中文摘要 农村地区如奶牛社区中可再生能源资源的整合，使得通过点对点（P2P）能源交易实现分散式能源管理。本研究强调了P2P交易在高效能源分销中的作用及其与先进优化技术的协同效应。虽然传统的基于规则的方法在稳定条件下表现良好，但在动态环境中表现不佳。为此，多智能体强化学习（MARL），特别是近端策略优化（PPO）和深度Q网络（DQN），与社区/分布式P2P交易机制结合。通过结合基于拍卖的市场清算、价格顾问代理以及负载和电池管理，该方法实现了显著改进。结果显示，与基线模型相比，DQN在爱尔兰降低了14.2%的电费，芬兰降低了5.16%，同时分别增加了7.24%和12.73%的电力收入。PPO在爱尔兰的高峰时段需求最低，减少了55.5%，而DQN在爱尔兰的高峰时段需求减少了50.0%，在芬兰减少了27.02%。这些改进归功于MARL算法和点对点能源交易，两者共同降低了电力成本和高峰时段的需求，并增加了电力销售收入。本研究强调了DQN、PPO和P2P交易在实现农村社区高效、适应性和可持续能源管理方面的互补优势。

REVEAL: Reasoning-enhanced Forensic Evidence Analysis for Explainable AI-generated Image Detection

揭晓：基于推理的增强法医证据分析，用于可解释的AI生成图像检测

Authors: Huangsen Cao, Qin Mei, Zhiheng Li, Yuxi Li, Ying Zhang, Chen Li, Zhimeng Zhang, Xin Ding, Yongwei Wang, Jing Lyu, Fei Wu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.23158
Pdf link: https://arxiv.org/pdf/2511.23158
Abstract With the rapid advancement of generative models, visually realistic AI-generated images have become increasingly difficult to distinguish from authentic ones, posing severe threats to social trust and information integrity. Consequently, there is an urgent need for efficient and truly explainable image forensic methods. Recent detection paradigms have shifted towards explainable forensics. However, state-of-the-art approaches primarily rely on post-hoc rationalizations or visual discrimination, lacking a verifiable chain of evidence. This reliance on surface-level pattern matching limits the generation of causally grounded explanations and often results in poor generalization. To bridge this critical gap, we introduce \textbf{REVEAL-Bench}, the first reasoning-enhanced multimodal benchmark for AI-generated image detection that is explicitly structured around a chain-of-evidence derived from multiple lightweight expert models, then records step-by-step reasoning traces and evidential justifications. Building upon this dataset, we propose \textbf{REVEAL} (\underline{R}easoning-\underline{e}nhanced Forensic E\underline{v}id\underline{e}nce \underline{A}na\underline{l}ysis), an effective and explainable forensic framework that integrates detection with a novel expert-grounded reinforcement learning. Our reward mechanism is specially tailored to jointly optimize detection accuracy, explanation fidelity, and logical coherence grounded in explicit forensic evidence, enabling REVEAL to produce fine-grained, interpretable, and verifiable reasoning chains alongside its detection outcomes. Extensive experimental results demonstrate that REVEAL significantly enhances detection accuracy, explanation fidelity, and robust cross-model generalization, benchmarking a new state of the art for explainable image forensics.
中文摘要 随着生成模型的快速发展，视觉逼真的AI生成图像越来越难以与真实图像区分，这对社会信任和信息完整性构成严重威胁。因此，迫切需要高效且真正可解释的图像取证方法。近年来的侦查范式已转向可解释的法医。然而，最先进的方法主要依赖事后合理化或视觉辨别，缺乏可验证的证据链。这种对表面模式匹配的依赖限制了因果基础解释的生成，常常导致推广不佳。为弥合这一关键空白，我们引入了 \textbf{REVEAL-Bench}，这是首个基于推理增强的多模态 AI 生成图像检测基准测试，明确围绕由多个轻量级专家模型导出的证据链构建，并记录逐步推理痕迹和证据证明。基于该数据集，我们提出了 \textbf{REVEAL}（\underline{R}easoning-\underline{e}nhanced Forensic E\underline{v}id\underline{e}nce \underline{A}na\underline{l}ysis），这是一个有效且可解释的法医框架，将检测与新颖的专家基础强化学习相结合。我们的奖励机制经过专门设计，能够共同优化检测准确性、解释准确性和逻辑连贯性，基于显性法医证据，使REVEAL能够在检测结果的同时，生成细粒度、可解释且可验证的推理链。大量实验结果表明，REVEAL 显著提升了检测准确率、解释准确性和稳健的跨模型泛化能力，成为可解释图像取证领域的新基准。

Fault-Tolerant MARL for CAVs under Observation Perturbations for Highway On-Ramp Merging

在高速公路匝道合流观测扰动下，CAV的容错MARL

Authors: Yuchen Shi, Huaxin Pei, Yi Zhang, Danya Yao
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2511.23193
Pdf link: https://arxiv.org/pdf/2511.23193
Abstract Multi-Agent Reinforcement Learning (MARL) holds significant promise for enabling cooperative driving among Connected and Automated Vehicles (CAVs). However, its practical application is hindered by a critical limitation, i.e., insufficient fault tolerance against observational faults. Such faults, which appear as perturbations in the vehicles' perceived data, can substantially compromise the performance of MARL-based driving systems. Addressing this problem presents two primary challenges. One is to generate adversarial perturbations that effectively stress the policy during training, and the other is to equip vehicles with the capability to mitigate the impact of corrupted observations. To overcome the challenges, we propose a fault-tolerant MARL method for cooperative on-ramp vehicles incorporating two key agents. First, an adversarial fault injection agent is co-trained to generate perturbations that actively challenge and harden the vehicle policies. Second, we design a novel fault-tolerant vehicle agent equipped with a self-diagnosis capability, which leverages the inherent spatio-temporal correlations in vehicle state sequences to detect faults and reconstruct credible observations, thereby shielding the policy from misleading inputs. Experiments in a simulated highway merging scenario demonstrate that our method significantly outperforms baseline MARL approaches, achieving near-fault-free levels of safety and efficiency under various observation fault patterns.
中文摘要 多智能体强化学习（MARL）在实现互联与自动驾驶（CAV）之间的合作驾驶方面具有重要潜力。然而，其实际应用受到一个关键限制，即对观测性错误容错能力不足。这些故障表现为车辆感知数据中的扰动，会显著影响基于MARL的驾驶系统的性能。解决这一问题面临两个主要挑战。一是产生对抗扰动，有效压缩培训期间的政策，二是赋予车辆以减轻观测数据损坏影响的能力。为克服这些挑战，我们提出了一种容错的MARL方法，结合了两种关键机制，用于协作的匝道车辆。首先，对抗性故障注入代理被协同训练，以产生主动挑战并加固车辆策略的扰动。其次，我们设计了一种具备自我诊断功能的新型容错车辆代理，利用车辆状态序列中固有的时空相关性来检测故障并重建可信的观察，从而保护政策免受误导输入的影响。在模拟高速公路合流场景中的实验表明，我们的方法显著优于基线MARL方法，在各种观测故障模式下实现近乎无故障的安全性和效率水平。

Adapting Like Humans: A Metacognitive Agent with Test-time Reasoning

像人类一样适应：具备测试时间推理的元认知代理

Authors: Yang Li, Zhiyuan He, Yuxuan Huang, Zhuhanling Xiao, Chao Yu, Meng Fang, Kun Shao, Jun Wang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.23262
Pdf link: https://arxiv.org/pdf/2511.23262
Abstract Recent Vision-Language Models (VLMs) exhibit strong perceptual reasoning abilities, yet they often struggle to adapt efficiently when encountering novel tasks at test time. In contrast, humans leverage the metacognitive model with memory, enabling continuous strategy refinement through metacognitive control when faced with new challenges. To bridge this gap, we propose metacognitive test-time reasoning (MCTR), a framework that equips models with the ability to learn, adapt, and improve during test time through metacognitive self-updating. Inspired by the dual structure of human metacognition, MCTR comprises meta-level and object-level VLM reasoning modules, each equipped with dedicated memory systems for hierarchical adaptive reasoning. Specifically, MCTR consists of (1) a meta-reasoning module which incrementally builds a structured memory by discovering and storing task-relevant rules, environmental patterns, and action-outcome relationships from test-time observations as natural language descriptions; and (2) an action-reasoning module that determines optimal actions through context-aware perception and strategic reasoning by dynamically retrieving and integrating knowledge from memory. The action-reasoning module continuously updates its policy through proposed metacognitive test-time reinforcement learning, adapting as knowledge memory evolves. We evaluate MCTR on 45 Atari games (33 seen, 12 unseen). MCTR demonstrates robust test-time adaptation, achieving 9/12 top-1 results on unseen games compared with baselines. Analyses through ablations, learning dynamics, and case studies reveal the complementary contributions of both components and show meta-reasoning evolving toward human-like adaptation strategies.
中文摘要 最新的视觉语言模型（VLMs）表现出强大的感知推理能力，但在测试时遇到新任务时常常难以高效适应。相比之下，人类利用元认知模型与记忆结合，通过元认知控制实现持续策略的精炼，以应对新挑战。为了弥合这一差距，我们提出了元认知测试时间推理（MCTR）框架，该框架通过元认知自我更新赋予模型学习、适应和改进的能力。MCTR受人类元认知双重结构启发，包含元层和对象层VLM推理模块，每个模块都配备了用于层级自适应推理的专用记忆系统。具体来说，MCTR包括（1）元推理模块，通过发现和存储测试时间观察中的任务相关规则、环境模式及动作-结果关系作为自然语言描述，逐步构建结构化记忆;以及（2）一个通过上下文感知和战略推理，动态检索和整合记忆知识来确定最佳行动的行动推理模块。动作推理模块通过提出的元认知测试时间强化学习不断更新其策略，并随着知识记忆的演变而调整。我们对45款雅达利游戏进行了评估（33款已见，12款未见）。MCTR展现出强大的测试时间适应能力，在未见游戏中与基线相比，12场中有9次获得前一。通过消融分析、学习动态和案例研究，揭示了两者互补的贡献，并展示了元推理正向类人适应策略演变。

Emergent Coordination and Phase Structure in Independent Multi-Agent Reinforcement Learning

独立多智能体强化学习中的涌现协调与阶段结构

Authors: Azusa Yamaguchi
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.23315
Pdf link: https://arxiv.org/pdf/2511.23315
Abstract A clearer understanding of when coordination emerges, fluctuates, or collapses in decentralized multi-agent reinforcement learning (MARL) is increasingly sought in order to characterize the dynamics of multi-agent learning systems. We revisit fully independent Q-learning (IQL) as a minimal decentralized testbed and run large-scale experiments across environment size L and agent density rho. We construct a phase map using two axes - the cooperative success rate (CSR) and a stability index derived from TD-error variance - revealing three distinct regimes: a coordinated and stable phase, a fragile transition region, and a jammed or disordered phase. A sharp double Instability Ridge separates these regimes and corresponds to persistent kernel drift, the time-varying shift of each agent's effective transition kernel induced by others' policy updates. Synchronization analysis further shows that temporal alignment is required for sustained cooperation, and that competition between drift and synchronization generates the fragile regime. Removing agent identifiers eliminates drift entirely and collapses the three-phase structure, demonstrating that small inter-agent asymmetries are a necessary driver of drift. Overall, the results show that decentralized MARL exhibits a coherent phase structure governed by the interaction between scale, density, and kernel drift, suggesting that emergent coordination behaves as a distribution-interaction-driven phase phenomenon.
中文摘要 为了表征多主体学习系统的动态，越来越多的人希望更清晰地理解去中心化多智能体强化学习（MARL）中协调何时出现、波动或崩溃。我们重新审视完全独立的Q学习（IQL），作为一个最小分散的测试平台，并在环境大小L和代理密度rho上进行大规模实验。我们利用两个轴——合作成功率（CSR）和基于TD误差方差的稳定性指数——构建了相位图，揭示了三种不同的状态：协调稳定的相、脆弱过渡区以及卡滞或无序的相。一个锐利的双不稳定性脊将这些区间分隔开，对应于持久核漂移，即每个代理有效过渡核因他人策略更新而发生的时间变化变化。同步分析进一步表明，时间对齐是持续合作的必要条件，漂移与同步之间的竞争产生了脆弱状态。移除代理标识符完全消除漂移，并崩溃三相结构，表明小的代理间不对称是漂移的必要驱动因素。总体来看，结果表明分散式MARL表现出相位结构，受尺度、密度和核漂移相互作用的控制，表明涌现配位表现为分布-交互驱动的相态现象。

Ambiguity Awareness Optimization: Towards Semantic Disambiguation for Direct Preference Optimization

歧义意识优化：直接偏好优化的语义消歧

Authors: Jian Li, Shenglin Yin, Yujia Zhang, Alan Zhao, Xi Chen, Xiaohui Zhou, Pengfei Xu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.23391
Pdf link: https://arxiv.org/pdf/2511.23391
Abstract Direct Preference Optimization (DPO) is a widely used reinforcement learning from human feedback (RLHF) method across various domains. Recent research has increasingly focused on the role of token importance in improving DPO effectiveness. It is observed that identical or semantically similar content (defined as ambiguous content) frequently appears within the preference pairs. We hypothesize that the presence of ambiguous content during DPO training may introduce ambiguity, thereby limiting further improvements in alignment. Through mathematical analysis and proof-of-concept experiments, we reveal that ambiguous content may potentially introduce ambiguities, thereby degrading performance. To address this issue, we introduce Ambiguity Awareness Optimization (AAO), a simple yet effective approach that automatically re-weights ambiguous content to reduce ambiguities by calculating semantic similarity from preference pairs. Through extensive experiments, we demonstrate that AAO consistently and significantly surpasses state-of-the-art approaches in performance, without markedly increasing response length, across multiple model scales and widely adopted benchmark datasets, including AlpacaEval 2, MT-Bench, and Arena-Hard. Specifically, AAO outperforms DPO by up to 8.9 points on AlpacaEval 2 and achieves an improvement of by up to 15.0 points on Arena-Hard.
中文摘要 直接偏好优化（DPO）是一种广泛应用于多个领域的人类反馈强化学习（RLHF）方法。近期研究越来越关注代币重要性在提升DPO效果中的作用。观察到，偏好对中经常出现相同或语义相似的内容（定义为歧义内容）。我们假设，DPO培训中存在歧义内容可能会引入歧义，从而限制进一步的对齐改善。通过数学分析和概念验证实验，我们发现模糊内容可能引入歧义，从而降低性能。为解决这一问题，我们引入了歧义意识优化（AAO），这是一种简单但有效的方法，通过从偏好对计算语义相似性，自动重新加权歧义内容以减少歧义。通过大量实验，我们证明AAO在多个模型尺度和广泛采用的基准数据集（包括AlpacaEval 2、MT-Bench和Arena-Hard）中，在性能上持续且显著地超越最先进方法，且响应长度并未显著增加。具体来说，AAO在AlpacaEval 2中比DPO高出最多8.9分，在Arena-Hard上提升最多15.0分。

ASTRO: Adaptive Stitching via Dynamics-Guided Trajectory Rollouts

ASTRO：通过动力学引导轨迹展开的自适应缝合

Authors: Hang Yu, Di Zhang, Qiwei Du, Yanping Zhao, Hai Zhang, Guang Chen, Eduardo E. Veas, Junqiao Zhao
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.23442
Pdf link: https://arxiv.org/pdf/2511.23442
Abstract Offline reinforcement learning (RL) enables agents to learn optimal policies from pre-collected datasets. However, datasets containing suboptimal and fragmented trajectories present challenges for reward propagation, resulting in inaccurate value estimation and degraded policy performance. While trajectory stitching via generative models offers a promising solution, existing augmentation methods frequently produce trajectories that are either confined to the support of the behavior policy or violate the underlying dynamics, thereby limiting their effectiveness for policy improvement. We propose ASTRO, a data augmentation framework that generates distributionally novel and dynamics-consistent trajectories for offline RL. ASTRO first learns a temporal-distance representation to identify distinct and reachable stitch targets. We then employ a dynamics-guided stitch planner that adaptively generates connecting action sequences via Rollout Deviation Feedback, defined as the gap between target state sequence and the actual arrived state sequence by executing predicted actions, to improve trajectory stitching's feasibility and reachability. This approach facilitates effective augmentation through stitching and ultimately enhances policy learning. ASTRO outperforms prior offline RL augmentation methods across various algorithms, achieving notable performance gain on the challenging OGBench suite and demonstrating consistent improvements on standard offline RL benchmarks such as D4RL.
中文摘要 离线强化学习（RL）使代理能够从预先收集的数据集中学习最优策略。然而，包含次优且碎片化轨迹的数据集对奖励传播构成挑战，导致价值估计不准确，政策性能下降。虽然通过生成模型进行轨迹拼接提供了有前景的解决方案，但现有的增强方法常常产生的轨迹要么局限于支持行为策略，要么违反了其基础动态，从而限制了其对策略改进的有效性。我们提出了ASTRO数据增强框架，能够生成分布上新颖且动态一致的离线强化学习轨迹。ASTRO首先学习时间距离表示，以识别明显且可达的针法目标。随后，我们采用动态引导的缝合计划器，通过滚动偏差反馈（Rollout Deviation Feedback）自适应生成连接动作序列，定义为目标状态序列与实际到达状态序列之间的间隙，通过执行预测动作，以提升轨迹缝合的可行性和可达性。这种方法通过缝合促进了有效的补充，最终提升了政策学习。ASTRO在多种算法上优于以往的离线强化学习增强方法，在具有挑战性强的OGBench套件中实现了显著的性能提升，并且在标准离线强化学习基准测试如D4RL上持续展现出持续的改进。

ThetaEvolve: Test-time Learning on Open Problems

ThetaEvolve：开放问题的考试学习

Authors: Yiping Wang, Shao-Rong Su, Zhiyuan Zeng, Eva Xu, Liliang Ren, Xinyu Yang, Zeyi Huang, Xuehai He, Luyao Ma, Baolin Peng, Hao Cheng, Pengcheng He, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, Yelong Shen
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.23473
Pdf link: https://arxiv.org/pdf/2511.23473
Abstract Recent advances in large language models (LLMs) have enabled breakthroughs in mathematical discovery, exemplified by AlphaEvolve, a closed-source system that evolves programs to improve bounds on open problems. However, it relies on ensembles of frontier LLMs to achieve new bounds and is a pure inference system that models cannot internalize the evolving strategies. We introduce ThetaEvolve, an open-source framework that simplifies and extends AlphaEvolve to efficiently scale both in-context learning and Reinforcement Learning (RL) at test time, allowing models to continually learn from their experiences in improving open optimization problems. ThetaEvolve features a single LLM, a large program database for enhanced exploration, batch sampling for higher throughput, lazy penalties to discourage stagnant outputs, and optional reward shaping for stable training signals, etc. ThetaEvolve is the first evolving framework that enable a small open-source model, like DeepSeek-R1-0528-Qwen3-8B, to achieve new best-known bounds on open problems (circle packing and first auto-correlation inequality) mentioned in AlphaEvolve. Besides, across two models and four open tasks, we find that ThetaEvolve with RL at test-time consistently outperforms inference-only baselines, and the model indeed learns evolving capabilities, as the RL-trained checkpoints demonstrate faster progress and better final performance on both trained target task and other unseen tasks. We release our code publicly: this https URL
中文摘要 大型语言模型（LLMs）的最新进展推动了数学发现的突破，AlphaEvolve 是一个闭源系统，能够进化程序以改善未解决问题的边界。然而，它依赖前沿大型语言模型的集合来实现新的边界，并且是一个纯粹的推理系统，模型无法内化不断演变的策略。我们介绍ThetaEvolve，一个开源框架，简化并扩展了AlphaEvolve，使其能够在测试时高效扩展上下文学习和强化学习（RL），使模型能够持续从改进开放优化问题的经验中学习。ThetaEvolve 具备单一大型语言模型、大型程序数据库以增强探索、批量采样以提高通量、懒惰惩罚以抑制停滞输出，以及可选的奖励塑形以稳定训练信号等。ThetaEvolve 是第一个不断演进的框架，使像 DeepSeek-R1-0528-Qwen3-8B 这样的小型开源模型能够在 AlphaEvolve 中提到的开放问题（圆封和首个自相关不等式）上实现新的已知边界。此外，在两个模型和四个开放任务中，我们发现带强化学习的ThetaEvolve在测试时始终优于仅推断基线，模型确实学习了不断演化的能力，因为强化学习的检查点在训练目标任务和其他未公开任务上表现出更快的进展和更优的最终表现。我们公开发布代码：这个 https URL

Video-CoM: Interactive Video Reasoning via Chain of Manipulations

视频-漫画：通过作链进行交互式视频推理

Authors: Hanoona Rasheed, Mohammed Zumri, Muhammad Maaz, Ming-Hsuan Yang, Fahad Shahbaz Khan, Salman Khan
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.23477
Pdf link: https://arxiv.org/pdf/2511.23477
Abstract Recent multimodal large language models (MLLMs) have advanced video understanding, yet most still "think about videos" ie once a video is encoded, reasoning unfolds entirely in text, treating visual input as a static context. This passive paradigm creates a semantic bottleneck: models cannot rewatch, refocus, or verify evidence, leading to shallow visual reasoning on tasks requiring fine grained spatio temporal understanding. In this work, we introduce Interactive Video Reasoning, a new paradigm that transforms video into an active cognitive workspace, enabling models to "think with videos". Our model, Video CoM, reasons through a Chain of Manipulations (CoM), performing iterative visual actions to gather and refine evidence. To support this behavior, we construct Video CoM Instruct, an 18K instruction tuning dataset curated for multi step manipulation reasoning. Beyond supervised learning, we further optimize the manipulation policy via reinforcement learning with reasoning aware Group Relative Policy Optimization (GRPO). Unlike prior work that relies solely on sparse answer rewards, our method introduces step level reasoning rewards, guiding the model toward grounded and consistent reasoning. Video CoM achieves strong results across nine video reasoning benchmarks, improving average performance by 3.6 percent over recent state of the art models, while training on only 25K SFT and 3K GRPO video samples, significantly fewer than comparable large scale models. Ablation studies demonstrate that reasoning aware rewards improve both accuracy and interpretability. Code: this https URL
中文摘要 最近的多模态大型语言模型（MLLMs）具备了较高的视频理解能力，但大多数仍然“思考视频”，即一旦视频编码完成，推理完全通过文本展开，将视觉输入视为静态上下文。这种被动范式造成语义瓶颈：模型无法重看、重新聚焦或验证证据，导致需要细致时空理解的任务只能进行浅显的视觉推理。在本研究中，我们介绍了交互式视频推理，这是一种将视频转化为主动认知工作空间的新范式，使模型能够“用视频思考”。我们的模型Video CoM通过作链（CoM）进行推理，通过迭代的视觉动作收集和完善证据。为支持这种行为，我们构建了Video CoM Instruct，这是一个18K指令调优数据集，专为多步作推理精心设计。除了监督学习外，我们还通过强化学习与推理感知的群体相对策略优化（GRPO）进一步优化作策略。与以往仅依赖稀疏答案奖励的研究不同，我们的方法引入了步骤级推理奖励，引导模型朝向扎根且一致的推理。视频CoM在九个视频推理基准测试中取得了强劲成绩，平均性能比近期最先进模型提升了3.6%，同时仅在2.5万SFT和3K GRPO视频样本上训练，远少于同等的大型模型。消融研究表明，推理性、有意识的奖励不仅提高了准确性，也提高了可解释性。代码：这个 https URL

Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models

视频-R2：在多模态语言模型中强化一致且扎实的推理

Authors: Muhammad Maaz, Hanoona Rasheed, Fahad Shahbaz Khan, Salman Khan
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.23478
Pdf link: https://arxiv.org/pdf/2511.23478
Abstract Reasoning over dynamic visual content remains a central challenge for multimodal large language models. Recent thinking models generate explicit reasoning traces for interpretability; however, their reasoning often appears convincing while being logically inconsistent or weakly grounded in visual evidence. We identify and formalize these issues through two diagnostic metrics: Think Answer Consistency (TAC), which measures the alignment between reasoning and answers, and Video Attention Score (VAS), which captures the extent to which reasoning depends on visual versus textual cues. Analysis across 11 video reasoning benchmarks shows that current models rely heavily on linguistic priors rather than visual content. To address this, we propose a reinforcement learning approach that enhances both temporal precision and reasoning consistency. Our approach combines timestamp aware supervised fine tuning with Group Relative Policy Optimization (GRPO) guided by a novel Temporal Alignment Reward (TAR). This dual step post training stage encourages temporally aligned and causally coherent video reasoning. The resulting model, Video R2, achieves consistently higher TAC, VAS, and accuracy across multiple benchmarks, demonstrating that improvements in temporal alignment and reasoning coherence lead to more accurate and trustworthy video understanding. Our code, dataset, and model will be open sourced.
中文摘要 对动态视觉内容的推理仍然是多模态大型语言模型面临的核心挑战。近代思维模型产生了可解释性的显式推理痕迹;然而，他们的推理常常令人信服，但逻辑上不一致或缺乏视觉证据依据。我们通过两个诊断指标识别并形式化这些问题：思维答案一致性（TAC），衡量推理与答案的对齐度;以及视频注意力评分（VAS），衡量推理对视觉与文本线索的依赖程度。对11个视频推理基准的分析显示，当前模型高度依赖语言先验而非视觉内容。为此，我们提出了一种增强学习方法，既提升时间精度，也提升推理一致性。我们的方法结合了时间戳感知的监督微调与由新型时间对齐奖励（TAR）引导的群体相对策略优化（GRPO）。这个双步后培训阶段鼓励时间对齐且因果连贯的视频推理。最终的模型Video R2在多个基准测试中持续实现更高的TAC、VAS和准确性，证明时间对齐和推理一致性的提升能带来更准确、更可信的视频理解。我们的代码、数据集和模型将开源。

Keyword: diffusion policy

Visual-Geometry Diffusion Policy: Robust Generalization via Complementarity-Aware Multimodal Fusion

视觉几何扩散政策：通过互补性感知多模态融合实现的稳健推广

Authors: Yikai Tang, Haoran Geng, Sheng Zang, Pieter Abbeel, Jitendra Malik
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.22445
Pdf link: https://arxiv.org/pdf/2511.22445
Abstract Imitation learning has emerged as a crucial ap proach for acquiring visuomotor skills from demonstrations, where designing effective observation encoders is essential for policy generalization. However, existing methods often struggle to generalize under spatial and visual randomizations, instead tending to overfit. To address this challenge, we propose Visual Geometry Diffusion Policy (VGDP), a multimodal imitation learning framework built around a Complementarity-Aware Fusion Module where modality-wise dropout enforces balanced use of RGB and point-cloud cues, with cross-attention serving only as a lightweight interaction layer. Our experiments show that the expressiveness of the fused latent space is largely induced by the enforced complementarity from modality-wise dropout, with cross-attention serving primarily as a lightweight interaction mechanism rather than the main source of robustness. Across a benchmark of 18 simulated tasks and 4 real-world tasks, VGDP outperforms seven baseline policies with an average performance improvement of 39.1%. More importantly, VGDP demonstrates strong robustness under visual and spatial per turbations, surpassing baselines with an average improvement of 41.5% in different visual conditions and 15.2% in different spatial settings.
中文摘要 模仿学习已成为通过演示获得视觉运动技能的关键途径，设计有效的观察编码器对于政策推广至关重要。然而，现有方法在空间和视觉随机化下往往难以推广，反而容易过拟合。为应对这一挑战，我们提出了视觉几何扩散策略（VGDP），这是一个多模态模仿学习框架，基于互补性感知融合模块构建，其中模态退出强制平衡RGB和点云线索，交叉注意力仅作为轻量级交互层。我们的实验表明，融合潜空间的表现力主要由模态上退出的互补性强制诱导，交叉注意主要作为轻量级交互机制，而非稳健性的主要来源。在18项模拟任务和4项真实世界任务的基准测试中，VGDP优于7项基线政策，平均性能提升达39.1%。更重要的是，VGDP在视觉和空间扰动下表现出强烈的稳健性，在不同视觉条件下平均改善41.5%，在不同空间环境下提升15.2%，超过基线。

CAPE: Context-Aware Diffusion Policy Via Proximal Mode Expansion for Collision Avoidance

CAPE：通过近端模式扩展实现上下文感知扩散策略以避免碰撞

Authors: Rui Heng Yang, Xuan Zhao, Leo Maxime Brunswic, Montgomery Alban, Mateo Clemente, Tongtong Cao, Jun Jin, Amir Rasouli
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.22773
Pdf link: https://arxiv.org/pdf/2511.22773
Abstract In robotics, diffusion models can capture multi-modal trajectories from demonstrations, making them a transformative approach in imitation learning. However, achieving optimal performance following this regiment requires a large-scale dataset, which is costly to obtain, especially for challenging tasks, such as collision avoidance. In those tasks, generalization at test time demands coverage of many obstacles types and their spatial configurations, which are impractical to acquire purely via data. To remedy this problem, we propose Context-Aware diffusion policy via Proximal mode Expansion (CAPE), a framework that expands trajectory distribution modes with context-aware prior and guidance at inference via a novel prior-seeded iterative guided refinement procedure. The framework generates an initial trajectory plan and executes a short prefix trajectory, and then the remaining trajectory segment is perturbed to an intermediate noise level, forming a trajectory prior. Such a prior is context-aware and preserves task intent. Repeating the process with context-aware guided denoising iteratively expands mode support to allow finding smoother, less collision-prone trajectories. For collision avoidance, CAPE expands trajectory distribution modes with collision-aware context, enabling the sampling of collision-free trajectories in previously unseen environments while maintaining goal consistency. We evaluate CAPE on diverse manipulation tasks in cluttered unseen simulated and real-world settings and show up to 26% and 80% higher success rates respectively compared to SOTA methods, demonstrating better generalization to unseen environments.
中文摘要 在机器人领域，扩散模型可以从演示中捕捉多模态轨迹，使其成为模仿学习的变革性方法。然而，按照这一方案实现最佳性能需要大规模数据集，而获取成本较高，尤其是对于避免碰撞等具有挑战性的任务。在这些任务中，测试时的泛化需要覆盖多种障碍类型及其空间配置，而这些配置仅靠数据获取是不切实际的。为解决这一问题，我们提出了通过近端模式展开（CAPE）进行上下文感知扩散策略的框架，该框架通过一种新颖的先验迭代引导细化程序，扩展轨迹分布模式，并提供上下文感知先验和推断引导。该框架生成初始轨迹计划并执行一个短前缀轨迹，然后对剩余轨迹段进行扰动至中间噪声水平，形成先验轨迹。这样的先验具有上下文感知能力，并保持任务意图。通过上下文感知引导去噪重复这一过程，可以扩展模式支持，从而找到更平滑、碰撞率较低的轨迹。为了避免碰撞，CAPE 扩展了带有碰撞感知上下文的轨迹分布模式，使得在前所未有的环境中采样无碰撞轨迹，同时保持目标一致性。我们在杂乱的未见模拟和现实环境中评估CAPE在多样化作任务中的应用，结果分别比SOTA方法高出26%和80%，展示了对未见环境的更佳泛化能力。