Arxiv Papers of Today

生成时间: 2025-11-25 16:33:28 (UTC+8); Arxiv 发布时间: 2025-11-25 20:00 EST (2025-11-26 09:00 UTC+8)

今天共有 67 篇相关文章

Keyword: reinforcement learning

AURA: Adaptive Unified Reasoning and Automation with LLM-Guided MARL for NextG Cellular Networks

AURA：基于LLM引导的自适应统一推理与自动化，适用于NextG蜂窝网络

Authors: Narjes Nourzad, Mingyu Zong, Bhaskar Krishnamachari
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.17506
Pdf link: https://arxiv.org/pdf/2511.17506
Abstract Next-generation (NextG) cellular networks are expected to manage dynamic traffic while sustaining high performance. Large language models (LLMs) provide strategic reasoning for 6G planning, but their computational cost and latency limit real-time use. Multi-agent reinforcement learning (MARL) supports localized adaptation, yet coordination at scale remains challenging. We present AURA, a framework that integrates cloud-based LLMs for high-level planning with base stations modeled as MARL agents for local decision-making. The LLM generates objectives and subgoals from its understanding of the environment and reasoning capabilities, while agents at base stations execute these objectives autonomously, guided by a trust mechanism that balances local learning with external input. To reduce latency, AURA employs batched communication so that agents update the LLM's view of the environment and receive improved feedback. In a simulated 6G scenario, AURA improves resilience, reducing dropped handoff requests by more than half under normal and high traffic and lowering system failures. Agents use LLM input in fewer than 60\% of cases, showing that guidance augments rather than replaces local adaptability, thereby mitigating latency and hallucination risks. These results highlight the promise of combining LLM reasoning with MARL adaptability for scalable, real-time NextG network management.
中文摘要 下一代（NextG）蜂窝网络预计能在保持高性能的同时管理动态流量。大型语言模型（LLMs）为6G规划提供了战略推理，但其计算成本和延迟限制了实时应用。多智能体强化学习（MARL）支持局部适应，但大规模协调仍然充满挑战。我们介绍AURA，一个整合云端大型语言模型（LLMs）用于高层规划的框架，基站建模为MARL代理，用于本地决策。LLM根据对环境的理解和推理能力生成目标和子目标，而基站的代理则自主执行这些目标，由信任机制引导，平衡本地学习与外部输入。为降低延迟，AURA采用批量通信，使代理更新LLM对环境的视图并获得更好的反馈。在模拟的6G场景中，AURA提升了韧性，在正常和高流量下，切换请求中断率减少一半以上，并降低系统故障。代理在不到60%的情况下使用LLM输入，显示指导增强而非替代局部适应性，从而降低了延迟和幻觉风险。这些结果凸显了将LLM推理与MARL适应性结合，实现可扩展、实时NextG网络管理的潜力。

Enhancing Robustness of Offline Reinforcement Learning Under Data Corruption via Sharpness-Aware Minimization

通过锐利感知最小化，增强离线强化学习在数据损坏下的稳健性

Authors: Le Xu, Jiayu Chen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.17568
Pdf link: https://arxiv.org/pdf/2511.17568
Abstract Offline reinforcement learning (RL) is vulnerable to real-world data corruption, with even robust algorithms failing under challenging observation and mixture corruptions. We posit this failure stems from data corruption creating sharp minima in the loss landscape, leading to poor generalization. To address this, we are the first to apply Sharpness-Aware Minimization (SAM) as a general-purpose, plug-and-play optimizer for offline RL. SAM seeks flatter minima, guiding models to more robust parameter regions. We integrate SAM into strong baselines for data corruption: IQL, a top-performing offline RL algorithm in this setting, and RIQL, an algorithm designed specifically for data-corruption robustness. We evaluate them on D4RL benchmarks with both random and adversarial corruption. Our SAM-enhanced methods consistently and significantly outperform the original baselines. Visualizations of the reward surface confirm that SAM finds smoother solutions, providing strong evidence for its effectiveness in improving the robustness of offline RL agents.
中文摘要 离线强化学习（RL）容易受到现实世界数据损坏的影响，即使是稳健的算法也在严苛的观察和混合损坏下失效。我们认为这种失败源于数据损坏在损失环境中形成了明显的最小值，导致概括性不佳。为此，我们率先将锐利感知最小化（SAM）作为一种通用的即插即用优化器应用于离线强化学习。SAM寻求更平坦的极小值，引导模型进入更稳健的参数区域。我们将SAM集成到数据损坏的强基线中：IQL，这是该环境中表现最出色的离线强化学习算法，以及RIQL，专门设计用于数据损坏鲁棒性的算法。我们在D4RL基准测试中评估它们，涉及随机和对抗性损坏。我们的SAM增强方法持续且显著地优于原始基线。奖励面可视化证实SAM能找到更平滑的解，为其提升离线强化学习代理鲁棒性的有效性提供了有力证据。

Multi-Value Alignment for LLMs via Value Decorrelation and Extrapolation

通过值去相关和外推实现LLM的多值对齐

Authors: Hefei Xu, Le Wu, Chen Cheng, Hao Liu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.17579
Pdf link: https://arxiv.org/pdf/2511.17579
Abstract With the rapid advancement of large language models (LLMs), aligning them with human values for safety and ethics has become a critical challenge. This problem is especially challenging when multiple, potentially conflicting human values must be considered and balanced. Although several variants of existing alignment methods (such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO)) have been proposed to address multi-value alignment, they suffer from notable limitations: 1) they are often unstable and inefficient in multi-value optimization; and 2) they fail to effectively handle value conflicts. As a result, these approaches typically struggle to achieve optimal trade-offs when aligning multiple values. To address this challenge, we propose a novel framework called Multi-Value Alignment (MVA). It mitigates alignment degradation caused by parameter interference among diverse human values by minimizing their mutual information. Furthermore, we propose a value extrapolation strategy to efficiently explore the Pareto frontier, thereby constructing a set of LLMs with diverse value preferences. Extensive experiments demonstrate that MVA consistently outperforms existing baselines in aligning LLMs with multiple human values.
中文摘要 随着大型语言模型（LLM）的快速发展，将其与人类安全和伦理价值观对齐已成为一项关键挑战。当必须考虑并平衡多种潜在冲突的人类价值观时，这个问题尤为棘手。尽管已有多种现有比对方法变体（如人类反馈强化学习（RLHF）和直接偏好优化（DPO））被提出以解决多值比对，但它们存在显著局限性：1）在多值优化中通常不稳定且效率低下;2）他们未能有效处理价值冲突。因此，这些方法在多个值对齐时通常难以实现最佳权衡。为应对这一挑战，我们提出了一种新颖框架，称为多价值对齐（MVA）。它通过最小化彼此信息，减轻了不同人类值之间参数干扰导致的比对劣化。此外，我们提出了一种价值外推策略，以高效探索帕累托前沿，从而构建一组具有多样价值偏好的大型语言模型。大量实验表明，MVA在将LLM与多重人类价值观对齐方面，始终优于现有基线。

Boosting Reinforcement Learning in 3D Visuospatial Tasks Through Human-Informed Curriculum Design

通过以人为本的课程设计，提升三维视觉空间任务中的强化学习

Authors: Markus D. Solbach, John K. Tsotsos
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.17595
Pdf link: https://arxiv.org/pdf/2511.17595
Abstract Reinforcement Learning is a mature technology, often suggested as a potential route towards Artificial General Intelligence, with the ambitious goal of replicating the wide range of abilities found in natural and artificial intelligence, including the complexities of human cognition. While RL had shown successes in relatively constrained environments, such as the classic Atari games and specific continuous control problems, recent years have seen efforts to expand its applicability. This work investigates the potential of RL in demonstrating intelligent behaviour and its progress in addressing more complex and less structured problem domains. We present an investigation into the capacity of modern RL frameworks in addressing a seemingly straightforward 3D Same-Different visuospatial task. While initial applications of state-of-the-art methods, including PPO, behavioural cloning and imitation learning, revealed challenges in directly learning optimal strategies, the successful implementation of curriculum learning offers a promising avenue. Effective learning was achieved by strategically designing the lesson plan based on the findings of a real-world human experiment.
中文摘要 强化学习是一项成熟的技术，常被建议作为通往人工通用智能的潜在途径，其雄心勃勃的目标是复制自然和人工智能中存在的广泛能力，包括人类认知的复杂性。虽然强化学习在相对受限的环境中取得了成功，比如经典的雅达利游戏和特定的连续控制问题，但近年来，其适用范围的努力也在不断扩大。本研究探讨了强化学习在展示智能行为方面的潜力，以及其在解决更复杂和结构化较少的问题领域的进展。我们探讨了现代强化学习框架在处理看似简单的三维相同-不同视觉空间任务中的能力。虽然最初应用了包括PPO、行为克隆和模仿学习在内的最先进方法，揭示了直接学习最优策略的挑战，但课程学习的成功实施为一条有前景的方向。通过基于真实人体实验的发现，有策略地设计教学计划，实现了有效的学习效果。

Non-stationary and Varying-discounting Markov Decision Processes for Reinforcement Learning

非平稳和变折现的马尔可夫决策过程用于强化学习

Authors: Zhizuo Chen, Theodore T. Allen
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2511.17598
Pdf link: https://arxiv.org/pdf/2511.17598
Abstract Algorithms developed under stationary Markov Decision Processes (MDPs) often face challenges in non-stationary environments, and infinite-horizon formulations may not directly apply to finite-horizon tasks. To address these limitations, we introduce the Non-stationary and Varying-discounting MDP (NVMDP) framework, which naturally accommodates non-stationarity and allows discount rates to vary with time and transitions. Infinite-horizon, stationary MDPs emerge as special cases of NVMDPs for identifying an optimal policy, and finite-horizon MDPs are also subsumed within the NVMDP formulations. Moreover, NVMDPs provide a flexible mechanism to shape optimal policies, without altering the state space, action space, or the reward structure. We establish the theoretical foundations of NVMDPs, including assumptions, state- and action-value formulation and recursion, matrix representation, optimality conditions, and policy improvement under finite state and action spaces. Building on these results, we adapt dynamic programming and generalized Q-learning algorithms to NVMDPs, along with formal convergence proofs. For problems requiring function approximation, we extend the Policy Gradient Theorem and the policy improvement bound in Trust Region Policy Optimization (TRPO), offering proofs in both scalar and matrix forms. Empirical evaluations in a non-stationary gridworld environment demonstrate that NVMDP-based algorithms successfully recover optimal trajectories under multiple reward and discounting schemes, whereas original Q-learning fails. These results collectively show that NVMDPs provide a theoretically sound and practically effective framework for reinforcement learning, requiring only minor algorithmic modifications while enabling robust handling of non-stationarity and explicit optimal policy shaping.
中文摘要 在平稳马尔可夫决策过程（MDP）下开发的算法在非平稳环境中常面临挑战，无限视界表述可能无法直接适用于有限视界任务。为解决这些限制，我们引入了非平稳且变折现的MDP（NVMDP）框架，该框架自然适应非平稳性，并允许贴现率随时间和跃迁变化。无限视野、平稳的MDP作为NVMDP的特例出现，用于识别最优策略，有限视野MDP也被纳入NVMDP的表述中。此外，NVMDPs提供了灵活的机制来塑造最优政策，而不改变状态空间、行动空间或奖励结构。我们建立了NVMDP的理论基础，包括假设、状态值和动作值的表述与递归、矩阵表示、最优条件以及有限状态和动作空间下的策略改进。基于这些结果，我们将动态规划和广义Q-学习算法应用于NVMDP，并结合形式收敛证明。对于需要函数近似的问题，我们扩展了策略梯度定理和信任区域策略优化（TRPO）中的策略改进界限，提供了标量和矩阵形式的证明。在非平稳网格世界环境中的实证评估表明，基于NVMDP的算法在多种奖励和折现方案下成功恢复最优轨迹，而原始Q学习则失败。这些结果共同表明，NVMDPs为强化学习提供了一个理论上合理且实用有效的框架，只需少量算法修改，同时能够稳健处理非平稳性和显式最优策略形成。

Can we use LLMs to bootstrap reinforcement learning? -- A case study in digital health behavior change

我们能用大型语言模型来启动强化学习吗？——数字健康行为改变的案例研究

Authors: Nele Albers, Esra Cemre Su de Groot, Loes Keijsers, Manon H. Hillegers, Emiel Krahmer
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2511.17630
Pdf link: https://arxiv.org/pdf/2511.17630
Abstract Personalizing digital applications for health behavior change is a promising route to making them more engaging and effective. This especially holds for approaches that adapt to users and their specific states (e.g., motivation, knowledge, wants) over time. However, developing such approaches requires making many design choices, whose effectiveness is difficult to predict from literature and costly to evaluate in practice. In this work, we explore whether large language models (LLMs) can be used out-of-the-box to generate samples of user interactions that provide useful information for training reinforcement learning models for digital behavior change settings. Using real user data from four large behavior change studies as comparison, we show that LLM-generated samples can be useful in the absence of real data. Comparisons to the samples provided by human raters further show that LLM-generated samples reach the performance of human raters. Additional analyses of different prompting strategies including shorter and longer prompt variants, chain-of-thought prompting, and few-shot prompting show that the relative effectiveness of different strategies depends on both the study and the LLM with also relatively large differences between prompt paraphrases alone. We provide recommendations for how LLM-generated samples can be useful in practice.
中文摘要 个性化数字应用以促进健康行为改变，是让它们更具吸引力和效果的有前景路径。这对于那些随着时间调整用户及其特定状态（如动机、知识、需求）的方法尤其适用。然而，开发此类方法需要做出许多设计选择，其有效性难以从文献中预测，且在实践中评估成本高昂。本研究探讨了大型语言模型（LLMs）是否可以开箱即用，生成用户交互样本，为训练数字行为改变设置的强化学习模型提供有用信息。我们利用四项大型行为改变研究的真实用户数据进行对比，表明在缺乏真实数据的情况下，LLM生成的样本仍然具有价值。与人类评分者提供的样本进行比较进一步表明，LLM生成的样本性能可达人类评分者。对不同提示策略的进一步分析，包括短提示变体和长提示变体、思维链提示和少数样本提示显示，不同策略的相对有效性取决于研究和大型语言模型，提示词改写之间差异也较大。我们提供了关于大型语言模型生成样本在实际应用中的有用建议。

Smart Manufacturing: MLOps-Enabled Event-Driven Architecture for Enhanced Control in Steel Production

智能制造：支持MLOps的事件驱动架构，增强钢铁生产控制

Authors: Bestoun S. Ahmed, Tommaso Azzalin, Andreas Kassler, Andreas Thore, Hans Lindback
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.17632
Pdf link: https://arxiv.org/pdf/2511.17632
Abstract We explore a Digital Twin-Based Approach for Smart Manufacturing to improve Sustainability, Efficiency, and Cost-Effectiveness for a steel production plant. Our system is based on a micro-service edge-compute platform that ingests real-time sensor data from the process into a digital twin over a converged network infrastructure. We implement agile machine learning-based control loops in the digital twin to optimize induction furnace heating, enhance operational quality, and reduce process waste. Key to our approach is a Deep Reinforcement learning-based agent used in our machine learning operation (MLOps) driven system to autonomously correlate the system state with its digital twin to identify correction actions that aim to optimize power settings for the plant. We present the theoretical basis, architectural details, and practical implications of our approach to reduce manufacturing waste and increase production quality. We design the system for flexibility so that our scalable event-driven architecture can be adapted to various industrial applications. With this research, we propose a pivotal step towards the transformation of traditional processes into intelligent systems, aligning with sustainability goals and emphasizing the role of MLOps in shaping the future of data-driven manufacturing.
中文摘要 我们探讨基于数字孪生的智能制造方法，以提升钢铁生产工厂的可持续性、效率和成本效益。我们的系统基于一个微服务边缘计算平台，将过程中的实时传感器数据导入数字孪生，通过融合的网络基础设施实现。我们在数字孪生中实现基于敏捷机器学习的控制环路，以优化感应炉加热，提升运营质量，减少工艺浪费。我们方法的关键是基于深度强化学习的智能体，用于我们的机器学习运行（MLOps）驱动系统，自动将系统状态与数字孪生关联，识别旨在优化电厂功率设置的纠正动作。我们介绍了我们减少制造浪费和提高生产质量的方法的理论基础、架构细节及实际影响。我们设计系统具有灵活性，使我们的可扩展事件驱动架构能够适应各种工业应用。通过这项研究，我们提出了将传统流程转变为智能系统的关键一步，这与可持续发展目标保持一致，并强调MLOps在塑造数据驱动制造未来的作用。

Dialogue Diplomats: An End-to-End Multi-Agent Reinforcement Learning System for Automated Conflict Resolution and Consensus Building

对话外交官：一个端到端的多智能体强化学习系统，用于自动化冲突解决与共识构建

Authors: Deepak Bolleddu
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.17654
Pdf link: https://arxiv.org/pdf/2511.17654
Abstract Conflict resolution and consensus building represent critical challenges in multi-agent systems, negotiations, and collaborative decision-making processes. This paper introduces Dialogue Diplomats, a novel end-to-end multi-agent reinforcement learning (MARL) framework designed for automated conflict resolution and consensus building in complex, dynamic environments. The proposed system integrates advanced deep reinforcement learning architectures with dialogue-based negotiation protocols, enabling autonomous agents to engage in sophisticated conflict resolution through iterative communication and strategic adaptation. We present three primary contributions: first, a novel Hierarchical Consensus Network (HCN) architecture that combines attention mechanisms with graph neural networks to model inter-agent dependencies and conflict dynamics. second, a Progressive Negotiation Protocol (PNP) that structures multi-round dialogue interactions with adaptive concession strategies; and third, a Context-Aware Reward Shaping mechanism that balances individual agent objectives with collective consensus goals.
中文摘要 冲突解决和共识构建是多智能体系统、谈判和协作决策过程中的关键挑战。本文介绍了对话外交官，这是一种创新的端到端多智能体强化学习（MARL）框架，旨在在复杂、动态环境中实现自动冲突解决和共识构建。该系统将先进的深度强化学习架构与基于对话的谈判协议相结合，使自主智能体能够通过迭代沟通和战略适应进行复杂的冲突解决。我们提出了三大主要贡献：首先，一种新型的分层共识网络（HCN）架构，将注意力机制与图神经网络结合起来，用于建模代理间的依赖关系和冲突动态。第二，一套渐进谈判协议（PNP），通过适应性让步策略构建多轮对话互动;第三，一种情境感知奖励塑造机制，平衡个体主体目标与集体共识目标。

LEARN: Learning End-to-End Aerial Resource-Constrained Multi-Robot Navigation

学习：学习端到端、资源有限的多机器人空中导航

Authors: Darren Chiu, Zhehui Huang, Ruohai Ge, Gaurav S. Sukhatme
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2511.17765
Pdf link: https://arxiv.org/pdf/2511.17765
Abstract Nano-UAV teams offer great agility yet face severe navigation challenges due to constrained onboard sensing, communication, and computation. Existing approaches rely on high-resolution vision or compute-intensive planners, rendering them infeasible for these platforms. We introduce LEARN, a lightweight, two-stage safety-guided reinforcement learning (RL) framework for multi-UAV navigation in cluttered spaces. Our system combines low-resolution Time-of-Flight (ToF) sensors and a simple motion planner with a compact, attention-based RL policy. In simulation, LEARN outperforms two state-of-the-art planners by $10\%$ while using substantially fewer resources. We demonstrate LEARN's viability on six Crazyflie quadrotors, achieving fully onboard flight in diverse indoor and outdoor environments at speeds up to $2.0 m/s$ and traversing $0.2 m$ gaps.
中文摘要 纳米无人机团队具备极高的灵活性，但由于机载感测、通信和计算受限，导航面临严重挑战。现有方法依赖高分辨率视觉或计算密集型规划器，使其在这些平台上难以实现。我们介绍了LEARN，一种轻量级、两阶段的安全引导强化学习（RL）框架，用于多无人机在杂乱空间中的导航。我们的系统结合了低分辨率飞行时间（ToF）传感器和简单的运动规划器，以及紧凑且注重注意力的强化学习策略。在模拟中，LEARN的表现比两款最先进的规划器高出10%美元，且使用资源大幅减少。我们在六架Crazyflie四旋翼飞机上展示了LEARN的可行性，实现了在多样室内外环境中的全机载飞行，速度最高可达2.0米/秒，穿越0.2亿美元的间隙。

Deterministic Inference across Tensor Parallel Sizes That Eliminates Training-Inference Mismatch

跨张量平行大小的确定性推断，消除训练-推断不匹配

Authors: Ziyang Zhang, Xinheng Ding, Jiayi Yuan, Rixin Liu, Huizi Mao, Jiarong Xing, Zirui Liu
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2511.17826
Pdf link: https://arxiv.org/pdf/2511.17826
Abstract Deterministic inference is increasingly critical for large language model (LLM) applications such as LLM-as-a-judge evaluation, multi-agent systems, and Reinforcement Learning (RL). However, existing LLM serving frameworks exhibit non-deterministic behavior: identical inputs can yield different outputs when system configurations (e.g., tensor parallel (TP) size, batch size) vary, even under greedy decoding. This arises from the non-associativity of floating-point arithmetic and inconsistent reduction orders across GPUs. While prior work has addressed batch-size-related nondeterminism through batch-invariant kernels, determinism across different TP sizes remains an open problem, particularly in RL settings, where the training engine typically uses Fully Sharded Data Parallel (i.e., TP = 1) while the rollout engine relies on multi-GPU TP to maximize the inference throughput, creating a natural mismatch between the two. This precision mismatch problem may lead to suboptimal performance or even collapse for RL training. We identify and analyze the root causes of TP-induced inconsistency and propose Tree-Based Invariant Kernels (TBIK), a set of TP-invariant matrix multiplication and reduction primitives that guarantee bit-wise identical results regardless of TP size. Our key insight is to align intra- and inter-GPU reduction orders through a unified hierarchical binary tree structure. We implement these kernels in Triton and integrate them into vLLM and FSDP. Experiments confirm zero probability divergence and bit-wise reproducibility for deterministic inference across different TP sizes. Also, we achieve bit-wise identical results between vLLM and FSDP in RL training pipelines with different parallel strategy. Code is available at this https URL.
中文摘要 确定性推断在大型语言模型（LLM）应用中日益重要，如LLM即评判评估、多智能体系统和强化学习（RL）。然而，现有的LLM服务框架表现出非确定性行为：当系统配置（如张量并行（TP）大小、批次大小）变化时，相同的输入也可能产生不同的输出，即使在贪婪解码下也是如此。这源于浮点运算的非结合性和GPU间的约简阶不一致。虽然此前研究通过批次不变核解决了批处理规模相关的非确定性，但不同TP大小间的确定性仍是一个未解之谜，尤其是在强化环境中，训练引擎通常使用全分片数据并行（即TP = 1），而滚动引擎依赖多GPU数据分片以最大化推理吞吐量，导致两者之间自然不匹配。这种精度不匹配问题可能导致强化学习的表现不佳，甚至崩溃。我们识别并分析了TP引起的不一致的根本原因，并提出了基于树的不变核（TBIK），这是一组TP不变矩阵的乘法和约简原语，保证无论TP大小如何都能获得逐位相同的结果。我们的核心见解是通过统一的层级二叉树结构对齐GPU内外的简化顺序。我们将这些内核集成在 Triton 中，并将其集成到 vLLM 和 FSDP 中。实验证实了在不同TP大小下确定性推断时，零概率散度和按位重复性。此外，我们在强化学习训练流水线中通过不同的并行策略实现了vLLM和FSDP的比特完全相同结果。代码可在此 https URL 访问。

Transformers with RL or SFT Provably Learn Sparse Boolean Functions, But Differently

带 RL 或 SFT 的变换器可以证明学习稀疏布尔函数，但方式不同

Authors: Bochen Lyu, Yiyang Jia, Xiaohao Cai, Zhanxing Zhu
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2511.17852
Pdf link: https://arxiv.org/pdf/2511.17852
Abstract Transformers can acquire Chain-of-Thought (CoT) capabilities to solve complex reasoning tasks through fine-tuning. Reinforcement learning (RL) and supervised fine-tuning (SFT) are two primary approaches to this end, yet their underlying mechanisms and differences remain theoretically unclear. In this work, we examine these aspects specifically for learning $k$-sparse Boolean functions with a one-layer transformer and intermediate supervision that is akin to CoT. In particular, we consider $k$-sparse Boolean functions that can be recursively decomposed into fixed 2-sparse Boolean functions. We analyze the learning dynamics of fine-tuning the transformer via either RL or SFT with CoT to identify sufficient conditions for it to provably learn these functions. We verify that these conditions hold for three basic examples, including $k$-PARITY, $k$-AND, and $k$-OR, thus demonstrating the learnability of both approaches. Notably, we reveal that RL and SFT exhibit distinct learning behaviors: RL learns the whole CoT chain simultaneously, whereas SFT learns the CoT chain step-by-step. Overall, our findings provide theoretical insights into the underlying mechanisms of RL and SFT as well as how they differ in triggering the CoT capabilities of transformers.
中文摘要 Transformer可以通过微调获得思维链（Chain-of-Thought，简称CoT）能力，以解决复杂的推理任务。强化学习（RL）和监督微调（SFT）是实现这一目标的两种主要方法，但它们的潜在机制和差异在理论上仍不清楚。在本研究中，我们专门研究这些方面，用于学习$k$稀疏的布尔函数，使用一层变换器和类似CoT的中间监督。特别地，我们考虑$k$稀疏的布尔函数，这些函数可以递归分解为固定的2-稀疏布尔函数。我们分析通过强化学习或SFT微调变压器与CoT的学习动力学，以识别足够条件使其可证明学习这些函数。我们验证了这些条件对三个基本例子成立，包括$k$-奇偶校验、$k$-AND和$k$-OR等，从而证明了两种方法的可学习性。值得注意的是，我们发现强化学习和SFT表现出不同的学习行为：强化学习同时学习整个CoT链，而SFT则逐步学习CoT链。总体而言，我们的发现为强化光谱和SFT的潜在机制提供了理论见解，以及它们在触发变压器CoT能力上的差异。

Training Emergent Joint Associations: A Reinforcement Learning Approach to Creative Thinking in Language Models

培养涌现的联合联想：语言模型中创造性思维的强化学习方法

Authors: Mukul Singh, Ananya Singha, Aishni Parab, Pronita Mehrotra, Sumit Gulwani
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.17876
Pdf link: https://arxiv.org/pdf/2511.17876
Abstract Associative thinking--the ability to connect seemingly unrelated ideas--is a foundational element of human creativity and problem-solving. This paper explores whether reinforcement learning (RL) guided by associative thinking principles can enhance a model's performance across diverse generative tasks, including story writing, code generation, and chart creation. We introduce a reinforcement learning framework that uses a prompt-based evaluation mechanism, incorporating established divergent thinking metrics from creativity research. A base language model is fine-tuned using this framework to reward outputs demonstrating higher novelty through higher degrees of conceptual connectivity. Interestingly, the experimental results suggest that RL-based associative thinking-trained models not only generate more original and coherent stories but also exhibit improved abstraction and flexibility in tasks such as programming and data visualization. Our findings provide initial evidence that modeling cognitive creativity principles through reinforcement learning can yield more adaptive and generative AI.
中文摘要 联想思维——连接看似无关的想法的能力——是人类创造力和解决问题的基础要素。本文探讨了以联想思维原则为指导的强化学习（RL）是否能提升模型在多种生成任务中的表现，包括故事编写、代码生成和图表制作。我们引入了一个基于提示的评估机制强化学习框架，融合了创造力研究中既有的发散性思维指标。利用该框架微调基础语言模型，奖励通过更高概念连接性展现更高新颖性的输出。有趣的是，实验结果表明基于强化学习的联想思维训练模型不仅能产生更具原创性和连贯性的故事，还在编程和数据可视化等任务中展现出更好的抽象性和灵活性。我们的发现提供了初步证据，表明通过强化学习建模认知创造力原则可以带来更具适应性和生成性的人工智能。

Generative Adversarial Post-Training Mitigates Reward Hacking in Live Human-AI Music Interaction

生成对抗式后训练缓解了现场人机音乐交互中的奖励黑客行为

Authors: Yusong Wu, Stephen Brade, Teng Ma, Tia-Jane Fowler, Enning Yang, Berker Banar, Aaron Courville, Natasha Jaques, Cheng-Zhi Anna Huang
Subjects: Subjects: Machine Learning (cs.LG); Sound (cs.SD)
Arxiv link: https://arxiv.org/abs/2511.17879
Pdf link: https://arxiv.org/pdf/2511.17879
Abstract Most applications of generative AI involve a sequential interaction in which a person inputs a prompt and waits for a response, and where reaction time and adaptivity are not important factors. In contrast, live jamming is a collaborative interaction that requires real-time coordination and adaptation without access to the other player's future moves, while preserving diversity to sustain a creative flow. Reinforcement learning post-training enables effective adaptation through on-policy interaction, yet it often reduces output diversity by exploiting coherence-based rewards. This collapse, known as ``reward hacking'', affects many RL post-training pipelines, but is especially harmful in live jamming, where musical creativity relies on dynamic variation and mutual responsiveness. In this paper, we propose a novel adversarial training method on policy-generated trajectories to mitigate reward hacking in RL post-training for melody-to-chord accompaniment. A co-evolving discriminator separates policy trajectories from the data distribution, while the policy maximizes the discriminator output in addition to coherence rewards to prevent collapse to trivial outputs. We evaluate accompaniment quality and output diversity in simulation with both fixed test melodies and learned melody agents, and we conduct a user study with the model deployed in a real-time interactive system with expert musicians. Quantitative evaluation and user feedback demonstrate improved output diversity, harmonic coherence, adaptation speed and user agency. Our results demonstrate a simple yet effective method to mitigate reward hacking in RL post-training of generative sequence models.
中文摘要 大多数生成式人工智能的应用涉及顺序交互，用户输入提示并等待响应，反应时间和适应性并非重要因素。相比之下，现场即兴演奏是一种协作互动，需要实时协调和适应，无法掌握对方未来的作，同时保持多样性以维持创作流畅。训练后强化学习通过策略上的互动实现有效适应，但通常通过利用基于一致性的奖励来减少输出多样性。这种崩溃被称为“奖励黑客”，影响了许多强化学习后的培训流程，但在现场即兴演奏中尤其有害，因为音乐创造力依赖动态变化和相互响应。本文提出了一种基于策略生成轨迹的新型对抗训练方法，以减轻旋律到和弦伴奏后训练中强化学习的奖励黑客行为。共演化判别器将策略轨迹与数据分布分离，而策略则最大化判别器输出，同时提供相干性奖励，防止输出崩溃为平凡输出。我们评估伴奏质量和输出多样性，使用固定测试旋律和学习旋律代理，并在实时交互系统中与专家音乐家合作进行用户研究。定量评估和用户反馈显示输出多样性、谐波一致性、适应速度和用户自主性有所提升。我们的结果展示了一种简单但有效的方法，可以在生成序列模型的强化学习后训练中减少奖励黑客行为。

MobileVLA-R1: Reinforcing Vision-Language-Action for Mobile Robots

MobileVLA-R1：增强移动机器人的视觉-语言-行动

Authors: Ting Huang, Dongjian Li, Rui Yang, Zeyu Zhang, Zida Yang, Hao Tang
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.17889
Pdf link: https://arxiv.org/pdf/2511.17889
Abstract Grounding natural-language instructions into continuous control for quadruped robots remains a fundamental challenge in vision language action. Existing methods struggle to bridge high-level semantic reasoning and low-level actuation, leading to unstable grounding and weak generalization in the real world. To address these issues, we present MobileVLA-R1, a unified vision-language-action framework that enables explicit reasoning and continuous control for quadruped robots. We construct MobileVLA-CoT, a large-scale dataset of multi-granularity chain-of-thought (CoT) for embodied trajectories, providing structured reasoning supervision for alignment. Built upon this foundation, we introduce a two-stage training paradigm that combines supervised CoT alignment with GRPO reinforcement learning to enhance reasoning consistency, control stability, and long-horizon execution. Extensive evaluations on VLN and VLA tasks demonstrate superior performance over strong baselines, with approximately a 5% improvement. Real-world deployment on a quadruped robot validates robust performance in complex environments. Code: this https URL. Website: this https URL.
中文摘要 将自然语言指令接入四足机器人的持续控制，仍然是视觉语言行动中的根本挑战。现有方法难以桥接高层语义推理与低层执行，导致现实世界中的基础不稳定和推广薄弱。为解决这些问题，我们推出了MobileVLA-R1，一个统一的视觉-语言-行动框架，能够实现四足机器人的显式推理和持续控制。我们构建了MobileVLA-CoT，一个针对具身轨迹的多粒度思维链（CoT）大规模数据集，提供结构化推理监督以实现对齐。在此基础上，我们引入了两阶段训练范式，结合监督式CoT对齐与GRPO强化学习，以增强推理一致性、控制稳定性和长期执行。对VLN和VLA任务的广泛评估显示，在强基线条件下表现更优，提升约5%。在四足机器人上的实际部署验证了在复杂环境中的稳健性能。代码：这个 https URL。网站：这个 https URL。

DISPATCH -- Decentralized Informed Spatial Planning and Assignment of Tasks for Cooperative Heterogeneous Agents

DISPATCH——为合作异构代理提供分散式知情的空间规划与任务分配

Authors: Yao Liu, Sampad Mohanty, Elizabeth Ondula, Bhaskar Krishnamachari
Subjects: Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2511.17915
Pdf link: https://arxiv.org/pdf/2511.17915
Abstract Spatial task allocation in systems such as multi-robot delivery or ride-sharing requires balancing efficiency with fair service across tasks. Greedy assignment policies that match each agent to its highest-preference or lowest-cost task can maximize efficiency but often create inequities: some tasks receive disproportionately favorable service (e.g., shorter delays or better matches), while others face long waits or poor allocations. We study fairness in heterogeneous multi-agent systems where tasks vary in preference alignment and urgency. Most existing approaches either assume centralized coordination or largely ignore fairness under partial observability. Distinct from this prior work, we establish a connection between the Eisenberg-Gale (EG) equilibrium convex program and decentralized, partially observable multi-agent learning. Building on this connection, we develop two equilibrium-informed algorithms that integrate fairness and efficiency: (i) a multi-agent reinforcement learning (MARL) framework, EG-MARL, whose training is guided by centralized fair assignment algorithms (EG and a preference-aware Hungarian method); and (ii) a stochastic online optimization mechanism that performs guided exploration and subset-based fair assignment as tasks are discovered. We evaluate our frameworks across a range of team sizes and assignment formulations against centralized EG, Hungarian, and Min-Max Distance baselines. Both algorithms preserve the fairness-efficiency balance of the Eisenberg-Gale equilibrium under partial observability. EG-MARL achieves near-centralized coordination and reduced travel distances, while the stochastic online mechanism enables real-time allocation with competitive fairness. Together, these results demonstrate that spatially aware EG formulations can effectively guide decentralized coordination in agents with heterogeneous capabilities.
中文摘要 在多机器人配送或拼车等系统中，空间任务分配需要在效率与公平服务之间取得平衡。贪婪的分配政策将每个代理匹配到其最高偏好或最低成本的任务，可以最大化效率，但常常造成不公平：某些任务获得的服务不成比例地有利（例如，延迟更短或匹配更好），而另一些任务则面临长时间等待或分配不佳。我们研究异构多智能体系统中的公平性，这些系统任务在偏好、对齐和紧迫性上有所不同。大多数现有方法要么假设集中协调，要么在部分可观测性下基本忽视公平性。与之前的工作不同，我们建立了艾森伯格-盖尔（EG）平衡凸程序与分散、部分可观察的多智能体学习之间的联系。基于这一联系，我们开发了两个均衡知情算法，结合了公平性和效率：（i）一个多智能体强化学习（MARL）框架EG-MARL，其训练由集中式公平分配算法（EG和偏好感知的匈牙利方法）指导;以及（ii）随机在线优化机制，在任务发现时执行引导探索和基于子集的公平分配。我们根据集中的EG、匈牙利和最小最大距离基线，评估团队规模和作业表述的框架。两种算法都保持了部分可观测性下艾森伯格-盖尔均衡的公平性与效率平衡。EG-MARL实现了近乎集中的协调和缩短的旅行距离，而随机在线机制则实现了实时分配和公平竞争。这些结果共同表明，空间感知的EG公式能够有效引导具有异构能力的智能体中的分散协调。

PA-FAS: Towards Interpretable and Generalizable Multimodal Face Anti-Spoofing via Path-Augmented Reinforcement Learning

PA-FAS：通过路径增强强化学习实现可解释且可推广的多模态人脸反欺骗

Authors: Yingjie Ma, Xun Lin, Yong Xu, Weicheng Xie, Zitong Yu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.17927
Pdf link: https://arxiv.org/pdf/2511.17927
Abstract Face anti-spoofing (FAS) has recently advanced in multimodal fusion, cross-domain generalization, and interpretability. With large language models and reinforcement learning (RL), strategy-based training offers new opportunities to jointly model these aspects. However, multimodal reasoning is more complex than unimodal reasoning, requiring accurate feature representation and cross-modal verification while facing scarce, high-quality annotations, which makes direct application of RL sub-optimal. We identify two key limitations of supervised fine-tuning plus RL (SFT+RL) for multimodal FAS: (1) limited multimodal reasoning paths restrict the use of complementary modalities and shrink the exploration space after SFT, weakening the effect of RL; and (2) mismatched single-task supervision versus diverse reasoning paths causes reasoning confusion, where models may exploit shortcuts by mapping images directly to answers and ignoring the intended reasoning. To address this, we propose PA-FAS, which enhances reasoning paths by constructing high-quality extended reasoning sequences from limited annotations, enriching paths and relaxing exploration constraints. We further introduce an answer-shuffling mechanism during SFT to force comprehensive multimodal analysis instead of using superficial cues, thereby encouraging deeper reasoning and mitigating shortcut learning. PA-FAS significantly improves multimodal reasoning accuracy and cross-domain generalization, and better unifies multimodal fusion, generalization, and interpretability for trustworthy FAS.
中文摘要 面部反欺骗（FAS）近年来在多模融合、跨域泛化和可解释性方面取得了进展。借助大型语言模型和强化学习（RL），基于策略的训练为共同建模这些方面提供了新的机会。然而，多模推理比单模推理更复杂，需要准确的特征表示和跨模态验证，同时面对稀缺且高质量的注释，这使得直接应用强化学习并不理想。我们确定了监督式微调加RL（SFT+RL）对多模态FAS的两个关键局限：（1）有限的多模态推理路径限制了互补模态的使用，并缩小了SFT后的探索空间，削弱了强化学习的效果;以及（2）单任务监督与多样推理路径不匹配会导致推理混淆，模型可能利用捷径，直接将图像映射到答案，忽略预期推理。为此，我们提出了PA-FAS，通过从有限的注释构建高质量的扩展推理序列，丰富路径并放宽探索约束，从而增强推理路径。我们在SFT中进一步引入了答案洗牌机制，以强制全面的多模态分析，而非使用表面线索，从而鼓励更深层次的推理并减少捷径学习。PA-FAS显著提升了多模态推理的准确性和跨域泛化能力，更好地统一了多模态融合、泛化和可解释性，以实现值得信赖的FAS。

A Reinforcement Learning Framework for Resource Allocation in Uplink Carrier Aggregation in the Presence of Self Interference

在自干涉存在下，上行载波聚合资源分配的强化学习框架

Authors: Jaswanth Bodempudi, Batta Siva Sairam, Madepalli Haritha, Sandesh Rao Mattu, Ananthanarayanan Chockalingam
Subjects: Subjects: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2511.17931
Pdf link: https://arxiv.org/pdf/2511.17931
Abstract Carrier aggregation (CA) is a technique that allows mobile networks to combine multiple carriers to increase user data rate. On the uplink, for power constrained users, this translates to the need for an efficient resource allocation scheme, where each user distributes its available power among its assigned uplink carriers. Choosing a good set of carriers and allocating appropriate power on the carriers is important. If the carrier allocation on the uplink is such that a harmonic of a user's uplink carrier falls on the downlink frequency of that user, it leads to a self coupling-induced sensitivity degradation of that user's downlink receiver. In this paper, we model the uplink carrier aggregation problem as an optimal resource allocation problem with the associated constraints of non-linearities induced self interference (SI). This involves optimization over a discrete variable (which carriers need to be turned on) and a continuous variable (what power needs to be allocated on the selected carriers) in dynamic environments, a problem which is hard to solve using traditional methods owing to the mixed nature of the optimization variables and the additional need to consider the SI constraint. We adopt a reinforcement learning (RL) framework involving a compound-action actor-critic (CA2C) algorithm for the uplink carrier aggregation problem. We propose a novel reward function that is critical for enabling the proposed CA2C algorithm to efficiently handle SI. The CA2C algorithm along with the proposed reward function learns to assign and activate suitable carriers in an online fashion. Numerical results demonstrate that the proposed RL based scheme is able to achieve higher sum throughputs compared to naive schemes. The results also demonstrate that the proposed reward function allows the CA2C algorithm to adapt the optimization both in the presence and absence of SI.
中文摘要 载波聚合（CA）是一种技术，允许移动网络将多个载波合并以提升用户数据速率。对于功率受限的用户来说，上行需要高效的资源分配方案，每个用户将其可用电力分配给指定的上行载波。选择一套合适的载波并分配合适的功率非常重要。如果上行链路的载波分配使得用户的上行载波的谐波落在该用户的下行频率上，则会导致该用户下行接收器的自耦合导致灵敏度下降。本文将上行载波聚合问题建模为一个最优资源分配问题，并伴随非线性诱导自干涉（SI）的相关约束。这涉及在动态环境中对离散变量（需要开启的载波）和连续变量（需要分配选定载波的幂）进行优化，但由于优化变量的混合性质以及额外考虑国际单位制约束，传统方法难以解决这一问题。我们采用了强化学习（RL）框架，采用复合动作演员-批判者（CA2C）算法来解决上行载波聚合问题。我们提出了一种新颖的奖励函数，这对于使所提CA2C算法能够高效处理SI至关重要。CA2C算法结合所提的奖励函数，能够在线方式学习分配和激活合适的载体。数值结果表明，所提出的基于强化学习的方案能够比朴素方案实现更高的和吞吐量。结果还表明，所提出的奖励函数使CA2C算法能够适应SI的存在和缺失的优化。

SPINE: Token-Selective Test-Time Reinforcement Learning with Entropy-Band Regularization

脊柱：带熵带正则化的令牌选择性测试时间强化学习

Authors: Jianghao Wu, Yasmeen George, Jin Ye, Yicheng Wu, Daniel F. Schmidt, Jianfei Cai
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.17938
Pdf link: https://arxiv.org/pdf/2511.17938
Abstract Large language models (LLMs) and multimodal LLMs (MLLMs) excel at chain-of-thought reasoning but face distribution shift at test-time and a lack of verifiable supervision. Recent test-time reinforcement learning (TTRL) methods derive label-free pseudo-rewards from self-consistency voting over sampled trajectories, yet they often collapse: the majority-vote reward prevails, responses shorten, and Pass@1 declines. We trace this to uniform sequence updates in which most tokens are low-entropy followers, while a small high-entropy subset determines the reasoning branches. Thus we propose SPINE, a token-selective test-time reinforcement learning framework that (i) updates only forking tokens, the high-entropy branch points identified from forward-pass statistics, and (ii) applies an entropy-band regularizer at those tokens to sustain exploration when entropy is too low and to suppress noisy supervision when it is too high. SPINE plugs into GRPO-style objectives, optionally with a KL anchor, and requires no labels or reward models. Across ten benchmarks spanning multimodal VQA, general and expert QA, mathematical reasoning, and medical QA, SPINE consistently improves Pass@1 over TTRL while avoiding response-length collapse and yielding more stable training dynamics on both LLM and MLLM backbones. These results indicate that aligning updates with chain-of-thought branch points is a simple and label-free mechanism for stable and effective test-time adaptation in reasoning models. Code is available at this https URL.
中文摘要 大型语言模型（LLM）和多模态LLMs（MLLM）在思维链推理方面表现出色，但在测试时会面临分布变化以及缺乏可验证监督的问题。最新的测试时强化学习（TTRL）方法通过对采样轨迹的自一致性投票推导出无标签伪奖励，但这些奖励常常崩溃：多数票奖励占上风，响应缩短，Pass@1下降。我们将此追溯到均匀序列更新，其中大多数标记是低熵跟随者，而一个小的高熵子集决定了推理分支。因此，我们提出了SPINE系统，一种基于标记选择的测试时间强化学习框架，它（i）仅更新分叉标记，即通过前向传递统计识别的高熵分支点，（ii）在这些标记处应用熵带正则化器，在熵过低时维持探索，在熵过高时抑制噪声监督。SPINE可插入GRPO式物镜，可选配KL锚点，无需标签或奖励模型。在涵盖多模态VQA、通用与专家QA、数学推理和医学QA的十个基准测试中，SPINE在TTRL基础上持续提升Pass@1，同时避免响应长度崩溃，并在LLM和MLLM骨干上实现更稳定的训练动态。这些结果表明，将更新与思维链分支点对齐是一种简单且无标签的机制，有助于推理模型中稳定且有效的测试时间适应。代码可在此 https URL 访问。

Hybrid LSTM and PPO Networks for Dynamic Portfolio Optimization

动态组合优化的混合LSTM和PPO网络

Authors: Jun Kevin, Pujianto Yugopuspito
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Portfolio Management (q-fin.PM)
Arxiv link: https://arxiv.org/abs/2511.17963
Pdf link: https://arxiv.org/pdf/2511.17963
Abstract This paper introduces a hybrid framework for portfolio optimization that fuses Long Short-Term Memory (LSTM) forecasting with a Proximal Policy Optimization (PPO) reinforcement learning strategy. The proposed system leverages the predictive power of deep recurrent networks to capture temporal dependencies, while the PPO agent adaptively refines portfolio allocations in continuous action spaces, allowing the system to anticipate trends while adjusting dynamically to market shifts. Using multi-asset datasets covering U.S. and Indonesian equities, U.S. Treasuries, and major cryptocurrencies from January 2018 to December 2024, the model is evaluated against several baselines, including equal-weight, index-style, and single-model variants (LSTM-only and PPO-only). The framework's performance is benchmarked against equal-weighted, index-based, and single-model approaches (LSTM-only and PPO-only) using annualized return, volatility, Sharpe ratio, and maximum drawdown metrics, each adjusted for transaction costs. The results indicate that the hybrid architecture delivers higher returns and stronger resilience under non-stationary market regimes, suggesting its promise as a robust, AI-driven framework for dynamic portfolio optimization.
中文摘要 本文介绍了一个混合框架，将长短期记忆（LSTM）预测与近距离策略优化（PPO）强化学习策略融合在一起。该系统利用深度循环网络的预测能力捕捉时间依赖关系，而PPO代理则在连续行动空间中自适应地优化投资组合配置，使系统能够在动态调整市场变化的同时预测趋势。利用涵盖2018年1月至2024年12月期间美国和印尼股票、美国国债及主要加密货币的多资产数据集，该模型结合多个基线进行评估，包括等权、指数风格和单一模型变体（仅LSTM和仅PPO）。该框架的性能基于等权重、基于指数和单一模型的方法（仅LSTM和仅PPO）进行基准测试，采用年化收益率、波动率、夏普比率和最大回撤指标，均调整了交易成本。结果显示，混合架构在非固定市场环境中带来更高的回报和更强的韧性，显示其作为一个稳健的人工智能驱动动态投资组合优化框架的潜力。

Reward Engineering for Spatial Epidemic Simulations: A Reinforcement Learning Platform for Individual Behavioral Learning

空间流行病模拟的奖励工程：个体行为学习的强化学习平台

Authors: Radman Rakhshandehroo, Daniel Coombs
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Populations and Evolution (q-bio.PE)
Arxiv link: https://arxiv.org/abs/2511.18000
Pdf link: https://arxiv.org/pdf/2511.18000
Abstract We present ContagionRL, a Gymnasium-compatible reinforcement learning platform specifically designed for systematic reward engineering in spatial epidemic simulations. Unlike traditional agent-based models that rely on fixed behavioral rules, our platform enables rigorous evaluation of how reward function design affects learned survival strategies across diverse epidemic scenarios. ContagionRL integrates a spatial SIRS+D epidemiological model with configurable environmental parameters, allowing researchers to stress-test reward functions under varying conditions including limited observability, different movement patterns, and heterogeneous population dynamics. We evaluate five distinct reward designs, ranging from sparse survival bonuses to a novel potential field approach, across multiple RL algorithms (PPO, SAC, A2C). Through systematic ablation studies, we identify that directional guidance and explicit adherence incentives are critical components for robust policy learning. Our comprehensive evaluation across varying infection rates, grid sizes, visibility constraints, and movement patterns reveals that reward function choice dramatically impacts agent behavior and survival outcomes. Agents trained with our potential field reward consistently achieve superior performance, learning maximal adherence to non-pharmaceutical interventions while developing sophisticated spatial avoidance strategies. The platform's modular design enables systematic exploration of reward-behavior relationships, addressing a knowledge gap in models of this type where reward engineering has received limited attention. ContagionRL is an effective platform for studying adaptive behavioral responses in epidemic contexts and highlight the importance of reward design, information structure, and environmental predictability in learning.
中文摘要 我们介绍ContagionRL，一款兼容Gymnasium的强化学习平台，专为空间流行病模拟中的系统奖励工程设计。与依赖固定行为规则的传统基于主体的模型不同，我们的平台能够严谨评估奖励函数设计如何影响多样化流行病情景下的生存策略。ContagionRL将空间SIRS+D流行病学模型与可配置的环境参数整合，使研究人员能够在可观测性有限、不同运动模式和异质种群动态等不同条件下对奖励函数进行压力测试。我们评估了五种不同的奖励设计，涵盖从稀疏的生存加成到新型潜在场法，涵盖多种强化学习算法（PPO、SAC、A2C）。通过系统性消融研究，我们发现方向性指导和明确的遵守激励是稳健政策学习的关键组成部分。我们对不同感染率、网格大小、可见性限制和运动模式的综合评估显示，奖励功能选择对代理行为和生存结果有显著影响。接受我们潜在现场奖励训练的代理持续获得优异表现，学习最大程度的非药物干预依从性，同时发展复杂的空间回避策略。该平台的模块化设计使得系统性探索奖励-行为关系成为可能，弥补了此类模型中奖励工程关注有限的知识空白。ContagionRL是一个有效的平台，用于研究流行病情境下的适应性行为反应，强调奖励设计、信息结构和环境可预测性在学习中的重要性。

IE-Critic-R1: Advancing the Explanatory Measurement of Text-Driven Image Editing for Human Perception Alignment

IE-Critic-R1：推进文本驱动图像编辑在人类感知对齐中的解释性测量

Authors: Bowen Qu, Shangkun Sun, Xiaoyu Liang, Wei Gao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.18055
Pdf link: https://arxiv.org/pdf/2511.18055
Abstract Recent advances in text-driven image editing have been significant, yet the task of accurately evaluating these edited images continues to pose a considerable challenge. Different from the assessment of text-driven image generation, text-driven image editing is characterized by simultaneously conditioning on both text and a source image. The edited images often retain an intrinsic connection to the original image, which dynamically change with the semantics of the text. However, previous methods tend to solely focus on text-image alignment or have not well aligned with human perception. In this work, we introduce the Text-driven Image Editing Benchmark suite (IE-Bench) to enhance the assessment of text-driven edited images. IE-Bench includes a database contains diverse source images, various editing prompts and the corresponding edited results from different editing methods, and nearly 4,000 samples with corresponding Mean Opinion Scores (MOS) provided by 15 human subjects. Furthermore, we introduce IE-Critic-R1, which, benefiting from Reinforcement Learning from Verifiable Rewards (RLVR), provides more comprehensive and explainable quality assessment for text-driven image editing that aligns with human perception. Extensive experiments demonstrate IE-Critic-R1's superior subjective-alignments on the text-driven image editing task compared with previous metrics. Related data and codes are available to the public.
中文摘要 文本驱动图像编辑的最新进展显著，但准确评估这些编辑图像仍是一项相当大的挑战。与文本驱动图像生成的评估不同，文本驱动图像编辑的特点是同时对文本和源图像进行条件处理。编辑后的图像通常保留与原始图像的内在联系，且这些连接会随着文本语义动态变化。然而，以往的方法往往仅关注文本与图像的对齐，或与人类感知不完全一致。在本研究中，我们介绍了文本驱动图像编辑基准套件（IE-Bench），以增强对文本驱动编辑图像的评估。IE-Bench 包含一个数据库，包含多样的原始图片、各种编辑提示及不同编辑方法的相应编辑结果，以及近 4,000 个样本，并配有对应的平均意见评分（MOS），由 15 名受试者提供。此外，我们引入了IE-Critic-R1，借助可验证奖励强化学习（RLVR），为文本驱动图像编辑提供更全面且可解释的质量评估，符合人类感知。大量实验表明，IE-Critic-R1在文本驱动图像编辑任务中的主观对齐性优于以往指标。相关数据和代码对公众开放。

Anti-Jamming based on Null-Steering Antennas and Intelligent UAV Swarm Behavior

基于零转向天线和智能无人机群行为的反干扰

Authors: Miguel Lourenço, António Grilo
Subjects: Subjects: Robotics (cs.RO); Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2511.18086
Pdf link: https://arxiv.org/pdf/2511.18086
Abstract Unmanned Aerial Vehicle (UAV) swarms represent a key advancement in autonomous systems, enabling coordinated missions through inter-UAV communication. However, their reliance on wireless links makes them vulnerable to jamming, which can disrupt coordination and mission success. This work investigates whether a UAV swarm can effectively overcome jamming while maintaining communication and mission efficiency. To address this, a unified optimization framework combining Genetic Algorithms (GA), Supervised Learning (SL), and Reinforcement Learning (RL) is proposed. The mission model, structured into epochs and timeslots, allows dynamic path planning, antenna orientation, and swarm formation while progressively enforcing collision rules. Null-steering antennas enhance resilience by directing antenna nulls toward interference sources. Results show that the GA achieved stable, collision-free trajectories but with high computational cost. SL models replicated GA-based configurations but struggled to generalize under dynamic or constrained settings. RL, trained via Proximal Policy Optimization (PPO), demonstrated adaptability and real-time decision-making with consistent communication and lower computational demand. Additionally, the Adaptive Movement Model generalized UAV motion to arbitrary directions through a rotation-based mechanism, validating the scalability of the proposed system. Overall, UAV swarms equipped with null-steering antennas and guided by intelligent optimization algorithms effectively mitigate jamming while maintaining communication stability, formation cohesion, and collision safety. The proposed framework establishes a unified, flexible, and reproducible basis for future research on resilient swarm communication systems.
中文摘要 无人机群是自主系统的关键进展，通过无人机间通信实现协调任务。然而，由于依赖无线链路，容易受到干扰，干扰协调和任务成功。这项研究探讨了无人机群是否能在保持通信和任务效率的同时有效克服干扰。为此，提出了一个结合遗传算法（GA）、监督学习（SL）和强化学习（RL）的统一优化框架。任务模型按时代和时间段结构化，支持动态路径规划、天线朝向和群体形成，同时逐步执行碰撞规则。零引导天线通过将天线零点指向干扰源来增强韧性。结果显示，GA实现了稳定、无碰撞的轨迹，但计算成本较高。SL模型复制了基于GA的配置，但在动态或受限条件下难以泛化。通过近端策略优化（PPO）训练的强化学习展现了适应性和实时决策能力，实现了一致的通信和较低的计算需求。此外，自适应运动模型通过基于旋转的机制将无人机运动推广至任意方向，验证了所提系统的可扩展性。总体而言，配备零转向天线并由智能优化算法引导的无人机群有效减少干扰，同时保持通信稳定性、编队凝聚力和碰撞安全。该框架为未来韧性群体通信系统研究奠定了一个统一、灵活且可重复的基础。

A New Error Temporal Difference Algorithm for Deep Reinforcement Learning in Microgrid Optimization

一种用于微电网优化中深度强化学习的新误差时差算法

Authors: Fulong Yao, Wanqing Zhao, Matthew Forshaw
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.18093
Pdf link: https://arxiv.org/pdf/2511.18093
Abstract Predictive control approaches based on deep reinforcement learning (DRL) have gained significant attention in microgrid energy optimization. However, existing research often overlooks the issue of uncertainty stemming from imperfect prediction models, which can lead to suboptimal control strategies. This paper presents a new error temporal difference (ETD) algorithm for DRL to address the uncertainty in predictions,aiming to improve the performance of microgrid operations. First,a microgrid system integrated with renewable energy sources (RES) and energy storage systems (ESS), along with its Markov decision process (MDP), is modelled. Second, a predictive control approach based on a deep Q network (DQN) is presented, in which a weighted average algorithm and a new ETD algorithm are designed to quantify and address the prediction uncertainty, respectively. Finally, simulations on a realworld US dataset suggest that the developed ETD effectively improves the performance of DRL in optimizing microgrid operations.
中文摘要 基于深度强化学习（DRL）的预测控制方法在微电网能源优化领域备受关注。然而，现有研究常常忽视了预测模型不完美带来的不确定性问题，这可能导致控制策略不优。本文提出了一种新的误差时间差分（ETD）算法用于解决预测中的不确定性，旨在提升微电网运行的性能。首先，构建了一个集成可再生能源（RES）和储能系统（ESS）的微电网系统，并结合其马尔可夫决策过程（MDP）。其次，提出了基于深度Q网络（DQN）的预测控制方法，其中分别设计了加权平均算法和新的ETD算法，用于量化和解决预测不确定性。最后，基于美国真实数据集的模拟表明，开发的ETD有效提升了日日驱动（DRL）在优化微电网运行方面的性能。

MOMA-AC: A preference-driven actor-critic framework for continuous multi-objective multi-agent reinforcement learning

MOMA-AC：一种基于偏好的演员-批评框架，用于连续多目标多代理强化学习

Authors: Adam Callaghan, Karl Mason, Patrick Mannion
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.18181
Pdf link: https://arxiv.org/pdf/2511.18181
Abstract This paper addresses a critical gap in Multi-Objective Multi-Agent Reinforcement Learning (MOMARL) by introducing the first dedicated inner-loop actor-critic framework for continuous state and action spaces: Multi-Objective Multi-Agent Actor-Critic (MOMA-AC). Building on single-objective, single-agent algorithms, we instantiate this framework with Twin Delayed Deep Deterministic Policy Gradient (TD3) and Deep Deterministic Policy Gradient (DDPG), yielding MOMA-TD3 and MOMA-DDPG. The framework combines a multi-headed actor network, a centralised critic, and an objective preference-conditioning architecture, enabling a single neural network to encode the Pareto front of optimal trade-off policies for all agents across conflicting objectives in a continuous MOMARL setting. We also outline a natural test suite for continuous MOMARL by combining a pre-existing multi-agent single-objective physics simulator with its multi-objective single-agent counterpart. Evaluating cooperative locomotion tasks in this suite, we show that our framework achieves statistically significant improvements in expected utility and hypervolume relative to outer-loop and independent training baselines, while demonstrating stable scalability as the number of agents increases. These results establish our framework as a foundational step towards robust, scalable multi-objective policy learning in continuous multi-agent domains.
中文摘要 本文通过引入首个专门的连续状态和动作空间内环路演员-批评者框架：多目标多代理演员-批判者（MOMA-AC），解决了多目标多代理强化学习（MOMARL）的关键空白。基于单目标单代理算法，我们用双延迟深度确定性策略梯度（TD3）和深度确定性策略梯度（DDPG）实现该框架，生成了MOMA-TD3和MOMA-DDPG。该框架结合了多头行为者网络、集中批评者和客观偏好条件架构，使单一神经网络能够在连续的MOMARL环境中，编码所有代理在冲突目标下的最优权衡策略的帕累托前沿。我们还概述了一套自然的连续MOMARL测试套件，通过结合已有的多智能体单目标物理模拟器与其多目标单智能体模拟器。通过评估该套件中的协作运动任务，我们表明我们的框架相较于外环和独立训练基线，在期望效用和超量方面实现了统计学上的显著提升，同时随着代理数量的增加，其可扩展性依然稳定。这些结果确立了我们的框架作为连续多代理领域中稳健、可扩展的多目标政策学习的基础步骤。

Deep Gaussian Process Proximal Policy Optimization

深度高斯过程近端策略优化

Authors: Matthijs van der Lende, Juan Cardenas-Cartagena
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.18214
Pdf link: https://arxiv.org/pdf/2511.18214
Abstract Uncertainty estimation for Reinforcement Learning (RL) is a critical component in control tasks where agents must balance safe exploration and efficient learning. While deep neural networks have enabled breakthroughs in RL, they often lack calibrated uncertainty estimates. We introduce Deep Gaussian Process Proximal Policy Optimization (GPPO), a scalable, model-free actor-critic algorithm that leverages Deep Gaussian Processes (DGPs) to approximate both the policy and value function. GPPO maintains competitive performance with respect to Proximal Policy Optimization on standard high-dimensional continuous control benchmarks while providing well-calibrated uncertainty estimates that can inform safer and more effective exploration.
中文摘要 强化学习（RL）中的不确定性估计是控制任务中的关键组成部分，主体必须在安全探索与高效学习之间取得平衡。虽然深度神经网络在强化学习领域取得了突破，但它们往往缺乏校准的不确定性估计。我们介绍了深高斯过程近端策略优化（GPPO），这是一种可扩展、无模型的演员-批判算法，利用深度高斯过程（DGPs）来近似策略函数和价值函数。GPPO在标准高维连续控制基准上保持近端策略优化的竞争性能，同时提供校准良好的不确定性估计，以指导更安全、更有效的勘探。

A Novel and Practical Universal Adversarial Perturbations against Deep Reinforcement Learning based Intrusion Detection Systems

针对基于深度强化学习的入侵检测系统的新颖且实用的普遍对抗扰动

Authors: H. Zhang, L. Zhang, G. Epiphaniou, C. Maple
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.18223
Pdf link: https://arxiv.org/pdf/2511.18223
Abstract Intrusion Detection Systems (IDS) play a vital role in defending modern cyber physical systems against increasingly sophisticated cyber threats. Deep Reinforcement Learning-based IDS, have shown promise due to their adaptive and generalization capabilities. However, recent studies reveal their vulnerability to adversarial attacks, including Universal Adversarial Perturbations (UAPs), which can deceive models with a single, input-agnostic perturbation. In this work, we propose a novel UAP attack against Deep Reinforcement Learning (DRL)-based IDS under the domain-specific constraints derived from network data rules and feature relationships. To the best of our knowledge, there is no existing study that has explored UAP generation for the DRL-based IDS. In addition, this is the first work that focuses on developing a UAP against a DRL-based IDS under realistic domain constraints based on not only the basic domain rules but also mathematical relations between the features. Furthermore, we enhance the evasion performance of the proposed UAP, by introducing a customized loss function based on the Pearson Correlation Coefficient, and we denote it as Customized UAP. To the best of our knowledge, this is also the first work using the PCC value in the UAP generation, even in the broader context. Four additional established UAP baselines are implemented for a comprehensive comparison. Experimental results demonstrate that our proposed Customized UAP outperforms two input-dependent attacks including Fast Gradient Sign Method (FGSM), Basic Iterative Method (BIM), and four UAP baselines, highlighting its effectiveness for real-world adversarial scenarios.
中文摘要 入侵检测系统（IDS）在防御现代网络物理系统免受日益复杂的网络威胁方面发挥着至关重要的作用。基于深度强化学习的IDS因其自适应性和泛化能力展现出潜力。然而，最新研究表明它们容易受到对抗性攻击，包括通用对抗扰动（UAPs），这些扰动可以通过单一输入无关扰动欺骗模型。本研究提出一种针对基于深度强化学习（DRL）的IDS的新颖UAP攻击，基于基于网络数据规则和特征关系的领域特定约束。据我们所知，目前尚无研究探讨基于DRL的IDS中UAP生成。此外，这是首个专注于基于基于DRL的IDS在基于基本领域规则以及特征间数学关系的真实领域约束下开发UAP的工作。此外，我们通过引入基于皮尔逊相关系数的定制损失函数，提升了拟议UAP的规避性能，称之为定制UAP。据我们所知，这也是首次在不明空中现象生成中使用PCC值的研究，甚至在更广泛的背景下也是如此。另外实施了四个既定的UAP基线，以进行全面比较。实验结果表明，我们提出的定制UAP优于两种输入依赖攻击，包括快速梯度符号法（FGSM）、基础迭代法（BIM）和四个UAP基线，凸显其在现实对抗场景中的有效性。

Carbon-Aware Intrusion Detection: A Comparative Study of Supervised and Unsupervised DRL for Sustainable IoT Edge Gateways

碳感知入侵检测：可持续物联网边缘网关监督与非监督DRL的比较研究

Authors: Saeid Jamshidi, Foutse Khomh, Kawser Wazed Nafi, Amin Nikanjam, Samira Keivanpour, Omar Abdul-Wahab, Martine Bellaiche
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2511.18240
Pdf link: https://arxiv.org/pdf/2511.18240
Abstract The rapid expansion of the Internet of Things (IoT) has intensified cybersecurity challenges, particularly in mitigating Distributed Denial-of-Service (DDoS) attacks at the network edge. Traditional Intrusion Detection Systems (IDSs) face significant limitations, including poor adaptability to evolving and zero-day attacks, reliance on static signatures and labeled datasets, and inefficiency on resource-constrained edge gateways. Moreover, most existing DRL-based IDS studies overlook sustainability factors such as energy efficiency and carbon impact. To address these challenges, this paper proposes two novel Deep Reinforcement Learning (DRL)-based IDS: DeepEdgeIDS, an unsupervised Autoencoder-DRL hybrid, and AutoDRL-IDS, a supervised LSTM-DRL model. Both DRL-based IDS are validated through theoretical analysis and experimental evaluation on edge gateways. Results demonstrate that AutoDRL-IDS achieves 94% detection accuracy using labeled data, while DeepEdgeIDS attains 98% accuracy and adaptability without labels. Distinctly, this study introduces a carbon-aware, multi-objective reward function optimized for sustainable and real-time IDS operations in dynamic IoT networks.
中文摘要 物联网（IoT）的快速扩展加剧了网络安全挑战，尤其是在缓解网络边缘的分布式拒绝服务（DDoS）攻击方面。传统的入侵检测系统（IDS）面临重大限制，包括对不断演变的攻击和零日攻击的适应性差、对静态签名和标记数据集的依赖，以及资源受限的边缘网关效率低下。此外，大多数现有基于DRL的IDS研究忽视了能源效率和碳影响等可持续性因素。为应对这些挑战，本文提出了两种基于深度强化学习（DRL）的新颖IDS：DeepEdgeIDS，一种非监督的自编码器-DRL混合体，以及AutoDRL-IDS，一种监督式LSTM-DRL模型。这两种基于DRL的IDS均通过理论分析和边缘网关的实验评估得到验证。结果显示，AutoDRL-IDS使用带标签数据可实现94%的检测准确率，而DeepEdgeIDS无需标签即可实现98%的准确率和适应性。本研究独特地提出了一种碳意识、多目标奖励函数，优化用于动态物联网网络中可持续且实时的IDS作。

EgoVITA: Learning to Plan and Verify for Egocentric Video Reasoning

EgoVITA：学习如何规划并验证以应对以自我为中心的视频推理

Authors: Yogesh Kulkarni, Pooyan Fazli
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.18242
Pdf link: https://arxiv.org/pdf/2511.18242
Abstract Reasoning about intentions and actions from a first-person (egocentric) perspective remains a fundamental challenge for multimodal large language models (MLLMs). Unlike third-person (exocentric) videos that capture scenes from an outside observer, egocentric videos reflect the actor's continuously changing viewpoint, introducing partial observability, limited field of view, and self-referenced motion. We introduce $\textbf{EgoVITA}$, a reinforcement learning framework that enables MLLMs to reason through structured planning and verification. Built on Group Relative Policy Optimization (GRPO), EgoVITA alternates between two stages: (1) an $\textbf{egocentric planning phase}$, where the model reasons from a first-person viewpoint to predict a step-by-step plan of future actions, and (2) an $\textbf{exocentric verification phase}$, where it switches to a third-person perspective to check the visual and logical consistency of that plan. Through GRPO, the model learns to make plans that are causally predictive of upcoming visual observations, leading to more coherent and visually grounded reasoning. EgoVITA achieves significant gains on egocentric reasoning tasks, outperforming the baseline Qwen2.5-VL-7B by $\mathbf{+7.7}$ on EgoBlind and $\mathbf{+4.4}$ on EgoOrient, while maintaining strong generalization on exocentric video tasks.
中文摘要 从第一人称（以自我为中心）视角推理意图和行为仍然是多模态大型语言模型（MLLM）面临的根本挑战。与第三人称（外中心）视频捕捉外部观察者的场景不同，自我中心视频反映了演员不断变化的视角，引入了部分可观察性、有限的视野和自我参照的运动。我们介绍了$\textbf{EgoVITA}$，这是一个强化学习框架，使MLLM能够通过结构化的规划和验证进行推理。基于群体相对政策优化（GRPO），EgoVITA 在两个阶段之间交替进行：（1） $\textbf{以自我为中心的规划阶段}$，模型从第一人称视角推理出未来行动的逐步计划;（2） $\textbf{外中心验证阶段}$，切换到第三人称视角以检查该计划的视觉和逻辑一致性。通过GRPO，模型学会制定因果预测即将到来的视觉观察的计划，从而实现更连贯且具视觉基础的推理。EgoVITA 在以自我为中心的推理任务中取得了显著进步，在 EgoBlind 上比基线 Qwen2.5-VL-7B 高出 $\mathbf{+7.7}$，在 EgoOrient 上比 $\mathbf{+4.4}$ 高出，同时在外中心视频任务中保持了强有力的泛化能力。

Dreaming Falcon: Physics-Informed Model-Based Reinforcement Learning for Quadcopters

梦境猎鹰：基于物理的模型强化学习，适用于四旋翼飞机

Authors: Eashan Vytla, Bhavanishankar Kalavakolanu, Andrew Perrault, Matthew McCrink
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.18243
Pdf link: https://arxiv.org/pdf/2511.18243
Abstract Current control algorithms for aerial robots struggle with robustness in dynamic environments and adverse conditions. Model-based reinforcement learning (RL) has shown strong potential in handling these challenges while remaining sample-efficient. Additionally, Dreamer has demonstrated that online model-based RL can be achieved using a recurrent world model trained on replay buffer data. However, applying Dreamer to aerial systems has been quite challenging due to its sample inefficiency and poor generalization of dynamics models. Our work explores a physics-informed approach to world model learning and improves policy performance. The world model treats the quadcopter as a free-body system and predicts the net forces and moments acting on it, which are then passed through a 6-DOF Runge-Kutta integrator (RK4) to predict future state rollouts. In this paper, we compare this physics-informed method to a standard RNN-based world model. Although both models perform well on the training data, we observed that they fail to generalize to new trajectories, leading to rapid divergence in state rollouts, preventing policy convergence.
中文摘要 当前用于空中机器人的控制算法在动态环境和恶劣条件下存在稳健性。基于模型的强化学习（RL）在应对这些挑战且保持样本效率方面展现出强大潜力。此外，Dreamer还证明了在线基于模型的强化学习可以通过训练回放缓冲区数据的循环世界模型实现。然而，由于其采样效率低和动力学模型泛化能力较差，将Dreamer应用到航空系统中一直相当具有挑战性。我们的工作探索以物理为基础的世界模型学习方法，并提升政策绩效。世界模型将四旋翼视为自由体系统，预测作用于其上的净力和力矩，然后通过6自由度的Runge-Kutta积分器（RK4）预测未来的状态展开。本文将这种基于物理学的方法与基于RNN的世界模型进行比较。尽管两种模型在训练数据上表现良好，但我们观察到它们未能推广到新的轨迹，导致状态推广迅速分歧，阻碍政策趋同。

Tail Distribution of Regret in Optimistic Reinforcement Learning

乐观强化学习中的遗憾尾分布

Authors: Sajad Khodadadian, Mehrdad Moharrami
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2511.18247
Pdf link: https://arxiv.org/pdf/2511.18247
Abstract We derive instance-dependent tail bounds for the regret of optimism-based reinforcement learning in finite-horizon tabular Markov decision processes with unknown transition dynamics. Focusing on a UCBVI-type algorithm, we characterize the tail distribution of the cumulative regret $R_K$ over $K$ episodes, rather than only its expectation or a single high-probability quantile. We analyze two natural exploration-bonus schedules: (i) a $K$-dependent scheme that explicitly incorporates the total number of episodes $K$, and (ii) a $K$-independent scheme that depends only on the current episode index. For both settings, we obtain an upper bound on $\Pr(R_K \ge x)$ that exhibits a distinctive two-regime structure: a sub-Gaussian tail starting from an instance-dependent scale $m_K$ up to a transition threshold, followed by a sub-Weibull tail beyond that point. We further derive corresponding instance-dependent bounds on the expected regret $\mathbb{E}[R_K]$. The proposed algorithm depends on a tuning parameter $\alpha$, which balances the expected regret and the range over which the regret exhibits a sub-Gaussian tail. To the best of our knowledge, our results provide one of the first comprehensive tail-regret guarantees for a standard optimistic algorithm in episodic reinforcement learning.
中文摘要 我们在有限视界表马尔可夫决策过程中，推导了基于乐观主义的强化学习遗憾的实例相关尾界，且转换动力学未知。我们聚焦于UCBVI类型的算法，描述累计后悔$R_K$在$K$发作中的尾部分布，而不仅仅是其期望值或单一高概率分位数。我们分析了两种自然的探索奖励计划：（i）一种基于$K$的方案，明确包含$K$的总集数;（ii）一种仅依赖当前集数的$K$无关方案。对于这两种设定，我们都得到 $\Pr（R_K \ge x）$ 的上界，该上界具有独特的两区间结构：从实例依赖尺度 $m_K$ 起始的亚高斯尾部至过渡阈值，之后是次魏布尔尾部。我们进一步推导出对应实例依赖的期望后悔 $\mathbb{E}[R_K]$ 的界限。所提算法依赖于一个调优参数 $\alpha$，该参数平衡了预期的遗憾值和遗憾表现出亚高斯尾部的范围。据我们所知，我们的结果为情节强化学习中的标准乐观算法提供了最早全面的尾部后悔保证。

LLM Reasoning for Cold-Start Item Recommendation

冷启动项目推荐的LLM推理

Authors: Shijun Li, Yu Wang, Jin Wang, Ying Li, Joydeep Ghosh, Anne Cocos
Subjects: Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.18261
Pdf link: https://arxiv.org/pdf/2511.18261
Abstract Large Language Models (LLMs) have shown significant potential for improving recommendation systems through their inherent reasoning capabilities and extensive knowledge base. Yet, existing studies predominantly address warm-start scenarios with abundant user-item interaction data, leaving the more challenging cold-start scenarios, where sparse interactions hinder traditional collaborative filtering methods, underexplored. To address this limitation, we propose novel reasoning strategies designed for cold-start item recommendations within the Netflix domain. Our method utilizes the advanced reasoning capabilities of LLMs to effectively infer user preferences, particularly for newly introduced or rarely interacted items. We systematically evaluate supervised fine-tuning, reinforcement learning-based fine-tuning, and hybrid approaches that combine both methods to optimize recommendation performance. Extensive experiments on real-world data demonstrate significant improvements in both methodological efficacy and practical performance in cold-start recommendation contexts. Remarkably, our reasoning-based fine-tuned models outperform Netflix's production ranking model by up to 8% in certain cases.
中文摘要 大型语言模型（LLMs）凭借其固有的推理能力和丰富的知识库，展现出显著的改进推荐系统的潜力。然而，现有研究主要针对拥有丰富用户与项目交互数据的热启动场景，而那些缺乏互动、阻碍传统协作过滤方法的更具挑战性的冷启动场景则鲜有深入探讨。为解决这一限制，我们提出了为Netflix领域中冷启动项目推荐设计的新推理策略。我们的方法利用大型语言模型的高级推理能力，有效推断用户偏好，特别是针对新引入或很少互动的项目。我们系统地评估了监督式微调、基于强化学习的微调以及结合这两种方法以优化推荐表现的混合方法。大量基于真实世界数据的实验显示，冷启动建议情境下方法学的有效性和实际表现均有显著提升。令人惊讶的是，我们基于推理的微调模型在某些情况下，比Netflix的制作排名模型高出多达8%。

MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation

MammothModa2：一个统一的AR扩散框架，用于多模态理解与生成

Authors: Tao Shen, Xin Wan, Taicai Chen, Rui Zhang, Junwen Pan, Dawei Lu, Fanding Lei, Zhilin Lu, Yunfei Yang, Chen Cheng, Qi She, Chang Liu, Zhenbang Sun
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.18262
Pdf link: https://arxiv.org/pdf/2511.18262
Abstract Unified multimodal models aim to integrate understanding and generation within a single framework, yet bridging the gap between discrete semantic reasoning and high-fidelity visual synthesis remains challenging. We present MammothModa2 (Mammoth2), a unified autoregressive-diffusion (AR-Diffusion) framework designed to effectively couple autoregressive semantic planning with diffusion-based generation. Mammoth2 adopts a serial design: an AR path equipped with generation experts performs global semantic modeling over discrete tokens, while a single-stream Diffusion Transformer (DiT) decoder handles high-fidelity image synthesis. A carefully designed AR-Diffusion feature alignment module combines multi-layer feature aggregation, unified condition encoding, and in-context conditioning to stably align AR's representations with the diffusion decoder's continuous latents. Mammoth2 is trained end-to-end with joint Next-Token Prediction and Flow Matching objectives, followed by supervised fine-tuning and reinforcement learning over both generation and editing. With roughly 60M supervised generation samples and no reliance on pre-trained generators, Mammoth2 delivers strong text-to-image and instruction-based editing performance on public benchmarks, achieving 0.87 on GenEval, 87.2 on DPGBench, and 4.06 on ImgEdit, while remaining competitive with understanding-only backbones (e.g., Qwen3-VL-8B) on multimodal understanding tasks. These results suggest that a carefully coupled AR-Diffusion architecture can provide high-fidelity generation and editing while maintaining strong multimodal comprehension within a single, parameter- and data-efficient model.
中文摘要 统一的多模态模型旨在将理解与生成整合到单一框架内，但弥合离散语义推理与高精度视觉综合之间的鸿沟仍具挑战性。我们介绍MammothModa2（Mammoth2），一个统一的自回归扩散（AR-Diffusion）框架，旨在有效将自回归语义规划与基于扩散的生成结合。Mammoth2采用串行设计：配备生成专家的增强现实路径对离散代币进行全局语义建模，而单流扩散转换器（DiT）解码器则处理高精度图像合成。精心设计的AR-扩散特征对齐模块结合了多层特征聚合、统一条件编码和上下文条件处理，稳定地将AR的表示与扩散解码器的连续潜在值对齐。Mammoth2 通过端到端训练，结合下一个标记预测和流量匹配目标，随后在生成和编辑过程中进行监督式微调和强化学习。Mammoth2拥有约6000万个监督生成样本，且不依赖预训练生成器，在公开基准测试中实现了强大的文本转图像和基于指令的编辑性能，在GenEval上达到0.87，DPGBench为87.2，在ImgEdit上为4.06，同时在多模态理解任务中仍能与仅理解的骨干网（如Qwen3-VL-8B）竞争。这些结果表明，精心耦合的AR-扩散架构可以在单一、参数和数据高效型模型内，同时实现高保真生成和编辑，同时保持强大的多模态理解能力。

DiVE-k: Differential Visual Reasoning for Fine-grained Image Recognition

DiVE-k：细粒度图像识别的差分视觉推理

Authors: Raja Kumar, Arka Sadhu, Ram Nevatia
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.18305
Pdf link: https://arxiv.org/pdf/2511.18305
Abstract Large Vision Language Models (LVLMs) possess extensive text knowledge but struggles to utilize this knowledge for fine-grained image recognition, often failing to differentiate between visually similar categories. Existing fine-tuning methods using Reinforcement Learning (RL) with exact-match reward signals are often brittle, encourage memorization of training categories, and fail to elicit differential reasoning needed for generalization to unseen classes. To address this, we propose $\textbf{DiVE-k}$, $\textbf{Di}$fferential $\textbf{V}$isual r$\textbf{E}$asoning using top-$\textbf{k}$ generations, framework that leverages model's own top-k predictions as a training signal. For each training image, DiVE-k creates a multiple-choice question from the model's top-k outputs and uses RL to train the model to select the correct answer. This approach requires the model to perform fine-grained differential reasoning among plausible options and provides a simple, verifiable reward signal that mitigates memorization and improves generalization. Experiments on five standard fine-grained datasets show that our method significantly outperforms existing approaches. In the standard base-to-novel generalization setting, DiVE-k surpasses the QWEN2.5-VL-7B and ViRFT by 10.04% and 6.16% on the Harmonic Mean metric, respectively. Further experiments show similar gains in mixed-domain and few-shot scenarios.
中文摘要 大型视觉语言模型（LVLM）拥有丰富的文本知识，但在精细图像识别中难以利用这些知识，常常无法区分视觉相似的类别。现有使用强化学习（RL）进行精确匹配奖励信号的微调方法通常较为脆弱，鼓励记忆训练类别，且无法引发对未见类别推广所需的差分推理。为此，我们提出了 $\textbf{DiVE-k}$， $\textbf{Di}$fferential $\textbf{V}$isual r$\textbf{E}$asoning 使用 top-$\textbf{k}$ 生成，利用模型自身的 top-k 预测作为训练信号的框架。对于每张训练图像，DiVE-k从模型的top-k输出中生成一个选择题，并利用强化学习训练模型选择正确答案。这种方法要求模型在合理选项之间进行细致的差分推理，并提供简单且可验证的奖励信号，减轻记忆负担并提升泛化能力。在五个标准细粒度数据集上的实验显示，我们的方法显著优于现有方法。在标准的基础到新颖推广中，DiVE-k 在谐波平均度量上分别比 QWEN2.5-VL-7B 和 ViRFT 高出 10.04% 和 6.16%。进一步的实验显示，混合域和少样本场景中也有类似的收益。

Synthetic Curriculum Reinforces Compositional Text-to-Image Generation

综合课程强化构图文本生成

Authors: Shijian Wang, Runhao Fu, Siyi Zhao, Qingqin Zhan, Xingjian Wang, Jiarui Jin, Yuan Lu, Hanqian Wu, Cunjian Chen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.18378
Pdf link: https://arxiv.org/pdf/2511.18378
Abstract Text-to-Image (T2I) generation has long been an open problem, with compositional synthesis remaining particularly challenging. This task requires accurate rendering of complex scenes containing multiple objects that exhibit diverse attributes as well as intricate spatial and semantic relationships, demanding both precise object placement and coherent inter-object interactions. In this paper, we propose a novel compositional curriculum reinforcement learning framework named CompGen that addresses compositional weakness in existing T2I models. Specifically, we leverage scene graphs to establish a novel difficulty criterion for compositional ability and develop a corresponding adaptive Markov Chain Monte Carlo graph sampling algorithm. This difficulty-aware approach enables the synthesis of training curriculum data that progressively optimize T2I models through reinforcement learning. We integrate our curriculum learning approach into Group Relative Policy Optimization (GRPO) and investigate different curriculum scheduling strategies. Our experiments reveal that CompGen exhibits distinct scaling curves under different curriculum scheduling strategies, with easy-to-hard and Gaussian sampling strategies yielding superior scaling performance compared to random sampling. Extensive experiments demonstrate that CompGen significantly enhances compositional generation capabilities for both diffusion-based and auto-regressive T2I models, highlighting its effectiveness in improving the compositional T2I generation systems.
中文摘要 文本生成（T2I）长期以来一直是个未解之谜，合成尤为具有挑战性。这项任务需要精确渲染包含多个具有不同属性的对象的复杂场景，以及复杂的空间和语义关系，要求既要精确地放置物体，也需要物间的连贯互动。本文提出了一种名为CompGen的新型作文课程强化学习框架，旨在解决现有T2I模型中的作文弱点。具体来说，我们利用场景图建立了一个新的构图难度标准，并开发了相应的自适应马尔可夫链蒙特卡洛采样算法。这种难度感知方法使得训练课程数据能够综合，通过强化学习逐步优化T2I模型。我们将课程学习方法整合进群体相对政策优化（GRPO），并探讨不同的课程安排策略。我们的实验显示，CompGen在不同课程调度策略下表现出明显的缩放曲线，其中从简单到困难和高斯抽样策略相比随机抽样在扩展表现上更优。大量实验表明，CompGen显著增强了基于扩散和自回归T2I模型的合成生成能力，凸显了其在改进合成T2I生成系统方面的有效性。

General Agentic Memory Via Deep Research

通过深入研究获得一般能动记忆

Authors: B.Y. Yan, Chaofan Li, Hongjin Qian, Shuqi Lu, Zheng Liu
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.18423
Pdf link: https://arxiv.org/pdf/2511.18423
Abstract Memory is critical for AI agents, yet the widely-adopted static memory, aiming to create readily available memory in advance, is inevitably subject to severe information loss. To address this limitation, we propose a novel framework called \textbf{general agentic memory (GAM)}. GAM follows the principle of "\textbf{just-in time (JIT) compilation}" where it focuses on creating optimized contexts for its client at runtime while keeping only simple but useful memory during the offline stage. To this end, GAM employs a duo-design with the following components. 1) \textbf{Memorizer}, which highlights key historical information using a lightweight memory, while maintaining complete historical information within a universal page-store. 2) \textbf{Researcher}, which retrieves and integrates useful information from the page-store for its online request guided by the pre-constructed memory. This design allows GAM to effectively leverage the agentic capabilities and test-time scalability of frontier large language models (LLMs), while also facilitating end-to-end performance optimization through reinforcement learning. In our experimental study, we demonstrate that GAM achieves substantial improvement on various memory-grounded task completion scenarios against existing memory systems.
中文摘要 记忆对人工智能代理至关重要，但广泛采用的静态记忆旨在提前创建可用内存，却不可避免地面临严重的信息丢失。为解决这一限制，我们提出了一种新框架，称为 \textbf{一般智能记忆（GAM）}。GAM遵循“\textbf{即时编译}”的原则，专注于在运行时为客户端创建优化上下文，同时在离线阶段只保留简单但有用的内存。为此，GAM采用了包含以下组件的双重设计。1）\textbf{记忆器}，利用轻量级内存突出关键历史信息，同时在通用页面存储中保持完整的历史信息。2）\textbf{Researcher}，它从页面存储中检索并整合有用信息，以供其在线请求，并由预建存储器引导。这种设计使 GAM 能够有效利用前沿大型语言模型（LLM）的代理能力和测试时的可扩展性，同时通过强化学习促进端到端性能优化。在我们的实验研究中，我们证明了GAM在多种基于记忆的任务完成场景中，与现有记忆系统相比取得了显著改进。

Perceptual-Evidence Anchored Reinforced Learning for Multimodal Reasoning

感知证据锚定的强化学习用于多模态推理

Authors: Chi Zhang, Haibo Qiu, Qiming Zhang, Yufei Xu, Zhixiong Zeng, Siqi Yang, Peng Shi, Lin Ma, Jing Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.18437
Pdf link: https://arxiv.org/pdf/2511.18437
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs) and is now being applied to Vision-Language Models (VLMs). However, vanilla RLVR for VLMs verifies only the final textual output, critically neglecting the foundational step of visual perception. This oversight leads to visual hallucinations and reward hacking, as reasoning built upon flawed perception is inherently unreliable. To address this, we propose PEARL (Perceptual-Evidence Anchored Reinforced Learning), a dual-branch, perception-reasoning synergistic that strengthens multimodal reasoning by explicitly anchoring it to verified visual evidence. For each reasoning-oriented QA instance, PEARL first derive a perception checklist -- a set of perception-oriented sub-questions with verifiable answers that probe the model's understanding of key visual evidence. During training, auxiliary rollouts on this checklist yield a perceptual reward that both directly reinforces the model's perception ability and acts as a fidelity gate for reasoning. If the model passes the perception check, its policy update is biased towards evidence-anchored reasoning. Otherwise, the process is halted to prevent reasoning from flawed premises. PEARL can be seamlessly integrated with popular RL methods like GRPO and DAPO. Comprehensive experiments show PEARL achieves substantial gains on multimodal reasoning benchmarks, e.g., a +9.7% improvement over the baseline and +6.6% over GRPO on MathVerse.
中文摘要 可验证奖励强化学习（RLVR）显著提升了大型语言模型（LLM）的推理能力，目前正应用于视觉语言模型（VLMs）。然而，VLM的原版RLVR仅验证最终文本输出，严重忽视了视觉感知这一基础步骤。这种疏忽导致视觉幻觉和奖励黑客行为，因为基于错误感知的推理本质上是不可靠的。为此，我们提出了PEARL（感知-证据锚定强化学习），这是一种双分支、感知-推理协同机制，通过明确将多模态推理锚定于经过验证的视觉证据来增强其能力。对于每个以推理为导向的质询实例，PEARL首先推导出一个感知检查表——一组带有可验证答案的感知导向子问题，以探究模型对关键视觉证据的理解。在训练过程中，该清单上的辅助展开会产生感知奖励，既直接强化模型的感知能力，也作为推理的忠实度门。如果模型通过了感知检验，其策略更新将偏向于基于证据的推理。否则，该过程将被暂停，以防止基于有缺陷前提的推理。PEARL可以无缝集成于流行的强化学习方法，如GRPO和DAPO。综合实验显示，PEARL在多模态推理基准测试上取得了显著提升，例如在MathVerse上，基线提升+9.7%，GRPO提升+6.6%。

Energy-Efficient Task Computation at the Edge for Vehicular Services

车载服务的节能任务计算边缘

Authors: Paniz Parastar, Giuseppe Caso, Jesus Alberto Omana Iglesias, Andra Lutu, Ozgu Alay
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2511.18449
Pdf link: https://arxiv.org/pdf/2511.18449
Abstract Multi-access edge computing (MEC) is a promising solution for providing the computational resources and low latency required by vehicular services such as autonomous driving. It enables cars to offload computationally intensive tasks to nearby servers. Effective offloading involves determining when to offload tasks, selecting the appropriate MEC site, and efficiently allocating resources to ensure good performance. Car mobility poses significant challenges to guaranteeing reliable task completion, and today we still lack energy efficient solutions to this problem, especially when considering real-world car mobility traces. In this paper, we begin by examining the mobility patterns of cars using data obtained from a leading mobile network operator in Europe. Based on the insights from this analysis, we design an optimization problem for task computation and offloading, considering both static and mobility scenarios. Our objective is to minimize the total energy consumption at the cars and at the MEC nodes while satisfying the latency requirements of various tasks. We evaluate our solution, based on multi-agent reinforcement learning, both in simulations and in a realistic setup that relies on datasets from the operator. Our solution shows a significant reduction of user dissatisfaction and task interruptions in both static and mobile scenarios, while achieving energy savings of 47 percent in the static case and 14 percent in the mobile case compared to state-of-the-art schemes.
中文摘要 多址边缘计算（MEC）是一种有前景的解决方案，能够满足自动驾驶等车辆服务所需的计算资源和低延迟。它使汽车能够将计算密集型任务卸载到附近的服务器。有效的卸载包括确定何时卸载任务、选择合适的MEC站点，以及高效分配资源以确保良好绩效。汽车出行在确保任务的可靠完成方面面临重大挑战，而如今我们仍然缺乏能效的解决方案，尤其是在考虑现实世界的汽车出行踪迹时。本文首先利用欧洲一家领先的移动网络运营商的数据，分析汽车的出行模式。基于分析所得的见解，我们设计了一个任务计算和卸载的优化问题，考虑静态和移动场景。我们的目标是在满足各项任务的延迟要求的同时，最大限度地减少车辆和MEC节点的总能耗。我们基于多智能体强化学习，在模拟和依赖操作员数据集的真实设置中评估我们的解决方案。我们的解决方案显著减少了静态和移动场景中的用户不满和任务中断，静态场景节能为47%，移动场景为14%，均为最先进方案。

ORIGAMISPACE: Benchmarking Multimodal LLMs in Multi-Step Spatial Reasoning with Mathematical Constraints

ORIGAMISPACE：多步空间推理中多模态大型语言模型的基准测试，且有数学约束

Authors: Rui Xu, Dakuan Lu, Zicheng Zhao, Xiaoyu Tan, Xintao Wang, Siyu Yuan, Jiangjie Chen, Yinghui Xu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.18450
Pdf link: https://arxiv.org/pdf/2511.18450
Abstract Spatial reasoning is a key capability in the field of artificial intelligence, especially crucial in areas such as robotics, computer vision, and natural language understanding. However, evaluating the ability of multimodal large language models(MLLMs) in complex spatial reasoning still faces challenges, particularly in scenarios requiring multi-step reasoning and precise mathematical constraints. This paper introduces ORIGAMISPACE, a new dataset and benchmark designed to evaluate the multi-step spatial reasoning ability and the capacity to handle mathematical constraints of MLLMs through origami tasks. The dataset contains 350 data instances,each comprising a strictly formatted crease pattern (CP diagram), the Compiled Flat Pattern, the complete Folding Process, and the final Folded Shape Image. We propose four evaluation tasks: Pattern Prediction, Multi-step Spatial Reasoning, Spatial Relationship Prediction, and End-to-End CP Code Generation. For the CP code generation task, we design an interactive environment and explore the possibility of using reinforcement learning methods to train MLLMs. Through experiments on existing MLLMs, we initially reveal the strengths and weaknesses of these models in handling complex spatial reasoning tasks.
中文摘要 空间推理是人工智能领域的一项关键能力，尤其在机器人学、计算机视觉和自然语言理解等领域尤为重要。然而，评估多模态大型语言模型（MLLM）在复杂空间推理中的表现仍面临挑战，尤其是在需要多步推理和精确数学约束的场景中。本文介绍了ORIGAMISPACE，这是一个新的数据集和基准测试，旨在评估MLLs通过折纸任务处理多步空间推理能力及数学约束处理能力。数据集包含350个数据实例，每个实例包含严格格式化的折痕图案（CP图）、编译平面图案、完整折叠过程和最终折叠形状图像。我们提出了四个评估任务：模式预测、多步空间推理、空间关系预测和端到端CP代码生成。对于CP代码生成任务，我们设计了一个互动环境，并探索使用强化学习方法训练MLM的可能性。通过对现有多层次语言模型的实验，我们初步揭示了这些模型在处理复杂空间推理任务中的优缺点。

SafeFall: Learning Protective Control for Humanoid Robots

安全坠落：学习人形机器人的保护控制

Authors: Ziyu Meng, Tengyu Liu, Le Ma, Yingying Wu, Ran Song, Wei Zhang, Siyuan Huang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.18509
Pdf link: https://arxiv.org/pdf/2511.18509
Abstract Bipedal locomotion makes humanoid robots inherently prone to falls, causing catastrophic damage to the expensive sensors, actuators, and structural components of full-scale robots. To address this critical barrier to real-world deployment, we present \method, a framework that learns to predict imminent, unavoidable falls and execute protective maneuvers to minimize hardware damage. SafeFall is designed to operate seamlessly alongside existing nominal controller, ensuring no interference during normal operation. It combines two synergistic components: a lightweight, GRU-based fall predictor that continuously monitors the robot's state, and a reinforcement learning policy for damage mitigation. The protective policy remains dormant until the predictor identifies a fall as unavoidable, at which point it activates to take control and execute a damage-minimizing response. This policy is trained with a novel, damage-aware reward function that incorporates the robot's specific structural vulnerabilities, learning to shield critical components like the head and hands while absorbing energy with more robust parts of its body. Validated on a full-scale Unitree G1 humanoid, SafeFall demonstrated significant performance improvements over unprotected falls. It reduced peak contact forces by 68.3\%, peak joint torques by 78.4\%, and eliminated 99.3\% of collisions with vulnerable components. By enabling humanoids to fail safely, SafeFall provides a crucial safety net that allows for more aggressive experiments and accelerates the deployment of these robots in complex, real-world environments.
中文摘要 双足行走使类人机器人天生容易坠落，导致全尺寸机器人昂贵的传感器、执行器和结构部件遭受灾难性损坏。为解决这一实际部署的关键障碍，我们提出了 \method 框架，该框架学习预测即将发生的不可避免坠落并执行保护性作以最大限度减少硬件损伤。SafeFall设计为与现有名义控制器无缝协作，确保正常运行期间无干扰。它结合了两个协同组件：一个基于GRU的轻量级坠落预测器，持续监测机器人状态;以及用于减损的强化学习策略。保护策略处于休眠状态，直到预测器识别跌落不可避免，随后启动控制并执行损害最小化的响应。该政策通过一种新型、感知损伤的奖励函数训练，整合机器人特定的结构弱点，学习保护头部和手部等关键部件，同时吸收更强壮身体部位的能量。在全尺寸Unitree G1人形机器人上验证后，SafeFall在无保护坠落中展现出显著的性能提升。它将峰值接触力降低了68.3%，峰值关节扭矩减少了78.4%，并消除了99.3%的易受损部件碰撞。通过使类人生物能够安全失效，SafeFall为更激进的实验提供了关键的安全网，加速了这些机器人在复杂现实环境中的部署。

From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence

从代码基础模型到代理与应用：代码智能实用指南

Authors: Jian Yang, Wei Zhang, Shark Liu, Jiajun Wu, Shawn Guo, Yizhi Li
Subjects: Subjects: Software Engineering (cs.SE); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.18538
Pdf link: https://arxiv.org/pdf/2511.18538
Abstract Large language models (LLMs) have fundamentally transformed automated software development by enabling direct translation of natural language descriptions into functional code, driving commercial adoption through tools like Github Copilot (Microsoft), Cursor (Anysphere), Trae (ByteDance), and Claude Code (Anthropic). While the field has evolved dramatically from rule-based systems to Transformer-based architectures, achieving performance improvements from single-digit to over 95\% success rates on benchmarks like HumanEval. In this work, we provide a comprehensive synthesis and practical guide (a series of analytic and probing experiments) about code LLMs, systematically examining the complete model life cycle from data curation to post-training through advanced prompting paradigms, code pre-training, supervised fine-tuning, reinforcement learning, and autonomous coding agents. We analyze the code capability of the general LLMs (GPT-4, Claude, LLaMA) and code-specialized LLMs (StarCoder, Code LLaMA, DeepSeek-Coder, and QwenCoder), critically examining the techniques, design decisions, and trade-offs. Further, we articulate the research-practice gap between academic research (e.g., benchmarks and tasks) and real-world deployment (e.g., software-related code tasks), including code correctness, security, contextual awareness of large codebases, and integration with development workflows, and map promising research directions to practical needs. Last, we conduct a series of experiments to provide a comprehensive analysis of code pre-training, supervised fine-tuning, and reinforcement learning, covering scaling law, framework selection, hyperparameter sensitivity, model architectures, and dataset comparisons.
中文摘要 大型语言模型（LLM）通过将自然语言描述直接转化为函数式代码，从根本上改变了自动化软件开发，推动了通过Github Copilot（Microsoft）、Cursor（Anysphere）、Trae（字节跳动）和Claude Code（Anthropic）等工具的商业应用。尽管该领域从基于规则的系统向基于Transformer的架构进行了巨大演变，在HumanEval等基准测试中实现了从个位数到超过95%的成功率提升。在本研究中，我们提供了关于代码大型语言模型的综合与实践指南（一系列分析和探测实验），系统地考察了从数据整理到后期训练，再到高级提示范式、代码预训练、监督微调、强化学习和自主编码代理的完整模型生命周期。我们分析了通用大型语言模型（GPT-4、Claude、LLaMA）和代码专用大型语言模型（StarCoder、Code LLaMA、DeepSeek-Coder和QwenCoder）的代码能力，批判性地审视了其技术、设计决策和权衡。此外，我们阐明了学术研究（如基准测试和任务）与现实部署（如软件相关代码任务）之间的研究与实践差距，包括代码正确性、安全性、大型代码库的上下文感知以及与开发工作流的集成，并将有前景的研究方向映射到实际需求。最后，我们进行了一系列实验，全面分析代码预训练、监督微调和强化学习，涵盖缩放律、框架选择、超参数敏感性、模型架构和数据集比较。

How to Train Your Latent Control Barrier Function: Smooth Safety Filtering Under Hard-to-Model Constraints

如何训练你的潜在控制屏障功能：在难以建模的约束下实现平滑的安全过滤

Authors: Kensuke Nakamura, Arun L. Bishop, Steven Man, Aaron M. Johnson, Zachary Manchester, Andrea Bajcsy
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.18606
Pdf link: https://arxiv.org/pdf/2511.18606
Abstract Latent safety filters extend Hamilton-Jacobi (HJ) reachability to operate on latent state representations and dynamics learned directly from high-dimensional observations, enabling safe visuomotor control under hard-to-model constraints. However, existing methods implement "least-restrictive" filtering that discretely switch between nominal and safety policies, potentially undermining the task performance that makes modern visuomotor policies valuable. While reachability value functions can, in principle, be adapted to be control barrier functions (CBFs) for smooth optimization-based filtering, we theoretically and empirically show that current latent-space learning methods produce fundamentally incompatible value functions. We identify two sources of incompatibility: First, in HJ reachability, failures are encoded via a "margin function" in latent space, whose sign indicates whether or not a latent is in the constraint set. However, representing the margin function as a classifier yields saturated value functions that exhibit discontinuous jumps. We prove that the value function's Lipschitz constant scales linearly with the margin function's Lipschitz constant, revealing that smooth CBFs require smooth margins. Second, reinforcement learning (RL) approximations trained solely on safety policy data yield inaccurate value estimates for nominal policy actions, precisely where CBF filtering needs them. We propose the LatentCBF, which addresses both challenges through gradient penalties that lead to smooth margin functions without additional labeling, and a value-training procedure that mixes data from both nominal and safety policy distributions. Experiments on simulated benchmarks and hardware with a vision-based manipulation policy demonstrate that LatentCBF enables smooth safety filtering while doubling the task-completion rate over prior switching methods.
中文摘要 潜在安全滤波器扩展了哈密顿-雅可比（HJ）可达性，使其能够直接从高维观测中获得的潜态表示和动力学，从而在难以建模的约束下实现安全的视觉运动控制。然而，现有方法实现了“最小限制”过滤，在名义策略和安全策略之间离散切换，这可能削弱了现代视觉运动策略价值所在的任务性能。虽然原则上可达性价值函数可以被改编为基于平滑优化的控制障碍函数（CBF），但我们从理论和经验上证明，当前潜空间学习方法产生根本不兼容的价值函数。我们识别出两个不兼容的来源：首先，在HJ可达性中，失败通过潜在空间中的“margin函数”编码，其符号表示潜在函数是否在约束集中。然而，将边际函数表示为分类器时，会产生饱和值函数，表现出不连续的跳跃。我们证明了价值函数的利普希茨常数与边际函数的利普希茨常数线性成比例，表明光滑的CBF需要平滑的边际。其次，仅基于安全政策数据训练的强化学习（RL）近似，在CBF过滤需要的地方，往往对名义策略动作产生不准确的价值估计。我们提出了潜在CBF，通过梯度惩罚实现平滑的边际函数而无需额外标记，以及混合名义与安全政策分布数据的价值训练程序，解决了这两个挑战。在模拟基准测试和基于视觉的作策略硬件上的实验表明，LatentCBF能够实现平滑的安全过滤，同时将任务完成率提高到以往切换方法的两倍。

Multi-Agent Cross-Entropy Method with Monotonic Nonlinear Critic Decomposition

多智能体交叉熵方法，采用单调非线性批判分解

Authors: Yan Wang, Ke Deng, Yongli Ren
Subjects: Subjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2511.18671
Pdf link: https://arxiv.org/pdf/2511.18671
Abstract Cooperative multi-agent reinforcement learning (MARL) commonly adopts centralized training with decentralized execution (CTDE), where centralized critics leverage global information to guide decentralized actors. However, centralized-decentralized mismatch (CDM) arises when the suboptimal behavior of one agent degrades others' learning. Prior approaches mitigate CDM through value decomposition, but linear decompositions allow per-agent gradients at the cost of limited expressiveness, while nonlinear decompositions improve representation but require centralized gradients, reintroducing CDM. To overcome this trade-off, we propose the multi-agent cross-entropy method (MCEM), combined with monotonic nonlinear critic decomposition (NCD). MCEM updates policies by increasing the probability of high-value joint actions, thereby excluding suboptimal behaviors. For sample efficiency, we extend off-policy learning with a modified k-step return and Retrace. Analysis and experiments demonstrate that MCEM outperforms state-of-the-art methods across both continuous and discrete action benchmarks.
中文摘要 合作多智能体强化学习（MARL）通常采用集中式训练与去中心化执行（CTDE），其中集中式批评者利用全球信息引导去中心化行为者。然而，中心化-去中心化错配（CDM）是指当一个代理的次优行为降低了其他代理的学习能力时。以往方法通过值分解来缓解CDM，但线性分解允许每个代理的梯度，但表现力有限;而非线性分解则提升了表示性，但需要集中梯度，重新引入CDM。为克服这一权衡，我们提出了多智能体交叉熵法（MCEM）结合单调非线性批判分解（NCD）。MCEM通过提高高价值联合行动的概率来更新策略，从而排除次优行为。为了提高样本效率，我们通过修改后的k步返回和回溯来扩展非策略学习。分析和实验表明，MCEM在连续和离散动作基准测试中均优于最先进方法。

Seeing What Matters: Visual Preference Policy Optimization for Visual Generation

看清什么重要：视觉偏好政策优化，用于视觉生成

Authors: Ziqi Ni, Yuanzhi Liang, Rui Li, Yi Zhou, Haibing Huang, Chi Zhang, Xuelong Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.18719
Pdf link: https://arxiv.org/pdf/2511.18719
Abstract Reinforcement learning (RL) has become a powerful tool for post-training visual generative models, with Group Relative Policy Optimization (GRPO) increasingly used to align generators with human preferences. However, existing GRPO pipelines rely on a single scalar reward per sample, treating each image or video as a holistic entity and ignoring the rich spatial and temporal structure of visual content. This coarse supervision hinders the correction of localized artifacts and the modeling of fine-grained perceptual cues. We introduce Visual Preference Policy Optimization (ViPO), a GRPO variant that lifts scalar feedback into structured, pixel-level advantages. ViPO employs a Perceptual Structuring Module that uses pretrained vision backbones to construct spatially and temporally aware advantage maps, redistributing optimization pressure toward perceptually important regions while preserving the stability of standard GRPO. Across both image and video benchmarks, ViPO consistently outperforms vanilla GRPO, improving in-domain alignment with human-preference rewards and enhancing generalization on out-of-domain evaluations. The method is architecture-agnostic, lightweight, and fully compatible with existing GRPO training pipelines, providing a more expressive and informative learning signal for visual generation.
中文摘要 强化学习（RL）已成为后训练可视化生成模型的强大工具，群体相对策略优化（Group Relative Policy Optimization，GRPO）越来越多地用于将生成器与人类偏好对齐。然而，现有的GRPO流程依赖于每个样本的单一标量奖励，将每张图片或视频视为整体整体，忽视了视觉内容丰富的空间和时间结构。这种粗糙的监督阻碍了局部伪影的纠正和细粒度感知线索的建模。我们引入了视觉偏好策略优化（ViPO），这是一种GRPO变体，将标量反馈提升为结构化的像素级优势。ViPO采用感知结构模块，利用预训练的视觉骨干构建空间和时间感知的优势图，将优化压力重新分配到感知重要区域，同时保持标准GRPO的稳定性。在图像和视频基准测试中，ViPO持续优于普通GRPO，提升了与人类偏好奖励的领域内对齐度，并增强了域外评估的泛化性。该方法不依赖架构，轻量级，且与现有GRPO训练流程完全兼容，为视觉生成提供更具表现力和信息量的学习信号。

Reinforcement Learning for Self-Healing Material Systems

自我修复材料系统的强化学习

Authors: Maitreyi Chatterjee, Devansh Agarwal, Biplab Chatterjee
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.18728
Pdf link: https://arxiv.org/pdf/2511.18728
Abstract The transition to autonomous material systems necessitates adaptive control methodologies to maximize structural longevity. This study frames the self-healing process as a Reinforcement Learning (RL) problem within a Markov Decision Process (MDP), enabling agents to autonomously derive optimal policies that efficiently balance structural integrity maintenance against finite resource consumption. A comparative evaluation of discrete-action (Q-learning, DQN) and continuous-action (TD3) agents in a stochastic simulation environment revealed that RL controllers significantly outperform heuristic baselines, achieving near-complete material recovery. Crucially, the TD3 agent utilizing continuous dosage control demonstrated superior convergence speed and stability, underscoring the necessity of fine-grained, proportional actuation in dynamic self-healing applications.
中文摘要 向自主材料系统的过渡需要自适应控制方法以最大化结构寿命。本研究将自我修复过程框架为马尔可夫决策过程（MDP）中的强化学习（RL）问题，使智能体能够自主推导最优策略，高效平衡结构完整性维护与有限资源消耗。在随机模拟环境中对离散作用（Q-learning，DQN）和连续作用（TD3）代理的比较评估显示，强化学习控制器的表现显著优于启发式基线，实现了近乎完整的材料回收。关键是，采用连续剂量控制的TD3药物展现出卓越的收敛速度和稳定性，凸显了动态自愈应用中细粒度、比例驱动的必要性。

ProxT2I: Efficient Reward-Guided Text-to-Image Generation via Proximal Diffusion

ProxT2I：通过近端扩散实现高效的奖励引导文本转图像生成

Authors: Zhenghan Fang, Jian Zheng, Qiaozi Gao, Xiaofeng Gao, Jeremias Sulam
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.18742
Pdf link: https://arxiv.org/pdf/2511.18742
Abstract Diffusion models have emerged as a dominant paradigm for generative modeling across a wide range of domains, including prompt-conditional generation. The vast majority of samplers, however, rely on forward discretization of the reverse diffusion process and use score functions that are learned from data. Such forward and explicit discretizations can be slow and unstable, requiring a large number of sampling steps to produce good-quality samples. In this work we develop a text-to-image (T2I) diffusion model based on backward discretizations, dubbed ProxT2I, relying on learned and conditional proximal operators instead of score functions. We further leverage recent advances in reinforcement learning and policy optimization to optimize our samplers for task-specific rewards. Additionally, we develop a new large-scale and open-source dataset comprising 15 million high-quality human images with fine-grained captions, called LAION-Face-T2I-15M, for training and evaluation. Our approach consistently enhances sampling efficiency and human-preference alignment compared to score-based baselines, and achieves results on par with existing state-of-the-art and open-source text-to-image models while requiring lower compute and smaller model size, offering a lightweight yet performant solution for human text-to-image generation.
中文摘要 扩散模型已成为生成建模在多个领域（包括提示条件生成）中的主导范式。然而，绝大多数采样器依赖于逆扩散过程的正向离散化，并使用从数据中学习的评分函数。这种前向且显式的离散化可能缓慢且不稳定，需要大量采样步骤才能产生高质量的样本。本研究中，我们基于逆向离散化开发了文本到图像（T2I）扩散模型，称为ProxT2I，依赖学习和条件近端算子代替评分函数。我们还进一步利用强化学习和策略优化的最新进展，优化采样器以获得任务特定奖励。此外，我们还开发了一个包含1500万张高质量人类图像和细粒度说明的新型大规模开源数据集，名为LAION-Face-T2I-15M，用于训练和评估。我们的方法持续提升了抽样效率和人类偏好对齐，相较于基于分数的基线，在需要更低计算和更小模型的同时，实现与现有最先进和开源文本转图像模型相当的结果，为人类文本转图像提供了轻量级但性能优异的解决方案。

VideoPerceiver: Enhancing Fine-Grained Temporal Perception in Video Multimodal Large Language Models

视频感知器：增强视频多模态大型语言模型中的细粒度时间感知

Authors: Fufangchen Zhao, Liao Zhang, Daiqi Shi, Yuanjun Gao, Chen Ye, Yang Cai, Jian Gao, Danfeng Yan
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.18823
Pdf link: https://arxiv.org/pdf/2511.18823
Abstract We propose VideoPerceiver, a novel video multimodal large language model (VMLLM) that enhances fine-grained perception in video understanding, addressing VMLLMs' limited ability to reason about brief actions in short clips or rare transient events in long videos. VideoPerceiver adopts a two-stage training framework. During supervised fine-tuning (SFT), we construct "key-information-missing" videos by extracting event-action keywords from captions, identifying corresponding key frames, and replacing them with adjacent frames. We jointly encode original and modified video tokens with text tokens, aligning intermediate visual representations with keywords via an auxiliary contrastive loss to enhance sensitivity to fine-grained motion cues. In reinforcement learning (RL), both video variants are fed into the model to generate descriptions, and a novel relative reward ensures responses from complete videos outperform those from degraded inputs, explicitly training the model to recover temporally precise action details. We also curate a dataset of 80,000 videos with fine-grained actions and transient events. Experiments show VideoPerceiver substantially outperforms state-of-the-art VMLLMs on fine-grained action understanding and rare event captioning benchmarks, while maintaining strong performance on standard tasks. By prioritizing task-relevant visual features, our work redefines video-language model training for fine-grained perception.
中文摘要 我们提出了VideoPerceiver，一种新型视频多模态大型语言模型（VMLLM），能够增强视频理解中的细粒度感知，解决VMLLM在推理短视频中短暂动作或长视频中罕见瞬态事件时有限的能力。VideoPerceiver 采用两阶段培训框架。在监督微调（SFT）过程中，我们通过从字幕中提取事件-动作关键词，识别对应的关键帧，并用相邻的帧替换，构建“缺失关键信息”视频。我们将原始和修改后的视频符号与文本符号共同编码，通过辅助对比损失对中间视觉表示与关键词对齐，以增强对细粒度运动线索的敏感度。在强化学习（RL）中，两种视频变体都输入模型生成描述，一种新的相对奖励确保完整视频的响应优于退化输入的响应，明确训练模型恢复时间上的精确动作细节。我们还策划了8万个细粒度动作和瞬时事件视频数据集。实验显示，VideoPerceiver 在细粒度动作理解和罕见事件字幕基准测试上远超最先进的 VMLLM，同时在标准任务中保持强劲表现。通过优先考虑任务相关的视觉特征，我们的工作重新定义了针对细粒度感知的视频语言模型训练。

PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation

PrismAudio：分解的思维链条与视频转音频生成的多维奖励

Authors: Huadai Liu, Kaicheng Luo, Wen Wang, Qian Chen, Peiwen Sun, Rongjie Huang, Xiangang Li, Jieping Ye, Wei Xue
Subjects: Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2511.18833
Pdf link: https://arxiv.org/pdf/2511.18833
Abstract Video-to-Audio (V2A) generation requires balancing four critical perceptual dimensions: semantic consistency, audio-visual temporal synchrony, aesthetic quality, and spatial accuracy; yet existing methods suffer from objective entanglement that conflates competing goals in single loss functions and lack human preference alignment. We introduce PrismAudio, the first framework to integrate Reinforcement Learning into V2A generation with specialized Chain-of-Thought (CoT) planning. Our approach decomposes monolithic reasoning into four specialized CoT modules (Semantic, Temporal, Aesthetic, and Spatial CoT), each paired with targeted reward functions. This CoT-reward correspondence enables multidimensional RL optimization that guides the model to jointly generate better reasoning across all perspectives, solving the objective entanglement problem while preserving interpretability. To make this optimization computationally practical, we propose Fast-GRPO, which employs hybrid ODE-SDE sampling that dramatically reduces the training overhead compared to existing GRPO implementations. We also introduce AudioCanvas, a rigorous benchmark that is more distributionally balanced and covers more realistically diverse and challenging scenarios than existing datasets, with 300 single-event classes and 501 multi-event samples. Experimental results demonstrate that PrismAudio achieves state-of-the-art performance across all four perceptual dimensions on both the in-domain VGGSound test set and out-of-domain AudioCanvas benchmark. The project page is available at this https URL.
中文摘要 视频转音频（V2A）生成需要平衡四个关键的感知维度：语义一致性、视听时间同步性、美学质量和空间准确性;然而，现有方法存在客观纠缠，导致单个损失函数中将竞争目标混淆，且缺乏人类偏好的对齐。我们引入了PrismAudio，这是首个将强化学习集成到V2A生成中，并采用专业思维链（CoT）规划的框架。我们的方法将单一推理分解为四个专门的CoT模块（语义、时间、美学和空间CoT），每个模块都配有针对性的奖励函数。这种CoT-奖励对应关系使得多维强化学习优化成为可能，引导模型在所有视角共同生成更好的推理，解决客观纠缠问题同时保持可解释性。为了使这一优化在计算上更为实用，我们提出了Fast-GRPO，采用混合的常微分方程-SDE采样，相比现有GRPO实现大幅降低了训练开销。我们还推出了AudioCanvas，这是一个更为分布平衡的严谨基准测试，涵盖了比现有数据集更真实且多样化且更具挑战性的场景，包含300个单事件类和501个多事件样本。实验结果表明，PrismAudio在域内VGGSound测试集和域外AudioCanvas基准测试中，在所有四个感知维度上都实现了最先进的性能。项目页面可在此 https 网址访问。

Periodic Asynchrony: An Effective Method for Accelerating On-Policy Reinforcement Learning

周期性非同步：加速策略上强化学习的有效方法

Authors: Jian Lu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.18871
Pdf link: https://arxiv.org/pdf/2511.18871
Abstract Since the introduction of the GRPO algorithm, reinforcement learning (RL) has attracted increasing attention, with growing efforts to reproduce and apply it. However, training efficiency remains a critical challenge. In mainstream RL frameworks, inference and training are typically deployed on the same devices. While this approach reduces costs through resource consolidation, its synchronous execution imposes a computational coupling that prevents concurrent inference and training. In this study, we are returning to the strategy of separating inference and training deployment, and by introducing improvements in the data loader, we transform the conventional synchronous architecture into a periodically asynchronous framework, which allows for demand-driven, independent, and elastic scaling of each component, while the accuracy of the algorithm remains completely equivalent to the synchronization method, with both belonging to the on-policy strategy. It is worth emphasizing that we apply a unified tri-model architecture in the training phase, and we also proposed a shared-prompt attention mask to reduce repetitive computation. In practice, these works have achieved at least a threefold overall performance improvement in RL training on NPU platforms, indicating its potential for widespread application.
中文摘要 自GRPO算法引入以来，强化学习（RL）受到越来越多的关注，复制和应用的努力也在不断增加。然而，培训效率仍是一个关键挑战。在主流强化学习框架中，推理和训练通常部署在同一设备上。虽然这种方法通过资源整合降低了成本，但其同步执行会带来计算耦合，从而阻止并发推理和训练。本研究回归推理与训练部署分离的策略，通过改进数据加载器，我们将传统的同步架构转变为周期性异步框架，允许需求驱动、独立且弹性地扩展每个组件，同时算法的准确性与同步方法完全等效，两者都属于政策策略。值得强调的是，我们在训练阶段采用了统一的三模型架构，同时我们还提出了共享提示注意力掩码以减少重复计算。实际上，这些工作在NPU平台上的强化学习训练整体性能提升了至少三倍，显示出其广泛应用潜力。

Accelerating Reinforcement Learning via Error-Related Human Brain Signals

通过错误相关的人脑信号加速强化学习

Authors: Suzie Kim, Hye-Bin Shin, Hyo-Jeong Jang
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.18878
Pdf link: https://arxiv.org/pdf/2511.18878
Abstract In this work, we investigate how implicit neural feed back can accelerate reinforcement learning in complex robotic manipulation settings. While prior electroencephalogram (EEG) guided reinforcement learning studies have primarily focused on navigation or low-dimensional locomotion tasks, we aim to understand whether such neural evaluative signals can improve policy learning in high-dimensional manipulation tasks involving obstacles and precise end-effector control. We integrate error related potentials decoded from offline-trained EEG classifiers into reward shaping and systematically evaluate the impact of human-feedback weighting. Experiments on a 7-DoF manipulator in an obstacle-rich reaching environment show that neural feedback accelerates reinforcement learning and, depending on the human-feedback weighting, can yield task success rates that at times exceed those of sparse-reward baselines. Moreover, when applying the best-performing feedback weighting across all sub jects, we observe consistent acceleration of reinforcement learning relative to the sparse-reward setting. Furthermore, leave-one subject-out evaluations confirm that the proposed framework remains robust despite the intrinsic inter-individual variability in EEG decodability. Our findings demonstrate that EEG-based reinforcement learning can scale beyond locomotion tasks and provide a viable pathway for human-aligned manipulation skill acquisition.
中文摘要 本研究探讨隐式神经反馈如何加速复杂机器人作环境中的强化学习。虽然此前脑电图（EEG）引导强化学习研究主要聚焦于导航或低维移动任务，我们旨在了解此类神经评估信号是否能提升涉及障碍物和精确末端执行器控制的高维作任务中的策略学习。我们将离线训练的脑电分类器解码的误差相关电位整合进奖励塑造，并系统评估人类反馈加权的影响。在障碍物丰富且可及的环境中，使用7-DoF作器进行的实验表明，神经反馈加速强化学习，并且根据人类反馈权重的不同，任务成功率有时甚至超过稀疏奖励基线。此外，当对所有子要素施加最佳反馈权重时，我们观察到相对于稀疏-奖励设置，强化学习的加速持续。此外，遗漏一主体评估确认，尽管脑电图可解码性存在个体间固有变异性，该框架依然稳健。我们的发现表明，基于脑电图的强化学习可以超越移动任务，为人类对齐的作技能习得提供可行的路径。

Learning to Compress Graphs via Dual Agents for Consistent Topological Robustness Evaluation

学习通过对偶代理压缩图以实现一致的拓扑鲁棒性评估

Authors: Qisen Chai, Yansong Wang, Junjie Huang, Tao Jia
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.18958
Pdf link: https://arxiv.org/pdf/2511.18958
Abstract As graph-structured data grow increasingly large, evaluating their robustness under adversarial attacks becomes computationally expensive and difficult to scale. To address this challenge, we propose to compress graphs into compact representations that preserve both topological structure and robustness profile, enabling efficient and reliable this http URL propose Cutter, a dual-agent reinforcement learning framework composed of a Vital Detection Agent (VDA) and a Redundancy Detection Agent (RDA), which collaboratively identify structurally vital and redundant nodes for guided compression. Cutter incorporates three key strategies to enhance learning efficiency and compression quality: trajectory-level reward shaping to transform sparse trajectory returns into dense, policy-equivalent learning signals; prototype-based shaping to guide decisions using behavioral patterns from both highand low-return trajectories; and cross-agent imitation to enable safer and more transferable exploration. Experiments on multiple real-world graphs demonstrate that Cutter generates compressed graphs that retain essential static topological properties and exhibit robustness degradation trends highly consistent with the original graphs under various attack scenarios, thereby significantly improving evaluation efficiency without compromising assessment fidelity.
中文摘要 随着图结构化数据日益庞大，评估其在对抗性攻击下的鲁棒性变得计算成本高且难以扩展。为应对这一挑战，我们提出将图压缩为紧凑的表示形式，既保持拓扑结构又保持鲁棒性，从而实现高效可靠的Cutter，这是一个由重要检测代理（VDA）和冗余检测代理（RDA）组成的双代理强化学习框架，协同识别结构上重要和冗余的节点进行引导压缩。Cutter 采用了三项关键策略来提升学习效率和压缩质量：轨迹级奖励塑造，将稀疏的轨迹回报转化为密集、策略对应的学习信号;基于原型的塑造，利用高回报和低回报轨迹的行为模式指导决策;以及跨智能体仿制，以实现更安全、更易转移的探索。多重真实世界图的实验表明，Cutter生成的压缩图保留了基本的静态拓扑性质，并在各种攻击场景下表现出与原始图高度一致的鲁棒性退化趋势，从而显著提升了评估效率，同时不影响评估准确性。

FastForward Pruning: Efficient LLM Pruning via Single-Step Reinforcement Learning

快进剪枝：通过单步强化学习实现高效的大型语言模型修剪

Authors: Xin Yuan, Siqi Li, Jiateng Wei, Chengrui Zhu, Yanming Wu, Qingpeng Li, Jiajun Lv, Xiaoke Lan, Jun Chen, Yong Liu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.18977
Pdf link: https://arxiv.org/pdf/2511.18977
Abstract Pruning is an effective method for compressing Large Language Models, but finding an optimal, non-uniform layer-wise sparsity allocation remains a key challenge. While heuristic methods are fast but yield suboptimal performance, more powerful search-based approaches like Reinforcement Learning are often hindered by prohibitive computational costs on large-scale models. To overcome this efficiency barrier, we propose FastForward Pruning. Its core is a decoupled, single-step RL framework that separates policy optimization from the complex budget satisfaction problem. Such a decoupling is crucial for efficiently searching the vast policy space of LLMs. This curriculum-based strategy begins with low-cost, simple tasks and gradually increases in complexity, significantly reducing the search's computational overhead. Evaluated on the LLaMA, Mistral, and OPT model families, our framework discovers pruning policies that achieve superior performance over strong heuristic baselines. Crucially, when compared to other search-based algorithms, our method achieves competitive or superior results at a fraction of the computational cost, demonstrating a clear advantage in search efficiency.
中文摘要 剪枝是压缩大型语言模型的有效方法，但找到最优且非均匀的层间稀疏度分配仍是一个关键挑战。虽然启发式方法速度快但性能不理想，更强大的基于搜索的方法如强化学习常因大规模模型中高昂的计算成本而受阻。为了克服这一效率障碍，我们提出了快速剪枝技术。其核心是一个解耦的单步强化学习框架，将政策优化与复杂的预算满足问题区分开来。这种解耦对于高效搜索广阔的大型语言模型政策空间至关重要。这种基于课程的策略从低成本、简单的任务开始，逐步增加复杂度，显著降低搜索的计算开销。在LLaMA、Mistral和OPT模型家族的评估中，我们的框架发现了在强启发式基线上实现更优绩效的剪枝策略。关键是，与其他基于搜索的算法相比，我们的方法以极低的计算成本实现了具有竞争力或更优的结果，显示出搜索效率上的明显优势。

Dynamic Mixture of Experts Against Severe Distribution Shifts

专家动态组合以应对严重的分布变动

Authors: Donghu Kim
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.18987
Pdf link: https://arxiv.org/pdf/2511.18987
Abstract The challenge of building neural networks that can continuously learn and adapt to evolving data streams is central to the fields of continual learning (CL) and reinforcement learning (RL). This lifelong learning problem is often framed in terms of the plasticity-stability dilemma, focusing on issues like loss of plasticity and catastrophic forgetting. Unlike neural networks, biological brains maintain plasticity through capacity growth, inspiring researchers to explore similar approaches in artificial networks, such as adding capacity dynamically. Prior solutions often lack parameter efficiency or depend on explicit task indices, but Mixture-of-Experts (MoE) architectures offer a promising alternative by specializing experts for distinct distributions. This paper aims to evaluate a DynamicMoE approach for continual and reinforcement learning environments and benchmark its effectiveness against existing network expansion methods.
中文摘要 构建能够持续学习并适应不断变化数据流的神经网络，是持续学习（CL）和强化学习（RL）领域的核心挑战。这一终身学习问题常被用可塑性与稳定性困境来框架，重点关注可塑性丧失和灾难性遗忘等问题。与神经网络不同，生物大脑通过容量增长保持可塑性，这激励研究人员探索人工网络中的类似方法，如动态增加容量。以往的解决方案通常缺乏参数效率或依赖显式任务指标，但专家混合架构（MoE）通过专门化专家处理不同分布，提供了有前景的替代方案。本文旨在评估动态MoE方法在持续和强化学习环境中的应用，并对其与现有网络扩展方法的有效性进行基准对比。

Energy-Efficient Routing Protocol in Vehicular Opportunistic Networks: A Dynamic Cluster-based Routing Using Deep Reinforcement Learning

车载机会网络中的节能路由协议：基于深度强化学习的动态集群路由

Authors: Meisam Sahrifi Sani, Saeid Iranmanesh, Raad Raad, Faisel Tubbal
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Methodology (stat.ME)
Arxiv link: https://arxiv.org/abs/2511.19026
Pdf link: https://arxiv.org/pdf/2511.19026
Abstract Opportunistic Networks (OppNets) employ the Store-Carry-Forward (SCF) paradigm to maintain communication during intermittent connectivity. However, routing performance suffers due to dynamic topology changes, unpredictable contact patterns, and resource constraints including limited energy and buffer capacity. These challenges compromise delivery reliability, increase latency, and reduce node longevity in highly dynamic environments. This paper proposes Cluster-based Routing using Deep Reinforcement Learning (CR-DRL), an adaptive routing approach that integrates an Actor-Critic learning framework with a heuristic function. CR-DRL enables real-time optimal relay selection and dynamic cluster overlap adjustment to maintain connectivity while minimizing redundant transmissions and enhancing routing efficiency. Simulation results demonstrate significant improvements over state-of-the-art baselines. CR-DRL extends node lifetimes by up to 21%, overall energy use is reduced by 17%, and nodes remain active for 15% longer. Communication performance also improves, with up to 10% higher delivery ratio, 28.5% lower delay, 7% higher throughput, and data requiring 30% fewer transmission steps across the network.
中文摘要 机会性网络（OppNet）采用存储-转发（SCF）范式，在间歇性连接期间保持通信。然而，由于动态拓扑变化、不可预测的接触模式以及有限的能量和缓冲容量等资源限制，布线性能会受到影响。这些挑战在高度动态环境中损害了传输可靠性，增加了延迟，并缩短了节点寿命。本文提出了基于集群的深度强化学习（CR-DRL）路由方法，这是一种将演员-批判者学习框架与启发式函数集成的自适应路由方法。CR-DRL支持实时最优中继选择和动态集群重叠调整，以保持连接性，同时减少冗余传输并提升路由效率。模拟结果显示，比最先进的基线有显著提升。CR-DRL将节点寿命延长多达21%，整体能耗减少17%，节点的活跃时间延长15%。通信性能也有所提升，传输率提升了最多10%，延迟降低了28.5%，吞吐量提高了7%，数据传输步骤减少了30%。

ReEXplore: Improving MLLMs for Embodied Exploration with Contextualized Retrospective Experience Replay

再研究：通过情境化回顾体验回放改进多层次营销产品以实现身体探索

Authors: Gengyuan Zhang, Mingcong Ding, Jingpei Wu, Ruotong Liao, Volker Tresp
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.19033
Pdf link: https://arxiv.org/pdf/2511.19033
Abstract Embodied exploration is a target-driven process that requires embodied agents to possess fine-grained perception and knowledge-enhanced decision making. While recent attempts leverage MLLMs for exploration due to their strong perceptual and reasoning abilities, we find that MLLM-based embodied agents remain suboptimal in exploring new environments: (i) they rely on profound but stale pre-trained knowledge, (ii) training-based approaches such as imitation learning or reinforcement learning are expensive for long-horizon tasks with sparse outcome rewards, and (iii) frontier-based exploration yields a large, visually nuanced action space that is difficult for MLLMs to make reliable decisions. We address these challenges with ReEXplore, a training-free framework that performs retrospective experience replay to inject distilled, abstract experience at inference time, and hierarchical frontier selection to decompose frontier ranking into coarse-to-fine decisions. Our approach enables robust, traceable, and efficient exploration. Across multiple embodied exploration benchmarks, ReEXplore yields great improvements over strong MLLM baselines, up to 3x higher performance in both success rate and in navigation efficiency under open-source backbones.
中文摘要 具身探索是一种目标驱动的过程，要求具身者具备细致的感知和知识增强的决策能力。尽管近期尝试利用MLLM进行探索，因其强大的感知和推理能力，但我们发现基于MLLM的具身智能体在探索新环境方面仍不尽如人意：（i）它们依赖深刻但陈旧的预训练知识，（ii）模仿学习或强化学习等基于训练的方法对于长期任务成本高昂且结果奖励稀疏，（iii）基于前沿的探索产生了大量，视觉上细腻的动作空间，使MLLM难以做出可靠决策。我们通过ReEXPLORE来应对这些挑战，这是一个无培训的框架，通过回溯经验回放，在推理时注入精炼抽象的经验，并通过层级前沿选择将前沿排名分解为粗细决策。我们的方法实现了鲁健、可追溯且高效的探索。在多个具象探索基准中，ReEXplore相较于强大的MLLM基线有了显著提升，在开源骨干网下，成功率和导航效率均提升了多达3倍。

DeCoRL: Decoupling Reasoning Chains via Parallel Sub-Step Generation and Cascaded Reinforcement for Interpretable and Scalable RLHF

DeCoRL：通过并行子步生成和级联强化实现可解释和可扩展的RLHF的推理链解耦

Authors: Ziyuan Gao, Di Liang, Xianjie Wu, Philippe Morel, Minlong Peng
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.19097
Pdf link: https://arxiv.org/pdf/2511.19097
Abstract Existing reinforcement learning methods for Chain-of-Thought reasoning suffer from two critical limitations. First, they operate as monolithic black boxes that provide undifferentiated reward signals, obscuring individual step contributions and hindering error diagnosis. Second, sequential decoding has O(n) time complexity. This makes real-time deployment impractical for complex reasoning tasks. We present DeCoRL (Decoupled Reasoning Chains via Coordinated Reinforcement Learning), a novel framework that transforms reasoning from sequential processing into collaborative modular orchestration. DeCoRL trains lightweight specialized models to generate reasoning sub-steps concurrently, eliminating sequential bottlenecks through parallel processing. To enable precise error attribution, the framework designs modular reward functions that score each sub-step independently. Cascaded DRPO optimization then coordinates these rewards while preserving inter-step dependencies. Comprehensive evaluation demonstrates state-of-the-art results across RM-Bench, RMB, and RewardBench, outperforming existing methods including large-scale models. DeCoRL delivers 3.8 times faster inference while maintaining superior solution quality and offers a 22.7\% improvement in interpretability through explicit reward attribution. These advancements, combined with a 72.4\% reduction in energy consumption and a 68\% increase in throughput, make real-time deployment of complex reasoning systems a reality.
中文摘要 现有的思维链推理强化学习方法存在两个关键局限。首先，它们作为单一黑箱运作，提供未区分的奖励信号，掩盖单个步数贡献，阻碍错误诊断。其次，顺序解码的时间复杂度为O（n）。这使得实时部署在复杂推理任务中变得不切实际。我们提出了DeCoRL（通过协调强化学习实现解耦推理链），这是一个新颖框架，将推理从顺序处理转变为协作模块化编排。DeCoRL训练轻量级专用模型，并行生成推理子步骤，通过并行处理消除顺序瓶颈。为了实现精确的错误归因，该框架设计了模块化奖励函数，独立对每个子步骤进行评分。级联DRPO优化随后协调这些奖励，同时保持步间依赖关系。全面的评估展示了RM-Bench、RMB和RewardBench的先进成果，优于包括大型模型在内的现有方法。DeCoRL在保持优越解质量的同时，推理速度提升3.8倍，并通过显式奖励归因提升22.7%的可解释性。这些进步，加上能耗减少72.4%和吞吐量提升68%，使复杂推理系统的实时部署成为现实。

VIL2C: Value-of-Information Aware Low-Latency Communication for Multi-Agent Reinforcement Learning

VIL2C：信息价值感知的低延迟通信，用于多智能体强化学习

Authors: Qian Zhang, Zhuo Sun, Yao Zhang, Zhiwen Yu, Bin Guo, Jun Zhang
Subjects: Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2511.19146
Pdf link: https://arxiv.org/pdf/2511.19146
Abstract Inter-agent communication serves as an effective mechanism for enhancing performance in collaborative multi-agent reinforcement learning(MARL) systems. However, the inherent communication latency in practical systems induces both action decision delays and outdated information sharing, impeding MARL performance gains, particularly in time-critical applications like autonomous driving. In this work, we propose a Value-of-Information aware Low-latency Communication(VIL2C) scheme that proactively adjusts the latency distribution to mitigate its effects in MARL systems. Specifically, we define a Value of Information (VOI) metric to quantify the importance of delayed message transmission based on each delayed message's importance. Moreover, we propose a progressive message reception mechanism to adaptively adjust the reception duration based on received messages. We derive the optimized VoI aware resource allocation and theoretically prove the performance advantage of the proposed VIL2C scheme. Extensive experiments demonstrate that VIL2C outperforms existing approaches under various communication conditions. These gains are attributed to the low-latency transmission of high-VoI messages via resource allocation and the elimination of unnecessary waiting periods via adaptive reception duration.
中文摘要 代理间通信作为提升协作多智能体强化学习（MARL）系统性能的有效机制。然而，实际系统固有的通信延迟导致动作决策延迟和信息共享过时，阻碍了MARL的性能提升，尤其是在自动驾驶等时间关键应用中。本研究提出一种信息价值感知的低延迟通信（VIL2C）方案，主动调整延迟分布以减轻其在MARL系统中的影响。具体来说，我们定义了一个信息价值（VOI）指标，用以量化延迟消息传输的重要性，基于每个延迟消息的重要性。此外，我们提出了一种渐进式消息接收机制，根据接收消息量自适应地调整接收时长。我们推导出了优化的VoI感知资源分配，并理论上证明了所提VIL2C方案的性能优势。大量实验表明，VIL2C在各种通信条件下优于现有方法。这些优势归因于通过资源分配低延迟传输高VoI消息，以及通过自适应接收时长消除不必要的等待时间。

RAVEN++: Pinpointing Fine-Grained Violations in Advertisement Videos with Active Reinforcement Reasoning

RAVEN++：通过主动强化推理精准定位广告视频中的细粒度违规

Authors: Deyi Ji, Yuekui Yang, Liqun Liu, Peng Shu, Haiyang Wu, Shaogang Tang, Xudong Chen, Shaoping Ma, Tianrun Chen, Lanyun Zhu
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.19168
Pdf link: https://arxiv.org/pdf/2511.19168
Abstract Advertising (Ad) is a cornerstone of the digital economy, yet the moderation of video advertisements remains a significant challenge due to their complexity and the need for precise violation localization. While recent advancements, such as the RAVEN model, have improved coarse-grained violation detection, critical gaps persist in fine-grained understanding, explainability, and generalization. To address these limitations, we propose RAVEN++, a novel framework that introduces three key innovations: 1) Active Reinforcement Learning (RL), which dynamically adapts training to samples of varying difficulty; 2) Fine-Grained Violation Understanding, achieved through hierarchical reward functions and reasoning distillation; and 3) Progressive Multi-Stage Training, which systematically combines knowledge injection, curriculum-based passive RL, and active RL. Extensive experiments on both public and proprietary datasets, on both offline scenarios and online deployed A/B Testing, demonstrate that RAVEN++ outperforms general-purpose LLMs and specialized models like RAVEN in terms of fine-grained violation understanding, reasoning capabilities, and generalization ability.
中文摘要 广告（Ad）是数字经济的基石，但由于视频广告的复杂性和精确的违规本地化需求，其审核仍是一大挑战。尽管近年来如RAVEN模型等技术进步提高了粗粒度违规检测，但在细粒度理解、可解释性和泛化方面仍存在关键空白。为解决这些局限性，我们提出了RAVEN++，这是一个引入三项关键创新的新框架：1）主动强化学习（RL），能够动态调整训练以适应不同难度的样本;2）细粒度违规理解，通过层级奖励函数和推理提炼实现;以及3）渐进式多阶段训练，系统性地结合了知识注入、基于课程的被动强化学习和主动强化学习。在公开和专有数据集上，无论是离线场景还是在线部署的A/B测试，经过大量实验，都表明RAVEN++在细粒度违规理解、推理能力和泛化能力方面优于通用大型语言模型和RAVEN等专用模型。

Adversarial Attack-Defense Co-Evolution for LLM Safety Alignment via Tree-Group Dual-Aware Search and Optimization

通过树群双重感知搜索与优化实现LLM安全对齐的对抗性攻防共演

Authors: Xurui Li, Kaisong Song, Rui Zhu, Pin-Yu Chen, Haixu Tang
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.19218
Pdf link: https://arxiv.org/pdf/2511.19218
Abstract Large Language Models (LLMs) have developed rapidly in web services, delivering unprecedented capabilities while amplifying societal risks. Existing works tend to focus on either isolated jailbreak attacks or static defenses, neglecting the dynamic interplay between evolving threats and safeguards in real-world web contexts. To mitigate these challenges, we propose ACE-Safety (Adversarial Co-Evolution for LLM Safety), a novel framework that jointly optimize attack and defense models by seamlessly integrating two key innovative procedures: (1) Group-aware Strategy-guided Monte Carlo Tree Search (GS-MCTS), which efficiently explores jailbreak strategies to uncover vulnerabilities and generate diverse adversarial samples; (2) Adversarial Curriculum Tree-aware Group Policy Optimization (AC-TGPO), which jointly trains attack and defense LLMs with challenging samples via curriculum reinforcement learning, enabling robust mutual improvement. Evaluations across multiple benchmarks demonstrate that our method outperforms existing attack and defense approaches, and provides a feasible pathway for developing LLMs that can sustainably support responsible AI ecosystems.
中文摘要 大型语言模型（LLMs）在网络服务中迅速发展，提供了前所未有的能力，同时放大了社会风险。现有研究往往聚焦于孤立的越狱攻击或静态防御，忽视了现实世界中不断演变的威胁与防护措施之间的动态互动。为缓解这些挑战，我们提出了ACE-Safety（LLM安全的对抗共进化），这是一个新颖框架，通过无缝整合两项关键创新程序，共同优化攻防模型：（1）群觉策略引导蒙特卡洛树搜索（GS-MCTS），高效探索越狱策略以发现漏洞并生成多样化的对抗样本;（2）对抗性课程树感知群策略优化（AC-TGPO），通过课程强化学习联合训练攻击型和防御型大型语言模型，实现强健的相互改进。多项基准测试的评估表明，我们的方法优于现有的攻防方法，并为开发能够可持续支持负责任人工智能生态系统的大型语言模型提供了可行的路径。

MAESTRO: Multi-Agent Environment Shaping through Task and Reward Optimization

MAESTRO：通过任务和奖励优化塑造多智能体环境

Authors: Boyuan Wu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.19253
Pdf link: https://arxiv.org/pdf/2511.19253
Abstract Cooperative Multi-Agent Reinforcement Learning (MARL) faces two major design bottlenecks: crafting dense reward functions and constructing curricula that avoid local optima in high-dimensional, non-stationary environments. Existing approaches rely on fixed heuristics or use Large Language Models (LLMs) directly in the control loop, which is costly and unsuitable for real-time systems. We propose MAESTRO (Multi-Agent Environment Shaping through Task and Reward Optimization), a framework that moves the LLM outside the execution loop and uses it as an offline training architect. MAESTRO introduces two generative components: (i) a semantic curriculum generator that creates diverse, performance-driven traffic scenarios, and (ii) an automated reward synthesizer that produces executable Python reward functions adapted to evolving curriculum difficulty. These components guide a standard MARL backbone (MADDPG) without increasing inference cost at deployment. We evaluate MAESTRO on large-scale traffic signal control (Hangzhou, 16 intersections) and conduct controlled ablations. Results show that combining LLM-generated curricula with LLM-generated reward shaping yields improved performance and stability. Across four seeds, the full system achieves +4.0% higher mean return (163.26 vs. 156.93) and 2.2% better risk-adjusted performance (Sharpe 1.53 vs. 0.70) over a strong curriculum baseline. These findings highlight LLMs as effective high-level designers for cooperative MARL training.
中文摘要 协作多智能体强化学习（MARL）面临两个主要设计瓶颈：设计密集的奖励函数和构建避免高维非定稳环境中局域最优的课程。现有方法依赖固定启发式或直接在控制环中使用大型语言模型（LLM），这成本高且不适合实时系统。我们提出了MAESTRO（通过任务和奖励优化实现多智能体环境塑造）框架，将LLM从执行循环中移出，作为离线训练架构师使用。MAESTRO 引入了两个生成组件：（i）语义课程生成器，创建多样化、以性能为驱动的交通场景;（ii）自动奖励合成器，生成可执行的 Python 奖励函数，适应不断演变的课程难度。这些组件在不增加推断成本的情况下引导标准MARL骨干网（MADDPG）。我们评估MAESTRO在大型交通信号控制（杭州，16个路口）中的应用，并进行受控消融。结果显示，将LLM生成的课程与LLM生成的奖励塑造结合，能够提升表现和稳定性。在四个种子项目中，完整系统在强有力的课程基线下实现了+4.0%的平均回报（163.26对156.93），风险调整后表现提升2.2%（夏普1.53对0.70）。这些发现凸显了LLM作为协作式MARL训练高效设计者的高效设计者。

Syn-GRPO: Self-Evolving Data Synthesis for MLLM Perception Reasoning

Syn-GRPO：MLLM感知推理的自我演化数据综合

Authors: Qihan Huang, Haofei Zhang, Rong Wei, Yi Wang, Rui Tang, Mingli Song, Jie Song
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.19343
Pdf link: https://arxiv.org/pdf/2511.19343
Abstract RL (reinforcement learning) methods (e.g., GRPO) for MLLM (Multimodal LLM) perception ability has attracted wide research interest owing to its remarkable generalization ability. Nevertheless, existing reinforcement learning methods still face the problem of low data quality, where data samples cannot elicit diverse responses from MLLMs, thus restricting the exploration scope for MLLM reinforcement learning. Some methods attempt to mitigate this problem by imposing constraints on entropy, but none address it at its root. Therefore, to tackle this problem, this work proposes Syn-GRPO (Synthesis-GRPO), which employs an online data generator to synthesize high-quality training data with diverse responses in GRPO training. Specifically, Syn-GRPO consists of two components: (1) data server; (2) GRPO workflow. The data server synthesizes new samples from existing ones using an image generation model, featuring a decoupled and asynchronous scheme to achieve high generation efficiency. The GRPO workflow provides the data server with the new image descriptions, and it leverages a diversity reward to supervise the MLLM to predict image descriptions for synthesizing samples with diverse responses. Experiment results across three visual perception tasks demonstrate that Syn-GRPO improves the data quality by a large margin, achieving significant superior performance to existing MLLM perception methods, and Syn-GRPO presents promising potential for scaling long-term self-evolving RL. Our code is available at this https URL.
中文摘要 强化学习（RL）方法（如GRPO）因其卓越的泛化能力而吸引了广泛的研究关注。然而，现有的强化学习方法仍面临数据质量低的问题，即数据样本无法从MLLM中引发多样化的反应，从而限制了MLLM强化学习的探索范围。有些方法试图通过对熵施加约束来缓解这一问题，但没有方法能从根本上解决熵。因此，为解决这一问题，本研究提出了Syn-GRPO（综合GRPO），利用在线数据生成器综合GRPO训练中具有多样化响应的高质量训练数据。具体来说，Syn-GRPO由两个组成部分组成：（1）数据服务器;（2）GRPO工作流程。数据服务器通过图像生成模型从现有样本中合成新样本，采用解耦异步方案以实现高生成效率。GRPO工作流程为数据服务器提供新的图像描述，并利用多样性奖励监督MLLM预测图像描述，以合成具有多样响应的样本。三项视觉感知任务的实验结果表明，Syn-GRPO大幅提升了数据质量，显著优于现有MLLM感知方法，并展现出长期自我演化强化学习扩展的潜力。我们的代码可在此 https URL 访问。

Leveraging LLMs for reward function design in reinforcement learning control tasks

在强化学习控制任务中利用LLM进行奖励函数设计

Authors: Franklin Cardenoso, Wouter Caarls
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.19355
Pdf link: https://arxiv.org/pdf/2511.19355
Abstract The challenge of designing effective reward functions in reinforcement learning (RL) represents a significant bottleneck, often requiring extensive human expertise and being time-consuming. Previous work and recent advancements in large language models (LLMs) have demonstrated their potential for automating the generation of reward functions. However, existing methodologies often require preliminary evaluation metrics, human-engineered feedback for the refinement process, or the use of environmental source code as context. To address these limitations, this paper introduces LEARN-Opt (LLM-based Evaluator and Analyzer for Reward functioN Optimization). This LLM-based, fully autonomous, and model-agnostic framework eliminates the need for preliminary metrics and environmental source code as context to generate, execute, and evaluate reward function candidates from textual descriptions of systems and task objectives. LEARN-Opt's main contribution lies in its ability to autonomously derive performance metrics directly from the system description and the task objective, enabling unsupervised evaluation and selection of reward functions. Our experiments indicate that LEARN-Opt achieves performance comparable to or better to that of state-of-the-art methods, such as EUREKA, while requiring less prior knowledge. We find that automated reward design is a high-variance problem, where the average-case candidate fails, requiring a multi-run approach to find the best candidates. Finally, we show that LEARN-Opt can unlock the potential of low-cost LLMs to find high-performing candidates that are comparable to, or even better than, those of larger models. This demonstrated performance affirms its potential to generate high-quality reward functions without requiring any preliminary human-defined metrics, thereby reducing engineering overhead and enhancing generalizability.
中文摘要 在强化学习（RL）中设计有效奖励函数的挑战是一个重大瓶颈，通常需要大量人类专业知识且耗时。以往的研究和大型语言模型（LLMs）的最新进展已证明它们在自动化生成奖励函数方面的潜力。然而，现有方法通常需要初步评估指标、人工工程反馈以进行优化，或使用环境源代码作为上下文。为解决这些局限性，本文介绍了基于LLM的奖励函数优化评估器和分析器（LEARN-Opt）。这一基于LLM的完全自主且模型无关的框架，无需依赖初步指标和环境源代码作为上下文，即可从系统和任务目标的文本描述中生成、执行和评估奖励函数候选。LEARN-Opt 的主要贡献在于能够自主地直接从系统描述和任务目标中推导出性能指标，从而实现无监督的奖励函数评估和选择。我们的实验表明，LEARN-OPT的性能可与最先进方法如EUREKA媲美甚至更好，同时对先验知识的要求更低。我们发现自动化奖励设计是一个高方差问题，平均值候选失败，需要多运行方法来寻找最佳候选者。最后，我们展示了LEARN-Opt能够释放低成本LLM的潜力，找到与大型模型相当甚至更优的高性能候选模型。这一性能证明了其生成高质量奖励函数而无需任何初步人工定义指标的潜力，从而降低工程开销并提升泛化性。

Growing with the Generator: Self-paced GRPO for Video Generation

与生成器共成长：视频生成的自节奏GRPO

Authors: Rui Li, Yuanzhi Liang, Ziqi Ni, Haibing Huang, Chi Zhang, Xuelong Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.19356
Pdf link: https://arxiv.org/pdf/2511.19356
Abstract Group Relative Policy Optimization (GRPO) has emerged as a powerful reinforcement learning paradigm for post-training video generation models. However, existing GRPO pipelines rely on static, fixed-capacity reward models whose evaluation behavior is frozen during training. Such rigid rewards introduce distributional bias, saturate quickly as the generator improves, and ultimately limit the stability and effectiveness of reinforcement-based alignment. We propose Self-Paced GRPO, a competence-aware GRPO framework in which reward feedback co-evolves with the generator. Our method introduces a progressive reward mechanism that automatically shifts its emphasis from coarse visual fidelity to temporal coherence and fine-grained text-video semantic alignment as generation quality increases. This self-paced curriculum alleviates reward-policy mismatch, mitigates reward exploitation, and yields more stable optimization. Experiments on VBench across multiple video generation backbones demonstrate consistent improvements in both visual quality and semantic alignment over GRPO baselines with static rewards, validating the effectiveness and generality of Self-Paced GRPO.
中文摘要 群体相对策略优化（GRPO）已成为一种强大的强化学习范式，应用于训练后视频生成模型。然而，现有的GRPO流水线依赖于静态、固定容量的奖励模型，其评估行为在训练过程中被冻结。这种僵化的奖励会引入分布偏见，随着生成器的提升，这种偏见会迅速被饱和，最终限制基于强化的对齐的稳定性和有效性。我们提出了自定进度GRPO，一种能力感知型GRPO框架，其中奖励反馈与生成器协同演化。我们的方法引入了一种渐进式奖励机制，随着生成质量的提升，其重点自动从粗糙的视觉忠实度转向时间连贯性和细粒度的文本-视频语义对齐。这种自定进度的课程缓解了奖励政策的不匹配，减少了奖励剥削，并实现了更稳定的优化。在多个视频生成骨干上的VBench实验显示，在视觉质量和语义对齐方面均优于静态奖励的GRPO基线，验证了自定节奏GRPO的有效性和普遍性。

LLM-Driven Stationarity-Aware Expert Demonstrations for Multi-Agent Reinforcement Learning in Mobile Systems

基于大型语言模型驱动的多智能体强化学习专家演示

Authors: Tianyang Duan, Zongyuan Zhang, Zheng Lin, Songxiao Guo, Xiuxian Guan, Guangyu Wu, Zihan Fang, Haotian Meng, Xia Du, Ji-Zhe Zhou, Heming Cui, Jun Luo, Yue Gao
Subjects: Subjects: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2511.19368
Pdf link: https://arxiv.org/pdf/2511.19368
Abstract Multi-agent reinforcement learning (MARL) has been increasingly adopted in many real-world applications. While MARL enables decentralized deployment on resource-constrained edge devices, it suffers from severe non-stationarity due to the synchronous updates of agent policies. This non stationarity results in unstable training and poor policy con vergence, especially as the number of agents increases. In this paper, we propose RELED, a scalable MARL framework that integrates large language model (LLM)-driven expert demonstrations with autonomous agent exploration. RELED incorporates a Stationarity-Aware Expert Demonstration module, which leverages theoretical non-stationarity bounds to enhance the quality of LLM-generated expert trajectories, thus providing high reward and training-stable samples for each agent. Moreover, a Hybrid Expert-Agent Policy Optimization module adaptively balances each agent's learning from both expert-generated and agent-generated trajectories, accelerating policy convergence and improving generalization. Extensive experiments with real city networks based on OpenStreetMap demonstrate that RELED achieves superior performance compared to state-of-the-art MARL methods.
中文摘要 多智能体强化学习（MARL）在许多实际应用中日益被采用。虽然MARL支持资源受限边缘设备上的去中心化部署，但由于代理策略的同步更新，它存在严重的非平稳性。这种非平稳性导致训练不稳定，策略趋同性差，尤其是在代理数量增加时。本文提出了RELED，一种可扩展的MARL框架，将大型语言模型（LLM）驱动的专家演示与自主代理探索相结合。RELED包含一个平稳感知专家演示模块，利用理论上的非平稳界限提升LLM生成专家轨迹的质量，从而为每个智能体提供高奖励和训练稳定的样本。此外，混合专家-代理策略优化模块自适应地平衡每个代理从专家生成和代理生成轨迹中的学习，加速策略收敛并提升泛化能力。基于OpenStreetMap的真实城市网络大量实验表明，RELED的性能优于最先进的MARL方法。

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

图鲁博士：深度研究的强化学习与不断演变的评分标准

Authors: Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G. Finlayson, David Sontag, Tyler Murray, Sewon Min, Pradeep Dasigi, Luca Soldaini, Faeze Brahman, Wen-tau Yih, Tongshuang Wu, Luke Zettlemoyer, Yoon Kim, Hannaneh Hajishirzi, Pang Wei Koh
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.19399
Pdf link: https://arxiv.org/pdf/2511.19399
Abstract Deep research models perform multi-step research to produce long-form, well-attributed answers. However, most open deep research models are trained on easily verifiable short-form QA tasks via reinforcement learning with verifiable rewards (RLVR), which does not extend to realistic long-form tasks. We address this with Reinforcement Learning with Evolving Rubrics (RLER), in which we construct and maintain rubrics that co-evolve with the policy model during training; this allows the rubrics to incorporate information that the model has newly explored and to provide discriminative, on-policy feedback. Using RLER, we develop Deep Research Tulu (DR Tulu-8B), the first open model that is directly trained for open-ended, long-form deep research. Across four long-form deep research benchmarks in science, healthcare and general domains, DR Tulu substantially outperforms existing open deep research models, and matches or exceeds proprietary deep research systems, while being significantly smaller and cheaper per query. To facilitate future research, we release all data, models, and code, including our new MCP-based agent infrastructure for deep research systems.
中文摘要 深度研究模型通过多步骤研究，生成长格式且归因良好的答案。然而，大多数开放深度研究模型通过可验证奖励强化学习（RLVR）训练于易于验证的短形式质量保证任务，而这种训练并不适用于现实中的长形式任务。我们通过强化学习与演变评分标准（RLER）来解决这个问题，在培训过程中构建并维护与政策模型共演化的评分标准;这使得评分标准能够纳入模型新探索的信息，并提供具有辨别性的、符合政策的反馈。利用RLER，我们开发了深度研究图卢（DR Tulu-8B），这是首个直接训练用于开放式、长格式深度研究的开放模型。在科学、医疗及一般领域的四个长期深度研究基准中，DR Tulu 远超现有开放深度研究模型，能够匹敌甚至超越专有的深度研究系统，同时每次查询体积更小、成本更低。为了促进未来的研究，我们发布了所有数据、模型和代码，包括我们基于MCP的新代理基础设施，用于深度研究系统。

Learning Robust Social Strategies with Large Language Models

利用大型语言模型学习稳健的社会策略

Authors: Dereck Piche, Mohammed Muqeeth, Milad Aghajohari, Juan Duque, Michael Noukhovitch, Aaron Courville
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.19405
Pdf link: https://arxiv.org/pdf/2511.19405
Abstract As agentic AI becomes more widespread, agents with distinct and possibly conflicting goals will interact in complex ways. These multi-agent interactions pose a fundamental challenge, particularly in social dilemmas, where agents' individual incentives can undermine collective welfare. While reinforcement learning (RL) has been effective for aligning large language models (LLMs) in the single-agent regime, prior small-network results suggest that standard RL in multi-agent settings often converges to defecting, self-interested policies. We show the same effect in LLMs: despite cooperative priors, RL-trained LLM agents develop opportunistic behavior that can exploit even advanced closed-source models. To address this tendency of RL to converge to poor equilibria, we adapt a recent opponent-learning awareness algorithm, Advantage Alignment, to fine-tune LLMs toward multi-agent cooperation and non-exploitability. We then introduce a group-relative baseline that simplifies advantage computation in iterated games, enabling multi-agent training at LLM scale. We also contribute a novel social dilemma environment, Trust and Split, which requires natural language communication to achieve high collective welfare. Across a wide range of social dilemmas, policies learned with Advantage Alignment achieve higher collective payoffs while remaining robust against exploitation by greedy agents.
中文摘要 随着智能人工智能的普及，拥有不同且可能冲突目标的智能体将以复杂的方式互动。这些多智能体互动构成根本挑战，尤其在社会困境中，个体的激励可能破坏集体福祉。虽然强化学习（RL）在单智能体环境中对齐大型语言模型（LLMs）方面表现有效，但以往小网络结果表明，多智能体环境中的标准强化学习往往趋于违背、自利的策略。我们在大型语言模型中也表现出同样的效果：尽管有合作的先验，强化学习训练的LLM代理仍发展出机会主义行为，甚至能利用先进的闭源模型。为解决强化学习趋向不良均衡的趋势，我们采用了近期的对手学习意识算法Advantage Alignment，以微调LLM以实现多智能体协作和不可利用性。随后，我们引入了一个群体相对基线，简化了迭代博弈中的优势计算，实现了大规模语言模型（LLM）规模下的多智能体训练。我们还提出了一种新的社会困境环境——信任与分裂，需要自然语言交流以实现高度的集体福祉。在各种社会困境中，通过优势对齐学习的政策能够实现更高的集体回报，同时对贪婪的代理人的利用保持强劲。

SLMFix: Leveraging Small Language Models for Error Fixing with Reinforcement Learning

SLMFix：利用小语言模型进行强化学习的错误修复

Authors: David Jiahao Fu, Aryan Gupta, Aaron Councilman, David Grove, Yu-Xiong Wang, Vikram Adve
Subjects: Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
Arxiv link: https://arxiv.org/abs/2511.19422
Pdf link: https://arxiv.org/pdf/2511.19422
Abstract Recent advancements in large language models (LLMs) have shown very impressive capabilities in code generation across many programming languages. However, even state-of-the-art LLMs generate programs that contains syntactic errors and fail to complete the given tasks, especially for low-resource programming languages (LRPLs). In addition, high training cost makes finetuning LLMs unaffordable with constrained computational resources, further undermining the effectiveness of LLMs for code generation. In this work, we propose SLMFix, a novel code generation pipeline that leverages a small language model (SLM) finetuned using reinforcement learning (RL) techniques to fix syntactic errors in LLM-generated programs to improve the quality of LLM-generated programs for domain-specific languages (DSLs). In specific, we applied RL on the SLM for the program repair task using a reward calculated using both a static validator and a static semantic similarity metric. Our experimental results demonstrate the effectiveness and generalizability of our approach across multiple DSLs, achieving more than 95% pass rate on the static validator. Notably, SLMFix brings substantial improvement to the base model and outperforms supervised finetuning approach even for 7B models on a LRPL, showing the potential of our approach as an alternative to traditional finetuning approaches.
中文摘要 大型语言模型（LLM）的最新进展展示了多种编程语言在代码生成方面的非常令人印象深刻的能力。然而，即使是最先进的大型语言模型，也会生成包含语法错误的程序，并无法完成给定任务，尤其是对于低资源编程语言（LRPL）。此外，高昂的训练成本使得在有限的计算资源下微调LLM变得负担不起，进一步削弱了LLM在代码生成方面的有效性。在本研究中，我们提出了SLMFix，一种新型代码生成流水线，利用通过强化学习（RL）技术微调的小型语言模型（SLM）修复LLM生成程序中的语法错误，从而提升领域特定语言（DSL）LLM生成程序的质量。具体来说，我们在SLM的程序修复任务中应用了强化学习，奖励基于静态验证器和静态语义相似度指标计算。我们的实验结果证明了我们方法在多个DSL上的有效性和可推广性，静态验证器通过率超过95%。值得注意的是，SLMFix对基础模型带来了显著改进，甚至在LRPL上的7B模型中也优于监督微调方法，显示了我们方法作为传统微调替代方案的潜力。

Keyword: diffusion policy

Learning Diffusion Policies for Robotic Manipulation of Timber Joinery under Fabrication Uncertainty

学习在制造不确定性下木材接合机器人作的扩散策略

Authors: Salma Mozaffari (1), Daniel Ruan (1), William van den Bogert (2), Nima Fazeli (2), Sigrid Adriaenssens (1), Arash Adel (1) ((1) Princeton University, (2) University of Michigan)
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.17774
Pdf link: https://arxiv.org/pdf/2511.17774
Abstract Construction uncertainties such as fabrication inaccuracies and material imperfections pose a significant challenge to contact-rich robotic manipulation by hindering precise and robust assembly. In this paper, we explore the performance and robustness of diffusion policy learning as a promising solution for contact-sensitive robotic assembly at construction scale, using timber mortise and tenon joints as a case study. A two-phase study is conducted: first, to evaluate policy performance and applicability; second, to assess robustness in handling fabrication uncertainties simulated as randomized perturbations to the mortise position. The best-performing policy achieved a total average success rate of 75% with perturbations up to 10 mm, including 100% success in unperturbed cases. The results demonstrate the potential of sensory-motor diffusion policies to generalize to a wide range of complex, contact-rich assembly tasks across construction and manufacturing, advancing robotic construction under uncertainty and contributing to safer, more efficient building practices.
中文摘要 制造不确定性如制造精度不准确和材料缺陷，对接触丰富机器人作构成重大挑战，阻碍了精准且稳健的组装。本文以木材榫头和榫头为案例，探讨扩散策略学习作为建筑规模接触敏感机器人组装解决方案的性能和稳健性。进行两阶段研究：首先评估政策的执行和适用性;其次，评估处理制造不确定性的鲁棒性，这些不确定性模拟为对榫眼位置的随机扰动。表现最佳的政策在扰动量达10毫米的情况下，平均成功率为75%，其中未扰动病例成功率为100%。结果表明，感觉运动扩散策略有潜力推广到建筑和制造中各种复杂、接触密集的组装任务，推动机器人施工在不确定性下的发展，并促进更安全、更高效的建筑实践。