Arxiv Papers of Today

生成时间: 2025-11-13 16:30:02 (UTC+8); Arxiv 发布时间: 2025-11-13 20:00 EST (2025-11-14 09:00 UTC+8)

今天共有 25 篇相关文章

Keyword: reinforcement learning

Interpretable by Design: Query-Specific Neural Modules for Explainable Reinforcement Learning

可解释设计：用于可解释强化学习的特定于查询的神经模块

Authors: Mehrdad Zakershahrak
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.08749
Pdf link: https://arxiv.org/pdf/2511.08749
Abstract Reinforcement learning has traditionally focused on a singular objective: learning policies that select actions to maximize reward. We challenge this paradigm by asking: what if we explicitly architected RL systems as inference engines that can answer diverse queries about their environment? In deterministic settings, trained agents implicitly encode rich knowledge about reachability, distances, values, and dynamics - yet current architectures are not designed to expose this information efficiently. We introduce Query Conditioned Deterministic Inference Networks (QDIN), a unified architecture that treats different types of queries (policy, reachability, paths, comparisons) as first-class citizens, with specialized neural modules optimized for each inference pattern. Our key empirical finding reveals a fundamental decoupling: inference accuracy can reach near-perfect levels (99% reachability IoU) even when control performance remains suboptimal (31% return), suggesting that the representations needed for accurate world knowledge differ from those required for optimal control. Experiments demonstrate that query specialized architectures outperform both unified models and post-hoc extraction methods, while maintaining competitive control performance. This work establishes a research agenda for RL systems designed from inception as queryable knowledge bases, with implications for interpretability, verification, and human-AI collaboration.
中文摘要 强化学习传统上专注于一个单一的目标：选择行动以最大化奖励的学习策略。我们通过问：如果我们将 RL 系统显式架构为推理引擎，可以回答有关其环境的各种查询，会怎样？在确定性设置中，经过训练的代理隐式编码有关可达性、距离、值和动态的丰富知识，但当前的架构并未设计为有效地公开这些信息。我们引入了查询条件确定性推理网络（QDIN），这是一种统一的架构，它将不同类型的查询（策略、可达性、路径、比较）视为一等公民，并针对每种推理模式优化了专门的神经模块。我们的关键实证发现揭示了一个根本的解耦：即使控制性能仍然不理想（31% 回报），推理准确性也可以达到近乎完美的水平（99% 的可达性 IoU），这表明准确的世界知识所需的表示与最佳控制所需的表示不同。实验表明，查询专用架构的性能优于统一模型和事后提取方法，同时保持竞争性控制性能。这项工作为从一开始就设计为可查询知识库的 RL 系统建立了研究议程，对可解释性、验证和人机协作具有影响。

Structured Uncertainty guided Clarification for LLM Agents

结构化不确定性引导的 LLM 代理澄清

Authors: Manan Suri, Puneet Mathur, Nedim Lipka, Franck Dernoncourt, Ryan A. Rossi, Dinesh Manocha
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.08798
Pdf link: https://arxiv.org/pdf/2511.08798
Abstract LLM agents extend large language models with tool-calling capabilities, but ambiguous user instructions often lead to incorrect invocations and task failures. We introduce a principled formulation of structured uncertainty over tool-call parameters, modeling joint tool-argument clarification as a POMDP with Expected Value of Perfect Information (EVPI) objective for optimal question selection and aspect-based cost modeling to prevent redundancy. Our SAGE-Agent leverages this structured uncertainty to achieve superior efficiency: increasing coverage on ambiguous tasks by 7-39\% while reducing clarification questions by 1.5-2.7$\times$ compared to strong prompting and uncertainty-based baselines. We present ClarifyBench, the first multi-turn tool-augmented disambiguation benchmark with realistic LLM-based user simulation across diverse domains including document editing, vehicle control, and travel booking. Additionally, we demonstrate that structured uncertainty provides effective training signals for reinforcement learning, boosting When2Call accuracy from 36.5\% to 65.2\% (3B model) and 36.7\% to 62.9\% (7B model) through uncertainty-weighted GRPO training. These results establish structured uncertainty as a principled, efficient approach for tool-augmented agents, improving both task success and interaction efficiency in real-world scenarios.
中文摘要 LLM 代理通过工具调用功能扩展大型语言模型，但模棱两可的用户指令通常会导致不正确的调用和任务失败。我们引入了工具调用参数结构化不确定性的原则性表述，将联合工具论证澄清建模为具有完美信息期望值（EVPI）目标的 POMDP，以实现最佳问题选择和基于方面的成本建模以防止冗余。我们的 SAGE-Agent 利用这种结构化不确定性来实现卓越的效率：与基于强提示和不确定性的基线相比，将模糊任务的覆盖率提高了 7-39%，同时将澄清问题减少了 1.5-2.7$\times$。我们展示了 ClarifyBench，这是第一个多轮工具增强的消歧基准测试，具有跨文档编辑、车辆控制和旅行预订等不同领域的基于 LLM 的真实用户模拟。此外，我们证明结构化不确定性为强化学习提供了有效的训练信号，通过不确定性加权GRPO训练，将When2Call的准确率从36.5%提高到65.2%（3B模型），将When2Call的准确率从36.7%提高到62.9%（7B模型）。这些结果将结构化不确定性确立为工具增强代理的一种有原则的、有效的方法，提高了现实场景中的任务成功率和交互效率。

TIGER-MARL: Enhancing Multi-Agent Reinforcement Learning with Temporal Information through Graph-based Embeddings and Representations

TIGER-MARL：通过基于图的嵌入和表示，利用时间信息增强多智能体强化学习

Authors: Nikunj Gupta, Ludwika Twardecka, James Zachary Hare, Jesse Milzman, Rajgopal Kannan, Viktor Prasanna
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.08832
Pdf link: https://arxiv.org/pdf/2511.08832
Abstract In this paper, we propose capturing and utilizing \textit{Temporal Information through Graph-based Embeddings and Representations} or \textbf{TIGER} to enhance multi-agent reinforcement learning (MARL). We explicitly model how inter-agent coordination structures evolve over time. While most MARL approaches rely on static or per-step relational graphs, they overlook the temporal evolution of interactions that naturally arise as agents adapt, move, or reorganize cooperation strategies. Capturing such evolving dependencies is key to achieving robust and adaptive coordination. To this end, TIGER constructs dynamic temporal graphs of MARL agents, connecting their current and historical interactions. It then employs a temporal attention-based encoder to aggregate information across these structural and temporal neighborhoods, yielding time-aware agent embeddings that guide cooperative policy learning. Through extensive experiments on two coordination-intensive benchmarks, we show that TIGER consistently outperforms diverse value-decomposition and graph-based MARL baselines in task performance and sample efficiency. Furthermore, we conduct comprehensive ablation studies to isolate the impact of key design parameters in TIGER, revealing how structural and temporal factors can jointly shape effective policy learning in MARL. All codes can be found here: this https URL.
中文摘要 在本文中，我们建议捕获并利用\textit{Temporal Information through Graph-based Embeddings and Representations}或\textbf{TIGER}来增强多智能体强化学习（MARL）。我们明确地模拟了代理间协调结构如何随时间演变。虽然大多数 MARL 方法依赖于静态或每步关系图，但它们忽略了随着代理适应、移动或重组协作策略而自然产生的交互的时间演变。捕获这种不断变化的依赖关系是实现稳健和自适应协调的关键。为此，TIGER 构建了 MARL 代理的动态时间图，连接它们当前和历史的交互。然后，它采用基于时间注意力的编码器来聚合这些结构和时间邻域的信息，产生指导合作策略学习的时间感知代理嵌入。通过对两个协调密集型基准的广泛实验，我们表明 TIGER 在任务性能和样本效率方面始终优于各种值分解和基于图的 MARL 基线。此外，我们还进行了全面的消融研究，以分离 TIGER 中关键设计参数的影响，揭示结构和时间因素如何共同塑造 MARL 中的有效政策学习。所有代码都可以在这里找到：这个 https URL。

UCO: A Multi-Turn Interactive Reinforcement Learning Method for Adaptive Teaching with Large Language Models

UCO：一种基于大型语言模型的自适应教学的多轮交互式强化学习方法

Authors: Shouang Wei, Min Zhang, Xin Lin, Bo Jiang, Kun Kuang, Zhongxiang Dai
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.08873
Pdf link: https://arxiv.org/pdf/2511.08873
Abstract Large language models (LLMs) are shifting from answer providers to intelligent tutors in educational settings, yet current supervised fine-tuning methods only learn surface teaching patterns without dynamic adaptation capabilities. Recent reinforcement learning approaches address this limitation but face two critical challenges. First, they evaluate teaching effectiveness solely based on whether students produce correct outputs, unable to distinguish whether students genuinely understand or echo teacher-provided answers during interaction. Second, they cannot perceive students' evolving cognitive states in real time through interactive dialogue, thus failing to adapt teaching strategies to match students' cognitive levels dynamically. We propose the Unidirectional Cognitive Optimization (UCO) method to address these challenges. UCO uses a multi-turn interactive reinforcement learning paradigm where the innovation lies in two synergistic reward functions: the Progress Reward captures students' cognitive advancement, evaluating whether students truly transition from confusion to comprehension, while the Scaffold Reward dynamically identifies each student's Zone of Proximal Development (ZPD), encouraging teachers to maintain productive teaching within this zone. We evaluate UCO by comparing it against 11 baseline models on BigMath and MathTutorBench benchmarks. Experimental results demonstrate that our UCO model outperforms all models of equivalent scale and achieves performance comparable to advanced closed-source models. The code and data are available at this https URL.
中文摘要 大型语言模型（LLMs）正在从教育环境中的答案提供者转变为智能导师，但目前的监督微调方法仅学习表面教学模式，而没有动态适应能力。最近的强化学习方法解决了这一限制，但面临着两个关键挑战。首先，他们仅根据学生是否产生正确的输出来评估教学效果，无法区分学生在互动过程中是否真正理解或呼应教师提供的答案。其次，他们无法通过互动对话实时感知学生不断发展的认知状态，从而无法动态地调整教学策略以匹配学生的认知水平。我们提出了单向认知优化（UCO）方法来应对这些挑战。UCO 使用多轮交互式强化学习范式，其中创新在于两个协同奖励功能：进步奖励捕捉学生的认知进步，评估学生是否真正从困惑过渡到理解，而脚手架奖励动态识别每个学生的最近发展区（ZPD），鼓励教师在该区域内保持富有成效的教学。我们通过将 UCO 与 BigMath 和 MathTutorBench 基准测试上的 11 个基线模型进行比较来评估 UCO。实验结果表明，我们的UCO模型优于所有同等规模的模型，并实现了与高级闭源模型相当的性能。代码和数据可在此 https URL 中找到。

A Shared Control Framework for Mobile Robots with Planning-Level Intention Prediction

一种具有规划级意图预测的移动机器人共享控制框架

Authors: Jinyu Zhang, Lijun Han, Feng Jian, Lingxi Zhang, Hesheng Wang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.08912
Pdf link: https://arxiv.org/pdf/2511.08912
Abstract In mobile robot shared control, effectively understanding human motion intention is critical for seamless human-robot collaboration. This paper presents a novel shared control framework featuring planning-level intention prediction. A path replanning algorithm is designed to adjust the robot's desired trajectory according to inferred human intentions. To represent future motion intentions, we introduce the concept of an intention domain, which serves as a constraint for path replanning. The intention-domain prediction and path replanning problems are jointly formulated as a Markov Decision Process and solved through deep reinforcement learning. In addition, a Voronoi-based human trajectory generation algorithm is developed, allowing the model to be trained entirely in simulation without human participation or demonstration data. Extensive simulations and real-world user studies demonstrate that the proposed method significantly reduces operator workload and enhances safety, without compromising task efficiency compared with existing assistive teleoperation approaches.
中文摘要 在移动机器人共享控制中，有效理解人体运动意图对于人机无缝协作至关重要。本文提出了一种具有规划级意图预测的新型共享控制框架。设计了一种路径重新规划算法，根据推断的人类意图调整机器人所需的轨迹。为了表示未来的运动意图，我们引入了意图域的概念，它作为路径重新规划的约束。将意向域预测和路径重新规划问题共同表述为马尔可夫决策过程，并通过深度强化学习进行求解。此外，还开发了一种基于 Voronoi 的人体轨迹生成算法，允许模型完全在模拟中进行训练，无需人工参与或演示数据。广泛的仿真和真实用户研究表明，与现有的辅助远程作方法相比，所提出的方法显着减少了操作员的工作量并提高了安全性，而不会影响任务效率。

Diffusion Policies with Value-Conditional Optimization for Offline Reinforcement Learning

离线强化学习的价值条件优化扩散策略

Authors: Yunchang Ma, Tenglong Liu, Yixing Lan, Xin Yin, Changxin Zhang, Xinglong Zhang, Xin Xu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.08922
Pdf link: https://arxiv.org/pdf/2511.08922
Abstract In offline reinforcement learning, value overestimation caused by out-of-distribution (OOD) actions significantly limits policy performance. Recently, diffusion models have been leveraged for their strong distribution-matching capabilities, enforcing conservatism through behavior policy constraints. However, existing methods often apply indiscriminate regularization to redundant actions in low-quality datasets, resulting in excessive conservatism and an imbalance between the expressiveness and efficiency of diffusion modeling. To address these issues, we propose DIffusion policies with Value-conditional Optimization (DIVO), a novel approach that leverages diffusion models to generate high-quality, broadly covered in-distribution state-action samples while facilitating efficient policy improvement. Specifically, DIVO introduces a binary-weighted mechanism that utilizes the advantage values of actions in the offline dataset to guide diffusion model training. This enables a more precise alignment with the dataset's distribution while selectively expanding the boundaries of high-advantage actions. During policy improvement, DIVO dynamically filters high-return-potential actions from the diffusion model, effectively guiding the learned policy toward better performance. This approach achieves a critical balance between conservatism and explorability in offline RL. We evaluate DIVO on the D4RL benchmark and compare it against state-of-the-art baselines. Empirical results demonstrate that DIVO achieves superior performance, delivering significant improvements in average returns across locomotion tasks and outperforming existing methods in the challenging AntMaze domain, where sparse rewards pose a major difficulty.
中文摘要 在离线强化学习中，分布外（OOD）行为导致的价值高估严重限制了策略性能。最近，扩散模型因其强大的分布匹配能力而被利用，通过行为策略约束来强制保守主义。然而，现有方法往往对低质量数据集中的冗余动作进行不加区别的正则化，导致过度保守，扩散建模的表现力和效率不平衡。为了解决这些问题，我们提出了具有价值条件优化（DIVO）的扩散策略，这是一种利用扩散模型生成高质量、广泛覆盖的分布内状态行动样本的新方法，同时促进有效的策略改进。具体来说，DIVO 引入了一种二元加权机制，利用离线数据集中动作的优势值来指导扩散模型训练。这样可以更精确地与数据集的分布保持一致，同时有选择地扩展高优势作的边界。在策略改进过程中，DIVO 从扩散模型中动态过滤高回报潜力的动作，有效引导学习到的策略走向更好的性能。这种方法在离线 RL 的保守性和可探索性之间实现了关键平衡。我们在 D4RL 基准上评估 DIVO，并将其与最先进的基线进行比较。实证结果表明，DIVO 实现了卓越的性能，显着提高了运动任务的平均回报，并在具有挑战性的 AntMaze 领域优于现有方法，在这些领域，稀疏奖励构成了重大困难。

Achieving Equilibrium under Utility Heterogeneity: An Agent-Attention Framework for Multi-Agent Multi-Objective Reinforcement Learning

在效用异质性下实现均衡：多智能体多目标强化学习的智能体-注意力框架

Authors: Zhuhui Li, Chunbo Luo, Liming Huang, Luyu Qi, Geyong Min
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.08926
Pdf link: https://arxiv.org/pdf/2511.08926
Abstract Multi-agent multi-objective systems (MAMOS) have emerged as powerful frameworks for modelling complex decision-making problems across various real-world domains, such as robotic exploration, autonomous traffic management, and sensor network optimisation. MAMOS offers enhanced scalability and robustness through decentralised control and more accurately reflects inherent trade-offs between conflicting objectives. In MAMOS, each agent uses utility functions that map return vectors to scalar values. Existing MAMOS optimisation methods face challenges in handling heterogeneous objective and utility function settings, where training non-stationarity is intensified due to private utility functions and the associated policies. In this paper, we first theoretically prove that direct access to, or structured modeling of, global utility functions is necessary for the Bayesian Nash Equilibrium under decentralised execution constraints. To access the global utility functions while preserving the decentralised execution, we propose an Agent-Attention Multi-Agent Multi-Objective Reinforcement Learning (AA-MAMORL) framework. Our approach implicitly learns a joint belief over other agents' utility functions and their associated policies during centralised training, effectively mapping global states and utilities to each agent's policy. In execution, each agent independently selects actions based on local observations and its private utility function to approximate a BNE, without relying on inter-agent communication. We conduct comprehensive experiments in both a custom-designed MAMO Particle environment and the standard MOMALand benchmark. The results demonstrate that access to global preferences and our proposed AA-MAMORL significantly improve performance and consistently outperform state-of-the-art methods.
中文摘要 多智能体多目标系统（MAMOS）已成为对各个现实领域（例如机器人探索、自主交通管理和传感器网络优化）的复杂决策问题进行建模的强大框架。MAMOS 通过分散控制提供增强的可扩展性和稳健性，并更准确地反映冲突目标之间的固有权衡。在 MAMOS 中，每个代理都使用将返回向量映射到标量值的效用函数。现有的MAMOS优化方法在处理异构目标和效用函数设置方面面临挑战，其中由于私有效用函数和相关策略，训练非平稳性会加剧。在本文中，我们首先从理论上证明，在去中心化执行约束下，直接访问或结构化建模全局效用函数对于贝叶斯纳什均衡是必要的。为了在保持去中心化执行的同时访问全局效用函数，我们提出了一个智能体-注意力多智能体多目标强化学习（AA-MAMORL）框架。我们的方法隐含地学习了对其他智能体的效用函数及其相关策略的共同信念，从而有效地将全局状态和效用映射到每个智能体的策略。在执行过程中，每个智能体根据局部观察及其私有效用函数独立选择动作来近似 BNE，而不依赖智能体间通信。我们在定制设计的 MAMO 粒子环境和标准 MOMALand 基准测试中进行全面的实验。结果表明，访问全球偏好和我们提出的 AA-MAMORL 显着提高了性能，并且始终优于最先进的方法。

SpiralThinker: Latent Reasoning through an Iterative Process with Text-Latent Interleaving

SpiralThinker：通过文本潜在交错的迭代过程进行潜在推理

Authors: Shengmin Piao, Sanghyun Park
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.08983
Pdf link: https://arxiv.org/pdf/2511.08983
Abstract Recent advances in large reasoning models have been driven by reinforcement learning and test-time scaling, accompanied by growing interest in latent rather than purely textual reasoning. However, existing latent reasoning methods lack mechanisms to ensure stable evolution of latent representations and a systematic way to interleave implicit and explicit reasoning. We introduce SpiralThinker, a unified framework that performs iterative updates over latent representations, enabling extended implicit reasoning without generating additional tokens. A progressive alignment objective combined with structured annotations maintains coherence between latent and textual reasoning. Across mathematical, logical, and commonsense reasoning tasks, SpiralThinker achieves the best overall performance among latent reasoning approaches, consistently surpassing previous methods across all benchmarks. Detailed analyses reveal that both iteration and alignment are indispensable, the numbers of latent tokens and iterations exhibit dataset-specific optima, and appropriate alignment proves critical for an effective iterative process. Overall, SpiralThinker bridges iterative computation and latent reasoning, demonstrating that aligned iterative updates can reliably steer reasoning in the latent space.
中文摘要 大型推理模型的最新进展是由强化学习和测试时间缩放推动的，伴随着人们对潜在推理而不是纯文本推理的兴趣日益浓厚。然而，现有的潜在推理方法缺乏保证潜在表示稳定演化的机制，缺乏将隐式和显式推理交错的系统方式。我们介绍了 SpiralThinker，这是一个统一的框架，它对潜在表示执行迭代更新，从而在不生成额外标记的情况下实现扩展的隐式推理。渐进式对齐目标与结构化注释相结合，可保持潜在推理和文本推理之间的连贯性。在数学、逻辑和常识推理任务中，SpiralThinker 在潜在推理方法中取得了最佳的整体性能，在所有基准测试中始终超越以前的方法。详细的分析表明，迭代和对齐都是必不可少的，潜在标记和迭代的数量表现出特定于数据集的最优值，并且适当的对齐被证明对于有效的迭代过程至关重要。总体而言，SpiralThinker 连接了迭代计算和潜在推理，证明对齐迭代更新可以可靠地引导潜在空间中的推理。

Advancing Autonomous Emergency Response Systems: A Generative AI Perspective

推进自主应急响应系统：生成式人工智能视角

Authors: Yousef Emami, Radha Reddy, Azadeh Pourkabirian, Miguel Gutierrez Gaitan
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.09044
Pdf link: https://arxiv.org/pdf/2511.09044
Abstract Autonomous Vehicles (AVs) are poised to revolutionize emergency services by enabling faster, safer, and more efficient responses. This transformation is driven by advances in Artificial Intelligence (AI), particularly Reinforcement Learning (RL), which allows AVs to navigate complex environments and make critical decisions in real time. However, conventional RL paradigms often suffer from poor sample efficiency and lack adaptability in dynamic emergency scenarios. This paper reviews next-generation AV optimization strategies to address these limitations. We analyze the shift from conventional RL to Diffusion Model (DM)-augmented RL, which enhances policy robustness through synthetic data generation, albeit with increased computational cost. Additionally, we explore the emerging paradigm of Large Language Model (LLM)-assisted In-Context Learning (ICL), which offers a lightweight and interpretable alternative by enabling rapid, on-the-fly adaptation without retraining. By reviewing the state of the art in AV intelligence, DM-augmented RL, and LLM-assisted ICL, this paper provides a critical framework for understanding the next generation of autonomous emergency response systems from a Generative AI perspective.
中文摘要 自动驾驶汽车（AV）有望通过实现更快、更安全、更高效的响应来彻底改变紧急服务。这一转变是由人工智能（AI）的进步推动的，特别是强化学习（RL），它允许自动驾驶汽车驾驭复杂的环境并实时做出关键决策。然而，传统的RL范式往往存在样本效率差、动态应急场景适应性不足等问题。本文回顾了解决这些局限性的下一代视听优化策略。我们分析了从传统 RL 到扩散模型（DM）增强 RL 的转变，它通过合成数据生成增强了策略稳健性，尽管计算成本增加。此外，我们还探索了大型语言模型（LLM）辅助上下文学习（ICL）的新兴范式，它通过实现快速、即时的适应而无需重新训练，提供了一种轻量级且可解释的替代方案。通过回顾 AV 智能、DM 增强 RL 和 LLM 辅助 ICL 的最新技术，本文为从生成式 AI 的角度理解下一代自主应急响应系统提供了一个关键框架。

APEX: Action Priors Enable Efficient Exploration for Robust Motion Tracking on Legged Robots

APEX：动作先验实现高效探索，实现有腿机器人的稳健运动跟踪

Authors: Shivam Sood, Laukik Nakhwa, Sun Ge, Yuhong Cao, Jin Cheng, Fatemah Zargarbashi, Taerim Yoon, Sungjoon Choi, Stelian Coros, Guillaume Sartoretti
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.09091
Pdf link: https://arxiv.org/pdf/2511.09091
Abstract Learning natural, animal-like locomotion from demonstrations has become a core paradigm in legged robotics. Despite the recent advancements in motion tracking, most existing methods demand extensive tuning and rely on reference data during deployment, limiting adaptability. We present APEX (Action Priors enable Efficient Exploration), a plug-and-play extension to state-of-the-art motion tracking algorithms that eliminates any dependence on reference data during deployment, improves sample efficiency, and reduces parameter tuning effort. APEX integrates expert demonstrations directly into reinforcement learning (RL) by incorporating decaying action priors, which initially bias exploration toward expert demonstrations but gradually allow the policy to explore independently. This is combined with a multi-critic framework that balances task performance with motion style. Moreover, APEX enables a single policy to learn diverse motions and transfer reference-like styles across different terrains and velocities, while remaining robust to variations in reward design. We validate the effectiveness of our method through extensive experiments in both simulation and on a Unitree Go2 robot. By leveraging demonstrations to guide exploration during RL training, without imposing explicit bias toward them, APEX enables legged robots to learn with greater stability, efficiency, and generalization. We believe this approach paves the way for guidance-driven RL to boost natural skill acquisition in a wide array of robotic tasks, from locomotion to manipulation. Website and code: this https URL.
中文摘要 从演示中学习自然的、类似动物的运动已成为腿部机器人技术的核心范式。尽管运动跟踪最近取得了进展，但大多数现有方法都需要进行大量调整，并且在部署过程中依赖参考数据，从而限制了适应性。我们推出了 APEX（Action Priors enable Efficient Exploration），这是对最先进运动跟踪算法的即插即用扩展，可消除部署过程中对参考数据的任何依赖，提高样本效率，并减少参数调整工作。APEX 通过合并衰减的动作先验，将专家演示直接集成到强化学习（RL）中，这最初使探索偏向于专家演示，但逐渐允许策略独立探索。这与平衡任务性能与运动风格的多重评论框架相结合。此外，APEX 支持单一策略来学习不同的运动并在不同的地形和速度之间传递类似参考的样式，同时对奖励设计的变化保持稳健。我们通过在模拟和 Unitree Go2 机器人上的大量实验验证了我们方法的有效性。通过在 RL 训练期间利用演示来指导探索，而不对它们施加明确的偏见，APEX 使腿式机器人能够以更高的稳定性、效率和泛化性进行学习。我们相信，这种方法为引导驱动的 RL 铺平了道路，以促进从运动到纵的各种机器人任务中的自然技能习得。网站和代码：这个 https URL。

Thinking Forward and Backward: Multi-Objective Reinforcement Learning for Retrieval-Augmented Reasoning

向前和向后思考：用于检索增强推理的多目标强化学习

Authors: Wenda Wei, Yu-An Liu, Ruqing Zhang, Jiafeng Guo, Lixin Su, Shuaiqiang Wang, Dawei Yin, Maarten de Rijke, Xueqi Cheng
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2511.09109
Pdf link: https://arxiv.org/pdf/2511.09109
Abstract Retrieval-augmented generation (RAG) has proven to be effective in mitigating hallucinations in large language models, yet its effectiveness remains limited in complex, multi-step reasoning this http URL efforts have incorporated search-based interactions into RAG, enabling iterative reasoning with real-time retrieval. Most approaches rely on outcome-based supervision, offering no explicit guidance for intermediate steps. This often leads to reward hacking and degraded response quality. We propose Bi-RAR, a novel retrieval-augmented reasoning framework that evaluates each intermediate step jointly in both forward and backward directions. To assess the information completeness of each step, we introduce a bidirectional information distance grounded in Kolmogorov complexity, approximated via language model generation probabilities. This quantification measures both how far the current reasoning is from the answer and how well it addresses the question. To optimize reasoning under these bidirectional signals, we adopt a multi-objective reinforcement learning framework with a cascading reward structure that emphasizes early trajectory alignment. Empirical results on seven question answering benchmarks demonstrate that Bi-RAR surpasses previous methods and enables efficient interaction and reasoning with the search engine during training and inference.
中文摘要 检索增强生成（RAG）已被证明可以有效减轻大型语言模型中的幻觉，但其在复杂的多步骤推理中的有效性仍然有限，这种 http URL 工作已将基于搜索的交互纳入 RAG，从而实现迭代推理和实时检索。大多数方法依赖于基于结果的监督，没有为中间步骤提供明确的指导。这通常会导致奖励黑客攻击和响应质量下降。我们提出了 Bi-RAR，这是一种新颖的检索增强推理框架，可以在前进和后退方向上共同评估每个中间步骤。为了评估每个步骤的信息完整性，我们引入了基于柯尔莫哥洛夫复杂性的双向信息距离，通过语言模型生成概率进行近似。这种量化衡量当前推理与答案的距离以及它解决问题的程度。为了优化这些双向信号下的推理，我们采用了一种多目标强化学习框架，该框架具有层叠奖励结构，强调早期轨迹对齐。七个问答基准的实证结果表明，Bi-RAR 超越了以前的方法，并在训练和推理过程中实现了与搜索引擎的高效交互和推理。

Towards a Generalisable Cyber Defence Agent for Real-World Computer Networks

面向现实世界计算机网络的通用网络防御代理

Authors: Tim Dudman, Martyn Bull
Subjects: Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2511.09114
Pdf link: https://arxiv.org/pdf/2511.09114
Abstract Recent advances in deep reinforcement learning for autonomous cyber defence have resulted in agents that can successfully defend simulated computer networks against cyber-attacks. However, many of these agents would need retraining to defend networks with differing topology or size, making them poorly suited to real-world networks where topology and size can vary over time. In this research we introduce a novel set of Topological Extensions for Reinforcement Learning Agents (TERLA) that provide generalisability for the defence of networks with differing topology and size, without the need for retraining. Our approach involves the use of heterogeneous graph neural network layers to produce a fixed-size latent embedding representing the observed network state. This representation learning stage is coupled with a reduced, fixed-size, semantically meaningful and interpretable action space. We apply TERLA to a standard deep reinforcement learning Proximal Policy Optimisation (PPO) agent model, and to reduce the sim-to-real gap, conduct our research using Cyber Autonomy Gym for Experimentation (CAGE) Challenge 4. This Cyber Operations Research Gym environment has many of the features of a real-world network, such as realistic Intrusion Detection System (IDS) events and multiple agents defending network segments of differing topology and size. TERLA agents retain the defensive performance of vanilla PPO agents whilst showing improved action efficiency. Generalisability has been demonstrated by showing that all TERLA agents have the same network-agnostic neural network architecture, and by deploying a single TERLA agent multiple times to defend network segments with differing topology and size, showing improved defensive performance and efficiency.
中文摘要 用于自主网络防御的深度强化学习的最新进展已经产生了能够成功保护模拟计算机网络免受网络攻击的代理。然而，其中许多代理需要重新训练来防御具有不同拓扑或大小的网络，这使得它们不太适合拓扑和大小会随时间变化的现实世界网络。在这项研究中，我们引入了一组新的强化学习代理拓扑扩展（TERLA），它为具有不同拓扑和大小的网络防御提供了通用性，而无需重新训练。我们的方法涉及使用异构图神经网络层来生成表示观察到的网络状态的固定大小的潜在嵌入。这个表示学习阶段与一个缩小的、固定的、语义上有意义和可解释的动作空间相结合。我们将 TERLA 应用于标准深度强化学习近端策略优化（PPO）代理模型，并减少模拟与真实的差距，使用 Cyber Autonomy Gym for Experimentation （CAGE）挑战 4 进行研究。这个 Cyber Operations Research Gym 环境具有真实世界网络的许多功能，例如逼真的入侵检测系统（IDS）事件和保护不同拓扑和大小的网段的多个代理。TERLA 药物保留了普通 PPO 药物的防御性能，同时显示出更高的作用效率。通过表明所有 TERLA 代理都具有相同的与网络无关的神经网络架构，以及通过多次部署单个 TERLA 代理来防御具有不同拓扑和大小的网段，显示出改进的防御性能和效率，已经证明了通用性。

History-Aware Reasoning for GUI Agents

GUI 代理的历史感知推理

Authors: Ziwei Wang, Leyang Yang, Xiaoxuan Tang, Sheng Zhou, Dajun Chen, Wei Jiang, Yong Li
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2511.09127
Pdf link: https://arxiv.org/pdf/2511.09127
Abstract Advances in Multimodal Large Language Models have significantly enhanced Graphical User Interface (GUI) automation. Equipping GUI agents with reliable episodic reasoning capabilities is essential for bridging the gap between users' concise task descriptions and the complexities of real-world execution. Current methods integrate Reinforcement Learning (RL) with System-2 Chain-of-Thought, yielding notable gains in reasoning enhancement. For long-horizon GUI tasks, historical interactions connect each screen to the goal-oriented episode chain, and effectively leveraging these clues is crucial for the current decision. However, existing native GUI agents exhibit weak short-term memory in their explicit reasoning, interpreting the chained interactions as discrete screen understanding, i.e., unawareness of the historical interactions within the episode. This history-agnostic reasoning challenges their performance in GUI automation. To alleviate this weakness, we propose a History-Aware Reasoning (HAR) framework, which encourages an agent to reflect on its own errors and acquire episodic reasoning knowledge from them via tailored strategies that enhance short-term memory in long-horizon interaction. The framework mainly comprises constructing a reflective learning scenario, synthesizing tailored correction guidelines, and designing a hybrid RL reward function. Using the HAR framework, we develop a native end-to-end model, HAR-GUI-3B, which alters the inherent reasoning mode from history-agnostic to history-aware, equipping the GUI agent with stable short-term memory and reliable perception of screen details. Comprehensive evaluations across a range of GUI-related benchmarks demonstrate the effectiveness and generalization of our method.
中文摘要 多模态大型语言模型的进步显着增强了图形用户界面（GUI）自动化。为 GUI 代理配备可靠的情景推理功能对于弥合用户简洁的任务描述与现实世界执行的复杂性之间的差距至关重要。目前的方法将强化学习（RL）与 System-2 思维链相结合，在推理增强方面取得了显着的成果。对于长视野的 GUI 任务，历史交互将每个屏幕连接到面向目标的情节链，有效利用这些线索对于当前决策至关重要。然而，现有的原生 GUI 代理在其显式推理中表现出较弱的短期记忆，将链式交互解释为离散的屏幕理解，即对剧集中的历史交互的无知。这种与历史无关的推理挑战了它们在 GUI 自动化方面的性能。为了缓解这一弱点，我们提出了一个历史感知推理（HAR）框架，该框架鼓励智能体反思自己的错误，并通过量身定制的策略从中获取情景推理知识，从而增强长期互动中的短期记忆。该框架主要包括构建反思性学习场景、综合定制纠正指南以及设计混合RL奖励函数。利用HAR框架，我们开发了一个原生的端到端模型HAR-GUI-3B，它将固有的推理模式从与历史无关转变为历史感知，为GUI代理提供稳定的短期记忆和对屏幕细节的可靠感知。对一系列 GUI 相关基准的综合评估证明了我们方法的有效性和通用性。

Efficient Reasoning via Reward Model

通过奖励模型进行高效推理

Authors: Yuhao Wang, Xiaopeng Li, Cheng Gong, Ziru Liu, Suiyun Zhang, Rui Liu, Xiangyu Zhao
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.09158
Pdf link: https://arxiv.org/pdf/2511.09158
Abstract Reinforcement learning with verifiable rewards (RLVR) has been shown to enhance the reasoning capabilities of large language models (LLMs), enabling the development of large reasoning models (LRMs). However, LRMs such as DeepSeek-R1 and OpenAI o1 often generate verbose responses containing redundant or irrelevant reasoning step-a phenomenon known as overthinking-which substantially increases computational costs. Prior efforts to mitigate this issue commonly incorporate length penalties into the reward function, but we find they frequently suffer from two critical issues: length collapse and training collapse, resulting in sub-optimal performance. To address them, we propose a pipeline for training a Conciseness Reward Model (CRM) that scores the conciseness of reasoning path. Additionally, we introduce a novel reward formulation named Conciseness Reward Function (CRF) with explicit dependency between the outcome reward and conciseness score, thereby fostering both more effective and more efficient reasoning. From a theoretical standpoint, we demonstrate the superiority of the new reward from the perspective of variance reduction and improved convergence properties. Besides, on the practical side, extensive experiments on five mathematical benchmark datasets demonstrate the method's effectiveness and token efficiency, which achieves an 8.1% accuracy improvement and a 19.9% reduction in response token length on Qwen2.5-7B. Furthermore, the method generalizes well to other LLMs including Llama and Mistral. The implementation code and datasets are publicly available for reproduction: this https URL.
中文摘要 具有可验证奖励的强化学习（RLVR）已被证明可以增强大型语言模型（LLM）的推理能力，从而实现大型推理模型（LRM）的开发。然而，DeepSeek-R1 和 OpenAI o1 等 LRM 通常会生成包含冗余或不相关推理步骤的冗长响应——这种现象被称为过度思考——这大大增加了计算成本。先前缓解这个问题的努力通常将长度惩罚纳入奖励函数，但我们发现它们经常遇到两个关键问题：长度崩溃和训练崩溃，导致性能欠佳。为了解决这些问题，我们提出了一个用于训练简洁奖励模型（CRM）的管道，该模型对推理路径的简洁性进行评分。此外，我们还引入了一种名为简洁奖励函数（CRF）的新颖奖励公式，该公式在结果奖励和简洁度分数之间具有明确的依赖性，从而促进更有效和更高效的推理。从理论角度，我们从方差减少和收敛特性改进的角度展示了新奖励的优越性。此外，在实践方面，在5个数学基准数据集上的大量实验证明了该方法的有效性和token效率，在Qwen2.5-7B上实现了8.1%的准确率提升和19.9%的响应token长度减少。此外，该方法可以很好地推广到其他法学硕士，包括 Llama 和 Mistral。实现代码和数据集公开可供复制：此 https URL。

Learning Efficient Communication Protocols for Multi-Agent Reinforcement Learning

学习多智能体强化学习的高效通信协议

Authors: Xinren Zhang, Jiadong Yu, Zixin Zhong
Subjects: Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2511.09171
Pdf link: https://arxiv.org/pdf/2511.09171
Abstract Multi-Agent Systems (MAS) have emerged as a powerful paradigm for modeling complex interactions among autonomous entities in distributed environments. In Multi-Agent Reinforcement Learning (MARL), communication enables coordination but can lead to inefficient information exchange, since agents may generate redundant or non-essential messages. While prior work has focused on boosting task performance with information exchange, the existing research lacks a thorough investigation of both the appropriate definition and the optimization of communication protocols (communication topology and message). To fill this gap, we introduce a generalized framework for learning multi-round communication protocols that are both effective and efficient. Within this framework, we propose three novel Communication Efficiency Metrics (CEMs) to guide and evaluate the learning process: the Information Entropy Efficiency Index (IEI) and Specialization Efficiency Index (SEI) for efficiency-augmented optimization, and the Topology Efficiency Index (TEI) for explicit evaluation. We integrate IEI and SEI as the adjusted loss functions to promote informative messaging and role specialization, while using TEI to quantify the trade-off between communication volume and task performance. Through comprehensive experiments, we demonstrate that our learned communication protocol can significantly enhance communication efficiency and achieves better cooperation performance with improved success rates.
中文摘要 多代理系统（MAS）已成为对分布式环境中自治实体之间复杂交互进行建模的强大范式。在多智能体强化学习（MARL）中，通信可以实现协调，但可能导致信息交换效率低下，因为智能体可能会生成冗余或非必要的消息。虽然之前的工作侧重于通过信息交换提高任务性能，但现有研究缺乏对通信协议（通信拓扑和消息）的适当定义和优化的彻底调查。为了填补这一空白，我们引入了一个通用框架，用于学习既有效又高效的多轮通信协议。在这个框架内，我们提出了三种新的通信效率指标（CEM）来指导和评估学习过程：用于效率增强优化的信息熵效率指数（IEI）和专业化效率指数（SEI），以及用于显式评估的拓扑效率指数（TEI）。我们将 IEI 和 SEI 整合为调整后的损失函数，以促进信息信息传递和角色专业化，同时使用 TEI 量化通信量和任务绩效之间的权衡。通过综合实验，我们证明了我们学习的通信协议可以显着提高通信效率，并实现更好的合作绩效和更高的成功率。

Iterated Population Based Training with Task-Agnostic Restarts

与任务无关的重启的基于群体的迭代训练

Authors: Alexander Chebykin, Tanja Alderliesten, Peter A. N. Bosman
Subjects: Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2511.09190
Pdf link: https://arxiv.org/pdf/2511.09190
Abstract Hyperparameter Optimization (HPO) can lift the burden of tuning hyperparameters (HPs) of neural networks. HPO algorithms from the Population Based Training (PBT) family are efficient thanks to dynamically adjusting HPs every few steps of the weight optimization. Recent results indicate that the number of steps between HP updates is an important meta-HP of all PBT variants that can substantially affect their performance. Yet, no method or intuition is available for efficiently setting its value. We introduce Iterated Population Based Training (IPBT), a novel PBT variant that automatically adjusts this HP via restarts that reuse weight information in a task-agnostic way and leverage time-varying Bayesian optimization to reinitialize HPs. Evaluation on 8 image classification and reinforcement learning tasks shows that, on average, our algorithm matches or outperforms 5 previous PBT variants and other HPO algorithms (random search, ASHA, SMAC3), without requiring a budget increase or any changes to its HPs. The source code is available at this https URL.
中文摘要 超参数优化（HPO）可以减轻神经网络超参数（HP）的调整负担。基于群体的训练（PBT）系列的 HPO 算法非常高效，这要归功于权重优化的每几个步骤动态调整 HP。最近的结果表明，HP 更新之间的步数是所有 PBT 变体的重要元 HP，可以显着影响其性能。然而，没有任何方法或直觉可用于有效地设置其值。我们引入了基于群体的迭代训练（IPBT），这是一种新型 PBT 变体，它通过重新启动自动调整该 HP，以与任务无关的方式重用权重信息，并利用时变贝叶斯优化来重新初始化 HP。对 8 个图像分类和强化学习任务的评估表明，平均而言，我们的算法匹配或优于之前的 5 个 PBT 变体和其他 HPO 算法（随机搜索， ASHA，SMAC3），无需增加预算或对其 HP 进行任何更改。源代码可在此 https URL 中找到。

Planning in Branch-and-Bound: Model-Based Reinforcement Learning for Exact Combinatorial Optimization

分支和边界规划：基于模型的强化学习进行精确组合优化

Authors: Paul Strang, Zacharie Alès, Côme Bissuel, Safia Kedad-Sidhoum, Emmanuel Rachelson
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.09219
Pdf link: https://arxiv.org/pdf/2511.09219
Abstract Mixed-Integer Linear Programming (MILP) lies at the core of many real-world combinatorial optimization (CO) problems, traditionally solved by branch-and-bound (B&B). A key driver influencing B&B solvers efficiency is the variable selection heuristic that guides branching decisions. Looking to move beyond static, hand-crafted heuristics, recent work has explored adapting traditional reinforcement learning (RL) algorithms to the B&B setting, aiming to learn branching strategies tailored to specific MILP distributions. In parallel, RL agents have achieved remarkable success in board games, a very specific type of combinatorial problems, by leveraging environment simulators to plan via Monte Carlo Tree Search (MCTS). Building on these developments, we introduce Plan-and-Branch-and-Bound (PlanB&B), a model-based reinforcement learning (MBRL) agent that leverages a learned internal model of the B&B dynamics to discover improved branching strategies. Computational experiments empirically validate our approach, with our MBRL branching agent outperforming previous state-of-the-art RL methods across four standard MILP benchmarks.
中文摘要 混合整数线性规划（MILP）是许多现实世界组合优化（CO）问题的核心，传统上通过分支和界限（B&B）解决。影响 B&B 求解器效率的一个关键驱动因素是指导分支决策的变量选择启发式方法。为了超越静态的、手工制作的启发式方法，最近的工作探索了将传统的强化学习（RL）算法应用于 B&B 设置，旨在学习针对特定 MILP 分布量身定制的分支策略。与此同时，RL 代理通过利用环境模拟器通过蒙特卡洛树搜索（MCTS）进行规划，在棋盘游戏（一种非常特殊的组合问题类型）中取得了显着的成功。基于这些发展，我们引入了 Plan-and-Branch-and-Bound （PlanB&B），这是一种基于模型的强化学习（MBRL）代理，它利用学习到的 B&B 动力学内部模型来发现改进的分支策略。计算实验通过经验验证了我们的方法，我们的 MBRL 分支代理在四个标准 MILP 基准测试中优于之前最先进的 RL 方法。

Stabilizing Reinforcement Learning for Honesty Alignment in Language Models on Deductive Reasoning

稳定强化学习以实现演绎推理语言模型中的诚实一致性

Authors: Jiarui Liu, Kaustubh Dhole, Yingheng Wang, Haoyang Wen, Sarah Zhang, Haitao Mao, Gaotang Li, Neeraj Varshney, Jingguo Liu, Xiaoman Pan
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.09222
Pdf link: https://arxiv.org/pdf/2511.09222
Abstract Reinforcement learning with verifiable rewards (RLVR) has recently emerged as a promising framework for aligning language models with complex reasoning objectives. However, most existing methods optimize only for final task outcomes, leaving models vulnerable to collapse when negative rewards dominate early training. This challenge is especially pronounced in honesty alignment, where models must not only solve answerable queries but also identify when conclusions cannot be drawn from the given premises. Deductive reasoning provides an ideal testbed because it isolates reasoning capability from reliance on external factual knowledge. To investigate honesty alignment, we curate two multi-step deductive reasoning datasets from graph structures, one for linear algebra and one for logical inference, and introduce unanswerable cases by randomly perturbing an edge in half of the instances. We find that GRPO, with or without supervised fine tuning initialization, struggles on these tasks. Through extensive experiments across three models, we evaluate stabilization strategies and show that curriculum learning provides some benefit but requires carefully designed in distribution datasets with controllable difficulty. To address these limitations, we propose Anchor, a reinforcement learning method that injects ground truth trajectories into rollouts, preventing early training collapse. Our results demonstrate that this method stabilizes learning and significantly improves the overall reasoning performance, underscoring the importance of training dynamics for enabling reliable deductive reasoning in aligned language models.
中文摘要 具有可验证奖励的强化学习（RLVR）最近已成为一种有前途的框架，用于使语言模型与复杂的推理目标保持一致。然而，大多数现有方法仅针对最终任务结果进行优化，当负面奖励主导早期训练时，模型很容易崩溃。这一挑战在诚实对齐中尤为明显，模型不仅必须解决可回答的查询，还必须识别何时无法从给定前提中得出结论。演绎推理提供了一个理想的测试平台，因为它将推理能力与对外部事实知识的依赖隔离开来。为了研究诚实对齐，我们从图结构中策划了两个多步演绎推理数据集，一个用于线性代数，一个用于逻辑推理，并通过在一半的实例中随机扰动一条边来引入无法回答的情况。我们发现，无论有没有监督微调初始化，GRPO 在这些任务上都表现不佳。通过对三个模型的广泛实验，我们评估了稳定策略，并表明课程学习提供了一些好处，但需要在难度可控的分布数据集中仔细设计。为了解决这些限制，我们提出了 Anchor，这是一种强化学习方法，可将地面实况轨迹注入到推出中，防止早期训练崩溃。我们的结果表明，这种方法稳定了学习并显着提高了整体推理性能，强调了训练动力学对于在对齐语言模型中实现可靠的演绎推理的重要性。

A Distributed Training Architecture For Combinatorial Optimization

用于组合优化的分布式训练架构

Authors: Yuyao Long
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.09261
Pdf link: https://arxiv.org/pdf/2511.09261
Abstract In recent years, graph neural networks (GNNs) have been widely applied in tackling combinatorial optimization problems. However, existing methods still suffer from limited accuracy when addressing that on complex graphs and exhibit poor scalability, since full training requires loading the whole adjacent matrix and all embeddings at a time, the it may results in out of memory of a single machine. This limitation significantly restricts their applicability to large-scale scenarios. To address these challenges, we propose a distributed GNN-based training framework for combinatorial optimization. In details, firstly, large graph is partition into several small subgraphs. Then the individual subgraphs are full trained, providing a foundation for efficient local optimization. Finally, reinforcement learning (RL) are employed to take actions according to GNN output, to make sure the restrictions between cross nodes can be learned. Extensive experiments are conducted on both real large-scale social network datasets (e.g., Facebook, Youtube) and synthetically generated high-complexity graphs, which demonstrate that our framework outperforms state-of-the-art approaches in both solution quality and computational efficiency. Moreover, the experiments on large graph instances also validate the scalability of the model.
中文摘要 近年来，图神经网络（GNNs）在解决组合优化问题方面得到了广泛的应用。然而，现有方法在处理复杂图时仍然存在精度有限的问题，并且可扩展性很差，因为完整的训练需要一次加载整个相邻矩阵和所有嵌入，这可能会导致单台机器的内存不足。这种限制极大地限制了它们在大规模场景中的适用性。为了应对这些挑战，我们提出了一种基于分布式GNN的组合优化训练框架。详细来说，首先，大图被划分为几个小子图。然后，各个子图经过全面训练，为高效的局部优化奠定基础。最后，采用强化学习（RL）根据GNN输出采取行动，确保跨节点之间的限制能够被学习。在真实的大规模社交网络数据集（例如 Facebook、Youtube）和合成生成的高复杂度图上进行了广泛的实验，这表明我们的框架在解决方案质量和计算效率方面都优于最先进的方法。此外，在大型图实例上的实验也验证了模型的可扩展性。

CoRL-MPPI: Enhancing MPPI With Learnable Behaviours For Efficient And Provably-Safe Multi-Robot Collision Avoidance

CoRL-MPPI：通过可学习行为增强MPPI，以实现高效且可证明安全的多机器人碰撞避免

Authors: Stepan Dergachev, Artem Pshenitsyn, Aleksandr Panov, Alexey Skrynnik, Konstantin Yakovlev
Subjects: Subjects: Robotics (cs.RO); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2511.09331
Pdf link: https://arxiv.org/pdf/2511.09331
Abstract Decentralized collision avoidance remains a core challenge for scalable multi-robot systems. One of the promising approaches to tackle this problem is Model Predictive Path Integral (MPPI) -- a framework that is naturally suited to handle any robot motion model and provides strong theoretical guarantees. Still, in practice MPPI-based controller may provide suboptimal trajectories as its performance relies heavily on uninformed random sampling. In this work, we introduce CoRL-MPPI, a novel fusion of Cooperative Reinforcement Learning and MPPI to address this limitation. We train an action policy (approximated as deep neural network) in simulation that learns local cooperative collision avoidance behaviors. This learned policy is then embedded into the MPPI framework to guide its sampling distribution, biasing it towards more intelligent and cooperative actions. Notably, CoRL-MPPI preserves all the theoretical guarantees of regular MPPI. We evaluate our approach in dense, dynamic simulation environments against state-of-the-art baselines, including ORCA, BVC, and a multi-agent MPPI implementation. Our results demonstrate that CoRL-MPPI significantly improves navigation efficiency (measured by success rate and makespan) and safety, enabling agile and robust multi-robot navigation.
中文摘要 分散式防撞仍然是可扩展多机器人系统的核心挑战。解决这个问题的一个有前途的方法是模型预测路径积分（MPPI）——一个自然适合处理任何机器人运动模型并提供强大理论保证的框架。尽管如此，在实践中，基于MPPI的控制器可能会提供次优的轨迹，因为它的性能在很大程度上依赖于不知情的随机采样。在这项工作中，我们引入了CoRL-MPPI，这是协同强化学习和MPPI的一种新颖融合，以解决这一限制。我们在模拟中训练一个动作策略（近似为深度神经网络），学习局部合作碰撞规避行为。然后，将这一学习到的策略嵌入到 MPPI 框架中，以指导其抽样分布，使其偏向于更智能和合作的行动。值得注意的是，CoRL-MPPI 保留了常规 MPPI 的所有理论保证。我们根据最先进的基线（包括 ORCA、BVC 和多代理 MPPI 实现）在密集、动态仿真环境中评估我们的方法。我们的结果表明，CoRL-MPPI显著提高了导航效率（以成功率和制造跨度衡量）和安全性，实现了敏捷而稳健的多机器人导航。

Potent but Stealthy: Rethink Profile Pollution against Sequential Recommendation via Bi-level Constrained Reinforcement Paradigm

有效但隐蔽：通过双级约束强化范式根据顺序推荐重新思考剖面污染

Authors: Jiajie Su, Zihan Nan, Yunshan Ma, Xiaobo Xia, Xiaohua Feng, Weiming Liu, Xiaolin Zheng, Chaochao Chen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.09392
Pdf link: https://arxiv.org/pdf/2511.09392
Abstract Sequential Recommenders, which exploit dynamic user intents through interaction sequences, is vulnerable to adversarial attacks. While existing attacks primarily rely on data poisoning, they require large-scale user access or fake profiles thus lacking practicality. In this paper, we focus on the Profile Pollution Attack that subtly contaminates partial user interactions to induce targeted mispredictions. Previous PPA methods suffer from two limitations, i.e., i) over-reliance on sequence horizon impact restricts fine-grained perturbations on item transitions, and ii) holistic modifications cause detectable distribution shifts. To address these challenges, we propose a constrained reinforcement driven attack CREAT that synergizes a bi-level optimization framework with multi-reward reinforcement learning to balance adversarial efficacy and stealthiness. We first develop a Pattern Balanced Rewarding Policy, which integrates pattern inversion rewards to invert critical patterns and distribution consistency rewards to minimize detectable shifts via unbalanced co-optimal transport. Then we employ a Constrained Group Relative Reinforcement Learning paradigm, enabling step-wise perturbations through dynamic barrier constraints and group-shared experience replay, achieving targeted pollution with minimal detectability. Extensive experiments demonstrate the effectiveness of CREAT.
中文摘要 顺序推荐器通过交互序列利用动态用户意图，容易受到对抗性攻击。虽然现有的攻击主要依赖于数据中毒，但它们需要大规模的用户访问或虚假的配置文件，因此缺乏实用性。在本文中，我们重点关注配置文件污染攻击，它巧妙地污染了部分用户交互以诱发有针对性的错误预测。以前的PPA方法有两个局限性，即i）过度依赖序列视界影响限制了对项目过渡的细粒度扰动，ii）整体修改会导致可检测到的分布偏移。为了应对这些挑战，我们提出了一种约束强化驱动攻击CREAT，它将双级优化框架与多奖励强化学习协同作用，以平衡对抗性效能和隐蔽性。我们首先开发了一种模式平衡奖励策略，该策略集成了模式反演奖励以反转关键模式和分布一致性奖励，以通过不平衡协优传输最大限度地减少可检测到的偏移。然后，我们采用约束组相对强化学习范式，通过动态障碍约束和组共享经验回放实现逐步扰动，以最小的可检测性实现有针对性的污染。广泛的实验证明了 CREAT 的有效性。

AdaCuRL: Adaptive Curriculum Reinforcement Learning with Invalid Sample Mitigation and Historical Revisiting

AdaCuRL：具有无效样本缓解和历史重温的自适应课程强化学习

Authors: Renda Li, Hailang Huang, Fei Wei, Feng Xiong, Yong Wang, Xiangxiang Chu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.09478
Pdf link: https://arxiv.org/pdf/2511.09478
Abstract Reinforcement learning (RL) has demonstrated considerable potential for enhancing reasoning in large language models (LLMs). However, existing methods suffer from Gradient Starvation and Policy Degradation when training directly on samples with mixed difficulty. To mitigate this, prior approaches leverage Chain-of-Thought (CoT) data, but the construction of high-quality CoT annotations remains labor-intensive. Alternatively, curriculum learning strategies have been explored but frequently encounter challenges, such as difficulty mismatch, reliance on manual curriculum design, and catastrophic forgetting. To address these issues, we propose AdaCuRL, a Adaptive Curriculum Reinforcement Learning framework that integrates coarse-to-fine difficulty estimation with adaptive curriculum scheduling. This approach dynamically aligns data difficulty with model capability and incorporates a data revisitation mechanism to mitigate catastrophic forgetting. Furthermore, AdaCuRL employs adaptive reference and sparse KL strategies to prevent Policy Degradation. Extensive experiments across diverse reasoning benchmarks demonstrate that AdaCuRL consistently achieves significant performance improvements on both LLMs and MLLMs.
中文摘要 强化学习（RL）在增强大型语言模型（LLM）的推理方面显示出相当大的潜力。然而，现有方法在直接对难度混合的样本进行训练时存在梯度饥饿和策略降级的问题。为了缓解这种情况，以前的方法利用了思维链（CoT）数据，但构建高质量的CoT注释仍然是劳动密集型的。或者，已经探索了课程学习策略，但经常遇到挑战，例如难度不匹配、依赖手动课程设计和灾难性遗忘。为了解决这些问题，我们提出了 AdaCuRL，这是一种自适应课程强化学习框架，它将粗到细的难度估计与自适应课程安排相结合。这种方法将数据难度与模型能力动态结合起来，并结合数据重访机制来减轻灾难性遗忘。此外，AdaCuRL 采用自适应参考和稀疏 KL 策略来防止策略降级。跨不同推理基准的广泛实验表明，AdaCuRL 在 LLM 和 MLLM 上始终如一地实现了显着的性能改进。

SPIDER: Scalable Physics-Informed Dexterous Retargeting

SPIDER：可扩展的物理知情灵巧重定向

Authors: Chaoyi Pan, Changhao Wang, Haozhi Qi, Zixi Liu, Homanga Bharadhwaj, Akash Sharma, Tingfan Wu, Guanya Shi, Jitendra Malik, Francois Hogan
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.09484
Pdf link: https://arxiv.org/pdf/2511.09484
Abstract Learning dexterous and agile policy for humanoid and dexterous hand control requires large-scale demonstrations, but collecting robot-specific data is prohibitively expensive. In contrast, abundant human motion data is readily available from motion capture, videos, and virtual reality, which could help address the data scarcity problem. However, due to the embodiment gap and missing dynamic information like force and torque, these demonstrations cannot be directly executed on robots. To bridge this gap, we propose Scalable Physics-Informed DExterous Retargeting (SPIDER), a physics-based retargeting framework to transform and augment kinematic-only human demonstrations to dynamically feasible robot trajectories at scale. Our key insight is that human demonstrations should provide global task structure and objective, while large-scale physics-based sampling with curriculum-style virtual contact guidance should refine trajectories to ensure dynamical feasibility and correct contact sequences. SPIDER scales across diverse 9 humanoid/dexterous hand embodiments and 6 datasets, improving success rates by 18% compared to standard sampling, while being 10X faster than reinforcement learning (RL) baselines, and enabling the generation of a 2.4M frames dynamic-feasible robot dataset for policy learning. As a universal physics-based retargeting method, SPIDER can work with diverse quality data and generate diverse and high-quality data to enable efficient policy learning with methods like RL.
中文摘要 学习人形和灵巧手控的灵巧和敏捷策略需要大规模演示，但收集特定于机器人的数据的成本高得令人望而却步。相比之下，从动作捕捉、视频和虚拟现实中可以轻松获得丰富的人体运动数据，这有助于解决数据稀缺问题。然而，由于实施例差距和力和扭矩等动态信息的缺失，这些演示无法直接在机器人上执行。为了弥补这一差距，我们提出了可扩展的物理信息脱极重定向（SPIDER），这是一种基于物理的重定向框架，用于将仅运动学的人类演示转换为大规模动态可行的机器人轨迹。我们的主要见解是，人类演示应该提供全局任务结构和目标，而基于物理的大规模采样和课程式的虚拟接触指导应该细化轨迹，以确保动态可行性和正确的接触顺序。SPIDER可跨不同的9个人形/灵巧手实施例和6个数据集进行扩展，与标准采样相比，成功率提高了18%，同时比强化学习（RL）基线快10倍，并能够生成用于策略学习的2.4M帧动态可行机器人数据集。作为一种基于物理的通用重定向方法，SPIDER可以处理各种质量的数据，生成多样化的高质量数据，通过RL等方法实现高效的策略学习。

Quasi-Newton Compatible Actor-Critic for Deterministic Policies

确定性策略的准牛顿兼容行为者-批评者

Authors: Arash Bahari Kordabad, Dean Brandner, Sebastien Gros, Sergio Lucia, Sadegh Soudjani
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2511.09509
Pdf link: https://arxiv.org/pdf/2511.09509
Abstract In this paper, we propose a second-order deterministic actor-critic framework in reinforcement learning that extends the classical deterministic policy gradient method to exploit curvature information of the performance function. Building on the concept of compatible function approximation for the critic, we introduce a quadratic critic that simultaneously preserves the true policy gradient and an approximation of the performance Hessian. A least-squares temporal difference learning scheme is then developed to estimate the quadratic critic parameters efficiently. This construction enables a quasi-Newton actor update using information learned by the critic, yielding faster convergence compared to first-order methods. The proposed approach is general and applicable to any differentiable policy class. Numerical examples demonstrate that the method achieves improved convergence and performance over standard deterministic actor-critic baselines.
中文摘要 在本文中，我们提出了一种强化学习中的二阶确定性行为者-批评框架，该框架扩展了经典的确定性策略梯度方法，以利用性能函数的曲率信息。基于对批评者的兼容函数近似的概念，我们引入了一个二次批评，它同时保留了真实的策略梯度和性能 Hessian 的近似值。然后开发了一种最小二乘时间差学习方案，以有效地估计二次批评参数。这种结构可以使用批评者学到的信息进行准牛顿参与者更新，与一阶方法相比产生更快的收敛速度。拟议的方法是通用的，适用于任何可区分的政策类别。数值示例表明，该方法比标准确定性行为者-批评者基线实现了更好的收敛性和性能。

WMPO: World Model-based Policy Optimization for Vision-Language-Action Models

WMPO：基于世界模型的视觉-语言-行动模型政策优化

Authors: Fangqi Zhu, Zhengyang Yan, Zicong Hong, Quanxin Shou, Xiao Ma, Song Guo
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.09515
Pdf link: https://arxiv.org/pdf/2511.09515
Abstract Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation, but their reliance on expert demonstrations limits their ability to learn from failures and perform self-corrections. Reinforcement learning (RL) addresses these through self-improving interactions with the physical environment, but suffers from high sample complexity on real robots. We introduce World-Model-based Policy Optimization (WMPO), a principled framework for on-policy VLA RL without interacting with the real environment. In contrast to widely used latent world models, WMPO focuses on pixel-based predictions that align the "imagined" trajectories with the VLA features pretrained with web-scale images. Crucially, WMPO enables the policy to perform on-policy GRPO that provides stronger performance than the often-used off-policy methods. Extensive experiments in both simulation and real-robot settings demonstrate that WMPO (i) substantially improves sample efficiency, (ii) achieves stronger overall performance, (iii) exhibits emergent behaviors such as self-correction, and (iv) demonstrates robust generalization and lifelong learning capabilities.
中文摘要 视觉-语言-行动（VLA）模型在通用机器人作方面显示出强大的潜力，但它们对专家演示的依赖限制了它们从失败中学习和执行自我纠正的能力。强化学习（RL）通过与物理环境的自我改进交互来解决这些问题，但在真实机器人上存在很高的样本复杂性。我们介绍了基于世界模型的策略优化（WMPO），这是一个无需与真实环境交互即可实现策略性 VLA RL 的原则框架。与广泛使用的潜在世界模型相比，WMPO 专注于基于像素的预测，将“想象”轨迹与用网络规模图像预训练的 VLA 特征对齐。至关重要的是，WMPO 使策略能够执行策略上的 GRPO，从而提供比常用的策略外方法更强的性能。在模拟和真实机器人环境中的大量实验表明，WMPO （i）显着提高了样本效率，（ii）实现了更强的整体性能，（iii）表现出自我校正等紧急行为，以及（iv）表现出强大的泛化和终身学习能力。

Keyword: diffusion policy

There is no result