Arxiv Papers of Today

生成时间: 2026-01-20 16:36:37 (UTC+8); Arxiv 发布时间: 2026-01-19 20:00 EST (2026-01-20 09:00 UTC+8)

今天共有 24 篇相关文章

Keyword: reinforcement learning

Energy-Efficient Omnidirectional Locomotion for Wheeled Quadrupeds via Predictive Energy-Aware Nominal Gait Selection

通过预测能量感知名义步态选择实现轮式四足动物的节能全向运动

Authors: Xu Yang, Wei Yang, Kaibo He, Bo Yang, Yanan Sui, Yilin Mo
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.10723
Pdf link: https://arxiv.org/pdf/2601.10723
Abstract Wheeled-legged robots combine the efficiency of wheels with the versatility of legs, but face significant energy optimization challenges when navigating diverse environments. In this work, we present a hierarchical control framework that integrates predictive power modeling with residual reinforcement learning to optimize omnidirectional locomotion efficiency for wheeled quadrupedal robots. Our approach employs a novel power prediction network that forecasts energy consumption across different gait patterns over a 1-second horizon, enabling intelligent selection of the most energy-efficient nominal gait. A reinforcement learning policy then generates residual adjustments to this nominal gait, fine-tuning the robot's actions to balance energy efficiency with performance objectives. Comparative analysis shows our method reduces energy consumption by up to 35\% compared to fixed-gait approaches while maintaining comparable velocity tracking performance. We validate our framework through extensive simulations and real-world experiments on a modified Unitree Go1 platform, demonstrating robust performance even under external disturbances. Videos and implementation details are available at \href{this https URL}{this https URL}.
中文摘要 轮式腿机器人结合了轮子的效率和腿部的多功能性，但在适应多样环境时面临显著的能源优化挑战。本研究提出了一个层级控制框架，将预测功率建模与残余强化学习相结合，以优化轮式四足机器人的全向移动效率。我们的方法采用了一种新型功率预测网络，能够在1秒内预测不同步态模式的能量消耗，从而智能选择最节能的名义步态。强化学习策略随后对该名义步态产生残余调整，微调机器人动作以平衡能效与性能目标。比较分析显示，我们的方法相比固定步态方法，能在保持相当速度追踪性能的同时，能耗降低多达35%%。我们通过在修改后的Unitree Go1平台上进行大量模拟和实际实验验证了我们的框架，即使在外部干扰下也能展现出强大的性能。视频和实现详情可在 \href{this https URL}{this https URL} 获取。

Explore with Long-term Memory: A Benchmark and Multimodal LLM-based Reinforcement Learning Framework for Embodied Exploration

《用长期记忆探索：基于多模态的LLM强化学习框架的具身探索基准与框架》

Authors: Sen Wang, Bangwei Liu, Zhenkun Gao, Lizhuang Ma, Xuhong Wang, Yuan Xie, Xin Tan
Subjects: Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.10744
Pdf link: https://arxiv.org/pdf/2601.10744
Abstract An ideal embodied agent should possess lifelong learning capabilities to handle long-horizon and complex tasks, enabling continuous operation in general environments. This not only requires the agent to accurately accomplish given tasks but also to leverage long-term episodic memory to optimize decision-making. However, existing mainstream one-shot embodied tasks primarily focus on task completion results, neglecting the crucial process of exploration and memory utilization. To address this, we propose Long-term Memory Embodied Exploration (LMEE), which aims to unify the agent's exploratory cognition and decision-making behaviors to promote lifelong this http URL further construct a corresponding dataset and benchmark, LMEE-Bench, incorporating multi-goal navigation and memory-based question answering to comprehensively evaluate both the process and outcome of embodied exploration. To enhance the agent's memory recall and proactive exploration capabilities, we propose MemoryExplorer, a novel method that fine-tunes a multimodal large language model through reinforcement learning to encourage active memory querying. By incorporating a multi-task reward function that includes action prediction, frontier selection, and question answering, our model achieves proactive exploration. Extensive experiments against state-of-the-art embodied exploration models demonstrate that our approach achieves significant advantages in long-horizon embodied tasks.
中文摘要 理想的具身智能体应具备终身学习能力，能够处理长远且复杂的任务，从而在一般环境中实现持续作。这不仅要求代理准确完成任务，还需要利用长期情景记忆优化决策。然而，现有主流一次性具象任务主要关注任务完成结果，忽视了关键的探索和记忆利用过程。为此，我们提出了长期记忆具身探索（LMEE），旨在统一智能体的探索性认知和决策行为，促进终身体验。进一步构建对应数据集和基准——LMEE-Bench，结合多目标导航和基于记忆的问题解答，全面评估具身探索的过程和结果。为了增强智能体的记忆回忆和主动探索能力，我们提出了MemoryExplorer，这是一种通过强化学习微调多模态大型语言模型，鼓励主动记忆查询的新方法。通过整合包括行动预测、前沿选择和问答在内的多任务奖励函数，我们的模型实现了主动探索。对最先进的具象探索模型的广泛实验表明，我们的方法在长视野具象任务中取得了显著优势。

Reasoning Models Generate Societies of Thought

推理模型生成思维社会

Authors: Junsol Kim, Shiyang Lai, Nino Scherrer, Blaise Agüera y Arcas, James Evans
Subjects: Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.10825
Pdf link: https://arxiv.org/pdf/2601.10825
Abstract Large language models have achieved remarkable capabilities across domains, yet mechanisms underlying sophisticated reasoning remain elusive. Recent reasoning models outperform comparable instruction-tuned models on complex cognitive tasks, attributed to extended computation through longer chains of thought. Here we show that enhanced reasoning emerges not from extended computation alone, but from simulating multi-agent-like interactions -- a society of thought -- which enables diversification and debate among internal cognitive perspectives characterized by distinct personality traits and domain expertise. Through quantitative analysis and mechanistic interpretability methods applied to reasoning traces, we find that reasoning models like DeepSeek-R1 and QwQ-32B exhibit much greater perspective diversity than instruction-tuned models, activating broader conflict between heterogeneous personality- and expertise-related features during reasoning. This multi-agent structure manifests in conversational behaviors, including question-answering, perspective shifts, and the reconciliation of conflicting views, and in socio-emotional roles that characterize sharp back-and-forth conversations, together accounting for the accuracy advantage in reasoning tasks. Controlled reinforcement learning experiments reveal that base models increase conversational behaviors when rewarded solely for reasoning accuracy, and fine-tuning models with conversational scaffolding accelerates reasoning improvement over base models. These findings indicate that the social organization of thought enables effective exploration of solution spaces. We suggest that reasoning models establish a computational parallel to collective intelligence in human groups, where diversity enables superior problem-solving when systematically structured, which suggests new opportunities for agent organization to harness the wisdom of crowds.
中文摘要 大型语言模型在各领域已实现了卓越的能力，但支撑复杂推理的机制仍然难以捉摸。近期的推理模型在复杂认知任务中优于类似的指令调优模型，这归因于通过更长思维链进行的扩展计算。我们展示了，增强的推理并非仅源于扩展计算，而是源于模拟多智能体类交互——一种思维社会——这使内部认知视角之间能够多样化和辩论，这些视角具有不同的人格特征和领域专长。通过定量分析和对推理迹的机制可解释性方法，我们发现像DeepSeek-R1和QwQ-32B这样的推理模型展现出比指令调谐模型更大的视角多样性，激发推理过程中异质性人格和专业相关特征之间的更广泛冲突。这种多代理结构体现在对话行为中，包括问答、视角转换和调和不同观点，以及社会情感角色，表现为尖锐的对话，共同解释了推理任务中的准确性优势。受控强化学习实验显示，当仅因推理准确性而奖励基础模型时，基础模型会增加对话行为，而带有会话支架的微调模型则加速了相较基础模型的推理改进。这些发现表明，思维的社会组织能够有效探索解决方案空间。我们认为推理模型在计算上与人类群体中的集体智能形成了相似之处，其中多样性在系统结构化下能够实现更优越的问题解决，这为代理组织提供了利用群体智慧的新机遇。

Action Shapley: A Training Data Selection Metric for World Model in Reinforcement Learning

Action Shapley：强化学习中世界模型的训练数据选择指标

Authors: Rajat Ghosh, Debojyoti Dutta
Subjects: Subjects: Machine Learning (cs.LG); Methodology (stat.ME)
Arxiv link: https://arxiv.org/abs/2601.10905
Pdf link: https://arxiv.org/pdf/2601.10905
Abstract Numerous offline and model-based reinforcement learning systems incorporate world models to emulate the inherent environments. A world model is particularly important in scenarios where direct interactions with the real environment is costly, dangerous, or impractical. The efficacy and interpretability of such world models are notably contingent upon the quality of the underlying training data. In this context, we introduce Action Shapley as an agnostic metric for the judicious and unbiased selection of training data. To facilitate the computation of Action Shapley, we present a randomized dynamic algorithm specifically designed to mitigate the exponential complexity inherent in traditional Shapley value computations. Through empirical validation across five data-constrained real-world case studies, the algorithm demonstrates a computational efficiency improvement exceeding 80\% in comparison to conventional exponential time computations. Furthermore, our Action Shapley-based training data selection policy consistently outperforms ad-hoc training data selection.
中文摘要 众多离线和基于模型的强化学习系统结合了世界模型来模拟固有环境。在与真实环境直接互动成本高昂、危险或不切实际的情境中，世界模型尤为重要。此类世界模型的有效性和可解释性显著依赖于底层训练数据的质量。在此背景下，我们引入Action Shapley作为一种中立的衡量指标，用于审慎且公正地选择训练数据。为了便于计算动作Shapley，我们提出了一种随机动态算法，专门设计用来减轻传统Shapley值计算中固有的指数级复杂性。通过五个数据受限的真实案例研究的实证验证，算法展示了计算效率提升超过80%的传统指数时间计算。此外，我们基于Action Shapley的训练数据选择策略始终优于临时训练数据选择。

Realistic Curriculum Reinforcement Learning for Autonomous and Sustainable Marine Vessel Navigation

自主且可持续海上船舶航行的现实课程强化学习

Authors: Zhang Xiaocai, Xiao Zhe, Liang Maohan, Liu Tao, Li Haijiang, Zhang Wenbin
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.10911
Pdf link: https://arxiv.org/pdf/2601.10911
Abstract Sustainability is becoming increasingly critical in the maritime transport, encompassing both environmental and social impacts, such as Greenhouse Gas (GHG) emissions and navigational safety. Traditional vessel navigation heavily relies on human experience, often lacking autonomy and emission awareness, and is prone to human errors that may compromise safety. In this paper, we propose a Curriculum Reinforcement Learning (CRL) framework integrated with a realistic, data-driven marine simulation environment and a machine learning-based fuel consumption prediction module. The simulation environment is constructed using real-world vessel movement data and enhanced with a Diffusion Model to simulate dynamic maritime conditions. Vessel fuel consumption is estimated using historical operational data and learning-based regression. The surrounding environment is represented as image-based inputs to capture spatial complexity. We design a lightweight, policy-based CRL agent with a comprehensive reward mechanism that considers safety, emissions, timeliness, and goal completion. This framework effectively handles complex tasks progressively while ensuring stable and efficient learning in continuous action spaces. We validate the proposed approach in a sea area of the Indian Ocean, demonstrating its efficacy in enabling sustainable and safe vessel navigation.
中文摘要 可持续性在海运运输中变得越来越重要，涵盖环境和社会影响，如温室气体排放和航行安全。传统船舶导航高度依赖人类体验，常缺乏自主性和排放意识，且容易出现人为错误，可能危及安全。本文提出了一个课程强化学习（CRL）框架，集成了真实的数据驱动的海洋模拟环境和基于机器学习的燃油消耗预测模块。模拟环境基于真实船舶运动数据构建，并辅以扩散模型以模拟动态海况。船舶燃料消耗是通过历史运营数据和基于学习的回归进行估算的。周围环境以基于图像的输入表示，以捕捉空间复杂性。我们设计了一款轻量级、基于策略的CRL代理，拥有综合的奖励机制，考虑安全性、排放、及时性和目标完成性。该框架能够逐步有效地处理复杂任务，同时确保在连续行动空间中实现稳定高效的学习。我们在印度洋海域验证了该提案方法，证明其在实现可持续安全船舶航行方面的有效性。

Where to Touch, How to Contact: Hierarchical RL-MPC Framework for Geometry-Aware Long-Horizon Dexterous Manipulation

触摸地点，如何联系：层级RL-MPC框架，用于几何感知的长视野灵巧作

Authors: Zhixian Xie, Yu Xiang, Michael Posa, Wanxin Jin
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.10930
Pdf link: https://arxiv.org/pdf/2601.10930
Abstract A key challenge in contact-rich dexterous manipulation is the need to jointly reason over geometry, kinematic constraints, and intricate, nonsmooth contact dynamics. End-to-end visuomotor policies bypass this structure, but often require large amounts of data, transfer poorly from simulation to reality, and generalize weakly across tasks/embodiments. We address those limitations by leveraging a simple insight: dexterous manipulation is inherently hierarchical - at a high level, a robot decides where to touch (geometry) and move the object (kinematics); at a low level it determines how to realize that plan through contact dynamics. Building on this insight, we propose a hierarchical RL--MPC framework in which a high-level reinforcement learning (RL) policy predicts a contact intention, a novel object-centric interface that specifies (i) an object-surface contact location and (ii) a post-contact object-level subgoal pose. Conditioned on this contact intention, a low-level contact-implicit model predictive control (MPC) optimizes local contact modes and replans with contact dynamics to generate robot actions that robustly drive the object toward each subgoal. We evaluate the framework on non-prehensile tasks, including geometry-generalized pushing and object 3D reorientation. It achieves near-100% success with substantially reduced data (10x less than end-to-end baselines), highly robust performance, and zero-shot sim-to-real transfer.
中文摘要 在接触丰富且灵巧的作中，一个关键挑战是需要共同推理几何结构、运动学约束以及复杂且非光滑的接触动力学。端到端的视觉运动策略绕过了这一结构，但通常需要大量数据，模拟到现实的转换较差，且在任务/身体之间泛化较弱。我们通过一个简单的洞见来解决这些局限：灵巧的作本质上是层级的——在高层次上，机器人决定触碰何处（几何）和移动物体（运动学）;在底层，它决定了如何通过接触动态实现该计划。基于这一见解，我们提出了一个分层式的RL-MPC框架，其中一个高级强化学习（RL）策略预测接触意图，一个新的以对象为中心的界面，指定（i）对象-表面接触位置和（ii）接触后对象级子目标姿态。基于该接触意图，低层接触隐式模型预测控制（MPC）优化局部接触模式，并结合接触动力学重新规划，生成机器人动作，强力推动对象朝向每个子目标。我们评估该框架在非抓取任务上的应用，包括几何广义推力和物体三维重新定向。它几乎实现了100%的成功率，数据量大幅减少（比端到端基线少10倍），性能极为稳健，并实现零机会模拟到真实的传输。

MMedExpert-R1: Strengthening Multimodal Medical Reasoning via Domain-Specific Adaptation and Clinical Guideline Reinforcement

MMedExpert-R1：通过领域特异性适应和临床指南强化强化多模态医学推理

Authors: Meidan Ding, Jipeng Zhang, Wenxuan Wang, Haiqin Zhong, Xiaoling Luo, Wenting Chen, Linlin Shen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.10949
Pdf link: https://arxiv.org/pdf/2601.10949
Abstract Medical Vision-Language Models (MedVLMs) excel at perception tasks but struggle with complex clinical reasoning required in real-world scenarios. While reinforcement learning (RL) has been explored to enhance reasoning capabilities, existing approaches face critical mismatches: the scarcity of deep reasoning data, cold-start limits multi-specialty alignment, and standard RL algorithms fail to model clinical reasoning diversity. We propose MMedExpert-R1, a novel reasoning MedVLM that addresses these challenges through domain-specific adaptation and clinical guideline reinforcement. We construct MMedExpert, a high-quality dataset of 10K samples across four specialties with step-by-step reasoning traces. Our Domain-Specific Adaptation (DSA) creates specialty-specific LoRA modules to provide diverse initialization, while Guideline-Based Advantages (GBA) explicitly models different clinical reasoning perspectives to align with real-world diagnostic strategies. Conflict-Aware Capability Integration then merges these specialized experts into a unified agent, ensuring robust multi-specialty alignment. Comprehensive experiments demonstrate state-of-the-art performance, with our 7B model achieving 27.50 on MedXpert-MM and 83.03 on OmniMedVQA, establishing a robust foundation for reliable multimodal medical reasoning systems.
中文摘要 医学视觉语言模型（MedVLMs）在感知任务中表现出色，但在现实场景中复杂的临床推理方面存在困难。虽然强化学习（RL）已被探索以增强推理能力，但现有方法面临严重的不匹配：深度推理数据稀缺、冷启动限制了多专科对齐，以及标准强化学习算法未能模拟临床推理多样性。我们提出了MMedExpert-R1，一种新颖的MedVLM，通过领域特异性适应和临床指南强化来解决这些挑战。我们构建了MMedExpert，这是一个涵盖四个专业、包含1万个样本的高质量数据集，并附有逐步推理的痕迹。我们的领域特定适应（DSA）创建专科专属的LoRA模块，提供多样化的初始化，而基于指南的优势（GBA）则明确建模不同的临床推理视角，以符合现实诊断策略。冲突感知能力集成将这些专业专家合并为统一的代理，确保多专业的稳健对齐。综合实验展示了最先进的性能，我们的7B模型在MedXpert-MM上达到27.50，在OmniMedVQA上达到83.03，为可靠的多模态医学推理系统奠定了坚实基础。

Toward Adaptive Grid Resilience: A Gradient-Free Meta-RL Framework for Critical Load Restoration

迈向自适应电网韧性：一个无梯度的元强化学习框架用于关键负载恢复

Authors: Zain ul Abdeen, Waris Gill, Ming Jin
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.10973
Pdf link: https://arxiv.org/pdf/2601.10973
Abstract Restoring critical loads after extreme events demands adaptive control to maintain distribution-grid resilience, yet uncertainty in renewable generation, limited dispatchable resources, and nonlinear dynamics make effective restoration difficult. Reinforcement learning (RL) can optimize sequential decisions under uncertainty, but standard RL often generalizes poorly and requires extensive retraining for new outage configurations or generation patterns. We propose a meta-guided gradient-free RL (MGF-RL) framework that learns a transferable initialization from historical outage experiences and rapidly adapts to unseen scenarios with minimal task-specific tuning. MGF-RL couples first-order meta-learning with evolutionary strategies, enabling scalable policy search without gradient computation while accommodating nonlinear, constrained distribution-system dynamics. Experiments on IEEE 13-bus and IEEE 123-bus test systems show that MGF-RL outperforms standard RL, MAML-based meta-RL, and model predictive control across reliability, restoration speed, and adaptation efficiency under renewable forecast errors. MGF-RL generalizes to unseen outages and renewable patterns while requiring substantially fewer fine-tuning episodes than conventional RL. We also provide sublinear regret bounds that relate adaptation efficiency to task similarity and environmental variation, supporting the empirical gains and motivating MGF-RL for real-time load restoration in renewable-rich distribution grids.
中文摘要 极端事件后恢复关键负载需要适应性控制以维持配电网韧性，但可再生能源发电的不确定性、有限的调度资源以及非线性动态使得有效恢复变得困难。强化学习（RL）可以在不确定性下优化顺序决策，但标准强化学习通常推广能力差，需要大量重新训练以适应新的停电配置或生成模式。我们提出了一个元引导的无梯度强化逻辑（MGF-RL）框架，能够从历史故障经验中学习可转移的初始化，并以最小的任务特定调优快速适应未知场景。MGF-RL将一阶元学习与进化策略相结合，实现无梯度计算的可扩展策略搜索，同时兼容非线性、受限分布系统动态。在IEEE 13总线和IEEE 123总线测试系统的实验显示，MGF-RL在可靠性、恢复速度和可再生预报误差下的适应效率方面，均优于标准强化学习、基于MAML的元强化学习和模型预测控制。MGF-RL可推广到看不见的停电和可再生模式，同时比传统RL需要的微调次数明显少得多。我们还提供了亚线性遗憾界限，将适应效率与任务相似性和环境变异联系起来，支持实证进展，并激励MGF-RL在富含可再生能源的配电网中实现实时负荷恢复。

BAPO: Boundary-Aware Policy Optimization for Reliable Agentic Search

BAPO：边界感知策略优化，实现可靠的代理搜索

Authors: Shiyu Liu, Yongjing Yin, Jianhao Yan, Yunbo Tang, Qinggang Zhang, Bei Li, Xin Chen, Jingang Wang, Xunliang Cai, Jinsong Su
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.11037
Pdf link: https://arxiv.org/pdf/2601.11037
Abstract RL-based agentic search enables LLMs to solve complex questions via dynamic planning and external search. While this approach significantly enhances accuracy with agent policies optimized via large-scale reinforcement learning, we identify a critical gap in reliability: these agents fail to recognize their reasoning boundaries and rarely admit ``I DON'T KNOW'' (IDK) even when evidence is insufficient or reasoning reaches its limit. The lack of reliability often leads to plausible but unreliable answers, introducing significant risks in many real-world scenarios. To this end, we propose Boundary-Aware Policy Optimization (BAPO), a novel RL framework designed to cultivate reliable boundary awareness without compromising accuracy. BAPO introduces two key components: (i) a group-based boundary-aware reward that encourages an IDK response only when the reasoning reaches its limit, and (ii) an adaptive reward modulator that strategically suspends this reward during early exploration, preventing the model from exploiting IDK as a shortcut. Extensive experiments on four benchmarks demonstrate that BAPO substantially enhances the overall reliability of agentic search.
中文摘要 基于强化学习的代理搜索使大型语言模型能够通过动态规划和外部搜索解决复杂问题。虽然这种方法通过大规模强化学习优化了智能体策略显著提升了准确性，但我们发现了一个关键的可靠性漏洞：这些智能体未能识别自己的推理边界，且很少承认“我不知道”（我不知道），即使证据不足或推理达到极限。缺乏可靠性常常导致合理但不可靠的答案，在许多现实场景中带来重大风险。为此，我们提出了边界感知策略优化（BAPO），这是一种新型强化学习框架，旨在培养可靠的边界意识，同时不牺牲准确性。BAPO引入了两个关键组成部分：（i）基于群体的边界感知奖励，仅在推理达到极限时鼓励IDK反应;（ii）自适应奖励调制器，在早期探索阶段战略性地暂停该奖励，防止模型利用IDK作为捷径。对四个基准测试的广泛实验表明，BAPO显著提升了智能搜索的整体可靠性。

Spurious Rewards Paradox: Mechanistically Understanding How RLVR Activates Memorization Shortcuts in LLMs

虚假奖励悖论：机械性理解RLVR如何激活LLM中的记忆捷径

Authors: Lecheng Yan, Ruizhe Li, Guanhua Chen, Qing Li, Jiahui Geng, Wenxi Li, Vincent Wang, Chris Lee
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.11061
Pdf link: https://arxiv.org/pdf/2601.11061
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) is highly effective for enhancing LLM reasoning, yet recent evidence shows models like Qwen 2.5 achieve significant gains even with spurious or incorrect rewards. We investigate this phenomenon and identify a "Perplexity Paradox": spurious RLVR triggers a divergence where answer-token perplexity drops while prompt-side coherence degrades, suggesting the model is bypassing reasoning in favor of memorization. Using Path Patching, Logit Lens, JSD analysis, and Neural Differential Equations, we uncover a hidden Anchor-Adapter circuit that facilitates this shortcut. We localize a Functional Anchor in the middle layers (L18-20) that triggers the retrieval of memorized solutions, followed by Structural Adapters in later layers (L21+) that transform representations to accommodate the shortcut signal. Finally, we demonstrate that scaling specific MLP keys within this circuit allows for bidirectional causal steering-artificially amplifying or suppressing contamination-driven performance. Our results provide a mechanistic roadmap for identifying and mitigating data contamination in RLVR-tuned models. Code is available at this https URL.
中文摘要 带可验证奖励的强化学习（RLVR）在提升LLM推理方面非常有效，但最新证据显示，像Qwen 2.5这样的模型即使奖励是虚假或错误的，也能取得显著的收益。我们研究这一现象，发现了一个“困惑悖论”：虚假的RLVR会引发一个发散，即答案标记困惑度下降，而提示侧一致性下降，表明模型正在绕过推理，转而依赖记忆。利用路径补丁、Logit 透镜、JSD分析和神经微分方程，我们发现了一个隐藏的锚-适配器电路，促成了这一捷径。我们在中间层（L18-20）定位一个功能锚点，用于触发记忆解的检索，随后在后面层（L21+）设置结构适配器，将表示变换以适应捷径信号。最后，我们证明在该电路中对特定MLP密钥进行缩放，可以实现双向因果引导——人为放大或抑制污染驱动的性能。我们的结果为识别和减轻RLVR调优模型中的数据污染提供了机制性路线图。代码可在此 https URL 访问。

Visual Marker Search for Autonomous Drone Landing in Diverse Urban Environments

视觉标志搜索：在多样城市环境中自主着陆的无人机

Authors: Jiaohong Yao, Linfeng Liang, Yao Deng, Xi Zheng, Richard Han, Yuankai Qi
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.11078
Pdf link: https://arxiv.org/pdf/2601.11078
Abstract Marker-based landing is widely used in drone delivery and return-to-base systems for its simplicity and reliability. However, most approaches assume idealized landing site visibility and sensor performance, limiting robustness in complex urban settings. We present a simulation-based evaluation suite on the AirSim platform with systematically varied urban layouts, lighting, and weather to replicate realistic operational diversity. Using onboard camera sensors (RGB for marker detection and depth for obstacle avoidance), we benchmark two heuristic coverage patterns and a reinforcement learning-based agent, analyzing how exploration strategy and scene complexity affect success rate, path efficiency, and robustness. Results underscore the need to evaluate marker-based autonomous landing under diverse, sensor-relevant conditions to guide the development of reliable aerial navigation systems.
中文摘要 基于标记的着陆因其简便可靠而广泛应用于无人机投放和返航系统。然而，大多数方法假设理想的着陆点可见性和传感器性能，限制了复杂城市环境中的稳健性。我们在AirSim平台上呈现基于模拟的评估套件，系统性地变化城市布局、照明和天气，以模拟真实的运营多样性。我们利用车载摄像头传感器（用于标记检测和深度用于避障），对两种启发式覆盖模式和基于强化学习的智能体进行了基准测试，分析探索策略和场景复杂性如何影响成功率、路径效率和鲁棒性。结果强调了在多样且与传感器相关条件下评估基于标记的自主着陆的必要性，以指导可靠空中导航系统的开发。

PhysRVG: Physics-Aware Unified Reinforcement Learning for Video Generative Models

PhysRVG：视频生成模型的物理感知统一强化学习

Authors: Qiyuan Zhang, Biao Gong, Shuai Tan, Zheng Zhang, Yujun Shen, Xing Zhu, Yuyuan Li, Kelu Yao, Chunhua Shen, Changqing Zou
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.11087
Pdf link: https://arxiv.org/pdf/2601.11087
Abstract Physical principles are fundamental to realistic visual simulation, but remain a significant oversight in transformer-based video generation. This gap highlights a critical limitation in rendering rigid body motion, a core tenet of classical mechanics. While computer graphics and physics-based simulators can easily model such collisions using Newton formulas, modern pretrain-finetune paradigms discard the concept of object rigidity during pixel-level global denoising. Even perfectly correct mathematical constraints are treated as suboptimal solutions (i.e., conditions) during model optimization in post-training, fundamentally limiting the physical realism of generated videos. Motivated by these considerations, we introduce, for the first time, a physics-aware reinforcement learning paradigm for video generation models that enforces physical collision rules directly in high-dimensional spaces, ensuring the physics knowledge is strictly applied rather than treated as conditions. Subsequently, we extend this paradigm to a unified framework, termed Mimicry-Discovery Cycle (MDcycle), which allows substantial fine-tuning while fully preserving the model's ability to leverage physics-grounded feedback. To validate our approach, we construct new benchmark PhysRVGBench and perform extensive qualitative and quantitative experiments to thoroughly assess its effectiveness.
中文摘要 物理原理是真实视觉模拟的基础，但在基于变压器的视频生成中仍是一个重大忽视。这一空白凸显了刚体运动表现中的关键局限，而刚体运动是经典力学的核心原则。虽然计算机图形学和基于物理的模拟器可以轻松用牛顿公式模拟此类碰撞，但现代预训练微调范式在像素级全局去噪时摒弃了物体刚性的概念。即使是完全正确的数学约束，在模型优化中也被视为次优解（即条件），从根本上限制了生成视频的物理真实性。基于这些考虑，我们首次引入了一种物理感知强化学习范式，应用于视频生成模型，直接在高维空间中强制物理碰撞规则，确保物理知识被严格应用，而非单纯作为条件处理。随后，我们将这一范式扩展为一个统一框架，称为拟态-发现循环（MDcycle），允许在充分利用物理反馈的前提下进行大量微调。为验证我们的方法，我们构建了新的基准PhysRVGBench，并进行了广泛的定性和定量实验，以全面评估其有效性。

Learning Quadrupedal Locomotion for a Heavy Hydraulic Robot Using an Actuator Model

利用执行器模型学习重型液压机器人的四足行走

Authors: Minho Lee, Hyeonseok Kim, Jin Tak Kim, Sangshin Park, Jeong Hyun Lee, Jungsan Cho, Jemin Hwangbo
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.11143
Pdf link: https://arxiv.org/pdf/2601.11143
Abstract The simulation-to-reality (sim-to-real) transfer of large-scale hydraulic robots presents a significant challenge in robotics because of the inherent slow control response and complex fluid dynamics. The complex dynamics result from the multiple interconnected cylinder structure and the difference in fluid rates of the cylinders. These characteristics complicate detailed simulation for all joints, making it unsuitable for reinforcement learning (RL) applications. In this work, we propose an analytical actuator model driven by hydraulic dynamics to represent the complicated actuators. The model predicts joint torques for all 12 actuators in under 1 microsecond, allowing rapid processing in RL environments. We compare our model with neural network-based actuator models and demonstrate the advantages of our model in data-limited scenarios. The locomotion policy trained in RL with our model is deployed on a hydraulic quadruped robot, which is over 300 kg. This work is the first demonstration of a successful transfer of stable and robust command-tracking locomotion with RL on a heavy hydraulic quadruped robot, demonstrating advanced sim-to-real transferability.
中文摘要 大规模液压机器人的模拟到现实（仿真到现实）传输在机器人领域面临重大挑战，因为其固有的控制响应缓慢和复杂的流体动力学。复杂的动力学源于多个相互连接的圆柱结构以及各圆柱体流速的差异。这些特性使得所有关节的详细仿真变得复杂，不适合强化学习（RL）应用。在本研究中，我们提出了一个由液压动力学驱动的解析执行器模型，以表示复杂的执行器。该模型预测所有12个执行器的关节扭矩在1微秒内，实现在强化环境中的快速处理。我们将模型与基于神经网络的执行器模型进行比较，并展示了模型在数据有限场景下的优势。我们模型在强化学习中训练的运动策略部署在一台超过300公斤的液压四足机器人上。这项工作首次成功演示了稳定且稳健的指令跟踪移动在重型液压四足机器人上的成功转移，展示了先进的模拟到现实可转移性。

Deep GraphRAG: A Balanced Approach to Hierarchical Retrieval and Adaptive Integration

深度GraphRAG：一种平衡的层级检索与自适应集成方法

Authors: Yuejie Li, Ke Yang, Tao Wang, Bolin Chen, Bowen Li, Chengjun Mao
Subjects: Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.11144
Pdf link: https://arxiv.org/pdf/2601.11144
Abstract Graph-based Retrieval-Augmented Generation (GraphRAG) frameworks face a trade-off between the comprehensiveness of global search and the efficiency of local search. Existing methods are often challenged by navigating large-scale hierarchical graphs, optimizing retrieval paths, and balancing exploration-exploitation dynamics, frequently lacking robust multi-stage re-ranking. To overcome these deficits, we propose Deep GraphRAG, a framework designed for a balanced approach to hierarchical retrieval and adaptive integration. It introduces a hierarchical global-to-local retrieval strategy that integrates macroscopic inter-community and microscopic intra-community contextual relations. This strategy employs a three-stage process: (1) inter-community filtering, which prunes the search space using local context; (2) community-level refinement, which prioritizes relevant subgraphs via entity-interaction analysis; and (3) entity-level fine-grained search within target communities. A beam search-optimized dynamic re-ranking module guides this process, continuously filtering candidates to balance efficiency and global comprehensiveness. Deep GraphRAG also features a Knowledge Integration Module leveraging a compact LLM, trained with Dynamic Weighting Reward GRPO (DW-GRPO). This novel reinforcement learning approach dynamically adjusts reward weights to balance three key objectives: relevance, faithfulness, and conciseness. This training enables compact models (1.5B) to approach the performance of large models (70B) in the integration task. Evaluations on Natural Questions and HotpotQA demonstrate that Deep GraphRAG significantly outperforms baseline graph retrieval methods in both accuracy and efficiency.
中文摘要 基于图的检索增强生成（GraphRAG）框架在全局搜索的全面性与局部搜索的效率之间面临权衡。现有方法常面临应对大规模层级图、优化检索路径和平衡探索与利用动态的挑战，且常缺乏稳健的多阶段重新排序。为弥补这些不足，我们提出了Deep GraphRAG框架，旨在实现层级检索与自适应集成的平衡方法。它引入了一种层级的全球到本地检索策略，整合了宏观的社区间和微观的社区内情境关系。该策略采用三阶段过程：（1）社区间过滤，利用本地上下文修剪搜索空间;（2）社区层面的细化，通过实体交互分析优先考虑相关子图;以及（3）目标社区内实体层面的细粒度搜索。一个经过优化的动态重新排序模块指导这一过程，持续筛选候选人，以平衡效率与全局综合性。Deep GraphRAG 还配备了一个知识集成模块，利用一个紧凑型大型语言模型，并使用动态加权奖励 GRPO（DW-GRPO）训练。这种新颖的强化学习方法动态调整奖励权重，以平衡三个关键目标：相关性、忠实性和简洁性。这种训练使紧凑模型（1.5B）能够接近大型模型（70B）在积分任务中的表现。Natural Questions 和 HotpotQA 的评估表明，Deep GraphRAG 在准确性和效率上显著优于基线图检索方法。

TANDEM: Temporal-Aware Neural Detection for Multimodal Hate Speech

TANDEM：多模态仇恨言论的时间感知神经检测

Authors: Girish A. Koushik, Helen Treharne, Diptesh Kanojia
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM); Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2601.11178
Pdf link: https://arxiv.org/pdf/2601.11178
Abstract Social media platforms are increasingly dominated by long-form multimodal content, where harmful narratives are constructed through a complex interplay of audio, visual, and textual cues. While automated systems can flag hate speech with high accuracy, they often function as "black boxes" that fail to provide the granular, interpretable evidence, such as precise timestamps and target identities, required for effective human-in-the-loop moderation. In this work, we introduce TANDEM, a unified framework that transforms audio-visual hate detection from a binary classification task into a structured reasoning problem. Our approach employs a novel tandem reinforcement learning strategy where vision-language and audio-language models optimize each other through self-constrained cross-modal context, stabilizing reasoning over extended temporal sequences without requiring dense frame-level supervision. Experiments across three benchmark datasets demonstrate that TANDEM significantly outperforms zero-shot and context-augmented baselines, achieving 0.73 F1 in target identification on HateMM (a 30% improvement over state-of-the-art) while maintaining precise temporal grounding. We further observe that while binary detection is robust, differentiating between offensive and hateful content remains challenging in multi-class settings due to inherent label ambiguity and dataset imbalance. More broadly, our findings suggest that structured, interpretable alignment is achievable even in complex multimodal settings, offering a blueprint for the next generation of transparent and actionable online safety moderation tools.
中文摘要 社交媒体平台越来越多地被长片多模态内容主导，这些内容通过复杂的音频、视觉和文本线索交织而成，构成有害叙事。虽然自动化系统能够高精度标记仇恨言论，但它们常常作为“黑匣子”，无法提供细致且可解读的证据，如精确的时间戳和目标身份，从而实现有效的人工参与审核。在本研究中，我们介绍了TANDEM，这是一个统一框架，将视听仇恨检测从二元分类任务转变为结构化推理问题。我们的方法采用了一种新型串联强化学习策略，视觉-语言模型和音频-语言模型通过自我约束的跨模态上下文相互优化，稳定推理在扩展的时序序列中，无需密集的帧级监督。跨三个基准数据集的实验表明，TANDEM显著优于零射击和上下文增强基线，在HateMM上目标识别率达到0.73 F1（比最先进数据提升30%），同时保持精确的时间基准。我们还观察到，虽然二元检测很稳健，但在多类别环境中区分冒犯性内容和仇恨内容仍然具有挑战性，原因是标签本身存在模糊性和数据集不平衡。更广泛地说，我们的发现表明，即使在复杂的多模态环境中，结构化、可解释的对齐也是可实现的，为下一代透明且可作的网络安全管理工具提供了蓝图。

Policy-Based Deep Reinforcement Learning Hyperheuristics for Job-Shop Scheduling Problems

基于策略的深度强化学习超启发式方法，用于工作间调度问题

Authors: Sofiene Lassoued, Asrat Gobachew, Stefan Lier, Andreas Schwung
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.11189
Pdf link: https://arxiv.org/pdf/2601.11189
Abstract This paper proposes a policy-based deep reinforcement learning hyper-heuristic framework for solving the Job Shop Scheduling Problem. The hyper-heuristic agent learns to switch scheduling rules based on the system state dynamically. We extend the hyper-heuristic framework with two key mechanisms. First, action prefiltering restricts decision-making to feasible low-level actions, enabling low-level heuristics to be evaluated independently of environmental constraints and providing an unbiased assessment. Second, a commitment mechanism regulates the frequency of heuristic switching. We investigate the impact of different commitment strategies, from step-wise switching to full-episode commitment, on both training behavior and makespan. Additionally, we compare two action selection strategies at the policy level: deterministic greedy selection and stochastic sampling. Computational experiments on standard JSSP benchmarks demonstrate that the proposed approach outperforms traditional heuristics, metaheuristics, and recent neural network-based scheduling methods
中文摘要 本文提出了一种基于策略的深度强化学习超启发式框架，用于解决工作坊排程问题。超启发式智能体会根据系统状态动态切换调度规则。我们通过两个关键机制扩展了超启发式框架。首先，动作预过滤将决策限制在可行的低层次行动中，使得低层次启发式能够独立于环境约束进行评估，并提供公正的评估。其次，承诺机制调节启发式切换的频率。我们探讨了从逐步切换到全集承诺等不同承诺策略对训练行为和完成时间的影响。此外，我们还比较了两种政策层面的行动选择策略：确定性贪婪选择和随机抽样。基于标准JSSP基准的计算实验表明，所提方法优于传统启发式、元启发式以及最新的基于神经网络的调度方法

Knowledge is Not Enough: Injecting RL Skills for Continual Adaptation

知识不够：注入强化学习技能以实现持续适应

Authors: Pingzhi Tang, Yiding Wang, Muhan Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.11258
Pdf link: https://arxiv.org/pdf/2601.11258
Abstract Large Language Models (LLMs) face the "knowledge cutoff" challenge, where their frozen parametric memory prevents direct internalization of new information. While Supervised Fine-Tuning (SFT) is commonly used to update model knowledge, it often updates factual content without reliably improving the model's ability to use the newly incorporated information for question answering or decision-making. Reinforcement Learning (RL) is essential for acquiring reasoning skills; however, its high computational cost makes it impractical for efficient online adaptation. We empirically observe that the parameter updates induced by SFT and RL are nearly orthogonal. Based on this observation, we propose Parametric Skill Transfer (PaST), a framework that supports modular skill transfer for efficient and effective knowledge adaptation. By extracting a domain-agnostic Skill Vector from a source domain, we can linearly inject knowledge manipulation skills into a target model after it has undergone lightweight SFT on new data. Experiments on knowledge-incorporation QA (SQuAD, LooGLE) and agentic tool-use benchmarks (ToolBench) demonstrate the effectiveness of our method. On SQuAD, PaST outperforms the state-of-the-art self-editing SFT baseline by up to 9.9 points. PaST further scales to long-context QA on LooGLE with an 8.0-point absolute accuracy gain, and improves zero-shot ToolBench success rates by +10.3 points on average with consistent gains across tool categories, indicating strong scalability and cross-domain transferability of the Skill Vector.
中文摘要 大型语言模型（LLMs）面临“知识截止”挑战，其冻结的参数记忆阻碍了新信息的直接内化。虽然监督微调（SFT）常用于更新模型知识，但它常常更新事实内容，却未能可靠提升模型利用新加入信息进行问答或决策的能力。强化学习（RL）对于获得推理技能至关重要;然而，其高计算成本使其在高效的在线适应上不切实际。我们通过经验观察到，SFT和RL引发的参数更新几乎正交。基于这一观察，我们提出了参数化技能转移（Parametric Skill Transfer，简称PaST）框架，支持模块化技能转移，实现高效且有效的知识适应。通过从源域提取一个领域无关的技能向量，我们可以线性地将知识作技能注入目标模型，前提是该模型在对新数据进行了轻量级SFT处理。知识整合质量保证（SQuAD、LooGLE）和智能工具使用基准测试（ToolBench）的实验展示了我们方法的有效性。在SQuAD上，PaST的表现比最先进的自剪SFT基线高出9.9个百分点。PaST进一步扩展到LooGLE的长上下文质量保证，绝对准确率提升8.0点，零射ToolBench成功率平均提升+10.3分，工具类别间提升一致，显示技能向量具有强的可扩展性和跨领域可转移性。

Offline Reinforcement-Learning-Based Power Control for Application-Agnostic Energy Efficiency

基于离线强化学习的电源控制，实现应用无关的能效

Authors: Akhilesh Raj, Swann Perarnau, Aniruddha Gokhale, Solomon Bekele Abera
Subjects: Subjects: Machine Learning (cs.LG); Performance (cs.PF); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.11352
Pdf link: https://arxiv.org/pdf/2601.11352
Abstract Energy efficiency has become an integral aspect of modern computing infrastructure design, impacting the performance, cost, scalability, and durability of production systems. The incorporation of power actuation and sensing capabilities in CPU designs is indicative of this, enabling the deployment of system software that can actively monitor and adjust energy consumption and performance at runtime. While reinforcement learning (RL) would seem ideal for the design of such energy efficiency control systems, online training presents challenges ranging from the lack of proper models for setting up an adequate simulated environment, to perturbation (noise) and reliability issues, if training is deployed on a live system. In this paper we discuss the use of offline reinforcement learning as an alternative approach for the design of an autonomous CPU power controller, with the goal of improving the energy efficiency of parallel applications at runtime without unduly impacting their performance. Offline RL sidesteps the issues incurred by online RL training by leveraging a dataset of state transitions collected from arbitrary policies prior to training. Our methodology applies offline RL to a gray-box approach to energy efficiency, combining online application-agnostic performance data (e.g., heartbeats) and hardware performance counters to ensure that the scientific objectives are met with limited performance degradation. Evaluating our method on a variety of compute-bound and memory-bound benchmarks and controlling power on a live system through Intel's Running Average Power Limit, we demonstrate that such an offline-trained agent can substantially reduce energy consumption at a tolerable performance degradation cost.
中文摘要 能源效率已成为现代计算基础设施设计的重要组成部分，影响着生产系统的性能、成本、可扩展性和耐久性。CPU设计中集成了电源驱动和传感能力，便说明了这一点，使得系统软件能够在运行时主动监控和调整能耗和性能。虽然强化学习（RL）似乎非常适合设计此类能效控制系统，但在线培训面临诸多挑战，从缺乏建立足够模拟环境的合适模型，到在实际系统上部署时存在的扰动（噪声）和可靠性问题。本文讨论了将离线强化学习作为自主CPU电源控制器设计的替代方法，旨在提升并行应用运行时的能效，同时不过度影响其性能。离线强化学习通过利用从训练前从任意策略收集的状态转换数据集，规避了在线强化学习带来的问题。我们的方法论将离线强化学习应用于灰盒能效方法，结合在线应用无关的性能数据（如心跳）和硬件性能计数器，确保科学目标在有限的性能下降下实现。通过在多种计算和内存受限的基准测试上评估我们的方法，并通过英特尔的运行平均功耗限制控制运行系统的功耗，我们证明了这样一个离线训练的代理能够显著降低能耗，同时代价是可接受的性能下降。

The Mini Wheelbot Dataset: High-Fidelity Data for Robot Learning

迷你轮机器人数据集：机器人学习的高保真数据

Authors: Henrik Hose, Paul Brunzema, Devdutt Subhasish, Sebastian Trimpe
Subjects: Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.11394
Pdf link: https://arxiv.org/pdf/2601.11394
Abstract The development of robust learning-based control algorithms for unstable systems requires high-quality, real-world data, yet access to specialized robotic hardware remains a significant barrier for many researchers. This paper introduces a comprehensive dynamics dataset for the Mini Wheelbot, an open-source, quasi-symmetric balancing reaction wheel unicycle. The dataset provides 1 kHz synchronized data encompassing all onboard sensor readings, state estimates, ground-truth poses from a motion capture system, and third-person video logs. To ensure data diversity, we include experiments across multiple hardware instances and surfaces using various control paradigms, including pseudo-random binary excitation, nonlinear model predictive control, and reinforcement learning agents. We include several example applications in dynamics model learning, state estimation, and time-series classification to illustrate common robotics algorithms that can be benchmarked on our dataset.
中文摘要 开发基于学习的稳健控制算法以应对不稳定系统需要高质量的真实世界数据，但对许多研究人员来说，获得专用机器人硬件仍是一大障碍。本文介绍了Mini Wheelbot的全面动力学数据集，Mini Wheelbot是一款开源的准对称平衡反作用轮独轮车。该数据集提供1 kHz同步数据，涵盖所有机载传感器读数、状态估计、动作捕捉系统的地面真实姿势以及第三人称视频日志。为确保数据多样性，我们采用多种控制范式，包括伪随机二元激励、非线性模型预测控制和强化学习代理，涵盖多个硬件实例和表面的实验。我们包含了多个动态模型学习、状态估计和时间序列分类的示例应用，以展示可在我们数据集上基准测试的常见机器人算法。

Factored Value Functions for Graph-Based Multi-Agent Reinforcement Learning

基于图的多智能体强化学习中的分解值函数

Authors: Ahmed Rashwan, Keith Briggs, Chris Budd, Lisa Kreusser
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.11401
Pdf link: https://arxiv.org/pdf/2601.11401
Abstract Credit assignment is a core challenge in multi-agent reinforcement learning (MARL), especially in large-scale systems with structured, local interactions. Graph-based Markov decision processes (GMDPs) capture such settings via an influence graph, but standard critics are poorly aligned with this structure: global value functions provide weak per-agent learning signals, while existing local constructions can be difficult to estimate and ill-behaved in infinite-horizon settings. We introduce the Diffusion Value Function (DVF), a factored value function for GMDPs that assigns to each agent a value component by diffusing rewards over the influence graph with temporal discounting and spatial attenuation. We show that DVF is well-defined, admits a Bellman fixed point, and decomposes the global discounted value via an averaging property. DVF can be used as a drop-in critic in standard RL algorithms and estimated scalably with graph neural networks. Building on DVF, we propose Diffusion A2C (DA2C) and a sparse message-passing actor, Learned DropEdge GNN (LD-GNN), for learning decentralised algorithms under communication costs. Across the firefighting benchmark and three distributed computation tasks (vector graph colouring and two transmit power optimisation problems), DA2C consistently outperforms local and global critic baselines, improving average reward by up to 11%.
中文摘要 学分分配是多智能体强化学习（MARL）中的核心挑战，尤其是在具有结构化、局部交互的大规模系统中。基于图的马尔可夫决策过程（GMDPs）通过影响图捕捉这些设定，但标准批评者对此结构并不认同：全局价值函数提供较弱的每个代理学习信号，而现有的局部构造在无限视野环境中难以估计且表现不佳。我们介绍扩散价值函数（DVF），这是一种针对GMDP的因式分解价值函数，通过在影响图上分散奖励，并进行时间贴现和空间衰减，为每个代理分配一个价值成分。我们证明了DVF是良定义的，存在一个贝尔曼不动点，并通过平均性质分解全局贴现值。DVF可以作为标准强化学习算法中的直接批判器使用，并通过图神经网络进行可扩展的估计。基于DVF，我们提出了扩散A2C（DA2C）和稀疏消息传递演员Learned DropEdge GNN（LD-GNN）用于在通信成本下学习去中心化算法。在消防基准测试和三个分布式计算任务（向量图着色和两个传输功率优化问题）中，DA2C始终优于本地和全球评论家基线，平均奖励提升了高达11%。

Generative Scenario Rollouts for End-to-End Autonomous Driving

端到端自动驾驶的生成式场景推广

Authors: Rajeev Yasarla, Deepti Hegde, Shizhong Han, Hsin-Pai Cheng, Yunxiao Shi, Meysam Sadeghigooghari, Shweta Mahajan, Apratim Bhattacharyya, Litian Liu, Risheek Garrepalli, Thomas Svantesson, Fatih Porikli, Hong Cai
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.11475
Pdf link: https://arxiv.org/pdf/2601.11475
Abstract Vision-Language-Action (VLA) models are emerging as highly effective planning models for end-to-end autonomous driving systems. However, current works mostly rely on imitation learning from sparse trajectory annotations and under-utilize their potential as generative models. We propose Generative Scenario Rollouts (GeRo), a plug-and-play framework for VLA models that jointly performs planning and generation of language-grounded future traffic scenes through an autoregressive rollout strategy. First, a VLA model is trained to encode ego vehicle and agent dynamics into latent tokens under supervision from planning, motion, and language tasks, facilitating text-aligned generation. Next, GeRo performs language-conditioned autoregressive generation. Given multi-view images, a scenario description, and ego-action questions, it generates future latent tokens and textual responses to guide long-horizon rollouts. A rollout-consistency loss stabilizes predictions using ground truth or pseudo-labels, mitigating drift and preserving text-action alignment. This design enables GeRo to perform temporally consistent, language-grounded rollouts that support long-horizon reasoning and multi-agent planning. On Bench2Drive, GeRo improves driving score and success rate by +15.7 and +26.2, respectively. By integrating reinforcement learning with generative rollouts, GeRo achieves state-of-the-art closed-loop and open-loop performance, demonstrating strong zero-shot robustness. These results highlight the promise of generative, language-conditioned reasoning as a foundation for safer and more interpretable end-to-end autonomous driving.
中文摘要 视觉-语言-行动（VLA）模型正作为端到端自动驾驶系统的高效规划模型不断兴起。然而，当前的工作大多依赖于从稀疏轨迹注释中进行模仿学习，未能充分发挥其作为生成模型的潜力。我们提出了生成场景展开（Generative Scenario Rollouts，GeRo），这是一个即插即用的VLA模型框架，通过自回归推广策略共同规划和生成基于语言的未来交通场景。首先，VLA模型被训练为在规划、运动和语言任务监督下，将自我载体和代理动态编码为潜在符号，从而促进文本对齐生成。接下来，GeRo 执行语言条件自回归生成。给定多视图图像、场景描述和自我行动问题，它生成未来的潜在代币和文本响应，以指导长期推广。部署一致性丢失通过真实或伪标签稳定预测，减少漂移并保持文本与动作的对齐。该设计使 GeRo 能够执行时间一致、基于语言的部署，支持长视野推理和多智能体规划。在Bench2Drive上，GeRo的驾驶分数提升了+15.7和+26.2的成功率。通过将强化学习与生成式推广相结合，GeRo实现了最先进的闭环和开环性能，展现了强大的零发射鲁棒性。这些结果凸显了生成式语言条件推理作为更安全、更易理解的端到端自动驾驶基础的前景。

Do explanations generalize across large reasoning models?

解释是否能在大型推理模型中泛化？

Authors: Koyena Pal, David Bau, Chandan Singh
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.11517
Pdf link: https://arxiv.org/pdf/2601.11517
Abstract Large reasoning models (LRMs) produce a textual chain of thought (CoT) in the process of solving a problem, which serves as a potentially powerful tool to understand the problem by surfacing a human-readable, natural-language explanation. However, it is unclear whether these explanations generalize, i.e. whether they capture general patterns about the underlying problem rather than patterns which are esoteric to the LRM. This is a crucial question in understanding or discovering new concepts, e.g. in AI for science. We study this generalization question by evaluating a specific notion of generalizability: whether explanations produced by one LRM induce the same behavior when given to other LRMs. We find that CoT explanations often exhibit this form of generalization (i.e. they increase consistency between LRMs) and that this increased generalization is correlated with human preference rankings and post-training with reinforcement learning. We further analyze the conditions under which explanations yield consistent answers and propose a straightforward, sentence-level ensembling strategy that improves consistency. Taken together, these results prescribe caution when using LRM explanations to yield new insights and outline a framework for characterizing LRM explanation generalization.
中文摘要 大型推理模型（LRM）在解决问题过程中产生文本思维链（CoT），这成为理解问题的有力工具，通过呈现人类可读的自然语言解释。然而，尚不清楚这些解释是否具有概括性，即是否捕捉了关于潜在问题的一般模式，而非对LRM来说晦涩的模式。这是理解或发现新概念的关键问题，例如在科学人工智能领域。我们通过评估一个特定的泛化性概念来研究这一泛化问题：一个LRM产生的解释在赋予其他LRM时是否会引发相同行为。我们发现CoT解释常表现出这种泛化（即提高LRM之间的一致性），这种增加的泛化与人类偏好排名以及训练后与强化学习相关。我们进一步分析解释在何种条件下能够产生一致答案，并提出了一种直接的句子层面集合策略，以提升一致性。综合来看，这些结果在使用LRM解释时应谨慎，以获得新见解并勾勒出描述LRM解释泛化的框架。

Keyword: diffusion policy

Multi-Agent Formation Navigation Using Diffusion-Based Trajectory Generation

利用基于扩散轨迹生成的多智能体形成导航

Authors: Hieu Do Quang, Chien Truong-Quoc, Quoc Van Tran
Subjects: Subjects: Robotics (cs.RO); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2601.10725
Pdf link: https://arxiv.org/pdf/2601.10725
Abstract This paper introduces a diffusion-based planner for leader--follower formation control in cluttered environments. The diffusion policy is used to generate the trajectory of the midpoint of two leaders as a rigid bar in the plane, thereby defining their desired motion paths in a planar formation. While the followers track the leaders and form desired foramtion geometry using a distance-constrained formation controller based only on the relative positions in followers' local coordinates. The proposed approach produces smooth motions and low tracking errors, with most failures occurring in narrow obstacle-free space, or obstacle configurations that are not in the training data set. Simulation results demonstrate the potential of diffusion models for reliable multi-agent formation planning.
中文摘要 本文介绍了一种基于扩散的规划器，用于在杂乱环境中实现领导者-跟随者的形成控制。扩散策略用于生成两个导线中点的轨迹，作为平面上的刚性杆，从而定义它们在平面形成中的期望运动路径。而跟随者则通过距离受限的编队控制器，仅根据随随者在本地坐标中的相对位置，跟踪领队并形成所需的孔几何形状。所提方法能够实现平滑的运动和低的跟踪误差，大多数故障发生在狭窄且无障碍的空间，或训练数据集中不存在的障碍配置中。模拟结果表明扩散模型在可靠多智能体形成规划中的潜力。

X-Distill: Cross-Architecture Vision Distillation for Visuomotor Learning

X-Distill：跨架构视觉提炼用于Visuomotor学习

Authors: Maanping Shao, Feihong Zhang, Gu Zhang, Baiye Cheng, Zhengrong Xue, Huazhe Xu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.11269
Pdf link: https://arxiv.org/pdf/2601.11269
Abstract Visuomotor policies often leverage large pre-trained Vision Transformers (ViTs) for their powerful generalization capabilities. However, their significant data requirements present a major challenge in the data-scarce context of most robotic learning settings, where compact CNNs with strong inductive biases can be more easily optimized. To address this trade-off, we introduce X-Distill, a simple yet highly effective method that synergizes the strengths of both architectures. Our approach involves an offline, cross-architecture knowledge distillation, transferring the rich visual representations of a large, frozen DINOv2 teacher to a compact ResNet-18 student on the general-purpose ImageNet dataset. This distilled encoder, now endowed with powerful visual priors, is then jointly fine-tuned with a diffusion policy head on the target manipulation tasks. Extensive experiments on $34$ simulated benchmarks and $5$ challenging real-world tasks demonstrate that our method consistently outperforms policies equipped with from-scratch ResNet or fine-tuned DINOv2 encoders. Notably, X-Distill also surpasses 3D encoders that utilize privileged point cloud observations or much larger Vision-Language Models. Our work highlights the efficacy of a simple, well-founded distillation strategy for achieving state-of-the-art performance in data-efficient robotic manipulation.
中文摘要 Visuomotor 保单通常利用大型预训练视觉变换器（ViT）来实现其强大的泛化能力。然而，它们对大量数据的需求在大多数机器人学习环境中数据稀缺的环境下构成了重大挑战，因为在这些环境中，具有强烈归纳偏倚的紧凑型CNN更容易被优化。为解决这一权衡，我们引入了X-Distill，这是一种简单但极为有效的方法，能够协同两种架构的优势。我们的方法涉及离线跨架构知识蒸馏，将大型定格的DINOv2教师丰富的视觉表现转移到通用ImageNet数据集上的紧凑ResNet-18学生中。这个精简编码器现在配备了强大的视觉先验，随后在目标作任务中与扩散策略的头一起进行联合微调。在价值34美元的模拟基准测试和价值5美元的具有挑战性的现实任务中进行了大量实验，表明我们的方法始终优于配备全新ResNet或微调DINOv2编码器的政策。值得注意的是，X-Distill还超过了使用特权点云观测或更大视觉语言模型的3D编码器。我们的研究强调了简单且有根基的提炼策略在实现数据高效机器人作中最先进的性能方面的有效性。