Arxiv Papers of Today

生成时间: 2025-12-16 16:35:47 (UTC+8); Arxiv 发布时间: 2025-12-16 20:00 EST (2025-12-17 09:00 UTC+8)

今天共有 65 篇相关文章

Keyword: reinforcement learning

Reinforcement Learning for Latent-Space Thinking in LLMs

LLM潜空间思维的强化学习

Authors: Enes Özeren, Matthias Aßenmacher
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.11816
Pdf link: https://arxiv.org/pdf/2512.11816
Abstract Chain-of-Thought (CoT) reasoning typically utilizes the discrete language space for thinking, which is inherently inefficient, as many generated tokens only enforce linguistic rules that are not required for reasoning. To bypass this, latent-space thinking allows models to think using the continuous embedding space. While existing methods for training those models show domain-specific gains, they fail to maintain performance in complex tasks, such as mathematical reasoning. We experimentally demonstrate that the Coconut approach, a form of supervised fine-tuning for latent-space thinking, is highly sensitive to design choices and exhibits several inherent limitations. To address these issues, we investigate reinforcement learning (RL) techniques -- an underexplored direction in latent-space thinking -- including GRPO and design a novel Latent RL method for directly optimizing the latent thinking steps. Our experimental results reveal that these RL-trained models still lag behind traditional language-space CoT models in the mathematical reasoning domain. We make our codebase publicly available.
中文摘要 思维链（CoT）推理通常利用离散语言空间进行思考，但这本质上效率较低，因为许多生成的代币只执行推理中不需要的语言规则。为了绕过这一点，潜空间思维允许模型利用连续嵌入空间进行思考。虽然现有的训练方法在领域上有所提升，但它们在复杂任务（如数学推理）中表现不佳。我们通过实验证明，Coconut方法是一种针对潜空间思维的监督微调方式，对设计选择极为敏感，并表现出若干固有局限性。为解决这些问题，我们研究了强化学习（RL）技术——潜空间思维中较少被充分探索的方向——包括GRPO，并设计了一种新型潜在强化学习方法，用于直接优化潜在思维步骤。我们的实验结果显示，这些强化学习训练的模型在数学推理领域仍落后于传统语言空间的CoT模型。我们公开了我们的代码库。

Hierarchical Task Offloading and Trajectory Optimization in Low-Altitude Intelligent Networks Via Auction and Diffusion-based MARL

低高度智能网络中的分层任务卸载与轨迹优化，通过基于拍卖和扩散的MARL实现

Authors: Jiahao You, Ziye Jia, Can Cui, Chao Dong, Qihui Wu, Zhu Han
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.11862
Pdf link: https://arxiv.org/pdf/2512.11862
Abstract The low-altitude intelligent networks (LAINs) emerge as a promising architecture for delivering low-latency and energy-efficient edge intelligence in dynamic and infrastructure-limited environments. By integrating unmanned aerial vehicles (UAVs), aerial base stations, and terrestrial base stations, LAINs can support mission-critical applications such as disaster response, environmental monitoring, and real-time sensing. However, these systems face key challenges, including energy-constrained UAVs, stochastic task arrivals, and heterogeneous computing resources. To address these issues, we propose an integrated air-ground collaborative network and formulate a time-dependent integer nonlinear programming problem that jointly optimizes UAV trajectory planning and task offloading decisions. The problem is challenging to solve due to temporal coupling among decision variables. Therefore, we design a hierarchical learning framework with two timescales. At the large timescale, a Vickrey-Clarke-Groves auction mechanism enables the energy-aware and incentive-compatible trajectory assignment. At the small timescale, we propose the diffusion-heterogeneous-agent proximal policy optimization, a generative multi-agent reinforcement learning algorithm that embeds latent diffusion models into actor networks. Each UAV samples actions from a Gaussian prior and refines them via observation-conditioned denoising, enhancing adaptability and policy diversity. Extensive simulations show that our framework outperforms baselines in energy efficiency, task success rate, and convergence performance.
中文摘要 低空智能网络（LAINs）作为一种有前景的架构，能够在动态且基础设施受限的环境中提供低延迟和节能的边缘智能。通过集成无人机（UAV）、空中基站和地面基站，LAIN能够支持灾害响应、环境监测和实时传感等关键任务应用。然而，这些系统面临关键挑战，包括能量受限的无人机、随机任务到达以及异构计算资源。为解决这些问题，我们提出了一个综合的空地协作网络，并提出了一个时间依赖整数非线性规划问题，共同优化无人机轨迹规划和任务卸载决策。由于决策变量之间的时间耦合，该问题难以解决。因此，我们设计了一个具有两个时间尺度的层级学习框架。在较长的时间尺度上，Vickrey-Clarke-Groves拍卖机制实现了能量意识和激励兼容的轨迹分配。在较短的时间尺度上，我们提出了扩散-异构-代理近端策略优化，这是一种生成式多代理强化学习算法，将潜在扩散模型嵌入演员网络中。每架无人机从高斯先验中采样动作，并通过观测条件去噪对其进行细化，提升适应性和策略多样性。大量模拟表明，我们的框架在能效、任务成功率和融合性能方面均优于基线。

WAM-Diff: A Masked Diffusion VLA Framework with MoE and Online Reinforcement Learning for Autonomous Driving

WAM-Diff：一个具备MoE和在线强化学习的蒙面扩散VLA框架，用于自动驾驶

Authors: Mingwang Xu, Jiahao Cui, Feipeng Cai, Hanlin Shang, Zhihao Zhu, Shan Luan, Yifang Xu, Neng Zhang, Yaoyi Li, Jia Cai, Siyu Zhu
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.11872
Pdf link: https://arxiv.org/pdf/2512.11872
Abstract End-to-end autonomous driving systems based on vision-language-action (VLA) models integrate multimodal sensor inputs and language instructions to generate planning and control signals. While autoregressive large language models and continuous diffusion policies are prevalent, the potential of discrete masked diffusion for trajectory generation remains largely unexplored. This paper presents WAM-Diff, a VLA framework that employs masked diffusion to iteratively refine a discrete sequence representing future ego-trajectories. Our approach features three key innovations: a systematic adaptation of masked diffusion for autonomous driving that supports flexible, non-causal decoding orders; scalable model capacity via a sparse MoE architecture trained jointly on motion prediction and driving-oriented visual question answering (VQA); and online reinforcement learning using Group Sequence Policy Optimization (GSPO) to optimize sequence-level driving rewards. Remarkably, our model achieves 91.0 PDMS on NAVSIM-v1 and 89.7 EPDMS on NAVSIM-v2, demonstrating the effectiveness of masked diffusion for autonomous driving. The approach provides a promising alternative to autoregressive and diffusion-based policies, supporting scenario-aware decoding strategies for trajectory generation. The code for this paper will be released publicly at: this https URL
中文摘要 基于视觉-语言-动作（VLA）模型的端到端自动驾驶系统集成多模态传感器输入和语言指令，以生成规划和控制信号。虽然自回归大型语言模型和连续扩散策略普遍存在，但离散掩蔽扩散在轨迹生成中的潜力仍大多未被充分探索。本文介绍了WAM-Diff，这是一种利用掩蔽扩散来迭代细化代表未来ego轨迹的离散序列的VLA框架。我们的方法有三项关键创新：系统化地适配掩蔽扩散以支持灵活、非因果的解码顺序;通过稀疏的MoE架构实现可扩展的模型容量，结合运动预测和面向驾驶的视觉问答（VQA）联合训练;以及利用组序列策略优化（GSPO）进行在线强化学习，以优化序列级驱动奖励。令人瞩目的是，我们的模型在NAVSIM-v1上实现了91.0 PDMS，在NAVSIM-v2上实现了89.7 EPDMS，证明了掩蔽扩散在自动驾驶中的有效性。该方法为自回归和基于扩散策略提供了有前景的替代方案，支持基于情景的解码策略以生成轨迹。本文代码将公开发布于：https URL

Mirror Mode in Fire Emblem: Beating Players at their own Game with Imitation and Reinforcement Learning

火焰纹章中的镜像模式：通过模仿与强化学习击败玩家

Authors: Yanna Elizabeth Smid, Peter van der Putten, Aske Plaat
Subjects: Subjects: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.11902
Pdf link: https://arxiv.org/pdf/2512.11902
Abstract Enemy strategies in turn-based games should be surprising and unpredictable. This study introduces Mirror Mode, a new game mode where the enemy AI mimics the personal strategy of a player to challenge them to keep changing their gameplay. A simplified version of the Nintendo strategy video game Fire Emblem Heroes has been built in Unity, with a Standard Mode and a Mirror Mode. Our first set of experiments find a suitable model for the task to imitate player demonstrations, using Reinforcement Learning and Imitation Learning: combining Generative Adversarial Imitation Learning, Behavioral Cloning, and Proximal Policy Optimization. The second set of experiments evaluates the constructed model with player tests, where models are trained on demonstrations provided by participants. The gameplay of the participants indicates good imitation in defensive behavior, but not in offensive strategies. Participant's surveys indicated that they recognized their own retreating tactics, and resulted in an overall higher player-satisfaction for Mirror Mode. Refining the model further may improve imitation quality and increase player's satisfaction, especially when players face their own strategies. The full code and survey results are stored at: this https URL
中文摘要 回合制游戏中的敌人策略应该是出人意料且不可预测的。本研究引入了镜像模式，一种新的游戏模式，敌方AI模仿玩家的个人策略，挑战玩家不断改变玩法。任天堂策略游戏《火焰纹章英雄》的简化版已在Unity中构建，包含标准模式和镜像模式。我们的第一组实验找到了一个适合模拟玩家演示的模型，利用强化学习和模仿学习：结合生成对抗性模仿学习、行为克隆和近端策略优化。第二组实验通过玩家测试评估构建模型，模型基于参与者提供的演示进行训练。参与者的游戏表现显示防御行为模仿良好，但进攻策略则不然。参与者的调查显示，他们意识到了自己的撤退策略，整体上对镜像模式的玩家满意度更高。进一步完善模型可能提升模仿质量，提升玩家满意度，尤其是在面对自身策略时。完整代码和调查结果存储在：此 https URL

Safe Learning for Contact-Rich Robot Tasks: A Survey from Classical Learning-Based Methods to Safe Foundation Models

为接触丰富机器人任务提供安全学习：从经典基于学习的方法到安全基础模型的综述

Authors: Heng Zhang, Rui Dai, Gokhan Solak, Pokuang Zhou, Yu She, Arash Ajoudani
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.11908
Pdf link: https://arxiv.org/pdf/2512.11908
Abstract Contact-rich tasks pose significant challenges for robotic systems due to inherent uncertainty, complex dynamics, and the high risk of damage during interaction. Recent advances in learning-based control have shown great potential in enabling robots to acquire and generalize complex manipulation skills in such environments, but ensuring safety, both during exploration and execution, remains a critical bottleneck for reliable real-world deployment. This survey provides a comprehensive overview of safe learning-based methods for robot contact-rich tasks. We categorize existing approaches into two main domains: safe exploration and safe execution. We review key techniques, including constrained reinforcement learning, risk-sensitive optimization, uncertainty-aware modeling, control barrier functions, and model predictive safety shields, and highlight how these methods incorporate prior knowledge, task structure, and online adaptation to balance safety and efficiency. A particular emphasis of this survey is on how these safe learning principles extend to and interact with emerging robotic foundation models, especially vision-language models (VLMs) and vision-language-action models (VLAs), which unify perception, language, and control for contact-rich manipulation. We discuss both the new safety opportunities enabled by VLM/VLA-based methods, such as language-level specification of constraints and multimodal grounding of safety signals, and the amplified risks and evaluation challenges they introduce. Finally, we outline current limitations and promising future directions toward deploying reliable, safety-aligned, and foundation-model-enabled robots in complex contact-rich environments. More details and materials are available at our \href{ this https URL}{Project GitHub Repository}.
中文摘要 由于固有的不确定性、复杂的动力学以及相互作用过程中的高损坏风险，接触密集任务对机器人系统构成了重大挑战。基于学习的控制技术的最新进展显示出极大潜力，使机器人能够在此类环境中习得和泛化复杂作技能，但确保探索和执行中的安全性仍是可靠实际部署的关键瓶颈。本调查全面介绍了基于安全学习的方法，适用于机器人接触丰富任务。我们将现有方法分为两大类：安全探索和安全执行。我们回顾了关键技术，包括受限强化学习、风险敏感优化、不确定性感知建模、控制障碍函数以及预测性安全盾模型，并强调这些方法如何结合先验知识、任务结构和在线适应，以平衡安全与效率。本调查特别强调这些安全学习原则如何延伸并与新兴的机器人基础模型相互作用，尤其是视觉-语言模型（VLMs）和视觉-语言-行动模型（VLA），这些模型统一了感知、语言和控制，实现了丰富的接触作。我们讨论了基于VLM/VLA的方法带来的新安全机遇，如语言级约束规范和安全信号多模态接地，以及这些方法带来的风险和评估挑战。最后，我们概述了当前的局限性以及未来在复杂且接触密集环境中部署可靠、安全对齐且基于基础模型的机器人的前景。更多详情和资料请访问我们的\href{ https URL}{Project GitHub Repository}。

Evolutionary Reinforcement Learning based AI tutor for Socratic Interdisciplinary Instruction

基于进化强化学习的苏格拉底跨学科教学人工智能导师

Authors: Mei Jiang, Haihai Shen, Zhuo Luo, Bingdong Li, Wenjing Hong, Ke Tang, Aimin Zhou
Subjects: Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.11930
Pdf link: https://arxiv.org/pdf/2512.11930
Abstract Cultivating higher-order cognitive abilities -- such as knowledge integration, critical thinking, and creativity -- in modern STEM education necessitates a pedagogical shift from passive knowledge transmission to active Socratic construction. Although Large Language Models (LLMs) hold promise for STEM Interdisciplinary education, current methodologies employing Prompt Engineering (PE), Supervised Fine-tuning (SFT), or standard Reinforcement Learning (RL) often fall short of supporting this paradigm. Existing methods are hindered by three fundamental challenges: the inability to dynamically model latent student cognitive states; severe reward sparsity and delay inherent in long-term educational goals; and a tendency toward policy collapse lacking strategic diversity due to reliance on behavioral cloning. Recognizing the unobservability and dynamic complexity of these interactions, we formalize the Socratic Interdisciplinary Instructional Problem (SIIP) as a structured Partially Observable Markov Decision Process (POMDP), demanding simultaneous global exploration and fine-grained policy refinement. To this end, we propose ERL4SIIP, a novel Evolutionary Reinforcement Learning (ERL) framework specifically tailored for this domain. ERL4SIIP integrates: (1) a dynamic student simulator grounded in a STEM knowledge graph for latent state modeling; (2) a Hierarchical Reward Mechanism that decomposes long-horizon goals into dense signals; and (3) a LoRA-Division based optimization strategy coupling evolutionary algorithms for population-level global search with PPO for local gradient ascent.
中文摘要 在现代STEM教育中培养高阶认知能力——如知识整合、批判性思维和创造力——需要从被动知识传递转向主动苏格拉底式建构的教学转变。尽管大型语言模型（LLMs）在STEM跨学科教育中充满希望，但当前采用提示工程（PE）、监督微调（SFT）或标准强化学习（RL）的方法往往无法支持这一范式。现有方法受到三大根本挑战的阻碍：无法动态建模潜在学生的认知状态;长期教育目标固有的严重奖励稀缺和延迟;以及由于依赖行为克隆，政策崩溃的趋势，缺乏战略多样性。鉴于这些互动的不可观察性和动态复杂性，我们将苏格拉底跨学科教学问题（SIIP）形式化为结构化的部分可观测马尔可夫决策过程（POMDP），要求同时进行全局探索和细致化政策细化。为此，我们提出了ERL4SIIP，一种专门针对该领域的新颖进化强化学习（ERL）框架。ERL4SIIP集成了：（1）基于STEM知识图谱的动态学生模拟器，用于潜在状态建模;（2）分层奖励机制，将长期目标分解为密集信号;以及（3）基于LoRA-Division的优化策略，结合了种群级全局搜索的进化算法与局部梯度上升的PPO。

A Review of Learning-Based Motion Planning: Toward a Data-Driven Optimal Control Approach

基于学习的运动规划综述：迈向数据驱动的最优控制方法

Authors: Jia Hu, Yang Chang, Haoran Wang
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.11944
Pdf link: https://arxiv.org/pdf/2512.11944
Abstract Motion planning for high-level autonomous driving is constrained by a fundamental trade-off between the transparent, yet brittle, nature of pipeline methods and the adaptive, yet opaque, "black-box" characteristics of modern learning-based systems. This paper critically synthesizes the evolution of the field -- from pipeline methods through imitation learning, reinforcement learning, and generative AI -- to demonstrate how this persistent dilemma has hindered the development of truly trustworthy systems. To resolve this impasse, we conduct a comprehensive review of learning-based motion planning methods. Based on this review, we outline a data-driven optimal control paradigm as a unifying framework that synergistically integrates the verifiable structure of classical control with the adaptive capacity of machine learning, leveraging real-world data to continuously refine key components such as system dynamics, cost functions, and safety constraints. We explore this framework's potential to enable three critical next-generation capabilities: "Human-Centric" customization, "Platform-Adaptive" dynamics adaptation, and "System Self-Optimization" via self-tuning. We conclude by proposing future research directions based on this paradigm, aimed at developing intelligent transportation systems that are simultaneously safe, interpretable, and capable of human-like autonomy.
中文摘要 高级自动驾驶的运动规划受制于管道方法透明但脆弱的特性与现代基于学习系统的自适应但不透明的“黑匣子”特性之间的根本权衡。本文批判性地综合了该领域的演变——从流水线方法到模仿学习、强化学习和生成式人工智能——展示了这一持续的困境如何阻碍了真正值得信赖系统的开发。为解决这一僵局，我们对基于学习的运动规划方法进行了全面回顾。基于本综述，我们勾勒出一个数据驱动的最优控制范式，作为一个统一框架，协同整合经典控制的可验证结构与机器学习的自适应能力，利用真实世界数据持续优化系统动力学、成本函数和安全约束等关键组件。我们探讨该框架在实现三项关键次世代能力上的潜力：“以人为本”定制化“、”平台自适应“动力学适应，以及通过自我调优实现”系统自我优化“。最后，我们提出了基于这一范式的未来研究方向，旨在开发既安全、可解释又具备类人自主性的智能交通系统。

Learning to Extract Context for Context-Aware LLM Inference

学习如何提取上下文以实现上下文感知的大型语言模型推理

Authors: Minseon Kim, Lucas Caccia, Zhengyan Shi, Matheus Pereira, Marc-Alexandre Côté, Xingdi Yuan, Alessandro Sordoni
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.11986
Pdf link: https://arxiv.org/pdf/2512.11986
Abstract User prompts to large language models (LLMs) are often ambiguous or under-specified, and subtle contextual cues shaped by user intentions, prior knowledge, and risk factors strongly influence what constitutes an appropriate response. Misinterpreting intent or risks may lead to unsafe outputs, while overly cautious interpretations can cause unnecessary refusal of benign requests. In this paper, we question the conventional framework in which LLMs generate immediate responses to requests without considering broader contextual factors. User requests are situated within broader contexts such as intentions, knowledge, and prior experience, which strongly influence what constitutes an appropriate answer. We propose a framework that extracts and leverages such contextual information from the user prompt itself. Specifically, a reinforcement learning based context generator, designed in an autoencoder-like fashion, is trained to infer contextual signals grounded in the prompt and use them to guide response generation. This approach is particularly important for safety tasks, where ambiguous requests may bypass safeguards while benign but confusing requests can trigger unnecessary refusals. Experiments show that our method reduces harmful responses by an average of 5.6% on the SafetyInstruct dataset across multiple foundation models and improves the harmonic mean of attack success rate and compliance on benign prompts by 6.2% on XSTest and WildJailbreak. These results demonstrate the effectiveness of context extraction for safer and more reliable LLM inferences.
中文摘要 用户对大型语言模型（LLM）的提示往往模糊或不够明确，而由用户意图、既有知识和风险因素塑造的微妙上下文线索，强烈影响了何种反应是合适的。误解意图或风险可能导致不安全的输出，而过于谨慎的解释可能导致对无害请求的不必要拒绝。本文质疑传统框架中，LLMs在不考虑更广泛背景因素的情况下即时生成请求响应。用户请求置于更广泛的背景中，如意图、知识和先前经验，这些因素强烈影响何谓适当答案。我们提出了一个框架，能够从用户提示中提取并利用此类上下文信息。具体来说，基于强化学习的上下文生成器设计为类似自编码器的方式，经过训练以推断基于提示的上下文信号，并用这些信号引导反应生成。这种方法对安全任务尤为重要，因为模糊的请求可能绕过保护措施，而无害但令人困惑的请求也可能引发不必要的拒绝。实验显示，我们的方法在多个基础模型中，SafetyIninstruction数据集平均减少了5.6%的有害反应，并且在XSTest和WildJailbreak上，良性提示的攻击成功率和合规性均值提升了6.2%。这些结果表明上下文提取对更安全、更可靠的大型语言模型推断具有有效性。

Policy Gradient Algorithms for Age-of-Information Cost Minimization

用于信息时代成本最小化的策略梯度算法

Authors: José-Ramón Vidal, Vicent Pla, Luis Guijarro, Israel Leyva-Mayorga
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Information Theory (cs.IT); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.11990
Pdf link: https://arxiv.org/pdf/2512.11990
Abstract Recent developments in cyber-physical systems have increased the importance of maximizing the freshness of the information about the physical environment. However, optimizing the access policies of Internet of Things devices to maximize the data freshness, measured as a function of the Age-of-Information (AoI) metric, is a challenging task. This work introduces two algorithms to optimize the information update process in cyber-physical systems operating under the generate-at-will model, by finding an online policy without knowing the characteristics of the transmission delay or the age cost function. The optimization seeks to minimize the time-average cost, which integrates the AoI at the receiver and the data transmission cost, making the approach suitable for a broad range of scenarios. Both algorithms employ policy gradient methods within the framework of model-free reinforcement learning (RL) and are specifically designed to handle continuous state and action spaces. Each algorithm minimizes the cost using a distinct strategy for deciding when to send an information update. Moreover, we demonstrate that it is feasible to apply the two strategies simultaneously, leading to an additional reduction in cost. The results demonstrate that the proposed algorithms exhibit good convergence properties and achieve a time-average cost within 3% of the optimal value, when the latter is computable. A comparison with other state-of-the-art methods shows that the proposed algorithms outperform them in one or more of the following aspects: being applicable to a broader range of scenarios, achieving a lower time-average cost, and requiring a computational cost at least one order of magnitude lower.
中文摘要 网络物理系统的最新发展凸显了最大化物理环境信息新鲜度的重要性。然而，优化物联网设备的访问策略以最大化数据新鲜度（以信息时代（AoI）指标衡量，是一项具有挑战性的任务。本研究引入了两种算法，用于优化在随意生成模型下运行的网络物理系统中的信息更新过程，方法是在不了解传输延迟特性或年龄成本函数的情况下找到在线策略。优化旨在最小化时间平均成本，从而整合接收端的 AoI 和数据传输成本，使该方法适用于广泛的场景。这两种算法都采用了策略梯度方法，在无模型强化学习（RL）框架内，并专门设计用于处理连续状态和动作空间。每个算法都采用独特的策略来决定何时发送信息更新，从而最大限度地降低成本。此外，我们证明同时应用这两种策略是可行的，从而进一步降低成本。结果表明，所提出的算法表现出良好的收敛性质，并且在最优值可计算时，时间平均成本在最优值的3%以内。与其他最先进方法的比较显示，所提出的算法在以下一项或多项方面表现优于它们：适用于更广泛的场景、实现更低的时间平均成本，以及计算成本至少低一个数量级。

Goal Reaching with Eikonal-Constrained Hierarchical Quasimetric Reinforcement Learning

利用以物纳约束的层级准强化学习实现目标

Authors: Vittorio Giammarino, Ahmed H. Qureshi
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2512.12046
Pdf link: https://arxiv.org/pdf/2512.12046
Abstract Goal-Conditioned Reinforcement Learning (GCRL) mitigates the difficulty of reward design by framing tasks as goal reaching rather than maximizing hand-crafted reward signals. In this setting, the optimal goal-conditioned value function naturally forms a quasimetric, motivating Quasimetric RL (QRL), which constrains value learning to quasimetric mappings and enforces local consistency through discrete, trajectory-based constraints. We propose Eikonal-Constrained Quasimetric RL (Eik-QRL), a continuous-time reformulation of QRL based on the Eikonal Partial Differential Equation (PDE). This PDE-based structure makes Eik-QRL trajectory-free, requiring only sampled states and goals, while improving out-of-distribution generalization. We provide theoretical guarantees for Eik-QRL and identify limitations that arise under complex dynamics. To address these challenges, we introduce Eik-Hierarchical QRL (Eik-HiQRL), which integrates Eik-QRL into a hierarchical decomposition. Empirically, Eik-HiQRL achieves state-of-the-art performance in offline goal-conditioned navigation and yields consistent gains over QRL in manipulation tasks, matching temporal-difference methods.
中文摘要 目标条件强化学习（GCRL）通过将任务框架为达成目标，而非最大化手工奖励信号，从而降低了奖励设计的难度。在此情境下，最优目标条件值函数自然形成一个准拟量激励性强化学习（QRL），该函数将价值学习限制在准度量映射上，并通过离散的轨迹约束强制局部一致性。我们提出了基于Eikonal偏微分方程（PDE）的连续时间QRL重述。这种基于偏微分方程的结构使Eik-QRL无轨迹，只需采样状态和目标，同时提升了分布外的泛化性。我们为Eik-QRL提供理论保证，并识别复杂动力学下出现的局限性。为应对这些挑战，我们引入了Eik层级QRL（Eik-HiQRL），将Eik-QRL整合进层级分解体系。从经验角度看，Eik-HiQRL在离线目标条件导航中达到了最先进的性能，并在作任务中持续优于QRL，匹配时间差分方法。

Learning to Get Up Across Morphologies: Zero-Shot Recovery with a Unified Humanoid Policy

学习跨形态起身：零射击回收与统一人形政策

Authors: Jonathan Spraggett
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.12230
Pdf link: https://arxiv.org/pdf/2512.12230
Abstract Fall recovery is a critical skill for humanoid robots in dynamic environments such as RoboCup, where prolonged downtime often decides the match. Recent techniques using deep reinforcement learning (DRL) have produced robust get-up behaviors, yet existing methods require training of separate policies for each robot morphology. This paper presents a single DRL policy capable of recovering from falls across seven humanoid robots with diverse heights (0.48 - 0.81 m), weights (2.8 - 7.9 kg), and dynamics. Trained with CrossQ, the unified policy transfers zero-shot up to 86 +/- 7% (95% CI [81, 89]) on unseen morphologies, eliminating the need for robot-specific training. Comprehensive leave-one-out experiments, morph scaling analysis, and diversity ablations show that targeted morphological coverage improves zero-shot generalization. In some cases, the shared policy even surpasses the specialist baselines. These findings illustrate the practicality of morphology-agnostic control for fall recovery, laying the foundation for generalist humanoid control. The software is open-source and available at: this https URL
中文摘要 摔倒恢复是人形机器人在动态环境中的关键技能，比如RoboCup，长时间的停机往往决定比赛胜负。近年来利用深度强化学习（DRL）的技术产生了稳健的起床行为，但现有方法仍需为每种机器人形态分别训练策略。本文提出了一套单一的日程恢复系统政策，能够在七台不同高度（0.48 - 0.81 米）、重量（2.8 - 7.9 公斤）和动力学不同的类人机器人中从坠落中恢复。通过CrossQ训练，统一策略可将零射击转移至86 +/- 7%（95% CI [81， 89]），无需针对机器人进行特定训练。全面的省略实验、形态尺度分析和多样性消融表明，靶向形态覆盖能改善零样本推广。在某些情况下，共享保单甚至超过了专家的基线。这些发现展示了形态无关控制在跌倒恢复中的实用性，为通用类人控制奠定了基础。该软件为开源，访问地址为：此 https URL

Moment and Highlight Detection via MLLM Frame Segmentation

通过MLLM帧分割进行时刻和高光检测

Authors: I Putu Andika Bagas Jiwanta, Ayu Purwarianti
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.12246
Pdf link: https://arxiv.org/pdf/2512.12246
Abstract Detecting video moments and highlights from natural-language queries have been unified by transformer-based methods. Other works use generative Multimodal LLM (MLLM) to predict moments and/or highlights as text timestamps, utilizing its reasoning capability. While effective, text-based generation cannot provide direct gradients for frame-level predictions because the model only emits language tokens. Although recent Reinforcement Learning (RL) methods attempt to address the issue, we propose a novel approach by applying segmentation objectives directly on the LLM's output tokens. The LLM is fed with a fixed number of frames alongside a prompt that enforces it to output a sequence of continuous "0" and/or "1" characters, with one character per frame. The "0"/"1" characters benefit from the LLM's inherent language capability while also acting as background and foreground probabilities, respectively. Training employs segmentation losses on the probabilities alongside a normal causal LM loss. At inference, beam search generates sequence and logits, acting as moments and saliency scores, respectively. Despite sampling only 25 frames -- less than half of comparable methods -- our method achieved strong highlight detection (56.74 HIT@1) on QVHighlights. Additionally, our efficient method scores above the baseline (35.28 MAP) for moment retrieval. Empirically, segmentation losses provide a stable complementary learning signal even when the causal LM loss plateaus.
中文摘要 基于变换器的方法实现了自然语言查询中视频瞬间和精彩片段的检测。其他研究利用生成式多模态LLM（MLLM）预测时刻和/或高亮作为文本时间戳，利用其推理能力。虽然基于文本的生成有效，但由于模型仅输出语言符号，无法直接为帧级预测提供梯度。尽管近期强化学习（RL）方法试图解决这一问题，我们提出了一种新颖的方法，直接对LLM的输出令牌应用分割目标。LLM会获得固定数量的帧数，并伴随一个提示，强制它输出连续的“0”和/或“1”字符序列，每帧一个字符。“0”/“1”字符受益于LLM固有的语言能力，同时分别作为背景和前景概率。训练在概率上使用分割损失，同时配合正常因果的LM损失。在推断阶段，束流搜索生成序列和logits，分别作为矩和显著性分数。尽管仅采样25帧——不到同类方法的一半——我们的方法在QVHighlights上实现了强高亮检测（56.74 HIT@1）。此外，我们的高效方法在力矩反演中得分高于基线（35.28 MAP）。从经验上看，即使因果LM损失趋于平稳，分段损失仍能提供稳定的互补学习信号。

A Conflict-Aware Resource Management Framework for the Computing Continuum

计算连续体的冲突感知资源管理框架

Authors: Vlad Popescu-Vifor, Ilir Murturi, Praveen Kumar Donta, Schahram Dustdar
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2512.12299
Pdf link: https://arxiv.org/pdf/2512.12299
Abstract The increasing device heterogeneity and decentralization requirements in the computing continuum (i.e., spanning edge, fog, and cloud) introduce new challenges in resource orchestration. In such environments, agents are often responsible for optimizing resource usage across deployed services. However, agent decisions can lead to persistent conflict loops, inefficient resource utilization, and degraded service performance. To overcome such challenges, we propose a novel framework for adaptive conflict resolution in resource-oriented orchestration using a Deep Reinforcement Learning (DRL) approach. The framework enables handling resource conflicts across deployments and integrates a DRL model trained to mediate such conflicts based on real-time performance feedback and historical state information. The framework has been prototyped and validated on a Kubernetes-based testbed, illustrating its methodological feasibility and architectural resilience. Preliminary results show that the framework achieves efficient resource reallocation and adaptive learning in dynamic scenarios, thus providing a scalable and resilient solution for conflict-aware orchestration in the computing continuum.
中文摘要 计算连续体（即跨边、雾和云）中设备异构性和去中心化需求的增加，带来了资源编排的新挑战。在此类环境中，代理通常负责优化部署服务间的资源使用。然而，代理决策可能导致持续的冲突循环、资源利用效率低下以及服务性能下降。为克服这些挑战，我们提出了一种基于资源导向编排的自适应冲突解决新框架，采用深度强化学习（DRL）方法。该框架支持跨部署处理资源冲突，并集成了一个基于实时性能反馈和历史状态信息训练以调解此类冲突的DRL模型。该框架已在基于 Kubernetes 的测试平台上原型化并验证，展示了其方法论的可行性和架构韧性。初步结果表明，该框架在动态场景下实现了高效的资源重新分配和自适应学习，从而为计算连续体中的冲突感知编排提供了可扩展且具有弹性的解决方案。

The Role of AI in Modern Penetration Testing

人工智能在现代渗透测试中的作用

Authors: J. Alexander Curtis, Nasir U. Eisty
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2512.12326
Pdf link: https://arxiv.org/pdf/2512.12326
Abstract Penetration testing is a cornerstone of cybersecurity, traditionally driven by manual, time-intensive processes. As systems grow in complexity, there is a pressing need for more scalable and efficient testing methodologies. This systematic literature review examines how Artificial Intelligence (AI) is reshaping penetration testing, analyzing 58 peer-reviewed studies from major academic databases. Our findings reveal that while AI-assisted pentesting is still in its early stages, notable progress is underway, particularly through Reinforcement Learning (RL), which was the focus of 77% of the reviewed works. Most research centers on the discovery and exploitation phases of pentesting, where AI shows the greatest promise in automating repetitive tasks, optimizing attack strategies, and improving vulnerability identification. Real-world applications remain limited but encouraging, including the European Space Agency's PenBox and various open-source tools. These demonstrate AI's potential to streamline attack path analysis, analyze complex network topology, and reduce manual workload. However, challenges persist: current models often lack flexibility and are underdeveloped for the reconnaissance and post-exploitation phases of pentesting. Applications involving Large Language Models (LLMs) remain relatively under-researched, pointing to a promising direction for future exploration. This paper offers a critical overview of AI's current and potential role in penetration testing, providing valuable insights for researchers, practitioners, and organizations aiming to enhance security assessments through advanced automation or looking for gaps in existing research.
中文摘要 渗透测试是网络安全的基石，传统上由手工且耗时的流程驱动。随着系统复杂度的增加，迫切需要更具可扩展性和高效的测试方法。本系统文献综述分析了人工智能（AI）如何重塑渗透测试，分析了来自主要学术数据库的58项同行评审研究。我们的研究结果显示，尽管AI辅助渗透测试仍处于早期阶段，但已在取得显著进展，尤其是强化学习（RL）方面，这是77%的综述研究的重点。大多数研究集中在渗透测试的发现和利用阶段，AI在自动化重复任务、优化攻击策略和改进漏洞识别方面展现出最大潜力。实际应用虽有限，但令人鼓舞，包括欧洲航天局的PenBox和各种开源工具。这些展示了人工智能简化攻击路径分析、分析复杂网络拓扑和减少人工工作负荷的潜力。然而，挑战依然存在：现有模型通常缺乏灵活性，且在渗透测试的侦察和后期利用阶段开发不足。涉及大型语言模型（LLM）的应用仍然相对研究不足，这为未来探索指明了有前景的方向。本文对人工智能在渗透测试中当前及潜在作用提供了关键概述，为研究人员、从业者和希望通过先进自动化提升安全评估或寻找现有研究空白的组织提供了宝贵见解。

ElasticVR: Elastic Task Computing in Multi-User Multi-Connectivity Wireless Virtual Reality (VR) Systems

ElasticVR：多用户多连接无线虚拟现实（VR）系统中的弹性任务计算

Authors: Babak Badnava, Jacob Chakareski, Morteza Hashemi
Subjects: Subjects: Information Theory (cs.IT); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2512.12366
Pdf link: https://arxiv.org/pdf/2512.12366
Abstract Diverse emerging VR applications integrate streaming of high fidelity 360 video content that requires ample amounts of computation and data rate. Scalable 360 video tiling enables having elastic VR computational tasks that can be scaled adaptively in computation and data rate based on the available user and system resources. We integrate scalable 360 video tiling in an edge-client wireless multi-connectivity architecture for joint elastic task computation offloading across multiple VR users called ElasticVR. To balance the trade-offs in communication, computation, energy consumption, and QoE that arise herein, we formulate a constrained QoE and energy optimization problem that integrates the multi-user/multi-connectivity action space with the elasticity of VR computational tasks. The ElasticVR framework introduces two multi-agent deep reinforcement learning solutions, namely CPPG and IPPG. CPPG adopts a centralized training and centralized execution approach to capture the coupling between users' communication and computational demands. This leads to globally coordinated decisions at the cost of increased computational overheads and limited scalability. To address the latter challenges, we also explore an alternative strategy denoted IPPG that adopts a centralized training with decentralized execution paradigm. IPPG leverages shared information and parameter sharing to learn robust policies; however, during execution, each user takes action independently based on its local state information only. The decentralized execution alleviates the communication and computation overhead of centralized decision-making and improves scalability. We show that the ElasticVR framework improves the PSNR by 43.21%, while reducing the response time and energy consumption by 42.35% and 56.83%, respectively, compared with a case where no elasticity is incorporated into VR computations.
中文摘要 多样化的新兴虚拟现实应用集成了高保真360度视频内容的流媒体，这需要大量的计算和数据速率。可扩展的360视频铺砌使得VR计算任务具有弹性，能够根据可用用户和系统资源在计算和数据速率上自适应扩展。我们将可扩展的360度视频铺砖整合到一个名为ElasticVR的边缘客户端无线多连接架构中，实现多个VR用户间的联合弹性任务计算卸载。为了平衡本文中出现的通信、计算、能耗和服务质量（QoE）权衡，我们提出了一个受限的服务质量（QoE）和能源优化问题，将多用户/多连接的动作空间与虚拟现实计算任务的弹性整合。ElasticVR框架引入了两种多智能体深度强化学习解决方案，分别是CPPG和IPPG。CPPG采用集中式训练和集中执行方法，捕捉用户通信与计算需求的耦合。这导致了全球协调决策，但代价是计算开销增加和扩展性受限。为应对后者挑战，我们还探讨了一种名为IPPG的替代策略，采用集中式训练和去中心化执行范式。IPPG利用共享信息和参数共享来学习稳健的策略;然而，在执行过程中，每个用户仅根据其本地状态信息独立采取行动。去中心化执行减轻了集中式决策的通信和计算开销，提升了可扩展性。我们证明，ElasticVR框架将PSNR提升了43.21%，同时响应时间和能耗分别减少了42.35%和56.83%，相比之下，VR计算中未包含弹性的情况。

Sim2Real Reinforcement Learning for Soccer skills

Sim2Real 足球技能强化学习

Authors: Jonathan Spraggett
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.12437
Pdf link: https://arxiv.org/pdf/2512.12437
Abstract This thesis work presents a more efficient and effective approach to training control-related tasks for humanoid robots using Reinforcement Learning (RL). The traditional RL methods are limited in adapting to real-world environments, complexity, and natural motions, but the proposed approach overcomes these limitations by using curriculum training and Adversarial Motion Priors (AMP) technique. The results show that the developed RL policies for kicking, walking, and jumping are more dynamic, and adaptive, and outperformed previous methods. However, the transfer of the learned policy from simulation to the real world was unsuccessful, highlighting the limitations of current RL methods in fully adapting to real-world scenarios.
中文摘要 本论文提出了一种更高效、更有效的方法，利用强化学习（RL）训练类人机器人控制相关任务。传统的强化学习方法在适应现实环境、复杂性和自然运动方面有限，但所提方法通过课程培训和对抗运动先验（AMP）技术克服了这些局限。结果显示，开发的强化学习踢、行走和跳跃策略更具动态性和适应性，且优于以往方法。然而，将所学策略从模拟转移到现实世界并未成功，凸显了当前强化学习方法在完全适应现实场景方面的局限性。

HetRL: Efficient Reinforcement Learning for LLMs in Heterogeneous Environments

HetRL：异构环境中LLM的高效强化学习

Authors: Yongjun He, Shuai Zhang, Jiading Gai, Xiyuan Zhang, Boran Han, Bernie Wang, Huzefa Rangwala, George Karypis
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2512.12476
Pdf link: https://arxiv.org/pdf/2512.12476
Abstract As large language models (LLMs) continue to scale and new GPUs are released even more frequently, there is an increasing demand for LLM post-training in heterogeneous environments to fully leverage underutilized mid-range or previous-generation GPUs across regions and alleviate the shortage of homogeneous high-end GPUs within a single region. However, achieving high-performance reinforcement learning (RL) training for LLMs on such computing resources remains challenging because the workflow involves multiple models and tasks with complex computation and data dependencies. In this paper, we present HetRL, a distributed system for efficient RL training in infrastructures with heterogeneous GPUs and networks. HetRL formulates the scheduling of RL training in heterogeneous environments as a constrained joint optimization problem and introduces a novel scheduling algorithm that (1) decomposes the complex search space with a multi-level search framework; and (2) allocates the search budget via successive halving. Our extensive evaluation, consuming 20,000 GPU-hours, shows that HetRL delivers up to 9.17x the throughput of state-of-the-art systems, and 3.17x on average, under various workloads and settings.
中文摘要 随着大型语言模型（LLM）的持续扩展和新GPU的发布频率提升，对于在异构环境中进行LLM后训练的需求日益增长，以充分利用跨地区未充分利用的中端或上一代GPU，缓解单一地区同质高端GPU的短缺。然而，在此类计算资源上实现大型语言模型（LLM）的高性能强化学习（RL）训练仍然具有挑战性，因为工作流程涉及多个模型和任务，具有复杂的计算和数据依赖关系。本文介绍了HetRL，一种分布式系统，用于在拥有异构GPU和网络的基础设施中高效强化学习。HetRL将异构环境中的强化学习训练调度为受限联合优化问题，并引入了一种新颖的调度算法，该算法（1）通过多层次搜索框架分解复搜索空间;以及（2）通过连续减半分配搜索预算。我们耗费2万GPU小时的广泛评估显示，HetRL在不同工作负载和环境下，吞吐量是先进系统的9.17倍，平均为3.17倍。

超越最终答案：提升视觉语言模型中的视觉提取和逻辑一致性

Authors: Hoang Anh Just, Yifei Fan, Handong Zhao, Jiuxiang Gu, Ruiyi Zhang, Simon Jenni, Kushal Kafle, Ruoxi Jia, Jing Shi
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.12487
Pdf link: https://arxiv.org/pdf/2512.12487
Abstract Reinforcement learning from verifiable rewards (RLVR) has recently been extended from text-only LLMs to vision-language models (VLMs) to elicit long-chain multimodal reasoning. However, RLVR-trained VLMs still exhibit two persistent failure modes: inaccurate visual extraction (missing or hallucinating details) and logically inconsistent chains-of-thought, largely because verifiable signals supervise only the final answer. We propose PeRL-VL (Perception and Reasoning Learning for Vision-Language Models), a decoupled framework that separately improves visual perception and textual reasoning on top of RLVR. For perception, PeRL-VL introduces a VLM-based description reward that scores the model's self-generated image descriptions for faithfulness and sufficiency. For reasoning, PeRL-VL adds a text-only Reasoning SFT stage on logic-rich chain-of-thought data, enhancing coherence and logical consistency independently of vision. Across diverse multimodal benchmarks, PeRL-VL improves average Pass@1 accuracy from 63.3% (base Qwen2.5-VL-7B) to 68.8%, outperforming standard RLVR, text-only reasoning SFT, and naive multimodal distillation from GPT-4o.
中文摘要 可验证奖励强化学习（RLVR）最近已从纯文本LLM扩展到视觉语言模型（VLM），以引发长链多模态推理。然而，RLVR训练的VLM仍表现出两种持续的失败模式：不准确的视觉提取（缺失或产生幻觉细节）和逻辑不一致的思维链，主要因为可验证信号仅监督最终答案。我们提出了PeRL-VL（视觉语言模型的感知与推理学习），这是一个解耦框架，分别在RLVR基础上提升视觉感知和文本推理能力。在感知方面，PeRL-VL引入基于VLM的描述奖励，对模型自生成的图像描述进行忠实性和充分性评分。在推理方面，PeRL-VL在逻辑丰富的思维链数据上增加了纯文本推理SFT阶段，独立于视觉提升了连贯性和逻辑一致性。在多种多模态基准测试中，PeRL-VL将平均Pass@1准确率从63.3%（基于Qwen2.5-VL-7B）提升至68.8%，优于标准RLVR、纯文本推理SFT和GPT-4o的朴素多模态蒸馏。

Adaptive Detector-Verifier Framework for Zero-Shot Polyp Detection in Open-World Settings

开放世界环境中零剂量息肉检测的自适应检测器-验证框架

Authors: Shengkai Xu, Hsiang Lun Kao, Tianxiang Xu, Honghui Zhang, Junqiao Wang, Runmeng Ding, Guanyu Liu, Tianyu Shi, Zhenyu Yu, Guofeng Pan, Ziqian Bi, Yuqi Ouyang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.12492
Pdf link: https://arxiv.org/pdf/2512.12492
Abstract Polyp detectors trained on clean datasets often underperform in real-world endoscopy, where illumination changes, motion blur, and occlusions degrade image quality. Existing approaches struggle with the domain gap between controlled laboratory conditions and clinical practice, where adverse imaging conditions are prevalent. In this work, we propose AdaptiveDetector, a novel two-stage detector-verifier framework comprising a YOLOv11 detector with a vision-language model (VLM) verifier. The detector adaptively adjusts per-frame confidence thresholds under VLM guidance, while the verifier is fine-tuned with Group Relative Policy Optimization (GRPO) using an asymmetric, cost-sensitive reward function specifically designed to discourage missed detections -- a critical clinical requirement. To enable realistic assessment under challenging conditions, we construct a comprehensive synthetic testbed by systematically degrading clean datasets with adverse conditions commonly encountered in clinical practice, providing a rigorous benchmark for zero-shot evaluation. Extensive zero-shot evaluation on synthetically degraded CVC-ClinicDB and Kvasir-SEG images demonstrates that our approach improves recall by 14 to 22 percentage points over YOLO alone, while precision remains within 0.7 points below to 1.7 points above the baseline. This combination of adaptive thresholding and cost-sensitive reinforcement learning achieves clinically aligned, open-world polyp detection with substantially fewer false negatives, thereby reducing the risk of missed precancerous polyps and improving patient outcomes.
中文摘要 在干净数据集上训练的息肉探测器在现实内镜中常常表现不佳，因为光线变化、运动模糊和遮挡会降低图像质量。现有方法在受控实验室条件与临床实践之间存在不良影像条件的领域差距中存在困难。在本研究中，我们提出了AdaptiveDetector，一种新型的两阶段检测-验证框架，由YOLOv11检测器和视觉语言模型（VLM）验证器组成。检测器在VLM指导下自适应调整每帧置信阈值，而验证器则通过组相对策略优化（GRPO）微调，采用专门设计的非对称、成本敏感的奖励函数，以防止漏检——这是关键的临床需求。为了在具有挑战性条件下实现真实评估，我们通过系统降解临床中常见不良条件的干净数据集，构建了一个全面的合成测试平台，为零剂量评估提供了严格的基准。对合成降解的CVC-ClinicDB和Kvasir-SEG图像进行的广泛零样本评估表明，我们的方法比单纯的YOLO提高了14%到22个百分点的回忆率，而精度则保持在基线以下0.7点到1.7个百分点之间。这种自适应阈值化与成本敏感强化学习的结合，实现了临床对齐的开放世界息肉检测，且假阴性大幅减少，从而降低漏诊癌前息肉的风险，改善患者预后。

World Models Unlock Optimal Foraging Strategies in Reinforcement Learning Agents

世界模型解锁强化学习代理中的最佳采集策略

Authors: Yesid Fonseca, Manuel S. Ríos, Nicanor Quijano, Luis F. Giraldo
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.12548
Pdf link: https://arxiv.org/pdf/2512.12548
Abstract Patch foraging involves the deliberate and planned process of determining the optimal time to depart from a resource-rich region and investigate potentially more beneficial alternatives. The Marginal Value Theorem (MVT) is frequently used to characterize this process, offering an optimality model for such foraging behaviors. Although this model has been widely used to make predictions in behavioral ecology, discovering the computational mechanisms that facilitate the emergence of optimal patch-foraging decisions in biological foragers remains under investigation. Here, we show that artificial foragers equipped with learned world models naturally converge to MVT-aligned strategies. Using a model-based reinforcement learning agent that acquires a parsimonious predictive representation of its environment, we demonstrate that anticipatory capabilities, rather than reward maximization alone, drive efficient patch-leaving behavior. Compared with standard model-free RL agents, these model-based agents exhibit decision patterns similar to many of their biological counterparts, suggesting that predictive world models can serve as a foundation for more explainable and biologically grounded decision-making in AI systems. Overall, our findings highlight the value of ecological optimality principles for advancing interpretable and adaptive AI.
中文摘要 斑块觅食是指有意识且有计划地确定离开资源丰富地区的最佳时机，并探索可能更有益的替代方案。边际值定理（MVT）常被用来描述这一过程，为此类采集行为提供了最优模型。尽管该模型已被广泛用于行为生态学中的预测，但发现促进生物采集者实现最佳斑块觅食决策的计算机制仍在研究中。在这里，我们展示了配备已学习世界模型的人工采集者自然趋向于MVT对齐的策略。我们利用基于模型的强化学习代理，能够获得环境的简洁预测表示，证明了预期能力而非仅仅是奖励最大化，能够驱动高效的离片行为。与标准无模型强化学习代理相比，这些基于模型的代理表现出与许多生物学对应者相似的决策模式，表明预测世界模型可以作为人工智能系统中更具解释性和生物学基础决策的基础。总体而言，我们的发现凸显了生态最优原则在推进可解释性和适应性人工智能中的价值。

Coupled Variational Reinforcement Learning for Language Model General Reasoning

语言模型通用推理中的耦合变分强化学习

Authors: Xueru Wen, Jie Lou, Yanjiang Liu, Hongyu Lin, Ben He, Xianpei Han, Le Sun, Yaojie Lu, Debing Zhang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.12576
Pdf link: https://arxiv.org/pdf/2512.12576
Abstract While reinforcement learning have achieved impressive progress in language model reasoning, they are constrained by the requirement for verifiable rewards. Recent verifier-free RL methods address this limitation by utilizing the intrinsic probabilities of LLMs generating reference answers as reward signals. However, these approaches typically sample reasoning traces conditioned only on the question. This design decouples reasoning-trace sampling from answer information, leading to inefficient exploration and incoherence between traces and final answers. In this paper, we propose \textit{\b{Co}upled \b{V}ariational \b{R}einforcement \b{L}earning} (CoVRL), which bridges variational inference and reinforcement learning by coupling prior and posterior distributions through a hybrid sampling strategy. By constructing and optimizing a composite distribution that integrates these two distributions, CoVRL enables efficient exploration while preserving strong thought-answer coherence. Extensive experiments on mathematical and general reasoning benchmarks show that CoVRL improves performance by 12.4\% over the base model and achieves an additional 2.3\% improvement over strong state-of-the-art verifier-free RL baselines, providing a principled framework for enhancing the general reasoning capabilities of language models.
中文摘要 虽然强化学习在语言模型推理方面取得了显著进展，但它们受限于对可验证奖励的要求。近期无验证器的强化学习方法通过利用LLM生成参考答案作为奖励信号的内在概率来解决这一局限。然而，这些方法通常只抽样基于问题的推理迹迹。该设计将推理痕迹抽样与答案信息分离，导致探索效率低下，痕迹与最终答案之间不连贯。本文提出 \textit{\b{Co}upled \b{V}ariational \b{R}einforcement \b{L}earning}（CoVRL），通过混合采样策略连接变分推断和强化学习，连接先验分布和后验分布。通过构建和优化整合这两种分布的复合分布，CoVRL实现了高效的探索，同时保持了强烈的思维-答案一致性。在数学和通用推理基准测试上的大量实验表明，CoVRL在基础模型上性能提升了12.4%，并且比强大的无验证器强化学习基线还提升了2.3%的水平，为增强语言模型的通用推理能力提供了有原则的框架。

CogDoc: Towards Unified thinking in Documents

CogDoc：迈向文档中的统一思维

Authors: Qixin Xu, Haozhe Wang, Che Liu, Fangzhen Lin, Wenhu Chen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.12658
Pdf link: https://arxiv.org/pdf/2512.12658
Abstract Current document reasoning paradigms are constrained by a fundamental trade-off between scalability (processing long-context documents) and fidelity (capturing fine-grained, multimodal details). To bridge this gap, we propose CogDoc, a unified coarse-to-fine thinking framework that mimics human cognitive processes: a low-resolution "Fast Reading" phase for scalable information localization,followed by a high-resolution "Focused Thinking" phase for deep reasoning. We conduct a rigorous investigation into post-training strategies for the unified thinking framework, demonstrating that a Direct Reinforcement Learning (RL) approach outperforms RL with Supervised Fine-Tuning (SFT) initialization. Specifically, we find that direct RL avoids the "policy conflict" observed in SFT. Empirically, our 7B model achieves state-of-the-art performance within its parameter class, notably surpassing significantly larger proprietary models (e.g., GPT-4o) on challenging, visually rich document benchmarks.
中文摘要 当前的文档推理范式受限于可扩展性（处理长上下文文档）和忠实度（捕捉细粒度多模态细节）之间的根本权衡。为弥合这一差距，我们提出了CogDoc，一个统一的从粗到精的思维框架，模拟人类认知过程：先是低分辨率的“快速阅读”阶段，用于可扩展的信息定位，随后是高分辨率的“专注思维”阶段，进行深度推理。我们对统一思维框架的训练后策略进行了严格调查，证明直接强化学习（RL）方法优于带监督微调（SFT）初始化的强化学习。具体来说，我们发现直接强化学习避免了SFT中观察到的“政策冲突”。从经验来看，我们的7B模型在其参数类别内实现了最先进的性能，尤其是在具有挑战性且视觉丰富的文档基准测试上，超过了更大型的专有模型（如GPT-4o）。

Reassessing the Role of Supervised Fine-Tuning: An Empirical Study in VLM Reasoning

重新评估监督式微调的作用：VLM推理中的实证研究

Authors: Yongcan Yu, Lingxiao He, Shuo Lu, Lijun Sheng, Yinuo Xu, Yanbo Wang, Kuangpu Guo, Jianjie Cheng, Meng Wang, Qianlong Xie, Xingxing Wang, Dapeng Hu, Jian Liang
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.12690
Pdf link: https://arxiv.org/pdf/2512.12690
Abstract Recent advances in vision-language models (VLMs) reasoning have been largely attributed to the rise of reinforcement Learning (RL), which has shifted the community's focus away from the supervised fine-tuning (SFT) paradigm. Many studies suggest that introducing the SFT stage not only fails to improve reasoning ability but may also negatively impact model training. In this study, we revisit this RL-centric belief through a systematic and controlled comparison of SFT and RL on VLM Reasoning. Using identical data sources, we find that the relative effectiveness of SFT and RL is conditional and strongly influenced by model capacity, data scale, and data distribution. Contrary to common assumptions, our findings show that SFT plays a crucial role across several scenarios: (1) Effectiveness for weaker models. SFT more reliably elicits reasoning capabilities in smaller or weaker VLMs. (2) Data efficiency. SFT with only 2K achieves comparable or better reasoning performance to RL with 20K. (3) Cross-modal transferability. SFT demonstrates stronger generalization across modalities. Moreover, we identify a pervasive issue of deceptive rewards, where higher rewards fail to correlate with better reasoning accuracy in RL. These results challenge the prevailing "RL over SFT" narrative. They highlight that the role of SFT may have been underestimated and support a more balanced post-training pipeline in which SFT and RL function as complementary components.
中文摘要 视觉语言模型（VLMs）推理的最新进展主要归功于强化学习（RL）的兴起，该技术使社区的关注点从监督式微调（SFT）范式中转移开来。许多研究表明，引入SFT阶段不仅无法提升推理能力，还可能对模型训练产生负面影响。本研究通过系统且对照的SFT与RL在VLM推理上的比较，重新审视这一以强化学习为中心的信念。利用相同的数据源，我们发现SFT和RL的相对有效性是有条件的，并且受到模型容量、数据规模和数据分布的强烈影响。与常见假设相反，我们的发现表明SFT在多个情景中起着关键作用：（1）弱模型的有效性。SFT更可靠地在较小或较弱的VLM中激发推理能力。（2）数据效率。仅用2K的SFT就能实现与20K强化学习相当甚至更好的推理性能。（3）跨模态可迁移性。SFT在不同模态间展现了更强的泛化能力。此外，我们还发现了一个普遍存在的欺骗性奖励问题，即更高的奖励与强化学习中推理准确率的提升无关。这些结果挑战了当时流行的“强化学习优于SFT”的说法。他们强调SFT的作用可能被低估，支持SFT与RL作为互补组成部分的更均衡培训后流程。

Synergizing Code Coverage and Gameplay Intent: Coverage-Aware Game Playtesting with LLM-Guided Reinforcement Learning

协同代码覆盖率与游戏意图：覆盖感知游戏测试与大语言模型引导强化学习

Authors: Enhong Mu, Minami Yoda, Yan Zhang, Mingyue Zhang, Yutaka Matsuno, Jialong Li
Subjects: Subjects: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2512.12706
Pdf link: https://arxiv.org/pdf/2512.12706
Abstract The widespread adoption of the "Games as a Service" model necessitates frequent content updates, placing immense pressure on quality assurance. In response, automated game testing has been viewed as a promising solution to cope with this demanding release cadence. However, existing automated testing approaches typically create a dichotomy: code-centric methods focus on structural coverage without understanding gameplay context, while player-centric agents validate high-level intent but often fail to cover specific underlying code changes. To bridge this gap, we propose SMART (Structural Mapping for Augmented Reinforcement Testing), a novel framework that synergizes structural verification and functional validation for game update testing. SMART leverages large language models (LLMs) to interpret abstract syntax tree (AST) differences and extract functional intent, constructing a context-aware hybrid reward mechanism. This mechanism guides reinforcement learning agents to sequentially fulfill gameplay goals while adaptively exploring modified code branches. We evaluate SMART on two environments, Overcooked and Minecraft. The results demonstrate that SMART significantly outperforms state-of-the-art baselines; it achieves over 94% branch coverage of modified code, nearly double that of traditional reinforcement learning methods, while maintaining a 98% task completion rate, effectively balancing structural comprehensiveness with functional correctness.
中文摘要 “游戏即服务”模式的广泛采用需要频繁的内容更新，给质量保证带来了巨大压力。为此，自动化游戏测试被视为应对这种高强度发布节奏的有前景方案。然而，现有的自动化测试方法通常形成了二分法：以代码为中心的方法只关注结构覆盖，却不理解游戏背景，而以玩家为中心的代理验证高层次意图，却常常未能覆盖具体的底层代码变更。为弥合这一差距，我们提出了SMART（增强强化测试结构映射），这是一个结合结构验证与功能验证的新型框架，用于游戏更新测试。SMART利用大型语言模型（LLMs）来解释抽象语法树（AST）差异并提取功能意图，构建了一个上下文感知的混合奖励机制。该机制引导强化学习代理顺序完成游戏目标，同时自适应探索修改后的代码分支。我们在两个环境上评估SMART，分别是《Overcooked》和《Minecraft》。结果表明，SMART的表现显著优于最先进的基线;它实现了修改代码分支覆盖率超过94%，几乎是传统强化学习方法的两倍，同时保持98%的任务完成率，有效平衡了结构的全面性和功能正确性。

Self-Motivated Growing Neural Network for Adaptive Architecture via Local Structural Plasticity

通过局部结构可塑性的自驱成长神经网络实现适应性架构

Authors: Yiyang Jia, Chengxu Zhou
Subjects: Subjects: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.12713
Pdf link: https://arxiv.org/pdf/2512.12713
Abstract Control policies in deep reinforcement learning are often implemented with fixed-capacity multilayer perceptrons trained by backpropagation, which lack structural plasticity and depend on global error signals. This paper introduces the Self-Motivated Growing Neural Network (SMGrNN), a controller whose topology evolves online through a local Structural Plasticity Module (SPM). The SPM monitors neuron activations and edge-wise weight update statistics over short temporal windows and uses these signals to trigger neuron insertion and pruning, while synaptic weights are updated by a standard gradient-based optimizer. This allows network capacity to be regulated during learning without manual architectural tuning. SMGrNN is evaluated on control benchmarks via policy distillation. Compared with multilayer perceptron baselines, it achieves similar or higher returns, lower variance, and task-appropriate network sizes. Ablation studies with growth disabled and growth-only variants isolate the role of structural plasticity, showing that adaptive topology improves reward stability. The local and modular design of SPM enables future integration of a Hebbian plasticity module and spike-timing-dependent plasticity, so that SMGrNN can support both artificial and spiking neural implementations driven by local rules.
中文摘要 深度强化学习中的控制策略通常通过通过反向传播训练的固定容量多层感知器实现，这些感知器缺乏结构可塑性，且依赖全局错误信号。本文介绍了自驱动增长神经网络（SMGrNN），这是一种通过局部结构可塑性模块（SPM）在线进化其拓扑的控制器。SPM监测神经元激活情况，并在短时间窗口内边缘更新权重统计，利用这些信号触发神经元插入和修剪，同时突触权重由标准梯度优化器更新。这使得在学习过程中可以调节网络容量，而无需手动架构调优。SMGrNN通过策略提炼在控制基准上进行评估。与多层感知器基线相比，它实现了类似甚至更高的回报、更低的方差以及符合任务需求的网络规模。针对生长障碍和仅生长变异的消融研究，分离了结构可塑性的作用，表明自适应拓扑能改善奖励稳定性。SPM的局部模块化设计使未来集成了Hebbian可塑性模块和尖峰时间依赖可塑性成为可能，使SMGrNN能够支持基于局部规则驱动的人工和尖峰神经实现。

CoDA: A Context-Decoupled Hierarchical Agent with Reinforcement Learning

CoDA：带有强化学习的上下文解耦层级代理

Authors: Xuanzhang Liu, Jianglun Feng, Zhuoran Zhuang, Junzhe Zhao, Maofei Que, Jieting Li, Dianlei Wang, Hao Tong, Ye Chen, Pan Li
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.12716
Pdf link: https://arxiv.org/pdf/2512.12716
Abstract Large Language Model (LLM) agents trained with reinforcement learning (RL) show great promise for solving complex, multi-step tasks. However, their performance is often crippled by "Context Explosion", where the accumulation of long text outputs overwhelms the model's context window and leads to reasoning failures. To address this, we introduce CoDA, a Context-Decoupled hierarchical Agent, a simple but effective reinforcement learning framework that decouples high-level planning from low-level execution. It employs a single, shared LLM backbone that learns to operate in two distinct, contextually isolated roles: a high-level Planner that decomposes tasks within a concise strategic context, and a low-level Executor that handles tool interactions in an ephemeral, isolated workspace. We train this unified agent end-to-end using PECO (Planner-Executor Co-Optimization), a reinforcement learning methodology that applies a trajectory-level reward to jointly optimize both roles, fostering seamless collaboration through context-dependent policy updates. Extensive experiments demonstrate that CoDA achieves significant performance improvements over state-of-the-art baselines on complex multi-hop question-answering benchmarks, and it exhibits strong robustness in long-context scenarios, maintaining stable performance while all other baselines suffer severe degradation, thus further validating the effectiveness of our hierarchical design in mitigating context overload.
中文摘要 经过强化学习（RL）训练的大型语言模型（LLM）智能体在解决复杂多步骤任务方面展现出巨大潜力。然而，它们的性能常常被“上下文爆炸”所拖累，即长文本输出的积累导致模型的上下文窗口不堪重负，导致推理失败。为此，我们引入了CoDA，一种上下文解耦的层级代理，这是一种简单但有效的强化学习框架，将高层规划与低层执行解耦。它采用单一的共享大型语言模型骨干网，学习在两个不同且上下文隔离的角色中运作：高级规划器在简洁的战略语境中拆解任务，以及低级别执行器，负责在短暂孤立的工作空间中处理工具交互。我们通过PECO（规划者-执行者协同优化）进行端到端训练这一统一代理，这是一种强化学习方法，通过轨迹级奖励共同优化两个角色，通过上下文相关的策略更新促进无缝协作。大量实验表明，CoDA在复杂的多跳问答基准测试中相较于最先进基线实现了显著的性能提升，并且在长上下文场景中表现出强的鲁棒性，在其他基线严重退化时保持稳定性能，进一步验证了我们分层设计在缓解上下文过载方面的有效性。

Distributed Reinforcement Learning using Local Smart Meter Data for Voltage Regulation in Distribution Networks

利用本地智能电表数据进行配电网络电压调节的分布式强化学习

Authors: Dong Liu, Juan S. Giraldo, Peter Palensky, Pedro P. Vergara
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2512.12803
Pdf link: https://arxiv.org/pdf/2512.12803
Abstract Centralised reinforcement learning (RL) for voltage magnitude regulation in distribution networks typically involves numerous agent-environment interactions and power flow (PF) calculations, inducing computational overhead and privacy concerns over shared data. Thus, we propose a distributed RL algorithm to regulate voltage magnitude. First, a dynamic Thevenin equivalent model is integrated within smart meters (SM), enabling local voltage magnitude estimation using local SM data for RL agent training, and mitigating the dependency of synchronised data collection and centralised PF calculations. To mitigate estimation errors induced by Thevenin model inaccuracies, a voltage magnitude correction strategy that combines piecewise functions and neural networks is introduced. The piecewise function corrects the large errors of estimated voltage magnitude, while a neural network mimics the grid's sensitivity to control actions, improving action adjustment precision. Second, a coordination strategy is proposed to refine local RL agent actions online, preventing voltage magnitude violations induced by excessive actions from multiple independently trained agents. Case studies on energy storage systems validate the feasibility and effectiveness of the proposed approach, demonstrating its potential to improve voltage regulation in distribution networks.
中文摘要 配电网络中用于电压幅度调节的集中式强化学习（RL）通常涉及大量代理-环境交互和功率流（PF）计算，导致对共享数据的计算开销和隐私担忧。因此，我们提出了一种分布式强化学习算法来调节电压幅度。首先，在智能电表（SM）中集成了一个动态的Thevenin等效模型，使能够利用本地SM数据进行强化学习（RL）训练时进行局部电压大小估计，并减少了同步数据收集和集中PF计算的依赖。为减少由Thevenin模型不准确率引起的估计误差，引入了结合分段函数和神经网络的电压幅度修正策略。分段函数修正估计电压幅度的巨大误差，而神经网络模拟网格对控制动作的敏感性，提高动作调整精度。其次，提出协调策略以优化局部强化学习代理的在线动作，防止多个独立训练代理过度作引发电压幅度违规。关于储能系统的案例研究验证了该方法的可行性和有效性，展示了其改善配电网络电压调节的潜力。

Information-Consistent Language Model Recommendations through Group Relative Policy Optimization

通过群组相对策略优化实现信息一致的语言模型推荐

Authors: Sonal Prabhune, Balaji Padmanabhan, Kaushik Dutta
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.12858
Pdf link: https://arxiv.org/pdf/2512.12858
Abstract Large Language Models (LLMs) are increasingly deployed in business-critical domains such as finance, education, healthcare, and customer support, where users expect consistent and reliable recommendations. Yet LLMs often exhibit variability when prompts are phrased with minor differences, even when semantically equivalent. Such inconsistency undermines trust, complicates compliance, and disrupts user experience. While personalization is desirable in certain contexts, many enterprise scenarios-such as HR onboarding, customer support, or policy disclosure-require invariant information delivery regardless of phrasing or prior conversational history. Existing approaches, including retrieval-augmented generation (RAG) and temperature tuning, improve factuality or reduce stochasticity but cannot guarantee stability across equivalent prompts. In this paper, we propose a reinforcement learning framework based on Group Relative Policy Optimization (GRPO) to directly optimize for consistency. Unlike prior applications of GRPO, which have been limited to reasoning and code generation, we adapt GRPO to enforce stability of information content across groups of semantically equivalent prompts. We introduce entropy-based helpfulness and stability rewards, treating prompt variants as groups and resetting conversational context to isolate phrasing effects. Experiments on investment and job recommendation tasks show that our GRPO-trained model reduces variability more effectively than fine-tuning or decoding-based baselines. To our knowledge, this is a novel application of GRPO for aligning LLMs toward information consistency, reframing variability not as an acceptable feature of generative diversity but as a correctable flaw in enterprise deployments.
中文摘要 大型语言模型（LLMs）越来越多地应用于金融、教育、医疗和客户支持等关键业务领域，用户期望获得一致且可靠的推荐。然而，当提示词表达时有细微差别，即使语义等价，LLMs常表现出变异性。这种不一致削弱了信任，使合规变得复杂，并扰乱了用户体验。虽然个性化在某些情境下是可取的，但许多企业场景——如人力资源入职、客户支持或政策披露——要求无论措辞如何或之前的对话历史，都必须保持信息传递不变。现有方法，包括检索增强生成（RAG）和温度调优，可以提升事实性或降低随机性，但无法保证在等效提示之间的稳定性。本文提出基于群体相对策略优化（GRPO）的强化学习框架，直接优化一致性。与以往仅限于推理和代码生成的GRPO应用不同，我们调整GRPO以强制信息内容在语义等价提示组间的稳定性。我们引入基于熵的助益性和稳定性奖励，将提示变体视为组，并重置会话语境以分离措辞效果。投资和职位推荐任务的实验表明，我们GRPO训练的模型比基于微调或解码的基线更有效地减少了变异性。据我们所知，这是GRPO在对齐LLM以实现信息一致性的新应用，将变异性重新定义为企业部署中可纠正的缺陷，而非生成多样性的可接受特征。

LLM-based Personalized Portfolio Recommender: Integrating Large Language Models and Reinforcement Learning for Intelligent Investment Strategy Optimization

基于LLM的个性化投资组合推荐工具：整合大型语言模型与强化学习以实现智能投资策略优化

Authors: Bangyu Li, Boping Gu, Ziyang Ding
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.12922
Pdf link: https://arxiv.org/pdf/2512.12922
Abstract In modern financial markets, investors increasingly seek personalized and adaptive portfolio strategies that reflect their individual risk preferences and respond to dynamic market conditions. Traditional rule-based or static optimization approaches often fail to capture the nonlinear interactions among investor behavior, market volatility, and evolving financial objectives. To address these limitations, this paper introduces the LLM-based Personalized Portfolio Recommender , an integrated framework that combines Large Language Models, reinforcement learning, and individualized risk preference modeling to support intelligent investment decision-making.
中文摘要 在现代金融市场中，投资者越来越寻求个性化且灵活的投资组合策略，以反映个人风险偏好并响应动态市场状况。传统的基于规则或静态优化方法往往无法捕捉投资者行为、市场波动性和不断变化的财务目标之间的非线性相互作用。为解决这些局限性，本文介绍了基于LLM的个性化投资组合推荐器（Personalized Portfolio Recommender），这是一个集成框架，结合了大型语言模型、强化学习和个性化风险偏好建模，支持智能投资决策。

QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management

QwenLong-L1.5：长上下文推理与记忆管理的训练后配方

Authors: Weizhou Shen, Ziyi Yang, Chenliang Li, Zhiyuan Lu, Miao Peng, Huashan Sun, Yingcheng Shi, Shengyi Liao, Shaopeng Lai, Bo Zhang, Dayiheng Liu, Fei Huang, Jingren Zhou, Ming Yan
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.12967
Pdf link: https://arxiv.org/pdf/2512.12967
Abstract We introduce QwenLong-L1.5, a model that achieves superior long-context reasoning capabilities through systematic post-training innovations. The key technical breakthroughs of QwenLong-L1.5 are as follows: (1) Long-Context Data Synthesis Pipeline: We develop a systematic synthesis framework that generates challenging reasoning tasks requiring multi-hop grounding over globally distributed evidence. By deconstructing documents into atomic facts and their underlying relationships, and then programmatically composing verifiable reasoning questions, our approach creates high-quality training data at scale, moving substantially beyond simple retrieval tasks to enable genuine long-range reasoning capabilities. (2) Stabilized Reinforcement Learning for Long-Context Training: To overcome the critical instability in long-context RL, we introduce task-balanced sampling with task-specific advantage estimation to mitigate reward bias, and propose Adaptive Entropy-Controlled Policy Optimization (AEPO) that dynamically regulates exploration-exploitation trade-offs. (3) Memory-Augmented Architecture for Ultra-Long Contexts: Recognizing that even extended context windows cannot accommodate arbitrarily long sequences, we develop a memory management framework with multi-stage fusion RL training that seamlessly integrates single-pass reasoning with iterative memory-based processing for tasks exceeding 4M tokens. Based on Qwen3-30B-A3B-Thinking, QwenLong-L1.5 achieves performance comparable to GPT-5 and Gemini-2.5-Pro on long-context reasoning benchmarks, surpassing its baseline by 9.90 points on average. On ultra-long tasks (1M~4M tokens), QwenLong-L1.5's memory-agent framework yields a 9.48-point gain over the agent baseline. Additionally, the acquired long-context reasoning ability translates to enhanced performance in general domains like scientific reasoning, memory tool using, and extended dialogue.
中文摘要 我们介绍QwenLong-L1.5模型，该模型通过系统化的训练后创新实现了更优越的长上下文推理能力。QwenLong-L1.5 的关键技术突破如下：（1）长上下文数据综合流程：我们开发了一个系统综合框架，生成需要多跳基础的全球分布证据的复杂推理任务。通过将文档拆解为原子事实及其底层关系，然后程序化地构建可验证的推理问题，我们的方法能够大规模创建高质量的训练数据，远远超越简单的检索任务，实现真正的远程推理能力。（2）长上下文训练中的稳定强化学习：为克服长上下文强化学习的关键不稳定性，我们引入了任务平衡抽样和任务特定优势估计以减轻奖励偏差，并提出了动态调节探索与利用权衡的自适应熵控制策略优化（AEPO）。（3）超长上下文的内存增强架构：认识到即使是扩展的上下文窗口也无法容纳任意长序列，我们开发了一个多阶段融合强化学习的内存管理框架，能够无缝整合单次推理与迭代基于内存的处理，以应对超过400万令牌的任务。基于Qwen3-30B-A3B-思维，QwenLong-L1.5在长上下文推理基准测试中的性能可与GPT-5和Gemini-2.5-Pro相当，平均比基线高出9.90分。在超长任务（1M~4M令牌）上，QwenLong-L1.5的内存代理框架比代理基线提升9.48分。此外，获得的长上下文推理能力还能提升科学推理、记忆工具使用和扩展对话等一般领域的表现。

Tackling Snow-Induced Challenges: Safe Autonomous Lane-Keeping with Robust Reinforcement Learning

应对积雪带来的挑战：安全自主车道保持，结合强化学习

Authors: Amin Jalal Aghdasian, Farzaneh Abdollahi, Ali Kamali Iglie
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.12987
Pdf link: https://arxiv.org/pdf/2512.12987
Abstract This paper proposes two new algorithms for the lane keeping system (LKS) in autonomous vehicles (AVs) operating under snowy road conditions. These algorithms use deep reinforcement learning (DRL) to handle uncertainties and slippage. They include Action-Robust Recurrent Deep Deterministic Policy Gradient (AR-RDPG) and end-to-end Action-Robust convolutional neural network Attention Deterministic Policy Gradient (AR-CADPG), two action-robust approaches for decision-making. In the AR-RDPG method, within the perception layer, camera images are first denoised using multi-scale neural networks. Then, the centerline coefficients are extracted by a pre-trained deep convolutional neural network (DCNN). These coefficients, concatenated with the driving characteristics, are used as input to the control layer. The AR-CADPG method presents an end-to-end approach in which a convolutional neural network (CNN) and an attention mechanism are integrated within a DRL framework. Both methods are first trained in the CARLA simulator and validated under various snowy scenarios. Real-world experiments on a Jetson Nano-based autonomous vehicle confirm the feasibility and stability of the learned policies. Among the two models, the AR-CADPG approach demonstrates superior path-tracking accuracy and robustness, highlighting the effectiveness of combining temporal memory, adversarial resilience, and attention mechanisms in AVs.
中文摘要 本文提出了两种用于在雪地条件下运行的自动驾驶车辆（AV）车道保持系统（LKS）的新算法。这些算法使用深度强化学习（DRL）来处理不确定性和滑移。它们包括动作稳健循环深度确定性策略梯度（AR-RDPG）和端到端动作稳健卷积神经网络注意力确定性策略梯度（AR-CADPG），这两种行动稳健的决策方法。在AR-RDPG方法中，在感知层内，首先通过多尺度神经网络对相机图像进行去噪处理。然后，通过预训练的深度卷积神经网络（DCNN）提取中心线系数。这些系数与驱动特性串联，作为控制层的输入。AR-CADPG方法采用端到端方法，将卷积神经网络（CNN）和注意力机制集成在DRL框架内。这两种方法都先在CARLA模拟器中训练，并在各种雪地场景下进行验证。基于Jetson Nano的自动驾驶车辆的实际实验证实了所学政策的可行性和稳定性。在这两种模型中，AR-CADPG方法展现出更优越的路径追踪准确性和鲁棒性，突出了将时间记忆、对抗韧性和注意力机制结合在杀毒软件中的有效性。

Learning Terrain Aware Bipedal Locomotion via Reduced Dimensional Perceptual Representations

通过简化维度的感知表征学习地形感知双足行走

Authors: Guillermo A. Castillo, Himanshu Lodha, Ayonga Hereid
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.12993
Pdf link: https://arxiv.org/pdf/2512.12993
Abstract This work introduces a hierarchical strategy for terrain-aware bipedal locomotion that integrates reduced-dimensional perceptual representations to enhance reinforcement learning (RL)-based high-level (HL) policies for real-time gait generation. Unlike end-to-end approaches, our framework leverages latent terrain encodings via a Convolutional Variational Autoencoder (CNN-VAE) alongside reduced-order robot dynamics, optimizing the locomotion decision process with a compact state. We systematically analyze the impact of latent space dimensionality on learning efficiency and policy robustness. Additionally, we extend our method to be history-aware, incorporating sequences of recent terrain observations into the latent representation to improve robustness. To address real-world feasibility, we introduce a distillation method to learn the latent representation directly from depth camera images and provide preliminary hardware validation by comparing simulated and real sensor data. We further validate our framework using the high-fidelity Agility Robotics (AR) simulator, incorporating realistic sensor noise, state estimation, and actuator dynamics. The results confirm the robustness and adaptability of our method, underscoring its potential for hardware deployment.
中文摘要 本研究提出了一种基于地形感知的双足行走分层策略，整合了降维感知表征，以增强基于强化学习（RL）的高级（HL）策略，实现实时步态生成。与端到端方法不同，我们的框架利用卷积变分自编码器（CNN-VAE）利用潜地形编码，结合低阶机器人动力学，优化紧凑状态下的运动决策过程。我们系统分析潜空间维度对学习效率和政策稳健性的影响。此外，我们扩展方法以实现历史感知，将近期地形观测序列纳入潜在表示中，以提高鲁棒性。为应对现实可行性，我们引入了一种蒸馏方法，直接从深度相机图像中学习潜在表示，并通过比较模拟与真实传感器数据进行初步硬件验证。我们还利用高精度的敏捷机器人（AR）模拟器验证了我们的框架，结合了真实的传感器噪声、状态估计和执行器动力学。结果证实了我们方法的稳健性和适应性，凸显了其硬件部署的潜力。

What Happens Next? Next Scene Prediction with a Unified Video Model

接下来会发生什么？使用统一视频模型预测下一场景

Authors: Xinjie Li, Zhimin Chen, Rui Zhao, Florian Schiffers, Zhenyu Liao, Vimal Bhat
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.13015
Pdf link: https://arxiv.org/pdf/2512.13015
Abstract Recent unified models for joint understanding and generation have significantly advanced visual generation capabilities. However, their focus on conventional tasks like text-to-video generation has left the temporal reasoning potential of unified models largely underexplored. To address this gap, we introduce Next Scene Prediction (NSP), a new task that pushes unified video models toward temporal and causal reasoning. Unlike text-to-video generation, NSP requires predicting plausible futures from preceding context, demanding deeper understanding and reasoning. To tackle this task, we propose a unified framework combining Qwen-VL for comprehension and LTX for synthesis, bridged by a latent query embedding and a connector module. This model is trained in three stages on our newly curated, large-scale NSP dataset: text-to-video pre-training, supervised fine-tuning, and reinforcement learning (via GRPO) with our proposed causal consistency reward. Experiments demonstrate our model achieves state-of-the-art performance on our benchmark, advancing the capability of generalist multimodal systems to anticipate what happens next.
中文摘要 近期统一的联合理解与生成模型显著提升了视觉生成能力。然而，他们对文本转视频生成等传统任务的关注，使得统一模型的时间推理潜力大多未被充分开发。为弥补这一空白，我们引入了下一场景预测（NSP），这是一项推动统一视频模型朝向时间和因果推理的新任务。与文本转视频生成不同，新系统处理需要从前置上下文预测合理的未来，这需要更深入的理解和推理。为完成此任务，我们提出了一个统一框架，结合了理解用的Qwen-VL和综合用的LTX，并通过潜在查询嵌入和连接器模块进行桥接。该模型在我们新整理的大规模NSP数据集上分三阶段训练：文本转视频预训练、监督式微调和强化学习（通过GRPO），并采用我们提出的因果一致性奖励。实验表明，我们的模型在基准测试上达到了最先进的性能，推动了通用多模态系统预测下一步发展的能力。

GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training

GTR-Turbo：合并检查点实际上是代理VLM训练的免费教师

Authors: Tong Wei, Yijun Yang, Changhao Zhang, Junliang Xing, Yuanchun Shi, Zongqing Lu, Deheng Ye
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.13043
Pdf link: https://arxiv.org/pdf/2512.13043
Abstract Multi-turn reinforcement learning (RL) for multi-modal agents built upon vision-language models (VLMs) is hampered by sparse rewards and long-horizon credit assignment. Recent methods densify the reward by querying a teacher that provides step-level feedback, e.g., Guided Thought Reinforcement (GTR) and On-Policy Distillation, but rely on costly, often privileged models as the teacher, limiting practicality and reproducibility. We introduce GTR-Turbo, a highly efficient upgrade to GTR, which matches the performance without training or querying an expensive teacher model. Specifically, GTR-Turbo merges the weights of checkpoints produced during the ongoing RL training, and then uses this merged model as a "free" teacher to guide the subsequent RL via supervised fine-tuning or soft logit distillation. This design removes dependence on privileged VLMs (e.g., GPT or Gemini), mitigates the "entropy collapse" observed in prior work, and keeps training stable. Across diverse visual agentic tasks, GTR-Turbo improves the accuracy of the baseline model by 10-30% while reducing wall-clock training time by 50% and compute cost by 60% relative to GTR.
中文摘要 基于视觉语言模型（VLMs）的多模态智能体多回合强化学习（RL）受限于奖励稀疏和长期学分分配。近期方法通过向教师提出反馈，提供步骤级反馈，例如引导思维强化（GTR）和政策提炼，但依赖成本高昂且常被特权的教师模型作为教师，限制了实用性和可重复性。我们推出了GTR-Turbo，这是一种高效的GTR升级版，无需培训或询问昂贵教师模型，即可匹配性能。具体来说，GTR-Turbo 合并了在持续的 RL 训练中产生的检查点权重，然后利用该合并模型作为“自由”教师，通过监督微调或软 logit 蒸馏来指导后续的 RL。该设计消除了对特权VLM（如GPT或Gemini）的依赖，减轻了先前工作中观察到的“熵坍缩”，并保持训练稳定。在多种视觉代理任务中，GTR-Turbo 将基线模型的准确率提升了 10-30%，同时将壁钟训练时间缩短了 50%，计算成本降低了 60%。

Deep Q-Learning-Based Intelligent Scheduling for ETL Optimization in Heterogeneous Data Environments

基于深度Q学习的智能调度，用于异构数据环境中的ETL优化

Authors: Kangning Gao, Yi Hu, Cong Nie, Wei Li
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.13060
Pdf link: https://arxiv.org/pdf/2512.13060
Abstract This paper addresses the challenges of low scheduling efficiency, unbalanced resource allocation, and poor adaptability in ETL (Extract-Transform-Load) processes under heterogeneous data environments by proposing an intelligent scheduling optimization framework based on deep Q-learning. The framework formalizes the ETL scheduling process as a Markov Decision Process and enables adaptive decision-making by a reinforcement learning agent in high-dimensional state spaces to dynamically optimize task allocation and resource scheduling. The model consists of a state representation module, a feature embedding network, a Q-value estimator, and a reward evaluation mechanism, which collectively consider task dependencies, node load states, and data flow characteristics to derive the optimal scheduling strategy in complex environments. A multi-objective reward function is designed to balance key performance indicators such as average scheduling delay, task completion rate, throughput, and resource utilization. Sensitivity experiments further verify the model's robustness under changes in hyperparameters, environmental dynamics, and data scale. Experimental results show that the proposed deep Q-learning scheduling framework significantly reduces scheduling delay, improves system throughput, and enhances execution stability under multi-source heterogeneous task conditions, demonstrating the strong potential of reinforcement learning in complex data scheduling and resource management, and providing an efficient and scalable optimization strategy for intelligent data pipeline construction.
中文摘要 本文通过提出基于深度Q学习的智能调度优化框架，解决了ETL（提取-转换-加载）过程在异构数据环境中调度效率低、资源分配不平衡以及适应性差的挑战。该框架将ETL调度过程形式化为马尔可夫决策过程，并使强化学习代理在高维状态空间中实现自适应决策，动态优化任务分配和资源调度。该模型由状态表示模块、特征嵌入网络、Q值估计器和奖励评估机制组成，这些因素共同考虑任务依赖性、节点负载状态和数据流特性，以在复杂环境中推导出最优调度策略。多目标奖励函数旨在平衡平均调度延迟、任务完成率、吞吐量和资源利用率等关键绩效指标。灵敏度实验进一步验证了模型在超参数、环境动力学和数据尺度变化下的鲁棒性。实验结果表明，所提出的深度Q学习调度框架显著减少了调度延迟，提高了系统吞吐量，并增强了多源异构任务条件下的执行稳定性，展示了强化学习在复杂数据调度和资源管理中的强大潜力，并为智能数据流水线构建提供了高效且可扩展的优化策略。

M-GRPO: Stabilizing Self-Supervised Reinforcement Learning for Large Language Models with Momentum-Anchored Policy Optimization

M-GRPO：基于动量锚定策略优化的大型语言模型稳定自监督强化学习

Authors: Bizhe Bai, Hongming Wu, Peng Ye, Tao Chen
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.13070
Pdf link: https://arxiv.org/pdf/2512.13070
Abstract Self-supervised reinforcement learning (RL) presents a promising approach for enhancing the reasoning capabilities of Large Language Models (LLMs) without reliance on expensive human-annotated data. However, we find that existing methods suffer from a critical failure mode under long-horizon training: a "policy collapse" where performance precipitously degrades. We diagnose this instability and demonstrate that simply scaling the number of rollouts -- a common strategy to improve performance -- only delays, but does not prevent, this collapse. To counteract this instability, we first introduce M-GRPO (Momentum-Anchored Group Relative Policy Optimization), a framework that leverages a slowly evolving momentum model to provide a stable training target. In addition, we identify that this process is often accompanied by a rapid collapse in policy entropy, resulting in a prematurely confident and suboptimal policy. To specifically address this issue, we propose a second contribution: an adaptive filtering method based on the interquartile range (IQR) that dynamically prunes low-entropy trajectories, preserving essential policy diversity. Our extensive experiments on multiple reasoning benchmarks demonstrate that M-GRPO stabilizes the training process while the IQR filter prevents premature convergence. The combination of these two innovations leads to superior training stability and state-of-the-art performance.
中文摘要 自监督强化学习（RL）提供了一种有前景的方法，可以在不依赖昂贵的人工注释数据的情况下，增强大型语言模型（LLM）的推理能力。然而，我们发现现有方法在长期训练中存在一个关键失败模式：“策略崩溃”，即性能急剧下降。我们诊断了这种不稳定性，并证明仅仅扩大推广次数——这是提升性能的常见策略——只是延缓了崩溃，但并不能阻止。为应对这种不稳定性，我们首先引入了M-GRPO（动量锚定群相对策略优化），该框架利用缓慢演进的动量模型提供稳定的训练目标。此外，我们发现这一过程常伴随着政策熵的快速崩溃，导致政策过早自信且次优。为具体解决这一问题，我们提出了第二项贡献：基于四分位区间（IQR）的自适应过滤方法，动态修剪低熵轨迹，保持基本政策多样性。我们在多重推理基准测试上的大量实验表明，M-GRPO稳定训练过程，而IQR滤波器则防止过早收敛。这两项创新的结合带来了卓越的训练稳定性和最先进的性能。

PvP: Data-Efficient Humanoid Robot Learning with Proprioceptive-Privileged Contrastive Representations

PvP：数据高效的类人机器人学习，带有本体感觉特权的对比表征

Authors: Mingqi Yuan, Tao Yu, Haolin Song, Bo Li, Xin Jin, Hua Chen, Wenjun Zeng
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.13093
Pdf link: https://arxiv.org/pdf/2512.13093
Abstract Achieving efficient and robust whole-body control (WBC) is essential for enabling humanoid robots to perform complex tasks in dynamic environments. Despite the success of reinforcement learning (RL) in this domain, its sample inefficiency remains a significant challenge due to the intricate dynamics and partial observability of humanoid robots. To address this limitation, we propose PvP, a Proprioceptive-Privileged contrastive learning framework that leverages the intrinsic complementarity between proprioceptive and privileged states. PvP learns compact and task-relevant latent representations without requiring hand-crafted data augmentations, enabling faster and more stable policy learning. To support systematic evaluation, we develop SRL4Humanoid, the first unified and modular framework that provides high-quality implementations of representative state representation learning (SRL) methods for humanoid robot learning. Extensive experiments on the LimX Oli robot across velocity tracking and motion imitation tasks demonstrate that PvP significantly improves sample efficiency and final performance compared to baseline SRL methods. Our study further provides practical insights into integrating SRL with RL for humanoid WBC, offering valuable guidance for data-efficient humanoid robot learning.
中文摘要 实现高效且稳健的全身控制（WBC）对于让类人机器人在动态环境中执行复杂任务至关重要。尽管强化学习（RL）在该领域取得了成功，但由于人形机器人的复杂动力学和部分可观测性，其样本效率依然是重大挑战。为解决这一限制，我们提出了PvP，一种本体感觉-特权对比学习框架，利用本体感觉与特权状态之间的内在互补性。PvP学习紧凑且与任务相关的潜在表示，无需手工数据增强，从而实现更快更稳定的策略学习。为支持系统评估，我们开发了SRL4Humanoid，这是首个统一且模块化的框架，提供高质量的代表性状态表示学习（SRL）方法实现人形机器人学习。LimX Oli机器人在速度跟踪和运动模拟任务上的广泛实验表明，PvP相比基线SRL方法显著提升了采样效率和最终性能。我们的研究进一步提供了关于将SRL与强化学习整合用于类人生物白细胞的实用见解，为高效数据型人形机器人学习提供了宝贵指导。

ADHint: Adaptive Hints with Difficulty Priors for Reinforcement Learning

ADHint：带有困难先验的自适应提示用于强化学习

Authors: Feng Zhang, Zezhong Tan, Xinhong Ma, Ziqiang Dong, Xi Leng, Jianfei Zhao, Xin Sun, Yang Yang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.13095
Pdf link: https://arxiv.org/pdf/2512.13095
Abstract To combine the advantages of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), recent methods have integrated ''hints'' into post-training, which are prefix segments of complete reasoning trajectories, aiming for powerful knowledge expansion and reasoning generalization. However, existing hint-based RL methods typically ignore difficulty when scheduling hint ratios and estimating relative advantages, leading to unstable learning and excessive imitation of off-policy hints. In this work, we propose ADHint, which treats difficulty as a key factor in both hint-ratio schedule and relative-advantage estimation to achieve a better trade-off between exploration and imitation. Specifically, we propose Adaptive Hint with Sample Difficulty Prior, which evaluates each sample's difficulty under the policy model and accordingly schedules an appropriate hint ratio to guide its rollouts. We also introduce Consistency-based Gradient Modulation and Selective Masking for Hint Preservation to modulate token-level gradients within hints, preventing biased and destructive updates. Additionally, we propose Advantage Estimation with Rollout Difficulty Posterior, which leverages the relative difficulty of rollouts with and without hints to estimate their respective advantages, thereby achieving more balanced updates. Extensive experiments across diverse modalities, model scales, and domains demonstrate that ADHint delivers superior reasoning ability and out-of-distribution generalization, consistently surpassing existing methods in both pass@1 and avg@8. Our code and dataset will be made publicly available upon paper acceptance.
中文摘要 为了结合监督式微调（SFT）和强化学习（RL）的优势，近期方法将“提示”整合进训练后，这些提示是完整推理轨迹的前缀段，旨在实现强大的知识扩展和推理泛化。然而，现有基于提示的强化学习方法通常忽视调度提示比率和估算相对优势的困难，导致学习不稳定和过度模仿非策略提示。在本研究中，我们提出了ADHint，将难度视为提示比率计划和相对优势估计的关键因素，以实现探索与模仿之间的更好权衡。具体来说，我们提出了带有样本难度先验的自适应提示，该方法在政策模型下评估每个样本的难度，并相应安排合适的提示比率以指导其推广。我们还引入了基于一致性的梯度调制和选择性遮罩以保持提示，以调节提示中的标记级梯度，防止偏向和破坏性的更新。此外，我们提出了带有发布难度后置的优势估计法，利用有提示和无提示的发布相对难度来估算各自的优势，从而实现更平衡的更新。跨越多种模态、模型尺度和领域的广泛实验表明，ADHint在推理能力和分布外推广方面表现出色，在pass@1和和avg@8上均持续超越现有方法。我们的代码和数据集将在论文接受后公开。

Toward Self-Healing Networks-on-Chip: RL-Driven Routing in 2D Torus Architectures

迈向自愈的片上网络：二维环面架构中的强化学习驱动路由

Authors: Mohammad Walid Charrwi, Zaid Hussain
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2512.13096
Pdf link: https://arxiv.org/pdf/2512.13096
Abstract We investigate adaptive minimal routing in 2D torus networks on chip NoCs under node fault conditions comparing a reinforcement learning RL based strategy to an adaptive routing baseline A torus topology is used for its low diameter high connectivity properties The RL approach models each router as an agent that learns to forward packets based on network state while the adaptive scheme uses fixed minimal paths with simple rerouting around faults We implement both methods in simulation injecting up to 50 node faults uniformly at random Key metrics are measured 1 throughput vs offered load at fault density 02 2 packet delivery ratio PDR vs fault density and 3 a fault adaptive score FT vs fault density Experimental results show the RL method achieves significantly higher throughput at high load approximately 2030 gain and maintains higher reliability under increasing faults The RL router delivers more packets per cycle and adapts to faults by exploiting path diversity whereas the adaptive scheme degrades sharply as faults accumulate In particular the RL approach preserves end to end connectivity longer PDR remains above 90 until approximately 3040 faults while adaptive PDR drops to approximately 70 at the same point The fault adaptive score likewise favors RL routing Thus RL based adaptive routing demonstrates clear advantages in throughput and fault resilience for torus NoCs
中文摘要 我们研究了在节点故障条件下，基于芯片NoC的二维环面网络中的自适应最小路由，比较基于强化学习的强化学习策略与自适应路由基线。采用环面拓扑以满足其低直径高连通性特性。强化学习方法将每个路由器建模为一个基于网络状态学习转发数据包的代理，而自适应方案则使用固定最小路径，简单绕过故障重新路由。我们实现了这两种方法在模拟中，随机均匀注入最多50个节点故障。关键指标包括：1：吞吐量与提供负载在故障密度02的对比;2、数据包传输比PDR与故障密度，3、故障自适应评分FT与故障密度。实验结果表明，RL方法在高负载下实现了约2030%的显著更高吞吐量，并在故障增加下保持更高可靠性。RL路由器每周期输出更多数据包并进行调整。通过利用路径多样性来实现故障，而自适应方案随着故障积累而急剧下降。特别是，强化学习方法能保持端到端连接性更长，PDR在约3040次故障前保持在90以上，而自适应PDR在同一点降至约70。故障自适应评分同样有利于强化逻辑路由，因此基于强化语言的自适应路由在环面NoC的吞吐量和故障韧性方面展现出明显优势

TraPO: A Semi-Supervised Reinforcement Learning Framework for Boosting LLM Reasoning

TraPO：一个半监督式强化学习框架，用于提升LLM推理能力

Authors: Shenzhi Yang, Guangcheng Zhu, Xing Zheng, Yingfan MA, Zhongqi Chen, Bowen Song, Weiqiang Wang, Junbo Zhao, Gang Chen, Haobo Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.13106
Pdf link: https://arxiv.org/pdf/2512.13106
Abstract Reinforcement learning with verifiable rewards (RLVR) has proven effective in training large reasoning models (LRMs) by leveraging answer-verifiable signals to guide policy optimization, which, however, suffers from high annotation costs. To alleviate this problem, recent work has explored unsupervised RLVR methods that derive rewards solely from the model's internal consistency, such as through entropy and majority voting. While seemingly promising, these methods often suffer from model collapse in the later stages of training, which may arise from the reinforcement of incorrect reasoning patterns in the absence of external supervision. In this work, we investigate a novel semi-supervised RLVR paradigm that utilizes a small labeled set to guide RLVR training on unlabeled samples. Our key insight is that supervised rewards are essential for stabilizing consistency-based training on unlabeled samples, ensuring that only reasoning patterns verified on labeled instances are incorporated into RL training. Technically, we propose an effective policy optimization algorithm, TraPO, that identifies reliable unlabeled samples by matching their learning trajectory similarity to labeled ones. Building on this, TraPO achieves remarkable data efficiency and strong generalization on six widely used mathematical reasoning benchmarks (AIME24/25, AMC, MATH-500, Minerva, and Olympiad) and three out-of-distribution tasks (ARC-c, GPQA-diamond, and MMLU-pro). With only 1K labeled and 3K unlabeled samples, TraPO reaches 42.6% average accuracy, surpassing the best unsupervised method trained on 45K unlabeled samples (38.3%). Notably, when using 4K labeled and 12K unlabeled samples, TraPO even outperforms the fully supervised model trained on the full 45K labeled samples on all benchmarks, while using only 10% of the labeled data. The code is available via this https URL.
中文摘要 带有可验证奖励的强化学习（RLVR）已被证明在训练大型推理模型（LRM）方面非常有效，该方法利用可验证的答案信号来指导策略优化，但该策略存在较高的注释成本。为缓解这一问题，近期研究探索了仅通过模型内部一致性获得奖励的无监督RLVR方法，如熵和多数投票。虽然看似有前景，但这些方法在训练后期常常出现模型崩溃，这可能是由于缺乏外部监督时，错误推理模式的强化所致。本研究中，我们研究了一种新型半监督RLVR范式，利用少量标记样本指导RLVR训练。我们的核心见解是，监督奖励对于稳定无标签样本上的基于一致性的训练至关重要，确保只有在标记样本上验证的推理模式才被纳入强化学习训练。从技术上讲，我们提出了一种有效的策略优化算法TraPO，通过匹配未标记样本的学习轨迹与标记样本的相似度来识别它们。基于此，TraPO在六个广泛使用的数学推理基准测试（AIME24/25、AMC、MATH-500、Minerva和Olympiad）以及三个非发行任务（ARC-c、GPQA-diamond和MMLU-pro）上实现了卓越的数据效率和强有力的泛化能力。TraPO仅有1000个标记样本和3000个未标记样本，平均准确率达到42.6%，超过了在4.5万个未标记样本上训练的最佳无监督方法（38.3%）。值得注意的是，在使用4K标记样本和12K未标记样本时，TraPO在所有基准测试中甚至优于在所有基准测试中训练的全监督模型，且仅使用了10%的标记数据。代码可以通过这个 https URL 获取。

SpeakRL: Synergizing Reasoning, Speaking, and Acting in Language Models with Reinforcement Learning

SpeakRL：语言模型中的推理、口语和行动与强化学习的协同

Authors: Emre Can Acikgoz, Jinoh Oh, Jie Hao, Joo Hyuk Jeon, Heng Ji, Dilek Hakkani-Tür, Gokhan Tur, Xiang Li, Chengyuan Ma, Xing Fan
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.13159
Pdf link: https://arxiv.org/pdf/2512.13159
Abstract Effective human-agent collaboration is increasingly prevalent in real-world applications. Current trends in such collaborations are predominantly unidirectional, with users providing instructions or posing questions to agents, where agents respond directly without seeking necessary clarifications or confirmations. However, the evolving capabilities of these agents require more proactive engagement, where agents should dynamically participate in conversations to clarify user intents, resolve ambiguities, and adapt to changing circumstances. Existing prior work under-utilize the conversational capabilities of language models (LMs), thereby optimizing agents as better followers rather than effective speakers. In this work, we introduce SpeakRL, a reinforcement learning (RL) method that enhances agents' conversational capabilities by rewarding proactive interactions with users, such as asking right clarification questions when necessary. To support this, we curate SpeakER, a synthetic dataset that includes diverse scenarios from task-oriented dialogues, where tasks are resolved through interactive clarification questions. We present a systematic analysis of reward design for conversational proactivity and propose a principled reward formulation for teaching agents to balance asking with acting. Empirical evaluations demonstrate that our approach achieves a 20.14% absolute improvement in task completion over base models without increasing conversation turns even surpassing even much larger proprietary models, demonstrating the promise of clarification-centric user-agent interactions.
中文摘要 有效的人-代理协作在现实应用中日益普及。目前此类合作的趋势主要是单向的，用户向代理提供指令或提问，代理直接回应，无需寻求必要的澄清或确认。然而，这些代理能力的演变要求更主动的参与，代理应动态参与对话，以澄清用户意图、解决歧义并适应变化的环境。现有研究未能充分发挥语言模型（LM）的会话能力，从而优化了代理成为更好的跟随者而非有效的说话者。在本研究中，我们介绍了SpeakRL，一种强化学习（RL）方法，通过奖励主动与用户互动（如必要时提出正确的澄清问题）来增强代理的对话能力。为此，我们策划了SpeakER综合数据集，包含任务导向对话中的多样场景，任务通过互动澄清问题解决。我们提出了对会话主动性奖励设计的系统分析，并提出了一种原则性的奖励表述，帮助教学者在请求与行动之间取得平衡。实证评估表明，我们的方法在任务完成率上比基础模型绝对提升了20.14%，且对话次数不增加，甚至超过了更大规模的专有模型，展示了以澄清为中心的用户-代理交互的潜力。

SACn: Soft Actor-Critic with n-step Returns

SACn：软演员兼评论家，带n步回报

Authors: Jakub Łyskawa, Jakub Lewandowski, Paweł Wawrzyński
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.13165
Pdf link: https://arxiv.org/pdf/2512.13165
Abstract Soft Actor-Critic (SAC) is widely used in practical applications and is now one of the most relevant off-policy online model-free reinforcement learning (RL) methods. The technique of n-step returns is known to increase the convergence speed of RL algorithms compared to their 1-step returns-based versions. However, SAC is notoriously difficult to combine with n-step returns, since their usual combination introduces bias in off-policy algorithms due to the changes in action distribution. While this problem is solved by importance sampling, a method for estimating expected values of one distribution using samples from another distribution, importance sampling may result in numerical instability. In this work, we combine SAC with n-step returns in a way that overcomes this issue. We present an approach to applying numerically stable importance sampling with simplified hyperparameter selection. Furthermore, we analyze the entropy estimation approach of Soft Actor-Critic in the context of the n-step maximum entropy framework and formulate the $\tau$-sampled entropy estimation to reduce the variance of the learning target. Finally, we formulate the Soft Actor-Critic with n-step returns (SAC$n$) algorithm that we experimentally verify on MuJoCo simulated environments.
中文摘要 软演员批评（SAC）在实际应用中被广泛应用，目前是最相关的非策略在线无模型强化学习（RL）方法之一。已知n步返回技术能提高强化学习算法的收敛速度，相较于基于1步返回的版本。然而，SAC与n步返回组合非常困难，因为它们通常的组合会因动作分布变化而引入偏置，导致非策略算法出现偏差。虽然这个问题通过重要性抽样（一种利用另一个分布样本估计一个分布期望值的方法）得到解决，但重要性抽样可能导致数值不稳定。在本研究中，我们将SAC与n步返回结合起来，以克服这一问题。我们提出了一种应用数值稳定重要性抽样的方法，并采用简化的超参数选择。此外，我们在n步最大熵框架的背景下分析了软演员-批判者的熵估计方法，并提出了$\tau$采样熵估计，以降低学习目标的方差。最后，我们提出了带有n步返回的软演员-批判者算法（SAC$n$），并在MuJoCo模拟环境中进行了实验验证。

Reflective Preference Optimization (RPO): Enhancing On-Policy Alignment via Hint-Guided Reflection

反思偏好优化（RPO）：通过提示引导反思提升政策上的对齐

Authors: Zihui Zhao, Zechang Li
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.13240
Pdf link: https://arxiv.org/pdf/2512.13240
Abstract Direct Preference Optimization (DPO) has emerged as a lightweight and effective alternative to Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with AI Feedback (RLAIF) for aligning large language and vision-language models. However, the standard DPO formulation, in which both the chosen and rejected responses are generated by the same policy, suffers from a weak learning signal because the two responses often share similar errors and exhibit small Kullback-Leibler (KL) divergence. This leads to slow and unstable convergence. To address this limitation, we introduce Reflective Preference Optimization (RPO), a new framework that incorporates hint-guided reflection into the DPO paradigm. RPO uses external models to identify hallucination sources and generate concise reflective hints, enabling the construction of on-policy preference pairs with stronger contrastiveness and clearer preference signals. We theoretically show that conditioning on hints increases the expected preference margin through mutual information and improves sample efficiency while remaining within the policy distribution family. Empirically, RPO achieves superior alignment with fewer training samples and iterations, substantially reducing hallucination rates and delivering state-of-the-art performance across multimodal benchmarks.
中文摘要 直接偏好优化（DPO）已成为一种轻量且有效的替代方案，取代来自人类反馈的强化学习（RLHF）和基于人工智能反馈的强化学习（RLAIF），用于对齐大型语言模型和视觉语言模型。然而，标准DPO表述中，选定和拒绝的响应均由同一策略生成，但由于两种响应常常存在相似误差且表现出较小的Kullback-Leibler（KL）发散，学习信号较弱。这导致收敛缓慢且不稳定。为解决这一限制，我们引入了反思偏好优化（RPO），这是一个将提示引导反射纳入DPO范式的新框架。RPO利用外部模型识别幻觉源并生成简明的反射提示，从而构建具有更强对比性和更清晰偏好信号的政策偏好对。我们理论上证明，条件化提示通过互信息提高了期望偏好裕度，并提高了样本效率，同时保持在政策分布族内。从经验上看，RPO通过更少的训练样本和迭代实现了更优的对齐，显著降低了幻觉率，并在多模态基准测试中展现了最先进的性能。

Post-Training and Test-Time Scaling of Generative Agent Behavior Models for Interactive Autonomous Driving

生成智能体行为模型的交互式自动驾驶训练后及测试时间尺度化

Authors: Hyunki Seong, Jeong-Kyun Lee, Heesoo Myeong, Yongho Shin, Hyun-Mook Cho, Duck Hoon Kim, Pranav Desai, Monu Surana
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.13262
Pdf link: https://arxiv.org/pdf/2512.13262
Abstract Learning interactive motion behaviors among multiple agents is a core challenge in autonomous driving. While imitation learning models generate realistic trajectories, they often inherit biases from datasets dominated by safe demonstrations, limiting robustness in safety-critical cases. Moreover, most studies rely on open-loop evaluation, overlooking compounding errors in closed-loop execution. We address these limitations with two complementary strategies. First, we propose Group Relative Behavior Optimization (GRBO), a reinforcement learning post-training method that fine-tunes pretrained behavior models via group relative advantage maximization with human regularization. Using only 10% of the training dataset, GRBO improves safety performance by over 40% while preserving behavioral realism. Second, we introduce Warm-K, a warm-started Top-K sampling strategy that balances consistency and diversity in motion selection. Our Warm-K method-based test-time scaling enhances behavioral consistency and reactivity at test time without retraining, mitigating covariate shift and reducing performance discrepancies. Demo videos are available in the supplementary material.
中文摘要 在多个智能体之间学习互动运动行为是自动驾驶的核心挑战。虽然模仿学习模型能生成真实的轨迹，但它们常常继承了以安全示范为主导的数据集的偏见，限制了安全关键案例中的稳健性。此外，大多数研究依赖开环评估，忽视闭环执行中的复利错误。我们通过两种互补策略来应对这些局限。首先，我们提出了群体相对行为优化（Group Relative Behavior Optimization，GRBO），这是一种强化学习的训练后方法，通过人类正则化实现群体相对优势最大化，对预训练行为模型进行微调。仅使用10%的训练数据集，GRBO在保持行为真实性的同时，安全性提升超过40%。其次，我们引入Warm-K，一种热启动的Top-K采样策略，在动作选择中平衡一致性和多样性。我们基于Warm-K方法的测试时间尺度提升了测试时的行为一致性和反应性，无需重新训练，从而减轻协变量偏移并减少性能差异。演示视频可在补充资料中提供。

SPARS: A Reinforcement Learning-Enabled Simulator for Power Management in HPC Job Scheduling

SPARS：一个基于强化学习的高性能计算作业调度功率管理模拟器

Authors: Muhammad Alfian Amrizal, Raka Satya Prasasta, Santana Yuda Pradata, Kadek Gemilang Santiyuda, Reza Pulungan, Hiroyuki Takizawa
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2512.13268
Pdf link: https://arxiv.org/pdf/2512.13268
Abstract High-performance computing (HPC) clusters consume enormous amounts of energy, with idle nodes as a major source of waste. Powering down unused nodes can mitigate this problem, but poorly timed transitions introduce long delays and reduce overall performance. To address this trade-off, we present SPARS, a reinforcement learning-enabled simulator for power management in HPC job scheduling. SPARS integrates job scheduling and node power state management within a discrete-event simulation framework. It supports traditional scheduling policies such as First Come First Served and EASY Backfilling, along with enhanced variants that employ reinforcement learning agents to dynamically decide when nodes should be powered on or off. Users can configure workloads and platforms in JSON format, specifying job arrivals, execution times, node power models, and transition delays. The simulator records comprehensive metrics-including energy usage, wasted power, job waiting times, and node utilization-and provides Gantt chart visualizations to analyze scheduling dynamics and power transitions. Unlike widely used Batsim-based frameworks that rely on heavy inter-process communication, SPARS provides lightweight event handling and consistent simulation results, making experiments easier to reproduce and extend. Its modular design allows new scheduling heuristics or learning algorithms to be integrated with minimal effort. By providing a flexible, reproducible, and extensible platform, SPARS enables researchers and practitioners to systematically evaluate power-aware scheduling strategies, explore the trade-offs between energy efficiency and performance, and accelerate the development of sustainable HPC operations.
中文摘要 高性能计算（HPC）集群消耗巨大的能量，空闲节点是主要的废弃物来源。关闭未使用节点可以缓解这一问题，但切换时机不当会导致较长延迟并降低整体性能。为解决这一权衡，我们介绍了SPARS，一种基于强化学习的高性能计算作业调度功率管理模拟器。SPARS将作业调度和节点电源状态管理集成在离散事件仿真框架内。它支持传统的调度政策，如先到先得和简易回填，以及利用强化学习代理动态决定节点何时开机或关闭的增强变体。用户可以用 JSON 格式配置工作负载和平台，指定作业到达时间、执行时间、节点功率模型和转换延迟。模拟器记录全面的指标——包括能耗、浪费电力、作业等待时间和节点利用率——并提供甘特图可视化，分析调度动态和电力转换。与广泛使用的基于Batsim、依赖大量进程间通信的框架不同，SPARS提供轻量级事件处理和一致的模拟结果，使实验更易于复现和扩展。其模块化设计允许以最小的努力整合新的调度启发式或学习算法。通过提供一个灵活、可重复且可扩展的平台，SPARS使研究人员和从业者能够系统地评估能耗感知调度策略，探讨能效与性能之间的权衡，加速可持续高性能计算运营的发展。

AutoTool: Dynamic Tool Selection and Integration for Agentic Reasoning

AutoTool：智能推理的动态工具选择与集成

Authors: Jiaru Zou, Ling Yang, Yunzhe Qi, Sirui Chen, Mengting Ai, Ke Shen, Jingrui He, Mengdi Wang
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.13278
Pdf link: https://arxiv.org/pdf/2512.13278
Abstract Agentic reinforcement learning has advanced large language models (LLMs) to reason through long chain-of-thought trajectories while interleaving external tool use. Existing approaches assume a fixed inventory of tools, limiting LLM agents' adaptability to new or evolving toolsets. We present AutoTool, a framework that equips LLM agents with dynamic tool-selection capabilities throughout their reasoning trajectories. We first construct a 200k dataset with explicit tool-selection rationales across 1,000+ tools and 100+ tasks spanning mathematics, science, code generation, and multimodal reasoning. Building on this data foundation, AutoTool employs a dual-phase optimization pipeline: (i) supervised and RL-based trajectory stabilization for coherent reasoning, and (ii) KL-regularized Plackett-Luce ranking to refine consistent multi-step tool selection. Across ten diverse benchmarks, we train two base models, Qwen3-8B and Qwen2.5-VL-7B, with AutoTool. With fewer parameters, AutoTool consistently outperforms advanced LLM agents and tool-integration methods, yielding average gains of 6.4% in math & science reasoning, 4.5% in search-based QA, 7.7% in code generation, and 6.9% in multimodal understanding. In addition, AutoTool exhibits stronger generalization by dynamically leveraging unseen tools from evolving toolsets during inference.
中文摘要 代理强化学习拥有先进的大型语言模型（LLM），能够通过长链思考轨迹推理，同时交织外部工具的使用。现有方法假设工具库存固定，限制了LLM代理对新工具集或不断演变的工具集的适应能力。我们介绍AutoTool，一个框架，为LLM代理在其推理轨迹中提供动态工具选择能力。我们首先构建了一个20万个数据集，明确了涵盖1000+工具和100+任务的工具选择理由，涵盖数学、科学、代码生成和多模态推理。基于此数据基础，AutoTool采用双阶段优化流程：（i）监督式和基于强化学习的轨迹稳定以实现连贯推理，（ii）KL正则化的Plackett-Luce排序以优化一致的多步工具选择。通过十个多样化的基准测试，我们用AutoTool训练两个基础模型Qwen3-8B和Qwen2.5-VL-7B。参数减少后，AutoTool持续优于高级LLM代理和工具集成方法，数学与科学推理平均提升6.4%，基于搜索的质量保证提升4.5%，代码生成7.7%，多模态理解提升6.9%。此外，AutoTool通过动态利用来自演化工具集的未见工具，展现出更强的泛化性。

Intrinsic-Motivation Multi-Robot Social Formation Navigation with Coordinated Exploration

内在动机多机器人社会形成导航与协调探索

Authors: Hao Fua, Wei Liu, Shuai Zhoua
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.13293
Pdf link: https://arxiv.org/pdf/2512.13293
Abstract This paper investigates the application of reinforcement learning (RL) to multi-robot social formation navigation, a critical capability for enabling seamless human-robot coexistence. While RL offers a promising paradigm, the inherent unpredictability and often uncooperative dynamics of pedestrian behavior pose substantial challenges, particularly concerning the efficiency of coordinated exploration among robots. To address this, we propose a novel coordinated-exploration multi-robot RL algorithm introducing an intrinsic motivation exploration. Its core component is a self-learning intrinsic reward mechanism designed to collectively alleviate policy conservatism. Moreover, this algorithm incorporates a dual-sampling mode within the centralized training and decentralized execution framework to enhance the representation of both the navigation policy and the intrinsic reward, leveraging a two-time-scale update rule to decouple parameter updates. Empirical results on social formation navigation benchmarks demonstrate the proposed algorithm's superior performance over existing state-of-the-art methods across crucial metrics. Our code and video demos are available at: this https URL.
中文摘要 本文探讨了强化学习（RL）在多机器人社会形成导航中的应用，这是实现无缝人机共存的关键能力。虽然强化学习提供了一个有前景的范式，但行人行为固有的不可预测性和常常不合作的动态带来了重大挑战，尤其是在机器人协调探索的效率方面。为此，我们提出了一种新的协调探索多机器人强化学习算法，引入了内在动机探索。其核心组成部分是一种自我学习的内在奖励机制，旨在集体缓解政策保守主义。此外，该算法在集中训练和去中心化执行框架中采用双重采样模式，增强导航策略和内在奖励的表现，利用双时间尺度更新规则解耦参数更新。社会形成导航基准测试的实证结果显示，该算法在关键指标上优于现有最先进方法。我们的代码和视频演示可在以下网站获取：https URL。

Control of a Twin Rotor using Twin Delayed Deep Deterministic Policy Gradient (TD3)

使用双延迟深确定性策略梯度（TD3）控制双旋翼

Authors: Zeyad Gamal, Youssef Mahran, Ayman El-Badawy
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.13356
Pdf link: https://arxiv.org/pdf/2512.13356
Abstract This paper proposes a reinforcement learning (RL) framework for controlling and stabilizing the Twin Rotor Aerodynamic System (TRAS) at specific pitch and azimuth angles and tracking a given trajectory. The complex dynamics and non-linear characteristics of the TRAS make it challenging to control using traditional control algorithms. However, recent developments in RL have attracted interest due to their potential applications in the control of multirotors. The Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm was used in this paper to train the RL agent. This algorithm is used for environments with continuous state and action spaces, similar to the TRAS, as it does not require a model of the system. The simulation results illustrated the effectiveness of the RL control method. Next, external disturbances in the form of wind disturbances were used to test the controller's effectiveness compared to conventional PID controllers. Lastly, experiments on a laboratory setup were carried out to confirm the controller's effectiveness in real-world applications.
中文摘要 本文提出了一种强化学习（RL）框架，用于在特定俯仰角和方位角下控制和稳定双旋翼气动系统（TRAS），并跟踪给定轨迹。TRAS复杂的动力学和非线性特性使其难以用传统控制算法进行控制。然而，强化学习的最新发展因其在多旋翼控制中的潜在应用而引起了关注。本文使用了双延迟深度确定性策略梯度（TD3）算法来训练强化学习代理。该算法适用于具有连续状态和作用空间的环境，类似于TRAS系统，因为它不需要系统的模型。模拟结果展示了强化学习控制方法的有效性。接着，使用风扰动等外部干扰来测试控制器相较于传统PID控制器的有效性。最后，在实验室环境中进行了实验，以验证控制器在实际应用中的有效性。

Fast Policy Learning for 6-DOF Position Control of Underwater Vehicles

6自由度水下飞行器位置控制的快速策略学习

Authors: Sümer Tunçay, Alain Andres, Ignacio Carlucho
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.13359
Pdf link: https://arxiv.org/pdf/2512.13359
Abstract Autonomous Underwater Vehicles (AUVs) require reliable six-degree-of-freedom (6-DOF) position control to operate effectively in complex and dynamic marine environments. Traditional controllers are effective under nominal conditions but exhibit degraded performance when faced with unmodeled dynamics or environmental disturbances. Reinforcement learning (RL) provides a powerful alternative but training is typically slow and sim-to-real transfer remains challenging. This work introduces a GPU-accelerated RL training pipeline built in JAX and MuJoCo-XLA (MJX). By jointly JIT-compiling large-scale parallel physics simulation and learning updates, we achieve training times of under two this http URL systematic evaluation of multiple RL algorithms, we show robust 6-DOF trajectory tracking and effective disturbance rejection in real underwater experiments, with policies transferred zero-shot from simulation. Our results provide the first explicit real-world demonstration of RL-based AUV position control across all six degrees of freedom.
中文摘要 自主水下载具（AUV）需要可靠的六自由度（6-DOF）位置控制，才能在复杂且动态的海洋环境中有效运行。传统控制器在名义条件下有效，但在面对未建模的动力学或环境干扰时表现下降。强化学习（RL）提供了强大的替代方案，但训练通常较慢，模拟到现实的转移依然具有挑战性。本研究引入了基于 JAX 和 MuJoCo-XLA（MJX）构建的 GPU 加速强化学习流水线。通过联合JIT编译大规模并行物理模拟和学习更新，我们实现了多个强化学习算法的训练时间低于两秒，展示了在真实水下实验中稳健的6自由度轨迹跟踪和有效扰动拒绝，策略从模拟中零时转移。我们的结果首次明确展示了基于强化学习的自主飞行器在所有六自由度范围内的位置控制。

Universal Dexterous Functional Grasping via Demonstration-Editing Reinforcement Learning

通过演示编辑强化学习实现普遍灵巧功能抓握

Authors: Chuan Mao, Haoqi Yuan, Ziye Huang, Chaoyi Xu, Kai Ma, Zongqing Lu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.13380
Pdf link: https://arxiv.org/pdf/2512.13380
Abstract Reinforcement learning (RL) has achieved great success in dexterous grasping, significantly improving grasp performance and generalization from simulation to the real world. However, fine-grained functional grasping, which is essential for downstream manipulation tasks, remains underexplored and faces several challenges: the complexity of specifying goals and reward functions for functional grasps across diverse objects, the difficulty of multi-task RL exploration, and the challenge of sim-to-real transfer. In this work, we propose DemoFunGrasp for universal dexterous functional grasping. We factorize functional grasping conditions into two complementary components - grasping style and affordance - and integrate them into an RL framework that can learn to grasp any object with any functional grasping condition. To address the multi-task optimization challenge, we leverage a single grasping demonstration and reformulate the RL problem as one-step demonstration editing, substantially enhancing sample efficiency and performance. Experimental results in both simulation and the real world show that DemoFunGrasp generalizes to unseen combinations of objects, affordances, and grasping styles, outperforming baselines in both success rate and functional grasping accuracy. In addition to strong sim-to-real capability, by incorporating a vision-language model (VLM) for planning, our system achieves autonomous instruction-following grasp execution.
中文摘要 强化学习（RL）在灵巧抓取方面取得了巨大成功，显著提升了抓握性能和从模拟到现实世界的泛化能力。然而，细粒度的功能抓取——对下游作任务至关重要——仍未被充分探索，面临诸多挑战：在不同对象间指定功能抓握目标和奖励函数的复杂性、多任务强化学习的探索困难，以及模拟到现实转移的挑战。在本研究中，我们提出了用于通用灵巧功能抓握的DemoFunGrasp。我们将功能性抓握条件分解为两个互补组成部分——抓取风格和可供性——并将其集成到一个能够学习掌握任何具有函数抓握条件的对象的强化学习框架中。为应对多任务优化挑战，我们利用单一抓取演示，将强化学习问题重新表述为一步演示编辑，显著提升样本效率和性能。在模拟和现实世界中的实验结果显示，DemoFunGrasp 能够推广到未见的对象组合、可供性和抓取风格，在成功率和函数抓取准确性上都优于基线。除了强大的模拟到现实能力外，通过引入视觉语言模型（VLM）进行规划，我们的系统实现了自主指令跟随抓取执行。

QoS-Aware State-Augmented Learnable Framework for 5G NR-U/Wi-Fi Coexistence: Impact of Parameter Selection and Enhanced Collision Resolution

QoS感知状态增强可学习框架用于5G NR-U/Wi-Fi共存：参数选择与增强碰撞分辨率的影响

Authors: Mohammad Reza Fasihi, Brian L. Mark
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2512.13393
Pdf link: https://arxiv.org/pdf/2512.13393
Abstract Unlicensed spectrum supports diverse traffic with stringent Quality-of-Service (QoS) requirements. In NR-U/Wi-Fi coexistence,the values of MAC parameters critically influence delay, collision behavior, and airtime fairness and efficiency. In this paper, we investigate the impact of (i) cost scaling and violation modeling, (ii) choice of MAC parameters, and (iii) an enhanced collision resolution scheme for the Listen-Before-Talk (LBT) mechanism on the performance of a state-augmented constrained reinforcement learning controller for NR-U/Wi-Fi coexistence. Coexistence control is formulated as a constrained Markov decision process with an explicit delay constraint for high-priority traffic and fairness as the optimization goal. Our simulation results show three key findings: (1) signed, threshold-invariant cost scaling with temporal smoothing stabilizes learning and strengthens long-term constraint adherence; (2) use of the contention window parameter for control provides smoother adaptation and better delay compliance than other MAC parameters; and (3) enhanced LBT significantly reduces collisions and improves airtime efficiency. These findings provide practical insights for achieving robust, QoS-aware coexistence control.
中文摘要 无许可频谱支持多样化流量，并有严格的服务质量（QoS）要求。在NR-U/Wi-Fi共存中，MAC参数的数值对延迟、碰撞行为以及播出时间的公平性和效率有着关键影响。本文探讨了（i）成本尺度和违规建模，（ii）MAC参数的选择，以及（iii）增强碰撞分辨率方案对状态增强的受限强化学习控制器（NR-U/Wi-Fi共存）机制的影响。共存控制被表述为一个受限的马尔可夫决策过程，明确以高优先级流量的延迟约束和公平性为优化目标。我们的模拟结果显示了三个关键发现：（1）带符号、阈值不变成本尺度配合时间平滑，稳定学习并增强长期约束依从性;（2）使用争用窗口参数进行控制，比其他MAC参数更平滑的适应性和更好的延迟合规性;以及（3）增强的LBT显著减少碰撞并提高空时效率。这些发现为实现稳健且具备QoS意识的共存控制提供了实用见解。

Differentiable Evolutionary Reinforcement Learning

可微分进化强化学习

Authors: Sitao Cheng, Tianle Li, Xuhan Huang, Xunjian Yin, Difan Zou
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.13399
Pdf link: https://arxiv.org/pdf/2512.13399
Abstract The design of effective reward functions presents a central and often arduous challenge in reinforcement learning (RL), particularly when developing autonomous agents for complex reasoning tasks. While automated reward optimization approaches exist, they typically rely on derivative-free evolutionary heuristics that treat the reward function as a black box, failing to capture the causal relationship between reward structure and task performance. To bridge this gap, we propose Differentiable Evolutionary Reinforcement Learning (DERL), a bilevel framework that enables the autonomous discovery of optimal reward signals. In DERL, a Meta-Optimizer evolves a reward function (i.e., Meta-Reward) by composing structured atomic primitives, guiding the training of an inner-loop policy. Crucially, unlike previous evolution, DERL is differentiable in its metaoptimization: it treats the inner-loop validation performance as a signal to update the Meta-Optimizer via reinforcement learning. This allows DERL to approximate the "meta-gradient" of task success, progressively learning to generate denser and more actionable feedback. We validate DERL across three distinct domains: robotic agent (ALFWorld), scientific simulation (ScienceWorld), and mathematical reasoning (GSM8k, MATH). Experimental results show that DERL achieves state-of-the-art performance on ALFWorld and ScienceWorld, significantly outperforming methods relying on heuristic rewards, especially in out-of-distribution scenarios. Analysis of the evolutionary trajectory demonstrates that DERL successfully captures the intrinsic structure of tasks, enabling selfimproving agent alignment without human intervention.
中文摘要 有效奖励函数的设计在强化学习（RL）中是一个核心且常常艰巨的挑战，尤其是在开发用于复杂推理任务的自主代理时。虽然存在自动奖励优化方法，但通常依赖于无导数的进化启发式方法，将奖励函数视为黑箱，未能捕捉奖励结构与任务表现之间的因果关系。为弥合这一差距，我们提出了可微进化强化学习（DERL）框架，这是一种双层框架，能够自主发现最优奖励信号。在DERL中，元优化器通过组合结构化的原子原语来演化奖励函数（即元奖励），指导内环策略的训练。关键是，与以往的演变不同，DERL在元优化方面具有微分性：它将内环验证性能视为通过强化学习更新元优化器的信号。这使得DERL能够近似任务成功的“元梯度”，逐步学习生成更密集且更具可作性的反馈。我们在三个不同领域验证DERL：机器人代理（ALFWorld）、科学模拟（ScienceWorld）和数学推理（GSM8k、MATH）。实验结果显示，DERL在ALFWorld和ScienceWorld上实现了最先进的性能，显著优于依赖启发式奖励的方法，尤其是在非分发场景中。对进化轨迹的分析表明，DERL成功捕捉了任务的内在结构，实现了无需人工干预的自主体自我改进对齐。

Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

Seedance 1.5 pro：原生视听联合生成基础模型

Authors: Siyan Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Xuyan Chi, Jian Cong, Qinpeng Cui, Qide Dong, Junliang Fan, Jing Fang, Zetao Fang, Chengjian Feng, Han Feng, Mingyuan Gao, Yu Gao, Qiushan Guo, Boyang Hao, Qingkai Hao, Bibo He, Qian He, Tuyen Hoang, Ruoqing Hu, Xi Hu, Weilin Huang, Zhaoyang Huang, Zhongyi Huang, Siqi Jiang, Wei Jiang, Yunpu Jiang, Zhuo Jiang, Ashley Kim, Jianan Kong, Zhichao Lai, Shanshan Lao, Ai Li, Feiya Li, Gen Li, Huixia Li, JiaShi Li, Liang Li, Ming Li, Tao Li, Xian Li, Xiaojie Li, Xiaoyang Li, Xingxing Li, Yameng Li, Yifu Li, Yiying Li, Chao Liang, Ying Liang, Zhiqiang Liang, Wang Liao, Yalin Liao, Heng Lin, Kengyu Lin, Shanchuan Lin, Xi Lin, Zhijie Lin, Feng Ling, Fangfang Liu, Gaohong Liu, Jiawei Liu, Jie Liu, Shouda Liu, Shu Liu, Sichao Liu, Songwei Liu, Xin Liu, Xue Liu, Yibo Liu, Zikun Liu, Zuxi Liu, Junlin Lyu, Lecheng Lyu, Qian Lyu, Han Mu, Xiaonan Nie, Jingzhe Ning, Xitong Pan, Yanghua Peng, Lianke Qin, Xueqiong Qu, Yuxi Ren, Yuchen Shen, Guang Shi, Lei Shi, Yan Song, Yinglong Song, Fan Sun, Li Sun, Renfei Sun, Zeyu Sun, Wenjing Tang, Zirui Tao, Feng Wang, Furui Wang, Jinran Wang, Junkai Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.13507
Pdf link: https://arxiv.org/pdf/2512.13507
Abstract Recent strides in video generation have paved the way for unified audio-visual generation. In this work, we present Seedance 1.5 pro, a foundational model engineered specifically for native, joint audio-video generation. Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline, achieving exceptional audio-visual synchronization and superior generation quality. To ensure practical utility, we implement meticulous post-training optimizations, including Supervised Fine-Tuning (SFT) on high-quality datasets and Reinforcement Learning from Human Feedback (RLHF) with multi-dimensional reward models. Furthermore, we introduce an acceleration framework that boosts inference speed by over 10X. Seedance 1.5 pro distinguishes itself through precise multilingual and dialect lip-syncing, dynamic cinematic camera control, and enhanced narrative coherence, positioning it as a robust engine for professional-grade content creation. Seedance 1.5 pro is now accessible on Volcano Engine at this https URL.
中文摘要 视频生成领域的最新进展为统一视听生成铺平了道路。在本研究中，我们介绍了Seedance 1.5 pro，这是一个专为原生联合音视频生成设计的基础模型。该模型采用双分支扩散变压器架构，集成了跨模态联合模块与专用多级数据流水线，实现了卓越的视听同步和卓越的生成质量。为确保实用性，我们实施了细致的训练后优化，包括对高质量数据集进行监督微调（SFT）和基于人类反馈的强化学习（RLHF），采用多维奖励模型。此外，我们还引入了加速框架，将推理速度提升了10倍以上。Seedance 1.5 pro通过精准的多语言和方言口型同步、动态电影摄像机控制以及增强的叙事连贯性脱颖而出，使其成为专业级内容创作的强大引擎。Seedance 1.5 pro 现已可在 Volcano Engine 访问，网址为 https。

MedCEG: Reinforcing Verifiable Medical Reasoning with Critical Evidence Graph

MedCEG：用关键证据图强化可验证的医学推理

Authors: Linjie Mu, Yannian Gu, Zhongzhen Huang, Yakun Zhu, Shaoting Zhang, Xiaofan Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.13510
Pdf link: https://arxiv.org/pdf/2512.13510
Abstract Large language models with reasoning capabilities have demonstrated impressive performance across a wide range of domains. In clinical applications, a transparent, step-by-step reasoning process provides physicians with strong evidence to support decision-making. While reinforcement learning has effectively enhanced reasoning performance in medical contexts, the clinical reliability of these reasoning processes remains limited because their accuracy and validity are often overlooked during training. To address this gap, we propose MedCEG, a framework that augments medical language models with clinically valid reasoning pathways by explicitly supervising the reasoning process through a Critical Evidence Graph (CEG). We curate a dataset of challenging clinical cases and algorithmically construct a CEG for each sample to represent a high-quality verifiable reasoning pathway. To guide the reasoning process, we introduce a Clinical Reasoning Procedure Reward, which evaluates Node Coverage, Structural Correctness, and Chain Completeness, thereby providing a holistic assessment of reasoning quality. Experimental results show that MedCEG surpasses existing methods in performance while producing clinically valid reasoning chains, representing a solid advancement in reliable medical AI reasoning. The code and models are available at this https URL.
中文摘要 具备推理能力的大型语言模型在广泛领域展现了令人印象深刻的性能。在临床应用中，透明的、逐步的推理过程为医生提供了有力的证据支持决策。虽然强化学习在医学环境中有效提升了推理表现，但这些推理过程的临床可靠性仍然有限，因为它们的准确性和有效性在培训过程中常被忽视。为弥补这一空白，我们提出了MedCEG，这一框架通过关键证据图（CEG）明确监督推理过程，增强具有临床有效推理通路的医学语言模型。我们策划了一批具有挑战性的临床案例数据集，并通过算法为每个样本构建了CEG，以代表高质量且可验证的推理路径。为指导推理过程，我们引入了临床推理过程奖励，评估节点覆盖度、结构正确性和链完整性，从而提供推理质量的整体评估。实验结果显示，MedCEG在性能上超越现有方法，同时产生临床有效的推理链，代表了可靠医疗AI推理的坚实进展。代码和模型可在该 https URL 访问。

Reinforcement Learning based 6-DoF Maneuvers for Microgravity Intravehicular Docking: A Simulation Study with Int-Ball2 in ISS-JEM

基于强化学习的6-DoF微重力舱内对接机动：ISS-JEM中Int-Ball2模拟研究

Authors: Aman Arora, Matteo El-Hariry, Miguel Olivares-Mendez
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.13514
Pdf link: https://arxiv.org/pdf/2512.13514
Abstract Autonomous free-flyers play a critical role in intravehicular tasks aboard the International Space Station (ISS), where their precise docking under sensing noise, small actuation mismatches, and environmental variability remains a nontrivial challenge. This work presents a reinforcement learning (RL) framework for six-degree-of-freedom (6-DoF) docking of JAXA's Int-Ball2 robot inside a high-fidelity Isaac Sim model of the Japanese Experiment Module (JEM). Using Proximal Policy Optimization (PPO), we train and evaluate controllers under domain-randomized dynamics and bounded observation noise, while explicitly modeling propeller drag-torque effects and polarity structure. This enables a controlled study of how Int-Ball2's propulsion physics influence RL-based docking performance in constrained microgravity interiors. The learned policy achieves stable and reliable docking across varied conditions and lays the groundwork for future extensions pertaining to Int-Ball2 in collision-aware navigation, safe RL, propulsion-accurate sim-to-real transfer, and vision-based end-to-end docking.
中文摘要 自主自由飞行器在国际空间站（ISS）的舱内任务中扮演着关键角色，在感知噪声、小的驱动不匹配和环境变异条件下，精确对接仍是一项非小挑战。本研究提出了一个强化学习（RL）框架，用于将JAXA的Int-Ball2机器人以六自由度（6-DoF）对接，嵌入日本实验舱（JEM）的高精度Isaac Sim模型内。利用近端策略优化（PPO），我们在域随机动力学和有界观测噪声下训练和评估控制器，同时显式建模螺旋桨阻力-扭矩效应和极性结构。这使得能够受控研究Int-Ball2的推进物理如何影响基于强化环境的对接在受限微重力内部的对接性能。该政策实现了在多种条件下的稳定可靠对接，并为未来Int-Ball2在碰撞感知导航、安全强化学习、推进精确模拟到实物传输以及基于视觉的端到端对接等方面扩展奠定了基础。

How Low Can You Go? The Data-Light SE Challenge

你能降到多低？Data-Light SE 挑战

Authors: Kishan Kumar Ganguly, Tim Menzies
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2512.13524
Pdf link: https://arxiv.org/pdf/2512.13524
Abstract Much of software engineering (SE) research assumes that progress depends on massive datasets and CPU-intensive optimizers. Yet has this assumption been rigorously tested? The counter-evidence presented in this paper suggests otherwise: across dozens of optimization problems from recent SE literature, including software configuration and performance tuning, cloud and systems optimization, project and process-level decision modeling, behavioral analytics, financial risk modeling, project health prediction, reinforcement learning tasks, sales forecasting, and software testing, even with just a few dozen labels, very simple methods (e.g. diversity sampling, a minimal Bayesian learner, or random probes) achieve near 90% of the best reported results. Further, these simple methods perform just as well as more state-of-the-the-art optimizers like SMAC, TPE, DEHB etc. While some tasks would require better outcomes and more sampling, these results seen after a few dozen samples would suffice for many engineering needs (particularly when the goal is rapid and cost-efficient guidance rather than slow and exhaustive optimization). Our results highlight that some SE tasks may be better served by lightweight approaches that demand fewer labels and far less computation. We hence propose the data-light challenge: when will a handful of labels suffice for SE tasks? To enable a large-scale investigation of this issue, we contribute (1) a mathematical formalization of labeling, (2) lightweight baseline algorithms, and (3) results on public-domain data showing the conditions under which lightweight methods excel or fail. For the purposes of open science, our scripts and data are online at this https URL .
中文摘要 许多软件工程（SE）研究假设进展依赖于庞大的数据集和CPU密集型优化器。然而，这一假设是否被严格检验过？本文提出的反证则相反：在数十个来自近期软件工程文献中的优化问题中，包括软件配置与性能调优、云和系统优化、项目与流程级决策建模、行为分析、财务风险建模、项目健康预测、强化学习任务、销售预测和软件测试，即使只有几十个标签，方法也非常简单（例如多样性抽样，最小贝叶斯学习者或随机探针）能实现近90%的最佳报告结果。此外，这些简单方法的性能与更先进的优化器如SMAC、TPE、DEHB等相当。虽然某些任务需要更好的结果和更多的采样，但经过几十个采样后看到的这些结果足以满足许多工程需求（尤其是当目标是快速且经济地提供指导，而非缓慢且全面优化时）。我们的结果表明，某些软件工程任务可能更适合采用轻量化的方法，这些方法需要更少的标签和更少的计算量。因此，我们提出了数据轻量化的挑战：何时只需少量标签就能满足软件工程任务？为了实现对该问题的大规模调查，我们贡献了（1）标记的数学形式化，（2）轻量级基线算法，以及（3）基于公共领域数据的结果，展示了轻量级方法在何种条件下表现优异或失败。为了开放科学的目的，我们的脚本和数据已在此 https URL 上线。

Memory in the Age of AI Agents

人工智能代理时代的记忆

Authors: Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, Zhenrong Cheng, Xuanbo Fan, Jiaxin Guo, Xinlei Yu, Zhenhong Zhou, Zewen Hu, Jiahao Huo, Junhao Wang, Yuwei Niu, Yu Wang, Zhenfei Yin, Xiaobin Hu, Yue Liao, Qiankun Li, Kun Wang, Wangchunshu Zhou, Yixin Liu, Dawei Cheng, Qi Zhang, Tao Gui, Shirui Pan, Yan Zhang, Philip Torr, Zhicheng Dou, Ji-Rong Wen, Xuanjing Huang, Yu-Gang Jiang, Shuicheng Yan
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.13564
Pdf link: https://arxiv.org/pdf/2512.13564
Abstract Memory has emerged, and will continue to remain, a core capability of foundation model-based agents. As research on agent memory rapidly expands and attracts unprecedented attention, the field has also become increasingly fragmented. Existing works that fall under the umbrella of agent memory often differ substantially in their motivations, implementations, and evaluation protocols, while the proliferation of loosely defined memory terminologies has further obscured conceptual clarity. Traditional taxonomies such as long/short-term memory have proven insufficient to capture the diversity of contemporary agent memory systems. This work aims to provide an up-to-date landscape of current agent memory research. We begin by clearly delineating the scope of agent memory and distinguishing it from related concepts such as LLM memory, retrieval augmented generation (RAG), and context engineering. We then examine agent memory through the unified lenses of forms, functions, and dynamics. From the perspective of forms, we identify three dominant realizations of agent memory, namely token-level, parametric, and latent memory. From the perspective of functions, we propose a finer-grained taxonomy that distinguishes factual, experiential, and working memory. From the perspective of dynamics, we analyze how memory is formed, evolved, and retrieved over time. To support practical development, we compile a comprehensive summary of memory benchmarks and open-source frameworks. Beyond consolidation, we articulate a forward-looking perspective on emerging research frontiers, including memory automation, reinforcement learning integration, multimodal memory, multi-agent memory, and trustworthiness issues. We hope this survey serves not only as a reference for existing work, but also as a conceptual foundation for rethinking memory as a first-class primitive in the design of future agentic intelligence.
中文摘要 内存已成为并将继续成为基于模型的基础代理的核心能力。随着对代理记忆的研究迅速扩展并吸引前所未有的关注，该领域也变得越来越分散。现有属于代理记忆范畴的研究在动机、实现和评估协议上往往存在显著差异，而模糊定义的记忆术语的泛滥进一步模糊了概念的清晰度。传统的分类法如长短期记忆已被证明不足以涵盖当代代理记忆系统的多样性。本研究旨在提供当前代理记忆研究的最新情况。我们首先明确划分了代理记忆的范围，并将其与相关概念如LLM内存、检索增强生成（RAG）和上下文工程区分开来。随后，我们通过形式、功能和动态的统一视角审视代理记忆。从形式角度，我们识别出代理记忆的三种主要实现，即代币级、参数性记忆和潜在记忆。从函数的角度，我们提出了一种更细致的分类法，区分事实记忆、经验记忆和工作记忆。从动力学的角度，我们分析记忆如何随着时间形成、演化和恢复。为了支持实用开发，我们汇编了一份全面的内存基准测试和开源框架摘要。除了巩固，我们还阐述了对新兴研究前沿的前瞻性视角，包括记忆自动化、强化学习集成、多模态记忆、多智能体记忆以及可信度问题。我们希望这项调查不仅能作为现有工作的参考，也能成为重新思考记忆作为未来智能设计中一类原始元素的概念基础。

MMhops-R1: Multimodal Multi-hop Reasoning

MMhops-R1：多模多跳推理

Authors: Tao Zhang, Ziqi Zhang, Zongyang Ma, Yuxin Chen, Bing Li, Chunfeng Yuan, Guangting Wang, Fengyun Rao, Ying Shan, Weiming Hu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.13573
Pdf link: https://arxiv.org/pdf/2512.13573
Abstract The ability to perform multi-modal multi-hop reasoning by iteratively integrating information across various modalities and external knowledge is critical for addressing complex real-world challenges. However, existing Multi-modal Large Language Models (MLLMs) are predominantly limited to single-step reasoning, as existing benchmarks lack the complexity needed to evaluate and drive multi-hop abilities. To bridge this gap, we introduce MMhops, a novel, large-scale benchmark designed to systematically evaluate and foster multi-modal multi-hop reasoning. MMhops dataset comprises two challenging task formats, Bridging and Comparison, which necessitate that models dynamically construct complex reasoning chains by integrating external knowledge. To tackle the challenges posed by MMhops, we propose MMhops-R1, a novel multi-modal Retrieval-Augmented Generation (mRAG) framework for dynamic reasoning. Our framework utilizes reinforcement learning to optimize the model for autonomously planning reasoning paths, formulating targeted queries, and synthesizing multi-level information. Comprehensive experiments demonstrate that MMhops-R1 significantly outperforms strong baselines on MMhops, highlighting that dynamic planning and multi-modal knowledge integration are crucial for complex reasoning. Moreover, MMhops-R1 demonstrates strong generalization to tasks requiring fixed-hop reasoning, underscoring the robustness of our dynamic planning approach. In conclusion, our work contributes a challenging new benchmark and a powerful baseline model, and we will release the associated code, data, and weights to catalyze future research in this critical area.
中文摘要 通过迭代整合不同模态和外部知识的信息进行多模态多跳推理的能力，对于应对复杂的现实世界挑战至关重要。然而，现有的多模态大型语言模型（MLLM）主要限于单步推理，因为现有基准测试缺乏评估和驱动多跳能力所需的复杂性。为了弥合这一差距，我们引入了MMhops，一个新型、大规模的基准测试，旨在系统评估和促进多模态多跳推理。MMhops数据集包含两种具有挑战性的任务格式：桥接和比较，这要求模型通过整合外部知识动态构建复杂的推理链。为应对MMhops带来的挑战，我们提出了MMhops-R1，一种新型多模态检索增强生成（mRAG）动态推理框架。我们的框架利用强化学习优化模型，实现自主规划推理路径、制定目标查询和综合多层信息。综合实验表明，MMhops-R1在强基线上表现显著优于MMhops，强调动态规划和多模态知识整合对于复杂推理至关重要。此外，MMhops-R1 在需要固定跳推理的任务中展现了强有力的推广能力，凸显了我们动态规划方法的稳健性。总之，我们的工作提供了一个具有挑战性的新基准和强大的基线模型，并将发布相关代码、数据和权重，以推动这一关键领域的未来研究。

Image Diffusion Preview with Consistency Solver

图像扩散预览与一致性求解器

Authors: Fu-Yun Wang, Hao Zhou, Liangzhe Yuan, Sanghyun Woo, Boqing Gong, Bohyung Han, Ming-Hsuan Yang, Han Zhang, Yukun Zhu, Ting Liu, Long Zhao
Subjects: Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.13592
Pdf link: https://arxiv.org/pdf/2512.13592
Abstract The slow inference process of image diffusion models significantly degrades interactive user experiences. To address this, we introduce Diffusion Preview, a novel paradigm employing rapid, low-step sampling to generate preliminary outputs for user evaluation, deferring full-step refinement until the preview is deemed satisfactory. Existing acceleration methods, including training-free solvers and post-training distillation, struggle to deliver high-quality previews or ensure consistency between previews and final outputs. We propose ConsistencySolver derived from general linear multistep methods, a lightweight, trainable high-order solver optimized via Reinforcement Learning, that enhances preview quality and consistency. Experimental results demonstrate that ConsistencySolver significantly improves generation quality and consistency in low-step scenarios, making it ideal for efficient preview-and-refine workflows. Notably, it achieves FID scores on-par with Multistep DPM-Solver using 47% fewer steps, while outperforming distillation baselines. Furthermore, user studies indicate our approach reduces overall user interaction time by nearly 50% while maintaining generation quality. Code is available at this https URL.
中文摘要 图像扩散模型的缓慢推断过程显著降低了交互式用户体验。为此，我们引入了扩散预览（Diffusion Preview），这是一种采用快速、低步采样生成初步输出供用户评估的新范式，推迟全步骤的细化，直到预览被认为满意。现有的加速方法，包括无训练求解器和训练后蒸馏，难以提供高质量的预览，也难以确保预览与最终输出的一致性。我们提出基于通用线性多步法的一致性求解器，这是一种轻量级、可训练的高阶求解器，通过强化学习优化，提升预览质量和一致性。实验结果表明，ConsistencySolver 在低步场景中显著提升了生成质量和一致性，非常适合高效的预览和优化工作流程。值得注意的是，它在FID得分上与多步DPM-Solver持平，且步数减少47%，同时优于蒸馏基线。此外，用户研究显示，我们的方法在保持生成质量的同时，将整体用户互动时间减少近50%。代码可在此 https URL 访问。

Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

Nemotron-级联：通用推理模型中的级联强化学习尺度化

Authors: Boxin Wang, Chankyu Lee, Nayeon Lee, Sheng-Chieh Lin, Wenliang Dai, Yang Chen, Yangyi Chen, Zhuolin Yang, Zihan Liu, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.13607
Pdf link: https://arxiv.org/pdf/2512.13607
Abstract Building general-purpose reasoning models with reinforcement learning (RL) entails substantial cross-domain heterogeneity, including large variation in inference-time response lengths and verification latency. Such variability complicates the RL infrastructure, slows training, and makes training curriculum (e.g., response length extension) and hyperparameter selection challenging. In this work, we propose cascaded domain-wise reinforcement learning (Cascade RL) to develop general-purpose reasoning models, Nemotron-Cascade, capable of operating in both instruct and deep thinking modes. Departing from conventional approaches that blend heterogeneous prompts from different domains, Cascade RL orchestrates sequential, domain-wise RL, reducing engineering complexity and delivering state-of-the-art performance across a wide range of benchmarks. Notably, RLHF for alignment, when used as a pre-step, boosts the model's reasoning ability far beyond mere preference optimization, and subsequent domain-wise RLVR stages rarely degrade the benchmark performance attained in earlier domains and may even improve it (see an illustration in Figure 1). Our 14B model, after RL, outperforms its SFT teacher, DeepSeek-R1-0528, on LiveCodeBench v5/v6/Pro and achieves silver-medal performance in the 2025 International Olympiad in Informatics (IOI). We transparently share our training and data recipes.
中文摘要 使用强化学习（RL）构建通用推理模型涉及显著的跨域异质性，包括推理响应时间长度和验证延迟的巨大差异。这种变异性使强化学习基础设施复杂化，减缓训练速度，并使训练课程（如响应长度延长）和超参数选择变得具有挑战性。本研究提出级联域强化学习（Cascade RL），以开发通用推理模型Nemotron-Cascade，能够在指令和深度思考模式下运行。Cascade RL 不同于传统融合不同领域异构提示的方法，能够按顺序、按领域协同进行强化学习，降低工程复杂性，并在广泛的基准测试中实现最先进的性能。值得注意的是，当RLHF作为前置步骤使用时，模型的推理能力远超单纯的偏好优化，后续按域的RLVR阶段很少会降低早期领域达到的基准性能，甚至可能有所提升（见图1示意）。我们的14B型号，继RL之后，在LiveCodeBench v5/v6/Pro上表现优于其SFT教师DeepSeek-R1-0528，并在2025年国际信息奥林匹克竞赛（IOI）中获得银牌。我们透明地分享我们的训练和数据配方。

SCR2-ST: Combine Single Cell with Spatial Transcriptomics for Efficient Active Sampling via Reinforcement Learning

SCR2-ST：结合单细胞与空间转录组学，通过强化学习实现高效主动采样

Authors: Junchao Zhu, Ruining Deng, Junlin Guo, Tianyuan Yao, Chongyu Qu, Juming Xiong, Siqi Lu, Zhengyi Lu, Yanfan Zhu, Marilyn Lionts, Yuechen Yang, Yalin Zheng, Yu Wang, Shilin Zhao, Haichun Yang, Yuankai Huo
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.13635
Pdf link: https://arxiv.org/pdf/2512.13635
Abstract Spatial transcriptomics (ST) is an emerging technology that enables researchers to investigate the molecular relationships underlying tissue morphology. However, acquiring ST data remains prohibitively expensive, and traditional fixed-grid sampling strategies lead to redundant measurements of morphologically similar or biologically uninformative regions, thus resulting in scarce data that constrain current methods. The well-established single-cell sequencing field, however, could provide rich biological data as an effective auxiliary source to mitigate this limitation. To bridge these gaps, we introduce SCR2-ST, a unified framework that leverages single-cell prior knowledge to guide efficient data acquisition and accurate expression prediction. SCR2-ST integrates a single-cell guided reinforcement learning-based (SCRL) active sampling and a hybrid regression-retrieval prediction network SCR2Net. SCRL combines single-cell foundation model embeddings with spatial density information to construct biologically grounded reward signals, enabling selective acquisition of informative tissue regions under constrained sequencing budgets. SCR2Net then leverages the actively sampled data through a hybrid architecture combining regression-based modeling with retrieval-augmented inference, where a majority cell-type filtering mechanism suppresses noisy matches and retrieved expression profiles serve as soft labels for auxiliary supervision. We evaluated SCR2-ST on three public ST datasets, demonstrating SOTA performance in both sampling efficiency and prediction accuracy, particularly under low-budget scenarios. Code is publicly available at: this https URL
中文摘要 空间转录组学（ST）是一项新兴技术，使研究人员能够探究组织形态背后的分子关系。然而，获取ST数据仍然成本高昂，传统的固定网格采样策略导致形态相似或生物学信息不足的区域进行冗余测量，导致数据稀缺，限制了现有方法。然而，成熟的单细胞测序领域作为有效的辅助来源，可以提供丰富的生物学数据，缓解这一限制。为弥合这些差距，我们引入了SCR2-ST这一统一框架，利用单细胞先验知识指导高效数据采集和准确表达预测。SCR2-ST集成了单细胞引导强化学习（SCRL）主动采样和混合回归-反演预测网络SCR2Net。SCRL结合单细胞基础模型嵌入与空间密度信息，构建生物学基础的奖励信号，使得在测序预算受限下选择性获取信息性组织区域成为可能。SCR2Net随后通过结合回归建模与检索增强推断的混合架构利用主动采样的数据，采用多数细胞类型过滤机制抑制噪声匹配，检索的表达谱作为辅助监督的软标签。我们在三个公开的ST数据集上评估了SCR2-ST，展示了SOTA在采样效率和预测准确性方面的表现，尤其是在低预算场景下。代码公开可访问：此 https URL

MindDrive: A Vision-Language-Action Model for Autonomous Driving via Online Reinforcement Learning

MindDrive：一种通过在线强化学习实现自动驾驶的视觉-语言-行动模型

Authors: Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Hongwei Xie, Bing Wang, Guang Chen, Dingkang Liang, Xiang Bai
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.13636
Pdf link: https://arxiv.org/pdf/2512.13636
Abstract Current Vision-Language-Action (VLA) paradigms in autonomous driving primarily rely on Imitation Learning (IL), which introduces inherent challenges such as distribution shift and causal confusion. Online Reinforcement Learning offers a promising pathway to address these issues through trial-and-error learning. However, applying online reinforcement learning to VLA models in autonomous driving is hindered by inefficient exploration in continuous action spaces. To overcome this limitation, we propose MindDrive, a VLA framework comprising a large language model (LLM) with two distinct sets of LoRA parameters. The one LLM serves as a Decision Expert for scenario reasoning and driving decision-making, while the other acts as an Action Expert that dynamically maps linguistic decisions into feasible trajectories. By feeding trajectory-level rewards back into the reasoning space, MindDrive enables trial-and-error learning over a finite set of discrete linguistic driving decisions, instead of operating directly in a continuous action space. This approach effectively balances optimal decision-making in complex scenarios, human-like driving behavior, and efficient exploration in online reinforcement learning. MindDrive achieves strong closed-loop performance on the challenging Bench2Drive benchmark, with a Driving Score (DS) of 78.04 and a Success Rate (SR) of 55.09%. To the best of our knowledge, this is the first work to demonstrate the effectiveness of online reinforcement learning for the VLA model in autonomous driving.
中文摘要 当前自动驾驶中的视觉-语言-行动（VLA）范式主要依赖模仿学习（IL），这带来了如分布偏移和因果混淆等固有挑战。在线强化学习为通过试错学习解决这些问题提供了有前景的途径。然而，在线强化学习应用于自动驾驶中的VLA模型，由于在连续动作空间中探索效率低下，受到阻碍。为克服这一限制，我们提出了MindDrive，这是一个由大型语言模型（LLM）组成的VLA框架，该模型具有两组不同的LoRA参数。一个LLM作为决策专家，负责情景推理和决策驱动，另一个则作为行动专家，动态地将语言决策映射到可行的轨迹中。通过将轨迹级奖励反馈回推理空间，MindDrive使得在有限的离散语言驱动决策中实现试错学习，而非直接在连续行动空间中运行。这种方法有效平衡了复杂场景下的决策、类人驾驶行为以及在线强化学习中的高效探索。MindDrive在具有挑战性的Bench2Drive基准测试中实现了强劲的闭环性能，驾驶得分（DS）为78.04，成功率（SR）为55.09%。据我们所知，这是首次证明在线强化学习在VLA模型中自动驾驶的有效性。

A Scientific Reasoning Model for Organic Synthesis Procedure Generation

有机合成程序生成的科学推理模型

Authors: Guoqing Liu, Junren Li, Zihan Zhao, Eray Inanc, Krzysztof Maziarz, Jose Garrido Torres, Victor Garcia Satorras, Shoko Ueda, Christopher M. Bishop, Marwin Segler
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.13668
Pdf link: https://arxiv.org/pdf/2512.13668
Abstract Solving computer-aided synthesis planning is essential for enabling fully automated, robot-assisted synthesis workflows and improving the efficiency of drug discovery. A key challenge, however, is bridging the gap between computational route design and practical laboratory execution, particularly the accurate prediction of viable experimental procedures for each synthesis step. In this work, we present QFANG, a scientific reasoning language model capable of generating precise, structured experimental procedures directly from reaction equations, with explicit chain-of-thought reasoning. To develop QFANG, we curated a high-quality dataset comprising 905,990 chemical reactions paired with structured action sequences, extracted and processed from patent literature using large language models. We introduce a Chemistry-Guided Reasoning (CGR) framework that produces chain-of-thought data grounded in chemical knowledge at scale. The model subsequently undergoes supervised fine-tuning to elicit complex chemistry reasoning. Finally, we apply Reinforcement Learning from Verifiable Rewards (RLVR) to further enhance procedural accuracy. Experimental results demonstrate that QFANG outperforms advanced general-purpose reasoning models and nearest-neighbor retrieval baselines, measured by traditional NLP similarity metrics and a chemically aware evaluator using an LLM-as-a-judge. Moreover, QFANG generalizes to certain out-of-domain reaction classes and adapts to variations in laboratory conditions and user-specific constraints. We believe that QFANG's ability to generate high-quality synthesis procedures represents an important step toward bridging the gap between computational synthesis planning and fully automated laboratory synthesis.
中文摘要 解决计算机辅助合成规划对于实现全自动化机器人辅助合成工作流程和提升药物发现效率至关重要。然而，一个关键挑战是弥合计算路径设计与实际实验室执行之间的差距，特别是每个合成步骤中可行实验程序的准确预测。在本研究中，我们提出了QFANG，一种科学推理语言模型，能够直接从反应方程生成精确、结构化的实验程序，并具备显式的思维链推理。为开发QFANG，我们策划了一个高质量数据集，包含905,990个化学反应及结构化作用序列，这些数据通过大型语言模型从专利文献中提取和处理。我们引入了化学引导推理（CGR）框架，能够大规模生成基于化学知识的思维链数据。随后，模型经过监督式微调，以引发复杂的化学推理。最后，我们应用可验证奖励强化学习（RLVR）以进一步提升程序准确性。实验结果表明，QFANG优于传统的NLP相似度指标和化学意识评估者（如LLM作为评判）所衡量的先进通用推理模型和最近邻检索基线。此外，QFANG可推广到某些域外反应类别，并适应实验室条件和用户特定约束的差异。我们认为，QFANG 生成高质量合成程序的能力，代表了弥合计算合成规划与全自动化实验室合成之间差距的重要一步。

AgentIAD: Tool-Augmented Single-Agent for Industrial Anomaly Detection

AgentIAD：工业异常检测工具增强单一代理

Authors: Junwen Miao, Penghui Du, Yi Liu, Yu Wang, Yan Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.13671
Pdf link: https://arxiv.org/pdf/2512.13671
Abstract Industrial anomaly detection (IAD) is difficult due to the scarcity of normal reference samples and the subtle, localized nature of many defects. Single-pass vision-language models (VLMs) often overlook small abnormalities and lack explicit mechanisms to compare against canonical normal patterns. We propose AgentIAD, a tool-driven agentic framework that enables multi-stage visual inspection. The agent is equipped with a Perceptive Zoomer (PZ) for localized fine-grained analysis and a Comparative Retriever (CR) for querying normal exemplars when evidence is ambiguous. To teach these inspection behaviors, we construct structured perceptive and comparative trajectories from the MMAD dataset and train the model in two stages: supervised fine-tuning followed by reinforcement learning. A two-part reward design drives this process: a perception reward that supervises classification accuracy, spatial alignment, and type correctness, and a behavior reward that encourages efficient tool use. Together, these components enable the model to refine its judgment through step-wise observation, zooming, and verification. AgentIAD achieves a new state-of-the-art 97.62% classification accuracy on MMAD, surpassing prior MLLM-based approaches while producing transparent and interpretable inspection traces.
中文摘要 由于正常参考样品稀少且许多缺陷存在微妙且局部的特性，工业异常检测（IAD）较为困难。单次视觉语言模型（VLM）常常忽略微小异常，且缺乏明确的机制与典型正常模式进行比较。我们提出了AgentIAD，一种工具驱动的智能框架，支持多阶段的视觉检查。该代理配备了感知式Zoomer（PZ），用于局部细粒度分析，以及比较检索器（CR），用于在证据存在歧义时查询正常范例。为了教授这些检查行为，我们从MMAD数据集构建结构化的感知轨迹和比较轨迹，并将模型分为两个阶段进行训练：监督微调，随后是强化学习。两部分奖励设计推动这一过程：感知奖励监督分类准确性、空间对齐和类型正确性，以及鼓励高效工具使用的行为奖励。这些组成部分共同使模型能够通过逐步观察、缩放和验证来优化判断。AgentIAD实现了MMAD上97.62%的先进分类准确率，超越以往基于MLLM的方法，同时产生透明且可解释的检查痕迹。

Keyword: diffusion policy

World Models Can Leverage Human Videos for Dexterous Manipulation

世界模型可以利用人类视频进行灵巧的作

Authors: Raktim Gautam Goswami, Amir Bar, David Fan, Tsung-Yen Yang, Gaoyue Zhou, Prashanth Krishnamurthy, Michael Rabbat, Farshad Khorrami, Yann LeCun
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.13644
Pdf link: https://arxiv.org/pdf/2512.13644
Abstract Dexterous manipulation is challenging because it requires understanding how subtle hand motion influences the environment through contact with objects. We introduce DexWM, a Dexterous Manipulation World Model that predicts the next latent state of the environment conditioned on past states and dexterous actions. To overcome the scarcity of dexterous manipulation datasets, DexWM is trained on over 900 hours of human and non-dexterous robot videos. To enable fine-grained dexterity, we find that predicting visual features alone is insufficient; therefore, we introduce an auxiliary hand consistency loss that enforces accurate hand configurations. DexWM outperforms prior world models conditioned on text, navigation, and full-body actions, achieving more accurate predictions of future states. DexWM also demonstrates strong zero-shot generalization to unseen manipulation skills when deployed on a Franka Panda arm equipped with an Allegro gripper, outperforming Diffusion Policy by over 50% on average in grasping, placing, and reaching tasks.
中文摘要 灵巧的作具有挑战性，因为它需要理解细微的手部动作如何通过与物体的接触影响环境。我们介绍DexWM，一种灵巧控世界模型，基于过去状态和灵巧动作预测环境的下一个潜在状态。为弥补灵巧作数据集的稀缺，DexWM基于900多小时的人类和非灵巧机器人视频进行训练。为了实现细致的灵巧度，我们发现仅靠预测视觉特征是不够的;因此，我们引入辅助手牌一致性损失，以强制准确的手牌配置。DexWM在基于文本、导航和全身动作的条件下，表现优于以往的世界模型，实现了对未来状态的更准确预测。当DexWM部署在配备Allegro抓钳的Franka Panda臂上时，还展现出对隐形控技能的强力推广能力，在抓取、定位和够到任务上平均比扩散政策高出50%以上。

Keyword: reinforcement learning

Reinforcement Learning for Latent-Space Thinking in LLMs

LLM潜空间思维的强化学习

Hierarchical Task Offloading and Trajectory Optimization in Low-Altitude Intelligent Networks Via Auction and Diffusion-based MARL

低高度智能网络中的分层任务卸载与轨迹优化，通过基于拍卖和扩散的MARL实现

WAM-Diff: A Masked Diffusion VLA Framework with MoE and Online Reinforcement Learning for Autonomous Driving

WAM-Diff：一个具备MoE和在线强化学习的蒙面扩散VLA框架，用于自动驾驶

Mirror Mode in Fire Emblem: Beating Players at their own Game with Imitation and Reinforcement Learning

火焰纹章中的镜像模式：通过模仿与强化学习击败玩家

Safe Learning for Contact-Rich Robot Tasks: A Survey from Classical Learning-Based Methods to Safe Foundation Models

为接触丰富机器人任务提供安全学习：从经典基于学习的方法到安全基础模型的综述

Evolutionary Reinforcement Learning based AI tutor for Socratic Interdisciplinary Instruction

基于进化强化学习的苏格拉底跨学科教学人工智能导师

A Review of Learning-Based Motion Planning: Toward a Data-Driven Optimal Control Approach

基于学习的运动规划综述：迈向数据驱动的最优控制方法

Learning to Extract Context for Context-Aware LLM Inference

学习如何提取上下文以实现上下文感知的大型语言模型推理

Policy Gradient Algorithms for Age-of-Information Cost Minimization

用于信息时代成本最小化的策略梯度算法

Goal Reaching with Eikonal-Constrained Hierarchical Quasimetric Reinforcement Learning

利用以物纳约束的层级准强化学习实现目标

Learning to Get Up Across Morphologies: Zero-Shot Recovery with a Unified Humanoid Policy

学习跨形态起身：零射击回收与统一人形政策

Moment and Highlight Detection via MLLM Frame Segmentation

通过MLLM帧分割进行时刻和高光检测

A Conflict-Aware Resource Management Framework for the Computing Continuum

计算连续体的冲突感知资源管理框架

The Role of AI in Modern Penetration Testing

人工智能在现代渗透测试中的作用

ElasticVR: Elastic Task Computing in Multi-User Multi-Connectivity Wireless Virtual Reality (VR) Systems

ElasticVR：多用户多连接无线虚拟现实（VR）系统中的弹性任务计算

Sim2Real Reinforcement Learning for Soccer skills

Sim2Real 足球技能强化学习

HetRL: Efficient Reinforcement Learning for LLMs in Heterogeneous Environments

HetRL：异构环境中LLM的高效强化学习

More Than the Final Answer: Improving Visual Extraction and Logical Consistency in Vision-Language Models

超越最终答案：提升视觉语言模型中的视觉提取和逻辑一致性

Adaptive Detector-Verifier Framework for Zero-Shot Polyp Detection in Open-World Settings

开放世界环境中零剂量息肉检测的自适应检测器-验证框架

World Models Unlock Optimal Foraging Strategies in Reinforcement Learning Agents

世界模型解锁强化学习代理中的最佳采集策略

Coupled Variational Reinforcement Learning for Language Model General Reasoning

语言模型通用推理中的耦合变分强化学习

CogDoc: Towards Unified thinking in Documents

CogDoc：迈向文档中的统一思维

Reassessing the Role of Supervised Fine-Tuning: An Empirical Study in VLM Reasoning

重新评估监督式微调的作用：VLM推理中的实证研究

Synergizing Code Coverage and Gameplay Intent: Coverage-Aware Game Playtesting with LLM-Guided Reinforcement Learning

协同代码覆盖率与游戏意图：覆盖感知游戏测试与大语言模型引导强化学习

Self-Motivated Growing Neural Network for Adaptive Architecture via Local Structural Plasticity

通过局部结构可塑性的自驱成长神经网络实现适应性架构

CoDA: A Context-Decoupled Hierarchical Agent with Reinforcement Learning

CoDA：带有强化学习的上下文解耦层级代理

Distributed Reinforcement Learning using Local Smart Meter Data for Voltage Regulation in Distribution Networks

利用本地智能电表数据进行配电网络电压调节的分布式强化学习

Information-Consistent Language Model Recommendations through Group Relative Policy Optimization

通过群组相对策略优化实现信息一致的语言模型推荐

LLM-based Personalized Portfolio Recommender: Integrating Large Language Models and Reinforcement Learning for Intelligent Investment Strategy Optimization

基于LLM的个性化投资组合推荐工具：整合大型语言模型与强化学习以实现智能投资策略优化

QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management

QwenLong-L1.5：长上下文推理与记忆管理的训练后配方

Tackling Snow-Induced Challenges: Safe Autonomous Lane-Keeping with Robust Reinforcement Learning

应对积雪带来的挑战：安全自主车道保持，结合强化学习

Learning Terrain Aware Bipedal Locomotion via Reduced Dimensional Perceptual Representations

通过简化维度的感知表征学习地形感知双足行走

What Happens Next? Next Scene Prediction with a Unified Video Model

接下来会发生什么？使用统一视频模型预测下一场景

GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training

GTR-Turbo：合并检查点实际上是代理VLM训练的免费教师

Deep Q-Learning-Based Intelligent Scheduling for ETL Optimization in Heterogeneous Data Environments

基于深度Q学习的智能调度，用于异构数据环境中的ETL优化

M-GRPO: Stabilizing Self-Supervised Reinforcement Learning for Large Language Models with Momentum-Anchored Policy Optimization

M-GRPO：基于动量锚定策略优化的大型语言模型稳定自监督强化学习

PvP: Data-Efficient Humanoid Robot Learning with Proprioceptive-Privileged Contrastive Representations

PvP：数据高效的类人机器人学习，带有本体感觉特权的对比表征

ADHint: Adaptive Hints with Difficulty Priors for Reinforcement Learning

ADHint：带有困难先验的自适应提示用于强化学习

Toward Self-Healing Networks-on-Chip: RL-Driven Routing in 2D Torus Architectures

迈向自愈的片上网络：二维环面架构中的强化学习驱动路由

TraPO: A Semi-Supervised Reinforcement Learning Framework for Boosting LLM Reasoning