Arxiv Papers of Today

生成时间: 2026-01-06 16:36:01 (UTC+8); Arxiv 发布时间: 2026-01-06 20:00 EST (2026-01-07 09:00 UTC+8)

今天共有 42 篇相关文章

Keyword: reinforcement learning

Horizon Reduction as Information Loss in Offline Reinforcement Learning

视界约简作为离线强化学习中的信息丢失

Authors: Uday Kumar Nidadala, Venkata Bhumika Guthi
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.00831
Pdf link: https://arxiv.org/pdf/2601.00831
Abstract Horizon reduction is a common design strategy in offline reinforcement learning (RL), used to mitigate long-horizon credit assignment, improve stability, and enable scalable learning through truncated rollouts, windowed training, or hierarchical decomposition (Levine et al., 2020; Prudencio et al., 2023; Park et al., 2025). Despite recent empirical evidence that horizon reduction can improve scaling on challenging offline RL benchmarks, its theoretical implications remain underdeveloped (Park et al., 2025). In this paper, we show that horizon reduction can induce fundamental and irrecoverable information loss in offline RL. We formalize horizon reduction as learning from fixed-length trajectory segments and prove that, under this paradigm and any learning interface restricted to fixed-length trajectory segments, optimal policies may be statistically indistinguishable from suboptimal ones even with infinite data and perfect function approximation. Through a set of minimal counterexample Markov decision processes (MDPs), we identify three distinct structural failure modes: (i) prefix indistinguishability leading to identifiability failure, (ii) objective misspecification induced by truncated returns, and (iii) offline dataset support and representation aliasing. Our results establish necessary conditions under which horizon reduction can be safe and highlight intrinsic limitations that cannot be overcome by algorithmic improvements alone, complementing algorithmic work on conservative objectives and distribution shift that addresses a different axis of offline RL difficulty (Fujimoto et al., 2019; Kumar et al., 2020; Gulcehre et al., 2020).
中文摘要 视距缩减是离线强化学习（RL）中常见的设计策略，用于减轻长视野学分分配、提升稳定性，并通过截断展开、窗口训练或层级分解实现可扩展学习（Levine 等，2020;Prudencio等，2023;Park 等，2025）。尽管近期实证证据表明视野缩减可以改善具有挑战性的离线强化学习基准的尺度，但其理论意义仍然不充分（Park 等，2025）。本文展示了视距缩减可以诱导离线强化学习中根本且不可恢复的信息丢失。我们将视界约简形式化为从固定长度轨迹段学习，并证明在该范式及任何限制于固定长度轨迹段的学习接口下，即使数据无限且函数近似完美，最优策略也可能在统计上无法区分次优策略。通过一组最小反例马尔可夫决策过程（MDP），我们识别出三种不同的结构性失效模式：（i）前缀不可区分性导致可识别性失败，（ii）截断返回引发的客观错误指定，以及（iii）离线数据集支持与表示混叠。我们的结果确立了视距缩减安全的必要条件，并突出了仅靠算法改进无法克服的内在限制，补充了在保守目标和分布偏移方面的算法工作，解决了离线强化学习难度的不同轴（Fujimoto 等，2019;Kumar 等，2020;Gulcehre 等，2020）。

SmartFlow Reinforcement Learning and Agentic AI for Bike-Sharing Optimisation

SmartFlow强化学习与智能人工智能用于共享单车优化

Authors: Aditya Sreevatsa K, Arun Kumar Raveendran, Jesrael K Mani, Prakash G Shigli, Rajkumar Rangadore, Narayana Darapaneni, Anwesh Reddy Paduri
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.00868
Pdf link: https://arxiv.org/pdf/2601.00868
Abstract SmartFlow is a multi-layered framework that integrates Reinforcement Learning and Agentic AI to address the dynamic rebalancing problem in urban bike-sharing services. Its architecture separates strategic, tactical, and communication functions for clarity and scalability. At the strategic level, a Deep Q-Network (DQN) agent, trained in a high-fidelity simulation of New Yorks Citi Bike network, learns robust rebalancing policies by modelling the challenge as a Markov Decision Process. These high-level strategies feed into a deterministic tactical module that optimises multi-leg journeys and schedules just-in-time dispatches to minimise fleet travel. Evaluation across multiple seeded runs demonstrates SmartFlows high efficacy, reducing network imbalance by over 95% while requiring minimal travel distance and achieving strong truck utilisation. A communication layer, powered by a grounded Agentic AI with a Large Language Model (LLM), translates logistical plans into clear, actionable instructions for operational staff, ensuring interpretability and execution readiness. This integration bridges machine intelligence with human operations, offering a scalable solution that reduces idle time, improves bike availability, and lowers operational costs. SmartFlow provides a blueprint for interpretable, AI-driven logistics in complex urban mobility networks.
中文摘要 SmartFlow是一个多层次框架，集成了强化学习和代理人工智能，解决城市共享自行车服务中的动态再平衡问题。其架构将战略、战术和通信功能分离，以提高清晰性和可扩展性。在战略层面，经过纽约Citi Bike网络高保真模拟训练的深度Q网络（DQN）代理，通过将挑战建模为马尔可夫决策过程，学习稳健的再平衡策略。这些高层战略汇入一个确定性战术模块，优化多段航程并安排准时调度，以减少舰队出行。多次播种运行的评估显示，SmartFlow高效，减少网络不平衡超过95%，同时减少行驶距离，实现卡车利用率强劲。由基于大型语言模型（LLM）的智能人工智能驱动的通信层，将后勤计划转化为清晰、可作的作指令，确保可理解性和执行准备度。这种整合连接了机器智能与人类作，提供了可扩展的解决方案，减少怠速时间，改善自行车可用性，并降低运营成本。SmartFlow为复杂城市出行网络中可解释的AI驱动物流提供了蓝图。

VideoCuRL: Video Curriculum Reinforcement Learning with Orthogonal Difficulty Decomposition

视频学习学习：正交难度分解视频课程强化学习

Authors: Hongbo Jin, Kuanwei Lin, Wenhao Zhang, Yichen Jin, Ge Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.00887
Pdf link: https://arxiv.org/pdf/2601.00887
Abstract Reinforcement Learning (RL) is crucial for empowering VideoLLMs with complex spatiotemporal reasoning. However, current RL paradigms predominantly rely on random data shuffling or naive curriculum strategies based on scalar difficulty metrics. We argue that scalar metrics fail to disentangle two orthogonal challenges in video understanding: Visual Temporal Perception Load and Cognitive Reasoning Depth. To address this, we propose VideoCuRL, a novel framework that decomposes difficulty into these two axes. We employ efficient, training-free proxies, optical flow and keyframe entropy for visual complexity, Calibrated Surprisal for cognitive complexity, to map data onto a 2D curriculum grid. A competence aware Diagonal Wavefront strategy then schedules training from base alignment to complex reasoning. Furthermore, we introduce Dynamic Sparse KL and Structured Revisiting to stabilize training against reward collapse and catastrophic forgetting. Extensive experiments show that VideoCuRL surpasses strong RL baselines on reasoning (+2.5 on VSI-Bench) and perception (+2.9 on VideoMME) tasks. Notably, VideoCuRL eliminates the prohibitive inference overhead of generation-based curricula, offering a scalable solution for robust video post-training.
中文摘要 强化学习（RL）对于赋能视频大型语言模型（VideoLLM）具备复杂的时空推理至关重要。然而，当前的强化学习范式主要依赖于随机数据洗牌或基于标量难度指标的朴素课程策略。我们认为标量指标未能解开视频理解中的两个正交挑战：视觉时间感知负载和认知推理深度。为此，我们提出了VideoCuRL，一种将难度分解为这两条轴的新框架。我们采用高效且无需训练的代理、光流和关键帧熵来实现视觉复杂度，使用校准惊讶来提升认知复杂度，将数据映射到二维课程网格上。一个能力意识的对角波前策略随后安排从基础对齐到复杂推理的培训。此外，我们引入了动态稀疏的基尔基尔基和结构化重访，以稳定训练，防止奖励崩溃和灾难性遗忘。大量实验表明，VideoCuRL在推理（VSI-Bench为+2.5）和感知（在VideoMME为+2.9）任务上的强化学习基线值均优于此。值得注意的是，VideoCuRL消除了基于世代的课程中繁重的推理开销，为视频后培训提供了可扩展的解决方案。

Dichotomous Diffusion Policy Optimization

二分扩散策略优化

Authors: Ruiming Liang, Yinan Zheng, Kexin Zheng, Tianyi Tan, Jianxiong Li, Liyuan Mao, Zhihao Wang, Guang Chen, Hangjun Ye, Jingjing Liu, Jinqiao Wang, Xianyuan Zhan
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.00898
Pdf link: https://arxiv.org/pdf/2601.00898
Abstract Diffusion-based policies have gained growing popularity in solving a wide range of decision-making tasks due to their superior expressiveness and controllable generation during inference. However, effectively training large diffusion policies using reinforcement learning (RL) remains challenging. Existing methods either suffer from unstable training due to directly maximizing value objectives, or face computational issues due to relying on crude Gaussian likelihood approximation, which requires a large amount of sufficiently small denoising steps. In this work, we propose DIPOLE (Dichotomous diffusion Policy improvement), a novel RL algorithm designed for stable and controllable diffusion policy optimization. We begin by revisiting the KL-regularized objective in RL, which offers a desirable weighted regression objective for diffusion policy extraction, but often struggles to balance greediness and stability. We then formulate a greedified policy regularization scheme, which naturally enables decomposing the optimal policy into a pair of stably learned dichotomous policies: one aims at reward maximization, and the other focuses on reward minimization. Under such a design, optimized actions can be generated by linearly combining the scores of dichotomous policies during inference, thereby enabling flexible control over the level of this http URL in offline and offline-to-online RL settings on ExORL and OGBench demonstrate the effectiveness of our approach. We also use DIPOLE to train a large vision-language-action (VLA) model for end-to-end autonomous driving (AD) and evaluate it on the large-scale real-world AD benchmark NAVSIM, highlighting its potential for complex real-world applications.
中文摘要 基于扩散的策略因其卓越的表达性和推理过程中可控生成的能力，在解决各种决策任务中越来越受欢迎。然而，有效利用强化学习（RL）训练大规模扩散策略仍然具有挑战性。现有方法要么因直接最大化值目标而导致训练不稳定，要么因依赖粗糙的高斯似然近似而面临计算问题，这需要大量足够小的去噪步骤。本研究提出了DIPOLE（二分扩散策略改进），这是一种用于稳定且可控扩散策略优化的新型强化学习算法。我们首先回顾了强化学习中的 KL-正则化目标，它提供了一个理想的加权回归目标以提取扩散策略，但常常难以平衡贪婪与稳定性。随后，我们提出了一个贪婪化的政策正则化方案，自然地将最优政策分解为一对稳定学习的二分策略：一个旨在奖励最大化，另一个专注于奖励最小化。在这种设计下，通过线性组合推理中的二分策略分数，可以生成优化动作，从而在ExORL和OGBench的离线和离线强化环境中灵活控制该http URL的层级，展示了我们方法的有效性。我们还使用DIPOLE训练一个大型视觉-语言-动作（VLA）模型，用于端到端自动驾驶（AD），并在大规模现实世界自动驾驶基准测试NAVSIM上进行评估，凸显其在复杂现实应用中的潜力。

DVGBench: Implicit-to-Explicit Visual Grounding Benchmark in UAV Imagery with Large Vision-Language Models

DVGBench：无人机影像中基于大型视觉语言模型的隐式到显式视觉基础基准测试

Authors: Yue Zhou, Jue Chen, Zilun Zhang, Penghui Huang, Ran Ding, Zhentao Zou, PengFei Gao, Yuchen Wei, Ke Li, Xue Yang, Xue Jiang, Hongxin Yang, Jonathan Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.00998
Pdf link: https://arxiv.org/pdf/2601.00998
Abstract Remote sensing (RS) large vision-language models (LVLMs) have shown strong promise across visual grounding (VG) tasks. However, existing RS VG datasets predominantly rely on explicit referring expressions-such as relative position, relative size, and color cues-thereby constraining performance on implicit VG tasks that require scenario-specific domain knowledge. This article introduces DVGBench, a high-quality implicit VG benchmark for drones, covering six major application scenarios: traffic, disaster, security, sport, social activity, and productive activity. Each object provides both explicit and implicit queries. Based on the dataset, we design DroneVG-R1, an LVLM that integrates the novel Implicit-to-Explicit Chain-of-Thought (I2E-CoT) within a reinforcement learning paradigm. This enables the model to take advantage of scene-specific expertise, converting implicit references into explicit ones and thus reducing grounding difficulty. Finally, an evaluation of mainstream models on both explicit and implicit VG tasks reveals substantial limitations in their reasoning capabilities. These findings provide actionable insights for advancing the reasoning capacity of LVLMs for drone-based agents. The code and datasets will be released at this https URL
中文摘要 遥感（RS）大型视觉语言模型（LVLM）在视觉基础（VG）任务中展现出强烈潜力。然而，现有的RS VG数据集主要依赖显式的引用表达式——如相对位置、相对大小和颜色线索——从而限制了需要特定场景领域知识的隐性VG任务的性能。本文介绍了DVGBench，一个高质量的隐式VG无人机基准，涵盖六大主要应用场景：交通、灾害、安全、体育、社交活动和生产活动。每个对象都提供显式和隐式查询。基于该数据集，我们设计了DroneVG-R1，这是一种LVLM，将新颖的隐性到显式思维链（I2E-CoT）整合进强化学习范式中。这使得模型能够利用场景特定的专业知识，将隐式引用转化为显式引用，从而降低接地难度。最后，对显性和隐性VG任务主流模型的评估显示其推理能力存在显著局限。这些发现为提升无人机智能体的推理能力提供了可作的见解。代码和数据集将在此 https URL 下发布

Performance and Security Aware Distributed Service Placement in Fog Computing

雾计算中性能与安全感知的分布式服务部署

Authors: Mohammad Goudarzi, Arash Shaghaghi, Zhiyu Wang, Rajkumar Buyya
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2601.01125
Pdf link: https://arxiv.org/pdf/2601.01125
Abstract The rapid proliferation of IoT applications has intensified the demand for efficient and secure service placement in Fog computing. However, heterogeneous resources, dynamic workloads, and diverse security requirements make optimal service placement highly challenging. Most solutions focus primarily on performance metrics while overlooking the security implications of deployment decisions. This paper proposes a Security and Performance-Aware Distributed Deep Reinforcement Learning (SPA-DDRL) framework for joint optimization of service response time and security compliance in Fog computing. The problem is formulated as a weighted multi-objective optimization task, minimizing latency while maximizing a security score derived from the security capabilities of Fog nodes. The security score features a new three-tier hierarchy, where configuration-level checks verify proper settings, capability-level assessments evaluate the resource security features, and control-level evaluations enforce stringent policies, thereby ensuring compliant solutions that align with performance objectives. SPA-DDRL adopts a distributed broker-learner architecture where multiple brokers perform autonomous service-placement decisions and a centralized learner coordinates global policy optimization through shared prioritized experiences. It integrates three key improvements, including Long Short-Term Memory networks, Prioritized Experience Replay, and off-policy correction mechanisms to improve the agent's performance. Experiments based on real IoT workloads show that SPA-DDRL significantly improves both service response time and placement security compared to current approaches, achieving a 16.3% improvement in response time and a 33% faster convergence rate. It also maintains consistent, feasible, security-compliant solutions across all system scales, while baseline techniques fail or show performance degradation.
中文摘要 物联网应用的快速普及加剧了雾计算中对高效且安全服务部署的需求。然而，异构资源、动态工作负载和多样化的安全需求使得最佳服务部署极具挑战性。大多数解决方案主要关注性能指标，忽视部署决策的安全影响。本文提出了一个安全与性能感知的分布式深度强化学习（SPA-DDRL）框架，用于联合优化雾计算中的服务响应时间和安全合规。该问题被表述为一个加权多目标优化任务，旨在最小化延迟，同时最大化由雾节点安全能力得出的安全评分。安全评分采用了新的三层级结构，配置级检查验证设置正确，能力级评估评估资源安全特性，控制级评估执行严格政策，确保解决方案符合性能目标。SPA-DDRL采用分布式经纪人-学习者架构，多个经纪人自主执行服务配置决策，集中式学习者通过共享优先级经验协调全局策略优化。它集成了三项关键改进，包括长短期记忆网络、优先级体验回放和非策略修正机制，以提升代理性能。基于真实物联网工作负载的实验显示，SPA-DDRL相比现有方法显著提升了服务响应时间和部署安全性，响应时间提升了16.3%，收敛速度提升了33%。它还能在所有系统规模下保持一致、可行且符合安全标准的解决方案，而基线技术则失败或性能下降。

Latent Space Reinforcement Learning for Multi-Robot Exploration

多机器人探索的潜空间强化学习

Authors: Sriram Rajasekar, Ashwini Ratnoo
Subjects: Subjects: Robotics (cs.RO); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2601.01139
Pdf link: https://arxiv.org/pdf/2601.01139
Abstract Autonomous mapping of unknown environments is a critical challenge, particularly in scenarios where time is limited. Multi-agent systems can enhance efficiency through collaboration, but the scalability of motion-planning algorithms remains a key limitation. Reinforcement learning has been explored as a solution, but existing approaches are constrained by the limited input size required for effective learning, restricting their applicability to discrete environments. This work addresses that limitation by leveraging autoencoders to perform dimensionality reduction, compressing high-fidelity occupancy maps into latent state vectors while preserving essential spatial information. Additionally, we introduce a novel procedural generation algorithm based on Perlin noise, designed to generate topologically complex training environments that simulate asteroid fields, caves and forests. These environments are used for training the autoencoder and the navigation algorithm using a hierarchical deep reinforcement learning framework for decentralized coordination. We introduce a weighted consensus mechanism that modulates reliance on shared data via a tuneable trust parameter, ensuring robustness to accumulation of errors. Experimental results demonstrate that the proposed system scales effectively with number of agents and generalizes well to unfamiliar, structurally distinct environments and is resilient in communication-constrained settings.
中文摘要 对未知环境进行自主测绘是一个关键挑战，尤其是在时间有限的场景下。多智能体系统可以通过协作提升效率，但动作规划算法的可扩展性仍是一个关键限制。强化学习作为解决方案被探索过，但现有方法受限于有效学习所需的输入量，限制了其在离散环境中的适用性。这项工作通过利用自编码器进行降维，将高保真占用映射压缩为潜态矢量，同时保留重要的空间信息，解决了这一限制。此外，我们还引入了基于Perlin噪声的新型程序生成算法，旨在生成拓扑复杂的训练环境，模拟小行星带、洞穴和森林。这些环境用于通过分层深度强化学习框架训练自编码器和导航算法，实现去中心化协调。我们引入了一种加权共识机制，通过可调优的信任参数调节对共享数据的依赖，确保对错误累积的鲁棒性。实验结果表明，所提系统能有效扩展到代理数量，并能很好地推广到陌生且结构差异化的环境中，并且在通信受限的环境中具有韧性。

ORION: Option-Regularized Deep Reinforcement Learning for Cooperative Multi-Agent Online Navigation

ORION：用于合作多智能体在线导航的选项正则化深度强化学习

Authors: Zhang Shizhe, Liang Jingsong, Zhou Zhitao, Ye Shuhan, Wang Yizhuo, Tan Ming Siang Derek, Chiun Jimmy, Cao Yuhong, Sartoretti Guillaume
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.01155
Pdf link: https://arxiv.org/pdf/2601.01155
Abstract Existing methods for multi-agent navigation typically assume fully known environments, offering limited support for partially known scenarios such as warehouses or factory floors. There, agents may need to plan trajectories that balance their own path optimality with their ability to collect and share information about the environment that can help their teammates reach their own goals. To these ends, we propose ORION, a novel deep reinforcement learning framework for cooperative multi-agent online navigation in partially known environments. Starting from an imperfect prior map, ORION trains agents to make decentralized decisions, coordinate to reach their individual targets, and actively reduce map uncertainty by sharing online observations in a closed perception-action loop. We first design a shared graph encoder that fuses prior map with online perception into a unified representation, providing robust state embeddings under dynamic map discrepancies. At the core of ORION is an option-critic framework that learns to reason about a set of high-level cooperative modes that translate into sequences of low-level actions, allowing agents to switch between individual navigation and team-level exploration adaptively. We further introduce a dual-stage cooperation strategy that enables agents to assist teammates under map uncertainty, thereby reducing the overall makespan. Across extensive maze-like maps and large-scale warehouse environments, our simulation results show that ORION achieves high-quality, real-time decentralized cooperation over varying team sizes, outperforming state-of-the-art classical and learning-based baselines. Finally, we validate ORION on physical robot teams, demonstrating its robustness and practicality for real-world cooperative navigation.
中文摘要 现有的多智能体导航方法通常假设完全已知的环境，对部分已知的场景（如仓库或工厂车间）支持有限。在那里，代理可能需要规划轨迹，既能平衡自身路径的最优性，又能收集和共享环境信息，帮助队友实现自身目标。为此，我们提出了ORION，一种用于部分已知环境中合作多智能体在线导航的新型深度强化学习框架。从不完美的先验地图出发，ORION训练代理做出去中心化决策，协调以达成各自目标，并通过在封闭的感知-行动循环中共享在线观测数据，主动减少地图不确定性。我们首先设计了一个共享图编码器，将先前的映射与在线感知融合为统一表示，在动态映射差异下提供稳健的状态嵌入。ORION的核心是一个选项批判框架，能够学习推理一组高级合作模式，这些模式转化为一系列低级动作，使代理能够自适应地在个人导航和团队层面探索之间切换。我们还引入了双阶段合作策略，使特工能够在地图不确定性下协助队友，从而缩短整体完成时间。在广泛的迷宫式地图和大规模仓库环境中，我们的模拟结果表明，ORION能够在不同团队规模下实现高质量、实时的去中心化协作，优于最先进的经典和基于学习的基线。最后，我们在物理机器人团队中验证了ORION的应用，展示了其在现实世界协作导航中的稳健性和实用性。

Reinforcement Learning Based Whittle Index Policy for Scheduling Wireless Sensors

基于强化学习的Whittle索引计划无线传感器调度策略

Authors: Sokipriala Jonah, Seong Ki Yoo, Saurav Sthapit
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.01179
Pdf link: https://arxiv.org/pdf/2601.01179
Abstract Wireless Sensor nodes used in remote monitoring applications typically transmit data in a timely manner, often optimising for the Age of Information (AoI). However, this approach focuses solely on keeping updates at the sink fresh without considering the actual value of each update. This method is inefficient for sensor networks, which have limited resources. Transmitting data indiscriminately without evaluating its significance not only increases energy consumption but also raises storage requirements. To address this challenge, we propose a reinforcement learning-based scheduling strategy that prioritises sensor transmissions based on the Age of Incorrect Information (AoII) using an edge mining technique. This ensures that the most valuable updates are received while reducing energy consumption. We frame the scheduling problem as a Restless Multi-Armed Bandit (RMAB) and introduce a Whittle Index-based Q-learning (WIQL) policy to dynamically select the most informative sensors. Additionally, we employ an edge mining technique, where raw sensor data is processed locally before transmission, enhancing state estimation at the sink. Experimental results demonstrate that WIQL achieves near-optimal performance while significantly reducing the number of transmitted packets by up to 70%. This reinforcement learning-based approach provides a scalable and adaptive solution for efficient data scheduling in resource-constrained WSNs.
中文摘要 用于远程监控应用的无线传感器节点通常能够及时传输数据，常常针对信息时代（AoI）进行优化。然而，这种方法仅专注于保持更新的新鲜感，而忽视了每次更新的实际价值。这种方法对于资源有限的传感器网络来说效率较低。在不评估数据重要性的情况下随意传输数据，不仅增加了能源消耗，也增加了存储需求。为应对这一挑战，我们提出了一种基于强化学习的调度策略，基于错误信息时代（AoII）优先处理传感器传输，采用边缘挖掘技术。这确保了最有价值的更新被接收到，同时降低能源消耗。我们将调度问题框架为一个不安多臂强盗（RMAB），并引入基于Whittle指数的Q-learning（WIQL）策略，动态选择最具信息量的传感器。此外，我们还采用了边缘挖掘技术，在传输前先在本地处理原始传感器数据，增强了汇的状态估计。实验结果表明，WIQL在显著减少传输数据包数量的同时，实现了接近最佳的性能。这种基于强化学习的方法为资源受限的WSN中高效调度提供了可扩展且适应性的解决方案。

SecureCodeRL: Security-Aware Reinforcement Learning for Code Generation with Partial-Credit Rewards

SecureCodeRL：基于部分学分奖励的代码生成安全意识强化学习

Authors: Suryansh Singh Sijwali, Suman Saha
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2601.01184
Pdf link: https://arxiv.org/pdf/2601.01184
Abstract Large Language Models (LLMs) can generate plausible code, but in settings that require exact stdin/stdout behavior they frequently produce programs that compile yet fail tests, and in some cases they introduce security-sensitive patterns. This paper presents SecureCodeRL, a reinforcement learning (RL) pipeline for security-aware code generation that optimizes a combined reward R = {\alpha}Rfunc + \b{eta}Rsec. The key idea is a partial-credit functional reward that assigns intermediate scores for syntactic validity, successful execution, and producing output, reducing reward sparsity that otherwise stalls learning on competitive programming style tasks. I evaluate supervised fine-tuning (SFT) and PPO variants on a small held-out prompt set from APPS+ and observe that PPO with partial credit (using a continued-training variant) improves syntax validity from 45% (SFT) to 60% and achieves the only non-zero test success signal in this pilot evaluation (5% at-least-one-test-pass), while remaining 100% clean under Bandit static analysis. Although Bandit findings were absent in this small evaluation, the security term is integrated into training to discourage insecure shortcuts when they appear.
中文摘要 大型语言模型（LLM）可以生成合理的代码，但在需要精确标准/标准行为的环境中，它们经常生成编译后测试失败的程序，有时甚至引入了安全敏感的模式。本文介绍了SecureCodeRL，一种用于安全感知代码生成的强化学习（RL）流水线，优化了组合奖励R = {\alpha}Rfunc + \b{eta}Rsec。核心思想是部分学分功能奖励，为语法效度、成功执行和产出分配中间分数，减少奖励稀疏性，避免在竞争性编程任务中阻碍学习。我评估了APPS+一小组提示词上的监督微调（SFT）和PPO变体，观察到部分加分的PPO（使用持续训练变体）能将语法效度从45%（SFT）提升到60%，并且在本次试点评估中实现了唯一非零的测试成功信号（至少一次测试通过5%），同时在Bandit静态分析下保持100%干净。虽然这次小评估中未提及Bandit的发现，但安全术语被纳入培训中，以防止出现不安全的捷径。

OrchestrRL: Dynamic Compute and Network Orchestration for Disaggregated RL

OrchestrRL：去中心化强化学习的动态计算与网络编排

Authors: Xin Tan, Yicheng Feng, Yu Zhou, Yimin Jiang, Yibo Zhu, Hong Xu
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2601.01209
Pdf link: https://arxiv.org/pdf/2601.01209
Abstract Post-training with reinforcement learning (RL) has greatly enhanced the capabilities of large language models. Disaggregating the generation and training stages in RL into a parallel, asynchronous pipeline offers the potential for flexible scaling and improved throughput. However, it still faces two critical challenges. First, the generation stage often becomes a bottleneck due to dynamic workload shifts and severe execution imbalances. Second, the decoupled stages result in diverse and dynamic network traffic patterns that overwhelm conventional network fabrics. This paper introduces OrchestrRL, an orchestration framework that dynamically manages compute and network rhythms in disaggregated RL. To improve generation efficiency, OrchestrRL employs an adaptive compute scheduler that dynamically adjusts parallelism to match workload characteristics within and across generation steps. This accelerates execution while continuously rebalancing requests to mitigate stragglers. To address the dynamic network demands inherent in disaggregated RL -- further intensified by parallelism switching -- we co-design RFabric, a reconfigurable hybrid optical-electrical fabric. RFabric leverages optical circuit switches at selected network tiers to reconfigure the topology in real time, enabling workload-aware circuits for (i) layer-wise collective communication during training iterations, (ii) generation under different parallelism configurations, and (iii) periodic inter-cluster weight synchronization. We evaluate OrchestrRL on a physical testbed with 48 H800 GPUs, demonstrating up to a 1.40x throughput improvement. Furthermore, we develop RLSim, a high-fidelity simulator, to evaluate RFabric at scale. Our results show that RFabric achieves superior performance-cost efficiency compared to static Fat-Tree networks, establishing it as a highly effective solution for large-scale RL workloads.
中文摘要 强化学习（RL）的后期训练极大地提升了大型语言模型的能力。将强化学习中的生成和训练阶段拆分成并行异步流水线，有望实现灵活的扩展和提升吞吐量。然而，它仍面临两个关键挑战。首先，生成阶段常因动态工作量变化和严重执行失衡而成为瓶颈。其次，解耦阶段导致多样且动态的网络流量模式，导致传统网络结构不堪重负。本文介绍了OrchestrRL，一种在分解强化学习中动态管理计算和网络节奏的编排框架。为提高生成效率，OrchestrRL 采用自适应计算调度器，动态调整并行性以匹配生成步骤内及跨世代的工作负载特性。这加快了执行速度，同时不断重新平衡请求以减少落后请求。为了应对分散强化学习固有的动态网络需求——并行交换进一步加剧——我们共同设计了RFabric，一种可重构的混合光电结构。RFabric利用选定网络层的光电路交换机实时重配置拓扑，实现工作负载感知电路，支持（i）训练迭代期间的层级集体通信，（ii）不同并行配置下的生成，以及（iii）周期性集群间权重同步。我们在物理测试平台上用48块H800 GPU评估OrchestrRL，显示吞吐量提升高达1.40倍。此外，我们还开发了RLSim，一款高精度模拟器，用于大规模评估RFabric。我们的结果表明，RFabric在性能和成本效率上优于静态胖树网络，使其成为大规模强化学习工作负载的高效解决方案。

PyBatchRender: A Python Library for Batched 3D Rendering at Up to One Million FPS

PyBatchRender：一个用于批量渲染、最高可达百万帧率的 Python 库

Authors: Evgenii Rudakov, Jonathan Shock, Benjamin Ultan Cowley
Subjects: Subjects: Graphics (cs.GR); Artificial Intelligence (cs.AI); Performance (cs.PF); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.01288
Pdf link: https://arxiv.org/pdf/2601.01288
Abstract Reinforcement learning from pixels is often bottlenecked by the performance and complexity of 3D rendered environments. Researchers face a trade-off between high-speed, low-level engines and slower, more accessible Python frameworks. To address this, we introduce PyBatchRender, a Python library for high-throughput, batched 3D rendering that achieves over 1 million FPS on simple scenes. Built on the Panda3D game engine, it utilizes its mature ecosystem while enhancing performance through optimized batched rendering for up to 1000X speedups. Designed as a physics-agnostic renderer for reinforcement learning from pixels, PyBatchRender offers greater flexibility than dedicated libraries, simpler setup than typical game-engine wrappers, and speeds rivaling state-of-the-art C++ engines like Madrona. Users can create custom scenes entirely in Python with tens of lines of code, enabling rapid prototyping for scalable AI training. Open-source and easy to integrate, it serves to democratize high-performance 3D simulation for researchers and developers. The library is available at this https URL.
中文摘要 从像素进行强化学习常常被3D渲染环境的性能和复杂性所阻碍。研究人员面临高速、低层级引擎与更慢、更易访问的Python框架之间的权衡。为此，我们引入了PyBatchRender，一个用于高吞吐量批量3D渲染的Python库，在简单场景上实现超过100万帧/秒。它基于Panda3D游戏引擎，利用其成熟的生态系统，同时通过优化的批量渲染提升性能，实现高达1000倍的加速。PyBatchRender 设计为一种物理无关的渲染器，用于从像素进行强化学习，它比专用库更具灵活性，设置比典型游戏引擎包装器更简单，速度可媲美 Madrona 等最先进的 C++ 引擎。用户可以用Python完全创建自定义场景，只需数十行代码，便于快速原型设计以实现可扩展的AI训练。该软件开源且易于集成，旨在为研究人员和开发者普及高性能3D仿真。该库可通过此 https URL 访问。

dataRLsec: Safety, Security, and Reliability With Robust Offline Reinforcement Learning for DPAs

dataRLsec：DPA的安全、保障与可靠性，结合强健的离线强化学习

Authors: Shriram KS Pandian, Naresh Kshetri
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2601.01289
Pdf link: https://arxiv.org/pdf/2601.01289
Abstract Data poisoning attacks (DPAs) are becoming popular as artificial intelligence (AI) algorithms, machine learning (ML) algorithms, and deep learning (DL) algorithms in this artificial intelligence (AI) era. Hackers and penetration testers are excessively injecting malicious contents in the training data (and in testing data too) that leads to false results that are very hard to inspect and predict. We have analyzed several recent technologies used (from deep reinforcement learning to federated learning) for the DPAs and their safety, security, & countermeasures. The problem setup along with the problem estimation is shown in the MuJoCo environment with performance of HalfCheetah before the dataset is poisoned and after the dataset is poisoned. We have analyzed several risks associated with the DPAs and falsification in medical data from popular poisoning data attacks to some popular data defenses. We have proposed robust offline reinforcement learning (Offline RL) for the safety and reliability with weighted hash verification along with density-ratio weighted behavioral cloning (DWBC) algorithm. The four stages of the proposed algorithm (as the Stage 0, the Stage 1, the Stage 2, and the Stage 3) are described with respect to offline RL, safety, and security for DPAs. The conclusion and future scope are provided with the intent to combine DWBC with other data defense strategies to counter and protect future contamination cyberattacks.
中文摘要 数据中毒攻击（DPA）正作为人工智能（AI）算法、机器学习（ML）算法和深度学习（DL）算法在人工智能（AI）时代变得越来越流行。黑客和渗透测试人员过度在训练数据（以及测试数据）中注入恶意内容，导致错误结果难以检测和预测。我们分析了多项近期用于DPA的技术（从深度强化学习到联邦学习）及其安全性、保障和对策。问题设置及问题估计在MuJoCo环境中展示了HalfCheetah在数据集中毒前和中毒后的性能。我们分析了与延期付款协议（DPA）及医疗数据伪造相关的多种风险，从流行的中毒数据攻击到一些流行的数据防御。我们提出了基于加权哈希验证和密度比加权行为克隆（DWBC）算法的稳健离线强化学习（离线强化学习）安全性和可靠性。所提算法的四个阶段（第0阶段、第1阶段、第2阶段和第3阶段）分别描述了离线强化学习、DPA的安全性与保障。结论和未来范围旨在将DWBC与其他数据防御策略结合起来，以应对和保护未来的污染网络攻击。

DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer

DreamID-V：通过扩散变换器弥合图像与视频之间的高保真面部交换差距

Authors: Xu Guo, Fulong Ye, Xinghui Li, Pengqi Tu, Pengze Zhang, Qichao Sun, Songtao Zhao, Xiangwang Hou, Qian He
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.01425
Pdf link: https://arxiv.org/pdf/2601.01425
Abstract Video Face Swapping (VFS) requires seamlessly injecting a source identity into a target video while meticulously preserving the original pose, expression, lighting, background, and dynamic information. Existing methods struggle to maintain identity similarity and attribute preservation while preserving temporal consistency. To address the challenge, we propose a comprehensive framework to seamlessly transfer the superiority of Image Face Swapping (IFS) to the video domain. We first introduce a novel data pipeline SyncID-Pipe that pre-trains an Identity-Anchored Video Synthesizer and combines it with IFS models to construct bidirectional ID quadruplets for explicit supervision. Building upon paired data, we propose the first Diffusion Transformer-based framework DreamID-V, employing a core Modality-Aware Conditioning module to discriminatively inject multi-model conditions. Meanwhile, we propose a Synthetic-to-Real Curriculum mechanism and an Identity-Coherence Reinforcement Learning strategy to enhance visual realism and identity consistency under challenging scenarios. To address the issue of limited benchmarks, we introduce IDBench-V, a comprehensive benchmark encompassing diverse scenes. Extensive experiments demonstrate DreamID-V outperforms state-of-the-art methods and further exhibits exceptional versatility, which can be seamlessly adapted to various swap-related tasks.
中文摘要 视频人脸交换（VFS）需要无缝地将源身份注入目标视频，同时细致地保留原始姿态、表情、光线、背景和动态信息。现有方法难以在保持身份相似性和属性保存的同时保持时间一致性。为应对这一挑战，我们提出了一个综合框架，旨在无缝将人脸交换（IFS）的优势转移到视频领域。我们首先介绍一种新颖的数据流水线SyncID-Pipe，它预训练身份锚定视频合成器，并将其与IFS模型结合，构建双向ID四元组以实现显式监督。基于配对数据，我们提出了首个基于扩散变换器（Diffusion Transformer）的框架DreamID-V，采用核心模态感知条件模块，判别性地注入多模型条件。同时，我们提出了一种合成到现实的课程机制和身份一致性强化学习策略，以增强在挑战场景下的视觉真实性和身份一致性。为解决基准有限的问题，我们引入了涵盖多元场景的综合基准IDBench-V。大量实验表明DreamID-V优于最先进的方法，并展现出卓越的多功能性，能够无缝适应各种交换相关任务。

SWE-Lego: Pushing the Limits of Supervised Fine-tuning for Software Issue Resolving

SWE-Lego：推动监督式微调的极限，用于软件问题解决

Authors: Chaofan Tao, Jierun Chen, Yuxin Jiang, Kaiqi Kou, Shaowei Wang, Ruoyu Wang, Xiaohui Li, Sidi Yang, Yiming Du, Jianbo Dai, Zhiming Mao, Xinyu Wang, Lifeng Shang, Haoli Bai
Subjects: Subjects: Software Engineering (cs.SE); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.01426
Pdf link: https://arxiv.org/pdf/2601.01426
Abstract We present SWE-Lego, a supervised fine-tuning (SFT) recipe designed to achieve state-ofthe-art performance in software engineering (SWE) issue resolving. In contrast to prevalent methods that rely on complex training paradigms (e.g., mid-training, SFT, reinforcement learning, and their combinations), we explore how to push the limits of a lightweight SFT-only approach for SWE tasks. SWE-Lego comprises three core building blocks, with key findings summarized as follows: 1) the SWE-Lego dataset, a collection of 32k highquality task instances and 18k validated trajectories, combining real and synthetic data to complement each other in both quality and quantity; 2) a refined SFT procedure with error masking and a difficulty-based curriculum, which demonstrably improves action quality and overall performance. Empirical results show that with these two building bricks alone,the SFT can push SWE-Lego models to state-of-the-art performance among open-source models of comparable size on SWE-bench Verified: SWE-Lego-Qwen3-8B reaches 42.2%, and SWE-Lego-Qwen3-32B attains 52.6%. 3) We further evaluate and improve test-time scaling (TTS) built upon the SFT foundation. Based on a well-trained verifier, SWE-Lego models can be significantly boosted--for example, 42.2% to 49.6% and 52.6% to 58.8% under TTS@16 for the 8B and 32B models, respectively.
中文摘要 我们介绍SWE-Lego，这是一种监督式微调（SFT）配方，旨在实现软件工程（SWE）问题解决的先进性能。与依赖复杂训练范式的流行方法（如中训练、SFT、强化学习及其组合）不同，我们探讨如何推动轻量级仅SFT方法的极限，应用于SWE任务。SWE-Lego包含三个核心构建模块，主要发现总结如下：1）SWE-Lego数据集，包含3.2万个高质量任务实例和1.8万个验证轨迹，结合真实和合成数据，在质量和数量上相互补充;2）改进的SFT程序，包含错误掩蔽和基于难度的课程，显著提升了动作质量和整体表现。实证结果显示，仅凭这两块积木，SFT就能将SWE-Lego模型推向同尺寸开源模型中最先进的性能。验证：SWE-Lego-Qwen3-8B达到42.2%，SWE-Lego-Qwen3-32B达到52.6%。3）我们进一步评估并改进基于SFT基础的测试时间缩放（TTS）。基于经过良好训练的验证器，SWE-Lego模型可以显著提升——例如，8B和32B模型在TTS@16下分别提升了42.2%到49.6%和52.6%到58.8%。

Context-Aware Information Transfer via Digital Semantic Communication in UAV-Based Networks

基于无人机的网络中通过数字语义通信实现上下文感知信息传输

Authors: Poorvi Joshi, Mohan Gurusamy (National University of Singapore)
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.01430
Pdf link: https://arxiv.org/pdf/2601.01430
Abstract In smart cities, bandwidth-constrained Unmanned Aerial Vehicles (UAVs) often fail to relay mission-critical data in time, compromising real-time decision-making. This highlights the need for faster and more efficient transmission of only the most relevant information. To address this, we propose DSC-UAV model, leveraging a context-adaptive Digital Semantic Communication (DSC) framework. This model redefines aerial data transmission through three core components: prompt-aware encoding, dynamic UAV-enabled relaying, and user mobility-optimized reinforcement learning. Ground users transmit context-driven visual content. Images are encoded via Vision Transformer combined with a prompt-text encoder to generate semantic features based on the desired context (generic or object-specific). These features are then quantized and transmitted over a UAV network that dynamically relays the data. Joint trajectory and resource allocation are optimized using Truncated Quantile Critic (TQC)-aided reinforcement learning technique, which offers greater stability and precision over standard SAC and TD3 due to its resistance to overestimation bias. Simulations demonstrate significant performance improvement, up to 22\% gain in semantic-structural similarity and 14\% reduction in Age of Information (AoI) compared to digital and prior UAV-semantic communication baselines. By integrating mobility control with context-driven visual abstraction, DSC-UAV advances resilient, information-centric surveillance for next-generation UAV networks in bandwidth-constrained environments.
中文摘要 在智慧城市中，带宽受限的无人机（UAV）常常无法及时传递关键任务数据，从而影响实时决策。这凸显了只传输最相关信息的更快、更高效的必要性。为此，我们提出了DSC-UAV模型，利用上下文自适应数字语义通信（DSC）框架。该模型通过三个核心组成部分重新定义了空中数据传输：提示感知编码、动态无人机中继以及用户移动性优化的强化学习。地面用户传递基于上下文的视觉内容。图像通过Vision Transformer与提示文本编码器结合，根据所需上下文（通用或对象特定）生成语义特征。这些特征随后被量化并通过无人机网络传输，该网络动态中继数据。联合轨迹和资源分配采用截断分位数批判（TQC）辅助强化学习技术进行优化，该技术因抗高估偏差，比标准SAC和TD3提供了更高的稳定性和精度。模拟显示出显著的性能提升，语义结构相似度提升多达22%，信息时代（AoI）降低14%，均与数字及以往无人机语义通信基线相比。通过将移动控制与基于情境的视觉抽象相结合，DSC-UAV在带宽受限环境中推动了下一代无人机网络的韧性、以信息为中心的监控。

Unified Generation and Self-Verification for Vision-Language Models via Advantage Decoupled Preference Optimization

通过优势解耦偏好优化实现视觉语言模型的统一生成与自我验证

Authors: Xinyu Qiu, Heng Jia, Zhengwen Zeng, Shuheng Shen, Changhua Meng, Yi Yang, Linchao Zhu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.01483
Pdf link: https://arxiv.org/pdf/2601.01483
Abstract Parallel test-time scaling typically trains separate generation and verification models, incurring high training and inference costs. We propose Advantage Decoupled Preference Optimization (ADPO), a unified reinforcement learning framework that jointly learns answer generation and self-verification within a single policy. ADPO introduces two innovations: a preference verification reward improving verification capability and a decoupled optimization mechanism enabling synergistic optimization of generation and verification. Specifically, the preference verification reward computes mean verification scores from positive and negative samples as decision thresholds, providing positive feedback when prediction correctness aligns with answer correctness. Meanwhile, the advantage decoupled optimization computes separate advantages for generation and verification, applies token masks to isolate gradients, and combines masked GRPO objectives, preserving generation quality while calibrating verification scores. ADPO achieves up to +34.1% higher verification AUC and -53.5% lower inference time, with significant gains of +2.8%/+1.4% accuracy on MathVista/MMMU, +1.9 cIoU on ReasonSeg, and +1.7%/+1.0% step success rate on AndroidControl/GUI Odyssey.
中文摘要 并行测试时间缩放通常训练不同的生成和验证模型，导致高强度的训练和推断成本。我们提出了优势解耦偏好优化（ADPO），这是一种统一的强化学习框架，能够在单一策略中共同学习答案生成和自我验证。ADPO引入了两项创新：优先验证奖励提升验证能力，以及解耦优化机制，实现生成与验证的协同优化。具体来说，偏好验证奖励计算正负样本的平均验证分数作为决策阈值，当预测正确性与答案正确性一致时，提供正反馈。与此同时，优势解耦优化计算生成和验证的独立优势，应用令牌掩码隔离梯度，并结合掩蔽GRPO目标，保持生成质量同时校准验证分数。ADPO在验证AUC上实现了高达+34.1%的提升和-53.5%的推理时间，在MathVista/MMMU上实现了+2.8%/+1.4%的准确率，在ReasonSeg上获得+1.9 cIoU，在AndroidControl/GUI Odyssey上实现了+1.7%/+1.0%的步进成功率。

Logics-STEM: Empowering LLM Reasoning via Failure-Driven Post-Training and Document Knowledge Enhancement

逻辑-STEM：通过失败驱动的后期培训和文档知识增强赋能LLM推理

Authors: Mingyu Xu, Cheng Fang, Keyue Jiang, Yuqian Zheng, Yanghua Xiao, Baojian Zhou, Qifang Zhao, Suhang Zheng, Xiuwen Zhu, Jiyang Tang, Yongchi Zhao, Yijia Luo, Zhiqi Bai, Yuchi Xu, Wenbo Su, Wei Wang, Bing Zhao, Lin Qu, Xiaoxiao Xu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.01562
Pdf link: https://arxiv.org/pdf/2601.01562
Abstract We present Logics-STEM, a state-of-the-art reasoning model fine-tuned on Logics-STEM-SFT-Dataset, a high-quality and diverse dataset at 10M scale that represents one of the largest-scale open-source long chain-of-thought corpora. Logics-STEM targets reasoning tasks in the domains of Science, Technology, Engineering, and Mathematics (STEM), and exhibits exceptional performance on STEM-related benchmarks with an average improvement of 4.68% over the next-best model at 8B scale. We attribute the gains to our data-algorithm co-design engine, where they are jointly optimized to fit a gold-standard distribution behind reasoning. Data-wise, the Logics-STEM-SFT-Dataset is constructed from a meticulously designed data curation engine with 5 stages to ensure the quality, diversity, and scalability, including annotation, deduplication, decontamination, distillation, and stratified sampling. Algorithm-wise, our failure-driven post-training framework leverages targeted knowledge retrieval and data synthesis around model failure regions in the Supervised Fine-tuning (SFT) stage to effectively guide the second-stage SFT or the reinforcement learning (RL) for better fitting the target distribution. The superior empirical performance of Logics-STEM reveals the vast potential of combining large-scale open-source data with carefully designed synthetic data, underscoring the critical role of data-algorithm co-design in enhancing reasoning capabilities through post-training. We make both the Logics-STEM models (8B and 32B) and the Logics-STEM-SFT-Dataset (10M and downsampled 2.2M versions) publicly available to support future research in the open-source community.
中文摘要 我们介绍Logics-STEM，一个基于Logics-STEM-SFT-Dataset的先进推理模型，这是一个高质量且多样化的100万尺度数据集，代表了规模最大的开源长链思维语料库之一。Logics-STEM针对科学、技术、工程和数学（STEM）领域的推理任务，在STEM相关基准测试中表现出色，平均比8B尺度下次优模型提升4.68%。我们将这些收益归功于我们的数据算法协同设计引擎，这些数据经过共同优化，以符合推理背后的黄金标准分布。在数据方面，Logics-STEM-SFT数据集由精心设计的数据管理引擎构建，设有5个阶段，确保质量、多样性和可扩展性，包括注释、去重、去污、蒸馏和分层抽样。在算法方面，我们的失败驱动后训练框架利用针对性知识检索和监督式微调（SFT）阶段中模型失效区域的数据综合，有效引导第二阶段SFT或强化学习（RL），更好地拟合目标分布。Logics-STEM卓越的实证表现揭示了将大规模开源数据与精心设计的合成数据结合的巨大潜力，凸显了数据-算法协同设计在通过后期训练提升推理能力中的关键作用。我们公开了Logics-STEM模型（8B和32B）以及Logics-STEM-SFT数据集（1000万和220万下采样版本），以支持开源社区未来的研究。

HanoiWorld : A Joint Embedding Predictive Architecture BasedWorld Model for Autonomous Vehicle Controller

HanoiWorld：基于预测架构的联合嵌入式自动驾驶车辆控制器模型

Authors: Tran Tien Dat, Nguyen Hai An, Nguyen Khanh Viet Dung, Nguyen Duy Duc
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.01577
Pdf link: https://arxiv.org/pdf/2601.01577
Abstract Current attempts of Reinforcement Learning for Autonomous Controller are data-demanding while the results are under-performed, unstable, and unable to grasp and anchor on the concept of safety, and over-concentrating on noise features due to the nature of pixel reconstruction. While current Self-Supervised Learningapproachs that learning on high-dimensional representations by leveraging the JointEmbedding Predictive Architecture (JEPA) are interesting and an effective alternative, as the idea mimics the natural ability of the human brain in acquiring new skill usingimagination and minimal samples of observations. This study introduces Hanoi-World, a JEPA-based world model that using recurrent neural network (RNN) formaking longterm horizontal planning with effective inference time. Experimentsconducted on the Highway-Env package with difference enviroment showcase the effective capability of making a driving plan while safety-awareness, with considerablecollision rate in comparison with SOTA baselines
中文摘要 当前自主控制器强化学习的尝试数据需求较大，但结果表现不足、不稳定，无法掌握和锚定安全概念，且由于像素重建的特性，过度关注噪声特征。虽然当前的自我监督学习方法通过利用联合嵌入预测架构（JEPA）在高维表征上学习，是有趣且有效的替代方案，因为该方法模拟了人脑在利用想象力和少量观察样本获得新技能时的自然能力。本研究介绍了Hanoi-World，这是一个基于JEPA的世界模型，利用循环神经网络（RNN）实现具有有效推断时间的长期水平规划。在不同环境下进行的高速公路环境套件实验展示了在安全意识下制定驾驶计划的有效能力，相较于SOTA基线，碰撞率显著提升

DemoBot: Efficient Learning of Bimanual Manipulation with Dexterous Hands From Third-Person Human Videos

DemoBot：从第三人称人类视频中高效学习双手作，灵活双手作

Authors: Yucheng Xu, Xiaofeng Mao, Elle Miller, Xinyu Yi, Yang Li, Zhibin Li, Robert B. Fisher
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.01651
Pdf link: https://arxiv.org/pdf/2601.01651
Abstract This work presents DemoBot, a learning framework that enables a dual-arm, multi-finger robotic system to acquire complex manipulation skills from a single unannotated RGB-D video demonstration. The method extracts structured motion trajectories of both hands and objects from raw video data. These trajectories serve as motion priors for a novel reinforcement learning (RL) pipeline that learns to refine them through contact-rich interactions, thereby eliminating the need to learn from scratch. To address the challenge of learning long-horizon manipulation skills, we introduce: (1) Temporal-segment based RL to enforce temporal alignment of the current state with demonstrations; (2) Success-Gated Reset strategy to balance the refinement of readily acquired skills and the exploration of subsequent task stages; and (3) Event-Driven Reward curriculum with adaptive thresholding to guide the RL learning of high-precision manipulation. The novel video processing and RL framework successfully achieved long-horizon synchronous and asynchronous bimanual assembly tasks, offering a scalable approach for direct skill acquisition from human videos.
中文摘要 本研究提出了DemoBot，一种学习框架，使双臂、多指机器人系统能够通过一个无注释的RGB-D视频演示习得复杂的作技能。该方法从原始视频数据中提取手部和物体的结构化运动轨迹。这些轨迹作为一种新型强化学习（RL）流水线的运动先验，通过丰富的接触互动来细化这些路径，从而消除从零学习的需求。为应对学习长视野作技能的挑战，我们引入了：（1）基于时间段的强化学习，通过演示强制当前状态的时间对齐;（2）成功门槛重置策略，平衡现成技能的精炼与后续任务阶段的探索;以及（3）事件驱动奖励课程，采用自适应阈值法指导强化学习的高精度作。这一新颖的视频处理和强化学习框架成功实现了长视野同步和异步双手组装任务，为直接从人类视频中获得技能提供了可扩展的方法。

Adversarial Instance Generation and Robust Training for Neural Combinatorial Optimization with Multiple Objectives

多目标的对抗实例生成与神经组合优化的稳健训练

Authors: Wei Liu, Yaoxin Wu, Yingqian Zhang, Thomas Bäck, Yingjie Fan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.01665
Pdf link: https://arxiv.org/pdf/2601.01665
Abstract Deep reinforcement learning (DRL) has shown great promise in addressing multi-objective combinatorial optimization problems (MOCOPs). Nevertheless, the robustness of these learning-based solvers has remained insufficiently explored, especially across diverse and complex problem distributions. In this paper, we propose a unified robustness-oriented framework for preference-conditioned DRL solvers for MOCOPs. Within this framework, we develop a preference-based adversarial attack to generate hard instances that expose solver weaknesses, and quantify the attack impact by the resulting degradation on Pareto-front quality. We further introduce a defense strategy that integrates hardness-aware preference selection into adversarial training to reduce overfitting to restricted preference regions and improve out-of-distribution performance. The experimental results on multi-objective traveling salesman problem (MOTSP), multi-objective capacitated vehicle routing problem (MOCVRP), and multi-objective knapsack problem (MOKP) verify that our attack method successfully learns hard instances for different solvers. Furthermore, our defense method significantly strengthens the robustness and generalizability of neural solvers, delivering superior performance on hard or out-of-distribution instances.
中文摘要 深度强化学习（DRL）在解决多目标组合优化问题（MOCOPs）方面展现出巨大潜力。然而，这些基于学习的求解器的鲁棒性尚未被充分探索，尤其是在多样且复杂的问题分布中。本文提出了一个统一的鲁棒性导向框架，用于偏好条件的MOCOPDRL求解器。在此框架下，我们开发了基于偏好的对抗攻击，生成硬实例以暴露求解器弱点，并通过对帕累托前线质量的劣化量化攻击影响。我们还进一步引入了一种防御策略，将硬度感知偏好选择整合进对抗训练中，以减少对受限偏好区域的过度拟合，并提升分布外表现。多目标旅行推销员问题（MOTSP）、多目标电容车辆路由问题（MOCVRP）和多目标背包问题（MOKP）的实验结果验证了我们的攻击方法能够成功学习不同求解器的硬实例。此外，我们的防御方法显著增强了神经求解器的鲁棒性和泛化性，在硬或非分布实例上提供更优的性能。

SRAS: A Lightweight Reinforcement Learning-based Document Selector for Edge-Native RAG Pipelines

SRAS：基于强化学习的轻量级文档选择器，适用于边缘原生RAG管道

Authors: Rajiv Chaitanya Muttur
Subjects: Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.01785
Pdf link: https://arxiv.org/pdf/2601.01785
Abstract Retrieval-Augmented Generation (RAG) systems often rely on fixed top-k document selection mechanisms that ignore downstream generation quality and impose computational overheads. We propose SRAS (Sparse Reward-Aware Selector), a lightweight document selector trained via reinforcement learning (RL) for edge-native RAG deployment. Unlike prior RL-based retrievers that assume large memory and latency budgets, SRAS learns a compact (~0.76MB) policy using Proximal Policy Optimization (PPO), guided by a hybrid reward signal combining Relaxed F1 and BERTScore. Our method operates under tight token and compute constraints, maintaining <1s latency on CPU. SRAS outperforms supervised and random selectors on a synthetic QA benchmark, and generalizes to real-world data, achieving BERTScore F1 of 0.8546 on SQuAD v2 without domain-specific tuning. This work is the first to demonstrate that RL-based document selection can be made ultra-lightweight, latency-aware, and effective for on-device RAG pipelines.
中文摘要 检索增强生成（RAG）系统通常依赖固定的顶K文档选择机制，忽视下游生成质量并增加计算开销。我们提出SRAS（稀疏奖励感知选择器），这是一个通过强化学习（RL）训练的轻量级文档选择器，用于边缘原生RAG部署。与以往假设大量内存和延迟预算的基于强化学习器的检索器不同，SRAS通过近端策略优化（PPO）学习紧凑（~0.76MB）策略，并由结合Relaxed F1和BERTScore的混合奖励信号指导。我们的方法在严格的令牌和计算约束下运行，保持 CPU 延迟<1秒。SRAS在合成质量保证基准测试中优于监督选择器和随机选择器，并可推广到真实世界数据，在SQuAD v2上实现了BERTScore F1的0.8546，且未进行特定领域调优。这项工作首次证明基于强化学习的文档选择可以实现超轻量级、延迟感知能力，并且对设备端RAG流水线非常有效。

Sparse Threats, Focused Defense: Criticality-Aware Robust Reinforcement Learning for Safe Autonomous Driving

稀疏威胁，聚焦防御：临界感知强化学习以实现安全自动驾驶

Authors: Qi Wei, Junchao Fan, Zhao Yang, Jianhua Wang, Jingkai Mao, Xiaolin Chang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.01800
Pdf link: https://arxiv.org/pdf/2601.01800
Abstract Reinforcement learning (RL) has shown considerable potential in autonomous driving (AD), yet its vulnerability to perturbations remains a critical barrier to real-world deployment. As a primary countermeasure, adversarial training improves policy robustness by training the AD agent in the presence of an adversary that deliberately introduces perturbations. Existing approaches typically model the interaction as a zero-sum game with continuous attacks. However, such designs overlook the inherent asymmetry between the agent and the adversary and then fail to reflect the sparsity of safety-critical risks, rendering the achieved robustness inadequate for practical AD scenarios. To address these limitations, we introduce criticality-aware robust RL (CARRL), a novel adversarial training approach for handling sparse, safety-critical risks in autonomous driving. CARRL consists of two interacting components: a risk exposure adversary (REA) and a risk-targeted robust agent (RTRA). We model the interaction between the REA and RTRA as a general-sum game, allowing the REA to focus on exposing safety-critical failures (e.g., collisions) while the RTRA learns to balance safety with driving efficiency. The REA employs a decoupled optimization mechanism to better identify and exploit sparse safety-critical moments under a constrained budget. However, such focused attacks inevitably result in a scarcity of adversarial data. The RTRA copes with this scarcity by jointly leveraging benign and adversarial experiences via a dual replay buffer and enforces policy consistency under perturbations to stabilize behavior. Experimental results demonstrate that our approach reduces the collision rate by at least 22.66\% across all cases compared to state-of-the-art baseline methods.
中文摘要 强化学习（RL）在自动驾驶（AD）领域展现出相当的潜力，但其对干扰的脆弱性仍然是实际应用的关键障碍。作为主要的对策，对抗训练通过在有意引入扰动的对手面前训练AD代理，从而提升策略的鲁棒性。现有方法通常将互动建模为零和博弈，伴随持续攻击。然而，此类设计忽视了智能体与对手之间固有的不对称性，未能反映安全关键风险的稀疏性，使得实现的鲁棒性不足以满足实际的AD场景。为解决这些局限，我们引入了临界感知强化强化学习（CARRL），这是一种新型对抗训练方法，用于处理自动驾驶中稀疏且安全关键的风险。CARRL由两个相互作用组成部分：风险暴露对抗剂（REA）和风险靶向强健剂（RTRA）。我们将REA与RTRA之间的互动建模为一般和博弈，使REA专注于揭露安全关键故障（如碰撞），而RTRA则学习如何在安全与驾驶效率之间取得平衡。REA采用解耦优化机制，在有限预算下更好地识别和利用稀疏的安全关键时刻。然而，这种针对性的攻击不可避免地导致对抗性数据的稀缺。RTRA通过双重回放缓冲区共同利用良性与对抗性经验，并在扰动下强制执行政策一致性以稳定行为，来应对这种稀缺性。实验结果表明，我们的方法在所有病例中相较于最先进的基线方法，至少降低了22.66%的碰撞率。

PsychEval: A Multi-Session and Multi-Therapy Benchmark for High-Realism and Comprehensive AI Psychological Counselor

PsychEval：多次会谈和多治疗的高现实主义与综合人工智能心理咨询基准

Authors: Qianjun Pan, Junyi Wang, Jie Zhou, Yutao Yang, Junsong Li, Kaiyin Xu, Yougen Zhou, Yihan Li, Jingyuan Zhao, Qin Chen, Ningning Zhou, Kai Chen, Liang He
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.01802
Pdf link: https://arxiv.org/pdf/2601.01802
Abstract To develop a reliable AI for psychological assessment, we introduce \texttt{PsychEval}, a multi-session, multi-therapy, and highly realistic benchmark designed to address three key challenges: \textbf{1) Can we train a highly realistic AI counselor?} Realistic counseling is a longitudinal task requiring sustained memory and dynamic goal tracking. We propose a multi-session benchmark (spanning 6-10 sessions across three distinct stages) that demands critical capabilities such as memory continuity, adaptive reasoning, and longitudinal planning. The dataset is annotated with extensive professional skills, comprising over 677 meta-skills and 4577 atomic skills. \textbf{2) How to train a multi-therapy AI counselor?} While existing models often focus on a single therapy, complex cases frequently require flexible strategies among various therapies. We construct a diverse dataset covering five therapeutic modalities (Psychodynamic, Behaviorism, CBT, Humanistic Existentialist, and Postmodernist) alongside an integrative therapy with a unified three-stage clinical framework across six core psychological topics. \textbf{3) How to systematically evaluate an AI counselor?} We establish a holistic evaluation framework with 18 therapy-specific and therapy-shared metrics across Client-Level and Counselor-Level dimensions. To support this, we also construct over 2,000 diverse client profiles. Extensive experimental analysis fully validates the superior quality and clinical fidelity of our dataset. Crucially, \texttt{PsychEval} transcends static benchmarking to serve as a high-fidelity reinforcement learning environment that enables the self-evolutionary training of clinically responsible and adaptive AI counselors.
中文摘要 为了开发可靠的心理评估AI，我们推出了\texttt{PsychEval}，这是一个多疗程、多重治疗且高度现实的基准，旨在解决三大关键挑战：\textbf{1）我们能否培训一个高度真实的AI咨询师？现实咨询是一项纵向任务，需要持续记忆和动态目标跟踪。我们提出了一个多会话基准测试（涵盖6-10个会话，三个不同阶段），要求具备记忆连续性、自适应推理和纵向规划等关键能力。该数据集标注了丰富的专业技能，包含超过677项元技能和4577项原子技能。\textbf{2）如何培训一名多疗愈人工智能咨询师？}现有模型通常专注于单一疗法，但复杂病例往往需要在多种治疗中灵活调整策略。我们构建了一个涵盖五种治疗模式（心理动力学、行为主义、认知行为疗法、人文存在主义和后现代主义）的多样化数据集，同时结合了一套整合疗法，采用统一的三阶段临床框架，涵盖六个核心心理学主题。\textbf{3）如何系统地评估一位人工智能咨询师？我们建立了包含18项治疗专项及治疗共享指标的整体评估框架，涵盖客户层面和咨询师层面。为此，我们还构建了2000多个多样化的客户档案。广泛的实验分析充分验证了我们数据集的卓越质量和临床准确性。关键是，\texttt{PsychEval}超越了静态基准测试，作为一个高保真度强化学习环境，支持临床负责任且适应性强的AI咨询师的自我进化培训。

Moments Matter:Stabilizing Policy Optimization using Return Distributions

时刻重要：利用收益分布稳定策略优化

Authors: Dennis Jabs, Aditya Mohan, Marius Lindauer
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.01803
Pdf link: https://arxiv.org/pdf/2601.01803
Abstract Deep Reinforcement Learning (RL) agents often learn policies that achieve the same episodic return yet behave very differently, due to a combination of environmental (random transitions, initial conditions, reward noise) and algorithmic (minibatch selection, exploration noise) factors. In continuous control tasks, even small parameter shifts can produce unstable gaits, complicating both algorithm comparison and real-world transfer. Previous work has shown that such instability arises when policy updates traverse noisy neighborhoods and that the spread of post-update return distribution $R(\theta)$, obtained by repeatedly sampling minibatches, updating $\theta$, and measuring final returns, is a useful indicator of this noise. Although explicitly constraining the policy to maintain a narrow $R(\theta)$ can improve stability, directly estimating $R(\theta)$ is computationally expensive in high-dimensional settings. We propose an alternative that takes advantage of environmental stochasticity to mitigate update-induced variability. Specifically, we model state-action return distribution through a distributional critic and then bias the advantage function of PPO using higher-order moments (skewness and kurtosis) of this distribution. By penalizing extreme tail behaviors, our method discourages policies from entering parameter regimes prone to instability. We hypothesize that in environments where post-update critic values align poorly with post-update returns, standard PPO struggles to produce a narrow $R(\theta)$. In such cases, our moment-based correction narrows $R(\theta)$, improving stability by up to 75% in Walker2D, while preserving comparable evaluation returns.
中文摘要 深度强化学习（RL）智能体常常学习策略，这些策略实现相同的情节返回，但行为却截然不同，这主要受环境因素（随机转移、初始条件、奖励噪声）和算法因素（小批量选择、探索噪声）的组合影响。在连续控制任务中，即使是微小的参数偏移也可能导致步态不稳定，使算法比较和现实世界传输变得复杂。先前研究表明，当策略更新穿越噪声邻域时，会出现这种不稳定性，而通过反复抽样小批次、更新$\theta$并测量最终收益获得的更新后返回分布$R（\theta）$的扩散，是这种噪声的一个有用指标。虽然明确限制策略保持狭窄的$R（\theta）$可以提升稳定性，但在高维环境中直接估计$R（\theta）$计算成本较高。我们提出了一种利用环境随机性来减轻更新引起的变异性的替代方案。具体来说，我们通过分布批判者对状态-动作返回分布进行建模，然后利用该分布的高阶矩（偏态和峰度）对PPO的优势函数进行偏置。通过惩罚极端尾部行为，我们的方法阻止政策进入易不稳定的参数体系。我们假设，在更新后评论家的价值与更新后回报不匹配的环境中，标准PPO难以产生狭窄的$R（\theta）$。在这种情况下，基于矩的修正会收窄$R（\theta）$，使Walker2D的稳定性提升高达75%，同时保持可比的评估收益。

DermoGPT: Open Weights and Open Data for Morphology-Grounded Dermatological Reasoning MLLMs

DermoGPT：面向形态学基础皮肤病学推理MLLMs的开放权重与开放数据

Authors: Jinghan Ru, Siyuan Yan, Yuguo Yin, Yuexian Zou, Zongyuan Ge
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.01868
Pdf link: https://arxiv.org/pdf/2601.01868
Abstract Multimodal Large Language Models (MLLMs) show promise for medical applications, yet progress in dermatology lags due to limited training data, narrow task coverage, and lack of clinically-grounded supervision that mirrors expert diagnostic workflows. We present a comprehensive framework to address these gaps. First, we introduce DermoInstruct, a large-scale morphology-anchored instruction corpus comprising 211,243 images and 772,675 trajectories across five task formats, capturing the complete diagnostic pipeline from morphological observation and clinical reasoning to final diagnosis. Second, we establish DermoBench, a rigorous benchmark evaluating 11 tasks across four clinical axes: Morphology, Diagnosis, Reasoning, and Fairness, including a challenging subset of 3,600 expert-verified open-ended instances and human performance baselines. Third, we develop DermoGPT, a dermatology reasoning MLLM trained via supervised fine-tuning followed by our Morphologically-Anchored Visual-Inference-Consistent (MAVIC) reinforcement learning objective, which enforces consistency between visual observations and diagnostic conclusions. At inference, we deploy Confidence-Consistency Test-time adaptation (CCT) for robust predictions. Experiments show DermoGPT significantly outperforms 16 representative baselines across all axes, achieving state-of-the-art performance while substantially narrowing the human-AI gap. DermoInstruct, DermoBench and DermoGPT will be made publicly available at this https URL upon acceptance.
中文摘要 多模态大型语言模型（MLLM）在医疗应用中展现出潜力，但由于训练数据有限、任务覆盖范围狭窄，以及缺乏与专家诊断流程相符的临床指导，皮肤科的进展仍滞后。我们提出了一个全面的框架来弥补这些空白。首先，我们介绍DermoInstruct，这是一个大规模的形态学锚定教学语料库，包含211,243张图片和772,675条轨迹，涵盖五种任务格式，捕捉从形态观察、临床推理到最终诊断的完整诊断流程。其次，我们建立了DermoBench，这是一个严格的基准，评估11项任务，涵盖四个临床轴：形态学、诊断、推理和公平性，其中包括3600个专家验证的开放式实例和人体表现基线。第三，我们开发了DermoGPT，这是一种通过监督微调训练的皮肤科推理工具，随后采用形态锚定视觉-推断-一致（MAVIC）强化学习目标，确保视觉观察与诊断结论之间的一致性。在推断时，我们采用置信一致性测试时间适应（CCT）进行稳健预测。实验显示，DermoGPT在所有轴上都远超16个代表性基线，实现了最先进的性能，同时大幅缩小了人机与人工智能的差距。DermoInstruct、DermoBench 和 DermoGPT 将在接受后通过该 https URL 公开使用。

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

代理记忆：学习大型语言模型代理的统一长期与短期记忆管理

Authors: Yi Yu, Liuyi Yao, Yuexiang Xie, Qingquan Tan, Jiaqi Feng, Yaliang Li, Libing Wu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.01885
Pdf link: https://arxiv.org/pdf/2601.01885
Abstract Large language model (LLM) agents face fundamental limitations in long-horizon reasoning due to finite context windows, making effective memory management critical. Existing methods typically handle long-term memory (LTM) and short-term memory (STM) as separate components, relying on heuristics or auxiliary controllers, which limits adaptability and end-to-end optimization. In this paper, we propose Agentic Memory (AgeMem), a unified framework that integrates LTM and STM management directly into the agent's policy. AgeMem exposes memory operations as tool-based actions, enabling the LLM agent to autonomously decide what and when to store, retrieve, update, summarize, or discard information. To train such unified behaviors, we propose a three-stage progressive reinforcement learning strategy and design a step-wise GRPO to address sparse and discontinuous rewards induced by memory operations. Experiments on five long-horizon benchmarks demonstrate that AgeMem consistently outperforms strong memory-augmented baselines across multiple LLM backbones, achieving improved task performance, higher-quality long-term memory, and more efficient context usage.
中文摘要 大型语言模型（LLM）代理在长视野推理方面面临根本性局限，原因是上下文窗口有限，使得有效的内存管理变得至关重要。现有方法通常将长期记忆（LTM）和短期记忆（STM）作为独立组件处理，依赖启发式或辅助控制器，这限制了适应性和端到端优化。本文提出了代理记忆（AgeMem），这是一个统一框架，将长期存储器和临时存储管理直接整合进代理策略中。AgeMem 将内存作作为工具作，使 LLM 代理能够自主决定存储、检索、更新、总结或丢弃信息的时机和内容。为训练此类统一行为，我们提出了三阶段渐进强化学习策略，并设计分阶段GRPO，以应对记忆作引发的稀疏和不连续奖励。五个长期基准测试的实验表明，AgeMem 在多个大型语言模型骨干中持续优于强内存增强基线，实现了任务性能提升、长期记忆质量提升和上下文使用效率。

Evaluating Feature Dependent Noise in Preference-based Reinforcement Learning

基于偏好的强化学习中评估特征相关噪声

Authors: Yuxuan Li, Harshith Reddy Kethireddy, Srijita Das
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.01904
Pdf link: https://arxiv.org/pdf/2601.01904
Abstract Learning from Preferences in Reinforcement Learning (PbRL) has gained attention recently, as it serves as a natural fit for complicated tasks where the reward function is not easily available. However, preferences often come with uncertainty and noise if they are not from perfect teachers. Much prior literature aimed to detect noise, but with limited types of noise and most being uniformly distributed with no connection to observations. In this work, we formalize the notion of targeted feature-dependent noise and propose several variants like trajectory feature noise, trajectory similarity noise, uncertainty-aware noise, and Language Model noise. We evaluate feature-dependent noise, where noise is correlated with certain features in complex continuous control tasks from DMControl and Meta-world. Our experiments show that in some feature-dependent noise settings, the state-of-the-art noise-robust PbRL method's learning performance is significantly deteriorated, while PbRL method with no explicit denoising can surprisingly outperform noise-robust PbRL in majority settings. We also find language model's noise exhibits similar characteristics to feature-dependent noise, thereby simulating realistic humans and call for further study in learning with feature-dependent noise robustly.
中文摘要 强化学习中的偏好学习（PbRL）近年来受到关注，因为它非常适合那些奖励函数不易获得的复杂任务。然而，如果偏好不是来自完美的老师，往往伴随着不确定性和噪音。许多早期文献旨在检测噪声，但噪声类型有限，且大多数均分布，与观测结果无关。在本研究中，我们形式化了目标特征相关噪声的概念，并提出了若干变体，如轨迹特征噪声、轨迹相似噪声、不确定性感知噪声和语言模型噪声。我们评估了DMControl和Meta-world复杂连续控制任务中，噪声与某些特征相关。我们的实验表明，在某些特征依赖噪声的环境中，最先进的噪声稳健PbRL方法的学习性能显著下降，而没有显式去噪的PbRL方法在大多数情况下却能惊人地优于抗噪稳健PbRL。我们还发现语言模型的噪声表现出与特征依赖噪声相似的特性，从而模拟了真实的人类，并呼吁对特征依赖噪声的学习进行更强有力的研究。

Distorted Distributional Policy Evaluation for Offline Reinforcement Learning

离线强化学习中的失真分布策略评估

Authors: Ryo Iwaki, Takayuki Osogami
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.01917
Pdf link: https://arxiv.org/pdf/2601.01917
Abstract While Distributional Reinforcement Learning (DRL) methods have demonstrated strong performance in online settings, its success in offline scenarios remains limited. We hypothesize that a key limitation of existing offline DRL methods lies in their approach to uniformly underestimate return quantiles. This uniform pessimism can lead to overly conservative value estimates, ultimately hindering generalization and performance. To address this, we introduce a novel concept called quantile distortion, which enables non-uniform pessimism by adjusting the degree of conservatism based on the availability of supporting data. Our approach is grounded in theoretical analysis and empirically validated, demonstrating improved performance over uniform pessimism.
中文摘要 虽然分布式强化学习（DRL）方法在在线环境中表现出色，但在离线场景中的成功仍然有限。我们假设现有离线DRL方法的一个关键局限在于其对返回分位数的均匀低估方法。这种一致的悲观可能导致过于保守的数值估计，最终阻碍概括和性能。为此，我们引入了一个新概念——分位数失真，通过根据支持数据的可用性调整保守度，实现非均匀悲观。我们的方法基于理论分析并经过实证验证，证明其性能优于单一悲观。

Thinking with Blueprints: Assisting Vision-Language Models in Spatial Reasoning via Structured Object Representation

用蓝图思考：通过结构化对象表示辅助视觉语言模型进行空间推理

Authors: Weijian Ma, Shizhao Sun, Tianyu Yu, Ruiyu Wang, Tat-Seng Chua, Jiang Bian
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.01984
Pdf link: https://arxiv.org/pdf/2601.01984
Abstract Spatial reasoning -- the ability to perceive and reason about relationships in space -- advances vision-language models (VLMs) from visual perception toward spatial semantic understanding. Existing approaches either revisit local image patches, improving fine-grained perception but weakening global spatial awareness, or mark isolated coordinates, which capture object locations but overlook their overall organization. In this work, we integrate the cognitive concept of an object-centric blueprint into VLMs to enhance spatial reasoning. Given an image and a question, the model first constructs a JSON-style blueprint that records the positions, sizes, and attributes of relevant objects, and then reasons over this structured representation to produce the final answer. To achieve this, we introduce three key techniques: (1) blueprint-embedded reasoning traces for supervised fine-tuning to elicit basic reasoning skills; (2) blueprint-aware rewards in reinforcement learning to encourage the blueprint to include an appropriate number of objects and to align final answers with this causal reasoning; and (3) anti-shortcut data augmentation that applies targeted perturbations to images and questions, discouraging reliance on superficial visual or linguistic cues. Experiments show that our method consistently outperforms existing VLMs and specialized spatial reasoning models.
中文摘要 空间推理——即感知和推理空间关系的能力——推动视觉语言模型（VLMs）从视觉感知向空间语义理解发展。现有方法要么重新审视局部图像块，提升细粒度感知但削弱整体空间感知，要么标记孤立坐标，捕捉物体位置却忽略其整体组织。在本研究中，我们将以对象为中心的认知蓝图概念整合进VLMs，以增强空间推理能力。给定一张图片和一个问题，模型首先构建一个JSON风格的蓝图，记录相关对象的位置、大小和属性，然后对该结构化表示进行推理以得出最终答案。为此，我们引入了三种关键技术：（1）蓝图嵌入的推理轨迹，用于监督微调以激发基本推理技能;（2）强化学习中的蓝图意识奖励，鼓励蓝图包含适当数量的对象，并将最终答案与因果推理对齐;以及（3）反捷径数据增强，针对图像和问题进行有针对性的扰动，避免依赖表面的视觉或语言线索。实验表明，我们的方法持续优于现有的VLM和专门的空间推理模型。

GDRO: Group-level Reward Post-training Suitable for Diffusion Models

GDRO：适合扩散模型的组级训练后奖励

Authors: Yiyang Wang, Xi Chen, Xiaogang Xu, Yu Liu, Hengshuang Zhao
Subjects: Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.02036
Pdf link: https://arxiv.org/pdf/2601.02036
Abstract Recent advancements adopt online reinforcement learning (RL) from LLMs to text-to-image rectified flow diffusion models for reward alignment. The use of group-level rewards successfully aligns the model with the targeted reward. However, it faces challenges including low efficiency, dependency on stochastic samplers, and reward hacking. The problem is that rectified flow models are fundamentally different from LLMs: 1) For efficiency, online image sampling takes much more time and dominates the time of training. 2) For stochasticity, rectified flow is deterministic once the initial noise is fixed. Aiming at these problems and inspired by the effects of group-level rewards from LLMs, we design Group-level Direct Reward Optimization (GDRO). GDRO is a new post-training paradigm for group-level reward alignment that combines the characteristics of rectified flow models. Through rigorous theoretical analysis, we point out that GDRO supports full offline training that saves the large time cost for image rollout sampling. Also, it is diffusion-sampler-independent, which eliminates the need for the ODE-to-SDE approximation to obtain stochasticity. We also empirically study the reward hacking trap that may mislead the evaluation, and involve this factor in the evaluation using a corrected score that not only considers the original evaluation reward but also the trend of reward hacking. Extensive experiments demonstrate that GDRO effectively and efficiently improves the reward score of the diffusion model through group-wise offline optimization across the OCR and GenEval tasks, while demonstrating strong stability and robustness in mitigating reward hacking.
中文摘要 近期进展采用了从大型语言模型（LLM）到文本到图像整流扩散模型的在线强化学习（RL）以实现奖励对齐。使用群体级奖励成功使模型与目标奖励保持一致。然而，它面临效率低、对随机采样器的依赖以及奖励性黑客攻击等挑战。问题在于，整流流模型与大型语言模型（LLM）有根本不同：1）为了提高效率，在线图像采样耗时更长，且占据训练时间。2）对于随机性，一旦初始噪声固定，整流流是确定性的。针对这些问题，并受到大型语言模型（LLM）群体级奖励效果的启发，我们设计了群体级直接奖励优化（GDRO）。GDRO是一种新的训练后范式，用于组级奖励对齐，结合了整流流模型的特性。通过严谨的理论分析，我们指出GDRO支持完整的离线训练，节省了图像展开采样的高额时间成本。此外，它与扩散采样器无关，这消除了对常微分方程到单微分方程近似以获得随机性的需求。我们还实证研究了可能误导评估的奖励黑客陷阱，并利用修正后评分将该因素纳入评估，该评分不仅考虑原始评估奖励，还考虑奖励黑客的趋势。大量实验表明，GDRO通过在OCR和GenEval任务中的组别离线优化，有效且高效地提升扩散模型的奖励得分，同时在缓解奖励黑客方面表现出强的稳定性和鲁棒性。

Higher-Order Action Regularization in Deep Reinforcement Learning: From Continuous Control to Building Energy Management

深度强化学习中的高阶动作正则化：从连续控制到建筑能源管理

Authors: Faizan Ahmed, Aniket Dixit, James Brusey
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.02061
Pdf link: https://arxiv.org/pdf/2601.02061
Abstract Deep reinforcement learning agents often exhibit erratic, high-frequency control behaviors that hinder real-world deployment due to excessive energy consumption and mechanical wear. We systematically investigate action smoothness regularization through higher-order derivative penalties, progressing from theoretical understanding in continuous control benchmarks to practical validation in building energy management. Our comprehensive evaluation across four continuous control environments demonstrates that third-order derivative penalties (jerk minimization) consistently achieve superior smoothness while maintaining competitive performance. We extend these findings to HVAC control systems where smooth policies reduce equipment switching by 60%, translating to significant operational benefits. Our work establishes higher-order action regularization as an effective bridge between RL optimization and operational constraints in energy-critical applications.
中文摘要 深度强化学习代理常表现出不稳定的高频控制行为，因能量消耗过大和机械磨损而阻碍实际部署。我们系统地通过高阶导数惩罚研究作用平滑正则化，从连续控制基准的理论理解逐步推进到建筑能源管理的实际验证。我们在四个连续控制环境中的综合评估表明，三阶导数惩罚（抖动最小化）在保持竞争力的同时，始终实现卓越的平滑度。我们将这些发现扩展到暖通空调控制系统，顺畅的政策可将设备切换减少60%，带来显著的运营效益。我们的研究确立了高阶作用正则化作为强化学习优化与能耗关键应用中作约束之间的有效桥梁。

MDAgent2: Large Language Model for Code Generation and Knowledge Q&A in Molecular Dynamics

MDAgent2：分子动力学中的代码生成与知识问答的大型语言模型

Authors: Zhuofan Shi, Hubao A, Yufei Shao, Mengyan Dai, Yadong Yu, Pan Xiang, Dongliang Huang, Hongxu An, Chunxiao Xin, Haiyang Shen, Zhenyu Wang, Yunshan Na, Gang Huang, Xiang Jing
Subjects: Subjects: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.02075
Pdf link: https://arxiv.org/pdf/2601.02075
Abstract Molecular dynamics (MD) simulations are essential for understanding atomic-scale behaviors in materials science, yet writing LAMMPS scripts remains highly specialized and time-consuming tasks. Although LLMs show promise in code generation and domain-specific question answering, their performance in MD scenarios is limited by scarce domain data, the high deployment cost of state-of-the-art LLMs, and low code executability. Building upon our prior MDAgent, we present MDAgent2, the first end-to-end framework capable of performing both knowledge Q&A and code generation within the MD domain. We construct a domain-specific data-construction pipeline that yields three high-quality datasets spanning MD knowledge, question answering, and code generation. Based on these datasets, we adopt a three stage post-training strategy--continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL)--to train two domain-adapted models, MD-Instruct and MD-Code. Furthermore, we introduce MD-GRPO, a closed-loop RL method that leverages simulation outcomes as reward signals and recycles low-reward trajectories for continual refinement. We further build MDAgent2-RUNTIME, a deployable multi-agent system that integrates code generation, execution, evaluation, and self-correction. Together with MD-EvalBench proposed in this work, the first benchmark for LAMMPS code generation and question answering, our models and system achieve performance surpassing several strong this http URL work systematically demonstrates the adaptability and generalization capability of large language models in industrial simulation tasks, laying a methodological foundation for automatic code generation in AI for Science and industrial-scale simulations. URL: this https URL
中文摘要 分子动力学（MD）模拟对于理解材料科学中的原子尺度行为至关重要，但编写LAMMPS脚本仍是高度专业化且耗时的任务。尽管LLM在代码生成和领域特定问答方面展现出潜力，但其在MD场景中的表现受限于领域数据稀缺、先进LLM的高部署成本以及代码可执行性低。基于之前的MDAgent，我们推出了MDAgent2，这是首个能够在MD领域内同时执行知识问答和代码生成的端到端框架。我们构建了一个领域特定的数据构建流程，产生三个高质量数据集，涵盖MD知识、问答和代码生成。基于这些数据集，我们采用三阶段训练后策略——持续预训练（CPT）、监督微调（SFT）和强化学习（RL）——来训练两个领域适配模型，MD-Ininstruction和MD-Code。此外，我们引入了MD-GRPO，一种闭环强化学习方法，利用模拟结果作为奖励信号，并回收低奖励轨迹以持续优化。我们进一步构建了MDAgent2-RUNTIME，这是一个可部署的多代理系统，集成了代码生成、执行、评估和自我纠正。与本研究中提出的MD-EvalBench合作，这是LAMMPS代码生成和问答的首个基准测试，我们的模型和系统性能超越了多个强模型。http URL工作系统地展示了大型语言模型在工业仿真任务中的适应性和泛化能力，为人工智能在科学和工业规模模拟中的自动代码生成奠定了方法论基础。URL：这个 https URL

Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting

熵适应微调：解决自信冲突以减轻遗忘

Authors: Muxi Diao, Lele Yang, Wuxuan Gong, Yutong Zhang, Zhonghao Yan, Yufei Han, Kongming Liang, Weiran Xu, Zhanyu Ma
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.02151
Pdf link: https://arxiv.org/pdf/2601.02151
Abstract Supervised Fine-Tuning (SFT) is the standard paradigm for domain adaptation, yet it frequently incurs the cost of catastrophic forgetting. In sharp contrast, on-policy Reinforcement Learning (RL) effectively preserves general capabilities. We investigate this discrepancy and identify a fundamental distributional gap: while RL aligns with the model's internal belief, SFT forces the model to fit external supervision. This mismatch often manifests as "Confident Conflicts" tokens characterized by low probability but low entropy. In these instances, the model is highly confident in its own prediction but is forced to learn a divergent ground truth, triggering destructive gradient updates. To address this, we propose Entropy-Adaptive Fine-Tuning (EAFT). Unlike methods relying solely on prediction probability, EAFT utilizes token-level entropy as a gating mechanism to distinguish between epistemic uncertainty and knowledge conflict. This allows the model to learn from uncertain samples while suppressing gradients on conflicting data. Extensive experiments on Qwen and GLM series (ranging from 4B to 32B parameters) across mathematical, medical, and agentic domains confirm our hypothesis. EAFT consistently matches the downstream performance of standard SFT while significantly mitigating the degradation of general capabilities.
中文摘要 监督式微调（SFT）是领域适应的标准范式，但它常常伴随着灾难性遗忘的代价。与此形成鲜明对比的是，策略上的强化学习（RL）有效保留了一般能力。我们调查了这一差异，并指出一个根本的分布差距：虽然强化学习与模型的内部信念保持一致，但SFT强迫模型适应外部监督。这种不匹配通常表现为“自信冲突”标记，其特征是概率低但熵低。在这些情况下，模型对自身预测高度自信，但被迫学习一个发散的地面真实数据，从而触发破坏性的梯度更新。为此，我们提出了熵自适应微调（EAFT）。与仅依赖预测概率的方法不同，EAFT利用代币级熵作为门槛机制，区分认识不确定性与知识冲突。这使得模型能够从不确定的样本中学习，同时抑制冲突数据上的梯度。对Qwen和GLM系列（参数范围从4B到32B）在数学、医学和智能领域进行的广泛实验证实了我们的假设。EAFT始终能与标准SFT的下游性能匹敌，同时显著减轻了通用能力的退化。

ACDZero: Graph-Embedding-Based Tree Search for Mastering Automated Cyber Defense

ACDZero：基于图嵌入的树搜索，助力掌握自动化网络防御

Authors: Yu Li, Sizhe Tang, Rongqian Chen, Fei Xu Yu, Guangyu Jiang, Mahdi Imani, Nathaniel D. Bastian, Tian Lan
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.02196
Pdf link: https://arxiv.org/pdf/2601.02196
Abstract Automated cyber defense (ACD) seeks to protect computer networks with minimal or no human intervention, reacting to intrusions by taking corrective actions such as isolating hosts, resetting services, deploying decoys, or updating access controls. However, existing approaches for ACD, such as deep reinforcement learning (RL), often face difficult exploration in complex networks with large decision/state spaces and thus require an expensive amount of samples. Inspired by the need to learn sample-efficient defense policies, we frame ACD in CAGE Challenge 4 (CAGE-4 / CC4) as a context-based partially observable Markov decision problem and propose a planning-centric defense policy based on Monte Carlo Tree Search (MCTS). It explicitly models the exploration-exploitation tradeoff in ACD and uses statistical sampling to guide exploration and decision making. We make novel use of graph neural networks (GNNs) to embed observations from the network as attributed graphs, to enable permutation-invariant reasoning over hosts and their relationships. To make our solution practical in complex search spaces, we guide MCTS with learned graph embeddings and priors over graph-edit actions, combining model-free generalization and policy distillation with look-ahead planning. We evaluate the resulting agent on CC4 scenarios involving diverse network structures and adversary behaviors, and show that our search-guided, graph-embedding-based planning improves defense reward and robustness relative to state-of-the-art RL baselines.
中文摘要 自动化网络防御（ACD）旨在以最小甚至无人工干预的方式保护计算机网络，通过采取纠正措施应对入侵，如隔离主机、重置服务、部署诱饵或更新访问控制。然而，现有的ACD方法，如深度强化学习（RL），在具有大决策/状态空间的复杂网络中常常面临困难，因此需要大量样本。受学习样本高效防御政策需求的启发，我们将CAGE挑战4中的ACD框架为基于上下文的部分可观测马尔可夫决策问题，并提出了基于蒙特卡洛树搜索（MCTS）的规划中心防御策略。它明确建模了ACD中的探索与开发权衡，并利用统计抽样指导探索和决策。我们创新地利用图神经网络（GNN）将网络中的观测数据嵌入为归属图，从而实现对宿主及其关系的置换不变推理。为了使我们的解决方案在复杂搜索空间中实用，我们通过学习到的图嵌入和先验来指导MCTS，而非图编辑动作，结合无模型的泛化和策略提炼与前瞻性规划。我们在涉及多样网络结构和对手行为的CC4场景下评估了所得代理，表明我们基于搜索引导、基于图嵌入的规划相较于最先进的强化学习基线提升了防御奖励和鲁棒性。

CORE: Code-based Inverse Self-Training Framework with Graph Expansion for Virtual Agents

CORE：基于代码的逆向自训框架，支持虚拟代理图展开

Authors: Keyu Wang, Bingchen Miao, Wendong Bu, Yu Wu, Juncheng Li, Shengyu Zhang, Wenqiao Zhang, Siliang Tang, Jun Xiao, Yueting Zhuang
Subjects: Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.02201
Pdf link: https://arxiv.org/pdf/2601.02201
Abstract The development of Multimodal Virtual Agents has made significant progress through the integration of Multimodal Large Language Models. However, mainstream training paradigms face key challenges: Behavior Cloning is simple and effective through imitation but suffers from low behavioral diversity, while Reinforcement Learning is capable of discovering novel strategies through exploration but heavily relies on manually designed reward functions. To address the conflict between these two methods, we present CORE, a Code-based Inverse Self-Training Framework with Graph Expansion that bridges imitation and exploration, offering a novel training framework that promotes behavioral diversity while eliminating the reliance on manually reward design. Specifically, we introduce Semantic Code Abstraction to automatically infers reward functions from expert demonstrations without manual design. The inferred reward function, referred to as the Label Function, is executable code that verifies one key step within a task. Building on this, we propose Strategy Graph Expansion to enhance in-domain behavioral diversity, which constructs a multi-path graph called Strategy Graph that captures diverse valid solutions beyond expert demonstrations. Furthermore, we introduce Trajectory-Guided Extrapolation, which enriches out-of-domain behavioral diversity by utilizing both successful and failed trajectories to expand the task space. Experiments on Web and Android platforms demonstrate that CORE significantly improves both overall performance and generalization, highlighting its potential as a robust and generalizable training paradigm for building powerful virtual agents.
中文摘要 多模态虚拟代理的发展通过整合多模态大型语言模型取得了显著进展。然而，主流训练范式面临关键挑战：行为克隆通过模仿简单且有效，但行为多样性较低，强化学习通过探索发现新策略，但高度依赖手动设计的奖励函数。为了解决这两种方法之间的冲突，我们提出了基于代码的逆自训框架，带有图展开功能，桥接了模仿与探索，提供了一个促进行为多样性的新型训练框架，同时消除了对人工奖励设计的依赖。具体来说，我们引入了语义代码抽象，能够在无需人工设计的情况下，从专家演示中自动推断奖励函数。推断奖励函数称为标签函数，是可执行的代码，用于验证任务中的一个关键步骤。基于此，我们提出了策略图扩展以增强领域内行为多样性，构建一个多路径图，称为策略图，捕捉专家演示之外的多样有效解。此外，我们引入了轨迹引导外推法，通过利用成功和失败轨迹来扩展任务空间，丰富了域外行为多样性。在网页和安卓平台上的实验表明，CORE显著提升了整体性能和泛化性，凸显了其作为构建强大虚拟代理的稳健且可通用训练范式的潜力。

NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation

NextFlow：统一顺序建模激活多模态理解与生成

Authors: Huichao Zhang, Liao Qu, Yiheng Liu, Hang Chen, Yangyang Song, Yongsheng Dong, Shikun Sun, Xian Li, Xu Wang, Yi Jiang, Hu Ye, Bo Chen, Yiming Gao, Peng Liu, Akide Liu, Zhipeng Yang, Qili Deng, Linjie Xing, Jiyang Liu, Zhao Wang, Yang Zhou, Mingcong Liu, Yi Zhang, Qian He, Xiwei Hu, Zhongqi Qi, Jie Shao, Zhiye Fu, Shuai Wang, Fangmin Chen, Xuezhi Chai, Zhihua Wu, Yitong Wang, Zehuan Yuan, Daniel K. Du, Xinglong Wu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.02204
Pdf link: https://arxiv.org/pdf/2601.02204
Abstract We present NextFlow, a unified decoder-only autoregressive transformer trained on 6 trillion interleaved text-image discrete tokens. By leveraging a unified vision representation within a unified autoregressive architecture, NextFlow natively activates multimodal understanding and generation capabilities, unlocking abilities of image editing, interleaved content and video generation. Motivated by the distinct nature of modalities - where text is strictly sequential and images are inherently hierarchical - we retain next-token prediction for text but adopt next-scale prediction for visual generation. This departs from traditional raster-scan methods, enabling the generation of 1024x1024 images in just 5 seconds - orders of magnitude faster than comparable AR models. We address the instabilities of multi-scale generation through a robust training recipe. Furthermore, we introduce a prefix-tuning strategy for reinforcement learning. Experiments demonstrate that NextFlow achieves state-of-the-art performance among unified models and rivals specialized diffusion baselines in visual quality.
中文摘要 我们介绍NextFlow，一种统一的仅解码器自回归变换器，训练于6万亿个交错的文本-图像离散令牌。通过在统一的自回归架构中实现统一的愿景表示，NextFlow原生激活了多模态理解与生成能力，解锁了图像编辑、交错内容和视频生成的能力。受模态特性的驱动——文本严格顺序，图像本质上层级——我们保留了文本的下一标记预测，但对视觉生成采用下一尺度预测。这不同于传统的光栅扫描方法，能够在仅5秒内生成1024x1024的图像——比同类AR模型快几个数量级。我们通过强健的训练方案解决多尺度发电的不稳定性。此外，我们还引入了一种用于强化学习的前缀调优策略。实验表明，NextFlow在统一模型中实现了最先进的性能，并在视觉质量上可与专门的扩散基线媲美。

Enabling Deep Reinforcement Learning Research for Energy Saving in Open RAN

支持开放RAN节能的深度强化学习研究

Authors: Matteo Bordin, Andrea Lacava, Michele Polese, Francesca Cuomo, Tommaso Melodia
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2601.02240
Pdf link: https://arxiv.org/pdf/2601.02240
Abstract The growing performance demands and higher deployment densities of next-generation wireless systems emphasize the importance of adopting strategies to manage the energy efficiency of mobile networks. In this demo, we showcase a framework that enables research on Deep Reinforcement Learning (DRL) techniques for improving the energy efficiency of intelligent and programmable Open Radio Access Network (RAN) systems. Using the open-source simulator ns-O-RAN and the reinforcement learning environment Gymnasium, the framework enables to train and evaluate DRL agents that dynamically control the activation and deactivation of cells in a 5G network. We show how to collect data for training and evaluate the impact of DRL on energy efficiency in a realistic 5G network scenario, including users' mobility and handovers, a full protocol stack, and 3rd Generation Partnership Project (3GPP)-compliant channel models. The tool will be open-sourced and a tutorial for energy efficiency testing in ns-O-RAN.
中文摘要 下一代无线系统日益增长的性能需求和更高的部署密度，凸显了采取管理移动网络能效策略的重要性。在本演示中，我们展示了一个框架，支持对深度强化学习（DRL）技术的研究，以提升智能且可编程的开放无线接入网（RAN）系统的能效。利用开源模拟器ns-O-RAN和强化学习环境Gymnasium，该框架能够训练和评估动态控制5G网络中细胞激活和停用的DRL代理。我们展示了如何收集培训数据，并评估DRL对现实5G网络场景中能效的影响，包括用户的移动性和切换、完整的协议栈以及符合第三代合作伙伴项目（3GPP）标准的信道模型。该工具将开源，并提供ns-O-RAN中能效测试的教程。

VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation

VAR RL 正确执行：解决视觉自回归生成中的异步策略冲突

Authors: Shikun Sun, Liao Qu, Huichao Zhang, Yiheng Liu, Yangyang Song, Xian Li, Xu Wang, Yi Jiang, Daniel K. Du, Xinglong Wu, Jia Jia
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.02256
Pdf link: https://arxiv.org/pdf/2601.02256
Abstract Visual generation is dominated by three paradigms: AutoRegressive (AR), diffusion, and Visual AutoRegressive (VAR) models. Unlike AR and diffusion, VARs operate on heterogeneous input structures across their generation steps, which creates severe asynchronous policy conflicts. This issue becomes particularly acute in reinforcement learning (RL) scenarios, leading to unstable training and suboptimal alignment. To resolve this, we propose a novel framework to enhance Group Relative Policy Optimization (GRPO) by explicitly managing these conflicts. Our method integrates three synergistic components: 1) a stabilizing intermediate reward to guide early-stage generation; 2) a dynamic time-step reweighting scheme for precise credit assignment; and 3) a novel mask propagation algorithm, derived from principles of Reward Feedback Learning (ReFL), designed to isolate optimization effects both spatially and temporally. Our approach demonstrates significant improvements in sample quality and objective alignment over the vanilla GRPO baseline, enabling robust and effective optimization for VAR models.
中文摘要 视觉生成主要由三种范式主导：自回归（AR）、扩散和视觉自回归（VAR）模型。与增强现实和扩散不同，VAR在其生成步骤中运行于异构输入结构，这导致严重的异步策略冲突。这个问题在强化学习（RL）场景中尤为突出，导致训练不稳定和对齐不优。为解决这一问题，我们提出了一个新框架，通过明确管理这些冲突来增强群体相对策略优化（GRPO）。我们的方法整合了三个协同成分：1）稳定的中间奖励以指导早期生成;2）用于精确信用分配的动态时间步重权重方案;3）一种新颖的掩模传播算法，源自奖励反馈学习（ReFL）原理，旨在空间和时间上分离优化效应。我们的方法在样本质量和目标对齐度方面相较于原版GRPO基线有显著提升，从而实现VAR模型的稳健且有效的优化。

Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes

Talk2Move：场景中文本指令对象级几何变换的强化学习

Authors: Jing Tan, Zhaoyang Zhang, Yantao Shen, Jiarui Cai, Shuo Yang, Jiajun Wu, Wei Xia, Zhuowen Tu, Stefano Soatto
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.02356
Pdf link: https://arxiv.org/pdf/2601.02356
Abstract We introduce Talk2Move, a reinforcement learning (RL) based diffusion framework for text-instructed spatial transformation of objects within scenes. Spatially manipulating objects in a scene through natural language poses a challenge for multimodal generation systems. While existing text-based manipulation methods can adjust appearance or style, they struggle to perform object-level geometric transformations-such as translating, rotating, or resizing objects-due to scarce paired supervision and pixel-level optimization limits. Talk2Move employs Group Relative Policy Optimization (GRPO) to explore geometric actions through diverse rollouts generated from input images and lightweight textual variations, removing the need for costly paired data. A spatial reward guided model aligns geometric transformations with linguistic description, while off-policy step evaluation and active step sampling improve learning efficiency by focusing on informative transformation stages. Furthermore, we design object-centric spatial rewards that evaluate displacement, rotation, and scaling behaviors directly, enabling interpretable and coherent transformations. Experiments on curated benchmarks demonstrate that Talk2Move achieves precise, consistent, and semantically faithful object transformations, outperforming existing text-guided editing approaches in both spatial accuracy and scene coherence.
中文摘要 我们介绍了Talk2Move，一种基于强化学习（RL）的扩散框架，用于文本指令式场景中对象的空间变换。通过自然语言空间作场景中的物体对多模态生成系统来说是一大挑战。虽然现有基于文本的作方法可以调整外观或样式，但由于缺乏配对监督和像素级优化限制，它们在执行对象级几何变换（如平移、旋转或调整大小）方面遇到困难。Talk2Move 采用群相对策略优化（Group Relative Policy Optimization，GRPO）通过输入图像和轻量级文本变体生成的多样化展开，探索几何动作，消除了对昂贵配对数据的需求。空间奖励引导模型将几何变换与语言描述对齐，而非策略阶梯评估和主动阶梯抽样通过关注信息转化阶段提升学习效率。此外，我们设计了以对象为中心的空间奖励，直接评估位移、旋转和缩放行为，实现可解释且连贯的转换。在策划基准测试上的实验表明，Talk2Move 实现了精确、一致且语义忠实的对象转换，在空间准确性和场景连贯性上均优于现有的文本引导编辑方法。

Keyword: diffusion policy

Dichotomous Diffusion Policy Optimization

二分扩散策略优化

Authors: Ruiming Liang, Yinan Zheng, Kexin Zheng, Tianyi Tan, Jianxiong Li, Liyuan Mao, Zhihao Wang, Guang Chen, Hangjun Ye, Jingjing Liu, Jinqiao Wang, Xianyuan Zhan
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.00898
Pdf link: https://arxiv.org/pdf/2601.00898
Abstract Diffusion-based policies have gained growing popularity in solving a wide range of decision-making tasks due to their superior expressiveness and controllable generation during inference. However, effectively training large diffusion policies using reinforcement learning (RL) remains challenging. Existing methods either suffer from unstable training due to directly maximizing value objectives, or face computational issues due to relying on crude Gaussian likelihood approximation, which requires a large amount of sufficiently small denoising steps. In this work, we propose DIPOLE (Dichotomous diffusion Policy improvement), a novel RL algorithm designed for stable and controllable diffusion policy optimization. We begin by revisiting the KL-regularized objective in RL, which offers a desirable weighted regression objective for diffusion policy extraction, but often struggles to balance greediness and stability. We then formulate a greedified policy regularization scheme, which naturally enables decomposing the optimal policy into a pair of stably learned dichotomous policies: one aims at reward maximization, and the other focuses on reward minimization. Under such a design, optimized actions can be generated by linearly combining the scores of dichotomous policies during inference, thereby enabling flexible control over the level of this http URL in offline and offline-to-online RL settings on ExORL and OGBench demonstrate the effectiveness of our approach. We also use DIPOLE to train a large vision-language-action (VLA) model for end-to-end autonomous driving (AD) and evaluate it on the large-scale real-world AD benchmark NAVSIM, highlighting its potential for complex real-world applications.
中文摘要 基于扩散的策略因其卓越的表达性和推理过程中可控生成的能力，在解决各种决策任务中越来越受欢迎。然而，有效利用强化学习（RL）训练大规模扩散策略仍然具有挑战性。现有方法要么因直接最大化值目标而导致训练不稳定，要么因依赖粗糙的高斯似然近似而面临计算问题，这需要大量足够小的去噪步骤。本研究提出了DIPOLE（二分扩散策略改进），这是一种用于稳定且可控扩散策略优化的新型强化学习算法。我们首先回顾了强化学习中的 KL-正则化目标，它提供了一个理想的加权回归目标以提取扩散策略，但常常难以平衡贪婪与稳定性。随后，我们提出了一个贪婪化的政策正则化方案，自然地将最优政策分解为一对稳定学习的二分策略：一个旨在奖励最大化，另一个专注于奖励最小化。在这种设计下，通过线性组合推理中的二分策略分数，可以生成优化动作，从而在ExORL和OGBench的离线和离线强化环境中灵活控制该http URL的层级，展示了我们方法的有效性。我们还使用DIPOLE训练一个大型视觉-语言-动作（VLA）模型，用于端到端自动驾驶（AD），并在大规模现实世界自动驾驶基准测试NAVSIM上进行评估，凸显其在复杂现实应用中的潜力。

Contractive Diffusion Policies: Robust Action Diffusion via Contractive Score-Based Sampling with Differential Equations

收缩扩散策略：通过微分方程的基于收缩评分抽样实现稳健的作用扩散

Authors: Amin Abyaneh, Charlotte Morissette, Mohamad H. Danesh, Anas El Houssaini, David Meger, Gregory Dudek, Hsiu-Chin Lin
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.01003
Pdf link: https://arxiv.org/pdf/2601.01003
Abstract Diffusion policies have emerged as powerful generative models for offline policy learning, whose sampling process can be rigorously characterized by a score function guiding a Stochastic Differential Equation (SDE). However, the same score-based SDE modeling that grants diffusion policies the flexibility to learn diverse behavior also incurs solver and score-matching errors, large data requirements, and inconsistencies in action generation. While less critical in image generation, these inaccuracies compound and lead to failure in continuous control settings. We introduce Contractive Diffusion Policies (CDPs) to induce contractive behavior in the diffusion sampling dynamics. Contraction pulls nearby flows closer to enhance robustness against solver and score-matching errors while reducing unwanted action variance. We develop an in-depth theoretical analysis along with a practical implementation recipe to incorporate CDPs into existing diffusion policy architectures with minimal modification and computational cost. We evaluate CDPs for offline learning by conducting extensive experiments in simulation and real-world settings. Across benchmarks, CDPs often outperform baseline policies, with pronounced benefits under data scarcity.
中文摘要 扩散策略已成为离线策略学习的强大生成模型，其采样过程可以通过一个引导随机微分方程（SDE）的评分函数进行严格表征。然而，同样基于分数的SDE建模赋予扩散策略学习多样性行为的灵活性，也会导致求解器和分数匹配错误、大量数据需求以及动作生成的不一致。虽然在图像生成中影响较小，但这些不准确性会叠加，导致连续控制中的失败。我们引入了收缩扩散策略（CDP），以在扩散采样动力学中诱导收缩行为。收缩将附近的流拉得更近，以增强对求解器和分数匹配错误的鲁棒性，同时减少不必要的动作方差。我们开发了深入的理论分析和实用的实现方案，以最小的修改和计算成本将CDP整合进现有的扩散策略架构。我们通过在模拟和现实环境中进行大量实验，评估CDP的离线学习能力。在各基准指标中，CDP常常优于基线政策，在数据稀缺下带来显著益处。