Arxiv Papers of Today

生成时间: 2026-04-30 18:08:15 (UTC+8); Arxiv 发布时间: 2026-04-30 20:00 EDT (2026-05-01 08:00 UTC+8)

今天共有 18 篇相关文章

Keyword: reinforcement learning

Digital Twin-assisted belief-state reinforcement learning for latency-robust ISAC in 6G networks

数字孪生辅助信念状态强化学习，用于6G网络中具有延迟的ISAC

Authors: Himanshu Tiwari (1 and 2), Binayak Kar (1 and 2), Priyanshu Tiwari (3) ((1) National Taiwan University of Science and Technology, Taipei, Taiwan, (2) Quantum Research Lab, National Taiwan University of Science and Technology, Taipei, Taiwan, (3) Sir M. Visvesvaraya Institute of Technology, Bengaluru, India)
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2604.25967
Pdf link: https://arxiv.org/pdf/2604.25967
Abstract Integrated Sensing and Communication (ISAC) enables joint data transmission and environmental perception for sixth-generation (6G) networks, but centralized and virtualized RAN control loops introduce telemetry latency that yields stale observations and unstable control. This paper proposes a Digital Twin-assisted belief-state reinforcement learning framework for latency-robust ISAC. A Digital Twin (DT) reconstructs a synchronized belief state from delayed telemetry using an Extended Kalman Filter, and a Proximal Policy Optimization agent performs joint beamforming and power allocation for communication and sensing. Closed-loop simulations with telemetry delays up to 100 ms demonstrate consistent performance gains over latency-unaware deep reinforcement learning (DRL) and heuristic baselines. At 50 ms latency, the proposed method improves median throughput by 12% and reduces sensing error by 7% relative to a DT-only controller, while achieving an order-of-magnitude reduction in reliability violations. Even at 100 ms latency, the proposed approach retains approximately 88% of its zero-latency throughput. These results show that Digital Twin-assisted belief-state control enables stable and efficient ISAC operation under realistic telemetry delays in 6G networks.
中文摘要 综合感测与通信（ISAC）实现了第六代（6G）网络的联合数据传输和环境感知，但集中式和虚拟化的RAN控制环引入了遥测延迟，导致观测数据陈旧且控制不稳定。本文提出了一种基于数字孪生辅助的信念状态强化学习框架，用于延迟强化的ISAC。数字孪生（DT）利用扩展卡尔曼滤波器从延迟遥测重建同步信念状态，近端策略优化代理则执行联合波束成形和功率分配，用于通信和传感。具有高达100毫秒遥测延迟的闭环仿真，在性能上优于无延迟的深度强化学习（DRL）和启发式基线，表现持续提升。在50毫秒延迟下，该方法相比仅支持DT控制器，中位吞吐量提升12%，并降低7%的传感误差，同时实现可靠性违规数量级减少。即使在100毫秒延迟下，所提方法仍保留约88%的零延迟吞吐。这些结果表明，数字孪生辅助信念状态控制能够在6G网络中，在现实遥测延迟下实现ISAC的稳定高效运行。

A Survey of Multi-Agent Deep Reinforcement Learning with Graph Neural Network-Based Communication

基于图神经网络的多智能体深度强化学习综述

Authors: Valentin Cuzin-Rambaud (LIRIS, UCBL), Laetitia Matignon (LIRIS, UCBL), Maxime Morge (LIRIS, UCBL)
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2604.25972
Pdf link: https://arxiv.org/pdf/2604.25972
Abstract In multi-agent reinforcement learning (MARL), the integration of a communication mechanism, allowing agents to better learn to coordinate their actions and converge on their objectives by sharing information. Based on an interaction graph, a subclass of methods employs graph neural networks (GNNs) to learn the communication, enabling agents to improve their internal representations by enriching them with information exchanged. With growing research, we note a lack of explicit structure and framework to distinguish and classify MARL approaches with communication based on GNNs. Thus, this paper surveys recent works in this field. We propose a generalized GNN-based communication process with the goal of making the underlying concepts behind the methods more obvious and accessible.
中文摘要 在多智能体强化学习（MARL）中，是集成通信机制，使智能体能够更好地学习协调行动并通过共享信息实现目标。基于交互图，一类方法采用图神经网络（GNN）来学习通信，使智能体通过丰富交换的信息来提升内部表征。随着研究的不断发展，我们注意到缺乏明确的结构和框架来区分和分类基于GNN的MARL沟通方法。因此，本文综述了该领域的最新研究成果。我们提出了一种基于GNN的通用沟通流程，旨在使方法背后的概念更加清晰和易于理解。

Application of Deep Reinforcement Learning to Event-Triggered Control for Networked Artificial Pancreas Systems

深度强化学习在网络人工胰腺系统事件触发控制中的应用

Authors: Junya Ikemoto, Satoshi Maruyama, Kazumune Hashimoto
Subjects: Subjects: Systems and Control (eess.SY); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2604.26126
Pdf link: https://arxiv.org/pdf/2604.26126
Abstract This paper proposes a deep reinforcement learning (DRL)-based event-triggered controller design for networked artificial pancreas (AP) systems. Although existing DRL-based AP controllers typically assume periodic control updates, networked control systems (NCSs) require a reduction in communication frequency to achieve energy-efficient operation, which is directly tied to control updates. However, jointly learning both insulin dosing and update timing significantly increases the complexity of the learning problem. To alleviate this complexity, we develop a practical DRL-based controller design that avoids explicitly learning update timing by introducing a rule-based criterion defined by changes in blood glucose. As a result, decision-making occurs at irregular intervals, and the problem is naturally formulated as a semi-Markov decision process (SMDP), for which we extend a standard DRL algorithm. Numerical experiments demonstrate that the proposed method improves communication efficiency while maintaining control performance.
中文摘要 本文提出了一种基于深度强化学习（DRL）的事件触发控制器设计，用于网络人工胰腺（AP）系统。尽管现有基于DRL的AP控制器通常假设定期控制更新，但网络控制系统（NCS）需要降低通信频率以实现节能运行，这与控制更新直接相关。然而，同时学习胰岛素剂量和更新时机，会显著增加学习问题的复杂度。为减轻这种复杂性，我们开发了一种实用的基于日程学习器的控制器设计，通过引入基于血糖变化的规则标准，避免显式学习更新时间。因此，决策过程不规则，问题自然被表述为半马尔可夫决策过程（SMDP），我们对此扩展了标准的DRL算法。数值实验表明，所提方法在保持控制性能的同时提升了通信效率。

AI Observability for Large Language Model Systems: A Multi-Layer Analysis of Monitoring Approaches from Confidence Calibration to Infrastructure Tracing

大型语言模型系统的AI可观察性：从置信度校准到基础设施追踪的多层次监控方法分析

Authors: Twinkll Sisodia
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2604.26152
Pdf link: https://arxiv.org/pdf/2604.26152
Abstract The deployment of large language models (LLMs) in production environments has created an urgent need for observability systems that span the full stack -- from model internals to GPU kernels. Yet existing monitoring approaches address isolated layers of this stack, and no comprehensive analysis has examined how these techniques relate, overlap, or complement each other. This paper presents a structured analysis of five recent research contributions (2025-2026) that collectively define the emerging landscape of AI observability: confidence calibration via reinforcement learning (MIT), internal state monitoring through propositional probes (UC Berkeley), chain-of-thought monitorability evaluation (OpenAI), autonomous cloud operations benchmarking (Microsoft Research, UC Berkeley, UIUC), and non-intrusive inference-level tracing (TRUFFLD). We organize these contributions into a five-layer observability taxonomy, synthesize their key findings into a unified comparison, and identify four critical gaps that remain unaddressed. We further contextualize these research directions against practical operational observability systems that translate infrastructure telemetry into actionable insights for site reliability teams. Our analysis reveals that while individual monitoring layers have matured rapidly, the integration challenge -- connecting model-level confidence signals with infrastructure-level anomalies into coherent operational intelligence -- remains the defining open problem for the field.
中文摘要 大型语言模型（LLM）在生产环境中的部署，迫切需要跨全栈的可观测性系统——从模型内部到GPU内核。然而，现有的监控方法只针对该堆栈的孤立层，且尚无全面分析探讨这些技术之间的关联、重叠或互补。本文结构化分析了五项近期研究成果（2025-2026年），这些研究共同定义了人工智能可观察性的新兴格局：通过强化学习进行置信度校准（MIT）、通过命题探针进行内部状态监控（加州大学伯克利分校）、思维链可监测性评估（OpenAI）、自主云运行基准测试（Microsoft Research、加州大学伯克利分校、UIUC）以及非侵入式推理级追踪（TRUFFLD）。我们将这些贡献组织成五层可观测性分类法，综合其关键发现进行统一比较，并识别出四个尚未解决的关键空白。我们还进一步将这些研究方向与实际操作性可观测性系统进行背景分析，这些系统将基础设施遥测转化为站点可靠性团队可操作的洞见。我们的分析显示，尽管单个监控层迅速成熟，但整合挑战——将模型级置信信号与基础设施级异常连接成连贯的运营智能——仍是该领域的定义性未解难题。

DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training

DORA：一种可扩展的异步强化学习系统，用于语言模型训练

Authors: Tianhao Hu, Xiangcheng Liu, Youshao Xiao, Yang Zheng, Xuan Huang, Jinrui Ding, Yufei Zhang, Tao Liang, Hongyu Zang, Quan Chen, Yueqing Sun, Wenjie Shi, Chao Zhang, Wei Wang, Qi Gu, Yerui Sun, Yucheng Xie, Xunliang Cai
Subjects: Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2604.26256
Pdf link: https://arxiv.org/pdf/2604.26256
Abstract Reinforcement learning (RL) has become a critical paradigm for LLM post-training, yet the rollout phase -- accounting for 50--80% of total step time -- is bottlenecked by skewed generation: long-tailed trajectories indispensable for model performance block the entire training pipeline. Asynchronous training offers a natural remedy by overlapping generation with training, but introduces a fundamental tension between efficiency and algorithmic correctness. We identify three constraints in asynchronous training to preserve convergence: intra-trajectory policy consistency, data integrity, and bounded staleness. Existing approaches fail to intrinsically address the long-tailed trajectory problem, which is further exacerbated by the imbalance characteristic of Mix-of-Experts models, or deviate from the standard RL training formulation, thereby hindering model convergence. Therefore, we propose DORA (Dynamic ORchestration for Asynchronous Rollout), which addresses this challenge through algorithm-system co-design. DORA introduces multi-version streaming rollout, a novel asynchronous paradigm that maintains multiple policy versions concurrently -- simultaneously achieving full bubble elimination without compromising algorithmic constraints. Experimental results demonstrate that our DORA system achieves substantial improvements in throughput -- up to 2--3 times higher than state-of-the-art systems on open-source benchmarks -- without compromising convergence. Furthermore, in large-scale industrial applications with tens of thousands of accelerators, DORA accelerates RL training by 2--4 times compared to synchronous training across various scenarios. The resultant open-source models, LongCat-Flash-Thinking, exhibit competitive performance on complex reasoning benchmarks, matching the capability of most advanced LLMs.
中文摘要 强化学习（RL）已成为LLM训练后的关键范式，但推广阶段——占总步时间的50%到80%——却被生成偏差所限制：长尾轨迹对模型性能至关重要，阻碍了整个训练流程。异步训练通过生成与训练重叠提供了自然的解决方案，但也带来了效率与算法正确性之间的根本张力。我们确定异步训练中的三个约束条件以保持收敛性：轨迹内策略一致性、数据完整性和有界陈旧性。现有方法未能从本质上解决长尾轨迹问题，而该问题因专家混合模型的不平衡特性而进一步加剧，或者偏离标准强化学习训练表述，从而阻碍模型趋同。因此，我们提出了异步部署动态OR（动态OR）方案，通过算法-系统协同设计来解决这一挑战。DORA引入了多版本流式推展，这是一种新颖的异步范式，能够同时维护多个策略版本——同时实现完全消除气泡，同时不影响算法约束。实验结果表明，我们的DORA系统在吞吐量方面实现了显著提升——在开源基准测试中，是最先进系统的2到3倍——同时不影响收敛性。此外，在拥有数万台加速器的大型工业应用中，DORA 在各种场景下将 RL 训练的加速速度比同步训练快 2-4 倍。由此产生的开源模型LongCat-Flash-Thinking在复杂推理基准测试中表现出竞争力，性能可媲美大多数高级大型语言模型。

MedSynapse-V: Bridging Visual Perception and Clinical Intuition via Latent Memory Evolution

MedSynapse-V：通过潜在记忆进化连接视觉感知与临床直觉

Authors: Chunzheng Zhu, Jiaqi Zeng, Junyu Jiang, Jianxin Lin, Yijun Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.26283
Pdf link: https://arxiv.org/pdf/2604.26283
Abstract High-precision medical diagnosis relies not only on static imaging features but also on the implicit diagnostic memory experts instantly invoke during image interpretation. We pinpoint a fundamental cognitive misalignment in medical VLMs caused by discrete tokenization, leading to quantization loss, long-range information dissipation, and missing case-adaptive expertise. To bridge this gap, we propose ours, a framework for latent diagnostic memory evolution that simulates the experiential invocation of clinicians by dynamically synthesizing implicit diagnostic memories within the model's hidden stream. Specifically, it begins with a Meta Query for Prior Memorization mechanism, where learnable probes retrieve structured priors from an anatomical prior encoder to generate condensed implicit memories. To ensure clinical fidelity, we introduce Causal Counterfactual Refinement (CCR), which leverages reinforcement learning and counterfactual rewards derived from region-level feature masking to quantify the causal contribution of each memory, thereby pruning redundancies and aligning latent representations with diagnostic logic. This evolutionary process culminates in Intrinsic Memory Transition (IMT), a privileged-autonomous dual-branch paradigm that internalizes teacher-branch diagnostic patterns into the student-branch via full-vocabulary divergence alignment. Comprehensive empirical evaluations across multiple datasets demonstrate that ours, by transferring external expertise into endogenous parameters, significantly outperforms existing state-of-the-art methods, particularly chain-of-thought paradigms, in diagnostic accuracy.
中文摘要 高精度医学诊断不仅依赖静态成像特征，还依赖专家在图像解读时即时调用的隐含诊断记忆。我们指出医学VLM中由离散分词化引起的根本认知错位，导致量化丢失、长距离信息散逸以及病例适应性专业知识的缺失。为弥合这一空白，我们提出了一种潜在诊断记忆演化框架，通过动态综合隐性诊断记忆在模型隐藏流中，模拟临床医生的体验调用。具体来说，它从一种先验记忆元查询机制开始，可学习探针从解剖先验编码器中检索结构化先验，生成浓缩隐式记忆。为确保临床准确性，我们引入了因果反事实细化（CCR），利用强化学习和由区域级特征掩蔽得出的反事实奖励，量化每个记忆的因果贡献，从而修剪冗余，使潜在表征与诊断逻辑保持一致。这一进化过程最终形成了内在记忆转换（IMT），这是一种特权自主的双分支范式，通过全词汇发散对齐将教师与分支的诊断模式内化到学生分支中。跨多个数据集的全面实证评估表明，通过将外部专业知识转移到内生参数中，我们的方法在诊断准确性上远超现有最先进方法，尤其是思维链范式。

Addressing Performance Saturation for LLM RL via Precise Entropy Curve Control

通过精确熵曲线控制解决LLM RL的性能饱和问题

Authors: Bolian Li, Yifan Wang, Yi Ding, Anamika Lochab, Ananth Grama, Ruqi Zhang
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2604.26326
Pdf link: https://arxiv.org/pdf/2604.26326
Abstract Reinforcement learning (RL) has unlocked complex reasoning abilities in large language models (LLMs). However, most RL algorithms suffer from performance saturation, preventing further gains as RL training scales. This problem can be characterized by the collapse of entropy, a key diagnostic for exploration in RL. Existing attempts have tried to prevent entropy collapse through regularization or clipping, but their resulting entropy curves often exhibit instability in the long term, which hinders performance gains. In this paper, we introduce Entrocraft, a simple rejection-sampling approach that realizes any user-customized entropy schedule by biasing the advantage distributions. Entrocraft requires no objective regularization and is advantage-estimator-agnostic. Theoretically, we relate per-step entropy change to the advantage distribution under minimal assumptions, which explains the behavior of existing RL and entropy-preserving methods. Entrocraft also enables a systematic study of entropy schedules, where we find that linear annealing, which starts high and decays to a slightly lower target, performs best. Empirically, Entrocraft addresses performance saturation, significantly improving generalization, output diversity, and long-term training. It enables a 4B model to outperform an 8B baseline, sustains improvement for up to 4x longer before plateauing, and raises pass@K by 50% over the baseline.
中文摘要 强化学习（RL）解锁了大型语言模型（LLM）中的复杂推理能力。然而，大多数强化学习算法存在性能饱和的问题，随着强化学习的扩展，难以实现进一步的提升。这个问题可以通过熵坍缩来描述，熵坍缩是强化学习探索的关键诊断。现有尝试通过正则化或截波防止熵坍缩，但其产生的熵曲线长期内常表现出不稳定性，阻碍性能提升。本文介绍了Entrocraft，一种简单的拒绝抽样方法，通过对优势分布进行偏置实现任意用户自定义的熵调度。Entrocraft不需要客观正则化，且不依赖优势-估计量。理论上，我们将每步熵变化与最小假设下的优势分布联系起来，这解释了现有强化学习和保持熵方法的行为。Entrocraft还实现了对熵程序的系统研究，我们发现线性退火效果最佳，该退火从高开始，衰变到略低的靶标。从实证角度看，Entrocraft解决了性能饱和问题，显著提升了泛化能力、输出多样性和长期训练。它使4B模型能够超越8B基线，持续提升时间可达4倍才会趋于平稳，并且pass@K比基线提升50%。

Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking

不确定性感知奖励折扣以缓解奖励黑客行为

Authors: Disha Singha
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.26360
Pdf link: https://arxiv.org/pdf/2604.26360
Abstract Reinforcement learning (RL) systems typically optimize scalar reward functions that assume precise and reliable evaluation of outcomes. However, real-world objectives--especially those derived from human preferences--are often uncertain, context-dependent, and internally inconsistent. This mismatch can lead to alignment failures such as reward hacking, over-optimization, and overconfident behavior. We introduce a dual-source uncertainty-aware reward framework that explicitly models both epistemic uncertainty in value estimation and uncertainty in human preferences. Model uncertainty is captured via ensemble disagreement over value predictions, while preference uncertainty is derived from variability in reward annotations. We combine these signals through a confidence-adjusted Reliability Filter that adaptively modulates action selection, encouraging a balance between exploitation and caution. Empirical results across multiple discrete grid configurations (6x6, 8x8, 10x10) and high-dimensional continuous control environments (Hopper-v4, Walker2d-v4) demonstrate that our approach yields more stable training dynamics and reduces exploitative behaviors under reward ambiguity, achieving a 93.7% reduction in reward-hacking behavior as measured by trap visitation frequency. We demonstrate statistical significance of these improvements and robustness under up to 30% supervisory noise, albeit with a trade-off in peak observed reward compared to unconstrained baselines. By treating uncertainty as a first-class component of the reward signal, this work offers a principled approach toward more reliable and aligned reinforcement learning systems.
中文摘要 强化学习（RL）系统通常优化标量奖励函数，假设对结果的评估精确且可靠。然而，现实世界的目标——尤其是基于人类偏好的目标——往往不确定、依赖上下文且内部不一致。这种不匹配可能导致对齐失败，如奖励黑客、过度优化和过度自信行为。我们引入了一个双源不确定性感知奖励框架，明确建模价值估计中的认识不确定性和人类偏好中的不确定性。模型不确定性通过集合对价值预测的分歧来体现，而偏好不确定性则源自奖励注释的变异性。我们通过置信度调整的可靠性过滤器将这些信号结合起来，该过滤器自适应地调节行动选择，鼓励在利用与谨慎之间取得平衡。在多种离散网格配置（6x6、8x8、10x10）和高维连续控制环境（Hopper-v4、Walker2d-v4）上的实证结果表明，我们的方法能够实现更稳定的训练动态，并在奖励模糊性下减少剥削行为，实现了以陷阱访问频率衡量的奖励黑客行为减少了93.7%。我们证明了这些改善和鲁棒性在最高30%的监督噪声下具有统计显著性，尽管与无约束基线相比，峰值观察奖励存在权衡。通过将不确定性视为奖励信号的一类组成部分，本研究提供了一种有原则的方法，朝向更可靠和对齐的强化学习系统。

Lyapunov-Guided Self-Alignment: Test-Time Adaptation for Offline Safe Reinforcement Learning

李雅普诺夫引导的自我对齐：离线安全强化学习的测试时间适应

Authors: Seungyub Han, Hyungjin Kim, Jungwoo Lee
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.26516
Pdf link: https://arxiv.org/pdf/2604.26516
Abstract Offline reinforcement learning (RL) agents often fail when deployed, as the gap between training datasets and real environments leads to unsafe behavior. To address this, we present SAS (Self-Alignment for Safety), a transformer-based framework that enables test-time adaptation in offline safe RL without retraining. In SAS, the main mechanism is self-alignment: at test time, the pretrained agent generates several imagined trajectories and selects those satisfying the Lyapunov condition. These feasible segments are then recycled as in-context prompts, allowing the agent to realign its behavior toward safety while avoiding parameter updates. In effect, SAS turns Lyapunov-guided imagination into control-invariant prompts, and its transformer architecture admits a hierarchical RL interpretation where prompting functions as Bayesian inference over latent skills. Across Safety Gymnasium and MuJoCo benchmarks, SAS consistently reduces cost and failure while maintaining or improving return.
中文摘要 离线强化学习（RL）智能体在部署时常常失败，因为训练数据集与真实环境之间的差距导致了不安全的行为。为此，我们提出了SAS（安全自对齐）框架，这是一种基于变压器的框架，能够在离线安全强化学习中实现测试时间适配，无需重新训练。在SAS中，主要机制是自对齐：在测试时，预训练的智能体生成多个想象轨迹，并选择满足李雅普诺夫条件的轨迹。这些可行的片段随后被回收为上下文提示，使智能体能够在避免参数更新的同时调整行为以保障安全。实际上，SAS将李雅普诺夫引导的想象力转化为控制不变的提示，其变换器架构允许采用层级强化学习解释，提示作为对潜在技能的贝叶斯推断。在Safety Gymnasium和MuJoCo基准测试中，SAS持续降低成本和故障，同时保持或提升回报率。

Learning to Route Electric Trucks Under Operational Uncertainty

学习在运营不确定性下为电动卡车布线

Authors: Stavros Orfanoudakis, Ziyan Li, Ruixiao Yang, Nikolay Aristov, Pedro P. Vergara, Chuchu Fan, Elenna Dugundji
Subjects: Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.26566
Pdf link: https://arxiv.org/pdf/2604.26566
Abstract Electric truck operations require routing decisions that remain feasible under limited battery range, long charging times, travel and energy consumption, and competition for shared charging infrastructure. These features make electric truck routing a coupled logistics and energy problem, limiting the practicality of heuristics-based methods and rendering them computationally infeasible at scale. This paper proposes a learning-based framework for the stochastic electric truck routing under charging constraints and operational uncertainty. The problem, solved by Reinforcement Learning, is formulated as an event-driven semi-Markov decision process with shared charging resources, stochastic travel and energy requirements, and realistic nonlinear fast-charging behavior. To support learning in this setting, a graph-based representation of system state and feasible decisions is introduced, together with a rule-based action mask that restricts policies to operationally admissible actions; thus, improving training efficiency. Building on this formulation, an event-driven simulation environment is developed that supports both Reinforcement Learning and benchmarking against heuristic and mathematical programming baselines. Computational experiments across a range of fleet sizes show that the proposed learning-based algorithm consistently outperforms baselines and attains performance close to optimization benchmarks in many settings, while preserving high success rates under charging congestion and uncertainty.
中文摘要 电动卡车运营需要在电池续航有限、充电时间长、出行和能耗以及共享充电基础设施竞争下仍可行的路线决策。这些特性使电动卡车路线成为物流与能源结合的问题，限制了基于启发式方法的实用性，使其在大规模计算上难以实现。本文提出了一个基于学习的随机电动卡车路由框架，适用于充电约束和运营不确定性。该问题通过强化学习解决，被表述为一个事件驱动的半马尔可夫决策过程，具有共享充电资源、随机旅行和能量需求，以及现实的非线性快速充电行为。为支持该环境中的学习，引入了基于图表的系统状态和可行决策表示，以及基于规则的动作掩码，将策略限制为操作允许的行动;从而提升训练效率。基于这一表述，开发了一个事件驱动的仿真环境，支持强化学习和基于启发式和数学编程基线的基准测试。跨不同车队规模的计算实验表明，基于学习的算法在许多环境中持续优于基线，并接近优化基准，同时在充电拥堵和不确定性下保持高成功率。

PAINT: Partial-Solution Adaptive Interpolated Training for Self-Distilled Reasoners

PAINT：部分解自适应插值培训，面向自我提炼推理者

Authors: Zhiquan Tan, Yinrong Hong
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.26573
Pdf link: https://arxiv.org/pdf/2604.26573
Abstract Improving large language model (LLM) reasoning requires supervision that is both aligned with the model's own test-time states and informative at the token level. Reinforcement learning with verifiable rewards provides on-policy exploration but offers sparse, high-variance credit; supervised fine-tuning and distillation provide dense targets but often train on fixed trajectories or rely on stronger teachers. Recent privileged on-policy self-distillation explores a middle ground by scoring student rollouts with the same model under verified solution context. We revisit this setting through a contextual re-scoring lens: for reasoning, the important choices are not only whether privileged context is available, but how much of it should be revealed and where its distribution should shape the student. We propose PAINT (Partial-solution Adaptive INterpolated Training), which masks the verified solution according to rollout-reference overlap and applies a small energy-space interpolation on a sparse set of entropy-mismatch token positions. Across competition-level math benchmarks, PAINT consistently improves over a strong prior on-policy self-distillation baseline at all three Qwen3 scales. On Qwen3-8B, it raises macro Avg@12 by 2.1 points over this prior baseline and 2.9 points over GRPO.
中文摘要 提升大型语言模型（LLM）推理需要既与模型自身测试时状态保持一致，又在令牌层面提供信息的监督。带有可验证奖励的强化学习提供政策探索，但学分稀疏且方差高;有监督的微调和蒸馏提供密集的目标，但通常训练方向固定，或依赖更强的教师。最近的特权政策自我提炼通过在验证的解决方案背景下，对学生推广进行同一模型的评分，探索中间地带。我们通过情境重新评分的视角重新审视这一环境：推理时，重要的选择不仅是特权上下文是否存在，还在于应揭示多少上下文，以及其分布应如何影响学生。我们提出了PAINT（部分解自适应内投影训练），该方法根据推广-参考重叠对已验证解进行掩盖，并在稀疏的熵-不匹配标记位置上进行小的能量空间插值。在竞争级别的数学基准中，PAINT在三个Qwen3量表上均相较于先前强劲的政策自蒸馏基线持续提升。在Qwen3-8B上，宏观Avg@12比此前基线上涨2.1个点，较GRPO上涨2.9个点。

ATLAS: An Annotation Tool for Long-horizon Robotic Action Segmentation

ATLAS：长视野机器人动作分割的注释工具

Authors: Sergej Stanovcic, Daniel Sliwowski, Dongheui Lee
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.26637
Pdf link: https://arxiv.org/pdf/2604.26637
Abstract Annotating long-horizon robotic demonstrations with precise temporal action boundaries is crucial for training and evaluating action segmentation and manipulation policy learning methods. Existing annotation tools, however, are often limited: they are designed primarily for vision-only data, do not natively support synchronized visualization of robot-specific time-series signals (e.g., gripper state or force/torque), or require substantial effort to adapt to different dataset formats. In this paper, we introduce ATLAS, an annotation tool tailored for long-horizon robotic action segmentation. ATLAS provides time-synchronized visualization of multi-modal robotic data, including multi-view video and proprioceptive signals, and supports annotation of action boundaries, action labels, and task outcomes. The tool natively handles widely used robotics dataset formats such as ROS bags and the Reinforcement Learning Dataset (RLDS) format, and provides direct support for specific datasets such as REASSEMBLE. ATLAS can be easily extended to new formats via a modular dataset abstraction layer. Its keyboard-centric interface minimizes annotation effort and improves efficiency. In experiments on a contact-rich assembly task, ATLAS reduced the average per-action annotation time by at least 6% compared to ELAN, while the inclusion of time-series data improved temporal alignment with expert annotations by more than 2.8% and decreased boundary error fivefold compared to vision-only annotation tools.
中文摘要 在长视野机器人演示中标注精确的时间动作边界，对于训练和评估动作分割与操作策略学习方法至关重要。然而，现有的注释工具通常功能有限：它们主要为仅视觉数据设计，不原生支持机器人特定时间序列信号的同步可视化（如夹持器状态或力/力矩），或需要大量精力适应不同数据集格式。本文介绍了ATLAS，一种专为长视野机器人动作分割设计的注释工具。ATLAS提供多模态机器人数据的时间同步可视化，包括多视角视频和本体感觉信号，并支持动作边界、动作标签和任务结果的注释。该工具原生处理广泛使用的机器人数据集格式，如ROS袋和强化学习数据集（RLDS）格式，并直接支持特定数据集如REASSEMBLE。ATLAS可以通过模块化数据集抽象层轻松扩展到新格式。其以键盘为中心的界面最大限度地减少了注释工作量，提高了效率。在接触丰富组装任务的实验中，ATLAS使每次动作平均注释时间至少比ELAN缩短了6%，同时加入时间序列数据使专家注释的时间对齐提升了2.8%以上，边界误差也降低了五倍，相比仅视觉注释工具。

FutureWorld: A Live Environment for Training Predictive Agents with Real-World Outcome Rewards

FutureWorld：一个用于培训预测代理并带来真实结果奖励的实时环境

Authors: Zhixin Han, Yanzhi Zhang, Chuyang Wei, Maohang Gao, Xiawei Yue, Kefei Chen, Yu Zhuang, Haoxiang Guan, Jiyan He, Jian Li, Yitong Duan, Yu Shi, Mengting Hu, Shuxin Zheng
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.26733
Pdf link: https://arxiv.org/pdf/2604.26733
Abstract Live future prediction refers to the task of making predictions about real-world events before they unfold. This task is increasingly studied using large language model-based agent systems, and it is important for building agents that can continually learn from real-world. Just as interactive environments have often driven progress in agents, advancing live future prediction naturally motivates viewing it as a learning environment. Prior works have explored future prediction from several different parts, but have generally not framed it as a unified learning environment. This task is appealing for learning because it can provide a large number of prediction questions grounded in diverse real-world events, while preventing answer leakage. To leverage the advantages of live future prediction, we present FutureWorld, a live agentic reinforcement learning environment that closes the training loop between prediction, outcome realization, and parameters update. In our environment, we take three open-source base models and train them for consecutive days. The results show that training is effective. Furthermore, we build a daily benchmark based on the environment and evaluate several frontier agents on it to establish performance baselines for current agent systems.
中文摘要 实时未来预测是指在事件发生之前对现实世界事件进行预测的任务。这一任务越来越多地被研究为基于大型语言的基于模型的智能体系统，对于构建能够持续从现实世界中学习的智能体非常重要。正如互动环境常常推动代理的进步一样，推进实时未来预测自然会激励人们将其视为学习环境。此前的研究从多个方面探讨了未来预测，但通常并未将其框架为统一的学习环境。这项任务对学习具有吸引力，因为它能提供大量基于各种真实事件的预测题，同时防止答案泄漏。为了发挥实时未来预测的优势，我们推出了FutureWorld，一个实时代理强化学习环境，实现了预测、结果实现和参数更新之间的训练循环。在我们的环境中，我们会连续训练三个开源基础模型。结果显示训练是有效的。此外，我们基于环境构建每日基准，并评估多个前沿代理，以建立当前代理系统的性能基线。

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

GLM-5V-Turbo：迈向多模态代理的本土基础模型

Authors: GLM-V Team: Wenyi Hong, Xiaotao Gu, Ziyang Pan, Zhen Yang, Yuting Wang, Yue Wang, Yuanchang Yue, Yu Wang, Yanling Wang, Yan Wang, Xijun Liu, Wenmeng Yu, Weihan Wang, Wei Li, Shuaiqi Duan, Sheng Yang, Ruiliang Lv, Mingdao Liu, Lihang Pan, Ke Ning, Junhui Ji, Jinjiang Wang, Jing Chen, Jiazheng Xu, Jiale Zhu, Jiale Cheng, Ji Qi, Guobing Gan, Guo Wang, Cong Yao, Zijun Dou, Zihao Zhou, Zihan Wang, Zhiqi Ge, Zhijie Li, Zhenyu Hou, Zhao Xue, Zehui Wang, Zehai He, Yusen Liu, Yukuo Cen, Yuchen Li, Yuan Wang, Yijian Lu, Yanzi Wang, Yadong Xue, Xinyu Zhang, Xinyu Liu, Wenkai Li, Tianyu Tong, Tianshu Zhang, Shengdong Yan, Qinkai Zheng, Mingde Xu, Licheng Bao, Jiaxing Xu, Jiaxin Fan, Jiawen Qian, Jiali Chen, Jiahui Lin, Haozhi Zheng, Haoran Wang, Haochen Li, Fan Yang, Dan Zhang, Chuangxin Zhao, Chengcheng Wu, Boyan Shi, Bowei Jia, Baoxu Wang, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Minlie Huang, Yuxiao Dong, Jie Tang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.26752
Pdf link: https://arxiv.org/pdf/2604.26752
Abstract We present GLM-5V-Turbo, a step toward native foundation models for multimodal agents. As foundation models are increasingly deployed in real environments, agentic capability depends not only on language reasoning, but also on the ability to perceive, interpret, and act over heterogeneous contexts such as images, videos, webpages, documents, GUIs. GLM-5V-Turbo is built around this objective: multimodal perception is integrated as a core component of reasoning, planning, tool use, and execution, rather than as an auxiliary interface to a language model. This report summarizes the main improvements behind GLM-5V-Turbo across model design, multimodal training, reinforcement learning, toolchain expansion, and integration with agent frameworks. These developments lead to strong performance in multimodal coding, visual tool use, and framework-based agentic tasks, while preserving competitive text-only coding capability. More importantly, our development process offers practical insights for building multimodal agents, highlighting the central role of multimodal perception, hierarchical optimization, and reliable end-to-end verification.
中文摘要 我们介绍GLM-5V-Turbo，这是迈向多模态代理本土基础模型的一步。随着基础模型在真实环境中的部署越来越多，代理能力不仅依赖于语言推理，还依赖于在图像、视频、网页、文档、图形界面等异构上下文中感知、解释和行动的能力。GLM-5V-Turbo 围绕这一目标构建：多模态感知被整合为推理、规划、工具使用和执行的核心组成部分，而非语言模型的辅助接口。本报告总结了GLM-5V-Turbo在模型设计、多模态训练、强化学习、工具链扩展以及与代理框架集成方面的主要改进。这些发展带来了在多模态编码、可视化工具使用和基于框架的代理任务中的强劲性能，同时保持了纯文本编码的竞争力。更重要的是，我们的开发过程为构建多模态代理提供了实用见解，凸显了多模态感知、层级优化和可靠的端到端验证的核心作用。

Factorized Latent Reasoning for LLM-based Recommendation

基于LLM的推荐的因式分解潜在推理

Authors: Tianqi Gao, Chengkai Huang, Zihan Wang, Cao Liu, Ke Zeng, Lina Yao
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2604.26760
Pdf link: https://arxiv.org/pdf/2604.26760
Abstract Large language models (LLMs) have recently been adopted for recommendation by framing user preference modeling as a language generation problem. However, existing latent reasoning approaches typically represent user intent with a single latent vector, which struggles to capture the inherently multi-faceted nature of user preferences. We propose Factorized Latent Reasoning (FLR), a novel framework for LLM-based sequential recommendation that decomposes latent reasoning into multiple disentangled preference factors. FLR introduces a lightweight multi-factor attention module that iteratively refines a latent thought representation, where each factor attends to distinct aspects of the user's interaction history. To encourage diversity and specialization, we design orthogonality, attention diversity, and sparsity regularization objectives, and dynamically aggregate factor contributions for the final prediction. We further integrate FLR with an efficient reinforcement learning strategy based on group-relative policy optimization, enabling stable alignment directly in the latent reasoning space. Experiments on multiple benchmarks show that FLR consistently outperforms strong baselines while improving robustness and interpretability.
中文摘要 大型语言模型（LLMs）最近被采纳为推荐对象，将用户偏好建模框架为语言生成问题。然而，现有的潜在推理方法通常用单一的潜在向量来表示用户意图，难以捕捉用户偏好的多面性。我们提出了分解潜在推理（FLR），这是一种基于LLM的顺序推荐新框架，将潜在推理分解为多个纠缠的偏好因素。FLR引入了一个轻量级多因素注意力模块，通过迭代细化潜在思维表征，每个因素关注用户交互历史的不同方面。为了鼓励多样性和专业化，我们设计了正交性、注意力多样性和稀疏性正则化目标，并动态聚合因素贡献以完成最终预测。我们进一步将FLR与基于群体相对策略优化的高效强化学习策略整合，实现潜在推理空间中直接稳定的对齐。多个基准测试的实验表明，FLR在提升鲁棒性和可解释性的情况下，始终优于强基线。

Rule-based High-Level Coaching for Goal-Conditioned Reinforcement Learning in Search-and-Rescue UAV Missions Under Limited-Simulation Training

基于规则的高级教练，用于有限模拟训练下的搜救无人机任务中的目标条件强化学习

Authors: Mahya Ramezani, Holger Voos
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.26833
Pdf link: https://arxiv.org/pdf/2604.26833
Abstract This paper presents a hierarchical decision-making framework for unmanned aerial vehicle (UAV) missions motivated by search-and-rescue (SAR) scenarios under limited simulation training. The framework combines a fixed rule-based high-level advisor with an online goal-conditioned low-level reinforcement learning (RL) controller. To stress-test early adaptation, we also consider a strict no-pretraining deployment regime. The high-level advisor is defined offline from a structured task specification and compiled into deterministic rules. It provides interpretable mission- and safety-aware guidance through recommended actions, avoided actions, and regime-dependent arbitration weights. The low-level controller learns online from task-defined dense rewards and reuses experience through a mode-aware prioritized replay mechanism augmented with rule-derived metadata. We evaluate the framework on two tasks: battery-aware multi-goal delivery and moving-target delivery in obstacle-rich environments. Across both tasks, the proposed method improves early safety and sample efficiency primarily by reducing collision terminations, while preserving the ability to adapt online to scenario-specific dynamics.
中文摘要 本文提出了一个基于搜救（SAR）场景、在有限模拟训练下执行无人机（UAV）任务的层级决策框架。该框架结合了固定规则基础的高级顾问和在线目标条件的低级强化学习（RL）控制器。为了测试早期适应，我们还考虑了严格的无预训练部署制度。高级顾问是从结构化任务规范离线定义的，并编译成确定性规则。它通过推荐行动、避免行动以及依赖体制的仲裁权重，提供可解释的任务和安全意识指导。低级控制器通过任务定义的密集奖励在线学习，并通过模式感知的优先级重放机制，辅以规则衍生的元数据，重复使用经验。我们在两个任务上评估该框架：电池感知多目标投递和在障碍物密集环境中的移动目标投递。在这两种任务中，所提方法主要通过减少碰撞终止来提升早期安全性和采样效率，同时保持在线适应特定场景动态的能力。

Uncertainty-Aware Predictive Safety Filters for Probabilistic Neural Network Dynamics

概率性神经网络动力学的不确定性感知预测安全过滤器

Authors: Bernd Frauenknecht, Lukas Kesper, Daniel Mayfrank, Henrik Hose, Sebastian Trimpe
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2604.26836
Pdf link: https://arxiv.org/pdf/2604.26836
Abstract Predictive safety filters (PSFs) leverage model predictive control to enforce constraint satisfaction during deep reinforcement learning (RL) exploration, yet their reliance on first-principles models or Gaussian processes limits scalability and broader applicability. Meanwhile, model-based RL (MBRL) methods routinely employ probabilistic ensemble (PE) neural networks to capture complex, high-dimensional dynamics from data with minimal prior knowledge. However, existing attempts to integrate PEs into PSFs lack rigorous uncertainty quantification. We introduce the Uncertainty-Aware Predictive Safety Filter (UPSi), a PSF that provides rigorous safety predictions using PE dynamics models by formulating future outcomes as reachable sets. UPSi introduces an explicit certainty constraint that prevents model exploitation and integrates seamlessly into common MBRL frameworks. We evaluate UPSi within Dyna-style MBRL on standard safe RL benchmarks and report substantial improvements in exploration safety over prior neural network PSFs while maintaining performance on par with standard MBRL. UPSi bridges the gap between the scalability and generality of modern MBRL and the safety guarantees of predictive safety filters.
中文摘要 预测安全过滤器（PSFs）利用模型预测控制在深度强化学习（RL）探索中强制约束满足，但其对第一性原理模型或高斯过程的依赖限制了扩展性和更广泛的适用性。与此同时，基于模型的强化学习（MBRL）方法通常采用概率性集合（PE）神经网络，从几乎没有先验知识的数据中捕捉复杂且高维的动态。然而，现有将PE整合进PSF的尝试缺乏严格的不确定性量化。我们介绍了不确定性感知预测安全过滤器（UPSi），这是一种PSF，通过将未来结果表述为可达集合，利用PE动力学模型提供严谨的安全预测。UPSi 引入了明确的确定性约束，防止模型被利用，并无缝集成到常见的 MBRL 框架中。我们在标准安全强化学习基准测试下评估了Dyna风格MBRL中的UPSi，报告在探索安全性上相较于以往神经网络PSF有显著提升，同时保持与标准MBRL相当的性能。UPSi弥合了现代MBRL的可扩展性和通用性与预测性安全滤波器安全保障之间的鸿沟。

ClawGym: A Scalable Framework for Building Effective Claw Agents

ClawGym：打造高效爪特工的可扩展框架

Authors: Fei Bai, Huatong Song, Shuang Sun, Daixuan Cheng, Yike Yang, Chuan Hao, Renyuan Li, Feng Chang, Yuan Wei, Ran Tao, Bryan Dai, Jian Yang, Wayne Xin Zhao
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.26904
Pdf link: https://arxiv.org/pdf/2604.26904
Abstract Claw-style environments support multi-step workflows over local files, tools, and persistent workspace states. However, scalable development around these environments remains constrained by the absence of a systematic framework, especially one for synthesizing verifiable training data and integrating it with agent training and diagnostic evaluation. To address this challenge, we present ClawGym, a scalable framework that supports the full lifecycle of Claw-style personal agent development. Concretely, we construct ClawGym-SynData, a diverse dataset of 13.5K filtered tasks synthesized from persona-driven intents and skill-grounded operations, paired with realistic mock workspaces and hybrid verification mechanisms. We then train a family of capable Claw-style models, termed ClawGym-Agents, through supervised fine-tuning on black-box rollout trajectories, and further explore reinforcement learning via a lightweight pipeline that parallelizes rollouts across per-task this http URL support reliable evaluation, we further construct ClawGym-Bench, a benchmark of 200 instances calibrated through automated filtering and human-LLM review. Relevant resources will be soon released at this https URL.
中文摘要 爪式环境支持多步骤工作流程，覆盖本地文件、工具和持久工作区状态。然而，围绕这些环境的可扩展开发仍受限于缺乏系统化框架，尤其是用于综合可验证训练数据并将其与代理训练和诊断评估整合的框架。为应对这一挑战，我们推出了ClawGym，一个支持Claw风格个人经纪人发展全生命周期的可扩展框架。具体来说，我们构建了ClawGym-SynData，这是一个包含13.5万个过滤任务的多样化数据集，这些任务由人格驱动的意图和基于技能的操作综合而成，结合了逼真的模拟工作区和混合验证机制。随后，我们通过对黑箱推展轨迹的监督微调，训练一组称为ClawGym-Agents的高能力型模型，并进一步探索通过轻量级流水线实现逐项任务并行部署的强化学习。http URL支持可靠评估，我们进一步构建了ClawGym-Bench，这是一个通过自动筛选和人类-大型语言模型审查校准的200个实例基准。相关资源将很快在此 https 网址发布。

Keyword: diffusion policy

There is no result