Arxiv Papers of Today

生成时间: 2026-06-17 20:18:21 (UTC+8); Arxiv 发布时间: 2026-06-17 20:00 EDT (2026-06-18 08:00 UTC+8)

今天共有 36 篇相关文章

Keyword: reinforcement learning

GeoDisaster: Benchmarking Orchestrated Agents for Operational Disaster Geo-Intelligence

地质灾害：为灾害地理情报的协调智能体进行基准测试

Authors: Maram Hasan, Aman Verma, Savitra Roy, Hariseetharam Gunduboina, Daksh Jain, Muhammad Haris Khan, Subhasis Chaudhuri, Biplab Banerjee
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2606.17246
Pdf link: https://arxiv.org/pdf/2606.17246
Abstract Remote-sensing vision-language models (RS-VLMs) have advanced Earth-observation analysis toward visual interpretation and instruction-following, yet fall short of operational geo-intelligence, which demands tool-grounded spatial reasoning and structured, evidence-backed decisions. We introduce GeoDisaster, an operational geospatial disaster reasoning benchmark with 2,921 verified instances across 43 question types and five task families: deforestation monitoring, multi-hazard analysis, building-damage assessment, flood-safe routing, and Sentinel-1 SAR flood monitoring. Instances integrate heterogeneous EO/GIS evidence-optical and SAR imagery, raster masks, vector geometries, road networks, and exposure layers-spanning hazard detection, damage assessment, exposure estimation, and diagnostic report generation. Ground-truth answers are grounded in executable geospatial workflows and deterministic consistency checks, removing the need for language-model annotation. We further propose an orchestrated multi-agent framework with 18 disaster-oriented tools, where role-specialized agents coordinate through explicit execution contracts, aligned via Role-Contract Expectation Alignment (RCEA): failure-aware supervised fine-tuning combined with contract-grounded reinforcement learning over dense step-level signals. Experiments show that GeoDisaster challenges existing RS-VLMs and agentic systems, while RCEA improves tool use, evidence grounding, state consistency, and decision generation.
中文摘要 遥感视觉语言模型（RS-VLMs）推动了地球观测分析，向视觉解释和指令跟踪推进，但仍不足于需要基于工具的空间推理和结构化、证据支持决策的操作性地球智能。我们介绍GeoDisaster，这是一个运营性的地理空间灾害推理基准测试，包含2921个经过验证的实例，涵盖43个题型和五个任务族：森林砍伐监测、多灾害分析、建筑损害评估、洪水安全路由和Sentinel-1搜救洪水监测。实例集成了异构的EO/GIS证据光学和搜救图像、栅格掩膜、矢量几何、道路网络以及跨层暴露的危险检测、损害评估、暴露估计和诊断报告生成。地面真实答案基于可执行的地理空间工作流和确定性一致性检查，消除了对语言模型注释的需求。我们还提出了一个由18个灾难导向工具组成的编排多代理框架，角色专用代理通过显式执行合同协调，并通过角色-合同期望对齐（RCEA）对齐：失败感知的监督微调与基于合同的强化学习相结合，覆盖密集的步进级信号。实验显示，GeoDisaster挑战了现有的RS-VLM和代理系统，而RCEA则提升了工具使用、证据基础、状态一致性和决策生成。

Rethinking Groups in Critic-Free RLVR

重新思考无批评RLVR中的团体

Authors: Yihong Wu, Liheng Ma, Lingfeng Xiao, Muzhi Li, Xinyu Wang, Yingxue Zhang, Jian-Yun Nie
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.17250
Pdf link: https://arxiv.org/pdf/2606.17250
Abstract Reinforcement learning (RL) has become a central paradigm for post-training large language models. Existing critic-free RL methods typically generate a group of rollouts for the same question to estimate value baselines for advantage computation. However, this design suffers from data inefficiency, group synchronization barriers, and inflexibility with structured rollouts. In this work, we revisit the role of the ``group'' and show that its underlying function is not merely to estimate baselines but to prevent false penalties on negative samples. Building on this insight, we propose negative token filtering, a simple and effective strategy that enables stable single-rollout training. We apply it to two batch-level advantage methods, achieving comparable performance on reasoning tasks and stronger performance on agentic tasks relative to group-based RL techniques.
中文摘要 强化学习（RL）已成为大型语言模型训练后的核心范式。现有无批评的强化学习方法通常为同一问题生成一组推测，以估算优势计算的价值基线。然而，这种设计存在数据效率低、组同步障碍以及结构化部署的灵活性不足。在本研究中，我们重新审视“群体”的作用，并展示了其根本功能不仅仅是估计基线，而是防止对阴性样本的虚假惩罚。基于这一见解，我们提出了负令牌过滤，这是一种简单有效的策略，能够实现稳定的单次推广训练。我们将该方法应用于两种批次层面的优势方法，在推理任务上表现相当，在代理任务中表现优于基于群体的强化学习技术。

Training LLMs with Reinforcement Learning over Digital Twin Representations for Reasoning-Intensive Surgical VideoQA

通过数字孪生表述进行强化学习培训大型语言模型，用于推理密集型手术视频质量保证

Authors: Yiqing Shen, Han Zhang, Mathias Unberath
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.17279
Pdf link: https://arxiv.org/pdf/2606.17279
Abstract Surgical video question answering requires multi-step reasoning across semantic, spatial, and temporal dimensions. Existing methods architecturally compress videos into discrete token representations and couple visual perception with reasoning. This approach fragments continuous spatial-temporal relationships and has been shown to restrict multi-step reasoning capabilities. We introduce a reinforcement learning (RL) framework that trains large language models (LLMs) to decouple perception from reasoning by operating over digital twin representations constructed from surgical foundation models. Additionally, we introduce hierarchical representations across frame, temporal window, and procedure levels with probabilistic uncertainty estimates. Finally, we propose a novel reward that combines format validation with accuracy assessment through clinical plausibility evaluation and uncertainty-aware calibration for training. To demonstrate the capabilities of this approach, we introduce REAL-Colon-Reason, a colonoscopic benchmark with 2000 question-answer pairs across three complexity levels. We achieve state-of-the-art performance on REAL-Colon-Reason and two existing surgical VideoQA benchmarks REAL-Colon-VQA and EndoVis18-VQA.
中文摘要 外科视频问答需要跨越语义、空间和时间维度的多步推理。现有方法在架构上将视频压缩为离散的代币表示，并将视觉感知与推理结合起来。这种方法会分割连续的时空关系，并被证明限制了多步推理能力。我们引入了一个强化学习（RL）框架，训练大型语言模型（LLM）通过基于外科基础模型构建的数字孪生表示，将感知与推理分离。此外，我们引入了跨框架、时间窗口和程序层级的层级表示，并进行了概率不确定性估计。最后，我们提出了一种新颖的奖励方式，结合了格式验证与通过临床合理性评估和不确定性校准进行准确性评估的训练。为了展示该方法的能力，我们引入了REAL-Colon-Reason，这是一个结肠镜基准测试，包含2000对问答对，涵盖三个复杂度层级。我们在REAL-Colon-Reason及两项现有手术视频质量保证基准测试（REAL-Colon-VQA）和EndoVis18-VQA上实现了最先进的性能。

Decision-Driven Geosteering Under Uncertainty: A Unified Framework for Sequential Decision Optimization

不确定性下的决策驱动地质引导：顺序决策优化的统一框架

Authors: Hibat Errahmen Djecta, Sergey Alyaev, Kristian Fossum, Reidar B. Bratvold, Ressi Bonti Muhammad, Apoorv Srivastava
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.17331
Pdf link: https://arxiv.org/pdf/2606.17331
Abstract Geosteering requires navigating a well trajectory through an unknown geological configuration, while sequentially updating decisions based on indirect measurements acquired during drilling. This work presents an uncertainty-aware geosteering framework that tightly integrates particle filtering for probabilistic subsurface interpretation with value-based reinforcement learning for sequential decision-making. Geological uncertainty ahead of the drill bit is represented explicitly through a particle filter (PF), enabling belief-informed control rather than deterministic trajectory correction. The framework couples PF belief updates with belief-informed decision policies and evaluates three decision-making options that operate under identical uncertainty representations: an interpretable Approximate Dynamic Programming (ADP) scheme, a Deep Q-learning baseline, and a Dual Deep Reinforcement Learning (Dual DRL) architecture trained with a target Q-network scheme for stability, using a dueling (value/advantage) decomposition for Q-value parameterization. Beyond final placement performance, we assess policy behavior using stability-oriented metrics that quantify steering smoothness over time, providing additional operational insight into how decision policies respond as uncertainty evolves. The framework is integrated with an API for validation within an industrial geosteering simulator under realistic measurement noise and drilling constraints. Using identical geological realizations, operational limits, and reward definitions across methods, the experiments provide a controlled and high-fidelity evaluation of how alternative decision policies behave throughout the drilling process, rather than evaluating performance solely from the final well trajectory.
中文摘要 地质导向需要在未知地质构型中导航井道，同时根据钻探过程中获得的间接测量数据，顺序更新决策。本研究提出了一个不确定性感知地质引导框架，紧密整合了粒子滤波用于概率地下解释，与基于价值的强化学习进行顺序决策。钻头前方的地质不确定性通过粒子滤波器（PF）明确表示，实现信念导向控制，而非确定性轨迹修正。该框架将PF信念更新与信念知情决策策略结合，并评估三种在相同不确定性表征下运行的决策选项：可解释的近似动态规划（ADP）方案、深度Q学习基线，以及采用目标Q网络方案训练的双重深度强化学习（Dual DRL）架构，采用对抗（值/优势）分解进行Q值参数化。除了最终安置表现外，我们还使用以稳定性为导向的指标评估政策行为，量化引导的顺畅度随时间变化，提供更多操作性洞察，了解决策政策在不确定性演变时的响应。该框架集成了API接口，可在工业地质转向模拟器中验证，满足实际测量噪声和钻井约束。通过各方法相同的地质实现、操作极限和奖励定义，实验提供了对各替代决策策略在整个钻井过程中表现的受控且高保真度评估，而非仅仅从最终井轨迹评估性能。

Performance-Driven Environment Abstraction with Multi-Timescale Learning

多时间尺度学习的绩效驱动环境抽象

Authors: Yue Guan, Dipankar Maity, Panagiotis Tsiotras
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2606.17377
Pdf link: https://arxiv.org/pdf/2606.17377
Abstract We study performance-driven environment abstraction for decision-making in large Markov decision processes. Rather than preserving geometric or topological structure, we seek abstractions that directly optimize decision quality. We model abstraction as a controlled approximation obtained by aggregating the state space and enforcing a shared action distribution within each aggregated state. For a fixed partition, we establish a performance guarantee that separates value-function approximation error from the loss introduced by action sharing. Guided by this analysis, we develop a multi-timescale reinforcement learning framework that jointly adapts the policy and a tree-structured environment abstraction. The resulting algorithm refines and coarsens regions of the state space based on Q-value discrepancies, balancing performance against abstraction size and complexity. Empirical results demonstrate substantial state compression, improved sample efficiency, and faster replanning compared to actor-critic baselines.
中文摘要 我们研究性能驱动环境抽象，用于大型马尔可夫决策过程的决策。我们寻求的是能够直接优化决策质量的抽象，而不是保留几何或拓扑结构。我们将抽象建模为一种受控近似，通过聚合状态空间并在每个聚合状态内强制执行共享动作分布。对于固定划分，我们建立了性能保证，将价值函数近似误差与动作共享带来的损失区分开来。在该分析的指导下，我们开发了一个多时间尺度的强化学习框架，结合策略和树结构环境抽象。该算法基于Q值差异对状态空间区域进行细化和粗化，平衡性能与抽象规模和复杂度。实证结果显示，相较于actor-critic基线，具有显著的状态压缩、提升样本效率和更快的重新规划。

Treatment Response Optimized Clinical Decision Support AI System via Digital Twin Simulation

通过数字孪生模拟优化治疗反应的临床决策支持人工智能系统

Authors: Xinyu Qin, Anil K. Sood, Ruiheng Yu, Sara Corvigno, Elaine Stur, Lu Wang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.17405
Pdf link: https://arxiv.org/pdf/2606.17405
Abstract Clinical decision support AI systems (CDSASs) must adapt to evolving patient conditions in real-time while adhering to strict safety constraints. We present an online adaptive framework that integrates Treatment Effect (TE) estimation to quantify clinical benefits, a patient Digital Twin (DT) to simulate treatment trajectories, and Reinforcement Learning (RL) for sequential decision-making. The AI system is initially trained on historical medical records and operates in a continuous learning loop. To ensure safety, a rule-based module monitors vital signs and blocks contraindicated treatments. Cases with strong internal model disagreement are flagged for clinician review, simulated in our experiments via a pre-trained outcome model. We validate our framework using both a synthetic clinical simulator and a real-world ovarian cancer dataset from The Cancer Genome Atlas (TCGA). In both simulated and clinical settings, our method demonstrated superior effectiveness and stability in recommending treatments compared to standard computational baselines. Furthermore, the AI system maintains low latency and requires expert consultation for only a minority of cases in our experimental validation, demonstrating its potential as a safe, clinician-supervised tool for personalized medicine that continuously improves through practical use.
中文摘要 临床决策支持人工智能系统（CDSASs）必须实时适应不断变化的患者状况，同时遵守严格的安全约束。我们提出了一个在线自适应框架，整合了治疗效果（TE）估计以量化临床益处，患者数字孪生（DT）模拟治疗轨迹，以及强化学习（RL）用于顺序决策。该人工智能系统最初基于历史医疗记录进行训练，并以持续学习循环运行。为确保安全，基于规则的模块监测生命体征并阻止禁忌治疗。内部模型存在明显分歧的病例会被标记，供临床医生通过预训练的结果模型进行模拟。我们使用合成临床模拟器和来自癌症基因组图谱（TCGA）的真实卵巢癌数据集来验证我们的框架。在模拟和临床环境中，我们的方法在推荐治疗方面均优于标准计算基线。此外，AI系统保持低延迟，且在实验验证中仅需少数案例专家咨询，展示了其作为安全、临床医生监督个性化医疗工具的潜力，并通过实际应用不断改进。

Enhancing Pathological VLMs with Cross-scale Reasoning

通过跨尺度推理增强病理性VLMs

Authors: Chi Phan, Tianyi Zhang, Qiaochu Xue, Yufeng Wu, Dan Hu, Zeyu Liu, Sudong Wang, Yueming Jin
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.17412
Pdf link: https://arxiv.org/pdf/2606.17412
Abstract Pathological images are inherently multi-scale, requiring pathologists to integrate evidence from global tissue architecture at low magnification to cellular morphology at higher magnification for accurate diagnosis. While existing pathological datasets for vision-language model (VLM) include various scales, they often lack an explicit cross-scale reasoning objective. This limitation prevents VLMs from capturing essential cross-scale representations and learning evidence-based reasoning. To bridge this gap, we introduce the first cross-scale training and evaluation paradigm that formulates pathology interpretation as multi-magnification reasoning. However, creating such a task reveals a critical challenge: multi-image visual question answering (VQA) is prone to text-only shortcuts, which allow models to guess answers using magnification-dependent artifacts rather than visual evidence. To address this, we propose a leakage-aware curation pipeline that combines adversarial text-only screening with constraint-guided question design. Using this pipeline, we construct Scale-VQA, a high-quality benchmark with 4,685 multiple-choice questions grounded in 2,537 pathology images across multiple magnification levels. Finally, we present ScaleReasoner-R1, a model trained via reinforcement learning to optimize performance on the cross-scale VQA task. ScaleReasoner-R1 achieves state-of-the-art performance on our cross-scale reasoning benchmark and generalizes to SOTA performance on established single-scale benchmarks. Findings suggest that even the limited cross-scale supervision can significantly improve pathological understanding. The code and demos will be open-sourced.
中文摘要 病理图像本质上具有多尺度性，病理学家需要整合从低倍率的整体组织结构到高倍率的细胞形态学证据，以实现准确诊断。虽然现有的视觉语言模型（VLM）病理数据集包含多种尺度，但它们通常缺乏明确的跨尺度推理目标。这一限制阻碍了VLM捕捉关键的跨尺度表征和学习基于证据的推理。为弥合这一差距，我们引入了首个跨尺度训练与评估范式，将病理学解释表述为多倍推理。然而，创建此类任务暴露了一个关键挑战：多图像视觉问答（VQA）容易出现纯文本捷径，这使得模型通过依赖放大的伪影而非视觉证据来猜测答案。为此，我们提出了一种结合对抗性纯文本筛选与约束引导问题设计的泄露感知策划流程。利用该流程，我们构建了Scale-VQA，这是一个高质量基准，包含4,685道多项选择题，基于2,537张多倍放大的病理图像。最后，我们介绍ScaleReasoner-R1，这是一个通过强化学习训练的模型，旨在优化跨尺度VQA任务的性能。ScaleReasoner-R1在我们的跨尺度推理基准测试中实现了最先进的性能，并推广到成熟单一尺度基准测试的SOTA性能。研究结果表明，即使是有限的跨尺度监督也能显著改善病理学理解。代码和演示将开源。

Memory-Efficient Meta-Reinforcement Learning for Adaptive Safety-Critical Control in Adversarial Spacecraft Proximity Operations

在对抗性航天器近距离操作中实现自适应安全关键控制的高效内存元强化学习

Authors: Alejandro Posadas-Nava, Richard Linares, Minduli Wijayatunga
Subjects: Subjects: Machine Learning (cs.LG); Dynamical Systems (math.DS)
Arxiv link: https://arxiv.org/abs/2606.17414
Pdf link: https://arxiv.org/pdf/2606.17414
Abstract Autonomous spacecraft rendezvous and proximity operations (RPO) require controllers that guarantee safety under thrust constraints while minimizing fuel expenditure. Input-constrained control barrier functions (ICCBFs) provide a control method for nonlinear systems with actuation constraints that construct a forward-invariant safe set. Previous work has shown that learning class-$\mathcal{K}$ functions defining the ICCBF recursion via meta reinforcement learning (meta-RL) yields a robust, non-greedy approach to safety-critical control in RPO. This paper extends that framework further by investigating the performance of three recurrent network architectures (Long Short Term Memory (LSTM), Gated Recurrent Unit (GRU), Selective State Space Model (Mamba)) and two training algorithms (Proximal Policy Optimization (PPO) and Soft Actor Critic (SAC)) to identify the best setup for tuning ICCBF class-K functions via meta-RL. In addition to cooperative test cases, performance is evaluated in the presence of adversarial behavior where the target spacecraft behaves in a way that worsens the safety of the chaser spacecraft. Results indicate that state space models such as Mamba when used with PPO achieve superior task completion, safety, and fuel-savings compared to other architectures, across all cooperative and uncooperative scenarios tested.
中文摘要 自主航天器会合与接近操作（RPO）需要控制器，在推力约束下保证安全，同时最大限度地减少燃料消耗。输入约束控制障碍函数（ICCBFs）为具有执行约束的非线性系统提供了一种控制方法，构建一个前向不变的安全集。以往研究表明，通过元强化学习（meta-RL）学习定义ICCBF递归的类$\mathcal{K}$函数，能够在RPO中实现一种稳健且非贪婪的安全关键控制方法。本文进一步扩展了该框架，研究了三种循环网络架构（长短期记忆（LSTM）、门控循环单元（GRU）、选择性状态空间模型（Mamba）））以及两种训练算法（近端策略优化（PPO）和软演员批评（SAC））的性能，以确定通过元强化学习调优ICCBF类K函数的最佳配置。除了协同测试案例外，在目标航天器表现出降低追踪飞船安全性的对抗行为时，还会评估性能。结果表明，配合PPO使用状态空间模型（如Mamba）在所有合作和非合作场景中，均在任务完成、安全性和燃料节约方面优于其他架构。

Embodiment Shapes Rolling Behavior in a Multimodal Infant Model

具身塑造了多模态婴儿模型中的滚动行为

Authors: Leon Philipp, Francisco M. López, Jochen Triesch
Subjects: Subjects: Robotics (cs.RO); Neurons and Cognition (q-bio.NC)
Arxiv link: https://arxiv.org/abs/2606.17456
Pdf link: https://arxiv.org/pdf/2606.17456
Abstract Rolling over is one of the earliest milestones in infant motor development, reflecting the emergence of coordinated, whole-body sensorimotor control. Here, we conduct a computational study of infant rolling using MIMo, a virtual infant embodiment equipped with proprioception and vestibular sensation. MIMo learns supine-to-prone rolls with reinforcement learning. Interestingly, the learned behaviors capture developmental trends and coordination patterns consistent with those reported in real infants, including improved performance and faster execution with age. Our results explain how infant capabilities and constraints can give rise to realistic behaviors in artificial agents, with a particular emphasis on how motor development is shaped by the changing body morphology. This work highlights the role of embodied computational models as a powerful tool for studying sensorimotor development.
中文摘要 翻身是婴儿运动发育的最早里程碑之一，反映了协调的全身感觉运动控制的出现。本研究中，我们利用MIMo进行婴儿翻滚的计算研究，MIMo是一种具备本体感觉和前庭感觉的虚拟婴儿身体。MIMo通过强化学习学习从仰卧到俯卧的翻滚。有趣的是，这些习得的行为捕捉到了与真实婴儿报告的发育趋势和协调模式一致，包括随着年龄增长表现提升和执行速度加快。我们的研究结果解释了婴儿能力和约束如何产生人工代理的真实行为，特别强调运动发育如何被身体形态的变化所塑造。这项工作强调了具身计算模型作为研究感觉运动发展的强大工具的作用。

Multi-Adapter PPO: A Cross-Attention Enhanced Wavelength Selection Framework for LIBS Quantitative Analysis

多适配器PPO：一种用于LIBS定量分析的交叉注意力增强波长选择框架

Authors: Hao Li, Man Fung Zhuo
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.17476
Pdf link: https://arxiv.org/pdf/2606.17476
Abstract Laser-induced breakdown spectroscopy (LIBS) quantitative analysis faces critical challenges in wavelength selection due to high-dimensional spectral data and the fundamental trade-off between prediction accuracy and feature efficiency. This paper presents a novel Multi-Adapter PPO framework that transforms wavelength selection into a reinforcement learning problem, leveraging cross-attention mechanisms and multiple specialized adapters to capture complex spectral relationships. Our approach outperforms traditional Particle Swarm Optimization (PSO) by an average of 28.4\% in comprehensive score and 45.2\% in prediction accuracy across steel and coal datasets. The proposed method demonstrates superior performance in balancing prediction accuracy with feature efficiency, achieving state-of-the-art results in LIBS quantitative analysis while maintaining interpretability and computational efficiency. We released our code and dataset here: this https URL
中文摘要 激光诱导击穿光谱（LIBS）定量分析在波长选择方面面临关键挑战，这源于高维光谱数据以及预测准确性与特征效率之间的基本权衡。本文提出了一种新型多适配器PPO框架，将波长选择转化为强化学习问题，利用交叉注意力机制和多种专用适配器捕捉复杂的光谱关系。我们的方法在综合得分上平均优于传统粒子群优化（PSO）28.4%，在钢铁和煤炭数据集的预测准确率上高出45.2%。该方法在预测准确性与特征效率之间表现出优异性能，在保持可解释性和计算效率的同时，实现了LIBS定量分析的先进成果。我们在这里发布了代码和数据集：这个 https URL

Theoretical Grounding of Out-Of-Distribution Detection With Reinforcement Learning Optimizer

利用强化学习优化器理论基础化分布外检测

Authors: Salimeh Sekeh, Xin Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.17477
Pdf link: https://arxiv.org/pdf/2606.17477
Abstract Out-of-distribution (OOD) detection in dynamic open-world environments requires a model to continually adapt to evolving data distributions while generalizing to covariate-shifted inputs and rejecting semantic-shifted OOD examples. Most existing OOD detection methods optimize only the current-step objective and do not explicitly account for how post-deployment environment changes affect future OOD behavior. In this paper, we establish a theoretical grounding for dynamic OOD detection using a reinforcement learning (RL)-guided optimizer that explicitly favors updates that reduce the semantic OOD false positive rate over time. We develop a novel augmented optimizer that uses an RL-guided correction term on top of standard gradient descent (GD) and show its improvement over both future-domain generalization and semantic-OOD rejection. We analyze temporal error decomposition in terms of model-change and environment-change generalization errors and develop a new theoretical framework for comparing the generalization errors under both GD and RL-guided optimizers.
中文摘要 在动态开放世界环境中，分布外（OOD）检测需要模型不断适应不断变化的数据分布，同时推广到协变量移位的输入，并拒绝语义移位的OOD示例。大多数现有的OOD检测方法仅优化当前步骤目标，并未明确考虑部署后环境变化如何影响未来的OOD行为。本文为动态外勤外音检测建立了理论基础，采用强化学习（RL）引导优化器，明确支持更新以降低语义外音误报率。我们开发了一种新型增强优化器，在标准梯度下降（GD）基础上使用强化学习引导的修正项，并展示了其在未来域泛化和语义外线拒绝方面表现优异。我们通过模型变更和环境变更泛化误差分析时间误差分解，并开发了一个新的理论框架，用于比较GD和RL引导优化器下泛化误差。

When Robots Sleep: Offline Skill Consolidation for Shared-Policy Robot Learning

当机器人沉睡时：共享策略机器人学习的离线技能巩固

Authors: Nethmi Jayasinghe, Diana Gontero, Amit Ranjan Trivedi
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.17493
Pdf link: https://arxiv.org/pdf/2606.17493
Abstract Robots that learn over long deployments must add new skills without losing the shared policy structure that makes earlier skills reusable. We study sequential robot skill learning, where previous trajectories and task losses may be unavailable, and the deployed policy must remain a single shared controller without task-specific heads, routing, or adapters. We identify skill-coupling collapse, a failure mode in which individual skill success remains non-trivial while reliability among related skills deteriorates. We propose Sleeping Robots, a wake-sleep framework that learns each new skill during wake and consolidates the shared policy offline during sleep using compact frozen skill memories: frozen critics with unordered state buffers for reinforcement learning and frozen actor snapshots with unordered observation buffers for imitation learning. During sleep, these memories define differentiable surrogate objectives whose gradients are combined through Nash bargaining, with adaptive anchoring and local excitability for stable consolidation. On Meta-World MT5, Sleeping Robots improves average success by 64 % and pairwise reliability by x 2.0 over the strongest non-oracle baseline, and on SurgicAI it improves average success and backward transfer relative to continual imitation baselines while remaining competitive on pairwise reliability.
中文摘要 长期部署学习的机器人必须在不失去共享策略结构的前提下添加新技能，使早期技能可重复使用。我们研究顺序机器人技能学习，其中之前的轨迹和任务丢失可能不可用，部署策略必须保持单一共享控制器，没有任务专属的头部、路由或适配器。我们识别了技能耦合崩溃，即个别技能成功非平凡，而相关技能可靠性下降的失败模式。我们提出了睡眠机器人（Sleeping Robots）框架，这是一种唤醒-睡眠框架，在清醒时学习每一项新技能，并在睡眠期间通过紧凑的冻结技能记忆（Compact Freested Skill Memory）在离线时整合共享策略：使用无序状态缓冲区的冻结批评者进行强化学习，以及用无序观察缓冲区的冻结演员快照进行模仿学习。在睡眠期间，这些记忆定义了可微的替代目标，其梯度通过纳什协商结合，并结合适应性锚定和局部兴奋性以实现稳定巩固。在Meta-World MT5上，睡眠机器人相比最强非预言基线提升了64%的平均成功率和成对可靠性提升了2.0倍;在SurgicAI上，它相较于持续模仿基线提升了平均成功率和向后转移，同时在成对可靠性上保持竞争力。

Reinforcing Dual-Path Reasoning in Spatial Vision Language Models

在空间视觉语言模型中强化双路径推理

Authors: Yatai Ji, An-Chieh Cheng, Yang Fu, Yukang Chen, Han Zhang, Zhaojing Yang, Wei Huang, Ka Chun Cheung, Song Han, Vidya Nariyambut Murali, Pavlo Molchanov, Jan Kautz, Simon See, Hongxu Yin, Ping Luo, Sifei Liu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.17539
Pdf link: https://arxiv.org/pdf/2606.17539
Abstract Spatial VLMs have made substantial progress in geometric perception, yet complex spatial reasoning requiring multi-step inference over depth, distance, and scene relations remains challenging. Moreover, different spatial queries call for fundamentally different strategies: some are best addressed through purely linguistic, step-by-step deduction, while others require explicit 3D grounding before quantitative inference. We present Dual-Path Spatial Reasoning via Reinforcement Learning for Spatial VLMs (SR-REAL), a unified framework that equips a spatial VLM with two complementary reasoning paths: Language-Only Reasoning (LOR), which performs step-by-step linguistic deduction, and Detect-Then-Reason (DTR), which detects 3D geometric cues (e.g., centers or bounding boxes) via region tokens before explicit geometric inference. SR-REAL begins with a cold-start supervised fine-tuning stage that constructs LOR and DTR chain-of-thought supervision and exposes a region-to-3D interface, followed by RL that optimizes the policy model with accuracy and format rewards; for DTR, a discrete center-based detection reward further refines geometric alignment. Across diverse spatial benchmarks, SR-REAL significantly outperforms spatial VLM baselines: (i) a single RL-trained model supports both reasoning paths, with DTR excelling in region-aware tasks through precise 3D localization and LOR enhancing general spatial reasoning; (ii) jointly training both paths fosters mutual reinforcement; (iii) high-quality, blended cold-start data is crucial for stable RL optimization; and (iv) the model generalizes across datasets and domains without per-task tuning, demonstrating positive transfer between LOR and DTR.
中文摘要 空间VLM在几何感知方面取得了显著进展，但需要多步推断深度、距离和场景关系的复杂空间推理仍然具有挑战性。此外，不同的空间查询需要根本不同的策略：有些问题最好通过纯粹的语言学、逐步推理来解决，而另一些则需要明确的三维基础才能进行定量推断。我们介绍了空间VLM的双路径空间推理（SR-REAL），这是一个统一框架，为空间VLM配备了两条互补的推理路径：纯语言推理（LOR），执行逐步的语言推理，以及检测后推理（DTR），通过区域标记检测三维几何线索（如中心或边界框），在显式几何推断前进行。SR-REAL 以冷启动监督微调阶段开始，构建 LOR 和 DTR 思维链监督，并展示区域到三维的接口，随后是 RL，优化策略模型并提供格式化奖励;对于DTR，基于中心的离散检测奖励进一步优化了几何对齐。在多种空间基准测试中，SR-REAL 显著优于空间 VLM 基线：（i）单个强化学习训练模型支持两种推理路径，DTR 通过精确的三维定位和 LOR 增强一般空间推理，在区域感知任务中表现出色;（二）联合培训两条路径促进相互强化;（iii）高质量、混合式冷启动数据对于稳定强化学习优化至关重要;以及（iv）模型在数据集和领域间进行推广，无需逐项调整，展示了LOR与DTR之间的正向传递。

Continuous-time Optimal Stopping through Deep Reinforcement Learning

通过深度强化学习实现连续时间最优停止

Authors: Cosmin Borsa, Michael Ludkovski
Subjects: Subjects: Machine Learning (cs.LG); Computational Finance (q-fin.CP); Pricing of Securities (q-fin.PR)
Arxiv link: https://arxiv.org/abs/2606.17545
Pdf link: https://arxiv.org/pdf/2606.17545
Abstract Simulation based solvers for optimal stopping problems must discretize the stopping decision. Under classical dynamic programming, a coarse exercise grid with only a few stopping opportunities can materially undervalue the optimal expected reward, whereas on a very fine grid, approximation errors accumulate through the backward recursion. To remove this limitation, we develop a new reinforcement-learning inspired algorithm that enables us to learn the exercise rule at arbitrarily fine time resolution. Our CARLOS (Continuous-time Adaptive Reinforcement Learning for Optimal Stopping) algorithm utilizes an aggregate deep neural network (ADNN) to learn a joint space-time decision boundary. Starting from a coarse time grid, we progressively increase the frequency of stopping opportunities, while in parallel training the ADNN to refine its timing-value estimates. We moreover design an adaptive sampling strategy that gradually concentrates training effort near the stopping boundary. Benchmarked results show that CARLOS delivers higher prices than existing Bermudan solvers, approaching the American upper bound, and achieves high computational efficiency relative to non-RL comparators.
中文摘要 基于仿真的最优停止问题的求解器必须对停止决策进行离散化。在经典动态规划下，只有少量停机的粗练练习网格可能会显著低估最优期望奖励，而在非常细的网格中，近似误差通过逆向递归积累。为消除这一限制，我们开发了一种受强化学习启发的新算法，使我们能够在任意精细时间分辨率下学习练习规则。我们的CARLOS（连续时间自适应强化学习以实现最佳停止）算法利用聚合深度神经网络（ADNN）来学习联合时空决策边界。从粗略的时间网格出发，我们逐步增加停止机会的频率，同时同时训练ADNN以优化其时机值估计。我们还设计了一种自适应采样策略，逐步将训练努力集中在停止边界附近。基准测试结果显示，CARLOS的价格高于现有百慕大求解器，接近美国的上限，并且相较于非强化学习比较器实现了较高的计算效率。

Reversal Q-Learning

反转Q-学习

Authors: Aditya Oberai, Seohong Park, Sergey Levine
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.17551
Pdf link: https://arxiv.org/pdf/2606.17551
Abstract Iterative generative modeling techniques, such as flow matching, provide powerful tools to model complex behaviors for effective offline reinforcement learning (RL). In this work, we propose a new off-policy RL algorithm that trains a flow policy based on prior data. Our idea starts from the "expanded" Markov decision process (MDP) framework, which treats individual flow refinement steps as separate actions in an MDP. To enable off-policy RL within this framework, we apply two techniques: we generate virtual on-policy trajectories (by "reversing" flows) to make this framework compatible with prior data, and we apply a bias-and-variance reduction technique to mitigate the curse of horizon in off-policy RL. We call the resulting algorithm Reversal Q-learning (RQL). RQL has several advantages over previous flow-based RL methods: it does not suffer from backpropagation through time, makes better use of the learned value function, and directly trains the full, expressive flow policy. Through our experiments on 50 challenging simulated robotic tasks, we show that RQL leads to the best average offline RL performance compared to state-of-the-art flow-based offline RL algorithms.
中文摘要 迭代生成建模技术，如流匹配，为建模复杂行为提供了强大的工具，以实现有效的离线强化学习（RL）。在本研究中，我们提出了一种新的非策略强化学习算法，基于先前数据训练流策略。我们的思路源自“扩展”马尔可夫决策过程（MDP）框架，该框架将单个流程细化步骤视为MDP中的独立动作。为了在该框架内实现非策略强化学习，我们采用了两种技术：通过“反转”流程生成虚拟策略内轨迹，使该框架与先前数据兼容;以及采用偏差与方差减少技术，减轻非策略强化学习中视界诅咒。我们将所得算法称为逆转Q学习（RQL）。RQL相比以往基于流的RL方法有几个优势：它不会因时间反向传播而出现，更好地利用已学到的值函数，并直接训练完整的表达式流策略。通过对50个具有挑战性的模拟机器人任务的实验，我们证明RQL在离线强化学习的平均表现上优于最先进的基于流程的离线强化学习算法。

Closing the Feedback Loop: From Experience Extraction to Insight Governance in Verbal Reinforcement Learning

闭合反馈循环：从经验提取到言语强化学习中的洞察治理

Authors: Yanwei Cui, Xing Zhang, Yulong Zhang, Li Shao, Xiaofeng Shi, Guanghui Wang, Peiyang He
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.17591
Pdf link: https://arxiv.org/pdf/2606.17591
Abstract Training-free verbal reinforcement learning enables LLM agents to learn from world feedback -- objective signals such as dynamic task outcomes, market returns, or demand forecasts -- by extracting verbal rules from experience and injecting them as context, updating the agent's behavior without parameter changes. However, in non-stationary environments these agents face a retention-forgetting dilemma: retaining stale insights causes negative transfer, while discarding them causes catastrophic forgetting when conditions recur. We identify four requirements for navigating this dilemma -- outcome-driven evaluation, persistent structured evidence, non-monotonic knowledge lifecycle, and compositional governance -- and show that existing methods invest heavily in experience extraction while underinvesting in insight governance. We propose a three-layer architecture -- rules, evidence, and skills -- connected by a feedback-driven curation loop that closes the governance gap. Rules capture distilled experience from world outcomes; evidence logs track each rule's reliability across episodes; skills govern which rules to apply, how to resolve conflicts, and when to abstain. On financial forecasting as a case study, where world feedback is naturally abundant, noisy, and non-stationary, we show that the same accumulated experience either degrades performance below the zero-shot baseline or dramatically improves accuracy and risk-adjusted returns, depending on whether the curation loop is present.
中文摘要 无训练的语言强化学习使LLM代理能够从世界反馈中学习——如动态任务结果、市场回报或需求预测等客观信号——通过从经验中提取语言规则并将其注入上下文，无需参数改变即可更新代理行为。然而，在非静止环境中，这些代理面临一个“保留”遗忘的困境：保留陈旧的洞见会导致负向转移，而舍弃则在条件反复出现时导致灾难性的遗忘。我们确定了应对这一困境的四大要求——以结果为导向的评估、持续的结构化证据、非单调的知识生命周期和组合治理——并表明现有方法在经验提取上投入大量，而在洞察治理上投入不足。我们提出一个三层架构——规则、证据和技能——通过反馈驱动的策展循环连接，弥合治理鸿沟。规则捕捉世界结果的精炼经验;证据日志跟踪每条规则在不同剧集中的可靠性;技能决定适用哪些规则、如何解决冲突以及何时弃权。以财务预测为案例，全球反馈自然丰富、噪声且非平稳，我们表明，相同的累积经验要么降低零次基准线以下的性能，要么显著提升准确性和风险调整回报，具体取决于是否存在策划循环。

Using Cognitive Models to Improve Language Model Simulation of Human Persuasion Games

利用认知模型提升人类说服游戏的语言模型模拟

Authors: Zirui Cheng, Zeyu Shen, Thomas L. Griffiths, Peter Henderson
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.17657
Pdf link: https://arxiv.org/pdf/2606.17657
Abstract People make decisions differently in strategic interactions. Some update beliefs like a Bayesian; others exhibit biases like motivated reasoning. Although creators of large language models use simulated humans for safety evaluations and training, they often fail to cover this breadth of human behavior. We argue that cognitive science and economics provide a convenient tool for doing so, making use of mathematical models of human decision-making. We propose an approach that we call Equation-to-Behavior Prompting for guiding large language models to match cognitive models, and evaluate this approach on persuasion games based on legal decision-making. We find that large models can approximate equation-based specifications -- Bayesian updating, affine distortion, motivated updating, and Grether's $\alpha$-$\beta$ model -- using prompting, but small models fail to do so. However, training small models with reinforcement learning to adhere to mathematical rules, Equation-to-Behavior RL, reduces belief error by 26.5% in out-of-distribution parameterizations. We show that these simulations can help create diverse training environments; training small models to consider different kinds of decision-makers improves average belief change by 2.5%--12% over Bayesian-only training, even when persuading GPT-5-mini. Our work could improve human simulations for training and evaluation in increasingly realistic settings, and could also enable novel research into more complicated mathematical models of human decision-making.
中文摘要 人们在战略性互动中做出决策的方式不同。一些更新信念，比如贝叶斯信念;还有些则表现出像动机推理这样的偏见。尽管大型语言模型的创建者使用模拟人类进行安全评估和培训，但他们往往未能涵盖人类行为的广泛范围。我们认为认知科学和经济学提供了便捷的工具，利用数学模型来实现这一目标。我们提出了一种称为方程到行为提示的方法，用于引导大型语言模型匹配认知模型，并基于法律决策评估该方法在说服博弈中的应用。我们发现大型模型可以通过提示法近似基于方程的规格——贝叶斯更新、仿射失真、动机更新以及Grether的$\alpha$-$\beta$模型——但小模型无法做到这一点。然而，通过强化学习训练小模型以遵循数学规则——方程到行为强化学习（Equation-to-Behavior RL），在分布外参数化中可将信念误差降低26.5%。我们证明这些模拟有助于创造多样化的培训环境;训练小模型考虑不同类型的决策者，平均信念变化比仅贝叶斯训练提升2.5%——12%，即使说服GPT-5-mini。我们的工作有望改进在越来越真实环境中的人类模拟训练和评估，同时也能推动对更复杂人类决策数学模型的新研究。

See First, Answer Later: Visual Evidence Pre-Alignment via Sufficiency-Driven RL

先看后答：通过充分性驱动强化学习实现视觉证据预对

Authors: Yilian Liu, Sicong Leng, Guoshun Nan, Junyi Zhu, Jiayu Huang, Minghao Sun, Xuancheng Zhu, Yisong Chen, Zexian Wei, Xiaofeng Tao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.17678
Pdf link: https://arxiv.org/pdf/2606.17678
Abstract Multimodal large language models (MLLMs) integrate strong text reasoning with visual inputs, yet their responses can be inconsistent with the underlying images, indicating ineffective utilization of visual evidence during inference. The prevailing training paradigm relies on large-scale caption-based pretraining for general alignment, followed by supervised fine-tuning and reinforcement learning to enable instruction following and complex reasoning. However, such pretraining provides only weak visual grounding: short, coarse captions bias models toward salient objects while neglecting fine-grained visual evidence. In this paper, we introduce Visual Evidence Pre-Alignment (VEPA), an intermediate stage between pretraining and post-training that explores a novel sufficiency-driven objective with Group Relative Policy Optimization (GRPO) to optimize question-conditioned visual evidence descriptions. Extensive experiments across diverse benchmarks show that our VEPA consistently enhances performance on visually demanding evaluations and complements standard supervised post-training. Further analyses show that the income stems from strengthened, transferable visual grounding, rather than from additional task-specific training.
中文摘要 多模态大型语言模型（MLLM）将强文本推理与视觉输入结合，但其反应可能与底层图像不一致，表明推理过程中视觉证据的利用效率不高。现有的训练范式依赖大规模的基于字幕的预训练以实现总体对齐，随后进行监督式微调和强化学习，以实现指令跟随和复杂推理。然而，这种预训练只能提供较弱的视觉基础：简短、粗糙的说明文字会偏向显著对象，忽视细粒度的视觉证据。本文介绍了视觉证据预对齐（VEPA），这是预训练与后训练之间的中间阶段，探索一种基于充分性的新目标，利用群体相对政策优化（GRPO）优化问题条件的视觉证据描述。跨多项基准的广泛实验表明，我们的VEPA在视觉要求高的评估中持续提升表现，并补充了标准的监督后培训。进一步分析显示，收入来自强化且可转移的视觉基础，而非额外的任务专项培训。

EnvRL: Learn from Environment Dynamics in Agentic Reinforcement Learning

环境RL：从代理强化学习中的环境动力学学习

Authors: Zhitong Wang, Songze Li, Hao Peng, Shuzheng Si, Yi Wang, Maosong Sun, Juanzi Li
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.17680
Pdf link: https://arxiv.org/pdf/2606.17680
Abstract Reinforcement learning (RL) has emerged as a powerful paradigm for training Large Language Models (LLMs) as agents. However, conventional RL methods for long-horizon agentic tasks often struggle with sparse outcome rewards. Intuitively, this overlooks the rich environment dynamics information contained in rollout interaction trajectories. We argue that the interaction experience inherently serves as an implicit supervision signal, reveals the underlying transition mechanisms of the environment, and enables the agent to construct a more accurate internal model of the environment.. Therefore, in this work, we investigate how to leverage this additional signal to improve policy learning. Specifically, we propose EnvRL, a framework that incorporates environment dynamics learning into agentic RL via two auxiliary objectives: state prediction and inverse dynamics. By jointly optimizing with the primary RL objective, we encourage the agent to internalize environment dynamics from its own interaction experience. Extensive experiments on two long-horizon agentic benchmarks demonstrate that EnvRL achieves significant improvements on success-rates over RL-only baselines, e.g., when trained with GRPO, lifting Qwen-2.5-1.5B-Instruct from 72.8% to 77.4% on ALFWorld, and from 56.8% to 67.0% on WebShop.
中文摘要 强化学习（RL）已成为训练大型语言模型（LLMs）作为代理的强大范式。然而，传统的强化学习方法用于长期代理任务，常常面临结果奖励稀疏的问题。直观上，这忽略了展开交互轨迹中丰富的环境动态信息。我们认为，互动体验本质上作为隐性监督信号，揭示了环境的潜在过渡机制，使代理能够构建更准确的环境内部模型。因此，在本研究中，我们将探讨如何利用这一额外信号来提升政策学习水平。具体来说，我们提出了EnvRL，这是一种通过状态预测和逆动力学两个辅助目标，将环境动力学学习纳入智能强化学习的框架。通过与主要强化学习目标共同优化，我们鼓励智能体从自身的互动体验中内化环境动态。在两个长期代理基准测试上的广泛实验表明，EnvRL在成功率上优于仅限RL基线，例如在使用GRPO训练时，将QWEN-2.5-1.5B-Ininstruction在ALFWorld上的成功率从72.8%提升至77.4%，WebShop上的成功率从56.8%提升至67.0%。

From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning

从学员到培训师：由LLM设计的多智能体推理强化学习培训环境

Authors: Chao Chen, Chengzu Li, Zhiwei Li, Yinhong Liu, Zhijiang Guo
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.17682
Pdf link: https://arxiv.org/pdf/2606.17682
Abstract Reinforcement learning pipelines for Large Language Model (LLM) training often rely on manually redesigned environments between stages, requiring practitioners to heuristically infer which configuration will best improve the current policy. To automate this process, we propose the LLM-as-Environment-Engineer framework in which the current policy model analyzes failure trajectories together with contextual information and proposes modifications to the next-stage training environment configuration. We also introduce MAPF-FrozenLake, a controllable testbed whose generator exposes multi-dimensional environment configurations, making it suitable for studying and benchmarking environment redesign. On this testbed, we condition the environment engineer on structured summaries of policy behavior, failure cases, and environment statistics, from which it produces the configuration for the next training stage. With Qwen3-4B as the backbone, our framework achieves the strongest aggregate performance on our benchmarks, outperforming larger proprietary LLMs (e.g., GPT, Gemini) and fixed-environment training baselines. We further analyze which forms of context are most effective, finding that successful environment updates rely on failure evidence and preserve configurations that already work. Interestingly, the current RL checkpoint serves as a better environment engineer than the original base model, suggesting that policy learning improves the model's ability to diagnose its remaining weaknesses.
中文摘要 大型语言模型（LLM）训练的强化学习流程通常依赖于各阶段之间手动重新设计环境，这需要从业者通过启发式推断哪种配置最能改善当前策略。为实现这一过程，我们提出了LLM即环境工程师框架，当前策略模型结合上下文信息分析失效轨迹，并提出对下一阶段训练环境配置的修改建议。我们还介绍了MAPF-FrozenLake，这是一个可控测试平台，其生成器揭示了多维环境配置，适合环境再设计的研究和基准测试。在这个测试平台上，我们基于策略行为、失败案例和环境统计的结构化总结来为环境工程师进行条件，从而生成下一阶段训练的配置。以Qwen3-4B为骨干，我们的框架在基准测试中实现了最强的综合性能，优于更大型专有LLM（如GPT、Gemini）和固定环境训练基线。我们进一步分析了哪些上下文形式最有效，发现成功的环境更新依赖于失败证据，并保留已运行的配置。有趣的是，当前的强化学习检查点比原始基础模型更适合环境工程师，这表明策略学习提升了模型诊断剩余弱点的能力。

SuCo: Sufficiency-guided Continuous Adaptive Reasoning

SuCo：充分引导的持续适应推理

Authors: Jiahao Wang, Bingyu Liang, Chenhao Hu, Longhui Zhang, Xuebo Liu, Min zhang, Jing Li, Xuelong Li
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.17687
Pdf link: https://arxiv.org/pdf/2606.17687
Abstract Despite remarkable performance on complex tasks, Large Reasoning Models (LRMs) often generate excessively long Chain-of-Thoughts (CoT), inflating computational costs even for simple queries. Existing efforts to mitigate this inefficiency typically rely on discrete reasoning modes or fixed budget tiers, lacking a principled criterion of when reasoning is sufficient. In this work, we introduce Minimal Sufficient CoT (MSC), defined as the shortest prefix of a CoT trajectory which is adequate for producing the correct answer. We empirically show that MSC not only reduces reasoning tokens, but also improves accuracy across difficulty levels. Building on MSC, we propose Sufficiency-guided Continuous Adaptive Reasoning (SuCo), a two-stage training framework for autonomous reasoning control along a continuous spectrum. In stage 1, MSC-Aligned Fine-Tuning (MFT) constructs MSC data using problem-adaptive sufficiency thresholds that naturally scale with question difficulty, then fine-tunes the model to internalize concise yet sufficient reasoning patterns. In stage 2, Sufficiency-Aware Policy Optimization (SAPO) further optimizes the model through reinforcement learning with dynamic complexity tracking and sufficiency-aware rewards that penalize both over- and under-thinking. Extensive experiments across mathematics, code, and science benchmarks show that SuCo consistently achieves improvements in both accuracy and reasoning efficiency.
中文摘要 尽管在复杂任务中表现出色，大型推理模型（LRM）常常产生过长的思维链（CoT），即使是简单查询也会增加计算成本。现有的降低效率努力通常依赖离散推理模式或固定预算层级，缺乏关于推理充分的原则性标准。在本研究中，我们介绍了最小充分CoT（MSC），定义为CoT轨迹中足以得出正确答案的最短前缀。我们通过实证证明，MSC不仅减少了推理符号，还能在不同难度层级中提高准确性。基于MSC，我们提出了充分引导的连续自适应推理（SuCo），这是一种基于连续谱域自主推理控制的两阶段训练框架。在第一阶段，MSC对齐微调（MFT）利用问题适应充分阈值构建MSC数据，这些阈值自然会随着问题难度增加而扩展，然后微调模型以内化简洁且充分的推理模式。在第二阶段，充分性感知策略优化（SAPO）通过强化学习进一步优化模型，采用动态复杂度跟踪和充分感应奖励，惩罚过度思考和不足思考。涵盖数学、代码和科学基准的广泛实验表明，SuCo在准确性和推理效率上持续提升。

Shattering the Autoregressive Curse: Dynamic Epistemic Entropy Orchestrated Erasable Reinforcement Learning for LLMs

打破自我回归的诅咒：大型语言模型的动态认知熵协式可抹除强化学习

Authors: Ziliang Wang, Kang An, Faqiang Qian, Jialu Cai, Cijun Ouyang, Yuhang Wang, Qibing Ren, Yichao Wu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.17735
Pdf link: https://arxiv.org/pdf/2606.17735
Abstract Although reinforcement learning (RL) has expanded the cognitive boundaries of large language models (LLMs), it often remains vulnerable to the autoregressive curse in long-horizon logical reasoning: small epistemic perturbations introduced early in generation can propagate irreversibly along the Markov decision process flow, triggering cascading failures that drive the reasoning trajectory toward collapse. To overcome this autoregressive cascade, in which a single early mistake can compromise all subsequent reasoning steps, we propose dynamic epistemic entropy orchestrated erasable reinforcement learning ($\text{E}^3\text{RL}$). $\text{E}^3\text{RL}$ eliminates reliance on external signals by grounding the model's endogenous local autoregressive cross-entropy as an intrinsic coordinate of epistemic uncertainty. By introducing segment-level adaptive dynamic thresholds and advantage allocation, $\text{E}^3\text{RL}$ enables the model to precisely excise localized logical defects while reusing historical key-value (KV) cache streams, thereby endowing the reasoning process with a self-healing capability. We train $\text{E}^3\text{RL}$ on the DeepMath-103k dataset. Experimental results show that $\text{E}^3\text{RL}$ reshapes the exploration efficiency of long-sequence reasoning and improves sample efficiency while maintaining linear memory overhead. On mathematical reasoning benchmarks such as AIME, $\text{E}^3\text{RL}$ achieves substantial performance gains, with the 4B and 8B parameter models surpassing previous state-of-the-art (SOTA) results by 5.349\% and 6.514\%, respectively. These findings suggest that $\text{E}^3\text{RL}$ shatters the autoregressive curse in long-sequence reasoning and establishes a theoretical and systems-level foundation for the next generation of self-healing artificial general intelligence (AGI).
中文摘要 尽管强化学习（RL）扩展了大型语言模型（LLM）的认知边界，但它在长视野逻辑推理中常常仍易受到自回归诅咒的影响：早期一代引入的小认知扰动可能不可逆地沿着马尔可夫决策流程传播，触发连锁失败，推动推理走向崩溃。为克服这种自回归级联，即一个早期错误可能影响所有后续推理步骤，我们提出了动态认知熵协同可抹除强化学习（$\text{E}^3\text{RL}$）。$\text{E}^3\text{RL}$ 通过将模型内生的局部自回归交叉熵作为认识论不确定性的内在坐标，消除了对外部信号的依赖。通过引入段级自适应动态阈值和优势分配，$\text{E}^3\text{RL}$使模型能够精确切除局部逻辑缺陷，同时重用历史键值（KV）缓存流，从而赋予推理过程自愈能力。我们在DeepMath-103k数据集上训练$\text{E}^3\text{RL}$。实验结果表明，$\text{E}^3\text{RL}$ 重塑了长序列推理的探索效率，并在保持线性内存开销的同时提高了样本效率。在AIME等数学推理基准测试中，$\text{E}^3\text{RL}$实现了显著的性能提升，4B和8B参数模型分别比以往最先进（SOTA）结果高出5.349%和6.514%。这些发现表明，$\text{E}^3\text{RL}$ 打破了长序列推理中的自我回归诅咒，为下一代自我修复的通用人工智能（AGI）奠定了理论和系统层面的基础。

Continual Self-Improvement with Lightweight Experiential Latent Memories

持续自我提升，拥有轻量级的体验式潜在记忆

Authors: Vaggelis Dorovatas, Nancy Kalaj, Rahaf Aljundi
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.17803
Pdf link: https://arxiv.org/pdf/2606.17803
Abstract Large language models achieve strong reasoning performance by scaling inference-time compute, yet remain fundamentally stateless, discarding the rich, self-produced reasoning traces generated during this process. We investigate whether models can instead learn online from this experience, converting transient computation (reasoning traces) into persistent reusable knowledge, and without external supervision or access to future data. We show that In-Context Learning (ICL) over raw reasoning traces fails to generalize, reflecting a fundamental limitation of token-level reuse: individual traces lack the abstraction needed for transfer, even after refinement (e.g. self-reflection). In contrast, drawing inspiration from recent works on unsupervised reinforcement learning, we find that lightweight per-instance training with self-generated test-time signals (majority voting) as rewards yields substantial gains, often surpassing full-dataset offline training, motivating a shift from raw traces to learned latent representations. Building on this insight, we propose an online method that distills inference-time compute spent on encountered problems into compact modular latent memories capturing the underlying reasoning structure. These memories are stored and retrieved for future inputs, enabling continual improvement while avoiding catastrophic forgetting through modular design. Importantly, our method is highly efficient, parametrized as extremely lightweight soft prompt memories (~0.001% of model parameters) and trained with only a few gradient steps, yet achieving performance competitive with full parametric updates and offline training. Across challenging mathematical reasoning benchmarks, our approach significantly outperforms zero-shot and raw data ICL baselines, while transferring effectively across datasets.
中文摘要 大型语言模型通过扩展推理时间计算实现了强有力的推理性能，但本质上仍保持无状态，丢弃了过程中产生的丰富自产推理痕迹。我们研究模型是否可以在线学习，将瞬时计算（推理痕迹）转化为持久且可重用的知识，且无需外部监督或未来数据访问。我们表明，基于原始推理痕迹的上下文学习（ICL）无法泛化，反映了令牌级重用的一个根本局限：即使经过细化（如自我反思），单个痕迹也缺乏转移所需的抽象。相比之下，借鉴近期无监督强化学习的研究，我们发现，采用自生成测试时间信号（多数投票）作为奖励的轻量级每实例训练，效果显著，常常超过全数据集离线训练，促使从原始痕迹转向学习的潜在表征。基于这一见解，我们提出了一种在线方法，将对遇到的问题所花费的推理时间计算提取为紧凑的模块潜在记忆，捕捉其底层推理结构。这些记忆会被存储和检索以备未来输入，实现持续改进，同时通过模块化设计避免灾难性遗忘。重要的是，我们的方法高效，参数化为极轻的软提示记忆（约0.001%），训练时仅有少量梯度步骤，且在全参数更新和离线训练下性能可媲美。在具有挑战性的数学推理基准测试中，我们的方法显著优于零样本和原始数据ICL基线，同时有效跨数据集转移。

StepGuard: Guarding Web Navigation via Single-Step Calibration

StepGuard：通过单步校准保护网页导航

Authors: Zhihao Cui, Yuchen Zhang, Xiyang Sun, Yaxiong Wang, Li Zhu, Jinpeng Hu, Liu Liu, Mengjia Li, Yujiao Wu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.17871
Pdf link: https://arxiv.org/pdf/2606.17871
Abstract Web navigation requires agents to follow natural language goals, interact with web pages, and produce accurate answers. While recent advances leverage vision-language models and reinforcement learning, existing methods still suffer from single-step fragility due to reward misalignment and error propagation. To tackle the reward entanglement, we design Dynamic Dual-Policy Optimization (DDPO), which dynamically switches between a navigation-first mode for exploration and an answer-first mode for question-answering to mitigate reward conflict. To calibrate the single-step error, we propose Confidence-Guided Adaptive Navigation Reflection (CANR), a mechanism that estimates per-step confidence, triggers reflection only when necessary, and uses contrastive rewards to encourage self-correction to calibrate the single-step inaccuracy. With the above as the main components, we finally develop our StepGuard, a new framework of Guarding Web Navigation via Single-Step Calibration. Experiments demonstrate that our approach significantly improves navigation and answer accuracy, setting new state-of-the-art performance on standard web navigation benchmarks.
中文摘要 网页导航要求代理遵循自然语言目标，与网页互动，并生成准确的答案。尽管近期进展利用了视觉语言模型和强化学习，但现有方法仍因奖励错位和错误传播而存在单步脆弱性。为解决奖励纠缠问题，我们设计了动态双策略优化（DDPO），它在探索的导航优先模式和问答优先的答案模式之间动态切换，以减轻奖励冲突。为了校准单步误差，我们提出了信心引导自适应导航反射（CANR）机制，该机制估计每步置信度，仅在必要时触发反射，并利用对比奖励鼓励自我纠正以校准单步误差。以上作为主要组成部分，我们最终开发了StepGuard，这是一个通过单步校准保护网页导航的新框架。实验表明，我们的方法显著提升了导航和答案的准确性，在标准网页导航基准测试中树立了新的尖端性能。

Dynamic Rollout Editing for Reducing Overthinking in RL-Trained Reasoning Models

动态展开编辑，用于减少强化学习训练推理模型中的过度思考

Authors: Zihao Wei, Wenjie Shi, Liang Pang, Jingcheng Deng, Shicheng Xu, Shasha Guo, Zenghao Duan, Jiahao Liu, Jingang Wang, Huawei Shen, Xueqi Cheng
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.17890
Pdf link: https://arxiv.org/pdf/2606.17890
Abstract Long-form chain-of-thought reasoning can improve LLM performance on complex tasks, but models often continue generating unnecessary reasoning after a correct answer has emerged. We refer to this behavior as overthinking. We study this phenomenon from the perspective of GRPO-style reinforcement learning (RL) post-training, framing it as a training-time credit-assignment problem rather than merely a decoding-time stopping problem. In rollouts sampled at the onset of GRPO training, we observe that successful trajectories can exhibit a slightly higher degree of overthinking than unsuccessful trajectories for the same prompts. This early imbalance provides a starting point for an undesirable feedback loop: because GRPO assigns sequence-level credit, it cannot distinguish the solution-reaching prefix from the unnecessary continuation that lengthens a successful trajectory. Both receive positive update signal, allowing the initial imbalance to grow into more severe overthinking during training. To address this issue, we introduce Dynamic Rollout Editing (DRE), a training-time intervention for successful trajectories that continue thinking after answer emergence. DRE preserves the accepted verified prefix, edits the remaining thinking, and prefers the edited trajectory within the same RL group, weakening the preference signal for unnecessary thinking without penalizing the reasoning needed to reach the answer. Experiments across diverse tasks show the effectiveness of DRE.
中文摘要 长形式思维链推理可以提升LLM在复杂任务中的表现，但模型在正确答案出现后，往往仍会持续生成不必要的推理。我们称这种行为为过度思考。我们从GRPO式强化学习（RL）训练后视角研究这一现象，将其框架为训练时间的学分分配问题，而不仅仅是解码时间停止的问题。在GRPO培训初期抽样的推广中，我们观察到，对于相同提示，成功的轨迹可能比失败的轨迹表现出略高的过度思考程度。这种早期失衡为不良反馈循环提供了起点：由于GRPO赋予序列层级的信用，无法区分解决前缀和延长成功轨迹的不必要延续。两者都会收到正向的更新信号，使最初的不平衡逐渐演变成训练中更严重的过度思考。为解决这个问题，我们引入了动态展开编辑（DRE），这是一种训练时间干预，用于成功思考路径，并在答案出现后继续思考。DRE保留被接受的已验证前缀，编辑剩余的思考，并优先选择同一强化学习组内的编辑轨迹，削弱对不必要思考的偏好信号，同时不惩罚得出答案所需的推理。跨越多种任务的实验显示了DRE的有效性。

WAM-RL: World-Action Model Reinforcement Learning with Reconstruction Rewards and Online Video SFT

WAM-RL：世界行动模型强化学习，含重建奖励和在线视频SFT

Authors: Zezhong Qian, Xiaowei Chi, Yu Qi, Haozhan Li, Zhi Yang Chen, Shanghang Zhang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.17906
Pdf link: https://arxiv.org/pdf/2606.17906
Abstract Recent World-Action (WA) models demonstrate strong generalization ability and data efficiency, but they typically rely on expert trajectories for training. This reliance limits their ability to acquire fine-grained manipulation skills beyond the demonstration distribution and prevents them from continuously improving through real-world interaction. To address these limitations, we propose WAM-RL, a reinforcement learning framework that enables joint optimization of the world model and the action model through online interaction with the environment. By allowing the two components to co-evolve, our approach enhances fine-grained control and adaptability. Specifically, a WA model consists of a world model and an actor. We design a tailored reinforcement learning method with hierarchical optimization to coordinate their improvement. On the methodological side, we systematically investigate the effects of applying reinforcement learning to the action model, as well as online training of the world model within an RL setting. Our experiments reveal a key insight: optimizing only the actor yields improvements on short-horizon tasks, but fails to provide significant gains on long-horizon tasks. In contrast, jointly optimizing both the world model and the actor is critical for achieving strong performance in long-horizon settings. Our work is the first to introduce reinforcement learning into the World-Action paradigm, and provides insights into how online optimization of both the action head and the world model impacts overall performance.
中文摘要 近期的世界行动（WA）模型展现了强大的泛化能力和数据效率，但通常依赖专家轨迹进行训练。这种依赖限制了他们获得超越演示分布的细粒度操作技能的能力，也阻碍了他们通过现实世界互动持续提升的能力。为解决这些局限性，我们提出了WAM-RL，一种强化学习框架，通过与环境的在线交互实现世界模型和动作模型的联合优化。通过允许这两个组成部分共同进化，我们的方法增强了细致的控制和适应性。具体来说，WA模型由一个世界模型和一个演员组成。我们设计了一套针对性的强化学习方法，采用层级优化来协调他们的改进。在方法论方面，我们系统性地探讨了将强化学习应用于行动模型以及在强化学习环境中对世界模型的在线训练的影响。我们的实验揭示了一个关键见解：仅优化演员能在短期任务中获得改进，但在长期任务上却无法带来显著提升。相比之下，联合优化世界模型和演员对于在长视野环境中实现强劲表现至关重要。我们的工作首次将强化学习引入世界-行动范式，并提供了关于动作头和世界模型在线优化如何影响整体表现的洞见。

From Reasoning Traces to Reusable Modules: Understanding Compositional Generalization in Language Model Reasoning

从推理痕迹到可重用模块：理解语言模型推理中的组合推广

Authors: Lingjing Kong, Xin Liu, Guangyi Chen, Martin Q. Ma, Xiangchen Song, Yuekai Sun, Mikhail Yurochkin, Taylor W. Killian, Ruslan Salakhutdinov, Kun Zhang, Eric P. Xing, Zhengzhong Liu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.18089
Pdf link: https://arxiv.org/pdf/2606.18089
Abstract Post-training pipelines that combine supervised fine-tuning (SFT) with reinforcement learning (RL) have emerged as the key recipe for transforming large language models (LLMs) into robust reasoners. We argue that this combined success is driven by compositional generalization, which we formalize through a hierarchical latent selection model. In this framework, reasoning traces are generated by a cascade of discrete latent selection variables corresponding to reusable atomic modules, including both skills (local operations) and routing mechanisms (how intermediate information is selected, reused, and composed). Within this model, we theoretically show that SFT and RL play asymmetric, complementary roles: SFT supplies the raw module materials in compositional traces, and RL decomposes those traces to identify the latent atomic modules and enable compositional generalization. We design controlled experiments to validate this theory. Our results demonstrate that RL can extract atomic modules from compound traces supplied by SFT and recombine them to solve new configurations. Moreover, we find that training on compound traces yields stronger generalization than training on isolated atomic modules. Finally, we investigate the relationship between SFT and RL data and identify an effective protocol in which SFT ensures coverage of all atomic modules through compositional traces, while RL focuses on novel compositions outside the SFT support to drive exploration.
中文摘要 结合监督微调（SFT）和强化学习（RL）的训练后流程，已成为将大型语言模型（LLMs）转变为稳健推理器的关键配方。我们认为这种综合成功是由组成泛化驱动的，我们通过层级潜在选择模型形式化了这一概括。在该框架中，推理痕迹由一系列离散潜在选择变量生成，这些变量对应于可复用的原子模块，包括技能（局部操作）和路由机制（中间信息的选择、重用和组合）。在该模型中，我们理论上表明SFT和RL发挥着不对称且互补的作用：SFT提供组成迹中的原材料，RL分解这些轨迹以识别潜在的原子模块并实现组合推广。我们设计了受控实验来验证这一理论。我们的结果表明，强化学习能够从SFT提供的复合痕迹中提取原子模块并重新组合以求解新的构型。此外，我们发现对复合迹的训练比在孤立原子模块上训练更能实现更强的泛化。最后，我们研究了SFT与RL数据之间的关系，并确定了一种有效的协议，使SFT通过组成痕迹确保覆盖所有原子模块，而RL则专注于SFT支持之外的新成分以推动探索。

WireCraft: A Simulation Benchmark for Industrial DLO Manipulation

WireCraft：工业DLO操作的仿真基准

Authors: Chongyu Zhu, Ramy ElMallah, Hyegang Kim, Zachary Tang, Jiachen Rao, Artem Arutyunov, Seungyeon Ha, Chi-Guhn Lee
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.18097
Pdf link: https://arxiv.org/pdf/2606.18097
Abstract Deformable Linear Objects (DLOs), such as wires and cables, are central to industrial assembly. Unlike rigid objects, whose state is captured by a 6-DoF pose, DLOs have an infinite-dimensional configuration space and deform continuously under contact with grippers, fixtures, and the workspace, making them a demanding benchmark for general dexterous manipulation. Despite their importance, policy development and comparison remain difficult: existing benchmarks are often tied to specific hardware setups, lack modular and customizable task assets, or study generic deformable-object tasks without the fixtures relevant to real-world industrial wire manipulation. Few benchmarks align simulation, real-world data, and shared evaluation protocols. To bridge this gap, we introduce WireCraft, a simulation benchmark for industrial DLO manipulation with configurable difficulty and assets, spanning three task families: connector insertion, clip routing, and channel seating. It supports two complementary DLO physics models, articulated and deformable, and the trajectories come from both simulation and a physical UR5. We benchmark reinforcement learning (RL), imitation learning (IL), and vision-language-action (VLA) policies under shared metrics. Privileged state-based RL solves a representative setting in each task family with over 82\% success, confirming the tasks are well-posed. For connector insertion, however, the transition from reaching the socket to contact-rich alignment remains a key bottleneck for vision RL, IL, and VLA policies. These results indicate that industrial DLO manipulation, though tractable under privileged state, remains an open challenge for current vision-based learning. The benchmark, data, and tools will be open-sourced upon acceptance.
中文摘要 可变形线性物体（DLO），如电线和电缆，是工业组装的核心。与刚体物体的状态由6景深姿态捕捉不同，DLO具有无限维的配置空间，并在与夹具、夹具和工作区接触时持续变形，因此成为通用灵巧操作的高标准。尽管这些基准很重要，但政策制定和比较仍然困难：现有基准往往依赖于特定硬件配置，缺乏模块化和可定制的任务资产，或研究的是通用的可变形对象任务，却没有与现实工业线缆操作相关的夹具。很少有基准能将仿真、真实世界数据和共享评估方案与之一致。为弥合这一差距，我们推出了WireCraft，这是一款工业DLO操作的仿真基准，具有可配置的难度和资产，涵盖连接器插入、剪辑路由和通道就位三大任务族。它支持两个互补的DLO物理模型，分别是关节型和可变形型，轨迹来自仿真和物理UR5。我们在共享指标下对强化学习（RL）、模仿学习（IL）和视觉-语言-行动（VLA）策略进行基准测试。基于特权状态的强化学习在每个任务族中以超过82%的成功率解决代表性场景，确认任务的定式性。然而，对于连接器插入，从到达插座到接触丰富对齐的过渡仍然是视觉RL、IL和VLA策略的关键瓶颈。这些结果表明，尽管在特权国家下工业DLO操作可处理，但当前基于愿景的学习仍面临挑战。基准、数据和工具将在接受后开源。

OmniPlan: An Adaptive Framework for Timely and Near-Optimal Network Planning Optimization

OmniPlan：一个用于及时且近乎最优网络规划优化的自适应框架

Authors: Longlong Zhu, Jiashuo Yu, Zedi Chen, Yuhan Wu, Zhifan Jiang, Yuchen Xian, Yimeng Liu, Jiajie Su, Shaopeng Zhou, Xingyuan Li, Hongyan Liu, Xuan Liu, Dong Zhang, Chunming Wu, Xiang Chen
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.18105
Pdf link: https://arxiv.org/pdf/2606.18105
Abstract Network planning optimization is a fundamental problem across diverse domains, including transportation systems, communication networks, and power grids. It requires simultaneous optimization of multiple competing objectives under complex constraints. Existing network planning optimization frameworks rely on mixed integer programming (MIP) solvers, heuristics, and deep reinforcement learning (DRL) models to compute planning decisions. However, they lack effective adaptability to diverse and dynamic user intents, thus leading to the trade-off between execution time and optimality. In this paper, we propose OmniPlan, an adaptive framework that achieves both timeliness and near-optimality in network planning optimization. To achieve the adaptability lacking in existing solutions, OmniPlan employs a large language model (LLM)-based interpreter to convert heterogeneous natural-language intents into a unified and quantifiable user-preference vector. Then it employs a mixture-of-experts architecture that integrates MIP solvers, heuristics, and DRL models as specialized experts, where OmniPlan adapts to diverse intents by dynamically selecting timely and near-optimal experts. Finally, it incorporates a DRL-based expert configuration module that fine-tunes optimization objective weights to align planning decisions with user-specific preferences. We evaluate OmniPlan with a representative real-world workload, i.e., distributed machine learning (ML), where we leverage OmniPlan to offload a wide spectrum of ML inference tasks, e.g., decision trees, SVM, naive Bayes, XGBoost, and random forests, onto a network of hardware devices. Our experiments on a real-world testbed indicate that OmniPlan achieves near-optimal and low-execution-time offloading for real-world ML inference tasks, reducing latency by up to 97.8\% and network device resource consumption by up to 11.5\%.
中文摘要 网络规划优化是一个跨越交通系统、通信网络和电网等多个领域的根本性问题。它需要在复杂约束下同时优化多个竞争目标。现有的网络规划优化框架依赖混合整数规划（MIP）求解器、启发式算法和深度强化学习（DRL）模型来计算规划决策。然而，它们缺乏对多样且动态用户意图的有效适应能力，因此在执行时间与最优性之间存在权衡。本文提出了OmniPlan，一种既实现网络规划优化时效性又近乎最优性的自适应框架。为了实现现有解决方案所缺乏的适应性，OmniPlan 采用基于大型语言模型（LLM）的解释器，将异构的自然语言意图转换为统一且可量化的用户偏好向量。随后采用专家混合架构，集成MIP求解器、启发式和DRL模型作为专业专家，OmniPlan通过动态选择及时且近优的专家来适应多样化意图。最后，它集成了一个基于日程学习器的专家配置模块，能够微调优化目标权重，使规划决策与用户特定偏好保持一致。我们用具有代表性的真实工作量——分布式机器学习（ML）来评估OmniPlan，利用OmniPlan将广泛的机器学习推理任务（如决策树、SVM、朴素贝叶斯、XGBoost和随机森林）卸载到硬件设备网络上。我们在真实世界测试平台上的实验表明，OmniPlan 在现实世界的机器学习推理任务中实现了近乎最优且执行时间低的卸载，延迟降低高达 97.8%，网络设备资源消耗降低高达 11.5%。

Deep Reinforcement Learning for Minimum Zero-Forcing Sets

针对最小零强制集的深度强化学习

Authors: Steve Halley, Maurício Gruppi
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.18106
Pdf link: https://arxiv.org/pdf/2606.18106
Abstract This paper explores the problem of finding the minimum zero-forcing set on undirected graphs and proposes an adapted machine-learning framework to solve the problem. The minimum zero-forcing set problem is a graph coloring problem where the color of an initial set of nodes propagates throughout a network. The set of nodes is zero-forcing if it forces all uncolored nodes to change color under the constraint of the color-change rule. There are several applications to this problem across different domains such as network science, network control, and designing logical circuits. Finding the minimum zero-forcing set is shown to be NP-hard. We propose a reinforcement learning framework, SD-ZFS, that adapts the S2V-DQN architecture to the ZFS problem. We train several models on this adapted framework and analyze the performance across graph datasets that have varying structures. We evaluate how the models trained on the framework generalize, scale, and transfer to different network types. The results demonstrate the effectiveness of the framework when compared against the optimal solution and greedy heuristic. We provide further insight into how the ZFS problem can be solved through machine-learning and the influence of network structure on the problem.
中文摘要 本文探讨了在无向图上寻找最小零强迫集的问题，并提出了一个适应过的机器学习框架来解决该问题。最小零强迫集问题是一种图着色问题，其中初始节点集合的颜色会在整个网络中传播。如果该节点集合在颜色变化规则约束下强制所有未着色节点变色，则称该节点集合为零强制。该问题在网络科学、网络控制和逻辑电路设计等多个领域都有多种应用。寻找最小零强迫集被证明是NP难的。我们提出了一个强化学习框架SD-ZFS，将S2V-DQN架构适配到ZFS问题。我们在这个适应框架上训练多个模型，并分析结构各异的图数据集中的表现。我们评估了基于该框架训练的模型如何泛化、扩展并迁移到不同网络类型。结果显示了该框架相较于最优解和贪婪启发式时的有效性。我们进一步阐述了如何通过机器学习解决ZFS问题，以及网络结构对问题的影响。

Learning Fair Pareto-Optimal Policies in Multi-Objective Reinforcement Learning

多目标强化学习中的公平帕累托最优策略学习

Authors: Umer Siddique, Peilang Li, Yongcan Cao
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.18111
Pdf link: https://arxiv.org/pdf/2606.18111
Abstract Fairness is an important aspect of decision-making in multi-objective reinforcement learning (MORL), where policies must ensure both optimality and equity across multiple, potentially conflicting objectives. While single-policy MORL methods can learn fair policies for fixed user preferences using welfare functions such as the generalized Gini welfare function (GGF), they fail to provide the diverse set of policies necessary for dynamic or unknown user preferences. To address this limitation, we formalize the fair optimization problem in multi-policy MORL, where the goal is to learn a set of Pareto-optimal policies that ensure fairness across all possible user preferences. Our key technical contributions are threefold: (1) We show that for concave, piecewise-linear welfare functions (e.g., GGF), fair policies remain in the convex coverage set (CCS), which is an approximated Pareto front for linear scalarization. (2) We demonstrate that non-stationary policies, augmented with accrued reward histories, and stochastic policies improve fairness by dynamically adapting to historical inequities. (3) We propose three novel algorithms, which include integrating GGF with multi-policy multi-objective Q-Learning (MOQL), state-augmented multi-policy MOQL for learning non-statoinary policies, and its novel extension for learning stochastic policies. We evaluate our algorithms across various domains and compare our methods against the state-of-the-art MORL baselines. The empirical results show that our methods learn a set of fair policies that accommodate different user preferences.
中文摘要 公平性是多目标强化学习（MORL）决策中的重要方面，政策必须确保在多个可能相互冲突的目标中既最优又公平。虽然单策略MORL方法可以通过福利函数（如广义基尼福利函数GGF）学习固定用户偏好的公平策略，但它们未能提供动态或未知用户偏好所需的多样化策略集。为解决这一局限，我们将多策略MORL中的公平优化问题形式化，目标是学习一套确保所有可能用户偏好公平的帕累托最优策略。我们的关键技术贡献有三方面：（1）我们证明对于凹面、分段线性福利函数（如GGF），公平策略仍处于凸覆盖集（CCS），这是线性标量化的近似帕累托前沿。（2）我们证明了非平稳策略，加上累计奖励历史和随机策略，通过动态适应历史不平等来提升公平性。（3）我们提出了三种新颖算法，包括将GGF与多策略多目标Q-学习（MOQL）集成，状态增强多策略MOQL用于学习非静态策略，以及其新颖扩展用于随机策略学习。我们评估了跨多个领域的算法，并将我们的方法与最先进的MORL基线进行比较。实证结果表明，我们的方法学习一套公平的策略，以适应不同用户偏好。

Knowledge Reutilization in Meta-Reinforcement Learning

元强化学习中的知识再利用

Authors: Yuan Meng, Bo Wang, Juan de los Rios Ruiz, Xiangtong Yao, Zhenshan Bing, Fuchun Sun, Alois Knoll
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.18132
Pdf link: https://arxiv.org/pdf/2606.18132
Abstract Meta-reinforcement learning enables fast adaptation by extracting shared structure from related tasks, but existing end-to-end methods often couple task inference with embodiment-specific control. This coupling can obscure non-parametric task semantics, reduce sample efficiency, and limit cross-agent reuse. We propose a meta-knowledge reutilization framework that learns task-level knowledge on a dynamics-simplified agent and transfers it to heterogeneous agents. The framework uses a Bayesian non-parametric prior to organize latent task modes and a high-level policy to generate task-level magnitude guidance. To bridge reusable task knowledge with different embodiments, we introduce a semantic-magnitude interface and a lightweight temporal adaptor, which convert frozen meta-knowledge into temporally aligned subgoals for embodiment-specific low-level controllers. Experiments on multiple locomotion agents show that our framework reduces final-step tracking error by 94.75% -- 99.79% compared with recent state-of-the-art baselines and achieves comparable deployment performance with about 23.8% of their interaction data.
中文摘要 元强化学习通过从相关任务中提取共享结构实现快速适应，但现有端到端方法通常将任务推理与具身特定控制结合起来。这种耦合可能模糊非参数任务语义，降低样本效率，并限制跨代理的重复使用。我们提出了一种元知识再利用框架，能够在动态简化代理上学习任务级知识，并将其转移到异构代理上。该框架使用贝叶斯非参数先验来组织潜在任务模式，并使用高层策略生成任务级强度指导。为了连接可重复使用的任务知识与不同具体体格，我们引入了语义大小接口和轻量级时间适配器，将冻结的元知识转换为具身特定低级控制器的时间对齐子目标。多移动代理的实验显示，我们的框架将最终步跟踪误差降低了94.75%——与近期最先进的基线相比，降低了99.79%，并且在约23.8%的交互数据下实现了相当的部署性能。

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

近端策略优化区：提示中的教师，而非梯度

Authors: Byung-Kwan Lee, Ximing Lu, Shizhe Diao, Minki Kang, Saurav Muralidharan, Karan Sapra, Andrew Tao, Pavlo Molchanov, Yejin Choi, Yu-Chiang Frank Wang, Ryo Hachiuma
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.18216
Pdf link: https://arxiv.org/pdf/2606.18216
Abstract Knowledge distillation transfers a teacher's competence to a small student but is brittle in the small-student regime: forcing the student to imitate logits from a much larger teacher concentrates it on the teacher's sharpest modes, hurting generalization on benchmark families beyond the training corpus. Reinforcement learning (RL) avoids logit imitation by training on the student's own rollouts. However, on questions where every rollout fails-yielding zero advantage and being silently discarded-injecting a stronger teacher's response into the policy gradient breaks the on-policy assumption and induces drift. We introduce Zone of Proximal Policy Optimization (ZPPO), inspired by Vygotsky's zone of proximal development, which keeps the teacher inside the prompt rather than the policy gradient. On hard questions, ZPPO constructs two reformulated prompts: a Binary Candidate-included Question (BCQ) pairs one correct teacher response with one incorrect student response as anonymized candidates the student must discriminate, and a Negative Candidate-included Question (NCQ) aggregates the student's wrong rollouts into a single prompt to surface their shared failure modes. A prompt replay buffer recirculates each hard question until it either graduates-the student's mean rollout accuracy on it reaches half- or is FIFO-evicted under finite capacity, amplifying BCQ and NCQ inside the student's current zone of proximal development. On the Qwen3.5 family at four student scales (0.8B-9B) with a 27B teacher, post-trained as vision-language models and evaluated on a 31-benchmark suite (16 VLM, 10 LLM, 5 Video), ZPPO outperforms off/on-policy distillation and GRPO, with the largest gains at the smallest scale.
中文摘要 知识提炼将教师的能力转移到小学生身上，但在小学生体系中则较为脆弱：强迫学生模仿更大教师的对数，会让教师的知识集中在最敏锐的模式上，损害了基准家庭的泛化能力。强化学习（RL）通过在学生自身的推广中进行训练，避免了logit的模仿。然而，在每次推广都失败的问题上——毫无优势且被默默抛弃——将更强教师的回应注入政策梯度，打破了政策假设，导致偏移。我们引入了近距政策优化区（ZPPO），灵感来自维果茨基的近端发展区，使教师始终处于提示而非政策梯度内。对于难题，ZPPO构建了两个重新表述的题目：二元候选人题（BCQ）将一个正确教师回答与一个错误学生回答配对，作为学生必须进行的匿名候选人;负面候选人包含题（NCQ）则将学生错误的出题汇总为一个提示，以揭示他们共享的失败模式。一个即时回放缓冲区会循环每个难题，直到它要么毕业——学生的平均掷出准确率达到一半——要么在有限容量下通过FIFO剔除，从而在学生当前的近端发展区内放大BCQ和NCQ。在Qwen3.5家族中，在四个学生尺度（0.8B-9B）中，教师为27B，作为视觉语言模型进行后期训练，并在31个基准测试套件（16个VLM，10个LLM，5个视频）上评估，ZPPO在最小尺度上优于开关/开策略提炼和GRPO，提升最大。

Learning Red Agent Policy from Observations for Neurosymbolic Autonomous Cyber Agents

从神经符号自主网络代理观察中学习红代理策略

Authors: Ankita Samaddar, Sandeep Neema, Daniel Balasubramanian, Xenofon Koutsoukos
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2606.18223
Pdf link: https://arxiv.org/pdf/2606.18223
Abstract With sophisticated cyber-attacks becoming increasingly prevalent, modern networks require intelligent autonomous cyber-defense agents trained via Reinforcement Learning (RL). These agents employ neurosymbolic approaches such as behavior trees with learning-enabled components (LECs) to learn, reason, adapt, and implement security rules while maintaining critical operations. However, these autonomous networks are partially observable systems, i.e., the cyber-attacker's (red agent's) actions are not observable, making it difficult for the defender to predict red actions, learn red policies, or assess the attacker's intrusion levels. To address this, we propose a Policy Learning Technique using imitation learning to learn policies for partially observable RL agents with discrete states and discrete actions. We apply this technique in an autonomous cyber environment to predict red agent's actions from network observations and defender actions. Integrated with a neurosymbolic cyber-defense agent, our method effectively handles different red policies and achieves high prediction accuracy across diverse simulated scenarios.
中文摘要 随着复杂的网络攻击日益普及，现代网络需要通过强化学习（RL）训练的智能自主网络防御智能体。这些代理采用神经符号方法，如带有学习驱动组件（LEC）的行为树，在维护关键操作的同时学习、推理、适应和安全规则。然而，这些自治网络是部分可观察的系统，即网络攻击者（红色代理）的行为不可观察，使防御者难以预测红色行动、学习红色策略或评估攻击者的入侵程度。为此，我们提出了一种策略学习技术，利用模仿学习来学习具有离散状态和动作的部分可观测强化学习代理的策略。我们将该技术应用于自主网络环境，通过网络观察和防御者行动预测红色代理的行动。通过与神经符号网络防御代理集成，我们的方法有效处理不同的红色策略，并在多种模拟场景中实现高预测精度。

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

统一多模自回归建模与共享上下文-可视化分词器是统一的关键

Authors: Wujian Peng, Lingchen Meng, Yuxuan Cai, Xianwei Zhuang, Yuhuan Yang, Rongyao Fang, Chenfei Wu, Junyang Lin, Zuxuan Wu, Shuai Bai
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.18249
Pdf link: https://arxiv.org/pdf/2606.18249
Abstract Unified Multimodal Modeling aims to integrate visual understanding and generation within a single system. However, existing approaches typically rely on two disparate visual tokenizers, which splits the representation space and hinders truly unified modeling. We propose UniAR, a unified autoregressive framework where a single discrete visual tokenizer serves as the key bridge between understanding and generation, enabling a shared context in which the model can directly interpret its own generated visual tokens without additional re-encoding. UniAR adapts a pretrained vision encoder with multi-level feature fusion and a lookup-free bitwise quantization scheme, preserving both high-level semantics and low-level details while scaling the effective visual vocabulary at minimal cost. Building on this, the unified autoregressive model adopts parallel-bitwise-prediction to jointly predict spatially grouped, multi-level visual codes, substantially reducing visual sequence length and accelerating generation. Finally, a diffusion-based visual decoder operates on discrete visual tokens to decode high-fidelity images. Through large-scale pre-training, followed by supervised fine-tuning and reinforcement learning, UniAR achieves state-of-the-art performance on image generation and image editing while remaining competitive on multimodal understanding benchmarks. The project page is available at this https URL.
中文摘要 统一多模态建模旨在将视觉理解与生成整合到单一系统中。然而，现有方法通常依赖于两个不同的视觉标记器，这会分裂表示空间，阻碍真正统一的建模。我们提出了UniAR，一种统一的自回归框架，其中单一的离散视觉分词器作为理解与生成之间的关键桥梁，使模型能够直接解释自身生成的视觉令牌，而无需额外重新编码。UniAR采用预训练视觉编码器，采用多层特征融合和无查找的位量化方案，既保留高层语义，也保持低层细节，同时以最小成本扩展有效视觉词汇。基于此，统一自回归模型采用并行比特预测，共同预测空间分组的多层视觉代码，显著缩短视觉序列长度并加速生成。最后，基于扩散的视觉解码器对离散的视觉符号进行解码，以解码高保真度图像。通过大规模预训练，随后进行监督式微调和强化学习，UniAR在图像生成和图像编辑方面达到了最先进的性能，同时在多模态理解基准测试中保持竞争力。项目页面可在此 https 网址访问。

Keyword: diffusion policy

LAGO Policy: Latency-Aware Asynchronous Diffusion Policies with Goal-Directed Collision-Free Planning for Smooth Manipulation

LAGO 策略：延迟感知异步扩散策略，配合目标导向的无碰撞规划，实现平滑操作

Authors: Guowei Shi, Xupeng Xie, Yiming Luo, Jian Guo, Jun Ma, Boyu Zhou
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.17982
Pdf link: https://arxiv.org/pdf/2606.17982
Abstract Diffusion-based visuomotor policies deployed with asynchronous inference often exhibit inter-chunk discontinuities and lack explicit mechanisms for obstacle-aware execution, leading to jerky motions and collisions that hinder reliable manipulation in real-world scenes. To address these issues, we propose LAGO Policy, a unified asynchronous action-generation framework that integrates trajectory optimization with diffusion policy for smooth and safe execution. LAGO Policy improves inter-chunk consistency via latency-aware classifier-free guidance conditioning on future actions. It further enables goal-directed collision-free trajectory planning by predicting a task-relevant interaction goal from demonstrations. Finally, spatial-temporal trajectory optimization refines the actions to be executed for low-jerk and feasible motion. Extensive real-world experiments demonstrate that LAGO Policy achieves smooth collision-free execution with high task success across challenging manipulation tasks. Project Website: this https URL
中文摘要 基于扩散的视觉运动策略与异步推断结合，常表现出块间不连续性，且缺乏明确的障碍感知执行机制，导致动作僵硬和碰撞，阻碍现实场景中的可靠操作。为解决这些问题，我们提出了LAGO策略，这是一个统一的异步动作生成框架，将轨迹优化与扩散策略整合，实现平稳安全的执行。LAGO 策略通过无延迟的无分类器指导，提高块间一致性，并以未来操作为条件。它还通过通过演示预测任务相关的交互目标，实现目标导向的无碰撞轨迹规划。最后，时空轨迹优化优化了需要执行的动作，以实现低震动和可行的运动。大量真实实验表明，LAGO 策略在具有挑战性的操作任务中实现了平滑无碰撞的执行，并实现了高成功率。项目网站：此 https URL