Arxiv Papers of Today

生成时间: 2026-05-18 19:46:33 (UTC+8); Arxiv 发布时间: 2026-05-18 20:00 EDT (2026-05-19 08:00 UTC+8)

今天共有 40 篇相关文章

Keyword: reinforcement learning

ICRL: Learning to Internalize Self-Critique with Reinforcement Learning

ICRL：通过强化学习学习内化自我批评

Authors: Jianbo Lin, Xiaomin Yu, Yi Xin, Yifu Guo, Zhuosong Jiang, Zhongqi Yue, Weishi Wang, Heqing Zou, Chengwei Qin, Hui Xiong
Subjects: Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2605.15224
Pdf link: https://arxiv.org/pdf/2605.15224
Abstract Large language model-based agents make mistakes, yet critique can often guide the same model toward correct behavior. However, when critique is removed, the model may fail again on the same query, indicating that it has not internalized the critique's guidance into its underlying capability. Meanwhile, a frozen critic cannot improve its feedback quality over time, limiting the potential for iterative self-improvement. To address this, we propose learning to internalize self-critique with reinforcement learning(ICRL), a novel framework that jointly trains a solver and a critic from a shared backbone to convert critique-induced success into unassisted solver ability. The critic is rewarded based on the solver's subsequent performance gain, incentivizing actionable feedback. To address the distribution shift between critique-conditioned and critique-free behavior, ICRL introduces a distribution-calibration re-weighting ratio that selectively transfers critique-guided improvements compatible with the solver's own prompt distribution. Additionally, a role-wise group advantage estimation stabilizes joint optimization across the two roles. Together, these mechanisms ensure that the solver learns to improve itself without external critique, rather than becoming dependent on critique-conditioned behavior. We evaluate ICRL on diverse benchmarks spanning agentic and mathematical reasoning tasks, using Qwen3-4B and Qwen3-8B as backbones. Results show consistent improvements, with average gains of 6.4 points over GRPO on agentic tasks, and 7.0 points on mathematical reasoning. Notably, the learned 8B critic is comparable to 32B critics while using substantially fewer tokens. The code is available at this https URL.
中文摘要 基于大型语言模型的智能体会犯错，但批评往往能引导同一模型走向正确的行为。然而，当批评被移除时，模型可能在同一查询中再次失败，表明它尚未将批评的指导内化到其底层能力中。与此同时，冻结的批评者无法随着时间提升反馈质量，限制了自我改进的潜在可能性。为此，我们提出通过强化学习（ICRL）学习内化自我批评，这是一种新颖框架，通过共同训练解答者和批评者，将批评带来的成功转化为无辅助的求解能力。批评者会根据解题者随后的表现提升获得奖励，激励可操作的反馈。为了解决批评条件与无批评行为之间的分布变化，ICRL引入了一种分布-校准重权重比，选择性地转移与求解器自身即时分布兼容的批评引导改进。此外，按角色进行群体优势估计，稳定了两角色之间的联合优化。这些机制共同确保解答者学会在不依赖外部批评的情况下自我改进，而非依赖于批评条件的行为。我们以Qwen3-4B和Qwen3-8B为骨干，在多样的基准测试中评估ICRL，涵盖代理和数学推理任务。结果显示持续提升，代理任务平均提升6.4分，数学推理提升7.0分。值得注意的是，博学的8B批评者与32B批评者相当，但使用的代币明显更少。代码可在该 https URL 访问。

Training on Documents About Monitoring Leads to CoT Obfuscation

关于监控文档的培训会导致CoT混淆

Authors: Reilly Haskins, Bilal Chughtai, Joshua Engels
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.15257
Pdf link: https://arxiv.org/pdf/2605.15257
Abstract Chain-of-thought (CoT) monitoring is one of the most promising tools we have for detecting model misbehavior, but its effectiveness depends on models faithfully externalizing their reasoning. Motivated by this vulnerability, we study whether monitor-aware models are capable of obfuscating their reasoning to evade detection. We use synthetic document finetuning to expose eight models to realistic pre-training-style documents describing a CoT monitor and find that monitor-aware models consistently achieve higher rates of undetected misbehavior compared to unaware controls. This effect is weaker but still present on a harder agentic task. We also show that CoT controllability, a model's ability to reshape its own reasoning trace under an imposed constraint, is closely correlated with obfuscation success across the eight models studied ($r=0.800$, $p=0.017$). Monitor-aware models placed under equal reinforcement learning optimization pressure also learn to reward-hack without triggering a CoT monitor substantially faster than unaware controls. Together, these results suggest that knowledge of monitoring combined with high CoT controllability poses a risk to CoT-based monitoring.
中文摘要 思维链（CoT）监测是我们目前最有前景的模型异常检测工具之一，但其有效性依赖于模型忠实外化推理。基于这一漏洞，我们研究了监控感知模型是否能够混淆其推理以规避检测。我们使用合成文档微调，将八个模型暴露在描述CoT监控器的真实预训练文档中，发现感知监控的模型在未检测异常行为的比例上始终高于无意识控制组。这种效应较弱，但在更难的能动任务中仍然存在。我们还展示了CoT可控性，即模型在强加约束下重塑自身推理轨迹的能力，与八个模型中的混淆成功率密切相关（$r=0.800$， $p=0.017$）。在同等强化学习优化压力下，具备监控意识的模型也比无意识控制更快学会在不触发CoT监控的情况下进行奖励黑客。综合来看，这些结果表明，监测知识与高CoT可控性相结合，对基于CoT的监测构成风险。

Solvita: Enhancing Large Language Models for Competitive Programming via Agentic Evolution

Solvita：通过智能进化增强大型语言模型以实现竞争性编程

Authors: Han Li, Jinyu Tian, Rili Feng, Yuqiao Du, Chong Zheng, Chenyu Wang, Chenchen Liu, Shihao Li, Xinping Lei, Yifan Yao, Weihao Xie, Letian Zhu, Jiaheng Liu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.15301
Pdf link: https://arxiv.org/pdf/2605.15301
Abstract Large language models (LLMs) still struggle with the rigorous reasoning demands of hard competitive programming. While recent multi-agent frameworks attempt to bridge this reliability gap, they remain fundamentally stateless: they rely on static retrieval and discard the valuable problem-solving and debugging experience gained from previous tasks. To address this, we present Solvita, an agentic evolution framework that enables continuous learning without requiring weight updates to the underlying LLM. Solvita reorganizes problem-solving into a closed-loop system of strategy selection, program synthesis, certified supervision, and targeted hacking, executed by four specialized agents: Planner, Solver, Oracle, and Hacker. Crucially, each agent is paired with a trainable, graph-structured knowledge network. As the system operates, outcome signals, such as pass/fail verdicts, test certification quality, and adversarial vulnerabilities discovered by the Hacker, are recast as reinforcement learning updates to these network weights. This allows the agents to dynamically route future queries based on past successes and failures, effectively accumulating transferable reasoning experience over time. Evaluated across CodeContests, APPS, AetherCode, and live Codeforces rounds, Solvita establishes a new state-of-the-art among code-generation agents, outperforming existing multi-agent pipelines and nearly doubling the accuracy of single-pass baselines.
中文摘要 大型语言模型（LLMs）仍然难以应对严格竞争性编程的严苛推理要求。尽管近期多智能体框架试图弥合这一可靠性差距，但它们本质上仍是无状态的：依赖静态检索，放弃了从之前任务中获得的宝贵问题解决和调试经验。为此，我们提出了Solvita，一种能够实现持续学习而无需对底层LLM进行权重更新的智能进化框架。Solvita 将问题解决重组为一个闭环系统，涵盖策略选择、程序综合、认证监督和有针对性黑客，由四个专业代理执行：Planner、Solver、Oracle 和 Hacker。关键是，每个智能体都与可训练的图结构知识网络配对。随着系统运行，黑客发现的结果信号，如通过/失败判决、测试认证质量和对抗性漏洞，会被重新定义为对这些网络权重的强化学习更新。这使得代理能够根据过去的成功和失败动态路由未来的查询，有效地随着时间积累可转移的推理经验。通过在CodeContests、APPS、AetherCode和实时Codeforce轮次的评估中，Solvita在代码生成代理中树立了新的尖端技术，超越现有的多代理流水线，并将单一遍基线的准确性几乎翻倍。

LEAP: Trajectory-Level Evaluation of LLMs in Iterative Scientific Design

LEAP：迭代科学设计中LLM的轨迹级评估

Authors: Marilyn Zhang, Tianfeng Chen, Fabián Barzuna, Ankita Rathod, Mark E. Whiting
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.15341
Pdf link: https://arxiv.org/pdf/2605.15341
Abstract LLMs are increasingly deployed in autonomous laboratories, under the assumption that their domain priors and reasoning over iterative feedback let them converge on good designs in fewer iterations than feedback-only baselines. Current iterative scientific design benchmarks, however, score only outcome snapshots at fixed horizons. This leaves the learning trajectory unmeasured, even though the trajectory is what captures learning efficiency, where each iteration saved is a real saving in cost and time. Motivated by this, we examine three evaluation choices that change the conclusions one draws about LLM learning efficiency in iterative scientific design: what to measure, what baseline to compare against, and what to ground against. We introduce LEAPBench, Learning Efficiency in Adaptive Processes, a 55-task framework that pairs a best-so-far area under the curve (AUC) trajectory metric with a classical Bayesian-optimization reference and an audit grounded in published literature. Applied to eight contemporary LLMs, switching from final-outcome to trajectory scoring changes the best-model decision on 53% of tasks at matched horizons, and exposes efficiency gains overlooked by outcome-based scoring. LLMs do not outperform a classical Bayesian baseline. On 16 biology tasks where the oracle's reward signal is aligned with configurations from the published-best design, domain-aware prompting leads to LLM choices that match the published-best's approximately 10 percentage points less often than domain-agnostic prompting at iteration 30. The pattern is sharpest on 6 tasks where the literature-typical and published-best configurations diverge, and domain-agnostic prompting matches the published-best more often on all 6. The trajectory metric also doubles as a tractable training target. Offline reinforcement learning with the metric as a reward improves performance on 14 of 21 held-out tasks.
中文摘要 LLM越来越多地被部署在自主实验室中，基于其领域先验和对迭代反馈的推理，使其能在比仅反馈基线更少的迭代内收敛到优质设计。然而，当前的迭代科学设计基准仅对固定视野的结果快照进行评分。这使得学习轨迹无法被测量，尽管学习轨迹体现了学习效率，每节省一次迭代都能真正节省成本和时间。基于此，我们考察了三种评估选择，这些选择改变了人们对迭代科学设计中LLM学习效率的结论：测量什么、比较哪个基线以及以什么为基础。我们介绍LEAPBench，即自适应过程中的学习效率，这是一个包含55个任务的框架，将迄今为止最佳曲线下面积（AUC）轨迹指标与经典贝叶斯优化参考和基于已发表文献的审计相结合。应用于八个当代大型语言模型，从最终结果评分切换到轨迹评分会改变匹配期任务中53%的最佳模型决策，并揭示了基于结果的评分所忽视的效率提升。大型语言模型的表现并不超过经典贝叶斯基线。在16个生物任务中，oracle的奖励信号与已发表最佳设计的配置对齐，领域感知提示导致LLM选择的频率比发布最佳的约10个百分点，在第30次迭代时少于领域无关提示。该模式在6个任务中最为明显，其中文献典型和已发表最佳配置存在差异，且领域无关提示在所有6个任务中更常与已发表最佳匹配。轨迹指标同时也是一个可操作的训练目标。以指标为奖励的离线强化学习提升了21项任务中的14项表现。

Controllable Molecular Generative Foundation Models

可控分子生成基础模型

Authors: Yihan Zhu, Yuhan Liu, Weijiang Li, Tengfei Luo, Meng Jiang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.15354
Pdf link: https://arxiv.org/pdf/2605.15354
Abstract Despite the success of foundation models in language and vision, molecular graph generation still lacks a unified framework for heterogeneous design tasks with reliable controllability. While reinforcement learning (RL) offers a natural post-training mechanism for task-specific optimization, applying it to graph generative models is hindered by the vast atom-wise action spaces and chemically invalid intermediate states. We propose \textbf{Co}ntrollable \textbf{Mole}cular Generative Foundation Models (CoMole), built with a unified motif-aware graph diffusion pipeline. By learning a motif-aware graph space, CoMole transfers pretrained structural priors into controllable generation, where RL optimizes conditional reverse policies over chemically meaningful decisions. We theoretically characterize the bottleneck of atom-level RL and justify motif-aware policy optimization. Across three heterogeneous benchmarks spanning materials and drug discovery, CoMole ranks first in controllability on all nine targets, reduces MAE by up to 48.2% relative to the strongest baselines, and maintains validity above 0.94 without rule-based correction or post-hoc filtering. We further show that CoMole transfers controllability to unseen properties by optimizing only task embeddings with the generator frozen, achieving performance competitive with strong task-specific baselines.
中文摘要 尽管语言和视觉领域的基础模型取得了成功，分子图生成仍缺乏一个统一的异构设计任务框架，且可可靠可控性。虽然强化学习（RL）提供了一种自然的训练后优化任务后机制，但由于庞大的原子级动作空间和化学无效的中间态，将其应用于图生成模型仍受阻碍。我们提出了 \textbf{Co}ntrollable \textbf{Mole}cular生成基础模型（CoMole），采用统一的主题感知图扩散流水线构建。通过学习一个具备动机的图空间，CoMole将预训练的结构先验转化为可控生成，强化学习优化化学意义上的有条件逆向策略。我们理论上描述了原子级强化学习的瓶颈，并为基于基序的策略优化提供了理由。在涵盖材料和药物发现的三个异质基准中，CoMole在所有九个靶点的可控性上均位列第一，相较最强基线降低最多48.2%的MAE，且在无规则修正或事后过滤的情况下，有效度保持在0.94以上。我们还进一步证明，CoMole通过优化任务嵌入且发生器冻结，将可控性转移到未见属性上，实现与强任务特定基线性能媲美的性能。

Video Models Can Reason with Verifiable Rewards

视频模特可以推理可验证的奖励

Authors: Tinghui Zhu, Sheng Zhang, James Y. Huang, Selena Song, Xiaofei Wen, Yuankai Li, Hoifung Poon, Muhao Chen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.15458
Pdf link: https://arxiv.org/pdf/2605.15458
Abstract Video diffusion models have made rapid progress in perceptual realism and temporal coherence, but they remain primarily optimized for plausible generation rather than verifiable reasoning. This limitation is especially pronounced in tasks where generated videos must satisfy explicit spatial, temporal, or logical constraints. Inspired by the role of reinforcement learning with verifiable rewards (RLVR) in reasoning-oriented language models, we introduce VideoRLVR, a practical recipe for optimizing video diffusion models with rule-based feedback. VideoRLVR formulates video reasoning as the generation of verifiable visual trajectories and consists of an SDE-GRPO optimization backbone, dense decomposed rewards, and an Early-Step Focus strategy for efficient training. The Early-Step Focus strategy restricts policy optimization to the early denoising phase, reducing training latency by about 40% while preserving performance. We evaluate VideoRLVR on Maze, FlowFree, and Sokoban, three procedurally generated domains with objective success criteria. Across these tasks, VideoRLVR consistently improves over supervised fine-tuning baselines, with dense decomposed rewards proving especially important in low-success-rate settings. Our RL-optimized model also outperforms the evaluated proprietary and open-source video generation models on these verifiable reasoning benchmarks and out-of-domain benchmarks. These results suggest that verifiable RL can move video models beyond perceptual imitation toward more reliable rule-consistent visual reasoning.
中文摘要 视频扩散模型在感知真实性和时间连贯性方面取得了快速进展，但它们主要还是为合理生成而优化，而非可验证的推理。这种限制在生成视频必须满足明确的空间、时间或逻辑约束的任务中尤为明显。受推理导向语言模型中可验证奖励强化学习（RLVR）作用的启发，我们介绍了VideoRLVR，这是一种基于规则反馈优化视频扩散模型的实用配方。VideoRLVR将视频推理表述为可验证的视觉轨迹生成，包括SDE-GRPO优化骨干、密集的分解奖励以及高效训练的早期阶段聚焦策略。早期聚焦策略将策略优化限制在早期去噪阶段，将训练延迟降低约40%，同时保持性能。我们在Maze、FlowFree和Sokoban这三个具有客观成功标准的程序生成域名上评估了VideoRLVR。在这些任务中，VideoRLVR持续优于监督微调基线，其中密集的分解奖励在低成功率环境中尤为重要。我们的强化学习优化模型在这些可验证推理基准和域外基准测试中，也优于已评估的专有和开源视频生成模型。这些结果表明可验证的强化学习能够将视频模型从感知模仿转向更可靠且规则一致的视觉推理。

GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero

GRLO：从零开始迈向开放式环境中的通用强化学习

Authors: Shangjian Yin, Yu Fu, Yue Dong, Zhouxing Shi
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.15464
Pdf link: https://arxiv.org/pdf/2605.15464
Abstract Post-training has become a crucial step for unlocking the capabilities of large language models, with reinforcement learning (RL) emerging as a critical paradigm. Recent RL-based post-training has increasingly split into two paradigms: reinforcement learning from human feedback (RLHF), which optimizes models using human preference signals in target domains, and reinforcement learning from verifiable rewards (RLVR), which operates in verifier-backed environments. The latter has dominated recent reasoning-oriented post-training because it delivers stronger gains and higher efficiency on domain-specific tasks (e.g., reasoning). However, although in-domain RL training achieves promising performance, it still requires a substantial amount of GPU compute, which remains a major barrier to broad adoption. In this work, we study the generalization ability of RLHF learned from scratch from a small set of interactions in open-ended environments, and investigate whether the conversational abilities it explicitly acquires can implicitly transfer to downstream tasks such as mathematical reasoning and code generation, namely GRLO. Specifically, on Qwen3-4B-Base backbone, GRLO improves the average performance across all domains from 24.1 to 63.1 with only 5K prompts and 22.7 GPU hours, requiring about $46\times$ less data and $68\times$ less compute than a strong in-domain RLVR baseline. The resulting model is even competitive with Qwen's released post-trained models which required a much larger training cost. Notably, a subsequent in-domain RLVR stage brings only selective gains, mainly on harder competition-math benchmarks. We hope GRLO offers a simple and efficient recipe for building broadly capable post-trained models. Our code and data will be available at: \href{this https URL}{this https URL}.
中文摘要 后训练已成为释放大型语言模型能力的关键步骤，强化学习（RL）成为关键范式。近年来基于强化学习的后期训练逐渐分为两种范式：基于人类反馈的强化学习（RLHF），即在目标领域利用人类偏好信号优化模型;以及基于可验证奖励的强化学习（RLVR），运行在验证者支持的环境中。后者在近期以推理为导向的后培训中占据主导地位，因为它在领域特定任务（如推理）上带来了更强的进步和更高的效率。然而，尽管领域强化学习取得了令人期待的性能，但仍需要大量GPU计算，这仍是广泛采用的主要障碍。本研究中，我们研究从零开始学习的RLHF在开放式环境中的少量交互中泛化能力，并探讨其显式获得的会话能力是否能隐式转移至数学推理和代码生成等下游任务，即GRLO。具体来说，在Qwen3-4B-Base骨干网上，GRLO在所有域的平均性能从24.1提升到63.1，仅有5K提示和22.7 GPU小时，比强力的领域内RLVR基线少约46美元，计算量减少68美元。最终的模型甚至能与Qwen发布的后期训练模型竞争，后者需要更高的训练成本。值得注意的是，后续的领域内RLVR阶段仅带来选择性提升，主要体现在更难的竞争数学基准测试上。我们希望GRLO能提供一个简单高效的方法，帮助构建广泛且具备能力的后训练模型。我们的代码和数据将发布于：\href{this https URL}{this https URL}。

Residual Reinforcement Learning for Robot Teleoperation under Stochastic Delays

随机延迟下机器人远程操作的残余强化学习

Authors: Kaize Deng, Zewen Yang
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.15480
Pdf link: https://arxiv.org/pdf/2605.15480
Abstract Stochastic communication delays in teleoperation introduce signal discontinuities that undermine control stability and degrade control performance. Consequently, the conventional reinforcement learning (RL) methods struggle with the delayed observations due to the delay-induced observations, leading to high-frequency chattering. To address this, we propose a hybrid control framework, delay-resilient RL, integrating a state estimator utilizing Long Short-Term Memory (LSTM) with a residual RL policy, which is resilient to stochastic delays. The LSTM reconstructs smooth, continuous state estimates from delayed observations, enabling the RL agent to learn a residual torque compensation policy that balances tracking accuracy with velocity smoothness. Experimental validation on Franka Panda robots demonstrates that our approach significantly outperforms the state-of-the-art baselines, ensuring robust and stable teleoperation even under high-variance stochastic delays.
中文摘要 远程操作中的随机通信延迟会引入信号不连续，破坏控制稳定性并降低控制性能。因此，传统强化学习（RL）方法因延迟诱导的观测导致观测延迟，导致高频振响。为此，我们提出了一种混合控制框架——延迟韧性强化学习，结合利用长短期记忆（LSTM）的状态估计器和残差强化学习策略，后者对随机延迟具有弹性。LSTM通过延迟观测重建平滑连续状态估计，使强化学习者能够学习一种在跟踪精度与速度平滑性之间取得平衡的残余扭矩补偿策略。Franka Panda机器人的实验验证表明，我们的方法远超最先进的基线，即使在高随机延迟下也能实现稳健稳定的远程操作。

Terrain Consistent Reference-Guided RL for Humanoid Navigation Autonomy

地形一致参考引导强化学习，用于类人生物导航自主性

Authors: William D. Compton, Zachary Olkin, Aaron D. Ames
Subjects: Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2605.15517
Pdf link: https://arxiv.org/pdf/2605.15517
Abstract We present a method for training reference-guided, perceptive reinforcement learning locomotion policies for humanoid robots in which reference trajectories are modulated in training to be consistent with terrain geometry. Aiming to deploy our method with standard navigation autonomy infrastructure, we synthesize SE(2)-controllable reference trajectories inside the RL training loop, projecting desired footsteps onto valid footholds and adjusting swing-foot and center-of-mass trajectories to match the terrain. The resulting policy exposes a clean SE(2) velocity interface compatible with standard navigation planners. In simulation, environmentally-conditioned references significantly improve reference tracking performance compared to environment agnostic references. On hardware, we integrate the policy with an MPC + control barrier function planner and demonstrate long-horizon (>70m) closed-loop autonomous navigation on the Unitree G1 through outdoor environments containing rough terrain and consecutive flights of stairs, with all sensing and computation onboard.
中文摘要 我们提出了一种训练参考引导、感知强化学习的人形机器人运动策略的方法，该方法在训练中调制参考轨迹以符合地形几何。旨在将我们的方法部署在标准导航自主基础设施中，我们在强化学习循环中综合SE（2）可控参考轨迹，将期望的脚步投射到有效的脚点上，并调整摆脚和质心轨迹以适应地形。由此产生的策略提供了一个与标准导航规划器兼容的干净SE（2）速度接口。在仿真中，环境条件下的引用相比环境无关的引用显著提升了参考跟踪性能。在硬件方面，我们将策略与MPC+控制障碍功能规划器集成，并在Unitree G1上演示了长视距（>70米）闭环自主导航，适用于包含崎岖地形和连续楼梯的户外环境，所有感应和计算都集成在机上。

DiffVAS: Diffusion-Guided Visual Active Search in Partially Observable Environments

DiffVAS：部分可观测环境中的扩散引导视觉主动搜索

Authors: Anindya Sarkar, Srikumar Sastry, Aleksis Pirinen, Nathan Jacobs, Yevgeniy Vorobeychik
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.15519
Pdf link: https://arxiv.org/pdf/2605.15519
Abstract Visual active search (VAS) has been introduced as a modeling framework that leverages visual cues to direct aerial (e.g., UAV-based) exploration and pinpoint areas of interest within extensive geospatial regions. Potential applications of VAS include detecting hotspots for rare wildlife poaching, aiding search-and-rescue missions, and uncovering illegal trafficking of weapons, among other uses. Previous VAS approaches assume that the entire search space is known upfront, which is often unrealistic due to constraints such as a restricted field of view and high acquisition costs, and they typically learn policies tailored to specific target objects, which limits their ability to search for multiple target categories simultaneously. In this work, we propose DiffVAS, a target-conditioned policy that searches for diverse objects simultaneously according to task requirements in partially observable environments, which advances the deployment of visual active search policies in real-world applications. DiffVAS leverages a diffusion model to reconstruct the entire geospatial area from sequentially observed partial glimpses, which enables a target-conditioned reinforcement learning-based planning module to effectively reason and guide subsequent search steps. Extensive experiments demonstrate that DiffVAS excels in searching diverse objects in partially observable environments, significantly surpassing state-of-the-art methods on several datasets.
中文摘要 视觉主动搜索（VAS）被引入为一种建模框架，利用视觉线索指导空中（如无人机）探索，并在广阔的地理空间区域内精确定位感兴趣区域。VAS的潜在应用包括检测稀有野生动物偷猎热点、协助搜救任务以及揭露非法武器走私等多种用途。以往的VAS方法假设整个搜索空间都是事先已知的，但由于视野有限和高获取成本等限制，这往往不现实，而且他们通常学习针对特定目标对象的策略，这限制了他们同时搜索多个目标类别的能力。在本研究中，我们提出了DiffVAS，一种目标条件策略，能够根据任务需求在部分可观测环境中同时搜索不同的对象，推动了可视化主动搜索策略在现实应用中的部署。DiffVAS利用扩散模型，从顺序观测的部分片段重建整个地理空间区域，从而使基于目标条件的强化学习规划模块能够有效推理和指导后续搜索步骤。大量实验表明，DiffVAS在部分可观测环境中搜索多样化物体方面表现出色，在多个数据集上显著超越了最先进的方法。

Task-Semantic Graph-Driven Distributed Agent Networking for Underwater Target Tracking

任务语义图驱动分布式智能体网络用于水下目标跟踪

Authors: Shengchao Zhu, Guangjie Han, Chuan Lin, Yu He
Subjects: Subjects: Robotics (cs.RO); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2605.15528
Pdf link: https://arxiv.org/pdf/2605.15528
Abstract Autonomous underwater vehicle (AUV) swarms are emerging as intelligent underwater networks, where each node must sense, communicate, process local data, and make decisions under severe acoustic constraints. Persistent underwater target tracking is a typical task with moving targets, changing communication topology, intermittent acoustic links, and limited observation for each AUV. Multi-agent reinforcement learning (MARL) is a natural candidate for distributed tracking, yet existing studies still lack a unified open-source platform for evaluating different MARL algorithms under six-degree-of-freedom AUV dynamics. In addition, policies trained with raw geometric states and low-level force actions often struggle to represent task phases, observation reliability, link quality, and local cooperation roles. This paper addresses these issues by developing an open-source MARL-AUV platform that integrates DI-engine with a six-degree-of-freedom underwater AUV target-tracking simulator. To the best of our knowledge, it is the first open platform that connects a public MARL training framework with physically modeled AUV swarm-based tasks, and provides a unified experimental protocol for fair training, testing, and comparison of representative RL and MARL algorithms. Based on this platform, we propose STG-MAPPO, a Semantic Task Graph-enhanced variant of Multi-Agent Proximal Policy Optimization. STG-MAPPO builds semantic policy inputs from tracking diagnostics, task phases, observation confidence, link availability, neighbor tracking quality, and local role advantage. A compact semantic task graph links communication-constrained network states to decentralized actor decisions, and a velocity-level action abstraction maps high-level cooperative decisions to executable six-degree-offreedom AUV control this http URL code is available at this https URL.
中文摘要 自主水下飞行器（AUV）群体正逐渐成为智能水下网络，每个节点必须在严苛的声学限制下感知、通信、处理本地数据并做出决策。持续水下目标跟踪是一项典型任务，涉及移动目标、不断变化的通信拓扑、间歇性的声学连接以及每个AUV的有限观测。多智能体强化学习（MARL）是分布式跟踪的自然候选，但现有研究仍缺乏统一的开源平台，用于评估六自由度AUV动力学下的不同MARL算法。此外，采用原始几何状态和低层次力量行动训练的政策，常常难以代表任务阶段、观测可靠性、链路质量和局部合作角色。本文通过开发一个开源的MARL-AUV平台解决了这些问题，该平台将DI发动机与六自由度水下AUV目标跟踪模拟器集成在一起。据我们所知，它是首个将公共MARL训练框架与物理建模的AUV群体任务连接起来的开放平台，并提供统一的实验协议，用于公平训练、测试和比较具有代表性的强化学习和MARL算法。基于该平台，我们提出了STG-MAPPO方案，这是一种多代理近端策略优化的语义图增强变体。STG-MAPPO通过跟踪诊断、任务阶段、观测置信度、链路可用性、邻居跟踪质量和局部角色优势构建语义策略输入。一个紧凑的语义任务图将通信受限的网络状态与去中心化的行为者决策连接起来，速度级动作抽象将高层协作决策映射到可执行的六自由度AUV控制。该http URL代码可在此HTTP访问。

Rethinking Neural Network Learning Rates: A Stackelberg Perspective

重新思考神经网络学习率：斯塔克尔伯格视角

Authors: Sihan Zeng, Sujay Bhatt, Sumitra Ganesh
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.15530
Pdf link: https://arxiv.org/pdf/2605.15530
Abstract Neural networks are typically trained with a single learning rate across all layers. While recent empirical evidence suggests that assigning layer-specific learning rates can accelerate training, a principled understanding of the conditions and mechanisms under which non-uniform learning rates are beneficial remains limited. In this work, we investigate non-uniform learning rates through the lens of Stackelberg optimization. Specifically, we demonstrate that training neural networks with a smaller learning rate for the body layers and a larger learning rate for the final layer can be interpreted as a two-time-scale alternating gradient descent algorithm applied to a Stackelberg reformulation of the original objective. We establish finite-time convergence guarantees for the algorithm under broad conditions that accommodate constraint sets and non-smooth activation functions. Beyond convergence, we identify two mechanisms by which non-uniform learning rates can outperform uniform learning rates: (i) we show that certain problem instances induce a Stackelberg objective with stronger optimization structure than the original objective, yielding faster convergence to globally optimal solutions, (ii) our numerical analysis reveals that the Stackelberg objective can exhibit substantially sharper local curvature, especially in early training, which leads to more informative gradients and learning acceleration. Experiments in supervised learning and reinforcement learning support our findings.
中文摘要 神经网络通常在所有层级都采用单一学习率进行训练。尽管近期实证证据表明，赋予特定层级学习率可以加速训练，但对非均匀学习率在何条件下有益的条件和机制的原则性理解仍然有限。本研究通过斯塔克伯格优化的视角研究非均匀学习率。具体来说，我们证明了训练神经网络时，身体层学习率较低、最终层学习率较高，可以被解释为应用于原始目标的斯塔克尔伯格重述的两时间尺度交替梯度下降算法。我们在宽义条件下为算法建立了有限时间收敛性保证，这些条件允许约束集和非光滑激活函数。除了收敛性之外，我们还识别了两种机制，使非均匀学习率能够优于均匀学习率：（i）我们证明某些问题实例诱导出的斯塔克伯格目标比原始目标更强的优化结构，从而更快地收敛到全局最优解;（ii）我们的数值分析显示斯塔克尔伯格目标可以表现出显著更锐利的局部曲率，尤其是在早期训练阶段，这会带来更有信息量的梯度和学习加速。监督学习和强化学习的实验支持了我们的发现。

NavRL++: A System-Level Framework for Improving Sim-to-Real Transfer in Reinforcement Learning-Based Robot Navigation

NavRL++：一个用于提升基于强化学习的机器人导航模拟到现实转移的系统级框架

Authors: Zhefan Xu, Hanyu Jin, Kenji Shimada
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.15559
Pdf link: https://arxiv.org/pdf/2605.15559
Abstract Recent years have witnessed significant progress in autonomous navigation using reinforcement learning. However, existing approaches largely emphasize reinforcement learning framework design, such as input representations, action spaces, and reward functions, while providing limited analysis of sim-to-real transfer and insufficient insight into how training strategies affect real-world deployment performance. To bridge this gap, we not only introduce an effective RL framework but also present a complete training and deployment pipeline, along with a systematic empirical study that disentangles the key factors affecting sim-to-real transfer in reinforcement learning-based navigation, including sensor noise, perception failures, system latency, and control response. Building on insights from this analysis, we introduce perturbation-aware fine-tuning, a post-training adaptation strategy that improves transfer robustness by explicitly accounting for empirically identified domain discrepancies. To further mitigate perception degradation and enhance control smoothness in real-world deployment, we propose a Transformer-based temporal reasoning policy that leverages short-horizon observation for navigation control. We quantitatively evaluate how individual sim-to-real perturbations and training design choices impact navigation performance across environments. Experimental results demonstrate that the proposed training strategy and policy architecture outperform learning-based baselines in both static and dynamic environments, while achieving performance comparable to optimization-based planners in static settings. We validate our approach through real-world deployment on multiple robotic platforms, including aerial and legged robots, across navigation-centric tasks such as exploration and inspection, demonstrating zero-shot sim-to-real transfer.
中文摘要 近年来，借助强化学习实现自主导航取得了显著进展。然而，现有方法大多强调强化学习框架设计，如输入表示、动作空间和奖励函数，而对模拟到现实的迁移分析有限，且对训练策略如何影响实际部署表现的洞察不足。为弥合这一差距，我们不仅引入了有效的强化学习框架，还提出了完整的培训和部署流程，以及系统性实证研究，剖析了影响基于强化学习导航模拟到现实转移的关键因素，包括传感器噪声、感知失败、系统延迟和控制响应。基于该分析的见解，我们引入了微扰感知微调，这是一种训练后适应策略，通过明确考虑实证识别的领域差异来提升转移鲁棒性。为了进一步减轻感知衰减并提升实际部署中的控制平滑性，我们提出了一种基于Transformer的时间推理策略，利用短视野观察进行导航控制。我们定量评估了单个模拟到现实的扰动和训练设计选择如何影响跨环境的导航性能。实验结果表明，所提出的培训策略和政策架构在静态和动态环境中都优于基于学习的基线，同时在静态环境中实现与基于优化的规划器相当的性能。我们通过在多个机器人平台上的实际部署验证了我们的方法，包括空中和腿部机器人，涵盖以导航为中心的任务，如探索和检查，展示了零机会模拟到真实的传输。

AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs

AstraFlow：面向数据流的智能大型语言模型强化学习

Authors: Haizhong Zheng, Yizhuo Di, Jiahui Wang, Shuowei Jin, Xueshen Liu, Yongji Wu, Z. Morley Mao, Ion Stoica, Jiawei Zhao, Beidi Chen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.15565
Pdf link: https://arxiv.org/pdf/2605.15565
Abstract Reinforcement learning (RL) is increasingly used to improve the reasoning, coding, and tool-use capabilities of large language models, but agentic RL remains prohibitively expensive. Scaling RL to agentic LLMs requires supporting complex workloads, including multi-policy collaborative training, while efficiently using elastic, heterogeneous, and cross-region compute resources. Existing LLM RL systems support some of these capabilities, but each new extension often requires dedicated system engineering. This burden arises from trainer-centered control architectures and the lack of principled abstractions for RL system components. To address these limitations, we propose AstraFlow, a dataflow-oriented RL system that replaces conventional trainer-centered control with principled component abstractions. In AstraFlow, rollout services, dataflow management, and training are decoupled into autonomous components, enabling the system to natively support complex multi-policy agentic RL workloads and efficiently exploit diverse compute resources. We evaluate AstraFlow across math, code, search, and AgentBench workloads, showing that the same system supports multi-policy training, elastic scaling, heterogeneous cross-region execution, and composable data algorithms without system-level code changes. In multi-policy collaborative training, AstraFlow achieves comparable or better accuracy than existing RL systems while speeding up training time by 2.7x.
中文摘要 强化学习（RL）越来越多地被用于提升大型语言模型的推理、编码和工具使用能力，但智能强化学习仍然成本高昂。将强化学习扩展到代理型大型语言模型需要支持复杂工作负载，包括多策略协同训练，同时高效利用弹性、异构和跨区域计算资源。现有的大型语言模型强化学习系统支持部分这些功能，但每个新扩展通常都需要专门的系统工程。这种负担源于以训练器为中心的控制架构以及强化学习系统组件缺乏原则性抽象。为解决这些局限性，我们提出了AstraFlow，一个面向数据流的强化学习系统，用原则性组件抽象取代了传统的以培训者为中心的控制。在 AstraFlow 中，部署服务、数据流管理和训练被解耦为自主组件，使系统能够原生支持复杂的多策略代理强化学习工作负载，并高效利用多样化的计算资源。我们评估了AstraFlow在数学、代码、搜索和AgentBench工作负载上的应用，表明同一系统支持多策略训练、弹性扩展、异构跨区域执行以及无需系统级代码更改的可组合数据算法。在多策略协作训练中，AstraFlow实现与现有强化学习系统相当甚至更好的准确性，同时将训练时间缩短2.7倍。

Calibrating LLMs with Semantic-level Reward

用语义级奖励校准大型语言模型

Authors: Fengfei Yu, Ruijia Niu, Dongxia Wu, Yian Ma, Rose Yu
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.15588
Pdf link: https://arxiv.org/pdf/2605.15588
Abstract As large language models (LLMs) are deployed in consequential settings such as medical question answering and legal reasoning, the ability to estimate when their outputs are likely to be correct is essential for safe and reliable use, requiring well-calibrated uncertainty. Standard reinforcement learning with verifiable rewards (RLVR) trains models with a binary correctness reward that is indifferent to confidence, providing no penalty for confident but wrong predictions and thereby degrading calibration. Recent work addresses this by training models to produce verbalized confidence scores alongside answers and rewarding agreement with correctness. However, verbalized confidence is calibrated at the token level and thus exhibits inconsistency across textual variations with same semantic meaning. We propose \textbf{Calibration with Semantic Reward (CSR)}, a framework that calibrates language models directly in semantic space without a verbalized confidence interface. CSR combines the correctness reward with a novel semantic calibration reward that encourages exploitation among correct rollouts by promoting semantic agreement, and exploration among incorrect ones by discouraging spurious consistency. Experiments across three model families on HotpotQA (in-distribution) and TriviaQA, MSMARCO, and NQ-Open (out-of-distribution) show that CSR consistently achieves lower ECE and higher AUROC than verbalized-confidence baselines across nearly all settings, reducing ECE by up to $40\%$ and improving AUROC by up to $31\%$ over verbalized-confidence baselines, with calibration behavior generalizing robustly across all four evaluation settings.
中文摘要 随着大型语言模型（LLMs）被应用于医疗问答和法律推理等重要场景，估算其输出可能正确的时间对于安全可靠使用至关重要，这需要高度校准的不确定性。带有可验证奖励的标准强化学习（RLVR）训练模型，其二元正确性奖励对置信度无关紧要，对有信心但错误的预测不予惩罚，从而降低校准效果。近期研究通过训练模型在回答的同时产生口头自信分数，并奖励正确性认同度来解决这一问题。然而，口头化的置信度是在符号层面校准的，因此在相同语义意义的文本变体之间存在不一致。我们提出了 \textbf{语义奖励校准（CSR）}，这是一个直接在语义空间中校准语言模型的框架，无需语言置信界面。CSR将正确性奖励与一种新颖的语义校准奖励结合起来，通过促进语义一致来鼓励正确推广中的利用，并通过阻止虚假一致性来促进对错误推广的探索。在HotpotQA（分布内）和TriviaQA、MSMARCO和NQ-Open（分布外）的三个模型家族实验显示，CSR在几乎所有环境中始终比口头置信度基线更低的ECE和更高的AUROC，ECE降低了最多40%$，AUROC比口头置信度基线提升了最多31%$，校准行为在这四个评估环境中都能稳健地泛化。

Offline Reinforcement Learning with Universal Horizon Models

基于通用地平线模型的离线强化学习

Authors: Hojun Chung, Junseo Lee, Songhwai Oh
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.15603
Pdf link: https://arxiv.org/pdf/2605.15603
Abstract Model-based reinforcement learning (RL) offers a compelling approach to offline RL by enabling value learning on imagined on-policy trajectories. However, it often suffers from compounding errors due to repeated model inference on self-generated states. While geometric horizon models (GHM) alleviate this issue through direct prediction over a discounted infinite-horizon future, they remain challenged in accurately modeling distant future states. To this end, we introduce universal horizon models (UHM), a generalization of GHM that directly predicts future states under arbitrary horizons. Leveraging this flexibility, we propose a scalable value learning method that employs a winsorized horizon distribution to stabilize training by capping excessively large horizons. Experimental results on 100 challenging OGBench tasks demonstrate that the proposed method outperforms competitive baselines, particularly on tasks with highly suboptimal datasets and those requiring long-horizon reasoning. Project page: this https URL
中文摘要 基于模型的强化学习（RL）通过实现在设想的政策轨迹上的价值学习，提供了一种引人注目的离线强化学习方法。然而，由于对自生成状态进行反复推断模型，它常常存在复合错误。虽然几何地平线模型（GHM）通过对折现后的无限地平线未来进行直接预测缓解了这一问题，但在准确建模遥远未来状态方面仍面临挑战。为此，我们引入了通用视界模型（UHM），这是GHM的推广，直接预测任意视界下的未来状态。利用这种灵活性，我们提出了一种可扩展的价值学习方法，采用赢索化视野分布，通过限制过大的视野来稳定训练。对100个具有挑战性OGBench任务的实验结果表明，该方法在数据集极不优和需要长期推理的任务中，表现优于竞争基线。项目页面：此 https URL

Sharp Spectral Thresholds for Logit Fixed Points

Logit 不动点的锐谱阈值

Authors: Tongxi Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
Arxiv link: https://arxiv.org/abs/2605.15651
Pdf link: https://arxiv.org/pdf/2605.15651
Abstract Softmax feedback systems are a common mathematical core of entropy-regularized reinforcement learning, logit game dynamics, population choice, and mean-field variational updates. Their central stability question is simple: when does a self-reinforcing softmax system produce a unique and globally predictable outcome? Classical theory gives a conservative answer. By treating softmax as a unit-scale response, it certifies stability only in a strongly randomized regime. We prove that the classical approach misses an entire stable regime and does not identify the point at which the qualitative change truly occurs. For finite-dimensional affine logit systems, the sharp dimension-free Euclidean threshold is $$\beta\|\Pi W\Pi\|_{\mathcal T\to\mathcal T}<2,$$ rather than the previously used condition, which certifies stability only while the softmax system remains safely over-regularized. Our theorem fills the previously missing pre-bifurcation regime, extending stability guarantees for affine softmax feedback systems to reward-responsive yet globally predictable systems. It enlarges the certified stability boundary for these systems and identifies where the model genuinely undergoes a phase transition.
中文摘要 Softmax 反馈系统是熵正则化强化学习、logit 博弈动态、种群选择和均值场变分更新的常见数学核心。他们的核心稳定性问题很简单：自强化的软极大系统何时会产生独特且全球可预测的结果？经典理论给出了保守的答案。通过将 softmax 视为单位尺度响应，它仅在强随机化环境中证明稳定性。我们证明了经典方法遗漏了整个稳定状态，且未能识别质变真正发生的时刻。对于有限维仿射对数系统，锐利的无维欧几里得阈值为$$\beta\|\Pi W\Pi\|_{\mathcal T\to\mathcal T}<2，$$，而非之前使用的条件，后者仅在软极大系统安全超正则化期间证明稳定性。我们的定理填补了此前缺失的分岔前区间，将仿射软极大反馈系统的稳定性保证扩展到奖励响应且全局可预测的系统。它扩大了这些系统的认证稳定性边界，并识别模型真正发生相变的位置。

PCASim: Promptable Closed-loop Adversarial Simulation for Urban Traffic Environment

PCASim：城市交通环境下的可提示闭环对抗模拟

Authors: Chuancheng Zhang, Zhenhao Wang, Kaizheng Li, Yaran Lin, Qiang Guo, Bin Jiang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.15654
Pdf link: https://arxiv.org/pdf/2605.15654
Abstract Real-world autonomous driving, particularly in urban environments with numerous corner cases, requires rigorous testing to ensure product safety and robustness. However, few studies have explored integrating adversarial scenario generation with the training of safety agents in closed-loop testing, enabling efficient co-evolution and mutual enhancement of both. To address this challenge, an adversarial behavior knowledge repository is constructed by applying rule-based filtering to an open-source dataset, combined with knowledge retrieval modules tailored for simulation environments. A large language model (LLM) is employed to integrate knowledge-, data-, and adversarial-driven approaches, generating safety-critical traffic scenarios customized to user needs. Additionally, while evaluating the generated scenarios, we employ reinforcement learning models to train the behaviors of different types of vehicles, thereby enriching scenario diversity beyond existing datasets while preserving realism. Experimental results demonstrate that the proposed framework improves the accuracy of domain-specific language generation by 12\%. Moreover, the success rate of newly generated scenario transformations increases by 8\%, while obstacle-avoidance capability is enhanced by 30\%. For the complete manuscript, please refer to: this https URL
中文摘要 现实世界的自动驾驶，尤其是在充满各种极端情况的城市环境中，需要严格的测试以确保产品安全和稳健性。然而，很少有研究探讨将对抗场景生成与安全代理的闭环测试训练相结合，从而实现两者的高效共演化和相互增强。为应对这一挑战，通过对开源数据集应用基于规则的过滤，结合针对仿真环境的知识检索模块，构建了一个对抗行为知识库。采用大型语言模型（LLM）整合知识、数据和对抗驱动的方法，生成针对用户需求的安全关键交通场景。此外，在评估生成场景时，我们采用强化学习模型训练不同类型车辆的行为，从而丰富了场景多样性，超越现有数据集，同时保持真实性。实验结果表明，所提出的框架将领域特定语言生成的准确率提高了12%。此外，新生成的情景转换成功率提升8%，障碍避让能力提升30%。完整手稿请参见：此 https 网址

Tighter Regret Bounds for Contextual Action-Set Reinforcement Learning

更严格的上下文动作集强化学习的遗憾界限

Authors: Zijun Chen, Zihan Zhang
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2605.15692
Pdf link: https://arxiv.org/pdf/2605.15692
Abstract We study episodic reinforcement learning with fixed reward and transition functions, but with episode-dependent admissible action sets that are observed at the start of each episode. Performance is measured by cumulative regret against the episode-wise optimal value, $\sum_{k=1}^K [V^{*,M^k} - V^{\pi^k,M^k}]$, where $M^k$ represents the action context in the $k$-th episode. We show that the MVP algorithm naturally extends to this framework and enjoys strong theoretical guarantees. In particular, we establish a minimax regret bound of $\widetilde{O}(\sqrt{SAH^3K\log L})$ for adversarial contexts, where $L$ denotes the number of possible contexts. This result implies a regret bound of $\widetilde{O}(\sqrt{SAH^3K})$ for stochastic contexts. We further translate the stochastic regret guarantee into a sample complexity bound of $\widetilde{O}(SAH^3/\epsilon^2)$ for a fixed context distribution. In addition, we derive a gap-dependent regret bound of [ \widetilde O\left( \inf_{p\in [0,1)} \left( \frac{1}{\Delta_{\min}^{p}} + pK\Delta_{\min}^{p} \right)\log K \cdot \mathrm{poly}(S,A,H) \right), ] where $\Delta_{\min}^{p}$ is the global $p$-trimmed positive-gap floor over suboptimal $(h,s,a)$ triples. This bound can substantially improve upon the minimax rate when the relevant suboptimality gaps are large.
中文摘要 我们研究了固定奖励和过渡函数的情节强化学习，但使用每集开始时观察到的依赖性可接受动作集。表现通过累计遗憾值与每集最优值的对比来衡量，$\sum_{k=1}^K [V^{*，M^k} - V^{\pi^k，M^k}]$，其中$M^k$代表第$k$集的动作情境。我们证明了MVP算法自然地扩展到该框架，并享有强有力的理论保证。特别地，我们建立了对抗上下文的极小极大后悔界限 $\widetilde{O}（\sqrt{SAH^3K\log L}）$，其中 $L$ 表示可能上下文的数量。该结果意味着随机上下文中后悔界限为$\widetilde{O}（\sqrt{SAH^3K}）$。我们进一步将随机后悔保证转换为样本复杂度上界为 $\widetilde{O}（SAH^3/\epsilon^2）$，适用于固定上下文分布。此外，我们推导出一个依赖缺口的遗憾界限 [ \widetilde O\left（ \inf_{p\in [0,1）} \left（ \frac{1}{\Delta_{\min}^{p}} + pK\Delta_{\min}^{p} \right）\log K \cdot \mathrm{poly}（S，A，H） \right）， ] 其中 $\Delta_{\min}^{p}$ 是次优 $（h，s，a）$ 三倍的全局 $p$-裁剪正缺口底。当相关的次最优性差距较大时，该界限可以显著改善极小极大率。

Distributed Zeroth-Order Policy Gradient for Networked Multi-agent Reinforcement Learning from Human Feedback

分布式零阶策略梯度用于网络多智能体强化学习，基于人类反馈

Authors: Pengcheng Dai, He Wang, Dongming Wang, Jian Qin, Wenwu Yu
Subjects: Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2605.15697
Pdf link: https://arxiv.org/pdf/2605.15697
Abstract We study a networked multi-agent reinforcement learning (NMARL) problem with human feedback in an infinite-horizon setting, where agents interact over an underlying network with localized state dependencies and aim to collaboratively maximize the average discounted return. Existing approaches with preference feedback are primarily developed for single-agent settings and rely on centralized training, which limits their scalability and applicability to large-scale networked multi-agent systems. To address this, we introduce a novel human feedback mechanism based on spatiotemporally truncated trajectories, defined as $H$-horizon trajectory pairs aggregated over each agent's $\kappa$-hop neighborhood. Building on this, we develop a distributed zeroth-order policy gradient algorithm, where each agent estimates its local policy gradient using human preference feedback generated from both the current joint policy and a perturbed joint policy drawn from zero-mean Gaussian distribution. Specifically, the algorithm is fully distributed, as the feedback received by each agent depends solely on the state-action information within its $\kappa$-hop neighborhood and does not require explicit reward signals or centralized control. We further rigorously establish that the proposed algorithm converges to an $\epsilon$-stationary point with polynomial sample complexity. Finally, simulation results in a stochastic GridWorld environment and a predator-prey environment further demonstrate that the effectiveness and scalability of the proposed algorithm in achieving collaborative optimization based solely on human preference feedback.
中文摘要 我们研究了一个网络化多智能体强化学习（NMARL）问题，涉及无限视野环境下的人类反馈，代理们通过具有局部状态依赖的底层网络交互，旨在协作最大化平均贴现回报。现有带有偏好反馈的方法主要为单智能体环境开发，依赖集中培训，这限制了其对大规模网络多智能体系统的可扩展性和适用性。为此，我们引入了一种基于时空截断轨迹的新型人类反馈机制，定义为在每个代理的$\kappa$跳点邻域上聚合的$H$视界轨迹对。基于此，我们开发了一种分布式零阶策略梯度算法，每个智能体利用当前联合策略和从零均值高斯分布中提取的扰动联合策略产生的人类偏好反馈，估计其局部策略梯度。具体来说，该算法是完全分布式的，因为每个代理接收到的反馈仅依赖于其$\kappa$跳区内的状态动作信息，不需要显式的奖励信号或集中控制。我们进一步严格证明所提算法收敛到一个 $\εⁿ-平稳点，样本复杂度为多项式。最后，模拟结果显示随机网格世界环境和捕食者-猎物环境，进一步证明了该算法在仅基于人类偏好反馈实现协作优化方面的有效性和可扩展性。

Scale: Deep Reinforcement Learning for Container Scheduling in Serverless Edge Computing

规模化：无服务器边缘计算中容器调度的深度强化学习

Authors: Chen Chen, Zihan Jia, Andrea Sabbioni, Reza Farahani, Lei Jiao
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2605.15704
Pdf link: https://arxiv.org/pdf/2605.15704
Abstract Serverless computing has emerged as a promising computing paradigm for edge computing. However, adopting the event driven model in highly dynamic, heterogeneous, and distributed edge systems poses significant challenges in request placement and resource management. Efficiently allocating requests to containers is therefore critical to reduce resource over provisioning and unnecessary data movement. This paper proposes Scale, a Service Level Objective aware container scheduling and resource allocation framework designed for serverless edge computing. Scale employs a policy based deep reinforcement learning algorithm to balance system stability and performance under dynamic workloads. The design jointly incorporates SLO constraints, end to end latency, and data locality into the scheduling decision process. Extensive simulations using large scale real world datasets from Huawei Cloud demonstrate that Scale achieves solutions within a factor of 1.11 to 1.15 of a state of the art Integer Linear Programming solver, while reducing decision making time by up to 99%.
中文摘要 无服务器计算已成为边缘计算的一种有前景的计算范式。然而，在高度动态、异构和分布式边缘系统中采用事件驱动模型，在请求布置和资源管理方面面临重大挑战。因此，高效分配请求到容器对于减少资源过度配置和不必要的数据流动至关重要。本文提出了Scale，一种服务级目标意识的容器调度和资源分配框架，专为无服务器边缘计算设计。Scale采用基于策略的深度强化学习算法，在动态工作负载下平衡系统稳定性和性能。设计将SLO约束、端到端延迟和数据局部性纳入调度决策过程。利用华为云大规模真实世界数据集进行的广泛模拟表明，Scale能够在最先进的整数线性规划求解器中实现1.11至1.15倍的解，同时将决策时间缩短多达99%。

Learning Dynamic Pick-and-Place for a Legged Manipulator

学习带腿操作者的动态选择与定位

Authors: Moonkyu Jung, Jiseong Lee, Zhengmao He, Donghoon Youm, Juhyeok Mun, HyeongJun Kim, Hyunsik Oh, Donghyuk Choi, Jungwoo Hur, Jie Song, Jemin Hwangbo
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.15713
Pdf link: https://arxiv.org/pdf/2605.15713
Abstract Legged manipulators extend robotic capabilities beyond static manipulation by integrating agile locomotion with versatile arm control. However, achieving precise manipulation while maintaining coordinated locomotion remains a major challenge. This work presents a hierarchical reinforcement learning framework for dynamic pick-and-place tasks using a quadruped equipped with a 6-DOF robotic arm. The framework incorporates an explicit mass estimation module enabling adaptive whole-body control for objects with varying weights. In simulation, the system achieves an 86.05% success rate with payloads up to 2.3 kg. The approach is further validated through real-world experiments across six representative scenarios with controlled variations in object physical properties (size and mass) and task heights. Specifically, within a wide vertical workspace ranging from ground level to 1.1~m-high tabletops, the system demonstrates an average success rate of 73.3% for payloads up to 1.3 kg, with an average execution time of 4.06 s. Unlike prior works that handle lightweight objects and execute pick-and-place motions with slow, piecewise motions, the proposed framework exploits concurrent locomotion and manipulation for dynamic, continuous execution. These results demonstrate the potential of quadrupedal mobile manipulators for adaptive, whole-body pick-and-place with heavier payloads and extended workspaces.
中文摘要 腿式机械臂通过将灵活的运动与多功能手臂控制相结合，将机器人能力扩展到静态操作之外。然而，在保持协调运动的同时实现精准操作仍是一大挑战。本研究提出了一个基于配备6自由度机械臂的四足动物进行动态拾取与放置任务的分层强化学习框架。该框架包含一个显式质量估计模块，能够对不同权重的物体进行自适应的全体控制。在模拟中，该系统在有效载荷最高2.3公斤时实现了86.05%的成功率。该方法通过六个具有代表性的场景中的真实实验进一步验证，这些场景控制了物体物理属性（尺寸和质量）和任务高度的变异。具体来说，在从地面到1.1~米高桌面的宽阔垂直工作区内，该系统对1.3公斤以下有效载荷的平均成功率为73.3%，平均执行时间为4.06秒。与以往处理轻量级物体并以缓慢分段动作执行拾取和放置运动不同，本框架利用并发移动与操作实现动态、连续执行。这些结果展示了四足移动机械臂在更重载荷和更宽工作空间下实现自适应、全身自适应的选拔与定位的潜力。

Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR

突破舒适区：RLVR高效策略引导探索

Authors: Chanuk Lee, Sangwoo Park, Minki Kang, Sung Ju Hwang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.15726
Pdf link: https://arxiv.org/pdf/2605.15726
Abstract Reinforcement learning with verifiable rewards (RLVR) has emerged as a scalable paradigm for improving the reasoning capabilities of large language models. However, its effectiveness is fundamentally limited by exploration: the policy can only improve on trajectories it has already sampled. While increasing the number of rollouts alleviates this issue, such brute-force scaling is computationally expensive, and existing approaches that modify the optimization objective provide limited control over what is explored. In this work, we propose NudgeRL, a framework for structured and diversity-driven exploration in RLVR. Our approach introduces Strategy Nudging, which conditions each rollout on lightweight, strategy-level contexts to induce diverse reasoning trajectories without relying on expensive oracle supervision. To effectively learn from such structured exploration, we further propose a unified objective, which decomposes the reward signal into inter- and intra-context components and incorporates a distillation objective to transfer discovered behaviors back to the base policy. Empirically, NudgeRL outperforms standard GRPO with up to 8 times larger rollout budgets, while outperforming oracle-guided RL baseline on average across five challenging math benchmarks. These results demonstrate that structured, context-driven exploration can serve as an efficient and scalable alternative to both brute-force rollout scaling and feasibility-oriented methods based on privileged information. Our code is available at this https URL.
中文摘要 带有可验证奖励的强化学习（RLVR）已成为提升大型语言模型推理能力的可扩展范式。然而，其有效性在探索过程中受到根本限制：政策只能在已采样的轨迹上进行改进。虽然增加推展次数可以缓解这一问题，但这种暴力破解扩展计算成本高昂，现有修改优化目标的方法对探索内容的控制也有限。在本研究中，我们提出了NudgeRL，这是一个用于RLVR结构化和多样性驱动探索的框架。我们的方法引入了策略助推，将每次推出都设定在轻量级战略层级情境之上，从而在不依赖昂贵预言机监督的情况下，诱导出多样化的推理轨迹。为了有效从这种结构化探索中学习，我们进一步提出一个统一目标，将奖励信号分解为上下文间和上下文内的成分，并包含一个提炼目标，将发现的行为反馈到基础策略。从实证数据来看，NudgeRL在部署预算上多达8倍，表现优于标准GRPO，并且在五项具有挑战性的数学基准测试中平均优于oracle引导的强化学习基线。这些结果表明，结构化、基于上下文的探索可以作为暴力破解扩展扩展和基于特权信息的可行性方法的高效且可扩展的替代方案。我们的代码可在此 https URL 访问。

ALSO: Adversarial Online Strategy Optimization for Social Agents

另请期待：社会代理的对抗性在线策略优化

Authors: Xiang Li, Liping Yi, Mingze Kong, Min Zhang, Zhongxiang Dai, QingHua Hu
Subjects: Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2605.15768
Pdf link: https://arxiv.org/pdf/2605.15768
Abstract Social simulation provides a compelling testbed for studying social intelligence, where agents interact through multi-turn dialogues under evolving contexts and strategically adapting opponents. Such environments are inherently non-stationary, requiring agents to dynamically adjust their strategies over time. However, most Large Language Model (LLM) based social agents rely on static personas, while existing approaches for enhancing social intelligence, such as offline reinforcement learning or external planners, are ill-suited to these settings, typically assuming stationarity and incurring substantial training overhead. To bridge this gap, we propose \textbf{ALSO} (\textbf{A}dversarial on\textbf{L}ine \textbf{S}trategy \textbf{O}ptimization), the first framework for online strategy optimization in multi-agent social simulation. ALSO advances social adaptation through two key contributions. (1) ALSO formulates multi-turn interaction as an adversarial bandit problem, where combinations of static personas and dynamic strategy instructions are treated as arms, providing a principled solution to non-stationarity without relying on environmental stability assumptions. (2) To predict rewards and generalize sparse feedback in multi-turn dialogues, ALSO introduces a lightweight neural surrogate to predict rewards from interaction histories, enabling sample-efficient exploration and continuous online adaptation. Experiments on the Sotopia benchmark demonstrate that ALSO consistently outperforms static baselines and existing optimization methods in dynamic environments, validating the effectiveness of adversarial online strategy optimization for building robust social agents.
中文摘要 社会模拟为研究社会智能提供了有力的试验平台，在不断变化的情境下，代理通过多轮对话互动，并战略性地适应对手。此类环境本质上是非固定的，要求智能体随时间动态调整策略。然而，大多数基于大型语言模型（LLM）的社会代理依赖静态角色，而现有提升社会智能的方法，如离线强化学习或外部规划器，并不适合这些环境，通常假设平稳性并产生大量训练开销。为弥合这一空白，我们提出了 \textbf{ALSO}（\textbf{A}dversarial on\textbf{L}ine \textbf{S}trategy \textbf{O}ptimization），这是多智能体社会模拟中首个在线策略优化框架。ALSO 通过两项关键贡献推动社会适应。（1）还将多回合交互表述为对抗性强盗问题，静态人格与动态策略指令的组合被视为臂，提供非平稳性的原则性解决方案，无需依赖环境稳定性假设。（2）为了预测奖励并在多回合对话中泛化稀疏反馈，ALSO 引入了轻量级神经替代体，从交互历史预测奖励，实现样本高效探索和持续在线适应。Sotopia基准测试的实验表明，ALSO 在动态环境中持续优于静态基线和现有优化方法，验证了对抗性在线策略优化在构建强大社会代理方面的有效性。

Lamarckian Inheritance in Dynamic Environments: How Key Variables Affect Evolutionary Dynamics

动态环境中的拉马克遗传：关键变量如何影响进化动力学

Authors: K. Ege de Bruin, Kyrre Glette, Kai Olav Ellefsen
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.15769
Pdf link: https://arxiv.org/pdf/2605.15769
Abstract The co-optimization of a robot's body and brain presents a coupled challenge: the morphology constrains which control strategies are effective, while the control determines how well the morphology performs. To address this, we combine morphology optimization as evolution with controller optimization as lifetime learning, utilizing Lamarckian inheritance to transfer learned controller parameters from parent to offspring. In dynamic environments, existing literature presents conflicting evidence: while traditional evolutionary theory often suggests Lamarckian inheritance lacks benefit, recent studies in evolutionary robotics indicate it can improve performance. We hypothesize that this is because previous works have not included all relevant variables with dynamic environments. In this work, we show that the benefit of Lamarckian inheritance depends on two variables: how conflicting the environmental changes are to robot control, and the predictability of those changes for the robotic agent. Using virtual soft robots and two different learning approaches, Bayesian optimization and reinforcement learning, we show that Lamarckian inheritance only underperforms Darwinian inheritance when the changes are both conflicting and unpredictable. We find that adding a sensor to detect environmental changes restores the benefits for Lamarckian inheritance in conflicting environments, by allowing robotic agents to predict the need for a different behavior, thereby generalizing their control.
中文摘要 机器人身体与大脑的协同优化带来了两重挑战：形态限制了哪些控制策略有效，而控制决定了形态的表现。为此，我们将形态优化作为进化与控制者优化作为终身学习结合起来，利用拉马克遗传将学到的控制参数从父传给后代。在动态环境中，现有文献呈现出相互矛盾的证据：传统进化理论常常认为拉马克遗传无益，但最新进化机器人研究显示它能提升性能。我们假设这是因为以往的研究并未包含所有动态环境相关的变量。本研究表明，拉马克遗传的益处取决于两个变量：环境变化对机器人控制的冲突程度，以及这些变化对机器人智能体的可预测性。利用虚拟软机器人和两种不同的学习方法——贝叶斯优化和强化学习，我们证明了只有当变化既冲突且不可预测时，拉马克继承才会表现得逊于达尔文继承。我们发现，增加传感器以检测环境变化，能够恢复拉马克遗传在冲突环境中的益处，使机器人智能体能够预测不同行为的需求，从而推广其控制。

Embedding-perturbed Exploration Preference Optimization for Flow Models

嵌入扰动的流动模型探索偏好优化

Authors: Sujie Hu, Chubin Chen, Jiashu Zhu, Jiahong Wu, Xiangxiang Chu, Xiu Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.15803
Pdf link: https://arxiv.org/pdf/2605.15803
Abstract Recent advancements have established Reinforcement Learning (RL) as a pivotal paradigm for aligning generative models with human intent. However, group-based optimization frameworks (e.g., GRPO) face a critical limitation: the rapid decay of intra-group variance. As the distinctiveness among samples within a group diminishes, the variance approaches zero. This eliminates the very learning signal required for optimization, rendering the process unstable and forcing the policy into premature stagnation or reward hacking. Existing strategies, such as varying the initial noise or increasing group sizes, often fail to address this fundamental issue, resulting in training instability or diminishing returns. To overcome these challenges, we propose $\textbf{Embedding-perturbed Exploration Preference Optimization (}E^2\textbf{PO)}$, a novel framework that sustains optimization through embedding-level perturbation. Our method introduces structured, embedding-level perturbations within sample groups, guaranteeing a robust variance that preserves the discriminative signal throughout the training process. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art baselines, achieving a more faithful alignment with human preference.
中文摘要 近期进展确立了强化学习（RL）作为将生成模型与人类意图对齐的关键范式。然而，基于群体的优化框架（如GRPO）面临一个关键限制：组内方差的快速衰减。随着群体内样本间的独特性减弱，方差趋近于零。这消除了优化所需的学习信号，使流程不稳定，迫使策略过早停滞或奖励黑客攻击。现有策略，如改变初始噪声或增加组规模，往往未能解决这一根本问题，导致训练不稳定或收益递减。为克服这些挑战，我们提出了$\textbf{嵌入扰动探索偏好优化（}E^2\textbf{PO）}$，这是一个通过嵌入层扰动维持优化的新框架。我们的方法在样本组中引入结构化的嵌入级扰动，保证了在训练过程中保持判别信号的稳健方差。大量实验表明，我们的方法远远优于最先进的基线，实现了更贴近人类偏好的效果。

Access Timing as Scaffolding: A Reinforcement Learning Approach to GenAI in Education

作为脚手架的时机获取：教育中生成式人工智能的强化学习方法

Authors: Janne Rotter, Pau Benazet i Montobbio, Davinia Hernández-Leo
Subjects: Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2605.15850
Pdf link: https://arxiv.org/pdf/2605.15850
Abstract In recent years, generative AI (GenAI) in educational settings has become ubiquitous in students' daily lives, despite its potential to induce over-reliance, metacognitive disengagement, and diminished learning when used unrestrictedly. While most prior research has thus focused on how to pedagogically scaffold its usage, the question of when to allow off-the-shelf GenAI remains understudied and lacks pedagogically grounded empirical investigation. We treat access timing itself as a form of implicit scaffolding and operationalize it through a reinforcement learning (RL) agent that decides when students should access GenAI, with a reward function grounded in metacognitive theory, cognitive load theory, and productive failure. In a mixed-methods controlled lab study with N=105 participants, we compared the agent's effect on learning gains and metacognitive engagement to unrestricted and fully restricted use. Results show that strategically timed GenAI access under the reinforcement learning condition improved objective post-test performance and metacognitive accuracy compared with unrestricted access, while reducing task errors and time on task relative to complete withholding, all without the need for explicit metacognitive prompts or structured scaffolding. However, no between-condition differences emerged on self-reported metacognitive awareness. Overall, timing of GenAI access therefore is a tractable, theoretically grounded, and scalable pedagogical paradigm that improves over completely unrestricted and withheld access, compatible with off-the-shelf tools and potentially low adoption barrier. This opens up a new research area that explores how access timing can be facilitated by educators and implemented in human-AI learning system design.
中文摘要 近年来，生成式人工智能（GenAI）在教育环境中已无处不在，尽管其在不受限制使用时可能引发过度依赖、元认知脱离和学习能力下降。尽管以往大多数研究都聚焦于如何为其教学搭建支架，但何时允许现成生成式人工智能的问题仍然缺乏研究，缺乏教学基础的实证调查。我们将访问时序本身视为一种隐性支架，并通过强化学习（RL）代理将其操作化，该智能体决定学生何时使用生成式人工智能，奖励函数基于元认知理论、认知负荷理论和生产性失败。在一项包含N=105名参与者的混合方法对照实验室研究中，我们比较了该药物对学习收益和元认知参与度与无限制和完全限制使用的影响。结果显示，在强化学习条件下进行战略性时机的生成式人工智能访问相比无限制访问，提高了客观测试后的表现和元认知准确性，同时减少了任务错误和任务时间，相较于完全保留，且无需显式元认知提示或结构化支架。然而，在自我报告的元认知意识中，两者之间没有差异。总体而言，生成式AI访问的时机因此是一个可操作、理论基础且可扩展的教学范式，能够在完全不受限制和被剥夺的访问中得到改进，兼容现成工具，且潜在的低采用门槛。这开辟了一个新的研究领域，探讨如何由教育者实现访问时序，并将其应用于人机学习系统设计中。

A Multi-Layer Cloud-IDS Pipeline with LLM and Adaptive Q-Learning Calibration

多层云-IDS流水线，配备LLM和自适应Q-学习校准

Authors: Syed Waqas Ali, Ibrar Ali Shah, Farzana Zahid, Daniyal Munir, Hans D. Schotten
Subjects: Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.15889
Pdf link: https://arxiv.org/pdf/2605.15889
Abstract Security in cloud computing has become a major concern due to several factors such as layered cloud architectures, dynamic environments, and exposure to unseen or zero-day attacks. Moreover, intrusion detection systems (IDS) typically operate at specific layers and rely heavily on machine learning models, which often perform well in experimental settings but fail to sustain performance in real cloud deployments. In this work, we implement a confidence-aware multilevel intrusion detection system using reinforcement learning tailored for cloud environments. The system secures three distinct layers: network, host, and hypervisor. Machine learning models at each layer detect known attack patterns, while prediction confidence distinguishes reliable decisions from uncertain outcomes. Within the multi-gate flow, low-confidence events pass through a learned-threshold confidence gate (Gate-1), followed by a Chroma memory-matching gate (Gate-2), with unresolved events escalated to a large language model (LLM) for semantic analysis and explanation. Final attack promotion at Gate-3 uses calibrated LLM confidence or weighted-fusion fallback, while uncertain events are retained in a review bucket to avoid forced classification. Generated explanations and confirmed knowledge are stored in ChromaDB to support future analysis and retraining. The approach is first evaluated using static thresholds, establishing a baseline for comparison. Results show that the proposed system learns adaptive thresholds and reduces LLM escalation by 58.78%, lowering cost while maintaining strong performance (88.68% accuracy, 85.29% precision, 84.72% recall, 85.00% F1). The network and hypervisor layers achieve 98.02% and 97.08% accuracy, demonstrating a balanced and efficient detection system.
中文摘要 云计算中的安全性已成为一个重大关注点，原因包括分层云架构、动态环境以及可能遭受未知或零日攻击的风险。此外，入侵检测系统（IDS）通常在特定层运行，并高度依赖机器学习模型，这些模型在实验环境中表现良好，但在真实云部署中难以持续。在本研究中，我们实现了基于云环境的强化学习的信心感知多级入侵检测系统。系统保护三个不同的层级：网络、主机和虚拟机监控程序。每一层的机器学习模型都能检测已知的攻击模式，而预测置信度则区分了可靠的决策与不确定的结果。在多门流程中，低置信事件会经过学习阈信度门（Gate-1），随后是Chroma记忆匹配门（Gate-2），未解决事件则升级到大型语言模型（LLM）进行语义分析和解释。Gate-3的最终攻击提升使用校准的LLM置信度或加权聚变缓冲，而不确定事件则保留在审查桶中以避免强制分类。生成的解释和确认知识存储在 ChromaDB 中，以支持未来的分析和再训练。该方法首先使用静态阈值进行评估，建立比较基线。结果显示，所提系统能够学习自适应阈值，并将LLM升级降低58.78%，降低成本同时保持强劲性能（准确率88.68%，精度85.29%，回忆率84.72%，F1 85.00%）。网络层和虚拟机监控程序分别实现了98.02%和97.08%的准确率，展现了平衡高效的检测系统。

Dynamic Plasma Shape Control with Arbitrary Sensor Subsets

带任意传感器子集的动态等离子体形状控制

Authors: D. Sorokin, M. Stokolesov, A. Granovskiy, I. Prokofyev, E. Adishchev, M. Nurgaliev, E. Khayrutdinov, G. Subbotin, R. Clark, D. Orlov
Subjects: Subjects: Robotics (cs.RO); Systems and Control (eess.SY); Plasma Physics (physics.plasm-ph)
Arxiv link: https://arxiv.org/abs/2605.15935
Pdf link: https://arxiv.org/pdf/2605.15935
Abstract Plasma shape control in tokamaks requires a real-time controller that tracks dynamically changing shape targets while tolerating diagnostic failures. Classical approaches decompose the problem into equilibrium reconstruction followed by a linear controller, and assume a fixed, fully operational sensor set. We present a reinforcement learning agent that addresses both limitations simultaneously. The agent is trained in NSFsim, a high-fidelity tokamak simulator configured for DIII-D, on a curated dataset of 120 experimental plasma shapes. The shape targets are resampled as random step changes every 0.25 s, exposing the agent to diverse transitions across the full shape envelope. At test time the agent zero-shot tracks dynamic shape sequences; on a held-out static configuration in simulation it achieves a mean shape error of 2.01 cm, and dynamic trajectory following is demonstrated qualitatively in simulation and on the physical device. Diagnostic dropout randomly masks 30% of magnetic sensors per episode, yielding a single policy robust to arbitrary sensor subsets without backup controllers or mode-switching logic. An asymmetric actor-critic architecture with privileged equilibrium information improves value estimation under partial observability; an auxiliary shape reconstruction head on the actor enables end-to-end shape reconstruction from raw diagnostics and serves as an interpretability tool for policy analysis. The policy transfers to experimental DIII-D shots, where it directly commands the coil actuators on two dynamic shape maneuvers, and to the independent GSevolve simulator.
中文摘要 托卡马克中的等离子体形状控制需要一个实时控制器，能够跟踪动态变化的形状目标，同时容忍诊断失败。经典方法将问题分解为平衡重建，再进行线性控制器，并假设传感器集固定且完全正常工作。我们提出了一个强化学习代理，同时解决了这两个限制。该特工在NSFsim中训练，NSFsim是一款为DIII-D配置的高保真托卡马克模拟器，基于一个由120个实验等离子体形状组成的精选数据集。形状靶以每0.25秒随机步进方式重新采样，使试剂在整个形状包络线中经历多样化的跃迁。测试时，特工零点跟踪动态形状序列;在模拟中保持的静态配置中，平均形状误差为2.01厘米，动态轨迹在仿真和物理设备上被定性地展示。诊断中断随机覆盖每发事件30%的磁传感器，形成一个对任意传感器子集的单一策略，无需备份控制器或模式切换逻辑。具有特权均衡信息的非对称演员-批判者架构在部分可观测性下提升了价值估计;actor上的辅助形状重建头支持从原始诊断实现端到端形状重建，并作为策略分析的可解释性工具。该策略可应用于实验性DIII-D射击，在两个动态形状操作中直接控制线圈执行器，并适用于独立的GSevolution模拟器。

From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding

从失败到反馈：群组修订解锁对象级接地的难题

Authors: Yuyuan Liu, Yiping Ji, Anjie Le, Jiayuan Zhu, Jiazhen Pan, Can Peng, Jiajun Deng, Fengbei Liu, Junde Wu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.15951
Pdf link: https://arxiv.org/pdf/2605.15951
Abstract Finetuning Large Vision-Language Models with reinforcement learning has emerged as a promising approach to enhance their capability in object-level grounding. However, existing methods, mainly based on GRPO, assign rewards at the response level. Such sparse reward, often criterion-induced, leads to minimal learning signals when all candidate responses fail in challenging scenarios. In this work, we propose a group-revision optimisation paradigm that enhances learning on hard cases. It begins with a sampled initial response and generates a set of revised candidates to explore improved grounding outcomes. Inspired by reward shaping, we introduce a consolidation process that quantifies each candidate's improvement over the initial attempt and converts it into informative shaping signals. These signals are used to both refine the reward and modulate the advantage, amplifying the influence of high-quality revisions. Our method achieves consistent gains across referring and reasoning segmentation, REC, and counting benchmarks compared with prior GRPO-based models. Our code is available at this https URL.
中文摘要 通过强化学习进行精细化大型视觉语言模型，已成为提升其对象级基础能力的有前景方法。然而，现有的方法，主要基于GRPO，是在响应层级分配奖励。这种稀疏的奖励，通常是由标准诱导的，导致在所有候选反应都失败时，学习信号极少。本研究提出一种小组复习优化范式，以增强难题的学习效果。它从抽样的初始反应开始，生成一组修订后的候选人，以探索更好的接地效果。受奖励塑造的启发，我们引入了一种巩固过程，量化每位候选人相较于初次尝试的进步，并将其转化为有信息的塑造信号。这些信号既用于优化奖励，也调节优势，放大高质量修订的影响力。与以往基于GRPO的模型相比，我们的方法在引用和推理分割、REC和计数基准测试方面取得了持续的提升。我们的代码可在此 https URL 访问。

Imperfect World Models are Exploitable

不完美世界模型是可被利用的

Authors: Logan Mondal Bhamidipaty (University of Edinburgh), Esmeralda S. Whitammer (University of Edinburgh), David Abel (University of Edinburgh), Mykel J. Kochenderfer (Stanford University), Subramanian Ramamoorthy (University of Edinburgh)
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.15960
Pdf link: https://arxiv.org/pdf/2605.15960
Abstract We propose a novel definition of model exploitation in reinforcement learning. Informally, a world model is exploitable if it implies that one policy should be strictly preferred over another while the environment's true transition model implies the reverse. We analogize our definition with a prior characterization of reward hacking but show that the associated proof of inevitability does not transfer to exploitation. To overcome this obstruction, we develop a general theory of reward hacking and model exploitation that proves that exploitation is essentially unavoidable on large policy sets and yields the corresponding claim for hacking as a special case. Unfortunately, we also find that the conditions that guarantee unhackability in finite policy sets have no counterpart that precludes exploitation. Consequently, we introduce a relaxed notion of exploitation and derive a safe horizon within which it can be avoided. Taken together, our results establish a formal bridge between reward hacking and model exploitation and elucidate the limits of safe planning in world models.
中文摘要 我们提出了强化学习中模型利用的新定义。非正式地说，如果世界模型暗示一种政策应严格优先于另一种，而环境的真实过渡模型则意味着相反，那么该模型是可利用的。我们将定义类比于先前对奖励黑客的描述，但表明必然性的相关证明并不能转移到利用上。为克服这一障碍，我们发展了一个奖励黑客和模型利用的通用理论，证明在大型策略集上剥削本质上不可避免，并提出了黑客作为特殊情况的相应主张。不幸的是，我们也发现，保证有限策略集不可被攻击的条件并不存在排除利用的对应条件。因此，我们引入了一个宽松的剥削概念，并推导出一个可以避免剥削的安全视野。综合来看，我们的结果在奖励黑客与模型利用之间建立了正式桥梁，并阐明了世界模型中安全规划的局限性。

PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control

寻呼器：弥合点精准几何图形界面控制中的语义与执行差距

Authors: Jingxuan Wei, Xi Bai, Shan Liu, Caijun Jia, Zheng Sun, Xinglong Xu, Siyuan Li, Linzhuang Sun, Bihui Yu, Conghui He, Cheng Tan
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.15963
Pdf link: https://arxiv.org/pdf/2605.15963
Abstract Large vision-language models have significantly advanced GUI agents, enabling executable interaction across web, mobile, and desktop interfaces. Yet these gains largely rely on a forgiving region-tolerant paradigm, where many nearby pixels inside the same component remain valid. Precise geometric construction breaks this assumption: actions must land on points in continuous canvas space rather than tolerant regions. Because geometric primitives carry ontological dependencies, a local coordinate error can induce cascading topological failures that distort downstream objects and invalidate the final construction. We identify this regime as precision-sensitive GUI tasks, requiring point-level accuracy, geometry-aware verification, and robustness to dependency-driven error propagation. To benchmark it, we introduce PAGE Bench, with 4,906 problems and over 224K process-supervised, pixel-level GUI actions. We further propose PAGER, a topology-aware agent that decomposes construction into dependency-structured planning and pixel-level execution. Pixel-grounded supervised tuning establishes executable action grammar, while precision-aligned reinforcement learning mitigates rollout-induced exposure bias through state-conditioned geometric feedback. Experiments reveal a pronounced Semantic-Execution Gap: general multimodal models can exceed 88% action type accuracy yet remain below 6% task success. PAGER closes this gap, delivering 4.1x higher task success than the strongest evaluated general baseline and raising step success rate from below 9% for GUI-specialized agents to over 62%, establishing a new state of the art for point-precise GUI control.
中文摘要 大型视觉语言模型显著提升了图形界面代理的水平，使得跨网页、移动和桌面界面的可执行交互成为可能。然而，这些进步很大程度上依赖于宽容区域容忍范式，即同一分量内许多相邻像素仍然有效。精确的几何构造打破了这一假设：动作必须落在连续画布空间中的点上，而非容忍区域。由于几何原语存在本体依赖性，局部坐标误差可能导致级联拓扑失效，扭曲下游对象并使最终构造失效。我们将这种模式识别为高精度敏感的图形用户界面任务，要求点级精度、几何感知验证以及对依赖驱动错误传播的鲁棒性。为了进行基准测试，我们引入了PAGE Bench，拥有4,906个问题和超过224K的过程监督像素级图形界面操作。我们还提出了PAGER，一种拓扑感知代理，能够将建筑分解为依赖结构规划和像素级执行。基于像素的监督调优建立可执行的动作语法，而精确对齐强化学习则通过状态条件几何反馈减轻推广引起的暴露偏差。实验显示存在明显的语义-执行差距：一般多模态模型的动作类型准确率可超过88%，但任务成功率却低于6%。PAGER缩小了这一差距，任务成功率比最强评估的通用基线高出4.1倍，并将图形界面专用代理的步骤成功率从9%以下提升至62%以上，确立了点精准GUI控制的新技术水平。

WorldVLN: Autoregressive World Action Model for Aerial Vision-Language Navigation

WorldVLN：自回归世界行动模型用于航空视觉语言导航

Authors: Baining Zhao, Jiacheng Xu, Weicheng Feng, Xin Zhang, Zhaolu Wang, Haoyang Wang, Shilong Ji, Ziyou Wang, Jianjie Fang, Zhiheng Zheng, Weichen Zhang, Yu Shang, Wei Wu, Chen Gao, Xinlei Chen, Yong Li
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.15964
Pdf link: https://arxiv.org/pdf/2605.15964
Abstract Aerial vision-language navigation (VLN) requires agents to follow natural-language instructions through closed-loop perception and action in 3D environments. We argue that aerial VLN can be formulated as a prediction-driven world-action problem: the agent should anticipate latent world evolution and act according to the predicted consequences. To this end, we propose WorldVLN, the first autoregressive world action model for aerial VLN. Unlike full-sequence video-generation world models that generate an entire visual clip, WorldVLN adapts a latent autoregressive video backbone to predict short-horizon world-state transitions and directly decodes them into executable waypoint actions. After each action segment is executed, newly received observations are encoded back into the autoregressive context, enabling closed-loop world-action prediction. We further introduce a two-stage training framework that first grounds the video prior in instruction-conditioned navigation dynamics and then develops Action-aware GRPO, the first reinforcement learning method tailored to autoregressive WAMs, to optimize waypoint decisions through their downstream rollout consequences. On public outdoor and indoor benchmarks, WorldVLN consistently outperforms existing Vision-Language-Action baselines with 12\%+ success-rate gains and larger advantages on challenging cases. It further transfers zero-shot to real drone deployment, suggesting that the proposed WorldVLN offers a promising route for spatial action tasks. Demos and code are available at this https URL.
中文摘要 空中视觉语言导航（VLN）要求智能体通过闭环感知和三维环境中的行动，遵循自然语言指令。我们认为空中VLN可以被表述为一个预测驱动的世界-行动问题：代理应预见潜在的世界演化，并根据预测结果行动。为此，我们提出了WorldVLN，这是首个用于航空VLN的自回归世界动作模型。与生成完整视觉片段的全序列视频生成世界模型不同，WorldVLN采用潜在自回归视频骨干来预测短视野世界状态转变，并直接将其解码为可执行的航点动作。每个动作段执行后，新接收的观测值被编码回自回归上下文，实现闭环世界动作预测。我们还进一步介绍了一个两阶段训练框架，首先将视频事先建立在指令条件导航动态基础上，然后开发了动作感知GRPO，这是首个针对自回归WAM量身定制的强化学习方法，旨在通过下游推广结果优化航点决策。在公开的户外和室内基准测试中，WorldVLN持续优于现有的视觉-语言-行动基线，成功率提升12%+，且在复杂案例中优势更大。它进一步将零射击转化为真实无人机部署，表明拟议中的WorldVLN为空间行动任务提供了一条有前景的路径。演示和代码可在此 https URL 获取。

OHP-RL: Online Human Preference as Guidance in Reinforcement Learning for Robot Manipulation

OHP-RL：在线人类偏好作为机器人操作强化学习指导

Authors: Yunyang Mo (1), Jian Li (1), Qiwei Wu (1), Yihang Kang (1), Renjing Xu (1) ((1) The Hong Kong University of Science and Technology (Guangzhou))
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.15971
Pdf link: https://arxiv.org/pdf/2605.15971
Abstract While reinforcement learning (RL) enables robots to acquire skills autonomously, its real-world deployment is severely limited by inefficient and unsafe exploration. Human-in-the-loop interventions offer a practical solution, yet existing methods typically exploit these interventions as auxiliary training signals, without fully capturing the richer information they provide about when and how autonomy should be guided. Human interventions often encode relative preferences over behavior under safety and task constraints, rather than prescribing exact actions to imitate. Motivated by this perspective, we propose Online Human Preference as Guidance in Reinforcement Learning (OHP-RL), a framework that leverages human interventions as preference information to guide policy learning. OHP-RL introduces a state-dependent preference gate that adaptively regulates when and to what extent human interventions should shape policy learning. This design enables the agent to benefit from intermittent and imperfect human feedback while preserving autonomous exploration and stable policy optimization. We evaluate OHP-RL on three challenging real-world contact-rich manipulation tasks on a Franka robot. Across all tasks, OHP-RL consistently achieves strong success rates, faster convergence, and substantially lower human intervention effort than prior approaches. Moreover, the learned policies exhibit more stable and human-aligned behavior throughout training.
中文摘要 虽然强化学习（RL）使机器人能够自主习得技能，但其在现实中的应用受到低效且不安全的探索极大限制。人为干预提供了切实可行的解决方案，但现有方法通常将这些干预作为辅助训练信号，未能充分捕捉它们关于何时以及如何引导自主的丰富信息。人工干预通常编码的是安全和任务约束下的行为偏好，而不是规定具体的模仿动作。基于这一观点，我们提出了“在线人类偏好作为强化学习指导”（OHP-RL）的框架，该框架利用人类干预作为偏好信息指导政策学习。OHP-RL引入了状态依赖的偏好门，能够自适应地调节何时以及多大程度上人类干预应塑造政策学习。这种设计使智能体能够受益于间歇性和不完美的人类反馈，同时保持自主探索和稳定的策略优化。我们在Franka机器人上评估了OHP-RL在三项具有挑战性的真实接触丰富操作任务中。在所有任务中，OHP-RL始终实现了较高的成功率、更快的收敛速度以及远低于以往方法的人为干预工作量。此外，所学策略在整个训练过程中表现出更稳定和更符合人类的行为。

Reference-Free Reinforcement Learning Fine-Tuning for MT: A Seq2Seq Perspective

无引用强化学习机器学习的微调：Seq2Seq 视角

Authors: Ernesto Garcia-Estrada, Carlos Escolano, José A. R. Fonallosa
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.15976
Pdf link: https://arxiv.org/pdf/2605.15976
Abstract Production machine translation relies overwhelmingly on encoder-decoder Seq2Seq models, yet reinforcement learning approaches to MT fine-tuning have largely targeted decoder-only LLMs at $\geq$7B parameters, with limited systematic study of encoder-decoder architectures. We apply Group Relative Policy Optimization to NLLB-200 (600M and 1.3B) using a hybrid reference-free reward (LaBSE and COMET-Kiwi) that requires no parallel data at fine-tuning time, evaluating across 13 typologically diverse languages. GRPO yields consistent improvements on all 13 languages, up to $+$5.03 chrF++ for Traditional Chinese, and, without any target-language data, competes with 3-epoch supervised fine-tuning on morphologically complex languages . We identify a consistent empirical pattern in which gains are largest where baseline performance is weakest and reward discriminability is highest, making this approach most effective precisely where parallel data is scarcest, and replicate this pattern across English and Spanish source languages.
中文摘要 生产机器翻译主要依赖编码器-解码器Seq2Seq模型，但强化学习方法主要针对参数为$\geq$7B的仅限译码器LLM，对编码器-解码器架构的系统性研究有限。我们对NLLB-200（600M和1.3B）应用群相对策略优化，采用混合无引用奖励（LaBSE和COMET-Kiwi），在微调时无需并行数据，评估13种类型学上多样的语言。GRPO在所有13种语言上都能持续提升，繁体中文最高可达$+$5.03 chrF++，并且在没有目标语言数据的情况下，能够与形态复杂语言的3纪元监督微调竞争。我们识别出一个一致的实证模式，即在基线表现最弱、奖励可辨别性最高的地方获得最大收益，使该方法在并行数据最稀缺的地区最为有效，并在英语和西班牙语源语言中复制了这一模式。

Adaptive Outer-Loop Control of Quadrotors via Reinforcement Learning

通过强化学习实现四旋翼的自适应外环控制

Authors: Vishnu Saj, Sushi Vemuri, Dileep Kalathil, Moble Benedict
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.16015
Pdf link: https://arxiv.org/pdf/2605.16015
Abstract Deep Reinforcement Learning (DRL) for quadrotor flight control typically relies on Domain Randomization (DR) for sim-to-real transfer, resulting in overly conservative policies that struggle with dynamic disturbances. To overcome this, we propose a novel adaptive control architecture that actively perceives and reacts to instantaneous perturbations. First, we train an optimal outer-loop policy, then replace its reliance on ground-truth disturbance data with a Residual Dynamics Predictor (RDP). The RDP estimates the external forces and moments acting on the aircraft in flight online using only the history of states and control actions. For seamless hardware transfer, we introduce a data-efficient linear calibration bridge and an online thrust correction mechanism that align the simulated latent space with reality using mere seconds of flight data. Real-world validations on a Crazyflie micro-quadrotor demonstrate that our adaptive controller significantly outperforms baselines, maintaining precise trajectory tracking under severe uncertainties including mass variations, asymmetric payloads, and dynamic slung loads
中文摘要 四旋翼飞行控制的深度强化学习（DRL）通常依赖域随机化（DR）进行模拟到真实的传输，导致过于保守的策略，难以应对动态干扰。为克服这一问题，我们提出了一种新型自适应控制架构，能够主动感知并响应瞬时扰动。首先，我们训练最优的外环策略，然后用残差动力学预测器（RDP）替代其对地面真实扰动数据的依赖。RDP仅利用状态和控制动作的历史，在线估算飞行中飞机上作用的外部力和矩。为了实现无缝硬件传输，我们引入了数据高效的线性校准桥和在线推力修正机制，能够仅用几秒钟的飞行数据将模拟潜空间与现实对齐。在Crazyflie微型四旋翼的实际验证表明，我们的自适应控制器在质量变化、非对称载荷和动态悬挂载荷等严重不确定性下，表现显著优于基线，能够保持精确的轨迹跟踪

Mind Dreamer: Untethering Imagination via Active Latent Intervention on Latent Manifolds

心灵梦者：通过对潜在流形的主动潜在干预解开想象力

Authors: Shaojun Xu, Xiaoling Zhou, Yihan Lin, Yapeng Meng, Xinglong Ji, Luping Shi, Rong Zhao
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.16030
Pdf link: https://arxiv.org/pdf/2605.16030
Abstract Model-Based Reinforcement Learning (MBRL) leverages latent imagination for sample efficiency, yet remains constrained by Historical Tethering: imagination is typically initialized from observed states. This creates a learning asymmetry, where the world model's manifold discovery outpaces the policy's sparse-reward optimization. We propose Mind Dreamer (MD), a framework that operationalizes Active Latent Intervention (ALI) to transcend Markovian continuity. MD reformulates discovery as the minimization of a global Relay Manifold Expected Free Energy (R-EFE); by sampling initial states from a learned generator $s_0 \sim p_{gen}(\cdot)$ rather than the historical buffer, MD utilizes an adversarial generator to synthesize non-continuous latent jumps to epistemic blind spots that are physically plausible yet cognitively challenging. To resolve the credit assignment paradox across these spatial ruptures, we derive the Relay Value Function (RVF) and Relay Uncertainty Function (RUF). These potentials treat synthesized anchors as counterfactual intermediary states, propagating pragmatic and epistemic value through a principled Bellman-style formulation. Notably, we prove that uncertainty propagation across discontinuities necessitates a quadratic discount $\gamma^2$, establishing a formal epistemic horizon. Theoretically, MD approximates a variance-minimizing importance sampler that expands the manifold's spectral gap, reducing the hitting time to critical bottleneck states. Empirically, MD achieves a 1.67$\times$ average speedup over DreamerV3 on DeepMind Control Suite, reaching 8.8$\times$ in sparse-reward tasks.
中文摘要 基于模型的强化学习（MBRL）利用潜在想象力提升样本效率，但仍受历史依赖限制：想象力通常从观察到的状态初始化。这导致学习不对称，即世界模型的多元发现速度超过策略的稀疏奖励优化。我们提出了“心灵梦者”（MD）框架，该框架将主动潜在干预（ALI）运化，超越马尔可夫连续性。MD将发现重新表述为全局继电器流形期望自由能（R-EFE）的最小化;通过从学习到的生成器$s_0 \sim p_{gen}（\cdot）$中采样初始状态，而非历史缓冲区，MD利用对抗性生成器合成非连续的潜在跳跃，指向物理上合理但认知上具有挑战性的认知盲点。为了解决这些空间断裂间的信用分配悖论，我们推导了继电器价值函数（RVF）和继电器不确定性函数（RUF）。这些势能将综合锚点视为反事实的中介状态，通过有原则的贝尔曼式表述传播实用和认识价值。值得注意的是，我们证明了不确定性在不连续点间传播需要一个二次折算 $\gamma^2$，建立了形式上的认识视野。理论上，MD近似了一个最小化重要性采样器，扩展流形的谱间隙，缩短到达临界瓶颈状态的击中时间。从经验来看，MD在DeepMind Control Suite上比DreamerV3平均加速1.67$\times$，在奖励稀疏任务中达到8.8$\times$。

Look Before You Leap: Autonomous Exploration for LLM Agents

三思而后行：LLM代理的自主探索

Authors: Ziang Ye, Wentao Shi, Yuxin Liu, Yu Wang, Zhengzhou Cai, Yaorui Shi, Qi Gu, Xunliang Cai, Fuli Feng
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.16143
Pdf link: https://arxiv.org/pdf/2605.16143
Abstract Large language model based agents often fail in unfamiliar environments due to premature exploitation: a tendency to act on prior knowledge before acquiring sufficient environment-specific information. We identify autonomous exploration as a critical yet underexplored capability for building adaptive agents. To formalize and quantify this capability, we introduce Exploration Checkpoint Coverage, a verifiable metric that measures how broadly an agent discovers key states, objects, and affordances. Our systematic evaluation reveals that agents trained with standard task-oriented reinforcement learning consistently exhibit narrow and repetitive behaviors that impede downstream performance. To address this limitation, we develop a training strategy that interleaves task-execution rollouts and exploration rollouts, with each type of rollout optimized by its corresponding verifiable reward. Building on this training strategy, we propose the Explore-then-Act paradigm, which decouples information-gathering from task execution: agents first utilize an interaction budget to acquire grounded environmental knowledge, then leverage it for task resolution. Our results demonstrate that learning to systematically explore is imperative for building generalizable and real-world-ready agents.
中文摘要 基于大型语言模型的智能体常因过早利用而在陌生环境中失败：即在获得足够环境特定信息之前，倾向于先行掌握知识。我们认为自主探索是构建适应性代理的关键但尚未充分探索的能力。为了形式化和量化这一能力，我们引入了探索检查点覆盖率（Exploration Checkpoint Coverage），这是一个可验证的指标，衡量智能体发现关键状态、对象和可供性（afforansions）的广泛程度。我们的系统评估显示，接受标准任务导向强化学习训练的智能体持续表现出狭窄且重复的行为，阻碍后续表现。为解决这一限制，我们开发了一种训练策略，将任务执行展开和探索展开交错进行，每种展开类型都根据其可验证的奖励进行优化。基于这一培训策略，我们提出了“探索后行动”范式，将信息收集与任务执行脱钩：代理首先利用交互预算获取扎实的环境知识，然后利用它来完成任务解决。我们的结果表明，学习系统性探索对于构建可推广且适合现实世界的代理至关重要。

Learn Where Outcomes Diverge: Efficient VLA RL via Probabilistic Chunk Masking

学习结果分歧之处：通过概率分块掩蔽实现高效的VLA RL

Authors: Vaidehi Bagaria, Nikshep Grampurohit, Pulkit Verma
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.16154
Pdf link: https://arxiv.org/pdf/2605.16154
Abstract Reinforcement learning (RL) allows vision-language-action (VLA) policies to generalize beyond their training distribution by optimizing directly for task success, but post-training is computationally expensive. A natural response has been to speed rollout collection through faster simulators and world models. In GRPO-based VLA RL, we find that the dominant cost lies elsewhere: gradient computation accounts for approximately 78% of wall-clock time per step in our runs, while rollout collection accounts for only 21%. Gradient cost dominates because much of this computation is spent on phases that contribute little to learning. GRPO's learning signal is driven by advantage variance: only phases where successful and failed rollouts diverge produce learning signal. However, GRPO assigns the same advantage to every chunk in a rollout. As a result, actor-update compute is spent uniformly across the trajectory, including phases the policy already handles after pre-training and supervised fine-tuning. This paper presents Probabilistic Chunk Masking (PCM), a drop-in modification to GRPO that allocates gradient computation to a small, probabilistically selected subset of chunks per trajectory. PCM scores semantic phases using success-failure action variance, a rollout-derived proxy for per-phase gradient variance, and samples a fixed chunk budget with online-updated phase-level keep probabilities. We formalize per-phase gradient variance as the quantity determines where gradient computation is useful and show that success-failure action variance provides a measurable proxy for it. PCM requires no reward model or learned critic. On three LIBERO benchmarks, PCM matches the final success rate of standard GRPO while achieving 2.38 times wall-clock speedup, 4.8 times faster gradient updates, and 60% lower peak activation memory, while backpropagating through fewer than 20% of trajectory chunks.
中文摘要 强化学习（RL）允许视觉-语言-行动（VLA）策略通过直接优化任务成功，超越其训练分布进行推广，但后训练计算成本高昂。自然而然的反应是通过更快的模拟器和世界模型加快部署收集。在基于GRPO的VLA强化学习中，我们发现主要成本在其他方面：梯度计算占每步墙钟时间的约78%，而部署收集仅占21%。梯度成本占主导地位，因为大量计算用于对学习贡献有限的阶段。GRPO的学习信号由优势方差驱动：只有成功与失败推广分开的阶段才会产生学习信号。然而，GRPO为部署中的每个块分配相同的优势。因此，actor-update的计算在整个轨迹中均匀分布，包括策略在预训练和监督微调后已处理的阶段。本文介绍了概率块遮蔽（PCM），这是一种对GRPO的直接修改，将梯度计算分配到每个轨迹中一个小的概率选区块子集。PCM利用成功-失败动作方差（一个基于推广推导的每阶段梯度方差代理）对语义阶段进行评分，并用在线更新的阶段级保持概率抽样固定块预算。我们将每相梯度方差形式化为该量决定梯度计算有用之处，并证明成功-失败作用方差为其提供了可测量的代理。PCM不需要奖励模型或学过的批评者。在三个LIBERO基准测试中，PCM与标准GRPO的最终成功率相当，同时实现了2.38倍的墙钟加速，4.8倍的梯度更新速度，峰值激活内存降低60%，同时反向传播的轨迹区块少于20%。

Argus: Evidence Assembly for Scalable Deep Research Agents

Argus：可扩展深度研究代理的证据汇编

Authors: Zhen Zhang, Liangcai Su, Zhuo Chen, Xiang Lin, Haotian Xu, Simon Shaolei Du, Kaiyu Yang, Bo An, Lidong Bing, Xinyu Wang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2605.16217
Pdf link: https://arxiv.org/pdf/2605.16217
Abstract Deep research agents have achieved remarkable progress on complex information seeking tasks. Even long ReAct style rollouts explore only a single trajectory, while recent state of the art systems scale inference time compute via parallel search and aggregation. Yet deep research answers are composed of complementary pieces of evidence, which parallel rollouts often duplicate rather than complete, yielding diminishing returns while pushing the aggregation context toward the model's limit. We propose Argus, an agentic system in which a Searcher and a Navigator cooperate to treat deep research as assembling a jigsaw from complementary evidence pieces, rather than brute forcing the whole answer in parallel. The Searcher collects evidence traces for a given sub-query through ReAct-style interaction. The Navigator maintains a shared evidence graph, verifying which pieces are still missing, dispatching Searchers to gather them, and reasoning over the completed graph to produce a source-traced final answer. We train the Navigator with reinforcement learning to verify, dispatch, and synthesize, while independently training the Searcher to remain a standard ReAct agent. The resulting Navigator supports rollouts with a single Searcher or many in parallel without retraining. With both Searcher and Navigator built on a 35B-A3B MoE backbone, Argus gains 5.5 points with a single Searcher and 12.7 points with 8 parallel Searchers, averaged over eight benchmarks. With 64 Searchers it reaches 86.2 on BrowseComp, surpassing every proprietary agent we benchmark, while the Navigator's reasoning context stays under 21.5K tokens.
中文摘要 深度研究代理在复杂的信息寻求任务上取得了显著进展。即使是长期的ReAct式推断，也只探索单一轨迹，而最新最先进的系统通过并行搜索和聚合计算规模推理时间。然而，深入的研究答案由互补的证据组成，这些证据的平行推广往往重复而非完整，导致收益递减，同时推动聚合语境接近模型的极限。我们提出Argus，一种代理系统，其中搜索者和导航者合作，将深度研究视为从互补证据拼凑拼图，而非暴力破解并行完成全部答案。搜索者通过ReAct式交互收集给定子查询的证据痕迹。导航员维护共享证据图，验证哪些部分尚未缺失，派遣搜索员收集，并对完成的图表进行推理，得出源头追踪的最终答案。我们通过强化学习训练导航员验证、调度和合成，同时独立训练搜索者保持标准反应代理。最终的导航仪支持单台搜索器或多个搜索器并行部署，无需重新训练。搜索者和导航者均基于35B-A3B的MoE骨干，阿尔戈斯单架搜索者提升5.5分，8架并行搜索者提升12.7分，平均八个基准测试。拥有64个搜索者后，在BrowseComp上达到86.2，超过了我们基准测试的所有专有代理，而导航者的推理上下文保持在215K代币以下。

Keyword: diffusion policy

There is no result