Arxiv Papers of Today

生成时间: 2026-05-05 18:07:03 (UTC+8); Arxiv 发布时间: 2026-05-05 20:00 EDT (2026-05-06 08:00 UTC+8)

今天共有 79 篇相关文章

Keyword: reinforcement learning

RA-CMF: Region-Adaptive Conditional MeanFlow for CT Image Reconstruction

RA-CMF：区域自适应条件平均流量用于CT图像重建

Authors: Md Shifatul Ahsan Apurba, Md Selim, Jin Chen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.00901
Pdf link: https://arxiv.org/pdf/2605.00901
Abstract The use of CT imaging is important for screening, diagnosis, therapy planning, and prognosis of lung cancers. Unfortunately, due to differences in imaging protocols and scanner models, CT images acquired by different means may show large differences in noise statistics, contrast, and texture. In this study, we develop a novel conditional MeanFlow pipeline for CT image reconstruction. We introduce a conditional MeanFlow network that models the enhancement trajectory by predicting image-conditioned flow fields given intermediate image states. The image enhancement network is trained with a MeanFlow consistency loss along with the image reconstruction loss. In order to provide an adaptive refinement process in terms of spatial location of enhancements, we integrate a regional reinforcement learning-driven policy network into our approach. The policy network receives information about the MeanFlow rollouts and provides predictions in terms of tile-wise refinement budgets, stopping criteria, and total budget allocation of enhancement processes. Our policy network is trained through reinforcement learning in a policy gradient framework, where the goal of the training reward is to maximize improvement of enhancements while minimizing unnecessary computations and avoiding instabilities. In this way, our approach combines conditional flow-based enhancement with reinforcement learning-based spatial enhancement control. This allows our approach to focus more attention on enhancing difficult areas while stabilizing areas already showing sufficient quality. Our results show high accuracy in the tumor ROI, with the average radiomic feature CCC being 0.96, an average PSNR of 31.30 $\pm$ 4.16, and average SSIM of 0.94 $\pm$ 0.07. Moreover, there is an improvement in the overall quality of images, with an average PSNR of 34.23 $\pm$ 1.71 and average SSIM of 0.95 $\pm$ 0.01.
中文摘要 CT影像的使用对于肺癌的筛查、诊断、治疗计划和预后非常重要。不幸的是，由于成像协议和扫描仪型号的差异，不同方式拍摄的CT图像在噪声统计、对比度和纹理上可能存在较大差异。本研究开发了一种新型条件平均流（MeanFlow）流水线用于CT图像重建。我们引入了条件均流网络，通过预测给定中间图像状态的图像条件流场来建模增强轨迹。图像增强网络训练时会接受平均流量一致性损失和图像重建损失。为了在增强的空间定位方面提供自适应的细化过程，我们将区域强化学习驱动的政策网络整合进我们的方法。政策网络接收MeanFlow推广信息，并就逐块细化预算、停止标准及增强流程总预算分配等方面提供预测。我们的策略网络通过策略梯度框架中的强化学习进行训练，训练奖励的目标是最大化增强改进，同时减少不必要的计算并避免不稳定。通过这种方式，我们的方法结合了基于条件的流增强与基于强化学习的空间增强控制。这使得我们的方法能够更多地关注提升困难区域，同时稳定已经展现足够质量的区域。我们的结果显示肿瘤投资回报率（ROI）准确率很高，放射电学特征CCC平均为0.96，平均PSNR为31.30美元/pm$ 4.16，平均SSIM为0.94美元\pm$ 0.07。此外，整体图像质量有所提升，平均PSNR为34.23美元\pm$ 1.71，平均SSIM为0.95 $\pm$ 0.01。

Interpretable experiential learning based on state history and global feedback

基于州历史和全球反馈的可解释体验式学习

Authors: Anton Kolonin
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.00940
Pdf link: https://arxiv.org/pdf/2605.00940
Abstract A new interpretable experiential learning model based on state history and global feedback is presented. It is capable of learning a behavioral model represented by a transition graph between sets of states, with transitions attributed with utility and evidence count. This model is expected to be suitable for solving reinforcement learning problem in resource-constrained environments. The model was thoroughly evaluated on the OpenAI Gym Atari Breakout benchmark, demonstrating performance comparable to some known neural network-based solutions.
中文摘要 本书提出了基于状态历史和全局反馈的新可解释体验式学习模型。它能够学习由状态集合之间的转移图表示的行为模型，这些转移赋予效用和证据计数。该模型预计适合在资源有限环境中解决强化学习问题。该模型在OpenAI Gym Atari Breakout基准测试中进行了全面评估，其性能可与一些已知的神经网络解决方案媲美。

PPO guided Agentic Pipeline for Adaptive Prompt Selection and Test Case Generation

PPO引导代理流水线用于自适应提示选择和测试用例生成

Authors: Gourisetty Venkata Sai Koushik, Dama Aditya, Mahankali Harish Sai, Peddi Siddarhta, Shadab Ahmad, Vivek Yelleti
Subjects: Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.00942
Pdf link: https://arxiv.org/pdf/2605.00942
Abstract Developing effective test cases capable of thoroughly exercising large-scale software systems is inherently difficult, especially if such systems have voluminous, complex, and deeply nested source codes. In this work, we present a novel approach for generating test cases using a reinforcement learning-driven agentic framework where Proximal Policy Optimization (PPO) is coupled with an LLM engine to guide prompt selection during test generation. Our approach consists of two phases. In Phase I, the ToT-guided optimization agent partitions and minimizes the source code by removing redundancies without changing the functional behavior of the source code. In Phase II, a PPO-based policy network is trained to solve the problem of selecting prompts among eight different prompting techniques, such as Boundary Value Analysis, Random Fuzzing, etc., based on the inputted 11-dimensional state vector representing the source code complexity metrics and live coverage metrics to direct the LLM engine towards exploring unvisited paths in the program. The PPO agent receives rewards based on a combination of increases in line and branch coverages, penalties for unexplored branches, and rewards for reducing source code length. From experiments conducted on twenty benchmark programs, it is evident that the proposed approach, PPO-LLM, outperforms CBMC, kS-LLM, and kS-LLM++ in terms of branch and line coverage in almost all cases, for various loop bound values ranging from BOUND~1 to BOUND~2000. While at BOUND~1, the coverage of branches is 100\% using PPO-LLM on the PALS suite, in comparison, it is around 86.8\% using kS-LLM++. This confirms that adaptive prompt selection driven by PPO substantially outperforms static prompting strategies on PALS type programs.
中文摘要 开发能够彻底运行大规模软件系统的有效测试案例本质上极为困难，尤其是当这些系统拥有大量、复杂且深度嵌套的源代码时。本研究提出了一种利用强化学习驱动代理框架生成测试案例的新方法，该框架将近端策略优化（PPO）与大型语言模型引擎结合，指导测试生成中的提示选择。我们的方法分为两个阶段。在第一阶段，ToT引导的优化代理通过去除冗余来划分和最小化源代码，而不改变源代码的功能行为。在第二阶段，基于PPO的策略网络被训练以解决在输入的11维状态矢量（代表源代码复杂度指标和实时覆盖度指标）中，从八种不同提示技术（如边界值分析、随机模糊等）中选择提示的问题，从而引导LLM引擎探索程序中未访问的路径。PPO代理根据线路和分支覆盖增加、未探索分支的惩罚以及缩短源代码长度的奖励获得奖励。通过对二十个基准程序的实验，可以明显看出，所提出的方法PPO-LLM在几乎所有情况下，在分支和线路覆盖率方面都优于CBMC、kS-LLM和kS-LLM++，适用于从BOUND~1到BOUND~2000的各种环界值。在BOUND~1时，使用PPO-LLM在PALS套件中分支覆盖率为100%，而使用kS-LLM++时约为86.8%。这证实了由PPO驱动的自适应提示选择在PALS类程序中显著优于静态提示策略。

Your Loss is My Gain: Low Stake Attacks on Liquid Staking Pools

你的损失就是我的收获：对流动质押池的低额攻击

Authors: Sen Yang, Aviv Yaish, Arthur Gervais, Fan Zhang
Subjects: Subjects: Computer Science and Game Theory (cs.GT)
Arxiv link: https://arxiv.org/abs/2605.01025
Pdf link: https://arxiv.org/pdf/2605.01025
Abstract Permissionless Proof-of-Stake (PoS) economic security is predicated on the high cost of violating consensus safety or liveness. We show that liquid staking introduces additional risks that are not captured by standard PoS economic security arguments. Through an empirical study of Ethereum data, we find that the operational performance of liquid staking pools is positively associated with subsequent normalized liquid staking token (LST) returns. Motivated by this, we present a cross-layer attack: a low-stake adversary can manipulate the consensus protocol to degrade a target pool's performance and take application-layer positions that profit if the market reprices the corresponding \gls{LST} in-line with the historically observed association. To make the consensus layer manipulation concrete, we develop a deep reinforcement learning (DRL) framework to automatically discover attack strategies. Our evaluation shows that the learned strategies can recover near-optimal theoretical attacks and uncover new manipulation behaviors that significantly degrade target pool performance. We further characterize feasible application-layer monetization channels and analyze leveraged shorting in detail using Monte Carlo simulations, showing that such attacks can be profitable with over one-half probability for LSTs of major staking pools. Our findings reveal a previously overlooked attack surface in PoS systems with liquid staking and expose a gap between consensus and economic security.
中文摘要 无许可权益证明（PoS）经济安全基于违反共识安全或活性的高昂成本。我们表明，流动质押带来了标准PoS经济安全论证无法涵盖的额外风险。通过对以太坊数据的实证研究，我们发现流动质押池的运营表现与后续归一化液体质押代币（LST）收益呈正相关。基于此，我们提出了一种跨层攻击：低风险的对手可以操纵共识协议，降低目标池的性能，并获得应用层仓位，若市场将相应的\gls{LST}重新定价，符合历史观察到的关联。为了使共识层操作落实，我们开发了一个深度强化学习（DRL）框架，用于自动发现攻击策略。我们的评估表明，所学策略能够恢复近乎最优的理论攻击，并发现显著降低目标池性能的新操控行为。我们还进一步描述了可行的应用层变现渠道，并利用蒙特卡洛模拟详细分析杠杆做空，表明此类攻击对主要质押池的LST可有超过一半概率的盈利。我们的发现揭示了流动质押PoS系统中此前被忽视的攻击面，并揭示了共识与经济安全之间的差距。

Separation Assurance between Heterogeneous Fleets of Small Unmanned Aerial Systems via Multi-Agent Reinforcement Learning

通过多智能体强化学习实现小型无人机系统异构机队之间的分离保障

Authors: Iman Sharifi, Hyeong Tae Kim, Maheed Hatem Ahmed, Mahsa Ghasemi, Peng Wei
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.01041
Pdf link: https://arxiv.org/pdf/2605.01041
Abstract In the envisioned future dense urban airspace, multiple companies will operate heterogeneous fleets of small unmanned aerial systems (sUASs), where each fleet includes several homogeneous aircraft with identical policies and configurations, e.g., equipage, sensing, and communication ranges, making tactical deconfliction highly complex for the aircraft. This paper aims to address two core questions: (1) Can tactical deconfliction policies converge or reach an equilibrium to ensure a conflict-free airspace when companies operate heterogeneous fleets of homogeneous aircraft? (2) If so, will the converged policies discriminate against companies operating sUASs with weaker configurations? We investigate a multi-agent reinforcement learning paradigm in which homogeneous aircraft within heterogeneous fleets operate concurrently to perform package delivery missions over Dallas, Texas, USA. An attention-enhanced Proximal Policy Optimization-based Advantage Actor-Critic (PPOA2C) framework is employed to resolve intra- and inter-fleet conflicts, with each fleet independently training its own policy while preserving privacy. Experimental results show that two fleets with distinct, shared PPOA2C policies can reach an equilibrium to maintain safe separation. While two PPOA2C policies outperform two strong rule-based baselines in terms of conflict resolution, a PPOA2C policy exhibits safer interaction with a rule-based policy, indicating adaptive capabilities of PPOA2C policies. Furthermore, we conducted extensive policy-configuration evaluations, which reveal that equilibria between similar policy types tend to favor fleets with stronger configurations. Even under similar configurations but different policy types, the equilibrium favors one of the heterogeneous policies, underscoring the need for fairness-aware conflict management in heterogeneous sUAS operations.
中文摘要 在设想的未来密集城市空域中，多家公司将运营异构的小型无人机系统（sUAS）机队，每个机队包含多架具有相同政策和配置的同质飞机，例如装备、感测和通信距离，使得战术冲突消除对飞机极为复杂。本文旨在解决两个核心问题：（1）当企业运营异质同质飞机机队时，战术避免冲突政策能否趋同或达到平衡，以确保空域无冲突？（2）如果是这样，合并政策是否会对使用配置较弱的无人机公司产生歧视？我们研究一种多智能体强化学习范式，该范式中异构机队中的同质飞机并行运行，执行美国德克萨斯州达拉斯的包裹投递任务。采用了基于近点策略优化的优势行为者-批评者（PPOA2C）框架，用于解决舰队内部和舰队间的冲突，每个舰队独立训练自己的策略，同时保护隐私。实验结果显示，两支拥有不同且共享PPOA2C政策的车队可以达到平衡，以维持安全分离。虽然两个PPOA2C策略在冲突解决方面优于两个强规则基线，但PPOA2C策略与基于规则的策略交互更安全，表明PPOA2C策略具备适应性能力。此外，我们进行了广泛的政策配置评估，发现相似政策类型之间的均衡倾向于更强配置的车队。即使在类似配置但政策类型不同的情况下，均衡仍偏向异质政策之一，凸显了异质无人机行动中公平性意识冲突管理的必要性。

Teaching LLMs Brazilian Healthcare: Injecting Knowledge from Official Clinical Guidelines

巴西医疗保健的LLM教学：从官方临床指南中注入知识

Authors: Hugo Abonizio, Filipe Rocha Lopes, Roberto Lotufo, Rodrigo Nogueira
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.01077
Pdf link: https://arxiv.org/pdf/2605.01077
Abstract Brazil's Unified Health System (SUS) relies on official clinical guidelines that define diagnostic criteria, treatments, dosages, and monitoring procedures for over 200 million citizens. Yet current LLMs perform poorly on this guideline-specific knowledge, and no benchmark evaluates clinical recall grounded in Brazilian Portuguese protocols. We address this gap by adapting Qwen2.5-14B-Instruct to the Brazilian clinical domain. From 178 official guidelines (~5.4M tokens), we generate ~70M tokens of synthetic data in three formats -- rephrases, wiki-style articles, and question-answer pairs -- using four generator LLMs. We then apply continual pre-training followed by Group Relative Policy Optimization (GRPO). We introduce HealthBench-BR, with 1,780 balanced true/false clinical assertions, and PCDT-QA, with 890 open-ended clinical questions scored by an LLM judge. Our best model achieves 83.9% on HealthBench-BR and 85.4% on PCDT-QA, outperforming GPT-5.2, Claude Sonnet 4.6, Gemini 3.1 Pro, and Google AI Overview's web-grounded RAG despite having only 14B parameters. Ablations show that generator diversity and reinforcement learning are critical to these gains. We release all datasets, benchmarks, and model weights to support reproducible clinical NLP research for Brazilian Portuguese. Code, data, and model weights are available at this https URL
中文摘要 巴西统一医疗系统（SUS）依赖官方临床指南，定义诊断标准、治疗、剂量和监测程序，覆盖超过2亿公民。然而，当前的LLM在这些指南特定知识下表现不佳，且没有基准评估基于巴西葡萄牙协议的临床回忆。我们通过将Qwen2.5-14B-Instruct适配至巴西临床领域来弥补这一空白。我们根据178条官方指南（~540万个代币），使用四个生成器生成的大型语言模型，生成了约7000万个合成数据代币，分为三种格式——重述、维基式文章和问答对。随后，我们应用持续的预训练，接着进行群体相对策略优化（GRPO）。我们引入了HealthBench-BR，包含1780条平衡的真伪临床断言，以及PCDT-QA，包含890个由LLM评委评分的开放式临床问题。我们的最佳模型在HealthBench-BR上达到83.9%，在PCDT-QA上达到85.4%，优于GPT-5.2、Claude Sonnet 4.6、Gemini 3.1 Pro以及Google AI Overview的基于网络的RAG，尽管只有14B参数。消融显示，生成器多样性和强化学习对这些成果至关重要。我们发布所有数据集、基准和模型权重，以支持巴西葡萄牙语的可重复临床自然语言处理研究。代码、数据和模型权重可在此 https URL 获取

Learning to Race in Minutes: Infoprop Dyna on the Mini Wheelbot

几分钟内学会赛车：Mini Wheelbot上的Infoprop Dyna

Authors: Devdutt Subhasish, Henrik Hose, Sebastian Trimpe
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.01096
Pdf link: https://arxiv.org/pdf/2605.01096
Abstract Reinforcement Learning (RL) has the potential to enable robots with fast, nonlinear, and unstable dynamics to reach the limits of their performance. However, most recent advances rely on carefully designed physics-based simulators and domain randomization to achieve successful sim-to-real transfer within reasonable wall-clock time. In this work, we bypass the need for such simulators and demonstrate that Infoprop Dyna, a state-of-the-art uncertainty-aware model-based reinforcement learning (MBRL) framework, can enable robots to learn directly from real-world interactions. Using Infoprop Dyna, the Mini Wheelbot, an underactuated unicycle robot, learns to race around a track within 11 minutes of real-world experience.
中文摘要 强化学习（RL）有潜力使具有快速、非线性和不稳定动力学的机器人达到性能极限。然而，大多数最新进展依赖精心设计的基于物理的模拟器和域随机化，以在合理的壁钟时间内成功实现模拟到现实的传输。在本研究中，我们绕过了对此类模拟器的依赖，展示了Infoprop Dyna——一个最先进的不确定性感知模型强化学习（MBRL）框架，能够使机器人能够直接从现实世界的交互中学习。利用Infoprop Dyna，迷你轮机器人——一台驱动不足的独轮车机器人——在实际体验后11分钟内学会在赛道上竞速。

PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs

PERSA：教授式的强化学习，采用大型语言模型（LLM）个性化反馈

Authors: Ravi Ranjan, Utkarsh Grover, Xiaomin Lin, Agoritsa Polyzou
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.01123
Pdf link: https://arxiv.org/pdf/2605.01123
Abstract Large language models (LLMs) can provide automated feedback in educational settings, but aligning an LLMs style with a specific instructors tone while maintaining diagnostic correctness remains challenging. We ask how can we update an LLM for automated feedback generation to align with a target instructors style without sacrificing core knowledge? We study how Reinforcement Learning from Human Feedback (RLHF) can adapt a transformer-based LLM to generate programming feedback that matches a professors grading voice. We introduce PERSA, an RLHF pipeline that combines supervised fine-tuning on professor demonstrations, reward modeling from pairwise preferences, and Proximal Policy Optimization (PPO), while deliberately constraining learning to style-bearing components. Motivated by analyses of transformer internals, PERSA applies parameter efficient fine-tuning. It updates only the top transformer blocks and their feed-forward projections, minimizing global parameter drift while increasing stylistic controllability. We evaluate our proposed approach on three code-feedback benchmarks (APPS, PyFiXV, and CodeReviewQA) using complementary metrics for style alignment and fidelity. Across both Llama-3 and Gemma-2 backbones, PERSA delivers the strongest professor-style transfer while retaining correctness, for example on APPS, it boosts Style Alignment Score (SAC) to 96.2% (from 34.8% for Base) with Correctness Accuracy (CA) up to 100% on Llama-3, and Gemma-2. Overall, PERSA offers a practical route to personalized educational feedback by aligning both what it says (content correctness) and, crucially, how it says it (instructor-like tone and structure).
中文摘要 大型语言模型（LLM）可以在教育环境中提供自动化反馈，但要将LLM的风格与特定教师语气对齐，同时保持诊断准确性仍然具有挑战性。我们会问，如何在不牺牲核心知识的前提下，更新一个自动反馈生成的LLM，使其符合目标教师的风格？我们研究了人类反馈强化学习（RLHF）如何将基于变换器的大型语言模型（LLM）调整为与教授评分声音匹配的编程反馈。我们引入了PERSA流程，这是一条RLHF流水线，结合了对教授演示的监督微调、基于两对偏好的奖励建模以及近端策略优化（PPO），同时有意将学习限制在风格相关组件中。基于变压器内部结构的分析，PERSA采用了参数高效的微调。它只更新顶部变压器模块及其前馈投影，最大限度地减少全局参数漂移，同时提升风格上的可控性。我们基于三个代码反馈基准测试（APPS、PyFiXV和CodeReviewQA）评估我们提出的方法，并使用风格对齐和忠实度的互补指标。在Llama-3和Gemma-2骨干中，PERSA在保持正确性的同时实现了最强的教授式转移，例如在APPS上，它将风格对齐得分（SAC）提升至96.2%（Base为34.8%），Llama-3和Gemma-2的正确性准确率（CA）最高可达100%。总体而言，PERSA通过协调其内容（内容正确性）和关键的表达方式（类似教师的语气和结构）来实现个性化教育反馈的实用途径。

Forager: a lightweight testbed for continual learning with partial observability in RL

Forager：一个轻量级的持续学习测试平台，具备部分可观测性

Authors: Steven Tang, Xinze Xiong, Anna Hakhverdyan, Andrew Patterson, Jacob Adkins, Jiamin He, Esraa Elelimy, Parham Mohammad Panahi, Martha White, Adam White
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.01131
Pdf link: https://arxiv.org/pdf/2605.01131
Abstract In continual reinforcement learning (CRL), good performance requires never-ending learning, acting, and exploration in a big, partially observable world. Most CRL experiments have focused on loss of plasticity -- the inability to keep learning -- in one-off experiments where some unobservable non-stationarity is added to classic fully observable MDPs. Further, these experiments rarely consider the role of partial observability and the importance of CRL agents that use memory or recurrence. One potential reason for this focus on mitigating loss of plasticity without considering partial observability is that many partially-observable CRL environments are prohibitively expensive. In this paper, we introduce Forager, a light-weight partially-observable CRL environment with a constant memory footprint. We provide a set of experiments and sample tasks demonstrating that Forager is challenging for current CRL agents and yet also allows for in-depth study of those agents. We demonstrate that agents exhibit loss of plasticity, proposed mitigations can help, but that most useful is to leverage state construction. We conclude with a variant of Forager that generates an unending stream of new tasks to learn that clearly highlights the limitations of current CRL agents.
中文摘要 在持续强化学习（CRL）中，良好的表现需要在一个庞大且部分可观察的世界中不断学习、行动和探索。大多数CRL实验都聚焦于可塑性丧失——即无法持续学习——在一次性实验中，在经典完全可观测的MDP中加入一些不可观测的非平稳性。此外，这些实验很少考虑部分可观测性的作用以及利用记忆或重现的CRL代理的重要性。这种关注减轻可塑性损失而不考虑部分可观测性的原因之一可能是许多部分可观测的CRL环境成本高昂。本文介绍了Forager，一种轻量级、部分可观测且内存占用恒定的CRL环境。我们提供了一组实验和示例任务，证明Forager对现有CRL代理具有挑战性，同时也允许对这些代理进行深入研究。我们证明了代理存在可塑性丧失，提出的缓解措施可以有所帮助，但最有用的是利用状态构建。我们以Forager的一个变体作结，该变体生成源源不断的新任务，清晰地凸显了当前CRL代理的局限性。

The Hidden Cost of Thinking: Energy Use and Environmental Impact of LMs Beyond Pretraining

思考的隐性成本：LM在预培训后对能源使用与环境影响

Authors: Jacob Morrison, Noah A. Smith, Emma Strubell
Subjects: Subjects: Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2605.01158
Pdf link: https://arxiv.org/pdf/2605.01158
Abstract Modern language model development extends far beyond pretraining, yet environmental reporting remains narrowly focused on the cost of training a single final model. In this work, we provide the first detailed breakdown of the environmental impact of a full model development pipeline, from pretraining through supervised fine-tuning, preference optimization, and reinforcement learning, for Olmo 3, a family of 7 billion and 32 billion parameter models in both instruction-following and reasoning variants. We find that reasoning models are 17x more expensive to post-train than their instruction-tuned counterparts in terms of datacenter energy, driven by reinforcement learning rollout generation. Development costs (including experimentation, failed runs, and ablations) account for 82.2% of total compute, a roughly 65% increase over the ~50% reported for pretraining-focused pipelines in prior work. In total, we estimate our model development process consumed ~12.3 GWh of datacenter energy, emitted 4,251 tCO2eq, and consumed 15,887 kL of water, with water consumption driven entirely by power generation infrastructure rather than data center cooling. These costs, which are almost entirely unreported by model developers, are growing rapidly as post-training pipelines become more complex, and must be accounted for in environmental reporting standards and by the research community working to reduce AI's environmental impact.
中文摘要 现代语言模型开发远远超出预训练，但环境报告仍狭隘地聚焦于训练单一最终模型的成本。在本研究中，我们首次详细解析了完整模型开发流程的环境影响，从预训练到监督微调、偏好优化和强化学习，涵盖Olmo 3家族，涵盖70亿和320亿参数模型，涵盖指令跟踪和推理变体。我们发现，推理模型在后训练成本上是数据中心的17倍，这主要由强化学习的推广生成驱动。开发成本（包括实验、失败运行和消融）占总计算量的82.2%，比以往以预训练为重点的流水线报告的50%增加了约65%。总体来看，我们估计模型开发过程消耗了约12.3 GWh的数据中心能源，排放4,251 tCO2eq，消耗15,887千升水，水的消耗完全由发电基础设施驱动，而非数据中心冷却。这些成本几乎未被模型开发者报告，且随着培训后流程日益复杂，必须在环境报告标准和致力于减少AI环境影响的研究界中加以考虑。

Zero-Shot Signal Temporal Logic Planning with Disjunctive Branch Selection in Dynamic Semantic Maps

动态语义映射中的零样信号时间逻辑规划与析取分支选择

Authors: Bowen Ye, Ancheng Hou, Junyue Huang, Ruijia Liu, Xiang Yin
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.01222
Pdf link: https://arxiv.org/pdf/2605.01222
Abstract Signal Temporal Logic (STL) offers verifiable task specifications and is crucial for safety-critical control. Yet STL planning remains challenging: exact optimization-based methods are often too slow, and learning-based methods struggle to generalize across varying environments. We propose a zero-shot STL planning solver for variable-map environments that generates feasible trajectories without retraining. By integrating a map-conditioned Transformer architecture with a lightweight heuristic, our approach effectively handles complex disjunctive (OR) subformulas. Furthermore, we leverage Transitive Reinforcement Learning (TRL) to ensure consistent temporal grounding and logical coherence across decomposed sub-tasks. Experiments on dynamic semantic maps with diverse obstacle layouts demonstrate consistent gains, highlighting the framework's superior zero-shot generalization to changing environments and broad STL coverage.
中文摘要 信号时间逻辑（STL）提供可验证的任务规范，对于安全关键的控制至关重要。然而，STL规划依然具有挑战性：基于精确优化的方法往往过于缓慢，基于学习的方法难以在不同环境中推广。我们提出了一种零样本STL规划求解器，适用于可变映射环境，能够生成可行轨迹而无需重新训练。通过将映射条件变换器架构与轻量级启发式结合，我们的方法有效处理复杂的析取子公式（OR）。此外，我们利用传递强化学习（TRL）确保分解子任务之间的时间基础一致和逻辑一致性。在具有多样障碍布局的动态语义图上的实验显示了持续的提升，凸显了该框架在零样本推广上对变化环境和广泛STL覆盖的优越性。

Breaking the Computational Barrier: Provably Efficient Actor-Critic for Low-Rank MDPs

突破计算障碍：低秩MDP可证明的高效actor-critic算法

Authors: Ruiquan Huang, Donghao Li, Yingbin Liang, Jing Yang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.01242
Pdf link: https://arxiv.org/pdf/2605.01242
Abstract Reinforcement learning (RL) is a fundamental framework for sequential decision-making, in which an agent learns an optimal policy through interactions with an unknown environment. In settings with function approximation, many existing RL algorithms achieve favorable sample complexity, but often rely on computationally intractable oracles. In this paper, we use supervised learning as a computational proxy to establish a clear hierarchy of commonly adopted RL oracles under low-rank Markov Decision Processes (MDPs). This hierarchy shows that policy evaluation is the most computationally efficient oracle, provided that supervised learning can be efficiently solved. Motivated by this observation, we propose a novel optimistic actor-critic algorithm that relies solely on the policy evaluation oracle. We prove that our algorithm outperforms the existing sample complexity guarantees for low-rank MDPs while avoiding computationally expensive planning or optimization oracles commonly assumed in prior works. We further extend our theoretical results to approximately low-rank MDPs and demonstrate that this setting captures a broad class of real-world environments. Finally, we validate our theoretical results with experiments on several standard Gym environments.
中文摘要 强化学习（RL）是一种序列决策的基本框架，在该框架中，智能体通过与未知环境的交互学习最优策略。在函数近似的环境中，许多现有的强化学习算法实现了有利的样本复杂度，但通常依赖于计算上难以处理的预言机。本文利用监督学习作为计算代理，建立低秩马尔可夫决策过程（MDP）下常见采用的强化学习预言机的清晰层级结构。这一层级结构表明，只要能高效解决监督学习，策略评估是计算效率最高的预言机。基于这一观察，我们提出了一种全新的乐观行为者-批评算法，完全依赖于策略评估预言机。我们证明了算法在低秩MDP中优于现有样本复杂度保证，同时避免了以往工作中常见的计算量高昂的规划或优化预言机。我们进一步将理论结果扩展到近似低级的MDP，并证明该环境涵盖了广泛的现实世界环境类别。最后，我们通过在多个标准健身房环境中的实验验证了理论结果。

S^3-R1: Learning to Retrieve and Answer Step-by-Step with Synthetic Data

S^3-R1：学习如何逐步检索并回答合成数据

Authors: Harsh Goel, Akhil Udathu, Susmija Jabireddy, Pradnesh Kalkar, Atharva Parulekar
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.01248
Pdf link: https://arxiv.org/pdf/2605.01248
Abstract Reinforcement learning (RL) post-training has enabled newer capabilities in models, such as agentic tool-use for search. However, these models struggle primarily due to limitations with sparse outcome-based rewards and a lack of training data that encapsulates questions of differing hardness, which results in models not performing deeper searches with tools to collect evidence for question-answering. To address these limitations, we introduce S^3-R1 (Synthetic data and stabilized Search R1), a framework that couples a data-centric approach with denser learning signals. We first develop a synthetic generation and curation pipeline that programmatically derives diverse, multi-hop questions from existing documents. This pipeline incorporates a retrieval-based verification step to specifically isolate questions of intermediate difficulty. We then pair this expanded training set with a reward structure that evaluates both intermediate search quality and the correctness of the final answer. This setup directly mitigates the credit assignment problems inherent to sparse rewards. Our evaluations show that S^3-R1 outperforms existing baselines by learning more effective search and synthesis strategies, yielding up to a 10% improvement in robust generalization on out-of-domain datasets.
中文摘要 强化学习（RL）后训练使模型具备了新功能，如使用代理工具进行搜索。然而，这些模型主要因基于结果的奖励稀疏以及缺乏涵盖不同难度问题的训练数据，导致模型未能通过工具进行更深入的搜索以收集问答证据。为解决这些局限性，我们引入了S^3-R1（合成数据与稳定搜索R1），这是一个将以数据为中心的方法与更密集的学习信号相结合的框架。我们首先开发了一个合成生成和策划流程，从现有文档中程序性地推导出多样化、多跳的问题。该流程包含基于检索的验证步骤，专门筛选中等难度的问题。然后，我们将扩展后的训练集与一个奖励结构结合，该结构同时评估中间搜索质量和最终答案的正确性。这种设置直接缓解了奖励稀疏带来的信用分配问题。我们的评估显示，S^3-R1通过学习更有效的搜索和综合策略，优于现有基线，在域外数据集上的稳健泛化提升高达10%。

Bi-Level Reinforcement Learning Control for an Underactuated Blimp via Center-of-Mass Reconfiguration

通过质心重构实现对欠驱动飞艇的双级强化学习控制

Authors: Xiaorui Wang, Hongwu Wang, Yue Fan, Hao Cheng, Feitian Zhang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.01289
Pdf link: https://arxiv.org/pdf/2605.01289
Abstract This paper investigates goal-directed tracking control of underactuated blimps with center-of-mass (CoM) reconfiguration. Unlike conventional overactuated blimp designs that rely on redundant actuation for simplified control, this paper focuses on a compact architecture consisting of two thrusters and a movable internal slider, aiming to improve energy efficiency and payload capacity. This hardware-efficient configuration introduces significant underactuation and strong nonlinear coupling between CoM dynamics and vehicle motion. To address these challenges, this paper proposes a bi-level reinforcement learning framework that explicitly decouples task-level CoM planning from continuous thrust control. The outer policy determines a target-dependent CoM configuration prior to flight, while the inner policy generates thrust commands to track straight-line references. To ensure stable learning, this paper introduces a two-stage learning strategy, supported by a convergence analysis of the resulting bi-level process. Extensive simulations and real-world experiments on a 27-goal evaluation set demonstrate that the proposed method consistently outperforms fixed-CoM baselines and PID-based controllers, achieving higher tracking accuracy, enhanced robustness, and reliable sim-to-real transfer.
中文摘要 本文研究了针对质心（CoM）重构的欠致动飞艇的目标导向跟踪控制。与依赖冗余驱动以简化控制的传统过致动飞艇设计不同，本文聚焦于由两个推进器和可移动内部滑块组成的紧凑架构，旨在提升能源效率和有效载荷能力。这种硬件高效的配置引入了显著的欠致动和CoM动力学与飞行器运动之间的强非线性耦合。为应对这些挑战，本文提出了一个双层级强化学习框架，明确将任务级CoM规划与连续推进控制分离。外部策略在飞行前确定目标相关的CoM配置，而内部策略则生成推力指令以跟踪直线参考。为确保学习稳定，本文引入了两阶段学习策略，并辅以对所得双级过程的收敛分析。对27个目标评估集的广泛模拟和实际实验表明，所提方法持续优于固定CoM基线和基于PID的控制器，实现更高的跟踪精度、增强的鲁棒性和可靠的模拟到实物传输。

Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs

超越感知捷径：基于因果的去偏见优化，用于轻量级多层次语言模型中的通用视频推理

Authors: Jingze Wu, Quan Zhang, Hongfei Suo, Zeqiang Cai, Hongbo Chen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.01324
Pdf link: https://arxiv.org/pdf/2605.01324
Abstract Although reinforcement learning (RL) has significantly advanced reasoning capabilities in large multimodal language models (MLLMs), its efficacy remains limited for lightweight models essential for edge this http URL address this issue, we leverage causal analysis and experiment to reveal the underlying phenomenon of perceptual bias, demonstrating that RL-based fine-tuning compels lightweight models to preferentially adopt perceptual shortcuts induced by data biases, rather than developing genuine reasoning this http URL by this insight, we propose VideoThinker, a causal-inspired framework that cultivates robust reasoning in lightweight models through a two-stage debiasing process. First, the Bias Aware Training stage forges a dedicated "bias model" to embody these shortcut behaviors. Then, the Causal Debiasing Policy Optimization (CDPO) algorithm fine-tunes the primary model, employing an innovative repulsive objective to actively push it away from the bias model's flawed logic while simultaneously pulling it toward correct, generalizable this http URL model, VideoThinker-R1, establishes a new state-of-the-art in video reasoning efficiency. For same-scale comparison, requiring no Supervised Fine-Tuning (SFT) and using only 1 of the training data for RL, it surpasses VideoRFT-3B with a 3.2% average gain on widely-used benchmarks and a 7% lead on VideoMME. For cross-scale comparison, it outperforms the larger Video-UTR-7B model on multiple benchmarks, including a 2.1% gain on MVBench and a 3.8% gain on TempCompass. Code is available at this https URL.
中文摘要 尽管强化学习（RL）在大型多模态语言模型（MLLM）中具有显著的推理能力，但其对轻量级模型的有效性仍然有限。对于边缘至关重要，我们利用因果分析和实验揭示了感知偏见的潜在现象，证明基于强化学习的微调迫使轻量级模型优先采用由数据偏见引发的感知捷径，我们提出VideoThinker，一个基于因果的框架，通过两阶段去偏见过程在轻量级模型中培养稳健推理，而不是通过这一见解开发真正的推理。首先，偏见意识训练阶段构建了一个专门的“偏见模型”，以体现这些捷径行为。随后，因果去偏政策优化（CDPO）算法对主要模型进行了微调，采用创新的排斥目标，主动将其从偏见模型的缺陷逻辑中推开，同时将其拉向正确且可推广的方向。该http URL模型VideoThinker-R1确立了视频推理效率的新先进水平。在同规模比较中，无需监督微调（SFT），且仅使用其中一个训练数据进行强化学习，它在广泛使用基准测试中平均增益为3.2%，在VideoMME上领先7%。在跨尺度比较中，它在多个基准测试中优于更大的Video-UTR-7B模型，包括MVBench的2.1%和TempCompass的3.8%的提升。代码可在此 https URL 访问。

Segment-Aligned Policy Optimization for Multi-Modal Reasoning

多模态推理的分段对齐策略优化

Authors: Lei Gao, Zhuoming Li, Mengxi Jia, Jiakang Yuan, Hongbo Sun, Hao Sun, Xuelong Li
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.01327
Pdf link: https://arxiv.org/pdf/2605.01327
Abstract Existing reinforcement learning approaches for Large Language Models typically perform policy optimization at the granularity of individual tokens or entire response sequences. However, such formulations often misalign with the natural step-wise structure of reasoning processes, leading to suboptimal credit assignment and unstable training in multi-modal reasoning tasks. To bridge this gap, we propose Segment-Aligned Policy Optimization (SAPO), a novel reinforcement learning paradigm that treats coherent reasoning steps, rather than tokens or full sequences as fundamental units of policy update. SAPO introduces a step-wise Markov decision process abstraction over reasoning segments, accompanied by segment-level value estimation, advantage computation, and importance sampling mechanisms that are semantically aligned with reasoning boundaries. Experiments on representative reasoning benchmarks demonstrate that SAPO consistently outperforms token-level and sequence-level policy optimization methods, achieving significant accuracy improvements while exhibiting better training stability and value estimation consistency. Our work underscores the importance of aligning reinforcement learning updates with the intrinsic structure of reasoning, paving the way for more efficient and semantically grounded policy optimization in complex reasoning tasks. Codes and models will be released to ensure full reproducibility.
中文摘要 现有的大型语言模型强化学习方法通常在单个标记或整个响应序列的粒度上进行策略优化。然而，这种表述常常与推理过程的自然分级结构不匹配，导致学分分配不优，并在多模态推理任务中训练不稳定。为弥合这一空白，我们提出了分段对齐策略优化（SAPO），这是一种新型强化学习范式，将连贯的推理步骤而非代币或完整序列视为策略更新的基本单元。SAPO引入了对推理片段的逐步马尔可夫决策过程抽象，并伴随着段级价值估计、优势计算以及与推理边界语义对齐的重要性抽样机制。代表性推理基准测试的实验表明，SAPO始终优于令牌级和序列级策略优化方法，在显著提升准确性的情况下，展现出更好的训练稳定性和价值估计一致性。我们的工作强调了将强化学习更新与推理内在结构对齐的重要性，为复杂推理任务中更高效、更有语义基础的策略优化铺平了道路。代码和模型将发布，以确保完全可重复。

A Multi-View Media Profiling Suite: Resources, Evaluation, and Analysis

多视图媒体画像套件：资源、评估与分析

Authors: Muhammad Arslan Manzoor, Dilshod Azizov, Daniil Orel, Umer Siddique, Zain Muhammad Mujahid, Yufang Hou, Preslav Nakov
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.01336
Pdf link: https://arxiv.org/pdf/2605.01336
Abstract News outlets shape public opinion at a scale that makes automated detection of political bias and factuality essential. However, the field still lacks unified resources, comprehensive evaluations across diverse approaches, and systematic analyses of the representations and fusion strategies that matter most, especially under label sparsity and dataset diversity. In addition, there is little empirical work reporting broad, observation-driven findings about what consistently works, what fails, and why. We address these gaps through four main contributions. First, we introduce MBFC-2025, a large-scale label set covering approximately 2,600 outlets from Media Bias/Fact Check (MBFC). Second, we construct multiview representations for ACL-2020 (Panayotov et al., 2022), which includes around 900 outlets, as well as for MBFC-2025. These representations span Alexa graphs, hyperlink graphs, LLM-derived graphs, articles, and Wikipedia descriptions. Third, we provide a systematic evaluation and analysis of embedding views and fusion strategies, including a reinforcement learning-based fusion variant. Fourth, we conduct extensive experiments that achieve state-of-the-art results on ACL-2020 and establish strong benchmarks on MBFC-2025.
中文摘要 新闻媒体塑造公众舆论的规模，使得自动检测政治偏见和事实性变得至关重要。然而，该领域仍缺乏统一的资源、跨多样方法的全面评估，以及对最关键表示法和融合策略的系统分析，尤其是在标签稀疏性和数据集多样性下。此外，关于什么持续有效、什么失败及其原因的广泛观察研究也很少。我们通过四个主要贡献来应对这些空白。首先，我们介绍MBFC-2025，这是一套涵盖约2600个媒体的大型标签集，来自Media Bias/Fact Check（MBFC）。其次，我们构建了ACL-2020（Panayotov等，2022）的多视图表示，包含约900个插座，以及MBFC-2025。这些表示涵盖了Alexa图、超链接图、LLM衍生图、文章和维基百科描述。第三，我们系统地评估和分析嵌入视图和融合策略，包括基于强化学习的融合变体。第四，我们开展广泛实验，在ACL-2020上取得最先进的成果，并在MBFC-2025上建立了强有力的基准。

LLM Output Detectability and Task Performance Can be Jointly Optimized

LLM的输出可检测性和任务性能可以共同优化

Authors: Koshiro Saito, Ryuto Koike, Masahiro Kaneko, Naoaki Okazaki
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.01350
Pdf link: https://arxiv.org/pdf/2605.01350
Abstract Detecting machine-generated text is essential for transparency and accountability when deploying large language models (LLMs). Among detection approaches, watermarking is a statistically reliable method by design -- it embeds detectable signals into LLM outputs by biasing their token distributions. However, it has been reported that watermarked LLMs often perform worse on downstream tasks. We propose PUPPET, a framework that fine-tunes an LLM via reinforcement learning to generate text that is both more detectable and better performing on downstream tasks. We use two reward functions: a detector that outputs a machine-class likelihood and an evaluator that measures a task-specific metric. Experiments on long-form QA, summarization, and essay writing show that LLMs trained with PUPPET achieve high detectability competitive with watermarking methods while outperforming them on downstream tasks. The analysis shows that this optimization can be performed efficiently with only a few thousand samples in 1--2 GPU hours. Moreover, these gains are consistent across out-of-domain tasks, different LLM families, and model sizes, and are even robust to paraphrasing attacks.
中文摘要 在部署大型语言模型（LLM）时，检测机器生成文本对于透明度和问责制至关重要。在检测方法中，水印设计上是一种统计可靠的方法——它通过对符号分布进行偏差，将可检测信号嵌入LLM输出中。然而，有报道指出，水印的大型语言模型在下游任务中的表现往往更差。我们提出了PUPPET框架，通过强化学习微调LLM，生成既更易检测又在下游任务中表现更好的文本。我们使用两个奖励函数：一个输出机器类似然的检测器，以及衡量任务特定指标的评估器。长文质量保证、摘要和论文写作的实验表明，使用UPPET训练的大型语言模型在与水印方法竞争中具有高度的可检测性，同时在下游任务中表现优于水印。分析表明，这种优化只需几千个样本即可高效完成，耗时1至2个GPU小时。此外，这些提升在域外任务、不同LLM家族和模型规模中保持一致，甚至对改写攻击具有鲁棒性。

Model-Based Proactive Cost Generation for Learning Safe Policies Offline with Limited Violation Data

基于模型的主动成本生成，用于离线学习安全政策，且违规数据有限

Authors: Ruiqi Xue, Lei Yuan, Kainuo Cheng, Jing-Wen Yang, Yang Yu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.01356
Pdf link: https://arxiv.org/pdf/2605.01356
Abstract Learning constraint-satisfying policies from offline data without risky online interaction is crucial for safety-critical decision making. Conventional methods typically learn cost value functions from abundant unsafe samples to define safety boundaries and penalize violations. However, in high-stakes scenarios, risky trial-and-error is infeasible, yielding datasets with few or no unsafe samples. Under this limitation, existing approaches often treat all data as uniformly safe, overlooking safe-but-infeasible states - states that currently satisfy constraints but inevitably violate them within a few steps - leading to deployment failures. Drawing inspiration from the concept of knowledge-data integration, we leverage large language models (LLMs) to incorporate natural language knowledge into the policy to address this challenge. Specifically, we propose PROCO, a model-based offline safe reinforcement learning (RL) framework tailored to datasets largely free of violations. PROCO first learns a dynamics model from offline data and constructs a conservative cost function by grounding natural-language knowledge of unsafe states in LLMs, enabling risk estimation even without observed violations. Using the cost function and learned model, PROCO performs model-based rollouts to synthesize diverse counterfactual unsafe samples, supporting reliable feasibility identification and feasibility-guided policy learning. Across a range of Safety-Gymnasium tasks with exclusively safe or minimally risky training data, PROCO integrates seamlessly with a variety of offline safe RL algorithms and consistently demonstrates reduced constraint violations and improved safety performance compared to both the original methods and other behavior cloning baselines.
中文摘要 从离线数据中学习满足约束的政策，而无需冒险的在线互动，对于安全关键决策至关重要。传统方法通常从大量不安全样本中学习成本值函数，以定义安全边界并惩罚违规行为。然而，在高风险情景下，冒险的试错法不可行，因此数据集中几乎没有或没有不安全的样本。在这种限制下，现有方法通常将所有数据视为统一安全，忽视了安全但不可行的状态——这些状态目前满足约束，但在几步内必然会违反，导致部署失败。我们从知识-数据整合的概念中汲取灵感，利用大型语言模型（LLMs）将自然语言知识融入政策中，以应对这一挑战。具体来说，我们提出了PROCO，这是一个基于模型的离线安全强化学习（RL）框架，专为基本无违规的数据集量身定制。PROCO首先从离线数据中学习动力学模型，并通过以LLMs为基础的不安全状态自然语言知识构建保守成本函数，从而实现即使未观察到违规也能进行风险估计。利用成本函数和学习模型，PROCO执行基于模型的推广，综合多样化的反事实不安全样本，支持可靠的可行性识别和可行性引导的政策学习。在一系列仅安全或风险极低的训练数据下的安全体育馆任务中，PROCO无缝集成多种离线安全强化学习算法，持续展现出比原始方法及其他行为克隆基线更少的约束违规和更优的安全性能。

PACE: Parameter Change for Unsupervised Environment Design

PACE：无监督环境设计的参数变更

Authors: Fang Yuan, Quanjun Yin, Siqi Shen, Yuxiang Xie, Junqiang Yang, Long Qin, Junjie Zeng, Qinglun Li
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.01358
Pdf link: https://arxiv.org/pdf/2605.01358
Abstract Unsupervised Environment Design (UED) offers a promising paradigm for improving reinforcement learning generalization by adaptively shaping training environments, but it requires reliable environment evaluation to remain effective. However, existing UED methods evaluate environments using indirect proxy signals such as regret, value-based errors, or Monte Carlo, which suffer from bias, high variance, or substantial computational overhead and fail to reflect agent realized learning progress. To address these limitations, we propose Parameter Change Environment Design (PACE), which evaluates an environment through the policy parameter change induced by training on that environment, directly grounding environment selection in realized learning progress. Specifically, PACE assigns environment value using a first-order approximation of the policy optimization objective, where the improvement induced by an environment is proportional to the squared L2 norm of the corresponding parameter update, enabling low-variance and computation-efficient evaluation without additional rollouts. Experiments on MiniGrid and Craftax show that PACE consistently outperforms established UED baselines, achieving higher IQM and smaller Optimality Gap on OOD evaluations, including an IQM of 96.4% and an Optimality Gap of 17.2% on MiniGrid.
中文摘要 无监督环境设计（UED）为通过自适应塑造训练环境来提升强化学习泛化提供了有前景的范式，但要保持有效性，需要可靠的环境评估。然而，现有的UED方法评估使用间接代理信号（如遗憾、基于价值的错误或蒙特卡洛）的环境，这些信号存在偏见、高方差或大量计算开销，无法反映代理实现的学习进展。为解决这些局限性，我们提出了参数变更环境设计（PACE），通过在环境上训练引发的政策参数变化来评估环境，直接将环境选择建立在已实现的学习进展基础上。具体来说，PACE通过策略优化目标的一阶近似来赋予环境价值，其中环境带来的改进与相应参数更新的L2范数平方成正比，从而实现低方差和计算效率高的评估，无需额外推广。MiniGrid和Craftax上的实验显示，PACE持续优于既定的UED基线，在户外评测中实现了更高的IQM和更小的最优差距，包括MiniGrid的IQM为96.4%，最优性差距为17.2%。

Coordination Architecture Shapes Continuous Demand Response Outcomes in Building Districts

协调架构塑造建筑区内的持续需求响应结果

Authors: Ava Mohammadi, Rick Kramer, Zoltan Nagy
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2605.01362
Pdf link: https://arxiv.org/pdf/2605.01362
Abstract Grid-integrated building districts must provide energy flexibility while preserving occupant comfort and equitable distribution of control burden. We study how coordination architecture influences the ability of building clusters to track aggregated load profiles, comparing four paradigms: centralized model predictive control (MPC), decentralized independent reinforcement learning (SAC), centralized-training-decentralized-execution multi-agent RL (MAPPO), and a hybrid MPC--SAC controller that separates district-level battery optimization from building-level HVAC regulation. A rule-based controller serves as a baseline. We evaluate a 25-building residential district across three metrics: aggregate load tracking, thermal comfort, and spatial variability of control actions. We find that architecture choice determines the trade-off structure. Centralized MPC achieves low tracking bias (8.8% NMBE) but concentrates actuation on a subset of buildings, causing elevated comfort violations (24.8% exceedance) and spatial imbalance. Decentralized RL distributes control effort more evenly but fails to sustain accurate tracking. The hybrid architecture achieves the best balance: accurate tracking (4.8% NMBE), moderate comfort impact (16.8% exceedance), and the lowest spatial variability. These findings demonstrate that architecture choice determines the trade-off structure between tracking and comfort.
中文摘要 电网集成建筑区必须在保持居住舒适度和公平分配控制负担的同时，提供能源灵活性。我们研究协调架构如何影响集群跟踪汇总负载曲线的能力，比较了四种范式：集中式模型预测控制（MPC）、去中心化独立强化学习（SAC）、集中式训练-去中心化-执行多智能体强化学习（MAPPO），以及区分区级电池优化与建筑级暖通空调调节的混合MPC-SAC控制器。基于规则的控制器作为基线。我们通过三个指标评估一个拥有25栋建筑的住宅区：总负载追踪、热舒适度和控制行动的空间变异性。我们发现架构选择决定了权衡结构。集中式MPC实现了低跟踪偏差（8.8% NMBE），但将驱动集中在部分建筑上，导致舒适度违规率升高（24.8%超额）和空间不平衡。去中心化强化学习更均匀地分配了控制工作，但无法维持准确的跟踪。混合架构实现了最佳平衡：追踪准确度（4.8% NMBE）、适度舒适度影响（超额度16.8%）以及最低的空间变异性。这些发现表明，架构选择决定了追踪与舒适度之间的权衡结构。

Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression

通过强化学习向MLLM注入分布意识以实现深度不平衡回归

Authors: Yao Du, Shanshan Li, Xiaomeng Li
Subjects: Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.01402
Pdf link: https://arxiv.org/pdf/2605.01402
Abstract Multimodal large language models (MLLMs) struggle with numerical regression under long-tailed target distributions. Token-level supervised fine-tuning (SFT) and point-wise regression rewards bias learning toward high-density regions, leading to regression-to-the-mean behavior and poor tail performance. We identify the lack of cross-sample relational supervision as a key limitation of existing MLLM training paradigms. To address it, we propose a distribution-aware reinforcement learning framework based on Group Relative Policy Optimization, which introduces batch-level comparison-based supervision via the Concordance Correlation Coefficient-based reward to align predicted and ground-truth distributions in terms of correlation, scale, and mean. The framework is plug-and-play, requiring no architectural modification. Experiments on a unified suite of long-tailed regression benchmarks show consistent improvements over SFT and existing MLLM regression methods, with particularly strong gains in medium- and few-shot regimes.
中文摘要 多模态大型语言模型（MLLMs）在长尾目标分布下的数值回归问题上存在困难。代币级监督微调（SFT）和点回归奖励偏向高密度区域的学习，导致回归均值行为和尾部表现较差。我们指出，缺乏跨样本关系监督是现有MLLM培训范式的一个关键局限。为此，我们提出了基于群体相对策略优化的分布感知强化学习框架，通过基于索引相关系数的奖励引入了批次级比较导演，使预测分布与真实分布在相关性、尺度和均值上保持一致。该框架即插即用，无需任何架构修改。在一组统一的长尾回归基准测试上，显示相较于SFT和现有MLLM回归方法有持续的提升，在中和少样本组中尤为显著。

Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks

Medmarks：一套全面的开源大型语言模型基准测试套件，适用于医疗任务

Authors: Benjamin Warner, Ratna Sagari Grandhi, Max Kieffer, Aymane Ouraq, Saurav Panigrahi, Geetu Ambwani, Kunal Bagga, Nikhil Khandekar, Arya Hariharan, Nishant Mishra, Manish Ram, Shamus Sim Zi Yang, Ahmed Essouaied, Adepoju Jeremiah Moyondafoluwa, Robert Scholz, Bofeng Huang, Molly Beavers, Srishti Gureja, Anish Mahishi, Sameed Khan, Maxime Griot, Hunar Batra, Jean-Benoit Delbrouck, Siddhant Bharadwaj, Ronald Clark, Ashish Vashist, Anas Zafar, Leema Krishna Murali, Harsh Deshpande, Ameen Patel, William Brown, Johannes Hagemann, Connor Lane, Paul Steven Scotti, Tanishq Mathew Abraham
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.01417
Pdf link: https://arxiv.org/pdf/2605.01417
Abstract Evaluating large language models (LLMs) for medical applications remains challenging due to benchmark saturation, limited data accessibility, and insufficient coverage of relevant tasks. Existing suites have either saturated, heavily depend on restricted datasets, or lack comprehensive model coverage. We introduce Medmarks, a fully open-source evaluation suite with 30 benchmarks spanning question answering, information extraction, medical calculations, and open-ended clinical reasoning. We perform a systematic evaluation of 61 models across 71 configurations using verifiable metrics and LLM-as-a-Judge. Our results show that frontier reasoning models (Gemini 3 Pro Preview, GPT-5.1, & GPT-5.2) achieve the highest performance across both benchmarks, most frontier proprietary models are significantly more token efficient than open-weight alternatives, medically fine-tuned models outperform their generalist counterparts, and that models are susceptible to answer-order bias (particularly smaller models and Grok 4). A subset of our evals (Medmarks-T) can be directly used as reinforcement learning environments to post-train LLMs for medical reasoning. Code is available at this https URL
中文摘要 由于基准饱和、数据可访问有限以及相关任务覆盖不足，评估用于医疗应用的大型语言模型（LLMs）依然充满挑战。现有套件要么已饱和，要么高度依赖有限的数据集，或者缺乏全面的模型覆盖。我们推出了Medmarks，这是一套完全开源的评估套件，包含30项基准，涵盖问答、信息提取、医学计算和开放式临床推理。我们系统评估了61个模型，涵盖71种配置，使用可验证的指标和LLM作为评判。我们的结果显示，前沿推理模型（Gemini 3 Pro Preview、GPT-5.1和GPT-5.2）在两个基准测试中都实现了最高性能，大多数Frontier专有模型的token效率明显优于开放权重替代方案，医学微调模型优于通用模型，且模型容易出现答案顺序偏差（尤其是较小模型和Grok 4）。我们的评估子集（Medmarks-T）可以直接用作强化学习环境，用于医学推理的LLM进行后期训练。代码可在此 https URL 获取

CoFlow: Coordinated Few-Step Flow for Offline Multi-Agent Decision Making

CoFlow：为离线多智能体决策提供协调的少数步骤流程

Authors: Guowei Zou, Haitao Wang, Beiwen Zhang, Boning Zhang, Hejun Wu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.01457
Pdf link: https://arxiv.org/pdf/2605.01457
Abstract Generative models have emerged as a major paradigm for offline multi-agent reinforcement learning (MARL), but existing approaches require many iterative sampling steps. Recent few-step accelerations either distill a joint teacher into independent students or apply averaged velocities independently per agent, suggesting that few-step inference requires sacrificing inter-agent coordination. We show this trade-off is not necessary: single-pass multi-agent generation can preserve coordination when the velocity field is natively joint-coupled. We propose Coordinated few-step Flow (CoFlow), an architecture that combines Coordinated Velocity Attention (CVA) with Adaptive Coordination Gating. A finite-difference consistency surrogate further replaces memory-prohibitive Jacobian-vector product backpropagation through the averaged velocity field with two stop-gradient forward passes. Across 60 configurations spanning MPE, MA-MuJoCo, and SMAC, CoFlow matches or surpasses Gaussian / value-based, transformer, diffusion, and prior flow baselines on episodic return. Three independent coordination probes confirm that the gains flow through inter-agent coordination rather than per-agent capacity. A denoising-step sweep shows that single-pass inference suffices on every configuration. CoFlow reaches state-of-the-art coordination quality in 1-3 denoising steps under both centralized and decentralized execution. Project page: this https URL.
中文摘要 生成模型已成为离线多智能体强化学习（MARL）的主要范式，但现有方法需要大量迭代抽样步骤。最近的少步加速要么将联合教师提炼成独立学生，要么为每个智能体独立应用平均速度，表明少步骤推断需要牺牲代理间协调。我们证明了这种权衡并非必要：当速度场本身是联合耦合时，单次多代理生成可以保持协调。我们提出了协调少步流（CoFlow），这是一种结合协调速度注意力（CVA）与自适应协调门控的架构。有限差分一致性代理进一步替代了通过平均速度场的记忆限制的雅可比向量积反向传播，采用两次停止梯度前向传递。在涵盖MPE、MA-MuJoCo和SMAC的60种配置中，CoFlow在周期性回波上匹配甚至超过高斯/值基、变压器、扩散及先前的流量基线。三个独立的协调探针确认收益通过代理间协调流动，而非单个代理的能力。去噪步扫描表明，单遍推断在所有配置上都足够。CoFlow在中心化和去中心化执行下，通过1-3个去噪步骤即可达到最先进的协调质量。项目页面：这个 https URL。

LLM-Foraging: Large Language Models for Decentralized Swarm Robot Foraging

LLM采集：用于去中心化群体机器人采集的大型语言模型

Authors: Peihan Li, Joanna Gutierrez, Fabian Hernandez, Qi Lu, Lifeng Zhou
Subjects: Subjects: Robotics (cs.RO); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2605.01461
Pdf link: https://arxiv.org/pdf/2605.01461
Abstract Swarm foraging algorithms, such as the central-place foraging algorithm (CPFA), typically rely on offline parameter optimization using genetic algorithms (GA) or reinforcement learning, yielding policies tightly coupled to a specific combination of team size, arena size, and resource distribution. When deployment conditions change, performance degrades, and retraining is computationally expensive. We propose LLM-Foraging, a decentralized swarm controller that augments the CPFA state machine with a large language model (LLM) tactical decision-maker at three structured decision points, namely post-deposit, central-zone arrival, and search starvation. Each robot runs its own LLM client and queries it using only locally observable state, while the existing CPFA motion and sensing stack executes the selected action. Because the LLM serves as a general decision policy rather than parameters fitted to a single configuration, the controller is training-free at deployment and transfers across configurations without re-optimization. We evaluate LLM-Foraging in Gazebo with TurtleBot3 robots across 36 configurations spanning team sizes of 4 to 10 robots, arena sizes from 6x6 to 10x10 meters, and three resource distributions (clustered, powerlaw, random). LLM-Foraging collects more resources than the GA-tuned CPFA baseline across the evaluated configurations and is more consistent, a property that the GA's single-configuration tuning does not transfer.
中文摘要 群体采集算法，如中心位置采集算法（CPFA），通常依赖于使用遗传算法（GA）或强化学习进行离线参数优化，从而产生与团队规模、场地规模和资源分布等特定组合紧密耦合的策略。当部署条件变化时，性能下降，重新训练计算成本高昂。我们提出了LLM-Foraging，这是一种去中心化的群体控制器，在三个结构化决策点（即沉积后、中心区到达和搜索饥饿）通过大型语言模型（LLM）战术决策者增强CPFA状态机。每个机器人运行自己的LLM客户端，仅用本地可观测状态查询，而现有的CPFA运动和传感堆栈执行所选动作。由于LLM作为通用决策策略而非参数适配于单一配置，控制器在部署时无需训练，且可在不同配置间转移而无需重新优化。我们评估了在凉亭中用TurtleBot3机器人在36种配置中的LLM采集，涵盖4至10台团队规模、6x6至10x10米的竞技场尺寸，以及三种资源分布（聚类、幂律、随机）。LLM寻集在评估配置中收集的资源比GA调优的CPFA基线更多，且更为一致，这一点是GA单一配置调优无法转移的。

An Intelligent eUPF for Time-Sensitive Path Selection in B5G Edge Networks

用于B5G边缘网络中时间敏感路径选择的智能eUPF

Authors: Rodrigo Moreira, Larissa Ferreira Rodrigues Moreira, Tereza Cristina Carvalho, Flávio de Oliveira Silva
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2605.01475
Pdf link: https://arxiv.org/pdf/2605.01475
Abstract In Beyond 5G (B5G) networks, intelligent, flexible traffic management is essential to meet the stringent speed and reliability requirements of new applications. This paper presents an improved User Plane Function (eUPF) design that uses a Deep Q-Network (DQN) agent for real-time path selection between Multi-access Edge Computing (MEC) and cloud endpoints. The path selection problem is formulated as a Partially Observable Markov Decision Process (POMDP). We propose a novel passive delay measurement method that uses eBPF programs to link TEID-based timestamps in GTP-U traffic, allowing for low-cost delay estimation without active testing. Experiments show that the DQN agent substantially outperforms a random baseline, with lower average latency, more stable rewards, and more reliable low-delay path choices. These results demonstrate the effectiveness of AI-driven control in B5G core networks and the promise of reinforcement learning for modern network management.
中文摘要 在超越5G（B5G）网络中，智能且灵活的流量管理对于满足新应用对速度和可靠性的严格要求至关重要。本文提出了改进的用户平面功能（eUPF）设计，利用深度Q网络（DQN）代理实现多接入边缘计算（MEC）与云端点之间的实时路径选择。路径选择问题被表述为部分可观测马尔可夫决策过程（POMDP）。我们提出了一种新型被动延迟测量方法，利用eBPF程序将基于TEID的时间戳串联GTP-U流量，实现低成本延迟估计，无需主动测试。实验显示，DQN代理的表现远超随机基线，平均延迟更低，奖励更稳定，且低延迟路径选择更可靠。这些结果展示了人工智能驱动控制在B5G核心网络中的有效性，以及强化学习在现代网络管理中的潜力。

Grounding Multi-Hop Reasoning in Structural Causal Models via Group Relative Policy Optimization

通过群体相对策略优化，在结构因果模型中扎根多跳推理

Authors: Yunhan Bu, Quan Zhang, Huaping Zhang, Guotong Geng, Chunxiao Gao, Askar Hamdulla, Juan Wang, Qiuchi Li, Baohua Zhang, Shuai Lei, Yunbo Cao, Zhunchen Luo
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.01482
Pdf link: https://arxiv.org/pdf/2605.01482
Abstract Multi-Hop Fact Verification (MHFV) necessitates complex reasoning across disparate evidence, posing significant challenges for Large Language Models (LLMs) which often suffer from hallucinations and fractured logical chains. Existing methods, while improving transparency via Chain-of-Thought (CoT), lack explicit modeling of the causal dependencies between evidence and claims. In this work, we introduce a novel framework that grounds reasoning in a Structural Causal Model (SCM), treating verification as a constructive causal inference process. We empirically identify an "inverted U-shaped" correlation between reasoning chain length and accuracy, revealing that excessive structural complexity degrades performance. To address this, we propose a Rule-based Reinforcement Learning strategy using Group Relative Policy Optimization (GRPO). This approach dynamically optimizes the trade-off between structural depth and conciseness. Extensive experiments on HoVer and EX-FEVER demonstrate that our SCM-GRPO framework significantly outperforms state-of-the-art baselines, offering a reliable and interpretable solution for complex fact verification.
中文摘要 多跳事实验证（MHFV）需要在不同证据之间进行复杂的推理，这对大型语言模型（LLMs）构成了重大挑战，因为大型语言模型常常存在幻觉和逻辑链断裂的问题。现有方法虽然通过思维链（Chain-of-Thought，CoT）提高了透明度，但缺乏对证据与主张之间因果依赖性的明确建模。本研究引入了一个新框架，将推理建立在结构因果模型（SCM）之上，将验证视为一种建设性的因果推断过程。我们通过实证识别出链条长度与准确性之间的“倒U形”相关性，揭示了过度的结构复杂性会降低性能。为此，我们提出了一种基于规则的强化学习策略，采用群体相对策略优化（Group Relative Policy Optimization，GRPO）。这种方法动态优化了结构深度与简洁性的权衡。在 HoVer 和 EX-FEVER 上的广泛实验表明，我们的 SCM-GRPO 框架远超最先进的基线，为复杂事实验证提供了可靠且可解释的解决方案。

SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning

SciResearcher：扩展深度研究代理以实现前沿科学推理

Authors: Tianshi Zheng, Rui Wang, Xiyun Li, Yangqiu Song, Tianqing Fang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.01489
Pdf link: https://arxiv.org/pdf/2605.01489
Abstract Frontier scientific reasoning is rapidly emerging as a key foundation for advancing AI agents in automated scientific discovery. Deep research agents offer a promising approach to this challenge. These models develop robust problem-solving capabilities through post-training on information-seeking tasks, which are typically curated via knowledge graph construction or iterative web browsing. However, these strategies face inherent limitations in frontier science, where domain-specific knowledge is scattered across sparse and heterogeneous academic sources, and problem solving requires sophisticated computation and reasoning far beyond factual recall. To bridge this gap, we introduce SciResearcher, a fully automated agentic framework for frontier-science data construction. SciResearcher synthesizes diverse conceptual and computational tasks grounded in academic evidence, while eliciting information acquisition, tool-integrated reasoning, and long-horizon capabilities. Leveraging the curated data for supervised fine-tuning and agentic reinforcement learning, we develop SciResearcher-8B, an agent foundation model that achieves 19.46% on the HLE-Bio/Chem-Gold benchmark, establishing a new state of the art at its parameter scale and surpassing several larger proprietary agents. It further achieves 13-15% absolute gains on SuperGPQA-Hard-Biology and TRQA-Literature benchmarks. Overall, SciResearcher introduces a new paradigm for automated data construction for frontier scientific reasoning and offers a scalable path toward future scientific agents.
中文摘要 前沿科学推理正迅速成为推动人工智能代理在自动化科学发现中发展的关键基础。深度研究代理为这一挑战提供了有前景的解决方案。这些模型通过对信息寻求任务的后期训练，培养出强大的问题解决能力，这些任务通常通过知识图谱构建或反复网页浏览来整理。然而，这些策略在前沿科学中存在固有局限，领域特定知识分散在稀疏且异质的学术来源中，问题解决需要远超事实回忆的复杂计算和推理能力。为弥合这一差距，我们引入了SciResearcher，一个全自动化的前沿科学数据构建代理框架。SciResearcher 综合基于学术证据的多样概念和计算任务，同时培养信息获取、工具整合推理和长期视野能力。利用精心策划的数据进行监督式微调和代理强化学习，我们开发了SciResearcher-8B代理基础模型，在HLE-Bio/Chem-Gold基准测试中实现了19.46%的成功率，在其参数尺度上树立了新的尖端技术，超越了多个更大型专有代理。此外，在SuperGPQA-Hard-Biology和TRQA-Literature基准测试中实现了13-15%的绝对提升。总体而言，SciResearcher为前沿科学推理的自动化数据构建引入了新的范式，并为未来科学代理提供了可扩展的路径。

Protein-Conditioned Multi-Objective Reinforcement Learning for Full-Length mRNA Design

全长mRNA设计中的蛋白质条件多目标强化学习

Authors: Zixi Shao, Tao Wang, Yibei Xiao, Tianyi Huang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.01513
Pdf link: https://arxiv.org/pdf/2605.01513
Abstract Designing therapeutic messenger RNA (mRNA) requires creating full-length transcripts that carefully balance stability, translation efficiency, and immune safety. To address this challenge, we propose ProMORNA, a multi-objective generation framework that produces complete mRNA transcripts \textit{de novo} directly from a target protein sequence. Our approach begins by training a BART-style encoder-decoder model on over 6 million natural protein-mRNA pairs. We then introduce Multi-Objective Group Relative Policy Optimization (MO-GRPO) to simultaneously optimize for various biological objectives in a unified way. As a case study, we evaluated ProMORNA on the widely used firefly luciferase target, excluding it from both our supervised training data and the prompt pool. The results indicate that ProMORNA improves the \textit{in silico} Pareto frontier for predicted half-life and translation efficiency relative to standard supervised baselines. Additionally, it achieves higher predicted functional scores than a state-of-the-art baseline under the same evaluation pipeline. These computational findings demonstrate the feasibility of using multi-objective reinforcement learning for full-length mRNA design on unseen targets.
中文摘要 设计治疗信使RNA（mRNA）需要创建全长转录本，在稳定性、翻译效率和免疫安全性之间取得平衡。为应对这一挑战，我们提出了ProMORNA，一种多目标生成框架，可直接从目标蛋白序列生成完整的mRNA转录本。我们的方法首先在超过600万对天然蛋白质-mRNA上训练BART式编码-解码模型。随后，我们引入了多目标群体相对策略优化（MO-GRPO），以统一方式同时优化多个生物目标。作为案例研究，我们评估了ProMORNA在广泛使用的萤火虫荧光素酶靶标上，将其排除在监督训练数据和提示池中。结果表明，ProMORNA相较于标准监督基线，提升了\textit{in silico}帕累托前沿的预测半衰期和翻译效率。此外，在相同评估流程下，其预测功能评分高于最先进的基线。这些计算发现表明，在未可见靶点上使用多目标强化学习进行全长mRNA设计的可行性。

Dynamics Distillation for Efficient and Transferable Control Learning

动力学蒸馏以实现高效且可转移的控制学习

Authors: Xunjiang Gu, Kashyap Chitta, Mahsa Golchoubian, Vladimir Suplin, Igor Gilitschenski
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.01516
Pdf link: https://arxiv.org/pdf/2605.01516
Abstract Robust control policy learning for autonomous driving requires training environments to be both physically realistic and computationally scalable, properties that existing simulators provide only in isolation. We introduce Sim2Sim2Sim, a framework that bridges high-fidelity vehicle simulation and scalable reinforcement learning by distilling simulator dynamics into a highly parallelizable learned dynamics model. By training control policies purely within this distilled environment and deploying them back into the high-fidelity source simulator, we demonstrate more efficient policy optimization and reliable transfer under challenging dynamics. We further show that predictive accuracy alone does not fully characterize a learned dynamics model's suitability as a reinforcement learning training environment, which should also be assessed by the quality of the policies it enables.
中文摘要 自动驾驶的稳健控制策略学习要求训练环境既具物理真实性又具备计算可扩展性，而现有模拟器仅能单独具备这些特性。我们介绍Sim2Sim2Sim，这是一个通过将模拟器动力学提炼成高度可并行化的学习动力学模型，桥接高保真车辆模拟与可扩展强化学习的框架。通过在这种纯粹的精炼环境中训练控制策略并将其部署回高保真源模拟器，我们展示了在复杂动态环境下更高效的策略优化和更可靠的传输能力。我们还进一步表明，仅靠预测准确性并不能完全描述已学习动力学模型作为强化学习培训环境的适用性，因此还应通过其所支持的策略质量来评估。

VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation

VAnim：用于结构保持向量动画的渲染感知稀疏状态建模

Authors: Guotao Liang, Zhangcheng Wang, Chuang Wang, Juncheng Hu, Haitao Zhou, Junhua Liu, Jing Zhang, Dong Xu, Qian Yu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.01517
Pdf link: https://arxiv.org/pdf/2605.01517
Abstract Scalable Vector Graphics (SVG) animation generation is pivotal for professional design due to their structural editability and resolution independence. However, this task remains challenging as it requires bridging discrete code representations with continuous visual dynamics. Existing optimization-based methods often destroy topological consistency, while general-purpose LLMs rely on rigid CSS/SMIL transformations, failing to model geometry-level non-rigid deformations. To address these limitations, we present VAnim, the first LLM-based framework for open-domain text-to-SVG animation. We reconceptualize animation not as sequence generation, but as Sparse State Updates (SSU) on a persistent SVG DOM tree. This paradigm compresses sequence length by over 9.8x while preserving the SVG DOM structure and non-participating elements by construction. To enable precise control, we propose an Identification-First Motion Planning mechanism that grounds textual instructions in explicit visual entities. Furthermore, to overcome the non-differentiable nature of SVG rendering, we employ Rendering-Aware Reinforcement Learning via Group Relative Policy Optimization (GRPO). By leveraging a hybrid reward from a state-of-the-art video perception encoder, we align discrete code updates with high-fidelity visual feedback. We also introduce SVGAnim-134k, the first benchmark for vector animation. Extensive experiments demonstrate that VAnim significantly outperforms state-of-the-art baselines in semantic alignment and structural validity, with additional appendix metrics further validating motion quality and identity preservation.
中文摘要 可扩展矢量图形（SVG）动画生成因其结构可编辑性和分辨率独立性，对专业设计至关重要。然而，这一任务依然具有挑战性，因为它需要将离散代码表示与连续的视觉动态连接起来。现有基于优化的方法常常破坏拓扑一致性，而通用大型语言模型依赖刚性CSS/SMIL变换，无法模拟几何级的非刚性变形。为解决这些局限性，我们介绍了VAnim，这是首个基于大语言模型的开放领域文本转SVG动画框架。我们将动画重新定义为对持久SVG DOM树的稀疏状态更新（SSU）。该范式在保持 SVG DOM 结构和非参与元素的同时，将序列长度压缩超过 9.8 倍。为实现精确控制，我们提出了一种识别优先的运动规划机制，将文本指令建立在明确的视觉实体之上。此外，为了克服SVG渲染的不可微性，我们采用了通过组相对策略优化（GRPO）进行渲染感知强化学习。通过利用最先进的视频感知编码器提供的混合奖励，我们将离散的代码更新与高保真视觉反馈对齐。我们还推出了SVGAnim-134k，这是矢量动画的首个基准测试。大量实验表明，VAnim 在语义对齐和结构效度方面显著优于最先进的基线，附录指标进一步验证了运动质量和身份保护。

MIRL: Mutual Information-Guided Reinforcement Learning for Vision-Language Models

MIRL：视觉语言模型中的互信息引导强化学习

Authors: Yin Zhang, Jiaxuan Zhao, Zonghan Wu, Zengxiang Li, Junfeng Fang, Kun Wang, Qingsong Wen, Yilei Shao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.01520
Pdf link: https://arxiv.org/pdf/2605.01520
Abstract Vision-Language Models (VLMs) frequently suffer from visual perception errors and hallucinations that compromise answer accuracy in complex reasoning tasks. Reinforcement Learning with Verifiable Rewards (RLVR) offers a promising solution by optimizing policies using answer correctness signals. Despite their effectiveness, prevailing RLVR methods face two critical limitations. First, much of the sampling budget is wasted on trajectories doomed to fail due to early visual description errors. Second, sparse rewards cannot distinguish whether failures stem from visual perception or reasoning stages. We introduce MIRL, a decoupled framework that addresses both limitations by leveraging mutual information (MI) between generated descriptions and visual inputs as a cheap pre-screening signal. This enables intelligent budget allocation toward high-potential trajectories via forking, while decoupled training provides independent MI-based rewards for visual perception optimization, resolving reward blindness. Experiments on six vision-language reasoning benchmarks demonstrate that MIRL achieves 70.22% average accuracy and successfully surpasses the performance of sampling 16 complete trajectories using only 10 pre-samples with top-6 selection (25% fewer complete trajectories). Our code is available at: this https URL.
中文摘要 视觉语言模型（VLM）经常存在视觉感知错误和幻觉，这会影响复杂推理任务中的回答准确性。带可验证奖励的强化学习（RLVR）通过利用答案正确性信号优化策略，提供了一个有前景的解决方案。尽管有效，现行的RLVR方法仍面临两个关键局限。首先，大量采样预算浪费在因早期视觉描述错误而注定失败的轨迹上。其次，稀疏的奖励无法区分失败源于视觉感知还是推理阶段。我们介绍了MIRL，这是一种解耦框架，通过利用生成描述与视觉输入之间的互信息（MI）来解决这两个局限性，作为廉价的预筛选信号。这使得通过分叉实现高潜力轨迹的智能预算分配，而解耦训练则提供基于独立的基于智能的奖励以优化视觉感知，解决奖励盲点。在六个视觉-语言推理基准测试上的实验表明，MIRL的平均准确率达到70.22%，并成功超越仅用10个前期样本和前6名选择抽样16条完整轨迹的表现（减少了25%的完整轨迹）。我们的代码可在以下 https URL 获取。

Hybrid Quantum Reinforcement Learning with QAOA for Improved Vehicle Routing Optimization

结合QAOA的混合量子强化学习，提升车辆路径优化

Authors: T. Satyanarayana Murthy, B. Swathi Sowmya, Santhosh Voruganti, Sai Varshini Giridi, Chaitanyya Pratap Agarwal, Vanteddu Akshitha
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.01574
Pdf link: https://arxiv.org/pdf/2605.01574
Abstract Vehicle Routing Problem (VRP) is one of the most complex NP-hard combinatorial optimization problem in transportation and logistics that requires a dynamic solution approach. In this paper we present a new hybrid approach that combines the Quantum Approximate Optimization Algorithm (QAOA) into the QRL policy network, instead of the usual variational layers, QAOA mixing and cost Hamiltonian layers. This enhancement enables the agent to exploit problem specific particular quantum correlations when learning policies, and so richer exploration of the routing solution space. The QAOA-augmented QRL framework shows quicker convergence in training and can tackle larger VRP instances that are beyond the reach of Grover's Adaptive Search (GAS) and Quantum Reinforcement Learning (QRL) approaches. Experiments on standard VRP instances demonstrate better solutions, fewer episodes to converge and good memory usage on near term quantum hardware simulators. These findings demonstrate QAOA- integrated QRL as a viable approach to scalable, high quality quantum-assisted combinatorial optimization.
中文摘要 车辆路由问题（VRP）是交通和物流领域最复杂的NP难组合优化问题之一，需要采用动态求解方法。本文提出了一种新的混合方法，将量子近似优化算法（QAOA）结合进QRL策略网络，而非通常的变分层、QAOA混合和成本哈密顿层。这种增强使智能体能够利用特定问题的量子相关性来学习策略，从而更丰富地探索路由解空间。QAOA增强的QRL框架在训练中收敛速度更快，能够应对Grover自适应搜索（GAS）和量子强化学习（QRL）方法无法覆盖的大型VRP实例。在标准VRP实例上的实验展示了更好的解决方案，收敛的集数更少，并且在近期量子硬件模拟器上能很好地利用内存。这些发现表明QAOA集成QRL作为一种可扩展、高质量的量子辅助组合优化方法。

TRIMMER: A New Paradigm for Video Summarization through Self-Supervised Reinforcement Learning

TRIMMER：通过自我监督强化学习实现视频摘要的新范式

Authors: Pritam Mishra, Coloma Ballester, Dimosthenis Karatzas
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.01659
Pdf link: https://arxiv.org/pdf/2605.01659
Abstract The rapid growth of video content across domains such as surveillance, education, and social media has made efficient content understanding increasingly critical. Video summarization addresses this challenge by generating concise yet semantically meaningful representations, but existing approaches often rely on expensive manual annotations, struggle to generalize across domains, and incur significant computational costs due to complex architectures. Moreover, unsupervised and weakly supervised methods typically underperform compared to supervised counterparts in capturing long-range temporal dependencies and semantic structure. In this work, we propose TRIMMER (Temporal Relative Information Maximization for Multi-objective Efficient Reinforcement), a novel self-supervised reinforcement learning framework for video summarization. TRIMMER operates in two stages: it first learns robust representations via self-supervised learning and then performs spatio-temporal decision making through reinforcement learning guided by information-theoretic reward functions. Unlike prior approaches that rely on similarity-based objectives, our method introduces entropy-based metrics to capture higher-order temporal dynamics and semantic diversity, while computing rewards directly over selected frame indices to improve computational efficiency. Extensive experiments on standard benchmarks demonstrate that TRIMMER achieves state-of-the-art performance among unsupervised and self-supervised methods, while remaining competitive with leading supervised approaches, highlighting its effectiveness for scalable and generalizable video summarization.
中文摘要 视频内容在监控、教育和社交媒体等多个领域的快速增长，使得高效理解内容变得愈发重要。视频摘要通过生成简洁且语义有意义的表示来应对这一挑战，但现有方法往往依赖昂贵的手工注释，难以跨域推广，且由于复杂的架构，计算成本较高。此外，无监督和弱监督方法在捕捉长距离时间依赖性和语义结构方面通常表现不如有监督方法。本研究提出TRIMMER（多目标高效强化时序相对信息最大化），这是一种用于视频摘要的自监督强化学习框架。TRIMMER分为两个阶段：首先通过自我监督学习学习稳健表征，然后通过信息理论奖励函数引导的强化学习进行时空决策。与以往依赖相似性目标的方法不同，我们的方法引入了基于熵的度量以捕捉高阶时间动态和语义多样性，同时直接对选定的帧索引进行奖励，以提高计算效率。对标准基准的大量实验表明，TRIMMER在无监督和自监督方法中达到了最先进的性能，同时在领先监督方法中保持竞争力，凸显了其在可扩展和通用视频摘要方面的有效性。

Towards Efficient and Expressive Offline RL via Flow-Anchored Noise-conditioned Q-Learning

通过流锚噪声条件Q学习实现高效且富有表现力的离线强化学习

Authors: Sungyoung Lee, Dohyeong Kim, Eshan Balachandar, Zelal Su Mustafaoglu, Keshav Pingali
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.01663
Pdf link: https://arxiv.org/pdf/2605.01663
Abstract We propose Flow-Anchored Noise-conditioned Q-Learning (FAN), a highly efficient and high-performing offline reinforcement learning (RL) algorithm. Recent work has shown that expressive flow policies and distributional critics improve offline RL performance, but at a high computational cost. Specifically, flow policies require iterative sampling to produce a single action, and distributional critics require computation over multiple samples (e.g., quantiles) to estimate value. To address these inefficiencies while maintaining high performance, we introduce FAN. Our method employs a behavior regularization technique that utilizes only a single flow policy iteration and requires only a single Gaussian noise sample for distributional critics. Our theoretical analysis of convergence and performance bounds demonstrates that these simplifications not only improve efficiency but also lead to superior task performance. Experiments on robotic manipulation and locomotion tasks demonstrate that FAN achieves state-of-the-art performance while significantly reducing both training and inference runtimes. We release our code at this https URL.
中文摘要 我们提出了流锚噪声条件Q-学习（FAN），一种高效且高效的离线强化学习（RL）算法。近期研究表明，表达式流策略和分布式批评能提升离线强化学习性能，但计算成本较高。具体来说，流策略需要迭代抽样来产生单一动作，而分布批判策略则需要对多个样本（例如分位数）进行计算来估算价值。为了在保持高性能的同时解决这些低效问题，我们引入了FAN。我们的方法采用了仅使用单次流策略迭代的行为正则化技术，并且仅需要一个高斯噪声样本作为分布批评者。我们对收敛和性能界限的理论分析表明，这些简化不仅提高了效率，还带来了更优越的任务表现。机器人操作和移动任务的实验表明，FAN在显著缩短训练和推理运行时间的同时，实现了最先进的性能。我们以这个 https URL 发布代码。

Zero-Shot, Safe and Time-Efficient UAV Navigation via Potential-Based Reward Shaping, Control Lyapunov and Barrier Functions

通过基于潜能的奖励形塑、控制李雅普诺夫和屏障功能实现零发射、安全且高效无人机导航

Authors: Ashik Abrar Naeem, Mohammad Ariful Haque
Subjects: Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.01787
Pdf link: https://arxiv.org/pdf/2605.01787
Abstract Autonomous navigation and obstacle avoidance remain a core challenge of modern Unmanned Aerial Vehicles (UAVs). While traditional control methods struggle with the complexity and variability of the environment, reinforcement learning (RL) enables UAVs to learn adaptive behaviors through interaction with the environment. Existing research with RL prioritizes the mission success at the expense of mission time and safety of UAVs. This study integrates Potential Based Reward Shaping (PBRS) with Control Lyapunov Functions (CLF) and Control Barrier Functions (CBF) to simultaneously optimize mission time and ensure formal safety guarantees. An RL model is trained in a generalized simple environment, then used in complex scenarios incorporating a CLF-CBF-QP filter without further training. Experimental results in simulated environments demonstrate a significant reduction in mission time and outstanding performance in complex environment.
中文摘要 自主导航和障碍物规避仍是现代无人机（UAV）的核心挑战。传统控制方法难以应对环境的复杂性和变异性，强化学习（RL）使无人机能够通过与环境的交互学习适应性行为。现有的强化学习研究以牺牲任务时间和无人机安全为代价，优先考虑任务成功。本研究将基于势的奖励塑造（PBRS）与控制李雅普诺夫函数（CLF）和控制屏障函数（CBF）整合，以同时优化任务时间并确保正式的安全保障。强化学习模型在一个通用的简单环境中训练，然后在复杂场景中使用，包含 CLF-CBF-QP 滤波器，无需进一步训练。模拟环境中的实验结果显示任务时间显著缩短，在复杂环境中表现出色。

MAGIC: Multi-Step Advantage-Gated Causal Influence for Multi-agent Reinforcement Learning

MAGIC：多步优势门槛因果影响，用于多智能体强化学习

Authors: Haohan Yu, Jinmiao Cong, Shengzhi Wang, Lu Wang, Chanjuan Liu
Subjects: Subjects: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.01805
Pdf link: https://arxiv.org/pdf/2605.01805
Abstract A key challenge in multi-agent reinforcement learning (MARL) lies in designing learning signals that effectively promote coordination among agents. Designing such signals necessitates the ability to quantify the true, long-term causal influence between agents. To address this, we introduce Multi-step Advantage-Gated Interventional Causal MARL (MAGIC), a framework that extracts multi-step causal influences between agents and selectively converts them into intrinsic rewards. MAGIC uses causal intervention with conditional mutual information to quantify long-horizon agent influence, and introduces an advantage-based gating mechanism to ensure exploration is directed toward beneficial, goal-aligned behaviors. Experiments across multiple standard MARL benchmarks and task families, including MPE and SMAC/SMACv2, demonstrate that MAGIC outperforms state-of-the-art methods by a significant margin, achieving an improvement of at least 10.1% in the main evaluation metric.
中文摘要 多智能体强化学习（MARL）中的一个关键挑战在于设计能够有效促进智能体间协调的学习信号。设计此类信号需要能够量化代理间真实的长期因果影响。为此，我们引入了多步优势门控介入因果MARL（MAGIC）框架，该框架提取代理间多步因果影响，并选择性转化为内在奖励。MAGIC利用条件互信息的因果干预来量化长期代理的影响，并引入基于优势的门槛机制，确保探索指向有益且目标一致的行为。涵盖多个标准MARL基准和任务家族（包括MPE和SMAC/SMACv2）的实验表明，MAGIC在主要评估指标上显著优于最先进方法，提升至少10.1%。

Selector-Guided Autonomous Curriculum for One-Shot Reinforcement Learning from Verifiable Rewards

选择者引导的自主课程，实现一次性强化学习，基于可验证奖励

Authors: Rudray Dave, Vedang Dubey, Smit Deoghare, Sudhakar Mishra
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.01823
Pdf link: https://arxiv.org/pdf/2605.01823
Abstract Recently, Reinforcement Learning from Verifiable Rewards (RLVR) has been established as a highly effective technique for augmenting the math reasoning skills of Large Language Models (LLMs) based on a single instance. Current state-of-the-art 1-shot RLVR models adopt heuristics for selecting instances, mostly based on historical variance in rewards, which we find to be inherently misleading as a measure of transferability value. In this paper, we propose a Selector-Guided Autonomous Curriculum (SGAC) approach, which employs a learnable selector model on a multi-dimensional feature space consisting of success probability, reward variance, output disagreement (entropy), and semantic difficulty level, instead of the static reward variance heuristic. In our empirical evaluation on pools of candidate problems, we observed that output disagreement, rather than reward variance, is the strongest predictor of reasoning gains in subsequent iterations. Leveraging this finding, we develop an autonomous curriculum algorithm for dynamically siphoning candidate problems from a large pool, ranking them by the learned selector, and running micro-bursts of 1-shot GRPO. Our framework is evaluated using the Hendrycks MATH benchmark, with the Qwen2.5-Math-1.5B model serving as the baseline. Our framework obtains an accuracy of 68.0\% on the hold-out dataset, which is better than the accuracy obtained from the state-of-the-art model, 64.0\%, as well as the 1-shot RLVR checkpoint proposed by Wang et al., which achieved an accuracy of 66.0\%. The results confirm that entropy-based intelligent data curation leads to strict reasoning improvement over static training methods, particularly in severely limited data conditions.
中文摘要 最近，基于可验证奖励的强化学习（RLVR）已被确立为一种基于单一实例增强大型语言模型（LLM）数学推理能力的高效技术。当前最先进的单次RLVR模型采用启发式方法选择实例，主要基于奖励的历史方差，但我们认为这在衡量可转移价值方面具有误导性。本文提出一种选择者引导自主课程（SGAC）方法，采用可学习的选择者模型，基于多维特征空间，包括成功概率、奖励方差、输出不一致（熵）和语义难度，而非静态奖励方差启发式。在我们对候选问题池的实证评估中，我们观察到，输出分歧而非奖励方差，是后续迭代推理进步的最强预测因子。基于这一发现，我们开发了一种自主课程算法，用于动态从大池中抽取候选问题，按学习的选择器排序，并运行一次随机GRPO的微突发。我们的框架采用Hendrycks MATH基准进行评估，以Qwen2.5-Math-1.5B模型为基线。我们的框架在保留数据集上获得了68.0%的准确率，优于最先进模型的64.0%以及Wang等人提出的单次RLVR检查点的66.0%准确率。结果证实，基于熵的智能数据管理在严格推理上优于静态训练方法，尤其是在极度受限的数据条件下。

RMGAP: Benchmarking the Generalization of Reward Models across Diverse Preferences

RMGAP：针对不同偏好的奖励模型推广基准测试

Authors: Yangyang Zhou, Yi-Chen Li
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.01831
Pdf link: https://arxiv.org/pdf/2605.01831
Abstract Reinforcement Learning from Human Feedback has become the standard paradigm for language model alignment, where reward models directly determine alignment effectiveness. In this work, we focus on how to evaluate the generalizability of reward models. By "generalizability", we mean the ability of RMs to correctly rank responses to align with diverse user preferences. However, existing reward model benchmarks are typically designed around a universal preference, failing to assess this generalization. To address this critical gap, we introduce RMGAP, a benchmark comprising 1,097 instances across Chat, Writing, Reasoning, and Safety domains. Since different users exhibit diverse preferences for the same task, we first generate four distinct responses with different linguistic profiles for each collected prompt. However, the original prompt set lacks the specificity to convey different preferences. We therefore construct tailored prompts by contrasting these candidates and designing scenarios in which one response becomes the uniquely appropriate choice. Moreover, we observe that users often express the same preference using different phrasings, and thus extend each prompt with two paraphrased variants. Our evaluation of 24 state-of-the-art RMs reveals their substantial limitations: even the best RM achieves only 49.27% Best-of-N accuracy, highlighting considerable room for improvement in reward model generalization. Related data and code are available at this https URL.
中文摘要 从人类反馈中获得强化学习已成为语言模型对齐的标准范式，其中奖励模型直接决定对齐的有效性。本研究重点探讨如何评估奖励模型的推广性。所谓“可概括性”，我们指的是 RM 能够正确排序回答，以符合多样化用户偏好的能力。然而，现有的奖励模型基准通常围绕普遍偏好设计，未能评估这一泛化。为弥补这一关键缺口，我们推出了RMGAP基准，涵盖聊天、写作、推理和安全领域共计1097个实例。由于不同用户对同一任务有不同的偏好，我们首先为每个收集的提示生成四个具有不同语言特征的响应。然而，原始提示集缺乏具体性，难以传达不同的偏好。因此，我们通过对比这些候选题并设计情景，使某个回答成为唯一合适的选择，构建了定制化的提示。此外，我们观察到用户常用不同的表达方式表达相同的偏好，因此每个提示词都扩展了两个意译版本。我们对24个最先进的条件值的评估揭示了它们的显著局限性：即使是最好的条件值，也只能达到49.27%的N最佳准确率，凸显了奖励模型泛化还有很大改进空间。相关数据和代码可在此 https URL 获取。

Quality-Aware Exploration Budget Allocation for Cooperative Multi-Agent Reinforcement Learning

合作多智能体强化学习的质量感知探索预算分配

Authors: Dahyun Oh, Minhyuk Yoon, H.Jin Kim
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.01865
Pdf link: https://arxiv.org/pdf/2605.01865
Abstract Cooperative multi-agent reinforcement learning (MARL) requires agents to discover joint strategies in a combinatorially large state-action space, yet effective coordination configurations are exceedingly rare. Intrinsic motivation, which augments task rewards with novelty bonuses, is a popular approach for driving exploration, but its effectiveness hinges on the exploration intensity $\beta$, where too large a value overwhelms the task signal and causes coordination collapse, while too small a value prevents discovery of rare strategies. We address two complementary challenges: adapting $\beta$ globally over training, and allocating the exploration budget across agents whose intrinsic reward signals vary in reliability. Our framework combines a return-conditioned sigmoid schedule (RCB) for global intensity control with a per-agent Reward Signal Quality (RSQ) metric that concentrates the exploration budget on agents with reliable signals. The core insight is that agents receiving noisy intrinsic rewards should explore less aggressively, and this allocation can be determined automatically from signal-to-noise statistics. Successor Distance (SD), a quasimetric intrinsic reward, naturally produces distinguishable per-agent signal quality, completing the framework with convergence and ordering preservation guarantees. On seven cooperative benchmarks (MPE, SMAX, MABrax), our method achieves top-tier returns across all environments.
中文摘要 合作多智能体强化学习（MARL）要求智能体在组合规模庞大的状态-行动空间中发现联合策略，但有效的协调配置极为罕见。内在动机通过新奇奖励来增强任务奖励，是推动探索的流行方法，但其有效性依赖于探索强度的$\beta$，即过大值会压倒任务信号，导致协调崩溃，而值过小则阻碍发现稀有策略。我们解决了两个互补的挑战：在训练中全球调整$\beta$，以及在内在奖励信号可靠性不一的代理之间分配探索预算。我们的框架结合了返回条件Sigmoid计划（RCB）用于全局强度控制，与每个智能体的奖励信号质量（RSQ）指标，将探索预算集中在信号可靠代理上。核心见解是，接收噪声内在奖励的代理应减少激进探索，这种分配可通过信噪统计自动确定。后继距离（SD）是一种准指标的内在奖励，自然产生可区分的每个代理信号质量，并通过收敛和排序的保持保证完成了框架。在七个合作基准测试（MPE、SMAX、MABrax）上，我们的方法在所有环境中都实现了顶级回报。

Chart-FR1: Visual Focus-Driven Fine-Grained Reasoning on Dense Charts

图表-FR1：视觉聚焦驱动的细粒度推理，聚焦于密集图表

Authors: Hongkun Pan, Yuwei Wu, Wanyi Hong, Shenghui Hu, Qitong Yan, Yi Yang, Rufei Han, Changju Zhou, Minfeng Zhu, Dongming Han, Wei Chen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.01882
Pdf link: https://arxiv.org/pdf/2605.01882
Abstract Multimodal large language models (MLLMs) have shown considerable potential in chart understanding and reasoning tasks. However, they still struggle with high information density (HID) charts characterized by multiple subplots, legends, and dense annotations due to three major challenges: (1) limited fine-grained perception results in the omission of critical visual cues; (2) redundant or noisy visual information undermines the performance of multimodal reasoning; (3) lack of adaptive deep reasoning relative to the amount of visual information. To tackle these challenges, we present a novel focus-driven fine-grained chart reasoning model, Chart-FR1, to improve perception, focusing efficiency, and adaptive deep reasoning on HID charts. Specifically, we propose Focus-CoT, a visual focusing chain-of-thought that enhances fine-grained perception by explicitly linking reasoning steps to key visual cues, such as local image regions and OCR signals. Building on this, we introduce Focus-GRPO, a focus-driven reinforcement learning algorithm with an information-efficiency reward that compresses redundant visual information for efficient focusing, and an adaptive KL penalty mechanism that enables flexible control over reasoning depth as more visual cues are discovered. Furthermore, to fill the gap in benchmarks for HID charts, we build HID-Chart, a challenging benchmark with an information-density metric designed to evaluate fine-grained chart reasoning capabilities. Extensive experiments on multiple chart benchmarks demonstrate that Chart-FR1 outperforms state-of-the-art MLLMs in chart understanding and reasoning. Code is available at this https URL.
中文摘要 多模态大型语言模型（MLLMs）在图表理解和推理任务中展现出相当的潜力。然而，由于三大挑战，它们在处理由多子图、图例和密集注释组成的高信息密度（HID）图表中仍然存在困难：（1）细致感知有限导致关键视觉线索缺失;（2）冗余或噪声的视觉信息削弱了多模态推理的表现;（3）相对于视觉信息量缺乏适应性深度推理能力。为应对这些挑战，我们提出了一种新颖的聚焦驱动细粒度图表推理模型Chart-FR1，旨在提升HID图表上的感知、聚焦效率和自适应深度推理能力。具体来说，我们提出了Focus-CoT，一种视觉聚焦思维链，通过明确将推理步骤与关键视觉线索（如局部图像区域和OCR信号）联系起来，增强细粒度感知。在此基础上，我们介绍了Focus-GRPO，一种聚焦驱动的强化学习算法，具有信息效率奖励，压缩冗余视觉信息以实现高效聚焦，以及一种自适应的认知惩罚机制，随着发现更多视觉线索，实现对推理深度的灵活控制。此外，为了填补HID图表基准的空白，我们构建了HID-Chart，这是一个具有信息密度指标的挑战性基准，旨在评估细粒度的图表推理能力。多项图表基准的广泛实验表明，Chart-FR1在图表理解和推理方面优于最先进的MLLMs。代码可在此 https URL 访问。

Moira: Language-driven Hierarchical Reinforcement Learning for Pair Trading

Moira：语言驱动的层级强化学习用于配对交易

Authors: Polydoros Giannouris, Yuechen Jiang, Lingfei Qian, Yuyan Wang, Xueqing Peng, Jimin Huang, Guojun Xiong, Sophia Ananiadou
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2605.01954
Pdf link: https://arxiv.org/pdf/2605.01954
Abstract Many sequential decision-making problems exhibit hierarchical structure, where high-level semantic choices constrain downstream actions and feedback is delayed and ambiguous. Learning in such settings is challenging due to credit assignment: performance degradation may arise from flawed abstractions, suboptimal execution, or their interaction. We study this challenge through pair trading, a domain that naturally combines long-horizon semantic reasoning for asset pair selection with short-horizon execution under partial observability. We formulate pair trading as a hierarchical reinforcement learning problem and propose a language-driven optimization framework in which both high-level and low-level policies are parameterized by large language models (LLMs) and optimized exclusively through prompt updates. Our approach leverages pretrained LLMs as hierarchical policies and uses trajectory- and episode-level textual feedback to adapt abstractions and execution without gradient-based fine-tuning. By explicitly separating abstraction selection from execution, the framework reduces non-stationarity across hierarchical levels and enables targeted adaptation under delayed feedback. Experiments on real-world market data show consistent improvements over traditional and LLM-based baselines, demonstrating the effectiveness of language-driven hierarchical reinforcement learning.
中文摘要 许多顺序决策问题表现出层级结构，高层语义选择限制后续动作，反馈延迟且模糊。由于学分分配，学习具有挑战性：性能下降可能源于抽象缺陷、执行不优或它们之间的相互作用。我们通过配对交易来研究这一挑战，该领域自然结合了资产配对选择的长视野语义推理与部分可观测性的短期执行。我们将配对交易提出为一种分层强化学习问题，并提出了一个语言驱动的优化框架，其中高层和低级策略均由大型语言模型（LLM）参数化，并仅通过提示更新进行优化。我们的方法利用预训练的大型语言模型作为分层策略，利用轨迹级和事件级文本反馈来调整抽象和执行，无需基于梯度的微调。通过明确将抽象选择与执行分离，该框架减少了层级间的非平稳性，并支持在延迟反馈下的有针对性适应。基于真实市场数据的实验显示，相较于传统和基于大型语言模型的基线，持续有改善，证明了语言驱动的层级强化学习的有效性。

Multi-User Dueling Bandits: A Fair Approach using Nash Social Welfare

多用户决斗强盗：利用纳什社会福利的公平方法

Authors: Maheed H. Ahmed, Mahsa Ghasemi
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.01961
Pdf link: https://arxiv.org/pdf/2605.01961
Abstract Learning from human preference data is becoming a useful tool, from fine-tuning large language models to training reinforcement learning agents. However, in most scenarios, the model is trained on the average preference of all human evaluators, which, under large variations of preferences, can be unfair to minority groups. In this work, we consider fairness in dueling bandits, a standard framework for online learning from preference data. We assume that each user has a (potentially distinct) Condorcet winner, which is an arm preferred to every other arm. Using these user-specific Condorcet winners as reference points, we evaluate and score arms according to their performance relative to the corresponding winner. To promote fairness across heterogeneous users, we adopt the well-established Nash Social Welfare objective, which maximizes the product of user utilities, thereby inherently penalizing inequality and preventing the marginalization of any single user. Within this framework, we construct a hard instance to establish a regret lower bound of $\Omega(T^{2/3}\min(K,D)^\frac{1}{3})$ for a time horizon $T$, $K$ arms, and $D$ users, which, to the best of our knowledge, is the first result quantifying the cost of fairness in dueling bandits with heterogeneous preferences. We then present the Fair-Explore-Then-Commit and Fair-$\epsilon$-Greedy algorithms with a Condorcet winner identification phase. We further derive their regret upper bounds that match the lower-bound dependence on $T$ up to logarithmic factors.
中文摘要 从人类偏好数据中学习正成为一种有用的工具，从微调大型语言模型到训练强化学习代理。然而，在大多数情况下，模型是基于所有人类评估者的平均偏好进行训练的，在偏好差异较大的情况下，这对少数群体可能不公平。在本研究中，我们探讨了决斗强盗中的公平性，这是基于偏好数据进行在线学习的标准框架。我们假设每个用户都有一个（可能不同的）孔多塞赢家，即一个臂优先于其他所有手。以这些用户专属的Condorcet获奖者为参考点，我们根据其相对于对应获奖者的表现来评估和评分。为了促进异质用户之间的公平，我们采用了广为人知的纳什社会福利目标，最大化用户效用的产积，从而本质上惩罚不平等，防止任何单一用户被边缘化。在此框架下，我们构建了一个硬性实例，以建立一个时间范围$T$、$K$臂和$D$用户的遗憾下界$\Omega（T^{2/3}\min（K，D）^\frac{1}{3}）$，据我们所知，这是首个量化对抗偏好异质强盗时公平成本的结论。随后，我们介绍了公平探索然后承诺和公平\ε\ε$-贪婪算法，并采用孔多塞赢家识别阶段。我们进一步推导出它们的遗憾上界，这些上界与对$T$的下界依赖关系相匹配，且为对数因子。

AdamO: A Collapse-Suppressed Optimizer for Offline RL

AdamO：离线强化学习的抑制崩溃优化器

Authors: Nan Qiao, Sheng Yue, Shuning Wang, Ju Ren
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.01968
Pdf link: https://arxiv.org/pdf/2605.01968
Abstract Offline reinforcement learning (RL) can fail spectacularly when bootstrapped temporal-difference (TD) updates amplify their own errors, driving the critic toward extreme and unusable Q-values. A key counterintuitive insight of this work is that collapse is not only a property of the backup rule or network architecture: optimizer dynamics themselves can directly trigger or suppress instability. From a control-theoretic viewpoint, we model offline TD learning as a feedback system and analyze Adam-based critic updates. This yields a necessary and sufficient condition for stability of the induced local update dynamics: within the regime we analyze, these dynamics are stable if and only if the spectral radius of the corresponding update operator is strictly below one. Further analysis suggests that standard Adam updates can inadvertently distort the parameter geometry, motivating explicit orthogonality constraints to prevent TD error amplification. To this end, we propose AdamO, an Adam-based optimizer with a decoupled orthogonality correction regulated by a strict task-alignment budget. We prove that this design theoretically guarantees worst-case task safety and preserves Adam's continuous-time dissipative dynamics. Empirically, AdamO is broadly compatible with diverse offline RL baselines, improving stability and returns across a broad suite of benchmarks.
中文摘要 当自助式时间差（TD）更新放大自身错误时，离线强化学习（RL）可能会严重失败，导致批评者走向极端且无法使用的Q值。这项工作的一个关键反直觉见解是，崩溃不仅仅是备份规则或网络架构的属性：优化器动态本身可以直接触发或抑制不稳定。从控制理论的角度，我们将离线TD学习建模为反馈系统，并分析基于亚当的批评更新。这给出了诱导局部更新动态稳定性的必要且充分条件：在我们分析的区域内，这些动态稳定当且仅当相应更新算符的谱半径严格小于1时。进一步分析表明，标准的亚当更新可能无意中扭曲参数几何，促使显式正交约束以防止TD误差放大。为此，我们提出了AdamO，一个基于Adam的优化器，具有解耦的正交修正，并由严格的任务对齐预算调节。我们证明该设计理论上保证了最坏情况下任务的安全，并保持了亚当连续时间耗散动力学。从经验角度看，AdamO广泛兼容多种离线强化学习基线，提升了广泛基准测试的稳定性和回报。

Stability of Control Lyapunov Function Guided Reinforcement Learning

控制稳定性李雅普诺夫函数引导强化学习

Authors: Zachary Olkin, William D. Compton, Aaron D. Ames
Subjects: Subjects: Systems and Control (eess.SY); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.01978
Pdf link: https://arxiv.org/pdf/2605.01978
Abstract Reinforcement learning (RL) has become the de facto method for achieving locomotion on humanoid robots in practice, yet stability analysis of the corresponding control policies is lacking. Recent work has attempted to merge control theoretic ideas with reinforcement learning through control guided learning. A notable example of this is the use of a control Lyapunov function (CLF) to synthesize the reinforcement learning rewards, a technique known as CLF-RL, which has shown practical success. This paper investigates the stability properties of optimal controllers using CLF-RL with the goal of bridging experimentally observed stability with theoretical guarantees. The RL problem is viewed as an optimal control problem and exponential stability is proven in both continuous and discrete time using both core CLF reward terms and the additional terms used in practice. The theoretical bounds are numerically verified on systems such as the double integrator and cart-pole. Finally, the CLF guided rewards are implemented for a walking humanoid robot to generate stable periodic orbits.
中文摘要 强化学习（RL）已成为实际实现人形机器人运动的事实方法，但对相应控制策略的稳定性分析仍然不足。近期研究尝试通过控制引导学习将控制理论思想与强化学习融合。一个显著的例子是使用控制李雅普诺夫函数（CLF）合成强化学习奖励，这种技术称为CLF-RL，已证明在实际中取得了成功。本文研究使用CLF-RL最优控制器的稳定性特性，旨在将实验观测的稳定性与理论保证相结合。强化学习问题被视为最优控制问题，指数稳定性在连续和离散时间内均可通过核心CLF奖励项及实际使用的附加项证明。理论界限在双积分器和车极等系统上进行了数值验证。最后，CLF引导奖励实现了行走类人机器人以产生稳定的周期轨道。

Enhancing Judgment Document Generation via Agentic Legal Information Collection and Rubric-Guided Optimization

通过代理法律信息收集和评分标准引导优化，增强判决文件生成

Authors: Weihang Su, Xuanyi Chen, Yueyue Wu, Qingyao Ai, Yiqun Liu
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2605.02011
Pdf link: https://arxiv.org/pdf/2605.02011
Abstract Automating the drafting of judgment documents is pivotal to judicial efficiency, yet it remains challenging due to the dual requirements of comprehensive retrieval of legal information and rigorous logical reasoning. Existing approaches, typically relying on standard Retrieval-Augmented Generation and Supervised Fine-Tuning, often suffer from insufficient evidence recall, hallucinated statutory references, and logically flawed legal reasoning. To bridge this gap, we propose Judge-R1, a unified framework designed to enhance LLM-based judgment document generation by jointly improving legal information collection and judgment document generation. First, we introduce Agentic Legal Information Collection, which employs a dynamic planning agent to retrieve precise statutes and precedents from multiple sources. Second, we implement Rubric-Guided Optimization, a reinforcement learning phase utilizing Group Relative Policy Optimization (GRPO) with a comprehensive legal reward function to enforce adherence to judicial standards and reasoning logic. Extensive experiments on the JuDGE benchmark demonstrate that Judge-R1 significantly outperforms state-of-the-art baselines in both legal accuracy and generation quality.
中文摘要 自动化起草判决文件对于司法效率至关重要，但由于全面检索法律信息和严谨逻辑推理的双重要求，这一过程依然具有挑战性。现有方法通常依赖标准的检索增强生成和监督微调，常常存在证据回忆不足、法定引用的幻觉以及逻辑上存在缺陷的法律推理。为弥合这一差距，我们提出了Judge-R1，这是一个统一框架，旨在通过共同提升法律信息收集和判决文件生成，提升基于LLM的判决文件生成。首先，我们介绍代理性法律信息收集，它利用动态规划代理从多个来源检索精确的法规和判例。其次，我们实施评分标准引导优化，这是一个利用群体相对政策优化（GRPO）的强化学习阶段，并配备全面的法律奖励函数，以强制执行司法标准和推理逻辑。在JuDGE基准测试上的大量实验表明，Judge-R1在法律准确性和生成质量方面均远超最先进基线。

Optimization of CV-QKD Under Practical Constraints

在实际约束下的CV-QKD优化

Authors: Svitlana Matsenko, Amirhossein Ghazisaeidi, Marcin Jarzyna, Konrad Banaszek, Darko Zibar
Subjects: Subjects: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
Arxiv link: https://arxiv.org/abs/2605.02045
Pdf link: https://arxiv.org/pdf/2605.02045
Abstract Using reinforcement learning, we optimize for practical hardware constraints, including limited FIR filter taps at the transmitter and receiver, mean photon number and finite DAC/ADC resolution. Under these realistic conditions, the proposed approach achieves significant performance improvements.
中文摘要 通过强化学习，我们优化了实际硬件约束，包括发射端和接收端有限的FIR滤波器抽头、平均光子数以及有限的DAC/ADC分辨率。在这些现实条件下，该方法实现了显著的性能提升。

Coopetition-Gym v1: A Formally Grounded Platform for Mixed-Motive Multi-Agent Reinforcement Learning under Strategic Coopetition

Coopetition-Gym v1：一个在战略合作下实现混合动机多智能体强化学习的正式基础平台

Authors: Vik Pant, Eric Yu
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.02063
Pdf link: https://arxiv.org/pdf/2605.02063
Abstract We present Coopetition-Gym v1, a benchmark platform for mixed-motive multi-agent reinforcement learning under strategic coopetition. The platform comprises twenty environments organized into four mechanism classes that correspond to four foundational technical reports: interdependence and complementarity (arXiv:2510.18802), trust and reputation dynamics (arXiv:2510.24909), collective action and loyalty (arXiv:2601.16237), and sequential interaction and reciprocity (arXiv:2604.01240). Each environment carries a closed-form payoff structure and a calibrated interdependence matrix derived from the corresponding report. Every environment exposes a parameterized reward layer configurable across three structurally distinct modes (private, integrated, cooperative). This separation of payoff from reward enables reward-type ablation, the platform's principal methodological apparatus. Four of the twenty environments are calibrated against historically documented coopetitive relationships and reproduce their outcomes at 98.3, 81.7, 86.7, and 87.3 percent on the validation rubric (Samsung-Sony LCD, Renault-Nissan Alliance, Apache HTTP Server, Apple iOS App Store). The platform exposes Gymnasium, PettingZoo Parallel, and PettingZoo AEC interfaces and ships 126 reference algorithms: 16 learning algorithms, 7 game-theoretic oracles, 2 heuristic baselines, and 101 constant-action policies. A reference experimental study trained the 16 learning algorithms on every environment under every reward configuration with seven random seeds, producing a 25,708-run training corpus and a 1,116-run behavioral audit corpus, both released under CC-BY-4.0 with Croissant 1.0 metadata. Coopetition-Gym v1 is the first platform to combine continuous-action mixed-motive environments, parameterized reward mutuality, calibrated interdependence coefficients, game-theoretic oracle baselines, and validated case studies.
中文摘要 我们介绍Coopetition-Gym v1，这是一个基于战略合作的混合动机多智能体强化学习的基准平台。该平台包含二十个环境，分为四个机制类别，对应四项基础性技术报告：相互依存与互补性（arXiv：2510.18802）、信任与声誉动态（arXiv：2510.24909）、集体行动与忠诚（arXiv：2601.16237）以及顺序交互与互惠（arXiv：2604.01240）。每个环境都包含一个封闭形式的收益结构和由对应报告导出的校准相互依赖矩阵。每个环境都暴露出一个参数化的奖励层，可配置于三种结构上不同的模式（私密、集成、合作模式）。这种收益与奖励的分离使得奖励型消融成为可能，这是平台的主要方法论工具。20个环境中有4个根据历史记录的合作关系校准，并在验证评分标准上分别重现了98.3%、81.7%、86.7和87.3%的结果（三星-索尼LCD、雷诺-日产联盟、Apache HTTP Server、Apple iOS App Store）。该平台展示了Gymnasium、PettingZoo Parallel和PettingZoo AEC接口，并发布126个参考算法：16个学习算法、7个博弈论预言机、2个启发式基线和101个恒定动作策略。一项参考实验研究在每个奖励配置下的每个环境下用七个随机种子训练了16个学习算法，产生了25,708次运行的训练语料库和1,116次运行的行为审计语料库，均以CC-BY-4.0及Croissant 1.0元数据发布。Coopetition-Gym v1 是首个结合连续动作混合动机环境、参数化奖励互惠、校准相互依赖系数、博弈论预言基线和验证案例研究的平台。

Enhanced LLM Reasoning by Optimizing Reward Functions with Search-Driven Reinforcement Learning

通过优化奖励函数与搜索驱动强化学习，增强LLM推理能力

Authors: Arash Ahmadi, Sarah Sharif, Yaser (Mike)Banad
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.02073
Pdf link: https://arxiv.org/pdf/2605.02073
Abstract Mathematical reasoning is a key benchmark for large language models. Reinforcement learning is a standard post-training mechanism for improving the reasoning capabilities of large language models, yet performance remains sensitive to the design of the reward function that drives policy optimization. This paper introduces a search-driven framework that treats the reward specification itself as an object of optimization. The setting of interest is one in which the base model is held fixed and the reward specification is the primary remaining design lever. Candidate reward functions are generated by a frontier language model, validated automatically, screened through 500-step Group Relative Policy Optimization (GRPO) training runs on a Llama-3.2-3B-Instruct base model with Low-Rank Adaptation (LoRA), and ranked by F1 on the GSM8K test set. Ranked summaries from prior rounds are then fed back into the next round of generation. Over five rounds, the search produces 50 candidate rewards. The mean F1 rises from 0.596 in Round 1 to 0.632 in Round 5, and the top individual reward reaches F1 = 0.787. Seven ensemble configurations of top-ranked rewards are evaluated. The best ensemble achieves F1 = 0.795 (95% bootstrap CI [0.756, 0.832]) and accuracy 0.660 [0.635, 0.686], a 0.19 absolute F1 gain over a base-rewards-only GRPO baseline (F1 = 0.609). Pairwise McNemar tests with Bonferroni correction show all five-or-more-reward configurations are statistically indistinguishable at {\alpha} = 0.05/21. A three-seed re-training of the best ensemble yields F1 of 0.785. A randomly drawn 5-reward control collapses to F1 = 0.047, which shows that the ranked-feedback loop, not the additive signal of having more rewards, drives the gain.
中文摘要 数学推理是大型语言模型的关键基准。强化学习是提升大型语言模型推理能力的标准训练后机制，但性能仍对驱动策略优化的奖励函数设计保持敏感。本文引入了一个以搜索为驱动的框架，将奖励规范本身视为优化对象。关注的设定是基础模型固定，奖励规格是主要的设计杠杆。候选奖励函数由前沿语言模型生成，自动验证，并通过带有低秩适应（LoRA）的Llama-3.2-3B-Instruct基础模型上的500步组相对策略优化（GRPO）训练进行筛选，并在GSM8K测试集中通过F1排序。前几轮的排名总结会反馈到下一轮生成中。在五轮中，搜索会产生50个候选奖励。平均F1分数从第一轮的0.596上升到第五轮的0.632，个人最高奖金达到F1 = 0.787。评估了七种顶级奖励的集合配置。最佳集合可实现F1 = 0.795（95% 自助置信区间[0.756， 0.832]）和准确率0.660[0.635， 0.686]，相较于仅基础奖励的GRPO基线（F1 = 0.609）绝对F1提升0.19。结合Bonferroni修正的两两McNemar检验显示，所有五个或以上的奖励配置在{\alpha} = 0.05/21时在统计上无法区分。对最佳集合进行三种子再训练，得出F1为0.785。随机抽取的5奖励控制会崩解为F1 = 0.047，这表明排名反馈循环而非奖励增加的加法信号驱动了收益。

Graph Transformers and Stabilized Reinforcement Learning for Large-Scale Dynamic Routing Modulation and Spectrum Allocation in Elastic Optical Networks

图变换器和稳定强化学习用于弹性光学网络中大规模动态路由调制和频谱分配

Authors: Michael Doherty, Alejandra Beghelli, Laura Toni
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2605.02075
Pdf link: https://arxiv.org/pdf/2605.02075
Abstract Reinforcement learning (RL) has been widely applied to dynamic routing, modulation and spectrum assignment (RMSA) in optical networks, yet no prior work has trained a transformer model for this task. We attribute this to the high data and compute requirements of transformers and potential training instabilities with RL. We address this gap by combining recent advances from the machine learning literature (rotary positional encodings for graph-structured data, off-policy invalid action masking, and valid mass regularization) with GPU-accelerated simulation to achieve, for the first time, stable RL training of a transformer for dynamic RMSA. We demonstrate, through systematic benchmarking against previous RL methods and heuristic algorithms, that ours is the first RL method to exceed all benchmarks, increasing the supportable traffic load by up to 13\%. To demonstrate the scalability of our approach, we train on real network topologies from the TopologyBench database up to 143 nodes and 362 links, with 320 x 12.5\,GHz frequency slot units per link, and 100\,Gbps traffic requests. To our knowledge, these are the largest dynamic RMSA problems to which RL has been applied. We find up to 4\% increased traffic load can be supported at low blocking probability (<0.1\%) with our method compared to the best available benchmark algorithm. We present an ablation study of the components of our training algorithm, the dynamics of the loss function during training, and analyze the allocation decisions of the trained models. We make all code used to produce this paper openly available for reproduction and future benchmarking: this https URL.
中文摘要 强化学习（RL）已被广泛应用于光网络中的动态路由、调制和频谱分配（RMSA），但此前尚无研究为该任务训练过变压器模型。我们将此归因于变换器对数据和计算的需求较高，以及强化学习可能存在的训练不稳定性。我们通过结合机器学习文献的最新进展（如图结构化数据的旋转位置编码、非策略无效动作掩蔽和有效的质量正则化）与GPU加速仿真，首次实现动态RMSA变换器的稳定强化学习。通过系统地对以往强化学习方法和启发式算法进行基准对比，我们证明了我们的方法是首个超越所有基准测试的强化学习方法，将可支持的流量负载提高了多达13%。为了展示我们方法的可扩展性，我们在TopologyBench数据库中的真实网络拓扑上训练，最多可达143个节点和362条链路，每个链路有320 x 12.5GHz的频槽单元，以及100Gbps的流量请求。据我们所知，这些是强化学习应用中最大的动态RMSA问题。我们发现，与最佳基准算法相比，我们的方法在低阻断概率（<0.1\%）下，最多可支持4%的流量负载增加。我们对训练算法的组成部分进行了消融研究，分析了训练过程中损失函数的动态，并分析了训练模型的分配决策。我们公开发布了所有用于完成本文的代码，供复制和未来基准测试使用：这个 https URL。

Reinforcement Learning Trained Observer Control for Bearings-Only Tracking

强化学习训练观察者控制，仅用方位跟踪

Authors: Branko Ristic, Sanjeev Arulampalam
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.02120
Pdf link: https://arxiv.org/pdf/2605.02120
Abstract This paper develops a deep reinforcement learning based observer control policy for autonomous bearings-only tracking of a moving target. The observer manoeuvre problem is formulated as a belief Markov decision process, where the belief state is represented by the posterior of a cubature Kalman filter (CKF). The reward function is designed to address two conflicting objectives: minimising the absolute target position estimation error (Euclidean distance) and maintaining CKF estimation consistency (Mahalanobis distance). The reward is formulated as a geometric interpolation between the two objectives on the Pareto front, parametrised by a weighting factor $\beta \in [0,1]$. The policy is implemented as a deep Q-network (DQN) trained over 50,000 episodes. Performance is evaluated over 5,000 Monte Carlo episodes and compared against two baselines: the perpendicular-to-bearing heuristic and the D-optimal Fisher information maximisation criterion. The results show that the DQN policy at $\beta = 0.7$ achieves the best trade-off between accuracy and robustness: it matches the information-theoretic baseline on mean tracking accuracy while reducing the worst-case error by nearly a factor of ten, owing to the implicit filter-consistency regularisation provided by the Mahalanobis term in the reward.
中文摘要 本文开发了基于深度强化学习的观察者控制策略，用于仅自主跟踪移动目标。观察者操作问题被表述为信念马尔可夫决策过程，其中信念态由立方卡尔曼滤波器（CKF）的后验表示。奖励函数旨在解决两个相互冲突的目标：最小化绝对目标位置估计误差（欧几里得距离）和保持CKF估计一致性（马哈拉诺比斯距离）。奖励被表述为帕累托前沿两个目标之间的几何插值，参数化为加权因子 $\beta \in [0,1]$。该政策以深度Q网络（DQN）形式实施，训练超过5万集。在5000个蒙特卡洛事件中评估性能，并与两个基线比较：垂直方位启发式和D最优费舍尔信息最大化准则。结果显示，在$\beta = 0.7$时，DQN策略在准确性和鲁棒性之间实现了最佳权衡：它与信息理论的平均跟踪基准相匹配，同时将最坏情况误差降低近十倍，这得益于奖励中Mahalanobis项所提供的隐含滤波器一致性正则化。

Hierarchical Cooperative MARL for Joint Downlink PRB and Power Allocation in a 5G System

5G系统中用于联合下行PRB和功率分配的分层合作MARL

Authors: Alireza Ebrahimi Dorcheh, Tolunay Seyfi, Ryan Barker, Fatemeh Afghah
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2605.02149
Pdf link: https://arxiv.org/pdf/2605.02149
Abstract Efficient downlink radio resource management in 5G requires jointly optimizing user scheduling and transmit-power allocation under time-varying wireless conditions. This is challenging in OFDMA systems because PRB assignment is combinatorial, power allocation is continuous, and performance depends on channel evolution, link adaptation, and long-term fairness. We propose a hierarchical cooperative multi-agent reinforcement learning framework with staged curriculum training for joint downlink PRB and power allocation in a physically grounded 5G environment. System-level simulation is implemented in Sionna, while Sionna RT supports wireless scene construction and mobility-aware ray-traced channel generation. The control task is decomposed into two sequential stages: a PRB agent learns user-level resource shares, which are converted to exact PRB assignments by a deterministic channel-aware quota resolver, and a power agent distributes the base-station power budget across users and their assigned PRB-symbol resources. The framework operates in a cross-layer loop with adaptive modulation and coding, HARQ feedback, outer-loop link adaptation, and a fairness-aware reward based on smoothed throughput and Jain's fairness index. Training stability is improved through a three-phase curriculum for PRB allocation, power control, and joint fine-tuning. Under matched channel realizations, we compare against a PF scheduler with equal-power transmission and two ablations isolating the learned PRB and power-control components. Results show that both learned components improve throughput distribution relative to PF, while the full PRB and power controller achieves the largest cell-throughput gain with only a modest reduction in Jain's fairness index.
中文摘要 在5G中高效的下行无线资源管理需要在时变无线条件下共同优化用户调度和发射功率分配。这在OFDMA系统中具有挑战性，因为PRB分配是组合性的，功率分配是连续的，性能依赖于信道演进、链路适配和长期公平性。我们提出了一个分层协作多智能体强化学习框架，采用分阶段课程培训，用于物理接地的5G环境中联合下行PRB和功率分配。系统级仿真在 Sionna 实现，而 Sionna RT 支持无线场景构建和具备移动感知的光线追踪通道生成。控制任务被分解为两个顺序阶段：PRB代理学习用户级资源份额，这些份额由确定性通道感知配额解析器转换为精确的PRB分配;电力代理则将基站的电力预算分配给用户及其分配的PRB符号资源。该框架运行于跨层循环中，具备自适应调制与编码、HARQ反馈、外环链路适配，以及基于平滑吞吐量和Jain公平指数的公平感奖励。训练稳定性通过三阶段课程提升，包括PRB分配、功率控制和关节微调。在匹配信道实现下，我们与具有等功率传输和两个消融隔离所学PRB和功率控制组件的PF调度器进行比较。结果显示，这两个学到的组件相对于PF都能改善吞吐量分布，而完整的PRB和功率控制器在仅略微降低Jain公平指数的情况下，实现了最大的单元吞吐量提升。

Combining Trained Models in Reinforcement Learning

强化学习中的训练模型结合

Authors: Ujjwal Patil, Javad Ghofrani
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2605.02159
Pdf link: https://arxiv.org/pdf/2605.02159
Abstract Deep reinforcement learning (DRL) has delivered strong results in domains such as Atari and Go, but it still suffers from high sample cost and weak transfer beyond the training setting. A common response is to reuse information from previously trained models through transfer, distillation, ensemble methods, or federated training instead of learning each target task from random initialization. The literature on these mechanisms is fragmented, and published comparisons are hard to interpret because tasks, baselines, and compute budgets differ. This paper presents a PRISMA-guided systematic review of empirical studies on pretrained knowledge reuse in DRL. Starting from 589 records retrieved from IEEE Xplore, the ACM Digital Library, and citation tracing, we screened 570 unique records and assessed 89 full texts. After applying the final eligibility criteria, 15 empirical studies remained in the main synthesis. We analyzed them qualitatively across three factors: source-target similarity, diversity among reused models, and the fairness of comparisons against from-scratch baselines. Three patterns recur across the surviving corpus. First, positive results are concentrated in settings where source and target tasks share substantial structure or where the method includes an explicit gating or alignment mechanism. Second, evidence for ensembles and federated aggregation is promising but sparse and mostly limited to narrow settings. Third, compute-matched comparisons are rare, which weakens claims about efficiency gains over stronger single-agent baselines. The paper contributes a narrower and internally consistent review scope, a study-level synthesis of empirical evidence, and a provisional independence spectrum that should be treated as a hypothesis for future benchmarking rather than a validated metric.
中文摘要 深度强化学习（DRL）在雅达利和围棋等领域取得了显著成果，但仍面临高样本成本和训练环境外迁移较弱的问题。一种常见的做法是通过转移、提纯、集合方法或联邦训练，重用先前训练过的模型中的信息，而不是通过随机初始化学习每个目标任务。关于这些机制的文献零散，已发表的比较难以解释，因为任务、基线和计算预算各不相同。本文以PRISMA为指导，系统综述了关于DRL中预训练知识再利用的实证研究。我们从IEEE Xplore、ACM数字图书馆检索的589条记录和引用追踪中筛选了570条独立记录，评估了89条全文。在应用最终资格标准后，主综合中还剩下15项实证研究。我们对它们进行了定性分析，涵盖三个因素：源-目标相似性、重复使用模型之间的多样性，以及与从零基础基线进行比较的公平性。三种模式在现存的语料库中反复出现。首先，积极结果集中在源任务和目标任务结构较为一致或方法包含明确门槛或对齐机制的环境中。其次，关于集合和联邦聚合的证据有前景，但稀少且主要局限于狭窄的环境。第三，计算匹配比较很少见，这削弱了关于效率提升的说法，而相比更强的单一智能体基线。本文提供了更狭窄且内部一致的综述范围、研究层面的实证证据综合，以及一个应作为未来基准测试的假设而非验证指标的临时独立性光谱。

Experience Constrained Hierarchical Federated Reinforcement Learning for Large-scale UAV Teams in Hazardous Environments

在危险环境中，体验受限的层级联合强化学习，适用于大型无人机团队

Authors: Qinwei Huang, Rui Zuo, Simon Khan, Qinru Qiu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.02165
Pdf link: https://arxiv.org/pdf/2605.02165
Abstract Conventional federated learning assumes that greater learner participation improves training performance, by leveraging abundant, independently generated local data. However, in federated reinforcement learning (FRL) for unmanned aerial vehicle (UAV) teams in hazardous environments where experience generation is severely constrained by safety considerations, energy limitations, and mission duration, this assumption may break. This work introduces Experience-Constrained Hierarchical Federated Reinforcement Learning (EC-HFRL), a framework in which clusters act as federated learning agents, while multiple intra-cluster learners represent parallel learning resources that reuse a shared experience pool. We show that increasing participation does not necessarily improve learning performance. Instead, learning performance is strongly associated with experience reuse strategy and the dominance of key analytically identified gradient transition experiences within a cluster. In particular, minibatch size primarily determines effective replay exposure, while higher intra-cluster participation increases reuse level. Empirical results demonstrate that the performance regimes are strongly associated with the structure of the learning signal, rather than federated aggregation effects, clarifying the limited and secondary role of learner participation in experience-constrained FRL.
中文摘要 传统的联邦学习假设更高的学习者参与度通过利用丰富且独立生成的本地数据来提升训练表现。然而，在无人机（UAV）团队在危险环境中进行联邦强化学习（FRL）时，经验生成受到安全考量、能源限制和任务持续时间的严重限制，这一假设可能会被打破。本研究引入了经验约束层级联合强化学习（EC-HFRL），该框架中簇作为联邦学习代理，多个簇内学习者代表并行学习资源，重复利用共享经验池。我们证明，提高参与度并不一定能改善学习表现。相反，学习表现与经验再利用策略以及在集群中关键分析识别的梯度过渡体验的主导地位密切相关。特别是，迷你批次大小主要决定有效的重放暴露，而更高的团内参与度则提高重复使用率。实证结果表明，表现模式与学习信号的结构密切相关，而非联邦聚合效应，这进一步阐明了学习者参与在经验受限FRL中有限且次要的作用。

Planner Matters! An Efficient and Unbalanced Multi-agent Collaboration Framework for Long-horizon Planning

规划者很重要！一个高效且不平衡的多代理协作框架，用于长期规划

Authors: Wenyi Wu, Sibo Zhu, Kun Zhou, Biwei Huang
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2605.02168
Pdf link: https://arxiv.org/pdf/2605.02168
Abstract Language model (LM)-based agents have demonstrated promising capabilities in automating complex tasks from natural language instructions, yet they continue to struggle with long-horizon planning and reasoning. To address this, we propose an enhanced multi-agent framework that decomposes automation into three roles: a planner for high-level decision-making, an actor for task execution, and a memory manager for contextual reasoning. While this modular decomposition aligns with established design patterns, our core contribution lies in a systematic compute-allocation analysis, revealing that planning is the dominant factor influencing task performance. Execution and memory management require significantly less compute and model capacity to achieve competitive results. Building on these insights, we introduce a planner-centric reinforcement learning approach, which exclusively optimizes the planner using trajectory-level rewards from a VLM-as-judge, while freezing the other components. Extensive experiments on benchmarks spanning web navigation, OS control, and tool use demonstrate that concentrating model capacity and learning on high-level planning yields robust and compute-efficient improvements in long-horizon agent automation. Our code is publicly released.
中文摘要 基于语言模型（LM）的智能体已展现出自动化复杂任务的有前景的能力，但它们仍在长期规划和推理方面遇到困难。为此，我们提出了一个增强的多智能体框架，将自动化分解为三个角色：用于高级决策的规划器、执行任务的演员，以及用于上下文推理的内存管理器。虽然这种模块化分解符合既定设计模式，但我们的核心贡献在于系统的计算分配分析，揭示了规划是影响任务表现的主导因素。执行和内存管理所需的计算和模型容量显著减少，以实现竞争效果。基于这些见解，我们引入了以规划者为中心的强化学习方法，专门利用VLM作为评判的轨迹级奖励来优化规划者，同时冻结其他组件。在网页导航、操作系统控制和工具使用的基准测试中进行的大量实验表明，将模型容量和学习集中在高层规划上，能够带来稳健且计算高效的长期代理自动化改进。我们的代码已公开发布。

T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

T$^2$PO：稳定多回合代理强化学习的不确定性引导探索控制

Authors: Haixin Wang, Hejie Cui, Chenwei Zhang, Xin Liu, Shuowei Jin, Shijie Geng, Xinyang Zhang, Nasser Zalmout, Zhenyu Shi, Yizhou Sun
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.02178
Pdf link: https://arxiv.org/pdf/2605.02178
Abstract Recent progress in multi-turn reinforcement learning (RL) has significantly improved reasoning LLMs' performances on complex interactive tasks. Despite advances in stabilization techniques such as fine-grained credit assignment and trajectory filtering, instability remains pervasive and often leads to training collapse. We argue that this instability stems from inefficient exploration in multi-turn settings, where policies continue to generate low-information actions that neither reduce uncertainty nor advance task progress. To address this issue, we propose Token- and Turn-level Policy Optimization (T$^2$PO), an uncertainty-aware framework that explicitly controls exploration at fine-grained levels. At the token level, T$^2$PO monitors uncertainty dynamics and triggers a thinking intervention once the marginal uncertainty change falls below a threshold. At the turn level, T$^2$PO identifies interactions with negligible exploration progress and dynamically resamples such turns to avoid wasted rollouts. We evaluate T$^2$PO in diverse environments, including WebShop, ALFWorld, and Search QA, demonstrating substantial gains in training stability and performance improvements with better exploration efficiency. Code is available at: this https URL.
中文摘要 多回合强化学习（RL）的最新进展显著提升了大型语言模型在复杂交互任务中的推理性能。尽管稳定技术如细粒度信用分配和轨迹过滤取得了进步，不稳定性依然普遍存在，常常导致训练崩溃。我们认为，这种不稳定性源于多回合环境中的低效探索，政策持续产生低信息操作，既不减少不确定性，也不推动任务进展。为解决这个问题，我们提出了代币级和回合级策略优化（T$^2$PO），这是一个不确定性感知框架，明确控制细粒度层面的探索。在代币层面，T$^2$PO监测不确定性动态，并在边际不确定性变化低于阈值时触发思考干预。在回合层面，T$^2$PO识别探索进展微乎其微的互动，并动态重新采样此类回合以避免浪费的推出。我们在包括WebShop、ALFWorld和搜索质量保证在内的多种环境中评估了T$^2$PO，显示出训练稳定性和性能提升显著提升，探索效率也更高。代码可在以下 https URL 获取。

Do We Really Need Immediate Resets? Rethinking Collision Handling for Efficient Robot Navigation

我们真的需要立即重置吗？重新思考碰撞处理以实现高效机器人导航

Authors: Shanze Wang, Xinming Zhang, Siwei Cheng, Xianghui Wang, Hailong Huang, Wei Zhang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.02192
Pdf link: https://arxiv.org/pdf/2605.02192
Abstract Should a single collision necessarily terminate an entire navigation episode? In most deep reinforcement learning (DRL) frameworks for robot navigation, this remains the standard practice: every collision immediately triggers a global environment reset and is penalized as a complete task failure. While a collision during deployment naturally indicates task failure, applying the same treatment during training prevents the agent from exploring challenging obstacle configurations, which slows learning progress in the early training phase. In this work, we challenge this convention and propose a Multi-Collision reset Budget (MCB) framework that decouples local collision termination from global environment resets, allowing the agent to retry difficult configurations within the same episode. Experiments on multiple simulated and real-world robotic platforms show that the framework accelerates early-stage exploration and improves both success rate and navigation efficiency over conventional single-collision reset baselines, with a small collision budget producing the largest gains.
中文摘要 单次碰撞是否必然会终止整个导航阶段？在大多数深度强化学习（DRL）机器人导航框架中，这仍然是标准做法：每次碰撞都会立即触发全局环境重置，并视为任务完全失败而惩罚。虽然部署期间的碰撞自然表示任务失败，但在培训时采用相同处理方式则阻止了智能体探索具有挑战性的障碍配置，从而在早期训练阶段减缓学习进展。在本研究中，我们挑战这一惯例，提出了一种多重碰撞重置预算（MCB）框架，该框架将局部碰撞终止与全局环境重置解耦，使智能体能够在同一事件内重试困难配置。在多个模拟和真实世界机器人平台上的实验表明，该框架加快了早期探索，并比传统的单次碰撞重置基线提高了成功率和导航效率，且在较小的碰撞预算下实现了最大的收益。

ARGUS: Policy-Adaptive Ad Governance via Evolving Reinforcement with Adversarial Umpiring

ARGUS：通过进化强化与对抗性裁判实现政策适应性广告治理

Authors: Deyi Ji, Junyu Lu, Xuanyi Liu, Liqun Liu, Hailong Zhang, Peng Shu, Huan Yu, Jie Jiang, Tianru Chen, Lanyun Zhu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.02200
Pdf link: https://arxiv.org/pdf/2605.02200
Abstract Online advertising governance faces significant challenges due to the non-stationary nature of regulatory policies, where emerging mandates (e.g., restrictions on education or aesthetic anxiety) create severe label inconsistencies and reasoning ambiguities in historical datasets. In this paper, we propose ARGUS, a policy-adaptive governance system that enables evolving reinforcement through multi-agent adversarial umpiring. ARGUS addresses the sparsity of new policy data by employing a three-stage framework: (1) Policy Seeding for initial perception; (2) Adversarial Label Rectification, which utilizes a Prosecutor-Defender-Umpire'' architecture to resolve conflicts between stale labels and new mandates; and (3) Latent Knowledge Discovery, which employs a tripartite dialectical discussion to unearth sophisticated,gray-area'' violations. By leveraging RAG-enhanced policy knowledge and Chain-of-Thought synthesis as dynamic rewards for reinforcement learning, ARGUS synchronizes its reasoning pathways with evolving regulations. Extensive experiments on both industrial and public datasets demonstrate that ARGUS significantly outperforms traditional fine-tuning baselines, achieving superior policy-adaptive learning with minimal gold data.
中文摘要 由于监管政策的非固定性，在线广告治理面临重大挑战，新兴的强制措施（如教育限制或审美焦虑）导致历史数据集中严重的标签不一致和推理模糊。本文提出ARGUS，一种策略适应治理系统，通过多代理对抗裁判实现不断演进的强化。ARGUS通过采用三阶段框架解决新政策数据的稀缺性：（1）初始感知的政策种子;（2）对抗性标签纠正，利用“检察官-辩护人-裁判”架构解决陈旧标签与新授权之间的冲突;以及（3）潜在知识发现，采用三方辩证讨论，揭示复杂的“灰色地带”违规。通过利用RAG增强的政策知识和思维链综合作为强化学习的动态奖励，ARGUS使其推理路径与不断演变的法规同步。在工业和公共数据集上的大量实验表明，ARGUS显著优于传统微调基线，以极少的黄金数据实现了更优越的政策适应学习。

Break the Block: Dynamic-size Reasoning Blocks for Diffusion Large Language Models via Monotonic Entropy Descent with Reinforcement Learning

打破障碍：通过单调熵下降与强化学习实现扩散大型语言模型的动态规模推理块

Authors: Yan Jiang, Ruihong Qiu, Zi Huang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.02263
Pdf link: https://arxiv.org/pdf/2605.02263
Abstract Recent diffusion large language models (dLLMs) have demonstrated both effectiveness and efficiency in reasoning via a block-based semi-autoregressive generation paradigm. Despite their progress, the fixed-size block generations remain a critical bottleneck for effective and coherent reasoning. 1. From a global perspective, different reasoning tasks would correspond to different optimal decoding block sizes, which makes a ``one-size-fits-all'' assumption ineffective. 2. Even within a single reasoning task, the rigid block partitioning would break the logical flow and reduce reasoning coherence. Through empirical observations, we reveal that for block-wise entropy, incorrect reasoning exhibits a fluctuating and unsteady trend between blocks, whereas the correctly generated tasks follow a consistent descending trend. Therefore, this paper proposes b1, a novel post-training framework for dLLMs that learns dynamic-size reasoning blocks via a Monotonic Entropy Descent objective with reinforcement learning to enhance reasoning coherence.b1 integrates seamlessly as a plug-and-play module with existing dLLM's post-training algorithms. Extensive experiments across various reasoning benchmarks showcase b1's consistent improvement over existing fixed-size block baselines. Our code has been released at this https URL.
中文摘要 近期的扩散大型语言模型（dLLMs）通过基于块的半自回归生成范式，展示了在推理方面的有效性和效率。尽管取得了进展，固定大小的块生成仍然是有效且连贯推理的关键瓶颈。1. 从全局视角看，不同的推理任务对应不同的最优解码块大小，这使得“一刀切”假设无效。2. 即使在单一推理任务中，僵化块划分也会破坏逻辑流程并降低推理的连贯性。通过实证观察，我们发现，对于块状熵，错误推理在块间呈现波动且不稳定的趋势，而正确生成的任务则呈持续下降趋势。因此，本文提出了b1，一种新的dLLM后训练框架，通过单调熵下降目标学习动态规模的推理模块，并通过强化学习来增强推理连贯性。b1作为即插即用模块无缝集成于现有dLLM的后训练算法中。在各种推理基准测试中的大量实验展示了 b1 相较于现有固定大小块基线的持续改进。我们的代码已在此 https URL 发布。

Compositional Multi-hop Factual Error Correction via Decomposition-and-Injection

通过分解与注入进行合成多跳事实错误纠正

Authors: Lei Zhu, Xiaobao Wang, Jianbiao Yang, Chenyang Wang, Dongxiao He, Longbiao Wang, Jianwu Dang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.02277
Pdf link: https://arxiv.org/pdf/2605.02277
Abstract Factual Error Correction (FEC) aims to revise inaccurate text into statements that are factually consistent with external evidence. Although recent methods perform well on single-hop correction, they often treat claims as atomic units and struggle with multi-hop cases that require compositional reasoning across multiple evidence sources. This challenge is further amplified by limited paired data and difficulties in locating semantic errors within complex reasoning chains. We present CECoR (Compositional Error Correction via Reasoning-aware Synthesis), a reasoning-aware framework that introduces a Decomposition and Injection paradigm for compositional error correction. CECoR decomposes multi-hop claims into interpretable reasoning steps and injects controlled perturbations to synthesize high-quality training pairs. A two-stage learning strategy combining supervised fine-tuning and reinforcement learning improves factual accuracy and robustness. Comprehensive evaluations show that CECoR achieves strong performance on multi-hop benchmarks, outperforming both distantly supervised methods and few-shot LLM baselines. It also generalizes effectively to single-hop correction and remains stable under noisy evidence, demonstrating its versatility for real-world factual correction.
中文摘要 事实错误更正（FEC）旨在将不准确的文本修正为与外部证据事实一致的陈述。尽管近期方法在单跳校正方面表现良好，但它们常将主张视为原子单位，且在需要跨多个证据来源进行组合推理的多跳案例中存在困难。这一挑战因有限的配对数据和在复杂推理链中发现语义错误的困难而进一步加剧。我们介绍CECoR（通过推理感知合成实现合成错误纠正），这是一个推理感知框架，引入了用于合成错误纠正的分解与注入范式。CECoR 将多跳声明分解为可解释的推理步骤，并注入受控扰动以合成高质量的训练对。结合监督微调和强化学习的两阶段学习策略提升事实准确性和稳健性。综合评估显示，CECoR在多跳基准测试中表现优异，优于远程监督方法和少样本大型语言模型基线。它还有效推广到单跳修正，并且在噪声证据下保持稳定，展示了其在现实世界事实纠正中的多功能性。

A Meta Reinforcement Learning Approach to Goals-Based Wealth Management

基于目标的财富管理的元强化学习方法

Authors: Sanjiv R. Das, Harshad Khadilkar, Sukrit Mittal, Daniel Ostrov, Deep Srivastav, Hungjen Wang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.02300
Pdf link: https://arxiv.org/pdf/2605.02300
Abstract Applying concepts related to zero-shot meta-learning and pre-training of foundation models, we develop a meta reinforcement learning approach (denoted MetaRL) that is pre-trained on thousands of goals-based wealth management (GBWM) problems. Each GBWM problem involves a multiple year scenario over which the investor looks to optimally choose an investment portfolio each year and choose to fulfill all, some, or none of the different financial goals that arise each year. These choices seek to maximize the expected total investor utility obtained from the fulfilled financial goals. By eliminating separate training and optimization for each new investor problem, the MetaRL model in inference mode produces near-optimal dynamic investment portfolio and goal-fulfilling strategies for a new GBWM problem within a few hundredths of a second. This delivers expected utilities that are, on average, 97.8% of the optimal expected utilities (determined via Dynamic Programming). These results are remarkably robust to capital market regime changes, even when training uses only one capital market regime. Further, the MetaRL approach can enable solving problems with larger state spaces where Dynamic Programming becomes computationally infeasible.
中文摘要 应用零样本元学习和基础模型预训练相关的概念，我们开发了一种元强化学习方法（记为MetaRL），该方法在数千个基于目标的财富管理（GBWM）问题上进行了预训练。每个GBWM问题都涉及一个多年期情景，投资者每年寻求最佳选择投资组合，并选择实现所有、部分或不满足每年出现的不同财务目标。这些选择旨在最大化实现财务目标所带来的预期总投资者效用。通过消除每个新投资者问题的单独训练和优化，MetaRL模型的推理模式能在几百分之一秒内为新GBWM问题生成近乎最优的动态投资组合和目标实现策略。这能带来平均为最优期望效用（通过动态规划确定）的97.8%。即使培训仅使用一种资本市场体制，这些结果对资本市场体制变化也极为稳健。此外，MetaRL方法还能解决动态规划在计算上变得不可行的更大状态空间问题。

Differentiable Kernel Ridge Regression for Deep Learning Pipelines

深度学习管道中的可微核脊回归

Authors: Jean-Marc Mercier, Gabriele Santin
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.02313
Pdf link: https://arxiv.org/pdf/2605.02313
Abstract Deep neural networks dominate modern machine learning, while alternative function approximators remain comparatively underexplored at scale. In this work, we revisit kernel methods as drop-in components for standard deep learning pipelines. We introduce \emph{Sparse Kernels} (SKs), a differentiable, localized, and lazy variant of kernel ridge regression (KRR) that defers training to inference time and reduces to the solution of small local systems. We integrate SKs into PyTorch as modular layers that preserve end-to-end trainability, and we show that they expose three distinct sets of parameters -- feature representations, target values, and evaluation points -- each of which can be fixed or learned. This decomposition broadens the design space available to practitioners, enabling, in particular, training-free transfer, nonlinear probing, and hybrid kernel-neural models. Across convolutional networks, vision transformers, and reinforcement learning, SK-based modules serve two complementary roles: in some settings, they match the performance of trained neural readouts with substantially less training; in others, they augment existing models and improve their performance when used as additional components. Our results suggest that kernel methods, once made scalable and differentiable, can be readily integrated with deep learning rather than treated as a separate paradigm.
中文摘要 深度神经网络主导着现代机器学习，而替代功能近似器在大规模上仍然相对缺乏被充分探索。在本研究中，我们重新审视内核方法作为标准深度学习流水线的可替换组件。我们引入了\emph{Sparse Kernels}（SKs），这是一种可微、局部化且懒惰的核脊回归（KRR）变体，它将训练推迟到推理时间，并简化为小型局部系统的解。我们将SK集成到PyTorch中，作为模块化层保持端到端可训练性，并展示了它们暴露出三组不同的参数——特征表示、目标值和评估点——每个参数都可以固定或学习。这种分解拓宽了从业者可用的设计空间，特别是实现了无训练转移、非线性探测以及核-神经混合模型。在卷积网络、视觉变换器和强化学习中，基于SK的模块承担两个互补角色：在某些环境中，它们以显著较少的训练匹配训练过的神经读数;在其他方面，它们增强了现有模型，并在作为额外组件时提升了性能。我们的结果表明，一旦核心方法具备可扩展性和微分性，就可以轻松地与深度学习集成，而非被视为独立范式。

Binary Rewards and Reinforcement Learning: Fundamental Challenges

二元奖励与强化学习：基本挑战

Authors: Marc Dymetman
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.02375
Pdf link: https://arxiv.org/pdf/2605.02375
Abstract Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for improving reasoning in language models, yet models trained with RLVR often suffer from diversity collapse: while single-sample accuracy improves, multi-sample coverage degrades, sometimes falling below the base model. We provide a structural account of this phenomenon grounded in the properties of binary rewards. Binary rewards create a fundamental degeneracy for policy gradient methods: the set of distributions maximizing expected reward is infinite, with no distinguished element. KL-control resolves this degeneracy by selecting, in the limit $\beta\to 0$, the filtered model $p_:=a(\cdot\mid\mathcal{Y}1)$ -- the base model conditioned on validity -- which is the unique fully valid distribution closest to the base model in KL divergence. This selection operates through a nontrivial asymmetry: the tilted distribution $p{[\beta]}\propto a(y)\,e^{v(y)/\beta}$ converges to $p_$ in forward KL as $\beta\to 0$, yet $p_$ cannot serve as a direct optimization target because $\mathrm{KL}(q\,\|\,p_)$ is infinite for any full-support policy $q$. We develop explicit formulas relating the hyperparameter $\beta$ to the more interpretable target validity rate $\mu$. Under model misspecification -- the typical practical regime -- the pressure to decrease $\beta$ drives the optimizer toward highly concentrated distributions over a small number of valid outputs, collapsing toward ever fewer as $\beta$ decreases, rather than toward the filtered model. We illustrate this mechanism on a toy autoregressive experiment and discuss how alternative divergences that target $p_$ directly -- as pursued empirically by \citet{kruszewski_whatever_2026} -- avoid this failure mode by rewarding coverage of $p_$'s support rather than concentration on high-validity outputs.
中文摘要 带有可验证奖励的强化学习（RLVR）已成为提升语言模型推理能力的标准方法，但用RLVR训练的模型常常会出现多样性崩溃：单样本准确率提升，多样本覆盖率下降，有时低于基础模型。我们基于二元奖励的属性，提供了这一现象的结构性解释。二元奖励为政策梯度方法带来了根本的简并性：最大化期望奖励的分布集合是无限的，没有显著的元素。KL-control通过在极限值$\beta\到 0$ 内选择滤波模型 $p_：=a（\cdot\mid\mathcal{Y}1）$ ——基于有效性的基础模型——解决了这种简并性，这是KL散度中最接近基础模型的唯一完全有效分布。这种选择通过一个非平凡的不对称性作用：倾斜分布 $p{[\beta]}\propto a（y）\，e^{v（y）/\beta}$ 在前向 KL 中收敛到 $p_$，即 $\beta\ 到 0$，但 $p_$ 不能作为直接优化目标，因为 $\mathrm{KL}（q\，\|\，p_）$ 对于任何全支持策略 $q$ 都是无限大的。我们开发了显式公式，将超参数 $\beta$ 与更易解释的目标效度率 $\mu$ 联系起来。在模型错误指定下——典型的实际情况——减少$\beta$的压力会推动优化器在少数有效输出上趋向高度集中分布，随着$\beta$减少，优化器向更少的分布坍缩，而非向过滤模型倾斜。我们在一个玩具自回归实验中说明了这一机制，并讨论了直接针对$p__$的备选发散——正如\citet{kruszewski_whatever_2026}实证追求的——如何通过奖励对$p_$的支持覆盖而非高效度输出来避免这种失败模式。

Enhancing Multimodal In-Context Learning via Inductive-Deductive Reasoning

通过归纳演绎推理增强多模态上下文学习

Authors: Haoyu Wang, Haonan Wang, Yuyan Chen, Jun Chen, Gang Liu, Qian Wang, Jiahong Yan, Yanghua Xiao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.02378
Pdf link: https://arxiv.org/pdf/2605.02378
Abstract In-context learning (ICL) allows large models to adapt to tasks using a few examples, yet its extension to vision-language models (VLMs) remains fragile. Our analysis reveals that the fundamental limitation lies in an inductive gap, models often produce correct answers from flawed reasoning, while struggling to extract consistent rules across demonstrations. This gap is further exacerbated by two visual-level obstacles: an overwhelming proportion of redundant visual tokens that obscure textual cues, and a skewed attention distribution that favors the initial image at the expense of subsequent context. To address these issues, we introduce a framework that restructures multimodal ICL as a principled inductive-deductive process. The framework incorporates a similarity-based visual token compression module to filter out redundant patches, a dynamic attention rebalancing mechanism to distribute focus equitably across all images, and a chain-of-thought paradigm that explicitly guides the model to analyze individual examples, derive a generalizable rule, and then apply it to the query. An auxiliary learning pipeline combines supervised fine-tuning with reinforcement learning using verifiable rewards to reinforce faithful citation and noise filtering. Evaluations across eight benchmarks covering visual perception, logical reasoning, STEM problems, and sarcasm detection demonstrate consistent and significant improvements over standard ICL baselines for multiple open-source VLMs, highlighting the potential of equipping models with genuine inductive capabilities in multimodal settings.
中文摘要 上下文学习（ICL）允许大型模型通过少数例子适应任务，但其对视觉语言模型（VLM）的扩展仍然脆弱。我们的分析显示，根本局限在于归纳差距，模型常常通过错误推理得出正确答案，同时难以在演示中提取一致规则。这一差距还因两个视觉层面障碍而加剧：大量冗余的视觉符号遮蔽了文本线索，以及偏向初始图像而牺牲后续上下文的注意力分布。为解决这些问题，我们引入了一个框架，将多模态ICL重构为一个有原则的归纳-演绎过程。该框架包含基于相似性的视觉令牌压缩模块以过滤冗余补丁，动态注意力再平衡机制以公平分配所有图像的焦点，以及一种思维链范式，明确引导模型分析单个示例，推导可推广规则，然后将其应用于查询。辅助学习管道结合了监督微调与使用可验证奖励的强化学习，以强化忠实引用和噪声过滤。涵盖视觉感知、逻辑推理、STEM问题和讽刺检测的八个基准测试显示，多个开源VLM相比标准ICL基线在多模态环境中具备真正归纳能力的潜力。

HeavySkill: Heavy Thinking as the Inner Skill in Agentic Harness

重技能：重思维作为智能束缚中的内在技能

Authors: Jianing Wang, Linsen Guo, Zhengyu Chen, Qi Guo, Hongyu Zang, Wenjie Shi, Haoxiang Ma, Xiangyu Xi, Xiaoyu Li, Wei Wang, Xunliang Cai
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.02396
Pdf link: https://arxiv.org/pdf/2605.02396
Abstract Recent advances in agentic harness with orchestration frameworks that coordinate multiple agents with memory, skills, and tool use have achieved remarkable success in complex reasoning tasks. However, the underlying mechanism that truly drives performance remains obscured behind intricate system designs. In this paper, we propose HeavySkill, a perspective that views heavy thinking not only as a minimal execution unit in orchestration harness but also as an inner skill internalized within the model's parameters that drives the orchestrator to solve complex tasks. We identify this skill as a two-stage pipeline, i.e., parallel reasoning then summarization, which can operate beneath any agentic harness. We present a systematic empirical study of HeavySkill across diverse domains. Our results show that this inner skill consistently outperforms traditional Best-of-N (BoN) strategies; notably, stronger LLMs can even approach Pass@N performance. Crucially, we demonstrate that the depth and width of heavy thinking, as a learnable skill, can be further scaled via reinforcement learning, offering a promising path toward self-evolving LLMs that internalize complex reasoning without relying on brittle orchestration layers.
中文摘要 近期在智能体利用方面取得的进展，结合协调多智能体的记忆、技能和工具使用，在复杂推理任务中取得了显著成功。然而，真正驱动性能的底层机制仍隐藏在复杂的系统设计之下。本文提出了一种观点，认为重型思维不仅是编排中最小的执行单元，更是内在于模型参数内化的内在技能，驱动编排者解决复杂任务。我们将此技能定位为两阶段流程，即平行推理和总结，可在任何代理框架下运行。我们对HeavySkill在多个领域进行了系统性实证研究。我们的结果显示，这种内在技能始终优于传统的最佳策略（Best-of-N; BoN）;值得注意的是，更强大的大型语言模型甚至能接近Pass@N性能。关键是，我们展示了作为可学习技能的深度和广度，可以通过强化学习进一步扩展，为实现自我演进的大型语言模型提供了一条有希望的路径，这些大型语言模型内化复杂推理，而无需依赖脆弱的编排层。

Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent

KL正则化RLVR的参考采样玻尔兹曼投影：目标匹配加权SFT、有限单次间隙和政策镜像下降

Authors: Yao Shu, Chenxing Wei, Hongbin Lin, Shuang Qiu, Hui Xiong
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.02469
Pdf link: https://arxiv.org/pdf/2605.02469
Abstract Online reinforcement learning with verifiable rewards (RLVR) turns checkable outcomes into a scalable training signal, but it keeps rollout generation, verifier scoring, and reference-policy evaluations on the optimization path. Static weighted supervised fine-tuning (SFT) on precomputed rollouts seems to remove this bottleneck, yet a weighted likelihood is not specified by rewards alone: its sampler and weights induce the policy being fit. This paper identifies the reference-sampled weighted-SFT objective whose induced policy equals the fixed-reference KL-regularized RLVR optimizer. The optimizer is the standard Boltzmann target policy, obtained by exponentially tilting the reference policy by verifier reward. Matching a weighted-SFT induced policy to this target forces density-ratio weights; in the reference-sampled subclass, this reduces uniquely, up to prompt scaling, to the prompt-normalized Boltzmann weight $\exp(r(x,y)/\beta)/Z(x)$. BOLT, a Boltzmann-Targeted SFT procedure, is the empirical estimator of this projection. The finite one-shot analysis separates the exact stored-support price $\beta\log(1/\pi^*(S_N\mid x))$ from partition estimation, effective-sample-size variance, generalization, optimization, and approximation errors. This decomposition explains why extra SFT epochs cannot repair missing reference-policy coverage and exposes the temperature--coverage--variance frontier. When coverage needs adaptive sampling, refreshed Boltzmann projections become KL policy mirror descent; finite inner solves enter as additive drift from the exact mirror step. Single-run Qwen experiments provide projection evidence for the target-matched weight, one-shot saturation, refreshed-sampler gains, and optimization-time savings, within the stated single-run scope.
中文摘要 带可验证奖励的在线强化学习（RLVR）将可检验的结果转化为可扩展的训练信号，同时使推广生成、验证者评分和参考策略评估始终处于优化路径上。在预计算的部署上进行静态加权监督微调（SFT）似乎消除了这一瓶颈，但加权似然并非仅由奖励决定：其采样器和权重决定策略的拟合。本文确定了其诱导策略等于固定参考KL正则化RLVR优化器的参考抽样加权SFT目标。优化器是标准的玻尔兹曼目标策略，通过对参考策略进行指数倾斜，并获得验证者奖励。将加权SFT诱导政策匹配至该目标，将强制要求使用密度比权重;在参考采样子类中，这唯一地（直到提示缩放）归约为提示归一化的玻尔兹曼权重 $\exp（r（x，y）/\beta）/Z（x）$。BOLT，一种玻尔兹曼靶向SFT方法，是该投影的经验估计量。有限一次性分析将精确的存储支持价格$\beta\log（1/\pi^*（S_N\mid x））$与划分估计、有效样本大小方差、泛化、优化和近似误差分离开来。这种分解解释了为何额外的SFT纪元无法修复缺失的参考政策覆盖范围，并揭示了温度-覆盖-方差边界。当覆盖范围需要自适应抽样时，更新后的玻尔兹曼预测会变成吉隆坡政策的镜像下降;有限内解则作为精确镜像步骤的加法漂移进入。单次运行的Qwen实验为目标匹配权重、单次抽样饱和度、刷新采样器增益和优化时间节省提供了预测证据，均符合单次运行范围。

Efficient Preference Poisoning Attack on Offline RLHF

离线RLHF上的高效偏好中毒攻击

Authors: Chenye Yang, Weiyu Xu, Lifeng Lai
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2605.02495
Pdf link: https://arxiv.org/pdf/2605.02495
Abstract Offline Reinforcement Learning from Human Feedback (RLHF) pipelines such as Direct Preference Optimization (DPO) train on a pre-collected preference dataset, which makes them vulnerable to preference poisoning attack. We study label flip attacks against log-linear DPO. We first illustrate that flipping one preference label induces a parameter-independent shift in the DPO gradient. Using this key property, we can then convert the targeted poisoning problem into a structured binary sparse approximation problem. To solve this problem, we develop two attack methods: Binary-Aware Lattice Attack (BAL-A) and Binary Matching Pursuit Attack (BMP-A). BAL-A embeds the binary flip selection problem into a binary-aware lattice and applies Lenstra-Lenstra-Lovász reduction and Babai's nearest plane algorithm; we provide sufficient conditions that enforce binary coefficients and recover the minimum-flip objective. BMP-A adapts binary matching pursuit to our non-normalized gradient dictionary and yields coherence-based recovery guarantees and robustness (impossibility) certificates for $K$-flip budgets. Experiments on synthetic dictionaries and the Stanford Human Preferences dataset validate the theory and highlight how dictionary geometry governs attack success.
中文摘要 离线强化学习（RLHF）流程，如直接偏好优化（DPO），训练基于预先收集的偏好数据集，因此容易受到偏好中毒攻击。我们研究对数线性DPO的标签翻转攻击。我们首先说明，翻转一个偏好标签会引发DPO梯度的参数无关的偏移。利用这一关键性质，我们可以将目标中毒问题转换为结构化的二元稀疏近似问题。为解决此问题，我们开发了两种攻击方法：二进制感知格点攻击（BAL-A）和二元匹配追踪攻击（BMP-A）。BAL-A将二元翻转选择问题嵌入到一个二元感知格中，并应用Lenstra-Lenstra-Lovász约简和Babai最近平面算法;我们提供了足够的条件来强制二元系数并恢复最小翻转目标。BMP-A将二元匹配追踪调整到我们的非归一化梯度词典中，并为$K$翻转预算提供基于相干性的恢复保证和稳健性（不可能性）证书。合成词典和斯坦福人类偏好数据集的实验验证了该理论，并凸显了词典几何如何决定攻击成功。

Beyond Specialization: Robust Reinforcement Learning Navigation via Procedural Map Generators

超越专业化：通过程序化地图生成器实现强化学习导航

Authors: Christian Jestel, Nicolas Bach, Marvin Wiedemann, Jan Finke, Peter Detzner
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.02528
Pdf link: https://arxiv.org/pdf/2605.02528
Abstract Deep reinforcement learning (DRL) navigation policies often overfit to the structure of their training environments, as environmental diversity is typically constrained by the manual effort required to design diverse scenarios. While procedural map generation offers scalable diversity, no prior work systematically compares how different generator types affect policy generalization. We integrate four generators (sparse, maze, graph, and Wave Function Collapse) with guaranteed navigability into MuRoSim, a 2D simulator focusing on training efficiency for LiDAR-based navigation. We cross-evaluate five navigation policies on 1000 seeded maps per generator across three training seeds. Results show a strongly asymmetric cross-generator transfer: a specialist trained on sparse layouts falls to 3.3% success on mazes, whereas a policy trained on the combined generator set achieves 91.5 +/- 1.1% mean success. We further demonstrate that A path-planner subgoal inputs are the dominant factor for robustness, raising success from the 90.2 +/- 1.4% feedforward baseline to 98.9 +/- 0.4% and outperforming GRU recurrence, which only improves the reactive baseline. The DRL policies outperform a classical Carrot+A controller, which matches their success only at low speeds (1.0 m/s) but collapses to 24.9% at 2.0 m/s. This highlights learned speed adaptation as the decisive advantage of the learned approach. Real-world experiments on a RoboMaster confirm sim-to-real transfer in a cluttered arena, while a maze-like layout exposes remaining failure modes that recurrence helps mitigate.
中文摘要 深度强化学习（DRL）导航策略常常过于适应其训练环境的结构，因为环境多样性通常受限于设计多样场景所需的人工劳动力。虽然程序化地图生成提供了可扩展的多样性，但此前没有任何工作系统地比较不同生成器类型如何影响政策泛化。我们将四个生成器（稀疏生成器、迷宫生成器、图和波函数坍缩）集成到MuRoSim中，并保证导航，这是一个专注于基于激光雷达导航训练效率的二维模拟器。我们在三个训练种子中，对每个生成器的1000张种子地图进行了五项导航策略的交叉评估。结果显示，跨生成器转移非常不对称：受过稀疏布局训练的专家在迷宫中成功率降至3.3%，而在合并生成器集训练的策略平均成功率为91.5 +//-1.1%。我们还进一步证明，A路径规划子目标输入是稳健性的主导因素，成功率从90.2 +/- 1.4%提升至98.9 +/- 0.4%，并优于仅提升反应基线的GRU复发率。日行灯政策优于经典的Carrot+A控制器，后者仅在低速（1.0米/秒）下表现相当，但在2.0米/秒时降至24.9%。这凸显了学习速度适应是学习方法的决定性优势。在RoboMaster上的真实实验验证了在拥挤竞技场中模拟到现实的传输，而迷宫般的布局则暴露了剩余的失败模式，反复出现有助于缓解这些故障。

Recurrent Deep Reinforcement Learning for Chemotherapy Control under Partial Observability

部分可观察性下化疗控制的循环深度强化学习

Authors: Firas Mohamed Elamine Kiram, Imane Youkana, Rachida Saouli, Gian Antonio Susto, Laid Kahloul
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.02552
Pdf link: https://arxiv.org/pdf/2605.02552
Abstract Chemotherapy dose optimization can be formulated as a dynamic treatment regime, requiring sequential decisions under uncertainty that must balance tumor suppression against toxicity. However, most reinforcement learning approaches assume full observability of the patient state, a condition rarely met in clinical practice. We investigate whether memory-augmented policies can improve chemotherapy control under partial observability. To this end, we employ a recurrent TD3-based approach with separate LSTM actor-critic networks and evaluate it on the AhnChemoEnv benchmark from DTR-Bench, considering both off-policy and on-policy recurrent architectures against feed-forward TD3 and Soft Actor-Critic. Pharmacokinetic and pharmacodynamic variability are held fixed to isolate hidden-state uncertainty and observation noise and to avoid confounding effects from inter-patient variability. Across ten random seeds, recurrence yields modest benefit under full observability but substantially stronger and more stable performance under partial observability, with more consistent tumor suppression and improved normal-cell preservation. These findings indicate that memory-based policies are particularly beneficial when clinically relevant state information is incomplete or noisy.
中文摘要 化疗剂量优化可以被设定为动态治疗方案，需要在不确定性下连续决策，平衡肿瘤抑制与毒性。然而，大多数强化学习方法假设患者状态的完全可观察性，而这在临床实践中很少达到。我们研究记忆增强政策是否能在部分可观察性下改善化疗控制。为此，我们采用基于TD3的循环方法，采用独立的LSTM演员-批评者网络，并在DTR-Bench的AhnChemoEnv基准测试中评估其表现，考虑非策略和非策略循环架构，对抗前馈TD3和软演员-批判者。药物动力学和药效学变异性被固定，以隔离隐藏状态不确定性和观察噪声，避免患者间变异性带来的混杂效应。在十个随机种子中，复发在完全可观测性下带来适度益处，但在部分可观测性下表现显著更强且更稳定，肿瘤抑制更为稳定，正常细胞保存率也更好。这些发现表明，当临床相关州信息不完整或噪声较大时，基于内存的政策尤其有益。

Gradient-Gated DPO: Stabilizing Preference Optimization in Language Models

梯度门控DPO：语言模型中偏好优化的稳定化

Authors: Inoussa Mouiche
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.02626
Pdf link: https://arxiv.org/pdf/2605.02626
Abstract Preference optimization has become a central paradigm for aligning large language models with human feedback. Direct Preference Optimization (DPO) simplifies reinforcement learning from human feedback by directly optimizing pairwise preferences, removing the need for reward modeling and policy optimization. However, recent work shows that DPO exhibits a squeezing effect, where negative gradients applied to rejected responses concentrate probability mass on high-confidence predictions while suppressing alternative responses. This phenomenon arises even in simple softmax models and can lead to systematic probability collapse during training. We introduce Gradient-Gated Preference Optimization (Gate-DPO), a method that stabilizes training by modulating rejected gradients according to the model's probability geometry. When updates target extremely low-probability responses, the gate attenuates harmful gradients while preserving standard optimization behavior. Gate-DPO addresses this optimization pathology without modifying the underlying preference objective and is complementary to existing methods such as extended SFT, IPO, and Cal-DPO. Experiments across multiple architectures and preference datasets show that Gate-DPO consistently reduces squeezing and improves chosen-response likelihood. Mass-dynamics analysis further reveals healthier optimization behavior, with improved preferred responses and reduced suppression of the overall distribution. Notably, smaller gated models can exhibit stronger chosen-response improvements than larger ungated models, suggesting that controlling gradient dynamics, rather than scale alone, is key to stable and efficient alignment.
中文摘要 偏好优化已成为将大型语言模型与人类反馈对齐的核心范式。直接偏好优化（DPO）通过直接优化成对偏好，简化了从人类反馈中获得的强化学习，消除了对奖励建模和策略优化的需求。然而，最新研究表明，DPO表现出挤压效应，即对被拒绝的回答施加负梯度会使概率质量集中于高置信度预测，同时抑制替代反应。即使在简单的软最大模型中也会出现这种现象，并可能导致训练过程中系统性概率崩溃。我们引入梯度门控偏好优化（Gate-DPO），这是一种通过根据模型概率几何调制被拒绝梯度来稳定训练的方法。当更新针对极低概率的响应时，门会衰减有害梯度，同时保持标准优化行为。Gate-DPO解决了这种优化病理问题，且不改变其底层偏好目标，并且是扩展SFT、IPO和Cal-DPO等现有方法的补充。跨多种架构和偏好数据集的实验表明，Gate-DPO持续减少挤压并提高选择响应的可能性。质量动力学分析进一步揭示了更健康的优化行为，优选响应得到改善，整体分布抑制减少。值得注意的是，较小的门控模型在选择响应方面表现得比较大的无限制模型更强，表明控制梯度动态而非仅仅规模是稳定高效比对的关键。

AutoFocus: Uncertainty-Aware Active Visual Search for GUI Grounding

自动对焦：不确定性感知的主动视觉搜索以实现图形界面的基础

Authors: Ruilin Yao, Shegnwu Xiong, Tianyu Zou, Shili Xiong, Yi Rong
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.02630
Pdf link: https://arxiv.org/pdf/2605.02630
Abstract Vision-Language Models (VLMs) have enabled autonomous GUI agents that translate natural language instructions into executable screen coordinates. However, grounding performance degrades in high-resolution interfaces, where dense layouts and small interactive elements expose a resolution gap between modern displays and model input constraints. Existing zoom-in strategies rely on fixed anchors, heuristic grids, or reinforcement learning, lacking a principled mechanism to adaptively determine where refinement is needed and how much spatial uncertainty should be explored. We propose AutoFocus, a training-free, uncertainty-aware active visual search framework for GUI grounding. Our key insight is that token-level perplexity in coordinate generation naturally reflects spatial uncertainty. Rather than committing to a single prediction, AutoFocus samples multiple coordinate hypotheses and converts their axial perplexities into an anisotropic gaussian spatial probability field, explicitly modeling directional uncertainty. Based on this field, we generate global and local region proposals and introduce Shape-Aware Zooming to balance tight localization with contextual preservation. A visual prompt-based aggregation step then selects the most consistent prediction via structured comparison. Extensive experiments on ScreenSpot-Pro and ScreenSpot-V2 demonstrate consistent improvements across both general-purpose and GUI-specialized VLMs.
中文摘要 视觉语言模型（VLM）使自主的图形界面代理能够将自然语言指令转换为可执行的屏幕坐标。然而，在高分辨率界面中，密集的布局和小型交互元素暴露出现代显示器与模型输入约束之间的分辨率差距，接地性能会下降。现有的放大策略依赖于固定锚点、启发式网格或强化学习，缺乏原则性机制来自适应地确定哪里需要细化以及应探索多少空间不确定性。我们提出了AutoFocus，一种无需训练、具备不确定性感知的主动视觉搜索框架，用于图形用户界面基础。我们的关键见解是，坐标生成中的令牌级困惑自然反映了空间不确定性。AutoFocus不只依赖单一预测，而是采样多个坐标假设，并将其轴向困惑转换为各向异性高斯空间概率场，明确建模方向不确定性。基于该领域，我们生成全球和局部区域提案，并引入形状感知缩放，以平衡紧密的本地化与上下文保存。然后通过可视化提示的聚合步骤，通过结构化比较选择最一致的预测。在ScreenSpot-Pro和ScreenSpot-V2上的广泛实验显示，无论是通用还是图形界面专用VLM，都取得了持续的改进。

Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE

Mamoda2.5：利用DiT-MoE增强统一多模态模型

Authors: Yangming Shi, Shixiang Zhu, Tao Shen, Zhimiao Yu, Dengsheng Chen, Taicai Chen, Yunfei Yang, Juan Zhou, Chen Cheng, Liang Ma, Xibin Wu, Benxuan Yan, Ge Li, Tuoyu Zhang, Dan Li, Chang Liu, Zhenbang Sun
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.02641
Pdf link: https://arxiv.org/pdf/2605.02641
Abstract We present Mamoda2.5, a unified AR-Diffusion framework that seamlessly integrates multimodal understanding and generation within a single architecture. To efficiently enhance the model's generation capability, we equip the Diffusion Transformer backbone with a fine-grained Mixture-of-Experts (MoE) design (128 experts, Top-8 routing), yielding a 25B-parameter model that activates only 3B parameters, significantly reducing training costs while scaling up the model capacity. Mamoda2.5 achieves top-tier generation performance on VBench 2.0 and sets a new record in video editing quality, surpassing evaluated open-source models and matching the performance of current top-tier proprietary models, including the Kling O1 on OpenVE-Bench. Furthermore, we introduce a joint few-step distillation and reinforcement learning framework that compresses the 30-step editing model into a 4-step model and greatly accelerates model inference. Compared to open-source baselines, Mamoda2.5 achieves up to $95.9\times$ faster video editing inference. In real-world applications, Mamoda2.5 has been successfully deployed for content moderation and creative restoration tasks in advertising scenarios, achieving a 98% success rate in internal advertising video editing scenario.
中文摘要 我们呈现Mamoda2.5，一个统一的AR-Diffusion框架，能够在单一架构中无缝整合多模态理解与生成。为高效提升模型生成能力，我们为扩散变压器骨干配备了细粒度专家混合（MoE）设计（128位专家，前8名路由），生成一个仅激活3B参数的25B参数模型，显著降低训练成本并提升模型容量。Mamoda2.5在VBench 2.0上实现了顶级的世代性能，并在视频编辑质量上创下新纪录，超越了已评估的开源模型，并与包括OpenVE-Bench上的Kling O1在内的当前顶级专有模型性能相媲美。此外，我们引入了一个联合的几步蒸馏与强化学习框架，将30步编辑模型压缩为4步模型，大幅加快模型推断速度。与开源基线相比，Mamoda2.5 在视频编辑推断上可提升高达 $95.9\ 倍数美元。在实际应用中，Mamoda2.5已成功应用于广告场景中的内容审核和创意修复任务，内部广告视频剪辑场景中成功率达98%。

AcademiClaw: When Students Set Challenges for AI Agents

AcademiClaw：当学生为人工智能代理设定挑战时

Authors: Junjie Yu, Pengrui Lu, Weiye Si, Hongliang Lu, Jiabao Wu, Kaiwen Tao, Kun Wang, Lingyu Yang, Qiran Zhang, Xiuting Guo, Xuanyu Wang, Yang Wang, Yanjie Wang, Yi Yang, Zijian Hu, Ziyi Yang, Zonghan Zhou, Binghao Qiang, Borui Zhang, Chenning Li, Enchang Zhang, Feifan Chen, Feng Jian, Fengyin Sun, Hao Qiu, Hao Zheng, Haoran Zhu, Hongyu Liu, Jianbin Deng, Jiaxin Song, Jiaying Chi, Jiayou Shi, Jie Fang, Jinghui Zhong, Jingyu Zhou, Jinze Li, Junfeng Yi, Junyan Yu, Junzhi Xue, Ni Song, Pengyi Chen, Qi Chen, Quansheng Li, Rui Tao, Shenghai Gong, Shenhang Lu, Tianqi Shen, Tianxiang Zhu, Tiehan Kang, Tingyu Li, Wendi Wu, Xiao Shen, Xiao Zhou, Xiaotao Zhang, Xinrong Li, Xuankun Yang, Xun Zhang, Yan Li, Ye Lu, Yi Wang, Yibo Zhou, Yichi Zhang, Yihao Sun, Yijun Huang, Yixin Zhu, Yixuan Wu, Yuchen Sun, Yue Wu, Yuheng Sun, Yukun Li, Yutian Tu, Yuxuan Qin, Yuzhuo Wu, Zeyu Li, Zhengyu Lou, Zhenning Ran, Zizhu He, Pengfei Liu
Subjects: Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2605.02661
Pdf link: https://arxiv.org/pdf/2605.02661
Abstract Benchmarks within the OpenClaw ecosystem have thus far evaluated exclusively assistant-level tasks, leaving the academic-level capabilities of OpenClaw largely unexamined. We introduce AcademiClaw, a bilingual benchmark of 80 complex, long-horizon tasks sourced directly from university students' real academic workflows -- homework, research projects, competitions, and personal projects -- that they found current AI agents unable to solve effectively. Curated from 230 student-submitted candidates through rigorous expert review, the final task set spans 25+ professional domains, ranging from olympiad-level mathematics and linguistics problems to GPU-intensive reinforcement learning and full-stack system debugging, with 16 tasks requiring CUDA GPU execution. Each task executes in an isolated Docker sandbox and is scored on task completion by multi-dimensional rubrics combining six complementary techniques, with an independent five-category safety audit providing additional behavioral analysis. Experiments on six frontier models show that even the best achieves only a 55\% pass rate. Further analysis uncovers sharp capability boundaries across task domains, divergent behavioral strategies among models, and a disconnect between token consumption and output quality, providing fine-grained diagnostic signals beyond what aggregate metrics reveal. We hope that AcademiClaw and its open-sourced data and code can serve as a useful resource for the OpenClaw community, driving progress toward agents that are more capable and versatile across the full breadth of real-world academic demands. All data and code are available at this https URL.
中文摘要 迄今为止，OpenClaw生态系统内的基准测试仅评估助理级任务，OpenClaw的学术层面能力基本未被审视。我们介绍了AcademiClaw，这是一个双语基准，包含80个直接来自大学生真实学术工作流程——作业、研究项目、竞赛和个人项目——他们发现现有AI代理无法有效解决的复杂长期任务。最终任务集从230名学生提交的候选人中筛选，经过严格的专家评审，涵盖25+个专业领域，涵盖奥林匹克级别的数学和语言学问题，到GPU密集型强化学习和全栈系统调试，其中16个任务需要CUDA GPU执行。每个任务在独立的Docker沙盒中执行，任务完成通过结合六种互补技术的多维评分标准，并配有独立的五类安全审计，提供额外的行为分析。六个前沿模型的实验显示，即使是最优秀的，通过率也只有55%。进一步分析揭示了任务领域间能力的明显界限、模型间行为策略的不同，以及代币消耗与输出质量之间的脱节，提供了超出整体指标所揭示的细致诊断信号。我们希望 AcademiClaw 及其开源数据和代码能成为 OpenClaw 社区的有用资源，推动在现实学术需求中更具能力和多样性的代理。所有数据和代码均可在此 https URL 访问。

Federated Reinforcement Learning for Efficient Mobile Crowdsensing under Incomplete Information

联合强化学习用于在不完整信息下高效移动群众感知

Authors: Sumedh J. Dongare, Patrick Weber, Andrea Ortiz, Walid Saad, Oliver Hinz, Anja Klein
Subjects: Subjects: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2605.02705
Pdf link: https://arxiv.org/pdf/2605.02705
Abstract Mobile crowdsensing (MCS) is a distributed sensing architecture that utilizes existing sensors on mobile units (MUs) to perform sensing tasks. A mobile crowdsensing platform (MCSP) publishes the sensing tasks and the MUs decide whether to participate in exchange for money. The MCS system is dynamic: the task requirements, the MUs' availability, and their available resources change over time. The MUs aim to find an efficient task participation strategy to maximize their income while the MCSP focuses on maximizing the number of completed tasks. As optimal strategies require perfect non-causal information about the MCS system, which is unavailable in realistic scenarios, the main challenge is to find an efficient task participation strategy for the MUs under incomplete information. To this end, a novel fully decentralized federated deep reinforcement learning algorithm, FDRL-PPO, is proposed. FDRL-PPO enables every MU to learn its own task participation strategy based on its experiences, available resources, and preferences, without relying on perfect non-causal information about the MCS system. To replenish their batteries, the MUs rely on energy harvesting. As a result, their available energy varies over time, leading to varying availability and fragmented learning experiences. To mitigate these challenges, the proposed approach leverages federated learning, enabling MUs to collaboratively improve their models without sharing private raw data like their own experiences. By exchanging only learned models, MUs collectively compensate for individual limitations, and find more scalable, robust, and efficient task participation strategies. Comprehensive evaluations on both synthetic and real-world datasets show that FDRL-PPO consistently outperforms benchmark algorithms in terms of task completion ratio, fairness in task completion, energy consumption, and number of conflicting proposals.
中文摘要 移动众象（MCS）是一种分布式传感架构，利用移动单元（MUs）上的现有传感器来执行传感任务。移动众象平台（MCSP）发布传感任务，MU决定是否参与以换取资金。MCS系统是动态的：任务需求、多单元的可用性及其可用资源随时间变化。MU旨在找到高效的任务参与策略以最大化收入，而MCSP则专注于最大化完成任务的数量。由于最优策略需要关于MCS系统的完美非因果信息，而这在现实情景中不可得，主要挑战是为信息不完整的情况下为多单元找到高效的任务参与策略。为此，提出了一种全新的全去中心化联邦深度强化学习算法FDRL-PPO。FDRL-PPO 使每个 MU 能够基于自身经验、可用资源和偏好学习自己的任务参与策略，而无需依赖关于 MCS 系统的完美非因果信息。为了补充电池，MU依赖能量收集。因此，他们的可用能量随时间变化，导致可用性和学习体验的零散。为缓解这些挑战，拟议方法利用联合学习，使多单元能够协作改进模型，而无需分享像自身经验那样的私人原始数据。通过仅交换学习到的模型，多单元集体弥补个体局限，找到更具可扩展性、稳健性和高效的任务参与策略。对合成和现实世界数据集的综合评估显示，FDRL-PPO在任务完成率、任务完成公平性、能耗及冲突提案数量等方面始终优于基准算法。

Perceptual Flow Network for Visually Grounded Reasoning

视觉基础推理的感知流网络

Authors: Yangfu Li, Yuning Gong, Hongjian Zhan, Teng Li, Yuanhuiyi Lyu, Tianyi Chen, Qi Liu, Ziyuan Huang, Zhihang Zhong, Dandan Zheng, Yue Lu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.02730
Pdf link: https://arxiv.org/pdf/2605.02730
Abstract Despite the success of Large-Vision Language Models (LVLMs), general optimization objectives (e.g., standard MLE) fail to constrain visual trajectories, leading to language bias and hallucination. To mitigate this, current methods introduce geometric priors from visual experts as additional supervision. However, we observe that such supervision is typically suboptimal: it is biased toward geometric precision and offers limited reasoning utility. To bridge this gap, we propose Perceptual Flow Network (PFlowNet), which eschews rigid alignment with the expert priors and achieves interpretable yet more effective visual reasoning. Specifically, PFlowNet decouples perception from reasoning to establish a self-conditioned generation process. Based on this, it integrates multi-dimensional rewards with vicinal geometric shaping via variational reinforcement learning, thereby facilitating reasoning-oriented perceptual behaviors while preserving visual reliability. PFlowNet delivers a provable performance guarantee and competitive empirical results, particularly setting new SOTA records on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).
中文摘要 尽管大视野语言模型（LVLM）取得了成功，但一般优化目标（如标准机器学习）未能限制视觉轨迹，导致语言偏差和幻觉。为缓解这种情况，现有方法引入视觉专家的几何先验作为额外监督。然而，我们观察到这种监督通常并不理想：它偏向几何精度，且推理效用有限。为弥合这一差距，我们提出了感知流网络（PFlowNet），它摒弃了与专家先验的僵化对齐，实现了可解释且更有效的视觉推理。具体来说，PFlowNet将感知与推理解耦，建立一个自我条件生成过程。基于此，它通过变分强化学习将多维奖励与邻近几何形状结合起来，从而促进以推理为导向的感知行为，同时保持视觉可靠性。PFlowNet 提供可验证的性能保证和具有竞争力的实证结果，尤其是在 V* Bench（90.6%）和 MME-RealWorld-lite（67.0%）上创下了新的 SOTA 纪录。

A decoupled diffusion planner that adapts to changing cost limits by using cost-conditioned generation for safety and reward gradients for performance

一种解耦扩散规划器，通过使用成本条件生成实现安全性和奖励梯度来适应不断变化的成本限制

Authors: Rufeng Chen, Zhaofan Zhang, Zhejiang Yang, Hechang Chen, Sihong Xie
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.02777
Pdf link: https://arxiv.org/pdf/2605.02777
Abstract Offline safe reinforcement learning often requires policies to adapt at deployment time to safety budgets that vary across episodes or change within a single episode. While diffusion-based planners enable flexible trajectory generation, existing guidance schemes often treat reward improvement and constraint satisfaction as competing gradient objectives, which can lead to unreliable safety compliance under cost limits. We reinterpret adaptive safe trajectory generation as sampling from a constrained trajectory distribution, where the budget restricts the trajectory region, and reward shapes preferences within that region. This perspective motivates Safe Decoupled Guidance Diffusion (SDGD), which conditions classifier-free guidance on the cost limit to bias sampling toward trajectories satisfying the specified limit, while using reward-gradient guidance to refine trajectories for higher return. Because direct reward guidance can increase return while also steering samples toward trajectories with higher cumulative cost, we introduce Feasible Trajectory Relabeling (FTR) to reshape reward targets and discourage such directions. We further provide a first-order sampling-time analysis showing that FTR suppresses reward-induced cost drift under a prefix-restorative alignment condition. Extensive evaluations on the DSRL benchmark show that SDGD achieves the strongest safety compliance among baselines, satisfying the constraint on 94.7% of tasks (36/38), while obtaining the highest reward among safe methods on 21 tasks.
中文摘要 离线安全强化学习通常需要在部署时调整策略，以适应不同事件或单一事件内变化的安全预算。虽然基于扩散的规划器支持灵活的轨迹生成，但现有指导方案常将奖励改进和约束满足视为竞争的梯度目标，这可能导致在成本限制下安全合规性不可靠。我们将自适应安全轨迹生成重新解释为从受限轨迹分布中抽样，预算限制轨迹区域，奖励塑造该区域内的偏好。这一观点促使安全解耦引导扩散（SDGD）提出，该方法将无分类器指导条件为偏置抽样对满足指定限制的轨迹进行成本限制，同时利用奖励梯度引导细化轨迹以获得更高回报。由于直接奖励引导可能提高回报，同时引导样本朝着累计成本更高的轨迹方向发展，我们引入了可行轨迹重标（FTR）以重塑奖励目标并抑制此类方向。我们还提供了一阶抽样时间分析，表明FTR在前缀-恢复性对齐条件下抑制了奖励诱导的成本漂移。对DSRL基准的广泛评估显示，SDGD在基线中实现了最强的安全合规性，满足了94.7%的任务（36/38）的约束，同时在21个任务中获得了安全方法中最高的奖励。

Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces

基于LLM的多智能体系统通过编排追踪进行强化学习

Authors: Chenchen Zhang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.02801
Pdf link: https://arxiv.org/pdf/2605.02801
Abstract As large language model (LLM) agents evolve from isolated tool users into coordinated teams, reinforcement learning (RL) must optimize not only individual actions but also how work is spawned, delegated, communicated, aggregated, and stopped. This paper studies RL for LLM-based multi-agent systems through orchestration traces: temporal interaction graphs whose events include sub-agent spawning, delegation, communication, tool use, return, aggregation, and stopping decisions. Using this lens, we identify three technical axes. First, reward design spans eight families, including orchestration rewards for parallelism speedup, split correctness, and aggregation quality. Second, reward and credit signals attach to eight credit- or signal-bearing units from token to team; explicit counterfactual message-level credit remains especially sparse in our curated pool. Third, orchestration learning decomposes into five sub-decisions: when to spawn, whom to delegate to, how to communicate, how to aggregate, and when to stop. In our curated pool as of May 4, 2026, we found no explicit RL training method for the stopping decision. We connect academic methods to public industrial evidence from Kimi Agent Swarm, OpenAI Codex, and Anthropic Claude Code. The resulting scale gap is a gap between publicly reported deployment envelopes and open academic evaluation regimes, not independent verification of industrial training traces. We release the artifact at this https URL, including an 84-entry tagged paper pool, a 32-record exclusion log, scripted corpus statistics, and a minimal JSON schema for replayable orchestration traces.
中文摘要 随着大型语言模型（LLM）代理从孤立的工具用户向协调团队发展，强化学习（RL）不仅要优化单个动作，还要优化工作如何生成、委托、沟通、聚合和停止。本文通过编排痕迹研究基于LLM的多智能体系统的强化学习：时间交互图，其事件包括子智能体生成、委托、通信、工具使用、返回、聚合和停止决策。通过这个视角，我们确定了三个技术轴。首先，奖励设计涵盖八个类别，包括对并行性加速、分割正确性和聚合质量的编排奖励。其次，奖励和信用信号从代币到队伍的八个信用或信号单位上;在我们精心筛选的池子中，明确的反事实信息层级署名尤其稀少。第三，编排学习分解为五个子决策：何时生成、委派给谁、如何沟通、如何聚合以及何时停止。截至2026年5月4日，我们整理的数据库中没有明确的强化学习训练方法来决定停止。我们将学术方法与 Kimi Agent Swarm、OpenAI Codex 和 Anthropic Claude Code 的公共工业证据相结合。由此产生的规模差距是公开报告的部署范围与开放学术评估体系之间的差距，而非工业培训痕迹的独立验证。我们将该 artwork 发布至 https URL，包括一个 84 条标记的纸质池、一个 32 条记录的排除日志、脚本化的语料库统计，以及用于重放编排追踪的最小 JSON 模式。

Enhancing RL Generalizability in Robotics through SHAP Analysis of Algorithms and Hyperparameters

通过算法和超参数的SHAP分析提升机器人学中的强化学习泛化性

Authors: Lingxiao Kong, Cong Yang, Oya Deniz Beyan, Zeyd Boukhers
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.02867
Pdf link: https://arxiv.org/pdf/2605.02867
Abstract Despite significant advances in Reinforcement Learning (RL), model performance remains highly sensitive to algorithm and hyperparameter configurations, while generalization gaps across environments complicate real-world deployment. Although prior work has studied RL generalization, the relative contribution of specific configurations to the generalization gap has not been quantitatively decomposed and systematically leveraged for configuration selection. To address this limitation, we propose an explainable framework that evaluates RL performance across robotic environments using SHapley Additive exPlanations (SHAP) to quantify configuration impacts. We establish a theoretical foundation connecting Shapley values to generalizability, empirically analyze configuration impact patterns, and introduce SHAP-guided configuration selection to enhance generalization. Our results reveal distinct patterns across algorithms and hyperparameters, with consistent configuration impacts across diverse tasks and environments. By applying these insights to configuration selection, we achieve improved RL generalizability and provide actionable guidance for practitioners.
中文摘要 尽管强化学习（RL）取得了重大进展，模型性能仍高度敏感于算法和超参数配置，而跨环境的泛化差距使实际部署变得复杂。尽管此前已有研究研究强化学习推广，但特定配置对泛化差距的相对贡献尚未被定量分解，也未被系统地用于配置选择。为解决这一限制，我们提出了一个可解释的框架，利用SHapley加法解释（SHAP）来量化配置影响，评估机器人环境中的强化学习性能。我们建立了将夏普利值与泛化性联系起来的理论基础，实证分析配置影响模式，并引入SHAP引导配置选择以增强泛化性。我们的结果揭示了算法和超参数之间存在明显的模式，且配置影响在不同任务和环境中保持一致。通过将这些见解应用于配置选择，我们实现了增强学习的通用性，并为从业者提供了可操作的指导。

Keyword: diffusion policy

Hydra-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control

Hydra-DP3：三维扩散策略的频率感知适定，用于视觉运动控制

Authors: Jinhao Zhang, Zhexuan Zhou, Huizhe Li, Yichen Lai, Wenlong Xia, Haoming Song, Youmin Gong, Jie Mei
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.01581
Pdf link: https://arxiv.org/pdf/2605.01581
Abstract Diffusion-based visuomotor policies perform well in robotic manipulation, yet current methods still inherit image-generation-style decoders and multi-step sampling. We revisit this design from a frequency-domain perspective. Robot action trajectories are highly smooth, with most energy concentrated in a few low-frequency discrete cosine transform modes. Under this structure, we show that the error of the optimal denoiser is bounded by the low-frequency subspace dimension and residual high-frequency energy, implying that denoising error saturates after very few reverse steps. This further suggests that action denoising requires a much simpler denoising model than image generation. Motivated by this insight, we propose Hydra-DP3(HDP3), a pocket-scale 3D diffusion policy with a lightweight Diffusion Mixer decoder that supports two-step DDIM inference. Our synthetic experiments validate the theory and support the sufficiency of two-step denoising. Futhermore, across RoboTwin2.0, Adroit, MetaWorld, and real-world tasks, HDP3 achieves state-of-the-art performance with fewer than 1% of the parameters of prior 3D diffusion-based policies and substantially lower inference latency.
中文摘要 基于扩散的视觉运动策略在机器人操作中表现良好，但现有方法仍继承了图像生成式解码器和多步采样。我们从频域视角重新审视这一设计。机器人动作轨迹非常平滑，大部分能量集中在少数低频离散余弦变换模式中。在该结构下，我们证明最优去噪器的误差受低频子空间维数和残余高频能量的限制，这意味着去噪误差在极少的反向步后达到饱和。这进一步表明动作去噪需要比图像生成更简单的去噪模型。基于这一见解，我们提出了Hydra-DP3（HDP3），一种口袋级3D扩散策略，配备轻量化扩散混合器解码器，支持两步DDIM推断。我们的合成实验验证了该理论，并支持了两步去噪的充分性。此外，在RoboTwin2.0、Adroit、MetaWorld及实际任务中，HDP3实现了最先进的性能，且参数仅为以往3D扩散策略的1%以下，推断延迟显著降低。