Arxiv Papers of Today

生成时间: 2026-04-07 17:08:04 (UTC+8); Arxiv 发布时间: 2026-04-07 20:00 EDT (2026-04-08 08:00 UTC+8)

今天共有 56 篇相关文章

Keyword: reinforcement learning

Self-Execution Simulation Improves Coding Models

自执行模拟改进编码模型

Authors: Gallil Maimon, Ori Yoran, Felix Kreuk, Michael Hassid, Gal Cohen, Pierre Chambon, Yossi Adi
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.03253
Pdf link: https://arxiv.org/pdf/2604.03253
Abstract A promising research direction in enabling LLMs to generate consistently correct code involves addressing their inability to properly estimate program execution, particularly for code they generate. In this work, we demonstrate that Code LLMs can be trained to simulate program execution in a step-by-step manner and that this capability can be leveraged to improve competitive programming performance. Our approach combines supervised fine-tuning on natural language execution traces, textual explanations grounded in true execution, with reinforcement learning using verifiable rewards. We introduce two complementary objectives: output prediction given code and inputs, and solving competitive programming tasks with either ground-truth or self-predicted execution feedback. These objectives enable models to perform self-verification over multiple candidate solutions, and iterative self-fixing by simulating test execution. Across multiple competitive programming benchmarks, our method yields consistent improvements over standard reasoning approaches. We further present ablations and analysis to elucidate the role of execution simulation and its limitations.
中文摘要 一个有前景的研究方向是解决LLM无法正确估计程序执行能力的问题，尤其是针对其生成的代码。在本研究中，我们展示了代码大型语言模型可以被训练为逐步模拟程序执行，并利用这一能力提升竞争编程性能。我们的方法结合了对自然语言执行痕迹的监督微调、基于真实执行的文本解释，以及基于可验证奖励的强化学习。我们引入了两个互补目标：给定代码和输入时的输出预测，以及通过地面真实或自我预测执行反馈解决竞争性编程任务。这些目标使模型能够对多个候选解进行自我验证，并通过模拟测试执行实现迭代自我固定。在多个竞争性编程基准测试中，我们的方法相较于标准推理方法持续有显著改进。我们还进一步介绍了消融和分析，以阐明执行模拟的作用及其局限性。

SDVDiag: Using Context-Aware Causality Mining for the Diagnosis of Connected Vehicle Functions

SDVDiag：利用上下文感知因果挖掘诊断联网车辆功能

Authors: Matthias Weiß, Falk Dettinger, Elias Detrois, Nasser Jazdi, Michael Weyrich
Subjects: Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.03391
Pdf link: https://arxiv.org/pdf/2604.03391
Abstract Real-world implementations of connected vehicle functions are spreading steadily, yet operating these functions reliably remains challenging due to their distributed nature and the complexity of the underlying cloud, edge, and networking infrastructure. Quick diagnosis of problems and understanding the error chains that lead to failures is essential for reducing downtime. However, diagnosing these systems is still largely performed manually, as automated analysis techniques are predominantly data-driven and struggle with hidden relationships and the integration of context information. This paper addresses this gap by introducing a multimodal approach that integrates human feedback and system-specific information into the causal analysis process. Reinforcement Learning from Human Feedback is employed to continuously train a causality mining model while incorporating expert knowledge. Additional modules leverage distributed tracing data to prune false-positive causal links and enable the injection of domain-specific relationships to further refine the causal this http URL is performed using an automated valet parking application operated in a connected vehicle test field. Results demonstrate a significant increase in precision from 14\% to 100\% for the detection of causal edges and improved system interpretability compared to purely data-driven approaches, highlighting the potential for system operators in the connected vehicle domain.
中文摘要 联网车辆功能的实际应用正在稳步传播，但由于这些功能的分布式特性以及底层云、边缘和网络基础设施的复杂性，可靠运行仍具挑战性。快速诊断问题并理解导致故障的错误链对于减少停机时间至关重要。然而，这些系统的诊断仍然大多依赖人工完成，因为自动化分析技术主要依赖数据，且在隐藏关系和上下文信息整合方面存在困难。本文通过引入多模态方法，将人类反馈和系统特定信息整合进因果分析过程，解决了这一空白。通过人类反馈强化学习，持续训练因果挖掘模型，同时融入专家知识。其他模块利用分布式追踪数据修剪假阳性因果链接，并支持注入领域特定关系，进一步细化该 http URL 通过在联网车辆测试场运行的自动代客泊车应用执行的因果关系。结果显示，与纯数据驱动方法相比，因果边缘检测精度从14%提升至100%，系统可解释性也有所提升，凸显了系统操作员在联网车辆领域的潜力。

Hypernetwork-Conditioned Reinforcement Learning for Robust Control of Fixed-Wing Aircraft under Actuator Failures

超网络条件强化学习用于执行器故障时稳健控制固定翼飞机

Authors: Dennis Marquis, Mazen Farhood
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2604.03392
Pdf link: https://arxiv.org/pdf/2604.03392
Abstract This paper presents a reinforcement learning-based path-following controller for a fixed-wing small uncrewed aircraft system (sUAS) that is robust to certain actuator failures. The controller is conditioned on a parameterization of actuator faults using hypernetwork-based adaptation. We consider parameter-efficient formulations based on Feature-wise Linear Modulation (FiLM) and Low-Rank Adaptation (LoRA), trained using proximal policy optimization. We demonstrate that hypernetwork-conditioned policies can improve robustness compared to standard multilayer perceptron policies. In particular, hypernetwork-conditioned policies generalize effectively to time-varying actuator failure modes not encountered during training. The approach is validated through high-fidelity simulations, using a realistic six-degree-of-freedom fixed-wing aircraft model.
中文摘要 本文提出了一种基于强化学习的路径跟踪控制器，适用于固定翼小型无人飞机系统（sUAS），该系统对某些执行器失效具有鲁棒性。控制器基于基于超网络自适应对执行器故障的参数化进行条件。我们考虑基于特征层次线性调制（FiLM）和低秩适应（LoRA）的参数高效表述，并通过近端策略优化训练。我们证明了超网络条件策略相较于标准多层感知器策略能够提升鲁棒性。特别是，超网络条件策略有效推广到训练中未遇到的时变执行器失效模式。该方法通过高精度仿真验证，采用真实的六自由度固定翼飞机模型。

Scaling Multi-agent Systems: A Smart Middleware for Improving Agent Interactions

扩展多智能体系统：提升智能体交互的智能中间件

Authors: Charles Fleming, Ramana Kompella, Peter Bosch, Vijoy Pandey
Subjects: Subjects: Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2604.03430
Pdf link: https://arxiv.org/pdf/2604.03430
Abstract As Large Language Model (LLM) based Multi-Agent Systems (MAS) evolve from experimental pilots to complex, persistent ecosystems, the limitations of direct agent-to-agent communication have become increasingly apparent. Current architectures suffer from fragmented context, stochastic hallucinations, rigid security boundaries, and inefficient topology management. This paper introduces Cognitive Fabric Nodes (CFN), a novel middleware layer that creates an omnipresent "Cognitive Fabric" between agents. Unlike traditional message queues or service meshes, CFNs are not merely pass-through mechanisms; they are active, intelligent intermediaries. Central to this architecture is the elevation of Memory from simple storage to an active functional substrate that informs four other critical capabilities: Topology Selection, Semantic Grounding, Security Policy Enforcement, and Prompt Transformation. We propose that each of these functions be governed by learning modules utilizing Reinforcement Learning (RL) and optimization algorithms to improve system performance dynamically. By intercepting, analyzing, and rewriting inter-agent communication, the Cognitive Fabric ensures that individual agents remain lightweight while the ecosystem achieves coherence, safety, and semantic alignment. We evaluate the effectiveness of the CFN on the HotPotQA and MuSiQue datasets in a multi-agent environment and demonstrate that the CFN improves performance by more than 10\% on both datasets over direct agent to agent communication.
中文摘要 随着基于大型语言模型（LLM）的多智能体系统（MAS）从实验性试点发展为复杂且持久的生态系统，直接代理间通信的局限性日益显现。当前架构存在碎片化的上下文、随机幻觉、严格的安全边界和低效的拓扑管理。本文介绍了认知织物节点（CFN），这是一种新型中间件层，能够在代理之间创建无处不在的“认知织物”。与传统的消息队列或服务网格不同，CFN不仅仅是传递机制;他们是积极且聪明的中介者。该架构的核心是将内存从简单存储提升为主动功能基底，支持另外四项关键能力：拓扑选择、语义基础、安全策略执行和提示转换。我们提出，这些功能由利用强化学习（RL）和优化算法的学习模块来管理，以动态提升系统性能。通过拦截、分析和重写代理间通信，认知织物确保单个代理保持轻量级，同时生态系统实现连贯性、安全性和语义一致性。我们评估了CFN在多代理环境中HotPotQA和MuSiQue数据集上的有效性，并证明CFN在两者数据集上的性能均提升超过10%以上，优于直接代理间的通信。

Improving Feasibility via Fast Autoencoder-Based Projections

通过快速自编码器投影提升可行性

Authors: Maria Chzhen, Priya L. Donti
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2604.03489
Pdf link: https://arxiv.org/pdf/2604.03489
Abstract Enforcing complex (e.g., nonconvex) operational constraints is a critical challenge in real-world learning and control systems. However, existing methods struggle to efficiently enforce general classes of constraints. To address this, we propose a novel data-driven amortized approach that uses a trained autoencoder as an approximate projector to provide fast corrections to infeasible predictions. Specifically, we train an autoencoder using an adversarial objective to learn a structured, convex latent representation of the feasible set. This enables rapid correction of neural network outputs by projecting their associated latent representations onto a simple convex shape before decoding into the original feasible set. We test our approach on a diverse suite of constrained optimization and reinforcement learning problems with challenging nonconvex constraints. Results show that our method effectively enforces constraints at a low computational cost, offering a practical alternative to expensive feasibility correction techniques based on traditional solvers.
中文摘要 执行复杂（例如非凸）操作约束是现实学习与控制系统中的一项关键挑战。然而，现有方法在高效执行一般约束类别方面存在困难。为此，我们提出了一种新型数据驱动摊销方法，利用训练有素的自编码器作为近似投影器，快速修正不可行的预测。具体来说，我们用对抗目标训练自编码器，学习可行集合的结构化、凸潜在表示。这使得通过将相关潜在表示投影到简单的凸形状上，从而快速校正神经网络输出，然后再解码为原始可行集合。我们在一系列具有挑战性的非凸约束的受限优化和强化学习问题上测试我们的方法。结果表明，我们的方法以低计算成本有效执行约束，为基于传统求解器的昂贵可行性修正技术提供了实用替代方案。

Sim2Real-AD: A Modular Sim-to-Real Framework for Deploying VLM-Guided Reinforcement Learning in Real-World Autonomous Driving

Sim2Real-AD：一个模块化模拟到现实框架，用于在现实世界自动驾驶中部署VLM引导强化学习

Authors: Zilin Huang, Zhengyang Wan, Zihao Sheng, Boyue Wang, Junwei You, Yue Leng, Sikai Chen
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.03497
Pdf link: https://arxiv.org/pdf/2604.03497
Abstract Deploying reinforcement learning policies trained in simulation to real autonomous vehicles remains a fundamental challenge, particularly for VLM-guided RL frameworks whose policies are typically learned with simulator-native observations and simulator-coupled action semantics that are unavailable on physical platforms. This paper presents Sim2Real-AD, a modular framework for zero-shot sim-to-real transfer of CARLA-trained VLM-guided RL policies to full-scale vehicles without any real-world RL training data. The framework decomposes the transfer problem into four components: a Geometric Observation Bridge (GOB) that converts monocular front-view images into simulator-compatible bird's-eye-view (BEV) observations, a Physics-Aware Action Mapping (PAM) that translates policy outputs into platform-agnostic physical commands, a Two-Phase Progressive Training (TPT) strategy that stabilizes adaptation by separating action-space and observation-space transfer, and a Real-time Deployment Pipeline (RDP) that integrates perception, policy inference, control conversion, and safety monitoring for closed-loop execution. Simulation experiments show that the framework preserves the relative performance ordering of representative RL algorithms across different reward paradigms and validate the contribution of each module. Zero-shot deployment on a full-scale Ford E-Transit achieves success rates of 90%, 80%, and 75% in car-following, obstacle avoidance, and stop-sign interaction scenarios, respectively. To the best of our knowledge, this study is among the first to demonstrate zero-shot closed-loop deployment of a CARLA-trained VLM-guided RL policy on a full-scale real vehicle without any real-world RL training data. The demo video and code are available at: this https URL.
中文摘要 将模拟训练的强化学习策略部署到真实自动驾驶车辆仍是一个根本挑战，尤其是对于通常通过模拟器原生观察和模拟器耦合动作语义学习的VLM引导RL框架来说，这些在物理平台上无法获得。本文介绍了Sim2Real-AD，这是一个模块化框架，用于将CARLA训练的VLM引导RL策略从零机会模拟到真实，迁移到全尺寸车辆，无需任何真实RL训练数据。该框架将传输问题分解为四个组成部分：几何观察桥（GOB），将单眼正面图像转换为模拟器兼容的鸟瞰图（BEV）观测;物理感知动作映射（PAM），将策略输出转换为平台无关的物理命令;两阶段渐进训练（TPT）策略，通过分离动作空间和观察空间传输来稳定适应;以及实时部署流水线（RDP）集成感知、策略推断、控制转换和安全监控，实现闭环执行。模拟实验表明，该框架保留了代表性强化学习算法在不同奖励范式上的相对性能排序，并验证了各模块的贡献。在全尺寸福特E-Transit上零发射，在跟踪车辆、避障和停车标志交互场景中分别实现了90%、80%和75%的成功率。据我们所知，本研究是首批在全尺寸真实车辆上演示零射点闭环部署CARLA训练VLM引导RL策略且无任何真实RL训练数据的研究之一。演示视频和代码可在以下网站获取：https URL。

BioAlchemy: Distilling Biological Literature into Reasoning-Ready Reinforcement Learning Training Data

生物炼金术：将生物文献提炼为推理准备的强化学习训练数据

Authors: Brian Hsu, Ozan Gökdemir, Carlo Siebenschuh, Bruce Parrello, Neil Getty, Thomas S. Brettin, Rick L. Stevens, Ian T. Foster, Nicholas Chia, Arvind Ramanathan
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.03506
Pdf link: https://arxiv.org/pdf/2604.03506
Abstract Despite the large corpus of biology training text, the impact of reasoning models on biological research generally lags behind math and coding. In this work, we show that biology questions from current large-scale reasoning datasets do not align well with modern research topic distributions in biology, and that this topic imbalance may negatively affect performance. In addition, we find that methods for extracting challenging and verifiable research problems from biology research text are a critical yet underdeveloped ingredient in applying reinforcement learning for better performance on biology research tasks. We introduce BioAlchemy, a pipeline for sourcing a diverse set of verifiable question-and-answer pairs from a scientific corpus of biology research text. We curate BioAlchemy-345K, a training dataset containing over 345K scientific reasoning problems in biology. Then, we demonstrate how aligning our dataset to the topic distribution of modern scientific biology can be used with reinforcement learning to improve reasoning performance. Finally, we present BioAlchemist-8B, which improves over its base reasoning model by 9.12% on biology benchmarks. These results demonstrate the efficacy of our approach for developing stronger scientific reasoning capabilities in biology. The BioAlchemist-8B model is available at: this https URL.
中文摘要 尽管生物学培训教材丰富，推理模型对生物学研究的影响通常落后于数学和编程。本研究表明，当前大规模推理数据集中的生物学问题与现代生物学研究主题分布不匹配，这种主题不平衡可能对表现产生负面影响。此外，我们发现从生物学研究文本中提取具有挑战性和可验证研究问题的方法，是应用强化学习以提升生物学研究任务表现的关键但尚未充分发展的要素。我们介绍了生物炼金术，这是一个从科学生物学研究文本语料库中获取多样化可验证问答对的流程。我们策划了BioAlchemy-345K，这是一个包含34.5万以上生物学科学推理问题的训练数据集。随后，我们展示了如何将数据集与现代科学生物学的主题分布对齐，结合强化学习来提升推理表现。最后，我们介绍BioAlchemist-8B，其在生物学基准测试中较基础推理模型提升了9.12%。这些结果证明了我们方法在生物学中提升科学推理能力的有效性。BioAlchemist-8B 模型可在以下 https URL 获取。

Optimizing Neurorobot Policy under Limited Demonstration Data through Preference Regret

通过偏好遗憾优化有限演示数据下的神经机器人政策

Authors: Viet Dung Nguyen, Yuhang Song, Anh Nguyen, Jamison Heard, Reynold Bailey, Alexander Ororbia
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.03523
Pdf link: https://arxiv.org/pdf/2604.03523
Abstract Robot reinforcement learning from demonstrations (RLfD) assumes that expert data is abundant; this is usually unrealistic in the real world given data scarcity as well as high collection cost. Furthermore, imitation learning algorithms assume that the data is independently and identically distributed, which ultimately results in poorer performance as gradual errors emerge and compound within test-time trajectories. We address these issues by introducing the "master your own expertise" (MYOE) framework, a self-imitation framework that enables robotic agents to learn complex behaviors from limited demonstration data samples. Inspired by human perception and action, we propose and design what we call the queryable mixture-of-preferences state space model (QMoP-SSM), which estimates the desired goal at every time step. These desired goals are used in computing the "preference regret", which is used to optimize the robot control policy. Our experiments demonstrate the robustness, adaptability, and out-of-sample performance of our agent compared to other state-of-the-art RLfD schemes. The GitHub repository that supports this work can be found at: this https URL.
中文摘要 机器人演示强化学习（RLfD）假设专家数据充足;鉴于数据稀缺和高收集成本，这在现实中通常不现实。此外，模仿学习算法假设数据是独立且均匀分布的，这最终导致性能下降，因为错误逐渐出现并在测试时间轨迹中叠加。我们通过引入“掌握自己的专业知识”（MYOE）框架来解决这些问题，这是一种自我模仿框架，使机器人智能体能够从有限的演示数据样本中学习复杂行为。受人类感知和行动启发，我们提出了并设计了所谓的可查询偏好混合状态空间模型（QMoP-SSM），该模型估计每个时间步的目标目标。这些期望目标用于计算“偏好遗憾”，用于优化机器人控制策略。我们的实验展示了本剂相较于其他最先进RLfD方案的鲁棒性、适应性和样本外表现。支持这项工作的GitHub仓库可以找到：这个https URL。

Drift-Based Policy Optimization: Native One-Step Policy Learning for Online Robot Control

基于漂移的策略优化：在线机器人控制的原生一步策略学习

Authors: Yuxuan Gao, Yedong Shen, Shiqi Zhang, Wenhao Yu, Yifan Duan, Jia pan, Jiajia Wu, Jiajun Deng, Yanyong Zhang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.03540
Pdf link: https://arxiv.org/pdf/2604.03540
Abstract Although multi-step generative policies achieve strong performance in robotic manipulation by modeling multimodal action distributions, they require multi-step iterative denoising at inference time. Each action therefore needs tens to hundreds of network function evaluations (NFEs), making them costly for high-frequency closed-loop control and online reinforcement learning (RL). To address this limitation, we propose a two-stage framework for native one-step generative policies that shifts refinement from inference to training. First, we introduce the Drift-Based Policy (DBP), which leverages fixed-point drifting objectives to internalize iterative refinement into the model parameters, yielding a one-step generative backbone by design while preserving multimodal action modeling capacity. Second, we develop Drift-Based Policy Optimization (DBPO), an online RL framework that equips the pretrained backbone with a compatible stochastic interface, enabling stable on-policy updates without sacrificing the one-step deployment property. Extensive experiments demonstrate the effectiveness of the proposed framework across offline imitation learning, online fine-tuning, and real-world control scenarios. DBP matches or exceeds the performance of multi-step diffusion policies while achieving up to $100\times$ faster inference. It also consistently outperforms existing one-step baselines on challenging manipulation benchmarks. Moreover, DBPO enables effective and stable policy improvement in online settings. Experiments on a real-world dual-arm robot demonstrate reliable high-frequency control at 105.2 Hz.
中文摘要 尽管多步生成策略通过建模多模态动作分布在机器人操作中表现出色，但它们在推理时需要多步迭代去噪。因此，每个动作都需要数十到数百次网络功能评估（NFE），这使得高频闭环控制和在线强化学习（RL）成本较高。为解决这一局限，我们提出了一个两阶段的框架，用于原生一步生成策略，将细化从推理转向训练。首先，我们介绍基于漂移的策略（DBP），该策略利用不动点漂移目标将迭代细化内化到模型参数中，设计上实现了单步生成骨干，同时保留了多模态动作建模能力。其次，我们开发了基于漂移的策略优化（DBPO），这是一个在线强化学习框架，为预训练骨干网络配备了兼容的随机接口，实现策略内的稳定更新，同时不牺牲一步部署特性。大量实验证明了该框架在离线模仿学习、在线微调和现实控制场景中的有效性。DBP的性能能匹敌甚至超过多步扩散策略，同时推理速度高达100美元/时间美元。它在具有挑战性的操作基准测试中，持续优于现有的一步基准。此外，DBPO还能在在线环境中有效且稳定地改进策略。在真实世界的双臂机器人上进行的实验展示了可靠的105.2 Hz高频控制能力。

When Adaptive Rewards Hurt: Causal Probing and the Switching-Stability Dilemma in LLM-Guided LEO Satellite Scheduling

当自适应奖励带来伤害：因果探测与LLM引导LEO卫星调度中的切换稳定性困境

Authors: Yuanhang Li
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.03562
Pdf link: https://arxiv.org/pdf/2604.03562
Abstract Adaptive reward design for deep reinforcement learning (DRL) in multi-beam LEO satellite scheduling is motivated by the intuition that regime-aware reward weights should outperform static ones. We systematically test this intuition and uncover a switching-stability dilemma: near-constant reward weights (342.1 Mbps) outperform carefully-tuned dynamic weights (103.3+/-96.8 Mbps) because PPO requires a quasistationary reward signal for value function convergence. Weight adaptation-regardless of quality-degrades performance by repeatedly restarting convergence. To understand why specific weights matter, we introduce a single-variable causal probing method that independently perturbs each reward term by +/-20% and measures PPO response after 50k steps. Probing reveals counterintuitive leverage: a +20% increase in the switching penalty yields +157 Mbps for polar handover and +130 Mbps for hot-cold regimes-findings inaccessible to human experts or trained MLPs without systematic probing. We evaluate four MDP architect variants (fixed, rule-based, learned MLP, finetuned LLM) across known and novel traffic regimes. The MLP achieves 357.9 Mbps on known regimes and 325.2 Mbps on novel regimes, while the fine-tuned LLM collapses to 45.3+/-43.0 Mbps due to weight oscillation rather than lack of domain knowledge-output consistency, not knowledge, is the binding constraint. Our findings provide an empirically-grounded roadmap for LLM-DRL integration in communication systems, identifying where LLMs add irreplaceable value (natural language intent understanding) versus where simpler methods suffice.
中文摘要 多波束LEO卫星调度中深度强化学习（DRL）的自适应奖励设计，基于直觉：状态感知的奖励权重应优于静态奖励权重。我们系统地测试这一直觉，发现了一个切换稳定性的困境：近乎恒定的奖励权重（342.1 Mbps）优于精心调校的动态权重（103.3+/-96.8 Mbps），因为PPO需要准静止奖励信号来实现价值函数收敛。权重适应——无论质量如何——都会通过反复重启收敛来降低性能。为了理解具体权重的重要性，我们引入了一种单变量因果探测方法，该方法独立扰动每个奖励项+/-20%，并在5万步后测量PPO反应。探测揭示了反直觉的杠杆效应：切换罚款+20%提高，极地切换可获得+157 Mbps，冷热状态可获得+130 Mbps——这些发现在没有系统探测的情况下，人类专家或受过训练的MLP无法获得。我们评估了已知和新颖流量体系中的四种MDP架构变体（固定型、基于规则型、学习型MLP、微调LLM）。MLP在已知条件下达到357.9 Mbps，在新颖条件下达到325.2 Mbps，而微调后的LLM则因权重波动而崩溃至45.3+/-43.0 Mbps，而非知识输出一致性，这才是约束条件。我们的发现为通信系统中的LLM-DRL整合提供了实证基础的路线图，识别LLM在哪些方面具有不可替代价值（自然语言意图理解），哪些方面仅能提供更简单的方法。

HAD: Combining Hierarchical Diffusion with Metric-Decoupled RL for End-to-End Driving

HAD：将分层扩散与度量解耦强化学习结合，实现端到端驾驶

Authors: Wenhao Yao, Xinglong Sun, Zhenxin Li, Shiyi Lan, Zi Wang, Jose M. Alvarez, Zuxuan Wu
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.03581
Pdf link: https://arxiv.org/pdf/2604.03581
Abstract End-to-end planning has emerged as a dominant paradigm for autonomous driving, where recent models often adopt a scoring-selection framework to choose trajectories from a large set of candidates, with diffusion-based decoding showing strong promise. However, directly selecting from the entire candidate space remains difficult to optimize, and Gaussian perturbations used in diffusion often introduce unrealistic trajectories that complicate the denoising process. In addition, for training these models, reinforcement learning (RL) has shown promise, but existing end-to-end RL approaches typically rely on a single coupled reward without structured signals, limiting optimization effectiveness. To address these challenges, we propose HAD, an end-to-end planning framework with a Hierarchical Diffusion Policy that decomposes planning into a coarse-to-fine process. To improve trajectory generation, we introduce Structure-Preserved Trajectory Expansion, which produces realistic candidates while maintaining kinematic structure. For policy learning, we develop Metric-Decoupled Policy Optimization (MDPO) to enable structured RL optimization across multiple driving objectives. Extensive experiments show that HAD achieves new state-of-the-art performance on both NAVSIM and HUGSIM, outperforming prior arts by a huge margin: +2.3 EPDMS on NAVSIM and +4.9 Route Completion on HUGSIM.
中文摘要 端到端规划已成为自动驾驶的主导范式，近期模型常采用评分选择框架从大量候选中选择轨迹，基于扩散的解码显示出强有力前景。然而，直接从整个候选空间中选择仍然难以优化，扩散中使用的高斯微扰常常引入不现实的轨迹，使去噪过程变得复杂。此外，强化学习（RL）在训练这些模型方面展现出潜力，但现有端到端强化学习方法通常依赖单一耦合奖励，缺乏结构化信号，限制了优化效果。为应对这些挑战，我们提出了HAD，一种端到端规划框架，采用层级扩散策略，将规划分解为从粗到细的流程。为了提升轨迹生成，我们引入了结构保持轨迹扩展，该方法在保持运动学结构的同时产生逼真的候选路径。在政策学习方面，我们开发了度量解耦策略优化（MDPO），以实现跨多个驱动目标的结构化强化学习优化。大量实验表明，HAD在NAVSIM和HUGSIM上都实现了前所未有的性能，远远超越现有技术：NAVSIM上的EPDMS为+2.3，HUGSIM为+4.9的路线完成。

Delayed Homomorphic Reinforcement Learning for Environments with Delayed Feedback

延迟反馈环境的延迟同态强化学习

Authors: Jongsoo Lee, Jangwon Kim, Soohee Han
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.03641
Pdf link: https://arxiv.org/pdf/2604.03641
Abstract Reinforcement learning in real-world systems is often accompanied by delayed feedback, which breaks the Markov assumption and impedes both learning and control. Canonical state augmentation approaches cause the state-space explosion, which introduces a severe sample-complexity burden. Despite recent progress, the state-of-the-art augmentation-based baselines remain incomplete: they either predominantly reduce the burden on the critic or adopt non-unified treatments for the actor and critic. To provide a structured and sample-efficient solution, we propose delayed homomorphic reinforcement learning (DHRL), a framework grounded in MDP homomorphisms that collapses belief-equivalent augmented states and enables efficient policy learning on the resulting abstract MDP without loss of optimality. We provide theoretical analyses of state-space compression bounds and sample complexity, and introduce a practical algorithm. Experiments on continuous control tasks in MuJoCo benchmark confirm that our algorithm outperforms strong augmentation-based baselines, particularly under long delays.
中文摘要 现实系统中的强化学习常伴随着延迟反馈，这破坏了马尔可夫假设，阻碍学习和控制。规范状态增强方法会导致状态-空间爆炸，从而带来严重的样本复杂度负担。尽管近期取得了进展，基于增强的先进基准仍不完整：它们要么主要减轻批评者的负担，要么对演员和批评者采取不统一的处理方式。为了提供结构化且样本高效的解，我们提出了延迟同态强化学习（DHRL），这是一种基于MDP同态的框架，能够坍缩信念等价的增强态，并在不损失最优性的情况下实现对所得抽象MDP的高效策略学习。我们提供了状态空间压缩界限和样本复杂度的理论分析，并引入了实用算法。MuJoCo基准测试中连续控制任务的实验证实，我们的算法在长时间延迟下表现优于强增强基线。

User Simulator-Guided Multi-Turn Preference Optimization for Reasoning LLM-based Conversational Recommendation

用户模拟器引导的多回合偏好优化，用于基于LLM的对话推荐推理

Authors: Xingyuan Xiang, Xiangchen Pan, Wei Wei
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2604.03671
Pdf link: https://arxiv.org/pdf/2604.03671
Abstract Conversational Recommender Systems (CRSs) leverage natural language interactions for personalized recommendation, yet information-scarce dialogue histories and single-turn recommendation paradigms may severely hinder accurate modeling of complex user preferences. To alleviate this issue, recent studies have introduced LLM-based user simulators, which generate natural language feedback and perform simulated multi-turn interactions to assist recommendation. Nevertheless, since simulators cannot access true user preference labels during inference, their feedback may deviate from actual user interests, causing errors to accumulate over multiple interactions and severely affecting the generalization of the recommender. Inspired by the multi-step reasoning capabilities of LLMs and the effectiveness of reinforcement learning in policy optimization, we propose SMTPO, a user simulator-guided multi-turn preference optimization conversational recommendation framework. To align simulator-generated feedback with true user preferences in the absence of explicit labels, we enhance feedback quality via multi-task supervised fine-tuning (SFT), enabling the simulator to better reflect users' complex and diverse needs. To address the challenge of biased feedback destabilizing multi-turn optimization, we first allow the reasoning LLM-based recommender to learn preference reasoning and recommendation patterns through SFT and then employ reinforcement learning with fine-grained reward design to progressively align with true user preferences, improving recommendation performance. Extensive experiments on public datasets demonstrate the effectiveness and transferability of our method.
中文摘要 会话推荐系统（CRS）利用自然语言交互实现个性化推荐，但信息稀缺的对话历史和单回合推荐范式可能严重阻碍复杂用户偏好的准确建模。为缓解这一问题，近期研究引入了基于LLM的用户模拟器，能够生成自然语言反馈并进行模拟多回合交互以辅助推荐。然而，由于模拟器在推断过程中无法访问真正的用户偏好标签，其反馈可能偏离实际用户兴趣，导致错误在多次交互中累积，严重影响推荐者的泛化。受大型语言模型（LLM）多步推理能力和强化学习在策略优化中的有效性启发，我们提出了SMTPO，这是一个由用户模拟器引导的多回合偏好优化对话推荐框架。为了在无明确标签的情况下，将模拟器生成的反馈与真实用户偏好对齐，我们通过多任务监督微调（SFT）提升反馈质量，使模拟器更好地反映用户复杂多样的需求。为应对偏向反馈导致多回合优化不稳定的挑战，我们首先允许基于LLM的推理推荐者通过SFT学习偏好推理和推荐模式，然后通过精细奖励设计的强化学习逐步与真实用户偏好对齐，提升推荐性能。在公开数据集上的大量实验证明了我们方法的有效性和可迁移性。

PRAISE: Prefix-Based Rollout Reuse in Agentic Search Training

赞扬：代理搜索培训中的基于前缀的推广重复使用

Authors: Erhan Zhang, Yiqun Chen, Zechun Niu, Wei Yang, Xiaochi Wei, Yan Gao, Yi Wu, Yao Hu, Jiaxin Mao
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.03675
Pdf link: https://arxiv.org/pdf/2604.03675
Abstract In agentic search, large language models (LLMs) are trained to perform multi-turn retrieval and reasoning for complex tasks such as multi-hop question answering (QA). However, current search-based Reinforcement Learning (RL) methods suffer from two core limitations: expensive long-horizon rollouts are under-utilized during training, and supervision is typically available only at the final answer, resulting in severe reward sparsity. We present Prefix-based Rollout reuse for Agentic search with Intermediate Step rEwards (PRAISE), a framework for improving both data efficiency and credit assignment in agentic search training. Given a complete search trajectory, PRAISE extracts prefix states at different search turns, elicits intermediate answers from them, and uses these prefixes both to construct additional training trajectories and to derive step-level rewards from performance differences across prefixes. Our method uses a single shared model for both search policy learning and prefix answer evaluation, enabling joint optimization without extra human annotations or a separate reward model. Experiments on multi-hop QA benchmarks show that PRAISE consistently improves performance over strong baselines.
中文摘要 在智能搜索中，大型语言模型（LLM）被训练用于执行多回合检索和推理复杂任务，如多跳问题答复（QA）。然而，当前基于搜索的强化学习（RL）方法存在两个核心局限：昂贵的长期部署在培训中未被充分利用，且监督通常仅在最终答案时提供，导致奖励严重稀缺。我们介绍基于前缀的中级步变（PRAISE）中的代理搜索推广重用，这是一个旨在提升代理搜索培训中数据效率和学分分配的框架。给定完整的搜索轨迹，PRAISE在不同搜索回合提取前缀状态，从中提取中间答案，并利用这些前缀构建额外的训练轨迹，并从前缀间的性能差异中获得阶级奖励。我们的方法使用单一共享模型进行搜索策略学习和前缀答案评估，实现联合优化，无需额外人工注释或独立奖励模型。多跳质量保证基准测试的实验显示，PRAISE在强基线条件下持续提升性能。

RL-Driven Sustainable Land-Use Allocation for the Lake Malawi Basin

由强化环境驱动的马拉维湖流域可持续土地利用分配

Authors: Ying Yao
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.03768
Pdf link: https://arxiv.org/pdf/2604.03768
Abstract Unsustainable land-use practices in ecologically sensitive regions threaten biodiversity, water resources, and the livelihoods of millions. This paper presents a deep reinforcement learning (RL) framework for optimizing land-use allocation in the Lake Malawi Basin to maximize total ecosystem service value (ESV). Drawing on the benefit transfer methodology of Costanza et al., we assign biome-specific ESV coefficients -- locally anchored to a Malawi wetland valuation -- to nine land-cover classes derived from Sentinel-2 imagery. The RL environment models a 50x50 cell grid at 500m resolution, where a Proximal Policy Optimization (PPO) agent with action masking iteratively transfers land-use pixels between modifiable classes. The reward function combines per-cell ecological value with spatial coherence objectives: contiguity bonuses for ecologically connected land-use patches (forest, cropland, built area etc.) and buffer zone penalties for high-impact development adjacent to water bodies. We evaluate the framework across three scenarios: (i) pure ESV maximization, (ii) ESV with spatial reward shaping, and (iii) a regenerative agriculture policy scenario. Results demonstrate that the agent effectively learns to increase total ESV; that spatial reward shaping successfully steers allocations toward ecologically sound patterns, including homogeneous land-use clustering and slight forest consolidation near water bodies; and that the framework responds meaningfully to policy parameter changes, establishing its utility as a scenario-analysis tool for environmental planning.
中文摘要 生态敏感地区的不可持续土地利用行为威胁着生物多样性、水资源和数百万人的生计。本文提出了一个深度强化学习（RL）框架，用于优化马拉维湖流域的土地利用分配，以最大化生态系统服务价值（ESV）。借鉴Costanza等人的利益转移方法，我们将基于Sentinel-2影像的九个土地覆盖类别赋予了生物群系特定的ESV系数——这些系数在当地锚定于马拉维湿地估价。强化学习环境建模一个50x50单元格、分辨率500米，近端策略优化（PPO）代理带有动作掩蔽，通过迭代传输可修改类别之间的土地利用像素。奖励函数结合了每细胞的生态价值与空间连贯目标：对生态相连的土地利用斑块（森林、农田、建地等）提供连续性加成，以及对邻近水体的高影响开发的缓冲区惩罚。我们通过三种情景评估该框架：（i）纯ESV最大化，（ii）带有空间奖励塑造的ESV，以及（iii）再生农业政策情景。结果表明，该主体能够有效学习增加总ESV;空间奖励塑造成功引导分配朝向生态合理的模式，包括同质性土地利用聚集和水体附近的轻微森林整合;并且该框架能够对政策参数的变化做出有意义的响应，确立了其作为环境规划情景分析工具的实用价值。

Decomposing Communication Gain and Delay Cost Under Cross-Timestep Delays in Cooperative Multi-Agent Reinforcement Learning

在合作多智能体强化学习中，跨时间步延迟下的通信增益和延迟成本分解

Authors: Zihong Gao, Hongjian Liang, Lei Hao, Liangjun Ke
Subjects: Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2604.03785
Pdf link: https://arxiv.org/pdf/2604.03785
Abstract Communication is essential for coordination in \emph{cooperative} multi-agent reinforcement learning under partial observability, yet \emph{cross-timestep} delays cause messages to arrive multiple timesteps after generation, inducing temporal misalignment and making information stale when consumed. We formalize this setting as a delayed-communication partially observable Markov game (DeComm-POMG) and decompose a message's effect into \emph{communication gain} and \emph{delay cost}, yielding the Communication Gain and Delay Cost (CGDC) metric. We further establish a value-loss bound showing that the degradation induced by delayed messages is upper-bounded by a discounted accumulation of an information gap between the action distributions induced by timely versus delayed messages. Guided by CGDC, we propose \textbf{CDCMA}, an actor--critic framework that requests messages only when predicted CGDC is positive, predicts future observations to reduce misalignment at consumption, and fuses delayed messages via CGDC-guided attention. Experiments on no-teammate-vision variants of Cooperative Navigation and Predator Prey, and on SMAC maps across multiple delay levels show consistent improvements in performance, robustness, and generalization, with ablations validating each component.
中文摘要 在部分可观察性下，\emph{cooperative}多智能体强化学习中的协调至关重要，但\emph{跨时间步}的延迟会导致消息在生成后多个时间步到达，导致时间错位，并在消费时使信息变得陈旧。我们将该设定形式化为延迟通信部分可观测马尔可夫博弈（DeComm-POMG），并将消息效果分解为\emph{通信收益}和\emph{延迟成本}，得到通信增益和延迟成本（CGDC）度量。我们进一步建立了价值损失上界，表明延迟消息引起的退化是由及时消息与延迟消息诱发的动作分布之间信息差距的折现累积所上界。在CGDC的指导下，我们提出了\textbf{CDCMA}，这是一个actor-critic框架，仅在预测CGDC为正时请求消息，预测未来观测以减少消费时的错位，并通过CGDC引导注意力融合延迟消息。在无队友视觉的Cooperative Navigation和Predator Prey变体，以及跨多延迟级别的SMAC地图上的实验显示，性能、鲁棒性和泛化性均有持续提升，消融验证了每个组件。

Provable Multi-Task Reinforcement Learning: A Representation Learning Framework with Low Rank Rewards

可证实的多任务强化学习：一种低秩奖励的表征学习框架

Authors: Yaoze Guo, Shana Moothedath
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.03891
Pdf link: https://arxiv.org/pdf/2604.03891
Abstract Multi-task representation learning (MTRL) is an approach that learns shared latent representations across related tasks, facilitating collaborative learning that improves the overall learning efficiency. This paper studies MTRL for multi-task reinforcement learning (RL), where multiple tasks have the same state-action space and transition probabilities, but different rewards. We consider T linear Markov Decision Processes (MDPs) where the reward functions and transition dynamics admit linear feature embeddings of dimension d. The relatedness among the tasks is captured by a low-rank structure on the reward matrices. Learning shared representations across multiple RL tasks is challenging due to the complex and policy-dependent nature of data that leads to a temporal progression of error. Our approach adopts a reward-free reinforcement learning framework to first learn a data-collection policy. This policy then informs an exploration strategy for estimating the unknown reward matrices. Importantly, the data collected under this well-designed policy enable accurate estimation, which ultimately supports the learning of an near-optimal policy. Unlike existing approaches that rely on restrictive assumptions such as Gaussian features, incoherence conditions, or access to optimal solutions, we propose a low-rank matrix estimation method that operates under more general feature distributions encountered in RL settings. Theoretical analysis establishes that accurate low-rank matrix recovery is achievable under these relaxed assumptions, and we characterize the relationship between representation error and sample complexity. Leveraging the learned representation, we construct near-optimal policies and prove a regret bound. Experimental results demonstrate that our method effectively learns robust shared representations and task dynamics from finite data.
中文摘要 多任务表示学习（MTRL）是一种学习相关任务间共享潜在表征的方法，促进协作学习，从而提升整体学习效率。本文研究了多任务强化学习（RL）中的MTRL，其中多个任务具有相同的状态-动作空间和过渡概率，但奖励不同。我们考虑T线性马尔可夫决策过程（MDP），其中奖励函数和转移动力学允许维数为d的线性特征嵌入。任务之间的关联性通过奖励矩阵上的低秩结构来捕捉。由于数据复杂且依赖策略，导致误差时间递进，学习跨多个强化学习的共享表示具有挑战性。我们的方法采用无奖励的强化学习框架，先学习数据收集策略。该策略随后指导了估计未知奖励矩阵的探索策略。重要的是，这一精心设计策略下收集的数据能够实现准确估计，最终支持学习近似最优策略。与依赖高斯特征、非相干条件或最优解等限制性假设的现有方法不同，我们提出了一种低秩矩阵估计方法，适用于强化学习中遇到的更一般特征分布。理论分析表明，在这些宽松假设下，准确的低秩矩阵恢复是可能的，我们描述了表示误差与样本复杂度之间的关系。利用所学表征，我们构建近优策略，证明了遗憾的约束。实验结果表明，我们的方法能够有效地从有限数据中学习稳健的共享表示和任务动态。

Can LLMs Learn to Reason Robustly under Noisy Supervision?

大型语言模型（LLM）能否在嘈杂的监督下学会强有力的推理能力？

Authors: Shenzhi Yang, Guangcheng Zhu, Bowen Song, Sharon Li, Haobo Wang, Xing Zheng, Yingfan Ma, Zhongqi Chen, Weiqiang Wang, Gang Chen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.03993
Pdf link: https://arxiv.org/pdf/2604.03993
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) effectively trains reasoning models that rely on abundant perfect labels, but its vulnerability to unavoidable noisy labels due to expert scarcity remains critically underexplored. In this work, we take the first step toward a systematic analysis of noisy label mechanisms in RLVR. In contrast to supervised classification, most RLVR algorithms incorporate a rollout-based condition: a label's influence on training is contingent on whether the current policy can generate rollouts that realize it, a property that naturally extends to noisy labels. Based on this observation, we distinguish two types of noise: inactive noisy labels, which reduce data efficiency, and active noisy labels, which are reinforced and risk skewing the model toward incorrect distributions. From experiments on training with noisy samples, we identify an Early Correctness Coherence phenomenon: although noisy samples begin to lag behind in later stages, accuracy on both clean and noisy samples increases similarly in early training. Motivated by this dynamic, we propose Online Label Refinement (OLR), which progressively corrects potentially noisy labels with majority-voted answers when two conditions hold: a positive slope in the majority answer's rollout pass rate and stable historical consistency across updates, enabling gradual self-correction as the policy improves. We evaluate OLR on six in-distribution mathematical reasoning benchmarks (AIME24/25, AMC, MATH-500, Minerva, and Olympiad) and three out-of-distribution tasks (ARC-c, GPQA-diamond, and MMLU-pro). Across noise ratios from 0.1 to 0.9, OLR consistently improves robustness under both inactive and active noisy-label settings, achieving average gains of 3.6% to 3.9% on in-distribution benchmarks and 3.3% to 4.6% on out-of-distribution evaluations.
中文摘要 带可验证奖励的强化学习（RLVR）有效训练依赖大量完美标签的推理模型，但其对专家稀缺性导致不可避免噪声标签的脆弱性仍未被充分探讨。本研究迈出了对RLVR中噪声标记机制系统分析的第一步。与监督分类不同，大多数RLVR算法包含一个基于推广的条件：标签对训练的影响取决于当前策略是否能生成实现该功能的推广，这一特性自然延伸到噪声标签。基于这一观察，我们区分了两种噪声：非活跃噪声标签，降低数据效率;活跃噪声标签，后者被强化，可能导致模型偏向错误分布。通过对噪声样本训练的实验，我们识别出一种早期正确性相干现象：虽然噪声样本在后期阶段会落后，但早期训练中干净和噪声样本的准确率同样提升。基于这一动态，我们提出了在线标签细化（OLR），当满足两个条件时，通过多数票通过率的正斜率和更新间历史一致性稳定，逐步纠正可能存在噪声的标签：多数回答的覆盖率为正，从而在政策改进时实现逐步自我修正。我们基于六个发行内数学推理基准测试（AIME24/25、AMC、MATH-500、Minerva 和 Olympiad）以及三个非发行任务（ARC-c、GPQA-diamond 和 MMLU-pro）评估 OLR。在噪声比0.1到0.9的范围内，OLR在非活跃和活跃噪声标签设置下持续提升鲁棒性，分布内基准平均提升为3.6%至3.9%，分布外评估为3.3%至4.6%。

VA-FastNavi-MARL: Real-Time Robot Control with Multimedia-Driven Meta-Reinforcement Learning

VA-FastNavi-MARL：实时机器人控制，结合多媒体驱动的元强化学习

Authors: Yang Zhang, Shengxi Jing, Fengxiang Wang, Yuan Feng, Hong Wang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.03998
Pdf link: https://arxiv.org/pdf/2604.03998
Abstract Interpreting dynamic, heterogeneous multimedia commands with real-time responsiveness is critical for Human-Robot Interaction. We present VA-FastNavi-MARL, a framework that aligns asynchronous audio-visual inputs into a unified latent representation. By treating diverse instructions as a distribution of navigable goals via Meta-Reinforcement Learning, our method enables rapid adaptation to unseen directives with negligible inference overhead. Unlike approaches bottlenecked by heavy sensory processing, our modality-agnostic stream ensures seamless, low-latency control. Validation on a multi-arm workspace confirms that VA-FastNavi-MARL significantly outperforms baselines in sample efficiency and maintains robust, real-time execution even under noisy multimedia streams.
中文摘要 以实时响应性解读动态、异构多媒体命令对于人机交互至关重要。我们介绍了VA-FastNavi-MARL，一个将异步视听输入对齐为统一潜在表示的框架。通过将多样化指令视为通过元强化学习的可导航目标分布，我们的方法能够快速适应看不见的指令，推理开销极低。与被大量感官处理所限制的方法不同，我们的模式无关流程确保了无缝、低延迟的控制。多臂工作区验证证实，VA-FastNavi-MARL在样本效率上显著优于基线，即使在噪声较大的多媒体流下也能保持稳健的实时执行。

Multi-AUV Trajectory Learning for Sustainable Underwater IoT with Acoustic Energy Transfer

多AUV轨迹学习，实现声能传输的可持续水下物联网

Authors: Mohamed Afouene Melki, Mohammad Shehab, Mohamed-Slim Alouini
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2604.04079
Pdf link: https://arxiv.org/pdf/2604.04079
Abstract The Internet of Underwater Things (IoUT) supports ocean sensing and offshore monitoring but requires coordinated mobility and energy-aware communication to sustain long-term operation. This letter proposes a multi-AUV framework that jointly addresses trajectory control and acoustic communication for sustainable IoUT operation. The problem is formulated as a Markov decision process that integrates continuous AUV kinematics, propulsion-aware energy consumption, acoustic energy transfer feasibility, and Age of Information (AoI) regulation. A centralized deep reinforcement learning policy based on Proximal Policy Optimization (PPO) is developed to coordinate multiple AUVs under docking and safety constraints. The proposed approach is evaluated against structured heuristic baselines and demonstrates significant reductions in average AoI while improving fairness and data collection efficiency. Results show that cooperative multi-AUV control provides scalable performance gains as the network size increases.
中文摘要 水下物联网（IoUT）支持海洋感测和海上监测，但需要协调的移动性和能量感知通信，以维持长期运营。该信提出了一个多AUV框架，共同解决轨迹控制和声学通信，以实现IoUT的可持续运行。该问题被表述为一个马尔可夫决策过程，整合了连续AUV运动学、推进感知能耗、声学能量传递的可行性以及信息时代（AoI）调控。基于近端策略优化（PPO）开发了集中深度强化学习策略，用于在对接和安全约束下协调多个AUV。该方法结合结构化启发式基线进行评估，显著降低了平均AoI，同时提升了公平性和数据收集效率。结果显示，随着网络规模的增大，协作式多AUV控制能够实现可扩展的性能提升。

Fine-grained Analysis of Stability and Generalization for Stochastic Bilevel Optimization

随机双层优化稳定性与泛化的细粒度分析

Authors: Xuelin Zhang, Hong Chen, Bin Gu, Tieliang Gong, Feng Zheng
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.04090
Pdf link: https://arxiv.org/pdf/2604.04090
Abstract Stochastic bilevel optimization (SBO) has been integrated into many machine learning paradigms recently, including hyperparameter optimization, meta learning, and reinforcement learning. Along with the wide range of applications, there have been numerous studies on the computational behavior of SBO. However, the generalization guarantees of SBO methods are far less understood from the lens of statistical learning theory. In this paper, we provide a systematic generalization analysis of the first-order gradient-based bilevel optimization methods. Firstly, we establish the quantitative connections between the on-average argument stability and the generalization gap of SBO methods. Then, we derive the upper bounds of on-average argument stability for single-timescale stochastic gradient descent (SGD) and two-timescale SGD, where three settings (nonconvex-nonconvex (NC-NC), convex-convex (C-C), and strongly-convex-strongly-convex (SC-SC)) are considered respectively. Experimental analysis validates our theoretical findings. Compared with the previous algorithmic stability analysis, our results do not require reinitializing the inner-level parameters at each iteration and are applicable to more general objective functions.
中文摘要 随机双层优化（SBO）近年来已被整合进许多机器学习范式中，包括超参数优化、元学习和强化学习。除了广泛的应用外，关于SBO计算行为的研究也有很多。然而，从统计学习理论的角度来看，SBO方法的推广保证要鲜为人知。本文对基于一阶梯度的双层优化方法进行了系统推广分析。首先，我们建立了SBO方法的平均论证稳定性与泛化差距之间的定量联系。然后，我们推导单时间尺度随机梯度下降（SGD）和两时间尺度SGD的平均论证稳定性上界，其中考虑了三种环境（非凸-非凸（NC-NC）、凸-凸（C-C）和强凸-强凸（SC-SC））。实验分析验证了我们的理论发现。与之前的算法稳定性分析相比，我们的结果无需在每次迭代中重新初始化内层参数，且适用于更一般的目标函数。

Restless Bandits with Individual Penalty Constraints: A New Near-Optimal Index Policy and How to Learn It

个别惩罚约束的不安强盗：一种新的近优指数政策及其学习方法

Authors: Nida Zamir, I-Hong Hou
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.04101
Pdf link: https://arxiv.org/pdf/2604.04101
Abstract This paper investigates the Restless Multi-Armed Bandit (RMAB) framework under individual penalty constraints to address resource allocation challenges in dynamic wireless networked environments. Unlike conventional RMAB models, our model allows each user (arm) to have distinct and stringent performance constraints, such as energy limits, activation limits, or age of information minimums, enabling the capture of diverse objectives including fairness and efficiency. To find the optimal resource allocation policy, we propose a new Penalty-Optimal Whittle (POW) index policy. The POW index of an user only depends on the user's transition kernel and penalty constraints, and remains invariable to system-wide features such as the number of users present and the amount of resource available. This makes it computationally tractable to calculate the POW Indices offline without any need for online adaptation. Moreover, we theoretically prove that the POW index policy is asymptotically optimal while satisfying all individual penalty constraints. We also introduce a deep reinforcement learning algorithm to efficiently learn the POW index on the fly. Simulation results across various applications and system configurations further demonstrate that the POW index policy not only has near-optimal performance but also significantly outperforms other existing policies.
中文摘要 本文在个别惩罚约束下探讨了Restless Multi-Armed Bandit（RMAB）框架，以应对动态无线网络环境中的资源分配挑战。与传统的RMAB模型不同，我们的模型允许每个用户（臂）拥有独特且严格的性能约束，如能量限制、激活限制或信息年龄最小值，从而实现包括公平性和效率在内的多样化目标。为了找到最优资源分配策略，我们提出了新的惩罚-最优惠特尔（POW）指数策略。用户的POW索引仅依赖于用户的过渡内核和惩罚约束，并且不受系统范围的特征影响，如存在的用户数量和可用资源量。这使得离线计算战俘指数无需在线适应即可实现。此外，我们理论上证明战俘指数政策在满足所有个体惩罚约束的同时是渐近最优的。我们还引入了深度强化学习算法，以高效地实时学习POW指数。在各种应用和系统配置上的模拟结果进一步表明，POW指数策略不仅性能接近最优，而且显著优于其他现有策略。

Learning Dexterous Grasping from Sparse Taxonomy Guidance

从稀疏分类指导中学习灵巧抓握

Authors: Juhan Park, Taerim Yoon, Seungmin Kim, Joonggil Kim, Wontae Ye, Jeongeun Park, Yoonbyung Chai, Geonwoo Cho, Geunwoo Cho, Dohyeong Kim, Kyungjae Lee, Yongjae Kim, Sungjoon Choi
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.04138
Pdf link: https://arxiv.org/pdf/2604.04138
Abstract Dexterous manipulation requires planning a grasp configuration suited to the object and task, which is then executed through coordinated multi-finger control. However, specifying grasp plans with dense pose or contact targets for every object and task is impractical. Meanwhile, end-to-end reinforcement learning from task rewards alone lacks controllability, making it difficult for users to intervene when failures occur. To this end, we present GRIT, a two-stage framework that learns dexterous control from sparse taxonomy guidance. GRIT first predicts a taxonomy-based grasp specification from the scene and task context. Conditioned on this sparse command, a policy generates continuous finger motions that accomplish the task while preserving the intended grasp structure. Our result shows that certain grasp taxonomies are more effective for specific object geometries. By leveraging this relationship, GRIT improves generalization to novel objects over baselines and achieves an overall success rate of 87.9%. Moreover, real-world experiments demonstrate controllability, enabling grasp strategies to be adjusted through high-level taxonomy selection based on object geometry and task intent.
中文摘要 灵巧操作需要规划适合物体和任务的握把配置，然后通过协调的多指控制来执行。然而，为每个物体和任务指定带有密集姿态或接触目标的抓取计划是不切实际的。与此同时，仅靠任务奖励进行端到端强化学习缺乏可控性，导致用户在失败发生时难以干预。为此，我们提出了GRIT，一个两阶段框架，通过稀疏的分类学指导学习灵巧的控制。GRIT首先从场景和任务上下文预测基于分类法的掌握规范。基于这种稀疏命令，策略生成连续的手指动作，完成任务同时保持预期抓握结构。我们的结果表明，某些抓握分类法对特定对象几何结构更有效。通过利用这一关系，GRIT提高了对基线新对象的泛化，整体成功率达到87.9%。此外，现实实验展示了可控性，使抓握策略能够通过基于物体几何和任务意图的高级分类选择进行调整。

DARE: Diffusion Large Language Models Alignment and Reinforcement Executor

DARE：扩散大型语言模型对齐与强化执行器

Authors: Jingyi Yang, Yuxian Jiang, Xuhao Hu, Shuang Cheng, Biqing Qi, Jing Shao
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.04215
Pdf link: https://arxiv.org/pdf/2604.04215
Abstract Diffusion large language models (dLLMs) are emerging as a compelling alternative to dominant autoregressive models, replacing strictly sequential token generation with iterative denoising and parallel generation dynamics. However, their open-source ecosystem remains fragmented across model families and, in particular, across post-training pipelines, where reinforcement learning objectives, rollout implementations and evaluation scripts are often released as paper-specific codebases. This fragmentation slows research iteration, raises the engineering burden of reproduction, and makes fair comparison across algorithms difficult. We present \textbf{DARE} (\textbf{d}LLMs \textbf{A}lignment and \textbf{R}einforcement \textbf{E}xecutor), an open framework for post-training and evaluating dLLMs. Built on top of verl~\cite{sheng2024hybridflow} and OpenCompass~\cite{2023opencompass}, DARE unifies supervised fine-tuning, parameter-efficient fine-tuning, preference optimization, and dLLM-specific reinforcement learning under a shared execution stack for both masked and block diffusion language models. Across representative model families including LLaDA, Dream, SDAR, and LLaDA2.x, DARE provides broad algorithmic coverage, reproducible benchmark evaluation, and practical acceleration. Extensive empirical results position that DARE serves as a reusable research substrate for developing, comparing, and deploying post-training methods for current and emerging dLLMs.
中文摘要 扩散大型语言模型（dLLMs）正作为主流自回归模型的有力替代方案出现，用迭代去噪和并行生成动力学取代严格顺序的令牌生成。然而，他们的开源生态系统在不同模型家族之间仍然分散，尤其是在培训后流程中，强化学习目标、推广实现和评估脚本通常以专用代码库的形式发布。这种碎片化减缓了研究迭代，增加了复制的工程负担，并使算法间的公平比较变得困难。我们提出了 \textbf{DARE}（\textbf{d}LLMs \textbf{A}lignment 和 \textbf{R}einforcement \textbf{E}xecutor），这是一个用于后期训练和评估 dLLMs 的开放框架。DARE建立在verl~\cite{sheng2024hybridflow}和OpenCompass~\cite{2023opencompass}之上，统一了监督式微调、参数高效微调、偏好优化和dLLM专属强化学习，采用共享执行栈，适用于掩蔽和块扩散语言模型。在包括LLaDA、Dream、SDAR和LLaDA2.x在内的代表性模型家族中，DARE提供了广泛的算法覆盖、可重复的基准测试和实用加速。大量实证结果表明，DARE作为开发、比较和部署当前及新兴dLLM训练后方法的可重复利用研究基底。

Learning from Imperfect Demonstrations via Temporal Behavior Tree-Guided Trajectory Repair

通过时间行为树引导轨迹修复从不完美演示中学习

Authors: Aniruddh G. Puranic, Sebastian Schirmer, John S. Baras, Calin Belta
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2604.04225
Pdf link: https://arxiv.org/pdf/2604.04225
Abstract Learning robot control policies from demonstrations is a powerful paradigm, yet real-world data is often suboptimal, noisy, or otherwise imperfect, posing significant challenges for imitation and reinforcement learning. In this work, we present a formal framework that leverages Temporal Behavior Trees (TBT), an extension of Signal Temporal Logic (STL) with Behavior Tree semantics, to repair suboptimal trajectories prior to their use in downstream policy learning. Given demonstrations that violate a TBT specification, a model-based repair algorithm corrects trajectory segments to satisfy the formal constraints, yielding a dataset that is both logically consistent and interpretable. The repaired trajectories are then used to extract potential functions that shape the reward signal for reinforcement learning, guiding the agent toward task-consistent regions of the state space without requiring knowledge of the agent's kinematic model. We demonstrate the effectiveness of this framework on discrete grid-world navigation and continuous single and multi-agent reach-avoid tasks, highlighting its potential for data-efficient robot learning in settings where high-quality demonstrations cannot be assumed.
中文摘要 通过演示学习机器人控制策略是一个强大的范式，但现实世界的数据往往不够优、噪声大或不完美，这给模仿和强化学习带来了重大挑战。在本研究中，我们提出了一个形式框架，利用时序行为树（TBT）——信号时序逻辑（STL）的扩展，结合行为树语义，在下游策略学习中使用前修复次优轨迹。在违反TBT规范的演示下，基于模型的修复算法会修正轨迹段以满足形式约束，从而生成一个逻辑上一致且可解释的数据集。修复后的轨迹随后被用来提取塑造强化学习奖励信号的潜在函数，引导智能体朝向状态空间中任务一致的区域，而无需了解智能体的运动学模型。我们展示了该框架在离散网格世界导航以及连续单代理和多代理距离回避任务中的有效性，突出其在无法假设高质量演示环境中实现数据高效机器人学习的潜力。

Pedagogical Safety in Educational Reinforcement Learning: Formalizing and Detecting Reward Hacking in AI Tutoring Systems

教育强化学习中的教学安全：在AI辅导系统中形式化与检测奖励黑客

Authors: Oluseyi Olukola, Nick Rahimi
Subjects: Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.04237
Pdf link: https://arxiv.org/pdf/2604.04237
Abstract Reinforcement learning (RL) is increasingly used to personalize instruction in intelligent tutoring systems, yet the field lacks a formal framework for defining and evaluating pedagogical safety. We introduce a four-layer model of pedagogical safety for educational RL comprising structural, progress, behavioral, and alignment safety and propose the Reward Hacking Severity Index (RHSI) to quantify misalignment between proxy rewards and genuine learning. We evaluate the framework in a controlled simulation of an AI tutoring environment with 120 sessions across four conditions and three learner profiles, totaling 18{,}000 interactions. Results show that an engagement-optimized agent systematically over-selected a high-engagement action with no direct mastery gain, producing strong measured performance but limited learning progress. A multi-objective reward formulation reduced this problem but did not eliminate it, as the agent continued to favor proxy-rewarding behavior in many states. In contrast, a constrained architecture combining prerequisite enforcement and minimum cognitive demand substantially reduced reward hacking, lowering RHSI from 0.317 in the unconstrained multi-objective condition to 0.102. Ablation results further suggest that behavioral safety was the most influential safeguard against repetitive low-value action selection. These findings suggest that reward design alone may be insufficient to ensure pedagogically aligned behavior in educational RL, at least in the simulated environment studied here. More broadly, the paper positions pedagogical safety as an important research problem at the intersection of AI safety and intelligent educational systems.
中文摘要 强化学习（RL）越来越多地被用于智能辅导系统中的个性化教学，但该领域缺乏正式框架来定义和评估教学安全性。我们提出了教育强化学习的四层教学安全性模型，包括结构性、进步性、行为性和对齐性安全，并提出了奖励黑客严重度指数（RHSI）以量化代理奖励与真实学习之间的错位。我们在一个受控的AI辅导环境模拟中评估该框架，涉及120次会话，涵盖四种条件和三种学习者档案，共计18,000次交互。结果显示，参与优化的智能体系统性地过度选择高参与度动作，且无直接掌握提升，导致测量表现强劲但学习进展有限。多目标奖励表述减少了这一问题，但未能根除，因为在许多状态下，代理性奖励行为仍倾向于存在。相比之下，结合先决执行和最低认知需求的受限架构，显著降低了奖励黑客行为，将无约束多目标条件下的 RHSI 从 0.317 降至 0.102。消融结果进一步表明，行为安全是防止重复性低价值行动选择的最有力保障。这些发现表明，仅靠奖励设计可能不足以确保教育RL中教学行为与教学法一致，至少在此处所研究的模拟环境中是如此。更广泛地说，论文将教学安全定位为人工智能安全与智能教育系统交汇处的重要研究问题。

MC-CPO: Mastery-Conditioned Constrained Policy Optimization

MC-CPO：掌握条件约束策略优化

Authors: Oluseyi Olukola, Nick Rahimi
Subjects: Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.04251
Pdf link: https://arxiv.org/pdf/2604.04251
Abstract Engagement-optimized adaptive tutoring systems may prioritize short-term behavioral signals over sustained learning outcomes, creating structural incentives for reward hacking in reinforcement learning policies. We formalize this challenge as a constrained Markov decision process (CMDP) with mastery-conditioned feasibility, in which pedagogical safety constraints dynamically restrict admissible actions according to learner mastery and prerequisite structure. We introduce Mastery-Conditioned Constrained Policy Optimization (MC-CPO), a two-timescale primal-dual algorithm that integrates structural action masking with constrained policy optimization. In the tabular regime, we establish feasibility preservation and convergence to stationary feasible points under standard stochastic approximation conditions and derive a safety gap result showing that optimization within the mastery-conditioned feasible set can strictly dominate post-hoc filtering under identical safety budgets. Empirical validation is conducted in minimal and extended tabular environments and in a neural tutoring setting. Across 10 random seeds and one million training steps in the neural regime, MC-CPO satisfies constraint budgets within tolerance, reduces discounted safety costs relative to unconstrained and reward-shaped baselines, and substantially lowers the Reward Hacking Severity Index (RHSI). These results indicate that embedding pedagogical structure directly into the feasible action space provides a principled foundation for mitigating reward hacking in instructional reinforcement learning systems.
中文摘要 参与度优化的自适应辅导系统可能优先考虑短期行为信号而非持续学习成果，从而在强化学习政策中创造结构性激励机制。我们将这一挑战形式化为一个受限马尔可夫决策过程（CMDP），其掌握条件可行性，其中教学安全约束动态限制可接受行为，基于学习者的掌握度和前提结构。我们介绍了掌握条件约束策略优化（MC-CPO），这是一种两时间尺度的原始对偶算法，将结构性动作掩蔽与受约束策略优化整合在一起。在表格范畴中，我们建立了可行性保持和收敛于标准随机近似条件下的平稳可行点，并推导出安全差距结果，表明在掌握条件下的可行集合内优化可以严格支配事后过滤，且在相同安全预算下。实证验证可在最小和扩展的表格环境中进行，并在神经辅导环境中进行。在神经体系中的10个随机种子和100万次训练步中，MC-CPO满足了容忍范围内的约束预算，降低了相较于无约束和奖励形态基线的折扣安全成本，并显著降低了奖励黑客严重度指数（RHSI）。这些结果表明，将教学结构直接嵌入可行行动空间，为在教学强化学习系统中减轻奖励黑客行为提供了有原则的基础。

APPA: Adaptive Preference Pluralistic Alignment for Fair Federated RLHF of LLMs

APPA：适用于大型语言模型公平联邦RLHF的自适应偏好多元对齐

Authors: Mahmoud Srewa, Tianyu Zhao, Salma Elmalaki
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.04261
Pdf link: https://arxiv.org/pdf/2604.04261
Abstract Aligning large language models (LLMs) with diverse human preferences requires pluralistic alignment, where a single model must respect the values of multiple distinct groups simultaneously. In federated reinforcement learning from human feedback (FedRLHF), these groups align a shared policy without centralizing preference data, which makes fair reward aggregation essential. Existing aggregation methods exhibit clear trade offs: average based aggregation systematically under aligns worst performing groups, while min aggregation prioritizes worst group performance at the cost of overall alignment. We propose APPA, an Adaptive Preference Pluralistic Alignment framework that dynamically reweights group level rewards based on historical alignment rewards. Our approach prioritizes under aligned groups without degrading well aligned ones, while requiring no access to raw preference data. Integrated into a proximal policy optimization (PPO) based FedRLHF pipeline and evaluated on GLOBALQA and OQA across three model families (Gemma 2 2B, Llama 3.2 3B, Qwen3 0.6B), APPA achieves strong fairness alignment trade offs, improving worst group alignment by up to 28% over average aggregation while maintaining higher overall alignment than min aggregation across most configurations.
中文摘要 使大型语言模型（LLMs）与多样化的人类偏好保持一致，需要多元对齐，即单个模型必须同时尊重多个不同群体的价值观。在联合强化学习（FedRLHF）中，这些群体在不集中偏好数据的情况下对齐共享策略，这使得公平奖励聚合变得不可或缺。现有聚合方法存在明显权衡：基于平均值的聚合系统性地低于表现最差的群体，而最小聚合优先考虑最差的群体表现，代价是整体对齐度下降。我们提出了APPA框架，这是一种适应性偏好多元对齐框架，基于历史对齐奖励动态重新加权群体级奖励。我们的方法在不削弱对齐群的情况下优先排序，同时不要求访问原始偏好数据。APPA集成到基于近端策略优化（PPO）的FedRLHF流水线中，并在三种模型家族（Gemma 2 2B、Llama 3.2 3B、Qwen3 0.6B）上基于GLOBALQA和OQA进行评估，实现了强的公平性对齐权衡，最差组比对比平均聚合提升了最多28%，同时在大多数配置中保持比最小聚合更高的整体对齐度。

Boosted Distributional Reinforcement Learning: Analysis and Healthcare Applications

增强分布式强化学习：分析与医疗应用

Authors: Zequn Chen, Wesley J. Marrero
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.04334
Pdf link: https://arxiv.org/pdf/2604.04334
Abstract Researchers and practitioners are increasingly considering reinforcement learning to optimize decisions in complex domains like robotics and healthcare. To date, these efforts have largely utilized expectation-based learning. However, relying on expectation-focused objectives may be insufficient for making consistent decisions in highly uncertain situations involving multiple heterogeneous groups. While distributional reinforcement learning algorithms have been introduced to model the full distributions of outcomes, they can yield large discrepancies in realized benefits among comparable agents. This challenge is particularly acute in healthcare settings, where physicians (controllers) must manage multiple patients (subordinate agents) with uncertain disease progression and heterogeneous treatment responses. We propose a Boosted Distributional Reinforcement Learning (BDRL) algorithm that optimizes agent-specific outcome distributions while enforcing comparability among similar agents and analyze its convergence. To further stabilize learning, we incorporate a post-update projection step formulated as a constrained convex optimization problem, which efficiently aligns individual outcomes with a high-performing reference within a specified tolerance. We apply our algorithm to manage hypertension in a large subset of the US adult population by categorizing individuals into cardiovascular disease risk groups. Our approach modifies treatment plans for median and vulnerable patients by mimicking the behavior of high-performing references in each risk group. Furthermore, we find that BDRL improves the number and consistency of quality-adjusted life years compared with reinforcement learning baselines.
中文摘要 研究人员和从业者越来越多地考虑通过强化学习来优化机器人和医疗等复杂领域的决策。迄今为止，这些努力主要采用基于期望的学习方法。然而，仅依赖以期望为中心的目标，可能不足以在涉及多个异质群体的高度不确定情境中做出一致决策。虽然分布强化学习算法已被引入以建模完整的结果分布，但它们在可比代理之间可能存在较大的实际效益差异。这一挑战在医疗环境中尤为严峻，医生（控制员）必须管理多名患者（下属代理），这些患者病情进展不确定且治疗反应异质。我们提出了一种增强分布强化学习（BDRL）算法，该算法在强化相似代理间可比性的同时优化主体特异性的结果分布，并分析其收敛性。为进一步稳定学习，我们采用了更新后预测步骤，作为受限凸优化问题，高效地将单个结果与在指定容差范围内的高效能参考对齐。我们应用算法，将美国成年人口的大部分人群用于管理高血压，将个体划分为心血管疾病风险组。我们的方法通过模仿每个风险组中高绩效参考患者的行为，调整中位数和易感患者的治疗方案。此外，我们发现与强化学习基线相比，BDRL提高了质量调整寿命年的数量和一致性。

Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning

强化学习，选择推理：视频推理的双范式

Authors: Songyuan Yang, Weijiang Yu, Jilin Ma, Ziyu Liu, Guijian Tang, Wenjing Yang, Huibin Tan, Nong Xiao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.04379
Pdf link: https://arxiv.org/pdf/2604.04379
Abstract Video reasoning has advanced with large multimodal models (LMMs), yet their inference is often a single pass that returns an answer without verifying whether the reasoning is evidence-aligned. We introduce Reinforce to Learn, Elect to Reason (RLER), a dual paradigm that decouples learning to produce evidence from obtaining a reliable answer. In RLER-Training, we optimize the policy with group-relative reinforcement learning (RL) and 3 novel task-driven rewards: Frame-sensitive reward grounds reasoning on explicit key frames, Think-transparency reward shapes readable and parsable reasoning traces, and Anti-repetition reward boosts information density. These signals teach the model to emit structured, machine-checkable evidence and potentiate reasoning capabilities. In RLER-Inference, we apply a train-free orchestrator that generates a small set of diverse candidates, parses their answers and cited frames, scores them by evidence consistency, confidence, transparency, and non-redundancy, and then performs a robust evidence-weighted election. This closes the loop between producing and using evidence, improving reliability and interpretability without enlarging the model. We comprehensively evaluate RLER against various open-source and RL-based LMMs on 8 representative benchmarks. RLER achieves state of the art across all benchmarks and delivers an average improvement of 6.3\% over base models, while using on average 3.1 candidates per question, indicating a favorable balance between compute and quality. The results support a simple thesis: making evidence explicit during learning and electing by evidence during inference is a robust path to trustworthy video reasoning.
中文摘要 视频推理随着大型多模态模型（LMMs）的推进，但它们的推断往往是单次传递，返回的答案却未验证推理是否符合证据。我们引入了“强化学习，选择推理”（RLER），这是一种双范式，将产生证据的学习与获得可靠答案的学习脱钩。在RLER训练中，我们通过群体相对强化学习（RL）和三种新颖的任务驱动奖励来优化该策略：基于显性关键帧的框架敏感奖励基础推理，Think-transparency奖励塑造可读且可解析的推理痕迹，以及反重复奖励提升信息密度。这些信号教会模型发出结构化、可机器验证的证据，并增强推理能力。在RLER-Inference中，我们应用无列车的编排器，生成一小组多样化的候选人，解析其答案和引用框架，根据证据一致性、信心度、透明度和非冗余性评分，然后进行稳健的证据加权选举。这封闭了证据生成和使用之间的循环，提高了可靠性和可解释性，同时不扩大模型。我们全面评估了RLER与多个开源和基于强化学习的LMM的8个代表性基准测试。RLER在所有基准测试中均达到最先进水平，平均比基础模型提升6.3%，平均每题使用3.1名候选人，显示计算与质量之间的良好平衡。结果支持了一个简单的论点：在学习时明确证据，在推理中通过证据选择，是通往可信视频推理的有力途径。

Finite-Time Analysis of Q-Value Iteration for General-Sum Stackelberg Games

一般和斯塔克尔伯格博弈Q值迭代的有限时间分析

Authors: Narim Jeong, Donghwan Lee
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2604.04394
Pdf link: https://arxiv.org/pdf/2604.04394
Abstract Reinforcement learning has been successful both empirically and theoretically in single-agent settings, but extending these results to multi-agent reinforcement learning in general-sum Markov games remains challenging. This paper studies the convergence of Stackelberg Q-value iteration in two-player general-sum Markov games from a control-theoretic perspective. We introduce a relaxed policy condition tailored to the Stackelberg setting and model the learning dynamics as a switching system. By constructing upper and lower comparison systems, we establish finite-time error bounds for the Q-functions and characterize their convergence properties. Our results provide a novel control-theoretic perspective on Stackelberg learning. Moreover, to the best of the authors' knowledge, this paper offers the first finite-time convergence guarantees for Q-value iteration in general-sum Markov games under Stackelberg interactions.
中文摘要 强化学习在单智能体环境中既在经验上还是理论上都取得了成功，但将这些结果推广到广义和马尔可夫博弈中的多智能体强化学习仍然具有挑战性。本文从控制理论视角研究了两人广义和马尔可夫博弈中斯塔克尔伯格Q值迭代的收敛性。我们引入了针对斯塔克尔伯格设定的宽松策略条件，并将学习动态建模为切换系统。通过构建上下比较系统，我们为Q函数建立了有限时间误差界限，并刻画了它们的收敛性质。我们的结果为斯塔克伯格学习提供了新的控制理论视角。此外，据作者所知，本文首次提供了在斯塔克伯格相互作用下广义和马尔可夫博弈中Q值迭代的有限时间收敛保证。

ReinVBC: A Model-based Reinforcement Learning Approach to Vehicle Braking Controller

ReinVBC：基于模型的车辆制动控制器加固学习方法

Authors: Haoxin Lin, Junjie Zhou, Daheng Xu, Yang Yu
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2604.04401
Pdf link: https://arxiv.org/pdf/2604.04401
Abstract Braking system, the key module to ensure the safety and steer-ability of current vehicles, relies on extensive manual calibration during production. Reducing labor and time consumption while maintaining the Vehicle Braking Controller (VBC) performance greatly benefits the vehicle industry. Model-based methods in offline reinforcement learning, which facilitate policy exploration within a data-driven dynamics model, offer a promising solution for addressing real-world control tasks. This work proposes ReinVBC, which applies an offline model-based reinforcement learning approach to deal with the vehicle braking control problem. We introduce useful engineering designs into the paradigm of model learning and utilization to obtain a reliable vehicle dynamics model and a capable braking policy. Several results demonstrate the capability of our method in real-world vehicle braking and its potential to replace the production-grade anti-lock braking system.
中文摘要 制动系统是确保现有车辆安全和转向能力的关键模块，生产过程中依赖大量人工校准。在保持车辆制动控制器（VBC）性能的同时减少劳动力和时间消耗，极大地有利于车辆行业。基于模型的离线强化学习方法，能够在数据驱动的动力学模型中进行策略探索，为解决现实控制任务提供了有前景的解决方案。该研究提出了ReinVBC，采用离线模型基础的强化学习方法来处理车辆制动控制问题。我们将有用的工程设计引入模型学习和利用范式，以获得可靠的车辆动力学模型和有效的制动策略。多项结果展示了我们方法在实际车辆制动中的能力，以及其取代量产级防抱死制动系统的潜力。

Structured Causal Video Reasoning via Multi-Objective Alignment

通过多目标对齐进行结构化因果视频推理

Authors: Zinuo Li, Yongxin Guo, Jun Liu, Jiawei Zhan, Xi Jiang, Chengjie Wang, Mohammed Bennamoun, Farid Boussaid, Feng Zheng, Qiuhong Ke
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.04415
Pdf link: https://arxiv.org/pdf/2604.04415
Abstract Human understanding of video dynamics is typically grounded in a structured mental representation of entities, actions, and temporal relations, rather than relying solely on immediate deductive reasoning. In contrast, existing Video-LLMs largely depend on unstructured video reasoning, where critical visual evidence is embedded in verbose textual descriptions and temporal causality is often weakly modeled. This leads to inefficient processes and fragile causal inference. To bridge this cognitive gap, we propose constructing a compact representation of salient events and their causal relationships, which we name Structured Event Facts, prior to the reasoning stage. This structured prior serves as an explicit constraint to promote concise and causally grounded reasoning, while also making intermediate evidence easier to verify. To effectively train models on such structured facts, we introduce CausalFact-60K and a four-stage training pipeline comprising facts alignment, format warm-start, thinking warm-start, and reinforcement learning-based post-training. During RL stage, we find that this framework introduces competing objectives, as structural completeness and causal fidelity must be balanced against reasoning length, making it difficult to optimize. We address this challenge by formulating the optimization as a Multi-Objective Reinforcement Learning (MORL) problem and explicitly optimizing toward the Pareto-Frontier to balance these trade-offs. As a result, we introduce Factum-4B, which yields more reliable reasoning and delivers stronger performance on challenging video understanding tasks requiring fine-grained temporal inference.
中文摘要 人类对视频动态的理解通常基于对实体、动作和时间关系的结构化心理表征，而非仅依赖即时演绎推理。相比之下，现有的视频大型语言模型（LLM）主要依赖非结构化视频推理，关键的视觉证据嵌入冗长的文本描述中，时间因果关系的建模往往较弱。这导致了低效的过程和脆弱的因果推断。为弥合这一认知鸿沟，我们建议在推理阶段之前构建一个显著事件及其因果关系的紧凑表示，称为结构化事件事实。这种结构化先验作为明确的约束，促进简洁且有因果基础的推理，同时也使中间证据更容易验证。为了有效训练基于此类结构化事实的模型，我们引入了CausalFact-60K和一个四阶段训练流程，包括事实对齐、格式化热启、思维热启和基于强化学习的后期训练。在强化学习阶段，我们发现该框架引入了相互竞争的目标，因为结构完备性和因果忠实度必须与推理长度取得平衡，这使得优化变得困难。我们通过将优化定为多目标强化学习（MORL）问题，并明确向帕累托-前沿优化以平衡这些权衡来应对这一挑战。因此，我们引入了Factum-4B，它提供了更可靠的推理能力，并在需要细粒度时间推断的复杂视频理解任务中表现更优。

Explainable Autonomous Cyber Defense using Adversarial Multi-Agent Reinforcement Learning

利用对抗性多智能体强化学习实现可解释的自主网络防御

Authors: Yiyao Zhang, Diksha Goel, Hussain Ahmad
Subjects: Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2604.04442
Pdf link: https://arxiv.org/pdf/2604.04442
Abstract Autonomous agents are increasingly deployed in both offensive and defensive cyber operations, creating high-speed, closed-loop interactions in critical infrastructure environments. Advanced Persistent Threat (APT) actors exploit "Living off the Land" techniques and targeted telemetry perturbations to induce ambiguity in monitoring systems, causing automated defenses to overreact or misclassify benign behavior as malicious activity. Existing monolithic and multi-agent defense pipelines largely operate on correlation-based signals, lack structural constraints on response actions, and are vulnerable to reasoning drift under ambiguous or adversarial inputs. We present the Causal Multi-Agent Decision Framework (C-MADF), a structurally constrained architecture for autonomous cyber defense that integrates causal modeling with adversarial dual-policy control. C-MADF first learns a Structural Causal Model (SCM) from historical telemetry and compiles it into an investigation-level Directed Acyclic Graph (DAG) that defines admissible response transitions. This roadmap is formalized as a Markov Decision Process (MDP) whose action space is explicitly restricted to causally consistent transitions. Decision-making within this constrained space is performed by a dual-agent reinforcement learning system in which a threat-optimizing Blue-Team policy is counterbalanced by a conservatively shaped Red-Team policy. Inter-policy disagreement is quantified through a Policy Divergence Score and exposed via a human-in-the-loop interface equipped with an Explainability-Transparency Score that serves as an escalation signal under uncertainty. On the real-world CICIoT2023 dataset, C-MADF reduces the false-positive rate from 11.2%, 9.7%, and 8.4% in three cutting-edge literature baselines to 1.8%, while achieving 0.997 precision, 0.961 recall, and 0.979 F1-score.
中文摘要 自主智能体越来越多地被部署在进攻和防御的网络行动中，在关键基础设施环境中创造高速、闭环的交互。高级持续威胁（APT）行为者利用“原生环境”技术和针对性的遥测扰动，使监控系统产生歧义，导致自动防御过度反应或将无害行为误判为恶意活动。现有的单体和多代理防御管道主要依赖基于相关性信号，缺乏对响应行动的结构约束，且在模糊或对抗输入下容易产生推理漂移。我们介绍了因果多智能体决策框架（C-MADF），这是一种结构约束的自主网络防御架构，集成了因果建模与对抗性双策略控制。C-MADF首先从历史遥测中学习结构因果模型（SCM），并将其编译成一个研究级的有向无环图（DAG），定义可接受的响应转变。该路线图被形式化为马尔可夫决策过程（MDP），其作用空间明确限制于因果一致的转移。在这一受限空间内，决策由双代理强化学习系统完成，其中一个威胁优化的蓝队策略与一个保守形态的红队策略相互平衡。政策间的分歧通过政策分歧评分来量化，并通过配备可解释性-透明度评分的人机界面揭示，该评分在不确定性下作为升级信号。在真实的CICIoT2023数据集中，C-MADF将三个前沿文献基线的假阳性率从11.2%、9.7%和8.4%降至1.8%，同时实现0.997的准确率、0.961的召回率和0.979的F1评分。

DeonticBench: A Benchmark for Reasoning over Rules

DeonticBench：推理胜过规则的标杆

Authors: Guangyao Dou, Luis Brena, Akhil Deo, William Jurayj, Jingyu Zhang, Nils Holzenberger, Benjamin Van Durme
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.04443
Pdf link: https://arxiv.org/pdf/2604.04443
Abstract Reasoning with complex, context-specific rules remains challenging for large language models (LLMs). In legal and policy settings, this manifests as deontic reasoning: reasoning about obligations, permissions, and prohibitions under explicit rules. While many recent benchmarks emphasize short-context mathematical reasoning, fewer focus on long-context, high-stakes deontic reasoning. To address this gap, we introduce DEONTICBENCH, a benchmark of 6,232 tasks across U.S. federal taxes, airline baggage policies, U.S. immigration administration, and U.S. state housing law. These tasks can be approached in multiple ways, including direct reasoning in language or with the aid of symbolic computation. Besides free-form chain-of-thought reasoning, DEONTICBENCH enables an optional solver-based workflow in which models translate statutes and case facts into executable Prolog, leading to formal problem interpretations and an explicit program trace. We release reference Prolog programs for all instances. Across frontier LLMs and coding models, best hard-subset performance reaches only 44.4% on SARA Numeric and 46.6 macro-F1 on Housing. We further study training with supervised fine-tuning and reinforcement learning for symbolic program generation. Although training improves Prolog generation quality, current RL methods still fail to solve these tasks reliably. Overall, DEONTICBENCH provides a benchmark for studying context-grounded rule reasoning in real-world domains under both symbolic and non-symbolic settings.
中文摘要 对于大型语言模型（LLM）来说，使用复杂且特定上下文的规则进行推理仍然具有挑战性。在法律和政策环境中，这表现为义务推理：关于义务、许可和在明确规则下的禁止行为进行推理。虽然许多近期基准强调短上下文数学推理，但较少关注长上下文、高风险的义务推理。为弥补这一空白，我们推出了DEONTICBENCH，这是一项涵盖美国联邦税收、航空行李政策、美国移民管理局和美国各州住房法的6,232项任务基准。这些任务可以通过多种方式完成，包括直接用语言推理或借助符号计算。除了自由形式的思维链推理外，DEONTICBENCH还支持一种可选的基于求解器的工作流程，模型将法规和案例事实转换为可执行的Prolog，从而实现形式化的问题解释和显式程序跟踪。我们为所有实例发布了参考Prolog程序。在前沿大型语言模型和编码模型中，SARA Numeric的最佳硬子集性能仅为44.4%，在Housing中为46.6%。我们还进一步研究了通过监督微调和强化学习进行符号程序生成的培训。尽管训练提高了Prolog生成的质量，但当前的强化学习方法仍然无法可靠地解决这些任务。总体而言，DEONTICBENCH为在符号和非符号环境中研究基于上下文的规则推理提供了基准。

Retrieval Augmented Conversational Recommendation with Reinforcement Learning

带强化学习的检索增强会话推荐

Authors: Zhenrui Yue, Honglei Zhuang, Zhen Qin, Zhankui He, Huimin Zeng, Julian McAuley, Dong Wang
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2604.04457
Pdf link: https://arxiv.org/pdf/2604.04457
Abstract Large language models (LLMs) exhibit enhanced capabilities in language understanding and generation. By utilizing their embedded knowledge, LLMs are increasingly used as conversational recommender systems (CRS), achieving improved performance across diverse scenarios. However, existing LLM-based methods rely on pretrained knowledge without external retrieval mechanisms for novel items. Additionally, the lack of a unified corpus poses challenges for integrating retrieval augmentation into CRS. Motivated by these challenges, we present RAR, a novel two-stage retrieval augmented conversational recommendation framework that aligns retrieval and generation to enhance both performance and factuality. To support this framework and provide a unified corpus, we construct a large-scale movie corpus, comprising over 300k movies with rich metadata, such as titles, casts and plot summaries. Leveraging this data, our primary contribution is RAR, the first framework to departs from standard two-stage CRS by dynamically bridging retrieval and generation. First, a retriever model generates candidate items based on user history; in the subsequent stage, an LLM refines the recommendations by incorporating conversational context with retrieved results. In addition, we introduce a novel reinforcement learning (RL) method that leverages LLM feedback to iteratively update the retriever. By creating a collaborative feedback loop that reinforces sampled candidate sets with higher ranking metrics, RAR effectively mitigates the misalignment between the retrieval and generation stages. Furthermore, grounding the LLM in factual metadata allows our RL-driven approach to capture subtle user intentions and generate context-aware recommendations with reduced hallucinations. We validate our approach through extensive experiments on multiple benchmarks, where RAR consistently outperforms state-of-the-art baseline methods.
中文摘要 大型语言模型（LLMs）在语言理解和生成方面展现出更强的能力。通过利用其内嵌知识，LLMs越来越多地被用作对话推荐系统（CRS），在多种场景下实现了更高的性能。然而，现有基于LLM的方法依赖于预训练知识，而没有外部检索机制来获取新项目。此外，缺乏统一语料库也带来了将检索增强整合进CRS的挑战。基于这些挑战，我们提出了RAR，一种新型的两阶段检索增强会话推荐框架，能够结合检索和生成，以提升表现性和事实性。为了支持该框架并提供统一语料库，我们构建了一个大型电影语料库，包含超过30万部电影，包含丰富的元数据，如片头、演员阵容和剧情摘要。基于这些数据，我们的主要贡献是RAR，这是首个通过动态桥接检索和生成，突破标准两阶段CRS的框架。首先，检索器模型基于用户历史生成候选项目;在下一阶段，LLM通过结合对话上下文和检索结果来细化推荐。此外，我们还引入了一种新型强化学习（RL）方法，利用LLM反馈迭代更新检索器。通过创建协作反馈循环，强化排名更高的样本集，RAR有效减少了检索阶段与生成阶段之间的错位。此外，基于事实元数据，使我们基于强化学习的方法能够捕捉微妙的用户意图，生成带有上下文感知的推荐，减少幻觉。我们通过多个基准测试的大量实验验证了我们的方法，RAR始终优于最先进的基线方法。

One Model for All: Multi-Objective Controllable Language Models

全民统一模型：多目标可控语言模型

Authors: Qiang He, Yucheng Yang, Tianyi Zhou, Meng Fang, Mykola Pechenizkiy, Setareh Maghsudi
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.04497
Pdf link: https://arxiv.org/pdf/2604.04497
Abstract Aligning large language models (LLMs) with human preferences is critical for enhancing LLMs' safety, helpfulness, humor, faithfulness, etc. Current reinforcement learning from human feedback (RLHF) mainly focuses on a fixed reward learned from average human ratings, which may weaken the adaptability and controllability of varying preferences. However, creating personalized LLMs requires aligning LLMs with individual human preferences, which is non-trivial due to the scarce data per user and the diversity of user preferences in multi-objective trade-offs, varying from emphasizing empathy in certain contexts to demanding efficiency and precision in others. Can we train one LLM to produce personalized outputs across different user preferences on the Pareto front? In this paper, we introduce Multi-Objective Control (MOC), which trains a single LLM to directly generate responses in the preference-defined regions of the Pareto front. Our approach introduces multi-objective optimization (MOO) principles into RLHF to train an LLM as a preference-conditioned policy network. We improve the computational efficiency of MOC by applying MOO at the policy level, enabling us to fine-tune a 7B-parameter model on a single A6000 GPU. Extensive experiments demonstrate the advantages of MOC over baselines in three aspects: (i) controllability of LLM outputs w.r.t. user preferences on the trade-off among multiple rewards; (ii) quality and diversity of LLM outputs, measured by the hyper-volume of multiple solutions achieved; and (iii) generalization to unseen preferences. These results highlight MOC's potential for real-world applications requiring scalable and customizable LLMs.
中文摘要 将大型语言模型（LLMs）与人类偏好对齐，对于提升LLMs的安全性、帮助性、幽默感、忠实性等至关重要。当前的人类反馈强化学习（RLHF）主要关注从平均人类评分中学习到的固定奖励，这可能削弱了不同偏好的适应性和可控性。然而，创建个性化LLM需要与个体个人偏好对齐，这并非易事，因为每个用户的数据稀缺且用户偏好多样，从强调同理心到要求效率和精确性。我们能否训练一个大型语言模型，在帕累托方面针对不同用户偏好生成个性化输出？本文介绍了多目标控制（MOC），它通过训练单个大型语言模型，直接在帕累托前缘的偏好定义区域产生响应。我们的方法将多目标优化（MOO）原则引入RLHF，以训练LLM作为偏好条件策略网络。我们通过在政策层面应用MOO，提升了MOC的计算效率，使我们能够在单个A6000 GPU上微调7B参数模型。大量实验展示了MOC相较基线的三个优势：（i）LLM输出的可控性，取决于用户对多奖励权衡的偏好;（ii） LLM输出的质量和多样性，通过实现的多解的超大容量来衡量;以及（iii）对看不见偏好的推广。这些结果凸显了MOC在现实世界中需要可扩展和可定制LLM的潜力。

Memory Intelligence Agent

记忆智能代理

Authors: Jingyang Qiao, Weicheng Meng, Yu Cheng, Zhihang Lin, Zhizhong Zhang, Xin Tan, Jingyu Gong, Kun Shao, Yuan Xie
Subjects: Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2604.04503
Pdf link: https://arxiv.org/pdf/2604.04503
Abstract Deep research agents (DRAs) integrate LLM reasoning with external tools. Memory systems enable DRAs to leverage historical experiences, which are essential for efficient reasoning and autonomous evolution. Existing methods rely on retrieving similar trajectories from memory to aid reasoning, while suffering from key limitations of ineffective memory evolution and increasing storage and retrieval costs. To address these problems, we propose a novel Memory Intelligence Agent (MIA) framework, consisting of a Manager-Planner-Executor architecture. Memory Manager is a non-parametric memory system that can store compressed historical search trajectories. Planner is a parametric memory agent that can produce search plans for questions. Executor is another agent that can search and analyze information guided by the search plan. To build the MIA framework, we first adopt an alternating reinforcement learning paradigm to enhance cooperation between the Planner and the Executor. Furthermore, we enable the Planner to continuously evolve during test-time learning, with updates performed on-the-fly alongside inference without interrupting the reasoning process. Additionally, we establish a bidirectional conversion loop between parametric and non-parametric memories to achieve efficient memory evolution. Finally, we incorporate a reflection and an unsupervised judgment mechanisms to boost reasoning and self-evolution in the open world. Extensive experiments across eleven benchmarks demonstrate the superiority of MIA.
中文摘要 深度研究代理（DRA）将大型语言模型推理与外部工具整合。记忆系统使DRA能够利用历史经验，这对于高效的推理和自主进化至关重要。现有方法依赖于从记忆中检索相似轨迹来辅助推理，但存在记忆演化效率低下以及存储和检索成本增加的关键限制。为解决这些问题，我们提出了一种新型内存智能代理（MIA）框架，由管理者-规划-执行者架构组成。内存管理器是一种非参数内存系统，可以存储压缩后的历史搜索轨迹。Planner 是一个参数记忆代理，可以为问题生成搜索计划。执行者是另一个可以根据搜索计划进行搜索和分析信息的代理。为了构建MIA框架，我们首先采用交替强化学习范式，以加强规划者与执行者之间的合作。此外，我们使Planner在测试学习过程中能够持续演进，更新能在推理过程中实时进行，而不中断推理过程。此外，我们还建立了参数记忆与非参数记忆之间的双向转换循环，以实现高效的记忆演化。最后，我们融入了反思和无监督判断机制，以增强开放世界中的推理和自我进化。跨越十一个基准测试的广泛实验证明了MIA的优越性。

DRL-Based Phase Optimization for O-RIS in Dual-Hop Hard-Switching FSO/RIS-aided RF and UWOC Systems

基于DRL的双跳硬交换FSO/RIS辅助射频和UWOC系统中O-RIS的相位优化

Authors: Aboozar Heydaribeni, Hamzeh Beyranvand, Sahar Eslami
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2604.04531
Pdf link: https://arxiv.org/pdf/2604.04531
Abstract This paper presents a dual-hop hybrid framework that integrates a free-space optical (FSO)/RIS-aided radio frequency (RF) link operating under a hard-switching protocol as the first hop, and an optical reconfigurable intelligent surface (O-RIS)-assisted underwater wireless optical communication (UWOC) link as the second hop. To capture realistic underwater dynamics, the Oceanic Turbulence Optical Power Spectrum (OTOPS) is employed for accurate turbulence modeling. For efficient O-RIS phase control, deep reinforcement learning (DRL) algorithms, specifically the Deep Deterministic Policy Gradient (DDPG) and Twin Delayed DDPG (TD3), have been developed to optimize the phase shifts of O-RIS elements. Simulation results demonstrate that the proposed system substantially improves outage probability and channel capacity, with TD3 achieving superior robustness and adaptability. These findings highlight the DRL-enabled O-RIS as a promising approach for achieving reliable and high-capacity 6G cross-domain UWOC networks.
中文摘要 本文提出了一个双跳混合框架，将第一跳为硬交换协议下的自由空间光（FSO）/RIS辅助射频（RF）链路，第二跳为可重构智能水面（O-RIS）辅助水下无线光通信（UWOC）链路。为了捕捉真实的水下动力学，采用海洋湍流光功率谱（OTOPS）进行精确的湍流建模。为了高效的O-RIS相位控制，开发了深度强化学习（DRL）算法，特别是深度确定性策略梯度（DDPG）和双延迟DDPG（TD3），以优化O-RIS元素的相位移。模拟结果表明，所提系统显著提升了断线概率和信道容量，TD3实现了更优越的鲁棒性和适应性。这些发现凸显了支持DRL的O-RIS作为实现可靠且高容量6G跨域UWOC网络的有前景方法。

FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control

FlashSAC：高维机器人控制的快速稳定非策略强化学习

Authors: Donghu Kim, Youngdo Lee, Minho Park, Kinam Kim, I Made Aswin Nahendra, Takuma Seno, Sehee Min, Daniel Palenicek, Florian Vogt, Danica Kragic, Jan Peters, Jaegul Choo, Hojoon Lee
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.04539
Pdf link: https://arxiv.org/pdf/2604.04539
Abstract Reinforcement learning (RL) is a core approach for robot control when expert demonstrations are unavailable. On-policy methods such as Proximal Policy Optimization (PPO) are widely used for their stability, but their reliance on narrowly distributed on-policy data limits accurate policy evaluation in high-dimensional state and action spaces. Off-policy methods can overcome this limitation by learning from a broader state-action distribution, yet suffer from slow convergence and instability, as fitting a value function over diverse data requires many gradient updates, causing critic errors to accumulate through bootstrapping. We present FlashSAC, a fast and stable off-policy RL algorithm built on Soft Actor-Critic. Motivated by scaling laws observed in supervised learning, FlashSAC sharply reduces gradient updates while compensating with larger models and higher data throughput. To maintain stability at increased scale, FlashSAC explicitly bounds weight, feature, and gradient norms, curbing critic error accumulation. Across over 60 tasks in 10 simulators, FlashSAC consistently outperforms PPO and strong off-policy baselines in both final performance and training efficiency, with the largest gains on high-dimensional tasks such as dexterous manipulation. In sim-to-real humanoid locomotion, FlashSAC reduces training time from hours to minutes, demonstrating the promise of off-policy RL for sim-to-real transfer.
中文摘要 强化学习（RL）是机器人控制中在无法获得专家演示时的核心方法。如近距离策略优化（PPO）等策略上方法因其稳定性被广泛使用，但其对策略内数据的依赖，限制了高维状态和动作空间中的准确策略评估。非策略方法可以通过从更广泛的状态-动作分布中学习来克服这一限制，但由于在多样化数据上拟合价值函数需要多次梯度更新，导致批评错误通过自助法积累，收敛缓慢且不稳定。我们介绍FlashSAC，一种基于软演员批判（Soft Actor-Critic）构建的快速稳定非策略强化学习算法。受监督学习中观察到的缩放律动机驱动，FlashSAC 大幅减少梯度更新，同时补偿更大模型和更高数据吞吐量。为了在更大尺度下保持稳定性，FlashSAC明确限制权重、特征和梯度范数，抑制批评错误的累积。在10个模拟器中60多个任务中，FlashSAC在最终性能和训练效率上始终优于PPO和强的非策略基线，在高维度任务如灵巧操作上取得最大提升。在模拟到真实的人形移动中，FlashSAC将训练时间从数小时缩短到几分钟，展示了非策略强化学习在模拟到现实传输中的潜力。

Paper Espresso: From Paper Overload to Research Insight

纸质浓缩：从纸张过载到研究洞察

Authors: Mingzhe Du, Luu Anh Tuan, Dong Huang, See-kiong Ng
Subjects: Subjects: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.04562
Pdf link: https://arxiv.org/pdf/2604.04562
Abstract The accelerating pace of scientific publishing makes it increasingly difficult for researchers to stay current. We present Paper Espresso, an open-source platform that automatically discovers, summarizes, and analyzes trending arXiv papers. The system uses large language models (LLMs) to generate structured summaries with topical labels and keywords, and provides multi-granularity trend analysis at daily, weekly, and monthly scales through LLM-driven topic consolidation. Over 35 months of continuous deployment, Paper Espresso has processed over 13,300 papers and publicly released all structured metadata, revealing rich dynamics in the AI research landscape: a mid-2025 surge in reinforcement learning for LLM reasoning, non-saturating topic emergence (6,673 unique topics), and a positive correlation between topic novelty and community engagement (2.0x median upvotes for the most novel papers). A live demo is available at this https URL.
中文摘要 科学发表的加速速度使研究人员越来越难以保持最新进展。我们介绍Paper Espresso，一个开源平台，能够自动发现、总结和分析热门的arXiv论文。该系统利用大型语言模型（LLMs）生成带有主题标签和关键词的结构化摘要，并通过基于LLM的主题整合，在日、周、月尺度提供多粒度趋势分析。经过35个月的持续部署，Paper Espresso已处理了13,300多篇论文，并公开了所有结构化元数据，揭示了AI研究领域的丰富动态：2025年中期，LLM推理强化学习激增，主题不饱和涌现（6,673个独特主题），以及主题新颖性与社区参与度之间的正相关（新颖论文中位数为2.0倍）。在线演示可在此 https 网址观看。

Digital Privacy in IoT: Exploring Challenges, Approaches and Open Issues

物联网中的数字隐私：探索挑战、方法与悬而未决的问题

Authors: Shini Girija, Pranav M. Pawar, Raja Muthalagu, Mithun Mukherjee
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2604.04572
Pdf link: https://arxiv.org/pdf/2604.04572
Abstract Privacy has always been a critical issue in the digital era, particularly with the increasing use of Internet of Things (IoT) devices. As the IoT continues to transform industries such as healthcare, smart cities, and home automation, it has also introduced serious challenges regarding the security of sensitive and private data. This paper examines the complex landscape of digital privacy in IoT ecosystems, highlighting the need to protect personally identifiable information (PII) of individuals and uphold their rights to digital independence. Global events, such as the COVID-19 pandemic, have accelerated the adoption of IoT, raising concerns about privacy and data protection. This paper provides an in-depth examination of digital privacy risks in the IoT domain and introduces a clear taxonomy for evaluating them using the IEEE Digital Privacy Model. The proposed framework categorizes privacy risks into five types: identity-oriented, behavioral, inference, data manipulation, and regulatory risks. We review existing digital privacy solutions, including encryption technologies, blockchain, federated learning, differential privacy, reinforcement learning, AI, and dynamic consent mechanisms, to mitigate these risks. We also highlight how these privacy-enhancing technologies (PETs) help with data confidentiality, access control, and trust management. Additionally, this study presents AURA-IoT, a futuristic framework that tackles AI-driven privacy risks through a multi-layered structure. AURA-IoT integrates adversarial robustness, explainability, transparency, fairness, compliance, dynamic consent, and policy enforcement mechanisms to ensure digital privacy, security, and accountable IoT operations. Finally, we discuss ongoing challenges and potential research directions for integrating AI and encryption-based privacy solutions to achieve comprehensive digital privacy in future IoT systems.
中文摘要 隐私一直是数字时代的关键问题，尤其是在物联网（IoT）设备的日益普及之际。随着物联网不断改变医疗、智慧城市和家庭自动化等行业，也带来了关于敏感和私人数据安全方面的严峻挑战。本文探讨物联网生态系统中数字隐私的复杂格局，强调保护个人可识别信息（PII）并维护其数字独立权利的必要性。全球事件，如新冠疫情，加速了物联网的普及，引发了对隐私和数据保护的担忧。本文深入探讨了物联网领域的数字隐私风险，并提出了利用IEEE数字隐私模型评估这些风险的清晰分类法。所提框架将隐私风险分为五类：面向身份、行为、推理、数据操作和监管风险。我们审查现有数字隐私解决方案，包括加密技术、区块链、联邦学习、差分隐私、强化学习、人工智能和动态同意机制，以降低这些风险。我们还强调了这些增强隐私技术（PET）如何帮助实现数据机密性、访问控制和信任管理。此外，本研究还介绍了AURA-IoT，这是一个通过多层结构应对AI驱动隐私风险的未来框架。AURA-IoT集成了对抗性的鲁棒性、可解释性、透明度、公平性、合规性、动态同意和政策执行机制，确保数字隐私、安全和负责任的物联网运营。最后，我们讨论了将人工智能与基于加密的隐私解决方案整合，实现未来物联网系统全面数字隐私的持续挑战及潜在研究方向。

Anticipatory Reinforcement Learning: From Generative Path-Laws to Distributional Value Functions

预期强化学习：从生成路径定律到分布价值函数

Authors: Daniel Bloch
Subjects: Subjects: Machine Learning (cs.LG); Mathematical Finance (q-fin.MF); Pricing of Securities (q-fin.PR); Statistical Finance (q-fin.ST)
Arxiv link: https://arxiv.org/abs/2604.04662
Pdf link: https://arxiv.org/pdf/2604.04662
Abstract This paper introduces Anticipatory Reinforcement Learning (ARL), a novel framework designed to bridge the gap between non-Markovian decision processes and classical reinforcement learning architectures, specifically under the constraint of a single observed trajectory. In environments characterised by jump-diffusions and structural breaks, traditional state-based methods often fail to capture the essential path-dependent geometry required for accurate foresight. We resolve this by lifting the state space into a signature-augmented manifold, where the history of the process is embedded as a dynamical coordinate. By utilising a self-consistent field approach, the agent maintains an anticipated proxy of the future path-law, allowing for a deterministic evaluation of expected returns. This transition from stochastic branching to a single-pass linear evaluation significantly reduces computational complexity and variance. We prove that this framework preserves fundamental contraction properties and ensures stable generalisation even in the presence of heavy-tailed noise. Our results demonstrate that by grounding reinforcement learning in the topological features of path-space, agents can achieve proactive risk management and superior policy stability in highly volatile, continuous-time environments.
中文摘要 本文介绍了预期强化学习（ARL），这是一个新颖框架，旨在弥合非马尔可夫决策过程与经典强化学习架构之间的鸿沟，特别是在单一观测轨迹约束下。在以跳跃扩散和结构断裂为特征的环境中，传统的基于状态的方法往往无法捕捉准确预见所需的路径依赖几何结构。我们通过将状态空间提升到一个带签名增强的流形中，将过程的历史嵌入为动力坐标来解决这个问题。通过采用自洽场方法，智能体维持未来路径定律的预期代理，从而实现对预期收益的确定性评估。这种从随机分支向单次线性评估的转变显著降低了计算复杂性和方差。我们证明该框架保持基本收缩性质，并确保即使在存在重尾噪声的情况下也能保持稳定的泛化。我们的结果表明，通过将强化学习扎根于路径空间的拓扑特征，智能体可以在高度波动的连续时间环境中实现主动风险管理和卓越的策略稳定性。

Discovering Failure Modes in Vision-Language Models using RL

利用强化学习发现视觉语言模型中的失败模式

Authors: Kanishk Jain, Qian Yang, Shravan Nayak, Parisa Kordjamshidi, Nishanth Anand, Aishwarya Agrawal
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.04733
Pdf link: https://arxiv.org/pdf/2604.04733
Abstract Vision-language Models (VLMs), despite achieving strong performance on multimodal benchmarks, often misinterpret straightforward visual concepts that humans identify effortlessly, such as counting, spatial reasoning, and viewpoint understanding. Previous studies manually identified these weaknesses and found that they often stem from deficits in specific skills. However, such manual efforts are costly, unscalable, and subject to human bias, which often overlooks subtle details in favor of salient objects, resulting in an incomplete understanding of a model's vulnerabilities. To address these limitations, we propose a Reinforcement Learning (RL)-based framework to automatically discover the failure modes or blind spots of any candidate VLM on a given data distribution without human intervention. Our framework trains a questioner agent that adaptively generates queries based on the candidate VLM's responses to elicit incorrect answers. Our approach increases question complexity by focusing on fine-grained visual details and distinct skill compositions as training progresses, consequently identifying 36 novel failure modes in which VLMs struggle. We demonstrate the broad applicability of our framework by showcasing its generalizability across various model combinations.
中文摘要 视觉语言模型（VLMs）尽管在多模态基准测试上表现出色，但常常误解人类轻松识别的直观概念，如计数、空间推理和视角理解。之前的研究人工识别了这些弱点，发现它们往往源于特定技能的不足。然而，这种人工工作成本高昂、不可扩展，且容易受到人为偏见的影响，这些偏见往往忽视了细微细节，而偏重显著对象，导致对模型脆弱性的理解不完整。为解决这些局限性，我们提出了基于强化学习（RL）的框架，能够在不需人工干预的情况下自动发现给定数据分布中任一候选VLM的失败模式或盲点。我们的框架训练提问代理，根据候选VLM的回答自适应生成查询，以引出错误答案。我们的方法通过关注细致的视觉细节和独特的技能组合，随着训练进展，提高了问题复杂度，从而识别出36种VLM在中遇到困难的新失败模式。我们通过展示该框架在各种模型组合中的普遍适用性，展示了其广泛的适用性。

Cog-DRIFT: Exploration on Adaptively Reformulated Instances Enables Learning from Hard Reasoning Problems

Cog-DRIFT：探索自适应重构实例，使得从硬推理问题中学习

Authors: Justin Chih-Yao Chen, Archiki Prasad, Zaid Khan, Joykirat Singh, Runchu Tian, Elias Stengel-Eskin, Mohit Bansal
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.04767
Pdf link: https://arxiv.org/pdf/2604.04767
Abstract Reinforcement learning from verifiable rewards (RLVR) has improved the reasoning abilities of LLMs, yet a fundamental limitation remains: models cannot learn from problems that are too difficult to solve under their current policy, as these yield no meaningful reward signal. We propose a simple yet effective solution based on task reformulation. We transform challenging open-ended problems into cognitively simpler variants -- such as multiple-choice and cloze formats -- that preserve the original answer while reducing the effective search space and providing denser learning signals. These reformulations span a spectrum from discriminative to generative tasks, which we exploit to bootstrap learning: models first learn from structured, easier formats, and this knowledge transfers back to improve performance on the original open-ended problems. Building on this insight, we introduce Cog-DRIFT, a framework that constructs reformulated variants and organizes them into an adaptive curriculum based on difficulty. Training progresses from easier to harder formats, enabling the model to learn from problems that previously yielded zero signal under standard RL post-training. Cog-DRIFT not only improves on the originally unsolvable hard problems (absolute +10.11% for Qwen and +8.64% for Llama) but also generalizes well to other held-out datasets. Across 2 models and 6 reasoning benchmarks, our method consistently outperforms standard GRPO and strong guided-exploration baselines. On average, Cog-DRIFT shows +4.72% (Qwen) and +3.23% (Llama) improvements over the second-best baseline. We further show that Cog-DRIFT improves pass@k at test time, and the curriculum improves sample efficiency. Overall, our results highlight task reformulation and curriculum learning as an effective paradigm for overcoming the exploration barrier in LLM post-training.
中文摘要 可验证奖励强化学习（RLVR）提升了大型语言模型的推理能力，但一个根本性局限性依然存在：模型无法从其当前策略下难以解决的问题中学习，因为这些问题无法产生有意义的奖励信号。我们提出一种简单但有效的解决方案，基于任务重组。我们将具有挑战性的开放式问题转化为认知上更简单的变体——如选择题和填空格式——既保留原始答案，又减少有效搜索空间，提供更密集的学习信号。这些重新表述涵盖了从判别任务到生成任务的光谱，我们利用这些方法进行自助学习：模型首先从结构化、更简单的格式学习，这些知识会反馈回去提升原始开放式问题的性能。基于这一见解，我们介绍了Cog-DRIFT框架，该框架构建了重新表述的变体，并根据难度将它们组织成适应性课程。训练从简单到困难的格式逐步推进，使模型能够从之前在标准强化学习后训练中零信号的问题中学习。Cog-DRIFT 不仅改进了原本无法解决的难题（Qwen 绝对难题为 +10.11%，Llama 为 +8.64%），还能很好地推广到其他未解决的数据集。在两个模型和6个推理基准中，我们的方法持续优于标准GRPO和强有力的引导探索基线。平均来看，Cog-DRIFT相比第二佳基线提升了+4.72%（Qwen）和+3.23%（Llama）。我们还进一步证明，Cog-DRIFT在考试时提升了pass@k，课程也提高了样本效率。总体而言，我们的结果强调任务重组和课程学习是克服LLM后期探索障碍的有效范式。

CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models

CLEAR：在统一多模态模型中释放图像理解退化的生成潜力

Authors: Xiangzhao Hao, Zefeng Zhang, Zhenyu Zhang, Linhao Yu, Yao Chen, Yiqian Zhang, Haiyun Guo, Shuohuan Wang, Yu Sun
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.04780
Pdf link: https://arxiv.org/pdf/2604.04780
Abstract Image degradation from blur, noise, compression, and poor illumination severely undermines multimodal understanding in real-world settings. Unified multimodal models that combine understanding and generation within a single architecture are a natural fit for this challenge, as their generative pathway can model the fine-grained visual structure that degradation destroys. Yet these models fail to leverage their own generative capacity on degraded inputs. We trace this disconnect to two compounding factors: existing training regimes never ask the model to invoke generation during reasoning, and the standard decode-reencode pathway does not support effective joint optimization. We present CLEAR, a framework that connects the two capabilities through three progressive steps: (1) supervised fine-tuning on a degradation-aware dataset to establish the generate-then-answer reasoning pattern; (2) a Latent Representation Bridge that replaces the decode-reencode detour with a direct, optimizable connection between generation and reasoning; (3) Interleaved GRPO, a reinforcement learning method that jointly optimizes text reasoning and visual generation under answer-correctness rewards. We construct MMD-Bench, covering three degradation severity levels across six standard multimodal benchmarks. Experiments show that CLEAR substantially improves robustness on degraded inputs while preserving clean-image performance. Our analysis further reveals that removing pixel-level reconstruction supervision leads to intermediate visual states with higher perceptual quality, suggesting that task-driven optimization and visual quality are naturally aligned.
中文摘要 模糊、噪点、压缩和照明不足导致的图像退化严重削弱了现实世界中多模态的理解。将理解与生成结合在单一架构中的统一多模态模型，是这一挑战的天然契合，因为它们的生成路径能够模拟退化破坏的细粒度视觉结构。然而，这些模型未能利用自身对退化输入的生成能力。我们将这种脱节归因于两个复合因素：现有训练模式从未要求模型在推理时调用生成，且标准的译码-重编码路径不支持有效的联合优化。我们介绍CLEAR框架，通过三个渐进步骤将两者连接起来：（1）在退化感知数据集上进行监督微调，建立生成后回答的推理模式;（2）一个潜在表示桥，用生成与推理之间直接且可优化的连接取代译码-重编码的绕行;（3）交错GRPO，一种在正确性奖励下共同优化文本推理和视觉生成的强化学习方法。我们构建了MMD-Bench，涵盖六个标准多模态基准的三种降解严重程度。实验表明，CLEAR在保持清晰图像性能的同时，显著提升了对劣化输入的鲁棒性。我们的分析进一步表明，去除像素级重建监督后，呈现出感知质量更高的中间视觉状态，表明任务驱动优化与视觉质量自然对齐。

Selecting Decision-Relevant Concepts in Reinforcement Learning

强化学习中选择决策相关概念

Authors: Naveen Raman, Stephanie Milani, Fei Fang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.04808
Pdf link: https://arxiv.org/pdf/2604.04808
Abstract Training interpretable concept-based policies requires practitioners to manually select which human-understandable concepts an agent should reason with when making sequential decisions. This selection demands domain expertise, is time-consuming and costly, scales poorly with the number of candidates, and provides no performance guarantees. To overcome this limitation, we propose the first algorithms for principled automatic concept selection in sequential decision-making. Our key insight is that concept selection can be viewed through the lens of state abstraction: intuitively, a concept is decision-relevant if removing it would cause the agent to confuse states that require different actions. As a result, agents should rely on decision-relevant concepts; states with the same concept representation should share the same optimal action, which preserves the optimal decision structure of the original state space. This perspective leads to the Decision-Relevant Selection (DRS) algorithm, which selects a subset of concepts from a candidate set, along with performance bounds relating the selected concepts to the performance of the resulting policy. Empirically, DRS automatically recovers manually curated concept sets while matching or exceeding their performance, and improves the effectiveness of test-time concept interventions across reinforcement learning benchmarks and real-world healthcare environments.
中文摘要 培训可解释的基于概念的策略需要从业者手动选择智能体在连续决策时应推理哪些人类可理解的概念。这种选拔需要领域专业知识，耗时且成本高昂，随着候选人数量的增加，扩展性差，且无法保证性能。为克服这一限制，我们提出了首批用于顺序决策中原则性自动概念选择的算法。我们的关键见解是，概念选择可以通过状态抽象的视角来看：直观上，如果移除一个概念会导致主体混淆需要不同行动的状态，那么它就是决策相关的。因此，代理应依赖决策相关概念;具有相同概念表示的状态应共享相同的最优动作，从而保持原始状态空间的最优决策结构。这一观点引出了决策相关选择（DRS）算法，该算法从候选集合中选择部分概念，并对所选概念与最终策略的表现设置性能界限。从经验角度看，DRS自动恢复人工策划的概念集，同时性能匹配甚至超越，并提升测试阶段概念干预在强化学习基准和现实医疗环境中的有效性。

Synthetic Sandbox for Training Machine Learning Engineering Agents

用于训练机器学习工程代理的合成沙盒

Authors: Yuhang Zhou, Lizhu Zhang, Yifan Wu, Jiayi Liu, Xiangjun Fan, Zhuokai Zhao, Hong Yan
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.04872
Pdf link: https://arxiv.org/pdf/2604.04872
Abstract As large language model agents advance beyond software engineering (SWE) tasks toward machine learning engineering (MLE), verifying agent behavior becomes orders of magnitude more expensive: while SWE tasks can be verified via fast-executing unit tests, MLE verification requires running full ML pipelines -- data preprocessing, model training, and metric evaluation -- on large datasets at each rollout step, rendering trajectory-wise on-policy reinforcement learning (RL) prohibitively slow. Existing approaches retreat to supervised fine-tuning (SFT) or offline proxy rewards, sacrificing the exploration and generalization benefits of on-policy RL. We observe that sandbox data size is the primary source of this bottleneck. Based on this insight, we introduce SandMLE, a multi-agent framework that generates diverse, verifiable synthetic MLE environments from a small number of seed tasks, preserving the structural and technical complexity of real-world problems while constraining datasets to micro-scale (each task is paired with only 50-200 training samples). Through extensive experiments, we show that SandMLE reduces execution time by over 13 times, enabling large-scale, on-policy trajectory-wise RL for the first time in the MLE domain. On MLE-bench-lite, SandMLE yields significant gains over SFT baselines across Qwen3-8B, 14B, and 30B-A3B, with relative medal rate improvements ranging from 20.3% to 66.9%. Furthermore, the trained policy generalizes across unseen agentic scaffolds, achieving up to 32.4% better HumanRank score on MLE-Dojo.
中文摘要 随着大型语言模型代理从软件工程（SWE）任务向机器学习工程（MLE）推进，验证代理行为的成本将高出数倍：虽然SWE任务可以通过快速执行的单元测试验证，但MLE验证需要在每个部署步骤对大型数据集运行完整的机器学习流程——数据预处理、模型训练和度量评估。这使得策略强化学习（RL）在轨迹上极为缓慢。现有方法退缩到监督式微调（SFT）或离线代理奖励，牺牲了策略强化学习的探索和泛化优势。我们观察到沙盒数据大小是这一瓶颈的主要来源。基于这一见解，我们介绍了SandMLE，一个多智能体框架，能够从少量种子任务生成多样且可验证的合成MLE环境，保持现实问题的结构和技术复杂性，同时将数据集限制在微观尺度（每个任务仅配对50-200个训练样本）。通过大量实验，我们证明SandMLE将执行时间缩短了13倍以上，首次实现了MLE领域中大规模、按策略轨迹运行的强化学习。在MLE-bench-lite中，SandMLE在Qwen3-8B、14B和30B-A3B的SFT基线上取得了显著提升，相对奖牌率提升范围从20.3%到66.9%。此外，训练出的策略在看不见的能动支架上推广，在MLE-Dojo上实现了高达32.4%的HumanRank分数提升。

Data Attribution in Adaptive Learning

自适应学习中的数据归因

Authors: Amit Kiran Rege
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.04892
Pdf link: https://arxiv.org/pdf/2604.04892
Abstract Machine learning models increasingly generate their own training data -- online bandits, reinforcement learning, and post-training pipelines for language models are leading examples. In these adaptive settings, a single training observation both updates the learner and shifts the distribution of future data the learner will collect. Standard attribution methods, designed for static datasets, ignore this feedback. We formalize occurrence-level attribution for finite-horizon adaptive learning via a conditional interventional target, prove that replay-side information cannot recover it in general, and identify a structural class in which the target is identified from logged data.
中文摘要 机器学习模型越来越多地自行生成训练数据——网络盗贼、强化学习和语言模型的后训练流程是典型例子。在这些自适应环境中，单一训练观察既更新学习者，又改变了学习者未来收集数据的分布。为静态数据集设计的标准归因方法忽略了这些反馈。我们通过条件干预目标形式化有限视野自适应学习的发生级归因，证明回放侧信息一般无法恢复该信息，并从日志数据中识别出目标的结构类。

Rethinking Exploration in RLVR: From Entropy Regularization to Refinement via Bidirectional Entropy Modulation

重新思考RLVR中的探索：从熵正则化到通过双向熵调制进行精炼

Authors: Hengrui Gu, Xiaotian Han, Yujing Bian, Kaixiong Zhou
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.04894
Pdf link: https://arxiv.org/pdf/2604.04894
Abstract Reinforcement learning with verifiable rewards (RLVR) has significantly advanced the reasoning capabilities of large language models (LLMs). However, it faces a fundamental limitation termed \textit{restricted exploration}, where the policy rapidly converges to a narrow set of solutions. While entropy regularization is a popular approach used to sustain exploration, it often proves unreliable for LLMs, suffering from high hyperparameter sensitivity and yielding only marginal performance gains. Motivated by these inefficiencies, we propose to rethink the relationship between policy entropy and exploration. By deriving a parametric formulation of group-relative advantage estimation and analyzing entropy dynamics, we conceptually decompose policy entropy into \textit{informative entropy}, which preserves diverse solution paths, and \textit{spurious entropy}, which erodes reasoning patterns. Our analysis reveals that, in contrast to blind maximization, effective exploration requires \textit{entropy refinement}-a mechanism implicitly embedded in group-relative advantage estimation that sustains informative entropy on positive rollouts while suppressing spurious entropy on negative ones. Guided by this insight, we propose \textbf{AsymGRPO}, an exploratory framework that explicitly decouples the modulation of positive and negative rollouts. This allows for independent control over the preservation of informative entropy and the suppression of spurious noise. Extensive experiments demonstrate that AsymGRPO achieves superior performance compared to strong baselines and exhibits the potential to synergize with existing entropy regularization methods.
中文摘要 带有可验证奖励的强化学习（RLVR）显著提升了大型语言模型（LLMs）的推理能力。然而，它面临一个根本性的限制，称为“\textit{restricted exploration}，即政策迅速趋向于狭窄的解决方案集。虽然熵正则化是维持探索的常用方法，但对大型语言模型来说常常不可靠，存在高超参数敏感性，性能提升有限。受这些低效的激励，我们提议重新思考政策熵与探索之间的关系。通过推导群体相对优势估计的参数化表述并分析熵动态，我们将策略熵概念性分解为\textit{信息熵}，该熵保持多样的解路径，以及\textit{虚熵}，后者侵蚀推理模式。我们的分析表明，与盲极大化不同，有效的探索需要 \textit{熵精细化}——这是群体相对优势估计中隐含的机制，能在正向推广时维持信息量熵，同时抑制负向推测时的虚假熵。基于这一见解，我们提出了 \textbf{AsymGRPO}，这是一个探索性框架，明确解耦正负推导的调制。这使得对信息熵的保持和虚假噪声的抑制实现了独立控制。大量实验表明，AsymGRPO相较于强基线实现了更优的性能，并展现出与现有熵正则化方法协同的潜力。

QED-Nano: Teaching a Tiny Model to Prove Hard Theorems

QED-Nano：教授一个微小模型以证明难定理

Authors: LM-Provers, Yuxiao Qu, Amrith Setlur, Jasper Dekoninck, Edward Beeching, Jia Li, Ian Wu, Lewis Tunstall, Aviral Kumar
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.04898
Pdf link: https://arxiv.org/pdf/2604.04898
Abstract Proprietary AI systems have recently demonstrated impressive capabilities on complex proof-based problems, with gold-level performance reported at the 2025 International Mathematical Olympiad (IMO). However, the training pipelines behind these systems remain largely undisclosed, and their reliance on large "internal" models and scaffolds makes them expensive to run, difficult to reproduce, and hard to study or improve upon. This raises a central question: can small, open models also be trained to achieve competitive reasoning performance on difficult Olympiad-level math? In this paper, we answer this question by building QED-Nano, a 4B model post-trained for Olympiad-level proofs. Our training recipe has three stages: (1) supervised fine-tuning to imbue good proof-writing styles by distilling from DeepSeek-Math-V2, (2) reinforcement learning (RL) with rubric-based rewards, and (3) expanding RL with a reasoning cache, which decomposes long proofs into iterative summarize-and-refine cycles and enables stronger test-time reasoning. QED-Nano surpasses the proof-generation performance of much larger open models, including Nomos-1 and GPT-OSS-120B, and approaches the performance of proprietary models like Gemini 3 Pro, at a fraction of the inference cost. To support further research on open mathematical reasoning, we release the full QED-Nano pipeline, including the QED-Nano and QED-Nano-SFT models, the FineProofs-SFT and FineProofs-RL datasets, and the training and evaluation code.
中文摘要 专有的人工智能系统最近在复杂的基于证明的问题上展现出令人印象深刻的能力，2025年国际数学奥林匹克竞赛（IMO）上报告了金级表现。然而，这些系统背后的训练流程大多未公开，且它们依赖大型“内部”模型和支架，导致运行成本高昂，难以复现，也难以研究或改进。这引出了一个核心问题：小型开放模型能否也被训练，以在复杂的奥林匹克级别数学中获得竞争性推理表现？本文通过构建QED-Nano来回答这个问题，QED-Nano是一个为奥林匹克级证明进行后期训练的4B模型。我们的训练方案分为三个阶段：（1）监督微调，通过从 DeepSeek-Math-V2 提炼出优质证明写作风格，（2）基于评分标准的强化学习（RL）;（3）通过推理缓存扩展强化学习，将长证明分解为迭代总结和精炼循环，增强测试时推理能力。QED-Nano的证明生成性能超越了包括Nomos-1和GPT-OSS-120B在内的更大型开放模型，并以极低的推理成本接近Gemini 3 Pro等专有模型的性能。为了支持对开放数学推理的进一步研究，我们发布了完整的QED-Nano流水线，包括QED-Nano和QED-Nano-SFT模型、FineProofs-SFT和FineProofs-RL数据集，以及训练和评估代码。

Analyzing Symbolic Properties for DRL Agents in Systems and Networking

系统与网络中DRL代理的符号性质分析

Authors: Mohammad Zangooei, Jannis Weil, Amr Rizk, Mina Tahmasbi Arashloo, Raouf Boutaba
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.04914
Pdf link: https://arxiv.org/pdf/2604.04914
Abstract Deep reinforcement learning (DRL) has shown remarkable performance on complex control problems in systems and networking, including adaptive video streaming, wireless resource management, and congestion control. For safe deployment, however, it is critical to reason about how agents behave across the range of system states they encounter in practice. Existing verification-based methods in this domain primarily focus on point properties, defined around fixed input states, which offer limited coverage and require substantial manual effort to identify relevant input-output pairs for analysis. In this paper, we study symbolic properties, that specify expected behavior over ranges of input states, for DRL agents in systems and networking. We present a generic formulation for symbolic properties, with monotonicity and robustness as concrete examples, and show how they can be analyzed using existing DNN verification engines. Our approach encodes symbolic properties as comparisons between related executions of the same policy and decomposes them into practically tractable sub-properties. These techniques serve as practical enablers for applying existing verification tools to symbolic analysis. Using our framework, diffRL, we conduct an extensive empirical study across three DRL-based control systems, adaptive video streaming, wireless resource management, and congestion control. Through these case studies, we analyze symbolic properties over broad input ranges, examine how property satisfaction evolves during training, study the impact of model size on verifiability, and compare multiple verification backends. Our results show that symbolic properties provide substantially broader coverage than point properties and can uncover non-obvious, operationally meaningful counterexamples, while also revealing practical solver trade-offs and limitations.
中文摘要 深度强化学习（DRL）在系统和网络中的复杂控制问题上表现出显著表现，包括自适应视频流、无线资源管理和拥塞控制。然而，为了安全部署，关键是要考虑代理在实际遇到的系统状态范围内的行为。该领域现有基于验证的方法主要关注点属性，这些属性定义在固定输入状态上，覆盖范围有限，且需要大量人工工作来识别用于分析的相关输入输出对。本文研究了符号属性，即在系统和网络中DRL代理在输入状态范围内的预期行为。我们提出了符号属性的通用表述，以单调性和鲁棒性为具体例子，并展示了如何利用现有的DNN验证引擎进行分析。我们的方法将符号属性编码为同一策略相关执行之间的比较，并将其分解为实际可处理的子属性。这些技术为将现有验证工具应用于符号分析提供了实用的助力。利用我们的框架diffRL，我们对三种基于DRL的控制系统进行了广泛的实证研究：自适应视频流、无线资源管理和拥塞控制。通过这些案例研究，我们分析了广泛输入范围内的符号属性，考察了属性满意度在培训过程中的演变，研究模型规模对可验证性的影响，并比较多种验证后端。我们的结果表明，符号性质提供的覆盖范围远比点性质更广泛，能够揭示非显而易见但操作上有意义的反例，同时揭示实际求解器的权衡和局限性。

Vero: An Open RL Recipe for General Visual Reasoning

Vero：一个开放的强化学习通用视觉推理配方

Authors: Gabriel Sarch, Linrong Cai, Qunzhong Wang, Haoyang Wu, Danqi Chen, Zhuang Liu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.04917
Pdf link: https://arxiv.org/pdf/2604.04917
Abstract What does it take to build a visual reasoner that works across charts, science, spatial understanding, and open-ended tasks? The strongest vision-language models (VLMs) show such broad visual reasoning is within reach, but the recipe behind them remains unclear, locked behind proprietary reinforcement learning (RL) pipelines with non-public data. We introduce Vero, a family of fully open VLMs that matches or exceeds existing open-weight models across diverse visual reasoning tasks. We scale RL data and rewards across six broad task categories, constructing Vero-600K, a 600K-sample dataset from 59 datasets, and designing task-routed rewards that handle heterogeneous answer formats. Vero achieves state-of-the-art performance, improving over four base models by 3.7-5.5 points on average across VeroEval, our suite of 30 challenging benchmarks. Starting from Qwen3-VL-8B-Instruct, Vero outperforms Qwen3-VL-8B-Thinking on 23 of 30 benchmarks without additional proprietary thinking data. When trained from the same base model, Vero-600K exceeds existing RL datasets across task categories. Systematic ablations reveal that different task categories elicit qualitatively distinct reasoning patterns that transfer poorly in isolation, suggesting that broad data coverage is the primary driver of strong RL scaling. All data, code, and models are released.
中文摘要 要打造一个跨图表、科学、空间理解和开放式任务的视觉推理器，需要什么？最强大的视觉语言模型（VLMs）显示这种广泛的视觉推理触手可及，但其背后的配方仍不明确，被专有强化学习（RL）流水线和非公开数据锁定。我们介绍了Vero，这是一系列完全开放的VLM，能够在多样化的视觉推理任务中匹配甚至超越现有的开放权重模型。我们将强化学习数据和奖励扩展到六大任务类别，构建了Vero-600K数据集，这是一个由59个数据集组成的60万样本数据集，并设计了处理异构答案格式的任务路由奖励。Vero在VeroEval这一包含30个挑战性基准测试的套件中，平均提升了四个基础型号的3.7-5.5个性能。从Qwen3-VL-8B-Instruct开始，Vero在没有额外专有思维数据的情况下，在30个基准测试中超过了Qwen3-VL-8B-Thinking中的23个。当从同一基础模型训练时，Vero-600K在各任务类别上优于现有的强化学习数据集。系统性消融显示，不同任务类别会引发质性差异的推理模式，单独传递能力较差，表明广泛的数据覆盖是强强化学习扩展的主要驱动力。所有数据、代码和模型均已公开。

Stratifying Reinforcement Learning with Signal Temporal Logic

用信号时间逻辑分层强化学习

Authors: Justin Curry, Alberto Speranzon
Subjects: Subjects: Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Systems and Control (eess.SY); Algebraic Topology (math.AT)
Arxiv link: https://arxiv.org/abs/2604.04923
Pdf link: https://arxiv.org/pdf/2604.04923
Abstract In this paper, we develop a stratification-based semantics for Signal Temporal Logic (STL) in which each atomic predicate is interpreted as a membership test in a stratified space. This perspective reveals a novel correspondence principle between stratification theory and STL, showing that most STL formulas can be viewed as inducing a stratification of space-time. The significance of this interpretation is twofold. First, it offers a fresh theoretical framework for analyzing the structure of the embedding space generated by deep reinforcement learning (DRL) and relates it to the geometry of the ambient decision space. Second, it provides a principled framework that both enables the reuse of existing high-dimensional analysis tools and motivates the creation of novel computational techniques. To ground the theory, we (1) illustrate the role of stratification theory in Minigrid games and (2) apply numerical techniques to the latent embeddings of a DRL agent playing such a game where the robustness of STL formulas is used as the reward. In the process, we propose computationally efficient signatures that, based on preliminary evidence, appear promising for uncovering the stratification structure of such embedding spaces.
中文摘要 本文开发了基于分层的信号时序逻辑（STL）语义，其中每个原子谓词被解释为分层空间中的成员测试。这一观点揭示了分层理论与STL之间的新对应原则，表明大多数STL公式可以被视为诱导时空分层的过程。这一解释有两个重要性。首先，它提供了一个新的理论框架，用于分析由深度强化学习（DRL）生成的嵌入空间结构，并将其与环境决策空间的几何结构联系起来。其次，它提供了一个原则性框架，既允许现有高维分析工具的复用，也激发了新颖计算技术的创造。为奠定理论基础，我们（1）展示了分层理论在微网格博弈中的作用，（2）对DRL代理的潜在嵌入应用，其中STL公式的鲁棒性作为奖励。在此过程中，我们提出了计算效率高的签名，基于初步证据，这些签名对于揭示此类嵌入空间的分层结构具有前景。

Keyword: diffusion policy

Diffusion Policy with Bayesian Expert Selection for Active Multi-Target Tracking

采用贝叶斯专家选择的扩散政策，用于主动多目标跟踪

Authors: Haotian Xiang, Qin Lu, Yaakov Bar-Shalom
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.03404
Pdf link: https://arxiv.org/pdf/2604.03404
Abstract Active multi-target tracking requires a mobile robot to balance exploration for undetected targets with exploitation of uncertain tracked ones. Diffusion policies have emerged as a powerful approach for capturing diverse behavioral strategies by learning action sequences from expert demonstrations. However, existing methods implicitly select among strategies through the denoising process, without uncertainty quantification over which strategy to execute. We formulate expert selection for diffusion policies as an offline contextual bandit problem and propose a Bayesian framework for pessimistic, uncertainty-aware strategy selection. A multi-head Variational Bayesian Last Layer (VBLL) model predicts the expected tracking performance of each expert strategy given the current belief state, providing both a point estimate and predictive uncertainty. Following the pessimism principle for offline decision-making, a Lower Confidence Bound (LCB) criterion then selects the expert whose worst-case predicted performance is best, avoiding overcommitment to experts with unreliable predictions. The selected expert conditions a diffusion policy to generate corresponding action sequences. Experiments on simulated indoor tracking scenarios demonstrate that our approach outperforms both the base diffusion policy and standard gating methods, including Mixture-of-Experts selection and deterministic regression baselines.
中文摘要 主动多目标跟踪需要移动机器人在探索未被发现目标与利用不确定追踪目标之间取得平衡。扩散策略已成为通过从专家演示中学习行动序列，捕捉多样化行为策略的有力方法。然而，现有方法通过去噪过程隐性地选择策略，且没有对执行哪种策略的不确定性量化。我们将扩散策略的专家选择构建为离线情境盗贼问题，并提出了一个贝叶斯框架用于悲观且不确定性意识的策略选择。多元变分贝叶斯末层（VBLL）模型在当前信念状态下预测每个专家策略的预期跟踪表现，既提供点估计，也提供预测不确定性。根据离线决策的悲观原则，低置信度界限（LCB）准则选择最坏预测表现最佳的专家，避免对预测不可靠专家过度投入。选定的专家会设定扩散策略以生成相应的动作序列。在模拟室内跟踪场景中的实验表明，我们的方法优于基础扩散策略和标准门控方法，包括专家混合选择和确定性回归基线。

HAD: Combining Hierarchical Diffusion with Metric-Decoupled RL for End-to-End Driving

HAD：将分层扩散与度量解耦强化学习结合，实现端到端驾驶

Authors: Wenhao Yao, Xinglong Sun, Zhenxin Li, Shiyi Lan, Zi Wang, Jose M. Alvarez, Zuxuan Wu
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.03581
Pdf link: https://arxiv.org/pdf/2604.03581
Abstract End-to-end planning has emerged as a dominant paradigm for autonomous driving, where recent models often adopt a scoring-selection framework to choose trajectories from a large set of candidates, with diffusion-based decoding showing strong promise. However, directly selecting from the entire candidate space remains difficult to optimize, and Gaussian perturbations used in diffusion often introduce unrealistic trajectories that complicate the denoising process. In addition, for training these models, reinforcement learning (RL) has shown promise, but existing end-to-end RL approaches typically rely on a single coupled reward without structured signals, limiting optimization effectiveness. To address these challenges, we propose HAD, an end-to-end planning framework with a Hierarchical Diffusion Policy that decomposes planning into a coarse-to-fine process. To improve trajectory generation, we introduce Structure-Preserved Trajectory Expansion, which produces realistic candidates while maintaining kinematic structure. For policy learning, we develop Metric-Decoupled Policy Optimization (MDPO) to enable structured RL optimization across multiple driving objectives. Extensive experiments show that HAD achieves new state-of-the-art performance on both NAVSIM and HUGSIM, outperforming prior arts by a huge margin: +2.3 EPDMS on NAVSIM and +4.9 Route Completion on HUGSIM.
中文摘要 端到端规划已成为自动驾驶的主导范式，近期模型常采用评分选择框架从大量候选中选择轨迹，基于扩散的解码显示出强有力前景。然而，直接从整个候选空间中选择仍然难以优化，扩散中使用的高斯微扰常常引入不现实的轨迹，使去噪过程变得复杂。此外，强化学习（RL）在训练这些模型方面展现出潜力，但现有端到端强化学习方法通常依赖单一耦合奖励，缺乏结构化信号，限制了优化效果。为应对这些挑战，我们提出了HAD，一种端到端规划框架，采用层级扩散策略，将规划分解为从粗到细的流程。为了提升轨迹生成，我们引入了结构保持轨迹扩展，该方法在保持运动学结构的同时产生逼真的候选路径。在政策学习方面，我们开发了度量解耦策略优化（MDPO），以实现跨多个驱动目标的结构化强化学习优化。大量实验表明，HAD在NAVSIM和HUGSIM上都实现了前所未有的性能，远远超越现有技术：NAVSIM上的EPDMS为+2.3，HUGSIM为+4.9的路线完成。