Arxiv Papers of Today

生成时间: 2026-03-19 16:45:58 (UTC+8); Arxiv 发布时间: 2026-03-19 20:00 EDT (2026-03-20 08:00 UTC+8)

今天共有 44 篇相关文章

Keyword: reinforcement learning

Multi-Modal Multi-Agent Reinforcement Learning for Radiology Report Generation: Radiologist-Like Workflow with Clinically Verifiable Rewards

多模态多代理强化学习用于放射科报告生成：类放射科医生的工作流，具有临床可验证的奖励

Authors: Kaito Baba, Satoshi Kodera
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.16876
Pdf link: https://arxiv.org/pdf/2603.16876
Abstract We propose MARL-Rad, a novel multi-modal multi-agent reinforcement learning framework for radiology report generation that coordinates region-specific agents and a global integrating agent, optimized via clinically verifiable rewards. Unlike prior single-model reinforcement learning or post-hoc agentization of independently trained models, our method jointly trains multiple agents and optimizes the entire agent system through reinforcement learning. Experiments on the MIMIC-CXR and IU X-ray datasets show that MARL-Rad consistently improves clinically efficacy (CE) metrics such as RadGraph, CheXbert, and GREEN scores, achieving state-of-the-art CE performance. Further analyses confirm that MARL-Rad enhances laterality consistency and produces more accurate, detail-informed reports.
中文摘要 我们提出了MARL-Rad，一种新型多模态多代理强化学习框架，用于放射科报告生成，协调区域特异性代理和全局整合代理，并通过临床验证的奖励进行优化。与以往单模型强化学习或独立训练模型的事后智能体化不同，我们的方法通过强化学习联合训练多个智能体，并优化整个智能体系统。对MIMIC-CXR和IU X光数据集的实验显示，MARL-Rad持续提升临床疗效（CE）指标，如RadGraph、CheXbert和GREEN评分，实现了最先进的CE性能。进一步分析证实，MARL-Rad增强了横向一致性，并产生更准确、更详细的报告。

Federated Multi Agent Deep Learning and Neural Networks for Advanced Distributed Sensing in Wireless Networks

联邦多智能体深度学习与神经网络，用于无线网络中的先进分布式感测

Authors: Nadine Muller, Stefano DeRosa, Su Zhang, Chun Lee Huan
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.16881
Pdf link: https://arxiv.org/pdf/2603.16881
Abstract Multi-agent deep learning (MADL), including multi-agent deep reinforcement learning (MADRL), distributed/federated training, and graph-structured neural networks, is becoming a unifying framework for decision-making and inference in wireless systems where sensing, communication, and computing are tightly coupled. Recent 5G-Advanced and 6G visions strengthen this coupling through integrated sensing and communication, edge intelligence, open programmable RAN, and non-terrestrial/UAV networking, which create decentralized, partially observed, time-varying, and resource-constrained control problems. This survey synthesizes the state of the art, with emphasis on 2021-2025 research, on MADL for distributed sensing and wireless communications. We present a task-driven taxonomy across (i) learning formulations (Markov games, Dec-POMDPs, CTDE), (ii) neural architectures (GNN-based radio resource management, attention-based policies, hierarchical learning, and over-the-air aggregation), (iii) advanced techniques (federated reinforcement learning, communication-efficient federated deep RL, and serverless edge learning orchestration), and (iv) application domains (MEC offloading with slicing, UAV-enabled heterogeneous networks with power-domain NOMA, intrusion detection in sensor networks, and ISAC-driven perceptive mobile networks). We also provide comparative tables of algorithms, training topologies, and system-level trade-offs in latency, spectral efficiency, energy, privacy, and robustness. Finally, we identify open issues including scalability, non-stationarity, security against poisoning and backdoors, communication overhead, and real-time safety, and outline research directions toward 6G-native sense-communicate-compute-learn systems.
中文摘要 多智能体深度学习（MADL），包括多智能体深度强化学习（MADRL）、分布式/联邦训练和图结构神经网络，正成为无线系统中感知、通信和计算紧密耦合的决策和推理的统一框架。近期的5G先进和6G愿景通过集成感测与通信、边缘智能、开放可编程RAN以及非地面/无人机网络，进一步加强了这种耦合，这些都带来了分散式、部分观测、时变和资源受限的控制问题。本调查综合了分布式传感和无线通信领域MADL的最新进展，重点关注2021-2025年的研究。我们提出了任务驱动分类法，涵盖（i）学习形式（马尔可夫博弈、Dec-POMDPs、CTDE）、（ii）神经架构（基于GNN的无线资源管理、基于注意力的策略、层级学习和空中聚合），（iii）高级技术（联邦强化学习、通信高效的联邦深度强化学习和无服务器边缘学习编排）和（iv）应用领域（MEC分担与切片、无人机支持的异构网络功率域NOMA、传感器网络中的入侵检测以及ISAC驱动的感知移动网络）。我们还提供了算法、训练拓扑以及系统层面在延迟、频谱效率、能耗、隐私和鲁棒性的权衡的比较表。最后，我们识别了包括可扩展性、非平稳性、防中毒和后门安全、通信开销和实时安全等未解问题，并概述了面向6G原生感性通信-计算-学习系统的研究方向。

Multi-Agent Reinforcement Learning for Dynamic Pricing: Balancing Profitability,Stability and Fairness

动态定价的多智能体强化学习：平衡盈利、稳定性与公平性

Authors: Krishna Kumar Neelakanta Pillai Santha Kumari Amma
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.16888
Pdf link: https://arxiv.org/pdf/2603.16888
Abstract Dynamic pricing in competitive retail markets requires strategies that adapt to fluctuating demand and competitor behavior. In this work, we present a systematic empirical evaluation of multi-agent reinforcement learning (MARL) approaches-specifically MAPPO and MADDPG-for dynamic price optimization under competition. Using a simulated marketplace environment derived from real-world retail data, we benchmark these algorithms against an Independent DDPG (IDDPG) baseline, a widely used independent learner in MARL literature. We evaluate profit performance, stability across random seeds, fairness, and training efficiency. Our results show that MAPPO consistently achieves the highest average returns with low variance, offering a stable and reproducible approach for competitive price optimization, while MADDPG achieves slightly lower profit but the fairest profit distribution among agents. These findings demonstrate that MARL methods-particularly MAPPO-provide a scalable and stable alternative to independent learning approaches for dynamic retail pricing.
中文摘要 在竞争激烈的零售市场中，动态定价需要适应变化的需求和竞争行为的策略。本研究提出了多智能体强化学习（MARL）方法——特别是MAPPO和MADDPG——在竞争下动态价格优化的系统实证评估。我们利用基于真实零售数据的模拟市场环境，将这些算法与独立DDPG（IDDPG）基线进行基准测试，IDDPG是MARL文献中广泛使用的独立学习工具。我们评估利润表现、随机种子间的稳定性、公平性和训练效率。我们的结果显示，MAPPO始终实现最高的平均回报和低方差，提供了一种稳定且可重复的竞争价格优化方法，而MADDPG利润略低，但代理人间利润分配最公平。这些发现表明，MARL方法——尤其是MAPPO——为动态零售定价提供了一种可扩展且稳定的独立学习替代方案。

Leveraging Large Vision Model for Multi-UAV Co-perception in Low-Altitude Wireless Networks

利用大视觉模型实现低空无线网络中多无人机共感知

Authors: Yunting Xu, Jiacheng Wang, Ruichen Zhang, Changyuan Zhao, Yinqiu Liu, Dusit Niyato, Liang Yu, Haibo Zhou, Dong In Kim
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2603.16927
Pdf link: https://arxiv.org/pdf/2603.16927
Abstract Multi-uncrewed aerial vehicle (UAV) cooperative perception has emerged as a promising paradigm for diverse low-altitude economy applications, where complementary multi-view observations are leveraged to enhance perception performance via wireless communications. However, the massive visual data generated by multiple UAVs poses significant challenges in terms of communication latency and resource efficiency. To address these challenges, this paper proposes a communication-efficient cooperative perception framework, termed Base-Station-Helped UAV (BHU), which reduces communication overhead while enhancing perception performance. Specifically, we employ a Top-K selection mechanism to identify the most informative pixels from UAV-captured RGB images, enabling sparsified visual transmission with reduced data volume and latency. The sparsified images are transmitted to a ground server via multi-user MIMO (MU-MIMO), where a Swin-large-based MaskDINO encoder extracts bird's-eye-view (BEV) features and performs cooperative feature fusion for ground vehicle perception. Furthermore, we develop a diffusion model-based deep reinforcement learning (DRL) algorithm to jointly select cooperative UAVs, sparsification ratios, and precoding matrices, achieving a balance between communication efficiency and perception utility. Simulation results on the Air-Co-Pred dataset demonstrate that, compared with traditional CNN-based BEV fusion baselines, the proposed BHU framework improves perception performance by over 5% while reducing communication overhead by 85%, providing an effective solution for multi-UAV cooperative perception under resource-constrained wireless environments.
中文摘要 多无人机（UAV）协同感知已成为多种低空经济应用的有前景范式，利用互补的多视角观测通过无线通信提升感知性能。然而，多架无人机产生的海量视觉数据在通信延迟和资源效率方面带来了重大挑战。为应对这些挑战，本文提出了一种高效的通信协作感知框架，称为基站辅助无人机（BHU），该框架降低通信开销并提升感知性能。具体来说，我们采用Top-K选择机制，从无人机捕捉的RGB图像中识别最具信息量的像素，实现稀疏的视觉传输，同时降低数据量和延迟。稀疏图像通过多用户MIMO（MU-MIMO）传输到地面服务器，基于Swin大型的MaskDINO编码器提取鸟瞰图（BEV）特征，并执行地面车辆感知的协作特征融合。此外，我们开发了基于扩散模型的深度强化学习（DRL）算法，共同选择协同无人机、稀疏化比例和预编码矩阵，实现通信效率与感知效用之间的平衡。Air-Co-Pred数据集的模拟结果表明，与传统的基于CNN的BEV融合基线相比，所提BHU框架提升了感知性能超过5%，同时降低了85%的通信开销，为资源受限的无线环境下多无人机协作感知提供了有效解决方案。

MHPO: Modulated Hazard-aware Policy Optimization for Stable Reinforcement Learning

MHPO：稳定强化学习中的调制危害感知策略优化

Authors: Hongjun Wang, Wei Liu, Weibo Gu, Xing Sun, Kai Han
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.16929
Pdf link: https://arxiv.org/pdf/2603.16929
Abstract Regulating the importance ratio is critical for the training stability of Group Relative Policy Optimization (GRPO) based frameworks. However, prevailing ratio control methods, such as hard clipping, suffer from non-differentiable boundaries and vanishing gradient regions, failing to maintain gradient fidelity. Furthermore, these methods lack a hazard-aware mechanism to adaptively suppress extreme deviations, leaving the optimization process vulnerable to abrupt policy shifts. To address these challenges, we propose Modulated Hazard-aware Policy Optimization (MHPO), a novel framework designed for robust and stable reinforcement learning. The proposed MHPO introduces a Log-Fidelity Modulator (LFM) to map unbounded importance ratios into a bounded, differentiable domain. This mechanism effectively prevents high-variance outlier tokens from destabilizing the loss landscape while ensuring global gradient stability. Complementarily, a Decoupled Hazard Penalty (DHP) integrates cumulative hazard functions from survival analysis to independently regulate positive and negative policy shifts. By shaping the optimization landscape with hazard-aware penalties, the proposed MHPO achieves fine-grained regulation of asymmetric policy shifts simultaneously mitigating mode collapse from over-expansion and preventing policy erosion from catastrophic contraction within a stabilized trust region. Extensive evaluations on diverse reasoning benchmarks across both text-based and vision-language tasks demonstrate that MHPO consistently outperforms existing methods, achieving superior performance while significantly enhancing training stability.
中文摘要 调控重要性比对于基于群体相对策略优化（GRPO）框架的训练稳定性至关重要。然而，主流的比率控制方法，如硬削波，存在不可微边界和梯度消失的问题，无法保持梯度保真度。此外，这些方法缺乏风险感知机制来适应性地抑制极端偏差，使优化过程容易受到突发政策转变的影响。为应对这些挑战，我们提出了调节危害感知策略优化（MHPO），这是一个为稳健稳定强化学习设计的新框架。所提MHPO引入了对数保真度调制器（LFM），将无界重要比映射到有界且可微的域中。该机制有效防止高方差异常值代币破坏损失格局，同时确保全局梯度稳定。此外，解耦危害惩罚（DHP）整合了生存分析中的累积危害函数，独立调节正负政策变动。通过赋予风险感知惩罚的优化环境，MHPO实现了对不对称政策转变的细致调控，同时减轻过度扩张导致的模式崩溃，并防止政策因稳定信任区域内灾难性收缩而侵蚀。对文本和视觉语言任务中多种推理基准的广泛评估表明，MHPO持续优于现有方法，在显著提升训练稳定性的同时实现了卓越表现。

DeepStage: Learning Autonomous Defense Policies Against Multi-Stage APT Campaigns

DeepStage：学习针对多阶段APT战役的自主防御政策

Authors: Trung V. Phan, Tri Gia Nguyen, Thomas Bauschert
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.16969
Pdf link: https://arxiv.org/pdf/2603.16969
Abstract This paper presents DeepStage, a deep reinforcement learning (DRL) framework for adaptive, stage-aware defense against Advanced Persistent Threats (APTs). The enterprise environment is modeled as a partially observable Markov decision process (POMDP), where host provenance and network telemetry are fused into unified provenance graphs. Building on our prior work, StageFinder, a graph neural encoder and an LSTM-based stage estimator infer probabilistic attacker stages aligned with the MITRE ATT&CK framework. These stage beliefs, combined with graph embeddings, guide a hierarchical Proximal Policy Optimization (PPO) agent that selects defense actions across monitoring, access control, containment, and remediation. Evaluated in a realistic enterprise testbed using CALDERA-driven APT playbooks, DeepStage achieves a stage-weighted F1-score of 0.89, outperforming a risk-aware DRL baseline by 21.9%. The results demonstrate effective stage-aware and cost-efficient autonomous cyber defense.
中文摘要 本文介绍了DeepStage，一种用于自适应、阶段感知防御高级持续性威胁（APTs）的深度强化学习（DRL）框架。企业环境被建模为部分可观测的马尔可夫决策过程（POMDP），其中主机来源和网络遥测融合成统一的来源图。基于我们之前的工作，StageFinder 是一个基于 LSTM 的图神经编码器和一个阶段估计器，推断出与 MITRE ATT&CK 框架对齐的概率攻击阶段。这些阶段信念结合图嵌入，引导一个层级的近端策略优化（PPO）代理，在监控、访问控制、遏制和修复等多个环节选择防御行动。在使用CALDERA驱动的APT操作手册的现实企业测试平台中评估，DeepStage的阶段加权F1得分为0.89，比风险意识DRL基线高出21.9%。结果展示了有效的阶段感知和成本效益高的自主网络防御。

Rewarding DINO: Predicting Dense Rewards with Vision Foundation Models

奖励DINO：用愿景基础模型预测高密度奖励

Authors: Pierre Krack, Tobias Jülg, Wolfram Burgard, Florian Walter
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.16978
Pdf link: https://arxiv.org/pdf/2603.16978
Abstract Well-designed dense reward functions in robot manipulation not only indicate whether a task is completed but also encode progress along the way. Generally, designing dense rewards is challenging and usually requires access to privileged state information available only in simulation, not in real-world experiments. This makes reward prediction models that infer task state information from camera images attractive. A common approach is to predict rewards from expert demonstrations based on visual similarity or sequential frame ordering. However, this biases the resulting reward function towards a specific solution and leaves it undefined in states not covered by the demonstrations. In this work, we introduce Rewarding DINO, a method for language-conditioned reward modeling that learns actual reward functions rather than specific trajectories. The model's compact size allows it to serve as a direct replacement for analytical reward functions with comparatively low computational overhead. We train our model on data sampled from 24 Meta-World+ tasks using a rank-based loss and evaluate pairwise accuracy, rank correlation, and calibration. Rewarding DINO achieves competitive performance in tasks from the training set and generalizes to new settings in simulation and the real world, indicating that it learns task semantics. We also test the model with off-the-shelf reinforcement learning algorithms to solve tasks from our Meta-World+ training set.
中文摘要 机器人操作中设计良好的密集奖励函数不仅能显示任务是否完成，还能编码过程中的进展。通常，设计高密度奖励具有挑战性，通常需要访问仅在模拟中获得的特权状态信息，而非真实实验中。这使得通过相机图像推断任务状态信息的奖励预测模型具有吸引力。一种常见方法是基于视觉相似度或顺序帧顺序预测专家演示的奖励。然而，这会使所得的奖励函数偏向特定解，并在演示未覆盖的状态中使奖励函数未定义。在本研究中，我们介绍了“奖励DINO”，一种语言条件化奖励建模方法，学习实际的奖励函数而非特定轨迹。该模型的紧凑规模使其能够直接替代分析奖励函数，计算开销相对较低。我们基于基于排名的损失，基于24个Meta-World+任务抽样数据训练模型，评估成对准确性、排名相关性和校准。奖励DINO在训练集中的任务中实现竞争表现，并推广到模拟和现实世界的新环境，表明它学会了任务语义。我们还用现成的强化学习算法测试模型，以解决Meta-World+训练集中的任务。

Efficient and Reliable Teleoperation through Real-to-Sim-to-Real Shared Autonomy

通过真实到模拟到真实共享自治实现高效且可靠的远程操作

Authors: Shuo Sha, Yixuan Wang, Binghao Huang, Antonio Loquerico, Yunzhu Li
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.17016
Pdf link: https://arxiv.org/pdf/2603.17016
Abstract Fine-grained, contact-rich teleoperation remains slow, error-prone, and unreliable in real-world manipulation tasks, even for experienced operators. Shared autonomy offers a promising way to improve performance by combining human intent with automated assistance, but learning effective assistance in simulation requires a faithful model of human behavior, which is difficult to obtain in practice. We propose a real-to-sim-to-real shared autonomy framework that augments human teleoperation with learned corrective behaviors, using a simple yet effective k-nearest-neighbor (kNN) human surrogate to model operator actions in simulation. The surrogate is fit from less than five minutes of real-world teleoperation data and enables stable training of a residual copilot policy with model-free reinforcement learning. The resulting copilot is deployed to assist human operators in real-world fine-grained manipulation tasks. Through simulation experiments and a user study with sixteen participants on industry-relevant tasks, including nut threading, gear meshing, and peg insertion, we show that our system improves task success for novice operators and execution efficiency for experienced operators compared to direct teleoperation and shared-autonomy baselines that rely on expert priors or behavioral-cloning pilots. In addition, copilot-assisted teleoperation produces higher-quality demonstrations for downstream imitation learning.
中文摘要 细粒度、接触丰富的远程操作在实际操作任务中依然缓慢、易出错且不可靠，即使是经验丰富的操作员也是如此。共享自主性通过结合人类意图与自动协助提供了一种有前景的方式来提升性能，但在模拟中学习有效的辅助需要一个忠实的人类行为模型，而这在实践中很难实现。我们提出了一种现实到模拟到现实的共享自治框架，通过学习到的纠正行为来增强人类远程操作，使用简单但有效的k最近邻（kNN）人类替代模型来模拟模拟操作员的行为。该替代器由不到五分钟的真实远程操作数据拟合，能够稳定训练残余副驾驶策略，实现无模型强化学习。由此产生的副驾驶被部署用于协助人类操作员完成真实的细粒度操作任务。通过模拟实验和16名参与者参与的行业相关任务（包括螺母螺纹、齿轮啮合和销钉插入）的测试，我们证明，与依赖专家经验或行为克隆试点的直接远程操作和共享自主基线相比，我们的系统提升了新手操作员的任务成功率和经验丰富操作员的执行效率。此外，副驾驶辅助远程操作还能产生更高质量的模拟学习演示。

Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models

Astrolabe：引导前进过程强化学习，用于蒸馏自回归视频模型

Authors: Songchun Zhang, Zeyue Xue, Siming Fu, Jie Huang, Xianghao Kong, Y Ma, Haoyang Huang, Nan Duan, Anyi Rao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.17051
Pdf link: https://arxiv.org/pdf/2603.17051
Abstract Distilled autoregressive (AR) video models enable efficient streaming generation but frequently misalign with human visual preferences. Existing reinforcement learning (RL) frameworks are not naturally suited to these architectures, typically requiring either expensive re-distillation or solver-coupled reverse-process optimization that introduces considerable memory and computational overhead. We present Astrolabe, an efficient online RL framework tailored for distilled AR models. To overcome existing bottlenecks, we introduce a forward-process RL formulation based on negative-aware fine-tuning. By contrasting positive and negative samples directly at inference endpoints, this approach establishes an implicit policy improvement direction without requiring reverse-process unrolling. To scale this alignment to long videos, we propose a streaming training scheme that generates sequences progressively via a rolling KV-cache, applying RL updates exclusively to local clip windows while conditioning on prior context to ensure long-range coherence. Finally, to mitigate reward hacking, we integrate a multi-reward objective stabilized by uncertainty-aware selective regularization and dynamic reference updates. Extensive experiments demonstrate that our method consistently enhances generation quality across multiple distilled AR video models, serving as a robust and scalable alignment solution.
中文摘要 提炼自回归（AR）视频模型能够高效生成流媒体，但常常与人类的视觉偏好不匹配。现有的强化学习（RL）框架并不自然适合这些架构，通常需要昂贵的再提纯或求解器耦合的逆过程优化，这会带来相当大的内存和计算开销。我们介绍Astrolabe，一个高效的在线强化学习框架，专为蒸馏增强现实模型设计。为克服现有瓶颈，我们引入了基于负面感知微调的前向过程强化学习（RL）表述。通过直接对比正样本和阴样本，这种方法在推断端点处确立了隐含的策略改进方向，而无需逆向展开。为了将这种对齐扩展到长视频，我们提出了一种流式训练方案，通过滚动KV缓存逐步生成序列，强化学习只对本地剪辑窗口进行更新，同时基于先前上下文进行条件，以确保长距离的连贯性。最后，为减轻奖励黑客行为，我们集成了一个多奖励目标，该目标由不确定性感知的选择正则化和动态引用更新稳定。大量实验表明，我们的方法在多个精炼后的增强现实视频模型中持续提升生成质量，作为一种稳健且可扩展的对齐解决方案。

PaAgent: Portrait-Aware Image Restoration Agent via Subjective-Objective Reinforcement Learning

PaAgent：通过主观-客观强化学习实现的肖像感知图像恢复代理

Authors: Yijian Wang, Qingsen Yan, Jiantao Zhou, Duwei Dai, Wei Dong
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.17055
Pdf link: https://arxiv.org/pdf/2603.17055
Abstract Image Restoration (IR) agents, leveraging multimodal large language models to perceive degradation and invoke restoration tools, have shown promise in automating IR tasks. However, existing IR agents typically lack an insight summarization mechanism for past interactions, which results in an exhaustive search for the optimal IR tool. To address this limitation, we propose a portrait-aware IR agent, dubbed PaAgent, which incorporates a self-evolving portrait bank for IR tools and Retrieval-Augmented Generation (RAG) to select a suitable IR tool for input. Specifically, to construct and evolve the portrait bank, the PaAgent continuously enriches it by summarizing the characteristics of various IR tools with restored images, selected IR tools, and degraded images. In addition, the RAG is employed to select the optimal IR tool for the input image by retrieving relevant insights from the portrait bank. Furthermore, to enhance PaAgent's ability to perceive degradation in complex scenes, we propose a subjective-objective reinforcement learning strategy that considers both image quality scores and semantic insights in reward generation, which accurately provides the degradation information even under partial and non-uniform degradation. Extensive experiments across 8 IR benchmarks, covering six single-degradation and eight mixed-degradation scenarios, validate PaAgent's superiority in addressing complex IR tasks. Our project page is \href{this https URL}{PaAgent}.
中文摘要 图像恢复（IR）代理利用多模态大型语言模型感知退化并调用恢复工具，在自动化IR任务方面展现出潜力。然而，现有的IR代理通常缺乏对过去相互作用的洞察总结机制，这导致对最优IR工具的全面搜索。为解决这一限制，我们提出了一种名为PaAgent的肖像感知IR代理，它结合了IR工具的自我演化肖像库和检索增强生成（RAG）以选择合适的IR工具进行输入。具体来说，为了构建和演进肖像库，PaAgent 通过总结各种红外工具的特性，包括恢复图像、选定的红外工具和退化图像，不断丰富其库。此外，RAG还用于通过从肖像库中获取相关洞察，选择输入图像的最佳红外工具。此外，为了增强PaAgent在复杂场景中感知退化的能力，我们提出了一种主观-客观强化学习策略，在奖励生成中同时考虑图像质量评分和语义洞察，即使在部分和非均匀降解下也能准确提供退化信息。涵盖6个单次降解和8个混合降解场景的8个红外基准测试广泛实验验证了PaAgent在复杂红外任务中的优越性。我们的项目页面是\href{this https URL}{PaAgent}。

SLowRL: Safe Low-Rank Adaptation Reinforcement Learning for Locomotion

SLowRL：安全的低阶适应强化学习，用于移动

Authors: Elham Daneshmand, Shafeef Omar, Glen Berseth, Majid Khadiv, Hsiu-Chin Lin
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.17092
Pdf link: https://arxiv.org/pdf/2603.17092
Abstract Sim-to-real transfer of locomotion policies often leads to performance degradation due to the inevitable sim-to-real gap. Naively fine-tuning these policies directly on hardware is problematic, as it poses risks of mechanical failure and suffers from high sample inefficiency. In this paper, we address the challenge of safely and efficiently fine-tuning reinforcement learning (RL) policies for dynamic locomotion tasks. Specifically, we focus on fine-tuning policies learned in simulation directly on hardware, while explicitly enforcing safety constraints. In doing so, we introduce SLowRL, a framework that combines Low-Rank Adaptation (LoRA) with training-time safety enforcement via a recovery policy. We evaluate our method both in simulation and on a real Unitree Go2 quadruped robot for jump and trot tasks. Experimental results show that our method achieves a $46.5\%$ reduction in fine-tuning time and near-zero safety violations compared to standard proximal policy optimization (PPO) baselines. Notably, we find that a rank-1 adaptation alone is sufficient to recover pre-trained performance in the real world, while maintaining stable and safe real-world fine-tuning. These results demonstrate the practicality of safe, efficient fine-tuning for dynamic real-world robotic applications.
中文摘要 由于模拟到现实的移动政策转移，常因不可避免的模拟到现实差距而导致性能下降。直接在硬件上进行这些策略的幼稚微调存在问题，因为存在机械故障风险且采样效率低。本文探讨了安全高效地微调动态运动任务强化学习（RL）策略的挑战。具体来说，我们专注于在硬件上直接在仿真中学到的策略进行微调，同时明确执行安全约束。在此过程中，我们引入了SLowRL，这是一个结合了低级适应（LoRA）和通过恢复策略实现培训时间安全强制的框架。我们在模拟和真实的Unitree Go2四足机器人上进行跳跃和小跑任务的评估。实验结果显示，与标准近端策略优化（PPO）基线相比，我们的方法在微调时间上节省了46.5美元，且几乎没有安全违规。值得注意的是，我们发现仅靠一级适应就足以在现实世界中恢复预训练的性能，同时保持稳定且安全的现实微调。这些结果展示了安全高效微调在动态现实机器人应用中的实用性。

REAL: Regression-Aware Reinforcement Learning for LLM-as-a-Judge

REAL：面向法官的回归感知强化学习

Authors: Yasi Zhang, Tianyu Chen, Mingyuan Zhou, Oscar Leong, Ying Nian Wu, Michal Lukasik
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.17145
Pdf link: https://arxiv.org/pdf/2603.17145
Abstract Large language models (LLMs) are increasingly deployed as automated evaluators that assign numeric scores to model outputs, a paradigm known as LLM-as-a-Judge. However, standard Reinforcement Learning (RL) methods typically rely on binary rewards (e.g., 0-1 accuracy), thereby ignoring the ordinal structure inherent in regression tasks; for instance, they fail to recognize that predicting 4 is significantly better than predicting 1 when the ground truth is 5. Conversely, existing regression-aware approaches are often confined to Supervised Fine-Tuning (SFT), limiting their ability to explore optimal reasoning paths. To bridge this gap, we propose \textbf{REAL} (\underline{RE}gression-\underline{A}ware Reinforcement \underline{L}earning), a principled RL framework designed to optimize regression rewards, and also proven to be optimal for correlation metrics. A key technical challenge is that the regression objective is explicitly policy-dependent, thus invalidating standard policy gradient methods. To address this, we employ the generalized policy gradient estimator, which naturally decomposes optimization into two complementary components: (1) exploration over Chain-of-Thought (CoT) trajectory, and (2) regression-aware prediction refinement of the final score. Extensive experiments across model scales (8B to 32B) demonstrate that REAL consistently outperforms both regression-aware SFT baselines and standard RL methods, exhibiting significantly better generalization on out-of-domain benchmarks. On Qwen3-32B specifically, we achieve gains of +8.40 Pearson and +7.20 Spearman correlation over the SFT baseline, and +18.30/+11.20 over the base model. These findings highlight the critical value of integrating regression objectives into RL exploration for accurate LLM evaluation.
中文摘要 大型语言模型（LLM）越来越多地被用作自动评估器，为模型输出分配数值分数，这一范式被称为“LLM即评判者”。然而，标准的强化学习（RL）方法通常依赖二元奖励（例如0-1的准确率），从而忽略了回归任务中固有的序数结构;例如，他们未能认识到，当真实数字为5时，预测4比预测1要好得多。相反，现有的回归感知方法通常局限于监督微调（SFT），限制了其探索最优推理路径的能力。为弥合这一差距，我们提出了 \textbf{REAL}（\underline{RE}gression-\underline{A}ware Reinforcement \underline{L}earning），这是一个有原则的强化学习框架，旨在优化回归奖励，并且被证明在相关性指标方面最优。一个关键的技术挑战是回归目标明确依赖策略，从而使标准策略梯度方法失效。为此，我们采用广义政策梯度估计，该算法自然将优化分解为两个互补组成部分：（1）对思维链（CoT）轨迹的探索，以及（2）对最终得分的回归感知预测细化。跨模型尺度（8B至32B）的大量实验表明，REAL始终优于回归感知SFT基线和标准强化学习方法，在域外基准测试中展现出显著更好的泛化性。具体在Qwen3-32B上，我们获得了SFT基线的+8.40 Pearson和+7.20 Spearman相关系数的提升，基础模型则提升了+18.30/+11.20。这些发现凸显了将回归目标整合进强化学习探索以实现准确LLM评估的关键价值。

Shielded Reinforcement Learning Under Dynamic Temporal Logic Constraints

动态时间逻辑约束下的屏蔽强化学习

Authors: Sadık Bera Yüksel, Ali Tevfik Buyukkocak, Derya Aksaray
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.17152
Pdf link: https://arxiv.org/pdf/2603.17152
Abstract Reinforcement Learning (RL) has shown promise in various robotics applications, yet its deployment on real systems is still limited due to safety and operational constraints. The safe RL field has gained considerable attention in recent years, which focuses on imposing safety constraints throughout the learning process. However, real systems often require more complex constraints than just safety, such as periodic recharging or time-bounded visits to specific regions. Imposing such spatio-temporal tasks during learning still remains a challenge. Signal Temporal Logic (STL) is a formal language for specifying temporal properties of real-valued signals and provides a way to express such complex tasks. In this paper, we propose a framework that leverages sequential control barrier functions and model-free RL to ensure that the given STL tasks are satisfied throughout the learning process. Our method extends beyond traditional safety constraints by enforcing rich STL specifications, which can involve visits to dynamic targets with unknown trajectories. We also demonstrate the effectiveness of our framework through various simulations.
中文摘要 强化学习（RL）在各种机器人应用中展现出潜力，但由于安全和操作限制，其在实际系统的部署仍然有限。近年来，安全强化学习领域受到了广泛关注，重点在于在整个学习过程中施加安全约束。然而，实际系统通常需要比安全更复杂的约束，比如定期充电或对特定区域的定时访问。在学习过程中强制执行此类时空任务仍是一项挑战。信号时序逻辑（STL）是一种用于指定实值信号时间属性的形式语言，并为表达此类复杂任务提供了一种方式。本文提出一个框架，利用顺序控制障碍函数和无模型强化学习，确保在学习过程中满足给定的STL任务。我们的方法超越了传统安全约束，强制执行丰富的STL规范，这可能涉及访问轨迹未知的动态目标。我们还通过各种模拟展示了我们框架的有效性。

MetaClaw: Just Talk -- An Agent That Meta-Learns and Evolves in the Wild

MetaClaw：Just Talk——一个在野外通过元学习和进化的特工

Authors: Peng Xia, Jianwen Chen, Xinyu Yang, Haoqin Tu, Jiaqi Liu, Kaiwen Xiong, Siwei Han, Shi Qiu, Haonian Ji, Yuyin Zhou, Zeyu Zheng, Cihang Xie, Huaxiu Yao
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.17187
Pdf link: https://arxiv.org/pdf/2603.17187
Abstract Large language model (LLM) agents are increasingly used for complex tasks, yet deployed agents often remain static, failing to adapt as user needs evolve. This creates a tension between the need for continuous service and the necessity of updating capabilities to match shifting task distributions. On platforms like OpenClaw, which handle diverse workloads across 20+ channels, existing methods either store raw trajectories without distilling knowledge, maintain static skill libraries, or require disruptive downtime for retraining. We present MetaClaw, a continual meta-learning framework that jointly evolves a base LLM policy and a library of reusable behavioral skills. MetaClaw employs two complementary mechanisms. Skill-driven fast adaptation analyzes failure trajectories via an LLM evolver to synthesize new skills, enabling immediate improvement with zero downtime. Opportunistic policy optimization performs gradient-based updates via cloud LoRA fine-tuning and Reinforcement Learning with a Process Reward Model (RL-PRM). This is triggered during user-inactive windows by the Opportunistic Meta-Learning Scheduler (OMLS), which monitors system inactivity and calendar data. These mechanisms are mutually reinforcing: a refined policy generates better trajectories for skill synthesis, while richer skills provide higher-quality data for policy optimization. To prevent data contamination, a versioning mechanism separates support and query data. Built on a proxy-based architecture, MetaClaw scales to production-size LLMs without local GPUs. Experiments on MetaClaw-Bench and AutoResearchClaw show that skill-driven adaptation improves accuracy by up to 32% relative. The full pipeline advances Kimi-K2.5 accuracy from 21.4% to 40.6% and increases composite robustness by 18.3%. Code is available at this https URL.
中文摘要 大型语言模型（LLM）代理越来越多地用于复杂任务，但部署的代理往往保持静态，无法随着用户需求的演变而适应。这在持续服务的需求与更新能力以适应不断变化的任务分布之间产生了张力。在像OpenClaw这样处理20+通道多样化工作负载的平台上，现有方法要么存储原始轨迹而不提炼知识，要么保持静态技能库，或者需要中断性停机来重新训练。我们介绍MetaClaw，一个持续的元学习框架，共同发展基础LLM策略和可重用行为技能库。MetaClaw 采用两种互补机制。技能驱动的快速适应通过大型语言模型演化器分析失败轨迹，综合新技能，实现无停机的即时改进。机会策略优化通过云LoRA微调和过程奖励模型（RL-PRM）的强化学习实现基于梯度的更新。该系统在用户非活动窗口期间由机会性元学习调度器（OMLS）触发，OMLS监控系统非活动和日历数据。这些机制相互强化：精细的策略能产生更好的技能综合轨迹，而更丰富的技能则为策略优化提供更高质量的数据。为防止数据污染，采用版本控制机制将支持数据和查询数据分开。MetaClaw 基于代理架构构建，可扩展到生产规模的大型语言模型，无需本地 GPU。MetaClaw-Bench和AutoResearchClaw的实验显示，技能驱动的适应率相对提升了高达32%。完整管线将 Kimi-K2.5 准确率从 21.4% 提升至 40.6%，并使复合材料的稳健性提升了 18.3%。代码可在此 https URL 访问。

Adaptive Anchor Policies for Efficient 4D Gaussian Streaming

高效4D高斯流的自适应锚策略

Authors: Ashim Dahal, Rabab Abdelfattah, Nick Rahimi
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.17227
Pdf link: https://arxiv.org/pdf/2603.17227
Abstract Dynamic scene reconstruction with Gaussian Splatting has enabled efficient streaming for real-time rendering and free-viewpoint video. However, most pipelines rely on fixed anchor selection such as Farthest Point Sampling (FPS), typically using 8,192 anchors regardless of scene complexity, which over-allocates computation under strict budgets. We propose Efficient Gaussian Streaming (EGS), a plug-in, budget-aware anchor sampler that replaces FPS with a reinforcement-learned policy while keeping the Gaussian streaming reconstruction backbone unchanged. The policy jointly selects an anchor budget and a subset of informative anchors under discrete constraints, balancing reconstruction quality and runtime using spatial features of the Gaussian representation. We evaluate EGS in two settings: fast rendering, which prioritizes runtime efficiency, and high-quality refinement, which enables additional optimization. Experiments on dynamic multi-view datasets show consistent improvements in the quality--efficiency trade-off over FPS sampling. On unseen data, in fast rendering at 256 anchors ($32\times$ fewer than 8,192), EGS improves PSNR by $+0.52$--$0.61$\,dB while running $1.29$--$1.35\times$ faster than IGS@8192 (N3DV and MeetingRoom). In high-quality refinement, EGS remains competitive with the full-anchor baseline at substantially lower anchor budgets. \emph{Code and pretrained checkpoints will be released upon acceptance.} \keywords{4D Gaussian Splatting \and 4D Gaussian Streaming \and Reinforcement Learning}
中文摘要 采用高斯喷溅的动态场景重建技术，实现了实时渲染和自由视点视频的高效流媒体。然而，大多数流水线依赖固定锚点选择，如最远点采样（FPS），通常无论场景复杂度如何都使用8,192个锚点，这在严格的预算下导致计算过度分配。我们提出了高效高斯流（EGS），一种插件、预算感知型锚点采样器，用强化学习策略取代FPS，同时保持高斯流重建骨干不变。该策略在离散约束下联合选择锚点预算和信息锚子集，利用高斯表示的空间特征平衡重建质量和运行时间。我们将在两个方面评估EGS：优先考虑运行效率的快速渲染，以及高质量的优化，从而实现进一步优化。在动态多视图数据集上的实验显示，质量与效率的权衡相较于FPS采样持续提升。在未公开数据中，在256个锚点的快速渲染（比8,192个少32美元），EGS将PSNR提升+0.52美元——0.61美元，dB，同时运行速度比IGS@8192快1.29美元——1.35美元\1.35美元/倍数美元（N3DV和MeetingRoom）。在高质量精细化方面，EGS在显著较低的锚地预算下仍能与全主力基准竞争。\emph{代码和预训练检查点将在接受后发布。} \keywords{4D高斯喷射\4D高斯流\及强化学习}

Network- and Device-Level Cyber Deception for Contested Environments Using RL and LLMs

利用强化学习和大型语言模型（LLM）进行网络和设备级网络欺骗，针对有争议环境

Authors: Abhijeet Sahu, Shuva Paul, Rochard Macwan
Subjects: Subjects: Cryptography and Security (cs.CR); Emerging Technologies (cs.ET)
Arxiv link: https://arxiv.org/abs/2603.17272
Pdf link: https://arxiv.org/pdf/2603.17272
Abstract Cyber deception assists in increasing the attacker's budget in reconnaissance or any early phases of threat intrusions. In the past, numerous methods of cyber deception have been adopted, such as IP address randomization, the creation of honeypots and honeynets mimicking an actual set of services, and networks deployed within an enterprise or operational technology(OT) network. These types of strategies follow naive approaches of recreating services that are expensive and that need a lot of human intervention. The advent of cloud services and other automations of containerized applications, such as Kubernetes, makes cyber defense easier. Yet, there remains a lot of potential to improve the accuracy of these deception strategies and to make them cost-effective using artificial intelligence (AI)-based solutions by making the deception more dynamic. Hence, in this work, we review various AI-based solutions in building network- and device-level cyber deception methods in contested environments. Specifically, we focus on leveraging the fusion of large language models (LLMs) and reinforcement learning(RL) in optimally learning these cyber deception strategies and validating the efficacy of such strategies in some stealthy attacks against OT systems in the literature.
中文摘要 网络欺骗有助于增加攻击者在侦察或威胁入侵初期阶段的预算。过去，网络欺骗手段被广泛采用，如IP地址随机化、创建模仿实际服务的蜜罐和蜜网，以及部署在企业或运营技术（OT）网络中的网络。这类策略遵循了天真地重现昂贵且需要大量人工干预的服务方法。云服务及其他容器化应用自动化（如 Kubernetes）的出现，使网络防御变得更加简单。然而，利用基于人工智能（AI）的解决方案，通过使欺诈更具动态性，提升这些欺骗策略的准确性并使其更具成本效益，仍有很大潜力。因此，在本研究中，我们将回顾在有争议环境中构建网络和设备级网络欺骗方法的各种基于人工智能的解决方案。具体来说，我们重点利用大型语言模型（LLM）与强化学习（RL）的融合，优化学习这些网络欺骗策略，并在文献中验证这些策略在针对OT系统的某些隐秘攻击中的有效性。

WINFlowNets: Warm-up Integrated Networks Training of Generative Flow Networks for Robotics and Machine Fault Adaptation

WINFlowNets：机器人和机器故障适配生成流网络的预热集成网络训练

Authors: Zahin Sufiyan, Shadan Golestan, Yoshihiro Mitsuka, Shotaro Miwa, Osmar Zaiane
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.17301
Pdf link: https://arxiv.org/pdf/2603.17301
Abstract Generative Flow Networks for continuous scenarios (CFlowNets) have shown promise in solving sequential decision-making tasks by learning stochastic policies using a flow and a retrieval network. Despite their demonstrated efficiency compared to state-of-the-art Reinforcement Learning (RL) algorithms, their practical application in robotic control tasks is constrained by the reliance on pre-training the retrieval network. This dependency poses challenges in dynamic robotic environments, where pre-training data may not be readily available or representative of the current environment. This paper introduces WINFlowNets, a novel CFlowNets framework that enables the co-training of flow and retrieval networks. WINFlowNets begins with a warm-up phase for the retrieval network to bootstrap its policy, followed by a shared training architecture and a shared replay buffer for co-training both networks. Experiments in simulated robotic environments demonstrate that WINFlowNets surpasses CFlowNets and state-of-the-art RL algorithms in terms of average reward and training stability. Furthermore, WINFlowNets exhibits strong adaptive capability in fault environments, making it suitable for tasks that demand quick adaptation with limited sample data. These findings highlight WINFlowNets' potential for deployment in dynamic and malfunction-prone robotic systems, where traditional pre-training or sample inefficient data collection may be impractical.
中文摘要 用于连续场景的生成流网络（CFlowNets）通过利用流和检索网络学习随机策略，在解决顺序决策任务方面展现出潜力。尽管它们在机器人控制任务中的实际应用受限于对检索网络的预训练。这种依赖在动态机器人环境中带来了挑战，因为预训练数据可能不易获得或无法代表当前环境。本文介绍了WINFlowNets，一种新颖的CFlowNets框架，能够实现流网络和检索网络的共训练。WINFlowNets 从检索网络的预热阶段开始，以启动策略，随后是共享训练架构和共享重放缓冲区，用于共同训练两个网络。模拟机器人环境的实验表明，WINFlowNets 在平均奖励和训练稳定性方面超越了 CFlowNets 和最先进的强化学习算法。此外，WINFlowNets 在故障环境中展现出强大的自适应能力，适合在样本数据有限下需要快速适应的任务。这些发现凸显了WINFlowNet在动态且易故障的机器人系统中的应用潜力，而传统预训练或样本低效数据收集可能不切实际。

Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations

对比推理对齐：从隐藏表征中学习强化

Authors: Haozheng Luo, Yimin Wang, Jiahao Yu, Binghui Wang, Yan Chen
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.17305
Pdf link: https://arxiv.org/pdf/2603.17305
Abstract We propose CRAFT, a red-teaming alignment framework that leverages model reasoning capabilities and hidden representations to improve robustness against jailbreak attacks. Unlike prior defenses that operate primarily at the output level, CRAFT aligns large reasoning models to generate safety-aware reasoning traces by explicitly optimizing objectives defined over the hidden state space. Methodologically, CRAFT integrates contrastive representation learning with reinforcement learning to separate safe and unsafe reasoning trajectories, yielding a latent-space geometry that supports robust, reasoning-level safety alignment. Theoretically, we show that incorporating latent-textual consistency into GRPO eliminates superficially aligned policies by ruling them out as local optima. Empirically, we evaluate CRAFT on multiple safety benchmarks using two strong reasoning models, Qwen3-4B-Thinking and R1-Distill-Llama-8B, where it consistently outperforms state-of-the-art defenses such as IPO and SafeKey. Notably, CRAFT delivers an average 79.0% improvement in reasoning safety and 87.7% improvement in final-response safety over the base models, demonstrating the effectiveness of hidden-space reasoning alignment.
中文摘要 我们提出了CRAFT，一种红队对齐框架，利用模型推理能力和隐藏表征，提升对越狱攻击的鲁棒性。与以往主要在输出层面工作的防御不同，CRAFT通过显式优化在隐藏状态空间上定义的目标，对大推理模型进行对齐，从而生成安全意识推理痕迹。在方法论上，CRAFT将对比表示学习与强化学习相结合，以区分安全与不安全的推理轨迹，从而形成支持稳健推理层安全对齐的潜空间几何。理论上，我们证明将潜在文本一致性纳入GRPO可排除表面上对齐的策略，将其排除为局部最优。通过实证，我们利用两个强推理模型Qwen3-4B-Thinking和R1-Distill-Llama-8B，在多个安全基准测试中，CRAFT的表现持续优于IPO和SafeKey等最先进的防御。值得注意的是，CRAFT在推理安全性方面平均提升了79.0%，最终响应安全性提升了87.7%，展示了隐藏空间推理对齐的有效性。

ReLMXEL: Adaptive RL-Based Memory Controller with Explainable Energy and Latency Optimization

ReLMXEL：基于强化学习的自适应内存控制器，具备可解释的能量和延迟优化功能

Authors: Panuganti Chirag Sai, Gandholi Sarat, R. Raghunatha Sarma, Venkata Kalyan Tavva, Naveen M
Subjects: Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2603.17309
Pdf link: https://arxiv.org/pdf/2603.17309
Abstract Reducing latency and energy consumption is critical to improving the efficiency of memory systems in modern computing. This work introduces ReLMXEL (Reinforcement Learning for Memory Controller with Explainable Energy and Latency Optimization), a explainable multi-agent online reinforcement learning framework that dynamically optimizes memory controller parameters using reward decomposition. ReLMXEL operates within the memory controller, leveraging detailed memory behavior metrics to guide decision-making. Experimental evaluations across diverse workloads demonstrate consistent performance gains over baseline configurations, with refinements driven by workload-specific memory access behaviour. By incorporating explainability into the learning process, ReLMXEL not only enhances performance but also increases the transparency of control decisions, paving the way for more accountable and adaptive memory system designs.
中文摘要 降低延迟和能耗对于提升现代计算中内存系统效率至关重要。本研究介绍了ReLMXEL（Reinforcement Learning for Memory Controller with Explainable Energy and Latency Optimization），这是一个可解释的多智能体在线强化学习框架，通过奖励分解动态优化记忆控制器参数。ReLMXEL 在内存控制器内运行，利用详细的内存行为指标来指导决策。在不同工作负载下的实验评估显示，性能优于基线配置，且改进主要受工作负载特定内存访问行为驱动。通过将可解释性融入学习过程，ReLMXEL 不仅提升了性能，还提高了控制决策的透明度，为更具责任感和适应性的记忆系统设计铺平了道路。

InfoDensity: Rewarding Information-Dense Traces for Efficient Reasoning

信息密度：奖励信息密集的追踪以促进高效推理

Authors: Chengwei Wei, Jung-jae Kim, Longyin Zhang, Shengkai Chen, Nancy F. Chen
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.17310
Pdf link: https://arxiv.org/pdf/2603.17310
Abstract Large Language Models (LLMs) with extended reasoning capabilities often generate verbose and redundant reasoning traces, incurring unnecessary computational cost. While existing reinforcement learning approaches address this by optimizing final response length, they neglect the quality of intermediate reasoning steps, leaving models vulnerable to reward hacking. We argue that verbosity is not merely a length problem, but a symptom of poor intermediate reasoning quality. To investigate this, we conduct an empirical study tracking the conditional entropy of the answer distribution across reasoning steps. We find that high-quality reasoning traces exhibit two consistent properties: low uncertainty convergence and monotonic progress. These findings suggest that high-quality reasoning traces are informationally dense, that is, each step contributes meaningful entropy reduction relative to the total reasoning length. Motivated by this, we propose InfoDensity, a reward framework for RL training that combines an AUC-based reward and a monotonicity reward as a unified measure of reasoning quality, weighted by a length scaling term that favors achieving equivalent quality more concisely. Experiments on mathematical reasoning benchmarks demonstrate that InfoDensity matches or surpasses state-of-the-art baselines in accuracy while significantly reducing token usage, achieving a strong accuracy-efficiency trade-off.
中文摘要 具备扩展推理能力的大型语言模型（LLM）常常生成冗长冗余的推理痕迹，导致不必要的计算成本。现有的强化学习方法通过优化最终响应长度来解决这个问题，但它们忽视了中间推理步骤的质量，使模型容易受到奖励黑客攻击的影响。我们认为冗长不仅仅是篇幅问题，而是中级推理质量较差的表现。为此，我们进行了一项实证研究，追踪答案分布在推理步骤中的条件熵。我们发现高质量的推理迹表现出两个一致的特性：低不确定性收敛和单调进展。这些发现表明，高质量的推理迹信息密度很高，即每一步相对于总推理长度都贡献了有意义的熵减少。基于此，我们提出了InfoDensity，这是一种用于强化学习训练的奖励框架，结合了基于AUC的奖励和单调性奖励，作为衡量推理质量的统一指标，并以一个长度尺度项加权，以更简洁地实现等效质量。数学推理基准测试显示，InfoDensity在准确性上与最先进的基线相当甚至超过，同时显著减少了代币使用，实现了强烈的准确性与效率权衡。

Ruyi2.5 Technical Report

如意2.5技术报告

Authors: Huan Song, Shuyu Tian, Qingfei Zhao, Wenhao Hong, Jiang Liu, Ting Long, Jiawei Shao, Xuelong Li
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.17311
Pdf link: https://arxiv.org/pdf/2603.17311
Abstract We present Ruyi2.5, a multimodal familial model built on the AI Flow framework. Extending Ruyi2's "Train Once, Deploy Many" paradigm to the multimodal domain, Ruyi2.5 constructs a shared-backbone architecture that co-trains models of varying scales within a single unified pipeline, ensuring semantic consistency across all deployment tiers. Built upon Ruyi2.5, Ruyi2.5-Camera model is developed as a privacy-preserving camera service system, which instantiates Ruyi2.5-Camera into a two-stage recognition pipeline: an edge model applies information-bottleneck-guided irreversible feature mapping to de-identify raw frames at the source, while a cloud model performs deep behavior reasoning. To accelerate reinforcement learning fine-tuning, we further propose Binary Prefix Policy Optimization (BPPO), which reduces sample redundancy via binary response selection and focuses gradient updates on response prefixes, achieving a 2 to 3 times training speedup over GRPO. Experiments show Ruyi2.5 matches Qwen3-VL on the general multimodal benchmarks, while Ruyi2.5-Camera substantially outperforms Qwen3-VL on privacy-constrained surveillance tasks.
中文摘要 我们呈现Ruyi2.5，一个基于AI流程框架构建的多模态家庭模型。Ruyi2.5将Ruyi2的“一次训练，部署多”的范式扩展到多模态领域，构建了一个共享骨干架构，在统一的流水线中协同训练不同规模的模型，确保所有部署层级的语义一致性。基于如意2.5，如意2.5相机模型作为隐私保护的相机服务系统开发，将如意2.5相机实例化为两阶段识别流水线：边缘模型通过信息瓶颈引导的不可逆特征映射，在源端去识别原始帧，而云模型则进行深度行为推理。为加速强化学习的微调，我们进一步提出了二元前缀策略优化（BPPO），通过二元响应选择减少样本冗余，并将梯度更新聚焦于响应前缀，训练速度比GRPO快2到3倍。实验显示，如意2.5在通用多模态基准测试上与Qwen3-VL相当，而Ruyi2.5摄像头在隐私受限的监控任务中显著优于Qwen3-VL。

Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress

使用视觉-语言模型进行反复推理估算长期具身任务进展

Authors: Yuelin Zhang, Sijie Cheng, Chen Li, Zongzhao Li, Yuxin Huang, Yang Liu, Wenbing Huang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.17312
Pdf link: https://arxiv.org/pdf/2603.17312
Abstract Accurately estimating task progress is critical for embodied agents to plan and execute long-horizon, multi-step tasks. Despite promising advances, existing Vision-Language Models (VLMs) based methods primarily leverage their video understanding capabilities, while neglecting their complex reasoning potential. Furthermore, processing long video trajectories with VLMs is computationally prohibitive for real-world deployment. To address these challenges, we propose the Recurrent Reasoning Vision-Language Model ($\text{R}^2$VLM). Our model features a recurrent reasoning framework that processes local video snippets iteratively, maintaining a global context through an evolving Chain of Thought (CoT). This CoT explicitly records task decomposition, key steps, and their completion status, enabling the model to reason about complex temporal dependencies. This design avoids the high cost of processing long videos while preserving essential reasoning capabilities. We train $\text{R}^2$VLM on large-scale, automatically generated datasets from ALFRED and Ego4D. Extensive experiments on progress estimation and downstream applications, including progress-enhanced policy learning, reward modeling for reinforcement learning, and proactive assistance, demonstrate that $\text{R}^2$VLM achieves strong performance and generalization, achieving a new state-of-the-art in long-horizon task progress estimation. The models and benchmarks are publicly available at \href{this https URL}{huggingface}.
中文摘要 准确估算任务进度对于具身主体规划和执行长期多步骤任务至关重要。尽管取得了一些有希望的进展，现有基于视觉语言模型（VLMs）的方法主要利用其视频理解能力，而忽视了其复杂的推理潜力。此外，使用VLM处理长视频轨迹在计算上对实际部署来说是负担过重的。为应对这些挑战，我们提出了循环推理视觉语言模型（$\text{R}^2$VLM）。我们的模型采用循环推理框架，迭代处理本地视频片段，通过不断演变的思维链（CoT）维持全局语境。该CoT明确记录任务分解、关键步骤及其完成状态，使模型能够推理复杂的时间依赖关系。这种设计避免了处理长视频的高昂成本，同时保留了必要的推理能力。我们在ALFRED和Ego4D的大规模自动生成数据集上训练$\text{R}^2$VLM。关于进度估计及其后续应用的广泛实验，包括进度增强策略学习、强化学习的奖励建模和主动辅助，表明 $\text{R}^2$VLM 实现了强劲的性能和泛化性，实现了长期任务进展估计的新尖端技术。模型和基准测试公开地址为 \href{此 https URL}{huggingface}。

Physics-informed offline reinforcement learning eliminates catastrophic fuel waste in maritime routing

基于物理的离线强化学习消除了海上航线中的灾难性燃料浪费

Authors: Aniruddha Bora, Julie Chalfant, Chryssostomos Chryssostomidis
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.17319
Pdf link: https://arxiv.org/pdf/2603.17319
Abstract International shipping produces approximately 3% of global greenhouse gas emissions, yet voyage routing remains dominated by heuristic methods. We present PIER (Physics-Informed, Energy-efficient, Risk-aware routing), an offline reinforcement learning framework that learns fuel-efficient, safety-aware routing policies from physics-calibrated environments grounded in historical vessel tracking data and ocean reanalysis products, requiring no online simulator. Validated on one full year (2023) of AIS data across seven Gulf of Mexico routes (840 episodes per method), PIER reduces mean CO2 emissions by 10% relative to great-circle routing. However, PIER's primary contribution is eliminating catastrophic fuel waste: great-circle routing incurs extreme fuel consumption (>1.5x median) in 4.8% of voyages; PIER reduces this to 0.5%, a 9-fold reduction. Per-voyage fuel variance is 3.5x lower (p<0.001), with bootstrap 95% CI for mean savings [2.9%, 15.7%]. Partial validation against observed AIS vessel behavior confirms consistency with the fastest real transits while exhibiting 23.1x lower variance. Crucially, PIER is forecast-independent: unlike A* path optimization whose wave protection degrades 4.5x under realistic forecast uncertainty, PIER maintains constant performance using only local observations. The framework combines physics-informed state construction, demonstration-augmented offline data, and a decoupled post-hoc safety shield, an architecture that transfers to wildfire evacuation, aircraft trajectory optimization, and autonomous navigation in unmapped terrain.
中文摘要 国际航运约占全球温室气体排放的3%，但航运路线仍以启发式方法为主导。我们介绍PIER（物理知情、节能、风险感知航线），这是一个离线强化学习框架，能够从基于历史船舶跟踪数据和海洋再分析产品的物理校准环境中学习节能、安全意识的航线政策，无需在线模拟器。基于2023年全年涵盖七条墨西哥湾航线的AIS数据（每种方法共840集）验证，PIER相较大圆路由平均CO2排放减少了10%。然而，PIER的主要贡献是消除灾难性燃料浪费：大圆航线在4.8%的航次中导致极端燃料消耗（>1.5倍中位数）;PIER将这一比例降至0.5%，实现了9倍的减少。每次航次燃油差异低3.5倍（p<0.001），自助式95%置信区间平均节省[2.9%，15.7%）。对观察到的AIS船舶行为进行部分验证，确认与最快真实凌日的一致性，且方差降低了23.1倍。关键是，PIER 与预报无关：与 A* 路径优化不同，后者的波浪保护在现实预报不确定性下会下降 4.5 倍，PIER 仅通过局部观测保持稳定性能。该框架结合了基于物理的状态构建、演示增强的离线数据和解耦的后置安全盾，这种架构可转移至野火疏散、飞机轨迹优化和未测绘地形中的自主导航。

ShuttleEnv: An Interactive Data-Driven RL Environment for Badminton Strategy Modeling

ShuttleEnv：一个用于羽毛球战略建模的交互式数据驱动强化环境

Authors: Ang Li, Xinyang Gong, Bozhou Chen, Yunlong Lu, Jiaming Ji, Yongyi Wang, Yaodong Yang, Wenxin Li
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.17324
Pdf link: https://arxiv.org/pdf/2603.17324
Abstract We present ShuttleEnv, an interactive and data-driven simulation environment for badminton, designed to support reinforcement learning and strategic behavior analysis in fast-paced adversarial sports. The environment is grounded in elite-player match data and employs explicit probabilistic models to simulate rally-level dynamics, enabling realistic and interpretable agent-opponent interactions without relying on physics-based simulation. In this demonstration, we showcase multiple trained agents within ShuttleEnv and provide live, step-by-step visualization of badminton rallies, allowing attendees to explore different play styles, observe emergent strategies, and interactively analyze decision-making behaviors. ShuttleEnv serves as a reusable platform for research, visualization, and demonstration of intelligent agents in sports AI. Our ShuttleEnv demo video URL: this https URL
中文摘要 我们介绍ShuttleEnv，一个互动且数据驱动的羽毛球模拟环境，旨在支持快节奏对抗运动中的强化学习和战略行为分析。该环境基于精英选手比赛数据，采用显式概率模型模拟拉力赛级别的动态，实现真实且可解释的代理与对手互动，无需依赖基于物理的模拟。在本次演示中，我们展示了ShuttleEnv中多位受过培训的代理，并实时、逐步地可视化羽毛球集会，使参与者能够探索不同的打法风格，观察涌现策略，并互动分析决策行为。ShuttleEnv作为一个可重复使用的平台，用于体育人工智能智能代理的研究、可视化和演示。我们的ShuttleEnv演示视频网址：此https URL

A Progressive Visual-Logic-Aligned Framework for Ride-Hailing Adjudication

一个基于视觉逻辑的渐进式叫车裁定框架

Authors: Weiming Wu, Zi-Jian Cheng, Jie Meng, Peng Zhen, Shan Huang, Qun Li, Guobin Wu, Lan-Zhe Guo
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.17328
Pdf link: https://arxiv.org/pdf/2603.17328
Abstract The efficient adjudication of responsibility disputes is pivotal for maintaining marketplace fairness. However, the exponential surge in ride-hailing volume renders manual review intractable, while conventional automated methods lack the reasoning transparency required for quasi-judicial decisions. Although Multimodal LLMs offer a promising paradigm, they fundamentally struggle to bridge the gap between general visual semantics and rigorous evidentiary protocols, often leading to perceptual hallucinations and logical looseness. To address these systemic misalignments, we introduce RideJudge, a Progressive Visual-Logic-Aligned Framework. Instead of relying on generic pre-training, we bridge the semantic gap via SynTraj, a synthesis engine that grounds abstract liability concepts into concrete trajectory patterns. To resolve the conflict between massive regulation volume and limited context windows, we propose an Adaptive Context Optimization strategy that distills expert knowledge, coupled with a Chain-of-Adjudication mechanism to enforce active evidentiary inquiry. Furthermore, addressing the inadequacy of sparse binary feedback for complex liability assessment, we implement a novel Ordinal-Sensitive Reinforcement Learning mechanism that calibrates decision boundaries against hierarchical severity. Extensive experiments show that our RideJudge-8B achieves 88.41\% accuracy, surpassing 32B-scale baselines and establishing a new standard for interpretable adjudication.
中文摘要 责任争议的高效裁决对于维护市场公平至关重要。然而，网约车数量的指数级激增使得人工审查变得难以应对，而传统的自动化方法缺乏准司法决策所需的推理透明度。尽管多模态大型语言模型提供了一个有前景的范式，但它们在弥合一般视觉语义与严谨证据协议之间的鸿沟方面存在根本困难，常常导致感知幻觉和逻辑松散。为了解决这些系统性的错位，我们推出了RideJudge，一个渐进式视觉逻辑对齐框架。我们不再依赖通用的预培训，而是通过SynTraj这一综合引擎弥合语义鸿沟，该引擎将抽象的责任概念扎根于具体的轨迹模式中。为解决庞大监管量与有限情境窗口之间的冲突，我们提出了一种自适应情境优化策略，提炼专家知识，结合审理链机制以强制主动证据调查。此外，针对稀疏二元反馈在复杂责任评估中的不足，我们实现了一种新的序数敏感强化学习机制，能够根据层级严重性校准决策边界。大量实验表明，我们的RideJudge-8B实现了88.41%的准确率，超越了32B尺度的基线，确立了可解释判决的新标准。

EvoGuard: An Extensible Agentic RL-based Framework for Practical and Evolving AI-Generated Image Detection

EvoGuard：一个可扩展的基于代理强化学习的框架，用于实用且不断发展的AI生成图像检测

Authors: Chenyang Zhu, Maorong Wang, Jun Liu, Ching-Chun Chang, Isao Echizen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.17343
Pdf link: https://arxiv.org/pdf/2603.17343
Abstract The rapid proliferation of AI-Generated Images (AIGIs) has introduced severe risks of misinformation, making AIGI detection a critical yet challenging task. While traditional detection paradigms mainly rely on low-level features, recent research increasingly focuses on leveraging the general understanding ability of Multimodal Large Language Models (MLLMs) to achieve better generalization, but still suffer from limited extensibility and expensive training data annotations. To better address complex and dynamic real-world environments, we propose EvoGuard, a novel agentic framework for AIGI detection. It encapsulates various state-of-the-art (SOTA) off-the-shelf MLLM and non-MLLM detectors as callable tools, and coordinates them through a capability-aware dynamic orchestration mechanism. Empowered by the agent's capacities for autonomous planning and reflection, it intelligently selects suitable tools for given samples, reflects intermediate results, and decides the next action, reaching a final conclusion through multi-turn invocation and reasoning. This design effectively exploits the complementary strengths among heterogeneous detectors, transcending the limits of any single model. Furthermore, optimized by a GRPO-based Agentic Reinforcement Learning algorithm using only low-cost binary labels, it eliminates the reliance on fine-grained annotations. Extensive experiments demonstrate that EvoGuard achieves SOTA accuracy while mitigating the bias between positive and negative samples. More importantly, it allows the plug-and-play integration of new detectors to boost overall performance in a train-free manner, offering a highly practical, long-term solution to ever-evolving AIGI threats. Source code will be publicly available upon acceptance.
中文摘要 AI生成图像（AIGIs）的快速普及带来了严重的错误信息风险，使得AIGI检测成为一项关键但充满挑战的任务。虽然传统检测范式主要依赖底层特征，但近期研究越来越多地关注利用多模态大型语言模型（MLLM）的普遍理解能力以实现更好的泛化，但仍受限于可扩展性和昂贵的训练数据注释。为了更好地应对复杂且动态的现实环境，我们提出了EvoGuard，一种用于AIGI检测的新型代理框架。它封装了各种最先进的（SOTA）现成MLLM和非MLLM检测器作为可调用工具，并通过能力感知的动态编排机制进行协调。借助代理自主规划和反思的能力，智能选择适合给定样本的工具，反映中间结果，并决定下一步行动，通过多回合的调用和推理达成最终结论。该设计有效利用异构探测器之间的互补优势，超越了任何单一模型的局限。此外，通过基于GRPO的代理强化学习算法，仅使用低成本的二进制标签进行优化，消除了对细粒度注释的依赖。大量实验表明，EvoGuard 在消除正负样本间偏差的同时，实现了 SOTA 精度。更重要的是，它允许即插即用地集成新探测器，以无需列车的方式提升整体性能，为不断演变的AIGI威胁提供了高度实用且长期的解决方案。源代码将在接受后公开。

Efficient Exploration at Scale

大规模高效勘探

Authors: Seyed Mohammad Asghari, Chris Chute, Vikranth Dwaracherla, Xiuyuan Lu, Mehdi Jafarnia, Victor Minden, Zheng Wen, Benjamin Van Roy
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.17378
Pdf link: https://arxiv.org/pdf/2603.17378
Abstract We develop an online learning algorithm that dramatically improves the data efficiency of reinforcement learning from human feedback (RLHF). Our algorithm incrementally updates reward and language models as choice data is received. The reward model is fit to the choice data, while the language model is updated by a variation of reinforce, with reinforcement signals provided by the reward model. Several features enable the efficiency gains: a small affirmative nudge added to each reinforcement signal, an epistemic neural network that models reward uncertainty, and information-directed exploration. With Gemma large language models (LLMs), our algorithm matches the performance of offline RLHF trained on 200K labels using fewer than 20K labels, representing more than a 10x gain in data efficiency. Extrapolating from our results, we expect our algorithm trained on 1M labels to match offline RLHF trained on 1B labels. This represents a 1,000x gain. To our knowledge, these are the first results to demonstrate that such large improvements are possible.
中文摘要 我们开发了一种在线学习算法，显著提升了基于人类反馈的强化学习（RLHF）数据效率。我们的算法会随着选择数据的到来逐步更新奖励模型和语言模型。奖励模型拟合于选择数据，而语言模型则通过强化的变体进行更新，奖励模型提供强化信号。多项特性促成了效率提升：每个强化信号中加入小幅正方推动、建模不确定性奖励的认知神经网络，以及信息导向探索。在Gemma大型语言模型（LLM）中，我们的算法性能可媲美在20万标签上训练的离线RLHF，使用少于2万标签，数据效率提升超过10倍。根据我们的结果推算，我们预期在1M标签上训练的算法将与在1B标签上训练的离线RLHF相匹配。这意味着1000倍的收益。据我们所知，这些是首次证明如此巨大的改进是可能的。

CRE-T1 Preview Technical Report: Beyond Contrastive Learning for Reasoning-Intensive Retrieval

CRE-T1预览技术报告：超越对比学习，实现推理密集型检索

Authors: Guangzhi Wang, Yinghao Jiao, Zhi Liu
Subjects: Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.17387
Pdf link: https://arxiv.org/pdf/2603.17387
Abstract The central challenge of reasoning-intensive retrieval lies in identifying implicitreasoning relationships between queries and documents, rather than superficial se-mantic or lexical similarity. The contrastive learning paradigm is fundamentallya static representation consolidation technique: during training, it encodes hier-archical relevance concepts into fixed geometric structures in the vector space,and at inference time it cannot dynamically adjust relevance judgments accord-ing to the specific reasoning demands of each query. Consequently, performancedegrades noticeably when vocabulary mismatch exists between queries and doc-uments or when implicit reasoning is required to establish relevance. This pa-per proposes Thought 1 (T1), a generative retrieval model that shifts relevancemodeling from static alignment to dynamic reasoning. On the query side, T1 dy-namically generates intermediate reasoning trajectories for each query to bridgeimplicit reasoning relationships and uses as a semantic aggregationpoint for the reasoning output. On the document side, it employs an instruction+ text + encoding format to support high-throughput indexing. Tointernalize dynamic reasoning capabilities into vector representations, we adopt athree-stage training curriculum and introduce GRPO in the third stage, enablingthe model to learn optimal derivation strategies for different queries through trial-and-error reinforcement learning. On the BRIGHT benchmark, T1-4B exhibitsstrong performance under the original query setting, outperforming larger modelstrained with contrastive learning overall, and achieving performance comparableto multi-stage retrieval pipelines. The results demonstrate that replacing static rep-resentation alignment with dynamic reasoning generation can effectively improvereasoning-intensive retrieval performance.
中文摘要 推理密集检索的核心挑战在于识别查询与文档之间的隐式推理关系，而非表面的 se-mantic 或词汇相似性。对比学习范式本质上是一种静态表示巩固技术：在训练过程中，它将层级结构的相关性概念编码到向量空间中的固定几何结构中，在推理时无法根据每个查询的具体推理需求动态调整相关性判断。因此，当查询与文档之间存在词汇不匹配或需要隐性推理以确定相关性时，性能会明显下降。该理论提出了思维1（T1），一种生成式检索模型，将相关建模从静态对齐转向动态推理。在查询端，T1 动态生成每个查询的中间推理轨迹，连接桥隐式推理关系，并用作推理输出的语义聚合点。在文档端，它采用指令+文本+ 编码格式，以支持高吞吐量索引。为了将动态推理能力内化为向量表示，我们采用三阶段训练课程，并在第三阶段引入GRPO，使模型能够通过试错强化学习不同查询的最佳推导策略。在BRIGHT基准测试中，T1-4B在原始查询设置下表现出强劲的性能，整体表现优于使用对比学习训练的大型模型，并实现了与多级检索流水线相当的性能。结果表明，用动态推理生成替代静态重复-怨恨对齐可以有效提升推理密集型检索性能。

AR-CoPO: Align Autoregressive Video Generation with Contrastive Policy Optimization

AR-CoPO：将自回归视频生成与对比策略优化对齐

Authors: Dailan He, Guanlin Feng, Xingtong Ge, Yi Zhang, Bingqi Ma, Guanglu Song, Yu Liu, Hongsheng Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.17461
Pdf link: https://arxiv.org/pdf/2603.17461
Abstract Streaming autoregressive (AR) video generators combined with few-step distillation achieve low-latency, high-quality synthesis, yet remain difficult to align via reinforcement learning from human feedback (RLHF). Existing SDE-based GRPO methods face challenges in this setting: few-step ODEs and consistency model samplers deviate from standard flow-matching ODEs, and their short, low-stochasticity trajectories are highly sensitive to initialization noise, rendering intermediate SDE exploration ineffective. We propose AR-CoPO (AutoRegressive Contrastive Policy Optimization), a framework that adapts the Neighbor GRPO contrastive perspective to streaming AR generation. AR-CoPO introduces chunk-level alignment via a forking mechanism that constructs neighborhood candidates at a randomly selected chunk, assigns sequence-level rewards, and performs localized GRPO updates. We further propose a semi-on-policy training strategy that complements on-policy exploration with exploitation over a replay buffer of reference rollouts, improving generation quality across domains. Experiments on Self-Forcing demonstrate that AR-CoPO improves both out-of-domain generalization and in-domain human preference alignment over the baseline, providing evidence of genuine alignment rather than reward hacking.
中文摘要 流式自回归（AR）视频生成器结合少步蒸馏技术，实现了低延迟、高质量的合成，但通过人工反馈强化学习（RLHF）进行对齐仍然很困难。现有基于SDE的GRPO方法在此环境中面临挑战：少步常微分方程和一致性模型采样器偏离标准的流匹配常微分方程，且其短且低随机性的轨迹对初始化噪声高度敏感，使得中间SDE探索效果不佳。我们提出了AR-CoPO（自回归对比策略优化）框架，将邻居GRPO对比视角适配到流式AR生成中。AR-CoPO通过分叉机制引入了区块级比对，该机制在随机选定的区块构建邻域候选，分配序列级奖励，并执行局部GRPO更新。我们还提出了一种半政策化训练策略，结合策略探索与参考部署重放缓冲区的利用，提升跨域生成质量。自强迫实验表明，AR-CoPO在基线基础上提升了域外泛化和域内人类偏好对齐，提供了真正的对齐证据，而非奖励性黑客行为。

Efficient Soft Actor-Critic with LLM-Based Action-Level Guidance for Continuous Control

高效的软行为者-批评者，基于LLM的动作级指导，实现持续控制

Authors: Hao Ma, Zhiqiang Pu, Xiaolin Ai, Huimu Wang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.17468
Pdf link: https://arxiv.org/pdf/2603.17468
Abstract We present GuidedSAC, a novel reinforcement learning (RL) algorithm that facilitates efficient exploration in vast state-action spaces. GuidedSAC leverages large language models (LLMs) as intelligent supervisors that provide action-level guidance for the Soft Actor-Critic (SAC) algorithm. The LLM-based supervisor analyzes the most recent trajectory using state information and visual replays, offering action-level interventions that enable targeted exploration. Furthermore, we provide a theoretical analysis of GuidedSAC, proving that it preserves the convergence guarantees of SAC while improving convergence speed. Through experiments in both discrete and continuous control environments, including toy text tasks and complex MuJoCo benchmarks, we demonstrate that GuidedSAC consistently outperforms standard SAC and state-of-the-art exploration-enhanced variants (e.g., RND, ICM, and E3B) in terms of sample efficiency and final performance.
中文摘要 我们介绍了GuidedSAC，一种新型强化学习（RL）算法，能够在庞大的状态-行动空间中高效探索。GuidedSAC 利用大型语言模型（LLMs）作为智能监督器，为软演员-批判者（SAC）算法提供动作级指导。基于LLM的导师利用状态信息和视觉回放分析最新趋势，提供行动层级干预，实现有针对性的探索。此外，我们还对GuidedSAC进行了理论分析，证明它在提升收敛速度的同时，依然保持了SAC的收敛保证。通过在离散和连续对照环境中的实验，包括玩具文本任务和复杂的MuJoCo基准测试，我们证明GuidedSAC在样本效率和最终性能方面持续优于标准SAC和最先进的探索增强型（如RND、ICM和E3B）。

Interpreting Context-Aware Human Preferences for Multi-Objective Robot Navigation

解读多目标机器人导航的情境感知人类偏好

Authors: Tharun Sethuraman, Subham Agrawal, Nils Dengler, Jorge de Heuvel, Teena Hassan, Maren Bennewitz
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.17510
Pdf link: https://arxiv.org/pdf/2603.17510
Abstract Robots operating in human-shared environments must not only achieve task-level navigation objectives such as safety and efficiency, but also adapt their behavior to human preferences. However, as human preferences are typically expressed in natural language and depend on environmental context, it is difficult to directly integrate them into low-level robot control policies. In this work, we present a pipeline that enables robots to understand and apply context-dependent navigation preferences by combining foundational models with a Multi-Objective Reinforcement Learning (MORL) navigation policy. Thus, our approach integrates high-level semantic reasoning with low-level motion control. A Vision-Language Model (VLM) extracts structured environmental context from onboard visual observations, while Large Language Models (LLM) convert natural language user feedback into interpretable, context-dependent behavioral rules stored in a persistent but updatable rule memory. A preference translation module then maps contextual information and stored rules into numerical preference vectors that parameterize a pretrained MORL policy for real-time navigation adaptation. We evaluate the proposed framework through quantitative component-level evaluations, a user study, and real-world robot deployments in various indoor environments. Our results demonstrate that the system reliably captures user intent, generates consistent preference vectors, and enables controllable behavior adaptation across diverse contexts. Overall, the proposed pipeline improves the adaptability, transparency, and usability of robots operating in shared human environments, while maintaining safe and responsive real-time control.
中文摘要 在人类共享环境中运行的机器人不仅要实现任务级导航目标，如安全和效率，还要根据人类偏好调整其行为。然而，由于人类偏好通常以自然语言表达，且依赖环境背景，难以直接将其整合到低层机器人控制策略中。在本研究中，我们提出了一个流程，使机器人能够通过结合基础模型与多目标强化学习（MORL）导航策略，理解并应用上下文依赖的导航偏好。因此，我们的方法将高级语义推理与低级运动控制相结合。视觉语言模型（VLM）从机载视觉观察中提取结构化环境上下文，而大型语言模型（LLM）则将自然语言用户反馈转换为可解释、上下文依赖的行为规则，存储在持久但可更新的规则内存中。偏好转换模块随后将上下文信息和存储规则映射为数值偏好向量，参数化预训练的MORL策略以实现实时导航适应。我们通过定量组件级评估、用户研究以及在各种室内环境中的实际机器人部署，评估了该框架。我们的结果表明，系统能够可靠地捕捉用户意图，生成一致的偏好向量，并实现在不同情境下的可控行为适应。总体而言，拟议的流程提升了机器人在共享人类环境中的适应性、透明度和可用性，同时保持安全且响应迅速的实时控制。

From Isolated Scoring to Collaborative Ranking: A Comparison-Native Framework for LLM-Based Paper Evaluation

从孤立评分到协作排名：基于LLM论文评估的比较原生框架

Authors: Pujun Zheng, Jiacheng Yao, Jinquan Zheng, Chenyang Gu, Guoxiu He, Jiawei Liu, Yong Huang, Tianrui Guo, Wei Lu
Subjects: Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.17588
Pdf link: https://arxiv.org/pdf/2603.17588
Abstract Large language models (LLMs) are currently applied to scientific paper evaluation by assigning an absolute score to each paper independently. However, since score scales vary across conferences, time periods, and evaluation criteria, models trained on absolute scores are prone to fitting narrow, context-specific rules rather than developing robust scholarly judgment. To overcome this limitation, we propose shifting paper evaluation from isolated scoring to collaborative ranking. In particular, we design \textbf{C}omparison-\textbf{N}ative framework for \textbf{P}aper \textbf{E}valuation (\textbf{CNPE}), integrating comparison into both data construction and model learning. We first propose a graph-based similarity ranking algorithm to facilitate the sampling of more informative and discriminative paper pairs from a collection. We then enhance relative quality judgment through supervised fine-tuning and reinforcement learning with comparison-based rewards. At inference, the model performs pairwise comparisons over sampled paper pairs and aggregates these preference signals into a global relative quality ranking. Experimental results demonstrate that our framework achieves an average relative improvement of \textbf{21.8\%} over the strong baseline DeepReview-14B, while exhibiting robust generalization to five previously unseen datasets. \href{this https URL}{Code}.
中文摘要 大型语言模型（LLMs）目前被应用于科学论文评估，通过独立为每篇论文赋予绝对分数。然而，由于评分量表因会议、时间段和评估标准而异，基于绝对分数训练的模型往往倾向于符合狭窄、特定情境的规则，而非形成扎实的学术判断。为克服这一限制，我们提议将论文评分从孤立评分转向协作排名。特别地，我们设计了 \textbf{C}omparison-\textbf{N}ative 框架，用于 \textbf{P}aper \textbf{E}估值（\textbf{CNPE}），将比较整合到数据构建和模型学习中。我们首先提出一种基于图的相似性排序算法，以便从一个集合中抽样更有信息量和辨别性的论文对。随后，我们通过监督微调和基于比较的奖励强化学习，提升相对质量判断。在推断时，模型对抽样论文对进行成对比较，并将这些偏好信号汇总为全局相对质量排名。实验结果表明，我们的框架相较于强基线DeepReview-14B实现了\textbf{21.8\%}的平均相对提升，同时对五个此前未见过的数据集表现出强劲的推广。\href{this https URL}{Code}。

Complementary Reinforcement Learning

补充强化学习

Authors: Dilxat Muhtar, Jiashun Liu, Wei Gao, Weixun Wang, Shaopan Xiong, Ju Huang, Siran Yang, Wenbo Su, Jiamang Wang, Ling Pan, Bo Zheng
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.17621
Pdf link: https://arxiv.org/pdf/2603.17621
Abstract Reinforcement Learning (RL) has emerged as a powerful paradigm for training LLM-based agents, yet remains limited by low sample efficiency, stemming not only from sparse outcome feedback but also from the agent's inability to leverage prior experience across episodes. While augmenting agents with historical experience offers a promising remedy, existing approaches suffer from a critical weakness: the experience distilled from history is either stored statically or fail to coevolve with the improving actor, causing a progressive misalignment between the experience and the actor's evolving capability that diminishes its utility over the course of training. Inspired by complementary learning systems in neuroscience, we present Complementary RL to achieve seamless co-evolution of an experience extractor and a policy actor within the RL optimization loop. Specifically, the actor is optimized via sparse outcome-based rewards, while the experience extractor is optimized according to whether its distilled experiences demonstrably contribute to the actor's success, thereby evolving its experience management strategy in lockstep with the actor's growing capabilities. Empirically, Complementary RL outperforms outcome-based agentic RL baselines that do not learn from experience, achieving 10% performance improvement in single-task scenarios and exhibits robust scalability in multi-task settings. These results establish Complementary RL as a paradigm for efficient experience-driven agent learning.
中文摘要 强化学习（RL）已成为训练基于LLM的代理的强大范式，但由于样本效率低，不仅源于结果反馈稀疏，还因代理无法跨剧集利用先前经验而受限。虽然增强具有历史经验的代理提供了有前景的解决方案，但现有方法存在一个关键弱点：从历史中提炼出来的经验要么静态存储，要么未能与进步中的行为者共同进化，导致经验与行为者不断进化的能力之间产生逐渐错位，随着培训过程其效用逐渐减弱。受神经科学中互补学习系统的启发，我们提出了补充强化学习，旨在实现强化学习优化循环中体验提取器和策略行为者的无缝共进化。具体来说，参与者通过稀疏的结果奖励进行优化，而体验提取器则根据其提炼体验是否明显促进参与者的成功进行优化，从而与参与者不断增长的能力同步演进体验管理策略。从经验角度看，补充强化学习优于基于结果的代理强化学习基线，在单任务场景中实现10%的性能提升，并在多任务环境中展现出强健的可扩展性。这些结果确立了补充强化学习作为高效体验驱动主体学习的范式。

Benchmarking Reinforcement Learning via Stochastic Converse Optimality: Generating Systems with Known Optimal Policies

通过随机逆最优性进行基准测试强化学习：生成具有已知最优策略的系统

Authors: Sinan Ibrahim, Grégoire Ouerdane, Hadi Salloum, Henni Ouerdane, Stefan Streif, Pavel Osinenko
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2603.17631
Pdf link: https://arxiv.org/pdf/2603.17631
Abstract The objective comparison of Reinforcement Learning (RL) algorithms is notoriously complex as outcomes and benchmarking of performances of different RL approaches are critically sensitive to environmental design, reward structures, and stochasticity inherent in both algorithmic learning and environmental dynamics. To manage this complexity, we introduce a rigorous benchmarking framework by extending converse optimality to discrete-time, control-affine, nonlinear systems with noise. Our framework provides necessary and sufficient conditions, under which a prescribed value function and policy are optimal for constructed systems, enabling the systematic generation of benchmark families via homotopy variations and randomized parameters. We validate it by automatically constructing diverse environments, demonstrating our framework's capacity for a controlled and comprehensive evaluation across algorithms. By assessing standard methods against a ground-truth optimum, our work delivers a reproducible foundation for precise and rigorous RL benchmarking.
中文摘要 强化学习（RL）算法的客观比较极为复杂，因为不同强化学习方法的结果和性能基准对环境设计、奖励结构和随机性极为敏感，这些因素都存在于算法学习和环境动态中。为管理这种复杂性，我们引入了严格的基准测试框架，将逆最优性扩展到离散时间、控制仿射、非线性噪声系统。我们的框架提供了必要且充分的条件，使得规定的价值函数和策略对构造系统最优，从而通过同伦变分和随机参数系统生成基准族。我们通过自动构建多样化环境来验证其能力，展示了我们框架在算法间进行受控且全面的评估能力。通过将标准方法与地面真实最优值进行评估，我们的工作为精确且严谨的强化学习基准测试提供了可重复的基础。

Post-Training Local LLM Agents for Linux Privilege Escalation with Verifiable Rewards

Linux权限升级本地LLM代理培训后可验证奖励

Authors: Philipp Normann, Andreas Happe, Jürgen Cito, Daniel Arp
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.17673
Pdf link: https://arxiv.org/pdf/2603.17673
Abstract LLM agents are increasingly relevant to research domains such as vulnerability discovery. Yet, the strongest systems remain closed and cloud-only, making them resource-intensive, difficult to reproduce, and unsuitable for work involving proprietary code or sensitive data. Consequently, there is an urgent need for small, local models that can perform security tasks under strict resource budgets, but methods for developing them remain underexplored. In this paper, we address this gap by proposing a two-stage post-training pipeline. We focus on the problem of Linux privilege escalation, where success is automatically verifiable and the task requires multi-step interactive reasoning. Using an experimental setup that prevents data leakage, we post-train a 4B model in two stages: supervised fine-tuning on traces from procedurally generated privilege-escalation environments, followed by reinforcement learning with verifiable rewards. On a held-out benchmark of 12 Linux privilege-escalation scenarios, supervised fine-tuning alone more than doubles the baseline success rate at 20 rounds, and reinforcement learning further lifts our resulting model, PrivEsc-LLM, to 95.8%, nearly matching Claude Opus 4.6 at 97.5%. At the same time, the expected inference cost per successful escalation is reduced by over 100x.
中文摘要 LLM代理在漏洞发现等研究领域中日益相关。然而，最强大的系统仍然封闭且仅运行于云端，使其资源密集、难以复现，且不适合处理专有代码或敏感数据的工作。因此，迫切需要能够在严格资源预算下执行安全任务的小型本地模型，但开发这些模型的方法尚未被充分探索。本文通过提出两阶段培训后流程来弥补这一空白。我们关注Linux权限升级的问题，成功是可自动验证的，任务需要多步交互推理。利用防止数据泄露的实验设置，我们将4B模型分两阶段进行后训练：对程序生成的权限提升环境中的痕迹进行监督微调，随后进行带有可验证奖励的强化学习。在一项包含12个Linux权限升级场景的基准测试中，仅监督式微调即可使20轮的基线成功率翻倍多，强化学习进一步提升了我们最终模型PrivEsc-LLM的95.8%，几乎与Claude Opus 4.6的97.5%持平。同时，每次成功升级的预期推理成本降低了100倍以上。

Flow Matching Policy with Entropy Regularization

带熵正则化的流匹配策略

Authors: Ting Gao, Stavros Orfanoudakis, Nan Lin, Elvin Isufi, Winnie Daamen, Serge Hoogendoorn
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.17685
Pdf link: https://arxiv.org/pdf/2603.17685
Abstract Diffusion-based policies have gained significant popularity in Reinforcement Learning (RL) due to their ability to represent complex, non-Gaussian distributions. Stochastic Differential Equation (SDE)-based diffusion policies often rely on indirect entropy control due to the intractability of the exact entropy, while also suffering from computationally prohibitive policy gradients through the iterative denoising chain. To overcome these issues, we propose Flow Matching Policy with Entropy Regularization (FMER), an Ordinary Differential Equation (ODE)-based online RL framework. FMER parameterizes the policy via flow matching and samples actions along a straight probability path, motivated by optimal transport. FMER leverages the model's generative nature to construct an advantage-weighted target velocity field from a candidate set, steering policy updates toward high-value regions. By deriving a tractable entropy objective, FMER enables principled maximum-entropy optimization for enhanced exploration. Experiments on sparse multi-goal FrankaKitchen benchmarks demonstrate that FMER outperforms state-of-the-art methods, while remaining competitive on standard MuJoco benchmarks. Moreover, FMER reduces training time by 7x compared to heavy diffusion baselines (QVPO) and 10-15% relative to efficient variants.
中文摘要 基于扩散的策略因其能够表示复杂的非高斯分布而在强化学习（RL）中获得了极大的流行。基于随机微分方程（SDE）的扩散策略通常依赖于间接控制熵，因为精确熵难以处理，同时也存在通过迭代去噪链的计算限制性策略梯度。为克服这些问题，我们提出了基于常微分方程（ODE）的在线强化学习框架（FMER）的流匹配策略（FMER）。FMER通过流量匹配参数化策略，并采样沿直线概率路径的动作，以最优运输为动力。FMER利用模型的生成特性，从候选集构建一个优势加权目标速度场，引导政策更新朝向高价值区域。通过推导一个可解的熵目标，FMER实现了有原则的最大熵优化，从而增强了探索效果。在稀疏多目标FrankaKitchen基准测试上的实验表明，FMER优于最先进方法，同时在标准MuJoco基准测试中保持竞争力。此外，FMER的训练时间比重扩散基线（QVPO）缩短7倍，且相较于高效变体减少10-15%。

Machine Learning for Network Attacks Classification and Statistical Evaluation of Machine Learning for Network Attacks Classification and Adversarial Learning Methodologies for Synthetic Data Generation

网络攻击机器学习分类与统计评估：合成数据生成的分类与对抗学习方法

Authors: Iakovos-Christos Zarkadis, Christos Douligeris
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Applications (stat.AP); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2603.17717
Pdf link: https://arxiv.org/pdf/2603.17717
Abstract Supervised detection of network attacks has always been a critical part of network intrusion detection systems (NIDS). Nowadays, in a pivotal time for artificial intelligence (AI), with even more sophisticated attacks that utilize advanced techniques, such as generative artificial intelligence (GenAI) and reinforcement learning, it has become a vital component if we wish to protect our personal data, which are scattered across the web. In this paper, we address two tasks, in the first unified multi-modal NIDS dataset, which incorporates flow-level data, packet payload information and temporal contextual features, from the reprocessed CIC-IDS-2017, CIC-IoT-2023, UNSW-NB15 and CIC-DDoS-2019, with the same feature space. In the first task we use machine learning (ML) algorithms, with stratified cross validation, in order to prevent network attacks, with stability and reliability. In the second task we use adversarial learning algorithms to generate synthetic data, compare them with the real ones and evaluate their fidelity, utility and privacy using the SDV framework, f-divergences, distinguishability and non-parametric statistical tests. The findings provide stable ML models for intrusion detection and generative models with high fidelity and utility, by combining the Synthetic Data Vault framework, the TRTS and TSTR tests, with non-parametric statistical tests and f-divergence measures.
中文摘要 网络攻击的监督检测一直是网络入侵检测系统（NIDS）的关键组成部分。如今，在人工智能（AI）的关键时期，随着利用生成式人工智能（GenAI）和强化学习等先进技术的更复杂攻击，人工智能已成为保护散布在网络各处的个人数据的重要组成部分。本文中，我们解决了首个统一多模态NIDS数据集中的两个任务，该数据集包含流级数据、数据包有效载荷信息和时间上下文特征，数据来自重新处理的CIC-IDS-2017、CIC-IoT-2023、UNSW-NB15和CIC-DDoS-2019，且具有相同特征空间。在第一个任务中，我们使用机器学习（ML）算法，配合分层交叉验证，以防范网络攻击，同时保持稳定性和可靠性。在第二个任务中，我们使用对抗性学习算法生成合成数据，将其与真实数据进行比较，并利用SDV框架、f发散、可区分性和非参数统计检验评估其保真度、效用和隐私性。研究结果通过结合合成数据库框架、TRTS和TSTR测试，以及非参数统计检验和f发散度度，提供了稳定的入侵检测和生成模型，具有高保真度和实用性。

CoVerRL: Breaking the Consensus Trap in Label-Free Reasoning via Generator-Verifier Co-Evolution

CoVerRL：通过生成器-验证器共进化打破无标签推理中的共识陷阱

Authors: Teng Pan, Yuchen Yan, Zixuan Wang, Ruiqing Zhang, Gaiyang Han, Wanqi Zhang, Weiming Lu, Jun Xiao, Yongliang Shen
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.17775
Pdf link: https://arxiv.org/pdf/2603.17775
Abstract Label-free reinforcement learning enables large language models to improve reasoning capabilities without ground-truth supervision, typically by treating majority-voted answers as pseudo-labels. However, we identify a critical failure mode: as training maximizes self-consistency, output diversity collapses, causing the model to confidently reinforce systematic errors that evade detection. We term this the consensus trap. To escape it, we propose CoVerRL, a framework where a single model alternates between generator and verifier roles, with each capability bootstrapping the other. Majority voting provides noisy but informative supervision for training the verifier, while the improving verifier progressively filters self-consistent errors from pseudo-labels. This co-evolution creates a virtuous cycle that maintains high reward accuracy throughout training. Experiments across Qwen and Llama model families demonstrate that CoVerRL outperforms label-free baselines by 4.7-5.9\% on mathematical reasoning benchmarks. Moreover, self-verification accuracy improves from around 55\% to over 85\%, confirming that both capabilities genuinely co-evolve.
中文摘要 无标签强化学习使大型语言模型能够在无需地面真实监督的情况下提升推理能力，通常通过将多数票通过的答案视为伪标签。然而，我们识别出一个关键的失败模式：随着训练最大化自洽性，输出多样性崩溃，导致模型自信地强化规避检测的系统性错误。我们称之为共识陷阱。为了摆脱这种模式，我们提出了CoVerRL，这是一个框架，单一模型在生成器和验证者之间交替，双方能力相互引导。多数投票为验证者训练提供了噪音但信息丰富的监督，而改进验证者则逐步过滤伪标签中的自洽错误。这种共同进化形成了一个良性循环，在整个训练过程中保持高奖励准确率。跨越Qwen和Llama模型家族的实验表明，CoVerRL在数学推理基准测试中比无标签基线高出4.7%至5.9%的水平。此外，自我验证的准确率从约55%提升到超过85%，证实了这两种能力确实是共同演进的。

Federated Distributional Reinforcement Learning with Distributional Critic Regularization

结合分布批判正则化的联合分布强化学习

Authors: David Millard, Cecilia Alm, Rashid Ali, Pengcheng Shi, Ali Baheri
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.17820
Pdf link: https://arxiv.org/pdf/2603.17820
Abstract Federated reinforcement learning typically aggregates value functions or policies by parameter averaging, which emphasizes expected return and can obscure statistical multimodality and tail behavior that matter in safety-critical settings. We formalize federated distributional reinforcement learning (FedDistRL), where clients parametrize quantile value function critics and federate these networks only. We also propose TR-FedDistRL, which builds a per client, risk-aware Wasserstein barycenter over a temporal buffer. This local barycenter provides a reference region to constrain the parameter averaged critic, ensuring necessary distributional information is not averaged out during the federation process. The distributional trust region is implemented as a shrink-squash step around this reference. Under fixed-policy evaluation, the feasibility map is nonexpansive and the update is contractive in a probe-set Wasserstein metric under evaluation. Experiments on a bandit, multi-agent gridworld, and continuous highway environment show reduced mean-smearing, improved safety proxies (catastrophe/accident rate), and lower critic/policy drift versus mean-oriented and non-federated baselines.
中文摘要 联合强化学习通常通过参数平均来聚合价值函数或策略，强调期望回报，可能掩盖在安全关键环境中重要的统计多模态和尾部行为。我们形式化了联邦分布强化学习（FedDistRL），客户对分位数价值函数批判者进行参数化，并仅对这些网络进行联邦化。我们还提出了TR-FedDistRL，它构建一个针对每个客户端的风险感知的Wasserstein重心，覆盖时间缓冲区。该局部质心为参数平均批判者提供一个参考区域，确保在联合过程中所需的分布信息不会被平均化。分布信任区域作为围绕该参考的缩小-压缩步骤实现。在固定策略评估下，可行性映射为非扩展映射，更新在评估中的探针集Wasserstein度量中是收缩映射。在强盗、多智能体网格世界和连续高速公路环境中的实验显示，平均模糊减少，安全代理（灾难/事故率）改善，且与均值导向和非联邦基线相比，批评/政策漂移更低。

CodeScout: An Effective Recipe for Reinforcement Learning of Code Search Agents

CodeScout：代码搜索代理强化学习的有效配方

Authors: Lintang Sutawika, Aditya Bharat Soni, Bharath Sriraam R R, Apurva Gandhi, Taha Yassine, Sanidhya Vijayvargiya, Yuchen Li, Xuhui Zhou, Yilin Zhang, Leander Melroy Maben, Graham Neubig
Subjects: Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.17829
Pdf link: https://arxiv.org/pdf/2603.17829
Abstract A prerequisite for coding agents to perform tasks on large repositories is code localization - the identification of relevant files, classes, and functions to work on. While repository-level code localization has been performed using embedding-based retrieval approaches such as vector search, recent work has focused on developing agents to localize relevant code either as a standalone precursor to or interleaved with performing actual work. Most prior methods on agentic code search equip the agent with complex, specialized tools, such as repository graphs derived from static analysis. In this paper, we demonstrate that, with an effective reinforcement learning recipe, a coding agent equipped with nothing more than a standard Unix terminal can be trained to achieve strong results. Our experiments on three benchmarks (SWE-Bench Verified, Pro, and Lite) reveal that our models consistently achieve superior or competitive performance over 2-18x larger base and post-trained LLMs and sometimes approach performance provided by closed models like Claude Sonnet, even when using specialized scaffolds. Our work particularly focuses on techniques for re-purposing existing coding agent environments for code search, reward design, and RL optimization. We release the resulting model family, CodeScout, along with all our code and data for the community to build upon.
中文摘要 编码代理执行大型仓库任务的前提是代码本地化——识别相关文件、类和函数进行处理。虽然仓库级代码本地化曾通过基于嵌入的检索方法（如向量搜索）进行，但近期工作重点在于开发代理，将相关代码本地化，作为实际工作的独立前置或交织。大多数以往的代理代码搜索方法都为代理配备了复杂且专门的工具，例如基于静态分析的仓库图。本文展示了，只要采用有效的强化学习方案，只需配备标准Unix终端的编码代理即可训练以取得强有力的结果。我们在三个基准测试（SWE-Bench Verified、Pro和Lite）上的实验显示，我们的模型在基础和后训练LLM的2倍到18倍大时，始终能实现优越或有竞争力的性能，有时甚至能接近Claude Sonnet等封闭模型，即使使用专用支架。我们的工作特别聚焦于将现有编码代理环境重新利用用于代码搜索、奖励设计和强化学习优化的技术。我们发布了由此产生的模型家族CodeScout，以及所有代码和数据，供社区继续构建。

Procedural Generation of Algorithm Discovery Tasks in Machine Learning

机器学习中算法发现任务的过程生成

Authors: Alexander D. Goldie, Zilin Wang, Adrian Hayler, Deepak Nathani, Edan Toledo, Ken Thampiratwong, Aleksandra Kalisz, Michael Beukman, Alistair Letcher, Shashank Reddy, Clarisse Wibault, Theo Wolf, Charles O'Neill, Uljad Berdica, Nicholas Roberts, Saeed Rahmani, Hannah Erlebach, Roberta Raileanu, Shimon Whiteson, Jakob N. Foerster
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.17863
Pdf link: https://arxiv.org/pdf/2603.17863
Abstract Automating the development of machine learning algorithms has the potential to unlock new breakthroughs. However, our ability to improve and evaluate algorithm discovery systems has thus far been limited by existing task suites. They suffer from many issues, such as: poor evaluation methodologies; data contamination; and containing saturated or very similar problems. Here, we introduce DiscoGen, a procedural generator of algorithm discovery tasks for machine learning, such as developing optimisers for reinforcement learning or loss functions for image classification. Motivated by the success of procedural generation in reinforcement learning, DiscoGen spans millions of tasks of varying difficulty and complexity from a range of machine learning fields. These tasks are specified by a small number of configuration parameters and can be used to optimise algorithm discovery agents (ADAs). We present DiscoBench, a benchmark consisting of a fixed, small subset of DiscoGen tasks for principled evaluation of ADAs. Finally, we propose a number of ambitious, impactful research directions enabled by DiscoGen, in addition to experiments demonstrating its use for prompt optimisation of an ADA. DiscoGen is released open-source at this https URL.
中文摘要 自动化机器学习算法开发有潜力开启新的突破。然而，我们迄今为止在改进和评估算法发现系统方面的能力受限于现有任务套件。他们存在许多问题，例如：评估方法不佳;数据污染;并包含饱和或非常相似的问题。这里，我们介绍了DiscoGen，一种机器学习算法发现任务的过程生成器，例如开发强化学习的优化器或图像分类的损失函数。受强化学习中过程生成成功的激励，DiscoGen涵盖了来自多个机器学习领域的数百万个难度和复杂度各异的任务。这些任务由少量配置参数指定，可用于优化算法发现代理（ADA）。我们介绍DiscoBench，这是一个基准测试，由固定的、小部分DiscoGen任务组成，用于原则性评估ADA。最后，我们提出了多项由DiscoGen推动的雄心勃勃且有影响力的研究方向，以及展示其在快速优化ADA中的应用的实验。DiscoGen 以开源形式发布，网址为 https URL。

Operator-Theoretic Foundations and Policy Gradient Methods for General MDPs with Unbounded Costs

具有无界成本的一般MDP的算符理论基础与策略梯度方法

Authors: Abhishek Gupta, Aditya Mahajan
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2603.17875
Pdf link: https://arxiv.org/pdf/2603.17875
Abstract Markov decision processes (MDPs) is viewed as an optimization of an objective function over certain linear operators over general function spaces. Using the well-established perturbation theory of linear operators, this viewpoint allows one to identify derivatives of the objective function as a function of the linear operators. This leads to generalization of many well-known results in reinforcement learning to cases with generate state and action spaces. Prior results of this type were only established in the finite-state finite-action MDP settings and in settings with certain linear function approximations. The framework also leads to new low-complexity PPO-type reinforcement learning algorithms for general state and action space MDPs.
中文摘要 马尔可夫决策过程（MDP）被视为对目标函数在某些线性算子上，在一般函数空间上的优化。利用线性算符的成熟微扰理论，这一观点使得目标函数的导数可以识别为线性算符的函数。这促使许多已知强化学习结果推广到生成状态空间和动作空间的情形。此类先前结果仅在有限状态有限作用MDP和某些线性函数近似环境中建立。该框架还催生了适用于一般状态和动作空间MDP的新型低复杂度PPO型强化学习算法。

Training Diffusion Language Models for Black-Box Optimization

黑箱优化的扩散语言模型训练

Authors: Zipeng Sun, Can Chen, Ye Yuan, Haolun Wu, Jiayao Gu, Christopher Pal, Xue Liu
Subjects: Subjects: Computational Engineering, Finance, and Science (cs.CE)
Arxiv link: https://arxiv.org/abs/2603.17919
Pdf link: https://arxiv.org/pdf/2603.17919
Abstract We study offline black-box optimization (BBO), aiming to discover improved designs from an offline dataset of designs and labels, a problem common in robotics, DNA, and materials science with limited labeled samples. While recent work applies autoregressive LLMs to BBO by formatting tasks as natural-language prompts, their left-to-right design generation struggles to capture the strong bidirectional dependencies inherent in design problems. To address this, we propose adapting diffusion LLMs to offline BBO to leverage their bidirectional modeling capabilities. However, a domain gap exists between the natural text pre-training of diffusion LLMs and the heterogeneous signals in BBO (prompts, designs, and labels). To bridge this gap, we construct a unified prompt-response corpus and introduce delimiter tokens to explicitly mark field boundaries for domain adaptation. We further propose a two-stage post-training framework to align the diffusion LLM generation with high-label designs. The first stage performs supervised fine-tuning on the unified dataset via masked-response prediction, and the second stage adopts reinforcement learning with rewards defined by label improvements. Our method achieves state-of-the-art results on Design-Bench small-data settings.
中文摘要 我们研究离线黑匣子优化（BBO），旨在通过离线设计和标签数据集发现改进设计，这是机器人学、DNA和材料科学中常见且标签样本有限的问题。虽然近期研究将自回归LLM应用于BBO，通过将任务格式化为自然语言提示，但其从左到右的设计生成难以捕捉设计问题中固有的强烈双向依赖关系。为此，我们提议将扩散大型语言模型适配为离线BBO，以利用其双向建模能力。然而，扩散大型语言模型的自然文本预训练与BBO中的异构信号（提示词、设计和标签）之间存在领域空白。为弥合这一空白，我们构建了统一的提示响应语料库，并引入分隔符标记以显式标记字段边界以便域适配。我们还提出了一个两阶段的后培训框架，以使扩散LLM生成与高标签设计保持一致。第一阶段通过掩蔽响应预测对统一数据集进行监督微调，第二阶段采用强化学习，奖励由标签改进定义。我们的方法在设计-实验室小数据环境中实现了最先进的结果。

Unified Policy Value Decomposition for Rapid Adaptation

快速适应的统一策略价值分解

Authors: Cristiano Capone, Luca Falorsi, Andrea Ciardiello, Luca Manneschi
Subjects: Subjects: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
Arxiv link: https://arxiv.org/abs/2603.17947
Pdf link: https://arxiv.org/pdf/2603.17947
Abstract Rapid adaptation in complex control systems remains a central challenge in reinforcement learning. We introduce a framework in which policy and value functions share a low-dimensional coefficient vector - a goal embedding - that captures task identity and enables immediate adaptation to novel tasks without retraining representations. During pretraining, we jointly learn structured value bases and compatible policy bases through a bilinear actor-critic decomposition. The critic factorizes as Q = sum_k G_k(g) y_k(s,a), where G_k(g) is a goal-conditioned coefficient vector and y_k(s,a) are learned value basis functions. This multiplicative gating - where a context signal scales a set of state-dependent bases - is reminiscent of gain modulation observed in Layer 5 pyramidal neurons, where top-down inputs modulate the gain of sensory-driven responses without altering their tuning. Building on Successor Features, we extend the decomposition to the actor, which composes a set of primitive policies weighted by the same coefficients G_k(g). At test time the bases are frozen and G_k(g) is estimated zero-shot via a single forward pass, enabling immediate adaptation to novel tasks without any gradient update. We train a Soft Actor-Critic agent on the MuJoCo Ant environment under a multi-directional locomotion objective, requiring the agent to walk in eight directions specified as continuous goal vectors. The bilinear structure allows each policy head to specialize to a subset of directions, while the shared coefficient layer generalizes across them, accommodating novel directions by interpolating in goal embedding space. Our results suggest that shared low-dimensional goal embeddings offer a general mechanism for rapid, structured adaptation in high-dimensional control, and highlight a potentially biologically plausible principle for efficient transfer in complex reinforcement learning systems.
中文摘要 复杂控制系统中的快速适应仍是强化学习的核心挑战。我们引入了一个框架，其中策略函数和价值函数共享一个低维系数向量——目标嵌入——该框架捕捉任务身份，并使得能够立即适应新任务而无需重新训练表示。在预训练过程中，我们通过双线性actor-critic分解共同学习结构化值基和兼容的策略基。批判者分解为 Q = sum_k G_k（g） y_k（s，a），其中 G_k（g）是目标条件系数向量，y_k（s，a）是学习到的值基函数。这种乘法门控——即上下文信号对一组状态依赖基进行扩展——让人联想到第5层金字塔神经元中观察到的增益调制，其中自上而下输入调制感官驱动响应的增益而不改变其调谐。基于继任特征，我们将分解扩展到actor，actor由一组由相同系数G_k（g）加权的原始策略组成。测试时，底座被冻结，并通过一次前向过估算G_k（g）为零射，便于立即适应新任务，无需梯度更新。我们在MuJoCo蚂蚁环境中训练软演员-批评代理，采用多方向运动目标，要求代理沿八个连续目标向量行走。双线性结构允许每个策略头专注于某个方向子集，而共享系数层则在这些方向上泛化，通过在目标嵌入空间中插值来适应新方向。我们的结果表明，共享的低维目标嵌入为高维控制中的快速、结构化适应提供了通用机制，并凸显了复杂强化学习系统中高效转移的潜在生物学合理原则。

Keyword: diffusion policy

There is no result