Arxiv Papers of Today

生成时间: 2026-03-06 16:42:08 (UTC+8); Arxiv 发布时间: 2026-03-06 20:00 EST (2026-03-07 09:00 UTC+8)

今天共有 37 篇相关文章

Keyword: reinforcement learning

CTRL-RAG: Contrastive Likelihood Reward Based Reinforcement Learning for Context-Faithful RAG Models

CTRL-RAG：基于对比似然奖励的强化学习，适用于情境忠实RAG模型

Authors: Zhehao Tan, Yihan Jiao, Dan Yang, Junjie Wang, Duolin Sun, Jie Feng, Xidong Wang, Lei Liu, Yue Shen, Jian Wang, Jinjie Gu
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.04406
Pdf link: https://arxiv.org/pdf/2603.04406
Abstract With the growing use of Retrieval-Augmented Generation (RAG), training large language models (LLMs) for context-sensitive reasoning and faithfulness is increasingly important. Existing RAG-oriented reinforcement learning (RL) methods rely on external rewards that often fail to evaluate document faithfulness, and may misjudge similar answers in open-domain settings. In addition, there is no RAG-based selfreward mechanism. Moreover, although such a mechanism could in principle estimate answer confidence given documents, the absence of objective feedback in a self-judgment can cause hallucination accumulation and eventual model collapse. To tackle these issues, we propose a novel "internal-external" hybrid reward framework centered on a Contrastive Likelihood Reward (CLR). CLR directly optimizes the log-likelihood gap between responses conditioned on prompts with and without supporting evidence. This encourages the model to extract relevant evidence and increases its confidence when grounded in a specific context. Experiments show that our method (used alone or combined with external correctness rewards) achieves strong performance on singlehop, multi-hop, vertical-domain, and faithfulness benchmarks. Our training code and models are coming soon.
中文摘要 随着检索增强生成（RAG）的日益普及，训练大型语言模型（LLMs）以实现上下文敏感推理和忠实性变得越来越重要。现有的基于RAG的强化学习（RL）方法依赖外部奖励，这些奖励常常无法评估文档的忠实度，并且在开放领域环境中可能误判类似答案。此外，没有基于RAG的自我奖励机制。此外，尽管这种机制原则上可以估计给定文档的答案置信度，但自我判断缺乏客观反馈可能导致幻觉积累，最终导致模型崩溃。为解决这些问题，我们提出了一种以对比似然奖（CLR）为核心的新型“内外部”混合奖赏框架。CLR直接优化了基于提示（有证据与无证据）反应之间的对数似然差距。这鼓励模型提取相关证据，并在基于特定情境时增强其信心。实验表明，我们的方法（单独使用或结合外部正确性奖励）在单跳、多跳、垂直域和忠实度基准测试上表现出色。我们的训练代码和模型即将发布。

Auction-Based RIS Allocation With DRL: Controlling the Cost-Performance Trade-Off

基于拍卖的RIS配额与日行车（DRL）：控制成本效益权衡

Authors: Martin Mark Zan, Stefan Schwarz
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2603.04433
Pdf link: https://arxiv.org/pdf/2603.04433
Abstract We study the allocation of reconfigurable intelligent surfaces (RISs) in a multi-cell wireless network, where base stations compete for control of shared RIS units deployed at the cell edges. These RISs, provided by an independent operator, are dynamically leased to the highest bidder using a simultaneously ascending auction format. Each base station estimates the utility of acquiring additional RISs based on macroscopic channel parameters, enabling a scalable and low-overhead allocation mechanism. To optimize the bidding behavior, we integrate deep reinforcement learning (DRL) agents that learn to maximize performance while adhering to budget constraints. Through simulations in clustered cell-edge environments, we demonstrate that reinforcement learning (RL)-based bidding significantly outperforms heuristic strategies, achieving optimal trade-offs between cost and spectral efficiency. Furthermore, we introduce a tunable parameter that governs the bidding aggressiveness of RL agents, enabling a flexible control of the trade-off between network performance and expenditure. Our results highlight the potential of combining auction-based allocation with adaptive RL mechanisms for efficient and fair utilization of RISs in next-generation wireless networks.
中文摘要 我们研究了多小区无线网络中可重构智能表面（RIS）的分配，基站们争夺部署在小区边缘的共享RIS单元的控制权。这些RIS由独立运营商提供，通过同步递增的拍卖格式动态租赁给出价最高者。每个基站根据宏观信道参数估算获取额外RIS的效用，从而实现可扩展且低开销的分配机制。为了优化竞价行为，我们集成了深度强化学习（DRL）代理，这些智能体学习如何在遵守预算约束的同时最大化性能。通过在集群单元边缘环境中的模拟，我们证明基于强化学习（RL）的竞价显著优于启发式策略，实现了成本与频谱效率的最佳权衡。此外，我们引入了一个可调参数，用于控制强化学习代理的竞价激进度，实现了网络性能与支出之间权衡的灵活控制。我们的结果凸显了将基于拍卖的分配与自适应强化学习机制结合的潜力，从而在下一代无线网络中高效且公平地利用RIS的潜力。

Dynamic Model Routing and Cascading for Efficient LLM Inference: A Survey

高效大型语言模型推断的动态模型路由与级联：一项综述

Authors: Yasmin Moslem, John D. Kelleher
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Computation and Language (cs.CL); Performance (cs.PF)
Arxiv link: https://arxiv.org/abs/2603.04445
Pdf link: https://arxiv.org/pdf/2603.04445
Abstract The rapid growth of large language models (LLMs) with diverse capabilities, costs, and domains has created a critical need for intelligent model selection at inference time. While smaller models suffice for routine queries, complex tasks demand more capable models. However, static model deployment does not account for the complexity and domain of incoming queries, leading to suboptimal performance and increased costs. Dynamic routing systems that adaptively select models based on query characteristics have emerged as a solution to this challenge. We provide a systematic analysis of state-of-the-art multi-LLM routing and cascading approaches. In contrast to mixture-of-experts architectures, which route within a single model, we study routing across multiple independently trained LLMs. We cover diverse routing paradigms, including query difficulty, human preferences, clustering, uncertainty quantification, reinforcement learning, multimodality, and cascading. For each paradigm, we analyze representative methods and examine key trade-offs. Beyond taxonomy, we introduce a conceptual framework that characterizes routing systems along three dimensions: when decisions are made, what information is used, and how they are computed. This perspective highlights that practical systems are often compositional, integrating multiple paradigms under operational constraints. Our analysis demonstrates that effective multi-LLM routing requires balancing competing objectives. Choosing the optimal routing strategy depends on deployment and computational constraints. Well-designed routing systems can outperform even the most powerful individual models by strategically leveraging specialized capabilities across models while maximizing efficiency gains. Meanwhile, open challenges remain in developing routing mechanisms that generalize across diverse architectures, modalities, and applications.
中文摘要 大型语言模型（LLMs）的快速发展，其能力、成本和领域多样，催生了在推理时对智能模型选择的迫切需求。虽然较小的模型足以满足常规查询，但复杂任务需要更强大的模型。然而，静态模型部署未能考虑输入查询的复杂性和域，导致性能不理想且成本增加。基于查询特性自适应选择模型的动态路由系统已成为解决这一挑战的解决方案。我们系统分析了最先进的多大型语言模型路由和级联方法。与在单一模型内路由的专家混合架构不同，我们研究跨多个独立训练的大型语言模型之间的路由。我们涵盖了多种路由范式，包括查询难度、人类偏好、聚类、不确定性量化、强化学习、多模态和级联分析。对于每种范式，我们分析代表性方法并分析关键权衡。除了分类学，我们还引入了一个概念框架，将路由系统从三个维度上刻画：何时做出决策、使用哪些信息以及如何计算。这一观点强调，实用系统往往具有组合性质，在作约束下整合多种范式。我们的分析表明，有效的多LLM路由需要平衡多个竞争目标。选择最优路由策略取决于部署和计算约束。设计良好的路由系统通过战略性地利用不同模型的专业能力，能够超越最强大的单个模型，同时最大化效率提升。与此同时，开发跨多种架构、模式和应用的路由机制仍面临未解之谜。

Transformer-Based Multipath Congestion Control: A Decoupled Approach for Wireless Uplinks

基于变压器的多径拥塞控制：无线上行链路的解耦方法

Authors: Zongyuan Zhang, Tianyang Duan, Liang Wang, Zihan Fang, Zheng Lin, Yijun Lu, Jiening Wu, Xia Du, Miao Yang, Zhe Chen, Heming Cui, Jun Luo
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2603.04550
Pdf link: https://arxiv.org/pdf/2603.04550
Abstract The proliferation of artificial intelligence applications on edge devices necessitates efficient transport protocols that leverage multi-homed connectivity across heterogeneous networks. While Multipath TCP enables bandwidth aggregation, its in-kernel congestion control mechanisms lack the programmability and flexibility needed for achieving efficient transmission. Additionally, inherent measurement noise renders network state partially observable, challenging data-driven approaches like deep reinforcement learning (DRL). To address these challenges, we propose a Transformer-based Congestion Control Optimization (TCCO) framework for multipath transport. TCCO employs a decoupled architecture that offloads control decisions to an external decision engine via a lightweight in-kernel client and user-space proxy, enabling edge devices to leverage external computational resources while maintaining TCP/IP compatibility. The Transformer-based DRL agent in the external decision engine uses self-attention to capture temporal dependencies, filter noise, and coordinate control across subflows through a unified policy. Extensive evaluation on both simulated and real dual-band Wi-Fi testbeds demonstrates that TCCO achieves superior adaptability and performance than state-of-the-art baselines, validating the feasibility and effectiveness of TCCO for wireless networks.
中文摘要 边缘设备上人工智能应用的激增，要求高效的传输协议，利用跨异构网络的多屋主连接。虽然多径TCP支持带宽聚合，但其内核拥塞控制机制缺乏实现高效传输所需的可编程性和灵活性。此外，固有的测量噪声使网络状态部分可观测，挑战了深度强化学习（DRL）等数据驱动方法。为应对这些挑战，我们提出了基于变压器的拥塞控制优化（TCCO）多径传输框架。TCCO 采用解耦架构，通过轻量级内核客户端和用户空间代理将控制决策卸载给外部决策引擎，使边缘设备能够利用外部计算资源，同时保持 TCP/IP 兼容性。基于Transformer的DRL代理在外部决策引擎中利用自我关注捕捉时间依赖关系，过滤噪声，并通过统一策略协调子流间的控制。对模拟和真实双频Wi-Fi测试平台的广泛评估表明，TCCO在适应性和性能方面优于最先进基线，验证了TCCO在无线网络中的可行性和有效性。

Risk-Aware Reinforcement Learning for Mobile Manipulation

移动作的风险感知强化学习

Authors: Michael Groom, James Wilson, Nick Hawes, Lars Kunze
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.04579
Pdf link: https://arxiv.org/pdf/2603.04579
Abstract For robots to successfully transition from lab settings to everyday environments, they must begin to reason about the risks associated with their actions and make informed, risk-aware decisions. This is particularly true for robots performing mobile manipulation tasks, which involve both interacting with and navigating within dynamic, unstructured spaces. However, existing whole-body controllers for mobile manipulators typically lack explicit mechanisms for risk-sensitive decision-making under uncertainty. To our knowledge, we are the first to (i) learn risk-aware visuomotor policies for mobile manipulation conditioned on egocentric depth observations with runtime-adjustable risk sensitivity, and (ii) show risk-aware behaviours can be transferred through Imitation Learning (IL) to a visuomotor policy conditioned on egocentric depth observations. Our method achieves this by first training a privileged teacher policy using Distributional Reinforcement Learning (DRL), with a risk-neutral distributional critic. Distortion risk-metrics are then applied to the critic's predicted return distribution to calculate risk-adjusted advantage estimates used in policy updates to achieve a range of risk-aware behaviours. We then distil teacher policies with IL to obtain risk-aware student policies conditioned on egocentric depth observations. We perform extensive evaluations demonstrating that our trained visuomotor policies exhibit risk-aware behaviour (specifically achieving better worst-case performance) while performing reactive whole-body motions in unmapped environments, leveraging live depth observations for perception.
中文摘要 为了让机器人成功从实验室环境过渡到日常环境，它们必须开始理性思考与自身行为相关的风险，并做出知情且有风险意识的决策。这对于执行移动作任务的机器人尤为明显，这些任务涉及与动态、无结构空间的交互和导航。然而，现有的全身移动作手控制器通常缺乏明确的机制，用于在不确定性下做出风险敏感的决策。据我们所知，我们是首个（i）基于以自我为中心深度观察为条件、运行时可调节风险敏感性，学习风险感知的移动作视觉运动策略，以及（ii）证明通过模仿学习（IL）将风险感知行为转移到基于自我深度观察的视觉运动策略的机构。我们的方法通过先使用分布强化学习（DRL）训练特权教师政策，并使用风险中性的分布批评者来实现这一目标。然后，将扭曲风险指标应用于批评者的预测收益分布，以计算用于政策更新以实现一系列风险意识行为的风险调整优势估计。然后我们与IL一起提炼教师政策，以获得基于自我中心深度观察的学生风险意识政策。我们进行了广泛评估，证明我们训练有素的视觉运动策略在未测绘环境中执行反应性全身动作时表现出风险意识行为（特别是在最坏情况下表现更佳），并利用实时深度观测进行感知。

Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning

强化学习中的自助探索与群体级自然语言反馈

Authors: Lei Huang, Xiang Cheng, Chenxiao Zhao, Guobin Shen, Junjie Yang, Xiaocheng Feng, Yuxuan Gu, Xing Yu, Bing Qin
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.04597
Pdf link: https://arxiv.org/pdf/2603.04597
Abstract Large language models (LLMs) typically receive diverse natural language (NL) feedback through interaction with the environment. However, current reinforcement learning (RL) algorithms rely solely on scalar rewards, leaving the rich information in NL feedback underutilized and leading to inefficient exploration. In this work, we propose GOLF, an RL framework that explicitly exploits group-level language feedback to guide targeted exploration through actionable refinements. GOLF aggregates two complementary feedback sources: (i) external critiques that pinpoint errors or propose targeted fixes, and (ii) intra-group attempts that supply alternative partial ideas and diverse failure patterns. These group-level feedbacks are aggregated to produce high-quality refinements, which are adaptively injected into training as off-policy scaffolds to provide targeted guidance in sparse-reward regions. Meanwhile, GOLF jointly optimizes generation and refinement within a unified RL loop, creating a virtuous cycle that continuously improves both capabilities. Experiments on both verifiable and non-verifiable benchmarks show that GOLF achieves superior performance and exploration efficiency, achieving 2.2$\times$ improvements in sample efficiency compared to RL methods trained solely on scalar rewards. Code is available at this https URL.
中文摘要 大型语言模型（LLMs）通常通过与环境的交互接收多样的自然语言（NL）反馈。然而，当前的强化学习（RL）算法仅依赖标量奖励，导致NL反馈中丰富的信息未被充分利用，导致探索效率低下。在本研究中，我们提出了GOLF框架，它明确利用群体层面的语言反馈，引导有针对性探索并通过可作的改进进行。GOLF汇总了两个互补的反馈来源：（i）外部批评，指出错误或提出针对性的修正方案;（ii）组内尝试，提供替代的部分想法和多样化的失败模式。这些群体级反馈被汇总生成高质量的改进，并作为非政策框架自适应地注入培训，在奖励稀疏区域提供有针对性的指导。与此同时，GOLF在统一的强化学习循环中共同优化生成和精炼，形成一个良性循环，持续提升这两项能力。在可验证和不可验证基准测试上的实验显示，GOLF在表现和探索效率上均有优异表现，相比仅基于标量奖励训练的强化学习方法，样本效率提升了2.2美元/倍数。代码可在此 https URL 访问。

When Sensors Fail: Temporal Sequence Models for Robust PPO under Sensor Drift

传感器失效时：传感器漂移下稳健PPO的时间序列模型

Authors: Kevin Vogt-Lowell, Theodoros Tsiligkaridis, Rodney Lafuente-Mercado, Surabhi Ghatti, Shanghua Gao, Marinka Zitnik, Daniela Rus
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.04648
Pdf link: https://arxiv.org/pdf/2603.04648
Abstract Real-world reinforcement learning systems must operate under distributional drift in their observation streams, yet most policy architectures implicitly assume fully observed and noise-free states. We study robustness of Proximal Policy Optimization (PPO) under temporally persistent sensor failures that induce partial observability and representation shift. To respond to this drift, we augment PPO with temporal sequence models, including Transformers and State Space Models (SSMs), to enable policies to infer missing information from history and maintain performance. Under a stochastic sensor failure process, we prove a high-probability bound on infinite-horizon reward degradation that quantifies how robustness depends on policy smoothness and failure persistence. Empirically, on MuJoCo continuous-control benchmarks with severe sensor dropout, we show Transformer-based sequence policies substantially outperform MLP, RNN, and SSM baselines in robustness, maintaining high returns even when large fractions of sensors are unavailable. These results demonstrate that temporal sequence reasoning provides a principled and practical mechanism for reliable operation under observation drift caused by sensor unreliability.
中文摘要 现实中的强化学习系统必须在其观察流的分布漂移下运行，但大多数策略架构隐含假设完全观察且无噪声状态。我们研究了在时间持续传感器失效（引发部分可观测性和表征偏移）下，近端策略优化（PPO）的鲁棒性。为应对这种漂移，我们用时间序列模型（包括变换器和状态空间模型（SSM））来增强PPO，使策略能够推断历史缺失信息并保持性能。在随机传感器失效过程中，我们证明了一个高概率界限，定义了无限视角奖励退化，量化了鲁棒性如何依赖于策略平滑性和失效持久性。在MuJoCo连续对照基准测试中，传感器严重掉落，我们表明基于Transformer的序列策略在鲁棒性上显著优于MLP、RNN和SSM基线，即使大量传感器不可用，也能保持高回报。这些结果表明，时间序列推理为在传感器不可靠性引起的观测漂移下，提供了一种原则性和实用的可靠机制。

Optimizing Language Models for Crosslingual Knowledge Consistency

优化跨语言知识一致性的语言模型

Authors: Tianyu Liu, Jirui Qi, Mrinmaya Sachan, Ryan Cotterell, Raquel Fernández, Arianna Bisazza
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.04678
Pdf link: https://arxiv.org/pdf/2603.04678
Abstract Large language models are known to often exhibit inconsistent knowledge. This is particularly problematic in multilingual scenarios, where models are likely to be asked similar questions in different languages, and inconsistent responses can undermine their reliability. In this work, we show that this issue can be mitigated using reinforcement learning with a structured reward function, which leads to an optimal policy with consistent crosslingual responses. We introduce Direct Consistency Optimization (DCO), a DPO-inspired method that requires no explicit reward model and is derived directly from the LLM itself. Comprehensive experiments show that DCO significantly improves crosslingual consistency across diverse LLMs and outperforms existing methods when training with samples of multiple languages, while complementing DPO when gold labels are available. Extra experiments demonstrate the effectiveness of DCO in bilingual settings, significant out-of-domain generalizability, and controllable alignment via direction hyperparameters. Taken together, these results establish DCO as a robust and efficient solution for improving knowledge consistency across languages in multilingual LLMs. All code, training scripts, and evaluation benchmarks are released at this https URL.
中文摘要 大型语言模型常常表现出不一致的知识。这在多语言场景中尤为严重，模型可能在不同语言中被问到相似问题，且回答不一致会削弱其可靠性。本研究展示了通过结构化奖励函数的强化学习来缓解这一问题，从而制定出最优策略，实现一致的跨语言反应。我们介绍了直接一致性优化（DCO），这是一种受DPO启发的方法，无需显式奖励模型，直接源自LLM本身。综合实验表明，DCO显著提升了跨语言模型间的一致性，并在使用多语言样本训练时优于现有方法，同时在有金标签时补充DPO。额外实验展示了DCO在双语环境中的有效性、显著的域外泛化能力以及通过方向超参数可控比对。综合来看，这些结果确立了 DCO 作为提升多语言大型语言模型中知识一致性的稳健高效解决方案。所有代码、训练脚本和评估基准均在此 https URL 发布。

LLM-Guided Decentralized Exploration with Self-Organizing Robot Teams

LLM引导的去中心化探索与自组织机器人团队

Authors: Hiroaki Kawashima, Shun Ikejima, Takeshi Takai, Mikita Miyaguchi, Yasuharu Kunii
Subjects: Subjects: Robotics (cs.RO); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2603.04762
Pdf link: https://arxiv.org/pdf/2603.04762
Abstract When individual robots have limited sensing capabilities or insufficient fault tolerance, it becomes necessary for multiple robots to form teams during exploration, thereby increasing the collective observation range and reliability. Traditionally, swarm formation has often been managed by a central controller; however, from the perspectives of robustness and flexibility, it is preferable for the swarm to operate autonomously even in the absence of centralized control. In addition, the determination of exploration targets for each team is crucial for efficient exploration in such multi-team exploration scenarios. This study therefore proposes an exploration method that combines (1) an algorithm for self-organization, enabling the autonomous and dynamic formation of multiple teams, and (2) an algorithm that allows each team to autonomously determine its next exploration target (destination). In particular, for (2), this study explores a novel strategy based on large language models (LLMs), while classical frontier-based methods and deep reinforcement learning approaches have been widely studied. The effectiveness of the proposed method was validated through simulations involving tens to hundreds of robots.
中文摘要 当单个机器人的感应能力有限或容错不足时，多个机器人在探索时必须组成团队，从而增加集体观测范围和可靠性。传统上，群体形成通常由中央控制者管理;然而，从鲁棒性和灵活性的角度来看，即使没有集中控制，群体也更希望能够自主运行。此外，确定每个团队的勘探目标对于在此类多组探索场景中高效探索至关重要。因此，本研究提出了一种探索方法，结合了（1）自组织算法，实现多支团队的自主且动态组建，以及（2）允许每个团队自主确定下一个探索目标（目的地）的算法。特别是，对于（2），本研究探讨了基于大型语言模型（LLM）的新策略，而经典的基于前沿的方法和深度强化学习方法已被广泛研究。通过涉及数十到数百台机器人的模拟，验证了该方法的有效性。

Distributional Reinforcement Learning with Information Bottleneck for Uncertainty-Aware DRAM Equalization

信息瓶颈分布式强化学习用于不确定性感知DRAM均衡

Authors: Muhammad Usama, Dong Eui Chang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.04768
Pdf link: https://arxiv.org/pdf/2603.04768
Abstract Equalizer parameter optimization is critical for signal integrity in high-speed memory systems operating at multi-gigabit data rates. However, existing methods suffer from computationally expensive eye diagram evaluation, optimization of expected rather than worst-case performance, and absence of uncertainty quantification for deployment decisions. In this paper, we propose a distributional risk-sensitive reinforcement learning framework integrating Information Bottleneck latent representations with Conditional Value-at-Risk optimization. We introduce rate-distortion optimal signal compression achieving 51 times speedup over eye diagrams while quantifying epistemic uncertainty through Monte Carlo dropout. Distributional reinforcement learning with quantile regression enables explicit worst-case optimization, while PAC-Bayesian regularization certifies generalization bounds. Experimental validation on 2.4 million waveforms from eight memory units demonstrated mean improvements of 37.1\% and 41.5\% for 4-tap and 8-tap equalizer configurations with worst-case guarantees of 33.8\% and 38.2\%, representing 80.7\% and 89.1\% improvements over Q-learning baselines. The framework achieved 62.5\% high-reliability classification eliminating manual validation for most configurations. These results suggest the proposed framework provides a practical solution for production-scale equalizer optimization with certified worst-case guarantees.
中文摘要 均衡器参数优化对于高速内存系统中以多千兆数据速率运行的信号完整性至关重要。然而，现有方法存在计算成本高昂的眼图评估、预期性能优化（而非最坏情况）以及部署决策缺乏不确定性量化的问题。本文提出一个分布式风险敏感强化学习框架，整合信息瓶颈潜在表征与条件风险价值优化。我们引入了速率失真最优信号压缩，实现眼图加速51倍，同时通过蒙特卡洛降频量化认知不确定性。带有分位数回归的分布强化学习实现了显式的最坏情况优化，而PAC-贝叶斯正则化则则证明了推广界限。对8个内存单元的240万波形进行实验验证，显示4-tap和8-tap均衡器配置的平均提升分别为37.1%和41.5%，最坏情况保证分别为33.8%和38.2%，相较Q学习基线分别提升80.7%和89.1%。该框架实现了62.5%的高可靠性分类，大多数配置无需人工验证。这些结果表明，所提出的框架为生产规模均衡器优化提供了实用的解决方案，并具备经过认证的最坏情况保证。

Selfish Cooperation Towards Low-Altitude Economy: Integrated Multi-Service Deployment with Resilient Federated Reinforcement Learning

自私合作迈向低空经济：集成多军种部署与韧性联邦强化学习

Authors: Yuxuan Yang, Bin Lyu, Abbas Jamalipour
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2603.04779
Pdf link: https://arxiv.org/pdf/2603.04779
Abstract The low-altitude economy (LAE) is a rapidly emerging paradigm that builds a service-centric economic ecosystem through large-scale and sustainable uncrewed aerial vehicle (UAV)-enabled service provisioning, reflecting the transition of the 6G era from technological advancement toward commercial deployment. The significant market potential of LAE attracts an increasing number of service providers (SPs), resulting in intensified competition in service deployment. In this paper, we study a realistic LAE scenario in which multiple SPs dynamically deploy UAVs to deliver multiple services to user hotspots, aiming to jointly optimize communication and computation resource allocation. To resolve deployment competition among SPs, an authenticity-guaranteed auction mechanism is designed, and game-theoretic analysis is conducted to establish the solvability of the proposed resource allocation problem. Furthermore, a resilient federated reinforcement learning (FRL)-based solution is developed with strong fault tolerance, effectively countering transmission errors and malicious competition while facilitating potential cooperation among self-interested SPs. Simulation results demonstrate that the proposed approach significantly improves service performance and robustness compared with baseline methods, providing a practical and scalable solution for competitive LAE service deployment.
中文摘要 低空经济（LAE）是一种迅速兴起的范式，通过大规模且可持续的无人机（UAV）服务提供，构建以服务为中心的经济生态系统，反映了6G时代从技术进步向商业部署的转变。LAE的巨大市场潜力吸引了越来越多的服务提供商（SP），导致服务部署竞争加剧。本文研究了一个现实的LAE场景，多个SP动态部署无人机，向用户热点提供多项服务，旨在联合优化通信和计算资源分配。为解决SP间的部署竞争，设计了一种真实性保证的拍卖机制，并进行了博弈论分析以确定所提资源分配问题的可解性。此外，开发了一种具有强容错能力的韧性联邦强化学习（FRL）解决方案，有效抵消传输错误和恶意竞争，同时促进了自利SP之间的潜在合作。模拟结果表明，所提方法相较基线方法显著提升了服务性能和鲁棒性，为竞争性LAE服务部署提供了实用且可扩展的解决方案。

Adaptive Personalized Federated Reinforcement Learning for RIS-Assisted Aerial Relays in SAGINs with Fluid Antennas

适用于带流体天线的SAGINs中RIS辅助天线中继的自适应个性化联合强化学习

Authors: Yuxuan Yang, Bin Lyu, Abbas Jamalipour
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2603.04788
Pdf link: https://arxiv.org/pdf/2603.04788
Abstract Space-air-ground integrated networks (SAGINs) interconnect satellites, uncrewed aerial vehicles (UAVs), and ground devices to enable flexible and ubiquitous wireless services. The integration of reconfigurable intelligent surfaces (RISs) and fluid antenna systems (FASs) further enhances radio environment controllability. However, the tight integration of cross-layer facilities and radio enhancement technologies leads to pronounced environmental dynamics and heterogeneity, posing fundamental challenges for system modeling and optimization in large-scale SAGINs. This paper investigates a SAGIN in which low Earth orbit (LEO) satellite constellations communicate with multiple ground hotspots via RIS-assisted UAV relays, serving both FAS-equipped and conventional users. A system model is developed that explicitly captures satellite mobility, UAV trajectories, RIS phase control, and heterogeneous user reception capabilities. Accordingly, a multi-hotspot downlink rate maximization problem is studied, whose solvability is analyzed through a hierarchical Stackelberg game. To address heterogeneous and time-varying multi-hotspot environments, an adaptive personalized federated reinforcement learning (FRL) algorithm is proposed for adaptive optimization of UAV trajectories and RIS phase controls. Simulation results demonstrate superior performance and validate the effectiveness of personalization in dynamic heterogeneous SAGIN scenarios.
中文摘要 空地集成网络（SAGIN）连接卫星、无人飞行器（UAV）和地面设备，实现灵活且普及的无线服务。可重构智能表面（RIS）和流体天线系统（FAS）的集成进一步提升了无线电环境的可控性。然而，跨层设施与无线增强技术的紧密集成导致明显的环境动态和异质性，给大规模SAGIN中的系统建模和优化带来了根本性挑战。本文探讨了一种SAGIN，即低地球轨道（LEO）卫星星座通过RIS辅助无人机中继与多个地面热点通信，服务于配备FAS的和常规用户。开发了一个系统模型，明确捕捉卫星移动性、无人机轨迹、RIS相位控制及异构用户接收能力。因此，研究了一个多热点下行速率最大化问题，其可解性通过分层斯塔克尔伯格博弈进行分析。为应对异构且时变的多热点环境，提出了一种自适应个性化联邦强化学习（FRL）算法，用于无人机轨迹和RIS相位控制的自适应优化。模拟结果显示了更优的性能，并验证了个性化在动态异构SAGIN场景中的有效性。

Diffusion Policy through Conditional Proximal Policy Optimization

通过条件近端策略优化实现扩散策略

Authors: Ben Liu, Shunpeng Yang, Hua Chen
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.04790
Pdf link: https://arxiv.org/pdf/2603.04790
Abstract Reinforcement learning (RL) has been extensively employed in a wide range of decision-making problems, such as games and robotics. Recently, diffusion policies have shown strong potential in modeling multi-modal behaviors, enabling more diverse and flexible action generation compared to the conventional Gaussian policy. Despite various attempts to combine RL with diffusion, a key challenge is the difficulty of computing action log-likelihood under the diffusion model. This greatly hinders the direct application of diffusion policies in on-policy reinforcement learning. Most existing methods calculate or approximate the log-likelihood through the entire denoising process in the diffusion model, which can be memory- and computationally inefficient. To overcome this challenge, we propose a novel and efficient method to train a diffusion policy in an on-policy setting that requires only evaluating a simple Gaussian probability. This is achieved by aligning the policy iteration with the diffusion process, which is a distinct paradigm compared to previous work. Moreover, our formulation can naturally handle entropy regularization, which is often difficult to incorporate into diffusion policies. Experiments demonstrate that the proposed method produces multimodal policy behaviors and achieves superior performance on a variety of benchmark tasks in both IsaacLab and MuJoCo Playground.
中文摘要 强化学习（RL）已被广泛应用于游戏和机器人等多种决策问题中。近年来，扩散策略在建模多模态行为方面展现出强大潜力，使行动生成比传统高斯策略更为丰富和灵活。尽管有多种尝试将强化学习与扩散结合，但一个关键挑战是在扩散模型下计算作用对数似然的困难。这极大地阻碍了扩散策略在策略内强化学习中的直接应用。大多数现有方法在扩散模型中计算或近似于整个去噪过程的对数似然，这在内存和计算上可能效率较低。为克服这一挑战，我们提出了一种新颖且高效的方法，在仅需评估简单高斯概率的策略参数中训练扩散策略。这通过将政策迭代与扩散过程对齐实现，扩散过程与以往工作形成了不同的范式。此外，我们的表述能够自然处理熵正则化，而熵通常难以纳入扩散策略。实验表明，所提方法在IsaacLab和MuJoCo Playground中能够产生多模态策略行为，并在多种基准任务中实现了更优的表现。

VISA: Value Injection via Shielded Adaptation for Personalized LLM Alignment

VISA：通过屏蔽适应注入价值，实现个性化LLM对齐

Authors: Jiawei Chen, Tianzhuo Yang, Guoxi Zhang, Jiaming Ji, Yaodong Yang, Juntao Dai
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.04822
Pdf link: https://arxiv.org/pdf/2603.04822
Abstract Aligning Large Language Models (LLMs) with nuanced human values remains a critical challenge, as existing methods like Reinforcement Learning from Human Feedback (RLHF) often handle only coarse-grained attributes. In practice, fine-tuning LLMs on task-specific datasets to optimize value alignment inevitably incurs an alignment tax: the model's pre-calibrated value system drifts significantly due to latent bias absorption from training data, while the fine-tuning process also causes severe hallucinations and semantic information loss in generated responses. To address this, we propose VISA (Value Injection via Shielded Adaptation), a closed-loop framework designed to navigate this trade-off. VISA's architecture features a high-precision value detector, a semantic-to-value translator, and a core value-rewriter. The value-rewriter is trained via Group Relative Policy Optimization (GRPO) with a composite reward function that simultaneously optimizes for fine-grained value precision, and the preservation of semantic integrity. By learning an optimal policy to balance these competing objectives, VISA effectively mitigates the alignment tax while staying loyal to the original knowledge. Our experiments demonstrate that this approach enables precise control over a model's value expression while maintaining its factual consistency and general capabilities, significantly outperforming both standard fine-tuning methods and prompting-based baselines, including GPT-4o.
中文摘要 将大型语言模型（LLMs）与细致的人类价值观对齐仍是一个关键挑战，因为现有方法如人类反馈强化学习（RLHF）通常只处理粗粒度属性。实际上，针对任务特定数据集进行微调以优化价值对齐的LLM必然会带来比对税：模型的预校准值系统因训练数据中的潜在偏误吸收而显著漂移，而微调过程还会导致生成响应出现严重的幻觉和语义信息丢失。为此，我们提出了VISA（通过屏蔽适应注入价值）框架，旨在应对这一权衡。VISA的架构具备高精度值检测器、语义转值转换器和核心值重写器。值重写器通过组相对策略优化（Group Relative Policy Optimization，GRPO）训练，采用复合奖励函数，同时优化细粒度的价值精度和保持语义完整性。通过学习平衡这些相互竞争目标的最优政策，VISA有效减轻了对齐税，同时忠于原始知识。我们的实验表明，这种方法能够精确控制模型的价值表达，同时保持其事实一致性和通用能力，显著优于标准微调方法和基于提示的基线，包括GPT-4o。

SCoUT: Scalable Communication via Utility-Guided Temporal Grouping in Multi-Agent Reinforcement Learning

SCoUT：通过多智能体强化学习中的实用工具引导时间分组实现的可扩展通信

Authors: Manav Vora, Gokul Puthumanaillam, Hiroyasu Tsukamoto, Melkior Ornik
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.04833
Pdf link: https://arxiv.org/pdf/2603.04833
Abstract Communication can improve coordination in partially observed multi-agent reinforcement learning (MARL), but learning \emph{when} and \emph{who} to communicate with requires choosing among many possible sender-recipient pairs, and the effect of any single message on future reward is hard to isolate. We introduce \textbf{SCoUT} (\textbf{S}calable \textbf{Co}mmunication via \textbf{U}tility-guided \textbf{T}emporal grouping), which addresses both these challenges via temporal and agent abstraction within traditional MARL. During training, SCoUT resamples \textit{soft} agent groups every (K) environment steps (macro-steps) via Gumbel-Softmax; these groups are latent clusters that induce an affinity used as a differentiable prior over recipients. Using the same assignments, a group-aware critic predicts values for each agent group and maps them to per-agent baselines through the same soft assignments, reducing critic complexity and variance. Each agent is trained with a three-headed policy: environment action, send decision, and recipient selection. To obtain precise communication learning signals, we derive counterfactual communication advantages by analytically removing each sender's contribution from the recipient's aggregated messages. This counterfactual computation enables precise credit assignment for both send and recipient-selection decisions. At execution time, all centralized training components are discarded and only the per-agent policy is run, preserving decentralized execution. Project website, videos and code: \hyperlink{this https URL}{this https URL}
中文摘要 在部分观察到的多智能体强化学习（MARL）中，交流可以改善协调，但学习 \emph{when} 和 \emph{who} 进行交流需要在众多可能的发送-接收者组合中选择，且任何单条消息对未来奖励的影响难以单独判断。我们引入了 \textbf{SCoUT}（\textbf{S}可通过 \textbf{U}tility-guided \textbf{T}emporal 分组实现的可扩展 \textbf{Co}免疫化），通过传统 MARL 中的时间抽象和药物抽象解决了这两个挑战。在训练过程中，SCoUT通过Gumbel-Softmax每隔\（K\）环境步（宏步骤）重新采样_textit{soft}代理组;这些基团是潜在聚类，诱导一种亲和力，作为对受体的可微先验。使用相同的分配，群体感知批评者预测每个代理组的数值，并通过相同的软赋值将其映射到每位代理的基线，从而降低批评者的复杂性和方差。每个代理都接受三项策略训练：环境动作、发送决策和接收者选择。为了获得精确的通信学习信号，我们通过分析性地从接收者汇总消息中剔除每个发送者的贡献，从而获得反事实通信优势。这种反事实计算使发送和接收选择决策都能精确分配信用。执行时，所有集中式训练组件被丢弃，仅运行每个代理的策略，保持去中心化执行。项目网站、视频和代码：\hyperlink{this https URL}{this https URL}

BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning

BandPO：通过概率感知界限桥接信任区域与比率剪裁，用于LLM强化学习

Authors: Yuan Li, Bo Wang, Yufei Gao, Yuqian Yao, Xinyuan Wang, Zhangyue Yin, Xipeng Qiu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.04918
Pdf link: https://arxiv.org/pdf/2603.04918
Abstract Proximal constraints are fundamental to the stability of the Large Language Model reinforcement learning. While the canonical clipping mechanism in PPO serves as an efficient surrogate for trust regions, we identify a critical bottleneck: fixed bounds strictly constrain the upward update margin of low-probability actions, disproportionately suppressing high-advantage tail strategies and inducing rapid entropy collapse. To address this, we introduce Band-constrained Policy Optimization (BandPO). BandPO replaces canonical clipping with Band, a unified theoretical operator that projects trust regions defined by f-divergences into dynamic, probability-aware clipping intervals. Theoretical analysis confirms that Band effectively resolves this exploration bottleneck. We formulate this mapping as a convex optimization problem, guaranteeing a globally optimal numerical solution while deriving closed-form solutions for specific divergences. Extensive experiments across diverse models and datasets demonstrate that BandPO consistently outperforms canonical clipping and Clip-Higher, while robustly mitigating entropy collapse.
中文摘要 近端约束是大型语言模型强化学习稳定性的基础。虽然PPO中的典型裁剪机制作为信任区域的高效替代，但我们发现了一个关键瓶颈：固定边界严格限制低概率动作的向上更新边际，不成比例地抑制高优势尾部策略并导致快速熵崩溃。为此，我们引入了带约束策略优化（BandPO）。BandPO用Band替代了典范剪裁，Band是一种统一的理论算子，将由f-发散定义的信任区域投射为动态且概率感知的剪辑区间。理论分析证实，Band 有效解决了这一探索瓶颈。我们将该映射表述为凸优化问题，保证全局最优数值解，同时推导特定散度的闭形式解。跨越多种模型和数据集的大量实验表明，BandPO始终优于正则剪裁和Clip-Higher，同时有效减轻熵坍缩。

Federated Heterogeneous Language Model Optimization for Hybrid Automatic Speech Recognition

联合异构语言模型优化，用于混合自动语音识别

Authors: Mengze Hong, Yi Gu, Di Jiang, Hanlin Gu, Chen Jason Zhang, Lu Wang, Zhiyang Su
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.04945
Pdf link: https://arxiv.org/pdf/2603.04945
Abstract Training automatic speech recognition (ASR) models increasingly relies on decentralized federated learning to ensure data privacy and accessibility, producing multiple local models that require effective merging. In hybrid ASR systems, while acoustic models can be merged using established methods, the language model (LM) for rescoring the N-best speech recognition list faces challenges due to the heterogeneity of non-neural n-gram models and neural network models. This paper proposes a heterogeneous LM optimization task and introduces a match-and-merge paradigm with two algorithms: the Genetic Match-and-Merge Algorithm (GMMA), using genetic operations to evolve and pair LMs, and the Reinforced Match-and-Merge Algorithm (RMMA), leveraging reinforcement learning for efficient convergence. Experiments on seven OpenSLR datasets show RMMA achieves the lowest average Character Error Rate and better generalization than baselines, converging up to seven times faster than GMMA, highlighting the paradigm's potential for scalable, privacy-preserving ASR systems.
中文摘要 自动语音识别（ASR）模型的训练越来越依赖去中心化的联邦学习，以确保数据隐私和可访问性，从而产生多个需要有效合并的本地模型。在混合ASR系统中，虽然声学模型可以通过既有方法合并，但用于重新评分N个最佳语音识别列表的语言模型（LM）由于非神经n-gram模型和神经网络模型的异质性，面临挑战。本文提出了一个异构的LM优化任务，并引入了一种匹配合并范式，采用两种算法：遗传匹配合并算法（GMMA），利用遗传作进化和配对LM，以及强化匹配合并算法（RMMA），利用强化学习实现高效收敛。对七个OpenSLR数据集的实验显示，RMMA实现了最低的平均字符错误率和比基线更佳的泛化能力，收敛速度高达GMMA的七倍，凸显了该范式在可扩展、保护隐私的ASR系统中的潜力。

$\nabla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space

$\nabla$-推理者：通过测试时间梯度下降在潜空间进行大型语言模型推理

Authors: Peihao Wang, Ruisi Cai, Zhen Wang, Hongyuan Mei, Qiang Liu, Pan Li, Zhangyang Wang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.04948
Pdf link: https://arxiv.org/pdf/2603.04948
Abstract Scaling inference-time compute for Large Language Models (LLMs) has unlocked unprecedented reasoning capabilities. However, existing inference-time scaling methods typically rely on inefficient and suboptimal discrete search algorithms or trial-and-error prompting to improve the online policy. In this paper, we propose $\nabla$-Reasoner, an iterative generation framework that integrates differentiable optimization over token logits into the decoding loop to refine the policy on the fly. Our core component, Differentiable Textual Optimization (DTO), leverages gradient signals from both the LLM's likelihood and a reward model to refine textual representations. $\nabla$-Reasoner further incorporates rejection sampling and acceleration design to robustify and speed up decoding. Theoretically, we show that performing inference-time gradient descent in the sample space to maximize reward is dual to aligning an LLM policy via KL-regularized reinforcement learning. Empirically, $\nabla$-Reasoner achieves over 20% accuracy improvement on a challenging mathematical reasoning benchmark, while reducing number of model calls by approximately 10-40% compared to strong baselines. Overall, our work introduces a paradigm shift from zeroth-order search to first-order optimization at test time, offering a cost-effective path to amplify LLM reasoning.
中文摘要 大型语言模型（LLM）推理时间计算的扩展性解锁了前所未有的推理能力。然而，现有的推理时间尺度方法通常依赖效率低下且不理想的离散搜索算法或试错提示来改进在线策略。本文提出了$\nabla$-Reasoner，一种迭代生成框架，将通过令牌日志的可微优化整合进解码循环，实时优化策略。我们的核心组件——可微文本优化（DTO），利用LLM的似然度和奖励模型的梯度信号来优化文本表示。$\nabla$-Reasoner 进一步结合了拒绝采样和加速设计，以增强和加快解码速度。理论上，我们证明在样本空间中进行推理时间梯度下降以最大化奖励，是通过KL正则化强化学习对齐LLM策略的双重过程。从实证角度看，$\nabla$-Reasoner 在具有挑战性的数学推理基准中实现了超过 20% 的准确性提升，同时相比强有力基线减少了约 10-40% 的模型调用次数。总体而言，我们的工作在测试时引入了从零阶搜索向一阶优化的范式转变，提供了一种成本效益高的路径来放大LLM推理。

3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding

3D-RFT：基于视频的3D场景理解强化微调

Authors: Xiongkun Linghu, Jiangyong Huang, Baoxiong Jia, Siyuan Huang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.04976
Pdf link: https://arxiv.org/pdf/2603.04976
Abstract Reinforcement Learning with Verifiable Rewards ( RLVR ) has emerged as a transformative paradigm for enhancing the reasoning capabilities of Large Language Models ( LLMs), yet its potential in 3D scene understanding remains under-explored. Existing approaches largely rely on Supervised Fine-Tuning ( SFT), where the token-level cross-entropy loss acts as an indirect proxy for optimization, leading to a misalignment between training objectives and task performances. To bridge this gap, we present Reinforcement Fine-Tuning for Video-based 3D Scene Understanding (3D-RFT ), the first framework to extend RLVR to video-based 3D perception and reasoning. 3D-RFT shifts the paradigm by directly optimizing the model towards evaluation metrics. 3D-RFT first activates 3D-aware Multi-modal Large Language Models ( MLLM s) via SFT, followed by reinforcement fine-tuning using Group Relative Policy Optimization ( GRPO) with strictly verifiable reward functions. We design task-specific reward functions directly from metrics like 3D IoU and F1-Score to provide more effective signals to guide model training. Extensive experiments demonstrate that 3D-RFT-4B achieves state-of-the-art performance on various video-based 3D scene understanding tasks. Notably, 3D-RFT-4B significantly outperforms larger models (e.g., VG LLM-8B) on 3D video detection, 3D visual grounding, and spatial reasoning benchmarks. We further reveal good properties of 3D-RFT such as robust efficacy, and valuable insights into training strategies and data impact. We hope 3D-RFT can serve as a robust and promising paradigm for future development of 3D scene understanding.
中文摘要 带可验证奖励的强化学习（RLVR）已成为提升大型语言模型（LLMs）推理能力的变革范式，但其在三维场景理解中的潜力仍未被充分开发。现有方法主要依赖监督微调（SFT），其中令牌级交叉熵损失作为优化的间接代理，导致训练目标与任务表现之间出现错位。为弥合这一空白，我们提出了基于视频的三维场景理解强化微调（3D-RFT），这是首个将RLVR扩展到基于视频的三维感知与推理的框架。3D-RFT通过直接优化模型，使模型趋向于评估指标，从而改变了范式。3D-RFT首先通过SFT激活3D感知的多模态大型语言模型（MLLM），随后使用具有严格可验证奖励函数的Group Relative Policy Optimization（GRPO）进行强化微调。我们直接根据3D IoU和F1-Score等指标设计任务特定奖励函数，以提供更有效的信号来指导模型训练。大量实验表明，3D-RFT-4B在多种基于视频的3D场景理解任务中实现了最先进的性能。值得注意的是，3D-RFT-4B在3D视频检测、3D视觉接地和空间推理基准测试中显著优于大型模型（如VG LLM-8B）。我们还揭示了3D-RFT的良好特性，如稳健的疗效，以及关于训练策略和数据影响的宝贵见解。我们希望3D-RFT能成为未来3D场景理解发展的坚实且有前景的范式。

Competitive Multi-Operator Reinforcement Learning for Joint Pricing and Fleet Rebalancing in AMoD Systems

竞争性多运营商强化学习，用于AMoD系统中的联合定价和车队再平衡

Authors: Emil Kragh Toft, Carolin Schmidt, Daniele Gammelli, Filipe Rodrigues
Subjects: Subjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2603.05000
Pdf link: https://arxiv.org/pdf/2603.05000
Abstract Autonomous Mobility-on-Demand (AMoD) systems promise to revolutionize urban transportation by providing affordable on-demand services to meet growing travel demand. However, realistic AMoD markets will be competitive, with multiple operators competing for passengers through strategic pricing and fleet deployment. While reinforcement learning has shown promise in optimizing single-operator AMoD control, existing work fails to capture competitive market dynamics. We investigate the impact of competition on policy learning by introducing a multi-operator reinforcement learning framework where two operators simultaneously learn pricing and fleet rebalancing policies. By integrating discrete choice theory, we enable passenger allocation and demand competition to emerge endogenously from utility-maximizing decisions. Experiments using real-world data from multiple cities demonstrate that competition fundamentally alters learned behaviors, leading to lower prices and distinct fleet positioning patterns compared to monopolistic settings. Notably, we demonstrate that learning-based approaches are robust to the additional stochasticity of competition, with competitive agents successfully converging to effective policies while accounting for partially unobserved competitor strategies.
中文摘要 自动出行按需（AMoD）系统承诺通过提供经济实惠的按需服务，彻底革新城市交通，以满足日益增长的出行需求。然而，现实中的AMoD市场将非常激烈，多个运营商通过战略定价和机队部署来争夺乘客。尽管强化学习在优化单AMoD控制方面展现出潜力，但现有工作未能捕捉到竞争市场动态。我们通过引入多运营商强化学习框架，研究竞争对政策学习的影响，其中两家运营商同时学习定价和车队再平衡策略。通过整合离散选择理论，我们使乘客配置和需求竞争能够内生地从效用最大化决策中产生。利用多个城市的真实世界数据进行的实验表明，竞争从根本上改变了学习到的行为，从而使价格更低，车队定位模式也不同于垄断环境。值得注意的是，我们证明基于学习的方法对竞争带来的额外随机性具有鲁棒性，竞争主体能够成功收敛到有效的策略，同时考虑部分未被观察到的竞争策略。

BioLLMAgent: A Hybrid Framework with Enhanced Structural Interpretability for Simulating Human Decision-Making in Computational Psychiatry

BioLLMAgent：一种具有增强结构可解释性的混合框架，用于计算精神病学中模拟人类决策

Authors: Zuo Fei, Kezhi Wang, Xiaomin Chen, Yizhou Huang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.05016
Pdf link: https://arxiv.org/pdf/2603.05016
Abstract Computational psychiatry faces a fundamental trade-off: traditional reinforcement learning (RL) models offer interpretability but lack behavioral realism, while large language model (LLM) agents generate realistic behaviors but lack structural interpretability. We introduce BioLLMAgent, a novel hybrid framework that combines validated cognitive models with the generative capabilities of LLMs. The framework comprises three core components: (i) an Internal RL Engine for experience-driven value learning; (ii) an External LLM Shell for high-level cognitive strategies and therapeutic interventions; and (iii) a Decision Fusion Mechanism for integrating components via weighted utility. Comprehensive experiments on the Iowa Gambling Task (IGT) across six clinical and healthy datasets demonstrate that BioLLMAgent accurately reproduces human behavioral patterns while maintaining excellent parameter identifiability (correlations $>0.67$). Furthermore, the framework successfully simulates cognitive behavioral therapy (CBT) principles and reveals, through multi-agent dynamics, that community-wide educational interventions may outperform individual treatments. Validated across reward-punishment learning and temporal discounting tasks, BioLLMAgent provides a structurally interpretable "computational sandbox" for testing mechanistic hypotheses and intervention strategies in psychiatric research.
中文摘要 计算精神病学面临一个根本的权衡：传统的强化学习（RL）模型提供可解释性，但缺乏行为真实性;而大型语言模型（LLM）代理则生成真实行为但缺乏结构性解释性。我们介绍BioLLMAgent，一个结合经过验证认知模型与LLM生成能力的新混合框架。该框架包含三个核心组件：（i）用于体验驱动价值学习的内部强化学习引擎;（ii）用于高级认知策略和治疗干预的外部LLM壳体;以及（iii）通过加权效用整合组件的决策融合机制。在爱荷华赌博任务（IGT）上，涵盖六个临床且健康的数据集，进行了全面的实验，证明BioLLMAgent准确地还原了人类行为模式，同时保持了极佳的参数识别性>（相关性约0.67美元）。此外，该框架成功模拟了认知行为疗法（CBT）原则，并通过多代理动态揭示了社区范围的教育干预可能优于单个治疗。BioLLMAgent 在奖励-惩罚学习和时间折现任务中得到了验证，提供了一个结构可解释的“计算沙盒”，用于检验精神病学研究中的机制假设和干预策略。

Formal Entropy-Regularized Control of Stochastic Systems

随机系统的形式熵正则化控制

Authors: Menno van Zutphen, Giannis Delimpaltadakis, Duarte J. Antunes
Subjects: Subjects: Systems and Control (eess.SY); Information Theory (cs.IT); Dynamical Systems (math.DS); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2603.05021
Pdf link: https://arxiv.org/pdf/2603.05021
Abstract Analyzing and controlling system entropy is a powerful tool for regulating predictability of control systems. Applications benefiting from such approaches range from reinforcement learning and data security to human-robot collaboration. In continuous-state stochastic systems, accurate entropy analysis and control remains a challenge. In recent years, finite-state abstractions of continuous systems have enabled control synthesis with formal performance guarantees on objectives such as stage costs. However, these results do not extend to entropy-based performance measures. We solve this problem by first obtaining bounds on the entropy of system discretizations using traditional formal-abstractions results, and then obtaining an additional bound on the difference between the entropy of a continuous distribution and that of its discretization. The resulting theory enables formal entropy-aware controller synthesis that trades predictability against control performance while preserving formal guarantees for the original continuous system. More specifically, we focus on minimizing the linear combination of the KL divergence of the system trajectory distribution to uniform -- our system entropy metric -- and a generic cumulative cost. We note that the bound we derive on the difference between the KL divergence to uniform of a given continuous distribution and its discretization can also be relevant in more general information-theoretic contexts. A set of case studies illustrates the effectiveness of the method.
中文摘要 分析和控制系统熵是调节控制系统可预测性的强大工具。受益于此类方法的应用涵盖强化学习、数据安全到人机协作等领域。在连续状态随机系统中，准确的熵分析和控制仍是一个挑战。近年来，连续系统的有限状态抽象使控制综合成为可能，并对阶段成本等目标实现了形式化的性能保证。然而，这些结果并不适用于基于熵的性能指标。我们通过先利用传统形式抽象结果求得系统离散化熵的界限，然后再对连续分布的熵与离散化熵差的额外界限来解决这个问题。该理论使得形式化的熵感知控制综合成为可能，在可预测性与控制性能之间取得平衡，同时保持原始连续系统的形式保证。更具体地说，我们关注最小化系统轨迹分布与均匀分布的KL散度——我们的系统熵度量——与一般累计成本的线性组合。我们注意到，我们推导出的连续分布与均匀散度之间KL散度与离散化差异的界限，在更一般的信息论语境中同样适用。一组案例研究展示了该方法的有效性。

WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents

WebFactory：基础语言智能的自动压缩到有根基的网络代理中

Authors: Sicheng Fan, Qingyun Shi, Shengze Xu, Shengbo Cai, Tieyong Zeng, Li Ling, Yanyi Shang, Dehan Kong
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.05044
Pdf link: https://arxiv.org/pdf/2603.05044
Abstract Current paradigms for training GUI agents are fundamentally limited by a reliance on either unsafe, non-reproducible live web interactions or costly, scarce human-crafted data and environments. We argue this focus on data volume overlooks a more critical factor: the efficiency of compressing a large language model's (LLM) latent knowledge into actionable agent behavior. We introduce WebFactory, a novel, fully automated closed-loop reinforcement learning pipeline for GUI agents, systematically compressing LLM-encoded internet intelligence into efficient, grounded actions. Our pipeline features a process of scalable environment synthesis, knowledge-aware task generation, LLM-powered trajectory collection, decomposed reward RL training, and systematic agent evaluation. Remarkably, our agent demonstrates exceptional data efficiency and generalization. Trained on synthetic data from only 10 websites within WebFactory, it achieves performance comparable to GUI agents trained on the same amount of human-annotated data from a much larger set of environments. This superior performance is consistent across our internal offline and online transfer benchmarks, where our agent also significantly outperforms the base foundation model. We further provide critical insights into the "embodiment potential" of different LLM foundations, offering a new axis for model evaluation. This work presents a scalable and cost-effective paradigm for transforming passive internet knowledge into active, grounded intelligence, marking a critical step towards general-purpose interactive agents.
中文摘要 当前GUI代理的训练范式在根本上受限于依赖不安全、不可复现的实时网络交互，或昂贵且稀缺的人造数据和环境。我们认为，这种对数据量的关注忽视了一个更关键的因素：将大型语言模型（LLM）潜在知识压缩为可作代理行为的效率。我们介绍WebFactory，一种全新、全自动化的闭环强化学习流水线，面向图形界面代理，系统地将LLM编码的互联网智能压缩为高效且扎根的作。我们的流程包含可扩展环境综合、知识感知任务生成、基于LLM的轨迹收集、分解奖励强化学习训练以及系统化智能体评估。令人惊讶的是，我们的代理展现了卓越的数据效率和泛化能力。它仅在WebFactory内10个网站的合成数据上训练，其性能可与来自更大环境集的同等人工注释数据训练的图形界面代理相当。这种优越表现在我们内部的线下和在线传输基准中保持一致，我们的代理在这些方面也显著优于基础基础模型。我们还提供了关于不同大型语言模型基础“具身潜力”的关键见解，为模型评估开辟了新的轴。这项工作提出了一种可扩展且经济高效的范式，将被动互联网知识转化为主动、扎根的智能，标志着迈向通用交互代理的关键一步。

Reward-Conditioned Reinforcement Learning

奖励条件强化学习

Authors: Michal Nauman, Marek Cygan, Pieter Abbeel
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.05066
Pdf link: https://arxiv.org/pdf/2603.05066
Abstract RL agents are typically trained under a single, fixed reward function, which makes them brittle to reward misspecification and limits their ability to adapt to changing task preferences. We introduce Reward-Conditioned Reinforcement Learning (RCRL), a framework that trains a single agent to optimize a family of reward specifications while collecting experience under only one nominal objective. RCRL conditions the agent on reward parameterizations and learns multiple reward objectives from a shared replay data entirely off-policy, enabling a single policy to represent reward-specific behaviors. Across single-task, multi-task, and vision-based benchmarks, we show that RCRL not only improves performance under the nominal reward parameterization, but also enables efficient adaptation to new parameterizations. Our results demonstrate that RCRL provides a scalable mechanism for learning robust, steerable policies without sacrificing the simplicity of single-task training.
中文摘要 强化学习代理通常在单一固定的奖励函数下训练，这使他们对奖励错误的描述较为脆弱，并限制了他们适应任务偏好变化的能力。我们引入了奖励条件强化学习（RCRL），这是一个训练单个智能体优化一系列奖励规格的框架，同时在仅以一个名义目标收集经验。RCRL对智能体进行奖励参数化的条件，并从完全非策略的共享回放数据中学习多个奖励目标，使单一策略能够代表奖励特定的行为。在单任务、多任务和基于视觉的基准测试中，我们证明RCRL不仅在名义奖励参数化下提升了性能，还能够高效适应新的参数化。我们的结果表明，RCRL提供了一种可扩展的机制，帮助学习稳健且可引导的策略，同时不牺牲单任务训练的简单性。

Decoupling Task and Behavior: A Two-Stage Reward Curriculum in Reinforcement Learning for Robotics

任务与行为的分离：机器人强化学习中的两阶段奖励课程

Authors: Kilian Freitag, Knut Åkesson, Morteza Haghir Chehreghani
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.05113
Pdf link: https://arxiv.org/pdf/2603.05113
Abstract Deep Reinforcement Learning is a promising tool for robotic control, yet practical application is often hindered by the difficulty of designing effective reward functions. Real-world tasks typically require optimizing multiple objectives simultaneously, necessitating precise tuning of their weights to learn a policy with the desired characteristics. To address this, we propose a two-stage reward curriculum where we decouple task-specific objectives from behavioral terms. In our method, we first train the agent on a simplified task-only reward function to ensure effective exploration before introducing the full reward that includes auxiliary behavior-related terms such as energy efficiency. Further, we analyze various transition strategies and demonstrate that reusing samples between phases is critical for training stability. We validate our approach on the DeepMind Control Suite, ManiSkill3, and a mobile robot environment, modified to include auxiliary behavioral objectives. Our method proves to be simple yet effective, substantially outperforming baselines trained directly on the full reward while exhibiting higher robustness to specific reward weightings.
中文摘要 深度强化学习是一种有前景的机器人控制工具，但实际应用常因设计有效奖励函数的困难而受阻。现实任务通常需要同时优化多个目标，因此需要精确调整它们的权重，以学习具有所需特性的策略。为此，我们提出了一个两阶段的奖励课程，将任务特定目标与行为术语分离。在我们的方法中，首先训练代理使用简化的仅任务奖励函数以确保有效探索，然后引入包含辅助行为相关项（如能源效率）的完整奖励。此外，我们分析了各种过渡策略，并证明在阶段间重复使用样本对训练稳定性至关重要。我们在DeepMind控制套件、ManiSkill3以及一个经过修改以包含辅助行为目标的移动机器人环境中验证了我们的方法。我们的方法简单而有效，显著优于直接基于全奖励训练的基线，同时对特定奖励权重展现出更高的鲁棒性。

LBM: Hierarchical Large Auto-Bidding Model via Reasoning and Acting

LBM：通过推理和行动实现的层级大型自动竞价模型

Authors: Yewen Li, Zhiyi Lyu, Peng Jiang, Qingpeng Cai, Fei Pan, Bo An, Peng Jiang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.05134
Pdf link: https://arxiv.org/pdf/2603.05134
Abstract The growing scale of ad auctions on online advertising platforms has intensified competition, making manual bidding impractical and necessitating auto-bidding to help advertisers achieve their economic goals. Current auto-bidding methods have evolved to use offline reinforcement learning or generative methods to optimize bidding strategies, but they can sometimes behave counterintuitively due to the black-box training manner and limited mode coverage of datasets, leading to challenges in understanding task status and generalization in dynamic ad environments. Large language models (LLMs) offer a promising solution by leveraging prior human knowledge and reasoning abilities to improve auto-bidding performance. However, directly applying LLMs to auto-bidding faces difficulties due to the need for precise actions in competitive auctions and the lack of specialized auto-bidding knowledge, which can lead to hallucinations and suboptimal decisions. To address these challenges, we propose a hierarchical Large autoBidding Model (LBM) to leverage the reasoning capabilities of LLMs for developing a superior auto-bidding strategy. This includes a high-level LBM-Think model for reasoning and a low-level LBM-Act model for action generation. Specifically, we propose a dual embedding mechanism to efficiently fuse two modalities, including language and numerical inputs, for language-guided training of the LBM-Act; then, we propose an offline reinforcement fine-tuning technique termed GQPO for mitigating the LLM-Think's hallucinations and enhancing decision-making performance without simulation or real-world rollout like previous multi-turn LLM-based methods. Experiments demonstrate the superiority of a generative backbone based on our LBM, especially in an efficient training manner and generalization ability.
中文摘要 在线广告平台广告拍卖规模的扩大加剧了竞争，使得人工竞价变得不切实际，必须通过自动竞价来帮助广告主实现经济目标。当前的自动竞价方法已发展为采用离线强化学习或生成式方法来优化竞价策略，但由于黑盒训练方式和数据集模式覆盖有限，有时表现得与直觉相反，导致在动态广告环境中理解任务状态和泛化时面临挑战。大型语言模型（LLMs）通过利用人类先前的知识和推理能力，提供了一种有前景的解决方案，以提升自动竞价的性能。然而，直接将大型语言模型应用于自动竞价面临困难，因为竞争性拍卖中需要精确作，且缺乏专业的自动竞价知识，可能导致幻觉和决策不优。为应对这些挑战，我们提出了一个分层大型自动竞价模型（LBM），以利用大型语言模型的推理能力，开发更优越的自动竞价策略。这包括用于推理的高级LBM-Think模型和用于行动生成的低层LBM-Act模型。具体来说，我们提出了一种双重嵌入机制，以高效融合语言和数值输入在内的两种模态，用于LBM法案的语言引导训练;随后，我们提出了一种名为GQPO的离线强化微调技术，用于减轻LLM-Think的幻觉，提升决策性能，无需像以往多回合基于LLM的方法那样进行模拟或实际推广。实验证明了基于我们LBM生成骨架的优越性，尤其是在高效的训练方式和泛化能力方面。

KARL: Knowledge Agents via Reinforcement Learning

卡尔：通过强化学习实现知识代理

Authors: Jonathan D. Chang, Andrew Drozdov, Shubham Toshniwal, Owen Oertell, Alexander Trott, Jacob Portes, Abhay Gupta, Pallavi Koppol, Ashutosh Baheti, Sean Kulinski, Ivan Zhou, Irene Dea, Krista Opsahl-Ong, Simon Favreau-Lessard, Sean Owen, Jose Javier Gonzalez Ortiz, Arnav Singhvi, Xabi Andrade, Cindy Wang, Kartik Sreenivasan, Sam Havens, Jialu Liu, Peyton DeNiro, Wen Sun, Michael Bendersky, Jonathan Frankle
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.05218
Pdf link: https://arxiv.org/pdf/2603.05218
Abstract We present a system for training enterprise search agents via reinforcement learning that achieves state-of-the-art performance across a diverse suite of hard-to-verify agentic search tasks. Our work makes four core contributions. First, we introduce KARLBench, a multi-capability evaluation suite spanning six distinct search regimes, including constraint-driven entity search, cross-document report synthesis, tabular numerical reasoning, exhaustive entity retrieval, procedural reasoning over technical documentation, and fact aggregation over internal enterprise notes. Second, we show that models trained across heterogeneous search behaviors generalize substantially better than those optimized for any single benchmark. Third, we develop an agentic synthesis pipeline that employs long-horizon reasoning and tool use to generate diverse, grounded, and high-quality training data, with iterative bootstrapping from increasingly capable models. Fourth, we propose a new post-training paradigm based on iterative large-batch off-policy RL that is sample efficient, robust to train-inference engine discrepancies, and naturally extends to multi-task training with out-of-distribution generalization. Compared to Claude 4.6 and GPT 5.2, KARL is Pareto-optimal on KARLBench across cost-quality and latency-quality trade-offs, including tasks that were out-of-distribution during training. With sufficient test-time compute, it surpasses the strongest closed models. These results show that tailored synthetic data in combination with multi-task reinforcement learning enables cost-efficient and high-performing knowledge agents for grounded reasoning.
中文摘要 我们提出了一套通过强化学习培训企业搜索代理的系统，能够在多样化且难以验证的代理搜索任务中实现最先进的性能。我们的工作有四大核心贡献。首先，我们介绍KARLBench，这是一个涵盖六种不同搜索模式的多能力评估套件，包括约束驱动的实体搜索、跨文档报告综合、表格数值推理、穷尽实体检索、技术文档上的程序推理以及内部企业笔记的事实聚合。其次，我们证明了跨异构搜索行为训练的模型的泛化效果远优于针对任何单一基准测试优化的模型。第三，我们开发了一个代理综合流水线，采用长远推理和工具使用，生成多样化、扎实且高质量的训练数据，并从日益强大的模型中迭代自助。第四，我们提出一种基于迭代大批量非策略强化学习的新训练后范式，该范式高效、对训练推理引擎差异具有鲁棒性，并自然扩展到多任务训练，实现分布外泛化。与Claude 4.6和GPT 5.2相比，KARL在成本质量和延迟质量权衡上达到帕累托最优，包括训练期间未分配的任务。在足够的测试时间计算下，它超越了最强的闭合模型。这些结果表明，定制化的合成数据结合多任务强化学习，能够实现成本效益高且高效的知识主体，实现基于基础的推理。

Boosting ASR Robustness via Test-Time Reinforcement Learning with Audio-Text Semantic Rewards

通过测试时间强化学习与音频文本语义奖励提升ASR的稳健性

Authors: Linghan Fang, Tianxin Xie, Li Liu
Subjects: Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.05231
Pdf link: https://arxiv.org/pdf/2603.05231
Abstract Recently, Automatic Speech Recognition (ASR) systems (e.g., Whisper) have achieved remarkable accuracy improvements but remain highly sensitive to real-world unseen data (data with large distribution shifts), including noisy environments and diverse accents. To address this issue, test-time adaptation (TTA) has shown great potential in improving the model adaptability at inference time without ground-truth labels, and existing TTA methods often rely on pseudo-labeling or entropy minimization. However, by treating model confidence as a learning signal, these methods may reinforce high-confidence errors, leading to confirmation bias that undermines adaptation. To overcome these limitations, we present ASR-TRA, a novel Test-time Reinforcement Adaptation framework inspired by causal intervention. More precisely, our method introduces a learnable decoder prompt and utilizes temperature-controlled stochastic decoding to generate diverse transcription candidates. These are scored by a reward model that measures audio-text semantic alignment, and the resulting feedback is used to update both model and prompt parameters via reinforcement learning. Comprehensive experiments on LibriSpeech with synthetic noise and L2 Arctic accented English datasets demonstrate that our method achieves higher accuracy while maintaining lower latency than existing TTA baselines. Ablation studies further confirm the effectiveness of combining audio and language-based rewards, highlighting our method's enhanced stability and interpretability. Overall, our approach provides a practical and robust solution for deploying ASR systems in challenging real-world conditions.
中文摘要 近年来，自动语音识别（ASR）系统（如Whisper）取得了显著的精度提升，但对现实世界中未见数据（分布变化大的数据）仍然高度敏感，包括噪声环境和多样的口音。为解决这个问题，测试时间适应（TTA）在无需真实标记的情况下，已展现出极大潜力，能够提升模型在推断时的适应性，现有TTA方法通常依赖伪标记或熵最小化。然而，将模型置信度视为学习信号，这些方法可能强化高置信误差，导致确认偏误，削弱适应性。为克服这些局限，我们提出了ASR-TRA，一种受因果干预启发的新型测试时间强化适应框架。更准确地说，我们的方法引入了可学习的解码提示，并利用温控随机解码生成多样化的转录候选。这些反馈由一个奖励模型评分，该模型测量音频-文本语义对齐，所得反馈通过强化学习更新模型和提示参数。在LibriSpeech上进行的综合实验，利用合成噪声和带L2北极口音的英语数据集，表明我们的方法在保持更低延迟的同时，比现有TTA基线实现了更高的准确性。消融研究进一步证实了结合音频和语言奖励的有效性，凸显了我们方法更强的稳定性和可理解性。总体而言，我们的方法为在复杂现实环境中部署ASR系统提供了实用且稳健的解决方案。

Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum

Wiki-R1：通过数据和抽样课程激励知识型VQA的多模态推理

Authors: Shan Ning, Longtian Qiu, Xuming He
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.05256
Pdf link: https://arxiv.org/pdf/2603.05256
Abstract Knowledge-Based Visual Question Answering (KB-VQA) requires models to answer questions about an image by integrating external knowledge, posing significant challenges due to noisy retrieval and the structured, encyclopedic nature of the knowledge base. These characteristics create a distributional gap from pretrained multimodal large language models (MLLMs), making effective reasoning and domain adaptation difficult in the post-training stage. In this work, we propose \textit{Wiki-R1}, a data-generation-based curriculum reinforcement learning framework that systematically incentivizes reasoning in MLLMs for KB-VQA. Wiki-R1 constructs a sequence of training distributions aligned with the model's evolving capability, bridging the gap from pretraining to the KB-VQA target distribution. We introduce \textit{controllable curriculum data generation}, which manipulates the retriever to produce samples at desired difficulty levels, and a \textit{curriculum sampling strategy} that selects informative samples likely to yield non-zero advantages during RL updates. Sample difficulty is estimated using observed rewards and propagated to unobserved samples to guide learning. Experiments on two KB-VQA benchmarks, Encyclopedic VQA and InfoSeek, demonstrate that Wiki-R1 achieves new state-of-the-art results, improving accuracy from 35.5\% to 37.1\% on Encyclopedic VQA and from 40.1\% to 44.1\% on InfoSeek. The project page is available at this https URL.
中文摘要 基于知识的视觉问答（KB-VQA）要求模型通过整合外部知识来回答关于图像的问题，这对检索噪声和知识库结构化、百科全书式的特性带来了重大挑战。这些特性导致预训练多模态大型语言模型（MLLM）之间的分布差距，使得在训练后阶段进行有效的推理和领域适应变得困难。在本研究中，我们提出了 \textit{Wiki-R1}，一种基于数据生成的课程强化学习框架，系统地激励 KB-VQA 的 MLLM 推理。Wiki-R1构建了一系列与模型不断演进能力相匹配的训练分布，弥合了预训练与KB-VQA目标分布之间的差距。我们引入了 \textit{可控课程数据生成}，它通过控检索器生成符合的难度水平样本，以及一种 \textit{课程抽样策略}，选择可能在强化学习更新中带来非零优势的信息样本。样本难度通过观察到的奖励进行估算，并传播到未观察样本以指导学习。在两个KB-VQA基准测试——Encyclopedic VQA和InfoSeek上的实验表明，Wiki-R1实现了新的最先进结果，在Encyclopedic VQA上将准确率从35.5%提升到37.1%，在InfoSeek上从40.1%提升到44.1%。项目页面可在此 https 网址访问。

SarcasmMiner: A Dual-Track Post-Training Framework for Robust Audio-Visual Sarcasm Reasoning

SarcasmMiner：一个双轨后培训框架，用于强健的视听讽刺推理

Authors: Zhu Li, Yongjian Chen, Huiyuan Lai, Xiyuan Gao, Shekhar Nayak, Matt Coler
Subjects: Subjects: Multimedia (cs.MM); Computation and Language (cs.CL); Sound (cs.SD)
Arxiv link: https://arxiv.org/abs/2603.05275
Pdf link: https://arxiv.org/pdf/2603.05275
Abstract Multimodal sarcasm detection requires resolving pragmatic incongruity across textual, acoustic, and visual cues through cross-modal reasoning. To enable robust sarcasm reasoning with foundation models, we propose SarcasmMiner, a reinforcement learning based post-training framework that resists hallucination in multimodal reasoning. We reformulate sarcasm detection as structured reasoning and adopt a dual-track distillation strategy: high-quality teacher trajectories initialize the student model, while the full set of trajectories trains a generative reward model (GenRM) to evaluate reasoning quality. The student is optimized with group relative policy optimization (GRPO) using decoupled rewards for accuracy and reasoning quality. On MUStARD++, SarcasmMiner increases F1 from 59.83% (zero-shot), 68.23% (supervised finetuning) to 70.22%. These findings suggest that reasoning-aware reward modeling enhances both performance and multimodal grounding.
中文摘要 多模态讽刺检测需要通过跨模态推理解决文本、声学和视觉线索间的语用不一致。为了支持基于基础模型的强健讽刺推理，我们提出了SarcasmMiner，一种基于强化学习的训练后框架，在多模态推理中抵抗幻觉。我们将讽刺检测重新表述为结构化推理，并采用双轨提炼策略：高质量教师轨迹初始化学生模型，而完整轨迹集训练生成奖励模型（GenRM）以评估推理质量。学生通过群相对策略优化（GRPO）进行优化，采用解耦奖励以提高准确性和推理质量。在 MUStARD++ 上，SarcasmMiner 将 F1 从 59.83%（零次）、68.23%（监督微调）提升到 70.22%。这些发现表明，推理感知奖励建模不仅提升了绩效，也提升了多模态的基础化。

Whispering to a Blackbox: Bootstrapping Frozen OCR with Visual Prompts

对黑匣低语：用视觉提示自助冻结OCR

Authors: Samandar Samandarov, Nazirjon Ismoiljonov, Abdullah Sattorov, Temirlan Sabyrbayev
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.05276
Pdf link: https://arxiv.org/pdf/2603.05276
Abstract In the landscape of modern machine learning, frozen pre-trained models provide stability and efficiency but often underperform on specific tasks due to mismatched data distributions. This paper introduces the Whisperer, a novel visual prompting framework that learns diffusion-based preprocessors to adapt inputs in pixel space, effectively "whispering" enhancements to frozen downstream models like EasyOCR. By framing the process as behavioral cloning of stochastically discovered improvement policies, our method achieves an 8% absolute (10.6% relative) reduction in Character Error Rate (CER) on a challenging dataset of 300k degraded synthetic text images, surpassing hand-engineered baselines such as CLAHE. The key innovation is a four-stage training curriculum that uses behavioral cloning to amplify "lucky" improvements discovered through the stochastic exploration of a partially trained diffusion model. This approach is highly sample-efficient and avoids the pitfalls of traditional reinforcement learning. Crucially, we frame this not as naive reinforcement learning, but as behavioral cloning of an exploration policy: we stochastically sample intermediate diffusion outputs, select those that improve CER by chance, and then train the model to reproduce them. This bootstrapping curriculum (4 stages over 60 GPU-hours) amplifies random successes into a systematic strategy. In summary, by whispering to the frozen OCR through its inputs, we improve an imperfect classifier without touching its weights.
中文摘要 在现代机器学习领域，冻结的预训练模型提供稳定性和效率，但由于数据分布不匹配，在特定任务上常常表现不佳。本文介绍了Whisperer，一种新型视觉提示框架，通过学习基于扩散的预处理器适应像素空间中的输入，有效地“低语”增强了像EasyOCR这样的冻结下游模型。通过将过程框架为随机发现的改进策略的行为克隆，我们的方法在一个包含30万张退化合成文本图像的复杂数据集上，实现了字符错误率（CER）绝对降低8%（相对降低10.6%），超过了手工工程的基线如CLAHE。关键创新是一套四阶段训练课程，利用行为克隆技术放大通过随机探索部分训练扩散模型发现的“幸运”改进。这种方法采样效率高，避免了传统强化学习的陷阱。关键是，我们将其视为探索策略的行为克隆，而非天真强化学习：我们随机采样中间扩散输出，随机选择那些能提升CER的，然后训练模型以复制它们。这种自力更生的课程（4个阶段，60个GPU小时）将随机成功放大成系统化的策略。总之，通过通过输入对冻结的OCR低语，我们可以在不影响其权重的情况下改进不完美的分类器。

Knowledge Divergence and the Value of Debate for Scalable Oversight

知识分歧与辩论对可扩展监督的价值

Authors: Robin Young
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.05293
Pdf link: https://arxiv.org/pdf/2603.05293
Abstract AI safety via debate and reinforcement learning from AI feedback (RLAIF) are both proposed methods for scalable oversight of advanced AI systems, yet no formal framework relates them or characterizes when debate offers an advantage. We analyze this by parameterizing debate's value through the geometry of knowledge divergence between debating models. Using principal angles between models' representation subspaces, we prove that the debate advantage admits an exact closed form. When models share identical training corpora, debate reduces to RLAIF-like where a single-agent method recovers the same optimum. When models possess divergent knowledge, debate advantage scales with a phase transition from quadratic regime (debate offers negligible benefit) to linear regime (debate is essential). We classify three regimes of knowledge divergence (shared, one-sided, and compositional) and provide existence results showing that debate can achieve outcomes inaccessible to either model alone, alongside a negative result showing that sufficiently strong adversarial incentives cause coordination failure in the compositional regime, with a sharp threshold separating effective from ineffective debate. We offer the first formal connection between debate and RLAIF, a geometric foundation for understanding when adversarial oversight protocols are justified, and connection to the problem of eliciting latent knowledge across models with complementary information.
中文摘要 通过辩论进行的人工智能安全和基于人工智能反馈的强化学习（RLAIF）都是提出的对先进人工智能系统进行可扩展监督的方法，但没有正式框架将它们关联起来，也没有规范辩论何时具有优势。我们通过参数化辩论模型间知识分歧的几何来分析辩论价值。利用模型表示子空间之间的主角，我们证明辩论优势存在一个精确闭合的形式。当模型共享相同的训练语料库时，争论会简化为类似RLAIF的方法，即单代理方法能恢复相同的最优值。当模型知识分歧时，辩论优势随着从二次阶段（辩论带来的收益微乎其微）到线性阶段（辩论至关重要）而规模化。我们将知识分歧分为三种类型（共享型、单向型和组合型），并提供了存在性结果，表明辩论能够实现单靠任何模型都无法实现的结果;同时也有一个负面结果，表明足够强烈的对抗激励会导致组合体系中的协调失败，并且有效与无效辩论之间有一个明确的界限。我们首次正式连接了辩论与RLAIF，为理解何时对抗性监督协议合理提供了几何基础，并与跨模型互补信息的潜在知识提取问题相联系。

Latent Policy Steering through One-Step Flow Policies

通过一步流策略进行潜在策略引导

Authors: Hokyun Im, Andrey Kolobov, Jianlong Fu, Youngwoon Lee
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.05296
Pdf link: https://arxiv.org/pdf/2603.05296
Abstract Offline reinforcement learning (RL) allows robots to learn from offline datasets without risky exploration. Yet, offline RL's performance often hinges on a brittle trade-off between (1) return maximization, which can push policies outside the dataset support, and (2) behavioral constraints, which typically require sensitive hyperparameter tuning. Latent steering offers a structural way to stay within the dataset support during RL, but existing offline adaptations commonly approximate action values using latent-space critics learned via indirect distillation, which can lose information and hinder convergence. We propose Latent Policy Steering (LPS), which enables high-fidelity latent policy improvement by backpropagating original-action-space Q-gradients through a differentiable one-step MeanFlow policy to update a latent-action-space actor. By eliminating proxy latent critics, LPS allows an original-action-space critic to guide end-to-end latent-space optimization, while the one-step MeanFlow policy serves as a behavior-constrained generative prior. This decoupling yields a robust method that works out-of-the-box with minimal tuning. Across OGBench and real-world robotic tasks, LPS achieves state-of-the-art performance and consistently outperforms behavioral cloning and strong latent steering baselines.
中文摘要 离线强化学习（RL）使机器人能够从离线数据集中学习，而无需冒险探索。然而，离线强化学习的性能往往依赖于（1）回报最大化（这可能使策略超出数据集支持范围）和（2）行为约束（通常需要敏感的超参数调优）之间的脆弱权衡。潜在引导为强化学习期间保持数据集支持范围提供了结构性方法，但现有的离线适应通常通过间接蒸馏学来近似动作值，这可能导致信息丢失并阻碍收敛。我们提出了潜在策略引导（LPS），通过可微的一步均流策略反向传播原始动作空间的Q梯度来更新潜在作用空间的演员，从而实现高保真度的潜在策略改进。通过消除代理潜在批评者，LPS允许原始动作空间批评者引导端到端潜在空间优化，而一步均流策略则作为行为约束的生成先验。这种解耦产生了一种稳健的方法，开箱即用且调谐极少。在OGBench和现实机器人任务中，LPS实现了最先进的性能，并持续优于行为克隆和强的潜在引导基线。

DiSCTT: Consensus-Guided Self-Curriculum for Efficient Test-Time Adaptation in Reasoning

DiSCTT：共识引导的自学课程，促进推理中高效考试时间适应

Authors: Mohammad Mahdi Moradi, Sudhir Mudur
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.05357
Pdf link: https://arxiv.org/pdf/2603.05357
Abstract Test-time adaptation offers a promising avenue for improving reasoning performance in large language models without additional supervision, but existing approaches often apply a uniform optimization objective across all inputs, leading to inefficient or unstable adaptation on heterogeneous reasoning problems. We propose DiSCTT, a difficulty-aware, consensus-guided self-curriculum framework that dynamically allocates test-time optimization strategies based on instance-level epistemic uncertainty estimated from agreement among sampled reasoning trajectories. Inputs with high consensus are consolidated via supervised fine-tuning using majority-agreed solutions as pseudo-labels, while low-consensus inputs are optimized via reinforcement learning with a consensus-regularized objective that encourages diversity under relevance constraints. Across a broad suite of mathematical and general reasoning benchmarks, DiSCTT consistently outperforms strong test-time adaptation baselines, achieving higher accuracy with reduced variance and substantially lower computation and wall-clock training times. These results demonstrate that explicitly accounting for instance difficulty and uncertainty enables more stable, efficient, and effective test-time adaptation for reasoning models.
中文摘要 测试时间适应为在大型语言模型中提升推理性能提供了一种有前景的途径，无需额外监督，但现有方法往往对所有输入都采用统一的优化目标，导致异构推理问题上的适应效率低下或不稳定。我们提出了DiSCTT，这是一个难度感知、共识引导的自学框架，基于样本层面的认知不确定性，根据抽样推理轨迹间的一致性估计，动态分配考试时间优化策略。高共识输入通过多数同意的伪标签进行监督微调巩固，低共识输入则通过强化学习优化，目标是共识正则化，鼓励在相关约束下实现多样性。在广泛的数学和通用推理基准测试中，DiSCTT持续优于强测试时间适应基线，实现更高准确率，方差更低，计算和壁钟训练时间显著降低。这些结果表明，明确考虑例如的难度和不确定性，可以实现推理模型更稳定、高效、更有效的测试时间适应。

Keyword: diffusion policy

Diffusion Policy through Conditional Proximal Policy Optimization

通过条件近端策略优化实现扩散策略

Authors: Ben Liu, Shunpeng Yang, Hua Chen
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.04790
Pdf link: https://arxiv.org/pdf/2603.04790
Abstract Reinforcement learning (RL) has been extensively employed in a wide range of decision-making problems, such as games and robotics. Recently, diffusion policies have shown strong potential in modeling multi-modal behaviors, enabling more diverse and flexible action generation compared to the conventional Gaussian policy. Despite various attempts to combine RL with diffusion, a key challenge is the difficulty of computing action log-likelihood under the diffusion model. This greatly hinders the direct application of diffusion policies in on-policy reinforcement learning. Most existing methods calculate or approximate the log-likelihood through the entire denoising process in the diffusion model, which can be memory- and computationally inefficient. To overcome this challenge, we propose a novel and efficient method to train a diffusion policy in an on-policy setting that requires only evaluating a simple Gaussian probability. This is achieved by aligning the policy iteration with the diffusion process, which is a distinct paradigm compared to previous work. Moreover, our formulation can naturally handle entropy regularization, which is often difficult to incorporate into diffusion policies. Experiments demonstrate that the proposed method produces multimodal policy behaviors and achieves superior performance on a variety of benchmark tasks in both IsaacLab and MuJoCo Playground.
中文摘要 强化学习（RL）已被广泛应用于游戏和机器人等多种决策问题中。近年来，扩散策略在建模多模态行为方面展现出强大潜力，使行动生成比传统高斯策略更为丰富和灵活。尽管有多种尝试将强化学习与扩散结合，但一个关键挑战是在扩散模型下计算作用对数似然的困难。这极大地阻碍了扩散策略在策略内强化学习中的直接应用。大多数现有方法在扩散模型中计算或近似于整个去噪过程的对数似然，这在内存和计算上可能效率较低。为克服这一挑战，我们提出了一种新颖且高效的方法，在仅需评估简单高斯概率的策略参数中训练扩散策略。这通过将政策迭代与扩散过程对齐实现，扩散过程与以往工作形成了不同的范式。此外，我们的表述能够自然处理熵正则化，而熵通常难以纳入扩散策略。实验表明，所提方法在IsaacLab和MuJoCo Playground中能够产生多模态策略行为，并在多种基准任务中实现了更优的表现。

Task-Relevant and Irrelevant Region-Aware Augmentation for Generalizable Vision-Based Imitation Learning in Agricultural Manipulation

用于农业作中基于视觉的通用模仿学习的任务相关与无关区域感知增强

Authors: Shun Hattori, Hikaru Sasaki, Takumi Hachimine, Yusuke Mizutani, Takamitsu Matsubara
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.04845
Pdf link: https://arxiv.org/pdf/2603.04845
Abstract Vision-based imitation learning has shown promise for robotic manipulation; however, its generalization remains limited in practical agricultural tasks. This limitation stems from scarce demonstration data and substantial visual domain gaps caused by i) crop-specific appearance diversity and ii) background variations. To address this limitation, we propose Dual-Region Augmentation for Imitation Learning (DRAIL), a region-aware augmentation framework designed for generalizable vision-based imitation learning in agricultural manipulation. DRAIL explicitly separates visual observations into task-relevant and task-irrelevant regions. The task-relevant region is augmented in a domain-knowledge-driven manner to preserve essential visual characteristics, while the task-irrelevant region is aggressively randomized to suppress spurious background correlations. By jointly handling both sources of visual variation, DRAIL promotes learning policies that rely on task-essential features rather than incidental visual cues. We evaluate DRAIL on diffusion policy-based visuomotor controllers through robot experiments on artificial vegetable harvesting and real lettuce defective leaf picking preparation tasks. The results show consistent improvements in success rates under unseen visual conditions compared to baseline methods. Further attention analysis and representation generalization metrics indicate that the learned policies rely more on task-essential visual features, resulting in enhanced robustness and generalization.
中文摘要 基于视觉的模仿学习已展现出机器人作的潜力;然而，其推广在实际农业任务中仍然有限。这一局限源于缺乏示范数据以及由i）作物特异性外观多样性和ii）背景变异引起的显著视觉领域空白。为解决这一限制，我们提出了双区域模拟学习增强（DRAIL），这是一种区域感知增强框架，旨在实现农业作中基于视觉的通用模仿学习。DRAIL明确将视觉观察分为任务相关和无关区域。任务相关区域通过领域知识驱动增强以保留关键视觉特征，而任务相关区域则被积极随机化以抑制虚假背景相关性。通过共同处理这两种视觉变化来源，DRAIL推动了依赖任务关键特征而非偶然视觉线索的学习政策。我们通过机器人实验进行人工蔬菜采摘和真实生菜缺陷叶片采摘准备任务，评估基于扩散策略的视觉运动控制器对DRAIL的应用。结果显示，与基线方法相比，在未可见视觉条件下的成功率持续提升。进一步的注意力分析和表征泛化指标表明，所学策略更多依赖于任务本能的视觉特征，从而增强了鲁棒性和泛化性。

SeedPolicy: Horizon Scaling via Self-Evolving Diffusion Policy for Robot Manipulation

种子政策：通过自我演化扩散策略实现机器人作视野尺度

Authors: Youqiang Gui, Yuxuan Zhou, Shen Cheng, Xinyang Yuan, Haoqiang Fan, Peng Cheng, Shuaicheng Liu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.05117
Pdf link: https://arxiv.org/pdf/2603.05117
Abstract Imitation Learning (IL) enables robots to acquire manipulation skills from expert demonstrations. Diffusion Policy (DP) models multi-modal expert behaviors but suffers performance degradation as observation horizons increase, limiting long-horizon manipulation. We propose Self-Evolving Gated Attention (SEGA), a temporal module that maintains a time-evolving latent state via gated attention, enabling efficient recurrent updates that compress long-horizon observations into a fixed-size representation while filtering irrelevant temporal information. Integrating SEGA into DP yields Self-Evolving Diffusion Policy (SeedPolicy), which resolves the temporal modeling bottleneck and enables scalable horizon extension with moderate overhead. On the RoboTwin 2.0 benchmark with 50 manipulation tasks, SeedPolicy outperforms DP and other IL baselines. Averaged across both CNN and Transformer backbones, SeedPolicy achieves 36.8% relative improvement in clean settings and 169% relative improvement in randomized challenging settings over the DP. Compared to vision-language-action models such as RDT with 1.2B parameters, SeedPolicy achieves competitive performance with one to two orders of magnitude fewer parameters, demonstrating strong efficiency and scalability. These results establish SeedPolicy as a state-of-the-art imitation learning method for long-horizon robotic manipulation. Code is available at: this https URL.
中文摘要 模仿学习（IL）使机器人能够通过专家演示获得作技能。扩散策略（DP）模拟多模态专家行为，但随着观察视野的增加，性能下降，限制了长视野的作。我们提出了自我演化门控注意力（SEGA），这是一种通过门控注意力维持时间演化潜在状态的时间模块，实现高效的循环更新，将长视界观测压缩为固定大小的表示，同时过滤无关的时间信息。将世嘉集成到DP中，产生了自我演化扩散策略（SeedPolicy），解决了时间建模瓶颈，并实现了适度开销的可扩展视野。在RoboTwin 2.0基准测试中，拥有50个作任务，SeedPolicy 优于DP及其他IL基线。在CNN和Transformer骨干链上平均，SeedPolicy在清洁环境下相较DP提升36.8%，在随机挑战环境中相对提升169%。与参数仅为12亿的RDT等视觉-语言-动作模型相比，SeedPolicy 仅用一个数量级到两个数量级的参数就能实现竞争性能，展现出强大的效率和可扩展性。这些结果确立了SeedPolicy作为一种用于长视野机器人作的先进模仿学习方法。代码可在以下 https URL 获取。