Arxiv Papers of Today

生成时间: 2026-04-16 17:23:31 (UTC+8); Arxiv 发布时间: 2026-04-16 20:00 EDT (2026-04-17 08:00 UTC+8)

今天共有 33 篇相关文章

Keyword: reinforcement learning

Integration of Deep Reinforcement Learning and Agent-based Simulation to Explore Strategies Counteracting Information Disorder

深度强化学习与基于主体的模拟的整合，探索对抗信息失调的策略

Authors: Luigi Lomasto, Andrea Camoia, Alfonso Guarino, Nicola Lettieri, Delfina Malandrino, Rocco Zaccagnino
Subjects: Subjects: Social and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2604.13047
Pdf link: https://arxiv.org/pdf/2604.13047
Abstract In recent years, the spread of fake news has triggered a growing interest in Information Disorders (ID) on social media, a phenomenon that has become a focal point of research across fields ranging from complexity theory and computer science to cognitive sciences. Overall, such a body of research can be traced back to two main approaches. On the one hand, there are works focused on exploiting data mining to analyze the content of news and related metadata data-driven approach; on the other hand, works are aiming at making sense of the phenomenon at hand and their evolution using explicit simulation models model-driven approach). In this paper, we integrate these approaches to explore strategies for counteracting IDs. Heading in this direction, we put together: i. an Agent-Based model to simulate in a scientifically sound way both complex fake news dynamics and the effects produced by containment strategies therein; ii. Deep Reinforcement Learning to learn the strategies that can better mitigate the spread of misinformation. The outcomes of our work unfold on different levels. From a substantive point of view, the results of preliminary experiments started providing interesting cues about the conditions under which given policies can mitigate the spread of misinformation. From a technical and methodological point of view, we scratched the surface of promising and worthy research topics like the integration of social simulation and artificial intelligence and the enhancement of social science simulation environments.
中文摘要 近年来，假新闻的传播激发了人们对社交媒体上信息障碍（ID）日益增长的关注，这一现象已成为复杂性理论、计算机科学到认知科学等多个领域的研究焦点。总体而言，这类研究可以追溯到两个主要方法。一方面，有研究专注于利用数据挖掘分析新闻内容及相关元数据驱动方法;另一方面，相关研究旨在通过显式模拟模型驱动的方法来理解当前现象及其演变过程。本文整合这些方法，探讨对抗智能设计的策略。为此，我们组建了：i. 基于代理的模型，以科学可信的方式模拟复杂的假新闻动态及其遏制策略产生的效果;ii. 深度强化学习，学习能够更好地减少错误信息传播的策略。我们的工作成果在不同层面展开。从实质性角度看，初步实验结果开始为某些政策在何种条件下减轻错误信息传播提供了有趣的线索。从技术和方法论角度来看，我们触及了诸如社会模拟与人工智能整合以及社会科学模拟环境提升等有前景且有价值的研究课题。

Adaptive Memory Crystallization for Autonomous AI Agent Learning in Dynamic Environments

动态环境中自主人工智能代理学习的自适应记忆结晶

Authors: Rajat Khanda, Mohammad Baqar Sambuddha Chakrabarti, Satyasaran Changdar
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.13085
Pdf link: https://arxiv.org/pdf/2604.13085
Abstract Autonomous AI agents operating in dynamic environments face a persistent challenge: acquiring new capabilities without erasing prior knowledge. We present Adaptive Memory Crystallization (AMC), a memory architecture for progressive experience consolidation in continual reinforcement learning. AMC is conceptually inspired by the qualitative structure of synaptic tagging and capture (STC) theory, the idea that memories transition through discrete stability phases, but makes no claim to model the underlying molecular or synaptic mechanisms. AMC models memory as a continuous crystallization process in which experiences migrate from plastic to stable states according to a multi-objective utility signal. The framework introduces a three-phase memory hierarchy (Liquid--Glass--Crystal) governed by an Itô stochastic differential equation (SDE) whose population-level behavior is captured by an explicit Fokker--Planck equation admitting a closed-form Beta stationary distribution. We provide proofs of: (i) well-posedness and global convergence of the crystallization SDE to a unique Beta stationary distribution; (ii) exponential convergence of individual crystallization states to their fixed points, with explicit rates and variance bounds; and (iii) end-to-end Q-learning error bounds and matching memory-capacity lower bounds that link SDE parameters directly to agent performance. Empirical evaluation on Meta-World MT50, Atari 20-game sequential learning, and MuJoCo continual locomotion consistently shows improvements in forward transfer (+34--43\% over the strongest baseline), reductions in catastrophic forgetting (67--80\%), and a 62\% decrease in memory footprint.
中文摘要 在动态环境中运行的自主人工智能代理面临一个持续的挑战：在不抹去先前知识的情况下获得新能力。我们介绍了自适应记忆结晶（AMC），这是一种用于持续强化学习中渐进式经验巩固的记忆架构。AMC的概念灵感来源于突触标记与捕获（STC）理论的定性结构，即记忆通过离散稳定阶段过渡的观点，但并不声称能模拟其背后的分子或突触机制。AMC将记忆建模为一个连续的结晶过程，其中经验根据多目标效用信号从塑性状态迁移到稳定状态。该框架引入了一个三相存储层级（液体-玻璃-晶体），由伊藤随机微分方程（SDE）支配，其种群级行为由一个明确的福克-普朗克方程捕捉，该方程允许闭式贝塔平稳分布。我们提供了以下证明：（i） SDE结晶至唯一Beta固定分布的良定性和全局收敛性;（ii）单个结晶态到其不动点的指数收敛，具有明确的速率和方差界限;以及（iii）端到端Q学习误差界限及匹配内存容量下限，这些下限直接将SDE参数与代理性能联系起来。对Meta-World MT50、Atari 20游戏顺序学习和MuJoCo持续移动的实证分析，持续显示前向转移有所改善（+34--43%，较最强基线下降+34-43\%），灾难性遗忘减少（67-80%），以及记忆占用减少62%。

Design Conditions for Intra-Group Learning of Sequence-Level Rewards: Token Gradient Cancellation

组内学习序列级奖励的设计条件：令牌梯度抵消

Authors: Fei Ding, Yongkang Zhang, youwei wang, Zijian Zeng
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.13088
Pdf link: https://arxiv.org/pdf/2604.13088
Abstract In sparse termination rewards, intra-group comparisons have become the dominant paradigm for fine-tuning reasoning models via reinforcement learning. However, long-term training often leads to issues like ineffective update accumulation (learning tax), solution probability drift, and entropy collapse. This paper presents a necessary condition for algorithm design from a token-level credit assignment perspective: to prevent reward-irrelevant drift, intra-group objectives must maintain gradient exchangeability across token updates, enabling gradient cancellation on weak-credit/high-frequency tokens. We show that two common mechanisms disrupting exchangeability make "non-cancellation" a structural norm. Based on this, we propose minimal intra-group transformations to restore or approximate the cancellation structure in the shared token space. Experimental results demonstrate that these transformations stabilize training, improve sample efficiency, and enhance final performance, validating the value of this design condition.
中文摘要 在稀疏终止奖励中，组内比较已成为通过强化学习微调推理模型的主导范式。然而，长期训练常常导致诸如无效的更新累积（学习税）、解概率漂移和熵坍缩等问题。本文从代币级信用分配的角度提出了算法设计的必要条件：为防止与奖励无关的漂移，组内目标必须在代币更新间保持梯度交换性，从而实现弱信用/高频代币的梯度抵消。我们表明，破坏交换性的两种常见机制使“非抵消”成为结构性规范。基于此，我们提出最小化的组内转换，以恢复或近似共享代币空间中的消去结构。实验结果表明，这些变换稳定了训练，提高了样本效率，并提升了最终性能，验证了该设计条件的价值。

C$^2$T: Captioning-Structure and LLM-Aligned Common-Sense Reward Learning for Traffic--Vehicle Coordination

C$^2$T：字幕结构与与大语言模型对齐的常识奖励学习，用于交通-车辆协调

Authors: Yuyang Chen, Kaiyan Zhao, Yiming Wang, Ming Yang, Bin Rao, Zhenning Li
Subjects: Subjects: Multiagent Systems (cs.MA); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.13098
Pdf link: https://arxiv.org/pdf/2604.13098
Abstract State-of-the-art (SOTA) urban traffic control increasingly employs Multi-Agent Reinforcement Learning (MARL) to coordinate Traffic Light Controllers (TLCs) and Connected Autonomous Vehicles (CAVs). However, the performance of these systems is fundamentally capped by their hand-crafted, myopic rewards (e.g., intersection pressure), which fail to capture high-level, human-centric goals like safety, flow stability, and comfort. To overcome this limitation, we introduce C2T, a novel framework that learns a common-sense coordination model from traffic-vehicle dynamics. C2T distills "common-sense" knowledge from a Large Language Model (LLM) into a learned intrinsic reward function. This new reward is then used to guide the coordination policy of a cooperative multi-intersection TLC MARL system on CityFlow-based multi-intersection benchmarks. Our framework significantly outperforms strong MARL baselines in traffic efficiency, safety, and an energy-related proxy. We further highlight C2T's flexibility in principle, allowing distinct "efficiency-focused" versus "safety-focused" policies by modifying the LLM prompt.
中文摘要 最先进的（SOTA）城市交通控制越来越多地采用多智能体强化学习（MARL）来协调交通信号灯控制器（TLC）和互联自动驾驶车辆（CAV）。然而，这些系统的性能根本上被其手工打造、目光短浅的奖励（例如交叉压力）所限制，这些奖励未能体现安全、流动稳定性和舒适性等高层次、以人为中心的目标。为克服这一限制，我们引入了C2T，一种新颖框架，通过交通-车辆动态学习常识协调模型。C2T将大型语言模型（LLM）中的“常识”知识提炼为学习的内在奖励函数。这一新奖励随后用于指导基于CityFlow的多路口基准的合作多路口TLC MARL系统的协调政策。我们的框架在交通效率、安全和能源相关代理指标方面，远远优于强劲的MARL基线。我们还进一步强调了C2T原则上的灵活性，通过修改LLM提示，允许“效率导向”与“安全导向”政策的区别。

Automated co-design of high-performance thermodynamic cycles via graph-based hierarchical reinforcement learning

通过基于图的层级强化学习自动协同设计高性能热力学循环

Authors: Wenqing Li, Xu Feng, Peixue Jiang, Yinhai Zhu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.13133
Pdf link: https://arxiv.org/pdf/2604.13133
Abstract Thermodynamic cycles are pivotal in determining the efficacy of energy conversion systems. Traditional design methodologies, which rely on expert knowledge or exhaustive enumeration, are inefficient and lack scalability, thereby constraining the discovery of high-performance cycles. In this study, we introduce a graph-based hierarchical reinforcement learning approach for the co-design of structure parameters in thermodynamic cycles. These cycles are encoded as graphs, with components and connections depicted as nodes and edges, adhering to grammatical constraints. A deep learning-based thermophysical surrogate facilitates stable graph decoding and the simultaneous resolution of global parameters. Building on this foundation, we develop a hierarchical reinforcement learning framework wherein a high-level manager explores structural evolution and proposes candidate configurations, whereas a low-level worker optimizes parameters and provides performance rewards to steer the search towards high-performance regions. By integrating graph representation, thermophysical surrogate, and manager-worker learning, this method establishes a fully automated pipeline for encoding, decoding, and co-optimization. Using heat pump and heat engine cycles as case studies, the results demonstrate that the proposed method not only replicates classical cycle configurations but also identifies 18 and 21 novel heat pump and heat engine cycles, respectively. Relative to classical cycles, the novel configurations exhibit performance improvements of 4.6% and 133.3%, respectively, surpassing the traditional designs. This method effectively balances efficiency with broad applicability, providing a practical and scalable intelligent alternative to expert-driven thermodynamic cycle design.
中文摘要 热力学循环在确定能量转换系统的效能方面起着关键作用。传统设计方法依赖专家知识或详尽枚举，效率低下且缺乏可扩展性，从而限制了高性能周期的发现。本研究引入了基于图的分层强化学习方法，用于热力学循环中结构参数的协同设计。这些循环以图的形式编码，分量和连接以节点和边表示，遵循语法约束。基于深度学习的热物理替代工具促进了稳定的图解码和全局参数的同步解析。基于此基础，我们开发了一个层级强化学习框架，其中高级管理者探索结构演化并提出候选配置，而低层级管理者则优化参数并提供绩效奖励，引导搜索目标达到高绩效区域。通过整合图表示、热物理替代和管理者-员工学习，该方法建立了编码、解码和协同优化的全自动化流水线。以热泵和热机循环为案例研究，结果表明该方法不仅复制了经典循环配置，还分别识别出18种和21种新的热泵和热机循环。相较于经典周期，新颖构型的性能提升分别为4.6%和133.3%，超过传统设计。该方法有效平衡了效率与广泛应用，提供了一种实用且可扩展的智能替代方案，替代专家驱动的热力学循环设计。

Pareto-Optimal Offline Reinforcement Learning via Smooth Tchebysheff Scalarization

通过平滑切比谢夫标量实现的帕累托最优离线强化学习

Authors: Aadyot Bhatnagar, Peter Mørch Groth, Ali Madani
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
Arxiv link: https://arxiv.org/abs/2604.13175
Pdf link: https://arxiv.org/pdf/2604.13175
Abstract Large language models can be aligned with human preferences through offline reinforcement learning (RL) on small labeled datasets. While single-objective alignment is well-studied, many real-world applications demand the simultaneous optimization of multiple conflicting rewards, e.g. optimizing both catalytic activity and specificity in protein engineering, or helpfulness and harmlessness for chatbots. Prior work has largely relied on linear reward scalarization, but this approach provably fails to recover non-convex regions of the Pareto front. In this paper, instead of scalarizing the rewards directly, we frame multi-objective RL itself as an optimization problem to be scalarized via smooth Tchebysheff scalarization, a recent technique that overcomes the shortcomings of linear scalarization. We use this formulation to derive Smooth Tchebysheff Optimization of Multi-Objective Preferences (STOMP), a novel offline RL algorithm that extends direct preference optimization to the multi-objective setting in a principled way by standardizing the individual rewards based on their observed distributions. We empirically validate STOMP on a range of protein engineering tasks by aligning three autoregressive protein language models on three laboratory datasets of protein fitness. Compared to state-of-the-art baselines, STOMP achieves the highest hypervolumes in eight of nine settings according to both offline off-policy and generative evaluations. We thus demonstrate that STOMP is a powerful, robust multi-objective alignment algorithm that can meaningfully improve post-trained models for multi-attribute protein optimization and beyond.
中文摘要 大型语言模型可以通过离线强化学习（RL）在小型标记数据集上与人类偏好对齐。虽然单一目标比对已有充分研究，但许多现实应用需要同时优化多个冲突奖励，例如蛋白质工程中优化催化活性和特异性，或聊天机器人的实用性和无害性。此前的研究主要依赖线性奖励标量化，但该方法可证明无法恢复帕累托前沿的非凸区域。本文中，我们不再直接对奖励进行标量化，而是将多目标强化学习本身框架为一个优化问题，通过平滑切比谢夫标量化来进行标量化，这是一种克服线性标量化不足的新技术。我们利用该表述推导出多目标偏好的平滑切比谢夫优化（STOMP），这是一种新颖的离线强化学习算法，通过基于观察到的分布标准化各个奖励，原则性地将直接偏好优化扩展到多目标环境。我们通过将三个自回归蛋白质语言模型对比到三个蛋白质适应度实验室数据集，实证验证了STOMP在一系列蛋白质工程任务上的应用。与最先进的基线相比，STOMP在离线非策略和生成评估中，九种环境中有八种实现了最高的超量。因此，我们证明STOMP是一种强大且稳健的多目标比对算法，能够显著提升训练后模型，实现多属性蛋白优化及更多优化。

Synthesis and Deployment of Maximal Robust Control Barrier Functions through Adversarial Reinforcement Learning

通过对抗强化学习综合与部署最大鲁棒控制障碍函数

Authors: Donggeon David Oh, Duy P. Nguyen, Haimin Hu, Jaime Fernández Fisac
Subjects: Subjects: Systems and Control (eess.SY); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.13192
Pdf link: https://arxiv.org/pdf/2604.13192
Abstract Robust control barrier functions (CBFs) provide a principled mechanism for smooth safety enforcement under worst-case disturbances. However, existing approaches typically rely on explicit, closed-form structure in the dynamics (e.g., control-affine) and uncertainty models. This has led to limited scalability and generality, with most robust CBFs certifying only conservative subsets of the maximal robust safe set. In this paper, we introduce a new robust CBF framework for general nonlinear systems under bounded uncertainty. We first show that the safety value function solving the dynamic programming Isaacs equation is a valid robust discrete-time CBF that enforces safety on the maximal robust safe set. We then adopt the key reinforcement learning (RL) notion of quality function (or Q-function), which removes the need for explicit dynamics by lifting the barrier certificate into state-action space and yields a novel robust Q-CBF constraint for safety filtering. Combined with adversarial RL, this enables the synthesis and deployment of robust Q-CBFs on general nonlinear systems with black-box dynamics and unknown uncertainty structure. We validate the framework on a canonical inverted pendulum benchmark and a 36-D quadruped simulator, achieving substantially less conservative safe sets than barrier-based baselines on the pendulum and reliable safety enforcement even under adversarial uncertainty realizations on the quadruped.
中文摘要 鲁棒控制障碍功能（CBFs）为在最坏情况下的安全执法提供了原则性机制，实现安全执行。然而，现有方法通常依赖于动力学中的显式封闭结构（例如控制仿射）和不确定性模型。这导致了有限的可扩展性和通用性，大多数稳健的CBF只认证最大稳健安全集的保守子集。本文介绍了一个适用于有界不确定性下一般非线性系统的新型稳健CBF框架。我们首先证明，解动态规划艾萨克斯方程的安全值函数是一个有效的稳健离散时间CBF，该函数对最大鲁棒安全集施加安全。随后，我们采用了密钥强化学习（RL）中的质量函数（或Q函数）概念，通过将障碍证书提升到状态-动作空间，消除了显式动态的需求，并产生了一种新的稳健Q-CBF约束用于安全过滤。结合对抗性强化学习，这使得在具有黑箱动力学和未知不确定性结构的一般非线性系统上合成和部署稳健的Q-CBF成为可能。我们在标准倒摆基准和36维四足模拟器上验证了该框架，在摆上实现了比基于障碍的基线更保守的安全设置，即使在四足动物面临对抗性不确定性实现的情况下，也实现了可靠的安全执行。

From Prediction to Justification: Aligning Sentiment Reasoning with Human Rationale via Reinforcement Learning

从预测到正当化：通过强化学习将情感推理与人类理性对齐

Authors: Shihao Zhang, Ziwei Wang, Jie Zhou, Yulan Wu, Qin Chen, Zhikai Lei, Liyang Yu, Liang Dou, Liang He
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.13398
Pdf link: https://arxiv.org/pdf/2604.13398
Abstract While Aspect-based Sentiment Analysis (ABSA) systems have achieved high accuracy in identifying sentiment polarities, they often operate as "black boxes," lacking the explicit reasoning capabilities characteristic of human affective cognition. Humans do not merely categorize sentiment; they construct causal explanations for their judgments. To bridge this gap, we propose ABSA-R1, a large language model framework designed to mimic this ``reason-before-predict" cognitive process. By leveraging reinforcement learning (RL), ABSA-R1 learns to articulate the why behind the what, generating natural language justifications that ground its sentiment predictions. We introduce a Cognition-Aligned Reward Model (formerly sentiment-aware reward model) that enforces consistency between the generated reasoning path and the final emotional label. Furthermore, inspired by metacognitive monitoring, we implement a performance-driven rejection sampling strategy that selectively targets hard cases where the model's internal reasoning is uncertain or inconsistent. Experimental results on four benchmarks demonstrate that equipping models with this explicit reasoning capability not only enhances interpretability but also yields superior performance in sentiment classification and triplet extraction compared to non-reasoning baselines.
中文摘要 虽然基于切面的情感分析（ABSA）系统在识别情感极性方面取得了很高的准确性，但它们通常作为“黑箱”运作，缺乏人类情感认知特有的显式推理能力。人类不仅仅是对情感进行分类;他们构建因果解释来支持自己的判断。为了弥合这一差距，我们提出了ABSA-R1，一个大型语言模型框架，旨在模拟这一“先推理后预测”的认知过程。通过利用强化学习（RL），ABSA-R1 学会表达“为什么”，生成自然语言的理由，为其情感预测提供基础。我们引入了认知对齐奖励模型（前称情感感知奖励模型），该模型强制生成的推理路径与最终情感标签之间的一致性。此外，受元认知监测启发，我们实施了一种绩效驱动的拒绝抽样策略，选择性地针对模型内部推理不确定或不一致的硬案例。四个基准测试的实验结果表明，赋予模型显式推理能力不仅提升了可解释性，还在情感分类和三元组提取方面优于非推理基线。

Minimax Optimality and Spectral Routing for Majority-Vote Ensembles under Markov Dependence

在马尔可夫依赖下，多数票集合的极小极大最优性和谱路由

Authors: Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.13414
Pdf link: https://arxiv.org/pdf/2604.13414
Abstract Majority-vote ensembles achieve variance reduction by averaging over diverse, approximately independent base learners. When training data exhibits Markov dependence, as in time-series forecasting, reinforcement learning (RL) replay buffers, and spatial grids, this classical guarantee degrades in ways that existing theory does not fully quantify. We provide a minimax characterization of this phenomenon for discrete classification in a fixed-dimensional Markov setting, together with an adaptive algorithm that matches the rate on a graph-regular subclass. We first establish an information-theoretic lower bound for stationary, reversible, geometrically ergodic chains in fixed ambient dimension, showing that no measurable estimator can achieve excess classification risk better than $\Omega(\sqrt{\Tmix/n})$. We then prove that, on the AR(1) witness subclass underlying the lower-bound construction, dependence-agnostic uniform bagging is provably suboptimal with excess risk bounded below by $\Omega(\Tmix/\sqrt{n})$, exhibiting a $\sqrt{\Tmix}$ algorithmic gap. Finally, we propose \emph{adaptive spectral routing}, which partitions the training data via the empirical Fiedler eigenvector of a dependency graph and achieves the minimax rate $\mathcal{O}(\sqrt{\Tmix/n})$ up to a lower-order geometric cut term on a graph-regular subclass, without knowledge of $\Tmix$. Experiments on synthetic Markov chains, 2D spatial grids, the 128-dataset UCR archive, and Atari DQN ensembles validate the theoretical predictions. Consequences for deep RL target variance, scalability via Nyström approximation, and bounded non-stationarity are developed as supporting material in the appendix.
中文摘要 多数票集合通过对多样化且近似独立的基底学习者进行平均来实现方差的缩小。当训练数据表现出马尔可夫依赖性时，如时间序列预测、强化学习（RL）重放缓冲区和空间网格时，这种经典保证会以现有理论尚未完全量化的方式退化。我们为固定维马尔可夫环境下的离散分类提供了该现象的极小极大特征，并结合了与图正则子类速率匹配的自适应算法。我们首先建立了固定环境维中平稳、可逆、几何遍历链的信息理论下界，表明没有任何可测估计量能比$\Omega（\sqrt{\Tmix/n}））$更好地实现超额分类风险。随后我们证明，在下界构造的AR（1）见证子类上，依赖无关的均匀袋装在超额风险下由$\Omega（\Tmix/\sqrt{n}））$下限时，存在$\sqrt{\Tmix}$算法缺口，证明是次优的。最后，我们提出 \emph{自适应谱路由}，通过依赖图的经验 Fiedler 特征向量划分训练数据，并在不了解 $\Tmix$ 的情况下，实现最小最大率 $\mathcal{O}（\sqrt{\Tmix/n}）$，直到图正则子类的低阶几何割项。合成马尔可夫链、二维空间网格、128数据集的UCR档案和Atari DQN集合的实验验证了理论预测。附录中介绍了深度强化学习目标方差、通过Nyström近似的可扩展性以及有界非平稳性的影响。

Bridging MARL to SARL: An Order-Independent Multi-Agent Transformer via Latent Consensus

桥接MARL到SARL：通过潜在共识实现的订单无关多智能体变换器

Authors: Zijian Zhao, Jing Gao, Sen Li
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2604.13472
Pdf link: https://arxiv.org/pdf/2604.13472
Abstract Cooperative multi-agent reinforcement learning (MARL) is widely used to address large joint observation and action spaces by decomposing a centralized control problem into multiple interacting agents. However, such decomposition often introduces additional challenges, including non-stationarity, unstable training, weak coordination, and limited theoretical guarantees. In this paper, we propose the Consensus Multi-Agent Transformer (CMAT), a centralized framework that bridges cooperative MARL to a hierarchical single-agent reinforcement learning (SARL) formulation. CMAT treats all agents as a unified entity and employs a Transformer encoder to process the large joint observation space. To handle the extensive joint action space, we introduce a hierarchical decision-making mechanism in which a Transformer decoder autoregressively generates a high-level consensus vector, simulating the process by which agents reach agreement on their strategies in latent space. Conditioned on this consensus, all agents generate their actions simultaneously, enabling order-independent joint decision making and avoiding the sensitivity to action-generation order in conventional Multi-Agent Transformers (MAT). This factorization allows the joint policy to be optimized using single-agent PPO while preserving expressive coordination through the latent consensus. To evaluate the proposed method, we conduct experiments on benchmark tasks from StarCraft II, Multi-Agent MuJoCo, and Google Research Football. The results show that CMAT achieves superior performance over recent centralized solutions, sequential MARL methods, and conventional MARL baselines. The code for this paper is available at:this https URL .
中文摘要 合作多智能体强化学习（MARL）被广泛用于处理大型联合观察和行动空间，通过将集中控制问题分解为多个相互作用的智能体。然而，这种分解通常会带来额外挑战，包括非平稳性、不稳定的训练、弱的协调以及有限的理论保证。本文提出了共识多智能体变换器（CMAT），这是一个集中式框架，将协作式MARL与分层单智能体强化学习（SARL）形式连接起来。CMAT将所有代理视为一个统一的实体，并使用变压器编码器处理庞大的联合观测空间。为处理广泛的联合行动空间，我们引入了一种层级决策机制，其中Transformer解码器自回归生成高层共识向量，模拟代理人在潜伏空间中达成策略一致的过程。基于该共识，所有代理同时生成动作，实现无序的联合决策，避免了传统多智能体变换器（MAT）中对动作生成顺序的敏感性。这种分解允许联合策略通过单代理PPO优化，同时通过潜在共识保持表达协调。为了评估所提方法，我们对《星际争霸II》、Multi-Agent MuJoCo和Google Research Football的基准任务进行了实验。结果显示，CMAT优于近期集中式解、顺序MARL方法和传统MARL基线表现优异。本文代码可在：this https URL 获取。

Towards Scalable Lightweight GUI Agents via Multi-role Orchestration

通过多角色编排实现可扩展的轻量级图形界面代理

Authors: Ziwei Wang, Junjie Zheng, Leyang Yang, Sheng Zhou, Xiaoxuan Tang, Zhouhua Fang, Zhiwei Liu, Dajun Chen, Yong Li, Jiajun Bu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.13488
Pdf link: https://arxiv.org/pdf/2604.13488
Abstract Autonomous Graphical User Interface (GUI) agents powered by Multimodal Large Language Models (MLLMs) enable digital automation on end-user devices. While scaling both parameters and data has yielded substantial gains, advanced methods still suffer from prohibitive deployment costs on resource-constrained devices. When facing complex in-the-wild scenarios, lightweight GUI agents are bottlenecked by limited capacity and poor task scalability under end-to-end episodic learning, impeding adaptation to multi-agent systems (MAS), while training multiple skill-specific experts remains costly. Can we strike an effective trade-off in this cost-scalability dilemma, enabling lightweight MLLMs to participate in realistic GUI workflows? To address these challenges, we propose the LAMO framework, which endows a lightweight MLLM with GUI-specific knowledge and task scalability, allowing multi-role orchestration to expand its capability boundary for GUI automation. LAMO combines role-oriented data synthesis with a two-stage training recipe: (i) supervised fine-tuning with Perplexity-Weighted Cross-Entropy optimization for knowledge distillation and visual perception enhancement, and (ii) reinforcement learning for role-oriented cooperative exploration. With LAMO, we develop a task-scalable native GUI agent, LAMO-3B, supporting monolithic execution and MAS-style orchestration. When paired with advanced planners as a plug-and-play policy executor, LAMO-3B can continuously benefit from planner advances, enabling a higher performance ceiling. Extensive static and online evaluations validate the effectiveness of our design.
中文摘要 由多模态大型语言模型（MLLM）驱动的自主图形用户界面（GUI）代理实现终端用户设备的数字化自动化。虽然参数和数据的扩展带来了显著提升，但先进方法在资源有限的设备上仍面临高昂的部署成本。面对复杂的实际场景时，轻量级图形界面代理因端到端场景学习下容量有限和任务可扩展性差而受限，阻碍了对多智能体系统（MAS）的适应，同时培训多位技能专属专家的成本仍然高昂。我们能否在成本可扩展性困境中达成有效权衡，使轻量级多层次多层次营销人员能够参与真实的图形用户界面工作流程？为应对这些挑战，我们提出了LAMO框架，该框架赋予轻量级MLLM图形界面专用知识和任务可扩展性，允许多角色编排扩展其GUI自动化能力边界。LAMO结合了面向角色的数据综合和两阶段训练方案：（i）通过困惑加权交叉熵优化进行监督微调，用于知识提炼和视觉感知增强;（ii）强化学习用于角色导向的合作探索。通过 LAMO，我们开发了一个任务可扩展的原生图形界面代理 LAMO-3B，支持单体执行和类似 MAS 的编排。当与高级规划器配合使用即插即用的政策执行器时，LAMO-3B可以持续受益于规划师的进步，从而实现更高的性能上限。广泛的静态和在线评估验证了我们设计的有效性。

Chain of Uncertain Rewards with Large Language Models for Reinforcement Learning

基于大型语言模型进行强化学习的不确定奖励链

Authors: Shentong Mo
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.13504
Pdf link: https://arxiv.org/pdf/2604.13504
Abstract Designing effective reward functions is a cornerstone of reinforcement learning (RL), yet it remains a challenging and labor-intensive process due to the inefficiencies and inconsistencies inherent in traditional methods. Existing methods often rely on extensive manual design and evaluation steps, which are prone to redundancy and overlook local uncertainties at intermediate decision points. To address these challenges, we propose the Chain of Uncertain Rewards (CoUR), a novel framework that integrates large language models (LLMs) to streamline reward function design and evaluation in RL environments. Specifically, our CoUR introduces code uncertainty quantification with a similarity selection mechanism that combines textual and semantic analyses to identify and reuse the most relevant reward function components. By reducing redundant evaluations and leveraging Bayesian optimization on decoupled reward terms, CoUR enables a more efficient and robust search for optimal reward feedback. We comprehensively evaluate CoUR across nine original environments from IsaacGym and all 20 tasks from the Bidexterous Manipulation benchmark. The experimental results demonstrate that CoUR not only achieves better performance but also significantly lowers the cost of reward evaluations.
中文摘要 设计有效的奖励函数是强化学习（RL）的基石，但由于传统方法固有的低效和不一致，这一过程依然充满挑战且劳动密集。现有方法通常依赖大量的手工设计和评估步骤，这些步骤容易出现冗余，且忽视了中间决策点的局部不确定性。为应对这些挑战，我们提出了不确定奖励链（CoUR）这一新框架，整合大型语言模型（LLM），以简化强化环境中的奖励函数设计和评估。具体来说，我们的CoUR引入了代码不确定性量化，采用结合文本和语义分析的相似性选择机制，识别并重用最相关的奖励函数组件。通过减少冗余评估并利用贝叶斯优化对解耦奖励项的应用，CoUR实现了更高效、更稳健的最佳奖励反馈搜索。我们全面评估了 IsaacGym 的九个原始环境以及双手操作基准测试中的全部 20 个任务。实验结果表明，CoUR不仅能实现更好的性能，还显著降低了奖励评估的成本。

Representation over Routing: Overcoming Surrogate Hacking in Multi-Timescale PPO

路由上的代表性：克服多时间尺度PPO中的代理黑客攻击

Authors: Jing Sun
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.13517
Pdf link: https://arxiv.org/pdf/2604.13517
Abstract Temporal credit assignment in reinforcement learning has long been a central challenge. Inspired by the multi-timescale encoding of the dopamine system in neurobiology, recent research has sought to introduce multiple discount factors into Actor-Critic architectures, such as Proximal Policy Optimization (PPO), to balance short-term responses with long-term planning. However, this paper reveals that blindly fusing multi-timescale signals in complex delayed-reward tasks can lead to severe algorithmic pathologies. We systematically demonstrate that exposing a temporal attention routing mechanism to policy gradients results in surrogate objective hacking, while adopting gradient-free uncertainty weighting triggers irreversible myopic degeneration, a phenomenon we term the Paradox of Temporal Uncertainty. To address these issues, we propose a Target Decoupling architecture: on the Critic side, we retain multi-timescale predictions to enforce auxiliary representation learning, while on the Actor side, we strictly isolate short-term signals and update the policy based solely on long-term advantages. Rigorous empirical evaluations across multiple independent random seeds in the LunarLander-v2 environment demonstrate that our proposed architecture achieves statistically significant performance improvements. Without relying on hyperparameter hacking, it consistently surpasses the ''Environment Solved'' threshold with minimal variance, completely eliminates policy collapse, and escapes the hovering local optima that trap single-timescale baselines.
中文摘要 强化学习中的时间学分分配长期以来一直是核心挑战。受神经生物学中多时间尺度多巴胺系统编码的启发，近期研究试图在演员-批评者架构中引入多重折扣因子，如近邻策略优化（PPO），以平衡短期响应与长期规划。然而，本文揭示，盲目融合复杂延迟奖励任务中的多时间尺度信号可能导致严重的算法病态。我们系统地证明，将时间注意力引导机制暴露于政策梯度下会导致替代客观黑客，而采用无梯度不确定性加权则会触发不可逆的近视退化，我们称之为时间不确定性悖论。为解决这些问题，我们提出了目标解耦架构：批判者端保留多时间尺度预测以强制辅助表示学习，而行动者端严格隔离短期信号，并仅基于长期优势更新策略。在LunarLander-v2环境中，对多个独立随机种子进行的严谨实证评估表明，我们提出的架构实现了统计学上显著的性能提升。无需依赖超参数破解，它能持续以最小的变异性超过“环境已解决”门槛，完全消除策略崩溃，并逃离困扰单一时间尺度基线的悬浮局部最优。

RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management

RiskWebWorld：电子商务风险管理中GUI代理的现实互动基准

Authors: Renqi Chen, Zeyin Tao, Jianming Guo, Jing Wang, Zezhou Xu, Jingzhe Zhu, Qingqing Sun, Tianyi Zhang, Shuai Chen
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.13531
Pdf link: https://arxiv.org/pdf/2604.13531
Abstract Graphical User Interface (GUI) agents show strong capabilities for automating web tasks, but existing interactive benchmarks primarily target benign, predictable consumer environments. Their effectiveness in high-stakes, investigative domains such as authentic e-commerce risk management remains underexplored. To bridge this gap, we present RiskWebWorld, the first highly realistic interactive benchmark for evaluating GUI agents in e-commerce risk management. RiskWebWorld features 1,513 tasks sourced from production risk-control pipelines across 8 core domains, and captures the authentic challenges of risk operations on uncooperative websites, partially environmental hijackments. To support scalable evaluation and agentic reinforcement learning (RL), we further build a Gymnasium-compliant infrastructure that decouples policy planning from environment mechanics. Our evaluation across diverse models reveals a dramatic capability gap: top-tier generalist models achieve 49.1% success, while specialized open-weights GUI models lag at near-total failure. This highlights that foundation model scale currently matters more than zero-shot interface grounding in long-horizon professional tasks. We also demonstrate the viability of our infrastructure through agentic RL, which improves open-source models by 16.2%. These results position RiskWebWorld as a practical testbed for developing robust digital workers.
中文摘要 图形用户界面（GUI）代理在自动化网页任务方面表现出强大能力，但现有的交互基准主要针对温和、可预测的消费者环境。它们在高风险调查领域（如真实电子商务风险管理）中的有效性仍未被充分开发。为弥合这一差距，我们推出了RiskWebWorld，这是首个高度真实的互动基准，用于评估电子商务风险管理中的图形界面代理。RiskWebWorld展示了来自生产风险控制管线、跨8个核心领域的1,513个任务，捕捉了不合作网站风险运营的真实挑战，部分涉及环境劫持。为支持可扩展的评估和代理强化学习（RL），我们进一步构建了一个符合体育馆标准的基础设施，将政策规划与环境力学脱钩。我们对多种模型的评估显示，能力差距显著：顶级通用模型成功率为49.1%，而专业的开放权重GUI模型则几乎完全失败。这凸显了基础模型规模目前比零镜头接口的扎根更为重要，尤其是在长期的专业任务中。我们还通过代理强化学习展示了基础设施的可行性，该系统提升了开源模型16.2%。这些结果使RiskWebWorld成为培养强大数字工作者的实用试验平台。

MM-Doc-R1: Training Agents for Long Document Visual Question Answering through Multi-turn Reinforcement Learning

MM-Doc-R1：通过多轮强化学习训练代理进行长文档视觉问答

Authors: Jiahang Lin, Kai Hu, Binghai Wang, Yuhao Zhou, Zhiheng Xi, Honglin Guo, Shichun Liu, Junzhe Wang, Shihan Dou, Enyu Zhou, Hang Yan, Zhenhua Han, Tao Gui, Qi Zhang, Xuanjing Huang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.13579
Pdf link: https://arxiv.org/pdf/2604.13579
Abstract Conventional Retrieval-Augmented Generation (RAG) systems often struggle with complex multi-hop queries over long documents due to their single-pass retrieval. We introduce MM-Doc-R1, a novel framework that employs an agentic, vision-aware workflow to address long document visual question answering through iterative information discovery and synthesis. To incentivize the information seeking capabilities of our agents, we propose Similarity-based Policy Optimization (SPO), addressing baseline estimation bias in existing multi-turn reinforcement learning (RL) algorithms like GRPO. Our core insight is that in multi-turn RL, the more semantically similar two trajectories are, the more accurate their shared baseline estimation becomes. Leveraging this, SPO calculates a more precise baseline by similarity-weighted averaging of rewards across multiple trajectories, unlike GRPO which inappropriately applies the initial state's baseline to all intermediate states. This provides a more stable and accurate learning signal for our agents, leading to superior training performance that surpasses GRPO. Our experiments on the MMLongbench-Doc benchmark show that MM-Doc-R1 outperforms previous baselines by 10.4%. Furthermore, SPO demonstrates superior performance over GRPO, boosting results by 5.0% with Qwen3-8B and 6.1% with Qwen3-4B. These results highlight the effectiveness of our integrated framework and novel training algorithm in advancing the state-of-the-art for complex, long-document visual question answering.
中文摘要 传统的检索增强生成（RAG）系统由于单次检索，常常难以处理长文档上的复杂多跳查询。我们介绍MM-Doc-R1，这是一个新颖框架，采用代理性、视觉感知的工作流程，通过迭代信息发现和综合处理长文档的视觉问题解答。为了激励代理信息寻求能力，我们提出了基于相似度的策略优化（SPO），解决现有多回合强化学习（RL）算法如GRPO中的基线估计偏差。我们的核心见解是，在多回合强化学习中，两条轨迹语义上越相似，它们共享的基线估计就越准确。基于此，SPO通过多条轨迹上的相似加权平均奖励计算更精确的基线，不同于GRPO错误地将初始状态的基线应用于所有中间状态。这为我们的代理提供了更稳定、更准确的学习信号，从而实现了超越GRPO的优越训练表现。我们在MMLongbench-Doc基准上的实验显示，MM-Doc-R1比之前基线高出10.4%。此外，SPO表现优于GRPO，Qwen3-8B提升了5.0%，Qwen3-4B提升了6.1%。这些结果凸显了我们集成框架和新颖训练算法在推动复杂长文档视觉问答技术方面取得的有效性。

Enhancing Reinforcement Learning for Radiology Report Generation with Evidence-aware Rewards and Self-correcting Preference Learning

通过证据意识奖励和自我纠正偏好学习，增强放射科报告生成的强化学习

Authors: Qin Zhou, Guoyan Liang, Qianyi Yang, Jingyuan Chen, Sai Wu, Chang Yao, Zhe Wang
Subjects: Subjects: Machine Learning (cs.LG); Methodology (stat.ME)
Arxiv link: https://arxiv.org/abs/2604.13598
Pdf link: https://arxiv.org/pdf/2604.13598
Abstract Recent reinforcement learning (RL) approaches have advanced radiology report generation (RRG), yet two core limitations persist: (1) report-level rewards offer limited evidence-grounded guidance for clinical faithfulness; and (2) current methods lack an explicit self-improving mechanism to align with clinical preference. We introduce clinically aligned Evidence-aware Self-Correcting Reinforcement Learning (ESC-RL), comprising two key components. First, a Group-wise Evidence-aware Alignment Reward (GEAR) delivers group-wise, evidence-aware feedback. GEAR reinforces consistent grounding for true positives, recovers missed findings for false negatives, and suppresses unsupported content for false positives. Second, a Self-correcting Preference Learning (SPL) strategy automatically constructs a reliable, disease-aware preference dataset from multiple noisy observations and leverages an LLM to synthesize refined reports without human supervision. ESC-RL promotes clinically faithful, disease-aligned reward and supports continual self-improvement during training. Extensive experiments on two public chest X-ray datasets demonstrate consistent gains and state-of-the-art performance.
中文摘要 近年来的强化学习（RL）方法推动了放射科报告生成（RRG）的发展，但仍存在两个核心局限：（1）报告级奖励对临床忠实度提供了有限的循证指导;以及（2）现有方法缺乏明确的自我改进机制以符合临床偏好。我们引入了临床对齐的证据意识自我纠正强化学习（ESC-RL），由两个关键组成部分组成。首先，按组计算的证据意识对齐奖励（GEAR）提供按组、证据为意识的反馈。GEAR强化真阳性时的一致基础，恢复漏掉的假阴性发现，并抑制无支持的假阳性内容。其次，自我纠正偏好学习（SPL）策略自动构建一个可靠且具备疾病的偏好数据集，基于多个噪声观测，并利用大型语言模型在无人工监督的情况下综合精炼报告。ESC-RL促进临床忠实、符合疾病的奖励，并支持培训期间的持续自我提升。对两组公共胸部X光数据集的广泛实验显示了持续的提升和最先进的性能。

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

大型模型时代的奖励黑客：机制、涌现的错位与挑战

Authors: Xiaohua Wang, Muzhao Tian, Yuqi Zeng, Zisu Huang, Jiakang Yuan, Bowen Chen, Jingwen Xu, Mingbo Zhou, Wenhao Liu, Muling Wu, Zhengkang Guo, Qi Qian, Yifei Wang, Feiran Zhang, Ruicheng Yin, Shihan Dou, Changze Lv, Tao Chen, Kaitao Song, Xu Tan, Tao Gui, Xiaoqing Zheng, Xuanjing Huang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.13602
Pdf link: https://arxiv.org/pdf/2604.13602
Abstract Reinforcement Learning from Human Feedback (RLHF) and related alignment paradigms have become central to steering large language models (LLMs) and multimodal large language models (MLLMs) toward human-preferred behaviors. However, these approaches introduce a systemic vulnerability: reward hacking, where models exploit imperfections in learned reward signals to maximize proxy objectives without fulfilling true task intent. As models scale and optimization intensifies, such exploitation manifests as verbosity bias, sycophancy, hallucinated justification, benchmark overfitting, and, in multimodal settings, perception--reasoning decoupling and evaluator manipulation. Recent evidence further suggests that seemingly benign shortcut behaviors can generalize into broader forms of misalignment, including deception and strategic gaming of oversight mechanisms. In this survey, we propose the Proxy Compression Hypothesis (PCH) as a unifying framework for understanding reward hacking. We formalize reward hacking as an emergent consequence of optimizing expressive policies against compressed reward representations of high-dimensional human objectives. Under this view, reward hacking arises from the interaction of objective compression, optimization amplification, and evaluator--policy co-adaptation. This perspective unifies empirical phenomena across RLHF, RLAIF, and RLVR regimes, and explains how local shortcut learning can generalize into broader forms of misalignment, including deception and strategic manipulation of oversight mechanisms. We further organize detection and mitigation strategies according to how they intervene on compression, amplification, or co-adaptation dynamics. By framing reward hacking as a structural instability of proxy-based alignment under scale, we highlight open challenges in scalable oversight, multimodal grounding, and agentic autonomy.
中文摘要 来自人类反馈的强化学习（RLHF）及相关对齐范式已成为引导大型语言模型（LLMs）和多模态大型语言模型（MLLM）朝向人类偏好行为的核心。然而，这些方法引入了一个系统性漏洞：奖励黑客，即模型利用已学习奖励信号的缺陷以最大化代理目标，但未能实现真正的任务意图。随着模型规模和优化的加剧，这种利用表现为冗长偏见、谄媚、幻觉式的辩解、基准过拟合，以及在多模态环境中的感知——推理解耦和评估者操控。最新证据进一步表明，看似无害的捷径行为可能泛化为更广泛的错位形式，包括欺骗和对监督机制的战略性利用。在本次调查中，我们提出了代理压缩假说（PCH）作为理解奖励黑客的统一框架。我们将奖励黑客形式化为优化表达策略以应对高维人类目标的压缩奖励表征的一种涌现结果。根据这一观点，奖励黑客源于客观压缩、优化放大和评估器-策略共适应的相互作用。这一视角统一了RLHF、RLAIF和RLVR体系的实证现象，并解释了局部捷径学习如何推广成更广泛的错位形式，包括欺骗和对监督机制的战略操控。我们进一步组织检测和缓解策略，根据它们对压缩、放大或共适应动态的干预方式。通过将奖励黑客框架为大规模代理对齐的结构性不稳定性，我们凸显了可扩展监督、多模态基础和代理自主性方面面临的挑战。

VRAG-DFD: Verifiable Retrieval-Augmentation for MLLM-based Deepfake Detection

VRAG-DFD：基于MLLM的深度伪造检测可验证检索-增强

Authors: Hui Han, Shunli Wang, Yandan Zhao, Taiping Yao, Shouhong Ding
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.13660
Pdf link: https://arxiv.org/pdf/2604.13660
Abstract In Deepfake Detection (DFD) tasks, researchers proposed two types of MLLM-based methods: complementary combination with small DFD detectors, or static forgery knowledge this http URL lack of professional forgery knowledge hinders the performance of these this http URL solve this, we deeply considered two insightful issues: How to provide high-quality associated forgery knowledge for MLLMs? AND How to endow MLLMs with critical reasoning abilities given noisy reference information? Notably, we attempted to address above two questions with preliminary answers by leveraging the combination of Retrieval-Augmented Generation (RAG) and Reinforcement Learning (RL).Through RAG and RL techniques, we propose the VRAG-DFD framework with accurate dynamic forgery knowledge retrieval and powerful critical reasoning this http URL, in terms of data, we constructed two datasets with RAG: Forensic Knowledge Database (FKD) for DFD knowledge annotation, and Forensic Chain-of-Thought Dataset (F-CoT), for critical CoT this http URL terms of model training, we adopt a three-stage training method (Alignment->SFT->GRPO) to gradually cultivate the critical reasoning ability of the this http URL terms of performance, VRAG-DFD achieved SOTA and competitive performance on DFD generalization testing.
中文摘要 在深度伪造检测（DFD）任务中，研究人员提出了两种基于MLLM的方法：与小型DFD探测器互补组合，或静态伪造知识。缺乏专业伪造知识会阻碍这些方法的性能。解决这个问题，我们深入探讨了两个有洞见的问题：如何为MLLM提供高质量的相关伪造知识？以及如何在噪音较大的参考信息下赋予多层次医学顾问批判性推理能力？值得注意的是，我们尝试通过结合检索增强生成（RAG）和强化学习（RL）来初步回答上述两个问题。通过RAG和RL技术，我们提出了具有准确动态伪造知识检索和强大批判性推理能力的VRAG-DFD框架。在数据方面，我们用RAG构建了两个数据集：用于DFD知识注释的法医知识数据库（FKD）和用于关键CoT的法医链思维数据集（F-CoT），用于模型训练的关键CoT。我们采用三阶段训练方法（Alignment->SFT->GRPO），逐步培养该HTTP URL性能项的关键推理能力，VRAG-DFD实现了SOTA认证，并在DFD泛化测试中具竞争力。

Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt

迈向细粒度时间感知：带音频侧时间提示的大型音频语言模型训练后

Authors: Yanfeng Shi, Pengfei Cai, Jun Liu, Qing Gu, Nan Jiang, Lirong Dai, Ian McLoughlin, Yan Song
Subjects: Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.13715
Pdf link: https://arxiv.org/pdf/2604.13715
Abstract Large Audio-Language Models (LALMs) enable general audio understanding and demonstrate remarkable performance across various audio tasks. However, these models still face challenges in temporal perception (e.g., inferring event onset and offset), leading to limited utility in fine-grained scenarios. To address this issue, we propose Audio-Side Time Prompt and leverage Reinforcement Learning (RL) to develop the TimePro-RL framework for fine-grained temporal perception. Specifically, we encode timestamps as embeddings and interleave them within the audio feature sequence as temporal coordinates to prompt the model. Furthermore, we introduce RL following Supervised Fine-Tuning (SFT) to directly optimize temporal alignment performance. Experiments demonstrate that TimePro-RL achieves significant performance gains across a range of audio temporal tasks, such as audio grounding, sound event detection, and dense audio captioning, validating its robust effectiveness.
中文摘要 大型音频语言模型（LALMs）能够实现一般的音频理解，并在各种音频任务中展现出卓越的性能。然而，这些模型在时间感知方面仍面临挑战（例如推断事件发生和偏移），导致在细粒度场景中的实用性有限。为解决这一问题，我们提出了音频侧时间提示，并利用强化学习（RL）开发用于细粒度时间感知的TimePro-RL框架。具体来说，我们将时间戳编码为嵌入，并在音频特征序列中交错使用时间坐标，以提示模型。此外，我们在监督微调（SFT）后引入强化学习，直接优化时间对齐性能。实验表明，TimePro-RL在音频接地、声音事件检测和密集音频字幕等多种音频时间任务中实现了显著的性能提升，验证了其稳健的有效性。

Jump-Start Reinforcement Learning with Vision-Language-Action Regularization

带有视觉-语言-行动正则化的跳板强化学习

Authors: Angelo Moroncelli, Roberto Zanetti, Marco Maccarini, Loris Roveda
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.13733
Pdf link: https://arxiv.org/pdf/2604.13733
Abstract Reinforcement learning (RL) enables high-frequency, closed-loop control for robotic manipulation, but scaling to long-horizon tasks with sparse or imperfect rewards remains difficult due to inefficient exploration and poor credit assignment. Vision-Language-Action (VLA) models leverage large-scale multimodal pretraining to provide generalist, task-level reasoning, but current limitations hinder their direct use in fast and precise manipulation. In this paper, we propose Vision-Language-Action Jump-Starting (VLAJS), a method that bridges sparse VLA guidance with on-policy RL to improve exploration and learning efficiency. VLAJS treats VLAs as transient sources of high-level action suggestions that bias early exploration and improve credit assignment, while preserving the high-frequency, state-based control of RL. Our approach augments Proximal Policy Optimization (PPO) with a directional action-consistency regularization that softly aligns the RL agent's actions with VLA guidance during early training, without enforcing strict imitation, requiring demonstrations, or relying on continuous teacher queries. VLA guidance is applied sparsely and annealed over time, allowing the agent to adapt online and ultimately surpass the guiding policy. We evaluate VLAJS on six challenging manipulation tasks: lifting, pick-and-place, peg reorientation, peg insertion, poking, and pushing in simulation, and validate a subset on a real Franka Panda robot. VLAJS consistently outperforms PPO and distillation-style baselines in sample efficiency, reducing required environment interactions by over 50% in several tasks. Real-world experiments demonstrate zero-shot sim-to-real transfer and robust execution under clutter, object variation, and external perturbations.
中文摘要 强化学习（RL）实现了高频、闭环的机器人操作控制，但由于探索效率低和信用分配不佳，扩展到奖励稀疏或不完美的长期任务仍然困难。视觉-语言-行动（VLA）模型利用大规模多模态预训练提供通用的任务级推理，但当前的限制阻碍了其快速且精确的直接操作。本文提出了视觉-语言-行动跳跃启动（VLAJS）方法，该方法连接稀疏的VLA指导与策略上强化学习，以提升探索和学习效率。VLAJS将VLA视为临时的高级行动建议来源，这些建议会偏向早期探索并改善信用分配，同时保持强化学习的高频、基于状态的控制。我们的方法通过方向性动作一致性正则化来补充近端策略优化（PPO），在早期训练时温和地将强化智能体的动作与VLA指导对齐，无需严格模仿，无需演示，也不依赖持续的教师查询。VLA指导是稀疏应用并逐步退火，使代理能够在线适应，最终超越指导策略。我们在六个具有挑战性的操作任务上评估VLAJS：提起、放置、销钉重新定向、销钉插入、戳击和推入，并在真实的Franka Panda机器人上验证了部分内容。VLAJS在样品效率上持续优于PPO和蒸馏式基线，在多个任务中将环境交互作用减少超过50%。真实世界的实验展示了零机会模拟到真实的传输，以及在杂波、物体变化和外部扰动下的稳健执行。

Soft $Q(λ)$: A multi-step off-policy method for entropy regularised reinforcement learning using eligibility traces

软 $Q（λ）$：一种多步非策略方法，用于使用资格痕迹进行熵正则化强化学习

Authors: Pranav Mahajan, Ben Seymour
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.13780
Pdf link: https://arxiv.org/pdf/2604.13780
Abstract Soft Q-learning has emerged as a versatile model-free method for entropy-regularised reinforcement learning, optimising for returns augmented with a penalty on the divergence from a reference policy. Despite its success, the multi-step extensions of soft Q-learning remain relatively unexplored and limited to on-policy action sampling under the Boltzmann policy. In this brief research note, we first present a formal $n$-step formulation for soft Q-learning and then extend this framework to the fully off-policy case by introducing a novel Soft Tree Backup operator. Finally, we unify these developments into Soft $Q(\lambda)$, an elegant online, off-policy, eligibility trace framework that allows for efficient credit assignment under arbitrary behaviour policies. Our derivations propose a model-free method for learning entropy-regularised value functions that can be utilised in future empirical experiments.
中文摘要 软Q学习已成为一种多功能的无模型熵正则化强化学习方法，优化了收益，并对与参考策略的背离有惩罚。尽管取得了成功，软Q学习的多步扩展仍相对未被充分探索，仅限于玻尔兹曼政策下的政策内行动抽样。在这份简短的研究笔记中，我们首先提出了软Q学习的正式$n$步表述，随后通过引入一种新的软树备份算子，将该框架扩展到完全非策略的情况。最后，我们将这些发展整合为Soft $Q（\lambda）$，一个优雅的在线、非政策的资格追踪框架，允许在任意行为政策下高效分配信用。我们的推导提出了一种无模型的方法用于学习熵正则化的价值函数，未来可用于实证实验。

DUET: Joint Exploration of User Item Profiles in Recommendation System

DUET：推荐系统中用户项目档案的联合探索

Authors: Yue Chen, Yifei Sun, Lu Wang, Fangkai Yang, Pu Zhao, Minjie Hong, Yifei Dong, Minghua He, Nan Hu, Jianjin Zhang, Zhiwei Dai, Yuefeng Zhan, Weihao Han, Hao Sun, Qingwei Lin, Weiwei Deng, Feng Sun, Qi Zhang, Saravan Rajmohan, Dongmei Zhang
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2604.13801
Pdf link: https://arxiv.org/pdf/2604.13801
Abstract Traditional recommendation systems represent users and items as dense vectors and learn to align them in a shared latent space for relevance estimation. Recent LLM-based recommenders instead leverage natural-language representations that are easier to interpret and integrate with downstream reasoning modules. This paper studies how to construct effective textual profiles for users and items, and how to align them for recommendation. A central difficulty is that the best profile format is not known a priori: manually designed templates can be brittle and misaligned with task objectives. Moreover, generating user and item profiles independently may produce descriptions that are individually plausible yet semantically inconsistent for a specific user--item pair. We propose Duet, an interaction-aware profile generator that jointly produces user and item profiles conditioned on both user history and item evidence. Duet follows a three-stage procedure: it first turns raw histories and metadata into compact cues, then expands these cues into paired profile prompts and then generate profiles, and finally optimizes the generation policy with reinforcement learning using downstream recommendation performance as feedback. Experiments on three real-world datasets show that Duet consistently outperforms strong baselines, demonstrating the benefits of template-free profile exploration and joint user-item textual alignment.
中文摘要 传统推荐系统将用户和项目视为密集向量，并学习将它们对齐在共享的潜在空间中以进行相关性估计。近期基于LLM的推荐工具则利用更易解释和与下游推理模块整合的自然语言表示。本文研究如何为用户和项目构建有效的文本画像，以及如何将其对齐以供推荐。一个核心难题是最佳配置文件格式尚不存在事先：手动设计的模板可能脆弱且与任务目标不匹配。此外，独立生成用户和物品配置文件可能产生对特定用户——物品对——在语义上不一致的描述。我们提出了Duet，一种交互感知型配置文件生成器，能够结合用户历史和物品证据共同生成用户和物品配置文件。Duet遵循三阶段流程：首先将原始历史和元数据转化为紧凑的提示，然后将这些提示扩展为配对的配置文件提示，再生成配置文件，最后通过强化学习优化生成策略，利用下游推荐性能作为反馈。在三个真实世界数据集上的实验显示，Duet始终优于强基线，展示了无模板配置文件探索和用户-项目联合文本对齐的优势。

Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning

超越言语的性格：通过强化学习在音频大型语言模型中利用角色扮演评估

Authors: Dongjie Fu, Fangming Feng, Xize Cheng, Linjun Li, Zhou Zhao, Tao Jin
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.13804
Pdf link: https://arxiv.org/pdf/2604.13804
Abstract The rapid evolution of multimodal large models has revolutionized the simulation of diverse characters in speech dialogue systems, enabling a novel interactive paradigm. Character attributes are manifested not only in textual responses but also through vocal features, as speech conveys rich paralinguistic information that is challenging to quantify. This poses significant difficulties in evaluating the character alignment of role-playing agents. To address these challenges, we present RoleJudge, an evaluation framework that leverages audio large language models to systematically assess the alignment between speech and character across multiple modalities and dimensions. Furthermore, we introduce RoleChat, the first voice role-playing evaluation dataset enriched with chain-of-thought reasoning annotations, comprising a diverse set of authentic and LLM-generated speech samples. Utilizing this dataset, we implement a multi-stage training paradigm and incorporate Standard Alignment in reinforcement learning to mitigate reward misalignment during optimization. Experimental results in terms of accuracy and subjective assessment demonstrate that RoleJudge outperforms various baseline models, validating the effectiveness of our multidimensional evaluation framework.
中文摘要 多模态大型模型的快速演进彻底改变了语音对话系统中多样角色的模拟，开启了一种新的交互范式。性格特征不仅体现在文本回应中，也通过语音特征体现，因为语音传递的丰富副语言信息难以量化。这在评估角色扮演代理的角色阵营时带来了重大困难。为应对这些挑战，我们推出了RoleJudge，这是一个评估框架，利用音频大型语言模型系统性评估语音与性格在多种模态和维度上的对齐情况。此外，我们还介绍了RoleChat，这是首个充满思维链推理注释的语音角色扮演评估数据集，包含多样的真实性和大型语言模型生成的语音样本。利用该数据集，我们实现了多阶段训练范式，并在强化学习中融入标准对齐，以减轻优化过程中的奖励错位。实验结果显示，准确性和主观评估方面，RoleJudge优于多种基线模型，验证了我们多维评估框架的有效性。

AlphaCNOT: Learning CNOT Minimization with Model-Based Planning

AlphaCNOT：基于模型的规划学习 CNOT 最小化

Authors: Jacopo Cossio, Daniele Lizzio Bosco, Riccardo Romanello, Giuseppe Serra, Carla Piazza
Subjects: Subjects: Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
Arxiv link: https://arxiv.org/abs/2604.13812
Pdf link: https://arxiv.org/pdf/2604.13812
Abstract Quantum circuit optimization is a central task in Quantum Computing, as current Noisy Intermediate Scale Quantum devices suffer from error propagation that often scales with the number of operations. Among quantum operations, the CNOT gate is of fundamental importance, being the only 2-qubit gate in the universal Clifford+T set. The problem of CNOT gates minimization has been addressed by heuristic algorithms such as the well-known Patel-Markov-Hayes (PMH) for linear reversible synthesis (i.e., CNOT minimization with no topological constraints), and more recently by Reinforcement Learning (RL) based strategies in the more complex case of topology-aware synthesis, where each CNOT can act on a subset of all qubits pairs. In this work we introduce AlphaCNOT, a RL framework based on Monte Carlo Tree Search (MCTS) that address effectively the CNOT minimization problem by modeling it as a planning problem. In contrast to other RL- based solution, our method is model-based, i.e. it can leverage lookahead search to evaluate future trajectories, thus finding more efficient sequences of CNOTs. Our method achieves a reduction of up to 32% in CNOT gate count compared to PMH baseline on linear reversible synthesis, while in the constraint version we report a consistent gate count reduction on a variety of topologies with up to 8 qubits, with respect to state-of-the-art RL-based solutions. Our results suggest the combination of RL with search-based strategies can be applied to different circuit optimization tasks, such as Clifford minimization, thus fostering the transition toward the "quantum utility" era.
中文摘要 量子电路优化是量子计算的核心任务，因为当前的噪声中级量子器件存在误差传播，且误差传播常随操作次数增加而扩展。在量子运算中，CNOT门具有根本性的重要性，因为它是通用Clifford+T集中唯一的2量子比特门。CNOT门最小化问题已被启发式算法解决，如著名的Patel-Markov-Hayes（PMH）线性可逆合成（即无拓扑约束的CNT最小化），以及更近的基于强化学习（RL）策略，适用于更复杂的拓扑感知综合，每个CTT可以作用于所有量子比特对中的一个子集。在本研究中，我们介绍了AlphaCNOT，这是一种基于蒙特卡洛树搜索（MCTS）的强化学习框架，通过将CNT最小化问题建模为规划问题，有效解决了该问题。与其他基于强化学习的解决方案不同，我们的方法是基于模型的，即可以利用前瞻搜索来评估未来轨迹，从而找到更高效的CNOT序列。我们的方法在线性可逆合成中，与PMH基线相比，CNOT门数减少了最多32%，而在约束版本中，我们报告了在多种拓扑结构上，最多8个量子比特，相较于最先进的强化学习解决方案，门数减少一致。我们的结果表明，强化学习与基于搜索的策略结合可用于不同的电路优化任务，如克利福德最小化，从而推动向“量子效用”时代的过渡。

RPS: Information Elicitation with Reinforcement Prompt Selection

RPS：信息引导与强化提示选择

Authors: Tao Wang, Jingyao Lu, Xibo Wang, Haonan Huang, Su Yao, Zhiqiang Hu, Xingyan Chen, Enmao Diao
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.13817
Pdf link: https://arxiv.org/pdf/2604.13817
Abstract Large language models (LLMs) have shown remarkable capabilities in dialogue generation and reasoning, yet their effectiveness in eliciting user-known but concealed information in open-ended conversations remains limited. In many interactive AI applications, such as personal assistants, tutoring systems, and legal or clinical support, users often withhold sensitive or uncertain information due to privacy concerns, ambiguity, or social hesitation. This makes it challenging for LLMs to gather complete and contextually relevant inputs. In this work, we define the problem of information elicitation in open-ended dialogue settings and propose Reinforcement Prompt Selection (RPS), a lightweight reinforcement learning framework that formulates prompt selection as a sequential decision-making problem. To analyze this problem in a controlled setting, we design a synthetic experiment, where a reinforcement learning agent outperforms a random query baseline, illustrating the potential of policy-based approaches for adaptive information elicitation. Building on this insight, RPS learns a policy over a pool of prompts to adaptively elicit concealed or incompletely expressed information from users through dialogue. We also introduce IELegal, a new benchmark dataset constructed from real legal case documents, which simulates dialogue-based information elicitation tasks aimed at uncovering case-relevant facts. In this setting, RPS outperforms static prompt baselines, demonstrating the effectiveness of adaptive prompt selection for eliciting critical information in LLM-driven dialogue systems.
中文摘要 大型语言模型（LLMs）在对话生成和推理方面展现出了卓越的能力，但在引出用户已知但隐藏信息的开放式对话中，其效果仍然有限。在许多互动式人工智能应用中，如个人助理、辅导系统以及法律或临床支持，用户常因隐私问题、模糊性或社交顾虑而隐瞒敏感或不确定的信息。这使得大型语言模型难以收集完整且具上下文相关的输入。在本研究中，我们定义了开放式对话环境中的信息诱导问题，并提出了强化提示选择（Reinforcement Prompt Selection，RPS）这一轻量级强化学习框架，将提示选择构建为顺序决策问题。为了在受控环境中分析该问题，我们设计了一个合成实验，其中强化学习代理的表现优于随机查询基线，展示了基于策略的方法在自适应信息诱导中的潜力。基于这一洞察，RPS通过一系列提示学习策略，通过对话自适应地从用户那里引出隐藏或不完全表达的信息。我们还推出了IELegal，这是一个由真实法律案件文件构建的新基准数据集，模拟基于对话的信息获取任务，旨在揭示与案件相关的事实。在此环境中，RPS优于静态提示基线，展示了自适应提示选择在LLM驱动对话系统中提取关键信息的有效性。

MUSE: Multi-Domain Chinese User Simulation via Self-Evolving Profiles and Rubric-Guided Alignment

MUSE：通过自我演进配置文件和评分标准引导对齐实现的多领域中国用户模拟

Authors: Zihao Liu, Hantao Zhou, Jiguo Li, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He, Peng Wang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.13828
Pdf link: https://arxiv.org/pdf/2604.13828
Abstract User simulators are essential for the scalable training and evaluation of interactive AI systems. However, existing approaches often rely on shallow user profiling, struggle to maintain persona consistency over long interactions, and are largely limited to English or single-domain settings. We present MUSE, a multi-domain Chinese user simulation framework designed to generate human-like, controllable, and behaviorally consistent responses. First, we propose Iterative Profile Self-Evolution (IPSE), which gradually optimizes user profiles by comparing and reasoning discrepancies between simulated trajectories and real dialogue behaviors. We then apply Role-Reversal Supervised Fine-Tuning to improve local response realism and human-like expression. To enable fine-grained behavioral alignment, we further train a specialized rubric-based reward model and incorporate it into rubric-guided multi-turn reinforcement learning, which optimizes the simulator at the dialogue level and enhances long-horizon behavioral consistency. Experiments show that MUSE consistently outperforms strong baselines in both utterance-level and session-level evaluations, generating responses that are more realistic, coherent, and persona-consistent over extended interactions.
中文摘要 用户模拟器对于交互式AI系统的可扩展训练和评估至关重要。然而，现有方法往往依赖浅层用户画像，难以在长时间互动中保持角色一致性，且主要局限于英语或单一领域环境。我们介绍MUSE，一个多域中国用户模拟框架，旨在生成类人化、可控且行为一致的响应。首先，我们提出了迭代配置文件自我进化（IPSE），通过比较和推理模拟轨迹与真实对话行为之间的差异，逐步优化用户配置文件。随后，我们应用角色反转监督微调，提升局部反应的真实性和类人化的表达。为实现细致行为对齐，我们进一步训练基于评分标准的奖励模型，并将其纳入评分标准引导的多回合强化学习中，优化对话层面的模拟器，增强长期行为一致性。实验显示，MUSE在话语层级和会话层面的评估中始终优于强基线，产生更真实、连贯且符合角色形象的反应，持续长时间互动。

Drowsiness-Aware Adaptive Autonomous Braking System based on Deep Reinforcement Learning for Enhanced Road Safety

基于深度强化学习的自适应自主制动系统，提升道路安全

Authors: Hossem Eddine Hafidi, Elisabetta De Giovanni, Teodoro Montanaro, Ilaria Sergi, Massimo De Vittorio, Luigi Patrono
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.13878
Pdf link: https://arxiv.org/pdf/2604.13878
Abstract Driver drowsiness significantly impairs the ability to accurately judge safe braking distances and is estimated to contribute to 10%-20% of road accidents in Europe. Traditional driver-assistance systems lack adaptability to real-time physiological states such as drowsiness. This paper proposes a deep reinforcement learning-based autonomous braking system that integrates vehicle dynamics with driver physiological data. Drowsiness is detected from ECG signals using a Recurrent Neural Network (RNN), selected through an extensive benchmark analysis of 2-minute windows with varying segmentation and overlap configurations. The inferred drowsiness state is incorporated into the observable state space of a Double-Dueling Deep Q-Network (DQN) agent, where driver impairment is modeled as an action delay. The system is implemented and evaluated in a high-fidelity CARLA simulation environment. Experimental results show that the proposed agent achieves a 99.99% success rate in avoiding collisions under both drowsy and non-drowsy conditions. These findings demonstrate the effectiveness of physiology-aware control strategies for enhancing adaptive and intelligent driving safety systems.
中文摘要 驾驶员嗜睡严重影响了准确判断安全制动距离的能力，据估计占欧洲10%-20%的交通事故。传统的驾驶辅助系统缺乏对实时生理状态（如嗜睡）的适应能力。本文提出了一种基于深度强化学习的自主制动系统，将车辆动力学与驾驶员生理数据整合。嗜睡通过循环神经网络（RNN）通过对不同分割和重叠配置的2分钟窗口进行广泛的基准分析选出的心电图信号检测。推断出的嗜睡状态被纳入双重对决深度Q网络（DQN）代理的可观测状态空间，其中驾驶员损伤被建模为动作延迟。该系统在高精度CARLA仿真环境中实现和评估。实验结果显示，该药物在昏昏欲睡和非昏昏欲睡条件下，避免碰撞的成功率均为99.99%。这些发现证明了生理感知控制策略在提升自适应和智能驾驶安全系统的有效性。

Beyond Conservative Automated Driving in Multi-Agent Scenarios via Coupled Model Predictive Control and Deep Reinforcement Learning

超越多智能体场景中的保守自动驾驶，通过耦合模型预测控制和深度强化学习

Authors: Saeed Rahmani, Gözde Körpe, Zhenlin (Gavin)Xu, Bruno Brito, Simeon Craig Calvert, Bart van Arem
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2604.13891
Pdf link: https://arxiv.org/pdf/2604.13891
Abstract Automated driving at unsignalized intersections is challenging due to complex multi-vehicle interactions and the need to balance safety and efficiency. Model Predictive Control (MPC) offers structured constraint handling through optimization but relies on hand-crafted rules that often produce overly conservative behavior. Deep Reinforcement Learning (RL) learns adaptive behaviors from experience but often struggles with safety assurance and generalization to unseen environments. In this study, we present an integrated MPC-RL framework to improve navigation performance in multi-agent scenarios. Experiments show that MPC-RL outperforms standalone MPC and end-to-end RL across three traffic-density levels. Collectively, MPC-RL reduces the collision rate by 21% and improves the success rate by 6.5% compared to pure MPC. We further evaluate zero-shot transfer to a highway merging scenario without retraining. Both MPC-based methods transfer substantially better than end-to-end PPO, which highlights the role of the MPC backbone in cross-scenario robustness. The framework also shows faster loss stabilization than end-to-end RL during training, which indicates a reduced learning burden. These results suggest that the integrated approach can improve the balance between safety performance and efficiency in multi-agent intersection scenarios, while the MPC component provides a strong foundation for generalization across driving environments. The implementation code is available open-source.
中文摘要 由于多车交互复杂且需要平衡安全与效率，自动驾驶在无信号灯路口进行挑战。模型预测控制（MPC）通过优化提供结构化约束处理，但依赖手工制定的规则，这些规则常常产生过于保守的行为。深度强化学习（RL）通过经验学习适应性行为，但在安全保障和对看不见环境的泛化方面常常遇到困难。本研究提出了一个集成的MPC-RL框架，以提升多智能体场景下的导航性能。实验显示，MPC-RL在三个流量密度层级上优于独立MPC和端到端RL。总体而言，MPC-RL相比纯MPC降低了21%的碰撞率，并提高了6.5%的成功率。我们还进一步评估了无需再培训即可实现零发子弹转运到高速公路合流场景。这两种基于MPC的方法传输都远优于端到端PPO，这凸显了MPC骨干在跨场景鲁棒性中的作用。该框架还显示，在训练期间损失稳定速度快于端到端的强化学习，这表明学习负担有所减轻。这些结果表明，综合方法能够改善多智能体交叉场景中安全性能与效率的平衡，而MPC组件则为跨驾驶环境的泛化奠定了坚实基础。实现代码为开源版本。

DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off

DiPO：细粒度探索与开发权衡的解缠复杂策略优化

Authors: Xiaofan Li, Ming Yang, Zhiyuan Ma, Shichao Ma, Jintao Du, Yu Cheng, Weiqiang Wang, Zhizhong Zhang, Xin Tan, Yanyun Qu, Lizhuang Ma, Yuan Xie
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.13902
Pdf link: https://arxiv.org/pdf/2604.13902
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed significant advances in the reasoning capabilities of Large Language Models (LLMs). However, effectively managing the exploration and exploitation trade-off remains a critical challenge. In this paper, we fully analyze the exploration and exploitation dilemma of extremely hard and easy samples during the training and propose a new fine-grained trade-off mechanism. Concretely, we introduce a perplexity space disentangling strategy that divides the sample space into distinct exploration (high perplexity) and exploitation (low perplexity) subspaces, thereby mining fine-grained samples requiring exploration-exploitation trade-off. Subsequently, we propose a bidirectional reward allocation mechanism with a minimum impact on verification rewards to implement perplexity-guided exploration and exploitation, enabling more stable policy optimization. Finally, we have evaluated our method on two mainstream tasks: mathematical reasoning and function calling, and experimental results demonstrate the superiority of the proposed method, confirming its effectiveness in enhancing LLM performance by fine-grained exploration-exploitation trade-off.
中文摘要 带可验证奖励的强化学习（RLVR）催化了大型语言模型（LLMs）推理能力的显著进步。然而，有效管理勘探与开发权衡仍是一个关键挑战。本文全面分析了训练过程中极难且易样本的探索与利用困境，并提出了一种新的细粒度权衡机制。具体来说，我们引入了一种复杂度空间解码策略，将样本空间划分为不同的探索（高困惑度）和利用（低困惑度）子空间，从而挖掘需要探索与利用权衡的细粒度样本。随后，我们提出一种对验证奖励影响最小的双向奖励分配机制，以实现困惑引导探索和利用，实现更稳定的策略优化。最后，我们对该方法进行了两项主流任务：数学推理和函数调用，实验结果证明了该方法的优越性，确认其通过细粒度探索与利用权衡提升LLM性能的有效性。

Provably Efficient Offline-to-Online Value Adaptation with General Function Approximation

可证明的高效离线到在线值适应，采用一般函数近似

Authors: Shangzhe Li, Weitong Zhang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.13966
Pdf link: https://arxiv.org/pdf/2604.13966
Abstract We study value adaptation in offline-to-online reinforcement learning under general function approximation. Starting from an imperfect offline pretrained $Q$-function, the learner aims to adapt it to the target environment using only a limited amount of online interaction. We first characterize the difficulty of this setting by establishing a minimax lower bound, showing that even when the pretrained $Q$-function is close to optimal $Q^\star$, online adaptation can be no more efficient than pure online RL on certain hard instances. On the positive side, under a novel structural condition on the offline-pretrained value functions, we propose O2O-LSVI, an adaptation algorithm with problem-dependent sample complexity that provably improves over pure online RL. Finally, we complement our theory with neural-network experiments that demonstrate the practical effectiveness of the proposed method.
中文摘要 我们研究了离线到在线强化学习中的价值适应，采用一般功能近似。从一个不完美的离线预训练$Q$函数开始，学习者旨在通过有限的在线互动将其适应目标环境。我们首先通过建立极小极大下限来描述该设置的难度，表明即使预训练的$Q$函数接近最优$Q^\star$，在线适应在某些困难实例上也不会比纯在线强化学习更有效率。积极方面，在离线预训练值函数的新结构条件下，我们提出了O2O-LSVI算法，这是一种基于问题的样本复杂度适配算法，且可证明优于纯在线强化学习。最后，我们用神经网络实验补充了理论，展示了该方法的实际有效性。

Hierarchical Reinforcement Learning with Runtime Safety Shielding for Power Grid Operation

带有运行时安全屏蔽的分层强化学习，用于电网运行

Authors: Gitesh Malik
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.14032
Pdf link: https://arxiv.org/pdf/2604.14032
Abstract Reinforcement learning has shown promise for automating power-grid operation tasks such as topology control and congestion management. However, its deployment in real-world power systems remains limited by strict safety requirements, brittleness under rare disturbances, and poor generalization to unseen grid topologies. In safety-critical infrastructure, catastrophic failures cannot be tolerated, and learning-based controllers must operate within hard physical constraints. This paper proposes a safety-constrained hierarchical control framework for power-grid operation that explicitly decouples long-horizon decision-making from real-time feasibility enforcement. A high-level reinforcement learning policy proposes abstract control actions, while a deterministic runtime safety shield filters unsafe actions using fast forward simulation. Safety is enforced as a runtime invariant, independent of policy quality or training distribution. The proposed framework is evaluated on the Grid2Op benchmark suite under nominal conditions, forced line-outage stress tests, and zero-shot deployment on the ICAPS 2021 large-scale transmission grid without retraining. Results show that flat reinforcement learning policies are brittle under stress, while safety-only methods are overly conservative. In contrast, the proposed hierarchical and safety-aware approach achieves longer episode survival, lower peak line loading, and robust zero-shot generalization to unseen grids. These results indicate that safety and generalization in power-grid control are best achieved through architectural design rather than increasingly complex reward engineering, providing a practical path toward deployable learning-based controllers for real-world energy systems.
中文摘要 强化学习在自动化电网运行任务（如拓扑控制和拥塞管理）方面展现出潜力。然而，其在现实电力系统中的部署仍受限于严格的安全要求、在罕见扰动下易脆弱，以及对未见网格拓扑的推广性较差。在安全关键基础设施中，灾难性故障不可容忍，基于学习的控制器必须在严格的物理约束下工作。本文提出了一种安全约束的分层控制框架，用于电网运行，明确将长期决策与实时可行性执行脱钩。高级强化学习策略提出抽象控制动作，而确定性运行时安全盾则通过快进仿真过滤不安全的动作。安全性作为运行时不变量被强制执行，与策略质量或训练分布无关。该框架在Grid2Op基准测试套件上进行名义条件评估，强制断线应力测试，并在ICAPS 2021大规模输电网上实现零发射部署，无需重新训练。结果显示，扁平强化学习策略在压力下较为脆弱，而仅安全性的方法则过于保守。相比之下，提出的分层式和安全意识方法实现了更长的集数存活时间、更低的峰值线负载以及对未见网格的稳健零点推广。这些结果表明，电网控制的安全性和泛化最好通过架构设计实现，而非日益复杂的奖励工程，为现实能源系统提供可部署的基于学习的控制器的实用路径。

Enhancing Local Life Service Recommendation with Agentic Reasoning in Large Language Model

在大型语言模型中用能动推理增强本地生活服务推荐

Authors: Shiteng Cao, Xiaochong Lan, Yuwei Du, Jie Feng, Yinxing Liu, Xinlei Shi, Yong Li
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2604.14051
Pdf link: https://arxiv.org/pdf/2604.14051
Abstract Local life service recommendation is distinct from general recommendation scenarios due to its strong living need-driven nature. Fundamentally, accurately identifying a user's immediate living need and recommending the corresponding service are inextricably linked tasks. However, prior works typically treat them in isolation, failing to achieve a unified modeling of need prediction and service recommendation. In this paper, we propose a novel large language model based framework that jointly performs living need prediction and service recommendation. To address the challenge of noise in raw consumption data, we introduce a behavioral clustering approach that filters out accidental factors and selectively preserves typical patterns. This enables the model to learn a robust logical basis for need generation and spontaneously generalize to long-tail scenarios. To navigate the vast search space stemming from diverse needs, merchants, and complex mapping paths, we employ a curriculum learning strategy combined with reinforcement learning with verifiable rewards. This approach guides the model to sequentially learn the logic from need generation to category mapping and specific service selection. Extensive experiments demonstrate that our unified framework significantly enhances both living need prediction performance and recommendation accuracy, validating the effectiveness of jointly modeling living needs and user behaviors.
中文摘要 地方生命服务推荐与一般推荐情景不同，因为它具有强烈的生活需求驱动性质。从根本上说，准确识别用户的即时生活需求并推荐相应的服务是密不可分的任务。然而，以往的研究通常将它们单独处理，未能实现需求预测和服务推荐的统一建模。本文提出了一种新型大型语言模型框架，结合生活需求预测和服务推荐。为解决原始消费数据中噪声的挑战，我们引入了一种行为聚类方法，过滤掉偶然因素并有选择地保留典型模式。这使得模型能够学习一个稳健的逻辑基础来生成需求，并自发地推广到长尾场景。为了应对源自多样化需求、商家和复杂映射路径的庞大搜索空间，我们采用课程学习策略与可验证奖励的强化学习相结合。这种方法引导模型从需求生成到类别映射再到特定服务选择，顺序学习逻辑。大量实验表明，我们的统一框架显著提升了生活需求预测性能和推荐准确性，验证了联合建模生活需求和用户行为的有效性。

From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space

从$P（y|x）$到$P（y）$：研究预训练空间中的强化学习

Authors: Yuqiao Tan, Minzheng Wang, Bo Liu, Zichen Liu, Tian Liang, Shizhu He, Jun Zhao, Kang Liu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.14142
Pdf link: https://arxiv.org/pdf/2604.14142
Abstract While reinforcement learning with verifiable rewards (RLVR) significantly enhances LLM reasoning by optimizing the conditional distribution P(y|x), its potential is fundamentally bounded by the base model's existing output distribution. Optimizing the marginal distribution P(y) in the Pre-train Space addresses this bottleneck by encoding reasoning ability and preserving broad exploration capacity. Yet, conventional pre-training relies on static corpora for passive learning, leading to a distribution shift that hinders targeted reasoning enhancement. In this paper, we introduce PreRL (Pre-train Space RL), which applies reward-driven online updates directly to P(y). We theoretically and empirically validate the strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a viable surrogate for standard RL. Furthermore, we uncover a critical mechanism: Negative Sample Reinforcement (NSR) within PreRL serves as an exceptionally effective driver for reasoning. NSR-PreRL rapidly prunes incorrect reasoning spaces while stimulating endogenous reflective behaviors, increasing transition and reflection thoughts by 14.89x and 6.54x, respectively. Leveraging these insights, we propose Dual Space RL (DSRL), a Policy Reincarnation strategy that initializes models with NSR-PreRL to expand the reasoning horizon before transitioning to standard RL for fine-grained optimization. Extensive experiments demonstrate that DSRL consistently outperforms strong baselines, proving that pre-train space pruning effectively steers the policy toward a refined correct reasoning subspace.
中文摘要 虽然带可验证奖励的强化学习（RLVR）通过优化条件分布P（y|x）显著增强了LLM推理，但其潜力在基础模型现有输出分布上基本受限。优化预列车空间中的边际分布P（y）通过编码推理能力和保持广泛的探索能力，解决了这一瓶颈。然而，传统的预训练依赖静态语料库进行被动学习，导致分布转移，阻碍了有针对性推理的提升。本文介绍了PreRL（预培训空间RL），该系统直接将奖励驱动的在线更新应用于P（y）。我们通过理论和实证验证了log P（y）与log P（y|x）之间的强梯度比对，确立了PreRL作为标准RL可行替代的途径。此外，我们还发现了一个关键机制：PreRL中的负样本强化（NSR）是推理的极其有效的驱动力。NSR-PreRL快速修剪错误的推理空间，同时刺激内生性反思行为，分别将过渡和反思思维增加14.89倍和6.54倍。基于这些洞见，我们提出了双空间强化学习（DSRL）策略，这是一种策略重生策略，初始化使用NSR-PreRL模型，以扩展推理视野，然后再转向标准强化学习以实现细粒度优化。大量实验表明，DSRL始终优于强基线，证明预训练空间剪枝有效引导策略朝向精细且正确的推理子空间。

Keyword: diffusion policy

There is no result