Arxiv Papers of Today

生成时间: 2025-11-10 16:31:38 (UTC+8); Arxiv 发布时间: 2025-11-10 20:00 EST (2025-11-11 09:00 UTC+8)

今天共有 25 篇相关文章

Keyword: reinforcement learning

Reasoning Up the Instruction Ladder for Controllable Language Models

推理可控语言模型的指令阶梯

Authors: Zishuo Zheng, Vidhisha Balachandran, Chan Young Park, Faeze Brahman, Sachin Kumar
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.04694
Pdf link: https://arxiv.org/pdf/2511.04694
Abstract As large language model (LLM) based systems take on high-stakes roles in real-world decision-making, they must reconcile competing instructions from multiple sources (e.g., model developers, users, and tools) within a single prompt context. Thus, enforcing an instruction hierarchy (IH) in LLMs, where higher-level directives override lower-priority requests, is critical for the reliability and controllability of LLMs. In this work, we reframe instruction hierarchy resolution as a reasoning task. Specifically, the model must first "think" about the relationship between a given user prompt and higher-priority (system) instructions before generating a response. To enable this capability via training, we construct VerIH, an instruction hierarchy dataset of constraint-following tasks with verifiable answers. This dataset comprises both aligned and conflicting system-user instructions. We show that lightweight reinforcement learning with VerIH effectively transfers general reasoning capabilities of models to instruction prioritization. Our finetuned models achieve consistent improvements on instruction following and instruction hierarchy benchmarks. This reasoning ability also generalizes to safety-critical settings beyond the training distribution. By treating safety issues as resolving conflicts between adversarial user inputs and predefined higher-priority policies, our trained model enhances robustness against jailbreak and prompt injection attacks. These results demonstrate that reasoning over instruction hierarchies provides a practical path to reliable LLMs, where updates to system prompts yield controllable and robust changes in model behavior.
中文摘要 由于基于大型语言模型（LLM）的系统在现实世界的决策中承担着高风险的角色，因此它们必须在单个提示上下文中协调来自多个来源（例如模型开发人员、用户和工具）的竞争指令。因此，在法学硕士中强制执行指令层次结构（IH），其中更高级别的指令覆盖较低优先级的请求，对于法学硕士的可靠性和可控性至关重要。在这项工作中，我们将指令层次结构解析重新定义为推理任务。具体来说，在生成响应之前，模型必须首先“思考”给定的用户提示与更高优先级（系统）指令之间的关系。为了通过训练实现这种能力，我们构建了 VerIH，这是一个具有可验证答案的约束遵循任务的指令层次结构数据集。该数据集包括对齐和冲突的系统用户指令。我们表明，使用 VerIH 的轻量级强化学习有效地将模型的一般推理能力转移到指令优先级上。我们的微调模型在指令遵循和指令层次结构基准方面实现了持续的改进。这种推理能力也推广到训练分布之外的安全关键环境。通过将安全问题视为解决对抗性用户输入和预定义的更高优先级策略之间的冲突，我们经过训练的模型增强了针对越狱和提示注入攻击的鲁棒性。这些结果表明，对指令层次结构的推理为可靠的 LLM 提供了一条实用的途径，其中对系统提示的更新会产生模型行为的可控和稳健的变化。

NCSAC: Effective Neural Community Search via Attribute-augmented Conductance

NCSAC：通过属性增强电导进行有效的神经社区搜索

Authors: Longlong Lin, Quanao Li, Miao Qiao, Zeli Wang, Jin Zhao, Rong-Hua Li, Xin Luo, Tao Jia
Subjects: Subjects: Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2511.04712
Pdf link: https://arxiv.org/pdf/2511.04712
Abstract Identifying locally dense communities closely connected to the user-initiated query node is crucial for a wide range of applications. Existing approaches either solely depend on rule-based constraints or exclusively utilize deep learning technologies to identify target communities. Therefore, an important question is proposed: can deep learning be integrated with rule-based constraints to elevate the quality of community search? In this paper, we affirmatively address this question by introducing a novel approach called Neural Community Search via Attribute-augmented Conductance, abbreviated as NCSAC. Specifically, NCSAC first proposes a novel concept of attribute-augmented conductance, which harmoniously blends the (internal and external) structural proximity and the attribute similarity. Then, NCSAC extracts a coarse candidate community of satisfactory quality using the proposed attribute-augmented conductance. Subsequently, NCSAC frames the community search as a graph optimization task, refining the candidate community through sophisticated reinforcement learning techniques, thereby producing high-quality results. Extensive experiments on six real-world graphs and ten competitors demonstrate the superiority of our solutions in terms of accuracy, efficiency, and scalability. Notably, the proposed solution outperforms state-of-the-art methods, achieving an impressive F1-score improvement ranging from 5.3\% to 42.4\%. For reproducibility purposes, the source code is available at this https URL.
中文摘要 识别与用户启动的查询节点紧密相连的本地密集社区对于广泛的应用程序至关重要。现有方法要么完全依赖于基于规则的约束，要么专门利用深度学习技术来识别目标社区。因此，提出了一个重要的问题：深度学习能否与基于规则的约束相结合，以提高社区搜索的质量？在本文中，我们通过引入一种称为通过属性增强电导进行神经社区搜索（缩写为 NCSAC）的新方法来肯定地解决了这个问题。具体而言，NCSAC首先提出了一种新的属性增强电导概念，该概念将（内部和外部）结构邻近性和属性相似性和谐地融合在一起。然后，NCSAC使用所提出的属性增强电导提取质量令人满意的粗候选群落。随后，NCSAC将社区搜索构建为图优化任务，通过复杂的强化学习技术细化候选社区，从而产生高质量的结果。对六个真实世界图表和十个竞争对手的广泛实验证明了我们的解决方案在准确性、效率和可扩展性方面的优势。值得注意的是，所提出的解决方案优于最先进的方法，实现了令人印象深刻的 F1 分数提高，范围从 5.3% 到 42.4%。出于可重现性，源代码可在此 https URL 中找到。

SMART-WRITE: Adaptive Learning-based Write Energy Optimization for Phase Change Memory

SMART-WRITE：相变存储器的基于自适应学习的写入能量优化

Authors: Mahek Desai, Rowena Quinn, Marjan Asadinia
Subjects: Subjects: Hardware Architecture (cs.AR); Emerging Technologies (cs.ET)
Arxiv link: https://arxiv.org/abs/2511.04713
Pdf link: https://arxiv.org/pdf/2511.04713
Abstract As dynamic random access memory (DRAM) and other current transistor-based memories approach their scalability limits, the search for alternative storage methods becomes increasingly urgent. Phase-change memory (PCM) emerges as a promising candidate due to its scalability, fast access time, and zero leakage power compared to many existing memory technologies. However, PCM has significant drawbacks that currently hinder its viability as a replacement. PCM cells suffer from a limited lifespan because write operations degrade the physical material, and these operations consume a considerable amount of energy. For PCM to be a practical option for data storage-which involves frequent write operations-its cell endurance must be enhanced, and write energy must be reduced. In this paper, we propose SMART-WRITE, a method that integrates neural networks (NN) and reinforcement learning (RL) to dynamically optimize write energy and improve performance. The NN model monitors real-time operating conditions and device characteristics to determine optimal write parameters, while the RL model dynamically adjusts these parameters to further optimize PCM's energy consumption. By continuously adjusting PCM write parameters based on real-time system conditions, SMART-WRITE reduces write energy consumption by up to 63% and improves performance by up to 51% compared to the baseline and previous models.
中文摘要 随着动态随机存取存储器（DRAM）和其他当前基于晶体管的存储器接近其可扩展性极限，寻找替代存储方法变得越来越紧迫。相变存储器（PCM）因其可扩展性、快速访问时间和与许多现有存储器技术相比的零泄漏功率而成为一个有前途的候选者。然而，PCM 有明显的缺点，目前阻碍了其作为替代品的可行性。PCM 单元的使用寿命有限，因为写入作会降低物理材料的性能，并且这些作会消耗大量能量。为了使 PCM 成为数据存储的实用选择（涉及频繁的写入作），必须增强其单元耐久性，并且必须降低写入能量。在本文中，我们提出了SMART-WRITE，这是一种集成神经网络（NN）和强化学习（RL）的方法，以动态优化写入能量并提高性能。NN 模型监控实时运行条件和设备特性以确定最佳写入参数，而 RL 模型则动态调整这些参数以进一步优化 PCM 的能耗。通过根据实时系统情况不断调整PCM写入参数，SMART-WRITE与基准和之前的型号相比，将写入能耗降低了63%，性能提高了51%。

Explore Data Left Behind in Reinforcement Learning for Reasoning Language Models

探索推理语言模型的强化学习中留下的数据

Authors: Chenxi Liu, Junjie Liang, Yuqi Jia, Bochuan Cao, Yang Bai, Heng Huang, Xun Chen
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.04800
Pdf link: https://arxiv.org/pdf/2511.04800
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach for improving the reasoning abilities of large language models (LLMs). The Group Relative Policy Optimization (GRPO) family has demonstrated strong performance in training LLMs with RLVR. However, as models train longer and scale larger, more training prompts become residual prompts, those with zero variance rewards that provide no training signal. Consequently, fewer prompts contribute to training, reducing diversity and hindering effectiveness. To fully exploit these residual prompts, we propose the Explore Residual Prompts in Policy Optimization (ERPO) framework, which encourages exploration on residual prompts and reactivates their training signals. ERPO maintains a history tracker for each prompt and adaptively increases the sampling temperature for residual prompts that previously produced all correct responses. This encourages the model to generate more diverse reasoning traces, introducing incorrect responses that revive training signals. Empirical results on the Qwen2.5 series demonstrate that ERPO consistently surpasses strong baselines across multiple mathematical reasoning benchmarks.
中文摘要 具有可验证奖励的强化学习（RLVR）已成为提高大型语言模型（LLM）推理能力的有效方法。组相对策略优化（GRPO）系列在使用 RLVR 训练法学硕士方面表现出了强大的性能。然而，随着模型训练时间更长、规模更大，更多的训练提示成为残差提示，即那些不提供训练信号的零方差奖励。因此，有助于培训的提示减少，减少多样性并阻碍有效性。为了充分利用这些残差提示，我们提出了在策略优化中探索残差提示（ERPO）框架，该框架鼓励对残差提示进行探索并重新激活其训练信号。ERPO 为每个提示维护一个历史跟踪器，并自适应地提高先前产生所有正确响应的残余提示的采样温度。这鼓励模型生成更多样化的推理轨迹，引入不正确的响应来恢复训练信号。Qwen2.5 系列的实证结果表明，ERPO 在多个数学推理基准中始终超过强基线。

Quantum Boltzmann Machines for Sample-Efficient Reinforcement Learning

用于样本高效强化学习的量子玻尔兹曼机

Authors: Thore Gerlach, Michael Schenk, Verena Kain
Subjects: Subjects: Machine Learning (cs.LG); Quantum Physics (quant-ph)
Arxiv link: https://arxiv.org/abs/2511.04856
Pdf link: https://arxiv.org/pdf/2511.04856
Abstract We introduce theoretically grounded Continuous Semi-Quantum Boltzmann Machines (CSQBMs) that supports continuous-action reinforcement learning. By combining exponential-family priors over visible units with quantum Boltzmann distributions over hidden units, CSQBMs yield a hybrid quantum-classical model that reduces qubit requirements while retaining strong expressiveness. Crucially, gradients with respect to continuous variables can be computed analytically, enabling direct integration into Actor-Critic algorithms. Building on this, we propose a continuous Q-learning framework that replaces global maximization by efficient sampling from the CSQBM distribution, thereby overcoming instability issues in continuous control.
中文摘要 我们介绍了理论上基于连续半量子玻尔兹曼机（CSQBM），它支持连续动作强化学习。通过将可见单元的指数族先验与隐藏单元的量子玻尔兹曼分布相结合，CSQBM 产生了混合量子经典模型，该模型减少了量子比特需求，同时保持了很强的表现力。至关重要的是，可以分析计算相对于连续变量的梯度，从而能够直接集成到 Actor-Critic 算法中。在此基础上，我们提出了一个连续的 Q 学习框架，通过从 CSQBM 分布中进行有效采样来取代全局最大化，从而克服连续控制中的不稳定性问题。

FoodRL: A Reinforcement Learning Ensembling Framework For In-Kind Food Donation Forecasting

FoodRL：实物食品捐赠预测的强化学习集成框架

Authors: Esha Sharma, Lauren Davis, Julie Ivy, Min Chi
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.04865
Pdf link: https://arxiv.org/pdf/2511.04865
Abstract Food banks are crucial for alleviating food insecurity, but their effectiveness hinges on accurately forecasting highly volatile in-kind donations to ensure equitable and efficient resource distribution. Traditional forecasting models often fail to maintain consistent accuracy due to unpredictable fluctuations and concept drift driven by seasonal variations and natural disasters such as hurricanes in the Southeastern U.S. and wildfires in the West Coast. To address these challenges, we propose FoodRL, a novel reinforcement learning (RL) based metalearning framework that clusters and dynamically weights diverse forecasting models based on recent performance and contextual information. Evaluated on multi-year data from two structurally distinct U.S. food banks-one large regional West Coast food bank affected by wildfires and another state-level East Coast food bank consistently impacted by hurricanes, FoodRL consistently outperforms baseline methods, particularly during periods of disruption or decline. By delivering more reliable and adaptive forecasts, FoodRL can facilitate the redistribution of food equivalent to 1.7 million additional meals annually, demonstrating its significant potential for social impact as well as adaptive ensemble learning for humanitarian supply chains.
中文摘要 粮食银行对于缓解粮食不安全至关重要，但其有效性取决于准确预测高度波动的实物捐赠，以确保公平和高效的资源分配。由于季节变化和自然灾害（例如美国东南部的飓风和西海岸的野火）导致的不可预测的波动和概念漂移，传统的预报模型往往无法保持一致的准确性。为了应对这些挑战，我们提出了 FoodRL，这是一种基于强化学习（RL）的新型元学习框架，它根据最近的性能和上下文信息对不同的预测模型进行聚类和动态加权。根据来自两个结构不同的美国食品银行的多年数据进行评估——一个是受野火影响的大型区域西海岸食品银行，另一个是持续受到飓风影响的州级东海岸食品银行，FoodRL 的表现始终优于基线方法，特别是在中断或衰退期间。通过提供更可靠和适应性更强的预测，FoodRL 可以促进相当于每年额外 170 万顿饭的食品重新分配，展示了其对社会影响的巨大潜力以及人道主义供应链的适应性集成学习。

Self-Interest and Systemic Benefits: Emergence of Collective Rationality in Mixed Autonomy Traffic Through Deep Reinforcement Learning

自身利益与系统效益：通过深度强化学习在混合自主交通中出现集体理性的出现

Authors: Di Chen, Jia Li, Michael Zhang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.04883
Pdf link: https://arxiv.org/pdf/2511.04883
Abstract Autonomous vehicles (AVs) are expected to be commercially available in the near future, leading to mixed autonomy traffic consisting of both AVs and human-driven vehicles (HVs). Although numerous studies have shown that AVs can be deployed to benefit the overall traffic system performance by incorporating system-level goals into their decision making, it is not clear whether the benefits still exist when agents act out of self-interest -- a trait common to all driving agents, both human and autonomous. This study aims to understand whether self-interested AVs can bring benefits to all driving agents in mixed autonomy traffic systems. The research is centered on the concept of collective rationality (CR). This concept, originating from game theory and behavioral economics, means that driving agents may cooperate collectively even when pursuing individual interests. Our recent research has proven the existence of CR in an analytical game-theoretical model and empirically in mixed human-driven traffic. In this paper, we demonstrate that CR can be attained among driving agents trained using deep reinforcement learning (DRL) with a simple reward design. We examine the extent to which self-interested traffic agents can achieve CR without directly incorporating system-level objectives. Results show that CR consistently emerges in various scenarios, which indicates the robustness of this property. We also postulate a mechanism to explain the emergence of CR in the microscopic and dynamic environment and verify it based on simulation evidence. This research suggests the possibility of leveraging advanced learning methods (such as federated learning) to achieve collective cooperation among self-interested driving agents in mixed-autonomy systems.
中文摘要 自动驾驶汽车（AV）预计将在不久的将来投入商业化，从而导致由自动驾驶汽车和人类驾驶车辆（HV）组成的混合自动驾驶交通。尽管大量研究表明，可以通过将系统级目标纳入决策来部署自动驾驶汽车，以使整体交通系统性能受益，但尚不清楚当智能体出于自身利益而行动时，这种好处是否仍然存在——这是所有驾驶智能体（包括人类和自动驾驶）的共同特征。本研究旨在了解自利自动驾驶汽车是否能为混合自动驾驶交通系统中的所有驾驶主体带来好处。该研究以集体理性（CR）的概念为中心。这一概念起源于博弈论和行为经济学，意味着驱动主体即使在追求个人利益时也可以集体合作。我们最近的研究已经证明了 CR 在分析博弈论模型中的存在，并在混合人类驱动的交通中实证了 CR 的存在。在本文中，我们证明了使用深度强化学习（DRL）训练的驾驶代理可以通过简单的奖励设计来实现CR。我们研究了自利的交通代理在多大程度上可以在不直接纳入系统级目标的情况下实现 CR。结果表明，CR在各种场景中始终如一地出现，这表明了该特性的鲁棒性。我们还假设了一种机制来解释CR在微观和动态环境中的出现，并基于模拟证据进行验证。这项研究提出了利用先进的学习方法（如联邦学习）在混合自动驾驶系统中实现自利驾驶主体之间集体合作的可能性。

You Need Reasoning to Learn Reasoning: The Limitations of Label-Free RL in Weak Base Models

你需要推理来学习推理：弱基模型中无标签 RL 的局限性

Authors: Shuvendu Roy, Hossein Hajimirsadeghi, Mengyao Zhai, Golnoosh Samei
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.04902
Pdf link: https://arxiv.org/pdf/2511.04902
Abstract Recent advances in large language models have demonstrated the promise of unsupervised reinforcement learning (RL) methods for enhancing reasoning capabilities without external supervision. However, the generalizability of these label-free RL approaches to smaller base models with limited reasoning capabilities remains unexplored. In this work, we systematically investigate the performance of label-free RL methods across different model sizes and reasoning strengths, from 0.5B to 7B parameters. Our empirical analysis reveals critical limitations: label-free RL is highly dependent on the base model's pre-existing reasoning capability, with performance often degrading below baseline levels for weaker models. We find that smaller models fail to generate sufficiently long or diverse chain-of-thought reasoning to enable effective self-reflection, and that training data difficulty plays a crucial role in determining success. To address these challenges, we propose a simple yet effective method for label-free RL that utilizes curriculum learning to progressively introduce harder problems during training and mask no-majority rollouts during training. Additionally, we introduce a data curation pipeline to generate samples with predefined difficulty. Our approach demonstrates consistent improvements across all model sizes and reasoning capabilities, providing a path toward more robust unsupervised RL that can bootstrap reasoning abilities in resource-constrained models. We make our code available at this https URL
中文摘要 大型语言模型的最新进展证明了无监督强化学习（RL）方法在没有外部监督的情况下增强推理能力的前景。然而，这些无标签 RL 方法对推理能力有限的较小基础模型的普遍性仍未得到探索。在这项工作中，我们系统地研究了无标记RL方法在不同模型大小和推理强度（从0.5B到7B参数）下的性能。我们的实证分析揭示了关键的局限性：无标记RL高度依赖于基础模型预先存在的推理能力，对于较弱的模型，性能通常会下降到基线水平以下。我们发现，较小的模型无法产生足够长或多样化的思维链推理来实现有效的自我反思，而训练数据难度在决定成功方面起着至关重要的作用。为了应对这些挑战，我们提出了一种简单而有效的无标签 RL 方法，该方法利用课程学习在训练期间逐步引入更难的问题，并在训练期间掩盖非多数推出。此外，我们还引入了数据管理管道来生成具有预定义难度的样本。我们的方法展示了所有模型大小和推理能力的持续改进，为更强大的无监督 RL 提供了一条途径，可以在资源受限的模型中引导推理能力。我们在此 https URL 上提供代码

Multi-Agent Craftax: Benchmarking Open-Ended Multi-Agent Reinforcement Learning at the Hyperscale

多智能体 Craftax：超大规模开放式多智能体强化学习基准测试

Authors: Bassel Al Omari, Michael Matthews, Alexander Rutherford, Jakob Nicolaus Foerster
Subjects: Subjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2511.04904
Pdf link: https://arxiv.org/pdf/2511.04904
Abstract Progress in multi-agent reinforcement learning (MARL) requires challenging benchmarks that assess the limits of current methods. However, existing benchmarks often target narrow short-horizon challenges that do not adequately stress the long-term dependencies and generalization capabilities inherent in many multi-agent systems. To address this, we first present \textit{Craftax-MA}: an extension of the popular open-ended RL environment, Craftax, that supports multiple agents and evaluates a wide range of general abilities within a single environment. Written in JAX, \textit{Craftax-MA} is exceptionally fast with a training run using 250 million environment interactions completing in under an hour. To provide a more compelling challenge for MARL, we also present \textit{Craftax-Coop}, an extension introducing heterogeneous agents, trading and more mechanics that require complex cooperation among agents for success. We provide analysis demonstrating that existing algorithms struggle with key challenges in this benchmark, including long-horizon credit assignment, exploration and cooperation, and argue for its potential to drive long-term research in MARL.
中文摘要 多智能体强化学习（MARL）的进展需要具有挑战性的基准来评估当前方法的局限性。然而，现有的基准测试通常针对狭窄的短期挑战，这些挑战没有充分强调许多多智能体系统固有的长期依赖性和泛化能力。为了解决这个问题，我们首先提出了 \textit{Craftax-MA}：流行的开放式 RL 环境 Craftax 的扩展，它支持多个代理并在单个环境中评估广泛的通用能力。\textit{Craftax-MA} 用 JAX 编写，速度非常快，使用 2.5 亿次环境交互的训练运行在不到一小时的时间内完成。为了给 MARL 提供更具吸引力的挑战，我们还提出了 \textit{Craftax-Coop}，这是一个引入异构代理、交易和更多机制的扩展，这些机制需要代理之间的复杂合作才能成功。我们提供的分析表明，现有算法在该基准中遇到了关键挑战，包括长期信用分配、探索和合作，并论证了其推动 MARL 长期研究的潜力。

DeepForgeSeal: Latent Space-Driven Semi-Fragile Watermarking for Deepfake Detection Using Multi-Agent Adversarial Reinforcement Learning

DeepForgeSeal：使用多智能体对抗强化学习进行深度伪造检测的潜在空间驱动的半脆弱水印

Authors: Tharindu Fernando, Clinton Fookes, Sridha Sridharan
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.04949
Pdf link: https://arxiv.org/pdf/2511.04949
Abstract Rapid advances in generative AI have led to increasingly realistic deepfakes, posing growing challenges for law enforcement and public trust. Existing passive deepfake detectors struggle to keep pace, largely due to their dependence on specific forgery artifacts, which limits their ability to generalize to new deepfake types. Proactive deepfake detection using watermarks has emerged to address the challenge of identifying high-quality synthetic media. However, these methods often struggle to balance robustness against benign distortions with sensitivity to malicious tampering. This paper introduces a novel deep learning framework that harnesses high-dimensional latent space representations and the Multi-Agent Adversarial Reinforcement Learning (MAARL) paradigm to develop a robust and adaptive watermarking approach. Specifically, we develop a learnable watermark embedder that operates in the latent space, capturing high-level image semantics, while offering precise control over message encoding and extraction. The MAARL paradigm empowers the learnable watermarking agent to pursue an optimal balance between robustness and fragility by interacting with a dynamic curriculum of benign and malicious image manipulations simulated by an adversarial attacker agent. Comprehensive evaluations on the CelebA and CelebA-HQ benchmarks reveal that our method consistently outperforms state-of-the-art approaches, achieving improvements of over 4.5% on CelebA and more than 5.3% on CelebA-HQ under challenging manipulation scenarios.
中文摘要 生成式人工智能的快速发展导致深度伪造越来越逼真，给执法和公众信任带来了越来越大的挑战。现有的被动深度伪造检测器难以跟上步伐，这主要是因为它们依赖特定的伪造伪件，这限制了它们推广到新的深度伪造类型的能力。使用水印的主动深度伪造检测已经出现，以应对识别高质量合成媒体的挑战。然而，这些方法通常难以平衡鲁棒性与良性失真以及对恶意篡改的敏感性。本文介绍了一种新型深度学习框架，该框架利用高维潜在空间表示和多智能体对抗强化学习（MAARL）范式来开发一种稳健且自适应的水印方法。具体来说，我们开发了一种可学习的水印嵌入器，该嵌入器在潜在空间中运行，捕获高级图像语义，同时提供对消息编码和提取的精确控制。MAARL 范式使可学习水印代理能够通过与对抗性攻击代理模拟的良性和恶意图像作的动态课程进行交互，在鲁棒性和脆弱性之间追求最佳平衡。对 CelebA 和 CelebA-HQ 基准的全面评估表明，我们的方法始终优于最先进的方法，在具有挑战性的纵场景下，CelebA 的改进率超过 4.5%，CelebA-HQ 的改进率超过 5.3%。

Multi-agent Coordination via Flow Matching

通过流程匹配实现多代理协调

Authors: Dongsu Lee, Daehee Lee, Amy Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.05005
Pdf link: https://arxiv.org/pdf/2511.05005
Abstract This work presents MAC-Flow, a simple yet expressive framework for multi-agent coordination. We argue that requirements of effective coordination are twofold: (i) a rich representation of the diverse joint behaviors present in offline data and (ii) the ability to act efficiently in real time. However, prior approaches often sacrifice one for the other, i.e., denoising diffusion-based solutions capture complex coordination but are computationally slow, while Gaussian policy-based solutions are fast but brittle in handling multi-agent interaction. MAC-Flow addresses this trade-off by first learning a flow-based representation of joint behaviors, and then distilling it into decentralized one-step policies that preserve coordination while enabling fast execution. Across four different benchmarks, including $12$ environments and $34$ datasets, MAC-Flow alleviates the trade-off between performance and computational cost, specifically achieving about $\boldsymbol{\times14.5}$ faster inference compared to diffusion-based MARL methods, while maintaining good performance. At the same time, its inference speed is similar to that of prior Gaussian policy-based offline multi-agent reinforcement learning (MARL) methods.
中文摘要 这项工作提出了 MAC-Flow，这是一个简单但富有表现力的多智能体协调框架。我们认为有效协调的要求是双重的：（i）离线数据中存在的各种关节行为的丰富表示和（ii）实时有效行动的能力。然而，以前的方法经常牺牲其中一种，即基于扩散的去噪解决方案捕获复杂的协调，但计算速度很慢，而基于高斯策略的解决方案在处理多智能体交互方面速度快但在脆弱方面。MAC-Flow 通过首先学习基于流的联合行为表示，然后将其提炼成去中心化的一步策略来解决这一权衡，以保持协调，同时实现快速执行。在四个不同的基准测试中，包括 12 美元的环境和 34 美元的数据集，MAC-Flow 减轻了性能和计算成本之间的权衡，特别是与基于扩散的 MARL 方法相比，实现了大约 $\boldsymbol{\times14.5}$ 更快的推理，同时保持了良好的性能。同时，其推理速度与之前基于高斯策略的离线多智能体强化学习（MARL）方法相似。

FM4Com: Foundation Model for Scene-Adaptive Communication Strategy Optimization

FM4Com：场景自适应通信策略优化的基础模型

Authors: Zhaoyang Li, Shangzhuo Xie, Qianqian Yang
Subjects: Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2511.05094
Pdf link: https://arxiv.org/pdf/2511.05094
Abstract The emergence of sixth-generation (6G) networks heralds an intelligent communication ecosystem driven by AI-native air interfaces. However, current physical-layer designs-typically following modular and isolated optimization paradigms-fail to achieve global end-to-end optimality due to neglected inter-module dependencies. Although large language models (LLMs) have recently been applied to communication tasks such as beam prediction and resource allocation, existing studies remain limited to single-task or single-modality scenarios and lack the ability to jointly reason over communication states and user intents for personalized strategy adaptation. To address these limitations, this paper proposes a novel multimodal communication decision-making model based on reinforcement learning. The proposed model semantically aligns channel state information (CSI) and textual user instructions, enabling comprehensive understanding of both physical-layer conditions and communication intents. It then generates physically realizable, user-customized link construction strategies that dynamically adapt to changing environments and preference tendencies. A two-stage reinforcement learning framework is employed: the first stage expands the experience pool via heuristic exploration and behavior cloning to obtain a near-optimal initialization, while the second stage fine-tunes the model through multi-objective reinforcement learning considering bit error rate, throughput, and complexity. Experimental results demonstrate that the proposed model significantly outperforms conventional planning-based algorithms under challenging channel conditions, achieving robust, efficient, and personalized 6G link construction.
中文摘要 第六代（6G）网络的出现预示着由人工智能原生空中接口驱动的智能通信生态系统的到来。然而，当前的物理层设计（通常遵循模块化和隔离优化范式）由于忽略了模块间依赖关系，无法实现全局端到端最优性。尽管大型语言模型（LLM）最近已应用于波束预测和资源分配等通信任务，但现有研究仍然局限于单任务或单模态场景，缺乏对通信状态和用户意图进行个性化策略调整的联合推理能力。针对这些局限性，提出了一种基于强化学习的新型多模态通信决策模型。所提出的模型在语义上对齐了信道状态信息（CSI）和文本用户指令，从而能够全面理解物理层条件和通信意图。然后，它生成物理上可实现的、用户定制的链路构建策略，动态适应不断变化的环境和偏好倾向。采用两阶段的强化学习框架：第一阶段通过启发式探索和行为克隆来扩展经验池，以获得近乎最优的初始化，而第二阶段则通过考虑误码率、吞吐量和复杂性的多目标强化学习对模型进行微调。实验结果表明，所提模型在具有挑战性的信道条件下明显优于传统的基于规划的算法，实现了鲁棒、高效和个性化的6G链路建设。

Real-World Adverse Weather Image Restoration via Dual-Level Reinforcement Learning with High-Quality Cold Start

通过具有高质量冷启动的双级强化学习恢复真实世界的恶劣天气图像

Authors: Fuyang Liu, Jiaqi Xu, Xiaowei Hu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.05095
Pdf link: https://arxiv.org/pdf/2511.05095
Abstract Adverse weather severely impairs real-world visual perception, while existing vision models trained on synthetic data with fixed parameters struggle to generalize to complex degradations. To address this, we first construct HFLS-Weather, a physics-driven, high-fidelity dataset that simulates diverse weather phenomena, and then design a dual-level reinforcement learning framework initialized with HFLS-Weather for cold-start training. Within this framework, at the local level, weather-specific restoration models are refined through perturbation-driven image quality optimization, enabling reward-based learning without paired supervision; at the global level, a meta-controller dynamically orchestrates model selection and execution order according to scene degradation. This framework enables continuous adaptation to real-world conditions and achieves state-of-the-art performance across a wide range of adverse weather scenarios. Code is available at this https URL
中文摘要 恶劣天气严重损害了现实世界的视觉感知，而使用具有固定参数的合成数据训练的现有视觉模型很难推广到复杂的退化。为了解决这个问题，我们首先构建了HFLS-Weather，这是一个物理驱动的高保真数据集，可以模拟各种天气现象，然后设计一个用HFLS-Weather初始化的双级强化学习框架，用于冷启动训练。在此框架内，在局部层面，通过扰动驱动的图像质量优化来完善特定天气的恢复模型，从而实现无需配对监督的基于奖励的学习;在全局层面，元控制器根据场景降级动态编排模型选择和执行顺序。该框架能够持续适应现实条件，并在各种恶劣天气场景中实现最先进的性能。代码可在此 https URL 中找到

Emergence from Emergence: Financial Market Simulation via Learning with Heterogeneous Preferences

从涌现中涌现：通过异质偏好学习进行金融市场模拟

Authors: Ryuko Hashimoto, Ryosuke Takata, Masahiro Suzuki, Yuki Tanaka, Kiyoshi Izumi
Subjects: Subjects: Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2511.05207
Pdf link: https://arxiv.org/pdf/2511.05207
Abstract Agent-based models help explain stock price dynamics as emergent phenomena driven by interacting investors. In this modeling tradition, investor behavior has typically been captured by two distinct mechanisms -- learning and heterogeneous preferences -- which have been explored as separate paradigms in prior studies. However, the impact of their joint modeling on the resulting collective dynamics remains largely unexplored. We develop a multi-agent reinforcement learning framework in which agents endowed with heterogeneous risk aversion, time discounting, and information access collectively learn trading strategies within a unified shared-policy framework. The experiment reveals that (i) learning with heterogeneous preferences drives agents to develop strategies aligned with their individual traits, fostering behavioral differentiation and niche specialization within the market, and (ii) the interactions by the differentiated agents are essential for the emergence of realistic market dynamics such as fat-tailed price fluctuations and volatility clustering. This study presents a constructive paradigm for financial market modeling in which the joint design of heterogeneous preferences and learning mechanisms enables two-stage emergence: individual behavior and the collective market dynamics.
中文摘要 基于代理的模型有助于将股价动态解释为由互动投资者驱动的新兴现象。在这种建模传统中，投资者行为通常被两种不同的机制所捕捉——学习和异质偏好——在之前的研究中已经作为单独的范式进行了探索。然而，他们的联合建模对由此产生的集体动力的影响在很大程度上仍未被探索。我们开发了一个多智能体强化学习框架，在这个框架中，具有异构风险规避、时间折扣和信息访问的智能体在统一的共享策略框架内共同学习交易策略。实验表明，（i）异质偏好的学习促使智能体制定符合其个人特征的策略，促进市场内的行为差异化和利基专业化，以及（ii）差异化智能体的互动对于现实市场动态的出现至关重要，例如肥尾价格波动和波动性聚类。本研究提出了一种金融市场建模的建设性范式，其中异质偏好和学习机制的联合设计实现了两个阶段的涌现：个体行为和集体市场动态。

An End-to-End Deep Reinforcement Learning Approach for Solving the Traveling Salesman Problem with Drones

一种端到端的深度强化学习方法，用于解决无人机旅行推销员问题

Authors: Taihelong Zeng, Yun Lin, Yuhe Shi, Yan Li, Zhiqing Wei, Xuanru Ji
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.05265
Pdf link: https://arxiv.org/pdf/2511.05265
Abstract The emergence of truck-drone collaborative systems in last-mile logistics has positioned the Traveling Salesman Problem with Drones (TSP-D) as a pivotal extension of classical routing optimization, where synchronized vehicle coordination promises substantial operational efficiency and reduced environmental impact, yet introduces NP-hard combinatorial complexity beyond the reach of conventional optimization paradigms. Deep reinforcement learning offers a theoretically grounded framework to address TSP-D's inherent challenges through self-supervised policy learning and adaptive decision-making. This study proposes a hierarchical Actor-Critic deep reinforcement learning framework for solving the TSP-D problem. The architecture consists of two primary components: a Transformer-inspired encoder and an efficient Minimal Gated Unit decoder. The encoder incorporates a novel, optimized k-nearest neighbors sparse attention mechanism specifically for focusing on relevant spatial relationships, further enhanced by the integration of global node features. The Minimal Gated Unit decoder processes these encoded representations to efficiently generate solution sequences. The entire framework operates within an asynchronous advantage actor-critic paradigm. Experimental results show that, on benchmark TSP-D instances of various scales (N=10 to 100), the proposed model can obtain competitive or even superior solutions in shorter average computation times compared to high-performance heuristic algorithms and existing reinforcement learning methods. Moreover, compared to advanced reinforcement learning algorithm benchmarks, the proposed framework significantly reduces the total training time required while achieving superior final performance, highlighting its notable advantage in training efficiency.
中文摘要 卡车-无人机协作系统在最后一英里物流中的出现，将无人机旅行推销员问题（TSP-D）定位为经典路线优化的关键延伸，其中同步车辆协调有望显着提高运营效率并减少对环境的影响，但引入了超出传统优化范式范围的 NP 硬组合复杂性。深度强化学习提供了一个理论基础的框架，通过自监督政策学习和适应性决策来应对 TSP-D 的固有挑战。本研究提出了一种分层的Actor-Critic深度强化学习框架来解决TSP-D问题。该架构由两个主要组件组成：受 Transformer 启发的编码器和高效的最小门控单元解码器。该编码器采用了一种新颖的、优化的 k 最近邻稀疏注意力机制，专门用于关注相关空间关系，并通过集成全局节点特征进一步增强。最小门控单元解码器处理这些编码表示以有效地生成解序列。整个框架在异步优势行为者-批评者范式中运行。实验结果表明，在各种尺度（N=10至100）的基准TSP-D实例上，与高性能启发式算法和现有强化学习方法相比，所提模型能够以更短的平均计算时间获得具有竞争力甚至优越的解。此外，与先进的强化学习算法基准相比，所提出的框架在实现卓越最终性能的同时，显著减少了所需的总训练时间，凸显了其在训练效率方面的显著优势。

DeepEyesV2: Toward Agentic Multimodal Model

DeepEyesV2：走向代理多模态模型

Authors: Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, Xing Yu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.05271
Pdf link: https://arxiv.org/pdf/2511.05271
Abstract Agentic multimodal models should not only comprehend text and images, but also actively invoke external tools, such as code execution environments and web search, and integrate these operations into reasoning. In this work, we introduce DeepEyesV2 and explore how to build an agentic multimodal model from the perspectives of data construction, training methods, and model evaluation. We observe that direct reinforcement learning alone fails to induce robust tool-use behavior. This phenomenon motivates a two-stage training pipeline: a cold-start stage to establish tool-use patterns, and reinforcement learning stage to further refine tool invocation. We curate a diverse, moderately challenging training dataset, specifically including examples where tool use is beneficial. We further introduce RealX-Bench, a comprehensive benchmark designed to evaluate real-world multimodal reasoning, which inherently requires the integration of multiple capabilities, including perception, search, and reasoning. We evaluate DeepEyesV2 on RealX-Bench and other representative benchmarks, demonstrating its effectiveness across real-world understanding, mathematical reasoning, and search-intensive tasks. Moreover, DeepEyesV2 exhibits task-adaptive tool invocation, tending to use image operations for perception tasks and numerical computations for reasoning tasks. Reinforcement learning further enables complex tool combinations and allows model to selectively invoke tools based on context. We hope our study can provide guidance for community in developing agentic multimodal models.
中文摘要 代理多模态模型不仅应该理解文本和图像，还应该主动调用外部工具，如代码执行环境和网络搜索，并将这些作整合到推理中。在这项工作中，我们介绍了 DeepEyesV2，并从数据构建、训练方法和模型评估的角度探讨了如何构建智能体多模态模型。我们观察到，仅靠直接强化学习无法诱导稳健的工具使用行为。这种现象激发了两个阶段的训练管道：建立工具使用模式的冷启动阶段和进一步完善工具调用的强化学习阶段。我们策划了一个多样化的、具有中等挑战性的训练数据集，特别是包括工具使用有益的示例。我们进一步介绍了 RealX-Bench，这是一个旨在评估现实世界多模态推理的综合基准测试，这本质上需要集成多种功能，包括感知、搜索和推理。我们在 RealX-Bench 和其他代表性基准测试上评估了 DeepEyesV2，展示了其在现实世界理解、数学推理和搜索密集型任务中的有效性。此外，DeepEyesV2 表现出任务自适应工具调用，倾向于使用图像作进行感知任务，使用数值计算进行推理任务。强化学习进一步支持复杂的工具组合，并允许模型根据上下文有选择地调用工具。希望我们的研究能够为社区开发代理多模态模型提供指导。

Reflective Personalization Optimization: A Post-hoc Rewriting Framework for Black-Box Large Language Models

反射式个性化优化：黑盒大型语言模型的事后重写框架

Authors: Teqi Hao, Xioayu Tan, Shaojie Shi, Yinghui Xu, Xihe Qiu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.05286
Pdf link: https://arxiv.org/pdf/2511.05286
Abstract The personalization of black-box large language models (LLMs) is a critical yet challenging task. Existing approaches predominantly rely on context injection, where user history is embedded into the prompt to directly guide the generation process. However, this single-step paradigm imposes a dual burden on the model: generating accurate content while simultaneously aligning with user-specific styles. This often results in a trade-off that compromises output quality and limits precise control. To address this fundamental tension, we propose Reflective Personalization Optimization (RPO), a novel framework that redefines the personalization paradigm by decoupling content generation from alignment. RPO operates in two distinct stages: first, a base model generates a high-quality, generic response; then, an external reflection module explicitly rewrites this output to align with the user's preferences. This reflection module is trained using a two-stage process. Initially, supervised fine-tuning is employed on structured rewriting trajectories to establish a core personalized reasoning policy that models the transformation from generic to user-aligned responses. Subsequently, reinforcement learning is applied to further refine and enhance the quality of the personalized outputs. Comprehensive experiments on the LaMP benchmark demonstrate that RPO, by decoupling content generation from personalization, significantly outperforms state-of-the-art baselines. These findings underscore the superiority of explicit response shaping over implicit context injection. Moreover, RPO introduces an efficient, model-agnostic personalization layer that can be seamlessly integrated with any underlying base model, paving the way for a new and effective direction in user-centric generation scenarios.
中文摘要 黑盒大型语言模型（LLM）的个性化是一项关键但又具有挑战性的任务。现有方法主要依赖于上下文注入，其中用户历史记录嵌入到提示中以直接指导生成过程。然而，这种单步范式给模型带来了双重负担：生成准确的内容，同时与用户特定的风格保持一致。这通常会导致权衡，从而影响输出质量并限制精确控制。为了解决这一基本紧张关系，我们提出了反射性个性化优化（RPO），这是一种新颖的框架，它通过将内容生成与一致性解耦来重新定义个性化范式。RPO 在两个不同的阶段运行：首先，基本模型生成高质量的通用响应;然后，外部反射模块显式重写此输出以符合用户的偏好。该反射模块使用两阶段过程进行训练。最初，在结构化重写轨迹上采用监督微调来建立核心个性化推理策略，该策略对从通用响应到用户对齐响应的转换进行建模。随后，应用强化学习来进一步完善和提高个性化输出的质量。LaMP 基准测试的综合实验表明，通过将内容生成与个性化解耦，RPO 的性能明显优于最先进的基线。这些发现强调了显式响应塑造相对于隐式上下文注入的优越性。此外，RPO 引入了一个高效、与模型无关的个性化层，可以与任何底层基础模型无缝集成，为以用户为中心的生成场景中新的有效方向铺平了道路。

QUESTER: Query Specification for Generative Retrieval

QUESTER：生成检索的查询规范

Authors: Arthur Satouf, Yuxuan Zong, Habiboulaye Amadou-Boubacar, Pablo Piantanida, Benjamin Piwowarski
Subjects: Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.05301
Pdf link: https://arxiv.org/pdf/2511.05301
Abstract Generative Retrieval (GR) differs from the traditional index-then-retrieve pipeline by storing relevance in model parameters and directly generating document identifiers. However, GR often struggles to generalize and is costly to scale. We introduce QUESTER (QUEry SpecificaTion gEnerative Retrieval), which reframes GR as query specification generation - in this work, a simple keyword query handled by BM25 - using a (small) LLM. The policy is trained using reinforcement learning techniques (GRPO). Across in- and out-of-domain evaluations, we show that our model is more effective than BM25, and competitive with neural IR models, while maintaining a good efficiency
中文摘要 生成检索（GR）与传统的索引然后检索管道不同，它存储在模型参数中并直接生成文档标识符。然而，GR 通常难以概括，并且扩展成本高昂。我们引入了 QUESTER（QUEry SpecificaTion gEnerative Retrieval），它使用（小型） LLM 将 GR 重新定义为查询规范生成 - 在这项工作中，由 BM25 处理的简单关键字查询。该策略使用强化学习技术（GRPO）进行训练。在域内和域外评估中，我们表明我们的模型比 BM25 更有效，并且与神经红外模型具有竞争力，同时保持良好的效率

TeaRAG: A Token-Efficient Agentic Retrieval-Augmented Generation Framework

TeaRAG：一种令牌高效的代理检索增强生成框架

Authors: Chao Zhang, Yuhao Wang, Derong Xu, Haoxin Zhang, Yuanjie Lyu, Yuhao Chen, Shuochen Liu, Tong Xu, Xiangyu Zhao, Yan Gao, Yao Hu, Enhong Chen
Subjects: Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.05385
Pdf link: https://arxiv.org/pdf/2511.05385
Abstract Retrieval-Augmented Generation (RAG) utilizes external knowledge to augment Large Language Models' (LLMs) reliability. For flexibility, agentic RAG employs autonomous, multi-round retrieval and reasoning to resolve queries. Although recent agentic RAG has improved via reinforcement learning, they often incur substantial token overhead from search and reasoning processes. This trade-off prioritizes accuracy over efficiency. To address this issue, this work proposes TeaRAG, a token-efficient agentic RAG framework capable of compressing both retrieval content and reasoning steps. 1) First, the retrieved content is compressed by augmenting chunk-based semantic retrieval with a graph retrieval using concise triplets. A knowledge association graph is then built from semantic similarity and co-occurrence. Finally, Personalized PageRank is leveraged to highlight key knowledge within this graph, reducing the number of tokens per retrieval. 2) Besides, to reduce reasoning steps, Iterative Process-aware Direct Preference Optimization (IP-DPO) is proposed. Specifically, our reward function evaluates the knowledge sufficiency by a knowledge matching mechanism, while penalizing excessive reasoning steps. This design can produce high-quality preference-pair datasets, supporting iterative DPO to improve reasoning conciseness. Across six datasets, TeaRAG improves the average Exact Match by 4% and 2% while reducing output tokens by 61% and 59% on Llama3-8B-Instruct and Qwen2.5-14B-Instruct, respectively. Code is available at this https URL.
中文摘要 检索增强生成（RAG）利用外部知识来增强大型语言模型（LLM）的可靠性。为了提高灵活性，代理 RAG 采用自主、多轮检索和推理来解决查询。尽管最近的代理 RAG 通过强化学习得到了改进，但它们通常会从搜索和推理过程中产生大量的标记开销。这种权衡优先考虑准确性而不是效率。为了解决这个问题，这项工作提出了 TeaRAG，这是一个令牌高效的代理 RAG 框架，能够压缩检索内容和推理步骤。1）首先，通过使用简洁三元组的图检索来增强基于块的语义检索，从而压缩检索到的内容。然后根据语义相似性和共现性构建知识关联图。最后，利用个性化 PageRank 来突出显示该图中的关键知识，从而减少每次检索的令牌数量。2）此外，为了减少推理步骤，提出了迭代过程感知直接偏好优化（IP-DPO）。具体来说，我们的奖励函数通过知识匹配机制评估知识充分性，同时惩罚过多的推理步骤。这种设计可以生成高质量的偏好对数据集，支持迭代DPO以提高推理简洁性。在六个数据集中，TeaRAG 将 Llama3-8B-Instruct 和 Qwen2.5-14B-Instruct 的平均精确匹配率分别提高了 4% 和 2%，同时将输出标记分别减少了 61% 和 59%。代码可在此 https URL 中找到。

PreResQ-R1: Towards Fine-Grained Rank-and-Score Reinforcement Learning for Visual Quality Assessment via Preference-Response Disentangled Policy Optimization

PreResQ-R1：通过偏好-响应解缠策略优化，实现细粒度的秩与分数强化学习

Authors: Zehui Feng, Tian Qiu, Tong Wu, Junxuan Li, Huayuan Xu, Ting Han
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.05393
Pdf link: https://arxiv.org/pdf/2511.05393
Abstract Visual Quality Assessment (QA) seeks to predict human perceptual judgments of visual fidelity. While recent multimodal large language models (MLLMs) show promise in reasoning about image and video quality, existing approaches mainly rely on supervised fine-tuning or rank-only objectives, resulting in shallow reasoning, poor score calibration, and limited cross-domain generalization. We propose PreResQ-R1, a Preference-Response Disentangled Reinforcement Learning framework that unifies absolute score regression and relative ranking consistency within a single reasoning-driven optimization scheme. Unlike prior QA methods, PreResQ-R1 introduces a dual-branch reward formulation that separately models intra-sample response coherence and inter-sample preference alignment, optimized via Group Relative Policy Optimization (GRPO). This design encourages fine-grained, stable, and interpretable chain-of-thought reasoning about perceptual quality. To extend beyond static imagery, we further design a global-temporal and local-spatial data flow strategy for Video Quality Assessment. Remarkably, with reinforcement fine-tuning on only 6K images and 28K videos, PreResQ-R1 achieves state-of-the-art results across 10 IQA and 5 VQA benchmarks under both SRCC and PLCC metrics, surpassing by margins of 5.30% and textbf2.15% in IQA task, respectively. Beyond quantitative gains, it produces human-aligned reasoning traces that reveal the perceptual cues underlying quality judgments. Code and model are available.
中文摘要 视觉质量评估（QA）旨在预测人类对视觉保真度的感知判断。虽然最近的多模态大型语言模型（MLLM）在图像和视频质量推理方面显示出前景，但现有方法主要依赖于监督微调或仅排名目标，导致推理浅层、分数校准差和跨域泛化有限。我们提出了PreResQ-R1，这是一个偏好-反应解缠强化学习框架，它将绝对分数回归和相对排名一致性统一在单一推理驱动的优化方案中。与以前的 QA 方法不同，PreResQ-R1 引入了一种双分支奖励公式，该公式分别对样本内响应一致性和样本间偏好对齐进行建模，并通过组相对策略优化（GRPO）进行优化。这种设计鼓励对感知质量进行细粒度、稳定和可解释的思维链推理。为了超越静态图像，我们进一步设计了用于视频质量评估的全局时间和局部空间数据流策略。值得注意的是，PreResQ-R1 仅对 6K 图像和 28K 视频进行强化微调，在 SRCC 和 PLCC 指标下，在 10 个 IQA 和 5 个 VQA 基准测试中都取得了最先进的结果，在 IQA 任务中分别超过了 5.30% 和 textbf2.15%。除了定量收益之外，它还产生与人类一致的推理痕迹，揭示质量判断背后的感知线索。代码和型号可用。

Sample Complexity of Distributionally Robust Off-Dynamics Reinforcement Learning with Online Interaction

在线交互的分布鲁棒非动力强化学习的样本复杂度

Authors: Yiting He, Zhishuai Liu, Weixin Wang, Pan Xu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2511.05396
Pdf link: https://arxiv.org/pdf/2511.05396
Abstract Off-dynamics reinforcement learning (RL), where training and deployment transition dynamics are different, can be formulated as learning in a robust Markov decision process (RMDP) where uncertainties in transition dynamics are imposed. Existing literature mostly assumes access to generative models allowing arbitrary state-action queries or pre-collected datasets with a good state coverage of the deployment environment, bypassing the challenge of exploration. In this work, we study a more realistic and challenging setting where the agent is limited to online interaction with the training environment. To capture the intrinsic difficulty of exploration in online RMDPs, we introduce the supremal visitation ratio, a novel quantity that measures the mismatch between the training dynamics and the deployment dynamics. We show that if this ratio is unbounded, online learning becomes exponentially hard. We propose the first computationally efficient algorithm that achieves sublinear regret in online RMDPs with $f$-divergence based transition uncertainties. We also establish matching regret lower bounds, demonstrating that our algorithm achieves optimal dependence on both the supremal visitation ratio and the number of interaction episodes. Finally, we validate our theoretical results through comprehensive numerical experiments.
中文摘要 非动力强化学习（RL）的训练和部署过渡动力学不同，可以表述为鲁棒马尔可夫决策过程（RMDP）中的学习，其中过渡动力学存在不确定性。现有文献大多假设可以访问生成模型，允许任意状态作查询或预先收集的数据集，这些数据集具有良好的部署环境状态覆盖率，从而绕过了探索的挑战。在这项工作中，我们研究了一个更现实和更具挑战性的环境，其中代理仅限于与训练环境的在线交互。为了捕捉在线 RMDP 中探索的内在难度，我们引入了最高访问率，这是一种衡量训练动态和部署动态之间不匹配的新数量。我们表明，如果这个比率是无限的，在线学习就会变得成倍困难。我们提出了第一个计算高效的算法，该算法在具有基于$f$散度的转换不确定性的在线RMDP中实现亚线性遗憾。我们还建立了匹配的后悔下限，证明我们的算法实现了对最高访问率和交互次数的最佳依赖性。最后，我们通过综合数值实验验证了我们的理论结果。

Minority-Aware Satisfaction Estimation in Dialogue Systems via Preference-Adaptive Reinforcement Learning

通过偏好自适应强化学习对对话系统进行少数族裔感知满意度估计

Authors: Yahui Fu, Zi Haur Pang, Tatsuya Kawahara
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.05407
Pdf link: https://arxiv.org/pdf/2511.05407
Abstract User satisfaction in dialogue systems is inherently subjective. When the same response strategy is applied across users, minority users may assign different satisfaction ratings than majority users due to variations in individual intents and preferences. However, existing alignment methods typically train one-size-fits-all models that aim for broad consensus, often overlooking minority perspectives and user-specific adaptation. We propose a unified framework that models both individual- and group-level preferences for user satisfaction estimation. First, we introduce Chain-of-Personalized-Reasoning (CoPeR) to capture individual preferences through interpretable reasoning chains. Second, we propose an expectation-maximization-based Majority-Minority Preference-Aware Clustering (M2PC) algorithm that discovers distinct user groups in an unsupervised manner to learn group-level preferences. Finally, we integrate these components into a preference-adaptive reinforcement learning framework (PAda-PPO) that jointly optimizes alignment with both individual and group preferences. Experiments on the Emotional Support Conversation dataset demonstrate consistent improvements in user satisfaction estimation, particularly for underrepresented user groups.
中文摘要 用户对对话系统的满意度本质上是主观的。当在用户之间应用相同的响应策略时，由于个人意图和偏好的差异，少数用户可能会分配与多数用户不同的满意度评级。然而，现有的对齐方法通常训练一刀切的模型，旨在达成广泛的共识，往往忽略少数人的观点和用户特定的适应。我们提出了一个统一的框架，对个人和群体层面的偏好进行建模，以进行用户满意度估计。首先，我们引入了个性化推理链（CoPeR），通过可解释的推理链来捕捉个人偏好。其次，我们提出了一种基于期望最大化的多数-少数偏好感知聚类（M2PC）算法，该算法以无监督的方式发现不同的用户群体，以学习群体级别的偏好。最后，我们将这些组件集成到偏好自适应强化学习框架（PAda-PPO）中，该框架共同优化与个人和群体偏好的一致性。情感支持对话数据集的实验表明，用户满意度估计的持续改进，特别是对于代表性不足的用户群体。

TimeSearch-R: Adaptive Temporal Search for Long-Form Video Understanding via Self-Verification Reinforcement Learning

TimeSearch-R：通过自我验证强化学习实现长视频理解的自适应时间搜索

Authors: Junwen Pan, Qizhe Zhang, Rui Zhang, Ming Lu, Xin Wan, Yuan Zhang, Chang Liu, Qi She
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.05489
Pdf link: https://arxiv.org/pdf/2511.05489
Abstract Temporal search aims to identify a minimal set of relevant frames from tens of thousands based on a given query, serving as a foundation for accurate long-form video understanding. Existing works attempt to progressively narrow the search space. However, these approaches typically rely on a hand-crafted search process, lacking end-to-end optimization for learning optimal search strategies. In this paper, we propose TimeSearch-R, which reformulates temporal search as interleaved text-video thinking, seamlessly integrating searching video clips into the reasoning process through reinforcement learning (RL). However, applying RL training methods, such as Group Relative Policy Optimization (GRPO), to video reasoning can result in unsupervised intermediate search decisions. This leads to insufficient exploration of the video content and inconsistent logical reasoning. To address these issues, we introduce GRPO with Completeness Self-Verification (GRPO-CSV), which gathers searched video frames from the interleaved reasoning process and utilizes the same policy model to verify the adequacy of searched frames, thereby improving the completeness of video reasoning. Additionally, we construct datasets specifically designed for the SFT cold-start and RL training of GRPO-CSV, filtering out samples with weak temporal dependencies to enhance task difficulty and improve temporal search capabilities. Extensive experiments demonstrate that TimeSearch-R achieves significant improvements on temporal search benchmarks such as Haystack-LVBench and Haystack-Ego4D, as well as long-form video understanding benchmarks like VideoMME and MLVU. Notably, TimeSearch-R establishes a new state-of-the-art on LongVideoBench with 4.1% improvement over the base model Qwen2.5-VL and 2.0% over the advanced video reasoning model Video-R1. Our code is available at this https URL.
中文摘要 时间搜索旨在根据给定查询从数万个相关帧中识别出最少的一组相关帧，作为准确理解长视频的基础。现有作品试图逐步缩小搜索空间。然而，这些方法通常依赖于手工制作的搜索过程，缺乏学习最佳搜索策略的端到端优化。在本文中，我们提出了TimeSearch-R，它将时间搜索重新表述为交错的文本-视频思维，通过强化学习（RL）将搜索视频片段无缝集成到推理过程中。然而，将 RL 训练方法（例如组相对策略优化（GRPO））应用于视频推理可能会导致无监督的中间搜索决策。这导致了对视频内容的探索不足和逻辑推理的不一致。为了解决这些问题，我们引入了GRPO with Completeness Self-Verification （GRPO-CSV），它从交错推理过程中收集搜索到的视频帧，并利用相同的策略模型来验证搜索到的帧是否充分，从而提高视频推理的完整性。此外，我们还构建了专门为GRPO-CSV的SFT冷启动和RL训练设计的数据集，过滤掉时间依赖性较弱的样本，以增强任务难度并提高时间搜索能力。大量实验表明，TimeSearch-R 在 Haystack-LVBench 和 Haystack-Ego4D 等时间搜索基准测试以及 VideoMME 和 MLVU 等长视频理解基准测试上取得了显着改进。值得注意的是，TimeSearch-R 在 LongVideoBench 上建立了新的先进技术，比基础模型 Qwen2.5-VL 提高了 4.1%，比高级视频推理模型 Video-R1 提高了 2.0%。我们的代码可在此 https URL 中找到。

Visual Spatial Tuning

视觉空间调整

Authors: Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, Hengshuang Zhao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.05491
Pdf link: https://arxiv.org/pdf/2511.05491
Abstract Capturing spatial relationships from visual inputs is a cornerstone of human-like general intelligence. Several previous studies have tried to enhance the spatial awareness of Vision-Language Models (VLMs) by adding extra expert encoders, which brings extra overhead and usually harms general capabilities. To enhance the spatial ability in general architectures, we introduce Visual Spatial Tuning (VST), a comprehensive framework to cultivate VLMs with human-like visuospatial abilities, from spatial perception to reasoning. We first attempt to enhance spatial perception in VLMs by constructing a large-scale dataset termed VST-P, which comprises 4.1 million samples spanning 19 skills across single views, multiple images, and videos. Then, we present VST-R, a curated dataset with 135K samples that instruct models to reason in space. In particular, we adopt a progressive training pipeline: supervised fine-tuning to build foundational spatial knowledge, followed by reinforcement learning to further improve spatial reasoning abilities. Without the side-effect to general capabilities, the proposed VST consistently achieves state-of-the-art results on several spatial benchmarks, including $34.8\%$ on MMSI-Bench and $61.2\%$ on VSIBench. It turns out that the Vision-Language-Action models can be significantly enhanced with the proposed spatial tuning paradigm, paving the way for more physically grounded AI.
中文摘要 从视觉输入中捕获空间关系是类人通用智能的基石。之前的几项研究试图通过添加额外的专家编码器来增强视觉语言模型（VLM）的空间感知，这会带来额外的开销，并且通常会损害一般功能。为了增强通用架构中的空间能力，我们引入了视觉空间调优（VST），这是一个全面的框架，用于培养具有类人视觉空间能力的VLM，从空间感知到推理。我们首先尝试通过构建一个名为 VST-P 的大规模数据集来增强 VLM 中的空间感知，该数据集包含 410 万个样本，涵盖单视图、多图像和视频的 19 种技能。然后，我们展示了 VST-R，这是一个包含 135K 样本的精选数据集，可指导模型在空间中进行推理。特别是，我们采用了渐进式训练管道：监督微调以建立基础空间知识，然后进行强化学习以进一步提高空间推理能力。在没有对一般能力产生副作用的情况下，所提出的 VST 在多个空间基准测试上始终取得了最先进的结果，包括 MMSI-Bench 的 34.8 美元和 VSIBench 的 61.2 美元。事实证明，视觉-语言-行动模型可以通过提出的空间调整范式得到显着增强，为更物理接地的人工智能铺平道路。

Keyword: diffusion policy

MoE-DP: An MoE-Enhanced Diffusion Policy for Robust Long-Horizon Robotic Manipulation with Skill Decomposition and Failure Recovery

MoE-DP：一种用于具有技能分解和故障恢复的鲁棒长视野机器人作的 MoE 增强扩散策略

Authors: Baiye Cheng, Tianhai Liang, Suning Huang, Maanping Shao, Feihong Zhang, Botian Xu, Zhengrong Xue, Huazhe Xu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.05007
Pdf link: https://arxiv.org/pdf/2511.05007
Abstract Diffusion policies have emerged as a powerful framework for robotic visuomotor control, yet they often lack the robustness to recover from subtask failures in long-horizon, multi-stage tasks and their learned representations of observations are often difficult to interpret. In this work, we propose the Mixture of Experts-Enhanced Diffusion Policy (MoE-DP), where the core idea is to insert a Mixture of Experts (MoE) layer between the visual encoder and the diffusion model. This layer decomposes the policy's knowledge into a set of specialized experts, which are dynamically activated to handle different phases of a task. We demonstrate through extensive experiments that MoE-DP exhibits a strong capability to recover from disturbances, significantly outperforming standard baselines in robustness. On a suite of 6 long-horizon simulation tasks, this leads to a 36% average relative improvement in success rate under disturbed conditions. This enhanced robustness is further validated in the real world, where MoE-DP also shows significant performance gains. We further show that MoE-DP learns an interpretable skill decomposition, where distinct experts correspond to semantic task primitives (e.g., approaching, grasping). This learned structure can be leveraged for inference-time control, allowing for the rearrangement of subtasks without any this http URL video and code are available at the this https URL.
中文摘要 扩散策略已成为机器人视觉运动控制的强大框架，但它们通常缺乏从长期、多阶段任务中的子任务故障中恢复的鲁棒性，并且它们对观察的学习表示通常难以解释。在这项工作中，我们提出了专家混合增强扩散策略（MoE-DP），其核心思想是在视觉编码器和扩散模型之间插入专家混合（MoE）层。该层将策略的知识分解为一组专业专家，这些专家被动态激活以处理任务的不同阶段。我们通过广泛的实验证明，MoE-DP 表现出强大的从干扰中恢复的能力，在鲁棒性方面显着优于标准基线。在一套 6 个远视野模拟任务中，这导致在干扰条件下的成功率平均相对提高 36%。这种增强的鲁棒性在现实世界中得到了进一步验证，MoE-DP 也显示出显着的性能提升。我们进一步表明，MoE-DP 学习了一种可解释的技能分解，其中不同的专家对应于语义任务原语（例如，接近、抓握）。这种学习到的结构可用于推理时间控制，允许重新排列子任务，而无需任何此 http URL 视频和代码，可在此 https URL 中找到。