Arxiv Papers of Today

生成时间: 2025-11-26 16:31:32 (UTC+8); Arxiv 发布时间: 2025-11-26 20:00 EST (2025-11-27 09:00 UTC+8)

今天共有 45 篇相关文章

Keyword: reinforcement learning

AI-driven Predictive Shard Allocation for Scalable Next Generation Blockchains

AI驱动的预测分片分配，实现可扩展的下一代区块链

Authors: M. Zeeshan Haider, Tayyaba Noreen, M. D. Assuncao, Kaiwen Zhang
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2511.19450
Pdf link: https://arxiv.org/pdf/2511.19450
Abstract Sharding has emerged as a key technique to address blockchain scalability by partitioning the ledger into multiple shards that process transactions in parallel. Although this approach improves throughput, static or heuristic shard allocation often leads to workload skew, congestion, and excessive cross-shard communication diminishing the scalability benefits of sharding. To overcome these challenges, we propose the Predictive Shard Allocation Protocol (PSAP), a dynamic and intelligent allocation framework that proactively assigns accounts and transactions to shards based on workload forecasts. PSAP integrates a Temporal Workload Forecasting (TWF) model with a safety-constrained reinforcement learning (Safe-PPO) controller, jointly enabling multi-block-ahead prediction and adaptive shard reconfiguration. The protocol enforces deterministic inference across validators through a synchronized quantized runtime and a safety gate that limits stake concentration, migration gas, and utilization thresholds. By anticipating hotspot formation and executing bounded, atomic migrations, PSAP achieves stable load balance while preserving Byzantine safety. Experimental evaluation on heterogeneous datasets, including Ethereum, NEAR, and Hyperledger Fabric mapped via address-clustering heuristics, demonstrates up to 2x throughput improvement, 35\% lower latency, and 20\% reduced cross-shard overhead compared to existing dynamic sharding baselines. These results confirm that predictive, deterministic, and security-aware shard allocation is a promising direction for next-generation scalable blockchain systems.
中文摘要 分片已成为解决区块链可扩展性的关键技术，通过将账本划分为多个并行处理交易的分片。尽管这种方法提高了吞吐量，静态或启发式分片分配常常导致工作负载偏移、拥堵和过度跨分片通信，从而削弱分片的可扩展性优势。为克服这些挑战，我们提出了预测分片分配协议（PSAP），这是一个动态且智能的分配框架，能够根据工作负载预测主动将账户和交易分配给分片。PSAP将时间工作负载预测（TWF）模型与安全约束强化学习（Safe-PPO）控制器集成，共同实现多块前瞻预测和自适应碎片重配置。该协议通过同步量化运行时间和安全门，在验证者间强制确定性推理，该门槛限制了质押集中度、迁移气体和利用阈值。通过预见热点形成并执行有界的原子迁移，PSAP实现了稳定的负载平衡，同时保持拜占庭安全。对包括以太坊、NEAR和通过地址聚类启发式映射的Hyperledger Fabric在内的异构数据集进行实验评估，显示吞吐量提升了多达2倍，延迟降低了35%%，跨分片开销降低了20%。这些结果证实了预测性、确定性和安全意识型分片分配是下一代可扩展区块链系统前景的前景。

SparOA: Sparse and Operator-aware Hybrid Scheduling for Edge DNN Inference

SparOA：边缘DNN推断的稀疏与操作员感知混合调度

Authors: Ziyang Zhang, Jie Liu, Luca Mottola
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.19457
Pdf link: https://arxiv.org/pdf/2511.19457
Abstract The resource demands of deep neural network (DNN) models introduce significant performance challenges, especially when deployed on resource-constrained edge devices. Existing solutions like model compression often sacrifice accuracy, while specialized hardware remains costly and inflexible. Hybrid inference methods, however, typically overlook how operator characteristics impact performance. In this work, we present SparOA, a CPU-GPU hybrid inference framework, which leverages both sparsity and computational intensity to optimize operator scheduling. SparOA embraces aforementioned challenges through three key components: (1) a threshold predictor that accurately determines optimal sparsity and computational intensity thresholds; (2) a reinforcement learning-based scheduler that dynamically optimizes resource allocation based on real-time hardware states; and (3) a hybrid inference engine that enhances efficiency through asynchronous execution and batch size this http URL results show that SparOA achieves an average speedup of 1.22-1.31x compared to all baselines, and outperforms the CPU-Only by up to 50.7x. Also, SparOA achieves optimal energy-per-inference, consuming 7\%-16\% less energy than the SOTA co-execution baseline.
中文摘要 深度神经网络（DNN）模型的资源需求带来了显著的性能挑战，尤其是在资源受限的边缘设备上。现有的解决方案如模型压缩常常牺牲准确性，而专用硬件则成本高昂且灵活性不足。然而，混合推断方法通常忽视了操作员特性如何影响性能。本研究介绍了SparOA，一种CPU与GPU混合推理框架，利用稀疏性和计算强度来优化算符调度。SparOA通过三个关键组成部分来应对上述挑战：（1）准确确定最优稀疏度和计算强度阈值的阈值预测器;（2）基于强化学习的调度器，基于实时硬件状态动态优化资源分配;以及（3）通过异步执行和批处理规模提升效率的混合推理引擎。http URL结果显示，SparOA的平均速度提升为所有基线的1.22-1.31倍，且比仅CPU高效多达50.7倍。此外，SparOA实现了每次推断的最优能量，消耗的能量比SOTA共执行基线低7%-16%。

Position: The Complexity of Perfect AI Alignment -- Formalizing the RLHF Trilemma

立场：完美AI对齐的复杂性——形式化RLHF三难题

Authors: Subramanyam Sahoo, Aman Chadha, Vinija Jain, Divya Chaudhary
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2511.19504
Pdf link: https://arxiv.org/pdf/2511.19504
Abstract Reinforcement Learning from Human Feedback (RLHF) is widely used for aligning large language models, yet practitioners face a persistent puzzle: improving safety often reduces fairness, scaling to diverse populations becomes computationally intractable, and making systems robust often amplifies majority biases. We formalize this tension as the Alignment Trilemma: no RLHF system can simultaneously achieve (i) epsilon-representativeness across diverse human values, (ii) polynomial tractability in sample and compute complexity, and (iii) delta-robustness against adversarial perturbations and distribution shift. Through a complexity-theoretic analysis integrating statistical learning theory and robust optimization, we prove that achieving both representativeness (epsilon <= 0.01) and robustness (delta <= 0.001) for global-scale populations requires Omega(2^{d_context}) operations, which is super-polynomial in the context dimensionality. We show that current RLHF implementations resolve this trilemma by sacrificing representativeness: they collect only 10^3--10^4 samples from homogeneous annotator pools while 10^7--10^8 samples are needed for true global representation. Our framework provides a unified explanation for documented RLHF pathologies including preference collapse, sycophancy, and systematic bias amplification. We conclude with concrete directions for navigating these fundamental trade-offs through strategic relaxations of alignment requirements.
中文摘要 人类反馈强化学习（RLHF）被广泛用于对齐大型语言模型，但实践者仍面临一个持续的难题：提高安全性往往会降低公平性，面向多样化人群的扩展变得计算困难，而使系统变得稳健往往会放大多数偏见。我们将这种张力形式化为比对三难境：没有任何RLHF系统能同时实现（i）跨不同人类值的ε-代表性，（ii）样本和计算复杂度的多项式可解性，以及（iii）对对抗扰动和分布变化的δ鲁棒性。通过整合统计学习理论和稳健优化的复杂性理论分析，我们证明了要在全球规模人群中实现代表性（ε <= 0.01）和鲁棒性（delta <= 0.001），需要Omega（2^{d_context}）运算，这在上下文维度下是超多项式的。我们表明，当前的RLHF实现通过牺牲代表性来解决这一三难困境：它们仅从齐次注释器池收集10^3--10^4个样本，而真正的全局表示需要10^7--10^8个样本。我们的框架为已有文献的RLHF病理提供了统一解释，包括偏好崩溃、谄媚和系统性偏倚放大。最后，我们提出了具体方向，帮助你通过战略性放松对齐要求来应对这些根本权衡。

VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning

VideoChat-M1：通过多智能体强化学习实现视频理解的协作策略规划

Authors: Boyu Chen, Zikang Wang, Zhengrong Yue, Kainan Yan, Chenyun Yu, Yi Huang, Zijun Liu, Yafei Wen, Xiaoxin Chen, Yang Liu, Peng Li, Yali Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2511.19524
Pdf link: https://arxiv.org/pdf/2511.19524
Abstract By leveraging tool-augmented Multimodal Large Language Models (MLLMs), multi-agent frameworks are driving progress in video understanding. However, most of them adopt static and non-learnable tool invocation mechanisms, which limit the discovery of diverse clues essential for robust perception and reasoning regarding temporally or spatially complex videos. To address this challenge, we propose a novel Multi-agent system for video understanding, namely VideoChat-M1. Instead of using a single or fixed policy, VideoChat-M1 adopts a distinct Collaborative Policy Planning (CPP) paradigm with multiple policy agents, which comprises three key processes. (1) Policy Generation: Each agent generates its unique tool invocation policy tailored to the user's query; (2) Policy Execution: Each agent sequentially invokes relevant tools to execute its policy and explore the video content; (3) Policy Communication: During the intermediate stages of policy execution, agents interact with one another to update their respective policies. Through this collaborative framework, all agents work in tandem, dynamically refining their preferred policies based on contextual insights from peers to effectively respond to the user's query. Moreover, we equip our CPP paradigm with a concise Multi-Agent Reinforcement Learning (MARL) method. Consequently, the team of policy agents can be jointly optimized to enhance VideoChat-M1's performance, guided by both the final answer reward and intermediate collaborative process feedback. Extensive experiments demonstrate that VideoChat-M1 achieves SOTA performance across eight benchmarks spanning four tasks. Notably, on LongVideoBench, our method outperforms the SOTA model Gemini 2.5 pro by 3.6% and GPT-4o by 15.6%.
中文摘要 通过利用工具增强的多模态大型语言模型（MLLM），多智能体框架正在推动视频理解的进展。然而，大多数采用静态且不可学习的工具调用机制，限制了对时间或空间复杂视频进行扎实感知和推理所必需的多样化线索的发现。为应对这一挑战，我们提出了一种新的多智能体视频理解系统，即VideoChat-M1。VideoChat-M1 不采用单一或固定策略，而是采用了独特的协作策略规划（CPP）范式，拥有多个策略代理，包含三个关键流程。（1）策略生成：每个代理生成其独特的工具调用策略，针对用户查询量身定制;（2）策略执行：每个代理依次调用相关工具以执行策略并探索视频内容;（3）策略通信：在策略执行的中间阶段，代理之间相互作用以更新各自的策略。通过这一协作框架，所有代理协同工作，基于同伴的上下文洞察动态优化偏好策略，有效回应用户的提问。此外，我们为CPP范式配备了简明的多代理强化学习（MARL）方法。因此，政策代理团队可以共同优化，以提升VideoChat-M1的性能，同时受到最终答案奖励和中间协作过程反馈的指导。大量实验表明，VideoChat-M1在涵盖四个任务的八个基准测试中实现了SOTA性能。值得注意的是，在LongVideoBench上，我们的方法比SOTA型号Gemini 2.5 pro高出3.6%，比GPT-4o高出15.6%。

Discover, Learn, and Reinforce: Scaling Vision-Language-Action Pretraining with Diverse RL-Generated Trajectories

发现、学习与强化：利用多样化强化学习生成轨迹扩展视觉-语言-行动预训练

Authors: Rushuai Yang, Zhiyuan Feng, Tianxiang Zhang, Kaixin Wang, Chuheng Zhang, Li Zhao, Xiu Su, Yi Chen, Jiang Bian
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.19528
Pdf link: https://arxiv.org/pdf/2511.19528
Abstract Scaling vision-language-action (VLA) model pre-training requires large volumes of diverse, high-quality manipulation trajectories. Most current data is obtained via human teleoperation, which is expensive and difficult to scale. Reinforcement learning (RL) methods learn useful skills through autonomous exploration, making them a viable approach for generating data. However, standard RL training collapses to a narrow execution pattern, limiting its utility for large-scale pre-training. We propose Discover, Lea rn and Reinforce (DLR), an information-theoretic pattern discovery framework that generates multiple distinct, high-success behavioral patterns for VLA pretraining. Empirically, DLR generates a markedly more diverse trajectory corpus on LIBERO. Specifically, it learns multiple distinct, high-success strategies for the same task where standard RL discovers only one, and hence it covers substantially broader regions of the state-action space. When adapted to unseen downstream task suites, VLA models pretrained on our diverse RL data surpass counterparts trained on equal-sized standard RL datasets. Moreover, DLR exhibits positive data-scaling behavior that single-pattern RL lacks. These results position multi-pattern RL as a practical, scalable data engine for embodied foundation models.
中文摘要 对视觉-语言-行动（VLA）模型进行预训练的扩展需要大量多样化且高质量的作轨迹。目前大多数数据通过人工远程作获得，这既昂贵又难以扩展。强化学习（RL）方法通过自主探索学习有用技能，使其成为生成数据的可行方法。然而，标准强化学习训练会简化为狭窄的执行模式，限制了其在大规模预训练中的实用性。我们提出了Discover， Lea rn and Reinforce（DLR），这是一个信息理论模式发现框架，能够生成多种不同且高成功的行为模式用于VLA预训练。从经验上看，DLR在LIBERO上生成的轨迹语料库明显更为多样化。具体来说，它会学习多种不同且高成功策略的任务，而标准强化学习只发现其中一种，因此覆盖了状态-行动空间的更广泛区域。当适应未知的下游任务套件时，VLA模型在我们多样化的强化学习数据上预训练，优于在同等规模标准强化学习数据集上训练的对应模型。此外，DLR表现出单模式强化学习所缺乏的积极数据尺度行为。这些结果使多模式强化学习成为具象基础模型的实用且可扩展的数据引擎。

HunyuanOCR Technical Report

混源OCR技术报告

Authors: Hunyuan Vision Team, Pengyuan Lyu, Xingyu Wan, Gengluo Li, Shangpin Peng, Weinong Wang, Liang Wu, Huawen Shen, Yu Zhou, Canhui Tang, Qi Yang, Qiming Peng, Bin Luo, Hower Yang, Houwen Peng, Hongming Yang, Senhao Xie, Binghong Wu, Mana Yang, Sergey Wang, Raccoon Liu, Dick Zhu, Jie Jiang, Linus, Han Hu, Chengquan Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.19575
Pdf link: https://arxiv.org/pdf/2511.19575
Abstract This paper presents HunyuanOCR, a commercial-grade, open-source, and lightweight (1B parameters) Vision-Language Model (VLM) dedicated to OCR tasks. The architecture comprises a Native Vision Transformer (ViT) and a lightweight LLM connected via an MLP adapter. HunyuanOCR demonstrates superior performance, outperforming commercial APIs, traditional pipelines, and larger models (e.g., Qwen3-VL-4B). Specifically, it surpasses current public solutions in perception tasks (Text Spotting, Parsing) and excels in semantic tasks (IE, Text Image Translation), securing first place in the ICDAR 2025 DIMT Challenge (Small Model Track). Furthermore, it achieves state-of-the-art (SOTA) results on OCRBench among VLMs with fewer than 3B parameters. HunyuanOCR achieves breakthroughs in three key aspects: 1) Unifying Versatility and Efficiency: We implement comprehensive support for core capabilities including spotting, parsing, IE, VQA, and translation within a lightweight framework. This addresses the limitations of narrow "OCR expert models" and inefficient "General VLMs". 2) Streamlined End-to-End Architecture: Adopting a pure end-to-end paradigm eliminates dependencies on pre-processing modules (e.g., layout analysis). This fundamentally resolves error propagation common in traditional pipelines and simplifies system deployment. 3) Data-Driven and RL Strategies: We confirm the critical role of high-quality data and, for the first time in the industry, demonstrate that Reinforcement Learning (RL) strategies yield significant performance gains in OCR tasks. HunyuanOCR is officially open-sourced on HuggingFace. We also provide a high-performance deployment solution based on vLLM, placing its production efficiency in the top tier. We hope this model will advance frontier research and provide a solid foundation for industrial applications.
中文摘要 本文介绍了HunyuanOCR，一款商业级开源且轻量级（1B参数）视觉语言模型（VLM），专注于OCR任务。该架构由一个原生视觉变换器（ViT）和一个轻量级大型语言模型组成，通过MLP适配器连接。混源OCR展现出卓越的性能，超越了商业API、传统流水线以及更大型号（如Qwen3-VL-4B）。具体来说，它在感知任务（文本定位、解析）中超越了现有的公共解决方案，并在语义任务（如文本图像翻译）中表现出色，在ICDAR 2025 DIMT挑战赛（小模型轨道）中获得第一名。此外，它在参数少于3B的VLM中实现了最先进的（SOTA）效果。HunyuanOCR在三大关键方面取得突破：1）统一灵活性与效率：我们在轻量化框架内全面支持定位、解析、输入分析、VQA和翻译等核心能力。这解决了狭窄的“OCR专家模型”和低效的“通用VLM”的局限性问题。2）简化端到端架构：采用纯端到端范式消除对预处理模块（如布局分析）的依赖。这从根本上解决了传统管道中常见的错误传播问题，并简化了系统部署。3）数据驱动与强化学习策略：我们确认了高质量数据的关键作用，并首次在业界展示了强化学习（RL）策略在OCR任务中显著提升性能。HunyuanOCR已在HuggingFace上正式开源。我们还提供基于vLLM的高性能部署解决方案，其生产效率位居顶尖水平。我们希望该模式能够推动前沿研究，并为工业应用提供坚实基础。

Learning Massively Multitask World Models for Continuous Control

学习大规模多任务世界模型以实现连续控制

Authors: Nicklas Hansen, Hao Su, Xiaolong Wang
Subjects: Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.19584
Pdf link: https://arxiv.org/pdf/2511.19584
Abstract General-purpose control demands agents that act across many tasks and embodiments, yet research on reinforcement learning (RL) for continuous control remains dominated by single-task or offline regimes, reinforcing a view that online RL does not scale. Inspired by the foundation model recipe (large-scale pretraining followed by light RL) we ask whether a single agent can be trained on hundreds of tasks with online interaction. To accelerate research in this direction, we introduce a new benchmark with 200 diverse tasks spanning many domains and embodiments, each with language instructions, demonstrations, and optionally image observations. We then present \emph{Newt}, a language-conditioned multitask world model that is first pretrained on demonstrations to acquire task-aware representations and action priors, and then jointly optimized with online interaction across all tasks. Experiments show that Newt yields better multitask performance and data-efficiency than a set of strong baselines, exhibits strong open-loop control, and enables rapid adaptation to unseen tasks. We release our environments, demonstrations, code for training and evaluation, as well as 200+ checkpoints.
中文摘要 通用控制需要能够跨越多种任务和具象的代理，但关于持续控制的强化学习（RL）研究仍以单任务或离线模式为主，强化了在线强化学习无法扩展的观点。受基础模型配方启发（大规模预训练后轻度强化学习），我们询问单个代理是否可以通过在线交互训练数百个任务。为了加速这一方向的研究，我们引入了一个新的基准，包含涵盖多个领域和具体体格的200个多样化任务，每个任务都配有语言指令、演示，并可选地进行图像观察。随后，我们提出了\emph{Newt}，一个语言条件化的多任务世界模型，首先通过演示预训练以获得任务感知的表示和动作先验，然后通过在线交互在所有任务间共同优化。实验显示，纽特在多任务表现和数据效率上优于一组强基线，表现出强的开环控制，并能快速适应看不见的任务。我们发布了环境、演示、培训和评估代码，以及200+检查点。

Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs

VLM工具集成推理的尺度化智能体强化学习

Authors: Meng Lu, Ran Xu, Yi Fang, Wenxuan Zhang, Yue Yu, Gaurav Srivastava, Yuchen Zhuang, Mohamed Elhoseiny, Charles Fleming, Carl Yang, Zhengzhong Tu, Yang Xie, Guanghua Xiao, Hanrui Wang, Di Jin, Wenqi Shi, Xuan Wang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.19773
Pdf link: https://arxiv.org/pdf/2511.19773
Abstract While recent vision-language models (VLMs) demonstrate strong image understanding, their ability to "think with images", i.e., to reason through multi-step visual interactions, remains limited. We introduce VISTA-Gym, a scalable training environment for incentivizing tool-integrated visual reasoning capabilities in VLMs. VISTA-Gym unifies diverse real-world multimodal reasoning tasks (7 tasks from 13 datasets in total) with a standardized interface for visual tools (e.g., grounding, parsing), executable interaction loops, verifiable feedback signals, and efficient trajectory logging, enabling visual agentic reinforcement learning at scale. While recent VLMs exhibit strong text-only reasoning, both proprietary and open-source models still struggle with tool selection, invocation, and coordination. With VISTA-Gym, we train VISTA-R1 to interleave tool-use with agentic reasoning via multi-turn trajectory sampling and end-to-end reinforcement learning. Extensive experiments across 11 public reasoning-intensive VQA benchmarks show that VISTA-R1-8B outperforms state-of-the-art baselines with similar sizes by 9.51%-18.72%, demonstrating VISTA-Gym as an effective training ground to unlock the tool-integrated reasoning capabilities for VLMs.
中文摘要 虽然最新的视觉语言模型（VLMs）展现了强烈的图像理解能力，但它们“用图像思考”的能力，即通过多步视觉互动进行推理的能力仍然有限。我们引入了VISTA-Gym，这是一个可扩展的培训环境，旨在激励VLM中工具集成的视觉推理能力。VISTA-Gym统一了多样的现实多模态推理任务（共13个数据集中的7个任务），并配备了标准化的可视化工具接口（如接地、解析）、可执行的交互循环、可验证的反馈信号和高效的轨迹记录，实现了大规模的视觉智能强化学习。虽然近期的VLM展现出强大的纯文本推理能力，但无论是专有模型还是开源模型，在工具选择、调用和协调方面仍然存在困难。通过 VISTA-Gym，我们训练 VISTA-R1 通过多回合轨迹采样和端到端强化学习，将工具使用与代理推理交织。在11个公开推理密集型VQA基准测试中的广泛实验显示，VISTA-R1-8B在类似规模的先进基线上表现优于9.51%-18.72%，证明VISTA-Gym是释放VLM工具集成推理能力的有效训练场。

Learning to Clean: Reinforcement Learning for Noisy Label Correction

学习清洁：噪声标签纠正的强化学习

Authors: Marzi Heidari, Hanping Zhang, Yuhong Guo
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.19808
Pdf link: https://arxiv.org/pdf/2511.19808
Abstract The challenge of learning with noisy labels is significant in machine learning, as it can severely degrade the performance of prediction models if not addressed properly. This paper introduces a novel framework that conceptualizes noisy label correction as a reinforcement learning (RL) problem. The proposed approach, Reinforcement Learning for Noisy Label Correction (RLNLC), defines a comprehensive state space representing data and their associated labels, an action space that indicates possible label corrections, and a reward mechanism that evaluates the efficacy of label corrections. RLNLC learns a deep feature representation based policy network to perform label correction through reinforcement learning, utilizing an actor-critic method. The learned policy is subsequently deployed to iteratively correct noisy training labels and facilitate the training of the prediction model. The effectiveness of RLNLC is demonstrated through extensive experiments on multiple benchmark datasets, where it consistently outperforms existing state-of-the-art techniques for learning with noisy labels.
中文摘要 在机器学习中，使用噪声标签学习的挑战尤为显著，如果不加以妥善处理，可能会严重降低预测模型的性能。本文提出了一个新框架，将噪声标签修正概念化为强化学习（RL）问题。所提方法——噪声标签校正强化学习（RLNLC）定义了一个全面的状态空间，表示数据及其相关标签;一个表示可能标签更正的动作空间;以及一个评估标签更正效果的奖励机制。RLNLC 学习基于深度特征表示的策略网络，通过强化学习执行标签校正，采用演员-批评者方法。学习到的策略随后被部署用于迭代修正噪声训练标签，并促进预测模型的训练。RLNLC的有效性通过对多个基准数据集的广泛实验得到了验证，其在使用噪声标签学习方面持续优于现有的先进学习技术。

CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception

CropVLM：学习缩放以实现细粒度视觉-语言感知

Authors: Miguel Carvalho, Helder Dias, Bruno Martins
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.19820
Pdf link: https://arxiv.org/pdf/2511.19820
Abstract Vision-Language Models (VLMs) often struggle with tasks that require fine-grained image understanding, such as scene-text recognition or document analysis, due to perception limitations and visual fragmentation. To address these challenges, we introduce CropVLM as an external low-cost method for boosting performance, enabling VLMs to dynamically ''zoom in'' on relevant image regions, enhancing their ability to capture fine details. CropVLM is trained using reinforcement learning, without using human-labeled bounding boxes as a supervision signal, and without expensive synthetic evaluations. The model is trained once and can be paired with both open-source and proprietary VLMs to improve their performance. Our approach delivers significant improvements on tasks that require high-resolution image understanding, notably for benchmarks that are out-of-domain for the target VLM, without modifying or fine-tuning the VLM, thus avoiding catastrophic forgetting.
中文摘要 视觉语言模型（VLM）常因感知限制和视觉碎片化，难以完成需要细粒度图像理解的任务，如场景文本识别或文档分析。为应对这些挑战，我们引入了CropVLM作为一种外部低成本提升性能的方法，使VLM能够动态“放大”相关图像区域，增强捕捉细节的能力。CropVLM 通过强化学习训练，不使用人工标记的边界框作为监督信号，也没有昂贵的合成评估。该模型只需训练一次，可以与开源和专有VLM配合使用，以提升其性能。我们的方法在需要高分辨率图像理解的任务上取得了显著改进，尤其是针对目标VLM域外的基准测试，而无需修改或微调VLM，从而避免灾难性的遗忘。

Reinforcement Learning with $ω$-Regular Objectives and Constraints

基于$ω$-正则目标和约束的强化学习

Authors: Dominik Wagner, Leon Witzman, Luke Ong
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.19849
Pdf link: https://arxiv.org/pdf/2511.19849
Abstract Reinforcement learning (RL) commonly relies on scalar rewards with limited ability to express temporal, conditional, or safety-critical goals, and can lead to reward hacking. Temporal logic expressible via the more general class of $\omega$-regular objectives addresses this by precisely specifying rich behavioural properties. Even still, measuring performance by a single scalar (be it reward or satisfaction probability) masks safety-performance trade-offs that arise in settings with a tolerable level of risk. We address both limitations simultaneously by combining $\omega$-regular objectives with explicit constraints, allowing safety requirements and optimisation targets to be treated separately. We develop a model-based RL algorithm based on linear programming, which in the limit produces a policy maximising the probability of satisfying an $\omega$-regular objective while also adhering to $\omega$-regular constraints within specified thresholds. Furthermore, we establish a translation to constrained limit-average problems with optimality-preserving guarantees.
中文摘要 强化学习（RL）通常依赖标量奖励，且表达时间、条件或安全关键目标的能力有限，可能导致奖励黑客行为。通过更一般的 $\omega$-正则目标类来表达的时序逻辑，通过精确指定丰富的行为性质来解决这个问题。即便如此，用单一标量（无论是奖励还是满意概率）来衡量绩效，掩盖了在可容忍风险环境中出现的安全与性能权衡。我们通过将$\omega$-正规目标与显式约束结合，同时解决了这两个限制，使安全要求和优化目标可以分开处理。我们开发了一种基于线性规划的模型强化学习算法，在极限情况下，该算法在极限内最大化满足$\omega$-正则目标的概率，同时在指定阈值内遵守$\ω$-正则约束。此外，我们建立了对约束极限平均问题的翻译，且保证最优性保持。

Complex Instruction Following with Diverse Style Policies in Football Games

复杂的教学跟随，足球比赛中采用多样风格政策

Authors: Chenglu Sun, Shuo Shen, Haonan Hu, Wei Zhou, Chen Chen
Subjects: Subjects: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.19885
Pdf link: https://arxiv.org/pdf/2511.19885
Abstract Despite advancements in language-controlled reinforcement learning (LC-RL) for basic domains and straightforward commands (e.g., object manipulation and navigation), effectively extending LC-RL to comprehend and execute high-level or abstract instructions in complex, multi-agent environments, such as football games, remains a significant challenge. To address this gap, we introduce Language-Controlled Diverse Style Policies (LCDSP), a novel LC-RL paradigm specifically designed for complex scenarios. LCDSP comprises two key components: a Diverse Style Training (DST) method and a Style Interpreter (SI). The DST method efficiently trains a single policy capable of exhibiting a wide range of diverse behaviors by modulating agent actions through style parameters (SP). The SI is designed to accurately and rapidly translate high-level language instructions into these corresponding SP. Through extensive experiments in a complex 5v5 football environment, we demonstrate that LCDSP effectively comprehends abstract tactical instructions and accurately executes the desired diverse behavioral styles, showcasing its potential for complex, real-world applications.
中文摘要 尽管语言控制强化学习（LC-RL）在基础领域和简单命令（如对象作和导航）方面取得了进步，但有效扩展LC-RL以理解和执行复杂多智能体环境（如橄榄球比赛）中的高级或抽象指令仍是一项重大挑战。为弥补这一空白，我们引入了语言控制多样风格策略（LCDSP），这是一种专为复杂场景设计的新型LC-RL范式。LCDSP由两个关键组成部分组成：多元风格培训（DST）方法和风格解释器（SI）。DST方法通过风格参数（SP）调制代理动作，高效训练单一策略，能够表现出广泛的多样行为。SI旨在准确且快速地将高级语言指令翻译成这些对应的SP。通过在复杂的5v5足球环境中的大量实验，我们证明LCDSP能够有效理解抽象战术指令，并准确执行所需的多样化行为风格，展示了其在复杂现实应用中的潜力。

Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning

Agent0-VL：探索用于工具集成视觉语言推理的自我演化智能体

Authors: Jiaqi Liu, Kaiwen Xiong, Peng Xia, Yiyang Zhou, Haonian Ji, Lu Feng, Siwei Han, Mingyu Ding, Huaxiu Yao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.19900
Pdf link: https://arxiv.org/pdf/2511.19900
Abstract Vision-language agents have achieved remarkable progress in a variety of multimodal reasoning tasks; however, their learning remains constrained by the limitations of human-annotated supervision. Recent self-rewarding approaches attempt to overcome this constraint by allowing models to act as their own critics or reward providers. Yet, purely text-based self-evaluation struggles to verify complex visual reasoning steps and often suffers from evaluation hallucinations. To address these challenges, inspired by recent advances in tool-integrated reasoning, we propose Agent0-VL, a self-evolving vision-language agent that achieves continual improvement with tool-integrated reasoning. Agent0-VL incorporates tool usage not only into reasoning but also into self-evaluation and self-repair, enabling the model to introspect, verify, and refine its reasoning through evidence-grounded analysis. It unifies two synergistic roles within a single LVLM: a Solver that performs multi-turn tool-integrated reasoning, and a Verifier that generates structured feedback and fine-grained self-rewards through tool-grounded critique. These roles interact through a Self-Evolving Reasoning Cycle, where tool-based verification and reinforcement learning jointly align the reasoning and evaluation distributions for stable self-improvement. Through this zero-external-reward evolution, Agent0-VL aligns its reasoning and verification behaviors without any human annotation or external reward models, achieving continual self-improvement. Experiments on geometric problem solving and visual scientific analysis show that Agent0-VL achieves an 12.5% improvement over the base model. Our code is available at \href{this https URL}{this https URL}.
中文摘要 视觉语言代理在多种多模态推理任务中取得了显著进展;然而，他们的学习仍受限于人工注释监督的限制。近期的自我奖励方法试图通过允许模型作为自身的批评者或奖励提供者来克服这一限制。然而，纯文本的自我评估难以验证复杂的视觉推理步骤，且常常出现评估幻觉。为应对这些挑战，受工具整合推理最新进展启发，我们提出了Agent0-VL，一种自我进化的视觉语言代理，能够通过工具集成推理实现持续改进。Agent0-VL不仅将工具的使用融入推理，还包括自我评估和自我修复，使模型能够通过基于证据的分析进行内省、验证和完善推理。它在一个LVLM中统一了两个协同作用：一个执行多回合工具整合推理的求解器，以及一个通过工具为基础的批评生成结构化反馈和细致自我奖励的验证者。这些角色通过自我演化推理循环相互作用，基于工具的验证与强化学习共同对齐推理与评估分布，实现稳定的自我提升。通过这种零外部奖励的进化，Agent0-VL在没有任何人工注释或外部奖励模型的情况下，调整其推理和验证行为，实现了持续的自我提升。几何问题解决和视觉科学分析的实验表明，Agent0-VL比基础模型提升了12.5%。我们的代码可在 \href{this https URL}{this https URL} 获取。

Reasoning-VLA: A Fast and General Vision-Language-Action Reasoning Model for Autonomous Driving

推理-VLA：一种快速且通用的视觉-语言-行动推理模型，用于自动驾驶

Authors: Dapeng Zhang, Zhenlong Yuan, Zhangquan Chen, Chih-Ting Liao, Yinda Chen, Fei Shen, Qingguo Zhou, Tat-Seng Chua
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.19912
Pdf link: https://arxiv.org/pdf/2511.19912
Abstract Vision-Language-Action (VLA) models have recently shown strong decision-making capabilities in autonomous driving. However, existing VLAs often struggle with achieving efficient inference and generalizing to novel autonomous vehicle configurations and driving scenarios. In this paper, we propose Reasoning-VLA, a general and fast action-generation VLA framework. The proposed model employs a set of learnable action queries, initialized via Gaussian sampling from ground-truth trajectories within the training corpus. These learnable queries interact with reasoning-enhanced vision-language features to generate continuous action trajectories in parallel. To promote robust generalization, we consolidate eight publicly available autonomous driving datasets into a standardized, Chain-of-Thought reasoning-based, and easy-to-use data format for model training. Leveraging both supervised learning and reinforcement learning fine-tuning, extensive empirical evaluations across multiple benchmarks demonstrate that Reasoning-VLA achieves state-of-the-art performance, superior generalization capability, and the excellent inference speed reported to date.
中文摘要 视觉-语言-行动（VLA）模型最近展现出了在自动驾驶中强大的决策能力。然而，现有的VLA常常难以实现高效的推断和推广到新型自动驾驶车辆配置和驾驶场景。本文提出了Reasoning-VLA框架，这是一个通用且快速的动作生成VLA框架。该模型采用一组可学习的动作查询，通过从训练语料库中的真实轨迹中进行高斯采样初始化。这些可学习的查询与推理增强的视觉语言特征交互，并行生成连续的行动轨迹。为促进稳健的泛化，我们将八个公开的自动驾驶数据集整合为标准化、基于思维链推理且易于使用的数据格式，用于模型训练。通过监督学习和强化学习的微调，跨多个基准测试的大量实证评估表明，Reasoning-VLA实现了最先进的性能、卓越的泛化能力以及迄今为止报告的优异推理速度。

Designing Reputation Systems for Manufacturing Data Trading Markets: A Multi-Agent Evaluation with Q-Learning and IRL-Estimated Utilities

为制造数据交易市场设计声誉系统：多智能体评估，结合Q学习和真实关系估计效用

Authors: Kenta Yamamoto, Teruaki Hayashi
Subjects: Subjects: Computer Science and Game Theory (cs.GT); Computers and Society (cs.CY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.19930
Pdf link: https://arxiv.org/pdf/2511.19930
Abstract Recent advances in machine learning and big data analytics have intensified the demand for high-quality cross-domain datasets and accelerated the growth of data trading across organizations. As data become increasingly recognized as an economic asset, data marketplaces have emerged as a key infrastructure for data-driven innovation. However, unlike mature product or service markets, data-trading environments remain nascent and suffer from pronounced information asymmetry. Buyers cannot verify the content or quality before purchasing data, making trust and quality assurance central challenges. To address these issues, this study develops a multi-agent data-market simulator that models participant behavior and evaluates the institutional mechanisms for trust formation. Focusing on the manufacturing sector, where initiatives such as GAIA-X and Catena-X are advancing, the simulator integrates reinforcement learning (RL) for adaptive agent behavior and inverse reinforcement learning (IRL) to estimate utility functions from empirical behavioral data. Using the simulator, we examine the market-level effects of five representative reputation systems-Time-decay, Bayesian-beta, PageRank, PowerTrust, and PeerTrust-and found that PeerTrust achieved the strongest alignment between data price and quality, while preventing monopolistic dominance. Building on these results, we develop a hybrid reputation mechanism that integrates the strengths of existing systems to achieve improved price-quality consistency and overall market stability. This study extends simulation-based data-market analysis by incorporating trust and reputation as endogenous mechanisms and offering methodological and institutional insights into the design of reliable and efficient data ecosystems.
中文摘要 机器学习和大数据分析的最新进展加剧了对高质量跨域数据集的需求，加速了跨组织数据交易的增长。随着数据日益被认可为经济资产，数据市场已成为数据驱动创新的关键基础设施。然而，与成熟的产品或服务市场不同，数据交易环境仍处于萌芽阶段，且存在明显的信息不对称问题。买家无法在购买数据前核实内容或质量，这使得信任和质量保证成为核心挑战。为解决这些问题，本研究开发了一款多智能体数据市场模拟器，模拟参与者行为并评估信任形成的制度机制。该模拟器聚焦于制造业领域，GAIA-X和Catena-X等项目正在推进，模拟器整合了自适应代理行为的强化学习（RL）和逆强化学习（IRL），以从经验行为数据估算效用函数。利用模拟器，我们考察了五个代表性声誉系统——时间衰减、贝叶斯测试、PageRank、PowerTrust 和 PeerTrust——的市场层面效应，发现 PeerTrust 在数据价格与质量之间实现了最强的对齐，同时防止了垄断性主导。基于这些成果，我们开发了一种混合声誉机制，整合现有系统的优势，以实现更高的价格质量一致性和整体市场稳定。本研究通过将信任和声誉纳入内生机制，扩展基于模拟的数据市场分析，并提供关于可靠高效数据生态系统设计的方法论和机构见解。

Collaborate sim and real: Robot Bin Packing Learning in Real-world and Physical Engine

协作模拟与现实：机器人垃圾箱打包学习在现实与物理引擎中

Authors: Lidi Zhang, Han Wu, Liyu Zhang, Ruofeng Liu, Haotian Wang, Chao Li, Desheng Zhang, Yunhuai Liu, Tian He
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.19932
Pdf link: https://arxiv.org/pdf/2511.19932
Abstract The 3D bin packing problem, with its diverse industrial applications, has garnered significant research attention in recent years. Existing approaches typically model it as a discrete and static process, while real-world applications involve continuous gravity-driven interactions. This idealized simplification leads to infeasible deployments (e.g., unstable packing) in practice. Simulations with physical engine offer an opportunity to emulate continuous gravity effects, enabling the training of reinforcement learning (RL) agents to address such limitations and improve packing stability. However, a simulation-to-reality gap persists due to dynamic variations in physical properties of real-world objects, such as various friction coefficients, elasticity, and non-uniform weight distributions. To bridge this gap, we propose a hybrid RL framework that collaborates with physical simulation with real-world data feedback. Firstly, domain randomization is applied during simulation to expose agents to a spectrum of physical parameters, enhancing their generalization capability. Secondly, the RL agent is fine-tuned with real-world deployment feedback, further reducing collapse rates. Extensive experiments demonstrate that our method achieves lower collapse rates in both simulated and real-world scenarios. Large-scale deployments in logistics systems validate the practical effectiveness, with a 35\% reduction in packing collapse compared to baseline methods.
中文摘要 3D垃圾桶包装问题及其多样的工业应用近年来引起了广泛研究关注。现有方法通常将其建模为离散且静态的过程，而现实应用则涉及连续的重力驱动相互作用。这种理想化的简化导致了实际中不可行的部署（例如不稳定的打包）。使用物理引擎的仿真提供了模拟连续重力效应的机会，使强化学习（RL）智能体能够克服这些限制并提升包装稳定性。然而，由于现实物体物理属性的动态变化，如各种摩擦系数、弹性和非均匀的重量分布，模拟与现实之间的差距依然存在。为弥合这一差距，我们提出了一个混合强化学习框架，结合物理模拟和真实世界数据反馈。首先，在仿真过程中应用域随机化，使智能体暴露于一系列物理参数，增强其泛化能力。其次，强化学习代理根据实际部署反馈进行微调，进一步降低崩溃率。大量实验表明，我们的方法在模拟和现实场景中均实现了更低的塌缩率。物流系统的大规模部署验证了其实际有效性，与基线方法相比，包装崩溃减少了35%。

Optimize Flip Angle Schedules In MR Fingerprinting Using Reinforcement Learning

利用强化学习优化MR指纹识别中的翻转角度安排

Authors: Shenjun Zhong, Zhifeng Chen, Zhaolin Chen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
Arxiv link: https://arxiv.org/abs/2511.19941
Pdf link: https://arxiv.org/pdf/2511.19941
Abstract Magnetic Resonance Fingerprinting (MRF) leverages transient-state signal dynamics generated by the tunable acquisition parameters, making the design of an optimal, robust sequence a complex, high-dimensional sequential decision problem, such as optimizing one of the key parameters, flip angle. Reinforcement learning (RL) offers a promising approach to automate parameter selection, to optimize pulse sequences that maximize the distinguishability of fingerprints across the parameter space. In this work, we introduce an RL framework for optimizing the flip-angle schedule in MRF and demonstrate a learned schedule exhibiting non-periodic patterns that enhances fingerprint separability. Additionally, an interesting observation is that the RL-optimized schedule may enable a reduction in the number of repetition time, potentially accelerate MRF acquisitions.
中文摘要 磁共振指纹识别（MRF）利用可调采集参数生成的瞬态信号动态，使得设计最优且稳健序列成为一个复杂的高维序列决策问题，例如优化关键参数之一——翻转角。强化学习（RL）提供了一种有前景的方法来自动化参数选择，优化脉冲序列，最大化指纹在参数空间中的可区分性。本研究介绍了一个用于优化MRF翻转角调度的强化学习框架，并展示了一个学习到的调度，具有非周期性模式，从而增强了指纹的可分离性。此外，一个有趣的观察是，强化学习优化的计划可能减少重复时间，从而加速MRF的采集速度。

Differential Smoothing Mitigates Sharpening and Improves LLM Reasoning

微分平滑减轻了锐化问题，提升了大型语言模型的推理能力

Authors: Jingchu Gai, Guanning Zeng, Huaqing Zhang, Aditi Raghunathan
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.19942
Pdf link: https://arxiv.org/pdf/2511.19942
Abstract It is widely recognized that reinforcement learning (RL) fine-tuning of large language models often leads to \textit{diversity collapse}, where outputs lack variety. Prior work has proposed a range of heuristics to counteract this effect, but these methods are ad hoc: they frequently trade off correctness for diversity, their effectiveness varies across tasks, and in some cases they even contradict one another. In this work, we place these observations on a rigorous foundation. We first provide a formal proof of why RL fine-tuning exhibits diversity collapse via a selection and reinforcement bias. Next, we make a key observation that any reward modification to address diversity collapse only needs to be applied on the correct trajectories. Building directly on this analysis, we introduce a principled method -- \textit{differential smoothing} -- that provably improves both correctness and diversity, outperforming vanilla RL as well as widely used entropy-based heuristics. Our theory precisely characterizes when existing heuristics help and why they fail, while showing that differential smoothing is universally superior. Extensive experiments with models from 1B to 7B parameters, across domains including CountDown and real-world mathematical reasoning, demonstrate consistent gains. Differential smoothing improves both Pass@1 and Pass@k, with up to 6.7\% improvements on AIME24 dataset.
中文摘要 广泛认识到，强化学习（RL）对大型语言模型的微调常常导致 \textit{多样性崩溃}，即输出缺乏多样性。此前已有研究提出多种启发式方法来抵消这一效应，但这些方法都是临时性的：它们经常以多样性为代价，效果在不同任务中存在差异，有时甚至相互矛盾。在本研究中，我们将这些观察建立在严谨的基础上。我们首先正式证明了为什么强化学习微调通过选择偏误和强化偏差表现出多样性坍缩。接下来，我们有一个关键观察：任何用于应对多样性崩溃的奖励修改只需在正确的轨迹上应用。基于该分析，我们引入了一种有原则的方法——\textit{微分平滑}——可证明提升了正确性和多样性，优于原版强化学习以及广泛使用的基于熵的启发式方法。我们的理论精确描述了现有启发式方法何时有效及其失效原因，同时表明微分平滑在普遍情况下更优。在包括CountDown和现实数学推理在内的多个领域，对1B到7B参数的模型进行了大量实验，显示出持续的提升。微分平滑不仅提升了Pass@1和Pass@k，AIME24数据集的提升高达6.7\%。

Toward Trustworthy Digital Twins in Agentic AI-based Wireless Network Optimization: Challenges, Solutions, and Opportunities

迈向基于代理AI的无线网络优化中的可信数字孪生：挑战、解决方案与机遇

Authors: Zhenyu Tao, Wei Xu, Xiaohu You
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2511.19961
Pdf link: https://arxiv.org/pdf/2511.19961
Abstract Optimizing modern wireless networks is exceptionally challenging due to their high dynamism and complexity. While the agentic artificial intelligence (AI) powered by reinforcement learning (RL) offers a promising solution, its practical application is limited by prohibitive exploration costs and potential risks in the real world. The emerging digital twin (DT) technology provides a safe and controlled virtual environment for agentic AI training, but its effectiveness critically depends on the DT's fidelity. Policies trained in a low-fidelity DT that does not accurately represent the physical network may experience severe performance degradation upon real-world deployment. In this article, we introduce a unified DT evaluation framework to ensure trustworthy DTs in agentic AI-based network optimization. This evaluation framework shifts from conventional isolated physical accuracy metrics, such as wireless channel and user trajectory similarities, to a more holistic, task-centric DT assessment. We demonstrate it as an effective guideline for design, selection, and lifecycle management of wireless network DTs. A comprehensive case study on a real-world wireless network testbed shows how this evaluation framework is used to pre-filter candidate DTs, leading to a significant reduction in training and testing costs without sacrificing deployment performance. Finally, potential research opportunities are discussed.
中文摘要 由于现代无线网络的高度动态性和复杂性，优化其极具挑战性。虽然由强化学习（RL）驱动的代理人工智能（AI）提供了一个有前景的解决方案，但其实际应用受限于高昂的探索成本和现实世界中潜在的风险。新兴的数字孪生（DT）技术为代理AI训练提供了一个安全且受控的虚拟环境，但其有效性关键在于DT的保真度。在低保真度DT中训练、无法准确反映物理网络的策略，在实际部署时可能会出现严重的性能下降。本文介绍了一个统一的DT评估框架，以确保基于人工智能的智能网络优化中DT的可靠性。该评估框架从传统的孤立物理准确度指标（如无线信道和用户轨迹相似度）转变为更全面、以任务为中心的DT评估。我们将其作为无线网络DT设计、选择和生命周期管理的有效指南进行了展示。一个基于真实无线网络测试平台的全面案例研究展示了该评估框架如何用于预筛选候选DT，从而显著降低训练和测试成本，同时不牺牲部署性能。最后，讨论了潜在的研究机会。

HiCoGen: Hierarchical Compositional Text-to-Image Generation in Diffusion Models via Reinforcement Learning

HiCoGen：通过强化学习实现扩散模型中的分层合成文本到图像生成

Authors: Hongji Yang, Yucheng Zhou, Wencheng Han, Runzhou Tao, Zhongying Qiu, Jianfei Yang, Jianbing Shen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.19965
Pdf link: https://arxiv.org/pdf/2511.19965
Abstract Recent advances in diffusion models have demonstrated impressive capability in generating high-quality images for simple prompts. However, when confronted with complex prompts involving multiple objects and hierarchical structures, existing models struggle to accurately follow instructions, leading to issues such as concept omission, confusion, and poor compositionality. To address these limitations, we propose a Hierarchical Compositional Generative framework (HiCoGen) built upon a novel Chain of Synthesis (CoS) paradigm. Instead of monolithic generation, HiCoGen first leverages a Large Language Model (LLM) to decompose complex prompts into minimal semantic units. It then synthesizes these units iteratively, where the image generated in each step provides crucial visual context for the next, ensuring all textual concepts are faithfully constructed into the final scene. To further optimize this process, we introduce a reinforcement learning (RL) framework. Crucially, we identify that the limited exploration of standard diffusion samplers hinders effective RL. We theoretically prove that sample diversity is maximized by concentrating stochasticity in the early generation stages and, based on this insight, propose a novel Decaying Stochasticity Schedule to enhance exploration. Our RL algorithm is then guided by a hierarchical reward mechanism that jointly evaluates the image at the global, subject, and relationship levels. We also construct HiCoPrompt, a new text-to-image benchmark with hierarchical prompts for rigorous evaluation. Experiments show our approach significantly outperforms existing methods in both concept coverage and compositional accuracy.
中文摘要 扩散模型的最新进展展示了为简单提示生成高质量图像的惊人能力。然而，当面对涉及多个对象和层级结构的复杂提示时，现有模型难以准确执行指令，导致概念遗漏、混淆和构造不佳等问题。为解决这些局限性，我们提出了一种基于新型合成链（CoS）范式的分层合成生成框架（HiCoGen）。HiCoGen 不再采用单一生成，而是首先利用大型语言模型（LLM）将复杂提示分解为最小语义单元。然后它对这些单元进行迭代综合，每一步生成的图像为下一步提供了关键的视觉背景，确保所有文本概念忠实构建到最终场景中。为了进一步优化这一过程，我们引入了一个强化学习（RL）框架。关键是，我们发现对标准扩散采样器的有限探索阻碍了有效的强化学习。我们理论上证明，将随机性集中在早期世代阶段可以最大化样本多样性，并基于这一见解提出了一种新的衰减随机性计划以增强探索性。我们的强化学习算法随后由一个层级奖励机制指导，该机制在整体、主体和关系层面共同评估图像。我们还构建了HiCoPrompt，这是一个新的文本到图像基准测试，采用层级提示进行严格评估。实验显示，我们的方法在概念覆盖和组成准确性方面显著优于现有方法。

Boosting Reasoning in Large Multimodal Models via Activation Replay

通过激活重放提升大型多模态模型中的推理能力

Authors: Yun Xing, Xiaobin Hu, Qingdong He, Jiangning Zhang, Shuicheng Yan, Shijian Lu, Yu-Gang Jiang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.19972
Pdf link: https://arxiv.org/pdf/2511.19972
Abstract Recently, Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach to incentivizing reasoning capability in Large Multimodal Models (LMMs), while the underlying mechanisms behind this post-training paradigm are poorly understood. We begin by exploring how input activations are affected by RLVR through the perspective of logit lens. Our systematic investigations across multiple post-trained LMMs suggest that RLVR shifts low-entropy activations unexpectedly, while high-entropy ones are less affected. We further demonstrate that such phenomena are associated with LMM reasoning by controlled experiments, suggesting a potentially beneficial role of modulating low-entropy activations. To this end, we propose Activation Replay, a novel simple yet effective training-free approach that boosts multimodal reasoning of post-trained LMMs without requiring expensive policy optimization. Our design involves manipulation of visual tokens at test time, replaying low-entropy activations from the input context of base LMMs to regulating the RLVR counterparts. Activation Replay triggers better reasoning across diverse scenarios, including mathematics, o3-like visual agents, and video reasoning. We further show that Activation Replay boosts Pass@K and mitigates narrower reasoning coverage of RLVR. Our design is compared against alternative choices, such as replaying high-entropy activations instead of low-entropy ones, or direct cross-model intervention instead of manipulating input tokens, demonstrating the superiority of our implementation. Codes will be made publicly available.
中文摘要 最近，带可验证奖励的强化学习（RLVR）成为激励大型多模态模型（LMM）推理能力的有效方法，尽管这一训练后范式背后的机制尚不充分。我们首先从logit视角探讨RLVR如何影响输入激活。我们对多个后训练LMM的系统性研究表明，RLVR会意外地转移低熵激活，而高熵激活则影响较小。我们进一步通过受控实验证明，这些现象与LMM推理相关，暗示调节低熵激活具有潜在的有益作用。为此，我们提出了激活重放（Activation Replay），这是一种新颖且简单但有效的无训练方法，能够提升训练后 LMM 的多模态推理能力，而无需昂贵的策略优化。我们的设计涉及在测试时作视觉标记，从基础LMM的输入上下文中重放低熵激活，调控RLVR对应物。激活回放能在多种场景中激发更佳的推理，包括数学、类似O3的视觉代理和视频推理。我们还进一步表明，激活回放提升了Pass@K并减轻了RLVR推理范围的狭窄。我们的设计与替代选择进行了比较，例如重放高熵激活而非低熵激活，或直接跨模型干预而非作输入标记，展示了我们实现的优越性。代码将对外公开。

OmniRefiner: Reinforcement-Guided Local Diffusion Refinement

OmniRefiner：强化引导局部扩散精炼

Authors: Yaoli Liu, Ziheng Ouyang, Shengtao Lou, Yiren Song
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.19990
Pdf link: https://arxiv.org/pdf/2511.19990
Abstract Reference-guided image generation has progressed rapidly, yet current diffusion models still struggle to preserve fine-grained visual details when refining a generated image using a reference. This limitation arises because VAE-based latent compression inherently discards subtle texture information, causing identity- and attribute-specific cues to vanish. Moreover, post-editing approaches that amplify local details based on existing methods often produce results inconsistent with the original image in terms of lighting, texture, or shape. To address this, we introduce \ourMthd{}, a detail-aware refinement framework that performs two consecutive stages of reference-driven correction to enhance pixel-level consistency. We first adapt a single-image diffusion editor by fine-tuning it to jointly ingest the draft image and the reference image, enabling globally coherent refinement while maintaining structural fidelity. We then apply reinforcement learning to further strengthen localized editing capability, explicitly optimizing for detail accuracy and semantic consistency. Extensive experiments demonstrate that \ourMthd{} significantly improves reference alignment and fine-grained detail preservation, producing faithful and visually coherent edits that surpass both open-source and commercial models on challenging reference-guided restoration benchmarks.
中文摘要 参考引导图像生成技术进展迅速，但当前扩散模型在使用参考精细生成图像时仍难以保持细粒度的视觉细节。这一限制源于基于VAE的潜在压缩本质上会丢弃细微的纹理信息，导致身份和属性相关的线索消失。此外，基于现有方法放大局部细节的后期编辑方法，往往在光线、质感或形状方面与原始图像不一致。为此，我们引入了\ourMthd{}，一个细节感知的细化框架，通过连续两个阶段的参考驱动修正来增强像素层面的一致性。我们首先通过微调单幅扩散编辑器，使其能够联合导入草图和参考图像，实现全局连贯的细化，同时保持结构准确性。随后，我们应用强化学习进一步强化本地化编辑能力，明确优化细节准确性和语义一致性。大量实验表明，\ourMthd{}显著提升了参考对齐和细致细节保存，产生忠实且视觉连贯的编辑，超越了开源和商业模型，在挑战性引用引导修复基准上。

Energy Costs and Neural Complexity Evolution in Changing Environments

能源成本与神经复杂性在变化环境中的演化

Authors: Sian Heesom-Green, Jonathan Shock, Geoff Nitschke
Subjects: Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.20018
Pdf link: https://arxiv.org/pdf/2511.20018
Abstract The Cognitive Buffer Hypothesis (CBH) posits that larger brains evolved to enhance survival in changing conditions. However, larger brains also carry higher energy demands, imposing additional metabolic burdens. Alongside brain size, brain organization plays a key role in cognitive ability and, with suitable architectures, may help mitigate energy challenges. This study evolves Artificial Neural Networks (ANNs) used by Reinforcement Learning (RL) agents to investigate how environmental variability and energy costs influence the evolution of neural complexity, defined in terms of ANN size and structure. Results indicate that under energy constraints, increasing seasonality led to smaller ANNs. This challenges CBH and supports the Expensive Brain Hypothesis (EBH), as highly seasonal environments reduced net energy intake and thereby constrained brain size. ANN structural complexity primarily emerged as a byproduct of size, where energy costs promoted the evolution of more efficient networks. These results highlight the role of energy constraints in shaping neural complexity, offering in silico support for biological theory and energy-efficient robotic design.
中文摘要 认知缓冲假说（CBH）认为，大脑较大是为了在变化环境中提高生存能力而进化出来。然而，大脑体积更大，能量需求也更高，带来额外的代谢负担。除了大脑大小，大脑组织在认知能力中起着关键作用，并且在合适的结构下，有助于缓解能量问题。本研究发展了人工神经网络（ANN），这些网络被强化学习（RL）智能体用于研究环境变异性和能量成本如何影响神经复杂性（以人工神经网络大小和结构定义）的演化。结果表明，在能量约束下，季节性增加导致人工神经网络变小。这挑战了CBH，支持昂贵脑假说（EBH），因为高度季节性环境降低了净能量摄入，从而限制了大脑体积。人工神经网络的结构复杂性主要是规模的副产品，能源成本促进了更高效网络的发展。这些结果凸显了能量约束在塑造神经复杂性中的作用，为生物理论和节能机器人设计提供了计算机支持。

SOMBRL: Scalable and Optimistic Model-Based RL

SOMBRL：可扩展且乐观的基于模型的强化学习

Authors: Bhavya Sukhija, Lenart Treven, Carmelo Sferrazza, Florian Dörfler, Pieter Abbeel, Andreas Krause
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.20066
Pdf link: https://arxiv.org/pdf/2511.20066
Abstract We address the challenge of efficient exploration in model-based reinforcement learning (MBRL), where the system dynamics are unknown and the RL agent must learn directly from online interactions. We propose Scalable and Optimistic MBRL (SOMBRL), an approach based on the principle of optimism in the face of uncertainty. SOMBRL learns an uncertainty-aware dynamics model and greedily maximizes a weighted sum of the extrinsic reward and the agent's epistemic uncertainty. SOMBRL is compatible with any policy optimizers or planners, and under common regularity assumptions on the system, we show that SOMBRL has sublinear regret for nonlinear dynamics in the (i) finite-horizon, (ii) discounted infinite-horizon, and (iii) non-episodic settings. Additionally, SOMBRL offers a flexible and scalable solution for principled exploration. We evaluate SOMBRL on state-based and visual-control environments, where it displays strong performance across all tasks and baselines. We also evaluate SOMBRL on a dynamic RC car hardware and show SOMBRL outperforms the state-of-the-art, illustrating the benefits of principled exploration for MBRL.
中文摘要 我们探讨了基于模型强化学习（MBRL）中高效探索的挑战，该领域系统动态未知，强化学习代理必须直接从在线交互中学习。我们提出可扩展且乐观的MBRL（SOMBRL），这是一种基于面对不确定性时保持乐观原则的方法。SOMBRL学习一个不确定性意识的动力学模型，贪婪地最大化外在奖励与主体认知不确定性的加权和。SOMBRL与任何策略优化器或规划器兼容，并且在系统上的常见正则性假设下，我们证明SOMBRL在（i）有限视界、（ii）折现无限视界和（iii）非情景设定下的非线性动态中存在亚线性遗憾。此外，SOMBRL还提供了一种灵活且可扩展的原则性探索解决方案。我们在基于状态和可视化控制的环境中评估SOMBRL，其在所有任务和基线上表现出强劲的性能。我们还在动态遥控车硬件上评估了SOMBRL，展示了其超越最先进技术的表现，展示了基于原则探索对MBRL的优势。

QiMeng-Kernel: Macro-Thinking Micro-Coding Paradigm for LLM-Based High-Performance GPU Kernel Generation

启梦内核：基于LLM的高性能GPU内核生成的宏思维微编码范式

Authors: Xinguo Zhu, Shaohui Peng, Jiaming Guo, Yunji Chen, Qi Guo, Yuanbo Wen, Hang Qin, Ruizhi Chen, Qirui Zhou, Ke Gao, Yanjun Wu, Chen Zhao, Ling Li
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.20100
Pdf link: https://arxiv.org/pdf/2511.20100
Abstract Developing high-performance GPU kernels is critical for AI and scientific computing, but remains challenging due to its reliance on expert crafting and poor portability. While LLMs offer promise for automation, both general-purpose and finetuned LLMs suffer from two fundamental and conflicting limitations: correctness and efficiency. The key reason is that existing LLM-based approaches directly generate the entire optimized low-level programs, requiring exploration of an extremely vast space encompassing both optimization policies and implementation codes. To address the challenge of exploring an intractable space, we propose Macro Thinking Micro Coding (MTMC), a hierarchical framework inspired by the staged optimization strategy of human experts. It decouples optimization strategy from implementation details, ensuring efficiency through high-level strategy and correctness through low-level implementation. Specifically, Macro Thinking employs reinforcement learning to guide lightweight LLMs in efficiently exploring and learning semantic optimization strategies that maximize hardware utilization. Micro Coding leverages general-purpose LLMs to incrementally implement the stepwise optimization proposals from Macro Thinking, avoiding full-kernel generation errors. Together, they effectively navigate the vast optimization space and intricate implementation details, enabling LLMs for high-performance GPU kernel generation. Comprehensive results on widely adopted benchmarks demonstrate the superior performance of MTMC on GPU kernel generation in both accuracy and running time. On KernelBench, MTMC achieves near 100% and 70% accuracy at Levels 1-2 and 3, over 50% than SOTA general-purpose and domain-finetuned LLMs, with up to 7.3x speedup over LLMs, and 2.2x over expert-optimized PyTorch Eager kernels. On the more challenging TritonBench, MTMC attains up to 59.64% accuracy and 34x speedup.
中文摘要 开发高性能GPU内核对于人工智能和科学计算至关重要，但由于依赖专家制作和可移植性较差，仍具挑战性。虽然LLM在自动化方面展现出潜力，但无论是通用型还是微调型LLM，都存在两个根本且相互冲突的局限：正确性和效率。关键原因是现有基于LLM的方法直接生成整个优化的低层程序，需要探索涵盖优化策略和实现代码的极其庞大空间。为了应对探索一个难以解决的空间的挑战，我们提出了宏观思维微编码（MTMC），这是一个受人类专家分阶段优化策略启发的分层框架。它将优化策略与实施细节解耦，通过高层战略确保效率，通过低层实现实现正确性。具体来说，宏观思维运用强化学习引导轻量级大型语言模型高效探索和学习语义优化策略，最大化硬件利用率。微编码利用通用大型语言模型，逐步实现宏思维中的逐步优化方案，避免全内核生成错误。它们共同有效导航广阔的优化空间和复杂的实现细节，使大型语言模型实现高性能GPU内核生成。广泛采用的基准测试的综合结果显示，MTMC在GPU内核生成的准确性和运行时间上表现出优越。在KernelBench上，MTMC在1-2级和3级的准确率接近100%和70%，比SOTA通用和领域微调LLM高出50%以上，速度可比LLM快7.3倍，专家优化的PyTorch Eager内核提升2.2倍。在更具挑战性的TritonBench上，MTMC可实现最高59.64%的准确率和34倍的加速。

From data to concepts via wiring diagrams

从数据到概念，通过接线图

Authors: Jason Lo, Mohammadnima Jafari
Subjects: Subjects: Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM); Machine Learning (cs.LG); Combinatorics (math.CO)
Arxiv link: https://arxiv.org/abs/2511.20138
Pdf link: https://arxiv.org/pdf/2511.20138
Abstract A wiring diagram is a labeled directed graph that represents an abstract concept such as a temporal process. In this article, we introduce the notion of a quasi-skeleton wiring diagram graph, and prove that quasi-skeleton wiring diagram graphs correspond to Hasse diagrams. Using this result, we designed algorithms that extract wiring diagrams from sequential data. We used our algorithms in analyzing the behavior of an autonomous agent playing a computer game, and the algorithms correctly identified the winning strategies. We compared the performance of our main algorithm with two other algorithms based on standard clustering techniques (DBSCAN and agglomerative hierarchical), including when some of the data was perturbed. Overall, this article brings together techniques in category theory, graph theory, clustering, reinforcement learning, and data engineering.
中文摘要 接线图是一种带标签的有向图，表示一个抽象概念，如时间过程。本文介绍了准骨架接线图的概念，并证明准骨架接线图对应于哈斯图。基于该结果，我们设计了从顺序数据中提取接线图的算法。我们用算法分析了自主智能体玩电脑游戏的行为，算法正确识别了获胜策略。我们将主算法的性能与基于标准聚类技术的另外两种算法（DBSCAN和聚合层级算法）进行了比较，包括部分数据被扰动时的表现。总体而言，本文汇集了范畴论、图论、聚类、强化学习和数据工程等技术。

Map-World: Masked Action planning and Path-Integral World Model for Autonomous Driving

Map-World：掩面行动规划与路径整合世界模型用于自动驾驶

Authors: Bin Hu, Zijian Lu, Haicheng Liao, Chengran Yuan, Bin Rao, Yongkang Li, Guofa Li, Zhiyong Cui, Cheng-zhong Xu, Zhenning Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.20156
Pdf link: https://arxiv.org/pdf/2511.20156
Abstract Motion planning for autonomous driving must handle multiple plausible futures while remaining computationally efficient. Recent end-to-end systems and world-model-based planners predict rich multi-modal trajectories, but typically rely on handcrafted anchors or reinforcement learning to select a single best mode for training and control. This selection discards information about alternative futures and complicates optimization. We propose MAP-World, a prior-free multi-modal planning framework that couples masked action planning with a path-weighted world model. The Masked Action Planning (MAP) module treats future ego motion as masked sequence completion: past waypoints are encoded as visible tokens, future waypoints are represented as mask tokens, and a driving-intent path provides a coarse scaffold. A compact latent planning state is expanded into multiple trajectory queries with injected noise, yielding diverse, temporally consistent modes without anchor libraries or teacher policies. A lightweight world model then rolls out future BEV semantics conditioned on each candidate trajectory. During training, semantic losses are computed as an expectation over modes, using trajectory probabilities as discrete path weights, so the planner learns from the full distribution of plausible futures instead of a single selected path. On NAVSIM, our method matches anchor-based approaches and achieves state-of-the-art performance among world-model-based methods, while avoiding reinforcement learning and maintaining real-time inference latency.
中文摘要 自动驾驶的运动规划必须在保持计算效率的前提下，处理多种可能的未来。最新的端到端系统和基于世界模型的规划器预测了丰富的多模态轨迹，但通常依赖手工制作的锚点或强化学习来选择单一最佳的训练和控制模式。这种选择会丢弃替代未来的信息，使优化变得复杂。我们提出了MAP-World，一种无先验的多模态规划框架，将掩蔽行动规划与路径加权世界模型结合起来。蒙面行动规划（MAP）模块将未来自我运动视为掩蔽序列完成：过去的航点编码为可见的标记，未来路径以遮罩标记表示，驾驶意图路径则提供粗略的脚手架。紧凑的潜在规划状态被扩展为多条轨迹查询，并注入噪声，产生多样且时间一致的模式，无需锚点库或教师策略。然后，一个轻量级世界模型会根据每个候选轨迹展开未来的BEV语义。在训练过程中，语义损失作为对模式的期望计算，使用轨迹概率作为离散路径权重，因此规划者从合理的未来完整分布中学习，而非单一选择路径。在NAVSIM上，我们的方法匹配基于锚点的方法，在基于世界模型的方法中实现了最先进的性能，同时避免了强化学习并保持了实时推理延迟。

Interactive AI NPCs Powered by LLMs: Technical Report for the CPDC Challenge 2025

由大型语言模型驱动的互动式AI NPC：CPDC 2025挑战赛技术报告

Authors: Yitian Huang, Yuxuan Lei, Jianxun Lian, Hao Liao
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.20200
Pdf link: https://arxiv.org/pdf/2511.20200
Abstract This report presents the solution and results of our team MSRA_SC in the Commonsense Persona-Grounded Dialogue Challenge (CPDC 2025). We propose a simple yet effective framework that unifies improvements across both GPU Track and API Track. Our method centers on two key components. First, Context Engineering applies dynamic tool pruning and persona clipping for input compression, combined with post-processing techniques such as parameter normalization and function merging. Together with manually refined prompts, this design improves tool call stability, execution reliability, and role-playing guidance. Second, in the GPU Track, we further adopt GRPO training, replacing supervised fine-tuning with reinforcement learning directly optimized by reward signals. This mitigates small-sample overfitting and significantly enhances task-oriented dialogue performance. In the final evaluation, our team ranks 1st in Task 2 API, 2nd in Task 1 API, and 3rd in both Task 3 API and GPU track, demonstrating the effectiveness of our approach. Our code is publicly available at this https URL
中文摘要 本报告展示了我们MSRA_SC团队在“常识人物-基础对话挑战赛”（CPDC 2025）中的解决方案和成果。我们提出了一个简单但有效的框架，统一了GPU轨道和API轨道的改进。我们的方法围绕两个关键组成部分展开。首先，上下文工程应用动态工具剪枝和人物裁剪来压缩输入，结合参数归一化和函数合并等后处理技术。结合手动优化的提示，这种设计提升了工具调用的稳定性、执行可靠性和角色扮演指导。其次，在GPU轨道中，我们进一步采用GRPO训练，用直接由奖励信号优化的强化学习取代监督式微调。这减少了小样本的过度拟合，显著提升了任务导向对话的表现。在最终评估中，我们团队在任务2 API中排名第一，任务1 API排名第2，任务3 API和GPU轨道均排名第三，充分证明了我们方法的有效性。我们的代码在此 https URL 公开发布

Leveraging weights signals - Predicting and improving generalizability in reinforcement learning

利用权重信号——预测和提升强化学习中的泛化性

Authors: Olivier Moulin, Vincent Francois-lavet, Paul Elbers, Mark Hoogendoorn
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.20234
Pdf link: https://arxiv.org/pdf/2511.20234
Abstract Generalizability of Reinforcement Learning (RL) agents (ability to perform on environments different from the ones they have been trained on) is a key problem as agents have the tendency to overfit to their training environments. In order to address this problem and offer a solution to increase the generalizability of RL agents, we introduce a new methodology to predict the generalizability score of RL agents based on the internal weights of the agent's neural networks. Using this prediction capability, we propose some changes in the Proximal Policy Optimization (PPO) loss function to boost the generalization score of the agents trained with this upgraded version. Experimental results demonstrate that our improved PPO algorithm yields agents with stronger generalizability compared to the original version.
中文摘要 强化学习（RL）代理的泛化性（即在与训练环境不同的环境中执行的能力）是一个关键问题，因为代理容易对其训练环境进行过拟合。为了解决这一问题并提出提高强化学习代理普遍化能力的解决方案，我们引入了一种基于强化学习代理神经网络内部权重预测强化学习代理的普遍性评分的新方法。利用这一预测能力，我们提出了对近端策略优化（PPO）损失函数的一些修改，以提升使用该升级版本训练的代理的泛化得分。实验结果表明，我们改进后的PPO算法相比原始版本，使得具有更强的泛化性。

Quantum-Enhanced Reinforcement Learning for Accelerating Newton-Raphson Convergence with Ising Machines: A Case Study for Power Flow Analysis

量子增强强化学习加速牛顿-拉夫森收敛与伊辛机：功率流分析案例研究

Authors: Zeynab Kaseb, Matthias Moller, Lindsay Spoor, Jerry J. Guo, Yu Xiang, Peter Palensky, Pedro P. Vergara
Subjects: Subjects: Systems and Control (eess.SY); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.20237
Pdf link: https://arxiv.org/pdf/2511.20237
Abstract The Newton-Raphson (NR) method is widely used for solving power flow (PF) equations due to its quadratic convergence. However, its performance deteriorates under poor initialization or extreme operating scenarios, e.g., high levels of renewable energy penetration. Traditional NR initialization strategies often fail to address these challenges, resulting in slow convergence or even divergence. We propose the use of reinforcement learning (RL) to optimize the initialization of NR, and introduce a novel quantum-enhanced RL environment update mechanism to mitigate the significant computational cost of evaluating power system states over a combinatorially large action space at each RL timestep by formulating the voltage adjustment task as a quadratic unconstrained binary optimization problem. Specifically, quantum/digital annealers are integrated into the RL environment update to evaluate state transitions using a problem Hamiltonian designed for PF. Results demonstrate significant improvements in convergence speed, a reduction in NR iteration counts, and enhanced robustness under different operating conditions.
中文摘要 牛顿-拉夫森（NR）方法因其二次收敛性被广泛用于求解功率流（PF）方程。然而，在初始化不良或极端运行场景（如高可再生能源渗透率）下，其性能会下降。传统的NR初始化策略往往无法解决这些挑战，导致收敛速度缓慢甚至发散。我们提出利用强化学习（RL）优化NR初始化，并引入一种新型量子增强强化环境更新机制，通过将电压调整任务表述为二次无约束二进制优化问题，降低在每个RL时间步中在组合性大动作空间内评估电力系统状态的显著计算成本。具体来说，量子/数字退火器被集成到RL环境更新中，利用为PF设计的问题哈密顿量评估状态转变。结果显示收敛速度显著提升，NR迭代次数减少，并在不同作条件下增强了鲁棒性。

The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation

图像作为自身的奖励：强化学习与对抗性图像生成奖励

Authors: Weijia Mao, Hao Chen, Zhenheng Yang, Mike Zheng Shou
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.20256
Pdf link: https://arxiv.org/pdf/2511.20256
Abstract A reliable reward function is essential for reinforcement learning (RL) in image generation. Most current RL approaches depend on pre-trained preference models that output scalar rewards to approximate human preferences. However, these rewards often fail to capture human perception and are vulnerable to reward hacking, where higher scores do not correspond to better images. To address this, we introduce Adv-GRPO, an RL framework with an adversarial reward that iteratively updates both the reward model and the generator. The reward model is supervised using reference images as positive samples and can largely avoid being hacked. Unlike KL regularization that constrains parameter updates, our learned reward directly guides the generator through its visual outputs, leading to higher-quality images. Moreover, while optimizing existing reward functions can alleviate reward hacking, their inherent biases remain. For instance, PickScore may degrade image quality, whereas OCR-based rewards often reduce aesthetic fidelity. To address this, we take the image itself as a reward, using reference images and vision foundation models (e.g., DINO) to provide rich visual rewards. These dense visual signals, instead of a single scalar, lead to consistent gains across image quality, aesthetics, and task-specific metrics. Finally, we show that combining reference samples with foundation-model rewards enables distribution transfer and flexible style customization. In human evaluation, our method outperforms Flow-GRPO and SD3, achieving 70.0% and 72.4% win rates in image quality and aesthetics, respectively. Code and models have been released.
中文摘要 可靠的奖励函数对于图像生成中的强化学习（RL）至关重要。大多数当前强化学习方法依赖于预训练的偏好模型，这些模型输出标量奖励以近似人类偏好。然而，这些奖励往往无法捕捉人类感知，且容易受到奖励黑客攻击的影响，因为得分越高并不代表图像越好。为此，我们引入了 Adv-GRPO，这是一种具有对抗性奖励的强化学习框架，能够迭代更新奖励模型和生成器。奖励模型通过参考图像作为正样本进行监督，基本可以避免被黑客攻击。与限制参数更新的 KL 正则化不同，我们学习到的奖励直接引导生成器通过其视觉输出，从而产生更高质量的图像。此外，虽然优化现有奖励函数可以缓解奖励黑客行为，但其固有偏见依然存在。例如，PickScore可能会降低图像质量，而基于OCR的奖励往往会降低美学的真实度。为此，我们将图像本身作为奖励，使用参考图像和视觉基础模型（如DINO）来提供丰富的视觉奖励。这些密集的视觉信号，而非单一标量，带来了图像质量、美感和任务特定指标的持续提升。最后，我们证明将参考样本与基础模型奖励结合，能够实现分布转移和灵活的风格定制。在人类评估中，我们的方法在图像质量和美学方面分别优于Flow-GRPO和SD3，分别达到70.0%和72.4%。代码和模型已经发布。

DRL-Guided Neural Batch Sampling for Semi-Supervised Pixel-Level Anomaly Detection

用于半监督像素级异常检测的DRL引导神经批次采样

Authors: Amirhossein Khadivi Noghredeh, Abdollah Safari, Fatemeh Ziaeetabar, Firoozeh Haghighi
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.20270
Pdf link: https://arxiv.org/pdf/2511.20270
Abstract Anomaly detection in industrial visual inspection is challenging due to the scarcity of defective samples. Most existing methods rely on unsupervised reconstruction using only normal data, often resulting in overfitting and poor detection of subtle defects. We propose a semi-supervised deep reinforcement learning framework that integrates a neural batch sampler, an autoencoder, and a predictor. The RL-based sampler adaptively selects informative patches by balancing exploration and exploitation through a composite reward. The autoencoder generates loss profiles highlighting abnormal regions, while the predictor performs segmentation in the loss-profile space. This interaction enables the system to effectively learn both normal and defective patterns with limited labeled data. Experiments on the MVTec AD dataset demonstrate that our method achieves higher accuracy and better localization of subtle anomalies than recent state-of-the-art approaches while maintaining low complexity, yielding an average improvement of 0.15 in F1_max and 0.06 in AUC, with a maximum gain of 0.37 in F1_max in the best case.
中文摘要 由于缺陷样品稀少，工业目视检测中的异常检测具有挑战性。大多数现有方法依赖仅使用正常数据进行无监督重建，常导致过拟合和对细微缺陷的检测不佳。我们提出了一个半监督式深度强化学习框架，集成了神经批抽样器、自编码器和预测器。基于强化学习的采样器通过复合奖励平衡探索与利用，自适应地选择有信息的补丁。自编码器生成突出异常区域的损耗剖面，而预测器则在损耗剖面空间中进行分割。这种交互使系统能够在有限的标注数据下有效学习正常和缺陷模式。在MVTec AD数据集上的实验表明，我们的方法在保持低复杂度的同时，比近期最先进方法实现了更高的准确性和更佳的微妙异常定位，平均提升了0.15的F1_max和0.06的AUC，最佳情况下最大增益为0.37 F1_max。

VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs

VKnowU：评估多模态大型语言模型中的视觉知识理解

Authors: Tianxiang Jiang, Sheng Xia, Yicheng Xu, Linquan Wu, Xiangyu Zeng, Limin Wang, Yu Qiao, Yi Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.20272
Pdf link: https://arxiv.org/pdf/2511.20272
Abstract While Multimodal Large Language Models (MLLMs) have become adept at recognizing objects, they often lack the intuitive, human-like understanding of the world's underlying physical and social principles. This high-level vision-grounded semantics, which we term visual knowledge, forms a bridge between perception and reasoning, yet remains an underexplored area in current MLLMs. To systematically evaluate this capability, we present VKnowU, a comprehensive benchmark featuring 1,680 questions in 1,249 videos, covering 8 core types of visual knowledge spanning both world-centric (e.g., intuitive physics) and human-centric (e.g., subjective intentions). Evaluation of 23 SOTA MLLMs reveals that leading models still fall short of human performance, with particularly notable gaps in the world-centric. To bridge this gap, we introduce a new dataset, VKnowQA, and VideoKnow+, a baseline model that explicitly incorporates visual knowledge into MLLMs. VideoKnow+ follows a structured See-Think-Answer paradigm and adopts reinforcement learning with visual knowledge reward, achieving a +3.7% improvement on VKnowU and consistent gains on MVBench, Video-MME, and MMVU. Our work highlights visual knowledge as a missing cornerstone for developing more generalizable MLLMs that can not only see but also truly understand our physical and social worlds.
中文摘要 虽然多模态大型语言模型（MLLM）已经熟练地识别了物体，但它们往往缺乏对世界底层物理和社会原理的直觉、类人理解。这种高层次视觉基础语义学，我们称之为视觉知识，构成了感知与推理之间的桥梁，但在当前MLLM中仍是一个尚未被充分探索的领域。为了系统评估这一能力，我们推出了VKnowU，这是一个综合基准测试，包含1,680个问题，分布在1,249个视频中，涵盖8种核心视觉知识类型，涵盖了以世界为中心（如直觉物理）和以人为中心（如主观意图）的视觉知识。对23个SOTA多层次多层次营销模型的评估显示，领先模型仍未能达到人类表现水平，尤其在全球中心方面存在明显差距。为弥合这一差距，我们引入了新数据集VKnowQA和VideoKnow+，这是一个明确将视觉知识纳入MLLM的基线模型。VideoKnow+采用结构化的“看-想-答”范式，采用强化学习和视觉知识奖励，VKnowU提升+3.7%，MVBench、Video-MME和MMVU均持续提升。我们的工作强调，视觉知识是开发更具通用性的多层次语言生命周期模型（MLLM）的缺失基石，这些MLLM不仅能看见，还能真正理解我们的物理和社会世界。

HAFO: Humanoid Force-Adaptive Control for Intense External Force Interaction Environments

HAFO：针对强烈外部力相互作用环境的类人自适应控制

Authors: Chenhui Dong, Haozhe Xu, Wenhao Feng, Zhipeng Wang, Yanmin Zhou, Yifei Zhao, Bin He
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.20275
Pdf link: https://arxiv.org/pdf/2511.20275
Abstract Reinforcement learning controllers have made impressive progress in humanoid locomotion and light load manipulation. However, achieving robust and precise motion with strong force interaction remains a significant challenge. Based on the above limitations, this paper proposes HAFO, a dual-agent reinforcement learning control framework that simultaneously optimizes both a robust locomotion strategy and a precise upper-body manipulation strategy through coupled training under external force interaction environments. Simultaneously, we explicitly model the external pulling disturbances through a spring-damper system and achieve fine-grained force control by manipulating the virtual spring. During this process, the reinforcement-learning policy spontaneously generates disturbance-rejection response by exploiting environmental feedback. Moreover, HAFO employs an asymmetric Actor-Critic framework in which the Critic-network access to privileged spring-damping forces guides the actor-network to learn a generalizable, robust policy for resisting external disturbances. The experimental results demonstrate that HAFO achieves stable control of humanoid robot under various strong force interactions, showing remarkable performance in load tasks and ensuring stable robot operation under rope tension disturbances. Project website: this http URL.
中文摘要 强化学习控制器在类人移动和轻负载控方面取得了显著进展。然而，在强力相互作用下实现稳健且精准的运动仍是重大挑战。基于上述局限性，本文提出了HAFO这一双代理强化学习控制框架，通过在外部力相互作用环境下的耦合训练，同时优化稳健的运动策略和精确的上半身作策略。同时，我们通过弹簧阻尼系统显式模拟外部拉力扰动，并通过作虚拟弹簧实现细粒度力控制。在此过程中，强化学习策略通过利用环境反馈自发生成干扰-拒绝反应。此外，HAFO采用了非对称的Actor-Critic框架，其中Critic网络通过访问特权弹簧阻尼力引导actor-网络学习一种可推广、稳健的策略来抵抗外部干扰。实验结果表明，HAFO在多种强力相互作用下能够稳定控制类人机器人，在负载任务中表现出色，并确保在绳索张力扰动下机器人稳定运行。项目网站：这个http网址。

AD-R1: Closed-Loop Reinforcement Learning for End-to-End Autonomous Driving with Impartial World Models

AD-R1：基于独立世界模型的端到端自动驾驶闭环强化学习

Authors: Tianyi Yan, Tao Tang, Xingtai Gui, Yongkang Li, Jiasen Zhesng, Weiyao Huang, Lingdong Kong, Wencheng Han, Xia Zhou, Xueyang Zhang, Yifei Zhan, Kun Zhan, Cheng-zhong Xu, Jianbing Shen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.20325
Pdf link: https://arxiv.org/pdf/2511.20325
Abstract End-to-end models for autonomous driving hold the promise of learning complex behaviors directly from sensor data, but face critical challenges in safety and handling long-tail events. Reinforcement Learning (RL) offers a promising path to overcome these limitations, yet its success in autonomous driving has been elusive. We identify a fundamental flaw hindering this progress: a deep seated optimistic bias in the world models used for RL. To address this, we introduce a framework for post-training policy refinement built around an Impartial World Model. Our primary contribution is to teach this model to be honest about danger. We achieve this with a novel data synthesis pipeline, Counterfactual Synthesis, which systematically generates a rich curriculum of plausible collisions and off-road events. This transforms the model from a passive scene completer into a veridical forecaster that remains faithful to the causal link between actions and outcomes. We then integrate this Impartial World Model into our closed-loop RL framework, where it serves as an internal critic. During refinement, the agent queries the critic to ``dream" of the outcomes for candidate actions. We demonstrate through extensive experiments, including on a new Risk Foreseeing Benchmark, that our model significantly outperforms baselines in predicting failures. Consequently, when used as a critic, it enables a substantial reduction in safety violations in challenging simulations, proving that teaching a model to dream of danger is a critical step towards building truly safe and intelligent autonomous agents.
中文摘要 端到端自动驾驶模型有望直接从传感器数据学习复杂行为，但在安全和处理长尾事件方面面临关键挑战。强化学习（RL）为克服这些局限提供了一条有前景的道路，但在自动驾驶领域的成功却一直难以实现。我们发现阻碍这一进展的一个根本缺陷：用于强化学习的世界模型中根深蒂固的乐观偏见。为此，我们引入了一个基于公正世界模型的培训后政策细化框架。我们的主要贡献是教导这个模型对危险保持诚实。我们通过一种新颖的数据综合流程——反事实综合（Counterfactual Synthesis）实现了这一点，系统地生成了丰富的合理碰撞和越野事件课程。这使得模型从被动的场景完成者转变为真实的预测者，忠实于行为与结果之间的因果联系。然后我们将这个公正世界模型整合进我们的闭环强化学习框架中，作为内部批评者。在精炼过程中，代理询问批评者“梦想”候选行动的结果。通过包括在新的风险预见基准测试在内的广泛实验中，我们证明了我们的模型在预测失效方面远超基线。因此，作为批评者，它显著减少了复杂模拟中的安全违规，证明教模型梦见危险是构建真正安全且智能自主智能体的关键一步。

NNGPT: Rethinking AutoML with Large Language Models

NNGPT：重新思考大型语言模型中的自动机器学习

Authors: Roman Kochnev, Waleed Khalid, Tolgay Atinc Uzun, Xi Zhang, Yashkumar Sanjaybhai Dhameliya, Furui Qin, Chandini Vysyaraju, Raghuvir Duvvuri, Avi Goyal, Dmitry Ignatov, Radu Timofte
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2511.20333
Pdf link: https://arxiv.org/pdf/2511.20333
Abstract Building self-improving AI systems remains a fundamental challenge in the AI domain. We present NNGPT, an open-source framework that turns a large language model (LLM) into a self-improving AutoML engine for neural network development, primarily for computer vision. Unlike previous frameworks, NNGPT extends the dataset of neural networks by generating new models, enabling continuous fine-tuning of LLMs based on closed-loop system of generation, assessment, and self-improvement. It integrates within one unified workflow five synergistic LLM-based pipelines: zero-shot architecture synthesis, hyperparameter optimization (HPO), code-aware accuracy/early-stop prediction, retrieval-augmented synthesis of scope-closed PyTorch blocks (NN-RAG), and reinforcement learning. Built on the LEMUR dataset as an audited corpus with reproducible metrics, NNGPT emits from a single prompt and validates network architecture, preprocessing code, and hyperparameters, executes them end-to-end, and learns from result. The PyTorch adapter makes NNGPT framework-agnostic, enabling strong performance: NN-RAG achieves 73% executability on 1,289 targets, 3-shot prompting boosts accuracy on common datasets, and hash-based deduplication saves hundreds of runs. One-shot prediction matches search-based AutoML, reducing the need for numerous trials. HPO on LEMUR achieves RMSE 0.60, outperforming Optuna (0.64), while the code-aware predictor reaches RMSE 0.14 with Pearson r=0.78. The system has already generated over 5K validated models, proving NNGPT as an autonomous AutoML engine. Upon acceptance, the code, prompts, and checkpoints will be released for public access to enable reproducibility and facilitate community usage.
中文摘要 构建自我提升的人工智能系统仍然是人工智能领域的根本挑战。我们介绍NNGPT，一个开源框架，将大型语言模型（LLM）转化为一个自我改进的AutoML引擎，用于神经网络开发，主要用于计算机视觉。与以往框架不同，NNGPT通过生成新模型扩展了神经网络数据集，实现基于生成、评估和自我提升闭环系统的持续微调。它在一个统一的工作流程中集成了五条协同的基于大型语言模型的流水线：零样本架构综合、超参数优化（HPO）、代码感知准确性/提前停止预测、范围闭合PyTorch块的检索增强综合（NN-RAG）以及强化学习。NNGPT基于LEMUR数据集，作为一个经过审计的语料库，具有可重复的指标，NNGPT从单一提示发出，验证网络架构、预处理代码和超参数，端到端执行，并从结果中学习。PyTorch 适配器使 NNGPT 框架无关，实现了强大的性能：NN-RAG 在 1,289 个目标上实现了 73% 的可执行性，三次提示提升了常见数据集的准确性，基于哈希的重复删除则节省了数百次运行。一次性预测与基于搜索的自动机器学习匹配，减少了大量试验的需求。LEMUR的HPO达到RMSE 0.60，优于Optuna（0.64），而代码感知预测器在Pearson r=0.78下达到RMSE 0.14。该系统已生成超过5000个经过验证的模型，证明NNGPT是自主AutoML引擎。一旦被接受，代码、提示和检查点将向公众开放，以促进复现并促进社区使用。

Soft Adaptive Policy Optimization

软自适应策略优化

Authors: Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, Junyang Lin
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.20347
Pdf link: https://arxiv.org/pdf/2511.20347
Abstract Reinforcement learning (RL) plays an increasingly important role in enhancing the reasoning capabilities of large language models (LLMs), yet stable and performant policy optimization remains challenging. Token-level importance ratios often exhibit high variance-a phenomenon exacerbated in Mixture-of-Experts models-leading to unstable updates. Existing group-based policy optimization methods, such as GSPO and GRPO, alleviate this problem via hard clipping, making it difficult to maintain both stability and effective learning. We propose Soft Adaptive Policy Optimization (SAPO), which replaces hard clipping with a smooth, temperature-controlled gate that adaptively attenuates off-policy updates while preserving useful learning signals. Compared with GSPO and GRPO, SAPO is both sequence-coherent and token-adaptive. Like GSPO, SAPO maintains sequence-level coherence, but its soft gating forms a continuous trust region that avoids the brittle hard clipping band used in GSPO. When a sequence contains a few highly off-policy tokens, GSPO suppresses all gradients for that sequence, whereas SAPO selectively down-weights only the offending tokens and preserves the learning signal from the near-on-policy ones, improving sample efficiency. Relative to GRPO, SAPO replaces hard token-level clipping with smooth, temperature-controlled scaling, enabling more informative and stable updates. Empirical results on mathematical reasoning benchmarks indicate that SAPO exhibits improved training stability and higher Pass@1 performance under comparable training budgets. Moreover, we employ SAPO to train the Qwen3-VL model series, demonstrating that SAPO yields consistent performance gains across diverse tasks and different model sizes. Overall, SAPO provides a more reliable, scalable, and effective optimization strategy for RL training of LLMs.
中文摘要 强化学习（RL）在提升大型语言模型（LLMs）推理能力方面扮演着越来越重要的角色，但稳定且高效的策略优化依然充满挑战。代币级重要性比率常表现出高方差——在专家混合模型中尤为严重——导致更新不稳定。现有的基于群体的策略优化方法，如GSPO和GRPO，通过硬剪裁缓解了这一问题，这使得保持稳定性和有效学习变得困难。我们提出了软自适应策略优化（SAPO），它用平滑、温控的门取代硬裁剪，自适应地衰减非策略更新，同时保留有用的学习信号。与GSPO和GRPO相比，SAPO既具序列一致性，也具有令牌自适应性。与GSPO类似，SAPO保持序列级相干性，但其软门控形成了一个连续的信任区，避免了GSPO中使用的脆弱硬剪断带。当序列包含少数高度偏离策略的标记时，GSPO会抑制该序列的所有梯度，而SAPO则选择性地仅降低有问题的标记权重，并保留接近策略标记的学习信号，从而提高样本效率。相较于 GRPO，SAPO 用平滑、温控的缩放替代硬令牌级裁剪，使更新更加有信息量且稳定。数学推理基准的实证结果表明，在可比的训练预算下，SAPO表现出更高的训练稳定性和更高的Pass@1性能。此外，我们利用SAPO训练Qwen3-VL模型系列，证明SAPO在不同任务和模型规模下都能持续提升性能。总体而言，SAPO为LLM的强化学习训练提供了更可靠、可扩展且高效的优化策略。

Complexity Reduction Study Based on RD Costs Approximation for VVC Intra Partitioning

基于 RD 成本近似的 VVC 内部分区复杂性降低研究

Authors: M.E.A. Kherchouche, F. Galpin, T. Dumas, F. Schnitzler, D. Menard, L. Zhang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.20349
Pdf link: https://arxiv.org/pdf/2511.20349
Abstract In this paper, a complexity study is conducted for Versatile Video Codec (VVC) intra partitioning to accelerate the exhaustive search involved in Rate-Distortion Optimization (RDO) process. To address this problem, two main machine learning techniques are proposed and compared. Unlike existing methods, the proposed approaches are size independent and incorporate the Rate-Distortion (RD) costs of neighboring blocks as input features. The first method is a regression based technique that predicts normalized RD costs of a given Coding Unit (CU). As partitioning possesses the Markov property, the associated decision-making problem can be modeled as a Markov Decision Process (MDP) and solved by Reinforcement Learning (RL). The second approach is a RL agent learned from trajectories of CU decision across two depths with Deep Q-Network (DQN) algorithm. Then a pre-determined thresholds are applied for both methods to select a suitable split for the current CU.
中文摘要 本文对多功能视频编码器（VVC）内部分区进行了复杂性研究，以加速速率失真优化（RDO）过程中的穷尽搜索。为解决这一问题，提出了两种主要的机器学习技术并进行了比较。与现有方法不同，所提方法与大小无关，并将邻近块的速率失真（RD）成本作为输入特征。第一种方法是基于回归的技术，用于预测给定编码单元（CU）的归一化RD成本。由于划分具有马尔可夫性质，相关的决策问题可以建模为马尔可夫决策过程（MDP），并通过强化学习（RL）求解。第二种方法是通过深度Q网络（DQN）算法从CU决策轨迹中学习的强化学习代理。然后为两种方法应用预设阈值，以选择当前CU的合适分割。

BRIC: Bridging Kinematic Plans and Physical Control at Test Time

BRIC：测试时运动计划与物理控制的桥接

Authors: Dohun Lim, Minji Kim, Jaewoon Lim, Sungchan Kim
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.20431
Pdf link: https://arxiv.org/pdf/2511.20431
Abstract We propose BRIC, a novel test-time adaptation (TTA) framework that enables long-term human motion generation by resolving execution discrepancies between diffusion-based kinematic motion planners and reinforcement learning-based physics controllers. While diffusion models can generate diverse and expressive motions conditioned on text and scene context, they often produce physically implausible outputs, leading to execution drift during simulation. To address this, BRIC dynamically adapts the physics controller to noisy motion plans at test time, while preserving pre-trained skills via a loss function that mitigates catastrophic forgetting. In addition, BRIC introduces a lightweight test-time guidance mechanism that steers the diffusion model in the signal space without updating its parameters. By combining both adaptation strategies, BRIC ensures consistent and physically plausible long-term executions across diverse environments in an effective and efficient manner. We validate the effectiveness of BRIC on a variety of long-term tasks, including motion composition, obstacle avoidance, and human-scene interaction, achieving state-of-the-art performance across all tasks.
中文摘要 我们提出了BRIC，一种新型测试时间适配（TTA）框架，通过解决基于扩散的运动学运动规划器与基于强化学习的物理控制器之间的执行差异，实现长期人类运动生成。虽然扩散模型可以根据文本和场景上下文生成多样且富有表现力的运动，但它们常常产生物理上不合理的输出，导致仿真过程中执行漂移。为此，BRIC在测试时动态调整物理控制器以适应噪声运动计划，同时通过损失函数保留预训练技能，减少灾难性遗忘。此外，BRIC引入了一种轻量级测试时间引导机制，可在不更新参数的情况下引导信号空间中的扩散模型。通过结合这两种适应策略，BRIC确保在多样环境中以有效高效的方式实现一致且物理上合理的长期执行。我们验证了BRIC在多种长期任务上的有效性，包括运动合成、障碍物规避和人机场景交互，实现了所有任务的先进性能。

DRAFT-RL: Multi-Agent Chain-of-Draft Reasoning for Reinforcement Learning-Enhanced LLMs

DRAFT-RL：强化学习增强型大型语言模型的多智能体草稿链推理

Authors: Yuanhao Li, Mingshan Liu, Hongbo Wang, Yiding Zhang, Yifei Ma, Wei Tan
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.20468
Pdf link: https://arxiv.org/pdf/2511.20468
Abstract Large Language Models (LLMs) have shown impressive capabilities in multi-step reasoning and this http URL works introduce multi-agent reflection frameworks where multiple LLM agents critique and refine each other's outputs using reinforcement learning (RL). However, these approaches often rely on single-shot responses and lack structural diversity in reasoning exploration. In this paper, we propose DRAFT-RL, a novel framework that integrates Chain-of-Draft (CoD) reasoning into multi-agent RL training. Instead of generating single responses, each agent produces multiple drafts per query, which are then evaluated by peer agents and a learned reward model to identify the most promising trajectory. These selected drafts are used to refine future reasoning strategies through actor-critic this http URL-RL enables explicit multi-path exploration, peer-guided reflection, and reward-aligned selection, resulting in more robust and interpretable LLM agent behavior. We evaluate our method on complex reasoning tasks including code synthesis, symbolic math, and knowledge-intensive QA,demonstrating that DRAFT-RL outperforms existing reflective and RL-based agents by significant margins in both accuracy and convergence speed
中文摘要 大型语言模型（LLM）在多步推理方面展现了令人印象深刻的能力，而该http URL作品引入了多智能体反射框架，多个LLM智能体通过强化学习（RL）相互批判和完善对方的输出。然而，这些方法通常依赖单次回应，且缺乏推理探索的结构多样性。本文提出了DRAFT-RL，一种将连载（CoD）推理整合进多智能体强化学习训练的新框架。每个代理不生成单一回应，而是为每个查询生成多个草稿，随后由同伴代理和学习的奖励模型评估，以确定最有前景的路径。这些选定的草稿用于通过actor-critic优化未来推理策略。http URL-RL支持显式多路径探索、同伴引导反思和奖励对齐选择，从而实现更稳健且可解释的LLM代理行为。我们评估了该方法在包括代码合成、符号数学和知识密集型质量保证在内的复杂推理任务上的应用，证明DRAFT-RL在准确性和收敛速度方面远远优于现有的反射型和基于强化学习的智能体

Flash-DMD: Towards High-Fidelity Few-Step Image Generation with Efficient Distillation and Joint Reinforcement Learning

Flash-DMD：迈向高保真少步图像生成，兼具高效蒸馏和关节强化学习

Authors: Guanjie Chen, Shirui Huang, Kai Liu, Jianchen Zhu, Xiaoye Qu, Peng Chen, Yu Cheng, Yifu Sun
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.20549
Pdf link: https://arxiv.org/pdf/2511.20549
Abstract Diffusion Models have emerged as a leading class of generative models, yet their iterative sampling process remains computationally expensive. Timestep distillation is a promising technique to accelerate generation, but it often requires extensive training and leads to image quality degradation. Furthermore, fine-tuning these distilled models for specific objectives, such as aesthetic appeal or user preference, using Reinforcement Learning (RL) is notoriously unstable and easily falls into reward hacking. In this work, we introduce Flash-DMD, a novel framework that enables fast convergence with distillation and joint RL-based refinement. Specifically, we first propose an efficient timestep-aware distillation strategy that significantly reduces training cost with enhanced realism, outperforming DMD2 with only $2.1\%$ its training cost. Second, we introduce a joint training scheme where the model is fine-tuned with an RL objective while the timestep distillation training continues simultaneously. We demonstrate that the stable, well-defined loss from the ongoing distillation acts as a powerful regularizer, effectively stabilizing the RL training process and preventing policy collapse. Extensive experiments on score-based and flow matching models show that our proposed Flash-DMD not only converges significantly faster but also achieves state-of-the-art generation quality in the few-step sampling regime, outperforming existing methods in visual quality, human preference, and text-image alignment metrics. Our work presents an effective paradigm for training efficient, high-fidelity, and stable generative models. Codes are coming soon.
中文摘要 扩散模型已成为生成模型的领先类别，但其迭代采样过程仍然计算成本高昂。时间步蒸馏是一种有前景的加速生成技术，但通常需要大量训练，并导致图像质量下降。此外，利用强化学习（RL）对这些精炼模型进行微调以满足特定目标，如美观或用户偏好，极不稳定，容易陷入奖励黑客的陷阱。在本研究中，我们介绍了Flash-DMD，这一新颖框架实现了与蒸馏及基于强化学习的联合优化的快速收敛。具体来说，我们首先提出了一种高效的时间步感知蒸馏策略，通过增强真实度显著降低训练成本，在仅需2.1%%的训练成本下优于DMD2。其次，我们引入一种联合训练方案，在同步进行时间步蒸馏训练的同时，模型通过强化学习目标进行微调。我们证明，持续蒸馏中稳定且明确的损失能作为强有力的正则化器，有效稳定强化学习训练过程，防止政策崩溃。基于分数和流量匹配模型的广泛实验表明，我们提出的Flash-DMD不仅收敛速度显著加快，还在少数步骤抽样模式下实现了最先进的生成质量，在视觉质量、人类偏好和文本-图像对齐指标方面均优于现有方法。我们的工作提出了一种高效、高保真度和稳定生成模型的有效范式。代码即将发布。

Attention Trajectories as a Diagnostic Axis for Deep Reinforcement Learning

注意力轨迹作为深度强化学习的诊断轴

Authors: Charlotte Beylier, Hannah Selder, Arthur Fleig, Simon M. Hofmann, Nico Scherf
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.20591
Pdf link: https://arxiv.org/pdf/2511.20591
Abstract The learning process of a reinforcement learning (RL) agent remains poorly understood beyond the mathematical formulation of its learning algorithm. To address this gap, we introduce attention-oriented metrics (ATOMs) to investigate the development of an RL agent's attention during training. In a controlled experiment, we tested ATOMs on three variations of a Pong game, each designed to teach the agent distinct behaviours, complemented by a behavioural assessment. ATOMs successfully delineate the attention patterns of an agent trained on each game variation, and that these differences in attention patterns translate into differences in the agent's behaviour. Through continuous monitoring of ATOMs during training, we observed that the agent's attention developed in phases, and that these phases were consistent across game variations. Overall, we believe that ATOM could help improve our understanding of the learning processes of RL agents and better understand the relationship between attention and learning.
中文摘要 强化学习（RL）代理的学习过程在其学习算法的数学表述之外仍鲜有了解。为弥补这一空白，我们引入了注意力导向指标（ATOMs），以研究强化学习者在训练期间注意力的发展。在一项受控实验中，我们在三种乒乓球游戏的变体上测试了ATOMs，每种变体都设计用来教导代理人不同的行为，并辅以行为评估。ATOMs成功描绘了针对每种博弈变体训练的智能体的注意力模式，并指出这些注意力模式的差异转化为智能体行为的差异。通过训练期间对ATOMs的持续监测，我们观察到智能体的注意力分阶段发展，且这些阶段在不同游戏变体中保持一致。总体而言，我们认为ATOM有助于我们更好地理解强化学习代理的学习过程，并更好地理解注意力与学习之间的关系。

MapReduce LoRA: Advancing the Pareto Front in Multi-Preference Optimization for Generative Models

MapReduce LoRA：在生成模型多优先优化中推进帕累托前沿

Authors: Chieh-Yun Chen, Zhonghao Wang, Qi Chen, Zhifan Ye, Min Shi, Yue Zhao, Yinan Zhao, Hui Qu, Wei-An Lin, Yiru Shen, Ajinkya Kale, Irfan Essa, Humphrey Shi
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.20629
Pdf link: https://arxiv.org/pdf/2511.20629
Abstract Reinforcement learning from human feedback (RLHF) with reward models has advanced alignment of generative models to human aesthetic and perceptual preferences. However, jointly optimizing multiple rewards often incurs an alignment tax, improving one dimension while degrading others. To address this, we introduce two complementary methods: MapReduce LoRA and Reward-aware Token Embedding (RaTE). MapReduce LoRA trains preference-specific LoRA experts in parallel and iteratively merges them to refine a shared base model; RaTE learns reward-specific token embeddings that compose at inference for flexible preference control. Experiments on Text-to-Image generation (Stable Diffusion 3.5 Medium and FLUX.1-dev) show improvements of 36.1%, 4.6%, and 55.7%, and 32.7%, 4.3%, and 67.1% on GenEval, PickScore, and OCR, respectively. On Text-to-Video generation (HunyuanVideo), visual and motion quality improve by 48.1% and 90.0%, respectively. On the language task, Helpful Assistant, with Llama-2 7B, helpful and harmless improve by 43.4% and 136.7%, respectively. Our framework sets a new state-of-the-art multi-preference alignment recipe across modalities.
中文摘要 基于人类反馈的强化学习（RLHF）与奖励模型推动了生成模型与人类审美和感知偏好的对齐。然而，联合优化多个奖励通常会产生对齐税，提升一个维度而贬低其他维度。为此，我们引入了两种互补方法：MapReduce LoRA和奖励感知令牌嵌入（RaTE）。MapReduce LoRA 并行训练针对偏好的 LoRA 专家，并迭代合并以精炼共享基础模型;RaTE学习在推理时组合的奖励特定代币嵌入，实现灵活的偏好控制。文本生成实验（稳定扩散3.5 Medium和FLUX.1-开发）分别显示GenEval、PickScore和OCR分别提升了36.1%、4.6%和55.7%，以及32.7%、4.3%和67.1%。在文字转视频生成（HunyuanVideo）方面，视觉质量和动态质量分别提升了48.1%和90.0%。在语言任务“助人助手”中，Llama-2 7B的“助益”和“无害”分别提升了43.4%和136.7%。我们的框架设定了跨模式的先进多重偏好对齐配方。

Reinforcing Action Policies by Prophesying

通过预言强化行动政策

Authors: Jiahui Zhang, Ze Huang, Chun Gu, Zipei Ma, Li Zhang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.20633
Pdf link: https://arxiv.org/pdf/2511.20633
Abstract Vision-Language-Action (VLA) policies excel in aligning language, perception, and robot control. However, most VLAs are trained purely by imitation, which overfits to demonstrations, and is brittle under distribution shift. Reinforcement learning (RL) directly optimizes task reward and thus addresses this misalignment, but real-robot interaction is expensive and conventional simulators are hard to engineer and transfer. We address both data efficiency and optimization stability in VLA post-training via a learned world model and an RL procedure tailored to flow-based action heads. Specifically, we introduce Prophet, a unified action-to-video robot actuation pretrained across large-scale, heterogeneous robot data to learn reusable action-outcome dynamics. It is able to few-shot adapt to new robots, objects, and environments, yielding a rollout-ready simulator. Upon Prophet, we reinforce action policies with Flow-action-GRPO (FA-GRPO), which adapts Flow-GRPO to operate on VLA actions, and with FlowScale, a stepwise reweighting that rescales per-step gradients in the flow head. Together, Prophet, FA-GRPO, and FlowScale constitute ProphRL, a practical, data- and compute-efficient path to VLA post-training. Experiments show 5-17% success gains on public benchmarks and 24-30% gains on real robots across different VLA variants.
中文摘要 视觉-语言-行动（VLA）政策在语言、感知和机器人控制的一致性方面表现出色。然而，大多数VLA纯粹通过仿制训练，这种模拟对演示过拟合，且在分布偏移下具有脆性。强化学习（RL）直接优化任务奖励，从而解决了这种错位，但真实机器人交互成本高昂，传统模拟器难以设计和迁移。我们通过学习世界模型和针对基于流程的动作head量身定制的强化学习过程，同时解决了VLA训练后的数据效率和优化稳定性。具体来说，我们介绍了Prophet，一种统一的动作到视频机器人驱动系统，能够在大规模、异构的机器人数据中预训练，学习可重复使用的动作-结果动态。它能够对新的机器人、物体和环境进行少量适应，成为一款准备好推出的模拟器。在Prophet上，我们用Flow-action-GRPO（FA-GRPO）来强化动作策略，该工具将Flow-GRPO调整用于VLA动作，以及FlowScale，一种逐步重加权，重新调整流头中每步梯度。Prophet、FA-GRPO 和 FlowScale 共同构成了 ProphRL，这是一条实用、数据和计算高效的 VLA 后训练路径。实验显示，在公开基准测试中成功率提升5-17%，在真实机器人不同VLA变体中提升24-30%。

RubricRL: Simple Generalizable Rewards for Text-to-Image Generation

评分标准RL：文本转图像生成的简单可推广奖励

Authors: Xuelu Feng, Yunsheng Li, Ziyu Wan, Zixuan Gao, Junsong Yuan, Dongdong Chen, Chunming Qiao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.20651
Pdf link: https://arxiv.org/pdf/2511.20651
Abstract Reinforcement learning (RL) has recently emerged as a promising approach for aligning text-to-image generative models with human preferences. A key challenge, however, lies in designing effective and interpretable rewards. Existing methods often rely on either composite metrics (e.g., CLIP, OCR, and realism scores) with fixed weights or a single scalar reward distilled from human preference models, which can limit interpretability and flexibility. We propose RubricRL, a simple and general framework for rubric-based reward design that offers greater interpretability, composability, and user control. Instead of using a black-box scalar signal, RubricRL dynamically constructs a structured rubric for each prompt--a decomposable checklist of fine-grained visual criteria such as object correctness, attribute accuracy, OCR fidelity, and realism--tailored to the input text. Each criterion is independently evaluated by a multimodal judge (e.g., o4-mini), and a prompt-adaptive weighting mechanism emphasizes the most relevant dimensions. This design not only produces interpretable and modular supervision signals for policy optimization (e.g., GRPO or PPO), but also enables users to directly adjust which aspects to reward or penalize. Experiments with an autoregressive text-to-image model demonstrate that RubricRL improves prompt faithfulness, visual detail, and generalizability, while offering a flexible and extensible foundation for interpretable RL alignment across text-to-image architectures.
中文摘要 强化学习（RL）最近被认为是一种有前景的方法，用于将文本生成模型与人类偏好对齐。然而，一个关键挑战在于设计有效且可理解的奖励。现有方法通常依赖于带有固定权重的复合指标（如CLIP、OCR和现实性评分），或是从人类偏好模型中提炼出的单一标量奖励，这可能限制了可解释性和灵活性。我们提出了RubricRL，这是一个简单通用的基于评分标准的奖励设计框架，提供更高的解释性、组合性和用户控制。RubricRL不使用黑框标量信号，而是动态构建一个结构化的标尺，为每个提示——一份可分解的细粒度视觉标准清单，如对象正确性、属性准确性、OCR准确度和真实性——并根据输入文本量身定制。每个标准由多模态评判（如o4-mini）独立评估，并采用提示自适应权重机制强调最相关的维度。该设计不仅产生可解释且模块化的监管信号以优化策略（如GRPO或PPO），还允许用户直接调整应奖励或惩罚的方面。自回归文本转图像模型的实验表明，RubricRL提升了提示的忠实度、视觉细节和泛化性，同时为文本到图像架构中可解释的强化学习提供了灵活且可扩展的基础。

Keyword: diffusion policy

There is no result