Arxiv Papers of Today

生成时间: 2025-11-20 16:31:02 (UTC+8); Arxiv 发布时间: 2025-11-20 20:00 EST (2025-11-21 09:00 UTC+8)

今天共有 33 篇相关文章

Keyword: reinforcement learning

Causally-Informed Reinforcement Learning for Adaptive Emotion-Aware Social Media Recommendation

因果知情强化学习用于适应性情绪感知社交媒体推荐

Authors: Bhavika Jain, Robert Pitsko, Ananya Drishti, Mahfuza Farooque
Subjects: Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.14768
Pdf link: https://arxiv.org/pdf/2511.14768
Abstract Social media recommendation systems play a central role in shaping users' emotional experiences. However, most systems are optimized solely for engagement metrics, such as click rate, viewing time, or scrolling, without accounting for users' emotional states. Repeated exposure to emotionally charged content has been shown to negatively affect users' emotional well-being over time. We propose an Emotion-aware Social Media Recommendation (ESMR) framework that personalizes content based on users' evolving emotional trajectories. ESMR integrates a Transformer-based emotion predictor with a hybrid recommendation policy: a LightGBM model for engagement during stable periods and a reinforcement learning agent with causally informed rewards when negative emotional states persist. Through behaviorally grounded evaluation over 30-day interaction traces, ESMR demonstrates improved emotional recovery, reduced volatility, and strong engagement retention. ESMR offers a path toward emotionally aware recommendations without compromising engagement performance.
中文摘要 社交媒体推荐系统在塑造用户的情感体验中起着核心作用。然而，大多数系统仅针对用户参与度指标进行优化，如点击率、观看时间或滚动次数，而未考虑用户的情绪状态。反复接触情绪化内容已被证明会随着时间推移对用户的情绪健康产生负面影响。我们提出了一种情绪感知社交媒体推荐（ESMR）框架，根据用户不断变化的情绪轨迹个性化内容。ESMR结合了基于Transformer的情绪预测器与混合推荐策略：在稳定期内以LightGBM模型参与，以及在负面情绪状态持续时给予因果知情奖励的强化学习代理。通过基于行为的30天互动记录评估，ESMR显示出情绪恢复效果、波动性降低和强烈的参与度保持。ESMR为实现情感敏感的建议提供了一条路径，同时不影响互动表现。

Learning Interestingness in Automated Mathematical Theory Formation

学习自动化数学理论形成中的趣味性

Authors: George Tsoukalas, Rahul Saha, Amitayush Thakur, Sabrina Reguyal, Swarat Chaudhuri
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.14778
Pdf link: https://arxiv.org/pdf/2511.14778
Abstract We take two key steps in automating the open-ended discovery of new mathematical theories, a grand challenge in artificial intelligence. First, we introduce $\emph{FERMAT}$, a reinforcement learning (RL) environment that models concept discovery and theorem-proving using a set of symbolic actions, opening up a range of RL problems relevant to theory discovery. Second, we explore a specific problem through $\emph{FERMAT}$: automatically scoring the $\emph{interestingness}$ of mathematical objects. We investigate evolutionary algorithms for synthesizing nontrivial interestingness measures. In particular, we introduce an LLM-based evolutionary algorithm that features function abstraction, leading to notable improvements in discovering elementary number theory and finite fields over hard-coded baselines. We open-source the $\emph{FERMAT}$ environment at this URL(this https URL).
中文摘要 我们在自动化新数学理论的开放式发现过程中采取了两个关键步骤，这在人工智能领域是一项重大挑战。首先，我们介绍了$\emph{FERMAT}$，这是一个强化学习（RL）环境，通过一组符号动作模拟概念发现和定理证明，开启了与理论发现相关的一系列强化学习问题。其次，我们通过 $\emph{FERMAT}$ 探索一个具体问题：自动对数学对象的 $\emph{有趣度}$ 进行评分。我们研究用于合成非平凡有趣度量的进化算法。特别是，我们引入了基于LLM的进化算法，具有函数抽象特性，显著提升了在硬编码基线上发现初等数论和有限域的能力。我们将 $\emph{FERMAT}$ 环境开源于此 URL（此 https URL）。

Empowering Multi-Turn Tool-Integrated Reasoning with Group Turn Policy Optimization

通过组回合策略优化赋能多回合工具集成推理

Authors: Yifeng Ding, Hung Le, Songyang Han, Kangrui Ruan, Zhenghui Jin, Varun Kumar, Zijian Wang, Anoop Deoras
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.14846
Pdf link: https://arxiv.org/pdf/2511.14846
Abstract Training Large Language Models (LLMs) for multi-turn Tool-Integrated Reasoning (TIR) - where models iteratively reason, generate code, and verify through execution - remains challenging for existing reinforcement learning (RL) approaches. Current RL methods, exemplified by Group Relative Policy Optimization (GRPO), suffer from coarse-grained, trajectory-level rewards that provide insufficient learning signals for complex multi-turn interactions, leading to training stagnation. To address this issue, we propose Group Turn Policy Optimization (GTPO), a novel RL algorithm specifically designed for training LLMs on multi-turn TIR tasks. GTPO introduces three key innovations: (1) turn-level reward assignment that provides fine-grained feedback for individual turns, (2) return-based advantage estimation where normalized discounted returns are calculated as advantages, and (3) self-supervised reward shaping that exploits self-supervision signals from generated code to densify sparse binary outcome-based rewards. Our comprehensive evaluation demonstrates that GTPO outperforms GRPO by 3.0% on average across diverse reasoning benchmarks, establishing its effectiveness for advancing complex mathematical reasoning in the real world.
中文摘要 为多回合工具集成推理（TIR）训练大型语言模型（LLMs）——即模型通过迭代推理、生成代码并通过执行验证——对于现有强化学习（RL）方法来说仍是一项挑战。当前的强化学习方法，以群体相对策略优化（Group Relative Policy Optimization，GRPO）为例，存在粗粒度的轨迹级奖励，导致复杂的多回合交互缺乏足够的学习信号，导致训练停滞。为解决这一问题，我们提出了组回合策略优化（GTPO）算法，这是一种专门用于多回合TIR任务训练LLM的新型强化学习算法。GTPO引入了三项关键创新：（1）回合级奖励分配，为单个回合提供细粒度反馈;（2）基于收益的优势估计，将归一化后的贴现收益作为优势计算;（3）利用生成代码中的自我监督信号，使稀疏的二元结果奖励更密集。我们的综合评估表明，GTPO在多样推理基准中平均优于GRPO3.0%，证明其在推动复杂数学推理在现实世界中的有效性。

Transformer-Guided Deep Reinforcement Learning for Optimal Takeoff Trajectory Design of an eVTOL Drone

变压器引导深度强化学习，实现eVTOL无人机的最佳起飞轨迹设计

Authors: Nathan M. Roberts II, Xiaosong Du
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.14887
Pdf link: https://arxiv.org/pdf/2511.14887
Abstract The rapid advancement of electric vertical take-off and landing (eVTOL) aircraft offers a promising opportunity to alleviate urban traffic congestion. Thus, developing optimal takeoff trajectories for minimum energy consumption becomes essential for broader eVTOL aircraft applications. Conventional optimal control methods (such as dynamic programming and linear quadratic regulator) provide highly efficient and well-established solutions but are limited by problem dimensionality and complexity. Deep reinforcement learning (DRL) emerges as a special type of artificial intelligence tackling complex, nonlinear systems; however, the training difficulty is a key bottleneck that limits DRL applications. To address these challenges, we propose the transformer-guided DRL to alleviate the training difficulty by exploring a realistic state space at each time step using a transformer. The proposed transformer-guided DRL was demonstrated on an optimal takeoff trajectory design of an eVTOL drone for minimal energy consumption while meeting takeoff conditions (i.e., minimum vertical displacement and minimum horizontal velocity) by varying control variables (i.e., power and wing angle to the vertical). Results presented that the transformer-guided DRL agent learned to take off with $4.57\times10^6$ time steps, representing 25% of the $19.79\times10^6$ time steps needed by a vanilla DRL agent. In addition, the transformer-guided DRL achieved 97.2% accuracy on the optimal energy consumption compared against the simulation-based optimal reference while the vanilla DRL achieved 96.3% accuracy. Therefore, the proposed transformer-guided DRL outperformed vanilla DRL in terms of both training efficiency as well as optimal design verification.
中文摘要 电动垂直起降（eVTOL）飞机的快速发展为缓解城市交通拥堵提供了有前景的机会。因此，开发最小能耗的最佳起飞轨迹对于更广泛的电子垂直起降飞机应用至关重要。传统的最优控制方法（如动态规划和线性二次调节器）提供了高效且成熟的解决方案，但受限于问题的维度和复杂性。深度强化学习（DRL）作为一种特殊类型的人工智能出现，用于应对复杂非线性系统;然而，训练难度是限制日程学习应用的关键瓶颈。为应对这些挑战，我们提出了变压器引导的DRL，通过利用变换器探索每个时间步的真实状态空间，以缓解训练难度。所提出的变压器引导日间拉灯（DRL）通过变控变量（如动力和翼角）在垂直垂直起降无人机的最佳起飞轨迹设计上进行了演示，实现了在满足起飞条件（即最小垂直排水量和最小水平速度）的同时实现最小能耗。结果显示，变压器引导的DRL代理以$4.57\times10^6$的时间步长学会起飞，占普通DRL代理所需$19.79\times10^6$时间步长的25%。此外，变压器引导的DRL在最优能耗方面达到了97.2%的准确率，而基于仿真的最优参考的DRL则达到了96.3%。因此，所提出的变压器引导 DRL 在训练效率和最优设计验证方面都优于原版 DRL。

Skin-R1: Toward Trustworthy Clinical Reasoning for Dermatological Diagnosis

Skin-R1：迈向可信的皮肤科诊断临床推理

Authors: Zehao Liu, Wejieying Ren, Jipeng Zhang, Tianxiang Zhao, Jingxi Zhu, Xiaoting Li, Vasant G. Honavar
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.14900
Pdf link: https://arxiv.org/pdf/2511.14900
Abstract The emergence of vision-language models (VLMs) has opened new possibilities for clinical reasoning and has shown promising performance in dermatological diagnosis. However, their trustworthiness and clinical utility are often limited by three major factors: (1) Data heterogeneity, where diverse datasets lack consistent diagnostic labels and clinical concept annotations; (2) Absence of grounded diagnostic rationales, leading to a scarcity of reliable reasoning supervision; and (3) Limited scalability and generalization, as models trained on small, densely annotated datasets struggle to transfer nuanced reasoning to large, sparsely-annotated ones. To address these limitations, we propose SkinR1, a novel dermatological VLM that combines deep, textbook-based reasoning with the broad generalization capabilities of reinforcement learning (RL). SkinR1 systematically resolves the key challenges through a unified, end-to-end framework. First, we design a textbook-based reasoning generator that synthesizes high-fidelity, hierarchy-aware, and differential-diagnosis (DDx)-informed trajectories, providing reliable expert-level supervision. Second, we leverage the constructed trajectories for supervised fine-tuning (SFT) empowering the model with grounded reasoning ability. Third, we develop a novel RL paradigm that, by incorporating the hierarchical structure of diseases, effectively transfers these grounded reasoning patterns to large-scale, sparse data. Extensive experiments on multiple dermatology datasets demonstrate that SkinR1 achieves superior diagnostic accuracy. The ablation study demonstrates the importance of the reasoning foundation instilled by SFT.
中文摘要 视觉语言模型（VLM）的出现为临床推理开辟了新可能，并在皮肤科诊断中展现出有前景的表现。然而，其可信度和临床效用通常受限于三个主要因素：（1）数据异质性，即多样化数据集缺乏一致的诊断标签和临床概念注释;（2）缺乏扎实的诊断依据，导致可靠的推理监督稀缺;以及（3）有限的可扩展性和泛化性，因为在小型、注释密集的数据集上训练的模型难以将细致推理转移到大量且注释稀少的数据集中。为解决这些局限性，我们提出了SkinR1，一种结合了深度教材推理与强化学习（RL）广泛泛化能力的新型皮肤学VLM。SkinR1通过统一的端到端框架系统性地解决了关键挑战。首先，我们设计了一个基于教科书的推理生成器，综合高保真度、层级意识和鉴别诊断（DDx）知情的轨迹，提供可靠的专家级监督。其次，我们利用构建轨迹进行监督微调（SFT），赋予模型基础推理能力。第三，我们开发了一种新的强化学习范式，通过整合疾病的层级结构，有效地将这些扎根的推理模式转移到大规模、稀疏的数据中。对多个皮肤病学数据集的广泛实验表明，SkinR1在诊断准确度上具有更高的优势。消融研究展示了SFT所奠定的推理基础的重要性。

Z-Merge: Multi-Agent Reinforcement Learning for On-Ramp Merging with Zone-Specific V2X Traffic Information

Z-Merge：多智能体强化学习，用于匝道合并，结合特定区域的V2X交通信息

Authors: Yassine Ibork, Myounggyu Won, Lokesh Das
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.14910
Pdf link: https://arxiv.org/pdf/2511.14910
Abstract Ramp merging is a critical and challenging task for autonomous vehicles (AVs), particularly in mixed traffic environments with human-driven vehicles (HVs). Existing approaches typically rely on either lane-changing or inter-vehicle gap creation strategies based solely on local or neighboring information, often leading to suboptimal performance in terms of safety and traffic efficiency. In this paper, we present a V2X (vehicle-to-everything communication)-assisted Multiagent Reinforcement Learning (MARL) framework for on-ramp merging that effectively coordinates the complex interplay between lane-changing and inter-vehicle gap adaptation strategies by utilizing zone-specific global information available from a roadside unit (RSU). The merging control problem is formulated as a Multiagent Partially Observable Markov Decision Process (MA-POMDP), where agents leverage both local and global observations through V2X communication. To support both discrete and continuous control decisions, we design a hybrid action space and adopt a parameterized deep Q-learning approach. Extensive simulations, integrating the SUMO traffic simulator and the MOSAIC V2X simulator, demonstrate that our framework significantly improves merging success rate, traffic efficiency, and road safety across diverse traffic scenarios.
中文摘要 匝道合流是自动驾驶汽车（AV）中一项关键且具有挑战性的任务，尤其是在与人驾驶车辆（HV）混合交通环境中。现有方法通常依赖基于本地或邻近信息的变道或车间间隙制造策略，常导致安全和交通效率表现不佳。本文提出了一种V2X（车辆到一切通信）辅助多代理强化学习（MARL）框架，用于匝道并入，利用路边单元（RSU）提供的区域特定全球信息，有效协调变道与车间间隙适应策略之间的复杂相互作用。合并控制问题被表述为多智能体部分可观测马尔可夫决策过程（MA-POMDP），智能体通过V2X通信利用本地和全局观测数据。为了支持离散和连续控制决策，我们设计了一个混合动作空间，并采用参数化的深度Q学习方法。通过整合SUMO交通模拟器和MOSAIC V2X模拟器的广泛模拟，我们的框架显著提升了多种交通场景下的合并成功率、交通效率和道路安全。

Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation

康定斯基5.0：图像与视频生成的基础模型家族

Authors: Vladimir Arkhipkin, Vladimir Korviakov, Nikolai Gerasimenko, Denis Parkhomenko, Viacheslav Vasilev, Alexey Letunovskiy, Maria Kovaleva, Nikolai Vaulin, Ivan Kirillov, Lev Novitskiy, Denis Koposov, Nikita Kiselev, Alexander Varlamov, Dmitrii Mikhailov, Vladimir Polovnikov, Andrey Shutkin, Ilya Vasiliev, Julia Agafonova, Anastasiia Kargapoltseva, Anna Dmitrienko, Anastasia Maltseva, Anna Averchenkova, Olga Kim, Tatiana Nikulina, Denis Dimitrov
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.14993
Pdf link: https://arxiv.org/pdf/2511.14993
Abstract This report introduces Kandinsky 5.0, a family of state-of-the-art foundation models for high-resolution image and 10-second video synthesis. The framework comprises three core line-up of models: Kandinsky 5.0 Image Lite - a line-up of 6B parameter image generation models, Kandinsky 5.0 Video Lite - a fast and lightweight 2B parameter text-to-video and image-to-video models, and Kandinsky 5.0 Video Pro - 19B parameter models that achieves superior video generation quality. We provide a comprehensive review of the data curation lifecycle - including collection, processing, filtering and clustering - for the multi-stage training pipeline that involves extensive pre-training and incorporates quality-enhancement techniques such as self-supervised fine-tuning (SFT) and reinforcement learning (RL)-based post-training. We also present novel architectural, training, and inference optimizations that enable Kandinsky 5.0 to achieve high generation speeds and state-of-the-art performance across various tasks, as demonstrated by human evaluation. As a large-scale, publicly available generative framework, Kandinsky 5.0 leverages the full potential of its pre-training and subsequent stages to be adapted for a wide range of generative applications. We hope that this report, together with the release of our open-source code and training checkpoints, will substantially advance the development and accessibility of high-quality generative models for the research community.
中文摘要 本报告介绍了Kandinsky 5.0，这是一系列用于高分辨率图像和10秒视频合成的先进基础模型。该框架包含三个核心型号系列：Kandinsky 5.0 Image Lite——6B参数图像生成模型系列;Kandinsky 5.0 Video Lite——快速且轻量级的2B参数文本到视频和图像到视频模型;以及Kandinsky 5.0 Video Pro——19B参数模型，实现了更优越的视频生成质量。我们全面回顾了数据整理生命周期——包括收集、处理、过滤和聚类——涵盖多阶段培训流程，涵盖大量预训练，并结合了基于自我监督的微调（SFT）和强化学习（RL）等质量提升技术。我们还提出了创新的架构、训练和推理优化，使Kandinsky 5.0能够在多种任务中实现高速生成和最先进的性能，这一点通过人工评估得到了验证。作为一个大规模、公开的生成式框架，Kandinsky 5.0充分发挥了其预训练及后续阶段的全部潜力，能够适应各种生成式应用。我们希望本报告连同我们开源代码和培训检查点的发布，能够显著推动高质量生成模型的开发和研究社区的可及性。

Task Specific Sharpness Aware O-RAN Resource Management using Multi Agent Reinforcement Learning

任务特定锐利度感知的O-RAN资源管理，采用多智能体强化学习

Authors: Fatemeh Lotfi, Hossein Rajoli, Fatemeh Afghah
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.15002
Pdf link: https://arxiv.org/pdf/2511.15002
Abstract Next-generation networks utilize the Open Radio Access Network (O-RAN) architecture to enable dynamic resource management, facilitated by the RAN Intelligent Controller (RIC). While deep reinforcement learning (DRL) models show promise in optimizing network resources, they often struggle with robustness and generalizability in dynamic environments. This paper introduces a novel resource management approach that enhances the Soft Actor Critic (SAC) algorithm with Sharpness-Aware Minimization (SAM) in a distributed Multi-Agent RL (MARL) framework. Our method introduces an adaptive and selective SAM mechanism, where regularization is explicitly driven by temporal-difference (TD)-error variance, ensuring that only agents facing high environmental complexity are regularized. This targeted strategy reduces unnecessary overhead, improves training stability, and enhances generalization without sacrificing learning efficiency. We further incorporate a dynamic $\rho$ scheduling scheme to refine the exploration-exploitation trade-off across agents. Experimental results show our method significantly outperforms conventional DRL approaches, yielding up to a $22\%$ improvement in resource allocation efficiency and ensuring superior QoS satisfaction across diverse O-RAN slices.
中文摘要 下一代网络采用开放无线接入网（O-RAN）架构，实现动态资源管理，这得益于RAN智能控制器（RIC）。虽然深度强化学习（DRL）模型在优化网络资源方面展现出潜力，但在动态环境中常常面临鲁棒性和泛化性上的困难。本文介绍了一种新的资源管理方法，在分布式多代理强化学习（MARL）框架中增强了软演员批评（SAC）算法，并结合锐利度感知最小化（SAM）。我们的方法引入了自适应且选择性的SAM机制，其中正则化由时间差（TD）误差方差明确驱动，确保只有面对高环境复杂性的代理被正则化。这种有针对性的策略减少了不必要的开销，提升了训练稳定性，并增强了泛化能力，同时不牺牲学习效率。我们还进一步采用动态的$\rho$调度方案，以优化各代理间的探索与开发权衡。实验结果显示，我们的方法显著优于传统日程学习方法，资源分配效率提升高达22%美元，并确保在不同O-RAN切片中获得更优的服务质量满意度。

Simulated Human Learning in a Dynamic, Partially-Observed, Time-Series Environment

在动态、部分观察、时间序列环境中的模拟人类学习

Authors: Jeffrey Jiang, Kevin Hong, Emily Kuczynski, Gregory Pottie
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2511.15032
Pdf link: https://arxiv.org/pdf/2511.15032
Abstract While intelligent tutoring systems (ITSs) can use information from past students to personalize instruction, each new student is unique. Moreover, the education problem is inherently difficult because the learning process is only partially observable. We therefore develop a dynamic, time-series environment to simulate a classroom setting, with student-teacher interventions - including tutoring sessions, lectures, and exams. In particular, we design the simulated environment to allow for varying levels of probing interventions that can gather more information. Then, we develop reinforcement learning ITSs that combine learning the individual state of students while pulling from population information through the use of probing interventions. These interventions can reduce the difficulty of student estimation, but also introduce a cost-benefit decision to find a balance between probing enough to get accurate estimates and probing so often that it becomes disruptive to the student. We compare the efficacy of standard RL algorithms with several greedy rules-based heuristic approaches to find that they provide different solutions, but with similar results. We also highlight the difficulty of the problem with increasing levels of hidden information, and the boost that we get if we allow for probing interventions. We show the flexibility of both heuristic and RL policies with regards to changing student population distributions, finding that both are flexible, but RL policies struggle to help harder classes. Finally, we test different course structures with non-probing policies and we find that our policies are able to boost the performance of quiz and midterm structures more than we can in a finals-only structure, highlighting the benefit of having additional information.
中文摘要 虽然智能辅导系统（ITS）可以利用过去学生的信息来个性化教学，但每个新生都是独一无二的。此外，教育问题本质上很困难，因为学习过程只能部分被观察到。因此，我们开发了一个动态的时间序列环境，模拟课堂环境，并结合师生干预——包括辅导、讲座和考试。特别是，我们设计模拟环境，允许不同层次的探测干预，以收集更多信息。随后，我们开发了强化学习 ITS，结合学习学生个体状态，并通过探询干预从人群信息中提取信息。这些干预措施可以降低学生估算的难度，但也带来了一个成本效益的决策，即在探查足够准确估算与频繁探查以造成干扰之间找到平衡。我们比较了标准强化学习算法与几种贪婪规则启发式方法的有效性，发现它们提供了不同的解决方案，但结果相似。我们还强调了隐藏信息水平增加的问题，以及如果允许深入干预，我们会获得的提升。我们展示了启发式和强化学习政策在改变学生群体分布方面的灵活性，发现两者都具有灵活性，但强化学习政策难以帮助难度较高的班级。最后，我们测试了采用非探究性政策的不同课程结构，发现我们的策略比仅限期末考试的结构更能提升测验和期中结构的表现，凸显了拥有额外信息的好处。

Distributed primal-dual algorithm for constrained multi-agent reinforcement learning under coupled policies

分布式原始对偶算法，用于耦合策略下的约束多智能体强化学习

Authors: Pengcheng Dai, He Wang, Dongming Wang, Wenwu Yu
Subjects: Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2511.15053
Pdf link: https://arxiv.org/pdf/2511.15053
Abstract In this work, we investigate constrained multi-agent reinforcement learning (CMARL), where agents collaboratively maximize the sum of their local objectives while satisfying individual safety constraints. We propose a framework where agents adopt coupled policies that depend on both local states and parameters, as well as those of their $\kappa_p$-hop neighbors, with $\kappa_p>0$ denoting the coupling distance. A distributed primal-dual algorithm is further developed under this framework, wherein each agent has access only to state-action pairs within its $2\kappa_p$-hop neighborhood and to reward information within its $\kappa + 2\kappa_p$-hop neighborhood, with $\kappa > 0$ representing the truncation distance. Moreover, agents are not permitted to directly share their true policy parameters or Lagrange multipliers. Instead, each agent constructs and maintains local estimates of these variables for other agents and employs such estimates to execute its policy. Additionally, these estimates are further updated and exchanged exclusively through an independent, time-varying networks, which enhances the overall system security. We establish that, with high probability, our algorithm can achieve an $\epsilon$-first-order stationary convergence with an approximation error of $\mathcal{O}(\gamma^{\frac{\kappa+1}{\kappa_{p}}})$ for discount factor $\gamma\in(0,1)$. Finally, simulations in GridWorld environment are conducted to demonstrate the effectiveness of the proposed algorithm.
中文摘要 本研究探讨了受限多智能体强化学习（CMARL），即智能体协作最大化其局部目标的总和，同时满足个别安全约束。我们提出一个框架，代理人采用依赖于局部状态和参数的耦合策略，以及其 $kappa_p$-跳数邻居的耦合策略，$\kappa_p>0$ 表示耦合距离。在该框架下进一步发展了分布式原始对偶算法，每个代理只能访问其$2\kappa_p$跳区内的状态-动作对，并在$\kappa + 2\kappa_p$跳区内获得奖励信息，$\kappa > 0$代表截断距离。此外，代理人不得直接分享其真实的政策参数或拉格朗日乘数。相反，每个代理人为其他代理人构建并维护这些变量的局部估计值，并利用这些估计来执行其策略。此外，这些估算数据会通过独立且时变的网络进行进一步更新和交换，从而增强整体系统安全性。我们确定，算法在折现因子 $\gamma\in（0,1）$ 的近似kappa_误差下，以高概率实现 $\epsilon$-一阶平稳收敛。最后，在GridWorld环境中进行模拟，以展示所提算法的有效性。

Learning Human-Like RL Agents Through Trajectory Optimization With Action Quantization

通过轨迹优化与动作量化学习类人强化学习代理

Authors: Jian-Ting Guo, Yu-Cheng Chen, Ping-Chun Hsieh, Kuo-Hao Ho, Po-Wei Huang, Ti-Rong Wu, I-Chen Wu
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.15055
Pdf link: https://arxiv.org/pdf/2511.15055
Abstract Human-like agents have long been one of the goals in pursuing artificial intelligence. Although reinforcement learning (RL) has achieved superhuman performance in many domains, relatively little attention has been focused on designing human-like RL agents. As a result, many reward-driven RL agents often exhibit unnatural behaviors compared to humans, raising concerns for both interpretability and trustworthiness. To achieve human-like behavior in RL, this paper first formulates human-likeness as trajectory optimization, where the objective is to find an action sequence that closely aligns with human behavior while also maximizing rewards, and adapts the classic receding-horizon control to human-like learning as a tractable and efficient implementation. To achieve this, we introduce Macro Action Quantization (MAQ), a human-like RL framework that distills human demonstrations into macro actions via Vector-Quantized VAE. Experiments on D4RL Adroit benchmarks show that MAQ significantly improves human-likeness, increasing trajectory similarity scores, and achieving the highest human-likeness rankings among all RL agents in the human evaluation study. Our results also demonstrate that MAQ can be easily integrated into various off-the-shelf RL algorithms, opening a promising direction for learning human-like RL agents. Our code is available at this https URL.
中文摘要 类人智能体长期以来一直是追求人工智能的目标之一。尽管强化学习（RL）在许多领域实现了超人般的性能，但对设计类人强化学习代理的关注却相对较少。因此，许多以奖励为驱动的强化学习代理常表现出与人类相比不自然的行为，这引发了对可解释性和可信度的担忧。为了在强化学习中实现类人行为，本文首先将类人性表述为轨迹优化，目标是找到一个与人类行为紧密对齐且最大化奖励的动作序列，并将经典的后退视界控制应用于类人学习，作为一种可作且高效的实现方式。为此，我们引入了宏量化量化（MAQ），一种类人强化学习框架，通过矢量量化VAE将人类演示提炼为宏观动作。D4RL Adroit基准测试的实验显示，MAQ显著提升了人类相似度，提高了轨迹相似度评分，并在人类评估研究中获得了所有强化学习代理中最高的人类相似度排名。我们的结果还表明，MAQ可以轻松集成到各种现成的强化学习算法中，为学习类人强化学习代理开辟了有前景的方向。我们的代码可在此 https URL 访问。

From Solving to Verifying: A Unified Objective for Robust Reasoning in LLMs

从求解到验证：大型语言模型中稳健推理的统一目标

Authors: Xiaoxuan Wang, Bo Liu, Song Jiang, Jingzhou Liu, Jingyuan Qi, Xia Chen, Baosheng He
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.15137
Pdf link: https://arxiv.org/pdf/2511.15137
Abstract The reasoning capabilities of large language models (LLMs) have been significantly improved through reinforcement learning (RL). Nevertheless, LLMs still struggle to consistently verify their own reasoning traces. This raises the research question of how to enhance the self-verification ability of LLMs and whether such an ability can further improve reasoning performance. In this work, we propose GRPO-Verif, an algorithm that jointly optimizes solution generation and self-verification within a unified loss function, with an adjustable hyperparameter controlling the weight of the verification signal. Experimental results demonstrate that our method enhances self-verification capability while maintaining comparable performance in reasoning.
中文摘要 大型语言模型（LLMs）的推理能力通过强化学习（RL）得到了显著提升。然而，大型语言模型仍然难以持续验证自身的推理痕迹。这引发了一个研究问题：如何提升LLM的自我验证能力，以及这种能力是否能进一步提升推理能力。在本研究中，我们提出了GRPO-Verif算法，该算法在统一的损耗函数内共同优化解生成和自验证，并配有可调超参数控制验证信号权重。实验结果表明，我们的方法在保持推理表现相当的同时，增强了自我验证能力。

Vehicle Routing Problems via Quantum Graph Attention Network Deep Reinforcement Learning

通过量子图注意力网络深度强化学习解决车辆路由问题

Authors: Le Tung Giang, Vu Hoang Viet, Nguyen Xuan Tung, Trinh Van Chien, Won-Joo Hwang
Subjects: Subjects: Machine Learning (cs.LG); Information Theory (cs.IT); Quantum Physics (quant-ph)
Arxiv link: https://arxiv.org/abs/2511.15175
Pdf link: https://arxiv.org/pdf/2511.15175
Abstract The vehicle routing problem (VRP) is a fundamental NP-hard task in intelligent transportation systems with broad applications in logistics and distribution. Deep reinforcement learning (DRL) with Graph Neural Networks (GNNs) has shown promise, yet classical models rely on large multi-layer perceptrons (MLPs) that are parameter-heavy and memory-bound. We propose a Quantum Graph Attention Network (Q-GAT) within a DRL framework, where parameterized quantum circuits (PQCs) replace conventional MLPs at critical readout stages. The hybrid model maintains the expressive capacity of graph attention encoders while reducing trainable parameters by more than 50%. Using proximal policy optimization (PPO) with greedy and stochastic decoding, experiments on VRP benchmarks show that Q-GAT achieves faster convergence and reduces routing cost by about 5% compared with classical GAT baselines. These results demonstrate the potential of PQC-enhanced GNNs as compact and effective solvers for large-scale routing and logistics optimization.
中文摘要 车辆路由问题（VRP）是智能交通系统中一个基础的NP难任务，广泛应用于物流和分销领域。基于图神经网络（GNN）的深度强化学习（DRL）已展现出潜力，但经典模型依赖于参数繁重且内存受限的大型多层感知器（MLP）。我们提出了在DRL框架下构建的量子图注意力网络（Q-GAT），其中参数化量子电路（PQCs）在关键读出阶段取代了传统的MLP。混合模型保持了图注意力编码器的表达能力，同时将可训练参数减少了50%以上。利用近端策略优化（PPO）配合贪婪和随机解码，VRP基准测试的实验表明，Q-GAT实现了更快的收敛，并比经典GAT基线降低了约5%的路由成本。这些结果展示了PQC增强GNN作为大规模路由和物流优化的紧凑高效求解器的潜力。

Masked Auto-Regressive Variational Acceleration: Fast Inference Makes Practical Reinforcement Learning

蒙面自回归变分加速：快速推断使实用的强化学习成为现实

Authors: Yuxuan Gu, Weimin Bai, Yifei Wang, Weijian Luo, He Sun
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.15190
Pdf link: https://arxiv.org/pdf/2511.15190
Abstract Masked auto-regressive diffusion models (MAR) benefit from the expressive modeling ability of diffusion models and the flexibility of masked auto-regressive ordering. However, vanilla MAR suffers from slow inference due to its hierarchical inference mechanism: an outer AR unmasking loop and an inner diffusion denoising chain. Such decoupled structure not only harm the generation efficiency but also hinder the practical use of MAR for reinforcement learning (RL), an increasingly critical paradigm for generative model this http URL address this fundamental issue, we introduce MARVAL (Masked Auto-regressive Variational Acceleration), a distillation-based framework that compresses the diffusion chain into a single AR generation step while preserving the flexible auto-regressive unmasking order. Such a distillation with MARVAL not only yields substantial inference acceleration but, crucially, makes RL post-training with verifiable rewards practical, resulting in scalable yet human-preferred fast generative models. Our contributions are twofold: (1) a novel score-based variational objective for distilling masked auto-regressive diffusion models into a single generation step without sacrificing sample quality; and (2) an efficient RL framework for masked auto-regressive models via MARVAL-RL. On ImageNet 256*256, MARVAL-Huge achieves an FID of 2.00 with more than 30 times speedup compared with MAR-diffusion, and MARVAL-RL yields consistent improvements in CLIP and image-reward scores on ImageNet datasets with entity names. In conclusion, MARVAL demonstrates the first practical path to distillation and RL of masked auto-regressive diffusion models, enabling fast sampling and better preference alignments.
中文摘要 蒙面自回归扩散模型（MAR）受益于扩散模型的表现力建模能力和掩蔽自回归排序的灵活性。然而，普通MAR由于其层级推断机制：外层AR解蔽环和内部扩散去噪链，导致推断速度较慢。这种解耦结构不仅损害生成效率，也阻碍了MAR在强化学习（RL）中的实际应用，RL是生成模型日益关键的范式。针对这一根本问题，我们引入了MARVAL（蒙面自回归变分加速），这是一个基于蒸馏的框架，将扩散链压缩为单一的AR生成步骤，同时保持灵活的自回归解密顺序。通过MARVAL进行这种提炼，不仅带来了显著的推理加速，更关键的是，使强化学习在训练后实现可验证的奖励变得切实可行，从而产生可扩展但更受人类青睐的快速生成模型。我们的贡献有两个方面：（1）一种基于评分的新变分客观，用于将掩蔽自回归扩散模型提炼成单代步骤，同时不牺牲样本质量;以及（2）通过MARVAL-RL构建的高效强化学习（RL）掩蔽自回归模型框架。在ImageNet 256*256上，MARVAL-Huge实现了2.00的FID，速度提升了30多倍，MARVAL-RL在带有实体名称的ImageNet数据集上，CLIP和图像奖励评分持续提升。总之，MARVAL展示了掩膜自回归扩散模型的首个实用蒸馏和强化学习路径，实现快速采样和更好的偏好比对。

Learning Where, What and How to Transfer: A Multi-Role Reinforcement Learning Approach for Evolutionary Multitasking

学习在哪里、什么以及如何转移：一种用于进化多任务的多角色强化学习方法

Authors: Jiajun Zhan, Zeyuan Ma, Yue-Jiao Gong, Kay Chen Tan
Subjects: Subjects: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.15199
Pdf link: https://arxiv.org/pdf/2511.15199
Abstract Evolutionary multitasking (EMT) algorithms typically require tailored designs for knowledge transfer, in order to assure convergence and optimality in multitask optimization. In this paper, we explore designing a systematic and generalizable knowledge transfer policy through Reinforcement Learning. We first identify three major challenges: determining the task to transfer (where), the knowledge to be transferred (what) and the mechanism for the transfer (how). To address these challenges, we formulate a multi-role RL system where three (groups of) policy networks act as specialized agents: a task routing agent incorporates an attention-based similarity recognition module to determine source-target transfer pairs via attention scores; a knowledge control agent determines the proportion of elite solutions to transfer; and a group of strategy adaptation agents control transfer strength by dynamically controlling hyper-parameters in the underlying EMT framework. Through pre-training all network modules end-to-end over an augmented multitask problem distribution, a generalizable meta-policy is obtained. Comprehensive validation experiments show state-of-the-art performance of our method against representative baselines. Further in-depth analysis not only reveals the rationale behind our proposal but also provide insightful interpretations on what the system have learned.
中文摘要 进化多任务（EMT）算法通常需要针对知识转移的定制设计，以确保多任务优化的趋同和最优性。本文探讨通过强化学习设计一套系统且可推广的知识转移策略。我们首先确定三大挑战：确定要转移的任务（在哪里）、要转移的知识（什么）以及转移的机制（如何转移）。为应对这些挑战，我们构建了一个多功能强化学习系统，其中三个（组）策略网络充当专业代理：任务路由代理集成基于注意力的相似性识别模块，通过注意力评分确定源-目标转移对;知识控制代理决定精英解决方案的转移比例;一组策略适应代理通过动态控制底层EMT框架中的超参数来控制转移强度。通过对所有网络模块在增强型多任务问题分布上端到端预训练，可以获得一个可推广的元策略。全面的验证实验显示，我们方法在代表性基线下表现出最先进的性能。进一步深入的分析不仅揭示了我们提案背后的理由，还能对系统所学内容提供深刻的解读。

Reasoning in Diffusion Large Language Models is Concentrated in Dynamic Confusion Zones

扩散中的推理大语言模型集中在动态混淆区

Authors: Ranfei Chen, Ming Chen, Kaifei Wang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.15208
Pdf link: https://arxiv.org/pdf/2511.15208
Abstract Diffusion Large Language Models (dLLMs) are rapidly emerging alongside autoregressive models as a powerful paradigm for complex reasoning, with reinforcement learning increasingly used for downstream alignment. Existing trajectory-based RL methods uniformly allocate policy gradients across denoising steps, implicitly treating all steps as equally important. We challenge this assumption by analyzing trajectories with several step-level metrics: entropy-based uncertainty, Confidence-Margin (CM) uncertainty, and Rate of Entropy Change (RoEC). These reveal structured "zones of confusion": transient spikes in uncertainty and instability that strongly predict final success or failure, while most steps remain stable. We propose Adaptive Trajectory Policy Optimization (ATPO), a lightweight step-selection strategy that dynamically reallocates gradient updates to these high-leverage steps without changing the RL objective, rewards, or compute budget. Using a hybrid RoEC+CM rule, ATPO delivers substantial gains in reasoning accuracy and training stability across benchmarks, showing that exploiting trajectory dynamics is key to advancing dLLM RL.
中文摘要 扩散大型语言模型（dLLMs）正迅速崛起，与自回归模型并列，成为复杂推理的强大范式，强化学习也越来越多地用于下游对齐。现有基于轨迹的强化学习方法统一分配策略梯度于去噪步骤，隐含地将所有步骤视为同等重要。我们通过分析多个步骤级指标来挑战这一假设：基于熵的不确定性、置信裕度（CM）不确定性和熵变化率（RoEC）。这些揭示了结构化的“混乱区”：不确定性和不稳定性的短暂峰值，强烈预测最终成功或失败，而大多数步骤保持稳定。我们提出了自适应轨迹策略优化（ATPO），这是一种轻量级步选策略，能够动态重新分配梯度更新到这些高杠杆步骤，而不改变强化学习的目标、奖励或计算预算。采用混合RoEC+CM规则，ATPO在各基准测试中显著提升了推理准确性和训练稳定性，表明利用轨迹动态是推动dLLM强化学习发展的关键。

Symmetry-Breaking in Multi-Agent Navigation: Winding Number-Aware MPC with a Learned Topological Strategy

多智能体导航中的对称破缺：带有学习拓扑策略的绕数感知MPC。

Authors: Tomoki Nakao, Kazumi Kasaura, Tadashi Kozuno
Subjects: Subjects: Robotics (cs.RO); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2511.15239
Pdf link: https://arxiv.org/pdf/2511.15239
Abstract We address the fundamental challenge of resolving symmetry-induced deadlocks in distributed multi-agent navigation by proposing a new hierarchical navigation method. When multiple agents interact, it is inherently difficult for them to autonomously break the symmetry of deciding how to pass each other. To tackle this problem, we introduce an approach that quantifies cooperative symmetry-breaking strategies using a topological invariant called the winding number, and learns the strategies themselves through reinforcement learning. Our method features a hierarchical policy consisting of a learning-based Planner, which plans topological cooperative strategies, and a model-based Controller, which executes them. Through reinforcement learning, the Planner learns to produce two types of parameters for the Controller: one is the topological cooperative strategy represented by winding numbers, and the other is a set of dynamic weights that determine which agent interaction to prioritize in dense scenarios where multiple agents cross simultaneously. The Controller then generates collision-free and efficient motions based on the strategy and weights provided by the Planner. This hierarchical structure combines the flexible decision-making ability of learning-based methods with the reliability of model-based approaches. Simulation and real-world robot experiments demonstrate that our method outperforms existing baselines, particularly in dense environments, by efficiently avoiding collisions and deadlocks while achieving superior navigation performance. The code for the experiments is available at this https URL.
中文摘要 我们通过提出一种新的分层导航方法，解决分布式多智能体导航中由对称性引起的死锁的根本性挑战。当多个代理相互作用时，他们本质上很难自主地打破彼此通过的对称性。为解决该问题，我们引入了一种方法，利用称为绕数的拓扑不变量量化合作对称破缺策略，并通过强化学习学习这些策略。我们的方法具有分层策略，由基于学习的规划器（规划拓扑合作策略）和基于模型的控制器（执行这些策略）组成。通过强化学习，规划者学会为控制器生成两种参数：一种是用绕数表示的拓扑合作策略，另一种是一组动态权重，决定在多个代理同时交叉的密集场景中优先处理哪种代理交互。控制器随后根据规划师提供的策略和权重生成无碰撞且高效的动作。这种层级结构结合了基于学习方法的灵活决策能力与基于模型方法的可靠性。模拟和真实机器人实验表明，我们的方法优于现有基线，尤其是在密集环境中，通过高效避免碰撞和死锁，同时实现卓越的导航性能。实验代码可在该 https URL 获取。

EntroPIC: Towards Stable Long-Term Training of LLMs via Entropy Stabilization with Proportional-Integral Control

EntroPIC：通过熵稳定与比例积分控制实现LLMs的稳定长期训练

Authors: Kai Yang, Xin Xu, Yangkun Chen, Weijie Liu, Jiafei Lyu, Zichuan Lin, Deheng Ye, Saiyong Yang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.15248
Pdf link: https://arxiv.org/pdf/2511.15248
Abstract Long-term training of large language models (LLMs) requires maintaining stable exploration to prevent the model from collapsing into sub-optimal behaviors. Entropy is crucial in this context, as it controls exploration and helps avoid premature convergence to sub-optimal solutions. However, existing reinforcement learning methods struggle to maintain an appropriate level of entropy, as the training process involves a mix of positive and negative samples, each affecting entropy in different ways across steps. To address this, we propose Entropy stablilization via Proportional-Integral Control (EntroPIC), a novel method that adaptively adjusts the influence of positive and negative samples by dynamically tuning their loss coefficients. This approach stabilizes entropy throughout training, ensuring efficient exploration and steady progress. We provide a comprehensive theoretical analysis for both on-policy and off-policy learning settings, demonstrating that EntroPIC is effective at controlling entropy in large-scale LLM training. Experimental results show that our method successfully maintains desired entropy levels, enabling stable and optimal RL training for LLMs.
中文摘要 大型语言模型（LLM）的长期训练需要保持稳定的探索，以防止模型崩溃为次优行为。熵在此背景下至关重要，因为它控制探索，帮助避免过早趋同到次优解。然而，现有的强化学习方法难以维持适当的熵水平，因为训练过程涉及正负样本的混合，每种样本在不同步骤中以不同方式影响熵。为此，我们提出了通过比例积分控制（EntroPIC）实现熵稳定的方法，这是一种通过动态调优正负样本损失系数，自适应地调整其影响的方法。这种方法在整个训练过程中稳定熵，确保探索高效且进展稳健。我们为政策内和非策略学习环境提供了全面的理论分析，证明EntroPIC在大规模LLM训练中有效控制熵。实验结果表明，我们的方法成功维持了所需的熵水平，从而实现LLM的稳定和最佳强化学习训练。

GRPO-RM: Fine-Tuning Representation Models via GRPO-Driven Reinforcement Learning

GRPO-RM：通过GRPO驱动强化学习进行微调表示模型

Authors: Yanchen Xu, Ziheng Jiao, Hongyuan Zhang, Xuelong Li
Subjects: Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.15256
Pdf link: https://arxiv.org/pdf/2511.15256
Abstract The Group Relative Policy Optimization (GRPO), a reinforcement learning method used to fine-tune large language models (LLMs), has proved its effectiveness in practical applications such as DeepSeek-R1. It raises a question whether GRPO can be generalized to representation learning models. In this paper, we propose Group Relative Policy Optimization for Representation Model (GRPO-RM), and investigate the performance of GRPO-like policy in post-training representation models. Specifically, our method establishes a predefined output set to functionally replace token sequence sampling in LLMs, thereby generating an output group, which is essential for the probability-driven optimization of GRPO. In addition, a specialized reward function is designed to accommodate the properties of representation models. Extensive experiments are conducted on various real-world datasets to validate the effectiveness of our proposed method.
中文摘要 Group Relative Policy Optimization（GRPO）是一种用于微调大型语言模型（LLM）的强化学习方法，已在DeepSeek-R1等实际应用中证明了其有效性。这引发了一个问题：GRPO是否可以推广到表示学习模型。本文提出了表示模型的群相对策略优化（GRPO-RM），并研究了类似GRPO策略在训练后表示模型中的表现。具体来说，我们的方法建立了一个预定义的输出集，功能上替代LLM中的标记序列抽样，从而生成输出组，这对概率驱动的GRPO优化至关重要。此外，设计了一个专门的奖励函数以适应表示模型的属性。我们对各种真实世界数据集进行了大量实验，以验证我们提出方法的有效性。

ChartEditor: A Reinforcement Learning Framework for Robust Chart Editing

ChartEditor：一个用于稳健图表编辑的强化学习框架

Authors: Liangyu Chen, Yichen Xu, Jianzhe Ma, Yuqi Liu, Donglu Yang, Liang Zhang, Wenxuan Wang, Qin Jin
Subjects: Subjects: Multimedia (cs.MM); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.15266
Pdf link: https://arxiv.org/pdf/2511.15266
Abstract Chart editing reduces manual effort in visualization design. Typical benchmarks limited in data diversity and assume access to complete chart code, which is seldom in real-world scenarios. To address this gap, we present ChartEditVista, a comprehensive benchmark consisting of 7,964 samples spanning 31 chart categories. It encompasses diverse editing instructions and covers nearly all editable chart elements. The inputs in ChartEditVista include only the original chart image and natural language editing instructions, without the original chart codes. ChartEditVista is generated through a fully automated pipeline that produces, edits, and verifies charts, ensuring high-quality chart editing data. Besides, we introduce two novel fine-grained, rule-based evaluation metrics: the layout metric, which evaluates the position, size and color of graphical components; and the text metric, which jointly assesses textual content and font styling. Building on top of ChartEditVista, we present ChartEditor, a model trained using a reinforcement learning framework that incorporates a novel rendering reward to simultaneously enforce code executability and visual fidelity. Through extensive experiments and human evaluations, we demonstrate that ChartEditVista provides a robust evaluation, while ChartEditor consistently outperforms models with similar-scale and larger-scale on chart editing tasks.
中文摘要 图表编辑减少了可视化设计中的人工工作。典型基准测试数据多样性有限，且假设访问完整的图表代码，而这在现实中很少见。为弥补这一空白，我们推出了ChartEditVista，这是一个涵盖31个图表类别的综合基准测试，包含7,964个样本。它涵盖了各种编辑指导，几乎涵盖了所有可编辑的图表元素。ChartEditVista 的输入仅包含原始图表图像和自然语言编辑指令，没有原始图表代码。ChartEditVista 通过全自动化流程生成，负责生成、编辑和验证图表，确保高质量的图表编辑数据。此外，我们还引入了两种新颖的细粒度、基于规则的评估指标：布局指标，用于评估图形组件的位置、大小和颜色;以及文本指标，用于联合评估文本内容和字体样式。在ChartEditVista之上，我们推出了ChartEditor模型，该模型采用强化学习框架训练，结合了新颖的渲染奖励，以同时强化代码可执行性和视觉忠实度。通过大量实验和人工评估，我们证明了ChartEditVista提供了稳健的评估，而ChartEditor在图表编辑任务中持续优于同等或更大尺度的模型。

Look, Zoom, Understand: The Robotic Eyeball for Embodied Perception

看，放大，理解：具身感知的机器人眼球

Authors: Jiashu Yang, Yifan Han, Yucheng Xie, Ning Guo, Wenzhao Lian
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.15279
Pdf link: https://arxiv.org/pdf/2511.15279
Abstract In embodied AI perception systems, visual perception should be active: the goal is not to passively process static images, but to actively acquire more informative data within pixel and spatial budget constraints. Existing vision models and fixed RGB-D camera systems fundamentally fail to reconcile wide-area coverage with fine-grained detail acquisition, severely limiting their efficacy in open-world robotic applications. To address this issue, we propose EyeVLA, a robotic eyeball for active visual perception that can take proactive actions based on instructions, enabling clear observation of fine-grained target objects and detailed information across a wide spatial extent. EyeVLA discretizes action behaviors into action tokens and integrates them with vision-language models (VLMs) that possess strong open-world understanding capabilities, enabling joint modeling of vision, language, and actions within a single autoregressive sequence. By using the 2D bounding box coordinates to guide the reasoning chain and applying reinforcement learning to refine the viewpoint selection policy, we transfer the open-world scene understanding capability of the VLM to a vision language action (VLA) policy using only minimal real-world data. Experiments show that our system efficiently performs instructed scenes in real-world environments and actively acquires more accurate visual information through instruction-driven actions of rotation and zoom, thereby achieving strong environmental perception capabilities. EyeVLA introduces a novel robotic vision system that leverages detailed and spatially rich, large-scale embodied data, and actively acquires highly informative visual observations for downstream embodied tasks.
中文摘要 在具身的AI感知系统中，视觉感知应当是主动的：目标不是被动处理静态图像，而是在像素和空间预算限制内主动获取更多信息。现有的视觉模型和固定RGB-D摄像系统根本无法调和广域覆盖与细粒度细节采集，严重限制了它们在开放世界机器人应用中的效能。为解决这一问题，我们提出了EyeVLA，一种能够根据指令主动采取主动行动的机器人眼球，从而清晰观察细粒度目标物体和广泛空间范围内的详细信息。EyeVLA将动作行为离散化为动作标记，并将其与具备强大开放世界理解能力的视觉语言模型（VLMs）集成，实现在单一自回归序列中对视觉、语言和动作的联合建模。通过使用二维边界框坐标引导推理链，并应用强化学习细化视点选择策略，我们将VLM的开放世界场景理解能力转移到仅使用极少真实世界数据的视觉语言动作（VLA）策略中。实验表明，我们的系统能够高效地在现实环境中执行指令场景，并通过旋转和缩放等指令驱动动作主动获取更准确的视觉信息，从而实现了强大的环境感知能力。EyeVLA引入了一种新型机器人视觉系统，利用详尽且空间丰富的大规模具身数据，并主动获取高度信息量的视觉观察，用于后续的具象任务。

Path Planning through Multi-Agent Reinforcement Learning in Dynamic Environments

动态环境中通过多智能体强化学习进行路径规划

Authors: Jonas De Maeyer, Hossein Yarahmadi, Moharram Challenger
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.15284
Pdf link: https://arxiv.org/pdf/2511.15284
Abstract Path planning in dynamic environments is a fundamental challenge in intelligent transportation and robotics, where obstacles and conditions change over time, introducing uncertainty and requiring continuous adaptation. While existing approaches often assume complete environmental unpredictability or rely on global planners, these assumptions limit scalability and practical deployment in real-world settings. In this paper, we propose a scalable, region-aware reinforcement learning (RL) framework for path planning in dynamic environments. Our method builds on the observation that environmental changes, although dynamic, are often localized within bounded regions. To exploit this, we introduce a hierarchical decomposition of the environment and deploy distributed RL agents that adapt to changes locally. We further propose a retraining mechanism based on sub-environment success rates to determine when policy updates are necessary. Two training paradigms are explored: single-agent Q-learning and multi-agent federated Q-learning, where local Q-tables are aggregated periodically to accelerate the learning process. Unlike prior work, we evaluate our methods in more realistic settings, where multiple simultaneous obstacle changes and increasing difficulty levels are present. Results show that the federated variants consistently outperform their single-agent counterparts and closely approach the performance of A* Oracle while maintaining shorter adaptation times and robust scalability. Although initial training remains time-consuming in large environments, our decentralized framework eliminates the need for a global planner and lays the groundwork for future improvements using deep RL and flexible environment decomposition.
中文摘要 动态环境中的路径规划是智能交通和机器人领域的根本挑战，障碍物和条件随时间变化，带来不确定性并需要持续适应。现有方法通常假设环境完全不可预测或依赖全球规划者，但这些假设限制了在现实环境中的可扩展性和实际部署。本文提出了一种可扩展的区域感知强化学习（RL）框架，用于动态环境中的路径规划。我们的方法基于这样一个观察：环境变化虽然动态，但通常局限于有界区域内。为此，我们引入了环境的层级分解，并部署了能够本地适应变化的分布式强化学习代理。我们还提出基于子环境成功率的再训练机制，以确定何时需要进行策略更新。探讨了两种训练范式：单智能体Q学习和多智能体联合Q学习，后者定期聚合本地Q表以加快学习进程。与以往工作不同，我们在更现实的环境中评估方法，这些环境存在多个同时变化的障碍物和难度递增。结果显示，联邦变体持续优于单代理版本，性能接近A* Oracle，同时保持更短的适应时间和稳健的可扩展性。尽管在大型环境中初期训练仍然耗时，我们的去中心化框架消除了对全局规划器的需求，并为利用深度强化学习和灵活环境分解的未来改进奠定了基础。

Platform-Agnostic Reinforcement Learning Framework for Safe Exploration of Cluttered Environments with Graph Attention

平台无关的强化学习框架，用于通过图关注安全探索杂乱环境

Authors: Gabriele Calzolari (1), Vidya Sumathy (1), Christoforos Kanellakis (1), George Nikolakopoulos (1) ((1) Lulea University of Technology)
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.15358
Pdf link: https://arxiv.org/pdf/2511.15358
Abstract Autonomous exploration of obstacle-rich spaces requires strategies that ensure efficiency while guaranteeing safety against collisions with obstacles. This paper investigates a novel platform-agnostic reinforcement learning framework that integrates a graph neural network-based policy for next-waypoint selection, with a safety filter ensuring safe mobility. Specifically, the neural network is trained using reinforcement learning through the Proximal Policy Optimization (PPO) algorithm to maximize exploration efficiency while minimizing safety filter interventions. Henceforth, when the policy proposes an infeasible action, the safety filter overrides it with the closest feasible alternative, ensuring consistent system behavior. In addition, this paper introduces a reward function shaped by a potential field that accounts for both the agent's proximity to unexplored regions and the expected information gain from reaching them. The proposed framework combines the adaptability of reinforcement learning-based exploration policies with the reliability provided by explicit safety mechanisms. This feature plays a key role in enabling the deployment of learning-based policies on robotic platforms operating in real-world environments. Extensive evaluations in both simulations and experiments performed in a lab environment demonstrate that the approach achieves efficient and safe exploration in cluttered spaces.
中文摘要 在障碍物密集空间中自主探索需要确保效率的同时，避免碰撞的策略。本文探讨了一种新型平台无关强化学习框架，该框架结合基于图神经网络的下一路径点选择策略，并设有安全过滤器确保安全移动。具体来说，神经网络通过近端策略优化（PPO）算法进行强化学习训练，以最大化探索效率，同时最小化安全滤波干预。因此，当策略提出不可行的行动时，安全过滤器会用最接近的可行替代方案覆盖该措施，确保系统行为一致。此外，本文介绍了一个由潜在场塑造的奖励函数，该函数同时考虑了智能体接近未探索区域的程度以及到达这些区域后预期获得的信息收益。所提框架结合了基于强化学习的探索策略的适应性与显式安全机制所提供的可靠性。该功能在实现基于学习的策略部署到实际环境中运行的机器人平台上起着关键作用。在模拟和实验室环境中进行的实验中，经过大量评估表明，该方法能够在杂乱空间中实现高效且安全的探索。

Terra Nova: A Comprehensive Challenge Environment for Intelligent Agents

Terra Nova：智能代理的全面挑战环境

Authors: Trevor McInroe
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.15378
Pdf link: https://arxiv.org/pdf/2511.15378
Abstract We introduce Terra Nova, a new comprehensive challenge environment (CCE) for reinforcement learning (RL) research inspired by Civilization V. A CCE is a single environment in which multiple canonical RL challenges (e.g., partial observability, credit assignment, representation learning, enormous action spaces, etc.) arise simultaneously. Mastery therefore demands integrated, long-horizon understanding across many interacting variables. We emphasize that this definition excludes challenges that only aggregate unrelated tasks in independent, parallel streams (e.g., learning to play all Atari games at once). These aggregated multitask benchmarks primarily asses whether an agent can catalog and switch among unrelated policies rather than test an agent's ability to perform deep reasoning across many interacting challenges.
中文摘要 我们介绍Terra Nova，一个受《文明V》启发的新型综合挑战环境（CCE），用于强化学习（RL）研究。CCE是一个单一环境，同时出现多个典型的强化学习挑战（例如，部分可观测性、学分分配、表示学习、巨大动作空间等）。因此，掌握需要对多个相互作用变量进行整合且长远的理解。我们强调，该定义排除了那些仅将无关任务合并成独立并行流的挑战（例如，学习同时玩所有雅达利游戏）。这些聚合的多任务基准主要评估代理是否能够在无关策略之间进行目录和切换，而不是测试代理在众多交互挑战中进行深度推理的能力。

Communication-Pipelined Split Federated Learning for Foundation Model Fine-Tuning in UAV Networks

通信流水线分流式联合学习用于无人机网络基础模型微调

Authors: Zizhen Zhou, Ying-Chang Liang, Yanyu Cheng, Wei Yang Bryan Lim
Subjects: Subjects: Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2511.15404
Pdf link: https://arxiv.org/pdf/2511.15404
Abstract Deploying foundation models (FMs) on uncrewed aerial vehicles (UAVs) promises broad ``low-altitude economy'' applications. Split federated learning (SFL)-based fine-tuning leverages distributed data while keeping raw data local and reduces client-side burden by partitioning the model between client and server. However, the per-round training latency is dominated by stragglers. Training paradigms featuring parallel gradient transmission (GT) allocate dedicated portions of downlink communication resources to each client. They may leave resources idle and suffer from prolonged GT latency, especially in UAV networks, where the communication latency typically far exceeds the computation latency. To address this, we propose a sequential GT paradigm, where the server dedicates all downlink resources for the current GT. We further propose communication-pipelined SFL (CPSFL), characterized by downlink GT priority scheduling and intra-round asynchronous training. We investigate CPSFL-based LoRA fine-tuning of FMs in UAV networks and formulate an optimization problem to minimize a weighted sum of per-round training latency and worst-case client energy consumption by optimizing the split point selection (SPS) and the computing and communication resource allocation (CCRA) (the uplink bandwidth allocation and the server computing frequency allocation). To solve this problem, we develop an attention-based deep reinforcement learning (DRL) framework, where the base station agent decides the split point and the CCRA in each round by leveraging previous round information, including UAV trajectories. Simulation results show that the proposed DRL-based CPSFL scheme outperforms the parallel GT benchmarks, the ablation variants, the fixed CCRA scheme, while approaching the best fixed-SPS scheme.
中文摘要 在无人机（UAV）上部署基础模型（FM）有望实现广泛的“低空经济”应用。基于分体联合学习（SFL）的微调利用分布式数据，同时保持原始数据本地化，并通过在客户端和服务器之间划分模型来减轻客户端负担。然而，每回合的训练延迟主要由落后者主导。具有并行梯度传输（GT）的训练范式为每个客户端分配专用的下行通信资源。它们可能使资源闲置，并遭受较长的GT延迟，尤其是在无人机网络中，通信延迟通常远超计算延迟。为此，我们提出了一种顺序GT范式，服务器将所有下行资源专用于当前GT。我们还提出了通信流水线SFL（CPSFL），其特点是下行GT优先调度和轮内异步训练。我们研究基于CPSFL的LoRA对无人机网络中FM的微调，并通过优化分点选择（SPS）和计算与通信资源分配（CCRA）（上行带宽分配和服务器计算频率分配）来最小化每轮训练延迟和最坏情况下客户端能耗的加权总和，构建优化问题。为解决这一问题，我们开发了基于注意力的深度强化学习（DRL）框架，基站代理利用前几轮的信息（包括无人机轨迹）决定每轮的分点和CCRA。模拟结果显示，基于DRL的CPSFL方案优于并行GT基准、消融变体、固定CCRA方案，同时接近最佳固定SPS方案。

Meta-Black-Box Optimization with Bi-Space Landscape Analysis and Dual-Control Mechanism for SAEA

SAEA的元黑匣子优化结合双空间景观分析和双重控制机制

Authors: Yukun Du, Haiyue Yu, Xiaotong Xie, Yan Zheng, Lixin Zhan, Yudong Du, Chongshuang Hu, Boxuan Wang, Jiang Jiang
Subjects: Subjects: Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2511.15551
Pdf link: https://arxiv.org/pdf/2511.15551
Abstract Surrogate-Assisted Evolutionary Algorithms (SAEAs) are widely used for expensive Black-Box Optimization. However, their reliance on rigid, manually designed components such as infill criteria and evolutionary strategies during the search process limits their flexibility across tasks. To address these limitations, we propose Dual-Control Bi-Space Surrogate-Assisted Evolutionary Algorithm (DB-SAEA), a Meta-Black-Box Optimization (MetaBBO) framework tailored for multi-objective problems. DB-SAEA learns a meta-policy that jointly regulates candidate generation and infill criterion selection, enabling dual control. The bi-space Exploratory Landscape Analysis (ELA) module in DB-SAEA adopts an attention-based architecture to capture optimization states from both true and surrogate evaluation spaces, while ensuring scalability across problem dimensions, population sizes, and objectives. Additionally, we integrate TabPFN as the surrogate model for accurate and efficient prediction with uncertainty estimation. The framework is trained via reinforcement learning, leveraging parallel sampling and centralized training to enhance efficiency and transferability across tasks. Experimental results demonstrate that DB-SAEA not only outperforms state-of-the-art baselines across diverse benchmarks, but also exhibits strong zero-shot transfer to unseen tasks with higher-dimensional settings. This work introduces the first MetaBBO framework with dual-level control over SAEAs and a bi-space ELA that captures surrogate model information.
中文摘要 代理辅助进化算法（SAEA）被广泛用于昂贵的黑箱优化。然而，它们依赖于严格且手动设计的组件，如填充条件和搜索过程中的演进策略，限制了它们在各任务间的灵活性。为解决这些局限性，我们提出了双控制双空间替代辅助进化算法（DB-SAEA），这是一个面向多目标问题的元黑箱优化（MetaBBO）框架。DB-SAEA学习一套元政策，共同规范候选人生成和填充标准选择，实现双重控制。DB-SAEA中的双空间探索性景观分析（ELA）模块采用基于注意力的架构，能够捕捉真实和代理评估空间的优化状态，同时确保跨问题维度、种群规模和目标的可扩展性。此外，我们还将TabPFN集成为替代模型，实现准确高效的预测和不确定性估计。该框架通过强化学习进行训练，利用并行抽样和集中培训，提升任务间的效率和可迁移性。实验结果表明，DB-SAEA不仅在多种基准测试中优于最先进的基线，还在高维环境中展现出强烈的零样本转移能力，适用于未见任务。这项工作引入了首个具备SAEA双级控制和双空间ELA的MetaBBO框架，能够捕捉代理模型信息。

SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models

SRPO：视觉-语言-行动模型的自指政策优化

Authors: Senyu Fei, Siyin Wang, Li Ji, Ao Li, Shiduo Zhang, Liming Liu, Jinlong Hou, Jingjing Gong, Xianzhong Zhao, Xipeng Qiu
Subjects: Subjects: Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.15605
Pdf link: https://arxiv.org/pdf/2511.15605
Abstract Vision-Language-Action (VLA) models excel in robotic manipulation but are constrained by their heavy reliance on expert demonstrations, leading to demonstration bias and limiting performance. Reinforcement learning (RL) is a vital post-training strategy to overcome these limits, yet current VLA-RL methods, including group-based optimization approaches, are crippled by severe reward sparsity. Relying on binary success indicators wastes valuable information in failed trajectories, resulting in low training efficiency. To solve this, we propose Self-Referential Policy Optimization (SRPO), a novel VLA-RL framework. SRPO eliminates the need for external demonstrations or manual reward engineering by leveraging the model's own successful trajectories, generated within the current training batch, as a self-reference. This allows us to assign a progress-wise reward to failed attempts. A core innovation is the use of latent world representations to measure behavioral progress robustly. Instead of relying on raw pixels or requiring domain-specific fine-tuning, we utilize the compressed, transferable encodings from a world model's latent space. These representations naturally capture progress patterns across environments, enabling accurate, generalized trajectory comparison. Empirical evaluations on the LIBERO benchmark demonstrate SRPO's efficiency and effectiveness. Starting from a supervised baseline with 48.9% success, SRPO achieves a new state-of-the-art success rate of 99.2% in just 200 RL steps, representing a 103% relative improvement without any extra supervision. Furthermore, SRPO shows substantial robustness, achieving a 167% performance improvement on the LIBERO-Plus benchmark.
中文摘要 视觉-语言-行动（VLA）模型在机器人作方面表现出色，但由于高度依赖专家演示，导致演示偏差和性能受限。强化学习（RL）是克服这些限制的重要训练后策略，但当前的VLA-RL方法，包括基于群体的优化方法，因严重的奖励稀疏性而受限。依赖二元成功指标会浪费宝贵信息，导致训练效率低下。为解决这个问题，我们提出了自指策略优化（SRPO），一种新的VLA-RL框架。SRPO通过利用模型自身在当前培训批次中生成的成功轨迹作为自我参照，消除了对外部演示或人工奖励工程的需求。这让我们能够为失败的尝试分配一个按进度划分的奖励。一项核心创新是利用潜在世界表征来稳健地衡量行为进展。我们不依赖原始像素或需要特定领域的微调，而是利用世界模型潜在空间中压缩的可转移编码。这些表示自然捕捉了跨环境的进展模式，从而实现准确、广义的轨迹比较。对LIBERO基准的实证评估显示SRPO的效率和有效性。从监督基线开始，成功率为48.9%，SRPO在仅200个强化学习步骤内达到了99.2%的成功率，在无额外监督的情况下相对提升了103%。此外，SRPO表现出显著的稳健性，在LIBERO-Plus基准测试上实现了167%的性能提升。

Continual Reinforcement Learning for Cyber-Physical Systems: Lessons Learned and Open Challenges

网络物理系统的持续强化学习：经验教训与开放挑战

Authors: Kim N. Nolle, Ivana Dusparic, Rhodri Cusack, Vinny Cahill
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.15652
Pdf link: https://arxiv.org/pdf/2511.15652
Abstract Continual learning (CL) is a branch of machine learning that aims to enable agents to adapt and generalise previously learned abilities so that these can be reapplied to new tasks or environments. This is particularly useful in multi-task settings or in non-stationary environments, where the dynamics can change over time. This is particularly relevant in cyber-physical systems such as autonomous driving. However, despite recent advances in CL, successfully applying it to reinforcement learning (RL) is still an open problem. This paper highlights open challenges in continual RL (CRL) based on experiments in an autonomous driving environment. In this environment, the agent must learn to successfully park in four different scenarios corresponding to parking spaces oriented at varying angles. The agent is successively trained in these four scenarios one after another, representing a CL environment, using Proximal Policy Optimisation (PPO). These experiments exposed a number of open challenges in CRL: finding suitable abstractions of the environment, oversensitivity to hyperparameters, catastrophic forgetting, and efficient use of neural network capacity. Based on these identified challenges, we present open research questions that are important to be addressed for creating robust CRL systems. In addition, the identified challenges call into question the suitability of neural networks for CL. We also identify the need for interdisciplinary research, in particular between computer science and neuroscience.
中文摘要 持续学习（CL）是机器学习的一个分支，旨在使智能体能够适应和泛化已学到的能力，从而重新应用于新的任务或环境。这在多任务环境或非静止环境中尤为有用，因为动态会随时间变化。这在自动驾驶等网络物理系统中尤为重要。然而，尽管强化学习（CL）最近取得了一些进展，成功将其应用于强化学习（RL）仍是一个悬而未决的问题。本文强调了基于自动驾驶环境实验的持续强化学习（CRL）面临的挑战。在这种环境下，代理必须学会在四种不同的场景中成功停车，这些场景对应于不同角度的停车位。代理会依次在这四个场景中进行训练，代表CL环境，使用近端策略优化（PPO）。这些实验揭示了CRL中许多未解决的挑战：寻找合适的环境抽象、对超参数的过度敏感、灾难性的遗忘以及神经网络容量的高效利用。基于这些已识别的挑战，我们提出了开放性研究问题，这些问题对于创建稳健的CRL系统至关重要。此外，已识别的挑战也质疑神经网络在临床分析中的适用性。我们还认识到跨学科研究的需求，特别是计算机科学与神经科学之间的研究。

VisPlay: Self-Evolving Vision-Language Models from Images

VisPlay：从图像中自我进化的视觉语言模型

Authors: Yicheng He, Chengsong Huang, Zongxia Li, Jiaxin Huang, Yonghui Yang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.15661
Pdf link: https://arxiv.org/pdf/2511.15661
Abstract Reinforcement learning (RL) provides a principled framework for improving Vision-Language Models (VLMs) on complex reasoning tasks. However, existing RL approaches often rely on human-annotated labels or task-specific heuristics to define verifiable rewards, both of which are costly and difficult to scale. We introduce VisPlay, a self-evolving RL framework that enables VLMs to autonomously improve their reasoning abilities using large amounts of unlabeled image data. Starting from a single base VLM, VisPlay assigns the model into two interacting roles: an Image-Conditioned Questioner that formulates challenging yet answerable visual questions, and a Multimodal Reasoner that generates silver responses. These roles are jointly trained with Group Relative Policy Optimization (GRPO), which incorporates diversity and difficulty rewards to balance the complexity of generated questions with the quality of the silver answers. VisPlay scales efficiently across two model families. When trained on Qwen2.5-VL and MiMo-VL, VisPlay achieves consistent improvements in visual reasoning, compositional generalization, and hallucination reduction across eight benchmarks, including MM-Vet and MMMU, demonstrating a scalable path toward self-evolving multimodal intelligence. The project page is available at this https URL
中文摘要 强化学习（RL）为改进视觉语言模型（VLMs）在复杂推理任务中的表现提供了原则框架。然而，现有的强化学习方法通常依赖人工注释标签或任务特定的启发式方法来定义可验证的奖励，而这些方法成本高昂且难以扩展。我们介绍VisPlay，一种自我演进的强化学习框架，使VLM能够利用大量未标记图像数据自主提升推理能力。从单一基底VLM出发，VisPlay将模型分配为两个交互角色：一个是生成具有挑战性但可回答视觉问题的图像条件提问器，另一个是生成银色答案的多模态推理器。这些角色与Group Relative Policy Optimization（GRPO）联合培训，GRPO结合多样性和难度奖励，以平衡生成问题的复杂性与银色答案的质量。VisPlay 在两个模型族之间高效扩展。在Qwen2.5-VL和MiMo-VL上训练时，VisPlay在包括MM-Vet和MMMU在内的八个基准测试中持续提升视觉推理、组合泛化和幻觉减少，展示了迈向自我演化多模态智能的可扩展路径。项目页面可在此 https URL 访问。

DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models

DeepThinkVLA：增强视觉-语言-行动模型的推理能力

Authors: Cheng Yin, Yankai Lin, Wang Xu, Sikyuen Tam, Xiangrui Zeng, Zhiyuan Liu, Zhouping Yin
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.15669
Pdf link: https://arxiv.org/pdf/2511.15669
Abstract Enabling Vision-Language-Action (VLA) models to "think before acting" via Chain-of-Thought (CoT) is a promising path to overcoming the data-hungry nature of end-to-end robot policies. However, progress is stalled by a fundamental conflict: existing models use a single autoregressive decoder for both sequential CoT reasoning and high-dimensional, parallelizable robot actions. This architectural mismatch degrades motor control and fails to forge a strong causal link between thought and action. We introduce DeepThinkVLA, which resolves this conflict through a tightly integrated architecture and training strategy. Architecturally, our hybrid-attention decoder generates sequential CoT with causal attention and then switches to bidirectional attention for fast, parallel decoding of action vectors. This design is complemented by a two-stage training pipeline: we first use Supervised Fine-Tuning (SFT) to teach the model foundational reasoning, then apply Reinforcement Learning (RL) with task-success rewards to causally align the full reasoning-action sequence with desired outcomes. This synergy leads to state-of-the-art performance, achieving a 97.0% success rate on the LIBERO benchmark. Our ablations confirm the design's effectiveness: the hybrid architecture alone outperforms standard decoders by 15.5%, and the final RL stage provides a crucial 2% boost to secure top performance.
中文摘要 通过思维链（Chain-of-Thought，CoT）使视觉-语言-行动（VLA）模型能够“先思考再行动”，是克服端到端机器人策略中数据需求高的有希望的途径。然而，进展因一个根本性冲突而停滞：现有模型同时使用单一自回归解码器进行顺序CoT推理和高维可并行机器人动作。这种结构上的不匹配削弱了运动控制，未能在思想与行动之间建立强有力的因果联系。我们介绍了DeepThinkVLA，通过紧密集成的架构和培训策略解决了这一冲突。在架构上，我们的混合注意力解码器生成带有因果注意力的顺序CoT，然后切换为双向注意，实现动作向量的快速并行解码。该设计辅以两阶段训练流程：首先使用监督式微调（SFT）教授模型基础推理，然后应用强化学习（RL）配合任务成功奖励，使整个推理-行动序列与期望结果因果对齐。这种协同效应带来了最先进的性能，在LIBERO基准上实现了97.0%的成功率。我们的消融验证了设计的有效性：仅混合架构就比标准解码器高出15.5%，最后的强化学习阶段还能带来关键的2%提升，确保最佳性能。

The Impact of Quantization on Large Reasoning Model Reinforcement Learning

量化对大型推理模型强化学习的影响

Authors: Medha Kumar, Zifei Xu, Xin Wang, Tristan Webb
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.15694
Pdf link: https://arxiv.org/pdf/2511.15694
Abstract Strong reasoning capabilities can now be achieved by large-scale reinforcement learning (RL) without any supervised fine-tuning. Although post-training quantization (PTQ) and quantization-aware training (QAT) are well studied in the context of fine-tuning, how quantization impacts RL in large reasoning models (LRMs) remains an open question. To answer this question, we conducted systematic experiments and discovered a significant gap in reasoning performance on mathematical benchmarks between post-RL quantized models and their quantization-aware RL optimized counterparts. Our findings suggest that quantization-aware RL training negatively impacted the learning process, whereas PTQ and QLoRA led to greater performance.
中文摘要 现在可以通过大规模强化学习（RL）实现强推理能力，无需任何监督微调。尽管训练后量化（PTQ）和量化感知训练（QAT）在微调领域已有充分研究，但量化如何影响大型推理模型（LRM）中的强化仍是一个未解之谜。为回答这个问题，我们进行了系统实验，发现后强化后量化模型与其量化意识优化的强化模型在数学基准测试中存在显著推理性能差距。我们的发现表明，量化感知强化学习训练对学习过程产生负面影响，而PTQ和QLoRA则带来了更高的学习表现。

GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization

GeoVista：基于网络增强的地理定位智能视觉推理

Authors: Yikun Wang, Zuyan Liu, Ziyi Wang, Pengfei Liu, Han Hu, Yongming Rao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.15705
Pdf link: https://arxiv.org/pdf/2511.15705
Abstract Current research on agentic visual reasoning enables deep multimodal understanding but primarily focuses on image manipulation tools, leaving a gap toward more general-purpose agentic models. In this work, we revisit the geolocalization task, which requires not only nuanced visual grounding but also web search to confirm or refine hypotheses during reasoning. Since existing geolocalization benchmarks fail to meet the need for high-resolution imagery and the localization challenge for deep agentic reasoning, we curate GeoBench, a benchmark that includes photos and panoramas from around the world, along with a subset of satellite images of different cities to rigorously evaluate the geolocalization ability of agentic models. We also propose GeoVista, an agentic model that seamlessly integrates tool invocation within the reasoning loop, including an image-zoom-in tool to magnify regions of interest and a web-search tool to retrieve related web information. We develop a complete training pipeline for it, including a cold-start supervised fine-tuning (SFT) stage to learn reasoning patterns and tool-use priors, followed by a reinforcement learning (RL) stage to further enhance reasoning ability. We adopt a hierarchical reward to leverage multi-level geographical information and improve overall geolocalization performance. Experimental results show that GeoVista surpasses other open-source agentic models on the geolocalization task greatly and achieves performance comparable to closed-source models such as Gemini-2.5-flash and GPT-5 on most metrics.
中文摘要 当前关于智能视觉推理的研究使得深度多模态理解成为可能，但主要聚焦于图像处理工具，这对更通用的智能体模型仍有空白。在本研究中，我们重新审视了地理定位任务，该任务不仅需要细致的视觉基础，还需要网络搜索以确认或完善假设。由于现有的地理定位基准无法满足高分辨率图像的需求和深度代理推理的定位挑战，我们策划了GeoBench，这是一个包含全球照片和全景图的基准测试，以及不同城市的卫星图像子集，以严格评估代理模型的地理定位能力。我们还提出了GeoVista，一种智能体模型，能够无缝将工具调用整合进推理循环，包括用于放大感兴趣区域的图像放大工具和用于检索相关网络信息的网页搜索工具。我们为此开发了完整的培训流程，包括冷启动监督微调（SFT）阶段，用于学习推理模式和工具使用先验，随后进入强化学习（RL）阶段以进一步提升推理能力。我们采用分层奖励，利用多层次地理信息，提升整体地理定位性能。实验结果显示，GeoVista在地理定位任务上远超其他开源代理模型，在大多数指标上的性能可与Gemini-2.5-flash和GPT-5等闭源模型媲美。

Keyword: diffusion policy

Theoretical Closed-loop Stability Bounds for Dynamical System Coupled with Diffusion Policies

动力系统与扩散策略耦合的理论闭环稳定性界限

Authors: Gabriel Lauzier, Alexandre Girard, François Ferland
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.15520
Pdf link: https://arxiv.org/pdf/2511.15520
Abstract Diffusion Policy has shown great performance in robotic manipulation tasks under stochastic perturbations, due to its ability to model multimodal action distributions. Nonetheless, its reliance on a computationally expensive reverse-time diffusion (denoising) process, for action inference, makes it challenging to use for real-time applications where quick decision-making is mandatory. This work studies the possibility of conducting the denoising process only partially before executing an action, allowing the plant to evolve according to its dynamics in parallel to the reverse-time diffusion dynamics ongoing on the computer. In a classical diffusion policy setting, the plant dynamics are usually slow and the two dynamical processes are uncoupled. Here, we investigate theoretical bounds on the stability of closed-loop systems using diffusion policies when the plant dynamics and the denoising dynamics are coupled. The contribution of this work gives a framework for faster imitation learning and a metric that yields if a controller will be stable based on the variance of the demonstrations.
中文摘要 扩散政策在随机扰动下的机器人作任务中表现出优异性能，因其能够模拟多模态作用分布。然而，其依赖计算量高的逆时间扩散（去噪）过程来进行动作推断，使其在需要快速决策的实时应用中使用起来颇具挑战。本研究研究在执行动作前仅部分进行去噪过程的可能性，使植物能够根据其动态与计算机上进行的反向时间扩散动态同步演化。在经典扩散策略环境中，植物动力学通常较慢，两个动力过程是解耦的。本研究利用扩散策略研究当植物动力学与去噪动力学耦合时，闭环系统的理论稳定性界限。这项工作的贡献为更快的模仿学习提供了框架，并基于演示的方差给出控制器是否稳定的度量。