Arxiv Papers of Today

生成时间: 2025-11-18 16:33:33 (UTC+8); Arxiv 发布时间: 2025-11-18 20:00 EST (2025-11-19 09:00 UTC+8)

今天共有 74 篇相关文章

Keyword: reinforcement learning

Mind Your Entropy: From Maximum Entropy to Trajectory Entropy-Constrained RL

注意熵：从最大熵到轨迹熵约束强化学习

Authors: Guojian Zhan, Likun Wang, Pengcheng Wang, Feihong Zhang, Jingliang Duan, Masayoshi Tomizuka, Shengbo Eben Li
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2511.11592
Pdf link: https://arxiv.org/pdf/2511.11592
Abstract Maximum entropy has become a mainstream off-policy reinforcement learning (RL) framework for balancing exploitation and exploration. However, two bottlenecks still limit further performance improvement: (1) non-stationary Q-value estimation caused by jointly injecting entropy and updating its weighting parameter, i.e., temperature; and (2) short-sighted local entropy tuning that adjusts temperature only according to the current single-step entropy, without considering the effect of cumulative entropy over time. In this paper, we extends maximum entropy framework by proposing a trajectory entropy-constrained reinforcement learning (TECRL) framework to address these two challenges. Within this framework, we first separately learn two Q-functions, one associated with reward and the other with entropy, ensuring clean and stable value targets unaffected by temperature updates. Then, the dedicated entropy Q-function, explicitly quantifying the expected cumulative entropy, enables us to enforce a trajectory entropy constraint and consequently control the policy long-term stochasticity. Building on this TECRL framework, we develop a practical off-policy algorithm, DSAC-E, by extending the state-of-the-art distributional soft actor-critic with three refinements (DSAC-T). Empirical results on the OpenAI Gym benchmark demonstrate that our DSAC-E can achieve higher returns and better stability.
中文摘要 最大熵已成为主流的非策略强化学习（RL）框架，用于平衡利用与探索。然而，仍有两个瓶颈限制了进一步的性能提升：（1）由于联合注入熵并更新其加权参数（即温度）导致的非平稳Q值估计;以及（2）仅根据当前单步熵调整温度的短视局部熵调优，而不考虑累积熵随时间的变化。本文通过提出轨迹熵约束强化学习（TECRL）框架，扩展了最大熵框架，以应对这两个挑战。在此框架下，我们首先分别学习两个Q函数，一个与奖励相关，另一个与熵相关，确保目标值干净且稳定，不受温度更新影响。然后，专门的熵Q函数，明确量化了预期的累积熵，使我们能够强制执行轨迹熵约束，从而控制政策的长期随机性。基于该 TECRL 框架，我们通过对最先进的分布式软演员-批判（DSAC-T）进行了三次改进，开发了实用的非策略算法 DSAC-E。OpenAI Gym基准的实证结果表明，我们的DSAC-E能够实现更高的回报和更好的稳定性。

Machine learning-based cloud resource allocation algorithms: a comprehensive comparative review

基于机器学习的云资源分配算法：全面比较综述

Authors: Deep Bodra, Sushil Khairnar
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.11603
Pdf link: https://arxiv.org/pdf/2511.11603
Abstract Cloud resource allocation has emerged as a major challenge in modern computing environments, with organizations struggling to manage complex, dynamic workloads while optimizing performance and cost efficiency. Traditional heuristic approaches prove inadequate for handling the multi-objective optimization demands of existing cloud infrastructures. This paper presents a comparative analysis of state-of-the-art artificial intelligence and machine learning algorithms for resource allocation. We systematically evaluate 10 algorithms across four categories: Deep Reinforcement Learning approaches, Neural Network architectures, Traditional Machine Learning enhanced methods, and Multi-Agent systems. Analysis of published results demonstrates significant performance improvements across multiple metrics including makespan reduction, cost optimization, and energy efficiency gains compared to traditional methods. The findings reveal that hybrid architectures combining multiple artificial intelligence and machine learning techniques consistently outperform single-method approaches, with edge computing environments showing the highest deployment readiness. Our analysis provides critical insights for both academic researchers and industry practitioners seeking to implement next-generation cloud resource allocation strategies in increasingly complex and dynamic computing environments.
中文摘要 云资源分配已成为现代计算环境中的一大挑战，组织在管理复杂且动态的工作负载时，努力优化性能和成本效益。传统的启发式方法无法满足现有云基础设施的多目标优化需求。本文对最先进的人工智能和机器学习算法进行资源分配的比较分析。我们系统地评估了10种算法，涵盖四大类：深度强化学习方法、神经网络架构、传统机器学习增强方法和多智能体系统。对已发表结果的分析显示，在多个指标上，包括使用寿命缩短、成本优化和能效提升，相较于传统方法，性能显著提升。研究结果显示，结合多种人工智能和机器学习技术的混合架构，持续优于单一方法，边缘计算环境展现出最高的部署准备度。我们的分析为学术研究人员和行业从业者提供了关键洞见，帮助他们在日益复杂和动态的计算环境中实施下一代云资源分配策略。

Clustering-Based Weight Orthogonalization for Stabilizing Deep Reinforcement Learning

基于聚类的权重正交化用于深度强化学习的稳定

Authors: Guoqing Ma, Yuhan Zhang, Yuming Dai, Guangfu Hao, Yang Chen, Shan Yu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.11607
Pdf link: https://arxiv.org/pdf/2511.11607
Abstract Reinforcement learning (RL) has made significant advancements, achieving superhuman performance in various tasks. However, RL agents often operate under the assumption of environmental stationarity, which poses a great challenge to learning efficiency since many environments are inherently non-stationary. This non-stationarity results in the requirement of millions of iterations, leading to low sample efficiency. To address this issue, we introduce the Clustering Orthogonal Weight Modified (COWM) layer, which can be integrated into the policy network of any RL algorithm and mitigate non-stationarity effectively. The COWM layer stabilizes the learning process by employing clustering techniques and a projection matrix. Our approach not only improves learning speed but also reduces gradient interference, thereby enhancing the overall learning efficiency. Empirically, the COWM outperforms state-of-the-art methods and achieves improvements of 9% and 12.6% in vision based and state-based DMControl benchmark. It also shows robustness and generality across various algorithms and tasks.
中文摘要 强化学习（RL）取得了显著进步，在各种任务中达到了超人般的表现。然而，强化学习代理通常假设环境平稳性，这对学习效率构成巨大挑战，因为许多环境本质上是非静止的。这种非平稳性导致需要数百万次迭代，导致采样效率较低。为解决这一问题，我们引入了聚类正交权重修改（COWM）层，该层可集成到任何强化学习算法的策略网络中，有效缓解非平稳性。COWM层通过采用聚类技术和投影矩阵来稳定学习过程。我们的方法不仅提升学习速度，还减少梯度干扰，从而提升整体学习效率。从实证角度看，COWM在基于视觉和状态的DMControl基准测试中表现优于最先进方法，分别提升了9%和12.6%。它还展示了在各种算法和任务中的鲁棒性和通用性。

Environment-Aware Transfer Reinforcement Learning for Sustainable Beam Selection

环境感知转移强化学习以实现可持续光束选择

Authors: Dariush Salami, Ramin Hashemi, Parham Kazemi, Mikko A. Uusitalo
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2511.11647
Pdf link: https://arxiv.org/pdf/2511.11647
Abstract This paper presents a novel and sustainable approach for improving beam selection in 5G and beyond networks using transfer learning and Reinforcement Learning (RL). Traditional RL-based beam selection models require extensive training time and computational resources, particularly when deployed in diverse environments with varying propagation characteristics posing a major challenge for scalability and energy efficiency. To address this, we propose modeling the environment as a point cloud, where each point represents the locations of gNodeBs (gNBs) and surrounding scatterers. By computing the Chamfer distance between point clouds, structurally similar environments can be efficiently identified, enabling the reuse of pre-trained models through transfer learning. This methodology leads to a 16x reduction in training time and computational overhead, directly contributing to energy efficiency. By minimizing the need for retraining in each new deployment, our approach significantly lowers power consumption and supports the development of green and sustainable Artificial Intelligence (AI) in wireless systems. Furthermore, it accelerates time-to-deployment, reduces carbon emissions associated with training, and enhances the viability of deploying AI-driven communication systems at the edge. Simulation results confirm that our approach maintains high performance while drastically cutting energy costs, demonstrating the potential of transfer learning to enable scalable, adaptive, and environmentally conscious RL-based beam selection strategies in dynamic and diverse propagation environments.
中文摘要 本文提出了一种新颖且可持续的方法，利用迁移学习和强化学习（RL）改进5G及更高网络中的束流选择。传统的基于强化学习的波束选择模型需要大量的训练时间和计算资源，尤其是在不同传播特性的环境中部署时，这对可扩展性和能效构成重大挑战。为此，我们提出将环境建模为点云，每个点代表gNodeB（gNBs）及其周围散射体的位置。通过计算点云间的倒角距离，可以高效识别结构相似的环境，从而通过迁移学习重用预训练模型。该方法可将训练时间和计算开销减少16倍，直接提升能源效率。通过减少每次新部署中的再培训需求，我们的方法显著降低了功耗，支持无线系统中绿色可持续人工智能（AI）的发展。此外，它加快了部署时间，减少了与培训相关的碳排放，并增强了在边缘部署AI驱动通信系统的可行性。仿真结果证实，我们的方法在大幅降低能源成本的同时保持高性能，展示了迁移学习在动态多样传播环境中实现可扩展、自适应且环保意识强化学习束流选择策略的潜力。

Convergence of Multiagent Learning Systems for Traffic control

多智能体学习系统在交通控制中的融合

Authors: Sayambhu Sen, Shalabh Bhatnagar
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2511.11654
Pdf link: https://arxiv.org/pdf/2511.11654
Abstract Rapid urbanization in cities like Bangalore has led to severe traffic congestion, making efficient Traffic Signal Control (TSC) essential. Multi-Agent Reinforcement Learning (MARL), often modeling each traffic signal as an independent agent using Q-learning, has emerged as a promising strategy to reduce average commuter delays. While prior work Prashant L A et. al has empirically demonstrated the effectiveness of this approach, a rigorous theoretical analysis of its stability and convergence properties in the context of traffic control has not been explored. This paper bridges that gap by focusing squarely on the theoretical basis of this multi-agent algorithm. We investigate the convergence problem inherent in using independent learners for the cooperative TSC task. Utilizing stochastic approximation methods, we formally analyze the learning dynamics. The primary contribution of this work is the proof that the specific multi-agent reinforcement learning algorithm for traffic control is proven to converge under the given conditions extending it from single agent convergence proofs for asynchronous value iteration.
中文摘要 班加罗尔等城市的快速城市化导致严重的交通拥堵，使得高效的交通信号控制（TSC）变得至关重要。多智能体强化学习（MARL）通常利用Q学习将每个交通信号作为独立代理建模，已成为减少平均通勤延误的有前景策略。而此前的工作中，Prashant L A 等人Al通过实证证明了该方法的有效性，但其在交通控制背景下的稳定性和收敛性质的严谨理论分析尚未被深入探讨。本文通过聚焦该多智能体算法的理论基础来弥合这一空白。我们研究在合作TSC任务中使用独立学习者时固有的收敛问题。利用随机近似方法，我们形式化地分析了学习动态。这项工作的主要贡献是证明了特定多智能体强化学习算法在给定条件下能够收敛，并从单智能体对异步值迭代的收敛证明中扩展。

OSGym: Super-Scalable Distributed Data Engine for Generalizable Computer Agents

OSGym：面向通用计算机代理的超可扩展分布式数据引擎

Authors: Zengyi Qin, Jinyuan Chen, Yunze Man, Shengcao Cao, Ziqi Pang, Zhuoyuan Wang, Xin Sun, Gen Lin, Han Fang, Ling Zhu, Zixin Xie, Zibu Wei, Tianshu Ran, Haoran Geng, Xander Wu, Zachary Bright, Qizhen Sun, Rui Wang, Yuyang Cai, Song Wang, Jiace Zhao, Han Cao, Yeyang Zhou, Tianrui Liu, Ray Pan, Chongye Yang, Xiang Ren, Bo Zhang, Yutong Ban, Jitendra Malik, Brian Anthony, Pieter Abbeel
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2511.11672
Pdf link: https://arxiv.org/pdf/2511.11672
Abstract We introduce OSGym, a super-scalable distributed data engine for training agents across diverse computer-related tasks. OSGym efficiently scales to over a thousand operating system (OS) replicas at an academia-affordable cost, serving as dynamic runtime environments for intelligent agents. It offers three key advantages. (1) Scalability: Despite the intensive resource requirements of running multiple OS replicas, OSGym parallelizes over a thousand instances while maintaining operational efficiency under constrained resources, generating up to 1420 multi-turn trajectories per minute. (2) Generality and Customizability: OSGym supports a broad spectrum of tasks that run on OS platforms, including tool use, browser interactions, software engineering, and office applications, with flexible support for diverse model training algorithms. (3) Economic Viability: OSGym operates at only 0.2-0.3 USD per day per OS replica using accessible on-demand compute providers. It is fully open-source and freely available for both research and commercial use. Experiments show that OSGym enables comprehensive data collection, supervised fine-tuning, and reinforcement learning pipelines for computer agents. Models trained with OSGym outperform state-of-the-art baselines, demonstrating its potential to advance scalability and universality in future agent research.
中文摘要 我们介绍OSGym，一款超可扩展的分布式数据引擎，用于培训代理处理各种计算机相关任务。OSGym 以学术界可负担的成本高效扩展到一千多个作系统副本，作为智能代理的动态运行环境。它有三个关键优势。（1）可扩展性：尽管运行多个作系统副本需要大量资源，OSGym 仍能在资源有限下并行化超过一千个实例，同时保持运营效率，每分钟生成多达 1420 条多回合轨迹。（2）通用性和可定制性：OSGym 支持在作系统平台上运行的广泛任务，包括工具使用、浏览器交互、软件工程和办公应用，并灵活支持多种模型训练算法。（3）经济可行性：OSGym 通过可访问的按需计算提供商，每个作系统副本每天仅以 0.2-0.3 美元运营。该软件完全开源，且对研究和商业用途均可免费获取。实验表明，OSGym 能够为计算机代理实现全面的数据收集、监督微调和强化学习流水线。用OSGym训练的模型表现优于最先进的基线，展示了其在未来代理研究中推动可扩展性和通用性的潜力。

Enhancing Reinforcement Learning in 3D Environments through Semantic Segmentation: A Case Study in ViZDoom

通过语义分割增强三维环境中的强化学习：ViZDoom 案例研究

Authors: Hugo Huang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.11703
Pdf link: https://arxiv.org/pdf/2511.11703
Abstract Reinforcement learning (RL) in 3D environments with high-dimensional sensory input poses two major challenges: (1) the high memory consumption induced by memory buffers required to stabilise learning, and (2) the complexity of learning in partially observable Markov Decision Processes (POMDPs). This project addresses these challenges by proposing two novel input representations: SS-only and RGB+SS, both employing semantic segmentation on RGB colour images. Experiments were conducted in deathmatches of ViZDoom, utilizing perfect segmentation results for controlled evaluation. Our results showed that SS-only was able to reduce the memory consumption of memory buffers by at least 66.6%, and up to 98.6% when a vectorisable lossless compression technique with minimal overhead such as run-length encoding is applied. Meanwhile, RGB+SS significantly enhances RL agents' performance with the additional semantic information provided. Furthermore, we explored density-based heatmapping as a tool to visualise RL agents' movement patterns and evaluate their suitability for data collection. A brief comparison with a previous approach highlights how our method overcame common pitfalls in applying semantic segmentation in 3D environments like ViZDoom.
中文摘要 在具有高维感官输入的三维环境中进行强化学习（RL）带来了两个主要挑战：（1）稳定学习所需的记忆缓冲区带来的高内存消耗，以及（2）部分可观测的马尔可夫决策过程（POMDPs）学习的复杂性。本项目通过提出两种新颖的输入表示方式：仅SS和RGB+SS，均在RGB彩色图像上采用语义分割来应对这些挑战。在ViZDoom的死亡匹配中进行了实验，利用完美分割结果进行受控评估。我们的结果显示，仅用SS能将内存缓冲区的内存消耗降低至少66.6%，而当采用可向量化且开销极低、开销极小的无损压缩技术（如游程编码）时，最高可达98.6%。与此同时，RGB+SS通过提供的语义信息显著提升了强化学习代理的性能。此外，我们探索了基于密度的热成像作为可视化强化学习代理运动模式并评估其数据采集适用性的工具。与之前方法的简要对比，突显了我们方法如何克服了在 ViZDoom 等三维环境中应用语义分割时常见的陷阱。

How Machine Learning-Data Driven Replication Strategies Enhance Fault Tolerance in Large-Scale Distributed Systems

机器学习数据驱动复制策略如何增强大规模分布式系统的容错能力

Authors: Almond Kiruthu Murimi
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2511.11749
Pdf link: https://arxiv.org/pdf/2511.11749
Abstract This research paper investigates how machine learning-driven data replication strategies can enhance fault tolerance in large-scale distributed systems. Traditional replication methods, which rely on static configurations, often struggle to adapt to dynamic workloads and unexpected failures, leading to inefficient resource utilization and prolonged downtime. By integrating machine learning techniques-specifically predictive analytics and reinforcement learning. The study proposes adaptive replication mechanisms capable of forecasting system failures and optimizing data placement in real time. Through an extensive literature review, qualitative analysis, and comparative evaluations with traditional approaches, the paper identifies key limitations in existing replication strategies and highlights the transformative potential of machine learning in creating more resilient, self-optimizing systems. The findings underscore both the promise and the challenges of implementing ML-driven solutions in real-world environments, offering recommendations for future research and practical deployment in cloud-based and enterprise systems.
中文摘要 本研究论文探讨了机器学习驱动的数据复制策略如何提升大规模分布式系统的容错能力。依赖静态配置的传统复制方法常难以适应动态工作负载和意外故障，导致资源利用效率低下和停机时间延长。通过整合机器学习技术——特别是预测分析和强化学习。本研究提出了能够实时预测系统故障和优化数据配置的自适应复制机制。通过广泛的文献综述、定性分析以及与传统方法的比较评估，本文指出了现有复制策略的关键局限性，并强调机器学习在创建更具韧性和自我优化系统的变革潜力。研究结果强调了在现实环境中实施机器学习驱动解决方案的前景与挑战，并为未来研究和在云端及企业系统中的实际部署提供了建议。

Learning to Refine: An Agentic RL Approach for Iterative SPARQL Query Construction

学习精炼：一种代理式强化学习方法用于迭代 SPARQL 查询构造

Authors: Floris Vossebeld, Shenghui Wang
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.11770
Pdf link: https://arxiv.org/pdf/2511.11770
Abstract Generating complex, logically-sound SPARQL queries for multi-hop questions remains a critical bottleneck for Knowledge Graph Question Answering, as the brittle nature of one-shot generation by Large Language Models (LLMs) hinders reliable interaction with structured data. Current methods lack the adaptive policies needed to dynamically debug queries based on real-time execution feedback. This paper introduces a novel agentic framework where an LLM learns a resilient policy for the sequential process of iterative SPARQL construction. We show that a compact 3B-parameter model, trained exclusively via outcome-driven Reinforcement Learning (GRPO) without supervised fine-tuning, can learn effective policies for this task, discovering how to systematically recover from execution errors and refine its queries toward a correct answer. On a curated, executable single-answer subset of LC-QuAD 2.0, our agent achieves 49.7\% accuracy post-entity-linking, a significant 17.5 percentage point improvement over the strongest iterative zero-shot baseline. Further analysis reveals that while the agent's capability is driven by RL, its performance is enhanced by an explicit deliberative reasoning step that acts as a cognitive scaffold to improve policy precision. This work presents a generalizable blueprint for teaching agents to master formal, symbolic tools through interaction, bridging the gap between probabilistic LLMs and the structured world of Knowledge Graphs.
中文摘要 生成复杂且逻辑合理的多跳问题SPARQL查询仍然是知识图谱问答的关键瓶颈，因为大型语言模型（LLM）一次性生成的脆弱性阻碍了与结构化数据的可靠交互。当前方法缺乏基于实时执行反馈动态调试查询所需的自适应策略。本文介绍了一个新的代理框架，LLM在该框架中学习一个针对迭代SPARQL顺序构建过程的弹性策略。我们展示了一个紧凑的3B参数模型，仅通过结果驱动强化学习（GRPO）训练，无需监督微调，能够学习有效的策略，发现如何系统地从执行错误中恢复，并优化查询以获得正确答案。在 LC-QuAD 2.0 的精心策划、可执行的单一答案子集上，我们的代理实体链接后准确率达到 49.7%，比最强的迭代零样本基线提升了 17.5 个百分点。进一步分析显示，虽然智能体的能力由强化学习驱动，但其表现通过明确的审慎推理步骤提升，作为认知支架以提升政策精度。这项工作为教学代理通过互动掌握形式化、符号化工具提供了可推广的蓝图，弥合了概率性大型语言模型与知识图谱结构化世界之间的鸿沟。

Image-POSER: Reflective RL for Multi-Expert Image Generation and Editing

图像模拟器：面向多专家图像生成与编辑的反射强化学习

Authors: Hossein Mohebbi, Mohammed Abdulrahman, Yanting Miao, Pascal Poupart, Suraj Kothawade
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.11780
Pdf link: https://arxiv.org/pdf/2511.11780
Abstract Recent advances in text-to-image generation have produced strong single-shot models, yet no individual system reliably executes the long, compositional prompts typical of creative workflows. We introduce Image-POSER, a reflective reinforcement learning framework that (i) orchestrates a diverse registry of pretrained text-to-image and image-to-image experts, (ii) handles long-form prompts end-to-end through dynamic task decomposition, and (iii) supervises alignment at each step via structured feedback from a vision-language model critic. By casting image synthesis and editing as a Markov Decision Process, we learn non-trivial expert pipelines that adaptively combine strengths across models. Experiments show that Image-POSER outperforms baselines, including frontier models, across industry-standard and custom benchmarks in alignment, fidelity, and aesthetics, and is consistently preferred in human evaluations. These results highlight that reinforcement learning can endow AI systems with the capacity to autonomously decompose, reorder, and combine visual models, moving towards general-purpose visual assistants.
中文摘要 文本到图像生成的最新进展催生了强大的单镜头模型，但没有任何单一系统能可靠地执行创意工作流程中典型的冗长构图提示。我们介绍了Image-POSER，一种反思强化学习框架，（i）协调多样化的预训练文本到图像和图像对图像专家注册，（ii）通过动态任务分解端到端处理长格式提示，（iii）通过视觉语言模型批评者的结构化反馈监督每一步的对齐。通过将图像合成和编辑视为马尔可夫决策过程，我们学习了能够适应性地组合不同模型优势的专家流程。实验显示，Image-POSER在行业标准和定制基准测试中，在对齐、保真度和美学方面均优于基线模型，包括前沿模型，并且在人工评估中始终被优先推荐。这些结果表明，强化学习能够赋予人工智能系统自主分解、重排和组合视觉模型的能力，逐步迈向通用视觉助手。

MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

MiroThinker：通过模型、上下文和交互式扩展推动开源研究代理的性能边界

Authors: MiroMind Team, Song Bai, Lidong Bing, Carson Chen, Guanzheng Chen, Yuntao Chen, Zhe Chen, Ziyi Chen, Jifeng Dai, Xuan Dong, Yue Deng, Yunjie Fu, Junqi Ge, Chenxia Han, Tammy Huang, Zhenhang Huang, Jerry Jiao, Shilei Jiang, Tianyu Jiao, Xiaoqi Jian, Lei Lei, Ruilin Li, Ryan Luo, Tiantong Li, Xiang Lin, Ziyuan Liu, Zhiqi Li, Jie Ni, Qiang Ren, Pax Sun, Shiqian Su, Chenxin Tao, Bin Wang, Hellen Wang, Haonan Wang, James Wang, Jin Wang, Jojo Wang, Letian Wang, Shizun Wang, Weizhi Wang, Zixuan Wang, Jinfan Xu, Sen Xing, Chenyu Yang, Hai Ye, Jiaheng Yu, Yue Yu, Muyan Zhong, Tianchen Zhao, Xizhou Zhu, Yanpeng Zhou, Yifan Zhang, Zhi Zhu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.11793
Pdf link: https://arxiv.org/pdf/2511.11793
Abstract We present MiroThinker v1.0, an open-source research agent designed to advance tool-augmented reasoning and information-seeking capabilities. Unlike previous agents that only scale up model size or context length, MiroThinker explores interaction scaling at the model level, systematically training the model to handle deeper and more frequent agent-environment interactions as a third dimension of performance improvement. Unlike LLM test-time scaling, which operates in isolation and risks degradation with longer reasoning chains, interactive scaling leverages environment feedback and external information acquisition to correct errors and refine trajectories. Through reinforcement learning, the model achieves efficient interaction scaling: with a 256K context window, it can perform up to 600 tool calls per task, enabling sustained multi-turn reasoning and complex real-world research workflows. Across four representative benchmarks-GAIA, HLE, BrowseComp, and BrowseComp-ZH-the 72B variant achieves up to 81.9%, 37.7%, 47.1%, and 55.6% accuracy respectively, surpassing previous open-source agents and approaching commercial counterparts such as GPT-5-high. Our analysis reveals that MiroThinker benefits from interactive scaling consistently: research performance improves predictably as the model engages in deeper and more frequent agent-environment interactions, demonstrating that interaction depth exhibits scaling behaviors analogous to model size and context length. These findings establish interaction scaling as a third critical dimension for building next-generation open research agents, complementing model capacity and context windows.
中文摘要 我们介绍 MiroThinker v1.0，一款开源研究代理，旨在推动工具增强推理和信息寻求能力。与以往仅仅放大模型规模或上下文长度的代理不同，MiroThinker在模型层面探索交互缩放，系统地训练模型以处理更深层次且更频繁的代理-环境交互，作为性能提升的第三维度。与大型语言模型测试时间缩放不同，后者在较长的推理链下存在退化风险，而交互式扩展利用环境反馈和外部信息获取来纠正错误并优化轨迹。通过强化学习，该模型实现了高效的交互扩展：拥有256K上下文窗口，每项任务可执行多达600次工具调用，支持持续的多回合推理和复杂的现实研究工作流。在四个代表性基准测试——GAIA、HLE、BrowseComp和BrowseComp-ZH——中，72B版本分别实现了高达81.9%、37.7%、47.1%和55.6%的准确率，超越了以往的开源代理，并接近了如GPT-5高的商业对应产品。我们的分析显示，MiroThinker 持续受益于交互式扩展：随着模型参与更深层次且更频繁的代理-环境交互，研究表现可预测地提升，表明交互深度表现出类似于模型规模和上下文长度的缩放行为。这些发现确立了交互尺度作为构建下一代开放研究代理的第三关键维度，补充了模型容量和上下文窗口。

Conformal Constrained Policy Optimization for Cost-Effective LLM Agents

为成本效益高的LLM代理提供共形约束策略优化

Authors: Wenwen Si, Sooyong Jang, Insup Lee, Osbert Bastani
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.11828
Pdf link: https://arxiv.org/pdf/2511.11828
Abstract While large language models (LLMs) have recently made tremendous progress towards solving challenging AI problems, they have done so at increasingly steep computational and API costs. We propose a novel strategy where we combine multiple LLM models with varying cost/accuracy tradeoffs in an agentic manner, where models and tools are run in sequence as determined by an orchestration model to minimize cost subject to a user-specified level of reliability; this constraint is formalized using conformal prediction to provide guarantees. To solve this problem, we propose Conformal Constrained Policy Optimization (CCPO), a training paradigm that integrates constrained policy optimization with off-policy reinforcement learning and recent advances in online conformal prediction. CCPO jointly optimizes a cost-aware policy (score function) and an adaptive threshold. Across two multi-hop question answering benchmarks, CCPO achieves up to a 30% cost reduction compared to other cost-aware baselines and LLM-guided methods without compromising reliability. Our approach provides a principled and practical framework for deploying LLM agents that are significantly more cost-effective while maintaining reliability.
中文摘要 虽然大型语言模型（LLM）近年来在解决具有挑战性的人工智能问题方面取得了巨大进展，但其计算和API成本也日益高涨。我们提出了一种新颖策略，将多个具有不同成本/准确性权衡的LLM模型以代理方式结合，模型和工具按编排模型确定的顺序运行，以根据用户指定的可靠性水平来最小化成本;该约束通过共形预测形式化以提供保证。为解决这一问题，我们提出了共形约束策略优化（CCPO），这是一种将受限策略优化与非策略强化学习及在线共形预测的最新进展相结合的训练范式。CCPO共同优化成本感知策略（评分函数）和自适应阈值。在两个多跳问答基准测试中，CCPO相比其他成本意识基线和LLM引导方法，在不牺牲可靠性的情况下实现了高达30%的成本降低。我们的方法为部署更具成本效益且保持可靠性的LLM代理提供了一个有原则且实用的框架。

Better LLM Reasoning via Dual-Play

通过双人游戏更好地进行大型语言模型推理

Authors: Zhengxin Zhang, Chengyu Huang, Aochong Oliver Li, Claire Cardie
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.11881
Pdf link: https://arxiv.org/pdf/2511.11881
Abstract Large Language Models (LLMs) have achieved remarkable progress through Reinforcement Learning with Verifiable Rewards (RLVR), yet still rely heavily on external supervision (e.g., curated labels). Adversarial learning, particularly through self-play, offers a promising alternative that enables models to iteratively learn from themselves - thus reducing reliance on external supervision. Dual-play extends adversarial learning by assigning specialized roles to two models and training them against each other, fostering sustained competition and mutual evolution. Despite its promise, adapting dual-play training to LLMs remains limited, largely due to their susceptibility to reward hacking and training instability. In this paper, we introduce PasoDoble, a novel LLM dual-play framework. PasoDoble adversarially trains two models initialized from the same base model: a Proposer, which generates challenging questions with ground-truth answers, and a Solver, which attempts to solve them. We enrich the Proposer with knowledge from a pre-training dataset to ensure the questions' quality and diversity. To avoid reward hacking, the Proposer is rewarded for producing only valid questions that push the Solver's limit, while the Solver is rewarded for solving them correctly, and both are updated jointly. To further enhance training stability, we introduce an optional offline paradigm that decouples Proposer and Solver updates, alternately updating each for several steps while holding the other fixed. Notably, PasoDoble operates without supervision during training. Experimental results show that PasoDoble can improve the reasoning performance of LLMs. Our project page is available at this https URL.
中文摘要 大型语言模型（LLMs）通过可验证奖励强化学习（RLVR）取得了显著进展，但仍高度依赖外部监督（例如策划标签）。对抗式学习，尤其是通过自我游戏，提供了一种有前景的替代方案，使模型能够迭代自我学习——从而减少对外部监督的依赖。双人游戏通过为两个模型分配专门角色并相互训练，促进持续竞争和相互演化，从而扩展对抗学习。尽管前景看好，将双人对战训练应用于大型语言模型（LLM）的应用仍然有限，主要原因是它们容易受到黑客攻击和训练不稳定性的影响。本文介绍了PasoDoble，一种新颖的大型语言模型双重对弈框架。PasoDoble 对抗式训练两个基于同一基础模型的模型：一个提出者，生成具有基础真相答案的挑战性问题，另一个是求解器，试图解决这些问题。我们通过预训练数据集丰富提案人，确保问题的质量和多样性。为避免奖励被黑，提议者只提出有效且能挑战解答者极限的问题而获得奖励，而解答者则因正确解答而获得奖励，两者同时进行更新。为进一步提升训练稳定性，我们引入了一个可选的离线范式，将提案者和求解器更新解耦，交替更新各自若干步，同时保持对方固定。值得注意的是，PasoDoble在训练期间无需监督。实验结果表明，PasoDoble 可以提升 LLM 的推理性能。我们的项目页面可在此 https 网址访问。

Context-Emotion Aware Therapeutic Dialogue Generation: A Multi-component Reinforcement Learning Approach to Language Models for Mental Health Support

情境感知治疗对话生成：一种多元强化学习方法，用于心理健康支持的语言模型

Authors: Eric Hua Qing Zhang, Julia Ive
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.11884
Pdf link: https://arxiv.org/pdf/2511.11884
Abstract Mental health illness represents a substantial global socioeconomic burden, with COVID-19 further exacerbating accessibility challenges and driving increased demand for telehealth mental health support. While large language models (LLMs) offer promising solutions through 24/7 availability and non-judgmental interactions, pre-trained models often lack the contextual and emotional awareness necessary for appropriate therapeutic responses. This paper investigated the application of supervised fine-tuning (SFT) and reinforcement learning (RL) techniques to enhance GPT-2's capacity for therapeutic dialogue generation. The methodology restructured input formats to enable simultaneous processing of contextual information and emotional states alongside user input, employing a multi-component reward function that aligned model outputs with professional therapist responses and annotated emotions. Results demonstrated improvements through reinforcement learning over baseline GPT-2 across multiple evaluation metrics: BLEU (0.0111), ROUGE-1 (0.1397), ROUGE-2 (0.0213), ROUGE-L (0.1317), and METEOR (0.0581). LLM evaluation confirmed high contextual relevance and professionalism, while reinforcement learning achieved 99.34% emotion accuracy compared to 66.96% for baseline GPT-2. These findings demonstrate reinforcement learning's effectiveness in developing therapeutic dialogue systems that can serve as valuable assistive tools for therapists while maintaining essential human clinical oversight.
中文摘要 心理健康疾病是全球重大的社会经济负担，COVID-19进一步加剧了可及性挑战，并推动了对远程心理健康支持需求的增加。虽然大型语言模型（LLMs）通过全天候开放和非评判性互动提供了有前景的解决方案，但预训练模型往往缺乏提供适当治疗反应所需的情境和情感意识。本文探讨了监督式微调（SFT）和强化学习（RL）技术在提升GPT-2治疗性对话生成能力方面的应用。该方法重构了输入格式，使上下文信息和情绪状态能够与用户输入同时处理，采用多元奖励函数，使模型输出与专业治疗师的反应和注释情绪保持一致。结果显示，强化学习在多个评估指标上相较于基线GPT-2有显著提升：BLEU（0.0111）、ROUGE-1（0.1397）、ROUGE-2（0.0213）、ROUGE-L（0.1317）和METEOR（0.0581）。LLM评估确认了高情境相关性和专业性，强化学习的情感准确率达到99.34%，而基线GPT-2为66.96%。这些发现表明强化学习在开发治疗性对话系统方面具有有效性，这些系统既能作为治疗师有价值的辅助工具，又能维持必要的临床监督。

VULPO: Context-Aware Vulnerability Detection via On-Policy LLM Optimization

VULPO：通过策略内大型语言模型优化实现上下文感知漏洞检测

Authors: Youpeng Li, Fuxun Yu, Xinda Wang
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2511.11896
Pdf link: https://arxiv.org/pdf/2511.11896
Abstract The widespread reliance on open-source software dramatically increases the risk of vulnerability exploitation, underscoring the need for effective and scalable vulnerability detection (VD). Existing VD techniques, whether traditional machine learning-based or LLM-based approaches like prompt engineering, supervised fine-tuning, or off-policy preference optimization, remain fundamentally limited in their ability to perform context-aware analysis: They depend on fixed inputs or static preference datasets, cannot adaptively explore repository-level dependencies, and are constrained by function-level benchmarks that overlook critical vulnerability context. This paper introduces Vulnerability-Adaptive Policy Optimization (VULPO), an on-policy LLM reinforcement learning framework for context-aware VD. To support training and evaluation, we first construct ContextVul, a new dataset that augments high-quality function-level samples with lightweight method to extract repository-level context information. We then design multi-dimensional reward structuring that jointly captures prediction correctness, vulnerability localization accuracy, and the semantic relevance of vulnerability analysis, thereby guiding the model toward comprehensive contextual reasoning. To address the asymmetric difficulty of different vulnerability cases and mitigate reward hacking, VULPO incorporates label-level and sample-level difficulty-adaptive reward scaling, encouraging the model to explore challenging cases while maintaining balanced reward distribution. Extensive experiments demonstrate the superiority of our VULPO framework in context-aware VD: Our VULPO-4B substantially outperforms existing VD baselines based on prompt engineering and off-policy optimization, improving F1 by 85% over Qwen3-4B and achieving performance comparable to a 150x larger-scale model, DeepSeek-R1-0528.
中文摘要 对开源软件的广泛依赖极大地增加了漏洞利用的风险，凸显了有效且可扩展的漏洞检测（VD）的必要性。现有的VD技术，无论是传统的机器学习基础还是基于LLM的方法，如提示工程、监督式微调或非策略偏好优化，在执行上下文感知分析方面仍然存在根本限制：它们依赖固定输入或静态偏好数据集，无法自适应探索仓库级依赖关系，且受限于功能级基准测试，忽视了关键的漏洞上下文。本文介绍了脆弱性自适应策略优化（VULPO），这是一种针对上下文感知性VD的策略上LLM强化学习框架。为支持训练和评估，我们首先构建了ContextVul，这是一个新数据集，通过轻量级方法增强高质量函数级样本，提取仓库级上下文信息。随后，我们设计了多维奖励结构，结合预测准确性、漏洞定位准确性以及漏洞分析的语义相关性，从而引导模型实现全面的情境推理。为了解决不同漏洞案例的非对称难度并减轻奖励黑客行为，VULPO采用了标签级和样本级难度自适应奖励尺度，鼓励模型在保持奖励分布平衡的同时探索具有挑战性的案例。大量实验证明了我们VULPO框架在上下文感知VD中的优势：基于即时工程和非策略优化，我们的VULPO-4B显著优于现有VD基线，F1性能比Qwen3-4B提升85%，性能可与150倍规模模型DeepSeek-R1-0528媲美。

Quantile Q-Learning: Revisiting Offline Extreme Q-Learning with Quantile Regression

分位Q学习：利用分位数回归重新审视离线极限Q学习

Authors: Xinming Gao, Shangzhe Li, Yujin Cai, Wenwu Yu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.11973
Pdf link: https://arxiv.org/pdf/2511.11973
Abstract Offline reinforcement learning (RL) enables policy learning from fixed datasets without further environment interaction, making it particularly valuable in high-risk or costly domains. Extreme $Q$-Learning (XQL) is a recent offline RL method that models Bellman errors using the Extreme Value Theorem, yielding strong empirical performance. However, XQL and its stabilized variant MXQL suffer from notable limitations: both require extensive hyperparameter tuning specific to each dataset and domain, and also exhibit instability during training. To address these issues, we proposed a principled method to estimate the temperature coefficient $\beta$ via quantile regression under mild assumptions. To further improve training stability, we introduce a value regularization technique with mild generalization, inspired by recent advances in constrained value learning. Experimental results demonstrate that the proposed algorithm achieves competitive or superior performance across a range of benchmark tasks, including D4RL and NeoRL2, while maintaining stable training dynamics and using a consistent set of hyperparameters across all datasets and domains.
中文摘要 离线强化学习（RL）使得从固定数据集中学习策略，无需额外的环境交互，使其在高风险或成本高的领域尤为重要。极$Q$学习（XQL）是一种近期的离线强化学习方法，利用极值定理建模贝尔曼错误，取得了强大的实证表现。然而，XQL及其稳定变体MXQL存在显著局限性：两者都需要针对每个数据集和领域的特定超参数进行大量超参数调优，且在训练过程中也存在不稳定性。为解决这些问题，我们提出了一种原则性方法，在轻度假设下通过分位数回归估计温度系数$\beta$。为进一步提升训练稳定性，我们引入了一种带有轻度泛化的值正则化技术，灵感来自近期受限值学习的最新进展。实验结果表明，该算法在包括D4RL和NeoRL2在内的多种基准任务中实现了竞争甚至更优的性能，同时保持了稳定的训练动态，并在所有数据集和领域中使用一致的超参数集合。

Goal-Oriented Multi-Agent Reinforcement Learning for Decentralized Agent Teams

去中心化智能体团队的目标导向多智能体强化学习

Authors: Hung Du, Hy Nguyen, Srikanth Thudumu, Rajesh Vasa, Kon Mouzakis
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.11992
Pdf link: https://arxiv.org/pdf/2511.11992
Abstract Connected and autonomous vehicles across land, water, and air must often operate in dynamic, unpredictable environments with limited communication, no centralized control, and partial observability. These real-world constraints pose significant challenges for coordination, particularly when vehicles pursue individual objectives. To address this, we propose a decentralized Multi-Agent Reinforcement Learning (MARL) framework that enables vehicles, acting as agents, to communicate selectively based on local goals and observations. This goal-aware communication strategy allows agents to share only relevant information, enhancing collaboration while respecting visibility limitations. We validate our approach in complex multi-agent navigation tasks featuring obstacles and dynamic agent populations. Results show that our method significantly improves task success rates and reduces time-to-goal compared to non-cooperative baselines. Moreover, task performance remains stable as the number of agents increases, demonstrating scalability. These findings highlight the potential of decentralized, goal-driven MARL to support effective coordination in realistic multi-vehicle systems operating across diverse domains.
中文摘要 跨陆、水、空的互联自动车辆常常必须在动态、不可预测的环境中运行，通信有限，缺乏集中控制，且部分可被观测。这些现实世界的限制对协调构成重大挑战，尤其是在车辆追求个人目标时。为此，我们提出了一个去中心化的多智能体强化学习（MARL）框架，使载体作为代理能够基于局部目标和观察进行选择流。这种目标感知的沟通策略允许客服仅共享相关信息，增强协作，同时尊重可见性限制。我们在复杂的多智能体导航任务中验证了我们的方法，这些任务包含障碍物和动态智能体群体。结果显示，我们的方法相比非合作基线显著提高了任务成功率并缩短了目标实现时间。此外，随着代理数量的增加，任务性能依然稳定，展现了可扩展性。这些发现凸显了去中心化、目标驱动的MARL在支持跨多领域现实多车系统中有效协调的潜力。

Look As You Think: Unifying Reasoning and Visual Evidence Attribution for Verifiable Document RAG via Reinforcement Learning

《如你所思：通过强化学习统一推理与视觉证据归因以实现可验证文档RAG》

Authors: Shuochen Liu, Pengfei Luo, Chao Zhang, Yuhao Chen, Haotian Zhang, Qi Liu, Xin Kou, Tong Xu, Enhong Chen
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.12003
Pdf link: https://arxiv.org/pdf/2511.12003
Abstract Aiming to identify precise evidence sources from visual documents, visual evidence attribution for visual document retrieval-augmented generation (VD-RAG) ensures reliable and verifiable predictions from vision-language models (VLMs) in multimodal question answering. Most existing methods adopt end-to-end training to facilitate intuitive answer verification. However, they lack fine-grained supervision and progressive traceability throughout the reasoning process. In this paper, we introduce the Chain-of-Evidence (CoE) paradigm for VD-RAG. CoE unifies Chain-of-Thought (CoT) reasoning and visual evidence attribution by grounding reference elements in reasoning steps to specific regions with bounding boxes and page indexes. To enable VLMs to generate such evidence-grounded reasoning, we propose Look As You Think (LAT), a reinforcement learning framework that trains models to produce verifiable reasoning paths with consistent attribution. During training, LAT evaluates the attribution consistency of each evidence region and provides rewards only when the CoE trajectory yields correct answers, encouraging process-level self-verification. Experiments on vanilla Qwen2.5-VL-7B-Instruct with Paper- and Wiki-VISA benchmarks show that LAT consistently improves the vanilla model in both single- and multi-image settings, yielding average gains of 8.23% in soft exact match (EM) and 47.0% in [email protected]. Meanwhile, LAT not only outperforms the supervised fine-tuning baseline, which is trained to directly produce answers with attribution, but also exhibits stronger generalization across domains.
中文摘要 旨在从视觉文档中确定精确的证据来源，视觉文档检索增强生成（VD-RAG）的视觉证据归因确保了视觉语言模型（VLMs）在多模态问答中可靠且可验证的预测。大多数现有方法采用端到端培训，以促进直观的答案验证。然而，它们缺乏细致的监督和在整个推理过程中的渐进可追溯性。本文介绍了VD-RAG的证据链（CoE）范式。CoE通过将参考元素置于推理步骤中，结合边界框和页面索引，统一了思维链（CoT）推理和视觉证据归属。为了使VLM能够生成这种基于证据的推理，我们提出了“Look As You Think”（LAT）强化学习框架，训练模型生成具有一致归因的可验证推理路径。在培训过程中，LAT评估每个证据区域的归因一致性，仅在CoE轨迹得出正确答案时给予奖励，鼓励过程层面的自我验证。基于原版Qwen2.5-VL-7B-Instruct与Paper-Video和Wiki-VISA基准测试的实验显示，LAT在单图和多图环境中持续提升原版模型，软精确匹配（EM）平均提升为8.23%，[email protected]帧平均提升47.0%。与此同时，LAT 不仅优于被训练为直接生成归因答案的监督微调基线，而且在跨领域展现出更强的泛化性。

EARL: Entropy-Aware RL Alignment of LLMs for Reliable RTL Code Generation

EARL：对熵感知强化学习的LLM对齐，实现可靠的RTL代码生成

Authors: Jiahe Shi, Zhengqi Gao, Ching-Yun Ko, Duane Boning
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.12033
Pdf link: https://arxiv.org/pdf/2511.12033
Abstract Recent advances in large language models (LLMs) have demonstrated significant potential in hardware design automation, particularly in using natural language to synthesize Register-Transfer Level (RTL) code. Despite this progress, a gap remains between model capability and the demands of real-world RTL design, including syntax errors, functional hallucinations, and weak alignment to designer intent. Reinforcement Learning with Verifiable Rewards (RLVR) offers a promising approach to bridge this gap, as hardware provides executable and formally checkable signals that can be used to further align model outputs with design intent. However, in long, structured RTL code sequences, not all tokens contribute equally to functional correctness, and naïvely spreading gradients across all tokens dilutes learning signals. A key insight from our entropy analysis in RTL generation is that only a small fraction of tokens (e.g., always, if, assign, posedge) exhibit high uncertainty and largely influence control flow and module structure. To address these challenges, we present EARL, an Entropy-Aware Reinforcement Learning framework for Verilog generation. EARL performs policy optimization using verifiable reward signals and introduces entropy-guided selective updates that gate policy gradients to high-entropy tokens. This approach preserves training stability and concentrates gradient updates on functionally important regions of code. Our experiments on VerilogEval and RTLLM show that EARL improves functional pass rates over prior LLM baselines by up to 14.7%, while reducing unnecessary updates and improving training stability. These results indicate that focusing RL on critical, high-uncertainty tokens enables more reliable and targeted policy improvement for structured RTL code generation.
中文摘要 大型语言模型（LLM）的最新进展展示了硬件设计自动化的巨大潜力，尤其是在利用自然语言合成寄存器-传输层（RTL）代码方面。尽管取得了这些进展，模型能力与现实RTL设计的需求之间仍存在差距，包括语法错误、功能性幻觉以及与设计意图的对齐度较弱。带可验证奖励的强化学习（RLVR）为弥合这一差距提供了一种有前景的方法，因为硬件提供了可执行且形式化可检查的信号，可用于进一步使模型输出与设计意图对齐。然而，在长而结构化的RTL代码序列中，并非所有词符对功能正确性贡献均一，且天真地将梯度分散到所有词符上会稀释学习信号。我们在RTL生成中的熵分析中发现，只有极少数代币（如始终、if、assign、posedge）表现出高度不确定性，并对控制流程和模块结构产生重大影响。为应对这些挑战，我们介绍了EARL，一个用于Verilog生成的熵感知强化学习框架。EARL利用可验证的奖励信号进行策略优化，并引入了熵引导的选择性更新，将策略梯度门控到高熵代币。这种方法保持了训练的稳定性，并将梯度更新集中在功能重要的代码区域。我们在VerilogEval和RTLLM上的实验显示，EARL较以往LLM基线提升了多达14.7%的功能通过率，同时减少了不必要的更新并提升了训练稳定性。这些结果表明，将强化学习重点放在关键且高不确定性的代币上，可以实现结构化RTL代码生成的更可靠和有针对性的策略改进。

Intelligent Collaborative Optimization for Rubber Tyre Film Production Based on Multi-path Differentiated Clipping Proximal Policy Optimization

基于多径差分裁剪的橡胶轮胎薄膜生产智能协作优化近端策略优化

Authors: Yinghao Ruan, Wei Pang, Shuaihao Liu, Huili Yang, Leyi Han, Xinghui Dong
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.12060
Pdf link: https://arxiv.org/pdf/2511.12060
Abstract The advent of smart manufacturing is addressing the limitations of traditional centralized scheduling and inflexible production line configurations in the rubber tyre industry, especially in terms of coping with dynamic production demands. Contemporary tyre manufacturing systems form complex networks of tightly coupled subsystems pronounced nonlinear interactions and emergent dynamics. This complexity renders the effective coordination of multiple subsystems, posing an essential yet formidable task. For high-dimensional, multi-objective optimization problems in this domain, we introduce a deep reinforcement learning algorithm: Multi-path Differentiated Clipping Proximal Policy Optimization (MPD-PPO). This algorithm employs a multi-branch policy architecture with differentiated gradient clipping constraints to ensure stable and efficient high-dimensional policy updates. Validated through experiments on width and thickness control in rubber tyre film production, MPD-PPO demonstrates substantial improvements in both tuning accuracy and operational efficiency. The framework successfully tackles key challenges, including high dimensionality, multi-objective trade-offs, and dynamic adaptation, thus delivering enhanced performance and production stability for real-time industrial deployment in tyre manufacturing.
中文摘要 智能制造的出现正在解决橡胶轮胎行业传统集中调度和生产线配置的局限性，尤其是在应对动态生产需求方面。当代轮胎制造系统形成了由紧密耦合的子系统组成的复杂网络，表现出明显的非线性相互作用和涌现动态。这种复杂性使得多个子系统的有效协调成为一项既重要又艰巨的任务。对于该领域的高维多目标优化问题，我们引入了一种深度强化学习算法：多路径差分剪裁近端策略优化（MPD-PPO）。该算法采用多分支策略架构，并采用差异化梯度裁剪约束，以确保高维策略更新的稳定高效。通过橡胶轮胎膜生产中宽度和厚度控制实验验证，MPD-PPO在调校精度和作效率方面均有显著提升。该框架成功解决了高维度、多目标权衡和动态适应等关键挑战，从而为轮胎制造的实时工业部署提供更高的性能和生产稳定性。

Treatment Stitching with Schrödinger Bridge for Enhancing Offline Reinforcement Learning in Adaptive Treatment Strategies

采用薛定谔桥进行治疗缝合，以增强自适应治疗策略中的离线强化学习

Authors: Dong-Hee Shin, Deok-Joong Lee, Young-Han Son, Tae-Eui Kam
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.12075
Pdf link: https://arxiv.org/pdf/2511.12075
Abstract Adaptive treatment strategies (ATS) are sequential decision-making processes that enable personalized care by dynamically adjusting treatment decisions in response to evolving patient symptoms. While reinforcement learning (RL) offers a promising approach for optimizing ATS, its conventional online trial-and-error learning mechanism is not permissible in clinical settings due to risks of harm to patients. Offline RL tackles this limitation by learning policies exclusively from historical treatment data, but its performance is often constrained by data scarcity-a pervasive challenge in clinical domains. To overcome this, we propose Treatment Stitching (TreatStitch), a novel data augmentation framework that generates clinically valid treatment trajectories by intelligently stitching segments from existing treatment data. Specifically, TreatStitch identifies similar intermediate patient states across different trajectories and stitches their respective segments. Even when intermediate states are too dissimilar to stitch directly, TreatStitch leverages the Schrödinger bridge method to generate smooth and energy-efficient bridging trajectories that connect dissimilar states. By augmenting these synthetic trajectories into the original dataset, offline RL can learn from a more diverse dataset, thereby improving its ability to optimize ATS. Extensive experiments across multiple treatment datasets demonstrate the effectiveness of TreatStitch in enhancing offline RL performance. Furthermore, we provide a theoretical justification showing that TreatStitch maintains clinical validity by avoiding out-of-distribution transitions.
中文摘要 适应性治疗策略（ATS）是一种顺序决策过程，通过动态调整治疗决策以应对患者症状的变化，实现个性化护理。虽然强化学习（RL）为优化ATS提供了有前景的方法，但其传统的在线试错学习机制在临床环境中因对患者有害的风险而不可行。离线强化学习通过仅从历史治疗数据学习策略来解决这一限制，但其性能常受限于数据稀缺性——这是临床领域普遍存在的挑战。为克服这一问题，我们提出了治疗缝合（TreatStitch）这一新型数据增强框架，通过智能缝合现有治疗数据的片段，生成临床有效的治疗轨迹。具体来说，TreatStitch识别不同路径上的相似中间患者状态，并对其相应的切段进行缝合。即使中间态差异过大无法直接缝合，TreatStitch仍利用薛定谔桥法生成平滑且节能的桥接轨迹，连接不同状态。通过将这些合成轨迹补充到原始数据集中，离线强化学习可以从更多样化的数据集中学习，从而提升其优化ATS的能力。跨多个处理数据集的大量实验证明了TreatStitch在提升离线强化学习表现方面的有效性。此外，我们还提供了理论依据，证明TreatStitch通过避免非发行转移来保持临床有效性。

HCPO: Hierarchical Conductor-Based Policy Optimization in Multi-Agent Reinforcement Learning

HCPO：多智能体强化学习中的基于导体的层级策略优化

Authors: Zejiao Liu, Junqi Tu, Yitian Hong, Luolin Xiong, Yaochu Jin, Yang Tang, Fangfei Li
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.12123
Pdf link: https://arxiv.org/pdf/2511.12123
Abstract In cooperative Multi-Agent Reinforcement Learning (MARL), efficient exploration is crucial for optimizing the performance of joint policy. However, existing methods often update joint policies via independent agent exploration, without coordination among agents, which inherently constrains the expressive capacity and exploration of joint policies. To address this issue, we propose a conductor-based joint policy framework that directly enhances the expressive capacity of joint policies and coordinates exploration. In addition, we develop a Hierarchical Conductor-based Policy Optimization (HCPO) algorithm that instructs policy updates for the conductor and agents in a direction aligned with performance improvement. A rigorous theoretical guarantee further establishes the monotonicity of the joint policy optimization process. By deploying local conductors, HCPO retains centralized training benefits while eliminating inter-agent communication during execution. Finally, we evaluate HCPO on three challenging benchmarks: StarCraftII Multi-agent Challenge, Multi-agent MuJoCo, and Multi-agent Particle Environment. The results indicate that HCPO outperforms competitive MARL baselines regarding cooperative efficiency and stability.
中文摘要 在合作式多智能体强化学习（MARL）中，高效的探索对于优化联合策略的执行至关重要。然而，现有方法常通过独立的代理探索来更新联合策略，而代理间缺乏协调，这本质上限制了联合策略的表达能力和探索。为解决这一问题，我们提出了一个以指挥家为基础的联合政策框架，直接提升联合政策的表达能力并协调探索。此外，我们还开发了基于层级指挥者的策略优化（HCPO）算法，指示指挥者和代理在符合性能提升的方向上进行策略更新。严谨的理论保证进一步确立了联合策略优化过程的单调性。通过部署本地指挥员，HCPO保留了集中培训优势，同时消除执行过程中的代理间通信。最后，我们基于三个具有挑战性的基准测试：StarCraftII多智能体挑战、多智能体MuJoCo和多智能粒子环境对HCPO进行了评估。结果显示，HCPO在合作效率和稳定性方面优于竞争性MARL基线。

AI-Salesman: Towards Reliable Large Language Model Driven Telemarketing

人工智能销售员：迈向可靠的大型语言模型驱动电话营销

Authors: Qingyu Zhang, Chunlei Xin, Xuanang Chen, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun, Qing Ye, Qianlong Xie, Xingxing Wang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.12133
Pdf link: https://arxiv.org/pdf/2511.12133
Abstract Goal-driven persuasive dialogue, exemplified by applications like telemarketing, requires sophisticated multi-turn planning and strict factual faithfulness, which remains a significant challenge for even state-of-the-art Large Language Models (LLMs). A lack of task-specific data often limits previous works, and direct LLM application suffers from strategic brittleness and factual hallucination. In this paper, we first construct and release TeleSalesCorpus, the first real-world-grounded dialogue dataset for this domain. We then propose AI-Salesman, a novel framework featuring a dual-stage architecture. For the training stage, we design a Bayesian-supervised reinforcement learning algorithm that learns robust sales strategies from noisy dialogues. For the inference stage, we introduce the Dynamic Outline-Guided Agent (DOGA), which leverages a pre-built script library to provide dynamic, turn-by-turn strategic guidance. Moreover, we design a comprehensive evaluation framework that combines fine-grained metrics for key sales skills with the LLM-as-a-Judge paradigm. Experimental results demonstrate that our proposed AI-Salesman significantly outperforms baseline models in both automatic metrics and comprehensive human evaluations, showcasing its effectiveness in complex persuasive scenarios.
中文摘要 以目标为导向的说服性对话，以电话营销等应用为代表，需要复杂的多轮次规划和严格的事实忠实性，这对即使是最先进的大型语言模型（LLMs）依然是一大挑战。缺乏任务特定数据常常限制了以往的工作，直接应用LLM也存在战略脆弱和事实错觉的问题。本文首先构建并发布了TeleSalesCorpus，这是该领域首个基于真实世界的对话数据集。随后，我们提出了AI-Salesman，一个具有双阶段架构的创新框架。在训练阶段，我们设计了一个贝叶斯监督的强化学习算法，能够从嘈杂的对话中学习稳健的销售策略。在推理阶段，我们引入动态大纲引导代理（DOGA），它利用预构建的脚本库，提供动态的轮流战略指导。此外，我们设计了一个综合评估框架，结合关键销售技能的细粒度指标与“法官”的LLM范式。实验结果显示，我们提出的AI推销员在自动指标和全面人类评估中均显著优于基线模型，展示了其在复杂说服场景中的有效性。

CriticSearch: Fine-Grained Credit Assignment for Search Agents via a Retrospective Critic

CriticSearch：通过回顾性批评人为搜索代理人提供细致的署名分配

Authors: Yaocheng Zhang, Haohuan Huang, Zijun Song, Yuanheng Zhu, Qichao Zhang, Zijie Zhao, Dongbin Zhao
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.12159
Pdf link: https://arxiv.org/pdf/2511.12159
Abstract Tool-Integrated Reasoning (TIR) with search engines enables large language models to iteratively retrieve up-to-date external knowledge, enhancing adaptability and generalization in complex question-answering tasks. However, existing search agent pipelines typically depend on reinforcement learning based optimization, which often suffers from sparse outcome rewards, leading to inefficient exploration and unstable training. We introduce CriticSearch, a fine-grained credit-assignment framework that supplies dense, turn-level feedback via a retrospective critic mechanism. During training, a frozen, asymmetric critique LLM retrospectively evaluates each turn using privileged information from the full trajectory and gold answers, converting these assessments into stable, dense rewards that guide policy improvement. Experimental results across diverse multi-hop reasoning benchmarks demonstrate that CriticSearch consistently outperforms existing baselines, achieving faster convergence, improved training stability, and higher performance.
中文摘要 工具集成推理（TIR）与搜索引擎结合，使大型语言模型能够迭代检索最新的外部知识，增强复杂问答任务中的适应性和泛化性。然而，现有的搜索代理流水线通常依赖基于强化学习的优化，这种优化常常导致结果奖励稀疏，导致探索效率低下和训练不稳定。我们介绍CriticSearch，这是一个细粒度的学分分配框架，通过回顾性批评机制提供密集的轮流反馈。在培训过程中，一个冻结的非对称批判性LLM会利用完整轨迹和金答案中的特权信息回顾性评估每一回合，将这些评估转化为稳定、密集的奖励，指导政策改进。跨越多种多跳推理基准的实验结果表明，CriticSearch始终优于现有基线，实现了更快的收敛、提升了训练稳定性和更高的性能。

SocialNav-Map: Dynamic Mapping with Human Trajectory Prediction for Zero-Shot Social Navigation

SocialNav-Map：零射点社会导航的动态地图与人类轨迹预测

Authors: Lingfeng Zhang, Erjia Xiao, Xiaoshuai Hao, Haoxiang Fu, Zeying Gong, Long Chen, Xiaojun Liang, Renjing Xu, Hangjun Ye, Wenbo Ding
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.12232
Pdf link: https://arxiv.org/pdf/2511.12232
Abstract Social navigation in densely populated dynamic environments poses a significant challenge for autonomous mobile robots, requiring advanced strategies for safe interaction. Existing reinforcement learning (RL)-based methods require over 2000+ hours of extensive training and often struggle to generalize to unfamiliar environments without additional fine-tuning, limiting their practical application in real-world scenarios. To address these limitations, we propose SocialNav-Map, a novel zero-shot social navigation framework that combines dynamic human trajectory prediction with occupancy mapping, enabling safe and efficient navigation without the need for environment-specific training. Specifically, SocialNav-Map first transforms the task goal position into the constructed map coordinate system. Subsequently, it creates a dynamic occupancy map that incorporates predicted human movements as dynamic obstacles. The framework employs two complementary methods for human trajectory prediction: history prediction and orientation prediction. By integrating these predicted trajectories into the occupancy map, the robot can proactively avoid potential collisions with humans while efficiently navigating to its destination. Extensive experiments on the Social-HM3D and Social-MP3D datasets demonstrate that SocialNav-Map significantly outperforms state-of-the-art (SOTA) RL-based methods, which require 2,396 GPU hours of training. Notably, it reduces human collision rates by over 10% without necessitating any training in novel environments. By eliminating the need for environment-specific training, SocialNav-Map achieves superior navigation performance, paving the way for the deployment of social navigation systems in real-world environments characterized by diverse human behaviors. The code is available at: this https URL.
中文摘要 在人口密集且动态环境中的社交导航对自主移动机器人构成重大挑战，需要先进的安全交互策略。现有基于强化学习（RL）的方法需要超过2000+小时的深入训练，且通常难以在不熟悉环境中推广，除非额外微调，限制了其在现实场景中的实际应用。为解决这些局限性，我们提出了SocialNav-Map，一种新型零点社交导航框架，结合了动态人类轨迹预测与占用映射，实现安全高效的导航，无需环境特定训练。具体来说，SocialNav-Map首先将任务目标位置转换为构造的地图坐标系。随后，它生成一个动态占用地图，将预测的人类移动作为动态障碍物。该框架采用两种互补的人类轨迹预测方法：历史预测和方位预测。通过将这些预测轨迹整合到占用地图中，机器人可以主动避免与人类的潜在碰撞，同时高效导航到达目的地。在Social-HM3D和Social-MP3D数据集上的大量实验表明，SocialNav-Map显著优于最先进的（SOTA）强化学习方法，后者需要2,396小时GPU训练。值得注意的是，它在不需任何新环境中训练的情况下，将人与人碰撞的发生率降低了10%以上。通过消除对特定环境培训的需求，SocialNav-Map实现了卓越的导航性能，为在具有多样人类行为特征的现实环境中部署社交导航系统铺平了道路。代码可在以下网址获取：https URL。

Reward and Guidance through Rubrics: Promoting Exploration to Improve Multi-Domain Reasoning

通过评分标准奖励与指导：促进探索以提升多领域推理能力

Authors: Baolong Bi, Shenghua Liu, Yiwei Wang, Siqian Tong, Lingrui Mei, Yuyao Ge, Yilong Xu, Jiafeng Guo, Xueqi Cheng
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.12344
Pdf link: https://arxiv.org/pdf/2511.12344
Abstract Recent advances in reinforcement learning (RL) have significantly improved the complex reasoning capabilities of large language models (LLMs). Despite these successes, existing methods mainly focus on single-domain RL (e.g., mathematics) with verifiable rewards (RLVR), and their reliance on purely online RL frameworks restricts the exploration space, thereby limiting reasoning performance. In this paper, we address these limitations by leveraging rubrics to provide both fine-grained reward signals and offline guidance. We propose $\textbf{RGR-GRPO}$ (Reward and Guidance through Rubrics), a rubric-driven RL framework for multi-domain reasoning. RGR-GRPO enables LLMs to receive dense and informative rewards while exploring a larger solution space during GRPO training. Extensive experiments across 14 benchmarks spanning multiple domains demonstrate that RGR-GRPO consistently outperforms RL methods that rely solely on alternative reward schemes or offline guidance. Compared with verifiable online RL baseline, RGR-GRPO achieves average improvements of +7.0%, +5.4%, +8.4%, and +6.6% on mathematics, physics, chemistry, and general reasoning tasks, respectively. Notably, RGR-GRPO maintains stable entropy fluctuations during off-policy training and achieves superior pass@k performance, reflecting sustained exploration and effective breakthrough beyond existing performance bottlenecks.
中文摘要 强化学习（RL）的最新进展显著提升了大型语言模型（LLM）的复杂推理能力。尽管取得了这些成功，现有方法主要聚焦于单域强化学习（如数学）和可验证奖励（RLVR），其对纯在线强化学习框架的依赖限制了探索空间，从而限制了推理性能。本文通过利用评分标准提供细粒度的奖励信号和离线指导，解决了这些局限性。我们提出了$\textbf{RGR-GRPO}$（通过评分标准的奖励与指导），这是一个基于评分标准的多领域推理强化学习框架。RGR-GRPO使LLM在GRPO培训期间能够获得密集且信息丰富的奖励，同时探索更广泛的解决方案空间。跨越多个领域的14个基准测试的广泛实验表明，RGR-GRPO始终优于仅依赖替代奖励方案或离线指导的强化学习方法。与可验证的在线强化学习基线相比，RGR-GRPO在数学、物理、化学和普通推理任务中分别实现了+7.0%、+5.4%、+8.4%和+6.6%的平均提升。值得注意的是，RGR-GRPO在非政策训练期间保持稳定的熵波动，并实现了卓越的 pass@k 性能，体现了持续探索和有效突破现有性能瓶颈。

Dynamic Reward Scaling for Multivariate Time Series Anomaly Detection: A VAE-Enhanced Reinforcement Learning Approach

多变量时间序列异常检测的动态奖励尺度：VAE增强强化学习方法

Authors: Bahareh Golchin, Banafsheh Rekabdar
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.12351
Pdf link: https://arxiv.org/pdf/2511.12351
Abstract Detecting anomalies in multivariate time series is essential for monitoring complex industrial systems, where high dimensionality, limited labeled data, and subtle dependencies between sensors cause significant challenges. This paper presents a deep reinforcement learning framework that combines a Variational Autoencoder (VAE), an LSTM-based Deep Q-Network (DQN), dynamic reward shaping, and an active learning module to address these issues in a unified learning framework. The main contribution is the implementation of Dynamic Reward Scaling for Multivariate Time Series Anomaly Detection (DRSMT), which demonstrates how each component enhances the detection process. The VAE captures compact latent representations and reduces noise. The DQN enables adaptive, sequential anomaly classification, and the dynamic reward shaping balances exploration and exploitation during training by adjusting the importance of reconstruction and classification signals. In addition, active learning identifies the most uncertain samples for labeling, reducing the need for extensive manual supervision. Experiments on two multivariate benchmarks, namely Server Machine Dataset (SMD) and Water Distribution Testbed (WADI), show that the proposed method outperforms existing baselines in F1-score and AU-PR. These results highlight the effectiveness of combining generative modeling, reinforcement learning, and selective supervision for accurate and scalable anomaly detection in real-world multivariate systems.
中文摘要 在多变量时间序列中检测异常对于监测复杂工业系统至关重要，因为高维度、有限的标签数据以及传感器间的细微依赖性带来了重大挑战。本文提出了一个深度强化学习框架，结合了变分自编码器（VAE）、基于LSTM的深度Q网络（DQN）、动态奖励塑造和主动学习模块，以在统一学习框架中解决这些问题。主要贡献是实现了多变量时间序列异常检测的动态奖励标度（DRSMT），展示了每个组件如何提升检测过程。VAE捕捉紧凑的潜在表示并减少噪声。DQN支持自适应、顺序异常分类，动态奖励塑造通过调整重建和分类信号的重要性，平衡了训练中的探索与利用。此外，主动学习还能识别最不确定的样本进行标记，减少了大量人工监督的需求。对两个多变量基准测试——服务器机数据集（SMD）和水分配测试平台（WADI）的实验显示，所提方法在F1评分和AU-PR方面优于现有基线。这些结果凸显了生成建模、强化学习和选择性监督结合，在真实多变量系统中实现准确且可扩展异常检测的有效性。

Constructing and Interpreting Digital Twin Representations for Visual Reasoning via Reinforcement Learning

通过强化学习构建和解释数字孪生表示以实现视觉推理

Authors: Yiqing Shen, Mathias Unberath
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.12365
Pdf link: https://arxiv.org/pdf/2511.12365
Abstract Visual reasoning may require models to interpret images and videos and respond to implicit text queries across diverse output formats, from pixel-level segmentation masks to natural language descriptions. Existing approaches rely on supervised fine-tuning with task-specific architectures. For example, reasoning segmentation, grounding, summarization, and visual question answering each demand distinct model designs and training, preventing unified solutions and limiting cross-task and cross-modality generalization. Hence, we propose DT-R1, a reinforcement learning framework that trains large language models to construct digital twin representations of complex multi-modal visual inputs and then reason over these high-level representations as a unified approach to visual reasoning. Specifically, we train DT-R1 using GRPO with a novel reward that validates both structural integrity and output accuracy. Evaluations in six visual reasoning benchmarks, covering two modalities and four task types, demonstrate that DT-R1 consistently achieves improvements over state-of-the-art task-specific models. DT-R1 opens a new direction where visual reasoning emerges from reinforcement learning with digital twin representations.
中文摘要 视觉推理可能需要模型解释图像和视频，并响应跨多种输出格式的隐式文本查询，从像素级分割掩码到自然语言描述。现有方法依赖于针对任务的监督微调架构。例如，推理分割、基础化、总结和视觉问题回答都需要不同的模型设计和训练，阻碍统一解决方案，限制跨任务和跨模态的泛化。因此，我们提出了DT-R1，一种强化学习框架，训练大型语言模型构建复杂多模态视觉输入的数字孪生表示，并基于这些高层次表征进行推理，作为一种统一的视觉推理方法。具体来说，我们用GRPO训练DT-R1，并采用一种新颖的奖励，既验证了结构完整性，也验证了输出准确性。涵盖两种模态和四种任务类型的六个视觉推理基准测试的评估表明，DT-R1 持续优于最先进的任务特定模型。DT-R1开辟了视觉推理的新方向，使视觉推理从数字孪生表征的强化学习中诞生。

Learning Adaptive Neural Teleoperation for Humanoid Robots: From Inverse Kinematics to End-to-End Control

学习人形机器人自适应神经远程作：从逆向运动学到端到端控制

Authors: Sanjar Atamuradov
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.12390
Pdf link: https://arxiv.org/pdf/2511.12390
Abstract Virtual reality (VR) teleoperation has emerged as a promising approach for controlling humanoid robots in complex manipulation tasks. However, traditional teleoperation systems rely on inverse kinematics (IK) solvers and hand-tuned PD controllers, which struggle to handle external forces, adapt to different users, and produce natural motions under dynamic conditions. In this work, we propose a learning-based neural teleoperation framework that replaces the conventional IK+PD pipeline with learned policies trained via reinforcement learning. Our approach learns to directly map VR controller inputs to robot joint commands while implicitly handling force disturbances, producing smooth trajectories, and adapting to user preferences. We train our policies in simulation using demonstrations collected from IK-based teleoperation as initialization, then fine-tune them with force randomization and trajectory smoothness rewards. Experiments on the Unitree G1 humanoid robot demonstrate that our learned policies achieve 34% lower tracking error, 45% smoother motions, and superior force adaptation compared to the IK baseline, while maintaining real-time performance (50Hz control frequency). We validate our approach on manipulation tasks including object pick-and-place, door opening, and bimanual coordination. These results suggest that learning-based approaches can significantly improve the naturalness and robustness of humanoid teleoperation systems.
中文摘要 虚拟现实（VR）远程作已成为控制复杂作任务中类人机器人的有前景方法。然而，传统的远程作系统依赖逆运动学（IK）求解器和手工调校的PD控制器，这些控制器难以承受外部力，适应不同用户，并在动态条件下产生自然运动。本研究提出一种基于学习的神经远程作框架，用强化学习训练的策略取代传统的IK+PD流程。我们的方法学习将VR控制器输入直接映射到机器人关节指令，同时隐式处理力扰动，实现平滑轨迹，并适应用户偏好。我们以基于IK的远程作演示作为初始化，在模拟中训练策略，然后通过力随机化和轨迹平滑奖励进行微调。在Unitree G1人形机器人上的实验表明，我们学习的策略相比IK基线实现了34%的跟踪误差降低、45%的动作更平滑和更优越的力适应，同时保持实时性能（50Hz控制频率）。我们在作任务中验证了方法，包括物品的挑选与放置、开门和双手协调。这些结果表明，基于学习的方法可以显著提升类人远程作系统的自然性和稳健性。

Integrating Neural Differential Forecasting with Safe Reinforcement Learning for Blood Glucose Regulation

将神经差异预测与安全强化学习相结合以实现血糖调节

Authors: Yushen Liu, Yanfu Zhang, Xugui Zhou
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.12417
Pdf link: https://arxiv.org/pdf/2511.12417
Abstract Automated insulin delivery for Type 1 Diabetes must balance glucose control and safety under uncertain meals and physiological variability. While reinforcement learning (RL) enables adaptive personalization, existing approaches struggle to simultaneously guarantee safety, leaving a gap in achieving both personalized and risk-aware glucose control, such as overdosing before meals or stacking corrections. To bridge this gap, we propose TSODE, a safety-aware controller that integrates Thompson Sampling RL with a Neural Ordinary Differential Equation (NeuralODE) forecaster to address this challenge. Specifically, the NeuralODE predicts short-term glucose trajectories conditioned on proposed insulin doses, while a conformal calibration layer quantifies predictive uncertainty to reject or scale risky actions. In the FDA-approved UVa/Padova simulator (adult cohort), TSODE achieved 87.9% time-in-range with less than 10% time below 70 mg/dL, outperforming relevant baselines. These results demonstrate that integrating adaptive RL with calibrated NeuralODE forecasting enables interpretable, safe, and robust glucose regulation.
中文摘要 1型糖尿病的自动胰岛素输送必须在不确定的饮食和生理变异性下平衡血糖控制与安全性。虽然强化学习（RL）实现了自适应个性化，但现有方法难以同时保证安全性，导致在实现个性化和风险感知的血糖控制上存在空白，比如餐前过量服用或叠加纠正。为弥合这一差距，我们提出了TSODE，一款安全意识强化控制器，将Thompson采样RL与神经常微分方程（NeuralODE）预测器集成，以应对这一挑战。具体来说，NeuralODE预测短期血糖轨迹基于拟议胰岛素剂量，而共形校准层则量化预测不确定性，以拒绝或扩展风险行为。在FDA批准的UVa/Padova模拟器（成人队列）中，TSODE在低于70 mg/dL的时间内实现了87.9%的时间范围内，时间低于70 mg/dL，优于相关基线。这些结果表明，将自适应强化学习与校准的NeuralODE预测相结合，能够实现可解释、安全且稳健的血糖调节。

Tailored Primitive Initialization is the Secret Key to Reinforcement Learning

定制化的原始初始化是强化学习的秘密钥匙

Authors: Yihang Yao, Guangtao Zeng, Raina Wu, Yang Zhang, Ding Zhao, Zhang-Wei Hong, Chuang Gan
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.12429
Pdf link: https://arxiv.org/pdf/2511.12429
Abstract Reinforcement learning (RL) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs). While RL has demonstrated substantial performance gains, it still faces key challenges, including low sampling efficiency and a strong dependence on model initialization: some models achieve rapid improvements with minimal RL steps, while others require significant training data to make progress. In this work, we investigate these challenges through the lens of reasoning token coverage and argue that initializing LLMs with diverse, high-quality reasoning primitives is essential for achieving stable and sample-efficient RL training. We propose Tailor, a finetuning pipeline that automatically discovers and curates novel reasoning primitives, thereby expanding the coverage of reasoning-state distributions before RL. Extensive experiments on mathematical and logical reasoning benchmarks demonstrate that Tailor generates more diverse and higher-quality warm-start data, resulting in higher downstream RL performance.
中文摘要 强化学习（RL）已成为增强大型语言模型（LLM）推理能力的强大范式。尽管强化学习展现了显著的性能提升，但仍面临关键挑战，包括采样效率低和对模型初始化的高度依赖：有些模型在极少的强化学习步骤下快速提升，而另一些则需要大量训练数据才能取得进展。本文通过推理令牌覆盖的视角探讨这些挑战，并主张用多样且高质量的推理原语初始化LLM对于实现稳定且高效的强化学习训练至关重要。我们提出了Tailor，一种微调流程，能够自动发现并策划新的推理原语，从而扩大强化学习前推理状态分布的覆盖范围。大量数学和逻辑推理基准测试表明，Tailor能生成更丰富且更高质量的热启动数据，从而提升下游强化学习性能。

ClutterNav: Gradient-Guided Search for Efficient 3D Clutter Removal with Learned Costmaps

ClutterNav：利用学习成本图实现高效3D杂乱去除的梯度引导搜索

Authors: Navin Sriram Ravie, Keerthi Vasan M, Bijo Sebastian
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.12479
Pdf link: https://arxiv.org/pdf/2511.12479
Abstract Dense clutter removal for target object retrieval presents a challenging problem, especially when targets are embedded deep within densely-packed configurations. It requires foresight to minimize overall changes to the clutter configuration while accessing target objects, avoiding stack destabilization and reducing the number of object removals required. Rule-based planners when applied to this problem, rely on rigid heuristics, leading to high computational overhead. End-to-end reinforcement learning approaches struggle with interpretability and generalizability over different conditions. To address these issues, we present ClutterNav, a novel decision-making framework that can identify the next best object to be removed so as to access a target object in a given clutter, while minimising stack disturbances. ClutterNav formulates the problem as a continuous reinforcement learning task, where each object removal dynamically updates the understanding of the scene. A removability critic, trained from demonstrations, estimates the cost of removing any given object based on geometric and spatial features. This learned cost is complemented by integrated gradients that assess how the presence or removal of surrounding objects influences the accessibility of the target. By dynamically prioritizing actions that balance immediate removability against long-term target exposure, ClutterNav achieves near human-like strategic sequencing, without predefined heuristics. The proposed approach is validated extensively in simulation and over real-world experiments. The results demonstrate real-time, occlusion-aware decision-making in partially observable environments.
中文摘要 在目标物体反演中去除密集杂波是一个具有挑战性的问题，尤其是在目标深处嵌入密集配置中时。在访问目标对象时，需要有远见地将杂波配置的整体变化降到最低，避免堆栈不稳定并减少所需移除对象的次数。基于规则的规划器在处理这个问题时依赖于僵化的启发式方法，导致计算开销过高。端到端强化学习方法在不同条件下的可解释性和泛化性方面存在困难。为解决这些问题，我们提出了ClutterNav，一种新的决策框架，能够识别下一个最佳对象，以便在给定杂波中访问目标对象，同时最大限度地减少堆栈干扰。ClutterNav将问题表述为一个持续强化学习任务，每次移除对象都会动态更新对场景的理解。可移除性批评者通过演示训练，根据几何和空间特征估算移除任何物体的成本。这种学习成本由综合梯度补充，后者评估周围物体的存在或移除如何影响目标的可达性。通过动态优先级调整动作，平衡即时移除与长期目标暴露，ClutterNav 实现了近乎人类的战略序列，无需预设启发式。该方法在模拟和现实实验中得到了广泛验证。结果展示了在部分可观察环境中实时、感知遮挡的决策能力。

Designed to Spread: Generative Approaches to Enhance Information Diffusion

旨在传播：生成式方法促进信息传播

Authors: Ziqing Qian, Jiaying Lei, Shengqi Dang, Nan Cao
Subjects: Subjects: Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2511.12516
Pdf link: https://arxiv.org/pdf/2511.12516
Abstract Social media has fundamentally transformed how people access information and form social connections, with content expression playing a critical role in driving information diffusion. While prior research has focused largely on network structures and tipping point identification, it provides limited tools for automatically generating content tailored for virality within a specific audience. To fill this gap, we propose the novel task of DOCG and introduce an information enhancement algorithm for generating content optimized for diffusion. Our method includes an influence indicator that enables content-level diffusion assessment without requiring access to network topology, and an information editor that employs reinforcement learning to explore interpretable editing strategies. The editor leverages generative models to produce semantically faithful, audience-aware textual or visual content. Experiments on real-world social media datasets and user study demonstrate that our approach significantly improves diffusion effectiveness while preserving the core semantics of the original content.
中文摘要 社交媒体从根本上改变了人们获取信息和建立社交联系的方式，内容表达在推动信息传播中发挥着关键作用。虽然此前的研究主要聚焦于网络结构和临界点识别，但该研究提供了有限的工具，用于自动生成针对特定受众病毒式传播的内容。为填补这一空白，我们提出了新颖任务DOCG，并引入了一种信息增强算法，用于生成优化扩散内容。我们的方法包括一个影响指示器，可在无需网络拓扑的情况下实现内容层级扩散评估，以及一个信息编辑器，利用强化学习探索可解释的编辑策略。编辑器利用生成模型生成语义忠实、受众感知的文本或视觉内容。在真实世界社交媒体数据集上的实验和用户研究表明，我们的方法在保留原始内容核心语义的同时，显著提升了扩散效果。

TAdaRAG: Task Adaptive Retrieval-Augmented Generation via On-the-Fly Knowledge Graph Construction

TAdaRAG：通过动态知识图谱构建实现任务自适应检索增强生成

Authors: Jie Zhang, Bo Tang, Wanzi Shao, Wenqiang Wei, Jihao Zhao, Jianqing Zhu, Zhiyu li, Wen Xi, Zehao Lin, Feiyu Xiong, Yanchao Tan
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.12520
Pdf link: https://arxiv.org/pdf/2511.12520
Abstract Retrieval-Augmented Generation (RAG) improves large language models by retrieving external knowledge, often truncated into smaller chunks due to the input context window, which leads to information loss, resulting in response hallucinations and broken reasoning chains. Moreover, traditional RAG retrieves unstructured knowledge, introducing irrelevant details that hinder accurate reasoning. To address these issues, we propose TAdaRAG, a novel RAG framework for on-the-fly task-adaptive knowledge graph construction from external sources. Specifically, we design an intent-driven routing mechanism to a domain-specific extraction template, followed by supervised fine-tuning and a reinforcement learning-based implicit extraction mechanism, ensuring concise, coherent, and non-redundant knowledge integration. Evaluations on six public benchmarks and a real-world business benchmark (NowNewsQA) across three backbone models demonstrate that TAdaRAG outperforms existing methods across diverse domains and long-text tasks, highlighting its strong generalization and practical effectiveness.
中文摘要 检索增强生成（RAG）通过检索外部知识来改进大型语言模型，这些外部知识常因输入上下文窗口而被截断成更小的块，导致信息丢失，进而导致反应幻觉和推理链断裂。此外，传统RAG会检索非结构化的知识，引入无关细节，阻碍准确推理。为解决这些问题，我们提出了TAdaRAG，一种用于从外部源实时构建任务自适应知识图谱的新型RAG框架。具体来说，我们设计了一个基于意图驱动的路由机制，连接到领域特定的提取模板，随后进行监督微调和基于强化学习的隐式提取机制，确保知识集成简洁、连贯且无重复。基于六个公开基准和一个真实世界业务基准（NowNewsQA）的三个骨干模型的评估显示，TAdaRAG在多个领域和长文本任务中优于现有方法，凸显了其强大的泛化性和实用性。

ReaSon: Reinforced Causal Search with Information Bottleneck for Video Understanding

ReaSon：带有信息瓶颈的强化因果搜索以促进视频理解

Authors: Yuan Zhou, Litao Hua, Shilong Jin, Wentao Huang, Haoran Duan
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.12530
Pdf link: https://arxiv.org/pdf/2511.12530
Abstract Keyframe selection has become essential for video understanding with vision-language models (VLMs) due to limited input tokens and the temporal sparsity of relevant information across video frames. Video understanding often relies on effective keyframes that are not only informative but also causally decisive. To this end, we propose Reinforced Causal Search with Information Bottleneck (ReaSon), a framework that formulates keyframe selection as an optimization problem with the help of a novel Causal Information Bottleneck (CIB), which explicitly defines keyframes as those satisfying both predictive sufficiency and causal necessity. Specifically, ReaSon employs a learnable policy network to select keyframes from a visually relevant pool of candidate frames to capture predictive sufficiency, and then assesses causal necessity via counterfactual interventions. Finally, a composite reward aligned with the CIB principle is designed to guide the selection policy through reinforcement learning. Extensive experiments on NExT-QA, EgoSchema, and Video-MME demonstrate that ReaSon consistently outperforms existing state-of-the-art methods under limited-frame settings, validating its effectiveness and generalization ability.
中文摘要 由于输入代币有限且视频帧间相关信息的时间稀疏性，关键帧选择已成为视觉语言模型（VLM）视频理解的关键。视频理解通常依赖于不仅信息丰富且具有因果决定性的关键帧。为此，我们提出了带信息瓶颈的强化因果搜索（ReaSon），该框架利用一种新型因果信息瓶颈（CIB）将关键帧选择作为优化问题，明确定义关键帧为既满足预测充分性又满足因果必然性的关键帧。具体来说，ReaSon采用可学习的策略网络，从视觉相关的候选帧池中选择关键帧以捕捉预测充分性，然后通过反事实干预评估因果必然性。最后，设计了与CIB原则一致的复合奖励，通过强化学习指导选择策略。在NExT-QA、EgoSchema和Video-MME上的大量实验表明，ReaSon在有限帧设置下持续优于现有最先进方法，验证了其有效性和泛化能力。

Mitigating Length Bias in RLHF through a Causal Lens

通过因果视角缓解RLHF中的长度偏置

Authors: Hyeonji Kim, Sujeong Oh, Sanghack Lee
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.12573
Pdf link: https://arxiv.org/pdf/2511.12573
Abstract Reinforcement learning from human feedback (RLHF) is widely used to align large language models (LLMs) with human preferences. However, RLHF-trained reward models often exhibit length bias -- a systematic tendency to favor longer responses by conflating verbosity with quality. We propose a causal framework for analyzing and mitigating length bias in RLHF reward modeling. Central to our approach is a counterfactual data augmentation method that generates response pairs designed to isolate content quality from verbosity. These counterfactual examples are then used to train the reward model, enabling it to assess responses based on content quality independently of verbosity. Specifically, we construct (1) length-divergent pairs with similar content and (2) content-divergent pairs of similar length. Empirical evaluations show that our method reduces length bias in reward assignment and leads to more concise, content-focused outputs from the policy model. These findings demonstrate that the proposed approach effectively reduces length bias and improves the robustness and content sensitivity of reward modeling in RLHF pipelines.
中文摘要 来自人类反馈的强化学习（RLHF）被广泛用于将大型语言模型（LLMs）与人类偏好对齐。然而，RLHF训练的奖励模型常表现出长度偏倚——这是一种系统性倾向，倾向于通过将冗长与质量混为一谈，从而倾向于延长响应。我们提出了一个因果框架，用于分析和减轻RLHF奖励建模中的长度偏差。我们方法的核心是一种反事实数据增强方法，生成旨在区分内容质量与冗长的响应对。这些反事实例子随后被用来训练奖励模型，使其能够独立于冗长程度，基于内容质量评估回答。具体来说，我们构造了（1）内容相似的长度发散对和（2）长度相似的内容发散对。实证评估表明，我们的方法减少了奖励分配中的长度偏差，并促使策略模型产生更简洁、内容为中心的输出。这些发现表明，所提方法有效减少了长度偏置，并提升了RLHF流程中奖励建模的鲁棒性和内容敏感性。

NFQ2.0: The CartPole Benchmark Revisited

NFQ2.0：CartPole 基准测试再访

Authors: Sascha Lange, Roland Hafner, Martin Riedmiller
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.12644
Pdf link: https://arxiv.org/pdf/2511.12644
Abstract This article revisits the 20-year-old neural fitted Q-iteration (NFQ) algorithm on its classical CartPole benchmark. NFQ was a pioneering approach towards modern Deep Reinforcement Learning (Deep RL) in applying multi-layer neural networks to reinforcement learning for real-world control problems. We explore the algorithm's conceptual simplicity and its transition from online to batch learning, which contributed to its stability. Despite its initial success, NFQ required extensive tuning and was not easily reproducible on real-world control problems. We propose a modernized variant NFQ2.0 and apply it to the CartPole task, concentrating on a real-world system build from standard industrial components, to investigate and improve the learning process's repeatability and robustness. Through ablation studies, we highlight key design decisions and hyperparameters that enhance performance and stability of NFQ2.0 over the original variant. Finally, we demonstrate how our findings can assist practitioners in reproducing and improving results and applying deep reinforcement learning more effectively in industrial contexts.
中文摘要 本文回顾了已有20年历史的神经拟合Q迭代（NFQ）算法，基于其经典的CartPole基准测试。NFQ是现代深度强化学习（Deep RL）的开创性方法，将多层神经网络应用于现实控制问题的强化学习。我们探讨了该算法的概念简洁性及其从在线向批量学习的转变，这些都促进了其稳定性。尽管最初取得成功，NFQ 需要大量调校，且在实际控制问题中难以复现。我们提出了一个现代化的NFQ2.0变体，并将其应用于CartPole任务，重点是基于标准工业组件构建的真实系统，以研究并提升学习过程的重复性和鲁棒性。通过消融研究，我们重点介绍了关键的设计决策和超参数，这些因素提升了NFQ2.0相较于原始变体的性能和稳定性。最后，我们展示了我们的发现如何帮助从业者在工业环境中复制和改进结果，并更有效地应用深度强化学习。

Task-Aware Morphology Optimization of Planar Manipulators via Reinforcement Learning

通过强化学习实现平面作器的任务感知形态优化

Authors: Arvind Kumar Mishra, Sohom Chakrabarty
Subjects: Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2511.12650
Pdf link: https://arxiv.org/pdf/2511.12650
Abstract In this work, Yoshikawa's manipulability index is used to investigate reinforcement learning (RL) as a framework for morphology optimization in planar robotic manipulators. A 2R manipulator tracking a circular end-effector path is first examined because this case has a known analytical optimum: equal link lengths and the second joint orthogonal to the first. This serves as a validation step to test whether RL can rediscover the optimum using reward feedback alone, without access to the manipulability expression or the Jacobian. Three RL algorithms (SAC, DDPG, and PPO) are compared with grid search and black-box optimizers, with morphology represented by a single action parameter phi that maps to the link lengths. All methods converge to the analytical solution, showing that numerical recovery of the optimum is possible without supplying analytical structure. Most morphology design tasks have no closed-form solutions, and grid or heuristic search becomes expensive as dimensionality increases. RL is therefore explored as a scalable alternative. The formulation used for the circular path is extended to elliptical and rectangular paths by expanding the action space to the full morphology vector (L1, L2, theta2). In these non-analytical settings, RL continues to converge reliably, whereas grid and black-box methods require far larger evaluation budgets. These results indicate that RL is effective for both recovering known optima and solving morphology optimization problems without analytical solutions.
中文摘要 在本研究中，吉川的可作性指数被用来研究强化学习（RL）作为平面机器人作手形态优化的框架。首先检查一个跟踪圆形端-执行器路径的2R作手，因为该情况已知有解析最优条件：连杆长度相等且第二个关节与第一个节点正交。这作为验证步骤，测试强化学习是否能仅凭奖励反馈重新发现最优，而无需作表达式或雅可比矩形。三种强化学习算法（SAC、DDPG和PPO）与网格搜索和黑箱优化器进行了比较，其形态由单一动作参数phi表示，映射到链路长度。所有方法都收敛于解析解，表明在不提供解析结构的情况下，数值恢复最优解是可能的。大多数形态设计任务没有封闭式解，随着维度增加，网格或启发式搜索成本会增加。因此，强化学习被探索为一种可扩展的替代方案。通过将作用空间展开到完整的形态向量（L1， L2， theta²），圆路径的表述被扩展到椭圆形和矩形路径。在这些非分析环境中，强化学习依然能可靠收敛，而网格和黑箱方法则需要更大的评估预算。这些结果表明，强化学习既能恢复已知最优解，也能在无解析解的情况下解决形态优化问题。

Beyond Fixed Tasks: Unsupervised Environment Design for Task-Level Pairs

超越固定任务：任务级对的无监督环境设计

Authors: Daniel Furelos-Blanco, Charles Pert, Frederik Kelbel, Alex F. Spies, Alessandra Russo, Michael Dennis
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.12706
Pdf link: https://arxiv.org/pdf/2511.12706
Abstract Training general agents to follow complex instructions (tasks) in intricate environments (levels) remains a core challenge in reinforcement learning. Random sampling of task-level pairs often produces unsolvable combinations, highlighting the need to co-design tasks and levels. While unsupervised environment design (UED) has proven effective at automatically designing level curricula, prior work has only considered a fixed task. We present ATLAS (Aligning Tasks and Levels for Autocurricula of Specifications), a novel method that generates joint autocurricula over tasks and levels. Our approach builds upon UED to automatically produce solvable yet challenging task-level pairs for policy training. To evaluate ATLAS and drive progress in the field, we introduce an evaluation suite that models tasks as reward machines in Minigrid levels. Experiments demonstrate that ATLAS vastly outperforms random sampling approaches, particularly when sampling solvable pairs is unlikely. We further show that mutations leveraging the structure of both tasks and levels accelerate convergence to performant policies.
中文摘要 训练通用代理在复杂环境（层级）中遵循复杂指令（任务）仍然是强化学习的核心挑战。对任务层级对的随机抽样常常产生无法解的组合，凸显了任务和层级需要共同设计的必要性。虽然无监督环境设计（UED）已被证明能有效自动设计水平课程，但以往的工作只考虑固定任务。我们介绍了ATLAS（规范自学课程任务与等级对齐），这是一种新颖的方法，能够在任务和层级上生成联合自学课程。我们的方法基于UED自动生成可解决但具有挑战性的政策培训任务级对。为了评估ATLAS并推动该领域的进展，我们引入了一套评估套件，将任务建模为迷你电网级别的奖励机器。实验表明，ATLAS在随机抽样方法上表现远超随机抽样方法，尤其是在可解对被抽样的可能性不大的情况下。我们还进一步证明，利用任务结构和层级的突变加速了向高效策略的趋同。

Prompt-Driven Domain Adaptation for End-to-End Autonomous Driving via In-Context RL

通过上下文强化学习实现端到端自动驾驶的提示驱动域适配

Authors: Aleesha Khurram, Amir Moeini, Shangtong Zhang, Rohan Chandra
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.12755
Pdf link: https://arxiv.org/pdf/2511.12755
Abstract Despite significant progress and advances in autonomous driving, many end-to-end systems still struggle with domain adaptation (DA), such as transferring a policy trained under clear weather to adverse weather conditions. Typical DA strategies in the literature include collecting additional data in the target domain or re-training the model, or both. Both these strategies quickly become impractical as we increase scale and complexity of driving. These limitations have encouraged investigation into few-shot and zero-shot prompt-driven DA at inference time involving LLMs and VLMs. These methods work by adding a few state-action trajectories during inference to the prompt (similar to in-context learning). However, there are two limitations of such an approach: $(i)$ prompt-driven DA methods are currently restricted to perception tasks such as detection and segmentation and $(ii)$ they require expert few-shot data. In this work, we present a new approach to inference-time few-shot prompt-driven DA for closed-loop autonomous driving in adverse weather condition using in-context reinforcement learning (ICRL). Similar to other prompt-driven DA methods, our approach does not require any updates to the model parameters nor does it require additional data collection in adversarial weather regime. Furthermore, our approach advances the state-of-the-art in prompt-driven DA by extending to closed driving using general trajectories observed during inference. Our experiments using the CARLA simulator show that ICRL results in safer, more efficient, and more comfortable driving policies in the target domain compared to state-of-the-art prompt-driven DA baselines.
中文摘要 尽管自动驾驶取得了重大进展和进步，许多端到端系统仍在领域适应（DA）方面遇到困难，比如将在晴朗天气下训练的政策转移到恶劣天气条件下。文献中典型的DA策略包括在目标领域收集额外数据或重新训练模型，或两者兼有。随着规模和驾驶复杂度的增加，这两种策略很快变得不切实际。这些局限性促使人们在推理时间下研究涉及大型语言模型（LLM）和大型模型（VLM）的推理时间下，采用少量和零样本提示驱动的数字分析。这些方法通过在对提示推理时添加几个状态-动作轨迹（类似于上下文学习）来工作。然而，这种方法存在两个局限性：$（i）$ 提示驱动的数字分析方法目前仅限于感知任务，如检测和分割，而 $（ii）$ 需要专家的少数样本数据。本研究提出了一种基于上下文强化学习（ICRL）的推理时间短样本提示驱动DA方法，用于恶劣天气条件下闭环自动驾驶。与其他提示驱动的DA方法类似，我们的方法无需更新模型参数，也无需在恶劣天气条件下额外收集数据。此外，我们的方法通过利用推理过程中观察到的一般轨迹扩展到封闭驱动，进一步推进了提示驱动DA的技术。我们使用CARLA模拟器的实验表明，ICRL在目标领域中比最先进的提示驱动DA基线更安全、更高效、更舒适地实现驾驶策略。

Scalable Multi-Objective and Meta Reinforcement Learning via Gradient Estimation

通过梯度估计实现可扩展的多目标和元强化学习

Authors: Zhenshuo Zhang, Minxuan Duan, Youran Ye, Hongyang R. Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.12779
Pdf link: https://arxiv.org/pdf/2511.12779
Abstract We study the problem of efficiently estimating policies that simultaneously optimize multiple objectives in reinforcement learning (RL). Given $n$ objectives (or tasks), we seek the optimal partition of these objectives into $k \ll n$ groups, where each group comprises related objectives that can be trained together. This problem arises in applications such as robotics, control, and preference optimization in language models, where learning a single policy for all $n$ objectives is suboptimal as $n$ grows. We introduce a two-stage procedure -- meta-training followed by fine-tuning -- to address this problem. We first learn a meta-policy for all objectives using multitask learning. Then, we adapt the meta-policy to multiple randomly sampled subsets of objectives. The adaptation step leverages a first-order approximation property of well-trained policy networks, which is empirically verified to be accurate within a $2\%$ error margin across various RL environments. The resulting algorithm, PolicyGradEx, efficiently estimates an aggregate task-affinity score matrix given a policy evaluation algorithm. Based on the estimated affinity score matrix, we cluster the $n$ objectives into $k$ groups by maximizing the intra-cluster affinity scores. Experiments on three robotic control and the Meta-World benchmarks demonstrate that our approach outperforms state-of-the-art baselines by $16\%$ on average, while delivering up to $26\times$ faster speedup relative to performing full training to obtain the clusters. Ablation studies validate each component of our approach. For instance, compared with random grouping and gradient-similarity-based grouping, our loss-based clustering yields an improvement of $19\%$. Finally, we analyze the generalization error of policy networks by measuring the Hessian trace of the loss surface, which gives non-vacuous measures relative to the observed generalization errors.
中文摘要 我们研究强化学习（RL）中高效估算策略的问题，以同时优化多个目标。给定$n$的目标（或任务），我们寻求将这些目标最优划分为$k \ n$的组，每个组包含可共同训练的相关目标。这个问题出现在机器人学、控制和语言模型偏好优化等应用中，随着$n$的增长，学习针对所有$n$目标的单一策略并不理想。我们引入了两阶段流程——先进行元训练，再进行微调——来解决这个问题。我们首先通过多任务学习学习所有目标的元策略。然后，我们将元政策调整为多个随机抽样的目标子集。适应步骤利用了训练良好的策略网络的一阶近似性质，经过实证验证，在各种强化学习环境中误差范围为$2/%%。由此产生的算法PolicyGradEx在给定策略评估算法的情况下，高效估计一个聚合任务亲和力评分矩阵。基于估计的亲和力评分矩阵，我们将$n$目标群内亲和力分数最大化，分为$k$组。在三个机器人控制和Meta-World基准测试上的实验表明，我们的方法平均比最先进的基线高出16美元，同时相比进行完整训练获得簇，速度提升速度高达26美元。消融研究验证了我们方法的每个组成部分。例如，与随机分组和基于梯度相似度的分组相比，我们的损失类聚类提升了19\%$。最后，我们通过测量损耗面的黑森迹来分析策略网络的泛化误差，该轨迹相对于观测到的泛化误差，提供了非空的度量。

Multi-Agent Reinforcement Learning for Heterogeneous Satellite Cluster Resources Optimization

多智能体强化学习用于异构卫星集群资源优化

Authors: Mohamad A. Hady, Siyi Hu, Mahardhika Pratama, Zehong Cao, Ryszard Kowalczyk
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.12792
Pdf link: https://arxiv.org/pdf/2511.12792
Abstract This work investigates resource optimization in heterogeneous satellite clusters performing autonomous Earth Observation (EO) missions using Reinforcement Learning (RL). In the proposed setting, two optical satellites and one Synthetic Aperture Radar (SAR) satellite operate cooperatively in low Earth orbit to capture ground targets and manage their limited onboard resources efficiently. Traditional optimization methods struggle to handle the real-time, uncertain, and decentralized nature of EO operations, motivating the use of RL and Multi-Agent Reinforcement Learning (MARL) for adaptive decision-making. This study systematically formulates the optimization problem from single-satellite to multi-satellite scenarios, addressing key challenges including energy and memory constraints, partial observability, and agent heterogeneity arising from diverse payload capabilities. Using a near-realistic simulation environment built on the Basilisk and BSK-RL frameworks, we evaluate the performance and stability of state-of-the-art MARL algorithms such as MAPPO, HAPPO, and HATRPO. Results show that MARL enables effective coordination across heterogeneous satellites, balancing imaging performance and resource utilization while mitigating non-stationarity and inter-agent reward coupling. The findings provide practical insights into scalable, autonomous satellite operations and contribute a foundation for future research on intelligent EO mission planning under heterogeneous and dynamic conditions.
中文摘要 本研究研究利用强化学习（RL）执行自主地球观测（EO）任务的异构卫星集群中的资源优化。在拟议的环境中，两颗光学卫星和一颗合成孔径雷达（SAR）卫星在近地轨道协同作业，以捕获地面目标并高效管理其有限的机载资源。传统优化方法难以应对EO作的实时、不确定性和去中心化特性，促使采用强化学习和多智能体强化学习（MARL）进行自适应决策。本研究系统地提出了从单卫星到多卫星场景的优化问题，解决了能源和内存限制、部分可观测性以及由不同有效载荷能力带来的代理异质性等关键挑战。利用基于Basilisk和BSK-RL框架构建的近乎真实的仿真环境，我们评估了MAPPO、HAPPO和HATRPO等先进MARL算法的性能和稳定性。结果显示，MARL能够有效协调异构卫星间的运行，平衡成像性能和资源利用，同时减轻非平稳性和代理间奖励耦合。这些发现为可扩展、自主卫星运行提供了实用见解，并为未来在异构和动态条件下智能EO任务规划的研究奠定了基础。

Maximizing the efficiency of human feedback in AI alignment: a comparative analysis

最大化人工智能对齐中人类反馈的效率：一项比较分析

Authors: Andreas Chouliaras, Dimitris Chatzopoulos
Subjects: Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.12796
Pdf link: https://arxiv.org/pdf/2511.12796
Abstract Reinforcement Learning from Human Feedback (RLHF) relies on preference modeling to align machine learning systems with human values, yet the popular approach of random pair sampling with Bradley-Terry modeling is statistically limited and inefficient under constrained annotation budgets. In this work, we explore alternative sampling and evaluation strategies for preference inference in RLHF, drawing inspiration from areas such as game theory, statistics, and social choice theory. Our best-performing method, Swiss InfoGain, employs a Swiss tournament system with a proxy mutual-information-gain pairing rule, which significantly outperforms all other methods in constrained annotation budgets while also being more sample-efficient. Even in high-resource settings, we can identify superior alternatives to the Bradley-Terry baseline. Our experiments demonstrate that adaptive, resource-aware strategies reduce redundancy, enhance robustness, and yield statistically significant improvements in preference learning, highlighting the importance of balancing alignment quality with human workload in RLHF pipelines.
中文摘要 人类反馈强化学习（RLHF）依赖偏好建模来使机器学习系统与人类价值观对齐，但采用Bradley-Terry建模的随机配对抽样方法在统计上有限且在受限的注释预算下效率低下。本研究探讨了RLHF偏好推断的替代抽样和评估策略，借鉴了博弈论、统计学和社会选择理论等领域的灵感。我们表现最好的方法Swiss InfoGain采用了带有代理互助信息增益配对规则的瑞士锦标赛系统，在受限的注释预算下显著优于其他方法，同时也提高了样本效率。即使在资源丰富的环境中，我们也能识别出比Bradley-Terry基线更优的替代方案。我们的实验表明，自适应且资源感知型策略能减少冗余、增强鲁棒性，并在偏好学习方面带来统计学上的显著改善，凸显了在RLHF流水线中平衡对齐质量与人类工作负荷的重要性。

Expressive Temporal Specifications for Reward Monitoring

奖励监测的表达性时间规范

Authors: Omar Adalat, Francesco Belardinelli
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
Arxiv link: https://arxiv.org/abs/2511.12808
Pdf link: https://arxiv.org/pdf/2511.12808
Abstract Specifying informative and dense reward functions remains a pivotal challenge in Reinforcement Learning, as it directly affects the efficiency of agent training. In this work, we harness the expressive power of quantitative Linear Temporal Logic on finite traces (($\text{LTL}_f[\mathcal{F}]$)) to synthesize reward monitors that generate a dense stream of rewards for runtime-observable state trajectories. By providing nuanced feedback during training, these monitors guide agents toward optimal behaviour and help mitigate the well-known issue of sparse rewards under long-horizon decision making, which arises under the Boolean semantics dominating the current literature. Our framework is algorithm-agnostic and only relies on a state labelling function, and naturally accommodates specifying non-Markovian properties. Empirical results show that our quantitative monitors consistently subsume and, depending on the environment, outperform Boolean monitors in maximizing a quantitative measure of task completion and in reducing convergence time.
中文摘要 指定信息丰富且密集的奖励函数仍然是强化学习中的关键挑战，因为它直接影响代理训练的效率。在本研究中，我们利用定量线性时序逻辑在有限迹（（$\text{LTL}_f[\mathcal{F}]$））上的表达力，合成了生成大量奖励流的奖励监视器，满足运行时可观测的状态轨迹。通过在培训过程中提供细致的反馈，这些监控器引导代理走向最佳行为，并帮助缓解长期决策下奖励稀疏的已知问题，这一问题源于当前文献中主导的布尔语义学。我们的框架与算法无关，仅依赖状态标记函数，自然支持非马尔可夫性质的指定。实证结果表明，我们的定量监测器在最大化任务完成量和缩短收敛时间方面，始终在概括甚至根据环境优于布尔监测器。

Mapping fNIRS Signals to Agent Performance: Toward Reinforcement Learning from Neural Feedback

将fNIRS信号映射到代理表现：迈向神经反馈强化学习

Authors: Julia Santaniello, Matthew Russell, Benson Jiang, Donatello Sassaroli, Robert Jacob, Jivko SInapov
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.12844
Pdf link: https://arxiv.org/pdf/2511.12844
Abstract Reinforcement Learning from Human Feedback (RLHF) is a methodology that aligns agent behavior with human preferences by integrating human feedback into the agent's training process. We introduce a possible framework that employs passive Brain-Computer Interfaces (BCI) to guide agent training from implicit neural signals. We present and release a novel dataset of functional near-infrared spectroscopy (fNIRS) recordings collected from 25 human participants across three domains: a Pick-and-Place Robot, Lunar Lander, and Flappy Bird. We train classifiers to predict levels of agent performance (optimal, sub-optimal, or worst-case) from windows of preprocessed fNIRS feature vectors, achieving an average F1 score of 67% for binary classification and 46% for multi-class models averaged across conditions and domains. We also train regressors to predict the degree of deviation between an agent's chosen action and a set of near-optimal policies, providing a continuous measure of performance. We evaluate cross-subject generalization and demonstrate that fine-tuning pre-trained models with a small sample of subject-specific data increases average F1 scores by 17% and 41% for binary and multi-class models, respectively. Our work demonstrates that mapping implicit fNIRS signals to agent performance is feasible and can be improved, laying the foundation for future brain-driven RLHF systems.
中文摘要 从人类反馈中强化学习（RLHF）是一种方法论，通过将人类反馈整合进智能体的训练过程，使智能体的行为与人类偏好保持一致。我们提出了一个可能的框架，利用被动脑机接口（BCI）来引导代理从隐性神经信号进行训练。我们展示了并发布了一项新功能近红外光谱（fNIRS）数据集，收集了来自25名人类参与者，涵盖三个领域：拾取与放置机器人、月球着陆器和拍翼鸟。我们训练分类器预测代理表现水平（最优、次优或最坏情况），从预处理的fNIRS特征向量窗口中预测，二元分类的平均F1得分为67%，跨条件和领域平均多类别模型为46%。我们还训练回归器预测代理所选动作与一组近似最优策略之间的偏差程度，提供连续的性能衡量。我们评估了跨学科推广，并证明用少量受试者特定数据微调预训练模型，分别使二元模型和多类别模型的平均F1分数提升17%和41%。我们的研究表明，将隐性fNIRS信号映射到代理表现是可行且可改进的，为未来脑驱动的RLHF系统奠定基础。

Think, Speak, Decide: Language-Augmented Multi-Agent Reinforcement Learning for Economic Decision-Making

思考、说话、决策：语言增强多智能体强化学习用于经济决策

Authors: Heyang Ma, Qirui Mi, Qipeng Yang, Zijun Fan, Bo Li, Haifeng Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI); General Economics (econ.GN)
Arxiv link: https://arxiv.org/abs/2511.12876
Pdf link: https://arxiv.org/pdf/2511.12876
Abstract Economic decision-making depends not only on structured signals such as prices and taxes, but also on unstructured language, including peer dialogue and media narratives. While multi-agent reinforcement learning (MARL) has shown promise in optimizing economic decisions, it struggles with the semantic ambiguity and contextual richness of language. We propose LAMP (Language-Augmented Multi-Agent Policy), a framework that integrates language into economic decision-making and narrows the gap to real-world settings. LAMP follows a Think-Speak-Decide pipeline: (1) Think interprets numerical observations to extract short-term shocks and long-term trends, caching high-value reasoning trajectories; (2) Speak crafts and exchanges strategic messages based on reasoning, updating beliefs by parsing peer communications; and (3) Decide fuses numerical data, reasoning, and reflections into a MARL policy to optimize language-augmented decision-making. Experiments in economic simulation show that LAMP outperforms both MARL and LLM-only baselines in cumulative return (+63.5%, +34.0%), robustness (+18.8%, +59.4%), and interpretability. These results demonstrate the potential of language-augmented policies to deliver more effective and robust economic strategies.
中文摘要 经济决策不仅依赖于价格和税收等结构化信号，还依赖于非结构化的语言，包括同行对话和媒体叙事。虽然多智能体强化学习（MARL）在优化经济决策方面展现出潜力，但它在语言的语义模糊性和语境丰富性方面存在困难。我们提出了LAMP（语言增强多代理政策），这是一个将语言融入经济决策并缩小与现实世界背景差距的框架。LAMP遵循“思考-说-决策”流程：（1）Think解读数值观察以提取短期冲击和长期趋势，缓存高价值推理轨迹;（2）口语基于推理进行策略性信息的表达和交换，通过解析同伴通信来更新信念;以及（3）决定将数值数据、推理和反思融合进MARL策略，以优化语言增强决策。经济模拟实验显示，LAMP在累计回报率（+63.5%、+34.0%）、稳健性（+18.8%、+59.4%）和可解释性方面均优于MARL和仅LLM的基线。这些结果表明，语言增强政策具有实现更有效、更稳健经济战略的潜力。

Green Emergency Communications in RIS- and MA-Assisted Multi-UAV SAGINs: A Partially Observable Reinforcement Learning Approach

RIS和MA辅助多无人机SAGIN中的绿色应急通信：一种部分可观察的强化学习方法

Authors: Liangshun Wu, Wen Chen, Shunqing Zhang, Yajun Wang, Kunlun Wang
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2511.12892
Pdf link: https://arxiv.org/pdf/2511.12892
Abstract In post-disaster space-air-ground integrated networks (SAGINs), terrestrial infrastructure is often impaired, and unmanned aerial vehicles (UAVs) must rapidly restore connectivity for mission-critical ground terminals in cluttered non-line-of-sight (NLoS) urban environments. To enhance coverage, UAVs employ movable antennas (MAs), while reconfigurable intelligent surfaces (RISs) on surviving high-rises redirect signals. The key challenge is communication-limited partial observability, leaving each UAV with a narrow, fast-changing neighborhood view that destabilizes value estimation. Existing multi-agent reinforcement learning (MARL) approaches are inadequate--non-communication methods rely on unavailable global critics, heuristic sharing is brittle and redundant, and learnable protocols (e.g., CommNet, DIAL) lose per-neighbor structure and aggravate non-stationarity under tight bandwidth. To address partial observability, we propose a spatiotemporal A2C where each UAV transmits prior-decision messages with local state, a compact policy fingerprint, and a recurrent belief, encoded per neighbor and concatenated. A spatial discount shapes value targets to emphasize local interactions, while analysis under one-hop-per-slot latency explains stable training with delayed views. Experimental results show our policy outperforms IA2C, ConseNet, FPrint, DIAL, and CommNet--achieving faster convergence, higher asymptotic reward, reduced Temporal-Difference(TD)/advantage errors, and a better communication throughput-energy trade-off.
中文摘要 在灾后空地综合网络（SAGIN）中，地面基础设施常常受损，无人机（UAV）必须迅速恢复关键地面终端在拥挤的非视距（NLoS）城市环境中的连接。为增强覆盖范围，无人机采用可移动天线（MA），而幸存高层建筑上的可重构智能表面（RIS）则重定向信号。关键挑战是通信受限的部分可观测性，使每架无人机的邻域视图狭窄且变化快速，导致价值估计不稳定。现有的多智能体强化学习（MARL）方法不够完善——非通信方法依赖于不可用的全局批评者，启发式共享脆弱且冗余，可学习协议（如CommNet、DIAL）在带宽紧张下会失去每邻结构并加剧非平稳性。为解决部分可观测性问题，我们提出一种时空A2C，其中每架无人机发送带有本地状态、紧凑政策指纹和重复信念的先验决策信息，这些信息由每个邻居编码并串接。空间折现塑造了价值目标以强调局部交互，而在每槽一跳延迟下的分析则解释了带有延迟视图的稳定训练。实验结果显示，我们的策略优于IA2C、ConseNet、FPrint、DIL和CommNet——实现了更快的收敛速度、更高的渐近奖励、减少了时间差（TD）/优势误差，以及更好的通信吞吐量与能量权衡。

DeepSport: A Multimodal Large Language Model for Comprehensive Sports Video Reasoning via Agentic Reinforcement Learning

DeepSport：一个通过智能强化学习实现全面体育视频推理的多模态大型语言模型

Authors: Junbo Zou, Haotian Xia, Zhen Ye, Shengjie Zhang, Christopher Lai, Vicente Ordonez, Weining Shen, Hanjie Chen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.12908
Pdf link: https://arxiv.org/pdf/2511.12908
Abstract Sports video understanding presents unique challenges, requiring models to perceive high-speed dynamics, comprehend complex rules, and reason over long temporal contexts. While Multimodal Large Language Models (MLLMs) have shown promise in genral domains, the current state of research in sports remains narrowly focused: existing approaches are either single-sport centric, limited to specific tasks, or rely on training-free paradigms that lack robust, learned reasoning process. To address this gap, we introduce DeepSport, the first end-to-end trained MLLM framework designed for multi-task, multi-sport video understanding. DeepSport shifts the paradigm from passive frame processing to active, iterative reasoning, empowering the model to ``think with videos'' by dynamically interrogating content via a specialized frame-extraction tool. To enable this, we propose a data distillation pipeline that synthesizes high-quality Chain-of-Thought (CoT) trajectories from 10 diverse data source, creating a unified resource of 78k training data. We then employ a two-stage training strategy, Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) with a novel gated tool-use reward, to optimize the model's reasoning process. Extensive experiments on the testing benchmark of 6.7k questions demonstrate that DeepSport achieves state-of-the-art performance, significantly outperforming baselines of both proprietary model and open-source models. Our work establishes a new foundation for domain-specific video reasoning to address the complexities of diverse sports.
中文摘要 体育视频理解面临独特挑战，要求模型感知高速动态、理解复杂规则，并在长时间时间背景中进行推理。尽管多模态大型语言模型（MLLM）在通用领域展现出潜力，但当前体育研究的现状仍然狭隘：现有方法要么以单一运动为中心，要么局限于特定任务，或者依赖缺乏扎实、可学推理过程的无训练范式。为弥补这一空白，我们推出了DeepSport，这是首个端到端培训的MLLM框架，专为多任务、多运动视频理解而设计。DeepSport将模式从被动帧处理转向主动、迭代推理，通过专门的帧提取工具动态查询内容，赋予模型“用视频思考”的能力。为此，我们提出了一个数据提炼流程，从10个不同数据源中综合高质量的思维链（Chain-of-Thought，CoT）轨迹，创建一个包含7.8万条训练数据的统一资源。随后，我们采用两阶段训练策略：监督式微调（SFT），随后是带有新颖门槛工具使用奖励的强化学习（RL），以优化模型的推理过程。对6.7k题目测试基准的大量实验表明，DeepSport实现了最先进的性能，远超专有模型和开源模型的基线。我们的工作为领域特定视频推理奠定了新的基础，以应对多样化体育的复杂性。

Wide-Area Feedback Control for Renewables-Heavy Power Systems: A Comparative Study of Reinforcement Learning and Lyapunov-Based Design

可再生能源重电力系统的广域反馈控制：强化学习与基于李雅普诺夫设计的比较研究

Authors: Muhammad Nadeem, MirSaleh Bahavarnia, Ahmad F. Taha
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2511.12911
Pdf link: https://arxiv.org/pdf/2511.12911
Abstract As renewable energy sources become more prevalent, accurately modeling power grid dynamics is becoming increasingly more complex. Concurrently, data acquisition and realtime system state monitoring are becoming more available for control centers. This motivates shifting from \textit{model- and Lyapunov-based} feedback controller designs toward \textit{model-free} ones. Reinforcement learning (RL) has emerged as a key tool for designing model-free controllers. Various studies have been carried out to study voltage/frequency control strategies via RL. However, usually a simplified system model is used neglecting detailed dynamics of solar, wind, and composite loads -- and damping system-wide oscillations and modeling power flows are all usually ignored. To that end, we pose an optimal feedback control problem for a detailed renewables-heavy power system, defined by a set of nonlinear differential algebraic equations (NDAE). The control problem is solved using a completely model-free design via RL as well as using a model-based approach built upon the Lyapunov stability theory with guarantees. The paper in its essence seeks to explore whether data-driven feedback control should be used in power grids over its model-driven counterpart. Theoretical developments and thorough case studies are presented with an eye on this exploration. Finally, a detailed analysis is provided to delineate the strengths and weaknesses of both approaches for renewables-heavy grids.
中文摘要 随着可再生能源的普及，准确建模电网动态变得越来越复杂。与此同时，数据采集和实时系统状态监控在控制中心中变得越来越普及。这促使人们从\textit{模型和基于Lyapunov}的反馈控制器设计转向\textit{无模型}的。强化学习（RL）已成为设计无模型控制器的关键工具。已有多项研究通过强化学习研究电压/频率控制策略。然而，通常使用简化的系统模型，忽略了太阳能、风能和复合材料载荷的详细动态，且通常忽略了系统范围内的衰减振荡和功率流的建模。为此，我们提出了一个详细的可再生能源重型电力系统的最优反馈控制问题，该系统由一组非线性微分代数方程（NDAE）定义。控制问题通过完全无模型的强化学习设计解决，同时采用基于带有保证的李雅普诺夫稳定性理论的基于模型的方法。本文本质上旨在探讨数据驱动反馈控制是否应在电网中使用，而非模型驱动的对应方法。理论发展和详尽案例研究都以这一探索为重点呈现。最后，详细分析了这两种方法在以可再生能源为主的电网中的优缺点。

Learning Branching Policies for MILPs with Proximal Policy Optimization

学习带有近端策略优化的MILP分支策略

Authors: Abdelouahed Ben Mhamed, Assia Kamal-Idrissi, Amal El Fallah Seghrouchni
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2511.12986
Pdf link: https://arxiv.org/pdf/2511.12986
Abstract Branch-and-Bound (B\&B) is the dominant exact solution method for Mixed Integer Linear Programs (MILP), yet its exponential time complexity poses significant challenges for large-scale instances. The growing capabilities of machine learning have spurred efforts to improve B\&B by learning data-driven branching policies. However, most existing approaches rely on Imitation Learning (IL), which tends to overfit to expert demonstrations and struggles to generalize to structurally diverse or unseen instances. In this work, we propose Tree-Gate Proximal Policy Optimization (TGPPO), a novel framework that employs Proximal Policy Optimization (PPO), a Reinforcement Learning (RL) algorithm, to train a branching policy aimed at improving generalization across heterogeneous MILP instances. Our approach builds on a parameterized state space representation that dynamically captures the evolving context of the search tree. Empirical evaluations show that TGPPO often outperforms existing learning-based policies in terms of reducing the number of nodes explored and improving p-Primal-Dual Integrals (PDI), particularly in out-of-distribution instances. These results highlight the potential of RL to develop robust and adaptable branching strategies for MILP solvers.
中文摘要 分支限界（B\&B）是混合整数线性规划（MILP）的主要精确解法，但其指数级时间复杂度对大规模实例构成了重大挑战。机器学习能力的不断提升推动了通过学习数据驱动分支策略来改进B&B的努力。然而，大多数现有方法依赖模仿学习（Imitation Learning，IL），该方法往往过于拟合专家演示，难以推广到结构多样或未见的实例。本研究提出树门近端策略优化（TGPPO），这是一种新颖框架，利用近端策略优化（PPO）一种强化学习（RL）算法，训练分支策略，旨在提升异构MILP实例间的泛化能力。我们的方法基于参数化状态空间表示，动态捕捉搜索树不断演变的上下文。实证评估表明，TGPPO在减少探索节点数量和提升p-原始-对偶积分（PDI）方面，尤其是在分布外的实例中，常常优于现有基于学习的策略。这些结果凸显了强化学习为MILP求解器开发稳健且可适应的分支策略的潜力。

The Good, The Bad, and The Hybrid: A Reward Structure Showdown in Reasoning Models Training

优点、缺点与混合：推理模型训练中的奖励结构对决

Authors: Subramanyam Sahoo
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.13016
Pdf link: https://arxiv.org/pdf/2511.13016
Abstract Reward design is central to reinforcement learning from human feedback (RLHF) and alignment research. In this work, we propose a unified framework to study hard, continuous, and hybrid reward structures for fine-tuning large language models (LLMs) on mathematical reasoning tasks. Using Qwen3-4B with LoRA fine-tuning on the GSM8K dataset, we formalize and empirically evaluate reward formulations that incorporate correctness, perplexity, reasoning quality, and consistency. We introduce an adaptive hybrid reward scheduler that transitions between discrete and continuous signals, balancing exploration and stability. Our results show that hybrid reward structures improve convergence speed and training stability over purely hard or continuous approaches, offering insights for alignment via adaptive reward modeling.
中文摘要 奖励设计是人类反馈强化学习（RLHF）和对齐研究的核心。本研究提出一个统一框架，用于研究硬性、连续性和混合型奖励结构，用于微调大型语言模型（LLMs）在数学推理任务中的微调。利用 Qwen3-4B 结合 LoRA 微调，在 GSM8K 数据集上，我们形式化并实证评估了包含正确性、困惑度、推理质量和一致性的奖励表述。我们引入了自适应混合奖励调度器，能在离散信号和连续信号之间切换，平衡探索与稳定性。我们的结果表明，混合奖励结构相比纯硬性或连续方法能提升收敛速度和训练稳定性，通过自适应奖励建模为对齐提供见解。

REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

修订者：超越文本反思，迈向长视频理解中的多模态内省推理

Authors: Jiaze Li, Hao Yin, Wenhui Tan, Jingyang Chen, Boshen Xu, Yuxun Qu, Yijing Chen, Jianzhong Ju, Zhenbo Luo, Jian Luan
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.13026
Pdf link: https://arxiv.org/pdf/2511.13026
Abstract Self-reflection mechanisms that rely on purely text-based rethinking processes perform well in most multimodal tasks. However, when directly applied to long-form video understanding scenarios, they exhibit clear limitations. The fundamental reasons for this lie in two points: (1)long-form video understanding involves richer and more dynamic visual input, meaning rethinking only the text information is insufficient and necessitates a further rethinking process specifically targeting visual information; (2) purely text-based reflection mechanisms lack cross-modal interaction capabilities, preventing them from fully integrating visual information during reflection. Motivated by these insights, we propose REVISOR (REflective VIsual Segment Oriented Reasoning), a novel framework for tool-augmented multimodal reflection. REVISOR enables MLLMs to collaboratively construct introspective reflection processes across textual and visual modalities, significantly enhancing their reasoning capability for long-form video understanding. To ensure that REVISOR can learn to accurately review video segments highly relevant to the question during reinforcement learning, we designed the Dual Attribution Decoupled Reward (DADR) mechanism. Integrated into the GRPO training strategy, this mechanism enforces causal alignment between the model's reasoning and the selected video evidence. Notably, the REVISOR framework significantly enhances long-form video understanding capability of MLLMs without requiring supplementary supervised fine-tuning or external models, achieving impressive results on four benchmarks including VideoMME, LongVideoBench, MLVU, and LVBench.
中文摘要 依赖纯文本反思过程的自我反思机制在大多数多模态任务中表现良好。然而，当直接应用于长视频理解场景时，它们表现出明显的局限性。其根本原因有两个：（1）长视频理解涉及更丰富、更具动态性的视觉输入，仅仅重新思考文本信息是不够的，需要进一步重新思考，专门针对视觉信息;（2）纯文本反射机制缺乏跨模态交互能力，阻碍其在反射时完全整合视觉信息。基于这些见解，我们提出了REVISOR（转向性虚拟片段导向推理），这是一种用于工具增强多模态反思的新框架。REVISOR使MLLM能够协作构建跨文本和视觉模态的内省反思过程，显著提升了他们在长视频理解中的推理能力。为了确保REVISOR能够在强化学习中准确复习与问题高度相关的视频片段，我们设计了双重归因解耦奖励（DADR）机制。该机制整合进GRPO训练策略，确保模型推理与所选视频证据之间的因果对齐。值得注意的是，REVISOR框架显著提升了MLLM的长视频理解能力，无需额外的监督微调或外部模型，在包括VideoMME、LongVideoBench、MLVU和LVBench在内的四个基准测试中取得了显著成果。

Scaling Generative Verifiers For Natural Language Mathematical Proof Verification And Selection

自然语言数学证明验证与选择的生成验证器规模化

Authors: Sadegh Mahdavi, Branislav Kisacanin, Shubham Toshniwal, Wei Du, Ivan Moshkov, George Armstrong, Renjie Liao, Christos Thrampoulidis, Igor Gitman
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.13027
Pdf link: https://arxiv.org/pdf/2511.13027
Abstract Large language models have achieved remarkable success on final-answer mathematical problems, largely due to the ease of applying reinforcement learning with verifiable rewards. However, the reasoning underlying these solutions is often flawed. Advancing to rigorous proof-based mathematics requires reliable proof verification capabilities. We begin by analyzing multiple evaluation setups and show that focusing on a single benchmark can lead to brittle or misleading conclusions. To address this, we evaluate both proof-based and final-answer reasoning to obtain a more reliable measure of model performance. We then scale two major generative verification methods (GenSelect and LLM-as-a-Judge) to millions of tokens and identify their combination as the most effective framework for solution verification and selection. We further show that the choice of prompt for LLM-as-a-Judge significantly affects the model's performance, but reinforcement learning can reduce this sensitivity. However, despite improving proof-level metrics, reinforcement learning does not enhance final-answer precision, indicating that current models often reward stylistic or procedural correctness rather than mathematical validity. Our results establish practical guidelines for designing and evaluating scalable proof-verification and selection systems.
中文摘要 大型语言模型在最终答案数学问题上取得了显著成功，这在很大程度上得益于强化学习的便捷应用，并能获得可验证的奖励。然而，这些解决方案背后的推理往往存在缺陷。进入严谨的基于证明的数学需要可靠的证明验证能力。我们首先分析了多种评估方案，并指出专注于单一基准可能导致脆弱或误导性的结论。为此，我们评估了基于证明和最终答案的推理，以获得更可靠的模型性能衡量。随后，我们将两种主要的生成验证方法（GenSelect和LLM即评判）扩展到数百万个代币，并确定其组合为解决方案验证和选择最有效的框架。我们还进一步表明，选择LLM作为评判的提示显著影响模型表现，但强化学习可以降低这种敏感性。然而，尽管证明层面的指标有所改进，强化学习并未提升最终答案的精度，表明当前模型往往奖励的是风格或程序上的正确性，而非数学上的有效性。我们的结果为设计和评估可扩展的证明验证与选择系统奠定了实用指南。

An Online Multiobjective Policy Gradient for Long-run Average-reward Markov Decision Process

长期平均回报马尔可夫决策过程的在线多目标政策梯度

Authors: Rahul Misra, Manuela L. Bujorianu, Rafał Wisniewski
Subjects: Subjects: Systems and Control (eess.SY); Optimization and Control (math.OC); Probability (math.PR)
Arxiv link: https://arxiv.org/abs/2511.13034
Pdf link: https://arxiv.org/pdf/2511.13034
Abstract We propose a reinforcement learning (RL) framework for multi-objective decision-making, where the agent seeks to optimize a vector of rewards rather than a single scalar value. The objective is to ensure that the time-averaged reward vector converges asymptotically to a predefined target set. Since standard RL algorithms operate on scalar rewards, we introduce a dynamic scalarization mechanism guided by Blackwell's Approachability Theorem. This theorem enables adaptive updates of the scalarization vector to guarantee convergence toward the target set. Assuming ergodicity, the Markov chain induced by the learned policies admits a stationary distribution, ensuring all states recur with finite return times. Our algorithm exploits this property by defining an inner loop that applies a policy gradient method (with baseline) between successive visits to a designated recurrent state, enforcing Blackwell's condition at each iteration. An outer loop then updates the scalarization vector after each recurrence. We establish theoretical convergence of the long-run average reward vector to the target set and validate the approach through a numerical example.
中文摘要 我们提出了一种强化学习（RL）框架，用于多目标决策，智能体寻求优化奖励向量，而非单一标量值。目标是确保时间平均奖励向量逐渐收敛到预定义的目标集合。由于标准强化学习算法基于标量奖励，我们引入了基于布莱克韦尔可接近定理的动态标量化机制。该定理使标量向量的自适应更新成为可能，以确保向目标集收敛。假设遍历性，由学习策略诱导的马尔可夫链允许一个平稳分布，确保所有状态以有限的返回时间重复出现。我们的算法利用这一特性，定义了一个内环，在连续访问指定重复状态之间应用策略梯度法（带有基线），并在每次迭代中强制执行布莱克韦尔条件。每次递现后，外环会更新标量向量。我们建立了长期平均奖励向量与目标集的理论收敛性，并通过数值示例验证了该方法。

One-Step Generative Policies with Q-Learning: A Reformulation of MeanFlow

带Q学习的一步生成策略：平均流的重新表述

Authors: Zeyuan Wang, Da Li, Yulin Chen, Ye Shi, Liang Bai, Tianyuan Yu, Yanwei Fu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.13035
Pdf link: https://arxiv.org/pdf/2511.13035
Abstract We introduce a one-step generative policy for offline reinforcement learning that maps noise directly to actions via a residual reformulation of MeanFlow, making it compatible with Q-learning. While one-step Gaussian policies enable fast inference, they struggle to capture complex, multimodal action distributions. Existing flow-based methods improve expressivity but typically rely on distillation and two-stage training when trained with Q-learning. To overcome these limitations, we propose to reformulate MeanFlow to enable direct noise-to-action generation by integrating the velocity field and noise-to-action transformation into a single policy network-eliminating the need for separate velocity estimation. We explore several reformulation variants and identify an effective residual formulation that supports expressive and stable policy learning. Our method offers three key advantages: 1) efficient one-step noise-to-action generation, 2) expressive modelling of multimodal action distributions, and 3) efficient and stable policy learning via Q-learning in a single-stage training setup. Extensive experiments on 73 tasks across the OGBench and D4RL benchmarks demonstrate that our method achieves strong performance in both offline and offline-to-online reinforcement learning settings. Code is available at this https URL.
中文摘要 我们引入了一种一步生成策略，用于离线强化学习，通过对平均流的残余重述直接将噪声映射到动作，使其与Q-learning兼容。虽然一步高斯策略能够实现快速推断，但它们难以捕捉复杂的多模态动作分布。现有基于流动的方法提升表现力，但通常依赖蒸馏和Q学习训练的两阶段训练。为克服这些限制，我们提议重新表述MeanFlow，通过将速度场和噪声到作用的转换整合为一个单一策略网络，实现直接噪声到动作生成，从而消除了单独的速度估计的需求。我们探讨了几种重述变体，并识别出支持表达性和稳定政策学习的有效残余表述。我们的方法有三个关键优势：1）高效的一步噪声到动作生成，2）多模态动作分布的表达建模，3）通过单阶段训练设置中的Q学习实现高效且稳定的策略学习。在OGBench和D4RL基准测试中对73项任务进行的广泛实验表明，我们的方法在离线和离线到在线强化学习环境中均表现出色。代码可在此 https URL 访问。

ViSS-R1: Self-Supervised Reinforcement Video Reasoning

ViSS-R1：自我监督强化视频推理

Authors: Bo Fang, Yuxin Song, Qiangqiang Wu, Haoyuan Sun, Wenhao Wu, Antoni B. Chan
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.13054
Pdf link: https://arxiv.org/pdf/2511.13054
Abstract Complex video reasoning remains a significant challenge for Multimodal Large Language Models (MLLMs), as current R1-based methodologies often prioritize text-centric reasoning derived from text-based and image-based developments. In video tasks, such strategies frequently underutilize rich visual information, leading to potential shortcut learning and increased susceptibility to hallucination. To foster a more robust, visual-centric video understanding, we start by introducing a novel self-supervised reinforcement learning GRPO algorithm (Pretext-GRPO) within the standard R1 pipeline, in which positive rewards are assigned for correctly solving pretext tasks on transformed visual inputs, which makes the model to non-trivially process the visual information. Building on the effectiveness of Pretext-GRPO, we further propose the ViSS-R1 framework, which streamlines and integrates pretext-task-based self-supervised learning directly into the MLLM's R1 post-training paradigm. Instead of relying solely on sparse visual cues, our framework compels models to reason about transformed visual input by simultaneously processing both pretext questions (concerning transformations) and true user queries. This necessitates identifying the applied transformation and reconstructing the original video to formulate accurate final answers. Comprehensive evaluations on six widely-used video reasoning and understanding benchmarks demonstrate the effectiveness and superiority of our Pretext-GRPO and ViSS-R1 for complex video reasoning. Our codes and models will be publicly available.
中文摘要 复杂的视频推理仍然是多模态大型语言模型（MLLM）面临的重大挑战，因为当前基于R1的方法论往往优先考虑基于文本和图像发展的以文本为中心的推理。在视频任务中，此类策略常常未能充分利用丰富的视觉信息，导致可能的捷径学习和更易产生幻觉。为了促进更扎实、以视觉为中心的视频理解，我们首先在标准R1流水线中引入了一种新的自监督强化学习GRPO算法（Pretext-GRPO），该算法在转换后的视觉输入上正确解决伪装任务时会给予积极奖励，使模型能够非平易地处理视觉信息。基于Pretext-GRPO的有效性，我们进一步提出了ViSS-R1框架，该框架将基于Pretext任务的自我监督学习简化并整合进MLLM的R1后培训范式中。我们的框架不仅依赖稀疏的视觉线索，而是通过同时处理关于转换的前提问题和真实用户查询，迫使模型推理变换后的视觉输入。这需要识别所应用的变换并重建原始视频，以制定准确的最终答案。对六个广泛使用的视频推理和理解基准的综合评估展示了我们的Pretext-GRPO和ViSS-R1在复杂视频推理中的有效性和优越性。我们的代码和型号将对外公开。

STEP: Success-Rate-Aware Trajectory-Efficient Policy Optimization

步骤：成功率感知轨迹的高效策略优化

Authors: Yuhan Chen, Yuxuan Liu, Long Zhang, Pengzhi Gao, Jian Luan, Wei Liu
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.13091
Pdf link: https://arxiv.org/pdf/2511.13091
Abstract Multi-turn interaction remains challenging for online reinforcement learning. A common solution is trajectory-level optimization, which treats each trajectory as a single training sample. However, this approach can be inefficient and yield misleading learning signals: it applies uniform sampling across tasks regardless of difficulty, penalizes correct intermediate actions in failed trajectories, and incurs high sample-collection costs. To address these issues, we propose STEP (Success-rate-aware Trajectory-Efficient Policy optimization), a framework that dynamically allocates sampling based on per-task success rates and performs step-level optimization. STEP maintains a smoothed success-rate record to guide adaptive trajectory resampling, allocating more effort to harder tasks. It then computes success-rate-weighted advantages and decomposes trajectories into step-level samples. Finally, it applies a step-level GRPO augmentation to refine updates for low-success tasks. Experiments on OSWorld and AndroidWorld show that STEP substantially improves sample efficiency and training stability over trajectory-level GRPO, converging faster and generalizing better under the same sampling budget.
中文摘要 多回合互动对在线强化学习来说依然充满挑战。常见的解决方案是轨迹级优化，将每个轨迹视为单一训练样本。然而，这种方法可能效率低下，并产生误导性的学习信号：它在任务间应用均匀抽样，无论难度如何，错误轨迹中正确的中间动作会被惩罚，并且产生高昂的样本采集成本。为解决这些问题，我们提出了STEP（成功率感知轨迹高效策略优化）框架，该框架基于每项任务的成功率动态分配抽样，并执行步级优化。STEP保持平滑的成功率记录，以指导自适应轨迹重采样，将更多精力分配给更难的任务。然后计算成功率加权优势，并将轨迹分解为步骤级样本。最后，它应用一步级GRPO增强以优化低成功任务的更新。在OSWorld和AndroidWorld上的实验显示，STEP相比轨迹级GRPO显著提升了样本效率和训练稳定性，在相同采样预算下收敛更快，泛化也更佳。

Transformer-Based Scalable Multi-Agent Reinforcement Learning for Networked Systems with Long-Range Interactions

基于变换器的可扩展多智能体强化学习，适用于具有远程交互的网络系统

Authors: Vidur Sinha, Muhammed Ustaomeroglu, Guannan Qu
Subjects: Subjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2511.13103
Pdf link: https://arxiv.org/pdf/2511.13103
Abstract Multi-agent reinforcement learning (MARL) has shown promise for large-scale network control, yet existing methods face two major limitations. First, they typically rely on assumptions leading to decay properties of local agent interactions, limiting their ability to capture long-range dependencies such as cascading power failures or epidemic outbreaks. Second, most approaches lack generalizability across network topologies, requiring retraining when applied to new graphs. We introduce STACCA (Shared Transformer Actor-Critic with Counterfactual Advantage), a unified transformer-based MARL framework that addresses both challenges. STACCA employs a centralized Graph Transformer Critic to model long-range dependencies and provide system-level feedback, while its shared Graph Transformer Actor learns a generalizable policy capable of adapting across diverse network structures. Further, to improve credit assignment during training, STACCA integrates a novel counterfactual advantage estimator that is compatible with state-value critic estimates. We evaluate STACCA on epidemic containment and rumor-spreading network control tasks, demonstrating improved performance, network generalization, and scalability. These results highlight the potential of transformer-based MARL architectures to achieve scalable and generalizable control in large-scale networked systems.
中文摘要 多智能体强化学习（MARL）在大规模网络控制方面展现出潜力，但现有方法面临两大局限。首先，它们通常依赖于导致局部代理相互作用衰变特性的假设，限制了捕捉长距离依赖性的能力，如连锁停电或流行病爆发。其次，大多数方法缺乏跨网络拓扑的泛化性，应用到新图时需要重新训练。我们介绍了STACA（具有反事实优势的共享变换器演员-批评者），这是一个统一的基于变换器的MARL框架，解决了这两个挑战。STACCA采用集中式图变换器Critic来建模长程依赖关系并提供系统级反馈，而其共享的图变换器演员则学习一个能够适应不同网络结构的通用策略。此外，为了改善培训期间的学分分配，STACCA集成了一种新颖的反事实优势估计器，该估计器与状态值批判者估计兼容。我们评估了STACCA在疫情遏制和传播谣言网络控制任务中的表现，展示了其性能、网络泛化性和可扩展性。这些结果凸显了基于变压器的MARL架构在大规模网络系统中实现可扩展且可通用控制的潜力。

Soft Conflict-Resolution Decision Transformer for Offline Multi-Task Reinforcement Learning

软冲突解决决策变换器，用于离线多任务强化学习

Authors: Shudong Wang, Xinfei Wang, Chenhao Zhang, Shanchen Pang, Haiyuan Gui, Wenhao Ji, Xiaojian Liao
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.13133
Pdf link: https://arxiv.org/pdf/2511.13133
Abstract Multi-task reinforcement learning (MTRL) seeks to learn a unified policy for diverse tasks, but often suffers from gradient conflicts across tasks. Existing masking-based methods attempt to mitigate such conflicts by assigning task-specific parameter masks. However, our empirical study shows that coarse-grained binary masks have the problem of over-suppressing key conflicting parameters, hindering knowledge sharing across tasks. Moreover, different tasks exhibit varying conflict levels, yet existing methods use a one-size-fits-all fixed sparsity strategy to keep training stability and performance, which proves inadequate. These limitations hinder the model's generalization and learning efficiency. To address these issues, we propose SoCo-DT, a Soft Conflict-resolution method based by parameter importance. By leveraging Fisher information, mask values are dynamically adjusted to retain important parameters while suppressing conflicting ones. In addition, we introduce a dynamic sparsity adjustment strategy based on the Interquartile Range (IQR), which constructs task-specific thresholding schemes using the distribution of conflict and harmony scores during training. To enable adaptive sparsity evolution throughout training, we further incorporate an asymmetric cosine annealing schedule to continuously update the threshold. Experimental results on the Meta-World benchmark show that SoCo-DT outperforms the state-of-the-art method by 7.6% on MT50 and by 10.5% on the suboptimal dataset, demonstrating its effectiveness in mitigating gradient conflicts and improving overall multi-task performance.
中文摘要 多任务强化学习（MTRL）旨在为不同任务学习统一策略，但常常存在任务间的梯度冲突。现有基于掩蔽的方法试图通过分配任务特定的参数掩码来缓解此类冲突。然而，我们的实证研究表明，粗粒度二进制掩模存在过度抑制关键冲突参数的问题，阻碍了任务间的知识共享。此外，不同任务表现出不同的冲突水平，但现有方法采用一刀切的固定稀疏度策略来保持训练稳定性和性能，但这已被证明不够充分。这些限制限制了模型的泛化和学习效率。为解决这些问题，我们提出了SoCo-DT，一种基于参数重要性的软冲突解决方法。通过利用费舍尔信息，掩码值会动态调整，以保留重要参数同时抑制冲突参数。此外，我们引入了基于四分位区间（IQR）的动态稀疏度调整策略，利用训练期间冲突与和谐分数的分布构建任务特定的阈值方案。为了在训练过程中实现自适应稀疏度演化，我们进一步采用了非对称余弦退火计划，以持续更新阈值。Meta-World基准测试的实验结果显示，SoCo-DT在MT50上比最先进方法高出7.6%，在次优数据集上高出10.5%，证明了其在缓解梯度冲突和提升整体多任务性能方面的有效性。

Conditional Diffusion Model for Multi-Agent Dynamic Task Decomposition

多智能体动态任务分解的条件扩散模型

Authors: Yanda Zhu, Yuanyang Zhu, Daoyi Dong, Caihua Chen, Chunlin Chen
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.13137
Pdf link: https://arxiv.org/pdf/2511.13137
Abstract Task decomposition has shown promise in complex cooperative multi-agent reinforcement learning (MARL) tasks, which enables efficient hierarchical learning for long-horizon tasks in dynamic and uncertain environments. However, learning dynamic task decomposition from scratch generally requires a large number of training samples, especially exploring the large joint action space under partial observability. In this paper, we present the Conditional Diffusion Model for Dynamic Task Decomposition (C$\text{D}^\text{3}$T), a novel two-level hierarchical MARL framework designed to automatically infer subtask and coordination patterns. The high-level policy learns subtask representation to generate a subtask selection strategy based on subtask effects. To capture the effects of subtasks on the environment, C$\text{D}^\text{3}$T predicts the next observation and reward using a conditional diffusion model. At the low level, agents collaboratively learn and share specialized skills within their assigned subtasks. Moreover, the learned subtask representation is also used as additional semantic information in a multi-head attention mixing network to enhance value decomposition and provide an efficient reasoning bridge between individual and joint value functions. Experimental results on various benchmarks demonstrate that C$\text{D}^\text{3}$T achieves better performance than existing baselines.
中文摘要 任务分解在复杂的协作多智能体强化学习（MARL）任务中展现出潜力，这使得在动态和不确定环境中实现长视野任务的高效分层学习。然而，从零学习动态任务分解通常需要大量训练样本，尤其是在部分可观测性下探索庞大的联合作用空间。本文介绍了动态任务分解条件扩散模型（C$\text{D}^\text{3}$T），这是一种新型的两级分层MARL框架，旨在自动推断子任务和协调模式。高级策略学习子任务表示，基于子任务效果生成子任务选择策略。为了捕捉子任务对环境的影响，C$\text{D}^\text{3}$T 使用条件扩散模型预测下一次观察和奖励。在基层，客服人员在分配的子任务中协作学习并分享专业技能。此外，所学子任务表示还被用作多头注意力混合网络中的额外语义信息，以增强值分解，并在个体值函数与联合值函数之间提供高效的推理桥梁。多项基准测试的实验结果表明，C$\text{D}^\text{3}$T 的性能优于现有基线。

DiffFP: Learning Behaviors from Scratch via Diffusion-based Fictitious Play

DiffFP：通过扩散的虚构游戏从零开始学习行为

Authors: Akash Karthikeyan, Yash Vardhan Pant
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2511.13186
Pdf link: https://arxiv.org/pdf/2511.13186
Abstract Self-play reinforcement learning has demonstrated significant success in learning complex strategic and interactive behaviors in competitive multi-agent games. However, achieving such behaviors in continuous decision spaces remains challenging. Ensuring adaptability and generalization in self-play settings is critical for achieving competitive performance in dynamic multi-agent environments. These challenges often cause methods to converge slowly or fail to converge at all to a Nash equilibrium, making agents vulnerable to strategic exploitation by unseen opponents. To address these challenges, we propose DiffFP, a fictitious play (FP) framework that estimates the best response to unseen opponents while learning a robust and multimodal behavioral policy. Specifically, we approximate the best response using a diffusion policy that leverages generative modeling to learn adaptive and diverse strategies. Through empirical evaluation, we demonstrate that the proposed FP framework converges towards $\epsilon$-Nash equilibria in continuous- space zero-sum games. We validate our method on complex multi-agent environments, including racing and multi-particle zero-sum games. Simulation results show that the learned policies are robust against diverse opponents and outperform baseline reinforcement learning policies. Our approach achieves up to 3$\times$ faster convergence and 30$\times$ higher success rates on average against RL-based baselines, demonstrating its robustness to opponent strategies and stability across training iterations
中文摘要 自玩强化学习在多智能体竞技游戏中学习复杂战略和交互行为方面取得了显著成功。然而，在连续决策空间中实现此类行为仍然具有挑战性。确保自玩环境中的适应性和泛化性对于在动态多代理环境中实现竞争性能至关重要。这些挑战常常导致方法收敛缓慢，甚至根本无法收敛到纳什均衡，使智能体容易被看不见的对手战略性利用。为应对这些挑战，我们提出了DiffFP，这是一种虚构玩法（FP）框架，在学习稳健且多模态的行为策略的同时，估算对看不见对手的最佳反应。具体来说，我们利用利用生成建模的扩散策略，学习适应性和多样化的策略，近似最佳响应。通过实证评估，我们证明所提出的FP框架在连续空间零和博弈中趋向$\epsilon$-Nash均衡。我们在复杂的多智能体环境中验证了我们的方法，包括赛车和多粒子零和博弈。模拟结果表明，所学策略对多样对手表现出鲁棒性，且优于基线强化学习策略。我们的方法在基于强化学习的基线下，平均收敛速度高达3$/倍/倍/倍，展示了其对对手策略的稳健性和训练迭代间的稳定性

Video Spatial Reasoning with Object-Centric 3D Rollout

基于对象的3D展开视频空间推理

Authors: Haoran Tang, Meng Cao, Ruyang Liu, Xiaoxi Liang, Linglong Li, Ge Li, Xiaodan Liang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.13190
Pdf link: https://arxiv.org/pdf/2511.13190
Abstract Recent advances in Multi-modal Large Language Models (MLLMs) have showcased remarkable capabilities in vision-language understanding. However, enabling robust video spatial reasoning-the ability to comprehend object locations, orientations, and inter-object relationships in dynamic 3D scenes-remains a key unsolved challenge. Existing approaches primarily rely on spatially grounded supervised fine-tuning or reinforcement learning, yet we observe that such models often exhibit query-locked reasoning, focusing narrowly on objects explicitly mentioned in the prompt while ignoring critical contextual cues. To address this limitation, we propose Object-Centric 3D Rollout (OCR), a novel strategy that introduces structured perturbations to the 3D geometry of selected objects during training. By degrading object-specific visual cues and projecting the altered geometry into 2D space, OCR compels the model to reason holistically across the entire scene. We further design a rollout-based training pipeline that jointly leverages vanilla and region-noisy videos to optimize spatial reasoning trajectories. Experiments demonstrate state-of-the-art performance: our 3B-parameter model achieves 47.5% accuracy on VSI-Bench, outperforming several 7B baselines. Ablations confirm OCR's superiority over prior rollout strategies (e.g., T-GRPO, NoisyRollout).
中文摘要 多模态大型语言模型（MLLM）的最新进展展示了视觉语言理解的卓越能力。然而，实现强大的视频空间推理——即理解动态3D场景中物体位置、方向及物体间关系的能力——仍是一个关键未解的挑战。现有方法主要依赖空间基础的监督微调或强化学习，但我们观察到这些模型常表现出查询锁定推理，狭隘地关注提示中明确提及的对象，忽视关键的上下文线索。为解决这一限制，我们提出了以对象为中心的3D扩展（OCR）新颖策略，在训练过程中为选定对象的三维几何引入结构化扰动。通过削弱对象特定的视觉线索并将改变后的几何体投射到二维空间，OCR迫使模型在整个场景中进行整体推理。我们还设计了一个基于推广的培训流程，结合原版视频和区域噪声视频，优化空间推理轨迹。实验展示了最先进的性能：我们的3B参数模型在VSI-Bench上准确率达到47.5%，优于多个7B基线模型。消融验证了OCR优于以往的推广策略（如T-GRPO、NoisyRollout）。

PIGEON: VLM-Driven Object Navigation via Points of Interest Selection

鸽子：通过兴趣点选择实现VLM驱动的目标导航

Authors: Cheng Peng, Zhenzhe Zhang, Cheng Chi, Xiaobao Wei, Yanhao Zhang, Heng Wang, Pengwei Wang, Zhongyuan Wang, Jing Liu, Shanghang Zhang
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.13207
Pdf link: https://arxiv.org/pdf/2511.13207
Abstract Navigating to a specified object in an unknown environment is a fundamental yet challenging capability of embodied intelligence. However, current methods struggle to balance decision frequency with intelligence, resulting in decisions lacking foresight or discontinuous actions. In this work, we propose PIGEON: Point of Interest Guided Exploration for Object Navigation with VLM, maintaining a lightweight and semantically aligned snapshot memory during exploration as semantic input for the exploration strategy. We use a large Visual-Language Model (VLM), named PIGEON-VL, to select Points of Interest (PoI) formed during exploration and then employ a lower-level planner for action output, increasing the decision frequency. Additionally, this PoI-based decision-making enables the generation of Reinforcement Learning with Verifiable Reward (RLVR) data suitable for simulators. Experiments on classic object navigation benchmarks demonstrate that our zero-shot transfer method achieves state-of-the-art performance, while RLVR further enhances the model's semantic guidance capabilities, enabling deep reasoning during real-time navigation.
中文摘要 在未知环境中导航到指定物体是具身智能既基本又具有挑战性的能力。然而，当前的方法在平衡决策频率与智能方面存在困难，导致决策缺乏前瞻性或行动不连续。在本研究中，我们提出了PIGEON：基于VLM进行对象导航的兴趣点引导探索，在探索过程中保持轻量级且语义对齐的快照记忆，作为探索策略的语义输入。我们使用一个大型视觉语言模型（VLM），名为PIGEON-VL，选择探索过程中形成的兴趣点（PoI），然后使用低级别的规划器进行行动输出，提高决策频率。此外，这种基于PoI的决策技术使得适合模拟器的可验证奖励强化学习（RLVR）数据得以生成。经典目标导航基准测试的实验表明，我们的零点传输方法实现了最先进的性能，而RLVR进一步增强了模型的语义引导能力，使实时导航能够进行深度推理。

Learning to Solve Resource-Constrained Project Scheduling Problems with Duration Uncertainty using Graph Neural Networks

利用图神经网络学习解决资源受限且时长不确定性的项目调度问题

Authors: Guillaume Infantes, Stéphanie Roussel, Antoine Jacquet, Emmanuel Benazera
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.13214
Pdf link: https://arxiv.org/pdf/2511.13214
Abstract The Resource-Constrained Project Scheduling Problem (RCPSP) is a classical scheduling problem that has received significant attention due to of its numerous applications in industry. However, in practice, task durations are subject to uncertainty that must be considered in order to propose resilient scheduling. In this paper, we address the RCPSP variant with uncertain tasks duration (modeled using known probabilities) and aim to minimize the overall expected project duration. Our objective is to produce a baseline schedule that can be reused multiple times in an industrial setting regardless of the actual duration scenario. We leverage Graph Neural Networks in conjunction with Deep Reinforcement Learning (DRL) to develop an effective policy for task scheduling. This policy operates similarly to a priority dispatch rule and is paired with a Serial Schedule Generation Scheme to produce a schedule. Our empirical evaluation on standard benchmarks demonstrates the approach's superiority in terms of performance and its ability to generalize. The developed framework, Wheatley, is made publicly available online to facilitate further research and reproducibility.
中文摘要 资源受限项目调度问题（RCPSP）是一个经典的调度问题，因其在工业中的广泛应用而受到广泛关注。然而，在实际作中，任务时长存在不确定性，必须考虑这些因素才能提出弹性调度。本文针对任务持续时间不确定的RCPSP变体（基于已知概率建模），并旨在最小化整体预期项目时长。我们的目标是制定一个基线计划，无论实际时长如何，都能在工业环境中多次重复使用。我们结合图神经网络与深度强化学习（DRL）来制定有效的任务调度策略。该策略类似于优先级调度规则，并与串行计划生成方案配合生成调度。我们对标准基准的实证评估展示了该方法在性能和推广能力方面的优越性。开发的Wheatley框架已公开在线，便于进一步研究和可重复性。

MMD-Thinker: Adaptive Multi-Dimensional Thinking for Multimodal Misinformation Detection

MMD-Thinker：多模态虚假信息检测的自适应多维思维

Authors: Junjie Wu, Guohong Fu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.13242
Pdf link: https://arxiv.org/pdf/2511.13242
Abstract Multimodal misinformation floods on various social media, and continues to evolve in the era of AI-generated content (AIGC). The emerged misinformation with low creation cost and high deception poses significant threats to society. While recent studies leverage general-purpose multimodal large language models (MLLMs) to achieve remarkable results in detection, they encounter two critical limitations: (1) Insufficient reasoning, where general-purpose MLLMs often follow the uniform reasoning paradigm but generate inaccurate explanations and judgments, due to the lack of the task-specific knowledge of multimodal misinformation detection. (2) Reasoning biases, where a single thinking mode make detectors a suboptimal path for judgment, struggling to keep pace with the fast-growing and intricate multimodal misinformation. In this paper, we propose MMD-Thinker, a two-stage framework for multimodal misinformation detection through adaptive multi-dimensional thinking. First, we develop tailor-designed thinking mode for multimodal misinformation detection. Second, we adopt task-specific instruction tuning to inject the tailored thinking mode into general-purpose MLLMs. Third, we further leverage reinforcement learning strategy with a mixed advantage function, which incentivizes the reasoning capabilities in trajectories. Furthermore, we construct the multimodal misinformation reasoning (MMR) dataset, encompasses more than 8K image-text pairs with both reasoning processes and classification labels, to make progress in the relam of multimodal misinformation detection. Experimental results demonstrate that our proposed MMD-Thinker achieves state-of-the-art performance on both in-domain and out-of-domain benchmark datasets, while maintaining flexible inference and token usage. Code will be publicly available at Github.
中文摘要 多模态错误信息在各种社交媒体上泛滥，并且在人工智能生成内容（AIGC）时代不断演变。这些低创造成本且欺骗性高的错误信息对社会构成重大威胁。尽管近期研究利用通用多模态大型语言模型（MLLM）在检测方面取得了显著成果，但它们面临两个关键局限：（1）推理不足，通用MLLM常遵循统一的推理范式，但由于缺乏多模态错误信息检测的任务特定知识，导致解释和判断不准确。（2）推理偏差，即单一思维模式使探测器成为判断的次优路径，难以跟上快速增长且复杂的多模态错误信息。本文提出了MMD-Thinker，一种通过自适应多维思维进行多模态错误信息检测的两阶段框架。首先，我们开发了针对多模态错误信息检测量身定制的思维模式。其次，我们采用任务特定指令调优，将定制思维模式注入通用MLM。第三，我们进一步利用强化学习策略，采用混合优势函数，激励推理能力在轨迹中发挥作用。此外，我们构建了多模态错误信息推理（MMR）数据集，涵盖8K多对图像-文本对，兼具推理过程和分类标签，以推动多模态错误信息检测领域的进展。实验结果表明，我们提出的MMD-Thinker在域内外基准数据集上都实现了最先进的性能，同时保持了推理和代币使用的灵活性。代码将在 GitHub 上公开发布。

Explainable RL Policies by Distilling to Locally-Specialized Linear Policies with Voronoi State Partitioning

通过通过 Voronoi 状态划分提炼为局部专用线性策略，实现可解释的强化学习策略

Authors: Senne Deproost, Dennis Steckelmacher, Ann Nowé
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.13322
Pdf link: https://arxiv.org/pdf/2511.13322
Abstract Deep Reinforcement Learning is one of the state-of-the-art methods for producing near-optimal system controllers. However, deep RL algorithms train a deep neural network, that lacks transparency, which poses challenges when the controller has to meet regulations, or foster trust. To alleviate this, one could transfer the learned behaviour into a model that is human-readable by design using knowledge distilla- tion. Often this is done with a single model which mimics the original model on average but could struggle in more dynamic situations. A key challenge is that this simpler model should have the right balance be- tween flexibility and complexity or right balance between balance bias and accuracy. We propose a new model-agnostic method to divide the state space into regions where a simplified, human-understandable model can operate in. In this paper, we use Voronoi partitioning to find regions where linear models can achieve similar performance to the original con- troller. We evaluate our approach on a gridworld environment and a classic control task. We observe that our proposed distillation to locally- specialized linear models produces policies that are explainable and show that the distillation matches or even slightly outperforms the black-box policy they are distilled from.
中文摘要 深度强化学习是生产近优系统控制器的尖端方法之一。然而，深度强化学习算法训练的是一个缺乏透明度的深度神经网络，这在控制者需要遵守法规或建立信任时会带来挑战。为了缓解这种情况，可以将学习到的行为转化为一个设计上可被人类阅读的模型，利用知识提炼。通常只用一个模型来完成，平均上模仿原始模型，但在更动态的场景下可能会吃力。一个关键挑战是，这种更简单的模型应在灵活性与复杂性之间取得恰当平衡，或平衡偏差与准确性之间的平衡。我们提出了一种新的模型无关方法，将状态空间划分为可简化、易于理解的区域。本文利用Voronoi划分寻找线性模型能够实现与原始con-troller相似性能的区域。我们在网格世界环境和经典控制任务中评估了我们的方法。我们观察到，我们提出的局部专门线性模型提炼会产生可解释的策略，并表明该提纯过程与其提炼的黑箱策略相匹配甚至略优。

Finding Kissing Numbers with Game-theoretic Reinforcement Learning

利用博弈论强化学习寻找吻数

Authors: Chengdong Ma, Théo Tao Zhaowei, Pengyu Li, Minghao Liu, Haojun Chen, Zihao Mao, Yuan Cheng, Yuan Qi, Yaodong Yang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.13391
Pdf link: https://arxiv.org/pdf/2511.13391
Abstract Since Isaac Newton first studied the Kissing Number Problem in 1694, determining the maximal number of non-overlapping spheres around a central sphere has remained a fundamental challenge. This problem represents the local analogue of Hilbert's 18th problem on sphere packing, bridging geometry, number theory, and information theory. Although significant progress has been made through lattices and codes, the irregularities of high-dimensional geometry and exponentially growing combinatorial complexity beyond 8 dimensions, which exceeds the complexity of Go game, limit the scalability of existing methods. Here we model this problem as a two-player matrix completion game and train the game-theoretic reinforcement learning system, PackingStar, to efficiently explore high-dimensional spaces. The matrix entries represent pairwise cosines of sphere center vectors; one player fills entries while another corrects suboptimal ones, jointly maximizing the matrix size, corresponding to the kissing number. This cooperative dynamics substantially improves sample quality, making the extremely large spaces tractable. PackingStar reproduces previous configurations and surpasses all human-known records from dimensions 25 to 31, with the configuration in 25 dimensions geometrically corresponding to the Leech lattice and suggesting possible optimality. It achieves the first breakthrough beyond rational structures from 1971 in 13 dimensions and discovers over 6000 new structures in 14 and other dimensions. These results demonstrate AI's power to explore high-dimensional spaces beyond human intuition and open new pathways for the Kissing Number Problem and broader geometry problems.
中文摘要 自从艾萨克·牛顿于1694年首次研究接吻数问题以来，确定中心球面周围不重叠球面的最大数一直是一项根本性的挑战。该问题代表了希尔伯特第18个关于球面堆积、桥接几何、数论和信息论问题的局部对应。尽管通过格和代码取得了显著进展，但高维几何的不规则性和超过8维的指数级增长组合复杂度（超过围棋复杂度）限制了现有方法的可扩展性。在这里，我们将该问题建模为双人矩阵补全博弈，并训练博弈论强化学习系统PackingStar，以高效探索高维空间。矩阵元素表示球心向量的两两余弦;一名玩家填写条目，另一名玩家纠正次优条目，共同最大化矩阵大小，对应吻数。这种协作动力学显著提升了样本质量，使极大的空间变得易于处理。PackingStar 重现了之前的配置，超越了25至31维所有已知的人类记录，其中25维构型几何上对应于Leech晶格，暗示可能的最优。它在13维度中首次突破了1971年的理性结构，并在14维及其他维度中发现了6000多个新结构。这些结果展示了人工智能探索超越人类直觉的高维空间的能力，并为亲吻数问题及更广泛的几何问题开辟新途径。

Contact-Safe Reinforcement Learning with ProMP Reparameterization and Energy Awareness

结合ProMP重参数化和能量感知的接触安全强化学习

Authors: Bingkun Huang, Yuhe Gong, Zewen Yang, Tianyu Ren, Luis Figueredo
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.13459
Pdf link: https://arxiv.org/pdf/2511.13459
Abstract Reinforcement learning (RL) approaches based on Markov Decision Processes (MDPs) are predominantly applied in the robot joint space, often relying on limited task-specific information and partial awareness of the 3D environment. In contrast, episodic RL has demonstrated advantages over traditional MDP-based methods in terms of trajectory consistency, task awareness, and overall performance in complex robotic tasks. Moreover, traditional step-wise and episodic RL methods often neglect the contact-rich information inherent in task-space manipulation, especially considering the contact-safety and robustness. In this work, contact-rich manipulation tasks are tackled using a task-space, energy-safe framework, where reliable and safe task-space trajectories are generated through the combination of Proximal Policy Optimization (PPO) and movement primitives. Furthermore, an energy-aware Cartesian Impedance Controller objective is incorporated within the proposed framework to ensure safe interactions between the robot and the environment. Our experimental results demonstrate that the proposed framework outperforms existing methods in handling tasks on various types of surfaces in 3D environments, achieving high success rates as well as smooth trajectories and energy-safe interactions.
中文摘要 基于马尔可夫决策过程（MDP）的强化学习（RL）方法主要应用于机器人关节空间，通常依赖有限的任务特定信息和对三维环境的部分认知。相比之下，情节式强化学习在轨迹一致性、任务感知以及复杂机器人任务中的整体性能方面，已证明优于传统基于MDP的方法。此外，传统的分阶段和片段式强化学习方法往往忽视任务空间作中固有的接触丰富信息，尤其考虑到接触安全性和鲁棒性。本研究通过任务空间、能量安全的框架来处理接触丰富作任务，通过近端策略优化（PPO）和移动原语的结合生成可靠且安全的任务空间轨迹。此外，在拟建框架中还集成了能量感知的笛卡尔阻抗控制器目标，以确保机器人与环境之间的安全交互。我们的实验结果表明，该框架在处理三维环境中各种表面任务时优于现有方法，实现了高成功率、平滑轨迹和能源安全交互。

Artificial Intelligence-driven Intelligent Wearable Systems: A full-stack Integration from Material Design to Personalized Interaction

人工智能驱动的智能可穿戴系统：从材料设计到个性化交互的全栈整合

Authors: Jingyi Zhao, Daqian Shi, Zhengda Wang, Xiongfeng Tang, Yanguo Qin
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.13565
Pdf link: https://arxiv.org/pdf/2511.13565
Abstract Intelligent wearable systems are at the forefront of precision medicine and play a crucial role in enhancing human-machine interaction. Traditional devices often encounter limitations due to their dependence on empirical material design and basic signal processing techniques. To overcome these issues, we introduce the concept of Human-Symbiotic Health Intelligence (HSHI), which is a framework that integrates multi-modal sensor networks with edge-cloud collaborative computing and a hybrid approach to data and knowledge modeling. HSHI is designed to adapt dynamically to both inter-individual and intra-individual variability, transitioning health management from passive monitoring to an active collaborative evolution. The framework incorporates AI-driven optimization of materials and micro-structures, provides robust interpretation of multi-modal signals, and utilizes a dual mechanism that merges population-level insights with personalized adaptations. Moreover, the integration of closed-loop optimization through reinforcement learning and digital twins facilitates customized interventions and feedback. In general, HSHI represents a significant shift in healthcare, moving towards a model that emphasizes prevention, adaptability, and a harmonious relationship between technology and health management.
中文摘要 智能可穿戴系统处于精准医疗的前沿，在增强人机交互方面发挥着关键作用。传统器件常因依赖经验材料设计和基础信号处理技术而面临限制。为克服这些问题，我们引入了人类-共生健康智能（HSHI）概念，这是一个将多模态传感器网络与边缘云协作计算以及数据与知识建模混合方法相结合的框架。HSHI旨在动态适应个体间和个体内部的变异性，将健康管理从被动监测转变为主动协作演进。该框架结合了由人工智能驱动的材料和微结构优化，提供了多模态信号的稳健解读，并采用了将群体层面洞察与个性化适应相结合的双重机制。此外，通过强化学习和数字孪生整合闭环优化，促进了个性化干预和反馈。总体而言，HSHI代表了医疗保健的重大转变，朝着强调预防、适应性以及技术与健康管理和谐关系的模式迈进。

P1: Mastering Physics Olympiads with Reinforcement Learning

P1：通过强化学习掌握物理奥林匹克竞赛

Authors: Jiacheng Chen, Qianjia Cheng, Fangchen Yu, Haiyuan Wan, Yuchen Zhang, Shenghe Zheng, Junchi Yao, Qingyang Zhang, Haonan He, Yun Luo, Yufeng Zhao, Futing Wang, Li Sheng, Chengxing Xie, Yuxin Zuo, Yizhuo Li, Wenxauan Zeng, Yulun Wu, Rui Huang, Dongzhan Zhou, Kai Chen, Yu Qiao, Lei Bai, Yu Cheng, Ning Ding, Bowen Zhou, Peng Ye, Ganqu Cui
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.13612
Pdf link: https://arxiv.org/pdf/2511.13612
Abstract Recent progress in large language models (LLMs) has moved the frontier from puzzle-solving to science-grade reasoning-the kind needed to tackle problems whose answers must stand against nature, not merely fit a rubric. Physics is the sharpest test of this shift, which binds symbols to reality in a fundamental way, serving as the cornerstone of most modern technologies. In this work, we manage to advance physics research by developing large language models with exceptional physics reasoning capabilities, especially excel at solving Olympiad-level physics problems. We introduce P1, a family of open-source physics reasoning models trained entirely through reinforcement learning (RL). Among them, P1-235B-A22B is the first open-source model with Gold-medal performance at the latest International Physics Olympiad (IPhO 2025), and wins 12 gold medals out of 13 international/regional physics competitions in 2024/2025. P1-30B-A3B also surpasses almost all other open-source models on IPhO 2025, getting a silver medal. Further equipped with an agentic framework PhysicsMinions, P1-235B-A22B+PhysicsMinions achieves overall No.1 on IPhO 2025, and obtains the highest average score over the 13 physics competitions. Besides physics, P1 models also present great performance on other reasoning tasks like math and coding, showing the great generalibility of P1 series.
中文摘要 大型语言模型（LLMs）的最新进展，将前沿从解谜转向科学级推理——这种推理需要解决那些答案必须违背自然、而不仅仅是符合规律的问题。物理学是这一转变最尖锐的考验，它以根本的方式将符号与现实紧密结合，是大多数现代技术的基石。在这项工作中，我们通过开发具有卓越物理推理能力的大型语言模型，特别是在解决奥林匹克级物理问题方面表现出色，推动了物理研究的发展。我们介绍了P1，这是一系列完全通过强化学习（RL）训练的开源物理推理模型。其中，P1-235B-A22B是首个在最新国际物理奥林匹克竞赛（IPhO 2025）中获得金牌表现的开源模型，并在2024/2025年度的13个国际/区域物理竞赛中赢得12枚金牌。P1-30B-A3B在iPhone hO 2025上几乎超过了所有其他开源型号，获得了银牌。借助代理框架PhysicsMinions，P1-235B-A22B+PhysicsMinions在IPhO 2025上获得总排名第一，并在13个物理竞赛中获得最高平均分。除了物理，P1模型在数学和编码等推理任务上也表现出色，显示出P1级数的高度通用性。

Distribution Matching Distillation Meets Reinforcement Learning

分布匹配蒸馏与强化学习结合

Authors: Dengyang Jiang, Dongyang Liu, Zanyi Wang, Qilong Wu, Xin Jin, David Liu, Zhen Li, Mengmeng Wang, Peng Gao, Harry Yang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.13649
Pdf link: https://arxiv.org/pdf/2511.13649
Abstract Distribution Matching Distillation (DMD) distills a pre-trained multi-step diffusion model to a few-step one to improve inference efficiency. However, the performance of the latter is often capped by the former. To circumvent this dilemma, we propose DMDR, a novel framework that combines Reinforcement Learning (RL) techniques into the distillation process. We show that for the RL of the few-step generator, the DMD loss itself is a more effective regularization compared to the traditional ones. In turn, RL can help to guide the mode coverage process in DMD more effectively. These allow us to unlock the capacity of the few-step generator by conducting distillation and RL simultaneously. Meanwhile, we design the dynamic distribution guidance and dynamic renoise sampling training strategies to improve the initial distillation process. The experiments demonstrate that DMDR can achieve leading visual quality, prompt coherence among few-step methods, and even exhibit performance that exceeds the multi-step teacher.
中文摘要 分布匹配蒸馏（DMD）将预训练的多步扩散模型提炼为几步扩散模型，以提高推断效率。然而，后者的表现往往被前者所限制。为解决这一困境，我们提出了DMDR这一新框架，将强化学习（RL）技术结合进蒸馏过程。我们证明，对于少步发生器的强化，DMD损耗本身比传统正则化更有效。反过来，强化学习可以帮助更有效地指导DMD中的模式覆盖过程。这些技术使我们能够通过同时进行蒸馏和强化学习，释放少步发生器的容量。同时，我们设计动态分配引导和动态重噪采样训练策略，以改进初始蒸馏过程。实验表明，DMDR能够实现领先的视觉质量，促使少数步骤方法之间的一致性，甚至表现出超越多步骤教学的表现。

Keyword: diffusion policy

MATT-Diff: Multimodal Active Target Tracking by Diffusion Policy

MATT-Diff：通过扩散策略实现多模主动目标跟踪

Authors: Saida Liu, Nikolay Atanasov, Shumon Koga
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.11931
Pdf link: https://arxiv.org/pdf/2511.11931
Abstract This paper proposes MATT-Diff: Multi-Modal Active Target Tracking by Diffusion Policy, a control policy that captures multiple behavioral modes - exploration, dedicated tracking, and target reacquisition - for active multi-target tracking. The policy enables agent control without prior knowledge of target numbers, states, or dynamics. Effective target tracking demands balancing exploration for undetected or lost targets with following the motion of detected but uncertain ones. We generate a demonstration dataset from three expert planners including frontier-based exploration, an uncertainty-based hybrid planner switching between frontier-based exploration and RRT* tracking based on target uncertainty, and a time-based hybrid planner switching between exploration and tracking based on target detection time. We design a control policy utilizing a vision transformer for egocentric map tokenization and an attention mechanism to integrate variable target estimates represented by Gaussian densities. Trained as a diffusion model, the policy learns to generate multi-modal action sequences through a denoising process. Evaluations demonstrate MATT-Diff's superior tracking performance against expert and behavior cloning baselines across multiple target motions, empirically validating its advantages in target tracking.
中文摘要 本文提出了MATT-Diff：通过扩散策略实现多模态主动目标跟踪，这是一种控制策略，涵盖多种行为模式——探索、专用跟踪和目标再获取——用于主动多目标跟踪。该策略使代理无需事先了解目标号码、状态或动态即可控制。有效的目标跟踪要求在探索未被发现或迷失目标与跟踪已发现但不确定目标的运动之间取得平衡。我们从三位专家规划师生成了一个演示数据集，包括基于前沿探索的前沿探索、基于不确定性的混合规划师（基于目标不确定性在前沿探索与RRT*跟踪之间切换）以及基于时间的混合规划师，基于目标检测时间在探索与跟踪之间切换。我们设计了一套控制策略，利用视觉变换器进行自我中心的映射标记化，并采用注意力机制整合由高斯密度表示的可变目标估计。该策略作为扩散模型训练，通过去噪过程学习生成多模态动作序列。评估显示，MATT-Diff在多目标运动中，在专家克隆和行为克隆基线中表现出优异的跟踪性能，实证验证了其在目标跟踪方面的优势。

Decoupled Action Head: Confining Task Knowledge to Conditioning Layers

解耦行动头：将任务知识限制在条件层

Authors: Jian Zhou, Sihao Lin, Shuai Fu, Qi WU
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.12101
Pdf link: https://arxiv.org/pdf/2511.12101
Abstract Behavior Cloning (BC) is a data-driven supervised learning approach that has gained increasing attention with the success of scaling laws in language and vision domains. Among its implementations in robotic manipulation, Diffusion Policy (DP), with its two variants DP-CNN (DP-C) and DP-Transformer (DP-T), is one of the most effective and widely adopted models, demonstrating the advantages of predicting continuous action sequences. However, both DP and other BC methods remain constrained by the scarcity of paired training data, and the internal mechanisms underlying DP's effectiveness remain insufficiently understood, leading to limited generalization and a lack of principled design in model development. In this work, we propose a decoupled training recipe that leverages nearly cost-free kinematics-generated trajectories as observation-free data to pretrain a general action head (action generator). The pretrained action head is then frozen and adapted to novel tasks through feature modulation. Our experiments demonstrate the feasibility of this approach in both in-distribution and out-of-distribution scenarios. As an additional benefit, decoupling improves training efficiency; for instance, DP-C achieves up to a 41% speedup. Furthermore, the confinement of task-specific knowledge to the conditioning components under decoupling, combined with the near-identical performance of DP-C in both normal and decoupled training, indicates that the action generation backbone plays a limited role in robotic manipulation. Motivated by this observation, we introduce DP-MLP, which replaces the 244M-parameter U-Net backbone of DP-C with only 4M parameters of simple MLP blocks, achieving a 83.9% faster training speed under normal training and 89.1% under decoupling.
中文摘要 行为克隆（BC）是一种数据驱动的监督学习方法，随着语言和视觉领域尺度定律的成功，这一方法日益受到关注。在其机器人作中的实现中，扩散策略（DP）及其两种变体DP-CNN（DP-C）和DP-Transformer（DP-T）是最有效且被广泛采用的模型之一，展示了连续动作序列预测的优势。然而，DP和其他BC方法仍受限于配对训练数据的稀缺，DP有效性的内部机制理解不足，导致模型开发中泛化有限且缺乏原则性设计。在本研究中，我们提出了一种解耦训练方案，利用几乎无成本的运动学生成轨迹作为无观测数据，预训练一个通用动作头（动作生成器）。预训练的动作头随后被冻结，并通过特征调制适应新颖任务。我们的实验证明了这种方法在分配内外的情境下都是可行的。此外，解耦还能提高训练效率;例如，DP-C可实现最高41%的加速。此外，任务特定知识在解耦条件下限制在条件组件，加上DP-C在正常和解耦训练中表现几乎相同，表明动作生成骨干在机器人作中作用有限。基于这一观察，我们引入了DP-MLP，它用仅4M参数的简单MLP块替换了DP-C中244M参数的U-Net骨干，在正常训练下训练速度提升了83.9%，在解耦条件下实现了89.1%的提升。

DiffFP: Learning Behaviors from Scratch via Diffusion-based Fictitious Play

DiffFP：通过扩散的虚构游戏从零开始学习行为

Authors: Akash Karthikeyan, Yash Vardhan Pant
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2511.13186
Pdf link: https://arxiv.org/pdf/2511.13186
Abstract Self-play reinforcement learning has demonstrated significant success in learning complex strategic and interactive behaviors in competitive multi-agent games. However, achieving such behaviors in continuous decision spaces remains challenging. Ensuring adaptability and generalization in self-play settings is critical for achieving competitive performance in dynamic multi-agent environments. These challenges often cause methods to converge slowly or fail to converge at all to a Nash equilibrium, making agents vulnerable to strategic exploitation by unseen opponents. To address these challenges, we propose DiffFP, a fictitious play (FP) framework that estimates the best response to unseen opponents while learning a robust and multimodal behavioral policy. Specifically, we approximate the best response using a diffusion policy that leverages generative modeling to learn adaptive and diverse strategies. Through empirical evaluation, we demonstrate that the proposed FP framework converges towards $\epsilon$-Nash equilibria in continuous- space zero-sum games. We validate our method on complex multi-agent environments, including racing and multi-particle zero-sum games. Simulation results show that the learned policies are robust against diverse opponents and outperform baseline reinforcement learning policies. Our approach achieves up to 3$\times$ faster convergence and 30$\times$ higher success rates on average against RL-based baselines, demonstrating its robustness to opponent strategies and stability across training iterations
中文摘要 自玩强化学习在多智能体竞技游戏中学习复杂战略和交互行为方面取得了显著成功。然而，在连续决策空间中实现此类行为仍然具有挑战性。确保自玩环境中的适应性和泛化性对于在动态多代理环境中实现竞争性能至关重要。这些挑战常常导致方法收敛缓慢，甚至根本无法收敛到纳什均衡，使智能体容易被看不见的对手战略性利用。为应对这些挑战，我们提出了DiffFP，这是一种虚构玩法（FP）框架，在学习稳健且多模态的行为策略的同时，估算对看不见对手的最佳反应。具体来说，我们利用利用生成建模的扩散策略，学习适应性和多样化的策略，近似最佳响应。通过实证评估，我们证明所提出的FP框架在连续空间零和博弈中趋向$\epsilon$-Nash均衡。我们在复杂的多智能体环境中验证了我们的方法，包括赛车和多粒子零和博弈。模拟结果表明，所学策略对多样对手表现出鲁棒性，且优于基线强化学习策略。我们的方法在基于强化学习的基线下，平均收敛速度高达3$/倍/倍/倍，展示了其对对手策略的稳健性和训练迭代间的稳定性

Keyword: reinforcement learning

Mind Your Entropy: From Maximum Entropy to Trajectory Entropy-Constrained RL

注意熵：从最大熵到轨迹熵约束强化学习

Machine learning-based cloud resource allocation algorithms: a comprehensive comparative review

基于机器学习的云资源分配算法：全面比较综述

Clustering-Based Weight Orthogonalization for Stabilizing Deep Reinforcement Learning

基于聚类的权重正交化用于深度强化学习的稳定

Environment-Aware Transfer Reinforcement Learning for Sustainable Beam Selection

环境感知转移强化学习以实现可持续光束选择

Convergence of Multiagent Learning Systems for Traffic control

多智能体学习系统在交通控制中的融合

OSGym: Super-Scalable Distributed Data Engine for Generalizable Computer Agents

OSGym：面向通用计算机代理的超可扩展分布式数据引擎

Enhancing Reinforcement Learning in 3D Environments through Semantic Segmentation: A Case Study in ViZDoom

通过语义分割增强三维环境中的强化学习：ViZDoom 案例研究

How Machine Learning-Data Driven Replication Strategies Enhance Fault Tolerance in Large-Scale Distributed Systems

机器学习数据驱动复制策略如何增强大规模分布式系统的容错能力

Learning to Refine: An Agentic RL Approach for Iterative SPARQL Query Construction

学习精炼：一种代理式强化学习方法用于迭代 SPARQL 查询构造

Image-POSER: Reflective RL for Multi-Expert Image Generation and Editing

图像模拟器：面向多专家图像生成与编辑的反射强化学习

MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

MiroThinker：通过模型、上下文和交互式扩展推动开源研究代理的性能边界

Conformal Constrained Policy Optimization for Cost-Effective LLM Agents

为成本效益高的LLM代理提供共形约束策略优化

Better LLM Reasoning via Dual-Play

通过双人游戏更好地进行大型语言模型推理

Context-Emotion Aware Therapeutic Dialogue Generation: A Multi-component Reinforcement Learning Approach to Language Models for Mental Health Support

情境感知治疗对话生成：一种多元强化学习方法，用于心理健康支持的语言模型

VULPO: Context-Aware Vulnerability Detection via On-Policy LLM Optimization

VULPO：通过策略内大型语言模型优化实现上下文感知漏洞检测

Quantile Q-Learning: Revisiting Offline Extreme Q-Learning with Quantile Regression

分位Q学习：利用分位数回归重新审视离线极限Q学习

Goal-Oriented Multi-Agent Reinforcement Learning for Decentralized Agent Teams

去中心化智能体团队的目标导向多智能体强化学习

Look As You Think: Unifying Reasoning and Visual Evidence Attribution for Verifiable Document RAG via Reinforcement Learning

《如你所思：通过强化学习统一推理与视觉证据归因以实现可验证文档RAG》

EARL: Entropy-Aware RL Alignment of LLMs for Reliable RTL Code Generation

EARL：对熵感知强化学习的LLM对齐，实现可靠的RTL代码生成

Intelligent Collaborative Optimization for Rubber Tyre Film Production Based on Multi-path Differentiated Clipping Proximal Policy Optimization

基于多径差分裁剪的橡胶轮胎薄膜生产智能协作优化 近端策略优化

Treatment Stitching with Schrödinger Bridge for Enhancing Offline Reinforcement Learning in Adaptive Treatment Strategies

采用薛定谔桥进行治疗缝合，以增强自适应治疗策略中的离线强化学习

HCPO: Hierarchical Conductor-Based Policy Optimization in Multi-Agent Reinforcement Learning

HCPO：多智能体强化学习中的基于导体的层级策略优化

AI-Salesman: Towards Reliable Large Language Model Driven Telemarketing

人工智能销售员：迈向可靠的大型语言模型驱动电话营销

CriticSearch: Fine-Grained Credit Assignment for Search Agents via a Retrospective Critic

CriticSearch：通过回顾性批评人为搜索代理人提供细致的署名分配

SocialNav-Map: Dynamic Mapping with Human Trajectory Prediction for Zero-Shot Social Navigation

SocialNav-Map：零射点社会导航的动态地图与人类轨迹预测

Reward and Guidance through Rubrics: Promoting Exploration to Improve Multi-Domain Reasoning

通过评分标准奖励与指导：促进探索以提升多领域推理能力

Dynamic Reward Scaling for Multivariate Time Series Anomaly Detection: A VAE-Enhanced Reinforcement Learning Approach

多变量时间序列异常检测的动态奖励尺度：VAE增强强化学习方法

Constructing and Interpreting Digital Twin Representations for Visual Reasoning via Reinforcement Learning

通过强化学习构建和解释数字孪生表示以实现视觉推理

Learning Adaptive Neural Teleoperation for Humanoid Robots: From Inverse Kinematics to End-to-End Control

学习人形机器人自适应神经远程作：从逆向运动学到端到端控制

Integrating Neural Differential Forecasting with Safe Reinforcement Learning for Blood Glucose Regulation

将神经差异预测与安全强化学习相结合以实现血糖调节

Tailored Primitive Initialization is the Secret Key to Reinforcement Learning

定制化的原始初始化是强化学习的秘密钥匙

ClutterNav: Gradient-Guided Search for Efficient 3D Clutter Removal with Learned Costmaps

ClutterNav：利用学习成本图实现高效3D杂乱去除的梯度引导搜索

Designed to Spread: Generative Approaches to Enhance Information Diffusion

旨在传播：生成式方法促进信息传播

TAdaRAG: Task Adaptive Retrieval-Augmented Generation via On-the-Fly Knowledge Graph Construction

TAdaRAG：通过动态知识图谱构建实现任务自适应检索增强生成

ReaSon: Reinforced Causal Search with Information Bottleneck for Video Understanding

ReaSon：带有信息瓶颈的强化因果搜索以促进视频理解

Mitigating Length Bias in RLHF through a Causal Lens

通过因果视角缓解RLHF中的长度偏置

NFQ2.0: The CartPole Benchmark Revisited

NFQ2.0：CartPole 基准测试再访

Task-Aware Morphology Optimization of Planar Manipulators via Reinforcement Learning

通过强化学习实现平面作器的任务感知形态优化

Beyond Fixed Tasks: Unsupervised Environment Design for Task-Level Pairs

超越固定任务：任务级对的无监督环境设计

Prompt-Driven Domain Adaptation for End-to-End Autonomous Driving via In-Context RL

基于多径差分裁剪的橡胶轮胎薄膜生产智能协作优化近端策略优化