Arxiv Papers of Today

生成时间: 2025-11-04 16:33:19 (UTC+8); Arxiv 发布时间: 2025-11-04 20:00 EST (2025-11-05 09:00 UTC+8)

今天共有 68 篇相关文章

Keyword: reinforcement learning

On the Fundamental Limitations of Decentralized Learnable Reward Shaping in Cooperative Multi-Agent Reinforcement Learning

研究合作多智能体强化学习中去中心化可学习奖励塑造的根本局限性

Authors: Aditya Akella
Subjects: Subjects: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.00034
Pdf link: https://arxiv.org/pdf/2511.00034
Abstract Recent advances in learnable reward shaping have shown promise in single-agent reinforcement learning by automatically discovering effective feedback signals. However, the effectiveness of decentralized learnable reward shaping in cooperative multi-agent settings remains poorly understood. We propose DMARL-RSA, a fully decentralized system where each agent learns individual reward shaping, and evaluate it on cooperative navigation tasks in the simple_spread_v3 environment. Despite sophisticated reward learning, DMARL-RSA achieves only -24.20 +/- 0.09 average reward, compared to MAPPO with centralized training at 1.92 +/- 0.87--a 26.12-point gap. DMARL-RSA performs similarly to simple independent learning (IPPO: -23.19 +/- 0.96), indicating that advanced reward shaping cannot overcome fundamental decentralized coordination limitations. Interestingly, decentralized methods achieve higher landmark coverage (0.888 +/- 0.029 for DMARL-RSA, 0.960 +/- 0.045 for IPPO out of 3 total) but worse overall performance than centralized MAPPO (0.273 +/- 0.008 landmark coverage)--revealing a coordination paradox between local optimization and global performance. Analysis identifies three critical barriers: (1) non-stationarity from concurrent policy updates, (2) exponential credit assignment complexity, and (3) misalignment between individual reward optimization and global objectives. These results establish empirical limits for decentralized reward learning and underscore the necessity of centralized coordination for effective multi-agent cooperation.
中文摘要 可学习奖励塑造的最新进展通过自动发现有效的反馈信号，在单智能体强化学习中显示出前景。然而，在合作多智能体环境中分散的可学习奖励塑造的有效性仍然知之甚少。我们提出了 DMARL-RSA，这是一个完全去中心化的系统，每个代理学习单独的奖励塑造，并在simple_spread_v3环境中的协作导航任务中对其进行评估。尽管奖励学习很复杂，但 DMARL-RSA 仅实现了 -24.20 +/- 0.09 的平均奖励，而集中训练的 MAPPO 为 1.92 +/- 0.87-，差距为 26.12 分。DMARL-RSA 的表现与简单的独立学习相似（IPPO：-23.19 +/- 0.96），表明高级奖励塑造无法克服基本的去中心化协调限制。有趣的是，去中心化方法实现了更高的里程碑覆盖率（DMARL-RSA 为 0.888 +/- 0.029，IPPO 为 0.960 +/- 0.045，共 3 个），但总体性能比集中式 MAPPO（0.273 +/- 0.008 个里程碑覆盖率）更差——揭示了局部优化和全局性能之间的协调悖论。分析确定了三个关键障碍：（1）并发政策更新的非平稳性，（2）指数级的信用分配复杂性，以及（3）个人奖励优化与全球目标之间的不一致。这些结果为去中心化奖励学习建立了经验限制，并强调了集中协调对于有效多智能体合作的必要性。

Graph-Attentive MAPPO for Dynamic Retail Pricing

用于动态零售定价的图形专心 MAPPO

Authors: Krishna Kumar Neelakanta Pillai Santha Kumari Amma
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.00039
Pdf link: https://arxiv.org/pdf/2511.00039
Abstract Dynamic pricing in retail requires policies that adapt to shifting demand while coordinating decisions across related products. We present a systematic empirical study of multi-agent reinforcement learning for retail price optimization, comparing a strong MAPPO baseline with a graph-attention-augmented variant (MAPPO+GAT) that leverages learned interactions among products. Using a simulated pricing environment derived from real transaction data, we evaluate profit, stability across random seeds, fairness across products, and training efficiency under a standardized evaluation protocol. The results indicate that MAPPO provides a robust and reproducible foundation for portfolio-level price control, and that MAPPO+GAT further enhances performance by sharing information over the product graph without inducing excessive price volatility. These results indicate that graph-integrated MARL provides a more scalable and stable solution than independent learners for dynamic retail pricing, offering practical advantages in multi-product decision-making.
中文摘要 零售业的动态定价需要政策能够适应不断变化的需求，同时协调相关产品的决策。我们提出了一项用于零售价格优化的多智能体强化学习的系统实证研究，将强大的 MAPPO 基线与利用产品之间学习到的交互的图注意力增强变体（MAPPO+GAT）进行了比较。使用从真实交易数据中得出的模拟定价环境，我们在标准化的评估协议下评估利润、随机种子之间的稳定性、产品的公平性和训练效率。结果表明，MAPPO为投资组合层面的价格控制提供了稳健且可重复的基础，并且MAPPO+GAT通过在产品图上共享信息而不会引起过度的价格波动，进一步提高了性能。这些结果表明，图集成MARL为动态零售定价提供了比独立学习器更具可扩展性和稳定性的解决方案，在多产品决策中具有实际优势。

SpatialTraceGen: High-Fidelity Traces for Efficient VLM Spatial Reasoning Distillation

SpatialTraceGen：用于高效 VLM 空间推理蒸馏的高保真迹线

Authors: Gio Huh, Dhruv Sheth, Rayhan Zirvi, Frank Xiao
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.00054
Pdf link: https://arxiv.org/pdf/2511.00054
Abstract While Vision-Language Models (VLMs) excel in many areas, they struggle with complex spatial reasoning, which requires problem decomposition and strategic tool use. Fine-tuning smaller, more deployable models offers an efficient path to strong performance, but this is hampered by a major bottleneck: the absence of high-quality, step-by-step reasoning data. To address this data-efficiency gap, we introduce SpatialTraceGen, a framework to distill the reasoning processes of a large teacher model into a high-quality dataset of multi-hop, multi-tool reasoning traces. A key innovation is our automated Verifier, which scalably ensures the fidelity of each reasoning step, providing a cost-effective alternative to manual human annotation. On the CLEVR-Humans benchmark, this verifier-guided process improves the average quality score of traces by 17\% while reducing quality variance by over 40\%. SpatialTraceGen delivers a dataset of expert traces, providing the structured, step-by-step examples of tool use necessary for effective fine-tuning and sample-efficient offline reinforcement learning.
中文摘要 虽然视觉语言模型（VLM）在许多领域表现出色，但它们在复杂的空间推理方面遇到了困难，这需要问题分解和战略工具的使用。微调更小、更可部署的模型提供了一条获得强大性能的有效途径，但这受到一个主要瓶颈的阻碍：缺乏高质量的分步推理数据。为了解决这种数据效率差距，我们引入了 SpatialTraceGen，这是一个框架，用于将大型教师模型的推理过程提炼成多跳、多工具推理轨迹的高质量数据集。一项关键创新是我们的自动验证器，它可以可扩展地确保每个推理步骤的保真度，为手动人工注释提供了一种经济高效的替代方案。在 CLEVR-Humans 基准测试中，这种验证者引导的过程将跟踪的平均质量得分提高了 17\%，同时将质量方差降低了 40% 以上。SpatialTraceGen 提供专家跟踪数据集，提供有效微调和样本高效离线强化学习所需的工具使用结构化分步示例。

World Simulation with Video Foundation Models for Physical AI

使用物理 AI 的视频基础模型进行世界模拟

Authors: NVIDIA: Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, Prithvijit Chattopadhyay, Mike Chen, Yongxin Chen, Yu Chen, Shuai Cheng, Yin Cui, Jenna Diamond, Yifan Ding, Jiaojiao Fan, Linxi Fan, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Ruiyuan Gao, Yunhao Ge, Jinwei Gu, Aryaman Gupta, Siddharth Gururani, Imad El Hanafi, Ali Hassani, Zekun Hao, Jacob Huffman, Joel Jang, Pooya Jannaty, Jan Kautz, Grace Lam, Xuan Li, Zhaoshuo Li, Maosheng Liao, Chen-Hsuan Lin, Tsung-Yi Lin, Yen-Chen Lin, Huan Ling, Ming-Yu Liu, Xian Liu, Yifan Lu, Alice Luo, Qianli Ma, Hanzi Mao, Kaichun Mo, Seungjun Nah, Yashraj Narang, Abhijeet Panaskar, Lindsey Pavao, Trung Pham, Morteza Ramezanali, Fitsum Reda, Scott Reed, Xuanchi Ren, Haonan Shao, Yue Shen, Stella Shi, Shuran Song, Bartosz Stefaniak, Shangkun Sun, Shitao Tang, Sameena Tasmeen, Lyne Tchapmi, Wei-Cheng Tseng, Jibin Varghese, Andrew Z. Wang, Hao Wang, Haoxiang Wang, Heng Wang, Ting-Chun Wang, Fangyin Wei, Jiashu Xu, Dinghao Yang, Xiaodong Yang, Haotian Ye, Seonghyeon Ye, Xiaohui Zeng, Jing Zhang, Qinsheng Zhang, Kaiwen Zheng, Andrew Zhu, Yuke Zhu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.00062
Pdf link: https://arxiv.org/pdf/2511.00062
Abstract We introduce [Cosmos-Predict2.5], the latest generation of the Cosmos World Foundation Models for Physical AI. Built on a flow-based architecture, [Cosmos-Predict2.5] unifies Text2World, Image2World, and Video2World generation in a single model and leverages [Cosmos-Reason1], a Physical AI vision-language model, to provide richer text grounding and finer control of world simulation. Trained on 200M curated video clips and refined with reinforcement learning-based post-training, [Cosmos-Predict2.5] achieves substantial improvements over [Cosmos-Predict1] in video quality and instruction alignment, with models released at 2B and 14B scales. These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems. We further extend the family with [Cosmos-Transfer2.5], a control-net style framework for Sim2Real and Real2Real world translation. Despite being 3.5$\times$ smaller than [Cosmos-Transfer1], it delivers higher fidelity and robust long-horizon video generation. Together, these advances establish [Cosmos-Predict2.5] and [Cosmos-Transfer2.5] as versatile tools for scaling embodied intelligence. To accelerate research and deployment in Physical AI, we release source code, pretrained checkpoints, and curated benchmarks under the NVIDIA Open Model License at this https URL and this https URL. We hope these open resources lower the barrier to adoption and foster innovation in building the next generation of embodied intelligence.
中文摘要 我们介绍了 [Cosmos-Predict2.5]，这是最新一代的 Cosmos World 物理 AI 基础模型。[Cosmos-Predict2.5] 基于基于流程的架构构建，将 Text2World、Image2World 和 Video2World 生成统一在一个模型中，并利用物理 AI 视觉语言模型 [Cosmos-Reason1] 提供更丰富的文本基础和更精细的世界模拟控制。[Cosmos-Predict2.5] 在 200M 精选视频剪辑上进行训练，并通过基于强化学习的后训练进行改进，在视频质量和指令对齐方面比 [Cosmos-Predict1] 取得了显着改进，模型以 2B 和 14B 规模发布。这些功能为机器人和自主系统提供了更可靠的合成数据生成、策略评估和闭环仿真。我们通过 [Cosmos-Transfer2.5] 进一步扩展了该系列，这是一个用于 Sim2Real 和 Real2Real 世界转换的控制网样式框架。尽管比 [Cosmos-Transfer3.5] 小 1 美元\倍，但它提供了更高的保真度和强大的长视野视频生成。这些进步共同确立了 [Cosmos-Predict2.5] 和 [Cosmos-Transfer2.5] 作为扩展具身智能的多功能工具。为了加速物理 AI 的研究和部署，我们在此 https URL 和此 https URL 下根据 NVIDIA 开放模型许可证发布源代码、预训练检查点和策划的基准测试。我们希望这些开放资源能够降低采用的门槛，并促进构建下一代具身智能的创新。

Token-Regulated Group Relative Policy Optimization for Stable Reinforcement Learning in Large Language Models

大型语言模型中稳定强化学习的代币调节组相对策略优化

Authors: Tue Le, Nghi D.Q.Bui, Linh Ngo Van, Trung Le
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.00066
Pdf link: https://arxiv.org/pdf/2511.00066
Abstract Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful approach for strengthening the reasoning capabilities of large language models (LLMs). Among existing algorithms, Group Relative Policy Optimization (GRPO) has demonstrated strong performance, yet it suffers from a critical issue: low-probability tokens disproportionately dominate gradient updates due to their inherently large gradient magnitudes. This imbalance leads to unstable training and suppresses the contribution of high-probability tokens that are more reliable for learning. In this work, we introduce Token-Regulated Group Relative Policy Optimization (TR-GRPO), a simple yet effective extension of GRPO that assigns token-level weights positively correlated with the model's predicted probability. By downweighting low-probability tokens and emphasizing high-probability ones, TR-GRPO mitigates gradient over-amplification while preserving informative learning signals. Extensive experiments demonstrate that TR-GRPO consistently outperforms GRPO across RLVR tasks, including logic, math, and agentic reasoning, highlighting the importance of regulating token contributions during RL training and establishing TR-GRPO as a robust framework for enhancing LLM reasoning.
中文摘要 具有可验证奖励的强化学习（RLVR）已成为增强大型语言模型（LLM）推理能力的强大方法。在现有算法中，群体相对策略优化（GRPO）表现出了强大的性能，但它存在一个关键问题：低概率标记由于其固有的大梯度幅度而不成比例地主导梯度更新。这种不平衡导致训练不稳定，并抑制了学习更可靠的高概率代币的贡献。在这项工作中，我们引入了代币监管组相对策略优化（TR-GRPO），这是GRPO的一种简单而有效的扩展，它分配了与模型预测概率正相关的代币级权重。通过降低低概率标记并强调高概率标记，TR-GRPO 减轻了梯度过度放大，同时保留了信息丰富的学习信号。广泛的实验表明，TR-GRPO 在 RLVR 任务（包括逻辑、数学和代理推理）中始终优于 GRPO，这凸显了在 RL 训练期间调节代币贡献的重要性，并将 TR-GRPO 建立为增强 LLM 推理的强大框架。

Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

Alpamayo-R1：长尾通用自动驾驶的桥接推理和动作预测

Authors: NVIDIA: Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Tong Che, Ke Chen, Yuxiao Chen, Jenna Diamond, Yifan Ding, Wenhao Ding, Liang Feng, Greg Heinrich, Jack Huang, Peter Karkus, Boyi Li, Pinyi Li, Tsung-Yi Lin, Dongran Liu, Ming-Yu Liu, Langechuan Liu, Zhijian Liu, Jason Lu, Yunxiang Mao, Pavlo Molchanov, Lindsey Pavao, Zhenghao Peng, Mike Ranzinger, Ed Schmerling, Shida Shen, Yunfei Shi, Sarah Tariq, Ran Tian, Tilman Wekel, Xinshuo Weng, Tianjun Xiao, Eric Yang, Xiaodong Yang, Yurong You, Xiaohui Zeng, Wenyuan Zhang, Boris Ivanovic, Marco Pavone
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.00088
Pdf link: https://arxiv.org/pdf/2511.00088
Abstract End-to-end architectures trained via imitation learning have advanced autonomous driving by scaling model size and data, yet performance remains brittle in safety-critical long-tail scenarios where supervision is sparse and causal understanding is limited. To address this, we introduce Alpamayo-R1 (AR1), a vision-language-action model (VLA) that integrates Chain of Causation reasoning with trajectory planning to enhance decision-making in complex driving scenarios. Our approach features three key innovations: (1) the Chain of Causation (CoC) dataset, built through a hybrid auto-labeling and human-in-the-loop pipeline producing decision-grounded, causally linked reasoning traces aligned with driving behaviors; (2) a modular VLA architecture combining Cosmos-Reason, a Vision-Language Model pre-trained for Physical AI applications, with a diffusion-based trajectory decoder that generates dynamically feasible plans in real time; (3) a multi-stage training strategy using supervised fine-tuning to elicit reasoning and reinforcement learning (RL) to optimize reasoning quality via large reasoning model feedback and enforce reasoning-action consistency. Evaluation shows AR1 achieves up to a 12% improvement in planning accuracy on challenging cases compared to a trajectory-only baseline, with a 35% reduction in off-road rate and 25% reduction in close encounter rate in closed-loop simulation. RL post-training improves reasoning quality by 45% as measured by a large reasoning model critic and reasoning-action consistency by 37%. Model scaling from 0.5B to 7B parameters shows consistent improvements. On-vehicle road tests confirm real-time performance (99 ms latency) and successful urban deployment. By bridging interpretable reasoning with precise control, AR1 demonstrates a practical path towards Level 4 autonomous driving. We plan to release AR1 models and a subset of the CoC in a future update.
中文摘要 通过模仿学习训练的端到端架构通过扩展模型大小和数据来推进自动驾驶，但在监督稀疏且因果理解有限的安全关键型长尾场景中，性能仍然脆弱。为了解决这个问题，我们引入了 Alpamayo-R1 （AR1），这是一种视觉-语言-动作模型（VLA），它将因果链推理与轨迹规划相结合，以增强复杂驾驶场景下的决策能力。我们的方法具有三项关键创新：（1）因果链（CoC）数据集，通过混合自动标记和人机交互管道构建，生成与驾驶行为一致的基于决策的、因果相关的推理轨迹;（2）模块化VLA架构，将Cosmos-Reason（一种为物理AI应用预训练的视觉语言模型）与基于扩散的轨迹解码器相结合，可实时生成动态可行的计划;（3）一种多阶段训练策略，使用监督微调引出推理和强化学习（RL），通过大推理模型反馈优化推理质量，并强制推理-动作一致性。评估表明，与仅使用轨迹基线相比，AR1 在具有挑战性的案例中的规划准确性提高了 12%，在闭环模拟中，越野率降低了 35%，近距离遭遇率降低了 25%。RL 后训练将推理质量提高了 45%，通过大型推理模型批评者测量，推理动作一致性提高了 37%。模型从 0.5B 参数扩展到 7B 参数显示出一致的改进。车载道路测试证实了实时性能（99 毫秒延迟）和成功的城市部署。通过将可解释推理与精确控制联系起来，AR1 展示了通往 4 级自动驾驶的实用路径。我们计划在未来的更新中发布 AR1 模型和 CoC 的子集。

Self-Improving Vision-Language-Action Models with Data Generation via Residual RL

通过残差RL生成数据的自我改进视觉-语言-动作模型

Authors: Wenli Xiao, Haotian Lin, Andy Peng, Haoru Xue, Tairan He, Yuqi Xie, Fengyuan Hu, Jimmy Wu, Zhengyi Luo, Linxi "Jim" Fan, Guanya Shi, Yuke Zhu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.00091
Pdf link: https://arxiv.org/pdf/2511.00091
Abstract Supervised fine-tuning (SFT) has become the de facto post-training strategy for large vision-language-action (VLA) models, but its reliance on costly human demonstrations limits scalability and generalization. We propose Probe, Learn, Distill (PLD), a three-stage plug-and-play framework that improves VLAs through residual reinforcement learning (RL) and distribution-aware data collection. In Stage 1, we train lightweight residual actors to probe failure regions of the VLA generalist. In Stage 2, we use a hybrid rollout scheme that aligns collected trajectories with the generalist's deployment distribution while capturing recovery behaviors. In Stage 3, we distill the curated trajectories back into the generalist with standard SFT. PLD achieves near-saturated 99% task success on LIBERO, over 50% gains in SimplerEnv, and 100% success on real-world Franka and YAM arm manipulation tasks. Ablations show that residual probing and distribution-aware replay are key to collecting deployment-aligned data that improves both seen and unseen tasks, offering a scalable path toward self-improving VLA models.
中文摘要 监督微调（SFT）已成为大型视觉-语言-动作（VLA）模型事实上的训练后策略，但它对昂贵的人工演示的依赖限制了可扩展性和泛化性。我们提出了探测、学习、蒸馏（PLD），这是一个三阶段即插即用框架，通过残差强化学习（RL）和分布感知数据收集来改进VLA。在第 1 阶段，我们训练轻量级残差参与者来探测 VLA 通才的故障区域。在第 2 阶段，我们使用混合推出方案，将收集的轨迹与通才的部署分布保持一致，同时捕获恢复行为。在第 3 阶段，我们将策划的轨迹提炼回标准 SFT 的通才。PLD 在 LIBERO 上实现了接近饱和的 99% 任务成功率，在 SimplerEnv 中实现了超过 50% 的收益，在现实世界的 Franka 和 YAM 手臂作任务上取得了 100% 的成功率。消融表明，残差探测和分布感知回放是收集与部署一致的数据的关键，这些数据可以改进可见和未看到的任务，为自我改进的 VLA 模型提供可扩展的路径。

Real-DRL: Teach and Learn in Reality

Real-DRL：在现实中教与学

Authors: Yanbing Mao, Yihao Cai, Lui Sha
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.00112
Pdf link: https://arxiv.org/pdf/2511.00112
Abstract This paper introduces the Real-DRL framework for safety-critical autonomous systems, enabling runtime learning of a deep reinforcement learning (DRL) agent to develop safe and high-performance action policies in real plants (i.e., real physical systems to be controlled), while prioritizing safety! The Real-DRL consists of three interactive components: a DRL-Student, a PHY-Teacher, and a Trigger. The DRL-Student is a DRL agent that innovates in the dual self-learning and teaching-to-learn paradigm and the real-time safety-informed batch sampling. On the other hand, PHY-Teacher is a physics-model-based design of action policies that focuses solely on safety-critical functions. PHY-Teacher is novel in its real-time patch for two key missions: i) fostering the teaching-to-learn paradigm for DRL-Student and ii) backing up the safety of real plants. The Trigger manages the interaction between the DRL-Student and the PHY-Teacher. Powered by the three interactive components, the Real-DRL can effectively address safety challenges that arise from the unknown unknowns and the Sim2Real gap. Additionally, Real-DRL notably features i) assured safety, ii) automatic hierarchy learning (i.e., safety-first learning and then high-performance learning), and iii) safety-informed batch sampling to address the learning experience imbalance caused by corner cases. Experiments with a real quadruped robot, a quadruped robot in NVIDIA Isaac Gym, and a cart-pole system, along with comparisons and ablation studies, demonstrate the Real-DRL's effectiveness and unique features.
中文摘要 本文介绍了安全关键型自主系统的Real-DRL框架，使深度强化学习（DRL）代理的运行时学习成为可能，以在真实工厂（即要控制的真实物理系统）中制定安全和高性能的行动策略，同时优先考虑安全！Real-DRL 由三个交互式组件组成：DRL-Student、PHY-Teacher 和 Trigger。DRL-Student 是一种 DRL 代理，在双重自学和教学学习范式以及实时安全知情批量采样方面进行了创新。另一方面，PHY-Teacher 是一种基于物理模型的行动策略设计，仅关注安全关键功能。PHY-Teacher 的实时补丁是新颖的，用于两个关键任务：i）培养 DRL-Student 的教学到学习范式，以及 ii）支持真实工厂的安全。触发器管理 DRL-Student 和 PHY-Teacher 之间的交互。Real-DRL由三个交互式组件提供支持，可以有效应对未知未知和Sim2Real差距带来的安全挑战。此外，Real-DRL 还具有 i）有保证的安全性，ii）自动层次学习（即安全优先学习，然后是高性能学习），以及 iii）安全知情批量抽样，以解决极端情况导致的学习体验不平衡。对真正的四足机器人、NVIDIA Isaac Gym 中的四足机器人和推车杆系统的实验，以及比较和消融研究，证明了 Real-DRL 的有效性和独特功能。

End-to-End Framework Integrating Generative AI and Deep Reinforcement Learning for Autonomous Ultrasound Scanning

集成生成式人工智能和深度强化学习的端到端框架，用于自主超声扫描

Authors: Hanae Elmekki, Amanda Spilkin, Ehsan Zakeri, Antonela Mariel Zanuttini, Ahmed Alagha, Hani Sami, Jamal Bentahar, Lyes Kadem, Wen-Fang Xie, Philippe Pibarot, Rabeb Mizouni, Hadi Otrok, Azzam Mourad, Sami Muhaidat
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.00114
Pdf link: https://arxiv.org/pdf/2511.00114
Abstract Cardiac ultrasound (US) is among the most widely used diagnostic tools in cardiology for assessing heart health, but its effectiveness is limited by operator dependence, time constraints, and human error. The shortage of trained professionals, especially in remote areas, further restricts access. These issues underscore the need for automated solutions that can ensure consistent, and accessible cardiac imaging regardless of operator skill or location. Recent progress in artificial intelligence (AI), especially in deep reinforcement learning (DRL), has gained attention for enabling autonomous decision-making. However, existing DRL-based approaches to cardiac US scanning lack reproducibility, rely on proprietary data, and use simplified models. Motivated by these gaps, we present the first end-to-end framework that integrates generative AI and DRL to enable autonomous and reproducible cardiac US scanning. The framework comprises two components: (i) a conditional generative simulator combining Generative Adversarial Networks (GANs) with Variational Autoencoders (VAEs), that models the cardiac US environment producing realistic action-conditioned images; and (ii) a DRL module that leverages this simulator to learn autonomous, accurate scanning policies. The proposed framework delivers AI-driven guidance through expert-validated models that classify image type and assess quality, supports conditional generation of realistic US images, and establishes a reproducible foundation extendable to other organs. To ensure reproducibility, a publicly available dataset of real cardiac US scans is released. The solution is validated through several experiments. The VAE-GAN is benchmarked against existing GAN variants, with performance assessed using qualitative and quantitative approaches, while the DRL-based scanning system is evaluated under varying configurations to demonstrate effectiveness.
中文摘要 心脏超声（US）是心脏病学中用于评估心脏健康的最广泛使用的诊断工具之一，但其有效性受到操作员依赖性、时间限制和人为错误的限制。训练有素的专业人员短缺，特别是在偏远地区，进一步限制了准入。这些问题凸显了对自动化解决方案的需求，无论操作员技能或位置如何，都可以确保一致且易于访问的心脏成像。人工智能（AI）的最新进展，尤其是深度强化学习（DRL）方面的进展，因实现自主决策而受到关注。然而，现有的基于DRL的心脏超声扫描方法缺乏可重复性，依赖专有数据，并使用简化的模型。在这些差距的推动下，我们提出了第一个集成生成式人工智能和 DRL 的端到端框架，以实现自主且可重复的心脏超声扫描。该框架包括两个部分：（i）将生成对抗网络（GAN）与变分自动编码器（VAE）相结合的条件生成模拟器，用于模拟心脏 US 环境，生成逼真的动作条件图像;（ii）一个 DRL 模块，利用该模拟器来学习自主、准确的扫描策略。所提出的框架通过经过专家验证的模型提供人工智能驱动的指导，这些模型对图像类型进行分类并评估质量，支持有条件地生成逼真的美国图像，并建立可扩展到其他器官的可重复基础。为确保可重复性，发布了真实心脏超声扫描的公开数据集。该解决方案通过多次实验进行了验证。VAE-GAN 以现有 GAN 变体为基准，使用定性和定量方法评估性能，而基于 DRL 的扫描系统则在不同的配置下进行评估以证明有效性。

LC-Opt: Benchmarking Reinforcement Learning and Agentic AI for End-to-End Liquid Cooling Optimization in Data Centers

LC-Opt：对强化学习和代理人工智能进行基准测试，以实现数据中心的端到端液体冷却优化

Authors: Avisek Naug, Antonio Guillen, Vineet Kumar, Scott Greenwood, Wesley Brewer, Sahand Ghorbanpour, Ashwin Ramesh Babu, Vineet Gundecha, Ricardo Luna Gutierrez, Soumyendu Sarkar
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2511.00116
Pdf link: https://arxiv.org/pdf/2511.00116
Abstract Liquid cooling is critical for thermal management in high-density data centers with the rising AI workloads. However, machine learning-based controllers are essential to unlock greater energy efficiency and reliability, promoting sustainability. We present LC-Opt, a Sustainable Liquid Cooling (LC) benchmark environment, for reinforcement learning (RL) control strategies in energy-efficient liquid cooling of high-performance computing (HPC) systems. Built on the baseline of a high-fidelity digital twin of Oak Ridge National Lab's Frontier Supercomputer cooling system, LC-Opt provides detailed Modelica-based end-to-end models spanning site-level cooling towers to data center cabinets and server blade groups. RL agents optimize critical thermal controls like liquid supply temperature, flow rate, and granular valve actuation at the IT cabinet level, as well as cooling tower (CT) setpoints through a Gymnasium interface, with dynamic changes in workloads. This environment creates a multi-objective real-time optimization challenge balancing local thermal regulation and global energy efficiency, and also supports additional components like a heat recovery unit (HRU). We benchmark centralized and decentralized multi-agent RL approaches, demonstrate policy distillation into decision and regression trees for interpretable control, and explore LLM-based methods that explain control actions in natural language through an agentic mesh architecture designed to foster user trust and simplify system management. LC-Opt democratizes access to detailed, customizable liquid cooling models, enabling the ML community, operators, and vendors to develop sustainable data center liquid cooling control solutions.
中文摘要 随着人工智能工作负载的增加，液体冷却对于高密度数据中心的热管理至关重要。然而，基于机器学习的控制器对于释放更高的能源效率和可靠性、促进可持续性至关重要。我们提出了 LC-Opt，这是一种可持续液体冷却（LC）基准环境，用于高性能计算（HPC）系统节能液体冷却中的强化学习（RL）控制策略。LC-Opt 建立在橡树岭国家实验室前沿超级计算机冷却系统的高保真数字孪生基线之上，提供详细的基于 Modelica 的端到端模型，涵盖站点级冷却塔到数据中心机柜和服务器刀片组。RL 代理通过 Gymnasium 界面优化关键的热控制，如液体供应温度、流量和 IT 机柜级别的颗粒阀门驱动，以及冷却塔（CT）设定点，并动态变化工作负载。这种环境创造了一个多目标实时优化挑战，平衡了局部热调节和全局能源效率，并且还支持热回收单元（HRU）等附加组件。我们对集中式和分散式多智能体 RL 方法进行了基准测试，演示了策略提炼到决策树和回归树中以实现可解释控制，并探索基于 LLM 的方法，通过旨在培养用户信任和简化系统管理的代理网格架构，以自然语言解释控制作。LC-Opt 使对详细、可定制的液体冷却模型的访问民主化，使 ML 社区、运营商和供应商能够开发可持续的数据中心液体冷却控制解决方案。

DCcluster-Opt: Benchmarking Dynamic Multi-Objective Optimization for Geo-Distributed Data Center Workloads

DCcluster-opt：地理分布式数据中心工作负载动态多目标优化基准测试

Authors: Antonio Guillen-Perez, Avisek Naug, Vineet Gundecha, Sahand Ghorbanpour, Ricardo Luna Gutierrez, Ashwin Ramesh Babu, Munther Salim, Shubhanker Banerjee, Eoin H. Oude Essink, Damien Fay, Soumyendu Sarkar
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2511.00117
Pdf link: https://arxiv.org/pdf/2511.00117
Abstract The increasing energy demands and carbon footprint of large-scale AI require intelligent workload management in globally distributed data centers. Yet progress is limited by the absence of benchmarks that realistically capture the interplay of time-varying environmental factors (grid carbon intensity, electricity prices, weather), detailed data center physics (CPUs, GPUs, memory, HVAC energy), and geo-distributed network dynamics (latency and transmission costs). To bridge this gap, we present DCcluster-Opt: an open-source, high-fidelity simulation benchmark for sustainable, geo-temporal task scheduling. DCcluster-Opt combines curated real-world datasets, including AI workload traces, grid carbon intensity, electricity markets, weather across 20 global regions, cloud transmission costs, and empirical network delay parameters with physics-informed models of data center operations, enabling rigorous and reproducible research in sustainable computing. It presents a challenging scheduling problem where a top-level coordinating agent must dynamically reassign or defer tasks that arrive with resource and service-level agreement requirements across a configurable cluster of data centers to optimize multiple objectives. The environment also models advanced components such as heat recovery. A modular reward system enables an explicit study of trade-offs among carbon emissions, energy costs, service level agreements, and water use. It provides a Gymnasium API with baseline controllers, including reinforcement learning and rule-based strategies, to support reproducible ML research and a fair comparison of diverse algorithms. By offering a realistic, configurable, and accessible testbed, DCcluster-Opt accelerates the development and validation of next-generation sustainable computing solutions for geo-distributed data centers.
中文摘要 大规模人工智能不断增长的能源需求和碳足迹要求在全球分布的数据中心进行智能工作负载管理。然而，由于缺乏真实地捕捉随时变化的环境因素（电网碳强度、电价、天气）、详细的数据中心物理（CPU、GPU、内存、HVAC 能源）和地理分布式网络动态（延迟和传输成本）的相互作用的基准，进展受到限制。为了弥补这一差距，我们提出了 DCcluster-Opt：一种用于可持续地理时间任务调度的开源、高保真模拟基准。DCcluster-Opt 将精选的真实世界数据集（包括 AI 工作负载轨迹、电网碳强度、电力市场、全球 20 个地区的天气、云传输成本和经验网络延迟参数）与数据中心运营的物理模型相结合，从而实现可持续计算的严格且可重复的研究。它提出了一个具有挑战性的调度问题，即顶级协调代理必须在可配置的数据中心集群中动态地重新分配或推迟具有资源和服务级别协议要求的任务，以优化多个目标。环境还对热回收等高级组件进行了建模。模块化奖励系统可以明确研究碳排放、能源成本、服务水平协议和用水之间的权衡。它提供了一个带有基线控制器的 Gymnasium API，包括强化学习和基于规则的策略，以支持可重复的 ML 研究和对不同算法的公平比较。通过提供真实、可配置且可访问的测试平台，DCcluster-Opt 加速了地理分布式数据中心下一代可持续计算解决方案的开发和验证。

A Dual Large Language Models Architecture with Herald Guided Prompts for Parallel Fine Grained Traffic Signal Control

具有 Herald 引导提示的双大语言模型架构，用于并行细粒度交通信号控制

Authors: Qing Guo, Xinhang Li, Junyu Chen, Zheng Guo, Xiaocong Li, Lin Zhang, Lei Li
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.00136
Pdf link: https://arxiv.org/pdf/2511.00136
Abstract Leveraging large language models (LLMs) in traffic signal control (TSC) improves optimization efficiency and interpretability compared to traditional reinforcement learning (RL) methods. However, existing LLM-based approaches are limited by fixed time signal durations and are prone to hallucination errors, while RL methods lack robustness in signal timing decisions and suffer from poor generalization. To address these challenges, this paper proposes HeraldLight, a dual LLMs architecture enhanced by Herald guided prompts. The Herald Module extracts contextual information and forecasts queue lengths for each traffic phase based on real-time conditions. The first LLM, LLM-Agent, uses these forecasts to make fine grained traffic signal control, while the second LLM, LLM-Critic, refines LLM-Agent's outputs, correcting errors and hallucinations. These refined outputs are used for score-based fine-tuning to improve accuracy and robustness. Simulation experiments using CityFlow on real world datasets covering 224 intersections in Jinan (12), Hangzhou (16), and New York (196) demonstrate that HeraldLight outperforms state of the art baselines, achieving a 20.03% reduction in average travel time across all scenarios and a 10.74% reduction in average queue length on the Jinan and Hangzhou scenarios. The source code is available on GitHub: this https URL.
中文摘要 与传统的强化学习（RL）方法相比，在交通信号控制（TSC）中利用大型语言模型（LLM）可以提高优化效率和可解释性。然而，现有的基于LLM的方法受到固定时间信号持续时间的限制，容易出现幻觉错误，而RL方法在信号定时决策方面缺乏鲁棒性，泛化性差。为了应对这些挑战，本文提出了 HeraldLight，这是一种通过 Herald 引导提示增强的双 LLM 架构。先驱模块提取上下文信息，并根据实时条件预测每个交通阶段的队列长度。第一个 LLM，LLM-Agent，使用这些预测来进行细粒度的交通信号控制，而第二个 LLM，LLM-Critic，改进 LLM-Agent 的输出，纠正错误和幻觉。这些细化输出用于基于分数的微调，以提高准确性和鲁棒性。在覆盖济南（12 个）、杭州（16 个）和纽约（196 个）的 224 个十字路口的真实世界数据集上使用 CityFlow 进行的仿真实验表明，HeraldLight 的性能优于最先进的基线，在济南和杭州场景中实现了所有场景的平均出行时间减少了 20.03%，平均队列长度减少了 10.74%。源代码可在 GitHub 上找到：此 https URL。

Study on Supply Chain Finance Decision-Making Model and Enterprise Economic Performance Prediction Based on Deep Reinforcement Learning

基于深度强化学习的供应链金融决策模型与企业经济绩效预测研究

Authors: Shiman Zhang, Jinghan Zhou, Zhoufan Yu, Ningai Leng
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.00166
Pdf link: https://arxiv.org/pdf/2511.00166
Abstract To improve decision-making and planning efficiency in back-end centralized redundant supply chains, this paper proposes a decision model integrating deep learning with intelligent particle swarm optimization. A distributed node deployment model and optimal planning path are constructed for the supply chain network. Deep learning such as convolutional neural networks extracts features from historical data, and linear programming captures high-order statistical features. The model is optimized using fuzzy association rule scheduling and deep reinforcement learning, while neural networks fit dynamic changes. A hybrid mechanism of "deep learning feature extraction - intelligent particle swarm optimization" guides global optimization and selects optimal decisions for adaptive control. Simulations show reduced resource consumption, enhanced spatial planning, and in dynamic environments improved real-time decision adjustment, distribution path optimization, and robust intelligent control.
中文摘要 为了提高后端集中式冗余供应链的决策和规划效率，提出了一种深度学习与智能粒子群优化相结合的决策模型。构建供应链网络分布式节点部署模型和最优规划路径。卷积神经网络等深度学习从历史数据中提取特征，线性规划捕获高阶统计特征。该模型使用模糊关联规则调度和深度强化学习进行优化，而神经网络则拟合动态变化。“深度学习特征提取-智能粒子群优化”的混合机制指导全局优化，选择最优决策进行自适应控制。仿真显示，资源消耗减少，空间规划增强，动态环境下实时决策调整、分布路径优化和稳健智能控制得到改进。

Iterative Foundation Model Fine-Tuning on Multiple Rewards

多重奖励的迭代基础模型微调

Authors: Pouya M. Ghari, Simone Sciabola, Ye Wang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.00220
Pdf link: https://arxiv.org/pdf/2511.00220
Abstract Fine-tuning foundation models has emerged as a powerful approach for generating objects with specific desired properties. Reinforcement learning (RL) provides an effective framework for this purpose, enabling models to generate outputs that maximize a given reward function. However, in many applications such as text generation and drug discovery, it can be suboptimal to optimize using a single reward signal, as multiple evaluation criteria are often necessary. This paper proposes a novel reinforcement learning-based method for fine-tuning foundation models using multiple reward signals. By employing an iterative fine-tuning strategy across these rewards, our approach generalizes state-of-the-art RL-based methods. We further provide a theoretical analysis that offers insights into the performance of multi-reward RL fine-tuning. Experimental results across diverse domains including text, biological sequence, and small molecule generation, demonstrate the effectiveness of the proposed algorithm compared to state-of-the-art baselines.
中文摘要 微调基础模型已成为生成具有特定所需属性的对象的强大方法。强化学习（RL）为此目的提供了一个有效的框架，使模型能够生成最大化给定奖励函数的输出。然而，在文本生成和药物发现等许多应用中，使用单个奖励信号进行优化可能不是最佳选择，因为通常需要多个评估标准。该文提出了一种基于强化学习的新方法，用于利用多个奖励信号对基础模型进行微调。通过对这些奖励采用迭代微调策略，我们的方法推广了最先进的基于 RL 的方法。我们进一步提供了理论分析，为多奖励 RL 微调的性能提供了见解。跨文本、生物序列和小分子生成等不同领域的实验结果证明了所提出的算法与最先进的基线相比的有效性。

Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning

通过多轮强化学习一致地模拟人类角色

Authors: Marwa Abdulhai, Ryan Cheng, Donovan Clay, Tim Althoff, Sergey Levine, Natasha Jaques
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.00222
Pdf link: https://arxiv.org/pdf/2511.00222
Abstract Large Language Models (LLMs) are increasingly used to simulate human users in interactive settings such as therapy, education, and social role-play. While these simulations enable scalable training and evaluation of AI agents, off-the-shelf LLMs often drift from their assigned personas, contradict earlier statements, or abandon role-appropriate behavior. We introduce a unified framework for evaluating and improving persona consistency in LLM-generated dialogue. We define three automatic metrics: prompt-to-line consistency, line-to-line consistency, and Q&A consistency, that capture different types of persona drift and validate each against human annotations. Using these metrics as reward signals, we apply multi-turn reinforcement learning to fine-tune LLMs for three user roles: a patient, a student, and a social chat partner. Our method reduces inconsistency by over 55%, resulting in more coherent and faithful simulated users.
中文摘要 大型语言模型（LLM）越来越多地用于在治疗、教育和社交角色扮演等交互环境中模拟人类用户。虽然这些模拟可以对人工智能代理进行可扩展的训练和评估，但现成的法学硕士通常会偏离其分配的角色，与之前的陈述相矛盾，或者放弃适合角色的行为。我们引入了一个统一的框架，用于评估和改进 LLM 生成的对话中的角色一致性。我们定义了三个自动指标：提示到行一致性、行到行一致性和问答一致性，用于捕获不同类型的角色漂移，并根据人工注释验证每个角色。使用这些指标作为奖励信号，我们应用多轮强化学习来微调三个用户角色的 LLM：患者、学生和社交聊天伙伴。我们的方法将不一致性减少了 55% 以上，从而产生了更连贯和忠实的模拟用户。

Improving the Robustness of Control of Chaotic Convective Flows with Domain-Informed Reinforcement Learning

利用域知情强化学习提高混沌对流控制的鲁棒性

Authors: Michiel Straat, Thorben Markmann, Sebastian Peitz, Barbara Hammer
Subjects: Subjects: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
Arxiv link: https://arxiv.org/abs/2511.00272
Pdf link: https://arxiv.org/pdf/2511.00272
Abstract Chaotic convective flows arise in many real-world systems, such as microfluidic devices and chemical reactors. Stabilizing these flows is highly desirable but remains challenging, particularly in chaotic regimes where conventional control methods often fail. Reinforcement Learning (RL) has shown promise for control in laminar flow settings, but its ability to generalize and remain robust under chaotic and turbulent dynamics is not well explored, despite being critical for real-world deployment. In this work, we improve the practical feasibility of RL-based control of such flows focusing on Rayleigh-Bénard Convection (RBC), a canonical model for convective heat transport. To enhance generalization and sample efficiency, we introduce domain-informed RL agents that are trained using Proximal Policy Optimization across diverse initial conditions and flow regimes. We incorporate domain knowledge in the reward function via a term that encourages Bénard cell merging, as an example of a desirable macroscopic property. In laminar flow regimes, the domain-informed RL agents reduce convective heat transport by up to 33%, and in chaotic flow regimes, they still achieve a 10% reduction, which is significantly better than the conventional controllers used in practice. We compare the domain-informed to uninformed agents: Our results show that the domain-informed reward design results in steady flows, faster convergence during training, and generalization across flow regimes without retraining. Our work demonstrates that elegant domain-informed priors can greatly enhance the robustness of RL-based control of chaotic flows, bringing real-world deployment closer.
中文摘要 混沌对流出现在许多现实世界的系统中，例如微流体设备和化学反应器。稳定这些流动是非常可取的，但仍然具有挑战性，特别是在传统控制方法经常失败的混乱状态下。强化学习（RL）已显示出在层流环境中进行控制的前景，但尽管对实际部署至关重要，但其在混沌和湍流动力学下泛化和保持稳健的能力尚未得到充分探索。在这项工作中，我们提高了基于 RL 的此类流动控制的实际可行性，重点关注 Rayleigh-Bénard 对流（RBC），这是一种对流热传输的规范模型。为了提高泛化和样本效率，我们引入了域知情的 RL 代理，这些代理在不同的初始条件和流程状态下使用近端策略优化进行训练。我们通过一个鼓励 Bénard 细胞合并的术语将领域知识纳入奖励函数中，作为理想宏观属性的一个例子。在层流状态下，域信息RL试剂可减少高达33%的对流热传输，而在混沌流状态下，它们仍可减少10%，明显优于实际使用的常规控制器。我们将领域知情代理与不知情代理进行了比较：我们的结果表明，领域知情奖励设计导致稳定的流动、训练期间更快的收敛以及跨流动制度的泛化，而无需重新训练。我们的工作表明，优雅的领域知情先验可以极大地增强基于 RL 的混沌流控制的鲁棒性，从而更接近现实世界的部署。

Reinforcement Learning for Resource Allocation in Vehicular Multi-Fog Computing

车载多雾计算中资源分配的强化学习

Authors: Mohammad Hadi Akbarzadeh, Mahmood Ahmadi, Mohammad Saeed Jahangiry, Jae Young Hur
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2511.00276
Pdf link: https://arxiv.org/pdf/2511.00276
Abstract The exponential growth of Internet of Things (IoT) devices, smart vehicles, and latency-sensitive applications has created an urgent demand for efficient distributed computing paradigms. Multi-Fog Computing (MFC), as an extension of fog and edge computing, deploys multiple fog nodes near end users to reduce latency, enhance scalability, and ensure Quality of Service (QoS). However, resource allocation in MFC environments is highly challenging due to dynamic vehicular mobility, heterogeneous resources, and fluctuating workloads. Traditional optimization-based methods often fail to adapt to such dynamics. Reinforcement Learning (RL), as a model-free decision-making framework, enables adaptive task allocation by continuously interacting with the environment. This paper formulates the resource allocation problem in MFC as a Markov Decision Process (MDP) and investigates the application of RL algorithms such as Q-learning, Deep Q-Networks (DQN), and Actor-Critic. We present experimental results demonstrating improvements in latency, workload balance, and task success rate. The contributions and novelty of this study are also discussed, highlighting the role of RL in addressing emerging vehicular computing challenges.
中文摘要 物联网（IoT）设备、智能汽车和延迟敏感型应用呈指数级增长，对高效的分布式计算范式产生了迫切的需求。多雾计算（MFC）作为雾和边缘计算的延伸，在终端用户附近部署多个雾节点，以减少延迟、增强扩展性并保证服务质量（QoS）。然而，由于动态的车辆流动性、异构资源和波动的工作负载，MFC 环境中的资源分配极具挑战性。传统的基于优化的方法往往无法适应这种动态。强化学习（Reinforcement Learning，RL）作为一种无模型的决策框架，通过与环境的持续交互，实现自适应任务分配。该文将MFC中的资源分配问题表述为马尔可夫决策过程（MDP），并研究了Q学习、深度Q网络（DQN）和Actor-Critic等RL算法的应用。我们提供了实验结果，证明了延迟、工作负载平衡和任务成功率的改进。还讨论了这项研究的贡献和新颖性，强调了 RL 在应对新出现的车辆计算挑战方面的作用。

Who Can We Trust? Scope-Aware Video Moment Retrieval with Multi-Agent Conflict

我们可以信任谁？具有多代理冲突的范围感知视频时刻检索

Authors: Chaochen Wu, Guan Luo, Meiyun Zuo, Zhitao Fan
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.00370
Pdf link: https://arxiv.org/pdf/2511.00370
Abstract Video moment retrieval uses a text query to locate a moment from a given untrimmed video reference. Locating corresponding video moments with text queries helps people interact with videos efficiently. Current solutions for this task have not considered conflict within location results from different models, so various models cannot integrate correctly to produce better results. This study introduces a reinforcement learning-based video moment retrieval model that can scan the whole video once to find the moment's boundary while producing its locational evidence. Moreover, we proposed a multi-agent system framework that can use evidential learning to resolve conflicts between agents' localization output. As a side product of observing and dealing with conflicts between agents, we can decide whether a query has no corresponding moment in a video (out-of-scope) without additional training, which is suitable for real-world applications. Extensive experiments on benchmark datasets show the effectiveness of our proposed methods compared with state-of-the-art approaches. Furthermore, the results of our study reveal that modeling competition and conflict of the multi-agent system is an effective way to improve RL performance in moment retrieval and show the new role of evidential learning in the multi-agent framework.
中文摘要 视频时刻检索使用文本查询从给定的未修剪视频引用中查找时刻。通过文本查询定位相应的视频时刻有助于人们有效地与视频互动。当前针对此任务的解决方案没有考虑来自不同模型的位置结果之间的冲突，因此各种模型无法正确集成以产生更好的结果。本研究引入了一种基于强化学习的视频瞬间检索模型，该模型可以扫描整个视频一次，找到瞬间的边界，同时产生其位置证据。此外，我们提出了一种多智能体系统框架，该框架可以使用证据学习来解决智能体定位输出之间的冲突。作为观察和处理智能体之间冲突的副产品，我们可以在没有额外训练的情况下决定查询在视频中是否没有对应的时刻（超出范围），这适用于现实世界的应用。在基准数据集上的大量实验表明，与最先进的方法相比，我们提出的方法具有有效性。此外，研究结果表明，对多智能体系统的竞争和冲突进行建模是提高RL在瞬间检索中性能的有效途径，并展示了证据学习在多智能体框架中的新作用。

Rethinking Facial Expression Recognition in the Era of Multimodal Large Language Models: Benchmark, Datasets, and Beyond

重新思考多模态大语言模型时代的面部表情识别：基准、数据集及其他

Authors: Fan Zhang, Haoxuan Li, Shengju Qian, Xin Wang, Zheng Lian, Hao Wu, Zhihong Zhu, Yuan Gao, Qiankun Li, Yefeng Zheng, Zhouchen Lin, Pheng-Ann Heng
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.00389
Pdf link: https://arxiv.org/pdf/2511.00389
Abstract Multimodal Large Language Models (MLLMs) have revolutionized numerous research fields, including computer vision and affective computing. As a pivotal challenge in this interdisciplinary domain, facial expression recognition (FER) has evolved from separate, domain-specific models to more unified approaches. One promising avenue to unify FER tasks is converting conventional FER datasets into visual question-answering (VQA) formats, enabling the direct application of powerful generalist MLLMs for inference. However, despite the success of cutting-edge MLLMs in various tasks, their performance on FER tasks remains largely unexplored. To address this gap, we provide FERBench, a systematic benchmark that incorporates 20 state-of-the-art MLLMs across four widely used FER datasets. Our results reveal that, while MLLMs exhibit good classification performance, they still face significant limitations in reasoning and interpretability. To this end, we introduce post-training strategies aimed at enhancing the facial expression reasoning capabilities of MLLMs. Specifically, we curate two high-quality and large-scale datasets: UniFER-CoT-230K for cold-start initialization and UniFER-RLVR-360K for reinforcement learning with verifiable rewards (RLVR), respectively. Building upon them, we develop a unified and interpretable FER foundation model termed UniFER-7B, which outperforms many open-sourced and closed-source generalist MLLMs (e.g., Gemini-2.5-Pro and Qwen2.5-VL-72B).
中文摘要 多模态大型语言模型（MLLM）彻底改变了众多研究领域，包括计算机视觉和情感计算。作为这一跨学科领域的关键挑战，面部表情识别（FER）已从独立的、特定领域的模型发展成为更加统一的方法。统一 FER 任务的一种有前途的途径是将传统的 FER 数据集转换为视觉问答（VQA）格式，从而能够直接应用强大的通才 MLLM 进行推理。然而，尽管尖端的 MLLM 在各种任务中取得了成功，但它们在 FER 任务上的性能在很大程度上仍未得到探索。为了解决这一差距，我们提供了 FERBench，这是一个系统基准测试，其中包含四个广泛使用的 FER 数据集中的 20 个最先进的 MLLM。我们的结果表明，虽然MLLM表现出良好的分类性能，但它们在推理和可解释性方面仍面临重大局限性。为此，我们引入了旨在增强MLLM面部表情推理能力的后训练策略。具体来说，我们策划了两个高质量和大规模的数据集：UniFER-CoT-230K分别用于冷启动初始化和UniFER-RLVR-360K用于具有可验证奖励的强化学习（RLVR）。在此基础上，我们开发了一个统一且可解释的 FER 基础模型，称为 UniFER-7B，它的性能优于许多开源和闭源通才 MLLM（例如 Gemini-2.5-Pro 和 Qwen2.5-VL-72B）。

VinciCoder: Unifying Multimodal Code Generation via Coarse-to-fine Visual Reinforcement Learning

VinciCoder：通过粗到细的视觉强化学习统一多模态代码生成

Authors: Xuanle Zhao, Deyang Jiang, Zhixiong Zeng, Lei Chen, Haibo Qiu, Jing Huang, Yufeng Zhong, Liming Zheng, Yilin Cao, Lin Ma
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.00391
Pdf link: https://arxiv.org/pdf/2511.00391
Abstract Multimodal code generation has garnered significant interest within the research community. Despite the notable success of recent vision-language models (VLMs) on specialized tasks like Chart-to-code generation, their reliance on single-task training regimens fosters a narrow paradigm that hinders the development of generalized \textbf{VI}sio\textbf{N} \textbf{C}ode \textbf{I}ntelligence. In this work, we introduce \textbf{VinciCoder}, a unified multimodal code generation model that addresses this limitation via a two-stage training framework. We begin by constructing a large-scale Supervised Finetuning (SFT) corpus comprising 1.6M image-code pairs for tasks involving direct code generation and visual-based code refinement. Subsequently, we introduce a Visual Reinforcement Learning (ViRL) strategy, which employs a coarse-to-fine reward mechanism to improve visual fidelity by calculating visual similarity across local and global image patches. Extensive experiments on various multimodal code generation benchmarks demonstrate that VinciCoder achieves state-of-the-art performance, underscoring the effectiveness of our coarse-to-fine ViRL strategy. The code and model will be available at this https URL.
中文摘要 多模态代码生成引起了研究界的极大兴趣。尽管最近的视觉语言模型（VLM）在图表到代码生成等专业任务上取得了显着成功，但它们对单任务训练方案的依赖助长了一种狭隘的范式，阻碍了广义 \textbf{VI}sio\textbf{N} \textbf{C}ode \textbf{I}ntelligence 的发展。在这项工作中，我们引入了 \textbf{VinciCoder}，这是一种统一的多模态代码生成模型，它通过两阶段训练框架解决了这一限制。我们首先构建一个包含 1.6M 图像代码对的大规模监督微调（SFT）语料库，用于涉及直接代码生成和基于视觉的代码细化的任务。随后，我们引入了视觉强化学习（ViRL）策略，该策略采用从粗到细的奖励机制，通过计算局部和全局图像块之间的视觉相似度来提高视觉保真度。对各种多模态代码生成基准的广泛实验表明，VinciCoder 实现了最先进的性能，凸显了我们从粗到细的 ViRL 策略的有效性。代码和模型将在此 https URL 中提供。

CoT-Saliency: Unified Chain-of-Thought Reasoning for Heterogeneous Saliency Tasks

CoT-Saliency：异构显著性任务的统一思维链推理

Authors: Long Li, Shuichen Ji, Ziyang Luo, Nian Liu, Dingwen Zhang, Junwei Han
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.00396
Pdf link: https://arxiv.org/pdf/2511.00396
Abstract We present the first unified framework that jointly handles three operationally heterogeneous saliency tasks, eg, SOD, CoSOD, and SIS, by casting each as a Chain-of-Thought (CoT) reasoning process in a Vision-Language Model (VLM) to bridge task heterogeneity. CoT training follows a two-stage paradigm: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). To enhance CoT quality in RL, we propose Confidence-Guided Policy Optimization (CGPO), a lightweight single-sample algorithm that leverages the discrepancy between reward and model confidence as a per-sample advantage signal. This design naturally focuses updates on informative responses while eliminating group sampling, thereby addressing GRPO's key limitations: confidence-agnostic learning, signal dilution, and prohibitive computational overhead. We also introduce an "output-to-reasoning" strategy to construct high-fidelity SFT data that ensures logical consistency with ground-truth masks. Experiments show our model matches or outperforms specialized SOTA methods and strong closed-source VLMs across all tasks, especially achieving an S-measure of 0.899 on CoCA for CoSOD, surpassing the prior best by 8.0 percentage points, despite using far less training data.
中文摘要 我们提出了第一个统一的框架，它通过将每个任务转换为视觉语言模型（VLM）中的思维链（CoT）推理过程来连接任务异构性，从而共同处理三个作上异构的显着性任务，例如 SOD、CoSOD 和 SIS，以弥合任务异构性。CoT 训练遵循两阶段范式：监督微调（SFT）和强化学习（RL）。为了提高RL中的CoT质量，我们提出了置信度引导策略优化（CGPO），这是一种轻量级的单样本算法，它利用奖励和模型置信度之间的差异作为每个样本的优势信号。这种设计自然会将更新重点放在信息性响应上，同时消除组采样，从而解决 GRPO 的关键局限性：与置信度无关的学习、信号稀释和令人望而却步的计算开销。我们还引入了一种“输出到推理”策略来构建高保真 SFT 数据，确保与地面实况掩码的逻辑一致性。实验表明，我们的模型在所有任务中都与专门的 SOTA 方法和强大的闭源 VLM 相匹配或优于强闭源 VLM，特别是在 CoSOD 的 CoCA 上实现了 0.899 的 S 度量，比之前的最佳值高出 8.0 个百分点，尽管使用的训练数据要少得多。

UME-R1: Exploring Reasoning-Driven Generative Multimodal Embeddings

UME-R1：探索推理驱动的生成式多模态嵌入

Authors: Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, Jinsong Su
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.00405
Pdf link: https://arxiv.org/pdf/2511.00405
Abstract The remarkable success of multimodal large language models (MLLMs) has driven advances in multimodal embeddings, yet existing models remain inherently discriminative, limiting their ability to benefit from reasoning-driven generation paradigm. In this work, we pioneer the exploration of generative embeddings, unifying embedding tasks within a generative paradigm. We propose UME-R1, a universal multimodal embedding framework consisting of a two-stage training strategy: a cold-start supervised fine-tuning equips the model with reasoning capabilities and enables it to generate both discriminative and generative embeddings; a subsequent reinforcement learning enhances reasoning and further optimizes generative embedding quality. This pioneering work reveals four key insights: 1) generative embeddings unlock substantial performance gains over conventional discriminative embeddings by leveraging the powerful generative reasoning capabilities of MLLMs; 2) discriminative and generative embeddings are complementary, whose combined oracle performance far exceeding that of either alone; 3) RL can effectively enhance generative embeddings, establishing a scalable optimization paradigm.; 4) repeated sampling at inference boosts downstream task coverage (pass@k), highlighting the inference-time scalability potential of generative embeddings. Evaluated on the MMEB-V2 benchmark across 78 tasks spanning video, image, and visual documents, UME-R1 significantly outperforms conventional discriminative embedding models and offers a foundation for more interpretable, reasoning-driven generative multimodal embeddings. Our code, models, and datasets will be publicly available at this https URL.
中文摘要 多模态大型语言模型（MLLM）的巨大成功推动了多模态嵌入的进步，但现有模型本质上仍然具有辨别性，限制了它们从推理驱动的生成范式中受益的能力。在这项工作中，我们率先探索了生成嵌入，将嵌入任务统一在生成范式中。我们提出了UME-R1，这是一个通用的多模态嵌入框架，由两阶段训练策略组成：冷启动监督微调使模型具有推理能力，并使其能够生成判别和生成嵌入;随后的强化学习增强了推理能力并进一步优化了生成嵌入质量。这项开创性的工作揭示了四个关键见解：1）生成嵌入通过利用 MLLM 强大的生成推理能力，比传统判别嵌入实现了显着的性能提升;2）判别性和生成性嵌入是互补的，其组合预言机性能远远超过单独使用;3）RL可以有效增强生成式嵌入，建立可扩展的优化范式;4）推理时的重复采样提高了下游任务覆盖率（pass@k），凸显了生成嵌入的推理时间可扩展性潜力。在 MMEB-V2 基准测试中对涵盖视频、图像和视觉文档的 78 项任务进行了评估，UME-R1 的性能明显优于传统的判别嵌入模型，并为更具可解释性、推理驱动的生成式多模态嵌入奠定了基础。我们的代码、模型和数据集将在此 https URL 上公开提供。

Bootstrap Off-policy with World Model

使用世界模型的 Bootstrap Off-policy

Authors: Guojian Zhan, Likun Wang, Xiangteng Zhang, Jiaxin Gao, Masayoshi Tomizuka, Shengbo Eben Li
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.00423
Pdf link: https://arxiv.org/pdf/2511.00423
Abstract Online planning has proven effective in reinforcement learning (RL) for improving sample efficiency and final performance. However, using planning for environment interaction inevitably introduces a divergence between the collected data and the policy's actual behaviors, degrading both model learning and policy improvement. To address this, we propose BOOM (Bootstrap Off-policy with WOrld Model), a framework that tightly integrates planning and off-policy learning through a bootstrap loop: the policy initializes the planner, and the planner refines actions to bootstrap the policy through behavior alignment. This loop is supported by a jointly learned world model, which enables the planner to simulate future trajectories and provides value targets to facilitate policy improvement. The core of BOOM is a likelihood-free alignment loss that bootstraps the policy using the planner's non-parametric action distribution, combined with a soft value-weighted mechanism that prioritizes high-return behaviors and mitigates variability in the planner's action quality within the replay buffer. Experiments on the high-dimensional DeepMind Control Suite and Humanoid-Bench show that BOOM achieves state-of-the-art results in both training stability and final performance. The code is accessible at this https URL.
中文摘要 在线规划已被证明在强化学习（RL）中有效提高样本效率和最终性能。然而，使用环境交互规划不可避免地会在收集的数据与政策的实际行为之间引入差异，从而降低模型学习和政策改进。为了解决这个问题，我们提出了 BOOM（Bootstrap Off-policy with WOrld Model），这是一个通过引导循环将规划和策略外学习紧密集成在一起的框架：策略初始化规划器，规划器通过行为对齐来细化作以引导策略。该循环由共同学习的世界模型支持，使规划者能够模拟未来的轨迹，并提供价值目标以促进政策改进。BOOM 的核心是一种无似然的对齐损失，它使用规划者的非参数行动分布引导策略，结合软价值加权机制，优先考虑高回报行为并减轻规划者行动质量在重放缓冲区内的可变性。在高维 DeepMind Control Suite 和 Humanoid-Bench 上的实验表明，BOOM 在训练稳定性和最终性能方面都取得了最先进的结果。该代码可通过此 https URL 访问。

GraphChain: Large Language Models for Large-scale Graph Analysis via Tool Chaining

GraphChain：通过工具链进行大规模图分析的大型语言模型

Authors: Chunyu Wei, Wenji Hu, Xingjia Hao, Xin Wang, Yifan Yang, Yueguo Chen, Yang Tian, Yunhai Wang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.00457
Pdf link: https://arxiv.org/pdf/2511.00457
Abstract Large Language Models (LLMs) face significant limitations when applied to large-scale graphs, struggling with context constraints and inflexible reasoning. We present GraphChain, a framework that enables LLMs to analyze complex graphs through dynamic sequences of specialized tools, mimicking human exploratory intelligence. Our approach introduces two key innovations: (1) Progressive Graph Distillation, a reinforcement learning mechanism that generates optimized tool sequences balancing task relevance with information compression, and (2) Structure-aware Test-Time Adaptation, which efficiently tailors tool selection strategies to diverse graph topologies using spectral properties and lightweight adapters without costly retraining. Experiments show GraphChain significantly outperforms prior methods, enabling scalable and adaptive LLM-driven graph analysis.
中文摘要 大型语言模型（LLM）在应用于大规模图时面临重大限制，难以应对上下文限制和不灵活的推理。我们提出了 GraphChain，这是一个框架，使法学硕士能够通过专用工具的动态序列来分析复杂的图，模仿人类的探索智能。我们的方法引入了两项关键创新：（1）渐进式图蒸馏，一种强化学习机制，可生成优化的工具序列，平衡任务相关性与信息压缩，以及（2）结构感知测试时间自适应，它使用光谱属性和轻量级适配器有效地定制工具选择策略，以适应不同的图拓扑，而无需昂贵的重新训练。实验表明，GraphChain 的性能明显优于以前的方法，实现了可扩展和自适应的 LLM 驱动的图分析。

ID-Composer: Multi-Subject Video Synthesis with Hierarchical Identity Preservation

ID-Composer：具有分层身份保留的多主题视频合成

Authors: Panwang Pan, Jingjing Zhao, Yuchen Lin, Chenguo Lin, Chenxin Li, Haopeng Li, Honglei Yan, Tingting Shen, Yadong Mu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.00511
Pdf link: https://arxiv.org/pdf/2511.00511
Abstract Video generative models pretrained on large-scale datasets can produce high-quality videos, but are often conditioned on text or a single image, limiting controllability and applicability. We introduce ID-Composer, a novel framework that addresses this gap by tackling multi-subject video generation from a text prompt and reference images. This task is challenging as it requires preserving subject identities, integrating semantics across subjects and modalities, and maintaining temporal consistency. To faithfully preserve the subject consistency and textual information in synthesized videos, ID-Composer designs a \textbf{hierarchical identity-preserving attention mechanism}, which effectively aggregates features within and across subjects and modalities. To effectively allow for the semantic following of user intention, we introduce \textbf{semantic understanding via pretrained vision-language model (VLM)}, leveraging VLM's superior semantic understanding to provide fine-grained guidance and capture complex interactions between multiple subjects. Considering that standard diffusion loss often fails in aligning the critical concepts like subject ID, we employ an \textbf{online reinforcement learning phase} to drive the overall training objective of ID-Composer into RLVR. Extensive experiments demonstrate that our model surpasses existing methods in identity preservation, temporal consistency, and video quality.
中文摘要 在大规模数据集上预训练的视频生成模型可以生成高质量的视频，但通常以文本或单个图像为条件，限制了可控性和适用性。我们介绍了 ID-Composer，这是一个新颖的框架，它通过解决从文本提示和参考图像生成多主题视频来解决这一差距。这项任务具有挑战性，因为它需要保留主体身份，整合跨主体和模态的语义，并保持时间一致性。为了忠实地保留合成视频中的主题一致性和文本信息，ID-Composer 设计了一种 \textbf{分层身份保留注意力机制}，该机制有效地聚合了主题和模态内部和跨主题和模态的特征。为了有效地允许对用户意图进行语义跟踪，我们引入了 \textbf{通过预训练视觉语言模型（VLM）的语义理解}，利用 VLM 卓越的语义理解来提供细粒度的指导并捕获多个主题之间的复杂交互。考虑到标准扩散损失通常无法对齐主题 ID 等关键概念，我们采用 \textbf{在线强化学习阶段} 将 ID-Composer 的整体训练目标驱动到 RLVR 中。广泛的实验表明，我们的模型在身份保存、时间一致性和视频质量方面超越了现有方法。

Robust Single-Agent Reinforcement Learning for Regional Traffic Signal Control Under Demand Fluctuations

需求波动下区域交通信号控制的鲁棒单智能体强化学习

Authors: Qiang Li, Jin Niu, Lina Yu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.00549
Pdf link: https://arxiv.org/pdf/2511.00549
Abstract Traffic congestion, primarily driven by intersection queuing, significantly impacts urban living standards, safety, environmental quality, and economic efficiency. While Traffic Signal Control (TSC) systems hold potential for congestion mitigation, traditional optimization models often fail to capture real-world traffic complexity and dynamics. This study introduces a novel single-agent reinforcement learning (RL) framework for regional adaptive TSC, circumventing the coordination complexities inherent in multi-agent systems through a centralized decision-making paradigm. The model employs an adjacency matrix to unify the encoding of road network topology, real-time queue states derived from probe vehicle data, and current signal timing parameters. Leveraging the efficient learning capabilities of the DreamerV3 world model, the agent learns control policies where actions sequentially select intersections and adjust their signal phase splits to regulate traffic inflow/outflow, analogous to a feedback control system. Reward design prioritizes queue dissipation, directly linking congestion metrics (queue length) to control actions. Simulation experiments conducted in SUMO demonstrate the model's effectiveness: under inference scenarios with multi-level (10%, 20%, 30%) Origin-Destination (OD) demand fluctuations, the framework exhibits robust anti-fluctuation capability and significantly reduces queue lengths. This work establishes a new paradigm for intelligent traffic control compatible with probe vehicle technology. Future research will focus on enhancing practical applicability by incorporating stochastic OD demand fluctuations during training and exploring regional optimization mechanisms for contingency events.
中文摘要 交通拥堵主要是由十字路口排队驱动的，严重影响城市生活水平、安全、环境质量和经济效率。虽然交通信号控制（TSC）系统具有缓解拥堵的潜力，但传统的优化模型通常无法捕捉现实世界的交通复杂性和动态。本研究引入了一种用于区域自适应TSC的新型单智能体强化学习（RL）框架，通过集中决策范式规避了多智能体系统固有的协调复杂性。该模型采用邻接矩阵来统一路网拓扑的编码、探测车辆数据得出的实时队列状态以及当前信号授时参数。利用 DreamerV3 世界模型的高效学习能力，代理学习控制策略，其中动作按顺序选择交叉路口并调整其信号相位分割以调节流量流入/流出，类似于反馈控制系统。奖励设计优先考虑队列耗散，将拥塞指标（队列长度）直接链接到控制作。在 SUMO 中进行的仿真实验证明了该模型的有效性：在多级（10%、20%、30%）出发地-目的地（OD）需求波动的推理场景下，该框架表现出强大的抗波动能力，并显着减少了队列长度。这项工作为与探测车技术兼容的智能交通控制建立了新的范式。未来的研究将侧重于通过纳入训练期间的随机OD需求波动和探索突发事件的区域优化机制来增强实际适用性。

Single-agent Reinforcement Learning Model for Regional Adaptive Traffic Signal Control

面向区域自适应交通信号控制的单智能体强化学习模型

Authors: Qiang Li, Ningjing Zeng, Lina Yu
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.00551
Pdf link: https://arxiv.org/pdf/2511.00551
Abstract Several studies have employed reinforcement learning (RL) to address the challenges of regional adaptive traffic signal control (ATSC) and achieved promising results. In this field, existing research predominantly adopts multi-agent frameworks. However, the adoption of multi-agent frameworks presents challenges for scalability. Instead, the Traffic signal control (TSC) problem necessitates a single-agent framework. TSC inherently relies on centralized management by a single control center, which can monitor traffic conditions across all roads in the study area and coordinate the control of all intersections. This work proposes a single-agent RL-based regional ATSC model compatible with probe vehicle technology. Key components of the RL design include state, action, and reward function definitions. To facilitate learning and manage congestion, both state and reward functions are defined based on queue length, with action designed to regulate queue dynamics. The queue length definition used in this study differs slightly from conventional definitions but is closely correlated with congestion states. More importantly, it allows for reliable estimation using link travel time data from probe vehicles. With probe vehicle data already covering most urban roads, this feature enhances the proposed method's potential for widespread deployment. The method was comprehensively evaluated using the SUMO simulation platform. Experimental results demonstrate that the proposed model effectively mitigates large-scale regional congestion levels via coordinated multi-intersection control.
中文摘要 一些研究已经采用强化学习（RL）来应对区域自适应交通信号控制（ATSC）的挑战，并取得了可喜的成果。在该领域，现有研究主要采用多智能体框架。然而，多代理框架的采用给可扩展性带来了挑战。相反，交通信号控制（TSC）问题需要一个单代理框架。TSC本质上依赖于单一控制中心的集中管理，可以监控研究区域内所有道路的交通状况，并协调控制所有路口。该工作提出了一种与探测器技术兼容的基于单智能体RL的区域ATSC模型。RL 设计的关键组件包括状态、动作和奖励函数定义。为了促进学习和管理拥塞，状态和奖励函数都是根据队列长度定义的，其作旨在调节队列动态。本研究中使用的队列长度定义与传统定义略有不同，但与拥塞状态密切相关。更重要的是，它允许使用来自探头车辆的链路行进时间数据进行可靠的估计。由于探测车辆数据已经覆盖了大多数城市道路，这一功能增强了所提出的方法广泛部署的潜力。利用SUMO仿真平台对该方法进行了综合评价。实验结果表明，所提模型通过协调多路口控制，有效缓解了大规模区域拥堵程度。

OpenSIR: Open-Ended Self-Improving Reasoner

OpenSIR：开放式自我改进推理器

Authors: Wai-Chung Kwan, Joshua Ong Jun Leang, Pavlos Vougiouklis, Jeff Z. Pan, Marco Valentino, Pasquale Minervini
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.00602
Pdf link: https://arxiv.org/pdf/2511.00602
Abstract Recent advances in large language model (LLM) reasoning through reinforcement learning rely on annotated datasets for verifiable rewards, which may limit models' ability to surpass human-level performance. While self-play offers a promising alternative, existing approaches depend on external verifiers or cannot learn open-endedly. We present Open-Ended Self-Improving Reasoner (OpenSIR), a self-play framework where an LLM learns to generate and solve novel problems by alternating teacher and student roles without external supervision. To generate novel problems, OpenSIR optimises for both difficulty and diversity, rewarding problems that challenge appropriately while exploring distinct concepts, enabling open-ended mathematical discovery. Starting from a single trivial seed problem, OpenSIR substantially improves instruction models: Llama-3.2-3B-Instruct advances from 73.9 to 78.3 on GSM8K, and from 28.8 to 34.4 on College Math, while Gemma-2-2B-Instruct rises from 38.5 to 58.7 on GSM8K. Our analyses reveal that OpenSIR achieves open-ended learning through co-evolving teacher-student roles that adaptively calibrate difficulty and drive diverse exploration, progressing autonomously from basic to advanced mathematics.
中文摘要 通过强化学习进行大型语言模型（LLM）推理的最新进展依赖于带注释的数据集来获得可验证的奖励，这可能会限制模型超越人类水平性能的能力。虽然自玩提供了一种有前途的替代方案，但现有的方法依赖于外部验证者或无法开放式学习。我们提出了开放式自我改进推理器（OpenSIR），这是一个自我游戏框架，法学硕士在没有外部监督的情况下通过交替教师和学生角色来学习生成和解决新问题。为了生成新问题，OpenSIR 针对难度和多样性进行了优化，奖励在探索不同概念的同时适当挑战的问题，从而实现开放式数学发现。从一个微不足道的种子问题开始，OpenSIR 极大地改进了指令模型：Llama-3.2-3B-Instruct 在 GSM8K 上从 73.9 上升到 78.3，在大学数学上从 28.8 上升到 34.4，而 Gemma-2-2B-Instruct 在 GSM8K 上从 38.5 上升到 58.7。我们的分析表明，OpenSIR通过师生角色的共同演变实现了开放式学习，这些角色自适应地校准难度并推动多样化的探索，从基础数学到高级数学自主发展。

PreferThinker: Reasoning-based Personalized Image Preference Assessment

PreferThinker：基于推理的个性化图像偏好评估

Authors: Shengqi Xu, Xinpeng Zhou, Yabo Zhang, Ming Liu, Tao Liang, Tianyu Zhang, Yalong Bai, Zuxuan Wu, Wangmeng Zuo
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.00609
Pdf link: https://arxiv.org/pdf/2511.00609
Abstract Personalized image preference assessment aims to evaluate an individual user's image preferences by relying only on a small set of reference images as prior information. Existing methods mainly focus on general preference assessment, training models with large-scale data to tackle well-defined tasks such as text-image alignment. However, these approaches struggle to handle personalized preference because user-specific data are scarce and not easily scalable, and individual tastes are often diverse and complex. To overcome these challenges, we introduce a common preference profile that serves as a bridge across users, allowing large-scale user data to be leveraged for training profile prediction and capturing complex personalized preferences. Building on this idea, we propose a reasoning-based personalized image preference assessment framework that follows a \textit{predict-then-assess} paradigm: it first predicts a user's preference profile from reference images, and then provides interpretable, multi-dimensional scores and assessments of candidate images based on the predicted profile. To support this, we first construct a large-scale Chain-of-Thought (CoT)-style personalized assessment dataset annotated with diverse user preference profiles and high-quality CoT-style reasoning, enabling explicit supervision of structured reasoning. Next, we adopt a two-stage training strategy: a cold-start supervised fine-tuning phase to empower the model with structured reasoning capabilities, followed by reinforcement learning to incentivize the model to explore more reasonable assessment paths and enhance generalization. Furthermore, we propose a similarity-aware prediction reward to encourage better prediction of the user's preference profile, which facilitates more reasonable assessments exploration. Extensive experiments demonstrate the superiority of the proposed method.
中文摘要 个性化图像偏好评估旨在仅依靠一小部分参考图像作为先验信息来评估单个用户的图像偏好。现有方法主要侧重于一般偏好评估，用大规模数据训练模型来处理定义明确的任务，例如文本-图像对齐。然而，这些方法难以处理个性化偏好，因为用户特定数据稀缺且不易扩展，而且个人品味往往多样化且复杂。为了克服这些挑战，我们引入了一种通用偏好配置文件，作为跨用户的桥梁，允许利用大规模用户数据进行训练配置文件预测并捕获复杂的个性化偏好。基于这一思路，我们提出了一个基于推理的个性化图像偏好评估框架，该框架遵循\textit{predict-then-assess}范式：它首先从参考图像中预测用户的偏好配置文件，然后根据预测的配置文件提供可解释的、多维的分数和候选图像的评估。为了支持这一点，我们首先构建了一个大规模的思维链（CoT）式个性化评估数据集，该数据集注释了不同的用户偏好画像和高质量的CoT式推理，从而能够对结构化推理进行显式监督。接下来，我们采用两阶段的训练策略：冷启动监督微调阶段，为模型赋能结构化推理能力，然后强化学习激励模型探索更合理的评估路径，增强泛化能力。此外，我们提出了一种相似性感知预测奖励，以鼓励更好地预测用户的偏好特征，从而促进更合理的评估探索。大量的实验证明了所提方法的优越性。

Ariadne: A Controllable Framework for Probing and Extending VLM Reasoning Boundaries

Ariadne：用于探测和扩展VLM推理边界的可控框架

Authors: Minghe Shen, Zhuo Zhi, Chonghan Liu, Shuo Xing, Zhengzhong Tu, Che Liu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.00710
Pdf link: https://arxiv.org/pdf/2511.00710
Abstract While Vision-Language Models (VLMs) post-trained with Reinforcement Learning (RL) show impressive general reasoning, their evaluation is often confined to language-dominant tasks (e.g., math). This raises a critical question: can RL post-training truly extend the inherent capability boundary of a base VLM, particularly for visual-centric spatial tasks where it initially fails? To investigate this, we introduce Ariadne, a framework utilizing synthetic mazes for multi-step spatial reasoning where task difficulty (e.g., path length, turns) is precisely controlled. We leverage this controllable environment to train VLMs using Reinforcement Learning with Verified Rewards (RLVR) in a difficulty-aware curriculum. Surprisingly, post-RLVR training, the VLM achieves over 50% accuracy on a problem set where the base model scored 0%, demonstrating that our approach expands the model's initial capability boundary. To assess real-world viability, we evaluate out-of-distribution (OOD) generalization on practical benchmarks. Despite training only on synthetic maze samples, Ariadne achieves significant zero-shot improvements, averaging 16% on MapBench (e.g., museum navigation) and 24% on ReasonMap (subway transfer tasks). These results confirm that our method not only broadens the model's fundamental limits but also enhances its generalization to real-world spatial reasoning. We acknowledge our study is limited to the post-training phase, given the opaqueness of pre-training data, and hope our research motivates further work on specialized, capability-extending alignment.
中文摘要 虽然使用强化学习（RL）进行后训练的视觉语言模型（VLM）显示出令人印象深刻的一般推理，但它们的评估通常仅限于语言主导的任务（例如数学）。这就提出了一个关键问题：RL 后训练能否真正扩展基础 VLM 的固有能力边界，特别是对于最初失败的以视觉为中心的空间任务？为了研究这一点，我们引入了 Ariadne，这是一个利用合成迷宫进行多步空间推理的框架，其中任务难度（例如，路径长度、转弯）被精确控制。我们利用这种可控环境在难度感知课程中使用具有验证奖励的强化学习（RLVR）来训练 VLM。令人惊讶的是，在 RLVR 训练后，VLM 在基础模型得分为 0% 的问题集上实现了超过 50% 的准确率，这表明我们的方法扩展了模型的初始能力边界。为了评估现实世界的可行性，我们在实际基准上评估了分布外（OOD）泛化。尽管仅对合成迷宫样本进行训练，但 Ariadne 实现了显着的零样本改进，在 MapBench（例如博物馆导航）上平均为 16%，在 ReasonMap（地铁换乘任务）上平均为 24%。这些结果证实，我们的方法不仅拓宽了模型的基本极限，而且增强了其对现实世界空间推理的泛化。我们承认，鉴于训练前数据的不透明性，我们的研究仅限于训练后阶段，并希望我们的研究能够激发进一步开展专业化、能力扩展对齐的工作。

Power Control Based on Multi-Agent Deep Q Network for D2D Communication

基于多智能体深度Q网络的D2D通信功率控制

Authors: Shi Gengtian, Takashi Koshimizu, Megumi Saito, Pan Zhenni, Liu Jiang, Shigeru Shimamoto
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2511.00767
Pdf link: https://arxiv.org/pdf/2511.00767
Abstract In device-to-device (D2D) communication under a cell with resource sharing mode the spectrum resource utilization of the system will be improved. However, if the interference generated by the D2D user is not controlled, the performance of the entire system and the quality of service (QOS) of the cellular user may be degraded. Power control is important because it helps to reduce interference in the system. In this paper, we propose a reinforcement learning algorithm for adaptive power control that helps reduce interference to increase system throughput. Simulation results show the proposed algorithm has better performance than traditional algorithm in LTE (Long Term Evolution).
中文摘要 在资源共享模式下，在小区对设备（D2D）通信中，系统的频谱资源利用率将得到提高。但是，如果D2D用户产生的干扰不受控制，则整个系统的性能和蜂窝用户的服务质量（QOS）可能会下降。电源控制很重要，因为它有助于减少系统中的干扰。在本文中，我们提出了一种用于自适应功率控制的强化学习算法，该算法有助于减少干扰以提高系统吞吐量。仿真结果表明，所提算法在LTE（Long Term Evolution）方面比传统算法具有更好的性能。

Efficient Reinforcement Learning for Large Language Models with Intrinsic Exploration

具有内在探索的大型语言模型的高效强化学习

Authors: Yan Sun, Jia Guo, Stanley Kok, Zihao Wang, Zujie Wen, Zhiqiang Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.00794
Pdf link: https://arxiv.org/pdf/2511.00794
Abstract Reinforcement learning with verifiable rewards (RLVR) has improved the reasoning ability of large language models, yet training remains costly because many rollouts contribute little to optimization, considering the amount of computation required. This study investigates how simply leveraging intrinsic data properties, almost free benefit during training, can improve data efficiency for RLVR. We propose PREPO with two complementary components. First, we adopt prompt perplexity as an indicator of model adaptability in learning, enabling the model to progress from well-understood contexts to more challenging ones. Second, we amplify the discrepancy among the rollouts by differentiating their relative entropy, and prioritize sequences that exhibit a higher degree of exploration. Together, these mechanisms reduce rollout demand while preserving competitive performance. On the Qwen and Llama models, PREPO achieves effective results on mathematical reasoning benchmarks with up to 3 times fewer rollouts than the baselines. Beyond empirical gains, we provide theoretical and in-depth analyses explaining the underlying rationale of our method to improve the data efficiency of RLVR.
中文摘要 具有可验证奖励的强化学习（RLVR）提高了大型语言模型的推理能力，但训练仍然成本高昂，因为考虑到所需的计算量，许多推出对优化的贡献很小。本研究调查了如何简单地利用内在数据属性（在训练期间几乎免费受益）来提高 RLVR 的数据效率。我们提出了具有两个互补成分的 PREPO。首先，我们采用提示困惑度作为模型在学习中适应性的指标，使模型能够从易于理解的上下文发展到更具挑战性的上下文。其次，我们通过区分它们的相对熵来放大推出之间的差异，并优先考虑表现出更高程度探索的序列。这些机制共同减少了推出需求，同时保持了竞争性能。在 Qwen 和 Llama 模型上，PREPO 在数学推理基准上取得了有效结果，推出次数比基线少了 3 倍。除了经验收益之外，我们还提供了理论和深入的分析，解释了我们提高 RLVR 数据效率的方法的基本原理。

GrowthHacker: Automated Off-Policy Evaluation Optimization Using Code-Modifying LLM Agents

GrowthHacker：使用代码修改 LLM 代理进行自动策略外评估优化

Authors: Jie JW Wu, Ayanda Patrick Herlihy, Ahmad Saleem Mirza, Ali Afoud, Fatemeh Fard
Subjects: Subjects: Software Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.00802
Pdf link: https://arxiv.org/pdf/2511.00802
Abstract With the software industry shifting toward a data-driven culture, online A/B testing is a key tool for evaluating new technologies. However, deploying such experiments requires substantial resources, may negatively impact users, and involves long data collection periods. To address this, \textit{off-policy evaluation (OPE)}, or offline A/B testing, uses logged data to assess technologies and is fundamental in Reinforcement Learning, making it crucial in domains where online testing is costly or risky, such as healthcare, recommender systems, education, dialog systems, and robotics. Despite advances in coding LLMs and agentic AI, little is known about leveraging them to optimize OPE results. We investigate whether LLMs and LLM-based agents can improve OPE performance via code optimization. We propose \textit{GrowthHacker}, a benchmark with agent and baseline methods on large-scale real-world datasets, which iteratively optimizes code, evaluates results, and begins new optimization cycles. We collected datasets, established protocols, implemented baselines for OPE on the Open Bandit Pipeline (OBP)~\cite{saito2021openbanditdatasetpipeline} and Scope-RL~\cite{kiyohara2023scope}, and developed the \textit{two_agent} framework, which reduces system complexity while preserving optimization effectiveness. Results show the two_agent framework achieves 100% reliability and the highest average improvement of 106.7% among positive outcomes. Both two_agent and CrewAI reach 45% success rates, outperforming AutoGen's 34%. These findings demonstrate the feasibility of LLM-based agents as automated "growth hackers" to enhance OPE systems, with implications for scaling data-driven decision-making in production.
中文摘要 随着软件行业向数据驱动的文化转变，在线 A/B 测试是评估新技术的关键工具。然而，部署此类实验需要大量资源，可能会对用户产生负面影响，并且涉及较长的数据收集周期。为了解决这个问题，\textit{off-policy evaluation （OPE）}，即离线 A/B 测试，使用记录的数据来评估技术，并且是强化学习的基础，这使得它在在线测试成本高昂或有风险的领域至关重要，例如医疗保健、推荐系统、教育、对话系统和机器人技术。尽管在编码 LLM 和代理 AI 方面取得了进步，但人们对利用它们来优化 OPE 结果知之甚少。我们研究了 LLM 和基于 LLM 的代理是否可以通过代码优化来提高 OPE 性能。我们提出了 \textit{GrowthHacker}，这是一个在大规模真实世界数据集上使用代理和基线方法的基准测试，它迭代地优化代码、评估结果并开始新的优化周期。我们收集了数据集，建立了协议，在开放强盗管道（OBP）~\cite{saito2021openbanditdatasetpipeline}和Scope-RL~\cite{kiyohara2023scope}上实现了OPE的基线，并开发了\textit{two_agent}框架，该框架在保持优化效果的同时降低了系统复杂性。结果表明，two_agent框架实现了 100% 的可靠性，在积极结果中平均提高了 106.7%。two_agent 和 CrewAI 的成功率均达到 45%，高于 AutoGen 的 34%。这些发现证明了基于法学硕士的代理作为自动化“增长黑客”增强 OPE 系统的可行性，这对在生产中扩展数据驱动的决策具有重要意义。

Logic-informed reinforcement learning for cross-domain optimization of large-scale cyber-physical systems

用于大规模信息物理系统跨域优化的逻辑知情强化学习

Authors: Guangxi Wan, Peng Zeng, Xiaoting Dong, Chunhe Song, Shijie Cui, Dong Li, Qingwei Dong, Yiyang Liu, Hongfei Bai
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.00806
Pdf link: https://arxiv.org/pdf/2511.00806
Abstract Cyber-physical systems (CPS) require the joint optimization of discrete cyber actions and continuous physical parameters under stringent safety logic constraints. However, existing hierarchical approaches often compromise global optimality, whereas reinforcement learning (RL) in hybrid action spaces often relies on brittle reward penalties, masking, or shielding and struggles to guarantee constraint satisfaction. We present logic-informed reinforcement learning (LIRL), which equips standard policy-gradient algorithms with projection that maps a low-dimensional latent action onto the admissible hybrid manifold defined on-the-fly by first-order logic. This guarantees feasibility of every exploratory step without penalty tuning. Experimental evaluations have been conducted across multiple scenarios, including industrial manufacturing, electric vehicle charging stations, and traffic signal control, in all of which the proposed method outperforms existing hierarchical optimization approaches. Taking a robotic reducer assembly system in industrial manufacturing as an example, LIRL achieves a 36.47\% to 44.33\% reduction at most in the combined makespan-energy objective compared to conventional industrial hierarchical scheduling methods. Meanwhile, it consistently maintains zero constraint violations and significantly surpasses state-of-the-art hybrid-action reinforcement learning baselines. Thanks to its declarative logic-based constraint formulation, the framework can be seamlessly transferred to other domains such as smart transportation and smart grid, thereby paving the way for safe and real-time optimization in large-scale CPS.
中文摘要 网络物理系统（CPS）需要在严格的安全逻辑约束下联合优化离散网络动作和连续物理参数。然而，现有的分层方法往往会损害全局最优性，而混合动作空间中的强化学习（RL）往往依赖于脆弱的奖励惩罚、掩蔽或屏蔽，难以保证约束的满足。我们提出了逻辑知情强化学习（LIRL），它为标准策略梯度算法配备了投影，将低维潜在动作映射到由一阶逻辑动态定义的可接受混合流形上。这保证了每个探索步骤的可行性，而无需进行惩罚调整。在工业制造、电动汽车充电站和交通信号控制等多个场景中进行了实验评估，在所有这些场景中，所提方法都优于现有的分层优化方法。以工业制造中的机器人减速机装配系统为例，与传统的工业分层调度方法相比，LIRL在组合制造-能量目标上最多减少了36.47%至44.33%。同时，它始终保持零约束违规，并显着超过最先进的混合动作强化学习基线。由于其基于声明性逻辑的约束制定，该框架可以无缝转移到智能交通和智能电网等其他领域，从而为大规模CPS中的安全、实时优化铺平道路。

Do Math Reasoning LLMs Help Predict the Impact of Public Transit Events?

数学推理法学硕士是否有助于预测公共交通事件的影响？

Authors: Bowen Fang, Ruijian Zha, Xuan Di
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.00808
Pdf link: https://arxiv.org/pdf/2511.00808
Abstract Predicting public transit incident duration from unstructured text alerts is a critical but challenging task. Addressing the domain sparsity of transit operations with standard Supervised Fine-Tuning (SFT) is difficult, as the task involves noisy, continuous labels and lacks reliable expert demonstrations for reasoning. While Reinforcement Learning from Verifiable Rewards (RLVR) excels at tasks with binary correctness, like mathematics, its applicability to noisy, continuous forecasting is an open question. This work, to our knowledge, is the first to bridge the gap between RLVR LLM training with the critical, real-world forecasting challenges in public transit operations. We adapt RLVR to this task by introducing a tolerance-based, shaped reward function that grants partial credit within a continuous error margin, rather than demanding a single correct answer. We systematically evaluate this framework on a curated dataset of NYC MTA service alerts. Our findings show that general-purpose, instruction-tuned LLMs significantly outperform specialized math-reasoning models, which struggle with the ambiguous, real-world text. We empirically demonstrate that the binary reward is unstable and degrades performance, whereas our shaped reward design is critical and allows our model to dominate on the most challenging metrics. While classical regressors are superior at minimizing overall MAE or MSE, our RLVR approach achieved a 35\% relative improvement in 5-minute accuracy (Acc@5) over the strongest baseline. This demonstrates that RLVR can be successfully adapted to real-world, noisy forecasting, but requires a verifier design that reflects the continuous nature of the problem.
中文摘要 从非结构化文本警报中预测公共交通事件持续时间是一项关键但具有挑战性的任务。使用标准监督微调（SFT）解决传输作的域稀疏性很困难，因为该任务涉及嘈杂、连续的标签，并且缺乏可靠的推理专家演示。虽然可验证奖励强化学习（RLVR）擅长数学等二元正确性的任务，但它对嘈杂、连续预测的适用性是一个悬而未决的问题。据我们所知，这项工作是第一个弥合 RLVR LLM 培训与公共交通运营中关键的现实世界预测挑战之间差距的工作。我们通过引入基于容差的整形奖励函数来使 RLVR 适应这项任务，该函数在连续误差范围内授予部分信用，而不是要求单个正确答案。我们在纽约市 MTA 服务警报的精选数据集上系统地评估了这个框架。我们的研究结果表明，通用的、经过指令调整的法学硕士的性能明显优于专门的数学推理模型，后者在处理模棱两可的现实世界文本时遇到困难。我们凭经验证明，二元奖励是不稳定的，会降低性能，而我们的形状奖励设计至关重要，允许我们的模型在最具挑战性的指标上占据主导地位。虽然经典回归变量在最小化总体 MAE 或 MSE 方面表现出色，但我们的 RLVR 方法比最强基线在 5 分钟准确度（Acc@5）方面实现了 35\% 的相对提高。这表明 RLVR 可以成功地适应现实世界的嘈杂预测，但需要一个反映问题连续性质的验证器设计。

Equilibrium Policy Generalization: A Reinforcement Learning Framework for Cross-Graph Zero-Shot Generalization in Pursuit-Evasion Games

均衡策略泛化：在追避博弈中进行跨图零样本泛化的强化学习框架

Authors: Runyu Lu, Peng Zhang, Ruochuan Shi, Yuanheng Zhu, Dongbin Zhao, Yang Liu, Dong Wang, Cesare Alippi
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.00811
Pdf link: https://arxiv.org/pdf/2511.00811
Abstract Equilibrium learning in adversarial games is an important topic widely examined in the fields of game theory and reinforcement learning (RL). Pursuit-evasion game (PEG), as an important class of real-world games from the fields of robotics and security, requires exponential time to be accurately solved. When the underlying graph structure varies, even the state-of-the-art RL methods require recomputation or at least fine-tuning, which can be time-consuming and impair real-time applicability. This paper proposes an Equilibrium Policy Generalization (EPG) framework to effectively learn a generalized policy with robust cross-graph zero-shot performance. In the context of PEGs, our framework is generally applicable to both pursuer and evader sides in both no-exit and multi-exit scenarios. These two generalizability properties, to our knowledge, are the first to appear in this domain. The core idea of the EPG framework is to train an RL policy across different graph structures against the equilibrium policy for each single graph. To construct an equilibrium oracle for single-graph policies, we present a dynamic programming (DP) algorithm that provably generates pure-strategy Nash equilibrium with near-optimal time complexity. To guarantee scalability with respect to pursuer number, we further extend DP and RL by designing a grouping mechanism and a sequence model for joint policy decomposition, respectively. Experimental results show that, using equilibrium guidance and a distance feature proposed for cross-graph PEG training, the EPG framework guarantees desirable zero-shot performance in various unseen real-world graphs. Besides, when trained under an equilibrium heuristic proposed for the graphs with exits, our generalized pursuer policy can even match the performance of the fine-tuned policies from the state-of-the-art PEG methods.
中文摘要 对抗博弈中的均衡学习是博弈论和强化学习（RL）领域广泛研究的重要课题。追避游戏（PEG）作为一类来自机器人和安全领域的现实世界游戏，需要指数级时间才能准确求解。当底层图结构发生变化时，即使是最先进的 RL 方法也需要重新计算或至少进行微调，这可能非常耗时并损害实时适用性。该文提出了一种均衡策略泛化（EPG）框架，以有效地学习具有鲁棒的跨图零样本性能的广义策略。在 PEG 的背景下，我们的框架通常适用于无退出和多退出场景中的追捕方和逃避方。据我们所知，这两个泛化性属性是第一个出现在该领域中的。EPG 框架的核心思想是针对每个图的均衡策略跨不同图结构训练 RL 策略。为了构建单图策略的均衡预言机，我们提出了一种动态规划（DP）算法，该算法可以证明可以生成具有接近最佳时间复杂度的纯策略纳什均衡。为了保证追击者数量的可扩展性，我们通过分别设计分组机制和联合策略分解的序列模型，进一步扩展了DP和RL。实验结果表明，利用平衡引导和为跨图PEG训练提出的距离特征，EPG框架保证了在各种看不见的现实世界图中理想的零样本性能。此外，当在为具有出口的图提出的均衡启发式方法下进行训练时，我们的广义追击者策略甚至可以与最先进的 PEG 方法的微调策略的性能相匹配。

KFCPO: Kronecker-Factored Approximated Constrained Policy Optimization

KFCPO：克罗内克因数近似约束策略优化

Authors: Joonyoung Lim, Younghwan Yoo
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.00880
Pdf link: https://arxiv.org/pdf/2511.00880
Abstract We propose KFCPO, a novel Safe Reinforcement Learning (Safe RL) algorithm that combines scalable Kronecker-Factored Approximate Curvature (K-FAC) based second-order policy optimization with safety-aware gradient manipulation. KFCPO leverages K-FAC to perform efficient and stable natural gradient updates by approximating the Fisher Information Matrix (FIM) in a layerwise, closed form manner, avoiding iterative approximation overheads. To address the tradeoff between reward maximization and constraint satisfaction, we introduce a margin aware gradient manipulation mechanism that adaptively adjusts the influence of reward and cost gradients based on the agent's proximity to safety boundaries. This method blends gradients using a direction sensitive projection, eliminating harmful interference and avoiding abrupt changes caused by fixed hard thresholds. Additionally, a minibatch level KL rollback strategy is adopted to ensure trust region compliance and to prevent destabilizing policy shifts. Experiments on Safety Gymnasium using OmniSafe show that KFCPO achieves 10.3% to 50.2% higher average return across environments compared to the best baseline that respected the safety constraint, demonstrating superior balance of safety and performance.
中文摘要 我们提出了KFCPO，这是一种新型的安全强化学习（Safe RL）算法，它将可扩展的克罗尼克因数近似曲率（K-FAC）基于二阶策略优化与安全感知梯度作相结合。KFCPO 利用 K-FAC 通过以分层、封闭的形式近似 Fisher 信息矩阵（FIM）来执行高效、稳定的自然梯度更新，避免迭代逼近开销。为了解决奖励最大化和约束满足之间的权衡，我们引入了一种边距感知梯度纵机制，该机制根据智能体与安全边界的接近程度自适应地调整奖励和成本梯度的影响。该方法使用方向敏感投影混合梯度，消除有害干扰并避免由固定硬阈值引起的突然变化。此外，还采用小批量级 KL 回滚策略来确保信任区域合规性并防止不稳定的策略转变。使用 OmniSafe 对 Safety Gymnasium 进行的实验表明，与遵守安全约束的最佳基线相比，KFCPO 在各种环境中的平均回报率提高了 10.3% 至 50.2%，展示了安全性和性能的卓越平衡。

Optimizing Energy and Latency in 6G Smart Cities with Edge CyberTwins

利用 Edge CyberTwins 优化 6G 智慧城市的能源和延迟

Authors: Abouaomar, Badr Ben Elallid, Nabil Benamar
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2511.00955
Pdf link: https://arxiv.org/pdf/2511.00955
Abstract The proliferation of IoT devices in smart cities challenges 6G networks with conflicting energy-latency requirements across heterogeneous slices. Existing approaches struggle with the energy-latency trade-off, particularly for massive scale deployments exceeding 50,000 devices km. This paper proposes an edge-aware CyberTwin framework integrating hybrid federated learning for energy-latency co-optimization in 6G network slicing. Our approach combines centralized Artificial Intelligence scheduling for latency-sensitive slices with distributed federated learning for non-critical slices, enhanced by compressive sensing-based digital twins and renewable energy-aware resource allocation. The hybrid scheduler leverages a three-tier architecture with Physical Unclonable Function (PUF) based security attestation achieving 99.7% attack detection accuracy. Comprehensive simulations demonstrate 52% energy reduction for non-real-time slices compared to Diffusion-Reinforcement Learning baselines while maintaining 0.9ms latency for URLLC applications with 99.1% SLA compliance. The framework scales to 50,000 devices km with CPU overhead below 25%, validated through NS-3 hybrid simulations across realistic smart city scenarios.
中文摘要 物联网设备在智慧城市中的激增给 6G 网络带来了挑战，其跨异构切片的能量延迟要求相互冲突。现有方法在能量延迟权衡方面遇到了困难，特别是对于超过 50,000 台设备公里的大规模部署。该文提出了一种边缘感知的CyberTwin框架，该框架集成了混合联邦学习，用于6G网络切片中的能量-时延协同优化。我们的方法将延迟敏感切片的集中式人工智能调度与非关键切片的分布式联邦学习相结合，并通过基于压缩传感的数字孪生和可再生能源感知资源分配得到增强。混合调度器利用三层架构和基于物理不可克隆功能（PUF）的安全证明，实现 99.7% 的攻击检测准确率。综合模拟表明，与扩散强化学习基线相比，非实时切片的能耗降低了 52%，同时保持 URLLC 应用程序的 0.9 毫秒延迟，SLA 合规性为 99.1%。该框架可扩展到 50,000 台设备公里，CPU 开销低于 25%，并通过 NS-3 混合模拟在现实智慧城市场景中进行了验证。

MARS-SQL: A multi-agent reinforcement learning framework for Text-to-SQL

MARS-SQL：用于文本到 SQL 的多代理强化学习框架

Authors: Haolin Yang, Jipeng Zhang, Zhitao He, Yi R. Fung
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.01008
Pdf link: https://arxiv.org/pdf/2511.01008
Abstract Translating natural language to SQL remains difficult for complex queries. Such queries often need environmental interaction and self-correction. To address this, we introduce MARS-SQL, a novel multi-agent framework that combines principled task decomposition and interactive reinforcement learning (RL). Our system comprises three specialized agents: a Grounding Agent for schema linking, a Generation Agent for query generation, and a Validation Agent for final selection. The core of our framework is the Generation agent, which is trained via a multi-turn RL policy. Adopting a ReAct-style Think-Act-Observe loop, the agent iteratively generates thoughts, executes SQL actions against a live database, and revises its strategy based on execution feedback, enabling dynamic, stateful reasoning and self-correction. At inference time, we generate multiple interaction trajectories to explore diverse reasoning paths. The Validation agent, then selects the optimal trajectory by modeling verification as a next-token prediction task and choosing the solution with the highest generation probability. This structured workflow pipelines specialized agents. It combines interactive RL for generation with generative modeling for verification. The approach proves highly effective for robust and accurate SQL generation. Experiments show that MARS-SQL achieves state-of-the-art Execution Accuracy of 77.84% on the BIRD dev set and 89.75% on the Spider test set. Our code is available at this https URL.
中文摘要 对于复杂的查询来说，将自然语言转换为 SQL 仍然很困难。此类查询通常需要环境交互和自我纠正。为了解决这个问题，我们引入了 MARS-SQL，这是一种结合了原则性任务分解和交互式强化学习（RL）的新型多智能体框架。我们的系统由三个专用代理组成：用于模式链接的接地代理、用于查询生成的生成代理和用于最终选择的验证代理。我们框架的核心是 Generation 代理，它是通过多轮 RL 策略进行训练的。采用 ReAct 风格的 Think-Act-Observe 循环，代理迭代生成想法，针对实时数据库执行 SQL作，并根据执行反馈修改其策略，从而实现动态、有状态的推理和自我纠正。在推理时，我们生成多个交互轨迹，以探索不同的推理路径。然后，验证代理通过将验证建模为下一个标记预测任务并选择生成概率最高的解决方案来选择最佳轨迹。这种结构化的工作流管道专门针对代理。它将用于生成的交互式 RL 与用于验证的生成建模相结合。事实证明，该方法对于健壮且准确的 SQL 生成非常有效。实验表明，MARS-SQL 在 BIRD 开发集上实现了 77.84% 的最先进的执行准确率，在 Spider 测试集上实现了 89.75% 的执行准确率。我们的代码可在此 https URL 中找到。

IF-CRITIC: Towards a Fine-Grained LLM Critic for Instruction-Following Evaluation

IF-CRITIC：迈向用于指令遵循评估的细粒度 LLM 评论家

Authors: Bosi Wen, Yilin Niu, Cunxiang Wang, Pei Ke, Xiaoying Ling, Ying Zhang, Aohan Zeng, Hongning Wang, Minlie Huang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.01014
Pdf link: https://arxiv.org/pdf/2511.01014
Abstract Instruction following is a fundamental ability of Large Language Models (LLMs), requiring their generated outputs to follow multiple constraints imposed in input instructions. Numerous studies have attempted to enhance this ability through preference optimization or reinforcement learning based on reward signals from LLM-as-a-Judge. However, existing evaluation models for instruction following still possess many deficiencies, such as substantial costs and unreliable assessments. To this end, we propose IF-CRITIC, an LLM critic that can provide efficient and reliable assessments of constraint following in the instructions. We first develop a checklist generator to decompose instructions and generate constraint checklists. With the assistance of the checklists, we collect high-quality critique training data through a multi-stage critique filtering mechanism and employ a constraint-level preference optimization method to train IF-CRITIC. Extensive experiments demonstrate that the evaluation performance of IF-CRITIC can beat strong LLM-as-a-Judge baselines, including Deepseek-R1 and o4-mini. With the scalable reward signals provided by IF-CRITIC, LLMs can achieve substantial performance gains in instruction-following optimization under lower computational overhead compared to strong LLM critic baselines.
中文摘要 指令遵循是大型语言模型（LLM）的一项基本能力，要求其生成的输出遵循输入指令中施加的多个约束。许多研究试图通过基于 LLM-as-a-Judge 的奖励信号的偏好优化或强化学习来增强这种能力。然而，现有的指导遵循评估模型仍然存在许多不足，例如成本高昂和评估不可靠。为此，我们提出了 IF-CRITIC，这是一个 LLM 评论家，可以按照说明对约束进行高效可靠的评估。我们首先开发一个清单生成器来分解指令并生成约束清单。在检查表的辅助下，我们通过多阶段的批评过滤机制收集高质量的批评训练数据，并采用约束级偏好优化方法来训练IF-CRITIC。大量实验表明，IF-CRITIC 的评估性能可以击败强大的 LLM-as-a-Judge 基线，包括 Deepseek-R1 和 o4-mini。借助 IF-CRITIC 提供的可扩展奖励信号，与强大的 LLM 批评基线相比，LLM 可以在较低的计算开销下在指令跟踪优化方面实现显着的性能提升。

Prompt-R1: Collaborative Automatic Prompting Framework via End-to-end Reinforcement Learning

Prompt-R1：通过端到端强化学习的协作自动提示框架

Authors: Wenjin Liu, Haoran Luo, Xueyuan Lin, Haoming Liu, Tiesunlong Shen, Jiapu Wang, Rui Mao, Erik Cambria
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.01016
Pdf link: https://arxiv.org/pdf/2511.01016
Abstract Recently, advanced large language models (LLMs) have emerged at an increasingly rapid pace. However, when faced with complex problems, most users are often unable to provide accurate and effective prompts to interact with LLMs, thus limiting the performance of LLMs. To address this challenge, we propose Prompt-R1, an end-to-end reinforcement learning framework that uses a small-scale LLM to collaborate with large-scale LLMs, replacing user interaction to solve problems better. This collaboration is cast as a multi-turn prompt interaction, where the small-scale LLM thinks and generates prompts, and the large-scale LLM performs complex reasoning. A dual-constrained reward is designed to optimize for correctness, generation quality, and reasoning accuracy. Prompt-R1 provides a plug-and-play framework that supports both inference and training with various large-scale LLMs. Experiments on multiple public datasets show that Prompt-R1 significantly outperforms baseline models across tasks. Our code is publicly available at this https URL.
中文摘要 最近，先进的大型语言模型（LLM）的出现速度越来越快。然而，在面对复杂问题时，大多数用户往往无法提供准确有效的提示来与LLM进行交互，从而限制了LLM的性能。为了应对这一挑战，我们提出了 Prompt-R1，这是一个端到端的强化学习框架，它使用小型 LLM 与大型 LLM 协作，取代用户交互以更好地解决问题。这种协作被铸造成多轮提示交互，其中小规模的 LLM 思考并生成提示，而大规模的 LLM 则执行复杂的推理。双重约束奖励旨在优化正确性、生成质量和推理准确性。Prompt-R1 提供了一个即插即用的框架，支持推理和训练各种大规模 LLM。在多个公共数据集上的实验表明，Prompt-R1 在任务中的性能明显优于基线模型。我们的代码在此 https URL 上公开可用。

Quantum Reinforcement Learning for 6G and Beyond Wireless Networks

适用于 6G 及更高无线网络的量子强化学习

Authors: Dinh-Hieu Tran, Thai Duong Nguyen, Thanh-Dao Nguyen, Ngoc-Tan Nguyen, Van Nhan Vo, Hung Tran, Mouhamad Chehaitly, Yan Kyaw Tun, Cedomir Stefanovic, Tu Ho Dac, Eva Lagunas, Symeon Chatzinotas, Nguyen Van Huynh
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2511.01070
Pdf link: https://arxiv.org/pdf/2511.01070
Abstract While 5G is being deployed worldwide, 6G is receiving increasing attention from researchers to meet the growing demand for higher data rates, lower latency, higher density, and seamless communications worldwide. To meet the stringent requirements of 6G wireless communications networks, AI-integrated communications have become an indispensable part of supporting 6G systems with intelligence, automation, and big data training capabilities. However, traditional artificial intelligence (AI) systems are difficult to meet the stringent latency and high throughput requirements of 6G with limited resources. In this article, we summarize, analyze, discuss the potential, and benefits of Quantum Reinforcement Learning (QRL) in 6G. As an example, we show the superiority of QRL in dynamic spectrum access compared to the conventional Deep Reinforcement Learning (DRL) approach. In addition, we provide an overview of what DRL has accomplished in 6G and its challenges and limitations. From there, we introduce QRL and potential research directions that should continue to be of interest in 6G. To the best of our knowledge, this is the first review and vision article on QRL for 6G wireless communication networks.
中文摘要 虽然 5G 正在全球部署，但 6G 也越来越受到研究人员的关注，以满足全球对更高数据速率、更低延迟、更高密度和无缝通信日益增长的需求。为满足6G无线通信网络的严格要求，AI集成通信已成为支撑6G系统不可或缺的一部分，具有智能化、自动化和大数据训练能力。然而，传统的人工智能（AI）系统在资源有限的情况下难以满足6G严格的延迟和高吞吐量要求。在本文中，我们总结、分析、讨论量子强化学习（QRL）在 6G 中的潜力和优势。例如，与传统的深度强化学习（DRL）方法相比，我们展示了QRL在动态频谱接入方面的优越性。此外，我们还概述了 DRL 在 6G 方面取得的成就及其挑战和局限性。从那里，我们介绍了 QRL 和 6G 中应该继续引起人们兴趣的潜在研究方向。据我们所知，这是第一篇关于 6G 无线通信网络 QRL 的评论和愿景文章。

Predictive Auxiliary Learning for Belief-based Multi-Agent Systems

基于信念的多智能体系统的预测辅助学习

Authors: Qinwei Huang, Stefan Wang, Simon Khan, Garrett Katz, Qinru Qiu
Subjects: Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2511.01078
Pdf link: https://arxiv.org/pdf/2511.01078
Abstract The performance of multi-agent reinforcement learning (MARL) in partially observable environments depends on effectively aggregating information from observations, communications, and reward signals. While most existing multi-agent systems primarily rely on rewards as the only feedback for policy training, our research shows that introducing auxiliary predictive tasks can significantly enhance learning efficiency and stability. We propose Belief-based Predictive Auxiliary Learning (BEPAL), a framework that incorporates auxiliary training objectives to support policy optimization. BEPAL follows the centralized training with decentralized execution paradigm. Each agent learns a belief model that predicts unobservable state information, such as other agents' rewards or motion directions, alongside its policy model. By enriching hidden state representations with information that does not directly contribute to immediate reward maximization, this auxiliary learning process stabilizes MARL training and improves overall performance. We evaluate BEPAL in the predator-prey environment and Google Research Football, where it achieves an average improvement of about 16 percent in performance metrics and demonstrates more stable convergence compared to baseline methods.
中文摘要 多智能体强化学习（MARL）在部分可观测环境中的性能取决于有效地聚合来自观察、通信和奖励信号的信息。虽然大多数现有的多智能体系统主要依赖奖励作为策略训练的唯一反馈，但我们的研究表明，引入辅助预测任务可以显着提高学习效率和稳定性。我们提出了基于信念的预测辅助学习（BEPAL），这是一个包含辅助训练目标以支持政策优化的框架。BEPAL 遵循去中心化执行范式的集中式训练。每个智能体学习一个信念模型，该模型可以预测不可观察的状态信息，例如其他智能体的奖励或运动方向，以及其策略模型。通过用不直接有助于立即奖励最大化的信息丰富隐藏状态表示，这种辅助学习过程可以稳定 MARL 训练并提高整体性能。我们在捕食者-猎物环境和 Google Research Football 中评估了 BEPAL，与基线方法相比，它在性能指标上平均提高了约 16%，并表现出更稳定的收敛性。

Deployable Vision-driven UAV River Navigation via Human-in-the-loop Preference Alignment

通过人机交互偏好对齐实现可部署的视觉驱动无人机河流导航

Authors: Zihan Wang, Jianwen Li, Li-Fan Wu, Nina Mahmoudian
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.01083
Pdf link: https://arxiv.org/pdf/2511.01083
Abstract Rivers are critical corridors for environmental monitoring and disaster response, where Unmanned Aerial Vehicles (UAVs) guided by vision-driven policies can provide fast, low-cost coverage. However, deployment exposes simulation-trained policies with distribution shift and safety risks and requires efficient adaptation from limited human interventions. We study human-in-the-loop (HITL) learning with a conservative overseer who vetoes unsafe or inefficient actions and provides statewise preferences by comparing the agent's proposal with a corrective override. We introduce Statewise Hybrid Preference Alignment for Robotics (SPAR-H), which fuses direct preference optimization on policy logits with a reward-based pathway that trains an immediate-reward estimator from the same preferences and updates the policy using a trust-region surrogate. With five HITL rollouts collected from a fixed novice policy, SPAR-H achieves the highest final episodic reward and the lowest variance across initial conditions among tested methods. The learned reward model aligns with human-preferred actions and elevates nearby non-intervened choices, supporting stable propagation of improvements. We benchmark SPAR-H against imitation learning (IL), direct preference variants, and evaluative reinforcement learning (RL) in the HITL setting, and demonstrate real-world feasibility of continual preference alignment for UAV river following. Overall, dual statewise preferences empirically provide a practical route to data-efficient online adaptation in riverine navigation.
中文摘要 河流是环境监测和灾害响应的重要走廊，在视觉驱动政策的指导下，无人机（UAV）可以提供快速、低成本的覆盖。然而，部署暴露了经过模拟训练的策略，存在分布转移和安全风险，并且需要从有限的人工干预中进行有效调整。我们与一位保守的监督者一起研究人机交互（HITL）学习，该监督者否决了不安全或低效的行动，并通过将代理的建议与纠正性覆盖进行比较来提供状态偏好。我们引入了机器人技术的状态混合偏好对齐（SPAR-H），它将策略 logit 上的直接偏好优化与基于奖励的路径融合在一起，该路径从相同的偏好中训练即时奖励估计器，并使用信任区域代理更新策略。通过从固定的新手策略中收集的五次 HITL 推出，SPAR-H 在测试方法中实现了最高的最终情景奖励和最低的初始条件方差。学习的奖励模型与人类偏好的行为保持一致，并提升附近的非干预选择，支持改进的稳定传播。我们将 SPAR-H 与 HITL 环境中的模仿学习（IL）、直接偏好变体和评估性强化学习（RL）进行基准测试，并证明了无人机河流跟踪的持续偏好对齐的现实可行性。总体而言，双重状态偏好在经验上为河流航行中数据高效的在线适应提供了一条实用的途径。

SLAP: Shortcut Learning for Abstract Planning

SLAP：抽象规划的捷径学习

Authors: Y. Isabel Liu, Bowen Li, Benjamin Eysenbach, Tom Silver
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.01107
Pdf link: https://arxiv.org/pdf/2511.01107
Abstract Long-horizon decision-making with sparse rewards and continuous states and actions remains a fundamental challenge in AI and robotics. Task and motion planning (TAMP) is a model-based framework that addresses this challenge by planning hierarchically with abstract actions (options). These options are manually defined, limiting the agent to behaviors that we as human engineers know how to program (pick, place, move). In this work, we propose Shortcut Learning for Abstract Planning (SLAP), a method that leverages existing TAMP options to automatically discover new ones. Our key idea is to use model-free reinforcement learning (RL) to learn shortcuts in the abstract planning graph induced by the existing options in TAMP. Without any additional assumptions or inputs, shortcut learning leads to shorter solutions than pure planning, and higher task success rates than flat and hierarchical RL. Qualitatively, SLAP discovers dynamic physical improvisations (e.g., slap, wiggle, wipe) that differ significantly from the manually-defined ones. In experiments in four simulated robotic environments, we show that SLAP solves and generalizes to a wide range of tasks, reducing overall plan lengths by over 50% and consistently outperforming planning and RL baselines.
中文摘要 奖励稀疏、状态和行动连续的长期决策仍然是人工智能和机器人技术的一个基本挑战。任务和运动规划（TAMP）是一个基于模型的框架，它通过使用抽象作（选项）进行分层规划来应对这一挑战。这些选项是手动定义的，将代理限制在我们作为人类工程师知道如何编程的行为（拾取、放置、移动）。在这项工作中，我们提出了抽象规划的捷径学习（SLAP），这是一种利用现有的 TAMP 选项自动发现新选项的方法。我们的关键思想是使用无模型强化学习（RL）来学习由TAMP中现有选项引起的抽象规划图中的捷径。在没有任何额外假设或输入的情况下，捷径学习比纯计划带来更短的解决方案，并且比扁平和分层的 RL 更高的任务成功率。从定性上讲，SLAP 发现了与手动定义的即兴创作显着不同的动态物理即兴创作（例如，拍打、摆动、擦拭）。在四个模拟机器人环境中的实验中，我们表明 SLAP 可以求解并推广到广泛的任务，将总体计划长度减少 50% 以上，并且始终优于计划和 RL 基线。

DART: Difficulty-Adaptive Reasoning Truncation for Efficient Large Language Models

DART：用于高效大型语言模型的难度自适应推理截断

Authors: Ruofan Zhang, Bin Xia, Zhen Cheng, Cairen Jian, Minglun Yang, Ngai Wong, Yuan Cheng
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.01170
Pdf link: https://arxiv.org/pdf/2511.01170
Abstract Adaptive reasoning is essential for aligning the computational effort of large language models (LLMs) with the intrinsic difficulty of problems. Current chain-of-thought methods boost reasoning ability but indiscriminately generate long explanations, leading to evident inefficiency. However, existing reinforcement learning approaches to adaptive thinking remain unstable and heavily reward-dependent. Here we propose \textbf{DART}, a supervised \textbf{D}ifficulty-\textbf{A}daptive \textbf{R}easoning \textbf{T}runcation framework that adjusts thinking length according to problem difficulty. By distilling concise reasoning patterns from stronger models, interpolating them into a continuum of reasoning styles, and curating optimal training data that balances correctness and compactness, DART learns when to ``stop thinking''. Across multiple mathematical benchmarks, experimental results demonstrate its remarkable efficiency while preserving or improving accuracy, achieving a significant 81.2\% reasoning truncation (DeepSeek-R1-Distill-Qwen-7B on GSM8K dataset) with 5.33$\times$ computational acceleration. DART provides a stable and general paradigm for efficient reasoning, advancing the development of adaptive intelligence in LLMs.
中文摘要 自适应推理对于使大型语言模型（LLM）的计算工作与问题的内在难度保持一致至关重要。当前的思维链方法提高了推理能力，但不加区别地产生冗长的解释，导致效率明显低下。然而，现有的适应性思维强化学习方法仍然不稳定且严重依赖奖励。在这里，我们提出了 \textbf{DART}，一个有监督的 \textbf{D}ifficulty-\textbf{A}daptive \textbf{R}easoning \textbf{T}runcation 框架，它根据问题难度调整思维长度。通过从更强大的模型中提炼出简洁的推理模式，将它们插值成连续的推理风格，并策划平衡正确性和紧凑性的最佳训练数据，DART 学会了何时“停止思考”。在多个数学基准测试中，实验结果证明了其卓越的效率，同时保持或提高了准确性，实现了显着的 81.2\% 推理截断（GSM8K 数据集上的 DeepSeek-R1-Distill-Qwen-7B），计算加速为 5.33$\times$。DART 为高效推理提供了稳定且通用的范式，推动了法学硕士自适应智能的发展。

Self-Harmony: Learning to Harmonize Self-Supervision and Self-Play in Test-Time Reinforcement Learning

自我和谐：在考试时间强化学习中学习协调自我监督和自我游戏

Authors: Ru Wang, Wei Huang, Qi Cao, Yusuke Iwasawa, Yutaka Matsuo, Jiaxian Guo
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.01191
Pdf link: https://arxiv.org/pdf/2511.01191
Abstract Test-time reinforcement learning (TTRL) offers a label-free paradigm for adapting models using only synthetic signals at inference, but its success hinges on constructing reliable learning signals. Standard approaches such as majority voting often collapse to spurious yet popular answers. We introduce Self-Harmony, a framework built on a simple intuition: the correct answer should remain stable across both an original question and its paraphrase. Self-Harmony operationalizes this by employing a single model in two complementary roles: a Solver to produce answers and a Reframer to rephrase the input. Based on this, we further propose a pseudo-label method: instead of majority voting, it aggregates answer frequencies across these original and reframed views using the harmonic mean. This is a process that naturally selects for solutions stable under reframing, thereby avoiding the common trap of favoring view-dependent, spurious answers. Crucially, this requires no human supervision or auxiliary models. Across diverse reasoning benchmarks, Self-Harmony achieves state-of-the-art results at the label-free test-time setting, ranking first in 28 of 30 settings across multiple methods. Beyond accuracy, it demonstrates unprecedented robustness, with zero training failures in all experiments, underscoring its stability and reliability.
中文摘要 测试时强化学习（TTRL）提供了一种无标签范式，用于在推理时仅使用合成信号来调整模型，但其成功取决于构建可靠的学习信号。多数投票等标准方法经常会崩溃为虚假但流行的答案。我们介绍了自我和谐，这是一个建立在简单直觉之上的框架：正确答案应该在原始问题及其释义中保持稳定。Self-Harmony 通过使用具有两个互补角色的单个模型来实现这一点：一个 Solver 来生成答案，一个 Reframer 来改写输入。基于此，我们进一步提出了一种伪标签方法：它不是多数投票，而是使用调和平均值聚合这些原始视图和重新构建视图的回答频率。这是一个自然选择在重构下稳定的解决方案的过程，从而避免了偏爱依赖于视图的虚假答案的常见陷阱。至关重要的是，这不需要人工监督或辅助模型。在不同的推理基准中，Self-Harmony 在无标签测试时间设置下取得了最先进的结果，在多种方法的 30 种设置中的 28 种中排名第一。除了准确性之外，它还表现出前所未有的鲁棒性，所有实验中训练失败为零，凸显了其稳定性和可靠性。

DEER: Disentangled Mixture of Experts with Instance-Adaptive Routing for Generalizable Machine-Generated Text Detection

DEER：具有实例自适应路由的专家的解缠混合，用于可推广的机器生成文本检测

Authors: Guoxin Ma, Xiaoming Liu, Zhanhan Zhang, Chengzhengxu Li, Shengchao Liu, Yu Lan
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.01192
Pdf link: https://arxiv.org/pdf/2511.01192
Abstract Detecting machine-generated text (MGT) has emerged as a critical challenge, driven by the rapid advancement of large language models (LLMs) capable of producing highly realistic, human-like content. However, the performance of current approaches often degrades significantly under domain shift. To address this challenge, we propose a novel framework designed to capture both domain-specific and domain-general MGT patterns through a two-stage Disentangled mixturE-of-ExpeRts (DEER) architecture. First, we introduce a disentangled mixture-of-experts module, in which domain-specific experts learn fine-grained, domain-local distinctions between human and machine-generated text, while shared experts extract transferable, cross-domain features. Second, to mitigate the practical limitation of unavailable domain labels during inference, we design a reinforcement learning-based routing mechanism that dynamically selects the appropriate experts for each input instance, effectively bridging the train-inference gap caused by domain uncertainty. Extensive experiments on five in-domain and five out-of-domain benchmark datasets demonstrate that DEER consistently outperforms state-of-the-art methods, achieving average F1-score improvements of 1.39% and 5.32% on in-domain and out-of-domain datasets respectively, along with accuracy gains of 1.35% and 3.61% respectively. Ablation studies confirm the critical contributions of both disentangled expert specialization and adaptive routing to model performance.
中文摘要 在能够生成高度逼真的类人内容的大型语言模型（LLM）的快速发展的推动下，检测机器生成文本（MGT）已成为一项关键挑战。然而，当前方法的性能在领域转移下往往会显着下降。为了应对这一挑战，我们提出了一种新颖的框架，旨在通过两阶段的解缠绕混合体验（DEER）架构捕获特定领域和通用领域的MGT模式。首先，我们引入了一个解开的专家混合模块，其中特定领域的专家学习人类和机器生成的文本之间的细粒度、领域本地的区别，而共享专家则提取可转移的跨领域特征。其次，为了减轻推理过程中领域标签不可用的实际限制，我们设计了一种基于强化学习的路由机制，为每个输入实例动态选择合适的专家，有效弥合领域不确定性造成的训练-推理差距。在五个域内和五个域外基准数据集上的广泛实验表明，DEER 始终优于最先进的方法，在域内和域外数据集上分别实现了 1.39% 和 5.32% 的平均 F1 分数提升，准确率分别提高了 1.35% 和 3.61%。消融研究证实了解开的专家专业化和自适应路由对模型性能的关键贡献。

Thought-For-Food: Reasoning Chain Induced Food Visual Question Answering

Thought-For-Food：推理链诱导的食物视觉问答

Authors: Riddhi Jain, Manasi Patwardhan, Parijat Deshpande, Venkataramana Runkana
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.01213
Pdf link: https://arxiv.org/pdf/2511.01213
Abstract The immense diversity in the culture and culinary of Indian cuisines calls attention to the major shortcoming of the existing Visual Question Answering(VQA) systems which are inclined towards the foods from Western region. Recent attempt towards building a VQA dataset for Indian food is a step towards addressing this challenge. However, their approach towards VQA follows a two-step process in which the answer is generated first, followed by the explanation of the expected answer. In this work, we claim that food VQA requires to follow a multi-step reasoning process to arrive at an accurate answer, especially in the context of India food, which involves understanding complex culinary context and identifying relationships between various food items. With this hypothesis we create reasoning chains upon the QA with minimal human intervention. We fine-tune smaller LLMs and VLMs with auto-validated reasoning chains and further train them using reinforcement learning with larger data. With augmentation of reasoning chains, we observed accuracy improvement of an average 10 percentage points on the baseline. We provide detailed analysis in terms the effect of addition of reasoning chains for the Indian Food VQA task. Index Terms - FoodVQA, Reasoning Chains, Reinforcement Learning, Knowledge Graph.
中文摘要 印度美食文化和烹饪的巨大多样性引起了人们对现有视觉问答（VQA）系统的主要缺陷的关注，这些系统倾向于西方地区的食物。最近尝试为印度食品构建 VQA 数据集是应对这一挑战的一步。然而，他们对 VQA 的方法遵循一个两步过程，首先生成答案，然后解释预期答案。在这项工作中，我们声称食品 VQA 需要遵循多步骤推理过程才能得出准确的答案，特别是在印度食品的背景下，这涉及理解复杂的烹饪背景并识别各种食品之间的关系。通过这个假设，我们在 QA 上创建了推理链，而人为干预最少。我们使用自动验证的推理链微调较小的 LLM 和 VLM，并使用具有更大数据的强化学习进一步训练它们。随着推理链的增强，我们观察到准确性在基线上平均提高了 10 个百分点。我们就印度食品 VQA 任务添加推理链的效果进行了详细分析。索引术语 - FoodVQA、推理链、强化学习、知识图谱。

Optimizing Electric Vehicle Charging Station Placement Using Reinforcement Learning and Agent-Based Simulations

使用强化学习和基于智能体的模拟优化电动汽车充电站布局

Authors: Minh-Duc Nguyen, Dung D. Le, Phi Long Nguyen
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.01218
Pdf link: https://arxiv.org/pdf/2511.01218
Abstract The rapid growth of electric vehicles (EVs) necessitates the strategic placement of charging stations to optimize resource utilization and minimize user inconvenience. Reinforcement learning (RL) offers an innovative approach to identifying optimal charging station locations; however, existing methods face challenges due to their deterministic reward systems, which limit efficiency. Because real-world conditions are dynamic and uncertain, a deterministic reward structure cannot fully capture the complexities of charging station placement. As a result, evaluation becomes costly and time-consuming, and less reflective of real-world scenarios. To address this challenge, we propose a novel framework that integrates deep RL with agent-based simulations to model EV movement and estimate charging demand in real time. Our approach employs a hybrid RL agent with dual Q-networks to select optimal locations and configure charging ports, guided by a hybrid reward function that combines deterministic factors with simulation-derived feedback. Case studies in Hanoi, Vietnam, show that our method reduces average waiting times by 53.28% compared to the initial state, outperforming static baseline methods. This scalable and adaptive solution enhances EV infrastructure planning, effectively addressing real-world complexities and improving user experience.
中文摘要 电动汽车（EV）的快速增长需要战略性地放置充电站，以优化资源利用率并最大限度地减少用户的不便。强化学习（RL）提供了一种识别最佳充电站位置的创新方法;然而，现有方法由于其确定性奖励系统而面临挑战，限制了效率。由于现实世界的条件是动态和不确定的，因此确定性奖励结构无法完全捕捉充电站放置的复杂性。因此，评估变得昂贵且耗时，并且不太能反映现实世界的场景。为了应对这一挑战，我们提出了一种新颖的框架，将深度 RL 与基于代理的模拟相结合，以模拟电动汽车的运动并实时估计充电需求。我们的方法采用具有双 Q 网络的混合 RL 代理来选择最佳位置并配置充电端口，由将确定性因素与模拟衍生的反馈相结合的混合奖励函数指导。越南河内的案例研究表明，与初始状态相比，我们的方法将平均等待时间减少了 53.28%，优于静态基线方法。这种可扩展且自适应的解决方案增强了电动汽车基础设施规划，有效解决现实世界的复杂性并改善用户体验。

From Pixels to Cooperation Multi Agent Reinforcement Learning based on Multimodal World Models

从像素到协作：基于多模态世界模型的多智能体强化学习

Authors: Sureyya Akin, Kavita Srivastava, Prateek B. Kapoor, Pradeep G. Sethi, Sunita Q. Patel, Rahu Srivastava
Subjects: Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2511.01310
Pdf link: https://arxiv.org/pdf/2511.01310
Abstract Learning cooperative multi-agent policies directly from high-dimensional, multimodal sensory inputs like pixels and audio (from pixels) is notoriously sample-inefficient. Model-free Multi-Agent Reinforcement Learning (MARL) algorithms struggle with the joint challenge of representation learning, partial observability, and credit assignment. To address this, we propose a novel framework based on a shared, generative Multimodal World Model (MWM). Our MWM is trained to learn a compressed latent representation of the environment's dynamics by fusing distributed, multimodal observations from all agents using a scalable attention-based mechanism. Subsequently, we leverage this learned MWM as a fast, "imagined" simulator to train cooperative MARL policies (e.g., MAPPO) entirely within its latent space, decoupling representation learning from policy learning. We introduce a new set of challenging multimodal, multi-agent benchmarks built on a 3D physics simulator. Our experiments demonstrate that our MWM-MARL framework achieves orders-of-magnitude greater sample efficiency compared to state-of-the-art model-free MARL baselines. We further show that our proposed multimodal fusion is essential for task success in environments with sensory asymmetry and that our architecture provides superior robustness to sensor-dropout, a critical feature for real-world deployment.
中文摘要 直接从像素和音频（来自像素）等高维、多模态感官输入中学习协作多智能体策略是出了名的样本效率低下。无模型多智能体强化学习（MARL）算法在表征学习、部分可观察性和学分分配的共同挑战中苦苦挣扎。为了解决这个问题，我们提出了一种基于共享生成多模态世界模型（MWM）的新框架。我们的 MWM 经过训练，通过使用可扩展的基于注意力的机制融合来自所有代理的分布式多模态观察结果，学习环境动态的压缩潜在表示。随后，我们将学习到的MWM作为一个快速的、“想象的”模拟器，完全在其潜在空间内训练合作的MARL策略（例如，MAPPO），将表示学习与策略学习解耦。我们引入了一组基于 3D 物理模拟器构建的具有挑战性的多模态、多智能体基准测试。我们的实验表明，与最先进的无模型 MARL 基线相比，我们的 MWM-MARL 框架实现了数量级的样品效率。我们进一步表明，我们提出的多模态融合对于在感官不对称的环境中成功执行任务至关重要，并且我们的架构提供了对传感器丢失的卓越鲁棒性，这是实际部署的关键特征。

RobustVLA: Robustness-Aware Reinforcement Post-Training for Vision-Language-Action Models

RobustVLA：视觉-语言-动作模型的鲁棒性感知后训练

Authors: Hongyin Zhang, Shuo Zhang, Junxi Jin, Qixin Zeng, Runze Li, Donglin Wang
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.01331
Pdf link: https://arxiv.org/pdf/2511.01331
Abstract Vision-Language-Action (VLA) models have recently emerged as powerful general-purpose policies for robotic manipulation, benefiting from large-scale multi-modal pre-training. However, they often fail to generalize reliably in out-of-distribution deployments, where unavoidable disturbances such as observation noise, sensor errors, or actuation perturbations become prevalent. While recent Reinforcement Learning (RL)-based post-training provides a practical means to adapt pre-trained VLA models, existing methods mainly emphasize reward maximization and overlook robustness to environmental uncertainty. In this work, we introduce RobustVLA, a lightweight online RL post-training method designed to explicitly enhance the resilience of VLA models. Through a systematic robustness analysis, we identify two key regularizations: Jacobian regularization, which mitigates sensitivity to observation noise, and smoothness regularization, which stabilizes policies under action perturbations. Extensive experiments across diverse robotic environments demonstrate that RobustVLA significantly outperforms prior state-of-the-art methods in robustness and reliability. Our results highlight the importance of principled robustness-aware RL post-training as a key step toward improving the reliability and robustness of VLA models.
中文摘要 视觉-语言-行动（VLA）模型最近成为机器人纵的强大通用策略，受益于大规模多模态预训练。然而，它们通常无法在分布式部署中可靠地推广，因为在分布式部署中，观测噪声、传感器误差或驱动扰动等不可避免的干扰变得普遍。虽然最近基于强化学习（RL）的后训练提供了一种实用的方法来适应预训练的VLA模型，但现有方法主要强调奖励最大化，而忽视了对环境不确定性的鲁棒性。在这项工作中，我们介绍了 RobustVLA，这是一种轻量级的在线 RL 后训练方法，旨在明确增强 VLA 模型的弹性。通过系统的鲁棒性分析，我们确定了两个关键的正则化：雅可比正则化（降低对观测噪声的敏感性）和平滑正则化（在作用扰动下稳定策略）。跨不同机器人环境的广泛实验表明，RobustVLA 在鲁棒性和可靠性方面明显优于先前的最先进方法。我们的结果强调了有原则的鲁棒性感知后训练的重要性，这是提高VLA模型可靠性和鲁棒性的关键一步。

Diffusion-Based Solver for CNF Placement on the Cloud-Continuum

基于扩散的求解器，用于在云连续体上放置 CNF

Authors: Álvaro Vázquez Rodríguez, Manuel Fernández-Veiga, Carlos Giraldo-Rodríguez
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.01343
Pdf link: https://arxiv.org/pdf/2511.01343
Abstract The placement of Cloud-Native Network Functions (CNFs) across the Cloud-Continuum represents a core challenge in the orchestration of current 5G and future 6G networks. The process involves the placement of interdependent computing tasks, structured as Service Function Chains, over distributed cloud infrastructures. This is achieved while satisfying strict resource, bandwidth and latency constraints. It is acknowledged that classical approaches, including mixed-integer nonlinear programming, heuristics and reinforcement learning are limited in terms of scalability, constraint handling and generalisation capacity. In the present study, a novel theoretical framework is proposed, which is based on Denoising Diffusion Probabilistic Models (DDPM) for CNF placement. The present approach proposes a reconceptualisation of placement as a generative graph to assignment task, where the placement problem is encoded as a heterogeneous graph, and a Graph Neural Network denoiser is trained to iteratively refine noisy CNF-to-cloud assignment matrices. The model incorporates constraint-specific losses directly into the loss function, thereby allowing it to learn feasible solution spaces. The integration of the DDPM formulation with structured combinatorial constraints is achieved through a rigorous and systematic approach. Extensive evaluations across diverse topologies have been conducted, which have confirmed that the model consistently produces feasible solutions with orders of magnitude faster inference than MINLP solvers. The results obtained demonstrate the potential of diffusion-based generative modelling for constrained network embedding problems, making an impact towards the practical, scalable orchestration of distributed Cloud-Native Network Functions.
中文摘要 云原生网络功能（CNF）在云连续体中的放置是当前 5G 和未来 6G 网络编排的核心挑战。该过程涉及在分布式云基础设施上放置相互依赖的计算任务，结构为服务功能链。这是在满足严格的资源、带宽和延迟限制的同时实现的。人们承认，包括混合整数非线性规划、启发式和强化学习在内的经典方法在可扩展性、约束处理和泛化能力方面受到限制。本研究提出了一种基于去噪扩散概率模型（DDPM）的CNF放置的新理论框架。目前的方法提出了将放置重新概念化为生成图到分配任务，其中放置问题被编码为异构图，并训练图神经网络降噪器以迭代细化有噪声的 CNF 到云分配矩阵。该模型将特定于约束的损失直接合并到损失函数中，从而允许它学习可行的解空间。DDPM 公式与结构化组合约束的集成是通过严格和系统的方法实现的。已经对不同拓扑进行了广泛的评估，证实该模型始终如一地生成可行的解决方案，其推理速度比 MINLP 求解器快几个数量级。获得的结果证明了基于扩散的生成建模在约束网络嵌入问题上的潜力，对分布式云原生网络函数的实用、可扩展编排产生了影响。

Thinking with DistilQwen: A Tale of Four Distilled Reasoning and Reward Model Series

与 DistilQwen 一起思考：四个提炼推理和奖励模型系列的故事

Authors: Wenrui Cai, Chengyu Wang, Junbing Yan, Jun Huang, Xiangzhong Fang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.01354
Pdf link: https://arxiv.org/pdf/2511.01354
Abstract Recently, the demand for small and efficient reasoning models to support real-world applications has driven the development of knowledge distillation techniques that balance reasoning performance and inference speed. In this paper, we further extend the DistilQwen model family, initialized from the Qwen models, by introducing four model series specifically designed to meet industrial requirements. The distilled model collection comprises: (1) slow-thinking models, optimized for reasoning tasks that require high accuracy; (2) two series of adaptive-thinking models, which dynamically adjust reasoning strategies based on input tasks to maximize efficiency across diverse scenarios; and (3) distilled reward models, which enable further reinforcement learning of reasoning models using distilled knowledge. Comprehensive evaluations across multiple benchmarks demonstrate both high inference efficiency and strong reasoning performance for these models, as well as the practical utility of distilled reward models. We further show that these models support industry practitioners by providing scalable training and inference functionalities on the Alibaba Cloud PAI (Platform for Artificial Intelligence) platform.
中文摘要 最近，对支持实际应用的小型高效推理模型的需求推动了平衡推理性能和推理速度的知识蒸馏技术的发展。在本文中，我们进一步扩展了从Qwen模型初始化的DistilQwen模型系列，引入了四个专门设计的模型系列，以满足工业需求。提炼模型集合包括：（1）慢思维模型，针对需要高精度的推理任务进行了优化;（2）两个系列的自适应思维模型，根据输入任务动态调整推理策略，在不同场景下实现效率最大化;（3）蒸馏奖励模型，能够使用蒸馏知识对推理模型进行进一步强化学习。跨多个基准的综合评估证明了这些模型的高推理效率和强大的推理性能，以及蒸馏奖励模型的实际实用性。我们进一步表明，这些模型通过在阿里云 PAI（人工智能平台）平台上提供可扩展的训练和推理功能来支持行业从业者。

Learning Intractable Multimodal Policies with Reparameterization and Diversity Regularization

通过重新参数化和多样性正则化学习棘手的多模态策略

Authors: Ziqi Wang, Jiashun Liu, Ling Pan
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.01374
Pdf link: https://arxiv.org/pdf/2511.01374
Abstract Traditional continuous deep reinforcement learning (RL) algorithms employ deterministic or unimodal Gaussian actors, which cannot express complex multimodal decision distributions. This limitation can hinder their performance in diversity-critical scenarios. There have been some attempts to design online multimodal RL algorithms based on diffusion or amortized actors. However, these actors are intractable, making existing methods struggle with balancing performance, decision diversity, and efficiency simultaneously. To overcome this challenge, we first reformulate existing intractable multimodal actors within a unified framework, and prove that they can be directly optimized by policy gradient via reparameterization. Then, we propose a distance-based diversity regularization that does not explicitly require decision probabilities. We identify two diversity-critical domains, namely multi-goal achieving and generative RL, to demonstrate the advantages of multimodal policies and our method, particularly in terms of few-shot robustness. In conventional MuJoCo benchmarks, our algorithm also shows competitive performance. Moreover, our experiments highlight that the amortized actor is a promising policy model class with strong multimodal expressivity and high performance. Our code is available at this https URL
中文摘要 传统的连续深度强化学习（RL）算法采用确定性或单模态高斯参与者，无法表达复杂的多模态决策分布。这种限制可能会阻碍它们在多样性关键场景中的性能。已经有一些尝试设计基于扩散或摊销参与者的在线多模态 RL 算法。然而，这些参与者是棘手的，使得现有方法难以同时平衡性能、决策多样性和效率。为了克服这一挑战，我们首先在统一的框架内重新制定现有的棘手多模态参与者，并证明它们可以通过重新参数化通过政策梯度直接优化。然后，我们提出了一种基于距离的多样性正则化，该正则化不明确需要决策概率。我们确定了两个多样性关键领域，即多目标实现和生成 RL，以展示多模态政策和我们的方法的优势，特别是在少样本鲁棒性方面。在传统的 MuJoCo 基准测试中，我们的算法也显示出具有竞争力的性能。此外，我们的实验强调，摊销行为者是一个有前途的策略模型类，具有较强的多模态表达能力和高性能。我们的代码可在此 https URL 中找到

Modulation of temporal decision-making in a deep reinforcement learning agent under the dual-task paradigm

双任务范式下深度强化学习智能体时间决策的调制

Authors: Amrapali Pednekar, Álvaro Garrido-Pérez, Yara Khaluf, Pieter Simoens
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.01415
Pdf link: https://arxiv.org/pdf/2511.01415
Abstract This study explores the interference in temporal processing within a dual-task paradigm from an artificial intelligence (AI) perspective. In this context, the dual-task setup is implemented as a simplified version of the Overcooked environment with two variations, single task (T) and dual task (T+N). Both variations involve an embedded time production task, but the dual task (T+N) additionally involves a concurrent number comparison task. Two deep reinforcement learning (DRL) agents were separately trained for each of these tasks. These agents exhibited emergent behavior consistent with human timing research. Specifically, the dual task (T+N) agent exhibited significant overproduction of time relative to its single task (T) counterpart. This result was consistent across four target durations. Preliminary analysis of neural dynamics in the agents' LSTM layers did not reveal any clear evidence of a dedicated or intrinsic timer. Hence, further investigation is needed to better understand the underlying time-keeping mechanisms of the agents and to provide insights into the observed behavioral patterns. This study is a small step towards exploring parallels between emergent DRL behavior and behavior observed in biological systems in order to facilitate a better understanding of both.
中文摘要 本研究从人工智能（AI）的角度探讨了双任务范式中时间处理的干扰。在这种情况下，双任务设置被实现为 Overcooked 环境的简化版本，具有两种变体，即单任务（T）和双任务（T+N）。这两种变体都涉及嵌入式时间生产任务，但双重任务（T+N）还涉及并发数字比较任务。两个深度强化学习（DRL）代理分别针对这些任务进行了训练。这些代理表现出与人类计时研究一致的紧急行为。具体来说，双任务（T+N）代理相对于其单任务（T）代理表现出显着的时间过剩。这一结果在四个目标持续时间内是一致的。对代理 LSTM 层中神经动力学的初步分析没有揭示任何专用或内在计时器的明确证据。因此，需要进一步调查以更好地了解智能体的潜在计时机制，并深入了解观察到的行为模式。这项研究是探索紧急 DRL 行为与生物系统中观察到的行为之间相似之处的一小步，以促进更好地理解两者。

Learning to Seek Evidence: A Verifiable Reasoning Agent with Causal Faithfulness Analysis

学习寻找证据：具有因果忠实性分析的可验证推理代理

Authors: Yuhang Huang, Zekai Lin, Fan Zhong, Lei Liu
Subjects: Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.01425
Pdf link: https://arxiv.org/pdf/2511.01425
Abstract Explanations for AI models in high-stakes domains like medicine often lack verifiability, which can hinder trust. To address this, we propose an interactive agent that produces explanations through an auditable sequence of actions. The agent learns a policy to strategically seek external visual evidence to support its diagnostic reasoning. This policy is optimized using reinforcement learning, resulting in a model that is both efficient and generalizable. Our experiments show that this action-based reasoning process significantly improves calibrated accuracy, reducing the Brier score by 18\% compared to a non-interactive baseline. To validate the faithfulness of the agent's explanations, we introduce a causal intervention method. By masking the visual evidence the agent chooses to use, we observe a measurable degradation in its performance ($\Delta$Brier=+0.029), confirming that the evidence is integral to its decision-making process. Our work provides a practical framework for building AI systems with verifiable and faithful reasoning capabilities.
中文摘要 对医学等高风险领域的人工智能模型的解释通常缺乏可验证性，这可能会阻碍信任。为了解决这个问题，我们提出了一种交互式代理，它通过可审计的一系列作来生成解释。代理学习策略以战略性地寻求外部视觉证据来支持其诊断推理。该策略使用强化学习进行优化，从而形成一个既高效又可推广的模型。我们的实验表明，这种基于动作的推理过程显着提高了校准准确性，与非交互式基线相比，Brier 分数降低了 18\%。为了验证智能体解释的真实性，我们引入了一种因果干预方法。通过掩盖代理选择使用的视觉证据，我们观察到其性能出现了可测量的下降（$\Delta$Brier=+0.029），证实该证据是其决策过程中不可或缺的一部分。我们的工作为构建具有可验证和忠实推理能力的人工智能系统提供了一个实用的框架。

BARD: budget-aware reasoning distillation

BARD：预算意识推理蒸馏

Authors: Lujie Niu, Lei Shen, Yi Jiang, Caixia Yuan, Xiaojie Wang, Wenbo Su, Bo zheng
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.01470
Pdf link: https://arxiv.org/pdf/2511.01470
Abstract While long Chain-of-Thought (CoT) distillation effectively transfers reasoning capability to smaller language models, the reasoning process often remains redundant and computational budget uncontrollable, leading to inefficient resource usage. To address this limitation, we propose \textbf{Budget-Aware Reasoning Distillation (BARD)}, a novel framework that simultaneously distills reasoning capability and enables fine-grained control over the reasoning length. BARD uses the thinking budget as a user-specified control signal, allowing the model to dynamically balance reasoning performance and computational efficiency. To achieve this concept, BARD introduces a two-phase training regimen. The first phase, Supervised Fine-Tuning (SFT) on teacher-generated long CoT data compressed to various budget levels, bootstrapping the model's understanding of budget constraints. The second phase leverages Reinforcement Learning (RL) from a reward signal in consideration of reasoning performance and budget fidelity simultaneously. Incorporating the two-phase regimen is crucial to avoiding policy degradation and ensuring that both objectives are optimized jointly. Extensive experiments demonstrate that our method empowers an 8B student model to achieve strong performance on challenging reasoning benchmarks (\textit{AIME24, AIME25, GPQA}) while providing precise and adaptive control over its reasoning length across a wide range of budgets.
中文摘要 虽然长思维链（CoT）蒸馏有效地将推理能力转移到较小的语言模型上，但推理过程往往仍然是冗余的，计算预算不可控，导致资源使用效率低下。为了解决这一限制，我们提出了 \textbf{预算感知推理蒸馏（BARD）}，这是一种新颖的框架，可以同时提炼推理能力并实现对推理长度的细粒度控制。BARD使用思维预算作为用户指定的控制信号，使模型能够动态平衡推理性能和计算效率。为了实现这一概念，BARD 引入了两阶段的训练方案。第一阶段是对教师生成的长 CoT 数据进行监督微调（SFT），压缩到不同的预算水平，引导模型对预算约束的理解。第二阶段利用奖励信号的强化学习（RL），同时考虑推理性能和预算保真度。纳入两阶段方案对于避免政策退化和确保共同优化这两个目标至关重要。广泛的实验表明，我们的方法使 8B 学生模型能够在具有挑战性的推理基准（\textit{AIME24、AIME25、GPQA}）上取得强大的性能，同时在广泛的预算范围内对其推理长度进行精确和自适应的控制。

TPS-Bench: Evaluating AI Agents' Tool Planning \& Scheduling Abilities in Compounding Tasks

TPS-Bench：评估AI智能体在复合任务中的工具规划和调度能力

Authors: Hanwen Xu, Xuyao Huang, Yuzhe Liu, Kai Yu, Zhijie Deng
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.01527
Pdf link: https://arxiv.org/pdf/2511.01527
Abstract Large language model (LLM) agents have exhibited strong problem-solving competence across domains like research and coding. Yet, it remains underexplored whether LLM agents can tackle compounding real-world problems that require a diverse set of tools to complete. Given a broad, heterogeneous tool repository, LLM agents must not only select appropriate tools based on task planning analysis but also strategically schedule the execution order to ensure efficiency. This paper introduces TPS-Bench to benchmark the ability of LLM agents in solving such problems that demand Tool Planning and Scheduling. TPS-Bench collects 200 compounding tasks of two difficulty levels, based on a tool repository containing hundreds of model context protocol (MCP) tools. In particular, each task is composed of multiple subtasks, such as web search, map navigation, calendar checking, etc., and each subtask can be completed by a basic tool. Our evaluation emphasizes both task completion rate and efficiency. The empirical studies on popular closed-source and open-source LLMs indicate that most models can perform reasonable tool planning, but differ in scheduling. For example, GLM-4.5 achieves an outperforming task completion rate of 64.72% with extensive sequential tool calls, hence suffering from significantly long execution time. By contrast, GPT-4o prioritizes parallel tool calls but achieves only a 45.08% completion rate. Considering reinforcement learning (RL) can be a viable way to improve the scheduling efficiency without compromising performance, we perform an initial study on Qwen3-1.7B and witness a 14% reduction in execution time alongside a 6% gain in task completion rate based on rarely 100 RL training samples. Our code is available this https URL.
中文摘要 大型语言模型（LLM）代理在研究和编码等领域表现出了强大的解决问题的能力。然而，LLM 代理是否能够解决需要多种工具才能完成的复杂现实问题，目前仍未得到充分探索。鉴于一个广泛的异构工具存储库，LLM 代理不仅要根据任务规划分析选择合适的工具，还要战略性地调度执行顺序以确保效率。本文介绍了TPS-Bench，对LLM智能体解决需要工具规划和调度的问题的能力进行了基准测试。TPS-Bench 基于包含数百个模型上下文协议（MCP）工具的工具存储库，收集了 200 个两个难度级别的复合任务。特别是，每个任务都由多个子任务组成，如网页搜索、地图导航、日历查询等，每个子任务都可以通过一个基础工具完成。我们的评估强调任务完成率和效率。对流行的闭源和开源 LLM 的实证研究表明，大多数模型可以执行合理的工具规划，但在调度上有所不同。例如，GLM-4.5 通过广泛的顺序工具调用实现了 64.72% 的优异任务完成率，因此执行时间明显较长。相比之下，GPT-4o 优先考虑并行工具调用，但完成率仅为 45.08%。考虑到强化学习（RL）可能是在不影响性能的情况下提高调度效率的可行方法，我们对Qwen3-1.7B进行了初步研究，并见证了执行时间减少了14%，任务完成率提高了6%，基于很少有100个RL训练样本。我们的代码可通过此 https URL 获得。

Learning what to say and how precisely: Efficient Communication via Differentiable Discrete Communication Learning

学习该说什么以及如何准确：通过可微分的离散通信学习实现高效通信

Authors: Aditya Kapoor, Yash Bhisikar, Benjamin Freed, Jan Peters, Mingfei Sun
Subjects: Subjects: Multiagent Systems (cs.MA); Information Theory (cs.IT); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.01554
Pdf link: https://arxiv.org/pdf/2511.01554
Abstract Effective communication in multi-agent reinforcement learning (MARL) is critical for success but constrained by bandwidth, yet past approaches have been limited to complex gating mechanisms that only decide \textit{whether} to communicate, not \textit{how precisely}. Learning to optimize message precision at the bit-level is fundamentally harder, as the required discretization step breaks gradient flow. We address this by generalizing Differentiable Discrete Communication Learning (DDCL), a framework for end-to-end optimization of discrete messages. Our primary contribution is an extension of DDCL to support unbounded signals, transforming it into a universal, plug-and-play layer for any MARL architecture. We verify our approach with three key results. First, through a qualitative analysis in a controlled environment, we demonstrate \textit{how} agents learn to dynamically modulate message precision according to the informational needs of the task. Second, we integrate our variant of DDCL into four state-of-the-art MARL algorithms, showing it reduces bandwidth by over an order of magnitude while matching or exceeding task performance. Finally, we provide direct evidence for the \enquote{Bitter Lesson} in MARL communication: a simple Transformer-based policy leveraging DDCL matches the performance of complex, specialized architectures, questioning the necessity of bespoke communication designs.
中文摘要 多智能体强化学习（MARL）中的有效通信对于成功至关重要，但受到带宽的限制，但过去的方法仅限于复杂的门控机制，这些机制仅决定 \textit{是否}进行通信，而不是 \textit{多精确}。学习在位级别优化消息精度从根本上来说更加困难，因为所需的离散化步骤会破坏梯度流。我们通过推广可微离散通信学习（DDCL）来解决这个问题，DDCL 是一个用于离散消息端到端优化的框架。我们的主要贡献是扩展 DDCL 以支持无界信号，将其转变为适用于任何 MARL 架构的通用即插即用层。我们用三个关键结果验证了我们的方法。首先，通过在受控环境中进行定性分析，我们演示了 \textit{how} 智能体学会根据任务的信息需求动态调节消息精度。其次，我们将 DDCL 的变体集成到四种最先进的 MARL 算法中，表明它在匹配或超过任务性能的同时将带宽减少了一个数量级以上。最后，我们为 MARL 通信中的 \enquote{Bitter Lesson} 提供了直接证据：利用 DDCL 的基于 Transformer 的简单策略与复杂、专业架构的性能相匹配，质疑定制通信设计的必要性。

L2T-Tune:LLM-Guided Hybrid Database Tuning with LHS and TD3

L2T-Tune：使用 LHS 和 TD3 进行 LLM 引导的混合数据库调优

Authors: Xinyue Yang, Chen Zheng, Yaoyang Hou, Renhao Zhang, Yiyan Zhang, Yanjun Wu, Heng Zhang
Subjects: Subjects: Databases (cs.DB); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.01602
Pdf link: https://arxiv.org/pdf/2511.01602
Abstract Configuration tuning is critical for database performance. Although recent advancements in database tuning have shown promising results in throughput and latency improvement, challenges remain. First, the vast knob space makes direct optimization unstable and slow to converge. Second, reinforcement learning pipelines often lack effective warm-start guidance and require long offline training. Third, transferability is limited: when hardware or workloads change, existing models typically require substantial retraining to recover performance. To address these limitations, we propose L2T-Tune, a new LLM-guided hybrid database tuning framework that features a three-stage pipeline: Stage one performs a warm start that simultaneously generates uniform samples across the knob space and logs them into a shared pool; Stage two leverages a large language model to mine and prioritize tuning hints from manuals and community documents for rapid convergence. Stage three uses the warm-start sample pool to reduce the dimensionality of knobs and state features, then fine-tunes the configuration with the Twin Delayed Deep Deterministic Policy Gradient algorithm. We conduct experiments on L2T-Tune and the state-of-the-art models. Compared with the best-performing alternative, our approach improves performance by an average of 37.1% across all workloads, and by up to 73% on TPC-C. Compared with models trained with reinforcement learning, it achieves rapid convergence in the offline tuning stage on a single server. Moreover, during the online tuning stage, it only takes 30 steps to achieve best results.
中文摘要 配置调优对于数据库性能至关重要。尽管数据库调优的最新进展在吞吐量和延迟改善方面显示出有希望的结果，但挑战仍然存在。首先，巨大的旋钮空间使得直接优化不稳定且收敛缓慢。其次，强化学习流水线往往缺乏有效的热启动指导，需要长时间的离线训练。第三，可转移性有限：当硬件或工作负载发生变化时，现有模型通常需要大量重新训练才能恢复性能。为了解决这些限制，我们提出了 L2T-Tune，这是一种新的 LLM 引导的混合数据库调优框架，具有三阶段管道：第一阶段执行热启动，同时在旋钮空间中生成统一的样本并将它们记录到共享池中;第二阶段利用大型语言模型从手册和社区文档中挖掘调整提示并确定其优先级，以实现快速收敛。第三阶段使用热启动样本池来降低旋钮和状态特征的维度，然后使用孪生延迟深度确定性策略梯度算法对配置进行微调。我们对 L2T-Tune 和最先进的模型进行了实验。与性能最佳的替代方案相比，我们的方法在所有工作负载上平均提高了 37.1% 的性能，在 TPC-C 上提高了 73%。与强化学习训练的模型相比，它在单台服务器上的离线调优阶段实现了快速收敛。而且，在在线调优阶段，只需30个步骤即可达到最佳效果。

Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models

Actial：激活多模态大语言模型的空间推理能力

Authors: Xiaoyu Zhan, Wenxuan Huang, Hao Sun, Xinyu Fu, Changfeng Ma, Shaosheng Cao, Bohan Jia, Shaohui Lin, Zhenfei Yin, Lei Bai, Wanli Ouyang, Yuanqi Li, Jie Guo, Yanwen Guo
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.01618
Pdf link: https://arxiv.org/pdf/2511.01618
Abstract Recent advances in Multimodal Large Language Models (MLLMs) have significantly improved 2D visual understanding, prompting interest in their application to complex 3D reasoning tasks. However, it remains unclear whether these models can effectively capture the detailed spatial information required for robust real-world performance, especially cross-view consistency, a key requirement for accurate 3D reasoning. Considering this issue, we introduce Viewpoint Learning, a task designed to evaluate and improve the spatial reasoning capabilities of MLLMs. We present the Viewpoint-100K dataset, consisting of 100K object-centric image pairs with diverse viewpoints and corresponding question-answer pairs. Our approach employs a two-stage fine-tuning strategy: first, foundational knowledge is injected to the baseline MLLM via Supervised Fine-Tuning (SFT) on Viewpoint-100K, resulting in significant improvements across multiple tasks; second, generalization is enhanced through Reinforcement Learning using the Group Relative Policy Optimization (GRPO) algorithm on a broader set of questions. Additionally, we introduce a hybrid cold-start initialization method designed to simultaneously learn viewpoint representations and maintain coherent reasoning thinking. Experimental results show that our approach significantly activates the spatial reasoning ability of MLLM, improving performance on both in-domain and out-of-domain reasoning tasks. Our findings highlight the value of developing foundational spatial skills in MLLMs, supporting future progress in robotics, autonomous systems, and 3D scene understanding.
中文摘要 多模态大型语言模型（MLLM）的最新进展显着提高了 2D 视觉理解能力，引发了人们对其在复杂 3D 推理任务中的应用的兴趣。然而，目前尚不清楚这些模型是否能够有效地捕获稳健的现实世界性能所需的详细空间信息，尤其是交叉视图一致性，这是准确 3D 推理的关键要求。考虑到这个问题，我们引入了视点学习，这是一项旨在评估和提高 MLLM 空间推理能力的任务。我们提出了 Viewpoint-100K 数据集，该数据集由 100K 个以对象为中心的图像对组成，具有不同的观点和相应的问答对。我们的方法采用两阶段微调策略：首先，通过 Viewpoint-100K 上的监督微调（SFT）将基础知识注入基线 MLLM，从而在多个任务中实现显着改进;其次，通过使用群体相对策略优化（GRPO）算法对更广泛的问题进行强化学习来增强泛化。此外，我们还引入了一种混合冷启动初始化方法，旨在同时学习观点表示并保持连贯的推理思维。实验结果表明，该方法显著激活了MLLM的空间推理能力，提高了域内和域外推理任务的性能。我们的研究结果强调了在 MLLM 中培养基本空间技能的价值，支持机器人、自主系统和 3D 场景理解的未来进步。

Enhancing Diffusion-based Restoration Models via Difficulty-Adaptive Reinforcement Learning with IQA Reward

通过具有 IQA 奖励的难度自适应强化学习增强基于扩散的恢复模型

Authors: Xiaogang Xu, Ruihang Chu, Jian Wang, Kun Zhou, Wenjie Shu, Harry Yang, Ser-Nam Lim, Hao Chen, Liang Lin
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.01645
Pdf link: https://arxiv.org/pdf/2511.01645
Abstract Reinforcement Learning (RL) has recently been incorporated into diffusion models, e.g., tasks such as text-to-image. However, directly applying existing RL methods to diffusion-based image restoration models is suboptimal, as the objective of restoration fundamentally differs from that of pure generation: it places greater emphasis on fidelity. In this paper, we investigate how to effectively integrate RL into diffusion-based restoration models. First, through extensive experiments with various reward functions, we find that an effective reward can be derived from an Image Quality Assessment (IQA) model, instead of intuitive ground-truth-based supervision, which has already been optimized during the Supervised Fine-Tuning (SFT) stage prior to RL. Moreover, our strategy focuses on using RL for challenging samples that are significantly distant from the ground truth, and our RL approach is innovatively implemented using MLLM-based IQA models to align distributions with high-quality images initially. As the samples approach the ground truth's distribution, RL is adaptively combined with SFT for more fine-grained alignment. This dynamic process is facilitated through an automatic weighting strategy that adjusts based on the relative difficulty of the training samples. Our strategy is plug-and-play that can be seamlessly applied to diffusion-based restoration models, boosting its performance across various restoration tasks. Extensive experiments across multiple benchmarks demonstrate the effectiveness of our proposed RL framework.
中文摘要 强化学习（RL）最近已被纳入扩散模型，例如文本到图像等任务。然而，直接将现有的RL方法应用于基于扩散的图像恢复模型是次优的，因为恢复的目标与纯生成的目标有根本的不同：它更加强调保真度。在本文中，我们研究了如何有效地将RL集成到基于扩散的恢复模型中。首先，通过对各种奖励函数的广泛实验，我们发现可以从图像质量评估（IQA）模型中得出有效的奖励，而不是直观的基于地面实况的监督，后者在RL之前的监督微调（SFT）阶段已经进行了优化。此外，我们的策略侧重于使用 RL 来处理与地面实况相距甚远的具有挑战性的样本，并且我们的 RL 方法是使用基于 MLLM 的 IQA 模型创新地实现的，以最初将分布与高质量图像对齐。当样本接近地面实况分布时，RL 与 SFT 自适应地结合，以实现更细粒度的对齐。这种动态过程是通过自动加权策略来促进的，该策略会根据训练样本的相对难度进行调整。我们的策略是即插即用的，可以无缝应用于基于扩散的恢复模型，从而提高其在各种恢复任务中的性能。跨多个基准的广泛实验证明了我们提出的 RL 框架的有效性。

Collaborative Large Language Model Inference via Resource-Aware Parallel Speculative Decoding

通过资源感知并行推测解码进行协同大型语言模型推理

Authors: Jungyeon Koh, Hyun Jong Yang
Subjects: Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2511.01695
Pdf link: https://arxiv.org/pdf/2511.01695
Abstract The growing demand for on-device large language model (LLM) inference highlights the need for efficient mobile edge computing (MEC) solutions, especially in resource-constrained settings. Speculative decoding offers a promising solution by partitioning token generation between a lightweight draft model on mobile devices and a powerful target model on edge servers, but suffers from communication overhead and asynchronous delays. This paper is the first to propose a unified framework that jointly optimizes user association and resource allocation (UARA) to support efficient parallel speculative decoding. We solve the UARA problem using a multi-agent deep reinforcement learning algorithm. To evaluate our approach under realistic conditions, we conduct experiments using the Sionna simulator. Results show that our method achieves up to 28.0% and an average of 23.7% reduction in end-to-end latency without compromising inference accuracy, enabling scalable and low-latency LLM services in MEC systems.
中文摘要 对设备上大型语言模型（LLM）推理的需求不断增长，凸显了对高效移动边缘计算（MEC）解决方案的需求，特别是在资源受限的环境中。推测解码通过在移动设备上的轻量级草稿模型和边缘服务器上的强大目标模型之间划分令牌生成，提供了一种有前途的解决方案，但存在通信开销和异步延迟。本文首次提出了一种统一框架，该框架联合优化用户关联和资源分配（UARA），以支持高效的并行推测解码。我们使用多智能体深度强化学习算法解决 UARA 问题。为了在现实条件下评估我们的方法，我们使用 Sionna 模拟器进行了实验。结果表明，在不影响推理精度的情况下，我们的方法实现了高达 28.0% 和平均 23.7% 的端到端延迟降低，从而在 MEC 系统中实现了可扩展和低延迟的 LLM 服务。

RLAC: Reinforcement Learning with Adversarial Critic for Free-Form Generation Tasks

RLAC：使用对抗性批评者进行自由形式生成任务的强化学习

Authors: Mian Wu, Gavin Zhang, Sewon Min, Sergey Levine, Aviral Kumar
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.01758
Pdf link: https://arxiv.org/pdf/2511.01758
Abstract Open-ended generation tasks require outputs to satisfy diverse and often implicit task-specific evaluation rubrics. The sheer number of relevant rubrics leads to prohibitively high verification costs and incomplete assessments of a response, making reinforcement learning (RL) post-training with rubric-based rewards difficult to scale. This problem is exacerbated by the fact that often the best way to combine these rubrics into one single reward is also highly prompt-specific. We propose Reinforcement Learning with Adversarial Critic (RLAC), a post-training approach that addresses these challenges via dynamic rubric verification. Our approach employs a large language model (LLM) as a critic that dynamically identifies only the most likely failure modes (e.g., a factual error or unhandled edge case), which are then verified by an external validator to optimize both generator and critic jointly. By training both the generator and the critic, this game enhances the critic's error detection and the generator's output quality while reducing required verifications. Our experiments demonstrate that RLAC improves factual accuracy in text generation and correctness in code generation, while also outperforming exhaustive verification and reward model methods. We show that dynamic critics are more effective than fixed critics, showcasing the potential of RLAC for scaling RL post-training to free-form generation tasks.
中文摘要 开放式生成任务需要输出来满足各种且通常是隐含的特定于任务的评估标准。相关评分标准的绝对数量导致验证成本高得令人望而却步，并且对响应的评估不完整，使得基于评分标准的奖励的强化学习（RL）后培训难以扩展。由于将这些评分标准组合成一个单一奖励的最佳方法通常也是高度针对提示的，这一事实加剧了这个问题。我们提出了对抗性批评强化学习（RLAC），这是一种通过动态评分标准验证来应对这些挑战的训练后方法。我们的方法采用大型语言模型（LLM）作为批评者，仅动态识别最有可能的故障模式（例如，事实错误或未处理的边缘情况），然后由外部验证器进行验证，以共同优化生成器和批评者。通过训练生成器和批评者，该游戏增强了批评者的错误检测和生成器的输出质量，同时减少了所需的验证。我们的实验表明，RLAC提高了文本生成的事实准确性和代码生成的正确性，同时也优于穷举验证和奖励模型方法。我们表明动态批评者比固定批评者更有效，展示了 RLAC 在将 RL 后训练扩展到自由形式生成任务方面的潜力。

MOBIUS: A Multi-Modal Bipedal Robot that can Walk, Crawl, Climb, and Roll

MOBIUS：一种多模态双足机器人，可以行走、爬行、攀爬和滚动

Authors: Alexander Schperberg, Yusuke Tanaka, Stefano Di Cairano, Dennis Hong
Subjects: Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2511.01774
Pdf link: https://arxiv.org/pdf/2511.01774
Abstract This article presents a Multi-Modal Bipedal Intelligent Urban Scout robot (MOBIUS) capable of walking, crawling, climbing, and rolling. MOBIUS features four limbs--two 6-DoF arms with two-finger grippers for manipulation and climbing, and two 4-DoF legs for locomotion--enabling smooth transitions across diverse terrains without reconfiguration. A hybrid control architecture combines reinforcement learning-based locomotion with model-based predictive and admittance control enhanced for safety by a Reference Governor toward compliant contact interactions. A high-level MIQCP planner autonomously selects locomotion modes to balance stability and energy efficiency. Hardware experiments demonstrate robust gait transitions, dynamic climbing, and full-body load support via pinch grasp. Overall, MOBIUS demonstrates the importance of tight integration between morphology, high-level planning, and control to enable mobile loco-manipulation and grasping, substantially expanding its interaction capabilities, workspace, and traversability.
中文摘要 本文介绍了一种能够行走、爬行、攀爬和滚动的多模态双足智能城市侦察机器人（MOBIUS）。MOBIUS 具有四个肢体——两个带有两指夹持器的 6 远度臂，用于纵和攀爬，以及两个用于运动的 4 远距腿——无需重新配置即可在不同地形上平稳过渡。混合控制架构将基于强化学习的运动与基于模型的预测和导纳控制相结合，通过参考调速器增强安全性，以实现合规的接触交互。高级 MIQCP 规划器自主选择运动模式，以平衡稳定性和能源效率。硬件实验证明了强劲的步态过渡、动态攀爬和通过捏握支撑的全身负荷支撑。总体而言，MOBIUS 展示了形态学、高级规划和控制之间紧密集成的重要性，以实现移动的轨迹纵和抓取，从而大大扩展了其交互能力、工作空间和可穿越性。

GenDexHand: Generative Simulation for Dexterous Hands

GenDexHand：灵巧手的生成模拟

Authors: Feng Chen, Zhuxiu Xu, Tianzhe Chu, Xunzhe Zhou, Li Sun, Zewen Wu, Shenghua Gao, Zhongyu Li, Yanchao Yang, Yi Ma
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.01791
Pdf link: https://arxiv.org/pdf/2511.01791
Abstract Data scarcity remains a fundamental bottleneck for embodied intelligence. Existing approaches use large language models (LLMs) to automate gripper-based simulation generation, but they transfer poorly to dexterous manipulation, which demands more specialized environment design. Meanwhile, dexterous manipulation tasks are inherently more difficult due to their higher degrees of freedom. Massively generating feasible and trainable dexterous hand tasks remains an open challenge. To this end, we present GenDexHand, a generative simulation pipeline that autonomously produces diverse robotic tasks and environments for dexterous manipulation. GenDexHand introduces a closed-loop refinement process that adjusts object placements and scales based on vision-language model (VLM) feedback, substantially improving the average quality of generated environments. Each task is further decomposed into sub-tasks to enable sequential reinforcement learning, reducing training time and increasing success rates. Our work provides a viable path toward scalable training of diverse dexterous hand behaviors in embodied intelligence by offering a simulation-based solution to synthetic data generation. Our website: this https URL.
中文摘要 数据稀缺仍然是具身智能的基本瓶颈。现有方法使用大型语言模型（LLM）来自动生成基于抓手的模拟，但它们很难转移到灵巧的作中，这需要更专业的环境设计。同时，灵巧的纵任务由于其更高的自由度而本质上更加困难。大规模生成可行且可训练的灵巧手部任务仍然是一个开放的挑战。为此，我们推出了 GenDexHand，这是一种生成模拟管道，可以自主生成各种机器人任务和环境，以进行灵巧的作。GenDexHand 引入了闭环细化过程，根据视觉语言模型（VLM）反馈调整对象位置和缩放，从而显着提高生成环境的平均质量。每个任务都进一步分解为子任务，以实现顺序强化学习，减少训练时间并提高成功率。我们的工作通过提供基于模拟的合成数据生成解决方案，为具身智能中各种灵巧的手部行为的可扩展训练提供了一条可行的途径。我们的网站：这个 https URL。

Keyword: diffusion policy

Improving Robustness to Out-of-Distribution States in Imitation Learning via Deep Koopman-Boosted Diffusion Policy

通过深度库夫曼增强扩散策略提高模仿学习中对分布外状态的鲁棒性

Authors: Dianye Huang, Nassir Navab, Zhongliang Jiang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.00555
Pdf link: https://arxiv.org/pdf/2511.00555
Abstract Integrating generative models with action chunking has shown significant promise in imitation learning for robotic manipulation. However, the existing diffusion-based paradigm often struggles to capture strong temporal dependencies across multiple steps, particularly when incorporating proprioceptive input. This limitation can lead to task failures, where the policy overfits to proprioceptive cues at the expense of capturing the visually derived features of the task. To overcome this challenge, we propose the Deep Koopman-boosted Dual-branch Diffusion Policy (D3P) algorithm. D3P introduces a dual-branch architecture to decouple the roles of different sensory modality combinations. The visual branch encodes the visual observations to indicate task progression, while the fused branch integrates both visual and proprioceptive inputs for precise manipulation. Within this architecture, when the robot fails to accomplish intermediate goals, such as grasping a drawer handle, the policy can dynamically switch to execute action chunks generated by the visual branch, allowing recovery to previously observed states and facilitating retrial of the task. To further enhance visual representation learning, we incorporate a Deep Koopman Operator module that captures structured temporal dynamics from visual inputs. During inference, we use the test-time loss of the generative model as a confidence signal to guide the aggregation of the temporally overlapping predicted action chunks, thereby enhancing the reliability of policy execution. In simulation experiments across six RLBench tabletop tasks, D3P outperforms the state-of-the-art diffusion policy by an average of 14.6\%. On three real-world robotic manipulation tasks, it achieves a 15.0\% improvement. Code: this https URL.
中文摘要 将生成模型与动作分块相结合在机器人纵的模仿学习中显示出巨大的前景。然而，现有的基于扩散的范式通常难以捕获多个步骤的强烈时间依赖性，特别是在合并本体感觉输入时。这种限制可能导致任务失败，即策略过度适应本体感觉线索，而牺牲了捕获任务的视觉衍生特征。为了克服这一挑战，我们提出了 Deep Koopman 增强的双分支扩散策略（D3P）算法。D3P 引入了双分支架构来解耦不同感觉模态组合的作用。视觉分支对视觉观察进行编码以指示任务进展，而融合分支则集成视觉和本体感觉输入以进行精确作。在这种架构中，当机器人无法完成中间目标（例如抓住抽屉把手）时，策略可以动态切换以执行可视化分支生成的作块，从而允许恢复到先前观察到的状态并促进任务的重试。为了进一步增强视觉表示学习，我们整合了一个 Deep Koopman Operator 模块，该模块从视觉输入中捕获结构化的时间动态。在推理过程中，我们使用生成模型的测试时损失作为置信信号，来指导时间重叠的预测动作块的聚合，从而增强策略执行的可靠性。在六个 RLBench 桌面任务的模拟实验中，D3P 的性能平均比最先进的扩散策略高出 14.6\%。在三个真实世界的机器人纵任务中，它实现了 15.0\% 的改进。代码：此 https URL。