Arxiv Papers of Today

生成时间: 2025-11-05 18:05:23 (UTC+8); Arxiv 发布时间: 2025-11-05 20:00 EST (2025-11-06 09:00 UTC+8)

今天共有 26 篇相关文章

Keyword: reinforcement learning

Tool Zero: Training Tool-Augmented LLMs via Pure RL from Scratch

Tool Zero：通过 Pure RL 从头开始训练工具增强的 LLM

Authors: Yirong Zeng, Xiao Ding, Yutai Hou, Yuxian Wang, Li Du, Juyi Dai, Qiuyang Ding, Duyu Tang, Dandan Tu, Weiwen Liu, Bing Qin, Ting Liu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.01934
Pdf link: https://arxiv.org/pdf/2511.01934
Abstract Training tool-augmented LLMs has emerged as a promising approach to enhancing language models' capabilities for complex tasks. The current supervised fine-tuning paradigm relies on constructing extensive domain-specific datasets to train models. However, this approach often struggles to generalize effectively to unfamiliar or intricate tool-use scenarios. Recently, reinforcement learning (RL) paradigm can endow LLMs with superior reasoning and generalization abilities. In this work, we address a key question: Can the pure RL be used to effectively elicit a model's intrinsic reasoning capabilities and enhance the tool-agnostic generalization? We propose a dynamic generalization-guided reward design for rule-based RL, which progressively shifts rewards from exploratory to exploitative tool-use patterns. Based on this design, we introduce the Tool-Zero series models. These models are trained to enable LLMs to autonomously utilize general tools by directly scaling up RL from Zero models (i.e., base models without post-training). Experimental results demonstrate that our models achieve over 7% performance improvement compared to both SFT and RL-with-SFT models under the same experimental settings. These gains are consistently replicated across cross-dataset and intra-dataset evaluations, validating the effectiveness and robustness of our methods.
中文摘要 训练工具增强的法学硕士已成为增强语言模型处理复杂任务能力的一种有前途的方法。当前的监督微调范式依赖于构建广泛的特定领域数据集来训练模型。然而，这种方法通常难以有效地推广到不熟悉或复杂的工具使用场景。近年来，强化学习（RL）范式可以赋予LLM卓越的推理和泛化能力。在这项工作中，我们解决了一个关键问题：纯RL能否有效地引出模型的内在推理能力并增强与工具无关的泛化？我们提出了一种基于规则的 RL 的动态泛化引导奖励设计，该设计逐步将奖励从探索性工具使用模式转变为利用性工具使用模式。基于这种设计，我们推出了 Tool-Zero 系列型号。这些模型经过训练，使法学硕士能够通过直接从零模型（即无需后训练的基础模型）扩展 RL 来自主利用通用工具。实验结果表明，在相同的实验设置下，与SFT和RL-with-SFT模型相比，我们的模型的性能提高了7%以上。这些收益在跨数据集和数据集内评估中一致复制，验证了我们方法的有效性和稳健性。

Shorter but not Worse: Frugal Reasoning via Easy Samples as Length Regularizers in Math RLVR

更短但不是更差：通过简单的样本作为数学 RLVR 中的长度正则化器进行节俭推理

Authors: Abdelaziz Bounhar, Hadi Abdine, Evan Dufraisse, Ahmad Chamma, Amr Mohamed, Dani Bouch, Michalis Vazirgiannis, Guokan Shang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2511.01937
Pdf link: https://arxiv.org/pdf/2511.01937
Abstract Large language models (LLMs) trained for step-by-step reasoning often become excessively verbose, raising inference cost. Standard Reinforcement Learning with Verifiable Rewards (RLVR) pipelines filter out easy'' problems for training efficiency, leaving the model to train primarily on harder problems that require longer reasoning chains. This skews the output length distribution upward, resulting in a \textbf{model that conflatesthinking longer'' with ``thinking better''}. In this work, we show that retaining and modestly up-weighting moderately easy problems acts as an implicit length regularizer. Exposing the model to solvable short-chain tasks constrains its output distribution and prevents runaway verbosity. The result is \textbf{\emph{emergent brevity for free}}: the model learns to solve harder problems without inflating the output length, \textbf{ despite the absence of any explicit length penalization}. RLVR experiments using this approach on \textit{Qwen3-4B-Thinking-2507} (with a 16k token limit) achieve baseline pass@1 AIME25 accuracy while generating solutions that are, on average, nearly twice as short. The code is available at \href{this https URL}{GitHub}, with datasets and models on \href{this https URL}{Hugging Face}.
中文摘要 经过分步推理训练的大型语言模型（LLM）通常会变得过于冗长，从而增加推理成本。具有可验证奖励的标准强化学习（RLVR）管道过滤掉“简单”问题以提高训练效率，使模型主要训练需要更长推理链的更困难的问题。这会向上倾斜输出长度分布，导致 \textbf{模型将“思考更长时间”与“思考得更好”混为一谈}。在这项工作中，我们表明保留和适度加权适度加权适度简单的问题充当隐式长度正则化器。将模型暴露给可解决的短链任务可以限制其输出分布并防止失控的冗长性。结果是 \textbf{\emph{emergent brevity for free}}：模型学会解决更难的问题，而不夸大输出长度 \textbf{，尽管没有任何显式的长度惩罚}。在 \textit{Qwen3-4B-Thinking-2507}（具有 16k 标记限制）上使用这种方法的 RLVR 实验实现了基线pass@1 AIME25 准确性，同时生成平均短近两倍的解决方案。该代码可在 \href{this https URL}{GitHub} 获得，数据集和模型位于 \href{this https URL}{Hugging Face} 上。

Automated Reward Design for Gran Turismo

Gran Turismo的自动奖励设计

Authors: Michel Ma, Takuma Seno, Kaushik Subramanian, Peter R. Wurman, Peter Stone, Craig Sherstan
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.02094
Pdf link: https://arxiv.org/pdf/2511.02094
Abstract When designing reinforcement learning (RL) agents, a designer communicates the desired agent behavior through the definition of reward functions - numerical feedback given to the agent as reward or punishment for its actions. However, mapping desired behaviors to reward functions can be a difficult process, especially in complex environments such as autonomous racing. In this paper, we demonstrate how current foundation models can effectively search over a space of reward functions to produce desirable RL agents for the Gran Turismo 7 racing game, given only text-based instructions. Through a combination of LLM-based reward generation, VLM preference-based evaluation, and human feedback we demonstrate how our system can be used to produce racing agents competitive with GT Sophy, a champion-level RL racing agent, as well as generate novel behaviors, paving the way for practical automated reward design in real world applications.
中文摘要 在设计强化学习（RL）代理时，设计者通过定义奖励函数来传达所需的代理行为——给予代理的数字反馈，作为对其行为的奖励或惩罚。然而，将期望的行为映射到奖励功能可能是一个困难的过程，尤其是在自动驾驶赛车等复杂环境中。在本文中，我们展示了当前的基础模型如何有效地搜索奖励函数空间，以仅给定基于文本的指令，为《跑车浪漫旅 7》赛车游戏生成理想的 RL 代理。通过结合基于 LLM 的奖励生成、基于 VLM 偏好的评估和人类反馈，我们展示了如何使用我们的系统来生产与冠军级 RL 赛车代理 GT Sophy 竞争的赛车代理，并生成新颖的行为，为实际应用中的实际自动化奖励设计铺平道路。

Second-Order Policy Gradient Methods for the Linear Quadratic Regulator

线性二次调节器的二阶策略梯度方法

Authors: Amirreza Valaei, Arash Bahari Kordabad, Sadegh Soudjani
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2511.02095
Pdf link: https://arxiv.org/pdf/2511.02095
Abstract Policy gradient methods are a powerful family of reinforcement learning algorithms for continuous control that optimize a policy directly. However, standard first-order methods often converge slowly. Second-order methods can accelerate learning by using curvature information, but they are typically expensive to compute. The linear quadratic regulator (LQR) is a practical setting in which key quantities, such as the policy gradient, admit closed-form expressions. In this work, we develop second-order policy gradient algorithms for LQR by deriving explicit formulas for both the approximate and exact Hessians used in Gauss--Newton and Newton methods, respectively. Numerical experiments show a faster convergence rate for the proposed second-order approach over the standard first-order policy gradient baseline.
中文摘要 策略梯度方法是一个功能强大的强化学习算法系列，用于直接优化策略的持续控制。然而，标准一阶方法通常收敛缓慢。二阶方法可以通过使用曲率信息来加速学习，但它们的计算成本通常很高。线性二次调节器（LQR）是一种实用的设置，其中关键量（例如策略梯度）允许闭式表达式。在这项工作中，我们通过分别为高斯方法中使用的近似和精确黑森——牛顿和牛顿方法推导显式公式，开发了 LQR 的二阶策略梯度算法。数值实验表明，所提出的二阶方法比标准一阶策略梯度基线的收敛率更快。

A Quantitative Comparison of Centralised and Distributed Reinforcement Learning-Based Control for Soft Robotic Arms

基于集中式和分布式强化学习的软机械臂控制的定量比较

Authors: Linxin Hou, Qirui Wu, Zhihang Qin, Neil Banerjee, Yongxin Guo, Cecilia Laschi
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.02192
Pdf link: https://arxiv.org/pdf/2511.02192
Abstract This paper presents a quantitative comparison between centralised and distributed multi-agent reinforcement learning (MARL) architectures for controlling a soft robotic arm modelled as a Cosserat rod in simulation. Using PyElastica and the OpenAI Gym interface, we train both a global Proximal Policy Optimisation (PPO) controller and a Multi-Agent PPO (MAPPO) under identical budgets. Both approaches are based on the arm having $n$ number of controlled sections. The study systematically varies $n$ and evaluates the performance of the arm to reach a fixed target in three scenarios: default baseline condition, recovery from external disturbance, and adaptation to actuator failure. Quantitative metrics used for the evaluation are mean action magnitude, mean final distance, mean episode length, and success rate. The results show that there are no significant benefits of the distributed policy when the number of controlled sections $n\le4$. In very simple systems, when $n\le2$, the centralised policy outperforms the distributed one. When $n$ increases to $4< n\le 12$, the distributed policy shows a high sample efficiency. In these systems, distributed policy promotes a stronger success rate, resilience, and robustness under local observability and yields faster convergence given the same sample size. However, centralised policies achieve much higher time efficiency during training as it takes much less time to train the same size of samples. These findings highlight the trade-offs between centralised and distributed policy in reinforcement learning-based control for soft robotic systems and provide actionable design guidance for future sim-to-real transfer in soft rod-like manipulators.
中文摘要 本文对集中式和分布式多智能体强化学习（MARL）架构进行了定量比较，用于在仿真中控制建模为Cosserat杆的软机械臂。使用 PyElastica 和 OpenAI Gym 接口，我们在相同的预算下训练全局近端策略优化（PPO）控制器和多代理 PPO （MAPPO）。这两种方法都基于具有 $n$ 个受控部分的手臂。该研究系统地改变了 $n$，并评估了手臂在三种情况下达到固定目标的性能：默认基线条件、从外部干扰中恢复以及适应执行器故障。用于评估的定量指标是平均行动幅度、平均最终距离、平均发作长度和成功率。结果表明，当受控部分数量$n\le4$时，分布式策略没有显著的收益。在非常简单的系统中，当$n\le2$时，集中式策略的性能优于分布式策略。当$n$增加到$4<n\le 12$时，分布式策略显示出较高的样本效率。在这些系统中，分布式策略在局部可观测性下促进了更强的成功率、弹性和鲁棒性，并在相同的样本量下产生更快的收敛。然而，集中式策略在训练过程中实现了更高的时间效率，因为训练相同大小的样本所需的时间要少得多。这些发现强调了软机器人系统基于强化学习的控制中集中式和分布式策略之间的权衡，并为软棒状机械手的未来模拟到真实转移提供了可作的设计指导。

Training Proactive and Personalized LLM Agents

培训主动和个性化的 LLM 代理

Authors: Weiwei Sun, Xuhui Zhou, Weihua Du, Xingyao Wang, Sean Welleck, Graham Neubig, Maarten Sap, Yiming Yang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.02208
Pdf link: https://arxiv.org/pdf/2511.02208
Abstract While existing work focuses primarily on task success, we argue that effective real-world agents require optimizing three dimensions: productivity (task completion), proactivity (asking essential questions), and personalization (adapting to diverse user preferences). We introduce UserVille, an interactive environment with LLM-based user simulators enabling diverse, configurable user preferences. Leveraging UserVille, we introduce PPP, a multi-objective reinforcement learning approach that jointly optimizes all three dimensions: Productivity, Proactivity, and Personalization. Experiments on software engineering and deep research tasks show that agents trained with PPP achieve substantial improvements over strong baselines such as GPT-5 (+21.6 on average), demonstrating the ability to ask strategic clarifying questions, adapt to unseen user preferences, and improve task success through better interaction. This work demonstrates that explicitly optimizing for user-centered interaction is critical for building practical and effective AI agents.
中文摘要 虽然现有工作主要关注任务成功，但我们认为有效的现实世界代理需要优化三个维度：生产力（任务完成）、主动性（提出基本问题）和个性化（适应不同的用户偏好）。我们介绍了 UserVille，这是一个交互式环境，具有基于 LLM 的用户模拟器，可实现多样化、可配置的用户偏好。利用 UserVille，我们引入了 PPP，这是一种多目标强化学习方法，可共同优化所有三个维度：生产力、主动性和个性化。软件工程和深度研究任务的实验表明，接受 PPP 训练的智能体比 GPT-5 等强基线（平均 +21.6）取得了实质性改进，展示了提出战略澄清问题的能力，适应看不见的用户偏好，并通过更好的交互提高任务成功率。这项工作表明，明确优化以用户为中心的交互对于构建实用且有效的人工智能代理至关重要。

Adaptive Cooperative Transmission Design for Ultra-Reliable Low-Latency Communications via Deep Reinforcement Learning

基于深度强化学习的超可靠低时延通信的自适应协同传输设计

Authors: Hyemin Yu, Hong-Chuan Yang
Subjects: Subjects: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.02216
Pdf link: https://arxiv.org/pdf/2511.02216
Abstract Next-generation wireless communication systems must support ultra-reliable low-latency communication (URLLC) service for mission-critical applications. Meeting stringent URLLC requirements is challenging, especially for two-hop cooperative communication. In this paper, we develop an adaptive transmission design for a two-hop relaying communication system. Each hop transmission adaptively configures its transmission parameters separately, including numerology, mini-slot size, and modulation and coding scheme, for reliable packet transmission within a strict latency constraint. We formulate the hop-specific transceiver configuration as a Markov decision process (MDP) and propose a dual-agent reinforcement learning-based cooperative latency-aware transmission (DRL-CoLA) algorithm to learn latency-aware transmission policies in a distributed manner. Simulation results verify that the proposed algorithm achieves the near-optimal reliability while satisfying strict latency requirements.
中文摘要 下一代无线通信系统必须支持用于关键任务应用的超可靠低延迟通信（URLLC）服务。满足严格的 URLLC 要求具有挑战性，尤其是对于两跳协作通信。本文开发了一种两跳中继通信系统的自适应传输设计。每个跳传输都自适应地单独配置其传输参数，包括命理学、微型时隙大小以及调制和编码方案，以便在严格的延迟限制内实现可靠的数据包传输。我们将跳特定收发器配置表述为马尔可夫决策过程（MDP），并提出一种基于双智能体强化学习的协同延迟感知传输（DRL-CoLA）算法，以分布式方式学习延迟感知传输策略。仿真结果验证了所提算法在满足严格的时延要求的同时实现了近乎最优的可靠性。

Optimizing Multi-Lane Intersection Performance in Mixed Autonomy Environments

在混合自动驾驶环境中优化多车道交叉路口性能

Authors: Manonmani Sekar, Nasim Nezamoddini
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.02217
Pdf link: https://arxiv.org/pdf/2511.02217
Abstract One of the main challenges in managing traffic at multilane intersections is ensuring smooth coordination between human-driven vehicles (HDVs) and connected autonomous vehicles (CAVs). This paper presents a novel traffic signal control framework that combines Graph Attention Networks (GAT) with Soft Actor-Critic (SAC) reinforcement learning to address this challenge. GATs are used to model the dynamic graph- structured nature of traffic flow to capture spatial and temporal dependencies between lanes and signal phases. The proposed SAC is a robust off-policy reinforcement learning algorithm that enables adaptive signal control through entropy-optimized decision making. This design allows the system to coordinate the signal timing and vehicle movement simultaneously with objectives focused on minimizing travel time, enhancing performance, ensuring safety, and improving fairness between HDVs and CAVs. The model is evaluated using a SUMO-based simulation of a four-way intersection and incorporating different traffic densities and CAV penetration rates. The experimental results demonstrate the effectiveness of the GAT-SAC approach by achieving a 24.1% reduction in average delay and up to 29.2% fewer traffic violations compared to traditional methods. Additionally, the fairness ratio between HDVs and CAVs improved to 1.59, indicating more equitable treatment across vehicle types. These findings suggest that the GAT-SAC framework holds significant promise for real-world deployment in mixed-autonomy traffic systems.
中文摘要 管理多车道交叉通的主要挑战之一是确保人驾驶车辆（HDV）和联网自动驾驶汽车（CAV）之间的顺畅协调。本文提出了一种新颖的交通信号控制框架，该框架将图注意力网络（GAT）与软行为者-批评者（SAC）强化学习相结合，以应对这一挑战。GAT用于对交通流的动态图结构性质进行建模，以捕获车道和信号相位之间的空间和时间依赖性。所提出的SAC是一种鲁棒的非策略强化学习算法，通过熵优化决策实现自适应信号控制。这种设计使系统能够同时协调信号定时和车辆运动，其目标侧重于最大限度地减少行驶时间、提高性能、确保安全性以及提高 HDV 和 CAV 之间的公平性。该模型使用基于 SUMO 的四向交叉口模拟进行评估，并结合了不同的交通密度和 CAV 渗透率。实验结果证明了 GAT-SAC 方法的有效性，与传统方法相比，平均延迟减少了 24.1%，交通违规行为减少了 29.2%。此外，HDV 和 CAV 之间的公平性比率提高到 1.59，表明不同车型的待遇更加公平。这些发现表明，GAT-SAC 框架在混合自动驾驶交通系统中的实际部署具有巨大的前景。

Structural Plasticity as Active Inference: A Biologically-Inspired Architecture for Homeostatic Control

结构可塑性作为主动推理：一种受生物启发的稳态控制架构

Authors: Brennen A. Hill
Subjects: Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
Arxiv link: https://arxiv.org/abs/2511.02241
Pdf link: https://arxiv.org/pdf/2511.02241
Abstract Traditional neural networks, while powerful, rely on biologically implausible learning mechanisms such as global backpropagation. This paper introduces the Structurally Adaptive Predictive Inference Network (SAPIN), a novel computational model inspired by the principles of active inference and the morphological plasticity observed in biological neural cultures. SAPIN operates on a 2D grid where processing units, or cells, learn by minimizing local prediction errors. The model features two primary, concurrent learning mechanisms: a local, Hebbian-like synaptic plasticity rule based on the temporal difference between a cell's actual activation and its learned expectation, and a structural plasticity mechanism where cells physically migrate across the grid to optimize their information-receptive fields. This dual approach allows the network to learn both how to process information (synaptic weights) and also where to position its computational resources (network topology). We validated the SAPIN model on the classic Cart Pole reinforcement learning benchmark. Our results demonstrate that the architecture can successfully solve the CartPole task, achieving robust performance. The network's intrinsic drive to minimize prediction error and maintain homeostasis was sufficient to discover a stable balancing policy. We also found that while continual learning led to instability, locking the network's parameters after achieving success resulted in a stable policy. When evaluated for 100 episodes post-locking (repeated over 100 successful agents), the locked networks maintained an average 82% success rate.
中文摘要 传统的神经网络虽然功能强大，但依赖于生物学上难以置信的学习机制，例如全局反向传播。本文介绍了结构自适应预测推理网络（SAPIN），这是一种受主动推理原理和生物神经培养物中观察到的形态可塑性的启发的新型计算模型。SAPIN 在 2D 网格上运行，其中处理单元或单元通过最大限度地减少局部预测误差来学习。该模型具有两种主要的并发学习机制：基于细胞实际激活与其学习期望之间的时间差异的局部类赫布突触可塑性规则，以及细胞在网格上物理迁移以优化其信息感受野的结构可塑性机制。这种双重方法允许网络学习如何处理信息（突触权重）以及将其计算资源放置在何处（网络拓扑）。我们在经典的 Cart Pole 强化学习基准上验证了 SAPIN 模型。我们的结果表明，该架构可以成功解决CartPole任务，实现鲁棒的性能。该网络最小化预测误差和维持体内平衡的内在驱动力足以发现稳定的平衡策略。我们还发现，虽然持续学习会导致不稳定，但在取得成功后锁定网络参数会产生稳定的策略。当对锁定后的 100 个事件（重复 100 多个成功的代理）进行评估时，锁定的网络平均保持了 82% 的成功率。

SAIL-RL: Guiding MLLMs in When and How to Think via Dual-Reward RL Tuning

SAIL-RL：通过双重奖励 RL 调整指导 MLLM 何时以及如何思考

Authors: Fangxun Shu, Yongjie Ye, Yue Liao, Zijian Kang, Weijie Yin, Jiacong Wang, Xiao Liang, Shuicheng Yan, Chao Feng
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.02280
Pdf link: https://arxiv.org/pdf/2511.02280
Abstract We introduce SAIL-RL, a reinforcement learning (RL) post-training framework that enhances the reasoning capabilities of multimodal large language models (MLLMs) by teaching them when and how to think. Existing approaches are limited by outcome-only supervision, which rewards correct answers without ensuring sound reasoning, and by uniform thinking strategies, which often lead to overthinking on simple tasks and underthinking on complex ones. SAIL-RL addresses these challenges with a dual reward system: the Thinking Reward, which evaluates reasoning quality through factual grounding, logical coherence, and answer consistency, and the Judging Reward, which adaptively determines whether deep reasoning or direct answering is appropriate. Experiments on the state-of-the-art SAIL-VL2 show that SAIL-RL improves reasoning and multimodal understanding benchmarks at both 4B and 8B scales, achieving competitive performance against commercial closed-source models such as GPT-4o, and substantially reduces hallucinations, establishing it as a principled framework for building more reliable and adaptive MLLMs. The code will be available at this https URL.
中文摘要 我们介绍了 SAIL-RL，这是一种强化学习（RL）后训练框架，它通过教多模态大语言模型（MLLM）何时以及如何思考来增强其推理能力。现有方法受到仅结果监督的限制，即奖励正确答案而不确保合理的推理，以及统一的思维策略，这往往导致对简单任务的过度思考而对复杂任务的思考不足。SAIL-RL 通过双重奖励系统来应对这些挑战：思维奖励，通过事实基础、逻辑连贯性和答案一致性来评估推理质量，以及判断奖励，自适应地确定深度推理或直接回答是否合适。对最先进的 SAIL-VL2 的实验表明，SAIL-RL 在 4B 和 8B 尺度上都改进了推理和多模态理解基准，实现了与 GPT-4o 等商业闭源模型相比的竞争性能，并大大减少了幻觉，使其成为构建更可靠和自适应的 MLLM 的原则框架。该代码将在此 https URL 上提供。

Reinforcement learning based data assimilation for unknown state model

基于强化学习的未知状态模型数据同化

Authors: Ziyi Wang, Lijian Jiang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.02286
Pdf link: https://arxiv.org/pdf/2511.02286
Abstract Data assimilation (DA) has increasingly emerged as a critical tool for state estimation across a wide range of applications. It is signiffcantly challenging when the governing equations of the underlying dynamics are unknown. To this end, various machine learning approaches have been employed to construct a surrogate state transition model in a supervised learning framework, which relies on pre-computed training datasets. However, it is often infeasible to obtain noise-free ground-truth state sequences in practice. To address this challenge, we propose a novel method that integrates reinforcement learning with ensemble-based Bayesian ffltering methods, enabling the learning of surrogate state transition model for unknown dynamics directly from noisy observations, without using true state trajectories. Speciffcally, we treat the process for computing maximum likelihood estimation of surrogate model parameters as a sequential decision-making problem, which can be formulated as a discretetime Markov decision process (MDP). Under this formulation, learning the surrogate transition model is equivalent to ffnding an optimal policy of the MDP, which can be effectively addressed using reinforcement learning techniques. Once the model is trained offfine, state estimation can be performed in the online stage using ffltering methods based on the learned dynamics. The proposed framework accommodates a wide range of observation scenarios, including nonlinear and partially observed measurement models. A few numerical examples demonstrate that the proposed method achieves superior accuracy and robustness in high-dimensional settings.
中文摘要 数据同化（DA）已日益成为各种应用中状态估计的关键工具。当基本动力学的控制方程未知时，这是非常具有挑战性的。为此，人们采用了各种机器学习方法在监督学习框架中构建代理状态转换模型，该框架依赖于预先计算的训练数据集。然而，在实践中获得无噪声的地面实况状态序列通常是不可行的。为了应对这一挑战，我们提出了一种新方法，将强化学习与基于集成的贝叶斯弗弗特方法相结合，能够直接从噪声观测中学习未知动力学的代理状态转换模型，而无需使用真实状态轨迹。具体来说，我们将计算代理模型参数的最大似然估计过程视为一个顺序决策问题，可以表述为离散时间马尔可夫决策过程（MDP）。在这种公式下，学习代理转换模型相当于制定了MDP的最优策略，可以使用强化学习技术来有效解决。一旦模型被训练得井井有条，就可以使用基于学习到的动力学的 ffltering 方法在在线阶段进行状态估计。所提出的框架适用于广泛的观测场景，包括非线性和部分观测测量模型。一些数值示例表明，所提出的方法在高维环境中实现了卓越的精度和鲁棒性。

Unlocking the Power of Multi-Agent LLM for Reasoning: From Lazy Agents to Deliberation

释放多智能体法学硕士推理的力量：从懒惰智能体到深思熟虑

Authors: Zhiwei Zhang, Xiaomin Li, Yudi Lin, Hui Liu, Ramraj Chandradevan, Linlin Wu, Minhua Lin, Fali Wang, Xianfeng Tang, Qi He, Suhang Wang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.02303
Pdf link: https://arxiv.org/pdf/2511.02303
Abstract Large Language Models (LLMs) trained with reinforcement learning and verifiable rewards have achieved strong results on complex reasoning tasks. Recent work extends this paradigm to a multi-agent setting, where a meta-thinking agent proposes plans and monitors progress while a reasoning agent executes subtasks through sequential conversational turns. Despite promising performance, we identify a critical limitation: lazy agent behavior, in which one agent dominates while the other contributes little, undermining collaboration and collapsing the setup to an ineffective single agent. In this paper, we first provide a theoretical analysis showing why lazy behavior naturally arises in multi-agent reasoning. We then introduce a stable and efficient method for measuring causal influence, helping mitigate this issue. Finally, as collaboration intensifies, the reasoning agent risks getting lost in multi-turn interactions and trapped by previous noisy responses. To counter this, we propose a verifiable reward mechanism that encourages deliberation by allowing the reasoning agent to discard noisy outputs, consolidate instructions, and restart its reasoning process when necessary. Extensive experiments demonstrate that our framework alleviates lazy agent behavior and unlocks the full potential of multi-agent framework for complex reasoning tasks.
中文摘要 通过强化学习和可验证奖励训练的大型语言模型（LLM）在复杂的推理任务上取得了强劲的成果。最近的工作将这种范式扩展到多智能体设置，其中元思维智能体提出计划并监控进度，而推理智能体通过连续的对话轮流执行子任务。尽管性能很有希望，但我们发现了一个关键的限制：懒惰代理行为，其中一个代理占主导地位，而另一个代理贡献不大，破坏了协作并将设置崩溃为无效的单个代理。在本文中，我们首先提供了一个理论分析，说明为什么懒惰行为在多智能体推理中自然产生。然后，我们引入了一种稳定有效的方法来测量因果影响，帮助缓解这个问题。最后，随着协作的加强，推理代理可能会迷失在多轮交互中，并被之前的嘈杂响应所困。为了解决这个问题，我们提出了一种可验证的奖励机制，通过允许推理代理丢弃嘈杂的输出、整合指令并在必要时重新启动其推理过程来鼓励审议。广泛的实验表明，我们的框架减轻了惰性代理行为，并释放了多智能体框架在复杂推理任务中的全部潜力。

Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning

自动机条件协同多智能体强化学习

Authors: Beyazit Yalcinkaya, Marcell Vazquez-Chanlatte, Ameesh Shah, Hanna Krasowski, Sanjit A. Seshia
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.02304
Pdf link: https://arxiv.org/pdf/2511.02304
Abstract We study the problem of learning multi-task, multi-agent policies for cooperative, temporal objectives, under centralized training, decentralized execution. In this setting, using automata to represent tasks enables the decomposition of complex tasks into simpler sub-tasks that can be assigned to agents. However, existing approaches remain sample-inefficient and are limited to the single-task case. In this work, we present Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning (ACC-MARL), a framework for learning task-conditioned, decentralized team policies. We identify the main challenges to ACC-MARL's feasibility in practice, propose solutions, and prove the correctness of our approach. We further show that the value functions of learned policies can be used to assign tasks optimally at test time. Experiments show emergent task-aware, multi-step coordination among agents, e.g., pressing a button to unlock a door, holding the door, and short-circuiting tasks.
中文摘要 我们研究了在集中训练、分散执行下学习合作、时间目标的多任务、多智能体策略的问题。在此设置中，使用自动机表示任务可以将复杂的任务分解为可以分配给代理的更简单的子任务。然而，现有方法仍然样本效率低下，并且仅限于单任务情况。在这项工作中，我们提出了自动机条件合作多智能体强化学习（ACC-MARL），这是一个用于学习任务条件、分散团队策略的框架。我们确定了ACC-MARL在实践中可行性面临的主要挑战，提出了解决方案，并证明了我们方法的正确性。我们进一步表明，学习策略的价值函数可用于在测试时以最佳方式分配任务。实验表明，智能体之间存在紧急任务感知、多步骤协调，例如，按下按钮解锁门、按住门和短路任务。

Large-scale automatic carbon ion treatment planning for head and neck cancers via parallel multi-agent reinforcement learning

通过并行多智能体强化学习大规模自动头颈癌碳离子治疗规划

Authors: Jueye Zhang, Chao Yang, Youfang Lai, Kai-Wen Li, Wenting Yan, Yunzhou Xia, Haimei Zhang, Jingjing Zhou, Gen Yang, Chen Lin, Tian Li, Yibao Zhang
Subjects: Subjects: Machine Learning (cs.LG); Medical Physics (physics.med-ph)
Arxiv link: https://arxiv.org/abs/2511.02314
Pdf link: https://arxiv.org/pdf/2511.02314
Abstract Head-and-neck cancer (HNC) planning is difficult because multiple critical organs-at-risk (OARs) are close to complex targets. Intensity-modulated carbon-ion therapy (IMCT) offers superior dose conformity and OAR sparing but remains slow due to relative biological effectiveness (RBE) modeling, leading to laborious, experience-based, and often suboptimal tuning of many treatment-planning parameters (TPPs). Recent deep learning (DL) methods are limited by data bias and plan feasibility, while reinforcement learning (RL) struggles to efficiently explore the exponentially large TPP search space. We propose a scalable multi-agent RL (MARL) framework for parallel tuning of 45 TPPs in IMCT. It uses a centralized-training decentralized-execution (CTDE) QMIX backbone with Double DQN, Dueling DQN, and recurrent encoding (DRQN) for stable learning in a high-dimensional, non-stationary environment. To enhance efficiency, we (1) use compact historical DVH vectors as state inputs, (2) apply a linear action-to-value transform mapping small discrete actions to uniform parameter adjustments, and (3) design an absolute, clinically informed piecewise reward aligned with plan scores. A synchronous multi-process worker system interfaces with the PHOENIX TPS for parallel optimization and accelerated data collection. On a head-and-neck dataset (10 training, 10 testing), the method tuned 45 parameters simultaneously and produced plans comparable to or better than expert manual ones (relative plan score: RL $85.93\pm7.85%$ vs Manual $85.02\pm6.92%$), with significant (p-value $<$ 0.05) improvements for five OARs. The framework efficiently explores high-dimensional TPP spaces and generates clinically competitive IMCT plans through direct TPS interaction, notably improving OAR sparing.
中文摘要 头颈癌（HNC）规划很困难，因为多个关键风险器官（OAR）接近复杂的靶点。调强碳离子疗法（IMCT）提供卓越的剂量一致性和 OAR 保留，但由于相对生物学有效性（RBE）建模而仍然缓慢，导致许多治疗计划参数（TPP）的费力、基于经验且通常次优的调整。最近的深度学习（DL）方法受到数据偏差和计划可行性的限制，而强化学习（RL）则难以有效地探索指数级的TPP搜索空间。我们提出了一个可扩展的多智能体RL（MARL）框架，用于在IMCT中并行调整45个TPP。它使用具有双 DQN、决斗 DQN 和递归编码（DRQN）的集中训练分散执行（CTDE） QMIX 主干网，以便在高维、非平稳环境中进行稳定学习。为了提高效率，我们（1）使用紧凑的历史 DVH 向量作为状态输入，（2）应用线性动作到值转换，将小的离散动作映射到统一的参数调整，以及（3）设计一个绝对的、临床知情的分段奖励与计划分数一致。同步多进程工作器系统与 PHOENIX TPS 接口，实现并行优化和加速数据收集。在头颈数据集（10 次训练，10 次测试）上，该方法同时调整了 45 个参数，并生成了与专家手动相当或更好的计划（相对计划得分：RL $85.93\pm7.85%$ vs 手动 $85.02\pm6.92%$），五个 OAR 有显着改进（p 值 $<$ 0.05）。该框架有效地探索了高维TPP空间，并通过直接的TPS相互作用生成了具有临床竞争力的IMCT计划，特别是改善了OAR的保留。

ChartM$^3$: A Multi-Stage Code-Driven Pipeline for Constructing Multi-Dimensional and Multi-Step Visual Reasoning Data in Chart Comprehension

ChartM$^3$：用于构建图表理解中多维多步骤视觉推理数据的多阶段代码驱动管道

Authors: Duo Xu, Hao Cheng, Xin Lin, Zhen Xie, Hao Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.02415
Pdf link: https://arxiv.org/pdf/2511.02415
Abstract Complex chart understanding tasks demand advanced visual recognition and reasoning capabilities from multimodal large language models (MLLMs). However, current research provides limited coverage of complex chart scenarios and computation-intensive reasoning tasks prevalent in real-world applications. This study proposes an automated multi-stage code-driven pipeline for systematically generating visual reasoning datasets to address these limitations. The pipeline integrates retrieval-augmented generation (RAG) to retrieve professional chart templates and employs chain-of-thought (CoT) strategies to generate reasoning codes that simulate real data distributions, thereby driving chart rendering and question-related statistical computations. Through model-based evaluation, the pipeline enhances chart diversity and data quality. Using this framework, we construct ChartM$^3$, a multi-dimensional and multi-step dataset containing 38K charts and 142K Q&A pairs for training, along with 2,871 high-quality evaluation samples for enabling practical performance assessment. Supervised fine-tuning (SFT) and reinforcement learning (RL) experiments demonstrate that our dataset significantly improves reasoning capabilities and cross-domain generalization performance, enabling smaller models to achieve performance comparable to larger-scale models in complex chart comprehension.
中文摘要 复杂的图表理解任务需要多模态大型语言模型（MLLM）的高级视觉识别和推理能力。然而，目前的研究对现实世界应用中普遍存在的复杂图表场景和计算密集型推理任务的覆盖范围有限。本研究提出了一种自动化的多阶段代码驱动管道，用于系统地生成视觉推理数据集，以解决这些限制。该管道集成了检索增强生成（RAG）来检索专业的图表模板，并采用思维链（CoT）策略来生成模拟真实数据分布的推理代码，从而推动图表渲染和与问题相关的统计计算。通过基于模型的评估，该管道增强了图表的多样性和数据质量。使用这个框架，我们构建了 ChartM$^3$，这是一个多维多步骤的数据集，包含 38K 张图表和 142K 个用于训练的问答对，以及 2,871 个用于实际绩效评估的高质量评估样本。监督微调（SFT）和强化学习（RL）实验表明，我们的数据集显著提高了推理能力和跨域泛化性能，使较小的模型能够在复杂图表理解中实现与大规模模型相当的性能。

Auditable-choice reframing unlocks RL-based verification for open-ended tasks

可审计选择重构为开放式任务解锁了基于 RL 的验证

Authors: Mengyu Zhang, Xubo Liu, Siyu Ding, Weichong Yin, Yu Sun, Hua Wu, Wenya Guo, Ying Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.02463
Pdf link: https://arxiv.org/pdf/2511.02463
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated great potential in enhancing the reasoning capabilities of large language models (LLMs), achieving remarkable progress in domains such as mathematics and programming where standard answers are available. However, for open-ended tasks lacking ground-truth solutions (e.g., creative writing and instruction following), existing studies typically regard them as non-reasoning scenarios, thereby overlooking the latent value of reasoning capabilities. This raises a key question: Can strengthening reasoning improve performance in open-ended tasks? To address this, we explore the transfer of the RLVR paradigm to the open domain. Yet, since RLVR fundamentally relies on verifiers that presuppose the existence of standard answers, it cannot be directly applied to open-ended tasks. To overcome this challenge, we introduce Verifiable Multiple-Choice Reformulation (VMR), a novel training strategy that restructures open-ended data into verifiable multiple-choice formats, enabling effective training even in the absence of explicit ground truth. Experimental results on multiple benchmarks validate the effectiveness of our method in improving LLM performance on open-ended tasks. Notably, across eight open-ended benchmarks, our VMR-based training delivers an average gain of 5.99 points over the baseline. Code will be released upon acceptance to facilitate reproducibility.
中文摘要 具有可验证奖励的强化学习（RLVR）在增强大型语言模型（LLM）的推理能力方面显示出巨大的潜力，在数学和编程等有标准答案的领域取得了显着进展。然而，对于缺乏基本事实解决方案的开放式任务（例如，创意写作和指令遵循），现有研究通常将其视为非推理场景，从而忽视了推理能力的潜在价值。这就提出了一个关键问题：加强推理能否提高开放式任务的表现？为了解决这个问题，我们探索了将 RLVR 范式转移到开放领域。然而，由于 RLVR 从根本上依赖于以标准答案存在为前提的验证者，因此它不能直接应用于开放式任务。为了克服这一挑战，我们引入了可验证的多项选择重新表述（VMR），这是一种新颖的训练策略，可将开放式数据重组为可验证的多项选择格式，即使在没有明确的基本事实的情况下也能实现有效的训练。在多个基准测试上的实验结果验证了我们的方法在提高开放式任务的 LLM 性能方面的有效性。值得注意的是，在八个开放式基准测试中，我们基于 VMR 的培训比基线平均提高了 5.99 分。代码将在接受后发布，以促进可重复性。

Dexterous Robotic Piano Playing at Scale

灵巧的机器人钢琴大规模演奏

Authors: Le Chen, Yi Zhao, Jan Schneider, Quankai Gao, Simon Guist, Cheng Qian, Juho Kannala, Bernhard Schölkopf, Joni Pajarinen, Dieter Büchler
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.02504
Pdf link: https://arxiv.org/pdf/2511.02504
Abstract Endowing robot hands with human-level dexterity has been a long-standing goal in robotics. Bimanual robotic piano playing represents a particularly challenging task: it is high-dimensional, contact-rich, and requires fast, precise control. We present OmniPianist, the first agent capable of performing nearly one thousand music pieces via scalable, human-demonstration-free learning. Our approach is built on three core components. First, we introduce an automatic fingering strategy based on Optimal Transport (OT), allowing the agent to autonomously discover efficient piano-playing strategies from scratch without demonstrations. Second, we conduct large-scale Reinforcement Learning (RL) by training more than 2,000 agents, each specialized in distinct music pieces, and aggregate their experience into a dataset named RP1M++, consisting of over one million trajectories for robotic piano playing. Finally, we employ a Flow Matching Transformer to leverage RP1M++ through large-scale imitation learning, resulting in the OmniPianist agent capable of performing a wide range of musical pieces. Extensive experiments and ablation studies highlight the effectiveness and scalability of our approach, advancing dexterous robotic piano playing at scale.
中文摘要 赋予机器人手人类水平的灵活性一直是机器人技术的长期目标。双手机器人钢琴演奏是一项特别具有挑战性的任务：它是高维的、接触丰富的，并且需要快速、精确的控制。我们展示了 OmniPianist，这是第一个能够通过可扩展、无需人工演示的学习来演奏近 1000 首音乐作品的代理。我们的方法建立在三个核心组件之上。首先，我们引入了一种基于最优传输（OT）的自动指法策略，使智能体能够从头开始自主发现高效的钢琴演奏策略，而无需演示。其次，我们通过训练 2,000 多个智能体（每个智能体专门研究不同的音乐作品）来进行大规模强化学习（RL），并将他们的经验聚合到一个名为 RP1M++ 的数据集中，该数据集由超过 100 万条机器人钢琴演奏轨迹组成。最后，我们采用 Flow Matching Transformer 通过大规模模仿学习利用 RP1M++，使 OmniPianist 代理能够演奏各种音乐作品。广泛的实验和消融研究凸显了我们方法的有效性和可扩展性，推动了大规模灵巧的机器人钢琴演奏。

An End-to-End Learning Approach for Solving Capacitated Location-Routing Problems

一种用于解决容量定位路由问题的端到端学习方法

Authors: Changhao Miao, Yuntian Zhang, Tongyu Wu, Fang Deng, Chen Chen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.02525
Pdf link: https://arxiv.org/pdf/2511.02525
Abstract The capacitated location-routing problems (CLRPs) are classical problems in combinatorial optimization, which require simultaneously making location and routing decisions. In CLRPs, the complex constraints and the intricate relationships between various decisions make the problem challenging to solve. With the emergence of deep reinforcement learning (DRL), it has been extensively applied to address the vehicle routing problem and its variants, while the research related to CLRPs still needs to be explored. In this paper, we propose the DRL with heterogeneous query (DRLHQ) to solve CLRP and open CLRP (OCLRP), respectively. We are the first to propose an end-to-end learning approach for CLRPs, following the encoder-decoder structure. In particular, we reformulate the CLRPs as a markov decision process tailored to various decisions, a general modeling framework that can be adapted to other DRL-based methods. To better handle the interdependency across location and routing decisions, we also introduce a novel heterogeneous querying attention mechanism designed to adapt dynamically to various decision-making stages. Experimental results on both synthetic and benchmark datasets demonstrate superior solution quality and better generalization performance of our proposed approach over representative traditional and DRL-based baselines in solving both CLRP and OCLRP.
中文摘要 容量定位路由问题（CLRP）是组合优化中的经典问题，需要同时做出定位和路由决策。在 CLRP 中，复杂的约束和各种决策之间的复杂关系使问题难以解决。随着深度强化学习（DRL）的出现，深度强化学习（DRL）已被广泛应用于解决车辆路线问题及其变体，而CLRPs的相关研究仍有待探索。本文提出了异构查询DRL（DRLHQ）分别求解CLRP和开放CLRP（OCLRP）。我们是第一个提出 CLRP 端到端学习方法的人，遵循编码器-解码器结构。特别是，我们将 CLRP 重新表述为针对各种决策量身定制的马尔可夫决策过程，这是一个可以适应其他基于 DRL 的方法的通用建模框架。为了更好地处理位置和路由决策之间的相互依赖性，我们还引入了一种新的异构查询注意力机制，旨在动态适应各种决策阶段。在合成数据集和基准数据集上的实验结果表明，在求解 CLRP 和 OCLRP 方面，我们提出的方法比具有代表性的传统和基于 DRL 的基线具有更优越的解决方案质量和更好的泛化性能。

Adaptive Neighborhood-Constrained Q Learning for Offline Reinforcement Learning

离线强化学习的自适应邻域约束Q学习

Authors: Yixiu Mao, Yun Qu, Qi Wang, Xiangyang Ji
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.02567
Pdf link: https://arxiv.org/pdf/2511.02567
Abstract Offline reinforcement learning (RL) suffers from extrapolation errors induced by out-of-distribution (OOD) actions. To address this, offline RL algorithms typically impose constraints on action selection, which can be systematically categorized into density, support, and sample constraints. However, we show that each category has inherent limitations: density and sample constraints tend to be overly conservative in many scenarios, while the support constraint, though least restrictive, faces challenges in accurately modeling the behavior policy. To overcome these limitations, we propose a new neighborhood constraint that restricts action selection in the Bellman target to the union of neighborhoods of dataset actions. Theoretically, the constraint not only bounds extrapolation errors and distribution shift under certain conditions, but also approximates the support constraint without requiring behavior policy modeling. Moreover, it retains substantial flexibility and enables pointwise conservatism by adapting the neighborhood radius for each data point. In practice, we employ data quality as the adaptation criterion and design an adaptive neighborhood constraint. Building on an efficient bilevel optimization framework, we develop a simple yet effective algorithm, Adaptive Neighborhood-constrained Q learning (ANQ), to perform Q learning with target actions satisfying this constraint. Empirically, ANQ achieves state-of-the-art performance on standard offline RL benchmarks and exhibits strong robustness in scenarios with noisy or limited data.
中文摘要 离线强化学习（RL）存在由分布外（OOD）动作引起的外推误差。为了解决这个问题，离线 RL 算法通常对动作选择施加约束，可以系统地将其分为密度、支撑和样本约束。然而，我们表明每个类别都有固有的局限性：在许多情况下，密度和样本约束往往过于保守，而支持约束虽然限制最少，但在准确建模行为策略方面面临挑战。为了克服这些限制，我们提出了一种新的邻域约束，将贝尔曼目标中的动作选择限制为数据集动作邻域的并集。从理论上讲，该约束不仅限制了特定条件下的外推误差和分布偏移，而且无需行为策略建模即可近似支持约束。此外，它保留了相当大的灵活性，并通过调整每个数据点的邻域半径来实现逐点保守性。在实践中，我们以数据质量为适应准则，设计自适应邻域约束。在高效的双层优化框架的基础上，我们开发了一种简单而有效的算法，即自适应邻域约束 Q 学习（ANQ），以满足该约束的目标动作执行 Q 学习。根据经验，ANQ 在标准离线 RL 基准测试上实现了最先进的性能，并在数据嘈杂或有限的场景中表现出很强的鲁棒性。

Directional-Clamp PPO

定向钳式 PPO

Authors: Gilad Karpel, Ruida Zhou, Shoham Sabach, Mohammad Ghavamzadeh
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.02577
Pdf link: https://arxiv.org/pdf/2511.02577
Abstract Proximal Policy Optimization (PPO) is widely regarded as one of the most successful deep reinforcement learning algorithms, known for its robustness and effectiveness across a range of problems. The PPO objective encourages the importance ratio between the current and behavior policies to move to the "right" direction -- starting from importance sampling ratios equal to 1, increasing the ratios for actions with positive advantages and decreasing those with negative advantages. A clipping function is introduced to prevent over-optimization when updating the importance ratio in these "right" direction regions. Many PPO variants have been proposed to extend its success, most of which modify the objective's behavior by altering the clipping in the "right" direction regions. However, due to randomness in the rollouts and stochasticity of the policy optimization, we observe that the ratios frequently move to the "wrong" direction during the PPO optimization. This is a key factor hindering the improvement of PPO, but it has been largely overlooked. To address this, we propose the Directional-Clamp PPO algorithm (DClamp-PPO), which further penalizes the actions going to the strict "wrong" direction regions, where the advantage is positive (negative) and importance ratio falls below (above) $1 - \beta$ ($1+\beta$), for a tunable parameter $\beta \in (0, 1)$. The penalty is by enforcing a steeper loss slope, i.e., a clamp, in those regions. We demonstrate that DClamp-PPO consistently outperforms PPO, as well as its variants, by focusing on modifying the objective's behavior in the "right" direction, across various MuJoCo environments, using different random seeds. The proposed method is shown, both theoretically and empirically, to better avoid "wrong" direction updates while keeping the importance ratio closer to 1.
中文摘要 近端策略优化（PPO）被广泛认为是最成功的深度强化学习算法之一，以其在一系列问题上的鲁棒性和有效性而闻名。PPO 目标鼓励当前政策和行为政策之间的重要性比率朝着“正确”的方向发展——从等于 1 的重要性抽样比率开始，增加具有积极优势的行动的比率，减少具有消极优势的行动的比率。引入了裁剪功能，以防止在更新这些“右”方向区域的重要性比时过度优化。已经提出了许多 PPO 变体来延长其成功，其中大多数通过改变“右”方向区域的剪裁来改变目标的行为。然而，由于政策优化的随机性和随机性，我们观察到在 PPO 优化期间，比率经常向“错误”方向移动。这是阻碍PPO改善的关键因素，但在很大程度上被忽视了。为了解决这个问题，我们提出了方向钳位 PPO 算法（DClamp-PPO），该算法进一步惩罚了进入严格“错误”方向区域的动作，其中优势为正（负），重要性比低于（高于）$1 - \beta$ （$1+\beta$），对于可调参数 $\beta \in （0， 1）$。惩罚是在这些区域强制执行更陡峭的损失斜率，即钳制。我们证明，DClamp-PPO 通过专注于使用不同的随机种子在各种 MuJoCo 环境中在“正确”方向上修改物镜的行为，始终优于 PPO 及其变体。从理论和经验上都表明，所提出的方法可以更好地避免“错误”方向更新，同时保持重要性比更接近 1。

Adaptive GR(1) Specification Repair for Liveness-Preserving Shielding in Reinforcement Learning

强化学习中活体保鲜屏蔽的自适应GR（1）规范修复

Authors: Tiberiu-Andrei Georgescu, Alexander W. Goodall, Dalal Alrajeh, Francesco Belardinelli, Sebastian Uchitel
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.02605
Pdf link: https://arxiv.org/pdf/2511.02605
Abstract Shielding is widely used to enforce safety in reinforcement learning (RL), ensuring that an agent's actions remain compliant with formal specifications. Classical shielding approaches, however, are often static, in the sense that they assume fixed logical specifications and hand-crafted abstractions. While these static shields provide safety under nominal assumptions, they fail to adapt when environment assumptions are violated. In this paper, we develop the first adaptive shielding framework - to the best of our knowledge - based on Generalized Reactivity of rank 1 (GR(1)) specifications, a tractable and expressive fragment of Linear Temporal Logic (LTL) that captures both safety and liveness properties. Our method detects environment assumption violations at runtime and employs Inductive Logic Programming (ILP) to automatically repair GR(1) specifications online, in a systematic and interpretable way. This ensures that the shield evolves gracefully, ensuring liveness is achievable and weakening goals only when necessary. We consider two case studies: Minepump and Atari Seaquest; showing that (i) static symbolic controllers are often severely suboptimal when optimizing for auxiliary rewards, and (ii) RL agents equipped with our adaptive shield maintain near-optimal reward and perfect logical compliance compared with static shields.
中文摘要 屏蔽广泛用于加强强化学习（RL）中的安全性，确保代理的作始终符合正式规范。然而，经典的屏蔽方法通常是静态的，因为它们假设固定的逻辑规范和手工制作的抽象。虽然这些静态屏蔽在标称假设下提供安全性，但当违反环境假设时，它们就无法适应。在本文中，我们开发了第一个自适应屏蔽框架——据我们所知——基于等级 1 的广义反应性（GR（1））规范，这是线性时间逻辑（LTL）的一个易于处理且富有表现力的片段，可捕获安全性和活跃性属性。我们的方法在运行时检测环境假设违规，并采用归纳逻辑编程（ILP）以系统且可解释的方式自动在线修复GR（1）规范。这确保了盾牌优雅地演化，确保活跃度是可以实现的，并且仅在必要时才削弱目标。我们考虑了两个案例研究：Minepump 和 Atari Seaquest;表明（i）静态符号控制器在优化辅助奖励时通常严重次优，以及（ii）与静态盾相比，配备我们的自适应盾牌的 RL 代理保持了近乎最优的奖励和完美的逻辑合规性。

Natural-gas storage modelling by deep reinforcement learning

深度强化学习的天然气储存建模

Authors: Tiziano Balaconi, Aldo Glielmo, Marco Taboga
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); General Economics (econ.GN); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2511.02646
Pdf link: https://arxiv.org/pdf/2511.02646
Abstract We introduce GasRL, a simulator that couples a calibrated representation of the natural gas market with a model of storage-operator policies trained with deep reinforcement learning (RL). We use it to analyse how optimal stockpile management affects equilibrium prices and the dynamics of demand and supply. We test various RL algorithms and find that Soft Actor Critic (SAC) exhibits superior performance in the GasRL environment: multiple objectives of storage operators - including profitability, robust market clearing and price stabilisation - are successfully achieved. Moreover, the equilibrium price dynamics induced by SAC-derived optimal policies have characteristics, such as volatility and seasonality, that closely match those of real-world prices. Remarkably, this adherence to the historical distribution of prices is obtained without explicitly calibrating the model to price data. We show how the simulator can be used to assess the effects of EU-mandated minimum storage thresholds. We find that such thresholds have a positive effect on market resilience against unanticipated shifts in the distribution of supply shocks. For example, with unusually large shocks, market disruptions are averted more often if a threshold is in place.
中文摘要 我们介绍了 GasRL，这是一种模拟器，它将天然气市场的校准表示与通过深度强化学习（RL）训练的存储运营商策略模型相结合。我们用它来分析最佳库存管理如何影响均衡价格以及供需动态。我们测试了各种 RL 算法，发现 Soft Actor Critic （SAC）在 GasRL 环境中表现出卓越的性能：成功实现了存储运营商的多个目标——包括盈利能力、稳健的市场清算和价格稳定。此外，SAC 衍生的最优政策引发的均衡价格动态具有波动性和季节性等特征，与现实世界的价格非常匹配。值得注意的是，这种对价格历史分布的遵守是在没有明确校准模型以价格数据的情况下获得的。我们展示了如何使用模拟器来评估欧盟规定的最低存储阈值的影响。我们发现，这些阈值对市场抵御供应冲击分布意外变化的弹性有积极影响。例如，对于异常大的冲击，如果设定阈值，市场混乱就会更频繁地避免。

Curriculum Design for Trajectory-Constrained Agent: Compressing Chain-of-Thought Tokens in LLMs

轨迹约束智能体的课程设计：压缩法学硕士中的思维链标记

Authors: Georgios Tzannetos, Parameswaran Kamalaruban, Adish Singla
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.02690
Pdf link: https://arxiv.org/pdf/2511.02690
Abstract Training agents to operate under strict constraints during deployment, such as limited resource budgets or stringent safety requirements, presents significant challenges, especially when these constraints render the task complex. In this work, we propose a curriculum learning strategy that gradually tightens constraints during training, enabling the agent to incrementally master the deployment requirements. Inspired by self-paced learning techniques in unconstrained reinforcement learning (RL), our approach facilitates a smoother transition to challenging environments by initially training on simplified versions of the constraints and progressively introducing the full deployment conditions. We provide a theoretical analysis using an RL agent in a binary-tree Markov Decision Process (MDP) to demonstrate that our curriculum strategy can accelerate training relative to a baseline approach that imposes the trajectory constraints from the outset. Moreover, we empirically validate the effectiveness and generality of our method across both RL and large language model (LLM) agents in diverse settings, including a binary-tree MDP, a multi-task navigation domain, and a math reasoning task with two benchmarks. These results highlight the potential of curriculum design in enhancing the efficiency and performance of agents operating under complex trajectory constraints during deployment. Moreover, when applied to LLMs, our strategy enables compression of output chain-of-thought tokens, achieving a substantial inference speedup on consumer hardware, demonstrating its effectiveness for resource-constrained deployment.
中文摘要 培训代理在部署期间在严格的限制下作，例如有限的资源预算或严格的安全要求，带来了重大挑战，尤其是当这些限制使任务变得复杂时。在这项工作中，我们提出了一种课程学习策略，在训练过程中逐渐收紧约束，使智能体能够逐步掌握部署需求。受无约束强化学习（RL）中自定进度学习技术的启发，我们的方法通过最初对约束的简化版本进行训练并逐步引入完整的部署条件，促进更平稳地过渡到具有挑战性的环境。我们在二叉树马尔可夫决策过程（MDP）中使用 RL 代理进行了理论分析，以证明我们的课程策略可以相对于从一开始就施加轨迹约束的基线方法加速训练。此外，我们通过经验验证了我们的方法在不同环境中的 RL 和大型语言模型（LLM）代理的有效性和通用性，包括二叉树 MDP、多任务导航域和具有两个基准的数学推理任务。这些结果凸显了课程设计在提高部署过程中在复杂轨迹约束下运行的智能体的效率和性能方面的潜力。此外，当应用于 LLM 时，我们的策略可以压缩输出思维链令牌，在消费者硬件上实现大幅推理加速，证明其在资源受限部署中的有效性。

VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation Models

VidEmo：以情感为中心的视频基础模型的情感树推理

Authors: Zhicheng Zhang, Weicheng Wang, Yongjie Zhu, Wenyu Qin, Pengfei Wan, Di Zhang, Jufeng Yang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.02712
Pdf link: https://arxiv.org/pdf/2511.02712
Abstract Understanding and predicting emotion from videos has gathered significant attention in recent studies, driven by advancements in video large language models (VideoLLMs). While advanced methods have made progress in video emotion analysis, the intrinsic nature of emotions poses significant challenges. Emotions are characterized by dynamic and cues-dependent properties, making it difficult to understand complex and evolving emotional states with reasonable rationale. To tackle these challenges, we propose a novel affective cues-guided reasoning framework that unifies fundamental attribute perception, expression analysis, and high-level emotional understanding in a stage-wise manner. At the core of our approach is a family of video emotion foundation models (VidEmo), specifically designed for emotion reasoning and instruction-following. These models undergo a two-stage tuning process: first, curriculum emotion learning for injecting emotion knowledge, followed by affective-tree reinforcement learning for emotion reasoning. Moreover, we establish a foundational data infrastructure and introduce a emotion-centric fine-grained dataset (Emo-CFG) consisting of 2.1M diverse instruction-based samples. Emo-CFG includes explainable emotional question-answering, fine-grained captions, and associated rationales, providing essential resources for advancing emotion understanding tasks. Experimental results demonstrate that our approach achieves competitive performance, setting a new milestone across 15 face perception tasks.
中文摘要 在视频大型语言模型（VideoLLM）的进步推动下，从视频中理解和预测情感在最近的研究中引起了广泛关注。虽然先进的方法在视频情绪分析方面取得了进展，但情绪的内在性质带来了重大挑战。情绪的特点是动态和依赖线索的特性，因此很难用合理的理由来理解复杂和不断变化的情绪状态。为了应对这些挑战，我们提出了一种新颖的情感线索引导推理框架，该框架以阶段性的方式统一了基本属性感知、表达分析和高级情感理解。我们方法的核心是一系列视频情感基础模型（VidEmo），专为情感推理和指令遵循而设计。这些模型经历了两个阶段的调优过程：首先是用于注入情感知识的课程情感学习，然后是用于情感推理的情感树强化学习。此外，我们还建立了基础数据基础设施，并引入了一个以情感为中心的细粒度数据集（Emo-CFG），该数据集由2.1M个不同的基于指令的样本组成。Emo-CFG 包括可解释的情感问答、细粒度字幕和相关基本原理，为推进情感理解任务提供了必要的资源。实验结果表明，我们的方法实现了具有竞争力的性能，在 15 项人脸感知任务中树立了新的里程碑。

Controlling Performance and Budget of a Centralized Multi-agent LLM System with Reinforcement Learning

基于强化学习的集中式多智能体LLM系统的性能和预算控制

Authors: Bowen Jin, TJ Collins, Donghan Yu, Mert Cemri, Shenao Zhang, Mengyu Li, Jay Tang, Tian Qin, Zhiyang Xu, Jiarui Lu, Guoli Yin, Jiawei Han, Zirui Wang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.02755
Pdf link: https://arxiv.org/pdf/2511.02755
Abstract Large language models (LLMs) exhibit complementary strengths across domains and come with varying inference costs, motivating the design of multi-agent LLM systems where specialized models collaborate efficiently. Existing approaches predominantly rely on decentralized frameworks, which invoke multiple LLMs for every input and thus lead to substantial and uncontrolled inference costs. In this work, we introduce a centralized multi-LLM framework, where a controller LLM selectively coordinates a pool of expert models in a cost-efficient and cost-controllable manner. We formulate this coordination problem as reinforcement learning with dual objectives: maximizing task performance while minimizing the overall inference cost. In addition, we expect the multi-agent system to have adapted behavior with different budget conditions during inference. To this end, we propose CoRL, a reinforcement learning framework that optimizes the performance cost trade-off in a controllable multi-budget setting. Experiments on four diverse benchmarks demonstrate that CoRL enables a single system to surpass the best expert LLM under high-budget settings, while maintaining strong performance in more economical low-budget modes, highlighting the effectiveness of centralized coordination for scalable and cost-efficient multi-agent LLM systems.
中文摘要 大型语言模型（LLM）在各个领域表现出互补的优势，并具有不同的推理成本，从而推动了多代理 LLM 系统的设计，其中专业模型可以有效协作。现有方法主要依赖于去中心化框架，该框架为每次输入调用多个 LLM，从而导致巨额且不受控制的推理成本。在这项工作中，我们引入了一个集中式多 LLM 框架，其中控制器 LLM 以经济高效且成本可控的方式有选择地协调专家模型池。我们将这个协调问题表述为具有双重目标的强化学习：最大化任务性能，同时最小化总体推理成本。此外，我们预计多智能体系统在推理过程中具有不同预算条件下的适应行为。为此，我们提出了CoRL，这是一种强化学习框架，可以在可控的多预算环境中优化性能成本权衡。在四个不同基准测试上的实验表明，CoRL 使单个系统能够在高预算设置下超越最好的专家 LLM，同时在更经济的低预算模式下保持强大的性能，凸显了集中协调对可扩展且具有成本效益的多智能体 LLM 系统的有效性。

From Solo to Symphony: Orchestrating Multi-Agent Collaboration with Single-Agent Demos

从独奏到交响乐：通过单代理演示协调多代理协作

Authors: Xun Wang, Zhuoran Li, Yanshan Lin, Hai Zhong, Longbo Huang
Subjects: Subjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2511.02762
Pdf link: https://arxiv.org/pdf/2511.02762
Abstract Training a team of agents from scratch in multi-agent reinforcement learning (MARL) is highly inefficient, much like asking beginners to play a symphony together without first practicing solo. Existing methods, such as offline or transferable MARL, can ease this burden, but they still rely on costly multi-agent data, which often becomes the bottleneck. In contrast, solo experiences are far easier to obtain in many important scenarios, e.g., collaborative coding, household cooperation, and search-and-rescue. To unlock their potential, we propose Solo-to-Collaborative RL (SoCo), a framework that transfers solo knowledge into cooperative learning. SoCo first pretrains a shared solo policy from solo demonstrations, then adapts it for cooperation during multi-agent training through a policy fusion mechanism that combines an MoE-like gating selector and an action editor. Experiments across diverse cooperative tasks show that SoCo significantly boosts the training efficiency and performance of backbone algorithms. These results demonstrate that solo demonstrations provide a scalable and effective complement to multi-agent data, making cooperative learning more practical and broadly applicable.
中文摘要 在多智能体强化学习（MARL）中从头开始训练智能体团队效率极低，就像要求初学者一起演奏交响乐而不先单独练习一样。现有的方法，例如离线或可转移的 MARL，可以减轻这种负担，但它们仍然依赖于昂贵的多代理数据，这往往成为瓶颈。相比之下，在许多重要场景中，例如协作编码、家庭合作和搜救，独奏体验要容易得多。为了释放他们的潜力，我们提出了 Solo-to-Collaborative RL （SoCo），这是一个将单独知识转化为合作学习的框架。SoCo 首先从单独演示中预训练共享的单独策略，然后通过结合了类似 MoE 的门控选择器和作编辑器的策略融合机制对其进行调整，以适应多智能体训练期间的协作。跨不同协作任务的实验表明，SoCo 显着提高了骨干算法的训练效率和性能。这些结果表明，单独演示为多智能体数据提供了可扩展且有效的补充，使合作学习更加实用和广泛适用。

Keyword: diffusion policy

There is no result