Arxiv Papers of Today

生成时间: 2025-11-28 16:30:16 (UTC+8); Arxiv 发布时间: 2025-11-27 20:00 EST (2025-11-28 09:00 UTC+8)

今天共有 22 篇相关文章

Keyword: reinforcement learning

Video Object Recognition in Mobile Edge Networks: Local Tracking or Edge Detection?

移动边缘网络中的视频对象识别：本地跟踪还是边缘检测？

Authors: Kun Guo, Yun Shen, Xijun Wang, Chaoqun You, Yun Rui, Tony Q. S. Quek
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2511.20716
Pdf link: https://arxiv.org/pdf/2511.20716
Abstract Fast and accurate video object recognition, which relies on frame-by-frame video analytics, remains a challenge for resource-constrained devices such as traffic cameras. Recent advances in mobile edge computing have made it possible to offload computation-intensive object detection to edge servers equipped with high-accuracy neural networks, while lightweight and fast object tracking algorithms run locally on devices. This hybrid approach offers a promising solution but introduces a new challenge: deciding when to perform edge detection versus local tracking. To address this, we formulate two long-term optimization problems for both single-device and multi-device scenarios, taking into account the temporal correlation of consecutive frames and the dynamic conditions of mobile edge networks. Based on the formulation, we propose the LTED-Ada in single-device setting, a deep reinforcement learning-based algorithm that adaptively selects between local tracking and edge detection, according to the frame rate as well as recognition accuracy and delay requirement. In multi-device setting, we further enhance LTED-Ada using federated learning to enable collaborative policy training across devices, thereby improving its generalization to unseen frame rates and performance requirements. Finally, we conduct extensive hardware-in-the-loop experiments using multiple Raspberry Pi 4B devices and a personal computer as the edge server, demonstrating the superiority of LTED-Ada.
中文摘要 依赖逐帧视频分析的快速准确视频物体识别，对于资源有限的交通摄像头等设备来说仍是挑战。移动边缘计算的最新进展使得将计算密集型物体检测工作卸载到配备高精度神经网络的边缘服务器成为可能，同时轻量级且快速的物体跟踪算法可在设备本地运行。这种混合方法提供了一个有前景的解决方案，但也带来了新的挑战：决定何时执行边缘检测，何时进行局部跟踪。为此，我们针对单设备和多设备场景提出了两个长期优化问题，考虑连续帧的时间相关性及移动边缘网络的动态条件。基于该表述，我们提出了单设备环境下的LTED-Ada算法，这是一种基于深度强化学习的算法，能够根据帧率、识别精度和延迟需求，自适应地选择局部跟踪和边缘检测。在多设备环境中，我们进一步利用联邦学习增强LTED-Ada，实现跨设备协作策略训练，从而提升其对未见帧率和性能需求的推广能力。最后，我们利用多台树莓派4B设备和一台个人电脑作为边缘服务器，进行了广泛的硬件在环实验，展示了LTED-Ada的优势。

LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

LongVT：通过原生工具调用激励“用长视频思考”

Authors: Zuhao Yang, Sudong Wang, Kaichen Zhang, Keming Wu, Sicong Leng, Yifan Zhang, Chengwei Qin, Shijian Lu, Xingxuan Li, Lidong Bing
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.20785
Pdf link: https://arxiv.org/pdf/2511.20785
Abstract Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we introduce LongVT, an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks. Our codes, data, and model checkpoints are publicly available at this https URL .
中文摘要 大型多模态模型（LMM）在视频推理中展现出了文本思维链的巨大潜力。然而，它们仍然容易受到幻觉影响，尤其是在处理证据稀少且时间分散的长视频时。灵感来源于人类理解长视频的方式——先全局浏览，然后仔细分析相关片段以获取细节——我们介绍了LongVT，一个端到端的代理框架，通过交错多模态工具链实现“用长视频思考”。具体来说，我们利用LMM固有的时间基础功能，作为原生视频裁剪工具，对特定视频片段进行放大，并重新采样更细粒度的视频帧。这种全球到本地的推理循环会持续，直到答案基于可检索的视觉证据。鉴于长视频推理任务中细粒度问答（QA）数据的稀缺，我们策划并发布名为VideoSIAH的数据套件，以促进培训和评估。具体来说，我们的训练数据集包含24.79K样本用于工具集成冷启动监督微调，1.6K样本用于智能强化学习，15.4K样本用于智能强化微调。我们的评估基准包含1280对质量保证对，这些对通过半自动数据管道精心策划，并实现人工验证。凭借精心设计的三阶段训练策略和广泛的实证验证，LongVT在四个具有挑战性的长视频理解与推理基准测试中持续优于现有强有力基线。我们的代码、数据和模型检查点在此 https URL 公开。

SPHINX: A Synthetic Environment for Visual Perception and Reasoning

SPHINX：一个用于视觉感知与推理的合成环境

Authors: Md Tanvirul Alam, Saksham Aggarwal, Justin Yang Chae, Nidhi Rastogi
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.20814
Pdf link: https://arxiv.org/pdf/2511.20814
Abstract We present Sphinx, a synthetic environment for visual perception and reasoning that targets core cognitive primitives. Sphinx procedurally generates puzzles using motifs, tiles, charts, icons, and geometric primitives, each paired with verifiable ground-truth solutions, enabling both precise evaluation and large-scale dataset construction. The benchmark covers 25 task types spanning symmetry detection, geometric transformations, spatial reasoning, chart interpretation, and sequence prediction. Evaluating recent large vision-language models (LVLMs) shows that even state-of-the-art GPT-5 attains only 51.1% accuracy, well below human performance. Finally, we demonstrate that reinforcement learning with verifiable rewards (RLVR) substantially improves model accuracy on these tasks and yields gains on external visual reasoning benchmarks, highlighting its promise for advancing multimodal reasoning.
中文摘要 我们介绍斯芬克斯，一个针对核心认知原始的视觉感知与推理合成环境。Sphinx 通过程序生成谜题，使用图案、拼块、图表、图标和几何原语，每个谜题都配有可验证的真实解，既支持精确评估，也支持大规模数据集构建。基准测试涵盖25种任务类型，涵盖对称性检测、几何变换、空间推理、图表解读和序列预测。对近期大型视觉语言模型（LVLM）的评估显示，即使是最先进的GPT-5，准确率也仅为51.1%，远低于人类表现。最后，我们证明了可验证奖励的强化学习（RLVR）显著提升了这些任务中的模型准确性，并在外部视觉推理基准测试中取得优势，凸显了其推动多模态推理的潜力。

Exploring Time-Step Size in Reinforcement Learning for Sepsis Treatment

探讨强化学习中败血症治疗的时间步长

Authors: Yingchuan Sun, Shengpu Tang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.20913
Pdf link: https://arxiv.org/pdf/2511.20913
Abstract Existing studies on reinforcement learning (RL) for sepsis management have mostly followed an established problem setup, in which patient data are aggregated into 4-hour time steps. Although concerns have been raised regarding the coarseness of this time-step size, which might distort patient dynamics and lead to suboptimal treatment policies, the extent to which this is a problem in practice remains unexplored. In this work, we conducted empirical experiments for a controlled comparison of four time-step sizes ($\Delta t!=!1,2,4,8$ h) on this domain, following an identical offline RL pipeline. To enable a fair comparison across time-step sizes, we designed action re-mapping methods that allow for evaluation of policies on datasets with different time-step sizes, and conducted cross-$\Delta t$ model selections under two policy learning setups. Our goal was to quantify how time-step size influences state representation learning, behavior cloning, policy training, and off-policy evaluation. Our results show that performance trends across $\Delta t$ vary as learning setups change, while policies learned at finer time-step sizes ($\Delta t = 1$ h and $2$ h) using a static behavior policy achieve the overall best performance and stability. Our work highlights time-step size as a core design choice in offline RL for healthcare and provides evidence supporting alternatives beyond the conventional 4-hour setup.
中文摘要 现有关于败血症管理强化学习（RL）的研究大多遵循既定的问题设置，即将患者数据汇总为4小时一周期。尽管有人担心这种时间步长过粗糙，可能扭曲患者动态并导致治疗政策不优，但这在实际中是否存在问题仍未被深入探讨。本研究通过实证实验，对该域上四个时间步长（$\Delta t\！=\！1,2,4,8$ h）进行了受控比较，采用相同的离线强化学习流水线。为了实现跨时间步长的公平比较，我们设计了动作重映射方法，允许在不同时间步长的数据集上评估策略，并在两种策略学习设置下进行了跨$\Delta t$模型选择。我们的目标是量化时间步长如何影响州表征学习、行为克隆、政策培训以及政策外评估。我们的结果显示，随着学习设置的变化，$\Delta t$ 的性能趋势会变化，而在更细小的时间步长（$\Delta t = 1$ h 和 $2$ h）中使用静态行为策略学习的策略，则能获得最佳性能和稳定性。我们的研究强调时间步长是离线强化学习在医疗护理中的核心设计选择，并提供了支持传统4小时设置之外替代方案的证据。

Independent policy gradient-based reinforcement learning for economic and reliable energy management of multi-microgrid systems

基于梯度的独立政策强化学习，用于经济且可靠的多微电网系统能源管理

Authors: Junkai Hu, Li Xia
Subjects: Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2511.20977
Pdf link: https://arxiv.org/pdf/2511.20977
Abstract Efficiency and reliability are both crucial for energy management, especially in multi-microgrid systems (MMSs) integrating intermittent and distributed renewable energy sources. This study investigates an economic and reliable energy management problem in MMSs under a distributed scheme, where each microgrid independently updates its energy management policy in a decentralized manner to optimize the long-term system performance collaboratively. We introduce the mean and variance of the exchange power between the MMS and the main grid as indicators for the economic performance and reliability of the system. Accordingly, we formulate the energy management problem as a mean-variance team stochastic game (MV-TSG), where conventional methods based on the maximization of expected cumulative rewards are unsuitable for variance metrics. To solve MV-TSGs, we propose a fully distributed independent policy gradient algorithm, with rigorous convergence analysis, for scenarios with known model parameters. For large-scale scenarios with unknown model parameters, we further develop a deep reinforcement learning algorithm based on independent policy gradients, enabling data-driven policy optimization. Numerical experiments in two scenarios validate the effectiveness of the proposed methods. Our approaches fully leverage the distributed computational capabilities of MMSs and achieve a well-balanced trade-off between economic performance and operational reliability.
中文摘要 效率和可靠性对能源管理至关重要，尤其是在集成间歇性和分布式可再生能源的多微电网系统（MMS）中。本研究探讨了分布式方案下MMS中经济且可靠的能源管理问题，每个微电网独立以去中心化方式更新其能源管理策略，以协作优化长期系统性能。我们引入了MMS与主电网之间交换功率的平均值和方差，作为系统经济性能和可靠性的指标。因此，我们将能量管理问题表述为平均方差团队随机博弈（MV-TSG），其中基于期望累计奖励最大化的传统方法不适合方差度量。为求解MV-TSGs，我们提出了一种全分布式独立策略梯度算法，并结合严格收敛分析，适用于已知模型参数的场景。对于模型参数未知的大规模场景，我们进一步开发基于独立策略梯度的深度强化学习算法，实现数据驱动的策略优化。两种情景中的数值实验验证了所提方法的有效性。我们的方法充分利用MMS的分布式计算能力，实现经济性能与运营可靠性之间的良好平衡。

Subgoal Graph-Augmented Planning for LLM-Guided Open-World Reinforcement Learning

子目标：基于大型语言模型的图增强规划，用于大型语言模型引导的开放世界强化学习

Authors: Shanwei Fan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.20993
Pdf link: https://arxiv.org/pdf/2511.20993
Abstract Large language models (LLMs) offer strong high-level planning capabilities for reinforcement learning (RL) by decomposing tasks into subgoals. However, their practical utility is limited by poor planning-execution alignment, which reflects a critical gap between abstract plans and actionable, environment-compatible behaviors. This misalignment arises from two interrelated limitations: (1) LLMs often produce subgoals that are semantically plausible but infeasible or irrelevant in the target environment due to insufficient grounding in environment-specific knowledge, and (2) single-LLM planning conflates generation with self-verification, resulting in overconfident yet unreliable subgoals that frequently fail during execution. To address these challenges, we propose Subgoal Graph-Augmented Actor-Critic-Refiner (SGA-ACR), a framework that integrates an environment-specific subgoal graph and structured entity knowledge with a multi-LLM planning pipeline that explicitly separates generation, critique, and refinement to produce executable and verifiable subgoals. A subgoal tracker further monitors execution progress, provides auxiliary rewards, and adaptively updates the subgoal graph to maintain alignment between plans and actions. Experimental results on 22 diverse tasks in the open-world game "Crafter" demonstrate the effectiveness of our proposed method.
中文摘要 大型语言模型（LLMs）通过将任务分解为子目标，为强化学习（RL）提供了强大的高级规划能力。然而，其实际效用受限于规划与执行的对齐不佳，这反映了抽象计划与可作且环境兼容行为之间的关键差距。这种不一致源于两个相互关联的局限：（1）LLMs常常产生语义上合理但在目标环境中不可行或无关的子目标，因为缺乏环境特定知识的基础;（2）单一LLM规划将生成与自我验证混为一谈，导致过于自信但不可靠的子目标，且执行过程中经常失败。为应对这些挑战，我们提出了子目标图增强演员-批判-精炼器（SGA-ACR）框架，该框架将环境特定子目标图和结构化实体知识与多LLM规划流水线结合，明确分离生成、批判和细化，生成可执行且可验证的子目标。子目标跟踪器进一步监控执行进度，提供辅助奖励，并自适应地更新子目标图，以保持计划与行动之间的对齐。在开放世界游戏《Crafter》中，对22个不同任务的实验结果展示了我们提出方法的有效性。

ICPO: Intrinsic Confidence-Driven Group Relative Preference Optimization for Efficient Reinforcement Learning

ICPO：内在信心驱动群体相对偏好优化以实现高效强化学习

Authors: Jinpeng Wang, Chao Li, Ting Ye, Mengyuan Zhang, Wei Liu, Jian Luan
Subjects: Subjects: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2511.21005
Pdf link: https://arxiv.org/pdf/2511.21005
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) demonstrates significant potential in enhancing the reasoning capabilities of Large Language Models (LLMs). However, existing RLVR methods are often constrained by issues such as coarse-grained rewards, reward noise, and inefficient exploration, which lead to unstable training and entropy collapse. To address this challenge, we propose the Intrinsic Confidence-Driven Group Relative Preference Optimization method (ICPO). The intuition behind it lies in the fact that the probabilities of an LLM generating different responses can inherently and directly reflect its self-assessment of the reasoning process. Inspired by the idea of preference modeling, ICPO calculates a preference advantage score for each response by comparing the relative generation probabilities of multiple responses under the same input prompt, and integrates this score with verifiable rewards to guide the exploration process. We have discovered that the preference advantage score not only alleviates the issues of coarse-grained rewards and reward noise but also effectively curbs overconfident errors, enhances the relative superiority of undervalued high-quality responses, and prevents the model from overfitting to specific strategies, thereby facilitating more thorough exploration. Comprehensive experiments across four general-domain benchmarks and three mathematical benchmarks demonstrate that ICPO steadily boosts reasoning compared to GRPO.
中文摘要 带可验证奖励的强化学习（RLVR）在增强大型语言模型（LLMs）推理能力方面展现出显著潜力。然而，现有的RLVR方法常常受到粗粒度奖励、奖励噪声和低效探索等问题的限制，导致训练不稳定和熵崩溃。为应对这一挑战，我们提出了内在置信驱动群体相对偏好优化方法（ICPO）。其直觉在于，大型语言模型生成不同反应的概率，本质上且直接反映了其对推理过程的自我评估。受偏好建模理念启发，ICPO通过比较同一输入提示下多个响应的相对生成概率，计算每个响应的偏好优势得分，并将该分数与可验证的奖励整合，以指导探索过程。我们发现，偏好优势评分不仅缓解了粗粒度奖励和奖励噪声的问题，还有效抑制了过度自信的误差，增强了低估高质量回答的相对优越性，并防止模型过度拟合特定策略，从而促进更深入的探索。涵盖四个广域基准和三个数学基准的全面实验表明，ICPO相较于GRPO稳步提升推理能力。

Staggered Environment Resets Improve Massively Parallel On-Policy Reinforcement Learning

错开环境重置大幅提升策略上的并行强化学习

Authors: Sid Bharthulwar, Stone Tao, Hao Su
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.21011
Pdf link: https://arxiv.org/pdf/2511.21011
Abstract Massively parallel GPU simulation environments have accelerated reinforcement learning (RL) research by enabling fast data collection for on-policy RL algorithms like Proximal Policy Optimization (PPO). To maximize throughput, it is common to use short rollouts per policy update, increasing the update-to-data (UTD) ra- tio. However, we find that, in this setting, standard synchronous resets introduce harmful nonstationarity, skewing the learning signal and destabilizing training. We introduce staggered resets, a simple yet effective technique where environments are initialized and reset at varied points within the task horizon. This yields training batches with greater temporal diversity, reducing the nonstationarity induced by synchronized rollouts. We characterize dimensions along which RL environments can benefit significantly from staggered resets through illustrative toy environ- ments. We then apply this technique to challenging high-dimensional robotics environments, achieving significantly higher sample efficiency, faster wall-clock convergence, and stronger final performance. Finally, this technique scales better with more parallel environments compared to naive synchronized rollouts.
中文摘要 大规模并行GPU仿真环境通过实现对策略上RL算法如近端策略优化（PPO）的快速数据收集，加速了强化学习（RL）的研究。为了最大化吞吐量，通常会使用每次策略更新的短时间展开，增加数据更新（UTD）时间。然而，我们发现在此环境中，标准同步重置引入有害的非平稳性，扭曲学习信号并破坏训练稳定性。我们引入了错开重置，这是一种简单但有效的技术，在任务视野的不同点初始化和重置环境。这使训练批次的时间多样性更高，减少同步展开引起的非平稳性。我们描述了强化学习环境在展示玩具环境中错位重置中能显著受益的维度。随后我们将该技术应用于具有挑战性的高维机器人环境，实现显著更高的采样效率、更快的壁钟收敛速度以及更强的最终性能。最后，这种技术在更多并行环境中的扩展性比简单的同步扩展更为可扩展。

Breaking the Safety-Capability Tradeoff: Reinforcement Learning with Verifiable Rewards Maintains Safety Guardrails in LLMs

打破安全与能力权衡：带可验证奖励的强化学习维护大型语言模型的安全防护

Authors: Dongkyu Derek Cho, Huan Song, Arijit Ghosh Chowdhury, Haotian An, Yawei Wang, Rohit Thekkanal, Negin Sokhandan, Sharlina Keshava, Hannah Marlowe
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2511.21050
Pdf link: https://arxiv.org/pdf/2511.21050
Abstract Fine-tuning large language models (LLMs) for downstream tasks typically exhibit a fundamental safety-capability tradeoff, where improving task performance degrades safety alignment even on benign datasets. This degradation persists across standard approaches including supervised finetuning (SFT) and reinforcement learning from human feedback (RLHF). While reinforcement learning with verifiable rewards (RLVR) has emerged as a promising alternative that optimizes models on objectively measurable tasks, its safety implications remain unexplored. We present the first comprehensive theoretical and empirical analysis of safety properties in RLVR. Theoretically, we derive upper bounds on safety drift under KL-constrained optimization and prove conditions under which safety degradation is eliminated. Empirically, we conduct extensive experiments across five adversarial safety benchmarks, demonstrating that RLVR can simultaneously enhance reasoning capabilities while maintaining or improving safety guardrails. Our comprehensive ablation studies examine the effects of optimization algorithms, model scale, and task domains. Our findings challenge the prevailing assumption of an inevitable safety capability trade-off, and establish that a specific training methodology can achieve both objectives simultaneously, providing insights for the safe deployment of reasoning-capable LLMs.
中文摘要 为下游任务微调大型语言模型（LLMs）通常存在基本的安全与能力权衡，即提升任务性能也会降低安全对齐性，即使是在无害数据集上。这种退化在包括监督式微调（SFT）和人类反馈强化学习（RLHF）在内的标准方法中依然存在。尽管带可验证奖励的强化学习（RLVR）已成为一种有前景的替代方案，能够优化客观可测量任务的模型，但其安全性影响尚未被充分探讨。我们首次提出了RLVR安全性的全面理论和实证分析。理论上，我们在KL约束优化下推导安全漂移的上界，并证明了在哪些条件下安全退化被消除。通过实证，我们在五个对抗性安全基准中进行了大量实验，证明RLVR能够在维护或改善安全防护措施的同时提升推理能力。我们的综合消融研究考察了优化算法、模型规模和任务域的影响。我们的发现挑战了普遍存在的安全能力权衡假设，确立了特定训练方法可以同时实现这两个目标，为安全部署具备推理能力的大型语言模型提供了见解。

Aligning LLMs with Biomedical Knowledge using Balanced Fine-Tuning

利用平衡微调将LLM与生物医学知识对齐

Authors: Zhenchao Tang, Fang Wang, Haohuai He, Jiale Zhou, Tianxu Lv, Jun Zhu, Shouzhi Chen, Minghao Yang, Yu Wang, Jiayang Wu, Yidong Song, Jianhua Yao
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.21075
Pdf link: https://arxiv.org/pdf/2511.21075
Abstract Effective post-training is essential to align Large Language Models (LLMs) with specialized biomedical knowledge to accelerate life science research. However, current approaches face significant limitations. First, biomedical reasoning involves intricate mechanisms often represented by sparse textual data. Standard Supervised Fine-Tuning (SFT) tends to overfit to surface-level instruction patterns without effectively internalizing this fragmented scientific knowledge. Second, Reinforcement Learning (RL) is impractical for this domain, as defining meaningful rewards often necessitates prohibitive experimental validation (e.g., wet-lab verification of drug responses), rendering real-time feedback unfeasible. We propose Balanced Fine-Tuning (BFT), an efficient post-training method designed to learn complex reasoning from sparse data without external reward signals. BFT operates through a two-layer weighting mechanism: 1. At the token level, it scales loss via prediction probabilities to stabilize gradients and prevent overfitting; 2. At the sample level, it uses "minimum group confidence" to adaptively enhance the learning of hard samples. Experiments demonstrate that BFT significantly outperforms SFT. In medical tasks, it enables LLMs to acquire knowledge that SFT misses. In biological tasks, BFT-based LLMs surpass GeneAgent (an accurate agent for biology analysis) in biological process reasoning. Moreover, the text embeddings generated by BFT can be directly applied to downstream tasks, such as gene interaction and single-cell perturbation response prediction. These results indicate that BFT facilitates broad applications of LLMs in biomedical research.
中文摘要 有效的后期培训对于将大型语言模型（LLMs）与专业生物医学知识对齐至关重要，以加速生命科学研究。然而，当前的方法面临显著局限。首先，生物医学推理涉及复杂的机制，通常由稀疏的文本数据表示。标准监督微调（SFT）往往过度拟合表面指令模式，而未能有效内化这些支离破碎的科学知识。其次，强化学习（RL）在该领域不切实际，因为定义有意义的奖励往往需要大量实验验证（例如湿实验室验证药物反应），使实时反馈不可行。我们提出了平衡微调（BFT），这是一种高效的训练后方法，旨在从稀疏数据中学习复杂推理，而无需外部奖励信号。BFT通过两层加权机制运作：1. 在代币层面，通过预测概率来扩展损失，以稳定梯度并防止过拟合;2. 在样本层面，它使用“最小组置信度”来自适应地增强硬样本的学习。实验表明BFT显著优于SFT。在医疗任务中，它使大型语言模型能够获得SFT所忽略的知识。在生物任务中，基于BFT的大型语言模型在生物过程推理方面超越了GeneAgent（一种生物学分析的精确代理）。此外，BFT生成的文本嵌入可以直接应用于下游任务，如基因相互作用和单细胞扰动响应预测。这些结果表明，BFT促进了大型语言模型在生物医学研究中的广泛应用。

Dual-Agent Reinforcement Learning for Adaptive and Cost-Aware Visual-Inertial Odometry

双代理强化学习用于自适应且成本感知的视觉惯性里程测量

Authors: Feiyang Pan, Shenghe Zheng, Chunyan Yin, Guangbin Dou
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.21083
Pdf link: https://arxiv.org/pdf/2511.21083
Abstract Visual-Inertial Odometry (VIO) is a critical component for robust ego-motion estimation, enabling foundational capabilities such as autonomous navigation in robotics and real-time 6-DoF tracking for augmented reality. Existing methods face a well-known trade-off: filter-based approaches are efficient but prone to drift, while optimization-based methods, though accurate, rely on computationally prohibitive Visual-Inertial Bundle Adjustment (VIBA) that is difficult to run on resource-constrained platforms. Rather than removing VIBA altogether, we aim to reduce how often and how heavily it must be invoked. To this end, we cast two key design choices in modern VIO, when to run the visual frontend and how strongly to trust its output, as sequential decision problems, and solve them with lightweight reinforcement learning (RL) agents. Our framework introduces a lightweight, dual-pronged RL policy that serves as our core contribution: (1) a Select Agent intelligently gates the entire VO pipeline based only on high-frequency IMU data; and (2) a composite Fusion Agent that first estimates a robust velocity state via a supervised network, before an RL policy adaptively fuses the full (p, v, q) state. Experiments on the EuRoC MAV and TUM-VI datasets show that, in our unified evaluation, the proposed method achieves a more favorable accuracy-efficiency-memory trade-off than prior GPU-based VO/VIO systems: it attains the best average ATE while running up to 1.77 times faster and using less GPU memory. Compared to classical optimization-based VIO systems, our approach maintains competitive trajectory accuracy while substantially reducing computational load.
中文摘要 视觉惯性里程计（VIO）是稳健的自我运动估计的关键组成部分，支持机器人中的自主导航和增强现实的实时6被摄度跟踪等基础能力。现有方法面临一个众所周知的权衡：基于滤波器的方法高效但容易出现漂移，而基于优化的方法虽然准确，但依赖计算量极高的视觉惯性束调整（VIBA），该方法在资源受限的平台上运行较为困难。我们并非完全取消VIBA，而是希望减少其被引用的频率和强度。为此，我们在现代VIO中将两个关键设计选择——何时运行可视化前端以及对其输出的信任程度——作为顺序决策问题，并用轻量级强化学习（RL）代理来解决。我们的框架引入了一套轻量级、双管齐下的强化学习策略，作为我们的核心贡献：（1）选择性代理仅基于高频IMU数据智能地对整个VO流水线进行门控;以及（2）一个复合聚变代理，先通过监督网络估计一个稳健的速度状态，然后通过强化学习策略自适应融合完整的（p， v， q）态。在EuRoC MAV和TUM-VI数据集上的实验表明，在我们的统一评估中，所提方法比以往基于GPU的VO/VIO系统实现了更有利的准确性-效率-内存权衡：它在运行速度高达1.77倍且GPU内存消耗更少的情况下，实现了最佳的平均ATE。与基于经典优化的VIO系统相比，我们的方法在显著降低计算负载的同时，保持了具有竞争力的轨迹精度。

SocialNav: Training Human-Inspired Foundation Model for Socially-Aware Embodied Navigation

SocialNav：培养以人为本的社会意识具身导航基础模型

Authors: Ziyi Chen, Yingnan Guo, Zedong Chu, Minghua Luo, Yanfen Shen, Mingchao Sun, Junjun Hu, Shichao Xie, Kuan Yang, Pei Shi, Zhining Gu, Lu Liu, Honglin Han, Xiaolong Wu, Mu Xu, Yu Zhang
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.21135
Pdf link: https://arxiv.org/pdf/2511.21135
Abstract Embodied navigation that adheres to social norms remains an open research challenge. Our \textbf{SocialNav} is a foundational model for socially-aware navigation with a hierarchical "brain-action" architecture, capable of understanding high-level social norms and generating low-level, socially compliant trajectories. To enable such dual capabilities, we construct the SocNav Dataset, a large-scale collection of 7 million samples, comprising (1) a Cognitive Activation Dataset providing social reasoning signals such as chain-of-thought explanations and social traversability prediction, and (2) an Expert Trajectories Pyramid aggregating diverse navigation demonstrations from internet videos, simulated environments, and real-world robots. A multi-stage training pipeline is proposed to gradually inject and refine navigation intelligence: we first inject general navigation skills and social norms understanding into the model via imitation learning, and then refine such skills through a deliberately designed Socially-Aware Flow Exploration GRPO (SAFE-GRPO), the first flow-based reinforcement learning framework for embodied navigation that explicitly rewards socially compliant behaviors. SocialNav achieves +38% success rate and +46% social compliance rate compared to the state-of-the-art method, demonstrating strong gains in both navigation performance and social compliance. Our project page: this https URL
中文摘要 遵循社会规范的具身导航仍是一个开放的研究挑战。我们的\textbf{SocialNav}是一个具有层级“大脑行动”架构的社会意识导航基础模型，能够理解高层社会规范并生成低层次、符合社会合规的轨迹。为实现这种双重能力，我们构建了SocNav数据集，这是一个包含700万个样本的大规模集合，包括（1）提供社会推理信号（如思维链解释和社会可移动性预测）的认知激活数据集，以及（2）汇总来自互联网视频、模拟环境和真实机器人的多样化导航演示的专家轨迹金字塔。提出多阶段训练流程，逐步注入和完善导航智能：我们首先通过模仿学习注入通用导航技能和社会规范理解，然后通过有意设计的社会意识流探索GRPO（SAFE-GRPO）来完善这些技能，这是首个基于流的强化学习框架，明确奖励具备社会顺应行为的具身导航。与最先进方法相比，SocialNav实现了+38%的成功率和+46%的社会顺从率，在导航性能和社交顺从性方面均有显著提升。我们的项目页面：这个 https URL

Maglev-Pentabot: Magnetic Levitation System for Non-Contact Manipulation using Deep Reinforcement Learning

磁悬浮-五方机器人：基于深度强化学习的非接触式磁悬浮系统

Authors: Guoming Huang, Qingyi Zhou, Dianjing Liu, Shuai Zhang, Ming Zhou, Zongfu Yu
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.21149
Pdf link: https://arxiv.org/pdf/2511.21149
Abstract Non-contact manipulation has emerged as a transformative approach across various industrial fields. However, current flexible 2D and 3D non-contact manipulation techniques are often limited to microscopic scales, typically controlling objects in the milligram range. In this paper, we present a magnetic levitation system, termed Maglev-Pentabot, designed to address this limitation. The Maglev-Pentabot leverages deep reinforcement learning (DRL) to develop complex control strategies for manipulating objects in the gram range. Specifically, we propose an electromagnet arrangement optimized through numerical analysis to maximize controllable space. Additionally, an action remapping method is introduced to address sample sparsity issues caused by the strong nonlinearity in magnetic field intensity, hence allowing the DRL controller to converge. Experimental results demonstrate flexible manipulation capabilities, and notably, our system can generalize to transport tasks it has not been explicitly trained for. Furthermore, our approach can be scaled to manipulate heavier objects using larger electromagnets, offering a reference framework for industrial-scale robotic applications.
中文摘要 非接触式作已成为各类工业领域的变革性方法。然而，目前灵活的二维和三维非接触作技术通常局限于微观尺度，通常控制毫米级的物体。本文介绍了一种磁悬浮系统，称为磁悬浮-五方机器人，旨在解决这一限制。磁悬浮-五角星利用深度强化学习（DRL）开发复杂的克级物体作控制策略。具体来说，我们提出了一种通过数值分析优化的电磁布置，以最大化可控空间。此外，引入了作用重映射方法，解决磁场强度强非线性引起的样品稀疏问题，从而允许DRL控制器收敛。实验结果显示了灵活的作能力，值得注意的是，我们的系统能够推广到未被明确训练的运输任务。此外，我们的方法还可以扩展到使用更大电磁铁作更重的物体，为工业级机器人应用提供了参考框架。

Kinematics-Aware Multi-Policy Reinforcement Learning for Force-Capable Humanoid Loco-Manipulation

运动学感知多策略强化学习，用于具备原力能力的人形机车控

Authors: Kaiyan Xiao, Zihan Xu, Cheng Zhe, Chengju Liu, Qijun Chen
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.21169
Pdf link: https://arxiv.org/pdf/2511.21169
Abstract Humanoid robots, with their human-like morphology, hold great potential for industrial applications. However, existing loco-manipulation methods primarily focus on dexterous manipulation, falling short of the combined requirements for dexterity and proactive force interaction in high-load industrial scenarios. To bridge this gap, we propose a reinforcement learning-based framework with a decoupled three-stage training pipeline, consisting of an upper-body policy, a lower-body policy, and a delta-command policy. To accelerate upper-body training, a heuristic reward function is designed. By implicitly embedding forward kinematics priors, it enables the policy to converge faster and achieve superior performance. For the lower body, a force-based curriculum learning strategy is developed, enabling the robot to actively exert and regulate interaction forces with the environment.
中文摘要 类人机器人因其类人形态，在工业应用中具有巨大潜力。然而，现有的机车控方法主要侧重于灵巧控，未能满足高负载工业场景中灵巧性和主动力量相互作用的综合要求。为弥合这一差距，我们提出了一个基于强化学习的框架，采用分离的三阶段培训流程，包括上半身政策、下半身政策和三角指挥政策。为了加速上半身训练，设计了一个启发式奖励函数。通过隐式嵌入前向运动学先验，使策略能够更快地收敛并实现更优的性能。针对下半身，开发了基于力量的课程学习策略，使机器人能够主动施加和调节与环境的相互作用力。

Sparse shepherding control of large-scale multi-agent systems via Reinforcement Learning

通过强化学习对大规模多智能体系统的稀疏牧羊控制

Authors: Luigi Catello, Italo Napolitano, Davide Salzano, Mario di Bernardo
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2511.21304
Pdf link: https://arxiv.org/pdf/2511.21304
Abstract We propose a reinforcement learning framework for sparse indirect control of large-scale multi-agent systems, where few controlled agents shape the collective behavior of many uncontrolled agents. The approach addresses this multi-scale challenge by coupling ODEs (modeling controlled agents) with a PDE (describing the uncontrolled population density), capturing how microscopic control achieves macroscopic objectives. Our method combines model-free reinforcement learning with adaptive interaction strength compensation to overcome sparse actuation limitations. Numerical validation demonstrates effective density control, with the system achieving target distributions while maintaining robustness to disturbances and measurement noise, confirming that learning-based sparse control can replace computationally expensive online optimization.
中文摘要 我们提出了一种强化学习框架，用于大规模多智能体系统的稀疏间接控制，在这些系统中，少数受控智能体塑造了许多非受控智能体的集体行为。该方法通过将常微分方程（建模受控主体）与偏微分方程（描述非受控种群密度）结合，解决了这一多尺度挑战，捕捉了微观控制如何实现宏观目标。我们的方法结合了无模型强化学习与自适应交互强度补偿，以克服稀疏驱动的限制。数值验证证明了有效的密度控制，系统在实现目标分布的同时保持对干扰和测量噪声的鲁棒性，证实基于学习的稀疏控制可以取代计算量高的在线优化。

Hybrid-AIRL: Enhancing Inverse Reinforcement Learning with Supervised Expert Guidance

混合-AIRL：通过专家监督提升逆向强化学习

Authors: Bram Silue, Santiago Amaya-Corredor, Patrick Mannion, Lander Willem, Pieter Libin
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.21356
Pdf link: https://arxiv.org/pdf/2511.21356
Abstract Adversarial Inverse Reinforcement Learning (AIRL) has shown promise in addressing the sparse reward problem in reinforcement learning (RL) by inferring dense reward functions from expert demonstrations. However, its performance in highly complex, imperfect-information settings remains largely unexplored. To explore this gap, we evaluate AIRL in the context of Heads-Up Limit Hold'em (HULHE) poker, a domain characterized by sparse, delayed rewards and significant uncertainty. In this setting, we find that AIRL struggles to infer a sufficiently informative reward function. To overcome this limitation, we contribute Hybrid-AIRL (H-AIRL), an extension that enhances reward inference and policy learning by incorporating a supervised loss derived from expert data and a stochastic regularization mechanism. We evaluate H-AIRL on a carefully selected set of Gymnasium benchmarks and the HULHE poker setting. Additionally, we analyze the learned reward function through visualization to gain deeper insights into the learning process. Our experimental results show that H-AIRL achieves higher sample efficiency and more stable learning compared to AIRL. This highlights the benefits of incorporating supervised signals into inverse RL and establishes H-AIRL as a promising framework for tackling challenging, real-world settings.
中文摘要 对抗逆强化学习（AIRL）通过专家演示推断密集奖励函数，在解决强化学习（RL）中稀疏奖励问题方面展现出潜力。然而，其在高度复杂、信息不完全化环境中的表现仍然大多未被充分探索。为探讨这一差距，我们将AIRL置于Heads-Up Limit Hold'em（HULHE）扑克的背景下进行评估，该领域以奖励稀少、延迟且存在显著不确定性为特征。在此设定下，我们发现AIRL难以推断出足够信息丰富的奖励函数。为克服这一限制，我们贡献了Hybrid-AIRL（H-AIRL），这是一种扩展，通过结合专家数据的监督损失和随机正则化机制，增强了奖励推断和策略学习。我们根据精心挑选的Gymnasium基准和HULHE扑克设置来评估H-AIRL。此外，我们通过可视化分析已学到的奖励函数，以获得对学习过程的更深入见解。我们的实验结果表明，H-AIRL相比AIRL实现了更高的样本效率和更稳定的学习。这凸显了将监督信号纳入逆强化学习的优势，并确立了H-AIRL作为应对复杂现实环境的有前景框架。

Monet: Reasoning in Latent Visual Space Beyond Images and Language

莫奈：超越图像与语言的潜在视觉空间中的推理

Authors: Qixun Wang, Yang Shi, Yifei Wang, Yuanxing Zhang, Pengfei Wan, Kun Gai, Xianghua Ying, Yisen Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.21395
Pdf link: https://arxiv.org/pdf/2511.21395
Abstract "Thinking with images" has emerged as an effective paradigm for advancing visual reasoning, extending beyond text-only chains of thought by injecting visual evidence into intermediate reasoning steps. However, existing methods fall short of human-like abstract visual thinking, as their flexibility is fundamentally limited by external tools. In this work, we introduce Monet, a training framework that enables multimodal large language models (MLLMs) to reason directly within the latent visual space by generating continuous embeddings that function as intermediate visual thoughts. We identify two core challenges in training MLLMs for latent visual reasoning: high computational cost in latent-vision alignment and insufficient supervision over latent embeddings, and address them with a three-stage distillation-based supervised fine-tuning (SFT) pipeline. We further reveal a limitation of applying GRPO to latent reasoning: it primarily enhances text-based reasoning rather than latent reasoning. To overcome this, we propose VLPO (Visual-latent Policy Optimization), a reinforcement learning method that explicitly incorporates latent embeddings into policy gradient updates. To support SFT, we construct Monet-SFT-125K, a high-quality text-image interleaved CoT dataset containing 125K real-world, chart, OCR, and geometry CoTs. Our model, Monet-7B, shows consistent gains across real-world perception and reasoning benchmarks and exhibits strong out-of-distribution generalization on challenging abstract visual reasoning tasks. We also empirically analyze the role of each training component and discuss our early unsuccessful attempts, providing insights for future developments in visual latent reasoning. Our model, data, and code are available at this https URL.
中文摘要 “用图像思考”已成为推动视觉推理的有效范式，通过将视觉证据注入中间推理步骤，超越了仅靠文本的思维链条。然而，现有方法未能达到类人抽象的视觉思维，因为它们的灵活性根本受外部工具的限制。在本研究中，我们介绍了Monet，一种训练框架，使多模态大型语言模型（MLLMs）能够直接在潜在视觉空间内推理，通过生成连续嵌入作为中间视觉思维。我们识别了潜视觉推理MLLM训练中的两个核心挑战：潜视对齐计算成本高和对潜在嵌入监督不足，并通过基于三阶段蒸馏的监督微调（SFT）流程加以解决。我们还进一步揭示了将GRPO应用于潜在推理的一个局限性：它主要增强基于文本的推理，而非潜在推理。为克服这一问题，我们提出了VLPO（可视化潜在策略优化）方法，这是一种强化学习方法，明确将潜在嵌入嵌入策略梯度更新中。为支持SFT，我们构建了Monet-SFT-125K，这是一个高质量的文本-图像交错CoT数据集，包含12.5万真实世界、图表、OCR和几何CoT。我们的模型Monet-7B在现实世界的感知和推理基准中持续有进步，并在具有挑战性的抽象视觉推理任务中表现出强烈的分布外泛化。我们还实证分析了每个训练组成部分的作用，并讨论了早期失败的尝试，为未来视觉潜能推理的发展提供了见解。我们的模型、数据和代码均可在此 https 网址访问。

Decentralized Shepherding of Non-Cohesive Swarms Through Cluttered Environments via Deep Reinforcement Learning

通过深度强化学习，去中心化对非凝聚群体的管理，穿越杂乱环境中

Authors: Cristiana Punzo, Italo Napolitano, Cinzia Tomaselli, Mario di Bernardo
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2511.21405
Pdf link: https://arxiv.org/pdf/2511.21405
Abstract This paper investigates decentralized shepherding in cluttered environments, where a limited number of herders must guide a larger group of non-cohesive, diffusive targets toward a goal region in the presence of static obstacles. A hierarchical control architecture is proposed, integrating a high-level target assignment rule, where each herder is paired with a selected target, with a learning-based low-level driving module that enables effective steering of the assigned target. The low-level policy is trained in a one-herder-one-target scenario with a rectangular obstacle using Proximal Policy Optimization and then directly extended to multi-agent settings with multiple obstacles without requiring retraining. Numerical simulations demonstrate smooth, collision-free trajectories and consistent convergence to the goal region, highlighting the potential of reinforcement learning for scalable, model-free shepherding in complex environments.
中文摘要 本文探讨了在杂乱环境中的分散式牧牛现象，在有限数量的牧民必须引导较多非凝聚、扩散的目标群体在静止障碍存在的情况下向目标区域前进。提出了一种分层控制架构，集成了高级别目标分配规则，每个牧民与选定目标配对，并结合基于学习的低级驾驶模块，实现对指定目标的有效引导。低层策略在一个有矩形障碍的一牧羊人一目标场景中使用近端策略优化训练，然后直接扩展到多智能体且多障碍的场景中无需重新训练。数值模拟展示了平滑、无碰撞的轨迹和一致的目标区域收敛性，凸显了强化学习在复杂环境中实现可扩展、无模型的牧羊潜力。

Predictive Safety Shield for Dyna-Q Reinforcement Learning

Dyna-Q强化学习的预测安全盾

Authors: Jin Pin, Krasowski Hanna, Vanneaux Elena
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2511.21531
Pdf link: https://arxiv.org/pdf/2511.21531
Abstract Obtaining safety guarantees for reinforcement learning is a major challenge to achieve applicability for real-world tasks. Safety shields extend standard reinforcement learning and achieve hard safety guarantees. However, existing safety shields commonly use random sampling of safe actions or a fixed fallback controller, therefore disregarding future performance implications of different safe actions. In this work, we propose a predictive safety shield for model-based reinforcement learning agents in discrete space. Our safety shield updates the Q-function locally based on safe predictions, which originate from a safe simulation of the environment model. This shielding approach improves performance while maintaining hard safety guarantees. Our experiments on gridworld environments demonstrate that even short prediction horizons can be sufficient to identify the optimal path. We observe that our approach is robust to distribution shifts, e.g., between simulation and reality, without requiring additional training.
中文摘要 获得强化学习的安全保障是实现现实任务适用性的重大挑战。安全防护罩扩展了标准强化学习，并实现了硬性安全保障。然而，现有的安全屏蔽通常采用随机抽样安全动作或固定的备用控制器，因此忽略了不同安全动作对未来性能的影响。本研究提出为离散空间中基于模型的强化学习代理提供一种预测性安全屏障。我们的安全屏蔽系统根据安全预测局部更新Q函数，这些预测源自环境模型的安全模拟。这种屏蔽方式在保持硬安全保障的同时，提升了性能。我们在网格世界环境中的实验表明，即使是较短的预测视野，也能识别出最优路径。我们观察到，我们的方法对分布变化（例如模拟与现实之间的转换）具有鲁棒性，无需额外培训。

BAMAS: Structuring Budget-Aware Multi-Agent Systems

BAMAS：构建预算感知的多智能体系统

Authors: Liming Yang, Junyu Luo, Xuanzhe Liu, Yiling Lou, Zhenpeng Chen
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.21572
Pdf link: https://arxiv.org/pdf/2511.21572
Abstract Large language model (LLM)-based multi-agent systems have emerged as a powerful paradigm for enabling autonomous agents to solve complex tasks. As these systems scale in complexity, cost becomes an important consideration for practical deployment. However, existing work rarely addresses how to structure multi-agent systems under explicit budget constraints. In this paper, we propose BAMAS, a novel approach for building multi-agent systems with budget awareness. BAMAS first selects an optimal set of LLMs by formulating and solving an Integer Linear Programming problem that balances performance and cost. It then determines how these LLMs should collaborate by leveraging a reinforcement learning-based method to select the interaction topology. Finally, the system is instantiated and executed based on the selected agents and their collaboration topology. We evaluate BAMAS on three representative tasks and compare it with state-of-the-art agent construction methods. Results show that BAMAS achieves comparable performance while reducing cost by up to 86%.
中文摘要 基于大型语言模型（LLM）的多智能体系统已成为使自主智能体能够解决复杂任务的强大范式。随着系统复杂度的扩大，成本成为实际部署的重要考量。然而，现有研究很少涉及如何在明确预算约束下构建多智能体系统。本文提出了BAMAS，这是一种用于构建具备预算意识的多智能体系统的新方法。BAMAS首先通过制定和解决一个整数线性规划问题来选择一组最优的大型语言模型，以平衡性能与成本。随后，它通过基于强化学习的方法来选择交互拓扑，确定这些大型语言模型应如何协作。最后，系统会根据所选代理及其协作拓扑进行实例化和执行。我们在三个代表性任务上评估BAMAS，并与最先进的代理构建方法进行比较。结果显示，BAMAS在降低成本高达86%的同时，实现了相当的性能。

Escaping the Verifier: Learning to Reason via Demonstrations

逃避验证者：通过演示学习推理

Authors: Locke Cai, Ivan Provilkov
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.21667
Pdf link: https://arxiv.org/pdf/2511.21667
Abstract Training Large Language Models (LLMs) to reason often relies on Reinforcement Learning (RL) with task-specific verifiers. However, many real-world reasoning-intensive tasks lack verifiers, despite offering abundant expert demonstrations that remain under-utilized for reasoning-focused training. We introduce RARO (Relativistic Adversarial Reasoning Optimization) that learns strong reasoning capabilities from only expert demonstrations via Inverse Reinforcement Learning. Our method sets up an adversarial interaction between a policy (generator) and a relativistic critic (discriminator): the policy learns to mimic expert answers, while the critic learns to compare and distinguish between policy and expert answers. Our method trains both the policy and the critic jointly and continuously via RL, and we identify the key stabilization techniques required for robust learning. Empirically, RARO significantly outperforms strong verifier-free baselines on all of our evaluation tasks -- Countdown, DeepMath, and Poetry Writing -- and enjoys the same robust scaling trends as RL on verifiable tasks. These results demonstrate that our method effectively elicits strong reasoning performance from expert demonstrations alone, enabling robust reasoning learning even when task-specific verifiers are unavailable.
中文摘要 训练大型语言模型（LLMs）进行推理通常依赖于强化学习（RL）和任务特定的验证器。然而，许多现实世界中需要推理的任务缺乏验证器，尽管有大量专家演示，但这些演示在推理培训中仍未被充分利用。我们引入了RARO（相对论对抗推理优化），通过逆强化学习仅通过专家演示学习强推理能力。我们的方法建立了政策（生成元）与相对论批评者（判别者）之间的对抗互动：政策学习模仿专家回答，而批评者学习比较和区分政策与专家回答。我们的方法通过强化学习共同且持续地训练策略和批评者，并确定了稳健学习所需的关键稳定技术。从实证数据来看，RARO在所有评估任务——Countdown、DeepMath和Poetry Writing——上都明显优于无验证者基线，并且在可验证任务上与强化学习（RL）拥有同样强劲的扩展趋势。这些结果表明，我们的方法仅凭专家演示就能有效激发强有力的推理表现，即使没有任务专用验证器，也能实现稳健的推理学习。

ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration

ToolOrchestra：通过高效的模型和工具编排提升智能

Authors: Hongjin Su, Shizhe Diao, Ximing Lu, Mingjie Liu, Jiacheng Xu, Xin Dong, Yonggan Fu, Peter Belcak, Hanrong Ye, Hongxu Yin, Yi Dong, Evelina Bakhturina, Tao Yu, Yejin Choi, Jan Kautz, Pavlo Molchanov
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2511.21689
Pdf link: https://arxiv.org/pdf/2511.21689
Abstract Large language models are powerful generalists, yet solving deep and complex problems such as those of the Humanity's Last Exam (HLE) remains both conceptually challenging and computationally expensive. We show that small orchestrators managing other models and a variety of tools can both push the upper bound of intelligence and improve efficiency in solving difficult agentic tasks. We introduce ToolOrchestra, a method for training small orchestrators that coordinate intelligent tools. ToolOrchestra explicitly uses reinforcement learning with outcome-, efficiency-, and user-preference-aware rewards. Using ToolOrchestra, we produce Orchestrator, an 8B model that achieves higher accuracy at lower cost than previous tool-use agents while aligning with user preferences on which tools are to be used for a given query. On HLE, Orchestrator achieves a score of 37.1%, outperforming GPT-5 (35.1%) while being 2.5x more efficient. On tau2-Bench and FRAMES, Orchestrator surpasses GPT-5 by a wide margin while using only about 30% of the cost. Extensive analysis shows that Orchestrator achieves the best trade-off between performance and cost under multiple metrics, and generalizes robustly to unseen tools. These results demonstrate that composing diverse tools with a lightweight orchestration model is both more efficient and more effective than existing methods, paving the way for practical and scalable tool-augmented reasoning systems.
中文摘要 大型语言模型是强大的通才，但解决诸如人类最后考试（HLE）等深层复杂问题，依然在概念上具有挑战性，计算成本也极高。我们展示了小型编排器管理其他模型和多种工具，既能推动智能上限，也能提高解决复杂代理任务的效率。我们介绍了ToolOrchestra，一种用于培训协调智能工具的小型编排器的方法。ToolOrchestra明确使用强化学习，提供结果、效率和用户偏好感知的奖励。利用 ToolOrchestra，我们生成了 Orchestrator，这是一个 8B 模型，它比以往的工具使用代理更准确且成本更低，同时符合用户对特定查询工具的偏好。在HLE上，Orchestrator得分为37.1%，优于GPT-5（35.1%），同时效率提升2.5倍。在tau2-Bench和FRAMES上，Orchestrator大幅超越GPT-5，且成本仅约为30%。详尽分析表明，Orchestrator在多个指标下实现了性能与成本的最佳权衡，并能稳健地推广到未见的工具。这些结果表明，采用轻量级编排模型组合多样化工具比现有方法更高效、更有效，为实用且可扩展的工具增强推理系统铺平了道路。

Keyword: diffusion policy

There is no result