Arxiv Papers of Today

生成时间: 2025-10-31 16:28:59 (UTC+8); Arxiv 发布时间: 2025-10-31 20:00 EDT (2025-11-01 08:00 UTC+8)

今天共有 38 篇相关文章

Keyword: reinforcement learning

Non-myopic Matching and Rebalancing in Large-Scale On-Demand Ride-Pooling Systems Using Simulation-Informed Reinforcement Learning

基于仿真信息强化学习的大规模按需拼车系统中的非近视匹配和再平衡

Authors: Farnoosh Namdarpour, Joseph Y. J. Chow
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2510.25796
Pdf link: https://arxiv.org/pdf/2510.25796
Abstract Ride-pooling, also known as ride-sharing, shared ride-hailing, or microtransit, is a service wherein passengers share rides. This service can reduce costs for both passengers and operators and reduce congestion and environmental impacts. A key limitation, however, is its myopic decision-making, which overlooks long-term effects of dispatch decisions. To address this, we propose a simulation-informed reinforcement learning (RL) approach. While RL has been widely studied in the context of ride-hailing systems, its application in ride-pooling systems has been less explored. In this study, we extend the learning and planning framework of Xu et al. (2018) from ride-hailing to ride-pooling by embedding a ride-pooling simulation within the learning mechanism to enable non-myopic decision-making. In addition, we propose a complementary policy for rebalancing idle vehicles. By employing n-step temporal difference learning on simulated experiences, we derive spatiotemporal state values and subsequently evaluate the effectiveness of the non-myopic policy using NYC taxi request data. Results demonstrate that the non-myopic policy for matching can increase the service rate by up to 8.4% versus a myopic policy while reducing both in-vehicle and wait times for passengers. Furthermore, the proposed non-myopic policy can decrease fleet size by over 25% compared to a myopic policy, while maintaining the same level of performance, thereby offering significant cost savings for operators. Incorporating rebalancing operations into the proposed framework cuts wait time by up to 27.3%, in-vehicle time by 12.5%, and raises service rate by 15.1% compared to using the framework for matching decisions alone at the cost of increased vehicle minutes traveled per passenger.
中文摘要 拼车，也称为拼车、共享网约车或微型交通，是乘客共享乘车的一种服务。这项服务可以降低乘客和运营商的成本，减少拥堵和环境影响。然而，一个关键的局限性是它的短视决策，它忽视了调度决策的长期影响。为了解决这个问题，我们提出了一种模拟知情强化学习（RL）方法。虽然 RL 在叫车系统中得到了广泛研究，但其在拼车系统中的应用却很少被探索。在这项研究中，我们通过在学习机制中嵌入拼车模拟，将Xu等人（2018）的学习和规划框架从网约车扩展到拼车，以实现非短视决策。此外，我们还提出了闲置车辆再平衡的补充政策。通过对模拟体验采用 n 步时间差学习，我们推导出时空状态值，随后使用纽约市出租车请求数据评估非近视政策的有效性。结果表明，与近视政策相比，非近视匹配政策可将服务率提高多达 8.4%，同时减少车内和乘客的等待时间。此外，与近视政策相比，拟议的非近视政策可以将车队规模减少 25% 以上，同时保持相同的性能水平，从而为运营商节省大量成本。与单独使用该框架进行匹配决策相比，将再平衡作纳入拟议的框架可减少多达 27.3% 的等待时间，减少 12.5% 的车内时间，并将服务率提高 15.1%，但代价是增加每位乘客的车辆行驶分钟数。

Metis-SPECS: Decoupling Multimodal Learning via Self-distilled Preference-based Cold Start

Metis-SPECS：通过基于自蒸馏偏好的冷启动解耦多模态学习

Authors: Kun Chen, Peng Shi, Haibo Qiu, Zhixiong Zeng, Siqi Yang, Wenji Mao, Lin Ma
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2510.25801
Pdf link: https://arxiv.org/pdf/2510.25801
Abstract Reinforcement learning (RL) with verifiable rewards has recently catalyzed a wave of "MLLM-r1" approaches that bring RL to vision language models. Most representative paradigms begin with a cold start, typically employing supervised fine-tuning (SFT), to initialize the policy before RL. However, SFT-based cold start adopts the reasoning paradigm intertwined with task solution and output format, which may induce instruction-style overfitting, weakens out-of-distribution generalization, and ultimately affects downstream RL. We revisit the cold start along two views, its training method and data construction, and introduce the Generalization Factor (GF) coefficient to quantify the generalization capability under different methods. Our empirical study finds that preference-based training methods (e.g. DPO) generalizes better than SFT-based methods in cold start. Motivated by this, we propose SPECS-a Self-distilled, Preference-based Cold Start framework that decouples multimodal learning: (1) generates introspective preference data pairs via self-distillation, avoiding reliance on larger teachers or manual annotation; (2) performs preference-based training to learn, focusing on shallow, transferable surface-form criteria (format, structure, style) rather than memorizing content; and (3) hands off to RL with verifiable rewards for deep reasoning results. Experimental results across multiple multimodal benchmarks show that our decoupling learning framework yields consistent performance gains over strong baselines, improving MEGA-Bench by 4.1% and MathVista by 12.2%. Additional experiments indicate that SPECS contributes to reducing in-distribution "stuckness," improving exploration, stabilizing training, and raising the performance ceiling.
中文摘要 具有可验证奖励的强化学习（RL）最近催化了一波“MLLM-r1”方法，将 RL 引入视觉语言模型。大多数代表性范式从冷启动开始，通常采用监督微调（SFT），在 RL 之前初始化策略。然而，基于SFT的冷启动采用了任务解和输出格式交织在一起的推理范式，可能会诱发指令式过拟合，削弱分布外泛化，最终影响下游RL。本文从冷启动的训练方法和数据构建两个角度重新审视了冷启动，并引入泛化因子（GF）系数来量化不同方法下的泛化能力。我们的实证研究发现，在冷启动中，基于偏好的训练方法（例如 DPO）比基于 SFT 的方法更通用。在此激励下，我们提出了 SPECS——一种自我蒸馏的、基于偏好的冷启动框架，它解耦了多模态学习：（1）通过自我蒸馏生成内省偏好数据对，避免依赖更大的教师或手动注释;（2）进行基于偏好的学习训练，重点关注浅层的、可转移的表面形式标准（格式、结构、风格），而不是记忆内容;（3）将深度推理结果的可验证奖励交给 RL。跨多个多模态基准的实验结果表明，我们的解耦学习框架在强基线上产生了一致的性能提升，将 MEGA-Bench 提高了 4.1%，将 MathVista 提高了 12.2%。其他实验表明，SPECS 有助于减少分布内“卡住”、改善探索、稳定训练和提高性能上限。

Adversarial Pre-Padding: Generating Evasive Network Traffic Against Transformer-Based Classifiers

对抗性预填充：针对基于 Transformer 的分类器生成规避网络流量

Authors: Quanliang Jing, Xinxin Fan, Yanyan Liu, Jingping Bi
Subjects: Subjects: Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2510.25810
Pdf link: https://arxiv.org/pdf/2510.25810
Abstract To date, traffic obfuscation techniques have been widely adopted to protect network data privacy and security by obscuring the true patterns of traffic. Nevertheless, as the pre-trained models emerge, especially transformer-based classifiers, existing traffic obfuscation methods become increasingly vulnerable, as witnessed by current studies reporting the traffic classification accuracy up to 99\% or higher. To counter such high-performance transformer-based classification models, we in this paper propose a novel and effective \underline{adv}ersarial \underline{traffic}-generating approach (AdvTraffic\footnote{The code and data are available at: http://xxx}). Our approach has two key innovations: (i) a pre-padding strategy is proposed to modify packets, which effectively overcomes the limitations of existing research against transformer-based models for network traffic classification; and (ii) a reinforcement learning model is employed to optimize network traffic perturbations, aiming to maximize adversarial effectiveness against transformer-based classification models. To the best of our knowledge, this is the first attempt to apply adversarial perturbation techniques to defend against transformer-based traffic classifiers. Furthermore, our method can be easily deployed into practical network environments. Finally, multi-faceted experiments are conducted across several real-world datasets, and the experimental results demonstrate that our proposed method can effectively undermine transformer-based classifiers, significantly reducing classification accuracy from 99\% to as low as 25.68\%.
中文摘要 迄今为止，流量混淆技术已被广泛采用，通过掩盖真实的流量模式来保护网络数据隐私和安全。然而，随着预训练模型的出现，尤其是基于 Transformer 的分类器，现有的流量混淆方法变得越来越脆弱，目前报告流量分类准确率高达 99% 或更高的研究就证明了这一点。为了对抗这种基于Transformer的高性能分类模型，本文提出了一种新颖有效的\underline{adv}ersarial \underline{traffic}生成方法（AdvTraffic\footnote{代码和数据可在：http://xxx}）中找到。我们的方法有两大关键创新：（i）提出了一种预填充策略来修改数据包，有效地克服了现有研究对基于变压器的网络流量分类模型的局限性;（ii）采用强化学习模型来优化网络流量扰动，旨在最大限度地提高对基于Transformer的分类模型的对抗性。据我们所知，这是首次尝试应用对抗性扰动技术来防御基于变压器的流量分类器。此外，我们的方法可以轻松部署到实际的网络环境中。最后，在多个真实世界数据集上进行了多方面的实验，实验结果表明，所提方法能够有效破坏基于Transformer的分类器，将分类准确率从99%显著降低到25.68%。

MedVLSynther: Synthesizing High-Quality Visual Question Answering from Medical Documents with Generator-Verifier LMMs

MedVLSynther：使用生成器验证器 LMM 从医疗文档中合成高质量的视觉问答

Authors: Xiaoke Huang, Ningsen Wang, Hui Liu, Xianfeng Tang, Yuyin Zhou
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.25867
Pdf link: https://arxiv.org/pdf/2510.25867
Abstract Large Multimodal Models (LMMs) are increasingly capable of answering medical questions that require joint reasoning over images and text, yet training general medical VQA systems is impeded by the lack of large, openly usable, high-quality corpora. We present MedVLSynther, a rubric-guided generator-verifier framework that synthesizes high-quality multiple-choice VQA items directly from open biomedical literature by conditioning on figures, captions, and in-text references. The generator produces self-contained stems and parallel, mutually exclusive options under a machine-checkable JSON schema; a multi-stage verifier enforces essential gates (self-containment, single correct answer, clinical validity, image-text consistency), awards fine-grained positive points, and penalizes common failure modes before acceptance. Applying this pipeline to PubMed Central yields MedSynVQA: 13,087 audited questions over 14,803 images spanning 13 imaging modalities and 28 anatomical regions. Training open-weight LMMs with reinforcement learning using verifiable rewards improves accuracy across six medical VQA benchmarks, achieving averages of 55.85 (3B) and 58.15 (7B), with up to 77.57 on VQA-RAD and 67.76 on PathVQA, outperforming strong medical LMMs. A Ablations verify that both generation and verification are necessary and that more verified data consistently helps, and a targeted contamination analysis detects no leakage from evaluation suites. By operating entirely on open literature and open-weight models, MedVLSynther offers an auditable, reproducible, and privacy-preserving path to scalable medical VQA training data.
中文摘要 大型多模态模型（LMM）越来越有能力回答需要对图像和文本进行联合推理的医学问题，但由于缺乏大型、可公开使用的、高质量的语料库，训练通用医疗 VQA 系统受到阻碍。我们提出了 MedVLSynther，这是一个评分标准引导的生成器-验证器框架，它通过以图表、标题和文本参考为条件，直接从开放的生物医学文献中合成高质量的多项选择 VQA 项目。生成器在机器可检查的 JSON 模式下生成独立的词干和并行的互斥选项;多阶段验证器强制执行基本门（自包含、单一正确答案、临床有效性、图像文本一致性），奖励细粒度的正分，并在接受之前惩罚常见的失败模式。将此管道应用于 PubMed Central 可产生 MedSynVQA：13,087 个审核问题，涉及 14,803 张图像，涵盖 13 种成像模式和 28 个解剖区域。使用可验证奖励通过强化学习训练开放权重 LMM，提高了六个医疗 VQA 基准的准确性，平均得分为 55.85 （3B）和 58.15 （7B），VQA-RAD 高达 77.57，PathVQA 高达 67.76，优于强大的医疗 LMM。A：消融验证生成和验证都是必要的，并且更多经过验证的数据始终有帮助，并且有针对性的污染分析检测到评估套件没有泄漏。通过完全在开放文献和开放权重模型上运行，MedVLSynther 为可扩展的医疗 VQA 训练数据提供了一条可审计、可重复和保护隐私的途径。

Approximating Human Preferences Using a Multi-Judge Learned System

使用多法官学习系统近似人类偏好

Authors: Eitán Sprejer, Fernando Avalos, Augusto Bernardi, Jose Pedro Brito de Azevedo Faustino, Jacob Haimes, Narmeen Fatimah Oozeer
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.25884
Pdf link: https://arxiv.org/pdf/2510.25884
Abstract Aligning LLM-based judges with human preferences is a significant challenge, as they are difficult to calibrate and often suffer from rubric sensitivity, bias, and instability. Overcoming this challenge advances key applications, such as creating reliable reward models for Reinforcement Learning from Human Feedback (RLHF) and building effective routing systems that select the best-suited model for a given user query. In this work, we propose a framework for modeling diverse, persona-based preferences by learning to aggregate outputs from multiple rubric-conditioned judges. We investigate the performance of this approach against naive baselines and assess its robustness through case studies on both human and LLM-judges biases. Our primary contributions include a persona-based method for synthesizing preference labels at scale and two distinct implementations of our aggregator: Generalized Additive Model (GAM) and a Multi-Layer Perceptron (MLP).
中文摘要 使基于法学硕士的评委与人类偏好保持一致是一项重大挑战，因为他们难以校准，并且经常受到评分标准敏感性、偏见和不稳定性的困扰。克服这一挑战推进了关键应用，例如为人类反馈强化学习（RLHF）创建可靠的奖励模型，以及构建有效的路由系统，为给定的用户查询选择最适合的模型。在这项工作中，我们提出了一个框架，通过学习汇总多个评分标准评委的输出来模拟多样化的、基于角色的偏好。我们根据幼稚基线调查了这种方法的性能，并通过对人类和法学硕士法官偏见的案例研究来评估其稳健性。我们的主要贡献包括一种基于角色的大规模合成偏好标签的方法，以及我们聚合器的两种不同实现：广义加法模型（GAM）和多层感知器（MLP）。

$π_\texttt{RL}$: Online RL Fine-tuning for Flow-based Vision-Language-Action Models

$π_\texttt{RL}$：基于流的视觉-语言-行动模型的在线RL微调

Authors: Kang Chen, Zhihao Liu, Tonghe Zhang, Zhen Guo, Si Xu, Hao Lin, Hongzhi Zang, Quanlu Zhang, Zhaofei Yu, Guoliang Fan, Tiejun Huang, Yu Wang, Chao Yu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.25889
Pdf link: https://arxiv.org/pdf/2510.25889
Abstract Vision-Language-Action (VLA) models enable robots to understand and perform complex tasks from multimodal input. Although recent work explores using reinforcement learning (RL) to automate the laborious data collection process in scaling supervised fine-tuning (SFT), applying large-scale RL to flow-based VLAs (e.g., $\pi_0$, $\pi_{0.5}$) remains challenging due to intractable action log-likelihoods from iterative denoising. We address this challenge with $\pi_{\text{RL}}$, an open-source framework for training flow-based VLAs in parallel simulation. $\pi_{\text{RL}}$ implements two RL algorithms: (1) {Flow-Noise} models the denoising process as a discrete-time MDP with a learnable noise network for exact log-likelihood computation. (2) {Flow-SDE} integrates denoising with agent-environment interaction, formulating a two-layer MDP that employs ODE-to-SDE conversion for efficient RL exploration. We evaluate $\pi_{\text{RL}}$ on LIBERO and ManiSkill benchmarks. On LIBERO, $\pi_{\text{RL}}$ boosts few-shot SFT models $\pi_0$ and $\pi_{0.5}$ from 57.6% to 97.6% and from 77.1% to 98.3%, respectively. In ManiSkill, we train $\pi_{\text{RL}}$ in 320 parallel environments, improving $\pi_0$ from 41.6% to 85.7% and $\pi_{0.5}$ from 40.0% to 84.8% across 4352 pick-and-place tasks, demonstrating scalable multitask RL under heterogeneous simulation. Overall, $\pi_{\text{RL}}$ achieves significant performance gains and stronger generalization over SFT-models, validating the effectiveness of online RL for flow-based VLAs.
中文摘要 视觉-语言-动作（VLA）模型使机器人能够从多模态输入中理解和执行复杂的任务。尽管最近的工作探索了使用强化学习（RL）来自动化扩展监督微调（SFT）中繁琐的数据收集过程，但由于迭代去噪带来棘手的动作对数可能性，将大规模RL应用于基于流的VLA（例如，$\pi_0$、$\pi_{0.5}$）仍然具有挑战性。我们使用 $\pi_{\text{RL}}$ 来应对这一挑战，这是一个用于在并行模拟中训练基于流的 VLA 的开源框架。$\pi_{\text{RL}}$ 实现了两种 RL 算法：（1） {Flow-Noise} 将去噪过程建模为离散时间 MDP，具有可学习噪声网络，用于精确的对数似然计算。（2）{Flow-SDE}将去噪与智能体-环境交互相结合，制定了采用ODE到SDE转换的双层MDP，以实现高效的RL探索。我们在 LIBERO 和 ManiSkill 基准测试上评估 $\pi_{\text{RL}}$。在 LIBERO 上，$\pi_{\text{RL}}$ 将少样本 SFT 模型 $\pi_0$ 和 $\pi_{0.5}$ 分别从 57.6% 提高到 97.6% 和从 77.1% 提高到 98.3%。在ManiSkill中，我们在320个并行环境中训练了$\pi_{\text{RL}}$，在4352个拾取和放置任务中，$\pi_0$从41.6%提高到85.7%，$\pi_{0.5}$从40.0%提高到84.8%，展示了异构仿真下可扩展的多任务RL。总体而言，与 SFT 模型相比，$\pi_{\text{RL}}$ 实现了显着的性能提升和更强的泛化，验证了在线 RL 对基于流的 VLA 的有效性。

Multi-Agent Reinforcement Learning for Market Making: Competition without Collusion

多智能体强化学习做市：无串通竞争

Authors: Ziyi Wang, Carmine Ventre, Maria Polukarov
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.25929
Pdf link: https://arxiv.org/pdf/2510.25929
Abstract Algorithmic collusion has emerged as a central question in AI: Will the interaction between different AI agents deployed in markets lead to collusion? More generally, understanding how emergent behavior, be it a cartel or market dominance from more advanced bots, affects the market overall is an important research question. We propose a hierarchical multi-agent reinforcement learning framework to study algorithmic collusion in market making. The framework includes a self-interested market maker (Agent~A), which is trained in an uncertain environment shaped by an adversary, and three bottom-layer competitors: the self-interested Agent~B1 (whose objective is to maximize its own PnL), the competitive Agent~B2 (whose objective is to minimize the PnL of its opponent), and the hybrid Agent~B$^\star$, which can modulate between the behavior of the other two. To analyze how these agents shape the behavior of each other and affect market outcomes, we propose interaction-level metrics that quantify behavioral asymmetry and system-level dynamics, while providing signals potentially indicative of emergent interaction patterns. Experimental results show that Agent~B2 secures dominant performance in a zero-sum setting against B1, aggressively capturing order flow while tightening average spreads, thus improving market execution efficiency. In contrast, Agent~B$^\star$ exhibits a self-interested inclination when co-existing with other profit-seeking agents, securing dominant market share through adaptive quoting, yet exerting a milder adverse impact on the rewards of Agents~A and B1 compared to B2. These findings suggest that adaptive incentive control supports more sustainable strategic co-existence in heterogeneous agent environments and offers a structured lens for evaluating behavioral design in algorithmic trading systems.
中文摘要 算法共谋已成为人工智能的一个核心问题：部署在市场上的不同人工智能代理之间的交互会导致共谋吗？更一般地说，了解新兴行为（无论是卡特尔还是更高级机器人的市场主导地位）如何影响整个市场是一个重要的研究问题。我们提出了一种分层多智能体强化学习框架来研究做市中的算法共谋。该框架包括一个自利的做市商（Agent~A），它在对手塑造的不确定环境中进行训练，以及三个底层竞争对手：自利的 Agent~B1（其目标是最大化自己的 PnL）、竞争性 Agent~B2（其目标是最小化对手的 PnL）和混合 Agent~B$^\star$，它可以在其他两者的行为之间进行调节。为了分析这些智能体如何塑造彼此的行为并影响市场结果，我们提出了量化行为不对称性和系统级动态的交互级指标，同时提供可能指示紧急交互模式的信号。实验结果表明，Agent~B2 在零和设置下对 B1 具有主导性，在收紧平均价差的同时积极捕获订单流，从而提高了市场执行效率。相比之下，Agent~B$^\star$ 在与其他逐利主体共存时表现出自利倾向，通过自适应报价确保主导市场份额，但与 B2 相比，对 Agent~A 和 B1 的奖励产生较轻微的不利影响。这些发现表明，自适应激励控制支持在异构代理环境中实现更可持续的战略共存，并为评估算法交易系统中的行为设计提供了结构化的视角。

Estimating cognitive biases with attention-aware inverse planning

使用注意力感知逆向计划估计认知偏差

Authors: Sounak Banerjee, Daphne Cornelisse, Deepak Gopinath, Emily Sumner, Jonathan DeCastro, Guy Rosman, Eugene Vinitsky, Mark K. Ho
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.25951
Pdf link: https://arxiv.org/pdf/2510.25951
Abstract People's goal-directed behaviors are influenced by their cognitive biases, and autonomous systems that interact with people should be aware of this. For example, people's attention to objects in their environment will be biased in a way that systematically affects how they perform everyday tasks such as driving to work. Here, building on recent work in computational cognitive science, we formally articulate the attention-aware inverse planning problem, in which the goal is to estimate a person's attentional biases from their actions. We demonstrate how attention-aware inverse planning systematically differs from standard inverse reinforcement learning and how cognitive biases can be inferred from behavior. Finally, we present an approach to attention-aware inverse planning that combines deep reinforcement learning with computational cognitive modeling. We use this approach to infer the attentional strategies of RL agents in real-life driving scenarios selected from the Waymo Open Dataset, demonstrating the scalability of estimating cognitive biases with attention-aware inverse planning.
中文摘要 人们的目标导向行为受到认知偏差的影响，与人互动的自主系统应该意识到这一点。例如，人们对环境中物体的注意力会产生偏差，从而系统地影响他们执行日常任务（例如开车上班）的方式。在这里，基于计算认知科学的最新工作，我们正式阐明了注意力感知逆向规划问题，其目标是从一个人的行为中估计一个人的注意力偏差。我们展示了注意力感知逆向规划与标准逆向强化学习的系统性差异，以及如何从行为中推断出认知偏差。最后，我们提出了一种将深度强化学习与计算认知建模相结合的注意力感知逆向规划方法。我们使用这种方法来推断从 Waymo 开放数据集中选择的现实生活驾驶场景中 RL 智能体的注意力策略，证明了通过注意力感知逆向规划估计认知偏差的可扩展性。

Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning

监督强化学习：从专家轨迹到逐步推理

Authors: Yihe Deng, I-Hung Hsu, Jun Yan, Zifeng Wang, Rujun Han, Gufeng Zhang, Yanfei Chen, Wei Wang, Tomas Pfister, Chen-Yu Lee
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.25992
Pdf link: https://arxiv.org/pdf/2510.25992
Abstract Large Language Models (LLMs) often struggle with problems that require multi-step reasoning. For small-scale open-source models, Reinforcement Learning with Verifiable Rewards (RLVR) fails when correct solutions are rarely sampled even after many attempts, while Supervised Fine-Tuning (SFT) tends to overfit long demonstrations through rigid token-by-token imitation. To address this gap, we propose Supervised Reinforcement Learning (SRL), a framework that reformulates problem solving as generating a sequence of logical "actions". SRL trains the model to generate an internal reasoning monologue before committing to each action. It provides smoother rewards based on the similarity between the model's actions and expert actions extracted from the SFT dataset in a step-wise manner. This supervision offers richer learning signals even when all rollouts are incorrect, while encouraging flexible reasoning guided by expert demonstrations. As a result, SRL enables small models to learn challenging problems previously unlearnable by SFT or RLVR. Moreover, initializing training with SRL before refining with RLVR yields the strongest overall performance. Beyond reasoning benchmarks, SRL generalizes effectively to agentic software engineering tasks, establishing it as a robust and versatile training framework for reasoning-oriented LLMs.
中文摘要 大型语言模型（LLM）经常在需要多步骤推理的问题上苦苦挣扎。对于小规模的开源模型，当即使经过多次尝试也很少采样正确的解决方案时，具有可验证奖励的强化学习（RLVR）会失败，而监督微调（SFT）往往会通过严格的逐个标记模仿来过度拟合长时间的演示。为了解决这一差距，我们提出了监督强化学习（SRL），这是一个框架，它将问题解决重新表述为生成一系列逻辑“动作”。SRL 训练模型在执行每个动作之前生成内部推理独白。它根据模型的动作与从 SFT 数据集中提取的专家动作之间的相似性，逐步提供更流畅的奖励。即使所有推出都不正确，这种监督也能提供更丰富的学习信号，同时鼓励在专家演示的指导下进行灵活的推理。因此，SRL 使小型模型能够学习以前 SFT 或 RLVR 无法学习的具有挑战性的问题。此外，在使用 RLVR 进行细化之前使用 SRL 初始化训练会产生最强的整体性能。除了推理基准之外，SRL 还有效地推广到代理软件工程任务，使其成为面向推理的法学硕士的强大且多功能的训练框架。

PORTool: Tool-Use LLM Training with Rewarded Tree

PORTool：使用奖励树进行工具使用法学硕士训练

Authors: Feijie Wu, Weiwu Zhu, Yuxiang Zhang, Soumya Chatterjee, Jiarong Zhu, Fan Mo, Rodin Luo, Jing Gao
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.26020
Pdf link: https://arxiv.org/pdf/2510.26020
Abstract Current tool-use large language models (LLMs) are trained on static datasets, enabling them to interact with external tools and perform multi-step, tool-integrated reasoning, which produces tool-call trajectories. However, these models imitate how a query is resolved in a generic tool-call routine, thereby failing to explore possible solutions and demonstrating limited performance in an evolved, dynamic tool-call environment. In this work, we propose PORTool, a reinforcement learning (RL) method that encourages a tool-use LLM to explore various trajectories yielding the correct answer. Specifically, this method starts with generating multiple rollouts for a given query, and some of them share the first few tool-call steps, thereby forming a tree-like structure. Next, we assign rewards to each step, based on its ability to produce a correct answer and make successful tool calls. A shared step across different trajectories receives the same reward, while different steps under the same fork receive different rewards. Finally, these step-wise rewards are used to calculate fork-relative advantages, blended with trajectory-relative advantages, to train the LLM for tool use. The experiments utilize 17 tools to address user queries, covering both time-sensitive and time-invariant topics. We conduct ablation studies to systematically justify the necessity and the design robustness of step-wise rewards. Furthermore, we compare the proposed PORTool with other training approaches and demonstrate significant improvements in final accuracy and the number of tool-call steps.
中文摘要 当前使用工具的大型语言模型（LLM）是在静态数据集上训练的，使它们能够与外部工具交互并执行多步骤、工具集成的推理，从而产生工具调用轨迹。然而，这些模型模仿了在通用工具调用例程中解析查询的方式，因此无法探索可能的解决方案，并在演进的动态工具调用环境中表现出有限的性能。在这项工作中，我们提出了 PORTool，这是一种强化学习（RL）方法，它鼓励使用工具的 LLM 探索各种轨迹以产生正确答案。具体来说，此方法首先为给定查询生成多个转出，其中一些共享前几个工具调用步骤，从而形成树状结构。接下来，我们根据每个步骤产生正确答案和成功调用工具的能力为每个步骤分配奖励。跨不同轨迹的共享步骤获得相同的奖励，而同一分叉下的不同步骤获得不同的奖励。最后，这些逐步奖励用于计算分叉相对优势，并结合轨迹相对优势，以训练 LLM 以供工具使用。实验利用17种工具来解决用户查询，涵盖时间敏感和时不变主题。我们进行消融研究，以系统地证明逐步奖励的必要性和设计稳健性的合理性。此外，我们将所提出的 PORTool 与其他训练方法进行了比较，并证明了最终精度和工具调用步骤数量的显着改进。

Morphology-Aware Graph Reinforcement Learning for Tensegrity Robot Locomotion

面向张力机器人运动的形态感知图强化学习

Authors: Chi Zhang, Mingrui Li, Wenzhe Tong, Xiaonan Huang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.26067
Pdf link: https://arxiv.org/pdf/2510.26067
Abstract Tensegrity robots combine rigid rods and elastic cables, offering high resilience and deployability but posing major challenges for locomotion control due to their underactuated and highly coupled dynamics. This paper introduces a morphology-aware reinforcement learning framework that integrates a graph neural network (GNN) into the Soft Actor-Critic (SAC) algorithm. By representing the robot's physical topology as a graph, the proposed GNN-based policy captures coupling among components, enabling faster and more stable learning than conventional multilayer perceptron (MLP) policies. The method is validated on a physical 3-bar tensegrity robot across three locomotion primitives, including straight-line tracking and bidirectional turning. It shows superior sample efficiency, robustness to noise and stiffness variations, and improved trajectory accuracy. Notably, the learned policies transfer directly from simulation to hardware without fine-tuning, achieving stable real-world locomotion. These results demonstrate the advantages of incorporating structural priors into reinforcement learning for tensegrity robot control.
中文摘要 张力机器人结合了刚性杆和弹性电缆，具有高弹性和可部署性，但由于其驱动不足和高度耦合的动力学，给运动控制带来了重大挑战。本文介绍了一种形态感知强化学习框架，该框架将图神经网络（GNN）集成到软Actor-Critic（SAC）算法中。通过将机器人的物理拓扑表示为图，所提出的基于GNN的策略捕获了组件之间的耦合，从而实现了比传统多层感知器（MLP）策略更快、更稳定的学习。该方法在物理 3 杆张势机器人上进行了验证，涵盖三个运动基元，包括直线跟踪和双向转弯。它显示出卓越的样品效率、对噪声和刚度变化的鲁棒性以及更高的轨迹精度。值得注意的是，学习到的策略直接从仿真转移到硬件，无需微调，实现了稳定的现实世界运动。这些结果证明了将结构先验纳入强化学习以进行张力机器人控制的优势。

Network-Constrained Policy Optimization for Adaptive Multi-agent Vehicle Routing

自适应多智能体车辆路线的网络约束策略优化

Authors: Fazel Arasteh, Arian Haghparast, Manos Papagelis
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2510.26089
Pdf link: https://arxiv.org/pdf/2510.26089
Abstract Traffic congestion in urban road networks leads to longer trip times and higher emissions, especially during peak periods. While the Shortest Path First (SPF) algorithm is optimal for a single vehicle in a static network, it performs poorly in dynamic, multi-vehicle settings, often worsening congestion by routing all vehicles along identical paths. We address dynamic vehicle routing through a multi-agent reinforcement learning (MARL) framework for coordinated, network-aware fleet navigation. We first propose Adaptive Navigation (AN), a decentralized MARL model where each intersection agent provides routing guidance based on (i) local traffic and (ii) neighborhood state modeled using Graph Attention Networks (GAT). To improve scalability in large networks, we further propose Hierarchical Hub-based Adaptive Navigation (HHAN), an extension of AN that assigns agents only to key intersections (hubs). Vehicles are routed hub-to-hub under agent control, while SPF handles micro-routing within each hub region. For hub coordination, HHAN adopts centralized training with decentralized execution (CTDE) under the Attentive Q-Mixing (A-QMIX) framework, which aggregates asynchronous vehicle decisions via attention. Hub agents use flow-aware state features that combine local congestion and predictive dynamics for proactive routing. Experiments on synthetic grids and real urban maps (Toronto, Manhattan) show that AN reduces average travel time versus SPF and learning baselines, maintaining 100% routing success. HHAN scales to networks with hundreds of intersections, achieving up to 15.9% improvement under heavy traffic. These findings highlight the potential of network-constrained MARL for scalable, coordinated, and congestion-aware routing in intelligent transportation systems.
中文摘要 城市道路网络的交通拥堵导致行程时间更长、排放量更高，尤其是在高峰期。虽然最短路径优先（SPF）算法最适合静态网络中的单辆车，但它在动态多车设置中表现不佳，通常会通过沿着相同的路径路由所有车辆来加剧拥堵。我们通过多智能体强化学习（MARL）框架来解决动态车辆路线问题，以实现协调的、网络感知的车队导航。我们首先提出了自适应导航（AN），这是一种去中心化的MARL模型，其中每个交叉路口代理根据（i）本地交通和（ii）使用图注意力网络（GAT）建模的邻域状态提供路由指导。为了提高大型网络的可扩展性，我们进一步提出了基于分层集线器的自适应导航（HHAN），这是AN的扩展，仅将代理分配给关键交叉点（集线器）。车辆在代理控制下在中心到中心之间路由，而 SPF 则处理每个中心区域内的微路由。对于枢纽协调，HHAN 在 A-QMIX （A-QMIX）框架下采用集中式训练和去中心化执行（CTDE），该框架通过注意力聚合异步车辆决策。Hub 代理使用流感知状态功能，将本地拥塞和预测动态相结合，以实现主动路由。对合成网格和真实城市地图（多伦多、曼哈顿）的实验表明，与 SPF 和学习基线相比，AN 减少了平均出行时间，保持了 100% 的路线成功率。HHAN 可扩展到具有数百个交叉路口的网络，在交通拥堵的情况下实现高达 15.9% 的改进。这些发现凸显了网络约束的 MARL 在智能交通系统中可扩展、协调和拥堵感知路线方面的潜力。

GUI Knowledge Bench: Revealing the Knowledge Gap Behind VLM Failures in GUI Tasks

GUI 知识台：揭示 GUI 任务中 VLM 故障背后的知识差距

Authors: Chenrui Shi, Zedong Yu, Zhi Gao, Ruining Feng, Enqi Liu, Yuwei Wu, Yunde Jia, Liuyu Xiang, Zhaofeng He, Qing Li
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.26098
Pdf link: https://arxiv.org/pdf/2510.26098
Abstract Large vision language models (VLMs) have advanced graphical user interface (GUI) task automation but still lag behind humans. We hypothesize this gap stems from missing core GUI knowledge, which existing training schemes (such as supervised fine tuning and reinforcement learning) alone cannot fully address. By analyzing common failure patterns in GUI task execution, we distill GUI knowledge into three dimensions: (1) interface perception, knowledge about recognizing widgets and system states; (2) interaction prediction, knowledge about reasoning action state transitions; and (3) instruction understanding, knowledge about planning, verifying, and assessing task completion progress. We further introduce GUI Knowledge Bench, a benchmark with multiple choice and yes/no questions across six platforms (Web, Android, MacOS, Windows, Linux, IOS) and 292 applications. Our evaluation shows that current VLMs identify widget functions but struggle with perceiving system states, predicting actions, and verifying task completion. Experiments on real world GUI tasks further validate the close link between GUI knowledge and task success. By providing a structured framework for assessing GUI knowledge, our work supports the selection of VLMs with greater potential prior to downstream training and provides insights for building more capable GUI agents.
中文摘要 大型视觉语言模型（VLM）具有先进的图形用户界面（GUI）任务自动化，但仍落后于人类。我们假设这种差距源于核心 GUI 知识的缺失，仅靠现有的训练方案（例如监督微调和强化学习）无法完全解决这些问题。通过分析GUI任务执行中常见的故障模式，我们将GUI知识提炼为三个维度：（1）界面感知，识别小部件和系统状态的知识;（2）交互预测，推理动作状态转换知识;（3）对教学的理解，有关计划、验证和评估任务完成进度的知识。我们进一步介绍了 GUI Knowledge Bench，这是一个在六个平台（Web、Android、MacOS、Windows、Linux、IOS）和 292 个应用程序中提供多项选择和是/否问题的基准测试。我们的评估表明，当前的 VLM 可以识别小部件功能，但在感知系统状态、预测作和验证任务完成方面遇到困难。对现实世界 GUI 任务的实验进一步验证了 GUI 知识与任务成功之间的密切联系。通过提供用于评估 GUI 知识的结构化框架，我们的工作支持在下游训练之前选择具有更大潜力的 VLM，并为构建更强大的 GUI 代理提供见解。

Do Not Step Into the Same River Twice: Learning to Reason from Trial and Error

不要两次踏入同一条河流：从反复试验中学习推理

Authors: Chenming Tang, Hsiu-Yuan Huang, Weijie Liu, Saiyong Yang, Yunfang Wu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.26109
Pdf link: https://arxiv.org/pdf/2510.26109
Abstract Reinforcement learning with verifiable rewards (RLVR) has significantly boosted the reasoning capability of large language models (LLMs) recently. However, existing RLVR approaches merely train LLMs based on their own generated responses and are constrained by the initial capability of LLMs, thus prone to exploration stagnation, in which LLMs fail to solve more training problems and cannot further learn from the training data. Some work tries to address this by leveraging off-policy solutions to training problems but requires external guidance from experts which suffers from limited availability. In this work, we propose LTE (Learning to reason from Trial and Error), an approach hinting LLMs with their previously self-generated incorrect answers and problem of overlong responses, which does not require any external expert guidance. Experiments validate the effectiveness of LTE, which outperforms the normal group relative policy optimization (GRPO) by 6.38 in Pass@1 and 9.00 in Pass@k on average across six mathematics benchmarks for Qwen3-4B-Base. Further analysis confirms that LTE successfully mitigates the problem of exploration stagnation and enhances both exploitation and exploration during training.
中文摘要 近年来，具有可验证奖励的强化学习（RLVR）显著提升了大型语言模型（LLMs）的推理能力。然而，现有的RLVR方法仅仅基于自身生成的响应来训练LLM，并受到LLM初始能力的约束，因此容易出现探索停滞，即LLM无法解决更多的训练问题，无法从训练数据中进一步学习。一些工作试图通过利用政策外的解决方案来解决培训问题，但需要专家的外部指导，而专家的可用性有限。在这项工作中，我们提出了LTE（Learning to Reason from Trial and Error），这是一种暗示LLM之前自行生成的错误答案和过长响应问题的方法，不需要任何外部专家的指导。实验验证了LTE的有效性，在Qwen3-4B-Base的六个数学基准中，LTE在Pass@1平均比正态组相对策略优化（GRPO）高出6.38，在Pass@k上平均高出9.00。进一步的分析证实，LTE成功地缓解了勘探停滞的问题，增强了训练过程中的开发和勘探。

EgoExo-Con: Exploring View-Invariant Video Temporal Understanding

EgoExo-Con：探索视图不变视频时间理解

Authors: Minjoon Jung, Junbin Xiao, Junghyun Kim, Byoung-Tak Zhang, Angela Yao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.26113
Pdf link: https://arxiv.org/pdf/2510.26113
Abstract Can Video-LLMs achieve consistent temporal understanding when videos capture the same event from different viewpoints? To study this, we introduce EgoExo-Con (Consistency), a benchmark of comprehensively synchronized egocentric and exocentric video pairs with human-refined queries in natural language. EgoExo-Con emphasizes two temporal understanding tasks: Temporal Verification and Temporal Grounding. It evaluates not only correctness but consistency across viewpoints. Our analysis reveals two critical limitations of existing Video-LLMs: (1) models often fail to maintain consistency, with results far worse than their single-view performances. (2) When naively finetuned with synchronized videos of both viewpoints, the models show improved consistency but often underperform those trained on a single view. For improvements, we propose View-GRPO, a novel reinforcement learning framework that effectively strengthens view-specific temporal reasoning while encouraging consistent comprehension across viewpoints. Our method demonstrates its superiority over naive SFT and GRPO, especially for improving cross-view consistency. All resources will be made publicly available.
中文摘要 当视频从不同角度捕捉同一事件时，视频法学硕士能否实现一致的时间理解？为了研究这一点，我们引入了 EgoExo-Con（一致性），这是一个全面同步的以自我为中心和以外部为中心的视频对的基准，并以自然语言进行人类精炼的查询。EgoExo-Con 强调两项时间理解任务：时间验证和时间基础。它不仅评估正确性，还评估跨观点的一致性。我们的分析揭示了现有视频法学硕士的两个关键局限性：（1）模型通常无法保持一致性，结果远不如单视图性能。（2）当对两种观点的同步视频进行朴素微调时，模型显示出更好的一致性，但通常表现不如在单个视图上训练的模型。为了进行改进，我们提出了 View-GRPO，这是一种新颖的强化学习框架，它有效地加强了特定于视图的时间推理，同时鼓励跨观点的一致理解。我们的方法证明了其优于朴素 SFT 和 GRPO，特别是在提高交叉视图一致性方面。所有资源都将公开。

Reasoning Curriculum: Bootstrapping Broad LLM Reasoning from Math

推理课程：从数学中引导广泛的 LLM 推理

Authors: Bo Pang, Deqian Kong, Silvio Savarese, Caiming Xiong, Yingbo Zhou
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.26143
Pdf link: https://arxiv.org/pdf/2510.26143
Abstract Reinforcement learning (RL) can elicit strong reasoning in large language models (LLMs), yet most open efforts focus on math and code. We propose Reasoning Curriculum, a simple two-stage curriculum that first elicits reasoning skills in pretraining-aligned domains such as math, then adapts and refines these skills across other domains via joint RL. Stage 1 performs a brief cold start and then math-only RL with verifiable rewards to develop reasoning skills. Stage 2 runs joint RL on mixed-domain data to transfer and consolidate these skills. The curriculum is minimal and backbone-agnostic, requiring no specialized reward models beyond standard verifiability checks. Evaluated on Qwen3-4B and Llama-3.1-8B over a multi-domain suite, reasoning curriculum yields consistent gains. Ablations and a cognitive-skill analysis indicate that both stages are necessary and that math-first elicitation increases cognitive behaviors important for solving complex problems. Reasoning Curriculum provides a compact, easy-to-adopt recipe for general reasoning.
中文摘要 强化学习（RL）可以在大型语言模型（LLM）中引发强大的推理，但大多数开放工作都集中在数学和代码上。我们提出了推理课程，这是一个简单的两阶段课程，首先在数学等预训练领域中引出推理技能，然后通过联合 RL 在其他领域调整和完善这些技能。第 1 阶段执行简短的冷启动，然后进行纯数学 RL，并提供可验证的奖励，以培养推理技能。第 2 阶段对混合域数据运行联合 RL，以转移和整合这些技能。该课程是最小的，与骨干无关，除了标准的可验证性检查之外，不需要专门的奖励模型。在多域套件上对 Qwen3-4B 和 Llama-3.1-8B 进行评估，推理课程产生了一致的收益。消融和认知技能分析表明，这两个阶段都是必要的，并且数学优先的引出增加了对解决复杂问题很重要的认知行为。推理课程为一般推理提供了一个紧凑、易于采用的配方。

One Model to Critique Them All: Rewarding Agentic Tool-Use via Efficient Reasoning

一个模型来批评所有这些：通过有效推理奖励代理工具的使用

Authors: Renhao Li, Jianhong Tu, Yang Su, Hamid Alinejad-Rokny, Derek F. Wong, Junyang Lin, Min Yang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.26167
Pdf link: https://arxiv.org/pdf/2510.26167
Abstract Reward models (RMs) play a critical role in aligning large language models (LLMs) with human preferences. Yet in the domain of tool learning, the lack of RMs specifically designed for function-calling tasks has limited progress toward more capable agentic AI. We introduce ToolRM, a family of lightweight generative RMs tailored for general tool-use scenarios. To build these models, we propose a novel pipeline that constructs pairwise preference data using rule-based scoring and multidimensional sampling. This yields ToolPref-Pairwise-30K, a diverse, balanced, and challenging dataset of critique tasks that supports reinforcement learning with verifiable feedback. To evaluate tool-use RMs, we also introduce TRBench$_{BFCL}$, a benchmark built on the agentic evaluation suite BFCL. Trained on our constructed data, models from the Qwen3-4B/8B series achieve up to 14.28% higher accuracy, substantially outperforming frontier models such as Claude 4 and OpenAI o3 in pairwise reward judgments. Beyond training objectives, ToolRM generalizes to broader critique tasks, including Best-of-N sampling and self-correction. Experiments on ACEBench highlight its effectiveness and efficiency, enabling inference-time scaling and reducing output token usage by over 66%. We release data and model checkpoints to facilitate future research.
中文摘要 奖励模型（RM）在使大型语言模型（LLM）与人类偏好保持一致方面发挥着关键作用。然而，在工具学习领域，缺乏专门为函数调用任务设计的 RM，限制了在功能更强大的代理 AI 方面取得的进展。我们介绍了 ToolRM，这是一系列专为一般工具使用场景量身定制的轻量级生成式 RM。为了构建这些模型，我们提出了一种新颖的管道，该管道使用基于规则的评分和多维抽样来构建成对偏好数据。这产生了 ToolPref-Pairwise-30K，这是一个多样化、平衡且具有挑战性的批评任务数据集，支持具有可验证反馈的强化学习。为了评估工具使用的 RM，我们还引入了 TRBench$_{BFCL}$，这是一个基于代理评估套件 BFCL 构建的基准测试。根据我们构建的数据进行训练，Qwen3-4B/8B 系列的模型实现了高达 14.28% 的准确率，在成对奖励判断方面大大优于 Claude 4 和 OpenAI o3 等前沿模型。除了训练目标之外，ToolRM 还推广到更广泛的批评任务，包括 Best-of-N 采样和自我纠正。ACEBench 上的实验凸显了其有效性和效率，实现了推理时间扩展并将输出令牌使用量减少了 66% 以上。我们发布数据和模型检查点，以促进未来的研究。

A Game-Theoretic Spatio-Temporal Reinforcement Learning Framework for Collaborative Public Resource Allocation

一种面向公共资源协同配置的博弈论时空强化学习框架

Authors: Songxin Lei, Qiongyan Wang, Yanchen Zhu, Hanyu Yao, Sijie Ruan, Weilin Ruan, Yuyu Luo, Huaming Wu, Yuxuan Liang
Subjects: Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2510.26184
Pdf link: https://arxiv.org/pdf/2510.26184
Abstract Public resource allocation involves the efficient distribution of resources, including urban infrastructure, energy, and transportation, to effectively meet societal demands. However, existing methods focus on optimizing the movement of individual resources independently, without considering their capacity constraints. To address this limitation, we propose a novel and more practical problem: Collaborative Public Resource Allocation (CPRA), which explicitly incorporates capacity constraints and spatio-temporal dynamics in real-world scenarios. We propose a new framework called Game-Theoretic Spatio-Temporal Reinforcement Learning (GSTRL) for solving CPRA. Our contributions are twofold: 1) We formulate the CPRA problem as a potential game and demonstrate that there is no gap between the potential function and the optimal target, laying a solid theoretical foundation for approximating the Nash equilibrium of this NP-hard problem; and 2) Our designed GSTRL framework effectively captures the spatio-temporal dynamics of the overall system. We evaluate GSTRL on two real-world datasets, where experiments show its superior performance. Our source codes are available in the supplementary materials.
中文摘要 公共资源配置涉及城市基础设施、能源和交通等资源的有效分配，以有效满足社会需求。然而，现有方法侧重于独立优化单个资源的流动，而不考虑其能力限制。为了解决这一限制，我们提出了一个新颖且更实用的问题：协作公共资源分配（CPRA），它明确地将容量约束和现实场景中的时空动态纳入其中。我们提出了一个名为博弈论时空强化学习（GSTRL）的新框架来解决CPRA。我们的贡献是双重的：1）我们将CPRA问题表述为一个势博弈，并证明势函数与最优目标之间没有差距，为近似这个NP难问题的纳什均衡奠定了坚实的理论基础;2）我们设计的GSTRL框架有效地捕捉了整个系统的时空动态。我们在两个真实世界的数据集上评估了 GSTRL，其中的实验显示了其卓越的性能。我们的源代码可在补充材料中找到。

Graph-Enhanced Policy Optimization in LLM Agent Training

LLM 代理训练中的图增强策略优化

Authors: Jiazhen Yuan, Wei Zhao, Zhengbiao Bai
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.26270
Pdf link: https://arxiv.org/pdf/2510.26270
Abstract Group based reinforcement learning (RL) has shown impressive results on complex reasoning and mathematical tasks. Yet, when applied to train multi-turn, interactive LLM agents, these methods often suffer from structural blindness-the inability to exploit the underlying connectivity of the environment. This manifests in three critical challenges: (1) inefficient, unguided exploration, (2) imprecise credit assignment due to overlooking pivotal states, and (3) myopic planning caused by static reward discounting. We address these issues with Graph-Enhanced Policy Optimization (GEPO), which dynamically constructs a state-transition graph from agent experience and employs graph-theoretic centrality to provide three synergistic learning signals: (1)structured intrinsic rewards that guide exploration toward high-impact states, (2) a graph-enhanced advantage function for topology-aware credit assignment, and (3) a dynamic discount factor adapted to each state's strategic value. On the ALFWorld, WebShop, and a proprietary Workbench benchmarks, GEPO demonstrates strong performance, achieving absolute success rate gains of +4.1%, +5.3%, and +10.9% over competitive baselines. These results highlight that explicitly modeling environmental structure is a robust, generalizable strategy for advancing LLM agent training.
中文摘要 基于组的强化学习（RL）在复杂的推理和数学任务上显示出令人印象深刻的结果。然而，当应用于训练多轮、交互式 LLM 代理时，这些方法往往存在结构盲性——无法利用环境的底层连接性。这体现在三个关键挑战：（1）低效、无指导的探索，（2）由于忽视关键状态而导致的信用分配不精确，以及（3）静态奖励贴现导致的短视规划。我们通过图增强策略优化（GEPO）解决了这些问题，该优化从代理经验中动态构建状态转换图，并利用图论中心性提供三种协同学习信号：（1）引导探索高影响力状态的结构化内在奖励，（2）用于拓扑感知信用分配的图增强优势函数，以及（3）适应每个状态战略价值的动态贴现因子。在 ALFWorld、WebShop 和专有的 Workbench 基准测试中，GEPO 表现出强劲的性能，与竞争基线相比，绝对成功率分别为 +4.1%、+5.3% 和 +10.9%。这些结果强调，显式对环境结构进行建模是推进 LLM 代理训练的稳健、可推广的策略。

Thor: Towards Human-Level Whole-Body Reactions for Intense Contact-Rich Environments

雷神：在激烈的接触丰富环境中实现人类水平的全身反应

Authors: Gangyang Li, Qing Shi, Youhao Hu, Jincheng Hu, Zhongyuan Wang, Xinlong Wang, Shaqi Luo
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.26280
Pdf link: https://arxiv.org/pdf/2510.26280
Abstract Humanoids hold great potential for service, industrial, and rescue applications, in which robots must sustain whole-body stability while performing intense, contact-rich interactions with the environment. However, enabling humanoids to generate human-like, adaptive responses under such conditions remains a major challenge. To address this, we propose Thor, a humanoid framework for human-level whole-body reactions in contact-rich environments. Based on the robot's force analysis, we design a force-adaptive torso-tilt (FAT2) reward function to encourage humanoids to exhibit human-like responses during force-interaction tasks. To mitigate the high-dimensional challenges of humanoid control, Thor introduces a reinforcement learning architecture that decouples the upper body, waist, and lower body. Each component shares global observations of the whole body and jointly updates its parameters. Finally, we deploy Thor on the Unitree G1, and it substantially outperforms baselines in force-interaction tasks. Specifically, the robot achieves a peak pulling force of 167.7 N (approximately 48% of the G1's body weight) when moving backward and 145.5 N when moving forward, representing improvements of 68.9% and 74.7%, respectively, compared with the best-performing baseline. Moreover, Thor is capable of pulling a loaded rack (130 N) and opening a fire door with one hand (60 N). These results highlight Thor's effectiveness in enhancing humanoid force-interaction capabilities.
中文摘要 人形机器人在服务、工业和救援应用方面具有巨大潜力，在这些应用中，机器人必须保持全身稳定性，同时与环境进行激烈、接触丰富的交互。然而，使类人生物能够在这种情况下产生类似人类的适应性反应仍然是一个重大挑战。为了解决这个问题，我们提出了 Thor，这是一种在接触丰富的环境中进行人类水平全身反应的人形框架。基于机器人的力分析，我们设计了一种力自适应躯干倾斜（FAT2）奖励函数，以鼓励人形生物在力交互任务中表现出类似人类的反应。为了缓解人形控制的高维挑战，雷神引入了一种强化学习架构，将上半身、腰部和下半身解耦。每个组件共享对整个身体的全局观察，并共同更新其参数。最后，我们在 Unitree G1 上部署了 Thor，它在力交互任务中的性能大大优于基线。具体来说，机器人在向后移动时达到了167.7 N（约占G1体重的48%）和向前移动时达到145.5 N的峰值拉力，与性能最佳的基线相比，分别提高了68.9%和74.7%。此外，雷神能够拉动装载的机架（130 N）并用一只手打开防火门（60 N）。这些结果凸显了雷神在增强人形力相互作用能力方面的有效性。

Empowering RepoQA-Agent based on Reinforcement Learning Driven by Monte-carlo Tree Search

基于蒙特卡洛树搜索驱动的强化学习赋能RepoQA-Agent

Authors: Guochang Li, Yuchen Liu, Zhen Qin, Yunkun Wang, Jianping Zhong, Chen Zhi, Binhua Li, Fei Huang, Yongbin Li, Shuiguang Deng
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2510.26287
Pdf link: https://arxiv.org/pdf/2510.26287
Abstract Repository-level software engineering tasks require large language models (LLMs) to efficiently navigate and extract information from complex codebases through multi-turn tool interactions. Existing approaches face significant limitations: training-free, in-context learning methods struggle to guide agents effectively in tool utilization and decision-making based on environmental feedback, while training-based approaches typically rely on costly distillation from larger LLMs, introducing data compliance concerns in enterprise environments. To address these challenges, we introduce RepoSearch-R1, a novel agentic reinforcement learning framework driven by Monte-carlo Tree Search (MCTS). This approach allows agents to generate diverse, high-quality reasoning trajectories via self-training without requiring model distillation or external supervision. Based on RepoSearch-R1, we construct a RepoQA-Agent specifically designed for repository question-answering tasks. Comprehensive evaluation on repository question-answering tasks demonstrates that RepoSearch-R1 achieves substantial improvements of answer completeness: 16.0% enhancement over no-retrieval methods, 19.5% improvement over iterative retrieval methods, and 33% increase in training efficiency compared to general agentic reinforcement learning approaches. Our cold-start training methodology eliminates data compliance concerns while maintaining robust exploration diversity and answer completeness across repository-level reasoning tasks.
中文摘要 存储库级软件工程任务需要大型语言模型（LLM）通过多轮工具交互有效地导航和提取复杂的代码库中的信息。现有方法面临重大局限性：无需培训的上下文学习方法难以有效地指导代理根据环境反馈进行工具使用和决策，而基于培训的方法通常依赖于大型法学硕士的昂贵蒸馏，从而在企业环境中引入数据合规性问题。为了应对这些挑战，我们推出了 RepoSearch-R1，这是一种由蒙特卡洛树搜索（MCTS）驱动的新型智能体强化学习框架。这种方法允许智能体通过自我训练生成多样化、高质量的推理轨迹，而无需模型蒸馏或外部监督。基于 RepoSearch-R1，我们构建了一个专门用于仓库问答任务的 RepoQA-Agent。对存储库问答任务的综合评估表明，与通用智能体强化学习方法相比，RepoSearch-R1在答案完整性方面取得了显著的提升：与无检索方法相比提高了16.0%，与迭代检索方法相比提高了19.5%，训练效率提高了33%。我们的冷启动训练方法消除了数据合规性问题，同时在存储库级推理任务中保持了强大的探索多样性和答案完整性。

Offline Clustering of Preference Learning with Active-data Augmentation

使用主动数据增强的偏好学习离线聚类

Authors: Jingyuan Liu, Fatemeh Ghaffari, Xuchuang Wang, Mohammad Hajiesmaili, Carlee Joe-Wong
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.26301
Pdf link: https://arxiv.org/pdf/2510.26301
Abstract Preference learning from pairwise feedback is a widely adopted framework in applications such as reinforcement learning with human feedback and recommendations. In many practical settings, however, user interactions are limited or costly, making offline preference learning necessary. Moreover, real-world preference learning often involves users with different preferences. For example, annotators from different backgrounds may rank the same responses differently. This setting presents two central challenges: (1) identifying similarity across users to effectively aggregate data, especially under scenarios where offline data is imbalanced across dimensions, and (2) handling the imbalanced offline data where some preference dimensions are underrepresented. To address these challenges, we study the Offline Clustering of Preference Learning problem, where the learner has access to fixed datasets from multiple users with potentially different preferences and aims to maximize utility for a test user. To tackle the first challenge, we first propose Off-C$^2$PL for the pure offline setting, where the learner relies solely on offline data. Our theoretical analysis provides a suboptimality bound that explicitly captures the tradeoff between sample noise and bias. To address the second challenge of inbalanced data, we extend our framework to the setting with active-data augmentation where the learner is allowed to select a limited number of additional active-data for the test user based on the cluster structure learned by Off-C$^2$PL. In this setting, our second algorithm, A$^2$-Off-C$^2$PL, actively selects samples that target the least-informative dimensions of the test user's preference. We prove that these actively collected samples contribute more effectively than offline ones. Finally, we validate our theoretical results through simulations on synthetic and real-world datasets.
中文摘要 从成对反馈中学习偏好是应用中广泛采用的框架，例如具有人类反馈和建议的强化学习。然而，在许多实际环境中，用户交互是有限的或成本高昂的，因此需要离线偏好学习。此外，现实世界的偏好学习通常涉及具有不同偏好的用户。例如，来自不同背景的注释者可能会对相同的响应进行不同的排名。此设置带来了两个核心挑战：（1）识别用户之间的相似性以有效地聚合数据，尤其是在离线数据跨维度不平衡的情况下，以及（2）处理某些偏好维度代表性不足的不平衡离线数据。为了应对这些挑战，我们研究了偏好学习的离线聚类问题，其中学习者可以访问来自多个具有潜在不同偏好的用户的固定数据集，旨在最大限度地提高测试用户的效用。为了解决第一个挑战，我们首先为纯离线设置提出了 Off-C$^2$PL，其中学习者仅依赖离线数据。我们的理论分析提供了一个次优性边界，明确地捕获了样本噪声和偏差之间的权衡。为了解决不平衡数据的第二个挑战，我们将我们的框架扩展到具有主动数据增强的设置，其中允许学习者根据 Off-C$^2$PL 学习的集群结构为测试用户选择有限数量的额外活动数据。在此设置中，我们的第二个算法 A$^2$-Off-C$^2$PL 主动选择针对测试用户偏好中信息最少的维度的样本。我们证明，这些主动收集的样本比离线样本更有效。最后，我们通过对合成和真实世界数据集的模拟来验证我们的理论结果。

Reinforcement Learning for Pollution Detection in a Randomized, Sparse and Nonstationary Environment with an Autonomous Underwater Vehicle

使用自主水下航行器在随机、稀疏和非静止环境中进行污染检测的强化学习

Authors: Sebastian Zieglmeier, Niklas Erdmann, Narada D. Warakagoda
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.26347
Pdf link: https://arxiv.org/pdf/2510.26347
Abstract Reinforcement learning (RL) algorithms are designed to optimize problem-solving by learning actions that maximize rewards, a task that becomes particularly challenging in random and nonstationary environments. Even advanced RL algorithms are often limited in their ability to solve problems in these conditions. In applications such as searching for underwater pollution clouds with autonomous underwater vehicles (AUVs), RL algorithms must navigate reward-sparse environments, where actions frequently result in a zero reward. This paper aims to address these challenges by revisiting and modifying classical RL approaches to efficiently operate in sparse, randomized, and nonstationary environments. We systematically study a large number of modifications, including hierarchical algorithm changes, multigoal learning, and the integration of a location memory as an external output filter to prevent state revisits. Our results demonstrate that a modified Monte Carlo-based approach significantly outperforms traditional Q-learning and two exhaustive search patterns, illustrating its potential in adapting RL to complex environments. These findings suggest that reinforcement learning approaches can be effectively adapted for use in random, nonstationary, and reward-sparse environments.
中文摘要 强化学习（RL）算法旨在通过学习最大化奖励的动作来优化问题解决，这项任务在随机和非平稳环境中变得特别具有挑战性。即使是先进的 RL 算法在这些条件下解决问题的能力也往往受到限制。在使用自主水下航行器（AUV）搜索水下污染云等应用中，RL 算法必须在奖励稀疏的环境中导航，在这些环境中，作通常会导致零奖励。本文旨在通过重新审视和修改经典的 RL 方法来应对这些挑战，以在稀疏、随机和非平稳环境中高效运行。我们系统地研究了大量的修改，包括分层算法更改、多目标学习以及集成位置存储器作为外部输出过滤器以防止状态重访。我们的结果表明，基于蒙特卡洛的改进方法明显优于传统的 Q 学习和两种详尽的搜索模式，说明了其在使 RL 适应复杂环境方面的潜力。这些发现表明，强化学习方法可以有效地适应随机、非平稳和奖励稀疏的环境。

Towards Reinforcement Learning Based Log Loading Automation

迈向基于强化学习的日志加载自动化

Authors: Ilya Kurinov, Miroslav Ivanov, Grzegorz Orzechowski, Aki Mikkola
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.26363
Pdf link: https://arxiv.org/pdf/2510.26363
Abstract Forestry forwarders play a central role in mechanized timber harvesting by picking up and moving logs from the felling site to a processing area or a secondary transport vehicle. Forwarder operation is challenging and physically and mentally exhausting for the operator who must control the machine in remote areas for prolonged periods of time. Therefore, even partial automation of the process may reduce stress on the operator. This study focuses on continuing previous research efforts in application of reinforcement learning agents in automating log handling process, extending the task from grasping which was studied in previous research to full log loading operation. The resulting agent will be capable to automate a full loading procedure from locating and grappling to transporting and delivering the log to a forestry forwarder bed. To train the agent, a trailer type forestry forwarder simulation model in NVIDIA's Isaac Gym and a virtual environment for a typical log loading scenario were developed. With reinforcement learning agents and a curriculum learning approach, the trained agent may be a stepping stone towards application of reinforcement learning agents in automation of the forestry forwarder. The agent learnt grasping a log in a random position from grapple's random position and transport it to the bed with 94% success rate of the best performing agent.
中文摘要 林业货运代理通过将原木从采伐现场拾取并移动到加工区或二次运输车辆，在机械化木材采伐中发挥着核心作用。货运代理作对于必须长时间在偏远地区控制机器的操作员来说具有挑战性，身心疲惫。因此，即使是部分自动化过程也可以减轻操作员的压力。本研究的重点是延续先前在自动化日志处理过程中应用强化学习智能体的研究工作，将任务从先前研究的抓取扩展到全日志加载作。由此产生的代理将能够自动执行从定位和抓取到将原木运输和运送到林业货运代理床的完整装载过程。为了训练代理，在 NVIDIA 的 Isaac Gym 中开发了拖车型林业货运代理模拟模型，并开发了用于典型原木加载场景的虚拟环境。借助强化学习智能体和课程学习方法，训练有素的智能体可以成为强化学习智能体在林业货运代理自动化中的应用的垫脚石。特工学会了从抓斗的随机位置抓取一根原木，并将其运送到床上，成功率为表现最好的特工的 94%。

Adaptive Context Length Optimization with Low-Frequency Truncation for Multi-Agent Reinforcement Learning

低频截断的自适应上下文长度优化，用于多智能体强化学习

Authors: Wenchang Duan, Yaoliang Yu, Jiwan He, Yi Shi
Subjects: Subjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2510.26389
Pdf link: https://arxiv.org/pdf/2510.26389
Abstract Recently, deep multi-agent reinforcement learning (MARL) has demonstrated promising performance for solving challenging tasks, such as long-term dependencies and non-Markovian environments. Its success is partly attributed to conditioning policies on large fixed context length. However, such large fixed context lengths may lead to limited exploration efficiency and redundant information. In this paper, we propose a novel MARL framework to obtain adaptive and effective contextual information. Specifically, we design a central agent that dynamically optimizes context length via temporal gradient analysis, enhancing exploration to facilitate convergence to global optima in MARL. Furthermore, to enhance the adaptive optimization capability of the context length, we present an efficient input representation for the central agent, which effectively filters redundant information. By leveraging a Fourier-based low-frequency truncation method, we extract global temporal trends across decentralized agents, providing an effective and efficient representation of the MARL environment. Extensive experiments demonstrate that the proposed method achieves state-of-the-art (SOTA) performance on long-term dependency tasks, including PettingZoo, MiniGrid, Google Research Football (GRF), and StarCraft Multi-Agent Challenge v2 (SMACv2).
中文摘要 最近，深度多智能体强化学习（MARL）在解决具有挑战性的任务（例如长期依赖关系和非马尔可夫环境）方面表现出了有希望的性能。它的成功部分归功于对大固定上下文长度的条件策略。然而，如此大的固定上下文长度可能会导致探索效率有限和信息冗余。在本文中，我们提出了一种新的MARL框架来获取自适应和有效的上下文信息。具体来说，我们设计了一个中心代理，通过时间梯度分析动态优化上下文长度，增强探索以促进收敛到 MARL 中的全局最优。此外，为了增强上下文长度的自适应优化能力，我们提出了一种高效的中央代理输入表示，可以有效地过滤冗余信息。通过利用基于傅里叶的低频截断方法，我们提取了分散代理的全局时间趋势，从而提供了 MARL 环境的有效且高效的表示。大量实验表明，所提出的方法在长期依赖任务上实现了最先进的（SOTA）性能，包括PettingZoo、MiniGrid、Google Research Football（GRF）和星际争霸多智能体挑战v2（SMACv2）。

Human-in-the-loop Online Rejection Sampling for Robotic Manipulation

用于机器人作的人机交互在线剔除采样

Authors: Guanxing Lu, Rui Zhao, Haitao Lin, He Zhang, Yansong Tang
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.26406
Pdf link: https://arxiv.org/pdf/2510.26406
Abstract Reinforcement learning (RL) is widely used to produce robust robotic manipulation policies, but fine-tuning vision-language-action (VLA) models with RL can be unstable due to inaccurate value estimates and sparse supervision at intermediate steps. In contrast, imitation learning (IL) is easy to train but often underperforms due to its offline nature. In this paper, we propose Hi-ORS, a simple yet effective post-training method that utilizes rejection sampling to achieve both training stability and high robustness. Hi-ORS stabilizes value estimation by filtering out negatively rewarded samples during online fine-tuning, and adopts a reward-weighted supervised training objective to provide dense intermediate-step supervision. For systematic study, we develop an asynchronous inference-training framework that supports flexible online human-in-the-loop corrections, which serve as explicit guidance for learning error-recovery behaviors. Across three real-world tasks and two embodiments, Hi-ORS fine-tunes a pi-base policy to master contact-rich manipulation in just 1.5 hours of real-world training, outperforming RL and IL baselines by a substantial margin in both effectiveness and efficiency. Notably, the fine-tuned policy exhibits strong test-time scalability by reliably executing complex error-recovery behaviors to achieve better performance.
中文摘要 强化学习（RL）被广泛用于生成稳健的机器人纵策略，但由于值估计不准确和中间步骤的稀疏监督，使用 RL 微调视觉-语言-动作（VLA）模型可能不稳定。相比之下，模仿学习（IL）易于训练，但由于其离线性质，往往表现不佳。在本文中，我们提出了Hi-ORS，这是一种简单而有效的训练后方法，它利用拒绝采样来实现训练稳定性和高鲁棒性。Hi-ORS通过在线微调时滤除负奖励样本来稳定价值估计，并采用奖励加权监督训练目标，提供密集的中间步骤监督。为了进行系统研究，我们开发了一个异步推理训练框架，支持灵活的在线人机交互纠正，作为学习错误恢复行为的明确指导。在三个真实任务和两个实施例中，Hi-ORS微调了pi-base策略，以在短短1.5小时的真实世界训练中掌握接触丰富的作，在有效性和效率上都大大优于RL和IL基线。值得注意的是，微调后的策略通过可靠地执行复杂的错误恢复行为来实现更好的性能，从而表现出强大的测试时可扩展性。

PolarZero: A Reinforcement Learning Approach for Low-Complexity Polarization Kernel Design

PolarZero：一种用于低复杂度极化核设计的强化学习方法

Authors: Yi-Ting Hong, Stefano Rini, Luca Barletta
Subjects: Subjects: Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2510.26452
Pdf link: https://arxiv.org/pdf/2510.26452
Abstract Polar codes with large kernels can achieve improved error exponents but are challenging to design with low decoding com- plexity. This work investigates kernel construction under recursive maximum likelihood decoding (RMLD) using a reinforcement learning framework based on the Gumbel AlphaZero algorithm. The proposed method efficiently explores the design space and identifies large-size kernels that satisfy a given error exponent while minimizing decoding complexity. For a size-16 kernel, it achieves 17% lower decoding complexity than handcrafted designs while reaching an error exponent of 0.5183 compared to 0.5 for Arikan's kernel, demonstrating the effectiveness of the learning-based approach for practical polar code construction.
中文摘要 具有大内核的极性码可以实现改进的误差指数，但在低解码复杂度下设计具有挑战性。这项工作使用基于Gumbel AlphaZero算法的强化学习框架研究了递归最大似然解码（RMLD）下的内核构建。所提出的方法有效地探索了设计空间，并识别了满足给定误差指数的大尺寸核，同时最大限度地降低了解码复杂性。对于 16 号内核，它的解码复杂度比手工设计低 17%，同时达到 0.5183 的误差指数，而 Arikan 内核为 0.5，证明了基于学习的方法在实际极性代码构建方面的有效性。

ReSpec: Towards Optimizing Speculative Decoding in Reinforcement Learning Systems

ReSpec：优化强化学习系统中的推测解码

Authors: Qiaoling Chen, Zijun Liu, Peng Sun, Shenggui Li, Guoteng Wang, Ziming Liu, Yonggang Wen, Siyuan Feng, Tianwei Zhang
Subjects: Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2510.26475
Pdf link: https://arxiv.org/pdf/2510.26475
Abstract Adapting large language models (LLMs) via reinforcement learning (RL) is often bottlenecked by the generation stage, which can consume over 75\% of the training time. Speculative decoding (SD) accelerates autoregressive generation in serving systems, but its behavior under RL training remains largely unexplored. We identify three critical gaps that hinder the naive integration of SD into RL systems: diminishing speedups at large batch sizes, drafter staleness under continual actor updates, and drafter-induced policy degradation. To address these gaps, we present ReSpec, a system that adapts SD to RL through three complementary mechanisms: dynamically tuning SD configurations, evolving the drafter via knowledge distillation, and weighting updates by rollout rewards. On Qwen models (3B--14B), ReSpec achieves up to 4.5x speedup while preserving reward convergence and training stability, providing a practical solution for efficient RL-based LLM adaptation.
中文摘要 通过强化学习（RL）适配大型语言模型（LLM）往往会受到生成阶段的瓶颈，生成阶段会消耗超过75%的训练时间。推测解码（SD）加速了服务系统中的自回归生成，但其在RL训练下的行为在很大程度上仍未得到探索。我们确定了阻碍 SD 天真集成到 RL 系统中的三个关键差距：大批量加速的减少、持续参与者更新下的起草陈旧以及起草者引起的政策退化。为了解决这些差距，我们提出了 ReSpec，这是一个通过三种互补机制使 SD 适应 RL 的系统：动态调整 SD 配置、通过知识蒸馏进化起草者以及通过推出奖励来加权更新。在Qwen模型（3B--14B）上，ReSpec在保持奖励收敛和训练稳定性的同时实现了高达4.5倍的加速，为基于RL的高效LLM适配提供了实用的解决方案。

Data-Efficient RLVR via Off-Policy Influence Guidance

通过政策外影响指导实现数据高效的 RLVR

Authors: Erle Zhu, Dazhi Jiang, Yuan Wang, Xujun Li, Jiale Cheng, Yuxian Gu, Yilin Niu, Aohan Zeng, Jie Tang, Minlie Huang, Hongning Wang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.26491
Pdf link: https://arxiv.org/pdf/2510.26491
Abstract Data selection is a critical aspect of Reinforcement Learning with Verifiable Rewards (RLVR) for enhancing the reasoning capabilities of large language models (LLMs). Current data selection methods are largely heuristic-based, lacking theoretical guarantees and generalizability. This work proposes a theoretically-grounded approach using influence functions to estimate the contribution of each data point to the learning objective. To overcome the prohibitive computational cost of policy rollouts required for online influence estimation, we introduce an off-policy influence estimation method that efficiently approximates data influence using pre-collected offline trajectories. Furthermore, to manage the high-dimensional gradients of LLMs, we employ sparse random projection to reduce dimensionality and improve storage and computation efficiency. Leveraging these techniques, we develop \textbf{C}urriculum \textbf{R}L with \textbf{O}ff-\textbf{P}olicy \text{I}nfluence guidance (\textbf{CROPI}), a multi-stage RL framework that iteratively selects the most influential data for the current policy. Experiments on models up to 7B parameters demonstrate that CROPI significantly accelerates training. On a 1.5B model, it achieves a 2.66x step-level acceleration while using only 10\% of the data per stage compared to full-dataset training. Our results highlight the substantial potential of influence-based data selection for efficient RLVR.
中文摘要 数据选择是具有可验证奖励的强化学习（RLVR）的一个重要方面，用于增强大型语言模型（LLM）的推理能力。目前的数据选择方法主要基于启发式，缺乏理论保证和普遍性。这项工作提出了一种基于理论的方法，使用影响函数来估计每个数据点对学习目标的贡献。为了克服在线影响估计所需的政策推出的计算成本，我们引入了一种政策外影响估计方法，该方法使用预先收集的离线轨迹有效地近似数据影响。此外，为了管理LLM的高维梯度，我们采用稀疏随机投影来降低维度，提高存储和计算效率。利用这些技术，我们开发了 \textbf{C}urriculum \textbf{R}L 和 \textbf{O}ff-\textbf{P}olicy \text{I}nfluence guidance （\textbf{CROPI}），这是一个多阶段的 RL 框架，迭代地选择对当前政策最有影响力的数据。在高达 7B 参数的模型上的实验表明，CROPI 显着加速了训练。在 1.5B 模型上，与全数据集训练相比，它实现了 2.66 倍的阶梯级加速，同时每个阶段仅使用 10\% 的数据。我们的结果凸显了基于影响的数据选择对高效 RLVR 的巨大潜力。

Think Outside the Policy: In-Context Steered Policy Optimization

跳出政策思考：情境引导策略优化

Authors: Hsiu-Yuan Huang, Chenming Tang, Weijie Liu, Saiyong Yang, Yunfang Wu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.26519
Pdf link: https://arxiv.org/pdf/2510.26519
Abstract Existing Reinforcement Learning from Verifiable Rewards (RLVR) methods, such as Group Relative Policy Optimization (GRPO), have achieved remarkable progress in improving the reasoning capabilities of Large Reasoning Models (LRMs). However, they exhibit limited exploration due to reliance on on-policy rollouts where confined to the current policy's distribution, resulting in narrow trajectory diversity. Recent approaches attempt to expand policy coverage by incorporating trajectories generated from stronger expert models, yet this reliance increases computational cost and such advaned models are often inaccessible. To address these issues, we propose In-Context Steered Policy Optimization (ICPO), a unified framework that leverages the inherent in-context learning capability of LRMs to provide expert guidance using existing datasets. ICPO introduces Mixed-Policy GRPO with Implicit Expert Forcing, which expands exploration beyond the current policy distribution without requiring advanced LRM trajectories. To further stabilize optimization, ICPO integrates Expert Region Reject Sampling to filter unreliable off-policy trajectories and Annealed Expert-Bonus Reward Shaping to balance early expert guidance with later autonomous improvement. Results demonstrate that ICPO consistently enhances reinforcement learning performance and training stability on mathematical reasoning benchmarks, revealing a scalable and effective RLVR paradigm for LRMs.
中文摘要 现有的可验证奖励强化学习（RLVR）方法，如群体相对策略优化（GRPO），在提高大型推理模型（LRM）的推理能力方面取得了显著进展。然而，由于依赖政策推出，仅限于当前政策的分布，它们表现出有限的探索，导致轨迹多样性狭窄。最近的方法试图通过结合更强大的专家模型生成的轨迹来扩大政策覆盖范围，但这种依赖增加了计算成本，而且这种先进的模型通常无法访问。为了解决这些问题，我们提出了上下文引导策略优化（ICPO），这是一个统一的框架，它利用LRM固有的上下文学习能力，使用现有数据集提供专家指导。ICPO 引入了带有隐式专家强制的混合策略 GRPO，它将探索扩展到当前策略分布之外，而无需高级 LRM 轨迹。为了进一步稳定优化，ICPO 集成了专家区域拒绝抽样以过滤不可靠的脱离策略轨迹，并集成了退火专家奖励塑造，以平衡早期专家指导和后期自主改进。结果表明，ICPO在数学推理基准上持续增强了强化学习性能和训练稳定性，揭示了LRM的可扩展且有效的RLVR范式。

InfoFlow: Reinforcing Search Agent Via Reward Density Optimization

InfoFlow：通过奖励密度优化强化搜索代理

Authors: Kun Luo, Hongjin Qian, Zheng Liu, Ziyi Xia, Shitao Xiao, Siqi Bao, Jun Zhao, Kang Liu
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.26575
Pdf link: https://arxiv.org/pdf/2510.26575
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) is a promising approach for enhancing agentic deep search. However, its application is often hindered by low \textbf{Reward Density} in deep search scenarios, where agents expend significant exploratory costs for infrequent and often null final rewards. In this paper, we formalize this challenge as the \textbf{Reward Density Optimization} problem, which aims to improve the reward obtained per unit of exploration cost. This paper introduce \textbf{InfoFlow}, a systematic framework that tackles this problem from three aspects. 1) \textbf{Subproblem decomposition}: breaking down long-range tasks to assign process rewards, thereby providing denser learning signals. 2) \textbf{Failure-guided hints}: injecting corrective guidance into stalled trajectories to increase the probability of successful outcomes. 3) \textbf{Dual-agent refinement}: employing a dual-agent architecture to offload the cognitive burden of deep exploration. A refiner agent synthesizes the search history, which effectively compresses the researcher's perceived trajectory, thereby reducing exploration cost and increasing the overall reward density. We evaluate InfoFlow on multiple agentic search benchmarks, where it significantly outperforms strong baselines, enabling lightweight LLMs to achieve performance comparable to advanced proprietary LLMs.
中文摘要 具有可验证奖励的强化学习（RLVR）是一种很有前途的方法，用于增强代理深度搜索。然而，在深度搜索场景中，它的应用经常受到低 \textbf{Reward Density} 的阻碍，在这种场景中，代理会花费大量的探索成本来获得不频繁且通常为零的最终奖励。在本文中，我们将这一挑战形式化为 \textbf{Reward Density Optimization} 问题，旨在提高单位探索成本获得的奖励。本文介绍了\textbf{InfoFlow}，这是一个从三个方面解决这一问题的系统框架。1）\textbf{子问题分解}：分解远程任务以分配过程奖励，从而提供更密集的学习信号。2）\textbf{Failure-guided hints}：在停滞的轨迹中注入纠正指导，增加成功结果的概率。3）\textbf{双智能体细化}：采用双智能体架构来减轻深度探索的认知负担。精炼代理对搜索历史进行合成，有效压缩了研究人员的感知轨迹，从而降低了探索成本，提高了整体奖励密度。我们在多个代理搜索基准上评估了 InfoFlow，它的性能明显优于强大的基线，使轻量级 LLM 能够实现与高级专有 LLM 相当的性能。

Emu3.5: Native Multimodal Models are World Learners

Emu3.5：原生多模态模型是世界学习者

Authors: Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, Yueze Wang, Chengyuan Wang, Fan Zhang, Yingli Zhao, Ting Pan, Xianduo Li, Zecheng Hao, Wenxuan Ma, Zhuo Chen, Yulong Ao, Tiejun Huang, Zhongyuan Wang, Xinlong Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2510.26583
Pdf link: https://arxiv.org/pdf/2510.26583
Abstract We introduce Emu3.5, a large-scale multimodal world model that natively predicts the next state across vision and language. Emu3.5 is pre-trained end-to-end with a unified next-token prediction objective on a corpus of vision-language interleaved data containing over 10 trillion tokens, primarily derived from sequential frames and transcripts of internet videos. The model naturally accepts interleaved vision-language inputs and generates interleaved vision-language outputs. Emu3.5 is further post-trained with large-scale reinforcement learning to enhance multimodal reasoning and generation. To improve inference efficiency, we propose Discrete Diffusion Adaptation (DiDA), which converts token-by-token decoding into bidirectional parallel prediction, accelerating per-image inference by about 20x without sacrificing performance. Emu3.5 exhibits strong native multimodal capabilities, including long-horizon vision-language generation, any-to-image (X2I) generation, and complex text-rich image generation. It also exhibits generalizable world-modeling abilities, enabling spatiotemporally consistent world exploration and open-world embodied manipulation across diverse scenarios and tasks. For comparison, Emu3.5 achieves performance comparable to Gemini 2.5 Flash Image (Nano Banana) on image generation and editing tasks and demonstrates superior results on a suite of interleaved generation tasks. We open-source Emu3.5 at this https URL to support community research.
中文摘要 我们介绍了 Emu3.5，这是一个大规模的多模态世界模型，可以本地预测视觉和语言的下一个状态。Emu3.5 是端到端预训练的，对包含超过 10 万亿个标记的视觉语言交错数据语料库具有统一的下一个标记预测目标，这些数据主要来自互联网视频的连续帧和脚本。该模型自然接受交错视觉语言输入并生成交错视觉语言输出。Emu3.5 通过大规模强化学习进一步进行后训练，以增强多模态推理和生成。为了提高推理效率，我们提出了离散扩散自适应（DiDA），它将逐个令牌的解码转换为双向并行预测，在不牺牲性能的情况下将每张图像的推理速度提高约 20 倍。Emu3.5 表现出强大的原生多模态能力，包括长视距视觉语言生成、任意图像（X2I）生成和复杂的文本富图像生成。它还表现出可推广的世界建模能力，能够实现时空一致的世界探索和跨不同场景和任务的开放世界具身作。相比之下，Emu3.5 在图像生成和编辑任务上实现了与 Gemini 2.5 Flash Image （Nano Banana）相当的性能，并在一系列交错生成任务中表现出卓越的结果。我们在此 https URL 上开源 Emu3.5 以支持社区研究。

A DRL-Empowered Multi-Level Jamming Approach for Secure Semantic Communication

DRL 赋能的多级干扰方法，实现安全语义通信

Authors: Weixuan Chen, Qianqian Yang
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2510.26610
Pdf link: https://arxiv.org/pdf/2510.26610
Abstract Semantic communication (SemCom) aims to transmit only task-relevant information, thereby improving communication efficiency but also exposing semantic information to potential eavesdropping. In this paper, we propose a deep reinforcement learning (DRL)-empowered multi-level jamming approach to enhance the security of SemCom systems over MIMO fading wiretap channels. This approach combines semantic layer jamming, achieved by encoding task-irrelevant text, and physical layer jamming, achieved by encoding random Gaussian noise. These two-level jamming signals are superposed with task-relevant semantic information to protect the transmitted semantics from eavesdropping. A deep deterministic policy gradient (DDPG) algorithm is further introduced to dynamically design and optimize the precoding matrices for both taskrelevant semantic information and multi-level jamming signals, aiming to enhance the legitimate user's image reconstruction while degrading the eavesdropper's performance. To jointly train the SemCom model and the DDPG agent, we propose an alternating optimization strategy where the two modules are updated iteratively. Experimental results demonstrate that, compared with both the encryption-based (ESCS) and encoded jammer-based (EJ) benchmarks, our method achieves comparable security while improving the legitimate user's peak signalto-noise ratio (PSNR) by up to approximately 0.6 dB.
中文摘要 语义通信（SemCom）旨在仅传输与任务相关的信息，从而提高通信效率，但也使语义信息暴露于潜在的窃听之下。在本文中，我们提出了一种深度强化学习（DRL）赋能的多级干扰方法，以增强 SemCom 系统在 MIMO 衰落窃听通道上的安全性。这种方法结合了通过对任务无关文本进行编码来实现的语义层干扰和通过对随机高斯噪声进行编码来实现的物理层干扰。这些两级干扰信号与任务相关的语义信息叠加在一起，以保护传输的语义免遭窃听。进一步引入深度确定性策略梯度（DDPG）算法，对任务相关语义信息和多级干扰信号的预编码矩阵进行动态设计和优化，旨在增强合法用户的图像重建能力，同时降低窃听者的性能。为了联合训练SemCom模型和DDPG代理，我们提出了一种交替优化策略，其中两个模块迭代更新。实验结果表明，与基于加密（ESCS）和基于编码干扰器（EJ）的基准相比，该方法在将合法用户峰值信噪比（PSNR）提高约0.6 dB的同时实现了相当的安全性。

Low-Altitude UAV-Carried Movable Antenna for Joint Wireless Power Transfer and Covert Communications

用于联合无线电力传输和秘密通信的低空无人机携带的可移动天线

Authors: Chuang Zhang, Geng Sun, Jiahui Li, Jiacheng Wang, Qingqing Wu, Dusit Niyato, Shiwen Mao, Tony Q. S. Quek
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2510.26628
Pdf link: https://arxiv.org/pdf/2510.26628
Abstract The proliferation of Internet of Things (IoT) networks has created an urgent need for sustainable energy solutions, particularly for the battery-constrained spatially distributed IoT nodes. While low-altitude uncrewed aerial vehicles (UAVs) employed with wireless power transfer (WPT) capabilities offer a promising solution, the line-of-sight channels that facilitate efficient energy delivery also expose sensitive operational data to adversaries. This paper proposes a novel low-altitude UAV-carried movable antenna-enhanced transmission system joint WPT and covert communications, which simultaneously performs energy supplements to IoT nodes and establishes transmission links with a covert user by leveraging wireless energy signals as a natural cover. Then, we formulate a multi-objective optimization problem that jointly maximizes the total harvested energy of IoT nodes and sum achievable rate of the covert user, while minimizing the propulsion energy consumption of the low-altitude UAV. To address the non-convex and temporally coupled optimization problem, we propose a mixture-of-experts-augmented soft actor-critic (MoE-SAC) algorithm that employs a sparse Top-K gated mixture-of-shallow-experts architecture to represent multimodal policy distributions arising from the conflicting optimization objectives. We also incorporate an action projection module that explicitly enforces per-time-slot power budget constraints and antenna position constraints. Simulation results demonstrate that the proposed approach significantly outperforms some baseline approaches and other state-of-the-art deep reinforcement learning algorithms.
中文摘要 物联网（IoT）网络的激增迫切需要可持续能源解决方案，特别是对于电池受限的空间分布式物联网节点。虽然采用无线电力传输（WPT）功能的低空无人机（UAV）提供了一种有前途的解决方案，但促进高效能源传输的视距通道也会向对手暴露敏感的作数据。本文提出了一种新型的低空无人机携带的移动天线增强传输系统，即WPT和隐蔽通信相结合，利用无线能量信号作为自然掩护，同时对物联网节点进行能量补充，并与隐蔽用户建立传输链路。然后，我们提出了一个多目标优化问题，共同最大化物联网节点的总收集能量和隐蔽用户的总可实现率，同时最小化低空无人机的推进能耗。为了解决非凸和时间耦合优化问题，我们提出了一种混合专家增强软参与者批评者（MoE-SAC）算法，该算法采用稀疏的Top-K门控浅层专家混合架构来表示由冲突的优化目标产生的多模态策略分布。我们还集成了一个动作投影模块，该模块明确实施每个时隙的功率预算约束和天线位置约束。仿真结果表明，所提出的方法明显优于一些基线方法和其他最先进的深度强化学习算法。

The Era of Agentic Organization: Learning to Organize with Language Models

代理组织时代：学习使用语言模型进行组织

Authors: Zewen Chi, Li Dong, Qingxiu Dong, Yaru Hao, Xun Wu, Shaohan Huang, Furu Wei
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.26658
Pdf link: https://arxiv.org/pdf/2510.26658
Abstract We envision a new era of AI, termed agentic organization, where agents solve complex problems by working collaboratively and concurrently, enabling outcomes beyond individual intelligence. To realize this vision, we introduce asynchronous thinking (AsyncThink) as a new paradigm of reasoning with large language models, which organizes the internal thinking process into concurrently executable structures. Specifically, we propose a thinking protocol where an organizer dynamically assigns sub-queries to workers, merges intermediate knowledge, and produces coherent solutions. More importantly, the thinking structure in this protocol can be further optimized through reinforcement learning. Experiments demonstrate that AsyncThink achieves 28% lower inference latency compared to parallel thinking while improving accuracy on mathematical reasoning. Moreover, AsyncThink generalizes its learned asynchronous thinking capabilities, effectively tackling unseen tasks without additional training.
中文摘要 我们设想了一个人工智能的新时代，称为代理组织，代理通过协作和并发工作来解决复杂的问题，从而实现超越个人智能的结果。为了实现这一愿景，我们引入了异步思维（AsyncThink）作为大型语言模型推理的新范式，它将内部思维过程组织成并发可执行的结构。具体来说，我们提出了一种思维协议，其中组织者动态地将子查询分配给工人，合并中间知识，并产生连贯的解决方案。更重要的是，该协议中的思维结构可以通过强化学习进一步优化。实验表明，与并行思维相比，AsyncThink 的推理延迟降低了 28%，同时提高了数学推理的准确性。此外，AsyncThink 将其学习到的异步思维能力推广，无需额外培训即可有效处理看不见的任务。

Kimi Linear: An Expressive, Efficient Attention Architecture

Kimi Linear：一种富有表现力、高效的注意力架构

Authors: Kimi Team: Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, Wentao Li, Enzhe Lu, Weizhou Liu, Yanru Chen, Weixin Xu, Longhui Yu, Yejie Wang, Yu Fan, Longguang Zhong, Enming Yuan, Dehao Zhang, Yizhi Zhang, T.Y. Liu, Haiming Wang, Shengjun Fang, Weiran He, Shaowei Liu, Yiwei Li, Jianlin Su, Jiezhong Qiu, Bo Pang, Junjie Yan, Zhejun Jiang, Weixiao Huang, Bohong Yin, Jiacheng You, Chu Wei, Zhengtao Wang, Chao Hong, Yutian Chen, Guanduo Chen, Yucheng Wang, Huabin Zheng, Feng Wang, Yibo Liu, Mengnan Dong, Zheng Zhang, Siyuan Pan, Wenhao Wu, Yuhao Wu, Longyu Guan, Jiawen Tao, Guohong Fu, Xinran Xu, Yuzhi Wang, Guokun Lai, Yuxin Wu, Xinyu Zhou, Zhilin Yang, Yulun Du
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.26692
Pdf link: https://arxiv.org/pdf/2510.26692
Abstract We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios -- including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism, enabling more effective use of limited finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) transition matrices, which substantially reduces computation compared to the general DPLR formulation while remaining more consistent with the classical delta rule. We pretrain a Kimi Linear model with 3B activated parameters and 48B total parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). Our experiments show that with an identical training recipe, Kimi Linear outperforms full MLA with a sizeable margin across all evaluated tasks, while reducing KV cache usage by up to 75% and achieving up to 6 times decoding throughput for a 1M context. These results demonstrate that Kimi Linear can be a drop-in replacement for full attention architectures with superior performance and efficiency, including tasks with longer input and output lengths. To support further research, we open-source the KDA kernel and vLLM implementations, and release the pre-trained and instruction-tuned model checkpoints.
中文摘要 我们介绍了 Kimi Linear，这是一种混合线性注意力架构，在各种场景（包括短上下文、长上下文和强化学习（RL）缩放制度）的公平比较下，它首次优于全注意力。其核心是 Kimi Delta Attention （KDA），这是一个富有表现力的线性注意力模块，它通过更细粒度的门控机制扩展了门控 DeltaNet，从而能够更有效地利用有限的有限状态 RNN 内存。我们定制的分块算法通过对角线加低秩（DPLR）转换矩阵的专用变体实现了高硬件效率，与一般 DPLR 公式相比，这大大减少了计算，同时与经典增量规则更加一致。我们基于 KDA 和多头潜在注意力（MLA）的分层混合，预训练了一个具有 3B 激活参数和 48B 总参数的 Kimi 线性模型。我们的实验表明，在相同的训练配方下，Kimi Linear 在所有评估任务中都以相当大的余量优于完整 MLA，同时将 KV 缓存使用量减少多达 75%，并在 1M 上下文中实现高达 6 倍的解码吞吐量。这些结果表明，Kimi Linear 可以成为具有卓越性能和效率的全注意力架构的直接替代品，包括具有更长输入和输出长度的任务。为了支持进一步的研究，我们开源了 KDA 内核和 vLLM 实现，并发布了预训练和指令调整的模型检查点。

A General Incentives-Based Framework for Fairness in Multi-agent Resource Allocation

基于通用激励的多智能体资源分配公平框架

Authors: Ashwin Kumar, William Yeoh
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.26740
Pdf link: https://arxiv.org/pdf/2510.26740
Abstract We introduce the General Incentives-based Framework for Fairness (GIFF), a novel approach for fair multi-agent resource allocation that infers fair decision-making from standard value functions. In resource-constrained settings, agents optimizing for efficiency often create inequitable outcomes. Our approach leverages the action-value (Q-)function to balance efficiency and fairness without requiring additional training. Specifically, our method computes a local fairness gain for each action and introduces a counterfactual advantage correction term to discourage over-allocation to already well-off agents. This approach is formalized within a centralized control setting, where an arbitrator uses the GIFF-modified Q-values to solve an allocation problem. Empirical evaluations across diverse domains, including dynamic ridesharing, homelessness prevention, and a complex job allocation task-demonstrate that our framework consistently outperforms strong baselines and can discover far-sighted, equitable policies. The framework's effectiveness is supported by a theoretical foundation; we prove its fairness surrogate is a principled lower bound on the true fairness improvement and that its trade-off parameter offers monotonic tuning. Our findings establish GIFF as a robust and principled framework for leveraging standard reinforcement learning components to achieve more equitable outcomes in complex multi-agent systems.
中文摘要 我们引入了基于一般激励的公平框架（GIFF），这是一种公平多智能体资源分配的新方法，可从标准价值函数中推断出公平的决策。在资源受限的环境中，优化效率的代理通常会产生不公平的结果。我们的方法利用行动价值（Q-）函数来平衡效率和公平性，而无需额外培训。具体来说，我们的方法计算每个动作的局部公平性增益，并引入一个反事实优势校正项，以阻止对已经富裕的代理进行过度分配。这种方法在集中控制设置中正式化，仲裁员使用 GIFF 修改的 Q 值来解决分配问题。跨不同领域的实证评估，包括动态拼车、无家可归预防和复杂的工作分配任务，表明我们的框架始终优于强大的基线，并且可以发现有远见、公平的政策。该框架的有效性得到了理论基础的支持;我们证明其公平性代理是真正公平性改进的原则性下限，并且其权衡参数提供了单调调整。我们的研究结果将 GIFF 确立为一个强大且有原则的框架，用于利用标准强化学习组件在复杂的多智能体系统中实现更公平的结果。

Defeating the Training-Inference Mismatch via FP16

通过 FP16 击败训练-推理不匹配

Authors: Penghui Qi, Zichen Liu, Xiangxin Zhou, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.26788
Pdf link: https://arxiv.org/pdf/2510.26788
Abstract Reinforcement learning (RL) fine-tuning of large language models (LLMs) often suffers from instability due to the numerical mismatch between the training and inference policies. While prior work has attempted to mitigate this issue through algorithmic corrections or engineering alignments, we show that its root cause lies in the floating point precision itself. The widely adopted BF16, despite its large dynamic range, introduces large rounding errors that breaks the consistency between training and inference. In this work, we demonstrate that simply reverting to \textbf{FP16} effectively eliminates this mismatch. The change is simple, fully supported by modern frameworks with only a few lines of code change, and requires no modification to the model architecture or learning algorithm. Our results suggest that using FP16 uniformly yields more stable optimization, faster convergence, and stronger performance across diverse tasks, algorithms and frameworks. We hope these findings motivate a broader reconsideration of precision trade-offs in RL fine-tuning.
中文摘要 由于训练和推理策略之间的数值不匹配，大型语言模型（LLM）的强化学习（RL）微调经常会出现不稳定。虽然之前的工作试图通过算法校正或工程对齐来缓解这个问题，但我们表明其根本原因在于浮点精度本身。广泛采用的 BF16 尽管动态范围很大，但引入了较大的舍入误差，从而破坏了训练和推理之间的一致性。在这项工作中，我们证明了简单地恢复到 \textbf{FP16} 可以有效消除这种不匹配。更改很简单，现代框架完全支持，只需更改几行代码，无需修改模型架构或学习算法。我们的结果表明，统一使用 FP16 可以在不同的任务、算法和框架中产生更稳定的优化、更快的收敛和更强的性能。我们希望这些发现能够促使人们更广泛地重新考虑RL微调中的精度权衡。

Keyword: diffusion policy

There is no result