Arxiv Papers of Today

生成时间: 2026-01-08 16:34:55 (UTC+8); Arxiv 发布时间: 2026-01-08 20:00 EST (2026-01-09 09:00 UTC+8)

今天共有 44 篇相关文章

Keyword: reinforcement learning

PC2P: Multi-Agent Path Finding via Personalized-Enhanced Communication and Crowd Perception

PC2P：通过个性化增强的通信和人群感知实现多智能体路径寻找

Authors: Guotao Li, Shaoyun Xu, Yuexing Hao, Yang Wang, Yuhui Sun
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.03301
Pdf link: https://arxiv.org/pdf/2601.03301
Abstract Distributed Multi-Agent Path Finding (MAPF) integrated with Multi-Agent Reinforcement Learning (MARL) has emerged as a prominent research focus, enabling real-time cooperative decision-making in partially observable environments through inter-agent communication. However, due to insufficient collaborative and perceptual capabilities, existing methods are inadequate for scaling across diverse environmental conditions. To address these challenges, we propose PC2P, a novel distributed MAPF method derived from a Q-learning-based MARL framework. Initially, we introduce a personalized-enhanced communication mechanism based on dynamic graph topology, which ascertains the core aspects of who" andwhat" in interactive process through three-stage operations: selection, generation, and aggregation. Concurrently, we incorporate local crowd perception to enrich agents' heuristic observation, thereby strengthening the model's guidance for effective actions via the integration of static spatial constraints and dynamic occupancy changes. To resolve extreme deadlock issues, we propose a region-based deadlock-breaking strategy that leverages expert guidance to implement efficient coordination within confined areas. Experimental results demonstrate that PC2P achieves superior performance compared to state-of-the-art distributed MAPF methods in varied environments. Ablation studies further confirm the effectiveness of each module for overall performance.
中文摘要 分布式多智能体路径寻觅（MAPF）与多智能体强化学习（MARL）集成，已成为重要的研究重点，能够通过智能体间通信实现部分可观察环境中的实时协作决策。然而，由于协作和感知能力不足，现有方法无法适应多样化环境条件的扩展。为应对这些挑战，我们提出了PC2P，一种基于Q学习的MARL框架衍生的新型分布式MAPF方法。最初，我们引入了基于动态图拓扑的个性化增强通信机制，通过选择、生成和聚合三阶段作，确定交互过程中“谁”和“什么”的核心方面。同时，我们结合局部人群感知，丰富代理的启发式观察，从而通过整合静态空间约束和动态占用变化，强化模型对有效行动的指导。为解决极端僵局问题，我们提出一种基于区域的打破僵局策略，借助专家指导在有限区域内实施高效协调。实验结果表明，PC2P在不同环境中优于最先进的分布式MAPF方法。消融研究进一步确认了每个模块对整体性能的有效性。

Autonomous Threat Detection and Response in Cloud Security: A Comprehensive Survey of AI-Driven Strategies

云安全中的自主威胁检测与响应：人工智能驱动策略的全面综述

Authors: Gaurav Sarraf, Vibhor Pal
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2601.03303
Pdf link: https://arxiv.org/pdf/2601.03303
Abstract Cloud computing has changed online communities in three dimensions, which are scalability, adaptability and reduced overhead. But there are serious security concerns which are brought about by its distributed and multi-tenant characteristics. The old methods of detecting and reacting to threats which are mostly reliant on fixed signatures, predefined rules and human operators are becoming less and less effective even in the advanced stages of cyberattacks of cloud infrastructures. The recent trend in the field of addressing these limitations is the creation of technologies of artificial intelligence (AI). The strategies allow independent protection, anomaly detection, and real-time analysis with references to using deep learning, machine learning, and reinforcement learning. Through imbuing AI with a constantly-learning feature, it enables the intrusion detection system to be more accurate and generate a lesser number of false positives and it also enables the possibility of adaptive and predictive security. The fusion of large-scale language models with efficient orchestration platforms contributes to reacting to the arising threats with a quicker and more precise response. This allows automatic control over incidences, self-healing network, and defense mechanisms on a policy basis. Considering the current detection and response methods, this discussion assesses their strengths and weaknesses and outlines key issues such as data privacy, adversarial machine learning and integration complexity in the context of AI-based cloud security. These results suggest the future application of AI to support autonomous, scalable and active cloud security operations.
中文摘要 云计算在三个方面改变了在线社区，即可扩展性、适应性和降低开销。但其分布式和多租户特性带来了严重的安全隐患。依赖固定签名、预设规则和人工操作员的旧有威胁检测和响应方法，即使在云基础设施网络攻击的高级阶段，效果也越来越差。近年来，针对这些限制的趋势是人工智能（AI）技术的诞生。这些策略支持独立保护、异常检测和实时分析，并结合深度学习、机器学习和强化学习的应用。通过赋予AI持续学习功能，入侵检测系统能够更准确，误报数量更少，同时实现自适应和预测性安全的可能性。大规模语言模型与高效编排平台的融合，有助于以更快、更精准的响应应对新出现的威胁。这允许基于策略的自动控制事件、自我修复网络和防御机制。结合当前的检测和响应方法，本文讨论其优势与劣势，并概述数据隐私、对抗性机器学习和基于AI云安全背景下的集成复杂性等关键问题。这些结果表明未来人工智能将应用于支持自主、可扩展且主动的云安全运营。

Mastering the Game of Go with Self-play Experience Replay

掌握围棋自玩体验回放

Authors: Jingbin Liu, Xuechun Wang
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.03306
Pdf link: https://arxiv.org/pdf/2601.03306
Abstract The game of Go has long served as a benchmark for artificial intelligence, demanding sophisticated strategic reasoning and long-term planning. Previous approaches such as AlphaGo and its successors, have predominantly relied on model-based Monte-Carlo Tree Search (MCTS). In this work, we present QZero, a novel model-free reinforcement learning algorithm that forgoes search during training and learns a Nash equilibrium policy through self-play and off-policy experience replay. Built upon entropy-regularized Q-learning, QZero utilizes a single Q-value network to unify policy evaluation and improvement. Starting tabula rasa without human data and trained for 5 months with modest compute resources (7 GPUs), QZero achieved a performance level comparable to that of AlphaGo. This demonstrates, for the first time, the efficiency of using model-free reinforcement learning to master the game of Go, as well as the feasibility of off-policy reinforcement learning in solving large-scale and complex environments.
中文摘要 围棋长期以来一直是人工智能的标杆，要求复杂的战略推理和长期规划。以往的方法如AlphaGo及其后继者，主要依赖基于模型的蒙特卡洛树搜索（MCTS）。在本研究中，我们提出了QZero，一种新型无模型强化学习算法，在训练过程中放弃搜索，通过自我游玩和非策略经验重放学习纳什均衡策略。QZero 建立在熵正则化的 Q-学习基础上，利用单一的 Q-value 网络统一策略评估和改进。QZero 在没有人类数据的情况下开始白板研究，并用有限的计算资源（7 GPU）训练了 5 个月，达到了与 AlphaGo 相当的性能水平。这首次展示了利用无模型强化学习掌握围棋的高效性，以及非策略强化学习在解决大规模复杂环境中的可行性。

Ratio-Variance Regularized Policy Optimization for Efficient LLM Fine-tuning

比率方差正则化策略优化以实现高效LLM微调

Authors: Yu Luo, Shuo Han, Yihan Hu, Dong Li, Jianye Hao
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.03320
Pdf link: https://arxiv.org/pdf/2601.03320
Abstract On-policy reinforcement learning (RL), particularly Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), has become the dominant paradigm for fine-tuning large language models (LLMs). While policy ratio clipping stabilizes training, this heuristic hard constraint incurs a fundamental cost: it indiscriminately truncates gradients from high-return yet high-divergence actions, suppressing rare but highly informative "eureka moments" in complex reasoning. Moreover, once data becomes slightly stale, hard clipping renders it unusable, leading to severe sample inefficiency. In this work, we revisit the trust-region objective in policy optimization and show that explicitly constraining the \emph{variance (second central moment) of the policy ratio} provides a principled and smooth relaxation of hard clipping. This distributional constraint stabilizes policy updates while preserving gradient signals from valuable trajectories. Building on this insight, we propose $R^2VPO$ (Ratio-Variance Regularized Policy Optimization), a novel primal-dual framework that supports stable on-policy learning and enables principled off-policy data reuse by dynamically reweighting stale samples rather than discarding them. We extensively evaluate $R^2VPO$ on fine-tuning state-of-the-art LLMs, including DeepSeek-Distill-Qwen-1.5B and the openPangu-Embedded series (1B and 7B), across challenging mathematical reasoning benchmarks. Experimental results show that $R^2VPO$ consistently achieves superior asymptotic performance, with average relative gains of up to 17% over strong clipping-based baselines, while requiring approximately 50% fewer rollouts to reach convergence. These findings establish ratio-variance control as a promising direction for improving both stability and data efficiency in RL-based LLM alignment.
中文摘要 策略强化学习（RL），特别是近端策略优化（PPO）和群相对策略优化（GRPO），已成为微调大型语言模型（LLM）的主导范式。虽然策略比率裁剪稳定了训练，但这种启发式硬约束带来了根本性代价：它无差别地截断了高回报但高发散动作的梯度，抑制了复杂推理中罕见但极具启发性的“顿悟时刻”。此外，一旦数据略显陈旧，硬裁剪会使其无法使用，导致采样效率极低。在本研究中，我们重新审视了策略优化中的信任区域目标，并证明明确约束政策比率的\emph{方差（第二个中心矩）}可以实现硬裁剪的原则性且平滑的放松。这种分布约束稳定了策略更新，同时保留了有价值轨迹的梯度信号。基于这一见解，我们提出了$R^2VPO$（比率-方差正则化策略优化）这一新颖的原始-对偶框架，支持稳定的策略内学习，并通过动态重权而非丢弃陈旧样本实现原则性非策略数据重用。我们广泛评估$R^2VPO$，针对包括DeepSeek-Distill-Qwen-1.5B和openPangu-Embedded系列（1B和7B）在内的先进大型语言模型，涵盖具有挑战性的数学推理基准测试。实验结果显示，$R^2VPO$ 持续实现更优越的渐近性能，平均相对提升高达17%，且实现收敛所需的推广次数减少约50%。这些发现确立了比方差控制作为提升基于强化学习（RL）的大型语言模型（LLM）稳定性和数据效率的有前景方向。

Aligning Findings with Diagnosis: A Self-Consistent Reinforcement Learning Framework for Trustworthy Radiology Reporting

将发现与诊断对齐：可信放射报告的自洽强化学习框架

Authors: Kun Zhao, Siyuan Dai, Pan Wang, Jifeng Song, Hui Ji, Chenghua Lin, Liang Zhan, Haoteng Tang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.03321
Pdf link: https://arxiv.org/pdf/2601.03321
Abstract Multimodal Large Language Models (MLLMs) have shown strong potential for radiology report generation, yet their clinical translation is hindered by architectural heterogeneity and the prevalence of factual hallucinations. Standard supervised fine-tuning often fails to strictly align linguistic outputs with visual evidence, while existing reinforcement learning approaches struggle with either prohibitive computational costs or limited exploration. To address these challenges, we propose a comprehensive framework for self-consistent radiology report generation. First, we conduct a systematic evaluation to identify optimal vision encoder and LLM backbone configurations for medical imaging. Building on this foundation, we introduce a novel "Reason-then-Summarize" architecture optimized via Group Relative Policy Optimization (GRPO). This framework restructures generation into two distinct components: a think block for detailed findings and an answer block for structured disease labels. By utilizing a multi-dimensional composite reward function, we explicitly penalize logical discrepancies between the generated narrative and the final diagnosis. Extensive experiments on the MIMIC-CXR benchmark demonstrate that our method achieves state-of-the-art performance in clinical efficacy metrics and significantly reduces hallucinations compared to strong supervised baselines.
中文摘要 多模态大型语言模型（MLLM）在放射科报告生成方面展现出强大潜力，但其临床翻译受限于结构异质性和事实幻觉的普遍性。标准的监督微调常常无法严格将语言输出与视觉证据对齐，而现有的强化学习方法则面临高昂的计算成本或有限的探索。为应对这些挑战，我们提出了一个全面的自洽放射报告生成框架。首先，我们进行系统评估，以确定医学影像中最优的视觉编码器和LLM骨干配置。基于此基础，我们引入了一种通过群相对策略优化（GRPO）优化的新型“推理后汇总”架构。该框架将生成重组为两个独立部分：用于详细发现的思考块和结构化疾病标签的答案块。通过使用多维复合奖励函数，我们明确惩罚生成叙述与最终诊断之间的逻辑差异。在MIMIC-CXR基准上的大量实验表明，我们的方法在临床疗效指标上达到了最先进的表现，并且相比强有力的监督基线，显著减少了幻觉。

Exploration Through Introspection: A Self-Aware Reward Model

通过内省探索：一种自我觉察的奖励模型

Authors: Michael Petrowski, Milica Gašić
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.03389
Pdf link: https://arxiv.org/pdf/2601.03389
Abstract Understanding how artificial agents model internal mental states is central to advancing Theory of Mind in AI. Evidence points to a unified system for self- and other-awareness. We explore this self-awareness by having reinforcement learning agents infer their own internal states in gridworld environments. Specifically, we introduce an introspective exploration component that is inspired by biological pain as a learning signal by utilizing a hidden Markov model to infer "pain-belief" from online observations. This signal is integrated into a subjective reward function to study how self-awareness affects the agent's learning abilities. Further, we use this computational framework to investigate the difference in performance between normal and chronic pain perception models. Results show that introspective agents in general significantly outperform standard baseline agents and can replicate complex human-like behaviors.
中文摘要 理解人工智能体如何模拟内部心理状态是推动人工智能心智理论发展的核心。证据表明存在统一的自我与他人意识系统。我们通过让强化学习代理在网格世界环境中推断自身的内部状态来探索这种自我意识。具体来说，我们引入了一个内省探索组件，灵感来自生物疼痛作为学习信号，利用隐藏的马尔可夫模型从在线观察中推断“疼痛信念”。该信号被整合进主观奖励函数，用于研究自我意识如何影响主体的学习能力。此外，我们利用该计算框架研究正常与慢性疼痛感知模型之间的性能差异。结果显示，内省型特工总体上显著优于标准基线特工，能够复制复杂的类人类行为。

Sensor to Pixels: Decentralized Swarm Gathering via Image-Based Reinforcement Learning

传感器到像素：通过基于图像的强化学习实现去中心化群体聚集

Authors: Yigal Koifman, Eran Iceland, Erez Koifman, Ariel Barel, Alfred M. Bruckstein
Subjects: Subjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.03413
Pdf link: https://arxiv.org/pdf/2601.03413
Abstract This study highlights the potential of image-based reinforcement learning methods for addressing swarm-related tasks. In multi-agent reinforcement learning, effective policy learning depends on how agents sense, interpret, and process inputs. Traditional approaches often rely on handcrafted feature extraction or raw vector-based representations, which limit the scalability and efficiency of learned policies concerning input order and size. In this work we propose an image-based reinforcement learning method for decentralized control of a multi-agent system, where observations are encoded as structured visual inputs that can be processed by Neural Networks, extracting its spatial features and producing novel decentralized motion control rules. We evaluate our approach on a multi-agent convergence task of agents with limited-range and bearing-only sensing that aim to keep the swarm cohesive during the aggregation. The algorithm's performance is evaluated against two benchmarks: an analytical solution proposed by Bellaiche and Bruckstein, which ensures convergence but progresses slowly, and VariAntNet, a neural network-based framework that converges much faster but shows medium success rates in hard constellations. Our method achieves high convergence, with a pace nearly matching that of VariAntNet. In some scenarios, it serves as the only practical alternative.
中文摘要 本研究强调基于图像的强化学习方法在处理群体相关任务中的潜力。在多智能体强化学习中，有效的策略学习依赖于智能体如何感知、解释和处理输入。传统方法通常依赖手工设计的特征提取或基于向量的原始表示，这限制了关于输入顺序和大小的学习策略的可扩展性和效率。本研究提出一种基于图像的强化学习方法，用于去中心化控制多智能体系统，将观察编码为结构化的视觉输入，由神经网络处理，提取其空间特征并生成新的去中心化运动控制规则。我们评估了在多智能体收敛任务中的方法，这些智能体具有有限范围且仅方位感测，旨在在聚合过程中保持群体的凝聚性。该算法的性能基于两个基准进行评估：Bellaiche和Bruckstein提出的分析解，确保收敛但进展缓慢;以及基于神经网络的VariAntNet框架，收敛速度更快，但在硬星座中成功率中等。我们的方法实现了高度收敛，收敛速度几乎与VariAntNet相当。在某些情况下，它是唯一实用的替代方案。

FIRE-VLM: A Vision-Language-Driven Reinforcement Learning Framework for UAV Wildfire Tracking in a Physics-Grounded Fire Digital Twin

FIRE-VLM：基于物理的火灾数字孪生中的无人机野火跟踪视觉语言驱动强化学习框架

Authors: Chris Webb, Mobin Habibpour, Mayamin Hamid Raha, Ali Reza Tavakkoli, Janice Coen, Fatemeh Afghah
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.03449
Pdf link: https://arxiv.org/pdf/2601.03449
Abstract Wildfire monitoring demands autonomous systems capable of reasoning under extreme visual degradation, rapidly evolving physical dynamics, and scarce real-world training data. Existing UAV navigation approaches rely on simplified simulators and supervised perception pipelines, and lack embodied agents interacting with physically realistic fire environments. We introduce FIRE-VLM, the first end-to-end vision-language model (VLM) guided reinforcement learning (RL) framework trained entirely within a high-fidelity, physics-grounded wildfire digital twin. Built from USGS Digital Elevation Model (DEM) terrain, LANDFIRE fuel inventories, and semi-physical fire-spread solvers, this twin captures terrain-induced runs, wind-driven acceleration, smoke plume occlusion, and dynamic fuel consumption. Within this environment, a PPO agent with dual-view UAV sensing is guided by a CLIP-style VLM. Wildfire-specific semantic alignment scores, derived from a single prompt describing active fire and smoke plumes, are integrated as potential-based reward shaping signals. Our contributions are: (1) a GIS-to-simulation pipeline for constructing wildfire digital twins; (2) a VLM-guided RL agent for UAV firefront tracking; and (3) a wildfire-aware reward design that combines physical terms with VLM semantics. Across five digital-twin evaluation tasks, our VLM-guided policy reduces time-to-detection by up to 6 times, increases time-in-FOV, and is, to our knowledge, the first RL-based UAV wildfire monitoring system demonstrated in kilometer-scale, physics-grounded digital-twin fires.
中文摘要 野火监测需要能够在极端视觉衰减、快速演变的物理动力学和稀缺的真实训练数据下进行推理的自主系统。现有的无人机导航方法依赖简化的模拟器和监督感知管道，缺乏具身的智能体与物理真实的火灾环境互动。我们介绍了FIRE-VLM，这是首个完全在高精度、基于物理基础的野火数字孪生中训练的端到端视觉语言模型（VLM）引导强化学习（RL）框架。该双体由美国地质调查局数字高程模型（DEM）地形、LANDFIRE燃料库存和半物理火势扩散解析器构建，捕捉地形引发的运行、风驱动加速、烟柱阻塞及动态燃料消耗。在此环境中，配备双视角无人机感测的PPO代理由CLIP风格VLM引导。野火特有的语义对齐评分，源自描述活跃火灾和烟雾羽流的单一提示，作为基于势的奖励塑造信号被整合。我们的贡献包括：（1）用于构建野火数字孪生的GIS到仿真流程;（2）用于无人机火线跟踪的VLM引导RL代理;以及（3）结合物理术语与VLM语义的野火感知奖励设计。在五个数字孪生评估任务中，我们基于VLM的策略将检测时间缩短了多达6倍，增加了视野时间，据我们所知，这是首个在公里级、基于物理基础的数字孪生火灾中演示的基于强化学习的无人机野火监测系统。

ThinkRL-Edit: Thinking in Reinforcement Learning for Reasoning-Centric Image Editing

ThinkRL-Edit：基于推理的强化学习思维图像编辑

Authors: Hengjia Li, Liming Jiang, Qing Yan, Yizhi Song, Hao Kang, Zichuan Liu, Xin Lu, Boxi Wu, Deng Cai
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.03467
Pdf link: https://arxiv.org/pdf/2601.03467
Abstract Instruction-driven image editing with unified multimodal generative models has advanced rapidly, yet their underlying visual reasoning remains limited, leading to suboptimal performance on reasoning-centric edits. Reinforcement learning (RL) has been investigated for improving the quality of image editing, but it faces three key challenges: (1) limited reasoning exploration confined to denoising stochasticity, (2) biased reward fusion, and (3) unstable VLM-based instruction rewards. In this work, we propose ThinkRL-Edit, a reasoning-centric RL framework that decouples visual reasoning from image synthesis and expands reasoning exploration beyond denoising. To the end, we introduce Chain-of-Thought (CoT)-based reasoning sampling with planning and reflection stages prior to generation in online sampling, compelling the model to explore multiple semantic hypotheses and validate their plausibility before committing to a visual outcome. To avoid the failures of weighted aggregation, we propose an unbiased chain preference grouping strategy across multiple reward dimensions. Moreover, we replace interval-based VLM scores with a binary checklist, yielding more precise, lower-variance, and interpretable rewards for complex reasoning. Experiments show our method significantly outperforms prior work on reasoning-centric image editing, producing instruction-faithful, visually coherent, and semantically grounded edits.
中文摘要 基于指令的图像编辑与统一多模态生成模型发展迅速，但其底层视觉推理仍然有限，导致以推理为中心的编辑表现不佳。强化学习（RL）已被研究用于提升图像编辑质量，但面临三大挑战：（1）限于去噪随机性的有限推理探索，（2）偏置的奖励融合，以及（3）基于VLM的指令奖励不稳定。在本研究中，我们提出了ThinkRL-Edit，一种以推理为中心的强化学习框架，将视觉推理与图像合成解耦，并将推理探索扩展到除噪之外。最后，我们在在线抽样中引入了基于思维链（Chain-of-Thought，简称CoT）的推理抽样，并在生成前有规划和反思阶段，促使模型在确定视觉结果前探索多个语义假设并验证其合理性。为避免加权聚合的失败，我们提出了一种跨多个奖励维度的无偏链偏好分组策略。此外，我们用二元检查表替代基于区间的VLM分数，从而为复杂推理带来更精确、更低方差和可解释的奖励。实验显示，我们的方法在推理为中心的图像编辑方面显著优于以往的研究，能够生成忠实于指令、视觉连贯且语义扎根的编辑。

Understanding Reward Hacking in Text-to-Image Reinforcement Learning

理解文本转图像强化学习中的奖励黑客

Authors: Yunqi Hong, Kuei-Chun Kao, Hengguang Zhou, Cho-Jui Hsieh
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.03468
Pdf link: https://arxiv.org/pdf/2601.03468
Abstract Reinforcement learning (RL) has become a standard approach for post-training large language models and, more recently, for improving image generation models, which uses reward functions to enhance generation quality and human preference alignment. However, existing reward designs are often imperfect proxies for true human judgment, making models prone to reward hacking--producing unrealistic or low-quality images that nevertheless achieve high reward scores. In this work, we systematically analyze reward hacking behaviors in text-to-image (T2I) RL post-training. We investigate how both aesthetic/human preference rewards and prompt-image consistency rewards individually contribute to reward hacking and further show that ensembling multiple rewards can only partially mitigate this issue. Across diverse reward models, we identify a common failure mode: the generation of artifact-prone images. To address this, we propose a lightweight and adaptive artifact reward model, trained on a small curated dataset of artifact-free and artifact-containing samples. This model can be integrated into existing RL pipelines as an effective regularizer for commonly used reward models. Experiments demonstrate that incorporating our artifact reward significantly improves visual realism and reduces reward hacking across multiple T2I RL setups, demonstrating the effectiveness of lightweight reward augment serving as a safeguard against reward hacking.
中文摘要 强化学习（RL）已成为大型语言模型后训练的标准方法，近年来也用于改进图像生成模型，后者利用奖励函数提升生成质量和人类偏好对齐。然而，现有的奖励设计往往是真实人类判断的不完美代理，使模型容易被黑客攻击奖励——生成不真实或低质量的图像，尽管获得了高奖励分数。本研究系统分析了文本转图像（T2I）强化学习后奖励黑客行为。我们研究了美学/人类偏好奖励和提示-图像一致性奖励如何单独促成奖励黑客行为，并进一步证明多重奖励的集合只能部分缓解这一问题。在各种奖励模型中，我们识别出一个共同的失败模式：生成易出现伪造物的图像。为此，我们提出了一个轻量级且自适应的人工物奖励模型，基于一个由无人工品且含人工物的小型精选数据集训练。该模型可以集成到现有的强化学习流程中，作为常用奖励模型的有效规范器。实验表明，采用我们的人工奖励显著提升了视觉真实感，并减少了多种T2I强化学习设置中的奖励黑客行为，证明了轻量级奖励增强作为防止奖励黑客的有效性。

Adaptive Model-Based Reinforcement Learning for Orbit Feedback Control in NSLS-II Storage Ring

NSLS-II存储环轨道反馈控制的自适应模型强化学习

Authors: Zeyu Dong, Yuke Tian, Yu Sun
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.03486
Pdf link: https://arxiv.org/pdf/2601.03486
Abstract The National Synchrotron Light Source II (NSLS-II) uses highly stable electron beam to produce high-quality X-ray beams with high brightness and low-emittance synchrotron radiation. The traditional algorithm to stabilize the beam applies singular value decomposition (SVD) on the orbit response matrix to remove noise and extract actions. Supervised learning has been studied on NSLS-II storage ring stabilization and other accelerator facilities recently. Several problems, for example, machine status drifting, environment noise, and non-linear accelerator dynamics, remain unresolved in the SVD-based and supervised learning algorithms. To address these problems, we propose an adaptive training framework based on model-based reinforcement learning. This framework consists of two types of optimizations: trajectory optimization attempts to minimize the expected total reward in a differentiable environment, and online model optimization learns non-linear machine dynamics through the agent-environment interaction. Through online training, this framework tracks the internal status drifting in the electron beam ring. Simulation and real in-facility experiments on NSLS-II reveal that our method stabilizes the beam position and minimizes the alignment error, defined as the root mean square (RMS) error between adjusted beam positions and the reference position, down to ~1$\mu$m.
中文摘要 国家同步辐射光源II（NSLS-II）采用高度稳定的电子束，产生高亮度且低发射率同步辐射的高品质X射线束。传统的稳定束流算法对轨道响应矩阵施加奇异值分解（SVD）以去除噪声并提取动作。监督学习最近在NSLS-II存储环稳定及其他加速器设施上进行了研究。例如，基于SVD和监督学习算法的若干问题，如机器状态漂移、环境噪声和非线性加速器动力学尚未解决。为解决这些问题，我们提出了基于模型强化学习的自适应训练框架。该框架包含两种优化方式：轨迹优化试图在可微环境中最小化预期总奖励，在线模型优化则通过代理-环境的交互学习非线性机器动力学。通过在线培训，该框架追踪电子束环内部状态漂移。NSLS-II的模拟和实际设施内实验表明，我们的方法稳定了光束位置，并将对准误差（定义为调整后的光束位置与参考位置之间的均方根误差）最小化到~1$\mu$m。

Semantic Belief-State World Model for 3D Human Motion Prediction

三维人体运动预测的语义信念-状态世界模型

Authors: Sarim Chaudhry
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.03517
Pdf link: https://arxiv.org/pdf/2601.03517
Abstract Human motion prediction has traditionally been framed as a sequence regression problem where models extrapolate future joint coordinates from observed pose histories. While effective over short horizons this approach does not separate observation reconstruction with dynamics modeling and offers no explicit representation of the latent causes governing motion. As a result, existing methods exhibit compounding drift, mean-pose collapse, and poorly calibrated uncertainty when rolled forward beyond the training regime. Here we propose a Semantic Belief-State World Model (SBWM) that reframes human motion prediction as latent dynamical simulation on the human body manifold. Rather than predicting poses directly, SBWM maintains a recurrent probabilistic belief state whose evolution is learned independently of pose reconstruction and explicitly aligned with the SMPL-X anatomical parameterization. This alignment imposes a structural information bottleneck that prevents the latent state from encoding static geometry or sensor noise, forcing it to capture motion dynamics, intent, and control-relevant structure. Inspired by belief-state world models developed for model-based reinforcement learning, SBWM adapts stochastic latent transitions and rollout-centric training to the domain of human motion. In contrast to RSSM-based, transformer, and diffusion approaches optimized for reconstruction fidelity, SBWM prioritizes stable forward simulation. We demonstrate coherent long-horizon rollouts, and competitive accuracy at substantially lower computational cost. These results suggest that treating the human body as part of the world models state space rather than its output fundamentally changes how motion is simulated, and predicted.
中文摘要 人体运动预测传统上被框架为序列回归问题，模型通过观察到的姿态历史推断未来的关节坐标。虽然这种方法在短期视野内有效，但并未将观测重建与动力学建模分开，也没有明确表示控制运动的潜在原因。因此，现有方法在超出训练阶段时表现出复合漂移、均态坍缩以及校准不准确的不确定性。在这里，我们提出了一个语义信念状态世界模型（SBWM），将人体运动预测重新框定为人体流形上的潜在动力学模拟。SBWM不直接预测姿态，而是保持一个循环的概率信念状态，其演化独立于姿态重建，并明确与SMPL-X解剖参数化对齐。这种对齐形成结构信息瓶颈，阻止潜态编码静态几何或传感器噪声，迫使其捕捉运动动力学、意图和控制相关结构。SBWM受基于模型强化学习的信念态世界模型启发，将随机潜在转移和以展开为中心的训练应用于人类运动领域。与基于RSSM的变换器和扩散方法（优化重建保真度）不同，SBWM优先考虑稳定前向仿真。我们展示了连贯的长视野推广和竞争精度，且计算成本大幅降低。这些结果表明，将人体视为世界模型的一部分而非其输出，根本改变了运动的模拟和预测方式。

VeRPO: Verifiable Dense Reward Policy Optimization for Code Generation

VeRPO：代码生成的可验证稠密奖励策略优化

Authors: Longwen Wang, Xuan'er Wu, Xiaohui Hu, Yirui Liu, Yuankai Fan, Kaidong Yu, Qizhen Weng, Wei Xi, Xuelong Li
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.03525
Pdf link: https://arxiv.org/pdf/2601.03525
Abstract Effective reward design is a central challenge in Reinforcement Learning (RL) for code generation. Mainstream pass/fail outcome rewards enforce functional correctness via executing unit tests, but the resulting sparsity limits potential performance gains. While recent work has explored external Reward Models (RM) to generate richer, continuous rewards, the learned RMs suffer from reward misalignment and prohibitive computational cost. In this paper, we introduce \textbf{VeRPO} (\textbf{V}erifiable D\textbf{e}nse \textbf{R}eward \textbf{P}olicy \textbf{O}ptimization), a novel RL framework for code generation that synthesizes \textit{robust and dense rewards fully grounded in verifiable execution feedback}. The core idea of VeRPO is constructing dense rewards from weighted partial success: by dynamically estimating the difficulty weight of each unit test based on the execution statistics during training, a dense reward is derived from the sum of weights of the passed unit tests. To solidify the consistency between partial success and end-to-end functional correctness, VeRPO further integrates the dense signal with global execution outcomes, establishing a robust and dense reward paradigm relying solely on verifiable execution feedback. Extensive experiments across diverse benchmarks and settings demonstrate that VeRPO consistently outperforms outcome-driven and RM-based baselines, achieving up to +8.83\% gain in pass@1 with negligible time cost (< 0.02\%) and zero GPU memory overhead.
中文摘要 有效的奖励设计是强化学习（RL）代码生成中的核心挑战。主流的通过/失败结果奖励通过执行单元测试来强制执行功能正确性，但由此产生的稀疏性限制了潜在的性能提升。虽然近期研究探索了外部奖励模型（RM）以生成更丰富、持续的奖励，但学习到的RM存在奖励错位和高昂的计算成本。本文介绍了 \textbf{VeRPO}（\textbf{V}可消除的 D\textbf{e}nse \textbf{R}eward \textbf{P}olicy \textbf{O}ptimization），这是一种新的强化学习代码生成框架，能够综合 \textit{稳健且密集的奖励，完全基于可验证的执行反馈}。VeRPO的核心理念是从加权部分成功中构建高密度奖励：通过根据训练期间执行统计动态估计每个单元测试的难度权重，从通过的单元测试权重之和中推导出高密度奖励。为了巩固部分成功与端到端功能正确性的一致性，VeRPO 进一步将密集信号与全局执行结果整合，建立一个稳健且高密度的奖励范式，仅依赖可验证的执行反馈。在多种基准测试和环境中的广泛实验表明，VeRPO始终优于结果驱动和基于RM的基线，pass@1提升高达+8.83\%，且时间成本<0.02\%，GPU内存开销为零。

SCRIBE: Structured Mid-Level Supervision for Tool-Using Language Models

SCRIBE：工具使用语言模型的结构化中级监督

Authors: Yuxuan Jiang, Francis Ferraro
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.03555
Pdf link: https://arxiv.org/pdf/2601.03555
Abstract Training reliable tool-augmented agents remains a significant challenge, largely due to the difficulty of credit assignment in multi-step reasoning. While process-level reward models offer a promising direction, existing LLM-based judges often produce noisy and inconsistent signals because they lack fine-grained, task-specific rubrics to distinguish high-level planning from low-level execution. In this work, we introduce SCRIBE (Skill-Conditioned Reward with Intermediate Behavioral Evaluation), a reinforcement learning framework that intervenes at a novel mid-level abstraction. SCRIBE grounds reward modeling in a curated library of skill prototypes, transforming open-ended LLM evaluation into a constrained verification problem. By routing each subgoal to a corresponding prototype, the reward model is equipped with precise, structured rubrics that substantially reduce reward variance. Experimental results show that SCRIBE achieves state-of-the-art performance across a range of reasoning and tool-use benchmarks. In particular, it improves the AIME25 accuracy of a Qwen3-4B model from 43.3% to 63.3%, and significantly increases success rates in complex multi-turn tool interactions. Further analysis of training dynamics reveals a co-evolution across abstraction levels, where mastery of mid-level skills consistently precedes the emergence of effective high-level planning behaviors. Finally, we demonstrate that SCRIBE is additive to low-level tool optimizations, providing a scalable and complementary pathway toward more autonomous and reliable tool-using agents.
中文摘要 培训可靠的工具增强智能体仍是一项重大挑战，主要原因是多步推理中信用分配的困难。虽然流程级奖励模型提供了有前景的方向，但现有基于LLM的评判常常产生嘈杂且不一致的信号，因为它们缺乏细粒度、针对任务的评分标准来区分高层规划与低层执行。在本研究中，我们介绍了SCRIBE（技能条件奖励与中级行为评估），这是一种在新型中层抽象处介入的强化学习框架。SCRIBE基于一个精心策划的技能原型库，将开放式LLM评估转变为受限验证问题。通过将每个子目标引导到对应的原型，奖励模型配备了精确、结构化的评分标准，显著降低了奖励的方差。实验结果显示，SCRIBE在一系列推理和工具使用基准测试中实现了最先进的性能。特别是，它将Qwen3-4B模型的AIME25准确率从43.3%提升到63.3%，并显著提高了复杂多工刀交互的成功率。对训练动态的进一步分析显示，跨抽象层面存在共进化，中级技能的掌握始终优先于有效高层次规划行为的出现。最后，我们证明SCRIBE对低层工具优化具有加成作用，提供了一条可扩展且互补的路径，朝向更自主、更可靠的工具使用代理。

From Score to Sound: An End-to-End MIDI-to-Motion Pipeline for Robotic Cello Performance

从乐谱到声音：机器人大提琴演奏的端到端MIDI到运动流程

Authors: Samantha Sudhoff, Pranesh Velmurugan, Jiashu Liu, Vincent Zhao, Yung-Hsiang Lu, Kristen Yeon-Ji Yun
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.03562
Pdf link: https://arxiv.org/pdf/2601.03562
Abstract Robot musicians require precise control to obtain proper note accuracy, sound quality, and musical expression. Performance of string instruments, such as violin and cello, presents a significant challenge due to the precise control required over bow angle and pressure to produce the desired sound. While prior robotic cellists focus on accurate bowing trajectories, these works often rely on expensive motion capture techniques, and fail to sightread music in a human-like way. We propose a novel end-to-end MIDI score to robotic motion pipeline which converts musical input directly into collision-aware bowing motions for a UR5e robot cellist. Through use of Universal Robot Freedrive feature, our robotic musician can achieve human-like sound without the need for motion capture. Additionally, this work records live joint data via Real-Time Data Exchange (RTDE) as the robot plays, providing labeled robotic playing data from a collection of five standard pieces to the research community. To demonstrate the effectiveness of our method in comparison to human performers, we introduce the Musical Turing Test, in which a collection of 132 human participants evaluate our robot's performance against a human baseline. Human reference recordings are also released, enabling direct comparison for future studies. This evaluation technique establishes the first benchmark for robotic cello performance. Finally, we outline a residual reinforcement learning methodology to improve upon baseline robotic controls, highlighting future opportunities for improved string-crossing efficiency and sound quality.
中文摘要 机器人音乐家需要精确的控制，以获得正确的音符准确度、音质和音乐表现力。演奏弦乐器，如小提琴和大提琴，是一项重大挑战，因为需要精确控制弓的角度和压力，才能发出理想的音色。以往的机器人大提琴手专注于准确的弓法轨迹，而这些作品往往依赖昂贵的动作捕捉技术，未能以人为的方式视奏乐谱。我们提出了一种全新的端到端MIDI配乐到机器人运动流水线，将音乐输入直接转换为针对碰撞感知的弓法动作，适用于UR5e机器人大提琴手。通过使用通用机器人自由驾驶功能，我们的机器人音乐家无需动作捕捉就能实现类人声音。此外，该工作通过实时数据交换（RTDE）记录机器人演奏时的实时联合数据，向研究界提供来自五个标准曲目的标记机器人演奏数据。为了展示我们方法与人类表演者的有效性，我们引入了音乐图灵测试，132名人类参与者将机器人的表现与人类基线进行评估。同时还发布了人类参考录音，便于未来研究的直接对比。该评估技术确立了机器人大提琴演奏的首个基准。最后，我们概述了一种残余强化学习方法，以改进机器人基线控制，突出未来提升弦交叉效率和音质的潜力。

Interleaved Tool-Call Reasoning for Protein Function Understanding

交错工具调用推理以理解蛋白质功能

Authors: Chuanliu Fan, Zicheng Ma, Huanran Meng, Aijia Zhang, Wenjie Du, Jun Zhang, Yi Qin Gao, Ziqiang Cao, Guohong Fu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.03604
Pdf link: https://arxiv.org/pdf/2601.03604
Abstract Recent advances in large language models (LLMs) have highlighted the effectiveness of chain-of-thought reasoning in symbolic domains such as mathematics and programming. However, our study shows that directly transferring such text-based reasoning paradigms to protein function understanding is ineffective: reinforcement learning mainly amplifies superficial keyword patterns while failing to introduce new biological knowledge, resulting in limited generalization. We argue that protein function prediction is a knowledge-intensive scientific task that fundamentally relies on external biological priors and computational tools rather than purely internal reasoning. To address this gap, we propose PFUA, a tool-augmented protein reasoning agent that unifies problem decomposition, tool invocation, and grounded answer generation. Instead of relying on long unconstrained reasoning traces, PFUA integrates domain-specific tools to produce verifiable intermediate evidence. Experiments on four benchmarks demonstrate that PFUA consistently outperforms text-only reasoning models with an average performance improvement of 103%.
中文摘要 大型语言模型（LLM）的最新进展凸显了思维链推理在数学和编程等符号领域的有效性。然而，我们的研究表明，直接将此类基于文本的推理范式转化为蛋白质功能理解是无效的：强化学习主要放大表面的关键词模式，未能引入新的生物学知识，导致泛化有限。我们认为蛋白质功能预测是一项知识密集型的科学任务，根本依赖外部生物先验和计算工具，而非纯粹的内部推理。为弥补这一空白，我们提出了PFUA，一种工具增强的蛋白质推理代理，统一了问题分解、工具调用和基于基础的答案生成。PFUA不再依赖冗长且不受约束的推理痕迹，而是集成了领域特定的工具，以产生可验证的中间证据。四个基准测试的实验表明，PFUA持续优于纯文本推理模型，平均性能提升达103%。

Locomotion Beyond Feet

脚之外的运动

Authors: Tae Hoon Yang, Haochen Shi, Jiacheng Hu, Zhicong Zhang, Daniel Jiang, Weizhuo Wang, Yao He, Zhen Wu, Yuming Chen, Yifan Hou, Monroe Kennedy III, Shuran Song, C. Karen Liu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.03607
Pdf link: https://arxiv.org/pdf/2601.03607
Abstract Most locomotion methods for humanoid robots focus on leg-based gaits, yet natural bipeds frequently rely on hands, knees, and elbows to establish additional contacts for stability and support in complex environments. This paper introduces Locomotion Beyond Feet, a comprehensive system for whole-body humanoid locomotion across extremely challenging terrains, including low-clearance spaces under chairs, knee-high walls, knee-high platforms, and steep ascending and descending stairs. Our approach addresses two key challenges: contact-rich motion planning and generalization across diverse terrains. To this end, we combine physics-grounded keyframe animation with reinforcement learning. Keyframes encode human knowledge of motor skills, are embodiment-specific, and can be readily validated in simulation or on hardware, while reinforcement learning transforms these references into robust, physically accurate motions. We further employ a hierarchical framework consisting of terrain-specific motion-tracking policies, failure recovery mechanisms, and a vision-based skill planner. Real-world experiments demonstrate that Locomotion Beyond Feet achieves robust whole-body locomotion and generalizes across obstacle sizes, obstacle instances, and terrain sequences.
中文摘要 大多数人形机器人的运动方法侧重于腿部步态，但自然双足机器人常依赖手、膝和肘来建立额外的接触，以在复杂环境中获得稳定性和支撑。本文介绍了“脚下运动”，这是一个全面的系统，用于全身人形移动，穿越极具挑战性的地形，包括椅子下方的低净空空间、膝盖高墙壁、膝高平台以及陡峭的上下楼梯。我们的方法解决了两个关键挑战：接触丰富的运动规划和跨多样地形的泛化。为此，我们将基于物理的关键帧动画与强化学习结合起来。关键帧编码了人类对运动技能的知识，具有具体的身体特征，可以在模拟或硬件上轻松验证，而强化学习则将这些参考转化为稳健且物理准确的动作。我们还采用了层级框架，包括地形特定的运动追踪策略、故障恢复机制以及基于愿景的技能规划器。真实实验表明，Locomotion Beyond Feet实现了稳健的全身运动，并能推广到障碍物大小、障碍实例和地形序列。

Shielded RecRL: Explanation Generation for Recommender Systems without Ranking Degradation

屏蔽RecRL：推荐系统解释生成，避免排名降级

Authors: Ansh Tiwari, Ayush Chauhan
Subjects: Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.03608
Pdf link: https://arxiv.org/pdf/2601.03608
Abstract We introduce Shielded RecRL, a reinforcement learning approach to generate personalized explanations for recommender systems without sacrificing the system's original ranking performance. Unlike prior RLHF-based recommender methods that directly optimize item rankings, our two-tower architecture keeps the recommender's ranking model intact while a language model learns to produce helpful explanations. We design a composite reward signal combining explanation length, content relevance, and coherence, and apply proximal policy optimization (PPO) with a KL-divergence constraint to fine-tune a large language model with only 0.4% of its parameters trainable via LoRA adapters. In experiments on an Amazon Books dataset (approximately 50K interactions in the fantasy and romance genres), Shielded RecRL improved the relative click-through rate (CTR) by 22.5% (1.225x over baseline) while keeping the recommender's item-ranking behavior virtually unchanged. An extensive ablation study confirms that our gradient shielding strategy and reward design effectively balance explanation quality and policy drift. Our results demonstrate that Shielded RecRL enhances user-facing aspects of recommendations through rich, personalized explanations without degrading core recommendation accuracy.
中文摘要 我们引入了Shielded RecRL，一种强化学习方法，旨在为推荐系统生成个性化解释，同时不牺牲系统原始排名表现。与以往基于RLHF直接优化题目排名的推荐者方法不同，我们的双塔架构保持了推荐者的排名模型完整，同时语言模型学习生成有用的解释。我们设计了一种结合解释长度、内容相关性和连贯性的复合奖励信号，并结合KL发散约束进行近端策略优化（PPO），以微调一个仅有0.4%参数可通过LoRA适配器训练的大型语言模型。在亚马逊图书数据集上的实验（奇幻和言情类型约5万次互动），Shielded RecRL将相对点击率（CTR）提升了22.5%（比基线高1.225倍），同时几乎保持了推荐者的商品排名行为。一项广泛的消融研究证实，我们的梯度屏蔽策略和奖励设计有效地平衡了解释质量和策略漂移。我们的结果表明，Shielded RecRL通过丰富且个性化的解释，增强了面向用户的推荐内容，同时不降低核心推荐的准确性。

AMIR-GRPO: Inducing Implicit Preference Signals into GRPO

AMIR-GRPO：将隐性偏好信号引入GRPO

Authors: Amir Hossein Yari, Fajri Koto
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.03661
Pdf link: https://arxiv.org/pdf/2601.03661
Abstract Reinforcement learning has become the primary paradigm for aligning large language models (LLMs) on complex reasoning tasks, with group relative policy optimization (GRPO) widely used in large-scale post-training. However, GRPO faces structural limitations in reasoning-heavy settings: sequence-level advantage normalization introduces systematic length bias, penalties for low-quality trajectories are diluted, and the scalar objective discards rich pairwise preference information embedded in within-group reward rankings. As a result, valuable supervision from costly rollouts remains underutilized. We propose AMIR-GRPO, which augments GRPO with an implicit DPO-style contrastive regularizer constructed directly from intra-group reward rankings, requiring no additional annotations. This mechanism amplifies suppression of low-reward trajectories, attenuates response-level length bias, and transforms each rollout group into a denser set of supervision constraints. Across multiple mathematical reasoning benchmarks, AMIR-GRPO consistently outperforms strong GRPO baselines, yields clearer separation between correct and incorrect reasoning chains, and delivers broader coverage gains beyond the subset of instances solved by standard GRPO.
中文摘要 强化学习已成为大型语言模型（LLMs）在复杂推理任务上对齐的主要范式，群体相对策略优化（GRPO）在大规模后训练中被广泛应用。然而，GRPO在推理密集的环境中面临结构性局限：序列层面优势归一化引入系统性长度偏倚，低质量轨迹的惩罚被稀释，标量目标则丢弃了组内奖励排名中嵌入的丰富两两偏好信息。因此，昂贵推广带来的宝贵监督仍然未被充分利用。我们提出AMIR-GRPO，它通过直接基于组内奖励排名构建的隐式DPO式对比正则子来增强GRPO，无需额外注释。该机制放大了低奖励轨迹的抑制，减弱反应水平的长度偏差，并将每个推广组转变为更密集的监督约束。在多个数学推理基准中，AMIR-GRPO持续优于强GRPO基线，更清晰地区分正确与错误推理链，并带来超出标准GRPO部分实例的覆盖范围。

Sandwich Reasoning: An Answer-Reasoning-Answer Approach for Low-Latency Query Correction

Authors: Chen Zhang, Kepu Zhang, Jiatong Zhang, Xiao Zhang, Jun Xu
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.03672
Pdf link: https://arxiv.org/pdf/2601.03672
Abstract Query correction is a critical entry point in modern search pipelines, demanding high accuracy strictly within real-time latency constraints. Chain-of-Thought (CoT) reasoning improves accuracy but incurs prohibitive latency for real-time query correction. A potential solution is to output an answer before reasoning to reduce latency; however, under autoregressive decoding, the early answer is independent of subsequent reasoning, preventing the model from leveraging its reasoning capability to improve accuracy. To address this issue, we propose Sandwich Reasoning (SandwichR), a novel approach that explicitly aligns a fast initial answer with post-hoc reasoning, enabling low-latency query correction without sacrificing reasoning-aware accuracy. SandwichR follows an Answer-Reasoning-Answer paradigm, producing an initial correction, an explicit reasoning process, and a final refined correction. To align the initial answer with post-reasoning insights, we design a consistency-aware reinforcement learning (RL) strategy: a dedicated consistency reward enforces alignment between the initial and final corrections, while margin-based rejection sampling prioritizes borderline samples where reasoning drives the most impactful corrective gains. Additionally, we construct a high-quality query correction dataset, addressing the lack of specialized benchmarks for complex query correction. Experimental results demonstrate that SandwichR achieves SOTA accuracy comparable to standard CoT while delivering a 40-70% latency reduction, resolving the latency-accuracy trade-off in online search.
中文摘要 查询修正是现代搜索管道中的关键切入点，要求在实时延迟限制下保持高准确性。思维链（CoT）推理提高了准确性，但实时查询纠正会带来极高的延迟。一种潜在的解决方案是在推理前先输出答案以减少延迟;然而，在自回归译码下，早期答案独立于后续推理，阻碍模型利用推理能力提升准确性。为解决这一问题，我们提出了夹心推理（SandwichR），这是一种新颖的方法，明确将快速的初始答案与事后推理对齐，实现低延迟的查询纠正，同时不牺牲推理感知准确性。SandwichR 遵循答案-推理-答案范式，产生初始纠正、显式推理过程和最终精炼纠正。为了使初始答案与推理后的洞察相匹配，我们设计了一致性感知强化学习（RL）策略：专门的一致性奖励强化初始与最终修正之间的对齐，而基于margin的拒绝抽样则优先选择推理带来最大修正收益的边界样本。此外，我们构建了一个高质量的查询更正数据集，解决了复杂查询更正缺乏专门基准的问题。实验结果表明，SandwichR 实现了与标准 CoT 相当的 SOTA 准确性，同时实现了 40-70% 的延迟降低，解决了在线搜索中延迟与准确性的权衡。

Dual-Attention Heterogeneous GNN for Multi-robot Collaborative Area Search via Deep Reinforcement Learning

通过深度强化学习实现多机器人协作区域搜索的双注意力异构GNN

Authors: Lina Zhu, Jiyu Cheng, Yuehu Liu, Wei Zhang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.03686
Pdf link: https://arxiv.org/pdf/2601.03686
Abstract In multi-robot collaborative area search, a key challenge is to dynamically balance the two objectives of exploring unknown areas and covering specific targets to be rescued. Existing methods are often constrained by homogeneous graph representations, thus failing to model and balance these distinct tasks. To address this problem, we propose a Dual-Attention Heterogeneous Graph Neural Network (DA-HGNN) trained using deep reinforcement learning. Our method constructs a heterogeneous graph that incorporates three entity types: robot nodes, frontier nodes, and interesting nodes, as well as their historical states. The dual-attention mechanism comprises the relational-aware attention and type-aware attention operations. The relational-aware attention captures the complex spatio-temporal relationships among robots and candidate goals. Building on this relational-aware heterogeneous graph, the type-aware attention separately computes the relevance between robots and each goal type (frontiers vs. points of interest), thereby decoupling the exploration and coverage from the unified tasks. Extensive experiments conducted in interactive 3D scenarios within the iGibson simulator, leveraging the Gibson and MatterPort3D datasets, validate the superior scalability and generalization capability of the proposed approach.
中文摘要 在多机器人协作区域搜索中，一个关键挑战是动态平衡探索未知区域和覆盖特定目标的两个目标。现有方法常受同质图表示的限制，因此无法建模和平衡这些不同任务。为解决这一问题，我们提出了一种通过深度强化学习训练的双注意异构图神经网络（DA-HGNN）。我们的方法构建了一个异构图，包含三种实体类型：机器人节点、前沿节点和有趣节点，以及它们的历史状态。双关注机制包括关系感知注意力和类型感知注意力作。关系感知注意力捕捉了机器人与候选目标之间复杂的时空关系。基于这一关系感知异构图，类型感知注意力分别计算机器人与各目标类型（边界与兴趣点）之间的相关性，从而将探索和覆盖与统一任务脱钩。在iGibson模拟器中，利用Gibson和MatterPort3D数据集在交互式三维场景中进行的大量实验验证了该方法卓越的可扩展性和泛化能力。

TreeAdv: Tree-Structured Advantage Redistribution for Group-Based RL

TreeAdv：基于群体的强化学习中的树状结构优势再分配

Authors: Lang Cao, Hui Ruan, Yongqian Li, Peng Chao, Wu Ning, Haonan Song, Renhong Chen, Yitong Li
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.03703
Pdf link: https://arxiv.org/pdf/2601.03703
Abstract Reinforcement learning with group-based objectives, such as Group Relative Policy Optimization (GRPO), is a common framework for aligning large language models on complex reasoning tasks. However, standard GRPO treats each rollout trajectory as an independent flat sequence and assigns a single sequence-level advantage to all tokens, which leads to sample inefficiency and a length bias toward verbose, redundant chains of thought without improving logical depth. We introduce TreeAdv (Tree-Structured Advantage Redistribution for Group-Based RL), which makes the tree structure of group rollouts explicit for both exploration and advantage assignment. Specifically, TreeAdv builds a group of trees (a forest) based on an entropy-driven sampling method where each tree branches at high-uncertainty decisions while sharing low-uncertainty tokens across rollouts. Then, TreeAdv aggregates token-level advantages for internal tree segments by redistributing the advantages of complete rollouts (all leaf nodes), and TreeAdv can easily apply to group-based objectives such as GRPO or GSPO. Across 10 math reasoning benchmarks, TreeAdv consistently outperforms GRPO and GSPO, while using substantially fewer generated tokens under identical supervision, data, and decoding budgets.
中文摘要 基于群体目标的强化学习，如群体相对策略优化（Group Relative Policy Optimization，GRPO），是大型语言模型在复杂推理任务上对齐的常见框架。然而，标准GRPO将每个滚动轨迹视为独立的平坦序列，并为所有标记分配单一序列层级优势，这导致样本效率低下，且思维链条冗长冗长，且逻辑深度不佳。我们引入了TreeAdv（基于群体的树结构优势再分配），它明确了群体推广的树状结构，既适用于探索，也方便优势分配。具体来说，TreeAdv 基于熵驱动的采样方法构建了一组树（森林），每棵树在高不确定性决策处分支，同时在不同推广过程中共享低不确定性令牌。然后，TreeAdv通过重新分配完整部署（所有叶节点）的优势，汇总内部树段的代币级优势，TreeAdv可以轻松应用于基于组的目标，如GRPO或GSPO。在10个数学推理基准测试中，TreeAdv持续优于GRPO和GSPO，同时在相同的监督、数据和解码预算下使用显著减少的生成代币。

R$^3$L: Reflect-then-Retry Reinforcement Learning with Language-Guided Exploration, Pivotal Credit, and Positive Amplification

R$^3$L：反思然后重试强化学习，结合语言引导探索、关键学分和积极放大

Authors: Weijie Shi, Yanxi Chen, Zexi Li, Xuchen Pan, Yuchang Sun, Jiajie Xu, Xiaofang Zhou, Yaliang Li
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.03715
Pdf link: https://arxiv.org/pdf/2601.03715
Abstract Reinforcement learning drives recent advances in LLM reasoning and agentic capabilities, yet current approaches struggle with both exploration and exploitation. Exploration suffers from low success rates on difficult tasks and high costs of repeated rollouts from scratch. Exploitation suffers from coarse credit assignment and training instability: Trajectory-level rewards penalize valid prefixes for later errors, and failure-dominated groups overwhelm the few positive signals, leaving optimization without constructive direction. To this end, we propose R$^3$L, Reflect-then-Retry Reinforcement Learning with Language-Guided Exploration, Pivotal Credit, and Positive Amplification. To synthesize high-quality trajectories, R$^3$L shifts from stochastic sampling to active synthesis via reflect-then-retry, leveraging language feedback to diagnose errors, transform failed attempts into successful ones, and reduce rollout costs by restarting from identified failure points. With errors diagnosed and localized, Pivotal Credit Assignment updates only the diverging suffix where contrastive signals exist, excluding the shared prefix from gradient update. Since failures dominate on difficult tasks and reflect-then-retry produces off-policy data, risking training instability, Positive Amplification upweights successful trajectories to ensure positive signals guide the optimization process. Experiments on agentic and reasoning tasks demonstrate 5\% to 52\% relative improvements over baselines while maintaining training stability. Our code is released at this https URL.
中文摘要 强化学习推动了大型语言模型推理和代理能力的最新进展，但当前的方法在探索和利用方面都存在困难。探索模式在困难任务中成功率低，且从零开始反复推出的成本高昂。利用机制存在粗糙的信用分配和训练不稳定性：轨迹级奖励惩罚后续错误的有效前缀，失败主导的群体压倒少数正面信号，导致优化缺乏建设性方向。为此，我们提出了R$^3$L，即反映后再试强化学习结合语言引导探索、关键学分和正向放大。为了合成高质量的轨迹，R$^3$L从随机抽样转向通过反射后重试的主动综合，利用语言反馈诊断错误，将失败尝试转化为成功尝试，并通过从识别的失败点重启来降低推广成本。在错误被诊断并定位后，关键信用分配仅更新存在对比信号的分歧后缀，排除梯度更新中的共享前缀。由于在困难任务中失败占主导地位，且反射后重试会产生非策略数据，存在训练不稳定的风险，正放大会对成功轨迹进行加权，以确保正信号引导优化过程。对能动性和推理任务的实验显示，相较基线有5%到52%的相对提升，同时保持训练稳定性。我们的代码以这个 https URL 发布。

ETR: Outcome-Guided Elastic Trust Regions for Policy Optimization

ETR：策略优化的成果导向弹性信任区域

Authors: Shijie Zhang, Kevin Zhang, Zheyuan Gu, Xiang Guo, Rujun Guo, Shaoyu Liu, Guanjun Jiang, Xiaozhao Wang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.03723
Pdf link: https://arxiv.org/pdf/2601.03723
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an important paradigm for unlocking reasoning capabilities in large language models, exemplified by the success of OpenAI o1 and DeepSeek-R1. Currently, Group Relative Policy Optimization (GRPO) stands as the dominant algorithm in this domain due to its stable training and critic-free efficiency. However, we argue that GRPO suffers from a structural limitation: it imposes a uniform, static trust region constraint across all samples. This design implicitly assumes signal homogeneity, a premise misaligned with the heterogeneous nature of outcome-driven learning, where advantage magnitudes and variances fluctuate significantly. Consequently, static constraints fail to fully exploit high-quality signals while insufficiently suppressing noise, often precipitating rapid entropy collapse. To address this, we propose \textbf{E}lastic \textbf{T}rust \textbf{R}egions (\textbf{ETR}), a dynamic mechanism that aligns optimization constraints with signal quality. ETR constructs a signal-aware landscape through dual-level elasticity: at the micro level, it scales clipping boundaries based on advantage magnitude to accelerate learning from high-confidence paths; at the macro level, it leverages group variance to implicitly allocate larger update budgets to tasks in the optimal learning zone. Extensive experiments on AIME and MATH benchmarks demonstrate that ETR consistently outperforms GRPO, achieving superior accuracy while effectively mitigating policy entropy degradation to ensure sustained exploration.
中文摘要 带可验证奖励的强化学习（RLVR）已成为解锁大型语言模型推理能力的重要范式，OpenAI o1和DeepSeek-R1的成功就是典型例子。目前，Group Relative Policy Optimization（GRPO）因其稳定的训练和无批评效率，成为该领域的主导算法。然而，我们认为GRPO存在结构性限制：它对所有样本施加了统一且静态的信任区域约束。这种设计隐含假设信号同质性，这一前提与结果驱动学习的异质性质不符，因为优势幅度和方差会显著波动。因此，静态约束无法充分发挥高质量信号，且噪声抑制不足，常常导致熵快速坍缩。为此，我们提出了 \textbf{E}lastic \textbf{T}rust \textbf{R}egions （\textbf{ETR}），这是一种动态机制，将优化约束与信号质量对齐。ETR通过双层弹性构建信号感知景观：在微观层面，它根据优势幅度扩展裁剪边界，以加速从高置信路径学习;在宏观层面，它利用群体方差，隐式地将更大的更新预算分配给最优学习区间的任务。AIME和MATH基准测试的广泛实验表明，ETR始终优于GRPO，在有效减缓政策熵退化的同时实现更优越的精度，确保持续勘探。

EDCO: Dynamic Curriculum Orchestration for Domain-specific Large Language Model Fine-tuning

EDCO：用于领域特定大型语言模型微调的动态课程编排

Authors: Jing-Cheng Pang, Liu Sun, Chang Zhou, Xian Tang, Haichuan Ma, Kun Jiang, Jianlong Wang, Kai Zhang, Sijie Wu, Haoran Cai, Chenwei Wu, Xubin Li, Xin Chen
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.03725
Pdf link: https://arxiv.org/pdf/2601.03725
Abstract Domain-specific large language models (LLMs), typically developed by fine-tuning a pre-trained general-purpose LLM on specialized datasets, represent a significant advancement in applied AI. A common strategy in LLM fine-tuning is curriculum learning, which pre-orders training samples based on metrics like difficulty to improve learning efficiency compared to a random sampling strategy. However, most existing methods for LLM fine-tuning rely on a static curriculum, designed prior to training, which lacks adaptability to the model's evolving needs during fine-tuning. To address this, we propose EDCO, a novel framework based on two key concepts: inference entropy and dynamic curriculum orchestration. Inspired by recent findings that maintaining high answer entropy benefits long-term reasoning gains, EDCO prioritizes samples with high inference entropy in a continuously adapted curriculum. EDCO integrates three core components: an efficient entropy estimator that uses prefix tokens to approximate full-sequence entropy, an entropy-based curriculum generator that selects data points with the highest inference entropy, and an LLM trainer that optimizes the model on the selected curriculum. Comprehensive experiments in communication, medicine and law domains, EDCO outperforms traditional curriculum strategies for fine-tuning Qwen3-4B and Llama3.2-3B models under supervised and reinforcement learning settings. Furthermore, the proposed efficient entropy estimation reduces computational time by 83.5% while maintaining high accuracy.
中文摘要 领域专用大型语言模型（LLM）通常通过在专用数据集上对预训练的通用LLM进行微调开发，代表了应用人工智能的重大进展。LLM微调中的一个常见策略是课程学习，它根据难度等指标预先订购训练样本，以提高学习效率，相较于随机抽样策略。然而，大多数现有的LLM微调方法依赖于静态课程，这些课程在训练前设计，缺乏对模型在微调过程中不断变化需求的适应性。为此，我们提出了EDCO，这是一个基于两个关键概念的新框架：推理熵和动态课程编排。受近期研究发现保持高答案熵有助于长期推理提升的启发，EDCO在持续调整的课程中优先考虑高推断熵的样本。EDCO集成了三个核心组件：一个高效的熵估计器，利用前缀标记近似全序列熵;一个基于熵的课程生成器，选择推理熵最大的数据点;以及一个优化所选课程模型的LLM训练器。EDCO在传播、医学和法律领域进行了全面的实验，在监督和强化学习环境下，精细调整Qwen3-4B和Llama3.2-3B模型方面优于传统课程策略。此外，所提出的高效熵估计在保持高精度的同时，计算时间减少了83.5%。

O-Researcher: An Open Ended Deep Research Model via Multi-Agent Distillation and Agentic RL

O-Researcher：通过多智能体蒸馏和智能强化学习的开放式深度研究模型

Authors: Yi Yao, He Zhu, Piaohong Wang, Jincheng Ren, Xinlong Yang, Qianben Chen, Xiaowan Li, Dingfeng Shi, Jiaxian Li, Qiexiang Wang, Sinuo Wang, Xinpeng Liu, Jiaqi Wu, Minghao Liu, Wangchunshu Zhou
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.03743
Pdf link: https://arxiv.org/pdf/2601.03743
Abstract The performance gap between closed-source and open-source large language models (LLMs) is largely attributed to disparities in access to high-quality training data. To bridge this gap, we introduce a novel framework for the automated synthesis of sophisticated, research-grade instructional data. Our approach centers on a multi-agent workflow where collaborative AI agents simulate complex tool-integrated reasoning to generate diverse and high-fidelity data end-to-end. Leveraging this synthesized data, we develop a two-stage training strategy that integrates supervised fine-tuning with a novel reinforcement learning method, designed to maximize model alignment and capability. Extensive experiments demonstrate that our framework empowers open-source models across multiple scales, enabling them to achieve new state-of-the-art performance on the major deep research benchmark. This work provides a scalable and effective pathway for advancing open-source LLMs without relying on proprietary data or models.
中文摘要 闭源与开源大型语言模型（LLM）之间的性能差距，主要归因于高质量训练数据的获取差异。为弥合这一差距，我们引入了一个新的框架，用于自动化综合复杂的研究级教学数据。我们的方法围绕多代理工作流程展开，协作式AI代理模拟复杂的工具整合推理，生成多样且高保真度的数据端到端。利用这些综合数据，我们开发了一套两阶段训练策略，将监督微调与一种新型强化学习方法相结合，旨在最大化模型的对齐性和能力。大量实验表明，我们的框架能够在多个尺度上支持开源模型，使其在主要深度研究基准上实现新的最先进性能。这项工作为推进开源大型语言模型提供了可扩展且高效的路径，而无需依赖专有数据或模型。

MVP: Enhancing Video Large Language Models via Self-supervised Masked Video Prediction

MVP：通过自我监督蒙面视频预测增强视频大型语言模型

Authors: Xiaokun Sun, Zezhong Wu, Zewen Ding, Linli Xu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.03781
Pdf link: https://arxiv.org/pdf/2601.03781
Abstract Reinforcement learning based post-training paradigms for Video Large Language Models (VideoLLMs) have achieved significant success by optimizing for visual-semantic tasks such as captioning or VideoQA. However, while these approaches effectively enhance perception abilities, they primarily target holistic content understanding, often lacking explicit supervision for intrinsic temporal coherence and inter-frame correlations. This tendency limits the models' ability to capture intricate dynamics and fine-grained visual causality. To explicitly bridge this gap, we propose a novel post-training objective: Masked Video Prediction (MVP). By requiring the model to reconstruct a masked continuous segment from a set of challenging distractors, MVP forces the model to attend to the sequential logic and temporal context of events. To support scalable training, we introduce a scalable data synthesis pipeline capable of transforming arbitrary video corpora into MVP training samples, and further employ Group Relative Policy Optimization (GRPO) with a fine-grained reward function to enhance the model's understanding of video context and temporal properties. Comprehensive evaluations demonstrate that MVP enhances video reasoning capabilities by directly reinforcing temporal reasoning and causal understanding.
中文摘要 基于强化学习的视频大型语言模型（VideoLLM）后训练范式通过优化字幕或视频质量保证（VideoQA）等视觉语义任务取得了显著成功。然而，虽然这些方法有效提升了感知能力，但它们主要针对整体内容理解，常缺乏对内在时间连贯性和帧间相关性的明确监督。这种倾向限制了模型捕捉复杂动态和细致视觉因果关系的能力。为了明确弥合这一差距，我们提出了一个新的训练后目标：蒙面视频预测（MVP）。通过要求模型从一组具有挑战性的干扰物中重建一个掩蔽的连续段，MVP迫使模型关注事件的顺序逻辑和时间上下文。为支持可扩展训练，我们引入了可扩展的数据综合流水线，能够将任意视频语料库转换为MVP训练样本，并进一步采用带有细粒度奖励函数的组相对策略优化（GRPO），以增强模型对视频上下文和时间属性的理解。综合评估表明，MVP通过直接强化时间推理和因果理解，增强了视频推理能力。

NeoAMT: Neologism-Aware Agentic Machine Translation with Reinforcement Learning

NeoAMT：新词感知能动机器翻译与强化学习

Authors: Zhongtao Miao, Kaiyan Zhao, Masaaki Nagata, Yoshimasa Tsuruoka
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.03790
Pdf link: https://arxiv.org/pdf/2601.03790
Abstract Neologism-aware machine translation aims to translate source sentences containing neologisms into target languages. This field remains underexplored compared with general machine translation (MT). In this paper, we propose an agentic framework, NeoAMT, for neologism-aware machine translation using a Wiktionary search tool. Specifically, we first create a new dataset for neologism-aware machine translation and develop a search tool based on Wiktionary. The new dataset covers 16 languages and 75 translation directions and is derived from approximately 10 million records of an English Wiktionary dump. The retrieval corpus of the search tool is also constructed from around 3 million cleaned records of the Wiktionary dump. We then use it for training the translation agent with reinforcement learning (RL) and evaluating the accuracy of neologism-aware machine translation. Based on this, we also propose an RL training framework that contains a novel reward design and an adaptive rollout generation approach by leveraging "translation difficulty" to further improve the translation quality of translation agents using our search tool.
中文摘要 新词感知机器翻译旨在将包含新词的源句翻译成目标语言。与一般机器翻译（MT）相比，这一领域仍然被充分探索。本文提出了一个代理框架NeoAMT，用于利用维基词典搜索工具进行新词感知的机器翻译。具体来说，我们首先创建了一个新词感知机器翻译的新数据集，并基于维基词典开发了搜索工具。新数据集涵盖16种语言和75个翻译方向，来源于大约1000万条英文维基词典数据。检索工具的检索语料库也基于约300万条清理后的维基词典记录构建。然后我们用它训练翻译代理进行强化学习（RL）并评估新词感知机器翻译的准确性。基于此，我们还提出了一个强化学习训练框架，包含一种新颖的奖励设计和自适应展开生成方法，利用“翻译难度”进一步提升翻译代理的翻译质量，利用我们的搜索工具。

From Brute Force to Semantic Insight: Performance-Guided Data Transformation Design with LLMs

从暴力破解到语义洞察：基于性能的大型语言模型数据转换设计

Authors: Usha Shrestha, Dmitry Ignatov, Radu Timofte
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.03808
Pdf link: https://arxiv.org/pdf/2601.03808
Abstract Large language models (LLMs) have achieved notable performance in code synthesis; however, data-aware augmentation remains a limiting factor, handled via heuristic design or brute-force approaches. We introduce a performance-aware, closed-loop solution in the NNGPT ecosystem of projects that enables LLMs to autonomously engineer optimal transformations by internalizing empirical performance cues. We fine-tune LLMs with Low-Rank Adaptation on a novel repository of more than 6,000 empirically evaluated PyTorch augmentation functions, each annotated solely by downstream model accuracy. Training uses pairwise performance ordering (better-worse transformations), enabling alignment through empirical feedback without reinforcement learning, reward models, or symbolic objectives. This reduces the need for exhaustive search, achieving up to 600x times fewer evaluated candidates than brute-force discovery while maintaining competitive peak accuracy and shifting generation from random synthesis to task-aligned design. Ablation studies show that structured Chain-of-Thought prompting introduces syntactic noise and degrades performance, whereas direct prompting ensures stable optimization in performance-critical code tasks. Qualitative and quantitative analyses demonstrate that the model internalizes semantic performance cues rather than memorizing syntax. These results show that LLMs can exhibit task-level reasoning through non-textual feedback loops, bypassing explicit symbolic rewards.
中文摘要 大型语言模型（LLM）在代码合成方面取得了显著的性能;然而，数据感知增强仍是一个限制因素，通常通过启发式设计或暴力破解方法来处理。我们在NNGPT项目生态系统中引入了一种性能感知型闭环解决方案，使LLM能够通过内化经验性能线索自主工程化最优转换。我们在一个包含6000多个经过实证评估的PyTorch增强函数的新仓库上，通过低秩适应微调LLMs，每个函数仅按下游模型准确性进行注释。训练采用两两绩效排序（优劣转换），通过经验反馈实现对齐，无需强化学习、奖励模型或符号目标。这减少了对穷尽搜索的需求，实现的评估候选数量比暴力破解少了多达600倍，同时保持了竞争性的峰值准确率，并将生成过程从随机综合转向任务对齐设计。消融研究表明，结构化思维链提示引入语法噪声并降低性能，而直接提示则确保性能关键代码任务的稳定优化。定性和定量分析表明，该模型内化语义表现线索，而非记忆语法。这些结果表明，大型语言模型可以通过非文本反馈循环展现任务级推理，绕过显式符号奖励。

ROI-Reasoning: Rational Optimization for Inference via Pre-Computation Meta-Cognition

ROI-推理：通过预计算元认知进行理性优化推理

Authors: Muyang Zhao, Qi Qi, Hao Sun
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.03822
Pdf link: https://arxiv.org/pdf/2601.03822
Abstract Large language models (LLMs) can achieve strong reasoning performance with sufficient computation, but they do not inherently know how much computation a task requires. We study budgeted inference-time reasoning for multiple tasks under a strict global token constraint and formalize it as a Ordered Stochastic Multiple-Choice Knapsack Problem(OS-MCKP). This perspective highlights a meta-cognitive requirement -- anticipating task difficulty, estimating return over investment (ROI), and allocating computation strategically. We propose ROI-Reasoning, a two-stage framework that endows LLMs with intrinsic, budget-aware rationality. In the first stage, Meta-Cognitive Fine-Tuning teaches models to predict reasoning cost and expected utility before generation, enabling explicit solve-or-skip decisions. Next, Rationality-Aware Reinforcement Learning optimizes sequential decision making under a hard token budget, allowing models to learn long-horizon allocation strategies. Across budgeted mathematical reasoning benchmarks, ROI-Reasoning consistently improves overall score while substantially reducing regret under tight computation budgets.
中文摘要 大型语言模型（LLMs）在足够的计算条件下可以实现强大的推理性能，但它们本身并不了解任务需要多少计算。我们研究了在严格全局代币约束下多任务的预算推理时间推理，并将其形式化为有序随机选择题问题（OS-MCKP）。这一观点强调了元认知的需求——预判任务难度，估算投资回报率（ROI），并战略性分配计算资源。我们提出了ROI-推理，一种两阶段框架，赋予LLM内在且具备预算意识的理性。在第一阶段，元认知微调教授模型预测推理成本和期望效用，从而实现明确的求解或跳过决策。接下来，理性感知强化学习在硬性令牌预算下优化顺序决策，使模型能够学习长视野的分配策略。在预算数学推理基准中，ROI-推理在有限的计算预算下持续提升整体得分，同时显著减少遗憾。

Step Potential Advantage Estimation: Harnessing Intermediate Confidence and Correctness for Efficient Mathematical Reasoning

阶梯电位优势估计：利用中间置信度和正确性实现高效的数学推理

Authors: Fei Wu, Zhenrong Zhang, Qikai Chang, Jianshu Zhang, Quan Liu, Jun Du
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.03823
Pdf link: https://arxiv.org/pdf/2601.03823
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) elicits long chain-of-thought reasoning in large language models (LLMs), but outcome-based rewards lead to coarse-grained advantage estimation. While existing approaches improve RLVR via token-level entropy or sequence-level length control, they lack a semantically grounded, step-level measure of reasoning progress. As a result, LLMs fail to distinguish necessary deduction from redundant verification: they may continue checking after reaching a correct solution and, in extreme cases, overturn a correct trajectory into an incorrect final answer. To remedy the lack of process supervision, we introduce a training-free probing mechanism that extracts intermediate confidence and correctness and combines them into a Step Potential signal that explicitly estimates the reasoning state at each step. Building on this signal, we propose Step Potential Advantage Estimation (SPAE), a fine-grained credit assignment method that amplifies potential gains, penalizes potential drops, and applies penalty after potential saturates to encourage timely termination. Experiments across multiple benchmarks show SPAE consistently improves accuracy while substantially reducing response length, outperforming strong RL baselines and recent efficient reasoning and token-level advantage estimation methods. The code is available at this https URL.
中文摘要 带可验证奖励的强化学习（RLVR）在大型语言模型（LLMs）中引发了长链思考，但基于结果的奖励则导致粗粒度优势估计。虽然现有方法通过符号级熵或序列级长度控制来提升RLVR，但它们缺乏一种语义基础的、阶级的推理进展衡量标准。因此，LLM未能区分必要的推理与冗余验证：它们可能在得出正确解后继续检查，极端情况下甚至会将正确路径推翻成错误的最终答案。为弥补过程监督的缺失，我们引入了一种无训练的探测机制，提取中间置信度和正确性，并将其组合成步电信号，明确估计每步的推理状态。基于这一信号，我们提出了阶梯潜在优势估计（SPAE），这是一种细粒度的信用分配方法，能够放大潜在收益，惩罚潜在的损失，并在潜在的收益后施加罚款，以鼓励及时终止。多个基准测试的实验显示，SPAE在显著缩短响应长度的同时持续提升准确性，表现优于强强化学习基线以及近期高效的推理和代币级优势估计方法。代码可在该 https URL 访问。

Staged Voxel-Level Deep Reinforcement Learning for 3D Medical Image Segmentation with Noisy Annotations

用于带噪声注释的3D医学图像分割的分级体素级深度强化学习

Authors: Yuyang Fu, Xiuzhen Guo, Ji Shi
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.03875
Pdf link: https://arxiv.org/pdf/2601.03875
Abstract Deep learning has achieved significant advancements in medical image segmentation. Currently, obtaining accurate segmentation outcomes is critically reliant on large-scale datasets with high-quality annotations. However, noisy annotations are frequently encountered owing to the complex morphological structures of organs in medical images and variations among different annotators, which can substantially limit the efficacy of segmentation models. Motivated by the fact that medical imaging annotator can correct labeling errors during segmentation based on prior knowledge, we propose an end-to-end Staged Voxel-Level Deep Reinforcement Learning (SVL-DRL) framework for robust medical image segmentation under noisy annotations. This framework employs a dynamic iterative update strategy to automatically mitigate the impact of erroneous labels without requiring manual intervention. The key advancements of SVL-DRL over existing works include: i) formulating noisy annotations as a voxel-dependent problem and addressing it through a novel staged reinforcement learning framework which guarantees robust model convergence; ii) incorporating a voxel-level asynchronous advantage actor-critic (vA3C) module that conceptualizes each voxel as an autonomous agent, which allows each agent to dynamically refine its own state representation during training, thereby directly mitigating the influence of erroneous labels; iii) designing a novel action space for the agents, along with a composite reward function that strategically combines the Dice value and a spatial continuity metric to significantly boost segmentation accuracy while maintain semantic integrity. Experiments on three public medical image datasets demonstrates State-of-The-Art (SoTA) performance under various experimental settings, with an average improvement of over 3\% in both Dice and IoU scores.
中文摘要 深度学习在医学图像分割方面取得了显著进展。目前，获得准确的分割结果关键依赖于具有高质量注释的大规模数据集。然而，由于医学图像中器官形态结构复杂且不同注释者间的差异，常常遇到噪杂的注释，这会显著限制分割模型的有效性。基于医学影像标注器可以基于先前知识纠正分割时的标记错误，我们提出了一种端到端分级体素级深度强化学习（SVL-DRL）框架，用于在噪声标注下实现稳健的医学图像分割。该框架采用动态迭代更新策略，自动减轻错误标签的影响，无需人工干预。SVL-DRL相较于现有工作的主要进展包括：i）将噪声标注作为体素依赖问题提出，并通过一种新颖的分级强化学习框架解决，保证稳健的模型收敛性;ii）集成一个体素级异步优势演员-批评者（vA3C）模块，将每个体素概念化为自主代理，允许每个代理在训练过程中动态优化自身状态表示，从而直接减少错误标签的影响;iii）为代理设计一个新颖的行动空间，以及一个复合奖励函数，策略性地结合骰子值和空间连续性指标，显著提升分割准确性，同时保持语义完整性。在三个公共医学图像数据集上的实验展示了在各种实验环境下的最先进（SoTA）性能，Dice和IoU评分平均提升超过3%%。

IndexTTS 2.5 Technical Report

IndexTTS 2.5 技术报告

Authors: Yunpei Li, Xun Zhou, Jinchao Wang, Lu Wang, Yong Wu, Siyi Zhou, Yiquan Zhou, Jingchen Shu
Subjects: Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.03888
Pdf link: https://arxiv.org/pdf/2601.03888
Abstract In prior work, we introduced IndexTTS 2, a zero-shot neural text-to-speech foundation model comprising two core components: a transformer-based Text-to-Semantic (T2S) module and a non-autoregressive Semantic-to-Mel (S2M) module, which together enable faithful emotion replication and establish the first autoregressive duration-controllable generative paradigm. Building upon this, we present IndexTTS 2.5, which significantly enhances multilingual coverage, inference speed, and overall synthesis quality through four key improvements: 1) Semantic Codec Compression: we reduce the semantic codec frame rate from 50 Hz to 25 Hz, halving sequence length and substantially lowering both training and inference costs; 2) Architectural Upgrade: we replace the U-DiT-based backbone of the S2M module with a more efficient Zipformer-based modeling architecture, achieving notable parameter reduction and faster mel-spectrogram generation; 3) Multilingual Extension: We propose three explicit cross-lingual modeling strategies, boundary-aware alignment, token-level concatenation, and instruction-guided generation, establishing practical design principles for zero-shot multilingual emotional TTS that supports Chinese, English, Japanese, and Spanish, and enables robust emotion transfer even without target-language emotional training data; 4) Reinforcement Learning Optimization: we apply GRPO in post-training of the T2S module, improving pronunciation accuracy and natrualness. Experiments show that IndexTTS 2.5 not only supports broader language coverage but also replicates emotional prosody in unseen languages under the same zero-shot setting. IndexTTS 2.5 achieves a 2.28 times improvement in RTF while maintaining comparable WER and speaker similarity to IndexTTS 2.
中文摘要 在之前的工作中，我们引入了IndexTTS 2，这是一个零样本神经文本转语音基础模型，包含两个核心组件：基于变换器的文本转语义（T2S）模块和非自回归的语义转Mel（S2M）模块，这两者共同实现了情感的忠实复制，并建立了首个可控时长的自回归生成范式。在此基础上，我们推出了IndexTTS 2.5，通过四项关键改进显著提升了多语言覆盖、推理速度和整体合成质量：1）语义编码压缩：我们将语义编码帧率从50 Hz降至25 Hz，序列长度减半，显著降低训练和推理成本;2）架构升级：我们用更高效的Zipformer建模架构取代了基于U-DiT的S2M模块骨干，实现了显著的参数缩减和更快的mel频谱图生成;3）多语言扩展：我们提出了三种显式跨语言建模策略：边界感知对齐、令牌级连接和指令引导生成，确立了支持中文、英语、日语和西班牙语的零时点多语言情感语音合成的实用设计原则，即使没有目标语言的情绪训练数据也能实现强健的情感传递;4）强化学习优化：我们在T2S模块的后期训练中应用GRPO，提升发音准确性和自然性。实验显示，IndexTTS 2.5 不仅支持更广泛的语言覆盖，还能在同一零时点设置下复制未见语言中的情感韵律。IndexTTS 2.5在RTF提升2.28倍的同时，WER和扬声器与IndexTTS 2保持相当。

Adaptive-Boundary-Clipping GRPO: Ensuring Bounded Ratios for Stable and Generalizable Training

自适应边界裁剪GRPO：确保稳定且可推广训练的有界比值

Authors: Chi Liu, Xin Chen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.03895
Pdf link: https://arxiv.org/pdf/2601.03895
Abstract Group Relative Policy Optimization (GRPO) has emerged as a popular algorithm for reinforcement learning with large language models (LLMs). However, upon analyzing its clipping mechanism, we argue that it is suboptimal in certain scenarios. With appropriate modifications, GRPO can be significantly enhanced to improve both flexibility and generalization. To this end, we propose Adaptive-Boundary-Clipping GRPO (ABC-GRPO), an asymmetric and adaptive refinement of the original GRPO framework. We demonstrate that ABC-GRPO achieves superior performance over standard GRPO on mathematical reasoning tasks using the Qwen3 LLMs. Moreover, ABC-GRPO maintains substantially higher entropy throughout training, thereby preserving the model's exploration capacity and mitigating premature convergence. The implementation code is available online to ease reproducibility this https URL.
中文摘要 群体相对策略优化（GRPO）已成为大型语言模型（LLM）强化学习的流行算法。然而，在分析其削波机制后，我们认为它在某些情况下并不理想。通过适当修改，GRPO可以大幅增强，以提升灵活性和泛化性。为此，我们提出了自适应边界剪裁GRPO（ABC-GRPO），这是对原始GRPO框架的非对称且自适应的细化。我们证明，ABC-GRPO在使用Qwen3大型语言模型进行数学推理任务时，表现优于标准GRPO。此外，ABC-GRPO在整个训练过程中保持显著更高的熵，从而保留了模型的探索能力并减少了过早收敛。实现代码已在线获取，以便简化该 https URL 的可重复性。

Trade-R1: Bridging Verifiable Rewards to Stochastic Environments via Process-Level Reasoning Verification

交易-R1：通过过程层推理验证将可验证奖励桥接到随机环境

Authors: Rui Sun, Yifan Sun, Sheng Xu, Li Zhao, Jing Li, Daxin Jiang, Chen Hua, Zuo Bai
Subjects: Subjects: Artificial Intelligence (cs.AI); Trading and Market Microstructure (q-fin.TR)
Arxiv link: https://arxiv.org/abs/2601.03948
Pdf link: https://arxiv.org/pdf/2601.03948
Abstract Reinforcement Learning (RL) has enabled Large Language Models (LLMs) to achieve remarkable reasoning in domains like mathematics and coding, where verifiable rewards provide clear signals. However, extending this paradigm to financial decision is challenged by the market's stochastic nature: rewards are verifiable but inherently noisy, causing standard RL to degenerate into reward hacking. To address this, we propose Trade-R1, a model training framework that bridges verifiable rewards to stochastic environments via process-level reasoning verification. Our key innovation is a verification method that transforms the problem of evaluating reasoning over lengthy financial documents into a structured Retrieval-Augmented Generation (RAG) task. We construct a triangular consistency metric, assessing pairwise alignment between retrieved evidence, reasoning chains, and decisions to serve as a validity filter for noisy market returns. We explore two reward integration strategies: Fixed-effect Semantic Reward (FSR) for stable alignment signals, and Dynamic-effect Semantic Reward (DSR) for coupled magnitude optimization. Experiments on different country asset selection demonstrate that our paradigm reduces reward hacking, with DSR achieving superior cross-market generalization while maintaining the highest reasoning consistency.
中文摘要 强化学习（RL）使大型语言模型（LLMs）能够在数学和编码等领域实现卓越的推理能力，这些领域可验证的奖励提供了清晰的信号。然而，将这一范式扩展到金融决策受到市场随机性质的挑战：奖励可验证但本质上存在噪声，导致标准强化学习退化为奖励黑客行为。为此，我们提出了Trade-R1模型训练框架，通过过程层推理验证，将可验证奖励与随机环境连接起来。我们的核心创新是一种验证方法，将对冗长财务文件推理的评估问题转变为结构化的检索增强生成（RAG）任务。我们构建了一个三角形一致性指标，评估检索到的证据、推理链和决策之间的两两对齐性，作为噪声市场回报的有效性过滤器。我们探讨了两种奖励整合策略：固定效应语义奖励（FSR）用于稳定比对信号，以及动态效应语义奖励（DSR）用于耦合幅度优化。不同国家资产选择的实验表明，我们的范式减少了奖励黑客行为，DSR在保持最高推理一致性的同时实现了更优越的跨市场泛化。

CoINS: Counterfactual Interactive Navigation via Skill-Aware VLM

CoINS：通过技能感知VLM实现的反事实交互式导航

Authors: Kangjie Zhou, Zhejia Wen, Zhiyong Zhuo, Zike Yan, Pengying Wu, Ieng Hou U, Shuaiyang Li, Han Gao, Kang Ding, Wenhan Cao, Wei Pan, Chang Liu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.03956
Pdf link: https://arxiv.org/pdf/2601.03956
Abstract Recent Vision-Language Models (VLMs) have demonstrated significant potential in robotic planning. However, they typically function as semantic reasoners, lacking an intrinsic understanding of the specific robot's physical capabilities. This limitation is particularly critical in interactive navigation, where robots must actively modify cluttered environments to create traversable paths. Existing VLM-based navigators are predominantly confined to passive obstacle avoidance, failing to reason about when and how to interact with objects to clear blocked paths. To bridge this gap, we propose Counterfactual Interactive Navigation via Skill-aware VLM (CoINS), a hierarchical framework that integrates skill-aware reasoning and robust low-level execution. Specifically, we fine-tune a VLM, named InterNav-VLM, which incorporates skill affordance and concrete constraint parameters into the input context and grounds them into a metric-scale environmental representation. By internalizing the logic of counterfactual reasoning through fine-tuning on the proposed InterNav dataset, the model learns to implicitly evaluate the causal effects of object removal on navigation connectivity, thereby determining interaction necessity and target selection. To execute the generated high-level plans, we develop a comprehensive skill library through reinforcement learning, specifically introducing traversability-oriented strategies to manipulate diverse objects for path clearance. A systematic benchmark in Isaac Sim is proposed to evaluate both the reasoning and execution aspects of interactive navigation. Extensive simulations and real-world experiments demonstrate that CoINS significantly outperforms representative baselines, achieving a 17\% higher overall success rate and over 80\% improvement in complex long-horizon scenarios compared to the best-performing baseline
中文摘要 近期的视觉语言模型（VLMs）在机器人规划领域展现出显著潜力。然而，它们通常作为语义推理者，缺乏对特定机器人物理能力的内在理解。这一限制在交互式导航中尤为关键，机器人必须主动修改杂乱环境以创造可通行路径。现有基于VLM的导航器主要局限于被动避障，无法判断何时以及如何与物体互动以清除阻塞路径。为弥合这一差距，我们提出了通过技能感知VLM（CoINS）进行反事实交互式导航的方案，这是一个整合技能感知推理和稳健底层执行的分层框架。具体来说，我们微调了一个名为InterNav-VLM，它将技能可适用性和具体约束参数整合到输入上下文中，并将其扎根于度量级环境表征中。通过对拟议的导航数据集进行微调，内化反事实推理的逻辑，模型学会隐式评估物体移除对导航连接的因果影响，从而确定交互的必要性和目标选择。为执行生成的高级计划，我们通过强化学习开发了全面的技能库，特别是引入以可移动性为导向的策略，以控多样化物体以实现路径清理。提出了一个系统性基准测试，用于评估交互式导航的推理和执行方面。大量模拟和真实世界实验表明，CoINS显著优于代表性基线，整体成功率比最佳基线高出17%且在复杂长期情景中提升超过80%

Anti-Length Shift: Dynamic Outlier Truncation for Training Efficient Reasoning Models

反长度偏移：用于高效推理模型训练的动态离群值截断

Authors: Wei Wu, Liyi Chen, Congxi Xiao, Tianfu Wang, Qimeng Wang, Chengqiang Lu, Yan Gao, Yi Wu, Yao Hu, Hui Xiong
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.03969
Pdf link: https://arxiv.org/pdf/2601.03969
Abstract Large reasoning models enhanced by reinforcement learning with verifiable rewards have achieved significant performance gains by extending their chain-of-thought. However, this paradigm incurs substantial deployment costs as models often exhibit excessive verbosity on simple queries. Existing efficient reasoning methods relying on explicit length penalties often introduce optimization conflicts and leave the generative mechanisms driving overthinking largely unexamined. In this paper, we identify a phenomenon termed length shift where models increasingly generate unnecessary reasoning on trivial inputs during training. To address this, we introduce Dynamic Outlier Truncation (DOT), a training-time intervention that selectively suppresses redundant tokens. This method targets only the extreme tail of response lengths within fully correct rollout groups while preserving long-horizon reasoning capabilities for complex problems. To complement this intervention and ensure stable convergence, we further incorporate auxiliary KL regularization and predictive dynamic sampling. Experimental results across multiple model scales demonstrate that our approach significantly pushes the efficiency-performance Pareto frontier outward. Notably, on the AIME-24, our method reduces inference token usage by 78% while simultaneously increasing accuracy compared to the initial policy and surpassing state-of-the-art efficient reasoning methods.
中文摘要 通过强化学习并提供可验证奖励的大型推理模型，通过扩展思维链实现了显著的性能提升。然而，这种范式会带来较高的部署成本，因为模型在简单查询时常常会显得冗长冗长。依赖显式长度惩罚的现有高效推理方法常常引入优化冲突，并使驱动过度思考的生成机制在很大程度上未被审视。本文指出一种称为长度偏移的现象，即模型在训练过程中对琐碎输入产生不必要的推理。为此，我们引入了动态异常值截断（DOT），一种训练时间干预，选择性地抑制冗余标记。该方法仅针对完全正确展开组内响应长度的极端尾部，同时保留复杂问题的长视野推理能力。为补充这一干预并确保稳定收敛，我们进一步加入辅助KL正则化和预测动态抽样。跨多个模型尺度的实验结果表明，我们的方法显著地将效率与性能的帕累托边界向外推展。值得注意的是，在AIME-24上，我们的方法将推理令牌使用率降低了78%，同时比初始策略提高了准确性，并超越了最先进的高效推理方法。

On-Device Deep Reinforcement Learning for Decentralized Task Offloading Performance trade-offs in the training process

设备内深度强化学习用于分散式任务卸载训练过程中的性能权衡

Authors: Gorka Nieto, Idoia de la Iglesia, Cristina Perfecto, Unai Lopez-Novoa
Subjects: Subjects: Emerging Technologies (cs.ET); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.03976
Pdf link: https://arxiv.org/pdf/2601.03976
Abstract Allowing less capable devices to offload computational tasks to more powerful devices or servers enables the development of new applications that may not run correctly on the device itself. Deciding where and why to run each of those applications is a complex task. Therefore, different approaches have been adopted to make offloading decisions. In this work, we propose a decentralized Deep Reinforcement Learning (DRL) agent to address the selection of computing locations. Unlike most existing work, we analyze it in a real testbed composed of various edge devices running the agent to determine where to execute each task. These devices are connected to a Multi-Access Edge Computing (MEC) server and a Cloud server through 5G communications. We evaluate not only the agent's performance in meeting task requirements but also the implications of running this type of agent locally, assessing the trade-offs of training locally versus remotely in terms of latency and energy consumption.
中文摘要 允许能力较弱的设备将计算任务卸载给更强大的设备或服务器，从而开发出可能无法在设备本身正常运行的新应用。决定每个应用在哪里以及为什么运行是一项复杂的任务。因此，人们采用了不同的方法来做出卸载决策。在本研究中，我们提出了一个去中心化的深度强化学习（DRL）代理，用于计算位置的选择。与大多数现有工作不同，我们在由多个边缘设备组成的真实测试平台中分析，确定每个任务的执行位置。这些设备通过5G通信连接到多接入边缘计算（MEC）服务器和云服务器。我们不仅评估代理在满足任务需求方面的表现，还评估本地运行此类代理在延迟和能耗方面的权衡。

Thinking with Frames: Generative Video Distortion Evaluation via Frame Reward Model

用帧思考：通过帧奖励模型进行生成视频失真评估

Authors: Yuan Wang, Borui Liao, Huijuan Huang, Jinda Lu, Ouxiang Li, Kuien Liu, Meng Wang, Xiang Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.04033
Pdf link: https://arxiv.org/pdf/2601.04033
Abstract Recent advances in video reward models and post-training strategies have improved text-to-video (T2V) generation. While these models typically assess visual quality, motion quality, and text alignment, they often overlook key structural distortions, such as abnormal object appearances and interactions, which can degrade the overall quality of the generative video. To address this gap, we introduce REACT, a frame-level reward model designed specifically for structural distortions evaluation in generative videos. REACT assigns point-wise scores and attribution labels by reasoning over video frames, focusing on recognizing distortions. To support this, we construct a large-scale human preference dataset, annotated based on our proposed taxonomy of structural distortions, and generate additional data using a efficient Chain-of-Thought (CoT) synthesis pipeline. REACT is trained with a two-stage framework: ((1) supervised fine-tuning with masked loss for domain knowledge injection, followed by (2) reinforcement learning with Group Relative Policy Optimization (GRPO) and pairwise rewards to enhance reasoning capability and align output scores with human preferences. During inference, a dynamic sampling mechanism is introduced to focus on frames most likely to exhibit distortion. We also present REACT-Bench, a benchmark for generative video distortion evaluation. Experimental results demonstrate that REACT complements existing reward models in assessing structutal distortion, achieving both accurate quantitative evaluations and interpretable attribution analysis.
中文摘要 视频奖励模型和训练后策略的最新进展提升了文本转视频（T2V）生成。虽然这些模型通常评估视觉质量、运动质量和文本对齐，但它们常常忽视关键的结构畸变，如异常物体的外观和交互，这些都会降低生成视频的整体质量。为弥补这一空白，我们引入了REACT，一种专为生成视频结构扭曲评估设计的帧级奖励模型。REACT通过对视频帧进行推理，重点识别失真，从而为点数分配分数和归因标签。为此，我们构建了一个大规模的人类偏好数据集，基于我们提出的结构扭曲分类法进行注释，并通过高效的思维链（Chain-of-Thought，CoT）综合流程生成额外数据。REACT采用两阶段框架训练：（1）带掩蔽丢失的监督微调以注入领域知识，随后是（2）通过群体相对策略优化（GRPO）和成对奖励进行强化学习，以增强推理能力并使输出分数与人类偏好保持一致。在推断过程中，会引入动态采样机制，以聚焦最可能表现出失真的帧。我们还介绍了REACT-Bench，这是生成视频失真评估的基准测试。实验结果表明，REACT在评估结构扭曲方面补充了现有的奖励模型，实现了准确的定量评估和可解释的归因分析。

Cells on Autopilot: Adaptive Cell (Re)Selection via Reinforcement Learning

细胞自动驾驶：通过强化学习实现的自适应细胞（再）选择

Authors: Marvin Illian, Ramin Khalili, Antonio A. de A. Rocha, Lin Wang
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.04083
Pdf link: https://arxiv.org/pdf/2601.04083
Abstract The widespread deployment of 5G networks, together with the coexistence of 4G/LTE networks, provides mobile devices a diverse set of candidate cells to connect to. However, associating mobile devices to cells to maximize overall network performance, a.k.a. cell (re)selection, remains a key challenge for mobile operators. Today, cell (re)selection parameters are typically configured manually based on operator experience and rarely adapted to dynamic network conditions. In this work, we ask: Can an agent automatically learn and adapt cell (re)selection parameters to consistently improve network performance? We present a reinforcement learning (RL)-based framework called CellPilot that adaptively tunes cell (re)selection parameters by learning spatiotemporal patterns of mobile network dynamics. Our study with real-world data demonstrates that even a lightweight RL agent can outperform conventional heuristic reconfigurations by up to 167%, while generalizing effectively across different network scenarios. These results indicate that data-driven approaches can significantly improve cell (re)selection configurations and enhance mobile network performance.
中文摘要 5G网络的广泛部署，加上4G/LTE网络的共存，为移动设备提供了多样化的候选小区连接。然而，将移动设备与小区关联以最大化整体网络性能，即小区（重新）选择，仍是移动运营商面临的关键挑战。如今，单元（重新）选择参数通常根据操作员经验手动配置，很少适应动态网络条件。在本研究中，我们探讨：智能体能否自动学习并调整细胞（重新）选择参数，以持续提升网络性能？我们提出了一个基于强化学习（RL）的框架，名为CellPilot，通过学习移动网络动态的时空模式，自适应地调整细胞（再）选择参数。我们对真实世界数据的研究表明，即使是轻量级强化学习代理，也能比传统启发式重构多达167%，并且在不同网络场景中有效泛化。这些结果表明，数据驱动方法能够显著改善小区（重新）选择配置并提升移动网络性能。

GeoReason: Aligning Thinking And Answering In Remote Sensing Vision-Language Models Via Logical Consistency Reinforcement Learning

GeoReason：通过逻辑一致性强化学习，使遥感视觉语言模型中的思维与回答对齐

Authors: Wenshuai Li, Xiantai Xiang, Zixiao Wen, Guangyao Zhou, Ben Niu, Feng Wang, Lijia Huang, Qiantong Wang, Yuxin Hu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.04118
Pdf link: https://arxiv.org/pdf/2601.04118
Abstract The evolution of Remote Sensing Vision-Language Models(RS-VLMs) emphasizes the importance of transitioning from perception-centric recognition toward high-level deductive reasoning to enhance cognitive reliability in complex spatial tasks. However, current models often suffer from logical hallucinations, where correct answers are derived from flawed reasoning chains or rely on positional shortcuts rather than spatial logic. This decoupling undermines reliability in strategic spatial decision-making. To address this, we present GeoReason, a framework designed to synchronize internal thinking with final decisions. We first construct GeoReason-Bench, a logic-driven dataset containing 4,000 reasoning trajectories synthesized from geometric primitives and expert knowledge. We then formulate a two-stage training strategy: (1) Supervised Knowledge Initialization to equip the model with reasoning syntax and domain expertise, and (2) Consistency-Aware Reinforcement Learning to refine deductive reliability. This second stage integrates a novel Logical Consistency Reward, which penalizes logical drift via an option permutation strategy to anchor decisions in verifiable reasoning traces. Experimental results demonstrate that our framework significantly enhances the cognitive reliability and interpretability of RS-VLMs, achieving state-of-the-art performance compared to other advanced methods.
中文摘要 遥感视觉语言模型（RS-VLMs）的发展强调了从以感知为中心的识别向高阶演绎推理转变的重要性，以提升复杂空间任务中的认知可靠性。然而，当前模型常常存在逻辑幻觉，正确答案源自有缺陷的推理链，或依赖位置捷径而非空间逻辑。这种脱钩削弱了战略空间决策的可靠性。为此，我们提出了GeoReason，一个旨在同步内部思考与最终决策的框架。我们首先构建了GeoReason-Bench，这是一个逻辑驱动数据集，包含4000条推理轨迹，由几何原语和专家知识合成而成。随后，我们制定了两阶段训练策略：（1）监督式知识初始化，为模型提供推理语法和领域专长;（2）一致性感知强化学习，以优化演绎可靠性。第二阶段整合了一种新颖的逻辑一致性奖励，通过期权置换策略惩罚逻辑漂移，将决策锚定在可验证的推理轨迹中。实验结果表明，我们的框架显著提升了RS-VLMs的认知可靠性和可解释性，达到了与其他先进方法相比最先进的性能。

InfiniteWeb: Scalable Web Environment Synthesis for GUI Agent Training

InfiniteWeb：可扩展的网页环境综合用于图形界面代理培训

Authors: Ziyun Zhang, Zezhou Wang, Xiaoyi Zhang, Zongyu Guo, Jiahao Li, Bin Li, Yan Lu
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.04126
Pdf link: https://arxiv.org/pdf/2601.04126
Abstract GUI agents that interact with graphical interfaces on behalf of users represent a promising direction for practical AI assistants. However, training such agents is hindered by the scarcity of suitable environments. We present InfiniteWeb, a system that automatically generates functional web environments at scale for GUI agent training. While LLMs perform well on generating a single webpage, building a realistic and functional website with many interconnected pages faces challenges. We address these challenges through unified specification, task-centric test-driven development, and a combination of website seed with reference design image to ensure diversity. Our system also generates verifiable task evaluators enabling dense reward signals for reinforcement learning. Experiments show that InfiniteWeb surpasses commercial coding agents at realistic website construction, and GUI agents trained on our generated environments achieve significant performance improvements on OSWorld and Online-Mind2Web, demonstrating the effectiveness of proposed system.
中文摘要 代表用户与图形界面交互的图形界面代理，代表了实用人工智能助手的一个有前景的方向。然而，由于合适的环境稀缺，培训此类特工受到阻碍。我们介绍InfiniteWeb，这是一款能够大规模自动生成功能正常的网络环境用于GUI代理训练的系统。虽然大型语言模型在生成单一网页方面表现良好，但构建一个具有多个相互关联页面的真实且功能齐全的网站则面临挑战。我们通过统一规范、以任务为中心的测试驱动开发，以及网站种子与参考设计图像的结合，来应对这些挑战，以确保多样性。我们的系统还生成可验证的任务评估器，支持密集的奖励信号以促进强化学习。实验显示，InfiniteWeb 在真实的网站构建上超越了商业编码代理，且在我们生成环境中训练的图形界面代理在 OSWorld 和 Online-Mind2Web 上实现了显著的性能提升，证明了所提系统的有效性。

Agentic Rubrics as Contextual Verifiers for SWE Agents

作为SWE代理的情境验证工具的代理评分标准

Authors: Mohit Raghavendra, Anisha Gunjal, Bing Liu, Yunzhong He
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.04171
Pdf link: https://arxiv.org/pdf/2601.04171
Abstract Verification is critical for improving agents: it provides the reward signal for Reinforcement Learning and enables inference-time gains through Test-Time Scaling (TTS). Despite its importance, verification in software engineering (SWE) agent settings often relies on code execution, which can be difficult to scale due to environment setup overhead. Scalable alternatives such as patch classifiers and heuristic methods exist, but they are less grounded in codebase context and harder to interpret. To this end, we explore Agentic Rubrics: an expert agent interacts with the repository to create a context-grounded rubric checklist, and candidate patches are then scored against it without requiring test execution. On SWE-Bench Verified under parallel TTS evaluation, Agentic Rubrics achieve a score of 54.2% on Qwen3-Coder-30B-A3B and 40.6% on Qwen3-32B, with at least a +3.5 percentage-point gain over the strongest baseline in our comparison set. We further analyze rubric behavior, showing that rubric scores are consistent with ground-truth tests while also flagging issues that tests do not capture. Our ablations show that agentic context gathering is essential for producing codebase-specific, unambiguous criteria. Together, these results suggest that Agentic Rubrics provide an efficient, scalable, and granular verification signal for SWE agents.
中文摘要 验证对于改进代理至关重要：它为强化学习提供奖励信号，并通过测试时间缩放（TTS）实现推理时间的提升。尽管验证重要，软件工程（SWE）代理设置中的验证通常依赖代码执行，而由于环境设置开销，代码的扩展性较大。存在可扩展的替代方案，如补丁分类器和启发式方法，但它们对代码库上下文的根基较少，且更难解释。为此，我们探讨了代理评分标准：专家代理与仓库交互，创建基于上下文的评分标准清单，然后根据该评分标准对候选补丁进行评分，无需执行测试。在并行TTS评估下的SWE-Bench Verified中，Agentic Rubrics在Qwen3-Coder-30B-A3B上得分为54.2%，在Qwen3-32B上得分为40.6%，较我们对比组中最强基线至少提升了+3.5个百分点。我们进一步分析了评分标准的行为，表明评分与实地真实测试一致，同时也标记了测试未能捕捉的问题。我们的消融表明，能动上下文收集对于生成代码库特定且明确的标准至关重要。综合这些结果表明，代理评分标准为SWE代理提供了高效、可扩展且细致的验证信号。

Hierarchical GNN-Based Multi-Agent Learning for Dynamic Queue-Jump Lane and Emergency Vehicle Corridor Formation

基于GNN的多智能体学习，用于动态排队-跳跃车道和紧急车辆走廊形成

Authors: Haoran Su
Subjects: Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.04177
Pdf link: https://arxiv.org/pdf/2601.04177
Abstract Emergency vehicles require rapid passage through congested traffic, yet existing strategies fail to adapt to dynamic conditions. We propose a novel hierarchical graph neural network (GNN)-based multi-agent reinforcement learning framework to coordinate connected vehicles for emergency corridor formation. Our approach uses a high-level planner for global strategy and low-level controllers for trajectory execution, utilizing graph attention networks to scale with variable agent counts. Trained via Multi-Agent Proximal Policy Optimization (MAPPO), the system reduces emergency vehicle travel time by 28.3% compared to baselines and 44.6% compared to uncoordinated traffic in simulations. The design achieves near-zero collision rates (0.3%) while maintaining 81% of background traffic efficiency. Ablation and generalization studies confirm the framework's robustness across diverse scenarios. These results demonstrate the effectiveness of combining GNNs with hierarchical learning for intelligent transportation systems.
中文摘要 紧急车辆需要快速穿越拥堵的交通，但现有策略未能适应动态环境。我们提出了一种基于分层图神经网络（GNN）的新型多智能体强化学习框架，用于协调连接车辆以形成紧急走廊。我们的方法使用高层规划器进行全球战略，低层控制器执行轨迹，利用图关注网络以适应可变代理数量的扩展。通过多代理近端策略优化（MAPPO）训练，该系统在模拟中将紧急车辆行驶时间比基线减少28.3%，在模拟中减少44.6%的非协调交通。该设计实现了接近零的碰撞率（0.3%），同时保持了81%的背景交通效率。消融和泛化研究证实了该框架在多种情境下的稳健性。这些结果证明了将GNN与分层学习结合在智能交通系统中的有效性。

Keyword: diffusion policy

There is no result