Arxiv Papers of Today

生成时间: 2026-01-21 16:37:43 (UTC+8); Arxiv 发布时间: 2026-01-21 20:00 EST (2026-01-22 09:00 UTC+8)

今天共有 62 篇相关文章

Keyword: reinforcement learning

GRADE: Replacing Policy Gradients with Backpropagation for LLM Alignment

GRADE：用反向传播替代策略梯度以实现LLM对齐

Authors: Lukas Abrie Nel
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.11574
Pdf link: https://arxiv.org/pdf/2601.11574
Abstract Reinforcement learning from human feedback (RLHF) has become the dominant paradigm for aligning large language models with human preferences. However, policy gradient methods such as PPO suffer from high variance gradient estimates, requiring careful hyperparameter tuning and extensive computational resources. We introduce GRADE (Gumbel-softmax Relaxation for Alignment via Differentiable Estimation), a method that replaces high-variance policy gradient estimation with direct backpropagation through a differentiable relaxation of the discrete token sampling process. Using the Gumbel-Softmax reparameterization with straight-through estimation (GRADE-STE), we enable end-to-end gradient flow from reward signals through generated tokens to model parameters. On sentiment-controlled text generation using the IMDB dataset, GRADE-STE achieves a test reward of 0.763 +- 0.344 compared to PPO's 0.510 +- 0.313 and REINFORCE's 0.617 +- 0.378, representing a 50% relative improvement over PPO. Critically, GRADE-STE exhibits gradient variance over 14 times lower than REINFORCE and maintains stable training dynamics throughout optimization. Our rigorous evaluation with proper train/validation/test splits demonstrates that these improvements generalize to held-out data, with GRADE-STE showing the best generalization characteristics among all methods tested. GRADE offers a simpler, more stable, and more effective alternative to reinforcement learning for LLM alignment.
中文摘要 来自人类反馈的强化学习（RLHF）已成为将大型语言模型与人类偏好对齐的主导范式。然而，策略梯度方法如PPO存在较高的方差阶估计，需要细致的超参数调优和大量计算资源。我们介绍了GRADE（通过可微估计实现Gumbel-softmax对齐的松弛），这是一种通过离散令牌抽样过程的可微松弛，直接用反向传播替代高方差策略梯度估计的方法。利用Gumbel-Softmax的直通估计重参数化（GRADE-STE），我们实现了从奖励信号到生成代币的端到端梯度流动，以建模参数。在使用IMDB数据集进行情感控制文本生成时，GRADE-STE的测试奖励为0.763 +- 0.344，而PPO为0.510 +- 0.313，REINFORCE为0.617 +- 0.378，相较PPO提升了50%。关键是，GRADE-STE的梯度方差比REINFORCE低14倍以上，并且在整个优化过程中保持了稳定的训练动态。我们通过严格的训练/验证/测试拆分评估表明，这些改进能够推广到保留数据，GRADE-STE在所有测试方法中展现出最佳的泛化特性。GRADE为LLM对齐提供了一种更简单、更稳定、更有效的强化学习替代方案。

Bielik 11B v3: Multilingual Large Language Model for European Languages

Bielik 11B v3：欧洲语言多语言大语言模型

Authors: Krzysztof Ociepa, Łukasz Flis, Remigiusz Kinas, Krzysztof Wróbel, Adrian Gwoździej
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.11579
Pdf link: https://arxiv.org/pdf/2601.11579
Abstract We present Bielik 11B v3, a state-of-the-art language model highly optimized for the Polish language, while also maintaining strong capabilities in other European languages. This model extends the Mistral 7B v0.2 architecture, scaled to 11B parameters via depth up-scaling. Its development involved a comprehensive four-stage training pipeline: continuous pre-training, supervised fine-tuning (SFT), Direct Preference Optimization (DPO), and reinforcement learning. Comprehensive evaluations demonstrate that Bielik 11B v3 achieves exceptional performance. It significantly surpasses other specialized Polish language models and outperforms many larger models (with 2-6 times more parameters) on a wide range of tasks, from basic linguistic understanding to complex reasoning. The model's parameter efficiency, combined with extensive quantization options, allows for effective deployment across diverse hardware configurations. Bielik 11B v3 not only advances AI capabilities for the Polish language but also establishes a new benchmark for developing resource-efficient, high-performance models for less-represented languages.
中文摘要 我们呈现Bielik 11B v3，这是一个高度优化的波兰语语言模型，同时也保持了其他欧洲语言的强大能力。该模型扩展了Mistral 7B v0.2架构，并通过深度放大扩展至11B参数。其开发包含一个全面的四阶段培训流程：连续预训练、监督微调（SFT）、直接偏好优化（DPO）和强化学习。综合评估表明，Bielik 11B v3 实现了卓越的性能。它显著优于其他专业的波兰语言模型，并且在从基础语言理解到复杂推理等多种任务中，表现优于许多参数多出2-6倍的大型模型。该模型的参数效率，加上丰富的量化选项，使得在多种硬件配置中高效部署。Bielik 11B v3 不仅推动了波兰语的人工智能能力，还为代表性较少的语言开发资源高效、高性能模型树立了新的标杆。

Hindsight Preference Replay Improves Preference-Conditioned Multi-Objective Reinforcement Learning

事后诸葛亮偏好重放改进偏好条件多目标强化学习

Authors: Jonaid Shianifar, Michael Schukat, Karl Mason
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.11604
Pdf link: https://arxiv.org/pdf/2601.11604
Abstract Multi-objective reinforcement learning (MORL) enables agents to optimize vector-valued rewards while respecting user preferences. CAPQL, a preference-conditioned actor-critic method, achieves this by conditioning on weight vectors w and restricts data usage to the specific preferences under which it was collected, leaving off-policy data from other preferences unused. We introduce Hindsight Preference Replay (HPR), a simple and general replay augmentation strategy that retroactively relabels stored transitions with alternative preferences. This densifies supervision across the preference simplex without altering the CAPQL architecture or loss functions. Evaluated on six MO-Gymnasium locomotion tasks at a fixed 300000-step budget using expected utility (EUM), hypervolume (HV), and sparsity, HPR-CAPQL improves HV in five of six environments and EUM in four of six. On mo-humanoid-v5, for instance, EUM rises from $323!\pm!125$ to $1613!\pm!464$ and HV from 0.52M to 9.63M, with strong statistical support. mo-halfcheetah-v5 remains a challenging exception where CAPQL attains higher HV at comparable EUM. We report final summaries and Pareto-front visualizations across all tasks.
中文摘要 多目标强化学习（MORL）使智能体能够在尊重用户偏好的同时优化向量值奖励。CAPQL是一种偏好条件的actor-critic方法，通过对权重向量w进行条件，并将数据使用限制在收集时的特定偏好中，从而保留来自其他偏好的非策略数据未被使用来实现这一目标。我们引入了事后诸葛亮偏好回放（Hindsight Preference Replay，简称HPR），这是一种简单且通用的重放增强策略，能够追溯性地重新标记存储的转换，使用替代偏好。这在不改变CAPQL架构或损耗函数的情况下，将监督密度化了偏好单纯形。HPR-CAPQL在六个MO-Gymnasium的移动任务中，在固定30万步预算下，结合期望效用（EUM）、超体积（HV）和稀疏度进行评估，在六个环境中提升了五个环境的高压，在六个环境中提升了四个环境的EUM。例如，在mo-humanoid-v5中，EUM从$323\\pm\！125$涨到$1613\！\pm\！464$，HV从0.52M涨到9.63M，并且有强有力的统计支持。mo-halfcheetah-v5仍是一个具有挑战性的例外，CAPQL在相当的EUM下能获得更高的高压。我们报告所有任务的最终总结和帕累托前期可视化。

Reinforcement Learning for Dynamic Workflow Optimization in CI/CD Pipelines

CI/CD管道中动态工作流优化的强化学习

Authors: Aniket Abhishek Soni, Milan Parikh, Rashi Nimesh Kumar Dhenia, Jubin Abhishek Soni, Ayush Raj Jha, Sneja Mitinbhai Shah
Subjects: Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.11647
Pdf link: https://arxiv.org/pdf/2601.11647
Abstract Continuous Integration and Continuous Deployment (CI/CD) pipelines are central to modern software delivery, yet their static workflows often introduce inefficiencies as systems scale. This paper proposes a reinforcement learning (RL) based approach to dynamically optimize CI/CD pipeline workflows. The pipeline is modeled as a Markov Decision Process, and an RL agent is trained to make runtime decisions such as selecting full, partial, or no test execution in order to maximize throughput while minimizing testing overhead. A configurable CI/CD simulation environment is developed to evaluate the approach across build, test, and deploy stages. Experimental results show that the RL optimized pipeline achieves up to a 30 percent improvement in throughput and approximately a 25 percent reduction in test execution time compared to static baselines, while maintaining a defect miss rate below 5 percent. The agent learns to selectively skip or abbreviate tests for low risk commits, accelerating feedback cycles without significantly increasing failure risk. These results demonstrate the potential of reinforcement learning to enable adaptive and intelligent DevOps workflows, providing a practical pathway toward more efficient, resilient, and sustainable CI/CD automation.
中文摘要 持续集成和持续部署（CI/CD）流水线是现代软件交付的核心，但其静态工作流程在系统规模扩大时常常带来低效。本文提出了一种基于强化学习（RL）的方法，用于动态优化CI/CD流水线工作流程。该流水线被建模为马尔可夫决策过程，强化学习代理被训练以做出运行时决策，如选择完整、部分或无测试执行，以最大化吞吐量同时最小化测试开销。开发了一个可配置的CI/CD仿真环境，用于评估跨构建、测试和部署阶段的方法。实验结果显示，强化学习优化的流水线相比静态基线，吞吐量提升最多30%，测试执行时间约减少25%，同时缺陷未命中率低于5%。智能体学会选择性地跳过或缩短低风险提交的测试，从而加速反馈周期，同时显著增加失败风险。这些结果展示了强化学习在实现自适应且智能的DevOps工作流方面的潜力，为实现更高效、更具韧性和可持续的CI/CD自动化提供了切实可行的路径。

Advances and Frontiers of LLM-based Issue Resolution in Software Engineering: A Comprehensive Survey

基于LLM的软件工程问题解决的进展与前沿：一项综合综述

Authors: Caihua Li, Lianghong Guo, Yanlin Wang, Daya Guo, Wei Tao, Zhenyu Shan, Mingwei Liu, Jiachi Chen, Haoyu Song, Duyu Tang, Hongyu Zhang, Zibin Zheng
Subjects: Subjects: Software Engineering (cs.SE); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.11655
Pdf link: https://arxiv.org/pdf/2601.11655
Abstract Issue resolution, a complex Software Engineering (SWE) task integral to real-world development, has emerged as a compelling challenge for artificial intelligence. The establishment of benchmarks like SWE-bench revealed this task as profoundly difficult for large language models, thereby significantly accelerating the evolution of autonomous coding agents. This paper presents a systematic survey of this emerging domain. We begin by examining data construction pipelines, covering automated collection and synthesis approaches. We then provide a comprehensive analysis of methodologies, spanning training-free frameworks with their modular components to training-based techniques, including supervised fine-tuning and reinforcement learning. Subsequently, we discuss critical analyses of data quality and agent behavior, alongside practical applications. Finally, we identify key challenges and outline promising directions for future research. An open-source repository is maintained at this https URL to serve as a dynamic resource in this field.
中文摘要 问题解决是一项复杂的软件工程（SWE）任务，是现实世界开发中不可或缺的一部分，已成为人工智能面临的重大挑战。像SWE-bench这样的基准测试的建立揭示了这项任务对大型语言模型极其困难，从而显著加速了自主编码代理的发展。本文对这一新兴领域进行了系统性综述。我们首先探讨数据构建流程，涵盖自动化收集和综合方法。随后，我们对方法论进行了全面分析，涵盖无训练框架及其模块化组件，结合基于训练的技术，包括监督式微调和强化学习。随后，我们将讨论数据质量和代理行为的关键分析，以及实际应用。最后，我们识别了关键挑战，并概述了未来研究的有前景方向。该 https URL 维护了一个开源仓库，作为该领域的动态资源。

AGGC: Adaptive Group Gradient Clipping for Stabilizing Large Language Model Training

AGGC：用于稳定大型语言模型训练的自适应群体梯度裁剪

Authors: Zhiyuan Li, Yuan Wu, Yi Chang
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.11864
Pdf link: https://arxiv.org/pdf/2601.11864
Abstract To stabilize the training of Large Language Models (LLMs), gradient clipping is a nearly ubiquitous heuristic used to alleviate exploding gradients. However, traditional global norm clipping erroneously presupposes gradient homogeneity across different functional modules, leading to an adverse "spill-over" effect where volatile parameters force unnecessary scaling on stable ones. To overcome this, we propose Adaptive Group-wise Gradient Clipping (AGGC). AGGC partitions parameters into groups based on functional types and regulates each according to its historical behavior using an Exponential Moving Average (EMA). Specifically, it constructs an adaptive interval to simultaneously mitigate gradient explosion and vanishing, while employing a time-dependent scheduling mechanism to balance exploration and convergence. Experiments on LLaMA 2-7B, Mistral-7B, and Gemma-7B models show that AGGC consistently outperforms LoRA and frequently surpasses Full Fine-Tuning. On the GSM8K benchmark, Mistral-7B fine-tuned with AGGC achieves an accuracy of 72.93%, exceeding LoRA's 69.5%. AGGC also effectively stabilizes Reinforcement Learning with Verifiable Rewards (RLVR), enhancing the logic deduction of Qwen 2.5 and Llama 3.2 models. Experimental results demonstrate that AGGC effectively addresses the limitations of traditional gradient clipping methods, particularly in overcoming gradient heterogeneity, by utilizing a modular, adaptive clipping strategy to stabilize the training process. Due to its lightweight design, AGGC can be seamlessly integrated into existing post-training pipelines with negligible overhead.
中文摘要 为了稳定大型语言模型（LLMs）的训练，梯度裁断几乎成为一种无处不在的启发式方法，用于缓解梯度爆炸性变化。然而，传统的全局范数裁断错误地假设不同函数模块间梯度均匀，导致一种不利的“溢出”效应，即波动性参数迫使稳定模块进行不必要的尺度调整。为克服这一问题，我们提出了自适应的分组梯度裁剪（AGGC）。AGGC根据功能类型将参数划分为组，并利用指数移动平均（EMA）根据其历史行为进行调控。具体来说，它构建了一个自适应区间，以同时减轻梯度爆炸和消失，同时采用时间依赖调度机制来平衡探索与收敛。对LLaMA 2-7B、Mistral-7B和Gemma-7B模型的实验显示，AGGC持续优于LoRA，且经常超过全微调。在GSM8K基准测试中，经过AGGC微调的Mistral-7B达到72.93%的准确率，超过LoRA的69.5%。AGGC还有效稳定了带可验证奖励的强化学习（RLVR），增强了Qwen 2.5和Llama 3.2模型的逻辑推理能力。实验结果表明，AGGC通过采用模块化、自适应裁剪策略，有效解决了传统梯度裁剪方法的局限性，特别是在克服梯度异质性方面，以稳定训练过程。由于其轻量化设计，AGGC可以无缝集成到现有的培训后流程中，开销极低。

Controlling Underestimation Bias in Constrained Reinforcement Learning for Safe Exploration

控制受限强化学习中的低估偏差以实现安全探索

Authors: Shiqing Gao, Jiaxin Ding, Luoyi Fu, Xinbing Wang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.11953
Pdf link: https://arxiv.org/pdf/2601.11953
Abstract Constrained Reinforcement Learning (CRL) aims to maximize cumulative rewards while satisfying constraints. However, existing CRL algorithms often encounter significant constraint violations during training, limiting their applicability in safety-critical scenarios. In this paper, we identify the underestimation of the cost value function as a key factor contributing to these violations. To address this issue, we propose the Memory-driven Intrinsic Cost Estimation (MICE) method, which introduces intrinsic costs to mitigate underestimation and control bias to promote safer exploration. Inspired by flashbulb memory, where humans vividly recall dangerous experiences to avoid risks, MICE constructs a memory module that stores previously explored unsafe states to identify high-cost regions. The intrinsic cost is formulated as the pseudo-count of the current state visiting these risk regions. Furthermore, we propose an extrinsic-intrinsic cost value function that incorporates intrinsic costs and adopts a bias correction strategy. Using this function, we formulate an optimization objective within the trust region, along with corresponding optimization methods. Theoretically, we provide convergence guarantees for the proposed cost value function and establish the worst-case constraint violation for the MICE update. Extensive experiments demonstrate that MICE significantly reduces constraint violations while preserving policy performance comparable to baselines.
中文摘要 受限强化学习（CRL）旨在最大化累计奖励，同时满足约束条件。然而，现有的CRL算法在训练过程中常常遇到显著的约束违规，限制了其在安全关键场景中的适用性。本文指出成本价值函数被低估是导致这些违规的关键因素。为解决这一问题，我们提出了内存驱动内在成本估计（MICE）方法，该方法引入内在成本以减少低估和控制偏差，促进更安全的勘探。灵感来源于闪光灯记忆，人类通过生动回忆危险经历来规避风险，MICE构建了一个存储模块，存储先前探索过的危险状态以识别高成本区域。内在成本被表述为当前访问这些风险区域的状态的伪计数。此外，我们提出了一种包含内在成本并采用偏置修正策略的外在-内在成本价值函数。利用该函数，我们在信任区域内制定优化目标及相应的优化方法。理论上，我们为所提出的代价值函数提供收敛保证，并确定MICE更新的最坏约束违规。大量实验表明，MICE 在保持策略性能与基线相当的同时，显著减少了约束违规。

R$^2$PO: Decoupling Training Trajectories from Inference Responses for LLM Reasoning

R$^2$PO：将训练轨迹与推理响应解耦用于大型语言模型推理

Authors: Jingchu Wang, Bingbing Xu, Yige Yuan, Bin Xie, Xiaoqian Sun, Huawei Shen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.11960
Pdf link: https://arxiv.org/pdf/2601.11960
Abstract Reinforcement learning has become a central paradigm for improving LLM reasoning. However, existing methods use a single policy to produce both inference responses and training optimization trajectories. The objective conflict between generating stable inference responses and diverse training trajectories leads to insufficient exploration, which harms reasoning capability. In this paper, to address the problem, we propose R$^2$PO (Residual Rollout Policy Optimization), which introduces a lightweight Residual Rollout-Head atop the policy to decouple training trajectories from inference responses, enabling controlled trajectory diversification during training while keeping inference generation stable. Experiments across multiple benchmarks show that our method consistently outperforms baselines, achieving average accuracy gains of 3.1% on MATH-500 and 2.4% on APPS, while also reducing formatting errors and mitigating length bias for stable optimization. Our code is publicly available at this https URL.
中文摘要 强化学习已成为提升大型语言模型推理的核心范式。然而，现有方法使用单一策略来生成推理响应和训练优化轨迹。生成稳定推理响应与多样化训练轨迹之间的客观冲突导致探索不足，损害推理能力。本文为解决该问题，提出了R$^2$PO（残余展开策略优化），在策略顶部引入轻量级残余滚动头，将训练轨迹与推理响应解耦，实现训练过程中受控的轨迹多样化，同时保持推理生成稳定。跨多个基准测试的实验表明，我们的方法持续优于基线，在MATH-500上实现了3.1%的平均准确率提升，在APPS上实现了2.4%的平均准确率提升，同时还减少了格式错误和减少长度偏差，实现了稳定的优化。我们的代码在此 https URL 公开。

Extreme Value Policy Optimization for Safe Reinforcement Learning

安全强化学习的极值策略优化

Authors: Shiqing Gao, Yihang Zhou, Shuai Shao, Haoyu Luo, Yiheng Bing, Jiaxin Ding, Luoyi Fu, Xinbing Wang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.12008
Pdf link: https://arxiv.org/pdf/2601.12008
Abstract Ensuring safety is a critical challenge in applying Reinforcement Learning (RL) to real-world scenarios. Constrained Reinforcement Learning (CRL) addresses this by maximizing returns under predefined constraints, typically formulated as the expected cumulative cost. However, expectation-based constraints overlook rare but high-impact extreme value events in the tail distribution, such as black swan incidents, which can lead to severe constraint violations. To address this issue, we propose the Extreme Value policy Optimization (EVO) algorithm, leveraging Extreme Value Theory (EVT) to model and exploit extreme reward and cost samples, reducing constraint violations. EVO introduces an extreme quantile optimization objective to explicitly capture extreme samples in the cost tail distribution. Additionally, we propose an extreme prioritization mechanism during replay, amplifying the learning signal from rare but high-impact extreme samples. Theoretically, we establish upper bounds on expected constraint violations during policy updates, guaranteeing strict constraint satisfaction at a zero-violation quantile level. Further, we demonstrate that EVO achieves a lower probability of constraint violations than expectation-based methods and exhibits lower variance than quantile regression methods. Extensive experiments show that EVO significantly reduces constraint violations during training while maintaining competitive policy performance compared to baselines.
中文摘要 确保安全是将强化学习（RL）应用于现实场景的关键挑战。受限强化学习（CRL）通过在预定义的约束条件下最大化收益来解决这个问题，通常以预期累计成本的形式表述。然而，基于期望的约束忽略了尾部分布中罕见但影响深远的极端值事件，如黑天鹅事件，这些事件可能导致严重的约束违规。为解决这一问题，我们提出了极值策略优化（EVO）算法，利用极值理论（EVT）建模和利用极端奖励和成本样本，减少约束违规。EVO引入了极端分位数优化目标，以显式捕捉成本尾分布中的极端样本。此外，我们提出了一种极端优先级机制，在回放时放大罕见但高影响力极端样本的学习信号。理论上，我们在策略更新期间建立预期约束违规的上限，保证在零违规分位数层面严格满足约束。此外，我们证明了EVO比基于期望的方法实现约束违背概率更低，且方差低于分位数回归方法。大量实验表明，EVO在保持竞争策略性能的同时，训练期间显著减少了约束违规。

Profit Maximization for Electric Vehicle Charging Stations Using Multiagent Reinforcement Learning

利用多智能体强化学习实现电动汽车充电站利润最大化

Authors: Kun-Yan Jiang, Wei-Yu Chiu, Yuan-Po Tsai
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.12028
Pdf link: https://arxiv.org/pdf/2601.12028
Abstract Electric vehicles (EVs) are increasingly integrated into power grids, offering economic and environmental benefits but introducing challenges due to uncoordinated charging. This study addresses the profit maximization problem for multiple EV charging stations (EVCSs) equipped with energy storage systems (ESS) and renewable energy sources (RES), with the capability for energy trading. We propose a Double Hypernetwork QMIX-based multi-agent reinforcement learning (MARL) framework to optimize cooperative energy management under uncertainty in EV demand, renewable generation, and real-time electricity prices. The framework mitigates overestimation bias in value estimation, enables distributed decision-making, and incorporates an internal energy trading mechanism. Numerical experiments using real-world data demonstrate that, compared to standard QMIX, the proposed method achieves approximately 5.3% and 12.7% higher total profit for the two regions, respectively, highlighting its economic and operational efficiency. Additionally, the approach maintains robust performance under varying levels of EV demand uncertainty and renewable energy fluctuations.
中文摘要 电动汽车（EV）越来越多地融入电网，带来了经济和环境上的好处，但由于充电协调不齐，也带来了挑战。本研究针对配备储能系统（ESS）和可再生能源（RES）的多个电动车充电站（EVCS）利润最大化问题，并具备能源交易能力。我们提出了基于双超网络QMIX的多智能体强化学习（MARL）框架，以优化在电动汽车需求、可再生能源发电和实时电价不确定性下的合作能源管理。该框架减轻了价值估算中的高估偏差，支持分布式决策，并整合了内部能源交易机制。利用真实世界数据进行数值实验表明，与标准QMIX相比，所提方法分别为两个区域实现了约5.3%和12.7%的总利润，凸显了其经济和运营效率。此外，该方法在不同程度的电动汽车需求不确定性和可再生能源波动下依然保持强劲表现。

UniMo: Unified Motion Generation and Understanding with Chain of Thought

UniMo：统一运动生成与理解与思维链

Authors: Guocun Wang, Kenkun Liu, Jing Lin, Guorui Song, Jian Li, Xiaoguang Han
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.12126
Pdf link: https://arxiv.org/pdf/2601.12126
Abstract Existing 3D human motion generation and understanding methods often exhibit limited interpretability, restricting effective mutual enhancement between these inherently related tasks. While current unified frameworks based on large language models (LLMs) leverage linguistic priors, they frequently encounter challenges in semantic alignment and task coherence. Moreover, the next-token prediction paradigm in LLMs is ill-suited for motion sequences, causing cumulative prediction errors. To address these limitations, we propose UniMo, a novel framework that integrates motion-language information and interpretable chain of thought (CoT) reasoning into the LLM via supervised fine-tuning (SFT). We further introduce reinforcement learning with Group Relative Policy Optimization (GRPO) as a post-training strategy that optimizes over groups of tokens to enforce structural correctness and semantic alignment, mitigating cumulative errors in motion token prediction. Extensive experiments demonstrate that UniMo significantly outperforms existing unified and task-specific models, achieving state-of-the-art performance in both motion generation and understanding.
中文摘要 现有的三维人体运动生成和理解方法通常表现出有限的可解释性，限制了这些本质上相关的任务之间的有效互助。虽然当前基于大型语言模型（LLMs）的统一框架利用语言先验，但它们经常在语义对齐和任务一致性方面遇到挑战。此外，LLM中的下一个令牌预测范式不适合运动序列，导致累积的预测误差。为解决这些局限性，我们提出了UniMo，一种新颖框架，通过监督微调（SFT）将运动语言信息和可解释思维链（CoT）推理整合进LLM。我们进一步引入了基于群体相对策略优化（GRPO）的强化学习，作为一种训练后策略，对多个令牌进行优化，以强制结构正确性和语义对齐，从而减少运动令牌预测中的累积错误。大量实验表明，UniMo 在运动生成和理解方面均处于最先进的统一和任务特定模型之上，表现远超现有。

Aletheia: What Makes RLVR For Code Verifiers Tick?

Aletheia：是什么让代码验证器的RLVR运作？

Authors: Vatsal Venkatkrishna, Indraneil Paul, Iryna Gurevych
Subjects: Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.12186
Pdf link: https://arxiv.org/pdf/2601.12186
Abstract Multi-domain thinking verifiers trained via Reinforcement Learning from Verifiable Rewards (RLVR) are a prominent fixture of the Large Language Model (LLM) post-training pipeline, owing to their ability to robustly rate and rerank model outputs. However, the adoption of such verifiers towards code generation has been comparatively sparse, with execution feedback constituting the dominant signal. Nonetheless, code verifiers remain valuable toward judging model outputs in scenarios where execution feedback is hard to obtain and are a potentially powerful addition to the code generation post-training toolbox. To this end, we create and open-source Aletheia, a controlled testbed that enables execution-grounded evaluation of code verifiers' robustness across disparate policy models and covariate shifts. We examine components of the RLVR-based verifier training recipe widely credited for its success: (1) intermediate thinking traces, (2) learning from negative samples, and (3) on-policy training. While experiments show the optimality of RLVR, we uncover important opportunities to simplify the recipe. Particularly, despite code verification exhibiting positive training- and inference-time scaling, on-policy learning stands out as the key component at small verifier sizes, and thinking-based training emerges as the most important component at larger scales.
中文摘要 通过可验证奖励强化学习（RLVR）训练的多域思维验证器，是大型语言模型（LLM）训练后流程中的重要组成部分，因为它们能够对模型输出进行稳健的评分和重新排序。然而，此类验证器在代码生成中的采用相对稀少，执行反馈是主要信号。尽管如此，代码验证器在难以获得执行反馈的场景下，仍具有重要价值，是训练后代码生成工具箱中潜在的强大补充。为此，我们创建并开源了Aletheia，一个受控测试平台，能够基于执行基础评估代码验证器在不同策略模型和协变量转移中的鲁棒性。我们考察了基于RLVR的验证者训练方案中被广泛认可的成功组成部分：（1）中间思维痕迹，（2）从负面样本中学习，以及（3）政策相关培训。虽然实验显示了RLVR的最优性，但我们发现了简化配方的重要机会。特别是，尽管代码验证在训练和推理时间上表现出积极的扩展，但在小规模验证者中，策略学习仍是关键组成部分，而基于思维的训练则在更大规模中成为最重要的组成部分。

Speculative Sampling with Reinforcement Learning

与强化学习的推测采样

Authors: Chenan Wang, Daniel H. Shi, Haipeng Chen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.12212
Pdf link: https://arxiv.org/pdf/2601.12212
Abstract Inference time latency has remained an open challenge for real world applications of large language models (LLMs). State-of-the-art (SOTA) speculative sampling (SpS) methods for LLMs, like EAGLE-3, use tree-based drafting to explore multiple candidate continuations in parallel. However, the hyperparameters controlling the tree structure are static, which limits flexibility and efficiency across diverse contexts and domains. We introduce Reinforcement learning for Speculative Sampling (Re-SpS), the first reinforcement learning (RL)-based framework for draft tree hyperparameter optimization. Re-SpS dynamically adjusts draft tree hyperparameters in real-time, learning context-aware policies that maximize generation speed by balancing speculative aggression with computational overhead. It leverages efficient state representations from target model hidden states and introduces multi-step action persistence for better context modeling. Evaluation results across five diverse benchmarks demonstrate consistent improvements over the SOTA method EAGLE-3, achieving up to 5.45$\times$ speedup over the backbone LLM and up to 1.12$\times$ speedup compared to EAGLE-3 across five diverse benchmarks, with no loss in output fidelity.
中文摘要 推理时间延迟一直是大型语言模型（LLM）现实应用中的一个未解挑战。最先进的（SOTA）大型语言模型（LLM）推测采样（SpS）方法，如EAGLE-3，利用基于树的绘图并行探索多个候选延拓。然而，控制树结构的超参数是静态的，这限制了在不同上下文和领域的灵活性和效率。我们介绍了基于强化学习的推测抽样（Re-SpS），这是首个基于强化学习（RL）的草图树超参数优化框架。Re-SpS实时动态调整草图树超参数，学习上下文感知策略，通过平衡推测攻击性和计算开销，最大化生成速度。它利用目标模型隐藏状态的高效状态表示，并引入多步动作持久性以实现更好的上下文建模。五个不同基准测试的评估结果显示，EAGLE-3 在主干大型语言模型上实现了高达 5.45 美元\时间美元的速度提升，在五个不同基准测试中相比 EAGLE-3 提升高达 1.12 美元\时间美元，输出保真度无损失。

Optimal Power Allocation and Sub-Optimal Channel Assignment for Downlink NOMA Systems Using Deep Reinforcement Learning

使用深度强化学习的下行NOMA系统的最优功率分配与次优信道分配

Authors: WooSeok Kim, Jeonghoon Lee, Sangho Kim, Taesun An, WonMin Lee, Dowon Kim, Kyungseop Shin
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2601.12242
Pdf link: https://arxiv.org/pdf/2601.12242
Abstract In recent years, Non-Orthogonal Multiple Access (NOMA) system has emerged as a promising candidate for multiple access frameworks due to the evolution of deep machine learning, trying to incorporate deep machine learning into the NOMA system. The main motivation for such active studies is the growing need to optimize the utilization of network resources as the expansion of the internet of things (IoT) caused a scarcity of network resources. The NOMA addresses this need by power multiplexing, allowing multiple users to access the network simultaneously. Nevertheless, the NOMA system has few limitations. Several works have proposed to mitigate this, including the optimization of power allocation known as joint resource allocation(JRA) method, and integration of the JRA method and deep reinforcement learning (JRA-DRL). Despite this, the channel assignment problem remains unclear and requires further investigation. In this paper, we propose a deep reinforcement learning framework incorporating replay memory with an on-policy algorithm, allocating network resources in a NOMA system to generalize the learning. Also, we provide extensive simulations to evaluate the effects of varying the learning rate, batch size, type of model, and the number of features in the state.
中文摘要 近年来，由于深度机器学习的发展，非正交多址（NOMA）系统成为多址框架的有前景候选，试图将深度机器学习整合进NOMA系统。开展此类积极研究的主要动机是随着物联网（IoT）扩展导致网络资源稀缺，优化网络资源利用的需求日益增长。NOMA通过功率复用满足这一需求，允许多个用户同时访问网络。尽管如此，NOMA系统几乎没有限制。已有多项工作提出缓解这一问题，包括称为联合资源分配（JRA）方法的功率分配优化，以及将JRA方法与深度强化学习（JRA-DRL）相结合。尽管如此，频道分配问题仍不明确，需进一步调查。本文提出了一个深度强化学习框架，结合重放记忆与策略上算法，在NOMA系统中分配网络资源以推广学习。此外，我们还提供了大量模拟，以评估学习率、批量大小、模型类型及状态特征数量变化的影响。

Beyond the Dirac Delta: Mitigating Diversity Collapse in Reinforcement Fine-Tuning for Versatile Image Generation

超越狄拉克δ：缓解增强微调中的多样性崩溃以实现多功能图像生成

Authors: Jinmei Liu, Haoru Li, Zhenhong Sun, Chaofeng Chen, Yatao Bian, Bo Wang, Daoyi Dong, Chunlin Chen, Zhi Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.12401
Pdf link: https://arxiv.org/pdf/2601.12401
Abstract Reinforcement learning (RL) has emerged as a powerful paradigm for fine-tuning large-scale generative models, such as diffusion and flow models, to align with complex human preferences and user-specified tasks. A fundamental limitation remains \textit{the curse of diversity collapse}, where the objective formulation and optimization landscape inherently collapse the policy to a Dirac delta distribution. To address this challenge, we propose \textbf{DRIFT} (\textbf{D}ive\textbf{R}sity-\textbf{I}ncentivized Reinforcement \textbf{F}ine-\textbf{T}uning for Versatile Image Generation), an innovative framework that systematically incentivizes output diversity throughout the on-policy fine-tuning process, reconciling strong task alignment with high generation diversity to enhance versatility essential for applications that demand diverse candidate generations. We approach the problem across three representative perspectives: i) \textbf{sampling} a reward-concentrated subset that filters out reward outliers to prevent premature collapse; ii) \textbf{prompting} with stochastic variations to expand the conditioning space, and iii) \textbf{optimization} of the intra-group diversity with a potential-based reward shaping mechanism. Experimental results show that DRIFT achieves superior Pareto dominance regarding task alignment and generation diversity, yielding a $ 9.08\%!\sim! 43.46\%$ increase in diversity at equivalent alignment levels and a $ 59.65\% !\sim! 65.86\%$ increase in alignment at equivalent levels of diversity.
中文摘要 强化学习（RL）已成为一种强大的范式，用于微调大规模生成模型，如扩散模型和流动模型，以适应复杂的人类偏好和用户指定任务。一个根本的局限依然存在 \textit{多样性崩溃的诅咒}，即客观的表述和优化景观本质上将策略压缩为狄拉克δ分布。为应对这一挑战，我们提出了 \textbf{DRIFT}（\textbf{D}ive\textbf{R}sity-\textbf{I}激励化强化 \textbf{F}ine-\textbf{T}uning for Versatile Image Generation），这是一个创新框架，系统地激励输出多样性贯穿政策微调过程，将任务高度对齐与高生成多样性相结合，提升对需要多样化候选世代应用至关重要的多样性。我们从三种代表性视角来探讨这个问题：i） \textbf{sampling}，一个以奖励为中心的子集，通过过滤奖励异常值以防止过早崩溃;ii） \textbf{prompting} 通过随机变体扩展条件空间，iii） \textbf{优化组内多样性，并采用基于势能的奖励塑造机制。实验结果显示，DRIFT在任务对齐和生成多样性方面实现了更优的帕累托优势，产生了$9.08\%\！\sim\！在等效对齐水平下多样性增长43.46%，\！\sim\！在同等多样性水平下，对齐度增加65.86%$。

RLMiner: Finding the Most Frequent k-sized Subgraph via Reinforcement Learning

RLMiner：通过强化学习寻找最常见的k维子图

Authors: Wei Huang, Hanchen Wang, Dong Wen, Xin Cao, Ying Zhang, Wenjie Zhang
Subjects: Subjects: Databases (cs.DB)
Arxiv link: https://arxiv.org/abs/2601.12416
Pdf link: https://arxiv.org/pdf/2601.12416
Abstract Identifying the most frequent induced subgraph of size $k$ in a target graph is a fundamental graph mining problem with direct implications for Web-related data mining and social network analysis. Despite its importance, finding the most frequent induced subgraph remains computationally expensive due to the NP-hard nature of the subgraph counting task. Traditional exact enumeration algorithms often suffer from high time complexity, especially for a large graph size $k$. To mitigate this, existing approaches often utilize frequency measurement with the Downward Closure Property to reduce the search space, imposing additional constraints on the task. In this paper, we first formulate this task as a Markov Decision Process and approach it using a multi-task reinforcement learning framework. Specifically, we introduce RLMiner, a novel framework that integrates reinforcement learning with our proposed task-state-aware Graph Neural Network to find the most frequent induced subgraph of size $k$ with a time complexity linear to $k$. Extensive experiments on real-world datasets demonstrate that our proposed RLMiner effectively identifies subgraphs with frequencies closely matching the ground-truth most frequent induced subgraphs, while achieving significantly shorter and more stable running times compared to traditional methods.
中文摘要 识别目标图中最常见的诱导子图大小为$k$，是一个基本的图挖掘问题，直接影响网络相关数据挖掘和社交网络分析。尽管如此，由于子图计数任务的NP难性，寻找最频繁诱导子图仍然计算成本较高。传统的精确枚举算法通常存在较高的时间复杂度，尤其是对于图大小$k$时。为缓解这一问题，现有方法通常利用下闭性质的频率测量来缩小搜索空间，给任务施加了额外限制。本文首先将该任务表述为马尔可夫决策过程，并采用多任务强化学习框架进行。具体来说，我们介绍了RLMiner，这是一个新颖的框架，将强化学习与我们提出的任务状态感知图神经网络整合，用以寻找最频繁的诱导子图，大小为$k$，时间复杂度线性于$k$。在真实世界数据集上的大量实验表明，我们提出的RLMiner能够有效识别频率与最频繁诱导子图高度匹配的子图，同时相比传统方法实现了显著更短且更稳定的运行时间。

ReWorld: Multi-Dimensional Reward Modeling for Embodied World Models

ReWorld：具身世界模型的多维奖励建模

Authors: Baorui Peng, Wenyao Zhang, Liang Xu, Zekun Qi, Jiazhao Zhang, Hongsi Liu, Wenjun Zeng, Xin Jin
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.12428
Pdf link: https://arxiv.org/pdf/2601.12428
Abstract Recently, video-based world models that learn to simulate the dynamics have gained increasing attention in robot learning. However, current approaches primarily emphasize visual generative quality while overlooking physical fidelity, dynamic consistency, and task logic, especially for contact-rich manipulation tasks, which limits their applicability to downstream tasks. To this end, we introduce ReWorld, a framework aimed to employ reinforcement learning to align the video-based embodied world models with physical realism, task completion capability, embodiment plausibility and visual quality. Specifically, we first construct a large-scale (~235K) video preference dataset and employ it to train a hierarchical reward model designed to capture multi-dimensional reward consistent with human preferences. We further propose a practical alignment algorithm that post-trains flow-based world models using this reward through a computationally efficient PPO-style algorithm. Comprehensive experiments and theoretical analysis demonstrate that ReWorld significantly improves the physical fidelity, logical coherence, embodiment and visual quality of generated rollouts, outperforming previous methods.
中文摘要 近年来，基于视频的世界模型学习模拟动力学在机器人学习中越来越受关注。然而，当前方法主要强调视觉生成质量，忽视了物理真实性、动态一致性和任务逻辑，尤其是对于接触丰富的作任务，限制了其适用于下游任务。为此，我们引入了ReWorld框架，旨在利用强化学习将基于视频的具身世界模型与物理真实性、任务完成能力、具身性合理性和视觉质量对齐。具体来说，我们首先构建了一个大规模（~23.5K）视频偏好数据集，并用它训练一个分层奖励模型，旨在捕捉符合人类偏好的多维奖励。我们还提出了一种实用的比对算法，通过计算效率高的PPO式算法对基于流的世界模型进行后置训练，利用该奖励。全面的实验和理论分析表明，ReWorld显著提升了生成的展开的物理真实度、逻辑一致性、具象性和视觉质量，优于以往方法。

Incentivizing In-depth Reasoning over Long Contexts with Process Advantage Shaping

通过过程优势塑造激励长期背景下的深入推理

Authors: Miao Peng, Weizhou Shen, Nuo Chen, Chenliang Li, Ming Yan, Jia Li
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.12465
Pdf link: https://arxiv.org/pdf/2601.12465
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective in enhancing LLMs short-context reasoning, but its performance degrades in long-context scenarios that require both precise grounding and robust long-range reasoning. We identify the "almost-there" phenomenon in long-context reasoning, where trajectories are largely correct but fail at the final step, and attribute this failure to two factors: (1) the lack of high reasoning density in long-context QA data that push LLMs beyond mere grounding toward sophisticated multi-hop reasoning; and (2) the loss of valuable learning signals during long-context RL training due to the indiscriminate penalization of partially correct trajectories with incorrect outcomes. To overcome this bottleneck, we propose DeepReasonQA, a KG-driven synthesis framework that controllably constructs high-difficulty, multi-hop long-context QA pairs with inherent reasoning chains. Building on this, we introduce Long-context Process Advantage Shaping (LongPAS), a simple yet effective method that performs fine-grained credit assignment by evaluating reasoning steps along Validity and Relevance dimensions, which captures critical learning signals from "almost-there" trajectories. Experiments on three long-context reasoning benchmarks show that our approach substantially outperforms RLVR baselines and matches frontier LLMs while using far fewer parameters. Further analysis confirms the effectiveness of our methods in strengthening long-context reasoning while maintaining stable RL training.
中文摘要 带可验证奖励的强化学习（RLVR）已被证明在增强LLM的短上下文推理方面有效，但在需要精确基础和稳健长距离推理的长上下文场景中表现会下降。我们识别了长上下文推理中的“几乎成功”现象，即轨迹大体正确但在最后一步失败，并将此失败归因于两个因素：（1）长上下文质量保证数据缺乏高推理密度，推动LLM从单纯基础转向复杂的多跳推理;以及（2）由于部分正确的轨迹和错误的结果被无差别惩罚，导致长上下文强化学习训练中宝贵的学习信号丢失。为克服这一瓶颈，我们提出了DeepReasonQA，一种基于KG的综合框架，可控构建高难度、多跳长上下文质量保证配对，并具备内在推理链。在此基础上，我们引入了长上下文过程优势塑形（LongPAS），这是一种简单但有效的方法，通过评估效度和相关性维度的推理步骤，实现细粒度的信用分配，捕捉“接近”轨迹的关键学习信号。在三个长上下文推理基准测试上的实验显示，我们的方法远远优于RLVR基线，并且在使用更少参数的情况下，与前沿大型语言模型（LLM）相匹配。进一步分析证实了我们方法在强化长语境推理同时保持强化学习训练稳定方面的有效性。

Agentic Reasoning for Large Language Models

大型语言模型的能动推理

Authors: Tianxin Wei, Ting-Wei Li, Zhining Liu, Xuying Ning, Ze Yang, Jiaru Zou, Zhichen Zeng, Ruizhong Qiu, Xiao Lin, Dongqi Fu, Zihao Li, Mengting Ai, Duo Zhou, Wenxuan Bao, Yunzhe Li, Gaotang Li, Cheng Qian, Yu Wang, Xiangru Tang, Yin Xiao, Liri Fang, Hui Liu, Xianfeng Tang, Yuji Zhang, Chi Wang, Jiaxuan You, Heng Ji, Hanghang Tong, Jingrui He
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.12538
Pdf link: https://arxiv.org/pdf/2601.12538
Abstract Reasoning is a fundamental cognitive process underlying inference, problem-solving, and decision-making. While large language models (LLMs) demonstrate strong reasoning capabilities in closed-world settings, they struggle in open-ended and dynamic environments. Agentic reasoning marks a paradigm shift by reframing LLMs as autonomous agents that plan, act, and learn through continual interaction. In this survey, we organize agentic reasoning along three complementary dimensions. First, we characterize environmental dynamics through three layers: foundational agentic reasoning, which establishes core single-agent capabilities including planning, tool use, and search in stable environments; self-evolving agentic reasoning, which studies how agents refine these capabilities through feedback, memory, and adaptation; and collective multi-agent reasoning, which extends intelligence to collaborative settings involving coordination, knowledge sharing, and shared goals. Across these layers, we distinguish in-context reasoning, which scales test-time interaction through structured orchestration, from post-training reasoning, which optimizes behaviors via reinforcement learning and supervised fine-tuning. We further review representative agentic reasoning frameworks across real-world applications and benchmarks, including science, robotics, healthcare, autonomous research, and mathematics. This survey synthesizes agentic reasoning methods into a unified roadmap bridging thought and action, and outlines open challenges and future directions, including personalization, long-horizon interaction, world modeling, scalable multi-agent training, and governance for real-world deployment.
中文摘要 推理是推理、解决问题和决策的基本认知过程。虽然大型语言模型（LLMs）在封闭环境中展现出强大的推理能力，但在开放式和动态环境中则显得困难。智能推理通过将大型语言模型重新框架为自主智能体，通过持续互动来规划、行动和学习，标志着范式转变。在本综述中，我们将代理推理按三个互补维度进行组织。首先，我们通过三层来描述环境动态：基础智能体推理，建立包括稳定环境中的规划、工具使用和搜索在内的核心单智能体能力;自我进化智能体推理，研究智能体如何通过反馈、记忆和适应来完善这些能力;以及集体多智能体推理，将智能扩展到涉及协调、知识共享和共同目标的协作环境中。在这些层面中，我们区分了上下文推理（通过结构化编排扩展测试时交互）和训练后推理（通过强化学习和监督微调优化行为）。我们还进一步回顾了代表性的代理推理框架，涵盖科学、机器人学、医疗保健、自主研究和数学等现实世界应用和基准领域。本综述将代理推理方法整合成一个连接思想与行动的统一路线图，并概述了未解决的挑战和未来方向，包括个性化、长远互动、世界建模、可扩展的多智能体培训以及现实世界部署的治理。

STEP-LLM: Generating CAD STEP Models from Natural Language with Large Language Models

STEP-LLM：利用大型语言模型从自然语言生成CAD STEP模型

Authors: Xiangyu Shi, Junyang Ding, Xu Zhao, Sinong Zhan, Payal Mohapatra, Daniel Quispe, Kojo Welbeck, Jian Cao, Wei Chen, Ping Guo, Qi Zhu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.12641
Pdf link: https://arxiv.org/pdf/2601.12641
Abstract Computer-aided design (CAD) is vital to modern manufacturing, yet model creation remains labor-intensive and expertise-heavy. To enable non-experts to translate intuitive design intent into manufacturable artifacts, recent large language models-based text-to-CAD efforts focus on command sequences or script-based formats like CadQuery. However, these formats are kernel-dependent and lack universality for manufacturing. In contrast, the Standard for the Exchange of Product Data (STEP, ISO 10303) file is a widely adopted, neutral boundary representation (B-rep) format directly compatible with manufacturing, but its graph-structured, cross-referenced nature poses unique challenges for auto-regressive LLMs. To address this, we curate a dataset of ~40K STEP-caption pairs and introduce novel preprocessing tailored for the graph-structured format of STEP, including a depth-first search-based reserialization that linearizes cross-references while preserving locality and chain-of-thought(CoT)-style structural annotations that guide global coherence. We integrate retrieval-augmented generation to ground predictions in relevant examples for supervised fine-tuning, and refine generation quality through reinforcement learning with a specific Chamfer Distance-based geometric reward. Experiments demonstrate consistent gains of our STEP-LLM in geometric fidelity over the Text2CAD baseline, with improvements arising from multiple stages of our framework: the RAG module substantially enhances completeness and renderability, the DFS-based reserialization strengthens overall accuracy, and the RL further reduces geometric discrepancy. Both metrics and visual comparisons confirm that STEP-LLM generates shapes with higher fidelity than Text2CAD. These results show the feasibility of LLM-driven STEP model generation from natural language, showing its potential to democratize CAD design for manufacturing.
中文摘要 计算机辅助设计（CAD）对现代制造至关重要，但模型制作仍然劳动密集且技术需求较高。为了让非专家能够将直观的设计意图转化为可制造的工件，近期基于大型语言模型的文本转CAD工作重点转向命令序列或基于脚本的格式，如CadQuery。然而，这些格式依赖内核，缺乏制造通用性。相比之下，产品数据交换标准（STEP，ISO 10303）文件是一种广泛采用的中性边界表示（B-rep）格式，直接兼容制造业，但其图结构化、交叉引用性质对自回归大型语言模型（LLM）构成了独特挑战。为此，我们策划了约4万对STEP字幕对的数据集，并引入了针对STEP图结构格式的新型预处理，包括基于深度优先的搜索重序列化，在保持局部性和思维链（CoT）式结构注释的同时线性化交叉引用，从而指导全局一致性。我们在相关实例中整合了反复生成与基础预测，用于监督微调，并通过基于倒角距离的特定几何奖励通过强化学习提升生成质量。实验显示，我们的STEP-LLM在几何精度上相较于Text2CAD基线持续提升，且来自框架多个阶段的改进：RAG模块显著提升了完整性和可渲染性，基于DFS的重序列化增强了整体准确性，强化语言进一步减少了几何差异。度量和视觉比较都证实STEP-LLM生成的形状保真度高于Text2CAD。这些结果展示了基于LLM驱动的自然语言STEP模型生成的可行性，展示了其在普及制造业CAD设计方面的潜力。

Multiagent Reinforcement Learning in Enhancing Resilience of Microgrids under Extreme Weather Events

多智能体强化学习在增强微电网在极端天气事件下的韧性

Authors: Yin Wu, Wei-Yu Chiu, Yuan-Po Tsai, Shangyuan Liu, Weiqi Hua
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.12657
Pdf link: https://arxiv.org/pdf/2601.12657
Abstract Grid resilience is crucial in light of power interruptions caused by increasingly frequent extreme weather events. Well-designed energy management systems (EMS) have made progress in improving microgrid resilience through the coordination of distributed energy resources (DERs), but still face significant challenges in addressing the uncertainty of load demand caused by extreme weather. The integration of deep reinforcement learning (DRL) into EMS design enables optimized microgrid control strategies for coordinating DERs. Building on this, we proposed a cooperative multi-agent deep reinforcement learning (MADRL)-based EMS framework to provide flexible scalability for microgrids, enhance resilience and reduce operational costs during power outages. Specifically, the gated recurrent unit with a gating mechanism was introduced to extract features from temporal data, which enables the EMS to coordinate DERs more efficiently. Next, the proposed MADRL method incorporating action masking techniques was evaluated in the IEEE 33-Bus system using real-world data on renewable generation and power load. Finally, the numerical results demonstrated the superiority of the proposed method in reducing operating costs as well as the effectiveness in enhancing microgrid resilience during power interruptions.
中文摘要 鉴于因日益频繁的极端天气事件导致的停电，电网韧性至关重要。设计良好的能源管理系统（EMS）通过协调分布式能源资源（DERs）在提升微电网韧性方面取得了进展，但在应对极端天气引发的负荷需求不确定性方面仍面临重大挑战。深度强化学习（DRL）融入EMS设计，使得优化微电网控制策略以协调分布式分布资源（DER）成为可能。基于此，我们提出了一个基于多智能体深度强化学习（MADRL）的协作式EMS框架，旨在为微电网提供灵活的可扩展性，增强韧性并在停电期间降低运营成本。具体来说，引入了带有门控机制的门控循环单元，用于从时间数据中提取特征，使EMS能够更高效地协调分布式分布资源（DER）。随后，结合动作掩码技术的MADRL方法在IEEE 33-Bus系统中利用可再生能源发电和电力负荷的真实数据进行了评估。最后，数值结果证明了该方法在降低运营成本方面具有优越性，同时在电力中断期间增强微电网韧性方面也具有显著效果。

Decentralized Learning Strategies for Estimation Error Minimization with Graph Neural Networks

利用图神经网络实现估计误差最小化的去中心化学习策略

Authors: Xingran Chen, Navid NaderiAlizadeh, Alejandro Ribeiro, Shirin Saeedi Bidokhti
Subjects: Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2601.12662
Pdf link: https://arxiv.org/pdf/2601.12662
Abstract We address real-time sampling and estimation of autoregressive Markovian sources in dynamic yet structurally similar multi-hop wireless networks. Each node caches samples from others and communicates over wireless collision channels, aiming to minimize time-average estimation error via decentralized policies. Due to the high dimensionality of action spaces and complexity of network topologies, deriving optimal policies analytically is intractable. To address this, we propose a graphical multi-agent reinforcement learning framework for policy optimization. Theoretically, we demonstrate that our proposed policies are transferable, allowing a policy trained on one graph to be effectively applied to structurally similar graphs. Numerical experiments demonstrate that (i) our proposed policy outperforms state-of-the-art baselines; (ii) the trained policies are transferable to larger networks, with performance gains increasing with the number of agents; (iii) the graphical training procedure withstands non-stationarity, even when using independent learning techniques; and (iv) recurrence is pivotal in both independent learning and centralized training and decentralized execution, and improves the resilience to non-stationarity.
中文摘要 我们研究动态但结构相似的多跳无线网络中自回归马可夫源的实时采样和估计。每个节点缓存其他节点的样本，并通过无线碰撞信道通信，旨在通过去中心化策略最小化时间平均估计误差。由于动作空间维度高且网络拓扑复杂，解析推导最优策略是难以解决的。为此，我们提出了一个用于策略优化的图形多智能体强化学习框架。理论上，我们证明了所提策略是可迁移的，使得在一个图上训练的策略能够有效地应用于结构相似的图。数值实验表明：（i）我们提出的政策优于最先进的基线;（ii）训练好的策略可迁移到更大的网络，性能提升随代理数量增加而增加;（iii）图形训练过程即使使用独立学习技术，也能抵抗非平稳性;（iv）复发性在独立学习、集中培训和去中心化执行中都至关重要，并提升了对非平稳性的韧性。

Resource-Conscious RL Algorithms for Deep Brain Stimulation

资源意识型强化学习算法用于深脑刺激

Authors: Arkaprava Gupta, Nicholas Carter, William Zellers, Prateek Ganguli, Benedikt Dietrich, Vibhor Krishna, Parasara Sridhar Duggirala, Samarjit Chakraborty
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.12699
Pdf link: https://arxiv.org/pdf/2601.12699
Abstract Deep Brain Stimulation (DBS) has proven to be a promising treatment of Parkinson's Disease (PD). DBS involves stimulating specific regions of the brain's Basal Ganglia (BG) using electric impulses to alleviate symptoms of PD such as tremors, rigidity, and bradykinesia. Although most clinical DBS approaches today use a fixed frequency and amplitude, they suffer from side effects (such as slurring of speech) and shortened battery life of the implant. Reinforcement learning (RL) approaches have been used in recent research to perform DBS in a more adaptive manner to improve overall patient outcome. These RL algorithms are, however, too complex to be trained in vivo due to their long convergence time and requirement of high computational resources. We propose a new Time & Threshold-Triggered Multi-Armed Bandit (T3P MAB) RL approach for DBS that is more effective than existing algorithms. Further, our T3P agent is lightweight enough to be deployed in the implant, unlike current deep-RL strategies, and even forgoes the need for an offline training phase. Additionally, most existing RL approaches have focused on modulating only frequency or amplitude, and the possibility of tuning them together remains greatly unexplored in the literature. Our RL agent can tune both frequency and amplitude of DBS signals to the brain with better sample efficiency and requires minimal time to converge. We implement an MAB agent for DBS for the first time on hardware to report energy measurements and prove its suitability for resource-constrained platforms. Our T3P MAB algorithm is deployed on a variety of microcontroller unit (MCU) setups to show its efficiency in terms of power consumption as opposed to other existing RL approaches used in recent work.
中文摘要 深脑刺激（DBS）已被证明是帕金森病（PD）的一种有前景的治疗方法。DBS通过电信号刺激大脑基底节（BG）的特定区域，以缓解帕金森病的症状，如震颤、僵硬和运动缓慢。尽管目前大多数临床DBS方法使用固定频率和幅度，但存在副作用（如言语含糊）和植入体电池寿命缩短。强化学习（RL）方法在近期研究中被用于以更具适应性的方式进行DBS，以改善整体患者结局。然而，由于收敛时间较长且计算资源需求高，这些强化学习算法过于复杂，无法在体内训练。我们提出了一种新的时间与阈值触发多臂强盗（T3P MAB）强化学习方法，用于DBS，其效果优于现有算法。此外，我们的T3P代理足够轻便，可以直接部署在植入体内，这与现有的深度强化学习策略不同，甚至无需离线训练阶段。此外，大多数现有的强化学习方法仅专注于调制频率或幅度，调谐它们的可能性在文献中仍然鲜有深入探讨。我们的强化学习代理能够以更好的采样效率调谐大脑DBS信号的频率和幅度，且收敛所需时间极短。我们首次在硬件上实现了DBS的MAB代理，用于报告能量测量并验证其在资源受限平台上的适用性。我们的T3P MAB算法被部署在多种微控制器单元（MCU）配置上，以展示其在功耗方面的效率，相较于近期工作中使用的其他强化学习方法。

Decoding Rewards in Competitive Games: Inverse Game Theory with Entropy Regularization

竞技游戏中的奖励解码：带熵正则化的逆博弈论

Authors: Junyi Liao, Zihan Zhu, Ethan Fang, Zhuoran Yang, Vahid Tarokh
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2601.12707
Pdf link: https://arxiv.org/pdf/2601.12707
Abstract Estimating the unknown reward functions driving agents' behaviors is of central interest in inverse reinforcement learning and game theory. To tackle this problem, we develop a unified framework for reward function recovery in two-player zero-sum matrix games and Markov games with entropy regularization, where we aim to reconstruct the underlying reward functions given observed players' strategies and actions. This task is challenging due to the inherent ambiguity of inverse problems, the non-uniqueness of feasible rewards, and limited observational data coverage. To address these challenges, we establish the reward function's identifiability using the quantal response equilibrium (QRE) under linear assumptions. Building upon this theoretical foundation, we propose a novel algorithm to learn reward functions from observed actions. Our algorithm works in both static and dynamic settings and is adaptable to incorporate different methods, such as Maximum Likelihood Estimation (MLE). We provide strong theoretical guarantees for the reliability and sample efficiency of our algorithm. Further, we conduct extensive numerical studies to demonstrate the practical effectiveness of the proposed framework, offering new insights into decision-making in competitive environments.
中文摘要 估计驱动智能体行为的未知奖励函数是逆强化学习和博弈论的核心研究。为解决这一问题，我们开发了一个统一的两人零和矩阵博弈和熵正则化马尔可夫博弈的奖励函数恢复框架，旨在根据观察到的玩家的策略和行为重建潜在的奖励函数。由于逆问题本身存在模糊性、可行奖励的非唯一性以及有限的观测数据覆盖，这一任务具有挑战性。为应对这些挑战，我们利用线性假设下的量子反应平衡（QRE）确定奖励函数的可识别性。基于这一理论基础，我们提出了一种新颖算法，用于从观察到的行为中学习奖励函数。我们的算法适用于静态和动态环境，并可适应多种方法，如最大似然估计（MLE）。我们为算法的可靠性和样本效率提供了强有力的理论保证。此外，我们进行了广泛的数值研究，以展示该框架的实际有效性，为竞争环境中的决策提供了新的见解。

Teaching Large Reasoning Models Effective Reflection

教授大型推理模型有效反思

Authors: Hanbin Wang, Jingwei Song, Jinpeng Li, Qi Zhu, Fei Mi, Ganqu Cui, Yasheng Wang, Lifeng Shang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.12720
Pdf link: https://arxiv.org/pdf/2601.12720
Abstract Large Reasoning Models (LRMs) have recently shown impressive performance on complex reasoning tasks, often by engaging in self-reflective behaviors such as self-critique and backtracking. However, not all reflections are beneficial-many are superficial, offering little to no improvement over the original answer and incurring computation overhead. In this paper, we identify and address the problem of superficial reflection in LRMs. We first propose Self-Critique Fine-Tuning (SCFT), a training framework that enhances the model's reflective reasoning ability using only self-generated critiques. SCFT prompts models to critique their own outputs, filters high-quality critiques through rejection sampling, and fine-tunes the model using a critique-based objective. Building on this strong foundation, we further introduce Reinforcement Learning with Effective Reflection Rewards (RLERR). RLERR leverages the high-quality reflections initialized by SCFT to construct reward signals, guiding the model to internalize the self-correction process via reinforcement learning. Experiments on two challenging benchmarks, AIME2024 and AIME2025, show that SCFT and RLERR significantly improve both reasoning accuracy and reflection quality, outperforming state-of-the-art baselines. All data and codes are available at this https URL.
中文摘要 大型推理模型（LRM）最近在复杂的推理任务中表现出令人印象深刻的表现，通常通过自我反思行为如自我批评和回溯来实现。然而，并非所有反射都有益——许多是表面的，几乎没有比原始答案改进，且增加了计算开销。本文指出并解决了LRMs中表层反射的问题。我们首先提出了自我批判微调（SCFT），这是一种仅通过自我生成的批评来增强模型反思推理能力的训练框架。SCFT会提示模型对自己的输出进行批评，通过拒绝抽样过滤高质量的批评，并基于批评目标对模型进行微调。基于这一坚实基础，我们进一步引入了带有效反思奖励的强化学习（RLERR）。RLERR利用SCFT初始化的高质量反射构建奖励信号，引导模型通过强化学习内化自我纠正过程。对两个具有挑战性基准测试AIME2024和AIME2025的实验显示，SCFT和RLERR显著提升了推理准确性和反射质量，超越了最先进的基线。所有数据和代码均可在此 https URL 获取。

Distribution-Centric Policy Optimization Dominates Exploration-Exploitation Trade-off

以分布为中心的策略优化主导了探索与开发的权衡

Authors: Zhaochun Li, Chen Wang, Jionghao Bai, Shisheng Cui, Ge Lan, Zhou Zhao, Yue Wang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.12730
Pdf link: https://arxiv.org/pdf/2601.12730
Abstract The exploration-exploitation (EE) trade-off is a central challenge in reinforcement learning (RL) for large language models (LLMs). With Group Relative Policy Optimization (GRPO), training tends to be exploitation driven: entropy decreases monotonically, samples convergence, and exploration fades. Most existing fixes are \textbf{sample-centric}: they seek or bonus rare samples, assuming exploration comes from novel trajectories and tokens. These heuristics depend on the "luck" of informative samples, lack principled control of the policy, and often yield limited or inconsistent gains. In this work, we are the first to introduce a \textbf{distribution-centric} perspective for RL, in which exploration is always guided by a "better" target distribution, and reveal that a policy's ability to resist entropy collapse is governed by the distribution itself rather than individual samples. Building on this insight, we propose Distribution-Centric Policy Optimization (DCPO), which reformulates entropy regulation as distribution-level regularization. DCPO achieves controllable entropy fully on-policy without sampling from external distributions, enabling efficient exploration while maintaining training stability. Across multiple models and seven benchmarks, DCPO improves over GRPO by about 20\% on average. Overall, DCPO replaces sample-level heuristics with distribution-level principles, offering a theoretically grounded and flexible framework for controllable exploration and a stronger EE trade-off. The code is available in this https URL.
中文摘要 探索与利用（EE）权衡是大型语言模型（LLM）强化学习（RL）中一个核心挑战。而在群相对策略优化（GRPO）中，训练往往以利用为驱动：熵单调下降，样本收敛，探索逐渐减弱。大多数现有修复都是 \textbf{sample-centric}：它们寻找或奖励稀有样本，假设探索来自新颖的轨迹和代币。这些启发式依赖于信息样本的“运气”，缺乏对政策的原则性控制，且常常带来有限或不一致的收益。在本研究中，我们首次引入了强化学习的\textbf{分布中心}视角，该观点认为探索始终由“更好”的目标分布引导，并揭示了策略抵抗熵坍缩的能力是由分布本身决定的，而非单个样本。基于这一见解，我们提出了以分布为中心的策略优化（DCPO），将熵调控重新表述为分布层级正则化。DCPO完全按策略实现可控熵，无需外部分布采样，使高效探索成为可能，同时保持训练稳定性。在多个模型和七个基准测试中，DCPO平均比GRPO提升约20%。总体而言，DCPO用分布层级原则取代了样本层级启发式，提供了一个理论基础且灵活的可控探索框架，并实现了更强的EE权衡。代码可以在这个 https URL 中获得。

Teaching LLMs to Learn Tool Trialing and Execution through Environment Interaction

通过环境交互教大语言模型学习工具试用和执行

Authors: Xingjie Gao, Pengcheng Huang, Zhenghao Liu, Yukun Yan, Shuo Wang, Zulong Chen, Chen Qian, Ge Yu, Yu Gu
Subjects: Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.12762
Pdf link: https://arxiv.org/pdf/2601.12762
Abstract Equipping Large Language Models (LLMs) with external tools enables them to solve complex real-world problems. However, the robustness of existing methods remains a critical challenge when confronting novel or evolving tools. Existing trajectory-centric paradigms primarily rely on memorizing static solution paths during training, which limits the ability of LLMs to generalize tool usage to newly introduced or previously unseen tools. In this paper, we propose ToolMaster, a framework that shifts tool use from imitating golden tool-calling trajectories to actively learning tool usage through interaction with the environment. To optimize LLMs for tool planning and invocation, ToolMaster adopts a trial-and-execution paradigm, which trains LLMs to first imitate teacher-generated trajectories containing explicit tool trials and self-correction, followed by reinforcement learning to coordinate the trial and execution phases jointly. This process enables agents to autonomously explore correct tool usage by actively interacting with environments and forming experiential knowledge that benefits tool execution. Experimental results demonstrate that ToolMaster significantly outperforms existing baselines in terms of generalization and robustness across unseen or unfamiliar tools. All code and data are available at this https URL.
中文摘要 为大型语言模型（LLMs）配备外部工具，使其能够解决复杂的现实问题。然而，面对新颖或不断发展的工具时，现有方法的稳健性仍是一个关键挑战。现有的轨迹中心范式主要依赖于在训练中记忆静态解路径，这限制了大型语言模型将工具应用推广到新引入或此前未见过的工具的能力。本文提出了ToolMaster框架，将工具使用从模仿黄金工具调用轨迹转变为通过与环境交互主动学习工具使用。为了优化LLMs的工具规划和调用，ToolMaster采用了试用与执行范式，训练LLM先模仿教师生成的轨迹，包含显式工具试用和自我纠正，随后进行强化学习，协调试用和执行阶段。这一过程使智能体能够通过积极与环境互动并形成有利于工具执行的体验式知识，自主探索正确的工具使用。实验结果表明，ToolMaster在泛化性和鲁棒性方面，在未见或陌生工具上的表现显著优于现有基线。所有代码和数据均在此 https 网址上获取。

Unleashing Efficient Asynchronous RL Post-Training via Staleness-Constrained Rollout Coordination

通过停滞约束的推广协调，释放高效的异步强化学习后培训

Authors: Haoyang Li, Sheng Lin, Fangcheng Fu, Yuming Zhou, Xiaodong Ji, Yanfeng Zhao, Lefeng Wang, Jie Jiang, Bin Cui
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2601.12784
Pdf link: https://arxiv.org/pdf/2601.12784
Abstract Reinforcement learning (RL) post-training has become pivotal for enhancing the capabilities of modern large models. A recent trend is to develop RL systems with a fully disaggregated architecture, which decouples the three RL phases (rollout, reward, and training) onto separate resources and executes them asynchronously. However, two critical data-level concerns arise: (1) asynchronous execution leads to data staleness in trajectories (the data generated by rollout) as the model parameters used in rollout may not be up to date, which impairs RL convergence; and (2) the length variation of trajectories introduces severe data skewness, leading to workload imbalance and degraded system performance. Existing systems fail to address these two concerns in a unified manner. Techniques that tightly control data staleness often constrain effective data skewness mitigation, while aggressive data skewness mitigation tends to exacerbate data staleness. As a result, systems are forced to trade off convergence for performance, or vice versa. To address this, we propose StaleFlow, an RL post-training system that jointly tackles data staleness and skewness. First, to control staleness, StaleFlow introduces a global consistency protocol that tracks the full lifecycle of each trajectory and constrains staleness. Second, to mitigate skewness, StaleFlow re-designs the RL system architecture by constructing data servers for trajectories and parameters to achieve flexible rollout coordination. Subsequently, we develop a suite of staleness-aware, throughput-oriented strategies to enhance system performance. Evaluations show that StaleFlow achieves up to 1.42-2.68$\times$ (1.17-2.01$\times$ on average) higher throughput than state-of-the-art systems, without compromising convergence.
中文摘要 强化学习（RL）的训练后期工作已成为提升现代大型模型能力的关键。最近趋势是开发完全拆分架构的强化学习系统，将三个强化学习阶段（推广、奖励和培训）解耦到独立资源上，并异步执行。然而，出现了两个关键的数据层面问题：（1）异步执行会导致轨迹（由推送生成的数据）中的数据趋于陈旧，因为推展时使用的模型参数可能不够最新，这会影响强化学习的收敛性;以及（2）轨迹长度变化带来严重的数据偏斜，导致工作负载不平衡和系统性能下降。现有系统未能统一解决这两个问题。严格控制数据陈旧的技术通常限制有效的数据偏态缓解，而激进的数据偏度缓解则往往加剧数据陈旧。因此，系统被迫在性能之间做出权衡，或者反之。为此，我们提出了StaleFlow，一种强化学习后训练系统，共同解决数据陈旧和偏态问题。首先，为了控制陈旧，StaleFlow 引入了全局一致性协议，跟踪每个轨迹的完整生命周期并约束陈旧。其次，为了减少偏斜，StaleFlow 重新设计了强化学习系统架构，构建轨迹和参数的数据服务器，实现灵活的展开协调。随后，我们开发了一套感知过时、以吞吐量为导向的策略，以提升系统性能。评估显示，StaleFlow平均吞吐量高达1.42-2.68$（约1.17-2.01$\times$），高于最先进系统，同时不影响收敛性。

FRoM-W1: Towards General Humanoid Whole-Body Control with Language Instructions

FRoM-W1：迈向通用类人生物全身控制及语言指令

Authors: Peng Li, Zihan Zhuang, Yangfan Gao, Yi Dong, Sixian Li, Changhao Jiang, Shihan Dou, Zhiheng Xi, Enyu Zhou, Jixuan Huang, Hui Li, Jingjing Gong, Xingjun Ma, Tao Gui, Zuxuan Wu, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang, Xipeng Qiu
Subjects: Subjects: Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.12799
Pdf link: https://arxiv.org/pdf/2601.12799
Abstract Humanoid robots are capable of performing various actions such as greeting, dancing and even backflipping. However, these motions are often hard-coded or specifically trained, which limits their versatility. In this work, we present FRoM-W1, an open-source framework designed to achieve general humanoid whole-body motion control using natural language. To universally understand natural language and generate corresponding motions, as well as enable various humanoid robots to stably execute these motions in the physical world under gravity, FRoM-W1 operates in two stages: (a) H-GPT: utilizing massive human data, a large-scale language-driven human whole-body motion generation model is trained to generate diverse natural behaviors. We further leverage the Chain-of-Thought technique to improve the model's generalization in instruction understanding. (b) H-ACT: After retargeting generated human whole-body motions into robot-specific actions, a motion controller that is pretrained and further fine-tuned through reinforcement learning in physical simulation enables humanoid robots to accurately and stably perform corresponding actions. It is then deployed on real robots via a modular simulation-to-reality module. We extensively evaluate FRoM-W1 on Unitree H1 and G1 robots. Results demonstrate superior performance on the HumanML3D-X benchmark for human whole-body motion generation, and our introduced reinforcement learning fine-tuning consistently improves both motion tracking accuracy and task success rates of these humanoid robots. We open-source the entire FRoM-W1 framework and hope it will advance the development of humanoid intelligence.
中文摘要 类人机器人能够执行多种动作，如问候、跳舞，甚至后空翻。然而，这些动作通常是硬编码或专门训练的，限制了它们的多样性。在本研究中，我们介绍了FRoM-W1，一个开源框架，旨在通过自然语言实现通用的人形全身运动控制。为了普遍理解自然语言并生成相应的动作，同时使各种类人机器人能够在重力作用下稳定执行这些动作，FRoM-W1分为两个阶段：（a） H-GPT：利用海量人类数据，训练出一个大规模语言驱动的全身运动生成模型，生成多样化的自然行为。我们进一步利用思维链技术提升模型在教学理解上的推广性。（b） H-ACT：在将生成的人类全身运动重新定位为机器人专属动作后，经过预训练并通过物理模拟强化学习进一步微调的动作控制器，使类人机器人能够准确且稳定地执行相应动作。然后通过模块化的模拟现实模块部署到真实机器人上。我们在Unitree H1和G1机器人上广泛评估FRoM-W1。结果显示，在HumanML3D-X基准测试中，人类全身运动生成表现出优异，我们引入的强化学习微调技术持续提升了这些类人机器人的运动追踪准确性和任务成功率。我们开源了整个FRoM-W1框架，希望它能推动类人生物智能的发展。

Communication Methods in Multi-Agent Reinforcement Learning

多智能体强化学习中的通信方法

Authors: Christoph Wittner
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.12886
Pdf link: https://arxiv.org/pdf/2601.12886
Abstract Multi-agent reinforcement learning is a promising research area that extends established reinforcement learning approaches to problems formulated as multi-agent systems. Recently, a multitude of communication methods have been introduced to this field to address problems such as partially observable environments, non-stationarity, and exponentially growing action spaces. Communication further enables efficient cooperation among all agents interacting in an environment. This work aims at providing an overview of communication techniques in multi-agent reinforcement learning. By an in-depth analysis of 29 publications on this topic, the strengths and weaknesses of explicit, implicit, attention-based, graph-based, and hierarchical/role-based communication are evaluated. The results of this comparison show that there is no general, optimal communication framework for every problem. On the contrary, the choice of communication depends heavily on the problem at hand. The comparison also highlights the importance of communication methods with low computational overhead to enable scalability to environments where many agents interact. Finally, the paper discusses current research gaps, emphasizing the need for standardized benchmarking of system-level metrics and improved robustness under realistic communication conditions to enhance the real-world applicability of these approaches.
中文摘要 多智能体强化学习是一个有前景的研究领域，它将既有的强化学习方法扩展到多智能体系统中的问题。近年来，该领域引入了多种通信方法，以解决部分可观测环境、非平稳性和作用空间指数增长等问题。通信进一步促进了环境中所有代理之间的高效合作。本研究旨在概述多智能体强化学习中的通信技术。通过对29篇相关出版物的深入分析，评估了显性、隐性、注意力型、图型以及层级/角色型沟通的优缺点。比较结果表明，没有适用于所有问题的通用、最优沟通框架。相反，沟通方式的选择很大程度上取决于当前的问题。比较还强调了低计算开销的通信方法对于实现多代理交互环境的可扩展性至关重要。最后，本文讨论了当前的研究空白，强调在现实通信条件下标准化系统级指标基准化和提升鲁棒性，以增强这些方法的实际适用性。

PaperGuide: Making Small Language-Model Paper-Reading Agents More Efficient

PaperGuide：让小型语言模型的纸张阅读代理更高效

Authors: Zijian Wang, Tiancheng Huang, Hanqi Li, Da Ma, Lu Chen, Kai Yu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.12988
Pdf link: https://arxiv.org/pdf/2601.12988
Abstract The accelerating growth of the scientific literature makes it increasingly difficult for researchers to track new advances through manual reading alone. Recent progress in large language models (LLMs) has therefore spurred interest in autonomous agents that can read scientific papers and extract task-relevant information. However, most existing approaches rely either on heavily engineered prompting or on a conventional SFT-RL training pipeline, both of which often lead to excessive and low-yield exploration. Drawing inspiration from cognitive science, we propose PaperCompass, a framework that mitigates these issues by separating high-level planning from fine-grained execution. PaperCompass first drafts an explicit plan that outlines the intended sequence of actions, and then performs detailed reasoning to instantiate each step by selecting the parameters for the corresponding function calls. To train such behavior, we introduce Draft-and-Follow Policy Optimization (DFPO), a tailored RL method that jointly optimizes both the draft plan and the final solution. DFPO can be viewed as a lightweight form of hierarchical reinforcement learning, aimed at narrowing the `knowing-doing' gap in LLMs. We provide a theoretical analysis that establishes DFPO's favorable optimization properties, supporting a stable and reliable training process. Experiments on paper-based question answering (Paper-QA) benchmarks show that PaperCompass improves efficiency over strong baselines without sacrificing performance, achieving results comparable to much larger models.
中文摘要 科学文献的快速增长使研究人员仅凭手动阅读来追踪新进展变得越来越困难。大型语言模型（LLMs）的最新进展因此激发了对自主智能体的兴趣，这些智能体能够阅读科学论文并提取与任务相关的信息。然而，大多数现有方法要么依赖高度工程化的提示，要么依赖传统的SFT-RL训练流程，这两者常常导致过度且低收益的探索。我们从认知科学中汲取灵感，提出了PaperCompass框架，该框架通过将高层次规划与细致执行分离，缓解这些问题。PaperCompass 首先制定一个明确的计划，概述预期的动作顺序，然后通过选择相应函数调用的参数，进行详细推理以实现每一步。为了训练此类行为，我们引入了草稿跟随策略优化（DFPO），这是一种定制的强化学习方法，能够联合优化草图方案和最终解决方案。DFPO可以被视为一种轻量级的层级强化学习，旨在缩小LLMs中的“知与做”差距。我们提供了理论分析，确立了DFPO有利的优化特性，支持稳定可靠的训练过程。基于纸质问答（Paper-QA）基准测试的实验显示，PaperCompass 在强基线下提升效率且不牺牲性能，结果可与更大规模模型媲美。

Graph Reasoning Paradigm: Structured and Symbolic Reasoning with Topology-Aware Reinforcement Learning for Large Language Models

图推理范式：结构化与符号推理，结合拓扑感知强化学习，适用于大型语言模型

Authors: Runxuan Liu, Xianhao Ou, Xinyan Ma, Jiyuan Wang, Jiafeng Liang, Jiaqi Li, Tao He, Zheng Chu, Rongchuan Mu, Zekun Wang, Baoxin Wang, Dayong Wu, Ming Liu, Shijin Wang, Guoping Hu, Bing Qin
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.12995
Pdf link: https://arxiv.org/pdf/2601.12995
Abstract Long Chain-of-Thought (LCoT), achieved by Reinforcement Learning with Verifiable Rewards (RLVR), has proven effective in enhancing the reasoning capabilities of Large Language Models (LLMs). However, reasoning in current LLMs is primarily generated as plain text, where performing semantic evaluation on such unstructured data creates a computational bottleneck during training. Despite RLVR-based optimization, existing methods still suffer from coarse-grained supervision, reward hacking, high training costs, and poor generalization. To address these issues, we propose the Graph Reasoning Paradigm (GRP), which realizes structured and symbolic reasoning, implemented via graph-structured representations with step-level cognitive labels. Building upon GRP, we further design Process-Aware Stratified Clipping Group Relative Policy Optimization (PASC-GRPO), which leverages structured evaluation to replace semantic evaluation, achieves process-aware verification through graph-structured outcome rewards, and mitigates reward hacking via stratified clipping advantage estimation. Experiments demonstrate significant improvements across mathematical reasoning and code generation tasks. Data, models, and code will be released later.
中文摘要 通过可验证奖励强化学习（RLVR）实现的长思维链（LCoT）已被证明有效提升大型语言模型（LLMs）的推理能力。然而，当前大型语言模型中的推理主要以明文形式生成，在训练过程中对此类非结构化数据进行语义评估会造成计算瓶颈。尽管基于RLVR的优化，现有方法仍存在粗粒度监督、奖励黑客、高训练成本和泛化能力不足的问题。为解决这些问题，我们提出了图推理范式（GRP），通过带有阶级认知标签的图结构表示实现结构化和符号化推理。基于GRP，我们进一步设计了过程感知分层剪裁组相对策略优化（PASC-GRPO），该系统利用结构化评估取代语义评估，通过图结构的结果奖励实现过程感知验证，并通过分层剪裁优势估计减轻奖励黑客行为。实验显示，数学推理和代码生成任务方面取得了显著提升。数据、模型和代码将在稍后发布。

Think3D: Thinking with Space for Spatial Reasoning

Think3D：空间思维以实现空间推理

Authors: Zaibin Zhang, Yuhan Wu, Lianjie Jia, Yifan Wang, Zhongbo Zhang, Yijiang Li, Binghao Ran, Fuxi Zhang, Zhuohan Sun, Zhenfei Yin, Lijun Wang, Huchuan Lu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.13029
Pdf link: https://arxiv.org/pdf/2601.13029
Abstract Understanding and reasoning about the physical world requires spatial intelligence: the ability to interpret geometry, perspective, and spatial relations beyond 2D perception. While recent vision large models (VLMs) excel at visual understanding, they remain fundamentally 2D perceivers and struggle with genuine 3D reasoning. We introduce Think3D, a framework that enables VLM agents to think with 3D space. By leveraging 3D reconstruction models that recover point clouds and camera poses from images or videos, Think3D allows the agent to actively manipulate space through camera-based operations and ego/global-view switching, transforming spatial reasoning into an interactive 3D chain-of-thought process. Without additional training, Think3D significantly improves the spatial reasoning performance of advanced models such as GPT-4.1 and Gemini 2.5 Pro, yielding average gains of +7.8% on BLINK Multi-view and MindCube, and +4.7% on VSI-Bench. We further show that smaller models, which struggle with spatial exploration, benefit significantly from a reinforcement learning policy that enables the model to select informative viewpoints and operations. With RL, the benefit from tool usage increases from +0.7% to +6.8%. Our findings demonstrate that training-free, tool-augmented spatial exploration is a viable path toward more flexible and human-like 3D reasoning in multimodal agents, establishing a new dimension of multimodal intelligence. Code and weights are released at this https URL.
中文摘要 理解和推理物理世界需要空间智能：能够解读几何、透视和空间关系，超越二维感知。虽然最新的视觉大型模型（VLM）在视觉理解方面表现出色，但它们本质上仍是二维感知者，且在真正的三维推理上存在困难。我们介绍了Think3D，一个使VLM代理能够用三维空间思考的框架。通过利用3D重建模型从图像或视频中恢复点云和相机姿势，Think3D使智能体能够通过基于摄像头的作和自我/全局视角切换主动控空间，将空间推理转化为交互式的3D思维链。无需额外训练，Think3D 显著提升了 GPT-4.1 和 Gemini 2.5 Pro 等高级模型的空间推理性能，在 BLINK 多视图和 MindCube 上平均提升为 +7.8%，在 VSI-Bench 上提升为 +4.7%。我们还进一步表明，较小的模型在空间探索方面有困难，但通过强化学习策略显著受益，使模型能够选择有信息的视角和作。使用强化学习时，工具使用带来的收益从+0.7%提升到+6.8%。我们的发现表明，无需训练、工具辅助的空间探索是实现多模态智能体更灵活、更类人三维推理的可行路径，开辟了多模态智能的新维度。代码和权重会在这个 https 网址发布。

Feedforward-Feedback Integration in Flight Control: Reinforcement Learning with Sliding Mode Control

飞控中的前馈-反馈集成：带滑动模式控制的强化学习

Authors: Imran Sayyed, Nandan Kumar Sinha
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.13037
Pdf link: https://arxiv.org/pdf/2601.13037
Abstract Learning-based controllers leverage nonlinear couplings and enhance transients but seldom offer guarantees under tight input constraints. Robust feedback like sliding-mode control (SMC) provides these guarantees but is conservative in isolation. This paper creates a learning-augmented framework where a deep reinforcement learning policy produces feedforward commands and an SMC law imposes actuator limits, bounds learned authority, and guarantees robustness. The policy is modeled as a matched, bounded input, and Lyapunov-based conditions link SMC gains to the admissible feedforward bound, guaranteeing stability under saturation. This formulation is applicable to nonlinear, underactuated plants with hard constraints. To illustrate the methodology, the method is applied to a six-degree-of-freedom aircraft model and compared with Reinforcement Learning and isolated SMC. Simulation results show that the hybrid controller improves transient behavior and reduces control oscillations compared to standalone RL and SMC controllers, while preserving robustness under modeling uncertainties and disturbances. Even using it with partially trained policies, SMC component of the control stabilizes transients, whereas fully trained policies provide faster convergence, reduced constraint violations, and robustness. These results illustrate that learning-augmented control offers superior performance with robustness guarantees under tight input constraints.
中文摘要 基于学习的控制器利用非线性耦合并增强瞬态，但在严格的输入约束下很少提供保证。像滑动模式控制（SMC）这样的强健反馈提供了这些保证，但单独来看是保守的。本文创建了一个学习增强框架，其中深度强化学习策略产生前馈指令，SMC定律施加执行器限制，限制学习权威，并保证稳健性。该策略被建模为匹配的有界输入，基于李雅普诺夫的条件将SMC增益与可接受的前馈上界联系起来，保证饱和下的稳定性。该表述适用于具有硬约束的非线性、欠驱动电厂。为说明该方法论，该方法应用于六自由度飞机模型，并与强化学习和孤立SMC进行比较。仿真结果表明，混合控制器相比独立的强化学习和单机控制控制器，改善了瞬态行为并减少了控制振荡，同时在建模不确定性和干扰下保持了鲁棒性。即使与部分训练策略结合使用，SMC控制部分也能稳定瞬变，而完全训练好的策略则提供更快的收敛、减少约束违规和鲁棒性。这些结果表明，学习增强控制在严格输入约束下具有更优的性能和鲁棒性保证。

Agentic Conversational Search with Contextualized Reasoning via Reinforcement Learning

通过强化学习进行情境化推理的能动会话搜索

Authors: Fengran Mo, Yifan Gao, Sha Li, Hansi Zeng, Xin Liu, Zhaoxuan Tan, Xian Li, Jianshu Chen, Dakuo Wang, Meng Jiang
Subjects: Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2601.13115
Pdf link: https://arxiv.org/pdf/2601.13115
Abstract Large Language Models (LLMs) have become a popular interface for human-AI interaction, supporting information seeking and task assistance through natural, multi-turn dialogue. To respond to users within multi-turn dialogues, the context-dependent user intent evolves across interactions, requiring contextual interpretation, query reformulation, and dynamic coordination between retrieval and generation. Existing studies usually follow static rewrite, retrieve, and generate pipelines, which optimize different procedures separately and overlook the mixed-initiative action optimization simultaneously. Although the recent developments in deep search agents demonstrate the effectiveness in jointly optimizing retrieval and generation via reasoning, these approaches focus on single-turn scenarios, which might lack the ability to handle multi-turn interactions. We introduce a conversational agent that interleaves search and reasoning across turns, enabling exploratory and adaptive behaviors learned through reinforcement learning (RL) training with tailored rewards towards evolving user goals. The experimental results across four widely used conversational benchmarks demonstrate the effectiveness of our methods by surpassing several existing strong baselines.
中文摘要 大型语言模型（LLMs）已成为人机交互的流行界面，通过自然、多回合的对话支持信息检索和任务协助。为了在多回合对话中回应用户，上下文依赖的用户意图会随着交互不断演变，需要上下文解释、查询重构以及检索与生成之间的动态协调。现有研究通常遵循静态重写、检索和生成流程，这些流程分别优化不同流程，同时忽略混合主动的动作优化。尽管近期深度搜索代理的发展展示了通过推理共同优化检索和生成的有效性，但这些方法主要关注单回合场景，可能无法处理多回合交互。我们引入了一种对话代理，将搜索和推理交错交错于各回合，实现通过强化学习（RL）训练学习的探索性和适应性行为，并根据用户目标的演变提供定制奖励。在四个广泛使用的会话基准测试中，实验结果通过超越多个现有强基准，证明了我们方法的有效性。

Training instability in deep learning follows low-dimensional dynamical principles

深度学习中的训练不稳定性遵循低维动力学原理

Authors: Zhipeng Zhang, Zhenjie Yao, Kai Li, Lei Yang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.13160
Pdf link: https://arxiv.org/pdf/2601.13160
Abstract Deep learning systems achieve remarkable empirical performance, yet the stability of the training process itself remains poorly understood. Training unfolds as a high-dimensional dynamical system in which small perturbations to optimization, data, parameters, or learning signals can induce abrupt and irreversible collapse, undermining reproducibility and scalability. We propose a unified dynamical perspective that characterizes training stability as an intrinsic property of learning systems, organized along four interacting dimensions: optimization, environmental/data, parametric, and learning-signal stability. We operationalize this perspective through controlled perturbation auditing of training trajectories, probing how learning dynamics respond to structured disturbances without modifying learning algorithms. Across reinforcement learning and large language model training, we identify three recurring regularities: high final performance is frequently decoupled from training stability; controlled stochasticity consistently buffers learning dynamics across paradigms; and deviations in low-dimensional latent meta-states systematically precede observable performance collapse. Together, these findings establish training stability as a measurable and comparable dynamical property of learning systems, providing a descriptive foundation for studying learning dynamics beyond final performance outcomes.
中文摘要 深度学习系统在经验上表现出色，但训练过程本身的稳定性仍然了解不足。训练如同一个高维动力系统，优化、数据、参数或学习信号的小扰动都可能引发突发且不可逆的崩溃，削弱重复性和可扩展性。我们提出了一种统一的动力学视角，将训练稳定性描述为学习系统的内在属性，围绕四个交互维度组织：优化、环境/数据、参数和学习信号稳定性。我们通过对训练轨迹的受控扰动审计，作化这一视角，探究学习动态如何响应结构化扰动，而无需修改学习算法。在强化学习和大型语言模型训练中，我们发现了三个反复出现的规律性：高最终表现常常与训练稳定性脱钩;受控随机性能够持续缓冲不同范式的学习动态;低维潜在元态的偏差系统性地预示于可观测的性能崩溃。这些发现共同确立了训练稳定性作为学习系统可测量且可比的动态属性，为研究超越最终表现结果的学习动态提供了描述性基础。

Autonomous Navigation at the Nano-Scale: Algorithms, Architectures, and Constraints

纳米尺度的自主导航：算法、架构与约束

Authors: Mahmud S. Zango, Jianglin Lan
Subjects: Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.13252
Pdf link: https://arxiv.org/pdf/2601.13252
Abstract Autonomous navigation for nano-scale unmanned aerial vehicles (nano-UAVs) is governed by extreme Size, Weight, and Power (SWaP) constraints (with the weight < 50 g and sub-100 mW onboard processor), distinguishing it fundamentally from standard robotic paradigms. This review synthesizes the state-of-the-art in sensing, computing, and control architectures designed specifically for these sub- 100mW computational envelopes. We critically analyse the transition from classical geometry-based methods to emerging "Edge AI" paradigms, including quantized deep neural networks deployed on ultra-low-power System-on-Chips (SoCs) and neuromorphic event-based control. Beyond algorithms, we evaluate the hardware-software co-design requisite for autonomy, covering advancements in dense optical flow, optimized Simultaneous Localization and Mapping (SLAM), and learning-based flight control. While significant progress has been observed in visual navigation and relative pose estimation, our analysis reveals persistent gaps in long-term endurance, robust obstacle avoidance in dynamic environments, and the "Sim-to-Real" transfer of reinforcement learning policies. This survey provides a roadmap for bridging these gaps, advocating for hybrid architectures that fuse lightweight classical control with data-driven perception to enable fully autonomous, agile nano-UAVs in GPS-denied environments.
中文摘要 纳米级无人机（nano-UAV）的自主导航受极端的尺寸、重量和功率（SWaP）约束（重量<50克，机载处理器功率低于100毫瓦），这从根本上区别于标准机器人范式。本综述综合了专为这些低于100mW计算包络设计的传感、计算和控制架构的尖端技术。我们批判性地分析了从经典几何方法向新兴“边缘人工智能”范式的转变，包括部署在超低功耗片上系统（SoC）上的量化深度神经网络和基于神经形态事件的控制。除了算法，我们还评估了实现自主性的软硬件协同设计，涵盖了密集光流、优化的同时定位与映射（SLAM）以及基于学习的飞行控制等领域的进展。虽然在视觉导航和相对姿态估计方面取得了显著进展，我们的分析揭示了长期耐力、动态环境中的强健障碍物回避以及强化学习策略的“模拟到现实”转移方面存在持续的差距。本次调查为弥合这些差距提供了路线图，倡导将轻量级经典控制与数据驱动感知融合的混合架构，实现在GPS限制环境中实现完全自主、敏捷的纳米无人机。

CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning

Cure-Med：多语言医学推理的课程导向强化学习

Authors: Eric Onyame, Akash Ghosh, Subhadip Baidya, Sriparna Saha, Xiuying Chen, Chirag Agarwal
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.13262
Pdf link: https://arxiv.org/pdf/2601.13262
Abstract While large language models (LLMs) have shown to perform well on monolingual mathematical and commonsense reasoning, they remain unreliable for multilingual medical reasoning applications, hindering their deployment in multilingual healthcare settings. We address this by first introducing CUREMED-BENCH, a high-quality multilingual medical reasoning dataset with open-ended reasoning queries with a single verifiable answer, spanning thirteen languages, including underrepresented languages such as Amharic, Yoruba, and Swahili. Building on this dataset, we propose CURE-MED, a curriculum-informed reinforcement learning framework that integrates code-switching-aware supervised fine-tuning and Group Relative Policy Optimization to jointly improve logical correctness and language stability. Across thirteen languages, our approach consistently outperforms strong baselines and scales effectively, achieving 85.21% language consistency and 54.35% logical correctness at 7B parameters, and 94.96% language consistency and 70.04% logical correctness at 32B parameters. These results support reliable and equitable multilingual medical reasoning in LLMs. The code and dataset are available at this https URL
中文摘要 虽然大型语言模型（LLMs）在单语数学和常识推理方面表现出色，但在多语言医学推理应用中仍然不可靠，阻碍了其在多语言医疗环境中的应用。我们首先介绍CUREMED-BENCH，这是一个高质量的多语言医学推理数据集，支持开放式推理查询，并有一个可验证的答案，涵盖十三种语言，包括阿姆哈拉语、约鲁巴语和斯瓦希里语等代表性不足的语言。基于该数据集，我们提出了CURE-MED，一个基于课程的强化学习框架，集成了代码切换感知的监督微调和群体相对策略优化，共同提升逻辑正确性和语言稳定性。在十三种语言中，我们的方法始终优于强基线和有效扩展，在7B参数下实现了85.21%的语言一致性和54.35%的逻辑正确性，在32B参数下实现了94.96%的语言一致性和70.04%的逻辑正确性。这些结果支持了LLMs中可靠且公平的多语言医学推理。代码和数据集可在此 https URL 获取

Balancing Classification and Calibration Performance in Decision-Making LLMs via Calibration Aware Reinforcement Learning

通过校准感知强化学习平衡决策型大型语言模型中的分类与校准性能

Authors: Duygu Nur Yaldiz, Evangelia Spiliopoulou, Zheng Qi, Siddharth Varia, Srikanth Doss, Nikolaos Pappas
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.13284
Pdf link: https://arxiv.org/pdf/2601.13284
Abstract Large language models (LLMs) are increasingly deployed in decision-making tasks, where not only accuracy but also reliable confidence estimates are essential. Well-calibrated confidence enables downstream systems to decide when to trust a model and when to defer to fallback mechanisms. In this work, we conduct a systematic study of calibration in two widely used fine-tuning paradigms: supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR). We show that while RLVR improves task performance, it produces extremely overconfident models, whereas SFT yields substantially better calibration, even under distribution shift, though with smaller performance gains. Through targeted experiments, we diagnose RLVR's failure, showing that decision tokens act as extraction steps of the decision in reasoning traces and do not carry confidence information, which prevents reinforcement learning from surfacing calibrated alternatives. Based on this insight, we propose a calibration-aware reinforcement learning formulation that directly adjusts decision-token probabilities. Our method preserves RLVR's accuracy level while mitigating overconfidence, reducing ECE scores up to 9 points.
中文摘要 大型语言模型（LLMs）越来越多地被应用于决策任务中，在这些任务中不仅需要准确性，也需要可靠的置信度估计。校准良好的置信度使下游系统能够决定何时信任模型，何时依赖备用机制。本研究系统地研究了两种广泛使用的微调范式中的校准：监督式微调（SFT）和带可验证奖励的强化学习（RLVR）。我们表明，虽然RLVR能提升任务性能，但会产生极度过自信的模型，而SFT即使在分布偏移下也能显著校准，尽管性能提升较小。通过靶向实验，我们诊断了RLVR的失败，表明决策标记作为推理迹中决策的提取步骤，不携带置信信息，防止强化学习通过校准替代方案出现。基于这一见解，我们提出了一种校准感知强化学习的公式，直接调整决策代币概率。我们的方法在降低 RLVR 准确性水平的同时，减少了过度自信，将 ECE 分数降低最多 9 分。

Group Relative Policy Optimization for Robust Blind Interference Alignment with Fluid Antennas

与流体天线实现稳健盲干扰对准的群相对策略优化

Authors: Jianqiu Peng, Tong Zhang, Shuai Wang, Mingjie Shao, Hao Xu, Rui Wang
Subjects: Subjects: Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2601.13506
Pdf link: https://arxiv.org/pdf/2601.13506
Abstract Fluid antenna system (FAS) leverages dynamic reconfigurability to unlock spatial degrees of freedom and reshape wireless channels. This paper proposes, for the first time, a robust fluid antenna-driven blind interference alignment (BIA) framework for a K-user MISO downlink under imperfect channel state information (CSI). We formulate a robust sum-rate maximization problem through optimizing fluid antenna positions. To solve this challenging non-convex problem, we employ group relative policy optimization (GRPO), a novel deep reinforcement learning algorithm that eliminates the critic network. This robust design reduces model size and floating point operations (FLOPs) by nearly half compared to proximal policy optimization (PPO) while significantly enhancing performance through group-based exploration that escapes bad local optima. Simulation results demonstrate that GRPO outperforms PPO by 4.17%, and a 100K-step pre-trained PPO by 30.29%. Due to error distribution learning, GRPO exceeds heuristic MaximumGain and RandomGain by 200.78% and 465.38%, respectively.
中文摘要 流体天线系统（FAS）利用动态重构能力解锁空间自由度并重塑无线信道。本文首次提出了一种稳健的流体天线驱动盲干扰对准（BIA）框架，用于在不完美信道状态信息（CSI）下进行K用户MISO下行链路。我们通过优化流体天线位置，构建了一个稳健的求和率最大化问题。为解决这一具有挑战性的非凸问题，我们采用了群相对策略优化（GRPO），这是一种新型深度强化学习算法，消除了批判网络。这种稳健设计使模型规模和浮点运算（FLOPs）相较于近端策略优化（PPO）减少了近一半，同时通过基于组的探索显著提升性能，避免了局部最优问题。模拟结果显示，GRPO比PPO高出4.17%，而10万步预训练PPO高出30.29%。由于误差分布学习，GRPO分别比启发式的最大增益和随机增益高出200.78%和465.38%。

Reasoning While Recommending: Entropy-Guided Latent Reasoning in Generative Re-ranking Models

推荐时的推理：生成式重新排序模型中的熵引导潜在推理

Authors: Changshuo Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.13533
Pdf link: https://arxiv.org/pdf/2601.13533
Abstract Reinforcement learning plays a crucial role in generative re-ranking scenarios due to its exploration-exploitation capabilities, but existing generative methods mostly fail to adapt to the dynamic entropy changes in model difficulty during list generation, making it challenging to accurately capture complex preferences. Given that language models have achieved remarkable breakthroughs by integrating reasoning capabilities, we draw on this approach to introduce a latent reasoning mechanism, and experimental validation demonstrates that this mechanism effectively reduces entropy in the model's decision-making process. Based on these findings, we introduce the Entropy-Guided Latent Reasoning (EGLR) recommendation model, which has three core advantages. First, it abandons the "reason first, recommend later" paradigm to achieve "reasoning while recommending", specifically designed for the high-difficulty nature of list generation by enabling real-time reasoning during generation. Second, it implements entropy-guided variable-length reasoning using context-aware reasoning token alongside dynamic temperature adjustment, expanding exploration breadth in reasoning and boosting exploitation precision in recommending to achieve a more precisely adapted exploration-exploitation trade-off. Third, the model adopts a lightweight integration design with no complex independent modules or post-processing, enabling easy adaptation to existing models. Experimental results on two real-world datasets validate the model's effectiveness, and its notable advantage lies in being compatible with existing generative re-ranking models to enhance their performance. Further analyses also demonstrate its practical deployment value and research potential.
中文摘要 强化学习因其探索-利用能力，在生成式重新排序场景中起着关键作用，但现有生成方法大多未能适应列表生成过程中模型难度的动态熵变化，这使得准确捕捉复杂偏好具有挑战性。鉴于语言模型通过整合推理能力取得了显著突破，我们借鉴该方法引入潜在推理机制，实验验证表明该机制有效降低模型决策过程中的熵。基于这些发现，我们介绍了熵引导潜在推理（EGLR）推荐模型，该模型具有三大核心优势。首先，它摒弃了“先推理，后推荐”的范式，实现了“推荐时推理”，这正是为列表生成的高难度设计，实现了生成过程中的实时推理。其次，它通过上下文感知推理符号和动态温度调整实现熵引导的变长推理，拓展了推理的探索广度，并提升了利用的精准度，从而实现更精准的探索-利用权衡。第三，模型采用轻量化集成设计，无复杂独立模块或后处理，便于适应现有模型。两个真实世界数据集的实验结果验证了该模型的有效性，其显著优势在于与现有生成式重新排序模型兼容，从而提升其性能。进一步分析还展示了其实际部署价值和研究潜力。

Behavior Knowledge Merge in Reinforced Agentic Models

强化代理模型中的行为知识融合

Authors: Xiangchi Yuan, Dachuan Shi, Chunhui Zhang, Zheyuan Liu, Shenglong Yao, Soroush Vosoughi, Wenke Lee
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.13572
Pdf link: https://arxiv.org/pdf/2601.13572
Abstract Reinforcement learning (RL) is central to post-training, particularly for agentic models that require specialized reasoning behaviors. In this setting, model merging offers a practical mechanism for integrating multiple RL-trained agents from different tasks into a single generalist model. However, existing merging methods are designed for supervised fine-tuning (SFT), and they are suboptimal to preserve task-specific capabilities on RL-trained agentic models. The root is a task-vector mismatch between RL and SFT: on-policy RL induces task vectors that are highly sparse and heterogeneous, whereas SFT-style merging implicitly assumes dense and globally comparable task vectors. When standard global averaging is applied under this mismatch, RL's non-overlapping task vectors that encode critical task-specific behaviors are reduced and parameter updates are diluted. To address this issue, we propose Reinforced Agent Merging (RAM), a distribution-aware merging framework explicitly designed for RL-trained agentic models. RAM disentangles shared and task-specific unique parameter updates, averaging shared components while selectively preserving and rescaling unique ones to counteract parameter update dilution. Experiments across multiple agent domains and model architectures demonstrate that RAM not only surpasses merging baselines, but also unlocks synergistic potential among agents to achieve performance superior to that of specialized agents in their domains.
中文摘要 强化学习（RL）是后训练的核心，尤其是针对需要专门推理行为的代理模型。在此环境中，模型合并提供了一种实用机制，将多个来自不同任务的强化学习训练代理整合到单一通用模型中。然而，现有的合并方法设计用于监督微调（SFT），对于保留强化学习训练的代理模型中任务特定能力并不理想。根源在于RL与SFT之间的任务向量不匹配：策略驱动式RL诱导任务向量高度稀疏且异构，而SFT式合并隐含假设任务向量密度高且全局可比较。当在这种不匹配下应用标准全局平均时，强化学习中编码关键任务特定行为的非重叠任务向量减少，参数更新被稀释。为解决这一问题，我们提出了强化代理合并（Reinforced Agent Mergeging，简称RAM），这是一个专门为强化学习训练智能体模型设计的分布感知合并框架。RAM 将共享和任务特定的唯一参数更新解开，平均共享组件，同时有选择地保留和重新扩展唯一组件，以抵消参数更新稀释。跨多个代理域和模型架构的实验表明，RAM 不仅超越合并基线，还释放了代理之间的协同潜力，使其在各自领域中实现超越专业代理的性能。

Communication-Free Collective Navigation for a Swarm of UAVs via LiDAR-Based Deep Reinforcement Learning

通过基于激光雷达的深度强化学习，为一群无人机实现无通信的集体导航

Authors: Myong-Yol Choi, Hankyoul Ko, Hanse Cho, Changseung Kim, Seunghwan Kim, Jaemin Seo, Hyondong Oh
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2601.13657
Pdf link: https://arxiv.org/pdf/2601.13657
Abstract This paper presents a deep reinforcement learning (DRL) based controller for collective navigation of unmanned aerial vehicle (UAV) swarms in communication-denied environments, enabling robust operation in complex, obstacle-rich environments. Inspired by biological swarms where informed individuals guide groups without explicit communication, we employ an implicit leader-follower framework. In this paradigm, only the leader possesses goal information, while follower UAVs learn robust policies using only onboard LiDAR sensing, without requiring any inter-agent communication or leader identification. Our system utilizes LiDAR point clustering and an extended Kalman filter for stable neighbor tracking, providing reliable perception independent of external positioning systems. The core of our approach is a DRL controller, trained in GPU-accelerated Nvidia Isaac Sim, that enables followers to learn complex emergent behaviors - balancing flocking and obstacle avoidance - using only local perception. This allows the swarm to implicitly follow the leader while robustly addressing perceptual challenges such as occlusion and limited field-of-view. The robustness and sim-to-real transfer of our approach are confirmed through extensive simulations and challenging real-world experiments with a swarm of five UAVs, which successfully demonstrated collective navigation across diverse indoor and outdoor environments without any communication or external localization.
中文摘要 本文提出了一种基于深度强化学习（DRL）的控制器，用于在通信阻断环境中集体导航无人机（UAV）群体，实现复杂且障碍物丰富的环境中的稳健运行。灵感来自生物群体，即知情个体引导群体，无需明确沟通，我们采用隐性领导者-追随者框架。在这种模式下，只有领航者拥有目标信息，而跟随无人机仅通过车载激光雷达传感学习稳健的策略，无需代理间通信或领队识别。我们的系统采用LiDAR点聚类和扩展的卡尔曼滤波器实现稳定邻居跟踪，提供独立于外部定位系统的可靠感知。我们的核心方法是一个日程学习（DRL）控制器，经过GPU加速的Nvidia Isaac Sim训练，使追随者能够仅凭局部感知学习复杂的涌现行为——平衡群聚和避障。这使得群体能够隐式跟随领导者，同时有力地解决诸如遮挡和视野有限等感知挑战。我们方法的稳健性和模拟到现实的传输能力通过大量模拟和五架无人机群的复杂现实实验得到了验证，这些无人机成功展示了在不同室内外环境中的集体导航，无需任何通信或外部定位。

Reinforcement Learning for Opportunistic Routing in Software-Defined LEO-Terrestrial Systems

软件定义LEO-地面系统中机会性路由的强化学习

Authors: Sivaram Krishnan, Zhouyou Gu, Jihong Park, Sung-Min Oh, Jinho Choi
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.13662
Pdf link: https://arxiv.org/pdf/2601.13662
Abstract The proliferation of large-scale low Earth orbit (LEO) satellite constellations is driving the need for intelligent routing strategies that can effectively deliver data to terrestrial networks under rapidly time-varying topologies and intermittent gateway visibility. Leveraging the global control capabilities of a geostationary (GEO)-resident software-defined networking (SDN) controller, we introduce opportunistic routing, which aims to minimize delivery delay by forwarding packets to any currently available ground gateways rather than fixed destinations. This makes it a promising approach for achieving low-latency and robust data delivery in highly dynamic LEO networks. Specifically, we formulate a constrained stochastic optimization problem and employ a residual reinforcement learning framework to optimize opportunistic routing for reducing transmission delay. Simulation results over multiple days of orbital data demonstrate that our method achieves significant improvements in queue length reduction compared to classical backpressure and other well-known queueing algorithms.
中文摘要 大规模近地轨道（LEO）卫星星座的激增，推动了对智能路由策略的需求，这些策略能够在快速变化的拓扑结构和间歇性网关可视性下，有效地将数据传递到地面网络。利用驻地地球静止（GEO）软件定义网络（SDN）控制器的全局控制能力，我们引入了机会性路由，旨在通过将数据包转发到当前可用的地面网关而非固定目的地，来最小化传输延迟。这使得它成为实现高度动态LEO网络中低延迟和稳健数据传输的有前景方法。具体来说，我们构建了一个受限随机优化问题，并采用残差强化学习框架优化机会性路由以减少传输延迟。多天轨道数据的模拟结果表明，我们的方法相比经典背压和其他知名排队算法，在缩短队列长度方面取得了显著提升。

Finding RELIEF: Shaping Reasoning Behavior without Reasoning Supervision via Belief Engineering

寻找缓解：通过信念工程塑造无推理监督的推理行为

Authors: Chak Tou Leong, Dingwei Chen, Heming Xia, Qingyu Yin, Sunbowen Lee, Jian Wang, Wenjie Li
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.13752
Pdf link: https://arxiv.org/pdf/2601.13752
Abstract Large reasoning models (LRMs) have achieved remarkable success in complex problem-solving, yet they often suffer from computational redundancy or reasoning unfaithfulness. Current methods for shaping LRM behavior typically rely on reinforcement learning or fine-tuning with gold-standard reasoning traces, a paradigm that is both computationally expensive and difficult to scale. In this paper, we reveal that LRMs possess latent \textit{reasoning beliefs} that internally track their own reasoning traits, which can be captured through simple logit probing. Building upon this insight, we propose Reasoning Belief Engineering (RELIEF), a simple yet effective framework that shapes LRM behavior by aligning the model's self-concept with a target belief blueprint. Crucially, RELIEF completely bypasses the need for reasoning-trace supervision. It internalizes desired traits by fine-tuning on synthesized, self-reflective question-answering pairs that affirm the target belief. Extensive experiments on efficiency and faithfulness tasks demonstrate that RELIEF matches or outperforms behavior-supervised and preference-based baselines while requiring lower training costs. Further analysis validates that shifting a model's reasoning belief effectively shapes its actual behavior.
中文摘要 大型推理模型（LRM）在复杂问题解决方面取得了显著成功，但它们常常存在计算冗余或推理不忠的问题。当前塑造LRM行为的方法通常依赖于强化学习或用金标准推理迹进行微调，这种范式计算量大且难以扩展。本文揭示，LRMs具有潜在的\textit{推理信念}，这些信念内部追踪自身的推理特征，这些特征可以通过简单的logit探测捕捉到。基于这一见解，我们提出了推理信念工程（RELIEF），这是一个简单而有效的框架，通过将模型的自我概念与目标信念蓝图对齐来塑造LRM行为。关键是，RELIEF完全绕过了推理追踪监督的需求。它通过对合成的、自我反思的问答对进行微调，确认目标信念，内化期望特质。对效率和忠诚任务的广泛实验表明，RELIEF在满足或优于行为监督和基于偏好的基线，同时所需的培训成本更低。进一步分析验证了模型推理信念的转变能有效塑造其实际行为。

TractRLFusion: A GPT-Based Multi-Critic Policy Fusion Framework for Fiber Tractography

TractRLFusion：基于GPT的多批判者政策融合框架，用于纤维图谱

Authors: Ankita Joshi, Ashutosh Sharma, Anoushkrit Goel, Ranjeet Ranjan Jha, Chirag Ahuja, Arnav Bhavsar, Aditya Nigam
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.13897
Pdf link: https://arxiv.org/pdf/2601.13897
Abstract Tractography plays a pivotal role in the non-invasive reconstruction of white matter fiber pathways, providing vital information on brain connectivity and supporting precise neurosurgical planning. Although traditional methods relied mainly on classical deterministic and probabilistic approaches, recent progress has benefited from supervised deep learning (DL) and deep reinforcement learning (DRL) to improve tract reconstruction. A persistent challenge in tractography is accurately reconstructing white matter tracts while minimizing spurious connections. To address this, we propose TractRLFusion, a novel GPT-based policy fusion framework that integrates multiple RL policies through a data-driven fusion strategy. Our method employs a two-stage training data selection process for effective policy fusion, followed by a multi-critic fine-tuning phase to enhance robustness and generalization. Experiments on HCP, ISMRM, and TractoInferno datasets demonstrate that TractRLFusion outperforms individual RL policies as well as state-of-the-art classical and DRL methods in accuracy and anatomical reliability.
中文摘要 牵引术在白质纤维通路的非侵入性重建中发挥着关键作用，提供关于大脑连接性的重要信息，支持精确的神经外科手术规划。尽管传统方法主要依赖经典的确定性和概率方法，但近年来，监督深度学习（DL）和深度强化学习（DRL）的进展已改善了路径重建。牵引学中一个持续的挑战是准确重建白质束，同时尽量减少虚假连接。为此，我们提出了TractRLFusion，一种基于GPT的新型策略融合框架，通过数据驱动的融合策略整合多个强化学习策略。我们的方法采用两阶段训练数据选择过程以实现有效的策略融合，随后进行多批评者微调阶段，以增强鲁棒性和泛化性。在HCP、ISMRM和TractoInferno数据集上的实验表明，TractRLFusion在准确性和解剖可靠性方面优于单个强化学习策略以及最先进的经典和日程学习方法。

HyperWalker: Dynamic Hypergraph-Based Deep Diagnosis for Multi-Hop Clinical Modeling across EHR and X-Ray in Medical VLMs

HyperWalker：基于动态超图谱的深度诊断，用于医疗VLM中EHR和X光的多跳临床建模

Authors: Yuezhe Yang, Hao Wang, Yige Peng, Jinman Kim, Lei Bi
Subjects: Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.13919
Pdf link: https://arxiv.org/pdf/2601.13919
Abstract Automated clinical diagnosis remains a core challenge in medical AI, which usually requires models to integrate multi-modal data and reason across complex, case-specific contexts. Although recent methods have advanced medical report generation (MRG) and visual question answering (VQA) with medical vision-language models (VLMs), these methods, however, predominantly operate under a sample-isolated inference paradigm, as such processing cases independently without access to longitudinal electronic health records (EHRs) or structurally related patient examples. This paradigm limits reasoning to image-derived information alone, which ignores external complementary medical evidence for potentially more accurate diagnosis. To overcome this limitation, we propose \textbf{HyperWalker}, a \textit{Deep Diagnosis} framework that reformulates clinical reasoning via dynamic hypergraphs and test-time training. First, we construct a dynamic hypergraph, termed \textbf{iBrochure}, to model the structural heterogeneity of EHR data and implicit high-order associations among multimodal clinical information. Within this hypergraph, a reinforcement learning agent, \textbf{Walker}, navigates to and identifies optimal diagnostic paths. To ensure comprehensive coverage of diverse clinical characteristics in test samples, we incorporate a \textit{linger mechanism}, a multi-hop orthogonal retrieval strategy that iteratively selects clinically complementary neighborhood cases reflecting distinct clinical attributes. Experiments on MRG with MIMIC and medical VQA on EHRXQA demonstrate that HyperWalker achieves state-of-the-art performance. Code is available at: this https URL
中文摘要 自动化临床诊断仍然是医疗人工智能的核心挑战，通常需要模型整合多模态数据，并在复杂、病例特定情境中进行推理。尽管近期方法通过医学视觉语言模型（VLMs）推进了医疗报告生成（MRG）和视觉问答（VQA）技术，但这些方法主要采用样本隔离推断范式，即在无法访问纵向电子健康记录（EHR）或结构相关患者示例的情况下独立处理病例。该范式将推理限制在仅依赖图像来源的信息，忽视了外部补充的医学证据，以实现可能更准确的诊断。为克服这一限制，我们提出了 \textbf{HyperWalker} 框架，这是一个 \textit{Deep Diagnosis} 框架，通过动态超图和测试时训练重新表述临床推理。首先，我们构建了一个动态超图，称为\textbf{iBrochure}，用以建模EHR数据的结构异质性以及多模态临床信息间隐含的高阶关联。在这个超图中，强化学习代理\textbf{Walker}导航并识别最佳诊断路径。为确保测试样本中多样临床特征的全面覆盖，我们采用了\textit{linger机制}，这是一种多跳正交检索策略，通过迭代选择具有不同临床特征的临床互补邻近病例。在带MIMIC的MRG和EHRXQA上的医学VQA实验表明，HyperWalker实现了最先进的性能。代码可在以下 https URL 获取

Glance-or-Gaze: Incentivizing LMMs to Adaptively Focus Search via Reinforcement Learning

一瞥或凝视：通过强化学习激励LMM自适应聚焦搜索

Authors: Hongbo Bai, Yujin Zhou, Yile Wu, Chi-Min Chan, Pengcheng Wen, Kunhao Pan, Sirui Han, Yike Guo
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.13942
Pdf link: https://arxiv.org/pdf/2601.13942
Abstract Large Multimodal Models (LMMs) have achieved remarkable success in visual understanding, yet they struggle with knowledge-intensive queries involving long-tail entities or evolving information due to static parametric knowledge. Recent search-augmented approaches attempt to address this limitation, but existing methods rely on indiscriminate whole-image retrieval that introduces substantial visual redundancy and noise, and lack deep iterative reflection, limiting their effectiveness on complex visual queries. To overcome these challenges, we propose Glance-or-Gaze (GoG), a fully autonomous framework that shifts from passive perception to active visual planning. GoG introduces a Selective Gaze mechanism that dynamically chooses whether to glance at global context or gaze into high-value regions, filtering irrelevant information before retrieval. We design a dual-stage training strategy: Reflective GoG Behavior Alignment via supervised fine-tuning instills the fundamental GoG paradigm, while Complexity-Adaptive Reinforcement Learning further enhances the model's capability to handle complex queries through iterative reasoning. Experiments across six benchmarks demonstrate state-of-the-art performance. Ablation studies confirm that both Selective Gaze and complexity-adaptive RL are essential for effective visual search. We will release our data and models for further exploration soon.
中文摘要 大型多模态模型（LMM）在视觉理解方面取得了显著成功，但在涉及长尾实体或因静态参数化知识而不断变化的信息等知识密集型查询时，仍面临困难。近期的搜索增强方法试图解决这一局限性，但现有方法依赖于无差别的全图检索，这会引入大量的视觉冗余和噪声，且缺乏深度迭代反射，限制了其在复杂视觉查询中的有效性。为克服这些挑战，我们提出了“凝视或凝视”（GoG），这是一种完全自主的框架，能够从被动感知转向主动视觉规划。GoG引入了选择性凝视机制，动态选择是浏览全局上下文还是凝视高价值区域，过滤无关信息后再检索。我们设计了双阶段训练策略：通过监督微调进行反思性GoG行为对齐，奠定GoG的基本范式，而复杂性自适应强化学习则进一步增强模型通过迭代推理处理复杂查询的能力。六个基准测试的实验展示了最先进的性能。消融研究证实，选择性凝视和复杂度适应性强化学习对有效的视觉搜索至关重要。我们将很快发布数据和模型，供进一步探索。

RL-BioAug: Label-Efficient Reinforcement Learning for Self-Supervised EEG Representation Learning

RL-BioAug：自监督脑电表征学习的标签高效强化学习

Authors: Cheol-Hui Lee, Hwa-Yeon Lee, Dong-Joo Kim
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.13964
Pdf link: https://arxiv.org/pdf/2601.13964
Abstract The quality of data augmentation serves as a critical determinant for the performance of contrastive learning in EEG tasks. Although this paradigm is promising for utilizing unlabeled data, static or random augmentation strategies often fail to preserve intrinsic information due to the non-stationarity of EEG signals where statistical properties change over time. To address this, we propose RL-BioAug, a framework that leverages a label-efficient reinforcement learning (RL) agent to autonomously determine optimal augmentation policies. While utilizing only a minimal fraction (10\%) of labeled data to guide the agent's policy, our method enables the encoder to learn robust representations in a strictly self-supervised manner. Experimental results demonstrate that RL-BioAug significantly outperforms the random selection strategy, achieving substantial improvements of 9.69\% and 8.80\% in Macro-F1 score on the Sleep-EDFX and CHB-MIT datasets, respectively. Notably, this agent mainly chose optimal strategies for each task -- for example, Time Masking with a 62\% probability for sleep stage classification and Crop \& Resize with a 77\% probability for seizure detection. Our framework suggests its potential to replace conventional heuristic-based augmentations and establish a new autonomous paradigm for data augmentation. The source code is available at \href{this https URL}{this https URL}.
中文摘要 数据增强的质量是脑电图任务中对对比学习表现的关键决定因素。尽管该范式在利用无标记数据方面前景看好，但由于脑电信号非平稳性，统计属性随时间变化，静态或随机增强策略常常无法保留内在信息。为此，我们提出了RL-BioAug框架，利用标签高效的强化学习（RL）代理自主确定最优增强策略。虽然仅使用极少比例（10%）的标记数据来指导代理策略，但我们的方法使编码器能够严格自监督地学习稳健的表示。实验结果显示，RL-BioAug显著优于随机选择策略，在Sleep-EDFX和CHB-MIT数据集的宏观F1评分中分别显著提升了9.69%和8.80\%。值得注意的是，该代理主要为每个任务选择最优策略——例如，时间掩蔽法（Time Masking）以62%概率确定睡眠阶段，以及裁剪/调整比例（Crop \ & Resize）以77%概率检测癫痫发作。我们的框架显示其有潜力取代传统的启发式增强，建立一种新的自主数据增强范式。源代码可在 \href{this https URL}{this https URL} 获取。

RM-Distiller: Exploiting Generative LLM for Reward Model Distillation

RM-Distiller：利用生成式大型语言模型进行奖励模型蒸馏

Authors: Hongli Zhou, Hui Huang, Wei Liu, Chenglong Wang, Xingyuan Bu, Lvyuan Han, Fuhai Song, Muyun Yang, Wenhao Jiang, Hailong Cao, Tiejun Zhao
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.14032
Pdf link: https://arxiv.org/pdf/2601.14032
Abstract Reward models (RMs) play a pivotal role in aligning large language models (LLMs) with human preferences. Due to the difficulty of obtaining high-quality human preference annotations, distilling preferences from generative LLMs has emerged as a standard practice. However, existing approaches predominantly treat teacher models as simple binary annotators, failing to fully exploit the rich knowledge and capabilities for RM distillation. To address this, we propose RM-Distiller, a framework designed to systematically exploit the multifaceted capabilities of teacher LLMs: (1) Refinement capability, which synthesizes highly correlated response pairs to create fine-grained and contrastive signals. (2) Scoring capability, which guides the RM in capturing precise preference strength via a margin-aware optimization objective. (3) Generation capability, which incorporates the teacher's generative distribution to regularize the RM to preserve its fundamental linguistic knowledge. Extensive experiments demonstrate that RM-Distiller significantly outperforms traditional distillation methods both on RM benchmarks and reinforcement learning-based alignment, proving that exploiting multifaceted teacher capabilities is critical for effective reward modeling. To the best of our knowledge, this is the first systematic research on RM distillation from generative LLMs.
中文摘要 奖励模型（RM）在将大型语言模型（LLMs）与人类偏好对齐方面起着关键作用。由于获取高质量的人类偏好注释困难，从生成式大型语言模型中提炼偏好已成为一种标准做法。然而，现有方法主要将教师模型视为简单的二进制注释符，未能充分利用 RM 蒸馏的丰富知识和能力。为此，我们提出了RM-Distiller框架，旨在系统地利用教师LLMs的多方面能力：（1）精细化能力，将高度相关的反应对合成，生成细粒度和对比性强的信号。（2）评分能力，通过边际感知优化目标指导 RM 捕捉精确的偏好强度。（3）生成能力，结合教师的生成分布来规范 RM，以保持其基本语言知识。大量实验表明，RM-Distiller在RM基准和基于强化学习的对齐中显著优于传统蒸馏方法，证明利用多元教师能力对于有效奖励建模至关重要。据我们所知，这是首次系统性地研究生成式大型语言模型（LLM）进行 RM 蒸馏。

Optimizing Energy and Data Collection in UAV-aided IoT Networks using Attention-based Multi-Objective Reinforcement Learning

利用基于注意力的多目标强化学习优化无人机辅助物联网网络中的能源和数据收集

Authors: Babacar Toure, Dimitrios Tsilimantos, Omid Esrafilian, Marios Kountouris
Subjects: Subjects: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2601.14092
Pdf link: https://arxiv.org/pdf/2601.14092
Abstract Due to their adaptability and mobility, Unmanned Aerial Vehicles (UAVs) are becoming increasingly essential for wireless network services, particularly for data harvesting tasks. In this context, Artificial Intelligence (AI)-based approaches have gained significant attention for addressing UAV path planning tasks in large and complex environments, bridging the gap with real-world deployments. However, many existing algorithms suffer from limited training data, which hampers their performance in highly dynamic environments. Moreover, they often overlook the inherently multi-objective nature of the task, treating it in an overly simplistic manner. To address these limitations, we propose an attention-based Multi-Objective Reinforcement Learning (MORL) architecture that explicitly handles the trade-off between data collection and energy consumption in urban environments, even without prior knowledge of wireless channel conditions. Our method develops a single model capable of adapting to varying trade-off preferences and dynamic scenario parameters without the need for fine-tuning or retraining. Extensive simulations show that our approach achieves substantial improvements in performance, model compactness, sample efficiency, and most importantly, generalization to previously unseen scenarios, outperforming existing RL solutions.
中文摘要 由于其适应性和机动性，无人机（UAV）在无线网络服务中变得越来越重要，尤其是在数据采集任务中。在此背景下，基于人工智能（AI）的方法在应对大型复杂环境中无人机路径规划任务中备受关注，弥合了与现实部署的差距。然而，许多现有算法训练数据有限，这影响了它们在高度动态环境中的性能。此外，他们常常忽视任务本质上的多目标性质，过于简化地处理。为解决这些局限性，我们提出了一种基于注意力的多目标强化学习（MORL）架构，明确处理城市环境中数据收集与能耗之间的权衡，即使事先不了解无线信道状况。我们的方法开发了一个能够适应不同权衡偏好和动态场景参数的单一模型，无需微调或重新训练。大量模拟表明，我们的方法在性能、模型紧凑性、样本效率等方面取得了显著提升，最重要的是，能够推广到前所未有的场景，优于现有的强化学习解决方案。

Diffusion-Guided Backdoor Attacks in Real-World Reinforcement Learning

扩散引导后门攻击在现实强化学习中的应用

Authors: Tairan Huang, Qingqing Ye, Yulin Jin, Jiawei Lian, Yi Wang, Haibo Hu
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.14104
Pdf link: https://arxiv.org/pdf/2601.14104
Abstract Backdoor attacks embed hidden malicious behaviors in reinforcement learning (RL) policies and activate them using triggers at test time. Most existing attacks are validated only in simulation, while their effectiveness in real-world robotic systems remains unclear. In physical deployment, safety-constrained control pipelines such as velocity limiting, action smoothing, and collision avoidance suppress abnormal actions, causing strong attenuation of conventional backdoor attacks. We study this previously overlooked problem and propose a diffusion-guided backdoor attack framework (DGBA) for real-world RL. We design small printable visual patch triggers placed on the floor and generate them using a conditional diffusion model that produces diverse patch appearances under real-world visual variations. We treat the robot control stack as a black-box system. We further introduce an advantage-based poisoning strategy that injects triggers only at decision-critical training states. We evaluate our method on a TurtleBot3 mobile robot and demonstrate reliable activation of targeted attacks while preserving normal task performance. Demo videos and code are available in the supplementary material.
中文摘要 后门攻击在强化学习（RL）策略中嵌入隐藏的恶意行为，并在测试时通过触发器激活它们。大多数现有攻击仅在仿真中得到验证，其在现实机器人系统中的有效性尚不明确。在物理部署中，安全约束的控制流水线如速度限制、动作平滑和碰撞避免抑制异常动作，从而大幅削弱传统后门攻击。我们研究了这一此前被忽视的问题，并提出了用于现实世界强化学习的扩散引导后门攻击框架（DGBA）。我们设计放置在地板上的小型可打印视觉补丁，并利用条件扩散模型生成，在现实视觉变化下产生多样化的补丁外观。我们把机器人控制栈当作黑箱系统。我们还进一步引入了一种基于优势的中毒策略，仅在决策关键的训练状态注入触发点。我们在TurtleBot3移动机器人上评估了该方法，并展示了在保持正常任务性能的同时，目标攻击的可靠激活。演示视频和代码可在补充资料中提供。

CREATE: Cross-Layer Resilience Characterization and Optimization for Efficient yet Reliable Embodied AI Systems

创建：跨层韧性特性描述与优化，以实现高效且可靠的具身人工智能系统

Authors: Tong Xie, Yijiahao Qi, Jinqi Wen, Zishen Wan, Yanchi Dong, Zihao Wang, Shaofei Cai, Yitao Liang, Tianyu Jia, Yuan Wang, Runsheng Wang, Meng Li
Subjects: Subjects: Hardware Architecture (cs.AR)
Arxiv link: https://arxiv.org/abs/2601.14140
Pdf link: https://arxiv.org/pdf/2601.14140
Abstract Embodied Artificial Intelligence (AI) has recently attracted significant attention as it bridges AI with the physical world. Modern embodied AI systems often combine a Large Language Model (LLM)-based planner for high-level task planning and a reinforcement learning (RL)-based controller for low-level action generation, enabling embodied agents to tackle complex tasks in real-world environments. However, deploying embodied agents remains challenging due to their high computation requirements, especially for battery-powered local devices. Although techniques like lowering operating voltage can improve energy efficiency, they can introduce bit errors and result in task failures. In this work, we propose CREATE, a general design principle that leverages heterogeneous resilience at different layers for synergistic energy-reliability co-optimization. For the first time, we conduct a comprehensive error injection study on modern embodied AI systems and observe an inherent but heterogeneous fault tolerance. Building upon these insights, we develop an anomaly detection and clearance mechanism at the circuit level to eliminate outlier errors. At the model level, we propose a weight-rotation-enhanced planning algorithm to improve the fault tolerance of the LLM-based planner. Furthermore, we introduce an application-level technique, autonomy-adaptive voltage scaling, to dynamically adjust the operating voltage of the controllers. The voltage scaling circuit is co-designed to enable online voltage adjustment. Extensive experiments demonstrate that without compromising task quality, CREATE achieves 40.6% computational energy savings on average over nominal-voltage baselines and 35.0% over prior-art techniques. This further leads to 29.5% to 37.3% chip-level energy savings and approximately a 15% to 30% improvement in battery life.
中文摘要 具身人工智能（AI）最近引起了广泛关注，因为它将人工智能与现实世界连接起来。现代具身型人工智能系统通常结合基于大型语言模型（LLM）的规划器用于高级任务规划，基于强化学习（RL）的控制器用于低层次动作生成，使具身智能体能够在现实环境中处理复杂任务。然而，部署具身代理因其高计算需求，尤其是电池供电的本地设备，仍然具有挑战性。虽然降低工作电压等技术可以提高能源效率，但它们可能引入位误差，导致任务失败。在本研究中，我们提出了CREATE这一通用设计原则，利用不同层的异质韧性实现协同的能源可靠性协同优化。我们首次在现代具象人工智能系统上进行了全面的错误注入研究，观察到其固有但异构的容错能力。基于这些见解，我们开发了电路层面的异常检测和清除机制，以消除异常值错误。在模型层面，我们提出了一种权重旋转增强的规划算法，以提升基于LLM的规划器的容错能力。此外，我们还引入了一种应用级技术——自主自适应电压标度，用于动态调整控制器的工作电压。电压调节电路设计支持在线电压调节。大量实验表明，在不牺牲任务质量的情况下，CREATE在标称电压基线下平均节省了40.6%的计算能耗，较以往技术节省了35.0%。这进一步带来了29.5%至37.3%的芯片级节能，电池寿命提升约15%至30%。

Toward Efficient Agents: Memory, Tool learning, and Planning

迈向高效代理：记忆、工具学习与规划

Authors: Xiaofang Yang, Lijun Li, Heng Zhou, Tong Zhu, Xiaoye Qu, Yuchen Fan, Qianshan Wei, Rui Ye, Li Kang, Yiran Qin, Zhiqiang Kou, Daizong Liu, Qi Li, Ning Ding, Siheng Chen, Jing Shao
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.14192
Pdf link: https://arxiv.org/pdf/2601.14192
Abstract Recent years have witnessed increasing interest in extending large language models into agentic systems. While the effectiveness of agents has continued to improve, efficiency, which is crucial for real-world deployment, has often been overlooked. This paper therefore investigates efficiency from three core components of agents: memory, tool learning, and planning, considering costs such as latency, tokens, steps, etc. Aimed at conducting comprehensive research addressing the efficiency of the agentic system itself, we review a broad range of recent approaches that differ in implementation yet frequently converge on shared high-level principles including but not limited to bounding context via compression and management, designing reinforcement learning rewards to minimize tool invocation, and employing controlled search mechanisms to enhance efficiency, which we discuss in detail. Accordingly, we characterize efficiency in two complementary ways: comparing effectiveness under a fixed cost budget, and comparing cost at a comparable level of effectiveness. This trade-off can also be viewed through the Pareto frontier between effectiveness and cost. From this perspective, we also examine efficiency oriented benchmarks by summarizing evaluation protocols for these components and consolidating commonly reported efficiency metrics from both benchmark and methodological studies. Moreover, we discuss the key challenges and future directions, with the goal of providing promising insights.
中文摘要 近年来，将大型语言模型扩展到智能系统方面兴趣日益增长。尽管代理的效能持续提升，但效率——对实际部署至关重要——却常被忽视。因此，本文从智能体的三个核心组成部分——记忆、工具学习和规划——中探讨效率，考虑了延迟、令牌、步数等成本。旨在开展针对代理系统自身效率的综合研究，我们回顾了一系列近期方法，这些方法在实现上有所不同，但常常趋同于共同的高层原则，包括但不限于通过压缩和管理来界定上下文、设计强化学习奖励以最小化工具调用，以及采用受控搜索机制提升效率，我们会详细讨论。因此，我们将效率描述为两种互补方式：在固定成本预算下比较效率，以及在可比有效水平下比较成本。这种权衡也可以通过帕累托边界的效能与成本来理解。从这个角度，我们还通过总结这些组件的评估方案，整合基准和方法学研究中常见的效率指标，来审视以效率为导向的基准。此外，我们还讨论了主要挑战和未来方向，旨在提供有希望的见解。

Differentiated Pickup Point Offering for Emission Reduction in Last-Mile Delivery

差异化取货点服务，实现最后一公里配送的减排

Authors: Albina Galiullina, Wouter van Heeswijk, Tom van Woensel
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.14196
Pdf link: https://arxiv.org/pdf/2601.14196
Abstract Pickup points are widely recognized as a sustainable alternative to home delivery, as consolidating orders at pickup locations can shorten delivery routes and improve first-attempt success rates. However, these benefits may be negated when customers drive to pick up their orders. This study proposes a Differentiated Pickup Point Offering (DPO) policy that aims to jointly reduce emissions from delivery truck routes and customer travel. Under DPO, each arriving customer is offered a single recommended pickup point, rather than an unrestricted choice among all locations, while retaining the option of home delivery. We study this problem in a dynamic and stochastic setting, where the pickup point offered to each customer depends on previously realized customer locations and delivery choices. To design effective DPO policies, we adopt a reinforcement learning-based approach that accounts for spatial relationships between customers and pickup points and their implications for future route consolidation. Computational experiments show that differentiated pickup point offerings can substantially reduce total carbon emissions. The proposed policies reduce total emissions by up to 9% relative to home-only delivery and by 2% on average compared with alternative policies, including unrestricted pickup point choice and nearest pickup point assignment. Differentiated offerings are particularly effective in dense urban settings with many pickup points and short inter-location distances. Moreover, explicitly accounting for the dynamic nature of customer arrivals and choices is especially important when customers are less inclined to choose pickup point delivery over home delivery.
中文摘要 取货点被广泛认为是上门送货的可持续替代方案，因为将订单集中在取货点可以缩短送货路线并提高首次成功率。然而，当顾客开车去取餐时，这些好处可能会被抵消。本研究提出了一项差异化取货点服务（DPO）政策，旨在共同减少送货卡车路线和客户出行的排放。在DPO下，每位到来的顾客只提供一个推荐的取货点，而不是在所有地点中自由选择，同时保留送货上门的选项。我们在动态和随机环境中研究该问题，即每个客户提供的取货点取决于先前确定的客户位置和配送选择。为了设计有效的DPO政策，我们采用基于强化学习的方法，考虑客户与取货点之间的空间关系及其对未来路线整合的影响。计算实验表明，差异化的取货点服务可以显著减少总碳排放。拟议政策相较仅限家庭配送减少总排放最多9%，与包括无限制取货点选择和最近取件点分配在内的替代政策相比，平均减少2%。差异化服务在密集的城市环境中尤其有效，这些环境中有多个接点和短距离的交接点。此外，明确考虑客户到达和选择的动态特性尤为重要，因为客户更倾向于选择自取点配送而非上门配送。

InT: Self-Proposed Interventions Enable Credit Assignment in LLM Reasoning

InT：自我提案干预使LLM推理中的学分分配成为可能

Authors: Matthew Y. R. Yang, Hao Bai, Ian Wu, Gene Yang, Amrith Setlur, Aviral Kumar
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.14209
Pdf link: https://arxiv.org/pdf/2601.14209
Abstract Outcome-reward reinforcement learning (RL) has proven effective at improving the reasoning capabilities of large language models (LLMs). However, standard RL assigns credit only at the level of the final answer, penalizing entire reasoning traces when the outcome is incorrect and uniformly reinforcing all steps when it is correct. As a result, correct intermediate steps may be discouraged in failed traces, while spurious steps may be reinforced in successful ones. We refer to this failure mode as the problem of credit assignment. While a natural remedy is to train a process reward model, accurately optimizing such models to identify corrective reasoning steps remains challenging. We introduce Intervention Training (InT), a training paradigm in which the model performs fine-grained credit assignment on its own reasoning traces by proposing short, targeted corrections that steer trajectories toward higher reward. Using reference solutions commonly available in mathematical reasoning datasets and exploiting the fact that verifying a model-generated solution is easier than generating a correct one from scratch, the model identifies the first error in its reasoning and proposes a single-step intervention to redirect the trajectory toward the correct solution. We then apply supervised fine-tuning (SFT) to the on-policy rollout up to the point of error concatenated with the intervention, localizing error to the specific step that caused failure. We show that the resulting model serves as a far better initialization for RL training. After running InT and subsequent fine-tuning with RL, we improve accuracy by nearly 14% over a 4B-parameter base model on IMO-AnswerBench, outperforming larger open-source models such as gpt-oss-20b.
中文摘要 结果-奖励强化学习（RL）已被证明有效提升大型语言模型（LLMs）的推理能力。然而，标准强化学习只在最终答案层面给予认可，结果错误时惩罚整个推理痕迹，正确时则统一强化所有步骤。因此，错误的中级步骤可能被劝阻，而成功的步骤则可能被强化。我们将这种失败模式称为信用分配问题。虽然自然的解决办法是训练过程奖励模型，但准确优化此类模型以识别纠正性推理步骤仍具挑战性。我们引入了干预训练（InT），这是一种训练范式，模型通过提出短暂、有针对性的修正，引导路径朝向更高奖励，对自身推理痕迹进行细粒度的信用分配。利用数学推理数据集中常见的参考解，并利用验证模型生成解比从零生成正确解更简单的事实，模型识别出推理中的第一个错误，并提出单步干预以引导路径朝向正确解。然后，我们对策略启动的部署应用监督微调（SFT），直到与干预相关的错误发生，并将错误局限于导致失败的特定步骤。我们证明，最终模型作为强化学习训练的初始化效果要好得多。在运行InT并随后用强化学习微调后，我们在IMO-AnswerBench上相比4B参数基模型的准确率提升了近14%，优于更大型的开源模型如gpt-oss-20b。

Attention-Based Offline Reinforcement Learning and Clustering for Interpretable Sepsis Treatment

基于注意力的离线强化学习与可解释败血症治疗的聚类

Authors: Punit Kumar, Vaibhav Saran, Divyesh Patel, Nitin Kulkarni, Alina Vereshchaka
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.14228
Pdf link: https://arxiv.org/pdf/2601.14228
Abstract Sepsis remains one of the leading causes of mortality in intensive care units, where timely and accurate treatment decisions can significantly impact patient outcomes. In this work, we propose an interpretable decision support framework. Our system integrates four core components: (1) a clustering-based stratification module that categorizes patients into low, intermediate, and high-risk groups upon ICU admission, using clustering with statistical validation; (2) a synthetic data augmentation pipeline leveraging variational autoencoders (VAE) and diffusion models to enrich underrepresented trajectories such as fluid or vasopressor administration; (3) an offline reinforcement learning (RL) agent trained using Advantage Weighted Regression (AWR) with a lightweight attention encoder and supported by an ensemble models for conservative, safety-aware treatment recommendations; and (4) a rationale generation module powered by a multi-modal large language model (LLM), which produces natural-language justifications grounded in clinical context and retrieved expert knowledge. Evaluated on the MIMIC-III and eICU datasets, our approach achieves high treatment accuracy while providing clinicians with interpretable and robust policy recommendations.
中文摘要 败血症仍是重症监护病房死亡的主要原因之一，及时且准确的治疗决策能显著影响患者的治疗效果。在本研究中，我们提出了一个可解释的决策支持框架。我们的系统整合了四个核心组成部分：（1）基于聚类的分层模块，利用统计验证将患者在ICU入院时分为低风险、中风险和高风险组;（2）利用变分自编码器（VAE）和扩散模型的合成数据增强流水线，以丰富如液体或血管压压剂给药等未被充分代表的轨迹;（3）使用优势加权回归（AWR）训练的离线强化学习（RL）代理，配备轻量级注意力编码器，并由集合模型支持，提供保守且安全意识的治疗建议;以及（4）由多模态大型语言模型（LLM）驱动的理据生成模块，能够生成基于临床语境和检索到的专家知识的自然语言辩护。在MIMIC-III和eICU数据集上评估，我们的方法实现了高治疗准确性，同时为临床医生提供了可解释且稳健的政策建议。

KAGE-Bench: Fast Known-Axis Visual Generalization Evaluation for Reinforcement Learning

KAGE-Bench：强化学习中的快速已知轴视觉泛化评估

Authors: Egor Cherepanov, Daniil Zelezetsky, Alexey K. Kovalev, Aleksandr I. Panov
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.14232
Pdf link: https://arxiv.org/pdf/2601.14232
Abstract Pixel-based reinforcement learning agents often fail under purely visual distribution shift even when latent dynamics and rewards are unchanged, but existing benchmarks entangle multiple sources of shift and hinder systematic analysis. We introduce KAGE-Env, a JAX-native 2D platformer that factorizes the observation process into independently controllable visual axes while keeping the underlying control problem fixed. By construction, varying a visual axis affects performance only through the induced state-conditional action distribution of a pixel policy, providing a clean abstraction for visual generalization. Building on this environment, we define KAGE-Bench, a benchmark of six known-axis suites comprising 34 train-evaluation configuration pairs that isolate individual visual shifts. Using a standard PPO-CNN baseline, we observe strong axis-dependent failures, with background and photometric shifts often collapsing success, while agent-appearance shifts are comparatively benign. Several shifts preserve forward motion while breaking task completion, showing that return alone can obscure generalization failures. Finally, the fully vectorized JAX implementation enables up to 33M environment steps per second on a single GPU, enabling fast and reproducible sweeps over visual factors. Code: this https URL.
中文摘要 基于像素的强化学习代理在纯视觉分布转移下常常失败，即使潜在动态和奖励未变，但现有基准将多个转移源纠缠在一起，阻碍系统分析。我们引入了KAGE-Env，一款基于JAX的2D平台游戏，将观测过程分解为独立可控的视觉轴，同时保持底层控制问题的固定。从构造角度看，变化视觉轴仅通过像素策略的诱导状态条件动作分布影响性能，为视觉泛化提供了清晰的抽象。基于这一环境，我们定义了KAGE-Bench，这是一个由六个已知轴套件组成的基准测试，包含34对列车评估配置，用于隔离单个视觉偏移。使用标准的PPO-CNN基线，我们观察到强烈的轴相关失败，背景和光度变化常常成功崩溃，而代理-外观变化则相对温和。若干换位在破坏任务完成的同时保持前进运动，表明仅靠返回就能掩盖泛化失败。最后，完全矢量化的JAX实现支持单一GPU每秒高达3300万环境步，实现快速且可重复的视觉因素扫描。代码：这个 https URL。

Q-learning with Adjoint Matching

带伴随匹配的Q学习

Authors: Qiyang Li, Sergey Levine
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2601.14234
Pdf link: https://arxiv.org/pdf/2601.14234
Abstract We propose Q-learning with Adjoint Matching (QAM), a novel TD-based reinforcement learning (RL) algorithm that tackles a long-standing challenge in continuous-action RL: efficient optimization of an expressive diffusion or flow-matching policy with respect to a parameterized Q-function. Effective optimization requires exploiting the first-order information of the critic, but it is challenging to do so for flow or diffusion policies because direct gradient-based optimization via backpropagation through their multi-step denoising process is numerically unstable. Existing methods work around this either by only using the value and discarding the gradient information, or by relying on approximations that sacrifice policy expressivity or bias the learned policy. QAM sidesteps both of these challenges by leveraging adjoint matching, a recently proposed technique in generative modeling, which transforms the critic's action gradient to form a step-wise objective function that is free from unstable backpropagation, while providing an unbiased, expressive policy at the optimum. Combined with temporal-difference backup for critic learning, QAM consistently outperforms prior approaches on hard, sparse reward tasks in both offline and offline-to-online RL.
中文摘要 我们提出了结合匹配（QAM）的Q学习，这是一种基于TD的新型强化学习（RL）算法，解决了连续作用强化学习长期面临的挑战：针对参数化Q函数高效优化表达扩散或流匹配策略。有效的优化需要利用批评者的一阶信息，但对于流或扩散策略来说，这样做具有挑战性，因为通过多步去噪过程进行的基于梯度的直接优化在数值上不稳定。现有方法要么只使用该值并舍弃梯度信息，要么依赖牺牲策略表达性或偏向所学策略的近似方法来绕过这个问题。QAM 通过利用伴随匹配（adjoint matching）这一新提出的生成建模技术，绕过了这两个挑战，该技术将批评者的作用梯度转化为一个阶梯式目标函数，且不存在不稳定的反向传播，同时在最优情况下提供无偏且表达性的策略。结合批评者学习的时间差备份，QAM在离线和离线到在线的强化学习中，在困难且稀疏的奖励任务中始终优于以往方法。

Spatiotemporal Wildfire Prediction and Reinforcement Learning for Helitack Suppression

直升机扑灭的时空野火预测与强化学习

Authors: Shaurya Mathur, Shreyas Bellary Manjunath, Nitin Kulkarni, Alina Vereshchaka
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.14238
Pdf link: https://arxiv.org/pdf/2601.14238
Abstract Wildfires are growing in frequency and intensity, devastating ecosystems and communities while causing billions of dollars in suppression costs and economic damage annually in the U.S. Traditional wildfire management is mostly reactive, addressing fires only after they are detected. We introduce \textit{FireCastRL}, a proactive artificial intelligence (AI) framework that combines wildfire forecasting with intelligent suppression strategies. Our framework first uses a deep spatiotemporal model to predict wildfire ignition. For high-risk predictions, we deploy a pre-trained reinforcement learning (RL) agent to execute real-time suppression tactics with helitack units inside a physics-informed 3D simulation. The framework generates a threat assessment report to help emergency responders optimize resource allocation and planning. In addition, we are publicly releasing a large-scale, spatiotemporal dataset containing $\mathbf{9.5}$ million samples of environmental variables for wildfire prediction. Our work demonstrates how deep learning and RL can be combined to support both forecasting and tactical wildfire response. More details can be found at this https URL.
中文摘要 野火的频率和强度都在不断增加，毁灭性地破坏生态系统和社区，同时每年在美国造成数十亿美元的扑灭成本和经济损失。传统的野火管理大多是被动应对，只有在火灾被发现后才处理。我们介绍 \textit{FireCastRL}，一个主动人工智能（AI）框架，结合了野火预报与智能扑灭策略。我们的框架首先使用深度时空模型来预测野火点燃。对于高风险预测，我们部署预训练强化学习（RL）智能体，在物理驱动的三维模拟中，使用直升机单元执行实时压制战术。该框架生成威胁评估报告，帮助应急响应者优化资源分配和规划。此外，我们还公开发布了一份包含$\mathbf{9.5}$百万环境变量样本的大规模时空数据集，用于野火预测。我们的研究展示了深度学习与强化学习如何结合，支持预报和战术野火响应。更多详情可见此 https 网址。

Jet-RL: Enabling On-Policy FP8 Reinforcement Learning with Unified Training and Rollout Precision Flow

Jet-RL：通过统一培训和推广精准流程实现政策内FP8强化学习

Authors: Haocheng Xi, Charlie Ruan, Peiyuan Liao, Yujun Lin, Han Cai, Yilong Zhao, Shuo Yang, Kurt Keutzer, Song Han, Ligeng Zhu
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.14243
Pdf link: https://arxiv.org/pdf/2601.14243
Abstract Reinforcement learning (RL) is essential for enhancing the complex reasoning capabilities of large language models (LLMs). However, existing RL training pipelines are computationally inefficient and resource-intensive, with the rollout phase accounting for over 70% of total training time. Quantized RL training, particularly using FP8 precision, offers a promising approach to mitigating this bottleneck. A commonly adopted strategy applies FP8 precision during rollout while retaining BF16 precision for training. In this work, we present the first comprehensive study of FP8 RL training and demonstrate that the widely used BF16-training + FP8-rollout strategy suffers from severe training instability and catastrophic accuracy collapse under long-horizon rollouts and challenging tasks. Our analysis shows that these failures stem from the off-policy nature of the approach, which introduces substantial numerical mismatch between training and inference. Motivated by these observations, we propose Jet-RL, an FP8 RL training framework that enables robust and stable RL optimization. The key idea is to adopt a unified FP8 precision flow for both training and rollout, thereby minimizing numerical discrepancies and eliminating the need for inefficient inter-step calibration. Extensive experiments validate the effectiveness of Jet-RL: our method achieves up to 33% speedup in the rollout phase, up to 41% speedup in the training phase, and a 16% end-to-end speedup over BF16 training, while maintaining stable convergence across all settings and incurring negligible accuracy degradation.
中文摘要 强化学习（RL）对于增强大型语言模型（LLM）的复杂推理能力至关重要。然而，现有的强化学习培训流程计算效率低下且资源密集，推广阶段占总训练时间的70%以上。量化强化学习训练，特别是使用FP8精度，提供了缓解这一瓶颈的有前景方法。一种常见的策略是在展开时采用FP8精度，同时保持BF16的精度用于训练。本研究首次全面研究FP8强化学习训练，证明广泛使用的BF16训练+FP8展开策略在长视野推广和复杂任务中存在严重的训练不稳定性和灾难性精度崩溃。我们的分析显示，这些失败源于方法的非策略性质，导致训练与推断之间存在显著的数值不匹配。基于这些观察，我们提出了Jet-RL，一种实现稳健稳定强化学习优化的FP8强化学习框架。关键理念是采用统一的FP8精准流程，涵盖训练和推广，从而最大限度地减少数值差异，消除低效的步骤间校准需求。大量实验验证了Jet-RL的有效性：我们的方法在推广阶段实现了高达33%的加速，训练阶段最高可提升41%，相比BF16训练实现了16%的端到端加速，同时在所有设置下保持稳定收敛，精度下降几乎可以忽略不计。

Keyword: diffusion policy

Sparse ActionGen: Accelerating Diffusion Policy with Real-time Pruning

稀疏ActionGen：通过实时修剪加速扩散政策

Authors: Kangye Ji, Yuan Meng, Zhou Jianbo, Ye Li, Hanyun Cui, Zhi Wang
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.12894
Pdf link: https://arxiv.org/pdf/2601.12894
Abstract Diffusion Policy has dominated action generation due to its strong capabilities for modeling multi-modal action distributions, but its multi-step denoising processes make it impractical for real-time visuomotor control. Existing caching-based acceleration methods typically rely on $\textit{static}$ schedules that fail to adapt to the $\textit{dynamics}$ of robot-environment interactions, thereby leading to suboptimal performance. In this paper, we propose $\underline{\textbf{S}}$parse $\underline{\textbf{A}}$ction$\underline{\textbf{G}}$en ($\textbf{SAG}$) for extremely sparse action generation. To accommodate the iterative interactions, SAG customizes a rollout-adaptive prune-then-reuse mechanism that first identifies prunable computations globally and then reuses cached activations to substitute them during action diffusion. To capture the rollout dynamics, SAG parameterizes an observation-conditioned diffusion pruner for environment-aware adaptation and instantiates it with a highly parameter- and inference-efficient design for real-time prediction. Furthermore, SAG introduces a one-for-all reusing strategy that reuses activations across both timesteps and blocks in a zig-zag manner, minimizing the global redundancy. Extensive experiments on multiple robotic benchmarks demonstrate that SAG achieves up to 4$\times$ generation speedup without sacrificing performance. Project Page: this https URL.
中文摘要 扩散政策因其强大的多模态动作分布建模能力而主导了动作生成，但其多步去噪过程使其在实时视觉运动控制中不切实际。现有基于缓存的加速方法通常依赖于无法适应机器人与环境交互的$\textit{dynamics}$调度，导致性能不理想。本文提出 $\underline{\textbf{S}}$parse $\underline{\textbf{A}}$ction$\underline{\textbf{G}}$en （$\textbf{SAG}$）用于极稀疏的动作生成。为适应迭代交互，SAG定制了一种自适应的剪枝再重用机制，先全局识别可预测计算，然后在动作扩散期间重复使用缓存激活以替代。为了捕捉推广动态，SAG参数化了观测条件扩散剪枝以实现环境感知适应，并以高度参数和推断高效的设计实现实时预测。此外，SAG引入了一种一对所有人的重用策略，能够以之字形方式重复使用跨时间步和区块的激活，最大限度地减少全局冗余。在多个机器人基准测试上的广泛实验表明，SAG在不牺牲性能的情况下，能够实现多达4美元/倍数的生成加速。项目页面：这个 https URL。

Keyword: reinforcement learning

GRADE: Replacing Policy Gradients with Backpropagation for LLM Alignment

GRADE：用反向传播替代策略梯度以实现LLM对齐

Bielik 11B v3: Multilingual Large Language Model for European Languages

Bielik 11B v3：欧洲语言多语言大语言模型

Hindsight Preference Replay Improves Preference-Conditioned Multi-Objective Reinforcement Learning

事后诸葛亮偏好重放改进偏好条件多目标强化学习

Reinforcement Learning for Dynamic Workflow Optimization in CI/CD Pipelines

CI/CD管道中动态工作流优化的强化学习

Advances and Frontiers of LLM-based Issue Resolution in Software Engineering: A Comprehensive Survey

基于LLM的软件工程问题解决的进展与前沿：一项综合综述

AGGC: Adaptive Group Gradient Clipping for Stabilizing Large Language Model Training

AGGC：用于稳定大型语言模型训练的自适应群体梯度裁剪

Controlling Underestimation Bias in Constrained Reinforcement Learning for Safe Exploration

控制受限强化学习中的低估偏差以实现安全探索

R$^2$PO: Decoupling Training Trajectories from Inference Responses for LLM Reasoning

R$^2$PO：将训练轨迹与推理响应解耦用于大型语言模型推理

Extreme Value Policy Optimization for Safe Reinforcement Learning

安全强化学习的极值策略优化

Profit Maximization for Electric Vehicle Charging Stations Using Multiagent Reinforcement Learning

利用多智能体强化学习实现电动汽车充电站利润最大化

UniMo: Unified Motion Generation and Understanding with Chain of Thought

UniMo：统一运动生成与理解与思维链

Aletheia: What Makes RLVR For Code Verifiers Tick?

Aletheia：是什么让代码验证器的RLVR运作？

Speculative Sampling with Reinforcement Learning

与强化学习的推测采样

Optimal Power Allocation and Sub-Optimal Channel Assignment for Downlink NOMA Systems Using Deep Reinforcement Learning

使用深度强化学习的下行NOMA系统的最优功率分配与次优信道分配

Beyond the Dirac Delta: Mitigating Diversity Collapse in Reinforcement Fine-Tuning for Versatile Image Generation

超越狄拉克δ：缓解增强微调中的多样性崩溃以实现多功能图像生成

RLMiner: Finding the Most Frequent k-sized Subgraph via Reinforcement Learning

RLMiner：通过强化学习寻找最常见的k维子图

ReWorld: Multi-Dimensional Reward Modeling for Embodied World Models

ReWorld：具身世界模型的多维奖励建模

Incentivizing In-depth Reasoning over Long Contexts with Process Advantage Shaping

通过过程优势塑造激励长期背景下的深入推理

Agentic Reasoning for Large Language Models

大型语言模型的能动推理

STEP-LLM: Generating CAD STEP Models from Natural Language with Large Language Models

STEP-LLM：利用大型语言模型从自然语言生成CAD STEP模型

Multiagent Reinforcement Learning in Enhancing Resilience of Microgrids under Extreme Weather Events

多智能体强化学习在增强微电网在极端天气事件下的韧性

Decentralized Learning Strategies for Estimation Error Minimization with Graph Neural Networks

利用图神经网络实现估计误差最小化的去中心化学习策略

Resource-Conscious RL Algorithms for Deep Brain Stimulation

资源意识型强化学习算法用于深脑刺激

Decoding Rewards in Competitive Games: Inverse Game Theory with Entropy Regularization

竞技游戏中的奖励解码：带熵正则化的逆博弈论

Teaching Large Reasoning Models Effective Reflection

教授大型推理模型 有效反思

Distribution-Centric Policy Optimization Dominates Exploration-Exploitation Trade-off

以分布为中心的策略优化主导了探索与开发的权衡

Teaching LLMs to Learn Tool Trialing and Execution through Environment Interaction

通过环境交互教大语言模型学习工具试用和执行

Unleashing Efficient Asynchronous RL Post-Training via Staleness-Constrained Rollout Coordination

通过停滞约束的推广协调，释放高效的异步强化学习后培训

FRoM-W1: Towards General Humanoid Whole-Body Control with Language Instructions

FRoM-W1：迈向通用类人生物全身控制及语言指令

Communication Methods in Multi-Agent Reinforcement Learning

多智能体强化学习中的通信方法

PaperGuide: Making Small Language-Model Paper-Reading Agents More Efficient

PaperGuide：让小型语言模型的纸张阅读代理更高效

Graph Reasoning Paradigm: Structured and Symbolic Reasoning with Topology-Aware Reinforcement Learning for Large Language Models

图推理范式：结构化与符号推理，结合拓扑感知强化学习，适用于大型语言模型

Think3D: Thinking with Space for Spatial Reasoning

Think3D：空间思维以实现空间推理

Feedforward-Feedback Integration in Flight Control: Reinforcement Learning with Sliding Mode Control

飞控中的前馈-反馈集成：带滑动模式控制的强化学习

Agentic Conversational Search with Contextualized Reasoning via Reinforcement Learning

通过强化学习进行情境化推理的能动会话搜索

Training instability in deep learning follows low-dimensional dynamical principles

深度学习中的训练不稳定性遵循低维动力学原理

Autonomous Navigation at the Nano-Scale: Algorithms, Architectures, and Constraints

纳米尺度的自主导航：算法、架构与约束

CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning

Cure-Med：多语言医学推理的课程导向强化学习

Balancing Classification and Calibration Performance in Decision-Making LLMs via Calibration Aware Reinforcement Learning

通过校准感知强化学习平衡决策型大型语言模型中的分类与校准性能

Group Relative Policy Optimization for Robust Blind Interference Alignment with Fluid Antennas

教授大型推理模型有效反思