Arxiv Papers of Today

生成时间: 2026-01-14 16:34:51 (UTC+8); Arxiv 发布时间: 2026-01-14 20:00 EST (2026-01-15 09:00 UTC+8)

今天共有 40 篇相关文章

Keyword: reinforcement learning

Reinforcement Learning Methods for Neighborhood Selection in Local Search

局部搜索中邻域选择的强化学习方法

Authors: Yannick Molinghen, Augustin Delecluse, Renaud De Landtsheer, Stefano Michelini
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.07948
Pdf link: https://arxiv.org/pdf/2601.07948
Abstract Reinforcement learning has recently gained traction as a means to improve combinatorial optimization methods, yet its effectiveness within local search metaheuristics specifically remains comparatively underexamined. In this study, we evaluate a range of reinforcement learning-based neighborhood selection strategies -- multi-armed bandits (upper confidence bound, $\epsilon$-greedy) and deep reinforcement learning methods (proximal policy optimization, double deep $Q$-network) -- and compare them against multiple baselines across three different problems: the traveling salesman problem, the pickup and delivery problem with time windows, and the car sequencing problem. We show how search-specific characteristics, particularly large variations in cost due to constraint violation penalties, necessitate carefully designed reward functions to provide stable and informative learning signals. Our extensive experiments reveal that algorithm performance varies substantially across problems, although that $\epsilon$-greedy consistently ranks among the best performers. In contrast, the computational overhead of deep reinforcement learning approaches only makes them competitive with a substantially longer runtime. These findings highlight both the promise and the practical limitations of deep reinforcement learning in local search.
中文摘要 强化学习近年来作为改进组合优化方法的手段逐渐受到关注，但它在局部搜索元启发式中的有效性仍然相对缺乏被充分评估。本研究评估了多种基于强化学习的邻域选择策略——多臂强盗（上置信界，$\epsilon$-贪婪）和深度强化学习方法（近端策略优化，双深度$Q$-网络）——并将其与三个不同问题的多个基线进行比较：旅行推销员问题、带时间窗口的取货和送货问题，以及汽车排序问题。我们展示了搜索特异性特征，尤其是因约束违规惩罚导致成本大幅变化的因素，需要精心设计的奖励函数以提供稳定且富有信息量的学习信号。我们广泛的实验显示，算法性能在不同问题间差异显著，尽管 $\epsilon$-greedy 始终位列最佳。相比之下，深度强化学习方法的计算开销仅使其在运行时间大幅延长的情况下具有竞争力。这些发现凸显了深度强化学习在局部搜索中的前景与实际局限性。

Reasoning over Precedents Alongside Statutes: Case-Augmented Deliberative Alignment for LLM Safety

先例与法规并列推理：案件加重审议以实现LLM安全

Authors: Can Jin, Rui Wu, Tong Che, Qixin Zhang, Hongwu Peng, Jiahui Zhao, Zhenting Wang, Wenqi Wei, Ligong Han, Zhao Zhang, Yuan Cao, Ruixiang Tang, Dimitris N. Metaxas
Subjects: Subjects: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2601.08000
Pdf link: https://arxiv.org/pdf/2601.08000
Abstract Ensuring that Large Language Models (LLMs) adhere to safety principles without refusing benign requests remains a significant challenge. While OpenAI introduces deliberative alignment (DA) to enhance the safety of its o-series models through reasoning over detailed ``code-like'' safety rules, the effectiveness of this approach in open-source LLMs, which typically lack advanced reasoning capabilities, is understudied. In this work, we systematically evaluate the impact of explicitly specifying extensive safety codes versus demonstrating them through illustrative cases. We find that referencing explicit codes inconsistently improves harmlessness and systematically degrades helpfulness, whereas training on case-augmented simple codes yields more robust and generalized safety behaviors. By guiding LLMs with case-augmented reasoning instead of extensive code-like safety rules, we avoid rigid adherence to narrowly enumerated rules and enable broader adaptability. Building on these insights, we propose CADA, a case-augmented deliberative alignment method for LLMs utilizing reinforcement learning on self-generated safety reasoning chains. CADA effectively enhances harmlessness, improves robustness against attacks, and reduces over-refusal while preserving utility across diverse benchmarks, offering a practical alternative to rule-only DA for improving safety while maintaining helpfulness.
中文摘要 确保大型语言模型（LLMs）遵守安全原则而不拒绝无害请求仍是重大挑战。虽然OpenAI引入了审慎对齐（DA）技术，通过推理详细的“类代码”安全规则来增强其o系列模型的安全性，但该方法在开源大型语言模型中的有效性尚未被充分研究，因为它们通常缺乏高级推理能力。本研究系统地评估明确指定广泛安全规范与通过示例案例展示的安全规范的影响。我们发现，引用显式代码不一致地提高了无害性，并系统性地降低了帮助性，而对格增益简单代码进行训练则能带来更稳健和普遍化的安全行为。通过以案例增强推理而非繁琐的代码式安全规则指导LLM，我们避免了对狭窄枚举规则的僵化遵循，并实现了更广泛的适应性。基于这些见解，我们提出了CADA，这是一种案例增强的审议性对齐方法，用于利用自我生成的安全推理链进行强化学习的LLMs。CADA有效提升无害性，增强对攻击的鲁棒性，减少过度拒绝，同时保持多种基准测试的效用，为仅规则DA提供了一种实用替代方案，以提升安全性同时保持帮助性。

FigEx2: Visual-Conditioned Panel Detection and Captioning for Scientific Compound Figures

图Ex2：科学复合图形的视觉条件面板检测与字幕

Authors: Jifeng Song, Arun Das, Pan Wang, Hui Ji, Kun Zhao, Yufei Huang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.08026
Pdf link: https://arxiv.org/pdf/2601.08026
Abstract Scientific compound figures combine multiple labeled panels into a single image, but captions in real pipelines are often missing or only provide figure-level summaries, making panel-level understanding difficult. In this paper, we propose FigEx2, visual-conditioned framework that localizes panels and generates panel-wise captions directly from the compound figure. To mitigate the impact of diverse phrasing in open-ended captioning, we introduce a noise-aware gated fusion module that adaptively filters token-level features to stabilize the detection query space. Furthermore, we employ a staged optimization strategy combining supervised learning with reinforcement learning (RL), utilizing CLIP-based alignment and BERTScore-based semantic rewards to enforce strict multimodal consistency. To support high-quality supervision, we curate BioSci-Fig-Cap, a refined benchmark for panel-level grounding, alongside cross-disciplinary test suites in physics and chemistry. Experimental results demonstrate that FigEx2 achieves a superior 0.726 [email protected]:0.95 for detection and significantly outperforms Qwen3-VL-8B by 0.51 in METEOR and 0.24 in BERTScore. Notably, FigEx2 exhibits remarkable zero-shot transferability to out-of-distribution scientific domains without any fine-tuning.
中文摘要 科学复合图将多个标注面板合并为单一图像，但实际流程中的说明往往缺失或仅提供图文摘要，使面板层面的理解变得困难。本文提出了FigEx2，一种可视化条件框架，可从复合图中定位面板并生成逐面板标题。为了减轻开放式字幕中多样措辞的影响，我们引入了一个噪声感知的门控融合模块，能够自适应过滤令牌级特征，以稳定检测查询空间。此外，我们采用了结合监督学习与强化学习（RL）的分级优化策略，利用基于CLIP的比对和基于BERTScore的语义奖励，以强制执行严格的多模态一致性。为了支持高质量的指导，我们策划了BioSci-Fig-Cap，这是面板层面基础的精炼基准，同时还设有物理和化学的跨学科测试套件。实验结果显示，FigEx2在检测率上优于0.726 [email protected]：0.95，并且在METEOR中显著优于Qwen3-VL-8B 0.51，在BERTScore中高出0.24。值得注意的是，FigEx2 在无需微调的情况下，展现出了卓越的零样本迁移能力，适用于非发行的科学领域。

Formalizing the Relationship between Hamilton-Jacobi Reachability and Reinforcement Learning

形式化汉密尔顿-雅各比可达性与强化学习之间的关系

Authors: Prashant Solanki, Isabelle El-Hajj, Jasper van Beers, Erik-Jan van Kampen, Coen de Visser
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.08050
Pdf link: https://arxiv.org/pdf/2601.08050
Abstract We unify Hamilton-Jacobi (HJ) reachability and Reinforcement Learning (RL) through a proposed running cost formulation. We prove that the resultant travel-cost value function is the unique bounded viscosity solution of a time-dependent Hamilton-Jacobi Bellman (HJB) Partial Differential Equation (PDE) with zero terminal data, whose negative sublevel set equals the strict backward-reachable tube. Using a forward reparameterization and a contraction inducing Bellman update, we show that fixed points of small-step RL value iteration converge to the viscosity solution of the forward discounted HJB. Experiments on a classical benchmark compare learned values to semi-Lagrangian HJB ground truth and quantify error.
中文摘要 我们通过提出的运行成本表述，统一了Hamilton-Jacobi（HJ）可达性与强化学习（RL）。我们证明了所得的旅行成本函数是时变Hamilton-Jacobi Bellman（HJB）偏微分方程（PDE）的唯一有界粘性解，且其负子层集等于严格向后可达管。通过前向重参数化和引发收缩的Bellman更新，我们证明了小步强化学习值迭代的不动点收敛到前贴现HJB的粘性解。经典基准测试实验将学习到的值与半拉格朗日HJB基准比对，并量化误差。

Forecast Aware Deep Reinforcement Learning for Efficient Electricity Load Scheduling in Dairy Farms

预报感知深度强化学习，助力奶牛场高效用电负载调度

Authors: Nawazish Alia, Rachael Shawb, Karl Mason
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.08052
Pdf link: https://arxiv.org/pdf/2601.08052
Abstract Dairy farming is an energy intensive sector that relies heavily on grid electricity. With increasing renewable energy integration, sustainable energy management has become essential for reducing grid dependence and supporting the United Nations Sustainable Development Goal 7 on affordable and clean energy. However, the intermittent nature of renewables poses challenges in balancing supply and demand in real time. Intelligent load scheduling is therefore crucial to minimize operational costs while maintaining reliability. Reinforcement Learning has shown promise in improving energy efficiency and reducing costs. However, most RL-based scheduling methods assume complete knowledge of future prices or generation, which is unrealistic in dynamic environments. Moreover, standard PPO variants rely on fixed clipping or KL divergence thresholds, often leading to unstable training under variable tariffs. To address these challenges, this study proposes a Deep Reinforcement Learning framework for efficient load scheduling in dairy farms, focusing on battery storage and water heating under realistic operational constraints. The proposed Forecast Aware PPO incorporates short term forecasts of demand and renewable generation using hour of day and month based residual calibration, while the PID KL PPO variant employs a proportional integral derivative controller to regulate KL divergence for stable policy updates adaptively. Trained on real world dairy farm data, the method achieves up to 1% lower electricity cost than PPO, 4.8% than DQN, and 1.5% than SAC. For battery scheduling, PPO reduces grid imports by 13.1%, demonstrating scalability and effectiveness for sustainable energy management in modern dairy farming.
中文摘要 奶牛养殖是一个能源密集型行业，高度依赖电网电力。随着可再生能源整合的加深，可持续能源管理已成为减少电网依赖和支持联合国可持续发展目标7（经济适用且清洁能源）的关键。然而，可再生能源的间歇性质在实时平衡供需方面带来了挑战。因此，智能负载调度对于在保持可靠性的同时降低运营成本至关重要。强化学习在提升能源效率和降低成本方面展现出潜力。然而，大多数基于强化学习的调度方法假设对未来价格或发电量有完整了解，这在动态环境中是不现实的。此外，标准PPO变体依赖固定削波或KL发散阈值，常导致在可变费率下训练不稳定。为应对这些挑战，本研究提出了一个深度强化学习框架，用于奶牛场的高效负载调度，重点关注在现实作约束下的电池储能和热水。拟议的预报感知PPO结合了基于日和月的残差校准对需求和可再生能源发电的短期预测，而PID KL PPO变体则采用比例积分导数控制器调节KL背离度，实现稳定的政策更新。该方法基于真实奶牛场数据训练，电价比PPO低1%，比DQN低4.8%，比SAC低1.5%。在电池调度方面，PPO减少了13.1%的电网进口，展示了现代奶牛养殖可持续能源管理的可扩展性和有效性。

DRL-based Power Allocation in LiDAL-Assisted RLNC-NOMA OWC Systems

基于DRL的功率分配，LiDAL辅助的RLNC-NOMA OWC系统

Authors: Ahmed A. Hassan, Ahmad Adnan Qidan, Taisir Elgorashi, Jaafar Elmirghani
Subjects: Subjects: Systems and Control (eess.SY); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2601.08060
Pdf link: https://arxiv.org/pdf/2601.08060
Abstract Non-orthogonal multiple access (NOMA) is a promising technique for optical wireless communication (OWC), enabling multiple users to share the optical spectrum simultaneously through the power domain. However, the imperfection of channel state information (CSI) and residual errors in decoding process deteriorate the performance of NOMA, especially when multi-parameteric and realistic dense-user indoor scenarios are considered. In this work, we model a LiDAL-assisted RLNC-NOMA OWC system, where the light detection and localization (LiDAL) technique exploits spatio-temporal information to improve user CSI, while random linear network coding (RLNC) enhances data resilience in the NOMA successive decoding process. Power allocation (PA) is a crucial issue in communication systems, particularly in the modeled system, due to the complex interactions between multiple users and the coding and detection processes. However, optimizing continuous PA dynamically requires advanced techniques to avoid excessive computational complexity. Therefore, we adopt a deep reinforcement learning (DRL) framework to efficiently learn near-optimal power allocation strategies, enabling enhanced system performance. In particular, a DRL-based normalized advantage function (NAF) algorithm is proposed to maximize the average sum rate of the system, and its performance is analyzed and compared to other widely used DRL-based and conventional PA schemes, such as deep deterministic policy gradient (DDPG), gain ratio PA (GRPA), and exhaustive search.
中文摘要 非正交多址（NOMA）是一种有前景的光无线通信（OWC）技术，使多个用户能够通过功率域同时共享光谱。然而，通道状态信息（CSI）的不完美性和解码过程中的残余误差会降低NOMA的性能，尤其是在考虑多参数且现实的密集用户室内场景时。本研究中，我们建模了一个LiDAL辅助的RLNC-NOMA系统，其中光检测与定位（LiDAL）技术利用时空信息提升用户CSI，而随机线性网络编码（RLNC）则增强NOMA连续解码过程中的数据韧性。功率分配（PA）在通信系统中是一个关键问题，尤其是在建模系统中，因为多个用户之间复杂的交互以及编码和检测过程。然而，动态优化连续PA需要高级技术以避免过度的计算复杂性。因此，我们采用深度强化学习（DRL）框架，高效学习近优功耗分配策略，从而提升系统性能。特别地，提出了基于DRL的归一化优势函数（NAF）算法以最大化系统的平均求和率，并分析并比较了基于DRL的其他基于DRL和传统PA方案，如深度确定性策略梯度（DDPG）、增益比PA（GRPA）和穷举搜索。

STO-RL: Offline RL under Sparse Rewards via LLM-Guided Subgoal Temporal Order

STO-RL：通过LLM引导子目标时间顺序进行稀疏奖励下的离线强化学习

Authors: Chengyang Gu, Yuxin Pan, Hui Xiong, Yize Chen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.08107
Pdf link: https://arxiv.org/pdf/2601.08107
Abstract Offline reinforcement learning (RL) enables policy learning from pre-collected datasets, avoiding costly and risky online interactions, but it often struggles with long-horizon tasks involving sparse rewards. Existing goal-conditioned and hierarchical offline RL methods decompose such tasks and generate intermediate rewards to mitigate limitations of traditional offline RL, but usually overlook temporal dependencies among subgoals and rely on imprecise reward shaping, leading to suboptimal policies. To address these issues, we propose STO-RL (Offline RL using LLM-Guided Subgoal Temporal Order), an offline RL framework that leverages large language models (LLMs) to generate temporally ordered subgoal sequences and corresponding state-to-subgoal-stage mappings. Using this temporal structure, STO-RL applies potential-based reward shaping to transform sparse terminal rewards into dense, temporally consistent signals, promoting subgoal progress while avoiding suboptimal solutions. The resulting augmented dataset with shaped rewards enables efficient offline training of high-performing policies. Evaluations on four discrete and continuous sparse-reward benchmarks demonstrate that STO-RL consistently outperforms state-of-the-art offline goal-conditioned and hierarchical RL baselines, achieving faster convergence, higher success rates, and shorter trajectories. Ablation studies further confirm STO-RL's robustness to imperfect or noisy LLM-generated subgoal sequences, demonstrating that LLM-guided subgoal temporal structures combined with theoretically grounded reward shaping provide a practical and scalable solution for long-horizon offline RL.
中文摘要 离线强化学习（RL）使得从预先收集的数据集中进行策略学习，避免了昂贵且风险高的在线互动，但它常常在涉及奖励稀疏的长期任务上遇到困难。现有的目标条件化和层级离线强化学习方法分解这些任务并生成中间奖励，以缓解传统离线强化学习的局限性，但通常忽视子目标之间的时间依赖性，依赖不精确的奖励塑造，导致策略不优。为解决这些问题，我们提出了STO-RL（基于LLM引导子目标时间顺序的离线RL）框架，该框架利用大型语言模型（LLM）生成时间顺序子目标序列及相应的状态到子目标阶段映射。利用这一时间结构，STO-RL应用基于电位的奖励塑造，将稀疏的终极奖励转化为密集且时间一致的信号，促进子目标的进展，同时避免次优解。由此产生的增强数据集和整形奖励，使高效离线训练高效能策略成为可能。对四个离散且连续稀疏奖励基准的评估表明，STO-RL始终优于最先进的离线目标条件和层级强化学习基线，实现更快的收敛、更高的成功率和更短的轨迹。消融研究进一步证实了STO-RL对不完美或噪声较大LLM生成子目标序列的鲁棒性，表明LLM引导的子目标时间结构结合理论基础的奖励塑造，为长期离线强化学习提供了实用且可扩展的解决方案。

Structure Detection for Contextual Reinforcement Learning

情境强化学习中的结构检测

Authors: Tianyue Zhou, Jung-Hoon Cho, Cathy Wu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.08120
Pdf link: https://arxiv.org/pdf/2601.08120
Abstract Contextual Reinforcement Learning (CRL) tackles the problem of solving a set of related Contextual Markov Decision Processes (CMDPs) that vary across different context variables. Traditional approaches--independent training and multi-task learning--struggle with either excessive computational costs or negative transfer. A recently proposed multi-policy approach, Model-Based Transfer Learning (MBTL), has demonstrated effectiveness by strategically selecting a few tasks to train and zero-shot transfer. However, CMDPs encompass a wide range of problems, exhibiting structural properties that vary from problem to problem. As such, different task selection strategies are suitable for different CMDPs. In this work, we introduce Structure Detection MBTL (SD-MBTL), a generic framework that dynamically identifies the underlying generalization structure of CMDP and selects an appropriate MBTL algorithm. For instance, we observe Mountain structure in which generalization performance degrades from the training performance of the target task as the context difference increases. We thus propose M/GP-MBTL, which detects the structure and adaptively switches between a Gaussian Process-based approach and a clustering-based approach. Extensive experiments on synthetic data and CRL benchmarks--covering continuous control, traffic control, and agricultural management--show that M/GP-MBTL surpasses the strongest prior method by 12.49% on the aggregated metric. These results highlight the promise of online structure detection for guiding source task selection in complex CRL environments.
中文摘要 情境强化学习（CRL）解决了解决一组相关的情境马尔可夫决策过程（CMDP）的问题，这些过程在不同情境变量之间有所不同。传统方法——独立训练和多任务学习——要么面临过高的计算成本，要么存在负转移。最近提出的一种多策略方法——基于模型的迁移学习（MBTL）通过战略性选择几个任务进行训练并零样本转移，展现了其有效性。然而，管理空间DPP涵盖了广泛的问题，其结构性质因问题而异。因此，不同的任务选择策略适用于不同的CMDP。本研究介绍了结构检测MBTL（SD-MBTL），这是一个通用框架，能够动态识别CMDP的底层泛化结构，并选择合适的MBTL算法。例如，我们观察到山脉结构，其中泛化性能会随着上下文差异的增加而下降，而目标任务的训练表现也会下降。因此我们提出了M/GP-MBTL，它检测结构并在基于高斯过程的方法和基于聚类的方法之间自适应切换。对合成数据和CRL基准的广泛实验——涵盖连续控制、交通控制和农业管理——显示M/GP-MBTL在综合指标上比最强的先前方法高出12.49%。这些结果凸显了在线结构检测在复杂CRL环境中指导源任务选择的潜力。

Reverse Flow Matching: A Unified Framework for Online Reinforcement Learning with Diffusion and Flow Policies

逆流匹配：带有扩散和流策略的在线强化学习统一框架

Authors: Zeyang Li, Sunbochen Tang, Navid Azizan
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.08136
Pdf link: https://arxiv.org/pdf/2601.08136
Abstract Diffusion and flow policies are gaining prominence in online reinforcement learning (RL) due to their expressive power, yet training them efficiently remains a critical challenge. A fundamental difficulty in online RL is the lack of direct samples from the target distribution; instead, the target is an unnormalized Boltzmann distribution defined by the Q-function. To address this, two seemingly distinct families of methods have been proposed for diffusion policies: a noise-expectation family, which utilizes a weighted average of noise as the training target, and a gradient-expectation family, which employs a weighted average of Q-function gradients. Yet, it remains unclear how these objectives relate formally or if they can be synthesized into a more general formulation. In this paper, we propose a unified framework, reverse flow matching (RFM), which rigorously addresses the problem of training diffusion and flow models without direct target samples. By adopting a reverse inferential perspective, we formulate the training target as a posterior mean estimation problem given an intermediate noisy sample. Crucially, we introduce Langevin Stein operators to construct zero-mean control variates, deriving a general class of estimators that effectively reduce importance sampling variance. We show that existing noise-expectation and gradient-expectation methods are two specific instances within this broader class. This unified view yields two key advancements: it extends the capability of targeting Boltzmann distributions from diffusion to flow policies, and enables the principled combination of Q-value and Q-gradient information to derive an optimal, minimum-variance estimator, thereby improving training efficiency and stability. We instantiate RFM to train a flow policy in online RL, and demonstrate improved performance on continuous-control benchmarks compared to diffusion policy baselines.
中文摘要 扩散和流策略因其表达力而在在线强化学习（RL）中日益受到重视，但高效训练仍是关键挑战。在线强化学习的一个根本难题是缺乏来自目标分布的直接样本;相反，目标是由Q函数定义的非正一化玻尔兹曼分布。为此，提出了两类看似不同的扩散策略方法：噪声期望族，利用噪声加权平均作为训练目标;以及梯度期望族，采用Q函数梯度加权平均。然而，这些目标在形式上如何关联，或是否能综合成更通用的表述，仍不清楚。本文提出了一个统一框架——反流匹配（RFM），严格解决了训练扩散和流模型在无直接目标样本的情况下的问题。通过采用反向推断视角，我们将训练目标表述为给定中间噪声样本的后验均值估计问题。关键是，我们引入了朗热文斯坦算子来构造零均值控制变量，推导出一类有效降低重要性抽样方差的一般估计量。我们表明，现有的噪声期望法和梯度期望法是这一更广泛类别中的两个具体实例。这一统一视角带来了两大关键进展：它扩展了从扩散到流动策略的玻尔兹曼分布定位能力，并使Q值和Q梯度信息的原则性组合能够推导出最优的最小方差估计，从而提升训练效率和稳定性。我们实例化RFM以在线强化学习中训练流策略，并展示了连续控制基准测试的性能优于扩散策略基线。

ZeroDVFS: Zero-Shot LLM-Guided Core and Frequency Allocation for Embedded Platforms

ZeroDVFS：嵌入式平台的零发射级大规模语言模型（LLM）引导核心与频率分配

Authors: Mohammad Pivezhandi, Mahdi Banisharif, Abusayeed Saifullah, Ali Jannesari
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.08166
Pdf link: https://arxiv.org/pdf/2601.08166
Abstract Dynamic voltage and frequency scaling (DVFS) and task-to-core allocation are critical for thermal management and balancing energy and performance in embedded systems. Existing approaches either rely on utilization-based heuristics that overlook stall times, or require extensive offline profiling for table generation, preventing runtime adaptation. We propose a model-based hierarchical multi-agent reinforcement learning (MARL) framework for thermal- and energy-aware scheduling on multi-core platforms. Two collaborative agents decompose the exponential action space, achieving 358ms latency for subsequent decisions. First decisions require 3.5 to 8.0s including one-time LLM feature extraction. An accurate environment model leverages regression techniques to predict thermal dynamics and performance states. When combined with LLM-extracted semantic features, the environment model enables zero-shot deployment for new workloads on trained platforms by generating synthetic training data without requiring workload-specific profiling samples. We introduce LLM-based semantic feature extraction that characterizes OpenMP programs through 13 code-level features without execution. The Dyna-Q-inspired framework integrates direct reinforcement learning with model-based planning, achieving 20x faster convergence than model-free methods. Experiments on BOTS and PolybenchC benchmarks across NVIDIA Jetson TX2, Jetson Orin NX, RubikPi, and Intel Core i7 demonstrate 7.09x better energy efficiency and 4.0x better makespan than Linux ondemand governor. First-decision latency is 8,300x faster than table-based profiling, enabling practical deployment in dynamic embedded systems.
中文摘要 动态电压和频率缩放（DVFS）以及任务与核心分配对于嵌入式系统的热管理以及能量与性能的平衡至关重要。现有方法要么依赖基于利用率的启发式方法忽略停顿时间，要么需要大量的离线建模来生成表，从而阻碍运行时适应。我们提出了一种基于模型的分层多智能体强化学习（MARL）框架，用于多核平台上的热感知和能量感知调度。两个协作代理分解指数动作空间，实现后续决策的延迟358毫秒。初步决策需要3.5到8.0的效率，包括一次性LLM特征提取。准确的环境模型利用回归技术预测热动力学和性能状态。结合LLM提取的语义特征，环境模型通过生成合成训练数据，实现在训练平台上的新工作负载零样本部署，无需特定工作负载的剖析样本。我们引入基于LLM的语义特征提取，通过13个代码级特征来描述OpenMP程序，无需执行。受Dyna-Q启发的框架将直接强化学习与基于模型的规划相结合，实现了比无模型方法快20倍的收敛速度。在NVIDIA Jetson TX2、Jetson Orin NX、RubikPi和Intel Core i7上对BOTS和PolybenchC基准测试的实验显示，其能效是Linux点播调长器的7.09倍，使用寿命长4.0倍。首判延迟比基于表的分析快8300倍，使得在动态嵌入式系统中实现实际部署。

Scalable Multiagent Reinforcement Learning with Collective Influence Estimation

可扩展多智能体强化学习与集体影响估计

Authors: Zhenglong Luo, Zhiyong Chen, Aoxiang Liu, Ke Pan
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.08210
Pdf link: https://arxiv.org/pdf/2601.08210
Abstract Multiagent reinforcement learning (MARL) has attracted considerable attention due to its potential in addressing complex cooperative tasks. However, existing MARL approaches often rely on frequent exchanges of action or state information among agents to achieve effective coordination, which is difficult to satisfy in practical robotic systems. A common solution is to introduce estimator networks to model the behaviors of other agents and predict their actions; nevertheless, such designs cause the size and computational cost of the estimator networks to grow rapidly with the number of agents, thereby limiting scalability in large-scale systems. To address these challenges, this paper proposes a multiagent learning framework augmented with a Collective Influence Estimation Network (CIEN). By explicitly modeling the collective influence of other agents on the task object, each agent can infer critical interaction information solely from its local observations and the task object's states, enabling efficient collaboration without explicit action information exchange. The proposed framework effectively avoids network expansion as the team size increases; moreover, new agents can be incorporated without modifying the network structures of existing agents, demonstrating strong scalability. Experimental results on multiagent cooperative tasks based on the Soft Actor-Critic (SAC) algorithm show that the proposed method achieves stable and efficient coordination under communication-limited environments. Furthermore, policies trained with collective influence modeling are deployed on a real robotic platform, where experimental results indicate significantly improved robustness and deployment feasibility, along with reduced dependence on communication infrastructure.
中文摘要 多智能体强化学习（MARL）因其在解决复杂协作任务中的潜力而受到广泛关注。然而，现有的MARL方法通常依赖智能体之间频繁交换动作或状态信息以实现有效协调，而这在实际机器人系统中难以实现。一种常见的解决方案是引入估计器网络，以模拟其他智能体的行为并预测他们的行为;然而，这种设计会使估计网络的规模和计算成本随着代理数量的增加迅速增长，从而限制了大规模系统的可扩展性。为应对这些挑战，本文提出了一个多智能体学习框架，辅以集体影响估计网络（CIEN）。通过显式建模其他代理对任务对象的集体影响，每个代理都能仅从其局部观察和任务对象的状态推断关键交互信息，从而实现高效的协作，而无需显式的动作信息交换。该框架有效避免了团队规模增加时网络的扩展;此外，新代理可以在不修改现有代理网络结构的情况下整合，展现出强大的可扩展性。基于软演员-批判者（SAC）算法的多智能体协作任务实验结果表明，该方法在通信受限环境下实现了稳定高效的协调。此外，通过集体影响建模训练的政策部署在真实机器人平台上，实验结果显示其鲁棒性和可行性显著提升，同时减少对通信基础设施的依赖。

The End of Reward Engineering: How LLMs Are Redefining Multi-Agent Coordination

奖励工程的终结：大型语言模型如何重新定义多智能体协调

Authors: Haoran Su, Yandong Sun, Congjia Yu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.08237
Pdf link: https://arxiv.org/pdf/2601.08237
Abstract Reward engineering, the manual specification of reward functions to induce desired agent behavior, remains a fundamental challenge in multi-agent reinforcement learning. This difficulty is amplified by credit assignment ambiguity, environmental non-stationarity, and the combinatorial growth of interaction complexity. We argue that recent advances in large language models (LLMs) point toward a shift from hand-crafted numerical rewards to language-based objective specifications. Prior work has shown that LLMs can synthesize reward functions directly from natural language descriptions (e.g., EUREKA) and adapt reward formulations online with minimal human intervention (e.g., CARD). In parallel, the emerging paradigm of Reinforcement Learning from Verifiable Rewards (RLVR) provides empirical evidence that language-mediated supervision can serve as a viable alternative to traditional reward engineering. We conceptualize this transition along three dimensions: semantic reward specification, dynamic reward adaptation, and improved alignment with human intent, while noting open challenges related to computational overhead, robustness to hallucination, and scalability to large multi-agent systems. We conclude by outlining a research direction in which coordination arises from shared semantic representations rather than explicitly engineered numerical signals.
中文摘要 奖励工程，即手动指定奖励函数以诱导期望智能体行为，仍然是多智能体强化学习中的根本挑战。这种难度因学分分配的模糊性、环境非平稳性以及交互复杂度的组合增长而加剧。我们认为，大型语言模型（LLMs）的最新进展指向从手工设计的数值奖励转向基于语言的目标规范。先前研究表明，LLMs可以直接从自然语言描述中合成奖励函数（例如EUREKA），并在最小的人工干预下在线调整奖励表述（例如CARD）。与此同时，来自可验证奖励的强化学习（RLVR）新兴范式提供了实证证据，表明语言介导监督可以作为传统奖励工程的可行替代方案。我们将这一转变从三个维度来概念化：语义奖励规范、动态奖励适应以及与人类意图的更佳对齐，同时指出计算开销、对幻觉的鲁棒性以及对大型多智能体系统的可扩展性等未解决挑战。最后，我们概述了一个研究方向，即协调源于共享语义表征，而非明确设计的数值信号。

Incorporating Cognitive Biases into Reinforcement Learning for Financial Decision-Making

将认知偏误融入强化学习以实现财务决策

Authors: Liu He
Subjects: Subjects: Machine Learning (cs.LG); Econometrics (econ.EM)
Arxiv link: https://arxiv.org/abs/2601.08247
Pdf link: https://arxiv.org/pdf/2601.08247
Abstract Financial markets are influenced by human behavior that deviates from rationality due to cognitive biases. Traditional reinforcement learning (RL) models for financial decision-making assume rational agents, potentially overlooking the impact of psychological factors. This study integrates cognitive biases into RL frameworks for financial trading, hypothesizing that such models can exhibit human-like trading behavior and achieve better risk-adjusted returns than standard RL agents. We introduce biases, such as overconfidence and loss aversion, into reward structures and decision-making processes and evaluate their performance in simulated and real-world trading environments. Despite its inconclusive or negative results, this study provides insights into the challenges of incorporating human-like biases into RL, offering valuable lessons for developing robust financial AI systems.
中文摘要 金融市场受人类行为的影响，这些行为因认知偏差而偏离理性。传统的强化学习（RL）模型在财务决策中假设主体是理性的，可能忽视了心理因素的影响。本研究将认知偏差整合进金融交易的强化学习框架，假设此类模型能够表现出类似人类的交易行为，并获得比标准强化学习代理更优的风险调整后回报。我们将偏见如过度自信和损失厌恶引入到奖励结构和决策过程中，并评估其在模拟和现实交易环境中的表现。尽管结果不明确或负面，本研究为将类人偏见融入强化学习的挑战提供了见解，为开发稳健的金融人工智能系统提供了宝贵经验。

Large Artificial Intelligence Model Guided Deep Reinforcement Learning for Resource Allocation in Non Terrestrial Networks

大型人工智能模型引导深度强化学习，用于非地球网络资源分配

Authors: Abdikarim Mohamed Ibrahim, Rosdiadee Nordin
Subjects: Subjects: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.08254
Pdf link: https://arxiv.org/pdf/2601.08254
Abstract Large AI Model (LAM) have been proposed to applications of Non-Terrestrial Networks (NTN), that offer better performance with its great generalization and reduced task specific trainings. In this paper, we propose a Deep Reinforcement Learning (DRL) agent that is guided by a Large Language Model (LLM). The LLM operates as a high level coordinator that generates textual guidance that shape the reward of the DRL agent during training. The results show that the LAM-DRL outperforms the traditional DRL by 40% in nominal weather scenarios and 64% in extreme weather scenarios compared to heuristics in terms of throughput, fairness, and outage probability.
中文摘要 大型人工智能模型（LAM）已被提议用于非地球网络（NTN）的应用，NTN因其出色的泛化性和更少的任务专属训练，提供了更好的性能。本文提出了一种由大型语言模型（LLM）引导的深度强化学习（DRL）代理。LLM作为一个高级协调器，生成文本指导，塑造DRL代理在培训中的奖励。结果显示，LAM-DRL在正常天气情景下比传统DRL高出40%，在极端天气情景下高出64%，无论是在通量、公平性还是停电概率方面。

Unleashing Tool Engineering and Intelligence for Agentic AI in Next-Generation Communication Networks

释放工具工程与智能化，应用于下一代通信网络中的代理人工智能

Authors: Yinqiu Liu, Ruichen Zhang, Dusit Niyato, Abbas Jamalipour, Trung Q. Duong, Dong In Kim
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2601.08259
Pdf link: https://arxiv.org/pdf/2601.08259
Abstract Nowadays, agentic AI is emerging as a transformative paradigm for next-generation communication networks, promising to evolve large language models (LLMs) from passive chatbots into autonomous operators. However, unleashing this potential requires bridging the critical gap between abstract reasoning and physical actuation, a capability we term tool intelligence. In this article, we explore the landscape of tool engineering to empower agentic AI in communications. We first analyze the functionalities of tool intelligence and its effects on communications. We then propose a systematic review for tool engineering, covering the entire lifecycle from tool creation and discovery to selection, learning, and benchmarking. Furthermore, we present a case study on tool-assisted uncrewed aerial vehicles (UAV) trajectory planning to demonstrate the realization of tool intelligence in communications. By introducing a teacher-guided reinforcement learning approach with a feasibility shield, we enable agents to intelligently operate tools. They utilize external tools to eliminate navigational uncertainty while mastering cost-aware scheduling under strict energy constraints. This article aims to provide a roadmap for building the tool-augmented intelligent agents of the 6G era.
中文摘要 如今，代理人工智能正作为下一代通信网络的变革范式崛起，有望将大型语言模型（LLM）从被动聊天机器人演进为自主作者。然而，释放这种潜力需要弥合抽象推理与物理驱动之间的关键鸿沟，我们称之为工具智能。本文将探讨工具工程如何赋能智能人工智能在通信领域的发展。我们首先分析工具智能的功能及其对通信的影响。随后，我们提出一项涵盖工具工程全生命周期的系统综述，涵盖从工具创建与发现到选择、学习和基准测试的全生命周期。此外，我们还提出了一个关于工具辅助无人机（UAV）轨迹规划的案例研究，以展示工具智能在通信中的实现。通过引入教师引导的强化学习方法和可行性盾牌，我们使客服能够智能作工具。它们利用外部工具消除导航不确定性，同时掌握在严格能源约束下的成本感知调度。本文旨在为构建6G时代的工具增强智能智能体提供路线图。

Discovery and Reinforcement of Tool-Integrated Reasoning Chains via Rollout Trees

通过展开树发现并强化工具集成推理链

Authors: Kun Li, Zenan Xu, Junan Li, Zengrui Jin, Jinghao Deng, Zexuan Qiu, Bo Zhou
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.08274
Pdf link: https://arxiv.org/pdf/2601.08274
Abstract Tool-Integrated Reasoning has emerged as a key paradigm to augment Large Language Models (LLMs) with computational capabilities, yet integrating tool-use into long Chain-of-Thought (long CoT) remains underexplored, largely due to the scarcity of training data and the challenge of integrating tool-use without compromising the model's intrinsic long-chain reasoning. In this paper, we introduce DART (Discovery And Reinforcement of Tool-Integrated Reasoning Chains via Rollout Trees), a reinforcement learning framework that enables spontaneous tool-use during long CoT reasoning without human annotation. DART operates by constructing dynamic rollout trees during training to discover valid tool-use opportunities, branching out at promising positions to explore diverse tool-integrated trajectories. Subsequently, a tree-based process advantage estimation identifies and credits specific sub-trajectories where tool invocation positively contributes to the solution, effectively reinforcing these beneficial behaviors. Extensive experiments on challenging benchmarks like AIME and GPQA-Diamond demonstrate that DART significantly outperforms existing methods, successfully harmonizing tool execution with long CoT reasoning.
中文摘要 工具集成推理已成为增强大型语言模型（LLM）计算能力的关键范式，但将工具使用整合进长思考链（long CoT）仍然缺乏被充分探索，主要原因是训练数据稀缺以及在不破坏模型内在长链推理的前提下整合工具使用的挑战。本文介绍了DART（通过滚动树发现与强化工具集成推理链），这是一种强化学习框架，使得在长CoT推理过程中无需人工注释即可自发使用工具。DART通过在培训期间构建动态的推广树，发现有效的工具使用机会，并在有前景的位置分支，探索多样化的工具集成轨迹。随后，基于树的过程优势估计识别并赋予工具调用对解有积极贡献的特定子轨迹，有效强化这些有益行为。对AIME和GPQA-Diamond等挑战性基准的广泛实验表明，DART的表现显著优于现有方法，成功地将工具执行与长CoT推理协调一致。

D$^2$Plan: Dual-Agent Dynamic Global Planning for Complex Retrieval-Augmented Reasoning

D$^2$Plan：复杂检索增强推理的双代理动态全局规划

Authors: Kangcheng Luo, Tinglang Wu, Yansong Feng
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.08282
Pdf link: https://arxiv.org/pdf/2601.08282
Abstract Recent search-augmented LLMs trained with reinforcement learning (RL) can interleave searching and reasoning for multi-hop reasoning tasks. However, they face two critical failure modes as the accumulating context becomes flooded with both crucial evidence and irrelevant information: (1) ineffective search chain construction that produces incorrect queries or omits retrieval of critical information, and (2) reasoning hijacking by peripheral evidence that causes models to misidentify distractors as valid evidence. To address these challenges, we propose D$^2$Plan, a Dual-agent Dynamic global Planning paradigm for complex retrieval-augmented reasoning. D$^2$Plan operates through the collaboration of a Reasoner and a Purifier: the Reasoner constructs explicit global plans during reasoning and dynamically adapts them based on retrieval feedback; the Purifier assesses retrieval relevance and condenses key information for the Reasoner. We further introduce a two-stage training framework consisting of supervised fine-tuning (SFT) cold-start on synthesized trajectories and RL with plan-oriented rewards to teach LLMs to master the D$^2$Plan paradigm. Extensive experiments demonstrate that D$^2$Plan enables more coherent multi-step reasoning and stronger resilience to irrelevant information, thereby achieving superior performance on challenging QA benchmarks.
中文摘要 最近通过强化学习（RL）训练的搜索增强大型语言模型，可以在多跳推理任务中交错搜索和推理。然而，随着积累的上下文被关键证据和无关信息淹没，它们面临两种关键失败模式：（1）无效的搜索链构建产生错误查询或遗漏关键信息检索;（2）通过外围证据劫持推理，导致模型误将干扰因素误判为有效证据。为应对这些挑战，我们提出了D$^2$Plan，一种Dual-agent Dynamic 全局Planning范式，用于复杂检索增强推理。D$^2$Plan通过推理者和净化者的协作运行：推理者在推理过程中构建显式全局计划，并根据检索反馈动态调整;净化者评估检索相关性，并为推理者压缩关键信息。我们进一步引入了两阶段训练框架，包括监督式微调（SFT）、对合成轨迹的冷启动和强化学习，并以计划为导向的奖励，教导LLM掌握D$^2$Plan范式。大量实验表明，D$^2$Plan能够实现更连贯的多步推理和更强的无关信息韧性，从而在具有挑战性的质量保证基准测试中获得更优异的性能。

ORBIT: On-policy Exploration-Exploitation for Controllable Multi-Budget Reasoning

ORBIT：政策上的探索——可控多预算推理的利用

Authors: Kun Liang, Clive Bai, Xin Xu, Chenming Tang, Sanwoo Lee, Weijie Liu, Saiyong Yang, Yunfang Wu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.08310
Pdf link: https://arxiv.org/pdf/2601.08310
Abstract Recent Large Reasoning Models (LRMs) achieve strong performance by leveraging long-form Chain-of-Thought (CoT) reasoning, but uniformly applying overlong reasoning at inference time incurs substantial and often unnecessary computational cost. To address this, prior work explores various strategies to infer an appropriate reasoning budget from the input. However, such approaches are unreliable in the worst case, as estimating the minimal required reasoning effort is fundamentally difficult, and they implicitly fix the trade-off between reasoning cost and accuracy during training, limiting flexibility under varying deployment scenarios. Motivated by these limitations, we propose ORBIT, a controllable multi-budget reasoning framework with well-separated reasoning modes triggered by input. ORBIT employs multi-stage reinforcement learning to discover Pareto-optimal reasoning behaviors at each effort, followed by on-policy distillation to fuse these behaviors into a single unified model. Experiments show that ORBIT achieves (1) controllable reasoning behavior over multiple modes, (2) competitive reasoning density within each mode, and (3) integration of these frontier policies into a single unified student model while preserving clear mode separation and high per-mode performance.
中文摘要 近期大型推理模型（LRM）通过利用长形式思维链（CoT）推理实现了强有力的性能，但在推理时统一应用过长推理会带来大量且常常不必要的计算成本。为此，先前的研究探索了多种策略，从输入中推断出合适的推理预算。然而，这种方法在最坏情况下不可靠，因为估算最低推理工作量本质上困难，且隐含地解决了训练中推理成本与准确性之间的权衡，限制了在不同部署场景下的灵活性。基于这些局限，我们提出了ORBIT，一种可控的多预算推理框架，推理模式由输入触发。ORBIT采用多阶段强化学习，在每次尝试中发现帕累托最优推理行为，随后进行策略内提炼，将这些行为融合为统一模型。实验显示，ORBIT实现了（1）多种模式可控推理行为，（2）每种模式内的竞争推理密度，以及（3）将这些前沿政策整合进统一的学生模型，同时保持清晰的模式分离和高单一模式表现。

AtomMem : Learnable Dynamic Agentic Memory with Atomic Memory Operation

AtomMem ：可学习的动态代理记忆，具原子记忆作

Authors: Yupeng Huo, Yaxi Lu, Zhong Zhang, Haotian Chen, Yankai Lin
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.08323
Pdf link: https://arxiv.org/pdf/2601.08323
Abstract Equipping agents with memory is essential for solving real-world long-horizon problems. However, most existing agent memory mechanisms rely on static and hand-crafted workflows. This limits the performance and generalization ability of these memory designs, which highlights the need for a more flexible, learning-based memory framework. In this paper, we propose AtomMem, which reframes memory management as a dynamic decision-making problem. We deconstruct high-level memory processes into fundamental atomic CRUD (Create, Read, Update, Delete) operations, transforming the memory workflow into a learnable decision process. By combining supervised fine-tuning with reinforcement learning, AtomMem learns an autonomous, task-aligned policy to orchestrate memory behaviors tailored to specific task demands. Experimental results across 3 long-context benchmarks demonstrate that the trained AtomMem-8B consistently outperforms prior static-workflow memory methods. Further analysis of training dynamics shows that our learning-based formulation enables the agent to discover structured, task-aligned memory management strategies, highlighting a key advantage over predefined routines.
中文摘要 为智能体配备记忆对于解决现实世界的长远问题至关重要。然而，大多数现有的代理记忆机制依赖静态且手工打造的工作流程。这限制了这些记忆设计的性能和泛化能力，凸显了对更灵活、基于学习的记忆框架的需求。本文提出了AtomMem理论，将内存管理重新定义为一个动态决策问题。我们将高级内存过程拆解为基本的原子CRUD（创建、读取、更新、删除）作，将记忆工作流程转变为可学习的决策过程。通过结合监督微调与强化学习，AtomMem 学习自主且符合任务的策略，以协调针对特定任务需求的记忆行为。跨越3个长上下文基准测试的实验结果表明，训练好的AtomMem-8B始终优于以往的静态工作流程内存方法。对训练动态的进一步分析表明，我们的基于学习的表述使智能体能够发现结构化、任务对齐的记忆管理策略，突出了相较于预设例程的关键优势。

Safe Heterogeneous Multi-Agent RL with Communication Regularization for Coordinated Target Acquisition

具通信正则化的安全异构多剂强化语言，用于协调目标获取

Authors: Gabriele Calzolari (1), Vidya Sumathy (1), Christoforos Kanellakis (1), George Nikolakopoulos (1) ((1) Lulea University of Technology)
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.08327
Pdf link: https://arxiv.org/pdf/2601.08327
Abstract This paper introduces a decentralized multi-agent reinforcement learning framework enabling structurally heterogeneous teams of agents to jointly discover and acquire randomly located targets in environments characterized by partial observability, communication constraints, and dynamic interactions. Each agent's policy is trained with the Multi-Agent Proximal Policy Optimization algorithm and employs a Graph Attention Network encoder that integrates simulated range-sensing data with communication embeddings exchanged among neighboring agents, enabling context-aware decision-making from both local sensing and relational information. In particular, this work introduces a unified framework that integrates graph-based communication and trajectory-aware safety through safety filters. The architecture is supported by a structured reward formulation designed to encourage effective target discovery and acquisition, collision avoidance, and de-correlation between the agents' communication vectors by promoting informational orthogonality. The effectiveness of the proposed reward function is demonstrated through a comprehensive ablation study. Moreover, simulation results demonstrate safe and stable task execution, confirming the framework's effectiveness.
中文摘要 本文介绍了一个去中心化的多智能体强化学习框架，使结构异质的智能体团队能够在部分可观测性、通信约束和动态交互的环境中共同发现并获取随机定位的目标。每个代理的策略都用多代理近端策略优化算法训练，并采用图注意力网络编码器，将模拟的距离感知数据与邻居代理之间交换的通信嵌入集成，实现基于本地感知和关系信息的上下文感知决策。特别是，本研究引入了一个统一框架，通过安全过滤器整合基于图的通信和轨迹感知安全。该架构由结构化奖励表述支持，旨在通过促进信息正交性，促进有效的目标发现与获取、避免碰撞以及智能体通信向量之间的去相关化。通过一项全面的消融研究，所提出的奖励函数的有效性得到了验证。此外，模拟结果显示任务执行安全稳定，确认了框架的有效性。

Owen-Shapley Policy Optimization (OSPO): A Principled RL Algorithm for Generative Search LLMs

Owen-Shapley 策略优化（OSPO）：一种用于生成式搜索大型语言模型的原则性强化学习算法

Authors: Abhijnan Nath, Alireza Bagheri Garakani, Tianchen Zhou, Fan Yang, Nikhil Krishnaswamy
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.08403
Pdf link: https://arxiv.org/pdf/2601.08403
Abstract Large language models are increasingly trained via reinforcement learning for personalized recommendation tasks, but standard methods like GRPO rely on sparse, sequence-level rewards that create a credit assignment gap, obscuring which tokens drive success. This gap is especially problematic when models must infer latent user intent from under-specified language without ground truth labels, a reasoning pattern rarely seen during pretraining. We introduce Owen-Shapley Policy Optimization (OSPO), a framework that redistributes sequence-level advantages based on tokens' marginal contributions to outcomes. Unlike value-model-based methods requiring additional computation, OSPO employs potential-based reward shaping via Shapley-Owen attributions to assign segment-level credit while preserving the optimal policy, learning directly from task feedback without parametric value models. By forming coalitions of semantically coherent units (phrases describing product attributes or sentences capturing preferences), OSPO identifies which response parts drive performance. Experiments on Amazon ESCI and H&M Fashion datasets show consistent gains over baselines, with notable test-time robustness to out-of-distribution retrievers unseen during training.
中文摘要 大型语言模型越来越多地通过强化学习训练个性化推荐任务，但像GRPO这样的标准方法依赖稀疏的序列级奖励，造成信用分配差距，模糊哪些代币驱动成功。当模型必须从未指定且没有真实标签的语言推断潜在用户意图时，这一差距尤其成问题，这种推理模式在预训练中很少见。我们介绍了Owen-Shapley策略优化（OSPO），这是一个基于代币对结果的边际贡献重新分配序列层面优势的框架。与需要额外计算的基于价值模型的方法不同，OSPO采用基于潜力的Shapley-Owen归因来分配细分层级的信用，同时保持最优策略，直接从任务反馈中学习，无需参数化价值模型。通过形成语义连贯单元联盟（描述产品属性的短语或捕捉偏好的句子），OSPO识别哪些反应部分驱动性能。在亚马逊ESCI和H&M Fashion数据集上的实验显示，基线数据持续提升，且对分布外检索器在测试时间上表现出显著的鲁棒性，这些是在训练中未曾见到的。

RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation

评分标准中心：通过自动粗细生成的全面且高度判别性的评分标准数据集

Authors: Sunzhu Li, Jiale Zhao, Miteto Wei, Huimin Ren, Yang Zhou, Jingwen Yang, Shunyu Liu, Kaike Zhang, Wei Chen
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.08430
Pdf link: https://arxiv.org/pdf/2601.08430
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has driven substantial progress in reasoning-intensive domains like mathematics. However, optimizing open-ended generation remains challenging due to the lack of ground truth. While rubric-based evaluation offers a structured proxy for verification, existing methods suffer from scalability bottlenecks and coarse criteria, resulting in a supervision ceiling effect. To address this, we propose an automated Coarse-to-Fine Rubric Generation framework. By synergizing principle-guided synthesis, multi-model aggregation, and difficulty evolution, our approach produces comprehensive and highly discriminative criteria capable of capturing the subtle nuances. Based on this framework, we introduce RubricHub, a large-scale ($\sim$110k) and multi-domain dataset. We validate its utility through a two-stage post-training pipeline comprising Rubric-based Rejection Sampling Fine-Tuning (RuFT) and Reinforcement Learning (RuRL). Experimental results demonstrate that RubricHub unlocks significant performance gains: our post-trained Qwen3-14B achieves state-of-the-art (SOTA) results on HealthBench (69.3), surpassing proprietary frontier models such as GPT-5. The code and data will be released soon.
中文摘要 带可验证奖励的强化学习（RLVR）在推理密集领域如数学领域取得了显著进展。然而，由于缺乏地面真实信息，优化开放式生成仍然具有挑战性。虽然基于评分标准的评估提供了验证的结构化代理，但现有方法存在可扩展性瓶颈和粗糙的标准，导致监督天花板效应。为此，我们提出了一个自动化的粗到细评分规生成框架。通过协同原理导向综合、多模型聚合和难度演化，我们的方法产生了全面且高度判别的标准，能够捕捉细微差别。基于该框架，我们介绍了RubricHub，一个大规模（$\sim$110k）多域数据集。我们通过包含基于评分标准的拒绝抽样微调（RuFT）和强化学习（RuRL）两阶段的训练后流程验证其实用性。实验结果表明，RubricHub 释放了显著的性能提升：我们经过后训练的 Qwen3-14B 在 HealthBench（69.3）上实现了最先进的（SOTA）成绩，超越了 GPT-5 等专有前沿模型。代码和数据将很快发布。

Large Multimodal Models for Embodied Intelligent Driving: The Next Frontier in Self-Driving?

具身智能驾驶的大型多模式模型：自动驾驶的下一个前沿？

Authors: Long Zhang, Yuchen Xia
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.08434
Pdf link: https://arxiv.org/pdf/2601.08434
Abstract The advent of Large Multimodal Models (LMMs) offers a promising technology to tackle the limitations of modular design in autonomous driving, which often falters in open-world scenarios requiring sustained environmental understanding and logical reasoning. Besides, embodied artificial intelligence facilitates policy optimization through closed-loop interactions to achieve the continuous learning capability, thereby advancing autonomous driving toward embodied intelligent (El) driving. However, such capability will be constrained by relying solely on LMMs to enhance EI driving without joint decision-making. This article introduces a novel semantics and policy dual-driven hybrid decision framework to tackle this challenge, ensuring continuous learning and joint decision. The framework merges LMMs for semantic understanding and cognitive representation, and deep reinforcement learning (DRL) for real-time policy optimization. We starts by introducing the foundational principles of EI driving and LMMs. Moreover, we examine the emerging opportunities this framework enables, encompassing potential benefits and representative use cases. A case study is conducted experimentally to validate the performance superiority of our framework in completing lane-change planning task. Finally, several future research directions to empower EI driving are identified to guide subsequent work.
中文摘要 大型多模态模型（LMMs）的出现为解决自动驾驶模块化设计的局限性提供了有前景的技术，而模块化设计在开放世界场景中常常出现不足，需要持续的环境理解和逻辑推理。此外，具身人工智能通过闭环交互促进政策优化，实现持续学习能力，从而推动自动驾驶迈向具身智能（El）驾驶。然而，这种能力将受到仅依赖LMM提升EI驾驶的限制，而没有共同决策。本文介绍了一种新的语义学和政策双重驱动混合决策框架，以应对这一挑战，确保持续学习和共同决策。该框架融合了用于语义理解和认知表征的LMMs，以及用于实时策略优化的深度强化学习（DRL）。我们首先介绍EI驾驶和LMM的基础原则。此外，我们还探讨了该框架带来的新兴机遇，涵盖潜在优势和具有代表性的应用场景。通过实验性案例研究验证我们框架在完成变道规划任务中的性能优势。最后，确定了若干未来研究方向以赋能EI驾驶，以指导后续工作。

Fine-Mem: Fine-Grained Feedback Alignment for Long-Horizon Memory Management

Fine-Mem：用于长视野内存管理的细粒度反馈对齐

Authors: Weitao Ma, Xiaocheng Feng, Lei Huang, Xiachong Feng, Zhanyu Ma, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He, Bing Qin
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.08435
Pdf link: https://arxiv.org/pdf/2601.08435
Abstract Effective memory management is essential for large language model agents to navigate long-horizon tasks. Recent research has explored using Reinforcement Learning to develop specialized memory manager agents. However, existing approaches rely on final task performance as the primary reward, which results in severe reward sparsity and ineffective credit assignment, providing insufficient guidance for individual memory operations. To this end, we propose Fine-Mem, a unified framework designed for fine-grained feedback alignment. First, we introduce a Chunk-level Step Reward to provide immediate step-level supervision via auxiliary chunk-specific question answering tasks. Second, we devise Evidence-Anchored Reward Attribution to redistribute global rewards by anchoring credit to key memory operations, based on the specific memory items utilized as evidence in reasoning. Together, these components enable stable policy optimization and align local memory operations with the long-term utility of memory. Experiments on Memalpha and MemoryAgentBench demonstrate that Fine-Mem consistently outperforms strong baselines, achieving superior success rates across various sub-tasks. Further analysis reveals its adaptability and strong generalization capabilities across diverse model configurations and backbones.
中文摘要 有效的内存管理对于大型语言模型代理完成长视野任务至关重要。近期研究探讨利用强化学习开发专用记忆管理代理。然而，现有方法主要依赖最终任务的表现作为主要奖励，导致严重的奖励稀缺和无效的学分分配，无法为单个记忆作提供足够的指导。为此，我们提出了Fine-Mem，一个用于细粒度反馈对齐的统一框架。首先，我们引入了区块级步骤奖励，通过辅助区块特定问题回答任务提供即时的步骤级监督。其次，我们设计了证据锚定奖励归因法，通过将功劳锚定于关键记忆作，基于推理中所用的具体记忆项目，重新分配全球奖励。这些组件共同实现了稳定的策略优化，并将本地内存作与内存的长期效用对齐。在 Memalpha 和 MemoryAgentBench 上的实验表明，Fine-Mem 在多个子任务中持续优于强基线，取得了更高的成功率。进一步分析显示其适应性强，并在多种模型配置和骨干中具有强力的泛化能力。

Incentivizing Cardiologist-Like Reasoning in MLLMs for Interpretable Echocardiographic Diagnosis

激励MLLM中类似心脏病学家的推理以实现可解读的超声心动图诊断

Authors: Yi Qin, Lehan Wang, Chenxu Zhao, Alex P.W. Lee, Xiaomeng Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.08440
Pdf link: https://arxiv.org/pdf/2601.08440
Abstract Echocardiographic diagnosis is vital for cardiac screening yet remains challenging. Existing echocardiography foundation models do not effectively capture the relationships between quantitative measurements and clinical manifestations, whereas medical reasoning multimodal large language models (MLLMs) require costly construction of detailed reasoning paths and remain ineffective at directly incorporating such echocardiographic priors into their reasoning. To address these limitations, we propose a novel approach comprising Cardiac Reasoning Template (CRT) and CardiacMind to enhance MLLM's echocardiographic reasoning by introducing cardiologist-like mindset. Specifically, CRT provides stepwise canonical diagnostic procedures for complex cardiac diseases to streamline reasoning path construction without the need for costly case-by-case verification. To incentivize reasoning MLLM under CRT, we develop CardiacMind, a new reinforcement learning scheme with three novel rewards: Procedural Quantity Reward (PQtR), Procedural Quality Reward (PQlR), and Echocardiographic Semantic Reward (ESR). PQtR promotes detailed reasoning; PQlR promotes integration of evidence across views and modalities, while ESR grounds stepwise descriptions in visual content. Our methods show a 48% improvement in multiview echocardiographic diagnosis for 15 complex cardiac diseases and a 5% improvement on CardiacNet-PAH over prior methods. The user study on our method's reasoning outputs shows 93.33% clinician agreement with cardiologist-like reasoning logic. Our code will be available.
中文摘要 超声心动图诊断对心脏筛查至关重要，但仍然具有挑战性。现有超声心动图基础模型未能有效捕捉定量测量与临床表现之间的关系，而医学推理多模态大语言模型（MLLM）则需要昂贵的详细推理路径构建，且未能有效直接将这些超声先验纳入推理。为解决这些局限性，我们提出了一种结合心脏推理模板（CRT）和心脏心灵的新方法，通过引入类似心脏病学家的思维方式，增强MLLM的超声心动图推理能力。具体来说，CRT为复杂心脏病提供逐步的典型诊断程序，简化推理路径的构建，无需昂贵的逐案验证。为了激励CRT下的MLLM推理，我们开发了CardiacMind，一种新的强化学习方案，具有三种新颖奖励：程序数量奖励（PQtR）、程序质量奖励（PQlR）和超声心动图语义奖励（ESR）。PQtR 促进详细推理;PQlR促进跨视角和模式的证据整合，而ESR则基于视觉内容逐步描述。我们的方法显示，15种复杂心脏疾病的多视角超声心动图诊断提升了48%，而CardiacNet-PAH的诊断比以往方法提升了5%。基于我们方法推理输出的用户研究显示，临床医生对心脏病学家推理逻辑的认同率为93.33%。我们的代码将会被公开。

JudgeRLVR: Judge First, Generate Second for Efficient Reasoning

JudgeRLVR：先评判，后生成以实现高效推理

Authors: Jiangshan Duo, Hanyu Li, Hailin Zhang, Yudong Wang, Sujian Li, Liang Zhao
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.08468
Pdf link: https://arxiv.org/pdf/2601.08468
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for reasoning in Large Language Models. However, optimizing solely for final-answer correctness often drives models into aimless, verbose exploration, where they rely on exhaustive trial-and-error tactics rather than structured planning to reach solutions. While heuristic constraints like length penalties can reduce verbosity, they often truncate essential reasoning steps, creating a difficult trade-off between efficiency and verification. In this paper, we argue that discriminative capability is a prerequisite for efficient generation: by learning to distinguish valid solutions, a model can internalize a guidance signal that prunes the search space. We propose JudgeRLVR, a two-stage judge-then-generate paradigm. In the first stage, we train the model to judge solution responses with verifiable answers. In the second stage, we fine-tune the same model with vanilla generating RLVR initialized from the judge. Compared to Vanilla RLVR using the same math-domain training data, JudgeRLVR achieves a better quality--efficiency trade-off for Qwen3-30B-A3B: on in-domain math, it delivers about +3.7 points average accuracy gain with -42\% average generation length; on out-of-domain benchmarks, it delivers about +4.5 points average accuracy improvement, demonstrating enhanced generalization.
中文摘要 带可验证奖励的强化学习（RLVR）已成为大型语言模型推理的标准范式。然而，仅仅为最终答案正确性进行优化，往往会使模型陷入无目的、冗长的探索，依赖于详尽的试错策略，而非结构化的规划来达成解决方案。虽然像长度惩罚这样的启发式约束可以减少冗长，但它们往往会截断关键的推理步骤，导致效率与验证之间的复杂权衡。本文论证，判别能力是高效生成的前提：通过学习区分有效解，模型可以内化一个引导信号，从而修剪搜索空间。我们提出了JudgeRLVR，一种两阶段评审后生成的范式。第一阶段，我们训练模型以判断解决方案响应并得到可验证的答案。第二阶段，我们用由评判初始化的原版生成RLVR微调同一模型。与使用相同数学领域训练数据的原版RLVR相比，JudgeRLVR在Qwen3-30B-A3B的质量-效率权衡上更为出色：在领域内数学计算中，平均精度提升约为+3.7点，平均生成长度为-42%%;在域外基准测试中，平均精度提升约+4.5分，展现了更强的泛化能力。

Baiting AI: Deceptive Adversary Against AI-Protected Industrial Infrastructures

诱导人工智能：针对人工智能保护工业基础设施的欺骗性对手

Authors: Aryan Pasikhani, Prosanta Gope, Yang Yang, Shagufta Mehnaz, Biplab Sikdar
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2601.08481
Pdf link: https://arxiv.org/pdf/2601.08481
Abstract This paper explores a new cyber-attack vector targeting Industrial Control Systems (ICS), particularly focusing on water treatment facilities. Developing a new multi-agent Deep Reinforcement Learning (DRL) approach, adversaries craft stealthy, strategically timed, wear-out attacks designed to subtly degrade product quality and reduce the lifespan of field actuators. This sophisticated method leverages DRL methodology not only to execute precise and detrimental impacts on targeted infrastructure but also to evade detection by contemporary AI-driven defence systems. By developing and implementing tailored policies, the attackers ensure their hostile actions blend seamlessly with normal operational patterns, circumventing integrated security measures. Our research reveals the robustness of this attack strategy, shedding light on the potential for DRL models to be manipulated for adversarial purposes. Our research has been validated through testing and analysis in an industry-level setup. For reproducibility and further study, all related materials, including datasets and documentation, are publicly accessible.
中文摘要 本文探讨了一种针对工业控制系统（ICS）的新网络攻击载体，特别关注水处理设施。对手开发了一种新的多智能体深度强化学习（DRL）方法，设计隐蔽、战略性时机的磨损攻击，旨在微妙地降低产品质量并缩短现场执行器的寿命。这种复杂的方法利用DRL方法论，不仅能对目标基础设施实施精确且有害的打击，还能规避现代AI驱动防御系统的侦测。通过制定和实施定制化政策，攻击者确保其敌对行为与正常运营模式无缝融合，绕过集成的安全措施。我们的研究揭示了该攻击策略的稳健性，揭示了DRL模型纵用于对抗目的的可能性。我们的研究已通过行业级测试和分析得到验证。为了可重复性和进一步研究，所有相关材料，包括数据集和文档，均可公开访问。

AME-2: Agile and Generalized Legged Locomotion via Attention-Based Neural Map Encoding

AME-2：通过注意力神经图编码实现敏捷和广义腿部运动

Authors: Chong Zhang, Victor Klemm, Fan Yang, Marco Hutter
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.08485
Pdf link: https://arxiv.org/pdf/2601.08485
Abstract Achieving agile and generalized legged locomotion across terrains requires tight integration of perception and control, especially under occlusions and sparse footholds. Existing methods have demonstrated agility on parkour courses but often rely on end-to-end sensorimotor models with limited generalization and interpretability. By contrast, methods targeting generalized locomotion typically exhibit limited agility and struggle with visual occlusions. We introduce AME-2, a unified reinforcement learning (RL) framework for agile and generalized locomotion that incorporates a novel attention-based map encoder in the control policy. This encoder extracts local and global mapping features and uses attention mechanisms to focus on salient regions, producing an interpretable and generalized embedding for RL-based control. We further propose a learning-based mapping pipeline that provides fast, uncertainty-aware terrain representations robust to noise and occlusions, serving as policy inputs. It uses neural networks to convert depth observations into local elevations with uncertainties, and fuses them with odometry. The pipeline also integrates with parallel simulation so that we can train controllers with online mapping, aiding sim-to-real transfer. We validate AME-2 with the proposed mapping pipeline on a quadruped and a biped robot, and the resulting controllers demonstrate strong agility and generalization to unseen terrains in simulation and in real-world experiments.
中文摘要 实现跨地形的敏捷和通用腿部运动需要感知与控制的紧密结合，尤其是在遮挡和稀疏的足迹下。现有方法在跑酷赛道上展示了敏捷性，但通常依赖端到端的感觉运动模型，推广性和解释性有限。相比之下，针对广义运动的方法通常表现出有限的灵活性，且视觉遮挡问题较为困难。我们介绍AME-2，这是一种统一的强化学习（RL）框架，用于敏捷和泛化移动，在控制策略中集成了一种新的基于注意力的地图编码器。该编码器提取局部和全局映射特征，并利用注意力机制聚焦显著区域，生成基于强化学习控制的可解释且广义化的嵌入。我们还提出了一种基于学习的制图流程，能够快速、具备不确定性的地形表示，且能对噪声和遮蔽构成鲁棒，作为政策输入。它利用神经网络将深度观测转换为带有不确定性的局部高程，并将其与里程计结合。该流水线还集成了并行仿真，使我们能够通过在线映射训练控制器，帮助模拟到实际的传输。我们用拟议的映射流程验证了AME-2，在四足和双足机器人上，结果控制器在模拟和现实实验中展现出强烈的敏捷性和对未见地形的泛化能力。

AUV Trajectory Learning for Underwater Acoustic Energy Transfer and Age Minimization

AUV轨迹学习用于水下声能传递和年龄最小化

Authors: Mohamed Afouene Melki, Mohammad Shehab, Mohamed-Slim Alouini
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.08491
Pdf link: https://arxiv.org/pdf/2601.08491
Abstract Internet of underwater things (IoUT) is increasingly gathering attention with the aim of monitoring sea life and deep ocean environment, underwater surveillance as well as maintenance of underwater installments. However, conventional IoUT devices, reliant on battery power, face limitations in lifespan and pose environmental hazards upon disposal. This paper introduces a sustainable approach for simultaneous information uplink from the IoUT devices and acoustic energy transfer (AET) to the devices via an autonomous underwater vehicle (AUV), potentially enabling them to operate indefinitely. To tackle the time-sensitivity, we adopt age of information (AoI), and Jain's fairness index. We develop two deep-reinforcement learning (DRL) algorithms, offering a high-complexity, high-performance frequency division duplex (FDD) solution and a low-complexity, medium-performance time division duplex (TDD) approach. The results elucidate that the proposed FDD and TDD solutions significantly reduce the average AoI and boost the harvested energy as well as data collection fairness compared to baseline approaches.
中文摘要 水下物联网（IoUT）正日益受到关注，旨在监测海洋生物和深海环境、水下监测以及水下设施的维护。然而，依赖电池供电的传统IoUT设备寿命有限，且在废弃时存在环境危害。本文提出了一种可持续的方法，用于通过自主水下载具（AUV）实现同时信息上行，并通过自主水下载具（AUV）实现声能传输（AET）到设备，从而有可能实现其无限期运行。为了解决时间敏感性问题，我们采用了信息年龄（AoI）和Jain的公平指数。我们开发了两种深度强化学习（DRL）算法，分别提供高复杂度、高性能的频分双工（FDD）解决方案和低复杂度、中性能的时分双工（TDD）方法。结果阐明了所提出的FDD和TDD解决方案相比基线方法显著降低了平均AoI，并提高了收获能量和数据收集的公平性。

Your Group-Relative Advantage Is Biased

你的群体相对优势是有偏的

Authors: Fengkai Yang, Zherui Chen, Xiaohan Wang, Xiaodong Lu, Jiajun Chai, Guojun Yin, Wei Lin, Shuai Ma, Fuzhen Zhuang, Deqing Wang, Yaodong Yang, Jianxin Li, Yikun Ban
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.08521
Pdf link: https://arxiv.org/pdf/2601.08521
Abstract Reinforcement Learning from Verifier Rewards (RLVR) has emerged as a widely used approach for post-training large language models on reasoning tasks, with group-based methods such as GRPO and its variants gaining broad adoption. These methods rely on group-relative advantage estimation to avoid learned critics, yet its theoretical properties remain poorly understood. In this work, we uncover a fundamental issue of group-based RL: the group-relative advantage estimator is inherently biased relative to the true (expected) advantage. We provide the first theoretical analysis showing that it systematically underestimates advantages for hard prompts and overestimates them for easy prompts, leading to imbalanced exploration and exploitation. To address this issue, we propose History-Aware Adaptive Difficulty Weighting (HA-DW), an adaptive reweighting scheme that adjusts advantage estimates based on an evolving difficulty anchor and training dynamics. Both theoretical analysis and experiments on five mathematical reasoning benchmarks demonstrate that HA-DW consistently improves performance when integrated into GRPO and its variants. Our results suggest that correcting biased advantage estimation is critical for robust and efficient RLVR training.
中文摘要 验证者奖励强化学习（RLVR）已成为大型语言模型推理任务后训练的广泛方法，基于群体的方法如GRPO及其变体正被广泛采用。这些方法依赖群体相对优势估计以避免学识渊博的批评者，但其理论特性仍鲜为人知。在本研究中，我们揭示了基于群体的强化学习的一个根本问题：群体相对优势估计器相对于真实（预期）优势本身存在一定偏差。我们首次提供理论分析，显示它系统性地低估了困难提示的优势，而对简单提示则高估优势，导致探索和利用不平衡。为解决这一问题，我们提出了历史感知自适应难度加权（HA-DW），这是一种基于演变的难度锚点和训练动态调整优势估计的自适应重权重方案。理论分析和五个数学推理基准测试的实验表明，当HA-DW集成到GRPO及其变体中时，性能持续提升。我们的结果表明，校正偏倚优势估计对于稳健高效的RLVR训练至关重要。

Provably Safe Reinforcement Learning using Entropy Regularizer

使用熵正则化器实现可证明的安全强化学习

Authors: Abhijit Mazumdar, Rafal Wisniewski, Manuela L. Bujorianu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.08646
Pdf link: https://arxiv.org/pdf/2601.08646
Abstract We consider the problem of learning the optimal policy for Markov decision processes with safety constraints. We formulate the problem in a reach-avoid setup. Our goal is to design online reinforcement learning algorithms that ensure safety constraints with arbitrarily high probability during the learning phase. To this end, we first propose an algorithm based on the optimism in the face of uncertainty (OFU) principle. Based on the first algorithm, we propose our main algorithm, which utilizes entropy regularization. We investigate the finite-sample analysis of both algorithms and derive their regret bounds. We demonstrate that the inclusion of entropy regularization improves the regret and drastically controls the episode-to-episode variability that is inherent in OFU-based safe RL algorithms.
中文摘要 我们考虑在安全约束下学习马尔可夫决策过程的最优策略的问题。我们将问题构思为一个接近-回避的设定。我们的目标是设计在线强化学习算法，在学习阶段以任意高概率保证安全约束。为此，我们首先提出一种基于“面对不确定性时乐观”（OFU）原则的算法。基于第一个算法，我们提出了主要算法，利用熵正则化。我们研究了这两种算法的有限样本分析，并推导了它们的遗憾界限。我们证明，熵正则化的加入能改善遗憾，并显著控制基于OFU的安全强化学习算法固有的发作间变异性。

From Classical to Quantum Reinforcement Learning and Its Applications in Quantum Control: A Beginner's Tutorial

从经典到量子强化学习及其在量子控制中的应用：初学者教程

Authors: Abhijit Sen, Sonali Panda, Mahima Arya, Subhajit Patra, Zizhan Zheng, Denys I. Bondar
Subjects: Subjects: Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
Arxiv link: https://arxiv.org/abs/2601.08662
Pdf link: https://arxiv.org/pdf/2601.08662
Abstract This tutorial is designed to make reinforcement learning (RL) more accessible to undergraduate students by offering clear, example-driven explanations. It focuses on bridging the gap between RL theory and practical coding applications, addressing common challenges that students face when transitioning from conceptual understanding to implementation. Through hands-on examples and approachable explanations, the tutorial aims to equip students with the foundational skills needed to confidently apply RL techniques in real-world scenarios.
中文摘要 本教程旨在通过提供清晰且以实例为驱动的解释，使本科生更容易理解强化学习（RL）。它侧重于弥合强化学习理论与实际编码应用之间的差距，解决学生在从概念理解过渡到实现过程中面临的常见挑战。通过实践示例和通俗易懂的讲解，教程旨在为学生提供自信地在现实场景中应用强化学习技术所需的基础技能。

VLingNav: Embodied Navigation with Adaptive Reasoning and Visual-Assisted Linguistic Memory

VLingNav：具身导航结合自适应推理和视觉辅助语言记忆

Authors: Shaoan Wang, Yuanfei Luo, Xingyu Chen, Aocheng Luo, Dongyue Li, Chang Liu, Sheng Chen, Yangang Zhang, Junzhi Yu
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.08665
Pdf link: https://arxiv.org/pdf/2601.08665
Abstract VLA models have shown promising potential in embodied navigation by unifying perception and planning while inheriting the strong generalization abilities of large VLMs. However, most existing VLA models rely on reactive mappings directly from observations to actions, lacking the explicit reasoning capabilities and persistent memory required for complex, long-horizon navigation tasks. To address these challenges, we propose VLingNav, a VLA model for embodied navigation grounded in linguistic-driven cognition. First, inspired by the dual-process theory of human cognition, we introduce an adaptive chain-of-thought mechanism, which dynamically triggers explicit reasoning only when necessary, enabling the agent to fluidly switch between fast, intuitive execution and slow, deliberate planning. Second, to handle long-horizon spatial dependencies, we develop a visual-assisted linguistic memory module that constructs a persistent, cross-modal semantic memory, enabling the agent to recall past observations to prevent repetitive exploration and infer movement trends for dynamic environments. For the training recipe, we construct Nav-AdaCoT-2.9M, the largest embodied navigation dataset with reasoning annotations to date, enriched with adaptive CoT annotations that induce a reasoning paradigm capable of adjusting both when to think and what to think about. Moreover, we incorporate an online expert-guided reinforcement learning stage, enabling the model to surpass pure imitation learning and to acquire more robust, self-explored navigation behaviors. Extensive experiments demonstrate that VLingNav achieves state-of-the-art performance across a wide range of embodied navigation benchmarks. Notably, VLingNav transfers to real-world robotic platforms in a zero-shot manner, executing various navigation tasks and demonstrating strong cross-domain and cross-task generalization.
中文摘要 VLA模型通过统一感知与规划，在继承大型VLM强大的泛化能力的同时，展现出具备具象导航的潜力。然而，大多数现有的VLA模型依赖于直接从观测到动作的反应映射，缺乏复杂长视野导航任务所需的显式推理能力和持久内存。为应对这些挑战，我们提出了VLingNav，一种基于语言驱动认知的具身导航VLA模型。首先，受人类认知的双过程理论启发，我们引入了自适应思维链机制，该机制仅在必要时动态触发显式推理，使智能体能够在快速直观的执行与缓慢、有计划的计划之间流畅切换。其次，为处理长视野空间依赖，我们开发了一个视觉辅助语言记忆模块，构建持久的跨模态语义记忆，使智能体能够回忆过去的观察，防止重复探索，并推断动态环境中的运动趋势。对于训练方案，我们构建了Nav-AdaCoT-2.9M，这是迄今为止最大的具身导航数据集，包含推理注释，并辅以自适应CoT注释，能够引导推理范式，既能调整思考时间，也调整思考内容。此外，我们还引入了在线专家引导的强化学习阶段，使模型超越纯粹的模仿学习，获得更稳健、自我探索的导航行为。大量实验表明，VLingNav在广泛的隐形导航基准测试中实现了最先进的性能。值得注意的是，VLingNav以零发射方式转移到现实世界的机器人平台，执行各种导航任务，并展现出强大的跨域和跨任务泛化能力。

PersonaDual: Balancing Personalization and Objectivity via Adaptive Reasoning

PersonaDual：通过适应性推理平衡个性化与客观性

Authors: Xiaoyou Liu, Xinyi Mou, Shengbin Yue, Liang Wang, Yuqing Wang, Qiexiang Wang, Tianrui Qin, Wangchunshu Zhou, Zhongyu Wei
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.08679
Pdf link: https://arxiv.org/pdf/2601.08679
Abstract As users increasingly expect LLMs to align with their preferences, personalized information becomes valuable. However, personalized information can be a double-edged sword: it can improve interaction but may compromise objectivity and factual correctness, especially when it is misaligned with the question. To alleviate this problem, we propose PersonaDual, a framework that supports both general-purpose objective reasoning and personalized reasoning in a single model, and adaptively switches modes based on context. PersonaDual is first trained with SFT to learn two reasoning patterns, and then further optimized via reinforcement learning with our proposed DualGRPO to improve mode selection. Experiments on objective and personalized benchmarks show that PersonaDual preserves the benefits of personalization while reducing interference, achieving near interference-free performance and better leveraging helpful personalized signals to improve objective problem-solving.
中文摘要 随着用户越来越期望大型语言模型符合他们的偏好，个性化信息变得非常有价值。然而，个性化信息可能是一把双刃剑：它能改善互动，但可能损害客观性和事实正确性，尤其是当信息与问题不符时。为缓解这一问题，我们提出了PersonaDual框架，该框架支持通用客观推理和个性化推理，并根据上下文自适应切换模式。PersonaDual首先用SFT训练以学习两种推理模式，然后通过强化学习进一步优化我们提出的DualGRPO，以改善模式选择。客观和个性化基准测试的实验表明，PersonaDual在减少干扰的同时保留了个性化的优势，实现了近乎无干扰的性能，并更好地利用有益的个性化信号提升客观问题解决能力。

QuantEval: A Benchmark for Financial Quantitative Tasks in Large Language Models

QuantEval：大型语言模型中财务定量任务的基准

Authors: Zhaolu Kang, Junhao Gong, Wenqing Hu, Shuo Yin, Kehan Jiang, Zhicheng Fang, Yingjie He, Chunlei Meng, Rong Fu, Dongyang Chen, Leqi Zheng, Eric Hanchen Jiang, Yunfei Feng, Yitong Leng, Junfan Zhu, Xiaoyou Chen, Xi Yang, Richeng Xuan
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.08689
Pdf link: https://arxiv.org/pdf/2601.08689
Abstract Large Language Models (LLMs) have shown strong capabilities across many domains, yet their evaluation in financial quantitative tasks remains fragmented and mostly limited to knowledge-centric question answering. We introduce QuantEval, a benchmark that evaluates LLMs across three essential dimensions of quantitative finance: knowledge-based QA, quantitative mathematical reasoning, and quantitative strategy coding. Unlike prior financial benchmarks, QuantEval integrates a CTA-style backtesting framework that executes model-generated strategies and evaluates them using financial performance metrics, enabling a more realistic assessment of quantitative coding ability. We evaluate some state-of-the-art open-source and proprietary LLMs and observe substantial gaps to human experts, particularly in reasoning and strategy coding. Finally, we conduct large-scale supervised fine-tuning and reinforcement learning experiments on domain-aligned data, demonstrating consistent improvements. We hope QuantEval will facilitate research on LLMs' quantitative finance capabilities and accelerate their practical adoption in real-world trading workflows. We additionally release the full deterministic backtesting configuration (asset universe, cost model, and metric definitions) to ensure strict reproducibility.
中文摘要 大型语言模型（LLMs）在多个领域展现出强大的能力，但它们在金融定量任务中的评估仍然零散，主要局限于以知识为中心的问题解答。我们介绍了QuantEval，这是一个基准工具，评估LLMs在定量金融的三个关键维度：基于知识的质量保证、定量数学推理和量化策略编码。与以往的金融基准不同，QuantEval集成了类似CTA的回测框架，执行模型生成的策略并利用财务绩效指标进行评估，使量化编码能力能够更真实地评估。我们评估了一些最先进的开源和专有大型语言模型，并发现人类专家在推理和策略编码方面存在显著差距。最后，我们对域对齐数据进行了大规模监督微调和强化学习实验，展示了持续的改进。我们希望QuantEval能够促进对LLMs定量金融能力的研究，并加速其在现实交易工作流程中的实际应用。我们还发布了完整的确定性回测配置（资产整体、成本模型和指标定义），以确保严格的可重复性。

Model-Agnostic Solutions for Deep Reinforcement Learning in Non-Ergodic Contexts

非遍历语境下深度强化学习的模型无关解决方案

Authors: Bert Verbruggen, Arne Vanhoyweghen, Vincent Ginis
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.08726
Pdf link: https://arxiv.org/pdf/2601.08726
Abstract Reinforcement Learning (RL) remains a central optimisation framework in machine learning. Although RL agents can converge to optimal solutions, the definition of ``optimality'' depends on the environment's statistical properties. The Bellman equation, central to most RL algorithms, is formulated in terms of expected values of future rewards. However, when ergodicity is broken, long-term outcomes depend on the specific trajectory rather than on the ensemble average. In such settings, the ensemble average diverges from the time-average growth experienced by individual agents, with expected-value formulations yielding systematically suboptimal policies. Prior studies demonstrated that traditional RL architectures fail to recover the true optimum in non-ergodic environments. We extend this analysis to deep RL implementations and show that these, too, produce suboptimal policies under non-ergodic dynamics. Introducing explicit time dependence into the learning process can correct this limitation. By allowing the network's function approximation to incorporate temporal information, the agent can estimate value functions consistent with the process's intrinsic growth rate. This improvement does not require altering the environmental feedback, such as reward transformations or modified objective functions, but arises naturally from the agent's exposure to temporal trajectories. Our results contribute to the growing body of research on reinforcement learning methods for non-ergodic systems.
中文摘要 强化学习（RL）仍然是机器学习中核心的优化框架。尽管强化学习代理可以收敛到最优解，但“最优性”的定义取决于环境的统计属性。贝尔曼方程是大多数强化学习算法的核心，其表述基于未来奖励的期望值。然而，当遍历性被破坏时，长期结果取决于具体轨迹，而非集合平均值。在此类环境中，集合平均值与个体代理经历的时间平均增长存在差异，期望值表述往往导致系统性次优政策。先前研究表明，传统强化学习架构在非遍历环境中无法恢复真正的最优。我们将此分析扩展到深度强化学习实现，显示这些实现在非遍历动态下同样会产生次优策略。在学习过程中引入明确的时间依赖性可以纠正这一局限。通过允许网络的功能近似包含时间信息，智能体可以估算与过程内在增长率一致的价值函数。这种提升不需要改变环境反馈，比如奖励变换或修改目标函数，而是自然地源于代理对时间轨迹的暴露。我们的研究成果为非遍历系统强化学习方法的不断增长贡献力量。

TerraFormer: Automated Infrastructure-as-Code with LLMs Fine-Tuned via Policy-Guided Verifier Feedback

TerraFormer：自动化基础设施即代码，LLM通过策略引导的验证反馈进行微调

Authors: Prithwish Jana, Sam Davidson, Bhavana Bhasker, Andrey Kan, Anoop Deoras, Laurent Callot
Subjects: Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.08734
Pdf link: https://arxiv.org/pdf/2601.08734
Abstract Automating Infrastructure-as-Code (IaC) is challenging, and large language models (LLMs) often produce incorrect configurations from natural language (NL). We present TerraFormer, a neuro-symbolic framework for IaC generation and mutation that combines supervised fine-tuning with verifier-guided reinforcement learning, using formal verification tools to provide feedback on syntax, deployability, and policy compliance. We curate two large, high-quality NL-to-IaC datasets, TF-Gen (152k instances) and TF-Mutn (52k instances), via multi-stage verification and iterative LLM self-correction. Evaluations against 17 state-of-the-art LLMs, including ~50x larger models like Sonnet 3.7, DeepSeek-R1, and GPT-4.1, show that TerraFormer improves correctness over its base LLM by 15.94% on IaC-Eval, 11.65% on TF-Gen (Test), and 19.60% on TF-Mutn (Test). It outperforms larger models on both TF-Gen (Test) and TF-Mutn (Test), ranks third on IaC-Eval, and achieves top best-practices and security compliance.
中文摘要 自动化基础设施即代码（IaC）具有挑战性，大型语言模型（LLMs）常常会从自然语言（NL）中生成错误的配置。我们介绍TerraFormer，一种神经符号框架，用于IaC生成和变异，结合了监督微调与验证者引导的强化学习，利用形式验证工具提供语法、部署性和策略合规性的反馈。我们通过多阶段验证和迭代大语言模型自我修正，策划了两个大型、高质量的NL-IaC数据集TF-Gen（152k实例）和TF-Muttn（52k实例）。对17个最先进的大型语言模型进行评估，包括像Sonnet 3.7、DeepSeek-R1和GPT-4.1等50倍大模型，TerraFormer在IaC评估上比基础LLM提升了15.94%，在TF-Gen（测试）上提升了11.65%，在TF-Mutn（测试）上提升了19.60%。它在TF-Gen（测试）和TF-Mutn（测试）上均优于大型模型，IaC评估排名第三，并实现了顶级最佳实践和安全合规。

Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs

奖励稀有：独特性意识强化学习在大型语言模型中创造性解决问题

Authors: Zhiyuan Hu, Yucheng Wang, Yufei He, Jiaying Wu, Yilun Zhao, See-Kiong Ng, Cynthia Breazeal, Anh Tuan Luu, Hae Won Park, Bryan Hooi
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.08763
Pdf link: https://arxiv.org/pdf/2601.08763
Abstract Reinforcement learning (RL) has become a central paradigm for post-training large language models (LLMs), particularly for complex reasoning tasks, yet it often suffers from exploration collapse: policies prematurely concentrate on a small set of dominant reasoning patterns, improving pass@1 while limiting rollout-level diversity and gains in pass@k. We argue that this failure stems from regularizing local token behavior rather than diversity over sets of solutions. To address this, we propose Uniqueness-Aware Reinforcement Learning, a rollout-level objective that explicitly rewards correct solutions that exhibit rare high-level strategies. Our method uses an LLM-based judge to cluster rollouts for the same problem according to their high-level solution strategies, ignoring superficial variations, and reweights policy advantages inversely with cluster size. As a result, correct but novel strategies receive higher rewards than redundant ones. Across mathematics, physics, and medical reasoning benchmarks, our approach consistently improves pass@$k$ across large sampling budgets and increases the area under the pass@$k$ curve (AUC@$K$) without sacrificing pass@1, while sustaining exploration and uncovering more diverse solution strategies at scale.
中文摘要 强化学习（RL）已成为大型语言模型（LLMs）训练后的核心范式，尤其适用于复杂推理任务，但它常常存在探索性崩溃的问题：政策过早集中于少数主导推理模式，提升了pass@1，同时限制了推广层面的多样性和pass@k的收益。我们认为，这种失败源于对局部代币行为的规范化，而非对解集的多样性。为此，我们提出了“独特性感知强化学习”（Uniqueness-Aware Reinforcement Learning），这是一个推广级目标，明确奖励展现罕见高层策略的正确解决方案。我们的方法使用基于LLM的法官，根据同一问题的高级解决方案策略对推广进行聚类，忽略表面差异，并与集群规模成反比重权重政策优势。因此，正确但新颖的策略获得的回报往往高于重复的策略。在数学、物理和医学推理基准中，我们的方法在大采样预算中持续提升pass@$k美元，扩大pass@$k美元曲线下的面积（AUC@$K美元），同时在不牺牲pass@1的前提下持续探索和发现更多样化的解决方案策略。

Multiplex Thinking: Reasoning via Token-wise Branch-and-Merge

多工思维：通过令牌分支与合并进行推理

Authors: Yao Tang, Li Dong, Yaru Hao, Qingxiu Dong, Furu Wei, Jiatao Gu
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.08808
Pdf link: https://arxiv.org/pdf/2601.08808
Abstract Large language models often solve complex reasoning tasks more effectively with Chain-of-Thought (CoT), but at the cost of long, low-bandwidth token sequences. Humans, by contrast, often reason softly by maintaining a distribution over plausible next steps. Motivated by this, we propose Multiplex Thinking, a stochastic soft reasoning mechanism that, at each thinking step, samples K candidate tokens and aggregates their embeddings into a single continuous multiplex token. This preserves the vocabulary embedding prior and the sampling dynamics of standard discrete generation, while inducing a tractable probability distribution over multiplex rollouts. Consequently, multiplex trajectories can be directly optimized with on-policy reinforcement learning (RL). Importantly, Multiplex Thinking is self-adaptive: when the model is confident, the multiplex token is nearly discrete and behaves like standard CoT; when it is uncertain, it compactly represents multiple plausible next steps without increasing sequence length. Across challenging math reasoning benchmarks, Multiplex Thinking consistently outperforms strong discrete CoT and RL baselines from Pass@1 through Pass@1024, while producing shorter sequences. The code and checkpoints are available at this https URL.
中文摘要 大型语言模型通常通过思维链（Chain-of-Thought，CoT）更有效地解决复杂的推理任务，但代价是使用冗长且带宽较低的令牌序列。相比之下，人类通常通过保持对合理下一步步骤的分布来进行温和推理。基于此，我们提出了多工思维，这是一种随机软推理机制，在每个思考步骤中采样K个候选代币并将其嵌入汇总为单一连续多路复用代币。这保留了词汇嵌入的先验和标准离散生成的采样动态，同时在多路复用展开上诱导出易于处理的概率分布。因此，多路复用轨迹可以通过策略强化学习（RL）直接优化。重要的是，多路复用思维具有自我适应性：当模型自信时，多路复用令牌几乎是离散的，表现类似于标准CoT;当它不确定时，它能紧致地表示多个合理的下一步步骤，而无需增加序列长度。在具有挑战性的数学推理基准测试中，多路思维从Pass@1到Pass@1024强力逻辑思维（RC）基线表现稳定，同时产生更短的序列。代码和检查点可在该 https URL 访问。

Keyword: diffusion policy

Reverse Flow Matching: A Unified Framework for Online Reinforcement Learning with Diffusion and Flow Policies

逆流匹配：带有扩散和流策略的在线强化学习统一框架

Authors: Zeyang Li, Sunbochen Tang, Navid Azizan
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.08136
Pdf link: https://arxiv.org/pdf/2601.08136
Abstract Diffusion and flow policies are gaining prominence in online reinforcement learning (RL) due to their expressive power, yet training them efficiently remains a critical challenge. A fundamental difficulty in online RL is the lack of direct samples from the target distribution; instead, the target is an unnormalized Boltzmann distribution defined by the Q-function. To address this, two seemingly distinct families of methods have been proposed for diffusion policies: a noise-expectation family, which utilizes a weighted average of noise as the training target, and a gradient-expectation family, which employs a weighted average of Q-function gradients. Yet, it remains unclear how these objectives relate formally or if they can be synthesized into a more general formulation. In this paper, we propose a unified framework, reverse flow matching (RFM), which rigorously addresses the problem of training diffusion and flow models without direct target samples. By adopting a reverse inferential perspective, we formulate the training target as a posterior mean estimation problem given an intermediate noisy sample. Crucially, we introduce Langevin Stein operators to construct zero-mean control variates, deriving a general class of estimators that effectively reduce importance sampling variance. We show that existing noise-expectation and gradient-expectation methods are two specific instances within this broader class. This unified view yields two key advancements: it extends the capability of targeting Boltzmann distributions from diffusion to flow policies, and enables the principled combination of Q-value and Q-gradient information to derive an optimal, minimum-variance estimator, thereby improving training efficiency and stability. We instantiate RFM to train a flow policy in online RL, and demonstrate improved performance on continuous-control benchmarks compared to diffusion policy baselines.
中文摘要 扩散和流策略因其表达力而在在线强化学习（RL）中日益受到重视，但高效训练仍是关键挑战。在线强化学习的一个根本难题是缺乏来自目标分布的直接样本;相反，目标是由Q函数定义的非正一化玻尔兹曼分布。为此，提出了两类看似不同的扩散策略方法：噪声期望族，利用噪声加权平均作为训练目标;以及梯度期望族，采用Q函数梯度加权平均。然而，这些目标在形式上如何关联，或是否能综合成更通用的表述，仍不清楚。本文提出了一个统一框架——反流匹配（RFM），严格解决了训练扩散和流模型在无直接目标样本的情况下的问题。通过采用反向推断视角，我们将训练目标表述为给定中间噪声样本的后验均值估计问题。关键是，我们引入了朗热文斯坦算子来构造零均值控制变量，推导出一类有效降低重要性抽样方差的一般估计量。我们表明，现有的噪声期望法和梯度期望法是这一更广泛类别中的两个具体实例。这一统一视角带来了两大关键进展：它扩展了从扩散到流动策略的玻尔兹曼分布定位能力，并使Q值和Q梯度信息的原则性组合能够推导出最优的最小方差估计，从而提升训练效率和稳定性。我们实例化RFM以在线强化学习中训练流策略，并展示了连续控制基准测试的性能优于扩散策略基线。