Arxiv Papers of Today

生成时间: 2026-02-12 16:53:04 (UTC+8); Arxiv 发布时间: 2026-02-12 20:00 EST (2026-02-13 09:00 UTC+8)

今天共有 47 篇相关文章

Keyword: reinforcement learning

Multimodal Information Fusion for Chart Understanding: A Survey of MLLMs -- Evolution, Limitations, and Cognitive Enhancement

多模态信息融合用于图表理解：MLLM综述——演变、局限性与认知增强

Authors: Zhihang Yi, Jian Zhao, Jiancheng Lv, Tao Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.10138
Pdf link: https://arxiv.org/pdf/2602.10138
Abstract Chart understanding is a quintessential information fusion task, requiring the seamless integration of graphical and textual data to extract meaning. The advent of Multimodal Large Language Models (MLLMs) has revolutionized this domain, yet the landscape of MLLM-based chart analysis remains fragmented and lacks systematic organization. This survey provides a comprehensive roadmap of this nascent frontier by structuring the domain's core components. We begin by analyzing the fundamental challenges of fusing visual and linguistic information in charts. We then categorize downstream tasks and datasets, introducing a novel taxonomy of canonical and non-canonical benchmarks to highlight the field's expanding scope. Subsequently, we present a comprehensive evolution of methodologies, tracing the progression from classic deep learning techniques to state-of-the-art MLLM paradigms that leverage sophisticated fusion strategies. By critically examining the limitations of current models, particularly their perceptual and reasoning deficits, we identify promising future directions, including advanced alignment techniques and reinforcement learning for cognitive enhancement. This survey aims to equip researchers and practitioners with a structured understanding of how MLLMs are transforming chart information fusion and to catalyze progress toward more robust and reliable systems.
中文摘要 图表理解是一项典型的信息融合任务，需要图形和文本数据无缝融合以提取意义。多模态大型语言模型（MLLM）的出现彻底改变了这一领域，然而基于MLLM的图表分析领域仍然支离破碎，缺乏系统的组织。本调查通过构建该领域的核心组成部分，提供了这一新兴前沿的全面路线图。我们首先分析将视觉信息与语言信息融合在图表中的根本挑战。随后，我们对下游任务和数据集进行了分类，引入了规范与非规范基准的新分类法，以突出该领域的扩展范围。随后，我们全面介绍了方法论的演进，追溯从经典深度学习技术到利用复杂融合策略的先进MLLM范式的发展历程。通过批判性地审视当前模型的局限性，特别是其感知和推理缺陷，我们识别出有前景的未来方向，包括先进的对齐技术和用于认知增强的强化学习。本调查旨在使研究人员和从业者能够有结构地理解MLLM如何改变图表信息融合，并推动系统走向更稳健可靠的发展。

Internalizing Meta-Experience into Memory for Guided Reinforcement Learning in Large Language Models

将元经验内化为记忆，用于大型语言模型中的引导强化学习

Authors: Shiting Huang, Zecheng Li, Yu Zeng, Qingnan Ren, Zhen Fang, Qisheng Su, Kou Shi, Lin Chen, Zehui Chen, Feng Zhao
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.10224
Pdf link: https://arxiv.org/pdf/2602.10224
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach for enhancing the reasoning capabilities of Large Language Models (LLMs). Despite its efficacy, RLVR faces a meta-learning bottleneck: it lacks mechanisms for error attribution and experience internalization intrinsic to the human learning cycle beyond practice and verification, thereby limiting fine-grained credit assignment and reusable knowledge formation. We term such reusable knowledge representations derived from past errors as meta-experience. Based on this insight, we propose Meta-Experience Learning (MEL), a novel framework that incorporates self-distilled meta-experience into the model's parametric memory. Building upon standard RLVR, we introduce an additional design that leverages the LLM's self-verification capability to conduct contrastive analysis on paired correct and incorrect trajectories, identify the precise bifurcation points where reasoning errors arise, and summarize them into generalizable meta-experience. The meta-experience is further internalized into the LLM's parametric memory by minimizing the negative log-likelihood, which induces a language-modeled reward signal that bridges correct and incorrect reasoning trajectories and facilitates effective knowledge reuse. Experimental results demonstrate that MEL achieves consistent improvements on benchmarks, yielding 3.92%--4.73% Pass@1 gains across varying model sizes.
中文摘要 带有可验证奖励的强化学习（RLVR）已成为提升大型语言模型（LLMs）推理能力的有效方法。尽管有效，RLVR仍面临元学习瓶颈：缺乏超越实践和验证的人类学习循环内在的错误归因和经验内化机制，从而限制了细粒度的学分分配和可重复使用的知识形成。我们称这些源自过去错误的可复用知识表示为元经验。基于这一见解，我们提出了元体验学习（MEL）这一新框架，将自我提炼的元体验整合进模型的参数记忆中。基于标准RLVR，我们引入了一种额外设计，利用LLM的自我验证能力，对配对正确与错误轨迹进行对比分析，识别推理错误出现的精确分岔点，并将其总结为可推广的元经验。通过最小化负对数似然，元体验进一步内化到LLM的参数记忆中，从而诱导出语言建模的奖励信号，桥接正确与错误的推理轨迹，促进知识的有效再利用。实验结果表明，MEL在不同模型规模下实现了持续的改进，提升幅度为3.92%——Pass@1提升4.73%。

Learning to Evict from Key-Value Cache

学习从键值缓存中驱逐

Authors: Luca Moschella, Laura Manduchi, Ozan Sener
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.10238
Pdf link: https://arxiv.org/pdf/2602.10238
Abstract The growing size of Large Language Models (LLMs) makes efficient inference challenging, primarily due to the memory demands of the autoregressive Key-Value (KV) cache. Existing eviction or compression methods reduce cost but rely on heuristics, such as recency or past attention scores, which serve only as indirect proxies for a token's future utility and introduce computational overhead. We reframe KV cache eviction as a reinforcement learning (RL) problem: learning to rank tokens by their predicted usefulness for future decoding. To this end, we introduce KV Policy (KVP), a framework of lightweight per-head RL agents trained on pre-computed generation traces using only key and value vectors. Each agent learns a specialized eviction policy guided by future utility, which evaluates the quality of the ranking across all cache budgets, requiring no modifications to the underlying LLM or additional inference. Evaluated across two different model families on the long-context benchmark RULER and the multi-turn dialogue benchmark OASST2-4k, KVP significantly outperforms baselines. Furthermore, zero-shot tests on standard downstream tasks (e.g., LongBench, BOOLQ, ARC) indicate that KVP generalizes well beyond its training distribution and to longer context lengths. These results demonstrate that learning to predict future token utility is a powerful and scalable paradigm for adaptive KV cache management.
中文摘要 大型语言模型（LLM）规模的不断扩大使得高效推理变得具有挑战性，主要原因是自回归键值（KV）缓存对内存的需求。现有的驱逐或压缩方法降低成本，但依赖启发式方法，如近期性或过去注意力评分，这些方法仅作为代币未来效用的间接代理，并增加计算开销。我们将KV缓存驱逐重新框架为强化学习（RL）问题：学习根据代币对未来解码的预测价值进行排序。为此，我们引入了KV策略（KVP），这是一个基于预先计算的生成轨迹仅使用键和值向量训练的轻量级每人强化学习代理框架。每个代理学习一套由未来效用指导的专业驱逐策略，评估所有缓存预算的排名质量，无需修改底层大型语言模型或额外推断。在长上下文基准RULER和多回合对话基准OASST2-4k上，KVP在两个不同模型家族中评估，显著优于基线。此外，对标准下游任务（如LongBench、BOOLQ、ARC）的零样本测试表明，KVP的推广远超其训练分布，且覆盖更长上下文长度。这些结果表明，学习预测未来令牌效用是自适应KV缓存管理的强大且可扩展的范式。

The Role of Learning in Attacking Intrusion Detection Systems

学习在攻击入侵检测系统中的作用

Authors: Kyle Domico, Jean-Charles Noirot Ferrand, Patrick McDaniel
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2602.10299
Pdf link: https://arxiv.org/pdf/2602.10299
Abstract Recent work on network attacks have demonstrated that ML-based network intrusion detection systems (NIDS) can be evaded with adversarial perturbations. However, these attacks rely on complex optimizations that have large computational overheads, making them impractical in many real-world settings. In this paper, we introduce a lightweight adversarial agent that implements strategies (policies) trained via reinforcement learning (RL) that learn to evade ML-based NIDS without requiring online optimization. This attack proceeds by (1) offline training, where the agent learns to evade a surrogate ML model by perturbing malicious flows using network traffic data assumed to be collected via reconnaissance, then (2) deployment, where the trained agent is used in a compromised device controlled by an attacker to evade ML-based NIDS using learned attack strategies. We evaluate our approach across diverse NIDS and several white-, gray-, and black-box threat models. We demonstrate that attacks using these lightweight agents can be highly effective (reaching up to 48.9% attack success rate), extremely fast (requiring as little as 5.72ms to craft an attack), and require negligible resources (e.g., 0.52MB of memory). Through this work, we demonstrate that future botnets driven by lightweight learning-based agents can be highly effective and widely deployable in diverse environments of compromised devices.
中文摘要 近期关于网络攻击的研究表明，基于机器学习的网络入侵检测系统（NIDS）可以通过对抗性扰动来规避。然而，这些攻击依赖于复杂的优化，且计算开销巨大，在许多现实环境中并不实用。本文介绍了一种轻量级对抗代理，它通过强化学习（RL）训练策略（策略），能够在无需在线优化的情况下学会规避基于机器学习的NIDS。该攻击通过（1）离线训练进行，代理通过利用假设通过侦察收集的网络流量数据扰动恶意流来规避代理机器学习模型，然后（2）部署，训练有素的代理被攻击者控制的设备中利用学习的攻击策略规避基于机器学习的NIDS。我们评估了多种NIDs以及多个白箱、灰箱和黑箱威胁模型的方法。我们证明，使用这些轻量级代理的攻击可以非常有效（攻击成功率可达48.9%），速度极快（设计攻击只需5.72毫秒），且资源需求极低（例如0.52MB内存）。通过这项工作，我们证明了未来由轻量级基于学习代理驱动的僵尸网络可以在受攻破设备多样环境中高效且广泛部署。

Confounding Robust Continuous Control via Automatic Reward Shaping

通过自动奖励塑造混淆稳健连续控制

Authors: Mateo Juliani, Mingxuan Li, Elias Bareinboim
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.10305
Pdf link: https://arxiv.org/pdf/2602.10305
Abstract Reward shaping has been applied widely to accelerate Reinforcement Learning (RL) agents' training. However, a principled way of designing effective reward shaping functions, especially for complex continuous control problems, remains largely under-explained. In this work, we propose to automatically learn a reward shaping function for continuous control problems from offline datasets, potentially contaminated by unobserved confounding variables. Specifically, our method builds upon the recently proposed causal Bellman equation to learn a tight upper bound on the optimal state values, which is then used as the potentials in the Potential-Based Reward Shaping (PBRS) framework. Our proposed reward shaping algorithm is tested with Soft-Actor-Critic (SAC) on multiple commonly used continuous control benchmarks and exhibits strong performance guarantees under unobserved confounders. More broadly, our work marks a solid first step towards confounding robust continuous control from a causal perspective. Code for training our reward shaping functions can be found at this https URL.
中文摘要 奖励塑造已被广泛应用于加速强化学习（RL）代理的训练。然而，设计有效奖励塑造函数的原则性方法，尤其是针对复杂的连续控制问题，仍然大多缺乏解释。本研究提出从离线数据集中自动学习一个奖励塑形函数，用于持续控制问题，这些数据集可能被未观察到的混杂变量污染。具体来说，我们的方法基于最近提出的因果贝尔曼方程，学习最优状态值的紧密上界，并将其用作基于势能的奖励塑形（PBRS）框架中的势能。我们提出的奖励整形算法在多个常用连续控制基准测试上采用软行为者-批判者（SAC）测试，在未观察到的混杂因素下表现出强的性能保证。更广泛地说，我们的工作标志着从因果角度混淆强健连续控制迈出了坚实的第一步。用于训练我们的奖励塑造功能的代码可以在这个 https 网址找到。

Are More Tokens Rational? Inference-Time Scaling in Language Models as Adaptive Resource Rationality

更多的代币合理吗？作为自适应资源理性，语言模型中的推理时间尺度

Authors: Zhimin Hu, Riya Roshan, Sashank Varma
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.10329
Pdf link: https://arxiv.org/pdf/2602.10329
Abstract Human reasoning is shaped by resource rationality -- optimizing performance under constraints. Recently, inference-time scaling has emerged as a powerful paradigm to improve the reasoning performance of Large Language Models by expanding test-time computation. Specifically, instruction-tuned (IT) models explicitly generate long reasoning steps during inference, whereas Large Reasoning Models (LRMs) are trained by reinforcement learning to discover reasoning paths that maximize accuracy. However, it remains unclear whether resource-rationality can emerge from such scaling without explicit reward related to computational costs. We introduce a Variable Attribution Task in which models infer which variables determine outcomes given candidate variables, input-output trials, and predefined logical functions. By varying the number of candidate variables and trials, we systematically manipulate task complexity. Both models exhibit a transition from brute-force to analytic strategies as complexity increases. IT models degrade on XOR and XNOR functions, whereas LRMs remain robust. These findings suggest that models can adjust their reasoning behavior in response to task complexity, even without explicit cost-based reward. It provides compelling evidence that resource rationality is an emergent property of inference-time scaling itself.
中文摘要 人类推理受资源理性影响——在约束条件下优化性能。近年来，推理时间尺度已成为一种强大的范式，通过扩展测试时间计算，提升大型语言模型的推理性能。具体来说，指令调优（IT）模型在推理过程中显式生成较长的推理步骤，而大型推理模型（LRM）则通过强化学习训练，以发现最大化准确率的推理路径。然而，目前尚不清楚资源理性是否能在没有计算成本相关明确奖励的情况下从这种扩展中产生。我们引入变量归因任务，模型推断哪些变量在候选变量、输入输出试验和预定义逻辑函数下决定结果。通过改变候选变量和试验的数量，我们系统地控任务复杂度。随着复杂度的增加，这两个模型都表现出从暴力破解向分析策略的转变。IT模型在异或和XNOR函数上会退化，而长程模型则保持稳健。这些发现表明，即使没有明确的成本奖励，模型也能根据任务复杂性调整推理行为。它提供了有力证据，表明资源合理性本身是推理时间尺度的一种涌现属性。

Efficient Policy Adaptation for Voltage Control Under Unknown Topology Changes

在未知拓扑变化下，电压控制的高效策略适配

Authors: Jie Feng, Yuanyuan Shi, Deepjyoti Deka
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2602.10355
Pdf link: https://arxiv.org/pdf/2602.10355
Abstract Reinforcement learning (RL) has shown great potential for designing voltage control policies, but their performance often degrades under changing system conditions such as topology reconfigurations and load variations. We introduce a topology-aware online policy optimization framework that leverages data-driven estimation of voltage-reactive power sensitivities to achieve efficient policy adaptation. Exploiting the sparsity of topology-switching events, where only a few lines change at a time, our method efficiently detects topology changes and identifies the affected lines and parameters, enabling fast and accurate sensitivity updates without recomputing the full sensitivity matrix. The estimated sensitivity is subsequently used for online policy optimization of a pre-trained neural-network-based RL controller. Simulations on both the IEEE 13-bus and SCE 56-bus systems demonstrate over 90 percent line identification accuracy, using only 15 data points. The proposed method also significantly improves voltage regulation performance compared with non-adaptive policies and adaptive policies that rely on regression-based online optimization methods for sensitivity estimation.
中文摘要 增强学习（RL）在设计电压控制策略方面展现出巨大潜力，但其性能常在系统条件变化（如拓扑重构和负载变化）下下降。我们引入了一个拓扑感知的在线策略优化框架，利用基于数据的电压无功功率敏感性估计，实现高效的策略适配。利用拓扑切换事件的稀疏性，即每次只有少数线发生变化，我们的方法高效检测拓扑变化并识别受影响的线和参数，实现快速且准确的灵敏度更新，而无需重新计算完整的灵敏度矩阵。估计的灵敏度随后被用于预训练的基于神经网络的强化学习控制器的在线策略优化。在IEEE 13-总线和SCE 56-总线系统上的仿真显示出超过90%的线路识别准确率，仅使用15个数据点。该方法还显著提升了电压调节性能，相较于依赖回归的在线优化方法的自适应策略。

Autonomous Continual Learning of Computer-Use Agents for Environment Adaptation

环境适应中计算机使用代理的自主持续学习

Authors: Tianci Xue, Zeyi Liao, Tianneng Shi, Zilu Wang, Kai Zhang, Dawn Song, Yu Su, Huan Sun
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.10356
Pdf link: https://arxiv.org/pdf/2602.10356
Abstract Real-world digital environments are highly diverse and dynamic. These characteristics cause agents to frequently encounter unseen scenarios and distribution shifts, making continual learning in specific environments essential for computer-use agents (CUAs). However, a key challenge lies in obtaining high-quality and environment-grounded agent data without relying on costly human annotation. In this work, we introduce ACuRL, an Autonomous Curriculum Reinforcement Learning framework that continually adapts agents to specific environments with zero human data. The agent first explores target environments to acquire initial experiences. During subsequent iterative training, a curriculum task generator leverages these experiences together with feedback from the previous iteration to synthesize new tasks tailored for the agent's current capabilities. To provide reliable reward signals, we introduce CUAJudge, a robust automatic evaluator for CUAs that achieves 93% agreement with human judgments. Empirically, our method effectively enables both intra-environment and cross-environment continual learning, yielding 4-22% performance gains without catastrophic forgetting on existing environments. Further analyses show highly sparse updates (e.g., 20% parameters), which helps explain the effective and robust adaptation. Our data and code are available at this https URL.
中文摘要 现实世界的数字环境极为多样且动态。这些特性导致代理经常遇到未见的场景和分布转移，因此在特定环境中持续学习对计算机使用代理（CUA）至关重要。然而，一个关键挑战在于如何在不依赖昂贵的人工注释的情况下，获得高质量且贴近环境的代理数据。在本研究中，我们介绍了ACuRL，一种自主课程强化学习框架，能够在零人类数据下不断调整代理到特定环境。智能体首先探索目标环境以获得初始体验。在后续迭代培训中，课程任务生成器会结合这些经验和前一次迭代的反馈，综合出针对代理当前能力量身定制的新任务。为了提供可靠的奖励信号，我们引入了CUAJudge，一款稳健的CUA自动评估器，与人类判断的一致性达到93%。从实证角度看，我们的方法有效实现了环境内和跨环境的持续学习，性能提升4%至22%，且不会在现有环境中出现灾难性遗忘。进一步分析显示，更新极为稀疏（例如参数仅20%），这有助于解释其有效且稳健的适应。我们的数据和代码可在该 https URL 访问。

Breaking the Curse of Repulsion: Optimistic Distributionally Robust Policy Optimization for Off-Policy Generative Recommendation

打破排斥的诅咒：乐观分布稳健的策略优化，用于非策略生成推荐

Authors: Jie Jiang, Yusen Huo, Xiangxin Zhan, Changping Wang, Jun Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.10430
Pdf link: https://arxiv.org/pdf/2602.10430
Abstract Policy-based Reinforcement Learning (RL) has established itself as the dominant paradigm in generative recommendation for optimizing sequential user interactions. However, when applied to offline historical logs, these methods suffer a critical failure: the dominance of low-quality data induces severe model collapse. We first establish the Divergence Theory of Repulsive Optimization, revealing that negative gradient updates inherently trigger exponential intensity explosion during off-policy training. This theory elucidates the inherent dilemma of existing methods, exposing their inability to reconcile variance reduction and noise imitation. To break this curse, we argue that the solution lies in rigorously identifying the latent high-quality distribution entangled within the noisy behavior policy. Accordingly, we reformulate the objective as an Optimistic Distributionally Robust Optimization (DRO) problem. Guided by this formulation, we propose Distributionally Robust Policy Optimization (DRPO). We prove that hard filtering is the exact solution to this DRO objective, enabling DRPO to optimally recover high-quality behaviors while strictly discarding divergence-inducing noise. Extensive experiments demonstrate that DRPO achieves state-of-the-art performance on mixed-quality recommendation benchmarks.
中文摘要 基于策略的强化学习（RL）已成为生成式推荐中优化顺序用户交互的主导范式。然而，当应用于离线历史日志时，这些方法出现了致命缺陷：低质量数据的主导导致了严重的模型崩溃。我们首先建立了排斥优化的发散理论，揭示了负梯度更新本质上会在非策略训练中触发指数强度爆炸。该理论阐明了现有方法固有的困境，揭示了它们无法调和方差减少和噪声模仿。为了打破这一困扰，我们认为解决方案在于严格识别纠缠在噪声行为策略中的潜在高质量分布。因此，我们将目标重新表述为乐观分布鲁棒优化（DRO）问题。基于这一表述，我们提出了分布稳健策略优化（DRPO）。我们证明硬滤波正是实现该 DRO 目标的正解，使 DRPO 能够在严格剔除发散噪声的同时，最优地恢复高质量行为。大量实验表明，DRPO在混合质量推荐基准测试中达到了最先进的性能。

Control Reinforcement Learning: Token-Level Mechanistic Analysis via Learned SAE Feature Steering

控制强化学习：通过学习SAE特征引导进行令牌级机制分析

Authors: Seonglae Cho, Zekun Wu, Adriano Koshiyama
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.10437
Pdf link: https://arxiv.org/pdf/2602.10437
Abstract Sparse autoencoders (SAEs) decompose language model activations into interpretable features, but existing methods reveal only which features activate, not which change model outputs when amplified. We introduce Control Reinforcement Learning (CRL), which trains a policy to select SAE features for steering at each token, producing interpretable intervention logs: the learned policy identifies features that change model outputs when amplified. Adaptive Feature Masking encourages diverse feature discovery while preserving singlefeature interpretability. The framework yields new analysis capabilities: branch point tracking locates tokens where feature choice determines output correctness; critic trajectory analysis separates policy limitations from value estimation errors; layer-wise comparison reveals syntactic features in early layers and semantic features in later layers. On Gemma-2 2B across MMLU, BBQ, GSM8K, HarmBench, and XSTest, CRL achieves improvements while providing per-token intervention logs. These results establish learned feature steering as a mechanistic interpretability tool that complements static feature analysis with dynamic intervention probes
中文摘要 稀疏自编码器（SAE）将语言模型激活分解为可解释的特征，但现有方法只揭示哪些特征被激活，而不会显示哪些特征在放大后改变模型输出。我们引入了控制强化学习（CRL），它训练策略选择SAE特征以引导每个符号，生成可解释的干预日志：学习到的策略识别放大后改变模型输出的特征。自适应特征掩盖鼓励多样化特征的发现，同时保持单一特征的可解释性。该框架带来了新的分析能力：分支点追踪定位特征选择决定输出正确性的标记;批评者轨迹分析将政策局限性与价值估计错误区分开;逐层比较揭示了早期层的句法特征和后期层的语义特征。在Gemma-2 2B平台上，跨越MMLU、BBQ、GSM8K、HarmBench和XSTest，CRL在提供每个代币干预日志的同时实现了改进。这些结果确立了习得特征引导作为一种机制性可解释工具，能够用动态干预探针补充静态特征分析

AudioRouter: Data Efficient Audio Understanding via RL based Dual Reasoning

AudioRouter：通过基于强化学习的双重推理实现数据高效音频理解

Authors: Liyang Chen, Hongkai Chen, Yujun Cai, Sifan Li, Qingwen Ye, Yiwei Wang
Subjects: Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2602.10439
Pdf link: https://arxiv.org/pdf/2602.10439
Abstract Large Audio Language Models (LALMs) have demonstrated strong capabilities in audio understanding and reasoning. However, their performance on fine grained auditory perception remains unreliable, and existing approaches largely rely on data intensive training to internalize perceptual abilities. We propose AudioRouter, a reinforcement learning framework that enables LALMs to improve audio understanding by learning when and how to use external audio tools. Rather than tightly coupling tool usage with audio reasoning, AudioRouter formulates tool use as an explicit decision making problem and optimizes a lightweight routing policy while keeping the underlying reasoning model frozen. Experimental results show that AudioRouter achieves substantial improvements on standard audio understanding benchmarks while requiring up to 600x less training data to learn tool usage compared with conventional training paradigms. These findings suggest that learning effective tool usage offers a data efficient and scalable alternative to internalizing perceptual abilities in LALMs.
中文摘要 大型音频语言模型（LALMs）在音频理解和推理方面展现出强大的能力。然而，它们在细粒度听觉感知上的表现仍然不可靠，现有方法主要依赖数据密集型训练来内化感知能力。我们提出了AudioRouter，一个强化学习框架，使LALM通过学习何时以及如何使用外部音频工具来提升音频理解。AudioRouter 不将工具使用与音频推理紧密耦合，而是将工具使用定义为显式的决策问题，优化轻量级路由策略，同时保持底层推理模型冻结。实验结果显示，AudioRouter 在标准音频理解基准测试基础上实现了显著提升，同时学习工具使用所需的训练数据比传统训练范式少了多达 600 倍。这些发现表明，学习有效的工具使用为LALM中内化感知能力提供了一种数据高效且可扩展的替代方案。

Found-RL: foundation model-enhanced reinforcement learning for autonomous driving

Found-RL：基于基础模型增强的自动驾驶强化学习

Authors: Yansong Qu, Zihao Sheng, Zilin Huang, Jiancong Chen, Yuhao Luo, Tianyi Wang, Yiheng Feng, Samuel Labi, Sikai Chen
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.10458
Pdf link: https://arxiv.org/pdf/2602.10458
Abstract Reinforcement Learning (RL) has emerged as a dominant paradigm for end-to-end autonomous driving (AD). However, RL suffers from sample inefficiency and a lack of semantic interpretability in complex scenarios. Foundation Models, particularly Vision-Language Models (VLMs), can mitigate this by offering rich, context-aware knowledge, yet their high inference latency hinders deployment in high-frequency RL training loops. To bridge this gap, we present Found-RL, a platform tailored to efficiently enhance RL for AD using foundation models. A core innovation is the asynchronous batch inference framework, which decouples heavy VLM reasoning from the simulation loop, effectively resolving latency bottlenecks to support real-time learning. We introduce diverse supervision mechanisms: Value-Margin Regularization (VMR) and Advantage-Weighted Action Guidance (AWAG) to effectively distill expert-like VLM action suggestions into the RL policy. Additionally, we adopt high-throughput CLIP for dense reward shaping. We address CLIP's dynamic blindness via Conditional Contrastive Action Alignment, which conditions prompts on discretized speed/command and yields a normalized, margin-based bonus from context-specific action-anchor scoring. Found-RL provides an end-to-end pipeline for fine-tuned VLM integration and shows that a lightweight RL model can achieve near-VLM performance compared with billion-parameter VLMs while sustaining real-time inference (approx. 500 FPS). Code, data, and models will be publicly available at this https URL.
中文摘要 强化学习（RL）已成为端到端自动驾驶（AD）的主导范式。然而，强化学习在复杂场景中存在样本效率低和语义解释性不足的问题。基础模型，尤其是视觉语言模型（VLM），可以通过提供丰富的上下文感知知识来缓解这一问题，但其高推理延迟阻碍了在高频强化学习训练循环中的应用。为弥合这一差距，我们推出了Found-RL，一个基于基础模型高效增强面向AD强化学习的平台。一项核心创新是异步批处理推理框架，它将繁重的VLM推理与仿真循环解耦，有效解决延迟瓶颈，支持实时学习。我们引入了多种监督机制：价值边际正则化（VMR）和优势加权行动指导（AWAG），有效将专家级VLM行动建议提炼进强化学习策略。此外，我们还采用高通量CLIP来实现高密度奖励塑造。我们通过条件对比动作对齐解决CLIP的动态盲点问题，该方法对提示进行离散化的速度/指令，并通过上下文特定动作锚点评分获得归一化、基于边距的加值。Found-RL 提供了端到端的细致 VLM 集成流水线，展示了轻量化 RL 模型相较于十亿参数 VLM 能够实现近乎 VLM 的性能，同时保持实时推理（约 500 FPS）。代码、数据和模型将在此 https URL 公开。

Towards Long-Lived Robots: Continual Learning VLA Models via Reinforcement Fine-Tuning

迈向长寿命机器人：通过强化微调持续学习VLA模型

Authors: Yuan Liu, Haoran Li, Shuai Tian, Yuxing Qin, Yuhui Chen, Yupeng Zheng, Yongzhen Huang, Dongbin Zhao
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.10503
Pdf link: https://arxiv.org/pdf/2602.10503
Abstract Pretrained on large-scale and diverse datasets, VLA models demonstrate strong generalization and adaptability as general-purpose robotic policies. However, Supervised Fine-Tuning (SFT), which serves as the primary mechanism for adapting VLAs to downstream domains, requires substantial amounts of task-specific data and is prone to catastrophic forgetting. To address these limitations, we propose LifeLong-RFT, a simple yet effective Reinforcement Fine-Tuning (RFT) strategy for VLA models independent of online environmental feedback and pre-trained reward models. By integrating chunking-level on-policy reinforcement learning with the proposed Multi-Dimensional Process Reward (MDPR) mechanism, LifeLong-RFT quantifies the heterogeneous contributions of intermediate action chunks across three dimensions to facilitate policy optimization. Specifically, (1) the Quantized Action Consistency Reward (QACR) ensures accurate action prediction within the discrete action space; (2) the Continuous Trajectory Alignment Reward (CTAR) aligns decoded continuous action chunks with reference trajectories to ensure precise control; (3) the Format Compliance Reward (FCR) guarantees the structural validity of outputs. Comprehensive experiments across SimplerEnv, LIBERO, and real-world tasks demonstrate that LifeLong-RFT exhibits strong performance in multi-task learning. Furthermore, for continual learning on the LIBERO benchmark, our method achieves a 22% gain in average success rate over SFT, while effectively adapting to new tasks using only 20% of the training data. Overall, our method provides a promising post-training paradigm for VLAs.
中文摘要 VLA模型在大规模且多样化的数据集上预训练，展现出作为通用机器人策略的强强泛化性和适应性。然而，监督式微调（SFT）作为将VLA适应到下游域的主要机制，需要大量任务特定的数据，且容易发生灾难性遗忘。为解决这些局限性，我们提出了LifeLong-RFT，这是一种简单但有效的强化微调（RFT）策略，适用于VLA模型，独立于在线环境反馈和预训练奖励模型。通过将分块级策略强化学习与提出的多维过程奖励（MDPR）机制整合，LifeLong-RFT 量化了中间动作块在三维空间的异构贡献，以促进策略优化。具体来说，（1）量化动作一致性奖励（QACR）确保在离散动作空间内的动作准确预测;（2）连续轨迹对齐奖励（CTAR）将解码后的连续动作块与参考轨迹对齐，以确保精确控制;（3）格式合规奖励（FCR）保证输出的结构有效性。在SimplerEnv、LIBERO及实际任务中的综合实验表明，LifeLong-RFT在多任务学习中表现出优异的性能。此外，在LIBERO基准的持续学习中，我们的方法平均成功率比SFT提升22%，同时仅用20%的训练数据有效适应新任务。总体而言，我们的方法为VLA提供了有前景的训练后范式。

Prioritize the Process, Not Just the Outcome: Rewarding Latent Thought Trajectories Improves Reasoning in Looped Language Models

优先考虑过程，而不仅仅是结果：奖励潜在的思维轨迹能提升循环语言模型中的推理能力

Authors: Williams Jonathan, Tureci Esin
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.10520
Pdf link: https://arxiv.org/pdf/2602.10520
Abstract Looped Language Models (LoopLMs) perform multi-step latent reasoning prior to token generation and outperform conventional LLMs on reasoning benchmarks at smaller parameter budgets. However, attempts to further improve LoopLM reasoning with reinforcement learning have failed - standard objectives such as Group Relative Policy Optimization (GRPO) only assign credit to the final latent state, creating a fundamental mismatch with the model's internal computation. To resolve this, we introduce RLTT (Reward Latent Thought Trajectories), a reinforcement learning framework which distributes reward across the full latent reasoning trajectory. RLTT provides dense, trajectory-level credit assignment without relying on external verifiers and can directly replace GRPO with negligible overhead. Across extensive experiments with Ouro-2.6B-Thinking under identical training and inference conditions, RLTT yields substantial improvements over GRPO on challenging mathematical reasoning benchmarks, improving accuracy by +14.4% on MATH-500, +16.6% on AIME24, and +10.0% on BeyondAIME. Despite being trained exclusively on mathematics, RLTT also transfers effectively to non-mathematical reasoning benchmarks, demonstrating the effectiveness of trajectory-level credit assignment for reinforcement learning in LoopLMs.
中文摘要 循环语言模型（LoopLMs）在代币生成前进行多步潜在推理，并在更小参数预算下优于传统大型语言模型的推理基准测试。然而，通过强化学习进一步改进LoopLM推理的尝试失败了——标准目标如群相对策略优化（GRPO）仅将功劳归功于最终潜态，导致与模型内部计算存在根本性不匹配。为解决这个问题，我们引入了RLTT（潜在思维奖励轨迹），这是一种强化学习框架，将奖励分布在整个潜在推理轨迹中。RLTT提供密集的轨迹级信用分配，无需依赖外部验证者，且可直接以极低的开销替代GRPO。在相同训练和推理条件下，RLTT在严格的Ouro-2.6B思维实验中，在具有挑战性的数学推理基准测试上相比GRPO取得了显著提升，MATH-500的准确率提升了+14.4%，AIME24的准确率提升了+16.6%，BeyondAIME的准确率提升了+10.0%。尽管RLTT仅专注于数学训练，但它也能有效地转移至非数学推理基准，展示了轨迹级学分分配在LoopLM强化学习中的有效性。

What Makes Value Learning Efficient in Residual Reinforcement Learning?

是什么让价值学习在残余强化学习中高效？

Authors: Guozheng Ma, Lu Li, Haoyu Wang, Zixuan Liu, Pierre-Luc Bacon, Dacheng Tao
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.10539
Pdf link: https://arxiv.org/pdf/2602.10539
Abstract Residual reinforcement learning (RL) enables stable online refinement of expressive pretrained policies by freezing the base and learning only bounded corrections. However, value learning in residual RL poses unique challenges that remain poorly understood. In this work, we identify two key bottlenecks: cold start pathology, where the critic lacks knowledge of the value landscape around the base policy, and structural scale mismatch, where the residual contribution is dwarfed by the base action. Through systematic investigation, we uncover the mechanisms underlying these bottlenecks, revealing that simple yet principled solutions suffice: base-policy transitions serve as an essential value anchor for implicit warmup, and critic normalization effectively restores representation sensitivity for discerning value differences. Based on these insights, we propose DAWN (Data-Anchored Warmup and Normalization), a minimal approach targeting efficient value learning in residual RL. By addressing these bottlenecks, DAWN demonstrates substantial efficiency gains across diverse benchmarks, policy architectures, and observation modalities.
中文摘要 残差强化学习（RL）通过冻结基底并仅学习有界修正，实现表达式预训练策略的稳定在线细化。然而，残余强化学习中的价值学习带来了独特的挑战，且理解不足。本研究指出两个关键瓶颈：冷启动病理，即批评者缺乏对基础策略价值景观的了解;以及结构性规模错配，剩余贡献被基础行动所掩盖。通过系统研究，我们揭示了这些瓶颈背后的机制，揭示简单但有原则的解决方案就足够：基础政策转变是隐性预热的重要价值锚点，批评规范化有效恢复了辨别价值差异的代表敏感性。基于这些见解，我们提出了DAWN（数据锚定热身与归一化），这是一种针对残余强化学习高效价值学习的极简方法。通过解决这些瓶颈，DAWN在多种基准、政策架构和观察模式中展示了显著的效率提升。

ReSPEC: A Framework for Online Multispectral Sensor Reconfiguration in Dynamic Environments

ReSPEC：动态环境中在线多光谱传感器重配置的框架

Authors: Yanchen Liu, Yuang Fan, Minghui Zhao, Xiaofan Jiang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.10547
Pdf link: https://arxiv.org/pdf/2602.10547
Abstract Multi-sensor fusion is central to robust robotic perception, yet most existing systems operate under static sensor configurations, collecting all modalities at fixed rates and fidelity regardless of their situational utility. This rigidity wastes bandwidth, computation, and energy, and prevents systems from prioritizing sensors under challenging conditions such as poor lighting or occlusion. Recent advances in reinforcement learning (RL) and modality-aware fusion suggest the potential for adaptive perception, but prior efforts have largely focused on re-weighting features at inference time, ignoring the physical cost of sensor data collection. We introduce a framework that unifies sensing, learning, and actuation into a closed reconfiguration loop. A task-specific detection backbone extracts multispectral features (e.g. RGB, IR, mmWave, depth) and produces quantitative contribution scores for each modality. These scores are passed to an RL agent, which dynamically adjusts sensor configurations, including sampling frequency, resolution, sensing range, and etc., in real time. Less informative sensors are down-sampled or deactivated, while critical sensors are sampled at higher fidelity as environmental conditions evolve. We implement and evaluate this framework on a mobile rover, showing that adaptive control reduces GPU load by 29.3\% with only a 5.3\% accuracy drop compared to a heuristic baseline. These results highlight the potential of resource-aware adaptive sensing for embedded robotic platforms.
中文摘要 多传感器融合是机器人稳健感知的核心，但大多数现有系统仍以静态传感器配置运行，以固定速率和保真度收集所有模态，无论其情境效用如何。这种刚性浪费了带宽、计算和能量，并阻止系统在光照不足或遮挡等挑战条件下优先使用传感器。强化学习（RL）和模态感知融合技术的最新进展表明了适应感知的潜力，但此前的努力主要集中在推理时对特征的重新加权，忽视了传感器数据收集的物理成本。我们引入了一个将感测、学习和执行统一为封闭重构环的框架。任务特定的检测骨干提取多光谱特征（如RGB、红外、毫米波、深度），并为每种模态生成定量贡献分数。这些分数会传递给强化学习代理，实时动态调整传感器配置，包括采样频率、分辨率、传感距离等。信息量较少的传感器会被降采样或停用，而关键传感器则根据环境条件变化以更高保真度采样。我们在移动探测车上实现并评估了该框架，显示自适应控制相比启发式基线仅降低5.3%的GPU负载，且准确率仅下降29.3%。这些结果凸显了资源感知自适应感测在嵌入式机器人平台上的潜力。

SplitCom: Communication-efficient Split Federated Fine-tuning of LLMs via Temporal Compression

SplitCom：通过时间压缩实现通信高效的分流联合微调大型语言模型

Authors: Tao Li, Yulin Tang, Yiyang Song, Cong Wu, Xihui Liu, Pan Li, Xianhao Chen
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2602.10564
Pdf link: https://arxiv.org/pdf/2602.10564
Abstract Federated fine-tuning of on-device large language models (LLMs) mitigates privacy concerns by preventing raw data sharing. However, the intensive computational and memory demands pose significant challenges for resource-constrained edge devices. To overcome these limitations, split federated learning (SFL) emerges as a promising solution that partitions the model into lightweight client-side and compute-intensive server-side sub-models, thus offloading the primary training workload to a powerful server. Nevertheless, high-dimensional activation exchanges in SFL lead to excessive communication overhead. To overcome this, we propose SplitCom, a communication-efficient SFL framework for LLMs that exploits temporal redundancy in activations across consecutive training epochs. Inspired by video compression, the core innovation of our framework lies in selective activation uploading only when a noticeable deviation from previous epochs occurs. To balance communication efficiency and learning performance, we introduce two adaptive threshold control schemes based on 1) bang-bang control or 2) deep deterministic policy gradient (DDPG)-based reinforcement learning. Moreover, we implement dimensionality reduction techniques to alleviate client-side memory requirements. Furthermore, we extend SplitCom to the U-shape architecture, ensuring the server never accesses clients' labels. Extensive simulations and laboratory experiments demonstrate that SplitCom reduces uplink communication costs by up to 98.6\,\% in its standard configuration and total communication costs by up to 95.8\,\% in its U-shape variant without noticeably compromising model performance.
中文摘要 对设备上大型语言模型（LLM）的联合微调通过防止原始数据共享，缓解了隐私问题。然而，高强度的计算和内存需求对资源受限的边缘设备构成了重大挑战。为克服这些限制，分体联合学习（SFL）成为一种有前景的解决方案，它将模型划分为轻量级客户端子模型和计算密集型服务器端子模型，从而将主要训练工作负载转嫁给强大的服务器。然而，SFL中的高维激活交换导致了过高的通信开销。为克服这一问题，我们提出了SplitCom，一种高效的通信型SFL框架，利用连续训练时代激活时的时间冗余。受视频压缩启发，我们框架的核心创新在于仅在与前一时代有明显偏差时进行选择性激活上传。为了平衡通信效率和学习表现，我们引入了两种自适应阈值控制方案，分别基于1）bang-bang控制或2）基于深度确定性策略梯度（DDPG）的强化学习。此外，我们还实施了降维技术，以减轻客户端内存需求。此外，我们将SplitCom扩展到U形架构，确保服务器永远不会访问客户端标签。大量模拟和实验室实验表明，SplitCom在标准配置下可降低多达98.6%，在U形变体中降低总通信成本高达95.8%，且在不明显影响模型性能的情况下。

MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning

MetaphorStar：图像隐喻理解与推理，结合端到端视觉强化学习

Authors: Chenhao Zhang, Yazhe Niu, Hongsheng Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2602.10575
Pdf link: https://arxiv.org/pdf/2602.10575
Abstract Metaphorical comprehension in images remains a critical challenge for Nowadays AI systems. While Multimodal Large Language Models (MLLMs) excel at basic Visual Question Answering (VQA), they consistently struggle to grasp the nuanced cultural, emotional, and contextual implications embedded in visual content. This difficulty stems from the task's demand for sophisticated multi-hop reasoning, cultural context, and Theory of Mind (ToM) capabilities, which current models lack. To fill this gap, we propose MetaphorStar, the first end-to-end visual reinforcement learning (RL) framework for image implication tasks. Our framework includes three core components: the fine-grained dataset TFQ-Data, the visual RL method TFQ-GRPO, and the well-structured benchmark TFQ-Bench. Our fully open-source MetaphorStar family, trained using TFQ-GRPO on TFQ-Data, significantly improves performance by an average of 82.6% on the image implication benchmarks. Compared with 20+ mainstream MLLMs, MetaphorStar-32B achieves state-of-the-art (SOTA) on Multiple-Choice Question and Open-Style Question, significantly outperforms the top closed-source model Gemini-3.0-pro on True-False Question. Crucially, our experiments reveal that learning image implication tasks improves the general understanding ability, especially the complex visual reasoning ability. We further provide a systematic analysis of model parameter scaling, training data scaling, and the impact of different model architectures and training strategies, demonstrating the broad applicability of our method. We open-sourced all model weights, datasets, and method code at this https URL.
中文摘要 图像中的隐喻理解仍然是当今人工智能系统面临的关键挑战。虽然多模态大型语言模型（MLLMs）在基础的视觉问答（VQA）方面表现出色，但它们始终难以理解视觉内容中蕴含的细微文化、情感和语境影响。这一困难源于任务对复杂的多跳推理、文化背景和心智理论（ToM）能力的需求，而现有模型缺乏这些能力。为填补这一空白，我们提出了MetaphorStar，这是首个面向端到端的视觉强化学习（RL）图像暗示任务框架。我们的框架包含三个核心组成部分：细粒度数据集TFQ-Data、可视化强化学习方法TFQ-GRPO，以及结构良好的基准测试TFQ-Bench。我们完全开源的MetaphorStar系列，基于TFQ-Data训练，在图像影响基准测试上平均提升了82.6%的性能。与20+个主流MLLM相比，MetaphorStar-32B在选择题和开放式题目上达到了最先进的（SOTA），在真假题方面显著优于顶级闭源型号Gemini-3.0-pro。关键是，我们的实验表明，学习图像蕴含任务能提升整体理解能力，尤其是复杂的视觉推理能力。我们还系统分析了模型参数缩放、训练数据缩放以及不同模型架构和训练策略的影响，展示了我们方法的广泛适用性。我们将所有模型权重、数据集和方法代码开源到这个 https URL。

LLM-Based Scientific Equation Discovery via Physics-Informed Token-Regularized Policy Optimization

基于LLM的科学方程发现，通过物理知情的令牌正则化策略优化

Authors: Boxiao Wang, Kai Li, Tianyi Liu, Chen Li, Junzhe Wang, Yifan Zhang, Jian Cheng
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.10576
Pdf link: https://arxiv.org/pdf/2602.10576
Abstract Symbolic regression aims to distill mathematical equations from observational data. Recent approaches have successfully leveraged Large Language Models (LLMs) to generate equation hypotheses, capitalizing on their vast pre-trained scientific priors. However, existing frameworks predominantly treat the LLM as a static generator, relying on prompt-level guidance to steer exploration. This paradigm fails to update the model's internal representations based on search feedback, often yielding physically inconsistent or mathematically redundant expressions. In this work, we propose PiT-PO (Physics-informed Token-regularized Policy Optimization), a unified framework that evolves the LLM into an adaptive generator via reinforcement learning. Central to PiT-PO is a dual-constraint mechanism that rigorously enforces hierarchical physical validity while simultaneously applying fine-grained, token-level penalties to suppress redundant structures. Consequently, PiT-PO aligns LLM to produce equations that are both scientifically consistent and structurally parsimonious. Empirically, PiT-PO achieves state-of-the-art performance on standard benchmarks and successfully discovers novel turbulence models for challenging fluid dynamics problems. We also demonstrate that PiT-PO empowers small-scale models to outperform closed-source giants, democratizing access to high-performance scientific discovery.
中文摘要 符号回归旨在从观测数据中提炼出数学方程。近年来，许多方法成功地利用大型语言模型（LLMs）生成方程假设，充分利用其庞大的预训练科学先验。然而，现有框架主要将LLM视为静态生成器，依赖提示级指导来引导探索。该范式未能基于搜索反馈更新模型内部表示，常导致物理上不一致或数学上冗余的表达式。在本研究中，我们提出了PiT-PO（物理知情的令牌正则化策略优化），这是一个统一框架，通过强化学习将LLM演化为自适应生成器。PiT-PO的核心是一种双约束机制，严格执行层级物理有效性，同时施加细粒度的令牌级惩罚以抑制冗余结构。因此，PiT-PO使LLM对齐，生成既科学一致又结构简洁的方程。从经验角度看，PiT-PO在标准基准测试上实现了最先进的性能，并成功发现了解决复杂流体动力学问题的新颖湍流模型。我们还展示了PiT-PO赋能小规模模型超越闭源巨头，普及高效科学发现的获取。

Neuro-symbolic Action Masking for Deep Reinforcement Learning

深度强化学习中的神经符号动作掩蔽

Authors: Shuai Han, Mehdi Dastani, Shihan Wang
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.10598
Pdf link: https://arxiv.org/pdf/2602.10598
Abstract Deep reinforcement learning (DRL) may explore infeasible actions during training and execution. Existing approaches assume a symbol grounding function that maps high-dimensional states to consistent symbolic representations and a manually specified action masking techniques to constrain actions. In this paper, we propose Neuro-symbolic Action Masking (NSAM), a novel framework that automatically learn symbolic models, which are consistent with given domain constraints of high-dimensional states, in a minimally supervised manner during the DRL process. Based on the learned symbolic model of states, NSAM learns action masks that rules out infeasible actions. NSAM enables end-to-end integration of symbolic reasoning and deep policy optimization, where improvements in symbolic grounding and policy learning mutually reinforce each other. We evaluate NSAM on multiple domains with constraints, and experimental results demonstrate that NSAM significantly improves sample efficiency of DRL agent while substantially reducing constraint violations.
中文摘要 深度强化学习（DRL）可能探索训练和执行过程中不可行的行为。现有方法假设一个符号基础函数，将高维状态映射到一致的符号表示，以及手动指定的动作掩蔽技术来约束动作。本文提出了神经符号动作掩蔽（NSAM），这是一种新颖框架，可在DRL过程中以最小监督的方式自动学习符合高维状态特定领域约束的符号模型。基于习得的符号状态模型，NSAM学习动作掩码以排除不可行的动作。NSAM实现了符号推理和深度策略优化的端到端整合，符号基础和策略学习的改进相互强化。我们在多个带约束的域上评估NSAM，实验结果表明NSAM显著提升了DRL代理的样本效率，同时显著减少了约束违规。

Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

步骤3.5 闪现：开放边境级智能，配备11B激活参数

Authors: Ailin Huang, Ang Li, Aobo Kong, Bin Wang, Binxing Jiao, Bo Dong, Bojun Wang, Boyu Chen, Brian Li, Buyun Ma, Chang Su, Changxin Miao, Changyi Wan, Chao Lou, Chen Hu, Chen Xu, Chenfeng Yu, Chengting Feng, Chengyuan Yao, Chunrui Han, Dan Ma, Dapeng Shi, Daxin Jiang, Dehua Ma, Deshan Sun, Di Qi, Enle Liu, Fajie Zhang, Fanqi Wan, Guanzhe Huang, Gulin Yan, Guoliang Cao, Guopeng Li, Han Cheng, Hangyu Guo, Hanshan Zhang, Hao Nie, Haonan Jia, Haoran Lv, Hebin Zhou, Hekun Lv, Heng Wang, Heung-Yeung Shum, Hongbo Huang, Hongbo Peng, Hongyu Zhou, Hongyuan Wang, Houyong Chen, Huangxi Zhu, Huimin Wu, Huiyong Guo, Jia Wang, Jian Zhou, Jianjian Sun, Jiaoren Wu, Jiaran Zhang, Jiashu Lv, Jiashuo Liu, Jiayi Fu, Jiayu Liu, Jie Cheng, Jie Luo, Jie Yang, Jie Zhou, Jieyi Hou, Jing Bai, Jingcheng Hu, Jingjing Xie, Jingwei Wu, Jingyang Zhang, Jishi Zhou, Junfeng Liu, Junzhe Lin, Ka Man Lo, Kai Liang, Kaibo Liu, Kaijun Tan, Kaiwen Yan, Kaixiang Li, Kang An, Kangheng Lin, Lei Yang, Liang Lv, Liang Zhao, Liangyu Chen, Lieyu Shi, Liguo Tan, Lin Lin, Lina Chen, Luck Ma, Mengqiang Ren, Michael Li, Ming Li, Mingliang Li, Mingming Zhang, Mingrui Chen, Mitt Huang, Na Wang, Peng Liu, Qi Han
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.10604
Pdf link: https://arxiv.org/pdf/2602.10604
Abstract We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency. We focus on what matters most when building agents: sharp reasoning and fast, reliable execution. Step 3.5 Flash pairs a 196B-parameter foundation with 11B active parameters for efficient inference. It is optimized with interleaved 3:1 sliding-window/full attention and Multi-Token Prediction (MTP-3) to reduce the latency and cost of multi-round agentic interactions. To reach frontier-level intelligence, we design a scalable reinforcement learning framework that combines verifiable signals with preference feedback, while remaining stable under large-scale off-policy training, enabling consistent self-improvement across mathematics, code, and tool use. Step 3.5 Flash demonstrates strong performance across agent, coding, and math tasks, achieving 85.4% on IMO-AnswerBench, 86.4% on LiveCodeBench-v6 (2024.08-2025.05), 88.2% on tau2-Bench, 69.0% on BrowseComp (with context management), and 51.0% on Terminal-Bench 2.0, comparable to frontier models such as GPT-5.2 xHigh and Gemini 3.0 Pro. By redefining the efficiency frontier, Step 3.5 Flash provides a high-density foundation for deploying sophisticated agents in real-world industrial environments.
中文摘要 我们介绍Step 3.5 Flash，一种稀疏的专家混合（MoE）模型，连接了前沿级别的智能智能和计算效率。我们在构建代理时关注最重要的方面：敏锐的推理和快速、可靠的执行。步骤3.5 Flash将196B参数基础与11B主动参数配对，以实现高效推断。它采用交错的3：1滑动窗口/全注意力和多令牌预测（MTP-3）进行优化，以降低多轮代理交互的延迟和成本。为了达到前沿级智能，我们设计了一个可扩展的强化学习框架，结合可验证信号与偏好反馈，同时在大规模非策略训练下保持稳定，实现数学、代码和工具使用等领域的持续自我提升。Step 3.5 Flash在代理、编码和数学任务中表现出色，IMO-AnswerBench上达到85.4%，LiveCodeBench-v6（2024.08-2025.05）86.4%，tau2-Bench88.2%，BrowseComp（带上下文管理）69.0%，Terminal-Bench 2.0555.0的51.0%，与GPT-5.2 xHigh和Gemini 3.0 Pro等前沿模型相当。通过重新定义效率前沿，Step 3.5 Flash为在现实工业环境中部署复杂代理提供了高密度基础。

Online Causal Kalman Filtering for Stable and Effective Policy Optimization

在线因果卡尔曼过滤，实现稳定有效的策略优化

Authors: Shuo He, Lang Feng, Xin Cheng, Lei Feng, Bo An
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.10609
Pdf link: https://arxiv.org/pdf/2602.10609
Abstract Reinforcement learning for large language models suffers from high-variance token-level importance sampling (IS) ratios, which would destabilize policy optimization at scale. To improve stability, recent methods typically use a fixed sequence-level IS ratio for all tokens in a sequence or adjust each token's IS ratio separately, thereby neglecting temporal off-policy derivation across tokens in a sequence. In this paper, we first empirically identify that local off-policy deviation is structurally inconsistent at the token level, which may distort policy-gradient updates across adjacent tokens and lead to training collapse. To address the issue, we propose Online Causal Kalman Filtering for stable and effective Policy Optimization (KPO). Concretely, we model the desired IS ratio as a latent state that evolves across tokens and apply a Kalman filter to update this state online and autoregressively based on the states of past tokens, regardless of future tokens. The resulting filtered IS ratios preserve token-wise local structure-aware variation while strongly smoothing noise spikes, yielding more stable and effective policy updates. Experimentally, KPO achieves superior results on challenging math reasoning datasets compared with state-of-the-art counterparts.
中文摘要 大型语言模型的强化学习存在高方差的令牌级重要性抽样（IS）比值，这会破坏策略优化的大规模运行。为提升稳定性，最新方法通常对序列中所有代币使用固定的序列级IS比率，或分别调整每个代币的IS比率，从而忽略序列中代币间的时间非策略推导。本文首先实证指出，局部非策略偏差在令牌层面结构上不一致，这可能导致相邻令牌间的策略梯度更新失真，导致训练崩溃。为解决这一问题，我们提出了在线因果卡尔曼过滤，用于稳定且有效的策略优化（KPO）。具体来说，我们将期望的IS比率建模为一个跨代币演化的潜在状态，并应用卡尔曼滤波器，在线且自回归地根据过去代币的状态更新该状态，无论未来代币如何。过滤后的IS比率保持了代币层面的局部结构感知变异，同时强力平滑噪声尖峰，从而带来更稳定和有效的策略更新。在实验上，KPO在复杂数学推理数据集上取得了优于最先进数据集的效果。

Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling

通过贝叶斯非负奖励建模缓解RLHF中的奖励黑客行为

Authors: Zhibin Duan, Guowei Rong, Zhuo Li, Bo Chen, Mingyuan Zhou, Dandan Guo
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.10623
Pdf link: https://arxiv.org/pdf/2602.10623
Abstract Reward models learned from human preferences are central to aligning large language models (LLMs) via reinforcement learning from human feedback, yet they are often vulnerable to reward hacking due to noisy annotations and systematic biases such as response length or style. We propose Bayesian Non-Negative Reward Model (BNRM), a principled reward modeling framework that integrates non-negative factor analysis into Bradley-Terry (BT) preference model. BNRM represents rewards through a sparse, non-negative latent factor generative process that operates at two complementary levels: instance-specific latent variables induce disentangled reward representations, while sparsity over global latent factors acts as an implicit debiasing mechanism that suppresses spurious correlations. Together, this disentanglement-then-debiasing structure enables robust uncertainty-aware reward learning. To scale BNRM to modern LLMs, we develop an amortized variational inference network conditioned on deep model representations, allowing efficient end-to-end training. Extensive empirical results demonstrate that BNRM substantially mitigates reward over-optimization, improves robustness under distribution shifts, and yields more interpretable reward decompositions than strong baselines.
中文摘要 从人类偏好中学习的奖励模型对于通过强化学习对齐大型语言模型（LLMs）至关重要，但由于噪声标注和系统性偏差（如响应长度或风格），它们常常容易受到奖励黑客攻击的影响。我们提出了贝叶斯非负奖励模型（BNRM），这是一种原则性的奖励建模框架，将非负因素分析整合进布拉德利-特里（BT）偏好模型中。BNRM通过稀疏、非负的潜在因子生成过程表示奖励，该过程作用于两个互补层面：实例特定的潜在变量诱导解缠奖励表示，而对全局潜在因素的稀疏性则作为隐性去偏见机制，抑制虚假相关性。这种解缠再去偏的结构共同促成了稳健的不确定性感知奖励学习。为了将BNRM扩展到现代LLM，我们开发了一个基于深度模型表示条件的摊销变分推断网络，实现高效的端到端训练。大量实证结果表明，BNRM显著减少了奖励过度优化，提高了分布变化下的鲁棒性，并且比强基线更能产生可解释的奖励分解。

OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL

OmniVL-Guard：迈向通过平衡强化学习实现统一视觉语言伪造检测与接地

Authors: Jinjie Shen, Jing Wu, Yaxiong Wang, Lechao Cheng, Shengeng Tang, Tianrui Hui, Nan Pu, Zhun Zhong
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.10687
Pdf link: https://arxiv.org/pdf/2602.10687
Abstract Existing forgery detection methods are often limited to uni-modal or bi-modal settings, failing to handle the interleaved text, images, and videos prevalent in real-world misinformation. To bridge this gap, this paper targets to develop a unified framework for omnibus vision-language forgery detection and grounding. In this unified setting, the {interplay} between diverse modalities and the dual requirements of simultaneous detection and localization pose a critical difficulty bias problem: the simpler veracity classification task tends to dominate the gradients, leading to suboptimal performance in fine-grained grounding during multi-task optimization. To address this challenge, we propose \textbf{OmniVL-Guard}, a balanced reinforcement learning framework for omnibus vision-language forgery detection and grounding. Particularly, OmniVL-Guard comprises two core designs: Self-Evolving CoT Generatio and Adaptive Reward Scaling Policy Optimization (ARSPO). {Self-Evolving CoT Generation} synthesizes high-quality reasoning paths, effectively overcoming the cold-start challenge. Building upon this, {Adaptive Reward Scaling Policy Optimization (ARSPO)} dynamically modulates reward scales and task weights, ensuring a balanced joint optimization. Extensive experiments demonstrate that OmniVL-Guard significantly outperforms state-of-the-art methods and exhibits zero-shot robust generalization across out-of-domain scenarios.
中文摘要 现有的伪造检测方法通常局限于单模态或双模态设置，无法处理现实世界中错误信息中交错的文本、图像和视频。为弥合这一差距，本文旨在开发一个统一的综合视觉语言伪造检测与基础框架。在这一统一的环境中，多种模态之间的相互作用以及同时检测和定位的双重要求构成了一个关键的“难度偏差”问题：较简单的真实性分类任务往往主导梯度，导致多任务优化时细粒度基础化表现不佳。为应对这一挑战，我们提出了 \textbf{OmniVL-Guard}，一个用于全方位视觉语言伪造检测与接地的平衡强化学习框架。具体来说，OmniVL-Guard 包含两个核心设计：自我演化的 CoT 生成率和自适应奖励缩放策略优化（ARSPO）。{自我演化的CoT Generation}综合了高质量的推理路径，有效克服了冷启动的挑战。基于此，{自适应奖励缩放策略优化（ARSPO）}动态调节奖励尺度和任务权重，确保联合优化的平衡。大量实验表明，OmniVL-Guard 在域外场景中显著优于最先进方法，并展现出零样本稳健的泛化能力。

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

VESPO：变分序列级软策略优化，用于稳定非策略大型语言模型训练

Authors: Guobin Shen, Chenxiao Zhao, Xiang Cheng, Lei Huang, Xing Yu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.10693
Pdf link: https://arxiv.org/pdf/2602.10693
Abstract Training stability remains a central challenge in reinforcement learning (RL) for large language models (LLMs). Policy staleness, asynchronous training, and mismatches between training and inference engines all cause the behavior policy to diverge from the current policy, risking training collapse. Importance sampling provides a principled correction for this distribution shift but suffers from high variance; existing remedies such as token-level clipping and sequence-level normalization lack a unified theoretical foundation. We propose Variational sEquence-level Soft Policy Optimization (VESPO). By incorporating variance reduction into a variational formulation over proposal distributions, VESPO derives a closed-form reshaping kernel that operates directly on sequence-level importance weights without length normalization. Experiments on mathematical reasoning benchmarks show that VESPO maintains stable training under staleness ratios up to 64x and fully asynchronous execution, and delivers consistent gains across both dense and Mixture-of-Experts models. Code is available at this https URL
中文摘要 训练稳定性仍然是大型语言模型（LLM）强化学习（RL）中的核心挑战。策略陈旧、异步训练以及训练与推理引擎之间的不匹配都会导致行为策略与当前策略背道而驰，进而有训练崩溃的风险。重要性抽样为这种分布偏移提供了原则性修正，但存在较高的方差;现有的补救方法如标记级裁剪和序列级归一化缺乏统一的理论基础。我们提出了变分方程级软策略优化（VESPO）。通过将方差约简纳入对提案分布的变分表述，VESPO推导出一个闭式重塑核，直接作用于序列级重要性权重，无需长度归一化。数学推理基准测试显示，VESPO在最高64倍的停滞率下保持稳定训练，并实现完全异步执行，并在密集模型和专家混合模型中实现稳定的提升。代码可在此 https URL 获取

Spend Search Where It Pays: Value-Guided Structured Sampling and Optimization for Generative Recommendation

“花钱搜索”，“有价值”：以价值为导向的结构化抽样与生成式推荐优化

Authors: Jie Jiang, Yangru Huang, Zeyu Wang, Changping Wang, Yuling Xiong, Jun Zhang, Huan Yu
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.10699
Pdf link: https://arxiv.org/pdf/2602.10699
Abstract Generative recommendation via autoregressive models has unified retrieval and ranking into a single conditional generation framework. However, fine-tuning these models with Reinforcement Learning (RL) often suffers from a fundamental probability-reward mismatch. Conventional likelihood-dominated decoding (e.g., beam search) exhibits a myopic bias toward locally probable prefixes, which causes two critical failures: (1) insufficient exploration, where high-reward items in low-probability branches are prematurely pruned and rarely sampled, and (2) advantage compression, where trajectories sharing high-probability prefixes receive highly correlated rewards with low within-group variance, yielding a weak comparative signal for RL. To address these challenges, we propose V-STAR, a Value-guided Sampling and Tree-structured Advantage Reinforcement framework. V-STAR forms a self-evolving loop via two synergistic components. First, a Value-Guided Efficient Decoding (VED) is developed to identify decisive nodes and selectively deepen high-potential prefixes. This improves exploration efficiency without exhaustive tree search. Second, we propose Sibling-GRPO, which exploits the induced tree topology to compute sibling-relative advantages and concentrates learning signals on decisive branching decisions. Extensive experiments on both offline and online datasets demonstrate that V-STAR outperforms state-of-the-art baselines, delivering superior accuracy and candidate-set diversity under strict latency constraints.
中文摘要 通过自回归模型实现的生成推荐，将检索和排名统一为单一的条件生成框架。然而，用强化学习（RL）微调这些模型往往存在根本的概率与奖励不匹配问题。传统的似然主导解码（如束搜索）对局部概率前缀存在近视偏向，导致两个关键失败：（1）探索不足，低概率分支中高回报项目被过早修剪且很少采样;（2）优势压缩，共享高概率前缀的轨迹获得高度相关且组内方差低的奖励，导致强化学习的比较信号较弱。为应对这些挑战，我们提出了V-STAR，一个价值引导抽样和树状结构优势强化框架。V-STAR 通过两个协同组件形成自我演化的环路。首先，开发了价值引导高效解码（VED），用于识别决定性节点并有选择地深化高电位前缀。这提高了探索效率，无需穷尽的树状搜索。其次，我们提出 Sibling-GRPO，利用诱导树拓扑计算兄弟姐妹关系优势，并将学习信号集中于决定性的分支决策。对离线和在线数据集的广泛实验表明，V-STAR在严格的延迟约束下，表现优于最先进的基线，提供更优越的准确性和候选集多样性。

Reinforced Curriculum Pre-Alignment for Domain-Adaptive VLMs

领域适应性VLM的强化课程预对齐

Authors: Yuming Yan, Shuo Yang, Kai Tang, Sihong Chen, Yang Zhang, Ke Xu, Dan Hu, Qun Yu, Pengfei Hu, Edith C.H. Ngai
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.10740
Pdf link: https://arxiv.org/pdf/2602.10740
Abstract Vision-Language Models (VLMs) demonstrate remarkable general-purpose capabilities but often fall short in specialized domains such as medical imaging or geometric problem-solving. Supervised Fine-Tuning (SFT) can enhance performance within a target domain, but it typically causes catastrophic forgetting, limiting its generalization. The central challenge, therefore, is to adapt VLMs to new domains while preserving their general-purpose capabilities. Continual pretraining is effective for expanding knowledge in Large Language Models (LLMs), but it is less feasible for VLMs due to prohibitive computational costs and the unavailability of pretraining data for most open-source models. This necessitates efficient post-training adaptation methods. Reinforcement learning (RL)-based approaches such as Group Relative Policy Optimization (GRPO) have shown promise in preserving general abilities, yet they often fail in domain adaptation scenarios where the model initially lacks sufficient domain knowledge, leading to optimization collapse. To bridge this gap, we propose Reinforced Curriculum Pre-Alignment (RCPA), a novel post-training paradigm that introduces a curriculum-aware progressive modulation mechanism. In the early phase, RCPA applies partial output constraints to safely expose the model to new domain concepts. As the model's domain familiarity increases, training gradually transitions to full generation optimization, refining responses and aligning them with domain-specific preferences. This staged adaptation balances domain knowledge acquisition with the preservation of general multimodal capabilities. Extensive experiments across specialized domains and general benchmarks validate the effectiveness of RCPA, establishing a practical pathway toward building high-performing and domain-adaptive VLMs.
中文摘要 视觉语言模型（VLM）展现了卓越的通用能力，但在医学影像或几何问题解决等专业领域常常表现不足。监督式微调（SFT）可以提升目标域内的性能，但通常会导致灾难性的遗忘，限制其推广性。因此，核心挑战是如何在保持其通用能力的同时，将VLM适应新领域。持续预训练对于扩展大型语言模型（LLMs）知识非常有效，但由于计算成本高昂且大多数开源模型缺乏预训练数据，VLMs则难以实现。这需要高效的训练后适应方法。基于强化学习（RL）的方法，如群体相对策略优化（Group Relative Policy Optimization，GRPO）已被证明在保留一般能力方面表现出潜力，但在模型初始缺乏足够领域知识的领域适应场景中，常常失败，导致优化崩溃。为弥合这一差距，我们提出了强化课程预对齐（RCPA），这是一种新的培训后范式，引入了基于课程意识的渐进式调节机制。在早期阶段，RCPA应用部分输出约束，以安全地将模型暴露于新的领域概念。随着模型领域熟悉度的提升，训练逐步转向全代优化，优化响应并使其与领域特定偏好对齐。这种分阶段适应平衡了领域知识的获取与保持通用多模态能力的平衡。跨专业领域和通用基准的广泛实验验证了RCPA的有效性，建立了构建高效且域适应性VLM的实用路径。

Dynamic Interference Management for TN-NTN Coexistence in the Upper Mid-Band

上中频带中TN-NTN共存的动态干扰管理

Authors: Pradyumna Kumar Bishoyi, Chia Chia Lee, Navid Keshtiarast, Marina Petrova
Subjects: Subjects: Information Theory (cs.IT); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2602.10813
Pdf link: https://arxiv.org/pdf/2602.10813
Abstract The coexistence of terrestrial networks (TN) and non-terrestrial networks (NTN) in the frequency range 3 (FR3) upper mid-band presents considerable interference concerns, as dense TN deployments can severely degrade NTN downlink performance. Existing studies rely on interference-nulling beamforming, precoding, or exclusion zones that require accurate channel state information (CSI) and static coordination, making them unsuitable for dynamic NTN scenarios. To overcome these limitations, we develop an optimization framework that jointly controls TN downlink power, uplink power, and antenna downtilt to protect NTN links while preserving terrestrial performance. The resultant non-convex coupling between TN and NTN parameters is addressed by a Proximal Policy Optimization (PPO)-based reinforcement learning method that develops adaptive power and tilt control strategies. Simulation results demonstrate a reduction up to 8 dB in the median interference-to-noise ratio (INR) while maintaining over 87% TN basestation activity, outperforming conventional baseline methods and validating the feasibility of the proposed strategy for FR3 coexistence.
中文摘要 地面网络（TN）和非地面网络（NTN）在频率范围3（FR3）上中段共存，带来了相当大的干扰问题，因为密集的TN部署会严重降低NTN下行链路性能。现有研究依赖于干扰消除波束成形、预编码或排除区，这些技术需要准确的信道状态信息（CSI）和静态协调，因此不适合动态NTN场景。为克服这些限制，我们开发了优化框架，联合控制TN下行功率、上行功率和天线下倾，以保护NTN链路，同时保持地面性能。TN和NTN参数之间产生的非凸耦合通过基于近点策略优化（PPO）的强化学习方法解决，该方法开发了自适应功率和倾斜控制策略。模拟结果显示，在保持超过87%的TN基站活动的同时，中位干扰噪声比（INR）可降低多达8 dB，优于传统基线方法，并验证了FR3共存策略的可行性。

Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training

为什么强化学习比SFT更能推广？以数据为中心的VLM后培训视角

Authors: Aojun Lu, Tao Feng, Hangjie Yuan, Wei Li, Yanan Sun
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.10815
Pdf link: https://arxiv.org/pdf/2602.10815
Abstract The adaptation of large-scale Vision-Language Models (VLMs) through post-training reveals a pronounced generalization gap: models fine-tuned with Reinforcement Learning (RL) consistently achieve superior out-of-distribution (OOD) performance compared to those trained with Supervised Fine-Tuning (SFT). This paper posits a data-centric explanation for this phenomenon, contending that RL's generalization advantage arises from an implicit data filtering mechanism that inherently prioritizes medium-difficulty training samples. To test this hypothesis, we systematically evaluate the OOD generalization of SFT models across training datasets of varying difficulty levels. Our results confirm that data difficulty is a critical factor, revealing that training on hard samples significantly degrades OOD performance. Motivated by this finding, we introduce Difficulty-Curated SFT (DC-SFT), a straightforward method that explicitly filters the training set based on sample difficulty. Experiments show that DC-SFT not only substantially enhances OOD generalization over standard SFT, but also surpasses the performance of RL-based training, all while providing greater stability and computational efficiency. This work offers a data-centric account of the OOD generalization gap in VLMs and establishes a more efficient pathway to achieving robust generalization. Code is available at this https URL.
中文摘要 通过后训练调整大规模视觉语言模型（VLMs）揭示了明显的泛化差距：经过强化学习（RL）微调的模型，始终比使用监督精细调优（SFT）训练的模型在分布外（OOD）表现更优。本文提出了以数据为中心的解释，认为强化学习的泛化优势源于一种隐式的数据过滤机制，该机制本质上优先考虑中等难度的训练样本。为验证该假设，我们系统评估了SFT模型在不同难度训练数据集上的OOD泛化情况。我们的结果证实数据难度是一个关键因素，显示在硬样本上训练会显著降低现场表现。基于这一发现，我们引入了难度策划SFT（DC-SFT），这是一种基于样本难度显式过滤训练集的简单方法。实验表明，DC-SFT不仅显著提升了OOD泛化能力，还超越了基于强化学习的训练，同时提供了更高的稳定性和计算效率。本研究以数据为中心，阐述了VLM中OOD泛化差距，并建立了实现稳健泛化的更高效路径。代码可在此 https URL 访问。

RePO: Bridging On-Policy Learning and Off-Policy Knowledge through Rephrasing Policy Optimization

RePO：通过重述策略优化，桥接政策内学习与非策略知识

Authors: Linxuan Xia, Xiaolong Yang, Yongyuan Chen, Enyue Zhao, Deng Cai, Yasheng Wang, Boxi Wu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.10819
Pdf link: https://arxiv.org/pdf/2602.10819
Abstract Aligning large language models (LLMs) on domain-specific data remains a fundamental challenge. Supervised fine-tuning (SFT) offers a straightforward way to inject domain knowledge but often degrades the model's generality. In contrast, on-policy reinforcement learning (RL) preserves generality but fails to effectively assimilate hard samples that exceed the model's current reasoning level. Recent off-policy RL attempts improve hard sample utilization, yet they suffer from severe training instability due to the forced distribution shift toward off-policy knowledge. To reconcile effective off-policy knowledge absorption with the stability of on-policy RL, we propose Rephrasing Policy Optimization (RePO). In RePO, the policy model is prompted to first comprehend off-policy knowledge and then rephrase it into trajectories that conform to its own stylistic and parametric distribution. RePO dynamically replaces low-reward rollouts with these rephrased, high-quality trajectories. This strategy guides the model toward correct reasoning paths while strictly preserving on-policy training dynamics. Experiments on several benchmarks demonstrate that RePO improves hard-sample utilization and outperforms existing baselines, achieving state-of-the-art performance.
中文摘要 将大型语言模型（LLMs）与领域特定数据对齐仍是一个根本性的挑战。监督微调（SFT）提供了一种直接注入领域知识的方法，但往往会降低模型的普遍性。相比之下，策略强化学习（RL）保持了一般性，但无法有效同化超过模型当前推理水平的硬样本。近期非策略强化学习尝试提高了硬样本利用率，但由于强制分布向非策略知识转移，它们存在严重的训练不稳定性。为了调和有效的非政策知识吸收与政策内强化学习的稳定性，我们提出了重述策略优化（RePO）。在RePO中，政策模型被引导先理解政策外知识，然后将其重新表述为符合自身风格和参数分布的轨迹。RePO动态地用这些重新表述的高质量轨迹替代低回报的部署。该策略引导模型朝向正确的推理路径，同时严格保持策略训练动态。多项基准测试的实验表明，RePO提升了硬样本利用率，并优于现有基线，达到了最先进的性能。

SimuScene: Training and Benchmarking Code Generation to Simulate Physical Scenarios

SimuScene：模拟物理场景的代码生成训练与基准测试

Authors: Yanan Wang, Renxi Wang, Yongxin Wang, Xuezhi Liang, Fajri Koto, Timothy Baldwin, Xiaodan Liang, Haonan Li
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.10840
Pdf link: https://arxiv.org/pdf/2602.10840
Abstract Large language models (LLMs) have been extensively studied for tasks like math competitions, complex coding, and scientific reasoning, yet their ability to accurately represent and simulate physical scenarios via code remains underexplored. We propose SimuScene, the first systematic study that trains and evaluates LLMs on simulating physical scenarios across five physics domains and 52 physical concepts. We build an automatic pipeline to collect data, with human verification to ensure quality. The final dataset contains 7,659 physical scenarios with 334 human-verified examples as the test set. We evaluated 10 contemporary LLMs and found that even the strongest model achieves only a 21.5% pass rate, demonstrating the difficulty of the task. Finally, we introduce a reinforcement learning pipeline with visual rewards that uses a vision-language model as a judge to train textual models. Experiments show that training with our data improves physical simulation via code while substantially enhancing general code generation performance.
中文摘要 大型语言模型（LLMs）已被广泛用于数学竞赛、复杂编码和科学推理等任务，但它们通过代码准确表示和模拟物理场景的能力仍未被充分探索。我们提出了SimuScene，这是首个系统性研究，旨在训练和评估大型语言模型，模拟五个物理领域和52个物理概念的物理场景。我们建立了自动收集数据的管道，并通过人工核实确保质量。最终数据集包含7,659个物理场景，测试集为334个经过人工验证的实例。我们评估了10个当代大型语言模型，发现即使是最强的模型，通过率也仅为21.5%，显示出任务的艰难性。最后，我们引入了带有视觉奖励的强化学习流程，利用视觉语言模型作为评判来训练文本模型。实验表明，使用我们的数据训练不仅能通过代码提升物理模拟，还能显著提升整体代码生成性能。

ICA: Information-Aware Credit Assignment for Visually Grounded Long-Horizon Information-Seeking Agents

ICA：视觉基础、远视线信息寻求代理的信息感知信用分配

Authors: Cong Pang, Xuyu Feng, Yujie Yi, Zixuan Chen, Jiawei Hong, Tiankuo Yao, Nang Yuan, Jiapeng Luo, Lewei Lu, Xin Lou
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.10863
Pdf link: https://arxiv.org/pdf/2602.10863
Abstract Despite the strong performance achieved by reinforcement learning-trained information-seeking agents, learning in open-ended web environments remains severely constrained by low signal-to-noise feedback. Text-based parsers often discard layout semantics and introduce unstructured noise, while long-horizon training typically relies on sparse outcome rewards that obscure which retrieval actions actually matter. We propose a visual-native search framework that represents webpages as visual snapshots, allowing agents to leverage layout cues to quickly localize salient evidence and suppress distractors. To learn effectively from these high-dimensional observations, we introduce Information-Aware Credit Assignment (ICA), a post-hoc method that estimates each retrieved snapshot's contribution to the final outcome via posterior analysis and propagates dense learning signals back to key search turns. Integrated with a GRPO-based training pipeline, our approach consistently outperforms text-based baselines on diverse information-seeking benchmarks, providing evidence that visual snapshot grounding with information-level credit assignment alleviates the credit-assignment bottleneck in open-ended web environments. The code and datasets will be released in this https URL.
中文摘要 尽管强化学习训练的信息寻求代理表现出色，但在开放式网络环境中的学习仍受到低信噪反馈的严重限制。基于文本的解析器通常舍弃布局语义，引入非结构化噪声，而长视野训练通常依赖稀疏的结果奖励，掩盖了哪些检索动作真正重要。我们提出了一种可视化原生搜索框架，将网页表示为视觉快照，使客服能够利用布局线索快速定位显著证据并抑制干扰因素。为了有效从这些高维观测中学习，我们引入了信息感知信用赋值（ICA）方法，这是一种事后方法，通过后验分析估算每个检索快照对最终结果的贡献，并将密集的学习信号传播回关键搜索回合。结合基于GRPO的培训流程，我们的方法在多样化的信息寻求基准上持续优于基于文本的基线，提供了证据表明，信息层级学分分配的视觉快照基础有助于缓解开放式网络环境中学分分配瓶颈。代码和数据集将在此 https URL 中发布。

Chart Specification: Structural Representations for Incentivizing VLM Reasoning in Chart-to-Code Generation

图表规范：用于激励图表到代码生成中VLM推理的结构表示

Authors: Minggui He, Mingchen Dai, Jian Zhang, Yilun Liu, Shimin Tao, Pufan Zeng, Osamu Yoshie, Yuya Ieiri
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.10880
Pdf link: https://arxiv.org/pdf/2602.10880
Abstract Vision-Language Models (VLMs) have shown promise in generating plotting code from chart images, yet achieving structural fidelity remains challenging. Existing approaches largely rely on supervised fine-tuning, encouraging surface-level token imitation rather than faithful modeling of underlying chart structure, which often leads to hallucinated or semantically inconsistent outputs. We propose Chart Specification, a structured intermediate representation that shifts training from text imitation to semantically grounded supervision. Chart Specification filters syntactic noise to construct a structurally balanced training set and supports a Spec-Align Reward that provides fine-grained, verifiable feedback on structural correctness, enabling reinforcement learning to enforce consistent plotting logic. Experiments on three public benchmarks show that our method consistently outperforms prior approaches. With only 3K training samples, we achieve strong data efficiency, surpassing leading baselines by up to 61.7% on complex benchmarks, and scaling to 4K samples establishes new state-of-the-art results across all evaluated metrics. Overall, our results demonstrate that precise structural supervision offers an efficient pathway to high-fidelity chart-to-code generation. Code and dataset are available at: this https URL
中文摘要 视觉语言模型（VLM）在从图表图像生成绘图代码方面展现出潜力，但实现结构真实度仍具挑战性。现有方法主要依赖监督式微调，鼓励表面上的代币模仿，而非对底层图表结构的忠实建模，这常导致输出产生幻觉或语义不一致。我们提出了图表规范，一种结构化的中间表示，将培训从文本模仿转向语义基础的监督。图表规范过滤语法噪声，构建结构平衡的训练集，并支持Spec-Align Reward，提供细粒度、可验证的结构正确性反馈，使强化学习能够强化一致的绘图逻辑。在三个公开基准测试上进行的实验表明，我们的方法始终优于以往的方法。仅用3K训练样本，我们实现了强大的数据效率，在复杂基准测试中领先基线高达61.7%，扩展到4K样本则在所有评估指标上建立了新的最先进结果。总体而言，我们的结果表明，精确的结构监督为高精度图表到代码生成提供了高效的途径。代码和数据集可在以下 https URL 获取

Resource-Efficient Model-Free Reinforcement Learning for Board Games

资源高效的无模型强化学习桌面游戏

Authors: Kazuki Ota, Takayuki Osa, Motoki Omura, Tatsuya Harada
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.10894
Pdf link: https://arxiv.org/pdf/2602.10894
Abstract Board games have long served as complex decision-making benchmarks in artificial intelligence. In this field, search-based reinforcement learning methods such as AlphaZero have achieved remarkable success. However, their significant computational demands have been pointed out as barriers to their reproducibility. In this study, we propose a model-free reinforcement learning algorithm designed for board games to achieve more efficient learning. To validate the efficiency of the proposed method, we conducted comprehensive experiments on five board games: Animal Shogi, Gardner Chess, Go, Hex, and Othello. The results demonstrate that the proposed method achieves more efficient learning than existing methods across these environments. In addition, our extensive ablation study shows the importance of core techniques used in the proposed method. We believe that our efficient algorithm shows the potential of model-free reinforcement learning in domains traditionally dominated by search-based methods.
中文摘要 桌游长期以来一直是人工智能复杂决策的基准。在该领域，基于搜索的强化学习方法如AlphaZero取得了显著成功。然而，它们显著的计算需求被指出是其可重复性的障碍。本研究提出一种无模型强化学习算法，专为棋类游戏设计，以实现更高效的学习。为了验证该方法的有效性，我们在五款棋类游戏上进行了全面的实验：动物将棋、加德纳国际象棋、围棋、六边棋和奥赛罗。结果表明，该方法在这些环境中比现有方法实现的学习效率更高。此外，我们广泛的消融研究显示了所提方法中核心技术的重要性。我们认为，我们高效的算法展示了在传统上以搜索方法为主导的领域中，无模型强化学习的潜力。

Near-Constant Strong Violation and Last-Iterate Convergence for Online CMDPs via Decaying Safety Margins

在线CMDPs几乎持续强烈的违规和最后一次趋同，安全余量下降

Authors: Qian Zuo, Zhiyong Wang, Fengxiang He
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.10917
Pdf link: https://arxiv.org/pdf/2602.10917
Abstract We study safe online reinforcement learning in Constrained Markov Decision Processes (CMDPs) under strong regret and violation metrics, which forbid error cancellation over time. Existing primal-dual methods that achieve sublinear strong reward regret inevitably incur growing strong constraint violation or are restricted to average-iterate convergence due to inherent oscillations. To address these limitations, we propose the Flexible safety Domain Optimization via Margin-regularized Exploration (FlexDOME) algorithm, the first to provably achieve near-constant $\tilde{O}(1)$ strong constraint violation alongside sublinear strong regret and non-asymptotic last-iterate convergence. FlexDOME incorporates time-varying safety margins and regularization terms into the primal-dual framework. Our theoretical analysis relies on a novel term-wise asymptotic dominance strategy, where the safety margin is rigorously scheduled to asymptotically majorize the functional decay rates of the optimization and statistical errors, thereby clamping cumulative violations to a near-constant level. Furthermore, we establish non-asymptotic last-iterate convergence guarantees via a policy-dual Lyapunov argument. Experiments corroborate our theoretical findings.
中文摘要 我们在强遗憾和违规指标下研究安全在线强化学习，采用强遗憾和违规指标，禁止随时间推移取消错误。现有实现亚线性强奖励遗憾的原始对偶方法不可避免地会遭遇越来越强的约束违规，或因固有振荡而仅限于平均迭代收敛。为解决这些限制，我们提出了通过边际正则化探索实现灵活安全域优化（FlexDOME）算法，这是首个可证明实现近常数 $\tilde{O}（1）$ 强约束违背、亚线性强遗憾和非渐近末次迭代收敛的算法。FlexDOME 将随时间变化的安全裕量和正则化项纳入原始对偶框架。我们的理论分析依赖于一种新的项渐近优势策略，其中安全裕度严格地将优化和统计误差的函数衰减率渐近地主化，从而将累积违规限制在近乎恒定的水平。此外，我们通过策略对偶的李雅普诺夫论证建立了非渐近的最后迭代收敛保证。实验证实了我们的理论发现。

Multi-Task Reinforcement Learning of Drone Aerobatics by Exploiting Geometric Symmetries

利用几何对称性实现无人机特技飞行多任务强化学习

Authors: Zhanyu Guo, Zikang Yin, Guobin Zhu, Shiliang Guo, Shiyu Zhao
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.10997
Pdf link: https://arxiv.org/pdf/2602.10997
Abstract Flight control for autonomous micro aerial vehicles (MAVs) is evolving from steady flight near equilibrium points toward more aggressive aerobatic maneuvers, such as flips, rolls, and Power Loop. Although reinforcement learning (RL) has shown great potential in these tasks, conventional RL methods often suffer from low data efficiency and limited generalization. This challenge becomes more pronounced in multi-task scenarios where a single policy is required to master multiple maneuvers. In this paper, we propose a novel end-to-end multi-task reinforcement learning framework, called GEAR (Geometric Equivariant Aerobatics Reinforcement), which fully exploits the inherent SO(2) rotational symmetry in MAV dynamics and explicitly incorporates this property into the policy network architecture. By integrating an equivariant actor network, FiLM-based task modulation, and a multi-head critic, GEAR achieves both efficiency and flexibility in learning diverse aerobatic maneuvers, enabling a data-efficient, robust, and unified framework for aerobatic control. GEAR attains a 98.85\% success rate across various aerobatic tasks, significantly outperforming baseline methods. In real-world experiments, GEAR demonstrates stable execution of multiple maneuvers and the capability to combine basic motion primitives to complete complex aerobatics.
中文摘要 自主微型飞行器（MAVs）的飞行控制正从接近平衡点的稳定飞行，向更激进的特技动作发展，如翻转、滚转和动力环。尽管强化学习（RL）在这些任务中展现出巨大潜力，但传统强化学习方法常常存在数据效率低和泛化有限的问题。在多任务场景中，这一挑战尤为明显，当单一策略需要掌握多种机动时。本文提出了一种新的端到端多任务强化学习框架，称为GEAR（几何等变特技飞行强化），充分利用MAV动力学中固有的SO（2）旋转对称性，并明确将该特性纳入策略网络架构。通过整合等变演员网络、基于FiLM的任务调制和多头批评者，GEAR实现了学习多样特技动作的效率与灵活性，实现了一个数据高效、稳健且统一的特技控制框架。GEAR在多种特技飞行任务中成功率为98.85%，远超基线方法。在实际实验中，GEAR展示了多种机动的稳定执行能力，并具备将基本运动原语结合完成复杂特技的能力。

Fine-Tuning GPT-5 for GPU Kernel Generation

GPT-5 GPU内核生成的微调

Authors: Ali Tehrani, Yahya Emara, Essam Wissam, Wojciech Paluch, Waleed Atallah, Łukasz Dudziak, Mohamed S. Abdelfattah
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.11000
Pdf link: https://arxiv.org/pdf/2602.11000
Abstract Developing efficient GPU kernels is essential for scaling modern AI systems, yet it remains a complex task due to intricate hardware architectures and the need for specialized optimization expertise. Although Large Language Models (LLMs) demonstrate strong capabilities in general sequential code generation, they face significant challenges in GPU code generation because of the scarcity of high-quality labeled training data, compiler biases when generating synthetic solutions, and limited generalization across hardware generations. This precludes supervised fine-tuning (SFT) as a scalable methodology for improving current LLMs. In contrast, reinforcement learning (RL) offers a data-efficient and adaptive alternative but requires access to relevant tools, careful selection of training problems, and a robust evaluation environment. We present Makora's environment and tools for reinforcement learning finetuning of frontier models and report our results from fine-tuning GPT-5 for Triton code generation. In the single-attempt setting, our fine-tuned model improves kernel correctness from 43.7% to 77.0% (+33.3 percentage points) and increases the fraction of problems outperforming TorchInductor from 14.8% to 21.8% (+7 percentage points) compared to baseline GPT-5, while exceeding prior state-of-the-art models on KernelBench. When integrated into a full coding agent, it is able to solve up to 97.4% of problems in an expanded KernelBench suite, outperforming the PyTorch TorchInductor compiler on 72.9% of problems with a geometric mean speedup of 2.12x. Our work demonstrates that targeted post-training with reinforcement learning can unlock LLM capabilities in highly specialized technical domains where traditional supervised learning is limited by data availability, opening new pathways for AI-assisted accelerator programming.
中文摘要 开发高效的GPU内核对于扩展现代AI系统至关重要，但由于复杂的硬件架构和对专业优化专业知识的需求，这仍然是一项复杂的任务。尽管大型语言模型（LLMs）在一般顺序代码生成方面表现出强大能力，但在GPU代码生成方面面临重大挑战，原因是高质量的标记训练数据稀缺，合成解生成时存在编译器偏差，且跨硬件世代的泛化有限。这排除了监督式微调（SFT）作为可扩展方法来改进现有大型语言模型。相比之下，强化学习（RL）提供了数据高效且适应性的替代方案，但需要使用相关工具、精心选择训练问题以及健全的评估环境。我们介绍了Makora的环境和工具，用于强化学习和前沿模型的微调，并报告了我们对GPT-5进行Triton代码生成微调的成果。在单次尝试设置下，我们经过精细调优的模型将核正确率从43.7%提升至77.0%（+33.3百分点），并且与基线GPT-5相比，问题的表现优于TorchInductor的比例从14.8%提升到21.8%（+7个百分点），同时超越了KernelBench上之前最先进的模型。当集成到完整编码代理中时，它能够解决扩展后的 KernelBench 套件中高达 97.4% 的问题，在 72.9% 的问题上优于 PyTorch TorchInductor 编译器，几何平均速度提升了 2.12 倍。我们的研究表明，针对性的强化学习后培训能够在传统监督学习受限于数据可用性的高度专业化技术领域释放LLM能力，为AI辅助加速器编程开辟新路径。

Divide, Harmonize, Then Conquer It: Shooting Multi-Commodity Flow Problems with Multimodal Language Models

分、和谐、然后征服它：用多模态语言模型解决多商品流问题

Authors: Xinyu Yuan, Yan Qiao, Zonghui Wang, Wenzhi Chen
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.11057
Pdf link: https://arxiv.org/pdf/2602.11057
Abstract The multi-commodity flow (MCF) problem is a fundamental topic in network flow and combinatorial optimization, with broad applications in transportation, communication, and logistics, etc. Nowadays, the rapid expansion of allocation systems has posed challenges for existing optimization engines in balancing optimality and tractability. In this paper, we present Pram, the first ML-based method that leverages the reasoning power of multimodal language models (MLMs) for addressing the trade-off dilemma -- a great need of service providers. As part of our proposal, Pram (i) quickly computes high-quality allocations by dividing the original problem into local subproblems, which are then resolved by an MLM-powered "agent", and (ii) ensures global consistency by harmonizing these subproblems via a multi-agent reinforcement learning algorithm. Theoretically, we show that Pram, which learns to perform gradient descent in context, provably converges to the optimum within the family of MCF problems. Empirically, on real-world datasets and public topologies, Pram achieves performance comparable to, and in some cases even surpassing, linear programming solvers (very close to the optimal solution), and substantially lower runtimes (1 to 2 orders of magnitude faster). Moreover, Pram exhibits strong robustness (<10\% performance degradation under link failures or flow bursts), demonstrating MLM's generalization ability to unforeseen events. Pram is objective-agnostic and seamlessly integrates with mainstream allocation systems, providing a practical and scalable solution for future networks.
中文摘要 多商品流（MCF）问题是网络流和组合优化中的一个基础课题，广泛应用于交通、通信和物流等领域。如今，分配系统的快速扩展给现有优化引擎在平衡最优性和可处理性方面带来了挑战。本文介绍了Pram，这是首个基于机器学习的方法，利用多模态语言模型（MLMs）的推理能力来解决权衡困境——服务提供者的巨大需求。作为我们提案的一部分，Pram （i）通过将原始问题划分为局部子问题，快速计算高质量分配，然后由一个由 MLM 驱动的“代理”解决这些子问题;（ii）通过多智能体强化学习算法协调这些子问题，确保全局一致性。理论上，我们证明了Pram学会在上下文中执行梯度下降，并且在MCF问题族中可以被证明收敛到最优。从经验角度看，在现实世界数据集和公共拓扑中，Pram 的性能可与线性规划求解器相当，甚至在某些情况下超越（非常接近最优解），运行时间也大幅降低（快 1 到 2 个数量级）。此外，Pram表现出强烈的鲁棒性（链路故障或流量突发时性能下降<10%%），展示了MLM对不可预见事件的推广能力。Pram 具有客观中立性，能够无缝集成主流分配系统，为未来网络提供实用且可扩展的解决方案。

Simultaneous Speech-to-Speech Translation Without Aligned Data

无对齐数据的语音转语音同步翻译

Authors: Tom Labiausse, Romain Fabre, Yannick Estève, Alexandre Défossez, Neil Zeghidour
Subjects: Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2602.11072
Pdf link: https://arxiv.org/pdf/2602.11072
Abstract Simultaneous speech translation requires translating source speech into a target language in real-time while handling non-monotonic word dependencies. Traditional approaches rely on supervised training with word-level aligned data, which is difficult to collect at scale and thus depends on synthetic alignments using language-specific heuristics that are suboptimal. We propose Hibiki-Zero, which eliminates the need for word-level alignments entirely. This fundamentally simplifies the training pipeline and enables seamless scaling to diverse languages with varying grammatical structures, removing the bottleneck of designing language-specific alignment heuristics. We first train on sentence-level aligned data to learn speech translation at high latency, then apply a novel reinforcement learning strategy using GRPO to optimize latency while preserving translation quality. Hibiki-Zero achieves state-of-the-art performance in translation accuracy, latency, voice transfer, and naturalness across five X-to-English tasks. Moreover, we demonstrate that our model can be adapted to support a new input language with less than 1000h of speech. We provide examples, model weights, inference code and we release a benchmark containing 45h of multilingual data for speech translation evaluation.
中文摘要 同声翻译需要在处理非单调词依赖关系的同时，将源语音实时翻译成目标语言。传统方法依赖于用词级对齐数据进行监督训练，这些数据难以大规模收集，因此依赖于使用语言特定启发式的合成对齐，而这些方法并不理想。我们提出了响零法，完全消除了词级对齐的需求。这从根本上简化了训练流程，并实现了对不同语法结构的多样语言无缝扩展，消除了设计特定语言对齐启发式的瓶颈。我们首先在句子层级对齐数据上训练，在高延迟下学习语音翻译，然后采用一种利用GRPO优化延迟的新型强化学习策略，以保持翻译质量。Hibiki-Zero在五项X到英语任务中实现了翻译准确性、延迟、语音传输和自然性等方面最先进的性能。此外，我们证明了我们的模型可以适应支持少于1000小时语音的新输入语言。我们提供了示例、模型权重、推理代码，并发布了一个包含45小时多语言数据的基准测试，用于语音翻译评估。

Chatting with Images for Introspective Visual Thinking

与图片聊天以促进内省视觉思维

Authors: Junfei Wu, Jian Guan, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, Tienie Tan
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.11073
Pdf link: https://arxiv.org/pdf/2602.11073
Abstract Current large vision-language models (LVLMs) typically rely on text-only reasoning based on a single-pass visual encoding, which often leads to loss of fine-grained visual information. Recently the proposal of ''thinking with images'' attempts to alleviate this limitation by manipulating images via external tools or code; however, the resulting visual states are often insufficiently grounded in linguistic semantics, impairing effective cross-modal alignment - particularly when visual semantics or geometric relationships must be reasoned over across distant regions or multiple images. To address these challenges, we propose ''chatting with images'', a new framework that reframes visual manipulation as language-guided feature modulation. Under the guidance of expressive language prompts, the model dynamically performs joint re-encoding over multiple image regions, enabling tighter coupling between linguistic reasoning and visual state updates. We instantiate this paradigm in ViLaVT, a novel LVLM equipped with a dynamic vision encoder explicitly designed for such interactive visual reasoning, and trained it with a two-stage curriculum combining supervised fine-tuning and reinforcement learning to promote effective reasoning behaviors. Extensive experiments across eight benchmarks demonstrate that ViLaVT achieves strong and consistent improvements, with particularly pronounced gains on complex multi-image and video-based spatial reasoning tasks.
中文摘要 当前的大型视觉语言模型（LVLM）通常依赖基于单次视觉编码的纯文本推理，这常导致细粒度视觉信息的丢失。最近，“用图像思考”的提议试图通过外部工具或代码作图像来缓解这一限制;然而，产生的视觉状态往往缺乏足够的语言语义基础，影响了有效的跨模态对齐——尤其是在需要跨距离区域或多幅图像推理视觉语义或几何关系时。为应对这些挑战，我们提出了“与图像对话”的新框架，将视觉作重新定义为语言引导的功能调制。在表达性语言提示的指导下，模型动态地在多个图像区域进行联合重编码，实现语言推理与视觉状态更新之间的紧密耦合。我们在ViLaVT中实现了这一范式，这是一款新型LVLM，配备了专门为这种互动视觉推理设计的动态视觉编码器，并通过结合监督微调和强化学习的两阶段课程进行训练，以促进有效的推理行为。跨越八个基准测试的广泛实验表明，ViLaVT在复杂的多图像和基于视频的空间推理任务中取得了显著且持续的提升。

RISE: Self-Improving Robot Policy with Compositional World Model

RISE：基于合成世界模型的自我改进机器人政策

Authors: Jiazhi Yang, Kunyang Lin, Jinwei Li, Wencong Zhang, Tianwei Lin, Longyan Wu, Zhizhong Su, Hao Zhao, Ya-Qin Zhang, Li Chen, Ping Luo, Xiangyu Yue, Hongyang Li
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.11075
Pdf link: https://arxiv.org/pdf/2602.11075
Abstract Despite the sustained scaling on model capacity and data acquisition, Vision-Language-Action (VLA) models remain brittle in contact-rich and dynamic manipulation tasks, where minor execution deviations can compound into failures. While reinforcement learning (RL) offers a principled path to robustness, on-policy RL in the physical world is constrained by safety risk, hardware cost, and environment reset. To bridge this gap, we present RISE, a scalable framework of robotic reinforcement learning via imagination. At its core is a Compositional World Model that (i) predicts multi-view future via a controllable dynamics model, and (ii) evaluates imagined outcomes with a progress value model, producing informative advantages for the policy improvement. Such compositional design allows state and value to be tailored by best-suited yet distinct architectures and objectives. These components are integrated into a closed-loop self-improving pipeline that continuously generates imaginary rollouts, estimates advantages, and updates the policy in imaginary space without costly physical interaction. Across three challenging real-world tasks, RISE yields significant improvement over prior art, with more than +35% absolute performance increase in dynamic brick sorting, +45% for backpack packing, and +35% for box closing, respectively.
中文摘要 尽管模型容量和数据采集持续扩展，视觉-语言-行动（VLA）模型在接触丰富且动态作任务中仍显脆弱，轻微的执行偏差可能累积成失败。虽然强化学习（RL）提供了一条有原则的鲁棒路径，但物理世界中的策略性强化学习受安全风险、硬件成本和环境重置的限制。为弥合这一差距，我们提出了RISE，一个可扩展的机器人想象力强化学习框架。其核心是一个构成世界模型，（i）通过可控动力学模型预测多视角的未来，（ii）通过进步价值模型评估想象结果，为政策改进带来有益的优势。这种组合设计允许状态和价值通过最合适但独特的架构和目标进行定制。这些组件集成到一个闭环自我改进的管道中，能够持续生成虚拟的推广、估算优势并在虚拟空间中更新政策，而无需昂贵的物理交互。在三项具有挑战性的现实任务中，RISE相较于现有技术取得了显著提升，动态积木分拣的绝对性能提升超过+35%，背包打包提升+45%，箱子关闭提升+35%。

Interpretable Attention-Based Multi-Agent PPO for Latency Spike Resolution in 6G RAN Slicing

基于注意力的多代理PPO，用于6G RAN切片中的延迟尖峰解析

Authors: Kavan Fatehi, Mostafa Rahmani Ghourtani, Amir Sonee, Poonam Yadav, Alessandra M Russo, Hamed Ahmadi, Radu Calinescu
Subjects: Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2602.11076
Pdf link: https://arxiv.org/pdf/2602.11076
Abstract Sixth-generation (6G) radio access networks (RANs) must enforce strict service-level agreements (SLAs) for heterogeneous slices, yet sudden latency spikes remain difficult to diagnose and resolve with conventional deep reinforcement learning (DRL) or explainable RL (XRL). We propose \emph{Attention-Enhanced Multi-Agent Proximal Policy Optimization (AE-MAPPO)}, which integrates six specialized attention mechanisms into multi-agent slice control and surfaces them as zero-cost, faithful explanations. The framework operates across O-RAN timescales with a three-phase strategy: predictive, reactive, and inter-slice optimization. A URLLC case study shows AE-MAPPO resolves a latency spike in $18$ms, restores latency to $0.98$ms with $99.9999\%$ reliability, and reduces troubleshooting time by $93\%$ while maintaining eMBB and mMTC continuity. These results confirm AE-MAPPO's ability to combine SLA compliance with inherent interpretability, enabling trustworthy and real-time automation for 6G RAN slicing.
中文摘要 第六代（6G）无线接入网络（RAN）必须严格执行异构切片的服务水平协议（SLA），但突发延迟尖峰仍难以用传统深度强化学习（DRL）或可解释强化学习（XRL）来诊断和解决。我们提出了\emph{注意力增强多智能体近端策略优化（AE-MAPPO）}，将六种专门的注意力机制整合进多智能体切片控制，并以零成本、忠实的解释形式呈现。该框架跨越O-RAN时间尺度，采用三阶段策略：预测优化、反应优化和切片间优化。URLLC的案例研究显示，AE-MAPPO能在18毫秒内解决延迟激增，将延迟恢复至0.98毫秒，可靠性提升99.9999美元，故障排查时间减少93美元，同时保持eMBB和mMTC连续性。这些结果证实了AE-MAPPO能够将SLA合规性与内在可解释性结合，实现可信且实时的6G RAN切片自动化。

DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning

DataChef：通过强化学习为LLM适配打造最佳数据配方

Authors: Yicheng Chen, Zerun Ma, Xinchen Xie, Yining Li, Kai Chen
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.11089
Pdf link: https://arxiv.org/pdf/2602.11089
Abstract In the current landscape of Large Language Models (LLMs), the curation of large-scale, high-quality training data is a primary driver of model performance. A key lever is the \emph{data recipe}, which comprises a data processing pipeline to transform raw sources into training corpora. Despite the growing use of LLMs to automate individual data processing steps, such as data synthesis and filtering, the overall design of data recipes remains largely manual and labor-intensive, requiring substantial human expertise and iteration. To bridge this gap, we formulate \emph{end-to-end data recipe generation} for LLM adaptation. Given a target benchmark and a pool of available data sources, a model is required to output a complete data recipe that adapts a base LLM to the target task. We present DataChef-32B, which performs online reinforcement learning using a proxy reward that predicts downstream performance for candidate recipes. Across six held-out tasks, DataChef-32B produces practical recipes that reach comparable downstream performance to those curated by human experts. Notably, the recipe from DataChef-32B adapts Qwen3-1.7B-Base to the math domain, achieving 66.7 on AIME'25 and surpassing Qwen3-1.7B. This work sheds new light on automating LLM training and developing self-evolving AI systems.
中文摘要 在当前大型语言模型（LLMs）领域，策划大规模高质量训练数据是模型性能的主要驱动力。一个关键杠杆是 \emph{data recipe}，它包含一个数据处理流水线，将原始源转换为训练语料库。尽管大型语言模型在自动化单个数据处理步骤（如数据综合和过滤）方面日益普及，但数据配方的整体设计仍然大多依赖人工和劳动强度，需要大量人类专业知识和反复迭代。为了弥合这一差距，我们提出了\emph{端到端数据配方生成}用于LLM适配。给定一个目标基准和可用数据源池，需要模型输出一个完整的数据配方，使基础LLM适配目标任务。我们介绍DataChef-32B，它通过代理奖励进行在线强化学习，预测候选食谱的下游表现。在六个未完成的任务中，DataChef-32B生成的实用食谱，其下游表现可与人类专家精心策划的菜谱相当。值得注意的是，DataChef-32B的配方将Qwen3-1.7B-Base适配到数学领域，在AIME'25上达到66.7，超过了Qwen3-1.7B。这项工作为自动化LLM训练和开发自我进化的人工智能系统带来了新的视角。

Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away

推理模型中的安全恢复只需几步早期引导

Authors: Soumya Suvra Ghosal, Souradip Chakraborty, Vaibhav Singh, Furong Huang, Dinesh Manocha, Amrit Singh Bedi
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.11096
Pdf link: https://arxiv.org/pdf/2602.11096
Abstract Reinforcement learning (RL) based post-training for explicit chain-of-thought (e.g., GRPO) improves the reasoning ability of multimodal large-scale reasoning models (MLRMs). But recent evidence shows that it can simultaneously degrade safety alignment and increase jailbreak success rates. We propose SafeThink, a lightweight inference-time defense that treats safety recovery as a satisficing constraint rather than a maximization objective. SafeThink monitors the evolving reasoning trace with a safety reward model and conditionally injects an optimized short corrective prefix ("Wait, think safely") only when the safety threshold is violated. In our evaluations across six open-source MLRMs and four jailbreak benchmarks (JailbreakV-28K, Hades, FigStep, and MM-SafetyBench), SafeThink reduces attack success rates by 30-60% (e.g., LlamaV-o1: 63.33% to 5.74% on JailbreakV-28K, R1-Onevision: 69.07% to 5.65% on Hades) while preserving reasoning performance (MathVista accuracy: 65.20% to 65.00%). A key empirical finding from our experiments is that safety recovery is often only a few steering steps away: intervening in the first 1-3 reasoning steps typically suffices to redirect the full generation toward safe completions.
中文摘要 基于强化学习（RL）的显式思维链（如GRPO）后期训练提升了多模态大规模推理模型（MLRM）的推理能力。但最新证据显示，它可能同时降低安全一致性并提高越狱成功率。我们提出了SafeThink，一种轻量级推理时间防御，将安全恢复视为令人满意的约束，而非最大化目标。SafeThink通过安全奖励模型监控不断演变的推理轨迹，只有在安全阈值被突破时，才有条件地注入优化的短纠正前缀（“等等，安全思考”）。在我们对六个开源MLRM和四个越狱基准测试（JailbreakV-28K、Hades、FigStep和MM-SafetyBench）的评估中，SafeThink将攻击成功率降低了30%-60%（例如，LlamaV-o1在JailbreakV-28K上为63.33%降至5.74%，在Hades上为R1-Onevision为69.07%至5.65%），同时保持推理性能（MathVista准确率：65.20%至65.00%）。我们实验的一个关键实证发现是，安全恢复往往只需几步引导：在前1-3个推理步骤中介入，通常足以将整个世代转向安全完成。

Asymmetric Prompt Weighting for Reinforcement Learning with Verifiable Rewards

非对称提示加权用于可验证奖励的强化学习

Authors: Reinhard Heckel, Mahdi Soltanolkotabi, Christos Thramboulidis
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.11128
Pdf link: https://arxiv.org/pdf/2602.11128
Abstract Reinforcement learning with verifiable rewards has driven recent advances in LLM post-training, in particular for reasoning. Policy optimization algorithms generate a number of responses for a given prompt and then effectively weight the corresponding gradients depending on the rewards. The most popular algorithms including GRPO, DAPO, and RLOO focus on ambiguous prompts, i.e., prompts with intermediate success probability, while downgrading gradients with very easy and very hard prompts. In this paper, we consider asymmetric prompt weightings that assign higher weights to prompts with low, or even zero, empirical success probability. We find that asymmetric weighting particularly benefits from-scratch RL (as in R1-Zero), where training traverses a wide accuracy range, and less so in post-SFT RL where the model already starts at high accuracy. We also provide theory that characterizes prompt weights which minimize the time needed to raise success probability from an initial level to a target accuracy under a fixed update budget. In low-success regimes, where informative responses are rare and response cost dominates, these optimal weights become asymmetric, upweighting low success probabilities and thereby accelerating effective-time convergence.
中文摘要 带有可验证奖励的强化学习推动了LLM培训后，尤其是推理方面的最新进展。策略优化算法为给定提示生成若干响应，然后根据奖励对相应的梯度进行有效权重。最流行的算法包括GRPO、DAPO和RLOO，专注于模糊提示，即成功概率介中的提示，同时对非常简单和非常困难的提示进行降级。本文探讨了非对称提示权重，即对经验成功概率低甚至零的提示赋予更高权重。我们发现，非对称加权尤其有利于从零开始的强化学习（如R1-Zero），因为训练覆盖的准确率范围较宽;而在SFT后强化学习中效果较差，因为模型起始精度已很高。我们还提供了描述提示权重的理论，使在固定更新预算下，将成功概率从初始水平提升到目标准确率所需的时间最小化。在低成功率体系中，信息性响应稀少且反应成本占主导地位，这些最优权重变成不对称，提高低成功概率，从而加速有效时间收敛。

Data-Efficient Hierarchical Goal-Conditioned Reinforcement Learning via Normalizing Flows

通过归一化流程实现的数据高效层级目标条件强化学习

Authors: Shaswat Garg, Matin Moezzi, Brandon Da Silva
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.11142
Pdf link: https://arxiv.org/pdf/2602.11142
Abstract Hierarchical goal-conditioned reinforcement learning (H-GCRL) provides a powerful framework for tackling complex, long-horizon tasks by decomposing them into structured subgoals. However, its practical adoption is hindered by poor data efficiency and limited policy expressivity, especially in offline or data-scarce regimes. In this work, Normalizing flow-based hierarchical implicit Q-learning (NF-HIQL), a novel framework that replaces unimodal gaussian policies with expressive normalizing flow policies at both the high- and low-levels of the hierarchy is introduced. This design enables tractable log-likelihood computation, efficient sampling, and the ability to model rich multimodal behaviors. New theoretical guarantees are derived, including explicit KL-divergence bounds for Real-valued non-volume preserving (RealNVP) policies and PAC-style sample efficiency results, showing that NF-HIQL preserves stability while improving generalization. Empirically, NF-HIQL is evaluted across diverse long-horizon tasks in locomotion, ball-dribbling, and multi-step manipulation from OGBench. NF-HIQL consistently outperforms prior goal-conditioned and hierarchical baselines, demonstrating superior robustness under limited data and highlighting the potential of flow-based architectures for scalable, data-efficient hierarchical reinforcement learning.
中文摘要 层级目标条件强化学习（H-GCRL）通过将复杂且长期的任务分解为结构化子目标，提供了强大的框架。然而，其实际应用受到数据效率低和政策表达力有限的阻碍，尤其是在离线或数据稀缺的环境中。本研究介绍了基于流的层级隐式Q学习规范化（NF-HIQL），这是一种新颖框架，在层级的高层和低层级用表达性规范化流策略替代单模高斯策略。该设计实现了可处理的对数似然计算、高效的采样以及丰富的多模态行为建模的能力。推导出新的理论保证，包括实值非体积保持（RealNVP）策略的显式KL散度界限和PAC式样本效率结果，表明NF-HIQL在提升泛化性的同时保持稳定性。通过实证方式，NF-HIQL在OGBench的多步作等多种长视野任务中进行了评估。NF-HIQL持续优于以往的目标条件和分层基线，在有限数据下展现出卓越的鲁棒性，并凸显了基于流程架构在可扩展、数据高效层级强化学习中的潜力。

APEX: Learning Adaptive High-Platform Traversal for Humanoid Robots

APEX：学习人形机器人的自适应高平台移动

Authors: Yikai Wang, Tingxuan Leng, Changyi Lin, Shiqi Liu, Shir Simon, Bingqing Chen, Jonathan Francis, Ding Zhao
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.11143
Pdf link: https://arxiv.org/pdf/2602.11143
Abstract Humanoid locomotion has advanced rapidly with deep reinforcement learning (DRL), enabling robust feet-based traversal over uneven terrain. Yet platforms beyond leg length remain largely out of reach because current RL training paradigms often converge to jumping-like solutions that are high-impact, torque-limited, and unsafe for real-world deployment. To address this gap, we propose APEX, a system for perceptive, climbing-based high-platform traversal that composes terrain-conditioned behaviors: climb-up and climb-down at vertical edges, walking or crawling on the platform, and stand-up and lie-down for posture reconfiguration. Central to our approach is a generalized ratchet progress reward for learning contact-rich, goal-reaching maneuvers. It tracks the best-so-far task progress and penalizes non-improving steps, providing dense yet velocity-free supervision that enables efficient exploration under strong safety regularization. Based on this formulation, we train LiDAR-based full-body maneuver policies and reduce the sim-to-real perception gap through a dual strategy: modeling mapping artifacts during training and applying filtering and inpainting to elevation maps during deployment. Finally, we distill all six skills into a single policy that autonomously selects behaviors and transitions based on local geometry and commands. Experiments on a 29-DoF Unitree G1 humanoid demonstrate zero-shot sim-to-real traversal of 0.8 meter platforms (approximately 114% of leg length), with robust adaptation to platform height and initial pose, as well as smooth and stable multi-skill transitions.
中文摘要 类人机动技术随着深度强化学习（DRL）的快速发展，使得在崎岖地形上实现基于脚部的稳健移动。然而，腿部长度以外的平台仍然大多难以实现，因为当前的强化学习训练范式往往趋向于类似跳跃的解决方案，这些方案冲击力大、扭矩有限且不适合实际部署。为弥补这一空白，我们提出了APEX系统，这是一种基于攀爬的感知高平台穿越系统，包含地形条件行为：在垂直边缘攀爬和下降、在平台上行走或爬行，以及站立和躺下以重新配置姿势。我们方法的核心是对学习接触丰富、达成目标动作的普遍递进奖励。它跟踪迄今为止最佳的任务进展，惩罚未改进的步骤，提供密集但无速度的监督，使在强安全正则化下高效探索。基于该表述，我们训练基于激光雷达的全身机动策略，并通过双重策略缩小模拟与真实感知差距：训练时建模地图伪影，部署时对高程图应用滤波和修复。最后，我们将六项技能整合成一个策略，自动根据局部几何和命令选择行为和转换。在一台29景深的Unitree G1人形机器人上，实验展示了0.8米平台（约为腿长114%）的零发射模拟到真实穿越，并对平台高度和初始姿势有良好适应性，同时实现平滑稳定的多技能转换。

Keyword: diffusion policy

There is no result