Arxiv Papers of Today

生成时间: 2026-01-28 16:37:07 (UTC+8); Arxiv 发布时间: 2026-01-28 20:00 EST (2026-01-29 09:00 UTC+8)

今天共有 37 篇相关文章

Keyword: reinforcement learning

Variational Quantum Circuit-Based Reinforcement Learning for Dynamic Portfolio Optimization

基于变分量子电路的强化学习用于动态组合优化

Authors: Vincent Gurgul, Ying Chen, Stefan Lessmann
Subjects: Subjects: Machine Learning (cs.LG); Computational Finance (q-fin.CP); Portfolio Management (q-fin.PM); Quantum Physics (quant-ph)
Arxiv link: https://arxiv.org/abs/2601.18811
Pdf link: https://arxiv.org/pdf/2601.18811
Abstract This paper presents a Quantum Reinforcement Learning (QRL) solution to the dynamic portfolio optimization problem based on Variational Quantum Circuits. The implemented QRL approaches are quantum analogues of the classical neural-network-based Deep Deterministic Policy Gradient and Deep Q-Network algorithms. Through an empirical evaluation on real-world financial data, we show that our quantum agents achieve risk-adjusted performance comparable to, and in some cases exceeding, that of classical Deep RL models with several orders of magnitude more parameters. In addition to improved parameter efficiency, quantum agents exhibit reduced variability across market regimes, indicating robust behaviour under changing conditions. However, while quantum circuit execution is inherently fast at the hardware level, practical deployment on cloud-based quantum systems introduces substantial latency, making end-to-end runtime currently dominated by infrastructural overhead and limiting practical applicability. Taken together, our results suggest that QRL is theoretically competitive with state-of-the-art classical reinforcement learning and may become practically advantageous as deployment overheads diminish. This positions QRL as a promising paradigm for dynamic decision-making in complex, high-dimensional, and non-stationary environments such as financial markets. The complete codebase is released as open source at: this https URL
中文摘要 本文提出了基于变分量子电路的动态组合优化问题的量子强化学习（QRL）解决方案。实现的QRL方法是经典基于神经网络的深度确定性策略梯度和深度Q网络算法的量子类比。通过对现实世界金融数据的实证评估，我们表明我们的量子代理在风险调整后的性能可与经典深度强化学习模型相当，甚至在某些情况下超过，但参数多出几个数量级。除了参数效率的提升外，量子代理在不同市场区间的变异性降低，表明在变化条件下表现出强健的行为。然而，尽管量子电路在硬件层面本身速度较快，但在基于云的量子系统上实际部署则带来了显著的延迟，使得端到端运行时目前主要受基础设施开销主导，限制了实际应用。综合来看，我们的结果表明QRL在理论上可与最先进的经典强化学习竞争，随着部署开销的减少，QRL可能变得具有实际优势。这使得QRL成为复杂、高维且非平稳环境中动态决策的有前景范式，如金融市场。完整的代码库已开源发布于：此 https URL

Differential Voting: Loss Functions For Axiomatically Diverse Aggregation of Heterogeneous Preferences

差异投票：异质偏好公理性多样性聚合的损失函数

Authors: Zhiyu An, Duaa Nakshbandi, Wan Du
Subjects: Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.18824
Pdf link: https://arxiv.org/pdf/2601.18824
Abstract Reinforcement learning from human feedback (RLHF) implicitly aggregates heterogeneous human preferences into a single utility function, even though the underlying utilities of the participants are in practice diverse. Hence, RLHF can be viewed as a form of voting, where the aggregation mechanism is defined by the loss function. Although Arrow's Impossibility Theorem suggests that different mechanisms satisfy different sets of desirable axioms, most existing methods rely on a single aggregation principle, typically the Bradley-Terry-Luce (BTL) model, which corresponds to Borda count voting. This restricts the axiomatic properties of the learned reward and obscures the normative assumptions embedded in optimization. In this work, we introduce Differential Voting, a unifying framework that constructs instance-wise, differentiable loss functions whose population-level optima provably correspond to distinct classical voting rules. We develop differentiable surrogates for majority-based aggregation (BTL), Copeland, and Kemeny rules, and formally analyze their calibration properties, gradient fields, and limiting behavior as smoothing parameters vanish. For each loss, we establish consistency with the corresponding social choice rule and characterize the axioms it satisfies or violates. Our analysis shows how design choices in loss geometry-such as margin sensitivity and boundary concentration-directly translate into normative aggregation behavior. Differential Voting makes preference aggregation an explicit and controllable design choice in RLHF, enabling principled trade-offs between axiomatic guarantees and optimization stability. Code to reproduce our experiments is open-sourced.
中文摘要 来自人类反馈的强化学习（RLHF）隐含地将异质人类偏好聚合为单一效用函数，尽管参与者的底层效用实际上是多样的。因此，RLHF可以被视为一种投票形式，聚合机制由损失函数定义。尽管箭不可能定理表明不同机制满足不同的理想公理集合，但大多数现有方法依赖单一聚合原理，通常是布拉德利-特里-卢斯（BTL）模型，对应博尔达计数投票。这限制了学习奖励的公理性质，并掩盖了优化中嵌入的规范假设。在本研究中，我们介绍了差分投票，这一统一框架构建了实例级、可微的损失函数，其总体级最优值可证明对应于不同的经典投票规则。我们开发了多数聚合（BTL）、Copeland和Kemeny规则的可微替代算法，并正式分析它们在平滑参数消失时的校准特性、梯度场和极限行为。对于每一次失败，我们建立与相应社会选择规则的一致性，并描述其满足或违反的公理。我们的分析展示了损失几何中的设计选择——如裕度敏感性和边界集中度——如何直接转化为规范性聚合行为。差别投票使偏好聚合成为RLHF中显式且可控的设计选择，实现公理保证与优化稳定性之间的原则性权衡。用于复现我们实验的代码是开源的。

Analysis of Control Bellman Residual Minimization for Markov Decision Problem

马尔可夫判定问题中控制贝尔曼残差最小化分析

Authors: Donghwan Lee, Hyukjun Yang
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.18840
Pdf link: https://arxiv.org/pdf/2601.18840
Abstract Markov decision problems are most commonly solved via dynamic programming. Another approach is Bellman residual minimization, which directly minimizes the squared Bellman residual objective function. However, compared to dynamic programming, this approach has received relatively less attention, mainly because it is often less efficient in practice and can be more difficult to extend to model-free settings such as reinforcement learning. Nonetheless, Bellman residual minimization has several advantages that make it worth investigating, such as more stable convergence with function approximation for value functions. While Bellman residual methods for policy evaluation have been widely studied, methods for policy optimization (control tasks) have been scarcely explored. In this paper, we establish foundational results for the control Bellman residual minimization for policy optimization.
中文摘要 马尔可夫判定问题最常通过动态规划求解。另一种方法是贝尔曼残差最小化，它直接最小化了平方贝尔曼残差目标函数。然而，与动态规划相比，这种方法受到的关注相对较少，主要因为在实际作中效率较低，且更难推广到无模型环境，如强化学习。尽管如此，贝尔曼残差最小化具有多个值得研究的优点，例如在价值函数的收敛上更稳定，使其与函数近似更为稳定。虽然贝尔曼残差方法用于策略评估已被广泛研究，但策略优化（控制任务）方法则鲜有探索。本文建立了控制Bellman残差最小化策略优化的基础性结果。

Vector-Valued Distributional Reinforcement Learning Policy Evaluation: A Hilbert Space Embedding Approach

向量值分布强化学习策略评估：希尔伯特空间嵌入方法

Authors: Mehrdad Mohammadi, Qi Zheng, Ruoqing Zhu
Subjects: Subjects: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2601.18952
Pdf link: https://arxiv.org/pdf/2601.18952
Abstract We propose an (offline) multi-dimensional distributional reinforcement learning framework (KE-DRL) that leverages Hilbert space mappings to estimate the kernel mean embedding of the multi-dimensional value distribution under a proposed target policy. In our setting, the state-action variables are multi-dimensional and continuous. By mapping probability measures into a reproducing kernel Hilbert space via kernel mean embeddings, our method replaces Wasserstein metrics with an integral probability metric. This enables efficient estimation in multi-dimensional state-action spaces and reward settings, where direct computation of Wasserstein distances is computationally challenging. Theoretically, we establish contraction properties of the distributional Bellman operator under our proposed metric involving the Matern family of kernels and provide uniform convergence guarantees. Simulations and empirical results demonstrate robust off-policy evaluation and recovery of the kernel mean embedding under mild assumptions, namely, Lipschitz continuity and boundedness of the kernels, highlighting the potential of embedding-based approaches in complex real-world decision-making scenarios and risk evaluation.
中文摘要 我们提出了一种（离线）多维分布强化学习框架（KE-DRL），利用希尔伯特空间映射估计在拟议目标策略下多维值分布的核均嵌入。在我们的环境中，状态-动作变量是多维且连续的。通过通过核均值嵌入将概率测度映射到重现的核希尔伯特空间，我们的方法用整概率度量替代了Wasserstein度量。这使得在多维状态-动作空间和奖励设置中实现高效估计，而直接计算Wasserstein距离在计算上具有挑战性。理论上，我们基于涉及母核族的度量，建立了分布贝尔曼算子的收缩性质，并提供了统一收敛保证。模拟和实证结果表明，在轻微假设（即核的Lipschitz连续性和有界性）下，核均值嵌入能够进行稳健的非策略评估和恢复，凸显了基于嵌入的方法在复杂现实世界决策场景和风险评估中的潜力。

Save the Good Prefix: Precise Error Penalization via Process-Supervised RL to Enhance LLM Reasoning

保存好前缀：通过过程监督强化学习实现精确的错误惩罚以增强LLM推理能力

Authors: Haolin Liu, Dian Yu, Sidi Lu, Yujun Zhou, Rui Liu, Zhenwen Liang, Haitao Mi, Chen-Yu Wei, Dong Yu
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.18984
Pdf link: https://arxiv.org/pdf/2601.18984
Abstract Reinforcement learning (RL) has emerged as a powerful framework for improving the reasoning capabilities of large language models (LLMs). However, most existing RL approaches rely on sparse outcome rewards, which fail to credit correct intermediate steps in partially successful solutions. Process reward models (PRMs) offer fine-grained step-level supervision, but their scores are often noisy and difficult to evaluate. As a result, recent PRM benchmarks focus on a more objective capability: detecting the first incorrect step in a reasoning path. However, this evaluation target is misaligned with how PRMs are typically used in RL, where their step-wise scores are treated as raw rewards to maximize. To bridge this gap, we propose Verifiable Prefix Policy Optimization (VPPO), which uses PRMs only to localize the first error during RL. Given an incorrect rollout, VPPO partitions the trajectory into a verified correct prefix and an erroneous suffix based on the first error, rewarding the former while applying targeted penalties only after the detected mistake. This design yields stable, interpretable learning signals and improves credit assignment. Across multiple reasoning benchmarks, VPPO consistently outperforms sparse-reward RL and prior PRM-guided baselines on both Pass@1 and Pass@K.
中文摘要 强化学习（RL）已成为提升大型语言模型（LLM）推理能力的强大框架。然而，大多数现有的强化学习方法依赖于稀疏的结果奖励，未能在部分成功的解决方案中认可纠正中间步骤。流程奖励模型（PRMs）提供细粒度的步骤级监督，但其评分通常噪声较大且难以评估。因此，近期的PRM基准测试更注重更客观的能力：检测推理路径中的第一个错误步骤。然而，这一评估目标与PRM在强化学习中通常的使用方式不一致，PRM的逐步得分被视为最大化的原始奖励。为弥合这一空白，我们提出了可验证前缀策略优化（VPPO），仅使用PRM在强化过程中的第一个错误进行定位。在部署错误的情况下，VPPO会根据第一个错误将轨迹划分为经过验证的正确前缀和错误后缀，奖励前者，只有在检测到错误后才施加针对性惩罚。这种设计产生了稳定、可解读的学习信号，并改善了学分分配。在多个推理基准测试中，VPPO在Pass@1和Pass@K上始终优于稀疏奖励RL和以往PRM指导基线。

A Unifying View of Coverage in Linear Off-Policy Evaluation

线性非保单评估中覆盖范围的统一视角

Authors: Philip Amortila, Audrey Huang, Akshay Krishnamurthy, Nan Jiang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2601.19030
Pdf link: https://arxiv.org/pdf/2601.19030
Abstract Off-policy evaluation (OPE) is a fundamental task in reinforcement learning (RL). In the classic setting of linear OPE, finite-sample guarantees often take the form $$ \textrm{Evaluation error} \le \textrm{poly}(C^\pi, d, 1/n,\log(1/\delta)), $$ where $d$ is the dimension of the features and $C^\pi$ is a coverage parameter that characterizes the degree to which the visited features lie in the span of the data distribution. While such guarantees are well-understood for several popular algorithms under stronger assumptions (e.g. Bellman completeness), the understanding is lacking and fragmented in the minimal setting where only the target value function is linearly realizable in the features. Despite recent interest in tight characterizations of the statistical rate in this setting, the right notion of coverage remains unclear, and candidate definitions from prior analyses have undesirable properties and are starkly disconnected from more standard definitions in the literature. We provide a novel finite-sample analysis of a canonical algorithm for this setting, LSTDQ. Inspired by an instrumental-variable view, we develop error bounds that depend on a novel coverage parameter, the feature-dynamics coverage, which can be interpreted as linear coverage in an induced dynamical system for feature evolution. With further assumptions -- such as Bellman-completeness -- our definition successfully recovers the coverage parameters specialized to those settings, finally yielding a unified understanding for coverage in linear OPE.
中文摘要 非策略评估（OPE）是强化学习（RL）中的一项基本任务。在经典的线性OPE环境中，有限样本保证通常表现为$$ \textrm{Evaluation error} \le \textrm{poly}（C^\pi， d， 1/n，\log（1/\delta）），$$，其中$d$是特征的维数，$C^\pi$是一个覆盖参数，用来描述所访问特征在数据分布范围内的程度。虽然在更强假设（如贝尔曼完备性）下，许多流行算法对此类保证已有充分理解，但在最小条件下，只有目标值函数线性可实现，相关理解仍然不足且支离破碎。尽管近年来对该统计率进行严格刻画的兴趣，但正确的覆盖概念仍不明确，且先前分析中的候选定义具有不良属性，且与文献中更标准的定义有明显脱节。我们提供了一种针对该环境的典型算法LSTDQ的新颖有限样本分析。受仪器变量视角启发，我们开发了依赖于一种新颖覆盖参数——特征动力学覆盖率的误差界限，该参数可解释为特征演化诱导动力系统中的线性覆盖。通过进一步假设——如贝尔曼完备性——我们的定义成功恢复了针对这些设置的覆盖参数，最终实现了线性OPE覆盖的统一理解。

Optimizing Conversational Quality in Spoken Dialogue Systems with Reinforcement Learning from AI Feedback

通过AI反馈强化学习优化口语对话系统中的会话质量

Authors: Siddhant Arora, Jinchuan Tian, Jiatong Shi, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe
Subjects: Subjects: Computation and Language (cs.CL); Sound (cs.SD)
Arxiv link: https://arxiv.org/abs/2601.19063
Pdf link: https://arxiv.org/pdf/2601.19063
Abstract Reinforcement learning from human or AI feedback (RLHF/RLAIF) for speech-in/speech-out dialogue systems (SDS) remains underexplored, with prior work largely limited to single semantic rewards applied at the utterance level. Such setups overlook the multi-dimensional and multi-modal nature of conversational quality, which encompasses semantic coherence, audio naturalness, speaker consistency, emotion alignment, and turn-taking behavior. Moreover, they are fundamentally mismatched with duplex spoken dialogue systems that generate responses incrementally, where agents must make decisions based on partial utterances. We address these limitations with the first multi-reward RLAIF framework for SDS, combining semantic, audio-quality, and emotion-consistency rewards. To align utterance-level preferences with incremental, blockwise decoding in duplex models, we apply turn-level preference sampling and aggregate per-block log-probabilities within a single DPO objective. We present the first systematic study of preference learning for improving SDS quality in both multi-turn Chain-of-Thought and blockwise duplex models, and release a multi-reward DPO dataset to support reproducible research. Experiments show that single-reward RLAIF selectively improves its targeted metric, while joint multi-reward training yields consistent gains across semantic quality and audio naturalness. These results highlight the importance of holistic, multi-reward alignment for practical conversational SDS.
中文摘要 来自人类或人工智能反馈的强化学习（RLHF/RLAIF）在语音输入/语音输出对话系统（SDS）中仍然缺乏探索，以往的工作主要限于在话语层面应用单一语义奖励。这种设置忽视了会话质量的多维和多模态性质，包括语义连贯性、音频自然性、说话者一致性、情绪对齐和轮流行为。此外，它们与双工口头对话系统根本不匹配，后者是逐步生成反应的，后者主体必须基于部分话语做出决策。我们通过首个多奖励RLAIF框架解决了这些局限性，结合了语义、音频质量和情感一致性奖励。为了将话语层偏好与双工模型中的增量分块解码对齐，我们在单一DPO目标内应用回合级偏好抽样和每块对数概率的聚合。我们提出了首个系统性研究，旨在提升多回合思维链和分块双工模型中的偏好学习以提升SDS质量，并发布了多奖励DPO数据集以支持可重复性研究。实验显示，单奖励RLAIF能够选择性地提升其目标指标，而联合多奖励训练则在语义质量和音频自然性方面持续获得提升。这些结果凸显了整体、多奖励对齐对实用会话SDS的重要性。

m2sv: A Scalable Benchmark for Map-to-Street-View Spatial Reasoning

m2sv：地图到街景空间推理的可扩展基准测试

Authors: Yosub Shin, Michael Buriek, Igor Molybog
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.19099
Pdf link: https://arxiv.org/pdf/2601.19099
Abstract Vision--language models (VLMs) achieve strong performance on many multimodal benchmarks but remain brittle on spatial reasoning tasks that require aligning abstract overhead representations with egocentric views. We introduce m2sv, a scalable benchmark for map-to-street-view spatial reasoning that asks models to infer camera viewing direction by aligning a north-up overhead map with a Street View image captured at the same real-world intersection. We release m2sv-20k, a geographically diverse benchmark with controlled ambiguity, along with m2sv-sft-11k, a curated set of structured reasoning traces for supervised fine-tuning. Despite strong performance on existing multimodal benchmarks, the best evaluated VLM achieves only 65.2% accuracy on m2sv, far below the human baseline of 95%. While supervised fine-tuning and reinforcement learning yield consistent gains, cross-benchmark evaluations reveal limited transfer. Beyond aggregate accuracy, we systematically analyze difficulty in map-to-street-view reasoning using both structural signals and human effort, and conduct an extensive failure analysis of adapted open models. Our findings highlight persistent gaps in geometric alignment, evidence aggregation, and reasoning consistency, motivating future work on grounded spatial reasoning across viewpoints.
中文摘要 视觉语言模型（VLMs）在许多多模态基准测试中表现优异，但在需要将抽象开销表示与自我中心观点对齐的空间推理任务上仍然脆弱。我们引入了m2sv，这是一个可扩展的地图到街景空间推理基准，要求模型通过将北上俯视地图与同一真实路口拍摄的街景图像对齐来推断摄像头的视野方向。我们发布了m2sv-20k，一个地理多样且具有受控歧义的基准测试，以及m2sv-sft-11k，一套经过精心策划的结构化推理痕迹，用于监督微调。尽管在现有多模态基准测试上表现优异，但最优评估的VLM在m2sv上仅能实现65.2%的准确率，远低于人类基线的95%。虽然监督式微调和强化学习能带来持续的提升，但跨基准评估显示迁移有限。除了总体准确性外，我们还系统地分析了利用结构信号和人力推理地图到街头视图推理的难度，并对适应开放模型进行了广泛的失败分析。我们的发现凸显了几何对齐、证据聚合和推理一致性方面存在的持续空白，激励未来跨视角进行基于基础的空间推理研究。

Reward Engineering for Reinforcement Learning in Software Tasks

软件任务中强化学习的奖励工程

Authors: Md Rayhanul Masud, Azmine Toushik Wasi, Salman Rahman, Md Rizwan Parvez
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2601.19100
Pdf link: https://arxiv.org/pdf/2601.19100
Abstract Reinforcement learning is increasingly used for code-centric tasks. These tasks include code generation, summarization, understanding, repair, testing, and optimization. This trend is growing faster with large language models and autonomous agents. A key challenge is how to design reward signals that make sense for software. In many RL problems, the reward is a clear number. In software, this is often not possible. The goal is rarely a single numeric objective. Instead, rewards are usually proxies. Common proxies check if the code compiles, passes tests, or satisfies quality metrics. Many reward designs have been proposed for code-related tasks. However, the work is scattered across areas and papers. There is no single survey that brings these approaches together and shows the full landscape of reward design for RL in software. In this survey, we provide the first systematic and comprehensive review of reward engineering for RL in software tasks. We focus on existing methods and techniques. We structure the literature along three complementary dimensions, summarizing the reward-design choices within each. We conclude with challenges and recommendations in the reward design space for SE tasks.
中文摘要 强化学习越来越多地被用于以代码为中心的任务。这些任务包括代码生成、摘要、理解、修复、测试和优化。随着大型语言模型和自主代理的出现，这一趋势正在加速发展。一个关键挑战是如何设计适合软件的奖励信号。在许多强化学习问题中，奖励是一个明确的数字。在软件领域，这通常是不可能的。目标很少是单一的数字目标。相反，奖励通常是代理。通用代理检查代码是否编译成功、通过测试或满足质量指标。许多与代码相关的任务被提出奖励设计。然而，这些工作分散在各个领域和论文中。目前没有一份调查能将这些方法汇聚在一起，展示软件中强化学习奖励设计的全貌。在本次调查中，我们首次系统且全面地回顾了软件任务中强化学习的奖励工程。我们专注于现有的方法和技术。我们根据三个互补维度构建文献，总结每个维度中的奖励设计选择。我们以SE任务奖励设计领域的挑战和建议作为结尾。

Glance and Focus Reinforcement for Pan-cancer Screening

泛癌筛查的眼神与专注强化

Authors: Linshan Wu, Jiaxin Zhuang, Hao Chen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.19103
Pdf link: https://arxiv.org/pdf/2601.19103
Abstract Pan-cancer screening in large-scale CT scans remains challenging for existing AI methods, primarily due to the difficulty of localizing diverse types of tiny lesions in large CT volumes. The extreme foreground-background imbalance significantly hinders models from focusing on diseased regions, while redundant focus on healthy regions not only decreases the efficiency but also increases false positives. Inspired by radiologists' glance and focus diagnostic strategy, we introduce GF-Screen, a Glance and Focus reinforcement learning framework for pan-cancer screening. GF-Screen employs a Glance model to localize the diseased regions and a Focus model to precisely segment the lesions, where segmentation results of the Focus model are leveraged to reward the Glance model via Reinforcement Learning (RL). Specifically, the Glance model crops a group of sub-volumes from the entire CT volume and learns to select the sub-volumes with lesions for the Focus model to segment. Given that the selecting operation is non-differentiable for segmentation training, we propose to employ the segmentation results to reward the Glance model. To optimize the Glance model, we introduce a novel group relative learning paradigm, which employs group relative comparison to prioritize high-advantage predictions and discard low-advantage predictions within sub-volume groups, not only improving efficiency but also reducing false positives. In this way, for the first time, we effectively extend cutting-edge RL techniques to tackle the specific challenges in pan-cancer screening. Extensive experiments on 16 internal and 7 external datasets across 9 lesion types demonstrated the effectiveness of GF-Screen. Notably, GF-Screen leads the public validation leaderboard of MICCAI FLARE25 pan-cancer challenge, surpassing the FLARE24 champion solution by a large margin (+25.6% DSC and +28.2% NSD).
中文摘要 在现有的人工智能方法中，大规模CT扫描中的泛癌筛查仍然具有挑战性，主要原因是很难在大量CT中定位多种微小病灶。极端的前景与背景不平衡显著阻碍模型聚焦病灶区域，而对健康区域的重复关注不仅降低效率，还增加了误报。受放射科医生“眼视与聚焦”诊断策略的启发，我们介绍了GF-Screen，一种用于泛癌筛查的视视与聚焦强化学习框架。GF-Screen采用Glance模型定位病灶区域，并采用Focus模型精确分段病灶，利用Focus模型的分段结果通过强化学习（RL）奖励Glance模型。具体来说，Glance模型从整个CT体积中裁剪一组子体积，并学习选择带有病灶的子体积以便Focus模型进行分割。鉴于选择作对分割训练是不可微的，我们建议利用分割结果来奖励Glance模型。为优化Glance模型，我们引入了一种新的组相对学习范式，利用组相对比较优先考虑高优势预测，在子卷组中舍弃低优势预测，不仅提高了效率，还减少了假阳性。通过这种方式，我们首次有效地扩展了前沿的强化学习技术，以应对泛癌筛查中的具体挑战。对16个内部数据集和7个外部数据集、9种病变类型的广泛实验证明了GF-Screen的有效性。值得注意的是，GF-Screen在MICCAI FLARE25泛癌挑战的公开验证排行榜中领先，远超FLARE24冠军解决方案（DSC +25.6%，NSD +28.2%）。

Exploring Weaknesses in Function Call Models via Reinforcement Learning: An Adversarial Data Augmentation Approach

通过强化学习探索函数调用模型的弱点：一种对抗性数据增强方法

Authors: Weiran Guo, Bing Bo, Shaoxiang Wu, Jingsheng Yang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.19122
Pdf link: https://arxiv.org/pdf/2601.19122
Abstract Function call capabilities have become crucial for Large Language Models (LLMs), enabling them to interact more effectively with external tools and APIs. Existing methods for improving the function call capabilities of LLMs rely on data obtained either through manual annotation or automated generation by models, and use this data to finetune the LLMs. However, these methods often lack targeted design and are constrained by fixed patterns and data distributions, which limits their effectiveness in enhancing the generalization and robustness of function call LLMs. To address this limitation, we propose a novel adversarial data augmentation method that employs reinforcement learning to systematically identify and target the weaknesses of function call LLMs. Our training framework introduces a query model trained with reinforcement learning (RL) to generate adversarial queries that are specifically designed to challenge function call (FC) models. This approach adopts a zero sum game formulation, where the query model and the FC model engage in iterative alternating training. Overall, our method advances the development of more robust FC models and provides a systematic way to identify and correct weaknesses in the ability of LLMs to interact with external tools.
中文摘要 函数调用能力对大型语言模型（LLM）来说变得至关重要，使其能够更有效地与外部工具和API交互。现有提升LLM函数调用能力的方法依赖于通过人工注释或模型自动生成获得的数据，并利用这些数据对LLM进行微调。然而，这些方法通常缺乏针对性设计，且受限于固定的模式和数据分布，限制了它们提升函数调用LLM泛化性和鲁棒性的有效性。为解决这一局限，我们提出了一种新型对抗性数据增强方法，利用强化学习系统地识别并针对函数调用LLM的弱点。我们的训练框架引入了一个通过强化学习（RL）训练的查询模型，生成专门设计用来挑战函数调用（FC）模型的对抗性查询。该方法采用零和博弈形式，查询模型与FC模型进行迭代交替训练。总体而言，我们的方法推动了更稳健的FC模型的发展，并为LLMs在与外部工具交互能力上的弱点提供了系统化的方法。

Towards Pixel-Level VLM Perception via Simple Points Prediction

通过简单点预测实现像素级VLM感知

Authors: Tianhui Song, Haoyu Lu, Hao Yang, Lin Sui, Haoning Wu, Zaida Zhou, Zhiqi Huang, Yiping Bao, Y.Charles, Xinyu Zhou, Limin Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.19228
Pdf link: https://arxiv.org/pdf/2601.19228
Abstract We present SimpleSeg, a strikingly simple yet highly effective approach to endow Multimodal Large Language Models (MLLMs) with native pixel-level perception. Our method reframes segmentation as a simple sequence generation problem: the model directly predicts sequences of points (textual coordinates) delineating object boundaries, entirely within its language space. To achieve high fidelity, we introduce a two-stage SF$\to$RL training pipeline, where Reinforcement Learning with an IoU-based reward refines the point sequences to accurately match ground-truth contours. We find that the standard MLLM architecture possesses a strong, inherent capacity for low-level perception that can be unlocked without any specialized architecture. On segmentation benchmarks, SimpleSeg achieves performance that is comparable to, and often surpasses, methods relying on complex, task-specific designs. This work lays out that precise spatial understanding can emerge from simple point prediction, challenging the prevailing need for auxiliary components and paving the way for more unified and capable VLMs. Homepage: this https URL
中文摘要 我们介绍SimpleSeg，这是一种极其简单但极为有效的方法，旨在赋予多模态大型语言模型（MLLMs）原生像素级感知能力。我们的方法将分割重新框定为一个简单的序列生成问题：模型直接预测定义对象边界的点序列（文本坐标），完全在其语言空间内。为了实现高保真度，我们引入了两阶段SF$到$RL训练流程，其中基于IoU的奖励强化学习通过精炼点序列，准确匹配真实轮廓。我们发现，标准MLLM架构具有强大且固有的低层次感知能力，无需任何专门架构即可解锁。在分段基准测试中，SimpleSeg 实现的性能可与依赖复杂任务特定设计的方法相媲美甚至超越。本研究阐述了通过简单的点预测可以实现精确的空间理解，挑战了对辅助组件的普遍需求，并为更统一且功能更强大的VLM铺平了道路。主页：此 https URL

Structure-based RNA Design by Step-wise Optimization of Latent Diffusion Model

通过逐步优化潜在扩散模型实现结构基RNA设计

Authors: Qi Si, Xuyang Liu, Penglei Wang, Xin Guo, Yuan Qi, Yuan Cheng
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.19232
Pdf link: https://arxiv.org/pdf/2601.19232
Abstract RNA inverse folding, designing sequences to form specific 3D structures, is critical for therapeutics, gene regulation, and synthetic biology. Current methods, focused on sequence recovery, struggle to address structural objectives like secondary structure consistency (SS), minimum free energy (MFE), and local distance difference test (LDDT), leading to suboptimal structural accuracy. To tackle this, we propose a reinforcement learning (RL) framework integrated with a latent diffusion model (LDM). Drawing inspiration from the success of diffusion models in RNA inverse folding, which adeptly model complex sequence-structure interactions, we develop an LDM incorporating pre-trained RNA-FM embeddings from a large-scale RNA model. These embeddings capture co-evolutionary patterns, markedly improving sequence recovery accuracy. However, existing approaches, including diffusion-based methods, cannot effectively handle non-differentiable structural objectives. By contrast, RL excels in this task by using policy-driven reward optimization to navigate complex, non-gradient-based objectives, offering a significant advantage over traditional methods. In summary, we propose the Step-wise Optimization of Latent Diffusion Model (SOLD), a novel RL framework that optimizes single-step noise without sampling the full diffusion trajectory, achieving efficient refinement of multiple structural objectives. Experimental results demonstrate SOLD surpasses its LDM baseline and state-of-the-art methods across all metrics, establishing a robust framework for RNA inverse folding with profound implications for biotechnological and therapeutic applications.
中文摘要 RNA逆折叠设计序列以形成特定的三维结构，对治疗、基因调控和合成生物学至关重要。目前专注于序列恢复的方法难以满足诸如二级结构一致性（SS）、最小自由能（MFE）和局部距离差检验（LDDT）等结构目标，导致结构精度不理想。为此，我们提出了一个集成于潜在扩散模型（LDM）的强化学习（RL）框架。我们借鉴RNA逆折叠中扩散模型的成功经验，这些模型能巧妙模拟复杂的序列-结构相互作用，我们开发了一个结合大规模RNA模型预训练RNA-FM嵌入的LDM。这些嵌入捕捉了共进化模式，显著提升了序列恢复的准确性。然而，现有方法，包括基于扩散的方法，无法有效处理不可微结构目标。相比之下，强化学习通过策略驱动的奖励优化来应对复杂且非梯度的目标，在这一任务中表现出色，这比传统方法具有显著优势。总之，我们提出了潜在扩散模型的分步优化（SOLD），这是一种新型强化学习框架，可在不采样完整扩散轨迹的情况下优化单步噪声，实现多个结构目标的高效细化。实验结果表明，SOLD 在所有指标上超越了其 LDM 基线和最先进方法，建立了坚实的 RNA 逆折叠框架，对生物技术和治疗应用具有深远影响。

iFAN Ecosystem: A Unified AI, Digital Twin, Cyber-Physical Security, and Robotics Environment for Advanced Nuclear Simulation and Operations

iFAN生态系统：一个统一的人工智能、数字孪生、网络物理安全与机器人环境，支持先进核模拟与作

Authors: Youndo Do, Chad Meece, Marc Zebrowitz, Spencer Banks, Myeongjun Choi, Xiaoxu Diao, Kai Tan, Michael Doran, Jason Reed, Fan Zhang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.19234
Pdf link: https://arxiv.org/pdf/2601.19234
Abstract As nuclear facilities experience digital transformation and advanced reactor development, AI integration, cyber-physical security, and other emerging technologies such as autonomous robot operations are increasingly developed. However, evaluation and deployment is challenged by the lack of dedicated virtual testbeds. The Immersive Framework for Advanced Nuclear (iFAN) ecosystem is developed, a comprehensive digital twin framework with a realistic 3D environment with physics-based simulations. The iFAN ecosystem serves as a high-fidelity virtual testbed for plant operation, cybersecurity, physical security, and robotic operation, as it provides real-time data exchange for pre-deployment verification. Core features include virtual reality, reinforcement learning, radiation simulation, and cyber-physical security. In addition, the paper investigates various applications through potential operational scenarios. The iFAN ecosystem provides a versatile and secure architecture for validating the next generation of autonomous and cyber-resilient nuclear operations.
中文摘要 随着核设施数字化转型和先进反应堆开发，人工智能集成、网络物理安全以及其他新兴技术（如自主机器人作）也在不断发展。然而，缺乏专门的虚拟测试平台，评估和部署面临挑战。开发了先进核能沉浸式框架（iFAN）生态系统，这是一个具备真实3D环境和基于物理模拟的综合数字孪生框架。iFAN生态系统作为高保真虚拟测试平台，用于工厂运营、网络安全、物理安全和机器人作，提供实时数据交换以实现部署前验证。核心功能包括虚拟现实、强化学习、辐射模拟和网络物理安全。此外，论文还通过潜在的作场景探讨了多种应用。iFAN生态系统提供了一个多功能且安全的架构，用于验证下一代自主且具网络韧性的核能作业。

Group Distributionally Robust Optimization-Driven Reinforcement Learning for LLM Reasoning

分布式稳健优化驱动强化学习用于大型语言模型推理

Authors: Kishan Panaganti, Zhenwen Liang, Wenhao Yu, Haitao Mi, Dong Yu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.19280
Pdf link: https://arxiv.org/pdf/2601.19280
Abstract Recent progress in Large Language Model (LLM) reasoning is increasingly driven by the refinement of post-training loss functions and alignment strategies. However, standard Reinforcement Learning (RL) paradigms like Group Relative Policy Optimization (GRPO) remain constrained by static uniformity: uniform prompt sampling and a fixed number of rollouts per prompt. For heterogeneous, heavy-tailed reasoning data, this creates structural inefficiencies that waste compute on already-solved patterns while under-training the long tail of hard problems. To address this, we propose Multi-Adversary Group Distributionally Robust Optimization (GDRO), an optimization-first framework that moves beyond uniform reasoning models by dynamically adapting the training distribution. We introduce an Online Difficulty Classifier that partitions prompts into dynamic pass@k difficulty groups. We then propose two independent GDRO games for post-training: (1) Prompt-GDRO, which employs an EMA-debiased multiplicative-weights bandit sampler to target the intensive difficulty margin and upweight persistently hard groups without frequency bias; and (2) Rollout-GDRO, which uses a shadow-price controller to reallocate rollouts across groups, maximizing gradient variance reduction on hard tasks under a fixed mean budget (compute-neutral). We provide no-regret guarantees for both controllers and additionally a variance-proxy analysis motivating a square-root optimal rollout allocation for Rollout-GDRO. We validate our framework on the DAPO 14.1k dataset using Qwen3-Base models. Prompt-GDRO and Rollout-GDRO achieve average relative gains of +10.6% and +10.1%, respectively, in pass@8 accuracy across 1.7B, 4B, and 8B scales compared to the GRPO baseline. Qualitative analysis shows an emergent curriculum: the adversaries shift resources to the evolving reasoning frontier, enhancing the reasoning model's performance.
中文摘要 大型语言模型（LLM）推理的最新进展越来越多地受到训练后丢失函数和对齐策略的精炼推动。然而，标准的强化学习（RL）范式如组相对策略优化（GRPO）仍受制于静态一致性：提示抽样一致且每个提示的展开次数固定。对于异构、重尾推理数据，这会产生结构性低效，浪费计算已解模式，同时不足训练难题的长尾。为此，我们提出了多对抗群体分布鲁棒优化（GDRO）框架，这是一种以优化为先的框架，通过动态调整训练分布，超越了统一推理模型。我们引入了一个在线难度分类器，将提示划分为动态的pass@k难度组。随后，我们提出了两个独立的GDRO游戏用于训练后：（1）Prompt-GDRO，采用EMA去偏重乘法权重的bandit采样器，针对强难度边际，并对持续加权困难组时无频率偏差;以及（2）Rollout-GDRO，利用影子价格控制器在各组间重新分配滚动，最大化在固定平均预算（计算中性）下困难任务的梯度方差缩小。我们为两个控制器提供无遗憾保证，并通过方差代理分析为Rollout-GDRO制定平方根最优部署分配。我们利用 Qwen3-Base 模型验证了 DAPO 14.1k 数据集的框架。Prompt-GDRO和Rollout-GDRO在1.7B、4B和8B尺度上的pass@8准确率相比GRPO基线分别获得了+10.6%和+10.1%的平均相对提升。定性分析显示出一种涌现课程：对手将资源转向不断演变的推理前沿，提升推理模型的表现。

Output Feedback Stabilization of Linear Systems via Policy Gradient Methods

通过策略梯度法实现线性系统的输出反馈稳定

Authors: Ankang Zhang, Ming Chi, Xiaoling Wang, Lintao Ye
Subjects: Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2601.19284
Pdf link: https://arxiv.org/pdf/2601.19284
Abstract Stabilizing a dynamical system is a fundamental problem that serves as a cornerstone for many complex tasks in the field of control systems. The problem becomes challenging when the system model is unknown. Among the Reinforcement Learning (RL) algorithms that have been successfully applied to solve problems pertaining to unknown linear dynamical systems, the policy gradient (PG) method stands out due to its ease of implementation and can solve the problem in a model-free manner. However, most of the existing works on PG methods for unknown linear dynamical systems assume full-state feedback. In this paper, we take a step towards model-free learning for partially observable linear dynamical systems with output feedback and focus on the fundamental stabilization problem of the system. We propose an algorithmic framework that stretches the boundary of PG methods to the problem without global convergence guarantees. We show that by leveraging zeroth-order PG update based on system trajectories and its convergence to stationary points, the proposed algorithms return a stabilizing output feedback policy for discrete-time linear dynamical systems. We also explicitly characterize the sample complexity of our algorithm and verify the effectiveness of the algorithm using numerical examples.
中文摘要 稳定动力系统是一个基本问题，是控制系统领域许多复杂任务的基石。当系统模型未知时，问题变得具有挑战性。在已成功应用于解决未知线性动力系统问题的强化学习（RL）算法中，策略梯度（PG）方法因其易于实现且能以无模型方式解决问题而脱颖而出。然而，目前大多数关于未知线性动力系统的PG方法研究假设了全态反馈。本文迈向部分可观测的带有输出反馈的线性动力学系统的无模型学习，重点关注系统的基本稳定问题。我们提出了一个算法框架，将PG方法的边界延伸到问题，且不保证全局收敛。我们证明，通过利用基于系统轨迹及其收敛到静止点的零阶PG更新，所提出的算法能够为离散时间线性动力系统返回稳定的输出反馈策略。我们还明确表征算法的样本复杂度，并通过数值示例验证算法的有效性。

Innovator-VL: A Multimodal Large Language Model for Scientific Discovery

创新者-VL：一个用于科学发现的多模态大型语言模型

Authors: Zichen Wen, Boxue Yang, Shuang Chen, Yaojie Zhang, Yuhang Han, Junlong Ke, Cong Wang, Yicheng Fu, Jiawang Zhao, Jiangchao Yao, Xi Fang, Zhen Wang, Henxing Cai, Lin Yao, Zhifeng Gao, Yanhui Hong, Nang Yuan, Yixuan Li, Guojiang Zhao, Haoyi Tao, Nan Wang, Han Lyu, Guolin Ke, Ning Liao, Xiaoxing Wang, Kai Chen, Zhiyu Li, Feiyu Xiong, Sihan Hu, Kun Chen, Yanfeng Wang, Weinan E, Linfeng Zhang, Linfeng Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.19325
Pdf link: https://arxiv.org/pdf/2601.19325
Abstract We present Innovator-VL, a scientific multimodal large language model designed to advance understanding and reasoning across diverse scientific domains while maintaining excellent performance on general vision tasks. Contrary to the trend of relying on massive domain-specific pretraining and opaque pipelines, our work demonstrates that principled training design and transparent methodology can yield strong scientific intelligence with substantially reduced data requirements. (i) First, we provide a fully transparent, end-to-end reproducible training pipeline, covering data collection, cleaning, preprocessing, supervised fine-tuning, reinforcement learning, and evaluation, along with detailed optimization recipes. This facilitates systematic extension by the community. (ii) Second, Innovator-VL exhibits remarkable data efficiency, achieving competitive performance on various scientific tasks using fewer than five million curated samples without large-scale pretraining. These results highlight that effective reasoning can be achieved through principled data selection rather than indiscriminate scaling. (iii) Third, Innovator-VL demonstrates strong generalization, achieving competitive performance on general vision, multimodal reasoning, and scientific benchmarks. This indicates that scientific alignment can be integrated into a unified model without compromising general-purpose capabilities. Our practices suggest that efficient, reproducible, and high-performing scientific multimodal models can be built even without large-scale data, providing a practical foundation for future research.
中文摘要 我们介绍Innovator-VL，一个科学多模态大型语言模型，旨在推动跨多个科学领域的理解与推理，同时保持在通用视觉任务上的优异表现。与依赖大规模领域专属预训练和不透明流水线的趋势相反，我们的研究表明，有原则的训练设计和透明的方法论能够在显著降低数据需求的情况下，带来强大的科学智慧。（i）首先，我们提供一个完全透明、端到端可重复的训练流程，涵盖数据收集、清理、预处理、监督微调、强化学习和评估，以及详细的优化方案。这有助于社区的系统性推广。（ii）其次，Innovator-VL展现出卓越的数据效率，在使用少于五百万个精选样本且未进行大规模预训练的情况下，在多种科学任务中取得竞争性能。这些结果表明，通过有原则的数据选择而非无差别的扩展，可以实现有效的推理。（iii）第三，Innovator-VL展现了强大的泛化能力，在一般视野、多模态推理和科学基准方面取得了竞争性表现。这表明科学对齐可以整合进统一模型，而不牺牲通用能力。我们的实践表明，即使没有大规模数据，也能构建高效、可重复且高效的科学多模态模型，为未来研究奠定实用基础。

From Observations to Events: Event-Aware World Model for Reinforcement Learning

从观察到事件：事件感知世界模型用于强化学习

Authors: Zhao-Han Peng, Shaohui Li, Zhi Li, Shulan Ruan, Yu Liu, You He
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.19336
Pdf link: https://arxiv.org/pdf/2601.19336
Abstract While model-based reinforcement learning (MBRL) improves sample efficiency by learning world models from raw observations, existing methods struggle to generalize across structurally similar scenes and remain vulnerable to spurious variations such as textures or color shifts. From a cognitive science perspective, humans segment continuous sensory streams into discrete events and rely on these key events for decision-making. Motivated by this principle, we propose the Event-Aware World Model (EAWM), a general framework that learns event-aware representations to streamline policy learning without requiring handcrafted labels. EAWM employs an automated event generator to derive events from raw observations and introduces a Generic Event Segmentor (GES) to identify event boundaries, which mark the start and end time of event segments. Through event prediction, the representation space is shaped to capture meaningful spatio-temporal transitions. Beyond this, we present a unified formulation of seemingly distinct world model architectures and show the broad applicability of our methods. Experiments on Atari 100K, Craftax 1M, and DeepMind Control 500K, DMC-GB2 500K demonstrate that EAWM consistently boosts the performance of strong MBRL baselines by 10%-45%, setting new state-of-the-art results across benchmarks. Our code is released at this https URL.
中文摘要 虽然基于模型的强化学习（MBRL）通过从原始观测中学习世界模型提高了样本效率，但现有方法难以在结构相似的场景中泛化，且容易受到纹理或色移等杂乱变化的影响。从认知科学的角度来看，人类将连续的感官流分割成离散事件，并依赖这些关键事件来做决策。基于这一原则，我们提出了事件感知世界模型（EAWM），这是一个通用框架，学习事件感知表征以简化策略学习，而无需手工制作标签。EAWM采用自动化事件生成器从原始观测中推导事件，并引入通用事件分段器（GES）用于识别事件边界，标记事件段的起始和结束时间。通过事件预测，表征空间被塑造以捕捉有意义的时空转变。除此之外，我们还提出了看似截然不同的世界模型架构的统一表述，并展示了我们方法的广泛适用性。在Atari 100K、Craftax 1M和DeepMind Control 500K、DMC-GB2 500K上的实验表明，EAWM能持续将强MBRL基准的性能提升10%-45%，在各基准测试中创造了新的尖端成果。我们的代码以这个 https URL 发布。

CHEHAB RL: Learning to Optimize Fully Homomorphic Encryption Computations

CHEHAB RL：学习优化完全同态加密计算

Authors: Bilel Sefsaf, Abderraouf Dandani, Abdessamed Seddiki, Arab Mohammed, Eduardo Chielle, Michail Maniatakos, Riyadh Baghdadi
Subjects: Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.19367
Pdf link: https://arxiv.org/pdf/2601.19367
Abstract Fully Homomorphic Encryption (FHE) enables computations directly on encrypted data, but its high computational cost remains a significant barrier. Writing efficient FHE code is a complex task requiring cryptographic expertise, and finding the optimal sequence of program transformations is often intractable. In this paper, we propose CHEHAB RL, a novel framework that leverages deep reinforcement learning (RL) to automate FHE code optimization. Instead of relying on predefined heuristics or combinatorial search, our method trains an RL agent to learn an effective policy for applying a sequence of rewriting rules to automatically vectorize scalar FHE code while reducing instruction latency and noise growth. The proposed approach supports the optimization of both structured and unstructured code. To train the agent, we synthesize a diverse dataset of computations using a large language model (LLM). We integrate our proposed approach into the CHEHAB FHE compiler and evaluate it on a suite of benchmarks, comparing its performance against Coyote, a state-of-the-art vectorizing FHE compiler. The results show that our approach generates code that is $5.3\times$ faster in execution, accumulates $2.54\times$ less noise, while the compilation process itself is $27.9\times$ faster than Coyote (geometric means).
中文摘要 全同态加密（FHE）允许直接对加密数据进行计算，但其高计算成本仍是一大障碍。编写高效的FHE代码是一项复杂的任务，需要密码学专业知识，而寻找最优程序转换顺序往往难以解决。本文提出了CHEHAB RL，一种利用深度强化学习（RL）自动化FHE代码优化的新框架。我们不依赖预设的启发式或组合搜索，而是训练强化学习代理学习有效的策略，通过一系列重写规则自动向量化标量FHE代码，同时降低指令延迟和噪声增长。该方法支持结构化和非结构化代码的优化。为了训练代理，我们使用大型语言模型（LLM）综合了多样化的计算数据集。我们将提出的方法集成到 CHEHAB FHE 编译器中，并在一组基准测试中评估其性能，与最先进的矢量化 FHE 编译器 Coyote 进行比较。结果显示，我们的方法生成的代码执行速度快5.3倍，噪声累积减少2.54倍，而编译过程本身比Coyote（几何平均值）快27.9倍。

Task-Centric Policy Optimization from Misaligned Motion Priors

从错位运动先验中实现任务中心策略优化

Authors: Ziang Zheng, Kai Feng, Yi Nie, Shentao Qin
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.19411
Pdf link: https://arxiv.org/pdf/2601.19411
Abstract Humanoid control often leverages motion priors from human demonstrations to encourage natural behaviors. However, such demonstrations are frequently suboptimal or misaligned with robotic tasks due to embodiment differences, retargeting errors, and task-irrelevant variations, causing naïve imitation to degrade task performance. Conversely, task-only reinforcement learning admits many task-optimal solutions, often resulting in unnatural or unstable motions. This exposes a fundamental limitation of linear reward mixing in adversarial imitation learning. We propose \emph{Task-Centric Motion Priors} (TCMP), a task-priority adversarial imitation framework that treats imitation as a conditional regularizer rather than a co-equal objective. TCMP maximizes task improvement while incorporating imitation signals only when they are compatible with task progress, yielding an adaptive, geometry-aware update that preserves task-feasible descent and suppresses harmful imitation under misalignment. We provide theoretical analysis of gradient conflict and task-priority stationary points, and validate our claims through humanoid control experiments demonstrating robust task performance with consistent motion style under noisy demonstrations.
中文摘要 类人生物控制常利用人体演示中的动作先验来鼓励自然行为。然而，由于身体差异、重定向误差和与任务无关的变异，这类演示往往不理想或与机器人任务不匹配，导致简单的模仿降低任务性能。相反，仅任务强化学习允许许多任务最优解，常导致不自然或不稳定的运动。这揭示了对抗性模仿学习中线性奖励混合的一个根本局限。我们提出了\emph{任务中心运动先验}（TCMP），一种任务优先级对抗性模仿框架，将模仿视为条件正则化而非同等目标。TCMP最大化任务改进，同时仅在模拟信号与任务进展兼容时纳入，从而实现自适应、几何感知的更新，保持任务可行下降并抑制错位下的有害模仿。我们提供梯度冲突和任务优先级静止点的理论分析，并通过类人机对照实验验证我们的主张，在噪声示范下运动风格一致且任务表现稳健。

OSIRIS: Bridging Analog Circuit Design and Machine Learning with Scalable Dataset Generation

OSIRIS：将模拟电路设计与机器学习与可扩展数据集生成相结合

Authors: Giuseppe Chiari, Michele Piccoli, Davide Zoni
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.19439
Pdf link: https://arxiv.org/pdf/2601.19439
Abstract The automation of analog integrated circuit (IC) design remains a longstanding challenge, primarily due to the intricate interdependencies among physical layout, parasitic effects, and circuit-level performance. These interactions impose complex constraints that are difficult to accurately capture and optimize using conventional design methodologies. Although recent advances in machine learning (ML) have shown promise in automating specific stages of the analog design flow, the development of holistic, end-to-end frameworks that integrate these stages and iteratively refine layouts using post-layout, parasitic-aware performance feedback is still in its early stages. Furthermore, progress in this direction is hindered by the limited availability of open, high-quality datasets tailored to the analog domain, restricting both the benchmarking and the generalizability of ML-based techniques. To address these limitations, we present OSIRIS, a scalable dataset generation pipeline for analog IC design. OSIRIS systematically explores the design space of analog circuits while producing comprehensive performance metrics and metadata, thereby enabling ML-driven research in electronic design automation (EDA). In addition, we release a dataset consisting of 87,100 circuit variations generated with OSIRIS, accompanied by a reinforcement learning (RL)-based baseline method that exploits OSIRIS for analog design optimization.
中文摘要 模拟集成电路（IC）设计的自动化仍是一个长期挑战，主要由于物理布局、寄生效应和电路级性能之间的复杂相互依赖。这些交互施加了复杂的约束，难以用传统设计方法准确捕捉和优化。尽管机器学习（ML）的最新进展显示出自动化模拟设计流程特定阶段的前景，但开发整合这些阶段并利用布局后、基于寄生性能反馈迭代优化布局的整体端到端框架仍处于早期阶段。此外，面向这一方向的进展受限于针对模拟领域的开放高质量数据集有限，限制了基于机器学习技术的基准测试和通用性。为解决这些局限性，我们介绍了OSIRIS，一种适用于模拟集成电路设计的可扩展数据集生成流程。OSIRIS 系统地探索模拟电路的设计空间，同时生成全面的性能指标和元数据，从而推动机器学习驱动的电子设计自动化（EDA）研究。此外，我们还发布了包含87,100条OSIRIS生成电路变体的数据集，并配有基于强化学习（RL）的基线方法，利用OSIRIS进行模拟设计优化。

APC-RL: Exceeding Data-Driven Behavior Priors with Adaptive Policy Composition

APC-RL：超越数据驱动行为先验，采用自适应策略组合

Authors: Finn Rietz, Pedro Zuidberg dos Martires, Johannes Andreas Stork
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.19452
Pdf link: https://arxiv.org/pdf/2601.19452
Abstract Incorporating demonstration data into reinforcement learning (RL) can greatly accelerate learning, but existing approaches often assume demonstrations are optimal and fully aligned with the target task. In practice, demonstrations are frequently sparse, suboptimal, or misaligned, which can degrade performance when these demonstrations are integrated into RL. We propose Adaptive Policy Composition (APC), a hierarchical model that adaptively composes multiple data-driven Normalizing Flow (NF) priors. Instead of enforcing strict adherence to the priors, APC estimates each prior's applicability to the target task while leveraging them for exploration. Moreover, APC either refines useful priors, or sidesteps misaligned ones when necessary to optimize downstream reward. Across diverse benchmarks, APC accelerates learning when demonstrations are aligned, remains robust under severe misalignment, and leverages suboptimal demonstrations to bootstrap exploration while avoiding performance degradation caused by overly strict adherence to suboptimal demonstrations.
中文摘要 将演示数据纳入强化学习（RL）可以大幅加快学习速度，但现有方法通常假设演示是最优且与目标任务完全对齐的。实际上，演示往往稀疏、次优或错位，当这些演示集成到强化学习时，可能会降低性能。我们提出了自适应策略组合（APC），这是一种层级模型，能够自适应地组合多个数据驱动的归一化流程（NF）先验。APC不强制严格遵循先验，而是评估每个先验对目标任务的适用性，同时利用它们进行探索。此外，APC要么完善有用的先验，要么在必要时避开错位的先验以优化下游奖励。在多样化基准测试中，APC在示范对齐时加速学习，在严重错位下保持稳健，并利用次优示范进行自我探索，同时避免因过于严格遵守次优示范而导致的性能下降。

Reinforcement Learning Goal-Reaching Control with Guaranteed Lyapunov-Like Stabilizer for Mobile Robots

增强学习目标达成控制，保证李雅普诺夫式稳定器适用于移动机器人

Authors: Mehdi Heydari Shahna, Seyed Adel Alizadeh Kolagar, Jouni Mattila
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.19499
Pdf link: https://arxiv.org/pdf/2601.19499
Abstract Reinforcement learning (RL) can be highly effective at learning goal-reaching policies, but it typically does not provide formal guarantees that the goal will always be reached. A common approach to provide formal goal-reaching guarantees is to introduce a shielding mechanism that restricts the agent to actions that satisfy predefined safety constraints. The main challenge here is integrating this mechanism with RL so that learning and exploration remain effective without becoming overly conservative. Hence, this paper proposes an RL-based control framework that provides formal goal-reaching guarantees for wheeled mobile robots operating in unstructured environments. We first design a real-time RL policy with a set of 15 carefully defined reward terms. These rewards encourage the robot to reach both static and dynamic goals while generating sufficiently smooth command signals that comply with predefined safety specifications, which is critical in practice. Second, a Lyapunov-like stabilizer layer is integrated into the benchmark RL framework as a policy supervisor to formally strengthen the goal-reaching control while preserving meaningful exploration of the state action space. The proposed framework is suitable for real-time deployment in challenging environments, as it provides a formal guarantee of convergence to the intended goal states and compensates for uncertainties by generating real-time control signals based on the current state, while respecting real-world motion constraints. The experimental results show that the proposed Lyapunov-like stabilizer consistently improves the benchmark RL policies, boosting the goal-reaching rate from 84.6% to 99.0%, sharply reducing failures, and improving efficiency.
中文摘要 强化学习（RL）在学习达标策略方面可以非常有效，但它通常无法形式化保证目标总能实现。提供正式目标达成保证的常见方法是引入屏蔽机制，限制智能体只能执行满足预定义安全约束的动作。这里的主要挑战是如何将这一机制与强化学习（RL）整合，以确保学习和探索既有效又不至于过于保守。因此，本文提出了基于强化学习的控制框架，为在非结构化环境中运行的轮式移动机器人提供正式的目标达成保证。我们首先设计一套实时强化学习策略，包含15个精心定义的奖励词。这些奖励鼓励机器人实现静态和动态目标，同时生成符合预设安全规范的足够流畅的指令信号，这在实际作中至关重要。其次，基准强化学习框架中集成了类似李雅普诺夫的稳定层，作为政策监督者，正式加强目标达成控制，同时保持对国家行动空间的有意义探索。该框架适用于具有挑战性的实时部署，因为它提供了与目标状态收敛的形式保证，并通过基于当前状态生成实时控制信号来弥补不确定性，同时尊重现实世界的运动约束。实验结果显示，所提出的类李雅普诺夫稳定子持续提升基准强化逻辑策略，将目标达成率从84.6%提升至99.0%，显著减少故障并提升效率。

Bridging Information Asymmetry: A Hierarchical Framework for Deterministic Blind Face Restoration

弥合信息不对称：确定性盲脸修复的层级框架

Authors: Zhengjian Yao, Jiakui Hu, Kaiwen Li, Hangzhou He, Xinliang Zhang, Shuang Zeng, Lei Zhu, Yanye Lu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.19506
Pdf link: https://arxiv.org/pdf/2601.19506
Abstract Blind face restoration remains a persistent challenge due to the inherent ill-posedness of reconstructing holistic structures from severely constrained observations. Current generative approaches, while capable of synthesizing realistic textures, often suffer from information asymmetry -- the intrinsic disparity between the information-sparse low quality inputs and the information-dense high quality outputs. This imbalance leads to a one-to-many mapping, where insufficient constraints result in stochastic uncertainty and hallucinatory artifacts. To bridge this gap, we present \textbf{Pref-Restore}, a hierarchical framework that integrates discrete semantic logic with continuous texture generation to achieve deterministic, preference-aligned restoration. Our methodology fundamentally addresses this information disparity through two complementary strategies: (1) Augmenting Input Density: We employ an auto-regressive integrator to reformulate textual instructions into dense latent queries, injecting high-level semantic stability to constrain the degraded signals; (2) Pruning Output Distribution: We pioneer the integration of on-policy reinforcement learning directly into the diffusion restoration loop. By transforming human preferences into differentiable constraints, we explicitly penalize stochastic deviations, thereby sharpening the posterior distribution toward the desired high-fidelity outcomes. Extensive experiments demonstrate that Pref-Restore achieves state-of-the-art performance across synthetic and real-world benchmarks. Furthermore, empirical analysis confirms that our preference-aligned strategy significantly reduces solution entropy, establishing a robust pathway toward reliable and deterministic blind restoration.
中文摘要 由于从极度受限的观察中重建整体结构本身存在不利性，盲脸修复仍是一个持续的挑战。当前的生成方法虽然能够综合真实的纹理，但常常存在信息不对称的问题——即信息稀疏且低质量的输入与信息密集的高质量输出之间的内在差异。这种不平衡导致一对多映射，约束不足导致随机不确定性和幻觉伪影。为弥合这一空白，我们提出了 \textbf{Pref-Restore}，这是一个层级框架，将离散语义逻辑与连续纹理生成整合，实现确定性、偏好对齐的恢复。我们的方法论通过两种互补策略从根本上解决了这一信息差异：（1）增强输入密度：我们采用自回归积分器将文本指令重新表述为密集的潜在查询，注入高层次语义稳定性以约束退化信号;（2）输出分布修剪：我们率先将策略强化学习直接集成到扩散恢复循环中。通过将人类偏好转化为可微约束，我们明确惩罚了随机偏差，从而使后验分布朝向期望的高保真度结果更清晰。大量实验表明，Pref-Restore在合成和现实基准测试中均达到了最先进的性能。此外，实证分析证实，我们的偏好匹配策略显著降低了解熵，建立了通往可靠且确定性盲恢复的稳健路径。

LLM-Enhanced Reinforcement Learning for Long-Term User Satisfaction in Interactive Recommendation

LLM增强强化学习，实现交互式推荐中的长期用户满意度

Authors: Chongjun Xia, Yanchun Peng, Xianzhi Wang
Subjects: Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.19585
Pdf link: https://arxiv.org/pdf/2601.19585
Abstract Interactive recommender systems can dynamically adapt to user feedback, but often suffer from content homogeneity and filter bubble effects due to overfitting short-term user preferences. While recent efforts aim to improve content diversity, they predominantly operate in static or one-shot settings, neglecting the long-term evolution of user interests. Reinforcement learning provides a principled framework for optimizing long-term user satisfaction by modeling sequential decision-making processes. However, its application in recommendation is hindered by sparse, long-tailed user-item interactions and limited semantic planning capabilities. In this work, we propose LLM-Enhanced Reinforcement Learning (LERL), a novel hierarchical recommendation framework that integrates the semantic planning power of LLM with the fine-grained adaptability of RL. LERL consists of a high-level LLM-based planner that selects semantically diverse content categories, and a low-level RL policy that recommends personalized items within the selected semantic space. This hierarchical design narrows the action space, enhances planning efficiency, and mitigates overexposure to redundant content. Extensive experiments on real-world datasets demonstrate that LERL significantly improves long-term user satisfaction when compared with state-of-the-art baselines. The implementation of LERL is available at this https URL.
中文摘要 交互式推荐系统可以动态适应用户反馈，但由于过拟合短期用户偏好，常常存在内容同质性和过滤气泡效应。虽然近期努力旨在提升内容多样性，但它们大多采用静态或单次游戏形式，忽视了用户兴趣的长期演变。强化学习通过建模顺序决策过程，为优化长期用户满意度提供了有原则的框架。然而，其在推荐中的应用受到稀疏且长尾的用户与项目交互以及语义规划能力有限的限制。本研究提出LLM增强强化学习（LERL），一种新型层级推荐框架，将LLM的语义规划能力与RL细粒度适应性相结合。LERL由一个基于LLM的高阶规划器组成，用于选择语义上多样化的内容类别，以及一个低层次的强化学习策略，在所选语义空间内推荐个性化项目。这种层级设计缩小了行动空间，提高了规划效率，并减少了对冗余内容的过度暴露。大量真实世界数据集实验表明，与最先进的基线相比，LERL显著提升了长期用户满意度。LERL的实现可在此 https URL 获取。

Safe Exploration via Policy Priors

通过政策先验安全探索

Authors: Manuel Wendl, Yarden As, Manish Prajapat, Anton Pollak, Stelian Coros, Andreas Krause
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.19612
Pdf link: https://arxiv.org/pdf/2601.19612
Abstract Safe exploration is a key requirement for reinforcement learning (RL) agents to learn and adapt online, beyond controlled (e.g. simulated) environments. In this work, we tackle this challenge by utilizing suboptimal yet conservative policies (e.g., obtained from offline data or simulators) as priors. Our approach, SOOPER, uses probabilistic dynamics models to optimistically explore, yet pessimistically fall back to the conservative policy prior if needed. We prove that SOOPER guarantees safety throughout learning, and establish convergence to an optimal policy by bounding its cumulative regret. Extensive experiments on key safe RL benchmarks and real-world hardware demonstrate that SOOPER is scalable, outperforms the state-of-the-art and validate our theoretical guarantees in practice.
中文摘要 安全探索是强化学习（RL）智能体在线学习和适应的关键要求，超越受控（如模拟）环境。本研究通过采用次优但保守的策略（例如从离线数据或模拟器中获得）作为先验来应对这一挑战。我们的方法SOOPER利用概率动力学模型乐观地探索，但如有必要，悲观地回归保守政策先验。我们证明了SOOPER保证整个学习过程的安全，并通过限制累计遗憾度，建立趋同于最优策略。在关键安全强化学习基准测试和现实硬件上的大量实验表明，SOOPER具有可扩展性，超越最先进技术，并在实践中验证了我们的理论保证。

R^3: Replay, Reflection, and Ranking Rewards for LLM Reinforcement Learning

R^3：重玩、反思与排名奖励，用于大型语言模型强化学习

Authors: Zhizheng Jiang, Kang Zhao, Weikai Xu, Xinkui Lin, Wei Liu, Jian Luan, Shuo Shang, Peng Han
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.19620
Pdf link: https://arxiv.org/pdf/2601.19620
Abstract Large reasoning models (LRMs) aim to solve diverse and complex problems through structured reasoning. Recent advances in group-based policy optimization methods have shown promise in enabling stable advantage estimation without reliance on process-level annotations. However, these methods rely on advantage gaps induced by high-quality samples within the same batch, which makes the training process fragile and inefficient when intra-group advantages collapse under challenging tasks. To address these problems, we propose a reinforcement learning mechanism named \emph{\textbf{R^3}} that along three directions: (1) a \emph{cross-context \underline{\textbf{R}}eplay} strategy that maintains the intra-group advantage by recalling valuable examples from historical trajectories of the same query, (2) an \emph{in-context self-\underline{\textbf{R}}eflection} mechanism enabling models to refine outputs by leveraging past failures, and (3) a \emph{structural entropy \underline{\textbf{R}}anking reward}, which assigns relative rewards to truncated or failed samples by ranking responses based on token-level entropy patterns, capturing both local exploration and global stability. We implement our method on Deepseek-R1-Distill-Qwen-1.5B and train it on the DeepscaleR-40k in the math domain. Experiments demonstrate our method achieves SoTA performance on several math benchmarks, representing significant improvements and fewer reasoning tokens over the base models. Code and model will be released.
中文摘要 大型推理模型（LRM）旨在通过结构化推理解决多样且复杂的问题。基于群体的策略优化方法的最新进展显示出在实现稳定优势估计方面展现出希望，无需依赖过程级注释。然而，这些方法依赖于同一批次中高质量样本所造成的优势差距，这使得训练过程在组内优势在挑战性任务下崩溃时变得脆弱且效率低下。为解决这些问题，我们提出了一种名为\emph{\textbf{R^3}}的强化学习机制，其分为三个方向：（1） \emph{跨上下文 \underline{\textbf{R}}eplay}策略，通过回忆同一查询的历史轨迹中宝贵的例子来保持组内优势;（2）一种\emph{上下文内自我\下划{\textbf{R}}flection}机制，使模型能够利用过去的失败来优化输出，以及（3）\emph{结构熵 \下划线{\textbf{R}}anking reward}，通过根据代币层面熵模式对响应排序，为截断或失败样本分配相对奖励，既捕捉局部探索，也体现全局稳定性。我们将方法在 Deepseek-R1-Distill-Qwen-1.5B 上实现，并在数学领域用 DeepscaleR-40k 进行训练。实验表明，我们的方法在多个数学基准测试中实现了SoTA性能，显著提升，且推理标记数量较基础模型更少。代码和模型将会发布。

Tracking Drift: Variation-Aware Entropy Scheduling for Non-Stationary Reinforcement Learning

跟踪漂移：非平稳强化学习中的变异感知熵调度

Authors: Tongxi Wang, Zhuoyang Xia, Xinran Chen, Shan Liu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.19624
Pdf link: https://arxiv.org/pdf/2601.19624
Abstract Real-world reinforcement learning often faces environment drift, but most existing methods rely on static entropy coefficients/target entropy, causing over-exploration during stable periods and under-exploration after drift (thus slow recovery), and leaving unanswered the principled question of how exploration intensity should scale with drift magnitude. We prove that entropy scheduling under non-stationarity can be reduced to a one-dimensional, round-by-round trade-off, faster tracking of the optimal solution after drift vs. avoiding gratuitous randomness when the environment is stable, so exploration strength can be driven by measurable online drift signals. Building on this, we propose AES (Adaptive Entropy Scheduling), which adaptively adjusts the entropy coefficient/temperature online using observable drift proxies during training, requiring almost no structural changes and incurring minimal overhead. Across 4 algorithm variants, 12 tasks, and 4 drift modes, AES significantly reduces the fraction of performance degradation caused by drift and accelerates recovery after abrupt changes.
中文摘要 现实世界强化学习常面临环境漂移，但大多数现有方法依赖静态熵系数/目标熵，导致稳定期过度探索，漂移后探索不足（因此恢复缓慢），并未解答探索强度应如何随漂移大小调整的原则性问题。我们证明了在非平稳条件下的熵调度可以简化为一维、逐轮的权衡，在漂移后更快地跟踪最优解，而不是在环境稳定时避免无谓随机性，因此探索强度可以由可测量的在线漂移信号驱动。基于此，我们提出了AES（自适应熵调度），该方法在训练期间利用可观测漂移代理在线自适应调整熵系数/温度，几乎无需结构改动，且开销最小。在4种算法变体、12个任务和4种漂移模式中，AES显著减少了漂移导致的性能下降，并加快了突变后的恢复速度。

Video-KTR: Reinforcing Video Reasoning via Key Token Attribution

视频-KTR：通过关键令牌归属强化视频推理

Authors: Ziyue Wang, Sheng Jin, Zhongrong Zuo, Jiawei Wu, Han Qiu, Qi She, Hao Zhang, Xudong Jiang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.19686
Pdf link: https://arxiv.org/pdf/2601.19686
Abstract Reinforcement learning (RL) has shown strong potential for enhancing reasoning in multimodal large language models, yet existing video reasoning methods often rely on coarse sequence-level rewards or single-factor token selection, neglecting fine-grained links among visual inputs, temporal dynamics, and linguistic outputs, limiting both accuracy and interpretability. We propose Video-KTR, a modality-aware policy shaping framework that performs selective, token-level RL by combining three attribution signals: (1) visual-aware tokens identified via counterfactual masking to reveal perceptual dependence; (2) temporal-aware tokens detected through frame shuffling to expose temporal sensitivity; and (3) high-entropy tokens signaling predictive uncertainty. By reinforcing only these key tokens, Video-KTR focuses learning on semantically informative, modality-sensitive content while filtering out low-value tokens. Across five challenging benchmarks, Video-KTR achieves state-of-the-art or highly competitive results, achieving 42.7\% on Video-Holmes (surpassing GPT-4o) with consistent gains on both reasoning and general video understanding tasks. Ablation studies verify the complementary roles of the attribution signals and the robustness of targeted token-level updates. Overall, Video-KTR improves accuracy and interpretability, offering a simple, drop-in extension to RL for complex video reasoning. Our code and models are available at this https URL.
中文摘要 强化学习（RL）在多模态大型语言模型中展现出增强推理能力的强大潜力，但现有的视频推理方法常依赖粗略的序列级奖励或单因子标记选择，忽视了视觉输入、时间动态和语言输出之间的细粒度联系，限制了准确性和可解释性。我们提出了Video-KTR，一种模式感知的政策塑造框架，通过结合三种归因信号，执行选择性的代币级强化学习：（1）通过反事实掩蔽识别的视觉感知代币，揭示感知依赖性;（2）通过帧洗牌检测到的时间感知令牌以暴露时间敏感性;以及（3）高熵的标记，表示预测不确定性。通过仅强化这些关键词汇，视频-KTR专注于语义信息量、模式敏感内容，同时过滤掉低价值词。在五项具有挑战性的基准测试中，Video-KTR实现了最先进或高度竞争的成绩，在Video-Holmes项目中取得了42.7%的成绩（超过GPT-4o），在推理和一般视频理解任务上均有持续进步。消融研究验证了归因信号的互补作用以及针对令牌级更新的鲁棒性。总体而言，视频-KTR提高了准确性和可理解性，为复杂的视频推理提供了简单的即插即用扩展。我们的代码和模型可在该 https URL 访问。

AlignCoder: Aligning Retrieval with Target Intent for Repository-Level Code Completion

AlignCoder：将检索与目标意图对齐，实现仓库级代码完成

Authors: Tianyue Jiang, Yanli Wang, Yanlin Wang, Daya Guo, Ensheng Shi, Yuchi Ma, Jiachi Chen, Zibin Zheng
Subjects: Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.19697
Pdf link: https://arxiv.org/pdf/2601.19697
Abstract Repository-level code completion remains a challenging task for existing code large language models (code LLMs) due to their limited understanding of repository-specific context and domain knowledge. While retrieval-augmented generation (RAG) approaches have shown promise by retrieving relevant code snippets as cross-file context, they suffer from two fundamental problems: misalignment between the query and the target code in the retrieval process, and the inability of existing retrieval methods to effectively utilize the inference information. To address these challenges, we propose AlignCoder, a repository-level code completion framework that introduces a query enhancement mechanism and a reinforcement learning based retriever training method. Our approach generates multiple candidate completions to construct an enhanced query that bridges the semantic gap between the initial query and the target code. Additionally, we employ reinforcement learning to train an AlignRetriever that learns to leverage inference information in the enhanced query for more accurate retrieval. We evaluate AlignCoder on two widely-used benchmarks (CrossCodeEval and RepoEval) across five backbone code LLMs, demonstrating an 18.1% improvement in EM score compared to baselines on the CrossCodeEval benchmark. The results show that our framework achieves superior performance and exhibits high generalizability across various code LLMs and programming languages.
中文摘要 由于现有代码大型语言模型（代码大型语言模型）对仓库特定上下文和领域知识理解有限，代码库级代码完成仍是一项具有挑战性的任务。虽然检索增强生成（RAG）方法通过检索相关代码片段作为跨文件上下文展现出潜力，但它们存在两个根本性问题：检索过程中查询与目标代码之间的不一致，以及现有检索方法无法有效利用推理信息。为应对这些挑战，我们提出了AlignCoder，一个仓库级代码补全框架，引入了查询增强机制和基于强化学习的检索器训练方法。我们的方法生成多个候选补全，构建一个增强查询，弥合初始查询与目标代码之间的语义差距。此外，我们采用强化学习训练AlignRetriever，使其能够利用增强查询中的推理信息实现更准确的检索。我们在两个广泛使用的基准测试（CrossCodeEval和RepoEval）上评估了AlignCoder，跨越五个骨干代码LLMs，显示EM评分较CrossCode评估基准基线提升了18.1%。结果显示，我们的框架实现了卓越的性能，并在各种代码、大型语言模型和编程语言中展现出高度的泛化性。

Scalable Exploration for High-Dimensional Continuous Control via Value-Guided Flow

通过价值引导流实现高维连续控制的可扩展探索

Authors: Yunyue Wei, Chenhui Zuo, Yanan Sui
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.19707
Pdf link: https://arxiv.org/pdf/2601.19707
Abstract Controlling high-dimensional systems in biological and robotic applications is challenging due to expansive state-action spaces, where effective exploration is critical. Commonly used exploration strategies in reinforcement learning are largely undirected with sharp degradation as action dimensionality grows. Many existing methods resort to dimensionality reduction, which constrains policy expressiveness and forfeits system flexibility. We introduce Q-guided Flow Exploration (Qflex), a scalable reinforcement learning method that conducts exploration directly in the native high-dimensional action space. During training, Qflex traverses actions from a learnable source distribution along a probability flow induced by the learned value function, aligning exploration with task-relevant gradients rather than isotropic noise. Our proposed method substantially outperforms representative online reinforcement learning baselines across diverse high-dimensional continuous-control benchmarks. Qflex also successfully controls a full-body human musculoskeletal model to perform agile, complex movements, demonstrating superior scalability and sample efficiency in very high-dimensional settings. Our results indicate that value-guided flows offer a principled and practical route to exploration at scale.
中文摘要 由于状态-作用空间庞大，有效探索至关重要，在生物和机器人应用中控制高维系统具有挑战性。强化学习中常用的探索策略大多是无向的，随着动作维度的增长，退化会急剧。许多现有方法诉诸维度缩小，这限制了政策表达性，并牺牲了系统的灵活性。我们介绍了Q引导流探索（Qflex），这是一种可扩展的强化学习方法，直接在原生高维动作空间中进行探索。在训练过程中，Qflex 从可学习的源分布沿由所学值函数诱导的概率流遍历作用，使探索与任务相关的梯度而非各向同性噪声对齐。我们提出的方法在多种高维连续控制基准测试中显著优于代表性的在线强化学习基线。Qflex还成功控制了全身人体肌肉骨骼模型，实现灵活复杂的动作，在极高维环境中展现出卓越的可扩展性和样本效率。我们的结果表明，价值引导流动为大规模探索提供了原则性且实用的路径。

Improving Policy Exploitation in Online Reinforcement Learning with Instant Retrospect Action

通过即时回顾行动改善在线强化学习中的策略利用

Authors: Gong Gao, Weidong Zhao, Xianhui Liu, Ning Jia
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.19720
Pdf link: https://arxiv.org/pdf/2601.19720
Abstract Existing value-based online reinforcement learning (RL) algorithms suffer from slow policy exploitation due to ineffective exploration and delayed policy updates. To address these challenges, we propose an algorithm called Instant Retrospect Action (IRA). Specifically, we propose Q-Representation Discrepancy Evolution (RDE) to facilitate Q-network representation learning, enabling discriminative representations for neighboring state-action pairs. In addition, we adopt an explicit method to policy constraints by enabling Greedy Action Guidance (GAG). This is achieved through backtracking historical actions, which effectively enhances the policy update process. Our proposed method relies on providing the learning algorithm with accurate $k$-nearest-neighbor action value estimates and learning to design a fast-adaptable policy through policy constraints. We further propose the Instant Policy Update (IPU) mechanism, which enhances policy exploitation by systematically increasing the frequency of policy updates. We further discover that the early-stage training conservatism of the IRA method can alleviate the overestimation bias problem in value-based RL. Experimental results show that IRA can significantly improve the learning efficiency and final performance of online RL algorithms on eight MuJoCo continuous control tasks.
中文摘要 现有基于价值的在线强化学习（RL）算法因探索无效和策略更新延迟，策略利用缓慢。为应对这些挑战，我们提出了一种称为即时回顾行动（IRA）的算法。具体来说，我们提出了Q表征差异演化（RDE）以促进Q网络表征学习，从而实现邻近状态-动作对的判别性表征。此外，我们通过启用贪婪行动指导（GAG）来明确管理政策约束。这通过回溯历史作实现，有效增强了政策更新流程。我们提出的方法依赖于为学习算法提供准确的$k$最近邻动作值估计，并通过策略约束学习设计快速且可适应的策略。我们还提出了即时政策更新（IPU）机制，通过系统性增加策略更新频率来增强策略利用。我们还发现，IRA方法的早期训练保守性可以缓解基于价值的强化学习中高估偏差问题。实验结果显示，IRA能显著提升在线强化学习算法在八个MuJoCo连续控制任务中的学习效率和最终性能。

Reimagining Social Robots as Recommender Systems: Foundations, Framework, and Applications

重新构想社会机器人作为推荐系统：基础、框架与应用

Authors: Jin Huang, Fethiye Irmak Doğan, Hatice Gunes
Subjects: Subjects: Robotics (cs.RO); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2601.19761
Pdf link: https://arxiv.org/pdf/2601.19761
Abstract Personalization in social robots refers to the ability of the robot to meet the needs and/or preferences of an individual user. Existing approaches typically rely on large language models (LLMs) to generate context-aware responses based on user metadata and historical interactions or on adaptive methods such as reinforcement learning (RL) to learn from users' immediate reactions in real time. However, these approaches fall short of comprehensively capturing user preferences-including long-term, short-term, and fine-grained aspects-, and of using them to rank and select actions, proactively personalize interactions, and ensure ethically responsible adaptations. To address the limitations, we propose drawing on recommender systems (RSs), which specialize in modeling user preferences and providing personalized recommendations. To ensure the integration of RS techniques is well-grounded and seamless throughout the social robot pipeline, we (i) align the paradigms underlying social robots and RSs, (ii) identify key techniques that can enhance personalization in social robots, and (iii) design them as modular, plug-and-play components. This work not only establishes a framework for integrating RS techniques into social robots but also opens a pathway for deep collaboration between the RS and HRI communities, accelerating innovation in both fields.
中文摘要 社交机器人中的个性化是指机器人满足个体用户需求和/或偏好的能力。现有方法通常依赖大型语言模型（LLM）基于用户元数据和历史交互生成上下文感知响应，或利用强化学习（RL）等自适应方法实时从用户的即时反应中学习。然而，这些方法未能全面捕捉用户偏好——包括长期、短期和细粒度方面——并未将其用于排序和选择动作、主动个性化互动以及确保伦理负责任的适应。为解决这些局限性，我们建议采用推荐系统（RS），该系统专注于建模用户偏好并提供个性化推荐。为了确保RS技术在整个社交机器人流程中稳固且无缝地整合，我们（i）对齐社交机器人和RS背后的范式，（ii）识别能够提升社交机器人个性化的关键技术，（iii）将其设计为模块化、即插即用的组件。这项工作不仅为将RS技术整合到社交机器人建立了框架，也为RS与HRI社区之间的深度合作开辟了一条新途径，推动了两领域的创新。

Reimagining Peer Review Process Through Multi-Agent Mechanism Design

通过多代理机制设计重新构想同行评审流程

Authors: Ahmad Farooq, Kamran Iqbal
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2601.19778
Pdf link: https://arxiv.org/pdf/2601.19778
Abstract The software engineering research community faces a systemic crisis: peer review is failing under growing submissions, misaligned incentives, and reviewer fatigue. Community surveys reveal that researchers perceive the process as "broken." This position paper argues that these dysfunctions are mechanism design failures amenable to computational solutions. We propose modeling the research community as a stochastic multi-agent system and applying multi-agent reinforcement learning to design incentive-compatible protocols. We outline three interventions: a credit-based submission economy, MARL-optimized reviewer assignment, and hybrid verification of review consistency. We present threat models, equity considerations, and phased pilot metrics. This vision charts a research agenda toward sustainable peer review.
中文摘要 软件工程研究界正面临系统性危机：同行评审在不断增长的投稿、激励机制不匹配以及评审疲劳下失败。社区调查显示，研究人员认为这一过程“破碎”。本立场文件认为，这些功能障碍是适合计算解决方案的机制设计失败。我们提出将研究社区建模为随机多智能体系统，并应用多智能体强化学习设计激励兼容的协议。我们概述了三项干预措施：基于学分的投稿经济、MARL优化的评审人分配，以及审稿一致性的混合验证。我们提出了威胁模型、公平性考量和分阶段试点指标。这一愿景为可持续的同行评审制定了研究议程。

Unsupervised Learning of Efficient Exploration: Pre-training Adaptive Policies via Self-Imposed Goals

高效探索的无监督学习：通过自我设定目标预训练适应性政策

Authors: Octavio Pappalardo
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.19810
Pdf link: https://arxiv.org/pdf/2601.19810
Abstract Unsupervised pre-training can equip reinforcement learning agents with prior knowledge and accelerate learning in downstream tasks. A promising direction, grounded in human development, investigates agents that learn by setting and pursuing their own goals. The core challenge lies in how to effectively generate, select, and learn from such goals. Our focus is on broad distributions of downstream tasks where solving every task zero-shot is infeasible. Such settings naturally arise when the target tasks lie outside of the pre-training distribution or when their identities are unknown to the agent. In this work, we (i) optimize for efficient multi-episode exploration and adaptation within a meta-learning framework, and (ii) guide the training curriculum with evolving estimates of the agent's post-adaptation performance. We present ULEE, an unsupervised meta-learning method that combines an in-context learner with an adversarial goal-generation strategy that maintains training at the frontier of the agent's capabilities. On XLand-MiniGrid benchmarks, ULEE pre-training yields improved exploration and adaptation abilities that generalize to novel objectives, environment dynamics, and map structures. The resulting policy attains improved zero-shot and few-shot performance, and provides a strong initialization for longer fine-tuning processes. It outperforms learning from scratch, DIAYN pre-training, and alternative curricula.
中文摘要 无监督预训练可以为强化学习代理提供先验知识，加速后续任务的学习。一个以人类发展为基础的有前景方向，研究通过设定和追求自身目标来学习的代理。核心挑战在于如何有效生成、选择并从这些目标中学习。我们的重点是下游任务的广泛分布，在这些情况下，所有任务都零机会解决是不可行的。当目标任务不在预训练分布之外，或智能体未知其身份时，自然会出现此类设置。在本研究中，我们（i）优化在元学习框架内高效的多集探索与适应，（ii）通过不断演变的智能体适应后表现估计来指导训练课程。我们介绍ULEE，一种无监督元学习方法，结合了上下文学习者与对抗性目标生成策略，保持训练在代理能力的前沿。在XLand-MiniGrid基准测试中，ULEE预训练提升了探索和适应能力，可推广到新目标、环境动态和地图结构。该策略实现了零射和少射性能的提升，并为更长的微调过程提供了强有力的初始化。它优于从零开始学习、DIAYN预备培训和替代课程。

A Latent Space Framework for Modeling Transient Engine Emissions Using Joint Embedding Predictive Architectures

一个利用联合嵌入预测架构建模瞬态发动机排放的潜在空间框架

Authors: Ganesh Sundaram, Tobias Gehra, Jonas Ulmen, Mirjan Heubaum, Daniel Görges, Michael Günthner
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.19822
Pdf link: https://arxiv.org/pdf/2601.19822
Abstract Accurately modeling and controlling vehicle exhaust emissions during transient events, such as rapid acceleration, is critical for meeting environmental regulations and optimizing powertrains. Conventional data-driven methods, such as Multilayer Perceptrons (MLPs) and Long Short-Term Memory (LSTM) networks, improve upon phenomenological models but often struggle with the complex nonlinear dynamics of emission formation. These monolithic architectures are sensitive to dataset variability and typically require deep, computationally expensive structures to perform well, limiting their practical utility. This paper introduces a novel approach that overcomes these limitations by modeling emission dynamics within a structured latent space. Leveraging a Joint Embedding Predictive Architecture (JEPA), the proposed framework learns from a rich dataset that combines real-world Portable Emission Measurement System (PEMS) data with high-frequency hardware-in-the-loop measurements. The model abstracts away irrelevant noise, encoding only the key factors governing emission behavior into a compact, robust representation. This results in superior data efficiency and predictive accuracy across diverse transient regimes, significantly outperforming high-performing LSTM baselines in generalization. To ensure suitability for real-world deployment, the JEPA framework is structured to support pruning and post-training quantization. This strategy drastically reduces the computational footprint, minimizing inference time and memory demand with negligible accuracy loss. The result is a highly efficient model ideal for on-board implementation of advanced strategies, such as model predictive control or model-based reinforcement learning, in conventional and hybrid powertrains. These findings offer a clear pathway toward more robust emission control systems for next-generation vehicles.
中文摘要 在瞬态事件（如快速加速）中准确建模和控制车辆尾气排放，对于满足环境法规和优化动力系统至关重要。传统的数据驱动方法，如多层感知器（MLP）和长短期记忆（LSTM）网络，虽然改进了现象学模型，但常常难以应对发射形成的复杂非线性动力学。这些单体架构对数据集的变异性非常敏感，通常需要深度且计算量高的结构才能良好运行，限制了其实际应用。本文引入了一种新方法，通过在结构化潜空间内建模发射动力学，克服了这些局限。该框架利用联合嵌入预测架构（JEPA），从结合真实便携式发射测量系统（PEMS）数据与高频硬件环路测量的丰富数据集中学习。该模型抽象了无关的噪声，仅将控制发射行为的关键因素编码为紧凑且稳健的表示。这导致在多种瞬变状态下实现了更优的数据效率和预测准确性，在泛化方面显著优于高性能LSTM基线。为确保适用于实际部署，JEPA框架结构支持剪枝和训练后量化。该策略大幅降低计算负载，最大限度地减少推理时间和内存需求，精度损失极小。结果是一个高效的模型，非常适合在传统和混合动力系统中实现先进策略，如模型预测控制或基于模型的强化学习。这些发现为下一代车辆更强大的排放控制系统提供了明确的路径。

Self-Distillation Enables Continual Learning

自我蒸馏促进持续学习

Authors: Idan Shenfeld, Mehul Damani, Jonas Hübotter, Pulkit Agrawal
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.19897
Pdf link: https://arxiv.org/pdf/2601.19897
Abstract Continual learning, enabling models to acquire new skills and knowledge without degrading existing capabilities, remains a fundamental challenge for foundation models. While on-policy reinforcement learning can reduce forgetting, it requires explicit reward functions that are often unavailable. Learning from expert demonstrations, the primary alternative, is dominated by supervised fine-tuning (SFT), which is inherently off-policy. We introduce Self-Distillation Fine-Tuning (SDFT), a simple method that enables on-policy learning directly from demonstrations. SDFT leverages in-context learning by using a demonstration-conditioned model as its own teacher, generating on-policy training signals that preserve prior capabilities while acquiring new skills. Across skill learning and knowledge acquisition tasks, SDFT consistently outperforms SFT, achieving higher new-task accuracy while substantially reducing catastrophic forgetting. In sequential learning experiments, SDFT enables a single model to accumulate multiple skills over time without performance regression, establishing on-policy distillation as a practical path to continual learning from demonstrations.
中文摘要 持续学习，使模型能够在不削弱现有能力的前提下获得新技能和知识，仍然是基础模型面临的根本挑战。虽然策略强化学习可以减少遗忘，但它需要显式的奖励函数，而这些函数往往无法实现。从专家演示中学习作为主要替代方案，主要由监督微调（SFT）主导，这本质上是违规的。我们介绍了自蒸馏微调（SDFT），这是一种简单的方法，能够直接从演示中实现策略上的学习。SDFT利用情境学习，使用示范条件模型作为自身教师，生成符合政策的培训信号，既保留了已有能力，又能获得新技能。在技能学习和知识获取任务中，SDFT持续优于SFT，实现了更高的新任务准确率，同时显著减少了灾难性遗忘。在顺序学习实验中，SDFT使单一模型能够在不经历绩效回归的情况下积累多种技能，确立了策略上提炼作为持续学习演示的实用路径。

Keyword: diffusion policy

There is no result