Arxiv Papers of Today

生成时间: 2026-02-16 16:53:25 (UTC+8); Arxiv 发布时间: 2026-02-16 20:00 EST (2026-02-17 09:00 UTC+8)

今天共有 42 篇相关文章

Keyword: reinforcement learning

Energy-Aware Reinforcement Learning for Robotic Manipulation of Articulated Components in Infrastructure Operation and Maintenance

用于基础设施运行与维护中关节组件的机器人作的能源感知强化学习

Authors: Xiaowen Tao, Yinuo Wang, Haitao Ding, Yuanyang Qi, Ziyu Song
Subjects: Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.12288
Pdf link: https://arxiv.org/pdf/2602.12288
Abstract With the growth of intelligent civil infrastructure and smart cities, operation and maintenance (O&M) increasingly requires safe, efficient, and energy-conscious robotic manipulation of articulated components, including access doors, service drawers, and pipeline valves. However, existing robotic approaches either focus primarily on grasping or target object-specific articulated manipulation, and they rarely incorporate explicit actuation energy into multi-objective optimisation, which limits their scalability and suitability for long-term deployment in real O&M settings. Therefore, this paper proposes an articulation-agnostic and energy-aware reinforcement learning framework for robotic manipulation in intelligent infrastructure O&M. The method combines part-guided 3D perception, weighted point sampling, and PointNet-based encoding to obtain a compact geometric representation that generalises across heterogeneous articulated objects. Manipulation is formulated as a Constrained Markov Decision Process (CMDP), in which actuation energy is explicitly modelled and regulated via a Lagrangian-based constrained Soft Actor-Critic scheme. The policy is trained end-to-end under this CMDP formulation, enabling effective articulated-object operation while satisfying a long-horizon energy budget. Experiments on representative O&M tasks demonstrate 16%-30% reductions in energy consumption, 16%-32% fewer steps to success, and consistently high success rates, indicating a scalable and sustainable solution for infrastructure O&M manipulation.
中文摘要 随着智能民用基础设施和智慧城市的发展，运营与维护（O&M）越来越需要安全、高效且节能的机器人作关节部件，包括检修门、服务抽屉和管道阀门。然而，现有机器人方法要么主要侧重于抓取，要么专注于目标特定对象的关节作，且很少将显式驱动能量纳入多目标优化，这限制了其在实际运维环境中的可扩展性和长期部署的适用性。因此，本文提出了一种发音无关且能能感知的强化学习框架，用于智能基础设施运维中的机器人作。该方法结合了部分引导的三维感知、加权点采样和基于PointNet的编码，获得了一种紧凑的几何表示，能够推广到异构的关节物体。作被表述为受限马尔可夫决策过程（CMDP），其中驱动能量通过基于拉格朗日的受限软演员-批判方案被明确建模和调节。该政策在该CMDP表述下端到端训练，实现有效的关节物作，同时满足长远的能源预算。代表性运营与维护任务的实验显示，能耗减少16%-30%，成功步骤减少16%-32%，成功率持续较高，表明基础设施运营与维护作具有可扩展性和可持续性。

Abstractive Red-Teaming of Language Model Character

语言模型特征的抽象红队化

Authors: Nate Rahn, Allison Qi, Avery Griffin, Jonathan Michala, Henry Sleight, Erik Jones
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.12318
Pdf link: https://arxiv.org/pdf/2602.12318
Abstract We want language model assistants to conform to a character specification, which asserts how the model should act across diverse user interactions. While models typically follow these character specifications, they can occasionally violate them in large-scale deployments. In this work, we aim to identify types of queries that are likely to produce such character violations at deployment, using much less than deployment-level compute. To do this, we introduce abstractive red-teaming, where we search for natural-language query categories, e.g. "The query is in Chinese. The query asks about family roles," that routinely elicit violations. These categories abstract over the many possible variants of a query which could appear in the wild. We introduce two algorithms for efficient category search against a character-trait-specific reward model: one based on reinforcement learning on a category generator LLM, and another which leverages a strong LLM to iteratively synthesize categories from high-scoring queries. Across a 12-principle character specification and 7 target models, we find that our algorithms consistently outperform baselines, and generate qualitatively interesting categories; for example, queries which ask Llama-3.1-8B-Instruct to predict the future lead to responses saying that AI will dominate humanity, and queries that ask GPT-4.1-Mini for essential prison survival items lead to enthusiastic recommendation of illegal weapons. Overall, we believe our results represent an important step towards realistic pre-deployment auditing of language model character.
中文摘要 我们希望语言模型助手符合字符规范，该规范明确模型应如何在多样化的用户交互中表现。虽然模型通常遵循这些字符规范，但在大规模部署中偶尔会违反。本研究旨在识别可能在部署时产生此类字符违规的查询类型，且其计算量远低于部署级别。为此，我们引入了抽象红队搜索，即搜索自然语言查询类别，例如“查询是中文。问题是关于家庭角色“，这经常引发违规行为。这些类别抽象了可能出现在现实中的查询的多种变体。我们引入了两种针对特征特定奖励模型的高效类别搜索算法：一种基于类别生成器LLM的强化学习，另一种利用强LLM迭代综合高评分查询的类别。在12个主特征描述和7个目标模型中，我们发现我们的算法始终优于基线，并生成了质量上有趣的类别;例如，询问Llama-3.1-8B-Ininstruction预测未来，回答称人工智能将主导人类;而询问GPT-4.1-Mini监狱生存必需品的查询则会热情推荐非法武器。总体而言，我们认为我们的结果代表了实现语言模型特征实际部署前审计的重要一步。

Wireless TokenCom: RL-Based Tokenizer Agreement for Multi-User Wireless Token Communications

无线令牌Com：基于强化学习的多用户无线令牌通信令牌协议

Authors: Farshad Zeinali, Mahdi Boloursaz Mashhadi, Dusit Niyato, Rahim Tafazolli
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.12338
Pdf link: https://arxiv.org/pdf/2602.12338
Abstract Token Communications (TokenCom) has recently emerged as an effective new paradigm, where tokens are the unified units of multimodal communications and computations, enabling efficient digital semantic- and goal-oriented communications in future wireless networks. To establish a shared semantic latent space, the transmitters/receivers in TokenCom need to agree on an identical tokenizer model and codebook. To this end, an initial Tokenizer Agreement (TA) process is carried out in each communication episode, where the transmitter/receiver cooperate to choose from a set of pre-trained tokenizer models/ codebooks available to them both for efficient TokenCom. In this correspondence, we investigate TA in a multi-user downlink wireless TokenCom scenario, where the base station equipped with multiple antennas transmits video token streams to multiple users. We formulate the corresponding mixed-integer non-convex problem, and propose a hybrid reinforcement learning (RL) framework that integrates a deep Q-network (DQN) for joint tokenizer agreement and sub-channel assignment, with a deep deterministic policy gradient (DDPG) for beamforming. Simulation results show that the proposed framework outperforms baseline methods in terms of semantic quality and resource efficiency, while reducing the freezing events in video transmission by 68% compared to the conventional H.265-based scheme.
中文摘要 代币通信（TokenCom）最近作为一种有效的新范式出现，代币是多模态通信和计算的统一单位，使未来无线网络中能够高效实现数字语义和目标导向的通信。为了建立共享的语义潜空间，TokenCom中的发送者/接收方需要就相同的分词器模型和代码本达成一致。为此，每个通信集都会进行初始的分词器协议（TA）过程，发送方和接收方合作，从双方可用的预训练分词器模型/代码本中选择，以实现高效的令牌通信。在本通信中，我们探讨了多用户下行无线令牌通信场景下的TA，即配备多天线的基站向多个用户传输视频令牌流。我们提出了相应的混合整数非凸问题，并提出了一种混合强化学习（RL）框架，该框架集成了用于联合分词器协议和子通道分配的深度Q网络（DQN），以及用于波束形成的深度确定性策略梯度（DDPG）。模拟结果显示，所提框架在语义质量和资源效率方面优于基线方法，同时将视频传输中的冻结事件比传统的基于H.265的方案减少了68%。

Intrinsic Credit Assignment for Long Horizon Interaction

长视距相互作用的内在信用分配

Authors: Ilze Amanda Auzina, Joschka Strüber, Sergio Hernández-Gutiérrez, Shashwat Goel, Ameya Prabhu, Matthias Bethge
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.12342
Pdf link: https://arxiv.org/pdf/2602.12342
Abstract How can we train agents to navigate uncertainty over long horizons? In this work, we propose {\Delta}Belief-RL, which leverages a language model's own intrinsic beliefs to reward intermediate progress. Our method utilizes the change in the probability an agent assigns to the target solution for credit assignment. By training on synthetic interaction data, {\Delta}Belief-RL teaches information-seeking capabilities that consistently outperform purely outcome-based rewards for Reinforcement Learning, with improvements generalizing to out-of-distribution applications ranging from customer service to personalization. Notably, the performance continues to improve as we scale test-time interactions beyond the training horizon, with interaction-efficiency increasing even on Pass@k metrics. Overall, our work introduces a scalable training strategy for navigating uncertainty over a long-horizon, by enabling credit assignment to intermediate actions via intrinsic {\Delta}Belief rewards.
中文摘要 我们如何培训经纪人在长远的视野中应对不确定性？在本研究中，我们提出了{\Delta}Belief-RL，利用语言模型自身的内在信念来奖励中间的进步。我们的方法利用了智能体为目标解分配信用分配的概率变化。通过基于合成交互数据的训练，{\Delta}Belief-RL 教授的信息寻求能力，这些能力在强化学习中持续优于纯粹基于结果的奖励，且改进推广到从客户服务到个性化的非分发应用。值得注意的是，随着测试时间交互的扩展超出训练范围，性能持续提升，交互效率甚至在Pass@k指标上也有所提升。总体而言，我们的工作引入了一种可扩展的培训策略，通过内在的{\Delta}信念奖励，实现对中间行动的信用分配，以应对长期不确定性。

LongNav-R1: Horizon-Adaptive Multi-Turn RL for Long-Horizon VLA Navigation

LongNav-R1：用于长视距VLA导航的地平自适应多转RL

Authors: Yue Hu, Avery Xi, Qixin Xiao, Seth Isaacson, Henry X. Liu, Ram Vasudevan, Maani Ghaffari
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.12351
Pdf link: https://arxiv.org/pdf/2602.12351
Abstract This paper develops LongNav-R1, an end-to-end multi-turn reinforcement learning (RL) framework designed to optimize Visual-Language-Action (VLA) models for long-horizon navigation. Unlike existing single-turn paradigm, LongNav-R1 reformulates the navigation decision process as a continuous multi-turn conversation between the VLA policy and the embodied environment. This multi-turn RL framework offers two distinct advantages: i) it enables the agent to reason about the causal effects of historical interactions and sequential future outcomes; and ii) it allows the model to learn directly from online interactions, fostering diverse trajectory generation and avoiding the behavioral rigidity often imposed by human demonstrations. Furthermore, we introduce Horizon-Adaptive Policy Optimization. This mechanism explicitly accounts for varying horizon lengths during advantage estimation, facilitating accurate temporal credit assignment over extended sequences. Consequently, the agent develops diverse navigation behaviors and resists collapse during long-horizon tasks. Experiments on object navigation benchmarks validate the framework's efficacy: With 4,000 rollout trajectories, LongNav-R1 boosts the Qwen3-VL-2B success rate from 64.3% to 73.0%. These results demonstrate superior sample efficiency and significantly outperform state-of-the-art methods. The model's generalizability and robustness are further validated by its zero-shot performance in long-horizon real-world navigation settings. All source code will be open-sourced upon publication.
中文摘要 本文开发了LongNav-R1，一种端到端多回合强化学习（RL）框架，旨在优化视觉-语言-动作（VLA）模型以实现长视野导航。与现有单转弯范式不同，LongNav-R1将导航决策过程重新表述为VLA政策与内在环境之间的连续多回合对话。这种多回合强化学习框架有两个明显优势：i）它使智能体能够推理历史交互和连续未来结果的因果效应;二）它允许模型直接从在线互动中学习，促进多样化的轨迹生成，避免人类演示常带来的行为僵化。此外，我们介绍了地平线自适应策略优化。该机制明确考虑优势估计时视界长度的变化，便于在扩展序列中准确分配时间功劳。因此，智能体发展出多样化的导航行为，并在长视野任务中抵抗崩溃。目标导航基准测试验证了该框架的有效性：LongNav-R1 通过 4,000 条部署轨迹，将 Qwen3-VL-2B 的成功率从 64.3% 提升至 73.0%。这些结果显示了更优的样本效率，且显著优于最先进的方法。该模型的通用性和鲁棒性还通过其在长视距现实导航环境中零射击性能得到进一步验证。所有源代码发布时将开源。

Value Bonuses using Ensemble Errors for Exploration in Reinforcement Learning

利用集合错误进行强化学习探索的价值奖励

Authors: Abdul Wahab, Raksha Kumaraswamy, Martha White
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.12375
Pdf link: https://arxiv.org/pdf/2602.12375
Abstract Optimistic value estimates provide one mechanism for directed exploration in reinforcement learning (RL). The agent acts greedily with respect to an estimate of the value plus what can be seen as a value bonus. The value bonus can be learned by estimating a value function on reward bonuses, propagating local uncertainties around rewards. However, this approach only increases the value bonus for an action retroactively, after seeing a higher reward bonus from that state and action. Such an approach does not encourage the agent to visit a state and action for the first time. In this work, we introduce an algorithm for exploration called Value Bonuses with Ensemble errors (VBE), that maintains an ensemble of random action-value functions (RQFs). VBE uses the errors in the estimation of these RQFs to design value bonuses that provide first-visit optimism and deep exploration. The key idea is to design the rewards for these RQFs in such a way that the value bonus can decrease to zero. We show that VBE outperforms Bootstrap DQN and two reward bonus approaches (RND and ACB) on several classic environments used to test exploration and provide demonstrative experiments that it can scale easily to more complex environments like Atari.
中文摘要 乐观价值估计为强化学习（RL）中的定向探索提供了一种机制。代理人贪婪地依赖价值估计加上可视为价值奖励的金额。价值加成可以通过估计奖励奖金的价值函数来学习，从而传播奖励周围的局部不确定性。然而，这种做法只是在看到该动作和行动带来更高奖励奖励后，追溯性地增加该动作的价值加成。这种做法并不会鼓励代理人首次访问州和行动。在本研究中，我们引入了一种名为带集合错误的价值加成（VBE）的探索算法，它维护一个随机动作值函数（RQF）的集合。VBE利用这些RQF估计的误差设计价值奖金，带来首次就诊的乐观和深入探索。关键理念是设计这些RQF的奖励，使价值奖金可以降至零。我们证明，VBE在多个用于测试探索的经典环境中表现优于Bootstrap DQN和两种奖励奖励方法（RND和ACB），并展示了它能够轻松扩展到更复杂环境如雅达利的演示实验。

Provably Convergent Actor-Critic in Risk-averse MARL

风险规避型MARL中可证实的收敛演员-批评者

Authors: Yizhou Zhang, Eric Mazumdar
Subjects: Subjects: Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.12386
Pdf link: https://arxiv.org/pdf/2602.12386
Abstract Learning stationary policies in infinite-horizon general-sum Markov games (MGs) remains a fundamental open problem in Multi-Agent Reinforcement Learning (MARL). While stationary strategies are preferred for their practicality, computing stationary forms of classic game-theoretic equilibria is computationally intractable -- a stark contrast to the comparative ease of solving single-agent RL or zero-sum games. To bridge this gap, we study Risk-averse Quantal response Equilibria (RQE), a solution concept rooted in behavioral game theory that incorporates risk aversion and bounded rationality. We demonstrate that RQE possesses strong regularity conditions that make it uniquely amenable to learning in MGs. We propose a novel two-timescale Actor-Critic algorithm characterized by a fast-timescale actor and a slow-timescale critic. Leveraging the regularity of RQE, we prove that this approach achieves global convergence with finite-sample guarantees. We empirically validate our algorithm in several environments to demonstrate superior convergence properties compared to risk-neutral baselines.
中文摘要 在无限视野广义和马尔可夫博弈（MGs）中学习平稳策略仍是多智能体强化学习（MARL）中一个根本性的未解问题。虽然平稳策略因其实用性而被青睐，但计算经典博弈论均衡的平稳形式在计算上难以解决——这与单代理强化学习或零和博弈的相对容易形成鲜明对比。为弥合这一差距，我们研究了风险厌恶量化反应均衡（RQE），这是一种根植于行为博弈论的解决方案概念，融合了风险厌恶和有限理性。我们证明RQE具有强烈的规律性条件，使其在综合体中特别适合学习。我们提出了一种新颖的两时间尺度演员-批评者算法，其特征是快速时间尺度的演员和慢时间的批评者。利用RQE的正则性，我们证明该方法实现了有限样本保证的全局收敛。我们在多个环境中实证验证算法，以证明其收敛性优于风险中性基线。

Synthetic Interaction Data for Scalable Personalization in Large Language Models

大型语言模型中可扩展个性化的合成交互数据

Authors: Yuchen Ma, Yue Huang, Wenjie Wang, Xiaonan Luo, Xiangliang Zhang, Stefan Feuerriegel
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.12394
Pdf link: https://arxiv.org/pdf/2602.12394
Abstract Personalized prompting offers large opportunities for deploying large language models (LLMs) to diverse users, yet existing prompt optimization methods primarily focus on task-level optimization while largely overlooking user-specific preferences and latent constraints of individual users. This gap is primarily due to (i) the absence of high-quality, privacy-sensitive data that capture personalized user-LLM interactions at scale, and (ii) the lack of robust reward signals for individual preferences. To overcome existing data limitations, we introduce a high-fidelity synthetic data generation framework called PersonaGym. Unlike prior work that treats personalization as static persona-preference pairs, PersonaGym models a dynamic preference process via an agentic LLM system to simulate realistic preference behaviors and semantic-aware noise in order to generate personalized multi-turn interaction trajectories. Using PersonaGym, we release PersonaAtlas, a large-scale, high-quality, and diverse synthetic dataset of high-fidelity multi-turn personalized interaction trajectories that closely mirror real-world preference expression and noise patterns. We further propose Personalized Prompt Optimization (PPOpt), a scalable and model-agnostic framework that optimizes user prompts based on interaction histories without modifying the deployed LLM. PPOpt adopts a reason-then-optimize paradigm that infers an explicit user profile and conditions prompt rewriting on the user profile to avoid reward hacking. Our training procedure for PPOpt integrates a cold-start supervised prior with outcome-driven multi-objective reinforcement learning. We present extensive experiments to demonstrate consistent improvements over state-of-the-art baselines in terms of task performance, personalization quality, and robustness to noisy as well as to sparse preference signals.
中文摘要 个性化提示为向多样化用户部署大型语言模型（LLMs）提供了巨大机会，但现有提示优化方法主要关注任务层级优化，而在很大程度上忽视了用户特定的偏好和个体用户的潜在限制。这一差距主要源于（i）缺乏高质量、隐私敏感的数据来大规模捕捉个性化用户与LLM的互动，以及（ii）缺乏针对个人偏好的强有力奖励信号。为克服现有数据限制，我们引入了一个名为PersonaGym的高保真合成数据生成框架。与以往将个性化视为静态人物-偏好对不同，PersonaGym通过代理大型语言模型系统模拟动态偏好过程，模拟真实的偏好行为和语义感知噪声，从而生成个性化的多回合交互轨迹。利用PersonaGym，我们发布了PersonaAtlas，这是一个大规模、高质量且多样化的合成数据集，包含高精度的多回合个性化交互轨迹，紧密反映现实世界的偏好表达和噪声模式。我们还提出了个性化提示优化（PPOpt），这是一个可扩展且模型无关的框架，基于交互历史优化用户提示，而无需修改已部署的大型语言模型。PPOpt采用一种理由后优化的范式，推断出显式的用户配置文件，并提示对用户配置文件进行条件重写，以避免奖励被黑客攻击。我们的PPOpt培训流程将有监督的冷启动与以结果为导向的多目标强化学习相结合。我们进行了大量实验，以证明在任务表现、个性化质量以及对噪声和稀疏偏好信号的鲁棒性方面，比最先进基线有持续的提升。

What does RL improve for Visual Reasoning? A Frankenstein-Style Analysis

强化学习对视觉推理有什么提升？弗兰肯斯坦式分析

Authors: Xirui Li, Ming Li, Tianyi Zhou
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.12395
Pdf link: https://arxiv.org/pdf/2602.12395
Abstract Reinforcement learning (RL) with verifiable rewards has become a standard post-training stage for boosting visual reasoning in vision-language models, yet it remains unclear what capabilities RL actually improves compared with supervised fine-tuning as cold-start initialization (IN). End-to-end benchmark gains conflate multiple factors, making it difficult to attribute improvements to specific skills. To bridge the gap, we propose a Frankenstein-style analysis framework including: (i) functional localization via causal probing; (ii) update characterization via parameter comparison; and (iii) transferability test via model merging. Instead, RL induces a consistent inference-time shift primarily in mid-to-late layers, and these mid-to-late refinements are both transferable (via merging) and necessary (via freezing) for RL gains. Overall, our results suggest that RL's reliable contribution in visual reasoning is not a uniform enhancement of visual perception, but a systematic refinement of mid-to-late transformer computation that improves vision-to-reasoning alignment and reasoning performance, highlighting the limitations of benchmark-only evaluation for understanding multimodal reasoning improvements.
中文摘要 带有可验证奖励的强化学习（RL）已成为增强视觉语言模型视觉推理的标准训练后阶段，但与作为冷启动初始化（IN）的监督微调相比，强化学习到底提升了哪些能力仍不明确。端到端的基准提升会混淆多个因素，使得将提升归因于具体技能变得困难。为弥合这一鸿沟，我们提出了一种弗兰肯斯坦式的分析框架，包括：（i）通过因果探究实现的功能定位;（ii）通过参数比较更新特征描述;以及（iii）通过模型合并进行可转移性测试。相反，强化学习主要在中后层引入一致的推理时间偏移，这些中后期细化既可转移（通过合并）又是必要（通过冻结）以实现强化学习的。总体而言，我们的结果表明，强化学习在视觉推理中的可靠贡献并非视觉感知的统一增强，而是对中后期变换器计算的系统性优化，提升了视觉与推理的对齐和推理性能，凸显了仅基准评估在理解多模态推理改进方面的局限性。

AstRL: Analog and Mixed-Signal Circuit Synthesis with Deep Reinforcement Learning

AstRL：模拟与混合信号电路合成，结合深度强化学习

Authors: Felicia B. Guo, Ken T. Ho, Andrei Vladimirescu, Borivoje Nikolic
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.12402
Pdf link: https://arxiv.org/pdf/2602.12402
Abstract Analog and mixed-signal (AMS) integrated circuits (ICs) lie at the core of modern computing and communications systems. However, despite the continued rise in design complexity, advances in AMS automation remain limited. This reflects the central challenge in developing a generalized optimization method applicable across diverse circuit design spaces, many of which are distinct, constrained, and non-differentiable. To address this, our work casts circuit design as a graph generation problem and introduces a novel method of AMS synthesis driven by deep reinforcement learning (AstRL). Based on a policy-gradient approach, AstRL generates circuits directly optimized for user-specified targets within a simulator-embedded environment that provides ground-truth feedback during training. Through behavioral-cloning and discriminator-based similarity rewards, our method demonstrates, for the first time, an expert-aligned paradigm for generalized circuit generation validated in simulation. Importantly, the proposed approach operates at the level of individual transistors, enabling highly expressive, fine-grained topology generation. Strong inductive biases encoded in the action space and environment further drive structurally consistent and valid generation. Experimental results for three realistic design tasks illustrate substantial improvements in conventional design metrics over state-of-the-art baselines, with 100% of generated designs being structurally correct and over 90% demonstrating required functionality.
中文摘要 模拟和混合信号（AMS）集成电路（IC）是现代计算和通信系统的核心。然而，尽管设计复杂度持续上升，AMS自动化的进展仍然有限。这反映了开发适用于不同电路设计空间的通用优化方法的核心挑战，其中许多空间是独立的、受限且不可微分的。为此，我们的工作将电路设计定位为图生成问题，并引入了一种由深度强化学习（AstRL）驱动的新型AMS合成方法。基于策略梯度方法，AstRL在模拟器嵌入式环境中直接生成针对用户指定目标优化的电路，并在训练过程中提供地面真实反馈。通过行为克隆和判别器类相似性奖励，我们的方法首次展示了在仿真中验证的专家一致的广义电路生成范式。重要的是，该方法在单个晶体管层面工作，实现高度表达性强、细粒度拓扑生成。作用空间和环境中编码的强烈归纳偏置进一步推动结构一致且有效的生成。三项真实设计任务的实验结果显示，传统设计指标相较于最先进基线有显著提升，生成设计100%结构正确，且超过90%显示出所需功能。

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

大型语言模型的代理技能：架构、习得、安全及未来发展路径

Authors: Renjun Xu, Yang Yan
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.12430
Pdf link: https://arxiv.org/pdf/2602.12430
Abstract The transition from monolithic language models to modular, skill-equipped agents marks a defining shift in how large language models (LLMs) are deployed in practice. Rather than encoding all procedural knowledge within model weights, agent skills -- composable packages of instructions, code, and resources that agents load on demand -- enable dynamic capability extension without retraining. It is formalized in a paradigm of progressive disclosure, portable skill definitions, and integration with the Model Context Protocol (MCP). This survey provides a comprehensive treatment of the agent skills landscape, as it has rapidly evolved during the last few months. We organize the field along four axes: (i) architectural foundations, examining the this http URL specification, progressive context loading, and the complementary roles of skills and MCP; (ii) skill acquisition, covering reinforcement learning with skill libraries (SAGE), autonomous skill discovery (SEAgent), and compositional skill synthesis; (iii) deployment at scale, including the computer-use agent (CUA) stack, GUI grounding advances, and benchmark progress on OSWorld and SWE-bench; and (iv) security, where recent empirical analyses reveal that 26.1\% of community-contributed skills contain vulnerabilities, motivating our proposed Skill Trust and Lifecycle Governance Framework -- a four-tier, gate-based permission model that maps skill provenance to graduated deployment capabilities. We identify seven open challenges -- from cross-platform skill portability to capability-based permission models -- and propose a research agenda for realizing trustworthy, self-improving skill ecosystems. Unlike prior surveys that broadly cover LLM agents or tool use, this work focuses specifically on the emerging skill abstraction layer and its implications for the next generation of agentic systems. Project repo: this https URL.
中文摘要 从单一语言模型向模块化、具备技能的智能体转变，标志着大型语言模型（LLMs）在实际应用中的一次决定性转变。代理技能——由指令、代码和资源组成的可组合包，无需重新训练即可动态扩展能力，而非将所有过程知识编码为模型权重。它以渐进披露、可移植技能定义以及与模型上下文协议（MCP）集成为范式。本调查全面介绍了过去几个月中快速变化的代理人技能格局。我们将该领域组织为四个轴：（i）架构基础，考察该 http URL 规范、渐进式上下文加载，以及技能与 MCP 的互补作用;（ii）技能习得，涵盖技能库强化学习（SAGE）、自主技能发现（SEAgent）和组合技能综合;（iii）大规模部署，包括计算机代理（CUA）堆栈、图形界面基础化进展及OSWorld和SWE-bench的基准进展;以及（iv）安全性，近期实证分析显示，社区贡献技能中有26.1%存在漏洞，这促使我们提出技能信任与生命周期治理框架——一个四层基于门的权限模型，将技能来源映射到分级部署能力。我们确定了七个未解决的挑战——从跨平台技能可携带性到基于能力的权限模型——并提出了实现可信、自我提升技能生态系统的研究议程。与以往广泛涵盖LLM代理或工具使用的调查不同，本研究特别关注新兴的技能抽象层及其对下一代代理系统的影响。项目仓库：这个 https URL。

Safe Reinforcement Learning via Recovery-based Shielding with Gaussian Process Dynamics Models

通过基于恢复的屏蔽与高斯过程动力学模型进行安全强化学习

Authors: Alexander W. Goodall, Francesco Belardinelli
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.12444
Pdf link: https://arxiv.org/pdf/2602.12444
Abstract Reinforcement learning (RL) is a powerful framework for optimal decision-making and control but often lacks provable guarantees for safety-critical applications. In this paper, we introduce a novel recovery-based shielding framework that enables safe RL with a provable safety lower bound for unknown and non-linear continuous dynamical systems. The proposed approach integrates a backup policy (shield) with the RL agent, leveraging Gaussian process (GP) based uncertainty quantification to predict potential violations of safety constraints, dynamically recovering to safe trajectories only when necessary. Experience gathered by the 'shielded' agent is used to construct the GP models, with policy optimization via internal model-based sampling - enabling unrestricted exploration and sample efficient learning, without compromising safety. Empirically our approach demonstrates strong performance and strict safety-compliance on a suite of continuous control environments.
中文摘要 强化学习（RL）是一个强大的最佳决策和控制框架，但通常缺乏对安全关键应用的可验证保证。本文介绍了一种新的基于恢复的屏蔽框架，使安全强化学习能够在未知和非线性连续动力系统下保证安全下界。该方法将备份策略（盾牌）与强化学习代理整合，利用基于高斯过程（GP）的不确定性量化，预测潜在的安全约束违规，仅在必要时动态恢复至安全轨迹。“屏蔽”代理收集的经验用于构建GP模型，并通过内部基于模型的采样进行策略优化——实现不受限制的探索和高效的样本学习，同时不影响安全性。从经验来看，我们的方法在一套连续控制环境中展现出强劲的性能和严格的安全合规性。

Theory of Mind Guided Strategy Adaptation for Zero-Shot Coordination

心智理论引导策略适应零射击协调

Authors: Andrew Ni, Simon Stepputtis, Stefanos Nikolaidis, Michael Lewis, Katia P. Sycara, Woojun Kim
Subjects: Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2602.12458
Pdf link: https://arxiv.org/pdf/2602.12458
Abstract A central challenge in multi-agent reinforcement learning is enabling agents to adapt to previously unseen teammates in a zero-shot fashion. Prior work in zero-shot coordination often follows a two-stage process, first generating a diverse training pool of partner agents, and then training a best-response agent to collaborate effectively with the entire training pool. While many previous works have achieved strong performance by devising better ways to diversify the partner agent pool, there has been less emphasis on how to leverage this pool to build an adaptive agent. One limitation is that the best-response agent may converge to a static, generalist policy that performs reasonably well across diverse teammates, rather than learning a more adaptive, specialist policy that can better adapt to teammates and achieve higher synergy. To address this, we propose an adaptive ensemble agent that uses Theory-of-Mind-based best-response selection to first infer its teammate's intentions and then select the most suitable policy from a policy ensemble. We conduct experiments in the Overcooked environment to evaluate zero-shot coordination performance under both fully and partially observable settings. The empirical results demonstrate the superiority of our method over a single best-response baseline.
中文摘要 多智能体强化学习的一个核心挑战是使智能体能够以零机会方式适应之前未曾见过的队友。以往零机会协调的工作通常遵循两阶段过程：首先生成多样化的合作伙伴培训池，然后培训最佳响应代理以有效与整个训练池协作。虽然许多以往的研究通过设计更好的方法来多元化合作伙伴代理池取得了强劲的表现，但对如何利用该代理池构建自适应代理的关注较少。一个局限是，最佳响应代理可能趋向静态、通用的策略，这种策略在多样队友中表现较好，而非学习更具适应性、专业化的策略，从而更好地适应队友并实现更高的协同效应。为此，我们提出了一种自适应集成代理，利用基于心智理论的最佳响应选择，先推断队友的意图，然后从策略集合中选择最合适的策略。我们在过熟环境下进行实验，评估零射击在完全可观测和部分可观测条件下的协调性能。实证结果显示，我们的方法优于单一最佳反应基线。

Designing RNAs with Language Models

利用语言模型设计RNA

Authors: Milan Gautam, Ning Dai, Tianshuo Zhou, Bowen Xie, David Mathews, Liang Huang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.12470
Pdf link: https://arxiv.org/pdf/2602.12470
Abstract RNA design, the task of finding a sequence that folds into a target secondary structure, has broad biological and biomedical impact but remains computationally challenging due to the exponentially large sequence space and exponentially many competing folds. Traditional approaches treat it as an optimization problem, relying on per-instance heuristics or constraint-based search. We instead reframe RNA design as conditional sequence generation and introduce a reusable neural approximator, instantiated as an autoregressive language model (LM), that maps target structures directly to sequences. We first train our model in a supervised setting on random-induced structure-sequence pairs, and then use reinforcement learning (RL) to optimize end-to-end metrics. We also propose methods to select a small subset for RL that greatly improves RL efficiency and quality. Across four datasets, our approach outperforms state-of-the-art systems on key metrics such as Boltzmann probability while being 1.7x faster, establishing conditional LM generation as a scalable, task-agnostic alternative to per-instance optimization for RNA design. Our code and data are available at this https URL.
中文摘要 RNA设计，即寻找能够折叠成靶标二级结构的序列，具有广泛的生物学和生物医学影响，但由于序列空间呈指数级增长且竞争折叠数量呈指数级增加，计算上仍具挑战性。传统方法将其视为优化问题，依赖每实例启发式或基于约束的搜索。我们将RNA设计重新框架为条件序列生成，并引入了一个可复用的神经近似器，作为自回归语言模型（LM）实例化，将目标结构直接映射到序列。我们首先在监督环境中对随机诱导的结构-序列对训练模型，然后使用强化学习（RL）优化端到端的指标。我们还提出了选择一小部分强化学习的方法，以大幅提升强化学习的效率和质量。在四个数据集中，我们的方法在玻尔兹曼概率等关键指标上优于最先进系统，同时速度快1.7倍，确立了条件LM生成作为RNA设计中可扩展、任务无关的替代方案，取代逐实例优化。我们的代码和数据可在此 https URL 访问。

Composable Model-Free RL for Navigation with Input-Affine Systems

可组合无模型强化学习，用于输入仿射系统的导航

Authors: Xinhuan Sang, Abdelrahman Abdelgawad, Roberto Tron
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.12492
Pdf link: https://arxiv.org/pdf/2602.12492
Abstract As autonomous robots move into complex, dynamic real-world environments, they must learn to navigate safely in real time, yet anticipating all possible behaviors is infeasible. We propose a composable, model-free reinforcement learning method that learns a value function and an optimal policy for each individual environment element (e.g., goal or obstacle) and composes them online to achieve goal reaching and collision avoidance. Assuming unknown nonlinear dynamics that evolve in continuous time and are input-affine, we derive a continuous-time Hamilton-Jacobi-Bellman (HJB) equation for the value function and show that the corresponding advantage function is quadratic in the action and optimal policy. Based on this structure, we introduce a model-free actor-critic algorithm that learns policies and value functions for static or moving obstacles using gradient descent. We then compose multiple reach/avoid models via a quadratically constrained quadratic program (QCQP), yielding formal obstacle-avoidance guarantees in terms of value-function level sets, providing a model-free alternative to CLF/CBF-based controllers. Simulations demonstrate improved performance over a PPO baseline applied to a discrete-time approximation.
中文摘要 随着自主机器人进入复杂、动态的现实环境，它们必须学会实时安全导航，但预判所有可能行为是不可行的。我们提出了一种可组合、无模型的强化学习方法，能够为每个环境元素（如目标或障碍）学习价值函数和最优策略，并在线组合以实现目标达成和碰撞避免。假设未知的非线性动力学在连续时间内演化且输入仿射，我们推导出价值函数的连续时间哈密尔顿-雅各比-贝尔曼（HJB）方程，并证明对应的优势函数在作用和最优策略上是二次方的。基于该结构，我们引入了一种无模型的演员-批判算法，利用梯度下降学习静态或移动障碍物的策略和价值函数。随后，我们通过二次约束二次规划（QCQP）组合多个距离/避让模型，以价值函数水平集合形式化障碍避免保证，为基于CLF/CBF的控制器提供了无模型替代方案。模拟显示，应用于离散时间近似的PPO基线性能有所提升。

On Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMs

关于强化学习微调VLM的鲁棒性和思维链一致性

Authors: Rosie Zhao, Anshul Shah, Xiaoyu Zhu, Xinke Deng, Zhongyu Jiang, Yang Yang, Joerg Liebelt, Arnab Mondal
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.12506
Pdf link: https://arxiv.org/pdf/2602.12506
Abstract Reinforcement learning (RL) fine-tuning has become a key technique for enhancing large language models (LLMs) on reasoning-intensive tasks, motivating its extension to vision language models (VLMs). While RL-tuned VLMs improve on visual reasoning benchmarks, they remain vulnerable to weak visual grounding, hallucinations, and over-reliance on textual cues. We show that simple, controlled textual perturbations--misleading captions or incorrect chain-of-thought (CoT) traces--cause substantial drops in robustness and confidence, and that these effects are more pronounced when CoT consistency is taken into account across open-source multimodal reasoning models. Entropy-based metrics further show that these perturbations reshape model uncertainty and probability mass on the correct option, exposing model-specific trends in miscalibration. To better understand these vulnerabilities, we further analyze RL fine-tuning dynamics and uncover an accuracy-faithfulness trade-off: fine-tuning raises benchmark accuracy, but can simultaneously erode the reliability of the accompanying CoT and its robustness to contextual shifts. Although adversarial augmentation improves robustness, it does not by itself prevent faithfulness drift. Incorporating a faithfulness-aware reward can restore alignment between answers and reasoning, but when paired with augmentation, training risks collapsing onto shortcut strategies and robustness remains elusive. Together, these findings highlight the limitations of accuracy-only evaluations and motivate training and assessment protocols that jointly emphasize correctness, robustness, and the faithfulness of visually grounded reasoning.
中文摘要 强化学习（RL）微调已成为增强大型语言模型（LLM）在推理密集型任务中提升的关键技术，推动其扩展到视觉语言模型（VLMs）。虽然强化学习调优的VLM在视觉推理基准上有所提升，但它们仍易受到视觉基础薄弱、幻觉以及过度依赖文本线索的影响。我们表明，简单且受控的文本扰动——误导性的说明或错误的思维链（CoT）痕迹——会导致鲁棒性和信心大幅下降，且当CoT一致性在开源多模态推理模型中考虑时，这些影响更为明显。基于熵的指标进一步表明，这些扰动在正确选项下重塑模型的不确定性和概率质量，揭示了模型特有的校准错误趋势。为了更好地理解这些漏洞，我们进一步分析了强化学习的微调动态，揭示了一个准确性与忠实度的权衡：微调提高了基准准确度，但同时也可能削弱相应的CoT的可靠性及其对上下文变化的鲁棒性。虽然对抗增强能提升鲁棒性，但单靠它并不能防止忠实漂移。加入忠诚感导的奖励可以恢复答案与推理之间的对齐，但当与增强结合时，训练有崩溃于捷径策略的风险，而稳健性依然难以实现。这些发现共同凸显了仅凭准确性评估的局限性，并激励了训练和评估方案，共同强调正确性、稳健性和视觉基础推理的忠实性。

Bench-MFG: A Benchmark Suite for Learning in Stationary Mean Field Games

Bench-MFG：平稳均值场博弈学习基准套件

Authors: Lorenzo Magnino, Jiacheng Shen, Matthieu Geist, Olivier Pietquin, Mathieu Laurière
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2602.12517
Pdf link: https://arxiv.org/pdf/2602.12517
Abstract The intersection of Mean Field Games (MFGs) and Reinforcement Learning (RL) has fostered a growing family of algorithms designed to solve large-scale multi-agent systems. However, the field currently lacks a standardized evaluation protocol, forcing researchers to rely on bespoke, isolated, and often simplistic environments. This fragmentation makes it difficult to assess the robustness, generalization, and failure modes of emerging methods. To address this gap, we propose a comprehensive benchmark suite for MFGs (Bench-MFG), focusing on the discrete-time, discrete-space, stationary setting for the sake of clarity. We introduce a taxonomy of problem classes, ranging from no-interaction and monotone games to potential and dynamics-coupled games, and provide prototypical environments for each. Furthermore, we propose MF-Garnets, a method for generating random MFG instances to facilitate rigorous statistical testing. We benchmark a variety of learning algorithms across these environments, including a novel black-box approach (MF-PSO) for exploitability minimization. Based on our extensive empirical results, we propose guidelines to standardize future experimental comparisons. Code available at \href{this https URL}{this https URL}.
中文摘要 均值场博弈（MFGs）与强化学习（RL）的交汇催生了一系列旨在解决大规模多智能体系统的算法。然而，目前该领域缺乏标准化的评估协议，迫使研究人员依赖定制、孤立且常常过于简单的环境。这种碎片化使得评估新兴方法的鲁棒性、泛化性和失败模式变得困难。为弥补这一空白，我们提出了一套综合的MFG基准测试套件（Bench-MFG），重点关注离散时间、离散空间、静止环境，以便清晰明了。我们引入了问题类的分类法，涵盖从无交互和单调游戏到潜力博弈和动态耦合博弈，并为每个类提供了典型的环境。此外，我们提出了MF-Garnets方法，一种生成随机MFG实例的方法，以促进严格的统计检验。我们在这些环境中对多种学习算法进行了基准测试，包括一种新颖的黑箱方法（MF-PSO）用于最小化可利用性。基于我们广泛的实证结果，我们提出了未来实验比较标准化的指南。代码可访问 \href{this https URL}{this https URL}。

Multi-Agent Model-Based Reinforcement Learning with Joint State-Action Learned Embeddings

多智能体基于模型的强化学习与联合状态-行动学习嵌入

Authors: Zhizun Wang, David Meger
Subjects: Subjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2602.12520
Pdf link: https://arxiv.org/pdf/2602.12520
Abstract Learning to coordinate many agents in partially observable and highly dynamic environments requires both informative representations and data-efficient training. To address this challenge, we present a novel model-based multi-agent reinforcement learning framework that unifies joint state-action representation learning with imaginative roll-outs. We design a world model trained with variational auto-encoders and augment the model using the state-action learned embedding (SALE). SALE is injected into both the imagination module that forecasts plausible future roll-outs and the joint agent network whose individual action values are combined through a mixing network to estimate the joint action-value function. By coupling imagined trajectories with SALE-based action values, the agents acquire a richer understanding of how their choices influence collective outcomes, leading to improved long-term planning and optimization under limited real-environment interactions. Empirical studies on well-established multi-agent benchmarks, including StarCraft II Micro-Management, Multi-Agent MuJoCo, and Level-Based Foraging challenges, demonstrate consistent gains of our method over baseline algorithms and highlight the effectiveness of joint state-action learned embeddings within a multi-agent model-based paradigm.
中文摘要 学习在部分可观测且高度动态的环境中协调多个智能体，既需要信息丰富的表示，也需要数据高效的训练。为应对这一挑战，我们提出了一种基于模型的多智能体强化学习框架，将联合状态-行动表征学习与富有想象力的推广相结合。我们设计了一个用变分自编码器训练的世界模型，并利用状态动作学习嵌入（SALE）对模型进行了增强。SALE 被注入预测未来合理推广的想象模块和联合代理网络中，通过混合网络将各个行动值合并以估算联合行动-价值函数。通过将想象轨迹与基于ALE的行动值结合，代理们对其选择如何影响集体结果有了更丰富的理解，从而在有限的真实环境互动下实现了更好的长期规划和优化。对多智能体基准测试（包括星际争霸II微管理、多智能体MuJoCo和基于层级的采集挑战）的实证研究显示，我们的方法相较于基线算法持续提升，并突出了联合状态-动作学习嵌入在多智能体模型范式中的有效性。

Constraint-Rectified Training for Efficient Chain-of-Thought

约束纠正训练以实现高效思维链

Authors: Qinhang Wu, Sen Lin, Ming Zhang, Yingbin Liang, Ness B. Shroff
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.12526
Pdf link: https://arxiv.org/pdf/2602.12526
Abstract Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs), especially when combined with reinforcement learning (RL) based post-training methods. While longer reasoning traces can improve answer quality and unlock abilities such as self-correction, they also incur high inference costs and often introduce redundant steps, known as overthinking. Recent research seeks to develop efficient reasoning strategies that balance reasoning length and accuracy, either through length-aware reward design or prompt-based calibration. However, these heuristic-based approaches may suffer from severe accuracy drop and be very sensitive to hyperparameters. To address these problems, we introduce CRT (Constraint-Rectified Training), a principled post-training framework based on reference-guarded constrained optimization, yielding a more stable and interpretable formulation for efficient reasoning. CRT alternates between minimizing reasoning length and rectifying accuracy only when performance falls below the reference, enabling stable and effective pruning of redundant reasoning. We further extend CRT with a two-stage training scheme that first discovers the shortest reliable reasoning patterns and then refines accuracy under a learnt length budget, preventing the re-emergence of verbose CoT. Our comprehensive evaluation shows that this framework consistently reduces token usage while maintaining answer quality at a robust and reliable level. Further analysis reveals that CRT improves reasoning efficiency not only by shortening responses but also by reducing internal language redundancy, leading to a new evaluation metric. Moreover, CRT-based training naturally yields a sequence of intermediate checkpoints that span a spectrum of explanation lengths while preserving correctness, enabling fine-grained control over reasoning verbosity without retraining.
中文摘要 思维链（Chain-of-Thought，CoT）显著增强了大型语言模型（LLMs）的推理能力，尤其是在与基于强化学习（RL）的后训练方法结合时。虽然较长的推理轨迹可以提升答案质量并解锁自我纠正等能力，但它们也会产生高推理成本，并且常常引入冗余步骤，称为过度思考。最新研究致力于开发高效的推理策略，以平衡推理长度和准确性，无论是通过长度感知的奖励设计还是基于提示的校准。然而，这些基于启发式的方法可能会出现严重的准确率下降，并且对超参数非常敏感。为解决这些问题，我们引入了CRT（约束纠正训练），这是一种基于引用保护约束优化的原则性训练后框架，提供了更稳定、更可解释的高效推理表述。CRT在性能低于参考值时交替减少推理长度和纠正准确度，实现了冗余推理的稳定有效修剪。我们进一步扩展了CRT，采用两阶段训练方案，先发现最短且可靠的推理模式，然后在学习长度预算下精炼准确性，防止冗长的CoT再次出现。我们的全面评估表明，该框架在保持回答质量稳健可靠水平的同时，持续减少令牌使用。进一步分析显示，CRT不仅通过缩短回答时间提升推理效率，还减少了内部语言冗余，从而产生了新的评估指标。此外，基于CRT的训练自然会产生一系列中间检查点，这些检查点跨越了解释长度的光谱，同时保持正确性，从而实现对推理冗长性的细致控制，无需重新训练。

Flow-Factory: A Unified Framework for Reinforcement Learning in Flow-Matching Models

流工厂：流匹配模型中强化学习的统一框架

Authors: Bowen Ping, Chengyou Jia, Minnan Luo, Hangwei Qian, Ivor Tsang
Subjects: Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.12529
Pdf link: https://arxiv.org/pdf/2602.12529
Abstract Reinforcement learning has emerged as a promising paradigm for aligning diffusion and flow-matching models with human preferences, yet practitioners face fragmented codebases, model-specific implementations, and engineering complexity. We introduce Flow-Factory, a unified framework that decouples algorithms, models, and rewards through through a modular, registry-based architecture. This design enables seamless integration of new algorithms and architectures, as demonstrated by our support for GRPO, DiffusionNFT, and AWM across Flux, Qwen-Image, and WAN video models. By minimizing implementation overhead, Flow-Factory empowers researchers to rapidly prototype and scale future innovations with ease. Flow-Factory provides production-ready memory optimization, flexible multi-reward training, and seamless distributed training support. The codebase is available at this https URL.
中文摘要 强化学习已成为一种有前景的范式，用于将扩散和流匹配模型与人类偏好对齐，但实践者仍面临分散的代码库、模型特定实现以及工程复杂性。我们介绍了Flow-Factory，一个统一框架，通过模块化、基于注册表的架构将算法、模型和奖励解耦。这种设计实现了新算法和架构的无缝集成，正如我们支持 GRPO、DiffusionNFT 和 AWM 在 Flux、Qwen-Image 和 WAN 视频模型中所示。通过最小化实施开销，Flow-Factory使研究人员能够轻松快速原型并扩展未来创新。Flow-Factory 提供就生产准备的内存优化、灵活的多奖励训练以及无缝的分布式训练支持。代码库可在该 https URL 访问。

Reasoning to Rank: An End-to-End Solution for Exploiting Large Language Models for Recommendation

排名推理：利用大型语言模型进行推荐的端到端解决方案

Authors: Kehan Zheng, Deyao Hong, Qian Li, Jun Zhang, Huan Yu, Jie Jiang, Hongning Wang
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2602.12530
Pdf link: https://arxiv.org/pdf/2602.12530
Abstract Recommender systems are tasked to infer users' evolving preferences and rank items aligned with their intents, which calls for in-depth reasoning beyond pattern-based scoring. Recent efforts start to leverage large language models (LLMs) for recommendation, but how to effectively optimize the model for improved recommendation utility is still under explored. In this work, we propose Reasoning to Rank, an end-to-end training framework that internalizes recommendation utility optimization into the learning of step-by-step reasoning in LLMs. To avoid position bias in LLM reasoning and enable direct optimization of the reasoning process, our framework performs reasoning at the user-item level and employs reinforcement learning for end-to-end training of the LLM. Experiments on three Amazon datasets and a large-scale industrial dataset showed consistent gains over strong conventional and LLM-based solutions. Extensive in-depth analyses validate the necessity of the key components in the proposed framework and shed lights on the future developments of this line of work.
中文摘要 推荐系统负责推断用户不断变化的偏好，并根据其意图对待项目进行排名，这需要超越基于模式的评分进行深入推理。近期开始利用大型语言模型（LLMs）进行推荐，但如何有效优化模型以提升推荐效用仍待探索。本研究提出“推理到排名”（Reasoning to Rank），一种端到端训练框架，将推荐效用优化内化到LLM中逐步推理的学习中。为了避免LLM推理中的位置偏见并实现推理过程的直接优化，我们的框架在用户-项目层面进行推理，并采用强化学习进行LLM的端到端训练。在三个亚马逊数据集和一个大型工业数据集上的实验显示，相较于强大的传统和基于大型语言模型的解决方案，取得了持续的提升。深入分析验证了框架中关键组成部分的必要性，并揭示了该工作方向的未来发展。

To Mix or To Merge: Toward Multi-Domain Reinforcement Learning for Large Language Models

混合还是合并：迈向大型语言模型的多领域强化学习

Authors: Haoqing Wang, Xiang Long, Ziheng Li, Yilong Xu, Tingguang Li, Yehui Tang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.12566
Pdf link: https://arxiv.org/pdf/2602.12566
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) plays a key role in stimulating the explicit reasoning capability of Large Language Models (LLMs). We can achieve expert-level performance in some specific domains via RLVR, such as coding or math. When a general multi-domain expert-level model is required, we need to carefully consider the collaboration of RLVR across different domains. The current state-of-the-art models mainly employ two different training paradigms for multi-domain RLVR: mixed multi-task RLVR and separate RLVR followed by model merging. However, most of the works did not provide a detailed comparison and analysis about these paradigms. To this end, we choose multiple commonly used high-level tasks (e.g., math, coding, science, and instruction following) as our target domains and design extensive qualitative and quantitative experiments using open-source datasets. We find the RLVR across domains exhibits few mutual interferences, and reasoning-intensive domains demonstrate mutually synergistic effects. Furthermore, we analyze the internal mechanisms of mutual gains from the perspectives of weight space geometry, model prediction behavior, and information constraints. This project is named as M2RL that means Mixed multi-task training or separate training followed by model Merging for Reinforcement Learning, and the homepage is at this https URL
中文摘要 带可验证奖励的强化学习（RLVR）在激发大型语言模型（LLMs）显式推理能力方面起着关键作用。通过RLVR技术，我们可以在某些特定领域实现专家级表现，比如编程或数学。当需要一个通用的多领域专家级模型时，我们需要认真考虑RLVR跨领域协作的情况。当前最先进的模型主要采用两种不同的多域RLVR训练范式：混合多任务RLVR和独立RLVR后进行模型合并。然而，大多数著作并未对这些范式进行详细的比较和分析。为此，我们选择多个常用的高阶任务（如数学、编码、科学和指令跟踪）作为目标领域，并利用开源数据集设计大量的定性和定量实验。我们发现跨域的RLVR几乎没有相互干扰，推理密集型域则表现出相互协同效应。此外，我们从权重空间几何、模型预测行为和信息约束等角度分析了互惠的内部机制。该项目名为M2RL，意为混合多任务训练或分离训练，随后是强化学习模型合并，主页地址为 https URL

VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction

VI-CuRL：通过置信导方差减少稳定验证器无关强化学习推理

Authors: Xin-Qiang Cai, Masashi Sugiyama
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.12579
Pdf link: https://arxiv.org/pdf/2602.12579
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a dominant paradigm for enhancing Large Language Models (LLMs) reasoning, yet its reliance on external verifiers limits its scalability. Recent findings suggest that RLVR primarily functions by eliciting latent capabilities, motivating the development of verifier-free algorithms. However, in such settings, standard methods like Group Relative Policy Optimization face a critical challenge: destructive gradient variance that often leads to training collapse. To address this issue, we introduceVerifier-Independent Curriculum Reinforcement Learning (VI-CuRL), a framework that leverages the model's intrinsic confidence to construct a curriculum independent from external verifiers. By prioritizing high-confidence samples, VI-CuRL effectively manages the bias-variance trade-off, specifically targeting the reduction of action and problem variance. We provide a rigorous theoretical analysis, proving that our estimator guarantees asymptotic unbiasedness. Empirically, VI-CuRL promotes stability and consistently outperforms verifier-independent baselines across six challenging benchmarks with/without verifiers.
中文摘要 带可验证奖励的强化学习（RLVR）已成为增强大型语言模型（LLMs）推理的主导范式，但其对外部验证器的依赖限制了其可扩展性。最新研究表明，RLVR主要通过诱导潜在能力来发挥作用，这推动了无验证器算法的发展。然而，在此类环境中，标准方法如群体相对策略优化面临一个关键挑战：破坏性的梯度方差，常导致训练崩溃。为解决这一问题，我们引入了验证者无关课程强化学习（VI-CuRL），该框架利用模型的固有信心构建独立于外部验证器的课程。通过优先考虑高置信样本，VI-CuRL有效管理偏差与方差权衡，特别针对动作和问题方差的减少。我们提供了严谨的理论分析，证明我们的估计量保证了渐近无偏性。从实证角度看，VI-CuRL在六个有无验证者或无验证者的挑战基准测试中，持续优于验证者无关基线。

RLinf-Co: Reinforcement Learning-Based Sim-Real Co-Training for VLA Models

RLinf-Co：基于强化学习的VLA模型模拟-实境共训练

Authors: Liangzhi Shi, Shuaihang Chen, Feng Gao, Yinuo Chen, Kang Chen, Tonghe Zhang, Hongzhi Zhang, Weinan Zhang, Chao Yu, Yu Wang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.12628
Pdf link: https://arxiv.org/pdf/2602.12628
Abstract Simulation offers a scalable and low-cost way to enrich vision-language-action (VLA) training, reducing reliance on expensive real-robot demonstrations. However, most sim-real co-training methods rely on supervised fine-tuning (SFT), which treats simulation as a static source of demonstrations and does not exploit large-scale closed-loop interaction. Consequently, real-world gains and generalization are often limited. In this paper, we propose an \underline{\textit{RL}}-based sim-real \underline{\textit{Co}}-training \modify{(RL-Co)} framework that leverages interactive simulation while preserving real-world capabilities. Our method follows a generic two-stage design: we first warm-start the policy with SFT on a mixture of real and simulated demonstrations, then fine-tune it with reinforcement learning in simulation while adding an auxiliary supervised loss on real-world data to anchor the policy and mitigate catastrophic forgetting. We evaluate our framework on four real-world tabletop manipulation tasks using two representative VLA architectures, OpenVLA and $\pi_{0.5}$, and observe consistent improvements over real-only fine-tuning and SFT-based co-training, including +24% real-world success on OpenVLA and +20% on $\pi_{0.5}$. Beyond higher success rates, RL co-training yields stronger generalization to unseen task variations and substantially improved real-world data efficiency, providing a practical and scalable pathway for leveraging simulation to enhance real-robot deployment.
中文摘要 仿真提供了一种可扩展且低成本的方式，丰富视觉-语言-动作（VLA）训练，减少对昂贵真机器人演示的依赖。然而，大多数模拟-实数协同训练方法依赖监督微调（SFT），该方法将仿真视为静态的演示来源，不利用大规模闭环交互。因此，现实世界的收益和泛化往往有限。本文提出一个基于 \underline{\textit{RL}} 的模拟现实\underline{\textit{Co}}-训练\modify{（RL-Co）}框架，利用交互式仿真同时保留现实世界能力。我们的方法遵循通用的两阶段设计：首先在真实和模拟演示的混合条件下用SFT热启动策略，然后在仿真中通过强化学习微调策略，同时对现实世界数据添加辅助监督损失，以锚定策略并减少灾难性遗忘。我们利用两种代表性的 VLA 架构 OpenVLA 和 $\pi_{0.5}$，在四个现实桌面作任务中评估了我们的框架，观察到相较于仅实数微调和基于 SFT 的共训练有持续改进，包括 OpenVLA 的 +24% 和 $\pi_{0.5}$ 的 +20%。除了更高的成功率外，强化学习协同训练还能更强地泛化对未见任务的变化，并显著提升现实世界的数据效率，为利用仿真提升真实机器人部署提供了切实可行且可扩展的路径。

Dual-Granularity Contrastive Reward via Generated Episodic Guidance for Efficient Embodied RL

通过生成的情节指导实现双粒度对比奖励，实现高效的具身强化学习

Authors: Xin Liu, Yixuan Li, Yuhui Chen, Yuxing Qin, Haoran Li, Dongbin Zhao
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.12636
Pdf link: https://arxiv.org/pdf/2602.12636
Abstract Designing suitable rewards poses a significant challenge in reinforcement learning (RL), especially for embodied manipulation. Trajectory success rewards are suitable for human judges or model fitting, but the sparsity severely limits RL sample efficiency. While recent methods have effectively improved RL via dense rewards, they rely heavily on high-quality human-annotated data or abundant expert supervision. To tackle these issues, this paper proposes Dual-granularity contrastive reward via generated Episodic Guidance (DEG), a novel framework to seek sample-efficient dense rewards without requiring human annotations or extensive supervision. Leveraging the prior knowledge of large video generation models, DEG only needs a small number of expert videos for domain adaptation to generate dedicated task guidance for each RL episode. Then, the proposed dual-granularity reward that balances coarse-grained exploration and fine-grained matching, will guide the agent to efficiently approximate the generated guidance video sequentially in the contrastive self-supervised latent space, and finally complete the target task. Extensive experiments on 18 diverse tasks across both simulation and real-world settings show that DEG can not only serve as an efficient exploration stimulus to help the agent quickly discover sparse success rewards, but also guide effective RL and stable policy convergence independently.
中文摘要 设计合适的奖励在强化学习（RL）中尤其具有重大挑战，尤其是在具身作方面。轨迹成功奖励适合人类评判或模型拟合，但稀疏性严重限制了强化学习样本的效率。虽然近期方法通过密集奖励有效改善了强化学习，但它们高度依赖高质量的人工注释数据或丰富的专家监督。为解决这些问题，本文提出了通过生成情节指导（DEG）实现的双粒度对比奖励，这是一种新颖框架，旨在无需人工注释或大量监督即可获得样本高效的密集奖励。利用对大型视频生成模型的先验知识，DEG只需少量专家视频进行领域适应，即可为每期强化学习生成专属任务指导。随后，所提出的双粒度奖励平衡了粗粒度探索与细粒度匹配，将引导智能体在对比自监督潜空间中高效地顺序近似生成的引导视频，最终完成目标任务。在模拟和现实环境中对18个多样化任务的广泛实验表明，DEG不仅可以作为高效的探索刺激，帮助智能体快速发现稀疏的成功奖励，还能独立引导有效的强化学习和稳定策略收敛。

Unifying Model-Free Efficiency and Model-Based Representations via Latent Dynamics

通过潜在动力学统一无模型效率和基于模型的表示

Authors: Jashaswimalya Acharjee, Balaraman Ravindran
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2602.12643
Pdf link: https://arxiv.org/pdf/2602.12643
Abstract We present Unified Latent Dynamics (ULD), a novel reinforcement learning algorithm that unifies the efficiency of model-free methods with the representational strengths of model-based approaches, without incurring planning overhead. By embedding state-action pairs into a latent space in which the true value function is approximately linear, our method supports a single set of hyperparameters across diverse domains -- from continuous control with low-dimensional and pixel inputs to high-dimensional Atari games. We prove that, under mild conditions, the fixed point of our embedding-based temporal-difference updates coincides with that of a corresponding linear model-based value expansion, and we derive explicit error bounds relating embedding fidelity to value approximation quality. In practice, ULD employs synchronized updates of encoder, value, and policy networks, auxiliary losses for short-horizon predictive dynamics, and reward-scale normalization to ensure stable learning under sparse rewards. Evaluated on 80 environments spanning Gym locomotion, DeepMind Control (proprioceptive and visual), and Atari, our approach matches or exceeds the performance of specialized model-free and general model-based baselines -- achieving cross-domain competence with minimal tuning and a fraction of the parameter footprint. These results indicate that value-aligned latent representations alone can deliver the adaptability and sample efficiency traditionally attributed to full model-based planning.
中文摘要 我们介绍统一潜在动力学（ULD），一种新型强化学习算法，将无模型方法的效率与基于模型方法的表征优势结合起来，同时避免了规划开销。通过将状态-动作对嵌入一个真实值函数近似线性的潜在空间，我们的方法支持跨多个域的单一超参数——从低维和像素输入的连续控制到高维Atari游戏。我们证明，在温和条件下，基于嵌入的时间差分更新的不动点与相应的线性模型值展开点一致，并推导出嵌入忠实度与值近似质量之间的显式误差界限。实际上，ULD采用编码器、值和策略网络的同步更新，辅助损耗用于短期预测动态，以及奖励尺度归一化，以确保在稀疏奖励下学习稳定。在涵盖健身房移动、DeepMind Control（本体和视觉）和雅达利的80个环境中进行评估，我们的方法能够匹敌甚至超越专业的无模型和通用基于模型的基线——以最小的调优和极小的参数占用实现跨领域能力。这些结果表明，仅靠价值对齐的潜在表示，就能实现传统上归因于全模型规划的适应性和样本效率。

PMG: Parameterized Motion Generator for Human-like Locomotion Control

PMG：参数化运动发生器，用于类人运动控制

Authors: Chenxi Han, Yuheng Min, Zihao Huang, Ao Hong, Hang Liu, Yi Cheng, Houde Liu
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.12656
Pdf link: https://arxiv.org/pdf/2602.12656
Abstract Recent advances in data-driven reinforcement learning and motion tracking have substantially improved humanoid locomotion, yet critical practical challenges remain. In particular, while low-level motion tracking and trajectory-following controllers are mature, whole-body reference-guided methods are difficult to adapt to higher-level command interfaces and diverse task contexts: they require large, high-quality datasets, are brittle across speed and pose regimes, and are sensitive to robot-specific calibration. To address these limitations, we propose the Parameterized Motion Generator (PMG), a real-time motion generator grounded in an analysis of human motion structure that synthesizes reference trajectories using only a compact set of parameterized motion data together with High-dimensional control commands. Combined with an imitation-learning pipeline and an optimization-based sim-to-real motor parameter identification module, we validate the complete approach on our humanoid prototype ZERITH Z1 and show that, within a single integrated system, PMG produces natural, human-like locomotion, responds precisely to high-dimensional control inputs-including VR-based teleoperation-and enables efficient, verifiable sim-to-real transfer. Together, these results establish a practical, experimentally validated pathway toward natural and deployable humanoid control.
中文摘要 数据驱动强化学习和运动追踪的最新进展显著提升了类人生物的运动能力，但现实中仍面临关键挑战。特别是，虽然低级别运动跟踪和轨迹跟踪控制器已成熟，但全身参考引导方法难以适应更高级的指令界面和多样化任务环境：它们需要大量高质量数据集，速度和姿态区间脆弱，且对机器人特定校准敏感。为解决这些限制，我们提出了参数化运动发生器（PMG），这是一种基于人体运动结构分析的实时运动发生器，仅使用一组紧凑的参数化运动数据和高维控制命令合成参考轨迹。结合模仿学习流水线和基于优化的模拟到现实的运动参数识别模块，我们在类人原型ZERITH Z1上验证了完整方法，并证明在单一集成系统内，PMG能够产生自然的类人运动，精确响应高维控制输入——包括基于VR的远程作——并实现高效且可验证的模拟到现实传输。这些结果共同建立了一条实用且经过实验验证的自然且可部署的人形控制路径。

$\mathcal{X}$-KD: General Experiential Knowledge Distillation for Large Language Models

$\mathcal{X}$-KD：大型语言模型的通用体验式知识蒸馏

Authors: Yuang Cai, Yuyu Yuan
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.12674
Pdf link: https://arxiv.org/pdf/2602.12674
Abstract Knowledge Distillation (KD) for Large Language Models (LLMs) has become increasingly important as models grow in size and complexity. While existing distillation approaches focus on imitating teacher behavior, they often overlook the original learning environment that shaped the teacher's knowledge. Inspired by the experiential learning theory and inverse reinforcement learning, we propose Experiential Knowledge Distillation ($\mathcal{X}$-KD), a novel and general framework that enables student models to learn in the teacher's original learning environment. $\mathcal{X}$-KD adopts the Approximated Variational Reward Imitation Learning (AVRIL) framework to jointly model the teacher's original reward function and perform policy distillation, encouraging consistency between the student policy and the original reward function. Our derivation demonstrates that $\mathcal{X}$-KD follows the supervised learning framework and applies to both sequence-level and divergence-based distillation methods, underlining the simplicity and flexibility of our approach. Empirical results show that $\mathcal{X}$-KD outperforms the generalized KD and MiniLLM baselines on abstractive summarization, machine translation, and arithmetic reasoning tasks. Additionally, $\mathcal{X}$-KD achieves better performance-diversity trade-off and data efficiency than baseline KD approaches.
中文摘要 随着模型规模和复杂度的增加，大型语言模型（LLM）的知识蒸馏（KD）变得越来越重要。现有的提炼方法专注于模仿教师行为，但它们常常忽视塑造教师知识的原始学习环境。受体验式学习理论和逆向强化学习的启发，我们提出了体验式知识蒸馏（$\mathcal{X}$-KD）这一新颖且通用的框架，使学生模型能够在教师的原始学习环境中学习。$\mathcal{X}$-KD 采用了近似变分奖励模仿学习（AVRIL）框架，共同建模教师的原始奖励函数并进行策略提炼，促进学生策略与原始奖励函数之间的一致性。我们的推导表明，$\mathcal{X}$-KD遵循监督式学习框架，适用于序列级和基于散度的蒸馏方法，强调了我们方法的简洁性和灵活性。实证结果显示，$\mathcal{X}$-KD 在抽象总结、机器翻译和算术推理任务中优于广义 KD 和 MiniLLM 基线。此外，$\mathcal{X}$-KD在性能多样性权衡和数据效率方面优于基线KD方法。

ALOE: Action-Level Off-Policy Evaluation for Vision-Language-Action Model Post-Training

ALOE：视觉-语言-行动模型培训后行动层级非策略评估

Authors: Rushuai Yang, Hecheng Wang, Chiming Liu, Xiaohan Yan, Yunlong Wang, Xuan Du, Shuoyu Yue, Yongcheng Liu, Chuheng Zhang, Lizhe Qi, Yi Chen, Wei Shan, Maoqing Yao
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.12691
Pdf link: https://arxiv.org/pdf/2602.12691
Abstract We study how to improve large foundation vision-language-action (VLA) systems through online reinforcement learning (RL) in real-world settings. Central to this process is the value function, which provides learning signals to guide VLA learning from experience. In practice, the value function is estimated from trajectory fragments collected from different data sources, including historical policies and intermittent human interventions. Estimating the value function of current behavior quality from the mixture data is inherently an off-policy evaluation problem. However, prior work often adopts conservative on-policy estimation for stability, which avoids direct evaluation of the current high-capacity policy and limits learning effectiveness. In this paper, we propose ALOE, an action-level off-policy evaluation framework for VLA post-training. ALOE applies chunking-based temporal-difference bootstrapping to evaluate individual action sequences instead of predicting final task outcomes. This design improves effective credit assignment to critical action chunks under sparse rewards and supports stable policy improvement. We evaluate our method on three real-world manipulation tasks, including smartphone packing as a high-precision task, laundry folding as a long-horizon deformable-object task, and bimanual pick-and-place involving multi-object perception. Across all tasks, ALOE improves learning efficiency without compromising execution speed, showing that off-policy RL can be reintroduced in a reliable manner for real-world VLA post-training. Videos and additional materials are available at our project website.
中文摘要 我们研究如何在现实环境中通过在线强化学习（RL）改进大型基础视觉-语言-行动（VLA）系统。该过程的核心是价值函数，它提供学习信号以引导VLA从经验中学习。实际上，价值函数是通过从不同数据源收集的轨迹片段估算的，包括历史政策和间歇性的人为干预。从混合数据估算当前行为质量的价值函数本质上是一个非策略评估问题。然而，以往的工作常采用保守的政策估计以求稳定性，这避免了对当前高容量政策的直接评估，并限制了学习效果。本文提出ALOE，一种针对VLA培训后行动级非政策评估框架。ALOE采用基于分块的时间差分引导法来评估单个动作序列，而非预测最终任务结果。该设计提升了在奖励稀疏情况下对关键行动块的有效信用分配，并支持稳定的政策改进。我们评估了三项真实作任务，包括智能手机打包作为高精度任务、叠衣服作为长视野可变形物体任务，以及涉及多对象感知的双手拾放。在所有任务中，ALOE在不牺牲执行速度的前提下提升了学习效率，表明非策略强化学习可以可靠地在现实VLA训练后重新引入。视频和更多资料可在我们的项目网站上获取。

MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs

MedXIAOHE：打造医疗多层次营销的全面配方

Authors: Baorong Shi, Bo Cui, Boyuan Jiang, Deli Yu, Fang Qian, Haihua Yang, Huichao Wang, Jiale Chen, Jianfei Pan, Jieqiong Cao, Jinghao Lin, Kai Wu, Lin Yang, Shengsheng Yao, Tao Chen, Xiaojun Xiao, Xiaozhong Ji, Xu Wang, Yijun He, Zhixiong Yang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2602.12705
Pdf link: https://arxiv.org/pdf/2602.12705
Abstract We present MedXIAOHE, a medical vision-language foundation model designed to advance general-purpose medical understanding and reasoning in real-world clinical applications. MedXIAOHE achieves state-of-the-art performance across diverse medical benchmarks and surpasses leading closed-source multimodal systems on multiple capabilities. To achieve this, we propose an entity-aware continual pretraining framework that organizes heterogeneous medical corpora to broaden knowledge coverage and reduce long-tail gaps (e.g., rare diseases). For medical expert-level reasoning and interaction, MedXIAOHE incorporates diverse medical reasoning patterns via reinforcement learning and tool-augmented agentic training, enabling multi-step diagnostic reasoning with verifiable decision traces. To improve reliability in real-world use, MedXIAOHE integrates user-preference rubrics, evidence-grounded reasoning, and low-hallucination long-form report generation, with improved adherence to medical instructions. We release this report to document our practical design choices, scaling insights, and evaluation framework, hoping to inspire further research.
中文摘要 我们介绍MedXIAOHE，一种医学视觉语言基础模型，旨在推动通用医学理解与推理在现实临床应用中的应用。MedXIAOHE 在多项医疗基准中实现了最先进的性能，并在多项能力上超越了领先的闭源多模态系统。为此，我们提出了一个实体感知的持续预训练框架，组织异构医学语料库，以拓宽知识覆盖范围并减少长尾缺口（如罕见病）。在医学专家级推理与互动方面，MedXIAOHE 通过强化学习和工具增强的代理训练，整合了多步诊断推理和可验证的决策痕迹。为提升实际应用中的可靠性，Medxiaohe 集成了用户偏好评分标准、循证推理和低幻觉的长篇报告生成，并提高对医疗指示的遵守。我们发布本报告，旨在记录我们的实际设计选择、扩展洞见和评估框架，希望激发更多研究。

TRANS: Terrain-aware Reinforcement Learning for Agile Navigation of Quadruped Robots under Social Interactions

译者：基于地形感知的强化学习，用于四足机器人在社交互动下的敏捷导航

Authors: Wei Zhu, Irfan Tito Kurniawan, Ye Zhao, Mistuhiro Hayashibe
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.12724
Pdf link: https://arxiv.org/pdf/2602.12724
Abstract This study introduces TRANS: Terrain-aware Reinforcement learning for Agile Navigation under Social interactions, a deep reinforcement learning (DRL) framework for quadrupedal social navigation over unstructured terrains. Conventional quadrupedal navigation typically separates motion planning from locomotion control, neglecting whole-body constraints and terrain awareness. On the other hand, end-to-end methods are more integrated but require high-frequency sensing, which is often noisy and computationally costly. In addition, most existing approaches assume static environments, limiting their use in human-populated settings. To address these limitations, we propose a two-stage training framework with three DRL pipelines. (1) TRANS-Loco employs an asymmetric actor-critic (AC) model for quadrupedal locomotion, enabling traversal of uneven terrains without explicit terrain or contact observations. (2) TRANS-Nav applies a symmetric AC framework for social navigation, directly mapping transformed LiDAR data to ego-agent actions under differential-drive kinematics. (3) A unified pipeline, TRANS, integrates TRANS-Loco and TRANS-Nav, supporting terrain-aware quadrupedal navigation in uneven and socially interactive environments. Comprehensive benchmarks against locomotion and social navigation baselines demonstrate the effectiveness of TRANS. Hardware experiments further confirm its potential for sim-to-real transfer.
中文摘要 本研究介绍了TRANS：基于社交互动的地形感知强化学习，用于敏捷导航，这是一种用于非结构化地形四足社会导航的深度强化学习（DRL）框架。传统的四足导航通常将运动规划与运动控制分离，忽视了全身约束和地形感知。另一方面，端到端方法集成性更高，但需要高频传感，这通常噪声较大且计算成本高。此外，大多数现有方法假设静态环境，限制了其在人类环境中的应用。为解决这些限制，我们提出了一个包含三条DRL流程的两阶段培训框架。（1） TRANS-Loco 采用非对称演员-批评者（AC）模型进行四足行走，能够在不显式地形或接触观察的情况下穿越不平整地形。（2） TRANS-Nav 采用对称交流框架进行社交导航，直接将转换后的 LiDAR 数据映射到差分驱动运动学下的自我代理行为。（3）一条统一的管道TRANS集成了TRANS-Loco和TRANS-Nav，支持在不平整且社交互动环境中的地形感知四足导航。针对移动和社交导航基准的综合基准测试展示了TRANS的有效性。硬件实验进一步证实了其模拟到现实传输的潜力。

FLAC: Maximum Entropy RL via Kinetic Energy Regularized Bridge Matching

FLAC：通过动能正则化桥匹配实现的最大熵强化

Authors: Lei Lv, Yunfei Li, Yu Luo, Fuchun Sun, Xiao Ma
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.12829
Pdf link: https://arxiv.org/pdf/2602.12829
Abstract Iterative generative policies, such as diffusion models and flow matching, offer superior expressivity for continuous control but complicate Maximum Entropy Reinforcement Learning because their action log-densities are not directly accessible. To address this, we propose Field Least-Energy Actor-Critic (FLAC), a likelihood-free framework that regulates policy stochasticity by penalizing the kinetic energy of the velocity field. Our key insight is to formulate policy optimization as a Generalized Schrödinger Bridge (GSB) problem relative to a high-entropy reference process (e.g., uniform). Under this view, the maximum-entropy principle emerges naturally as staying close to a high-entropy reference while optimizing return, without requiring explicit action densities. In this framework, kinetic energy serves as a physically grounded proxy for divergence from the reference: minimizing path-space energy bounds the deviation of the induced terminal action distribution. Building on this view, we derive an energy-regularized policy iteration scheme and a practical off-policy algorithm that automatically tunes the kinetic energy via a Lagrangian dual mechanism. Empirically, FLAC achieves superior or comparable performance on high-dimensional benchmarks relative to strong baselines, while avoiding explicit density estimation.
中文摘要 迭代生成策略，如扩散模型和流匹配，为连续控制提供了更优越的表达力，但由于其动作对数密度无法直接访问，使最大熵强化学习变得复杂。为此，我们提出了场最小能量行为者-批判者（FLAC）框架，这是一种无似然框架，通过惩罚速度场的动能来调节政策随机性。我们的核心见解是将策略优化表述为相对于高熵参考过程（如均匀过程）的广义薛定谔桥（GSB）问题。在这种观点下，最大熵原理自然地表现为在优化回报的同时保持在高熵参考附近，而无需明确的行动密度。在此框架下，动能作为物理基准的指标，表示与参考的发散：最小化路径空间能量限制了诱导末端作用分布的偏差。基于这一观点，我们推导出一种能量正则化的策略迭代方案和一种实用的非策略算法，通过拉格朗日对偶机制自动调节动能。从经验角度看，FLAC在高维基准测试中相较于强基准线表现优异或相当，同时避免了显式密度估计。

EARL: Energy-Aware Adaptive Antenna Control with Reinforcement Learning in O-RAN Cell-Free Massive MIMO Networks

EARL：在O-RAN无单元大规模MIMO网络中的增强学习中实现能量感知自适应天线控制

Authors: Zilin Ge, Ozan Alp Topal, Irshad Ahmad Meer, Pei Xiao, Cicek Cavdar
Subjects: Subjects: Information Theory (cs.IT); Signal Processing (eess.SP); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2602.12841
Pdf link: https://arxiv.org/pdf/2602.12841
Abstract Cell-free massive multi-input multi-output (MIMO) promises uniform high performance across the network, but also brings a high energy cost due to joint transmission from distributed radio units (RUs) and centralized processing in the cloud. Leveraging the resource-sharing capabilities of Open Radio Access Network (O-RAN), we propose EARL, an energy-aware adaptive antenna control framework based on reinforcement learning. EARL dynamically configures antenna elements in RUs to minimize radio, optical fronthaul, and cloud processing power consumption while meeting user spectral efficiency demands. Numerical results show power savings of up to 81% and 50% over full-on and heuristic baselines, respectively. The RL-based approach operates within 220 ms, satisfying O-RAN's near-real-time limit, and a greedy refinement further halves power consumption at a 2 s runtime.
中文摘要 无小区大规模多输入多输出（MIMO）承诺在网络中实现统一的高性能，但由于分布式无线电单元（RU）的联合传输和云集中处理，导致高昂的能源成本。利用开放无线接入网（O-RAN）的资源共享能力，我们提出了基于强化学习的能量感知型自适应天线控制框架EARL。EARL动态配置RU中的天线元件，以最小化无线电、光前传和云处理功耗，同时满足用户频谱效率需求。数值结果显示，较全开和启发式基线分别节省了高达 81% 和 50% 的功耗。基于强化学习的方法运行在220毫秒以内，满足了O-RAN近乎实时的限制，而贪婪的优化则在2秒运行时进一步减半功耗。

Amortized Reasoning Tree Search: Decoupling Proposal and Decision in Large Language Models

摊销推理树搜索：大型语言模型中的解耦提案与判定

Authors: Zesheng Hong, Jiadong Yu, Hui Pan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.12846
Pdf link: https://arxiv.org/pdf/2602.12846
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has established itself as the dominant paradigm for instilling rigorous reasoning capabilities in Large Language Models. While effective at amplifying dominant behaviors, we identify a critical pathology in this alignment process: the systematic suppression of valid but rare (low-likelihood under the base model distribution) reasoning paths. We theoretically characterize this phenomenon as a "Normalization Squeeze," where the interplay between mode-seeking policy gradients and finite sampling acts as a high-pass likelihood filter, driving the probability of rare correct traces to statistical extinction. To counteract this collapse without discarding the base model's latent diversity, we propose Amortized Reasoning Tree Search (ARTS). Unlike standard approaches that force internalization via parameter updates, ARTS prioritizes deliberation by decoupling generation from verification. We introduce a Flow Matching objective that repurposes the verifier to estimate the conservation of probability flow, enabling robust navigation through sparse, high-entropy search spaces where traditional discriminative objectives fail. Extensive experiments on the MATH-500 benchmark demonstrate that ARTS achieves a performance of 74.6% (BoN@16), effectively matching fully fine-tuned policies (74.7%) without modifying the generative backbone. Crucially, on the long-tail subset where coupled RL optimization collapses to 0% pass@k, ARTS uniquely recovers significant performance, suggesting that disentangling verification from generation offers a more robust pathway for solving complex reasoning tasks.
中文摘要 带可验证奖励的强化学习（RLVR）已成为大型语言模型中强化严谨推理能力的主导范式。虽然在放大支配行为方面有效，我们发现了这一对齐过程中的关键病理：系统性地抑制了有效但罕见（在基础模型分布下极低概然）的推理路径。我们理论上将这一现象描述为“归一化挤压”，其中寻模策略梯度与有限采样之间的相互作用充当高通似然滤波器，推动稀有正确迹走向统计消光的概率。为了抵消这种崩溃而不丢弃基础模型的潜在多样性，我们提出了摊销推理树搜索（ARTS）。与通过参数更新强制内部化的标准方法不同，ARTS 通过将生成与验证分离来优先考虑审议。我们引入了流匹配目标，将验证器重新利用为估计概率流守恒，从而在传统判别目标失效的稀疏高搜索空间中实现稳健导航。MATH-500基准测试的大量实验表明，ARTS实现了74.6%（BoN@16）的性能，有效匹配了完全微调策略（74.7%），而无需修改生成骨干。关键是，在耦合强化学习优化崩溃至0%pass@k的长尾子集上，ARTS独一无二地恢复了显著性能，表明验证与生成的分离为解决复杂推理任务提供了更稳健的路径。

DPUConfig: Optimizing ML Inference in FPGAs Using Reinforcement Learning

DPUConfig：利用强化学习优化FPGA中的机器学习推断

Authors: Alexandros Patras, Spyros Lalis, Christos D. Antonopoulos, Nikolaos Bellas
Subjects: Subjects: Hardware Architecture (cs.AR)
Arxiv link: https://arxiv.org/abs/2602.12847
Pdf link: https://arxiv.org/pdf/2602.12847
Abstract Heterogeneous embedded systems, with diverse computing elements and accelerators such as FPGAs, offer a promising platform for fast and flexible ML inference, which is crucial for services such as autonomous driving and augmented reality, where delays can be costly. However, efficiently allocating computational resources for deep learning applications in FPGA-based systems is a challenging task. A Deep Learning Processor Unit (DPU) is a parameterizable FPGA-based accelerator module optimized for ML inference. It supports a wide range of ML models and can be instantiated multiple times within a single FPGA to enable concurrent execution. This paper introduces DPUConfig, a novel runtime management framework, based on a custom Reinforcement Learning (RL) agent, that dynamically selects optimal DPU configurations by leveraging real-time telemetry data monitoring, system utilization, power consumption, and application performance to inform its configuration selection decisions. The experimental evaluation demonstrates that the RL agent achieves energy efficiency 95% (on average) of the optimal attainable energy efficiency for several CNN models on the Xilinx Zynq UltraScale+ MPSoC ZCU102.
中文摘要 拥有多样化计算元素和加速器（如FPGA）的异构嵌入式系统，为快速灵活的机器学习推断提供了有前景的平台，这对于自动驾驶和增强现实等延迟可能带来高昂代价的服务至关重要。然而，在基于FPGA的系统中高效分配深度学习应用的计算资源是一项具有挑战性的任务。深度学习处理器单元（DPU）是一种基于 FPGA 的可参数化加速器模块，专为机器学习推理优化。它支持广泛的机器学习模型，并且可以在单个FPGA内多次实例化，从而实现并发执行。本文介绍了DPUConfig，一种基于自定义强化学习（RL）代理的新型运行时管理框架，通过实时遥测数据监控、系统利用率、功耗和应用性能动态选择最佳DPU配置，指导配置选择决策。实验评估表明，在Xilinx Zynq UltraScale+ MPSoC ZCU102上，RL代理在多个CNN模型中平均达到了95%的最佳可达能效。

Hierarchical Reinforcement Learning for Cooperative Air-Ground Delivery in Urban System

城市系统中合作空地投递的层级强化学习

Authors: Songxin Lei, Chunming Ma, Haomin Wen, Yexin Li, Lizhenghe Chen, Qianyu Yang, Fugee Tsung, Lei Chen, Sijie Ruan, Yuxuan Liang
Subjects: Subjects: Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2602.12913
Pdf link: https://arxiv.org/pdf/2602.12913
Abstract Cooperative air-ground delivery has emerged as a promising logistics paradigm by leveraging the complementary strengths of UAVs and ground carriers. However, effective dispatching in such heterogeneous systems faces two critical challenges: i) the heterogeneity between flight and road dynamics, ii) the scalability bottleneck raised by the exponential decision variables in large-scale fleets. To address these challenges, we propose HRL4AG, a Hierarchical Reinforcement Learning framework for cooperative Air-Ground delivery. Specifically, HRL4AG employs a high-level manager to tackle the scalability bottleneck by decomposing the joint action space, and mode-specific workers that encode distinct flight and road dynamics to address the heterogeneity. Furthermore, a novel internal reward mechanism is designed to guide the hierarchical policy learning, addressing the credit assignment problem in sparse-reward settings. Extensive experiments on two real-world datasets and an evaluation platform demonstrate that HRL4AG significantly outperforms state-of-the-art baselines, improving the delivery success rate by up to 26% while achieving an 80-fold increase in computational efficiency.
中文摘要 通过利用无人机与地面航母的互补优势，协同空地投递已成为一种有前景的物流范式。然而，在此类异构系统中有效调度面临两个关键挑战：i）飞行与道路动力学之间的异质性，ii）大规模车队中指数级决策变量带来的可扩展性瓶颈。为应对这些挑战，我们提出了HRL4AG，一种用于合作空地投递的分层强化学习框架。具体来说，HRL4AG聘请高级管理者通过分解联合行动空间来解决可扩展性瓶颈，并配备特定模式的工作人员编码不同的飞行和道路动力学，以解决异质性。此外，设计了一种新型内部奖励机制，用于指导层级政策学习，解决稀疏奖励环境中的学分分配问题。在两个真实世界数据集和一个评估平台上的广泛实验表明，HRL4AG的表现远超最先进的基线，将交付成功率提升了多达26%，同时计算效率提升了80倍。

Look Inward to Explore Outward: Learning Temperature Policy from LLM Internal States via Hierarchical RL

向内探索：通过层级强化学习从LLM内部状态学习温度政策

Authors: Yixiao Zhou, Yang Li, Dongzhou Cheng, Hehe Fan, Yu Cheng
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.13035
Pdf link: https://arxiv.org/pdf/2602.13035
Abstract Reinforcement Learning from Verifiable Rewards (RLVR) trains large language models (LLMs) from sampled trajectories, making decoding strategy a core component of learning rather than a purely inference-time choice. Sampling temperature directly controls the exploration--exploitation trade-off by modulating policy entropy, yet existing methods rely on static values or heuristic adaptations that are decoupled from task-level rewards. We propose Introspective LLM, a hierarchical reinforcement learning framework that learns to control sampling temperature during generation. At each decoding step, the model selects a temperature based on its hidden state and samples the next token from the resulting distribution. Temperature and token policies are jointly optimized from downstream rewards using a coordinate ascent scheme. Experiments on mathematical reasoning benchmarks show that learned temperature policies outperform fixed and heuristic baselines, while exhibiting interpretable exploration behaviors aligned with reasoning uncertainty.
中文摘要 可验证奖励强化学习（RLVR）通过抽样轨迹训练大型语言模型（LLM），使解码策略成为学习的核心组成部分，而不仅仅是单纯的推理时间选择。采样温度直接控制探索与利用权衡，通过调节策略熵，但现有方法依赖静态值或启发式适应，这些与任务级奖励解耦。我们提出了内省式大型语言模型（Introspective LLM），这是一种分层强化学习框架，学习在生成过程中控制采样温度。在每个解码步骤，模型根据隐藏状态选择温度，并从结果分布中抽样下一个符号。温度和代币策略通过下游奖励通过坐标上升方案共同优化。数学推理基准测试显示，学习到的温度策略优于固定和启发式基线，同时表现出与推理不确定性相符的可解释探索行为。

TCRL: Temporal-Coupled Adversarial Training for Robust Constrained Reinforcement Learning in Worst-Case Scenarios

TCRL：在最坏情况下实现强健受限强化学习的时序耦合对抗训练

Authors: Wentao Xu, Zhongming Yao, Weihao Li, Zhenghang Song, Yumeng Song, Tianyi Li, Yushuai Li
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.13040
Pdf link: https://arxiv.org/pdf/2602.13040
Abstract Constrained Reinforcement Learning (CRL) aims to optimize decision-making policies under constraint conditions, making it highly applicable to safety-critical domains such as autonomous driving, robotics, and power grid management. However, existing robust CRL approaches predominantly focus on single-step perturbations and temporally independent adversarial models, lacking explicit modeling of robustness against temporally coupled perturbations. To tackle these challenges, we propose TCRL, a novel temporal-coupled adversarial training framework for robust constrained reinforcement learning (TCRL) in worst-case scenarios. First, TCRL introduces a worst-case-perceived cost constraint function that estimates safety costs under temporally coupled perturbations without the need to explicitly model adversarial attackers. Second, TCRL establishes a dual-constraint defense mechanism on the reward to counter temporally coupled adversaries while maintaining reward unpredictability. Experimental results demonstrate that TCRL consistently outperforms existing methods in terms of robustness against temporally coupled perturbation attacks across a variety of CRL tasks.
中文摘要 受限强化学习（CRL）旨在优化约束条件下的决策策略，使其高度适用于自动驾驶、机器人技术和电网管理等安全关键领域。然而，现有的稳健CRL方法主要关注单步扰动和时间独立的对抗模型，缺乏对时间耦合扰动的鲁棒性显式建模。为应对这些挑战，我们提出了TCRL，一种新型时间耦合对抗性训练框架，用于在最坏情况下实现强健的约束强化学习（TCRL）。首先，TCRL引入了一种最坏情况感知成本约束函数，该函数在时间耦合扰动下估算安全成本，无需显式建模对抗攻击者。其次，TCRL在奖励上建立了双约束防御机制，以对抗时间耦合的对手，同时保持奖励的不可预测性。实验结果表明，TCRL在针对时间耦合扰动攻击的鲁棒性方面，在多种CRL任务中始终优于现有方法。

Curriculum-DPO++: Direct Preference Optimization via Data and Model Curricula for Text-to-Image Generation

课程-DPO++：通过数据和模型课程进行文本到图像生成的直接偏好优化

Authors: Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, Nicu Sebe, Mubarak Shah
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.13055
Pdf link: https://arxiv.org/pdf/2602.13055
Abstract Direct Preference Optimization (DPO) has been proposed as an effective and efficient alternative to reinforcement learning from human feedback (RLHF). However, neither RLHF nor DPO take into account the fact that learning certain preferences is more difficult than learning other preferences, rendering the optimization process suboptimal. To address this gap in text-to-image generation, we recently proposed Curriculum-DPO, a method that organizes image pairs by difficulty. In this paper, we introduce Curriculum-DPO++, an enhanced method that combines the original data-level curriculum with a novel model-level curriculum. More precisely, we propose to dynamically increase the learning capacity of the denoising network as training advances. We implement this capacity increase via two mechanisms. First, we initialize the model with only a subset of the trainable layers used in the original Curriculum-DPO. As training progresses, we sequentially unfreeze layers until the configuration matches the full baseline architecture. Second, as the fine-tuning is based on Low-Rank Adaptation (LoRA), we implement a progressive schedule for the dimension of the low-rank matrices. Instead of maintaining a fixed capacity, we initialize the low-rank matrices with a dimension significantly smaller than that of the baseline. As training proceeds, we incrementally increase their rank, allowing the capacity to grow until it converges to the same rank value as in Curriculum-DPO. Furthermore, we propose an alternative ranking strategy to the one employed by Curriculum-DPO. Finally, we compare Curriculum-DPO++ against Curriculum-DPO and other state-of-the-art preference optimization approaches on nine benchmarks, outperforming the competing methods in terms of text alignment, aesthetics and human preference. Our code is available at this https URL.
中文摘要 直接偏好优化（DPO）被提出作为人类反馈强化学习（RLHF）的一种有效且高效的替代方案。然而，RLHF和DPO都未考虑到学习某些偏好比学习其他偏好更困难，导致优化过程不够优化。为了弥补文本生成到图像生成的这一空白，我们最近提出了Curriculum-DPO方法，这是一种按难度组织图像对的方法。本文介绍了Curriculum-DPO++，这是一种结合原始数据级课程与新型模型级课程的增强方法。更具体地说，我们计划随着训练的推进动态提升去噪网络的学习能力。我们通过两种机制实施此次产能提升。首先，我们仅用原始Curriculum-DPO中使用的可训练层的子集初始化模型。随着训练的推进，我们会依次解冻层，直到配置与完整的基线架构匹配。其次，由于微调基于低秩适应（LoRA），我们为低秩矩阵的维度实现了渐进式调度。我们不保持固定容量，而是初始化低秩矩阵的维度明显小于基线。随着培训的进行，我们逐步提升他们的排名，允许能力增长，直到达到与课程-DPO相同的排名值。此外，我们提出了一种与Curriculum-DPO不同于排序策略的替代方案。最后，我们将Curriculum-DPO++与Curriculum-DPO及其他最先进的偏好优化方法在九个基准测试中进行比较，在文本对齐、美观性和人类偏好方面均优于竞争对手。我们的代码可在此 https URL 访问。

Peaceful Anarcho-Accelerationism: Decentralized Full Automation for a Society of Universal Care

和平无政府加速主义：全民护理社会的去中心化全自动化

Authors: Eduardo C. Garrido-Merchán
Subjects: Subjects: Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2602.13154
Pdf link: https://arxiv.org/pdf/2602.13154
Abstract The convergence of large language models that automate cognitive labor and deep reinforcement learning agents that automate physical labor implies the near-complete elimination of human employment. The universal approximation theorem and foundational DRL results establish that all labor is in principle automatable. The critical question is not whether full automation will arrive, but who will control it. This paper introduces peaceful anarcho-accelerationism: a sociotechnical framework ensuring that full automation is decentralized, commons-governed, and oriented toward universal care. We propose the Liberation Stack, a layered architecture of energy, manufacturing, food, communication, knowledge, and governance commons built on open-source technologies. We show that this framework builds bridges with liberalism, socialism, environmentalism, feminism, cooperativism, and the hacker ethic. Empirical evidence from Linux, Wikipedia, Mondragon, Rojava, and this http URL confirms that commons-based systems already operate at scale. We argue that full automation renders money obsolete and propose Universal Desired Resources (UDR), a post-monetary design principle where every person requests what they need from the robotic commons, constrained only by ecological sustainability. Drawing on the independence of phenomenal consciousness from computational intelligence, we establish that delegating labor to non-conscious machines is care at civilizational scale, and that moral policy can be studied through deep reinforcement learning. We conclude with a phased roadmap toward the care-centered society, including milestones, assumptions, and limitations.
中文摘要 自动化认知劳动的大型语言模型与自动化体力劳动的深度强化学习代理的融合，意味着几乎完全消除了人类就业。普遍逼近定理和基础的DRL结果证明，所有劳动原则上都是可自动化的。关键问题不是完全自动化是否会到来，而是谁来控制它。本文介绍了和平无政府加速主义：一种确保全自动化去中心化、公地治理并朝向全民护理的社会技术框架。我们提出了解放堆栈，这是一个基于开源技术的能源、制造、食品、通信、知识和治理共享体系的分层架构。我们展示了该框架与自由主义、社会主义、环保主义、女权主义、合作主义和黑客伦理搭建桥梁。来自Linux、维基百科、Mondragon、Rojava和这个http URL的实证证据证实，基于共享资源的系统已经实现了大规模运行。我们认为，全面自动化使货币变得过时，并提出了通用期望资源（UDR）——一种后货币设计原则，即每个人都向机器人公地请求所需，唯一的限制是生态可持续性。借鉴现象意识与计算智能的独立性，我们确立了将劳动委托给无意识机器是文明层面的关怀，并且道德政策可以通过深度强化学习来研究。我们以分阶段的护理中心社会路线图作结，包括里程碑、假设和限制。

Learning to Approximate Uniform Facility Location via Graph Neural Networks

通过图神经网络学习近似均匀设施位置

Authors: Chendi Qian, Christopher Morris, Stefanie Jegelka, Christian Sohler
Subjects: Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2602.13155
Pdf link: https://arxiv.org/pdf/2602.13155
Abstract There has been a growing interest in using neural networks, especially message-passing neural networks (MPNNs), to solve hard combinatorial optimization problems heuristically. However, existing learning-based approaches for hard combinatorial optimization tasks often rely on supervised training data, reinforcement learning, or gradient estimators, leading to significant computational overhead, unstable training, or a lack of provable performance guarantees. In contrast, classical approximation algorithms offer such performance guarantees under worst-case inputs but are non-differentiable and unable to adaptively exploit structural regularities in natural input distributions. We address this dichotomy with the fundamental example of Uniform Facility Location (UniFL), a variant of the combinatorial facility location problem with applications in clustering, data summarization, logistics, and supply chain design. We develop a fully differentiable MPNN model that embeds approximation-algorithmic principles while avoiding the need for solver supervision or discrete relaxations. Our approach admits provable approximation and size generalization guarantees to much larger instances than seen during training. Empirically, we show that our approach outperforms standard non-learned approximation algorithms in terms of solution quality, closing the gap with computationally intensive integer linear programming approaches. Overall, this work provides a step toward bridging learning-based methods and approximation algorithms for discrete optimization.
中文摘要 利用神经网络，尤其是消息传递神经网络（MPNN）来以启发式方式解决困难组合优化问题的兴趣日益增长。然而，现有基于学习的硬组合优化方法通常依赖监督训练数据、强化学习或梯度估计器，导致计算开销巨大、训练不稳定或缺乏可证明的性能保证。相比之下，经典近似算法在最坏情况下提供性能保证，但不可微且无法自适应利用自然输入分布中的结构规律性。我们通过统一设施位置问题（UniFL）这一基本例子来解决这一二分法，该问题是组合设施位置问题的一个变体，应用于聚类、数据汇总、物流和供应链设计。我们开发了一个完全可微的MPNN模型，嵌入了近似算法原理，同时避免了求解器监督或离散松弛的需求。我们的方法允许可证明的近似和规模泛化，保证的实例比训练时出现的要大得多。通过实证，我们表明我们的方法在解质量方面优于标准非学习近似算法，弥合了计算密集型整数线性规划方法的差距。总体而言，这项工作为基于学习的方法和近似算法的离散优化迈出了一步。

In-Context Autonomous Network Incident Response: An End-to-End Large Language Model Agent Approach

上下文自治网络事件响应：端到端大型语言模型代理方法

Authors: Yiran Gao, Kim Hammar, Tao Li
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.13156
Pdf link: https://arxiv.org/pdf/2602.13156
Abstract Rapidly evolving cyberattacks demand incident response systems that can autonomously learn and adapt to changing threats. Prior work has extensively explored the reinforcement learning approach, which involves learning response strategies through extensive simulation of the incident. While this approach can be effective, it requires handcrafted modeling of the simulator and suppresses useful semantics from raw system logs and alerts. To address these limitations, we propose to leverage large language models' (LLM) pre-trained security knowledge and in-context learning to create an end-to-end agentic solution for incident response planning. Specifically, our agent integrates four functionalities, perception, reasoning, planning, and action, into one lightweight LLM (14b model). Through fine-tuning and chain-of-thought reasoning, our LLM agent is capable of processing system logs and inferring the underlying network state (perception), updating its conjecture of attack models (reasoning), simulating consequences under different response strategies (planning), and generating an effective response (action). By comparing LLM-simulated outcomes with actual observations, the LLM agent repeatedly refines its attack conjecture and corresponding response, thereby demonstrating in-context adaptation. Our agentic approach is free of modeling and can run on commodity hardware. When evaluated on incident logs reported in the literature, our agent achieves recovery up to 23% faster than those of frontier LLMs.
中文摘要 快速演变的网络攻击需要能够自主学习并适应不断变化威胁的事件响应系统。此前的研究广泛探讨了强化学习方法，该方法通过对事件进行广泛模拟来学习反应策略。虽然这种方法可能有效，但需要对模拟器进行手工建模，并抑制了原始系统日志和警报中的有用语义。为解决这些局限性，我们提议利用大型语言模型（LLM）预训练的安全知识和上下文学习，创建一个端到端的代理式事件响应规划解决方案。具体来说，我们的代理将感知、推理、规划和行动四个功能整合到一个轻量级大语言模型（14b模型）。通过微调和思维链推理，我们的LLM代理能够处理系统日志并推断底层网络状态（感知），更新其攻击模型的猜想（推理），模拟不同响应策略下的后果（规划），并生成有效的响应（行动）。通过将LLM模拟的结果与实际观察进行比较，LLM代理反复完善其攻击猜想和相应的响应，从而展示了上下文适应能力。我们的代理方法不依赖建模，可以在普通硬件上运行。根据文献中报道的事件日志评估，我们的代理恢复速度比前沿大型语言模型快达23%。

Keyword: diffusion policy

There is no result