Arxiv Papers of Today

生成时间: 2026-03-02 16:50:04 (UTC+8); Arxiv 发布时间: 2026-03-02 20:00 EST (2026-03-03 09:00 UTC+8)

今天共有 36 篇相关文章

Keyword: reinforcement learning

Pacing Opinion Polarization via Graph Reinforcement Learning

通过图强化学习实现节奏与观点极化

Authors: Mingkai Liao
Subjects: Subjects: Social and Information Networks (cs.SI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.23390
Pdf link: https://arxiv.org/pdf/2602.23390
Abstract Opinion polarization in online social networks poses serious risks to social cohesion and democratic processes. Recent studies formulate polarization moderation as algorithmic intervention problems under opinion dynamics models, especially the Friedkin--Johnsen (FJ) model. However, most existing methods are tailored to specific linear settings and rely on closed-form steady-state analysis, limiting scalability, flexibility, and applicability to cost-aware, nonlinear, or topology-altering interventions. We propose PACIFIER, a graph reinforcement learning framework for sequential polarization moderation via network interventions. PACIFIER reformulates the canonical ModerateInternal (MI) and ModerateExpressed (ME) problems as sequential decision-making tasks, enabling adaptive intervention policies without repeated steady-state recomputation. The framework is objective-agnostic and extends naturally to FJ-consistent settings, including budget-aware interventions, continuous internal opinions, biased-assimilation dynamics, and node removal. Extensive experiments on real-world networks demonstrate strong performance and scalability across diverse moderation scenarios.
中文摘要 在线社交网络中的意见极化对社会凝聚力和民主进程构成严重风险。最新研究将极化调节定为意见动态模型下的算法干预问题，特别是弗里德金-约翰森（FJ）模型。然而，大多数现有方法针对特定线性环境量身定制，依赖闭式稳态分析，限制了可扩展性、灵活性以及对成本意识性、非线性或拓扑改变干预的适用性。我们提出了PACIFIER，这是一种通过网络干预实现顺序极化调节的图强化学习框架。PACIFIER将典型的中度内部（MI）和中度表达（ME）问题重新表述为顺序决策任务，实现无需重复稳态重算的自适应干预政策。该框架具有客观与目标无关性，自然地扩展到FJ一致的环境，包括预算意识干预、持续内部意见、偏见同化动态以及节点移除。在真实网络上的大量实验证明了在多种调度场景下表现出强大的性能和可扩展性。

Learning to Generate Secure Code via Token-Level Rewards

学习通过代币级奖励生成安全代码

Authors: Jiazheng Quan, Xiaodong Li, Bin Wang, Guo An, Like Liu, Degen Huang, Lin Liu, Chengbin Hou
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2602.23407
Pdf link: https://arxiv.org/pdf/2602.23407
Abstract Large language models (LLMs) have demonstrated strong capabilities in code generation, yet they remain prone to producing security vulnerabilities. Existing approaches commonly suffer from two key limitations: the scarcity of high-quality security data and coarse-grained reinforcement learning reward signals. To address these challenges, we propose Vul2Safe, a new secure code generation framework that leverages LLM self-reflection to construct high-confidence repair pairs from real-world vulnerabilities, and further generates diverse implicit prompts to build the PrimeVul+ dataset. Meanwhile, we introduce SRCode, a novel training framework that pioneers the use of token-level rewards in reinforcement learning for code security, which enables the model to continuously attend to and reinforce critical fine-grained security patterns during training. Compared with traditional instance-level reward schemes, our approach allows for more precise optimization of local security implementations. Extensive experiments show that PrimeVul+ and SRCode substantially reduce security vulnerabilities in generated code while improving overall code quality across multiple benchmarks.
中文摘要 大型语言模型（LLM）在代码生成方面展现出强大的能力，但它们仍然容易产生安全漏洞。现有方法通常存在两个关键局限：高质量安全数据稀缺和粗粒度强化学习奖励信号。为应对这些挑战，我们提出了Vul2Safe，一种新的安全代码生成框架，利用大型语言模型自反思构建高置信度修复对，并进一步生成多样的隐式提示以构建PrimeVul+数据集。同时，我们介绍了SRCode，这是一个新颖的训练框架，开创性地在代码安全强化学习中使用代币级奖励，使模型能够在训练过程中持续关注并强化关键的细粒度安全模式。与传统的实例级奖励方案相比，我们的方法能够更精准地优化本地安全实现。大量实验表明，PrimeVul+ 和 SRCode 大幅减少了生成代码的安全漏洞，同时在多个基准测试中提升了整体代码质量。

Truncated Step-Level Sampling with Process Rewards for Retrieval-Augmented Reasoning

截断步级抽样及过程奖励用于检索增强推理

Authors: Chris Samarinas, Haw-Shiuan Chang, Hamed Zamani
Subjects: Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2602.23440
Pdf link: https://arxiv.org/pdf/2602.23440
Abstract Training large language models to reason with search engines via reinforcement learning is hindered by a fundamental credit assignment problem: existing methods such as Search-R1 provide only a sparse outcome reward after an entire multi-step trajectory, making it infeasible to attribute success or failure to individual reasoning and retrieval decisions. Process-reward methods like StepSearch alleviate this by introducing step-level supervision, but rely on heuristic rewards such as TF-IDF overlap with gold documents, and still sample k complete trajectories per example, retaining high gradient variance. We propose SLATE, a framework built on two complementary ideas: (1) truncated step-level sampling, which generates k trajectories that share a common prefix and differ only at the next step, and (2) dense LLM-as-judge rewards, which replace heuristic scoring with a capable LLM evaluator that assesses the quality of each reasoning step, search query, and answer, providing richer and more reliable supervision. We theoretically prove that under the same dense reward structure, truncated sampling reduces the variance of advantage estimates by up to a factor of T compared to full-trajectory sampling for T-step trajectories, yielding lower-variance, better-targeted policy gradients. Experiments on seven QA benchmarks confirm that SLATE consistently outperforms both sparse-reward and process-reward baselines, with the largest gains on harder multi-hop tasks and smaller models.
中文摘要 通过强化学习训练大型语言模型与搜索引擎进行推理，受到一个根本的信用分配问题的阻碍：现有方法如Search-R1在完成整个多步轨迹后仅提供稀疏的结果奖励，使得成功或失败无法归因于个体推理和检索决策。像StepSearch这样的过程-奖励方法通过引入步级监督缓解了这一问题，但依赖启发式奖励，如TF-IDF与金文档重叠，且每个实例仍采样k个完整轨迹，保持高梯度方差。我们提出了SLATE，这是一个基于两个互补思想的框架：（1）截断步骤级抽样，生成k条共享共同前缀且仅在下一步不同的轨迹;（2）密集的LLM作为评判奖励，用一个有能力的LLM评估器来评估每个推理步骤、搜索查询和答案的质量，取代启发式评分，提供更丰富、更可靠的监督。我们理论上证明，在相同的密集奖励结构下，截断抽样相比T步轨迹的全轨迹抽样，优势估计的方差可降低最多T的倍数，从而产生更低方差、更精准的政策梯度。七个质量保证基准测试的实验证实，SLATE 在稀疏奖励和过程-奖励基线上始终表现优于，且在更难的多跳任务和较小模型中获得最大收益。

Human Supervision as an Information Bottleneck: A Unified Theory of Error Floors in Human-Guided Learning

人类监督作为信息瓶颈：人类引导学习中错误底线的统一理论

Authors: Alejandro Rodriguez Dominguez
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.23446
Pdf link: https://arxiv.org/pdf/2602.23446
Abstract Large language models are trained primarily on human-generated data and feedback, yet they exhibit persistent errors arising from annotation noise, subjective preferences, and the limited expressive bandwidth of natural language. We argue that these limitations reflect structural properties of the supervision channel rather than model scale or optimization. We develop a unified theory showing that whenever the human supervision channel is not sufficient for a latent evaluation target, it acts as an information-reducing channel that induces a strictly positive excess-risk floor for any learner dominated by it. We formalize this Human-Bounded Intelligence limit and show that across six complementary frameworks (operator theory, PAC-Bayes, information theory, causal inference, category theory, and game-theoretic analyses of reinforcement learning from human feedback), non-sufficiency yields strictly positive lower bounds arising from the same structural decomposition into annotation noise, preference distortion, and semantic compression. The theory explains why scaling alone cannot eliminate persistent human-aligned errors and characterizes conditions under which auxiliary non-human signals (e.g., retrieval, program execution, tools) increase effective supervision capacity and collapse the floor by restoring information about the latent target. Experiments on real preference data, synthetic known-target tasks, and externally verifiable benchmarks confirm the predicted structural signatures: human-only supervision exhibits a persistent floor, while sufficiently informative auxiliary channels strictly reduce or eliminate excess error.
中文摘要 大型语言模型主要基于人类生成的数据和反馈训练，但它们仍存在因注释噪声、主观偏好以及自然语言表达带宽有限而产生的持续错误。我们认为这些限制反映了监督通道的结构性质，而非模型规模或优化。我们发展了一个统一理论，表明当人类监督通道不足以满足潜在评估目标时，它作为信息简化通道，为被其主导的学习者引入严格正的超额风险底线。我们形式化了这一人类有界智能极限，并展示了在六个互补框架（算子理论、PAC贝叶斯、信息理论、因果推断、范畴论以及基于人类反馈强化学习的博弈论分析）中，非充分性会产生严格正的下界，这些下界源自相同的注释噪声、偏好扭曲和语义压缩。该理论解释了为何仅靠缩放无法消除持续的人类对齐错误，并描述了辅助非人类信号（如检索、程序执行、工具）如何提升有效监督能力并通过恢复潜在目标信息来崩溃底线的条件。对真实偏好数据、合成已知目标任务和外部可验证基准的实验证实了预测的结构特征：仅人工监督呈现持久的底线，而足够信息丰富的辅助通道则严格减少或消除过剩误差。

Component Centric Placement Using Deep Reinforcement Learning

基于组件的深度强化学习配置

Authors: Kart Leong Lim
Subjects: Subjects: Emerging Technologies (cs.ET); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.23540
Pdf link: https://arxiv.org/pdf/2602.23540
Abstract Automated placement of components on printed circuit boards (PCBs) is a critical stage in placement layout design. While reinforcement learning (RL) has been successfully applied to system-on-chip IP block placement and chiplet arrangement in complex packages, PCB component placement presents unique challenges due to several factors: variation in component sizes, single- and double-sided boards, wirelength constraints, board constraints, and non-overlapping placement requirements. In this work, we adopt a component-centric layout for automating PCB component placement using RL: first, the main component is fixed at the center, while passive components are placed in proximity to the pins of the main component. Free space around the main component is discretized, drastically reducing the search space while still covering all feasible placement; second, we leverage prior knowledge that each passive's position has to be near to its corresponding voltage source. This allows us to design the reward function which avoids wasted exploration of infeasible or irrelevant search space. Using the component centric layout, we implemented different methods including Deep Q-Network, Actor-Critic algorithm and Simulated Annealing. Evaluation on over nine real-world PCBs of varying complexity shows that our best proposed method approaches near human-like placements in terms of wirelength and feasibility.
中文摘要 在印刷电路板（PCB）上自动放置元件是布局设计中的关键阶段。虽然强化学习（RL）已成功应用于复杂封装中的系统单芯片块布置和芯片组布置，但PCB元件布置面临多种因素的独特挑战：元件尺寸的差异、单面和双面板、线长限制、板块限制以及不重叠的布置要求。在本研究中，我们采用以元件为中心的布局，利用强化学习实现PCB元件的自动化安装：首先，主元件固定在中心，而被动元件则放置在靠近主元件引脚的位置。主分量周围的空闲空间被离散化，极大地减少搜索空间，同时仍覆盖所有可行的放置;其次，我们利用每个被动元件的位置必须接近其对应电压源的先验知识。这使我们能够设计奖励函数，避免对不可行或无关搜索空间的浪费探索。采用以组件为中心的布局，我们实现了包括深度Q网络、演员-批判算法和模拟退火在内的多种方法。对九种不同复杂度的真实PCB进行的评估表明，我们提出的最佳方法在线长和可行性方面接近人类位置。

Construct, Merge, Solve & Adapt with Reinforcement Learning for the min-max Multiple Traveling Salesman Problem

构建、合并、解决和适应，利用强化学习解决多重旅行推销员问题

Authors: Guillem Rodríguez-Corominas, Maria J. Blesa, Christian Blum
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.23579
Pdf link: https://arxiv.org/pdf/2602.23579
Abstract The Multiple Traveling Salesman Problem (mTSP) extends the Traveling Salesman Problem to m tours that start and end at a common depot and jointly visit all customers exactly once. In the min-max variant, the objective is to minimize the longest tour, reflecting workload balance. We propose a hybrid approach, Construct, Merge, Solve & Adapt with Reinforcement Learning (RL-CMSA), for the symmetric single-depot min-max mTSP. The method iteratively constructs diverse solutions using probabilistic clustering guided by learned pairwise q-values, merges routes into a compact pool, solves a restricted set-covering MILP, and refines solutions via inter-route remove, shift, and swap moves. The q-values are updated by reinforcing city-pair co-occurrences in high-quality solutions, while the pool is adapted through ageing and pruning. This combination of exact optimization and reinforcement-guided construction balances exploration and exploitation. Computational results on random and TSPLIB instances show that RL-CMSA consistently finds (near-)best solutions and outperforms a state-of-the-art hybrid genetic algorithm under comparable time limits, especially as instance size and the number of salesmen increase.
中文摘要 多重旅行推销员问题（mTSP）将旅行推销员问题扩展为m个旅游，这些旅游项目从同一个车库出发和结束，并且只访问所有客户一次。在最小-最大变体中，目标是最小化最长的巡回，以反映工作负荷的平衡。我们提出了一种混合方法——构造、合并、求解与适应与强化学习（RL-CMSA），用于对称单仓库最小最大mTSP。该方法通过概率聚类迭代构建多样化解，基于学习到的两对q值，将路径合并为紧凑池，解决一个限制性集合的MILP，并通过路径间移除、移位和交换移动来细化解。q值通过强化高质量解中的城市对共现来更新，同时池通过老化和修剪进行调整。这种精确优化与加固引导建造的结合，平衡了探索与利用。随机和TSPLIB实例的计算结果表明，RL-CMSA在相当时间限制下，能够持续找到（接近）最佳解，并且在实例规模和销售人数增加的情况下，表现优于最先进的混合遗传算法。

Learning to Reflect and Correct: Towards Better Decoding Trajectories for Large-Scale Generative Recommendation

学习反思与纠正：迈向更佳的大规模生成式推荐解码轨迹

Authors: Haibo Xing, Hao Deng, Lingyu Mu, Jinxin Hu, Yu Zhang, Xiaoyi Zeng, Jing Zhang
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2602.23639
Pdf link: https://arxiv.org/pdf/2602.23639
Abstract Generative Recommendation (GR) has become a promising paradigm for large-scale recommendation systems. However, existing GR models typically perform single-pass decoding without explicit refinement, causing early deviations to accumulate and ultimately degrade recommendation quality. To tackle this problem, we propose GRC, which is, to our knowledge, the first structured reflection-correction framework for GR that extends standard decoding into a Generation-Reflection-Correction (GRC) process. Concretely, GRC introduces a supervised reflection-correction template that decomposes the decoding process into initial draft generation, multi-granular reflection, and reflection-guided correction, thereby enabling structured reflection and correction in the semantic token space. To further explore the enlarged refinement space introduced by the GRC process, we optimize the entire GRC trajectory with GRPO-based reinforcement learning, under a carefully designed reward function with token-level and trajectory-level signals. For efficient online serving, we propose an Entropy-Guided Reflection Scheduling (EGRS) strategy that dynamically allocates more correction budget to high-uncertainty decoding trajectories during beam search. Extensive experiments on real-world datasets show that GRC consistently outperforms six state-of-the-art baselines by up to 15.74%, and online A/B tests demonstrate its substantial practical value in large-scale industrial recommendation, delivering a 1.79% lift in advertising revenue with only modest latency overhead.
中文摘要 生成式推荐（GR）已成为大规模推荐系统的有前景范式。然而，现有的广义相对论模型通常只进行单遍解码，且未进行显式细化，导致早期偏差积累，最终降低推荐质量。为解决这一问题，我们提出了GRC，据我们所知，这是首个将标准译码扩展为生成-反射-修正（GRC）过程的结构化反射-修正框架。具体来说，GRC引入了一种监督式反射修正模板，将解码过程分解为初稿生成、多粒度反射和反射引导修正，从而实现语义标记空间中的结构化反射和修正。为了进一步探索GRC过程引入的扩展细化空间，我们利用基于GRPO的强化学习，在精心设计的奖励函数下，通过代币级和轨迹级信号优化整个GRC轨迹。为了实现高效的在线服务，我们提出了一种熵引导反射调度（EGRS）策略，在束流搜索过程中动态分配更多修正预算给高不确定性解码轨迹。在真实世界数据集上的大量实验显示，GRC在六个最先进基线上表现稳定，高达15.74%，在线A/B测试也展示了其在大规模工业推荐中的显著实用价值，广告收入提升1.79%，延迟开销仅适中。

The Auton Agentic AI Framework

Auton 智能人工智能框架

Authors: Sheng Cao, Zhao Chang, Chang Li, Hannan Li, Liyao Fu, Ji Tang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.23720
Pdf link: https://arxiv.org/pdf/2602.23720
Abstract The field of Artificial Intelligence is undergoing a transition from Generative AI -- probabilistic generation of text and images -- to Agentic AI, in which autonomous systems execute actions within external environments on behalf of users. This transition exposes a fundamental architectural mismatch: Large Language Models (LLMs) produce stochastic, unstructured outputs, whereas the backend infrastructure they must control -- databases, APIs, cloud services -- requires deterministic, schema-conformant inputs. The present paper describes the Auton Agentic AI Framework, a principled architecture for standardizing the creation, execution, and governance of autonomous agent systems. The framework is organized around a strict separation between the Cognitive Blueprint, a declarative, language-agnostic specification of agent identity and capabilities, and the Runtime Engine, the platform-specific execution substrate that instantiates and runs the agent. This separation enables cross-language portability, formal auditability, and modular tool integration via the Model Context Protocol (MCP). The paper formalizes the agent execution model as an augmented Partially Observable Markov Decision Process (POMDP) with a latent reasoning space, introduces a hierarchical memory consolidation architecture inspired by biological episodic memory systems, defines a constraint manifold formalism for safety enforcement via policy projection rather than post-hoc filtering, presents a three-level self-evolution framework spanning in-context adaptation through reinforcement learning, and describes runtime optimizations -- including parallel graph execution, speculative inference, and dynamic context pruning -- that reduce end-to-end latency for multi-step agent workflows.
中文摘要 人工智能领域正经历从生成式人工智能——概率生成文本和图像——向代理人工智能的转变，后者自主系统代表用户在外部环境中执行作。这一转变暴露了一个根本的架构不匹配：大型语言模型（LLMs）输出随机且非结构化，而它们必须控制的后端基础设施——数据库、API、云服务——则需要确定性且符合模式的输入。本文介绍了Auton Agentic AI框架，这是一种原则性架构，用于标准化自主智能体系统的创建、执行和治理。该框架围绕认知蓝图（一种声明式、语言无关的智能体身份和能力规范）与运行时引擎（实现和运行智能体的平台特定执行基质）之间的严格分离而组织。这种分离使得跨语言的可移植性、形式化审计性以及通过模型上下文协议（MCP）实现模块化工具集成。论文将代理执行模型形式化为增强的部分可观测马尔可夫决策过程（POMDP），具有潜在推理空间，引入了受生物情节记忆系统启发的层级记忆巩固架构，定义了通过策略投影而非事后过滤实现安全执行的约束流形形式主义，提出了跨越上下文适应的三级自我进化框架，通过强化学习实现。并描述了运行时优化——包括并行图执行、推测推断和动态上下文剪枝——以降低多步代理工作流程的端到端延迟。

Bridging Dynamics Gaps via Diffusion Schrödinger Bridge for Cross-Domain Reinforcement Learning

通过Diffusion薛定谔桥桥跨域强化学习实现动力学间隙的桥梁

Authors: Hanping Zhang, Yuhong Guo
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.23737
Pdf link: https://arxiv.org/pdf/2602.23737
Abstract Cross-domain reinforcement learning (RL) aims to learn transferable policies under dynamics shifts between source and target domains. A key challenge lies in the lack of target-domain environment interaction and reward supervision, which prevents direct policy learning. To address this challenge, we propose Bridging Dynamics Gaps for Cross-Domain Reinforcement Learning (BDGxRL), a novel framework that leverages Diffusion Schrödinger Bridge (DSB) to align source transitions with target-domain dynamics encoded in offline demonstrations. Moreover, we introduce a reward modulation mechanism that estimates rewards based on state transitions, applying to DSB-aligned samples to ensure consistency between rewards and target-domain dynamics. BDGxRL performs target-oriented policy learning entirely within the source domain, without access to the target environment or its rewards. Experiments on MuJoCo cross-domain benchmarks demonstrate that BDGxRL outperforms state-of-the-art baselines and shows strong adaptability under transition dynamics shifts.
中文摘要 跨域强化学习（RL）旨在学习源域与目标域动态转换下的可转移策略。一个关键挑战在于缺乏目标领域环境互动和奖励监督，这阻碍了直接的政策学习。为应对这一挑战，我们提出了跨域强化学习的跨域动力学差距桥接（BDGxRL）新框架，利用扩散薛定谔桥（DSB）将源转变与离线演示中编码的目标域动态对齐。此外，我们引入了一种基于状态转移估计奖励的机制，应用于DSB对齐样本，以确保奖励与目标域动态的一致性。BDGxRL 完全在源域内进行面向目标的策略学习，无需访问目标环境或其奖励。MuJoCo跨域基准测试的实验表明，BDGxRL优于最先进的基线，并在转变动力学变化下展现出强烈的适应性。

MAGE: Multi-scale Autoregressive Generation for Offline Reinforcement Learning

MAGE：多尺度自回归生成用于离线强化学习

Authors: Chenxing Lin, Xinhui Gao, Haipeng Zhang, Xinran Li, Haitao Wang, Songzhu Mei, Chenglu Wen, Weiquan Liu, Siqi Shen, Cheng Wang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.23770
Pdf link: https://arxiv.org/pdf/2602.23770
Abstract Generative models have gained significant traction in offline reinforcement learning (RL) due to their ability to model complex trajectory distributions. However, existing generation-based approaches still struggle with long-horizon tasks characterized by sparse rewards. Some hierarchical generation methods have been developed to mitigate this issue by decomposing the original problem into shorter-horizon subproblems using one policy and generating detailed actions with another. While effective, these methods often overlook the multi-scale temporal structure inherent in trajectories, resulting in suboptimal performance. To overcome these limitations, we propose MAGE, a Multi-scale Autoregressive GEneration-based offline RL method. MAGE incorporates a condition-guided multi-scale autoencoder to learn hierarchical trajectory representations, along with a multi-scale transformer that autoregressively generates trajectory representations from coarse to fine temporal scales. MAGE effectively captures temporal dependencies of trajectories at multiple resolutions. Additionally, a condition-guided decoder is employed to exert precise control over short-term behaviors. Extensive experiments on five offline RL benchmarks against fifteen baseline algorithms show that MAGE successfully integrates multi-scale trajectory modeling with conditional guidance, generating coherent and controllable trajectories in long-horizon sparse-reward settings.
中文摘要 生成模型因其能够建模复杂轨迹分布的能力，在离线强化学习（RL）中获得了显著的关注。然而，现有基于世代的方法在长期任务中仍面临奖励稀疏的挑战。一些分层生成方法被开发出来，通过将原始问题分解为较短视野的子问题，使用一种策略生成详细作，来缓解这一问题。虽然有效，但这些方法常常忽视轨迹中固有的多尺度时间结构，导致性能不理想。为克服这些局限，我们提出了MAGE，一种多尺度自回归基于实线强化学习的方法。MAGE集成了条件引导多尺度自编码器以学习分层轨迹表示，以及一个多尺度变换器，能够自回归生成从粗尺度到细小尺度的轨迹表示。MAGE 有效捕捉了多分辨率轨迹的时间依赖关系。此外，还采用了条件引导解码器，对短期行为进行精确控制。对五个离线强化学习基准测试和十五个基线算法的广泛实验表明，MAGE成功将多尺度轨迹建模与条件引导整合，在长视野稀疏奖励环境中生成连贯且可控的轨迹。

EMO-R3: Reflective Reinforcement Learning for Emotional Reasoning in Multimodal Large Language Models

EMO-R3：多模态大型语言模型中情感推理的反思强化学习

Authors: Yiyang Fang, Wenke Huang, Pei Fu, Yihao Yang, Kehua Su, Zhenbo Luo, Jian Luan, Mang Ye
Subjects: Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.23802
Pdf link: https://arxiv.org/pdf/2602.23802
Abstract Multimodal Large Language Models (MLLMs) have shown remarkable progress in visual reasoning and understanding tasks but still struggle to capture the complexity and subjectivity of human emotions. Existing approaches based on supervised fine-tuning often suffer from limited generalization and poor interpretability, while reinforcement learning methods such as Group Relative Policy Optimization fail to align with the intrinsic characteristics of emotional cognition. To address these challenges, we propose Reflective Reinforcement Learning for Emotional Reasoning (EMO-R3), a framework designed to enhance the emotional reasoning ability of MLLMs. Specifically, we introduce Structured Emotional Thinking to guide the model to perform step-by-step emotional reasoning in a structured and interpretable manner, and design a Reflective Emotional Reward that enables the model to re-evaluate its reasoning based on visual-text consistency and emotional coherence. Extensive experiments demonstrate that EMO-R3 significantly improves both the interpretability and emotional intelligence of MLLMs, achieving superior performance across multiple visual emotional understanding benchmarks.
中文摘要 多模态大型语言模型（MLLM）在视觉推理和理解任务方面取得了显著进步，但仍难以捕捉人类情感的复杂性和主观性。基于监督微调的现有方法常常存在推广有限和解释性差的问题，而强化学习方法如群体相对策略优化则未能与情感认知的内在特征相匹配。为应对这些挑战，我们提出了情感推理反思强化学习（EMO-R3），这一框架旨在提升MLLM的情感推理能力。具体来说，我们引入了结构化情绪思维，指导模型以结构化且可理解的方式逐步执行情绪推理，并设计反思性情感奖励，使模型能够基于视觉文本的一致性和情感连贯性重新评估推理。大量实验表明，EMO-R3显著提升了MLLM的可解读性和情商，在多个视觉情感理解基准中实现了卓越的表现。

Actor-Critic Pretraining for Proximal Policy Optimization

Actor-Critic 预训练用于近端策略优化

Authors: Andreas Kernbach, Amr Elsheikh, Nicolas Grupp, René Nagel, Marco F. Huber
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.23804
Pdf link: https://arxiv.org/pdf/2602.23804
Abstract Reinforcement learning (RL) actor-critic algorithms enable autonomous learning but often require a large number of environment interactions, which limits their applicability in robotics. Leveraging expert data can reduce the number of required environment interactions. A common approach is actor pretraining, where the actor network is initialized via behavioral cloning on expert demonstrations and subsequently fine-tuned with RL. In contrast, the initialization of the critic network has received little attention, despite its central role in policy optimization. This paper proposes a pretraining approach for actor-critic algorithms like Proximal Policy Optimization (PPO) that uses expert demonstrations to initialize both networks. The actor is pretrained via behavioral cloning, while the critic is pretrained using returns obtained from rollouts of the pretrained policy. The approach is evaluated on 15 simulated robotic manipulation and locomotion tasks. Experimental results show that actor-critic pretraining improves sample efficiency by 86.1% on average compared to no pretraining and by 30.9% to actor-only pretraining.
中文摘要 强化学习（RL）演员-批评算法支持自主学习，但通常需要大量环境交互，限制了其在机器人学中的适用性。利用专家数据可以减少所需的环境交互次数。一种常见方法是演员预训练，演员网络通过专家演示的行为克隆初始化，随后通过强化学习进行微调。相比之下，尽管批评网络在政策优化中扮演核心角色，但其初始化鲜少受到关注。本文提出了一种针对演员-批判算法（如近端策略优化（PPO）的预训练方法，利用专家演示初始化两个网络。参与者通过行为克隆预训练，而批评者则利用预训练策略的推广结果进行预训练。该方法通过15项模拟机器人作和运动任务进行评估。实验结果显示，actor-critic预训练平均比无预训练提升了86.1%，仅参与者预训练提高了30.9%。

See, Act, Adapt: Active Perception for Unsupervised Cross-Domain Visual Adaptation via Personalized VLM-Guided Agent

参见、行动、适应：通过个性化VLM引导代理实现无监督跨域视觉适应的主动感知

Authors: Tianci Tang, Tielong Cai, Hongwei Wang, Gaoang Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.23806
Pdf link: https://arxiv.org/pdf/2602.23806
Abstract Pre-trained perception models excel in generic image domains but degrade significantly in novel environments like indoor scenes. The conventional remedy is fine-tuning on downstream data which incurs catastrophic forgetting of prior knowledge and demands costly, scene-specific annotations. We propose a paradigm shift through Sea$^2$ (See, Act, Adapt): rather than adapting the perception modules themselves, we adapt how they are deployed through an intelligent pose-control agent. Sea$^2$ keeps all perception modules frozen, requiring no downstream labels during training, and uses only scalar perceptual feedback to navigate the agent toward informative viewpoints. Specially, we transform a vision-language model (VLM) into a low-level pose controller through a two-stage training pipeline: first fine-tuning it on rule-based exploration trajectories that systematically probe indoor scenes, and then refining the policy via unsupervised reinforcement learning that constructs rewards from the perception module's outputs and confidence. Unlike prior active perception methods that couple exploration with specific models or collect data for retraining them, Sea$^2$ directly leverages off-the-shelf perception models for various tasks without the need for retraining. We conducted experiments on three visual perception tasks, including visual grounding, segmentation and 3D box estimation, with performance improvements of 13.54%, 15.92% and 27.68% respectively on dataset ReplicaCAD.
中文摘要 预训练的感知模型在通用图像领域表现出色，但在室内场景等新环境中会显著退化。传统的解决办法是对下游数据进行微调，这会导致灾难性地遗忘先前知识，并需要昂贵且针对场景的注释。我们通过Sea$^2$（See， Act， Adapt）提出范式转变：我们不是调整感知模块本身，而是通过智能姿态控制代理调整它们的部署方式。Sea$^2$ 将所有感知模块冻结，训练期间无需下游标签，仅使用标量感知反馈引导代理获得信息性观点。特别是，我们将视觉语言模型（VLM）通过两阶段训练流程转变为低级别姿态控制器：首先在基于规则的探索轨迹上微调其系统性探测室内场景，然后通过无监督强化学习完善策略，从感知模块的输出和信心构建奖励。与以往将探索与特定模型结合或收集数据进行再训练的方法不同，Sea$^2$ 直接利用现成的感知模型完成各种任务，无需再训练。我们对三项视觉感知任务进行了实验，包括视觉接地、分割和三维盒估计，在ReplicaCAD数据集上分别提升了13.54%、15.92%和27.68%。

Beyond State-Wise Mirror Descent: Offline Policy Optimization with Parameteric Policies

超越状态镜像下降：带参数策略的离线策略优化

Authors: Xiang Li, Nan Jiang, Yuheng Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.23811
Pdf link: https://arxiv.org/pdf/2602.23811
Abstract We investigate the theoretical aspects of offline reinforcement learning (RL) under general function approximation. While prior works (e.g., Xie et al., 2021) have established the theoretical foundations of learning a good policy from offline data via pessimism, existing algorithms that are computationally tractable (often in an oracle-efficient sense), such as PSPI, only apply to finite and small action spaces. Moreover, these algorithms rely on state-wise mirror descent and require actors to be implicitly induced from the critic functions, failing to accommodate standalone policy parameterization which is ubiquitous in practice. In this work, we address these limitations and extend the theoretical guarantees to parameterized policy classes over large or continuous action spaces. When extending mirror descent to parameterized policies, we identify contextual coupling as the core difficulty, and show how connecting mirror descent to natural policy gradient leads to novel analyses, guarantees, and algorithmic insights, including a surprising unification between offline RL and imitation learning.
中文摘要 我们研究了在一般功能近似下离线强化学习（RL）的理论层面。虽然此前的研究（例如谢等，2021）已经通过悲观主义从离线数据中学习良好策略奠定了理论基础，但现有的计算可处理算法（通常以神谕效率为例）的算法，如PSPI，仅适用于有限和较小的动作空间。此外，这些算法依赖于状态层的镜像下降，并要求演员必须由批判函数隐式诱导，未能支持在实际中普遍存在的独立策略参数化。在本研究中，我们解决了这些局限性，并将理论保证扩展到参数化策略类，覆盖在大型或连续的动作空间中。在将镜像下降扩展到参数化策略时，我们将上下文耦合视为核心难题，并展示了将镜像下降与自然策略梯度连接如何带来新的分析、保证和算法洞见，包括离线强化学习与模仿学习之间的惊人统一。

Learning to maintain safety through expert demonstrations in settings with unknown constraints: A Q-learning perspective

在未知约束条件下通过专家演示学习保持安全：Q-学习视角

Authors: George Papadopoulos, George A. Vouros
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.23816
Pdf link: https://arxiv.org/pdf/2602.23816
Abstract Given a set of trajectories demonstrating the execution of a task safely in a constrained MDP with observable rewards but with unknown constraints and non-observable costs, we aim to find a policy that maximizes the likelihood of demonstrated trajectories trading the balance between being conservative and increasing significantly the likelihood of high-rewarding trajectories but with potentially unsafe steps. Having these objectives, we aim towards learning a policy that maximizes the probability of the most $promising$ trajectories with respect to the demonstrations. In so doing, we formulate the ``promise" of individual state-action pairs in terms of $Q$ values, which depend on task-specific rewards as well as on the assessment of states' safety, mixing expectations in terms of rewards and safety. This entails a safe Q-learning perspective of the inverse learning problem under constraints: The devised Safe $Q$ Inverse Constrained Reinforcement Learning (SafeQIL) algorithm is compared to state-of-the art inverse constraint reinforcement learning algorithms to a set of challenging benchmark tasks, showing its merits.
中文摘要 给定一组轨迹，证明在受限MDP中任务安全执行，奖励可观测，但约束和成本未知，我们旨在找到一种策略，使得已展示轨迹在保守与显著增加高回报轨迹可能性之间取得平衡，同时可能存在不安全的步骤。基于这些目标，我们旨在学习一种最大化示范中最能达到$promising$轨迹概率的政策。在此过程中，我们以$Q$值来构建单个状态-行动对的“承诺”，这些价值依赖于任务特定的奖励以及状态安全性评估，混合了对奖励和安全的预期。这涉及了对约束下逆学习问题的安全Q-学习视角：设计的安全$Q$逆约束强化学习（SafeQIL）算法与最先进的逆约束强化学习算法进行一组具有挑战性的基准任务进行比较，展示了其优点。

RUMAD: Reinforcement-Unifying Multi-Agent Debate

RUMAD：增援统一多代理之争

Authors: Chao Wang, Han Lin, Huaze Tang, Huijing Lin, Wenbo Ding
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.23864
Pdf link: https://arxiv.org/pdf/2602.23864
Abstract Multi-agent debate (MAD) systems leverage collective intelligence to enhance reasoning capabilities, yet existing approaches struggle to simultaneously optimize accuracy, consensus formation, and computational efficiency. Static topology methods lack adaptability to task complexity variations, while external LLM-based coordination risks introducing privileged knowledge that compromises debate neutrality. This work presents RUMAD (Reinforcement-Unifying Multi-Agent Debate), a novel framework that formulates dynamic communication topology control in MAD as a reinforcement learning (RL) problem. RUMAD employs a content-agnostic observation scheme that captures high-level debate dynamics avoiding access to raw agent reasoning content. RUMAD uses a multi-objective reward to model solution quality, cohesion and efficiency. A PPO-trained controller dynamically adjusts edge weights in the communication graph, while a dual-threshold mechanism enables fine-grained control over both agent activation and information visibility. Experimental evaluation across MMLU, GSM8K, and GPQA benchmarks demonstrates that RUMAD achieves substantial efficiency gains, reducing token costs by over 80\%, while still improving reasoning accuracy compared to single LLM model and multiple MAD baselines. Notably, RUMAD trained exclusively on MMLU exhibits robust zero-shot generalization to out-of-domain (OOD) tasks, indicating that the learned communication strategies capture task-independent principles of effective multi-agent coordination. These results establish RUMAD as a efficient and robust approach for deploying multi-agent reasoning application with practical resource constraints.
中文摘要 多智能体辩论（MAD）系统利用集体智能提升推理能力，但现有方法在同时优化准确性、共识形成和计算效率方面仍面临困难。静态拓扑方法缺乏对任务复杂度变化的适应性，而基于大型语言模型的外部协调则可能引入特权知识，从而破坏辩论中立性。本研究提出了RUMAD（强化统一多智能体辩论），这是一个新颖框架，将MAD中的动态通信拓扑控制表述为强化学习（RL）问题。RUMAD 采用内容无关的观察方案，捕捉高层次辩论动态，避免访问原始的代理推理内容。RUMAD 采用多目标奖励来建模解决方案的质量、一致性和效率。经过 PPO 训练的控制器动态调整通信图中的边缘权重，而双阈值机制则实现对代理激活和信息可视化的细致控制。跨 MMLU、GSM8K 和 GPQA 基准测试的实验评估表明，RUMAD 实现了显著的效率提升，令牌成本降低了 80% 以上，同时与单一大型语言模型模型和多个 MAD 基线相比，推理准确性依然提升。值得注意的是，专门基于MMLU训练的RUMAD对域外（OOD）任务表现出稳健的零射点推广，表明所学的通信策略体现了与任务无关的有效多智能体协调原则。这些结果确立了 RUMAD 作为一种高效且稳健的方法，用于部署多智能体推理应用，且具备实用资源约束。

SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale

SWE-rebench V2：大规模语言无关软件工程任务收集

Authors: Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, Alexander Golubev
Subjects: Subjects: Software Engineering (cs.SE); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.23866
Pdf link: https://arxiv.org/pdf/2602.23866
Abstract Software engineering agents (SWE) are improving rapidly, with recent gains largely driven by reinforcement learning (RL). However, RL training is constrained by the scarcity of large-scale task collections with reproducible execution environments and reliable test suites. Although a growing number of benchmarks have emerged, datasets suitable for training remain limited in scale and diversity or often target a limited set of high-resource language ecosystems. We introduce SWE-rebench V2, a language-agnostic automated pipeline for harvesting executable real-world SWE tasks and constructing RL training environments at scale. The pipeline synthesizes repository-specific installation and test procedures via an interactive setup agent, and filters unsound instances using an ensemble of LLM judges, validated against human-verified SWE-bench annotations. Using this pipeline, we construct a dataset of 32,000+ tasks spanning 20 languages and 3,600+ repositories, with pre-built images for reproducible execution. To further scale training data, we additionally release 120,000+ tasks with installation instructions, fail-to-pass tests and rich metadata, where the problem statement is generated based on the original pull request description. We validate the collected instances through a diagnostic study that covers a subset of tasks in five programming languages across seven popular models, and provide instance-level metadata that flags common confounders such as overly restrictive tests and underspecified descriptions. We release the datasets, the collection and execution code, and associated artifacts to enable large-scale training of SWE agents across diverse languages and repositories.
中文摘要 软件工程代理（SWE）正在快速提升，近期的进步主要由强化学习（RL）推动。然而，强化学习训练受限于缺乏具有可重复执行环境和可靠测试套件的大规模任务集合。尽管越来越多的基准出现，但适合训练的数据集在规模和多样性上仍然有限，或者通常只针对有限的高资源语言生态系统。我们推出了SWE-rebench V2，这是一条语言无关的自动化流水线，用于收集可执行的真实世界软件工程任务并大规模构建强化学习训练环境。该流水线通过交互式设置代理综合仓库特定的安装和测试程序，并通过一组经过人工验证的软件工程师（SWE）注释验证的LLM评判员组合过滤不健全实例。利用该流程，我们构建了一个涵盖20种语言和3600+个仓库的32,000+任务数据集，并配有预建的图像以实现可重复执行。为了进一步扩展训练数据，我们还发布了120,000+任务，包含安装说明、失败通过测试和丰富的元数据，问题陈述基于原始拉取请求描述生成。我们通过诊断研究验证收集的实例，该研究涵盖了五种编程语言、七个流行模型中的部分任务，并提供了实例级元数据，标记常见的混杂因素，如过于限制的测试和描述不够具体。我们发布数据集、收集和执行代码及相关工件，以支持跨多种语言和仓库大规模训练软件工程代理。

Hybrid Offline-Online Reinforcement Learning for Sensorless, High-Precision Force Regulation in Surgical Robotic Grasping

用于无传感器、高精度力调控的混合离线-在线强化学习，用于外科机器人抓取

Authors: Edoardo Fazzari, Omar Mohamed, Khalfan Hableel, Hamdan Alhadhrami, Cesare Stefanini
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.23870
Pdf link: https://arxiv.org/pdf/2602.23870
Abstract Precise grasp force regulation in tendon-driven surgical instruments is fundamentally limited by nonlinear coupling between motor dynamics, transmission compliance, friction, and distal mechanics. Existing solutions typically rely on distal force sensing or analytical compensation, increasing hardware complexity or degrading performance under dynamic motion. We present a sensorless control framework that combines physics-consistent modeling and hybrid reinforcement learning to achieve high-precision distal force regulation in a proximally actuated surgical end-effector. We develop a first-principles digital twin of the da Vinci Xi grasping mechanism that captures coupled electrical, transmission, and jaw dynamics within a unified differential-algebraic formulation. To safely learn control policies in this stiff and highly nonlinear system, we introduce a three-stage pipeline:(i)a receding-horizon CMA-ES oracle that generates dynamically feasible expert trajectories,(ii)fully offline policy learning via Implicit Q-Learning to ensure stable initialization without unsafe exploration, and (iii)online refinement using TD3 for adaptation to on-policy dynamics. The resulting policy directly maps proximal measurements to motor voltages and requires no distal sensing. In simulation, the controller maintains grasp force within 1% of the desired reference during multi-harmonic jaw motion. Hardware experiments demonstrate average force errors below 4% across diverse trajectories, validating sim-to-real transfer. The learned policy contains approximately 71k param and executes at kH rates, enabling real-time deployment. These results demonstrate that high-fidelity modeling combined with structured offline-online RL can recover precise distal force behavior without additional sensing, offering a scalable and mechanically compatible solution for surgical robotic manipulation.
中文摘要 肌腱驱动手术器械中的精确握力调节受运动动力学、传输顺应性、摩擦力学和远端力学之间的非线性耦合限制。现有解决方案通常依赖远端力感应或分析补偿，增加了硬件复杂度或在动态运动下性能下降。我们提出了一个无传感器控制框架，结合物理一致性建模和混合强化学习，实现近端驱动外科末端执行器中高精度的远端力调节。我们开发了达芬奇习握机制的第一原理数字孪生，能够在统一的微分代数表述中捕捉耦合的电学、传输和颚力学。为了安全地学习该僵硬且高度非线性的控制策略，我们引入了三阶段流水线：（i）一个退缩视界的CMA-ES预言机，生成动态可行的专家轨迹，（ii）通过隐式Q学习实现完全离线策略学习，确保初始化稳定且不安全探索，（iii）利用TD3在线细化以适应策略内动态。最终的政策直接将近端测量映射到电机电压，无需远端感应。在仿真中，控制器在多谐波颚运动时，保持抓握力在目标参考值的1%以内。硬件实验显示，在不同轨迹中平均力误差低于4%，验证了模拟到实际的传输。该策略包含约71k参数，执行速率为kH，支持实时部署。这些结果表明，高保真建模结合结构化离线在线强化学习，可以在无需额外感测的情况下恢复精确的远端力行为，为手术机器人作提供了可扩展且机械兼容的解决方案。

TSC: Topology-Conditioned Stackelberg Coordination for Multi-Agent Reinforcement Learning in Interactive Driving

TSC：拓扑条件斯塔克伯格协调，用于交互式驾驶中的多智能体强化学习

Authors: Xiaotong Zhang, Gang Xiong, Yuanjing Wang, Siyu Teng, Alois Knoll, Long Chen
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.23896
Pdf link: https://arxiv.org/pdf/2602.23896
Abstract Safe and efficient autonomous driving in dense traffic is fundamentally a decentralized multi-agent coordination problem, where interactions at conflict points such as merging and weaving must be resolved reliably under partial observability. With only local and incomplete cues, interaction patterns can change rapidly, often causing unstable behaviors such as oscillatory yielding or unsafe commitments. Existing multi-agent reinforcement learning (MARL) approaches either adopt synchronous decision-making, which exacerbate non-stationarity, or depend on centralized sequencing mechanisms that scale poorly as traffic density increases. To address these limitations, we propose Topology-conditioned Stackelberg Coordination (TSC), a learning framework for decentralized interactive driving under communication-free execution, which extracts a time-varying directed priority graph from braid-inspired weaving relations between trajectories, thereby defining local leader-follower dependencies without constructing a global order of play. Conditioned on this graph, TSC endogenously factorizes dense interactions into graph-local Stackelberg subgames and, under centralized training and decentralized execution (CTDE), learns a sequential coordination policy that anticipates leaders via action prediction and trains followers through action-conditioned value learning to approximate local best responses, improving training stability and safety in dense traffic. Experiments across four dense traffic scenarios show that TSC achieves superior performance over representative MARL baselines across key metrics, most notably reducing collisions while maintaining competitive traffic efficiency and control smoothness.
中文摘要 在密集交通中安全高效的自动驾驶本质上是一个去中心化的多智能体协调问题，冲突点如并线和交织等交互必须在部分可观测性下可靠地解决。仅有局部且不完整的线索时，相互作用模式可能迅速变化，常导致不稳定行为，如振荡性让步或不安全的承诺。现有的多智能体强化学习（MARL）方法要么采用同步决策，这会加剧非平稳性，要么依赖于随着流量密度增加而扩展性较差的集中式序列机制。为解决这些局限性，我们提出了拓扑条件斯塔克尔伯格协调（TSC），这是一种去中心化交互式驾驶的学习框架，基于无通信执行，从轨迹间编织关系中提取时间变化的有向优先级图，从而定义局部领导者-跟随者依赖关系，而无需构建全局的游戏顺序。基于该图，TSC内生将密集交互分解为图局部Stackelberg子博弈，并在集中训练与去中心化执行（CTDE）下学习顺序协调策略，通过行动预测预测领导者，并通过动作条件值学习训练追随者近似局部最佳反应，提升密集交通中的训练稳定性和安全。在四个密集交通场景中的实验表明，TSC在关键指标上优于代表性的MARL基线，尤其是在减少碰撞的同时保持了竞争性交通效率和控制平稳性。

Learning to Build: Autonomous Robotic Assembly of Stable Structures Without Predefined Plans

学习建造：无预设计划的自主机器人组装稳定结构

Authors: Jingwen Wang, Johannes Kirschner, Paul Rolland, Luis Salamanca, Stefana Parascho
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.23934
Pdf link: https://arxiv.org/pdf/2602.23934
Abstract This paper presents a novel autonomous robotic assembly framework for constructing stable structures without relying on predefined architectural blueprints. Instead of following fixed plans, construction tasks are defined through targets and obstacles, allowing the system to adapt more flexibly to environmental uncertainty and variations during the building process. A reinforcement learning (RL) policy, trained using deep Q-learning with successor features, serves as the decision-making component. As a proof of concept, we evaluate the approach on a benchmark of 15 2D robotic assembly tasks of discrete block construction. Experiments using a real-world closed-loop robotic setup demonstrate the feasibility of the method and its ability to handle construction noise. The results suggest that our framework offers a promising direction for more adaptable and robust robotic construction in real-world environments.
中文摘要 本文提出了一种新型自主机器人组装框架，用于构建稳定结构，无需依赖预设的架构蓝图。建筑任务不再遵循固定的计划，而是通过目标和障碍物来定义，使系统能够更灵活地适应建筑过程中的环境不确定性和变化。强化学习（RL）策略通过深度Q学习及其后续特性训练，作为决策组件。作为概念验证，我们基于15个离散模块构建的二维机器人组装任务基准进行评估。使用现实世界闭环机器人装置的实验展示了该方法的可行性及其处理建筑噪音的能力。结果表明，我们的框架为在现实环境中实现更具适应性和稳健性的机器人构建提供了有前景的方向。

Green or Fast? Learning to Balance Cold Starts and Idle Carbon in Serverless Computing

绿色还是快速？学习在无服务器计算中平衡冷启动和闲置碳

Authors: Bowen Sun, Christos D. Antonopoulos, Evgenia Smirni, Bin Ren, Nikolaos Bellas, Spyros Lalis
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Performance (cs.PF)
Arxiv link: https://arxiv.org/abs/2602.23935
Pdf link: https://arxiv.org/pdf/2602.23935
Abstract Serverless computing simplifies cloud deployment but introduces new challenges in managing service latency and carbon emissions. Reducing cold-start latency requires retaining warm function instances, while minimizing carbon emissions favors reclaiming idle resources. This balance is further complicated by time-varying grid carbon intensity and varying workload patterns, under which static keep-alive policies are inefficient. We present LACE-RL, a latency-aware and carbon-efficient management framework that formulates serverless pod retention as a sequential decision problem. LACE-RL uses deep reinforcement learning to dynamically tune keep-alive durations, jointly modeling cold-start probability, function-specific latency costs, and real-time carbon intensity. Using the Huawei Public Cloud Trace, we show that LACE-RL reduces cold starts by 51.69% and idle keep-alive carbon emissions by 77.08% compared to Huawei's static policy, while achieving better latency-carbon trade-offs than state-of-the-art heuristic and single-objective baselines, approaching Oracle performance.
中文摘要 无服务器计算简化了云部署，但也带来了管理服务延迟和碳排放的新挑战。降低冷启动延迟需要保留热功能实例，同时减少碳排放有利于回收闲置资源。这种平衡因电网碳强度和工作量模式的变化而更加复杂，在这些条件下，静态的“保持生命”政策效率低下。我们提出了LACE-RL，一种感知延迟且碳效率高的管理框架，将无服务器舱保留问题定为顺序决策问题。LACE-RL利用深度强化学习动态调优保持生命持续时间，联合建模冷启动概率、函数特定延迟成本和实时碳强度。利用华为公有云追踪，我们证明LACE-RL相比华为静态政策，冷启动减少了51.69%，空闲保持活性碳排放减少了77.08%，同时在延迟-碳排放权衡上优于最先进的启发式和单目标基线，接近甲骨文的性能。

Thinking with Images as Continuous Actions: Numerical Visual Chain-of-Thought

以图像作为连续行动思考：数值视觉思维链

Authors: Kesen Zhao, Beier Zhu, Junbao Zhou, Xingyu Zhu, Zhongqi Yue, Hanwang Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.23959
Pdf link: https://arxiv.org/pdf/2602.23959
Abstract Recent multimodal large language models (MLLMs) increasingly rely on visual chain-of-thought to perform region-grounded reasoning over images. However, existing approaches ground regions via either textified coordinates-causing modality mismatch and semantic fragmentation or fixed-granularity patches that both limit precise region selection and often require non-trivial architectural changes. In this paper, we propose Numerical Visual Chain-of-Thought (NV-CoT), a framework that enables MLLMs to reason over images using continuous numerical coordinates. NV-CoT expands the MLLM action space from discrete vocabulary tokens to a continuous Euclidean space, allowing models to directly generate bounding-box coordinates as actions with only minimal architectural modification. The framework supports both supervised fine-tuning and reinforcement learning. In particular, we replace categorical token policies with a Gaussian (or Laplace) policy over coordinates and introduce stochasticity via reparameterized sampling, making NV-CoT fully compatible with GRPO-style policy optimization. Extensive experiments on three benchmarks against eight representative visual reasoning baselines demonstrate that NV-CoT significantly improves localization precision and final answer accuracy, while also accelerating training convergence, validating the effectiveness of continuous-action visual reasoning in MLLMs. The code is available in this https URL.
中文摘要 近年来，多模态大型语言模型（MLLM）越来越依赖视觉思维链来对图像进行区域基础推理。然而，现有方法要么通过文本化的坐标导致模态不匹配和语义碎片化，要么通过固定粒度补丁来覆盖区域，这既限制了精确的区域选择，也常常需要非平凡的架构变更。本文提出了数值视觉思维链（NV-CoT）框架，使MLLM能够利用连续数值坐标对图像进行推理。NV-CoT 将 MLLM 动作空间从离散词汇标记扩展为连续的欧几里得空间，允许模型以最小的架构修改直接生成边界框坐标作为动作。该框架支持监督式微调和强化学习。特别地，我们将类别令牌策略替换为高斯（或拉普拉斯）坐标策略，并通过重参数抽样引入随机性，使NV-CoT完全兼容GRPO风格策略优化。针对三个基准测试、八个具代表性的视觉推理基线的广泛实验表明，NV-CoT显著提升定位精度和最终答案准确性，同时加速训练收敛，验证了连续动作视觉推理在MLLM中的有效性。代码可以在这个 https URL 中获得。

Pessimistic Auxiliary Policy for Offline Reinforcement Learning

悲观的离线强化学习辅助政策

Authors: Fan Zhang, Baoru Huang, Xin Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.23974
Pdf link: https://arxiv.org/pdf/2602.23974
Abstract Offline reinforcement learning aims to learn an agent from pre-collected datasets, avoiding unsafe and inefficient real-time interaction. However, inevitable access to out-ofdistribution actions during the learning process introduces approximation errors, causing the error accumulation and considerable overestimation. In this paper, we construct a new pessimistic auxiliary policy for sampling reliable actions. Specifically, we develop a pessimistic auxiliary strategy by maximizing the lower confidence bound of the Q-function. The pessimistic auxiliary strategy exhibits a relatively high value and low uncertainty in the vicinity of the learned policy, avoiding the learned policy sampling high-value actions with potentially high errors during the learning process. Less approximation error introduced by sampled action from pessimistic auxiliary strategy leads to the alleviation of error accumulation. Extensive experiments on offline reinforcement learning benchmarks reveal that utilizing the pessimistic auxiliary strategy can effectively improve the efficacy of other offline RL approaches.
中文摘要 离线强化学习旨在从预先收集的数据集中学习代理，避免不安全和低效的实时交互。然而，在学习过程中不可避免地访问分布外的动作会引入近似误差，导致错误累积和相当高估。本文构建了一种新的悲观辅助策略，用于采样可靠动作。具体来说，我们通过最大化Q函数的下置信度界限来发展悲观辅助策略。悲观辅助策略在学习策略附近表现出相对较高的价值和低不确定性，避免了学习策略采样高价值动作，且在学习过程中可能出现高错误。从悲观辅助策略中采样作用引入的近似误差较少，从而减少误差累积。大量离线强化学习基准测试显示，利用悲观辅助策略可以有效提升其他离线强化学习方法的效果。

Foundation World Models for Agents that Learn, Verify, and Adapt Reliably Beyond Static Environments

Foundation World 模型，用于能够在静态环境之外可靠地学习、验证和适应的智能体

Authors: Florent Delgrange
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.23997
Pdf link: https://arxiv.org/pdf/2602.23997
Abstract The next generation of autonomous agents must not only learn efficiently but also act reliably and adapt their behavior in open worlds. Standard approaches typically assume fixed tasks and environments with little or no novelty, which limits world models' ability to support agents that must evolve their policies as conditions change. This paper outlines a vision for foundation world models: persistent, compositional representations that unify reinforcement learning, reactive/program synthesis, and abstraction mechanisms. We propose an agenda built around four components: (i) learnable reward models from specifications to support optimization with clear objectives; (ii) adaptive formal verification integrated throughout learning; (iii) online abstraction calibration to quantify the reliability of the model's predictions; and (iv) test-time synthesis and world-model generation guided by verifiers. Together, these components enable agents to synthesize verifiable programs, derive new policies from a small number of interactions, and maintain correctness while adapting to novelty. The resulting framework positions foundation world models as a substrate for learning, reasoning, and adaptation, laying the groundwork for agents that not only act well but can explain and justify the behavior they adopt.
中文摘要 下一代自主智能体不仅要高效学习，还要在开放世界中可靠地行动并调整行为。标准方法通常假设固定任务和环境，几乎没有新颖性，这限制了世界模型支持必须随着条件变化演进策略的代理的能力。本文概述了基础世界模型的愿景：持久的组合表征，统一强化学习、反应式/程序综合和抽象机制。我们提出一个围绕四个组成部分构建的议程：（i）通过规范提供可学习的奖励模型，以支持具有明确目标的优化;（ii）贯穿学习的自适应形式验证;（iii）在线抽象校准以量化模型预测的可靠性;以及（iv）由验证者指导的测试时间综合和世界模型生成。这些组件共同使智能体能够综合可验证程序，从少量交互中推导出新策略，并在适应新颖性时保持正确性。最终的框架将基础世界模型定位为学习、推理和适应的基础，为主体不仅表现良好，还能解释和证明其行为的行为奠定基础。

Curriculum Reinforcement Learning for Quadrotor Racing with Random Obstacles

随机障碍四旋翼竞速课程强化学习

Authors: Fangyu Sun, Fanxing Li, Yu Hu, Linzuo Zhang, Yueqian Liu, Wenxian Yu, Danping Zou
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.24030
Pdf link: https://arxiv.org/pdf/2602.24030
Abstract Autonomous drone racing has attracted increasing interest as a research topic for exploring the limits of agile flight. However, existing studies primarily focus on obstacle-free racetracks, while the perception and dynamic challenges introduced by obstacles remain underexplored, often resulting in low success rates and limited robustness in real-world flight. To this end, we propose a novel vision-based curriculum reinforcement learning framework for training a robust controller capable of addressing unseen obstacles in drone racing. We combine multi-stage cu rriculum learning, domain randomization, and a multi-scene updating strategy to address the conflicting challenges of obstacle avoidance and gate traversal. Our end-to-end control policy is implemented as a single network, allowing high-speed flight of quadrotors in environments with variable obstacles. Both hardware-in-the-loop and real-world experiments demonstrate that our method achieves faster lap times and higher success rates than existing approaches, effectively advancing drone racing in obstacle-rich environments. The video and code are available at: this https URL.
中文摘要 自主无人机竞速作为探索敏捷飞行极限的研究课题，正受到越来越多的关注。然而，现有研究主要聚焦于无障碍赛道，障碍物带来的感知和动态挑战仍未被充分探讨，导致成功率低且现实飞行中的稳健性有限。为此，我们提出了一种新的基于愿景的课程强化学习框架，用于培训能够应对无人机竞速中未见障碍的强大控制者。我们结合多阶段的 cu rriculum 学习、领域随机化和多场景更新策略，解决障碍避让与门穿越的冲突挑战。我们的端到端控制策略作为单一网络实现，允许在障碍物多变的环境中高速飞行四旋翼。硬件在环和实际实验都表明，我们的方法比现有方法更快的圈速和更高的成功率，有效推动了障碍物密集环境中的无人机竞速。视频和代码可在以下网址获取：https URL。

Task Complexity Matters: An Empirical Study of Reasoning in LLMs for Sentiment Analysis

任务复杂性重要：情感分析中大型语言模型推理的实证研究

Authors: Donghao Huang, Zhaoxia Wang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.24060
Pdf link: https://arxiv.org/pdf/2602.24060
Abstract Large language models (LLMs) with reasoning capabilities have fueled a compelling narrative that reasoning universally improves performance across language tasks. We test this claim through a comprehensive evaluation of 504 configurations across seven model families--including adaptive, conditional, and reinforcement learning-based reasoning architectures--on sentiment analysis datasets of varying granularity (binary, five-class, and 27-class emotion). Our findings reveal that reasoning effectiveness is strongly task-dependent, challenging prevailing assumptions: (1) Reasoning shows task-complexity dependence--binary classification degrades up to -19.9 F1 percentage points (pp), while 27-class emotion recognition gains up to +16.0pp; (2) Distilled reasoning variants underperform base models by 3-18 pp on simpler tasks, though few-shot prompting enables partial recovery; (3) Few-shot learning improves over zero-shot in most cases regardless of model type, with gains varying by architecture and task complexity; (4) Pareto frontier analysis shows base models dominate efficiency-performance trade-offs, with reasoning justified only for complex emotion recognition despite 2.1x-54x computational overhead. We complement these quantitative findings with qualitative error analysis revealing that reasoning degrades simpler tasks through systematic over-deliberation, offering mechanistic insight beyond the high-level overthinking hypothesis.
中文摘要 具备推理能力的大型语言模型（LLM）推动了一个有力的观点：推理普遍提升了各语言任务的表现。我们通过对涵盖七个模型家族的504种配置——包括自适应、条件和强化学习推理架构——在不同粒度（二元、五类和27类情感）的情感分析数据集上，全面评估了这一主张。我们的发现表明，推理效果高度依赖任务，挑战了现有假设：（1）推理显示任务复杂度依赖性——二元分类可降至-19.9 F1百分点（pp），而27类情绪识别提升可达+16.0pp;（2）简化推理变体在较简单任务中表现不如基础模型3至18pp，尽管少样本提示支持部分恢复;（3）无论模型类型如何，少数样本学习在大多数情况下相较零样本学习都有改善，且其提升因架构和任务复杂度而异;（4）帕累托前沿分析显示，基础模型在效率与性能权衡中占主导地位，尽管计算开销为2.1倍至54倍，但仅在复杂情感识别时推理合理。我们用定性误差分析补充这些定量发现，揭示推理通过系统性过度思考降低简单任务，提供了超越高层次过度思考假说的机制性洞见。

Adaptive Correlation-Weighted Intrinsic Rewards for Reinforcement Learning

适应性相关加权内在奖励用于强化学习

Authors: Viet Bac Nguyen, Phuong Thai Nguyen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.24081
Pdf link: https://arxiv.org/pdf/2602.24081
Abstract We propose ACWI (Adaptive Correlation Weighted Intrinsic), an adaptive intrinsic reward scaling framework designed to dynamically balance intrinsic and extrinsic rewards for improved exploration in sparse reward reinforcement learning. Unlike conventional approaches that rely on manually tuned scalar coefficients, which often result in unstable or suboptimal performance across tasks, ACWI learns a state dependent scaling coefficient online. Specifically, ACWI introduces a lightweight Beta Network that predicts the intrinsic reward weight directly from the agent state through an encoder based architecture. The scaling mechanism is optimized using a correlation based objective that encourages alignment between the weighted intrinsic rewards and discounted future extrinsic returns. This formulation enables task adaptive exploration incentives while preserving computational efficiency and training stability. We evaluate ACWI on a suite of sparse reward environments in MiniGrid. Experimental results demonstrate that ACWI consistently improves sample efficiency and learning stability compared to fixed intrinsic reward baselines, achieving superior performance with minimal computational overhead.
中文摘要 我们提出了ACWI（自适应相关加权内在）框架，一种自适应内在奖励尺度框架，旨在动态平衡内在和外在奖励，以改善稀疏奖励强化学习的探索。与依赖手动调优标量系数的传统方法不同，后者常导致任务间性能不稳定或次优，ACWI在线学习状态依赖的标量系数。具体来说，ACWI引入了一个轻量级Beta网络，通过基于编码器的架构直接从代理状态预测内在奖励权重。该尺度机制通过基于相关性的目标进行优化，鼓励加权内在奖励与贴现后的未来外部回报之间实现对齐。该表述实现任务自适应探索激励，同时保持计算效率和训练稳定性。我们在MiniGrid中的一组稀疏奖励环境中评估了ACWI。实验结果表明，与固定的内在奖励基线相比，ACWI能够持续提升样本效率和学习稳定性，以最小的计算开销实现更优异的性能。

Bi-level RL-Heuristic Optimization for Real-world Winter Road Maintenance

用于现实世界冬季道路维护的双级强化学习启发式优化

Authors: Yue Xie, Zizhen Xu, William Beazley, Fumiya Iida
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.24097
Pdf link: https://arxiv.org/pdf/2602.24097
Abstract Winter road maintenance is critical for ensuring public safety and reducing environmental impacts, yet existing methods struggle to manage large-scale routing problems effectively and mostly reply on human decision. This study presents a novel, scalable bi-level optimization framework, validated on real operational data on UK strategic road networks (M25, M6, A1), including interconnected local road networks in surrounding areas for vehicle traversing, as part of the highway operator's efforts to solve existing planning challenges. At the upper level, a reinforcement learning (RL) agent strategically partitions the road network into manageable clusters and optimally allocates resources from multiple depots. At the lower level, a multi-objective vehicle routing problem (VRP) is solved within each cluster, minimizing the maximum vehicle travel time and total carbon emissions. Unlike existing approaches, our method handles large-scale, real-world networks efficiently, explicitly incorporating vehicle-specific constraints, depot capacities, and road segment requirements. Results demonstrate significant improvements, including balanced workloads, reduced maximum travel times below the targeted two-hour threshold, lower emissions, and substantial cost savings. This study illustrates how advanced AI-driven bi-level optimization can directly enhance operational decision-making in real-world transportation and logistics.
中文摘要 冬季道路维护对于保障公共安全和减少环境影响至关重要，但现有方法难以有效应对大规模路线问题，且主要依赖人为决策。本研究提出了一种新颖、可扩展的双级优化框架，基于英国战略道路网络（M25、M6、A1）的真实运营数据验证，包括周边地区相互连接的本地道路网络，用于车辆通行，作为公路运营商解决现有规划挑战努力的一部分。在上层，强化学习（RL）智能体会有策略地将道路网络划分为可管理的集群，并最优地分配多个仓库的资源。在较低层面，每个集群内解决多目标车辆路由问题（VRP），以最小化车辆最大行驶时间和总碳排放。与现有方法不同，我们的方法高效处理大规模的真实网络，明确纳入车辆特定约束、车库容量和道路段要求。结果显示出显著改进，包括工作量平衡、将最大旅行时间缩短至目标两小时以下、排放降低以及大幅节省成本。本研究展示了先进的人工智能驱动双级优化如何直接提升现实世界运输和物流中的运营决策。

Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance

回收失败：通过细致的非政策指导挽救RLVR中的探索

Authors: Yanwei Ren, Haotian Zhang, Likang Xiao, Xikai Zhang, Jiaxing Huang, Jiayan Qiu, Baosheng Yu, Quan Chen, Liu Liu
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.24110
Pdf link: https://arxiv.org/pdf/2602.24110
Abstract Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the complex reasoning capabilities of Large Reasoning Models. However, standard outcome-based supervision suffers from a critical limitation that penalizes trajectories that are largely correct but fail due to several missteps as heavily as completely erroneous ones. This coarse feedback signal causes the model to discard valuable largely correct rollouts, leading to a degradation in rollout diversity that prematurely narrows the exploration space. Process Reward Models have demonstrated efficacy in providing reliable step-wise verification for test-time scaling, naively integrating these signals into RLVR as dense rewards proves this http URL methods attempt to introduce off-policy guided whole-trajectory replacement that often outside the policy model's distribution, but still fail to utilize the largely correct rollouts generated by the model itself and thus do not effectively mitigate the narrowing of the exploration space. To address these issues, we propose SCOPE (Step-wise Correction for On-Policy Exploration), a novel framework that utilizes Process Reward Models to pinpoint the first erroneous step in suboptimal rollouts and applies fine-grained, step-wise off-policy rectification. By applying precise refinement on partially correct rollout, our method effectively salvages partially correct trajectories and increases diversity score by 13.5%, thereby sustaining a broad exploration space. Extensive experiments demonstrate that our approach establishes new state-of-the-art results, achieving an average accuracy of 46.6% on math reasoning and exhibiting robust generalization with 53.4% accuracy on out-of-distribution reasoning tasks.
中文摘要 可验证奖励强化学习（RLVR）已成为增强大型推理模型复杂推理能力的强大范式。然而，标准的基于结果的监督存在一个关键限制，即惩罚那些大体正确但因多次严重失误甚至完全错误而失败的轨迹。这种粗糙反馈信号导致模型丢弃有价值且大致正确的推展，导致推展多样性退化，过早缩小探索空间。过程奖励模型已证明在提供可靠的分阶段测试时间尺度验证方面有效，天真地将这些信号集成到RLVR，因为密集奖励证明了这种http URL方法试图引入非策略引导的整体轨迹替换，这通常超出策略模型的分布，但仍未能充分利用模型本身生成的大致正确的展开，因此无法有效缓解探索空间的缩小。为解决这些问题，我们提出了SCOPE（按步骤调整以探索策略），这是一个新颖框架，利用流程奖励模型精准定位次优部署的第一个错误步骤，并应用细粒度的分步骤非策略纠正。通过对部分正确推广进行精确优化，我们的方法有效挽救了部分正确轨迹，并将多样性得分提升了13.5%，从而维持了广泛的探索空间。大量实验表明，我们的方法建立了新的最先进结果，数学推理平均准确率为46.6%，在分布外推理任务中展现出强健的泛化能力，准确率为53.4%。

Planning from Observation and Interaction

从观察与互动出发的规划

Authors: Tyler Han, Siyang Shen, Rohan Baijal, Harine Ravichandiran, Bat Nemekhbold, Kevin Huang, Sanghun Jung, Byron Boots
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.24121
Pdf link: https://arxiv.org/pdf/2602.24121
Abstract Observational learning requires an agent to learn to perform a task by referencing only observations of the performed task. This work investigates the equivalent setting in real-world robot learning where access to hand-designed rewards and demonstrator actions are not assumed. To address this data-constrained setting, this work presents a planning-based Inverse Reinforcement Learning (IRL) algorithm for world modeling from observation and interaction alone. Experiments conducted entirely in the real-world demonstrate that this paradigm is effective for learning image-based manipulation tasks from scratch in under an hour, without assuming prior knowledge, pre-training, or data of any kind beyond task observations. Moreover, this work demonstrates that the learned world model representation is capable of online transfer learning in the real-world from scratch. In comparison to existing approaches, including IRL, RL, and Behavior Cloning (BC), which have more restrictive assumptions, the proposed approach demonstrates significantly greater sample efficiency and success rates, enabling a practical path forward for online world modeling and planning from observation and interaction. Videos and more at: this https URL.
中文摘要 观察学习要求智能体通过仅引用对已执行任务的观察来学习执行任务。本研究探讨了现实机器人学习中不假设可获得手工设计奖励和示范动作的等效环境。为应对这种数据受限环境，本研究提出了一种基于规划的逆强化学习（IRL）算法，仅通过观察和交互进行世界建模。完全在现实中进行的实验表明，这种范式能够在不到一小时内从零开始学习基于图像的作任务，无需假设已有先验知识、预训练或除任务观察外的任何数据。此外，这项工作表明，已学习世界模型的表示能够从零开始在现实世界中进行在线迁移学习。与现有方法（包括IRL、RL和行为克隆（BC）等更严格的假设相比，该方法展现了显著更高的样本效率和成功率，为从观察和交互出发的在线世界建模和规划提供了实用的路径。视频及更多内容请访问：https URL。

Learning Flexible Job Shop Scheduling under Limited Buffers and Material Kitting Constraints

在有限缓冲和物料套箱限制下学习灵活的作业车间排班

Authors: Shishun Zhang, Juzhan Xu, Yidan Fan, Chenyang Zhu, Ruizhen Hu, Yongjun Wang, Kai Xu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.24180
Pdf link: https://arxiv.org/pdf/2602.24180
Abstract The Flexible Job Shop Scheduling Problem (FJSP) originates from real production lines, while some practical constraints are often ignored or idealized in current FJSP studies, among which the limited buffer problem has a particular impact on production efficiency. To this end, we study an extended problem that is closer to practical scenarios--the Flexible Job Shop Scheduling Problem with Limited Buffers and Material Kitting. In recent years, deep reinforcement learning (DRL) has demonstrated considerable potential in scheduling tasks. However, its capacity for state modeling remains limited when handling complex dependencies and long-term constraints. To address this, we leverage a heterogeneous graph network within the DRL framework to model the global state. By constructing efficient message passing among machines, operations, and buffers, the network focuses on avoiding decisions that may cause frequent pallet changes during long-sequence scheduling, thereby helping improve buffer utilization and overall decision quality. Experimental results on both synthetic and real production line datasets show that the proposed method outperforms traditional heuristics and advanced DRL methods in terms of makespan and pallet changes, and also achieves a good balance between solution quality and computational cost. Furthermore, a supplementary video is provided to showcase a simulation system that effectively visualizes the progression of the production line.
中文摘要 灵活作业车间排程问题（FJSP）起源于真实生产线，而一些实际约束在当前FJSP研究中常被忽视或理想化，其中有限缓冲区问题对生产效率有特别的影响。为此，我们研究了一个更接近实际场景的扩展问题——带有有限缓冲区和材料套件的灵活工坊排程问题。近年来，深度强化学习（DRL）在任务调度方面展现出显著潜力。然而，在处理复杂依赖和长期约束时，其状态建模能力仍然有限。为此，我们在DRL框架内利用异构图网络来建模全局状态。通过构建机器、作和缓冲区之间的高效消息传递，网络专注于避免在长序列调度中导致频繁更换托盘的决策，从而帮助提升缓冲区利用率和整体决策质量。在合成和真实生产线数据集上的实验结果显示，该方法在完成时间和托盘变化方面优于传统启发式和高级日程学习方法，并且在解决方案质量与计算成本之间实现了良好平衡。此外，还附有补充视频，展示了一个模拟系统，有效可视化生产线的进展。

Multi-Objective Reinforcement Learning for Large-Scale Tote Allocation in Human-Robot Collaborative Fulfillment Centers

人机协作履约中心大规模托盘分配的多目标强化学习

Authors: Sikata Sengupta, Guangyi Liu, Omer Gottesman, Joseph W Durham, Michael Kearns, Aaron Roth, Michael Caldara
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.24182
Pdf link: https://arxiv.org/pdf/2602.24182
Abstract Optimizing the consolidation process in container-based fulfillment centers requires trading off competing objectives such as processing speed, resource usage, and space utilization while adhering to a range of real-world operational constraints. This process involves moving items between containers via a combination of human and robotic workstations to free up space for inbound inventory and increase container utilization. We formulate this problem as a large-scale Multi-Objective Reinforcement Learning (MORL) task with high-dimensional state spaces and dynamic system behavior. Our method builds on recent theoretical advances in solving constrained RL problems via best-response and no-regret dynamics in zero-sum games, enabling principled minimax policy learning. Policy evaluation on realistic warehouse simulations shows that our approach effectively trades off objectives, and we empirically observe that it learns a single policy that simultaneously satisfies all constraints, even if this is not theoretically guaranteed. We further introduce a theoretical framework to handle the problem of error cancellation, where time-averaged solutions display oscillatory behavior. This method returns a single iterate whose Lagrangian value is close to the minimax value of the game. These results demonstrate the promise of MORL in solving complex, high-impact decision-making problems in large-scale industrial systems.
中文摘要 优化集装箱履约中心的合并流程，需要在处理速度、资源使用和空间利用率等多个竞争目标之间权衡，同时遵守一系列现实运营约束。该过程涉及通过人机工作站的结合在集装箱间移动物品，以释放入库库存空间并提高集装箱利用率。我们将该问题表述为一个具有高维状态空间和动态系统行为的大规模多目标强化学习（MORL）任务。我们的方法基于近期理论进展，通过最佳响应和无遗憾动态解决受限强化学习问题，实现了原则极小极大策略学习。在现实仓库模拟上的策略评估表明，我们的方法有效地权衡了目标，并且我们实证观察到它学习了一个同时满足所有约束的单一策略，尽管这在理论上并非保证。我们进一步引入了一个理论框架来处理误差消除问题，其中时间平均解表现出振荡行为。该方法返回一个迭代，其拉格朗日值接近博弈的极小值。这些结果展示了MORL在解决大型工业系统中复杂且高影响力决策问题中的前景。

Enhancing Spatial Understanding in Image Generation via Reward Modeling

通过奖励建模增强图像生成中的空间理解

Authors: Zhenyu Tang, Chaoran Feng, Yufan Deng, Jie Wu, Xiaojie Li, Rui Wang, Yunpeng Chen, Daquan Zhou
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.24233
Pdf link: https://arxiv.org/pdf/2602.24233
Abstract Recent progress in text-to-image generation has greatly advanced visual fidelity and creativity, but it has also imposed higher demands on prompt complexity-particularly in encoding intricate spatial relationships. In such cases, achieving satisfactory results often requires multiple sampling attempts. To address this challenge, we introduce a novel method that strengthens the spatial understanding of current image generation models. We first construct the SpatialReward-Dataset with over 80k preference pairs. Building on this dataset, we build SpatialScore, a reward model designed to evaluate the accuracy of spatial relationships in text-to-image generation, achieving performance that even surpasses leading proprietary models on spatial evaluation. We further demonstrate that this reward model effectively enables online reinforcement learning for the complex spatial generation. Extensive experiments across multiple benchmarks show that our specialized reward model yields significant and consistent gains in spatial understanding for image generation.
中文摘要 文本生成技术的最新进展极大提升了视觉真实性和创造力，但也对提示复杂性提出了更高的要求——尤其是在编码复杂的空间关系时。在这种情况下，要获得满意的结果通常需要多次抽样尝试。为应对这一挑战，我们引入了一种新颖方法，强化了对当前图像生成模型的空间理解。我们首先构建包含超过8万对偏好对的空间奖励数据集。基于该数据集，我们构建了SpatialScore，一个奖励模型，旨在评估文本到图像生成中空间关系的准确性，其性能甚至超过了领先的专有空间评估模型。我们还进一步证明，这种奖励模型能够有效实现复杂空间生成的在线强化学习。跨多个基准测试的大量实验表明，我们专门的奖励模型在图像生成空间理解方面带来了显著且持续的提升。

SafeGen-LLM: Enhancing Safety Generalization in Task Planning for Robotic Systems

SafeGen-LLM：增强机器人系统任务规划中的安全泛化

Authors: Jialiang Fan, Weizhe Xu, Mengyu Liu, Oleg Sokolsky, Insup Lee, Fangxin Kong
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.24235
Pdf link: https://arxiv.org/pdf/2602.24235
Abstract Safety-critical task planning in robotic systems remains challenging: classical planners suffer from poor scalability, Reinforcement Learning (RL)-based methods generalize poorly, and base Large Language Models (LLMs) cannot guarantee safety. To address this gap, we propose safety-generalizable large language models, named SafeGen-LLM. SafeGen-LLM can not only enhance the safety satisfaction of task plans but also generalize well to novel safety properties in various domains. We first construct a multi-domain Planning Domain Definition Language 3 (PDDL3) benchmark with explicit safety constraints. Then, we introduce a two-stage post-training framework: Supervised Fine-Tuning (SFT) on a constraint-compliant planning dataset to learn planning syntax and semantics, and Group Relative Policy Optimization (GRPO) guided by fine-grained reward machines derived from formal verification to enforce safety alignment and by curriculum learning to better handle complex tasks. Extensive experiments show that SafeGen-LLM achieves strong safety generalization and outperforms frontier proprietary baselines across multi-domain planning tasks and multiple input formats (e.g., PDDLs and natural language).
中文摘要 机器人系统中安全关键任务规划依然充满挑战：传统规划者缺乏可扩展性，基于强化学习（RL）的方法推广能力差，基础大型语言模型（LLMs）无法保证安全。为弥补这一空白，我们提出了安全推广的大型语言模型，名为SafeGen-LLM。SafeGen-LLM不仅能提升任务计划的安全满意度，还能很好地推广到多个领域的新型安全属性。我们首先构建了一个多域规划领域定义语言3（PDDL3）基准测试，具有明确的安全约束。随后，我们引入了两阶段的后期培训框架：在约束合规的规划数据集上进行监督微调（SFT），以学习规划语法和语义;以及基于形式验证的细粒度奖励机制指导的群体相对策略优化（GRPO），以强制安全对齐，并通过课程学习更好地处理复杂任务。大量实验表明，SafeGen-LLM实现了强有力的安全泛化能力，并在多域规划任务和多种输入格式（如PDDLS和自然语言）中优于前沿专有基线。

CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

CUDA Agent：用于高性能 CUDA 内核生成的大规模 Agentic RL

Authors: Weinan Dai, Hanlin Wu, Qiying Yu, Huan-ang Gao, Jiahao Li, Chengquan Jiang, Weiqiang Lou, Yufan Song, Hongli Yu, Jiaze Chen, Wei-Ying Ma, Ya-Qin Zhang, Jingjing Liu, Mingxuan Wang, Xin Liu, Hao Zhou
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.24286
Pdf link: https://arxiv.org/pdf/2602.24286
Abstract GPU kernel optimization is fundamental to modern deep learning but remains a highly specialized task requiring deep hardware expertise. Despite strong performance in general programming, large language models (LLMs) remain uncompetitive with compiler-based systems such as this http URL for CUDA kernel generation. Existing CUDA code generation approaches either rely on training-free refinement or fine-tune models within fixed multi-turn execution-feedback loops, but both paradigms fail to fundamentally improve the model's intrinsic CUDA optimization ability, resulting in limited performance gains. We present CUDA Agent, a large-scale agentic reinforcement learning system that develops CUDA kernel expertise through three components: a scalable data synthesis pipeline, a skill-augmented CUDA development environment with automated verification and profiling to provide reliable reward signals, and reinforcement learning algorithmic techniques enabling stable training. CUDA Agent achieves state-of-the-art results on KernelBench, delivering 100\%, 100\%, and 92\% faster rate over this http URL on KernelBench Level-1, Level-2, and Level-3 splits, outperforming the strongest proprietary models such as Claude Opus 4.5 and Gemini 3 Pro by about 40\% on the hardest Level-3 setting.
中文摘要 GPU内核优化是现代深度学习的基础，但仍是一项高度专业化的任务，需要深厚的硬件专业知识。尽管在通用编程中表现优异，大型语言模型（LLM）仍无法与基于编译器的系统竞争，如用于CUDA内核生成的http URL。现有的CUDA代码生成方法要么依赖无训练的精炼，要么在固定的多回合执行反馈循环内微调模型，但这两种范式都未能从根本上提升模型固有的CUDA优化能力，导致性能提升有限。我们介绍CUDA Agent，一个大规模的智能体强化学习系统，通过三个组成部分发展CUDA内核专业知识：可扩展的数据综合流水线、具备自动验证和剖析的技能增强CUDA开发环境，以提供可靠的奖励信号，以及实现稳定训练的强化学习算法技术。CUDA Agent在KernelBench上实现了最先进的性能，在KernelBench的1级、2级和3级分段下，通过该http URL实现了100%、100%和92%的加速率，在最难的Level-3设置下，性能比Claude Opus 4.5和Gemini 3 Pro等最强专有模型高出约40%%。

DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science

DARE-bench：评估数据科学中大型语言模型的建模与指令忠实度

Authors: Fan Shu, Yite Wang, Ruofan Wu, Boyi Liu, Zhewei Yao, Yuxiong He, Feng Yan
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.24288
Pdf link: https://arxiv.org/pdf/2602.24288
Abstract The fast-growing demands in using Large Language Models (LLMs) to tackle complex multi-step data science tasks create an emergent need for accurate benchmarking. There are two major gaps in existing benchmarks: (i) the lack of standardized, process-aware evaluation that captures instruction adherence and process fidelity, and (ii) the scarcity of accurately labeled training data. To bridge these gaps, we introduce DARE-bench, a benchmark designed for machine learning modeling and data science instruction following. Unlike many existing benchmarks that rely on human- or model-based judges, all tasks in DARE-bench have verifiable ground truth, ensuring objective and reproducible evaluation. To cover a broad range of tasks and support agentic tools, DARE-bench consists of 6,300 Kaggle-derived tasks and provides both large-scale training data and evaluation sets. Extensive evaluations show that even highly capable models such as gpt-o4-mini struggle to achieve good performance, especially in machine learning modeling tasks. Using DARE-bench training tasks for fine-tuning can substantially improve model performance. For example, supervised fine-tuning boosts Qwen3-32B's accuracy by 1.83x and reinforcement learning boosts Qwen3-4B's accuracy by more than 8x. These significant improvements verify the importance of DARE-bench both as an accurate evaluation benchmark and critical training data.
中文摘要 使用大型语言模型（LLMs）处理复杂多步数据科学任务的需求迅速增长，促使对准确基准测试的需求日益增长。现有基准存在两个主要缺口：（i）缺乏标准化、流程感知的评估来捕捉指令遵循性和流程忠实度;（ii）准确标注的培训数据稀缺。为弥合这些差距，我们推出了DARE-bench，这是一个专为机器学习建模和数据科学指令跟踪设计的基准测试。与许多依赖人力或模型评判的现有基准不同，DARE-bench中的所有任务均具可验证的真实性，确保评估客观且可重复。为了覆盖广泛的任务和支持代理工具，DARE-bench包含6300个Kaggle衍生任务，提供大规模训练数据和评估集。大量评估表明，即使是像GPT-O4-mini这样高性能模型，在机器学习建模任务中也难以实现良好性能。利用DARE-bench训练任务进行微调可以显著提升模型性能。例如，监督微调能使Qwen3-32B的准确率提升1.83倍，强化学习则使Qwen3-4B的准确率提升超过8倍。这些重大改进验证了DARE-bench作为准确评估基准和关键训练数据的重要性。

Keyword: diffusion policy

There is no result