Arxiv Papers of Today

生成时间: 2026-02-24 16:54:58 (UTC+8); Arxiv 发布时间: 2026-02-24 20:00 EST (2026-02-25 09:00 UTC+8)

今天共有 58 篇相关文章

Keyword: reinforcement learning

FineRef: Fine-Grained Error Reflection and Correction for Long-Form Generation with Citations

FineRef：长格式生成的细粒度错误反思与纠正，含引用

Authors: Yixing Peng, Licheng Zhang, Shancheng Fang, Yi Liu, Peijian Gu, Quan Wang
Subjects: Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.18437
Pdf link: https://arxiv.org/pdf/2602.18437
Abstract Generating with citations is crucial for trustworthy Large Language Models (LLMs), yet even advanced LLMs often produce mismatched or irrelevant citations. Existing methods over-optimize citation fidelity while overlooking relevance to the user query, which degrades answer quality and robustness in real-world settings with noisy or irrelevant retrieved content. Moreover, the prevailing single-pass paradigm struggles to deliver optimal answers in long-form generation that requiring multiple citations. To address these limitations, we propose FineRef, a framework based on Fine-grained error Reflection, which explicitly teaches the model to self-identify and correct two key citation errors, mismatch and irrelevance, on a per-citation basis. FineRef follows a two-stage training strategy. The first stage instills an "attempt-reflect-correct" behavioral pattern via supervised fine-tuning, using fine-grained and controllable reflection data constructed by specialized lightweight models. An online self-reflective bootstrapping strategy is designed to improve generalization by iteratively enriching training data with verified, self-improving examples. To further enhance the self-reflection and correction capability, the second stage applies process-level reinforcement learning with a multi-dimensional reward scheme that promotes reflection accuracy, answer quality, and correction gain. Experiments on the ALCE benchmark demonstrate that FineRef significantly improves both citation performance and answer accuracy. Our 7B model outperforms GPT-4 by up to 18% in Citation F1 and 4% in EM Recall, while also surpassing the state-of-the-art model across key evaluation metrics. FineRef also exhibits strong generalization and robustness in domain transfer settings and noisy retrieval scenarios.
中文摘要 生成引用对于值得信赖的大型语言模型（LLM）至关重要，但即使是高级LLM也常常产生不匹配或无关的引用。现有方法过度优化引用准确性，忽视了与用户查询的相关性，这在现实环境中因噪声或无关检索内容而降低了答案质量和稳健性。此外，主流的单次处理范式在需要多次引用的长格式生成中难以提供最佳答案。为解决这些局限性，我们提出了基于细粒度错误反思的FineRef框架，明确教导模型按每引用次数自我识别并纠正两个关键引用错误——不匹配和无关性。FineRef遵循两阶段培训策略。第一阶段通过监督微调，使用由专门的轻量级模型构建的细粒度且可控的反射数据，植入“尝试-反射-正确”的行为模式。在线自我反思自举策略旨在通过迭代丰富经过验证的自我改进实例来提升泛化能力。为进一步提升自我反思和纠正能力，第二阶段采用过程级强化学习，采用多维奖励方案，促进反射准确性、答案质量和纠正收益。ALCE基准测试的实验表明，FineRef显著提升了引用性能和答案准确性。我们的7B模型在引用F1中领先GPT-4多达18%，在EM回忆中领先4%，同时在关键评估指标上超越了最先进的模型。FineRef 在域转移设置和噪声较大的检索场景中也表现出强烈的泛化性和鲁棒性。

Learning to Remember: End-to-End Training of Memory Agents for Long-Context Reasoning

学习记忆：记忆代理的端到端训练以实现长上下文推理

Authors: Kehao Zhang, Shangtong Gui, Sheng Yang, Wei Chen, Yang Feng
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.18493
Pdf link: https://arxiv.org/pdf/2602.18493
Abstract Long-context LLMs and Retrieval-Augmented Generation (RAG) systems process information passively, deferring state tracking, contradiction resolution, and evidence aggregation to query time, which becomes brittle under ultra long streams with frequent updates. We propose the Unified Memory Agent (UMA), an end-to-end reinforcement learning framework that unifies memory operations and question answering within a single policy. UMA maintains a dual memory representation: a compact core summary for global context and a structured Memory Bank that supports explicit CRUD (create, update, delete, reorganize) over key value entries, enabling proactive consolidation during streaming. To evaluate long-horizon memory behavior, we introduce Ledger-QA, a diagnostic benchmark for continuous state tracking where answers are latent values derived from accumulated updates rather than lo cal span retrieval. Across 13 datasets spanning Ledger-QA, Test-Time Learning, and Accurate Retrieval, UMA substantially outperforms long-context and RAG baselines on dynamic reasoning and learning tasks while remaining competitive on standard retrieval benchmarks, underscoring the importance of learned, end-to-end memory management.
中文摘要 长上下文LLM和检索增强生成（RAG）系统被动处理信息，将状态跟踪、矛盾解决和证据聚合推迟到查询时间，而查询时间在频繁更新的超长流下会变得脆弱。我们提出了统一记忆代理（UMA），这是一个端到端强化学习框架，将记忆作和问答统一在单一策略中。UMA 维护双重内存表示：一个用于全局上下文的紧凑核心摘要，以及一个结构化内存库，支持对关键值条目的显式 CRUD（创建、更新、删除、重组），实现流式流式的主动整合。为了评估长视野内存行为，我们引入了Ledger-QA，这是一种连续状态追踪的诊断基准测试，其中答案是基于累积更新得出的潜在值，而非单纯的local span检索。在涵盖Ledger-QA、测试时学习和精确检索的13个数据集中，UMA在动态推理和学习任务中显著优于长上下文和RAG基线，同时在标准检索基准测试中保持竞争力，凸显了学习到端到端记忆管理的重要性。

Deep Reinforcement Learning for Optimizing Energy Consumption in Smart Grid Systems

深度强化学习优化智能电网系统中的能耗

Authors: Abeer Alsheikhi, Amirfarhad Farhadi, Azadeh Zamanifar
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2602.18531
Pdf link: https://arxiv.org/pdf/2602.18531
Abstract The energy management problem in the context of smart grids is inherently complex due to the interdependencies among diverse system components. Although Reinforcement Learning (RL) has been proposed for solving Optimal Power Flow (OPF) problems, the requirement for iterative interaction with an environment often necessitates computationally expensive simulators, leading to significant sample inefficiency. In this study, these challenges are addressed through the use of Physics-Informed Neural Networks (PINNs), which can replace conventional and costly smart grid simulators. The RL policy learning process is enhanced so that convergence can be achieved in a fraction of the time required by the original environment. The PINN-based surrogate is compared with other benchmark data-driven surrogate models. By incorporating knowledge of the underlying physical laws, the results show that the PINN surrogate is the only approach considered in this context that can obtain a strong RL policy even without access to samples from the true simulator. The results demonstrate that using PINN surrogates can accelerate training by 50% compared to RL training without a surrogate. This approach enables the rapid generation of performance scores similar to those produced by the original simulator.
中文摘要 智能电网背景下的能源管理问题本质上复杂，因为不同系统组件之间存在相互依赖关系。尽管强化学习（RL）被提出用于解决最优功率流（OPF）问题，但对环境迭代交互的需求通常需要计算量高的模拟器，导致样本效率显著低下。本研究通过使用物理知情神经网络（PINNs）来应对这些挑战，这些网络可以取代传统且昂贵的智能电网模拟器。强化学习策略学习过程得到增强，使得收敛能够在原始环境所需时间的一小部分内实现。基于PINN的代理模型与其他基于基准数据驱动的替代模型进行了比较。通过结合对基础物理定律的了解，结果表明，PINN 替代方法是在此背景下唯一能够获得强行强化策略的方法，即使无法访问真实模拟器的样本。结果表明，使用PINN替代者相比，训练速度可提升50%。这种方法能够快速生成与原始模拟器相似的性能评分。

1D-Bench: A Benchmark for Iterative UI Code Generation with Visual Feedback in Real-World

1D-Bench：现实世界中带有视觉反馈的迭代UI代码生成基准

Authors: Qiao Xu, Yipeng Yu, Chengxiao Feng, Xu Liu
Subjects: Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.18548
Pdf link: https://arxiv.org/pdf/2602.18548
Abstract Design-to-code translates high-fidelity UI designs into executable front-end implementations, but progress remains hard to compare due to inconsistent datasets, toolchains, and evaluation protocols. We introduce 1D-Bench, a benchmark grounded in real e-commerce workflows, where each instance provides a reference rendering and an exported intermediate representation that may contain extraction errors. 1D is short for one day, representing the efficient completion of design-to-code tasks in less than one day. Models take both as input, using the intermediate representation as structural cues while being evaluated against the reference rendering, which tests robustness to intermediate representation defects rather than literal adherence. 1D-Bench requires generating an executable React codebase under a fixed toolchain with an explicit component hierarchy, and defines a multi-round setting in which models iteratively apply component-level edits using execution feedback. Experiments on commercial and open-weight multimodal models show that iterative editing generally improves final performance by increasing rendering success and often improving visual similarity. We further conduct a pilot study on post-training with synthetic repair trajectories and reinforcement learning based editing, and observe limited and unstable gains that may stem from sparse terminal rewards and high-variance file-level updates.
中文摘要 设计到代码将高精度界面设计转化为可执行的前端实现，但由于数据集、工具链和评估协议不一致，进展仍难以比较。我们引入了1D-Bench，这是一个基于真实电商工作流程的基准测试，每个实例都提供参考渲染和导出的中间表示，可能包含提取错误。1D代表一天，代表设计到代码任务在一天内高效完成。模型将两者都作为输入，利用中间表征作为结构线索，同时与参考渲染进行评估，后者测试的是对中间表征缺陷的鲁棒性，而非字面上的遵循性。1D-Bench 需要在固定工具链下生成可执行的 React 代码库，并定义了多轮设置，模型通过执行反馈迭代应用组件级编辑。商业和开权重多模态模型的实验表明，迭代编辑通常通过提高渲染成功率和视觉相似性来提升最终性能。我们还开展了一项关于合成修复轨迹和基于强化学习的编辑后训练的试点研究，观察到由于终端奖励稀疏和文件级高方差更新，可能导致有限且不稳定的进展。

Hierarchical Reward Design from Language: Enhancing Alignment of Agent Behavior with Human Specifications

语言中的层级奖励设计：增强智能体行为与人类规范的对齐

Authors: Zhiqin Qian, Ryan Diaz, Sangwon Seo, Vaibhav Unhelkar
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.18582
Pdf link: https://arxiv.org/pdf/2602.18582
Abstract When training artificial intelligence (AI) to perform tasks, humans often care not only about whether a task is completed but also how it is performed. As AI agents tackle increasingly complex tasks, aligning their behavior with human-provided specifications becomes critical for responsible AI deployment. Reward design provides a direct channel for such alignment by translating human expectations into reward functions that guide reinforcement learning (RL). However, existing methods are often too limited to capture nuanced human preferences that arise in long-horizon tasks. Hence, we introduce Hierarchical Reward Design from Language (HRDL): a problem formulation that extends classical reward design to encode richer behavioral specifications for hierarchical RL agents. We further propose Language to Hierarchical Rewards (L2HR) as a solution to HRDL. Experiments show that AI agents trained with rewards designed via L2HR not only complete tasks effectively but also better adhere to human specifications. Together, HRDL and L2HR advance the research on human-aligned AI agents.
中文摘要 在训练人工智能（AI）执行任务时，人类通常不仅关心任务是否完成，还关心任务如何执行。随着人工智能代理应对日益复杂的任务，其行为与人类提供的规范保持一致对于负责任的人工智能部署至关重要。奖励设计通过将人类期望转化为指导强化学习（RL）的奖励函数，为这种对齐提供了直接渠道。然而，现有方法往往过于有限，难以捕捉长期任务中出现的细微人类偏好。因此，我们引入了语言中的层级奖励设计（HRDL）：一种将经典奖励设计扩展为层级强化学习代理编码更丰富的行为规范的问题表述。我们还进一步提出了语言到层级奖励（L2HR）作为HRDL的解决方案。实验显示，通过L2HR设计的奖励训练的AI智能体不仅能高效完成任务，还更符合人类的规范。HRDL和L2HR共同推动了人类对齐AI代理的研究。

DP-RFT: Learning to Generate Synthetic Text via Differentially Private Reinforcement Fine-Tuning

DP-RFT：通过差分私有强化微调学习生成合成文本

Authors: Fangyuan Xu, Sihao Chen, Zinan Lin, Taiwei Shi, Sydney Graham, Pei Zhou, Mengting Wan, Alex Stein, Virginia Estellers, Charles Chen, Morris Sharp, Richard Speyer, Tadas Baltrusaitis, Jennifer Neville, Eunsol Choi, Longqi Yang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.18633
Pdf link: https://arxiv.org/pdf/2602.18633
Abstract Differentially private (DP) synthetic data generation plays a pivotal role in developing large language models (LLMs) on private data, where data owners cannot provide eyes-on access to individual examples. Generating DP synthetic data typically involves a difficult trade-off. On one hand, DP finetuning methods train an LLM as a synthetic data generator with formal privacy guarantees, yet it still requires the raw content of private examples for model training. However, methods that avoid direct exposure to private data are bounded by an off-the-shelf, un-finetuned model, whose outputs often lack domain fidelity. Can we train an LLM to generate high-quality synthetic text without eyes-on access to individual private examples? In this work, we introduce Differentially Private Reinforcement Fine-Tuning (DP-RFT), an online reinforcement learning algorithm for synthetic data generation with LLMs. DP-RFT leverages DP-protected nearest-neighbor votes from an eyes-off private corpus as a reward signal for on-policy synthetic samples generated by an LLM. The LLM iteratively learns to generate synthetic data to maximize the expected DP votes through Proximal Policy Optimization (PPO). We evaluate DP-RFT for long-form and domain-specific synthetic data generation, such as news articles, meeting transcripts, and medical article abstracts. Our experiments show that DP-RFT closes the gap between private evolution and DP finetuning methods in terms of the fidelity and downstream utility of the generated synthetic data, while respecting the private data boundary.
中文摘要 差分私有（DP）合成数据生成在开发私有数据的大型语言模型（LLM）中起着关键作用，因为数据所有者无法直接查看单个实例。生成DP合成数据通常需要一个艰难的权衡。一方面，DP微调方法将LLM训练为具有正式隐私保障的合成数据生成器，但仍需私有实例的原始内容进行模型训练。然而，避免直接暴露于私人数据的方法受限于现成、未经微调的模型，其输出通常缺乏域的忠实度。我们能否训练大型语言模型生成高质量的合成文本，而无需直接查看单个私有实例？在本研究中，我们介绍了差分私有强化微调（DP-RFT），这是一种用于大型语言模型合成数据生成的在线强化学习算法。DP-RFT利用来自视线不存在的私有语料库中受DP保护的最近邻投票，作为对由LLM生成的策略合成样本的奖励信号。LLM通过近端策略优化（PPO）迭代学习生成合成数据，以最大化预期DP投票。我们评估DP-RFT用于长格式和领域特定综合数据生成，如新闻报道、会议记录和医学文章摘要。我们的实验表明，DP-RFT在私有演化与DP微调方法之间，在生成合成数据的保真度和下游效用方面弥合了差距，同时尊重私有数据的边界。

Adaptive Time Series Reasoning via Segment Selection

通过段选择实现自适应时间序列推理

Authors: Shvat Messica, Jiawen Zhang, Kevin Li, Theodoros Tsiligkaridis, Marinka Zitnik
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.18645
Pdf link: https://arxiv.org/pdf/2602.18645
Abstract Time series reasoning tasks often start with a natural language question and require targeted analysis of a time series. Evidence may span the full series or appear in a few short intervals, so the model must decide what to inspect. Most existing approaches encode the entire time series into a fixed representation before inference, regardless of whether or not the entire sequence is relevant. We introduce ARTIST, which formulates time-series reasoning as a sequential decision problem. ARTIST interleaves reasoning with adaptive temporal segment selection. It adopts a controller-reasoner architecture and uses reinforcement learning to train the controller role to select informative segments and the reasoner role to generate segment-conditioned reasoning traces and final answers. During inference, the model actively acquires task-relevant information instead of relying on a static summary of the full sequence. We use a novel hierarchical policy optimization approach for post-training that allows the model to excel in both segment selection and question-answering behavior. We evaluate ARTIST on six time-series reasoning benchmarks and compare it with large language models, vision-language models, and prior time-series reasoning systems. ARTIST improves average accuracy by 6.46 absolute percentage points over the strongest baseline. The largest gains appear on rare event localization and multi-segment reasoning tasks. Supervised fine-tuning improves performance, and reinforcement learning provides additional gains by optimizing question-adaptive segment selection. These results show that selective data use drives effective time-series reasoning.
中文摘要 时间序列推理任务通常从自然语言问题开始，需要对时间序列进行有针对性的分析。证据可能贯穿整个系列，或在短时间内出现，因此模型必须决定检查哪些内容。大多数现有方法在推断前将整个时间序列编码为固定表示，无论整个序列是否相关。我们介绍ARTIST，它将时间序列推理表述为顺序判定问题。ARTIST 交错了推理与自适应时间段选择。它采用控制器-推理器架构，并利用强化学习训练控制者角色选择信息性段，以及推理者角色生成段条件推理轨迹和最终答案。在推断过程中，模型主动获取与任务相关的信息，而非依赖完整序列的静态总结。我们采用一种新颖的分层策略优化方法进行后训练，使模型在分段选择和问答行为中均表现出色。我们基于六个时间序列推理基准测试评估ARTIST，并将其与大型语言模型、视觉语言模型及更早的时间序列推理系统进行比较。ARTIST的平均准确率比最强基线提升6.46个百分点。最大的收益出现在罕见事件定位和多段推理任务上。监督微调提升了性能，强化学习通过优化问题自适应片段选择提供了额外收益。这些结果表明，选择性数据的使用推动了有效的时间序列推理。

Toward AI Autonomous Navigation for Mechanical Thrombectomy using Hierarchical Modular Multi-agent Reinforcement Learning (HM-MARL)

迈向基于分层模块化多智能体强化学习（HM-MARL）的机械血栓切除术AI自主导航

Authors: Harry Robertshaw, Nikola Fischer, Lennart Karstensen, Benjamin Jackson, Xingyu Chen, S.M.Hadi Sadati, Christos Bergeles, Alejandro Granados, Thomas C Booth
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.18663
Pdf link: https://arxiv.org/pdf/2602.18663
Abstract Mechanical thrombectomy (MT) is typically the optimal treatment for acute ischemic stroke involving large vessel occlusions, but access is limited due to geographic and logistical barriers. Reinforcement learning (RL) shows promise in autonomous endovascular navigation, but generalization across 'long' navigation tasks remains challenging. We propose a Hierarchical Modular Multi-Agent Reinforcement Learning (HM-MARL) framework for autonomous two-device navigation in vitro, enabling efficient and generalizable navigation. HM-MARL was developed to autonomously navigate a guide catheter and guidewire from the femoral artery to the internal carotid artery (ICA). A modular multi-agent approach was used to decompose the complex navigation task into specialized subtasks, each trained using Soft Actor-Critic RL. The framework was validated in both in silico and in vitro testbeds to assess generalization and real-world feasibility. In silico, a single-vasculature model achieved 92-100% success rates on individual anatomies, while a multi-vasculature model achieved 56-80% across multiple patient anatomies. In vitro, both HM-MARL models successfully navigated 100% of trials from the femoral artery to the right common carotid artery and 80% to the right ICA but failed on the left-side vessel superhuman challenge due to the anatomy and catheter type used in navigation. This study presents the first demonstration of in vitro autonomous navigation in MT vasculature. While HM-MARL enables generalization across anatomies, the simulation-to-real transition introduces challenges. Future work will refine RL strategies using world models and validate performance on unseen in vitro data, advancing autonomous MT towards clinical translation.
中文摘要 机械血栓切除术（MT）通常是治疗涉及大血管堵塞的急性缺血性中风的最佳方法，但由于地理和后勤障碍，治疗手段有限。强化学习（RL）在自主血管内导航中展现出潜力，但在“长”导航任务中的泛化仍具挑战。我们提出了一种分层模块化多智能体强化学习（HM-MARL）框架，用于体外自主双设备导航，实现高效且可推广的导航。HM-MARL旨在自主导航导管和导丝，从股动脉到颈内动脉（ICA）。采用模块化多智能体方法将复杂的导航任务分解为专门子任务，每个子任务均通过软演员批判强化学习进行训练。该框架在计算机和体外测试平台均得到了验证，以评估其泛化性和现实世界可行性。在计算机模拟中，单血管模型在单个解剖结构上实现了92%-100%的成功率，而多血管模型则在多个患者解剖结构中实现了56-80%。体外受精中，两种HM-MARL模型均成功从股动脉至右颈总动脉及80%至右颈总动脉，但由于导航所用导管的解剖结构和导管类型，在左侧血管超人挑战中失败。本研究首次展示了MT血管中体外自主导航的实践。虽然HM-MARL实现了跨解剖结构的泛化，但从模拟到现实的转变带来了挑战。未来工作将利用世界模型完善强化学习策略，验证体外未见数据的性能，推动自主机器翻译迈向临床转化。

In-Context Planning with Latent Temporal Abstractions

含潜在时间抽象的上下文规划

Authors: Baiting Luo, Yunuo Zhang, Nathaniel S. Keplinger, Samir Gupta, Abhishek Dubey, Ayan Mukhopadhyay
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.18694
Pdf link: https://arxiv.org/pdf/2602.18694
Abstract Planning-based reinforcement learning for continuous control is bottlenecked by two practical issues: planning at primitive time scales leads to prohibitive branching and long horizons, while real environments are frequently partially observable and exhibit regime shifts that invalidate stationary, fully observed dynamics assumptions. We introduce I-TAP (In-Context Latent Temporal-Abstraction Planner), an offline RL framework that unifies in-context adaptation with online planning in a learned discrete temporal-abstraction space. From offline trajectories, I-TAP learns an observation-conditioned residual-quantization VAE that compresses each observation-macro-action segment into a coarse-to-fine stack of discrete residual tokens, and a temporal Transformer that autoregressively predicts these token stacks from a short recent history. The resulting sequence model acts simultaneously as a context-conditioned prior over abstract actions and a latent dynamics model. At test time, I-TAP performs Monte Carlo Tree Search directly in token space, using short histories for implicit adaptation without gradient update, and decodes selected token stacks into executable actions. Across deterministic MuJoCo, stochastic MuJoCo with per-episode latent dynamics regimes, and high-dimensional Adroit manipulation, including partially observable variants, I-TAP consistently matches or outperforms strong model-free and model-based offline baselines, demonstrating efficient and robust in-context planning under stochastic dynamics and partial observability.
中文摘要 基于计划的连续控制强化学习存在两个实际问题：原始时间尺度的规划导致难以实现的分支和漫长的视野;而真实环境往往部分可观测，且出现状态偏移，使平稳、完全观测的动力学假设失效。我们介绍了I-TAP（上下文内潜在时间抽象规划器），这是一个离线强化学习框架，将上下文适应与在线规划结合在学习的离散时间抽象空间中。从离线轨迹中，I-TAP学习一个观测条件残差量子化VAE，将每个观测-宏-动作段压缩为粗到细的离散残差标记堆栈，以及一个时间变换器，从近期历史中自回归预测这些标记堆栈。由此产生的序列模型同时作为上下文条件先验作用于抽象动作和潜在动力学模型。测试时，I-TAP直接在令牌空间中进行蒙特卡洛树搜索，利用短历史进行隐式适应且不进行梯度更新，并将选定的令牌堆解码为可执行动作。在确定性MuJoCo、具有每集潜在动态体系的随机MuJoCo以及高维Adroit作（包括部分可观测变体）中，I-TAP始终能够匹敌甚至超越强无模型和基于模型的离线基线，在随机动力学和部分可观测性下展现出高效且稳健的上下文规划能力。

LMFPPO-UBP: Local Mean Field Proximal Policy Optimization with Unbalanced Punishment for Spatial Public Goods Games

LMFPPO-UBP：空间公共物品博弈的局部均值场近端策略优化，带有不平衡惩罚

Authors: Jinshuo Yang, Zhaoqilin Yang, Wenjie Zhou, Xin Wang, Youliang Tian
Subjects: Subjects: Computer Science and Game Theory (cs.GT)
Arxiv link: https://arxiv.org/abs/2602.18696
Pdf link: https://arxiv.org/pdf/2602.18696
Abstract Spatial public goods games are characterized by high-dimensional state spaces and localized externalities, which pose significant challenges for achieving stable and widespread cooperation. Traditional approaches often struggle to effectively capture neighborhood-level strategic interactions and dynamically align individual incentives with collective welfare. To resolve this issue, this paper introduces a novel intelligent decision-making framework called Local Mean-Field Proximal Policy Optimization with Unbalanced Punishment (LMFPPO-UBP). The conventional mean field concept is reformulated as a socio-statistical sensor embedded directly into the policy gradient space of deep reinforcement learning, allowing agents to adapt their strategies based on mesoscale neighborhood dynamics. Additionally, an unbalanced punishment mechanism is integrated to penalize defectors proportionally to the local density of cooperators, thereby reshaping the payoff structures without imposing direct costs on cooperative agents. Experimental results demonstrate that the LMFPPO-UBP promotes rapid and stable global cooperation even under low enhancement factors, consistently outperforming baseline methods such as Q-learning and Fermi update rules. Statistical analyses further validate the framework's effectiveness in lowering the cooperation threshold and achieving better coordinated outcomes.
中文摘要 空间公共财博弈以高维状态空间和局部外部性为特征，这对实现稳定和广泛的合作构成了重大挑战。传统方法常常难以有效捕捉邻里层面的战略互动，并动态地将个人激励与集体福祉对齐。为解决这一问题，本文引入了一个新的智能决策框架——带不平衡惩罚的局部均值场近端策略优化（LMFPPO-UBP）。传统的均值场概念被重新表述为直接嵌入深度强化学习政策梯度空间的社会统计传感器，使智能体能够基于中尺度邻域动态调整策略。此外，还整合了不平衡惩罚机制，根据合作者本地密度对叛逃者进行相应惩罚，从而重塑收益结构，同时不给合作主体带来直接代价。实验结果表明，LMFPPO-UBP即使在低增强因子下也能促进快速稳定的全球合作，持续优于Q学习和费米更新规则等基线方法。统计分析进一步验证了该框架在降低合作门槛和实现更好协调成果方面的有效性。

Task-Aware Exploration via a Predictive Bisimulation Metric

通过预测双模拟指标实现任务感知探索

Authors: Dayang Liang, Ruihan Liu, Lipeng Wan, Yunlong Liu, Bo An
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.18724
Pdf link: https://arxiv.org/pdf/2602.18724
Abstract Accelerating exploration in visual reinforcement learning under sparse rewards remains challenging due to the substantial task-irrelevant variations. Despite advances in intrinsic exploration, many methods either assume access to low-dimensional states or lack task-aware exploration strategies, thereby rendering them fragile in visual domains. To bridge this gap, we present TEB, a Task-aware Exploration approach that tightly couples task-relevant representations with exploration through a predictive Bisimulation metric. Specifically, TEB leverages the metric not only to learn behaviorally grounded task representations but also to measure behaviorally intrinsic novelty over the learned latent space. To realize this, we first theoretically mitigate the representation collapse of degenerate bisimulation metrics under sparse rewards by internally introducing a simple but effective predicted reward differential. Building on this robust metric, we design potential-based exploration bonuses, which measure the relative novelty of adjacent observations over the latent space. Extensive experiments on MetaWorld and Maze2D show that TEB achieves superior exploration ability and outperforms recent baselines.
中文摘要 在奖励稀疏的情况下加速视觉强化学习的探索仍然具有挑战性，因为任务间存在显著的无关差异。尽管内在探索取得了进步，许多方法要么假设能够访问低维状态，要么缺乏任务感知的探索策略，因此在视觉领域中显得脆弱。为弥合这一差距，我们提出了TEB，一种任务感知探索方法，紧密结合任务相关表示与通过预测双模拟指标进行探索。具体来说，TEB利用该指标不仅学习基于行为的任务表征，还衡量了学习潜在空间中行为上的内在新颖性。为实现这一点，我们首先通过内部引入简单但有效的预测奖励差异，理论上缓解了稀疏奖励下退化双模拟指标的表征崩溃。基于这一稳健指标，我们设计了基于潜力的探索奖励，衡量潜伏空间内相邻观测的相对新颖性。在MetaWorld和Maze2D上的大量实验表明，TEB在探索能力上更胜一筹，并优于近期基准数据。

HONEST-CAV: Hierarchical Optimization of Network Signals and Trajectories for Connected and Automated Vehicles with Multi-Agent Reinforcement Learning

HONEST-CAV：利用多智能体强化学习，实现联网和自动化车辆网络信号和轨迹的分层优化

Authors: Ziyan Zhang, Changxin Wan, Peng Hao, Kanok Boriboonsomsin, Matthew J. Barth, Yongkang Liu, Seyhan Ucar, Guoyuan Wu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2602.18740
Pdf link: https://arxiv.org/pdf/2602.18740
Abstract This study presents a hierarchical, network-level traffic flow control framework for mixed traffic consisting of Human-driven Vehicles (HVs), Connected and Automated Vehicles (CAVs). The framework jointly optimizes vehicle-level eco-driving behaviors and intersection-level traffic signal control to enhance overall network efficiency and decrease energy consumption. A decentralized Multi-Agent Reinforcement Learning (MARL) approach by Value Decomposition Network (VDN) manages cycle-based traffic signal control (TSC) at intersections, while an innovative Signal Phase and Timing (SPaT) prediction method integrates a Machine Learning-based Trajectory Planning Algorithm (MLTPA) to guide CAVs in executing Eco-Approach and Departure (EAD) maneuvers. The framework is evaluated across varying CAV proportions and powertrain types to assess its effects on mobility and energy performance. Experimental results conducted in a 4*4 real-world network demonstrate that the MARL-based TSC method outperforms the baseline model (i.e., Webster method) in speed, fuel consumption, and idling time. In addition, with MLTPA, HONEST-CAV benefits the traffic system further in energy consumption and idling time. With a 60% CAV proportion, vehicle average speed, fuel consumption, and idling time can be improved/saved by 7.67%, 10.23%, and 45.83% compared with the baseline. Furthermore, discussions on CAV proportions and powertrain types are conducted to quantify the performance of the proposed method with the impact of automation and electrification.
中文摘要 本研究提出了一个分层的网络级交通流量控制框架，适用于由人驾驶车辆（HV）、联网和自动驾驶车辆（CAVs）组成的混合交通。该框架共同优化车辆级生态驾驶行为和路口级交通信号控制，提升整体网络效率并降低能耗。价值分解网络（VDN）采用去中心化的多智能体强化学习（MARL）方法，管理基于循环的交通信号控制（TSC），而创新的信号相位与时序（SPaT）预测方法集成了基于机器学习的轨迹规划算法（MLTPA），指导CAV执行生态进场和离场（EAD）机动。该框架在不同比例和动力总成类型下进行评估，以评估其对机动性和能源性能的影响。在4*4真实世界网络中进行的实验结果表明，基于MARL的TSC方法在速度、燃油消耗和怠速时间方面优于基线模型（即韦氏法）。此外，借助MLTP，HONEST-CAV进一步提升了交通系统的能耗和怠速时间。在60%的CAV比例下，车辆平均速度、油耗和怠速时间相较基线可提升/节省7.67%、10.23%和45.83%。此外，还讨论了CAV比例和动力总成类型，以量化所提方法在自动化和电气化方面的影响。

TAG: Thinking with Action Unit Grounding for Facial Expression Recognition

标签：用动作单元思考面部表情识别的接地

Authors: Haobo Lin, Tianyi Bai, Jiajun Zhang, Xuanhao Chang, Sheng Lu, Fangming Gu, Zengjie Hu, Wentao Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.18763
Pdf link: https://arxiv.org/pdf/2602.18763
Abstract Facial Expression Recognition (FER) is a fine-grained visual understanding task where reliable predictions require reasoning over localized and meaningful facial cues. Recent vision--language models (VLMs) enable natural language explanations for FER, but their reasoning is often ungrounded, producing fluent yet unverifiable rationales that are weakly tied to visual evidence and prone to hallucination, leading to poor robustness across different datasets. We propose TAG (Thinking with Action Unit Grounding), a vision--language framework that explicitly constrains multimodal reasoning to be supported by facial Action Units (AUs). TAG requires intermediate reasoning steps to be grounded in AU-related facial regions, yielding predictions accompanied by verifiable visual evidence. The model is trained via supervised fine-tuning on AU-grounded reasoning traces followed by reinforcement learning with an AU-aware reward that aligns predicted regions with external AU detectors. Evaluated on RAF-DB, FERPlus, and AffectNet, TAG consistently outperforms strong open-source and closed-source VLM baselines while simultaneously improving visual faithfulness. Ablation and preference studies further show that AU-grounded rewards stabilize reasoning and mitigate hallucination, demonstrating the importance of structured grounded intermediate representations for trustworthy multimodal reasoning in FER. The code will be available at this https URL .
中文摘要 面部表情识别（FER）是一项细粒度的视觉理解任务，其可靠预测需要在局部且有意义的面部线索上进行推理。近期的视觉语言模型（VLMs）支持对FER的自然语言解释，但其推理往往缺乏依据，产生的理由流畅但无法验证，且与视觉证据联系薄弱且容易产生幻觉，导致不同数据集间的稳健性较差。我们提出了TAG（以行动单元为基础思考），这是一种视觉语言框架，明确限制多模态推理必须由面部动作单元（AU）支持。TAG要求通过中级推理步骤，将与AU相关的面部区域扎根，从而产生伴随可验证视觉证据的预测。该模型通过监督微调训练，基于AU基础的推理轨迹，随后通过强化学习，并以AU感知奖励将预测区域与外部AU检测器对齐。通过RAF-DB、FERPlus和AffectNet的评估，TAG持续优于强大的开源和闭源VLM基线，同时提升了视觉忠实度。消融和偏好研究进一步表明，AU基础奖赏稳定推理并减轻幻觉，表明结构化基础中间表征对于FER中可信的多模态推理至关重要。代码将在此 https URL 中提供。

Carbon-aware decentralized dynamic task offloading in MIMO-MEC networks via multi-agent reinforcement learning

通过多智能体强化学习实现MIMO-MEC网络中的碳感知去中心化动态任务卸载

Authors: Mubshra Zulfiqar, Muhammad Ayzed Mirza, Basit Qureshi
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.18797
Pdf link: https://arxiv.org/pdf/2602.18797
Abstract Massive internet of things microservices require integrating renewable energy harvesting into mobile edge computing (MEC) for sustainable eScience infrastructures. Spatiotemporal mismatches between stochastic task arrivals and intermittent green energy along with complex inter-user interference in multi-antenna (MIMO) uplinks complicate real-time resource management. Traditional centralized optimization and off-policy reinforcement learning struggle with scalability and signaling overhead in dense networks. This paper proposes CADDTO-PPO, a carbon-aware decentralized dynamic task offloading framework based on multi-agent proximal policy optimization. The multi-user MIMO-MEC system is modeled as a Decentralized Partially Observable Markov Decision Process (DEC-POMDP) to jointly minimize carbon emissions and buffer latency and energy wastage. A scalable architecture utilizes decentralized execution with parameter sharing (DEPS), which enables autonomous IoT agents to make fine-grained power control and offloading decisions based solely on local observations. Additionally, a carbon-first reward structure adaptively prioritizes green time slots for data transmission to decouple system throughput from grid-dependent carbon footprints. Finally, experimental results demonstrate CADDTO-PPO outperforms deep deterministic policy gradient (DDPG) and lyapunov-based baselines. The framework achieves the lowest carbon intensity and maintains near-zero packet overflow rates under extreme traffic loads. Architectural profiling validates the framework to demonstrate a constant $O(1)$ inference complexity and theoretical lightweight feasibility for future generation sustainable IoT deployments.
中文摘要 大规模物联网微服务需要将可再生能源采集整合进移动边缘计算（MEC）中，以实现可持续的电子科学基础设施。随机任务到达与间歇性绿色能源之间的时空不匹配，加上多天线（MIMO）上行链路中的复杂用户间干扰，使得实时资源管理更加复杂。传统的集中式优化和非策略强化学习在密集网络中面临扩展性和信令开销的挑战。本文提出了CADDTO-PPO，这是一种基于多智能体近端策略优化的碳感知去中心化动态任务卸载框架。多用户MIMO-MEC系统被建模为去中心化部分可观测马尔可夫决策过程（DEC-POMDP），以共同减少碳排放、缓冲延迟和能源浪费。可扩展架构采用参数共享的去中心化执行（DEPS），使自主物联网代理能够仅基于本地观测数据做出细粒度的电力控制和卸载决策。此外，碳优先的奖励结构可自适应地优先设置绿色数据传输时隙，以将系统吞吐量与依赖电网的碳足迹脱钩。最后，实验结果表明CADDTO-PPO优于深度确定性政策梯度（DDPG）和基于李雅普诺夫的基线。该框架实现最低碳强度，并在极端流量负载下保持近乎零的数据包溢出率。架构分析验证了该框架，证明了未来世代可持续物联网部署中恒定的$O（1）$推理复杂度和理论上的轻量可行性。

Issues with Measuring Task Complexity via Random Policies in Robotic Tasks

机器人任务中通过随机策略测量任务复杂性的问题

Authors: Reabetswe M. Nkhumise, Mohamed S. Talamali, Aditya Gilra
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.18856
Pdf link: https://arxiv.org/pdf/2602.18856
Abstract Reinforcement learning (RL) has enabled major advances in fields such as robotics and natural language processing. A key challenge in RL is measuring task complexity, which is essential for creating meaningful benchmarks and designing effective curricula. While there are numerous well-established metrics for assessing task complexity in tabular settings, relatively few exist in non-tabular domains. These include (i) Statistical analysis of the performance of random policies via Random Weight Guessing (RWG), and (ii) information-theoretic metrics Policy Information Capacity (PIC) and Policy-Optimal Information Capacity (POIC), which are reliant on RWG. In this paper, we evaluate these methods using progressively difficult robotic manipulation setups, with known relative complexity, with both dense and sparse reward formulations. Our empirical results reveal that measuring complexity is still nuanced. Specifically, under the same reward formulation, PIC suggests that a two-link robotic arm setup is easier than a single-link setup - which contradicts the robotic control and empirical RL perspective whereby the two-link setup is inherently more complex. Likewise, for the same setup, POIC estimates that tasks with sparse rewards are easier than those with dense rewards. Thus, we show that both PIC and POIC contradict typical understanding and empirical results from RL. These findings highlight the need to move beyond RWG-based metrics towards better metrics that can more reliably capture task complexity in non-tabular RL with our task framework as a starting point.
中文摘要 强化学习（RL）推动了机器人学和自然语言处理等领域的重大进展。强化学习的一个关键挑战是测量任务复杂度，这对于制定有意义的基准和设计有效课程至关重要。虽然有许多成熟的指标用于评估表格环境中的任务复杂性，但在非表格领域中相对较少。这些包括（i）通过随机权重猜测（RWG）对随机政策性能进行统计分析，以及（ii）依赖RWG的信息理论指标政策信息容量（PIC）和策略最优信息容量（POIC）。本文采用逐步复杂度的机器人作设置，既有高密度奖励，也有稀疏奖励表述，评估了这些方法。我们的实证结果显示，复杂性的测量仍然很细致。具体来说，在相同的奖励表述下，PIC认为双链机械臂配置比单链联更容易——这与机器人控制和经验强化学习的观点相矛盾，后者认为双链环结构本质上更复杂。同样，在相同设置下，POIC估计奖励稀少的任务比奖励密集的任务更容易。因此，我们表明PIC和POIC都与强化学习的典型理解和实证结果相矛盾。这些发现凸显了我们需要超越基于 RWG 的指标，向更优的指标迈进，以非表格式强化学习为起点，更可靠地捕捉任务复杂性。

VariBASed: Variational Bayes-Adaptive Sequential Monte-Carlo Planning for Deep Reinforcement Learning

VariBASed：变分贝叶斯自适应序列蒙特卡洛规划中的深度强化学习

Authors: Joery A. de Vries, Jinke He, Yaniv Oren, Pascal R. van der Vaart, Mathijs M. de Weerdt, Matthijs T. J. Spaan
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.18857
Pdf link: https://arxiv.org/pdf/2602.18857
Abstract Optimally trading-off exploration and exploitation is the holy grail of reinforcement learning as it promises maximal data-efficiency for solving any task. Bayes-optimal agents achieve this, but obtaining the belief-state and performing planning are both typically intractable. Although deep learning methods can greatly help in scaling this computation, existing methods are still costly to train. To accelerate this, this paper proposes a variational framework for learning and planning in Bayes-adaptive Markov decision processes that coalesces variational belief learning, sequential Monte-Carlo planning, and meta-reinforcement learning. In a single-GPU setup, our new method VariBASeD exhibits favorable scaling to larger planning budgets, improving sample- and runtime-efficiency over prior methods.
中文摘要 探索与利用的最佳权衡是强化学习的圣杯，因为它承诺为解决任何任务提供最大的数据效率。贝叶斯最优智能体能实现这一点，但获得信念状态和执行计划通常都难以处理。尽管深度学习方法能极大地帮助扩展计算，但现有方法的训练成本仍然很高。为加速这一进程，本文提出了一种贝叶斯自适应马尔可夫决策过程中学习与规划的变分框架，融合变分信念学习、顺序蒙特卡洛规划和元强化学习。在单GPU配置下，我们的新方法VariBASeD在更大规划预算下表现出良好的扩展性，提升了采样和运行效率，优于以往方法。

Gait Asymmetry from Unilateral Weakness and Improvement With Ankle Assistance: a Reinforcement Learning based Simulation Study

单侧无力与踝关节辅助改善带来的步态不对称：基于强化学习的模拟研究

Authors: Yifei Yuan, Ghaith Androwis, Xianlian Zhou
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.18862
Pdf link: https://arxiv.org/pdf/2602.18862
Abstract Unilateral muscle weakness often leads to asymmetric gait, disrupting interlimb coordination and stance timing. This study presents a reinforcement learning (RL) based musculoskeletal simulation framework to (1) quantify how progressive unilateral muscle weakness affects gait symmetry and (2) evaluate whether ankle exoskeleton assistance can improve gait symmetry under impaired conditions. The overarching goal is to establish a simulation- and learning-based workflow that supports early controller development prior to patient experiments. Asymmetric gait was induced by reducing right-leg muscle strength to 75%, 50%, and 25% of baseline. Gait asymmetry was quantified using toe-off timing, peak contact forces, and joint-level symmetry metrics. Increasing weakness produced progressively larger temporal and kinematic asymmetry, most pronounced at the ankle. Ankle range of motion symmetry degraded from near-symmetric behavior at 100% strength (symmetry index, SI = +6.4%; correlation r=0.974) to severe asymmetry at 25% strength (SI = -47.1%, r=0.889), accompanied by a load shift toward the unimpaired limb. At 50% strength, ankle exoskeleton assistance improved kinematic symmetry relative to the unassisted impaired condition, reducing the magnitude of ankle SI from 25.8% to 18.5% and increasing ankle correlation from r=0.948 to 0.966, although peak loading remained biased toward the unimpaired side. Overall, this framework supports controlled evaluation of impairment severity and assistive strategies, and provides a basis for future validation in human experiments.
中文摘要 单侧肌肉无力常导致步态不对称，干扰肢体协调和站姿时机。本研究提出了基于强化学习（RL）的肌肉骨骼模拟框架，旨在（1）量化单侧肌肉衰弱如何影响步态对称性，（2）评估踝外骨骼辅助是否能在受损条件下改善步态对称性。总体目标是建立基于仿真和学习的工作流程，支持患者实验前的早期控制器开发。通过将右腿肌肉力量降至基线的75%、50%和25%诱发了不对称步态。步态不对称通过趾尖偏离时机、峰值接触力和关节级对称度量化。无力加剧导致颞部和运动学上的不对称逐渐加剧，尤以踝部为最为明显。踝关节活动范围对称性从100%力量时近乎对称（对称指数，SI = +6.4%;相关性r=0.974）降至25%力量时严重不对称（SI = -47.1%，r=0.889），伴有负荷向未受损肢体移动。在50%力量下，踝外骨骼辅助改善了运动学对称性，相较于无辅助受损状态，踝关节骶髂状体大小从25.8%降至18.5%，踝关节相关性从r=0.948提升至0.966，尽管峰值负荷仍偏向未受损侧。总体而言，该框架支持对损伤严重程度和辅助策略的受控评估，并为未来在人体实验中的验证提供了基础。

TPRU: Advancing Temporal and Procedural Understanding in Large Multimodal Models

TPRU：推进大型多模态模型中的时间和过程理解

Authors: Zhenkun Gao, Xuhong Wang, Xin Tan, Yuan Xie
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.18884
Pdf link: https://arxiv.org/pdf/2602.18884
Abstract Multimodal Large Language Models (MLLMs), particularly smaller, deployable variants, exhibit a critical deficiency in understanding temporal and procedural visual data, a bottleneck hindering their application in real-world embodied AI. This gap is largely caused by a systemic failure in training paradigms, which lack large-scale, procedurally coherent data. To address this problem, we introduce TPRU, a large-scale dataset sourced from diverse embodied scenarios such as robotic manipulation and GUI navigation. TPRU is systematically designed to cultivate temporal reasoning through three complementary tasks: Temporal Reordering, Next-Frame Prediction, and Previous-Frame Review. A key feature is the inclusion of challenging negative samples, compelling models to transition from passive observation to active, cross-modal validation. We leverage TPRU with a reinforcement learning (RL) fine-tuning methodology, specifically targeting the enhancement of resource-efficient models. Experiments show our approach yields dramatic gains: on our manually curated TPRU-Test, the accuracy of TPRU-7B soars from 50.33\% to 75.70\%, a state-of-the-art result that significantly outperforms vastly larger baselines, including GPT-4o. Crucially, these capabilities generalize effectively, demonstrating substantial improvements on established benchmarks. The codebase is available at this https URL .
中文摘要 多模态大型语言模型（MLLM），尤其是较小且可部署的变体，在理解时间和过程视觉数据方面存在严重缺陷，阻碍了其在现实世界中具身人工智能中的应用。这一空白主要源于训练范式的系统性失效，缺乏大规模且程序连贯的数据。为解决这一问题，我们引入了TPRU，这是一个大规模数据集，来源于机器人作和图形界面导航等多样的具体场景。TPRU系统性地通过三个互补任务培养时间推理能力：时间重排序、下一帧预测和前帧回顾。一个关键特点是包含具有挑战性的阴性样本，促使模型从被动观察转向主动、跨模态验证。我们利用TPRU采用强化学习（RL）微调方法，特别针对资源高效模型的提升。实验显示，我们的方法取得了显著提升：在我们手动筛选的TPRU-Test中，TPRU-7B的准确率从50.33%飙升至75.70%，这一最先进的结果显著优于包括GPT-4o在内的更大基线。关键是，这些能力能够有效推广，在既定基准基础上展现出显著提升。代码库可在此 https URL 访问。

DeepInterestGR: Mining Deep Multi-Interest Using Multi-Modal LLMs for Generative Recommendation

DeepInterestGR：利用多模态大型语言模型挖掘深度多兴趣生成推荐

Authors: Yangchen Zeng
Subjects: Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2602.18907
Pdf link: https://arxiv.org/pdf/2602.18907
Abstract Recent generative recommendation frameworks have demonstrated remarkable scaling potential by reformulating item prediction as autoregressive Semantic ID (SID) generation. However, existing methods primarily rely on shallow behavioral signals, encoding items solely through surface-level textual features such as titles and descriptions. This reliance results in a critical Shallow Interest problem: the model fails to capture the latent, semantically rich interests underlying user interactions, limiting both personalization depth and recommendation interpretability. DeepInterestGR introduces three key innovations: (1) Multi-LLM Interest Mining (MLIM): We leverage multiple frontier LLMs along with their multi-modal variants to extract deep textual and visual interest representations through Chain-of-Thought prompting. (2) Reward-Labeled Deep Interest (RLDI): We employ a lightweight binary classifier to assign reward labels to mined interests, enabling effective supervision signals for reinforcement learning. (3) Interest-Enhanced Item Discretization (IEID): The curated deep interests are encoded into semantic embeddings and quantized into SID tokens via RQ-VAE. We adopt a two-stage training pipeline: supervised fine-tuning aligns the generative model with deep interest signals and collaborative filtering patterns, followed by reinforcement learning with GRPO optimized by our Interest-Aware Reward. Experiments on three Amazon Review benchmarks demonstrate that DeepInterestGR consistently outperforms state-of-the-art baselines across HR@K and NDCG@K metrics.
中文摘要 近期生成式推荐框架通过将项目预测重新表述为自回归语义ID（SID）生成，展现出显著的扩展潜力。然而，现有方法主要依赖浅层行为信号，仅通过标题和描述等表层文本特征来编码项目。这种依赖导致了一个关键的浅层兴趣问题：模型未能捕捉用户交互背后潜在且语义丰富的兴趣，限制了个性化深度和推荐的可解释性。DeepInterestGR引入了三项关键创新：（1）多LLM兴趣挖掘（MLIM）：我们利用多种前沿LLM及其多模态变体，通过Chain-of-Thought提示提取深度文本和视觉兴趣表示。（2）奖励标记深度兴趣（RLDI）：我们使用轻量级二进制分类器为挖掘兴趣分配奖励标签，实现强化学习的有效监督信号。（3）兴趣增强项目离散化（IEID）：策划的深度兴趣被编码为语义嵌入，并通过RQ-VAE量化为SID代币。我们采用两阶段训练流程：监督式微调将生成模型与深度兴趣信号和协作过滤模式对齐，随后通过我们的兴趣感知奖励优化的GRPO进行强化学习。在三个亚马逊评论基准测试上的实验表明，DeepInterestGR在HR@K和NDCG@K指标上始终优于最先进的基线。

IDSelect: A RL-Based Cost-Aware Selection Agent for Video-based Multi-Modal Person Recognition

IDSelect：基于强化学习的成本感知选择代理，用于基于视频的多模态人物识别

Authors: Yuyang Ji, Yixuan Shen, Kien Nguyen, Lifeng Zhou, Feng Liu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.18990
Pdf link: https://arxiv.org/pdf/2602.18990
Abstract Video-based person recognition achieves robust identification by integrating face, body, and gait. However, current systems waste computational resources by processing all modalities with fixed heavyweight ensembles regardless of input complexity. To address these limitations, we propose IDSelect, a reinforcement learning-based cost-aware selector that chooses one pre-trained model per modality per-sequence to optimize the accuracy-efficiency trade-off. Our key insight is that an input-conditioned selector can discover complementary model choices that surpass fixed ensembles while using substantially fewer resources. IDSelect trains a lightweight agent end-to-end using actor-critic reinforcement learning with budget-aware optimization. The reward balances recognition accuracy with computational cost, while entropy regularization prevents premature convergence. At inference, the policy selects the most probable model per modality and fuses modality-specific similarities for the final score. Extensive experiments on challenging video-based datasets demonstrate IDSelect's superior efficiency: on CCVID, it achieves 95.9% Rank-1 accuracy with 92.4% less computation than strong baselines while improving accuracy by 1.8%; on MEVID, it reduces computation by 41.3% while maintaining competitive performance.
中文摘要 基于视频的人识别通过整合面部、身体和步态实现了稳健的识别。然而，当前系统通过处理所有模态而浪费计算资源，无论输入复杂度如何，都采用固定的重量级集合。为解决这些局限性，我们提出了IDSelect，一种基于强化学习的成本感知选择器，每个序列为每个模态选择一个预训练模型，以优化准确性与效率的权衡。我们的关键见解是，输入条件选择器能够发现超越固定集合的互补模型选择，同时使用更少的资源。IDSelect 通过 actor-critic 强化学习和预算感知优化，端到端训练一个轻量级代理。奖励在识别准确性与计算成本之间取得平衡，而熵正则化则防止过早收敛。在推断时，该策略为每个模态选择最可能的模型，并将模态特有的相似性融合为最终得分。在具有挑战性的基于视频数据集的广泛实验中展示了IDSelect的卓越效率：在CCVID上，IDSelect实现了95.9%的Rank-1准确率，计算量比强基线少92.4%，准确率提升了1.8%;在MEVID上，它在保持竞争力性能的同时，计算量减少了41.3%。

MagicAgent: Towards Generalized Agent Planning

MagicAgent：迈向通用代理规划

Authors: Xuhui Ren, Shaokang Dong, Chen Yang, Qing Gao, Yunbin Zhao, Yongsheng Liu, Xinwei Geng, Xiang Li, Demei Yan, Yanqing Li, Chenhao Huang, Dingwei Zhu, Junjie Ye, Boxuan Yue, Yingnan Fu, Mengzhe Lv, Zezeng Feng, Boshen Zhou, Bocheng Wang, Xuanjing Huang, Yu-Gang Jiang, Tao Gui, Qi Zhang, Yunke Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2602.19000
Pdf link: https://arxiv.org/pdf/2602.19000
Abstract The evolution of Large Language Models (LLMs) from passive text processors to autonomous agents has established planning as a core component of modern intelligence. However, achieving generalized planning remains elusive, not only by the scarcity of high-quality interaction data but also by inherent conflicts across heterogeneous planning tasks. These challenges result in models that excel at isolated tasks yet struggle to generalize, while existing multi-task training attempts suffer from gradient interference. In this paper, we present \textbf{MagicAgent}, a series of foundation models specifically designed for generalized agent planning. We introduce a lightweight and scalable synthetic data framework that generates high-quality trajectories across diverse planning tasks, including hierarchical task decomposition, tool-augmented planning, multi-constraint scheduling, procedural logic orchestration, and long-horizon tool execution. To mitigate training conflicts, we propose a two-stage training paradigm comprising supervised fine-tuning followed by multi-objective reinforcement learning over both static datasets and dynamic environments. Empirical results demonstrate that MagicAgent-32B and MagicAgent-30B-A3B deliver superior performance, achieving accuracies of $75.1\%$ on Worfbench, $55.9\%$ on NaturalPlan, $57.5\%$ on $\tau^2$-Bench, $86.9\%$ on BFCL-v3, and $81.2\%$ on ACEBench, as well as strong results on our in-house MagicEval benchmarks. These results substantially outperform existing sub-100B models and even surpass leading closed-source models.
中文摘要 大型语言模型（LLMs）从被动文本处理器向自主智能体的发展，使规划成为现代智能的核心组成部分。然而，实现泛化规划仍然难以实现，不仅因为高质量交互数据的稀缺，还因异构规划任务之间固有的冲突。这些挑战导致模型在孤立任务中表现出色，但难以泛化，而现有的多任务训练尝试则存在梯度干扰。本文介绍了 \textbf{MagicAgent}，一系列专门为通用智能体规划设计的基础模型。我们引入了一个轻量级且可扩展的合成数据框架，能够在多样化的规划任务中生成高质量的轨迹，包括层级任务分解、工具增强规划、多约束调度、程序逻辑编排和长视野工具执行。为减少训练冲突，我们提出了一个两阶段训练范式，包括监督微调，随后在静态数据集和动态环境中进行多目标强化学习。实证结果显示，MagicAgent-32B和MagicAgent-30B-A3B表现优异，在Worfbench上达到75.1美元，NaturalPlan为55.9美元，在$\tau^2$-Bench为57.5美元，BFCL-v3为86.9美元，ACEBench为81.2美元，并且在我们内部的MagicEval基准测试中取得了强劲的成绩。这些结果远超现有100B以下模型，甚至超过领先的闭源模型。

Learning to Detect Language Model Training Data via Active Reconstruction

通过主动重建学习检测语言模型训练数据

Authors: Junjie Oscar Yin, John X. Morris, Vitaly Shmatikov, Sewon Min, Hannaneh Hajishirzi
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.19020
Pdf link: https://arxiv.org/pdf/2602.19020
Abstract Detecting LLM training data is generally framed as a membership inference attack (MIA) problem. However, conventional MIAs operate passively on fixed model weights, using log-likelihoods or text generations. In this work, we introduce \textbf{Active Data Reconstruction Attack} (ADRA), a family of MIA that actively induces a model to reconstruct a given text through training. We hypothesize that training data are \textit{more reconstructible} than non-members, and the difference in their reconstructibility can be exploited for membership inference. Motivated by findings that reinforcement learning (RL) sharpens behaviors already encoded in weights, we leverage on-policy RL to actively elicit data reconstruction by finetuning a policy initialized from the target model. To effectively use RL for MIA, we design reconstruction metrics and contrastive rewards. The resulting algorithms, \textsc{ADRA} and its adaptive variant \textsc{ADRA+}, improve both reconstruction and detection given a pool of candidate data. Experiments show that our methods consistently outperform existing MIAs in detecting pre-training, post-training, and distillation data, with an average improvement of 10.7\% over the previous runner-up. In particular, \MethodPlus~improves over Min-K\%++ by 18.8\% on BookMIA for pre-training detection and by 7.6\% on AIME for post-training detection.
中文摘要 LLM训练数据的检测通常被框架为成员推理攻击（MIA）问题。然而，传统的MIA是通过固定模型权重被动作的，使用对数似然或文本生成。在本研究中，我们介绍了 \textbf{主动数据重建攻击}（ADRA），这是一类MIA，通过训练主动诱导模型重建给定文本。我们假设训练数据比非成员数据更具可重构性，且其可重建性差异可用于成员推断。基于强化学习（RL）能加深已编码在权重中的行为的发现，我们利用策略上强化学习，通过微调目标模型初始化的策略，主动引发数据重建。为了有效利用强化学习（RL）来应对MIA，我们设计了重建指标和对比奖励。由此产生的算法 \textsc{ADRA} 及其自适应变体 \textsc{ADRA+}，在候选数据池下提升了重建和检测能力。实验显示，我们的方法在检测训练前、训练后和蒸馏数据方面持续优于现有MIA，平均提升幅度为10.7%。特别是，\MethodPlus~在BookMIA的预训练检测中比Min-K\%++提升了18.8%，在AIME的训练后检测中提升了7.6%。

Human-to-Robot Interaction: Learning from Video Demonstration for Robot Imitation

人与机器人交互：通过视频演示学习机器人模仿

Authors: Thanh Nguyen Canh, Thanh-Tuan Tran, Haolan Zhang, Ziyan Gao, Nak Young Chong, Xiem HoangVan
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.19184
Pdf link: https://arxiv.org/pdf/2602.19184
Abstract Learning from Demonstration (LfD) offers a promising paradigm for robot skill acquisition. Recent approaches attempt to extract manipulation commands directly from video demonstrations, yet face two critical challenges: (1) general video captioning models prioritize global scene features over task-relevant objects, producing descriptions unsuitable for precise robotic execution, and (2) end-to-end architectures coupling visual understanding with policy learning require extensive paired datasets and struggle to generalize across objects and scenarios. To address these limitations, we propose a novel ``Human-to-Robot'' imitation learning pipeline that enables robots to acquire manipulation skills directly from unstructured video demonstrations, inspired by the human ability to learn by watching and imitating. Our key innovation is a modular framework that decouples the learning process into two distinct stages: (1) Video Understanding, which combines Temporal Shift Modules (TSM) with Vision-Language Models (VLMs) to extract actions and identify interacted objects, and (2) Robot Imitation, which employs TD3-based deep reinforcement learning to execute the demonstrated manipulations. We validated our approach in PyBullet simulation environments with a UR5e manipulator and in a real-world experiment with a UF850 manipulator across four fundamental actions: reach, pick, move, and put. For video understanding, our method achieves 89.97% action classification accuracy and BLEU-4 scores of 0.351 on standard objects and 0.265 on novel objects, representing improvements of 76.4% and 128.4% over the best baseline, respectively. For robot manipulation, our framework achieves an average success rate of 87.5% across all actions, with 100% success on reaching tasks and up to 90% on complex pick-and-place operations. The project website is available at this https URL.
中文摘要 从演示中学习（LfD）为机器人技能习得提供了一种有前景的范式。近期方法试图直接从视频演示中提取作指令，但面临两个关键挑战：（1）通用视频字幕模型优先考虑全局场景特征而非任务相关对象，产生不适合精确机器人执行的描述;（2）将视觉理解与策略学习结合的端到端架构需要大量配对数据集，难以跨对象和场景泛化。为解决这些局限性，我们提出了一种新型“人到机器人”模仿学习流程，使机器人能够直接通过非结构化视频演示获得作技能，灵感来自人类通过观看和模仿来学习的能力。我们的关键创新是一个模块化框架，将学习过程拆分为两个不同阶段：（1）视频理解，结合时间转换模块（TSM）与视觉语言模型（VLM）以提取动作并识别交互对象;（2）机器人模仿，采用基于TD3的深度强化学习来执行已演示的作。我们在PyBullet模拟环境中用UR5e作手验证了方法，并在UF850作手的实际实验中验证了方法，涵盖四个基本动作：伸手、拾取、移动和放置。在视频理解方面，我们的方法在动作分类准确率上达到了89.97%，在标准对象上获得了0.351的BLEU-4分数，在新对象上分别为0.265，分别比最佳基线提升了76.4%和128.4%。在机器人作方面，我们的框架在所有动作中平均成功率为87.5%，完成任务的成功率为100%，复杂的挑选与放置作成功率高达90%。项目网站可通过此 https URL 访问。

Adaptive Problem Generation via Symbolic Representations

通过符号表示实现自适应问题生成

Authors: Teresa Yeo, Myeongho Jeon, Dulaj Weerakoon, Rui Qiao, Alok Prakash, Armando Solar-Lezama, Archan Misra
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.19187
Pdf link: https://arxiv.org/pdf/2602.19187
Abstract We present a method for generating training data for reinforcement learning with verifiable rewards to improve small open-weights language models on mathematical tasks. Existing data generation approaches rely on open-loop pipelines and fixed modifications that do not adapt to the model's capabilities. Furthermore, they typically operate directly on word problems, limiting control over problem structure. To address this, we perform modifications in a symbolic problem space, representing each problem as a set of symbolic variables and constraints (e.g., via algebraic frameworks such as SymPy or SMT formulations). This representation enables precise control over problem structure, automatic generation of ground-truth solutions, and decouples mathematical reasoning from linguistic realization. We also show that this results in more diverse generations. To adapt the problem difficulty to the model, we introduce a closed-loop framework that learns modification strategies through prompt optimization in symbolic space. Experimental results demonstrate that both adaptive problem generation and symbolic representation modifications contribute to improving the model's math solving ability.
中文摘要 我们提出了一种生成强化学习训练数据的方法，并带有可验证的奖励，以改进数学任务中的小型开放权重语言模型。现有的数据生成方法依赖于开环流水线和固定修改，这些修改无法适应模型的能力。此外，它们通常直接处理文字题，限制了对问题结构的控制。为此，我们在符号问题空间中进行修改，将每个问题表示为一组符号变量和约束（例如通过SymPy或SMT等代数框架）。这种表示方式使得对问题结构的精确控制、自动生成真实的解决方案，并将数学推理与语言实现脱钩。我们还表明，这导致了更多多样化的世代。为了适应模型，我们引入了一个闭环框架，通过符号空间中的提示优化学习修改策略。实验结果表明，自适应问题生成和符号表示修改都有助于提升模型的数学解决能力。

How to Allocate, How to Learn? Dynamic Rollout Allocation and Advantage Modulation for Policy Optimization

如何分配，如何学习？动态推广分配与优势调制以优化策略

Authors: Yangyi Fang, Jiaye Lin, Xiaoliang Fu, Cong Qin, Haolin Shi, Chaowen Hu, Lu Pan, Ke Zeng, Xunliang Cai
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.19208
Pdf link: https://arxiv.org/pdf/2602.19208
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for Large Language Model (LLM) reasoning, yet current methods face key challenges in resource allocation and policy optimization dynamics: (i) uniform rollout allocation ignores gradient variance heterogeneity across problems, and (ii) the softmax policy structure causes gradient attenuation for high-confidence correct actions, while excessive gradient updates may destabilize training. Therefore, we propose DynaMO, a theoretically-grounded dual-pronged optimization framework. At the sequence level, we prove that uniform allocation is suboptimal and derive variance-minimizing allocation from the first principle, establishing Bernoulli variance as a computable proxy for gradient informativeness. At the token level, we develop gradient-aware advantage modulation grounded in theoretical analysis of gradient magnitude bounds. Our framework compensates for gradient attenuation of high-confidence correct actions while utilizing entropy changes as computable indicators to stabilize excessive update magnitudes. Extensive experiments conducted on a diverse range of mathematical reasoning benchmarks demonstrate consistent improvements over strong RLVR baselines. Our implementation is available at: \href{this https URL}{this https URL}.
中文摘要 带可验证奖励的强化学习（RLVR）已被证明对大型语言模型（LLM）推理有效，但现有方法在资源分配和策略优化动态中面临关键挑战：（i）均匀的推广分配忽略了问题间的梯度方差异质性，（ii）软极限策略结构导致高置信度正确动作出现梯度衰减，而过度的梯度更新可能破坏训练稳定性。因此，我们提出了DynaMO，一个理论基础的双管齐下优化框架。在序列层面，我们证明均匀分配不最优，并从第一原理推导出方差最小化分配，确立伯努利方差作为梯度信息量的可计算代理。在代币层面，我们基于梯度幅度界限的理论分析开发了梯度感知优势调制。我们的框架在利用熵变化作为可计算指标来稳定高更新幅度的同时，补偿高置信度正确动作的梯度衰减。在多种数学推理基准测试中进行的大量实验显示，RLVR基线数据持续提升。我们的实现地址为：\href{this https URL}{this https URL}。

Characterizing MARL for Energy Control: A Multi-KPI Benchmark on the CityLearn Environment

MARL用于能源控制的特性描述：CityLearn环境的多关键绩效指标基准

Authors: Aymen Khouja, Imen Jendoubi, Oumayma Mahjoub, Oussama Mahfoudhi, Claude Formanek, Siddarth Singh, Ruan De Kock
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2602.19223
Pdf link: https://arxiv.org/pdf/2602.19223
Abstract The optimization of urban energy systems is crucial for the advancement of sustainable and resilient smart cities, which are becoming increasingly complex with multiple decision-making units. To address scalability and coordination concerns, Multi-Agent Reinforcement Learning (MARL) is a promising solution. This paper addresses the imperative need for comprehensive and reliable benchmarking of MARL algorithms on energy management tasks. CityLearn is used as a case study environment because it realistically simulates urban energy systems, incorporates multiple storage systems, and utilizes renewable energy sources. By doing so, our work sets a new standard for evaluation, conducting a comparative study across multiple key performance indicators (KPIs). This approach illuminates the key strengths and weaknesses of various algorithms, moving beyond traditional KPI averaging which often masks critical insights. Our experiments utilize widely accepted baselines such as Proximal Policy Optimization (PPO) and Soft Actor Critic (SAC), and encompass diverse training schemes including Decentralized Training with Decentralized Execution (DTDE) and Centralized Training with Decentralized Execution (CTDE) approaches and different neural network architectures. Our work also proposes novel KPIs that tackle real world implementation challenges such as individual building contribution and battery storage lifetime. Our findings show that DTDE consistently outperforms CTDE in both average and worst-case performance. Additionally, temporal dependency learning improved control on memory dependent KPIs such as ramping and battery usage, contributing to more sustainable battery operation. Results also reveal robustness to agent or resource removal, highlighting both the resilience and decentralizability of the learned policies.
中文摘要 城市能源系统的优化对于可持续且具韧性的智慧城市的发展至关重要，而智慧城市正变得越来越复杂，需多重决策单元。为了解决可扩展性和协调性问题，多智能体强化学习（MARL）是一个有前景的解决方案。本文讨论了对MARL算法在能源管理任务中进行全面且可靠的基准测试的迫切需求。CityLearn 被用作案例研究环境，因为它真实模拟了城市能源系统，集成了多种储能系统，并利用了可再生能源。通过这样做，我们的工作树立了评估的新标准，开展了跨多个关键绩效指标（KPI）的比较研究。这种方法揭示了各种算法的关键优势和劣势，超越了传统KPI平均法，因为传统指标往往掩盖了关键洞察。我们的实验采用了广泛认可的基线，如近端策略优化（PPO）和软演员批评（SAC），涵盖了多种训练方案，包括去中心化执行训练（DTDE）、集中式训练结合去中心化执行（CTDE）方法以及不同的神经网络架构。我们的工作还提出了新的关键绩效指标（KPI），以应对实际实施挑战，如单个建筑的贡献和电池储能寿命。我们的研究结果显示，DTDE在平均和最坏情况下的表现上始终优于CTDE。此外，时间依赖学习改善了对内存依赖KPI的控制，如加速和电池使用，有助于实现更可持续的电池运行。结果还显示了代理或资源移除的鲁棒性，凸显了所学策略的韧性和去中心化性。

Robust Exploration in Directed Controller Synthesis via Reinforcement Learning with Soft Mixture-of-Experts

通过软专家混合强化学习对定向控制器合成的深入探索

Authors: Toshihide Ubukata, Zhiyao Wang, Enhong Mu, Jialong Li, Kenji Tei
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.19244
Pdf link: https://arxiv.org/pdf/2602.19244
Abstract On-the-fly Directed Controller Synthesis (OTF-DCS) mitigates state-space explosion by incrementally exploring the system and relies critically on an exploration policy to guide search efficiently. Recent reinforcement learning (RL) approaches learn such policies and achieve promising zero-shot generalization from small training instances to larger unseen ones. However, a fundamental limitation is anisotropic generalization, where an RL policy exhibits strong performance only in a specific region of the domain-parameter space while remaining fragile elsewhere due to training stochasticity and trajectory-dependent bias. To address this, we propose a Soft Mixture-of-Experts framework that combines multiple RL experts via a prior-confidence gating mechanism and treats these anisotropic behaviors as complementary specializations. The evaluation on the Air Traffic benchmark shows that Soft-MoE substantially expands the solvable parameter space and improves robustness compared to any single expert.
中文摘要 即时定向控制器合成（OTF-DCS）通过逐步探索系统来缓解状态空间爆炸，并关键依赖探索策略以高效指导搜索。近期的强化学习（RL）方法学习此类策略，并实现从小型训练实例到更大未见实例的有希望的零样本推广。然而，一个根本的限制是各向异性推广，即强化学习策略仅在域参数空间的特定区域表现出强劲表现，而在其他区域因训练随机性和轨迹依赖偏差而仍然脆弱。为此，我们提出了一个软性专家混合框架，通过先验置信门控机制将多位强化学习专家结合起来，并将这些各向异性行为视为互补的专长。对空中交通基准的评估显示，Soft-MoE显著扩展了可解参数空间，并提升了与任何单一专家相比的鲁棒性。

DGPO: RL-Steered Graph Diffusion for Neural Architecture Generation

DGPO：用于神经结构生成的强化学习引导图扩散

Authors: Aleksei Liuliakov, Luca Hermes, Barbara Hammer
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2602.19261
Pdf link: https://arxiv.org/pdf/2602.19261
Abstract Reinforcement learning fine-tuning has proven effective for steering generative diffusion models toward desired properties in image and molecular domains. Graph diffusion models have similarly been applied to combinatorial structure generation, including neural architecture search (NAS). However, neural architectures are directed acyclic graphs (DAGs) where edge direction encodes functional semantics such as data flow-information that existing graph diffusion methods, designed for undirected structures, discard. We propose Directed Graph Policy Optimization (DGPO), which extends reinforcement learning fine-tuning of discrete graph diffusion models to DAGs via topological node ordering and positional encoding. Validated on NAS-Bench-101 and NAS-Bench-201, DGPO matches the benchmark optimum on all three NAS-Bench-201 tasks (91.61%, 73.49%, 46.77%). The central finding is that the model learns transferable structural priors: pretrained on only 7% of the search space, it generates near-oracle architectures after fine-tuning, within 0.32 percentage points of the full-data model and extrapolating 7.3 percentage points beyond its training ceiling. Bidirectional control experiments confirm genuine reward-driven steering, with inverse optimization reaching near random-chance accuracy (9.5%). These results demonstrate that reinforcement learning-steered discrete diffusion, once extended to handle directionality, provides a controllable generative framework for directed combinatorial structures.
中文摘要 强化学习的微调已被证明能有效引导生成扩散模型在图像和分子领域达到预期特性。图扩散模型同样被应用于组合结构生成，包括神经结构搜索（NAS）。然而，神经架构是有向无环图（DAGs），其中边方向编码功能语义，如数据流信息，而现有为无向结构设计的图扩散方法会舍弃这些信息。我们提出了有向图策略优化（DGPO），通过拓扑节点排序和位置编码，将离散图扩散模型的强化学习微调扩展到DAGs。在NAS-Bench-101和NAS-Bench-201上验证后，DGPO在三个NAS-Bench-201任务上均达到基准最优（91.61%，73.49%，46.77%）。核心发现是该模型学习可转移的结构先验：仅对搜索空间的7%进行预训练，经过微调后生成近乎预言机的架构，距离完整数据模型不到0.32个百分点，并且外推超出训练上限7.3个百分点。双向控制实验证实了真正的奖励驱动引导，逆向优化的准确率接近随机率（9.5%）。这些结果表明，一旦扩展到处理方向性，强化学习引导离散扩散，就能为有向组合结构提供可控的生成框架。

ALPACA: A Reinforcement Learning Environment for Medication Repurposing and Treatment Optimization in Alzheimer's Disease

羊驼：阿尔茨海默病药物再利用与治疗优化的强化学习环境

Authors: Nolan Brady, Tom Yeh
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.19298
Pdf link: https://arxiv.org/pdf/2602.19298
Abstract Evaluating personalized, sequential treatment strategies for Alzheimer's disease (AD) using clinical trials is often impractical due to long disease horizons and substantial inter-patient heterogeneity. To address these constraints, we present the Alzheimer's Learning Platform for Adaptive Care Agents (ALPACA), an open-source, Gym-compatible reinforcement learning (RL) environment for systematically exploring personalized treatment strategies using existing therapies. ALPACA is powered by the Continuous Action-conditioned State Transitions (CAST) model trained on longitudinal trajectories from the Alzheimer's Disease Neuroimaging Initiative (ADNI), enabling medication-conditioned simulation of disease progression under alternative treatment decisions. We show that CAST autoregressively generates realistic medication-conditioned trajectories and that RL policies trained in ALPACA outperform no-treatment and behavior-cloned clinician baselines on memory-related outcomes. Interpretability analyses further indicated that the learned policies relied on clinically meaningful patient features when selecting actions. Overall, ALPACA provides a reusable in silico testbed for studying individualized sequential treatment decision-making for AD.
中文摘要 由于疾病周期长且患者间异质性较大，利用临床试验评估个性化、顺序化的阿尔茨海默病治疗策略往往不切实际。为解决这些限制，我们推出了阿尔茨海默病适应性护理代理学习平台（ALPACA），这是一个开源、兼容健身房的强化学习（RL）环境，用于系统性探索利用现有疗法的个性化治疗策略。ALPACA由阿尔茨海默病神经影像倡议（ADNI）基于纵向轨迹训练的连续动作条件状态转换（CAST）模型驱动，实现了在替代治疗决策下对疾病进展的药物条件模拟。我们表明CAST自回归生成了现实的药物条件轨迹，且在ALPACA训练的强化学习政策在记忆相关结局上优于无治疗和行为克隆临床医生基线。可解释性分析进一步表明，所学政策在选择行动时依赖于临床有意义的患者特征。总体而言，ALPACA为研究个体化序列治疗决策提供了可重复使用的硅基测试平台。

TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics

TOPReward：代币概率作为机器人隐藏的零射击奖励

Authors: Shirui Chen, Cole Harrison, Ying-Chun Lee, Angela Jin Yang, Zhongzheng Ren, Lillian J. Ratliff, Jiafei Duan, Dieter Fox, Ranjay Krishna
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.19313
Pdf link: https://arxiv.org/pdf/2602.19313
Abstract While Vision-Language-Action (VLA) models have seen rapid progress in pretraining, their advancement in Reinforcement Learning (RL) remains hampered by low sample efficiency and sparse rewards in real-world settings. Developing generalizable process reward models is essential for providing the fine-grained feedback necessary to bridge this gap, yet existing temporal value functions often fail to generalize beyond their training domains. We introduce TOPReward, a novel, probabilistically grounded temporal value function that leverages the latent world knowledge of pretrained video Vision-Language Models (VLMs) to estimate robotic task progress. Unlike prior methods that prompt VLMs to directly output progress values, which are prone to numerical misrepresentation, TOPReward extracts task progress directly from the VLM's internal token logits. In zero-shot evaluations across 130+ distinct real-world tasks and multiple robot platforms (e.g., Franka, YAM, SO-100/101), TOPReward achieves 0.947 mean Value-Order Correlation (VOC) on Qwen3-VL, dramatically outperforming the state-of-the-art GVL baseline which achieves near-zero correlation on the same open-source model. We further demonstrate that TOPReward serves as a versatile tool for downstream applications, including success detection and reward-aligned behavior cloning.
中文摘要 尽管视觉-语言-行动（VLA）模型在预训练方面取得了快速进展，但其在强化学习（RL）方面的进展仍受限于样本效率低下和现实环境中奖励稀疏。开发可推广的过程奖励模型对于提供弥合这一差距所需的细粒度反馈至关重要，然而现有的时间价值函数往往无法超越其训练领域进行推广。我们介绍TOPReward，一种新颖的、基于概率的时间价值函数，利用预训练视频视觉语言模型（VLMs）的潜在世界知识来估算机器人任务的进展。与以往提示VLM直接输出进度值的方法不同，后者容易出现数值错误，TOPReward直接从VLM内部的令牌日志中提取任务进度。在130+个不同现实世界任务和多个机器人平台（如Franka、YAM、SO-100/101）的零样本评估中，TOPReward在Qwen3-VL上实现了0.947的平均值阶相关性（VOC），远远超过了在同一开源模型上实现近零相关性的先进GVL基线。我们还进一步证明，TOPReward作为下游应用的多功能工具，包括成功检测和奖励对齐行为克隆。

Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering

学习在个性化问答中多步检索个人语境的推理

Authors: Maryam Amirizaniani, Alireza Salemi, Hamed Zamani
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2602.19317
Pdf link: https://arxiv.org/pdf/2602.19317
Abstract Personalization in Question Answering (QA) requires answers that are both accurate and aligned with users' background, preferences, and historical context. Existing state-of-the-art methods primarily rely on retrieval-augmented generation (RAG) solutions that construct personal context by retrieving relevant items from the user's profile. Existing methods use the user's query directly to retrieve personal documents, and such strategies often lead to surface-level personalization. We propose PR2 (Personalized Retrieval-Augmented Reasoning), a reinforcement learning framework that integrates reasoning and retrieval from personal context for personalization. PR2 learns adaptive retrieval-reasoning policies, determining when to retrieve, what evidence to retrieve from user profiles, and how to incorporate it into intermediate reasoning steps. By optimizing multi-turn reasoning trajectories under a personalized reward function, the framework reinforces reasoning paths that better align with user-specific preferences and contextual signals reflected by the reward model. Extensive experiments on the LaMP-QA benchmark using three LLMs show that PR2 consistently outperforms strong baselines, achieving an average relative improvement of 8.8%-12% in personalized QA.
中文摘要 问答（QA）中的个性化要求回答既准确又符合用户背景、偏好和历史背景。现有的先进方法主要依赖于检索增强生成（RAG）解决方案，通过从用户档案中检索相关项目来构建个人语境。现有方法直接利用用户查询来检索个人文档，这类策略往往导致表面个性化。我们提出了PR2（个性化检索增强推理），这是一种强化学习框架，整合推理和从个人语境中检索以实现个性化。PR2 学习自适应检索推理策略，决定何时检索、从用户配置文件检索哪些证据，以及如何将其纳入中间推理步骤。通过在个性化奖励函数下优化多回合推理轨迹，该框架强化了更符合用户特定偏好和奖励模型所反映情境信号的推理路径。使用三款大型语言模型（LLM）在LaMP-QA基准测试上的广泛实验显示，PR2始终优于强基线，个性化QA的平均相对提升为8.8%-12%。

Soft Sequence Policy Optimization: Bridging GMPO and SAPO

软序列策略优化：连接GMPO与SAPO

Authors: Svetlana Glazyrina, Maksim Kryzhanovskiy, Roman Ischenko
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.19327
Pdf link: https://arxiv.org/pdf/2602.19327
Abstract A significant portion of recent research on Large Language Model (LLM) alignment focuses on developing new policy optimization methods based on Group Relative Policy Optimization (GRPO). Two prominent directions have emerged: (i) a shift toward sequence-level importance sampling weights that better align with the sequence-level rewards used in many tasks, and (ii) alternatives to PPO-style clipping that aim to avoid the associated loss of training signal and entropy collapse. Recent work, such as Soft Adaptive Policy Optimization (SAPO), reformulates the Scopic objective within the GRPO framework and achieves both sequence coherence and token adaptivity. Geometric-Mean Policy Optimization (GMPO) leverages token-wise ratio clipping within sequence importance sampling weights. Building on these ideas, this work proposes a new objective that promotes effective policy exploration while maintaining training stability. Specifically, we introduce Soft Sequence Policy Optimization, an off-policy reinforcement learning objective that incorporates soft gating functions over token-level probability ratios within sequence-level importance weights.
中文摘要 近期关于大型语言模型（LLM）对齐的研究中，很大一部分集中在基于群相对策略优化（Group Relative Policy Optimization，GRPO）的新策略优化方法开发。出现了两个显著方向：（一）转向序列层级重要性抽样权重，以更好地符合许多任务中使用的序列级奖励;（二）旨在避免训练信号丢失和熵坍缩的PPO式剪裁替代方案。近期工作如软自适应策略优化（SAPO）在GRPO框架内重新表述了Scopic目标，实现了序列一致性和令牌适应性。几何均值策略优化（GMPO）利用序列重要性抽样权重内的按标记比率裁剪。基于这些理念，本研究提出了一个新目标，旨在促进有效的政策探索，同时保持培训的稳定性。具体来说，我们介绍了软序列策略优化，这是一种非策略强化学习目标，在序列级重要性权重内，结合了对代币级概率比的软门控函数。

LLMs Can Learn to Reason Via Off-Policy RL

LLMs可以通过非策略强化学习推理

Authors: Daniel Ritter, Owen Oertell, Bradley Guo, Jonathan Chang, Kianté Brantley, Wen Sun
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.19362
Pdf link: https://arxiv.org/pdf/2602.19362
Abstract Reinforcement learning (RL) approaches for Large Language Models (LLMs) frequently use on-policy algorithms, such as PPO or GRPO. However, policy lag from distributed training architectures and differences between the training and inference policies break this assumption, making the data off-policy by design. To rectify this, prior work has focused on making this off-policy data appear more on-policy, either via importance sampling (IS), or by more closely aligning the training and inference policies by explicitly modifying the inference engine. In this work, we embrace off-policyness and propose a novel off-policy RL algorithm that does not require these modifications: Optimal Advantage-based Policy Optimization with Lagged Inference policy (OAPL). We show that OAPL outperforms GRPO with importance sampling on competition math benchmarks, and can match the performance of a publicly available coding model, DeepCoder, on LiveCodeBench, while using 3x fewer generations during training. We further empirically demonstrate that models trained via OAPL have improved test time scaling under the Pass@k metric. OAPL allows for efficient, effective post-training even with lags of more than 400 gradient steps between the training and inference policies, 100x more off-policy than prior approaches.
中文摘要 大型语言模型（LLM）的强化学习（RL）方法通常使用策略上的算法，如PPO或GRPO。然而，分布式训练架构中的策略滞后以及训练策略与推理策略之间的差异打破了这一假设，使数据设计上不符合策略。为此，先前的工作重点是通过重要性抽样（IS）或通过显式修改推理引擎，使这些非策略数据看起来更具策略内。在本研究中，我们拥抱非策略性，并提出了一种新颖的非策略强化学习算法，无需这些修改：基于最优优势的策略优化与延迟推理策略（OAPL）。我们证明，OAPL在竞争数学基准测试中的重要性抽样优于GRPO，并且在训练中使用3倍的代数，同时能与公开的编码模型DeepCoder在LiveCodeBench上匹敌。我们进一步实证证明，通过OAPL训练的模型在Pass@k指标下提升了测试时间的扩展性。OAPL允许高效且有效的后期训练，即使训练与推理策略之间存在超过400个梯度的滞后，且比以往方法的非策略延迟多100倍。

Stable Deep Reinforcement Learning via Isotropic Gaussian Representations

通过各向同性高斯表示的稳定深度强化学习

Authors: Ali Saheb, Johan Obando-Ceron, Aaron Courville, Pouya Bashivan, Pablo Samuel Castro
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.19373
Pdf link: https://arxiv.org/pdf/2602.19373
Abstract Deep reinforcement learning systems often suffer from unstable training dynamics due to non-stationarity, where learning objectives and data distributions evolve over time. We show that under non-stationary targets, isotropic Gaussian embeddings are provably advantageous. In particular, they induce stable tracking of time-varying targets for linear readouts, achieve maximal entropy under a fixed variance budget, and encourage a balanced use of all representational dimensions--all of which enable agents to be more adaptive and stable. Building on this insight, we propose the use of Sketched Isotropic Gaussian Regularization for shaping representations toward an isotropic Gaussian distribution during training. We demonstrate empirically, over a variety of domains, that this simple and computationally inexpensive method improves performance under non-stationarity while reducing representation collapse, neuron dormancy, and training instability.
中文摘要 深度强化学习系统常因非平稳性而导致训练动态不稳定，学习目标和数据分布随时间演变。我们证明在非平稳靶下，各向同性高斯嵌入是可证明的优势。特别是，它们能够稳定跟踪线性读出时变目标，在固定方差预算下实现最大熵，并鼓励所有表示维度的平衡使用——这些都使智能体能够更具适应性和稳定性。基于这一见解，我们提出在训练过程中使用草图各向同性高斯正则化来塑造表示，朝向各向同性高斯分布。我们通过实证在多个领域证明，这一简单且计算成本低的方法在非平稳性下提升性能，同时减少表征崩溃、神经元休眠和训练不稳定性。

IR$^3$: Contrastive Inverse Reinforcement Learning for Interpretable Detection and Mitigation of Reward Hacking

IR$^3$：对比逆强化学习用于可解释的奖励黑客检测与缓解

Authors: Mohammad Beigi, Ming Jin, Junshan Zhang, Jiaxin Zhang, Qifan Wang, Lifu Huang
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.19416
Pdf link: https://arxiv.org/pdf/2602.19416
Abstract Reinforcement Learning from Human Feedback (RLHF) enables powerful LLM alignment but can introduce reward hacking - models exploit spurious correlations in proxy rewards without genuine alignment. Compounding this, the objectives internalized during RLHF remain opaque, making hacking behaviors difficult to detect or correct. We introduce IR3 (Interpretable Reward Reconstruction and Rectification), a framework that reverse-engineers, interprets, and surgically repairs the implicit objectives driving RLHF-tuned models. We propose Contrastive Inverse Reinforcement Learning (C-IRL), which reconstructs the implicit reward function by contrasting paired responses from post-alignment and baseline policies to explain behavioral shifts during RLHF. We then decompose the reconstructed reward via sparse autoencoders into interpretable features, enabling identification of hacking signatures through contribution analysis. Finally, we propose mitigation strategies - clean reward optimization, adversarial shaping, constrained optimization, and feature-guided distillation - that target problematic features while preserving beneficial alignment. Experiments across multiple reward model configurations show that IR3 achieves 0.89 correlation with ground-truth rewards, identifies hacking features with over 90% precision, and significantly reduces hacking behaviors while maintaining capabilities within 3% of the original model.
中文摘要 人类反馈强化学习（RLHF）实现了强大的大型语言模型对齐，但也可能引入奖励黑客——模型利用代理奖励中的虚假相关性，而没有真正的对齐。更糟的是，RLHF期间内化的目标保持不透明，使得黑客行为难以被检测或纠正。我们引入了IR3（可解释性奖励重建与整流），这是一个逆向工程、解释并外科修复驱动RLHF调优模型的隐性目标的框架。我们提出了对比逆强化学习（C-IRL），通过对比后对齐策略和基线策略的配对反应，重建隐性奖励函数，以解释RLHF期间的行为转变。然后，我们通过稀疏自编码器将重建后的奖励分解为可解释的特征，从而通过贡献分析识别黑客签名。最后，我们提出了缓解策略——干净的奖励优化、对抗性塑造、受限优化和特征引导的提炼——既针对问题特征，又保持有利的对齐。在多种奖励模型配置中的实验表明，IR3与真实奖励的相关性达到0.89，识别黑客特征的准确率超过90%，并且在保持原始模型3%以内能力的同时，显著减少了黑客行为。

RAmmStein: Regime Adaptation in Mean-reverting Markets with Stein Thresholds -- Optimal Impulse Control in Concentrated AMMs

RAmmStein：均值回归市场中的体制适应——集中AMMs中的最优冲动控制

Authors: Pranay Anchuri
Subjects: Subjects: Machine Learning (cs.LG); Trading and Market Microstructure (q-fin.TR)
Arxiv link: https://arxiv.org/abs/2602.19419
Pdf link: https://arxiv.org/pdf/2602.19419
Abstract Concentrated liquidity provision in decentralized exchanges presents a fundamental Impulse Control problem. Liquidity Providers (LPs) face a non-trivial trade-off between maximizing fee accrual through tight price-range concentration and minimizing the friction costs of rebalancing, including gas fees and swap slippage. Existing methods typically employ heuristic or threshold strategies that fail to account for market dynamics. This paper formulates liquidity management as an optimal control problem and derives the corresponding Hamilton-Jacobi-Bellman quasi-variational inequality (HJB-QVI). We present an approximate solution RAmmStein, a Deep Reinforcement Learning method that incorporates the mean-reversion speed (theta) of an Ornstein-Uhlenbeck process among other features as input to the model. We demonstrate that the agent learns to separate the state space into regions of action and inaction. We evaluate the framework using high-frequency 1Hz Coinbase trade data comprising over 6.8M trades. Experimental results show that RAmmStein achieves a superior net ROI of 0.72% compared to both passive and aggressive strategies. Notably, the agent reduces rebalancing frequency by 67% compared to a greedy rebalancing strategy while maintaining 88% active time. Our results demonstrate that regime-aware laziness can significantly improve capital efficiency by preserving the returns that would otherwise be eroded by the operational costs.
中文摘要 去中心化交易所中的集中流动性提供是一个根本性的冲动控制问题。流动性提供者（LP）面临着一个不简单的权衡：通过严格的价格区间集中最大化费用积累与最小化再平衡的摩擦成本，包括燃气费和掉期滑点。现有方法通常采用启发式或阈值策略，未能考虑市场动态。本文将流动性管理表述为最优控制问题，并推导出相应的Hamilton-Jacobi-Bellman拟变分不等式（HJB-QVI）。我们提出了一个近似解 RAmmStein，这是一种深度强化学习方法，将 Ornstein-Uhlenbeck 过程的平均回归速度（theta）等特征作为模型输入。我们证明了智能体学会将状态空间划分为动作和不动作两个区域。我们利用包含超过680万笔交易的高频1Hz Coinbase交易数据评估该框架。实验结果显示，RAmmStein相比被动和激进策略，净投资回报率均优于0.72%。值得注意的是，该代理相比贪婪的再平衡策略，将再平衡频率降低了67%，同时保持了88%的活跃时间。我们的结果表明，体制意识的懒惰通过保留本应被运营成本侵蚀的回报，显著提升资本效率。

A Reinforcement Learning-based Transmission Expansion Framework Considering Strategic Bidding in Electricity Markets

基于强化学习的输电扩展框架，考虑电力市场中的战略竞标

Authors: Tomonari Kanazawa, Hikaru Hoshino, Eiko Furutani
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2602.19421
Pdf link: https://arxiv.org/pdf/2602.19421
Abstract Transmission expansion planning in electricity markets is tightly coupled with the strategic bidding behaviors of generation companies. This paper proposes a Reinforcement Learning (RL)-based co-optimization framework that simultaneously learns transmission investment decisions and generator bidding strategies within a unified training process. Based on a multiagent RL framework for market simulation, the proposed method newly introduces a design policy layer that jointly optimizes continuous/discrete transmission expansion decisions together with strategic bidding policies. Through iterative interaction between market clearing and investment design, the framework effectively captures their mutual influence and achieves consistent co-optimization of expansion and bidding decisions. Case studies on the IEEE 30-bus system are provided for proof-of-concept validation of the proposed co-optimization framework.
中文摘要 电力市场的输电扩展规划与发电公司的战略性竞标行为紧密相连。本文提出了一种基于强化学习（RL）的协同优化框架，能够在统一的培训过程中同时学习输电投资决策和发电机竞标策略。基于多智能体强化学习框架进行市场模拟，所提方法新引入了设计策略层，能够联合优化连续/离散传输扩展决策与战略竞价策略。通过市场清算与投资设计之间的迭代互动，该框架有效捕捉了双方的相互影响，实现了扩张和竞价决策的持续协同优化。提供IEEE 30总线系统的案例研究，用于对所提协优化框架的概念验证。

Sizing of Battery Considering Renewable Energy Bidding Strategy with Reinforcement Learning

电池规模评估：结合强化学习考虑可再生能源招标策略

Authors: Taiyo Mantani, Hikaru Hoshino, Tomonari Kanazawa, Eiko Furutani
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2602.19428
Pdf link: https://arxiv.org/pdf/2602.19428
Abstract This paper proposes a novel computationally efficient algorithm for optimal sizing of Battery Energy Storage Systems (BESS) considering renewable energy bidding strategies. Unlike existing two-stage methods, our algorithm enables the cooptimization of both by updating the BESS size during the training of the bidding policy, leveraging an extended reinforcement learning (RL) framework inspired by advancements in embodied cognition. By integrating the Deep Recurrent Q-Network (DRQN) with a distributed RL framework, the proposed algorithm effectively manages uncertainties in renewable generation and market prices while enabling parallel computation for efficiently handling long-term data.
中文摘要 本文提出了一种新颖的计算高效算法，用于考虑可再生能源竞标策略，用于实现电池储能系统（BESS）的最优尺寸设计。与现有的两阶段方法不同，我们的算法通过在竞标策略训练过程中更新BESS大小，利用受具身认知进步启发的扩展强化学习（RL）框架，实现两者的协同优化。通过将深度循环Q网络（DRQN）与分布式强化学习框架集成，所提算法有效管理可再生能源发电和市场价格的不确定性，同时实现并行计算以高效处理长期数据。

UrbanAlign: Post-hoc Semantic Calibration for VLM-Human Preference Alignment

UrbanAlign：VLM-人类偏好对齐的事后语义校准

Authors: Yecheng Zhang, Rong Zhao, Zhizhou Sha, Yong Li, Lei Wang, Ce Hou, Wen Ji, Hao Huang, Yunshan Wan, Jian Yu, Junhao Xia, Yuru Zhang, Chunlei Shi
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.19442
Pdf link: https://arxiv.org/pdf/2602.19442
Abstract Aligning vision-language model (VLM) outputs with human preferences in domain-specific tasks typically requires fine-tuning or reinforcement learning, both of which demand labelled data and GPU compute. We show that for subjective perception tasks, this alignment can be achieved without any model training: VLMs are already strong concept extractors but poor decision calibrators, and the gap can be closed externally. We propose a training-free post-hoc concept-bottleneck pipeline consisting of three tightly coupled stages: concept mining, multi-agent structured scoring, and geometric calibration, unified by an end-to-end dimension optimization loop. Interpretable evaluation dimensions are mined from a handful of human annotations; an Observer-Debater-Judge chain extracts robust continuous concept scores from a frozen VLM; and locally-weighted ridge regression on a hybrid visual-semantic manifold calibrates these scores against human ratings. Applied to urban perception as UrbanAlign, the framework achieves 72.2% accuracy ($\kappa=0.45$) on Place Pulse 2.0 across six categories, outperforming the best supervised baseline by +15.1 pp and uncalibrated VLM scoring by +16.3 pp, with full dimension-level interpretability and zero model-weight modification.
中文摘要 在特定领域任务中，将视觉语言模型（VLM）输出与人类偏好对齐通常需要微调或强化学习，这两者都需要标记数据和GPU计算。我们证明，对于主观感知任务，这种对齐可以在无需任何模型训练的情况下实现：VLM本身就是强的概念提取器，但决策校准器较差，且这一差距可以在外部弥合。我们提出了一个无需培训的事后概念瓶颈流程，由三个紧密耦合的阶段组成：概念挖掘、多智能体结构化评分和几何校准，并通过端到端的维度优化循环进行统一。可解释的评估维度是从少数人工注释中挖掘出来的;观察者-辩论者-评判链从冻结的VLM中提取稳健的连续概念分数;而局部加权脊回归则在混合视觉-语义流形上校准这些分数与人类评分。作为UrbanAlign应用于城市感知，该框架在Place Pulse 2.0的六个类别中实现了72.2%的准确率（$\kappa=0.45$），比最佳监督基线高出+15.1 pp，比未校准VLM评分高+16.3 pp，实现了全维度可解释性，且模型权重零修改。

SenTSR-Bench: Thinking with Injected Knowledge for Time-Series Reasoning

SenTSR-Bench：用注入知识思考时间序列推理

Authors: Zelin He, Boran Han, Xiyuan Zhang, Shuai Zhang, Haotian Lin, Qi Zhu, Haoyang Fang, Danielle C. Maddix, Abdul Fatir Ansari, Akash Chandrayan, Abhinav Pradhan, Bernie Wang, Matthew Reimherr
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2602.19455
Pdf link: https://arxiv.org/pdf/2602.19455
Abstract Time-series diagnostic reasoning is essential for many applications, yet existing solutions face a persistent gap: general reasoning large language models (GRLMs) possess strong reasoning skills but lack the domain-specific knowledge to understand complex time-series patterns. Conversely, fine-tuned time-series LLMs (TSLMs) understand these patterns but lack the capacity to generalize reasoning for more complicated questions. To bridge this gap, we propose a hybrid knowledge-injection framework that injects TSLM-generated insights directly into GRLM's reasoning trace, thereby achieving strong time-series reasoning with in-domain knowledge. As collecting data for knowledge injection fine-tuning is costly, we further leverage a reinforcement learning-based approach with verifiable rewards (RLVR) to elicit knowledge-rich traces without human supervision, then transfer such an in-domain thinking trace into GRLM for efficient knowledge injection. We further release SenTSR-Bench, a multivariate time-series-based diagnostic reasoning benchmark collected from real-world industrial operations. Across SenTSR-Bench and other public datasets, our method consistently surpasses TSLMs by 9.1%-26.1% and GRLMs by 7.9%-22.4%, delivering robust, context-aware time-series diagnostic insights.
中文摘要 时间序列诊断推理在许多应用中至关重要，但现有解决方案仍面临一个持续的空白：通用推理大型语言模型（GRLMs）具备强大的推理能力，但缺乏理解复杂时间序列模式的领域特定知识。相反，精细调优的时间序列LLM（TSLMs）理解这些模式，但缺乏对更复杂问题进行推理推广的能力。为弥合这一差距，我们提出了一种混合知识-注入框架，将TSLM生成的洞见直接注入GRLM的推理轨迹，从而实现强有力的时间序列推理与领域内知识。由于收集知识注入微调数据成本高昂，我们进一步采用基于强化学习的可验证奖励（RLVR）方法，在无人工监督下引发丰富的知识痕迹，然后将此类领域内思维轨迹转移到GRLM，实现高效的知识注入。我们还发布了SenTSR-Bench，这是一个基于多变量时间序列的诊断推理基准，数据来自真实工业运营。在SenTSR-Bench及其他公开数据集中，我们的方法持续领先TSLMs9.1%-26.1%，高于GRLMs7.9%-22.4%，提供稳健且具上下文感知的时间序列诊断洞察。

How to Train Your Deep Research Agent? Prompt, Reward, and Policy Optimization in Search-R1

如何培训你的深度研究特工？Search-R1中的提示、奖励与策略优化

Authors: Yinuo Xu, Shuo Lu, Jianjie Cheng, Meng Wang, Qianlong Xie, Xingxing Wang, Ran He, Jian Liang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.19526
Pdf link: https://arxiv.org/pdf/2602.19526
Abstract Deep Research agents tackle knowledge-intensive tasks through multi-round retrieval and decision-oriented generation. While reinforcement learning (RL) has been shown to improve performance in this paradigm, its contributions remain underexplored. To fully understand the role of RL, we conduct a systematic study along three decoupled dimensions: prompt template, reward function, and policy optimization. Our study reveals that: 1) the Fast Thinking template yields greater stability and better performance than the Slow Thinking template used in prior work; 2) the F1-based reward underperforms the EM due to training collapse driven by answer avoidance; this can be mitigated by incorporating action-level penalties, ultimately surpassing EM; 3) REINFORCE outperforms PPO while requiring fewer search actions, whereas GRPO shows the poorest stability among policy optimization methods. Building on these insights, we then introduce Search-R1++, a strong baseline that improves the performance of Search-R1 from 0.403 to 0.442 (Qwen2.5-7B) and 0.289 to 0.331 (Qwen2.5-3B). We hope that our findings can pave the way for more principled and reliable RL training strategies in Deep Research systems.
中文摘要 深度研究代理通过多轮检索和决策导向生成来处理知识密集型任务。虽然强化学习（RL）已被证明能提升该范式中的性能，但其贡献尚未被充分充分探讨。为了全面理解强化学习的作用，我们从三个解耦维度进行系统研究：提示模板、奖励函数和策略优化。我们的研究显示：1）快速思维模板比之前工作中使用的慢思考模板提供了更高的稳定性和更好的性能;2）基于F1的奖励因应对规避导致训练崩溃而表现不佳;通过加入行动级惩罚来缓解，最终超过EM;3）REINFORCE在需要更少搜索作的情况下优于PPO，而GRPO在策略优化方法中稳定性最差。基于这些见解，我们介绍了Search-R1++，这是一个强有力的基线，将Search-R1的性能从0.403提升到0.442（Qwen2.5-7B），从0.289提升到0.331（Qwen2.5-3B）。我们希望我们的发现能为深度研究系统中更原则性和可靠的强化学习训练策略铺平道路。

Cost-Aware Diffusion Active Search

成本感知扩散主动搜索

Authors: Arundhati Banerjee, Jeff Schneider
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.19538
Pdf link: https://arxiv.org/pdf/2602.19538
Abstract Active search for recovering objects of interest through online, adaptive decision making with autonomous agents requires trading off exploration of unknown environments with exploitation of prior observations in the search space. Prior work has proposed information gain and Thompson sampling based myopic, greedy approaches for agents to actively decide query or search locations when the number of targets is unknown. Decision making algorithms in such partially observable environments have also shown that agents capable of lookahead over a finite horizon outperform myopic policies for active search. Unfortunately, lookahead algorithms typically rely on building a computationally expensive search tree that is simulated and updated based on the agent's observations and a model of the environment dynamics. Instead, in this work, we leverage the sequence modeling abilities of diffusion models to sample lookahead action sequences that balance the exploration-exploitation trade-off for active search without building an exhaustive search tree. We identify the optimism bias in prior diffusion based reinforcement learning approaches when applied to the active search setting and propose mitigating solutions for efficient cost-aware decision making with both single and multi-agent teams. Our proposed algorithm outperforms standard baselines in offline reinforcement learning in terms of full recovery rate and is computationally more efficient than tree search in cost-aware active decision making.
中文摘要 通过自主智能体在线自适应决策进行主动搜索，需要在探索未知环境与利用搜索空间中先前观测数据之间进行权衡。此前的研究提出了基于汤普森采样的信息获取和近视贪婪的方法，使智能体在未知目标数量时主动决定查询或搜索位置。在此类部分可观测环境中的决策算法还表明，能够在有限视野内提前看的代理，在主动搜索中优于近视策略。不幸的是，前瞻算法通常依赖于构建一个计算量高的搜索树，并根据智能体的观察和环境动态模型进行模拟和更新。相反，在本研究中，我们利用扩散模型的序列建模能力，在主动搜索中平衡探索与利用权衡，同时不构建穷尽搜索树，对前瞻动作序列进行采样。我们将基于先前扩散的强化学习方法应用于主动搜索环境时识别出乐观偏差，并提出了针对单代理和多智能体团队高效成本决策的缓解解决方案。我们提出的算法在离线强化学习中，在完全恢复率方面优于标准基线，并且在成本感知的主动决策中比树搜索计算效率更高。

Advantage-based Temporal Attack in Reinforcement Learning

基于优势的时序攻击在强化学习中

Authors: Shenghong He
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.19582
Pdf link: https://arxiv.org/pdf/2602.19582
Abstract Extensive research demonstrates that Deep Reinforcement Learning (DRL) models are susceptible to adversarially constructed inputs (i.e., adversarial examples), which can mislead the agent to take suboptimal or unsafe actions. Recent methods improve attack effectiveness by leveraging future rewards to guide adversarial perturbation generation over sequential time steps (i.e., reward-based attacks). However, these methods are unable to capture dependencies between different time steps in the perturbation generation process, resulting in a weak temporal correlation between the current perturbation and previous this http URL this paper, we propose a novel method called Advantage-based Adversarial Transformer (AAT), which can generate adversarial examples with stronger temporal correlations (i.e., time-correlated adversarial examples) to improve the attack performance. AAT employs a multi-scale causal self-attention (MSCSA) mechanism to dynamically capture dependencies between historical information from different time periods and the current state, thus enhancing the correlation between the current perturbation and the previous perturbation. Moreover, AAT introduces a weighted advantage mechanism, which quantifies the effectiveness of a perturbation in a given state and guides the generation process toward high-performance adversarial examples by sampling high-advantage regions. Extensive experiments demonstrate that the performance of AAT matches or surpasses mainstream adversarial attack baselines on Atari, DeepMind Control Suite and Google football tasks.
中文摘要 大量研究表明，深度强化学习（DRL）模型容易受到对抗性构建的输入（即对抗性示例）的影响，这可能误导智能体采取次优或不安全的行为。最新方法通过利用未来奖励引导对抗性扰动生成（即基于奖励的攻击），提高攻击效果。然而，这些方法无法捕捉扰动生成过程中不同时间步之间的依赖关系，导致当前扰动与之前的时间相关性较弱。本文提出了一种名为优势基对抗变换器（AAT）的新方法，可以生成具有更强时间相关性的对抗实例（即时间相关的对抗实例），以提升攻击性能。AAT采用多尺度因果自注意（MSCSA）机制，动态捕捉不同时间段的历史信息与当前状态之间的依赖关系，从而增强当前扰动与之前扰动之间的相关性。此外，AAT引入了加权优势机制，量化给定状态下扰动的有效性，并通过采样高优势区域引导生成过程，朝向高性能对抗实例。大量实验表明，AAT在Atari、DeepMind Control Suite和Google足球任务中，其表现与主流对抗攻击基准匹配甚至超过。

CACTO-BIC: Scalable Actor-Critic Learning via Biased Sampling and GPU-Accelerated Trajectory Optimization

CACTO-BIC：通过偏向采样和GPU加速轨迹优化实现可扩展的演员-批评者学习

Authors: Elisa Alboni, Pietro Noah Crestaz, Elias Fontanari, Andrea Del Prete
Subjects: Subjects: Robotics (cs.RO); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2602.19699
Pdf link: https://arxiv.org/pdf/2602.19699
Abstract Trajectory Optimization (TO) and Reinforcement Learning (RL) offer complementary strengths for solving optimal control problems. TO efficiently computes locally optimal solutions but can struggle with non-convexity, while RL is more robust to non-convexity at the cost of significantly higher computational demands. CACTO (Continuous Actor-Critic with Trajectory Optimization) was introduced to combine these advantages by learning a warm-start policy that guides the TO solver towards low-cost trajectories. However, scalability remains a key limitation, as increasing system complexity significantly raises the computational cost of TO. This work introduces CACTO-BIC to address these challenges. CACTO-BIC improves data efficiency by biasing initial-state sampling leveraging a property of the value function associated with locally optimal policies; moreover, it reduces computation time by exploiting GPU acceleration. Empirical evaluations show improved sample efficiency and faster computation compared to CACTO. Comparisons with PPO demonstrate that our approach can achieve similar solutions in less time. Finally, experiments on the AlienGO quadruped robot demonstrate that CACTO-BIC can scale to high-dimensional systems and is suitable for real-time applications.
中文摘要 轨迹优化（TO）和强化学习（RL）为解决最优控制问题提供了互补优势。TO 高效计算局部最优解，但在非凸性方面会遇到困难，而 RL 对非凸性更为鲁棒，但计算需求显著增加。引入了CACTO（带轨迹优化的连续演员-批判者）结合这些优势，学习一个热启动策略，引导TO求解器朝向低成本轨迹发展。然而，可扩展性仍是一个关键限制，因为系统复杂度的增加显著提高了 TO 的计算成本。本研究引入了CACTO-BIC技术以应对这些挑战。CACTO-BIC通过利用与局部最优策略相关的价值函数特性，对初始状态抽样进行偏置，从而提升数据效率;此外，它通过利用GPU加速来减少计算时间。实证评估显示，相较于CACTO，样本效率和计算速度有所提升。与PPO的比较表明，我们的方法能够在更短时间内实现类似解决方案。最后，AlienGO四足机器人的实验表明CACTO-BIC可扩展到高维系统，适合实时应用。

TextShield-R1: Reinforced Reasoning for Tampered Text Detection

TextShield-R1：被篡改文本检测的强化推理

Authors: Chenfan Qu, Yiwu Zhong, Jian Liu, Xuekang Zhu, Bohan Yu, Lianwen Jin
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.19828
Pdf link: https://arxiv.org/pdf/2602.19828
Abstract The growing prevalence of tampered images poses serious security threats, highlighting the urgent need for reliable detection methods. Multimodal large language models (MLLMs) demonstrate strong potential in analyzing tampered images and generating interpretations. However, they still struggle with identifying micro-level artifacts, exhibit low accuracy in localizing tampered text regions, and heavily rely on expensive annotations for forgery interpretation. To this end, we introduce TextShield-R1, the first reinforcement learning based MLLM solution for tampered text detection and reasoning. Specifically, our approach introduces Forensic Continual Pre-training, an easy-to-hard curriculum that well prepares the MLLM for tampered text detection by harnessing the large-scale cheap data from natural image forensic and OCR tasks. During fine-tuning, we perform Group Relative Policy Optimization with novel reward functions to reduce annotation dependency and improve reasoning capabilities. At inference time, we enhance localization accuracy via OCR Rectification, a method that leverages the MLLM's strong text recognition abilities to refine its predictions. Furthermore, to support rigorous evaluation, we introduce the Text Forensics Reasoning (TFR) benchmark, comprising over 45k real and tampered images across 16 languages, 10 tampering techniques, and diverse domains. Rich reasoning-style annotations are included, allowing for comprehensive assessment. Our TFR benchmark simultaneously addresses seven major limitations of existing benchmarks and enables robust evaluation under cross-style, cross-method, and cross-language conditions. Extensive experiments demonstrate that TextShield-R1 significantly advances the state of the art in interpretable tampered text detection.
中文摘要 篡改图像日益普遍，构成严重的安全威胁，凸显了对可靠检测方法的紧迫需求。多模态大型语言模型（MLLM）在分析被篡改图像和生成解释方面展现了强大潜力。然而，它们在识别微观伪影方面仍然存在困难，在定位被篡改文本区域时准确性较低，且严重依赖昂贵的注释来进行伪造解读。为此，我们介绍了TextShield-R1，这是首个基于强化学习的MLLM解决方案，用于篡改文本检测和推理。具体来说，我们的方法引入了法医持续预训练，这是一套从简单到难的课程，通过利用自然图像取证和OCR任务中获得的大规模廉价数据，为MLLM做好被篡改文本检测的良好准备。在微调过程中，我们采用新颖的奖励函数进行群相对策略优化，以减少对注释的依赖并提升推理能力。在推断阶段，我们通过OCR整流技术提升定位准确性，该方法利用MLLM强大的文本识别能力来优化预测。此外，为了支持严谨评估，我们引入了文本取证推理（TFR）基准测试，包含16种语言、10种篡改技术和多领域中超过4.5万张真实和篡改图像。书中包含丰富的推理式注释，便于全面评估。我们的TFR基准同时解决了现有基准的七大局限性，并支持在跨风格、跨方法和跨语言条件下的稳健评估。大量实验表明，TextShield-R1在可解释的篡改文本检测方面取得了显著进步。

Meta-Learning and Meta-Reinforcement Learning - Tracing the Path towards DeepMind's Adaptive Agent

元学习与元强化学习——追踪通往DeepMind自适应代理的路径

Authors: Björn Hoppmann, Christoph Scholz
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.19837
Pdf link: https://arxiv.org/pdf/2602.19837
Abstract Humans are highly effective at utilizing prior knowledge to adapt to novel tasks, a capability that standard machine learning models struggle to replicate due to their reliance on task-specific training. Meta-learning overcomes this limitation by allowing models to acquire transferable knowledge from various tasks, enabling rapid adaptation to new challenges with minimal data. This survey provides a rigorous, task-based formalization of meta-learning and meta-reinforcement learning and uses that paradigm to chronicle the landmark algorithms that paved the way for DeepMind's Adaptive Agent, consolidating the essential concepts needed to understand the Adaptive Agent and other generalist approaches.
中文摘要 人类非常擅长利用先验知识适应新任务，而标准机器学习模型因依赖任务训练而难以复制这一点。元学习克服了这一局限，允许模型从各种任务中获取可迁移的知识，从而以极少的数据快速适应新挑战。本综述为元学习和元强化学习提供了严谨的基于任务的形式化，并利用这一范式记录了为DeepMind自适应代理铺平道路的里程碑式算法，巩固了理解自适应智能体及其他通用方法所需的核心概念。

DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning

DSDR：用于探索大型语言模型推理的双尺度多样性正则化

Authors: Zhongwei Wan, Yun Shen, Zhihao Dou, Donghao Zhou, Yu Zhang, Xin Wang, Hui Shen, Jing Xiong, Chaofan Tao, Zixuan Zhong, Peizhou Huang, Mi Zhang
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.19895
Pdf link: https://arxiv.org/pdf/2602.19895
Abstract Reinforcement learning with verifiers (RLVR) is a central paradigm for improving large language model (LLM) reasoning, yet existing methods often suffer from limited exploration. Policies tend to collapse onto a few reasoning patterns and prematurely stop deep exploration, while conventional entropy regularization introduces only local stochasticity and fails to induce meaningful path-level diversity, leading to weak and unstable learning signals in group-based policy optimization. We propose DSDR, a Dual-Scale Diversity Regularization reinforcement learning framework that decomposes diversity in LLM reasoning into global and coupling components. Globally, DSDR promotes diversity among correct reasoning trajectories to explore distinct solution modes. Locally, it applies a length-invariant, token-level entropy regularization restricted to correct trajectories, preventing entropy collapse within each mode while preserving correctness. The two scales are coupled through a global-to-local allocation mechanism that emphasizes local regularization for more distinctive correct trajectories. We provide theoretical support showing that DSDR preserves optimal correctness under bounded regularization, sustains informative learning signals in group-based optimization, and yields a principled global-to-local coupling rule. Experiments on multiple reasoning benchmarks demonstrate consistent improvements in accuracy and pass@k, highlighting the importance of dual-scale diversity for deep exploration in RLVR. Code is available at this https URL.
中文摘要 带验证器的强化学习（RLVR）是提升大型语言模型（LLM）推理的核心范式，但现有方法往往缺乏有限的探索。策略往往只局限于少数推理模式，过早终止深度探索，而传统的熵正则化仅引入局部随机性，未能诱导有意义的路径层级多样性，导致基于群体的策略优化中学习信号薄弱且不稳定。我们提出了DSDR，一种双尺度多样性正则化强化学习框架，将LLM推理中的多样性分解为全局和耦合成分。全球范围内，DSDR促进正确推理轨迹的多样性，以探索不同的解法模式。在局部上，它采用长度不变、标记级熵正则化，限制于正确的轨迹，防止每个模式内熵坍缩，同时保持正确性。这两个尺度通过全局到局部的分配机制耦合，强调局部正则化以实现更为独特的正确轨迹。我们提供了理论支持，证明DSDR在有界正则化下保持最优正确性，在基于群的优化中维持信息性学习信号，并产生一个有原则的全局到局部耦合规则。多重推理基准测试显示准确性和pass@k持续提升，凸显了双尺度多样性在RLVR深度探索中的重要性。代码可在此 https URL 访问。

Uncertainty-Aware Rank-One MIMO Q Network Framework for Accelerated Offline Reinforcement Learning

不确定性感知的一级MIMO Q网络框架，用于加速离线强化学习

Authors: Thanh Nguyen, Tung Luu, Tri Ton, Sungwoong Kim, Chang D. Yoo
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.19917
Pdf link: https://arxiv.org/pdf/2602.19917
Abstract Offline reinforcement learning (RL) has garnered significant interest due to its safe and easily scalable paradigm. However, training under this paradigm presents its own challenge: the extrapolation error stemming from out-of-distribution (OOD) data. Existing methodologies have endeavored to address this issue through means like penalizing OOD Q-values or imposing similarity constraints on the learned policy and the behavior policy. Nonetheless, these approaches are often beset by limitations such as being overly conservative in utilizing OOD data, imprecise OOD data characterization, and significant computational overhead. To address these challenges, this paper introduces an Uncertainty-Aware Rank-One Multi-Input Multi-Output (MIMO) Q Network framework. The framework aims to enhance Offline Reinforcement Learning by fully leveraging the potential of OOD data while still ensuring efficiency in the learning process. Specifically, the framework quantifies data uncertainty and harnesses it in the training losses, aiming to train a policy that maximizes the lower confidence bound of the corresponding Q-function. Furthermore, a Rank-One MIMO architecture is introduced to model the uncertainty-aware Q-function, \TP{offering the same ability for uncertainty quantification as an ensemble of networks but with a cost nearly equivalent to that of a single network}. Consequently, this framework strikes a harmonious balance between precision, speed, and memory efficiency, culminating in improved overall performance. Extensive experimentation on the D4RL benchmark demonstrates that the framework attains state-of-the-art performance while remaining computationally efficient. By incorporating the concept of uncertainty quantification, our framework offers a promising avenue to alleviate extrapolation errors and enhance the efficiency of offline RL.
中文摘要 离线强化学习（RL）因其安全且易于扩展的范式而备受关注。然而，在这种范式下训练也面临挑战：源自非分发（OOD）数据的外推误差。现有方法论试图通过惩罚值班Q值或对学习策略和行为策略施加相似性约束等方式来解决这一问题。然而，这些方法常常存在局限性，如在使用户外数据时过于保守、外部数据特征描述不精确以及计算开销较大。为应对这些挑战，本文引入了一个不确定性感知的一级多输入多输出（MIMO）Q网络框架。该框架旨在通过充分发挥户外数据的潜力，提升离线强化学习，同时确保学习过程的高效性。具体来说，该框架量化数据不确定性，并将其纳入训练损失，旨在训练一种最大化对应Q函数置信度下界的策略。此外，引入了Rank-One MIMO架构，用于建模不确定性感知的Q函数，\TP{提供与网络集合相同的不确定性量化能力，但成本几乎等同于单一网络}。因此，该框架在精度、速度和内存效率之间取得了和谐的平衡，最终提升了整体性能。对D4RL基准测试的广泛实验表明，该框架在保持计算效率的同时，能够达到最先进的性能。通过引入不确定性量化的概念，我们的框架为减少外推误差和提升离线强化学习效率提供了有前景的途径。

Janus-Q: End-to-End Event-Driven Trading via Hierarchical-Gated Reward Modeling

Janus-Q：通过层级门槛奖励建模实现端到端事件驱动交易

Authors: Xiang Li, Zikai Wei, Yiyan Qi, Wanyun Zhou, Xiang Liu, Penglei Sun, Yongqi Zhang, Xiaowen Chu
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.19919
Pdf link: https://arxiv.org/pdf/2602.19919
Abstract Financial market movements are often driven by discrete financial events conveyed through news, whose impacts are heterogeneous, abrupt, and difficult to capture under purely numerical prediction objectives. These limitations have motivated growing interest in using textual information as the primary source of trading signals in learning-based systems. Two key challenges hinder existing approaches: (1) the absence of large-scale, event-centric datasets that jointly model news semantics and statistically grounded market reactions, and (2) the misalignment between language model reasoning and financially valid trading behavior under dynamic market conditions. To address these challenges, we propose Janus-Q, an end-to-end event-driven trading framework that elevates financial news events from auxiliary signals to primary decision units. Janus-Q unifies event-centric data construction and model optimization under a two-stage paradigm. Stage I focuses on event-centric data construction, building a large-scale financial news event dataset comprising 62,400 articles annotated with 10 fine-grained event types, associated stocks, sentiment labels, and event-driven cumulative abnormal return (CAR). Stage II performs decision-oriented fine-tuning, combining supervised learning with reinforcement learning guided by a Hierarchical Gated Reward Model (HGRM), which explicitly captures trade-offs among multiple trading objectives. Extensive experiments demonstrate that Janus-Q achieves more consistent, interpretable, and profitable trading decisions than market indices and LLM baselines, improving the Sharpe Ratio by up to 102.0% while increasing direction accuracy by over 17.5% compared to the strongest competing strategies.
中文摘要 金融市场的波动通常由新闻传递的离散金融事件驱动，这些事件的影响异质化、突兀，且在纯数值预测目标下难以捕捉。这些局限性促使人们越来越关注将文本信息作为基于学习的系统中交易信号的主要来源。现有方法面临两个关键挑战：（1）缺乏大规模、以事件为中心的数据集，这些数据集能够共同建模新闻语义和统计基础的市场反应;（2）语言模型推理与动态市场条件下财务有效交易行为之间的不匹配。为应对这些挑战，我们提出了Janus-Q，一种端到端事件驱动的交易框架，将金融新闻事件从辅助信号提升为主要决策单元。Janus-Q 将事件中心的数据构建和模型优化统一在两阶段范式下。第一阶段聚焦于事件中心数据构建，构建一个包含62,400篇文章的大型金融新闻事件数据集，注释了10种细粒度事件类型、相关股票、情绪标签以及事件驱动的累计异常回报（CAR）。第二阶段进行决策导向的微调，结合监督学习与由层级门控奖励模型（HGRM）指导的强化学习，明确捕捉多个交易目标之间的权衡。大量实验表明，Janus-Q比市场指数和大型语言模型基线实现的交易决策更为一致、可解释且盈利，使夏普比率提升了高达102.0%，同时使方向准确率提升超过17.5%，相较于最强的竞争策略。

Sparse Masked Attention Policies for Reliable Generalization

稀疏的掩饰注意力政策以实现可靠概括

Authors: Caroline Horsch, Laurens Engwegen, Max Weltevrede, Matthijs T. J. Spaan, Wendelin Böhmer
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.19956
Pdf link: https://arxiv.org/pdf/2602.19956
Abstract In reinforcement learning, abstraction methods that remove unnecessary information from the observation are commonly used to learn policies which generalize better to unseen tasks. However, these methods often overlook a crucial weakness: the function which extracts the reduced-information representation has unknown generalization ability in unseen observations. In this paper, we address this problem by presenting an information removal method which more reliably generalizes to new states. We accomplish this by using a learned masking function which operates on, and is integrated with, the attention weights within an attention-based policy network. We demonstrate that our method significantly improves policy generalization to unseen tasks in the Procgen benchmark compared to standard PPO and masking approaches.
中文摘要 在强化学习中，通常使用去除观察中不必要信息的抽象方法来学习更能推广到未见任务的策略。然而，这些方法常常忽视一个关键弱点：提取约简信息表示的函数在未见观测中具有未知的泛化能力。本文通过提出一种更可靠地推广到新态的信息去除方法来解决这个问题。我们通过使用一种学习得来的掩蔽函数实现这一点，该函数作用于并与基于注意力的政策网络中的注意力权重整合。我们证明，与标准PPO和掩蔽方法相比，我们的方法显著提升了Procgen基准对未见任务的策略泛化。

RL-RIG: A Generative Spatial Reasoner via Intrinsic Reflection

RL-RIG：通过内在反射实现生成空间推理器

Authors: Tianyu Wang, Zhiyuan Ma, Qian Wang, Xinyi Zhang, Xinwei Long, Bowen Zhou
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.19974
Pdf link: https://arxiv.org/pdf/2602.19974
Abstract Recent advancements in image generation have achieved impressive results in producing high-quality images. However, existing image generation models still generally struggle with a spatial reasoning dilemma, lacking the ability to accurately capture fine-grained spatial relationships from the prompt and correctly generate scenes with structural integrity. To mitigate this dilemma, we propose RL-RIG, a Reinforcement Learning framework for Reflection-based Image Generation. Our architecture comprises four primary components: Diffuser, Checker, Actor, and Inverse Diffuser, following a Generate-Reflect-Edit paradigm to spark the Chain of Thought reasoning ability in image generation for addressing the dilemma. To equip the model with better intuition over generation trajectories, we further develop Reflection-GRPO to train the VLM Actor for edit prompts and the Image Editor for better image quality under a given prompt, respectively. Unlike traditional approaches that solely produce visually stunning yet structurally unreasonable content, our evaluation metrics prioritize spatial accuracy, utilizing Scene Graph IoU and employing a VLM-as-a-Judge strategy to assess the spatial consistency of generated images on LAION-SG dataset. Experimental results show that RL-RIG outperforms existing state-of-the-art open-source models by up to 11% in terms of controllable and precise spatial reasoning in image generation.
中文摘要 近年来图像生成的进步在生成高质量图像方面取得了显著成果。然而，现有的图像生成模型通常仍面临空间推理困境，缺乏准确捕捉提示中细粒度空间关系和正确生成结构完整性场景的能力。为缓解这一困境，我们提出了RL-RIG，一种基于反射的图像生成强化学习框架。我们的架构由四个主要组件组成：扩散器、检查器、演员和反向扩散器，遵循生成-反射-编辑范式，激发思维链推理能力，用于图像生成以解决这一难题。为了增强模型对生成轨迹的直觉，我们进一步开发了Reflection-GRPO，分别训练VLM演员用于编辑提示和图像编辑器，以提升特定提示下的图像质量。与传统方法仅制作视觉惊艳但结构不合理内容不同，我们的评估指标优先考虑空间准确性，利用场景图 IoU，并采用 VLM 作为评判策略评估 LAION-SG 数据集生成图像的空间一致性。实验结果显示，RL-RIG在图像生成的可控和精确空间推理方面，比现有最先进的开源模型高出多达11%。

A Secure and Private Distributed Bayesian Federated Learning Design

一种安全且私密的分布式贝叶斯联合学习设计

Authors: Nuocheng Yang, Sihua Wang, Zhaohui Yang, Mingzhe Chen, Changchuan Yin, Kaibin Huang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.20003
Pdf link: https://arxiv.org/pdf/2602.20003
Abstract Distributed Federated Learning (DFL) enables decentralized model training across large-scale systems without a central parameter server. However, DFL faces three critical challenges: privacy leakage from honest-but-curious neighbors, slow convergence due to the lack of central coordination, and vulnerability to Byzantine adversaries aiming to degrade model accuracy. To address these issues, we propose a novel DFL framework that integrates Byzantine robustness, privacy preservation, and convergence acceleration. Within this framework, each device trains a local model using a Bayesian approach and independently selects an optimal subset of neighbors for posterior exchange. We formulate this neighbor selection as an optimization problem to minimize the global loss function under security and privacy constraints. Solving this problem is challenging because devices only possess partial network information, and the complex coupling between topology, security, and convergence remains unclear. To bridge this gap, we first analytically characterize the trade-offs between dynamic connectivity, Byzantine detection, privacy levels, and convergence speed. Leveraging these insights, we develop a fully distributed Graph Neural Network (GNN)-based Reinforcement Learning (RL) algorithm. This approach enables devices to make autonomous connection decisions based on local observations. Simulation results demonstrate that our method achieves superior robustness and efficiency with significantly lower overhead compared to traditional security and privacy schemes.
中文摘要 分布式联合学习（DFL）实现了跨大型系统的去中心化模型训练，无需中央参数服务器。然而，DFL面临三大关键挑战：来自诚实但好奇的邻居带来的隐私泄露、缺乏中央协调导致趋同缓慢，以及易受到拜占庭式对手的脆弱性，这些敌人试图降低模型的准确性。为解决这些问题，我们提出了一个新颖的DFL框架，整合了拜占庭鲁棒性、隐私保护和融合加速。在此框架下，每个设备使用贝叶斯方法训练局部模型，并独立选择一个最优邻居子集进行后验交换。我们将邻居选择作为优化问题表述，以在安全和隐私约束下最小化全局损失函数。解决这一问题具有挑战性，因为设备仅拥有部分网络信息，且拓扑、安全性与融合之间的复杂耦合尚不明确。为弥合这一差距，我们首先分析了动态连接性、拜占庭检测、隐私级别和融合速度之间的权衡。基于这些洞见，我们开发了一种基于全分布式图神经网络（GNN）的强化学习（RL）算法。这种方法使设备能够基于本地观察做出自主连接决策。模拟结果表明，我们的方法在与传统安全和隐私方案相比，实现了更优越的鲁棒性和效率，且开销显著降低。

noDice: Inference for Discrete Probabilistic Programs with Nondeterminism and Conditioning

noDice：具有非确定性和条件处理的离散概率程序的推断

Authors: Tobias Gürtler, Benjamin Lucien Kaminski
Subjects: Subjects: Logic in Computer Science (cs.LO)
Arxiv link: https://arxiv.org/abs/2602.20049
Pdf link: https://arxiv.org/pdf/2602.20049
Abstract Probabilistic programming languages (PPLs) are an expressive and intuitive means of representing complex probability distributions. In that realm, languages like Dice target an important class of probabilistic programs: those whose probability distributions are discrete. Discrete distributions are common in many fields, including text analysis, network verification, artificial intelligence, and graph analysis. Another important feature in the world of probabilistic modeling are nondeterministic choices as found in Markov Decision Processes (MDPs) which play a major role in reinforcement learning. Modern PPLs usually lack support for nondeterminism. We address this gap with the introduction of noDice, which extends the discrete probabilistic inference engine Dice. noDice performs inference on loop-free programs by constructing an MDP so that the distributions modeled by the program correspond to schedulers in the MDP. Furthermore, decision diagrams are used as an intermediate step to exploit the program structure and drastically reduce the state space of the MDP.
中文摘要 概率编程语言（PPL）是一种表达性强且直观的复杂概率分布方式。在这方面，像Dice这样的语言针对一类重要的概率程序：那些概率分布是离散的。离散分布在许多领域都很常见，包括文本分析、网络验证、人工智能和图分析。概率建模领域的另一个重要特征是非确定性选择，如马尔可夫决策过程（MDPs）中，在强化学习中起着重要作用。现代PPL通常缺乏对非确定性的支持。我们通过引入noDice来弥补这一空白，它扩展了离散概率推断引擎Dice。noDice 通过构造一个 MDP，使程序建模的分布对应于 MDP 中的调度器，从而对无循环程序进行推断。此外，决策图作为利用程序结构并大幅缩小MDP状态空间的中间步骤。

Descent-Guided Policy Gradient for Scalable Cooperative Multi-Agent Learning

可扩展合作多代理学习的下降引导策略梯度

Authors: Shan Yang, Yang Liu
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.20078
Pdf link: https://arxiv.org/pdf/2602.20078
Abstract Scaling cooperative multi-agent reinforcement learning (MARL) is fundamentally limited by cross-agent noise: when agents share a common reward, the actions of all $N$ agents jointly determine each agent's learning signal, so cross-agent noise grows with $N$. In the policy gradient setting, per-agent gradient estimate variance scales as $\Theta(N)$, yielding sample complexity $\mathcal{O}(N/\epsilon)$. We observe that many domains -- cloud computing, transportation, power systems -- have differentiable analytical models that prescribe efficient system states. In this work, we propose Descent-Guided Policy Gradient (DG-PG), a framework that constructs noise-free per-agent guidance gradients from these analytical models, decoupling each agent's gradient from the actions of all others. We prove that DG-PG reduces gradient variance from $\Theta(N)$ to $\mathcal{O}(1)$, preserves the equilibria of the cooperative game, and achieves agent-independent sample complexity $\mathcal{O}(1/\epsilon)$. On a heterogeneous cloud scheduling task with up to 200 agents, DG-PG converges within 10 episodes at every tested scale -- from $N=5$ to $N=200$ -- directly confirming the predicted scale-invariant complexity, while MAPPO and IPPO fail to converge under identical architectures.
中文摘要 扩展合作多智能体强化学习（MARL）本质上受限于跨智能体噪声：当智能体共享共同奖励时，所有 $N$ 智能体的行为共同决定每个智能体的学习信号，因此跨智能体噪声随 $N$ 增长。在策略梯度设置下，每个代理的梯度估计方差以$\Theta（N）$为单位，从而得到样本复杂度$\mathcal{O}（N/\epsilon）$。我们观察到许多领域——云计算、交通、电力系统——都拥有可微分的分析模型，可以规定高效的系统状态。在本研究中，我们提出了下降引导政策梯度（DG-PG）框架，该框架可从这些分析模型中构建无噪声的每位智能体指导梯度，将每个智能体的梯度与其他所有行为解耦。我们证明DG-PG将从$\Theta（N）$的梯度方差降低到$\mathcal{O}（1）$，保持合作博弈的均衡，并实现与代理无关的样本复杂度$\mathcal{O}（1/\epsilon）$。在拥有多达200个代理的异构云调度任务中，DG-PG在每个测试尺度（从$N=5$到$N=200$）下都能在10集内收敛，直接确认了预测的尺度不变复杂度，而MAPPO和IPPO在相同架构下未能收敛。

Adaptive Underwater Acoustic Communications with Limited Feedback: An AoI-Aware Hierarchical Bandit Approach

有限反馈的自适应水下声学通信：一种AoI感知的分层盗贼方法

Authors: Fabio Busacca, Andrea Panebianco, Yin Sun
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2602.20105
Pdf link: https://arxiv.org/pdf/2602.20105
Abstract Underwater Acoustic (UWA) networks are vital for remote sensing and ocean exploration but face inherent challenges such as limited bandwidth, long propagation delays, and highly dynamic channels. These constraints hinder real-time communication and degrade overall system performance. To address these challenges, this paper proposes a bilevel Multi-Armed Bandit (MAB) framework. At the fast inner level, a Contextual Delayed MAB (CD-MAB) jointly optimizes adaptive modulation and transmission power based on both channel state feedback and its Age of Information (AoI), thereby maximizing throughput. At the slower outer level, a Feedback Scheduling MAB dynamically adjusts the channel-state feedback interval according to throughput dynamics: stable throughput allows longer update intervals, while throughput drops trigger more frequent updates. This adaptive mechanism reduces feedback overhead and enhances responsiveness to varying network conditions. The proposed bilevel framework is computationally efficient and well-suited to resource-constrained UWA networks. Simulation results using the DESERT Underwater Network Simulator demonstrate throughput gains of up to 20.61% and energy savings of up to 36.60% compared with Deep Reinforcement Learning (DRL) baselines reported in the existing literature.
中文摘要 水下声学（UWA）网络对于遥感和海洋探索至关重要，但面临带宽有限、传播延迟长以及高度动态的通道等固有挑战。这些限制阻碍了实时通信并降低整体系统性能。为应对这些挑战，本文提出了一个双层多臂强盗（MAB）框架。在快速内层，上下文延迟多层（CD-MAB）结合信道状态反馈和信息时代（AoI）共同优化自适应调制和传输功率，从而最大化吞吐量。在较慢的外部层面，反馈调度MAB根据吞吐量动态调整通道状态反馈间隔：稳定吞吐量允许更长的更新间隔，而吞吐量下降则触发更频繁的更新。这种自适应机制降低了反馈开销，增强了对不同网络状况的响应能力。所提出的双层框架计算效率高，非常适合资源受限的UWA网络。使用DESERT水下网络模拟器进行的模拟结果显示，与现有文献中报道的深度强化学习（DRL）基线相比，吞吐量提升高达20.61%，节能效果高达36.60%。

ReSyn: Autonomously Scaling Synthetic Environments for Reasoning Models

ReSyn：为推理模型自主扩展合成环境

Authors: Andre He, Nathaniel Weir, Kaj Bostrom, Allen Nie, Darion Cassel, Sam Bayless, Huzefa Rangwala
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.20117
Pdf link: https://arxiv.org/pdf/2602.20117
Abstract Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising approach for training reasoning language models (RLMs) by leveraging supervision from verifiers. Although verifier implementation is easier than solution annotation for many tasks, existing synthetic data generation methods remain largely solution-centric, while verifier-based methods rely on a few hand-crafted procedural environments. In this work, we scale RLVR by introducing ReSyn, a pipeline that generates diverse reasoning environments equipped with instance generators and verifiers, covering tasks such as constraint satisfaction, algorithmic puzzles, and spatial reasoning. A Qwen2.5-7B-Instruct model trained with RL on ReSyn data achieves consistent gains across reasoning benchmarks and out-of-domain math benchmarks, including a 27\% relative improvement on the challenging BBEH benchmark. Ablations show that verifier-based supervision and increased task diversity both contribute significantly, providing empirical evidence that generating reasoning environments at scale can enhance reasoning abilities in RLMs
中文摘要 带可验证奖励的强化学习（RLVR）已成为一种有前景的方法，用于通过验证者的监督来训练推理语言模型（RLMs）。尽管验证器实现在许多任务中比解注释更简单，但现有的合成数据生成方法仍然主要以解决方案为中心，而基于验证器的方法则依赖少数手工构建的程序环境。在本研究中，我们通过引入ReSyn来扩展RLVR，这是一条生成多样推理环境的管道，配备实例生成器和验证器，涵盖约束满足、算法难题和空间推理等任务。在ReSyn数据上用RL训练的Qwen2.5-7B-Instruct模型在推理基准和域外数学基准中取得了持续的提升，包括对挑战性BBEH基准的相对提升27%。消融显示，基于验证者的监督和增加任务多样性均显著贡献，提供了实证证据表明大规模生成推理环境能提升RLM的推理能力

LAD: Learning Advantage Distribution for Reasoning

LAD：推理中的学习优势分布

Authors: Wendi Li, Sharon Li
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.20132
Pdf link: https://arxiv.org/pdf/2602.20132
Abstract Current reinforcement learning objectives for large-model reasoning primarily focus on maximizing expected rewards. This paradigm can lead to overfitting to dominant reward signals, while neglecting alternative yet valid reasoning trajectories, thereby limiting diversity and exploration. To address this issue, we introduce Learning Advantage Distributions (LAD), a distribution-matching framework that replaces advantage maximization with learning the advantage-induced distribution. By establishing the equivalence between the optimal policy update and an advantage-based target distribution, we derive a practical LAD objective formulated as minimizing an $f$-divergence between the policy-induced and advantage-induced distributions. This yields a gradient update that increases likelihood for high-advantage responses while suppressing over-confident probability growth, preventing collapse without requiring auxiliary entropy regularization. LAD incurs no extra training cost compared to GRPO and scales naturally to LLM post-training. In a controlled bandit setting, LAD faithfully recovers the multimodal advantage distribution, validating the theoretical formulation. Experiments on math and code reasoning tasks across several LLM backbones show that LAD reliably improves both accuracy and generative diversity.
中文摘要 当前大模型推理的强化学习目标主要聚焦于最大化预期奖励。这种范式可能导致对主导奖励信号的过拟合，同时忽视替代但有效的推理轨迹，从而限制多样性和探索。为解决这个问题，我们引入了学习优势分布（LAD），这是一种分布匹配框架，用学习优势诱导分布取代优势最大化。通过确立最优政策更新与基于优势的目标分布之间的等价性，我们推导出一个实用的LAD目标，即最小化政策诱导与优势诱导分布之间的$f$-背离。这会产生梯度更新，提高高优势反应的可能性，同时抑制过度自信的概率增长，防止崩溃而无需辅助熵正则化。与GRPO相比，LAD没有额外的培训费用，并且培训后自然可扩展到LLM。在受控的盗贼环境中，LAD忠实地恢复了多模态优势分布，验证了理论表述。在多个LLM骨干上的数学和代码推理任务实验表明，LAD能够可靠地提高准确性和生成多样性。

Keyword: diffusion policy

AdaWorldPolicy: World-Model-Driven Diffusion Policy with Online Adaptive Learning for Robotic Manipulation

AdaWorldPolicy：基于在线自适应学习的世界模型驱动扩散政策，用于机器人作

Authors: Ge Yuan, Qiyuan Qiao, Jing Zhang, Dong Xu
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.20057
Pdf link: https://arxiv.org/pdf/2602.20057
Abstract Effective robotic manipulation requires policies that can anticipate physical outcomes and adapt to real-world environments. Effective robotic manipulation requires policies that can anticipate physical outcomes and adapt to real-world environments. In this work, we introduce a unified framework, World-Model-Driven Diffusion Policy with Online Adaptive Learning (AdaWorldPolicy) to enhance robotic manipulation under dynamic conditions with minimal human involvement. Our core insight is that world models provide strong supervision signals, enabling online adaptive learning in dynamic environments, which can be complemented by force-torque feedback to mitigate dynamic force shifts. Our AdaWorldPolicy integrates a world model, an action expert, and a force predictor-all implemented as interconnected Flow Matching Diffusion Transformers (DiT). They are interconnected via the multi-modal self-attention layers, enabling deep feature exchange for joint learning while preserving their distinct modularity characteristics. We further propose a novel Online Adaptive Learning (AdaOL) strategy that dynamically switches between an Action Generation mode and a Future Imagination mode to drive reactive updates across all three modules. This creates a powerful closed-loop mechanism that adapts to both visual and physical domain shifts with minimal overhead. Across a suite of simulated and real-robot benchmarks, our AdaWorldPolicy achieves state-of-the-art performance, with dynamical adaptive capacity to out-of-distribution scenarios.
中文摘要 有效的机器人作需要能够预见物理结果并适应现实环境的政策。有效的机器人作需要能够预见物理结果并适应现实环境的政策。在本研究中，我们引入了一个统一框架——世界模型驱动扩散政策与在线自适应学习（AdaWorldPolicy），旨在增强机器人在动态条件下的作能力，且人力干预极少。我们的核心见解是，世界模型提供了强有力的监督信号，使在线自适应学习能够在动态环境中实现，同时还可以通过力-扭矩反馈来缓解动态力的变化。我们的AdaWorldPolicy集成了世界模型、行动专家和力预测器——全部作为互联的流匹配扩散变换器（DiT）实现。它们通过多模态自注意层相互连接，实现深度特征交换以实现联合学习，同时保持各自独特的模块性特征。我们还提出了一种新颖的在线自适应学习（AdaOL）策略，该策略在动作生成模式和未来想象模式之间动态切换，以驱动三个模块之间的反应式更新。这创造了一个强大的闭环机制，能够以最小的开销适应视觉和物理领域的变化。通过一系列模拟和真实机器人基准测试，我们的AdaWorldPolicy实现了最先进的性能，并具备对非分销场景的动态自适应能力。

Keyword: reinforcement learning

FineRef: Fine-Grained Error Reflection and Correction for Long-Form Generation with Citations

FineRef：长格式生成的细粒度错误反思与纠正，含引用

Learning to Remember: End-to-End Training of Memory Agents for Long-Context Reasoning

学习记忆：记忆代理的端到端训练以实现长上下文推理

Deep Reinforcement Learning for Optimizing Energy Consumption in Smart Grid Systems

深度强化学习优化智能电网系统中的能耗

1D-Bench: A Benchmark for Iterative UI Code Generation with Visual Feedback in Real-World

1D-Bench：现实世界中带有视觉反馈的迭代UI代码生成基准

Hierarchical Reward Design from Language: Enhancing Alignment of Agent Behavior with Human Specifications

语言中的层级奖励设计：增强智能体行为与人类规范的对齐

DP-RFT: Learning to Generate Synthetic Text via Differentially Private Reinforcement Fine-Tuning

DP-RFT：通过差分私有强化微调学习生成合成文本

Adaptive Time Series Reasoning via Segment Selection

通过段选择实现自适应时间序列推理

Toward AI Autonomous Navigation for Mechanical Thrombectomy using Hierarchical Modular Multi-agent Reinforcement Learning (HM-MARL)

迈向基于分层模块化多智能体强化学习（HM-MARL）的机械血栓切除术AI自主导航

In-Context Planning with Latent Temporal Abstractions

含潜在时间抽象的上下文规划

LMFPPO-UBP: Local Mean Field Proximal Policy Optimization with Unbalanced Punishment for Spatial Public Goods Games

LMFPPO-UBP：空间公共物品博弈的局部均值场近端策略优化，带有不平衡惩罚

Task-Aware Exploration via a Predictive Bisimulation Metric

通过预测双模拟指标实现任务感知探索

HONEST-CAV: Hierarchical Optimization of Network Signals and Trajectories for Connected and Automated Vehicles with Multi-Agent Reinforcement Learning

HONEST-CAV：利用多智能体强化学习，实现联网和自动化车辆网络信号和轨迹的分层优化

TAG: Thinking with Action Unit Grounding for Facial Expression Recognition

标签：用动作单元思考 面部表情识别的接地

Carbon-aware decentralized dynamic task offloading in MIMO-MEC networks via multi-agent reinforcement learning

通过多智能体强化学习实现MIMO-MEC网络中的碳感知去中心化动态任务卸载

Issues with Measuring Task Complexity via Random Policies in Robotic Tasks

机器人任务中通过随机策略测量任务复杂性的问题

VariBASed: Variational Bayes-Adaptive Sequential Monte-Carlo Planning for Deep Reinforcement Learning

VariBASed：变分贝叶斯自适应序列蒙特卡洛规划中的深度强化学习

Gait Asymmetry from Unilateral Weakness and Improvement With Ankle Assistance: a Reinforcement Learning based Simulation Study

单侧无力与踝关节辅助改善带来的步态不对称：基于强化学习的模拟研究

TPRU: Advancing Temporal and Procedural Understanding in Large Multimodal Models

TPRU：推进大型多模态模型中的时间和过程理解

DeepInterestGR: Mining Deep Multi-Interest Using Multi-Modal LLMs for Generative Recommendation

DeepInterestGR：利用多模态大型语言模型挖掘深度多兴趣生成推荐

IDSelect: A RL-Based Cost-Aware Selection Agent for Video-based Multi-Modal Person Recognition

IDSelect：基于强化学习的成本感知选择代理，用于基于视频的多模态人物识别

MagicAgent: Towards Generalized Agent Planning

MagicAgent：迈向通用代理规划

Learning to Detect Language Model Training Data via Active Reconstruction

通过主动重建学习检测语言模型训练数据

Human-to-Robot Interaction: Learning from Video Demonstration for Robot Imitation

人与机器人交互：通过视频演示学习机器人模仿

Adaptive Problem Generation via Symbolic Representations

通过符号表示实现自适应问题生成

How to Allocate, How to Learn? Dynamic Rollout Allocation and Advantage Modulation for Policy Optimization

如何分配，如何学习？动态推广分配与优势调制以优化策略

Characterizing MARL for Energy Control: A Multi-KPI Benchmark on the CityLearn Environment

MARL用于能源控制的特性描述：CityLearn环境的多关键绩效指标基准

Robust Exploration in Directed Controller Synthesis via Reinforcement Learning with Soft Mixture-of-Experts

通过软专家混合强化学习对定向控制器合成的深入探索

DGPO: RL-Steered Graph Diffusion for Neural Architecture Generation

DGPO：用于神经结构生成的强化学习引导图扩散

ALPACA: A Reinforcement Learning Environment for Medication Repurposing and Treatment Optimization in Alzheimer's Disease

羊驼：阿尔茨海默病药物再利用与治疗优化的强化学习环境

TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics

TOPReward：代币概率作为机器人隐藏的零射击奖励

Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering

学习在个性化问答中多步检索个人语境的推理

Soft Sequence Policy Optimization: Bridging GMPO and SAPO

软序列策略优化：连接GMPO与SAPO

LLMs Can Learn to Reason Via Off-Policy RL

LLMs可以通过非策略强化学习推理

Stable Deep Reinforcement Learning via Isotropic Gaussian Representations

通过各向同性高斯表示的稳定深度强化学习

IR$^3$: Contrastive Inverse Reinforcement Learning for Interpretable Detection and Mitigation of Reward Hacking

IR$^3$：对比逆强化学习用于可解释的奖励黑客检测与缓解

RAmmStein: Regime Adaptation in Mean-reverting Markets with Stein Thresholds -- Optimal Impulse Control in Concentrated AMMs

RAmmStein：均值回归市场中的体制适应——集中AMMs中的最优冲动控制

A Reinforcement Learning-based Transmission Expansion Framework Considering Strategic Bidding in Electricity Markets

基于强化学习的输电扩展框架，考虑电力市场中的战略竞标

Sizing of Battery Considering Renewable Energy Bidding Strategy with Reinforcement Learning

电池规模评估：结合强化学习考虑可再生能源招标策略

UrbanAlign: Post-hoc Semantic Calibration for VLM-Human Preference Alignment

UrbanAlign：VLM-人类偏好对齐的事后语义校准

SenTSR-Bench: Thinking with Injected Knowledge for Time-Series Reasoning

标签：用动作单元思考面部表情识别的接地