Arxiv Papers of Today

生成时间: 2026-03-05 16:45:55 (UTC+8); Arxiv 发布时间: 2026-03-05 20:00 EST (2026-03-06 09:00 UTC+8)

今天共有 38 篇相关文章

Keyword: reinforcement learning

SE-Search: Self-Evolving Search Agent via Memory and Dense Reward

SE-Search：通过记忆和密集奖励实现自我演化的搜索代理

Authors: Jian Li, Yizhang Jin, Dongqi Liu, Hang Ding, Jiafu Wu, Dongsheng Chen, Yunhang Shen, Yulei Qin, Ying Tai, Chengjie Wang, Xiaotong Yuan, Yabiao Wang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.03293
Pdf link: https://arxiv.org/pdf/2603.03293
Abstract Retrieval augmented generation (RAG) reduces hallucinations and factual errors in large language models (LLMs) by conditioning generation on retrieved external knowledge. Recent search agents further cast RAG as an autonomous, multi-turn information-seeking process. However, existing methods often accumulate irrelevant or noisy documents and rely on sparse reinforcement learning signals. We propose \textbf{S}elf-\textbf{E}volving \textbf{Search}, a Self-Evolving Search agent that improves online search behavior through three components, memory purification, atomic query training, and dense rewards. SE-Search follows a \textit{Think-Search-Memorize} strategy that retains salient evidence while filtering irrelevant content. Atomic query training promotes shorter and more diverse queries, improving evidence acquisition. Dense rewards provide fine-grained feedback that speeds training. Experiments on single-hop and multi-hop question answering benchmarks show that \texttt{SE-Search-3B} outperforms strong baselines, yielding a $10.8$ point absolute improvement and a $33.8\%$ relative gain over Search-R1.\footnote{We will make the code and model weights publicly available upon acceptance.}
中文摘要 检索增强生成（RAG）通过将生成条件置于检索到的外部知识，从而减少大型语言模型（LLM）中的幻觉和事实错误。近期的搜索代理进一步将RAG描述为一个自主的、多回合的信息寻求过程。然而，现有方法常常积累无关或噪声较大的文档，并依赖稀疏的强化学习信号。我们提出了 \textbf{S}elf-\textbf{E}evolutionving \textbf{Search}，这是一种自我演化搜索代理，通过三个组成部分——内存净化、原子查询训练和高密度奖励——改善在线搜索行为。SE-Search 采用 \textit{思考-搜索-记忆}策略，保留显著证据同时过滤无关内容。原子查询训练促进更短、更多样化的查询，提升证据获取。密集奖励提供细粒度反馈，加快训练进程。单跳和多跳问答基准测试的实验显示，\texttt{SE-Search-3B} 优于强基线，绝对提升 $10.8$，相较于 Search-R1 提升 $33.8\%$。\footnote{我们将在接受后公开代码和模型权重。}

HumanLM: Simulating Users with State Alignment Beats Response Imitation

HumanLM：模拟状态对齐用户胜过响应模仿

Authors: Shirley Wu, Evelyn Choi, Arpandeep Khatua, Zhanghan Wang, Joy He-Yueya, Tharindu Cyril Weerasooriya, Wei Wei, Diyi Yang, Jure Leskovec, James Zou
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.03303
Pdf link: https://arxiv.org/pdf/2603.03303
Abstract Large Language Models (LLMs) are increasingly used to simulate how specific users respond to a given context, enabling more user-centric applications that rely on user feedback. However, existing user simulators mostly imitate surface-level patterns and language styles, which fail to reflect the underlying states of real users (e.g., beliefs and emotions). To address these limitations, we propose a novel training framework, HumanLM, which builds user simulators that accurately reflect real users. Our key insight is that, in addition to generating responses, the model should generate natural-language latent states that align with ground-truth responses through reinforcement learning. These latent states correspond to a set of psychologically grounded state dimensions that drive how real users respond. HumanLM further synthesizes these aligned latent states into responses that accurately represent real users. For extensive evaluation, we develop Humanual, a comprehensive benchmark for simulating real users based on public data. Humanual consists of six large-scale datasets with 26k users and 216k responses in total, spanning diverse tasks such as generating user responses to daily life issues, political blogs, and chat sessions with LLM assistants. Across datasets, HumanLM significantly outperforms alternative approaches, achieving an average relative improvement of 16.3% in alignment scores from an LLM judge. In a real-time simulation study with 111 participants, HumanLM achieves the highest similarity to real user responses and competitive human-likeness scores.
中文摘要 大型语言模型（LLM）越来越多地用于模拟特定用户对特定上下文的响应，从而实现更依赖用户反馈的用户中心应用。然而，现有的用户模拟器大多模仿表面模式和语言风格，无法反映真实用户的潜在状态（如信念和情感）。为解决这些局限性，我们提出了一种新颖的培训框架HumanLM，构建准确反映真实用户的用户模拟器。我们的核心见解是，除了生成反应外，模型还应通过强化学习生成与地面真实反应相符的自然语言潜在状态。这些潜在状态对应于一组心理基础的状态维度，驱动真实用户的反应。HumanLM进一步将这些对齐的潜在状态综合为准确反映真实用户的响应。为了广泛评估，我们开发了Humanual，这是一个基于公开数据模拟真实用户的综合基准测试。Humanual 包含六个大型数据集，总计 2.6 万用户和 21.6 万条回复，涵盖了生成用户对日常生活问题的响应、政治博客以及与 LLM 助手的聊天会话等多样化任务。在数据集中，HumanLM显著优于其他方法，LLM评判的对齐评分平均相对提升16.3%。在一项有111名参与者的实时模拟研究中，HumanLM实现了与真实用户反应最高的相似度和竞争性人类相似度评分。

Sleeper Cell: Injecting Latent Malice Temporal Backdoors into Tool-Using LLMs

Sleeper Cell：向使用工具的大型语言模型注入潜在恶意的时间后门

Authors: Bhanu Pallakonda, Mikkel Hindsbo, Sina Ehsani, Prag Mishra
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.03371
Pdf link: https://arxiv.org/pdf/2603.03371
Abstract The proliferation of open-weight Large Language Models (LLMs) has democratized agentic AI, yet fine-tuned weights are frequently shared and adopted with limited scrutiny beyond leaderboard performance. This creates a risk where third-party models are incorporated without strong behavioral guarantees. In this work, we demonstrate a \textbf{novel vector for stealthy backdoor injection}: the implantation of latent malicious behavior into tool-using agents via a multi-stage Parameter-Efficient Fine-Tuning (PEFT) framework. Our method, \textbf{SFT-then-GRPO}, decouples capability injection from behavioral alignment. First, we use SFT with LoRA to implant a "sleeper agent" capability. Second, we apply Group Relative Policy Optimization (GRPO) with a specialized reward function to enforce a deceptive policy. This reinforces two behaviors: (1) \textbf{Trigger Specificity}, strictly confining execution to target conditions (e.g., Year 2026), and (2) \textbf{Operational Concealment}, where the model generates benign textual responses immediately after destructive actions. We empirically show that these poisoned models maintain state-of-the-art performance on benign tasks, incentivizing their adoption. Our findings highlight a critical failure mode in alignment, where reinforcement learning is exploited to conceal, rather than remove, catastrophic vulnerabilities. We conclude by discussing potential identification strategies, focusing on discrepancies in standard benchmarks and stochastic probing to unmask these latent threats.
中文摘要 开放权重大型语言模型（LLMs）的普及使智能人工智能更加民主化，但精细调整的权重仍常被共享和采用，仅限于排行榜表现。这会带来风险，比如引入第三方模型却没有强有力的行为保障。在本研究中，我们展示了一种\textbf{隐秘后门注入的新型载体}：通过多阶段参数高效微调（PEFT）框架，将潜在恶意行为植入工具使用者代理中。我们的方法 \textbf{SFT-then-GRPO} 将能力注入与行为对齐解耦。首先，我们用SFT配合LoRA植入“潜伏特工”能力。其次，我们应用带有专门奖励函数的群体相对策略优化（GRPO）来执行欺骗性策略。这强化了两种行为：（1）\textbf{触发特异性}，严格将执行限制在目标条件（例如2026年）;（2）\textbf{作隐蔽}，模型在破坏性行为后立即生成良性文本响应。我们实证显示，这些被污染的模型在无害任务上仍保持最先进性能，激励其采用。我们的发现凸显了对齐中存在一个关键的失败模式，即强化学习被用来掩盖而非消除灾难性漏洞。最后，我们讨论了潜在的识别策略，重点关注标准基准和随机探测的差异，以揭示这些潜在威胁。

MemSifter: Offloading LLM Memory Retrieval via Outcome-Driven Proxy Reasoning

MemSifter：通过结果驱动代理推理卸载LLM内存检索

Authors: Jiejun Tan, Zhicheng Dou, Liancheng Zhang, Yuyang Hu, Yiruo Cheng, Ji-Rong Wen
Subjects: Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.03379
Pdf link: https://arxiv.org/pdf/2603.03379
Abstract As Large Language Models (LLMs) are increasingly used for long-duration tasks, maintaining effective long-term memory has become a critical challenge. Current methods often face a trade-off between cost and accuracy. Simple storage methods often fail to retrieve relevant information, while complex indexing methods (such as memory graphs) require heavy computation and can cause information loss. Furthermore, relying on the working LLM to process all memories is computationally expensive and slow. To address these limitations, we propose MemSifter, a novel framework that offloads the memory retrieval process to a small-scale proxy model. Instead of increasing the burden on the primary working LLM, MemSifter uses a smaller model to reason about the task before retrieving the necessary information. This approach requires no heavy computation during the indexing phase and adds minimal overhead during inference. To optimize the proxy model, we introduce a memory-specific Reinforcement Learning (RL) training paradigm. We design a task-outcome-oriented reward based on the working LLM's actual performance in completing the task. The reward measures the actual contribution of retrieved memories by mutiple interactions with the working LLM, and discriminates retrieved rankings by stepped decreasing contributions. Additionally, we employ training techniques such as Curriculum Learning and Model Merging to improve performance. We evaluated MemSifter on eight LLM memory benchmarks, including Deep Research tasks. The results demonstrate that our method meets or exceeds the performance of existing state-of-the-art approaches in both retrieval accuracy and final task completion. MemSifter offers an efficient and scalable solution for long-term LLM memory. We have open-sourced the model weights, code, and training data to support further research.
中文摘要 随着大型语言模型（LLM）越来越多地用于长期任务，维持有效的长期记忆已成为一项关键挑战。现有方法常常在成本和准确性之间面临权衡。简单的存储方法通常无法检索相关信息，而复杂的索引方法（如内存图）则需要大量计算，可能导致信息丢失。此外，依赖工作中的大型语言模型处理所有记忆计算成本高且速度缓慢。为解决这些局限性，我们提出了MemSifter，一种将内存检索过程卸载到小规模代理模型的新框架。MemSifter 没有增加主要工作大型语言模型的负担，而是使用更小的模型来推理任务，然后再获取必要信息。这种方法在索引阶段无需大量计算，且推理时增加的开销最小。为了优化代理模型，我们引入了一种内存特异性的强化学习（RL）训练范式。我们基于工作中的大型语言模型完成任务的实际表现，设计一个以任务结果为导向的奖励。奖励通过多次与工作LLM交互来衡量检索记忆的实际贡献，并通过递减贡献区分检索后的排名。此外，我们还采用课程学习和模型合并等培训技术来提升表现。我们在八个大型语言模型内存基准测试中评估了MemSifter，包括深度研究任务。结果表明，我们的方法在检索精度和最终任务完成度方面均达到甚至超越现有最先进方法。MemSifter为长期LLM记忆提供了高效且可扩展的解决方案。我们已开源模型权重、代码和训练数据，以支持进一步研究。

Beyond Accuracy: Evaluating Visual Grounding In Multimodal Medical Reasoning

超越准确性：评估多模态医学推理中的视觉基础

Authors: Anas Zafar, Leema Krishna Murali, Ashish Vashist
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.03437
Pdf link: https://arxiv.org/pdf/2603.03437
Abstract Recent work shows that text-only reinforcement learning with verifiable rewards (RLVR) can match or outperform image-text RLVR on multimodal medical VQA benchmarks, suggesting current evaluation protocols may fail to measure causal visual dependence. We introduce a counterfactual evaluation framework using real, blank, and shuffled images across four medical VQA benchmarks: PathVQA, PMC-VQA, SLAKE, and VQA-RAD. Beyond accuracy, we measure Visual Reliance Score (VRS), Image Sensitivity (IS), and introduce Hallucinated Visual Reasoning Rate (HVRR) to detect cases where models generate visual claims despite producing image-invariant answers. Our findings reveal that RLVR improves accuracy while degrading visual grounding: text-only RLVR achieves negative VRS on PathVQA (-0.09), performing better with mismatched images, while image-text RLVR reduces image sensitivity to 39.8% overall despite improving accuracy. On VQA-RAD, both variants achieve 63% accuracy through different mechanisms: text-only RLVR retains 81% performance with blank images, while image-text RLVR shows only 29% image sensitivity. Models generate visual claims in 68-74% of responses, yet 38-43% are ungrounded (HVRR). These findings demonstrate that accuracy-only rewards enable shortcut exploitation, and progress requires grounding-aware evaluation protocols and training objectives that explicitly enforce visual dependence.
中文摘要 最新研究表明，仅文本的可验证奖励强化学习（RLVR）在多模态医学VQA基准中可匹敌甚至超过图像文本RLVR，这表明当前评估方案可能未能测量因果视觉依赖性。我们引入了一种反事实评估框架，使用真实、空白和混杂图像，涵盖四个医学VQA基准测试：PathVQA、PMC-VQA、SLAKE和VQA-RAD。除了准确性外，我们还测量视觉依赖评分（VRS）、图像敏感度（IS）和幻觉视觉推理率（HVRR），以检测模型在产生图像不变答案时仍产生视觉主张的情况。我们的发现显示，RLVR提高了准确性，但会降低视觉基础：纯文本RLVR在PathVQA上实现负的VRS（-0.09），在不匹配图像时表现更好;而图像-文本RLVR整体将图像灵敏度降至39.8%，尽管准确性有所提升。在VQA-RAD上，两种变体通过不同机制均达到63%的准确率：纯文本RLVR在空白图像下保持81%的性能，而图像-文本RLVR仅为29%的图像灵敏度。模型在68-74%的回答中产生视觉主张，但38-43%的回答是无根据的（HVRR）。这些发现表明，仅凭准确性奖励可以实现捷径利用，而进展则需要基于基础的评估协议和明确强调视觉依赖的训练目标。

[Re] FairDICE: A Gap Between Theory And Practice

[Re]FairDICE：理论与实践之间的鸿沟

Authors: Peter Adema, Karim Galliamov, Aleksey Evstratovskiy, Ross Geurts
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.03454
Pdf link: https://arxiv.org/pdf/2603.03454
Abstract Offline Reinforcement Learning (RL) is an emerging field of RL in which policies are learned solely from demonstrations. Within offline RL, some environments involve balancing multiple objectives, but existing multi-objective offline RL algorithms do not provide an efficient way to find a fair compromise. FairDICE (see arXiv:2506.08062v2) seeks to fill this gap by adapting OptiDICE (an offline RL algorithm) to automatically learn weights for multiple objectives to e.g.\ incentivise fairness among objectives. As this would be a valuable contribution, this replication study examines the replicability of claims made regarding FairDICE. We find that many theoretical claims hold, but an error in the code reduces FairDICE to standard behaviour cloning in continuous environments, and many important hyperparameters were originally underspecified. After rectifying this, we show in experiments extending the original paper that FairDICE can scale to complex environments and high-dimensional rewards, though it can be reliant on (online) hyperparameter tuning. We conclude that FairDICE is a theoretically interesting method, but the experimental justification requires significant revision.
中文摘要 离线强化学习（RL）是强化学习中一个新兴领域，策略完全通过演示学习。在离线强化学习中，一些环境涉及多目标平衡，但现有的多目标离线强化学习算法无法提供高效的方式来找到公平的折中方案。FairDICE（参见 arXiv：2506.08062v2）试图通过改编 OptiDICE（离线强化学习算法）来填补这一空白，自动学习多个目标的权重，从而激励目标之间的公平性。由于这将是一项宝贵的贡献，本复制研究考察了关于FairDICE的声明的可重复性。我们发现许多理论主张成立，但代码中的错误使FairDICE简化为连续环境中的标准行为克隆，且许多重要的超参数最初被低估。在纠正这一问题后，我们在扩展原始论文的实验中展示了FairDICE能够扩展到复杂环境和高维奖励，尽管它可能依赖于（在线）超参数调优。我们得出结论，FairDICE是一种理论上有趣的方法，但实验上的依据需要大幅修订。

Minimax Optimal Strategy for Delayed Observations in Online Reinforcement Learning

在线强化学习中延迟观察的极小极大策略

Authors: Harin Lee, Kevin Jamieson
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2603.03480
Pdf link: https://arxiv.org/pdf/2603.03480
Abstract We study reinforcement learning with delayed state observation, where the agent observes the current state after some random number of time steps. We propose an algorithm that combines the augmentation method and the upper confidence bound approach. For tabular Markov decision processes (MDPs), we derive a regret bound of $\tilde{\mathcal{O}}(H \sqrt{D_{\max} SAK})$, where $S$ and $A$ are the cardinalities of the state and action spaces, $H$ is the time horizon, $K$ is the number of episodes, and $D_{\max}$ is the maximum length of the delay. We also provide a matching lower bound up to logarithmic factors, showing the optimality of our approach. Our analytical framework formulates this problem as a special case of a broader class of MDPs, where their transition dynamics decompose into a known component and an unknown but structured component. We establish general results for this abstract setting, which may be of independent interest.
中文摘要 我们研究延迟状态观察的强化学习，即代理在随机时间步后观察当前状态。我们提出了一种结合增强方法和上置信界方法的算法。对于表格马尔可夫决策过程（MDP），我们推导出一个后悔界限为$\tilde{\mathcal{O}}（H \sqrt{D_{\max} SAK}）$，其中$S$和$A$分别是状态空间和动作空间的基数，$H$为时间视野，$K$为集数，$D_{\max}$为延迟的最大长度。我们还提供了对数因子的匹配下界，显示我们方法的最优性。我们的分析框架将该问题表述为更广泛MDP类别的特例，其中其跃迁动力学分解为已知成分和未知但结构化的成分。我们建立了该抽象背景的通用结果，可能具有独立的兴趣。

Optimal trajectory-guided stochastic co-optimization for e-fuel system design and real-time operation

电子燃料系统设计和实时运行的最优轨迹引导随机协同优化

Authors: Jeongdong Kim, Minsu Kim, Jonggeol Na, Junghwan Kim
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.03484
Pdf link: https://arxiv.org/pdf/2603.03484
Abstract E-fuels are promising long-term energy carriers supporting the net-zero transition. However, the large combinatorial design-operation spaces under renewable uncertainty make the use of mathematical programming impractical for co-optimizing e-fuel production systems. Here, we present MasCOR, a machine-learning-assisted co-optimization framework that learns from global operational trajectories. By encoding system design and renewable trends, a single MasCOR agent generalizes dynamic operation across diverse configurations and scenarios, substantially simplifying design-operation co-optimization under uncertainty. Benchmark comparisons against state-of-the-art reinforcement learning baselines demonstrate near-optimal performance, while computational costs are substantially lower than those of mathematical programming, enabling rapid parallel evaluation of designs within the co-optimization loop. This framework enables rapid screening of feasible design spaces together with corresponding operational policies. When applied to four potential European sites targeting e-methanol production, MasCOR shows that most locations benefit from reducing system load below 50 MW to achieve carbon-neutral methanol production, with production costs of 1.0-1.2 USD per kg. In contrast, Dunkirk (France), with limited renewable availability and high grid prices, favors system loads above 200 MW and expanded storage to exploit dynamic grid exchange and hydrogen sales to the market. These results underscore the value of the MasCOR framework for site-specific guidance from system design to real-time operation.
中文摘要 电子燃料是支持净零转型的有望长期能源载体。然而，在可再生能源不确定性下，庞大的组合设计-作空间使得数学规划在电子燃料生产系统中进行协同优化并不切实际。在这里，我们介绍了MasCOR，一个机器学习辅助的协同优化框架，能够从全球运营轨迹中学习。通过编码系统设计和可再生能源趋势，单个MasCOR代理可将动态作推广到不同配置和场景，在不确定性下大幅简化设计-作协同优化。与最先进的强化学习基线进行基准比较显示，性能接近最佳，计算成本远低于数学编程，支持协优化循环内设计的快速并行评估。该框架能够快速筛选可行的设计空间及相应的运营政策。当将应用到四个针对电子甲醇生产的欧洲潜在站点时，MasCOR显示大多数地点通过将系统负载降至50兆瓦以下以实现碳中和甲醇生产，生产成本为每公斤1.0-1.2美元。相比之下，法国敦刻尔克因可再生能源供应有限且电网价格高昂，更倾向于系统负荷超过200兆瓦，并扩大储能容量，以利用动态电网交换和向市场销售氢气。这些结果强调了MasCOR框架在从系统设计到实时运行的现场特定指导中的重要价值。

Phys4D: Fine-Grained Physics-Consistent 4D Modeling from Video Diffusion

Phys4D：来自视频扩散的细粒度物理一致四维建模

Authors: Haoran Lu, Shang Wu, Jianshu Zhang, Maojiang Su, Guo Ye, Chenwei Xu, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Zhaoran Wang, Han Liu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.03485
Pdf link: https://arxiv.org/pdf/2603.03485
Abstract Recent video diffusion models have achieved impressive capabilities as large-scale generative world models. However, these models often struggle with fine-grained physical consistency, exhibiting physically implausible dynamics over time. In this work, we present \textbf{Phys4D}, a pipeline for learning physics-consistent 4D world representations from video diffusion models. Phys4D adopts \textbf{a three-stage training paradigm} that progressively lifts appearance-driven video diffusion models into physics-consistent 4D world representations. We first bootstrap robust geometry and motion representations through large-scale pseudo-supervised pretraining, establishing a foundation for 4D scene modeling. We then perform physics-grounded supervised fine-tuning using simulation-generated data, enforcing temporally consistent 4D dynamics. Finally, we apply simulation-grounded reinforcement learning to correct residual physical violations that are difficult to capture through explicit supervision. To evaluate fine-grained physical consistency beyond appearance-based metrics, we introduce a set of \textbf{4D world consistency evaluation} that probe geometric coherence, motion stability, and long-horizon physical plausibility. Experimental results demonstrate that Phys4D substantially improves fine-grained spatiotemporal and physical consistency compared to appearance-driven baselines, while maintaining strong generative performance. Our project page is available at this https URL
中文摘要 近期的视频扩散模型已实现了作为大规模生成世界模型的令人印象深刻的能力。然而，这些模型常常难以保持细粒度的物理一致性，随着时间推移表现出物理上不合理的动态。在本研究中，我们提出了 \textbf{Phys4D}，这是一条用于从视频扩散模型中学习物理一致的四维世界表示的流水线。Phys4D 采用 \textbf{三阶段训练范式}，逐步将外观驱动的视频扩散模型提升为物理一致的四维世界表示。我们首先通过大规模伪监督预训练，启动了稳健的几何和运动表示，为四维场景建模奠定基础。随后，我们利用仿真生成的数据进行基于物理的监督微调，强化时间一致的四维动态。最后，我们应用基于模拟的强化学习来纠正那些难以通过显性监督捕捉的残余物理违规。为了评估超越外观度量的细粒度物理一致性，我们引入了一套\textbf{4D世界一致性评估}，探究几何相干性、运动稳定性和长视野物理可信性。实验结果表明，与外观驱动基线相比，Phys4D显著提升了细粒度的时空和物理一致性，同时保持了强劲的生成性能。我们的项目页面可在此 https URL 访问

PhyPrompt: RL-based Prompt Refinement for Physically Plausible Text-to-Video Generation

PhyPrompt：基于强化学习的提示精炼，实现物理上合理的文本转视频生成

Authors: Shang Wu, Chenwei Xu, Zhuofan Xia, Weijian Li, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Han Liu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.03505
Pdf link: https://arxiv.org/pdf/2603.03505
Abstract State-of-the-art text-to-video (T2V) generators frequently violate physical laws despite high visual quality. We show this stems from insufficient physical constraints in prompts rather than model limitations: manually adding physics details reliably produces physically plausible videos, but requires expertise and does not scale. We present PhyPrompt, a two-stage reinforcement learning framework that automatically refines prompts for physically realistic generation. First, we fine-tune a large language model on a physics-focused Chain-of-Thought dataset to integrate principles like object motion and force interactions while preserving user intent. Second, we apply Group Relative Policy Optimization with a dynamic reward curriculum that initially prioritizes semantic fidelity, then progressively shifts toward physical commonsense. This curriculum achieves synergistic optimization: PhyPrompt-7B reaches 40.8\% joint success on VideoPhy2 (8.6pp gain), improving physical commonsense by 11pp (55.8\% to 66.8\%) while simultaneously increasing semantic adherence by 4.4pp (43.4\% to 47.8\%). Remarkably, our curriculum exceeds single-objective training on both metrics, demonstrating compositional prompt discovery beyond conventional multi-objective trade-offs. PhyPrompt outperforms GPT-4o (+3.8\% joint) and DeepSeek-V3 (+2.2\%, 100$\times$ larger) using only 7B parameters. The approach transfers zero-shot across diverse T2V architectures (Lavie, VideoCrafter2, CogVideoX-5B) with up to 16.8\% improvement, establishing that domain-specialized reinforcement learning with compositional curricula surpasses general-purpose scaling for physics-aware generation.
中文摘要 最先进的文本转视频（T2V）生成器经常违反物理定律，尽管视觉质量很高。我们证明，这源于提示词物理约束不足，而非模型限制：手动添加物理细节能可靠地生成物理可信的视频，但需要专业知识且无法扩展。我们介绍PhyPrompt，一个两阶段的强化学习框架，自动优化提示以实现物理真实生成。首先，我们在以物理为核心的思维链数据集上微调大型语言模型，以整合物体运动和力相互作用等原则，同时保持用户意图。其次，我们采用群体相对策略优化，采用动态奖励课程，最初优先考虑语义忠实度，随后逐步转向物理常识。该课程实现了协同优化：PhyPrompt-7B 在 VideoPhy2 上联合成功率达到 40.8%（提升 8.6 pp），身体常识提升 11 pp（55.8% 至 66.8%），同时语义依从性提升 4.4pp（43.4% 至 47.8%）。令人惊讶的是，我们的课程在这两个指标上都超过了单一目标训练，展示了构图提示发现的能力，超越了传统多目标的权衡。PhyPrompt仅用7B参数就优于GPT-4o（+3.8%关节）和DeepSeek-V3（+2.2\%，100$\时间$大）。该方法能够将零拍摄数据传输到多种T2V架构（Lavie、VideoCrafter2、CogVideoX-5B），提升率高达16.8%，证明了基于合成课程的领域专门强化学习在物理感知生成方面超越了通用扩展。

Q-Measure-Learning for Continuous State RL: Efficient Implementation and Convergence

连续状态强化学习的Q-Measure-Learning：高效实现与收敛

Authors: Shengbo Wang
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2603.03523
Pdf link: https://arxiv.org/pdf/2603.03523
Abstract We study reinforcement learning in infinite-horizon discounted Markov decision processes with continuous state spaces, where data are generated online from a single trajectory under a Markovian behavior policy. To avoid maintaining an infinite-dimensional, function-valued estimate, we propose the novel Q-Measure-Learning, which learns a signed empirical measure supported on visited state-action pairs and reconstructs an action-value estimate via kernel integration. The method jointly estimates the stationary distribution of the behavior chain and the Q-measure through coupled stochastic approximation, leading to an efficient weight-based implementation with $O(n)$ memory and $O(n)$ computation cost per iteration. Under uniform ergodicity of the behavior chain, we prove almost sure sup-norm convergence of the induced Q-function to the fixed point of a kernel-smoothed Bellman operator. We also bound the approximation error between this limit and the optimal $Q^*$ as a function of the kernel bandwidth. To assess the performance of our proposed algorithm, we conduct RL experiments in a two-item inventory control setting.
中文摘要 我们研究了在无限视野折现马尔可夫决策过程中进行的强化学习，这些过程具有连续状态空间，其中数据在马尔可夫行为策略下从单一轨迹在线生成。为避免维持无限维函数值估计，我们提出了新的Q-测度学习方法，该方法学习基于访问状态-动作对的签名经验测度，并通过核积分重建动作-值估计。该方法通过耦合随机近似联合估计行为链和Q测度的平稳分布，实现了高效的权重实现，内存为$O（n）$，每次迭代计算成本为$O（n）$。在行为链的均匀遍历性下，我们几乎确定诱导的Q函数与核平滑的贝尔曼算子不动点的超范数收敛。我们还将该极限与最优$Q^*$（核带宽函数）之间的近似误差上界。为了评估我们提出的算法的性能，我们在两项清单控制环境中进行了强化学习实验。

Hybrid Belief Reinforcement Learning for Efficient Coordinated Spatial Exploration

高效协调空间探索的混合信念强化学习

Authors: Danish Rizvi, David Boyle
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.03595
Pdf link: https://arxiv.org/pdf/2603.03595
Abstract Coordinating multiple autonomous agents to explore and serve spatially heterogeneous demand requires jointly learning unknown spatial patterns and planning trajectories that maximize task performance. Pure model-based approaches provide structured uncertainty estimates but lack adaptive policy learning, while deep reinforcement learning often suffers from poor sample efficiency when spatial priors are absent. This paper presents a hybrid belief-reinforcement learning (HBRL) framework to address this gap. In the first phase, agents construct spatial beliefs using a Log-Gaussian Cox Process (LGCP) and execute information-driven trajectories guided by a Pathwise Mutual Information (PathMI) planner with multi-step lookahead. In the second phase, trajectory control is transferred to a Soft Actor-Critic (SAC) agent, warm-started through dual-channel knowledge transfer: belief state initialization supplies spatial uncertainty, and replay buffer seeding provides demonstration trajectories generated during LGCP exploration. A variance-normalized overlap penalty enables coordinated coverage through shared belief state, permitting cooperative sensing in high-uncertainty regions while discouraging redundant coverage in well-explored areas. The framework is evaluated on a multi-UAV wireless service provisioning task. Results show 10.8% higher cumulative reward and 38% faster convergence over baselines, with ablation studies confirming that dual-channel transfer outperforms either channel alone.
中文摘要 协调多个自主智能体探索和满足空间异构需求，需要共同学习未知的空间模式并规划最大化任务性能的轨迹。纯模型方法提供结构化的不确定性估计，但缺乏自适应策略学习，而深度强化学习在缺乏空间先验时常常表现得样本效率较低。本文提出了一种混合信念-强化学习（HBRL）框架，以弥补这一空白。第一阶段，代理使用对数-高斯Cox过程（LGCP）构建空间信念，并执行由路径互信息（PathMI）规划器引导的信息驱动轨迹，并具备多步前瞻功能。第二阶段，轨迹控制转移给软演员-批判者（SAC）代理，通过双通道知识转移热启：信念状态初始化提供空间不确定性，重放缓冲区播种提供LGCP探索过程中生成的示范轨迹。方差归一化重叠惩罚通过共享信念状态实现协调覆盖，允许在高不确定性区域实现协同感知，同时防止在已充分探索的区域出现冗余覆盖。该框架在多无人机无线服务配置任务中进行评估。结果显示，比基线相比，累计奖励提高了10.8%，收敛速度加快了38%，消融研究证实双通道转移单独表现优于任一通道。

Freezing of Gait Prediction using Proactive Agent that Learns from Selected Experience and DDQN Algorithm

使用主动代理学习选择经验和DDQN算法冻结步态预测

Authors: Septian Enggar Sukmana (1), Sang Won Bae (2), Tomohiro Shibata (1) ((1) Kyushu Institute of Technology, (2) Stevens Institute of Technology)
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.03651
Pdf link: https://arxiv.org/pdf/2603.03651
Abstract Freezing of Gait (FOG) is a debilitating motor symptom commonly experienced by individuals with Parkinson's Disease (PD) which often leads to falls and reduced mobility. Timely and accurate prediction of FOG episodes is essential for enabling proactive interventions through assistive technologies. This study presents a reinforcement learning-based framework designed to identify optimal pre-FOG onset points, thereby extending the prediction horizon for anticipatory cueing systems. The model implements a Double Deep Q-Network (DDQN) architecture enhanced with Prioritized Experience Replay (PER) allowing the agent to focus learning on high-impact experiences and refine its policy. Trained over 9000 episodes with a reward shaping strategy that promotes cautious decision-making, the agent demonstrated robust performance in both subject-dependent and subject-independent evaluations. The model achieved a prediction horizon of up to 8.72 seconds prior to FOG onset in subject-independent scenarios and 7.89 seconds in subject-dependent settings. These results highlight the model's potential for integration into wearable assistive devices, offering timely and personalized interventions to mitigate FOG in PD patients.
中文摘要 步态冻结（FOG）是一种令人衰弱的运动症状，帕金森病（PD）患者常经历，常导致跌倒和行动能力受限。及时且准确地预测FOG发作对于通过辅助技术实现主动干预至关重要。本研究提出了一种基于强化学习的框架，旨在识别最佳的前FOG起始点，从而拓展了预期提示系统的预测视野。该模型实现了双深度Q网络（DDQN）架构，并辅以优先级体验重放（PER），使智能体能够专注于高影响力体验的学习并优化其策略。该代理训练了9000多集，采用促进谨慎决策的奖励塑造策略，在主体依赖性和非主体独立评估中均表现出色。模型在主体无关情境下，预测范围在雾发生前可达8.72秒，在主体相关情境下为7.89秒。这些结果凸显了该模型在可穿戴辅助设备中的整合潜力，提供及时且个性化的干预措施，以减轻帕金森病患者的迷雾。

Principled Learning-to-Communicate with Quasi-Classical Information Structures

基于准经典信息结构的原则性学习交流

Authors: Xiangyu Liu, Haoyi You, Kaiqing Zhang
Subjects: Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2603.03664
Pdf link: https://arxiv.org/pdf/2603.03664
Abstract Learning-to-communicate (LTC) in partially observable environments has received increasing attention in deep multi-agent reinforcement learning, where the control and communication strategies are jointly learned. Meanwhile, the impact of communication on decision-making has been extensively studied in control theory. In this paper, we seek to formalize and better understand LTC by bridging these two lines of work, through the lens of information structures (ISs). To this end, we formalize LTC in decentralized partially observable Markov decision processes (Dec-POMDPs) under the common-information-based framework from decentralized stochastic control, and classify LTC problems based on the ISs before (additional) information sharing. We first show that non-classical LTCs are computationally intractable in general, and thus focus on quasi-classical (QC) LTCs. We then propose a series of conditions for QC LTCs, under which LTCs preserve the QC IS after information sharing, whereas violating which can cause computational hardness in general. Further, we develop provable planning and learning algorithms for QC LTCs, and establish quasi-polynomial time and sample complexities for several QC LTC examples that satisfy the above conditions. Along the way, we also establish results on the relationship between (strictly) QC IS and the condition of having strategy-independent common-information-based beliefs (SI-CIBs), as well as on solving Dec-POMDPs without computationally intractable oracles but beyond those with SI-CIBs, which may be of independent interest.
中文摘要 在部分可观测环境中的学习沟通（LTC）在深度多智能体强化学习中越来越受到关注，在这种学习中，控制和通信策略是共同学习的。与此同时，控制理论中对沟通对决策的影响得到了广泛研究。本文旨在通过信息结构（IS）视角，将这两条工作线路相结合，形式化并更好地理解长期护理。为此，我们在去中心化随机控制的基于信息的通用框架下，将LTC形式化为去中心化部分可观测马尔可夫决策过程（Dec-POMDPs），并在（额外）信息共享前基于IS对LTC问题进行分类。我们首先表明，非经典LTC在计算上普遍难以处理，因此重点关注准经典（QC）LTC。随后，我们提出了一系列条件，即在信息共享后保持质量控制系统，而违反这些条件则可能导致计算难度。此外，我们为QC LTC开发了可证明的规划和学习算法，并为满足上述条件的多个QC LTC实例建立了准多项式时间和样本复杂度。在此过程中，我们还建立了（严格的）QC IS与策略无关的基于信息的信念（SI-CIBs）条件之间的关系，以及在没有计算难处理预言机但超出SI-CIB预言机的情况下求解Dec-POMDPs的结果，这些预言机可能具有独立的兴趣。

MIND: Unified Inquiry and Diagnosis RL with Criteria Grounded Clinical Supports for Psychiatric Consultation

MIND：统一探究与诊断强化学习，基于精神科咨询的临床支持标准

Authors: Guoyi Li, Shihao Xu, Jiatong Ma, Yunyun Han, Jianhua Chen, Yafeng Deng
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.03677
Pdf link: https://arxiv.org/pdf/2603.03677
Abstract Large language models (LLMs) have advanced medical dialogue systems, yet psychiatric consultation poses substantially higher demands due to subjective ambiguity and comorbidity complexity: an agent must continuously extract psychopathological cues from incomplete and inconsistent patient reports in multi-turn interactions and perform rigorous differential diagnostic reasoning. However, existing methods face two fundamental challenges. First, without criteria-grounded clinical supports, they are prone to unsupported clinical assertions when symptoms are atypical or underspecified. Second, in multi-turn interactions, they struggle to mitigate inquiry drift (off-topic or low-yield questioning) and optimize questioning strategies. To address these challenges, we propose MIND, a unified inquiry--diagnosis reinforcement learning framework for psychiatric consultation. Specifically, we build a Criteria-Grounded Psychiatric Reasoning Bank (PRB) that summarizes dialogue context into clinical retrieval states, retrieves semantically similar reference consultations, and distills reusable criteria-grounded clinical supports to guide criteria-aligned inquiry and reasoning. Building on this foundation, MIND enforces explicit clinical reasoning with rubric-based process rewards to provide fine-grained supervision over intermediate decision steps, and incorporates a value-aware trajectory rectification mechanism to jointly improve information acquisition and diagnostic decision-making across turns. Extensive experiments demonstrate that MIND consistently outperforms strong baselines in diagnostic accuracy, empathetic interaction quality, interpretability, and generalization.
中文摘要 大型语言模型（LLMs）拥有先进的医疗对话系统，但由于主观模糊性和共病复杂性，精神科会诊的需求远高于此：代理必须在多回合互动中持续从不完整且不一致的患者报告中提取精神病理线索，并进行严格的差异诊断推理。然而，现有方法面临两个根本性挑战。首先，缺乏基于标准的临床支持，当症状非典型或未明确时，他们容易做出无依据的临床断言。其次，在多回合互动中，他们难以减少探究偏离主题或低收益提问，优化提问策略。为应对这些挑战，我们提出了MIND，一个统一的探究——诊断强化学习框架，用于精神科咨询。具体来说，我们构建了一个基于标准的精神病学推理库（PRB），将对话上下文总结为临床检索状态，检索语义相似的参考咨询，并提取可重复使用的基于标准的临床支持，以指导符合标准的探究和推理。在此基础上，MIND通过基于评分标准的过程奖励强制执行显性临床推理，提供对中间决策步骤的细致监督，并结合价值意识的轨迹纠正机制，共同提升跨轮次的信息获取和诊断决策。大量实验表明，MIND在诊断准确性、共情互动质量、可解释性和泛化性方面始终优于强基线。

MAGE: Meta-Reinforcement Learning for Language Agents toward Strategic Exploration and Exploitation

MAGE：面向战略探索与利用的元强化学习：语言代理

Authors: Lu Yang, Zelai Xu, Minyang Xie, Jiaxuan Gao, Zhao Shok, Yu Wang, Yi Wu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.03680
Pdf link: https://arxiv.org/pdf/2603.03680
Abstract Large Language Model (LLM) agents have demonstrated remarkable proficiency in learned tasks, yet they often struggle to adapt to non-stationary environments with feedback. While In-Context Learning and external memory offer some flexibility, they fail to internalize the adaptive ability required for long-term improvement. Meta-Reinforcement Learning (meta-RL) provides an alternative by embedding the learning process directly within the model. However, existing meta-RL approaches for LLMs focus primarily on exploration in single-agent settings, neglecting the strategic exploitation necessary for multi-agent environments. We propose MAGE, a meta-RL framework that empowers LLM agents for strategic exploration and exploitation. MAGE utilizes a multi-episode training regime where interaction histories and reflections are integrated into the context window. By using the final episode reward as the objective, MAGE incentivizes the agent to refine its strategy based on past experiences. We further combine population-based training with an agent-specific advantage normalization technique to enrich agent diversity and ensure stable learning. Experiment results show that MAGE outperforms existing baselines in both exploration and exploitation tasks. Furthermore, MAGE exhibits strong generalization to unseen opponents, suggesting it has internalized the ability for strategic exploration and exploitation. Code is available at this https URL.
中文摘要 大型语言模型（LLM）代理在学习任务中表现出显著的熟练度，但它们常常难以适应非静态环境。虽然情境学习和外部记忆提供了一定的灵活性，但它们未能内化实现长期改进所需的适应能力。元强化学习（meta-RL）通过将学习过程直接嵌入模型，提供了另一种选择。然而，现有的大型语言模型元强化学习方法主要侧重于单智能体环境中的探索，忽视了多智能体环境中所需的战略利用。我们提出了MAGE，一种元强化学习框架，赋能LLM代理进行战略探索和利用。MAGE采用多集训练体系，将交互历史和反思整合到上下文窗口中。通过以最终章节奖励为目标，MAGE激励特工根据以往经验完善策略。我们进一步结合基于群体的训练与代理特异优势规范化技术，丰富代理多样性并确保稳定学习。实验结果显示，MAGE在探索和开发任务中均优于现有基线。此外，MAGE对看不见的对手表现出强烈的泛化能力，表明它已内化了战略探索和利用的能力。代码可在此 https URL 访问。

UrbanHuRo: A Two-Layer Human-Robot Collaboration Framework for the Joint Optimization of Heterogeneous Urban Services

UrbanHuRo：一个用于异构城市服务联合优化的双层人机协作框架

Authors: Tonmoy Dey, Lin Jiang, Zheng Dong, Guang Wang
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2603.03701
Pdf link: https://arxiv.org/pdf/2603.03701
Abstract In the vision of smart cities, technologies are being developed to enhance the efficiency of urban services and improve residents' quality of life. However, most existing research focuses on optimizing individual services in isolation, without adequately considering reciprocal interactions among heterogeneous urban services that could yield higher efficiency and improved resource utilization. For example, human couriers could collect traffic and air quality data along their delivery routes, while sensing robots could assist with on-demand delivery during peak hours, enhancing both sensing coverage and delivery efficiency. However, the joint optimization of different urban services is challenging due to potentially conflicting objectives and the need for real-time coordination in dynamic environments. In this paper, we propose UrbanHuRo, a two-layer human-robot collaboration framework for joint optimization of heterogeneous urban services, demonstrated through crowdsourced delivery and urban sensing. UrbanHuRo includes two key designs: (i) a scalable distributed MapReduce-based K-submodular maximization module for efficient order dispatch, and (ii) a deep submodular reward reinforcement learning algorithm for sensing route planning. Experimental evaluations on real-world datasets from a food delivery platform demonstrate that UrbanHuRo improves sensing coverage by 29.7% and courier income by 39.2% on average in most settings, while also significantly reducing the number of overdue orders.
中文摘要 在智慧城市的愿景下，正在开发以提升城市服务效率和改善居民生活质量的技术。然而，大多数现有研究侧重于单独优化各个服务，而未充分考虑异质城市服务之间的互惠互动，从而带来更高效率和资源利用改善。例如，人工快递员可以沿送货路线收集交通和空气质量数据，而感应机器人则可在高峰时段协助按需配送，提升传感覆盖范围和投递效率。然而，由于目标可能冲突以及动态环境中需要实时协调，不同城市服务的联合优化具有挑战性。本文提出UrbanHuRo，一种两层人机协作框架，用于异构城市服务的联合优化，通过众包交付和城市感知得到验证。UrbanHuRo 包含两个关键设计：（i）可扩展的分布式 MapReduce 基 K-子模块最大化模块，用于高效调度订单，以及（ii）用于传感路径规划的深度亚模块奖励强化学习算法。通过对外卖平台的真实世界数据集进行的实验评估显示，UrbanHuRo在大多数环境中平均提升了29.7%的感知覆盖率，提升了39.2%的快递收入，同时显著减少了逾期订单数量。

HALyPO: Heterogeneous-Agent Lyapunov Policy Optimization for Human-Robot Collaboration

HALyPO：异构代理李雅普诺夫策略优化，用于人机协作

Authors: Hao Zhang, Yaru Niu, Yikai Wang, Ding Zhao, H. Eric Tseng
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.03741
Pdf link: https://arxiv.org/pdf/2603.03741
Abstract To improve generalization and resilience in human-robot collaboration (HRC), robots must handle the combinatorial diversity of human behaviors and contexts, motivating multi-agent reinforcement learning (MARL). However, inherent heterogeneity between robots and humans creates a rationality gap (RG) in the learning process-a variational mismatch between decentralized best-response dynamics and centralized cooperative ascent. The resulting learning problem is a general-sum differentiable game, so independent policy-gradient updates can oscillate or diverge without added structure. We propose heterogeneous-agent Lyapunov policy optimization (HALyPO), which establishes formal stability directly in the policy-parameter space by enforcing a per-step Lyapunov decrease condition on a parameter-space disagreement metric. Unlike Lyapunov-based safe RL, which targets state/trajectory constraints in constrained Markov decision processes, HALyPO uses Lyapunov certification to stabilize decentralized policy learning. HALyPO rectifies decentralized gradients via optimal quadratic projections, ensuring monotonic contraction of RG and enabling effective exploration of open-ended interaction spaces. Extensive simulations and real-world humanoid-robot experiments show that this certified stability improves generalization and robustness in collaborative corner cases.
中文摘要 为了提升人机协作（HRC）中的泛化性和韧性，机器人必须处理人类行为和情境的组合多样性，这推动了多智能体强化学习（MARL）。然而，机器人与人类之间的固有异质性导致学习过程中存在理性差距（RG）——即分散式最佳响应动态与集中式合作上升之间的变分不匹配。由此产生的学习问题是一个广义和可微博弈，因此独立的策略梯度更新可以在不增加结构的情况下振荡或发散。我们提出了异质代理李雅普诺夫策略优化（HALyPO），通过对参数空间不一致度量强制每步李雅普诺夫下降条件，直接在策略-参数空间中建立形式稳定性。与基于Lyapunov的安全强化学习不同，后者针对受限马尔可夫决策过程中的状态/轨迹约束，HALyPO利用Lyapunov认证来稳定去中心化的策略学习。HALyPO通过最优二次投影纠正分散梯度，确保RG的单调收缩，并实现对开放交互空间的有效探索。广泛的模拟和真实世界的人形机器人实验表明，这种认证稳定性提升了协作性角落情形下的泛化性和鲁棒性。

Interaction-Aware Whole-Body Control for Compliant Object Transport

交互感知的全体控制，用于合规的物体传输

Authors: Hao Zhang, Yves Tseng, Ding Zhao, H. Eric Tseng
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.03751
Pdf link: https://arxiv.org/pdf/2603.03751
Abstract Cooperative object transport in unstructured environments remains challenging for assistive humanoids because strong, time-varying interaction forces can make tracking-centric whole-body control unreliable, especially in close-contact support tasks. This paper proposes a bio-inspired, interaction-oriented whole-body control (IO-WBC) that functions as an artificial cerebellum - an adaptive motor agent that translates upstream (skill-level) commands into stable, physically consistent whole-body behavior under contact. This work structurally separates upper-body interaction execution from lower-body support control, enabling the robot to maintain balance while shaping force exchange in a tightly coupled robot-object system. A trajectory-optimized reference generator (RG) provides a kinematic prior, while a reinforcement learning (RL) policy governs body responses under heavy-load interactions and disturbances. The policy is trained in simulation with randomized payload mass/inertia and external perturbations, and deployed via asymmetric teacher-student distillation so that the student relies only on proprioceptive histories at runtime. Extensive experiments demonstrate that IO-WBC maintains stable whole-body behavior and physical interaction even when precise velocity tracking becomes infeasible, enabling compliant object transport across a wide range of scenarios.
中文摘要 辅助类人机器人在无结构环境中的协作物体运输仍然具有挑战性，因为强烈且时间变化的相互作用力可能使以追踪为中心的全身控制不可靠，尤其是在近距离接触支持任务中。本文提出了一种仿生、以互动为导向的全身控制（IO-WBC），其功能类似于人工小脑——一种适应性运动代理，能够将上游（技能级）指令转化为在接触下稳定、物理一致的全身行为。该工作在结构上将上半身的交互执行与下半身支撑控制分离，使机器人在紧密耦合的机器人-物体系统中保持平衡的同时，能够形成力的交换。轨迹优化参考生成器（RG）提供运动学先验，而强化学习（RL）策略则在高负载相互作用和扰动下控制身体反应。该策略通过随机载荷质量/惯性和外部扰动进行模拟训练，并通过非对称师生提纯法部署，使学生在运行时仅依赖本体感觉历史。大量实验表明，即使精确速度追踪变得不可行，IO-WBC仍能保持稳定的全身行为和物理相互作用，从而实现在多种场景下的合规物体运输。

Confidence-Calibrated Small-Large Language Model Collaboration for Cost-Efficient Reasoning

信心校准小-大语言模型协作，实现成本效益高的推理

Authors: Chuang Zhang, Zizhen Zhu, Yihao Wei, Bing Tian, Junyi Liu, Henan Wang, Xavier Wang, Yaxiao Liu
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.03752
Pdf link: https://arxiv.org/pdf/2603.03752
Abstract Large language models (LLMs) demonstrate superior reasoning capabilities compared to small language models (SLMs), but incur substantially higher costs. We propose COllaborative REAsoner (COREA), a system that cascades an SLM with an LLM to achieve a balance between accuracy and cost in complex reasoning tasks. COREA first attempts to answer questions using the SLM, which outputs both an answer and a verbalized confidence score. Questions with confidence below a predefined threshold are deferred to the LLM for more accurate resolution. We introduce a reinforcement learning-based training algorithm that aligns the SLM's confidence through an additional confidence calibration reward. Extensive experiments demonstrate that our method jointly improves the SLM's reasoning ability and confidence calibration across diverse datasets and model backbones. Compared to using the LLM alone, COREA reduces cost by 21.5% and 16.8% on out-of-domain math and non-math datasets, respectively, with only an absolute pass@1 drop within 2%.
中文摘要 大型语言模型（LLMs）相比小型语言模型（SLMs）展现出更优越的推理能力，但成本显著更高。我们提出了协作式REAsner（COREA），这是一种将SLM与LLM级联连接的系统，以实现复杂推理任务中准确性与成本之间的平衡。COREA首先尝试使用SLM回答问题，该模型既输出答案，也输出口头置信度分数。低于预设阈值的置信度问题会推迟到LLM以获得更准确的解决。我们引入了基于强化学习的训练算法，通过额外的置信度校准奖励来对齐SLM的置信度。大量实验表明，我们的方法共同提升了SLM在不同数据集和模型骨架上的推理能力和置信度校准。与单独使用LLM相比，COREA在域外数学和非数学数据集上分别降低了21.5%和16.8%的成本，且仅有绝对pass@1下降幅度在2%范围内。

Fairness Begins with State: Purifying Latent Preferences for Hierarchical Reinforcement Learning in Interactive Recommendation

公平始于状态：在交互式推荐中净化层级强化学习的潜在偏好

Authors: Yun Lu, Xiaoyu Shi, Hong Xie, Xiangyu Zhao, Mingsheng Shang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.03820
Pdf link: https://arxiv.org/pdf/2603.03820
Abstract Interactive recommender systems (IRS) are increasingly optimized with Reinforcement Learning (RL) to capture the sequential nature of user-system dynamics. However, existing fairness-aware methods often suffer from a fundamental oversight: they assume the observed user state is a faithful representation of true preferences. In reality, implicit feedback is contaminated by popularity-driven noise and exposure bias, creating a distorted state that misleads the RL agent. We argue that the persistent conflict between accuracy and fairness is not merely a reward-shaping issue, but a state estimation failure. In this work, we propose \textbf{DSRM-HRL}, a framework that reformulates fairness-aware recommendation as a latent state purification problem followed by decoupled hierarchical decision-making. We introduce a Denoising State Representation Module (DSRM) based on diffusion models to recover the low-entropy latent preference manifold from high-entropy, noisy interaction histories. Built upon this purified state, a Hierarchical Reinforcement Learning (HRL) agent is employed to decouple conflicting objectives: a high-level policy regulates long-term fairness trajectories, while a low-level policy optimizes short-term engagement under these dynamic constraints. Extensive experiments on high-fidelity simulators (KuaiRec, KuaiRand) demonstrate that DSRM-HRL effectively breaks the "rich-get-richer" feedback loop, achieving a superior Pareto frontier between recommendation utility and exposure equity.
中文摘要 交互式推荐系统（IRS）越来越多地通过强化学习（RL）进行优化，以捕捉用户系统动态的顺序性。然而，现有的公平意识方法常常存在一个根本性疏漏：它们假设观察到的用户状态是真实偏好的忠实表现。实际上，隐性反馈被受欢迎程度驱动的噪声和曝光偏差污染，形成一种扭曲状态，误导强化学习代理。我们认为，准确性与公平性之间的持续冲突不仅是奖励塑造的问题，更是状态估计的失败。在本研究中，我们提出了\textbf{DSRM-HRL}框架，将公平意识的推荐重新表述为潜在状态净化问题，随后进行解耦的层级决策。我们引入基于扩散模型的去噪状态表示模块（DSRM），以从高熵、噪声相互作用历史中恢复低熵潜在偏好流形。基于这种纯净状态，采用层级强化学习（HRL）代理来解耦冲突的目标：高层策略调节长期公平轨迹，低层策略在这些动态约束下优化短期参与。在高保真模拟器（如KuaiRec、KuaiRand）上的大量实验表明，DSRM-HRL有效打破了“富者越富”的反馈循环，实现了推荐效用与暴露公平之间的优越帕累托边界。

Dual-Interaction-Aware Cooperative Control Strategy for Alleviating Mixed Traffic Congestion

双交互感知协同控制策略，用于缓解混合交通拥堵

Authors: Zhengxuan Liu, Yuxin Cai, Yijing Wang, Xiangkun He, Chen Lv, Zhiqiang Zuo
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2603.03848
Pdf link: https://arxiv.org/pdf/2603.03848
Abstract As Intelligent Transportation System (ITS) develops, Connected and Automated Vehicles (CAVs) are expected to significantly reduce traffic congestion through cooperative strategies, such as in bottleneck areas. However, the uncertainty and diversity in the behaviors of Human-Driven Vehicles (HDVs) in mixed traffic environments present major challenges for CAV cooperation. This paper proposes a Dual-Interaction-Aware Cooperative Control (DIACC) strategy that enhances both local and global interaction perception within the Multi-Agent Reinforcement Learning (MARL) framework for Connected and Automated Vehicles (CAVs) in mixed traffic bottleneck scenarios. The DIACC strategy consists of three key innovations: 1) A Decentralized Interaction-Adaptive Decision-Making (D-IADM) module that enhances actor's local interaction perception by distinguishing CAV-CAV cooperative interactions from CAV-HDV observational interactions. 2) A Centralized Interaction-Enhanced Critic (C-IEC) that improves critic's global traffic understanding through interaction-aware value estimation, providing more accurate guidance for policy updates. 3) A reward design that employs softmin aggregation with temperature annealing to prioritize interaction-intensive scenarios in mixed traffic. Additionally, a lightweight Proactive Safety-based Action Refinement (PSAR) module applies rule-based corrections to accelerate training convergence. Experimental results demonstrate that DIACC significantly improves traffic efficiency and adaptability compared to rule-based and benchmark MARL models.
中文摘要 随着智能交通系统（ITS）的发展，互联与自动驾驶车辆（CAV）预计将通过协作策略，如瓶颈区域，显著减少交通拥堵。然而，人驾车辆（HDV）在混合交通环境中行为的不确定性和多样性，给CAV合作带来了重大挑战。本文提出了一种双交互感知协同控制（DIACC）策略，在多智能体强化学习（MARL）框架下增强混合交通瓶颈场景下的联网与自动驾驶车辆（CAV）中，增强本地和全局交互感知。DIACC 策略包含三项关键创新：1）去中心化交互自适应决策（D-IADM）模块，通过区分 CAV-CAV 的协作互动与 CAV-HDV 的观察互动，增强了参与者的局部互动感知。2）一种集中式交互增强批评器（C-IEC），通过交互感知价值估计提升批评者的全球流量理解，为政策更新提供更准确的指导。3）一种奖励设计，采用软最小聚合和温度退火，优先处理混合流量中交互密集场景。此外，一个轻量级的主动安全基础行动精炼（PSAR）模块应用基于规则的纠正，加速训练融合。实验结果表明，DIACC相比基于规则和基准的MARL模型，显著提升了交通效率和适应性。

Selecting Offline Reinforcement Learning Algorithms for Stochastic Network Control

选择离线强化学习算法以实现随机网络控制

Authors: Nicolas Helson, Pegah Alizadeh, Anastasios Giovanidis
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2603.03932
Pdf link: https://arxiv.org/pdf/2603.03932
Abstract Offline Reinforcement Learning (RL) is a promising approach for next-generation wireless networks, where online exploration is unsafe and large amounts of operational data can be reused across the model lifecycle. However, the behavior of offline RL algorithms under genuinely stochastic dynamics -- inherent to wireless systems due to fading, noise, and traffic mobility -- remains insufficiently understood. We address this gap by evaluating Bellman-based (Conservative Q-Learning), sequence-based (Decision Transformers), and hybrid (Critic-Guided Decision Transformers) offline RL methods in an open-access stochastic telecom environment (mobile-env). Our results show that Conservative Q-Learning consistently produces more robust policies across different sources of stochasticity, making it a reliable default choice in lifecycle-driven AI management frameworks. Sequence-based methods remain competitive and can outperform Bellman-based approaches when sufficient high-return trajectories are available. These findings provide practical guidance for offline RL algorithm selection in AI-driven network control pipelines, such as O-RAN and future 6G functions, where robustness and data availability are key operational constraints.
中文摘要 离线强化学习（RL）是下一代无线网络中一种有前景的方法，在这些网络上在线探索不安全，且大量作数据可以在模型生命周期中被重复使用。然而，离线强化学习算法在真正随机动力学下的行为——无线系统固有的衰落、噪声和流量流动性——仍然不够了解。我们通过评估基于Bellman的（保守Q-学习）、基于序列的（决策转换器）和混合的（批判者引导的决策转换器）离线强化学习方法，在开放接入的随机电信环境（移动环境）中来弥补这一空白。我们的结果显示，保守Q-Learning在不同随机性来源中持续产生更稳健的政策，使其成为生命周期驱动的AI管理框架中可靠的默认选择。基于序列的方法依然具有竞争力，并且在有足够高回报轨迹时，能够优于基于贝尔曼的方法。这些发现为AI驱动的网络控制流水线（如O-RAN和未来6G功能）中的离线强化学习算法选择提供了实用指导，因为鲁棒性和数据可用性是关键的作限制。

RVN-Bench: A Benchmark for Reactive Visual Navigation

RVN-Bench：反应式视觉导航的标杆

Authors: Jaewon Lee, Jaeseok Heo, Gunmin Lee, Howoong Jun, Jeongwoo Oh, Songhwai Oh
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.03953
Pdf link: https://arxiv.org/pdf/2603.03953
Abstract Safe visual navigation is critical for indoor mobile robots operating in cluttered environments. Existing benchmarks, however, often neglect collisions or are designed for outdoor scenarios, making them unsuitable for indoor visual navigation. To address this limitation, we introduce the reactive visual navigation benchmark (RVN-Bench), a collision-aware benchmark for indoor mobile robots. In RVN-Bench, an agent must reach sequential goal positions in previously unseen environments using only visual observations and no prior map, while avoiding collisions. Built on the Habitat 2.0 simulator and leveraging high-fidelity HM3D scenes, RVN-Bench provides large-scale, diverse indoor environments, defines a collision-aware navigation task and evaluation metrics, and offers tools for standardized training and benchmarking. RVN-Bench supports both online and offline learning by offering an environment for online reinforcement learning, a trajectory image dataset generator, and tools for producing negative trajectory image datasets that capture collision events. Experiments show that policies trained on RVN-Bench generalize effectively to unseen environments, demonstrating its value as a standardized benchmark for safe and robust visual navigation. Code and additional materials are available at: this https URL.
中文摘要 安全的目视导航对于在杂乱环境中运行的室内移动机器人至关重要。然而，现有基准测试往往忽视碰撞，或设计用于户外场景，不适合室内目视导航。为解决这一限制，我们引入了反应式视觉导航基准（RVN-Bench），这是室内移动机器人的碰撞感知基准。在 RVN-Bench 中，代理必须仅凭视觉观察、无先行地图，在未见过的环境中依次到达目标位置，同时避免碰撞。RVN-Bench基于Habitat 2.0模拟器，利用高保真HM3D场景，提供大规模多样的室内环境，定义碰撞感知导航任务和评估指标，并提供标准化培训和基准工具。RVN-Bench通过提供在线强化学习环境、轨迹图像数据集生成器以及生成捕捉碰撞事件的负轨迹图像数据集的工具，支持在线和离线学习。实验表明，在 RVN-Bench 上训练的策略能够有效泛化到看不见的环境，证明了其作为安全且稳健视觉导航标准化基准的价值。代码和更多资料可在以下 https URL 获取。

GIPO: Gaussian Importance Sampling Policy Optimization

GIPO：高斯重要性抽样策略优化

Authors: Chengxuan Lu, Zhenquan Zhang, Shukuan Wang, Qunzhi Lin, Baigui Sun, Yang Liu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.03955
Pdf link: https://arxiv.org/pdf/2603.03955
Abstract Post-training with reinforcement learning (RL) has recently shown strong promise for advancing multimodal agents beyond supervised imitation. However, RL remains limited by poor data efficiency, particularly in settings where interaction data are scarce and quickly become outdated. To address this challenge, GIPO (Gaussian Importance sampling Policy Optimization) is proposed as a policy optimization objective based on truncated importance sampling, replacing hard clipping with a log-ratio-based Gaussian trust weight to softly damp extreme importance ratios while maintaining non-zero gradients. Theoretical analysis shows that GIPO introduces an implicit, tunable constraint on the update magnitude, while concentration bounds guarantee robustness and stability under finite-sample estimation. Experimental results show that GIPO achieves state-of-the-art performance among clipping-based baselines across a wide range of replay buffer sizes, from near on-policy to highly stale data, while exhibiting superior bias--variance trade-off, high training stability and improved sample efficiency.
中文摘要 强化学习（RL）的后期训练最近显示出在推动多模态代理超越监督模仿方面展现出强烈潜力。然而，强化学习仍受限于数据效率低下，尤其是在交互数据稀缺且迅速过时的环境中。为应对这一挑战，提出了基于截断重要性抽样的策略优化目标GIPO（高斯重要性抽样策略优化），用基于对数比的高斯信任权重替代硬裁剪，以软性抑制极端重要性比，同时保持非零梯度。理论分析表明，GIPO引入了隐式且可调的更新幅度约束，而浓度界限则保证了有限样本估计下的鲁棒性和稳定性。实验结果显示，GIPO在基于剪裁的基线中，在从近策略中到高度陈旧数据的重放缓冲区大小下，都达到了最先进的性能，同时展现出更优的偏差——方差权衡、高训练稳定性和更好的样本效率。

Discriminative Perception via Anchored Description for Reasoning Segmentation

通过锚定描述进行判断感知的推理分割

Authors: Tao Yang, Qing Zhou, Yanliang Li, Qi Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.04002
Pdf link: https://arxiv.org/pdf/2603.04002
Abstract Reasoning segmentation increasingly employs reinforcement learning to generate explanatory reasoning chains that guide Multimodal Large Language Models. While these geometric rewards are primarily confined to guiding the final localization, they are incapable of discriminating whether the reasoning process remains anchored on the referred region or strays into irrelevant context. Lacking this discriminative guidance, the model's reasoning often devolves into unfocused and verbose chains that ultimately fail to disambiguate and perceive the target in complex scenes. This suggests a need to complement the RL objective with Discriminative Perception, an ability to actively distinguish a target from its context. To realize this, we propose DPAD to compel the model to generate a descriptive caption of the referred object, which is then used to explicitly discriminate by contrasting the caption's semantic relevance to the referred object against the wider context. By optimizing for this discriminative capability, the model is forced to focus on the unique attributes of the target, leading to a more converged and efficient reasoning chain. The descriptive caption also serves as an interpretability rationale that aligns with the segmentation. Experiments on the benchmarks confirm the validity of our approach, delivering substantial performance gains, with the cIoU on ReasonSeg increasing by 3.09% and the reasoning chain length decreasing by approximately 42%. Code is available at this https URL
中文摘要 推理分割越来越多地利用强化学习来生成指导多模态大型语言模型的解释性推理链。虽然这些几何奖励主要用于指导最终定位，但它们无法区分推理过程是否仍停留在所指区域，还是偏离无关上下文。缺乏这种辨别引导，模型的推理常常陷入无焦点且冗长的链条，最终无法在复杂场景中区分和感知目标。这表明有必要用辨别感知来补充强化学习目标，即能够主动区分目标与其上下文的能力。为实现这一点，我们提出 DPAD 强制模型生成被指对象的描述性说明，然后通过对比该描述与被指对象的语义相关性与更广泛的上下文进行显式区分。通过优化这种判别能力，模型被迫聚焦于目标的独特属性，从而形成更趋同且高效的推理链。描述性说明也作为可解释性的理由，与分割相符。基准测试验证了我们方法的有效性，带来了显著的性能提升，ReasonSeg 的 cIoU 增加了 3.09%，推理链长度减少了约 42%。代码可在此 https URL 获取

Rethinking the Efficiency and Effectiveness of Reinforcement Learning for Radiology Report Generation

重新思考强化学习在放射科报告生成中的效率与效果

Authors: Zilin Lu, Ruifeng Yuan, Weiwei Cao, Wanxing Chang, Zhongyu Wei, Sinuo Wang, Yong Xia, Ling Zhang, Jianpeng Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.04022
Pdf link: https://arxiv.org/pdf/2603.04022
Abstract Radiologists highly desire fully automated AI for radiology report generation (R2G), yet existing approaches fall short in clinical utility. Reinforcement learning (RL) holds potential to address these shortcomings, but its adoption in this task remains underexplored. In this paper, we revisit RL in terms of data efficiency and optimization effectiveness for R2G tasks. First, we explore the impact of data quantity and quality on the performance of RL in medical contexts, revealing that data quality plays a more critical role than quantity. To this end, we propose a diagnostic diversity-based data sampling strategy that enables comparable performance with fewer samples. Second, we observe that the majority of tokens in radiology reports are template-like and diagnostically uninformative, whereas the low frequency of clinically critical tokens heightens the risk of being overlooked during optimization. To tackle this, we introduce Diagnostic Token-weighted Policy Optimization (DiTPO), which directly optimizes for clinical accuracy by using a diagnostic F1 score as the reward signal. Unlike standard RL approaches that treat all tokens equally, DiTPO explicitly models the varying importance of different tokens through rule- or gradient-based mechanisms to prioritize clinically relevant content. Extensive experiments on the MIMIC-CXR, IU-Xray, and CheXpert Plus datasets demonstrate that our framework achieves state-of-the-art (SOTA) performance while requiring substantially fewer training samples in RL. Notably, on MIMIC-CXR, our framework attains an F1 score of 0.516 using only 20% of the RL training samples.
中文摘要 放射科医生非常渴望全自动化的人工智能用于放射报告生成（R2G），但现有方法在临床实用性方面不足。强化学习（RL）有潜力解决这些不足，但其在这一任务中的应用尚未被充分探索。本文将重新审视强化学习在R2G任务中的数据效率和优化效果。首先，我们探讨了数据数量和质量对强化学习在医学环境中表现的影响，揭示数据质量比数量更为关键。为此，我们提出了一种基于诊断多样性的数据抽样策略，能够在更少样本条件下实现可比的性能。其次，我们观察到放射学报告中的大多数标记类似模板且诊断信息量有限，而临床关键标记的低频率增加了优化过程中被忽视的风险。为此，我们引入了诊断令牌加权策略优化（DiTPO），通过以诊断F1评分作为奖励信号，直接优化临床准确性。与标准强化学习方法对所有代币一视同仁不同，DiTPO通过基于规则或梯度的机制，明确模拟不同代币重要性的变化，以优先排序临床相关内容。在MIMIC-CXR、IU-Xray和CheXpert Plus数据集上的大量实验表明，我们的框架在强化学习中所需的训练样本大幅减少，同时实现了最先进的（SOTA）性能。值得注意的是，在MIMIC-CXR中，我们的框架仅使用20%的强化学习训练样本，获得了0.516的F1分数。

Self-adapting Robotic Agents through Online Continual Reinforcement Learning with World Model Feedback

通过在线持续强化学习结合世界模型反馈实现自我适应机器人智能体

Authors: Fabian Domberg, Georg Schildbach
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.04029
Pdf link: https://arxiv.org/pdf/2603.04029
Abstract As learning-based robotic controllers are typically trained offline and deployed with fixed parameters, their ability to cope with unforeseen changes during operation is limited. Biologically inspired, this work presents a framework for online Continual Reinforcement Learning that enables automated adaptation during deployment. Building on DreamerV3, a model-based Reinforcement Learning algorithm, the proposed method leverages world model prediction residuals to detect out-of-distribution events and automatically trigger finetuning. Adaptation progress is monitored using both task-level performance signals and internal training metrics, allowing convergence to be assessed without external supervision and domain knowledge. The approach is validated on a variety of contemporary continuous control problems, including a quadruped robot in high-fidelity simulation, and a real-world model vehicle. Relevant metrics and their interpretation are presented and discussed, as well as resulting trade-offs described. The results sketch out how autonomous robotic agents could once move beyond static training regimes toward adaptive systems capable of self-reflection and -improvement during operation, just like their biological counterparts.
中文摘要 由于基于学习的机器人控制器通常离线训练并以固定参数部署，其应对运行中不可预见变化的能力有限。本研究受生物学启发，提出了一个在线持续强化学习框架，实现部署期间的自动适应。基于DreamerV3这一基于模型的强化学习算法，该方法利用世界模型预测残差检测分布外事件并自动触发微调。适应进展通过任务级性能信号和内部培训指标进行监测，使得在没有外部监督和领域知识的情况下评估趋同情况。该方法在多种当代连续控制问题上得到了验证，包括高保真模拟中的四足机器人和真实世界的模型车辆。相关指标及其解释被介绍和讨论，同时描述了由此产生的权衡。结果勾勒出自主机器人智能体如何能够超越静态训练模式，发展为能够在作过程中自我反思和改进的自适应系统，就像它们的生物对应物一样。

SaFeR: Safety-Critical Scenario Generation for Autonomous Driving Test via Feasibility-Constrained Token Resampling

SaFeR：通过可行性约束令牌重采样实现自动驾驶测试的安全关键场景生成

Authors: Jinlong Cui, Fenghua Liang, Guo Yang, Chengcheng Tang, Jianxun Cui
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.04071
Pdf link: https://arxiv.org/pdf/2603.04071
Abstract Safety-critical scenario generation is crucial for evaluating autonomous driving systems. However, existing approaches often struggle to balance three conflicting objectives: adversarial criticality, physical feasibility, and behavioral realism. To bridge this gap, we propose SaFeR: safety-critical scenario generation for autonomous driving test via feasibility-constrained token resampling. We first formulate traffic generation as a discrete next token prediction problem, employing a Transformer-based model as a realism prior to capture naturalistic driving distributions. To capture complex interactions while effectively mitigating attention noise, we propose a novel differential attention mechanism within the realism prior. Building on this prior, SaFeR implements a novel resampling strategy that induces adversarial behaviors within a high-probability trust region to maintain naturalism, while enforcing a feasibility constraint derived from the Largest Feasible Region (LFR). By approximating the LFR via offline reinforcement learning, SaFeR effectively prevents the generation of theoretically inevitable collisions. Closed-loop experiments on the Waymo Open Motion Dataset and nuPlan demonstrate that SaFeR significantly outperforms state-of-the-art baselines, achieving a higher solution rate and superior kinematic realism while maintaining strong adversarial effectiveness.
中文摘要 安全关键场景生成对于评估自动驾驶系统至关重要。然而，现有方法常常难以平衡三个相互冲突的目标：对抗性、物理可行性和行为现实主义。为弥合这一差距，我们提出了SaFeR：通过可行性约束的代币重采样实现自动驾驶测试的安全关键场景生成。我们首先将流量生成表述为离散的下一个令牌预测问题，采用基于Transformer的模型作为现实主义，再捕捉自然驱动分布。为了在有效减少注意力噪声的同时捕捉复杂交互，我们提出了一种在前述现实主义内的新颖差分注意力机制。基于此前述，SaFeR实现了一种新颖的重采样策略，在高概率信任区域内诱导对抗行为以保持自然性，同时强制执行由最大可行区域（LFR）推导出的可行性约束。通过离线强化学习近似LFR，SaFeR有效防止了理论上不可避免的碰撞生成。Waymo开放运动数据集和nuPlan上的闭环实验表明，SaFeR显著优于最先进的基线，实现更高的解法率和更优的运动学真实性，同时保持强大的对抗效果。

BeamPERL: Parameter-Efficient RL with Verifiable Rewards Specializes Compact LLMs for Structured Beam Mechanics Reasoning

BeamPERL：参数高效的RL且可验证奖励，专注于结构化束力学推理的紧凑型大型语言模型

Authors: Tarjei Paule Hage, Markus J. Buehler
Subjects: Subjects: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.04124
Pdf link: https://arxiv.org/pdf/2603.04124
Abstract Can reinforcement learning with hard, verifiable rewards teach a compact language model to reason about physics, or does it primarily learn to pattern-match toward correct answers? We study this question by training a 1.5B-parameter reasoning model on beam statics, a classic engineering problem, using parameter-efficient RLVR with binary correctness rewards from symbolic solvers, without teacher-generated reasoning traces. The best BeamPERL checkpoint achieves a 66.7% improvement in Pass@1 over the base model. However, the learned competence is anisotropic: the model generalizes compositionally (more loads) but fails under topological shifts (moved supports) that require the same equilibrium equations. Intermediate checkpoints yield the strongest reasoning, while continued optimization degrades robustness while maintaining reward. These findings reveal a key limitation of outcome-level alignment: reinforcement learning with exact physics rewards induces procedural solution templates rather than internalization of governing equations. The precision of the reward signal - even when analytically exact - does not by itself guarantee transferable physical reasoning. Our results suggest that verifiable rewards may need to be paired with structured reasoning scaffolding to move beyond template matching toward robust scientific reasoning.
中文摘要 带有硬性且可验证奖励的强化学习能否教会一个紧凑的语言模型去推理物理，还是它主要学习模式匹配以求正确答案？我们通过在梁静学上训练一个1.5B参数的推理模型来研究这个问题，这是一种经典的工程问题，使用参数高效的RLVR，并从符号求解器获得二元正确性奖励，且不使用教师生成的推理痕迹。最佳的 BeamPERL 检查点比基础模型提升了 66.7% 的 Pass@1。然而，所学到的能力是各向异性的：模型在组合上推广（更多载荷），但在拓扑平移（移动支撑）下失败，这些变化需要相同的平衡方程。中间检查点产生最强的推理，而持续优化则削弱了稳健性，同时保持奖励。这些发现揭示了结果层级对齐的一个关键局限：带有精确物理奖励的强化学习会诱导过程式解模板，而非控制方程的内化。奖励信号的精确性——即使分析上精确——本身也不保证物理推理的可转移性。我们的结果表明，可验证的奖励可能需要与结构化推理支架相结合，才能超越模板匹配，迈向稳健的科学推理。

Learning Hip Exoskeleton Control Policy via Predictive Neuromusculoskeletal Simulation

通过预测性神经肌肉骨骼模拟学习髋部外骨骼控制政策

Authors: Ilseung Park, Changseob Song, Inseung Kang
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.04166
Pdf link: https://arxiv.org/pdf/2603.04166
Abstract Developing exoskeleton controllers that generalize across diverse locomotor conditions typically requires extensive motion-capture data and biomechanical labeling, limiting scalability beyond instrumented laboratory settings. Here, we present a physics-based neuromusculoskeletal learning framework that trains a hip-exoskeleton control policy entirely in simulation, without motion-capture demonstrations, and deploys it on hardware via policy distillation. A reinforcement learning teacher policy is trained using a muscle-synergy action prior over a wide range of walking speeds and slopes through a two-stage curriculum, enabling direct comparison between assisted and no-exoskeleton conditions. In simulation, exoskeleton assistance reduces mean muscle activation by up to 3.4% and mean positive joint power by up to 7.0% on level ground and ramp ascent, with benefits increasing systematically with walking speed. On hardware, the assistance profiles learned in simulation are preserved across matched speed-slope conditions (r: 0.82, RMSE: 0.03 Nm/kg), providing quantitative evidence of sim-to-real transfer without additional hardware tuning. These results demonstrate that physics-based neuromusculoskeletal simulation can serve as a practical and scalable foundation for exoskeleton controller development, substantially reducing experimental burden during the design phase.
中文摘要 开发适用于多种运动条件的外骨骼控制器通常需要大量动作捕捉数据和生物力学标记，限制了超出仪器实验室环境的扩展性。在这里，我们提出了一个基于物理的神经肌肉骨骼学习框架，完全通过模拟训练髋外骨骼控制策略，无需动作捕捉演示，并通过策略提炼部署到硬件上。强化学习教师政策通过肌肉协同动作在广泛的步行速度和坡度范围内进行训练，采用两阶段课程，从而直接比较辅助与无外骨骼条件。在模拟中，外骨骼辅助在平地和坡道上升时可将平均肌肉激活减少最多3.4%，平均正关节力量降低高达7.0%，随着行走速度的提升效果有系统地增加。在硬件上，仿真中学到的辅助曲线在匹配的速度-斜率条件下保持不变（r： 0.82，RMSE： 0.03 Nm/kg），提供了无需额外硬件调优即可实现模拟到实物传输的定量证据。这些结果表明，基于物理的神经肌肉骨骼仿真可以作为外骨骼控制器开发的实用且可扩展的基础，显著减轻设计阶段的实验负担。

Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory

Memex（RL）：通过索引经验记忆扩展长期视野LLM代理

Authors: Zhenting Wang, Huancheng Chen, Jiayun Wang, Wei Wei
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.04257
Pdf link: https://arxiv.org/pdf/2603.04257
Abstract Large language model (LLM) agents are fundamentally bottlenecked by finite context windows on long-horizon tasks. As trajectories grow, retaining tool outputs and intermediate reasoning in-context quickly becomes infeasible: the working context becomes prohibitively long, eventually exceeds the context budget, and makes distant evidence harder to use even when it is still present. Existing solutions typically shorten context through truncation or running summaries, but these methods are fundamentally lossy because they compress or discard past evidence itself. We introduce Memex, an indexed experience memory mechanism that instead compresses context without discarding evidence. Memex maintains a compact working context consisting of concise structured summaries and stable indices, while storing full-fidelity underlying interactions in an external experience database under those indices. The agent can then decide when to dereference an index and recover the exact past evidence needed for the current subgoal. We optimize both write and read behaviors with our reinforcement learning framework MemexRL, using reward shaping tailored to indexed memory usage under a context budget, so the agent learns what to summarize, what to archive, how to index it, and when to retrieve it. This yields a substantially less lossy form of long-horizon memory than summary-only approaches. We further provide a theoretical analysis showing the potential of the Memex loop to preserve decision quality with bounded dereferencing while keeping effective in-context computation bounded as history grows. Empirically, on challenging long-horizon tasks, Memex agent trained with MemexRL improves task success while using a significantly smaller working context.
中文摘要 大型语言模型（LLM）代理在长期任务中被有限上下文窗口的限制所限制。随着路径的增长，在上下文中保留工具输出和中间推理变得不可行：工作上下文变得过长，最终超过上下文预算，且即使远方证据仍然存在，也使得使用困难。现有解决方案通常通过截断或运行摘要来缩短上下文，但这些方法本质上是有损的，因为它们压缩或丢弃了过去的证据本身。我们介绍了Memex，一种索引体验记忆机制，它压缩上下文而不丢弃证据。Memex 保持一个紧凑的工作环境，包括简明的结构化摘要和稳定索引，同时将全保真度的底层交互存储在这些索引下的外部体验数据库中。代理人随后可以决定何时取消索引，恢复当前子目标所需的精确过去证据。我们利用强化学习框架MemexRL，优化写入和阅读行为，采用针对索引内存使用量的奖励塑形，在上下文预算下，使智能体学会总结什么、归档什么、如何索引以及何时检索。这能比仅汇总方法实现明显更少损耗的长视距存储。我们还提供了理论分析，展示了Memex循环在保持有界去参照决策质量同时保持有效上下文计算有界的潜力，随着历史增长。在具有挑战性的长期任务中，用MemexRL训练的Memex代理在使用显著更小的工作环境下，提高了任务的成功率。

IPD: Boosting Sequential Policy with Imaginary Planning Distillation in Offline Reinforcement Learning

IPD：在离线强化学习中通过虚构规划提炼提升顺序策略

Authors: Yihao Qin, Yuanfei Wang, Hang Zhou, Peiran Liu, Hao Dong, Yiding Ji
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.04289
Pdf link: https://arxiv.org/pdf/2603.04289
Abstract Decision transformer based sequential policies have emerged as a powerful paradigm in offline reinforcement learning (RL), yet their efficacy remains constrained by the quality of static datasets and inherent architectural limitations. Specifically, these models often struggle to effectively integrate suboptimal experiences and fail to explicitly plan for an optimal policy. To bridge this gap, we propose \textbf{Imaginary Planning Distillation (IPD)}, a novel framework that seamlessly incorporates offline planning into data generation, supervised training, and online inference. Our framework first learns a world model equipped with uncertainty measures and a quasi-optimal value function from the offline data. These components are utilized to identify suboptimal trajectories and augment them with reliable, imagined optimal rollouts generated via Model Predictive Control (MPC). A Transformer-based sequential policy is then trained on this enriched dataset, complemented by a value-guided objective that promotes the distillation of the optimal policy. By replacing the conventional, manually-tuned return-to-go with the learned quasi-optimal value function, IPD improves both decision-making stability and performance during inference. Empirical evaluations on the D4RL benchmark demonstrate that IPD significantly outperforms several state-of-the-art value-based and transformer-based offline RL methods across diverse tasks.
中文摘要 基于决策转换器的顺序策略已成为离线强化学习（RL）中的强大范式，但其有效性仍受静态数据集质量和固有架构局限所限。具体来说，这些模型常常难以有效整合次优体验，且未能明确规划最优策略。为弥合这一差距，我们提出了 \textbf（想象规划蒸馏，IPD）——一种新颖框架，将线下规划无缝融入数据生成、监督培训和在线推断。我们的框架首先从离线数据中学习一个配备不确定性度量和准最优价值函数的世界模型。这些组件用于识别次优轨迹，并通过模型预测控制（MPC）生成可靠且想象中的最优展开进行补充。然后基于Transformer的顺序策略在该丰富数据集上训练，辅以价值导向的目标，促进最优策略的提炼。通过用学习到的准最优价值函数替代传统的手动调优回溯函数，IPD提高了决策稳定性和推断性能。对D4RL基准的实证评估表明，IPD在多种任务中显著优于多种最先进的基于价值和基于变换器的离线强化学习方法。

What Does Flow Matching Bring To TD Learning?

流程匹配对TD学习有什么意义？

Authors: Bhavya Agrawalla, Michal Nauman, Aviral Kumar
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.04333
Pdf link: https://arxiv.org/pdf/2603.04333
Abstract Recent work shows that flow matching can be effective for scalar Q-value function estimation in reinforcement learning (RL), but it remains unclear why or how this approach differs from standard critics. Contrary to conventional belief, we show that their success is not explained by distributional RL, as explicitly modeling return distributions can reduce performance. Instead, we argue that the use of integration for reading out values and dense velocity supervision at each step of this integration process for training improves TD learning via two mechanisms. First, it enables robust value prediction through \emph{test-time recovery}, whereby iterative computation through integration dampens errors in early value estimates as more integration steps are performed. This recovery mechanism is absent in monolithic critics. Second, supervising the velocity field at multiple interpolant values induces more \emph{plastic} feature learning within the network, allowing critics to represent non-stationary TD targets without discarding previously learned features or overfitting to individual TD targets encountered during training. We formalize these effects and validate them empirically, showing that flow-matching critics substantially outperform monolithic critics (2$\times$ in final performance and around 5$\times$ in sample efficiency) in settings where loss of plasticity poses a challenge e.g., in high-UTD online RL problems, while remaining stable during learning.
中文摘要 最新研究表明，流匹配在强化学习（RL）中对标量Q值函数估计具有有效性，但目前尚不清楚这种方法为何或如何与标准批评者不同。与传统看法相反，我们证明了它们的成功并非由分布强化学习来解释，因为明确建模收益分布可能会降低性能。相反，我们认为，在训练过程中每个积分过程的每一步使用积分来读出数值，并进行密集速度监督，通过两种机制提升TD学习。首先，它通过 \emph{测试时间恢复}实现了稳健的价值预测，通过积分进行迭代计算，随着更多积分步骤的进行，可以减少早期价值估计中的误差。这种恢复机制在单一批评者中缺失。其次，在多个插值值下监督速度场可以促进网络中的特征学习，使批评者能够在不丢弃已学特征或对训练中遇到的单个TD目标进行过度拟合的情况下，表示非驻稳TD目标。我们形式化了这些效应并实证验证，表明在可塑性丧失存在挑战的环境中（如高UTD在线强化学习问题），流匹配批评者在学习过程中保持稳定，显著优于整体批评者（最终性能约2$×倍，样本效率约5$×倍）。

Tendon Force Modeling for Sim2Real Transfer of Reinforcement Learning Policies for Tendon-Driven Robots

Sim2Real 强化学习策略转移的腱力建模

Authors: Valentin Yuryev, Josie Hughes
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.04351
Pdf link: https://arxiv.org/pdf/2603.04351
Abstract Robots which make use of soft or compliant inter- actions often leverage tendon-driven actuation which enables actuators to be placed more flexibly, and compliance to be maintained. However, controlling complex tendon systems is challenging. Simulation paired with reinforcement learning (RL) could be enable more complex behaviors to be generated. Such methods rely on torque and force-based simulation roll- outs which are limited by the sim-to-real gap, stemming from the actuator and system dynamics, resulting in poor transfer of RL policies onto real robots. To address this, we propose a method to model the tendon forces produced by typical servo motors, focusing specifically on the transfer of RL policies for a tendon driven finger. Our approach extends existing data- driven techniques by leveraging contextual history and a novel data collection test-bench. This test-bench allows us to capture tendon forces undergo contact-rich interactions typical of real- world manipulation. We then utilize our force estimation model in a GPU-accelerated tendon force-driven rigid body simulation to train RL-based controllers. Our transformer-based model is capable of predicting tendon forces within 3% of the maximum motor force and is robot-agnostic. By integrating our learned model into simulation, we reduce the sim-to-real gap for test trajectories by 41%. RL-based controller trained with our model achieves a 50% improvement in fingertip pose tracking tasks on real tendon-driven robotic fingers. This approach is generalizable to different actuators and robot systems, and can enable RL policies to be used widely across tendon systems, advancing capabilities of dexterous manipulators and soft robots.
中文摘要 采用软性或顺应式交互的机器人通常利用肌腱驱动驱动，使执行器能够更灵活地放置，并保持顺应性。然而，控制复杂的肌腱系统具有挑战性。模拟与强化学习（RL）结合，可以生成更复杂的行为。此类方法依赖基于扭矩和力的仿真展开，但受限于执行器和系统动力学带来的模拟与实际差距，导致强化学习策略未能有效转移至真实机器人。为此，我们提出了一种方法来模拟典型伺服电机产生的腱力，特别关注腱驱动手指的强化学习策略的转移。我们的方法通过利用上下文历史和新颖的数据收集测试平台，扩展了现有的数据驱动技术。该测试台使我们能够捕捉肌腱力量在真实作中经历的丰富接触相互作用。随后，我们将力估计模型应用于GPU加速的腱力驱动刚体模拟，训练基于强化学习的控制器。我们的基于变压器的模型能够预测最大运动力3%以内的肌腱受力，并且与机器人无关。通过将我们学到的模型整合进仿真，我们将测试轨迹的模拟与现实差距缩小了41%。基于强化学习的控制器，使用我们的模型训练，在真实的肌腱驱动机器人手指上，指尖姿势追踪任务提升了50%。该方法可推广至不同的执行器和机器人系统，使强化学习策略能够广泛应用于肌腱系统，提升灵巧作器和软机器人的能力。

A Constrained RL Approach for Cost-Efficient Delivery of Latency-Sensitive Applications

一种受限强化学习方法，用于成本效益高地交付对延迟敏感的应用

Authors: Ozan Aygün, Vincenzo Norman Vitale, Antonia M. Tulino, Hao Feng, Elza Erkip, Jaime Llorca
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.04353
Pdf link: https://arxiv.org/pdf/2603.04353
Abstract Next-generation networks aim to provide performance guarantees to real-time interactive services that require timely and cost-efficient packet delivery. In this context, the goal is to reliably deliver packets with strict deadlines imposed by the application while minimizing overall resource allocation cost. A large body of work has leveraged stochastic optimization techniques to design efficient dynamic routing and scheduling solutions under average delay constraints; however, these methods fall short when faced with strict per-packet delay requirements. We formulate the minimum-cost delay-constrained network control problem as a constrained Markov decision process and utilize constrained deep reinforcement learning (CDRL) techniques to effectively minimize total resource allocation cost while maintaining timely throughput above a target reliability level. Results indicate that the proposed CDRL-based solution can ensure timely packet delivery even when existing baselines fall short, and it achieves lower cost compared to other throughput-maximizing methods.
中文摘要 下一代网络旨在为需要及时且经济高效的分组传输的实时交互服务提供性能保证。在此背景下，目标是可靠地交付由应用程序设定严格截止日期的数据包，同时最小化整体资源分配成本。大量研究利用随机优化技术设计了在平均延迟约束下高效的动态路由和调度解决方案;然而，当面对严格的每包延迟要求时，这些方法表现不佳。我们将最小成本延迟约束网络控制问题提出为受限马尔可夫决策过程，并利用受限深度强化学习（CDRL）技术，有效最小化总资源分配成本，同时保持高于目标可靠性水平的及时吞吐量。结果表明，基于CDRL的解决方案能够确保即使现有基线不足也能及时分包，并且相比其他吞吐量最大化方法实现更低成本。

Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks

双模态多阶段对抗性安全培训：强健多模态网络代理抵御跨模态攻击

Authors: Haoyu Liu, Dingcheng Li, Lukas Rutishauser, Zeyu Zheng
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.04364
Pdf link: https://arxiv.org/pdf/2603.04364
Abstract Multimodal web agents that process both screenshots and accessibility trees are increasingly deployed to interact with web interfaces, yet their dual-stream architecture opens an underexplored attack surface: an adversary who injects content into the webpage DOM simultaneously corrupts both observation channels with a consistent deceptive narrative. Our vulnerability analysis on MiniWob++ reveals that attacks including a visual component far outperform text-only injections, exposing critical gaps in text-centric VLM safety training. Motivated by this finding, we propose Dual-Modality Multi-Stage Adversarial Safety Training (DMAST), a framework that formalizes the agent-attacker interaction as a two-player zero-sum Markov game and co-trains both players through a three-stage pipeline: (1) imitation learning from a strong teacher model, (2) oracle-guided supervised fine-tuning that uses a novel zero-acknowledgment strategy to instill task-focused reasoning under adversarial noise, and (3) adversarial reinforcement learning via Group Relative Policy Optimization (GRPO) self-play. On out-of-distribution tasks, DMAST substantially mitigates adversarial risks while simultaneously doubling task completion efficiency. Our approach significantly outperforms established training-based and prompt-based defenses, demonstrating genuine co-evolutionary progress and robust generalization to complex, unseen environments.
中文摘要 处理截图和无障碍树的多模态网络代理越来越多地被用于与网页界面交互，但其双流架构却开辟了一个鲜为人知的攻击面：一个向网页DOM注入内容的攻击者，会同时破坏两个观察通道，并以持续的欺骗性叙事进行破坏。我们在MiniWob++上的漏洞分析显示，包含视觉组件的攻击远超纯文本注入，暴露了以文本为中心的VLM安全培训中的关键空白。基于这一发现，我们提出了双模态多阶段对抗性安全训练（DMAST），该框架将代理与攻击者的交互形式化为两人零和马尔可夫博弈，并通过三阶段流程共同训练双方：（1）来自强教师模型的模仿学习，（2）基于oracle引导的监督微调，采用新颖的零确认策略在对抗噪声下培养任务导向推理，以及（3）通过群体相对策略优化（GRPO）自我游戏进行对抗强化学习。在分配外任务中，DMAST大幅降低了对抗性风险，同时将任务完成效率翻倍。我们的方法远远优于既有的基于训练和基于提示的防御，展现了真正的共进化进展和对复杂、未见环境的强有力推广能力。

TaxonRL: Reinforcement Learning with Intermediate Rewards for Interpretable Fine-Grained Visual Reasoning

TaxonRL：带有中级奖励的强化学习，用于可解释的细粒度视觉推理

Authors: Maximilian von Klinski, Maximilian Schall
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.04380
Pdf link: https://arxiv.org/pdf/2603.04380
Abstract Traditional vision-language models struggle with contrastive fine-grained taxonomic reasoning, particularly when distinguishing between visually similar species within the same genus or family. We introduce TaxonRL, a reinforcement learning approach using Group Relative Policy Optimization with intermediate rewards that decomposes the reasoning process into hierarchical taxonomic predictions. Our method incentivizes models to explicitly reason about species-level, genus-level, and family-level features before making final classifications. This structured approach is designed not only to boost accuracy but also to yield a transparent, verifiable decision-making process. On the challenging Birds-to-Words dataset, TaxonRL achieves 91.7\% average accuracy, exceeding human performance (77.3\%) while generating interpretable reasoning traces. We demonstrate strong cross-domain generalization, showing substantial gains in primate and marine species verification. Our results establish that enforcing structured, hierarchical reasoning provides a powerful and transferable framework for fine-grained visual discrimination.
中文摘要 传统的视觉语言模型在区分同一属或科内视觉相似物种时，在对比细粒度分类推理方面存在困难。我们介绍了TaxonRL，这是一种基于组相对策略优化的强化学习方法，带有中间奖励，将推理过程分解为层级分类预测。我们的方法鼓励模型在做出最终分类前，明确推理物种层面、属层面和科层面的特征。这种结构化的方法不仅旨在提升准确性，还能实现透明且可验证的决策过程。在具有挑战性强的Birds-to-Words数据集上，TaxonRL的平均准确率达到了91.7%，超过人类表现（77.3%），同时生成可解释的推理痕迹。我们展示了强大的跨域推广能力，在灵长类和海洋物种验证方面取得了显著进展。我们的结果证明，执行结构化、层级式推理为细致视觉识别提供了强大且可迁移的框架。

Keyword: diffusion policy

There is no result