Arxiv Papers of Today

生成时间: 2026-03-24 16:59:30 (UTC+8); Arxiv 发布时间: 2026-03-24 20:00 EDT (2026-03-25 08:00 UTC+8)

今天共有 74 篇相关文章

Keyword: reinforcement learning

Fast-Slow Thinking RM: Efficient Integration of Scalar and Generative Reward Models

快慢思维 RM：标量与生成奖励模型的高效集成

Authors: Jiayun Wu, Peixu Hou, Shan Qu, Peng Zhang, Ning Gu, Tun Lu
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.20212
Pdf link: https://arxiv.org/pdf/2603.20212
Abstract Reward models (RMs) are critical for aligning Large Language Models via Reinforcement Learning from Human Feedback (RLHF). While Generative Reward Models (GRMs) achieve superior accuracy through chain-of-thought (CoT) reasoning, they incur substantial computational costs. Conversely, Scalar Reward Models (SRMs) offer efficiency but suffer from limited performance and adaptability in complex scenarios. We introduce Fast-Slow Thinking Reward Models (F/S-RM), a hybrid RM architecture inspired by Dual Process Theory. It trains a single model to integrate two distinct reward paradigms: first-token prediction as a scalar score (fast thinking) and CoT-based judgment (slow thinking), regulated by a dual-confidence activation mechanism that determines when to activate slow thinking. F/S-RM achieves a 1.2% relative performance improvement over state-of-the-art models while reducing token consumption by 20.8%. Code and data will be publicly available.
中文摘要 奖励模型（RM）对于通过人类反馈强化学习（RLHF）对齐大型语言模型至关重要。虽然生成奖励模型（GRMs）通过思维链（CoT）推理实现了更高的准确性，但它们会带来巨大的计算成本。相反，标量奖励模型（SRM）虽然效率高，但在复杂场景下性能和适应性有限。我们介绍了快慢思维奖励模型（F/S-RM），这是一种受双过程理论启发的混合逻辑阅读架构。它训练单一模型整合两种不同的奖励范式：作为标量分数的第一标记预测（快速思考）和基于CoT的判断（慢思考），后者由双重信心激活机制调节，决定何时激活慢思考。F/S-RM相较于最先进模型实现了1.2%的性能提升，同时减少了20.8%的代币消耗。代码和数据将公开。

Beyond Scalar Rewards: Distributional Reinforcement Learning with Preordered Objectives for Safe and Reliable Autonomous Driving

超越标量奖励：带有预购目标的分布式强化学习，实现安全可靠的自动驾驶

Authors: Ahmed Abouelazm, Jonas Michel, Daniel Bogdoll, Philip Schörner, J. Marius Zöllner
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.20230
Pdf link: https://arxiv.org/pdf/2603.20230
Abstract Autonomous driving involves multiple, often conflicting objectives such as safety, efficiency, and comfort. In reinforcement learning (RL), these objectives are typically combined through weighted summation, which collapses their relative priorities and often yields policies that violate safety-critical constraints. To overcome this limitation, we introduce the Preordered Multi-Objective MDP (Pr-MOMDP), which augments standard MOMDPs with a preorder over reward components. This structure enables reasoning about actions with respect to a hierarchy of objectives rather than a scalar signal. To make this structure actionable, we extend distributional RL with a novel pairwise comparison metric, Quantile Dominance (QD), that evaluates action return distributions without reducing them into a single statistic. Building on QD, we propose an algorithm for extracting optimal subsets, the subset of actions that remain non-dominated under each objective, which allows precedence information to shape both decision-making and training targets. Our framework is instantiated with Implicit Quantile Networks (IQN), establishing a concrete implementation while preserving compatibility with a broad class of distributional RL methods. Experiments in Carla show improved success rates, fewer collisions and off-road events, and deliver statistically more robust policies than IQN and ensemble-IQN baselines. By ensuring policies respect rewards preorder, our work advances safer, more reliable autonomous driving systems.
中文摘要 自动驾驶涉及多个且常常相互冲突的目标，如安全、效率和舒适性。在强化学习（RL）中，这些目标通常通过加权求和结合，导致它们的相对优先级崩溃，常常产生违反安全关键约束的策略。为克服这一限制，我们引入了预购多目标多目标 MDP（Pr-MOMDP），它通过预购奖励组件来补充标准 MOMDP。这种结构使得对动作的推理能够基于目标层级而非标量信号进行推理。为了使该结构可操作，我们用一种新颖的两两比较指标——分位数优势（QD）来扩展分布RL，该指标评估行动回报分布而不将其简化为单一统计量。基于量子点，我们提出了一种提取最优子集的算法，即在每个目标下保持不被支配的动作子集，使优先级信息能够塑造决策和训练目标。我们的框架以隐式分位数网络（IQN）为实例化，确立了具体的实现，同时保持与广泛分布式强化学习方法的兼容性。Carla的实验显示，成功率更高，碰撞和越野事件更少，且比IQN和集成IQN基线更为稳健。通过确保政策尊重预订奖励，我们的工作推动了更安全、更可靠的自动驾驶系统的发展。

Emergency Lane-Change Simulation: A Behavioral Guidance Approach for Risky Scenario Generation

紧急变道模拟：一种用于风险场景生成的行为指导方法

Authors: Chen Xiong, Cheng Wang, Yuhang Liu, Zirui Wu, Ye Tian
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.20234
Pdf link: https://arxiv.org/pdf/2603.20234
Abstract In contemporary autonomous driving testing, virtual simulation has become an important approach due to its efficiency and cost effectiveness. However, existing methods usually rely on reinforcement learning to generate risky scenarios, making it difficult to efficiently learn realistic emergency behaviors. To address this issue, we propose a behavior guided method for generating high risk lane change scenarios. First, a behavior learning module based on an optimized sequence generative adversarial network is developed to learn emergency lane change behaviors from an extracted dataset. This design alleviates the limitations of existing datasets and improves learning from relatively few samples. Then, the opposing vehicle is modeled as an agent, and the road environment together with surrounding vehicles is incorporated into the operating environment. Based on the Recursive Proximal Policy Optimization strategy, the generated trajectories are used to guide the vehicle toward dangerous behaviors for more effective risk scenario exploration. Finally, the reference trajectory is combined with model predictive control as physical constraints to continuously optimize the strategy and ensure physical authenticity. Experimental results show that the proposed method can effectively learn high risk trajectory behaviors from limited data and generate high risk collision scenarios with better efficiency than traditional methods such as grid search and manual design.
中文摘要 在当代自动驾驶测试中，虚拟仿真因其高效性和成本效益而成为一种重要方法。然而，现有方法通常依赖强化学习来生成高风险情景，这使得高效学习真实的紧急行为变得困难。为解决这一问题，我们提出了一种行为引导方法，用于生成高风险变道场景。首先，基于优化序列生成对抗网络开发了行为学习模块，用于从提取的数据集中学习紧急变换车道行为。这种设计缓解了现有数据集的局限性，并提升了从相对较少样本中学习的能力。然后，对方车辆被建模为代理，道路环境及周围车辆被纳入运行环境。基于递归近端策略优化策略，生成的轨迹用于引导车辆走向危险行为，从而更有效地探索风险情景。最后，参考轨迹与模型预测控制结合，作为物理约束，持续优化策略并确保物理真实性。实验结果表明，所提方法能够从有限的数据中有效学习高风险轨迹行为，并以比传统方法如网格搜索和人工设计更高效的生成高风险碰撞场景。

Joint Trajectory, RIS, and Computation Offloading Optimization via Decentralized Model-Based PPO in Urban Multi-UAV Mobile Edge Computing

城市多无人机移动边缘计算中的联合轨迹、RIS与计算分担优化，通过去中心化基于模型的PPO实现优化

Authors: Liangshun Wu, Jianbo Du, Junsuo Qu
Subjects: Subjects: Systems and Control (eess.SY); Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2603.20238
Pdf link: https://arxiv.org/pdf/2603.20238
Abstract Efficient computation offloading in multi-UAV edge networks becomes particularly challenging in dense urban areas, where line-of-sight (LoS) links are frequently blocked and user demand varies rapidly. Reconfigurable intelligent surfaces (RISs) can mitigate blockage by creating controllable reflected links, but realizing their potential requires tightly coupled decisions on UAV trajectories, offloading schedules, and RIS phase configurations. This joint optimization is hard to solve in practice because multiple UAVs must coordinate under limited information exchange, and purely model-free multi-agent reinforcement learning (MARL) often learns too slowly in highly dynamic environments. To address these challenges, we propose a decentralized model-based MARL framework. Each UAV optimizes mobility and offloading using observations from several hop neighbors, and submits an RIS phase proposal that is aggregated by a lightweight RIS controller. To boost sample efficiency and stability, agents learn local dynamics models and perform short horizon branched rollouts for proximal policy optimization (PPO) updates. Simulations show near centralized performance with improved throughput and energy efficiency at scale.
中文摘要 在密集的城市区域，多无人机边缘网络中的高效计算卸载尤其具有挑战性，因为视距（LoS）链路经常被阻断，用户需求变化迅速。可重构智能曲面（RIS）可以通过创建可控反射链路来缓解阻塞，但要实现其潜力，则需要在无人机轨迹、卸载计划和RIS阶段配置上做出紧密耦合的决策。这种联合优化在实际中难以解决，因为多架无人机必须在有限的信息交换下协调，而纯模型无模型的多智能体强化学习（MARL）在高度动态环境中学习速度往往过慢。为应对这些挑战，我们提出了一个去中心化的基于模型的MARL框架。每架无人机利用多个跳邻的观测数据优化机动性和卸载，并提交由轻量级RIS控制器汇总的RIS阶段提案。为了提高样本效率和稳定性，代理学习局部动力学模型并进行短视野分支扩展以进行近距离策略优化（PPO）更新。仿真显示出近乎集中式的性能，同时在大规模下吞吐量和能效都有所提升。

JCAS-MARL: Joint Communication and Sensing UAV Networks via Resource-Constrained Multi-Agent Reinforcement Learning

JCAS-MARL：通过资源受限多智能体强化学习实现联合通信与感测无人机网络

Authors: Islam Guven, Mehmet Parlak
Subjects: Subjects: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2603.20265
Pdf link: https://arxiv.org/pdf/2603.20265
Abstract Multi-UAV networks are increasingly deployed for large-scale inspection and monitoring missions, where operational performance depends on the coordination of sensing reliability, communication quality, and energy constraints. In particular, the rapid increase in overflowing waste bins and illegal dumping sites has created a need for efficient detection of waste hotspots. In this work, we introduce JCAS-MARL, a resource-aware multi-agent reinforcement learning (MARL) framework for joint communication and sensing (JCAS)-enabled UAV networks. Within this framework, multiple UAVs operate in a shared environment where each agent jointly controls its trajectory and the resource allocation of an OFDM waveform used simultaneously for sensing and communication. Battery consumption, charging behavior, and associated CO$_2$ emissions are incorporated into the system state to model realistic operational constraints. Information sharing occurs over a dynamic communication graph determined by UAV positions and wireless channel conditions. Waste hotspot detection requires consensus among multiple UAVs to improve reliability. Using this environment, we investigate how MARL policies exploit the sensing-communication-energy trade-off in JCAS-enabled UAV networks. Simulation results demonstrate that adaptive pilot-density control learned by the agents can outperform static configurations, particularly in scenarios where sensing accuracy and communication connectivity vary across the environment.
中文摘要 多无人机网络越来越多地被用于大规模的检查和监控任务，在这些任务中，作战性能依赖于感测可靠性、通信质量和能源约束的协调。尤其是垃圾桶溢出和非法倾倒场的快速增加，催生了高效检测废物热点的需求。在本研究中，我们介绍了JCAS-MARL，一种资源感知型多智能体强化学习（MARL）框架，用于联合通信与感测（JCAS）支持的无人机网络。在此框架下，多架无人机在共享环境中运行，每个代理共同控制其轨迹及同时用于感测和通信的OFDM波形的资源分配。电池消耗、充电行为及相关的二氧化碳$_2$排放被纳入系统状态，以模拟现实的运行约束。信息共享通过无人机位置和无线信道条件决定的动态通信图实现。废物热点检测需要多架无人机达成共识以提高可靠性。利用该环境，我们研究MARL政策如何在JCAS支持的无人机网络中利用感测-通信-能量权衡。模拟结果表明，智能体学习的自适应导员密度控制能够优于静态配置，尤其是在传感精度和通信连接性因环境而异的场景下。

Learning Communication Between Heterogeneous Agents in Multi-Agent Reinforcement Learning for Autonomous Cyber Defence

多智能体强化学习中的异构智能体间通信，用于自主网络防御

Authors: Alex Popa, Adrian Taylor, Ranwa Al Mallah
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2603.20279
Pdf link: https://arxiv.org/pdf/2603.20279
Abstract Reinforcement learning techniques are being explored as solutions to the threat of cyber attacks on enterprise networks. Recent research in the field of AI in cyber security has investigated the ability of homogeneous multi-agent reinforcement learning agents, capable of inter-agent communication, to respond to cyberattacks. This paper advances the study of learned communication in multi-agent systems by examining heterogeneous agent capabilities within a simulated network environment. To this end, we leverage CommFormer, a publicly available state-of-the-art communication algorithm, to train and evaluate agents within the Cyber Operations Research Gym (CybORG). Our results show that CommFormer agents with heterogeneous capabilities can outperform other algorithms deployed in the CybORG environment, by converging to an optimal policy up to four times faster while improving standard error by up 38%. The agents implemented in this project provide an additional avenue for exploration in the field of AI for cyber security, enabling further research involving realistic networks.
中文摘要 强化学习技术正被探索为应对企业网络网络威胁的解决方案。人工智能在网络安全领域的最新研究探讨了具备智能体间通信能力的同质多智能体强化学习智能体应对网络攻击的能力。本文通过研究模拟网络环境中异构智能体能力，推动了多智能体系统中学习通信的研究。为此，我们利用CommFormer这一公开可用的最先进通信算法，在网络运筹学健身房（CybORG）内培训和评估代理。我们的结果表明，具有异构能力的CommForer代理能够超越CybORG环境中部署的其他算法，通过收敛至最优策略的速度高达四倍，同时将标准误提升了38%。本项目中实施的代理为人工智能网络安全领域的探索提供了额外途径，促进了涉及真实网络的进一步研究。

MARLIN: Multi-Agent Reinforcement Learning for Incremental DAG Discovery

MARLIN：增量DAG发现的多智能体强化学习

Authors: Dong Li, Zhengzhang Chen, Xujiang Zhao, Linlin Yu, Zhong Chen, Yi He, Haifeng Chen, Chen Zhao
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.20295
Pdf link: https://arxiv.org/pdf/2603.20295
Abstract Uncovering causal structures from observational data is crucial for understanding complex systems and making informed decisions. While reinforcement learning (RL) has shown promise in identifying these structures in the form of a directed acyclic graph (DAG), existing methods often lack efficiency, making them unsuitable for online applications. In this paper, we propose MARLIN, an efficient multi agent RL based approach for incremental DAG learning. MARLIN uses a DAG generation policy that maps a continuous real valued space to the DAG space as an intra batch strategy, then incorporates two RL agents state specific and state invariant to uncover causal relationships and integrates these agents into an incremental learning framework. Furthermore, the framework leverages a factored action space to enhance parallelization efficiency. Extensive experiments on synthetic and real datasets demonstrate that MARLIN outperforms state of the art methods in terms of both efficiency and effectiveness.
中文摘要 从观察数据中揭示因果结构对于理解复杂系统和做出明智决策至关重要。虽然强化学习（RL）在识别这些结构（以有向无环图DAG）形式出现的潜力，但现有方法往往效率不足，不适合在线应用。本文提出了MARLIN，一种高效的多智能体强化学习增量DAG学习方法。MARLIN 使用一种 DAG 生成策略，将连续的实值空间映射到 DAG 空间，作为批内策略，然后结合两个强化学习代理（状态特定和状态不变）来揭示因果关系，并将这些代理整合进增量学习框架中。此外，该框架利用分解动作空间来提升并行化效率。对合成和真实数据集的大量实验表明，MARLIN在效率和效果方面均优于最先进方法。

Bounded Coupled AI Learning Dynamics in Tri-Hierarchical Drone Swarms

三层级无人机群体中的有界耦合AI学习动态

Authors: Oleksii Bychkov
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2603.20333
Pdf link: https://arxiv.org/pdf/2603.20333
Abstract Modern autonomous multi-agent systems combine heterogeneous learning mechanisms operating at different timescales. An open question remains: can one formally guarantee that coupled dynamics of such mechanisms stay within the admissible operational regime? This paper studies a tri-hierarchical swarm learning system where three mechanisms act simultaneously: (1) local Hebbian online learning at individual agent level (fast timescale, 10-100 ms); (2) multi-agent reinforcement learning (MARL) for tactical group coordination (medium timescale, 1-10 s); (3) meta-learning (MAML) for strategic adaptation (slow timescale, 10-100 s). Four results are established. The Bounded Total Error Theorem shows that under contractual constraints on learning rates, Lipschitz continuity of inter-level mappings, and weight stabilization, total suboptimality admits a component-wise upper bound uniform in time. The Bounded Representation Drift Theorem gives a worst-case estimate of how Hebbian updates affect coordination-level embeddings during one MARL cycle. The Meta-Level Compatibility Theorem provides sufficient conditions under which strategic adaptation preserves lower-level invariants. The Non-Accumulation Theorem proves that error does not grow unboundedly over time.
中文摘要 现代自主多智能体系统结合了在不同时间尺度运行的异构学习机制。一个悬而未决的问题是：是否能形式上保证这些机制的耦合动态保持在允许的操作体系内？本文研究了一种三层级群体学习系统，其中三种机制同时作用：（1）个体主体级的本地Hebbian在线学习（快速时间尺度，10-100毫秒）;（2）多智能体强化学习（MARL），用于战术小组协调（中等时间尺度，1-10秒）;（3）用于战略适应的元学习（MAML）（慢时间，10-100秒）。确定了四个结果。有界全误差定理表明，在学习率、层间映射的利普希茨连续性和权重稳定的契约约束下，总次最优性允许时间上存在分量上的上界均匀。有界表示漂移定理给出了赫布更新在一个MARL周期内对配位层嵌入的影响的最坏情况估计。元层兼容性定理为战略适应保持低层不变量提供了充分条件。非累积定理证明误差不会随时间无界增长。

Leum-VL Technical Report

Leum-VL技术报告

Authors: Yuxuan He, Chaiming Huang, Yifan Wu, Hongjun Wang, Chenkui Shen, Jifan Zhang, Long Li
Subjects: Subjects: Multimedia (cs.MM); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.20354
Pdf link: https://arxiv.org/pdf/2603.20354
Abstract A short video succeeds not simply because of what it shows, but because of how it schedules attention -- yet current multimodal models lack the structural grammar to parse or produce this organization. Existing models can describe scenes, answer event-centric questions, and read on-screen text, but they are far less reliable at identifying timeline-grounded units such as hooks, cut rationales, shot-induced tension, and platform-facing packaging cues. We propose SV6D (Structured Video in Six Dimensions), inspired by professional storyboard practice in film and television production, a representation framework that decomposes internet-native video into six complementary structural dimensions -- subject, aesthetics, camera language, editing, narrative, and dissemination -- with each label tied to physically observable evidence on the timeline. We formalize a unified optimization objective over SV6D that combines Hungarian-matched temporal alignment, dimension-wise semantic label distance, and quality regularization. Building on this framework, we present Leum-VL-8B, an 8B video-language model that realizes the SV6D objective through an expert-driven post-training pipeline, further refined through verifiable reinforcement learning on perception-oriented tasks. Leum-VL-8B achieves 70.8 on VideoMME (w/o subtitles), 70.0 on MVBench, and 61.6 on MotionBench, while remaining competitive on general multimodal evaluations such as MMBench-EN. We also construct FeedBench, a benchmark for structure-sensitive short-video understanding. Our results indicate that the missing layer in video AI is not pixel generation but structural representation: grounded on the timeline, linked to visible evidence, and directly consumable by downstream workflows such as editing, retrieval, recommendation, and generation control, including text-heavy internet video formats with overlays and image-text layouts.
中文摘要 短视频之所以成功，不仅因为它展示了什么，更因为它如何安排注意力——然而当前的多模态模型缺乏结构语法来解析或产生这种组织。现有模型可以描述场景、回答以事件为中心的问题并阅读屏幕文字，但在识别基于时间线的单元（如钩子、剪辑理由、镜头引发的紧张感和面向平台的包装线索）方面，可靠性远不如它。我们提出了SV6D（六维结构化视频），灵感来源于电影和电视制作中的专业分镜实践，这是一种将互联网原生视频分解为六个互补结构维度——主题、美学、摄影语言、剪辑、叙事和传播——每个标签都与时间线上的物理证据相关联。我们形式化了一个统一的SV6D优化目标，结合了匈牙利匹配的时间比对、维度语义标签距离和质量正则化。基于该框架，我们介绍Leum-VL-8B，一个8B视频语言模型，通过专家驱动的训练后流程实现SV6D目标，并通过可验证的强化学习进一步完善感知导向任务。Leum-VL-8B在VideoMME（无字幕）上获得70.8分，MVBench 70.0分，MotionBench 61.6分，同时在MMBench-EN等通用多模态评估中保持竞争力。我们还构建了FeedBench，这是一个结构敏感短视频理解的基准。我们的结果表明，视频AI中缺失的层不是像素生成，而是结构性表征：基于时间线，关联可见证据，并可直接被编辑、检索、推荐和生成控制等下游工作流程调用，包括带有叠加和图像-文本布局的大量文本互联网视频格式。

CAMA: Exploring Collusive Adversarial Attacks in c-MARL

CAMA：探讨c-MARL中的对抗性攻击

Authors: Men Niu, Xinxin Fan, Quanliang Jing, Shaoye Luo, Yunfeng Lu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.20390
Pdf link: https://arxiv.org/pdf/2603.20390
Abstract Cooperative multi-agent reinforcement learning (c-MARL) has been widely deployed in real-world applications, such as social robots, embodied intelligence, UAV swarms, etc. Nevertheless, many adversarial attacks still exist to threaten various c-MARL systems. At present, the studies mainly focus on single-adversary perturbation attacks and white-box adversarial attacks that manipulate agents' internal observations or actions. To address these limitations, we in this paper attempt to study collusive adversarial attacks through strategically organizing a set of malicious agents into three collusive attack modes: Collective Malicious Agents, Disguised Malicious Agents, and Spied Malicious Agents. Three novelties are involved: i) three collusive adversarial attacks are creatively proposed for the first time, and a unified framework CAMA for policy-level collusive attacks is designed; ii) the attack effectiveness is theoretically analyzed from the perspectives of disruptiveness, stealthiness, and attack cost; and iii) the three collusive adversarial attacks are technically realized through agent's observation information fusion, attack-trigger control. Finally, multi-facet experiments on four SMAC II maps are performed, and experimental results showcase the three collusive attacks have an additive adversarial synergy, strengthening attack outcome while maintaining high stealthiness and stability over long horizons. Our work fills the gap for collusive adversarial learning in c-MARL.
中文摘要 协作多智能体强化学习（c-MARL）已被广泛应用于现实世界，如社交机器人、具身智能、无人机群等。尽管如此，仍有许多对抗性攻击威胁各种c-MARL系统。目前，研究主要聚焦于单一对抗扰动攻击和操控智能体内部观察或动作的白箱对抗攻击。为解决这些局限性，本文尝试通过战略性地将一组恶意代理人分为三种共谋攻击模式来研究共谋对抗攻击：集体恶意代理人、伪装恶意代理人和监控恶意代理人。涉及三项新颖之处：i）首次创新地提出了三种共谋对抗攻击，并设计了一个统一的策略级共谋攻击CAMA框架;ii）攻击效果从破坏性、隐蔽性和攻击成本的角度理论分析;三）三种共谋的对抗攻击技术上通过代理观察信息融合实现，即攻击触发控制。最后，对四张SMAC II地图进行了多方面实验，实验结果显示三种共谋攻击具有叠加的对抗协同效应，增强攻击结果，同时在长期内保持高度隐蔽性和稳定性。我们的工作填补了c-MARL中共谋对抗性学习的空白。

SymCircuit: Bayesian Structure Inference for Tractable Probabilistic Circuits via Entropy-Regularized Reinforcement Learning

SymCircuit：通过熵正则化强化学习实现可处理概率电路的贝叶斯结构推断

Authors: Y. Sungtaek Ju
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2603.20392
Pdf link: https://arxiv.org/pdf/2603.20392
Abstract Probabilistic circuit (PC) structure learning is hampered by greedy algorithms that make irreversible, locally optimal decisions. We propose SymCircuit, which replaces greedy search with a learned generative policy trained via entropy-regularized reinforcement learning. Instantiating the RL-as-inference framework in the PC domain, we show the optimal policy is a tempered Bayesian posterior, recovering the exact posterior when the regularization temperature is set inversely proportional to the dataset size. The policy is implemented as SymFormer, a grammar-constrained autoregressive Transformer with tree-relative self-attention that guarantees valid circuits at every generation step. We introduce option-level REINFORCE, restricting gradient updates to structural decisions rather than all tokens, yielding an SNR (signal to noise ratio) improvement and >10 times sample efficiency gain on the NLTCS dataset. A three-layer uncertainty decomposition (structural via model averaging, parametric via the delta method, leaf via conjugate Dirichlet-Categorical propagation) is grounded in the multilinear polynomial structure of PC outputs. On NLTCS, SymCircuit closes 93% of the gap to LearnSPN; preliminary results on Plants (69 variables) suggest scalability.
中文摘要 概率电路（PC）结构学习受到贪婪算法的阻碍，这些算法做出不可逆的局部最优决策。我们提出了SymCircuit，它用通过熵正则化强化学习训练的生成策略取代贪婪搜索。在PC领域实例化RL作为推断框架，我们展示了最优策略是调律贝叶斯后验，当正则化温度与数据集大小成反比时，能恢复精确的后验。该策略以SymFormer的形式实现，这是一种语法约束的自回归变换器，具有树相对自关注，保证每代步均有有效电路。我们引入了期权级REINFORCE，将梯度更新限制在结构决策中，而非所有代币，从而提升了信噪比（SNR），并在NLTCS数据集上提高了>10倍的样本效率。三层不确定性分解（结构化通过模型平均，参数化通过δ方法，叶子通过共轭狄利克雷-范化传播）基于PC输出的多线性多项式结构。在NLTCS上，SymCircuit缩小了与LearnSPN的93%差距;关于植物（69个变量）的初步结果表明具有可扩展性。

Reinforcement Learning from Multi-Source Imperfect Preferences: Best-of-Both-Regimes Regret

从多源不完美偏好中强化学习：两者皆优的遗憾

Authors: Ming Shi, Yingbin Liang, Ness B. Shroff, Ananthram Swami
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.20453
Pdf link: https://arxiv.org/pdf/2603.20453
Abstract Reinforcement learning from human feedback (RLHF) replaces hard-to-specify rewards with pairwise trajectory preferences, yet regret-oriented theory often assumes that preference labels are generated consistently from a single ground-truth objective. In practical RLHF systems, however, feedback is typically \emph{multi-source} (annotators, experts, reward models, heuristics) and can exhibit systematic, persistent mismatches due to subjectivity, expertise variation, and annotation/modeling artifacts. We study episodic RL from \emph{multi-source imperfect preferences} through a cumulative imperfection budget: for each source, the total deviation of its preference probabilities from an ideal oracle is at most $\omega$ over $K$ episodes. We propose a unified algorithm with regret $\tilde{O}(\sqrt{K/M}+\omega)$, which exhibits a best-of-both-regimes behavior: it achieves $M$-dependent statistical gains when imperfection is small (where $M$ is the number of sources), while remaining robust with unavoidable additive dependence on $\omega$ when imperfection is large. We complement this with a lower bound $\tilde{\Omega}(\max{\sqrt{K/M},\omega})$, which captures the best possible improvement with respect to $M$ and the unavoidable dependence on $\omega$, and a counterexample showing that naïvely treating imperfect feedback as as oracle-consistent can incur regret as large as $\tilde{\Omega}(\min{\omega\sqrt{K},K})$. Technically, our approach involves imperfection-adaptive weighted comparison learning, value-targeted transition estimation to control hidden feedback-induced distribution shift, and sub-importance sampling to keep the weighted objectives analyzable, yielding regret guarantees that quantify when multi-source feedback provably improves RLHF and how cumulative imperfection fundamentally limits it.
中文摘要 来自人类反馈的强化学习（RLHF）用两两轨迹偏好替代难以指定奖励，但遗憾导向理论通常假设偏好标签始终从单一的真实目标生成。然而，在实际的RLHF系统中，反馈通常是多源的（标注者、专家、奖励模型、启发式），由于主观性、专业差异以及注释/建模伪影，可能出现系统性且持续的不匹配。我们研究从\emph{多源不完美偏好}到累积不完美预算的片段式强化学习：对于每个来源，其偏好概率与理想预言机的总偏差最多为$\omega$，每集$K次。我们提出一个统一算法，后悔为 $\tilde{O}（\sqrt{K/M}+\omega）$，表现出两者最佳的行为：当不完美性较小（$M$为来源数）时，实现$M$依赖的统计收益;而在不完美性较大时，保持对$\omega$不可避免的加性依赖的稳健性。我们用一个下界 $\tilde{\Omega}（\max{\sqrt{K/M}，\omega}）$ 补充，该 $ 捕捉了相对于 $M$ 的最佳改进和对 $\omega$ 不可避免的依赖，同时还有一个反例表明，天真地将不完全反馈视为预言者一致性，可能会引发高达 $\tilde{\Omega}（\min{\omega\sqrt{K}，K}）$ 的后悔。从技术上讲，我们的方法包括不完美自适应加权比较学习、价值导向的转变估计以控制隐藏的反馈诱导分布偏移，以及次重要性抽样以保持加权目标的可分析性，从而获得后悔保证，量化多源反馈何时可证明提升RLHF，以及累积缺陷如何根本限制其表现。

Fluid Antenna Networks Beyond Beamforming: An AI-Native Control Paradigm for 6G

超越波束成形的流体天线网络：6G的AI原生控制范式

Authors: Ian F. Akyildiz, Tuğçe Bilen
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2603.20484
Pdf link: https://arxiv.org/pdf/2603.20484
Abstract Fluid Antenna Systems (FAS) introduce a new degree of freedom for wireless networks by enabling the physical antenna position to adapt dynamically to changing radio conditions. While existing studies primarily emphasize physical-layer gains, their broader implications for network operation remain largely unexplored. Once antennas become reconfigurable entities, antenna positioning naturally becomes part of the network control problem rather than a standalone optimization task. This article presents an AI-native perspective on fluid antenna networks for future 6G systems. Instead of treating antenna repositioning as an isolated operation, we consider a closed-loop control architecture in which antenna adaptation is jointly managed with conventional radio resource management (RRM) functions. Within this framework, real-time network observations are translated into coordinated antenna and resource configuration decisions that respond to user mobility, traffic demand, and evolving interference conditions. To address the complexity of multi-cell environments, we explore a multi-agent reinforcement learning (MARL) approach that enables distributed and adaptive control across base stations. Illustrative results show that intelligent antenna adaptation yields consistent performance gains, particularly at the cell edge, while also reducing inter-cell interference. These findings suggest that the true potential of fluid antenna systems lies not only in reconfigurable hardware, but in intelligent network control architectures that can effectively exploit this additional spatial degree of freedom.
中文摘要 流体天线系统（FAS）通过使物理天线位置能够动态适应不断变化的无线电条件，为无线网络引入了新的自由度。虽然现有研究主要强调物理层的提升，但其对网络运行的更广泛影响仍大多未被深入探讨。一旦天线成为可重构实体，天线定位自然成为网络控制问题的一部分，而非独立的优化任务。本文提出了一个基于人工智能的视角，探讨未来6G系统的流体天线网络。我们不将天线重新定位视为孤立操作，而是考虑闭环控制架构，其中天线适配与传统无线资源管理（RRM）功能共同管理。在这一框架下，实时网络观测被转化为协调的天线和资源配置决策，以响应用户移动性、流量需求和不断变化的干扰条件。为应对多小区环境的复杂性，我们探索了一种多智能体强化学习（MARL）方法，实现基站间分布式和自适应控制。示意性结果表明，智能天线自适应能够持续提升性能，尤其是在小区边缘，同时减少小区间干扰。这些发现表明，流体天线系统的真正潜力不仅在于可重构的硬件，还在于能够有效利用这一额外空间自由度的智能网络控制架构。

Grounded Chess Reasoning in Language Models via Master Distillation

通过主提纯法在语言模型中进行基础国际象棋推理

Authors: Zhenwei Tang, Qianfeng Wen, Seth Grief-Albert, Yahya Elgabra, Blair Yang, Honghua Dong, Ashton Anderson
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.20510
Pdf link: https://arxiv.org/pdf/2603.20510
Abstract Language models often lack grounded reasoning capabilities in specialized domains where training data is scarce but bespoke systems excel. We introduce a general framework for distilling expert system reasoning into natural language chain-of-thought explanations, enabling compact models to acquire domain expertise and the ability to generate faithful, grounded explanations. Rather than distilling only final outputs, we capture the full reasoning process, transforming opaque expert computations into transparent, step-by-step explanations. We demonstrate this approach in chess, a canonical reasoning domain where language models continue to underperform. Our 4B parameter model, C1, advances from a near-zero baseline to 48.1% accuracy, outperforming all open-source models and most frontier proprietary systems. Notably, C1 surpasses its distillation teacher and generates solutions in two orders of magnitude fewer tokens than baselines. Unlike prior neural chess approaches that predict only best moves, C1 generates explainable solutions revealing strategic reasoning. Our pipeline combines supervised fine-tuning and reinforcement learning with theme-balanced data sampling for comprehensive tactical coverage. Master Distillation demonstrates how to inject expert-level knowledge into compact models for under-optimized domains, offering a recipe for unlocking RLVR where LLMs lack sufficient base capabilities.
中文摘要 在训练数据稀缺但定制系统表现出色的专业领域中，语言模型往往缺乏扎实的推理能力。我们引入了一个将专家系统推理提炼为自然语言思维链解释的通用框架，使小型模型能够获得领域专业知识，并能够生成忠实、扎实的解释。我们不再仅提取最终输出，而是捕捉完整的推理过程，将晦涩的专家计算转化为透明的逐步解释。我们在国际象棋中展示了这种方法，国际象棋是一个典型的推理领域，语言模型持续表现不佳。我们的4B参数模型C1从接近零的基线提升到48.1%的准确率，优于所有开源模型和大多数前沿专有系统。值得注意的是，C1 的解数比基线少了两个数量级，超越了它的蒸馏教师。与以往只预测最佳走法的神经国际象棋方法不同，C1生成可解释的解，揭示战略推理。我们的流程将监督式微调和强化学习与主题平衡数据采样相结合，实现全面的战术覆盖。Master Distillation 演示了如何将专家级知识注入精简模型，适用于优化不足的领域，为解锁 RLVR 提供了方法，帮助 LLM 在基础能力不足时实现。

Delightful Distributed Policy Gradient

令人愉快的分布式政策梯度

Authors: Ian Osband
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2603.20521
Pdf link: https://arxiv.org/pdf/2603.20521
Abstract Distributed reinforcement learning trains on data from stale, buggy, or mismatched actors, producing actions with high surprisal (negative log-probability) under the learner's policy. The core difficulty is not surprising data per se, but \emph{negative learning from surprising data}. High-surprisal failures can dominate the update direction despite carrying little useful signal, while high-surprisal successes reveal opportunities the current policy would otherwise miss. The \textit{Delightful Policy Gradient} (DG) separates these cases by gating each update with delight, the product of advantage and surprisal, suppressing rare failures and amplifying rare successes without behavior probabilities. Under contaminated sampling, the cosine similarity between the standard policy gradient and the true gradient collapses, while DG's grows as the policy improves. No sign-blind reweighting, including exact importance sampling, can reproduce this effect. On MNIST with simulated staleness, DG without off-policy correction outperforms importance-weighted PG with exact behavior probabilities. On a transformer sequence task with staleness, actor bugs, reward corruption, and rare discovery, DG achieves roughly $10{\times}$ lower error. When all four frictions act simultaneously, its compute advantage is order-of-magnitude and grows with task complexity.
中文摘要 分布式强化学习基于陈旧、有漏洞或不匹配的行为者数据进行训练，在学习者策略下产生高惊讶（负对数概率）的动作。核心难题并非数据本身令人惊讶，而是从意外数据中进行负面学习。高度意外的失败可能主导更新方向，尽管其传递的信号有限;而高度意外的成功则揭示当前政策本可能错失的机会。\textit{令人愉快的策略梯度}（DG）通过以优势和惊喜的效应对每次更新进行门槛，抑制罕见失败并放大罕见成功且无行为概率的差异，从而区分这些情况。在污染抽样下，标准政策梯度与真实梯度的余弦相似度会崩溃，而DG的相似度则随着策略的改进而增长。任何符号盲加权，包括精确重要性抽样，都无法重现这一效应。在模拟陈旧的MNIST中，未进行偏离策略修正的DG在精确行为概率上优于重要性加权PG。在一个变压器序列任务中，出现陈旧、演员错误、奖励损坏和罕见发现，DG能降低大约10美元。当这四个摩擦力同时作用时，其计算优势是数量级的，并且随着任务复杂度的增加而增强。

Current state of the multi-agent multi-view experimental and digital twin rendezvous (MMEDR-Autonomous) framework

多智能体多视角实验与数字孪生会合（MMEDR-Autonomous）框架的现状

Authors: Logan Banker, Michael Wozniak, Mohanad Alameer, Smriti Nandan Paul, David Meisinger, Grant Baer, Trevor Hunting, Ryan Dunham, Jay Kamdar
Subjects: Subjects: Robotics (cs.RO); Space Physics (physics.space-ph)
Arxiv link: https://arxiv.org/abs/2603.20575
Pdf link: https://arxiv.org/pdf/2603.20575
Abstract As near-Earth resident space objects proliferate, there is an increasing demand for reliable technologies in applications of on-orbit servicing, debris removal, and orbit modification. Rendezvous and docking are critical mission phases for such applications and can benefit from greater autonomy to reduce operational complexity and human workload. Machine learning-based methods can be integrated within the guidance, navigation, and control (GNC) architecture to design a robust rendezvous and docking framework. In this work, the Multi-Agent Multi-View Experimental and Digital Twin Rendezvous (MMEDR-Autonomous) is introduced as a unified framework comprising a learning-based optical navigation network, a reinforcement learning-based guidance approach under ongoing development, and a hardware-in-the-loop testbed. Navigation employs a lightweight monocular pose estimation network with multi-scale feature fusion, trained on realistic image augmentations to mitigate domain shift. The guidance component is examined with emphasis on learning stability, reward design, and systematic hyperparameter tuning under mission-relevant constraints. Prior Control Barrier Function results for Clohessy-Wiltshire dynamics are reviewed as a basis for enforcing safety and operational constraints and for guiding future nonlinear controller design within the MMEDR-Autonomous framework. The MMEDR-Autonomous framework is currently progressing toward integrated experimental validation in multi-agent rendezvous scenarios.
中文摘要 随着近地驻留空间天体的激增，在轨道维护、碎片清除和轨道改造等应用中，对可靠技术的需求日益增长。交会与对接是此类应用的关键任务阶段，能够通过更高的自主性来降低操作复杂性和人力负担。基于机器学习的方法可以集成到导航、导航与控制（GNC）架构中，设计出一个稳健的交会与对接框架。本研究介绍了多智能体多视角实验与数字孪生会合（MMEDR-Autonomous）作为一个统一框架，包括基于学习的光学导航网络、正在开发中的强化学习指导方法以及硬件在环测试平台。导航采用轻量级单眼姿态估计网络，结合多尺度特征融合，训练于真实图像增强以减轻领域偏移。指导部分重点关注学习稳定性、奖励设计和在任务相关约束下的系统性超参数调优。回顾了克洛西-威尔特希尔动力学的先前控制障碍函数结果，作为执行安全和操作约束的基础，并为MMEDR-自治框架内未来非线性控制器设计提供指导。MMEDR-Autonomous框架目前正朝着多智能体交会场景中的集成实验验证推进。

Towards Practical World Model-based Reinforcement Learning for Vision-Language-Action Models

迈向基于世界模型的视觉-语言-行动模型强化学习

Authors: Zhilong Zhang, Haoxiang Ren, Yihao Sun, Yifei Sheng, Haonan Wang, Haoxin Lin, Zhichao Wu, Pierre-Luc Bacon, Yang Yu
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.20607
Pdf link: https://arxiv.org/pdf/2603.20607
Abstract Vision-Language-Action (VLA) models show strong generalization for robotic control, but finetuning them with reinforcement learning (RL) is constrained by the high cost and safety risks of real-world interaction. Training VLA models in interactive world models avoids these issues but introduces several challenges, including pixel-level world modeling, multi-view consistency, and compounding errors under sparse rewards. Building on recent advances across large multimodal models and model-based RL, we propose VLA-MBPO, a practical framework to tackle these problems in VLA finetuning. Our approach has three key design choices: (i) adapting unified multimodal models (UMMs) for data-efficient world modeling; (ii) an interleaved view decoding mechanism to enforce multi-view consistency; and (iii) chunk-level branched rollout to mitigate error compounding. Theoretical analysis and experiments across simulation and real-world tasks demonstrate that VLA-MBPO significantly improves policy performance and sample efficiency, underscoring its robustness and scalability for real-world robotic deployment.
中文摘要 视觉-语言-行动（VLA）模型在机器人控制方面表现出强烈的推广性，但通过强化学习（RL）进行微调时，受限于现实世界交互的高成本和安全风险。在交互世界模型中训练VLA模型避免了这些问题，但带来了若干挑战，包括像素级世界建模、多视角一致性以及在稀疏奖励下叠加错误。基于大型多模态模型和基于模型的强化学习的最新进展，我们提出了VLA-MBPO这一实用框架，用于解决VLA微调中的这些问题。我们的方法有三个关键设计选择：（i）调整统一多模态模型（UMM）以实现数据高效的世界建模;（ii）一种交错视图解码机制，用于强制多视图一致性;以及（iii）区块级分支扩展以减少错误复利。理论分析和跨实际任务的实验表明，VLA-MBPO显著提升了策略性能和样本效率，强调其在现实机器人部署中的鲁棒性和可扩展性。

Speedup Patch: Learning a Plug-and-Play Policy to Accelerate Embodied Manipulation

加速补丁：学习即插即用策略以加速具身操控

Authors: Zhichao Wu, Junyin Ye, Zhilong Zhang, Yihao Sun, Haoxin Lin, Jiaheng Luo, Haoxiang Ren, Lei Yuan, Yang Yu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.20658
Pdf link: https://arxiv.org/pdf/2603.20658
Abstract While current embodied policies exhibit remarkable manipulation skills, their execution remains unsatisfactorily slow as they inherit the tardy pacing of human demonstrations. Existing acceleration methods typically require policy retraining or costly online interactions, limiting their scalability for large-scale foundation models. In this paper, we propose Speedup Patch (SuP), a lightweight, policy-agnostic framework that enables plug-and-play acceleration using solely offline data. SuP introduces an external scheduler that adaptively downsamples action chunks provided by embodied policies to eliminate redundancies. Specifically, we formalize the optimization of our scheduler as a Constrained Markov Decision Process (CMDP) aimed at maximizing efficiency without compromising task performance. Since direct success evaluation is infeasible in offline settings, SuP introduces World Model based state deviation as a surrogate metric to enforce safety constraints. By leveraging a learned world model as a virtual evaluator to predict counterfactual trajectories, the scheduler can be optimized via offline reinforcement learning. Empirical results on simulation benchmarks (Libero, Bigym) and real-world tasks validate that SuP achieves an overall 1.8x execution speedup for diverse policies while maintaining their original success rates.
中文摘要 虽然现行的具象政策展现出惊人的操控技巧，但其执行速度依然令人不满意，因为它们继承了人类示威的缓慢节奏。现有的加速方法通常需要策略重新训练或昂贵的在线交互，限制了其在大规模基础模型中的可扩展性。本文提出了Speedup Patch（SuP），一种轻量级、策略无关的框架，能够仅利用离线数据实现即插即用加速。SuP 引入了一个外部调度器，能够自适应地下采样由具现策略提供的动作块，以消除冗余。具体来说，我们将调度器的优化形式化为受限马尔可夫决策过程（CMDP），旨在最大化效率且不牺牲任务性能。由于离线环境中无法进行直接成功评估，SuP引入了基于世界模型的状态偏差作为替代指标，以强制执行安全约束。通过利用学习到的世界模型作为虚拟评估器预测反事实轨迹，调度器可以通过离线强化学习得到优化。模拟基准测试（Libero、Bigym）和现实任务的实证结果验证了SuP在保持原始成功率的同时，实现了多种策略的整体执行速度提升1.8倍。

AI-Driven Multi-Agent Simulation of Stratified Polyamory Systems: A Computational Framework for Optimizing Social Reproductive Efficiency

AI驱动的多智能体分层多元恋系统模拟：优化社会生殖效率的计算框架

Authors: Yicai Xing
Subjects: Subjects: Artificial Intelligence (cs.AI); General Economics (econ.GN)
Arxiv link: https://arxiv.org/abs/2603.20678
Pdf link: https://arxiv.org/pdf/2603.20678
Abstract Contemporary societies face a severe crisis of demographic reproduction. Global fertility rates continue to decline precipitously, with East Asian nations exhibiting the most dramatic trends -- China's total fertility rate (TFR) fell to approximately 1.0 in 2023, while South Korea's dropped below 0.72. Simultaneously, the institution of marriage is undergoing structural disintegration: educated women rationally reject unions lacking both emotional fulfillment and economic security, while a growing proportion of men at the lower end of the socioeconomic spectrum experience chronic sexual deprivation, anxiety, and learned helplessness. This paper proposes a computational framework for modeling and evaluating a Stratified Polyamory System (SPS) using techniques from agent-based modeling (ABM), multi-agent reinforcement learning (MARL), and large language model (LLM)-empowered social simulation. The SPS permits individuals to maintain a limited number of legally recognized secondary partners in addition to one primary spouse, combined with socialized child-rearing and inheritance reform. We formalize the A/B/C stratification as heterogeneous agent types in a multi-agent system and model the matching process as a MARL problem amenable to Proximal Policy Optimization (PPO). The mating network is analyzed using graph neural network (GNN) representations. Drawing on evolutionary psychology, behavioral ecology, social stratification theory, computational social science, algorithmic fairness, and institutional economics, we argue that SPS can improve aggregate social welfare in the Pareto sense. Preliminary computational results demonstrate the framework's viability in addressing the dual crisis of female motherhood penalties and male sexlessness, while offering a non-violent mechanism for wealth dispersion analogous to the historical Chinese Grace Decree (Tui'en Ling).
中文摘要 当代社会正面临严重的人口再生产危机。全球生育率持续急剧下降，东亚国家的趋势最为显著——中国的总生育率（TFR）在2023年降至约1.0，而韩国则跌破了0.72。与此同时，婚姻制度正在经历结构性瓦解：受过教育的女性理性地拒绝缺乏情感满足和经济保障的结合，而社会经济光谱下端的男性则越来越多地经历着长期的性剥夺、焦虑和习得性无助感。本文提出了一个计算框架，用于建模和评估分层多元恋系统（SPS），利用基于主体建模（ABM）、多代理强化学习（MARL）和大型语言模型（LLM）赋能的社会模拟技术。SPS允许个人除一名主要配偶外，维持有限数量的法定次级伴侣，并结合社会化育儿和继承改革。我们将A/B/C分层形式化为多智能体系统中的异构智能体类型，并将匹配过程建模为适用于近端策略优化（PPO）的MARL问题。配对网络通过图神经网络（GNN）表示进行分析。我们结合进化心理学、行为生态学、社会分层理论、计算社会科学、算法公平性和制度经济学，认为SPS能够在帕累托意义上改善整体社会福利。初步计算结果显示，该框架在解决女性母职惩罚与男性无性别双重危机方面具有可行性，同时提供了类似中国历史恩典（推恩令）的非暴力财富分配机制。

Clinical Cognition Alignment for Gastrointestinal Diagnosis with Multimodal LLMs

多模态大型语言模型用于胃肠道诊断的临床认知对齐

Authors: Huan Zheng, Yucheng Zhou, Tianyi Yan, Dubing Chen, Hongbo Lu, Wenlong Liao, Tao He, Pai Peng, Jianbing Shen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.20698
Pdf link: https://arxiv.org/pdf/2603.20698
Abstract Multimodal Large Language Models (MLLMs) have demonstrated remarkable potential in medical image analysis. However, their application in gastrointestinal endoscopy is currently hindered by two critical limitations: the misalignment between general model reasoning and standardized clinical cognitive pathways, and the lack of causal association between visual features and diagnostic outcomes. In this paper, we propose a novel Clinical-Cognitive-Aligned (CogAlign) framework to address these challenges. First, we endow the model with rigorous clinical analytical capabilities by constructing the hierarchical clinical cognition dataset and employing Supervised Fine-Tuning (SFT). Unlike conventional approaches, this strategy internalizes the hierarchical diagnostic logic of experts, ranging from anatomical localization and morphological evaluation to microvascular analysis, directly into the model. Second, to eliminate visual bias, we provide a theoretical analysis demonstrating that standard supervised tuning inevitably converges to spurious background correlations. Guided by this insight, we propose a counterfactual-driven reinforcement learning strategy to enforce causal rectification. By generating counterfactual normal samples via lesion masking and optimizing through clinical-cognition-centric rewards, we constrain the model to strictly ground its diagnosis in causal lesion features. Extensive experiments demonstrate that our approach achieves State-of-the-Art (SoTA) performance across multiple benchmarks, significantly enhancing diagnostic accuracy in complex clinical scenarios. All source code and datasets will be made publicly available.
中文摘要 多模态大型语言模型（MLLM）在医学图像分析中展现出了显著的潜力。然而，目前它们在胃肠内镜中的应用受到两个关键局限：一般模型推理与标准化临床认知通路之间的不匹配，以及视觉特征与诊断结果之间缺乏因果关联。本文提出了一种新型临床-认知对齐（CogAlign）框架，以应对这些挑战。首先，我们通过构建层级临床认知数据集并采用监督式微调（SFT），赋予模型严谨的临床分析能力。与传统方法不同，该策略将专家的层级诊断逻辑，从解剖定位、形态评估到微血管分析，直接整合到模型中。其次，为消除视觉偏倚，我们提供了理论分析，证明标准监督调谐不可避免地收敛到虚假背景相关。基于这一见解，我们提出了一种反事实驱动的强化学习策略，以强制因果纠正。通过病灶掩蔽生成反事实正常样本，并通过临床认知为中心的奖励进行优化，我们限制模型严格以因果病变特征为诊断基础。大量实验表明，我们的方法在多个基准测试中实现了最先进的（SoTA）性能，显著提升了复杂临床场景下的诊断准确性。所有源代码和数据集将公开。

Decoupling Numerical and Structural Parameters: An Empirical Study on Adaptive Genetic Algorithms via Deep Reinforcement Learning for the Large-Scale TSP

数值与结构参数的解耦：一项通过深度强化学习实现自适应遗传算法的实证研究，适用于大规模TSP

Authors: Hongyu Wang, Yuhan Jing, Yibing Shi, Enjin Zhou, Haotian Zhang, Jialong Shi
Subjects: Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.20702
Pdf link: https://arxiv.org/pdf/2603.20702
Abstract Proper parameter configuration is a prerequisite for the success of Evolutionary Algorithms (EAs). While various adaptive strategies have been proposed, it remains an open question whether all control dimensions contribute equally to algorithmic scalability. To investigate this, we categorize control variables into numerical parameters (e.g., crossover and mutation rates) and structural parameters (e.g., population size and operator switching), hypothesizing that they play distinct roles. This paper presents an empirical study utilizing a dual-level Deep Reinforcement Learning (DRL) framework to decouple and analyze the impact of these two dimensions on the Traveling Salesman Problem (TSP). We employ a Recurrent PPO agent to dynamically regulate these parameters, treating the DRL model as a probe to reveal evolutionary dynamics. Experimental results confirm the effectiveness of this approach: the learned policies outperform static baselines, reducing the optimality gap by approximately 45% on the largest tested instance (rl5915). Building on this validated framework, our ablation analysis reveals a fundamental insight: while numerical tuning offers local refinement, structural plasticity is the decisive factor in preventing stagnation and facilitating escape from local optima. These findings suggest that future automated algorithm design should prioritize dynamic structural reconfiguration over fine-grained probability adjustment. To facilitate reproducibility, the source code is available at this https URL
中文摘要 正确的参数配置是进化算法（EA）成功的前提。尽管提出了各种自适应策略，但所有控制维度是否均等地贡献于算法可扩展性仍是一个开放问题。为此，我们将控制变量分类为数值参数（如交叉率和突变率）和结构参数（如种群规模和操作符切换），假设它们扮演不同角色。本文采用双层深度强化学习（DRL）框架，进行实证研究，分析这两种维度对旅行推销员问题（TSP）的影响。我们使用循环PPO药物动态调控这些参数，将DRL模型视为探针，揭示进化动态。实验结果证实了该方法的有效性：所学策略优于静态基线，在最大测试实例（rl5915）上将最优性差距减少约45%。基于这一验证框架，我们的消融分析揭示了一个基本见解：虽然数值调优提供了局部精细化，但结构可塑性是防止停滞和促进逃离局部最优状态的关键因素。这些发现表明，未来的自动化算法设计应优先考虑动态结构重构，而非细粒度概率调整。为了便于复现，源代码可在此 https URL 获取

RLVR Training of LLMs Does Not Improve Thinking Ability for General QA: Evaluation Method and a Simple Solution

对大型语言模型的RLVR训练并不能提升一般质量保证的思维能力：评估方法与简单解决方案

Authors: Kaiyuan Li, Jing-Cheng Pang, Yang Yu
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.20799
Pdf link: https://arxiv.org/pdf/2603.20799
Abstract Reinforcement learning from verifiable rewards (RLVR) stimulates the thinking processes of large language models (LLMs), substantially enhancing their reasoning abilities on verifiable tasks. It is often assumed that similar gains should transfer to general question answering (GQA), but this assumption has not been thoroughly validated. To assess whether RLVR automatically improves LLM performance on GQA, we propose a Cross-Generation evaluation framework that measures the quality of intermediate reasoning by feeding the generated thinking context into LLMs of varying capabilities. Our evaluation leads to a discouraging finding: the efficacy of the thinking process on GQA tasks is markedly lower than on verifiable tasks, suggesting that explicit training on GQA remains necessary in addition to training on verifiable tasks. We further observe that direct RL training on GQA is less effective than RLVR. Our hypothesis is that, whereas verifiable tasks demand robust logical chains to obtain high rewards, GQA tasks often admit shortcuts to high rewards without cultivating high-quality thinking. To avoid possible shortcuts, we introduce a simple method, Separated Thinking And Response Training (START), which first trains only the thinking process, using rewards defined on the final answer. We show that START improves both the quality of thinking and the final answer across several GQA benchmarks and RL algorithms.
中文摘要 可验证奖励强化学习（RLVR）刺激大型语言模型（LLMs）的思维过程，显著增强它们在可验证任务中的推理能力。人们常假设类似的提升应转移到一般问答（GQA）中，但这一假设尚未得到充分验证。为评估RLVR是否能自动提升GQA上的LLM表现，我们提出了一个跨世代评估框架，通过将生成的思维上下文输入不同能力的LLM来衡量中间推理的质量。我们的评估得出令人沮丧的结论：GQA任务的思维过程效率明显低于可验证任务，表明除了可验证任务的培训外，显性GQA培训仍然是必要的。我们还观察到，GQA上的直接强化学习效果不如RLVR有效。我们的假设是，虽然可验证任务需要强大的逻辑链才能获得高回报，而GQA任务往往允许捷径获得高回报，却缺乏高质量思维培养。为避免可能的捷径，我们引入了一种简单的方法——分离思维与反应训练（START），它首先只训练思维过程，并利用最终答案定义的奖励。我们证明START在多个GQA基准测试和强化学习算法中提升了思维质量和最终答案。

EruDiff: Refactoring Knowledge in Diffusion Models for Advanced Text-to-Image Synthesis

EruDiff：在扩散模型中重构知识以实现高级文本转图像合成

Authors: Xiefan Guo, Xinzhu Ma, Haoxiang Ma, Zihao Zhou, Di Huang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.20828
Pdf link: https://arxiv.org/pdf/2603.20828
Abstract Text-to-image diffusion models have achieved remarkable fidelity in synthesizing images from explicit text prompts, yet exhibit a critical deficiency in processing implicit prompts that require deep-level world knowledge, ranging from natural sciences to cultural commonsense, resulting in counter-factual synthesis. This paper traces the root of this limitation to a fundamental dislocation of the underlying knowledge structures, manifesting as a chaotic organization of implicit prompts compared to their explicit counterparts. In this paper, we propose EruDiff, which aims to refactor the knowledge within diffusion models. Specifically, we develop the Diffusion Knowledge Distribution Matching (DK-DM) to register the knowledge distribution of intractable implicit prompts with that of well-defined explicit anchors. Furthermore, to rectify the inherent biases in explicit prompt rendering, we employ the Negative-Only Reinforcement Learning (NO-RL) strategy for fine-grained correction. Rigorous empirical evaluations demonstrate that our method significantly enhances the performance of leading diffusion models, including FLUX and Qwen-Image, across both the scientific knowledge benchmark (i.e., Science-T2I) and the world knowledge benchmark (i.e., WISE), underscoring the effectiveness and generalizability. Our code is available at this https URL.
中文摘要 文本到图像扩散模型在从显性文本提示中合成图像方面取得了显著的保真度，但在处理需要深层次世界知识（从自然科学到文化常识）的隐性提示方面存在严重缺陷，导致了反事实的综合。本文追溯这一局限的根源，源于底层知识结构的根本错位，表现为隐性提示与显性提示的混乱组织。本文提出了EruDiff，旨在重构扩散模型中的知识。具体来说，我们开发了扩散知识分布匹配（DK-DM），用于将难以处理的隐性提示的知识分布与明确的显式锚点进行登记。此外，为了纠正显式提示渲染中的固有偏差，我们采用了仅负面强化学习（NO-RL）策略进行细粒度纠正。严谨的实证评估表明，我们的方法显著提升了包括FLUX和Qwen-Image在内的领先扩散模型在科学知识基准（即Science-T2I）和全球知识基准（即WISE）上的性能，强调了其有效性和普适性。我们的代码可在此 https URL 访问。

Deep Adaptive Rate Allocation in Volatile Heterogeneous Wireless Networks

易失异构无线网络中的深度自适应速率分配

Authors: Gregorio Maglione, Veselin Rakocevic, Markus Amend, Touraj Soleymani
Subjects: Subjects: Information Theory (cs.IT); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.20926
Pdf link: https://arxiv.org/pdf/2603.20926
Abstract Modern multi-access 5G+ networks provide mobile terminals with additional capacity, improving network stability and performance. However, in highly mobile environments such as vehicular networks, supporting multi-access connectivity remains challenging. The rapid fluctuations of wireless link quality often outpace the responsiveness of existing multipath schedulers and transport-layer protocols. This paper addresses this challenge by integrating Transformer-based path state forecasting with a new multipath splitting scheduler called Deep Adaptive Rate Allocation (DARA). The proposed scheduler employs a deep reinforcement learning engine to dynamically compute optimal congestion window fractions on available paths, determining data allocation among them. A six-component normalised reward function with weight-mediated conflict resolution drives a DQN policy that eliminates the observation-reaction lag inherent in reactive schedulers. Performance evaluation uses a Mininet-based Multipath Datagram Congestion Control Protocol testbed with traces from mobile users in vehicular environments. Experimental results demonstrate that DARA achieves better file transfer time reductions compared to learning-based schedulers under moderate-volatility traces. For buffered video streaming, resolution improvements are maintained across all tested conditions. Under controlled burst scenarios with sub-second buffer constraints, DARA achieves substantial rebuffering improvements whilst state-of-the-art schedulers exhibit near-continuous stalling.
中文摘要 现代多址5G+网络为移动终端提供了额外容量，提升了网络稳定性和性能。然而，在高度移动的环境中，如车载网络，支持多址连接仍然具有挑战性。无线链路质量的快速波动往往超过现有多径调度器和传输层协议的响应速度。本文通过将基于Transformer的路径状态预测与一种名为深度自适应速率分配（DARA）的新型多径分割调度器整合，解决了这一挑战。所提调度器采用深度强化学习引擎，动态计算可用路径的最优拥塞窗口分数，并确定路径间的数据分配。六个成分的归一化奖励函数配合权重介导的冲突解决，推动了DQN策略，消除了反应调度器固有的观察-反应滞后。性能评估使用基于Mininet的多径数据报拥塞控制协议测试平台，利用车辆环境中移动用户的痕迹。实验结果表明，在中等波动性痕迹下，DARA 相比基于学习的调度器实现了更好的文件传输时间缩短。对于缓冲视频流，分辨率在所有测试条件下均保持提升。在受控突发场景和亚秒缓冲约束下，DARA 实现了显著的重缓冲改进，而最先进的调度器则几乎持续停滞。

Cyber Deception for Mission Surveillance via Hypergame-Theoretic Deep Reinforcement Learning

通过超博弈理论深度强化学习实现任务监视的网络欺骗

Authors: Zelin Wan, Jin-Hee Cho, Mu Zhu, Ahmed H. Anwar, Charles Kamhoua, Munindar P. Singh
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2603.20981
Pdf link: https://arxiv.org/pdf/2603.20981
Abstract Unmanned Aerial Vehicles (UAVs) are valuable for mission-critical systems like surveillance, rescue, or delivery. Not surprisingly, such systems attract cyberattacks, including Denial-of-Service (DoS) attacks to overwhelm the resources of mission drones (MDs). How can we defend UAV mission systems against DoS attacks? We adopt cyber deception as a defense strategy, in which honey drones (HDs) are proposed to bait and divert attacks. The attack and deceptive defense hinge upon radio signal strength: The attacker selects victim MDs based on their signals, and HDs attract the attacker from afar by emitting stronger signals, despite this reducing battery life. We formulate an optimization problem for the attacker and defender to identify their respective strategies for maximizing mission performance while minimizing energy consumption. To address this problem, we propose a novel approach, called HT-DRL. HT-DRL identifies optimal solutions without a long learning convergence time by taking the solutions of hypergame theory into the neural network of deep reinforcement learning. This achieves a systematic way to intelligently deceive attackers. We analyze the performance of diverse defense mechanisms under different attack strategies. Further, the HT-DRL-based HD approach outperforms existing non-HD counterparts up to two times better in mission performance while incurring low energy consumption.
中文摘要 无人机（UAV）对于关键任务系统如监视、救援或投递具有重要价值。不足为奇的是，这类系统容易引发网络攻击，包括拒绝服务（DoS）攻击，旨在压垮任务无人机（MD）的资源。我们如何防御无人机任务系统免受拒绝服务（DoS）攻击？我们将网络欺骗作为防御策略，建议使用蜂蜜无人机（HDs）来诱导和转移攻击。攻击和欺骗防御依赖于无线电信号强度：攻击者根据信号选择目标MD，而HD通过发射更强信号吸引远距离攻击者，尽管这会缩短电池寿命。我们为攻击方和防御方制定了优化问题，以确定各自最大化任务绩效且最小化能耗的策略。为解决这一问题，我们提出了一种新方法，称为HT-DRL。HT-DRL通过将超博弈理论的解纳入深度强化学习的神经网络，识别出无需长时间学习收敛时间的最优解。这实现了一种有系统地智能欺骗攻击者的方法。我们分析了不同攻击策略下多种防御机制的性能。此外，基于HT-DRL的HD方法在任务表现上优于现有非HD方案，其性能高出多达两倍，同时能耗较低。

The Intelligent Disobedience Game: Formulating Disobedience in Stackelberg Games and Markov Decision Processes

智能不服从博弈：在斯塔克伯格博弈和马尔可夫决策过程中表述不服从

Authors: Benedikt Hornig, Reuth Mirsky
Subjects: Subjects: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.20994
Pdf link: https://arxiv.org/pdf/2603.20994
Abstract In shared autonomy, a critical tension arises when an automated assistant must choose between obeying a human's instruction and deliberately overriding it to prevent harm. This safety-critical behavior is known as intelligent disobedience. To formalize this dynamic, this paper introduces the Intelligent Disobedience Game (IDG), a sequential game-theoretic framework based on Stackelberg games that models the interaction between a human leader and an assistive follower operating under asymmetric information. It characterizes optimal strategies for both agents across multi-step scenarios, identifying strategic phenomena such as ``safety traps,'' where the system indefinitely avoids harm but fails to achieve the human's goal. The IDG provides a needed mathematical foundation that enables both the algorithmic development of agents that can learn safe non-compliance and the empirical study of how humans perceive and trust disobedient AI. The paper further translates the IDG into a shared control Multi-Agent Markov Decision Process representation, forming a compact computational testbed for training reinforcement learning agents.
中文摘要 在共享自主中，当自动化助手必须在服从人类指令和故意覆盖指令以防止伤害之间做出选择时，就会产生一种关键的张力。这种对安全至关重要的行为被称为智能不服从。为形式化这一动态，本文引入了智能不服从博弈（IDG），这是一个基于斯塔克伯格博弈的顺序博弈论框架，模拟了人类领导者与辅助跟随者在非对称信息下操作的相互作用。它描述了两位智能体在多步场景中的最优策略，识别了诸如“安全陷阱”等战略现象，即系统无限期避免伤害但未能实现人类目标。IDG提供了必要的数学基础，既支持算法开发能够学习安全不服从的智能体，也支持对人类如何感知和信任不服从人工智能的实证研究。论文进一步将IDG转化为共享控制的多智能体马尔可夫决策过程表示，形成了一个紧凑的计算测试平台，用于训练强化学习智能体。

OrbitStream: Training-Free Adaptive 360-degree Video Streaming via Semantic Potential Fields

OrbitStream：通过语义势场实现无训练自适应360度视频流

Authors: Aizierjiang Aiersilan, Zhangfei Yang
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Robotics (cs.RO); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2603.20999
Pdf link: https://arxiv.org/pdf/2603.20999
Abstract Adaptive 360° video streaming for teleoperation faces dual challenges: viewport prediction under uncertain gaze patterns and bitrate adaptation over volatile wireless channels. While data-driven and Deep Reinforcement Learning (DRL) methods achieve high Quality of Experience (QoE), their "black-box" nature and reliance on training data can limit deployment in safety-critical systems. To address this, we propose OrbitStream, a training-free framework that combines semantic scene understanding with robust control theory. We formulate viewport prediction as a Gravitational Viewport Prediction (GVP) problem, where semantic objects generate potential fields that attract user gaze. Furthermore, we employ a Saturation-Based Proportional-Derivative (PD) Controller for buffer regulation. On object-rich teleoperation traces, OrbitStream achieves a 94.7\% zero-shot viewport prediction accuracy without user-specific profiling, approaching trajectory-extrapolation baselines ($\sim$98.5\%). Across 3,600 Monte Carlo simulations on diverse network traces, OrbitStream yields a mean QoE of 2.71. It ranks second among 12 evaluated algorithms, close to the top-performing BOLA-E (2.80) while outperforming FastMPC (1.84). The system exhibits an average decision latency of 1.01 ms with minimal rebuffering events. By providing competitive QoE with interpretability and zero training overhead, OrbitStream demonstrates that physics-based control, combined with semantic modeling, offers a practical solution for 360° streaming in teleoperation.
中文摘要 自适应360°视频流用于远程操作面临双重挑战：在不确定的视线模式下预测视口，以及在易失性无线信道上的码率适应。虽然数据驱动和深度强化学习（DRL）方法实现了高质量的体验（QoE），但其“黑匣子”特性和对训练数据的依赖可能限制安全关键系统的部署。为此，我们提出了OrbitStream，一个无训练框架，结合语义场景理解与稳健控制理论。我们将视口预测表述为引力视口预测（GVP）问题，语义对象生成吸引用户视线的潜在场。此外，我们采用基于饱和度的比例导数（PD）控制器进行缓冲液调节。在具对象的远程操作追踪中，OrbitStream在无用户特定剖析的情况下实现了94.7%的零拍摄视口预测准确率，接近轨迹外推基线（$\sim$98.5%）。在3600次蒙特卡洛模拟中，OrbitStream在不同网络线路上的平均QoE为2.71。在12个评估算法中排名第二，接近表现最好的BOLA-E（2.80），并优于FastMPC（1.84）。该系统平均决策延迟为1.01毫秒，且重缓冲事件极少。通过提供具有可解释性和零训练开销的竞争性QoE，OrbitStream展示了基于物理的控制结合语义建模，为远程操作中的360°流式传输提供了切实可行的解决方案。

DSL-R1: From SQL to DSL for Training Retrieval Agents across Structured and Unstructured Data with Reinforcement Learning

DSL-R1：从SQL到DSL，用于通过强化学习训练跨结构化和非结构化数据的检索代理

Authors: Yunhai Hu, Junwei Zhou, Yumo Cao, Yitao Long, Yiwei Xu, Qiyi Jiang, Weiyao Wang, Xiaoyu Cao, Zhen Sun, Yiran Zou, Nan Du
Subjects: Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.21018
Pdf link: https://arxiv.org/pdf/2603.21018
Abstract Effective retrieval in complex domains requires bridging the gap between structured metadata and unstructured content. Existing systems typically isolate these capabilities, relying on either symbolic filtering or vector similarity, failing to capture their interplay. In this work, we propose DSL-R1, a unified framework that synergizes logical reasoning with semantic matching via a novel Domain-Specific Language (DSL). By embedding vector primitives within SQL-style operators, our approach leverages the complementary strengths of symbolic precision and semantic coverage. We further introduce a reinforcement learning mechanism where rule-based execution feedback and retrieval quality rewards jointly optimize the DSL generation, balancing structural correctness and semantic alignment. Evaluations on a large-scale industrial email benchmark demonstrate that DSL-R1 achieves a +12.3% improvement in Hit@1/3, consistently outperforming decoupled baselines and establishing a robust paradigm for hybrid retrieval.
中文摘要 在复杂领域中有效检索需要弥合结构化元数据与非结构化内容之间的差距。现有系统通常将这些能力隔离开来，依赖符号滤波或矢量相似性，未能捕捉它们之间的相互作用。在本研究中，我们提出了DSL-R1，这是一个统一框架，通过一种新型领域特定语言（DSL）协同逻辑推理与语义匹配。通过在SQL风格的运算符中嵌入向量原语，我们的方法利用了符号精度和语义覆盖的互补优势。我们还进一步引入了一种强化学习机制，其中基于规则的执行反馈和检索质量奖励共同优化了DSL生成，平衡了结构正确性和语义对齐。一项大规模工业邮件基准的评估显示，DSL-R1在3 Hit@1中实现了+12.3%的提升，持续优于解耦基线，建立了强有力的混合检索范式。

Knowledge Boundary Discovery for Large Language Models

大型语言模型的知识边界发现

Authors: Ziquan Wang, Zhongqi Lu
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.21022
Pdf link: https://arxiv.org/pdf/2603.21022
Abstract We propose Knowledge Boundary Discovery (KBD), a reinforcement learning based framework to explore the knowledge boundaries of the Large Language Models (LLMs). We define the knowledge boundary by automatically generating two types of questions: (i) those the LLM can confidently answer (within-knowledge boundary) and (ii) those it cannot (beyond-knowledge boundary). Iteratively exploring and exploiting the LLM's responses to find its knowledge boundaries is challenging because of the hallucination phenomenon. To find the knowledge boundaries of an LLM, the agent interacts with the LLM under the modeling of exploring a partially observable environment. The agent generates a progressive question as the action, adopts an entropy reduction as the reward, receives the LLM's response as the observation and updates its belief states. We demonstrate that the KBD detects knowledge boundaries of LLMs by automatically finding a set of non-trivial answerable and unanswerable questions. We validate the KBD by comparing its generated knowledge boundaries with manually crafted LLM benchmark datasets. Experiments show that our KBD-generated question set is comparable to the human-generated datasets. Our approach paves a new way to evaluate LLMs.
中文摘要 我们提出了知识边界发现（KBD），这是一个基于强化学习的框架，用于探索大型语言模型（LLMs）的知识边界。我们通过自动生成两种类型的问题来定义知识边界：（i）LLM能够自信回答的问题（在知识边界内）和（ii）它无法回答的问题（超出知识边界）。由于幻觉现象，反复探索和利用LLM的反应来确定其知识边界具有挑战性。为了确定LLM的知识边界，智能体在探索部分可观察环境的建模下与LLM互动。智能体作为动作生成一个渐进问题，采用熵约简作为奖励，接收大语言模型作为观察的回应，并更新其信念状态。我们证明了KBD通过自动找到一组非平凡的可答和不可答问题来检测LLM的知识边界。我们通过比较其生成的知识边界与手动构建的大型语言模型基准数据集来验证KBD。实验表明，我们的KBD生成题组与人工生成的数据集相当。我们的方法为评估大型语言模型开辟了新途径。

DRL-driven Online Optimization for Joint Traffic Reshaping and Channel Reconfiguration in RIS-assisted Semantic NOMA Communications

基于DRL驱动的在线优化，用于RIS辅助语义NOMA通信中的联合流量重塑和信道重配置

Authors: Songhan Zhao, Shimin Gong, Bo Gu, Zehui Xiong, Ping Wang, Kaibin Huang
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2603.21093
Pdf link: https://arxiv.org/pdf/2603.21093
Abstract This paper explores a reconfigurable intelligent surface (RIS)-assisted and semantic-aware wireless network, where multiple semantic users (SUs) transmit semantic information to an access point (AP) using the non-orthogonal multiple access (NOMA) method. The RIS reconfigures channel conditions, while semantic extraction reshapes traffic demands, providing enhanced control flexibility for NOMA transmissions. To enable efficient long-term resource allocation, we propose a deferrable semantic extraction scheme that can distribute the semantic extraction tasks across multiple time slots. We formulate a long-term energy efficiency maximization problem by jointly optimizing the RIS's passive beamforming, the SUs' semantic extraction, and the NOMA decoding order. Note that this problem involves multiple and coupled control variables, which can incur significant computational overhead in time-varying network environments. To support low-complexity online optimization, a deep reinforcement learning (DRL)-driven online optimization framework is developed. Specifically, the DRL module facilitates the adaptive selection and optimization of the most suitable option from traffic reshaping, channel reconfiguration, or NOMA decoding order assignment based on the dynamic network status. Numerical results demonstrate that the deferrable semantic extraction scheme significantly improves the long-term energy efficiency. Meanwhile, the DRL-driven online optimization framework effectively reduces the running time while maintaining superior learning performance compared to state-of-the-art methods.
中文摘要 本文探讨了一种可重构智能表面（RIS）辅助且语义感知的无线网络，其中多个语义用户（SU）通过非正交多重访问（NOMA）方法向接入点（AP）传输语义信息。RIS重新配置信道条件，而语义提取则重塑流量需求，为NOMA传输提供更强的控制灵活性。为了实现高效的长期资源分配，我们提出了一种可递延的语义提取方案，可以将语义提取任务分配到多个时隙。我们通过联合优化RIS的被动束束成形、超音束的语义提取和NOMA译码顺序，提出了长期能效最大化问题。注意，该问题涉及多个且耦合的控制变量，在时间变化的网络环境中可能产生显著的计算开销。为支持低复杂度在线优化，开发了一个深度强化学习（DRL）驱动的在线优化框架。具体来说，DRL模块便于根据动态网络状态，自适应选择和优化流量重塑、信道重配置或NOMA解码顺序分配中最合适的选项。数值结果表明，可延期语义提取方案显著提升了长期能效。与此同时，基于日程学习的在线优化框架有效地缩短了运行时间，同时保持了优于最先进方法的学习性能。

Learning to Optimize Joint Source and RIS-assisted Channel Encoding for Multi-User Semantic Communication Systems

学习优化多用户语义通信系统的联合源和RIS辅助信道编码

Authors: Haidong Wang, Songhan Zhao, Bo Gu, Shimin Gong, Hongyang Du, Ping Wang
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.21097
Pdf link: https://arxiv.org/pdf/2603.21097
Abstract In this paper, we explore a joint source and reconfigurable intelligent surface (RIS)-assisted channel encoding (JSRE) framework for multi-user semantic communications, where a deep neural network (DNN) extracts semantic features for all users and the RIS provides channel orthogonality, enabling a unified semantic encoding-decoding design. We aim to maximize the overall energy efficiency of semantic communications across all users by jointly optimizing the user scheduling, the RIS's phase shifts, and the semantic compression ratio. Although this joint optimization problem can be addressed using conventional deep reinforcement learning (DRL) methods, evaluating semantic similarity typically relies on extensive real environment interactions, which can incur heavy computational overhead during training. To address this challenge, we propose a truncated DRL (T-DRL) framework, where a DNN-based semantic similarity estimator is developed to rapidly estimate the similarity score. Moreover, the user scheduling strategy is tightly coupled with the semantic model configuration. To exploit this relationship, we further propose a semantic model caching mechanism that stores and reuses fine-tuned semantic models corresponding to different scheduling decisions. A Transformer-based actor network is employed within the DRL framework to dynamically generate action space conditioned on the current caching state. This avoids redundant retraining and further accelerates the convergence of the learning process. Numerical results demonstrate that the proposed JSRE framework significantly improves the system energy efficiency compared with the baseline methods. By training fewer semantic models, the proposed T-DRL framework significantly enhances the learning efficiency.
中文摘要 本文探讨了一种联合源和可重构智能表面（RIS）辅助信道编码（JSRE）框架，用于多用户语义通信，其中深度神经网络（DNN）为所有用户提取语义特征，RIS提供信道正交性，实现统一的语义编码-解码设计。我们旨在通过共同优化用户调度、RIS相位偏移和语义压缩比，最大化所有用户语义通信的整体能源效率。虽然该联合优化问题可以通过传统的深度强化学习（DRL）方法解决，但评估语义相似性通常依赖于广泛的真实环境交互，这在训练过程中可能带来较大的计算开销。为应对这一挑战，我们提出了一个截断DRL（T-DRL）框架，其中开发了基于DNN的语义相似度估计器，以快速估计相似度分数。此外，用户调度策略与语义模型配置紧密耦合。为了利用这种关系，我们进一步提出了一种语义模型缓存机制，能够存储并重用对应不同调度决策的微调语义模型。基于Transformer的演员网络在DRL框架内被采用，以动态生成基于当前缓存状态的动作空间。这避免了重复的再培训，并进一步加速了学习过程的融合。数值结果表明，所提出的JSRE框架相比基线方法显著提升了系统的能效。通过训练更少的语义模型，所提出的T-DRL框架显著提升了学习效率。

VisFly-Lab: Unified Differentiable Framework for First-Order Reinforcement Learning of Quadrotor Control

VisFly-Lab：一阶强化学习四旋翼控制的统一可微框架

Authors: Fanxing Li, Fangyu Sun, Tianbao Zhang, Shuyu Wu, Dexin Zuo, yufei Yan, Wenxian Yu, Danping Zou
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.21123
Pdf link: https://arxiv.org/pdf/2603.21123
Abstract First-order reinforcement learning with differentiable simulation is promising for quadrotor control, but practical progress remains fragmented across task-specific settings. To support more systematic development and evaluation, we present a unified differentiable framework for multi-task quadrotor control. The framework is wrapped, extensible, and equipped with deployment-oriented dynamics, providing a common interface across four representative tasks: hovering, tracking, landing, and racing. We also present the suite of first-order learning algorithms, where we identify two practical bottlenecks of standard first-order training: limited state coverage caused by horizon initialization and gradient bias caused by partially non-differentiable rewards. To address these issues, we propose Amended Backpropagation Through Time (ABPT), which combines differentiable rollout optimization, a value-based auxiliary objective, and visited-state initialization to improve training robustness. Experimental results show that ABPT yields the clearest gains in tasks with partially non-differentiable rewards, while remaining competitive in fully differentiable settings. We further provide proof-of-concept real-world deployments showing initial transferability of policies learned in the proposed framework beyond simulation.
中文摘要 带有可微仿真的第一阶强化学习在四旋翼控制方面前景看好，但实际进展在任务特定环境中仍零散。为了支持更系统的开发和评估，我们提出了一个统一的可微分多任务四旋翼控制框架。该框架被包裹、可扩展，并配备了部署导向的动态功能，提供了四大代表性任务的通用界面：悬停、跟踪、着陆和竞速。我们还介绍了一套一阶学习算法，指出标准一阶训练的两个实际瓶颈：视野初始化导致的状态覆盖有限和部分不可微分奖励引起的梯度偏置。为解决这些问题，我们提出了修正反向传播（ABPT），结合了可微的展开优化、基于价值的辅助目标和访问状态初始化，以提升训练的鲁棒性。实验结果显示，ABPT在部分不可微分奖励的任务中获得最明显的收益，同时在完全可微化的环境中保持竞争力。我们还提供了概念验证的真实部署，展示了在拟议框架中学到的策略的初始可迁移性，超越模拟。

Anatomical Prior-Driven Framework for Autonomous Robotic Cardiac Ultrasound Standard View Acquisition

自主机器人心脏超声标准视图采集的解剖先验驱动框架

Authors: Zhiyan Cao, Zhengxi Wu, Yiwei Wang, Pei-Hsuan Lin, Li Zhang, Zhen Xie, Huan Zhao, Han Ding
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.21134
Pdf link: https://arxiv.org/pdf/2603.21134
Abstract Cardiac ultrasound diagnosis is critical for cardiovascular disease assessment, but acquiring standard views remains highly operator-dependent. Existing medical segmentation models often yield anatomically inconsistent results in images with poor textural differentiation between distinct feature classes, while autonomous probe adjustment methods either rely on simplistic heuristic rules or black-box learning. To address these issues, our study proposed an anatomical prior (AP)-driven framework integrating cardiac structure segmentation and autonomous probe adjustment for standard view acquisition. A YOLO-based multi-class segmentation model augmented by a spatial-relation graph (SRG) module is designed to embed AP into the feature pyramid. Quantifiable anatomical features of standard views are extracted. Their priors are fitted to Gaussian distributions to construct probabilistic APs. The probe adjustment process of robotic ultrasound scanning is formalized as a reinforcement learning (RL) problem, with the RL state built from real-time anatomical features and the reward reflecting the AP matching. Experiments validate the efficacy of the framework. The SRG-YOLOv11s improves mAP50 by 11.3% and mIoU by 6.8% on the Special Case dataset, while the RL agent achieves a 92.5% success rate in simulation and 86.7% in phantom experiments.
中文摘要 心脏超声诊断对心血管疾病评估至关重要，但获取标准视图仍高度依赖操作员。现有的医学分割模型常常在不同特征类别间纹理区分较差的图像中产生解剖学不一致的结果，而自主探针调整方法要么依赖简单的启发式规则，要么依赖黑箱学习。为解决这些问题，本研究提出了一种结合心脏结构分段和自主探针调整的解剖先验（AP）驱动框架，用于标准视野获取。基于YOLO的多类分割模型，辅以空间关系图（SRG）模块，旨在将AP嵌入特征金字塔中。提取标准视图的可量化解剖特征。它们的先验被拟合到高斯分布中，以构造概率APs。机器人超声扫描的探针调整过程被形式化为强化学习（RL）问题，强化学习状态由实时解剖特征构建，奖励则反映AP匹配。实验验证了该框架的有效性。SRG-YOLOv11s在特殊案例数据集中提升了mAP50 11.3%，mIoU提升了6.8%，而强化学习代理在模拟中成功率为92.5%，在幻影实验中为86.7%。

Incentivizing Generative Zero-Shot Learning via Outcome-Reward Reinforcement Learning with Visual Cues

通过通过视觉提示的成果-奖励强化学习激励生成零样本学习

Authors: Wenjin Hou, Xiaoxiao Sun, Hehe Fan
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.21138
Pdf link: https://arxiv.org/pdf/2603.21138
Abstract Recent advances in zero-shot learning (ZSL) have demonstrated the potential of generative models. Typically, generative ZSL synthesizes visual features conditioned on semantic prototypes to model the data distribution of unseen classes, followed by training a classifier on the synthesized data. However, the synthesized features often remain task-agnostic, leading to degraded performance. Moreover, inferring a faithful distribution from semantic prototypes alone is insufficient for classes that are semantically similar but visually distinct. To address these and advance ZSL, we propose RLVC, an outcome-reward reinforcement learning RL framework with visual cues for generative ZSL. At its core, RL empowers the generative model to self-evolve, implicitly enhancing its generation capability. In particular, RLVC updates the generative model using an outcome-based reward, encouraging the synthesis of task-relevant features. Furthermore, we introduce class-wise visual cues that (i) align synthesized features with visual prototypes and (ii) stabilize the RL training updates. For the training process, we present a novel cold-start strategy. Comprehensive experiments and analyses on three prevalent ZSL benchmarks demonstrate that RLVC achieves state-of-the-art results with a 4.7% gain.
中文摘要 零样本学习（ZSL）的最新进展展示了生成模型的潜力。通常，生成式 ZSL 会基于语义原型综合视觉特征，以建模未见类的数据分布，然后对合成数据进行分类器训练。然而，合成的特征往往与任务无关，导致性能下降。此外，仅凭语义原型推断忠实分布对于语义相似但视觉上不同的类来说是不够的。为了解决这些问题并推动ZSL，我们提出了RLVC，一种带有视觉提示的生成性ZSL结果-奖励强化学习RL框架。强化学习的核心是赋予生成模型自我演化的能力，隐含地增强其生成能力。特别是，RLVC通过基于结果的奖励更新生成模型，鼓励综合与任务相关的特征。此外，我们引入了按类别分类的视觉提示，（i）将合成特征与视觉原型对齐，（ii）稳定强化学习训练的更新。在培训过程中，我们提出了一种新的冷启动策略。对三项主流ZSL基准的综合实验和分析表明，RLVC实现了4.7%的先进性能。

Revisiting Tree Search for LLMs: Gumbel and Sequential Halving for Budget-Scalable Reasoning

重访大型语言模型的树状搜索：Gumbel和顺序减半法以实现预算可扩展推理

Authors: Leonid Ugadiarov, Yuri Kuratov, Aleksandr Panov, Alexey Skrynnik
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.21162
Pdf link: https://arxiv.org/pdf/2603.21162
Abstract Neural tree search is a powerful decision-making algorithm widely used in complex domains such as game playing and model-based reinforcement learning. Recent work has applied AlphaZero-style tree search to enhance the reasoning capabilities of Large Language Models (LLMs) during inference, but we find that this approach suffers from a scaling failure: on GSM8K and Game24, accuracy drops as the search budget increases. In this paper, we present ReSCALE, an adaptation of Gumbel AlphaZero MCTS that replaces Dirichlet noise and PUCT selection with Gumbel sampling and Sequential Halving, restoring monotonic scaling without changes to the model or its training. ReSCALE reaches 58.4\% on GSM8K and 85.3\% on Game24 at budgets where the baseline degrades. Ablations confirm that Sequential Halving is the primary driver of the improvement.
中文摘要 神经树搜索是一种强大的决策算法，广泛应用于游戏和基于模型的强化学习等复杂领域。近期研究应用了AlphaZero风格树搜索来增强大型语言模型（LLM）推理能力，但我们发现该方法存在缩放性失败的问题：在GSM8K和Game24上，随着搜索预算增加，准确率下降。本文介绍了ReSCALE，这是一种基于Gumbel AlphaZero MCTS的改良，用Gumbel采样和顺序减半替代了Dirichlet噪声和PUCT选择，恢复单调标度，且不改变模型或训练。在GSM8K上ReSCALE达到58.4%，在Game24上达到85.3%，而在这些预算下基准会下降。消融结果证实，顺序减半是改善的主要驱动力。

Rethinking Plasticity in Deep Reinforcement Learning

重新思考深度强化学习中的可塑性

Authors: Zhiqiang He
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.21173
Pdf link: https://arxiv.org/pdf/2603.21173
Abstract This paper investigates the fundamental mechanisms driving plasticity loss in deep reinforcement learning (RL), a critical challenge where neural networks lose their ability to adapt to non-stationary environments. While existing research often relies on descriptive metrics like dormant neurons or effective rank, these summaries fail to explain the underlying optimization dynamics. We propose the Optimization-Centric Plasticity (OCP) hypothesis, which posits that plasticity loss arises because optimal points from previous tasks become poor local optima for new tasks, trapping parameters during task transitions and hindering subsequent learning. We theoretically establish the equivalence between neuron dormancy and zero-gradient states, demonstrating that the absence of gradient signals is the primary driver of dormancy. Our experiments reveal that plasticity loss is highly task-specific; notably, networks with high dormancy rates in one task can achieve performance parity with randomly initialized networks when switched to a significantly different task, suggesting that the network's capacity remains intact but is inhibited by the specific optimization landscape. Furthermore, our hypothesis elucidates why parameter constraints mitigate plasticity loss by preventing deep entrenchment in local optima. Validated across diverse non-stationary scenarios, our findings provide a rigorous optimization-based framework for understanding and restoring network plasticity in complex RL domains.
中文摘要 本文探讨了驱动深度强化学习（RL）中可塑性丧失的基本机制，这是神经网络失去适应非固定环境能力的关键挑战。虽然现有研究常依赖于休眠神经元或有效秩等描述性指标，但这些总结未能解释潜在的优化动态。我们提出了以优化为中心的可塑性（OCP）假说，认为可塑性损失是因为之前任务的最优点成为新任务的局部最优，导致任务过渡时的参数被困住，阻碍后续学习。我们理论上建立了神经元休眠与零梯度状态的等效性，证明梯度信号的缺失是休眠的主要驱动因素。我们的实验显示，可塑性丧失高度依赖任务;值得注意的是，某一任务休眠率高的网络，在切换到显著不同的任务时，可以与随机初始化网络实现性能均衡，表明网络容量保持不变，但受限于特定的优化环境。此外，我们的假设阐明了为何参数约束通过防止局部最优的深度固化来减轻塑性损失。我们的发现在多种非平稳场景中得到了验证，提供了一个基于优化的严谨框架，用于理解和恢复复杂强化学习领域的网络可塑性。

Reward Sharpness-Aware Fine-Tuning for Diffusion Models

为扩散模型奖励锐度感知的微调

Authors: Kwanyoung Kim, Byeongsu Sim
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.21175
Pdf link: https://arxiv.org/pdf/2603.21175
Abstract Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models with human preferences, inspiring the development of reward-centric diffusion reinforcement learning (RDRL) to achieve similar alignment and controllability. While diffusion models can generate high-quality outputs, RDRL remains susceptible to reward hacking, where the reward score increases without corresponding improvements in perceptual quality. We demonstrate that this vulnerability arises from the non-robustness of reward model gradients, particularly when the reward landscape with respect to the input image is sharp. To mitigate this issue, we introduce methods that exploit gradients from a robustified reward model without requiring its retraining. Specifically, we employ gradients from a flattened reward model, obtained through parameter perturbations of the diffusion model and perturbations of its generated samples. Empirically, each method independently alleviates reward hacking and improves robustness, while their joint use amplifies these benefits. Our resulting framework, RSA-FT (Reward Sharpness-Aware Fine-Tuning), is simple, broadly compatible, and consistently enhances the reliability of RDRL.
中文摘要 来自人类反馈的强化学习（RLHF）已被证明能有效使大型语言模型与人类偏好对齐，激励了以奖励为中心的扩散强化学习（RDRL）的发展，以实现类似的对齐和可控性。虽然扩散模型可以产生高质量输出，但RDRL仍易受奖励黑客攻击的影响，即奖励分数上升，而感知质量却没有相应提升。我们证明了这种脆弱性源于奖励模型梯度的不稳健性，尤其是在奖励景观相对于输入图像的锐利时。为缓解这一问题，我们引入了利用稳健奖励模型梯度的方法，而无需重新训练。具体来说，我们采用了来自平坦奖励模型的梯度，该模型通过扩散模型及其生成样本的参数扰动获得。从经验来看，每种方法独立减轻了奖励黑客行为并提升了鲁棒性，而它们的联合使用则放大了这些益处。我们最终形成的框架RSA-FT（奖励锐利感知微调）简单、广泛兼容，并持续提升RDRL的可靠性。

Prompt replay: speeding up grpo with on-policy reuse of high-signal prompts

提示回放：通过策略重用高信号提示加快GRPO

Authors: Andrei Baroian, Rutger Berger
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.21177
Pdf link: https://arxiv.org/pdf/2603.21177
Abstract Reinforcement learning with verifiable rewards (RLVR) plays a crucial role in expanding the capacities of LLM reasoning, but GRPO-style training is dominated by expensive rollouts and wastes compute on unusable prompts. We propose Prompt Replay, an overhead-free online data selection method for GRPO that reuses prompts only (not trajectories), to preserve on-policy optimization. After each step, we insert prompts with medium difficulty into a buffer, and prioritize prompts closer to a pass rate of 0.5 (half answers correct, half wrong) to maximize the advantage, thus learning signal. Training batches are formed by mixing reused prompts with fresh samples, with cooldown steps and max reuse times controlling aggressiveness vs risk of overfitting. Across multiple model families (Llama-3.2- 3B, Qwen3-8B) and training datasets (Dolci, Polaris), evaluated using average accuracy on six standard math benchmarks, Prompt Replay reduces zero-variance prompts, increases mean absolute advantage and shows faster initial accuracy gains. Yet, it plateaus and converges with the baseline, as too aggressive configuration was used. The method is most efficient when the rollouts are the primary bottleneck and the dataset is difficult for the model. We additionally observe that Qwen2.5-Math can exhibit spurious-reward effects that invalidates ablations, raising a warning signal for using it as a sole testbed for GRPO method research.
中文摘要 带有可验证奖励的强化学习（RLVR）在扩展LLM推理能力方面起着关键作用，但GRPO式训练主要依赖昂贵的推广和浪费计算于无用提示。我们提出了提示重放（Prompt Replay），这是一种无开销的在线数据选择方法，仅重用提示（不重用轨迹），以保持策略优化。每一步结束后，我们将中等难度的提示插入缓冲区，优先考虑通过率接近0.5（一半答对一半错）的提示，以最大化优势，从而学习信号。训练批次通过将重复使用的提示与新样本混合形成，冷却步骤和最大重用时间控制了激进度与过度拟合风险。在多个模型家族（Llama-3.2-3B，Qwen3-8B）和训练数据集（Dolci、Polaris）中，通过六个标准数学基准的平均准确率进行评估，提示重放减少了零方差提示，提高了平均绝对优势，并显示出更快的初始准确率提升。然而，由于使用了过于激进的配置，它最终趋于平稳并与基线收敛。当推广是主要瓶颈且数据集对模型来说难以使用时，该方法效率最高。我们还观察到Qwen2.5-Math可能表现出虚假奖励效应，使消融失效，这也引发了将其作为GRPO方法研究唯一试验平台的警示信号。

DeepXplain: XAI-Guided Autonomous Defense Against Multi-Stage APT Campaigns

DeepXplain：XAI引导的多阶段APT战役自主防御

Authors: Trung V. Phan, Thomas Bauschert
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.21296
Pdf link: https://arxiv.org/pdf/2603.21296
Abstract Advanced Persistent Threats (APTs) are stealthy, multi-stage attacks that require adaptive and timely defense. While deep reinforcement learning (DRL) enables autonomous cyber defense, its decisions are often opaque and difficult to trust in operational environments. This paper presents DeepXplain, an explainable DRL framework for stage-aware APT defense. Building on our prior DeepStage model, DeepXplain integrates provenance-based graph learning, temporal stage estimation, and a unified XAI pipeline that provides structural, temporal, and policy-level explanations. Unlike post-hoc methods, explanation signals are incorporated directly into policy optimization through evidence alignment and confidence-aware reward shaping. To the best of our knowledge, DeepXplain is the first framework to integrate explanation signals into reinforcement learning for APT defense. Experiments in a realistic enterprise testbed show improvements in stage-weighted F1-score (0.887 to 0.915) and success rate (84.7% to 89.6%), along with higher explanation confidence (0.86), improved fidelity (0.79), and more compact explanations (0.31). These results demonstrate enhanced effectiveness and trustworthiness of autonomous cyber defense.
中文摘要 高级持续威胁（APT）是一种隐蔽的多阶段攻击，需要适应性和及时防御。虽然深度强化学习（DRL）实现了自主网络防御，但其决策往往不透明且在作战环境中难以信任。本文介绍了DeepXplain，一个可解释的阶段感知APT防御DRL框架。基于我们之前的DeepStage模型，DeepXplain集成了基于来源的图学习、时间阶段估计以及统一的XAI流程，提供结构性、时间性和策略层面的解释。与事后方法不同，解释信号通过证据对齐和信心感知的奖励塑造直接融入政策优化中。据我们所知，DeepXplain是首个将解释信号整合进APT防御强化学习的框架。在现实企业测试平台中的实验显示，阶段加权F1分数（0.887至0.915）和成功率（84.7%至89.6%）均有所提升，解释置信度提升（0.86）、更准确度（0.79）和解释更紧凑（0.31）。这些结果显示自主网络防御的效能和可靠性得到了提升。

RoboAlign: Learning Test-Time Reasoning for Language-Action Alignment in Vision-Language-Action Models

RoboAlign：学习视觉-语言-行动模型中语言-动作对齐的测试时间推理

Authors: Dongyoung Kim, Sumin Park, Woomin Song, Seungku Kim, Taeyoung Kim, Huiwon Jang, Jinwoo Shin, Jaehyung Kim, Younggyo Seo
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.21341
Pdf link: https://arxiv.org/pdf/2603.21341
Abstract Improving embodied reasoning in multimodal-large-language models (MLLMs) is essential for building vision-language-action models (VLAs) on top of them to readily translate multimodal understanding into low-level actions. Accordingly, recent work has explored enhancing embodied reasoning in MLLMs through supervision of vision-question-answering type. However, these approaches have been reported to result in unstable VLA performance, often yielding only marginal or even negative gains. In this paper, we propose a more systematic MLLM training framework RoboAlign that reliably improves VLA performance. Our key idea is to sample action tokens via zero-shot natural language reasoning and refines this reasoning using reinforcement learning (RL) to improve action accuracy. As a result, RoboAlign bridges the modality gap between language and low-level actions in MLLMs, and facilitate knowledge transfer from MLLM to VLA. To validate the effectiveness of RoboAlign, we train VLAs by adding a diffusion-based action head on top of an MLLM backbone and evaluate them on major robotics benchmarks. Remarkably, by performing RL-based alignment after SFT using less than 1\% of the data, RoboAlign achieves performance improvements of 17.5\%, 18.9\%, and 106.6\% over SFT baselines on LIBERO, CALVIN, and real-world environments, respectively.
中文摘要 提升多模态-大语言模型（MLLM）中的具身推理对于在此基础上构建视觉-语言-行动模型（VLA）至关重要，以便能够轻松将多模态理解转化为低层次动作。因此，近期研究探讨通过监督视觉-问答型MLM来增强具身推理能力。然而，据报道这些方法会导致VLA性能不稳定，往往仅带来边际甚至负的提升。本文提出了一种更系统化的MLLM训练框架RoboAlign，能够可靠地提升VLA性能。我们的核心想法是通过零样本自然语言推理采样动作标记，并通过强化学习（RL）对这种推理进行优化，以提高动作准确率。因此，RoboAlign弥合了MLLM中语言与低层动作之间的模态差距，促进MLLM向VLA的知识转移。为了验证RoboAlign的有效性，我们通过在MLLM骨干上加装基于扩散的动作头来训练VLA，并在主要机器人基准测试中进行评估。值得注意的是，通过在SFT后使用少于1%的数据进行基于强化学习的对齐，RoboAlign在LIBERO、CALVIN和现实环境中分别比SFT基线提升了17.5%、18.9%和106.6%的性能。

A transformer architecture alteration to incentivise externalised reasoning

一项激励外部推理的变换器架构变更

Authors: Elizabeth Pavlova, Mariia Koroliuk, Karthik Viswanathan, Cameron Tice, Edward James Young, Puria Radmard
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.21376
Pdf link: https://arxiv.org/pdf/2603.21376
Abstract We propose a new architectural change, and post-training pipeline, for making LLMs more verbose reasoners by teaching a model to truncate forward passes early. We augment an existing transformer architecture with an early-exit mechanism at intermediate layers and train the model to exit at shallower layers when the next token can be predicted without deep computation. After a calibration stage, we incentivise the model to exit as early as possible while maintaining task performance using reinforcement learning. We provide preliminary results to this effect for small reasoning models, showing that they learn to adaptively reduce computations across tokens. We predict that, applied at the right scale, our approach can minimise the amount of excess computation that reasoning models have at their disposal to perform non-myopic planning using their internal activations, reserving this only for difficult-to-predict tokens.
中文摘要 我们提出一种新的架构变革和培训后流程，通过教授一个模型来提前截断前传，使大型语言模型更具冗长的推理能力。我们在现有的变换器架构基础上加入了中间层的提前退出机制，并训练模型在较浅层退出，当下一个令牌可以无需深度计算即可预测。校准阶段后，我们通过强化学习激励模型尽早退出，同时保持任务性能。我们提供了小型推理模型的初步结果，表明它们能够自适应地减少跨代币的计算。我们预测，在适当尺度下应用，我们的方法可以最大限度地减少推理模型在利用内部激活进行非近视规划时所产生的多余计算，仅用于难以预测的标记。

PivotRL: High Accuracy Agentic Post-Training at Low Compute Cost

PivotRL：低计算成本实现高精度代理后训练

Authors: Junkeun Yi, Damon Mosk-Aoyama, Baihe Huang, Ritu Gala, Charles Wang, Sugam Dipak Devare, Khushi Bhardwaj, Abhibha Gupta, Oleksii Kuchaiev, Jiantao Jiao, Jian Zhang, Venkat Srinivasan
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.21383
Pdf link: https://arxiv.org/pdf/2603.21383
Abstract Post-training for long-horizon agentic tasks has a tension between compute efficiency and generalization. While supervised fine-tuning (SFT) is compute efficient, it often suffers from out-of-domain (OOD) degradation. Conversely, end-to-end reinforcement learning (E2E RL) preserves OOD capabilities, but incurs high compute costs due to many turns of on-policy rollout. We introduce PivotRL, a novel framework that operates on existing SFT trajectories to combine the compute efficiency of SFT with the OOD accuracy of E2E RL. PivotRL relies on two key mechanisms: first, it executes local, on-policy rollouts and filters for pivots: informative intermediate turns where sampled actions exhibit high variance in outcomes; second, it utilizes rewards for functional-equivalent actions rather than demanding strict string matching with the SFT data demonstration. We theoretically show that these mechanisms incentivize strong learning signals with high natural gradient norm, while maximally preserving policy probability ordering on actions unrelated to training tasks. In comparison to standard SFT on identical data, we demonstrate that PivotRL achieves +4.17% higher in-domain accuracy on average across four agentic domains, and +10.04% higher OOD accuracy in non-agentic tasks. Notably, on agentic coding tasks, PivotRL achieves competitive accuracy with E2E RL with 4x fewer rollout turns. PivotRL is adopted by NVIDIA's Nemotron-3-Super-120B-A12B, acting as the workhorse in production-scale agentic post-training.
中文摘要 长视野代理任务的后期训练在计算效率与泛化之间存在张力。虽然监督微调（SFT）计算效率高，但常常存在域外（OOD）退化的问题。相反，端到端强化学习（E2E RL）保留了面向对象的能力，但由于多次策略启动，计算成本较高。我们介绍了PivotRL，这是一个基于现有SFT轨迹运行的新框架，将SFT的计算效率与E2E RL的OOD精度相结合。PivotRL依赖两个关键机制：首先，执行本地的政策内部署和枢轴的过滤：信息性中间回合，其中抽样动作结果差异较大;其次，它利用功能等效动作的奖励，而非严格的字符串匹配，并用SFT数据演示。我们理论上证明，这些机制激励具有高自然梯度范数的强学习信号，同时最大限度地保留与训练任务无关动作的政策概率排序。与标准SFT在相同数据上相比，我们证明PivotRL在四个代理领域平均实现了+4.17%的域内准确率，在非代理任务中实现了+10.04%的离域准确率。值得注意的是，在代理编码任务中，PivotRL以4倍的滚动回合实现了与端对端强化语言的竞争精度。PivotRL被英伟达的Nemotron-3-Super-120B-A12B采用，成为量产规模代理后训练的主力。

Dynasto: Validity-Aware Dynamic-Static Parameter Optimization for Autonomous Driving Testing

Dynasto：自动驾驶测试的有效性感知动态静态参数优化

Authors: Dmytro Humeniuk, Mohammad Hamdaqa, Houssem Ben Braiek, Amel Bennaceur, Foutse Khomh
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2603.21427
Pdf link: https://arxiv.org/pdf/2603.21427
Abstract Extensive simulation-based testing is important for assuring the safety of autonomous driving systems (ADS). However, generating safety-critical traffic scenarios remains challenging because failures often arise from rare, complex interactions with surrounding vehicles. Existing automatic scenario-generation approaches frequently fail to distinguish genuine ADS faults from collisions caused by implausible or invalid adversarial behaviors, and they typically optimize either scenario initialization or agent behavior in isolation. We propose Dynasto, a two-step testing approach that jointly optimizes initial scenario parameters and dynamic adversarial behaviors to uncover realistic safety-critical failures. First, we train an adversarial agent using reinforcement learning (RL) with temporal-logic-based validity criteria and a safe-distance model inspired by ISO 34502 to promote behaviorally plausible failures. Second, a genetic algorithm (GA) searches over initial conditions while replaying the adversary's failure-inducing behaviors to reveal additional failures that the RL agent alone does not uncover. Finally, a graph-based clustering pipeline groups failures into representative modes based on semantic event sequences. Our evaluation experiments in HighwayEnv across two ADS controllers show that Dynasto finds 60%-70% more valid failures than an RL-only adversary under the same evaluation budget. With clustering, we obtain about 12 interpretable failure modes per system under test, revealing valid failures driven by weaknesses in ego-controller behavior. These results indicate that coordinated dynamic-static optimization with explicit validity constraints is effective for exposing safety-relevant failures in ADS testing.
中文摘要 广泛的基于仿真的测试对于确保自动驾驶系统（ADS）的安全至关重要。然而，生成安全关键交通场景依然具有挑战性，因为故障往往源于与周围车辆的罕见且复杂的交互。现有的自动场景生成方法常常无法区分真正的ADS故障与由不合理或无效的对抗行为引起的碰撞，它们通常单独优化场景初始化或代理行为。我们提出了Dynasto，这是一种两步测试方法，联合优化初始场景参数和动态对抗行为，以揭示现实的安全关键故障。首先，我们使用基于时序逻辑的有效性标准和受ISO 34502启发的安全距离模型，使用强化学习（RL）训练对抗性代理，以促进行为上合理的失败。其次，遗传算法（GA）在重放对手导致失败的行为的同时，搜索初始条件，揭示强化学习代理单独无法发现的额外失败。最后，基于图的聚类流水线将失败按语义事件序列的代表性模式分组。我们在HighwayEnv中对两个ADS控制器的评估实验显示，在相同评估预算下，Dynasto发现的有效失效率比仅支持强化环境的对手高出60%-70%。通过聚类，我们得到每个被测系统约12种可解释的失败模式，揭示由自我控制行为弱点驱动的有效失败。这些结果表明，配合明确效度约束的动态静态协调优化，在ADS测试中有效暴露安全相关故障。

KG-Hopper: Empowering Compact Open LLMs with Knowledge Graph Reasoning via Reinforcement Learning

KG-Hopper：通过强化学习赋能紧凑的开放大型语言模型，实现知识图谱推理

Authors: Shuai Wang, Yinan Yu
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.21440
Pdf link: https://arxiv.org/pdf/2603.21440
Abstract Large Language Models (LLMs) demonstrate impressive natural language capabilities but often struggle with knowledge-intensive reasoning tasks. Knowledge Base Question Answering (KBQA), which leverages structured Knowledge Graphs (KGs) exemplifies this challenge due to the need for accurate multi-hop reasoning. Existing approaches typically perform sequential reasoning steps guided by predefined pipelines, restricting flexibility and causing error cascades due to isolated reasoning at each step. To address these limitations, we propose KG-Hopper, a novel Reinforcement Learning (RL) framework that empowers compact open LLMs with the ability to perform integrated multi-hop KG reasoning within a single inference round. Rather than reasoning step-by-step, we train a Reasoning LLM that embeds the entire KG traversal and decision process into a unified ``thinking'' stage, enabling global reasoning over cross-step dependencies and dynamic path exploration with backtracking. Experimental results on eight KG reasoning benchmarks show that KG-Hopper, based on a 7B-parameter LLM, consistently outperforms larger multi-step systems (up to 70B) and achieves competitive performance with proprietary models such as GPT-3.5-Turbo and GPT-4o-mini, while remaining compact, open, and data-efficient. The code is publicly available at: this https URL.
中文摘要 大型语言模型（LLM）展现了令人印象深刻的自然语言能力，但在知识密集型推理任务中常常遇到困难。知识库问答（KBQA）利用结构化知识图谱（KGs）体现了这一挑战，因为需要准确的多跳推理。现有方法通常执行顺序推理步骤，由预设的流水线引导，限制了灵活性，并因每步推理孤立导致错误连锁反应。为解决这些局限，我们提出了KG-Hopper，一种新型强化学习（RL）框架，使紧凑的开放大型语言模型能够在单一推理轮内进行集成多跳KG推理。我们不再一步步推理，而是训练一个推理大型语言模型，将整个基因重的穿越和决策过程嵌入到统一的“思考”阶段，实现跨步依赖的全局推理和动态路径探索，并伴随回溯。八个 KG 推理基准测试的实验结果显示，基于 7B 参数 LLM 的 KG-Hopper 持续优于更大型多步系统（最高可达 70B），并能与 GPT-3.5-Turbo 和 GPT-4o-mini 等专有模型竞争，同时保持紧凑、开放和数据高效。该代码公开于：https URL。

DRTriton: Large-Scale Synthetic Data Reinforcement Learning for Triton Kernel Generation

DRTriton：用于Triton内核生成的大规模合成数据强化学习

Authors: Siqi Guo, Ming Lin, Tianbao Yang
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.21465
Pdf link: https://arxiv.org/pdf/2603.21465
Abstract Developing efficient CUDA kernels is a fundamental yet challenging task in the generative AI industry. Recent researches leverage Large Language Models (LLMs) to automatically convert PyTorch reference implementations to CUDA kernels, significantly reducing the engineering efforts. State-of-the-art LLMs, such as GPT-5.2 and Claude-Sonnet-4.5, still struggle in this specific task. To address this challenge, we propose DRTriton, a scalable learning framework for training LLMs to convert PyTorch codes into highly optimized Triton kernels, which are then compiled to CUDA kernels at runtime. DRTriton consists of three key components: (i) a data synthetic algorithm CSP-DAG that guarantees full coverage and unbiased uniform sampling over the operator space with controlled difficulty; (ii) a curriculum reinforcement learning with decoupled reward efficiently optimizes conversion success rate and inference speed simultaneously; and (iii) a test-time search algorithm that further improves the inference speed of the generated Triton kernels. Notably, despite being trained exclusively on synthetic data, DRTriton generalizes effectively to real-world CUDA kernels that are challenging even for human experts. Experimental results show that DRTriton-7B achieves speedup on 92% of the KernelBench Level 2, compared to 23% for GPT-5.2 and 19% for Claude-Sonnet-4.5.
中文摘要 开发高效的CUDA内核是生成式人工智能行业中既基础又充满挑战的任务。近期研究利用大型语言模型（LLMs）自动将PyTorch引用实现转换为CUDA内核，显著减少了工程工作量。最先进的大型语言模型，如GPT-5.2和Claude-Sonnet-4.5，在这项任务上仍然存在困难。为应对这一挑战，我们提出了DRTriton，一个可扩展的学习框架，用于训练LLM将PyTorch代码转换为高度优化的Triton内核，并在运行时编译为CUDA内核。DRTriton由三个关键组成部分：（i）数据合成算法CSP-DAG，保证在操作符空间内实现全覆盖和无偏均匀采样，且难度受控;（ii）采用解耦奖励的课程强化学习，能够高效地优化转换成功率和推理速度;以及（iii）一种测试时搜索算法，进一步提升生成的Triton核的推理速度。值得注意的是，尽管DRTriton完全基于合成数据训练，但它能够有效地推广到现实世界的CUDA内核，这对人类专家来说也颇具挑战性。实验结果显示，DRTriton-7B在KernelBench Level 2的提升率为92%，而GPT-5.2为23%，Claude-Sonnet-4.5为19%。

Learning Can Converge Stably to the Wrong Belief under Latent Reliability

学习可能会在潜在可靠度下稳定地收敛到错误的信念

Authors: Zhipeng Zhang, Zhenjie Yao, Kai Li, Lei Yang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.21491
Pdf link: https://arxiv.org/pdf/2603.21491
Abstract Learning systems are typically optimized by minimizing loss or maximizing reward, assuming that improvements in these signals reflect progress toward the true objective. However, when feedback reliability is unobservable, this assumption can fail, and learning algorithms may converge stably to incorrect solutions. This failure arises because single-step feedback does not reveal whether an experience is informative or persistently biased. When information is aggregated over learning trajectories, however, systematic differences between reliable and unreliable regimes can emerge. We propose a Monitor-Trust-Regulator (MTR) framework that infers reliability from learning dynamics and modulates updates through a slow-timescale trust variable. Across reinforcement learning and supervised learning settings, standard algorithms exhibit stable optimization behavior while learning incorrect solutions under latent unreliability, whereas trust-modulated systems reduce bias accumulation and improve recovery. These results suggest that learning dynamics are not only optimization traces but also a source of information about feedback reliability.
中文摘要 学习系统通常通过最小化损失或最大化奖励来优化，假设这些信号的改进反映了朝着真正目标的进展。然而，当反馈可靠性不可观测时，这一假设可能失效，学习算法可能会稳定地收敛到错误的解。这种失败的原因是单步反馈无法揭示体验是信息丰富还是持续偏见。然而，当信息在学习轨迹上汇总时，可靠与不可靠体系之间的系统性差异可能显现出来。我们提出了一个监测-信任-监管者（MTR）框架，通过学习动态推断可靠性，并通过慢时间尺度的信任变量调节更新。在强化学习和监督学习环境中，标准算法在潜在不可靠性下学习错误解时表现出稳定的优化行为，而信任调制系统则减少偏见积累并改善恢复率。这些结果表明学习动态不仅是优化痕迹，也是反馈可靠性信息的来源。

VIGIL: Part-Grounded Structured Reasoning for Generalizable Deepfake Detection

VIGIL：部分基于基础的结构化推理用于通用深度伪造检测

Authors: Xinghan Li, Junhao Xu, Jingjing Chen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.21526
Pdf link: https://arxiv.org/pdf/2603.21526
Abstract Multimodal large language models (MLLMs) offer a promising path toward interpretable deepfake detection by generating textual explanations. However, the reasoning process of current MLLM-based methods combines evidence generation and manipulation localization into a unified step. This combination blurs the boundary between faithful observations and hallucinated explanations, leading to unreliable conclusions. Building on this, we present VIGIL, a part-centric structured forensic framework inspired by expert forensic practice through a plan-then-examine pipeline: the model first plans which facial parts warrant inspection based on global visual cues, then examines each part with independently sourced forensic evidence. A stage-gated injection mechanism delivers part-level forensic evidence only during examination, ensuring that part selection remains driven by the model's own perception rather than biased by external signals. We further propose a progressive three-stage training paradigm whose reinforcement learning stage employs part-aware rewards to enforce anatomical validity and evidence--conclusion coherence. To enable rigorous generalizability evaluation, we construct OmniFake, a hierarchical 5-Level benchmark where the model, trained on only three foundational generators, is progressively tested up to in-the-wild social-media data. Extensive experiments on OmniFake and cross-dataset evaluations demonstrate that VIGIL consistently outperforms both expert detectors and concurrent MLLM-based methods across all generalizability levels.
中文摘要 多模态大型语言模型（MLLM）通过生成文本解释，为可解释的深度伪造检测提供了一条有前景的道路。然而，当前基于MLLM的方法的推理过程将证据生成和操作定位结合为一个统一步骤。这种组合模糊了忠实观察与幻觉解释之间的界限，导致结论不可靠。基于此，我们提出了VIGIL，这是一个部分以中心的结构化法医框架，灵感来自专家法医实践，通过规划后检查流程：模型先规划面部部件，基于全球视觉线索进行检查，然后用独立来源的法医证据检查每个部分。阶段门控注入机制仅在检查时提供部分层级的法医证据，确保零件选择始终由模型自身感知驱动，而不受外部信号偏见影响。我们还提出了一种渐进式三阶段训练范式，其强化学习阶段采用部分意识奖励来强化解剖学的有效性和证据——结论的连贯性。为实现严谨的泛化性评估，我们构建了OmniFake，这是一个层级五级基准测试，模型仅用三个基础生成器训练，逐步测试至真实的社交媒体数据。OmniFake和跨数据集评估的广泛实验表明，VIGIL在所有泛化层级上始终优于专家检测器和基于MLLM的并行方法。

What Do World Models Learn in RL? Probing Latent Representations in Learned Environment Simulators

世界模型在强化学习中学到了什么？学习环境模拟器中的探测潜在表征

Authors: Xinyu Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.21546
Pdf link: https://arxiv.org/pdf/2603.21546
Abstract World models learn to simulate environment dynamics from experience, enabling sample-efficient reinforcement learning. But what do these models actually represent internally? We apply interpretability techniques--including linear and nonlinear probing, causal interventions, and attention analysis--to two architecturally distinct world models: IRIS (discrete token transformer) and DIAMOND (continuous diffusion UNet), trained on Atari Breakout and Pong. Using linear probes, we find that both models develop linearly decodable representations of game state variables (object positions, scores), with MLP probes yielding only marginally higher R^2, confirming that these representations are approximately linear. Causal interventions--shifting hidden states along probe-derived directions--produce correlated changes in model predictions, providing evidence that representations are functionally used rather than merely correlated. Analysis of IRIS attention heads reveals spatial specialization: specific heads attend preferentially to tokens overlapping with game objects. Multi-baseline token ablation experiments consistently identify object-containing tokens as disproportionately important. Our findings provide interpretability evidence that learned world models develop structured, approximately linear internal representations of environment state across two games and two architectures.
中文摘要 世界模型通过经验学习模拟环境动力学，实现样本高效的强化学习。但这些模型在内部实际上代表了什么？我们将可解释性技术——包括线性和非线性探测、因果干预和注意力分析——应用于两种架构上截然不同的世界模型：IRIS（离散令牌变换器）和DIAMOND（连续扩散UNet），它们分别基于Atari Breakout和Pong训练。利用线性探针，我们发现两种模型都能发展出游戏状态变量（对象位置、分数）的线性可解码表示，MLP探针仅带来略高的R^2，证实这些表示近似线性。因果干预——沿探针导出方向移动隐藏状态——产生模型预测的相关变化，提供了表示是功能性使用而非仅仅相关性的证据。对IRIS注意力中心的分析揭示了空间专化性：特定中心优先关注与游戏对象重叠的标记。多基线标记消融实验一致认为含对象标记的重要性过高。我们的发现提供了可解释性的证据，表明学习到的世界模型会在两个游戏和两种架构中发展出结构化、近乎线性的环境内部状态表示。

Counterfactual Credit Policy Optimization for Multi-Agent Collaboration

多代理协作的反事实信贷策略优化

Authors: Zhongyi Li, Wan Tian, Yikun Ban, Jinju Chen, Huiming Zhang, Yang Liu, Fuzhen Zhuang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.21563
Pdf link: https://arxiv.org/pdf/2603.21563
Abstract Collaborative multi-agent large language models (LLMs) can solve complex reasoning tasks by decomposing roles and aggregating diverse hypotheses. Yet, reinforcement learning (RL) for such systems is often undermined by credit assignment: a shared global reward obscures individual contributions, inflating update variance and encouraging free-riding. We introduce Counterfactual Credit Policy Optimization (CCPO), a framework that assigns agent-specific learning signals by estimating each agent's marginal contribution through counterfactual trajectories. CCPO builds dynamic counterfactual baselines that simulate outcomes with an agent's contribution removed, yielding role-sensitive advantages for policy optimization. To further improve stability under heterogeneous tasks and data distributions, we propose a global-history-aware normalization scheme that calibrates advantages using global rollout statistics. We evaluate CCPO on two collaboration topologies: a sequential Think--Reason dyad and multi-agent voting. Across mathematical and logical reasoning benchmarks, CCPO mitigates free-riding and outperforms strong multi-agent RL baselines, yielding finer-grained and more effective credit assignment for collaborative LLM training. Our code is available at this https URL.
中文摘要 协作多智能体大型语言模型（LLMs）可以通过分解角色和聚合不同假设来解决复杂的推理任务。然而，这类系统的强化学习（RL）常常被学分分配所削弱：共享的全局奖励掩盖了个人贡献，膨胀了更新差异，并鼓励搭便车行为。我们介绍了反事实信用政策优化（CCPO），这是一个通过估算每个代理通过反事实轨迹的边际贡献来分配特定智能体学习信号的框架。CCPO构建动态反事实基线，模拟结果，去除代理贡献，为策略优化带来角色敏感优势。为了进一步提升在异构任务和数据分布下的稳定性，我们提出了一种全局历史感知的规范化方案，利用全局推广统计来校准优势。我们基于两种协作拓扑来评估CCPO：顺序的思考-理性二元和多代理投票。在数学和逻辑推理基准上，CCPO减少了搭便车现象，并优于强的多智能体强化学习基线，为协作式LLM训练提供了更细致、更有效的学分分配。我们的代码可在此 https URL 访问。

Adaptive Robust Estimator for Multi-Agent Reinforcement Learning

多智能体强化学习的自适应稳健估计器

Authors: Zhongyi Li, Wan Tian, Jingyu Chen, Kangyao Huang, Huiming Zhang, Hui Yang, Tao Ren, Jinyang Jiang, Yijie Peng, Yikun Ban, Fuzhen Zhuang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.21574
Pdf link: https://arxiv.org/pdf/2603.21574
Abstract Multi-agent collaboration has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models, yet it suffers from interaction-level ambiguity that blurs generation, critique, and revision, making credit assignment across agents difficult. Moreover, policy optimization in this setting is vulnerable to heavy-tailed and noisy rewards, which can bias advantage estimation and trigger unstable or even divergent training. To address both issues, we propose a robust multi-agent reinforcement learning framework for collaborative reasoning, consisting of two components: Dual-Agent Answer-Critique-Rewrite (DACR) and an Adaptive Robust Estimator (ARE). DACR decomposes reasoning into a structured three-stage pipeline: answer, critique, and rewrite, while enabling explicit attribution of each agent's marginal contribution to its partner's performance. ARE provides robust estimation of batch experience means during multi-agent policy optimization. Across mathematical reasoning and embodied intelligence benchmarks, even under noisy rewards, our method consistently outperforms the baseline in both homogeneous and heterogeneous settings. These results indicate stronger robustness to reward noise and more stable training dynamics, effectively preventing optimization failures caused by noisy reward signals.
中文摘要 多智能体协作已成为提升大型语言模型推理能力的强大范式，但其存在交互层面的模糊性，模糊了生成、批评和修订，使得跨智能体的署名分配变得困难。此外，在这种环境下，策略优化容易受到重尾和噪声奖励的影响，这可能偏向优势估计，并引发不稳定甚至发散的训练。为解决这两个问题，我们提出了一个稳健的多智能体强化学习协作推理框架，由两个组成部分组成：双智能体答案-批判-重写（DACR）和自适应稳健估计器（ARE）。DACR将推理分解为结构化的三阶段流程：回答、批评和重写，同时允许明确归因于每个代理对其合作伙伴绩效的边际贡献。ARE 在多智能体策略优化过程中，提供了批量体验平均值的稳健估计。在数学推理和具身智能基准中，即使在噪声奖励下，我们的方法在同质和异质环境中都始终优于基线。这些结果表明其对噪声的奖励更强鲁棒性和更稳定的训练动态，有效防止了因噪声奖励信号导致的优化失败。

Spatio-Temporal Attention Enhanced Multi-Agent DRL for UAV-Assisted Wireless Networks with Limited Communications

时空注意力增强型多智能体日程学习，适用于无人机辅助无线网络，通信有限

Authors: Che Chen, Lanhua Li, Shimin Gong, Yu Zhao, Yuming Fang, Dusit Niyato
Subjects: Subjects: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2603.21594
Pdf link: https://arxiv.org/pdf/2603.21594
Abstract In this paper, we employ multiple UAVs to accelerate data transmissions from ground users (GUs) to a remote base station (BS) via the UAVs' relay communications. The UAVs' intermittent information exchanges typically result in delays in acquiring the complete system state and hinder their effective collaboration. To maximize the overall throughput, we first propose a delay-tolerant multi-agent deep reinforcement learning (MADRL) algorithm that integrates a delay-penalized reward to encourage information sharing among UAVs, while jointly optimizing the UAVs' trajectory planning, network formation, and transmission control strategies. Additionally, considering information loss due to unreliable channel conditions, we further propose a spatio-temporal attention based prediction approach to recover the lost information and enhance each UAV's awareness of the network state. These two designs are envisioned to enhance the network capacity in UAV-assisted wireless networks with limited communications. The simulation results reveal that our new approach achieves over 50\% reduction in information delay and 75% throughput gain compared to the conventional MADRL. Interestingly, it is shown that improving the UAVs' information sharing will not sacrifice the network capacity. Instead, it significantly improves the learning performance and throughput simultaneously. It is also effective in reducing the need for UAVs' information exchange and thus fostering practical deployment of MADRL in UAV-assisted wireless networks.
中文摘要 本文中，我们利用多架无人机通过无人机的中继通信，加速地面用户（GU）到远程基站（BS）的数据传输。无人机间歇性的信息交流通常导致获取完整系统状态的延迟，阻碍其有效协作。为了最大化整体吞吐量，我们首先提出了一种容忍延迟的多智能体深度强化学习（MADRL）算法，该算法集成了延迟惩罚奖励以促进无人机间的信息共享，同时共同优化无人机的轨迹规划、网络形成和传输控制策略。此外，考虑到信道不可靠导致的信息丢失，我们进一步提出了基于时空注意力的预测方法，以恢复丢失的信息并增强无人机对网络状态的感知。这两种设计旨在增强无人机辅助无线网络中通信能力的提升。模拟结果显示，我们的新方法相比传统MADRL实现了超过50%的信息延迟减少和75%的吞吐量提升。有趣的是，研究表明，提升无人机的信息共享能力并不会牺牲网络容量。相反，它显著提升了学习性能和吞吐量。它还有效减少了无人机信息交换的需求，从而促进了MADRL在无人机辅助无线网络中的实际部署。

Proximal Policy Optimization in Path Space: A Schrödinger Bridge Perspective

路径空间中的近端策略优化：薛定谔桥视角

Authors: Yuehu Gong, Zeyuan Wang, Yulin Chen, Yanwei Fu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.21621
Pdf link: https://arxiv.org/pdf/2603.21621
Abstract On-policy reinforcement learning with generative policies is promising but remains underexplored. A central challenge is that proximal policy optimization (PPO) is traditionally formulated in terms of action-space probability ratios, whereas diffusion- and flow-based policies are more naturally represented as trajectory-level generative processes. In this work, we propose GSB-PPO, a path-space formulation of generative PPO inspired by the Generalized Schrödinger Bridge (GSB). Our framework lifts PPO-style proximal updates from terminal actions to full generation trajectories, yielding a unified view of on-policy optimization for generative policies. Within this framework, we develop two concrete objectives: a clipping-based objective, GSB-PPO-Clip, and a penalty-based objective, GSB-PPO-Penalty. Experimental results show that while both objectives are compatible with on-policy training, the penalty formulation consistently delivers better stability and performance than the clipping counterpart. Overall, our results highlight path-space proximal regularization as an effective principle for training generative policies with PPO.
中文摘要 基于生成策略的策略强化学习前景看好，但仍未被充分探索。一个核心挑战是，近端策略优化（PPO）传统上是以行动空间概率比的形式表述，而基于扩散和流的策略则更自然地表现为轨迹级生成过程。在本研究中，我们提出了GSB-PPO，这是一种受广义薛定谔桥（GSB）启发的生成PPO路径空间表述。我们的框架将PPO式的近端更新从终端动作提升到全代生成轨迹，提供生成策略的统一优化视图。在此框架下，我们制定了两个具体目标：基于剪裁的目标GSB-PPO-Clip，以及基于惩罚的目标GSB-PPO-惩罚。实验结果表明，虽然这两个目标都与政策训练兼容，但惩罚表述始终比剪波模式提供更好的稳定性和性能。总体而言，我们的结果强调路径空间近端正则化作为使用PPO训练生成策略的有效原理。

TAMTRL: Teacher-Aligned Reward Reshaping for Multi-Turn Reinforcement Learning in Long-Context Compression

TAMTRL：长上下文压缩中多回合强化学习的教师对齐奖励重塑

Authors: Li Wang, Yandong Wang, Xin Yu, Kui Zhang, Tianhao Peng, Wenjun Wu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.21663
Pdf link: https://arxiv.org/pdf/2603.21663
Abstract The rapid progress of large language models (LLMs) has led to remarkable performance gains across a wide range of tasks. However, when handling long documents that exceed the model's context window limit, the entire context cannot be processed in a single pass, making chunk-wise processing necessary. This requires multiple turns to read different chunks and update memory. However, supervision is typically provided only by the final outcome, which makes it difficult to evaluate the quality of memory updates at each turn in the multi-turn training setting. This introduces a temporal credit assignment challenge. Existing approaches, such as LLM-as-a-judge or process reward models, incur substantial computational overhead and suffer from estimation noise. To better address the credit assignment problem in multi-turn memory training, we propose Teacher-Aligned Reward Reshaping for Multi-Turn Reinforcement Learning (TAMTRL). TAMTRL leverages relevant documents as teacher signals by aligning them with each turn of model input and assigns rewards through normalized probabilities in a self-supervised manner. This provides fine-grained learning signals for each memory update and improves long-context processing. Experiments with multiple models of varying scales across seven long-context benchmarks show that TAMTRL consistently outperforms strong baselines, demonstrating its effectiveness. Our code is available at this https URL.
中文摘要 大型语言模型（LLM）的快速发展带来了在各种任务中显著的性能提升。然而，当处理超过模型上下文窗口限制的长文档时，无法一次性处理整个上下文，因此需要按分块处理。这需要多次读取不同区块并更新内存。然而，监督通常仅由最终结果提供，这使得在多回合训练环境中评估每回合内存更新质量变得困难。这引入了时间信用分配的挑战。现有方法，如LLM即评判或过程奖励模型，会产生大量计算开销，并存在估计噪声。为了更好地解决多回合记忆训练中的学分分配问题，我们提出了教师对齐奖励重塑法（TAMTRL）用于多回合强化学习。TAMTRL通过将相关文档与模型输入的每回合对齐，作为教师信号，并通过归一化概率自我监督地分配奖励。这为每次内存更新提供细粒度的学习信号，并提升了长上下文处理能力。在七个长上下文基准中对多个不同尺度模型的实验表明，TAMTRL始终优于强基线，证明了其有效性。我们的代码可在此 https URL 访问。

PPGL-Swarm: Integrated Multimodal Risk Stratification and Hereditary Syndrome Detection in Pheochromocytoma and Paraganglioma

PPGL-Swarm：综合多模态风险分层与遗传综合征检测在嗜铬细胞瘤和副神经节瘤中

Authors: Zelin Liu, Xiangfu Yu, Jie Huang, Ge Wang, Yizhe Yuan, Zhenyu Yi, Jing Xie, Haotian Jiang, Lichi Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.21700
Pdf link: https://arxiv.org/pdf/2603.21700
Abstract Pheochromocytomas and paragangliomas (PPGLs) are rare neuroendocrine tumors, of which 15-25% develop metastatic disease with 5-year survival rates reported as low as 34%. PPGL may indicate hereditary syndromes requiring stricter, syndrome-specific treatment and surveillance, but clinicians often fail to recognize these associations in routine care. Clinical practice uses GAPP score for PPGL grading, but several limitations remain for PPGL diagnosis: (1) GAPP scoring demands a high workload for clinician because it requires the manual evaluation of six independent components; (2) key components such as cellularity and Ki-67 are often evaluated with subjective criteria; (3) several clinically relevant metastatic risk factors are not captured by GAPP, such as SDHB mutations, which have been associated with reported metastatic rates of 35-75%. Agent-driven diagnostic systems appear promising, but most lack traceable reasoning for decision-making and do not incorporate domain-specific knowledge such as PPGL genotype information. To address these limitations, we present PPGL-Swarm, an agentic PPGL diagnostic system that generates a comprehensive report, including automated GAPP scoring (with quantified cellularity and Ki-67), genotype risk alerts, and multimodal report with integrated evidence. The system provides an auditable reasoning trail by decomposing diagnosis into micro-tasks, each assigned to a specialized agent. The gene and table agents use knowledge enhancement to better interpret genotype and laboratory findings, and during training we use reinforcement learning to refine tool selection and task assignment.
中文摘要 嗜铬细胞瘤和副神经节瘤（PPGLs）是罕见的神经内分泌肿瘤，其中15-25%发生转移性疾病，5年生存率低至34%。PPGL可能表明遗传综合征需要更严格、针对特定综合征的治疗和监测，但临床医生常常未能在常规护理中识别这些关联。临床实践使用GAPP评分进行PPGL评分，但PPGL诊断仍有若干局限性：（1）GAPP评分对临床医生来说工作量很大，因为它需要对六个独立组成部分进行人工评估;（2）关键组成部分如细胞性和Ki-67常被主观评估;（3）若干临床相关的转移风险因子未被GAPP捕捉，如SDHB突变，这些突变与报告的转移率为35-75%相关。主体驱动的诊断系统看起来很有前景，但大多数缺乏可追溯的决策推理，且不包含如PPGL基因型信息等领域特定知识。为解决这些局限性，我们推出了PPGL-Swarm，这是一种能动的PPGL诊断系统，能够生成全面的报告，包括自动GAPP评分（含定量细胞浓度和Ki-67）、基因型风险警报以及多模态报告，并集成证据。该系统通过将诊断分解为微任务，每个任务分配给一个专业代理，提供可审计的推理轨迹。基因和表格代理利用知识增强更好地解释基因型和实验室发现，培训期间我们通过强化学习优化工具选择和任务分配。

EvoIdeator: Evolving Scientific Ideas through Checklist-Grounded Reinforcement Learning

EvoIdeator：通过以清单为基础的强化学习演进科学思想

Authors: Andreas Sauter, Yuyue Zhao, Jacopo Urbani, Wenxiang Hu, Zaiqiao Meng, Lun Zhou, Xiaohui Yan, Yougang Lyu
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.21728
Pdf link: https://arxiv.org/pdf/2603.21728
Abstract Scientific idea generation is a cornerstone of autonomous knowledge discovery, yet the iterative evolution required to transform initial concepts into high-quality research proposals remains a formidable challenge for Large Language Models (LLMs). Existing Reinforcement Learning (RL) paradigms often rely on rubric-based scalar rewards that provide global quality scores but lack actionable granularity. Conversely, language-based refinement methods are typically confined to inference-time prompting, targeting models that are not explicitly optimized to internalize such critiques. To bridge this gap, we propose \textbf{EvoIdeator}, a framework that facilitates the evolution of scientific ideas by aligning the RL training objective with \textbf{checklist-grounded feedback}. EvoIdeator leverages a structured judge model to generate two synergistic signals: (1) \emph{lexicographic rewards} for multi-dimensional optimization, and (2) \emph{fine-grained language feedback} that offers span-level critiques regarding grounding, feasibility, and methodological rigor. By integrating these signals into the RL loop, we condition the policy to systematically utilize precise feedback during both optimization and inference. Extensive experiments demonstrate that EvoIdeator, built on Qwen3-4B, significantly outperforms much larger frontier models across key scientific metrics. Crucially, the learned policy exhibits strong generalization to diverse external feedback sources without further fine-tuning, offering a scalable and rigorous path toward self-refining autonomous ideation.
中文摘要 科学思想生成是自主知识发现的基石，但将初始概念转化为高质量研究提案所需的迭代演进，仍是大型语言模型（LLMs）面临的巨大挑战。现有的强化学习（RL）范式通常依赖基于评分标准的标量奖励，这些奖励提供全局质量评分，但缺乏可操作的细致度。相反，基于语言的精炼方法通常局限于推理时间提示，针对未明确优化以内化此类批评的模型。为弥合这一差距，我们提出了 \textbf{EvoIdeator} 框架，通过将强化学习训练目标与 \textbf{check-listed feedback} 对齐，促进科学思想的演进。EvoIdeator 利用结构化评判模型生成两个协同信号：（1） \emph{字典学奖励}用于多维优化，以及（2） \emph{细粒度语言反馈}，提供关于基础、可行性和方法论严谨性的跨度层面批评。通过将这些信号整合进强化学习循环，我们使策略能够在优化和推断过程中系统地利用精确反馈。大量实验表明，基于Qwen3-4B构建的EvoIdeator在关键科学指标上显著优于更大型的前沿模型。关键是，该学识政策在无需进一步微调的情况下，能够强烈推广到多样化的外部反馈来源，为自我完善的自主理念提供了可扩展且严谨的路径。

CellFluxRL: Biologically-Constrained Virtual Cell Modeling via Reinforcement Learning

CellFluxRL：通过强化学习实现生物约束虚拟细胞建模

Authors: Dongxia Wu, Shiye Su, Yuhui Zhang, Elaine Sui, Emma Lundberg, Emily B. Fox, Serena Yeung-Levy
Subjects: Subjects: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
Arxiv link: https://arxiv.org/abs/2603.21743
Pdf link: https://arxiv.org/pdf/2603.21743
Abstract Building virtual cells with generative models to simulate cellular behavior in silico is emerging as a promising paradigm for accelerating drug discovery. However, prior image-based generative approaches can produce implausible cell images that violate basic physical and biological constraints. To address this, we propose to post-train virtual cell models with reinforcement learning (RL), leveraging biologically meaningful evaluators as reward functions. We design seven rewards spanning three categories-biological function, structural validity, and morphological correctness-and optimize the state-of-the-art CellFlux model to yield CellFluxRL. CellFluxRL consistently improves over CellFlux across all rewards, with further performance boosts from test-time scaling. Overall, our results present a virtual cell modeling framework that enforces physically-based constraints through RL, advancing beyond "visually realistic" generations towards "biologically meaningful" ones.
中文摘要 构建带有生成模型的虚拟细胞以模拟计算机模拟细胞行为，正成为加速药物发现的有前景范式。然而，以往基于图像的生成方法可能产生不合理的细胞图像，违反基本的物理和生物限制。为此，我们提议通过强化学习（RL）对虚拟细胞模型进行后训练，利用具有生物学意义的评估器作为奖励函数。我们设计了七项奖励，涵盖生物功能、结构效度和形态正确性三类，并优化了最先进的CellFlux模型，生成CellFluxRL。CellFluxRL 在所有奖励项目上都持续优于 CellFlux，并通过测试时间缩放进一步提升性能。总体而言，我们的结果提出了一个虚拟细胞建模框架，通过强化学习强制执行基于物理的约束，推动“视觉真实”世代向“生物学意义”的世代迈进。

Image-Conditioned Adaptive Parameter Tuning for Visual Odometry Frontends

视觉里程计前端的图像调节自适应参数调优

Authors: Simone Nascivera, Leonard Bauersfeld, Jeff Delaune, Davide Scaramuzza
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.21785
Pdf link: https://arxiv.org/pdf/2603.21785
Abstract Resource-constrained autonomous robots rely on sparse direct and semi-direct visual-(inertial)-odometry (VO) pipelines, as they provide a favorable tradeoff between accuracy, robustness, and computational cost. However, the performance of most systems depends critically on hand-tuned hyperparameters governing feature detection, tracking, and outlier rejection. These parameters are typically fixed during deployment, even though their optimal values vary with scene characteristics such as texture density, illumination, motion blur, and sensor noise, leading to brittle performance in real-world environments. We propose the first image-conditioned reinforcement learning framework for online tuning of VO frontend parameters, effectively embedding the expert into the system. Our key idea is to formulate the frontend configuration as a sequential decision-making problem and learn a policy that directly maps visual input to feature detection and tracking parameters. The policy uses a lightweight texture-aware CNN encoder and a privileged critic during training. Unlike prior RL-based approaches that rely solely on internal VO statistics, our method observes the image content and proactively adapts parameters before tracking degrades. Experiments on TartanAirV2 and TUM RGB-D show 3x longer feature tracks and 3x lower computational cost, despite training entirely in simulation.
中文摘要 资源受限的自主机器人依赖稀疏的直接和半直接视觉（惯性）里程计（VO）流水线，因为它们在准确性、鲁棒性和计算成本之间提供了有利的权衡。然而，大多数系统的性能关键依赖于手工调优的超参数，这些参数控制特征检测、跟踪和异常值抑制。这些参数通常在部署时固定，尽管其最佳值会随场景特性变化，如纹理密度、光照、运动模糊和传感器噪声，导致在现实环境中表现脆弱。我们提出了首个用于在线调优VO前端参数的图像条件强化学习框架，有效地将专家嵌入系统中。我们的核心思想是将前端配置构建为顺序决策问题，并学习一个策略，将视觉输入直接映射到特征检测和跟踪参数。该政策在培训期间使用轻量级纹理感知的CNN编码器和特权批评者。与以往仅依赖内部VO统计的基于强化学习的方法不同，我们的方法会先观察图像内容，并在跟踪性能下降前主动调整参数。TartanAirV2和TUM RGB-D的实验显示，尽管训练完全在模拟中，功能轨迹长度是3倍长，计算成本降低3倍。

Agentic Personas for Adaptive Scientific Explanations with Knowledge Graphs

带知识图谱的适应性科学解释的代理人物

Authors: Susana Nunes, Tiago Guerreiro, Catia Pesquita
Subjects: Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2603.21846
Pdf link: https://arxiv.org/pdf/2603.21846
Abstract AI explanation methods often assume a static user model, producing non-adaptive explanations regardless of expert goals, reasoning strategies, or decision contexts. Knowledge graph-based explanations, despite their capacity for grounded, path-based reasoning, inherit this limitation. In complex domains such as scientific discovery, this assumption fails to capture the diversity of cognitive strategies and epistemic stances among experts, preventing explanations that foster deeper understanding and informed decision-making. However, the scarcity of human experts limits the use of direct human feedback to produce adaptive explanations. We present a reinforcement learning approach for scientific explanation generation that incorporates agentic personas, structured representations of expert reasoning strategies, that guide the explanation agent towards specific epistemic preferences. In an evaluation of knowledge graph-based explanations for drug discovery, we tested two personas that capture distinct epistemic stances derived from expert feedback. Results show that persona-driven explanations match state-of-the-art predictive performance while persona preferences closely align with those of their corresponding experts. Adaptive explanations were consistently preferred over non-adaptive baselines (n = 22), and persona-based training reduces feedback requirements by two orders of magnitude. These findings demonstrate how agentic personas enable scalable adaptive explainability for AI systems in complex and high-stakes domains.
中文摘要 AI解释方法通常假设用户模型静态，无论专家目标、推理策略或决策上下文如何，都会产生非自适应的解释。基于知识图的解释，尽管具备基于路径的基础推理能力，但它们继承了这一局限性。在科学发现等复杂领域，这一假设未能反映专家认知策略和认知立场的多样性，阻碍了促进更深入理解和明智决策的解释。然而，人类专家的稀缺限制了直接的人类反馈来产生适应性解释。我们提出了一种科学解释生成的强化学习方法，结合了智能人格（agentic persona），即专家推理策略的结构化表征，引导解释智能体朝向特定的认知偏好。在对基于知识图的药物发现解释的评估中，我们测试了两种人格，它们捕捉了基于专家反馈的不同认知立场。结果显示，人格驱动的解释与最先进的预测表现相匹配，而人格偏好与相应专家的偏好高度一致。自适应解释始终优于非自适应基线（n = 22），基于人格的训练将反馈需求减少了两个数量级。这些发现展示了代理人物角色如何为复杂且高风险领域的人工智能系统实现可扩展的自适应解释性。

P^2O: Joint Policy and Prompt Optimization

P^2O：联合策略与即时优化

Authors: Xinyu Lu, Kaiqi Zhang, Jinglin Yang, Boxi Cao, Yaojie Lu, Hongyu Lin, Min He, Xianpei Han, Le Sun
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.21877
Pdf link: https://arxiv.org/pdf/2603.21877
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). However, vanilla RLVR suffers from inefficient exploration, particularly when confronting "hard samples" that yield nearzero success rates. In such scenarios, the reliance on sparse outcome rewards typically results in zero-advantage estimates, effectively starving the model of supervision signals despite the high informational value of these instances. To address this, we propose P^2O, a novel framework that synergizes Prompt Optimization with Policy Optimization. P^2O identifies hard samples during training iterations and leverages the GeneticPareto (GEPA) prompt optimization algorithm to evolve prompt templates that guide the model toward discovering successful trajectories. Crucially, unlike traditional prompt engineering methods that rely on input augmentation, P^2O distills the reasoning gains induced by these optimized prompts directly into the model parameters. This mechanism provides denser positive supervision signals for hard samples and accelerates convergence. Extensive experiments demonstrate that P^2O not only achieves superior performance on in-distribution datasets but also exhibits strong generalization, yielding substantial improvements on out-of-distribution benchmarks (+4.7% avg.).
中文摘要 带可验证奖励的强化学习（RLVR）已成为增强大型语言模型（LLM）推理能力的强大范式。然而，原版RLVR存在低效探索问题，尤其是在面对几乎没有成功率的“硬样本”时。在这种情况下，依赖稀疏的结果奖励通常导致零优势估计，尽管这些实例信息价值很高，但实际上使监督信号模型陷入困境。为此，我们提出了P^2O，一种将提示优化与策略优化协同的新框架。P^2O 在训练迭代中识别硬样本，并利用遗传帕累托（GEPA）提示优化算法进化提示模板，指导模型发现成功的轨迹。关键是，与依赖输入增强的传统提示工程方法不同，P^2O将这些优化提示带来的推理收益直接提炼到模型参数中。该机制为硬样品提供更密集的正向监督信号，并加速收敛。大量实验表明，P^2O不仅在分布内数据集上表现优异，还表现出强烈的泛化能力，在分布外基准测试上有显著提升（平均+4.7%）。

Deep Reinforcement Learning and The Tale of Two Temporal Difference Errors

深度强化学习与两次时间差异错误的故事

Authors: Juan Sebastian Rojas, Chi-Guhn Lee
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.21921
Pdf link: https://arxiv.org/pdf/2603.21921
Abstract The temporal difference (TD) error was first formalized in Sutton (1988), where it was first characterized as the difference between temporally successive predictions, and later, in that same work, formulated as the difference between a bootstrapped target and a prediction. Since then, these two interpretations of the TD error have been used interchangeably in the literature, with the latter eventually being adopted as the standard critic loss in deep reinforcement learning (RL) architectures. In this work, we show that these two interpretations of the TD error are not always equivalent. In particular, we show that increasingly-nonlinear deep RL architectures can cause these interpretations of the TD error to yield increasingly different numerical values. Then, building on this insight, we show how choosing one interpretation of the TD error over the other can affect the performance of deep RL algorithms that utilize the TD error to compute other quantities, such as with deep differential (i.e., average-reward) RL methods. All in all, our results show that the default interpretation of the TD error as the difference between a bootstrapped target and a prediction does not always hold in deep RL settings.
中文摘要 时间差（TD）误差最早在Sutton（1988）中被形式化，最初将其描述为时间连续预测之间的差值，后来在同一研究中又被表述为自助目标与预测之间的差值。此后，文献中这两种TD误差的解释被交替使用，后者最终被采纳为深度强化学习（RL）架构中的标准批判者损失。在本研究中，我们证明了这两种TD误差的解释并不总是等价的。特别地，我们展示了日益非线性的深度强化学习架构可能导致TD误差的解释产生越来越多不同的数值。接着，基于这一见解，我们展示了选择TD误差的一种解释如何影响利用TD误差计算其他量的深度强化学习算法的性能，例如深度差分（即平均奖励）强化学习方法。总的来说，我们的结果表明，TD误差作为自助目标与预测差的默认解释，在深度强化学习环境中并不总是成立。

Demystifying Reinforcement Learning for Long-Horizon Tool-Using Agents: A Comprehensive Recipe

揭开长期工具使用智能体的强化学习神秘面纱：全面配方

Authors: Xixi Wu, Qianguo Sun, Ruiyang Zhang, Chao Song, Junlong Wu, Yiyan Qi, Hong Cheng
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.21972
Pdf link: https://arxiv.org/pdf/2603.21972
Abstract Reinforcement Learning (RL) is essential for evolving Large Language Models (LLMs) into autonomous agents capable of long-horizon planning, yet a practical recipe for scaling RL in complex, multi-turn environments remains elusive. This paper presents a systematic empirical study using TravelPlanner, a challenging testbed requiring tool orchestration to satisfy multifaceted constraints. We decompose the agentic RL design space along 5 axes: reward shaping, model scaling, data composition, algorithm selection, and environmental stability. Our controlled experiments yield 7 key takeaways, e.g., (1) reward and algorithm choices are scale-dependent as smaller models benefit from staged rewards and enhanced exploration, whereas larger models converge efficiently with simpler dense rewards, (2) ~ 1K training samples with a balanced difficulty mixture mark a sweet spot for both in-domain and out-of-domain performance, and (3) environmental stability is critical to prevent policy degradation. Based on our distilled recipe, our RL-trained models achieve state-of-the-art performance on TravelPlanner, significantly outperforming leading LLMs.
中文摘要 强化学习（RL）对于将大型语言模型（LLMs）发展为能够实现长期规划的自主智能体至关重要，但在复杂、多回合环境中实现强化学习的实际方法仍然难以实现。本文提出了一项系统性实证研究，使用TravelPlanner，这是一个需要工具编排以满足多方面约束的挑战性测试平台。我们沿五个轴分解了能动强化学习设计空间：奖励塑造、模型缩放、数据组合、算法选择和环境稳定性。我们的对控实验得出7个关键结论，例如：（1）奖励和算法选择依赖尺度，小模型受益于分阶段奖励和增强探索，而大型模型则能高效收敛，且难度均衡，是域内外性能的最佳平衡点;（3）环境稳定性对防止策略退化至关重要。基于我们精炼的配方，我们的强化学习训练模型在TravelPlanner上实现了最先进的性能，远远优于主流大型语言模型。

TREX: Trajectory Explanations for Multi-Objective Reinforcement Learning

TREX：多目标强化学习的轨迹解释

Authors: Dilina Rajapakse, Juan C. Rosero, Ivana Dusparic
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.21988
Pdf link: https://arxiv.org/pdf/2603.21988
Abstract Reinforcement Learning (RL) has demonstrated its ability to solve complex decision-making problems in a variety of domains, by optimizing reward signals obtained through interaction with an environment. However, many real-world scenarios involve multiple, potentially conflicting objectives that cannot be easily represented by a single scalar reward. Multi-Objective Reinforcement Learning (MORL) addresses this limitation by enabling agents to optimize several objectives simultaneously, explicitly reasoning about trade-offs between them. However, the ``black box" nature of the RL models makes the decision process behind chosen objective trade-offs unclear. Current Explainable Reinforcement Learning (XRL) methods are typically designed for single scalar rewards and do not account for explanations with respect to distinct objectives or user preferences. To address this gap, in this paper we propose TREX, a Trajectory based Explainability framework to explain Multi-objective Reinforcement Learning policies, based on trajectory attribution. TREX generates trajectories directly from the learned expert policy, across different user preferences and clusters them into semantically meaningful temporal segments. We quantify the influence of these behavioural segments on the Pareto trade-off by training complementary policies that exclude specific clusters, measuring the resulting relative deviation on the observed rewards and actions compared to the original expert policy. Experiments on multi-objective MuJoCo environments - HalfCheetah, Ant and Swimmer, demonstrate the framework's ability to isolate and quantify the specific behavioural patterns.
中文摘要 强化学习（RL）已证明其通过优化通过与环境交互获得的奖励信号，解决多个领域的复杂决策问题的能力。然而，许多现实场景涉及多个可能相互冲突的目标，这些目标无法用单一标量奖励轻易表示。多目标强化学习（MORL）通过使智能体能够同时优化多个目标，明确推理它们之间的权衡，从而解决了这一局限性。然而，强化学习模型的“黑匣子”特性使得选择客观权衡背后的决策过程变得不清晰。当前的可解释强化学习（XRL）方法通常为单一标量奖励设计，未考虑针对特定目标或用户偏好的解释。为弥补这一空白，本文提出了TREX，这是一种基于轨迹的可解释性框架，用于基于轨迹归因来解释多目标强化学习策略。TREX 直接从学到的专家策略生成轨迹，跨越不同用户偏好，并将其聚类为语义有意义的时间段。我们通过训练排除特定群体的互补政策，量化这些行为片段对帕累托权衡的影响，并测量与原始专家策略相比观察到的奖励和行为的相对偏差。多目标MuJoCo环境——HalfCheetah、Ant和Swimmer——的实验展示了该框架分离和量化特定行为模式的能力。

MEVIUS2: Practical Open-Source Quadruped Robot with Sheet Metal Welding and Multimodal Perception

MEVIUS2：具备钣金焊接和多模态感知的实用开源四足机器人

Authors: Kento Kawaharazuka, Keita Yoneda, Shintaro Inoue, Temma Suzuki, Jun Oda, Kei Okada
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.22031
Pdf link: https://arxiv.org/pdf/2603.22031
Abstract Various quadruped robots have been developed to date, and thanks to reinforcement learning, they are now capable of traversing diverse types of rough terrain. In parallel, there is a growing trend of releasing these robot designs as open-source, enabling researchers to freely build and modify robots themselves. However, most existing open-source quadruped robots have been designed with 3D printing in mind, resulting in structurally fragile systems that do not scale well in size, leading to the construction of relatively small robots. Although a few open-source quadruped robots constructed with metal components exist, they still tend to be small in size and lack multimodal sensors for perception, making them less practical. In this study, we developed MEVIUS2, an open-source quadruped robot with a size comparable to Boston Dynamics' Spot, whose structural components can all be ordered through e-commerce services. By leveraging sheet metal welding and metal machining, we achieved a large, highly durable body structure while reducing the number of individual parts. Furthermore, by integrating sensors such as LiDARs and a high dynamic range camera, the robot is capable of detailed perception of its surroundings, making it more practical than previous open-source quadruped robots. We experimentally validated that MEVIUS2 can traverse various types of rough terrain and demonstrated its environmental perception capabilities. All hardware, software, and training environments can be obtained from Supplementary Materials or this https URL.
中文摘要 迄今为止，已经开发出多种四足机器人，得益于强化学习，它们现在能够穿越各种崎岖地形。与此同时，越来越多的趋势是将这些机器人设计开源发布，使研究人员能够自由地自行构建和修改机器人。然而，大多数现有的开源四足机器人设计时都考虑了3D打印，导致结构脆弱且不适合扩展，导致建造相对较小的机器人。虽然存在一些开源的金属部件制造的四足机器人，但它们仍然体积较小，缺乏多模态感知传感器，因此实用性较差。在这项研究中，我们开发了MEVIUS2，一款开源的四足机器人，其尺寸可与波士顿动力的Spot相当，其结构部件均可通过电子商务服务订购。通过利用金属板焊接和金属加工技术，我们实现了大型且高度耐用的车体结构，同时减少了单个零件的数量。此外，通过集成激光雷达（LiDAR）和高动态范围摄像头等传感器，该机器人能够对周围环境进行细致感知，使其比以往的开源四足机器人更为实用。我们通过实验验证了MEVIUS2能够穿越各种崎岖地形，并展示了其环境感知能力。所有硬件、软件和培训环境均可从补充材料或此 https URL 获取。

A Context Engineering Framework for Improving Enterprise AI Agents based on Digital-Twin MDP

基于数字孪生MDP改进企业AI代理的上下文工程框架

Authors: Xi Yang, Aurelie Lozano, Naoki Abe, Bhavya, Saurabh Jha, Noah Zheutlin, Rohan R. Arora, Yu Deng, Daby M. Sow
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.22083
Pdf link: https://arxiv.org/pdf/2603.22083
Abstract Despite rapid progress in AI agents for enterprise automation and decision-making, their real-world deployment and further performance gains remain constrained by limited data quality and quantity, complex real-world reasoning demands, difficulties with self-play, and the lack of reliable feedback signals. To address these challenges, we propose a lightweight, model-agnostic framework for improving LLM-based enterprise agents via offline reinforcement learning (RL). The proposed Context Engineering via DT-MDP (DT-MDP-CE) framework comprises three key components: (1) A Digital-Twin Markov Decision Process (DT-MDP), which abstracts the agent's reasoning behavior as a finite MDP; (2) A robust contrastive inverse RL, which, armed with the DT-MDP, to efficiently estimate a well-founded reward function and induces policies from mixed-quality offline trajectories; and (3) RL-guided context engineering, which uses the policy obtained from the integrated process of (1) and (2), to improve the agent's decision-making behavior. As a case study, we apply the framework to a representative task in the enterprise-oriented domain of IT automation. Extensive experimental results demonstrate consistent and significant improvements over baseline agents across a wide range of evaluation settings, suggesting that the framework can generalize to other agents sharing similar characteristics in enterprise environments.
中文摘要 尽管AI代理在企业自动化和决策领域取得了快速进展，但其实际部署和进一步性能提升仍受限于有限的数据质量和数量、复杂的现实推理需求、自我游戏困难以及缺乏可靠反馈信号。为应对这些挑战，我们提出了一个轻量级、模型无关的框架，通过离线强化学习（RL）改进基于LLM的企业代理。提出的基于DT-MDP（DT-MDP-CE）框架的上下文工程包含三个关键组成部分：（1）数字孪生马尔可夫决策过程（DT-MDP），将智能体的推理行为抽象为有限的MDP;（2）一个稳健的对比逆RL，配备DT-MDP，能够高效估计良基奖励函数，并从混合质量的离线轨迹中诱导策略;以及（3）强化学习引导的上下文工程，利用从（1）和（2）的集成过程中获得的策略，来改善智能体的决策行为。作为案例研究，我们将该框架应用于企业导向的IT自动化领域的代表性任务。大量实验结果显示，在各种评估环境中，相较基线代理存在一致且显著的改进，表明该框架可以推广到企业环境中具有相似特征的其他代理。

On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation

关于 LLM 推理 RLVR 更新方向：识别与利用

Authors: Kexin Huang, Haoming Meng, Junkang Wu, Jinda Lu, Chiyu Ma, Ziqian Chen, Xue Wang, Bolin Ding, Jiancan Wu, Xiang Wang, Xiangnan He, Guoyin Wang, Jingren Zhou
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.22117
Pdf link: https://arxiv.org/pdf/2603.22117
Abstract Reinforcement learning with verifiable rewards (RLVR) has substantially improved the reasoning capabilities of large language models. While existing analyses identify that RLVR-induced changes are sparse, they primarily focus on the \textbf{magnitude} of these updates, largely overlooking their \textbf{direction}. In this work, we argue that the direction of updates is a more critical lens for understanding RLVR's effects, which can be captured by the signed, token-level log probability difference $\Delta\log p$ between the base and final RLVR models. Through statistical analysis and token-replacement interventions, we demonstrate that $\Delta\log p$ more effectively identifies sparse, yet reasoning-critical updates than magnitude-based metrics (\eg divergence or entropy). Building on this insight, we propose two practical applications: (1) a \textit{test-time extrapolation} method that amplifies the policy along the learned $\Delta\log p$ direction to improve reasoning accuracy without further training; (2) a \textit{training-time reweighting} method that focuses learning on low-probability (corresponding to higher $\Delta\log p$) tokens, which improves reasoning performance across models and benchmarks. Our work establishes the direction of change as a key principle for analyzing and improving RLVR.
中文摘要 带有可验证奖励的强化学习（RLVR）显著提升了大型语言模型的推理能力。虽然现有分析指出RLVR引起的变化较为稀疏，但它们主要关注这些更新的\textbf{大小}，而大多忽视了它们的\textbf{方向}。在本研究中，我们认为更新方向是理解RLVR效应的一个更关键视角，这可以通过基础模型与最终RLVR模型之间带符号的标记级对数概率差$\Delta\log p$来捕捉。通过统计分析和代币替换干预，我们证明$\Delta\log p$比基于大小的指标（如发散度或熵）更有效地识别稀疏但关键的推理更新。基于这一见解，我们提出了两个实际应用：（1）一种\textit{测试时间外推}方法，沿学到的$\Delta\log p$方向放大策略，提高推理准确性而无需进一步训练;（2）一种\textit{训练时间重权}方法，专注于低概率（对应较高的$\Delta\log p$）令牌学习，从而提升跨模型和基准的推理性能。我们的工作确立了变革方向作为分析和改进RLVR的关键原则。

Closed-Loop Verbal Reinforcement Learning for Task-Level Robotic Planning

任务级机器人规划的闭环口语强化学习

Authors: Dmitrii Plotnikov, Iaroslav Kolomiets, Dmitrii Maliukov, Dmitrij Kosenkov, Daniia Zinniatullina, Artem Trandofilov, Georgii Gazaryan, Kirill Bogatikov, Timofei Kozlov, Igor Duchinskii, Mikhail Konenkov, Miguel Altamirano Cabrera, Dzmitry Tsetserukou
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.22169
Pdf link: https://arxiv.org/pdf/2603.22169
Abstract We propose a new Verbal Reinforcement Learning (VRL) framework for interpretable task-level planning in mobile robotic systems operating under execution uncertainty. The framework follows a closed-loop architecture that enables iterative policy improvement through interaction with the physical environment. In our framework, executable Behavior Trees are repeatedly refined by a Large Language Model actor using structured natural-language feedback produced by a Vision-Language Model critic that observes the physical robot and execution traces. Unlike conventional reinforcement learning, policy updates in VRL occur directly at the symbolic planning level, without gradient-based optimization. This enables transparent reasoning, explicit causal feedback, and human-interpretable policy evolution. We validate the proposed framework on a real mobile robot performing a multi-stage manipulation and navigation task under execution uncertainty. Experimental results show that the framework supports explainable policy improvements, closed-loop adaptation to execution failures, and reliable deployment on physical robotic systems.
中文摘要 我们提出了一种新的口头强化学习（VRL）框架，用于在执行不确定性下运行的移动机器人系统中可解释的任务级规划。该框架采用闭环架构，通过与物理环境的交互实现迭代政策改进。在我们的框架中，可执行行为树由大型语言模型演员反复优化，利用视觉语言模型批评者产生的结构化自然语言反馈，观察物理机器人和执行痕迹。与传统强化学习不同，VRL中的策略更新直接发生在符号规划层面，没有基于梯度的优化。这使得透明的推理、明确的因果反馈和人类可理解的政策演变成为可能。我们在执行不确定性条件下，实际执行多阶段操作和导航任务的移动机器人验证了该框架。实验结果显示，该框架支持可解释的策略改进、对执行失败的闭环适配，以及在物理机器人系统上的可靠部署。

Cross-Modal Reinforcement Learning for Navigation with Degraded Depth Measurements

基于降级深度测量的跨模态强化学习导航

Authors: Omkar Sawant, Luca Zanatta, Grzegorz Malczyk, Kostas Alexis
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.22182
Pdf link: https://arxiv.org/pdf/2603.22182
Abstract This paper presents a cross-modal learning framework that exploits complementary information from depth and grayscale images for robust navigation. We introduce a Cross-Modal Wasserstein Autoencoder that learns shared latent representations by enforcing cross-modal consistency, enabling the system to infer depth-relevant features from grayscale observations when depth measurements are corrupted. The learned representations are integrated with a Reinforcement Learning-based policy for collision-free navigation in unstructured environments when depth sensors experience degradation due to adverse conditions such as poor lighting or reflective surfaces. Simulation and real-world experiments demonstrate that our approach maintains robust performance under significant depth degradation and successfully transfers to real environments.
中文摘要 本文提出了一个跨模态学习框架，利用深度和灰度图像的互补信息实现稳健导航。我们引入了一种跨模态Wasserstein自动编码器，通过强制执行跨模态一致性来学习共享的潜在表示，使系统能够在深度测量数据被破坏时，从灰度观测中推断出与深度相关的特征。这些学习到的表征与基于强化学习的策略相结合，在深度传感器因光照不足或反光面等不利条件而退化时，实现无碰撞导航。仿真和实际实验表明，我们的方法在显著深度降解下仍能保持稳健性能，并成功迁移到真实环境中。

Seeing is Improving: Visual Feedback for Iterative Text Layout Refinement

看到就是进步：视觉反馈用于迭代文本布局优化

Authors: Junrong Guo, Shancheng Fang, Yadong Qu, Hongtao Xie
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.22187
Pdf link: https://arxiv.org/pdf/2603.22187
Abstract Recent advances in Multimodal Large Language Models (MLLMs) have enabled automated generation of structured layouts from natural language descriptions. Existing methods typically follow a code-only paradigm that generates code to represent layouts, which are then rendered by graphic engines to produce final images. However, they are blind to the rendered visual outcome, making it difficult to guarantee readability and aesthetics. In this paper, we identify visual feedback as a critical factor in layout generation and propose Visual Feedback Layout Model (VFLM), a self-improving framework that leverages visual feedback iterative refinement. VFLM is capable of performing adaptive reflective generation, which leverages visual information to reflect on previous issues and iteratively generates outputs until satisfactory quality is achieved. It is achieved through reinforcement learning with a visually grounded reward model that incorporates OCR accuracy. By rewarding only the final generated outcome, we can effectively stimulate the model's iterative and reflective generative capabilities. Experiments across multiple benchmarks show that VFLM consistently outperforms advanced MLLMs, existing layout models, and code-only baselines, establishing visual feedback as critical for design-oriented MLLMs. Our code and data are available at this https URL.
中文摘要 多模态大型语言模型（MLLM）的最新进展使自然语言描述能够自动生成结构化布局。现有方法通常遵循纯代码范式，生成代码表示布局，然后由图形引擎渲染以生成最终图像。然而，他们对渲染后的视觉效果视而不见，这使得难以保证可读性和美观性。本文指出视觉反馈是布局生成的关键因素，并提出了视觉反馈布局模型（VFLM），这是一个利用视觉反馈迭代细化的自我改进框架。VFLM能够执行自适应反射生成，利用视觉信息反思以往的问题，并迭代生成输出，直到达到满意的质量。它通过强化学习实现，采用视觉化基础的奖励模型，并结合OCR的准确性。通过仅奖励最终生成的结果，我们可以有效激发模型的迭代和反思生成能力。多个基准测试的实验表明，VFLM始终优于先进的MLLM、现有布局模型和纯代码基线，确立了视觉反馈对设计导向MLM至关重要。我们的代码和数据可在此 https URL 访问。

Make Tracking Easy: Neural Motion Retargeting for Humanoid Whole-body Control

让追踪变得简单：神经运动重定向用于人形全身控制

Authors: Qingrui Zhao, Kaiyue Yang, Xiyu Wang, Shiqi Zhao, Yi Lu, Xinfang Zhang, Wei Yin, Qiu Shen, Xiao-Xiao Long, Xun Cao
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.22201
Pdf link: https://arxiv.org/pdf/2603.22201
Abstract Humanoid robots require diverse motor skills to integrate into complex environments, but bridging the kinematic and dynamic embodiment gap from human data remains a major bottleneck. We demonstrate through Hessian analysis that traditional optimization-based retargeting is inherently non-convex and prone to local optima, leading to physical artifacts like joint jumps and self-penetration. To address this, we reformulate the targeting problem as learning data distribution rather than optimizing optimal solutions, where we propose NMR, a Neural Motion Retargeting framework that transforms static geometric mapping into a dynamics-aware learned process. We first propose Clustered-Expert Physics Refinement (CEPR), a hierarchical data pipeline that leverages VAE-based motion clustering to group heterogeneous movements into latent motifs. This strategy significantly reduces the computational overhead of massively parallel reinforcement learning experts, which project and repair noisy human demonstrations onto the robot's feasible motion manifold. The resulting high-fidelity data supervises a non-autoregressive CNN-Transformer architecture that reasons over global temporal context to suppress reconstruction noise and bypass geometric traps. Experiments on the Unitree G1 humanoid across diverse dynamic tasks (e.g., martial arts, dancing) show that NMR eliminates joint jumps and significantly reduces self-collisions compared to state-of-the-art baselines. Furthermore, NMR-generated references accelerate the convergence of downstream whole-body control policies, establishing a scalable path for bridging the human-robot embodiment gap.
中文摘要 类人机器人需要多样化的运动技能才能融入复杂环境，但弥合人体运动学和动态身体化差距仍是主要瓶颈。通过黑森分析，我们证明了传统的基于优化的重定向本质上是非凸的，容易出现局部最优，导致关节跳跃和自我穿透等物理伪影。为此，我们将目标定位问题重新表述为学习数据分布，而非优化最优解，提出了NMR，一种神经运动重定向框架，将静态几何映射转化为动态感知的学习过程。我们首先提出了集群-专家物理细化（CEPR），这是一种基于VAE的运动聚类将异质运动归类为潜在基序的分层数据流水线。这一策略显著降低了大规模并行强化学习专家的计算开销，他们将嘈杂的人工演示投射并修复到机器人可行运动的流形上。由此产生的高保真度数据监督着一个非自回归的CNN-Transformer架构，能够在全局时间上下文上推理，抑制重建噪声并绕过几何陷阱。在Unitree G1类人生物上进行的多种动态任务（如武术、舞蹈）实验显示，NMR消除了关节跳跃，并显著减少了自碰撞，相较于最先进的基线。此外，NMR生成的参考资料加速了下游全身控制策略的趋同，建立了弥合人机身体差距的可扩展路径。

SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation

空间奖励：文本到图像生成中可验证的空间奖励建模，实现细粒度空间一致性

Authors: Sashuai Zhou, Qiang Zhou, Junpeng Ma, Yue Cao, Ruofan Hu, Ziang Zhang, Xiaoda Yang, Zhibin Wang, Jun Song, Cheng Yu, Bo Zheng, Zhou Zhao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.22228
Pdf link: https://arxiv.org/pdf/2603.22228
Abstract Recent advances in text-to-image (T2I) generation via reinforcement learning (RL) have benefited from reward models that assess semantic alignment and visual quality. However, most existing reward models pay limited attention to fine-grained spatial relationships, often producing images that appear plausible overall yet contain inaccuracies in object positioning. In this work, we present \textbf{SpatialReward}, a verifiable reward model explicitly designed to evaluate spatial layouts in generated images. SpatialReward adopts a multi-stage pipeline: a \emph{Prompt Decomposer} extracts entities, attributes, and spatial metadata from free-form prompts; expert detectors provide accurate visual grounding of object positions and attributes; and a vision-language model applies chain-of-thought reasoning over grounded observations to assess complex spatial relations that are challenging for rule-based methods. To more comprehensively evaluate spatial relationships in generated images, we introduce \textbf{SpatRelBench}, a benchmark covering object attributes, orientation, inter-object relations, and rendered text placement. Experiments on Stable Diffusion and FLUX show that incorporating SpatialReward into RL training consistently improves spatial consistency and overall generation quality, with results aligned more closely to human judgments. These findings indicate that verifiable reward models hold considerable potential for enabling more accurate and controllable optimization in text-to-image generation models.
中文摘要 通过强化学习（RL）生成文本到图像（T2I）的最新进展，受益于评估语义对齐和视觉质量的奖励模型。然而，大多数现有的奖励模型对细粒度空间关系关注有限，常常产生整体看似可信但存在物体定位不准确的图像。在本研究中，我们提出了 \textbf{SpatialReward}，这是一个可验证的奖励模型，专门设计用于评估生成图像中的空间布局。SpatialReward采用多阶段流水线：\emph{Prompt Decomposer}从自由形式提示中提取实体、属性和空间元数据;专家探测器提供物体位置和属性的准确视觉基础;视觉语言模型则在基于基础的观察基础上应用思维链推理，以评估规则化方法难以实现的复杂空间关系。为了更全面地评估生成图像中的空间关系，我们引入了 \textbf{SpatRelBench}，这是一个涵盖对象属性、方向、对象间关系和渲染文本位置的基准测试。稳定扩散和FLUX的实验表明，将空间奖励纳入强化学习训练，能够持续提升空间一致性和整体生成质量，结果更贴近人类判断。这些发现表明，可验证的奖励模型在实现文本到图像生成模型中更精准和可控的优化方面具有巨大潜力。

DexDrummer: In-Hand, Contact-Rich, and Long-Horizon Dexterous Robot Drumming

DexDrummer：手持、接触丰富且远程的灵巧机器人鼓点

Authors: Hung-Chieh Fang, Amber Xie, Jennifer Grannen, Kenneth Llontop, Dorsa Sadigh
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.22263
Pdf link: https://arxiv.org/pdf/2603.22263
Abstract Performing in-hand, contact-rich, and long-horizon dexterous manipulation remains an unsolved challenge in robotics. Prior hand dexterity works have considered each of these three challenges in isolation, yet do not combine these skills into a single, complex task. To further test the capabilities of dexterity, we propose drumming as a testbed for dexterous manipulation. Drumming naturally integrates all three challenges: it involves in-hand control for stabilizing and adjusting the drumstick with the fingers, contact-rich interaction through repeated striking of the drum surface, and long-horizon coordination when switching between drums and sustaining rhythmic play. We present DexDrummer, a hierarchical object-centric bimanual drumming policy trained in simulation with sim-to-real transfer. The framework reduces the exploration difficulty of pure reinforcement learning by combining trajectory planning with residual RL corrections for fast transitions between drums. A dexterous manipulation policy handles contact-rich dynamics, guided by rewards that explicitly model both finger-stick and stick-drum interactions. In simulation, we show our policy can play two styles of music: multi-drum, bimanual songs and challenging, technical exercises that require increased dexterity. Across simulated bimanual tasks, our dexterous, reactive policy outperforms a fixed grasp policy by 1.87x across easy songs and 1.22x across hard songs F1 scores. In real-world tasks, we show song performance across a multi-drum setup. DexDrummer is able to play our training song and its extended version with an F1 score of 1.0.
中文摘要 在机器人领域，进行手握、接触丰富且长视距的灵巧操作仍是一个未解决的挑战。以往的手部灵巧练习都是单独考虑这三种挑战，但并未将这些技能合并为一个复杂的任务。为了进一步测试灵巧度的能力，我们建议以打鼓作为灵巧操作的试验平台。打鼓自然融合了这三大挑战：它涉及用手指控制鼓棒以稳定和调整鼓棒，通过反复敲击鼓面实现丰富的接触互动，以及在切换鼓点和维持节奏演奏时的长距离协调。我们介绍DexDrummer，一种层级的以对象为中心的双手鼓点策略，通过模拟到实物传输训练。该框架通过结合轨迹规划与剩余的强化学习修正，降低了纯强化学习的探索难度，以实现鼓之间的快速过渡。灵巧的操作策略处理了接触丰富的动态，由明确模拟指尖和鼓棒互动的奖励指导。在模拟中，我们展示了我们的策略可以演奏两种音乐风格：多鼓、双手的歌曲，以及需要更高灵巧度的具有挑战性、技术性练习。在模拟双手任务中，我们灵巧且反应的策略在简单歌曲中比固定抓取策略高出1.87倍，在困难歌曲F1得分中是1.22倍。在实际任务中，我们展示多鼓组中的歌曲表现。DexDrummer能够演奏我们的训练曲及其扩展版，F1分数为1.0。

TiCo: Time-Controllable Training for Spoken Dialogue Models

TiCo：语音对话模型的时间可控训练

Authors: Kai-Wei Chang, Wei-Chih Chen, En-Pei Hu, Hung-yi Lee, James Glass
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2603.22267
Pdf link: https://arxiv.org/pdf/2603.22267
Abstract We propose TiCo, a simple post-training method for enabling spoken dialogue models (SDMs) to follow time-constrained instructions and generate responses with controllable duration. This capability is valuable for real-world spoken language systems such as voice assistants and interactive agents, where controlling response duration can improve interaction quality. However, despite their strong ability to generate natural spoken responses, existing models lack time awareness and struggle to follow duration-related instructions (e.g., "Please generate a response lasting about 15 seconds"). Through an empirical evaluation of both open-source and commercial SDMs, we show that they frequently fail to satisfy such time-control requirements. TiCo addresses this limitation by enabling models to estimate elapsed speaking time during generation through Spoken Time Markers (STM) (e.g., <10.6 seconds>). These markers help the model maintain awareness of time and adjust the remaining content to meet the target duration. TiCo is simple and efficient: it requires only a small amount of data and no additional question-answer pairs, relying instead on self-generation and reinforcement learning. Experimental results show that TiCo significantly improves adherence to duration constraints while preserving response quality.
中文摘要 我们提出了TiCo，这是一种简单的训练后方法，使口头对话模型（SDM）能够遵循时间限制的指令并生成可控时长的响应。这一能力对于现实世界的语音系统如语音助手和交互代理非常有价值，因为控制响应时长可以提升交互质量。然而，尽管现有模型在生成自然口语回应方面能力强，但缺乏时间感知能力，难以遵循与持续时间相关的指令（例如，“请生成一个持续约15秒的响应”）。通过对开源和商业SDM的实证评估，我们发现它们经常无法满足这些时间控制要求。TiCo通过使模型能够通过语音时间标记（STM）（例如<10.6秒>）估算生成过程中的经过发言时间来解决这一限制。这些标记帮助模型保持时间感知，并调整剩余内容以达到目标时长。TiCo 简单高效：只需少量数据，无需额外的问答对，而是依靠自我生成和强化学习。实验结果表明，TiCo显著提高了对持续时间约束的依从性，同时保持了响应质量。

Decoupling Exploration and Policy Optimization: Uncertainty Guided Tree Search for Hard Exploration

解耦探索与策略优化：不确定性引导树搜索以进行硬探索

Authors: Zakaria Mhammedi, James Cohan
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.22273
Pdf link: https://arxiv.org/pdf/2603.22273
Abstract The process of discovery requires active exploration -- the act of collecting new and informative data. However, efficient autonomous exploration remains a major unsolved problem. The dominant paradigm addresses this challenge by using Reinforcement Learning (RL) to train agents with intrinsic motivation, maximizing a composite objective of extrinsic and intrinsic rewards. We suggest that this approach incurs unnecessary overhead: while policy optimization is necessary for precise task execution, employing such machinery solely to expand state coverage may be inefficient. In this paper, we propose a new paradigm that explicitly separates exploration from exploitation and bypasses RL during the exploration phase. Our method uses a tree-search strategy inspired by the Go-With-The-Winner algorithm, paired with a measure of epistemic uncertainty to systematically drive exploration. By removing the overhead of policy optimization, our approach explores an order of magnitude more efficiently than standard intrinsic motivation baselines on hard Atari benchmarks. Further, we demonstrate that the discovered trajectories can be distilled into deployable policies using existing supervised backward learning algorithms, achieving state-of-the-art scores by a wide margin on Montezuma's Revenge, Pitfall!, and Venture without relying on domain-specific knowledge. Finally, we demonstrate the generality of our framework in high-dimensional continuous action spaces by solving the MuJoCo Adroit dexterous manipulation and AntMaze tasks in a sparse-reward setting, directly from image observations and without expert demonstrations or offline datasets. To the best of our knowledge, this has not been achieved before.
中文摘要 发现过程需要积极探索——即收集新的、有信息量的数据。然而，高效的自主探索仍是一个重大未解决的问题。主流范式通过使用强化学习（RL）来训练具有内在动机的代理，最大化外在与内在奖励的复合目标。我们认为这种方法会产生不必要的开销：虽然政策优化对于精确执行任务是必要的，但仅仅用此类机制扩展州覆盖可能效率低下。本文提出了一种新范式，明确区分勘探与开发，并在勘探阶段绕过强化学习。我们的方法采用了受“与赢家同行”算法启发的树搜索策略，结合认知不确定性的测量，系统地推动探索。通过消除策略优化的开销，我们的方法比硬性Atari基准测试上的标准内在动机基线高效了一个数量级。此外，我们证明了发现的轨迹可以利用现有的监督式后退学习算法提炼成可部署的策略，在Montezuma的Revenge、Pitfall！和Venture项目上以大幅优势获得最先进的分数，而无需依赖领域特定知识。最后，我们通过在稀疏奖励环境下直接通过图像观察、无专家演示或离线数据集，解决MuJoCo的敏捷灵巧操作和AntMaze任务，展示了我们框架在高维连续动作空间中的通用性。据我们所知，这以前从未有过实现。

Keyword: diffusion policy

Dreaming the Unseen: World Model-regularized Diffusion Policy for Out-of-Distribution Robustness

梦见未见：世界模型规范化扩散政策，实现非分发鲁棒性

Authors: Ziou Hu, Xiangtong Yao, Yuan Meng, Zhenshan Bing, Alois Knoll
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.21017
Pdf link: https://arxiv.org/pdf/2603.21017
Abstract Diffusion policies excel at visuomotor control but often fail catastrophically under severe out-of-distribution (OOD) disturbances, such as unexpected object displacements or visual corruptions. To address this vulnerability, we introduce the Dream Diffusion Policy (DDP), a framework that deeply integrates a diffusion world model into the policy's training objective via a shared 3D visual encoder. This co-optimization endows the policy with robust state-prediction capabilities. When encountering sudden OOD anomalies during inference, DDP detects the real-imagination discrepancy and actively abandons the corrupted visual stream. Instead, it relies on its internal "imagination" (autoregressively forecasted latent dynamics) to safely bypass the disruption, generating imagined trajectories before smoothly realigning with physical reality. Extensive evaluations demonstrate DDP's exceptional resilience. Notably, DDP achieves a 73.8% OOD success rate on MetaWorld (vs. 23.9% without predictive imagination) and an 83.3% success rate under severe real-world spatial shifts (vs. 3.3% without predictive imagination). Furthermore, as a stress test, DDP maintains a 76.7% real-world success rate even when relying entirely on open-loop imagination post-initialization.
中文摘要 扩散政策在视觉运动控制方面表现优异，但在严重的分布外（OOD）干扰（如意外物体位移或视觉损坏）下常常灾难性失效。为解决这一漏洞，我们引入了梦境扩散策略（Dream Diffusion Policy，DDP），这是一个通过共享3D视觉编码器深度整合扩散世界模型到策略训练目标中的框架。这种协同优化赋予了策略强大的状态预测能力。当推理过程中遇到突发的OOD异常时，DDP会检测到真实想象的差异，并主动放弃被破坏的视觉流。相反，它依靠自身的“想象力”（自我回归预测的潜在动态）安全绕过干扰，生成想象轨迹，然后平滑地重新对齐物理现实。广泛的评估显示了DDP卓越的韧性。值得注意的是，DDP在MetaWorld上的离场成功率为73.8%（无预测想象力时为23.9%），在严重现实空间变化下成功率为83.3%（无预测想象时为3.3%）。此外，作为压力测试，DDP即使在初始化后完全依赖开环想象力时，现实世界成功率仍为76.7%。