Arxiv Papers of Today

生成时间: 2026-03-04 16:42:50 (UTC+8); Arxiv 发布时间: 2026-03-04 20:00 EST (2026-03-05 09:00 UTC+8)

今天共有 44 篇相关文章

Keyword: reinforcement learning

ATPO: Adaptive Tree Policy Optimization for Multi-Turn Medical Dialogue

ATPO：多回合医疗对话的自适应树政策优化

Authors: Ruike Cao, Shaojie Bai, Fugen Yao, Liang Dong, Jian Xu, Li Xiao
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.02216
Pdf link: https://arxiv.org/pdf/2603.02216
Abstract Effective information seeking in multi-turn medical dialogues is critical for accurate diagnosis, especially when dealing with incomplete information. Aligning Large Language Models (LLMs) for these interactive scenarios is challenging due to the uncertainty inherent in user-agent interactions, which we formulate as a Hierarchical Markov Decision Process (H-MDP). While conventional Reinforcement Learning (RL) methods like Group Relative Policy Optimization (GRPO) struggle with long-horizon credit assignment and Proximal Policy Optimization (PPO) suffers from unstable value estimation in this context, we propose a novel uncertainty-aware Adaptive Tree Policy Optimization (ATPO) algorithm. Our method adaptively allocates the rollout budget to states with high uncertainty, quantified by a composite metric of Bellman error and action-value variance. This strategy enables more accurate value estimation, while fostering more efficient and diverse exploration. To mitigate the high computational cost of tree-based RL, we introduce two key optimizations: an uncertainty-guided pruning mechanism to minimize the number of rollouts, and an asynchronous search architecture that leverages KV cache reuse to maximize inference throughput. Extensive experiments on three public medical dialogue benchmarks demonstrate that our algorithm significantly outperforms several strong baselines, culminating in Qwen3-8B model surpassing the much larger GPT-4o ($+0.92\%$ accuracy).
中文摘要 在多回合医学对话中，有效的信息寻求对于准确诊断至关重要，尤其是在处理信息不完整时。由于用户-代理交互固有的不确定性，将大型语言模型（LLMs）与这些交互场景对齐具有挑战性，我们将其定义为层级马尔可夫决策过程（H-MDP）。虽然传统的强化学习（RL）方法如群体相对策略优化（GRPO）在长期视野的信用分配上存在困难，而近端策略优化（PPO）在此背景下存在不稳定的价值估计问题，我们提出了一种新的不确定性感知自适应树策略优化（ATPO）算法。我们的方法将推广预算自适应地分配给不确定性高的州，这些州通过贝尔曼误差和行动价值方差的综合指标量化。这一策略能够实现更准确的价值估算，同时促进更高效、更多样化的勘探。为了降低基于树的强化学习高计算成本，我们引入了两个关键优化：一种不确定性引导的剪枝机制以最小化部署次数;以及一种利用KV缓存重用以最大化推理吞吐量的异步搜索架构。在三个公共医疗对话基准测试上的广泛实验表明，我们的算法远远优于多个强有力基线，最终Qwen3-8B模型超越了更强大的GPT-4o（准确率$+0.92\%$）。

When Scaling Fails: Mitigating Audio Perception Decay of LALMs via Multi-Step Perception-Aware Reasoning

当缩放失败时：通过多步感知感知推理缓解LALM的音频感知衰减

Authors: Ruixiang Mao, Xiangnan Ma, Dan Chen, Ziming Zhu, Yuan Ge, Aokai Hao, Haishu Zhao, Yifu Huo, Qing Yang, Kaiyan Chang, Xiaoqian Liu, Chenglong Wang, Qiaozhi He, Tong Xiao, Jingbo Zhu
Subjects: Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2603.02266
Pdf link: https://arxiv.org/pdf/2603.02266
Abstract Test-Time Scaling has shown notable efficacy in addressing complex problems through scaling inference compute. However, within Large Audio-Language Models (LALMs), an unintuitive phenomenon exists: post-training models for structured reasoning trajectories results in marginal or even negative gains compared to post-training for direct answering. To investigate it, we introduce CAFE, an evaluation framework designed to precisely quantify audio reasoning errors. Evaluation results reveal LALMs struggle with perception during reasoning and encounter a critical bottleneck: reasoning performance suffers from audio perception decay as reasoning length extends. To address it, we propose MPAR$^2$, a paradigm that encourages dynamic perceptual reasoning and decomposes complex questions into perception-rich sub-problems. Leveraging reinforcement learning, MPAR$^2$ improves perception performance on CAFE from 31.74% to 63.51% and effectively mitigates perception decay, concurrently enhancing reasoning capabilities to achieve a significant 74.59% accuracy on the MMAU benchmark. Further analysis demonstrates that MPAR$^2$ reinforces LALMs to attend to audio input and dynamically adapts reasoning budget to match task complexity.
中文摘要 测试时间缩放在通过缩放推理计算解决复杂问题方面表现出显著效果。然而，在大型音频语言模型（LALMs）中存在一种不直观的现象：结构化推理轨迹的训练后模型相比直接回答的训练后获得的收益有限甚至负面。为此，我们引入了CAFE评估框架，旨在精确量化听觉推理错误。评估结果显示，LALMs在推理过程中感知能力有困难，并面临一个关键瓶颈：推理表现会随着推理长度的延长而衰减。为此，我们提出了MPAR$^2$范式，鼓励动态感知推理，并将复杂问题分解为富含感知的子问题。借助强化学习，MPAR$^2$将CAFE的感知表现从31.74%提升至63.51%，有效缓解感知衰减，同时提升推理能力，使MMAU基准测试中准确率达到显著的74.59%。进一步分析表明，MPAR$^2$ 强化 LALM 关注音频输入，并动态调整推理预算以匹配任务复杂度。

COOL-MC: Verifying and Explaining RL Policies for Platelet Inventory Management

COOL-MC：核实并解释血小板库存管理的强化学习政策

Authors: Dennis Gross
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.02396
Pdf link: https://arxiv.org/pdf/2603.02396
Abstract Platelets expire within five days. Blood banks face uncertain daily demand and must balance ordering decisions between costly wastage from overstocking and life-threatening shortages from understocking. Reinforcement learning (RL) can learn effective ordering policies for this Markov decision process (MDP), but the resulting neural policies remain black boxes, hindering trust and adoption in safety-critical domains. We apply COOL-MC, a tool that combines RL with probabilistic model checking and explainable RL, to verify and explain a trained policy for the MDP on platelet inventory management inspired by Haijema et al. By constructing a policy-induced discrete-time Markov chain (which includes only the reachable states under the trained policy to reduce memory usage), we verify PCTL properties and provide feature-level explanations. Results show that the trained policy achieves a 2.9% stockout probability and a 1.1% inventory-full (potential wastage) probability within a 200-step horizon, primarily attends to the age distribution of inventory rather than other features such as day of week or pending orders. Action reachability analysis reveals that the policy employs a diverse replenishment strategy, with most order quantities reached quickly, while several are never selected. Counterfactual analysis shows that replacing medium-large orders with smaller ones leaves both safety probabilities nearly unchanged, indicating that these orders are placed in well-buffered inventory states. This first formal verification and explanation of an RL platelet inventory management policy demonstrates COOL-MC's value for transparent, auditable decision-making in safety-critical healthcare supply chain domains.
中文摘要 血小板在五天内过期。血库面临不确定的日常需求，必须在库存过剩导致的高昂浪费和库存不足导致的生命威胁短缺之间做出平衡。强化学习（RL）可以学习该马尔可夫决策过程（MDP）的有效排序策略，但由此产生的神经策略仍然是黑箱，阻碍了安全关键领域的信任和采用。我们应用COOL-MC工具，该工具结合了强化学习与概率模型检查及可解释强化学习，以验证并解释受Haijema等人启发的MDP血小板库存管理训练政策。通过构建策略诱导的离散时间马尔可夫链（仅包含训练策略下可达状态以减少内存占用），我们验证了PCTL属性并提供了功能层面的解释。结果显示，训练有素的政策在200步内实现2.9%的缺货概率和1.1%的库存满货（潜在浪费）概率，主要关注库存的年龄分布，而非星期几或待处理订单等其他特征。行动可达性分析显示，该政策采用多样化的补货策略，大多数订单数量很快达成，而少数订单从未被选中。反事实分析显示，用较小订单替换中大型订单，安全概率几乎保持不变，表明这些订单处于缓冲良好的库存状态。这次对RL血小板库存管理政策的首次正式验证和解释，展示了COOL-MC在安全关键医疗供应链领域实现透明且可审计决策的价值。

TraceGuard: Process-Guided Firewall against Reasoning Backdoors in Large Language Models

TraceGuard：针对大型语言模型中推理后门的进程引导防火墙

Authors: Zhen Guo, Shanghao Shi, Hao Li, Shamim Yazdani, Ning Zhang, Reza Tourani
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2603.02436
Pdf link: https://arxiv.org/pdf/2603.02436
Abstract The deployment of Large Reasoning Models (LRMs) in high-stakes decision-making pipelines has introduced a novel and opaque attack surface: reasoning backdoors. In these attacks, the model's intermediate Chain-of-Thought (CoT) is manipulated to provide a linguistically plausible but logically fallacious justification for a malicious conclusion. While frontier models exhibit an intrinsic capacity to detect these fractures, compact, deployable models suffer from a fundamental verification gap, relying on fragile lexical heuristics that are easily bypassed by motivated adversaries. To bridge this gap, we propose TraceGuard, a process-guided security framework that transforms small-scale models into robust reasoning firewalls. Our approach treats the reasoning trace as an untrusted payload and establishes a defense-in-depth strategy through three synergistic phases: (1) Automated Forensic Synthesis, which generates contrastive reasoning pairs to isolate the specific logical point of fracture; (2) Step-Aware Supervised Fine-Tuning (SSFT), to instill a structural verification grammar; and (3) Verifier-Guided Reinforcement Learning (VGRL), utilizing Group Relative Policy Optimization. We identify and mitigate a critical failure mode of baseline alignment - lexical overfitting - whereby verifiers memorize adversarial triggers rather than auditing logical integrity. Our empirical evaluation demonstrates that TraceGuard acts as a security force multiplier: a 4B-parameter verifier achieves forensic precision on unseen attacks - including latent backdoors and post-hoc rationalizations - that rivals architectures two orders of magnitude larger. We further demonstrate robustness against adaptive adversaries in a grey-box setting, establishing TraceGuard as a viable, low-latency security primitive for the Trusted Computing Base.
中文摘要 大型推理模型（LRM）在高风险决策流程中的部署引入了一个新颖且不透明的攻击面：推理后门。在这些攻击中，模型的中间思维链（CoT）控，以提供一个语言上合理但逻辑上错误的恶意结论正当理由。虽然前沿模型具备检测这些断裂的内在能力，但紧凑且可部署的模型存在根本性的验证缺口，依赖脆弱的词汇启发式，容易被有动机的对手绕过。为弥合这一差距，我们提出了TraceGuard，一个过程引导的安全框架，将小规模模型转变为强大的推理防火墙。我们的方法将推理痕迹视为不可信的负载，并通过三个协同阶段建立深度防御策略：（1）自动法医综合，生成对比推理对以分离特定的逻辑断裂点;（2）步知监督微调（SSFT），用于引入结构验证语法;以及（3）验证者引导强化学习（VGRL），采用群体相对策略优化。我们识别并缓解了基线对齐的一个关键失效模式——词汇过拟合——即验证者记忆对抗触发器，而非审计逻辑完整性。我们的实证评估表明，TraceGuard 作为安全力量的倍增器：一个 4B 参数的验证器能够实现对未见攻击——包括潜在后门和事后合理化——的取证精度，其精度可媲美大两倍的架构。我们还进一步展示了在灰箱环境中对自适应对手的鲁棒性，确立了TraceGuard作为可信计算基础（Trusted Computing Base）可行、低延迟的安全原语。

Safe Whole-Body Loco-Manipulation via Combined Model and Learning-based Control

通过结合模型和基于学习的控制实现安全的全身机车作

Authors: Alexander Schperberg, Yeping Wang, Stefano Di Cairano
Subjects: Subjects: Robotics (cs.RO); Human-Computer Interaction (cs.HC); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2603.02443
Pdf link: https://arxiv.org/pdf/2603.02443
Abstract Simultaneous locomotion and manipulation enables robots to interact with their environment beyond the constraints of a fixed base. However, coordinating legged locomotion with arm manipulation, while considering safety and compliance during contact interaction remains challenging. To this end, we propose a whole-body controller that combines a model-based admittance control for the manipulator arm with a Reinforcement Learning (RL) policy for legged locomotion. The admittance controller maps external wrenches--such as those applied by a human during physical interaction--into desired end-effector velocities, allowing for compliant behavior. The velocities are tracked jointly by the arm and leg controllers, enabling a unified 6-DoF force response. The model-based design permits accurate force control and safety guarantees via a Reference Governor (RG), while robustness is further improved by a Kalman filter enhanced with neural networks for reliable base velocity estimation. We validate our approach in both simulation and hardware using the Unitree Go2 quadruped robot with a 6-DoF arm and wrist-mounted 6-DoF Force/Torque sensor. Results demonstrate accurate tracking of interaction-driven velocities, compliant behavior, and safe, reliable performance in dynamic settings.
中文摘要 同时运动和作使机器人能够超越固定基地的限制与环境互动。然而，协调腿部运动与手臂作，同时考虑接触互动中的安全性和顺从性仍然具有挑战性。为此，我们提出了一种全体控制器，结合了基于模型的作臂导纳控制和用于腿部运动的强化学习（RL）策略。导纳控制器将外部扳手——例如人类在物理交互中施加的扳手——映射到所需的末端执行器速度，从而实现顺从的行为。这些速度由臂部和腿部控制器共同跟踪，实现统一的6景深部队响应。基于模型的设计通过参考调速器（RG）实现了精确的力控制和安全保障，同时通过卡尔曼滤波器和神经网络增强，进一步提升了鲁棒性，实现了可靠的基准速度估计。我们利用Unitree Go2四足机器人，配备6度臂和腕部6度力/扭矩传感器，验证了我们的方法。结果显示了交互驱动速度的准确追踪、顺应行为以及动态环境中安全可靠的性能。

RIS-Enabled Wireless Channel Equalization: Adaptive RIS Equalizer and Deep Reinforcement Learning

RIS支持的无线信道均衡：自适应RIS均衡器和深度强化学习

Authors: Gal Ben-Itzhak, Ender Ayanoglu
Subjects: Subjects: Information Theory (cs.IT); Emerging Technologies (cs.ET); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2603.02489
Pdf link: https://arxiv.org/pdf/2603.02489
Abstract Reconfigurable Intelligent Surfaces (RISs) offer a promising means of reshaping the wireless propagation environment, yet practical methods for configuring large passive arrays to achieve reliable signal equalization remain limited. Equalization is essential in wideband links to counteract multipath-induced pulse distortion that otherwise degrades symbol recovery. This work investigates RIS-assisted pulse response equalization and signal boosting using both classical adaptive filtering and model-free deep reinforcement learning (DRL). We develop a steepest descent (SD) method that exploits cascaded BS-RIS-UE channel information to configure RIS coefficients for multipath mitigation and SNR enhancement, and we show that the tradeoffs between SD and DRL primarily arise from the extensive channel estimation required for accurate equalization with passive RIS hardware. Unlike traditional adaptive filtering, which updates delayed filter coefficients after signal reception, our approach uses the RIS positioned within the cascaded channel to perform equalization without delay elements, prior to reception at the UE. In this framework, the channel is estimated before equalization, forming the basis of what we term adaptive RIS equalization (ARISE). To overcome the reliance on channel estimation required for ARISE, we explore several DRL algorithms -- DDPG, TD3, and SAC -- that optimize RIS coefficients directly from the received pulse response without explicit channel estimation. Through extensive simulations across diverse channel conditions and RIS sizes, we show that SAC achieves fast, stable convergence and equalization performance comparable to ARISE while offering significantly lower implementation complexity. These results highlight the potential of DRL as a practical and scalable solution for real-time RIS control in future wireless systems.
中文摘要 可重构智能表面（RIS）为重塑无线传播环境提供了有前景的方法，但配置大型无源阵列以实现可靠信号均衡的实用方法仍然有限。在宽带链路中，均衡化对于抵消多径引起的脉冲失真至关重要，否则会降低符号恢复。本研究利用经典自适应滤波和无模型深度强化学习（DRL）研究RIS辅助脉冲响应均衡和信号增强。我们开发了一种最陡下降（SD）方法，利用级联的BS-RIS-UE信道信息配置RIS系数以实现多径缓解和信噪比增强，并展示了SD与DRL之间的权衡主要源于与被动RIS硬件实现准确均衡所需的大量信道估计。与传统的自适应滤波器在信号接收后更新延迟滤波系数不同，我们的方法是在级联信道中使用RIS进行无延迟元素的均衡，在UE接收前进行。在该框架中，信道在均衡前被估计，构成了我们所称的自适应RIS均衡（ARISE）的基础。为了克服ARISE对信道估计的依赖，我们探索了多种DRL算法——DDPG、TD3和SAC——这些算法直接从接收到的脉冲响应中优化RIS系数，而无需显式信道估计。通过在不同信道条件和RIS大小下的广泛模拟，我们证明SAC实现了与ARISE相当的快速稳定收敛和均衡性能，同时实现复杂度显著降低。这些结果凸显了日间学习作为未来无线系统中实时RIS控制实用且可扩展解决方案的潜力。

Wasserstein Proximal Policy Gradient

瓦瑟斯坦近端政策梯度

Authors: Zhaoyu Zhu, Shuhan Zhang, Rui Gao, Shuang Li
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.02576
Pdf link: https://arxiv.org/pdf/2603.02576
Abstract We study policy gradient methods for continuous-action, entropy-regularized reinforcement learning through the lens of Wasserstein geometry. Starting from a Wasserstein proximal update, we derive Wasserstein Proximal Policy Gradient (WPPG) via an operator-splitting scheme that alternates an optimal transport update with a heat step implemented by Gaussian convolution. This formulation avoids evaluating the policy's log density or its gradient, making the method directly applicable to expressive implicit stochastic policies specified as pushforward maps. We establish a global linear convergence rate for WPPG, covering both exact policy evaluation and actor-critic implementations with controlled approximation error. Empirically, WPPG is simple to implement and attains competitive performance on standard continuous-control benchmarks.
中文摘要 我们通过Wasserstein几何的视角研究连续作用、熵正则化强化学习的策略梯度方法。从Wasserstein近端更新出发，我们通过一种算符拆分方案推导出Wasserstein近端策略梯度（WPPG），该方案将最优传输更新与高斯卷积实现的热步骤交替进行。该表述避免了评估策略的对数密度或梯度，使该方法直接适用于以推前映射形式指定的表达式隐性随机策略。我们建立了WPPG的全局线性收敛率，涵盖精确策略评估和演员-批判者实现，且带有受控近似误差。从经验来看，WPPG实现简单，并在标准连续控制基准测试中表现出竞争力。

Towards Parameter-Free Temporal Difference Learning

迈向无参数时间差分学习

Authors: Yunxiang Li, Mark Schmidt, Reza Babanezhad, Sharan Vaswani
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.02577
Pdf link: https://arxiv.org/pdf/2603.02577
Abstract Temporal difference (TD) learning is a fundamental algorithm for estimating value functions in reinforcement learning. Recent finite-time analyses of TD with linear function approximation quantify its theoretical convergence rate. However, they often require setting the algorithm parameters using problem-dependent quantities that are difficult to estimate in practice -- such as the minimum eigenvalue of the feature covariance ((\omega)) or the mixing time of the underlying Markov chain ((\tau_{\text{mix}})). In addition, some analyses rely on nonstandard and impractical modifications, exacerbating the gap between theory and practice. To address these limitations, we use an exponential step-size schedule with the standard TD(0) algorithm. We analyze the resulting method under two sampling regimes: independent and identically distributed (i.i.d.) sampling from the stationary distribution, and the more practical Markovian sampling along a single trajectory. In the i.i.d.\ setting, the proposed algorithm does not require knowledge of problem-dependent quantities such as (\omega), and attains the optimal bias-variance trade-off for the last iterate. In the Markovian setting, we propose a regularized TD(0) algorithm with an exponential step-size schedule. The resulting algorithm achieves a comparable convergence rate to prior works, without requiring projections, iterate averaging, or knowledge of (\tau_{\text{mix}}) or (\omega).
中文摘要 时间差分学习（TD）是强化学习中估算价值函数的基本算法。近期对TD的有限时间分析通过线性函数近似量化了其理论收敛率。然而，它们通常需要使用与问题相关的变量来设置算法参数，这些量在实际中难以估计——例如特征协方差的最小特征值（\（\omega\））或底层马尔可夫链的混合时间（\（\tau_{\text{mix}}\））。此外，一些分析依赖于非标准且不切实际的修改，加剧了理论与实践之间的鸿沟。为解决这些限制，我们使用标准TD（0）算法的指数步长调度。我们将所得方法在两种抽样模式下分析：从平稳分布独立且同分布（i.i.d.）抽样，以及更实用的马尔可夫采样，沿着单一轨迹进行。在i.i.d.\场景下，所提算法无需了解如\（\omega\）等问题相关量，并在最后一次迭代中实现最优的偏置-方差权衡。在马尔可夫情境下，我们提出了一个具有指数步长排度的正则化TD（0）算法。最终算法实现了与之前作品相当的收敛率，无需投影、迭代平均，也无需了解 \（\tau_{\text{mix}}\）或 \（\omega\）

Heterogeneous Agent Collaborative Reinforcement Learning

异构代理协作强化学习

Authors: Zhixia Zhang, Zixuan Huang, Xin Xia, Deqing Wang, Fuzhen Zhuang, Shuai Ma, Ning Ding, Yaodong Yang, Jianxin Li, Yikun Ban
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.02604
Pdf link: https://arxiv.org/pdf/2603.02604
Abstract We introduce Heterogeneous Agent Collaborative Reinforcement Learning (HACRL), a new learning paradigm that addresses the inefficiencies of isolated on-policy optimization. HACRL enables collaborative optimization with independent execution: heterogeneous agents share verified rollouts during training to mutually improve, while operating independently at inference time. Unlike LLM-based multi-agent reinforcement learning (MARL), HACRL does not require coordinated deployment, and unlike on-/off-policy distillation, it enables bidirectional mutual learning among heterogeneous agents rather than one-directional teacher-to-student transfer. Building on this paradigm, we propose HACPO, a collaborative RL algorithm that enables principled rollout sharing to maximize sample utilization and cross-agent knowledge transfer. To mitigate capability discrepancies and policy distribution shifts, HACPO introduces four tailored mechanisms with theoretical guarantees on unbiased advantage estimation and optimization correctness. Extensive experiments across diverse heterogeneous model combinations and reasoning benchmarks show that HACPO consistently improves all participating agents, outperforming GSPO by an average of 3.3\% while using only half the rollout cost.
中文摘要 我们介绍异构代理协作强化学习（HACRL），这是一种新的学习范式，旨在解决孤立的策略优化的低效问题。HACRL实现了协作优化，实现了独立执行：异构代理在训练过程中共享经过验证的部署以相互改进，同时在推理时独立运行。与基于LLM的多智能体强化学习（MARL）不同，HACRL不需要协调部署，且与开/关策略提纯不同，它支持异构代理之间的双向互学习，而非单向教师与学生的转移。基于这一范式，我们提出了HACPO，一种协作强化学习算法，能够实现有原则的推广共享，以最大化样本利用率和跨智能体知识转移。为缓解能力差异和策略分布变化，HACPO引入了四种量身定制的机制，理论上保证无偏优势估计和优化正确性。在多种异构模型组合和推理基准中的广泛实验表明，HACPO持续提升所有参与代理，平均比GSPO高出3.3%的表现，且仅使用了一半的推广成本。

Real-Time Generative Policy via Langevin-Guided Flow Matching for Autonomous Driving

通过朗之万引导流量匹配实现自动驾驶的实时生成策略

Authors: Tianze Zhu, Yinuo Wang, Wenjun Zou, Tianyi Zhang, Likun Wang, Letian Tao, Feihong Zhang, Yao Lyu, Shengbo Eben Li
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.02613
Pdf link: https://arxiv.org/pdf/2603.02613
Abstract Reinforcement learning (RL) is a fundamental methodology in autonomous driving systems, where generative policies exhibit considerable potential by leveraging their ability to model complex distributions to enhance exploration. However, their inherent high inference latency severely impedes their deployment in real-time decision-making and control. To address this issue, we propose diffusion actor-critic with entropy regulator via flow matching (DACER-F) by introducing flow matching into online RL, enabling the generation of competitive actions in a single inference step. By leveraging Langevin dynamics and gradients of the Q-function, DACER-F dynamically optimizes actions from experience replay toward a target distribution that balances high Q-value information with exploratory behavior. The flow policy is then trained to efficiently learn a mapping from a simple prior distribution to this dynamic target. In complex multi-lane and intersection simulations, DACER-F outperforms baselines diffusion actor-critic with entropy regulator (DACER) and distributional soft actor-critic (DSAC), while maintaining an ultra-low inference latency. DACER-F further demonstrates its scalability on standard RL benchmark DeepMind Control Suite (DMC), achieving a score of 775.8 in the humanoid-stand task and surpassing prior methods. Collectively, these results establish DACER-F as a high-performance and computationally efficient RL algorithm.
中文摘要 强化学习（RL）是自动驾驶系统中的一种基础方法论，生成策略通过利用其对复杂分布建模的能力，展现出巨大的潜力，以增强探索能力。然而，其固有的高推理延迟严重阻碍了其在实时决策和控制中的部署。为解决这一问题，我们提出了通过流匹配（DACER-F）实现的扩散演员-批判者与熵调节，通过在线强化学习引入流匹配，使在单一推理步骤内生成竞争动作成为可能。通过利用朗之文动力学和Q函数梯度，DACER-F动态优化了从经验回放中到平衡高Q值信息与探索行为的目标分布。随后，流策略被训练以高效地从简单的先验分布到该动态目标的映射。在复杂的多车道和交叉模拟中，DACER-F 优于带有熵调节器（DACER）和分布软演员-批判者（DSAC）的基线扩散演员-批判者，同时保持超低推理延迟。DACER-F在标准强化学习基准测试DeepMind控制套件（DMC）上进一步展示了其可扩展性，在类人模型任务中得分为775.8，超越了以往方法。综合来看，这些结果确立了DACER-F作为一种高性能且计算高效的强化学习算法。

Post Hoc Extraction of Pareto Fronts for Continuous Control

后续提取帕累托前缘以实现连续控制

Authors: Raghav Thakar, Gaurav Dixit, Kagan Tumer
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.02628
Pdf link: https://arxiv.org/pdf/2603.02628
Abstract Agents in the real world must often balance multiple objectives, such as speed, stability, and energy efficiency in continuous control. To account for changing conditions and preferences, an agent must ideally learn a Pareto frontier of policies representing multiple optimal trade-offs. Recent advances in multi-policy multi-objective reinforcement learning (MORL) enable learning a Pareto front directly, but require full multi-objective consideration from the start of training. In practice, multi-objective preferences often arise after a policy has already been trained on a single specialised objective. Existing MORL methods cannot leverage these pre-trained `specialists' to learn Pareto fronts and avoid incurring the sample costs of retraining. We introduce Mixed Advantage Pareto Extraction (MAPEX), an offline MORL method that constructs a frontier of policies by reusing pre-trained specialist policies, critics, and replay buffers. MAPEX combines evaluations from specialist critics into a mixed advantage signal, and weights a behaviour cloning loss with it to train new policies that balance multiple objectives. MAPEX's post hoc Pareto front extraction preserves the simplicity of single-objective off-policy RL, and avoids retrofitting these algorithms into complex MORL frameworks. We formally describe the MAPEX procedure and evaluate MAPEX on five multi-objective MuJoCo environments. Given the same starting policies, MAPEX produces comparable fronts at $0.001\%$ the sample cost of established baselines.
中文摘要 现实世界中的代理常常需要在连续控制中平衡速度、稳定性和能效等多个目标。为了考虑变化的条件和偏好，智能体理想情况下必须学习一个代表多重最优权衡的帕累托前沿政策。多策略多目标强化学习（MORL）的最新进展使得直接学习帕累托前线成为可能，但从训练开始就需要充分的多目标考虑。实际上，多目标偏好通常在政策已经针对单一专业目标进行训练后出现。现有的MORL方法无法利用这些预培训的“专家”来学习帕累托战场，避免再培训的样本成本。我们介绍混合优势帕累托提取（MAPEX），这是一种离线MORL方法，通过重用预训练的专业策略、批评者和重放缓冲区构建策略前沿。MAPEX将专业批评者的评估整合成混合优势信号，并加权行为克隆损失，以训练平衡多重目标的新政策。MAPEX的事后帕累托前端提取保持了单目标非策略强化学习的简洁性，避免了将这些算法逆转到复杂的MORL框架中。我们正式描述了MAPEX程序，并在五个多目标MuJoCo环境中评估了MAPEX。在相同的起始政策下，MAPEX以已建立基线样本成本的0.001%%$美元生成类似的前锋。

StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning

StitchCUDA：一个自动化多代理端到端GPU编程框架，支持基于评分标准的代理强化学习

Authors: Shiyang Li, Zijian Zhang, Winson Chen, Yuebo Luo, Mingyi Hong, Caiwen Ding
Subjects: Subjects: Multiagent Systems (cs.MA); Computation and Language (cs.CL); Programming Languages (cs.PL)
Arxiv link: https://arxiv.org/abs/2603.02637
Pdf link: https://arxiv.org/pdf/2603.02637
Abstract Modern machine learning (ML) workloads increasingly rely on GPUs, yet achieving high end-to-end performance remains challenging due to dependencies on both GPU kernel efficiency and host-side settings. Although LLM-based methods show promise on automated GPU kernel generation, prior works mainly focus on single-kernel optimization and do not extend to end-to-end programs, hindering practical deployment. To address the challenge, in this work, we propose StitchCUDA, a multi-agent framework for end-to-end GPU program generation, with three specialized agents: a Planner to orchestrate whole system design, a Coder dedicated to implementing it step-by-step, and a Verifier for correctness check and performance profiling using Nsys/NCU. To fundamentally improve the Coder's ability in end-to-end GPU programming, StitchCUDA integrates rubric-based agentic reinforcement learning over two atomic skills, task-to-code generation and feedback-driven code optimization, with combined rubric reward and rule-based reward from real executions. Therefore, the Coder learns how to implement advanced CUDA programming techniques (e.g., custom kernel fusion, cublas epilogue), and we also effectively prevent Coder's reward hacking (e.g., just copy PyTorch code or hardcoding output) during benchmarking. Experiments on KernelBench show that StitchCUDA achieves nearly 100% success rate on end-to-end GPU programming tasks, with 1.72x better speedup over the multi-agent baseline and 2.73x than the RL model baselines.
中文摘要 现代机器学习（ML）工作负载越来越依赖GPU，但由于依赖GPU内核效率和主机端设置，实现高端到端性能仍具挑战性。尽管基于LLM的方法在自动化GPU内核生成方面展现出潜力，但以往的工作主要集中在单内核优化，未能扩展到端到端程序，这限制了实际部署。为应对这一挑战，本研究提出了StitchCUDA，这是一个多代理框架，用于端到端GPU程序生成，拥有三个专业代理：一个用于协调整个系统设计的Planner、一个专门负责逐步实现的程序员，以及一个用于Nsys/NCU进行正确性检查和性能分析的验证器。为了从根本上提升程序员在端到端GPU编程中的能力，StitchCUDA将基于评分标准的代理强化学习整合在两项原子技能上：任务到代码生成和反馈驱动的代码优化，并结合了来自真实执行的评分标准奖励和基于规则的奖励。因此，程序员学习如何实现高级CUDA编程技术（例如，自定义内核融合、cublas epilogue），并且在基准测试期间有效防止了Coder的奖励黑客行为（例如，直接复制PyTorch代码或硬编码输出）。KernelBench上的实验显示，StitchCUDA在端到端GPU编程任务中几乎实现了100%的成功率，其加速速度比多智能体基线快1.72倍，比强化学习模型基线快2.73倍。

Improving Diffusion Planners by Self-Supervised Action Gating with Energies

通过自我监督行动能量门控改善扩散规划器

Authors: Yuan Lu, Dongqi Han, Yansen Wang, Dongsheng Li
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.02650
Pdf link: https://arxiv.org/pdf/2603.02650
Abstract Diffusion planners are a strong approach for offline reinforcement learning, but they can fail when value-guided selection favours trajectories that score well yet are locally inconsistent with the environment dynamics, resulting in brittle execution. We propose Self-supervised Action Gating with Energies (SAGE), an inference-time re-ranking method that penalises dynamically inconsistent plans using a latent consistency signal. SAGE trains a Joint-Embedding Predictive Architecture (JEPA) encoder on offline state sequences and an action-conditioned latent predictor for short horizon transitions. At test time, SAGE assigns each sampled candidate an energy given by its latent prediction error and combines this feasibility score with value estimates to select actions. SAGE can integrate into existing diffusion planning pipelines that can sample trajectories and select actions via value scoring; it requires no environment rollouts and no policy re-training. Across locomotion, navigation, and manipulation benchmarks, SAGE improves the performance and robustness of diffusion planners.
中文摘要 扩散规划器是离线强化学习的有力方法，但当价值引导选择偏向得分良好但与环境动态局部不一致的轨迹时，它们可能失败，导致执行脆弱。我们提出了自监督作用门控（SAGE），这是一种推理时间重新排序方法，利用潜在一致性信号惩罚动态不一致的计划。SAGE训练一个联合嵌入预测架构（JEPA）编码器，用于离线状态序列，并训练一个动作条件的潜在预测器用于短视距转变。测试时，SAGE会为每个抽样候选者分配一个能量，由其潜在预测误差计算，并将该可行性分数与数值估计结合以选择行动。SAGE可以集成到现有的扩散规划流程中，这些流程可以通过价值评分采样轨迹并选择行动;它不需要环境推广，也不需要政策再培训。在移动、导航和作基准测试中，SAGE提升了扩散规划器的性能和稳健性。

Generalized Per-Agent Advantage Estimation for Multi-Agent Policy Optimization

多智能体策略优化的广义每代理优势估计

Authors: Seongmin Kim, Giseung Park, Woojun Kim, Jiwon Jeon, Seungyeol Han, Youngchul Sung
Subjects: Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2603.02654
Pdf link: https://arxiv.org/pdf/2603.02654
Abstract In this paper, we propose a novel framework for multi-agent reinforcement learning that enhances sample efficiency and coordination through accurate per-agent advantage estimation. The core of our approach is Generalized Per-Agent Advantage Estimator (GPAE), which employs a per-agent value iteration operator to compute precise per-agent advantages. This operator enables stable off-policy learning by indirectly estimating values via action probabilities, eliminating the need for direct Q-function estimation. To further refine estimation, we introduce a double-truncated importance sampling ratio scheme. This scheme improves credit assignment for off-policy trajectories by balancing sensitivity to the agent's own policy changes with robustness to non-stationarity from other agents. Experiments on benchmarks demonstrate that our approach outperforms existing approaches, excelling in coordination and sample efficiency for complex scenarios.
中文摘要 本文提出了一种多智能体强化学习的新框架，通过精确估计每个智能体的优势估计，提升样本效率和协调性。我们方法的核心是广义每位代理优势估计器（GPAE），它采用每位代理的价值迭代算符来计算精确的每位代理优势。该算符通过间接估计动作概率的值，实现了稳定的非策略学习，免除了直接的Q函数估计需求。为了进一步精细估计，我们引入了双截断重要性抽样比方案。该方案通过平衡对代理自身策略变化的敏感性与其他代理对非平稳性的鲁棒性，从而改善非策略轨迹的信用分配。基准测试实验表明，我们的方法优于现有方法，在协调和样本效率方面表现出色，适用于复杂场景。

Watch Your Step: Learning Semantically-Guided Locomotion in Cluttered Environment

小心脚下：在杂乱环境中学习语义引导的移动

Authors: Denan Liang, Yuan Zhu, Ruimeng Liu, Thien-Minh Nguyen, Shenghai Yuan, Lihua Xie
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.02657
Pdf link: https://arxiv.org/pdf/2603.02657
Abstract Although legged robots demonstrate impressive mobility on rough terrain, using them safely in cluttered environments remains a challenge. A key issue is their inability to avoid stepping on low-lying objects, such as high-cost small devices or cables on flat ground. This limitation arises from a disconnection between high-level semantic understanding and low-level control, combined with errors in elevation maps during real-world operation. To address this, we introduce SemLoco, a Reinforcement Learning (RL) framework designed to avoid obstacles precisely in densely cluttered environments. SemLoco uses a two-stage RL approach that combines both soft and hard constraints and performs pixel-wise foothold safety inference, enabling more accurate foot placement. Additionally, SemLoco integrates a semantic map to assign traversability costs rather than relying solely on geometric data. SemLoco significantly reduces collisions and improves safety around sensitive objects, enabling reliable navigation in situations where traditional controllers would likely cause damage. Experimental results further demonstrate that SemLoco can be effectively applied to more complex, unstructured real-world environments.
中文摘要 尽管有腿的机器人在崎岖地形上展现出令人印象深刻的机动性，但在拥挤环境中安全使用仍然具有挑战性。一个关键问题是它们无法避免踩到低洼物体，比如高价小型设备或平地上的电缆。这一局限源于高层语义理解与低层控制之间的脱节，以及实际作中高程图的错误。为此，我们引入了SemLoco，一种强化学习（RL）框架，旨在精准避开密集环境中的障碍物。SemLoco采用两阶段强化学习方法，结合软约束和硬约束，并进行像素级脚点安全推断，实现更精准的脚部放置。此外，SemLoco 集成了语义映射来分配可遍历性成本，而非仅依赖几何数据。SemLoco显著减少碰撞，提升敏感物体周围的安全，使得在传统控制器可能造成损害的情境下实现可靠导航。实验结果进一步表明，SemLoco可以有效地应用于更复杂、无结构的现实环境。

VisionCreator: A Native Visual-Generation Agentic Model with Understanding, Thinking, Planning and Creation

VisionCreator：一个原生视觉生成智能模型，具备理解、思考、规划和创造能力

Authors: Jinxiang Lai, Zexin Lu, Jiajun He, Rongwei Quan, Wenzhe Zhao, Qinyu Yang, Qi Chen, Qin Lin, Chuyue Li, Tao Gao, Yuhao Shan, Shuai Shao, Song Guo, Qinglin Lu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.02681
Pdf link: https://arxiv.org/pdf/2603.02681
Abstract Visual content creation tasks demand a nuanced understanding of design conventions and creative workflows-capabilities challenging for general models, while workflow-based agents lack specialized knowledge for autonomous creative planning. To overcome these challenges, we propose VisionCreator, a native visual-generation agentic model that unifies Understanding, Thinking, Planning, and Creation (UTPC) capabilities within an end-to-end learnable framework. Our work introduces four key contributions: (i) VisGenData-4k and its construction methodology using metacognition-based VisionAgent to generate high-quality creation trajectories with explicit UTPC structures; (ii) The VisionCreator agentic model, optimized through Progressive Specialization Training (PST) and Virtual Reinforcement Learning (VRL) within a high-fidelity simulated environment, enabling stable and efficient acquisition of UTPC capabilities for complex creation tasks; (iii) VisGenBench, a comprehensive benchmark featuring 1.2k test samples across diverse scenarios for standardized evaluation of multi-step visual creation capabilities; (iv) Remarkably, our VisionCreator-8B/32B models demonstrate superior performance over larger closed-source models across multiple evaluation dimensions. Overall, this work provides a foundation for future research in visual-generation agentic systems.
中文摘要 视觉内容创作任务需要对设计惯例和创意工作流程有细致理解——这些能力对通用模型来说更具挑战性，而基于工作流的代理缺乏自主创意规划的专业知识。为克服这些挑战，我们提出了VisionCreator，一种原生的视觉生成智能模型，将理解、思考、规划和创造（UTPC）能力统一在端到端可学习的框架中。我们的工作提出了四项关键贡献：（i） VisGenData-4k 及其基于元认知的 VisionAgent 构建方法，生成带有显式 UTPC 结构的高质量创作轨迹;（ii） VisionCreator 智能模型，通过渐进专业化训练（PST）和虚拟强化学习（VRL）在高保真模拟环境中优化，实现对复杂创建任务的 UTPC 能力的稳定高效获取;（iii） VisGenBench，一个涵盖多场景的1.2k测试样本的综合基准测试，用于标准化评估多步可视化创建能力;（iv）令人惊讶的是，我们的VisionCreator-8B/32B模型在多个评估维度上优于大型闭源模型。总体而言，这项工作为未来视觉生成智能系统研究奠定了基础。

Graph-GRPO: Stabilizing Multi-Agent Topology Learning via Group Relative Policy Optimization

Graph-GRPO：通过群相对策略优化稳定多智能体拓扑学习

Authors: Yueyang Cang, Xiaoteng Zhang, Erlu Zhao, Zehua Ji, Yuhang Liu, Yuchen He, Zhiyuan Ning, Chen Yijun, Wenge Que, Li Shi
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.02701
Pdf link: https://arxiv.org/pdf/2603.02701
Abstract Optimizing communication topology is fundamental to the efficiency and effectiveness of Large Language Model (LLM)-based Multi-Agent Systems (MAS). While recent approaches utilize reinforcement learning to dynamically construct task-specific graphs, they typically rely on single-sample policy gradients with absolute rewards (e.g., binary correctness). This paradigm suffers from severe gradient variance and the credit assignment problem: simple queries yield non-informative positive rewards for suboptimal structures, while difficult queries often result in failures that provide no learning signal. To address these challenges, we propose Graph-GRPO, a novel topology optimization framework that integrates Group Relative Policy Optimization. Instead of evaluating a single topology in isolation, Graph-GRPO samples a group of diverse communication graphs for each query and computes the advantage of specific edges based on their relative performance within the group. By normalizing rewards across the sampled group, our method effectively mitigates the noise derived from task difficulty variance and enables fine-grained credit assignment. Extensive experiments on reasoning and code generation benchmarks demonstrate that Graph-GRPO significantly outperforms state-of-the-art baselines, achieving superior training stability and identifying critical communication pathways previously obscured by reward noise.
中文摘要 优化通信拓扑是基于大型语言模型（LLM）的多智能体系统（MAS）效率和效能的基础。虽然近期方法利用强化学习动态构建任务特定图，但通常依赖单样本策略梯度并获得绝对奖励（例如二元正确性）。该范式存在严重的梯度方差和信用分配问题：简单查询对次优结构产生无益的正向奖励，而困难查询则常导致无学习信号的失败。为应对这些挑战，我们提出了Graph-GRPO，一种整合群相对策略优化的新型拓扑优化框架。Graph-GRPO不是孤立地评估单一拓扑，而是为每个查询采样一组不同的通信图，并根据特定边在该组内的相对表现计算其优势。通过对抽样组的奖励进行规范化，我们的方法有效减少了任务难度方差产生的噪声，并实现了细粒度的学分分配。大量推理和代码生成基准测试的实验表明，Graph-GRPO显著优于最先进的基线，实现了卓越的训练稳定性，并识别出此前被奖励噪声掩盖的关键通信路径。

From "What" to "How": Constrained Reasoning for Autoregressive Image Generation

从“什么”到“如何”：自回归图像生成的受限推理

Authors: Ruxue Yan, Xubo Liu, Wenya Guo, Zhengkun Zhang, Ying Zhang, Xiaojie Yuan
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2603.02712
Pdf link: https://arxiv.org/pdf/2603.02712
Abstract Autoregressive image generation has seen recent improvements with the introduction of chain-of-thought and reinforcement learning. However, current methods merely specify "What" details to depict by rewriting the input prompt, yet fundamentally fail to reason about "How" to structure the overall image. This inherent limitation gives rise to persistent issues, such as spatial ambiguity directly causing unrealistic object overlaps. To bridge this gap, we propose CoR-Painter, a novel framework that pioneers a "How-to-What" paradigm by introducing Constrained Reasoning to guide the autoregressive generation. Specifically, it first deduces "How to draw" by deriving a set of visual constraints from the input prompt, which explicitly govern spatial relationships, key attributes, and compositional rules. These constraints steer the subsequent generation of a detailed description "What to draw", providing a structurally sound and coherent basis for accurate visual synthesis. Additionally, we introduce a Dual-Objective GRPO strategy that specifically optimizes the textual constrained reasoning and visual projection processes to ensure the coherence and quality of the entire generation pipeline. Extensive experiments on T2I-CompBench, GenEval, and WISE demonstrate that our method achieves state-of-the-art performance, with significant improvements in spatial metrics (e.g., +5.41% on T2I-CompBench).
中文摘要 自回归图像生成近年来得到了改进，得益于思维链和强化学习的引入。然而，当前的方法仅通过重写输入提示来指定“什么”细节来表示，却根本无法推理“如何”来构建整体图像。这一固有限制导致持续存在的问题，如空间模糊性直接导致物体不切实际重叠。为弥合这一差距，我们提出了CoR-Painter这一新框架，开创了“如何做什么”范式，引入了受限推理来引导自我回归生成。具体来说，它首先通过从输入提示中推导出一组视觉约束来推导“如何绘制”，这些约束明确支配空间关系、关键属性和构图规则。这些约束引导后续详细描述“画什么”的生成，为准确的视觉综合提供了结构健全且连贯的基础。此外，我们引入了双目标GRPO策略，专门优化文本约束推理和视觉投影过程，确保整个生成流程的连贯性和质量。在T2I-CompBench、GenEval和WISE上的大量实验表明，我们的方法实现了最先进的性能，空间指标显著提升（例如T2I-CompBench为+5.41%）。

Enhancing User Throughput in Multi-panel mmWave Radio Access Networks for Beam-based MU-MIMO Using a DRL Method

利用DRL方法提升多面板毫米波无线接入网中基于束流的MU-MIMO用户吞吐量

Authors: Ramin Hashemi, Vismika Ranasinghe, Teemu Veijalainen, Petteri Kela, Risto Wichman
Subjects: Subjects: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.02745
Pdf link: https://arxiv.org/pdf/2603.02745
Abstract Millimeter-wave (mmWave) communication systems, particularly those leveraging multi-user multiple-input and multiple-output (MU-MIMO) with hybrid beamforming, face challenges in optimizing user throughput and minimizing latency due to the high complexity of dynamic beam selection and management. This paper introduces a deep reinforcement learning (DRL) approach for enhancing user throughput in multi-panel mmWave radio access networks in a practical network setup. Our DRL-based formulation utilizes an adaptive beam management strategy that models the interaction between the communication agent and its environment as a Markov decision process (MDP), optimizing beam selection based on real-time observations. The proposed framework exploits spatial domain (SD) characteristics by incorporating the cross-correlation between the beams in different antenna panels, the measured reference signal received power (RSRP), and the beam usage statistics to dynamically adjust beamforming decisions. As a result, the spectral efficiency is improved and end-to-end latency is reduced. The numerical results demonstrate an increase in throughput of up to 16% and a reduction in latency by factors 3-7x compared to baseline (legacy beam management).
中文摘要 毫米波（mmWave）通信系统，尤其是采用多用户多输入多输出（MU-MIMO）混合波束成形的系统，由于动态波束选择和管理的高度复杂性，在优化用户吞吐量和最小化延迟方面面临挑战。本文介绍了一种深度强化学习（DRL）方法，用于在实际网络环境中提升多面板毫米波无线接入网络的用户吞吐量。我们的基于DRL的表述采用自适应波束管理策略，将通信代理与环境之间的相互作用建模为马尔可夫决策过程（MDP），基于实时观测优化波束选择。所提出的框架利用空间域（SD）特性，结合不同天线面板中波束之间的交叉相关、测量参考信号接收功率（RSRP）以及波束使用统计数据，动态调整波束形成决策。因此，频谱效率得以提升，端到端延迟也得以降低。数值结果显示吞吐量提升了多达16%，延迟比基线（传统波束管理）减少了3-7倍。

Next Embedding Prediction Makes World Models Stronger

下一个嵌入预测使世界模型更强大

Authors: George Bredis, Nikita Balagansky, Daniil Gavrilov, Ruslan Rakhimov
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.02765
Pdf link: https://arxiv.org/pdf/2603.02765
Abstract Capturing temporal dependencies is critical for model-based reinforcement learning (MBRL) in partially observable, high-dimensional domains. We introduce NE-Dreamer, a decoder-free MBRL agent that leverages a temporal transformer to predict next-step encoder embeddings from latent state sequences, directly optimizing temporal predictive alignment in representation space. This approach enables NE-Dreamer to learn coherent, predictive state representations without reconstruction losses or auxiliary supervision. On the DeepMind Control Suite, NE-Dreamer matches or exceeds the performance of DreamerV3 and leading decoder-free agents. On a challenging subset of DMLab tasks involving memory and spatial reasoning, NE-Dreamer achieves substantial gains. These results establish next-embedding prediction with temporal transformers as an effective, scalable framework for MBRL in complex, partially observable environments.
中文摘要 捕捉时间依赖对于部分可观测的高维领域中的基于模型强化学习（MBRL）至关重要。我们介绍了NE-Dreamer，一种无解码器的MBRL智能体，利用时间变换器预测潜态序列的下一步编码器嵌入，直接优化表示空间中的时间预测对齐。这种方法使 NE-Dreamer 能够在没有重建损失或辅助监督的情况下学习连贯的预测状态表示。在DeepMind控制套件中，NE-Dreamer的性能可与DreamerV3及领先的无解码代理程序媲美甚至超越。在涉及记忆和空间推理的DMLab任务中，NE-Dreamer取得了显著进步。这些结果确立了利用时间变换器作为复杂且部分可观测环境中MBRL有效且可扩展的框架的下一步嵌入预测。

VSearcher: Long-Horizon Multimodal Search Agent via Reinforcement Learning

VSearcher：通过强化学习实现的长视界多模态搜索代理

Authors: Ruiyang Zhang, Qianguo Sun, Chao Song, Yiyan Qi, Zhedong Zheng
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.02795
Pdf link: https://arxiv.org/pdf/2603.02795
Abstract Large models are increasingly becoming autonomous agents that interact with real-world environments and use external tools to augment their static capabilities. However, most recent progress has focused on text-only large language models, which are limited to a single modality and therefore have narrower application scenarios. On the other hand, multimodal large models, while offering stronger perceptual capabilities, remain limited to static knowledge and lack the ability to access and leverage up-to-date web information. In this paper, we propose VSearcher, turning static multimodal model into multimodal search agent capable of long-horizon, multi-turn tool use in real-world web environments, including text search, image search, and web browsing, via reinforcement learning. Specifically, we introduce Iterative Injection Data Synthesis pipeline to generate large-scale, complex multimodal QA questions, which are further filtered with comprehensive metrics to ensure high quality and sufficient difficulty. We then adopt an SFT-then-RL training pipeline to turn base multimodal models to agent capable of multi-turn tool calling in real-world web environments. Besides, we propose a multimodal search benchmark MM-SearchExam dedicated to evaluating search capabilities of multimodal search agents, which proves highly challenging for recent proprietary models. Extensive evaluations across multiple multimodal search benchmarks reveal effectiveness of our method. VSearcher achieves superior performance compared to recent multimodal search agents and even surpasses several proprietary models on multimodal web search tasks.
中文摘要 大型模型正日益成为自主代理，能够与现实世界环境互动，并利用外部工具增强其静态能力。然而，最近的进展主要集中在纯文本大型语言模型上，这些模型仅限于单一模态，因此应用场景更为狭窄。另一方面，多模态大型模型虽然具备更强的感知能力，但仍受限于静态知识，缺乏访问和利用最新网络信息的能力。本文提出VSearcher理论，将静态多模态模型转变为多模态搜索代理，能够在现实世界网络环境中实现长视野、多回合工具的使用，包括文本搜索、图像搜索和网页浏览，通过强化学习实现。具体来说，我们引入了迭代注入数据综合流程，生成大规模复杂的多模态质量保证问题，并通过全面指标进一步筛选，确保高质量和足够难度。随后，我们采用先行技术（SFT）再强化学习（RL）的训练流水线，将基础多模态模型转变为能够在真实世界网络环境中进行多回合工具调用的代理。此外，我们还提出了一个多模态搜索基准 MM-SearchExam，专门评估多模态搜索代理的搜索能力，这对近期专有模型来说极具挑战性。对多个多模态搜索基准的广泛评估显示了我们方法的有效性。VSearcher 在多模态网络搜索任务中表现优于近期的多模态搜索代理，甚至超过了多个专有模型。

Learning Memory-Enhanced Improvement Heuristics for Flexible Job Shop Scheduling

学习记忆增强改进启发式方法，以实现灵活的工序排班

Authors: Jiaqi Wang, Zhiguang Cao, Peng Zhao, Rui Cao, Yubin Xiao, Yuan Jiang, You Zhou
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.02846
Pdf link: https://arxiv.org/pdf/2603.02846
Abstract The rise of smart manufacturing under Industry 4.0 introduces mass customization and dynamic production, demanding more advanced and flexible scheduling techniques. The flexible job-shop scheduling problem (FJSP) has attracted significant attention due to its complex constraints and strong alignment with real-world production scenarios. Current deep reinforcement learning (DRL)-based approaches to FJSP predominantly employ constructive methods. While effective, they often fall short of reaching (near-)optimal solutions. In contrast, improvement-based methods iteratively explore the neighborhood of initial solutions and are more effective in approaching optimality. However, the flexible machine allocation in FJSP poses significant challenges to the application of this framework, including accurate state representation, effective policy learning, and efficient search strategies. To address these challenges, this paper proposes a Memory-enhanced Improvement Search framework with heterogeneous graph representation--MIStar. It employs a novel heterogeneous disjunctive graph that explicitly models the operation sequences on machines to accurately represent scheduling solutions. Moreover, a memoryenhanced heterogeneous graph neural network (MHGNN) is designed for feature extraction, leveraging historical trajectories to enhance the decision-making capability of the policy network. Finally, a parallel greedy search strategy is adopted to explore the solution space, enabling superior solutions with fewer iterations. Extensive experiments on synthetic data and public benchmarks demonstrate that MIStar significantly outperforms both traditional handcrafted improvement heuristics and state-of-the-art DRL-based constructive methods.
中文摘要 工业4.0下智能制造的兴起带来了大规模定制和动态生产，要求更先进和灵活的调度技术。灵活工坊调度问题（FJSP）因其复杂的约束条件和与现实生产场景高度契合而备受关注。目前基于深度强化学习（DRL）的FJSP方法主要采用建设性方法。虽然有效，但往往无法达到（接近）最优的解决方案。相比之下，基于改进的方法通过迭代探索初始解的邻近，更有效地接近最优性。然而，FJSP中灵活的机器分配对该框架的应用带来了重大挑战，包括准确的状态表示、有效的策略学习和高效的搜索策略。为应对这些挑战，本文提出了一个具有异构图表示的内存增强改进搜索框架——MIStar。它采用了一种新颖的异构析取图，明确建模机器上的作序列，以准确表示调度解。此外，设计了一个记忆增强异构图神经网络（MHGNN），用于特征提取，利用历史轨迹提升策略网络的决策能力。最后，采用并行贪婪搜索策略探索解空间，实现更少迭代的优质解。对合成数据和公开基准的广泛实验表明，MIStar的表现显著优于传统的手工改进启发式方法和最先进的基于DRL的建设性方法。

Rhythm: Learning Interactive Whole-Body Control for Dual Humanoids

节奏：学习双人生物的互动全身控制

Authors: Hongjin Chen, Wei Zhang, Pengfei Li, Shihao Ma, Ke Ma, Yujie Jin, Zijun Xu, Xiaohui Wang, Yupeng Zheng, Zining Wang, Jieru Zhao, Yilun Chen, Wenchao Ding
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.02856
Pdf link: https://arxiv.org/pdf/2603.02856
Abstract Realizing interactive whole-body control for multi-humanoid systems is critical for unlocking complex collaborative capabilities in shared environments. Although recent advancements have significantly enhanced the agility of individual robots, bridging the gap to physically coupled multi-humanoid interaction remains challenging, primarily due to severe kinematic mismatches and complex contact dynamics. To address this, we introduce Rhythm, the first unified framework enabling real-world deployment of dual-humanoid systems for complex, physically plausible interactions. Our framework integrates three core components: (1) an Interaction-Aware Motion Retargeting (IAMR) module that generates feasible humanoid interaction references from human data; (2) an Interaction-Guided Reinforcement Learning (IGRL) policy that masters coupled dynamics via graph-based rewards; and (3) a real-world deployment system that enables robust transfer of dual-humanoid interaction. Extensive experiments on physical Unitree G1 robots demonstrate that our framework achieves robust interactive whole-body control, successfully transferring diverse behaviors such as hugging and dancing from simulation to reality.
中文摘要 实现多人形系统中的交互式全身控制对于在共享环境中解锁复杂协作能力至关重要。尽管近期技术大幅提升了单个机器人的敏捷性，但弥合与物理耦合多人生物交互的差距仍具挑战性，主要原因是严重的运动学不匹配和复杂的接触动力学。为此，我们引入了Rhythm，这是首个统一框架，能够在现实世界中部署双人生物系统，实现复杂且物理上合理的交互。我们的框架整合了三个核心组件：（1）交互感知运动重定向（IAMR）模块，可从人类数据生成可行的人形交互参考;（2）互动引导强化学习（IGRL）策略，通过基于图的奖励掌握耦合动态;以及（3）实现双人互动稳健传输的真实部署系统。对物理Unitree G1机器人的广泛实验表明，我们的框架实现了稳健的交互式全身控制，成功将拥抱和舞蹈等多样化行为从模拟转化为现实。

Learning in Markov Decision Processes with Exogenous Dynamics

马尔可夫决策过程的外生动力学学习

Authors: Davide Maran, Davide Salaorni, Marcello Restelli
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.02862
Pdf link: https://arxiv.org/pdf/2603.02862
Abstract Reinforcement learning algorithms are typically designed for generic Markov Decision Processes (MDPs), where any state-action pair can lead to an arbitrary transition distribution. In many practical systems, however, only a subset of the state variables is directly influenced by the agent's actions, while the remaining components evolve according to exogenous dynamics and account for most of the stochasticity. In this work, we study a structured class of MDPs characterized by exogenous state components whose transitions are independent of the agent's actions. We show that exploiting this structure yields significantly improved learning guarantees, with only the size of the exogenous state space appearing in the leading terms of the regret bounds. We further establish a matching lower bound, showing that this dependence is information-theoretically optimal. Finally, we empirically validate our approach across classical toy settings and real-world-inspired environments, demonstrating substantial gains in sample efficiency compared to standard reinforcement learning methods.
中文摘要 强化学习算法通常为通用的马尔可夫决策过程（MDP）设计，其中任意状态-动作对都可能导致任意的转移分布。然而，在许多实际系统中，只有部分状态变量直接受到代理人行为的影响，其余部分则根据外生动力学演变，并解释了大部分随机性。本研究中，我们研究了一类结构化的MDP，其特征为外生状态成分，其转变独立于代理的行为。我们证明，利用该结构显著提升了学习保证，只有外生状态空间的大小出现在遗憾边界的首项中。我们进一步建立了匹配的下界，表明该依赖在信息理论上最优。最后，我们通过实证验证了该方法在经典玩具环境和现实启发环境中，展示了相较于标准强化学习方法在样本效率上的显著提升。

SAE as a Crystal Ball: Interpretable Features Predict Cross-domain Transferability of LLMs without Training

SAE作为水晶球：可解释特征预测LLM在不需训练的情况下跨域迁移

Authors: Qi Zhang, Yifei Wang, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, Yisen Wang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.02908
Pdf link: https://arxiv.org/pdf/2603.02908
Abstract In recent years, pre-trained large language models have achieved remarkable success across diverse tasks. Besides the pivotal role of self-supervised pre-training, their effectiveness in downstream applications also depends critically on the post-training process, which adapts models to task-specific data and objectives. However, this process inevitably introduces model shifts that can influence performance in different domains, and how such shifts transfer remains poorly understood. To open up the black box, we propose the SAE-based Transferability Score (STS), a new metric that leverages sparse autoencoders (SAEs) to forecast post-training transferability. Taking supervised fine-tuning as an example, STS identifies shifted dimensions in SAE representations and calculates their correlations with downstream domains, enabling reliable estimation of transferability \textit{before} fine-tuning. Extensive experiments across multiple models and domains show that STS accurately predicts the transferability of supervised fine-tuning, achieving Pearson correlation coefficients above 0.7 with actual performance changes. Beyond this, we take an initial step toward extending STS to reinforcement learning. We believe that STS can serve as an {\color{black} interpretable} tool for guiding post-training strategies in LLMs. Code is available at this https URL.
中文摘要 近年来，预训练的大型语言模型在多样化任务中取得了显著成功。除了自监督预训练的关键作用外，其在下游应用中的有效性还关键于后期培训过程，后者会根据任务特定数据和目标调整模型。然而，这一过程不可避免地引入了模型转移，这些变化会影响不同领域的性能，而这种转变如何转移至今仍不充分。为了打开这个黑箱，我们提出了基于SAE的可转移性评分（STS），这是一种利用稀疏自编码器（SAE）预测训练后可转移性的新指标。以监督微调为例，STS识别SAE表示中的维度移位，并计算其与下游域的相关性，从而实现对可转移性\textit{在微调前}的可靠估计。跨多个模型和领域的大量实验表明，STS能够准确预测监督微调的可转移性，实际性能变化时，皮尔逊相关系数可达到0.7以上。除此之外，我们还迈出了将STS扩展到强化学习的第一步。我们相信STS可以作为一种{\color{black}可解释性工具，指导LLM的训练后策略。代码可在此 https URL 获取。

On the Structural Limitations of Weight-Based Neural Adaptation and the Role of Reversible Behavioral Learning

关于基于权重的神经适应的结构性局限性及可逆行为学习的作用

Authors: Pardhu Sri Rushi Varma Konduru
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.02934
Pdf link: https://arxiv.org/pdf/2603.02934
Abstract Neural models are usually adapted through changes in parameters shared among model components via fine-tuning, alignment-based training, and reinforcement learning. These changes have been found effective in short-term optimization. However, they result in long-term alterations in the model's base behavior. In this study, we introduce the concept of structural irreversibility as a characteristic of shared-parameter model adaptation. This concept refers to the intertwining of task-specific objectives with the representational identity of the model. We show that when parameters are directly mutated, the resulting model behaves divergently from the original model. This divergence cannot be reversed deterministically without an explicit parameter snapshot. We introduce reversible behavioral learning, in which model behaviors are structurally dissociated from identity parameters and can be deterministically unloaded through an explicit unload process. We also introduce the Recoverability Factor as a normalized measure of behavioral recoverability and provide additional diagnostics based on model divergence. Experiments show that reversible model adaptation achieves rollback within numerical precision, whereas shared-parameter mutation exhibits persistent post-reset divergence.
中文摘要 神经模型通常通过微调、基于比对的训练和强化学习，调整模型组件间共享的参数。这些变化在短期优化中已被发现有效。然而，它们会导致模型基础行为的长期变化。本研究引入了结构不可逆性的概念，作为共享参数模型适应的一个特征。该概念指的是任务特定目标与模型表征身份的交织。我们证明，当参数直接变异时，所得模型的行为与原始模型存在差异。没有明确的参数快照，这种发散无法确定性地逆转。我们引入了可逆行为学习，其中模型行为在结构上与身份参数分离，可以通过显式卸载过程确定性卸载。我们还引入了可恢复因子作为行为恢复能力的归一化衡量标准，并基于模型发散度提供额外的诊断。实验表明，可逆模型适应在数值精度范围内实现回滚，而共享参数突变则在重置后持续存在分歧。

Contextual Latent World Models for Offline Meta Reinforcement Learning

离线元强化学习的上下文潜在世界模型

Authors: Mohammadreza Nakheai, Aidan Scannell, Kevin Luck, Joni Pajarinen
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.02935
Pdf link: https://arxiv.org/pdf/2603.02935
Abstract Offline meta-reinforcement learning seeks to learn policies that generalize across related tasks from fixed datasets. Context-based methods infer a task representation from transition histories, but learning effective task representations without supervision remains a challenge. In parallel, latent world models have demonstrated strong self-supervised representation learning through temporal consistency. We introduce contextual latent world models, which condition latent world models on inferred task representations and train them jointly with the context encoder. This enforces task-conditioned temporal consistency, yielding task representations that capture task-dependent dynamics rather than merely discriminating between tasks. Our method learns more expressive task representations and significantly improves generalization to unseen tasks across MuJoCo, Contextual-DeepMind Control, and Meta-World benchmarks.
中文摘要 离线元强化学习旨在从固定数据集中学习跨相关任务的策略。基于上下文的方法从过渡历史推断任务表示，但在无监督的情况下学习有效的任务表示仍是个挑战。与此同时，潜在世界模型通过时间一致性展现了强烈的自我监督表征学习。我们引入了上下文潜在世界模型，这些模型基于推断任务表示来进行条件条件，并与上下文编码器共同训练。这强制执行任务条件的时间一致性，产生捕捉任务相关动态的任务表示，而不仅仅是区分任务之间。我们的方法学习了更多表现力的任务表征，并显著提升了对MuJoCo、Contextual-DeepMind Control和Meta-World基准测试中未见任务的泛化能力。

Beyond One-Size-Fits-All: Adaptive Subgraph Denoising for Zero-Shot Graph Learning with Large Language Models

超越一刀切：在大型语言模型下零截图图学习中的自适应子图去噪

Authors: Fengzhi Li, Liang Zhang, Yuan Zuo, Ruiqing Zhao, YanSong Liu, Yunfei Ma, Fanyu Meng, Junlan Feng
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.02938
Pdf link: https://arxiv.org/pdf/2603.02938
Abstract Graph-based tasks in the zero-shot setting remain a significant challenge due to data scarcity and the inability of traditional Graph Neural Networks (GNNs) to generalize to unseen domains or label spaces. While recent advancements have transitioned toward leveraging Large Language Models (LLMs) as predictors to enhance GNNs, these methods often suffer from cross-modal alignment issues. A recent paradigm (i.e., Graph-R1) overcomes the aforementioned architectural dependencies by adopting a purely text-based format and utilizing LLM-based graph reasoning, showing improved zero-shot generalization. However, it employs a task-agnostic, one-size-fits-all subgraph extraction strategy, which inevitably introduces significant structural noise--irrelevant neighbors and edges--that distorts the LLMs' receptive field and leads to suboptimal predictions. To address this limitation, we introduce GraphSSR, a novel framework designed for adaptive subgraph extraction and denoising in zero-shot LLM-based graph reasoning. Specifically, we propose the SSR pipeline, which dynamically tailors subgraph extraction to specific contexts through a "Sample-Select-Reason" process, enabling the model to autonomously filter out task-irrelevant neighbors and overcome the one-size-fits-all issue. To internalize this capability, we develop SSR-SFT, a data synthesis strategy that generates high-quality SSR-style graph reasoning traces for supervised fine-tuning of LLMs. Furthermore, we propose SSR-RL, a two-stage reinforcement learning framework that explicitly regulates sampling and selection operations within the proposed SSR pipeline designed for adaptive subgraph denoising. By incorporating Authenticity-Reinforced and Denoising-Reinforced RL, we guide the model to achieve accurate predictions using parsimonious, denoised subgraphs for reasoning.
中文摘要 由于数据稀缺以及传统图神经网络（GNN）无法推广到未见领域或标记空间，零样本环境中基于图的任务依然面临重大挑战。尽管近年来的进展已转向利用大型语言模型（LLMs）作为增强GNN的预测变量，但这些方法常常存在跨模态对齐问题。一个近期范式（即Graph-R1）通过采用纯文本格式并采用基于LLM的图推理，克服了上述架构依赖，展示了改进的零样本推广。然而，它采用了任务无关、一刀切的子图提取策略，这不可避免地引入了显著的结构噪声——无关邻居和边——扭曲了LLMs的感受野，导致预测不优。为解决这一限制，我们介绍了GraphSSR，一种新颖的框架，用于零截图基于LLM的图推理中自适应子图提取和去噪。具体来说，我们提出了SSR流水线，通过“样本-选择-理由”过程动态调整子图提取以适应特定上下文，使模型能够自主过滤任务无关的邻居，克服一刀切的问题。为实现这一能力，我们开发了SSR-SFT数据综合策略，生成高质量的SSR风格图推理迹，用于监督式微调LLM。此外，我们提出了SSR-RL，一种两阶段强化学习框架，明确调控拟议SSR管道中的采样和选择作，专为自适应子图去噪设计。通过结合真实性强化和去噪强化学习，我们引导模型通过简约、去噪的子图进行推理，实现准确的预测。

CGL: Advancing Continual GUI Learning via Reinforcement Fine-Tuning

CGL：通过强化微调推进持续的图形用户界面学习

Authors: Zhenquan Yao, Zitong Huang, Yihan Zeng, Jianhua Han, Hang Xu, Chun-Mei Feng, Jianwei Ma, Wangmeng Zuo
Subjects: Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.02951
Pdf link: https://arxiv.org/pdf/2603.02951
Abstract Graphical User Interface (GUI) Agents, benefiting from recent advances in multimodal large language models (MLLM), have achieved significant development. However, due to the frequent updates of GUI applications, adapting to new tasks without forgetting old tasks in GUI continual learning remains an open problem. In this work, we reveal that while Supervised Fine-Tuning (SFT) facilitates fast adaptation, it often triggers knowledge overwriting, whereas Reinforcement Learning (RL) demonstrates an inherent resilience that shields prior interaction logic from erasure. Based on this insight, we propose a \textbf{C}ontinual \textbf{G}UI \textbf{L}earning (CGL) framework that dynamically balances adaptation efficiency and skill retention by enhancing the synergy between SFT and RL. Specifically, we introduce an SFT proportion adjustment mechanism guided by policy entropy to dynamically control the weight allocation between the SFT and RL training phases. To resolve explicit gradient interference, we further develop a specialized gradient surgery strategy. By projecting exploratory SFT gradients onto GRPO-based anchor gradients, our method explicitly clips the components of SFT gradients that conflict with GRPO. On top of that, we establish an AndroidControl-CL benchmark, which divides GUI applications into distinct task groups to effectively simulate and evaluate the performance of continual GUI learning. Experimental results demonstrate the effectiveness of our proposed CGL framework across continual learning scenarios. The benchmark, code, and model will be made publicly available.
中文摘要 图形用户界面（GUI）代理借助多模态大型语言模型（MLLM）的最新进展，取得了显著的发展。然而，由于GUI应用频繁更新，适应新任务而不忘记旧任务仍是一个未解之谜。本研究揭示，虽然监督式微调（SFT）促进快速适应，但常触发知识覆盖，而强化学习（RL）则展现出内在的韧性，保护先前的交互逻辑不被抹除。基于这一见解，我们提出了一个\textbf{C}持续的\textbf{G}UI \textbf{L}earning（CGL）框架，通过增强SFT与RL之间的协同效应，动态平衡适应效率和技能保持。具体来说，我们引入了一种由策略熵引导的SFT比例调整机制，以动态控制SFT和RL训练阶段之间的权重分配。为了解决显性梯度干扰，我们进一步开发了一种专门的梯度手术策略。通过将探索性SFT梯度投影到基于GRPO的锚点梯度上，我们的方法明确裁剪了与GRPO冲突的SFT梯度分量。此外，我们还建立了AndroidControl-CL基准测试，将GUI应用划分为不同的任务组，以有效模拟和评估持续GUI学习的性能。实验结果证明了我们提出的CGL框架在持续学习场景下的有效性。基准测试、代码和模型将公开。

DreamFlow: Local Navigation Beyond Observation via Conditional Flow Matching in the Latent Space

梦流：通过条件流匹配在潜在空间中实现的本地导航，超越观察

Authors: Jiwon Park, Dongkyu Lee, I Made Aswin Nahrendra, Jaeyoung Lim, Hyun Myung
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.02976
Pdf link: https://arxiv.org/pdf/2603.02976
Abstract Local navigation in cluttered environments often suffers from dense obstacles and frequent local minima. Conventional local planners rely on heuristics and are prone to failure, while deep reinforcement learning(DRL)based approaches provide adaptability but are constrained by limited onboard sensing. These limitations lead to navigation failures because the robot cannot perceive structures outside its field of view. In this paper, we propose DreamFlow, a DRL-based local navigation framework that extends the robot's perceptual horizon through conditional flow matching(CFM). The proposed CFM based prediction module learns probabilistic mapping between local height map latent representation and broader spatial representation conditioned on navigation context. This enables the navigation policy to predict unobserved environmental features and proactively avoid potential local minima. Experimental results demonstrate that DreamFlow outperforms existing methods in terms of latent prediction accuracy and navigation performance in simulation. The proposed method was further validated in cluttered real world environments with a quadrupedal robot. The project page is available at this https URL.
中文摘要 在杂乱环境中的局部导航常常存在密集障碍和频繁出现局部极小值的问题。传统的本地规划器依赖启发式方法，容易失败，而基于深度强化学习（DRL）的方法则提供适应性，但受限于有限的机载感测。这些限制导致导航失败，因为机器人无法感知视野之外的结构。本文提出了DreamFlow，一种基于DRL的本地导航框架，通过条件流匹配（CFM）扩展机器人的感知视野。所提出的基于CFM的预测模块学习了基于导航上下文的本地高度图潜在表示与更广泛空间表示之间的概率映射。这使得导航政策能够预测未被观测到的环境特征，并主动避免潜在的局部极小值。实验结果表明，DreamFlow在潜在预测精度和导航性能方面优于现有方法。该方法在四足机器人的杂乱现实环境中得到了进一步验证。项目页面可在此 https 网址访问。

Contextualized Privacy Defense for LLM Agents

LLM代理的情境化隐私防御

Authors: Yule Wen, Yanzhe Zhang, Jianxun Lian, Xiaoyuan Yi, Xing Xie, Diyi Yang
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.02983
Pdf link: https://arxiv.org/pdf/2603.02983
Abstract LLM agents increasingly act on users' personal information, yet existing privacy defenses remain limited in both design and adaptability. Most prior approaches rely on static or passive defenses, such as prompting and guarding. These paradigms are insufficient for supporting contextual, proactive privacy decisions in multi-step agent execution. We propose Contextualized Defense Instructing (CDI), a new privacy defense paradigm in which an instructor model generates step-specific, context-aware privacy guidance during execution, proactively shaping actions rather than merely constraining or vetoing them. Crucially, CDI is paired with an experience-driven optimization framework that trains the instructor via reinforcement learning (RL), where we convert failure trajectories with privacy violations into learning environments. We formalize baseline defenses and CDI as distinct intervention points in a canonical agent loop, and compare their privacy-helpfulness trade-offs within a unified simulation framework. Results show that our CDI consistently achieves a better balance between privacy preservation (94.2%) and helpfulness (80.6%) than baselines, with superior robustness to adversarial conditions and generalization.
中文摘要 LLM代理越来越多地对用户个人信息采取行动，但现有的隐私防御在设计和适应性上仍然有限。大多数先前的方法依赖静态或被动防御，如提示和防御。这些范式不足以支持多步代理执行中的情境性、主动隐私决策。我们提出了情境化防御教学（CDI），这是一种新的隐私防御范式，其中讲师模型在执行过程中生成针对步骤的情境感知隐私指导，主动塑造行动，而不仅仅是限制或否决它们。关键是，CDI配合一个基于体验的优化框架，通过强化学习（RL）培训教师，我们将涉及隐私侵犯的失败轨迹转化为学习环境。我们将基线防御和CDI形式化为典型代理循环中的独立干预点，并在统一的模拟框架内比较它们的隐私与帮助权衡。结果显示，我们的CDI在隐私保护（94.2%）和帮助性（80.6%）之间始终比基线更为平衡，且在对抗条件和泛化中更为稳健。

Why Does RLAIF Work At All?

RLAIF 为什么能正常工作？

Authors: Robin Young
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.03000
Pdf link: https://arxiv.org/pdf/2603.03000
Abstract Reinforcement Learning from AI Feedback (RLAIF) enables language models to improve by training on their own preference judgments, yet no theoretical account explains why this self-improvement seemingly works for value learning. We propose the latent value hypothesis, that pretraining on internet-scale data encodes human values as directions in representation space, and constitutional prompts elicit these latent values into preference judgments. We formalize this intuition under a linear model where the constitution acts as a projection operator selecting value-relevant directions. Our analysis yields several results. RLAIF improves alignment when the constitution-activated direction correlates with true values better than the model's default generation direction thus explaining the generation-judgment gap; the ceiling on RLAIF quality is determined by how well representations encode values, which scales with model capacity; and adversarial constitutions exist that can activate anti-social value directions encoded from harmful pretraining data. Our account unifies scattered empirical findings including the refusal direction, low-rank safety subspaces, and RLAIF scaling behavior.
中文摘要 来自人工智能反馈的强化学习（RLAIF）使语言模型能够通过训练自身偏好判断来改进，但没有理论解释为何这种自我提升似乎对价值学习有效。我们提出了潜在价值假说，即在互联网规模数据上预训练将人类价值编码为表示空间中的方向，而宪法提示则将这些潜在价值引发为偏好判断。我们将这种直觉形式化为线性模型，其中构成作为投影算子，选择与价值相关的方向。我们的分析得出了几个结果。当宪法激活方向与真实值相关联得比模型默认生成方向更准确时，RLAIF能改善对齐性，从而解释了世代-判断差距;RLAIF 质量的上限取决于表示对数值的编码程度，并随模型容量而变化;而且存在对抗性宪法，可以激活由有害预训练数据编码的反社会价值指令。我们的论述统一了分散的实证发现，包括拒绝方向、低秩安全子空间和RLAIF尺度行为。

PrivMedChat: End-to-End Differentially Private RLHF for Medical Dialogue Systems

PrivMedChat：端到端的差异私密RLHF，用于医疗对话系统

Authors: Sudip Bhujel
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.03054
Pdf link: https://arxiv.org/pdf/2603.03054
Abstract Large language models are increasingly used for patient-facing medical assistance and clinical decision support, but adapting them to clinical dialogue often requires supervision derived from doctor-patient conversations that may contain sensitive information. Conventional supervised fine-tuning and reinforcement learning from human feedback (RLHF) can amplify memorization risks, enabling empirical membership inference and extraction of rare training-set content. We present PrivMedChat, an end-to-end framework for differentially private RLHF (DP-RLHF) for medical dialogue. Our design enforces differential privacy at every training stage that directly accesses dialogue-derived supervision: (i) Differential Private Stochastic Gradient Descent (DP-SGD) for medical SFT and (ii) DP-SGD for reward model learning from preference pairs. To limit additional privacy expenditure during alignment, we apply DP-SGD to the PPO actor and critic when operating on dialogue-derived prompts, while the reward model remains fixed after DP training. We also introduce an annotation-free preference construction strategy that pairs physician responses with filtered non-expert generations to produce scalable preference data without clinician labeling. Experiments on medical dialogue benchmarks show that PrivMedChat at $\varepsilon=7$ achieves the highest ROUGE-L of 0.156 among all DP models, reduces clinical hallucinations to 1.4% and harmful advice to 0.4%, and obtains the highest overall score of 2.86 in a 3-model LLM-jury evaluation, while producing membership-inference signals that are near chance (AUC 0.510-0.555). We open-source our code at this https URL.
中文摘要 大型语言模型越来越多地用于面向患者的医疗协助和临床决策支持，但要将其适应临床对话，通常需要医生与患者的对话中可能包含敏感信息的监督。传统的监督微调和人类反馈强化学习（RLHF）可以放大记忆风险，从而实现经验性成员推断和稀有训练集内容的提取。我们介绍PrivMedChat，一个端到端的差异私密RLHF（DP-RLHF）医疗对话框架。我们的设计在每个直接访问对话衍生督导的培训阶段都强制执行差异隐私：（i）医疗SFT的差异性私人随机梯度下降（DP-SGD）和（ii）来自偏好对的奖励模型学习DP-SGD。为了减少对齐过程中的额外隐私消耗，我们在作对话衍生提示时对PPO演员和批评者应用DP-SGD，而DP训练后奖励模型保持固定。我们还引入了无注释的偏好构建策略，将医生回复与过滤后的非专家世代配对，生成可扩展的偏好数据，无需临床医生标签。医学对话基准测试显示，PrivMedChat在$\varepsilon=7$时在所有DP模型中获得最高的ROUGE-L0.156，将临床幻觉降至1.4%，有害建议降至0.4%，并在三模型LLM-陪审团评估中获得最高总分2.86，同时产生接近偶然的成员推断信号（AUC 0.510-0.555）。我们将代码开源于这个 https URL。

CMoE: Contrastive Mixture of Experts for Motion Control and Terrain Adaptation of Humanoid Robots

CMoE：人形机器人运动控制与地形适应专家的对比组合

Authors: Shihao Ma, Hongjin Chen, Zijun Xu, Yi Zhao, Ke Wu, Ruichen Yang, Leyao Zou, Zhongxue Gan, Wenchao Ding
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.03067
Pdf link: https://arxiv.org/pdf/2603.03067
Abstract For effective deployment in real-world environments, humanoid robots must autonomously navigate a diverse range of complex terrains with abrupt transitions. While the Vanilla mixture of experts (MoE) framework is theoretically capable of modeling diverse terrain features, in practice, the gating network exhibits nearly uniform expert activations across different terrains, weakening the expert specialization and limiting the model's expressive power. To address this limitation, we introduce CMoE, a novel single-stage reinforcement learning framework that integrates contrastive learning to refine expert activation distributions. By imposing contrastive constraints, CMoE maximizes the consistency of expert activations within the same terrain while minimizing their similarity across different terrains, thereby encouraging experts to specialize in distinct terrain types. We validated our approach on the Unitree G1 humanoid robot through a series of challenging experiments. Results demonstrate that CMoE enables the robot to traverse continuous steps up to 20 cm high and gaps up to 80 cm wide, while achieving robust and natural gait across diverse mixed terrains, surpassing the limits of existing methods. To support further research and foster community development, we release our code publicly.
中文摘要 为了在现实环境中有效部署，类人机器人必须能够自主导航多样复杂的地形，并实现突兀的过渡。虽然原版专家混合（MoE）框架理论上能够建模多样化的地形特征，但实际上，门控网络在不同地形上几乎均匀地表现出专家激活，削弱了专家的专精化并限制了模型的表现力。为解决这一局限，我们引入了CMoE，一种新型单阶段强化学习框架，整合对比学习以优化专家激活分布。通过施加对比约束，CMoE最大化专家激活在同一地形内的一致性，同时最小化它们在不同地形间的相似性，从而鼓励专家专注于不同地形类型。我们通过一系列具有挑战性的实验验证了我们在Unitree G1人形机器人上的方法。结果表明，CMoE使机器人能够跨越高达20厘米、宽度达80厘米的连续阶梯，同时在多样混合地形中实现稳健自然的步态，超越现有方法的极限。为了支持进一步的研究并促进社区发展，我们公开发布了代码。

Reinforcement Learning with Symbolic Reward Machines

符号奖励机的强化学习

Authors: Thomas Krug, Daniel Neider
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.03068
Pdf link: https://arxiv.org/pdf/2603.03068
Abstract Reward Machines (RMs) are an established mechanism in Reinforcement Learning (RL) to represent and learn sparse, temporally extended tasks with non-Markovian rewards. RMs rely on high-level information in the form of labels that are emitted by the environment alongside the observation. However, this concept requires manual user input for each environment and task. The user has to create a suitable labeling function that computes the labels. These limitations lead to poor applicability in widely adopted RL frameworks. We propose Symbolic Reward Machines (SRMs) together with the learning algorithms QSRM and LSRM to overcome the limitations of RMs. SRMs consume only the standard output of the environment and process the observation directly through guards that are represented by symbolic formulas. In our evaluation, our SRM methods outperform the baseline RL approaches and generate the same results as the existing RM methods. At the same time, our methods adhere to the widely used environment definition and provide interpretable representations of the task to the user.
中文摘要 奖励机（RM）是强化学习（RL）中一种成熟的机制，用于表示和学习具有非马尔可夫奖励的稀疏、时间扩展任务。RM依赖环境与观测同时发出的标签形式高层信息。然而，这一概念需要针对每个环境和任务手动输入用户。用户必须创建一个合适的标签函数来计算这些标签。这些局限导致广泛采用的强化学习框架中的适用性较差。我们提出了符号奖励机（SRM）与学习算法QSRM和LSRM结合，以克服符号奖励机的局限性。SRM仅消耗环境的标准输出，并通过符号公式表示的守卫直接处理观察数据。在我们的评估中，我们的SRM方法优于基线强化学习方法，并产生与现有RM方法相同的结果。同时，我们的方法遵循广泛使用的环境定义，并为用户提供可解释的任务表示。

TikZilla: Scaling Text-to-TikZ with High-Quality Data and Reinforcement Learning

TikZilla：通过高质量数据和强化学习，将文本扩展到TikZ

Authors: Christian Greisinger, Steffen Eger
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.03072
Pdf link: https://arxiv.org/pdf/2603.03072
Abstract Large language models (LLMs) are increasingly used to assist scientists across diverse workflows. A key challenge is generating high-quality figures from textual descriptions, often represented as TikZ programs that can be rendered as scientific images. Prior research has proposed a variety of datasets and modeling approaches for this task. However, existing datasets for Text-to-TikZ are too small and noisy to capture the complexity of TikZ, causing mismatches between text and rendered figures. Moreover, prior approaches rely solely on supervised fine-tuning (SFT), which does not expose the model to the rendered semantics of the figure, often resulting in errors such as looping, irrelevant content, and incorrect spatial relations. To address these issues, we construct DaTikZ-V4, a dataset more than four times larger and substantially higher in quality than DaTikZ-V3, enriched with LLM-generated figure descriptions. Using this dataset, we train TikZilla, a family of small open-source Qwen models (3B and 8B) with a two-stage pipeline of SFT followed by reinforcement learning (RL). For RL, we leverage an image encoder trained via inverse graphics to provide semantically faithful reward signals. Extensive human evaluations with over 1,000 judgments show that TikZilla improves by 1.5-2 points over its base models on a 5-point scale, surpasses GPT-4o by 0.5 points, and matches GPT-5 in the image-based evaluation, while operating at much smaller model sizes. Code, data, and models will be made available.
中文摘要 大型语言模型（LLMs）越来越多地被用于协助科学家在不同工作流程中。一个关键挑战是从文本描述生成高质量图表，这些描述通常以TikZ程序形式呈现，可以作为科学图像渲染。此前的研究提出了多种数据集和建模方法来实现这一任务。然而，现有的文本转TikZ数据集过小且噪声较大，无法捕捉TikZ的复杂性，导致文本与渲染图形之间存在不匹配。此外，以往的方法仅依赖监督微调（SFT），该方法未使模型接触图形的渲染语义，常导致循环、无关内容和空间关系错误等错误。为解决这些问题，我们构建了DaTikZ-V4数据集，其规模是DaTikZ-V3的四倍以上，质量显著提升，并丰富了大型语言模型生成的图形描述。利用该数据集，我们训练TikZilla，这是一系列小型开源Qwen模型（3B和8B），采用SFT和强化学习（RL）两阶段的流水线。对于强化学习，我们利用通过逆向图形训练的图像编码器，提供语义忠实的奖励信号。经过超过1000次评估的大量人工评估显示，TikZilla在5分制下比基础模型提升1.5-2分，比GPT-4o高出0.5分，在基于图像的评估中与GPT-5相当，且运行模型规模更小。代码、数据和模型将被公开。

RAPO: Expanding Exploration for LLM Agents via Retrieval-Augmented Policy Optimization

RAPO：通过检索增强策略优化扩展LLM代理的探索

Authors: Siwei Zhang, Yun Xiong, Xi Chen, Zi'an Jia, Renhong Huang, Jiarong Xu, Jiawei Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.03078
Pdf link: https://arxiv.org/pdf/2603.03078
Abstract Agentic Reinforcement Learning (Agentic RL) has shown remarkable potential in large language model-based (LLM) agents. These works can empower LLM agents to tackle complex tasks via multi-step, tool-integrated reasoning. However, an inherent limitation of existing Agentic RL methods is their reliance on a pure on-policy paradigm for exploration, restricting exploration to the agent's self-generated outputs and preventing the discovery of new reasoning perspectives for further improvement. While recent efforts incorporate auxiliary off-policy signals to enhance exploration, they typically utilize full off-policy trajectories for trajectory-level policy estimation, overlooking the necessity for the fine-grained, step-level exploratory dynamics within agentic rollout. In this paper, we revisit exploration in Agentic RL and propose Retrieval-Augmented Policy Optimization (RAPO), a novel RL framework that introduces retrieval to explicitly expand exploration during training. To achieve this, we decompose the Agentic RL training process into two phases: (i) Hybrid-policy Agentic Rollout, and (ii) Retrieval-aware Policy Optimization. Specifically, we propose a Hybrid-policy Agentic Rollout strategy, which allows the agents to continuously reason over the retrieved off-policy step-level traces. It dynamically extends the reasoning receptive field of agents, enabling broader exploration conditioned on external behaviors. Subsequently, we introduce the Retrieval-aware Policy Optimization mechanism, which calibrates the policy gradient estimation with retrieval reward and importance shaping, stabilizing training and prioritizing retrieval-illuminating exploration. Extensive experiments show that RAPO achieves an +5.0% average gain on fourteen datasets across three agentic reasoning tasks, while delivering 1.2x faster training efficiency.
中文摘要 代理强化学习（代理强化学习）在大型语言模型（LLM）代理中展现出显著潜力。这些工作能够通过多步骤、工具集成的推理，使LLM代理能够完成复杂任务。然而，现有智能强化学习方法的固有局限在于其依赖纯粹的政策范式进行探索，限制了探索仅限于智能体自生成的输出，并阻碍了新推理视角的发现以进一步改进。虽然近期努力包含辅助非政策信号以增强探索，但它们通常在轨迹级政策估计中使用完整的非策略轨迹，忽视了能动性推广中细粒度、阶级探索动态的必要性。本文回顾了能动强化学习中的探索，并提出了检索增强策略优化（RAPO）这一新颖的强化学习框架，引入检索以明确扩展训练中的探索。为此，我们将代理强化学习训练过程分解为两个阶段：（i）混合策略代理推广，（ii）检索感知策略优化。具体来说，我们提出了一种混合策略代理部署策略，允许代理对检索到的非策略级跟踪进行持续推理。它动态扩展了智能体的推理接受场，使得基于外部行为的更广泛探索成为可能。随后，我们引入了检索感知策略优化机制，该机制校准策略梯度估计与检索奖励和重要性塑造，稳定训练并优先考虑检索-启发探索。大量实验表明，RAPO在14个数据集中，三项代理推理任务的平均增益为+5.0%，同时训练效率提升了1.2倍。

Proactive Guiding Strategy for Item-side Fairness in Interactive Recommendation

互动推荐中项目端公平性的主动指导策略

Authors: Chongjun Xia, Xiaoyu Shi, Hong Xie, Xianzhi Wang, yun lu, Mingsheng Shang
Subjects: Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.03094
Pdf link: https://arxiv.org/pdf/2603.03094
Abstract Item-side fairness is crucial for ensuring the fair exposure of long-tail items in interactive recommender systems. Existing approaches promote the exposure of long-tail items by directly incorporating them into recommended results. This causes misalignment between user preferences and the recommended long-tail items, which hinders long-term user engagement and reduces the effectiveness of recommendations. We aim for a proactive fairness-guiding strategy, which actively guides user preferences toward long-tail items while preserving user satisfaction during the interactive recommendation process. To this end, we propose HRL4PFG, an interactive recommendation framework that leverages hierarchical reinforcement learning to guide user preferences toward long-tail items progressively. HRL4PFG operates through a macro-level process that generates fairness-guided targets based on multi-step feedback, and a micro-level process that fine-tunes recommendations in real time according to both these targets and evolving user preferences. Extensive experiments show that HRL4PFG improves cumulative interaction rewards and maximum user interaction length by a larger margin when compared with state-of-the-art methods in interactive recommendation environments.
中文摘要 项目端公平性对于确保互动推荐系统中长尾项目的公平曝光至关重要。现有方法通过将长尾项目直接纳入推荐结果，促进其暴露。这会导致用户偏好与推荐的长尾商品之间出现不匹配，阻碍用户长期参与，降低推荐的有效性。我们致力于主动的公平指导策略，积极引导用户偏好长尾商品，同时在互动推荐过程中保持用户满意度。为此，我们提出了HRL4PFG，一种交互式推荐框架，利用层级强化学习逐步引导用户偏好对长尾项目的偏好。HRL4PFG通过宏观层面流程运行，基于多步反馈生成公平性导向的目标，以及一个微观层面过程，实时根据这些目标和不断演变的用户偏好微调推荐。大量实验表明，HRL4PFG在互动推荐环境中采用最先进方法时，能以更大的幅度提升累计互动奖励和最大用户互动时长。

Deep Q-Learning-Based Gain Scheduling for Nonlinear Quadcopter Dynamics

基于深度Q学习的非线性四旋翼飞行器动力学增益调度

Authors: Hossein Rastgoftar, Muhammad J. H. Zahed
Subjects: Subjects: Systems and Control (eess.SY); Dynamical Systems (math.DS)
Arxiv link: https://arxiv.org/abs/2603.03127
Pdf link: https://arxiv.org/pdf/2603.03127
Abstract This paper presents a deep Q-network (DQN)-based gain-scheduling framework for safety-critical quadcopter trajectory tracking. Instead of directly learning control inputs, the proposed approach selects from a finite set of pre-certified stabilizing gain vectors, enabling reinforcement learning to operate within a structured and stability-preserving control architecture. By exploiting the isotropic structure of the translational dynamics, feedback gains are shared across spatial axes to reduce dimensionality while preserving performance. The learned policy adapts feedback aggressiveness in real time, applying high authority during large transients and reducing gains near convergence to limit control effort. Simulation results using a high-fidelity nonlinear quadcopter model demonstrate accurate trajectory tracking, bounded attitude excursions, smooth transition to hover after the final time, and consistent reward improvement, validating the effectiveness and robustness of the proposed learning-based gain scheduling strategy.
中文摘要 本文提出了基于深度Q网络（DQN）的增益调度框架，用于安全关键的四翼飞行器轨迹追踪。该方法不直接学习控制输入，而是从有限的预认证稳定增益矢量中选择，使强化学习能够在结构化且保持稳定的控制架构中运行。通过利用平移动力学的各向同性结构，反馈增益在空间轴上共享，以降低维度同时保持性能。该策略实时调整反馈激进度，在大型瞬变期间施加高权威，并在收敛点附近减少增益以限制控制工作。使用高精度非线性四旋翼模型的模拟结果展示了轨迹追踪的准确性、有界姿态偏移、最终停留后平滑过渡到悬停以及持续的奖励改进，验证了基于学习的增益调度策略的有效性和稳健性。

RL-Based Coverage Path Planning for Deformable Objects on 3D Surfaces

基于强化学习的三维表面可变形物体覆盖路径规划

Authors: Yuhang Zhang, Jinming Ma, Feng Wu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.03137
Pdf link: https://arxiv.org/pdf/2603.03137
Abstract Currently, manipulation tasks for deformable objects often focus on activities like folding clothes, handling ropes, and manipulating bags. However, research on contact-rich tasks involving deformable objects remains relatively underdeveloped. When humans use cloth or sponges to wipe surfaces, they rely on both vision and tactile feedback. Yet, current algorithms still face challenges with issues like occlusion, while research on tactile perception for manipulation is still evolving. Tasks such as covering surfaces with deformable objects demand not only perception but also precise robotic manipulation. To address this, we propose a method that leverages efficient and accessible simulators for task execution. Specifically, we train a reinforcement learning agent in a simulator to manipulate deformable objects for surface wiping tasks. We simplify the state representation of object surfaces using harmonic UV mapping, process contact feedback from the simulator on 2D feature maps, and use scaled grouped convolutions (SGCNN) to extract features efficiently. The agent then outputs actions in a reduced-dimensional action space to generate coverage paths. Experiments demonstrate that our method outperforms previous approaches in key metrics, including total path length and coverage area. We deploy these paths on a Kinova Gen3 manipulator to perform wiping experiments on the back of a torso model, validating the feasibility of our approach.
中文摘要 目前，可变形物体的作任务通常集中在叠衣服、处理绳索和作包包等活动。然而，涉及可变形物体的丰富接触任务的研究仍然相对较为不足。当人类用布或海绵擦拭表面时，他们依赖视觉和触觉反馈。然而，当前算法仍面临遮挡等问题，而触觉感知用于控的研究仍在不断发展。诸如用可变形物体覆盖表面等任务，不仅需要感知，还需要精准的机器人作。为此，我们提出了一种利用高效且易得的模拟器来执行任务的方法。具体来说，我们在模拟器中训练强化学习代理，以作可变形物体进行表面擦除任务。我们利用谐波紫外映射简化物体表面的状态表示，处理模拟器对二维特征图的接触反馈，并使用缩放分组卷积（SGCNN）高效提取特征。智能体随后在缩小维度的动作空间中输出动作以生成覆盖路径。实验表明，我们的方法在包括总路径长度和覆盖面积等关键指标上优于以往方法。我们将这些路径部署在Kinova Gen3机械臂上，对躯干模型背面进行擦拭实验，验证了我们方法的可行性。

Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing

多视角一致3D场景编辑的几何引导强化学习

Authors: Jiyuan Wang, Chunyu Lin, Lei Sun, Zhi Cao, Yuyang Yin, Lang Nie, Zhenlong Yuan, Xiangxiang Chu, Yunchao Wei, Kang Liao, Guosheng Lin
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.03143
Pdf link: https://arxiv.org/pdf/2603.03143
Abstract Leveraging the priors of 2D diffusion models for 3D editing has emerged as a promising paradigm. However, maintaining multi-view consistency in edited results remains challenging, and the extreme scarcity of 3D-consistent editing paired data renders supervised fine-tuning (SFT), the most effective training strategy for editing tasks, infeasible. In this paper, we observe that, while generating multi-view consistent 3D content is highly challenging, verifying 3D consistency is tractable, naturally positioning reinforcement learning (RL) as a feasible solution. Motivated by this, we propose \textbf{RL3DEdit}, a single-pass framework driven by RL optimization with novel rewards derived from the 3D foundation model, VGGT. Specifically, we leverage VGGT's robust priors learned from massive real-world data, feed the edited images, and utilize the output confidence maps and pose estimation errors as reward signals, effectively anchoring the 2D editing priors onto a 3D-consistent manifold via RL. Extensive experiments demonstrate that RL3DEdit achieves stable multi-view consistency and outperforms state-of-the-art methods in editing quality with high efficiency. To promote the development of 3D editing, we will release the code and model.
中文摘要 利用二维扩散模型的先验经验进行三维编辑，已成为一种有前景的范式。然而，保持编辑结果的多视图一致性依然具有挑战性，且三维一致性编辑配对数据极度稀缺，使得监督微调（SFT）——编辑任务中最有效的训练策略——变得不可行。本文指出，虽然生成多视角一致的三维内容极具挑战性，但验证三维一致性是可行的，自然将强化学习（RL）定位为可行的解决方案。基于此，我们提出了 \textbf{RL3DEdit}，这是一个由强化学习优化驱动的单次框架，并基于三维基础模型 VGGT 提供新颖的奖励。具体来说，我们利用VGGT从大量真实世界数据中获得的稳健先验，输入编辑后的图像，并将输出置信度图和姿态估计误差作为奖励信号，通过强化学习有效地将二维编辑先验锚定在三维一致性流形上。大量实验表明，RL3DEdit 实现了稳定的多视图一致性，并且在编辑质量上优于最先进的方法，效率更高。为了促进3D编辑的发展，我们将发布代码和模型。

Specificity-aware reinforcement learning for fine-grained open-world classification

针对细粒度开放世界分类的特异性感知强化学习

Authors: Samuele Angheben, Davide Berasi, Alessandro Conti, Elisa Ricci, Yiming Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.03197
Pdf link: https://arxiv.org/pdf/2603.03197
Abstract Classifying fine-grained visual concepts under open-world settings, i.e., without a predefined label set, demands models to be both accurate and specific. Recent reasoning Large Multimodal Models (LMMs) exhibit strong visual understanding capability but tend to produce overly generic predictions when performing fine-grained image classification. Our preliminary analysis reveals that models do possess the intrinsic fine-grained domain knowledge. However, promoting more specific predictions (specificity) without compromising correct ones (correctness) remains a non-trivial and understudied challenge. In this work, we investigate how to steer reasoning LMMs toward predictions that are both correct and specific. We propose a novel specificity-aware reinforcement learning framework, SpeciaRL, to fine-tune reasoning LMMs on fine-grained image classification under the open-world setting. SpeciaRL introduces a dynamic, verifier-based reward signal anchored to the best predictions within online rollouts, promoting specificity while respecting the model's capabilities to prevent incorrect predictions. Our out-of-domain experiments show that SpeciaRL delivers the best trade-off between correctness and specificity across extensive fine-grained benchmarks, surpassing existing methods and advancing open-world fine-grained image classification. Code and model are publicly available at this https URL.
中文摘要 在开放世界环境下，即没有预定义标签集的情况下，对细粒度视觉概念进行分类，要求模型既准确又具体。最新推理：大型多模模型（LMM）具有较强的视觉理解能力，但在进行细粒度图像分类时往往会做出过于通用的预测。我们的初步分析显示，模型确实具备内在的细粒度领域知识。然而，在不牺牲正确预测（正确性）的前提下，推广更具体的预测（特异性）仍是一个非凡且研究不足的挑战。在本研究中，我们探讨如何引导推理型LMMs做出既正确又具体的预测。我们提出了一种新颖的特异性感知强化学习框架SpeciaRL，用于在开放世界环境下微调推理LMMs的细粒度图像分类。SpeciaRL引入了动态的基于验证者的奖励信号，锚定于在线推广中的最佳预测，促进具体性，同时尊重模型防止错误预测的能力。我们的域外实验表明，SpeciaRL在广泛的细粒度基准测试中实现了正确性与特异性的最佳权衡，超越现有方法，推动了开放世界细粒度图像分类的发展。代码和模型在此 https URL 公开。

Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

学习何时行动或拒绝：保护智能推理模型以保障安全多步工具的使用。

Authors: Aradhye Agarwal, Gurdit Siyan, Yash Pandya, Joykirat Singh, Akshay Nambi, Ahmed Awadallah
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.03205
Pdf link: https://arxiv.org/pdf/2603.03205
Abstract Agentic language models operate in a fundamentally different safety regime than chat models: they must plan, call tools, and execute long-horizon actions where a single misstep, such as accessing files or entering credentials, can cause irreversible harm. Existing alignment methods, largely optimized for static generation and task completion, break down in these settings due to sequential decision-making, adversarial tool feedback, and overconfident intermediate reasoning. We introduce MOSAIC, a post-training framework that aligns agents for safe multi-step tool use by making safety decisions explicit and learnable. MOSAIC structures inference as a plan, check, then act or refuse loop, with explicit safety reasoning and refusal as first-class actions. To train without trajectory-level labels, we use preference-based reinforcement learning with pairwise trajectory comparisons, which captures safety distinctions often missed by scalar rewards. We evaluate MOSAIC zero-shot across three model families, Qwen2.5-7B, Qwen3-4B-Thinking, and Phi-4, and across out-of-distribution benchmarks spanning harmful tasks, prompt injection, benign tool use, and cross-domain privacy leakage. MOSAIC reduces harmful behavior by up to 50%, increases harmful-task refusal by over 20% on injection attacks, cuts privacy leakage, and preserves or improves benign task performance, demonstrating robust generalization across models, domains, and agentic settings.
中文摘要 代理语言模型与聊天模型遵循根本不同的安全机制：它们必须规划、调用工具，并执行长期作，在访问文件或输入凭证等单一失误时，可能造成不可逆的伤害。现有的对齐方法，主要针对静态生成和任务完成优化，在这些环境中因顺序决策、对抗性工具反馈和过度自信的中间推理而失效。我们引入了MOSAIC培训后框架，通过使安全决策明确且可学习，使代理人员能够安全使用多步工具。MOSAIC 将推理结构化为一个计划、检查、行动或拒绝循环，明确安全推理和拒绝作为一类行动。为了训练不使用轨迹级标签，我们采用基于偏好的强化学习和两两轨迹比较，捕捉标量奖励常忽略的安全区分。我们评估了MOSAIC零射击在三个模型家族：Qwen2.5-7B、Qwen3-4B-思维和Phi-4，以及涵盖有害任务、即时注入、良性工具使用和跨域隐私泄露等非发行基准测试。MOSAIC 可将有害行为降低高达 50%，在注入攻击中将有害任务拒绝率提高超过 20%，减少隐私泄露，并保留或改善良性任务表现，展示了跨模型、领域和代理环境的强有力泛化能力。

ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation

ULTRA：自主人形全身机车控的统一多模态控制

Authors: Xialin He, Sirui Xu, Xinyao Li, Runpei Dong, Liuyu Bian, Yu-Xiong Wang, Liang-Yan Gui
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.03279
Pdf link: https://arxiv.org/pdf/2603.03279
Abstract Achieving autonomous and versatile whole-body loco-manipulation remains a central barrier to making humanoids practically useful. Yet existing approaches are fundamentally constrained: retargeted data are often scarce or low-quality; methods struggle to scale to large skill repertoires; and, most importantly, they rely on tracking predefined motion references rather than generating behavior from perception and high-level task specifications. To address these limitations, we propose ULTRA, a unified framework with two key components. First, we introduce a physics-driven neural retargeting algorithm that translates large-scale motion capture to humanoid embodiments while preserving physical plausibility for contact-rich interactions. Second, we learn a unified multimodal controller that supports both dense references and sparse task specifications, under sensing ranging from accurate motion-capture state to noisy egocentric visual inputs. We distill a universal tracking policy into this controller, compress motor skills into a compact latent space, and apply reinforcement learning finetuning to expand coverage and improve robustness under out-of-distribution scenarios. This enables coordinated whole-body behavior from sparse intent without test-time reference motions. We evaluate ULTRA in simulation and on a real Unitree G1 humanoid. Results show that ULTRA generalizes to autonomous, goal-conditioned whole-body loco-manipulation from egocentric perception, consistently outperforming tracking-only baselines with limited skills.
中文摘要 实现自主且多功能的全身机车控仍是让类人生物实用性的核心障碍。然而，现有方法存在根本上的限制：重定向数据往往稀缺或质量低;方法难以适应大规模的技能库;最重要的是，它们依赖于跟踪预定义的运动引用，而不是从感知和高级任务规格中生成行为。为了解决这些局限，我们提出了ULTRA，这是一个由两个关键组成部分组成的统一框架。首先，我们引入了一种基于物理的神经重定向算法，将大规模动作捕捉转化为类人生物，同时保持了接触丰富互动的物理可信度。其次，我们学习了一个统一的多模态控制器，支持密集引用和稀疏任务指定，感知范围从准确的动作捕捉状态到噪声的自我中心视觉输入。我们将通用追踪策略注入该控制器，将运动技能压缩到紧凑的潜在空间中，并应用强化学习的微调，以扩大覆盖范围并提升在非分发场景下的稳健性。这使得从稀疏意图实现协调的全身行为，无需测试时间的参考动作。我们在模拟和真实的Unitree G1类人生物上评估ULTRA。结果显示，ULTRA能够推广到自主、目标条件化的全身机动作，基于自我中心的感知，持续优于仅追踪的基线，且技能有限。

Keyword: diffusion policy

There is no result