Arxiv Papers of Today

生成时间: 2026-04-01 17:08:34 (UTC+8); Arxiv 发布时间: 2026-04-01 20:00 EDT (2026-04-02 08:00 UTC+8)

今天共有 29 篇相关文章

Keyword: reinforcement learning

Mitigating Temporal Blindness in Kubernetes Autoscaling: An Attention-Double-LSTM Framework

缓解Kubernetes自动扩展中的时间盲点：一个注意力双重LSTM框架

Authors: Faraz Shaikh, Gianluca Reali, Mauro Femminella
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.28790
Pdf link: https://arxiv.org/pdf/2603.28790
Abstract In the emerging landscape of edge computing, the stochastic and bursty nature of serverless workloads presents a critical challenge for autonomous resource orchestration. Traditional reactive controllers, such as the Kubernetes Horizontal Pod Autoscaler (HPA), suffer from inherent reaction latency, leading to Service Level Objective (SLO) violations during traffic spikes and resource flapping during ramp-downs. While Deep Reinforcement Learning (DRL) offers a pathway toward proactive management, standard agents suffer from temporal blindness, an inability to effectively capture long-term dependencies in non-Markovian edge environments. To bridge this gap, we propose a novel stability-aware autoscaling framework unifying workload forecasting and control via an Attention-Enhanced Double-Stacked LSTM architecture integrated within a Proximal Policy Optimization (PPO) agent. Unlike shallow recurrent models, our approach employs a deep temporal attention mechanism to selectively weight historical states, effectively filtering high-frequency noise while retaining critical precursors of demand shifts. We validate the framework on a heterogeneous cluster using real-world Azure Functions traces. Comparative analysis against industry-standard HPA, stateless Double DQN, and a single-layer LSTM ablation demonstrates that our approach reduces 90th percentile latency by approximately 29% while simultaneously decreasing replica churn by 39%, relative to the single-layer LSTM baseline. These results confirm that mitigating temporal blindness through deep attentive memory is a prerequisite for reliable, low-jitter autoscaling in production edge environments.
中文摘要 在边缘计算新兴领域，无服务器工作负载的随机性和突发性质为自主资源编排带来了关键挑战。传统的反应式控制器，如Kubernetes水平舱自动扩展器（HPA），存在固有的反应延迟，导致流量激增时服务级别目标（SLO）违规，资源减少时出现抖动。虽然深度强化学习（DRL）提供了通往主动管理的路径，但标准代理存在时间盲，即在非马尔可夫边缘环境中无法有效捕捉长期依赖关系。为弥合这一差距，我们提出了一种新型的稳定性感知自动扩展框架，通过集成在近端策略优化（PPO）代理中的注意力增强双栈LSTM架构，统一工作负载预测与控制。与浅层循环模型不同，我们的方法采用深度时间注意力机制，选择性地加权历史状态，有效过滤高频噪声，同时保留需求变化的关键前兆。我们利用真实世界的 Azure Functions 跟踪在异构集群上验证该框架。对行业标准HPA、无状态双重DQN和单层LSTM消融的比较分析表明，我们的方法相较单层LSTM基线降低了约29%的第90百分位延迟，同时减少了39%的复制流失。这些结果证实，通过深度专注记忆减轻时间盲是生产边缘环境中可靠、低抖动自缩放的前提。

Robust Multi-Agent Reinforcement Learning for Small UAS Separation Assurance under GPS Degradation and Spoofing

在GPS劣化和欺骗下实现小型无人机分离保障的强健多智能体强化学习

Authors: Alex Zongo, Filippos Fotiadis, Ufuk Topcu, Peng Wei
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2603.28900
Pdf link: https://arxiv.org/pdf/2603.28900
Abstract We address robust separation assurance for small Unmanned Aircraft Systems (sUAS) under GPS degradation and spoofing via Multi-Agent Reinforcement Learning (MARL). In cooperative surveillance, each aircraft (or agent) broadcasts its GPS-derived position; when such position broadcasts are corrupted, the entire observed air traffic state becomes unreliable. We cast this state observation corruption as a zero-sum game between the agents and an adversary: with probability R, the adversary perturbs the observed state to maximally degrade each agent's safety performance. We derive a closed-form expression for this adversarial perturbation, bypassing adversarial training entirely and enabling linear-time evaluation in the state dimension. We show that this expression approximates the true worst-case adversarial perturbation with second-order accuracy. We further bound the safety performance gap between clean and corrupted observations, showing that it degrades at most linearly with the corruption probability under Kullback-Leibler regularization. Finally, we integrate the closed-form adversarial policy into a MARL policy gradient algorithm to obtain a robust counter-policy for the agents. In a high-density sUAS simulation, we observe near-zero collision rates under corruption levels up to 35%, outperforming a baseline policy trained without adversarial perturbations.
中文摘要 我们通过多智能体强化学习（MARL）解决小型无人机系统（sUAS）在GPS劣化和欺骗下稳健的分离保障。在协同监视中，每架飞机（或代理）都会广播其GPS定位;当此类位置广播被破坏时，整个观测到的空中交通状态将变得不可靠。我们将这种状态观测腐化视为智能体与对手之间的零和博弈：以概率R，对手扰动观察状态以最大化每个智能体的安全性能。我们推导出该对抗扰动的闭式表达式，完全绕过对抗训练，实现状态维度的线性时间评估。我们证明该表达式以二阶精度近似真实的最坏情况对抗扰动。我们进一步界定了清洁观测与损坏观测之间的安全性能差距，表明在Kullback-Leibler正则化下，其与腐败概率最多线性下降。最后，我们将封闭式对抗策略集成到MARL策略梯度算法中，以获得对智能体的稳健反策略。在高密度无人机模拟中，我们在腐败水平高达35%的情况下观察到接近零的碰撞率，优于无对抗干扰训练的基线政策。

Optimistic Online LQR via Intrinsic Rewards

通过Intrinsic Rewards的乐观在线LQR

Authors: Marcell Bartos, Bruce D. Lee, Lenart Treven, Andreas Krause, Florian Dörfler, Melanie N. Zeilinger
Subjects: Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2603.28938
Pdf link: https://arxiv.org/pdf/2603.28938
Abstract Optimism in the face of uncertainty is a popular approach to balance exploration and exploitation in reinforcement learning. Here, we consider the online linear quadratic regulator (LQR) problem, i.e., to learn the LQR corresponding to an unknown linear dynamical system by adapting the control policy online based on closed-loop data collected during operation. In this work, we propose Intrinsic Rewards LQR (IR-LQR), an optimistic online LQR algorithm that applies the idea of intrinsic rewards originating from reinforcement learning and the concept of variance regularization to promote uncertainty-driven exploration. IR-LQR retains the structure of a standard LQR synthesis problem by only modifying the cost function, resulting in an intuitively pleasing, simple, computationally cheap, and efficient algorithm. This is in contrast to existing optimistic online LQR formulations that rely on more complicated iterative search algorithms or solve computationally demanding optimization problems. We show that IR-LQR achieves the optimal worst-case regret rate of $\sqrt{T}$, and compare it to various state-of-the-art online LQR algorithms via numerical experiments carried out on an aircraft pitch angle control and an unmanned aerial vehicle example.
中文摘要 在不确定性面前保持乐观是强化学习中平衡探索与利用的流行方法。这里，我们考虑在线线性二次调节器（LQR）问题，即通过根据运行过程中收集的闭环数据调整控制策略在线，学习对应未知线性动力系统的LQR。本研究提出了内在奖励LQR（IR-LQR）一种乐观的在线LQR算法，应用源自强化学习的内在奖励理念和方差正则化的概念，推动以不确定性驱动的探索。IR-LQR通过仅修改成本函数，保持了标准LQR合成问题的结构，从而形成一个直观易懂、简单、计算成本低且高效的算法。这与现有的乐观在线LQR表述形成对比，后者依赖更复杂的迭代搜索算法或解决计算量大的优化问题。我们证明了红外-LQR实现了最优的最坏情况后悔率$\sqrt{T}$，并通过对飞机俯仰角控制和无人机实例进行的数值实验，将其与多种最先进的在线LQR算法进行比较。

A Pontryagin Method of Model-based Reinforcement Learning via Hamiltonian Actor-Critic

一种基于模型的Pontryagin方法，通过哈密顿演员-批评者

Authors: Chengyang Gu, Yuxin Pan, Hui Xiong, Yize Chen
Subjects: Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.28971
Pdf link: https://arxiv.org/pdf/2603.28971
Abstract Model-based reinforcement learning (MBRL) improves sample efficiency by leveraging learned dynamics models for policy optimization. However, the effectiveness of methods such as actor-critic is often limited by compounding model errors, which degrade long-horizon value estimation. Existing approaches, such as Model-Based Value Expansion (MVE), partially mitigate this issue through multi-step rollouts, but remain sensitive to rollout horizon selection and residual model bias. Motivated by the Pontryagin Maximum Principle (PMP), we propose Hamiltonian Actor-Critic (HAC), a model-based approach that eliminates explicit value function learning by directly optimizing a Hamiltonian defined over the learned dynamics and reward for deterministic systems. By avoiding value approximation, HAC reduces sensitivity to model errors while admitting convergence guarantees. Extensive experiments on continuous control benchmarks, in both online and offline RL settings, demonstrate that HAC outperforms model-free and MVE-based baselines in control performance, convergence speed, and robustness to distributional shift, including out-of-distribution (OOD) scenarios. In offline settings with limited data, HAC matches or exceeds state-of-the-art methods, highlighting its strong sample efficiency.
中文摘要 基于模型的强化学习（MBRL）通过利用学习的动力学模型进行策略优化，提高样本效率。然而，像actor-critic这样的方法的有效性通常会受到模型误差的累计限制，这会降低长视界值估计。现有方法，如基于模型的价值扩展（MVE），通过多步推展部分缓解了这一问题，但仍对推展视野选择和残余模型偏差保持敏感。受庞特里亚金极大原理（PMP）启发，我们提出了哈密顿演员-批判者（HAC）方法，这是一种基于模型的方法，通过直接优化定义在确定性系统中所学动力学和奖励上的哈密顿量，消除显式价值函数学习。通过避免数值近似，HAC降低了对模型错误的敏感性，同时保证收敛性。在在线和离线环境中，对连续控制基准的大量实验表明，HAC在控制性能、收敛速度和对分布偏移（包括分布外（OOD）场景的鲁棒性方面优于无模型和基于MVE的基线。在数据有限的离线环境中，HAC能够与最先进的方法匹敌甚至超越，彰显其强大的采样效率。

Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

特洛伊语：通过对抗性精调化绕过宪法分类器，无需越狱税

Authors: Bilgehan Sel, Xuanli He, Alwin Peng, Ming Jin, Jerry Wei
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.29038
Pdf link: https://arxiv.org/pdf/2603.29038
Abstract Fine-tuning APIs offered by major AI providers create new attack surfaces where adversaries can bypass safety measures through targeted fine-tuning. We introduce Trojan-Speak, an adversarial fine-tuning method that bypasses Anthropic's Constitutional Classifiers. Our approach uses curriculum learning combined with GRPO-based hybrid reinforcement learning to teach models a communication protocol that evades LLM-based content classification. Crucially, while prior adversarial fine-tuning approaches report more than 25% capability degradation on reasoning benchmarks, Trojan-Speak incurs less than 5% degradation while achieving 99+% classifier evasion for models with 14B+ parameters. We demonstrate that fine-tuned models can provide detailed responses to expert-level CBRN (Chemical, Biological, Radiological, and Nuclear) queries from Anthropic's Constitutional Classifiers bug-bounty program. Our findings reveal that LLM-based content classifiers alone are insufficient for preventing dangerous information disclosure when adversaries have fine-tuning access, and we show that activation-level probes can substantially improve robustness to such attacks.
中文摘要 主要AI供应商提供的微调API创造了新的攻击面，使对手能够通过有针对性的微调绕过安全措施。我们介绍了特洛伊语，一种对抗式微调方法，绕过了Anthropic的宪法分类器。我们的方法结合课程学习与基于GRPO的混合强化学习，教授模型一套规避LLM内容分类的通信协议。关键是，以往的对抗微调方法在推理基准测试中报告能力下降超过25%，而Trojan-Speak在14B+参数模型中实现99+%的分类器规避率时，降级率不到5%。我们展示了经过精细调优的模型，能够对Anthropic宪法分类器（CBRN）bug-赏金项目中专家级的CBRN（化学、生物、放射和核）查询提供详细回应。我们的发现表明，仅靠基于LLM的内容分类器在对手拥有微调访问权限时，无法阻止危险信息泄露，我们也证明激活级探针可以显著提升对此类攻击的鲁棒性。

Realistic Market Impact Modeling for Reinforcement Learning Trading Environments

强化学习交易环境的现实市场影响建模

Authors: Lucas Riera Abbade, Anna Helena Reali Costa
Subjects: Subjects: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
Arxiv link: https://arxiv.org/abs/2603.29086
Pdf link: https://arxiv.org/pdf/2603.29086
Abstract Reinforcement learning (RL) has shown promise for trading, yet most open-source backtesting environments assume negligible or fixed transaction costs, causing agents to learn trading behaviors that fail under realistic execution. We introduce three Gymnasium-compatible trading environments -- MACE (Market-Adjusted Cost Execution) stock trading, margin trading, and portfolio optimization -- that integrate nonlinear market impact models grounded in the Almgren-Chriss framework and the empirically validated square-root impact law. Each environment provides pluggable cost models, permanent impact tracking with exponential decay, and comprehensive trade-level logging. We evaluate five DRL algorithms (A2C, PPO, DDPG, SAC, TD3) on the NASDAQ-100, comparing a fixed 10 bps baseline against the AC model with Optuna-tuned hyperparameters. Our results show that (i) the cost model materially changes both absolute performance and the relative ranking of algorithms across all three environments; (ii) the AC model produces dramatically different trading behavior, e.g., daily costs dropping from $200k to $8k with turnover falling from 19% to 1%; (iii) hyperparameter optimization is essential for constraining pathological trading, with costs dropping up to 82%; and (iv) algorithm-cost model interactions are strongly environment-specific, e.g., DDPG's OOS Sharpe jumps from -2.1 to 0.3 under AC in margin trading while SAC's drops from -0.5 to -1.2. We release the full suite as an open-source extension to FinRL-Meta.
中文摘要 强化学习（RL）在交易方面展现出潜力，但大多数开源回测环境假设交易成本极低或固定，导致代理学习在现实执行下失败的交易行为。我们介绍了三种兼容Gymnasium的交易环境——MACE（市场调整成本执行）股票交易、保证金交易和投资组合优化——这些环境整合了基于Almgren-Chriss框架和经实证验证的平方根影响定律的非线性市场影响模型。每个环境都提供可插拔的成本模型、带指数衰减的永久影响跟踪以及全面的交易层级记录。我们在NASDAQ-100上评估了五种DRL算法（A2C、PPO、DDPG、SAC、TD3），并将固定10个基点的基线与带有Optuna调优超参数的AC模型进行比较。我们的结果表明：（i）成本模型在三种环境中显著改变了算法的绝对性能和相对排名;（ii）AC模型产生了截然不同的交易行为，例如日成本从20万美元降至8千美元，周转率从19%降至1%;（iii）超参数优化对于约束病态交易至关重要，成本可降低高达82%;以及（iv）算法与成本模型的交互高度依赖环境，例如DDPG的保证金交易中，在AC下，外部Sharpe值从-2.1跃升至0.3，而SAC的则从-0.5降至-1.2。我们将完整套件作为 FinRL-Meta 的开源扩展发布。

MemRerank: Preference Memory for Personalized Product Reranking

MemRerank：个性化产品重新排序的偏好记忆

Authors: Zhiyuan Peng, Xuyang Wu, Huaixiao Tou, Yi Fang, Yi Gong
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.29247
Pdf link: https://arxiv.org/pdf/2603.29247
Abstract LLM-based shopping agents increasingly rely on long purchase histories and multi-turn interactions for personalization, yet naively appending raw history to prompts is often ineffective due to noise, length, and relevance mismatch. We propose MemRerank, a preference memory framework that distills user purchase history into concise, query-independent signals for personalized product reranking. To study this problem, we build an end-to-end benchmark and evaluation framework centered on an LLM-based \textbf{1-in-5} selection task, which measures both memory quality and downstream reranking utility. We further train the memory extractor with reinforcement learning (RL), using downstream reranking performance as supervision. Experiments with two LLM-based rerankers show that MemRerank consistently outperforms no-memory, raw-history, and off-the-shelf memory baselines, yielding up to \textbf{+10.61} absolute points in 1-in-5 accuracy. These results suggest that explicit preference memory is a practical and effective building block for personalization in agentic e-commerce systems.
中文摘要 基于LLM的购物代理越来越依赖长的购买历史和多回合互动来实现个性化，但天真地将原始历史附加到提示后往往因噪音、长度和相关性不匹配而无效。我们提出了MemRerank，一种偏好记忆框架，将用户购买历史提炼成简洁、与查询无关的信号，实现个性化产品排名。为研究该问题，我们构建了一个端到端基准和评估框架，基于基于LLM的\textbf{1-in-5}选择任务，该任务既衡量内存质量，也衡量下游重新排序效用。我们进一步用强化学习（RL）训练记忆提取器，并以下游的重新排序表现作为监督。对两个基于LLM的重新排序器进行的实验显示，MemRerank始终优于无记忆、原始历史和现成记忆基线，绝对分数高达\textbf{+10.61}，准确率为1/5。这些结果表明，显性偏好记忆是智能电子商务系统个性化的实用且有效的构建模块。

Downsides of Smartness Across Edge-Cloud Continuum in Modern Industry

现代行业中边缘云连续体智能化的缺点

Authors: Akhil Gupta Chigullapally, Sharvan Vittala, Razin Farhan Hussian, Mohsen Amini Salehi
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2603.29289
Pdf link: https://arxiv.org/pdf/2603.29289
Abstract The fast pace of modern AI is rapidly transforming traditional industrial systems into vast, intelligent and potentially unmanned autonomous operational environments driven by AI-based solutions. These solutions leverage various forms of machine learning, reinforcement learning, and generative AI. The introduction of such smart capabilities has pushed the envelope in multiple industrial domains, enabling predictive maintenance, optimized performance, and streamlined workflows. These solutions are often deployed across the Industrial Internet of Things (IIoT) and supported by the Edge-Fog-Cloud computing continuum to enable urgent (i.e., real-time or near real-time) decision-making. Despite the current trend of aggressively adopting these smart industrial solutions to increase profit, quality, and efficiency, large-scale integration and deployment also bring serious hazards that if ignored can undermine the benefits of smart industries. These hazards include unforeseen interoperability side-effects and heightened vulnerability to cyber threats, particularly in environments operating with a plethora of heterogeneous IIoT systems. The goal of this study is to shed light on the potential consequences of industrial smartness, with a particular focus on security implications, including vulnerabilities, side effects, and cyber threats. We distinguish software-level downsides stemming from both traditional AI solutions and generative AI from those originating in the infrastructure layer, namely IIoT and the Edge-Cloud continuum. At each level, we investigate potential vulnerabilities, cyber threats, and unintended side effects. As industries continue to become smarter, understanding and addressing these downsides will be crucial to ensure secure and sustainable development of smart industrial systems.
中文摘要 现代人工智能的快速发展正在迅速将传统工业系统转变为由基于人工智能的解决方案驱动的庞大、智能且潜在的无人自主运营环境。这些解决方案利用了多种形式的机器学习、强化学习和生成式人工智能。此类智能功能的引入推动了多个工业领域的创新，实现了预测性维护、优化性能和简化工作流程。这些解决方案通常部署于工业物联网（IIoT）各领域，并由边缘-雾云计算连续体支持，以实现紧急（即实时或近实时）决策。尽管当前积极采用这些智能工业解决方案以提高利润、质量和效率的趋势，大规模集成和部署也带来了严重风险，如果忽视，可能会削弱智能产业的优势。这些风险包括不可预见的互操作性副作用以及对网络威胁的脆弱性增加，尤其是在运行大量异构IIoT系统的环境中。本研究旨在揭示工业智能的潜在后果，特别关注安全影响，包括漏洞、副作用和网络威胁。我们将传统AI解决方案和生成式AI源自软件层面的缺点与源自基础设施层（即IIoT和边缘-云连续体）的弊端区分开来。在每个层面，我们都会调查潜在的漏洞、网络威胁和意外副作用。随着行业不断变得更智能，理解并解决这些弊端对于确保智能工业系统安全且可持续发展至关重要。

Scaling Whole-Body Human Musculoskeletal Behavior Emulation for Specificity and Diversity

针对全人体肌肉骨骼行为模拟进行标度化，以实现特异性和多样性

Authors: Yunyue Wei, Chenhui Zuo, Shanning Zhuang, Haixin Gong, Yaming Liu, Yanan Sui
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.29332
Pdf link: https://arxiv.org/pdf/2603.29332
Abstract The embodied learning of human motor control requires whole-body neuro-actuated musculoskeletal dynamics, while the internal muscle-driven processes underlying movement remain inaccessible to direct measurement. Computational modeling offers an alternative, but inverse dynamics methods struggled to resolve redundant control from observed kinematics in the high-dimensional, over-actuated system. Forward imitation approaches based on deep reinforcement learning exhibited inadequate tracking performance due to the curse of dimensionality in both control and reward design. Here we introduce a large-scale parallel musculoskeletal computation framework for biomechanically grounded whole-body motion reproduction. By integrating large-scale parallel GPU simulation with adversarial reward aggregation and value-guided flow exploration, the MS-Emulator framework overcomes key optimization bottlenecks in high-dimensional reinforcement learning for musculoskeletal control, which accurately reproduces a broad repertoire of motions in a whole-body human musculoskeletal system actuated by approximately 700 muscles. It achieved high joint angle accuracy and body position alignment for highly dynamic tasks such as dance, cartwheel, and backflip. The framework was also used to explore the musculoskeletal control solution space, identifying distinct musculoskeletal control policies that converge to nearly identical external kinematic and mechanical measurements. This work establishes a tractable computational route to analyzing the specificity and diversity underlying human embodied control of movement. Project page: this https URL.
中文摘要 人体运动控制的具身学习需要全身神经驱动的肌肉骨骼动力学，而支撑运动的内部肌肉驱动过程则无法直接测量。计算建模提供了另一种选择，但逆动力学方法难以解决高维、过驱动系统中观测到的运动学中的冗余控制。基于深度强化学习的前向模仿方法由于控制和奖励设计中维度的诅咒，追踪性能不足。本文介绍了一种大规模并行肌肉骨骼计算框架，用于生物力学基础的全身运动再现。通过将大规模并行GPU仿真与对抗性奖励聚合和价值引导流探索相结合，MS-Emulator框架克服了高维强化学习中的关键优化瓶颈，该过程能够准确重现由约700块肌肉驱动的全身肌肉骨骼系统中的广泛动作。它实现了高关节角度的准确性和身体姿势对齐，适用于舞蹈、侧手翻和后空翻等高度动态动作。该框架还被用于探索肌肉骨骼控制解决方案空间，识别出趋同于几乎相同的外部运动学和机械测量的独特肌肉骨骼控制策略。这项工作建立了一种可操作的计算路径，用于分析人类身体控制运动背后的特异性和多样性。项目页面：这个 https URL。

AP-DRL: A Synergistic Algorithm-Hardware Framework for Automatic Task Partitioning of Deep Reinforcement Learning on Versal ACAP

AP-DRL：一种用于在Versal ACAP上实现深度强化学习自动任务划分的协同算法-硬件框架

Authors: Enlai Li, Zhe Lin, Sharad Sinha, Wei Zhang
Subjects: Subjects: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.29369
Pdf link: https://arxiv.org/pdf/2603.29369
Abstract Deep reinforcement learning has demonstrated remarkable success across various domains. However, the tight coupling between training and inference processes makes accelerating DRL training an essential challenge for DRL optimization. Two key issues hinder efficient DRL training: (1) the significant variation in computational intensity across different DRL algorithms and even among operations within the same algorithm complicates hardware platform selection, while (2) DRL's wide dynamic range could lead to substantial reward errors with conventional FP16+FP32 mixed-precision quantization. While existing work has primarily focused on accelerating DRL for specific computing units or optimizing inference-stage quantization, we propose AP-DRL to address the above challenges. AP-DRL is an automatic task partitioning framework that harnesses the heterogeneous architecture of AMD Versal ACAP (integrating CPUs, FPGAs, and AI Engines) to accelerate DRL training through intelligent hardware-aware optimization. Our approach begins with bottleneck analysis of CPU, FPGA, and AIE performance across diverse DRL workloads, informing the design principles for AP-DRL's inter-component task partitioning and quantization optimization. The framework then addresses the challenge of platform selection through design space exploration-based profiling and ILP-based partitioning models that match operations to optimal computing units based on their computational characteristics. For the quantization challenge, AP-DRL employs a hardware-aware algorithm coordinating FP32 (CPU), FP16 (FPGA/DSP), and BF16 (AI Engine) operations by leveraging Versal ACAP's native support for these precision formats. Comprehensive experiments indicate that AP-DRL can achieve speedup of up to 4.17$\times$ over programmable logic and up to 3.82$\times$ over AI Engine baselines while maintaining training convergence.
中文摘要 深度强化学习在多个领域取得了显著成功。然而，训练与推理过程的紧密耦合使得加速DRL训练成为DRL优化中不可或缺的挑战。阻碍有效日程学习的两个关键问题：（1）不同日程学习算法之间甚至同一算法内操作间计算强度的显著差异使硬件平台选择复杂;（2）日行学习的宽动态范围可能导致传统FP16+FP32混精度量化时显著的奖励误差。虽然现有工作主要集中在加速特定计算单元的DRL或优化推理阶段量子化，但我们提出AP-DRL以解决上述挑战。AP-DRL 是一个自动任务分区框架，利用 AMD Versal ACAP 的异构架构（集成 CPU、FPGA 和 AI 引擎），通过智能硬件感知优化加速 DRL 训练。我们的方法始于对不同DRL工作负载下的CPU、FPGA和AIE性能的瓶颈分析，指导AP-DRL组件间任务分区和量化优化的设计原则。该框架随后通过设计基于空间探索的剖析和基于ILP的划分模型，解决平台选择的挑战，将操作与最优计算单元匹配，基于其计算特性。针对量化挑战，AP-DRL采用了硬件感知算法，通过利用Versal ACAP对这些高精度格式的原生支持，协调FP32（CPU）、FP16（FP16/DSP）和BF16（AI引擎）操作。综合实验表明，AP-DRL在可编程逻辑上可实现最高4.17美元/时间的加速，在AI引擎基线上最高可提升3.82美元/时间，同时保持训练收敛。

Multi-AUV Cooperative Target Tracking Based on Supervised Diffusion-Aided Multi-Agent Reinforcement Learning

基于监督扩散辅助多智能体强化学习的多AUV协作目标跟踪

Authors: Jiaao Ma, Chuan Lin, Guangjie Han, Shengchao Zhu, Zhenyu Wang, Chen An
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.29426
Pdf link: https://arxiv.org/pdf/2603.29426
Abstract In recent years, advances in underwater networking and multi-agent reinforcement learning (MARL) have significantly expanded multi-autonomous underwater vehicle (AUV) applications in marine exploration and target tracking. However, current MARL-driven cooperative tracking faces three critical challenges: 1) non-stationarity in decentralized coordination, where local policy updates destabilize teammates' observation spaces, preventing convergence; 2) sparse-reward exploration inefficiency from limited underwater visibility and constrained sensor ranges, causing high-variance learning; and 3) water disturbance fragility combined with handcrafted reward dependency that degrades real-world robustness under unmodeled hydrodynamic conditions. To address these challenges, this paper proposes a hierarchical MARL architecture comprising four layers: global training scheduling, multi-agent coordination, local decision-making, and real-time execution. This architecture optimizes task allocation and inter-AUV coordination through hierarchical decomposition. Building on this foundation, we propose the Supervised Diffusion-Aided MARL (SDA-MARL) algorithm featuring three innovations: 1) a dual-decision architecture with segregated experience pools mitigating nonstationarity through structured experience replay; 2) a supervised learning mechanism guiding the diffusion model's reverse denoising process to generate high-fidelity training samples that accelerate convergence; and 3) disturbance-robust policy learning incorporating behavioral cloning loss to guide the Deep Deterministic Policy Gradient network update using high-quality replay actions, eliminating handcrafted reward dependency. The tracking algorithm based on SDA-MARL proposed in this paper achieves superior precision compared to state-of-the-art methods in comprehensive underwater simulations.
中文摘要 近年来，水下网络和多智能体强化学习（MARL）的进展显著扩展了多自主水下飞行器（AUV）在海洋探索和目标跟踪中的应用。然而，当前由MARL驱动的协作追踪面临三大关键挑战：1）分散协调中的非平稳性，即本地政策更新破坏队友观察空间，阻碍汇聚;2）由于水下能见度有限和传感器范围受限，导致稀疏奖励探索效率低下，导致高方差学习;3）水扰动脆弱性与手工设计的奖励依赖性结合，降低了在未建模水动力条件下的真实稳健性。为应对这些挑战，本文提出了一个由四层组成的分层MARL架构：全局训练调度、多智能体协调、本地决策和实时执行。该架构通过层级分解优化任务分配和AUV间协调。在此基础上，我们提出了监督扩散辅助MARL（SDA-MARL）算法，具有三项创新：1）具有分离经验池的双决策架构，通过结构化经验重放缓解非平稳性;2）监督学习机制，指导扩散模型的逆去噪过程，生成高保真度训练样本以加速收敛;3）扰动稳健策略学习，结合行为克隆损失，通过高质量重放动作引导深度确定性策略梯度网络更新，消除手工制作的奖励依赖。本文提出基于SDA-MARL的跟踪算法在综合水下模拟中，相较于最先进方法实现了更优的精度。

Calibrated Confidence Expression for Radiology Report Generation

放射科报告生成的校准置信表达

Authors: David Bani-Harouni, Chantal Pellegrini, Julian Lüers, Su Hwan Kim, Markus Baalmann, Benedikt Wiestler, Rickmer Braren, Nassir Navab, Matthias Keicher
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.29492
Pdf link: https://arxiv.org/pdf/2603.29492
Abstract Safe deployment of Large Vision-Language Models (LVLMs) in radiology report generation requires not only accurate predictions but also clinically interpretable indicators of when outputs should be thoroughly reviewed, enabling selective radiologist verification and reducing the risk of hallucinated findings influencing clinical decisions. One intuitive approach to this is verbalized confidence, where the model explicitly states its certainty. However, current state-of-the-art language models are often overconfident, and research on calibration in multimodal settings such as radiology report generation is limited. To address this gap, we introduce ConRad (Confidence Calibration for Radiology Reports), a reinforcement learning framework for fine-tuning medical LVLMs to produce calibrated verbalized confidence estimates alongside radiology reports. We study two settings: a single report-level confidence score and a sentence-level variant assigning a confidence to each claim. Both are trained using the GRPO algorithm with reward functions based on the logarithmic scoring rule, which incentivizes truthful self-assessment by penalizing miscalibration and guarantees optimal calibration under reward maximization. Experimentally, ConRad substantially improves calibration and outperforms competing methods. In a clinical evaluation we show that ConRad's report level scores are well aligned with clinicians' judgment. By highlighting full reports or low-confidence statements for targeted review, ConRad can support safer clinical integration of AI-assistance for report generation.
中文摘要 在放射学报告生成中安全部署大型视觉语言模型（LVLM），不仅需要准确的预测，还需要临床上可解释的指标，明确何时应彻底审查输出，从而实现放射科医生的选择性核实，并降低幻觉发现影响临床决策的风险。一种直观的方法是语言置信，即模型明确陈述其确定性。然而，当前最先进的语言模型往往过于自信，关于多模态环境（如放射报告生成）校准的研究有限。为弥补这一空白，我们引入了ConRad（放射报告置信度校准），这是一个强化学习框架，用于微调医学LVLMs，使其在放射报告的同时产生校准的口头置信度估计。我们研究了两种环境：单一报告级置信度评分和一个为每个主张赋予置信度的句子级变体。两者均使用基于对数评分规则的GRPO算法和奖励函数进行训练，该规则通过惩罚校准错误来激励真实自我评估，并在奖励最大化下保证校准最优。实验上，ConRad显著提升了校准性能，并优于竞争对手。在临床评估中，我们发现ConRad的报告级别评分与临床医生的判断高度一致。通过突出显示完整报告或低置信度声明以供有针对性的审查，ConRad可以支持AI辅助的临床安全整合以生成报告。

MemFactory: Unified Inference & Training Framework for Agent Memory

MemFactory：代理内存的统一推理与训练框架

Authors: Ziliang Guo, Ziheng Li, Zhiyu Li
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.29493
Pdf link: https://arxiv.org/pdf/2603.29493
Abstract Memory-augmented Large Language Models (LLMs) are essential for developing capable, long-term AI agents. Recently, applying Reinforcement Learning (RL) to optimize memory operations, such as extraction, updating, and retrieval, has emerged as a highly promising research direction. However, existing implementations remain highly fragmented and task-specific, lacking a unified infrastructure to streamline the integration, training, and evaluation of these complex pipelines. To address this gap, we present MemFactory, the first unified, highly modular training and inference framework specifically designed for memory-augmented agents. Inspired by the success of unified fine-tuning frameworks like LLaMA-Factory, MemFactory abstracts the memory lifecycle into atomic, plug-and-play components, enabling researchers to seamlessly construct custom memory agents via a "Lego-like" architecture. Furthermore, the framework natively integrates Group Relative Policy Optimization (GRPO) to fine-tune internal memory management policies driven by multi-dimensional environmental rewards. MemFactory provides out-of-the-box support for recent cutting-edge paradigms, including Memory-R1, RMM, and MemAgent. We empirically validate MemFactory on the open-source MemAgent architecture using its publicly available training and evaluation data. Across both in-domain and out-of-distribution evaluation sets, MemFactory consistently improves performance over the corresponding base models, with relative gains of up to 14.8%. By providing a standardized, extensible, and easy-to-use infrastructure, MemFactory significantly lowers the barrier to entry, paving the way for future innovations in memory-driven AI agents.
中文摘要 内存增强大型语言模型（LLMs）对于开发具备能力的长期AI代理至关重要。近年来，应用强化学习（RL）优化记忆操作，如提取、更新和检索，已成为一项极具前景的研究方向。然而，现有的实现仍然高度分散且任务特定，缺乏统一的基础设施来简化这些复杂管道的集成、培训和评估。为弥补这一空白，我们推出了MemFactory，这是首个专为记忆增强代理设计的统一高度模块化训练与推理框架。受到LLaMA-Factory等统一微调框架成功的启发，MemFactory将内存生命周期抽象为原子化的即插即用组件，使研究人员能够通过“乐高”式架构无缝构建自定义内存代理。此外，该框架原生集成了群相对策略优化（Group Relative Policy Optimization，GRPO），以微调由多维环境奖励驱动的内部内存管理策略。MemFactory 开箱即用地支持最新的尖端范式，包括 Memory-R1、RMM 和 MemAgent。我们利用MemAgent的公开训练和评估数据，实证验证了开源MemAgent架构。无论是在领域内还是分发外的评估集，MemFactory在相应基础模型上都能持续提升性能，相对提升高达14.8%。通过提供标准化、可扩展且易用的基础设施，MemFactory大幅降低了进入门槛，为未来内存驱动的AI智能体创新铺平了道路。

Learning to Generate Formally Verifiable Step-by-Step Logic Reasoning via Structured Formal Intermediaries

学习通过结构化形式中介生成形式可验证的逐步逻辑推理

Authors: Luoxin Chen, Yichi Zhou, Huishuai Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.29500
Pdf link: https://arxiv.org/pdf/2603.29500
Abstract Large language models (LLMs) have recently demonstrated impressive performance on complex, multi-step reasoning tasks, especially when post-trained with outcome-rewarded reinforcement learning Guo et al. 2025. However, it has been observed that outcome rewards often overlook flawed intermediate steps, leading to unreliable reasoning steps even when final answers are correct. To address this unreliable reasoning, we propose PRoSFI (Process Reward over Structured Formal Intermediates), a novel reward method that enhances reasoning reliability without compromising accuracy. Instead of generating formal proofs directly, which is rarely accomplishable for a modest-sized (7B) model, the model outputs structured intermediate steps aligned with its natural language reasoning. Each step is then verified by a formal prover. Only fully validated reasoning chains receive high rewards. The integration of formal verification guides the model towards generating step-by-step machine-checkable proofs, thereby yielding more credible final answers. PRoSFI offers a simple and effective approach to training trustworthy reasoning models.
中文摘要 大型语言模型（LLMs）最近在复杂、多步推理任务中表现出显著表现，尤其是在经过后期训练的成果奖励强化学习（Guo 等人，2025）。然而，观察到结果奖励常常忽视中间步骤的缺陷，导致即使最终答案正确，推理步骤也不可靠。为解决这种不可靠的推理，我们提出了PRoSFI（结构化形式中间体上的过程奖励），这是一种新颖的奖励方法，能在不牺牲准确性的情况下提升推理可靠性。模型不会直接生成形式证明，而这在中等规模（7B）模型中很少实现，而是输出与其自然语言推理相符的结构化中间步骤。每一步都由正式校验器进行验证。只有经过完全验证的推理链才能获得高额奖励。形式验证的整合引导模型生成逐步可机器检查的证明，从而获得更具可信度的最终答案。PRoSFI提供了一种简单有效的方法来训练可信的推理模型。

Target-Aligned Reinforcement Learning

目标对齐强化学习

Authors: Leonard S. Pleiss, James Harrison, Maximilian Schiffer
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.29501
Pdf link: https://arxiv.org/pdf/2603.29501
Abstract Many reinforcement learning algorithms rely on target networks - lagged copies of the online network - to stabilize training. While effective, this mechanism introduces a fundamental stability-recency tradeoff: slower target updates improve stability but reduce the recency of learning signals, hindering convergence speed. We propose Target-Aligned Reinforcement Learning (TARL), a framework that emphasizes transitions for which the target and online network estimates are highly aligned. By focusing updates on well-aligned targets, TARL mitigates the adverse effects of stale target estimates while retaining the stabilizing benefits of target networks. We provide a theoretical analysis demonstrating that target alignment correction accelerates convergence, and empirically demonstrate consistent improvements over standard reinforcement learning algorithms across various benchmark environments.
中文摘要 许多强化学习算法依赖目标网络——在线网络的滞后副本——来稳定训练。虽然有效，但该机制引入了基本的稳定性与近期性权衡：较慢的目标更新提升稳定性，但降低学习信号的近期性，从而阻碍收敛速度。我们提出了目标对齐强化学习（TARL）框架，强调目标与在线网络估计高度对齐的过渡。通过聚焦于对齐良好目标的更新，TARL减轻了陈旧目标估计的负面影响，同时保留了目标网络的稳定优势。我们提供了理论分析，证明目标比对校正加速收敛，并通过实证证明在各种基准环境中相较标准强化学习算法有持续的改进。

Learning Diagnostic Reasoning for Decision Support in Toxicology

在毒理学中学习诊断推理以支持决策

Authors: Nico Oberländer, David Bani-Harouni, Tobias Zellner, Nassir Navab, Florian Eyer, Matthias Keicher
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.29608
Pdf link: https://arxiv.org/pdf/2603.29608
Abstract Acute poly-substance intoxication requires rapid, life-saving decisions under substantial uncertainty, as clinicians must rely on incomplete ingestion details and nonspecific symptoms. Effective diagnostic reasoning in this chaotic environment requires fusing unstructured, non-medical narratives (e.g. paramedic scene descriptions and unreliable patient self-reports or known histories), with structured medical data like vital signs. While Large Language Models (LLMs) show potential for processing such heterogeneous inputs, they struggle in this setting, often underperforming simple baselines that rely solely on patient histories. To address this, we present DeToxR (Decision-support for Toxicology with Reasoning), the first adaptation of Reinforcement Learning (RL) to emergency toxicology. We design a robust data-fusion engine for multi-label prediction across 14 substance classes based on an LLM finetuned with Group Relative Policy Optimization (GRPO). We optimize the model's reasoning directly using a clinical performance reward. By formulating a multi-label agreement metric as the reward signal, the model is explicitly penalized for missing co-ingested substances and hallucinating absent poisons. Our model significantly outperforms its unadapted base LLM counterpart and supervised baselines. Furthermore, in a clinical validation study, the model indicates a clinical advantage by outperforming an expert toxicologist in identifying the correct poisons (Micro-F1: 0.644 vs. 0.473). These results demonstrate the potential of RL-aligned LLMs to synthesize unstructured pre-clinical narratives and structured medical data for decision support in high-stakes environments.
中文摘要 急性多重物质中毒需要在极大不确定性下做出迅速且挽救生命的决策，因为临床医生必须依赖不完整的摄入细节和非特异性症状。在这种混乱环境中有效的诊断推理需要将无结构、非医学叙述（如急救人员现场描述和不可靠的患者自我报告或已知病史）与结构化的医学数据如生命体征相结合。虽然大型语言模型（LLMs）在处理此类异质输入方面表现出潜力，但它们在这种环境中表现不佳，常常表现不及仅依赖患者病史的简单基线。为此，我们介绍了DeToxR（带推理的毒理学决策支持），这是强化学习（RL）首次适用于紧急毒理学的应用。我们基于经过群相对政策优化（GRPO）微调的大型语言模型，设计了一个稳健的数据融合引擎，适用于14类物质类别的多标签预测。我们直接通过临床表现奖励优化模型的推理。通过将多标签一致指标作为奖励信号，模型因漏失共摄入物质和幻觉缺失毒素而受到明确惩罚。我们的模型显著优于未适应的基础LLM对应模型和监督基线。此外，在一项临床验证研究中，该模型显示其在识别正确毒物方面表现优于专业毒理学家（Micro-F1：0.644对0.473）。这些结果展示了强化学习对齐的大型语言模型在高风险环境中综合非结构化临床前叙事和结构化医疗数据以支持决策的潜力。

ASI-Evolve: AI Accelerates AI

ASI-Evolve：人工智能加速人工智能

Authors: Weixian Xu, Tiantian Mi, Yixiu Liu, Yang Nan, Zhimeng Zhou, Lyumanshan Ye, Lin Zhang, Yu Qiao, Pengfei Liu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.29640
Pdf link: https://arxiv.org/pdf/2603.29640
Abstract Can AI accelerate the development of AI itself? While recent agentic systems have shown strong performance on well-scoped tasks with rapid feedback, it remains unclear whether they can tackle the costly, long-horizon, and weakly supervised research loops that drive real AI progress. We present ASI-Evolve, an agentic framework for AI-for-AI research that closes this loop through a learn-design-experiment-analyze cycle. ASI-Evolve augments standard evolutionary agents with two key components: a cognition base that injects accumulated human priors into each round of exploration, and a dedicated analyzer that distills complex experimental outcomes into reusable insights for future iterations. To our knowledge, ASI-Evolve is the first unified framework to demonstrate AI-driven discovery across three central components of AI development: data, architectures, and learning algorithms. In neural architecture design, it discovered 105 SOTA linear attention architectures, with the best discovered model surpassing DeltaNet by +0.97 points, nearly 3x the gain of recent human-designed improvements. In pretraining data curation, the evolved pipeline improves average benchmark performance by +3.96 points, with gains exceeding 18 points on MMLU. In reinforcement learning algorithm design, discovered algorithms outperform GRPO by up to +12.5 points on AMC32, +11.67 points on AIME24, and +5.04 points on OlympiadBench. We further provide initial evidence that this AI-for-AI paradigm can transfer beyond the AI stack through experiments in mathematics and biomedicine. Together, these results suggest that ASI-Evolve represents a promising step toward enabling AI to accelerate AI across the foundational stages of development, offering early evidence for the feasibility of closed-loop AI research.
中文摘要 人工智能能否加速人工智能本身的发展？尽管近期的代理系统在明确且反馈快速的任务中表现出强劲表现，但它们是否能够应对推动真实人工智能进步的昂贵、长远且监督薄弱的研究循环，仍不清楚。我们介绍ASI-Evolve，一个用于人工智能对人工智能研究的代理框架，通过学习-设计-实验-分析的循环闭合了这一循环。ASI-Evolve 在标准进化代理基础上增加了两个关键组成部分：一个认知基础，将积累的人类先验注入每轮探索中;另一个专门分析器，将复杂实验结果提炼为可重复使用的洞见，供未来迭代使用。据我们所知，ASI-Evolve 是首个统一框架，展示 AI 驱动的发现，涵盖 AI 开发的三个核心组成部分：数据、架构和学习算法。在神经架构设计中，它发现了105种SOTA线性注意力架构，其中发现最好的模型比DeltaNet高出+0.97个百分点，几乎是近期人工设计改进的3倍。在预训练数据管理中，进化流水线平均基准性能提升+3.96分，MMLU提升超过18分。在强化学习算法设计中，发现的算法在AMC32上比GRPO高出最多+12.5分，AIME24上高出+11.67分，OlympiadBench上高出+5.04分。我们还提供了初步证据，表明这种人工智能对人工智能的范式可以通过数学和生物医学的实验超越人工智能堆栈。综合来看，这些结果表明ASI-Evolve是推动人工智能加速人工智能在基础发展阶段的有力一步，为闭环AI研究的可行性提供了早期证据。

6GAgentGym: Tool Use, Data Synthesis, and Agentic Learning for Network Management

6GAgentGym：工具使用、数据综合与网络管理中的代理学习

Authors: Jiao Chen, Jianhua Tang, Xiaotong Yang, Zuohong Lv
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.29656
Pdf link: https://arxiv.org/pdf/2603.29656
Abstract Autonomous 6G network management requires agents that can execute tools, observe the resulting state changes, and adapt their decisions accordingly. Existing benchmarks based on static questions or scripted episode replay, however, do not support such closed-loop interaction, limiting agents to passive evaluation without the ability to learn from environmental feedback. This paper presents 6GAgentGym to provide closed-loop capability. The framework provides an interactive environment with 42 typed tools whose effect classification distinguishes read-only observation from state-mutating configuration, backed by a learned Experiment Model calibrated on NS-3 simulation data. 6G-Forge bootstraps closed-loop training trajectories from NS-3 seeds via iterative Self-Instruct generation with execution verification against the Experiment Model. Supervised fine-tuning on the resulting corpus followed by reinforcement learning with online closed-loop interaction enables an 8B open-source model to achieve comparable overall success rate to GPT-5 on the accompanying 6GAgentBench, with stronger performance on long-horizon tasks. Together, these components provide a viable path toward autonomous, closed-loop network management.
中文摘要 自主6G网络管理需要能够执行工具、观察状态变化并相应调整决策的代理。然而，基于静态问题或脚本化剧集回放的现有基准测试不支持这种闭环交互，限制了代理只能被动评估，无法从环境反馈中学习。本文介绍了6GAgentGym以提供闭环功能。该框架提供了一个交互式环境，包含42个类型化工具，其效应分类区分了只读观察与状态变异配置，并基于基于NS-3模拟数据校对的实验模型。6G-Forge通过迭代自指令生成并对实验模型进行执行验证，从NS-3种子启动闭环训练轨迹。对所得语料库进行监督微调，随后通过在线闭环交互进行强化学习，使8B开源模型能够实现与GPT-5在配套6GAgentBench上相当的整体成功率，并在长期任务中表现更佳。这些组件共同提供了一条可行的路径，实现自主闭环网络管理。

Reinforced Reasoning for End-to-End Retrosynthetic Planning

强化端到端逆合成规划的推理

Authors: Chenyang Zuo, Siqi Fan, Yizhen Luo, Zaiqing Nie
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.29723
Pdf link: https://arxiv.org/pdf/2603.29723
Abstract Retrosynthetic planning is a fundamental task in organic chemistry, yet remains challenging due to its combinatorial complexity. To address this, conventional approaches typically rely on hybrid frameworks that combine single-step predictions with external search heuristics, inevitably fracturing the logical coherence between local molecular transformations and global planning objectives. To bridge this gap and embed sophisticated strategic foresight directly into the model's chemical reasoning, we introduce ReTriP, an end-to-end generative framework that reformulates retrosynthesis as a direct Chain-of-Thought reasoning task. We establish a path-coherent molecular representation and employ a progressive training curriculum that transitions from reasoning distillation to reinforcement learning with verifiable rewards, effectively aligning stepwise generation with practical route utility. Empirical evaluation on RetroBench demonstrates that ReTriP achieves state-of-the-art performance, exhibiting superior robustness in long-horizon planning compared to hybrid baselines.
中文摘要 逆合成规划是有机化学中的一项基础任务，但由于其组合复杂性，仍具有挑战性。为解决这一问题，传统方法通常依赖结合单步预测与外部搜索启发式的混合框架，不可避免地破坏局部分子转化与全球规划目标之间的逻辑一致性。为了弥合这一空白，并将复杂的战略前瞻直接嵌入模型的化学推理中，我们引入了ReTriP，一个端到端生成框架，将逆合成重新表述为直接的思维链推理任务。我们建立了路径连贯的分子表征，采用渐进式培训课程，从推理提炼过渡到强化学习，并提供可验证的奖励，有效地将分步生成与实用路径效用相结合。RetroBench上的实证评估表明，ReTriP实现了最先进的性能，在长视野规划中表现出优于混合基线的鲁棒性。

Friends, Foes, and First Authors: A Game Theory Model of How Power Plays Rewrite Academic Co-Authorship Networks

朋友、敌人与第一作者：权力游戏如何重写学术合著网络的博弈论模型

Authors: Amit Bengal, Teddy Lazebnik
Subjects: Subjects: Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2603.29834
Pdf link: https://arxiv.org/pdf/2603.29834
Abstract Scientific research increasingly depends on multi-author collaboration, yet the systems used to allocate authorship credit remain vulnerable to conflict, strategic behavior, and project breakdown. Although prior work has shown that authors may rationally issue ultimatums over authorship order within a single manuscript, much less is known about how such behavior unfolds over repeated collaborations embedded in evolving academic networks. In this study, we develop a repeated, networked game-theoretic model of co-authorship in which researchers form collaborations over time, accumulate reputation through an evolving friendship network, and, in a subset of cases, learn strategic behavior through deep reinforcement learning. Using large-scale agent-based simulations, we compare myopic and forward-looking authors across mixed populations. We find that strategic agents do not raise fewer ultimatums than greedy agents, but instead learn to avoid insisting after rejection, thereby eliminating destructive manuscript termination. As strategic prevalence increases, paper destruction falls from 0.120 to 0.000 per paper, completion rates rise from 0.853 to 0.970, and average completed papers per agent increase from 15.2 to 16.9. Strategic agents also obtain a substantial utility advantage, reaching 30.8\% when rare, while overall inequality remains stable. These results suggest that reputational feedback and long-term incentives can make academic collaboration more resilient, offering a computational testbed for designing fairer and more productive authorship policies.
中文摘要 科学研究越来越依赖多作者合作，但用于分配作者署名的系统仍易受冲突、战略行为和项目崩溃的影响。尽管先前研究表明，作者可能会理性地在单一手稿中对作者顺序发出最后通牒，但关于这种行为在不断发展的学术网络中反复合作中如何展开的了解则少得多。本研究开发了一个重复的网络博弈论共同作者模型，研究人员随着时间建立合作关系，通过不断发展的友谊网络积累声誉，并在部分案例中通过深度强化学习学习战略行为。利用大规模基于主体的模拟，我们比较了近视和前瞻性作者在混合人群中的观点。我们发现战略代理人提出的最后通牒并不比贪婪的代理人少，而是学会避免在被拒绝后坚持，从而消除了破坏性终止稿件。随着战略性普及率的增加，纸张销毁率从每张纸的0.120降至0.000，完成率从0.853升至0.970，平均完成的每名特工完成的论文量从15.2增加到16.9。战略代理人还获得了显著的效用优势，罕见时可达30.8%，而整体不平等保持稳定。这些结果表明，声誉反馈和长期激励可以使学术合作更具韧性，为设计更公平、更具生产力的作者政策提供了计算测试平台。

VectorGym: A Multitask Benchmark for SVG Code Generation, Sketching, and Editing

VectorGym：SVG代码生成、草图与编辑的多任务基准测试

Authors: Juan Rodriguez, Haotian Zhang, Abhay Puri, Tianyang Zhang, Rishav Pramanik, Meng Lin, Xiaoqing Xie, Marco Terral, Darsh Kaushik, Aly Shariff, Perouz Taslakian, Spandana Gella, Sai Rajeswar, David Vazquez, Christopher Pal, Marco Pedersoli
Subjects: Subjects: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.29852
Pdf link: https://arxiv.org/pdf/2603.29852
Abstract We introduce VectorGym, a comprehensive benchmark suite for Scalable Vector Graphics (SVG) that spans generation from text and sketches, complex editing, and visual understanding. VectorGym addresses the lack of realistic, challenging benchmarks aligned with professional design workflows. Our benchmark comprises four tasks with expert human-authored annotations: the novel Sketch2SVG task (VG-Sketch); a new SVG editing dataset (VG-Edit) featuring complex, multi-step edits with higher-order primitives; Text2SVG generation (VG-Text); and SVG captioning (VG-Cap). Unlike prior benchmarks that rely on synthetic edits, VectorGym provides gold-standard human annotations that require semantic understanding and design intent. We also propose a multi-task reinforcement learning approach that jointly optimizes across all four tasks using rendering-based rewards. Our method, built on GRPO with curriculum learning, trains a Qwen3-VL 8B model that achieves state-of-the-art performance among open-source models, surpassing much larger models including Qwen3-VL 235B and matching GPT-4o. We also introduce a VLM-as-a-Judge metric for SVG generation, validated through human correlation studies. Our evaluation of frontier VLMs reveals significant performance gaps, positioning VectorGym as a rigorous framework for advancing visual code generation. VectorGym is publicly available on this http URL.
中文摘要 我们介绍VectorGym，一套全面的可扩展矢量图形（SVG）基准测试套件，涵盖文本和草图生成、复杂编辑及视觉理解。VectorGym解决了缺乏与专业设计工作流程相匹配的现实且具有挑战性的基准问题。我们的基准测试包含四个带有专家人工注释的任务：新颖的Sketch2SVG任务（VG-Sketch）;一个新的SVG编辑数据集（VG-Edit），具有复杂、多步骤的高阶原语编辑;Text2SVG 生成（VG-Text）;以及SVG字幕（VG-Cap）。与以往依赖合成编辑的基准不同，VectorGym 提供了需要语义理解和设计意图的黄金标准人工注释。我们还提出了一种多任务强化学习方法，利用基于渲染的奖励共同优化这四个任务。我们的方法基于GRPO和课程学习，训练出Qwen3-VL 8B模型，在开源模型中达到最先进的性能，超过了包括Qwen3-VL 235B在内的更大型模型，并匹配GPT-4o。我们还引入了通过人体相关研究验证的VLM即评判指标用于SVG生成。我们对前沿VLM的评估揭示了显著的性能差距，使VectorGym成为推动可视化代码生成的严谨框架。VectorGym 在此 http 网址上公开发布。

An Output Feedback Q-learning Algorithm for Optimal Control of Nonlinear Systems with Koopman Linear Embedding

一种用于库普曼线性嵌入非线性系统最优控制的输出反馈Q-学习算法

Authors: Victor G. Lopez, Malte Heinrich, Matthias A. Müller
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2603.29858
Pdf link: https://arxiv.org/pdf/2603.29858
Abstract In the reinforcement learning literature, strong theoretical guarantees have been obtained for algorithms applicable to LTI systems. However, in the nonlinear case only weaker results have been obtained for algorithms that mostly rely on the use of function approximation strategies like, for example, neural networks. In this paper, we study the applicability of a known output-feedback Q-learning algorithm to the class of nonlinear systems that admit a Koopman linear embedding. This algorithm uses only input-output data, and no knowledge of either the system model or the Koopman lifting functions is required. Moreover, no function approximation techniques are used, and the same theoretical guarantees as for LTI systems are preserved. Furthermore, we analyze the performance of the algorithm when the Koopman linear embedding is only an approximation of the real nonlinear system. A simulation example verifies the applicability of this method.
中文摘要 在强化学习文献中，已有针对LTI系统的算法获得了强有力的理论保证。然而，在非线性情况下，对于主要依赖函数近似策略（如神经网络）的算法，只有较弱的结果。本文研究了已知输出反馈Q学习算法在允许库普曼线性嵌入的非线性系统类别中的适用性。该算法仅使用输入输出数据，无需了解系统模型或库普曼提升函数。此外，不使用函数近似技术，且保持与LTI系统相同的理论保证。此外，我们分析了当库普曼线性嵌入仅为实非线性系统的近似时算法的性能。一个模拟示例验证了该方法的适用性。

ShapE-GRPO: Shapley-Enhanced Reward Allocation for Multi-Candidate LLM Training

ShapE-GRPO：Shapley增强奖励分配，用于多候选人LLM培训

Authors: Rui Ai, Yu Pan, David Simchi-Levi, Chonghuan Wang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.29871
Pdf link: https://arxiv.org/pdf/2603.29871
Abstract In user-agent interaction scenarios such as recommendation, brainstorming, and code suggestion, Large Language Models (LLMs) often generate sets of candidate recommendations where the objective is to maximize the collective utility of the entire set rather than individual candidates independently. However, existing reinforcement learning post-training paradigms, such as Group Relative Policy Optimization (GRPO), typically assign the same set-level scalar reward to every candidate in the set. This leads to noisy training signals where poor candidates free-ride on the high reward produced by a single strong peer, resulting in suboptimal exploration. To address this, we propose Shapley-Enhanced GRPO (ShapE-GRPO). By leveraging the permutation-invariant nature of set-level utility, we derive a Shapley-enhanced formulation from cooperative game theory to decompose set-level rewards into granular, candidate-specific signals. We show that our formulation preserves the fundamental axioms of the Shapley value while remaining computationally efficient with polynomial-time complexity. Empirically, ShapE-GRPO consistently outperforms standard GRPO across diverse datasets with accelerated convergence during training.
中文摘要 在用户-代理交互场景中，如推荐、头脑风暴和代码建议，大型语言模型（LLMs）通常生成一组候选推荐，目标是最大化整个推荐集的集体效用，而非单独针对单个候选。然而，现有的强化学习训练后范式，如群相对策略优化（Group Relative Policy Optimization，GRPO），通常会为集合中的每个候选对象分配相同的集合级标量奖励。这导致训练信号噪杂，差的候选人依赖单一强势同伴带来的高回报，导致探索不优。为此，我们提出了Shapley增强GRPO（ShapE-GRPO）。通过利用集合级效用的置换不变性质，我们从合作博弈论中推导出了Shapley增强的表述，将集合级奖励分解为细粒度的候选特异信号。我们证明我们的表述保持了夏普利值的基本公理，同时保持了多项式时间复杂度的计算效率。从实证角度看，ShapE-GRPO在多样化数据集中持续优于标准GRPO，且训练过程中收敛速度加快。

UniRank: End-to-End Domain-Specific Reranking of Hybrid Text-Image Candidates

UniRank：面向特定领域的端到端重新排序混合文本-图像候选对象

Authors: Yupei Yang, Lin Yang, Wanxi Deng, Lin Qu, Shikui Tu, Lei Xu
Subjects: Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.29897
Pdf link: https://arxiv.org/pdf/2603.29897
Abstract Reranking is a critical component in many information retrieval pipelines. Despite remarkable progress in text-only settings, multimodal reranking remains challenging, particularly when the candidate set contains hybrid text and image items. A key difficulty is the modality gap: a text reranker is intrinsically closer to text candidates than to image candidates, leading to biased and suboptimal cross-modal ranking. Vision-language models (VLMs) mitigate this gap through strong cross-modal alignment and have recently been adopted to build multimodal rerankers. However, most VLM-based rerankers encode all candidates as images, and treating text as images introduces substantial computational overhead. Meanwhile, existing open-source multimodal rerankers are typically trained on general-domain data and often underperform in domain-specific scenarios. To address these limitations, we propose UniRank, a VLM-based reranking framework that natively scores and orders hybrid text-image candidates without any modality conversion. Building on this hybrid scoring interface, UniRank provides an end-to-end domain adaptation pipeline that includes: (1) an instruction-tuning stage that learns calibrated cross-modal relevance scoring by mapping label-token likelihoods to a unified scalar score; and (2) a hard-negative-driven preference alignment stage that constructs in-domain pairwise preferences and performs query-level policy optimization through reinforcement learning from human feedback (RLHF). Extensive experiments on scientific literature retrieval and design patent search demonstrate that UniRank consistently outperforms state-of-the-art baselines, improving Recall@1 by 8.9% and 7.3%, respectively.
中文摘要 重新排序是许多信息检索流程中的关键组成部分。尽管在纯文本设置方面取得了显著进展，多模态重新排序仍然具有挑战性，尤其是当候选集包含文本和图像混合项目时。一个关键难题是模态差距：文本重新排序器本质上更接近文本候选对象而非图像候选，导致跨模态排名偏向且次优。视觉语言模型（VLMs）通过强有力的跨模态对齐来弥补这一差距，最近已被用于构建多模态重新排序器。然而，大多数基于VLM的重新排序器将所有候选数据编码为图像，且将文本视为图像会带来较大的计算开销。与此同时，现有的开源多模态重排序器通常基于通用域数据训练，且在特定领域场景中表现常不佳。为解决这些局限性，我们提出了UniRank，一种基于VLM的重新排序框架，能够原生对混合文本-图像候选对象进行评分和排序，无需任何模态转换。基于这一混合评分接口，UniRank 提供了一个端到端的领域适配流程，包括：（1）一个指令调优阶段，通过将标签标记似然映射到统一的标量评分，学习校准后的跨模态相关性评分;以及（2）硬性负驱动的偏好对齐阶段，构建域内成对偏好，并通过人类反馈强化学习（RLHF）进行查询级策略优化。科学文献检索和设计专利检索的广泛实验表明，UniRank 持续优于最先进的基线，分别提升Recall@1 8.9%和7.3%。

GreenFLag: A Green Agentic Approach for Energy-Efficient Federated Learning

绿色农业：一种绿色代理方法，实现节能的联邦学习

Authors: Theodora Panagea, Nikolaos Koursioumpas, Lina Magoula, Ramin Khalili
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2603.29933
Pdf link: https://arxiv.org/pdf/2603.29933
Abstract Progressing toward a new generation of mobile networks, a clear focus on integrating distributed intelligence across the system is observed to drive performance, autonomy, and real-time adaptability. Federated learning (FL) stands out as a key emerging technique, enabling on-device model training while preserving data locality. However, its operation introduces substantial energy and resource demands. Energy needs are mostly met by grid power sources, while FL resource orchestration strategies remain limited. This work introduces GreenFLag, an agentic resource orchestration framework designed to minimize the energy consumption from the grid power to complete FL workflows, guarantee FL model performance, and reduce grid power reliance by incorporating renewable sources into the system. GreenFLag leverages a Soft-Actor Critic reinforcement learning approach to jointly optimize computational and communication resources, while accounting for communication contention and the dynamic availability of renewable energy. Evaluations using a real-world open dataset from Copernicus, demonstrate that GreenFLag significantly reduces grid energy consumption by 94.8% on average, compared to three state-of-the-art baselines, while primarily relying on green power.
中文摘要 迈向新一代移动网络，明确聚焦于系统内分布式智能的整合，以提升性能、自主性和实时适应性。联邦学习（FL）作为一项新兴关键技术脱颖而出，它支持设备内模型训练，同时保持数据本地性。然而，其运行带来了巨大的能源和资源需求。能源需求主要由电网电源满足，而佛罗里达的资源协调策略仍然有限。本研究引入了GreenFLag，一种代理式资源编排框架，旨在最大限度地减少电网电力消耗，完成FL工作流程，保证FL模型性能，并通过将可再生能源纳入系统减少对电网电力的依赖。GreenFLag 利用软性行为者批评者强化学习方法，共同优化计算和通信资源，同时考虑通信争用和可再生能源的动态可用性。使用哥白尼的真实开放数据集进行的评估显示，绿色FLag相比三个最先进的基线，在主要依赖绿色能源的情况下，平均显著减少了电网能源消耗94.8%。

Phyelds: A Pythonic Framework for Aggregate Computing

Phyelds：聚合计算的Python框架

Authors: Gianluca Aguzzi, Davide Domini, Nicolas Farabegoli, Mirko Viroli
Subjects: Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
Arxiv link: https://arxiv.org/abs/2603.29999
Pdf link: https://arxiv.org/pdf/2603.29999
Abstract Aggregate programming is a field-based coordination paradigm with over a decade of exploration and successful applications across domains including sensor networks, robotics, and IoT, with implementations in various programming languages, such as Protelis, ScaFi (Scala), and FCPP (C++). A recent research direction integrates machine learning with aggregate computing, aiming to support large-scale distributed learning and provide new abstractions for implementing learning algorithms. However, existing implementations do not target data science practitioners, who predominantly work in Python--the de facto language for data science and machine learning, with a rich and mature ecosystem. Python also offers advantages for other use cases, such as education and robotics (e.g., via ROS). To address this gap, we present Phyelds, a Python library for aggregate programming. Phyelds offers a fully featured yet lightweight implementation of the field calculus model of computation, featuring a Pythonic API and an architecture designed for seamless integration with Python's machine learning ecosystem. We describe the design and implementation of Phyelds and illustrate its versatility across domains, from well-known aggregate computing patterns to federated learning coordination and integration with a widely used multi-agent reinforcement learning simulator.
中文摘要 聚合编程是一种基于现场的协调范式，经过十余年的探索和成功应用，涵盖传感器网络、机器人和物联网等多个领域，已有多种编程语言实现，如Protelis、ScaFi（Scala）和FCPP（C++）。近期的研究方向将机器学习与聚合计算相结合，旨在支持大规模分布式学习，并为实现学习算法提供新的抽象。然而，现有的实现并未面向数据科学从业者，他们主要使用Python——数据科学和机器学习的事实语言，拥有丰富成熟的生态系统。Python 还为其他应用场景提供优势，如教育和机器人技术（例如通过 ROS）。为了弥补这一空白，我们介绍了Phyelds，一个用于聚合编程的Python库。Phyelds 提供了一个功能齐全但轻量级的场演算模型实现，采用 Python API，并采用旨在无缝集成 Python 机器学习生态系统的架构。我们介绍了Phyelds的设计与实现，并展示了其跨领域多样性，从知名的聚合计算模式到联邦学习协调，以及与广泛使用的多智能体强化学习模拟器的集成。

Hybrid Framework for Robotic Manipulation: Integrating Reinforcement Learning and Large Language Models

机器人操作混合框架：强化学习与大型语言模型的整合

Authors: Md Saad, Sajjad Hussain, Mohd Suhaib
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.30022
Pdf link: https://arxiv.org/pdf/2603.30022
Abstract This paper introduces a new hybrid framework that combines Reinforcement Learning (RL) and Large Language Models (LLMs) to improve robotic manipulation tasks. By utilizing RL for accurate low-level control and LLMs for high level task planning and understanding of natural language, the proposed framework effectively connects low-level execution with high-level reasoning in robotic systems. This integration allows robots to understand and carry out complex, human-like instructions while adapting to changing environments in real time. The framework is tested in a PyBullet-based simulation environment using the Franka Emika Panda robotic arm, with various manipulation scenarios as benchmarks. The results show a 33.5% decrease in task completion time and enhancements of 18.1% and 36.4% in accuracy and adaptability, respectively, when compared to systems that use only RL. These results underscore the potential of LLM-enhanced robotic systems for practical applications, making them more efficient, adaptable, and capable of interacting with humans. Future research will aim to explore sim-to-real transfer, scalability, and multi-robot systems to further broaden the framework's applicability.
中文摘要 本文介绍了一个结合强化学习（RL）和大型语言模型（LLMs）以改进机器人操作任务的新混合框架。通过利用强化学习进行精确的低层控制，利用大型语言模型进行高层次任务规划和自然语言理解，所提出的框架有效地将机器人系统中的低层执行与高层推理连接起来。这种集成使机器人能够理解并执行复杂、类人指令，同时实时适应不断变化的环境。该框架在基于PyBullet的仿真环境中测试，使用Franka Emika Panda机械臂，并以多种操作场景作为基准测试。结果显示，与仅使用强化学习的系统相比，任务完成时间减少了33.5%，准确性和适应性分别提升了18.1%和36.4%。这些结果凸显了LLM增强机器人系统的实际应用潜力，使其更高效、更具适应性，并具备与人类交互的能力。未来的研究将致力于探索模拟到现实的传输、可扩展性和多机器人系统，以进一步拓宽该框架的适用范围。

Keyword: diffusion policy

Enhancing Policy Learning with World-Action Model

利用世界行动模式提升政策学习

Authors: Yuci Han, Alper Yilmaz
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.28955
Pdf link: https://arxiv.org/pdf/2603.28955
Abstract This paper presents the World-Action Model (WAM), an action-regularized world model that jointly reasons over future visual observations and the actions that drive state transitions. Unlike conventional world models trained solely via image prediction, WAM incorporates an inverse dynamics objective into DreamerV2 that predicts actions from latent state transitions, encouraging the learned representations to capture action-relevant structure critical for downstream control. We evaluate WAM on enhancing policy learning across eight manipulation tasks from the CALVIN benchmark. We first pretrain a diffusion policy via behavioral cloning on world model latents, then refine it with model-based PPO inside the frozen world model. Without modifying the policy architecture or training procedure, WAM improves average behavioral cloning success from 59.4% to 71.2% over DreamerV2 and DiWA baselines. After PPO fine-tuning, WAM achieves 92.8% average success versus 79.8% for the baseline, with two tasks reaching 100%, using 8.7x fewer training steps.
中文摘要 本文介绍了世界-动作模型（WAM），这是一个动作正则化的世界模型，能够共同推理未来的视觉观察和驱动状态转变的动作。与仅通过图像预测训练的传统世界模型不同，WAM在DreamerV2中集成了逆动力学目标，预测潜态转变中的动作，鼓励学习到的表征捕捉对下游控制至关重要的动作相关结构。我们评估了WAM在CALVIN基准中八个操作任务中提升政策学习方面的效果。我们首先通过对世界模型潜伏的行为克隆预训练扩散策略，然后在冻结世界模型中用基于模型的PPO进行细化。无需修改策略架构或训练程序，WAM将平均行为克隆成功率提升至71.2%，相较于DreamerV2和DiWA基线。经过PPO微调后，WAM平均成功率为92.8%，而基线为79.8%，其中两项任务达到100%，且使用了8.7倍的训练步骤。

CLaD: Planning with Grounded Foresight via Cross-Modal Latent Dynamics

CLaD：通过跨模态潜在动力学实现基于基础的前瞻性规划

Authors: Andrew Jeong, Jaemin Kim, Sebin Lee, Sung-Eui Yoon
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.29409
Pdf link: https://arxiv.org/pdf/2603.29409
Abstract Robotic manipulation involves kinematic and semantic transitions that are inherently coupled via underlying actions. However, existing approaches plan within either semantic or latent space without explicitly aligning these cross-modal transitions. To address this, we propose CLaD, a framework that models how proprioceptive and semantic states jointly evolve under actions through asymmetric cross-attention that allows kinematic transitions to query semantic ones. CLaD predicts grounded latent foresights via self-supervised objectives with EMA target encoders and auxiliary reconstruction losses, preventing representation collapse while anchoring predictions to observable states. Predicted foresights are modulated with observations to condition a diffusion policy for action generation. On LIBERO-LONG benchmark, CLaD achieves 94.7\% success rate, competitive with large VLAs with significantly fewer parameters.
中文摘要 机器人操作涉及运动学和语义转换，这些转换本质上通过底层动作相互耦合。然而，现有方法在语义空间或潜在空间内规划，却未明确对应这些跨模态转换。为此，我们提出了CLaD框架，该框架通过非对称交叉注意力模拟本体感受和语义状态在行动下共同演化，从而使运动学转变转为查询语义状态。CLaD通过EMA目标编码器和辅助重建损失的自监督目标预测基础潜在预见，防止表征崩溃，同时将预测锚定于可观测状态。预测的前瞻性通过观测调制，以条件化扩散策略以产生行动。在LIBERO-LONG基准测试中，CLaD成功率为94.7%，与参数较少的大型VLA竞争。