Arxiv Papers of Today

生成时间: 2026-04-15 17:23:10 (UTC+8); Arxiv 发布时间: 2026-04-15 20:00 EDT (2026-04-16 08:00 UTC+8)

今天共有 31 篇相关文章

Keyword: reinforcement learning

Self-Monitoring Benefits from Structural Integration: Lessons from Metacognition in Continuous-Time Multi-Timescale Agents

结构整合带来的自我监测益处：连续时间多时间尺度代理中元认知的经验教训

Authors: Ying Xie
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.11914
Pdf link: https://arxiv.org/pdf/2604.11914
Abstract Self-monitoring capabilities -- metacognition, self-prediction, and subjective duration -- are often proposed as useful additions to reinforcement learning agents. But do they actually help? We investigate this question in a continuous-time multi-timescale agent operating in predator-prey survival environments of varying complexity, including a 2D partially observable variant. We first show that three self-monitoring modules, implemented as auxiliary-loss add-ons to a multi-timescale cortical hierarchy, provide no statistically significant benefit across 20 random seeds, 1D and 2D predator-prey environments with standard and non-stationary variants, and training horizons up to 50,000 steps. Diagnosing the failure, we find the modules collapse to near-constant outputs (confidence std < 0.006, attention allocation std < 0.011) and the subjective duration mechanism shifts the discount factor by less than 0.03%. Policy sensitivity analysis confirms the agent's decisions are unaffected by module outputs in this design. We then show that structurally integrating the module outputs -- using confidence to gate exploration, surprise to trigger workspace broadcasts, and self-model predictions as policy input -- produces a medium-large improvement over the add-on approach (Cohen's d = 0.62, p = 0.06, paired) in a non-stationary environment. Component-wise ablations reveal that the TSM-to-policy pathway contributes most of this gain. However, structural integration does not significantly outperform a baseline with no self-monitoring (d = 0.15, p = 0.67), and a parameter-matched control without modules performs comparably, so the benefit may lie in recovering from the trend-level harm of ignored modules rather than in self-monitoring content. The architectural implication is that self-monitoring should sit on the decision pathway, not beside it.
中文摘要 自我监控能力——元认知、自我预测和主观持续时间——常被提出作为强化学习代理的有用补充。但它们真的有帮助吗？我们在一个连续时间多时间尺度的代理中研究这个问题，该代理在不同复杂度的捕食者-猎物生存环境中工作，包括一个二维部分可观测的变异。我们首先表明，三个自我监测模块作为多时间尺度皮层层级的辅助损失附加组件，在20个随机种子、1D和2D捕食者-猎物环境（含标准和非固定变体）以及训练视野高达5万步中，均无统计学显著性益处。诊断失败时，我们发现模块的输出几乎保持恒定（置信度标准<0.006，注意力分配标准<0.011），主观持续时间机制使折现因子的偏移不到0.03%。策略敏感性分析确认，在该设计中，代理的决策不受模块输出影响。随后我们展示了，结构性整合模块输出——用信心来进行门槛探索，用惊讶触发工作空间广播，并以自建模预测作为策略输入——在非平稳环境中相比附加方法（Cohen's d = 0.62，p = 0.06，配对）有中大型改进。按组件进行的消融显示，TSM到政策的路径贡献了大部分收益。然而，结构集成并未显著优于无自我监测的基线（d = 0.15，p = 0.67），且无模块的参数匹配对照表现相当，因此其优势可能在于从忽略模块带来的趋势层级伤害中恢复，而非自我监测内容。其架构意义在于自我监控应置于决策路径上，而非旁边。

Offline-Online Reinforcement Learning for Linear Mixture MDPs

线性混合MDP的离线-在线强化学习

Authors: Zhongjun Zhang, Sean R. Sinclair
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2604.11994
Pdf link: https://arxiv.org/pdf/2604.11994
Abstract We study offline-online reinforcement learning in linear mixture Markov decision processes (MDPs) under environment shift. In the offline phase, data are collected by an unknown behavior policy and may come from a mismatched environment, while in the online phase the learner interacts with the target environment. We propose an algorithm that adaptively leverages offline data. When the offline data are informative, either due to sufficient coverage or small environment shift, the algorithm provably improves over purely online learning. When the offline data are uninformative, it safely ignores them and matches the online-only performance. We establish regret upper bounds that explicitly characterize when offline data are beneficial, together with nearly matching lower bounds. Numerical experiments further corroborate our theoretical findings.
中文摘要 我们研究线性混合马尔可夫决策过程（MDPs）在环境转移下的离线强化学习。在离线阶段，数据由未知的行为策略收集，可能来自不匹配的环境;而在线阶段，学习者与目标环境互动。我们提出了一种能够自适应利用离线数据的算法。当离线数据具有信息价值，无论是由于覆盖足够或环境变化较小时，算法在纯在线学习中可以证明为优越。当离线数据无信息时，它会安全地忽略这些数据，并与仅在线的表现相匹配。我们建立了后悔上限，明确描述何时离线数据有益，并与下限几乎匹配。数值实验进一步印证了我们的理论发现。

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

自我提炼零：自我修正将二元奖励变成密集监督

Authors: Yinghui He, Simran Kaur, Adithya Bhaskar, Yongjin Yang, Jiarui Liu, Narutatsu Ri, Liam Fowl, Abhishek Panigrahi, Danqi Chen, Sanjeev Arora
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.12002
Pdf link: https://arxiv.org/pdf/2604.12002
Abstract Current post-training methods in verifiable settings fall into two categories. Reinforcement learning (RLVR) relies on binary rewards, which are broadly applicable and powerful, but provide only sparse supervision during training. Distillation provides dense token-level supervision, typically obtained from an external teacher or using high-quality demonstrations. Collecting such supervision can be costly or unavailable. We propose Self-Distillation Zero (SD-Zero), a method that is substantially more training sample-efficient than RL and does not require an external teacher or high-quality demonstrations. SD-Zero trains a single model to play two roles: a Generator, which produces an initial response, and a Reviser, which conditions on that response and its binary reward to produce an improved response. We then perform on-policy self-distillation to distill the reviser into the generator, using the reviser's token distributions conditioned on the generator's response and its reward as supervision. In effect, SD-Zero trains the model to transform binary rewards into dense token-level self-supervision. On math and code reasoning benchmarks with Qwen3-4B-Instruct and Olmo-3-7B-Instruct, SD-Zero improves performance by at least 10% over the base models and outperforms strong baselines, including Rejection Fine-Tuning (RFT), GRPO, and Self-Distillation Fine-Tuning (SDFT), under the same question set and training sample budget. Extensive ablation studies show two novel characteristics of our proposed algorithm: (a) token-level self-localization, where the reviser can identify the key tokens that need to be revised in the generator's response based on reward, and (b) iterative self-evolution, where the improving ability to revise answers can be distilled back into generation performance with regular teacher synchronization.
中文摘要 当前可验证环境中的培训后方法可分为两类。强化学习（RLVR）依赖于二元奖励，这些奖励广泛适用且强大，但在培训期间仅提供有限的监督。提炼提供密集的代币级监督，通常由外部教师或高质量演示获得。获得此类监督可能成本高昂或无法获得。我们提出了自蒸馏零（SD-Zero）方法，这种方法在训练样本效率上远高于强化学习，且无需外部教师或高质量演示。SD-Zero训练单个模型扮演两个角色：生成器，产生初始反应;修订者，基于该反应及其二元奖励进行条件，以产生更好的反应。然后我们进行策略上自蒸馏，将修订者提炼进生成器，利用修订者的代币分布，条件是生成器的响应及其奖励作为监督。实际上，SD-Zero训练模型将二元奖励转化为密集的代币级自我监督。在Qwen3-4B-Instruct和Olmo-3-7B-Instruct的数学和代码推理基准测试中，SD-Zero在相同题组和训练样本预算下，性能至少提升10%，并且优于强基线，包括拒绝微调（RFT）、GRPO和自蒸馏微调（SDFT）。广泛的消融研究显示了我们提出的算法的两个新特性：（a）词级自我定位，即修订者可以根据奖励识别生成器响应中需要修改的关键词元;（b）迭代自我进化，改进的答案能力可以通过教师的常规同步提炼回生成表现。

Think Through Uncertainty: Improving Long-Form Generation Factuality via Reasoning Calibration

思考不确定性：通过推理校准提升长格式生成事实性

Authors: Xin Liu, Lu Wang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.12046
Pdf link: https://arxiv.org/pdf/2604.12046
Abstract Large language models (LLMs) often hallucinate in long-form generation. Existing approaches mainly improve factuality through post-hoc revision or reinforcement learning (RL) with correctness-based rewards, but they do not teach the model to estimate which parts of its generation are reliable. As a result, models may still state incorrect claims confidently in their responses. Recent advances in reasoning have significantly improved LLM performance, and have been leveraged to estimate confidence by incorporating calibration into RL objectives. However, existing approaches remain limited to a single scalar confidence for the entire response, which is insufficient for long-form generation where uncertainty varies across individual claims. To mitigate this problem, we propose CURE, a framework that improves long-form factuality by teaching LLMs to reason about uncertainty at the claim level. We first introduce a Claim-Aware Reasoning Protocol, which structures outputs into atomic claims paired with explicit confidence estimates. We then develop a multi-stage training pipeline that aligns model confidence with claims' correctness and then optimizes on factuality. The resulting calibrated confidence further enables selective prediction, allowing the model to abstain from uncertain claims at inference time. Experiments on four long-form factuality benchmarks show that CURE consistently improves factual accuracy over competitive supervised and RL baselines, while maintaining factual recall. In particular, it improves claim-level accuracy by up to 39.9% on Biography generation. These gains are accompanied by improved calibration, as reflected by a 16.0% increase in AUROC on FactBench.
中文摘要 大型语言模型（LLM）在长形式生成中常常出现幻觉。现有方法主要通过事后修订或强化学习（RL）和基于正确性的奖励来提升事实性，但它们并未教模型评估生成中哪些部分是可靠的。因此，模型在回答中仍可能自信地陈述错误的主张。推理的最新进展显著提升了LLM的性能，并通过将校准纳入强化学习目标来估算置信度。然而，现有方法仍限制于整个响应的单一标量置信度，这对于长形式生成时不够，因为不确定性在各个索赔间存在差异。为缓解这一问题，我们提出了CURE框架，通过教LLM在主张层面推理不确定性，提升长视频事实性。我们首先介绍了一种“索赔感知推理协议”，将输出结构化为原子声明，并结合明确的置信度估计。随后，我们开发了一个多阶段的培训流程，将模型置信度与声明的正确性对齐，并在事实性上进行优化。由此产生的校准置信进一步支持选择性预测，使模型在推断时避免不确定的断言。四个长形式事实性基准测试的实验表明，CURE在保持事实回忆的同时，持续提升了竞争性监督基准和强化学习基线的事实准确性。特别是，它在传记生成方面提高了索赔层面的准确性高达39.9%。这些提升伴随着校准的改进，FactBench上的AUROC增长了16.0%。

Robust Optimization for Mitigating Reward Hacking with Correlated Proxies

利用相关代理进行的稳健优化，以减轻奖励黑客攻击

Authors: Zixuan Liu, Xiaolin Sun, Zizhan Zheng
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.12086
Pdf link: https://arxiv.org/pdf/2604.12086
Abstract Designing robust reinforcement learning (RL) agents in the presence of imperfect reward signals remains a core challenge. In practice, agents are often trained with proxy rewards that only approximate the true objective, leaving them vulnerable to reward hacking, where high proxy returns arise from unintended or exploitative behaviors. Recent work formalizes this issue using r-correlation between proxy and true rewards, but existing methods like occupancy-regularized policy optimization (ORPO) optimize against a fixed proxy and do not provide strong guarantees against broader classes of correlated proxies. In this work, we formulate reward hacking as a robust policy optimization problem over the space of all r-correlated proxy rewards. We derive a tractable max-min formulation, where the agent maximizes performance under the worst-case proxy consistent with the correlation constraint. We further show that when the reward is a linear function of known features, our approach can be adapted to incorporate this prior knowledge, yielding both improved policies and interpretable worst-case rewards. Experiments across several environments show that our algorithms consistently outperform ORPO in worst-case returns, and offer improved robustness and stability across different levels of proxy-true reward correlation. These results show that our approach provides both robustness and transparency in settings where reward design is inherently uncertain. The code is available at this https URL.
中文摘要 在不完美奖励信号存在的情况下设计稳健的强化学习（RL）代理仍是核心挑战。实际上，代理常被训练为仅接近真实目标的代理奖励，使他们容易受到奖励黑客攻击的影响，而奖励性黑客往往因非意图或剥削行为产生高代理回报。近期研究通过代理与真实奖励之间的r-相关性形式化了这一问题，但现有方法如占用正则化策略优化（ORPO）针对固定代理进行优化，且对更广泛的相关代理类别未能提供强有力的保证。在本研究中，我们将奖励黑客定位为一个关于所有r相关代理奖励空间的稳健策略优化问题。我们推导出一个可处理的最大最小表述，其中代理在与相关约束一致的最坏情况代理下最大化性能。我们进一步证明，当奖励是已知特征的线性函数时，我们的方法可以调整以纳入这些先验知识，从而带来更好的策略和可解释的最坏情况奖励。在多个环境中的实验表明，我们的算法在最坏情况下的回报表现持续优于ORPO，并且在不同代理真实奖励相关性层级下提供了更好的鲁棒性和稳定性。这些结果表明，我们的方法在奖励设计本质上不确定的环境中，既能提供稳健性，也具有透明度。代码可在该 https URL 访问。

PubSwap: Public-Data Off-Policy Coordination for Federated RLVR

PubSwap：联邦RLVR的公共数据非政策协调

Authors: Anupam Nayak, Baris Askin, Muhammed Ustaomeroglu, Carlee Joe-Wong, Gauri Joshi
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.12160
Pdf link: https://arxiv.org/pdf/2604.12160
Abstract Reasoning post-training with reinforcement learning from verifiable rewards (RLVR) is typically studied in centralized settings, yet many realistic applications involve decentralized private data distributed across organizations. Federated training is a natural solution, but scaling RLVR in this regime is challenging: full-model synchronization is expensive, and performing many local steps can cause severe client drift under heterogeneous data. We propose a federated RLVR framework that combines LoRA-based local adaptation with public-data-based off-policy steps to improve both communication efficiency and cross-client coordination. In particular, a small shared public dataset is used to periodically exchange and reuse response-level training signals across organizations, providing a lightweight anchor toward a more globally aligned objective without exposing private data. Our method selectively replaces locally incorrect responses with globally correct ones during public-data steps, thereby keeping training closer to the local policy while still benefiting from cross-client coordination. Across mathematical and medical reasoning benchmarks and models, our method consistently improves over standard baselines. Our results highlight a simple and effective recipe for federated reasoning post-training: combining low-rank communication with limited public-data coordination.
中文摘要 基于可验证奖励的强化学习（RLVR）培训后推理通常在集中式环境中进行研究，但许多现实应用涉及分散在组织中的去中心化私密数据。联合训练是一个自然的解决方案，但在这种模式下对RLVR进行扩展具有挑战：全模型同步成本高昂，且执行大量局部步骤可能导致在异构数据下严重的客户端漂移。我们提出了一个联邦RLVR框架，结合基于LoRA的本地适应与基于公共数据的非政策步骤，以提升通信效率和跨客户端协调。特别是，使用一个小型共享公共数据集定期在组织间交换和重复使用响应级训练信号，提供一个轻量级锚点，朝向更全球一致的目标，同时不暴露私密数据。我们的方法在公开数据步骤中选择性地用全局正确的回答替换本地错误的回答，从而使培训更贴近本地政策，同时仍能受益于跨客户协调。在数学和医学推理基准和模型中，我们的方法持续优于标准基线。我们的结果凸显了一个简单且有效的训练后联合推理方案：将低级沟通与有限的公共数据协调结合起来。

Nucleus-Image: Sparse MoE for Image Generation

核像：稀疏的成像环境用于图像生成

Authors: Chandan Akiti, Ajay Modukuri, Murali Nandan Nagarapu, Gunavardhan Akiti, Haozhe Liu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.12163
Pdf link: https://arxiv.org/pdf/2604.12163
Abstract We present Nucleus-Image, a text-to-image generation model that establishes a new Pareto frontier in quality-versus-efficiency by matching or exceeding leading models on GenEval, DPG-Bench, and OneIG-Bench while activating only approximately 2B parameters per forward pass. Nucleus-Image employs a sparse mixture-of-experts (MoE) diffusion transformer architecture with Expert-Choice Routing that scales total model capacity to 17B parameters across 64 routed experts per layer. We adopt a streamlined architecture optimized for inference efficiency by excluding text tokens from the transformer backbone entirely and using joint attention that enables text KV sharing across timesteps. To improve routing stability when using timestep modulation, we introduce a decoupled routing design that separates timestep-aware expert assignment from timestep-conditioned expert computation. We construct a large-scale training corpus of 1.5B high-quality training pairs spanning 700M unique images through multi-stage filtering, deduplication, aesthetic tiering, and caption curation. Training follows a progressive resolution curriculum (256 to 512 to 1024) with multi-aspect-ratio bucketing at every stage, coupled with progressive sparsification of the expert capacity factor. We adopt the Muon optimizer and share our parameter grouping recipe tailored for diffusion models with timestep modulation. Nucleus-Image demonstrates that sparse MoE scaling is a highly effective path to high-quality image generation, reaching the performance of models with significantly larger active parameter budgets at a fraction of the inference cost. These results are achieved without post-training optimization of any kind: no reinforcement learning, no direct preference optimization, and no human preference tuning. We release the training recipe, making Nucleus-Image the first fully open-source MoE diffusion model at this quality.
中文摘要 我们提出了Nucleus-Image，一种文本到图像生成模型，通过在GenEval、DPG-Bench和OneIG-Bench上匹配或超越领先模型，在前向传递中仅激活约2B参数，从而在质量与效率的结合上开辟了新的帕累托前沿。Nucleus-Image 采用稀疏专家混合（MoE）扩散变换器架构，并配备专家选择路由，将总模型容量扩展至17B参数，覆盖每层64名路由专家。我们采用了简化架构，优化推理效率，完全排除变换器骨干中的文本标记，并采用联合关注实现跨时间步的文本 KV 共享。为了提高使用时间步调制时的路由稳定性，我们引入了一种解耦路由设计，将时间步感知专家任务与时间步条件专家计算分离。我们通过多阶段过滤、去重、美学分层和说明策划，构建了15亿对高质量训练对的大规模训练语料库，涵盖7亿张独特图片。培训遵循渐进式解决方案课程（256至512至1024），每个阶段均采用多方面比例分组，并逐步减少专家能力因子。我们采用了Muon优化器，并分享了针对带有时间步调制的扩散模型量身定制的参数分组方案。Nucleus-Image 展示了稀疏 MoE 尺度是实现高质量图像生成的高效途径，能够以极低的推断成本达到显著更大主动参数预算模型的性能。这些结果是在没有任何训练后优化的情况下实现的：没有强化学习、没有直接的偏好优化，也没有人工偏好调整。我们发布了训练配方，使Nucleus-Image成为首个完全开源且品质如此高的MoE扩散模型。

Hybrid Adaptive Tuning for Tiered Memory Systems

分层存储系统的混合自适应调优

Authors: Xi Wang, Jie Liu, Shuangyan Yang, Jongryool Kim, Pengfei Su, Dong Li
Subjects: Subjects: Operating Systems (cs.OS)
Arxiv link: https://arxiv.org/abs/2604.12165
Pdf link: https://arxiv.org/pdf/2604.12165
Abstract Memory tiering provides a cost-effective solution to increase memory capacity, utilization, and even bandwidth. Memory tiering relies on system software for memory profiling, detection of frequently accessed pages, and page migration. Such a system software often comes with system parameters. The configurations of those parameters impact application performance. We comprehensively classify system parameters, and characterize the sensitivity of application performance to them using representative memory tiering solutions. Furthermore, we introduce a lightweight and user-friendly framework PTMT, which automates tuning of parameters at runtime for various memory tiering solutions. We identify major challenges for online tuning of memory tiering. PTMT uses a hybrid "offline + online" tuning method: while the offline phase builds a performance database for online queries and reduces runtime overhead, the online phase uses reinforcement learning (customized to memory tiering) to tune. PTMT improves performance by 30%, 26%, 21%, and 14%, on four memory tiering solutions (TPP, UPM, Colloid, and AutoNUMA), compared to using the default configurations. PTMT outperforms the state-of-the-art by 32% on average.
中文摘要 内存分层提供了一种具有成本效益的解决方案，可以提升内存容量、利用率甚至带宽。内存分层依赖系统软件进行内存分析、频繁访问页面检测和页面迁移。此类系统软件通常自带系统参数。这些参数的配置会影响应用性能。我们全面分类系统参数，并通过具代表性的内存分层解法描述应用性能对其的敏感性。此外，我们引入了一个轻量且用户友好的框架PTMT，它在运行时自动调整各种内存分层解决方案的参数。我们识别了在线内存分层调优的主要挑战。PTMT 采用混合的“离线+在线”调优方法：离线阶段构建在线查询的性能数据库并降低运行时开销，在线阶段则通过强化学习（针对内存分层定制）进行调优。PTMT 在四种内存分层方案（TPP、UPM、Colloid 和 AutoNUMA）上，性能提升了 30%、26%、21% 和 14%，相比默认配置。PTMT平均表现优于最先进公司32%。

MolMem: Memory-Augmented Agentic Reinforcement Learning for Sample-Efficient Molecular Optimization

MolMem：用于样本高效分子优化的记忆增强代理强化学习

Authors: Ziqing Wang, Yibo Wen, Abhishek Pandy, Han Liu, Kaize Ding
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.12237
Pdf link: https://arxiv.org/pdf/2604.12237
Abstract In drug discovery, molecular optimization aims to iteratively refine a lead compound to improve molecular properties while preserving structural similarity to the original molecule. However, each oracle evaluation is expensive, making sample efficiency a key challenge for existing methods under a limited oracle budget. Trial-and-error approaches require many oracle calls, while methods that leverage external knowledge tend to reuse familiar templates and struggle on challenging objectives. A key missing piece is long-term memory that can ground decisions and provide reusable insights for future optimizations. To address this, we present MolMem (\textbf{Mol}ecular optimization with \textbf{Mem}ory), a multi-turn agentic reinforcement learning (RL) framework with a dual-memory system. Specifically, MolMem uses Static Exemplar Memory to retrieve relevant exemplars for cold-start grounding, and Evolving Skill Memory to distill successful trajectories into reusable strategies. Built on this memory-augmented formulation, we train the policy with dense step-wise rewards, turning costly rollouts into long-term knowledge that improves future optimization. Extensive experiments show that MolMem achieves 90\% success on single-property tasks (1.5$\times$ over the best baseline) and 52\% on multi-property tasks using only 500 oracle calls. Our code is available at this https URL.
中文摘要 在药物发现中，分子优化旨在迭代精炼先导化合物，以改善分子性质，同时保持与原始分子的结构相似性。然而，每次预言机评估成本高昂，这使得样本效率成为现有预言机预算有限方法面临的主要挑战。试错方法需要多次神谕调用，而利用外部知识的方法往往重复使用熟悉的模板，并在具有挑战性的目标上遇到困难。一个关键缺失环节是长期记忆，它能够为决策提供基础，并为未来优化提供可重复使用的洞见。为此，我们提出了MolMem（\textbf{Mol}ecular optimization with \textbf{Mem}ory），这是一个多回合的智能强化学习（RL）框架，采用双重记忆系统。具体来说，MolMem 利用静态范例记忆检索相关范例进行冷启动基础，并利用进化技能记忆将成功轨迹提炼成可重用策略。基于这种内存增强的表述，我们用密集的分步骤奖励训练策略，将高昂的部署转化为长期知识，从而提升未来的优化。大量实验表明，MolMem在单属性任务中成功率为90%（比最佳基线高出1.5美元\时间），在多属性任务中仅用500次预言机调用，成功率为52%。我们的代码可在此 https URL 访问。

ARGen: Affect-Reinforced Generative Augmentation towards Vision-based Dynamic Emotion Perception

ARGen：情感强化生成增强，面向基于视觉的动态情绪感知

Authors: Huanzhen Wang, Ziheng Zhou, Jiaqi Song, Li He, Yunshi Lan, Yan Wang, Wenqiang Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.12255
Pdf link: https://arxiv.org/pdf/2604.12255
Abstract Dynamic facial expression recognition in the wild remains challenging due to data scarcity and long-tail distributions, which hinder models from effectively learning the temporal dynamics of scarce emotions. To address these limitations, we propose ARGen, an Affect-Reinforced Generative Augmentation Framework that enables data-adaptive dynamic expression generation for robust emotion perception. ARGen operates in two stages: Affective Semantic Injection (ASI) and Adaptive Reinforcement Diffusion (ARD). The ASI stage establishes affective knowledge alignment through facial Action Units and employs a retrieval-augmented prompt generation strategy to synthesize consistent and fine-grained affective descriptions via large-scale visual-language models, thereby injecting interpretable emotional priors into the generation process. The ARD stage integrates text-conditioned image-to-video diffusion with reinforcement learning, introducing inter-frame conditional guidance and a multi-objective reward function to jointly optimize expression naturalness, facial integrity, and generative efficiency. Extensive experiments on both generation and recognition tasks verify that ARGen substantially enhances synthesis fidelity and improves recognition performance, establishing an interpretable and generalizable generative augmentation paradigm for vision-based affective computing.
中文摘要 由于数据稀缺和长尾分布，模型难以有效学习稀缺情绪的时间动态，野外动态面部表情识别依然充满挑战。为解决这些局限性，我们提出了ARGen，一种情感强化生成增强框架，支持数据自适应的动态表达生成，实现强健的情感感知。ARGen 有两个阶段的运作：情感语义注入（ASI）和自适应强化扩散（ARD）。ASI阶段通过面部动作单元建立情感知识对齐，并采用检索增强提示生成策略，通过大规模视觉语言模型综合一致且细粒度的情感描述，从而为生成过程注入可解读的情感先验。ARD阶段将文本条件图像到视频扩散与强化学习相结合，引入帧间条件引导和多目标奖励函数，共同优化表情自然度、面部完整性和生成效率。在生成和识别任务上的大量实验验证了ARGen显著提升了合成精度并提升识别性能，建立了基于视觉的情感计算可解释且可推广的生成增强范式。

WebAgentGuard: A Reasoning-Driven Guard Model for Detecting Prompt Injection Attacks in Web Agents

WebAgentGuard：一种基于推理的Guard模型，用于检测Web代理中的提示注入攻击

Authors: Yulin Chen, Tri Cao, Haoran Li, Yue Liu, Yibo Li, Yufei He, Le Minh Khoi, Yangqiu Song, Shuicheng Yan, Bryan Hooi
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2604.12284
Pdf link: https://arxiv.org/pdf/2604.12284
Abstract Web agents powered by vision-language models (VLMs) enable autonomous interaction with web environments by perceiving and acting on both visual and textual webpage content to accomplish user-specified tasks. However, they are highly vulnerable to prompt injection attacks, where adversarial instructions embedded in HTML or rendered screenshots can manipulate agent behavior and lead to harmful outcomes such as information leakage. Existing defenses, including system prompt defenses and direct fine-tuning of agents, have shown limited effectiveness. To address this issue, we propose a defense framework in which a web agent operates in parallel with a dedicated guard agent, decoupling prompt injection detection from the agent's own reasoning. Building on this framework, we introduce WebAgentGuard, a reasoning-driven, multimodal guard model for prompt injection detection. We construct a synthetic multimodal dataset using GPT-5 spanning 164 topics and 230 visual and UI design styles, and train the model via reasoning-intensive supervised fine-tuning followed by reinforcement learning. Experiments across multiple benchmarks show that WebAgentGuard consistently outperforms strong baselines while preserving agent utility, without introducing additional latency.
中文摘要 由视觉语言模型（VLMs）驱动的网络代理通过感知和操作视觉和文本网页内容，实现与网页环境的自主交互，以完成用户指定的任务。然而，它们极易受到提示注入攻击的攻击，即嵌入HTML或渲染截图中的对抗性指令可以操控代理行为，导致信息泄露等有害结果。现有的防御措施，包括系统即时防御和对智能体的直接微调，效果有限。为解决这个问题，我们提出了一种防御框架，其中网络代理与专用守卫代理并行运行，将提示注入检测与代理自身推理解耦。基于该框架，我们引入了WebAgentGuard，一种基于推理的多模态Guard模型，用于提示注入检测。我们利用GPT-5构建了一个涵盖164个主题和230种视觉及界面设计风格的合成多模态数据集，并通过推理密集型监督微调和强化学习训练模型。多个基准测试的实验表明，WebAgentGuard 在保持代理效用的同时，始终优于强基线，且不引入额外延迟。

Labeled TrustSet Guided: Batch Active Learning with Reinforcement Learning

标签为 TrustSet 引导：带强化学习的批量主动学习

Authors: Guofeng Cui, Yang Liu, Pichao Wang, Hankai Hsu, Xiaohang Sun, Xiang Hao, Zhu Liu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.12303
Pdf link: https://arxiv.org/pdf/2604.12303
Abstract Batch active learning (BAL) is a crucial technique for reducing labeling costs and improving data efficiency in training large-scale deep learning models. Traditional BAL methods often rely on metrics like Mahalanobis Distance to balance uncertainty and diversity when selecting data for annotation. However, these methods predominantly focus on the distribution of unlabeled data and fail to leverage feedback from labeled data or the model's performance. To address these limitations, we introduce TrustSet, a novel approach that selects the most informative data from the labeled dataset, ensuring a balanced class distribution to mitigate the long-tail problem. Unlike CoreSet, which focuses on maintaining the overall data distribution, TrustSet optimizes the model's performance by pruning redundant data and using label information to refine the selection process. To extend the benefits of TrustSet to the unlabeled pool, we propose a reinforcement learning (RL)-based sampling policy that approximates the selection of high-quality TrustSet candidates from the unlabeled data. Combining TrustSet and RL, we introduce the Batch Reinforcement Active Learning with TrustSet (BRAL-T) framework. BRAL-T achieves state-of-the-art results across 10 image classification benchmarks and 2 active fine-tuning tasks, demonstrating its effectiveness and efficiency in various domains.
中文摘要 批量主动学习（BAL）是降低标记成本和提高数据效率的重要技术，用于训练大规模深度学习模型。传统的BAL方法通常依赖马哈拉诺比斯距离等指标来平衡不确定性和多样性，以选择标注数据。然而，这些方法主要关注未标记数据的分布，未能利用标注数据的反馈或模型性能。为解决这些限制，我们引入了TrustSet，这是一种新颖的方法，从标记数据集中选择最具信息量的数据，确保类别分布均衡，以缓解长尾问题。与专注于维护整体数据分布的 CoreSet 不同，TrustSet 通过修剪冗余数据并利用标签信息优化模型性能，以优化选择过程。为了将TrustSet的优势扩展到未标记数据池，我们提出了一种基于强化学习（RL）的抽样策略，近似从未标记数据中选择出高质量的TrustSet候选对象。结合TrustSet和RL，我们介绍了TrustSet的批量强化主动学习（BRAL-T）框架。BRAL-T在10项图像分类基准和2项主动微调任务中取得了最先进的成果，展示了其在多个领域的有效性和效率。

Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

Nemotron 3 Super：开放高效的专家混合混合曼巴-变换器模型用于代理推理

Authors: NVIDIA: Aakshita Chandiramani, Aaron Blakeman, Abdullahi Olaoye, Abhibha Gupta, Abhilash Somasamudramath, Abhinav Khattar, Adeola Adesoba, Adi Renduchintala, Adil Asif, Aditya Agrawal, Aditya Vavre, Ahmad Kiswani, Aishwarya Padmakumar, Ajay Hotchandani, Akanksha Shukla, Akhiad Bercovich, Aleksander Ficek, Aleksandr Shaposhnikov, Alex Gronskiy, Alex Kondratenko, Alex Neefus, Alex Steiner, Alex Yang, Alexander Bukharin, Alexander Young, Ali Hatamizadeh, Ali Taghibakhshi, Alina Galiautdinova, Alisa Liu, Alok Kumar, Ameya Sunil Mahabaleshwarkar, Amir Klein, Amit Zuker, Amnon Geifman, Anahita Bhiwandiwalla, Ananth Subramaniam, Andrew Tao, Anjaney Shrivastava, Anjulie Agrusa, Ankur Srivastava, Ankur Verma, Ann Guan, Anna Shors, Annamalai Chockalingam, Anubhav Mandarwal, Aparnaa Ramani, Arham Mehta, Arti Jain, Arun Venkatesan, Asha Anoosheh, Ashwath Aithal, Ashwin Poojary, Asif Ahamed, Asit Mishra, Asli Sabanci Demiroz, Asma Kuriparambil Thekkumpate, Atefeh Sohrabizadeh, Avinash Kaur, Ayush Dattagupta, Barath Subramaniam Anandan, Bardiya Sadeghi, Barnaby Simkin, Ben Lanir, Benedikt Schifferer, Benjamin Chislett, Besmira Nushi, Bilal Kartal, Bill Thiede, Bita Darvish Rouhani, Bobby Chen, Boris Ginsburg, Brandon Norick, Branislav Kisacanin, Brian Yu, Bryan Catanzaro, Buvaneswari Mani, Carlo del Mundo, Chankyu Lee, Chanran Kim, Chantal Hwang, Chao Ni, Charles Wang, Charlie Truong, Cheng-Ping Hsieh, Chenhan Yu, Chenjie Luo, Cherie Wang, Chetan Mungekar, Chintan Patel, Chris Alexiuk, Chris Holguin, Chris Wing, Christian Munley, Christopher Parisien, Chuck Desai, Chunyang Sheng, Collin Neale, Cyril Meurillon, Dakshi Kumar
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.12374
Pdf link: https://arxiv.org/pdf/2604.12374
Abstract We describe the pre-training, post-training, and quantization of Nemotron 3 Super, a 120 billion (active 12 billion) parameter hybrid Mamba-Attention Mixture-of-Experts model. Nemotron 3 Super is the first model in the Nemotron 3 family to 1) be pre-trained in NVFP4, 2) leverage LatentMoE, a new Mixture-of-Experts architecture that optimizes for both accuracy per FLOP and accuracy per parameter, and 3) include MTP layers for inference acceleration through native speculative decoding. We pre-trained Nemotron 3 Super on 25 trillion tokens followed by post-training using supervised fine tuning (SFT) and reinforcement learning (RL). The final model supports up to 1M context length and achieves comparable accuracy on common benchmarks, while also achieving up to 2.2x and 7.5x higher inference throughput compared to GPT-OSS-120B and Qwen3.5-122B, respectively. Nemotron 3 Super datasets, along with the base, post-trained, and quantized checkpoints, are open-sourced on HuggingFace.
中文摘要 我们描述了Nemotron 3 Super的预训练、后训练和量子化过程，该模型是一个参数1200亿（活跃120亿）的Mamba-Attention-Expert混合模型。Nemotron 3 Super 是 Nemotron 3 系列中首个具备 1）预训练 NVFP4 的型号，2）利用 LatentMoE（一种新的专家混合架构，优化每个浮点浮点和参数的准确性），3）包含通过原生推测解码实现推理加速的 MTP 层。我们对Nemotron 3 Super进行了25万亿个代币的预训练，随后采用监督微调（SFT）和强化学习（RL）进行后期训练。最终模型支持最高100万上下文长度，在常见基准测试中实现相当准确性，同时推理吞吐量分别比GPT-OSS-120B和Qwen3.5-122B高出2.2倍和7.5倍。Nemotron 3 Super 数据集连同基础、后训练和量化检查点，均开源于 HuggingFace。

ReasonXL: Shifting LLM Reasoning Language Without Sacrificing Performance

ReasonXL：在不牺牲性能的情况下转换大型语言模型推理语言

Authors: Daniil Gurgurov, Tom Röhr, Sebastian von Rohrscheidt, Josef van Genabith, Alexander Löser, Simon Ostermann
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.12378
Pdf link: https://arxiv.org/pdf/2604.12378
Abstract Despite advances in multilingual capabilities, most large language models (LLMs) remain English-centric in their training and, crucially, in their production of reasoning traces. Even when tasked with non-English problems, these models predominantly reason in English, creating a fundamental mismatch for non-English usage scenarios. We address this disparity directly with three contributions. (i) We introduce ReasonXL, the first large-scale parallel corpus of cross-domain reasoning traces spanning five European languages (English, German, French, Italian, and Spanish), with over two million aligned samples per language, each comprising prompts, reasoning traces, and final outputs, enabling direct supervision of language-specific reasoning. (ii) Using ReasonXL, we demonstrate that LLMs can be adapted to reason entirely in a desired target language, using a simple two-stage pipeline of supervised fine-tuning (SFT) followed by reinforcement learning with verifiable rewards (RLVR). The resulting models match or exceed baseline performance, with minimal loss in general knowledge and broadly preserved cross-lingual transfer. (iii) We conduct an extensive representational analysis of the adaptation and find a clear functional division across model depth: early layers contain an activation bottleneck that causally determines language identity, while upper layers concentrate the weight and activation changes driven by adaptation. We further find that RLVR achieves greater behavioral divergence from the base model with smaller parameter updates than SFT, suggesting a more efficient representational rerouting despite much smaller weight updates.
中文摘要 尽管多语言能力有所进步，大多数大型语言模型（LLMs）在训练和推理痕迹生成上仍以英语为中心。即使被要求处理非英语问题，这些模型也主要用英语推理，导致非英语用法场景存在根本性不匹配。我们直接通过三项观点来应对这一差异。（i）我们引入了ReasonXL，这是首个跨领域推理痕迹的大规模平行语料库，涵盖五种欧洲语言（英语、德语、法语、意大利语和西班牙语），每种语言拥有超过两百万个对齐样本，每个样本包含提示、推理痕迹和最终输出，实现对语言特定推理的直接监督。（ii）利用ReasonXL，我们展示了LLM可以完全适应目标语言的推理，采用简单的两阶段流水线：监督微调（SFT）和可验证奖励的强化学习（RLVR）。最终模型的性能与基线相当甚至超过，常识损失极小，且跨语言迁移性大致保持。（iii）我们对适应进行了广泛的表征分析，发现模型深度中存在明确的功能划分：早期层存在因果决定语言身份的激活瓶颈，而上层则集中由适应驱动的权重和激活变化。我们还发现，RLVR在参数更新较小的情况下，与基础模型的行为偏差比SFT更大，表明尽管权重更新更小，但表示重路由更为高效。

Traffic-Aware Domain Partitioning and Load-Balanced Inter-Domain Routing for LEO Satellite Networks

LEO卫星网络的流量感知域划分与负载均衡域间路由

Authors: Chen Zhou, Jiangtao Luo, Yongyi Ran
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2604.12382
Pdf link: https://arxiv.org/pdf/2604.12382
Abstract Low Earth Orbit (LEO) satellite networks provide global coverage and low latency, yet high node mobility, uneven traffic distribution, and stochastic link failures pose severe challenges for inter-domain routing. Existing approaches either neglect graph-structured topology or lack dynamic awareness of real-time link states, struggling to balance load distribution and routing reliability. This paper proposes DTAR, a traffic-aware deep reinforcement learning approach for inter-domain routing in LEO satellite networks. A multi-objective NSGA-II algorithm first generates an offline domain partition maximizing intra-domain traffic ratio and minimizing load imbalance. A Graph Attention Network dynamically encodes inter-domain link traffic intensity, load distribution, and fault status, upon which an action-masked PPO agent learns routing decisions online. Simulations on a 288-satellite Walker constellation against multiple baselines demonstrate that DTAR significantly reduces link load imbalance and end-to-end delay, while improving routing success rate and reducing packet loss rate across normal, traffic surge, and fault scenarios.
中文摘要 低地球轨道（LEO）卫星网络提供全球覆盖和低延迟，但高节点移动性、流量分布不均和随机链路故障对域间路由构成严峻挑战。现有方法要么忽视图结构拓扑，要么缺乏对实时链路状态的动态感知，难以平衡负载分布和路由可靠性。本文提出了DTAR，一种用于LEO卫星网络域间路由的流量感知深度强化学习方法。多目标NSGA-II算法首先生成离线域分区，最大化域内流量比并最小化负载不平衡。图关注网络动态编码域间链路流量强度、负载分布和故障状态，带动作掩蔽的PPO代理在线学习路由决策。对288颗卫星的Walker星座进行多基线模拟显示，DTAR显著减少链路负载不平衡和端到端延迟，同时提升路由成功率，降低正常、流量浪涌和故障场景下的丢包率。

From Kinematics to Dynamics: Learning to Refine Hybrid Plans for Physically Feasible Execution

从运动学到动力学：学习优化混合计划以实现物理可行执行

Authors: Lidor Erez, Shahaf S. Shperberg, Ayal Taitler
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.12474
Pdf link: https://arxiv.org/pdf/2604.12474
Abstract In many robotic tasks, agents must traverse a sequence of spatial regions to complete a mission. Such problems are inherently mixed discrete-continuous: a high-level action sequence and a physically feasible continuous trajectory. The resulting trajectory and action sequence must also satisfy problem constraints such as deadlines, time windows, and velocity or acceleration limits. While hybrid temporal planners attempt to address this challenge, they typically model motion using linear (first-order) dynamics, which cannot guarantee that the resulting plan respects the robot's true physical constraints. Consequently, even when the high-level action sequence is fixed, producing a dynamically feasible trajectory becomes a bi-level optimization problem. We address this problem via reinforcement learning in continuous space. We define a Markov Decision Process that explicitly incorporates analytical second-order constraints and use it to refine first-order plans generated by a hybrid planner. Our results show that this approach can reliably recover physical feasibility and effectively bridge the gap between a planner's initial first-order trajectory and the dynamics required for real execution.
中文摘要 在许多机器人任务中，特工必须穿越一系列空间区域以完成任务。此类问题本质上是离散与连续混合的：一个高层次的作用序列和一个物理上可行的连续轨迹。最终的轨迹和动作序列还必须满足问题约束，如截止时间、时间窗口以及速度或加速度限制。虽然混合时间规划器试图解决这一挑战，但它们通常使用线性（一阶）动力学来建模运动，无法保证最终计划能满足机器人的真实物理约束。因此，即使高层动作序列是固定的，生成动态可行轨迹也成为一个二层优化问题。我们通过连续空间中的强化学习来解决这个问题。我们定义了一个马尔可夫决策过程，明确包含分析性的二阶约束，并用它来优化由混合规划器生成的一阶计划。我们的结果表明，这种方法能够可靠地恢复物理可行性，并有效地弥合规划者初始一阶轨迹与实际执行所需的动态之间的差距。

KG-Reasoner: A Reinforced Model for End-to-End Multi-Hop Knowledge Graph Reasoning

KG-推理器：端到端多跳知识图谱推理的强化模型

Authors: Shuai Wang, Yinan Yu
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.12487
Pdf link: https://arxiv.org/pdf/2604.12487
Abstract Large Language Models (LLMs) exhibit strong abilities in natural language understanding and generation, yet they struggle with knowledge-intensive reasoning. Structured Knowledge Graphs (KGs) provide an effective form of external knowledge representation and have been widely used to enhance performance in classical Knowledge Base Question Answering (KBQA) tasks. However, performing precise multi-hop reasoning over KGs for complex queries remains highly challenging. Most existing approaches decompose the reasoning process into a sequence of isolated steps executed through a fixed pipeline. While effective to some extent, such designs constrain reasoning flexibility and fragment the overall decision process, often leading to incoherence and the loss of critical intermediate information from earlier steps. In this paper, we introduce KG-Reasoner, an end-to-end framework that integrates multi-step reasoning into a unified "thinking" phase of a Reasoning LLM. Through Reinforcement Learning (RL), the LLM is trained to internalize the KG traversal process, enabling it to dynamically explore reasoning paths, and perform backtracking when necessary. Experiments on eight multi-hop and knowledge-intensive reasoning benchmarks demonstrate that KG-Reasoner achieves competitive or superior performance compared to the state-of-the-art methods. Codes are available at the repository: this https URL.
中文摘要 大型语言模型（LLMs）在自然语言理解和生成方面表现出强大的能力，但在知识密集型推理方面却存在困难。结构化知识图谱（KGs）提供了一种有效的外部知识表示形式，已被广泛用于提升经典知识库问答（KBQA）任务的性能。然而，针对复杂查询进行精确的多跳推理仍然极具挑战性。大多数现有方法将推理过程分解为通过固定流水线执行的一系列孤立步骤。虽然在一定程度上有效，但此类设计限制了推理灵活性，并使整体决策过程支离破碎，常常导致前一步中不连贯且关键中间信息丢失。本文介绍了KG-Reasoner，一个端到端框架，将多步推理整合进推理大型语言模型的统一“思考”阶段。通过强化学习（RL），LLM被训练为内化KG遍历过程，使其能够动态探索推理路径，并在必要时进行回溯。在八个多跳和知识密集型推理基准测试上的实验表明，与最先进的方法相比，KG-Reasoner 在竞争或更优于其他方法的表现上。代码可在仓库获取：https URL。

Safety Training Modulates Harmful Misalignment Under On-Policy RL, But Direction Depends on Environment Design

安全培训在政策强化学习下调节有害的错位，但方向取决于环境设计

Authors: Leon Eshuijs, Shihan Wang, Antske Fokkens
Subjects: Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2604.12500
Pdf link: https://arxiv.org/pdf/2604.12500
Abstract Specification gaming under Reinforcement Learning (RL) is known to cause LLMs to develop sycophantic, manipulative, or deceptive behavior, yet the conditions under which this occurs remain unclear. We train 11 instruction-tuned LLMs (0.5B--14B) with on-policy RL across 3 environments and find that model size acts as a safety buffer in some environments but enables greater harmful exploitation in others. Controlled ablations trace this reversal to environment-specific features such as role framing and implicit gameability cues. We further show that most safety benchmarks do not predict RL-induced misalignment, except in the case of Sycophancy scores when the exploit relies on inferring the user's preference. Finally, we find that on-policy RL preserves a safety buffer inherent in the model's own generation distribution, one that is bypassed during off-policy settings.
中文摘要 强化学习（RL）下的规格博弈已知会导致大型语言模型发展出谄媚、操控或欺骗行为，但发生这种情况的具体条件尚不清楚。我们在3个环境中用策略式强化学习训练了11个指令调优LLM（0.5B-14B），发现模型大小在某些环境中起到安全缓冲作用，但在其他环境中却可能带来更大的有害利用。受控消融将这种逆转追溯到环境特有特征，如角色框架和隐性可游戏性线索。我们还进一步表明，大多数安全基准测试并不能预测强化学习引起的错位，除非在谄媚评分中，当利用依赖于推断用户偏好时。最后，我们发现策略中强化学习保留了模型自身生成分布中固有的安全缓冲区，而在非策略设置时该缓冲区被绕过。

A Heterogeneous Dual-Network Framework for Emergency Delivery UAVs: Communication Assurance and Path Planning Coordination

紧急投递无人机的异构双网络框架：通信保障与路径规划协调

Authors: Ping Huang, Bin Duo, Ziedor Godfred, Liuwei Huo, Jin Ning, Xiaojun Yuan, Jun Li
Subjects: Subjects: Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2604.12501
Pdf link: https://arxiv.org/pdf/2604.12501
Abstract Natural disasters often damage ground infrastructure, making unmanned aerial vehicles (UAVs) essential for emergency supply delivery. Yet safe operation in complex post-disaster environments requires reliable command-and-control (C2) links; link instability can cause loss of control, delay rescue, and trigger severe secondary harm. To provide continuous three-dimensional (3D) C2 coverage during dynamic missions, we propose a Heterogeneous Dual-Network Framework (HDNF) for safe and reliable emergency delivery. HDNF tightly couples an Emergency Communication Support Network (ECSN), formed by hovering UAV base stations, with a Delivery Path Network (DPN), formed by fast-moving delivery UAVs. The ECSN dynamically safeguards mission-critical flight corridors, while the DPN aligns trajectories with reliable coverage regions. We formulate a joint optimization problem over task assignment, 3D UAV-BS deployment, and DPN path planning to maximize end-to-end C2 reliability while minimizing UAV flight energy consumption and base-station deployment cost. To solve this computationally intractable NP-hard problem, we develop a layered strategy with three components: (i) a multi-layer C2 service model that overcomes 2D-metric limitations and aligns UAV-BS deployment with mission-critical 3D phases; (ii) a 3D coverage-aware multi-agent reinforcement learning algorithm that addresses the high-dimensional search space and improves both training efficiency and topology resilience; and (iii) a 3D communication-aware A* planner that jointly optimizes C2 quality and flight energy, mitigating trajectory--coverage mismatch and improving routing safety. Extensive simulations show that HDNF markedly improves C2 reliability, eliminates outages in critical phases, and sustains high task success rates while reducing hardware deployment cost.
中文摘要 自然灾害常常破坏地面基础设施，使得无人机（UAV）成为紧急物资配送的必需品。然而，在复杂的灾后环境中安全运行需要可靠的指挥控制（C2）链路;链路不稳定可能导致失控、延迟救援，并引发严重的次级伤害。为了在动态任务中提供连续的三维（3D）C2覆盖，我们提出了一种异构双网络框架（HDNF）以实现安全可靠的紧急投递。HDNF紧密连接由悬停无人机基站组成的紧急通信支持网络（ECSN）与由快速运输无人机组成的投递路径网络（DPN）。ECSN动态保护关键任务飞行走廊，而DPN则将轨迹与可靠的覆盖区域对齐。我们提出了任务分配、3D 无人机-BS部署和DPN路径规划的联合优化问题，以最大化端到端C2可靠性，同时最小化无人机飞行能耗和基站部署成本。为解决这一计算上难以解决的NP难题，我们开发了一套分层策略，包含三个部分：（i）克服二维度量限制的多层C2服务模型，使无人机-BS部署与关键三维阶段保持一致;（ii）一种三维覆盖感知的多智能体强化学习算法，能够处理高维搜索空间，同时提升训练效率和拓扑韧性;以及（iii）一个三维通信感知的A*规划器，能够共同优化C2质量和飞行能量，减轻轨迹覆盖不匹配并提升航线安全。大量模拟表明，HDNF显著提升了C2的可靠性，消除了关键阶段的中断，并在降低硬件部署成本的同时保持了高任务成功率。

Whole-Body Mobile Manipulation using Offline Reinforcement Learning on Sub-optimal Controllers

在次优控制器上进行全体移动操作，采用离线强化学习

Authors: Snehal Jauhri, Vignesh Prasad, Georgia Chalvatzaki
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.12509
Pdf link: https://arxiv.org/pdf/2604.12509
Abstract Mobile Manipulation (MoMa) of articulated objects, such as opening doors, drawers, and cupboards, demands simultaneous, whole-body coordination between a robot's base and arms. Classical whole-body controllers (WBCs) can solve such problems via hierarchical optimization, but require extensive hand-tuned optimization and remain brittle. Learning-based methods, on the other hand, show strong generalization capabilities but typically rely on expensive whole-body teleoperation data or heavy reward engineering. We observe that even a sub-optimal WBC is a powerful structural prior: it can be used to collect data in a constrained, task-relevant region of the state-action space, and its behavior can still be improved upon using offline reinforcement learning. Building on this, we propose WHOLE-MoMa, a two-stage pipeline that first generates diverse demonstrations by randomizing a lightweight WBC, and then applies offline RL to identify and stitch together improved behaviors via a reward signal. To support the expressive action-chunked diffusion policies needed for complex coordination tasks, we extend offline implicit Q-learning with Q-chunking for chunk-level critic evaluation and advantage-weighted policy extraction. On three tasks of increasing difficulty using a TIAGo++ mobile manipulator in simulation, WHOLE-MoMa significantly outperforms WBC, behavior cloning, and several offline RL baselines. Policies transfer directly to the real robot without finetuning, achieving 80% success in bimanual drawer manipulation and 68% in simultaneous cupboard opening and object placement, all without any teleoperated or real-world training data.
中文摘要 移动操作（MoMa）对可活动物体（如开门、抽屉和橱柜）要求机器人底座与手臂同时进行全身协调。经典的全体控制器（WBC）可以通过层级优化解决此类问题，但需要大量手工优化，且仍较脆弱。而基于学习的方法则展现出强大的泛化能力，但通常依赖昂贵的全身远程操作数据或大量奖励工程。我们观察到，即使是次优的白细胞，也是一个强大的结构先验：它可以用于收集状态-行动空间中受限且与任务相关的区域的数据，其行为仍可通过离线强化学习得到改进。基于此，我们提出了WHOLE-MoMA的两阶段流程，首先通过随机化一个轻量级白细胞生成多样化的演示，然后应用离线强化学习，通过奖励信号识别并拼接出改进行为。为支持复杂协调任务所需的表达性动作块扩散策略，我们将离线隐式Q学习与Q块化扩展，用于区块级批评评估和优势加权策略提取。在使用TIAGo++移动操作器进行模拟的三个难度递增任务中，WHOLE-MoMa显著优于白细胞计数、行为克隆和多个离线强化学习基线。策略直接传递给真实机器人，无需微调，双手操作抽屉成功率达80%，同时开柜和放置物品成功率达68%，且无任何远程操作或真实训练数据。

SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models

SOAR：扩散模型中自我校正以实现最佳对齐与精炼

Authors: You Qin, Linqing Wang, Hao Fei, Roger Zimmermann, Liefeng Bo, Qinglin Lu, Chunyu Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.12617
Pdf link: https://arxiv.org/pdf/2604.12617
Abstract The post-training pipeline for diffusion models currently has two stages: supervised fine-tuning (SFT) on curated data and reinforcement learning (RL) with reward models. A fundamental gap separates them. SFT optimizes the denoiser only on ground-truth states sampled from the forward noising process; once inference deviates from these ideal states, subsequent denoising relies on out-of-distribution generalization rather than learned correction, exhibiting the same exposure bias that afflicts autoregressive models, but accumulated along the denoising trajectory instead of the token sequence. RL can in principle address this mismatch, yet its terminal reward signal is sparse, suffers from credit-assignment difficulty, and risks reward hacking. We propose SOAR (Self-Correction for Optimal Alignment and Refinement), a bias-correction post-training method that fills this gap. Starting from a real sample, SOAR performs a single stop-gradient rollout with the current model, re-noises the resulting off-trajectory state, and supervises the model to steer back toward the original clean target. The method is on-policy, reward-free, and provides dense per-timestep supervision with no credit-assignment problem. On SD3.5-Medium, SOAR improves GenEval from 0.70 to 0.78 and OCR from 0.64 to 0.67 over SFT, while simultaneously raising all model-based preference scores. In controlled reward-specific experiments, SOAR surpasses Flow-GRPO in final metric value on both aesthetic and text-image alignment tasks, despite having no access to a reward model. Since SOAR's base loss subsumes the standard SFT objective, it can directly replace SFT as a stronger first post-training stage after pretraining, while remaining fully compatible with subsequent RL alignment.
中文摘要 扩散模型的训练后流程目前分为两个阶段：基于精选数据的监督微调（SFT）和基于奖励模型的强化学习（RL）。两者之间存在根本性的差距。SFT仅在前向噪声过程采样的地真态上优化去噪器;一旦推断偏离这些理想状态，后续去噪依赖于分布外推广而非学习性修正，表现出与自回归模型相同的暴露偏置，但沿去噪轨迹而非标记序列积累。强化学习原则上可以解决这种不匹配问题，但其终端奖励信号稀疏，信用分配困难，且存在被奖励黑客攻击的风险。我们提出了SOAR（自我纠正以实现最优对齐与精炼），这是一种训练后偏倚纠正方法，填补了这一空白。从真实样本出发，SOAR对当前模型进行单次停止梯度滚动，重新噪声产生的偏离轨迹状态，并监督模型转向原始干净目标。该方法符合政策，无需奖励，且提供密集的每步监督，且无信用分配问题。在SD3.5-Medium上，SOAR将GenEval从0.70提升到0.78，OCR从0.64提升到0.67，同时提升所有基于模型的偏好分数。在受控的奖励特定实验中，尽管没有奖励模型，SOAR在美学和文本-图像对齐任务中均超过Flow-GRPO。由于SOAR的基底损耗涵盖了标准SFT目标，它可以直接取代SFT，成为预训练后更强的第一个训练后阶段，同时保持与后续强化学习对齐的完全兼容。

KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance

KnowRL：通过强化学习在有限知识指导下提升LLM推理能力

Authors: Linhao Yu, Tianmeng Yang, Siyu Ding, Renren Jin, Naibin Gu, Xiangzhao Hao, Shuaiyi Nie, Deyi Xiong, Weichong Yin, Yu Sun, Hua Wu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.12627
Pdf link: https://arxiv.org/pdf/2604.12627
Abstract RLVR improves reasoning in large language models, but its effectiveness is often limited by severe reward sparsity on hard problems. Recent hint-based RL methods mitigate sparsity by injecting partial solutions or abstract templates, yet they typically scale guidance by adding more tokens, which introduce redundancy, inconsistency, and extra training overhead. We propose \textbf{KnowRL} (Knowledge-Guided Reinforcement Learning), an RL training framework that treats hint design as a minimal-sufficient guidance problem. During RL training, KnowRL decomposes guidance into atomic knowledge points (KPs) and uses Constrained Subset Search (CSS) to construct compact, interaction-aware subsets for training. We further identify a pruning interaction paradox -- removing one KP may help while removing multiple such KPs can hurt -- and explicitly optimize for robust subset curation under this dependency structure. We train KnowRL-Nemotron-1.5B from OpenMath-Nemotron-1.5B. Across eight reasoning benchmarks at the 1.5B scale, KnowRL-Nemotron-1.5B consistently outperforms strong RL and hinting baselines. Without KP hints at inference, KnowRL-Nemotron-1.5B reaches 70.08 average accuracy, already surpassing Nemotron-1.5B by +9.63 points; with selected KPs, performance improves to 74.16, establishing a new state of the art at this scale. The model, curated training data, and code are publicly available at this https URL.
中文摘要 RLVR提升了大型语言模型中的推理能力，但其有效性常常受限于困难问题上的严重奖励稀疏。最新的基于提示的强化学习方法通过注入部分解或抽象模板来缓解稀疏性，但它们通常通过增加更多代币来扩展指导，这带来了冗余、不一致和额外的训练开销。我们提出了 \textbf{KnowRL}（知识引导强化学习），这是一个将提示设计视为最小充分引导问题的强化学习训练框架。在强化学习训练中，KnowRL将引导分解为原子知识点（KP），并使用受限子集搜索（CSS）构建紧凑且具交互感知的子集用于训练。我们还进一步识别了一个剪枝交互悖论——移除一个KP可能有帮助，而移除多个此类KP则可能有害——并明确优化了在该依赖结构下稳健子集的管理。我们用OpenMath-Nemotron-1.5B训练KnowRL-Nemotron-1.5B。在八个1.5亿尺度的推理基准中，KnowRL-Nemotron-1.5B始终优于强强强化逻辑和暗示基线。没有KP的推断线索，KnowRL-Nemotron-1.5B的平均准确率达到70.08，已比Nemotron-1.5B高出+9.63点;通过选定的KPs，性能提升至74.16，确立了该级别的新技术水平。模型、策划的训练数据和代码均在此 https URL 公开。

Contextual Multi-Task Reinforcement Learning for Autonomous Reef Monitoring

自主珊瑚礁监测的上下文多任务强化学习

Authors: Melvin Laux, Yi-Ling Liu, Rina Alo, Sören Töpper, Mariela De Lucas Alvarez, Frank Kirchner, Rebecca Adam
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.12645
Pdf link: https://arxiv.org/pdf/2604.12645
Abstract Although autonomous underwater vehicles promise the capability of marine ecosystem monitoring, their deployment is fundamentally limited by the difficulty of controlling vehicles under highly uncertain and non-stationary underwater dynamics. To address these challenges, we employ a data-driven reinforcement learning approach to compensate for unknown dynamics and task this http URL single-task reinforcement learning has a tendency to overfit the training environment, thus, limit the long-term usefulness of the learnt policy. Hence, we propose to use a contextual multi-task reinforcement learning paradigm instead, allowing us to learn controllers that can be reused for various tasks, e.g., detecting oysters in one reef and detecting corals in another. We evaluate whether contextual multi-task reinforcement learning can efficiently learn robust and generalisable control policies for autonomous underwater reef monitoring. We train a single context-dependent policy that is able to solve multiple related monitoring tasks in a simulated reef environment in HoloOcean. In our experiments, we empirically evaluate the contextual policies regarding sample-efficiency, zero-shot generalisation to unseen tasks, and robustness to varying water currents. By utilising multi-task reinforcement learning, we aim to improve the training effectiveness, as well as the reusability of learnt policies to take a step towards more sustainable procedures in autonomous reef monitoring.
中文摘要 尽管自主水下载具有望实现海洋生态系统监测，但其部署在高度不确定且非静止的水下动力学下难以控制载具，实质上受限。为应对这些挑战，我们采用数据驱动强化学习方法来补偿未知动态，任务中这种 http URL 单任务强化学习容易过度拟合训练环境，从而限制了所学策略的长期效用。因此，我们提出采用情境式多任务强化学习范式，使我们能够学习可重复用于各种任务的控制器，例如在一个珊瑚礁检测牡蛎和在另一个珊瑚礁检测珊瑚。我们评估了情境多任务强化学习是否能够高效学习稳健且可推广的自主水下珊瑚礁监测控制策略。我们训练一个单一的上下文相关策略，能够在HoloOcean模拟珊瑚礁环境中解决多个相关监测任务。在我们的实验中，我们实证评估了关于样本效率、零样本泛化对未见任务以及对不同水流的鲁棒性等情境政策。通过多任务强化学习，我们旨在提升培训效果，以及学习策略的可重复使用性，从而迈出自主珊瑚礁监测更可持续的步骤。

PromptEcho: Annotation-Free Reward from Vision-Language Models for Text-to-Image Reinforcement Learning

PromptEcho：视觉语言模型提供的无注释奖励，用于文本到图像强化学习

Authors: Jinlong Liu, Wanggui He, Peng Zhang, Mushui Liu, Hao Jiang, Pipei Huang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.12652
Pdf link: https://arxiv.org/pdf/2604.12652
Abstract Reinforcement learning (RL) can improve the prompt following capability of text-to-image (T2I) models, yet obtaining high-quality reward signals remains challenging: CLIP Score is too coarse-grained, while VLM-based reward models (e.g., RewardDance) require costly human-annotated preference data and additional fine-tuning. We propose PromptEcho, a reward construction method that requires \emph{no} annotation and \emph{no} reward model training. Given a generated image and a guiding query, PromptEcho computes the token-level cross-entropy loss of a frozen VLM with the original prompt as the label, directly extracting the image-text alignment knowledge encoded during VLM pretraining. The reward is deterministic, computationally efficient, and improves automatically as stronger open-source VLMs become available. For evaluation, we develop DenseAlignBench, a benchmark of concept-rich dense captions for rigorously testing prompt following capability. Experimental results on two state-of-the-art T2I models (Z-Image and QwenImage-2512) demonstrate that PromptEcho achieves substantial improvements on DenseAlignBench (+26.8pp / +16.2pp net win rate), along with consistent gains on GenEval, DPG-Bench, and TIIFBench without any task-specific training. Ablation studies confirm that PromptEcho comprehensively outperforms inference-based scoring with the same VLM, and that reward quality scales with VLM size. We will open-source the trained models and the DenseAlignBench.
中文摘要 强化学习（RL）可以提升文本转图像（T2I）模型的提示跟随能力，但获得高质量的奖励信号仍然具有挑战性：CLIP Score过于粗粒度，而基于VLM的奖励模型（如RewardDance）则需要昂贵的人工注释偏好数据和额外的微调。我们提出了PromptEcho，一种需要\emph{no}注释和\emph{no}奖励模型训练的奖励构建方法。给定生成的图像和引导查询，PromptEcho 计算以原始提示为标签的冻结 VLM 的令牌级交叉熵损失，直接提取 VLM 预训练中编码的图像-文本对齐知识。奖励是确定性的，计算效率高，并且随着更强大的开源VLM的出现而自动提升。为了评估，我们开发了DenseAlignBench，这是一个概念丰富且密集字幕的基准测试，用于严格测试提示跟随功能。在两个最先进的T2I模型（Z-Image和QwenImage-2512）上的实验结果表明，PromptEcho在DenseAlignBench上实现了显著提升（+26.8pp / +16.2pp净胜率），同时在GenEval、DPG-Bench和TIIFBench上也持续提升，无需任何任务特定训练。消融研究证实，PromptEcho在相同VLM下全面优于基于推理的评分，且奖励质量随VLM大小而成。我们将开源训练好的模型和DenseAlignBench。

Safe reinforcement learning with online filtering for fatigue-predictive human-robot task planning and allocation in production

安全强化学习与在线过滤，用于生产中的疲劳预测人机任务规划与分配

Authors: Jintao Xue, Xiao Li, Nianmin Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.12667
Pdf link: https://arxiv.org/pdf/2604.12667
Abstract Human-robot collaborative manufacturing, a core aspect of Industry 5.0, emphasizes ergonomics to enhance worker well-being. This paper addresses the dynamic human-robot task planning and allocation (HRTPA) problem, which involves determining when to perform tasks and who should execute them to maximize efficiency while ensuring workers' physical fatigue remains within safe limits. The inclusion of fatigue constraints, combined with production dynamics, significantly increases the complexity of the HRTPA problem. Traditional fatigue-recovery models in HRTPA often rely on static, predefined hyperparameters. However, in practice, human fatigue sensitivity varies daily due to factors such as changed work conditions and insufficient sleep. To better capture this uncertainty, we treat fatigue-related parameters as inaccurate and estimate them online based on observed fatigue progression during production. To address these challenges, we propose PF-CD3Q, a safe reinforcement learning (safe RL) approach that integrates the particle filter with constrained dueling double deep Q-learning for real-time fatigue-predictive HRTPA. Specifically, we first develop PF-based estimators to track human fatigue and update fatigue model parameters in real-time. These estimators are then integrated into CD3Q by making task-level fatigue predictions during decision-making and excluding tasks that exceed fatigue limits, thereby constraining the action space and formulating the problem as a constrained Markov decision process (CMDP).
中文摘要 人机协作制造是工业5.0的核心，强调人体工学以提升员工福祉。本文探讨了动态人机任务规划与分配（HRTPA）问题，该问题涉及确定何时执行任务以及由谁执行任务以最大化效率，同时确保工人的身体疲劳保持在安全范围内。疲劳约束的加入，结合生产动力学，显著增加了HRTPA问题的复杂性。HRTPA中的传统疲劳-恢复模型通常依赖静态、预定义的超参数。然而，实际上，由于工作条件的变化和睡眠不足等因素，人类的疲劳敏感性每天都在变化。为了更好地捕捉这种不确定性，我们将疲劳相关参数视为不准确，并基于生产过程中观察到的疲劳进展进行在线估计。为应对这些挑战，我们提出了PF-CD3Q，一种安全强化学习（safe RL）方法，将粒子滤波器与受限双重深度Q学习相结合，实现实时疲劳预测HRTPA。具体来说，我们首先开发基于PF的估计器，用于实时追踪人类疲劳并更新疲劳模型参数。这些估计器随后通过在决策过程中进行任务级疲劳预测，排除超出疲劳极限的任务，从而限制行动空间，并将问题表述为受限马尔可夫决策过程（CMDP）。

Teaching LLMs Human-Like Editing of Inappropriate Argumentation via Reinforcement Learning

通过强化学习教授LLM类人编辑不当论证

Authors: Timon Ziegenbein, Maja Stahl, Henning Wachsmuth
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.12770
Pdf link: https://arxiv.org/pdf/2604.12770
Abstract Editing human-written text has become a standard use case of large language models (LLMs), for example, to make one's arguments more appropriate for a discussion. Comparing human to LLM-generated edits, however, we observe a mismatch in editing strategies: While LLMs often perform multiple scattered edits and tend to change meaning notably, humans rather encapsulate dependent changes in self-contained, meaning-preserving edits. In this paper, we present a reinforcement learning approach that teaches LLMs human-like editing to improve the appropriateness of arguments. Our approach produces self-contained sentence-level edit suggestions that can be accepted or rejected independently. We train the approach using group relative policy optimization with a multi-component reward function that jointly optimizes edit-level semantic similarity, fluency, and pattern conformity as well as argument-level appropriateness. In automatic and human evaluation, it outperforms competitive baselines and the state of the art in human-like editing, with multi-round editing achieving appropriateness close to full rewriting.
中文摘要 编辑人类文本已成为大型语言模型（LLM）的标准用例，例如，用以使论点更适合讨论。然而，比较人类与大型语言模型生成的编辑时，我们观察到编辑策略存在不匹配：虽然大型语言模型通常执行多次零散编辑并倾向于显著改变意义，而人类则更倾向于封装依赖性变化，形成自包含、保持意义的编辑。本文提出了一种强化学习方法，教授LLM类人编辑，以提升论证的适当性。我们的方法产生了自成体系的句子级编辑建议，可以独立接受或拒绝。我们通过组相对策略优化和多元奖励函数训练该方法，共同优化编辑层的语义相似度、流畅度和模式一致性以及论证层面的适当性。在自动和人工评估中，它优于竞争对手基线和类人编辑的先进技术，多轮编辑几乎实现了完全重写的合适性。

FastGrasp: Learning-based Whole-body Control method for Fast Dexterous Grasping with Mobile Manipulators

快速抓握：基于学习的全身控制方法，利用移动操作器实现快速灵巧抓握

Authors: Heng Tao, Yiming Zhong, Zemin Yang, Yuexin Ma
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.12879
Pdf link: https://arxiv.org/pdf/2604.12879
Abstract Fast grasping is critical for mobile robots in logistics, manufacturing, and service applications. Existing methods face fundamental challenges in impact stabilization under high-speed motion, real-time whole-body coordination, and generalization across diverse objects and scenarios, limited by fixed bases, simple grippers, or slow tactile response capabilities. We propose \textbf{FastGrasp}, a learning-based framework that integrates grasp guidance, whole-body control, and tactile feedback for mobile fast grasping. Our two-stage reinforcement learning strategy first generates diverse grasp candidates via conditional variational autoencoder conditioned on object point clouds, then executes coordinated movements of mobile base, arm, and hand guided by optimal grasp selection. Tactile sensing enables real-time grasp adjustments to handle impact effects and object variations. Extensive experiments demonstrate superior grasping performance in both simulation and real-world scenarios, achieving robust manipulation across diverse object geometries through effective sim-to-real transfer.
中文摘要 快速抓取对于物流、制造和服务应用中的移动机器人至关重要。现有方法在高速运动下的冲击稳定、全身实时协调以及跨越不同物体和场景的泛化方面面临根本性挑战，受限于固定底座、简单抓握器或迟钝的触觉响应能力。我们提出了 \textbf{FastGrasp}，这是一个基于学习的框架，集成了抓取引导、全身控制和触觉反馈，用于移动快速抓取。我们的两阶段强化学习策略首先通过条件变分自编码器生成多样化的抓取候选对象，基于对象点云，然后根据最优抓握选择执行移动基座、手臂和手部的协调动作。触觉感应支持实时调整抓握，以处理撞击效应和物体变化。大量实验证明了在模拟和现实场景中都表现出优异的抓取性能，通过有效的模拟到现实传输实现了在不同物体几何形态上的稳健操作。

Tree Learning: A Multi-Skill Continual Learning Framework for Humanoid Robots

树状学习：人形机器人的多技能持续学习框架

Authors: Yifei Yan, Linqi Ye
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.12909
Pdf link: https://arxiv.org/pdf/2604.12909
Abstract As reinforcement learning for humanoid robots evolves from single-task to multi-skill paradigms, efficiently expanding new skills while avoiding catastrophic forgetting has become a key challenge in embodied intelligence. Existing approaches either rely on complex topology adjustments in Mixture-of-Experts (MoE) models or require training extremely large-scale models, making lightweight deployment difficult. To address this, we propose Tree Learning, a multi-skill continual learning framework for humanoid robots. The framework adopts a root-branch hierarchical parameter inheritance mechanism, providing motion priors for branch skills through parameter reuse to fundamentally prevent catastrophic forgetting. A multi-modal feedforward adaptation mechanism combining phase modulation and interpolation is designed to support both periodic and aperiodic motions. A task-level reward shaping strategy is also proposed to accelerate skill convergence. Unity-based simulation experiments show that, in contrast to simultaneous multi-task training, Tree Learning achieves higher rewards across various representative locomotion skills while maintaining a 100% skill retention rate, enabling seamless multi-skill switching and real-time interactive control. We further validate the performance and generalization capability of Tree Learning on two distinct Unity-simulated tasks: a Super Mario-inspired interactive scenario and autonomous navigation in a classical Chinese garden environment.
中文摘要 随着人形机器人强化学习从单一任务向多技能范式发展，高效扩展新技能同时避免灾难性遗忘已成为具身智能中的关键挑战。现有方法要么依赖专家混合（MoE）模型中的复杂拓扑调整，要么需要训练极大规模模型，这使得轻量化部署变得困难。为此，我们提出了树学习（Tree Learning），一种面向人形机器人的多技能持续学习框架。该框架采用根分支层级参数继承机制，通过参数重用为分支技能提供运动先验，从根本上防止灾难性遗忘。结合相位调制和插值的多模态前馈适应机制设计用于支持周期性和非周期性运动。还提出了一种任务级奖励塑造策略，以加速技能融合。基于Unity的模拟实验表明，与同时多任务训练不同，树学习在多种代表性移动技能方面获得更高奖励，同时保持100%技能保持率，实现无缝多技能切换和实时交互控制。我们还进一步验证了树学习在两个不同Unity模拟任务上的表现和泛化能力：受超级马里奥启发的互动场景和经典中国园林环境中的自主导航。

E2E-Fly: An Integrated Training-to-Deployment System for End-to-End Quadrotor Autonomy

E2E-Fly：一个集成的训练到部署系统，实现端到端四旋翼自主性

Authors: Fangyu Sun, Fanxing Li, Linzuo Zhang, Yu Hu, Renbiao Jin, Shuyu Wu, Wenxian Yu, Danping Zou
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.12916
Pdf link: https://arxiv.org/pdf/2604.12916
Abstract Training and transferring learning-based policies for quadrotors from simulation to reality remains challenging due to inefficient visual rendering, physical modeling inaccuracies, unmodeled sensor discrepancies, and the absence of a unified platform integrating differentiable physics learning into end-to-end training. While recent work has demonstrated various end-to-end quadrotor control tasks, few systems provide a systematic, zero-shot transfer pipeline, hindering reproducibility and real-world deployment. To bridge this gap, we introduce E2E-Fly, an integrated framework featuring an agile quadrotor platform coupled with a full-stack training, validation, and deployment workflow. The training framework incorporates a high-performance simulator with support for differentiable physics learning and reinforcement learning, alongside structured reward design tailored to common quadrotor tasks. We further introduce a two-stage validation strategy using sim-to-sim transfer and hardware-in-the-loop testing, and deploy policies onto two physical quadrotor platforms via a dedicated low-level control interface and a comprehensive sim-to-real alignment methodology, encompassing system identification, domain randomization, latency compensation, and noise modeling. To the best of our knowledge, this is the first work to systematically unify differentiable physical learning with training, validation, and real-world deployment for quadrotors. Finally, we demonstrate the effectiveness of our framework for training six end-to-end control tasks and deploy them in the real world.
中文摘要 由于视觉渲染效率低下、物理建模不准确、传感器不符以及缺乏统一平台将可微物理学习整合到端到端训练，四旋翼的学习策略从模拟到现实仍然充满挑战。尽管近期研究展示了多种端到端四旋翼控制任务，但很少系统提供系统化的零发射传输流程，这阻碍了重复性和实际部署。为弥合这一差距，我们推出了E2E-Fly，一个集成框架，结合了敏捷的四旋翼平台和全栈培训、验证和部署工作流程。该训练框架集成了高性能模拟器，支持可微物理学习和强化学习，同时为常见四旋翼任务量身定制的结构化奖励设计。我们进一步引入了两阶段验证策略，利用模拟对模拟传输和硬件在环测试，并通过专用的低级控制接口和全面的模拟与实对齐方法，将策略部署到两个物理四旋翼平台上，涵盖系统识别、域随机化、延迟补偿和噪声建模。据我们所知，这是首个系统性地将可微分物理学习与四旋翼训练、验证及实际应用相结合的工作。最后，我们展示了我们框架在培训六个端到端控制任务的有效性，并将其应用于现实世界。

Graph-based Hierarchical Deep Reinforcement Learning for Deliverable Block Propagation with Optimal Hybrid Cost in Web 3.0

基于图的分层深度强化学习，用于Web 3.0中实现可交付块传播，实现最佳混合成本

Authors: Shi Chen, Jinbo Wen, Jiawen Kang, Tenghui Huang, Maomao Zhang, Tao Zhang, Dong In Kim
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2604.12920
Pdf link: https://arxiv.org/pdf/2604.12920
Abstract Web 3.0 is envisioned as a decentralized paradigm, where blockchain serves as a core technology for transparent and tamper-proof data management. Among various blockchain architectures, consortium blockchains have emerged as the preferred platform for enterprise-grade Web 3.0. For consortium blockchains, newly generated blocks are generally propagated to all consensus nodes for validation through the gossip protocol. However, gossip-based propagation may introduce substantial message redundancy and tail latency. Moreover, the consensus nodes exhibit heterogeneous availability patterns, and existing block propagation schemes often overlook such temporal constraints. Therefore, the joint optimization of propagation timeliness and delivery coverage remains an open problem. In this paper, we propose a deliverable block propagation optimization framework for consortium blockchain-enabled Web 3.0. We first propose a delivery-aware timeliness metric called Age of Validated Block (AoVB), which excludes block receptions occurring outside the availability window of each consensus node, thereby measuring only actionable synchronization latency. This metric is unified with the block arrival rate into a hybrid cost objective that balances timeliness against delivery. To solve this complex optimization problem, we propose a Graph-based Hierarchical Deep Reinforcement Learning (GHDRL) method, which comprises a graph isomorphism network-based assignment module and a graph attention network-based propagation module. The two modules are optimized jointly under a two-stage training strategy. Numerical results show that GHDRL consistently outperforms all compared schemes across network scales from 50 to 500 peers, achieving up to 19.2% lower hybrid cost than the best-performing neural baseline. Moreover, the model generalizes from 100-peer training instances to 500-peer deployments without retraining.
中文摘要 Web 3.0被设想为一种去中心化范式，区块链作为透明且防篡改数据管理的核心技术。在各种区块链架构中，联盟区块链已成为企业级Web 3.0的首选平台。对于联盟区块链，新生成的区块通常会传播到所有共识节点，通过gossip协议进行验证。然而，基于八卦的传播可能会引入大量消息冗余和尾部延迟。此外，共识节点表现出异质可用性模式，现有的块传播方案常常忽视这些时间约束。因此，传播时效性和传递覆盖率的联合优化仍是一个悬而未决的问题。本文提出了一个可交付的区块链驱动Web 3.0区块传播优化框架。我们首先提出了一种称为验证区块年龄（AoVB）的传递感知时效度指标，该指标排除了发生在每个共识节点可用性窗口之外的区块接收，从而仅测量可操作的同步延迟。该指标与区块到达率统一，形成一个混合成本目标，平衡时效性与交付。为解决这一复杂优化问题，我们提出了一种基于图的层级深度强化学习（GHDRL）方法，该方法包括基于图同构的网络赋值模块和基于图注意力网络的传播模块。这两个模块在两阶段培训策略下共同优化。数值结果显示，GHDRL在50至500个网络规模下持续优于所有对比方案，混合成本比表现最佳的神经基线低多达19.2%。此外，该模型可从100个对等的训练实例推广到500个对等的部署而无需重新训练。

Cycle-Consistent Search: Question Reconstructability as a Proxy Reward for Search Agent Training

循环一致搜索：问题可重建性作为搜索代理训练的代理奖励

Authors: Sohyun An (1 and 2), Shuibenyang Yuan (1), Hayeon Lee (1), Cho-Jui Hsieh (2), Alexander Min (1) ((1) Meta Superintelligence Labs, (2) UCLA)
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.12967
Pdf link: https://arxiv.org/pdf/2604.12967
Abstract Reinforcement Learning (RL) has shown strong potential for optimizing search agents in complex information retrieval tasks. However, existing approaches predominantly rely on gold supervision, such as ground-truth answers, which is difficult to scale. To address this limitation, we propose Cycle-Consistent Search (CCS), a gold-supervision-free framework for training search agents, inspired by cycle-consistency techniques from unsupervised machine translation and image-to-image translation. Our key hypothesis is that an optimal search trajectory, unlike insufficient or irrelevant ones, serves as a lossless encoding of the question's intent. Consequently, a high-quality trajectory should preserve the information required to accurately reconstruct the original question, thereby inducing a reward signal for policy optimization. However, naive cycle-consistency objectives are vulnerable to information leakage, as reconstruction may rely on superficial lexical cues rather than the underlying search process. To reduce this effect, we apply information bottlenecks, including exclusion of the final response and named entity recognition (NER) masking of search queries. These constraints force reconstruction to rely on retrieved observations together with the structural scaffold, ensuring that the resulting reward signal reflects informational adequacy rather than linguistic redundancy. Experiments on question-answering benchmarks show that CCS achieves performance comparable to supervised baselines while outperforming prior methods that do not rely on gold supervision. These results suggest that CCS provides a scalable training paradigm for training search agents in settings where gold supervision is unavailable.
中文摘要 强化学习（RL）在优化复杂信息检索任务中的搜索代理方面展现出强大潜力。然而，现有方法主要依赖金监督，如基于实地的答案，这在推广上较为困难。为解决这一限制，我们提出了循环一致性搜索（CCS）框架，这是一个无金监督的搜索代理训练框架，灵感来源于无监督机器翻译和图像间转换的循环一致性技术。我们的关键假设是，最优的搜索轨迹不同于不足或无关的轨迹，可以作为问题意图的无损编码。因此，高质量的轨迹应保留准确重建原始问题所需的信息，从而诱导政策优化的奖励信号。然而，朴素的循环一致性目标容易受到信息泄露的影响，因为重建可能依赖于表面词汇线索而非底层的搜索过程。为减少这一影响，我们采用信息瓶颈，包括排除最终回复和搜索查询的命名实体识别（NER）掩码。这些约束迫使重建依赖于检索到的观测数据和结构支架，确保最终的奖励信号反映的信息充分性，而非语言上的冗余。问答基准测试的实验显示，CCS在性能上与监督基准相当，同时优于不依赖黄金监督的旧方法。这些结果表明，CCS为在缺乏黄金监督的环境中训练搜索代理提供了可扩展的训练范式。

Keyword: diffusion policy

There is no result