生成时间: 2026-06-12 19:44:05 (UTC+8); Arxiv 发布时间: 2026-06-12 20:00 EDT (2026-06-13 08:00 UTC+8)
今天共有 30 篇相关文章
Keyword: reinforcement learning
ReCal: Reward Calibration for RL-based LLM Routing
ReCal:基于强化学习的大型语言模型路由奖励校准
- Authors: Qihang Yu, Hanwen Tong, Zhengqi Zhang, Bo Zheng, Feng Wei, Shengyu Zhang, Zemin Liu, Fei Wu
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.12479
- Pdf link: https://arxiv.org/pdf/2606.12479
- Abstract
Large language model (LLM) routing has emerged as an effective paradigm for leveraging the complementary strengths of multiple LLMs through dynamic model and reasoning-strategy selection. Recent reinforcement learning (RL)-based routing methods further improve routing quality by optimizing routing policies from interaction feedback. However, they still struggle to provide informative and comparable learning signals under heterogeneous tasks with varying difficulty. In practice, multiple objectives (e.g., correctness, format behavior) are aggregated into a single scalar reward, leading to ambiguous credit assignment and conflicting optimization signals. Moreover, reward signals exhibit significant variability across instances, where some instances produce higher or more variable rewards, introducing optimization bias that favors trivial samples over informative ones. To address these issues, we propose \textbf{ReCal}, a \textbf{\underline{Re}}ward \textbf{\underline{Cal}}ibration framework for RL-based LLM routing. We first introduce a hierarchical reward decomposition mechanism with component-wise advantage estimation. We further propose a distribution-aware optimization strategy that calibrates optimization variability through variance-aware reweighting and per-dataset normalization. Experiments on seven datasets demonstrate that ReCal consistently improves routing performance, and training stability over baselines. Code is available at this https URL.
- 中文摘要
大型语言模型(LLM)路由已成为利用多LLM互补优势的有效范式,通过动态模型和推理策略选择实现。最新的基于强化学习(RL)的路由方法通过根据交互反馈优化路由策略,进一步提升了路由质量。然而,在不同难度的异质任务下,他们仍难以提供有信息量且可比的学习信号。实际上,多个目标(如正确性、格式行为)被聚合为单一标量奖励,导致信用分配模糊且优化信号冲突。此外,奖励信号在不同实例间表现出显著的变异性,有些实例产生的奖励更高或更变异,导致优化偏向于平凡样本而非信息量。为解决这些问题,我们提出了 \textbf{ReCal},一个用于基于强化学习的 LLM 路由的 \textbf{\underline{Re}}ward \textbf{\underline{Cal}}ibration 框架。我们首先引入了带有分量优势估计的分层奖励分解机制。我们还提出了一种分布感知优化策略,通过方差感知加权和逐数据集归一化来校准优化变异性。七个数据集的实验表明,ReCal 在基线数据上持续提升路由性能和训练稳定性。代码可在此 https URL 访问。
Boosting Direct Preference Optimization with Penalization
通过惩罚提升直接偏好优化
- Authors: Pengwei Sun
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.12505
- Pdf link: https://arxiv.org/pdf/2606.12505
- Abstract
Offline preference optimization has become a practical substitute for reinforcement learning from human feedback, but pairwise objectives such as Direct Preference Optimization (DPO) and its variants use only the chosen and rejected responses stored in a static dataset. This leaves a useful signal unused: the response that the reference model itself would generate for the same prompt. We propose Direct Preference Optimization with Penalization (DPOP), a simple extension of DPO that augments the base preference loss with a gated penalty on reference-greedy responses. DPOP activates this penalty only when the current policy still assigns a lower likelihood to the preferred response than to the rejected response. On AlpacaEval 2.0, DPOP improves length-controlled win rate over DPO, SimPO, and AlphaDPO on both Llama-3-8b-it and Gemma-2-9b-it, achieving relative gains of 5.3\% and 4.4\% over baselines on the two models, respectively. Ablations further show that a SimNPO-style length-normalized penalty is stronger than NPO and token-level unlikelihood in this setting.
- 中文摘要
离线偏好优化已成为基于人类反馈强化学习的实用替代方案,但如直接偏好优化(DPO)及其变体等成对目标仅使用存储在静态数据集中的选择和拒绝反应。这会留下一个有用的信号未被使用:即参考模型对同一提示词产生的响应。我们提出了直接偏好优化与惩罚(DPOP),这是DPO的简单扩展,通过对引用贪婪反应施加门槛惩罚来增加基础偏好损失。DPOP仅在当前策略仍将优选响应概率低于被拒绝回复时激活此惩罚。在AlpacaEval 2.0中,DPOP在Llama-3-8b-it和Gemma-2-9b-it上,在长度控制下胜DPO、SimPO和AlphaDPO的胜率均提升,分别在两者基础上获得了5.3%和4.4%的相对提升。消融进一步表明,SimNPO式长度归一化惩罚在此环境中比NPO和代币级的不太可能更强。
Foresight: Iterative Reasoning About Clues that Matter for Navigation
前瞻性:对导航关键线索的反复推理
- Authors: Arthur Zhang, Carl Qi, Donne Su, Xiangyun Meng, Amy Zhang, Joydeep Biswas
- Subjects: Subjects:
Robotics (cs.RO); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.12550
- Pdf link: https://arxiv.org/pdf/2606.12550
- Abstract
Open-world mapless navigation from sparse language instructions requires resolving underspecified goals and inferring which environmental cues are relevant for reaching the goal. For instance, reaching an out-of-view destination may require interpreting ramps, signs, or detours that reveal where to go or which route to take. Prior works are limited by their reliance on known navigation factors and closed-set factor categories, or identify cues before motion planning and miss plan-dependent cues. We argue that pretrained Vision-Language Models (VLMs) can discover novel instruction-relevant cues, but require adaptation to focus on which cues matter and how they should influence motion planning. We realize these ideas in Foresight, a test-time framework in which a finetuned VLM alternates between proposing image-space motion plans and critiquing them using the language goal and visual context. Subsequent plans are conditioned on prior critiques, enabling iterative motion refinement before execution. To align plan critiques and refinements with open-set behavior preferences, we learn a reward model from human feedback and use it to post-train the VLM with reinforcement learning in the plan-critique loop. In offline evaluations and 6 real-world environments, Foresight improves average task success by 37% and reduces interventions per mission by 52% relative to state-of-the-art test-time reasoning and foundation-model baselines, while running in real-time on a Jetson AGX Orin. We will release code, data, and training details to support future work on test-time reasoning for robot motion refinement. Additional videos at: this https URL
- 中文摘要
通过稀疏语言指令实现无地图开放世界导航,需要解决未明确的目标,并推断哪些环境线索对达成目标有帮助。例如,前往视线之外的目的地可能需要解读坡道、标志或绕行路线,这些路线会揭示去向或路线。以往作品受限于依赖已知导航因素和封闭场景因素类别,或在动作规划前识别线索,错过依赖计划的线索。我们认为,预训练的视觉语言模型(VLMs)可以发现新的与指令相关的线索,但需要适应,专注于哪些线索重要以及它们如何影响运动规划。我们在“前瞻”中实现了这些理念,这是一个测试时间框架,在其中,经过微调的VLM在提出图像空间运动计划和利用语言目标和视觉上下文进行批判之间交替进行。后续计划基于先前的评审,使得在执行前能够进行迭代式的运动改进。为了使计划批评和改进与开放行为偏好对齐,我们从人类反馈中学习奖励模型,并用它在计划-批评循环中进行强化学习对VLM进行后期训练。在离线评估和6个真实环境中,Foresight在使用Jetson AGX Orin实时运行时,平均任务成功率提升37%,每次任务干预次数减少52%,相较于最先进的测试时间推理和基础模型基线。我们将发布代码、数据和培训细节,以支持未来关于机器人运动优化测试时推理的工作。更多视频请访问:此 https URL
Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents
保持政策梯度主导:长期工具使用代理的兄弟姐妹引导信用蒸馏
- Authors: Tianyu Ding, Jianhong Xin, Juan Pablo De la Cruz Weinstein
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2606.12634
- Pdf link: https://arxiv.org/pdf/2606.12634
- Abstract
Long-horizon tool-use reinforcement learning can learn from outcome verification, but its trajectory-level advantage is broadcast across many reasoning, API, and answer tokens. Self-distillation promises a denser signal by reusing a policy's own rollouts or a privileged teacher. We show, however, that direct token-level self-distillation can silently destroy tool use: it rehearses teacher behavior without knowing which actions the verifier rewards, so useful skills and harmful shortcuts are amplified together. We introduce Sibling-Guided Credit Distillation (SGCD), which uses distillation for credit assignment rather than as a competing actor loss. Dynamic sampling produces mixed successful and failed sibling rollouts; an external LLM summarizes their contrast into a training-only stepwise credit reference; dense teacher/student divergence drives credit reassignment; and bounded detached credit weights reshape GRPO token advantages. The deployed student sees no external LLM, sibling evidence, or oracle. Across AppWorld and $\tau^3$-airline, SGCD improves over matched GRPO comparators: AppWorld TGC $42.9 \to 45.6$ on test_normal and $24.7 \to 27.0$ on test_challenge, and $\tau^3$-airline pass@1 $0.583 \to 0.602$.
- 中文摘要
长期工具使用强化学习可以从结果验证中学习,但其轨迹层面优势通过多种推理、API和答案令牌传播。自我提炼通过重复使用政策自身的推广或特权教师,承诺传递更密集的信号。然而,我们表明,直接的代币级自我蒸馏可以悄然摧毁工具的使用:它在不了解验证者奖励哪些行为的情况下排练教师行为,因此有用的技能和有害的捷径被放大。我们引入了兄弟姐妹引导的信用蒸馏(SGCD),它利用蒸馏来分配信用,而非作为竞争的行为者损失。动态采样会产生混合的成功与失败的兄弟部署;外部LLM将它们的对比总结为仅培训阶段的分级信用参考;师生差异密集导致学分重新分配;而有界分离信用权重则重塑了GRPO代币的优势。部署的学生不会看到外部LLM、兄弟证据或神谕。在AppWorld和$\tau^3$航空公司中,SGCD优于匹配GRPO对比对象:AppWorld TGC在test_normal为42.9至45.6美元,test_challenge为24.7至27.0美元,$\tau^3$航空公司pass@1 0.583至0.602美元。
Individual Control Barrier Functions-Guided Diffusion Model for Safe Offline Multi-Agent Reinforcement Learning
个别控制屏障功能引导扩散模型,用于安全离线多智能体强化学习
- Authors: Qingyun Guo, Junyi Shi, Jianuo Huang, Tianyu Shi
- Subjects: Subjects:
Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
- Arxiv link: https://arxiv.org/abs/2606.12640
- Pdf link: https://arxiv.org/pdf/2606.12640
- Abstract
Offline reinforcement learning allows control policies to be learned directly from data without online interaction, making it suitable for safety-critical tasks. Recent studies have applied diffusion models to offline reinforcement learning to leverage their strong capacity for modeling complex data distributions. However, existing approaches primarily focus on single-agent settings, leaving the safety challenges in multi-agent environments largely unexplored. In this work, we propose a safe offline multi-agent reinforcement learning algorithm that embeds neural individual control barrier functions into the diffusion model to enhance safety during trajectory generation, with control policies recovered through inverse dynamics. We evaluate our algorithm across diverse benchmarks, demonstrating substantial safety improvements while maintaining competitive rewards.
- 中文摘要
离线强化学习允许直接从数据中学习控制策略,无需在线交互,非常适合安全关键任务。近期研究将扩散模型应用于离线强化学习,以利用其在建模复杂数据分布方面的强大能力。然而,现有方法主要聚焦于单智能体环境,导致多智能体环境的安全挑战大多未被充分探索。本研究提出一种安全的离线多智能体强化学习算法,将神经个体控制屏障函数嵌入扩散模型中,以增强轨迹生成时的安全性,控制策略通过逆动力学恢复。我们在多个基准测试中评估算法,展示了显著的安全改进,同时保持了竞争性的奖励。
Stubborn: A Streamlined and Unified Reinforcement Learning Framework for Robust Motion Tracking and Fall Recovery for Humanoids
Stubborn:一个简化统一的强化学习框架,用于强健的人体运动追踪和跌倒恢复
- Authors: Xiao Ren, Yuhui Yang, Zongbiao Weng, Zhijie Liu, He Kong
- Subjects: Subjects:
Robotics (cs.RO); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.12814
- Pdf link: https://arxiv.org/pdf/2606.12814
- Abstract
Recent reinforcement learning approaches have shown great promise in improving humanoid motion tracking performance and achieving fall recovery under disturbances. However, most existing works treat motion tracking and fall recovery as different tasks and require multi-stage training with specialized recovery rewards and/or separate recovery policies. Moreover, existing reinforcement learning-based methods often terminate training episodes immediately after severe tracking failures, limiting recovery-oriented exploration in unstable or fallen states. To address the above issues, we propose Stubborn, a streamlined and unified reinforcement learning framework to achieve robust humanoid motion tracking and fall recovery. Specifically, Stubborn uses an asymmetric Actor-Critic architecture and consists of three major components. First, a yaw-aligned tracking representation is adopted to reduce sensitivity to global drift and heading disturbances while preserving gravity-related balance information. Second, we introduce a Bernoulli-based probabilistic termination mechanism that enables the policy to encourage exploration of fall-recovery behaviors under varying failure modes. Third, we propose a probabilistic termination and tracking-error-driven strategy that dynamically reshapes the sampling distribution based on tracking performance, increasing the training efficiency for difficult motion segments and unstable states. Extensive comparisons with SOTA methods and ablation studies show that Stubborn achieved competitive performance, and the proposed probabilistic termination mechanism and adaptive sampling strategy contributed to the performance and robustness gains. For real-world demonstrations, please refer to this https URL.
- 中文摘要
近期强化学习方法在提升类人生物运动追踪性能和实现干扰下跌倒恢复方面展现出巨大潜力。然而,大多数现有作品将运动追踪和跌倒恢复视为不同的任务,要求多阶段培训,配备专门的恢复奖励和/或单独的恢复政策。此外,现有基于强化学习的方法常在严重追踪失败后立即终止训练,限制了在不稳定或崩溃状态下的恢复探索。为解决上述问题,我们提出了Stubborn,一个简化统一的强化学习框架,以实现稳健的人形运动追踪和跌倒恢复。具体来说,Stubborn采用非对称的Actor-Critic架构,由三个主要组件组成。首先,采用偏航对齐的跟踪表示法,以降低对全球漂移和航向扰动的敏感性,同时保持与重力相关的平衡信息。其次,我们引入了基于伯努利的概率终止机制,使该策略能够鼓励探索不同失效模式下的坠落恢复行为。第三,我们提出一种基于概率终止和跟踪误差驱动的策略,基于跟踪性能动态重塑采样分布,提高困难运动段和不稳定状态的训练效率。与SOTA方法和消融研究的广泛比较表明,Stubborn实现了竞争性能,且提出的概率终止机制和自适应采样策略促进了性能和鲁棒性提升。如需实际演示,请参阅此 https URL。
Topical Phase Transitions in Artificial Intelligence Research: Large-Scale Evidence and an Early-Warning Signature for Emerging Topics
人工智能研究中的局部相变:大规模证据与新兴话题的预警信号
- Authors: Rasul Khanbayov, Hasan Kurban
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.12828
- Pdf link: https://arxiv.org/pdf/2606.12828
- Abstract
Do research topics in artificial intelligence grow gradually, or do they advance through abrupt, detectable jumps? Analyzing 80,814 accepted main-track papers from five premier AI conferences (ACL, CVPR, ICLR, ICML, NeurIPS) spanning 2017 to 2025, we show major AI topics advance through topical phase transitions: remaining marginal for years, then surging across venues within one to three years. Large language models became the dominant cross-venue topic by 2025, diffusion models rose with comparable abruptness, and language-model methods crossed into computer vision via vision-language models, whereas reinforcement learning compounded smoothly, distinguishing genuine phase transitions from ordinary growth. This structure is our primary contribution: a large-scale, cross-venue characterization of how AI research reorganizes. We then ask whether a transition leaves a detectable footprint before it peaks. We define an early-warning signature, four publication-dynamics criteria frozen on 2017-2021 data, and evaluate it out of sample on 2023-2025 transitions, obtaining a precision of 27% and recall of 63% against a 13.5% base rate. Applied to 2025 data, the signature flags reasoning and test-time compute, agentic AI, multimodal LLMs, retrieval-augmented generation, and world models as topics to monitor over 2026-2028. The source code is also publicly available on GitHub at this https URL.
- 中文摘要
人工智能的研究课题是逐渐增长的,还是通过突如其来、可检测的跳跃而发展?通过分析2017年至2025年间五大人工智能会议(ACL、CVPR、ICLR、ICML、NeurIPS)共80,814篇被接受的主线论文,我们展示了主要人工智能议题通过主题阶段转变:先是边缘化多年,随后在一到三年内跨场地激增。到2025年,大型语言模型成为主流跨领域话题,扩散模型同样迅速崛起,语言模型方法通过视觉语言模型跨入计算机视觉领域,而强化学习则顺利叠加,区分了真正的相变与普通增长。这一结构是我们的主要贡献:对人工智能研究如何重组进行大规模、跨领域的描述。接着我们询问一个转变在达到峰值之前是否会留下可检测的足迹。我们定义了预警特征,四个以2017-2021年数据冻结的发表动态标准,并在2023-2025年转换中进行样本外评估,精度为27%,召回率为63%,基准率为13.5%。应用于2025年数据,签名标记了推理和测试时间计算、代理人工智能、多模态大型语言模型、检索增强生成以及世界模型作为2026-2028年间需要关注的主题。源代码也在此 https URL 公开发布于 GitHub 上。
Fantastic Scientific Agents and How to Build Them: AgentBuild for Rietveld Refinement
出色的科学特工及其构建方法:AgentBuild for Rietveld Refinement
- Authors: Woong Shin, Craig A. Bridges, Marshall T. McDonnell, Rafael Ferreira da Silva
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.12834
- Pdf link: https://arxiv.org/pdf/2606.12834
- Abstract
As scientific workflows shift from deterministic executables to LLM-based agents, the development practices on offer, such as fine-tuning, reinforcement learning, and prompt-and-go, bury the scientist's judgment. We propose treating agent construction as a workflow stage and introduce AgentBuild, which builds a scientific agent from a contract the scientist authors. The contract is a version-controlled rubric, a difficulty-graded curriculum, and a curated external knowledge base. A rubric-driven judge gates a meta-optimizer coding agent that edits the agent within a declared boundary, so the build compiles the agent, not the scientist's judgment. We instantiate this for Rietveld refinement of X-ray diffraction data through GSAS-II behind MCP and A2A, where a blank-harness construction run progresses through a lithium lanthanum zirconium oxide (LLZO) signal-to-noise ladder, reaches the 4 hour scan as a frontier case, and exposes the workflow-scope limits that remain. The same rubric that rewards credible fits also scores trajectory scope, making the frontier a contract failure rather than a pattern-fitting failure. As base models evolve, re-running AgentBuild is a re-tune, not a rebuild, and the scientist's authored contract remains the durable asset.
- 中文摘要
随着科学工作流程从确定性可执行文件转向基于LLM的智能体,所提供的开发实践,如微调、强化学习和提示即走,掩盖了科学家的判断力。我们提议将代理构建视为工作流阶段,并引入AgentBuild,它通过科学家签署的合同构建科学代理。合同包括一个版本控制的评分标准、一个按难度分级的课程体系,以及一个精心策划的外部知识库。由评分标准驱动的裁判会对元优化器编码代理进行门禁,该代理在声明边界内编辑代理,因此构建编译的是代理,而非科学家的判断。我们在 MSP 和 A2A 后面通过 GSAS-II 进行 Rietveld X射线衍射数据细化时实现了这一过程,空白线束构建通过锂镧氧化物(LLZO)信噪梯级,达到4小时扫描作为前沿案例,并揭示了剩余的工作流范围极限。同样奖励可信匹配的评分标准也赋予轨迹范围,使前沿成为合同失败而非模式匹配失败。随着基础模型的发展,重新运行AgentBuild是重新调校,而非重建,科学家的合同依然是持久资产。
Direct Preference Optimization for Chatbot Fine-Tuning: An Empirical Study
聊天机器人微调的直接偏好优化:一项实证研究
- Authors: Yvonne Qiu, Dezhi Yu, ShuoJia Fu
- Subjects: Subjects:
Computation and Language (cs.CL); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2606.12881
- Pdf link: https://arxiv.org/pdf/2606.12881
- Abstract
We present an approach to fine-tuning large language models using Direct Preference Optimization (DPO), a reinforcement learning technique. Our experimental results demonstrate that DPO simplifies the training pipeline, improves computational efficiency, and achieves competitive performance. The evaluation using BLEU, ROUGE, and cosine similarity metrics indicates effective learning and convergence, though further investigation is needed to address observed training instability.
- 中文摘要
我们提出了一种利用直接偏好优化(DPO)强化学习技术对大型语言模型进行微调的方法。我们的实验结果表明,DPO简化了训练流程,提高了计算效率,并实现了竞争性能。使用BLEU、ROUGE和余弦相似度指标的评估表明有效学习和收敛,但仍需进一步研究以解决观察到的训练不稳定性。
Bridging Modal Isolation in Interleaved Thinking: Supervising Modality Transitions via Stepwise Reinforcement
在交错思维中弥合模态孤立:通过逐步强化监督模态转换
- Authors: Tingyu Li, Le Zhou, Siyuan Li, Yujun Wu, Xinglong Xu, Jingxuan Wei, Conghui He, Cheng Tan
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.12886
- Pdf link: https://arxiv.org/pdf/2606.12886
- Abstract
Interleaved thinking, where a unified multimodal model alternates between textual reasoning and visual generation, has shown promise on spatial and physical tasks. However, in complex long-chain scenarios, we identify a fundamental failure mode: generated images diverge from the textual context while subsequent text ignores the visual evidence, causing the two modalities to alternate without genuinely informing each other. We term this Modal Isolation and attribute it to compounding information loss at modality boundaries. We decompose each reasoning cycle into atomic operations and define modality transition loss, quantifying cross-modal hallucination (text-to-image) and visual utilization deficit (image-to-text) at each boundary. We propose MoTiF (Modality Tiransition Fidelity), a two-stage training framework that directly optimizes these transitions: Reflective SFT trains the model to detect and recover from erroneous visual outputs; Flow-GRPO improves image generation fidelity via reinforcement learning. All training signals in MoTiF derive from transition-level fidelity rather than end-task accuracy. Across four visual puzzle benchmarks, this transition-level supervision substantially improves both cross-modal coherence and final task accuracy. The results demonstrate that effective interleaved reasoning requires explicit structural supervision at modality boundaries, not merely scaling or end-task optimization.
- 中文摘要
交错思维,即统一的多模态模型在文本推理和视觉生成之间交替进行,在空间和物理任务中展现出潜力。然而,在复杂的长链情景中,我们发现了一个根本性的失败模式:生成的图像偏离文本上下文,而后续文本则忽略了视觉证据,导致两种模态交替出现,却没有真正相互启发。我们称之为模态隔离,并将其归因于模态边界处信息的复合丢失。我们将每个推理周期分解为原子操作,定义模态转移损失,量化跨模态幻觉(文本到图像)和视觉利用缺陷(图像到文本)在每个边界。我们提出了MoTiF(模态Tiransition Fidelity),这是一个两阶段训练框架,直接优化这些过渡:反射SFT训练模型检测并恢复错误的视觉输出;Flow-GRPO通过强化学习提升图像生成的真实度。MoTiF中的所有训练信号均基于过渡级保真度,而非终端任务精度。在四个视觉谜题基准中,这种过渡层级监督显著提升了跨模态的一致性和最终任务的准确性。结果表明,有效的交错推理需要在模态边界处进行显式结构监督,而不仅仅是缩放或终端任务优化。
Learning to Adapt: Representation-Based Reinforcement Learning for Multi-Task Skill Transfer
学习适应:基于表征的强化学习用于多任务技能转移
- Authors: Aryan Naveen, Haitong Ma, Haldun Balim, Na Li
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2606.12890
- Pdf link: https://arxiv.org/pdf/2606.12890
- Abstract
Reinforcement learning has achieved remarkable success in learning complex control policies, yet its applicability remains limited due to sample inefficiency and poor generalization across tasks. In this work, we propose RepMT-SAC, a framework for multi-task RL that enables efficient knowledge sharing and robust transfer to new tasks. RepMT-SAC uses spectral MDP decomposition to capture transferable dynamics, structuring the value function into a task-agnostic core with a minimal task-specific adjustment. This design allows for strong zero-shot performance on in-distribution tasks and rapid few-shot adaptation to out-of-distribution tasks. We evaluate RepMT-SAC on quadcopter trajectory-following tasks across in-distribution and out-of-distribution contexts, demonstrating that it outperforms baselines by up to 30%.
- 中文摘要
强化学习在学习复杂控制策略方面取得了显著成功,但由于样本效率低下和任务间泛化能力差,其适用性仍然有限。在本研究中,我们提出了RepMT-SAC,一种多任务强化学习框架,能够实现高效的知识共享和对新任务的稳健迁移。RepMT-SAC利用谱MDP分解捕捉可转移动态,将价值函数结构化为任务无关的核心,且任务特定调整最小。该设计允许在分配内任务中实现强劲的零发性能,并快速适应分配外任务的少量发放。我们在分布内外环境中对RepMT-SAC在四旋翼轨迹跟踪任务中进行了评估,证明其性能比基线高出多达30%。
PolicyGuard: Towards Test-time and Step-level Adversary Defense for Reinforcement Learning Agent
PolicyGuard:为强化学习代理实现测试时和步级对抗防御
- Authors: Junfeng Guo Heng Huang
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
- Arxiv link: https://arxiv.org/abs/2606.12896
- Pdf link: https://arxiv.org/pdf/2606.12896
- Abstract
While real-world applications of reinforcement learning (RL) are becoming increasingly popular, the security of RL systems deserve more attention and exploration. In particular, recent work has revealed that RL agents are vulnerable to backdoor attacks, where a victim agent behaves normally under standard conditions but executes malicious actions when a specific trigger is activated. Existing backdoor defenses for RL either require access to the agent's internal parameters, operate only at the model or trajectory level, or are limited to specific attack types. To ensure the security of RL agents, we propose \texttt{PolicyGuard}, a \textit{test-time step-level} backdoor defense which leverages Gaussian Process (GP) posterior variance and adapts pseudo trajectories to enable uncertainty computation for individual time step. Besides, we also provide theoretical foundations to explain the efficacy of GP posterior variance. Extensive experiments across seven RL games demonstrate that PolicyGuard achieves state-of-the-art detection performance in most cases, with average AUROC of 0.856 for perturbation-based attacks and 0.859 for adversary-agent attacks.
- 中文摘要
尽管强化学习(RL)在现实世界中的应用日益普及,但强化学习系统的安全性值得更多关注和探索。特别是,近期研究显示强化学习代理易受后门攻击,即受害者代理在标准条件下表现正常,但在激活特定触发器时执行恶意操作。现有的强化学习后门防御要么需要访问代理内部参数,要么仅在模型或轨迹层面运行,或者仅限于特定类型的攻击类型。为确保强化学习代理的安全性,我们提出了 \texttt{PolicyGuard},这是一种 \textit{测试时间步级}后门防御,利用高斯过程(GP)后验方差并调整伪轨迹,实现单个时间步的不确定性计算。此外,我们还提供了理论基础,以解释全科医生后验方差的疗效。在七个强化学习游戏中的广泛实验表明,PolicyGuard在大多数情况下实现了最先进的检测性能,基于扰动的攻击平均AUROC为0.856,对手代理攻击的平均AUROC为0.859。
SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents
SENTINEL:针对使用语言模型代理工具训练的失败驱动强化学习
- Authors: Ziyi Wang, Yuxuan Lu, Yimeng Zhang, Qun Liu, Chen Luo, Jiri Gesi, Hanqing Lu, Yisi Sang, Manling Li, Jing Huang, Dakuo Wang
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2606.12908
- Pdf link: https://arxiv.org/pdf/2606.12908
- Abstract
Language model agents are increasingly effective in solving realistic tasks through multi-turn tool use. However, training reliable tool-using agents remains challenging in practice. While reinforcement learning provides an on-policy paradigm for improving agents from their own environment interactions, its effectiveness depends heavily on the training task distribution. When tasks are fixed before training, the task distribution can become increasingly mismatched with the policy's evolving capabilities, causing many rollouts to be spent on uninformative tasks. We propose SENTINEL, a failure-driven reinforcement learning framework that turns the Solver's rollout failures into targeted training tasks. SENTINEL follows a Controller--Proposer--Solver loop: the Controller analyzes failed trajectories and summarizes recurring error patterns, the Proposer generates executable tasks that stress these weaknesses, and the Solver is trained on the targeted tasks. On Tau2-Bench Retail with Qwen3-4B-Thinking-2507, SENTINEL improves Pass\^{}1 from 66.4 to 74.9 and outperforms RL on general synthetic tasks across Pass\^{}k metrics. These results demonstrate that model failures provide an effective and scalable source of targeted training signal for improving tool-using language model agents.
- 中文摘要
语言模型代理通过多回合工具的使用,在解决现实任务方面越来越有效。然而,培训可靠的工具使用代理在实践中仍然具有挑战性。虽然强化学习提供了一个基于策略的范式,用于从自身环境互动中改进代理,但其有效性很大程度上依赖于训练任务的分布。当任务在培训前就已确定时,任务分布可能与策略不断演进的能力不匹配,导致许多部署时间被用于无益的任务。我们提出了SENTINEL,一种以失败为驱动的强化学习框架,将Solver的推广失败转化为有针对的训练任务。SENTINEL 遵循控制器-提案者-求解器循环:控制器分析失败轨迹并总结反复出现的错误模式,提案者生成强调这些弱点的可执行任务,求解器则针对目标任务进行训练。在Tau2-Bench Retail与Qwen3-4B-Thinking-2507上,SENTINEL将Pass\^{}1提升至74.9,并在通用合成任务中超越强化学习,涵盖Pass\^{}k指标。这些结果表明,模型失败为提升工具使用语言模型代理提供了有效且可扩展的定向训练信号来源。
Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning
揭开隐藏状态重现的神秘面纱:可切换潜在推理与策略强化学习
- Authors: Jiayu Yang, Chao Chen, Shengen Wu, Yinhong Liu, Yuxuan Fan, Lujundong Li, Songning Lai, Chengwei Qin, Zhijiang Guo
- Subjects: Subjects:
Machine Learning (cs.LG); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2606.13106
- Pdf link: https://arxiv.org/pdf/2606.13106
- Abstract
Latent chain-of-thought compresses reasoning by replacing visible reasoning traces with continuous hidden-state recurrence, but existing formulations are difficult to optimize with standard on-policy reinforcement learning (RL) and hard to interpret causally. Our key insight is that a single pair of explicit boundary tokens can address both issues at once: discrete entry and exit anchors make the latent block compatible with standard on-policy RL, and the same anchors offer a natural foothold for mechanistic analysis. Motivated by this, we propose SWITCH, a switchable latent reasoning framework. The model emits to enter latent mode and to exit. Because the boundaries are ordinary discrete tokens, the GRPO policy ratio is well-defined at every decision point. The same anchors also expose the latent steps to direct probing and causal intervention. We train the model with a visible-to-latent curriculum and a Switch-GRPO objective that propagates gradients through recurrent latent computation. SWITCH consistently outperforms prior hidden-state-recurrence latent reasoning approaches at similar scale. Mechanistic analysis through the boundary tokens further reveals three findings: (i) is a sharply localised, learned switching policy rather than a stylistic artefact; (ii) the latent step it opens performs problem-specific, causally important computation rather than acting as an inert placeholder; and (iii) that computation is concentrated at a single hidden-state transition on entry. Together, these results show that hidden-state-recurrence latent reasoning is both RL-trainable and open to direct mechanistic analysis, including of how on-policy RL itself improves the model from the inside.
- 中文摘要
潜在思维链通过用连续的隐藏状态重复来替代可见的推理痕迹来压缩推理,但现有的表述难以用标准的策略强化学习(RL)优化,也难以因果解读。我们的关键见解是,一对显式边界标记可以同时解决这两个问题:离散的进出锚点使潜在块与标准策略强化学习兼容,而相同的锚点为机制分析提供了自然的立足点。基于此,我们提出了SWITCH,一种可切换的潜在推理框架。模型发射进入潜态和退出。由于边界是普通离散代币,GRPO政策比率在每个决策点都明确定义。同样的锚点也让潜在步骤暴露在直接探查和因果干预之下。我们用可见到潜在的课程和通过循环潜在计算传播梯度的Switch-GRPO目标来训练模型。SWITCH在类似规模下持续优于以往隐藏状态-重现潜在推理方法。通过边界标记进行机制分析进一步揭示了三个发现:(i) 这是一个明显局部化的学习式切换策略,而非风格化的伪造;(ii) 它开启的潜在步骤执行特定问题且因果重要的计算,而非作为惰性的占位符;以及(iii)计算集中在进入时的单一隐态转移。这些结果共同表明,隐藏状态-重现潜在推理既可通过强化学习训练,又开放于直接的机制分析,包括政策强化学习如何从内部改进模型。
Select and Improve: Understanding the Mechanics of Post-Training for Reasoning
选择与改进:理解推理培训后机制
- Authors: Akshay Krishnamurthy, Audrey Huang, Nived Rajaraman
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.13125
- Pdf link: https://arxiv.org/pdf/2606.13125
- Abstract
Reinforcement learning has rapidly emerged as a key component in the training of reasoning and coding models, yet it remains poorly understood from a mechanistic perspective. We study how and through what underlying processes capabilities are acquired or enhanced via reinforcement learning post-training. Our analysis, based on controlled math reasoning experiments with Qwen-2.5-1.5B, reveals two core mechanisms: strategy selection and strategy improvement. Our results highlight the role of SFT data and reinforcement learning data in activating these mechanisms, in particular showing how supervising the model on diverse reasoning strategies can enable strategy selection and how increasing difficulty in reinforcement learning data can enable strategy improvement. Taken together, our results provide mechanistic insight into RL training and suggest practical interventions to continue scaling reasoning capabilities.
- 中文摘要
强化学习迅速成为推理和编码模型训练的关键组成部分,但从机制角度来看,它仍然理解不足。我们研究通过强化学习在培训后如何以及通过哪些底层过程获得或增强能力。我们基于Qwen-2.5-1.5B的受控数学推理实验,分析揭示了两个核心机制:策略选择和策略改进。我们的结果强调了SFT数据和强化学习数据在激活这些机制中的作用,特别是展示了如何通过多样推理策略监督模型来促进策略选择,以及强化学习数据难度增加如何促进策略改进。综合来看,我们的结果为强化学习训练提供了机制性见解,并提出了实用的干预措施,以持续提升推理能力。
Redesigning Regularization for Effective Policy Smoothing
重新设计正则化以实现有效策略平滑
- Authors: Taisuke Kobayashi, Naoto Yamanaka
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2606.13169
- Pdf link: https://arxiv.org/pdf/2606.13169
- Abstract
This paper proposes a novel regularization design to effectively smooth policy functions in reinforcement learning. While regularization that enhances
global'' Lipschitz continuity was initially considered, it has been limited tolocal'' Lipschitz continuity due to a tradeoff between smoothness and expressiveness. However, it has become apparent that the original implementation is cumbersome and does not provide sufficient smoothing, leading to a preference for simpler implementations. This stems from a discrepancy between theory and implementation, and a more appropriate implementation can expect to facilitate smoothing. Therefore, this paper identifies three reasons why the original implementation does not function adequately and provide remedies for them. This modified regularization performs well across multiple tasks and algorithms, successfully achieving smooth motion while improving control performance. Furthermore, by applying it to sim-to-real reinforcement learning for a quadruped robot, it is demonstrated that smooth motion provides robustness against sudden changes in target velocity commands.
- 中文摘要
本文提出了一种新颖的正则化设计,以有效平滑强化学习中的策略函数。虽然最初考虑过增强“全局”利普希茨连续性的正则化,但由于平滑性和表现力之间的权衡,这种正则化仅限于“局部”利普希茨连续性。然而,很明显原始实现繁琐且缺乏足够的平滑处理,导致人们更倾向于采用更简单的实现。这源于理论与实现之间的差异,更合适的实现可以促进平滑化。因此,本文指出了原始实现未能充分发挥作用的三个原因,并提供了相应的补救措施。这种修改后的正则化在多种任务和算法中表现良好,成功实现平滑运动并提升控制性能。此外,通过将其应用于四足机器人的模拟到实实强化学习,证明了平滑运动能对目标速度指令的突变提供鲁棒性。
Mental-R1: Aligning LLM Reasoning for Mental Health Assessment
心理-R1:将LLM推理与心理健康评估对齐
- Authors: Xin Wang, Boyan Gao, Yibo Yang, David A. Clifton
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.13176
- Pdf link: https://arxiv.org/pdf/2606.13176
- Abstract
Mental health problems such as anxiety, depression, and suicide remain urgent global challenges, where timely and accurate assessment is critical for effective intervention. Recently, large language models have been explored for mental health assessment. However, existing general-purpose post-training methods do not align with the cognitive processes of human assessment, which may lead to unreliable reasoning outcomes. To bridge this gap, we propose Cognitive Relative Policy Optimization (CRPO), a reinforcement learning framework tailored for the mental health domain. CRPO extends group relative policy optimization by integrating stage-dependent uncertainty modeling into the policy optimization process. Specifically, we introduce a stage-wise entropy regularization mechanism that encourages broad exploration in early reasoning phases and progressively enforces confident decision-making in later stages, mimicking the human cognitive shift from uncertainty to certainty. In addition, inspired by cognitive appraisal theory, we formalize cognitive reasoning stages, thereby guiding theory-grounded interpretable inference. Experiments on 8 mental health datasets show that CRPO achieves an average improvement of 10.4 percentage points in weighted F1-score over the best reinforcement learning baseline. Furthermore, the CRPO-trained model Mental-R1 demonstrates clear advantages compared with existing large language models on reasoning-intensive cases, suggesting that CRPO enhances reasoning capabilities for mental health assessment.
- 中文摘要
焦虑、抑郁和自杀等心理健康问题依然是全球紧迫挑战,及时且准确的评估对于有效干预至关重要。近年来,大型语言模型被用于心理健康评估。然而,现有的通用训练后方法与人类评估的认知过程不一致,可能导致推理结果不可靠。为弥合这一差距,我们提出了认知相对政策优化(CRPO),这是一个专为心理健康领域量身定制的强化学习框架。CRPO通过将阶段相关不确定性建模整合进策略优化过程,扩展了群体相对策略优化。具体来说,我们引入了一种阶段熵正则化机制,鼓励在早期推理阶段广泛探索,并在后期阶段逐步强化自信决策,模拟人类认知从不确定性到确定性的认知转变。此外,受认知评估理论启发,我们形式化了认知推理阶段,从而指导基于理论的可解释推断。对8个心理健康数据集的实验显示,CRPO在加权F1分数上比最佳强化学习基线平均提升10.4个百分点。此外,CRPO训练的模型Mental-R1在推理密集型案例中明显优于现有大型语言模型,表明CRPO增强了心理健康评估的推理能力。
Reasoning for Mobile User Experience with Multimodal LLMs: Task, Benchmark, and Approach
多模态大型语言模型的移动用户体验推理:任务、基准与方法
- Authors: Ruichao Mao, Zhou Fang, Teng Guo, Hao Yang, Yaping Li, Shaohua Peng, Maji Huang, Xiaoyu Lin, Shuoyang Liu, Xuepeng Li, Yuyu Zhang, Hai Rao
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.13192
- Pdf link: https://arxiv.org/pdf/2606.13192
- Abstract
User experience (UX) centered on usability, perceived consistency, and functional clarity is fundamental to real-world user interfaces (UI). The application of multimodal large language models (MLLMs) in the field of user interfaces is evolving rapidly, such as visual element grounding, graphical user interface (GUI) agents, and design-to-code generation. However, research efforts on evaluating UX based on UI screenshots are still immature. To address this, we propose UXBench, a novel multimodal benchmark consisting of 2,000 VQA data samples designed to assess MLLMs' ability to perform UI-based reasoning. UXBench includes 8 tasks based on real-world UI screenshots that require fine-grained diagnosis of UX issues across layout relationships, visual hierarchy, and content consistency. Our extensive evaluation of mainstream MLLMs shows that they remain fundamentally limited in their capacity for UI-based reasoning. The results underscore the need for further advancements in this area. To bridge this gap, we propose UI-UX, an MLLM based on Qwen3-VL-4B-Thinking foundation model and enhanced via reinforcement learning with two key innovations: a reward routing mechanism that dynamically balances perceptual understanding and logical reasoning during inference, and an asymmetric transition reward that suppresses redundant or insufficient reasoning steps. Experiments demonstrate that UI-UX achieves state-of-the-art (SOTA) performance on UXBench, attaining an accuracy of 0.7963 -- surpassing Claude-4.5-Sonnet's 0.6550 -- while exhibiting strong generalization across diverse UI tasks and maintaining low inference latency.
- 中文摘要
以可用性、感知一致性和功能清晰度为核心的用户体验(UX)是现实世界用户界面(UI)的基础。多模态大型语言模型(MLLM)在用户界面领域的应用正在迅速发展,例如视觉元素基础化、图形用户界面(GUI)代理以及设计到代码生成。然而,基于界面截图评估用户体验的研究工作仍然不成熟。为此,我们提出了UXBench,这是一个由2000个VQA数据样本组成的新型多模态基准测试,旨在评估MLLM执行基于界面推理的能力。UXBench包含8个基于真实界面截图的任务,要求对布局关系、视觉层级和内容一致性等用户体验问题进行细致诊断。我们对主流MLLM的广泛评估显示,它们在基于UI的推理能力方面仍然存在根本限制。结果强调了该领域进一步进步的必要性。为弥合这一差距,我们提出了UI-UX,这是一种基于Qwen3-VL-4B-思维基础模型的MLLM,并通过强化学习增强了两项关键创新:一种在推理过程中动态平衡感知理解与逻辑推理的奖励路由机制,以及一种抑制冗余或推理不足步骤的非对称过渡奖励。实验表明,UI-UX在UXBench上实现了最先进的(SOTA)性能,准确率达到0.7963——超过了Claude-4.5-Sonnet的0.6550——同时在多种UI任务中展现出强烈的泛化能力,并保持低推理延迟。
Understanding helpfulness and harmless tension in reward models
理解奖励模型中的帮助性和无害紧张
- Authors: Eshaan Tanwar, Pepa Atanasova
- Subjects: Subjects:
Machine Learning (cs.LG); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2606.13209
- Pdf link: https://arxiv.org/pdf/2606.13209
- Abstract
Reward models are a key component of reinforcement learning from human feedback (RLHF), aligning language models toward both helpful and harmless behaviour. However, the internal mechanisms underlying these objectives and their conflicts remain poorly understood. We study alignment tension in reward models trained under helpfulness-only, harmlessness-only, and mixed-objective settings. We find that mixed-objective models often underperform single-objective models, indicating interference between objectives. Using activation-based methods, we identify neurons associated with each objective and study their functional roles via targeted ablations. We find that these neurons causally support their corresponding objectives while often negatively affecting the opposing one. We find that a substantial proportion of neurons are shared between helpfulness and harmlessness, and that these shared neurons exert a disproportionate influence on model behaviour, contributing to alignment tension. Additionally, our results provide insights and mechanistic interpretation into how alignment objectives are represented in reward models and why multi-objective alignment remains challenging, motivating future work on disentangled and controllable alignment methods.
- 中文摘要
奖励模型是人类反馈强化学习(RLHF)的关键组成部分,使语言模型既有帮助又无害的行为。然而,这些目标背后的内部机制及其冲突仍然了解不足。我们研究了在仅帮助性、仅无害性和混合目标条件下训练的奖励模型中的对齐张力。我们发现混合目标模型常常表现不如单目标模型,表明目标之间存在干扰。通过基于激活的方法,我们识别与每个目标相关的神经元,并通过靶向消融研究其功能角色。我们发现这些神经元在因果上支持其对应目标,同时常常对对方目标产生负面影响。我们发现,有相当比例的神经元在有益性和无害性之间存在,这些共享神经元对模型行为产生了不成比例的影响,从而加剧了对齐张力。此外,我们的结果为奖励模型中对齐目标的表示方式提供了见解和机制解释,以及为何多目标对齐仍具挑战性,激励未来对脱离和可控比对方法的研究。
From Verdict to Process: Agentic Reinforcement Learning for Multi-Stage Fact Verification
从裁决到过程:多阶段事实验证的能动强化学习
- Authors: Rongxin Yang, Shenghong He, Siyuan Zhu, Chao Yu
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.13262
- Pdf link: https://arxiv.org/pdf/2606.13262
- Abstract
Recent approaches combining Large Language Models (LLMs) with retrieval-augmented reasoning have shown promise for automated fact verification. To process complex claims, these verification pipelines typically execute multi-stage workflows that coordinate tightly coupled modules, including claim decomposition, evidence gathering, and verdict prediction. However, existing methods optimize individual stages in isolation or rely on fixed heuristics, which limits adaptive coordination among stages and can lead to suboptimal outcomes. In this work, we propose ProFact, an agentic reinforcement learning framework for end-to-end optimization of multi-stage fact verification trajectories. ProFact trains a unified policy to coordinate claim decomposition, evidence seeking, answer generation, and verdict prediction. To address the sparse and delayed supervision provided by final veracity labels, ProFact introduces process-aware rewards that provide stage-level learning signals throughout the verification process. Empirical evaluation shows that ProFact consistently outperforms strong baselines in both verification performance and inference efficiency. These results highlight the effectiveness of process-aware trajectory optimization for multi-stage fact verification.
- 中文摘要
近期将大型语言模型(LLM)与检索增强推理结合的方法,显示出自动化事实验证的潜力。为了处理复杂的索赔,这些验证流程通常执行多阶段工作流程,协调紧密耦合的模块,包括索赔分析、证据收集和裁决预测。然而,现有方法要么单独优化各个阶段,要么依赖固定启发式,这限制了各阶段之间的自适应协调,可能导致次优结果。本研究提出ProFact,一种用于多阶段事实验证轨迹端到端优化的智能体强化学习框架。ProFact训练统一策略,协调索赔分解、证据寻求、答案生成和裁决预测。为解决最终真实性标签提供的稀疏和延迟监督问题,ProFact引入了过程感知奖励,在整个验证过程中提供阶段级学习信号。实证评估表明,ProFact在验证性能和推断效率方面始终优于强基线。这些结果凸显了过程感知轨迹优化在多阶段事实验证中的有效性。
ReFree: Towards Realistic Co-Speech Video Generation via Reward-Free RL and Multilevel Speech Guidance
ReFree:通过无奖励的强化学习和多级语音指导,迈向真实的共话视频生成
- Authors: Salaheldin Mohamed, M. Hamza Mughal, Rishabh Dabral, Christian Theobalt
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2606.13304
- Pdf link: https://arxiv.org/pdf/2606.13304
- Abstract
Speech-driven talking character animation seeks to generate life-like portrait videos that convey natural conversation behavior, aligning facial motion with spoken audio. Although recent advances in video generation have substantially improved realism in video-based animation, achieving both accurate lip articulation and expressive behavior remains challenging. Existing approaches typically trade off precise phoneme-to-lip synchronization against dynamic facial expressions and head motion, yielding animations that are either accurate yet rigid, or expressive but poorly synchronized. We address this challenge by proposing ReFree-S2V, a flow-matching speech-to-portrait animation framework that builds upon a pretrained video generation model to achieve fine-grained speech articulation and high-level expressive cues in speech-driven portrait animation. This model introduces a multi-level speech representation capturing phonetic and prosodic information at both local and global granularities. These representations are selectively injected into transformer blocks via learnable level selectors, enabling both accurate lip synchronization and natural expressive motion. To achieve natural head movements, we further introduce a novel reward-free reinforcement learning scheme into flow-matching training to discourage perceptually implausible motion without relying on handcrafted synchronization metrics or reward models, or the high cost of human preference annotation. Extensive experiments demonstrate that ReFree-S2V achieves state-of-the-art performance, significantly outperforming existing methods in both quantitative lip-sync accuracy and qualitative human evaluations of naturalness and expressivity.
- 中文摘要
以语音为驱动的对话角色动画旨在生成逼真的肖像视频,传达自然的对话行为,将面部动作与语音音频对齐。尽管视频生成技术的最新进展显著提升了基于视频的动画的真实性,但实现准确的嘴唇表达和表现行为仍然具有挑战性。现有方法通常在精确的音素与嘴唇同步与动态面部表情和头部动作之间进行权衡,导致动画要么准确但僵硬,要么表情丰富但同步不佳。我们通过提出ReFree-S2V来应对这一挑战,这是一个基于预训练视频生成模型的流程匹配语音到肖像动画框架,实现细粒度的语音表达和高层次的表现力提示,应用于语音驱动的肖像动画。该模型引入了多层次的语音表示,能够捕捉局部和全局粒度上的语音和韵律信息。这些表示通过可学习的电平选择器选择性地注入变压器模块,实现了准确的唇同步和自然的表现运动。为了实现自然的头部运动,我们进一步引入一种新颖的无奖励强化学习方案,在不依赖手工同步指标、奖励模型或高成本的人类偏好标注的情况下,抑制感知上不合理的动作。大量实验表明,ReFree-S2V在定量口型同步准确性和人类自然性和表现力的定性评估上,均远超现有方法。
ReSum: Synergizing LLM Reasoning and Summarization with Reinforcement Learning
ReSum:将LLM推理与总结与强化学习结合起来
- Authors: Xucong Wang, Ziyu Ma, Yong Wang, Shidong Yang, Hailang Huang, Renda Li, Pengkun Wang, Xiangxiang Chu
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.13316
- Pdf link: https://arxiv.org/pdf/2606.13316
- Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) is a central technique for improving long-horizon reasoning in Large Language Models (LLMs). However, existing RLVR methods often encourage unnecessarily long reasoning rollouts, which can degrade reasoning coherence and exhaust the available context budget. Existing approaches to long-context organization often depend on external mechanisms to organize rollouts, rather than enabling the model to manage its own reasoning trajectory. To address this limitation, we propose ReSum, a novel RLVR framework that enables LLMs to compress and organize their reasoning trajectories through self-summarization. Our pilot studies show that self-summarization stabilizes generation by lowering token-level entropy, and that introducing a ``summarization'' phrase can substantially mitigate errors propagated from an incorrect rollout prefix. Motivated by these findings, ReSum adopts a summarization-aware adaptive rollout mechanism that contrastively evaluates whether self-summarization benefits the ongoing reasoning process. Specifically, when the model spontaneously triggers self-summarization, ReSum masks the summarization phrase to create a contrastive branch; for non-summarization positions, it instead randomly injects the phrase to create a matched branch. We further design a summarization-aware advantage to enable finer-grained comparison between contrastive rollout trajectories. Extensive experiments show that ReSum improves performance at an average of 4\% while reducing rollout length by 18.6\%.
- 中文摘要
带可验证奖励的强化学习(RLVR)是提升大型语言模型(LLM)中长期推理能力的核心技术。然而,现有的RLVR方法常常鼓励不必要的冗长推理展开,这可能降低推理的连贯性并耗尽可用的上下文预算。现有的长上下文组织方法通常依赖外部机制来组织推广,而非使模型能够自行管理推理轨迹。为解决这一限制,我们提出了ReSum,一种新颖的RLVR框架,使LLM能够通过自我总结压缩和组织其推理轨迹。我们的试点研究表明,自总结通过降低令牌层面熵来稳定生成,引入“总结”短语可以显著减少因错误推销前缀传播的错误。受这些发现启发,ReSum采用了一种针对摘要的自适应推广机制,对比地评估自我总结是否有利于当前的推理过程。具体来说,当模型自发触发自我总结时,ReSum会掩盖摘要短语以创建对比分支;对于非摘要位置,它会随机注入短语以生成匹配分支。我们还设计了一种具备总结感知的优势,以便对比性展开轨迹进行更细粒度的比较。大量实验表明,ReSum 平均提升性能 4%,同时将部署时间缩短 18.6%。
From Passive Generation to Investigation: A Proactive Scientific Peer Review Agent
从被动生成到研究:主动科学同行评审的推动者
- Authors: Haishuo Fang, Yue Feng, Iryna Gurevych
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2606.13349
- Pdf link: https://arxiv.org/pdf/2606.13349
- Abstract
Large language models (LLMs) have shown promise in automating scientific peer review. However, existing approaches often struggle to generate in-depth reviews supported by concrete evidence. We argue that a key limitation is the lack of flexibility to proactively investigate suspicious parts of a paper based on accumulated evidence, as human reviewers do. In this paper, we explore how to enable an LLM-based review agent to perform such proactive investigation. We find that this can be naturally formulated as a Markov Decision Process (MDP), and propose ProReviewer, a scientific peer review agent that proactively reviews a paper guided by a maintained, structured review log. The structured review log serves as a workspace for the agent to track evidence and intermediate findings collected during review. Experiments show that ProReviewer with an 8B backbone, trained by supervised fine-tuning and optimized by reinforcement learning, achieves the highest average score across five quality dimensions, outperforming prompt-based methods with much larger frontier LLMs by up to 39% and the strongest fine-tuned baseline by 16% relatively. It also attains the highest win rates against baselines in human evaluation.
- 中文摘要
大型语言模型(LLMs)在自动化科学同行评审方面展现出潜力。然而,现有方法常常难以产生有确凿证据支持的深入综述。我们认为一个关键局限是缺乏灵活性,无法像人类审稿人那样,基于累积的证据主动调查论文中的可疑部分。本文探讨如何使基于LLM的评审代理能够进行此类主动调查。我们发现这可以自然地被表述为马尔可夫决策过程(MDP),并提出了ProRereviewer,一种科学同行评审代理,它在维护的结构化评审日志指导下主动审阅论文。结构化评价日志作为一个工作空间,供代理跟踪审查过程中收集的证据和中间发现。实验显示,配备8B骨干的ProReviewer通过监督微调训练并通过强化学习优化,在五个质量维度中获得最高平均分,比拥有更大前沿大型大型语言模型的提示型方法高出多达39%,而最强的微调基线则相较高16%。它在人类评估中也取得了最高的基线胜率。
IterCAD: An Iterative Multimodal Agent for Visually-Grounded CAD Generation and Editing
IterCAD:一款用于视觉基础CAD生成与编辑的迭代多模态代理
- Authors: Tao Hu, Jiaxin Ai, Licheng Wen, Xueheng Li, Shu Zou, Siqi Li, Nianchen Deng, Xinyu Cai, Hongbin Zhou, Pinlong Cai, Daocheng Fu, Yu Yang, Hairong Zhang, Botian Shi, Xuemeng Yang
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2606.13368
- Pdf link: https://arxiv.org/pdf/2606.13368
- Abstract
Computer-Aided Design is pivotal in modern manufacturing, yet existing automated methods predominantly rely on open-loop, one-shot generation, creating a mismatch with iterative real-world practices. In this paper, we present IterCAD, a unified multimodal agent framework for closed-loop, interactive CAD generation and editing. We formulate the task as a multi-turn interaction between a multimodal agent and an executable CAD sandbox, covering three tasks: Drawing-to-Code, Text-to-Code, and Interactive Editing. To support this, we develop a data synthesis pipeline incorporating advanced industrial manufacturing features to generate standard-compliant multi-view engineering drawings, complex code-editing tasks, and high-fidelity interaction trajectories. We optimize the agent via progressive SFT followed by geometry-aware reinforcement learning with viable-prefix masking to enhance code executability and geometric fidelity. Finally, we introduce the IterCAD-Bench evaluation suite and propose the Chamfer Distance Tolerance-Recall (CD-TR) curve alongside its AUC-TR metric, establishing a survivor-bias-free standard that unifies code validity and geometric precision. Extensive experiments demonstrate that IterCAD achieves highly competitive performance across multiple benchmarks, significantly outperforming existing approaches in both code executability and geometric precision, while exhibiting superior capabilities in closed-loop iterative refinement.
- 中文摘要
计算机辅助设计在现代制造业中至关重要,但现有自动化方法主要依赖开环一次性生成,这与现实世界的迭代实践存在不匹配。本文介绍了IterCAD,一个统一的多模态代理框架,用于闭环交互式CAD生成与编辑。我们将该任务表述为多模态代理与可执行CAD沙箱之间的多回合交互,涵盖三个任务:绘图代码、文本转代码和交互式编辑。为此,我们开发了一套数据综合流程,融合先进的工业制造特性,生成符合标准的多视图工程图纸、复杂的代码编辑任务以及高精度交互轨迹。我们通过渐进式SFT优化代理,随后采用具备可行前缀掩蔽的几何感知强化学习,以提升代码可执行性和几何真实度。最后,我们介绍了IterCAD-Bench评估套件,并提出了倒角距离容差-召回曲线(CD-TR)及其AUC-TR指标,建立了一个无幸存偏差的标准,统一了代码有效性和几何精度。大量实验表明,IterCAD在多个基准测试中实现了极具竞争力的性能,在代码可执行性和几何精度方面均远超现有方法,同时在闭环迭代优化方面表现出更优越的能力。
Reinforcement Learning for Neural Model Editing
神经模型编辑的强化学习
- Authors: Shaivi Malik
- Subjects: Subjects:
Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2606.13461
- Pdf link: https://arxiv.org/pdf/2606.13461
- Abstract
Editing pretrained neural networks requires specialized algorithms tailored to specific objectives. Designing such algorithms is often time-consuming and demands significant effort. We present an exploratory framework that formulates neural model editing as a reinforcement learning problem, where agents modify models using reward feedback. We introduce two environments: MaskWorld, where agents scale weights multiplicatively, and ShiftWorld, where agents apply additive weight updates. The reward function combines a utility-preservation objective with a task-specific editing objective, enabling agents to learn targeted modifications while maintaining overall model performance. We evaluate the framework on bias mitigation in text classification and machine unlearning in image classification, both of which traditionally rely on specialized algorithms. Our results show that the learned policies reduce forget set accuracy to nearly 0% while preserving over 90% retain set accuracy on the unlearning task. In the bias mitigation setting, the learned policies improve bias-related performance by more than 5% while maintaining general classification utility. Our findings show that neural model editing can be cast as a reinforcement learning problem, allowing editing policies to be learned from reward feedback rather than manually engineered for each task.
- 中文摘要
编辑预训练神经网络需要针对特定目标量身定制的专门算法。设计此类算法通常耗时且投入大量精力。我们提出了一个探索性框架,将神经模型编辑表述为强化学习问题,代理通过奖励反馈修改模型。我们引入了两个环境:MaskWorld,代理通过乘法缩放权重,以及ShiftWorld,代理应用加法权重更新。奖励函数结合了效用保持目标和任务特定编辑目标,使智能体能够学习针对性修改,同时保持整体模型性能。我们评估了文本分类中的偏见缓解框架和图像分类中的机器学习去学习,这两者传统上依赖专业算法。我们的结果显示,学习策略将遗忘设置的准确率降至接近0%,同时在去学习任务中保持超过90%的设置准确率。在偏见缓解环境中,所学策略在保持一般分类效用的同时提升了5%以上的偏见相关性能。我们的发现表明,神经模型编辑可以被归结为强化学习问题,使编辑策略可以从奖励反馈中学习,而非为每个任务手动设计。
Multi-Agent Reinforcement Learning from Delayed Marketplace Feedback for Objective-Weight Adaptation in Three-Sided Dispatch
多智能体强化学习:基于延迟市场反馈的目标权重适应三方调度
- Authors: Haochen Wu, Yi Hou, Shiguang Xie
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
- Arxiv link: https://arxiv.org/abs/2606.13604
- Pdf link: https://arxiv.org/pdf/2606.13604
- Abstract
Dispatch in three-sided marketplaces provides a natural setting for reinforcement learning from world feedback: decisions are evaluated by delayed operational outcomes such as delivery speed, courier utilization, and merchant congestion. We present a deployed reinforcement learning system at DoorDash that adapts dispatch objective weights in a large-scale food-delivery marketplace using delayed signals. Rather than replacing the combinatorial assignment optimizer, a store-level policy learned from logged marketplace data selects a discrete multiplier that shifts the dispatch optimizer's tradeoff between delivery quality and batching efficiency. This interface enables offline policy learning under noisy, delayed, and coupled feedback while preserving production feasibility constraints and operational safeguards. We train a shared value function using centralized offline data and decentralized store-level execution, with Double Q-learning targets and a conservative regularizer to reduce out-of-distribution value overestimation. In a production switchback experiment, the offline-trained policy increases batching and reduces courier-side time costs without degrading customer-facing delivery quality. Results illustrate how world feedback from a live economic and logistics system can be used to safely adapt decision policies online.
- 中文摘要
三边市场中的调度为从全球反馈中强化学习提供了自然环境:决策通过延迟的运营结果评估,如配送速度、快递利用率和商户拥堵。我们在DoorDash展示了一套部署的强化学习系统,利用延迟信号调整大型食品配送市场中的派遣目标权重。与其取代组合分配优化器,不如通过从市场数据学习的商店级策略选择离散乘数,调整派送质量与批处理效率之间的权衡。该接口允许在噪声、延迟和耦合反馈下进行离线策略学习,同时保持生产可行性约束和运营保障。我们通过中心化离线数据和去中心化的存储级执行训练共享价值函数,采用双Q学习目标和保守正则化器,以减少分布外价值的高估。在生产回头实验中,离线训练策略提高了批量处理并降低快递端时间成本,同时不影响面向客户的交付质量。结果展示了如何利用来自实时经济和物流系统的全球反馈,安全地调整在线决策政策。
Beyond Runtime Enforcement: Shield Synthesis as Defensibility Analysis for Adversarial Networks
超越运行时执行:盾牌合成作为对抗网络防御性分析
- Authors: Achraf Hsain, Sultan Almuhammadi
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
- Arxiv link: https://arxiv.org/abs/2606.13621
- Pdf link: https://arxiv.org/pdf/2606.13621
- Abstract
Shielded reinforcement learning is typically presented as a runtime safety mechanism that compiles temporal-logic specifications into automata restricting an agent's actions. We argue this is the wrong product. The same automata-theoretic machinery -- specification compilation, product game construction, attractor computation, and winning-region extraction -- is better read as a design-time analytical instrument whose outputs are structural insights about a system rather than runtime constraints on a deployed agent. We instantiate this through a constrained two-player safety game for network defense. The two specifications are enforced asymmetrically: the defender specification defines the unsafe region of the game, whereas the attacker specification restricts the adversary's legal actions during attractor computation. Solving the game yields a defensibility verdict -- a formal certificate that a topology-specification pair is or is not defensible -- with the associated winning region and shield. Beyond the binary verdict, we derive topology-level metrics from the attractor structure and combine them with post-convergence behavior from shield-constrained adversarial multi-agent reinforcement learning. Together these form a defensibility fingerprint capturing both a network's formal safety properties and its operational behavior under adaptive play. A what-if analysis shows that formal defensibility and operational effectiveness capture distinct aspects of security: small architectural changes can produce large shifts in operational outcomes while leaving formal safety margins nearly unchanged. Shield synthesis is thus most valuable not as a deployment mechanism for safe agents, but as a framework for answering architectural questions about whether, where, and how a system can be defended. The defensibility verdict is the output, not the safe policy.
- 中文摘要
屏蔽强化学习通常被呈现为运行时安全机制,将时间逻辑规范编译成自动机,限制代理的操作。我们认为这是错误的产品。同样的自动机理论机制——规范编译、产品博弈构建、吸引子计算和胜区提取——更适合作为设计时分析工具,其输出是对系统的结构洞察,而非对已部署代理的运行时约束。我们通过一个受限的双人安全游戏实现网络防御。这两种规范的执行方式是不对称的:防御者规范定义了博弈的不安全区域,而攻击者规范则限制了对手在吸引子计算过程中的法律行为。解开博弈后,会得到可辩护性判决——即形式证明拓扑-规格对是否可辩护——并附带对应的获胜区域和盾牌。除了二元判决,我们还从吸引子结构中推导出拓扑层级度量,并将其与屏蔽约束的对抗性多智能体强化学习中的收敛后行为结合起来。这些元素共同构成防御指纹,捕捉网络的形式安全属性及其在自适应游戏下的操作行为。假设分析显示,形式防御性和作战效能涵盖了安全的不同方面:小幅架构变更能带来作战结果的巨大变化,而正式安全边际几乎保持不变。因此,Shield合成最有价值的不是作为安全代理的部署机制,而是作为回答系统是否、在哪里以及如何防御架构问题的框架。可辩护性判决是输出,而非安全政策。
Improving Robotic Generalist Policies via Flow Reversal Steering
通过流程逆转引导改进机器人通用策略
- Authors: Andy Tang, William Chen, Andrew Wagenmaker, Chelsea Finn, Sergey Levine
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2606.13675
- Pdf link: https://arxiv.org/pdf/2606.13675
- Abstract
Generalist policies can learn a wide range of skills from diverse robot datasets. In order to solve or improve on challenging news tasks, we need a way to infer and invoke the appropriate actions from the policy's rich behavioral prior, especially when directly commanding the policy fails. We focus on flow matching generalists and propose Flow Reversal Steering (FRS): a method that takes suboptimal but ``reasonable'' actions, finds their latent noises by passing them through the flow policy in reverse, and maps them to nearby generalist action modes. We evaluate FRS across many simulated and real-world manipulation settings. First, FRS can turn coarse semantic guidance from humans or vision-language models (VLMs) into corresponding good robot actions, improving zero-shot control. These gains can be distilled with behavioral cloning by training an auxiliary policy to output noises that the generalist maps to good actions -- showing up to 95% absolute task success rate boosts in under a minute of training. Finally, FRS enables policy improvement by bootstrapping reinforcement learning with semantic knowledge, improving on several tasks that standard RL fails to improve on.
- 中文摘要
通才政策可以从多样化的机器人数据集中学习到广泛的技能。为了解决或改进具有挑战性的新闻任务,我们需要一种方法,能够从政策丰富的行为先验中推断并调用适当的行动,尤其是在直接命令政策失败时。我们专注于流匹配通才,并提出了流逆转引导(FRS):一种将次优但“合理”的动作,反向通过流策略来发现其潜在噪声,并将其映射到附近的通才动作模式。我们在多种模拟和现实操作环境中评估FRS。首先,FRS可以将人类或视觉语言模型(VLM)的粗略语义引导转化为相应的良好机器人动作,从而提升零次打击控制。通过训练辅助策略,通过训练辅助策略,输出通用者将这些行为映射为良好行为的噪声,从而通过行为克隆提炼这些收益——在训练不到一分钟内,任务成功率提升高达95%。最后,FRS通过用语义知识自助强化学习,实现策略改进,改进了标准强化学习未能改进的多个任务。
Mana: Dexterous Manipulation of Articulated Tools
魔力:灵活操控关节工具
- Authors: Zhao-Heng Yin, Guanya Shi, Pieter Abbeel, C. Karen Liu
- Subjects: Subjects:
Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2606.13677
- Pdf link: https://arxiv.org/pdf/2606.13677
- Abstract
Articulated tool manipulation remains a major challenge in dexterous robotics due to the need to coordinate internal degrees of freedom and contact-rich interactions. While prior work has largely focused on rigid objects, articulated tool use remains underexplored because of its physical complexity and the difficulty of learning functional grasping and manipulation policies. We present Mana (Manipulation Animator), a general sim-to-real framework that reinterprets dexterous manipulation as an animation problem. Inspired by computer animation, Mana employs a coarse-to-fine pipeline that transforms procedurally-generated grasp keyframes into manipulation trajectories through motion planning and reinforcement learning. The data generation process is largely automatic, requiring only a few mouse clicks to specify functional affordances (<1 minute per tool). Across four articulated tools spanning different scales and joint types, Mana achieves zero-shot sim-to-real transfer for both grasping and in-hand manipulation, demonstrating a scalable approach to dexterous articulated tool use.
- 中文摘要
铰接式工具操作在灵巧机器人中仍是一大挑战,因为需要协调内部自由度和接触丰富的交互。虽然此前的研究主要聚焦于刚性物体,但关节式工具的使用因其物理复杂性以及学习功能抓握和操作策略的困难,仍然未被充分探索。我们介绍Mana(操作动画器),这是一个通用的模拟到现实框架,将灵巧操作重新诠释为动画问题。受计算机动画启发,《魔力》采用了从粗到细的流程,通过动作规划和强化学习将程序生成的关键帧转化为操作轨迹。数据生成过程大多自动化,只需几次鼠标点击即可指定功能性(<每件工具约1分钟)。通过四种跨越不同比例和关节类型的关节工具,Mana实现了零发子模拟到真实的转移,既支持抓握,也用于手持操作,展示了灵活的关节工具使用方式。
Keyword: diffusion policy
Action-Effect Memory Pretraining for Robot Manipulation
机器人操作的动作效应记忆预训练
- Authors: Yijing Zhou, Qiwei Liang, Sitong Zhuang, Jiaxi Li, Xianpeng Wang, Boyang Cai, Yunyang Mo, Renjing Xu
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2606.12499
- Pdf link: https://arxiv.org/pdf/2606.12499
- Abstract
We present AEM, an Action-Effect Memory pretraining framework for robot manipulation that learns compact temporal representations from vision-action history. Unlike prior robot representation pretraining methods that mainly focus on single-frame visual encoding, AEM targets the temporal nature of manipulation, where the current observation alone is often insufficient under partial observability. AEM models manipulation as an action-driven interaction process by interleaving visual and action features and applying masked modeling to recover missing content from incomplete histories, thereby learning action-conditioned state evolution. The Mamba-encoded output of the final vision token is used as a compact history representation, serving as the global context for decoding and downstream control. This design preserves a single-vector temporal bottleneck while keeping inference efficient. We evaluate AEM with Diffusion Policy and Flow Policy. AEM consistently improves manipulation performance in both simulation and real-world settings, outperforming baselines across clean scenes, cluttered and random scenes, and non-Markovian tasks. Ablation studies further show that history-aware pretraining surpasses single-frame pretraining and direct frame stacking, while reducing inference latency and computational cost.
- 中文摘要
我们介绍了AEM,一种动作-效应记忆预训练框架,用于机器人操作,能够从视觉-动作历史中学习紧凑的时间表征。与以往主要关注单帧视觉编码的机器人表示预训练方法不同,AEM 针对的是操作的时间性,即仅当前观测在部分可观测性下往往不足。AEM将操作建模为一种动作驱动的交互过程,通过交错视觉特征和动作特征,并应用掩蔽建模来恢复不完整的历史中缺失的内容,从而学习动作条件状态演化。最终视觉令牌的Mamba编码输出被用作紧凑的历史表示,作为解码和下游控制的全局上下文。这种设计保留了单向量的时间瓶颈,同时保持推理效率。我们通过扩散政策和流量策略评估AEM。AEM 在模拟和现实环境中持续提升操作性能,在干净场景、杂乱和随机场景以及非马尔可夫任务中都优于基线。消融研究进一步表明,历史感知预训练优于单帧预训练和直接帧堆栈,同时降低推理延迟和计算成本。