Arxiv Papers of Today

生成时间: 2026-06-10 19:40:58 (UTC+8); Arxiv 发布时间: 2026-06-10 20:00 EDT (2026-06-11 08:00 UTC+8)

今天共有 41 篇相关文章

Keyword: reinforcement learning

Self-EmoQ: Plutchik-Guided Value-based Planning to Drive Streaming Emotional TTS

自我表情问答：普鲁奇克引导的价值导向规划推动流媒体情感TTS

Authors: Yue Zhao, Hongyan Li, Yong Chen, Luo Ji
Subjects: Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.09837
Pdf link: https://arxiv.org/pdf/2606.09837
Abstract Emotional interaction is increasingly crucial for conversational AI, yet current systems lack a self-emotion determination mechanism to drive the streaming text-to-speech (TTS) synthesis. We propose an emotion-planning framework that determines the emotion prior to the textual generation, grounding the downstream emotional TTS in a streaming manner. The framework is implemented by a plug-and-play LLM module, initialized from pretrained LLMs, and trained by reinforcement learning (RL) with emotions as the actions. A hybrid reward is employed which combines imitation signals with theory-driven scoring, in which the theory of Plutchik's wheel of emotions is adopted. By experiments on DailyDialog, EmoryNLP, IMEOCAP, and MELD, our method outperforms prompting and finetuning baselines on both emotion determination and response quality. We finally implement an entire streaming pipeline for real-time deployment, with the speech quality confirming the framework's emotional alignment, contextual coherence, and expressive fluency. Codes, cases, and demos are available in this https URL.
中文摘要 情感互动对对话式人工智能日益重要，但当前系统缺乏自我情感决定机制来驱动流式文本转语音（TTS）综合。我们提出了一种情感规划框架，在文本生成前确定情绪，从而以流式方式扎根下游情感TTS。该框架由即插即用的大型语言模型模块实现，模块由预训练的大型语言模型初始化，并通过强化学习（RL）训练，动作为情感。采用混合奖励方式，结合模仿信号与理论驱动评分，采用普鲁奇克情感轮理论。通过在DailyDialog、EmoryNLP、IMEOCAP和MELD上的实验，我们的方法在情绪决定和反应质量方面均优于提示和微调基线。我们最终实现了完整的流媒体流水线以实现实时部署，语音质量验证了框架的情感契合、语境一致性和表达流畅度。代码、案例和演示均可在该 https 网址中获取。

TD-Grokking: Learning from Zero-Reward Problems by Training-Time Decomposition

TD-Grokking：通过训练时间分解从零奖励问题中学习

Authors: Ningyuan Xi, Hao Xu, Hongsheng Xin, Ning Miao
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.09883
Pdf link: https://arxiv.org/pdf/2606.09883
Abstract Large language models (LLMs) have made remarkable progress in reasoning tasks, largely driven by post-training paradigms, especially reinforcement learning with verifiable rewards (RLVR). However, a critical bottleneck persists: RLVR fails on highly challenging zero-reward problems, where all sampled reasoning trajectories yield uniformly failed outcomes, providing no optimization signal to drive model improvement. Prior efforts to address this limitation, such as dense process supervision, partial reward assignment, or prefix-guided exploration, suffer from inherent task constraints or do not fully equip the policy model with the capabilities necessary to solve the original intractable problems. To address this, we propose TD-Grokking, a training-time decomposition framework for zero-reward problems. It recursively decomposes intractable root problems into self-contained, verifiable subproblems, forming hierarchical trees where solvable leaves provide non-zero rewards. Evaluations on mathematical and medical tasks show that TD-Grokking outperforms vanilla GRPO as well as all baseline approaches. Together with detailed analysis, these results confirm that training-time decomposition effectively converts zero-reward examples into usable training signals, enabling consistent performance gains. Our code and datasets are available at this https URL.
中文摘要 大型语言模型（LLMs）在推理任务方面取得了显著进展，这主要得益于训练后范式，尤其是带有可验证奖励的强化学习（RLVR）。然而，一个关键瓶颈依然存在：RLVR在极具挑战性的零奖励问题上失败，所有抽样推理轨迹的结果均一失败，无法提供驱动模型改进的优化信号。此前针对这一限制的努力，如密集流程监督、部分奖励分配或前缀引导探索，存在固有任务限制，或未能完全赋予策略模型解决原始难题所需的能力。为此，我们提出了TD-Grokking，一种针对零奖励问题的训练时间分解框架。它递归地将难以解决的根问题分解为自包含、可验证的子问题，形成层级树，其中可解叶子提供了非零的奖励。数学和医学任务的评估显示，TD-Grokking 优于普通的 GRPO 及所有基础方法。结合详细分析，这些结果证实训练时间分解有效将零奖励示例转化为可用的训练信号，实现持续的性能提升。我们的代码和数据集可在该 https URL 访问。

Failure Modes of Deep Multi-Agent RL in Asynchronous Pricing: Reproducible Triggers, Trace Diagnostics, and a Partial Fix

异步定价中深度多智能体强化学习的失败模式：可重现触发器、跟踪诊断与部分修复

Authors: Shree Murthy, Rohan Pandey
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Econometrics (econ.EM)
Arxiv link: https://arxiv.org/abs/2606.09884
Pdf link: https://arxiv.org/pdf/2606.09884
Abstract We study two reproducible failure modes of deep multi-agent reinforcement learning in continuous-time pricing markets: (i) tacit cartel formation between competing DDPG agents, and (ii) actor--critic instability at high event rates. We instantiate both inside a single CT-MARL benchmark (Poisson-clocked price updates, observation latency $\delta$, interior-optimum logit demand), show that synchronous DDPG agents reliably trigger Failure Mode 1 with collusion index $\Delta = 0.69 \pm 0.11$, and quantify a partial microstructure fix: asynchrony alone cuts collusion by 48\% and adding latency drives it to a minimum of $\Delta = 0.28$. The fix has clearly documented costs: it is partial ($\Delta$ remains supra-Bertrand), it is non-monotone in $\delta$, and it does not survive Failure Mode 2, which emerges as DDPG critic divergence at $\lambda = 5$ and corrupts the phase-diagram cell at $(\lambda{=}5, \delta{=}1)$. We accompany the scalar collusion index with trajectory-level trace diagnostics that expose the within-episode signalling collapse and the post-shock non-recovery.
中文摘要 我们研究了连续时间定价市场中深度多智能体强化学习的两种可重复失败模式：（i）竞争的DDPG智能体之间的默性卡特尔形成，以及（ii）高事件率下的行为者-批判者不稳定性。我们将两者实例化在单一CT-MARL基准中（泊松时钟价格更新、观测延迟$\delta$、内部最优logit需求），展示了同步DDPG代理可靠地触发故障模式1，其合并指数为$\Delta = 0.69 \pm 0.11$，并量化了部分微观结构修正：仅异步即可将串合减少48%，增加延迟则使其降至$\Delta = 0.28$。该修正有明确的代价：它是部分的（$\Delta$ 仍超 Bertrand），在 $\delta$ 中非单调，且无法通过失败模式 2，该模式在 $\lambda = 5$ 处表现为 DDPG 批评者发散，并在 $（\lambda{=}5， \delta{=}1）$ 处损坏相图单元。我们配合标量共谋指数，并附带轨迹级的微量诊断，揭示发作中信号崩溃和震惊后未恢复的情况。

SocraticPO: Policy Optimization via Interactive Guidance

苏格拉底邮政：通过互动指导进行政策优化

Authors: Zirui Liu, Jie Ouyang, Qi Liu, Xianquan Wang, Jiayu Liu, Tingyue Pan, Qingchuan Li, Jing Sha, Zhenya Huang, Shijin Wang, Enhong Chen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.09887
Pdf link: https://arxiv.org/pdf/2606.09887
Abstract Reinforcement learning (RL) for large language models usually supervises reasoning with scalar outcome rewards, such as binary correctness. Such rewards provide an optimization direction but rarely explain how a model should revise its mistaken reasoning, which can encourage shortcut learning and brittle policies. We propose \textbf{SocraticPO} (Socratic Policy Optimization), a policy-optimization framework that augments RL rollouts with Socratic-style natural-language guidance. During rollout, the student first answers independently; if the answer is incorrect, a teacher diagnoses the attempt and provides concise corrective guidance, after which the student continues under the expanded context. Crucially, this guidance is paired with reward decay: correct answers obtained after teacher intervention only receive decayed rewards, preventing the policy from treating teacher help as a free path to reward. Since SocraticPO only modifies the rollout process while leaving the standard expected-reward objective intact, it can be plugged into existing policy-gradient backends such as Reinforce++. Moreover, because the teacher provides only text-level guidance, SocraticPO can leverage stronger black-box teacher models without requiring access to logits or distribution matching. On undergraduate-level scientific reasoning benchmarks from SciKnowEval, SocraticPO improves over strong RL and self-distillation baselines. Ablations show that both targeted guidance and reward decay are necessary, with reward decay mitigating reliance on assisted correction.
中文摘要 大型语言模型的强化学习（RL）通常通过标量结果奖励（如二元正确性）来监督推理。此类奖励提供了优化方向，但很少解释模型应如何修正其错误推理，这可能助长捷径学习和脆弱策略。我们提出了 \textbf{SocraticPO}（苏格拉底式政策优化），这是一个通过苏格拉底式自然语言指导来增强强化学习推广的策略优化框架。在介绍过程中，学生首先独立回答;如果答案错误，教师会诊断尝试并提供简明的纠正指导，之后学生在扩展上下文下继续学习。关键是，这种指导与奖励递减相伴：教师介入后获得的正确答案只会获得“递减奖励”，防止政策将教师帮助视为免费奖励途径。由于SocraticPO仅修改了推广过程，同时保留了标准的预期奖励目标，它可以插入现有的策略梯度后端，如Reinforce++。此外，由于教师仅提供文本层面指导，SocraticPO可以利用更强的黑箱教师模型，而无需访问logit或分布匹配。在SciKnowEval的本科科学推理基准测试中，SocraticPO优于强强的强化学习和自我蒸馏基线。消融显示，靶向引导和奖励衰减都是必要的，奖励衰减减少了对辅助纠正的依赖。

When RL Fails after SFT: Rejuvenating Model Plasticity for Robust SFT-to-RL Handoff

当强化学习在SFT后失败：重振模型可塑性以实现稳健的SFT到RL交接

Authors: Runze Liu, Jiashun Liu, Xu Wan, Yuqian Fu, Ling Pan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.09932
Pdf link: https://arxiv.org/pdf/2606.09932
Abstract Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has become a standard pipeline for Large Language Model (LLM) post-training. SFT is expected to provide a useful behavioral prior for RL to further enhance model capabilities. However, checkpoints with excessive SFT often show limited improvement during RL. We attribute this failure to the loss of model plasticity: the reduced ability of an SFT-initialized policy to be effectively reshaped by subsequent RL. To better understand this phenomenon, we conduct detailed analysis from multiple perspectives, including parameter changes, output spaces, and RL optimization dynamics. Our results show that models from excessive SFT tend to produce over-confident token distributions and exhibit sharp parameter landscapes, which make them harder to optimize in the RL stage. To enable a more robust SFT-to-RL handoff, we propose \texttt{Rejuvenation}, a simple yet effective method that restores plasticity while preserving useful SFT-acquired priors. Rejuvenation leverages base-anchored model fusion to reduce excessive SFT-induced drift with targeted neuron reset to mitigate model rigidity. Experimental results on both math reasoning tasks and agentic tasks demonstrate that our approach consistently improves RL performance on over-trained SFT models, while also enhancing generalization to out-of-distribution tasks.
中文摘要 监督式微调（SFT）随后进行强化学习（RL）已成为大型语言模型（LLM）训练后标准流程。SFT预计将为强化学习提供有用的行为先验，进一步提升模型能力。然而，SFT过高的检查点在强化过程中往往进步有限。我们将此失败归因于模型可塑性的丧失：即SFT初始化策略被后续强化学习有效重塑的能力降低。为了更好地理解这一现象，我们从多个角度进行了详细分析，包括参数变化、输出空间和强化学习优化动态。我们的结果表明，过度SFT的模型往往会产生过于自信的代币分布，并表现出锐利的参数景观，这使得它们在强化学习阶段更难优化。为了实现更稳健的SFT到RL切换，我们提出了\texttt{Rejuvenation}，这是一种简单但有效的方法，可以在保留有用SFT获得先验的同时恢复可塑性。复活利用基锚模型融合，通过靶向神经元重置来减少SFT引起的过度漂移，以减轻模型僵硬性。数学推理任务和代理任务的实验结果表明，我们的方法在过度训练的SFT模型上持续提升强化学习表现，同时增强对分布外任务的泛化能力。

Uncertainty-Aware Motion Planning for Autonomous Driving in Mixed Traffic Environment

混合交通环境中自动驾驶的不确定性感知运动规划

Authors: Ming Cheng, Hao Chen, Ziyi Yang, Ziluowen Luo, Senzhang Wang
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.09958
Pdf link: https://arxiv.org/pdf/2606.09958
Abstract In mixed-traffic environments where autonomous and human-driven vehicles may co-exist, motion planning for autonomous vehicles requires anticipating the future behaviors of surrounding human drivers. Existing reinforcement learning-based methods generally directly incorporate the predicted human intents into the observation to enable a proactive planning. However, human intent is inherently uncertain due to the behavioral diversity, perception noise, and partial observability. Treating predicted intends as deterministic states can result in unsafe decisions for autonomous vehicles. To address this problem, we propose Uncertainty-Aware Motion Planning (UAMP), which incorporates uncertainty in human intent prediction for AV decision-making. Specifically, UAMP first introduces a proximity-aware uncertainty estimator to quantify the interaction-conditioned intent uncertainty and constructs an uncertainty-guided joint intent distribution over surrounding human-driven vehicles. Within this uncertainty set, UAMP further introduces Uncertainty-Calibrated Value Learning (UCVL) to correct value function learning biases arising from directly incorporating uncertain human intent predictions into the observation. Extensive experiments in various mixed-traffic scenarios show that UAMP significantly improves safety and driving comfort, while maintaining traffic efficiency compared with existing approaches. The code is released at this https URL.
中文摘要 在混合交通环境中，自动驾驶车辆与人驾驶车辆可能共存，自动驾驶车辆的运动规划需要预见周围人类驾驶者的未来行为。现有基于强化学习的方法通常直接将预测的人类意图纳入观察，以便实现主动规划。然而，由于行为多样性、感知噪声和部分可观察性，人类意图本质上是不确定的。将预测的意图视为确定性状态可能导致自动驾驶车辆做出不安全的决策。为解决这一问题，我们提出了不确定性感知运动规划（UAMP），该方法将人类意图预测中的不确定性纳入视听决策。具体来说，UAMP首先引入了一种接近感知不确定性估计器，用于量化交互条件下的意图不确定性，并构建了对周围人驾驶车辆的不确定性引导联合意图分布。在这一不确定性集中，UAMP进一步引入了不确定性校准价值学习（UCVL），以纠正因直接将不确定的人类意图预测纳入观察而产生的价值函数学习偏差。在各种混合交通场景中的大量实验表明，UAMP在保持交通效率的同时，显著提升了安全性和驾驶舒适度，相较于现有方法。代码发布时是这个 https URL。

3SPO: State-Score-Supervised Policy Optimization for LLM Agents

3SPO：LLM代理的状态评分监督策略优化

Authors: Yu Han, Kailing Li, Yang Jiao, Yulin Dai, Yuqian Fu, Linhai Zhuo, Tianwen Qian
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.09961
Pdf link: https://arxiv.org/pdf/2606.09961
Abstract Training large language models (LLMs) as autonomous agents via reinforcement learning (RL) has enabled frontier models to achieve superhuman performance in long-horizon tasks. However, existing RL algorithms operate at the trajectory level, performing policy optimization only after collecting complete episode rollouts. This coarse-grained approach faces fundamental challenges in multi-turn agent settings where rewards are sparse, delayed, and credit assignment across individual steps is critical. In this work, we propose \textbf{State-Score-Supervised Policy Optimization (3SPO)}, a novel RL algorithm that performs post-step policy optimization with dynamic state score supervision. At each step, 3SPO computes the state score based on historical success rates, supervising step-wise credit assignment, adaptive rollout and post-step policy optimization without requiring value function estimation or additional auxiliary models. Theoretically, under a per-state bandit abstraction, we show that the proposed score-supervised allocation mechanism achieves logarithmic allocation regret and provide sample-complexity guarantees for action identification, score distinguishability, and filtering stability. Experiments on ALFWorld and WebShop with Qwen2.5-1.5B/7B-Instruct show that 3SPO consistently outperforms GRPO by $+22.6\%$ on ALFWorld and $+15.6$ points on WebShop, while using comparable resources to achieve $2.4\times$ more state exploration and $1.8\times$ faster convergence. Code is available at this https URL.
中文摘要 通过强化学习（RL）将大型语言模型（LLMs）训练为自主智能体，使前沿模型在长期任务中实现了超人般的表现。然而，现有的强化学习算法在轨迹层面工作，只有在收集完完整的集数后才进行策略优化。这种粗粒度方法在多回合代理环境中面临根本挑战，因为奖励稀疏、延迟，且各步骤间的信用分配至关重要。在本研究中，我们提出了 \textbf{状态评分监督策略优化（3SPO）}，一种新颖的强化学习算法，能够通过动态状态评分监督执行后步骤策略优化。在每一步，3SPO根据历史成功率计算状态评分，监督分步学分分配、自适应推广和后续策略优化，无需价值函数估计或额外的辅助模型。理论上，在每状态的强盗抽象下，我们证明所提出的分数监督分配机制实现了对数分配遗憾，并为动作识别、分数可区分性和过滤稳定性提供了样本复杂度保证。在ALFWorld和WebShop上的Qwen2.5-1.5B/7B-Ininstruction实验显示，3SPO在ALFWorld上持续优于GRPO多出+22.6%$，在WebShop上高出+15.6美元，同时利用同等资源实现了状态探索增加2.4倍和收敛速度1.8倍的速度。代码可在此 https URL 访问。

Discovering Interpretable Multi-Parameter Control Policies for Evolutionary Algorithms Using Deep Reinforcement Learning

利用深度强化学习发现进化算法的可解释多参数控制策略

Authors: Tai Nguyen, Phong Le, Carola Doerr, Nguyen Dang
Subjects: Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2606.10129
Pdf link: https://arxiv.org/pdf/2606.10129
Abstract While deep Reinforcement Learning (deep-RL) has been increasingly applied to parameter control in evolutionary algorithms, rigorous theoretical analysis of parameter control remains largely restricted to single-parameter settings, owing to the difficulty of deriving effective, interpretable multi-parameter policies amenable to formal study. We demonstrate how deep-RL can be leveraged to overcome this barrier, using the (1+($\lambda$,$\lambda$))-genetic algorithm optimizing OneMax, one of the few problems where a super-constant speedup of dynamic control has been formally proven, as a representative case study. We first show that standard approaches struggle to converge in this multi-parameter setting, and introduce algorithm-agnostic enhancements targeting action-space decomposition, reward shifting, and long-horizon discounting. With these in place, we compare common deep-RL methods and find that Double Deep Q-Networks uniquely avoid the policy collapse observed in Proximal Policy Optimization, yielding trajectories suitable for downstream analysis. Crucially, we move beyond the ``black-box'' nature of neural networks by distilling the learned behaviors into a transparent, symbolic control policy. This resulting policy does not only offer interpretability for future theoretical analysis but also yields exceptional performance, consistently outperforming existing baselines across a wide range of problem sizes.
中文摘要 尽管深度强化学习（deep-RL）越来越多地应用于进化算法中的参数控制，但由于难以推导出适合正式研究的有效且可解释的多参数策略，对参数控制的严谨理论分析仍主要局限于单参数设置。我们展示了如何利用深度强化学习克服这一障碍，采用（1+（$\lambda$，$\lambda$））遗传算法优化OneMax，这是少数被正式证明动态控制超常数加速的问题之一，作为代表性案例研究。我们首先表明，标准方法在多参数环境中难以融合，并引入了针对动作空间分解、奖励转移和长视野折扣的算法无关性增强。有了这些方法，我们比较了常见的深度强化学习方法，发现双深度Q网络独特地避免了近端策略优化中观察到的策略崩溃，从而产生适合下游分析的轨迹。关键是，我们通过将学习到的行为提炼成透明、符号化的控制策略，超越了神经网络的“黑箱”特性。这一策略不仅为未来理论分析提供了可解释性，还能在各种问题规模范围内持续优于现有基线，实现卓越的性能。

Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning

Dropout-GRPO：连续潜在推理的变分随机性

Authors: Wooil Jung
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.10184
Pdf link: https://arxiv.org/pdf/2606.10184
Abstract Group Relative Policy Optimization (GRPO) relies on the diversity of $K$ rollouts within each group; otherwise, the group-mean advantage $A^{(k)} = r^{(k)} - \mu_r$ collapses to zero. This presents a structural challenge for latent-reasoning models like Coconut, which feed continuous hidden states recurrently in place of discrete chain-of-thought tokens. Because the latent phase is inherently deterministic given the parameters and prompt, multiple rollouts produce identical trajectories, stalling GRPO's progress. Consequently, applying group-relative reinforcement learning to continuous latent reasoning has proven difficult. To address this, we propose sourcing the necessary stochasticity through structured dropout. By applying a single Bernoulli mask held constant across all latent recurrence steps for a given rollout, we generate essential trajectory variance. This shared mask effectively treats each rollout as a posterior sample from a variational distribution over parameters, allowing GRPO to optimize the expected reward of a Bayesian model-average policy. We provide both theoretical justification for this method -- including unbiasedness, variance reduction, and the well-definedness of the latent gradient -- and empirical validation. On GSM8K, dropout-GRPO improves a Coconut baseline from $27.29\%$ to $29.01\%$ pass@1, demonstrating the viability of GRPO learning for latent-reasoning models. Our work positions this as a practical, theoretically grounded approach for post-training latent-reasoning LLMs.
中文摘要 群体相对策略优化（GRPO）依赖于每个组内$K美元推广的多样性;否则，群体均值优势 $A^{（k）} = r^{（k）} - \mu_r$ 崩溃为零。这对像Coconut这样的潜能推理模型构成了结构性挑战，因为它们通过循环供给连续的隐藏状态替代离散的思维链代币。由于潜伏阶段在参数和提示条件下本质上是确定性的，多次推广会产生相同的轨迹，从而阻碍GRPO的进展。因此，将群体相对强化学习应用于连续潜在推理被证明非常困难。为此，我们建议通过结构化退出来获取必要的随机性。通过在给定滚动的所有潜在复现步骤中保持不变的单一伯努利掩膜，我们生成了本质轨迹方差。这种共享掩码有效地将每次推广视为参数变分分布的后验样本，使GRPO能够优化贝叶斯模型平均策略的期望奖励。我们为该方法提供了理论依据——包括无偏性、方差缩减以及潜在梯度的明确定义性——以及实证验证。在GSM8K上，dropout-GRPO将椰子基线从$27.29\%$提升到$29.01\%$pass@1，证明了GRPO学习在潜在推理模型中的可行性。我们的研究将其定位为一种实用且理论基础的潜在推理大型语言模型训练后的方法。

SHAPO: Sharpness-Aware Policy Optimization for Safe Exploration

SHAPO：锐利感知策略优化，实现安全探索

Authors: Kaustubh Mani, Yann Pequignot, Vincent Mai, Liam Paull
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.10228
Pdf link: https://arxiv.org/pdf/2606.10228
Abstract Safe exploration is a prerequisite for deploying reinforcement learning (RL) agents in safety-critical domains. In this paper, we approach safe exploration through the lens of epistemic uncertainty, where the actor's sensitivity to parameter perturbations serves as a practical proxy for regions of high uncertainty. We propose Sharpness-Aware Policy Optimization (SHAPO), a sharpness-aware policy update rule that evaluates gradients at perturbed parameters, making policy updates pessimistic with respect to the actor's epistemic uncertainty. Analytically we show that this adjustment implicitly reweighs policy gradients, amplifying the influence of rare unsafe actions while tempering contributions from already safe ones, thereby biasing learning toward conservative behavior in under-explored regions. Across several continuous-control tasks, our method consistently improves both safety and task performance over existing baselines, significantly expanding their Pareto frontiers.
中文摘要 安全探索是部署强化学习（RL）代理在安全关键领域的重要前提。本文通过认识论不确定性视角来探讨安全探索，其中行为者对参数扰动的敏感性成为高不确定性区域的实际代理指标。我们提出了锐利感知策略优化（SHAPO），这是一条锐利感知策略更新规则，在扰动参数下评估梯度，使策略更新对行为者的认知不确定性显得悲观。分析显示，这种调整隐含地重新权衡了政策梯度，放大了罕见不安全行为的影响，同时抑制了本已安全行为的贡献，从而使学习偏向于在未被充分探索地区的保守行为。在多个连续控制任务中，我们的方法在安全性和任务性能上均有提升，显著拓展了其帕累托边界。

Locomotion analysis of a quadruped interacting with the lunar granular surface

四足动物与月球颗粒表面相互作用的运动分析

Authors: Yash J Vyas
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.10273
Pdf link: https://arxiv.org/pdf/2606.10273
Abstract Deploying legged robots in extra-terrestrial environments includes many challenges due to complex terrain interactions, energy, and thermal constraints. For effective mechanical design of a lunar exploration quadrupedal robot, careful consideration of motor torques, energy expenditure, and cost of transport is required. The lunar surface is composed of granular regolith, which impacts the locomotion of legged robots and their performance. Locomotion algorithms trained with rigid contact assumptions are also ineffective when applied to environments with soft contacts, such as granular surfaces, which can result in instability and poor tracking. In this report, the physical modelling of the granular lunar surface-robot foot contacts is applied to a simulation environment with locomotion trained using Reinforcement Learning. A comparison is conducted between the policy trained on rigid contact and soft contact environments, analysing the gait and locomotion performance metrics. The analysis demonstrates that soft contacts simulating regolith surfaces pose additional challenges for Reinforcement Learning based training, result in a qualitatively different gait, and increase the overall energy expenditure.
中文摘要 在外星环境中部署有腿机器人面临诸多挑战，原因包括复杂的地形相互作用、能量和热量限制。为了有效设计月球探测四足机器人，需要仔细考虑电机扭矩、能量消耗和运输成本。月球表面由颗粒状风化层组成，影响有腿机器人的运动和性能。基于刚性接触假设训练的运动算法在软接触环境（如颗粒表面）时效果不佳，可能导致不稳定和跟踪不良。本报告将颗粒状月球表面与机器人足部接触的物理建模应用于使用强化学习训练的模拟环境中。对刚性接触和软接触环境训练的政策进行了比较，分析步态和运动性能指标。分析表明，模拟风化层表面的软接触对基于强化学习的训练带来了额外挑战，导致步态质量不同，并增加整体能量消耗。

MARCH: Model-Assisted Reinforcement Learning for the Perceptive Control of Humanoids over Sparse Footholds

三月：模型辅助强化学习，用于对稀疏立足点的类人生物感知控制

Authors: Codrin Crismariu, Ryan K. Cosner
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.10288
Pdf link: https://arxiv.org/pdf/2606.10288
Abstract Perceptive bipedal locomotion over sparse terrain remains a difficult challenge: model-based methods are precise but brittle to uncertainty, while model-free methods are robust but struggle to discover the precise, constrained motions required for safety-critical locomotion where small errors can cause catastrophic failures. We propose a model-assisted reinforcement learning (RL) framework that combines both perspectives in three steps: (1) generate a safe reference trajectory using simplified models; (2) train a privileged teacher policy guided by a control Lyapunov function (CLF) reward built around the safe reference trajectory; and (3) distill the teacher into a vision-based student policy. We show that this model-assistance procedure produces physically grounded locomotion, improving sample efficiency, reducing the need for a complex learning curriculum, and achieving smoother locomotion behavior alongside stepping stone performance comparable to model-free baselines. We validate our approach in simulation and demonstrate successful deployment on a Unitree G1 humanoid robot navigating sparse footholds with lateral constraints.
中文摘要 在稀疏地形上进行感知双足行走依然是一项艰难的挑战：基于模型的方法精确但易受不确定性影响，而无模型方法则稳健，但难以发现安全关键运动所需的精确且受限的运动，因为小错误可能导致灾难性故障。我们提出了一个模型辅助强化学习（RL）框架，将这两种视角结合为三个步骤：（1）使用简化模型生成安全的参考轨迹;（2）训练一个由控制李雅普诺夫函数（CLF）奖励引导的特权教师策略，该奖励围绕安全参考轨迹构建;以及（3）将教师提炼为以愿景为基础的学生政策。我们证明，这种模型辅助过程产生了物理基础的运动，提高了样本效率，减少了对复杂学习课程的需求，并实现了更平滑的运动行为，同时实现了与无模型基线相当的垫脚石性能。我们在模拟中验证了我们的方法，并展示了在Unitree G1类人机器人上成功部署，能够在稀疏的立足点和横向约束下导航。

SARM2: Multi-Task Stage Aware Reward Modeling for Self Improving Robotic Manipulation

SARM2：多任务阶段感知奖励建模，用于自我提升的机器人操作

Authors: Qianzhong Chen, Hau Zheng, Justin Yu, Suning Huang, Jiankai Sun, Ken Goldberg, Chuan Wen, Pieter Abbeel, Yide Shentu, Philipp Wu, Mac Schwager
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.10305
Pdf link: https://arxiv.org/pdf/2606.10305
Abstract Fine-tuning vision-language-action (VLA) policies for long-horizon manipulation still relies heavily on behavior cloning, which requires costly high-quality demonstrations and keeps policies near the demonstration distribution. Reward models can reduce this dependence by reweighting demonstrations and providing dense supervision for on-robot reinforcement learning (RL), but they must be dense, accurate, and general. Existing methods fall short: task-specific stage-aware models are accurate but require per-task annotations, while general vision-language-model (VLM) reward models are broadly applicable but too coarse for fine-grained long-horizon progress. We introduce RM, a multi-task stage-aware reward model that combines an action-primitive-based stage estimator with a multi-gate Mixture-of-Experts (MMoE) value head to produce dense per-step rewards across manipulation tasks. Building on RM, we further propose SPIRAL (Self-Policy Improvement via Reward-Aligned Learning), an on-policy reward-guided framework that improves VLA policies from cheap autonomous rollouts. On a 10-task benchmark, RM reduces value-estimation MSE by 80% over the strongest baselines; when used in SPIRAL, it improves task success from around 50% to near-perfect performance on Folding Shorts (58% to 100%) and Cleaning Whiteboard (50% to 90%), showing that high-quality dense rewards are key to a stable robot data flywheel. Project website: this https URL.
中文摘要 用于长期视野操作的视觉-语言-行动（VLA）策略的微调仍然高度依赖行为克隆，这需要昂贵且高质量的演示，并且策略常常处于演示分布附近。奖励模型可以通过重新加权演示和为机器人强化学习（RL）提供密集监督来减少这种依赖，但它们必须是密集、准确且通用的。现有方法不足：任务特定阶段感知模型准确但需要逐任务注释，而通用视觉语言模型（VLM）奖励模型广泛适用，但过于粗糙，难以实现细粒度的长期进展。我们引入了RM，一种多任务阶段感知奖励模型，结合了基于动作原语的阶段估计器和多门专家混合（MMoE）价值头，在操作任务中生成密集的每步奖励。基于RM，我们进一步提出了SPIRAL（通过奖励对齐学习实现自我政策改进），这是一个基于政策奖励的指导框架，通过廉价自主推广改进VLA政策。在10个任务基准测试中，RM在最强基线上将价值估计MSE降低80%;在SPIRAL中使用时，它将折叠短裤（58%至100%）和清洁白板（50%至90%）的任务成功率从约50%提升至近乎完美，表明高质量密集的奖励是机器人数据飞轮稳定的关键。项目网站：这个 https URL。

Reasoning or Memorization? Direction-Aware Diversity Exploration in LLM Reinforcement Learning

推理还是记忆？LLM强化学习中的方向感知多样性探索

Authors: Jiangnan Xia, Yucheng Shi, Yu Yang, Kishan Panaganti, Zhenwen Liang, Ninghao Liu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.10346
Pdf link: https://arxiv.org/pdf/2606.10346
Abstract Reinforcement learning has become a key paradigm for eliciting reasoning abilities in large language models, where exploration is crucial for discovering effective solution trajectories. Existing exploration methods typically encourage diversity in semantic or gradient spaces, without distinguishing what drives this diversity. A trajectory may appear novel because it follows a new reasoning process, or because it varies memorized patterns and shortcuts. Rewarding both cases equally may steer exploration toward memorization rather than genuine reasoning improvement. In this paper, we propose DiRL, a Direction-Aware Reinforcement Learning framework that anchors exploration to an internal reasoning-memorization direction of the policy. Specifically, DiRL extracts this direction from model representations, constructs direction-weighted gradient features to characterize rollout updates, and shapes rewards to amplify reasoning-aligned exploration while suppressing memorization-aligned variations. DiRL integrates seamlessly into standard Group Relative Policy Optimization (GRPO). Extensive experiments on mathematical and general reasoning benchmarks demonstrate the effectiveness of DiRL, showing significant improvements over various existing exploration methods.
中文摘要 强化学习已成为激发大型语言模型推理能力的关键范式，探索对于发现有效的解轨迹至关重要。现有的探索方法通常鼓励语义或梯度空间的多样性，却未区分驱动这种多样性的因素。一条轨迹可能看起来新颖，因为它遵循了新的推理过程，或者因为它改变了记忆中的模式和捷径。同等奖励两种情况可能会引导探索趋向记忆，而非真正的推理提升。本文提出了DiRL，一种方向感知强化学习框架，将探索锚定于策略的内部推理记忆方向。具体来说，DiRL从模型表示中提取该方向，构建方向加权梯度特征以描述推广更新，并塑造奖励以放大推理对齐的探索，同时抑制记忆对应的变异。DiRL无缝集成到标准的Group Relative Policy Optimization（GRPO）中。大量数学和通用推理基准测试展示了DiRL的有效性，相较于多种现有探索方法有显著提升。

ReflectiChain: Epistemic Grounding in LLM-Driven World Models for Supply Chain Resilience

ReflectiChain：以大型语言模型驱动的世界模型中的认知基础，促进供应链韧性

Authors: Jia Luo
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.10359
Pdf link: https://arxiv.org/pdf/2606.10359
Abstract AI agents in supply chains face a fundamental epistemic gap: large language models (LLMs) interpret policies but lack physical grounding, while reinforcement learning (RL) optimizes flows but is semantically blind to unstructured constraints. We introduce REFLECTICHAIN, bridging this gap through a Generative Supply Chain World Model (SC-WM) - encoding heterogeneous supply networks into a 6-dim graph-latent space with physical conservation - and Double-Loop Learning that separates epistemic uncertainty (KL-trust-region-bounded policy adaptation) from aleatoric uncertainty (stochastic latent rollouts). On Semi-Sim, a 10-node semiconductor benchmark with SIR risk propagation, 6 perturbation types, and 10 policy constraint templates, REFLECTICHAIN improves Rationale Consistency Score by 33.0% (p < 0.0001, d = 2.78), maintains 82.3% operability under adversarial shocks, and exhibits anti-fragile behavior (+40.2% gain under moderate pressure). We identify three operational epistemic mechanisms - uncertainty separation, knowledge-boundary detection, and empirical Bayesian policy updating - and discuss five limitation categories.
中文摘要 供应链中的人工智能代理面临一个根本的认知鸿沟：大型语言模型（LLMs）解释政策但缺乏物理基础，而强化学习（RL）优化流程，但对非结构约束语义盲目。我们引入了REFLECTICHAIN，通过生成供应链世界模型（SC-WM）——将异构供应网络编码到具有物理守恒性的6维图潜在空间——以及双环学习，将认知不确定性（KL-信任-区域限定的政策适应）与偶然性不确定性（随机潜在推广）区分开来。在Semi-Sim（一个拥有SIR风险传播、6种微扰类型和10个策略约束模板的10节点半导体基准测试）上，REFLECTICHAIN将理据一致性评分提升了33.0%（p < 0.0001，d = 2.78），在对抗冲击下保持82.3%的操作性，并表现出抗脆弱行为（中等压力下+40.2%的增益）。我们确定了三种操作性的认知机制——不确定性分离、知识边界检测和经验贝叶斯政策更新——并讨论了五个限制类别。

Belief-Space Control for Personalized Cancer Treatment via Active Inference

通过主动推理实现个性化癌症治疗的信念空间控制

Authors: Deniz Sargun, H. Bugra Tulay, C. Emre Koksal
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.10376
Pdf link: https://arxiv.org/pdf/2606.10376
Abstract Cancer treatment is at the core a sequential decision-making problem with partial observability, latent patient heterogeneity, and explicit constraints on the budget for medical measurements. Unlike standard Reinforcement Learning (RL) approaches that control state trajectories, cancer treatments permanently modify patients' transition dynamics, changing how states evolve over time. We model cancer treatment as a belief-space planning problem using active inference, deriving an expected free-energy objective that unifies goal-directed control and information acquisition under measurement budgets without. We implement this framework using real clinical cancer data from the AACR Project GENIE Biopharma Collaborative dataset. Results on clinical data demonstrate a simultaneous patient categorization and high treatment efficacy, under real measurement and treatment constraints.
中文摘要 癌症治疗本质上是一个序列决策问题，具有部分可观察性、潜在患者异质性以及对医疗测量预算的明确约束。与控制状态轨迹的标准强化学习（RL）方法不同，癌症治疗永久性地改变患者的过渡动态，改变状态随时间演变的方式。我们利用主动推断将癌症治疗建模为信念空间规划问题，推导出一个预期的自由能目标，统一目标导向控制与信息获取，在测量预算下实现。我们使用来自AACR Project GENIE生物制药协作数据集的真实临床癌症数据来实现该框架。临床数据结果显示，患者在真实测量和治疗约束下同时存在患者分类和高治疗效果。

Mitigating Bias in Low-SNR Financial Reinforcement Learning via Quantum Representations

通过量子表征缓解低信噪比金融强化学习中的偏置

Authors: Zeyu Liu, Xuanzhi Feng, Sing Kwong Lai, Yuanchen Gao, Xiaoyi Pang, Hualei Zhang, Jingcai Guo, Jie Zhang, Song Guo
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.10448
Pdf link: https://arxiv.org/pdf/2606.10448
Abstract The financial market is a typical low signal-to-noise ratio (SNR) setting, which often destabilizes off-policy maximum-entropy methods like Soft Actor-Critic (SAC). Specifically, noisy state representations may produce unreliable Q-value estimates, and bootstrapping amplifies these errors, forming a failure mode we call the "Financial Entropy Trap". In this paper, we propose FPQC-SAC, an efficient and plug-and-play SAC variant that places a compact and bounded Parameterized Quantum Circuit (PQC) before the actor and critic networks to constrain feature propagation at the representation level, rather than filtering raw inputs or regularizing Q-values after bootstrapping. Notably, FPQC-SAC reduces the impact of extreme market fluctuations on Bellman target estimation, while trainable quantum entanglement preserves flexible cross-asset interactions. Empirical evaluations on real-world portfolio management tasks demonstrate that FPQC-SAC substantially enhances out-of-sample stability and cumulative returns by achieving a 66.89% relative gain in cumulative return over standard unconstrained SAC and outperforms the best continuous-control deep reinforcement learning baseline by approximately 27%. Open-source code is available at this https URL.
中文摘要 金融市场是一个典型的低信噪比（SNR）环境，这常常使非政策最大熵方法如软行为者-批评者（SAC）不稳定。具体来说，噪声状态表示可能产生不可靠的Q值估计，自助法会放大这些误差，形成我们称之为“金融熵陷阱”的失败模式。本文提出了FPQC-SAC，一种高效且即插即用的SAC变体，在actor和critic网络前放置紧凑且有界的参数化量子电路（PQC），以在表示层面限制特征传播，而非在引导后对原始输入进行过滤或正则化Q值。值得注意的是，FPQC-SAC减少了极端市场波动对贝尔曼目标估计的影响，而可训练量子纠缠则保持了灵活的跨资产交互。对现实投资组合管理任务的实证评估表明，FPQC-SAC通过在相对66.89%的累计回报上相较标准无约束SAC提升66.89%，显著提升了样本外稳定性和累计回报，并且比最佳连续控制深度强化学习基线高出约27%。开源代码可在此 https URL 访问。

GuideWalk: Learning Unified Autonomous Navigation and Locomotion for Humanoid Robots across Versatile Terrains

GuideWalk：学习跨多样地形的类人机器人统一自主导航与移动

Authors: Haoxuan Han, Chen Chen, Linao Gong, Xin Yang, Hao Hu, Junhong Guo, Zhicheng He, Yao Su, Fenghua He
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.10449
Pdf link: https://arxiv.org/pdf/2606.10449
Abstract Humanoid robots have achieved strong locomotion capabilities, but reliable navigation on versatile terrains remains challenging because obstacle avoidance must be coordinated with dynamically feasible motion. In this work, we present GuideWalk, a unified end-to-end framework that integrates traversability-aware navigation guidance with terrain-adaptive locomotion teacher for humanoid navigation. Specifically, we introduce a navigation module that provides explicit velocity guidance, decoupling obstacle avoidance from terrain conditions to enable robust planning across diverse environments. We propose a composite teacher distillation scheme, where goal-directed commands and dynamically consistent actions are aggregated and distilled into a single policy. To further improve robustness, the distilled policy is refined with reinforcement learning and an auxiliary behavior cloning objective, which promotes exploration while preserving desirable teacher behaviors. Experiments demonstrate that GuideWalk achieves stable and effective navigation while maintaining stable humanoid locomotion.
中文摘要 类人机器人已具备强大的移动能力，但在多样化地形上实现可靠导航仍具挑战，因为障碍物避让必须与动态可行的运动协调。本研究中，我们介绍了GuideWalk，一个统一的端到端框架，集成了可通行性感知的导航指导与地形自适应的行走教师，用于类人生物导航。具体来说，我们引入了一个导航模块，提供明确的速度引导，将障碍物避让与地形条件脱钩，从而实现跨越多样环境的稳健规划。我们提出了一种综合教师提炼方案，将目标导向命令和动态一致的动作聚合并提炼成单一策略。为了进一步提升稳健性，精炼策略通过强化学习和辅助行为克隆目标进行了细化，促进探索，同时保持理想的教师行为。实验表明，GuideWalk在保持人形移动稳定的同时，实现了稳定有效的导航。

HIPIF: Hierarchical Planning and Information Folding for Long-Horizon LLM Agent Learning

HIPIF：长视野LLM代理学习的分层规划与信息折叠

Authors: Juncheng Diao, Zhicong Lu, Peiguang Li, Yongwei Zhou, Changyuan Tian, Qingbin Li, Rongxiang Weng, Jingang Wang, Xunliang Cai
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.10507
Pdf link: https://arxiv.org/pdf/2606.10507
Abstract While Large Language Models (LLMs) have demonstrated strong capabilities as autonomous agents across a wide range of tasks, their performance often degrades in multi-turn long-horizon agentic tasks. Existing methods have made progress through fine-grained credit assignment to alleviate long-horizon sparse rewards and hierarchical reinforcement learning to decompose tasks and reduce long-term dependency. However, these methods still do not directly address long-context interference, in which continuously growing histories weaken the agent's ability to track the global task state and impair subsequent reasoning and decision-making. Inspired by the way humans handle complex tasks through subgoal decomposition and completed progress summarization, we propose Hierarchical Planning and Information Folding (HIPIF) for long-horizon LLM agent learning. HIPIF trains the agent end-to-end to organize long-horizon execution around explicit subgoals while folding completed subgoal histories to reduce long-context interference. Furthermore, to stabilize subgoal-based planning and execution, HIPIF combines hierarchical reflection and subgoal-oriented process rewards to guide subgoal generation, transition, and execution, without relying on costly auxiliary models or task-specific expert trajectories. Extensive experiments on three publicly available agentic benchmarks demonstrate the validity of our method.
中文摘要 虽然大型语言模型（LLMs）在广泛的任务中展现出作为自主代理的强大能力，但在多回合的长期代理任务中，其性能常常下降。现有方法通过细粒度的学分分配以缓解长期稀疏奖励和层级强化学习，逐步拆解任务并减少长期依赖，取得了进展。然而，这些方法仍无法直接解决长上下文干扰问题，即持续增长的历史削弱了智能体追踪全局任务状态的能力，并影响后续的推理和决策。受人类通过子目标分解和完成进度总结处理复杂任务方式的启发，我们提出了用于长期LLM代理学习的分层规划与信息折叠（HIPIF）。HIPIF 从端到端训练代理围绕显式子目标组织长视野执行，同时折叠已完成的子目标历史以减少长上下文干扰。此外，为了稳定基于子目标的规划与执行，HIPIF结合了层级反思和以子目标为导向的过程奖励，指导子目标的生成、过渡和执行，而无需依赖昂贵的辅助模型或任务特定的专家路径。对三个公开的代理基准测试进行了大量实验，证明了我们方法的有效性。

Representation-Aware Advantage Estimation: Your Reward Model Provides More Than A Scalar Output

表征感知优势估计：你的奖励模型提供的不仅仅是标量输出

Authors: Guozheng Li, Xiyan Fu, Yiwen Guo
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.10528
Pdf link: https://arxiv.org/pdf/2606.10528
Abstract Current reinforcement learning from human feedback (RLHF) methods primarily rely on scalar rewards from a trained reward model (RM). While effective, scalar rewards are often noisy and fail to capture fine-grained preference differences, whereas RM hidden states encode richer semantic and preference information. We introduce the representation-aware advantage estimation, which leverages RM hidden states and models them as auxiliary signals for better advantage estimation. Specifically, we propose the Graph-based Advantage Estimation (GraphAE), treat each sampled group as a graph, where nodes correspond to responses and edges capture their similarity in the RM hidden space. Then advantages are computed via graph propagation, enabling each sample to incorporate contextual information from its neighbors. GraphAE is lightweight and can be seamlessly integrated into existing group-based RL algorithms. We apply GraphAE to GRPO, GSPO and RLOO, and conduct extensive experiments on different models and benchmarks. Empirical results show consistent improvements across three benchmarks, with gains of up to + 6.3 on Arena-Hard-v0.1, + 8.27 on AlpacaEval 2.0, and + 0.22 on MT-Bench. These results demonstrate that leveraging RM representations leads to more sample efficient and robust RLHF.
中文摘要 当前的人类反馈强化学习（RLHF）方法主要依赖训练有素的奖励模型（RM）中的标量奖励。虽然标量奖励有效，但通常噪声较大，无法捕捉细粒度偏好差异，而 RM 隐藏状态则编码更丰富的语义和偏好信息。我们介绍了表征感知优势估计，利用RM隐藏状态并将其建模为辅助信号，以实现更好的优势估计。具体来说，我们提出了基于图的优势估计（GraphAE），将每个采样群视为图，节点对应响应，边在RM隐藏空间中捕捉其相似性。然后通过图传播计算优势，使每个样本能够整合邻近样本的上下文信息。GraphAE 轻量化，可以无缝集成到现有的基于群体的强化学习算法中。我们将GraphAE应用于GRPO、GSPO和RLOO，并对不同模型和基准进行大量实验。实证结果显示，三个基准测试均有持续提升，Arena-Hard-v0.1 提升高达 + 6.3，AlpacaEval 2.0 提升 + 8.27，MT-Bench 提升 + 0.22。这些结果表明，利用RM表示能带来更高效的RLHF。

Dmsh: A Multi-Agent Reinforcement Learning Framework for All-Quad Mesh Generation

Dmsh：用于全四元网格生成的多智能体强化学习框架

Authors: Anirudh Kalyan, Cosmin Anitescu, Xiaoying Zhuang, Timon Rabczuk, Somdatta Goswami, Sundararajan Natarajan
Subjects: Subjects: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.10601
Pdf link: https://arxiv.org/pdf/2606.10601
Abstract Generating high-quality meshes for arbitrary geometries remains a fundamental bottleneck in computational engineering, often demanding heuristic tuning and semi-manual workflows. In this paper, we introduce Dmsh, a first fully automated reinforcement learning pipeline that unifies geometric decomposition and quadrilateral mesh generation within a single learning-based framework. Dmsh decomposes the problem through three coordinated agents handling topology simplification, geometric regularization, and mesh generation. The meshing process is formulated as a Markov Decision Process and solved using a parametric Soft Actor-Critic architecture with decoupled critics, enabling efficient exploration of a hybrid discrete-continuous action space. A curriculum learning strategy ensures scalability from simple domains to highly complex geometries, suppressing seed variance. By design, the recursive decomposition enables parallel meshing of subregions, yielding globally conforming all-quadrilateral meshes without post hoc correction. Across a wide range of benchmarks, Dmsh consistently outperforms existing methods in automation, robustness, and mesh quality, establishing a new paradigm for learning-based mesh generation.
中文摘要 为任意几何体生成高质量网格仍然是计算工程中的根本瓶颈，常常需要启发式调优和半手动工作流程。本文介绍了Dmsh，这是首个全自动化强化学习流水线，将几何分解和四边形网格生成统一在单一基于学习的框架内。Dmsh通过三个协调的代理来分解该问题，分别处理拓扑简化、几何正则化和网格生成。网格过程被表述为马尔可夫决策过程，并通过参数化软演员-批判者架构（带有解耦批评者）求解，从而高效探索混合离散-连续动作空间。课程学习策略确保从简单领域扩展到高度复杂几何，抑制种子的变异性。按设计，递归分解允许子区域的并行网格，生成全局符合全四边形网格且无需事后修正。在众多基准测试中，Dmsh在自动化、稳健性和网格质量方面持续优于现有方法，为基于学习的网格生成树立了新范式。

Geometry-Aware Reinforcement Learning for 2D Irregular Nesting

二维不规则嵌套的几何感知强化学习

Authors: Auguste Lehuger, Guillaume Henon-Just
Subjects: Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.10611
Pdf link: https://arxiv.org/pdf/2606.10611
Abstract Traditional heuristic solvers for the 2D irregular nesting problem share a fundamental limitation: they are blind to polygon geometry, relying on guided brute-force to navigate the continuous placement space with minimal geometrical guidance. In this paper, we argue that Reinforcement Learning is uniquely positioned to overcome this bottleneck. By pairing an optimization policy with a geometry-aware neural encoder, an agent can automatically discover rich geometric priors directly from data, utilizing these learned intuitions to strategically guide exploration. To realize this, we introduce the Polygons Transformer (PoT), a novel architecture that encodes 2D continuous vector geometries while allowing cross-polygons attention. We couple this novel architecture with a Combinatorial Optimization Reinforcement Learning (CORL) training framework to find optimal solutions. To support this paradigm, we release an open-source training dataset derived from complex geographic contours alongside a dedicated evaluation benchmark. Our empirical validation demonstrates that our trained agent achieves area utilization performance highly competitive with Sparrow, the state-of-the-art heuristic solver, proving that reinforcement learning can successfully discover and exploit geometric awareness for precise spatial tasks.
中文摘要 传统的启发式解法解决二维不规则嵌套问题有一个根本性局限：它们对多边形几何结构视而不见，依赖引导暴力破解以最小的几何指导来导航连续放置空间。本文论证强化学习具有独特的优势来克服这一瓶颈。通过将优化策略与几何感知神经编码器配对，智能体可以直接从数据中自动发现丰富的几何先验，利用这些学到的直觉进行战略性引导探索。为实现这一点，我们引入了多边形变换器（Polygons Transformer，简称PoT），这是一种新颖的架构，能够编码二维连续向量几何，同时允许交叉多边形的关注。我们将这种新颖架构与组合优化强化学习（CORL）训练框架结合，以寻找最优解。为支持这一范式，我们发布了一个源自复杂地理轮廓的开源训练数据集，并配备了专门的评估基准。我们的实证验证表明，我们训练有素的智能体在区域利用方面的性能与最先进的启发式求解器Sparrow竞争，证明强化学习能够成功发现并利用几何感知来执行精确的空间任务。

Fast and Highly Expressive Policy Learning for Offline Reinforcement Learning via Bootstrapped Flow Q-Learning

通过引导流Q-Learning实现的快速且高度表达性的离线强化学习策略学习

Authors: Thanh Nguyen, Tri Ton, Hongbin Choe, Tung M. Luu, Chang D. Yoo
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.10613
Pdf link: https://arxiv.org/pdf/2606.10613
Abstract Diffusion-based Q-learning has emerged as a powerful paradigm for offline reinforcement learning, but its reliance on multi-step denoising makes both training and inference computationally expensive and brittle. Recent efforts to accelerate diffusion Q-learning toward single-step action generation typically introduce auxiliary networks, policy distillation, or multi-phase training, which frequently compromise simplicity, stability, or performance. To address these limitations, we introduce Bootstrapped Flow Q-Learning (BFQ), a novel framework that enables accurate single-step action generation during both training and inference, without auxiliary networks or distillation procedures. BFQ adopts a divide-and-conquer view of the displacement vector along the flow path: it begins by learning short-range displacements that can be accurately estimated from the Flow Matching marginal velocity, and bootstraps these components to directly learn a noise-to-action mapping in a single step. This formulation eliminates multi-step denoising, resulting in a learning procedure that is substantially faster, simpler, and more robust. Extensive D4RL evaluations show that BFQ improves performance while significantly reducing computational cost compared to multi-step diffusion baselines, demonstrating that single-step action generation suffices for high-performance offline Reinforcement Learning.
中文摘要 基于扩散的Q学习已成为离线强化学习的强大范式，但其对多步去噪的依赖使得训练和推理在计算上成本高且脆弱。近年来，加速扩散Q学习实现单步动作生成的努力通常引入辅助网络、策略提炼或多阶段训练，这些往往会牺牲简易性、稳定性或性能。为解决这些局限，我们引入了引导流Q-学习（BFQ），这是一种新颖框架，能够在训练和推断过程中实现准确的单步动作生成，无需辅助网络或蒸馏过程。BFQ采用分而治之的视角，对流路上的位移矢量进行分析：它首先学习可从流动匹配边际速度准确估计的短程位移，然后通过引导这些组成部分，直接在一步内学习噪声到作用的映射。这种表述消除了多步去噪，使学习过程更快、更简单、更稳健。大量D4RL评估表明，BFQ在提升性能的同时显著降低计算成本，相较于多步扩散基线，证明单步动作生成足以实现高性能离线强化学习。

How Does Reasoning Flow? Tracing Attention-Induced Information Flow for Targeted RL in LLMs

推理是如何流动的？追踪大型语言模型中针对目标强化学习的注意力诱导信息流

Authors: Zhichen Dong, Yang Li, Yuhan Sun, Weixun Wang, Yijia Luo, Zinian Peng, Taiheng Ye, Chao Yang, Wenbo Su, Yu Cheng, Bo Zheng, Junchi Yan
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.10646
Pdf link: https://arxiv.org/pdf/2606.10646
Abstract Token-level credit assignment remains a key obstacle for reinforcement learning (RL) in large language models (LLMs), where RL recipes typically treat all tokens equally, failing to distinguish decisive reasoning steps from routine formatting or fluent filler. Recent attempts leverage model-internal signals to assign finer-grained credit, but these are often point-wise heuristics that ignore the global structure of information propagation. We propose FlowTracer, an RL framework that traces answer-targeted reasoning flow on an attention-induced directed acyclic graph in which nodes correspond to tokens and edge capacities come from aggregated attention weights and derives token credit from this global structure. The edge capacities are reweighted to retain only the influence that can reach the answer region, while enforcing local flow conservation so intermediate tokens neither lose nor gain effective mass due to path length or irrelevant branches. On this graph, FlowTracer extracts an information-flow backbone connecting the question to the answer and scores tokens by flow throughput, revealing high-impact hubs and aggregation checkpoints that mediate long-range dependencies. These derived importances are used to shape token-level rewards, enabling learning signals to focus precisely on the tokens that route information toward (or away from) correct answers and delivering consistent performance gains across a range of reasoning tasks.
中文摘要 代币级学分分配仍是大型语言模型（LLM）强化学习（RL）的主要障碍，因为强化学习的配方通常一视同仁地对待所有代币，未能区分决定性的推理步骤与常规格式化或流利填充。近期尝试利用模型内部信号赋予更细粒度的功劳，但这些通常是基于点的启发式方法，忽视了信息传播的全局结构。我们提出了FlowTracer，这是一个强化学习框架，在一个由注意力诱导的有向无环图上追踪答案定向推理流，其中节点对应代币，边容量来自聚合的注意力权重，并从该全局结构中推导出代币信用。边缘容量会重新加权，只保留能够到达答案区域的影响，同时强制执行局部流动守恒，使中间标记既不会因路径长度或无关分支而失去也不会增加有效质量。在该图上，FlowTracer提取连接问题与答案的信息流骨干，并按流量吞吐量对代币进行评分，揭示高影响力的枢纽和聚合检查点，这些节点调节了长期依赖关系。这些衍生重要性被用来塑造代币级奖励，使学习信号能够精准聚焦于将信息引导到正确答案（或偏离）的代币上，并在各种推理任务中实现持续的性能提升。

Vector Map as Language: Toward Unified Remote Sensing Vector Mapping

矢量图作为语言：迈向统一遥感矢量映射

Authors: Yinglong Yan, Yunkai Yang, Haoyi Wang, Wei Fu, Linshan Wu, Honghu Pan, Shaobo Xia, Shanghang Zhang, Hao Chen, Leyuan Fang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.10701
Pdf link: https://arxiv.org/pdf/2606.10701
Abstract Remote sensing vector mapping aims to generate structured maps of geospatial entities, such as buildings, roads, and water bodies, from remote sensing imagery. In practice, vector maps usually contain multiple category layers and heterogeneous entity structures, requiring a unified model for diverse mapping needs. However, existing methods typically represent vector objects as polygons or graphs, making them suitable only for specific categories: polygons poorly capture topological relations, while graphs often blur instance boundaries. We observe that language, as a natural medium for human communication, offers a flexible and expressive representation that can accommodate heterogeneous map elements, including geometry, semantics, and topolog. Motivated by this insight, we propose Vector Map as Language (VecLang), a unified paradigm that reformulates multiclass vector mapping as structured text generation. VecLang encodes the common elements of different geospatial entities into a GeoJSON-like vector language, enabling cross-category modeling within a shared textual format. To generate this language reliably, we design a progressive vision-language mapping framework that first localizes vectorization units and then generates structured map elements. We further introduce Hierarchical Vector Language Optimization, which uses reinforcement learning to improve syntax validity, content fidelity, and map executability. We also build VecMap-Bench with 54K images and 800K instances, supporting training and evaluation across standard and generalization settings. Extensive experiments demonstrate that VecLang handles both single-class and multiclass vector mapping while achieving strong cross-dataset and open-vocabulary generalization. The model and dataset are publicly available at this https URL.
中文摘要 遥感矢量映射旨在从遥感图像生成地理空间实体（如建筑物、道路和水体）的结构化地图。实际上，向量映射通常包含多个类别层和异构实体结构，因此需要一个统一的模型以满足多样化的映射需求。然而，现有方法通常将矢量对象表示为多边形或图，因此仅适用于特定类别：多边形难以很好地捕捉拓扑关系，而图则常常模糊实例边界。我们观察到，作为人类交流的自然媒介，语言提供了一种灵活且富有表现力的表示方式，能够容纳包括几何、语义和拓扑在内的异质映射元素。基于这一见解，我们提出了向量映射即语言（VecLang）这一统一范式，将多类向量映射重新表述为结构化文本生成。VecLang 将不同地理空间实体的共同元素编码成类似 GeoJSON 的向量语言，实现在共享文本格式内的跨类别建模。为了可靠生成该语言，我们设计了一个渐进视觉语言映射框架，先定位向量化单元，然后生成结构化映射元素。我们进一步介绍了层级向量语言优化，利用强化学习提升语法有效性、内容忠实度和映射可执行性。我们还构建了包含54K图像和80万实例的VecMap-Bench，支持标准和泛化设置下的培训与评估。大量实验表明，VecLang既能处理单类向量映射，也能实现强大的跨数据集和开放词汇泛化。模型和数据集在此 https URL 公开发布。

Event-Driven Reinforcement Learning Enables Long-Horizon Control in Semiconductor Fabrication

事件驱动强化学习实现半导体制造中的长视野控制

Authors: Yavar Yeganeh, Mahsa Shekari, Nicla Frigerio, Daniele Pagano, Andrea Matta
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2606.10705
Pdf link: https://arxiv.org/pdf/2606.10705
Abstract Reinforcement learning promises to optimize sequential decisions in large-scale systems. Semiconductor manufacturing systems are stochastic and highly constrained environments where heterogeneous wafers traverse hundreds of processing steps across extensive equipment networks. These characteristics yield complex, high-dimensional decision problems with delayed feedback and long-horizon requirements, complicating production planning and control. We propose a deep reinforcement learning framework for multi-objective policy optimization at this scale. Specifically, we formulate control as a centralized-agent problem, where a core policy coordinates system-wide decisions, while system evolution is represented as an interconnected temporal process driven by discrete events. Accordingly, we develop a tailored event-driven temporal-difference formulation that remains general and can be integrated with various policy optimization methods under relevant training settings. We investigate several core model-free algorithms incorporated into this framework and evaluate their effectiveness using high-fidelity simulations of diverse, industry-real operating scenarios. Across extensive validation experiments, agents trained in both offline and online settings show significant and consistent gains in throughput and utilization. We further evaluate performance and generalization across training phases, clarifying the relative strengths of alternative reinforcement learning formulations and algorithms. Overall, the results support the scalability, generality, and transferability of the proposed framework for controlling event-driven complex adaptive systems.
中文摘要 强化学习有望优化大规模系统中的顺序决策。半导体制造系统属于随机且高度受限的环境，异构晶圆在广泛的设备网络中经过数百个工艺步骤。这些特性导致复杂且高维度的决策问题，反馈延迟且视野较长，增加了生产计划和控制的复杂性。我们提出了一个深度强化学习框架，用于该规模的多目标策略优化。具体来说，我们将控制表述为一个集中化的代理问题，核心策略协调系统范围的决策，而系统演化则被表示为由离散事件驱动的相互关联的时间过程。因此，我们开发了一种定制化的事件驱动时间差分表述，保持通用性，并可在相关训练条件下与多种策略优化方法集成。我们研究了该框架中包含的若干核心无模型算法，并通过高保真模拟多样的行业真实操作场景评估其有效性。在广泛的验证实验中，线下和在线环境中训练的代理在吞吐量和利用率上均有显著且持续的提升。我们进一步评估了各训练阶段的表现和泛化，明确了替代强化学习表述和算法的相对优势。总体而言，结果支持该框架在控制事件驱动复杂自适应系统的可扩展性、通用性和可转移性。

MODIP: Efficient Model-Based Optimization for Diffusion Policies

MODIP：基于模型的高效扩散策略优化

Authors: Zakariae El Asri, Philippe Gratias-Quiquandon, Nicolas Thome, Olivier Sigaud
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.10825
Pdf link: https://arxiv.org/pdf/2606.10825
Abstract Diffusion policies (DPs) have emerged as expressive policy representations for robot learning, often used with imitation learning methods such as behavioral cloning (BC). However, while their success has largely been confined to BC, direct reinforcement learning (RL) fine-tuning remains challenging because actions are generated through a multi-step denoising process. In this work, we propose MODIP, a framework for the offline-to-online fine-tuning of DPs. Rather than directly applying RL to the DPs, MODIP leverages a world model (WM) to guide policy adaptation and keeps the simplicity and stability of BC. We utilize model predictive control (MPC) to generate high-quality trajectories within the WM, and use them as supervised targets for fine-tuning the DP. To make MPC planning efficient, MODIP uses a terminal state value instead of a policy-dependent state-action value, reducing inference time. Additionally, MODIP trains critics with policy-independent TD targets, reducing training time. Experiments on D4RL (MuJoCo, Kitchen) and RoboMimic tasks show that MODIP improves diffusion policies beyond BC, and is competitive with or outperforms diffusion policy RL fine-tuning methods and strong model-based baselines such as TD-MPC2.
中文摘要 扩散策略（DPs）已成为机器人学习的表达性策略表示，常与模仿学习方法如行为克隆（BC）一起使用。然而，虽然它们的成功主要局限于BC，但直接强化学习（RL）的微调仍然具有挑战性，因为动作是通过多步去噪过程生成的。在本研究中，我们提出了MODIP，一个用于离线到在线的DP微调框架。MODIP不直接将强化学习应用于DP，而是利用世界模型（WM）指导政策调整，同时保持BC的简洁性和稳定性。我们利用模型预测控制（MPC）在WM内生成高质量轨迹，并将其作为监督目标，用于微调DP。为了使MPC规划高效，MODIP使用终端状态值而非依赖策略的状态-动作值，从而缩短推理时间。此外，MODIP通过政策无关的TD目标培训批评者，缩短培训时间。D4RL（MuJoCo、Kitchen）和机器人模拟任务的实验表明，MODIP能提升扩散策略超越BC，且在与扩散策略RL微调方法及强模型基线（如TD-MPC2）竞争或优于。

GUIDE: Goal-Initialized Directional Understanding for End-to-End Visual Navigation

指南：端到端视觉导航的目标初始化方向理解

Authors: Liang Wang, Jin Jin, KanZhong Yao, YiBin Wu, Fangqiang Ding, Jin Wang, Jun Wu, Zhe Sun, Qiuguo Zhu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.10832
Pdf link: https://arxiv.org/pdf/2606.10832
Abstract Learning-based visual navigation for legged robots typically relies on continuous goal updates from hierarchical state estimation to provide a persistent directional reference. This reliance incurs additional sensory and computational overhead and deviates from fully end-to-end mobile autonomy. Furthermore, under partial observability, policies are prone to learn myopic behaviors, easily becoming trapped in dead ends and complex structural layouts. To address these limitations, we investigate a goal-initialized navigation setting, where the target is provided only once at the beginning of an episode, requiring the robot to operate based on intrinsic spatial memory without subsequent goal updates from external modules. In this work, we propose GUIDE, a fully end-to-end reinforcement learning framework designed to cultivate internal directional awareness. Specifically, GUIDE incorporates a spatial anchor predictor that leverages multi-frequency proprioceptive history to extract egomotion representations, thereby maintaining a persistent long-horizon spatial context for navigation. Concurrently, it utilizes raw depth streams to perceive local environmental geometry. We evaluate the proposed framework across both simulation and real-world scenarios on a quadruped robot. Experiments show that GUIDE learns reliable egomotion and directional awareness, enabling a fully end-to-end deployed policy to safely navigate through dense clutter and structured mazes without subsequent goal guidance or prior maps.
中文摘要 基于学习的腿型机器人视觉导航通常依赖于层级状态估计的持续目标更新，以提供持久的方向参考。这种依赖会带来额外的感官和计算开销，并且偏离了完全端到端的移动自主。此外，在部分可观测性下，策略容易学习短视行为，容易陷入死胡同和复杂的结构布局。为解决这些限制，我们研究了一种目标初始化导航设置，即每集开始时只给目标一次，要求机器人基于内在空间记忆操作，无需外部模块后续更新目标。在本研究中，我们提出了GUIDE，一种完全端到端的强化学习框架，旨在培养内在方向意识。具体来说，GUIDE集成了一种空间锚点预测器，利用多频率本体感觉历史提取自我运动表征，从而保持持续的长视野空间上下文以实现导航。同时，它利用原始的深度流来感知局部环境几何。我们在四足机器人的模拟和现实场景中评估了该框架。实验表明，GUIDE能够学习可靠的自我运动和方向意识，使得一个完全端到端的部署策略能够安全地穿越密集的杂乱和结构化迷宫，无需后续目标指引或预先地图。

Pushing the Limits of LLM Tool Calling via Experiential Knowledge Integration and Activation

通过体验式知识集成与激活，突破LLM工具调用的极限

Authors: Yupu Hao, Zhuoran Jin, Huanxuan Liao, Kang Liu, Jun Zhao
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.10875
Pdf link: https://arxiv.org/pdf/2606.10875
Abstract Large language models (LLMs) rely on tool use to act as autonomous agents, yet often fail in multi-step execution due to insufficient tool-related knowledge and ineffective knowledge activation. Therefore, we present a systematic study on how knowledge influences tool-use performance, covering the stages of knowledge acquisition, activation, and internalization. In the knowledge acquisition stage, we acquire and evaluate various forms of experiential knowledge, and our analysis shows that simple instance-level knowledge can already provide strong and reliable gains, while abstract intent-level knowledge offers limited benefits. At inference time, to activate knowledge, we find that prompting LLM to expand the depth of reasoning yields diminishing returns, whereas expanding the width of reasoning by parallel sampling with aggregation more effectively activates latent experiential knowledge. At training time, for knowledge internalization, post-training with knowledge-augmented data further improves performance, with reinforcement learning outperforming supervised fine-tuning. Based on these insights, we propose the Knowledge-Augmented Tool Execution (KATE), a knowledge-augmented tool execution framework that integrates experiential knowledge with reasoning-width-expanded inference and knowledge-aware training. Experiments on BFCL-V3 and AppWorld demonstrate consistent and substantial improvements over strong baselines across model scales. Our Code is available at this https URL.
中文摘要 大型语言模型（LLMs）依赖工具作为自主代理，但由于工具相关知识不足和知识激活效果不佳，常常在多步骤执行中失败。因此，我们提出了一项系统性研究，探讨知识如何影响工具使用表现，涵盖知识获取、激活和内化的各个阶段。在知识获取阶段，我们获取并评估各种形式的体验式知识，分析显示，简单的实例级知识已经能带来强大且可靠的收益，而抽象的意图级知识则带来有限的益处。在推理阶段，为了激活知识，我们发现促使大型语言模型扩展推理深度会带来收益递减，而通过并行抽样与聚合扩展推理宽度则更有效地激活潜在的体验知识。在训练阶段，对于知识内化，使用知识增强数据进行训练后进一步提升表现，强化学习优于监督微调。基于这些见解，我们提出了知识增强工具执行（KATE），这是一个知识增强工具执行框架，将体验式知识与推理宽度扩展的推理和知识感知训练相结合。BFCL-V3和AppWorld上的实验显示，在不同模型尺度上，相较于强基线，持续且显著地改善。我们的代码可在此 https 网址获取。

AllDayNav: Lifelong Navigation via Real-World Reinforcement Learning

AllDayNav：通过现实世界强化学习实现终身导航

Authors: Hang Yin, Yinan Liang, Jiazhao Zhang, Jiahang Liu, Minghan Li, Zhizheng Zhang, He Wang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.10927
Pdf link: https://arxiv.org/pdf/2606.10927
Abstract Lifelong embodied navigation in dynamic environments requires robots to form persistent scene understanding from fragmentary observations, which remains difficult for existing methods that rely on explicit maps or scene graphs and struggle to generalize beyond structured settings. We propose AllDayNav, a lifelong self-learning navigation framework that implicitly encodes scene dynamics into the billion-scale parameters of a large model via reinforcement learning, powered by a self-evolving multimodal memory that maintains and updates visual keyframes, semantic descriptions, and temporal context while autonomously generating open-vocabulary instructions, image goals, and structured rewards. Experiments in both synthetic and real-world environments across cross-room, cross-episode, and cross-task scenarios show that AllDayNav achieves success rates approaching $100\%$ and consistently surpasses strong map-based, VLM, and RL baselines in path efficiency and robustness, demonstrating implicit, memory-driven reinforcement learning as a scalable alternative to explicit mapping for reliable lifelong navigation.
中文摘要 动态环境中的终身具象导航要求机器人从零散的观测中形成持续的场景理解，而这对于依赖显式地图或场景图且难以超越结构化环境的现有方法来说仍然困难。我们提出了AllDayNav，一种终身自学导航框架，通过强化学习隐式将场景动态编码到大型模型的十亿尺度参数中，依靠自我演化的多模态记忆，维护和更新视觉关键帧、语义描述及时间上下文，同时自主生成开放词汇指令、图像目标和结构化奖励。在合成和现实环境中跨房间、跨集和跨任务场景的实验显示，AllDayNav 的成功率接近 100% 美元，且在路径效率和鲁棒性方面持续超越强地图、VLM 和 RL 基线，展示了隐式、内存驱动强化学习作为显式地图的可扩展替代方案，实现可靠的终身导航。

Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning

超越统一令牌级信任区域，LLM强化学习

Authors: Renjie Mao, Xiangxin Zhou, Lvfang Tao, Yixin Ding, Yu Shi, Yongguang Lin, Yuheng Wu, Honglin Zhu, Qian Qiu, Wenxi Zhu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.10968
Pdf link: https://arxiv.org/pdf/2606.10968
Abstract Reinforcement learning with verifiable rewards (RLVR) has become standard for improving LLM reasoning. However, existing PPO-style trust-region mechanisms remain position-agnostic by enforcing uniform thresholds across all tokens independently. This pointwise treatment conflicts with autoregressive generation in two critical ways. First, uniform thresholds ignore autoregressive asymmetry. Early-stage deviations produce compounding sequence-level drift, causing static thresholds to under-regulate early divergence and excessively constrain late-stage exploration. Second, evaluating token-level divergence in isolation overlooks cumulative prefix drift, granting the same divergence allowance regardless of how far the conditioning history has already deviated from the rollout policy. To address this limitation, we propose CPPO (Cumulative Prefix-divergence Policy Optimization), a token-level masking rule that aligns updates with a finite-horizon policy-improvement bound via two coupled mechanisms. First, a position-weighted threshold imposes stricter limits at early positions whose effects persist longer, relaxing constraints for late-stage tokens. Second, a cumulative prefix budget tracks historical deviations, dynamically restricting further token-level deviation to prevent compounding errors along the prefix. Empirically, CPPO enhances training stability and significantly improves reasoning accuracy across various model scales.
中文摘要 带有可验证奖励的强化学习（RLVR）已成为提升大型语言模型推理的标准。然而，现有的PPO式信任区域机制通过独立对所有代币强制统一阈值，保持位置无关性。这种逐点处理与自回归生成在两个关键方面存在冲突。首先，均匀阈值忽略了自回归不对称性。早期阶段偏差会产生叠加的序列级漂移，导致静态阈值不足，抑制了早期发散，并过度限制了后期探索。其次，单独评估代币级的偏差可以忽略累积前缀漂移，无论条件历史已经偏离推广策略多远，都能给予相同的发散宽度。为解决这一限制，我们提出了CPPO（累积前缀-发散策略优化），这是一条令牌级掩蔽规则，通过两个耦合机制将更新与有限视野的策略改进界限对齐。首先，持仓加权阈值在早期位置施加更严格的限制，这些位置的影响持续时间更长，从而放宽了后期代币的约束。其次，累计前缀预算追踪历史偏差，动态限制更多代币级偏差，以防止前缀沿线的复合错误。从实证角度看，CPPO增强了训练稳定性，并显著提升了不同模型尺度下的推理准确性。

Bellman-Taylor Score Decoding for Markov Decision Processes with State-Dependent Feasible Action Sets

带有状态依赖可行行动集的马尔可夫决策过程的贝尔曼-泰勒评分解码

Authors: Yi Chen, Rushuai Yang, Qiang Chen, Dongyan (Lucy)Huo
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.10979
Pdf link: https://arxiv.org/pdf/2606.10979
Abstract Many Markov decision processes (MDPs) in operations research have feasible actions that are state dependent and defined implicitly by various operational constraints. These features make it difficult to use standard deep reinforcement learning (DRL) algorithms, whose action interfaces typically assume either a fixed finite action catalog or a simple Euclidean space. Motivated by a Taylor expansion of the optimal action-value function, we propose Bellman--Taylor score decoding, a framework that moves policy learning to a Euclidean score space while enforcing feasibility through an action decoder. The induced latent-score MDP then can be optimized by standard DRL algorithms without differentiating through the decoder. We provide a performance guarantee showing that the optimality gap of this approach decomposes into a structural approximation error and an algorithmic learning error. Lastly, we apply this framework to a queueing network control problem, where the policy essentially learns a state-dependent index-based dispatching rule. Numerical experiments show near-optimal performance in small instances and considerable improvements over benchmarks in larger systems.
中文摘要 运筹学中的许多马尔可夫决策过程（MDP）具有可行的动作，这些动作依赖于状态，并由各种操作约束隐含定义。这些特性使得使用标准的深度强化学习（DRL）算法变得困难，因为其动作接口通常假设固定的有限动作目录或简单的欧几里得空间。受最优动作值函数泰勒展开的启发，我们提出了贝尔曼-泰勒分数解码框架，该框架将策略学习移动到欧几里得评分空间，同时通过动作解码器强制执行可行性。诱导的潜在分数MDP随后可以通过标准DRL算法进行优化，而无需通过解码器进行区分。我们提供了性能保证，表明该方法的最优性差距分解为结构近似误差和算法学习误差。最后，我们将该框架应用于排队网络控制问题，策略本质上学习了基于状态的基于索引的调度规则。数值实验显示，在小实例中性能接近最佳，且在大型系统中相较基准测试有显著提升。

Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models

Flow-DPPO：流匹配模型的离度近端策略优化

Authors: Bowen Ping, Xiangxin Zhou, Penghui Qi, Minnan Luo, Liefeng Bo, Tianyu Pang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.11025
Pdf link: https://arxiv.org/pdf/2606.11025
Abstract Recent work has demonstrated that online reinforcement learning (RL) can substantially improve the quality and alignment of flow matching models for image and video generation. Methods such as Flow-GRPO and CPS cast the denoising process as a Markov Decision Process and apply PPO-style ratio clipping to enforce a trust region. However, we argue that ratio clipping is structurally ill-suited for flow models: the probability ratio between new and old policies is a noisy, single-sample estimate of the true policy divergence, leading to over-constraining in some regions of the trajectory and under-constraining in others. We propose Flow-DPPO (Flow Divergence Proximal Policy Optimization), which replaces ratio clipping with a divergence proximal constraint. A key observation is that the per-step policy in flow models is Gaussian, enabling exact and cheap computation of the KL divergence between old and new policies. Flow-DPPO employs an asymmetric divergence mask that blocks gradient updates only when they simultaneously move away from the trusted region and violate the divergence threshold. Experiments show that Flow-DPPO achieves higher rewards with better KL-proximal efficiency, alleviates catastrophic forgetting, promotes balanced multi-objective optimization, and enables stable multi-epoch training where ratio clipping degrades. Code and models are available at this https URL.
中文摘要 近期研究表明，在线强化学习（RL）能够显著提升图像和视频生成中流匹配模型的质量和对齐性。如Flow-GRPO和CPS等方法将去噪过程定义为马尔可夫决策过程，并应用PPO式比率剪裁以强制信任区域。然而，我们认为比率裁剪在结构上不适合流模型：新旧政策之间的概率比是一个噪声的单样本真实政策背离估计，导致轨迹的某些区域过度约束，而在其他区域约束不足。我们提出了Flow-DPPO（流发散近端策略优化），用发散近端约束替代了比率裁剪。一个关键观察是，流量模型中的每步策略是高斯的，这使得旧策略与新策略之间的KL散度能够精确且廉价地计算。Flow-DPPO采用非对称发散掩码，仅在梯度更新同时偏离受信任区域并违反发散阈值时阻断。实验表明，Flow-DPPO以更好的KL近端效率实现更高奖励，缓解灾难性遗忘，促进多目标均衡优化，并实现多跨时期稳定训练，避免比例裁剪下降。代码和模型可在该 https URL 访问。

LLM-Mediated Demand Response Coordination in Smart Microgrids

智能微电网中的大型语言模型介导需求响应协调

Authors: J. de Curtò, I. de Zarzà
Subjects: Subjects: Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2606.11050
Pdf link: https://arxiv.org/pdf/2606.11050
Abstract Effective demand response in smart microgrids requires prosumers to cooperate voluntarily under strategic self-interest, a coordination problem structurally equivalent to a repeated Prisoner's Dilemma on a social network. This paper presents a multi-agent simulation in which a Large Language Model (LLM) Influence Compiler issues structured demand-response directives to a population of heterogeneous prosumer agents, each governed by a hybrid decision architecture combining game-theoretic base probability (derived from payoff history, neighbour imitation, and exploitation memory) with LLM narrative evaluation of incoming coordination signals. The hybrid architecture resolves a key methodological challenge: LLMs aligned via Reinforcement Learning from Human Feedback (RLHF) exhibit strong cooperation bias when used as direct decision-makers, producing flat dynamics regardless of grid conditions. By separating strategic reasoning from grounded narrative evaluation, the model generates realistic prosumer behaviour across six personality archetypes, with baseline cooperation near 50% and clear differentiation under influence. Compiled structured directives achieve 33.3% demand-curtailment cooperation versus 27.0% for unstructured messaging and 28.0% for a no-intervention baseline ($\Delta_\mathrm{comp} = +0.063$), with the advantage preserved across both grounded and idealized agent substrates ($\Delta = +0.083$) and across all resistance levels ($R = 0.1$ to $0.7$). Hub-targeted dissemination via high-centrality network nodes outperforms peripheral or random targeting, confirming that grid topology provides mechanistic amplification independent of message content. These results suggest that structured LLM compilation, grounded agent reasoning, and network-aware targeting are complementary design principles for scalable, interpretable demand-response coordination in smart-city energy systems.
中文摘要 智能微电网中的有效需求响应需要专业消费者在战略自利基础下自愿合作，这一协调问题在结构上相当于社交网络上反复出现的囚徒困境。本文提出了一种多智能体模拟，其中大型语言模型（LLM）影响编译器向一群异构的准专业代理发布结构化需求响应指令，每个代理由混合决策架构管理，结合了博弈论基础概率（源自收益历史、邻居模仿和利用记忆）与对输入协调信号的叙事评估。这种混合架构解决了一个关键方法学难题：通过人类反馈强化学习（RLHF）对齐的LLMs在直接决策者时表现出强烈的合作偏差，无论网格条件如何，都会产生平坦的动态。通过将战略推理与扎实叙事评估分离，模型在六种人格原型中生成了真实的准消费者行为，基线合作率接近50%，且在影响下明显区分。汇编结构化指令实现了33.3%的需求-限制合作，而非结构化消息为27.0%，无干预基线为28.0%（$\Delta_\mathrm{comp} = +0.063$），且优势在接地和理想化代理基底（$\Delta = +0.083$）以及所有阻力水平（$R = 0.1$至$0.7$）中均保持。通过高中心性网络节点进行枢纽定向传播，其性能优于外围或随机定向，证实了网格拓扑提供了独立于消息内容的机制性放大。这些结果表明，结构化LLM编译、基准代理推理和网络感知定向是智能城市能源系统中可扩展、可解释的需求-响应协调设计原则的互补原则。

Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning

强化学习中流策略的测试时间梯度指导

Authors: Zhiyuan Zhou, Andy Peng, Charles Xu, Qiyang Li, Tobias Springenberg, Kevin Frans, Sergey Levine
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.11087
Pdf link: https://arxiv.org/pdf/2606.11087
Abstract Expressive continuous control policies, such as diffusion and flow models, form the backbone of recent advances in scaling imitation learning for simulated and real robot control. While they are known to scale stably in the supervised imitation learning setting, incorporating them into reinforcement learning (RL) pipelines for policy improvement has proven more difficult. It often requires specialized training objectives or backpropagating through denoising processes, which cause well-known issues with stability and affect scalability. In this paper we study the question of whether simple policy improvement schemes at test time alone, leaving stable supervised policy training intact, can be a competitive alternative which sidesteps these issues. To this end, we propose QGF (Q-Guided Flow), an RL algorithm that performs policy optimization entirely at test time. QGF works by pre-training both a reference flow policy (via a standard behavioral cloning objective) and a value function critic and, at test time, using the value gradient to guide the reference policy to generate higher-value actions without any additional policy learning. Empirically, QGF outperforms prior test-time RL methods on single-task and goal-conditioned offline RL benchmarks with high-dimensional action spaces, and is competitive with state-of-the-art training-time algorithms while being much cheaper to run. Moreover, it exhibits favorable scaling with model size by avoiding the instability of actor-critic training, offering a practical and effective alternative RL algorithm with expressive policies.
中文摘要 表现性连续控制策略，如扩散和流动模型，构成了模拟和真实机器人控制中模拟学习扩展的最新进展的基础。虽然它们在监督模仿学习环境中能够稳定扩展，但将其纳入强化学习（RL）政策改进流程中却更加困难。它通常需要专门的训练目标或通过去噪过程进行反向传播，这会导致已知的稳定性问题并影响可扩展性。本文研究了仅在测试阶段实施简单政策改进方案，保持稳定监督政策培训，是否能成为规避这些问题的竞争替代方案。为此，我们提出了QGF（Q引导流）算法，这是一种在测试时完全执行策略优化的强化学习算法。QGF的工作原理是通过预训练参考流策略（通过标准行为克隆目标）和价值函数批判者，并在测试时利用值梯度引导参考策略生成更高价值的动作，而无需额外策略学习。从经验角度看，QGF在单任务和目标条件离线强化学习基准测试中优于以往测试时的强化学习方法，且具有高维动作空间，且能与最先进的训练时间算法竞争，同时运行成本也大幅降低。此外，它通过避免actor-critic训练的不稳定性，表现出良好的模型规模扩展性，提供了一种具有表达策略的实用且有效的强化学习替代算法。

RoboNaldo: Accurate, Stable and Powerful Humanoid Soccer Shooting via Motion-Guided Curriculum Reinforcement Learning

RoboNaldo：通过动作引导课程强化学习实现的精准、稳定且强力的人形足球射门

Authors: Yichao Zhong, Yidan Lu, Yuhang Lu, Tianyang Tang, Haoguang Mai, Yixuan Pan, Tianyu Li, Li Chen, Jingbo Wang, Zhongyu Li, Peng Lu, Hongyang Li
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.11092
Pdf link: https://arxiv.org/pdf/2606.11092
Abstract Elite humanoid soccer shooting requires whole-body stability, high-impulse whole-body interactions, and accuracy to targets. Motion tracking-driven reinforcement learning (RL) provides stability in whole-body movement coordination, but a fixed reference makes it hard to adapt to varied ball positions and strike timings; in contrast, task reward-driven RL struggles to explore and discover valid kicks from scratch. We therefore introduce RoboNaldo, a three-stage motion-guided curriculum RL framework for high-impulse humanoid interaction. A single human-kick reference is used as a scaffold and progressively shifts optimization towards shooting performance. The curriculum first learns a stable whole-body kicking prior, then adapts the kick to free-kick settings where the ball is stationary at random positions, and finally extends it to moving-ball shooting through a locomotion-command and kick-trigger interface. A high-level heuristic planner controls this interface during training, while alternative high-level controllers can drive the same low-level policy at inference. In simulation, RoboNaldo demonstrates free-kick shot error 48.6% lower and shoot velocity 2.96x than prior work baselines. In real world on a Unitree G1 with onboard perception, RoboNaldo attains 0.73 m and 0.86 m average target shooting error from 3 m away in free-kick and moving-ball cases, accordingly. And the post-contact ball velocity reaches 13.10 m/s, which is 59-71% of reported professional open-play shot speed. Project page: $\href{this https URL}{\text{this http URL}}$.
中文摘要 精英类人足球射击需要全身稳定性、高冲击力的全身互动以及精准命中目标。运动追踪驱动强化学习（RL）在全身运动协调上提供了稳定性，但固定参考使得适应不同球的位置和击球时机变得困难;相比之下，任务奖励驱动的强化学习则难以从零开始探索和发现有效的“kick”。因此，我们介绍RoboNaldo，一个三阶段的动作引导强化学习课程框架，用于高冲量类人生物互动。单一的人类踢球参考被用作支架，逐步将优化转向射击性能。课程先学习稳定的全身踢球，然后将踢球适应任意球设定，球在随机位置静止，最后通过移动指令和踢动触发界面扩展为移动射门。高级启发式规划器在培训期间控制该接口，而其他高级控制器则可在推理时驱动同样的低级策略。在模拟中，RoboNaldo的任意球射门误差比之前的工作基准降低了48.6%，射门速度提升了2.96倍。在真实世界中，使用带有机载感知功能的Unitree G1，RoboNaldo在任意球和移动球箱中，3米外的平均射击误差分别达到0.73米和0.86米。接触后的球速达到13.10米/秒，约为职业开放式进攻射门速度的59-71%。项目页面：$\href{此 https URL}{\text{this http URL}}$。

TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning

TRACE：高效代理强化学习的统一推广预算分配框架

Authors: Heming Zou, Qi Wang, Yun Qu, Yuhang Jiang, Lizhou Cai, Yixiu Mao, Ru Peng, Xin Xu, Weijie Liu, Kai Yang, Saiyong Yang, Xiangyang Ji
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.11119
Pdf link: https://arxiv.org/pdf/2606.11119
Abstract Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensive policy optimization is often limited by insufficient reward contrast, arising when overly simple or complex prompts generate low-variance feedback and when outcome-only rewards assign the same terminal assessment to every decision in a multi-turn rollout. Past efforts have focused on allocating available rollout resources to promising prompts, yet they only leverage sample informativeness at the prompt level and neglect variation in prefix-level informativeness across turns within the same rollout. This work targets multi-turn agentic RL by modeling each ReAct-style thought-action-observation turn as a semantically distinct node, allowing budget allocation to extend from prompt roots to turn-level prefixes with further continuations, which naturally forms tree-structured rollouts. We introduce Tree Rollout Allocation for Contrastive Exploration (TRACE), a unified rollout allocation framework that enhances reward contrast within a fixed sampling budget. Technically, TRACE allocates rollout budget to both prompt roots and intermediate prefixes that are most likely to yield mixed terminal rewards. A shared generalizable predictor estimates conditional success probability at these anchors from prefix histories to guide this allocation. The resulting adaptive tree structure enriches outcome-only feedback and amplifies the policy-update signal. Empirically, TRACE achieves competitive performance and efficiency gains on typical agentic benchmarks, e.g., improving Qwen3-14B Multi-Hop QA average accuracy by 2.8 points over competitive baselines at equal sampling cost.
中文摘要 带有可验证奖励的强化学习（RLVR）是一种有前景的方法，用于增强大型语言模型中的推理和代理行为。然而，部署密集型策略优化常常受限于奖励对比不足，这通常发生在过于简单或复杂的提示产生低方差反馈，以及仅结果奖励对多回合推送中每个决策都赋予相同的终端评估时。过去的努力主要集中在将可用资源分配给有前景的提示，但它们只在提示层面利用样本信息量，忽视了同一推出中不同回合前缀级信息量的差异。本研究通过将每个ReAct风格的思维-行动-观察回合建模为语义上独立的节点，针对多回合代理强化学习，允许预算分配从提示根延伸到回合级前缀及后续，自然形成树状结构的展开。我们介绍了对比探索树的展开分配（TRACE），这是一个统一的推广分配框架，在固定抽样预算内增强奖励对比度。技术上，TRACE将推广预算分配给提示词根和中间前缀，这些前缀最有可能产生混合终端奖励。共享的可推广预测变量通过前缀历史估计这些锚点的条件成功概率，以指导分配。由此产生的自适应树结构丰富了仅基于结果的反馈，并放大了策略更新信号。从经验上看，TRACE在典型代理基准测试中实现了竞争性能和效率提升，例如在相同采样成本下，Qwen3-14B多跳QA平均准确率比竞争基线提升2.8个百分点。

Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models

全双工语音模型中的多面交互性对齐

Authors: Atsumoto Ohashi, Neil Zeghidour, Alexandre Défossez, Eugene Kharitonov
Subjects: Subjects: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2606.11167
Pdf link: https://arxiv.org/pdf/2606.11167
Abstract Full-duplex spoken dialogue models can listen and speak simultaneously, making them a promising architecture for natural conversation. However, current models are trained solely with supervised learning through token-level likelihood maximization, which does not directly optimize interaction-level behaviors, causing interactivity issues such as excessive silence and ill-timed turn-taking. Recent work has applied reinforcement learning (RL) to improve interactivity, but existing methods address only a limited set of interactive behaviors in their rewards. In this work, we propose a post-training alignment method that comprehensively improves the interactivity of full-duplex spoken dialogue models through RL. We address the four canonical axes of interactivity: pause handling, turn-taking, backchanneling, and user interruption. For each axis, we extract short audio segments from human conversation corpora and optimize the model with axis-specific reward functions. An extra LLM-based reward for response quality prevents semantic degradation. We apply our method to two open-source models, Moshi and PersonaPlex, demonstrating consistent improvements in interactivity on both offline evaluation with pre-recorded audio and real-time multi-turn dialogue evaluation.
中文摘要 全双工口语对话模型可以同时听和说，使其成为自然对话的有前景的架构。然而，目前的模型仅通过代币级似然最大化进行监督学习训练，这并未直接优化交互层级行为，导致过度安静和转弯时机不佳等交互性问题。近期研究应用强化学习（RL）来提升交互性，但现有方法仅涵盖有限的交互行为。本研究提出一种训练后对齐方法，通过强化学习全面提升全双工口语对话模型的交互性。我们讨论了交互性的四个典型轴：暂停处理、转弯、反向通道和用户中断。对于每个轴，我们从人类对话语料库中提取短音频片段，并用轴特定的奖励函数优化模型。基于LLM的额外奖励可防止语义退化。我们将该方法应用于两个开源模型Moshi和PersonaPlex，在离线预录音频评估和实时多回合对话评估中均持续提升互动性。

ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations

ARM：具有统一离散表示的自回归大型多模模型

Authors: Junke Wang, Xiao Wang, Jiacheng Pan, Xuefeng Hu, Feng Li, Jingxiang Sun, Chaorui Deng, Zilong Chen, Yunpeng Chen, Kaibin Tian, Matthew Gwilliam, Hao Chen, Danhui Guan, Kun Xu, Weilin Huang, Zuxuan Wu, Haoqi Fan, Yu-Gang Jiang, Zhenheng Yang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.11188
Pdf link: https://arxiv.org/pdf/2606.11188
Abstract This paper introduces ARM, a discrete representation-based AutoRegressive Model that unifies image understanding, generation, and editing within a next-token prediction framework. ARM is built on three efforts: first, we train a discrete semantic visual tokenizer that maps images into compact token sequences. Our tokenizer is supervised with multiple objectives that jointly promote semantic discriminability, language alignment and faithful reconstruction, thereby supporting diverse tasks in a shared latent space. With this, we train a 7B autoregressive model over large-scale text and image token sequences, seamlessly developing vision-language perception and generation capabilities. Finally, to further improve preference-aligned behavior for text-to-image generation and instruction-guided editing, ARM applies reinforcement learning (RL) to optimize task-level objectives such as visual quality, instruction adherence, and edit consistency. Surprisingly, the results show that RL not only substantially improves performance on the target tasks (e.g., raising WISE overall from 0.50 to 0.56, GEdit-Bench-EN G_O from 5.75 to 6.68), but also induces cross-task synergy between text-to-image generation and editing. Collectively, these findings highlight autoregressive modeling, when paired with strong representations and preference optimization, as a scalable foundation for multimodal intelligence. Code: this https URL.
中文摘要 本文介绍了ARM，一种基于离散表示的自回归模型，将图像理解、生成和编辑统一在下一标记预测框架内。ARM建立在三项努力之上：首先，我们训练了一个离散语义可视化分词器，将图像映射为紧凑的令牌序列。我们的分词器由多个目标监督，共同促进语义区分、语言对齐和忠实重建，从而支持共享潜在空间中的多样任务。通过该模型，我们训练一个7B自回归模型，覆盖大规模文本和图像令牌序列，无缝发展视觉语言感知和生成能力。最后，为了进一步改善文本到图像生成和指令引导编辑的偏好对齐行为，ARM应用强化学习（RL）来优化任务级目标，如视觉质量、指令遵循性和编辑一致性。令人惊讶的是，结果显示，强化学习不仅显著提升了目标任务的性能（例如，整体WISE从0.50提升到0.56，GEdit-Bench-EN G_O从5.75提升到6.68），还促进了文本转图像生成与编辑之间的跨任务协同。综合来看，这些发现凸显了自回归建模结合强表示和偏好优化时，作为多模态智能的可扩展基础。代码：这个 https URL。

Keyword: diffusion policy

GHOST: Hierarchical Sub-Goal Policies for Generalizing Robot Manipulation

GHOST：用于推广机器人操作的层级子目标策略

Authors: Sriram Krishna, Ben Eisner, Haotian Zhan, Ying Yuan, Haoyu Zhen, Chuang Gan, Shubham Tulsiani, David Held
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.10025
Pdf link: https://arxiv.org/pdf/2606.10025
Abstract We present GHOST, a framework for learning visuomotor manipulation policies that generalize beyond the training distribution. GHOST factorizes control into (i) a high-level policy that predicts the next sub-goal as a distribution over 3D end-effector poses from multi-view RGB-D observations, and (ii) a low-level goal-conditioned controller that executes embodiment-specific actions. To condition image-based policies on 3D goals, we introduce a simple spatial interface that projects predicted goals into the image plane and represents them as end-effector heatmaps. Across a suite of manipulation tasks, this hierarchical factorization consistently improves performance and robustness compared to a flat Diffusion Policy. Further, we show that this hierarchical interface also makes it easy to incorporate human demonstrations without relying on (noisy) action retargeting. As sub-goals are largely embodiment-agnostic, we train the high-level policy on human video to specify how learned skills should be applied and composed, while keeping the low-level policy trained purely on robot data. This hierarchy enables adaptation to novel objects and task variations using a small number of human demonstrations.
中文摘要 我们提出了GHOST框架，用于学习超越训练分布的视觉运动操作策略。GHOST 将控制分解为：（i）一个高级策略，通过多视角 RGB-D 观测预测下一个子目标，作为三维端执行器姿态分布的策略;以及（ii）一个执行具象特定动作的低级目标条件控制器。为了以3D目标为基础的基于图像的策略，我们引入了一个简单的空间接口，将预测目标投射到图像平面，并以末端效应器热图表示。在一系列操作任务中，这种层级分解相较于扁平扩散策略，持续提升性能和鲁棒性。此外，我们展示了这种层级界面也使得在不依赖（噪声）动作重定向的情况下，轻松地融入人工演示。由于子目标大多与身体无关，我们训练高层次的人类视频策略，明确如何应用和组合所学技能，而低层策略则完全基于机器人数据进行训练。这种层级结构使得通过少量人工演示，适应新颖的对象和任务变化。

MODIP: Efficient Model-Based Optimization for Diffusion Policies

MODIP：基于模型的高效扩散策略优化

Authors: Zakariae El Asri, Philippe Gratias-Quiquandon, Nicolas Thome, Olivier Sigaud
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.10825
Pdf link: https://arxiv.org/pdf/2606.10825
Abstract Diffusion policies (DPs) have emerged as expressive policy representations for robot learning, often used with imitation learning methods such as behavioral cloning (BC). However, while their success has largely been confined to BC, direct reinforcement learning (RL) fine-tuning remains challenging because actions are generated through a multi-step denoising process. In this work, we propose MODIP, a framework for the offline-to-online fine-tuning of DPs. Rather than directly applying RL to the DPs, MODIP leverages a world model (WM) to guide policy adaptation and keeps the simplicity and stability of BC. We utilize model predictive control (MPC) to generate high-quality trajectories within the WM, and use them as supervised targets for fine-tuning the DP. To make MPC planning efficient, MODIP uses a terminal state value instead of a policy-dependent state-action value, reducing inference time. Additionally, MODIP trains critics with policy-independent TD targets, reducing training time. Experiments on D4RL (MuJoCo, Kitchen) and RoboMimic tasks show that MODIP improves diffusion policies beyond BC, and is competitive with or outperforms diffusion policy RL fine-tuning methods and strong model-based baselines such as TD-MPC2.
中文摘要 扩散策略（DPs）已成为机器人学习的表达性策略表示，常与模仿学习方法如行为克隆（BC）一起使用。然而，虽然它们的成功主要局限于BC，但直接强化学习（RL）的微调仍然具有挑战性，因为动作是通过多步去噪过程生成的。在本研究中，我们提出了MODIP，一个用于离线到在线的DP微调框架。MODIP不直接将强化学习应用于DP，而是利用世界模型（WM）指导政策调整，同时保持BC的简洁性和稳定性。我们利用模型预测控制（MPC）在WM内生成高质量轨迹，并将其作为监督目标，用于微调DP。为了使MPC规划高效，MODIP使用终端状态值而非依赖策略的状态-动作值，从而缩短推理时间。此外，MODIP通过政策无关的TD目标培训批评者，缩短培训时间。D4RL（MuJoCo、Kitchen）和机器人模拟任务的实验表明，MODIP能提升扩散策略超越BC，且在与扩散策略RL微调方法及强模型基线（如TD-MPC2）竞争或优于。