Arxiv Papers of Today

生成时间: 2026-05-14 18:21:03 (UTC+8); Arxiv 发布时间: 2026-05-14 20:00 EDT (2026-05-15 08:00 UTC+8)

今天共有 45 篇相关文章

Keyword: reinforcement learning

SP-GCRL: Influence Maximization on Incomplete Social Graphs

SP-GCRL：不完整社会图的影响力最大化

Authors: Haohua Niu, Yuxuan Yang, Lingfeng Zhang, Hao Li, Jiao Liang, Zongfu Luo, Luca Rossi
Subjects: Subjects: Social and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.12513
Pdf link: https://arxiv.org/pdf/2605.12513
Abstract Influence maximization (IM) in real platforms is challenged by incomplete, noisy social graphs and non-stationary diffusion dynamics. We propose SP-GCRL, a social-propagation-aware graph contrastive reinforcement learning framework that learns end-to-end seed selection under partial this http URL first introduce a social-propagation-aware nonlinear diffusion function to model reinforcement/diminishing effects and probability drift under repeated exposure; we then construct dual structural views and perform contrastive learning to obtain node representations robust to missing edges and weak ties, while replacing expensive strategy metrics with a GAT-based regression surrogate to improve efficiency and scalability; finally, we use DDQN to learn an end-to-end seed selection policy on top of these representations. Experiments on multiple real-world networks show that SP-GCRL achieves significant gains over heuristic and learning-based baselines across budgets and topologies, while maintaining strong large-scale scalability.
中文摘要 现实平台中的影响力最大化（IM）受到不完整、噪声较大的社交图谱和非平稳扩散动态的挑战。我们提出了SP-GCRL，一种社会传播感知图对比强化学习框架，能够在部分 http URL 下学习端到端的种子选择，首先引入社会传播感知非线性扩散函数，以模拟强化/减减效应和反复暴露下的概率漂移;随后，我们构建对偶结构视图并进行对比学习，以获得对缺失边和弱连接稳健的节点表示，同时用基于GAT的回归代理替代昂贵的策略指标，以提高效率和可扩展性;最后，我们利用DDQN在这些表示基础上学习端到端的种子选择政策。在多个真实世界网络上的实验表明，SP-GCRL在预算和拓扑条件下，相较于启发式和基于学习的基线实现了显著提升，同时保持了强大的大规模可扩展性。

Correct Answers from Sound Reasoning: Verifiable Process Supervision for Language Models

从合理推理中正确回答：语言模型的可验证过程监督

Authors: Kyuyoung Kim, Kevin Wang, Yunfei Xie, Peiyang Xu, Peiyao Sheng, Chen Wei, Zhangyang Wang, Jinwoo Shin, Pramod Viswanath, Sewoong Oh
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.12519
Pdf link: https://arxiv.org/pdf/2605.12519
Abstract Training language models to produce both correct answers and sound reasoning remains an open challenge. Reinforcement learning with verifiable rewards typically optimizes only final outcomes, which can lead to a failure mode where task accuracy improves while reasoning becomes less accurate, less complete, or even internally inconsistent. We propose verifiable process supervision (VPS), a post-training framework for verifiable domains that jointly optimizes prediction accuracy and reasoning quality. We first apply supervised fine-tuning to induce a structured reasoning format, enabling syntactic extraction of intermediate claims that are evaluated against ground-truth signals to form process-level rewards. To address the heterogeneous difficulty of reasoning subtasks, we introduce adaptive reward weighting that prioritizes components with the largest remaining errors, creating an implicit curriculum. We evaluate VPS on chess, a controlled testbed where reasoning steps can be deterministically verified against engine signals. While accuracy-only RL improves move accuracy, it sharply degrades reasoning quality, increasing win-rate error by up to 112% and reducing internal consistency by up to 69%. In contrast, VPS preserves accuracy while significantly improving reasoning quality, reducing win-rate error by up to 30% and restoring consistency to near saturation. At matched accuracy, judge evaluation also prefers the process-supervised models. A reasoning-space analysis further shows that, without a structured prior, accuracy-only RL converges to budget-dependent shortcuts rather than sound multi-step reasoning. These results show that VPS enables language models to reason both accurately and reliably in verifiable domains.
中文摘要 训练语言模型以产生正确答案和合理推理仍是一个开放的挑战。带有可验证奖励的强化学习通常只优化最终结果，这可能导致任务准确性提升，而推理变得不够准确、不完整，甚至内部不一致的失败模式。我们提出了可验证过程监督（VPS），这是一种针对可验证领域的后期训练框架，能够共同优化预测准确性和推理质量。我们首先应用监督微调，诱导结构化推理格式，使中间主张能够语法提取，并通过基于真实信号评估，形成过程级奖励。为了解决子任务推理的异质难度，我们引入了自适应奖励加权，优先考虑剩余错误最大的组成部分，创建隐式课程。我们在国际象棋上评估VPS，这是一个受控测试平台，可以确定性地验证推理步骤与发动机信号。虽然纯准确的强化学习提升了走法的准确性，但它会大幅降低推理质量，使胜率误差增加多达112%，内部一致性降低多达69%。相比之下，VPS保持准确性，显著提升推理质量，将胜率误差降低多达30%，并将一致性恢复到接近饱和状态。在匹配准确性下，评判评估也更倾向于使用过程监督模型。推理空间分析进一步表明，没有结构化先验，纯准确性的强化学习趋于预算依赖的捷径，而非合理的多步推理。这些结果表明，VPS使语言模型能够在可验证的领域中准确且可靠地推理。

DelAC: A Multi-agent Reinforcement Learning of Team-Symmetric Stochastic Games

DelAC：团队对称随机博弈的多智能体强化学习

Authors: Duan-Shin Lee, Yu-Hsiu Hung
Subjects: Subjects: Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT)
Arxiv link: https://arxiv.org/abs/2605.12555
Pdf link: https://arxiv.org/pdf/2605.12555
Abstract In this paper we study team-symmetric games with $m\ge 2$ teams. Players within a team have symmetric identity and have a common payoff function. We show that team-symmetric games always have a team-symmetric Nash equilibrium. We develop and solve a linear complementarity problem of team-symmetric Nash equilibria. We propose an actor-critic based multi-agent reinforcement learning algorithm for team-symmetric games. Through simulations, we show that this multi-agent reinforcement learning algorithm performs much better than many existing algorithms.
中文摘要 本文研究了$m\ge为2美元团队的团队对称博弈。队伍中的球员具有对称身份，并具有共同的收益函数。我们证明团队对称博弈总是存在团队对称纳什均衡。我们开发并解决了一个团队对称纳什均衡的线性互补性问题。我们提出了一种基于actor-critic的多代理强化学习算法，用于团队对称游戏。通过模拟，我们表明该多智能体强化学习算法的表现远优于许多现有算法。

Learning When to Act: Communication-Efficient Reinforcement Learning via Run-Time Assurance

学习何时行动：通过运行时保障实现的高效沟通强化学习

Authors: Adam Haroon, Erick J. Rodríguez-Seda, Cody Fleming, Tristan Schuler
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.12561
Pdf link: https://arxiv.org/pdf/2605.12561
Abstract Safe reinforcement learning (RL) typically asks $\textit{what}$ an agent should do. We ask $\textit{when}$ it needs to act, and show that a single policy can jointly learn control inputs and communication-efficient timing decisions under a pointwise Lyapunov safety shield. We focus on stabilization around a known equilibrium, where CARE-based LQR backups, Lyapunov certificates, and classical Lyapunov-STC are well defined, enabling clean comparison against analytical baselines. A run-time assurance (RTA) layer overrides the policy via a one-step-ahead Lyapunov prediction and a precomputed LQR backup, providing a strictly stronger guarantee than constrained MDP methods that enforce safety only in expectation. On an inverted pendulum, cart--pole, and planar quadrotor, the learned policy achieves $1.91\times$, $1.45\times$, and $3.51\times$ higher mean inter-sample interval (MSI) than a Lyapunov-triggered baseline; a fixed LQR controller at the same average rate is unstable on all three plants, showing that adaptive timing, not a lower average rate, makes sparsity safe. A CARE-derived Lyapunov reward transfers across environments without redesign, with a single weight $w_c$ controlling the stability--communication tradeoff; ablations confirm the RTA shield is essential, with its removal reducing MSI by $1.27$--$1.84\times$ and degrading state norms. A preference-conditioned extension recovers the full tradeoff frontier from one model at $\tfrac{2}{11}$ of training compute, and SAC experiments show the results are algorithm-agnostic across discrete and continuous domains. A 12-state 3D quadrotor case study extends the framework to higher-dimensional systems where classical STC is intractable, and robustness to $\pm30\%$ mass variation and disturbances shows graceful degradation, with the RTA absorbing what the learned policy cannot.
中文摘要 安全强化学习（RL）通常会问代理应该做$\textit{what}$。我们请求$\textit{when}$需要行动，并展示单一策略可以在点状的柳普诺夫安全盾下共同学习控制输入和高效通信时机决策。我们关注围绕已知均衡的稳定，在该平衡点上，基于CARE的LQR备份、Lyapunov证书和经典Lyapunov-STC都已明确定义，从而实现与分析基线的清晰比较。运行时保障层（RTA）通过提前一步的李雅普诺夫预测和预计算的LQR备份覆盖策略，提供比仅在预期中强制安全的受限MDP方法更强的保证。在倒摆、推杆和平面四旋翼上，学习策略比李雅普诺夫触发基线高出平均采样间隔（MSI）$1.91\倍、$1.45\倍和$3.51\乘倍;固定的LQR控制器在相同平均速率下对三座工厂都不稳定，表明自适应时序而非较低平均速率使稀疏度安全。基于CARE的Lyapunov奖励可在不重新设计的情况下跨环境转移，单一权重$w_c$控制稳定性-通信权衡;消融显示RTA屏蔽是必不可少的，移除后MSI减少了1.27美元——1.84美元乘以，并降低了州标准。偏好条件扩展可从一个训练计算量$\tfrac{2}{11}$的模型中恢复完整的权衡前沿，SAC实验表明结果在离散和连续域中均与算法无关。一个12态三维四旋翼案例研究将该框架扩展到高维系统，在这些系统中经典STC难以解决，且对质量变化和扰动的鲁棒性达到$\pm30\%$，表现出优雅的退化，RTA吸收了学得策略无法吸收的内容。

Driving Intents Amplify Planning-Oriented Reinforcement Learning

驾驶意图放大了以规划为导向的强化学习

Authors: Hengtong Lu, Victor Shea-Jay Huang, Chengmin Yang, Pengfei Jing, Jifeng Dai, Yan Xie, Benjin Zhu
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.12625
Pdf link: https://arxiv.org/pdf/2605.12625
Abstract Continuous-action policies trained on a single demonstrated trajectory per scene suffer from mode collapse: samples cluster around the demonstrated maneuver and the policy cannot represent semantically distinct alternatives. Under preference-based evaluation, this caps best-of-N performance -- even oracle selection cannot recover what the sampling distribution does not contain. We introduce DIAL, a two-stage Driving-Intent-Amplified reinforcement Learning framework for preference-aligned continuous-action driving policies. In the first stage, DIAL conditions the flow-matching action head on a discrete intent label with classifier-free guidance (CFG), which expands the sampling distribution along distinct maneuver modes and breaks single-demonstration mode collapse. In the second stage, DIAL carries this expanded distribution into preference RL through multi-intent GRPO, which spans all intent classes within every preference group and prevents fine-tuning from re-collapsing around the currently preferred mode. Instantiated for end-to-end driving with eight rule-derived intents and evaluated on WOD-E2E: competitive Vision-to-Action (VA) and Vision-Language-Action (VLA) Supervised Finetuning (SFT) baselines plateau below the human-driven demonstration at best-of-128, with the strongest prior (RAP) capping at Rater Feedback Score (RFS) 8.5 even with best-of-64; intent-CFG sampling lifts this ceiling to RFS 9.14 at best-of-128, surpassing both the prior best (RAP 8.5) and the human-driven demonstration (8.13) for the first time; and multi-intent GRPO improves held-out RFS from 7.681 to 8.211, while every single-intent baseline peaks lower and degrades by training end. These results suggest that the bottleneck of preference RL on continuous-action policies trained from demonstrations is not only how to update the policy, but to expand and preserve the sampling distribution being optimized.
中文摘要 基于每个场景单一演示轨迹训练的连续行动策略存在模式崩溃问题：样本聚集在演示机动周围，策略无法代表语义上不同的替代方案。在基于偏好的评估下，这限制了N局数最佳的表现——即使是预言机选择也无法恢复抽样分布中未包含的内容。我们介绍了DIAL，一个两阶段的驾驶意图增强强化学习框架，用于偏好对齐的持续行动驾驶政策。在第一阶段，DIAL将流动匹配动作头条件为离散意图标签，带有无分类器引导（CFG），该标签扩展了不同机动模式的采样分布，打破单次演示模式的坍缩。在第二阶段，DIAL通过多意图GRPO将扩展后的分布带入偏好强化学习，跨越每个偏好组内的所有意图类别，防止微调在当前首选模式附近再次崩溃。为端到端驾驶实例化，包含八个规则衍生意图，并在WOD-E2E上评估：竞争性的视觉到行动（VA）和视觉语言动作（VLA）监督精细调整（SFT）基线在128局最佳时低于人工演示，最强的先验（RAP）上限为Rater Feedback Score（RFS）8.5，即使采用64局最佳;意图-CFG采样将该上限提升至128局最佳时的RFS 9.14，首次超过了之前的最佳（RAP 8.5）和人工驱动演示（8.13）;多意图GRPO将持续释放的RFS从7.681提升至8.211，而每个单一意图基线峰值下降，训练结束时下降。这些结果表明，基于演示训练的连续行动策略中，偏好强化学习的瓶颈不仅在于如何更新策略，还在于扩展和保持被优化的抽样分布。

Training LLMs with Reinforcement Learning for Intent-Aware Personalized Question Answering

用强化学习培训LLM进行意图感知的个性化问答

Authors: Maryam Amirizaniani, Benjamin Charles Germain Lee, Jevin West, Nicholas Weber
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.12645
Pdf link: https://arxiv.org/pdf/2605.12645
Abstract Effective personalized question answering (PQA) in language models requires grounding responses in the user's underlying intent, where intent refers to the implicit ``why'' behind a query beyond its explicit wording. However, existing approaches to intent-aware personalization rely on multi-turn conversational context or rich user profiles, and do not explicitly model user intent during the reasoning process. This limits their effectiveness in single-turn settings, where the user's latent goal must be inferred from minimal input and integrated into the thinking and reasoning process. To bridge this gap, we propose IAP (Intent-Aware Personalization), a reinforcement learning framework that trains models to infer implicit user intent directly from a single-turn question and incorporate it into thinking steps through a tag-based schema for generating personalized, intent-grounded answers. By optimizing intent-aware answer trajectories under a personalized reward function, IAP reinforces generation paths that make implicit user intent explicit and produce responses that better align with the user's underlying goal. Through experiments on the LaMP-QA benchmark across six models, IAP consistently outperforms all baselines, achieving an average macro-score gain of around 7.5\% over the strongest competitor, demonstrating that modeling implicit user intent within the training objective is a promising direction for PQA.
中文摘要 语言模型中的有效个性化问答（PQA）需要基于用户的潜在意图，意图指的是查询背后隐含的“为什么”，超越其明确措辞。然而，现有的意图感知个性化方法依赖于多回合对话上下文或丰富的用户档案，且在推理过程中并未明确建模用户意图。这限制了它们在单回合环境中的有效性，因为用户的潜在目标必须从最小输入中推断，并融入思考和推理过程。为弥合这一差距，我们提出了IAP（意图感知个性化），这是一种强化学习框架，训练模型直接从单回合问题推断隐性用户意图，并通过基于标签的模式将其纳入思考步骤，生成个性化、基于意图的答案。通过在个性化奖励函数下优化意图感知的答案轨迹，内购强化了生成路径，使隐性用户意图显化，并产生更符合用户潜在目标的回应。通过在六个模型中对LaMP-QA基准的实验，IAP持续优于所有基线，宏观评分平均提升约7.5%，领先最强竞争者，表明在训练目标中建模隐性用户意图是PQA的前景。

Plan Before You Trade: Inference-Time Optimization for RL Trading Agents

交易前规划：强化学习代理的推理时间优化

Authors: Eun Go, Rohan Deb, Arindam Banerjee
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2605.12653
Pdf link: https://arxiv.org/pdf/2605.12653
Abstract Reinforcement learning agents for portfolio management are typically trained and deployed as static policies, with no mechanism for using price forecasts at inference time. We propose $\text{FPILOT}$ (Financial Plugin Inference-time Learning for Optimal Trading), a plugin inference-time optimization framework inspired by Model Predictive Control (MPC). Our key structural insight is that future prices mostly do not depend on one agent's portfolio allocation, so a suitable predictive model can produce a multi-step price trajectory without iterative action-conditioned rollouts as in typical reinforcement learning. At each decision step, we use the forecaster's predicted price trajectory to construct an allocation-based imagined return objective, and optimize the policy at inference-time before executing one step of the trade. Our framework is compatible with any pre-trained agent and adapts the policy to the forecaster's predictions without any retraining. Evaluated across five policy learning algorithms on the TradeMaster DJ30 benchmark, $\text{FPILOT}$ produces consistent improvements in total return and return-based risk-adjusted metrics (Sharpe, Sortino, Calmar), with stochastic policies benefiting more than deterministic ones. Further, using synthetic forecasts at calibrated quality levels, we show that gains consistently improve with forecaster quality, suggesting that our performance will improve based on advances in financial forecasting.
中文摘要 用于投资组合管理的强化学习代理通常以静态策略的形式训练和部署，没有在推断时使用价格预测的机制。我们提出$\text{FPILOT}$（Financial Plugin Inference-time Learning for Optimal Trading），这是一个受模型预测控制（MPC）启发的插件推理时间优化框架。我们的关键结构性见解是，未来价格大多不依赖于单个代理的投资组合配置，因此合适的预测模型可以在不像典型强化学习那样进行迭代动作条件的推广的情况下，生成多步价格轨迹。在每个决策步骤中，我们利用预测者预测的价格轨迹构建基于配置的想象收益目标，并在执行交易步骤前对推断时进行优化。我们的框架兼容任何预训练代理，并根据预测者的预测调整策略，无需重新训练。在TradeMaster DJ30基准测试的五种策略学习算法中，$\text{FPILOT}$在总回报和基于收益的风险调整指标（Sharpe、Sortino、Calmar）上持续提升，随机策略受益多于确定性政策。此外，利用校准质量水平的合成预测，我们表明预测员质量的提升持续提升，表明我们的绩效将因财务预测的进步而提升。

Macro-Action Based Multi-Agent Instruction Following through Value Cancellation

基于宏动作的多智能体指令通过值消除跟随

Authors: Wo Wei Lin, Ethan Rathbun, Enrico Marchesini Xiang Zhi Tan
Subjects: Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2605.12655
Pdf link: https://arxiv.org/pdf/2605.12655
Abstract Multi-agent reinforcement learning (MARL) in real-world use cases may need to adapt to external natural language instructions that interrupt ongoing behavior and conflict with long-horizon objectives. However, conditioning rewards on instructions introduces a fundamental failure mode as Bellman updates couple value estimates across instruction contexts, leading to inconsistent values when instructions interrupt macro-actions. We propose Macro-Action Value Correction for Instruction Compliance (MAVIC), which corrects Bellman backups at instruction boundaries by correcting the incoming instruction objective and restoring the continuation value under the current objective. Unlike reward shaping, MAVIC modifies the bootstrapping target itself, enabling consistent value estimation under stochastic instruction switching within a unified policy. We provide theoretical analysis and an actor-critic implementation, and show that MAVIC achieves high instruction compliance while preserving base task performance in increasingly complex cooperative multi-agent environments.
中文摘要 在现实应用场景中，多智能体强化学习（MARL）可能需要适应会中断持续行为并与长期目标冲突的外部自然语言指令。然而，对指令的条件奖励引入了根本性的失败模式，因为Bellman会在不同指令上下文中更新耦合值估计值，导致指令中断宏操作时出现不一致的值。我们提出了指令合规宏动作值修正（MAVIC），通过纠正输入指令目标并在当前目标下恢复延续值，纠正Bellman备份。与奖励塑形不同，MAVIC修改了引导目标本身，使得在统一策略内的随机指令切换下实现一致的价值估计。我们提供了理论分析和actor-critic实现，证明MAVIC在日益复杂的协作多代理环境中，既能实现高指令合规性，又保持基础任务性能。

ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization

ODRPO：离散奖励的序数分解以实现稳健策略优化

Authors: Nirmal Patel, Fei Wang, Inderjit Dhillon
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.12667
Pdf link: https://arxiv.org/pdf/2605.12667
Abstract The alignment of Large Language Models (LLMs) utilizes Reinforcement Learning from AI Feedback (RLAIF) for non-verifiable domains such as long-form question answering and open-ended instruction following. These domains often rely on LLM based auto-raters to provide granular, multi-tier discrete rewards (e.g., 1-10 rubrics) that are inherently stochastic due to prompt sensitivity and sampling randomness. We empirically verify the stochasticity of auto-raters that can propagate and corrupt standard advantage estimators like GRPO and MaxRL, as a noisy reward samples can skew normalization statistics and degrade the global learning signal. Empirically, sampling more rewards and taking majority voting may reduce the noise and improve performance, but this approach is computationally expensive. To address this bottleneck, we introduce $\textbf{O}$rdinal $\textbf{D}$ecomposition for $\textbf{R}$obust $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{ODRPO}$), a framework that structurally isolates evaluation noise by decomposing discrete rewards into a sequence of ordinal binary indicators. By independently computing and accumulating advantages across these progressively challenging success thresholds, ODRPO prevents outlier evaluations from corrupting the global update while establishing an implicit, variance-aware learning curriculum. Empirically, ODRPO achieves robust performance on Qwen2.5-7B and Qwen3-4B models, outperforming baselines with relative improvements of upto 14.8% on FACTS-grounding-v2 and 7.5% on Alpaca-Evals. Critically, these gains are achieved with negligible training-time overhead, as ODRPO requires no additional compute per step compared to standard estimators. Supported by theoretical analysis confirming its optimization stability, ODRPO provides a scalable and robust framework for aligning models within the noisy, discrete evaluation landscape of modern RLAIF.
中文摘要 大型语言模型（LLMs）的对齐采用了来自AI反馈的强化学习（RLAIF），用于不可验证的领域，如长形式问答和开放式指令跟随。这些领域通常依赖基于大型语言模型的自动评定器，提供细致的、多层次的离散奖励（例如1-10个评分标准），这些奖励本质上是随机的，因为响应敏感性和抽样随机性。我们实证验证了自动评定器的随机性，这些自动评定器能够传播并破坏GRPO和MaxRL等标准优势估计量，因为噪声较大的奖励样本会扭曲归一化统计并降低全局学习信号。从经验上看，抽样更多奖励并采取多数投票可能减少噪声并提升性能，但这种方法计算成本高。为解决这一瓶颈，我们引入了 $\textbf{O}$rdinal $\textbf{D}$ecomposition for $\textbf{R}$obust $\textbf{P}$olicy $\textbf{O}$ptimization （$\textbf{ODRPO}$），该框架通过将离散奖励分解为一系列序数二元指示符，结构性地隔离评估噪声。通过独立计算并累积这些日益严峻的成功阈值优势，ODRPO防止异常值评估破坏全球更新，同时建立隐含且方差意识的学习课程。从实证角度看，ODRPO在Qwen2.5-7B和Qwen3-4B模型上表现强劲，FACTS-grounding-v2相较提升高达14.8%，羊驼评估提升7.5%。关键是，这些提升几乎无需训练时间开销，因为ODRPO每步无需比标准估计器额外计算。在理论分析支持下，验证其优化稳定性，ODRPO为现代RLAIF噪声多、离散评估环境下的模型对齐提供了一个可扩展且稳健的框架。

3D RL-DWA: A Hybrid Reinforcement Learning and Dynamic Window Approach for Goal-Directed Local Navigation in Multi-DoF Robots

3D RL-DWA：一种混合强化学习与动态窗口方法用于多景远机器人目标导向本地导航

Authors: Chiara Castellani, Enrico Turco, Domenico Prattichizzo
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.12689
Pdf link: https://arxiv.org/pdf/2605.12689
Abstract In this paper, we present a novel hybrid approach that combines Reinforcement Learning (RL) with Dynamic Window Approach (DWA) for adaptive 3D local navigation of high-degree-of-freedom robotic systems. Our method leverages sparse point cloud data to dynamically adjust both the motion and the shape of a deformable microrobot, enabling the system to navigate toward a goal in complex, constrained environments while maximizing the occupied volume. We evaluate our framework in a simulated vascular network. Experimental results, based on 1080 trials, indicate that integrating RL with a DWA-based local planner significantly enhances both deformation and navigation capabilities compared to a pure RL and a model-based methods. In particular, the proposed autonomous controller consistently achieves high deformation and near-perfect path completion during training and maintains robust performance in unseen scenarios. These findings highlight the potential of hybrid planning strategies for efficient and adaptive 3D navigation under sparse sensory conditions.
中文摘要 本文提出了一种新颖的混合方法，结合强化学习（RL）与动态窗口方法（Dynamic Window Approach，DWA）用于高自由度机器人系统的自适应三维局部导航。我们的方法利用稀疏点云数据动态调整可变形微型机器人的运动和形状，使系统能够在复杂且受限的环境中朝目标导航，同时最大化占用体积。我们在模拟血管网络中评估我们的框架。基于1080项试验的实验结果表明，与基于DWA的本地规划器结合强化学习相比纯强化学习和基于模型的方法，显著增强了变形和导航能力。特别是，所提自主控制器在训练中持续实现高变形和近乎完美的路径完成，并在未见场景中保持稳健性能。这些发现凸显了混合规划策略在稀疏感官条件下高效且自适应的三维导航潜力。

CoT-Guard: Small Models for Strong Monitoring

CoT-Guard：用于强监控的小型模型

Authors: Nirav Diwan, Han Wang, Berkcan Kapusuzoglu, Ramin Moradi, Supriyo Chakraborty, Giri Iyengar, Sambit Sahu, Huan Zhang, Gang Wang
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.12746
Pdf link: https://arxiv.org/pdf/2605.12746
Abstract Monitoring the chain-of-thought (CoT) of reasoning models is a promising approach for detecting covert misbehavior (i.e., hidden objectives) in code generation tasks. While large models (GPT-5, Gemini-3-Flash) can serve as effective CoT monitors, they are expensive to deploy due to the lengthy reasoning traces and high API cost, emphasizing the need for smaller, cheaper alternatives. Nevertheless, we find that current small models (4B--8B) struggle to detect hidden objectives despite access to the CoT, frequently misattributing them as part of the user query. To address this, we propose a post-training pipeline combining supervised fine-tuning (SFT) and reinforcement learning (RL), where SFT narrows the gap for in-domain tasks by distilling detection behavior from stronger monitors, and RL on hard and subtly crafted hidden objectives helps the model generalize to out-of-domain monitoring tasks. To validate this generalization, we evaluate under a realistic threat model motivated by practical supply-chain attacks, where the adversary is a third-party LLM router injecting hidden objectives into code-generation requests through either prompt manipulation or code manipulation attacks. To push beyond objectives that large monitors already saturate, we also introduce four new challenging tasks even for strong monitors. Finally, we introduce CoT-Guard, a 4B-parameter monitor that demonstrates superior generalization performance under both prompt and code manipulation attacks, achieving a G-mean^2 (i.e., TNR x TPR) of 75% and outperforming GPT-5.4 (56%), GPT-5-mini (41%), and Qwen3-32B (54%), while closing the gap to Gemini-3-Flash (83%). These results demonstrate that CoT-Guard provides a practical and cost-effective user-side defense, substantially improving hidden-objective detection while avoiding the deployment cost of large monitors.
中文摘要 监控推理模型的思维链（CoT）是一种有前景的方法，用于检测代码生成任务中的隐性不当行为（即隐藏目标）。虽然大型模型（如GPT-5、Gemini-3-Flash）可以作为有效的CoT监控器，但由于推理轨迹冗长且API成本高，部署成本较高，凸显了更小、更便宜替代方案的必要性。然而，我们发现当前的小型模型（4B-8B）即使访问了CoT，仍难以发现隐藏目标，且经常错误地将其归属为用户查询的一部分。为此，我们提出了一个结合监督微调（SFT）和强化学习（RL）的训练后流程，SFT通过从更强的监控器中提炼检测行为缩小了域内任务的差距，而针对硬性且微妙设计的隐藏目标的强化学习则帮助模型推广到域外监控任务。为验证这一概括，我们在一个现实威胁模型下评估，该模型动机源自实际的供应链攻击，攻击者是第三方LLM路由器，通过提示操作或代码操作攻击向代码生成请求注入隐藏目标。为了超越大型显示器已经存在的目标，我们还引入了四个即使是强力显示器也具有挑战性的任务。最后，我们介绍了CoT-Guard，一款4B参数的监控器，在提示和代码操作攻击下展现出更优的泛化性能，实现了75%的G-mean^2（即TNR×TPR），超过GPT-5.4（56%）、GPT-5-mini（41%）和Qwen3-32B（54%），同时缩小了与Gemini-3-Flash（83%）的差距。这些结果表明，CoT-Guard提供了实用且经济的用户端防御，显著提升了隐藏目标的检测能力，同时避免了大型监控器部署成本。

Simulating Students or Sycophantic Problem Solving? On Misconception Faithfulness of LLM Simulators

是模拟学生还是讨好性解决问题？关于对LLM模拟器的忠实度的误解

Authors: Heejin Do, Shashank Sonkar, Mrinmaya Sachan
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.12748
Pdf link: https://arxiv.org/pdf/2605.12748
Abstract Large language models (LLMs) can fluently generate student-like responses, making them attractive as simulated students for training and evaluating AI tutors and human educators. Yet such simulators are typically evaluated by output similarity to real students, not by whether they behave like students with coherent misconceptions during interaction. We introduce a controlled framework for evaluating misconception faithfulness, whether a simulator maintains a misconception-driven belief state and updates selectively when feedback addresses the underlying misconception. Central to our framework is a misconception-contrastive feedback protocol that compares targeted feedback against two controls: misaligned feedback (targeting a different but plausible misconception) and generic feedback (only identifying answer is wrong). We propose Selective Flip Score (SFS), which quantifies how much more often a simulator flips its answer under targeted feedback than under contrastive controls. Across seven LLMs (4B-120B), multiple datasets, and prompting strategies, simulators exhibit near-zero SFS, correcting their answers at similarly high rates regardless of feedback relevance. Further analyses reveal a sycophantic failure mode: models behave less like students with misconceptions but more like problem-solvers who treat any corrective signal as a cue to abandon the simulated belief and re-solve from internal knowledge. To address this, we develop a post-training pipeline spanning supervised fine-tuning (SFT), preference optimization, and reinforcement learning (RL) with an SFS-aligned reward; SFT yields notable gains up to +0.56, and SFS-aligned RL provides more consistent improvements than preference optimization. Our results establish misconception faithfulness as a challenging yet trainable property, motivating a shift from static output matching toward interactive, belief-aware student modeling.
中文摘要 大型语言模型（LLMs）能够流畅生成类似学生的回答，使其成为培训和评估AI导师及人类教育者的理想模拟学生。然而，这类模拟器通常通过输出与真实学生的相似度来评估，而不是它们在互动中是否像学生一样存在连贯误解。我们引入了一个受控框架来评估误解忠实度，即模拟器是否保持误解驱动的信念状态，并在反馈解决潜在误解时选择性更新。我们框架的核心是误解-对比反馈协议，将有针对性的反馈与两种对照进行比较：错位反馈（针对不同但合理的误解）和泛指反馈（仅识别错误的答案）。我们提出了选择性翻转评分（SFS），该评分量化了模拟器在有针对性反馈下翻转答案的频率，而在对比对照下则高得多。在七个大型语言模型（4B-120B）、多个数据集和提示策略中，模拟器表现出近乎零的SFS，无论反馈相关性如何，纠正率都相当高。进一步分析揭示了一种谄媚的失败模式：模型的行为不像有误解的学生，而更像是问题解决者，将任何纠正信号视为放弃模拟信念、从内部知识重新解决的问题。为此，我们开发了一个涵盖监督微调（SFT）、偏好优化和强化学习（RL）的训练后流程，并以SFS对齐的奖励;SFT在+0.56范围内带来显著提升，而与SFS对齐的强化学习比偏好优化更为稳定。我们的结果确立了误解忠实性作为一个具有挑战性但可训练的特性，推动了从静态输出匹配向互动式、信念意识学生建模的转变。

Adaptive Smooth Tchebycheff Attention for Multi-Objective Policy Optimization

自适应平滑切比谢夫注意力用于多目标策略优化

Authors: Alejandro Murillo-Gonzalez, Mahmoud Ali, Lantao Liu
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2605.12771
Pdf link: https://arxiv.org/pdf/2605.12771
Abstract Multi-objective reinforcement learning in robotic domains requires balancing complex, non-convex trade-offs between conflicting objectives. While linear scalarization methods provide stability, they are theoretically incapable of recovering solutions within non-convex regions of the Pareto front. Conversely, static non-linear scalarizations (e.g., Tchebycheff) can theoretically access these regions but often suffer from severe gradient variance and optimization instability in deep RL. In this work, we propose an Adaptive Smooth Tchebycheff framework that resolves this tension by dynamically modulating the curvature of the optimization landscape. We introduce a novel conflict-driven controller that regulates the optimization smoothness based on real-time gradient interference. This allows the agent to anneal toward precise, non-convex scalarization when objectives align, while elastically reverting to stable, smooth approximations when destructive gradient conflicts emerge. We validate our approach on a challenging robotic stealth visual search task -- a proxy for monitoring of protected/fragile ecosystems -- where an agent must balance search, exposure/interference minimization and exploration speed. Extensive ablations confirm that our conflict-aware adaptation enables the robust discovery of Pareto-optimal policies in non-convex regions inaccessible to linear baselines and unstable for static non-linear methods. Website: this https URL
中文摘要 机器人领域的多目标强化学习需要在冲突目标之间平衡复杂且非凸的权衡。虽然线性标量化方法提供了稳定性，但理论上它们无法在帕累托前沿的非凸区域内恢复解。相反，静态非线性标量（如Tchebycheff）理论上可以访问这些区域，但在深度强化学习中常常存在严重的梯度方差和优化不稳定性。在本研究中，我们提出了一个自适应光滑切比切夫框架，通过动态调制优化景观的曲率来解决这种张力。我们引入了一种新型冲突驱动控制器，基于实时梯度干扰调节优化平滑性。这使得当目标对齐时，化学剂能够向精确的非凸标量退火，而在出现破坏性梯度冲突时，弹性上恢复为稳定、光滑的近似。我们在一项具有挑战性的机器人隐形视觉搜索任务中验证了我们的方法——这是监测受保护/脆弱生态系统的代理指标——在该任务中，智能体必须在搜索、暴露/干扰最小化和探索速度之间取得平衡。广泛的消融验证了我们的冲突感知适应能够在非凸区域中稳健地发现帕累托最优策略，这些区域对线性基线无法到达且静态非线性方法不稳定。网站：这个 https URL

Quantifying Potential Observation Missingness in Inverse Reinforcement Learning

逆强化学习中潜在观察缺失的量化

Authors: Leo Benac, Abhishek Sharma, Alihan Huyuk, Finale Doshi-Velez
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.12831
Pdf link: https://arxiv.org/pdf/2605.12831
Abstract Inverse reinforcement learning (IRL), which infers reward functions from demonstrations, is a valuable tool for modeling and understanding decision-making behavior. Many variants of IRL have been developed to capture complexities of human decision-making, such as subjective beliefs, imperfect planning, and dynamic goals. However, an often-overlooked issue in real-world behavioral datasets is that the recorded data may be missing observations that were available to the original decision-maker. In use-inspired settings such as healthcare, this can make expert actions appear suboptimal, even when they were near-optimal given the information available at the time. As a result, the rewards learned by standard IRL may be misleading. In this paper, we identify the minimal perturbations to the recorded observations needed for the expert's actions to appear optimal. We develop a practical algorithm for this problem and demonstrate its utility for quantifying the possible extent of missing observations in behavioral datasets through extensive experiments on synthetic navigation tasks, a cancer treatment simulator, and ICU treatment data.
中文摘要 逆强化学习（IRL）通过演示推断奖励函数，是建模和理解决策行为的宝贵工具。许多IRL变体已被开发出来，以捕捉人类决策的复杂性，如主观信念、不完善的规划和动态目标。然而，现实行为数据集中常被忽视的一个问题是，记录的数据可能缺少原始决策者可用的观察数据。在以使用为灵感的环境中，如医疗行业，这可能使专家行动显得次优，即使考虑到当时的信息，它们几乎是最优的。因此，标准现实中学到的奖励可能会产生误导。本文指出，对于专家的行为看起来最优，所需的最小扰动对记录观测数据的干扰。我们为该问题开发了实用算法，并通过合成导航任务、癌症治疗模拟器和ICU治疗数据的广泛实验，展示了其在量化行为数据集中缺失观测可能程度的实用性。

Revisiting DAgger in the Era of LLM-Agents

在LLM代理时代重新审视DAgger（大语言模型代理）

Authors: Changhao Li, Rushi Qiang, Jiawei Huang, Chenxiao Gao, Chao Zhang, Niao He, Bo Dai
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.12913
Pdf link: https://arxiv.org/pdf/2605.12913
Abstract Long-horizon LM agents learn from multi-turn interaction, where a single early mistake can alter the subsequent state distribution and derail the whole trajectory. Existing recipes fall short in complementary ways: supervised fine-tuning provides dense teacher supervision but suffers from covariate shift because it is trained on off-policy teacher trajectories; while reinforcement learning with verifiable rewards avoids this off-policy mismatch by learning from on-policy rollouts but with only sparse outcome feedback. We address this dilemma by revisiting Dataset Aggregation (DAgger) for multi-turn LM agents: the algorithm collects trajectories through a turn-level interpolation of student and teacher policies, and the student is then trained on these trajectories using supervised labels provided by the teacher. By directly interacting with environments, we expose the model to realistic states likely to be encountered during deployment, thereby effectively mitigating covariate shift. Besides, since the student is learned by mimicking the teacher's behavior, it receives rich feedback during learning. To demonstrate DAgger enjoys the benefits of both worlds, we tested the algorithm to train a software-engineering agent with 4B- and 8B-scale student models. On SWE-bench Verified, our DAgger-style training improves over the strongest post-training baseline by +3.9 points at 4B and +3.6 points at 8B. The resulting 4B agent reaches 27.3%, outperforming representative published 8B SWE-agent systems, while the 8B agent achieves 29.8%, surpassing SWE-Gym-32B and coming within 5 points of stronger 32B-scale agents. Together with consistent gains on the held-out SWE-Gym split, these results suggest the effectiveness of DAgger for modern long-horizon LM agents.
中文摘要 长视野的LM代理从多回合相互作用中学习，一次早期失误可能改变后续状态分布，破坏整个轨迹。现有的方案在互补方面存在不足：监督式微调提供了密集的教师监督，但由于训练在非政策教师轨迹上，存在协变量偏移;而带有可验证奖励的强化学习通过从政策内推广中学习，避免了这种非策略错配，但结果反馈很少。我们通过重新审视多回合LM代理的数据集聚合（DAgger）来解决这一困境：该算法通过回合级插值学生和教师策略收集轨迹，然后通过教师提供的监督标签对学生进行这些轨迹的训练。通过直接与环境交互，我们使模型暴露于部署过程中可能遇到的真实状态，从而有效缓解协变量偏移。此外，由于学生通过模仿教师的行为来学习，学习过程中会获得丰富的反馈。为了证明DAgger兼具两者的优势，我们测试了该算法，用4B和8B比例的学生模型训练软件工程代理。在SWE卧推验证中，我们的DAgger式训练相比最强的训练后基线提升了4B+3.9点，8B段提升+3.6点。最终的4B代理达到27.3%，优于代表性已发布的8B代理系统，而8B代理的表现为29.8%，超过SWE-Gym-32B，仅差5点以内。加上持续增长的 SWE-Gym 比例，这些结果表明 DAgger 对现代长期 LM 药物的有效性。

Reinforced Collaboration in Multi-Agent Flow Networks

多智能体流网络中的协作加强

Authors: Zheng Wang, Yuang Liu, Yangkai Ding
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.12943
Pdf link: https://arxiv.org/pdf/2605.12943
Abstract Multi-agent systems provide a powerful way to extend large language models (LLMs) by decomposing a complex task into specialized subtasks handled by different agents. However, their performance is often hindered by error propagation, arising from suboptimal workflow design or inaccurate agent outputs, which can propagate through the agent collaboration process and degrade final results. To address the challenges, we present MANGO (Multi-Agent Network Gradient Optimization), a data-driven framework that organizes and refines agent collaboration via a flow network constructed from past successful workflows. MANGO integrates reinforcement learning and textual gradients to jointly optimize workflow paths and agent behaviors, while a skipping mechanism prevents redundant updates to well-optimized agents for improving efficiency. Extensive experiments on seven benchmarks show that MANGO achieves up to 12.8% performance improvement over state-of-the-art baselines, enhances efficiency by 47.4%, and generalizes effectively to unseen domains. Our code and datasets are publicly available at this https URL.
中文摘要 多智能体系统通过将复杂任务分解为由不同智能体处理的专用子任务，提供了一种强大的扩展大型语言模型（LLMs）的方法。然而，它们的性能常常受到误差传播的影响，这些误差源于工作流设计不优或代理输出不准确，这些误差可能在代理协作过程中传播，降低最终结果。为应对这些挑战，我们提出了MANGO（多代理网络梯度优化），这是一个数据驱动框架，通过由过去成功工作流构建的流程网络组织和优化代理协作。MANGO 集成强化学习和文本梯度，共同优化工作流程路径和代理行为，同时跳跃机制防止对优化良好的代理进行重复更新，从而提高效率。对七个基准测试的广泛实验表明，MANGO 在最先进的基线上性能提升高达 12.8%，效率提升 47.4%，并有效推广到未被发现的领域。我们的代码和数据集在此 https URL 公开。

A Persistence-Aware Framework for Age Violation Control in Wireless Status Update Systems

无线状态更新系统中用于年龄违规控制的持久性感知框架

Authors: Haoyuan Pan, Chen Chen, Shiyong Zhou, Kun Chen, Tse-Tin Chan
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2605.13002
Pdf link: https://arxiv.org/pdf/2605.13002
Abstract Timely and reliable status updates are essential for emerging QoS-sensitive wireless applications. Common age of information (AoI)-based metrics, such as average AoI and age violation rate (AVR), characterize time-averaged freshness or violation frequency but do not explicitly capture the temporal persistence of consecutive age violations, which can be critical in safety-sensitive wireless applications. We develop a persistence-aware reliability framework based on the consecutive age violation rate (C-AVR) vector, whose components quantify AoI threshold violations over consecutive time windows of different lengths. Through flexible weighting schemes, the proposed framework unifies reliability objectives ranging from average persistence to tail-sensitive performance. Optimizing weighted C-AVR objectives is challenging because consecutive violations are temporally correlated, leading to sparse learning signals. To address this issue, we develop a distributional reinforcement learning approach based on a quantile regression dueling double deep Q-network (QR-D3QN). By modeling a quantile-based return distribution rather than only a scalar expected return, QR-D3QN provides richer value-estimation signals for rare but prolonged violation sequences under stochastic packet arrivals, unreliable channels, and transmission cost constraints. Simulation results show that QR-D3QN consistently outperforms expectation-based baselines across a wide range of weighting schemes and system settings, with particularly significant gains under tail-sensitive persistence objectives. Component-wise analysis further shows that distributional value learning substantially improves reliability across multiple persistence scales, especially for long consecutive violation sequences. Overall, our results establish the proposed C-AVR framework as an effective foundation for persistence-aware reliability evaluation.
中文摘要 及时且可靠的状态更新对于新兴的QoS敏感无线应用至关重要。基于信息年龄（AoI）的常见指标，如平均年龄信息和年龄违规率（AVR），描述时间平均的新鲜度或违规频率，但并不能明确反映连续违规的时间持续性，而这在安全敏感的无线应用中至关重要。我们基于连续年龄违规率（C-AVR）向量开发了一个持久感知可靠性框架，其组成部分量化不同长度连续时间窗口内的AoI阈值违规。通过灵活的加权方案，所提出的框架统一了从平均持久性到尾部敏感性能的可靠性目标。优化加权C-AVR目标具有挑战性，因为连续违规在时间上相互关联，导致学习信号稀疏。为解决这一问题，我们开发了一种基于分位数回归与双深度Q网络（QR-D3QN）对决的分布式强化学习方法。通过建模基于分位数的回报分布，而非仅仅标量期望回报，QR-D3QN为在随机数据包到达、不可靠信道和传输成本约束下的罕见但持续时间较长的违规序列提供了更丰富的价值估计信号。模拟结果显示，QR-D3QN在多种权重方案和系统设置下始终优于基于期望的基线，在尾部敏感持久性目标下尤为显著。分量分析进一步表明，分布式价值学习在多个持久性尺度上显著提升了可靠性，尤其是在长时间连续违规序列中。总体而言，我们的结果确立了所提出的C-AVR框架作为持久性感知可靠性评估的有效基础。

JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning

JEDI：在线基于模型的强化学习的联合嵌入扩散世界模型

Authors: Jing Yu Lim, Rushi Shah, Zarif Ikram, Samson Yu, Haozhe Ma, Tze-Yun Leong, Dianbo Liu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.13013
Pdf link: https://arxiv.org/pdf/2605.13013
Abstract Diffusion world models have recently become competitive for online model-based reinforcement learning, but current approaches expose a tension: pixel diffusion is effective but computationally expensive while the latest latent diffusion approach improves efficiency yet performs subpar. The latter also relies on separately trained latents rather than the end-to-end world-model objectives that have driven much of modern MBRL progress. In particular, JEPA-style predictive representation learning has emerged as an especially promising direction for world modeling and MBRL. Concurrently, diffusion-style objectives have gained traction across multiple domains, with iterative refinement as a promising approach for multimodal and stochastic targets. Taken together, these trends motivate Joint Embedding DIffusion (JEDI), the first online end-to-end latent diffusion world model. JEDI learns its latent space directly from the diffusion denoising loss with a JEPA framework, using denoising to learn and predict future latents rather than relying on reconstruction and pretrained models. We provide a theoretical motivation showing that conventional JEPA objectives induce a predictive information bottleneck, and that conditional diffusion denoising admits a closely related predictive-compression decomposition. Empirically, JEDI is competitive on Atari100k and outperforms the baseline with seperately trained latents where directly comparable. Relative to the pixel diffusion baseline, JEDI uses 43% less VRAM, over 3$\times$ faster world-model sampling, and 2.5$\times$ faster training. JEDI also exhibits a markedly different task-level performance profile from the pixel baseline, suggesting that end-to-end predictive latents change more than compute alone.
中文摘要 扩散世界模型近年来在基于模型的在线强化学习中变得具有竞争力，但当前方法暴露出一种矛盾：像素扩散有效但计算成本高，而最新的潜在扩散方法提升了效率，但表现不佳。后者还依赖于单独训练的潜在目标，而非推动现代MBRL进展的端到端世界模型目标。特别是，JEPA式预测表示学习已成为世界建模和MBRL中一个特别有前景的方向。与此同时，扩散式目标物在多个领域获得认可，迭代精细化被视为多模态和随机靶点的有前景方法。综合来看，这些趋势推动了联合嵌入扩散（JEDI），这是首个在线端到端潜在扩散世界模型。JEDI通过JEPA框架直接从扩散去噪损耗中学习潜空间，利用去噪来学习和预测未来潜在势，而非依赖重建和预训练模型。我们提出了理论动机，表明传统JEPA目标物会引发预测信息瓶颈，条件扩散去噪则允许与预测-压缩分解密切相关。从实证上看，JEDI在Atari 100k上具有竞争力，并且在与单独训练的潜在者直接比较时，表现优于基线。相较于像素扩散基线，JEDI使用了43%的显存，世界模型采样速度快了3美元，训练速度快了2.5美元。JEDI还表现出与像素基线显著不同的任务级性能特征，表明端到端的预测潜能变化比单独计算更大。

Bridging Domain Gaps with Target-Aligned Generation for Offline Reinforcement Learning

通过目标对齐生成弥合领域差距，实现离线强化学习

Authors: Minung Kim, Jeongmo Kim, Gwanwoo Choi, Seungyul Han
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.13054
Pdf link: https://arxiv.org/pdf/2605.13054
Abstract Cross-domain offline reinforcement learning aims to adapt a policy from a source domain to a target domain using only pre-collected datasets, where environment dynamics may differ. A key challenge is to leverage source data while reducing distributional mismatch, particularly when the target dataset is extremely limited. To address this, we propose Target-aligned Coverage Expansion (TCE), a framework that decides how source data should be used, either by directly incorporating target-near transitions or by expanding state coverage through target-aligned generation, guided by theoretical analysis. TCE builds on a dual score-based generative model to synthesize target-consistent transitions over an expanded state region. Extensive experiments across diverse cross-domain environments show that TCE consistently outperforms state-of-the-art cross-domain offline RL baselines.
中文摘要 跨域离线强化学习旨在仅使用预先收集的数据集，将源域的策略适配到目标域，而这些数据集的环境动态可能不同。一个关键挑战是如何在减少分布不匹配的同时利用源数据，尤其是在目标数据集极其有限的情况下。为此，我们提出了目标对齐覆盖扩展（TCE）框架，决定源数据的使用方式，要么直接纳入目标近接转移，要么通过理论分析指导的目标对齐生成扩展状态覆盖。TCE基于基于分数的双重生成模型，在扩展状态区域内综合目标一致性的转移。跨多领域环境的广泛实验表明，TCE始终优于最先进的跨域离线强化学习基线。

What to Ignore, What to React: Visually Robust RL Fine-Tuning of VLA Models

忽略什么，反应什么：视觉上稳健的强化学习VLA模型微调

Authors: Yuanfang Peng, Jingjing Fu, Chuheng Zhang, Li Zhao, Jiang Bian, Mingyu Liu, Ling Zhang, Jun Zhang, Rui Wang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.13105
Pdf link: https://arxiv.org/pdf/2605.13105
Abstract Reinforcement learning (RL) fine-tuning has shown promise for Vision-Language-Action (VLA) models in robotic manipulation, but deployment-time visual shifts pose practical challenges. A key difficulty is that standard task rewards supervise task success, but offer limited guidance on whether a visual change is task-irrelevant or changes the behavior required for manipulation. We propose PAIR-VLA (Paired Action Invariance & Sensitivity for Visually Robust VLA), an RL fine-tuning framework to address this difficulty by adding two auxiliary objectives over paired visual variants during PPO optimization: an invariance term that reduces the discrepancy between action distributions for a task-preserving pair (e.g., different distractors), and a sensitivity objective that encourages separable action distributions for a task-altering pair (e.g., target object in a different pose). Together, these objectives turn visual variants from mere observation diversity into behavior-level guidance on policy responses during RL fine-tuning. We evaluate on ManiSkill3 across two representative VLA architectures, OpenVLA and $\pi_{0.5}$, under diverse out-of-distribution visual shifts including unseen distractors, texture changes, target object pose variation, viewpoint shifts, and lighting changes. Our method consistently improves over standard PPO, achieving average improvements of 16.62% on $\pi_{0.5}$ and 9.10% on OpenVLA. Notably, ablations further show generalization across visual shifts: invariance guidance learned from distractor and texture variants transfers to target-pose and lighting shifts, while adding sensitivity guidance on target-pose variants further improves robustness to nuisance shifts, highlighting the broader transferability of behavior-level RL guidance.
中文摘要 强化学习（RL）微调已在机器人操作中的视觉-语言-行动（VLA）模型中展现出潜力，但部署时的视觉变化带来了实际挑战。一个关键难点在于，标准任务奖励会监督任务的成功，但对视觉变化是否与任务无关或改变操作所需的行为，提供了有限的指导。我们提出了PAIR-VLA（视觉稳健VLA的配对动作不变性与敏感性），这是一种强化学习微调框架，通过在PPO优化过程中对配对视觉变体增加两个辅助目标来解决这一难题：一个不变项，用于减少任务保持对动作分布差异（例如不同的干扰因素），以及一个敏感性目标，鼓励任务改变对的动作分布可分离（例如，目标物体姿势不同）。这些目标共同将视觉变异从单纯的观察多样性转变为强化学习微调期间政策响应的行为层级指导。我们在 ManiSkill3 上评估了两种代表性的 VLA 架构 OpenVLA 和 $\pi_{0.5}$，涵盖多种非分布视觉变化，包括看不见的干扰、纹理变化、目标物体姿态变化、视角变化和光照变化。我们的方法相较标准PPO持续改进，在$\pi_{0.5}$上实现了16.62%的平均提升，在OpenVLA上达到9.10%。值得注意的是，消融进一步展示了视觉变化间的推广性：从分散源和纹理变体中学习的不变性引导可转移至目标姿态和光照变化，而对目标姿态变体增加灵敏度引导则进一步增强对干扰偏移的鲁棒性，凸显行为级强化学习指导的更广泛可转移性。

ERPPO: Entropy Regularization-based Proximal Policy Optimization

ERPPO：基于熵正则化的近端策略优化

Authors: Changha Lee, Gyusang Cho
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.13131
Pdf link: https://arxiv.org/pdf/2605.13131
Abstract Multi-Agent Proximal Policy Optimization (MAPPO) is a variant of the Proximal Policy Optimization (PPO) algorithm, specifically tailored for multi-agent reinforcement learning (MARL). MAPPO optimizes cooperative multi-agent settings by employing a centralized critic with decentralized actors. However, in case of multi-dimensional environment, MAPPO can not extract optimal policy due to non-stationary agent observation. To overcome this problem, we introduce a novel approach, Entropy Regularization-based Proximal Policy Optimization (ERPPO). For the policy optimization, we first define the object detection ambiguity under multi-dimensional observation environment. Distributional Spatiotemporal Ambiguity (DSA) learner is trained to estimate object detection uncertainty in non-stationary constraints. Then, we enhance PPO with a novel Entropy Regularization term. This regularization dynamically adjusts the policy update by applying a stronger (L1) regularization in high-ambiguity observation to encourage significant exploratory actions and a weaker (L2) regularization in low-ambiguity observation to stabilize the proximal policy optimization. This approach is designed to enhance the probability of successful object localization in time-critical operations by reducing detection failures and optimizing search policy. Experiments on a testbed with AirSim-based maritime searching scenarios show that the proposed ERPPO improves accuracy performance. Our proposed method improves higher gradient than MAPPO. Qualitative results confirm that ERPPO effectiveness in terms of suppressing false detection in visually uncertain conditions.
中文摘要 多智能体近端策略优化（MAPPO）是近端策略优化（PPO）算法的一个变体，专门为多智能体强化学习（MARL）量身定制。MAPPO通过采用去中心化的批评者和去中心化的行为者，优化合作多智能体设置。然而，在多维环境下，由于非平稳代理观察，MAPPO无法提取最优策略。为克服这一问题，我们引入了一种新方法——基于熵正则化的近端策略优化（ERPPO）。对于策略优化，我们首先定义了多维观察环境下的对象检测歧义。分布时空模糊性（DSA）学习器被训练用于估计非固定约束下的对象检测不确定性。然后，我们用一个新的熵正则化项增强PPO。这种正则化通过在高歧义观察中应用更强的（L1）正则化来鼓励显著的探索性行动，以及在低歧义观察中施加较弱（L2）正则化以动态调整策略更新，以稳定近端策略优化。该方法旨在通过减少检测失败和优化搜索策略，提高在时间关键操作中成功定位对象的概率。基于AirSim的海上搜索场景测试平台实验显示，拟议的ERPPO提高了准确性性能。我们提出的方法比MAPPO更能改善更高的梯度。定性结果证实ERPPO在视觉不确定条件下抑制误探效果。

Finding the Weakest Link: Adversarial Attack against Multi-Agent Communications

寻找最薄弱环节：针对多智能体通信的对抗性攻击

Authors: Maxwell Standen, Junae Kim, Claudia Szabo
Subjects: Subjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2605.13170
Pdf link: https://arxiv.org/pdf/2605.13170
Abstract Multi-agent systems rely on communication for information sharing and action coordination, which exposes a vulnerability to attacks. We investigate single-victim communication perturbation attacks against Multi-Agent Reinforcement Learning-trained systems and propose methods that use gradient information from the Jacobian to identify which messages, agent, and timesteps are most susceptible to attack and have the greatest impact on the system. We enhance these methods with two proposed adversarial loss functions that trade-off attack success for attack impact which also create more effective perturbations. We empirically demonstrate the effectiveness of our methods against two different multi-agent communication methods in navigation, PredatorPrey, and TrafficJunction environments. Our results show that our novel message selection method achieves a similar or greater impact than random message selection across almost all tested scenarios. Our victim selection, message selection, tempo, and loss functions improve attack effectiveness in half of the thirty scenarios we tested.
中文摘要 多智能体系统依赖通信来共享信息和协调行动，这暴露了攻击的脆弱性。我们研究针对多智能体强化学习训练系统的单受害者通信扰动攻击，并提出利用雅可比梯度信息识别哪些消息、代理和时间步最易受攻击且对系统影响最大的方法。我们通过两种提出的对抗损失函数来增强这些方法，这些函数在攻击成功率与攻击影响之间进行权衡，同时产生更有效的扰动。我们通过实证证明了这些方法在导航环境中两种不同的多智能体通信方式——PredatorPrey和TrafficJunction环境下的有效性。我们的结果表明，我们新颖的消息选择方法在几乎所有测试场景中都达到了与随机消息选择相当甚至更大的效果。我们的受害者选择、消息选择、节奏和丢失功能在测试的三十个场景中，有一半提升了攻击效果。

Switching Successor Measures for Hierarchical Zero-shot Reinforcement Learning

层级零样本强化学习的切换后续测量

Authors: Stefan Stojanovic, Alexandre Proutiere
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.13207
Pdf link: https://arxiv.org/pdf/2605.13207
Abstract Hierarchical reinforcement learning can improve generalization by decomposing long-horizon decision-making into simpler subproblems. However, existing approaches often rely on restrictive design choices, such as fixed temporal abstractions or goal-conditioned objectives, which largely confine them to goal-reaching tasks and limit their applicability to general reward functions. In this paper, we introduce switching successor measures, an extension of successor measures that enables hierarchical control in zero-shot reinforcement learning without additional supervision, fixed horizons, or manually designed subgoals. We show that switching successor measures arise naturally from classical successor measures while preserving their underlying structure. Building on this result, we propose FB $\pi$-Switch, an algorithm that extracts both a high-level subgoal-selection policy and a low-level control policy directly from forward-backward (FB) representations, allowing hierarchical behavior to emerge from a single learned representation. Experiments on both goal-conditioned and general reward-based tasks show that FB $\pi$-Switch improves over non-hierarchical baselines and matches state-of-the-art hierarchical methods in goal-conditioned settings. These results demonstrate that structured successor representations provide a flexible foundation for hierarchical zero-shot reinforcement learning beyond goal-reaching tasks. Our project website is available at: this https URL.
中文摘要 层级强化学习可以通过将长期决策分解为更简单的子问题来提升泛化能力。然而，现有方法往往依赖于限制性的设计选择，如固定的时间抽象或目标条件目标，这在很大程度上将其限制在达成目标的任务中，并限制了其适用于一般奖励函数。本文介绍了切换后继措施，这是一种后继措施的扩展，使得零样本强化学习中无需额外监督、固定视野或手动设计子目标即可实现层级控制。我们表明，切换后继测度自然源自经典后继测度，同时保持其底层结构。基于此结果，我们提出了FB $\pi$-Switch算法，该算法能直接从前向后（FB）表示中提取高层次子目标选择策略和低级控制策略，使层级行为能够从单一学习的表征中产生。针对目标条件和一般奖励任务的实验显示，FB $\pi$-Switch 相较于非层级基线更优，并且在目标条件条件下能匹配最先进的层级方法。这些结果表明，结构化继承表征为超越目标任务的层级零样本强化学习提供了灵活的基础。我们的项目网站可访问：此 https URL。

GAGPO: Generalized Advantage Grouped Policy Optimization

GAGPO：广义优势分组策略优化

Authors: Siyuan Zhu, Chao Yu, Rongxin Yang, Zongkai Liu, Jinjun Hu, Qiwen Chen, Yibo Zhang
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.13217
Pdf link: https://arxiv.org/pdf/2605.13217
Abstract Reinforcement learning has become a powerful paradigm for post-training large language model agents, yet credit assignment in multi-turn environments remains a challenge. Agents often receive sparse, trajectory-level rewards only at the end of an episode, making it difficult to determine which intermediate actions contributed to success or failure. As a result, propagating delayed outcomes back to individual decision steps without relying on costly auxiliary value models remains an open problem. We propose Generalized Advantage Grouped Policy Optimization (GAGPO), a critic-free reinforcement learning method for precise, step-aligned temporal credit assignment. GAGPO constructs a non-parametric grouped value proxy from sampled rollouts and uses it to compute TD/GAE-style temporal advantages, recursively propagating outcome supervision backward through time. Combined with group-wise advantage normalization and an action-level importance ratio, GAGPO extracts stable, localized optimization signals directly from multi-turn trajectories. Experiments on ALFWorld and WebShop show that GAGPO outperforms strong reinforcement learning baselines. Further analyses demonstrate faster early-stage learning, improved interaction efficiency, and smoother optimization dynamics, suggesting that GAGPO offers a simple yet effective framework for multi-turn agentic reinforcement learning.
中文摘要 强化学习已成为大型语言模型代理训练后强有力的范式，但在多回合环境中的学分分配仍是个挑战。特工通常只在剧集结束时获得稀疏的轨迹级奖励，难以判断哪些中间行动促成了成功或失败。因此，在不依赖昂贵的辅助价值模型的情况下，将延迟结果传递回各个决策步骤仍是一个悬而未决的问题。我们提出了广义优势分组策略优化（GAGPO），这是一种无批评的强化学习方法，用于精确、阶级对齐的时间学分分配。GAGPO通过抽样的推举构建一个非参数的分组值代理，并用它计算TD/GAE风格的时间优势，递归地向后传播结果监督。结合群体优势归一化和动作级重要性比，GAGPO直接从多回合轨迹中提取稳定、局部化的优化信号。在ALFWorld和WebShop上的实验显示，GAGPO的表现优于强强强化学习基线。进一步分析显示，GAGPO在早期阶段学习速度更快，交互效率提升，优化动态更为平滑，表明GAGPO提供了一个简单而有效的多回合智能体强化学习框架。

An Agentic AI Framework with Large Language Models and Chain-of-Thought for UAV-Assisted Logistics Scheduling with Mobile Edge Computing

一个具备大型语言模型和思维链的代理人工智能框架，用于无人机辅助物流调度，配合移动边缘计算

Authors: Hanwen Zhang, Dusit Niyato, Wei Zhang, Xin Lou, Malcolm Yoke Hean Low
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.13221
Pdf link: https://arxiv.org/pdf/2605.13221
Abstract In cloud manufacturing, unmanned aerial vehicles (UAVs) can support both product collection and mobile edge computing (MEC). This joint operation forms a hybrid scheduling problem, where physical logistics decisions are coupled with computational task scheduling. In this paper, UAVs collect finished products from manufacturing stations and transport them back to a central depot. Meanwhile, computational tasks generated by industrial sensor devices at these stations are processed locally, at UAVs, or offloaded via UAVs to the cloud. This coupling makes the problem challenging. A UAV can provide MEC services only during its service window at a station, so routing decisions directly determine when UAV-assisted offloading is available. Routing decisions also affect the UAV energy budget and the availability of onboard computing and communication resources for computational task execution under task deadline constraints. To address this, we propose an agentic-AI-assisted optimization framework with two components. First, we develop an agentic AI that combines large language models, retrieval-augmented generation, and chain-of-thought reasoning to translate user input into an interpretable mathematical formulation for the hybrid scheduling problem. Second, we design a hierarchical deep reinforcement learning approach based on proximal policy optimization (PPO), where the upper layer learns UAV routing and the lower layer optimizes per-slot task execution and resource allocation. Simulation results show that the proposed framework yields more consistent formulations, while the hierarchical PPO achieves full product collection in 99.6% of the last 500 episodes and maintains a 100% deadline satisfaction rate, with more stable performance than the advantage actor-critic approach.
中文摘要 在云制造领域，无人机（UAV）既支持产品收集，也支持移动边缘计算（MEC）。这种联合操作形成了一个混合调度问题，物理物流决策与计算任务调度相结合。本文中，无人机从制造站收集成品并运回中央仓库。与此同时，这些站点工业传感器设备产生的计算任务会在本地、无人机上处理，或通过无人机卸载到云端。这种耦合使问题变得具有挑战性。无人机只能在其站点的服务窗口内提供MEC服务，因此路线决策直接决定何时可进行无人机辅助卸载。路线决策还会影响无人机的能量预算以及在任务截止时间约束下执行计算任务时机载计算和通信资源的可用性。为此，我们提出了一个由两个组成部分组成的智能人工智能辅助优化框架。首先，我们开发了一种结合大型语言模型、检索增强生成和思维链推理的智能人工智能，将用户输入转化为可解释的混合调度问题数学表述。其次，我们设计了基于近端策略优化（PPO）的分层深度强化学习方法，上层学习无人机路由，下层优化每槽任务执行和资源分配。模拟结果显示，所提框架产生了更一致的表述，而分层PPO在过去500集中实现了99.6%的产品收集，并保持100%的截止日期满意率，性能比优势行为者-批评者方法更稳定。

Teacher-Guided Policy Optimization for LLM Distillation

教师引导的LLM提炼策略优化

Authors: Xinyu Liu, Kechen Jiao, Chunyang Xiao, Runsong Zhao, Junhao Ruan, Bei Li, Jiahao Liu, Qifan Wang, Xin Chen, Jingang Wang, Tong Xiao, JingBo Zhu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.13230
Pdf link: https://arxiv.org/pdf/2605.13230
Abstract The convergence of reinforcement learning and imitation learning has positioned Reverse KL (RKL) as a promising paradigm for on-policy LLM distillation, aiming to unify exploration with teacher supervision. However, we identify a critical limitation: when the student and teacher distributions diverge significantly, standard RKL often fails to yield meaningful improvement due to uninformative negative feedback. To address this inefficiency, we propose Teacher-Guided Policy Optimization (TGPO), an on-policy algorithm that incorporates dense directional guidance by leveraging teacher predictions conditioned on the student's rollout. Because TGPO remains on-policy, the algorithm integrates seamlessly with existing RLVR frameworks without requiring additional data annotation. Experiments on complex reasoning benchmarks demonstrate that TGPO significantly outperforms standard baselines and is robust to different teachers.
中文摘要 强化学习与模仿学习的融合使逆KL（RKL）成为策略型LLM提炼的有前景范式，旨在将探索与教师监督相结合。然而，我们发现了一个关键局限：当学生和教师的分布差异显著时，标准RKL常因缺乏信息性负反馈而无法带来实质性改进。为解决这一低效问题，我们提出了教师引导政策优化（TGPO）算法，这是一种基于学生推广情况的教师预测，结合密集的方向性指导。由于TGPO保持策略性，算法能够无缝集成现有RLVR框架，无需额外数据注释。复杂推理基准测试的实验表明，TGPO显著优于标准基线，并且对不同教师具有鲁棒性。

Submodular Multi-Agent Policy Learning for Online Distributed Task Allocation in Open Multi-Agent Systems

用于开放多智能体系统中在线分布式任务分配的子模块化多智能体策略学习

Authors: Jing Liu, Yangyang Yang, Luca Ballotta, Fangfei Li, Yang Tang, Ruggero Carli
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2605.13269
Pdf link: https://arxiv.org/pdf/2605.13269
Abstract This paper studies multi-agent reinforcement learning with submodular team utilities for online distributed task allocation. In this setting, each agent selects one action from a local categorical policy, so feasible joint actions form a partition matroid over agent-action pairs. Classical multilinear extensions use independent Bernoulli sampling and therefore do not match the categorical policies executed by decentralized agents. To address this mismatch, we introduce the Partition Multilinear Extension (PME), a continuous relaxation whose value equals the expected team utility under factorized categorical policies. We prove that submodular difference rewards provide unbiased PME marginal-gradient information and yield a stagewise score-function policy-gradient estimator. Based on this connection, we propose SubMAPG, a centralized-training decentralized-execution policy-gradient framework with masked categorical policies and submodular difference-reward training signals. For the associated PME marginal-space projected stochastic-gradient dynamics, we prove a stagewise 1/2-approximation guarantee and sublinear dynamic regret in slowly varying environments, measured by the path length of the optimal PME marginals. To handle open systems with time-varying agents and targets, we instantiate SubMAPG with graph neural network policies. Experiments on multi-robot coverage and multi-target tracking show that SubMAPG outperforms local greedy and shared-reward baselines and is competitive with centralized myopic greedy strategies.
中文摘要 本文研究了多智能体强化学习，利用亚模块化团队工具实现在线分布式任务分配。在此环境中，每个代理从局部类别策略中选择一个动作，因此可行的联合行动形成代理-动作对的划分拟阵。经典多线性扩展使用独立伯努利采样，因此不匹配去中心化代理执行的类别策略。为解决这一不匹配，我们引入了划分多线性扩展（PME），这是一种连续松弛，其值等于在分解类别策略下的期望团队效用。我们证明了亚模差分奖励提供了无偏的PME边际梯度信息，并产生了分级得分函数策略梯度估计。基于这一联系，我们提出了SubMAPG，一种集中训练去中心化执行策略梯度框架，采用掩蔽类别策略和次模块差分-奖励训练信号。对于相关的PME边际空间投影随机梯度动力学，我们证明了在缓慢变化环境中的分级半近似保证和亚线性动态后遗，测量为最优PME边际的路径长度。为了处理具有时间变化代理和目标的开放系统，我们通过图神经网络策略实例化SubMAPG。多机器人覆盖和多目标跟踪的实验显示，SubMAPG优于局部贪婪和共享奖励基线，并能与集中式近视贪婪策略竞争。

D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models

D-VLA：一种面向视觉-语言-行动模型的高并发分布式异步强化学习框架

Authors: Yucheng Guo, Yongjian Guo, Zhong Guan, Wen Huang, Haoran Sun, Haodong Yue, Xiaolong Xiang, Shuai Di, Zhen Sun, Luqiao Wang, Junwu Xiong, Yicheng Gong
Subjects: Subjects: Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.13276
Pdf link: https://arxiv.org/pdf/2605.13276
Abstract The rapid evolution of Embodied AI has enabled Vision-Language-Action (VLA) models to excel in multimodal perception and task execution. However, applying Reinforcement Learning (RL) to these massive models in large-scale distributed environments faces severe systemic bottlenecks, primarily due to the resource conflict between high-fidelity physical simulation and the intensive VRAM/bandwidth demands of deep learning. This conflict often leaves overall throughput constrained by execution-phase inefficiencies. To address these challenges, we propose D-VLA, a high-concurrency, low-latency distributed RL framework for large-scale embodied foundation models. D-VLA introduces "Plane Decoupling," physically isolating high-frequency training data from low-frequency weight control to eliminate interference between simulation and optimization. We further design a four-thread asynchronous "Swimlane" pipeline, enabling full parallel overlap of sampling, inference, gradient computation, and parameter distribution. Additionally, a dual-pool VRAM management model and topology-aware replication resolve memory fragmentation and optimize communication efficiency. Experiments on benchmarks like LIBERO show that D-VLA significantly outperforms mainstream RL frameworks in throughput and sampling efficiency for billion-parameter VLA models. In trillion-parameter scalability tests, our framework maintains exceptional stability and linear speedup, providing a robust system for high-performance general-purpose embodied agents.
中文摘要 具身人工智能的快速发展使视觉-语言-行动（VLA）模型在多模态感知和任务执行方面表现出色。然而，在大规模分布式环境中将强化学习（RL）应用于这些庞大模型时，面临严重的系统性瓶颈，主要源于高保真物理仿真与深度学习对VRAM/带宽的高负载需求之间的资源冲突。这种冲突常常使整体吞吐量受到执行阶段低效的限制。为应对这些挑战，我们提出了D-VLA，一个高并发、低延迟的分布式强化学习框架，用于大规模具身基础模型。D-VLA引入了“平面解耦”技术，物理隔离高频训练数据与低频权重控制，消除仿真与优化之间的干扰。我们还设计了一个四线程异步“Swimlane”流水线，实现采样、推断、梯度计算和参数分布的完全并行重叠。此外，双池显存管理模型和拓扑感知复制解决内存碎片问题并优化通信效率。基于LIBERO等基准测试的实验显示，D-VLA在十亿参数VLA模型的吞吐量和采样效率上显著优于主流强化学习框架。在万亿参数的可扩展性测试中，我们的框架保持了卓越的稳定性和线性加速，为高性能通用内涵代理提供了稳健的系统。

GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models

GRIP-VLM：高效视觉语言模型中的群体相对重要性剪枝

Authors: Mingzhe Huang, Weijun Wang, Xin Ding, Liang Mi, Hao Wen, Yuanchun Li, Lichen Pang, Shansong Yang, Yunxin Liu, Ting Cao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.13375
Pdf link: https://arxiv.org/pdf/2605.13375
Abstract In Vision-Language Models (VLMs), processing a massive number of visual tokens incurs prohibitive computational overhead. While recent training-aware pruning methods attempt to selectively discard redundant tokens, they largely rely on continuous-gradient relaxations. However, visual token pruning is inherently a discrete, non-convex combinatorial problem; consequently, these continuous approximations frequently trap the optimization in sub-optimal local minima, especially under aggressive compression budgets. To overcome this fundamental bottleneck, we propose GRIP-VLM, a Group-Relative Importance Pruning framework driven by Reinforcement Learning. Rather than relying on smooth-gradient assumptions, GRIP-VLM formulates pruning as a Markov Decision Process, employing a Group Relative Policy Optimization (GRPO) paradigm anchored by supervised warm-up to directly explore the discrete selection space. Integrated with a budget-aware scorer, our lightweight agent dynamically evaluates per-token importance and adapts to arbitrary compression ratios without retraining. Extensive experiments across diverse multimodal benchmarks demonstrate that GRIP-VLM consistently outperforms heuristic and supervised-learning baselines, achieving a superior Pareto frontier and delivering up to a 15\% inference speedup at equal accuracy.
中文摘要 在视觉语言模型（VLM）中，处理大量视觉令牌会产生巨大的计算开销。虽然最新的训练感知剪枝方法试图选择性地丢弃冗余标记，但它们主要依赖于连续梯度松弛。然而，视觉符号剪枝本质上是一个离散的、非凸的组合问题;因此，这些连续近似常常将优化困在次优的局部极小值中，尤其是在激进压缩预算下。为克服这一根本瓶颈，我们提出了GRIP-VLM，一种由强化学习驱动的群体相对重要性修剪框架。GRIP-VLM不依赖平滑梯度假设，而是将剪枝表述为马尔可夫决策过程，采用以监督预热为基础的群相对策略优化（GRPO）范式，直接探索离散选择空间。集成预算感知评分器，我们的轻量级代理动态评估每个代币的重要性，并适应任意压缩比，无需重新训练。跨越多种多模态基准测试的广泛实验表明，GRIP-VLM始终优于启发式和监督式基线，实现了更优的帕累托前沿，并在同等准确率下实现了高达15%的推理加速。

Trajectory-Level Data Augmentation for Offline Reinforcement Learning

轨迹级数据增强用于离线强化学习

Authors: Tobias Schmähling, Matthias Burkhardt, Tobias Windisch
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2605.13401
Pdf link: https://arxiv.org/pdf/2605.13401
Abstract We propose a data augmentation method for offline reinforcement learning, motivated by active positioning problems. Particularly, our approach enables the training of off-policy models from a limited number of suboptimal trajectories. We introduce a trajectory-based augmentation technique that exploits task structure and the geometric relationship between rewards, value functions, and mathematical properties of logging policies. During data collection, our augmentation supports suboptimal logging policies, leading to higher data quality and improved offline reinforcement learning performance. We provide theoretical justification for these strategies and validate them empirically across positioning tasks of varying dimensionality and under partial observability.
中文摘要 我们提出了一种基于主动定位问题的离线强化学习数据增强方法。特别是，我们的方法能够从有限数量的次优轨迹中训练出非策略模型。我们引入了一种基于轨迹的增强技术，利用任务结构以及奖励、价值函数与日志策略数学属性之间的几何关系。在数据收集过程中，我们的增强支持次优的日志策略，从而提高数据质量并提升离线强化学习性能。我们为这些策略提供了理论依据，并在不同维度和部分可观测性下的定位任务中进行了实证验证。

Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy

Q-Flow：基于流程策略的稳定表达强化学习

Authors: JaeHyeok Doo, Byeongguk Jeon, Seonghyeon Ye, Kimin Lee, Minjoon Seo
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.13435
Pdf link: https://arxiv.org/pdf/2605.13435
Abstract There is growing interest in utilizing flow-based models as decision-making policies in reinforcement learning due to their high expressive capacity. However, effectively leveraging this expressivity for value maximization remains challenging, as naive gradient-based optimization requires backpropagating through numerical solvers and often leads to instability. Existing approaches typically address this issue by restricting the expressive capacity of flow-based policies, resulting in a trade-off between optimization stability and representational flexibility. To resolve this, we introduce Q-Flow, a framework that leverages the deterministic nature of flow dynamics to explicitly propagate terminal trajectory value to intermediate latent states along the policy-induced flow. This formulation enables stable policy optimization using intermediate value gradients without unrolling the numerical solver, effectively bridging the gap between stability and expressivity. We evaluate Q-Flow in the offline learning setting on the challenging OGBench suite, where it consistently outperforms state-of-the-art baselines by an average of 10.6 percentage points, while also enabling stable online adaptation within the same framework.
中文摘要 由于流型模型具有较高的表达能力，越来越多的人对将基于流的模型作为强化学习的决策策略使用。然而，有效利用这种表达力实现价值最大化仍然具有挑战性，因为朴素的基于梯度的优化需要通过数值求解器反向传播，且常导致不稳定性。现有方法通常通过限制基于流策略的表达能力来解决这个问题，导致优化稳定性与表示灵活性之间存在权衡。为解决这个问题，我们引入了Q-Flow，这一框架利用流动动力学的确定性特性，明确地将终端轨迹值传播到策略诱导流的中间潜在状态。该表述使得利用中间值梯度实现稳定策略优化，而无需展开数值求解器，有效弥合了稳定性与表达性的差距。我们在具有挑战性质的OGBench套件中评估了Q-Flow在离线学习环境中的表现，其平均领先最先进的基线10.6个百分点，同时在同一框架内实现了稳定的在线适应。

PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning

PDCR：视觉-语言推理的感知分解信心奖励

Authors: Hee Suk Yoon, Eunseop Yoon, Ji Woo Hong, SooHwan Eom, Gwanhyeong Koo, Mark Hasegawa-Johnson, Qi Dai, Chong Luo, Chang D. Yoo
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.13467
Pdf link: https://arxiv.org/pdf/2605.13467
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) traditionally relies on a sparse, outcome-based signal. Recent work shows that providing a fine-grained, model-intrinsic signal (rewarding the confidence growth in the ground-truth answer) effectively improves language reasoning training by providing step-level guidance without costly external models. While effective for unimodal text, we find that naively applying this global reward to vision-language (V-L) reasoning is a suboptimal strategy, as the task is a heterogeneous mix of sparse visual perception and dense textual reasoning. This global normalization creates mixture-induced signal degradation, where the training signal for visual steps is statistically distorted by the predominant textual steps. We propose Perception-Decomposed Confidence Reward (PDCR), a framework that solves this by aligning the reward structure with the task's heterogeneous nature. PDCR first performs an unsupervised skill decomposition, introducing a model-internal Visual Dependence Score to quantify visual reliance and applying a clustering algorithm to separate perception and reasoning steps. Based on this, PDCR computes a decomposed advantage by normalizing confidence gains within each skill cluster. This intra-cluster normalization provides a stable, correctly-scaled signal for both perception and reasoning. We demonstrate that PDCR outperforms the naive, global-reward formulation and sparse-reward baselines on key V-L reasoning benchmarks.
中文摘要 带可验证奖励的强化学习（RLVR）传统上依赖于稀疏的基于结果的信号。最新研究表明，提供细粒度的模型内在信号（奖励基于真实答案的信心增长）通过提供步骤级指导，有效提升语言推理训练，而无需昂贵的外部模型。虽然对单模态文本有效，但我们发现，简单地将这种全局奖励应用于视觉-语言（V-L）推理是一种次优策略，因为该任务是稀疏视觉感知和密集文本推理的异质混合。这种全局规范化导致混合信号衰减，即视觉步进的训练信号被主要文本步变统计性地扭曲。我们提出了感知分解信心奖励（PDCR）框架，通过将奖励结构与任务的异质性质对齐来解决这个问题。PDCR首先进行无监督技能分解，引入模型内部视觉依赖评分以量化视觉依赖，并应用聚类算法将感知和推理步骤分离。基于此，PDCR通过规范化每个技能群组内的置信度增益来计算一个分解后的优势。这种簇内归一化为感知和推理提供了稳定且正确标度的信号。我们证明，PDCR在关键的V-L推理基准上优于朴素的全局奖励表述和稀疏奖励基线。

Sustainable Graph Analytics Workload Scheduling with Evolutionary Reinforcement Learning in Edge-Cloud Systems

边缘云系统中可持续的图分析工作负载调度与进化强化学习

Authors: P. Ramicetty, H. Moore, S. Qi, A. Islam, M. Ghose, D. Milojicic, C. Bash, S. Pasricha
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2605.13489
Pdf link: https://arxiv.org/pdf/2605.13489
Abstract Graph analytics powers modern intelligent systems such as smart cities, cyber-physical infrastructure, IoT security, and large-scale social networks. As these workloads scale in complexity, their execution in heterogeneous edge-cloud environments results in higher energy use and carbon emission footprint. To address this challenge, we propose MERSEM, a multi-objective evolutionary reinforcement learning framework for sustainable edge-cloud system management. MERSEM integrates evolutionary search with reinforcement learning (RL) to solve the problem of graph workload allocation and scheduling. The evolutionary component explores diverse global solutions, while the RL agent refines decisions through adaptive local optimization. The framework is designed to jointly minimize service-level agreement (SLA) violations and carbon emissions by considering dynamic carbon intensity, resource heterogeneity, and workload characteristics. Experimental results demonstrate that MERSEM outperforms the state-of-the-art with up to 45% SLA violation reductions and up to 12% carbon emission reductions.
中文摘要 图谱分析驱动着现代智能系统，如智慧城市、网络物理基础设施、物联网安全和大规模社交网络。随着这些工作负载复杂度的扩展，它们在异构边缘云环境中的执行会导致更高的能源消耗和碳排放足迹。为应对这一挑战，我们提出了MERSEM，一个多目标的进化强化学习框架，用于可持续的边缘云系统管理。MERSEM将进化搜索与强化学习（RL）结合，解决图工作负载分配和调度问题。进化部分探索多样化的全局解决方案，而强化学习代理则通过自适应局部优化来优化决策。该框架旨在通过考虑动态碳强度、资源异质性和工作负荷特性，共同减少服务水平协议（SLA）违规和碳排放。实验结果显示，MERSEM在SLA违规率降低率高达45%和碳排放减少率方面表现优于最先进技术。

MARLIN: Multi-Agent Game-Theoretic Reinforcement Learning for Sustainable LLM Inference in Cloud Datacenters

MARLIN：多智能体博弈论强化学习，用于云数据中心可持续的大型语言模型推理

Authors: H. Moore, S. Qi, D. Milojicic, C. Bash, S. Pasricha
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.13496
Pdf link: https://arxiv.org/pdf/2605.13496
Abstract Large Language Models (LLMs) have become increasingly prevalent in cloud-based platforms, propelled by the introduction of AI-based consumer and enterprise services. LLM inference requests in particular account for up to 90% of total LLM lifecycle energy use, dwarfing training energy costs. The rising volume of LLM inference requests is increasing environmental footprints, particularly carbon emissions and water consumption. To improve sustainability for LLM inference serving in cloud datacenter environments, we propose a novel multi-agent game-theoretic reinforcement learning framework called MARLIN to co-optimize time-to-first token (TTFT), carbon emissions, water usage, and energy costs associated with LLM inference. MARLIN demonstrates a reduction of at least 18% in TTFT, 33% in carbon emissions, 43% in water usage, and 11% in energy costs compared to state-of-the-art LLM inference management frameworks.
中文摘要 大型语言模型（LLMs）在云平台上日益普及，这得益于基于人工智能的消费者和企业服务的引入。尤其是LLM推理请求占LLM生命周期总能耗高达90%，远远超过训练能耗。LLM推理请求量的增加正在加剧环境足迹，尤其是碳排放和用水量。为了提升云数据中心环境中LLM推理服务的可持续性，我们提出了一种名为MARLIN的新型多智能体博弈论强化学习框架，用于协同优化与LLM推理相关的首次代币（TTFT）、碳排放、用水和能源成本。与最先进的LLM推理管理框架相比，MARLIN在TTFT中至少减少了18%，碳排放减少了33%，用水量减少了43%，能源成本降低了11%。

Scaling Retrieval-Augmented Reasoning with Parallel Search and Explicit Merging

利用并行搜索和显式合并进行扩展检索推理

Authors: Jiabei Liu, Wenyu Mao, Junfei Tan, Chunxu Shen, Lingling Yi, Jiancan Wu, Xiang Wang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.13534
Pdf link: https://arxiv.org/pdf/2605.13534
Abstract Deep search agents have proven effective in enhancing LLMs by retrieving external knowledge during multi-step reasoning. However, existing methods often generate a single query for retrieval at each reasoning step, limiting information coverage and introducing high noise. This may result in low signal-to-noise ratios (SNR) during search, degrading reasoning accuracy and leading to unnecessary reasoning steps. In this paper, we introduce MultiSearch, an RL-based framework that addresses these limitations through multi-query retrieval and explicit merging of retrieved information. At each reasoning step, MultiSearch generates queries from multiple perspectives and retrieves external information in parallel, expanding the scope of relevant information and mitigating the reliance on any single retrieval result. Then, the agent consolidates and refines retrieved information at the merging process, improving the SNR and ensuring more accurate reasoning. Additionally, we propose a reinforcement learning framework with a multi-process reward design to optimize agents for both multi-query retrieval and information consolidation. Extensive experiments on seven benchmarks demonstrate that MultiSearch outperforms baseline methods, enhancing the SNR of retrieval and improving reasoning performance in question-answering tasks.
中文摘要 深度搜索代理已被证明在多步推理中通过获取外部知识来增强大型语言模型（LLM）的效果。然而，现有方法通常在每个推理步骤生成单一查询，限制了信息覆盖范围并引入高噪声。这可能导致搜索过程中信噪比（SNR）降低，降低推理准确性，并导致不必要的推理步骤。本文介绍了MultiSearch，一种基于强化学习的框架，通过多查询检索和显式合并检索信息来解决这些局限性。在每个推理步骤，MultiSearch 从多个角度生成查询并并行检索外部信息，扩大相关信息的范围，减少对单一检索结果的依赖。然后，智能体在合并过程中整合和完善检索到的信息，提升信噪比，确保推理更准确。此外，我们提出了一个强化学习框架，采用多过程奖励设计，以优化代理在多查询检索和信息整合方面。七个基准测试的广泛实验表明，MultiSearch优于基线方法，提升了检索的信噪比，并提升了问答任务中的推理能力。

HLS-Seek: QoR-Aware Code Generation for High-Level Synthesis via Proxy Comparative Reward Reinforcement Learning

HLS-Seek：通过代理比较奖励强化学习实现高层综合的QoR感知代码生成

Authors: Qingyun Zou, Feng Yu, Hongshi Tan, Yao Chen, Bingsheng He, WengFai Wong
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.13536
Pdf link: https://arxiv.org/pdf/2605.13536
Abstract High-Level Synthesis (HLS) compiles algorithmic C/C++ descriptions into hardware, with Quality of Results (QoR) -- latency and resource utilization -- critically governed by pragma configurations and code structure. Existing LLM-based HLS approaches train for functional correctness but ignore QoR entirely. We observe that reinforcement learning (RL) for HLS does not require absolute synthesis results -- only relative comparisons between candidates. Based on this insight, we propose \textbf{HLS-Seek}, a QoR-aware NL-to-HLS framework that replaces expensive synthesis-in-the-loop RL with a comparative proxy reward model achieving 99.53\% Pareto-dominance accuracy. To prevent reward hacking, we introduce \textit{uncertainty-aware Monte Carlo (MC) dropout switching} that selectively invokes real Vitis HLS synthesis for low-confidence candidates and online updates the proxy, creating a self-improving reward system. HLS-Seek achieves 81.5\% syntax correctness pass@1 and 81.4\% Func@5 on HLS-eval with only 7B parameters, surpassing GPT-5.1 and other frontier models while achieving 8.5$\times$ faster training than real-reward RL. On QoR evaluation, HLS-Seek achieves the lowest latency on 16/30 kernels and Pareto-dominates HLS-specific baselines on 9 kernels.
中文摘要 高级综合（HLS）将算法化的C/C++描述编译到硬件中，结果质量（QoR）——即延迟和资源利用率——由语用配置和代码结构严格控制。现有基于大型语言模型的HLS方法训练功能正确性，但完全忽视生活质量（QoR）。我们观察到，HLS的强化学习（RL）并不要求绝对综合结果——只需候选者之间的相对比较。基于这一见解，我们提出了 \textbf{HLS-Seek}，一个具 QoR 意识的自然语言到高级语言框架，用比较代理奖励模型取代昂贵的合成循环强化学习，实现了 99.53% 的帕累托优势准确率。为防止奖励黑客攻击，我们引入了 \textit{不确定性感知蒙特卡洛（MC）退出切换}，该技术对低置信度候选人选择性调用真实的 Vitis HLS 综合，并在线更新代理，创建自我改进的奖励系统。HLS-Seek仅用7B参数，在HLS评估pass@1实现81.5%语法正确率和81.4%的Func@5语法正确率，超过GPT-5.1及其他前沿模型，同时比真实奖励强化学习快8.5美元\时间美元。在QoR评估中，HLS-Seek在16/30个核上实现了最低延迟，而在9个核的HLS特定基线上则以帕累托优势占优。

Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment

情绪和倾斜导致SLOP：通过推理时间对齐来缓解奖励黑客

Authors: Ye Wang, Jing Liu, Toshiaki Koike-Akino
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.13537
Pdf link: https://arxiv.org/pdf/2605.13537
Abstract Inference-time alignment techniques offer a lightweight alternative or complement to costly reinforcement learning, while enabling continual adaptation as alignment objectives and reward targets evolve. Existing theoretical analyses justify these methods as approximations to sampling from distributions optimally tilted toward a given reward model. We extend these techniques by introducing reference-model temperature adjustment, which leads to further generalization of inference-time alignment to ensembles of generative reward models combined as a sharpened logarithmic opinion pool (SLOP). To mitigate reward hacking, we propose an algorithm for calibrating SLOP weight parameters and experimentally demonstrate that it improves robustness while preserving alignment performance.
中文摘要 推理时间比对技术为高成本强化学习提供了轻量级的替代方案或补充，同时随着比对目标和奖励目标的演变，能够持续适应。现有理论分析将这些方法视为从最优倾向某一奖励模型的分布抽样的近似。我们通过引入参考模型温度调节来扩展这些技术，进一步推广推理时间比对到生成奖励模型的集合，形成锐利对数意见池（SLOP）。为减轻奖励黑客效应，我们提出了一种用于校准SLOP权重参数的算法，并通过实验证明其在保持比对性能的同时提升了鲁棒性。

Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation

通过对比近距离策略优化实现的自监督政策内强化学习

Authors: Asim Osman, Sasha Abramowitz, Mark Bergh, Ulrich Armel Mbou Sob, Ruan John de Kock, Omayma Mahjoub, Oussama Hidaoui, Noah De Nicola, Arnol Manuel Fokam, Felix Chalumeau, Daniel Rajaonarivonivelomanantsoa, Siddarth Singh, Refiloe Shabe, Juan Claude Formanek, Simon Verster Du Toit, Arnu Pretorius
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.13554
Pdf link: https://arxiv.org/pdf/2605.13554
Abstract Contrastive reinforcement learning (CRL) learns goal-conditioned Q-values through a contrastive objective over state-action and goal representations, removing the need for hand-crafted reward functions. Despite impressive success in achieving viable self-supervised learning in RL, all existing CRL algorithms rely on off-policy optimisation and are mostly constrained to continuous action spaces, with little research invested in discrete environments. This leaves CRL disconnected from widely used and effective, modern on-policy training pipelines adopted across both single-agent and multi-agent RL in continuous and discrete environments. To establish a first connection, we introduce Contrastive Proximal Policy Optimisation (CPPO). CPPO is an on-policy contrastive RL algorithm that derives policy advantages directly from contrastive Q-values and optimises them via the standard PPO objective, without requiring a reward function or a replay buffer. We evaluate CPPO across continuous and discrete, single-agent and cooperative multi-agent tasks. Whilst the existence of an on-policy approach is inherently useful, we observe that \textbf{CPPO not only significantly outperforms the previous CRL baselines in 14 out of 18 tasks, but also matches or exceeds PPO's performance, which uses hand-crafted dense rewards, in 12 out of the 18 tasks tested.}
中文摘要 对比强化学习（CRL）通过对状态-行动和目标表示的对比目标来学习目标条件的Q值，消除了对手工奖励函数的需求。尽管在强化学习中实现可行的自我监督学习取得了显著成功，所有现有的CRL算法都依赖于非策略优化，且大多局限于连续动作空间，离散环境的研究投入甚少。这使得CRL与广泛使用且有效的现代策略培训流水线脱节，适用于单代理和多代理强化学习，适用于连续和离散环境。为了建立第一个联系，我们引入了对比近端政策优化（CPPO）。CPPO是一种策略对比性强化学习算法，直接从对比Q值中提取策略优势，并通过标准PPO目标进行优化，无需奖励函数或重放缓冲区。我们评估CPPO在连续和离散、单代理及合作多代理任务中的表现。虽然策略化方法的存在本身有用，但我们观察到，\textbf{CPPO不仅在18个任务中有14个显著优于之前的CRL基线，而且在18个测试任务中有12个表现与PPO的PPO持平甚至超越。

Achieving $ε^{-2}$ Sample Complexity for Single-Loop Actor-Critic under Minimal Assumptions

在最小假设下实现单循环演员-批评者样本复杂度的 $ε^{-2}$

Authors: Ishaq Hamza, Zaiwei Chen
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2605.13639
Pdf link: https://arxiv.org/pdf/2605.13639
Abstract In this paper, we establish last-iterate convergence rates for off-policy actor--critic methods in reinforcement learning. In particular, under a single-loop, single-timescale implementation and a broad class of policy updates, including approximate policy iteration and natural policy gradient methods, we prove the first $\tilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity guarantee for finding an $\epsilon$-optimal policy under minimal assumptions, namely, the existence of a policy that induces an irreducible Markov chain. This stands in stark contrast to the existing literature, where an $\tilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity is achieved only through nested-loop updates and/or under strong, algorithm-dependent assumptions on the policies, such as uniform mixing and uniform exploration. Technically, to address the challenges posed by the coupled update equations arising from the single-loop implementation, as well as the potentially unbounded iterates induced by off-policy learning, our analysis is based on a coupled Lyapunov drift framework. Specifically, we establish a geometric convergence rate for the actor and an $\tilde{\mathcal{O}}(1/T)$ convergence rate for the critic, and combine the two Lyapunov drift inequalities through a cross-domination property. We believe this analytical framework is of independent interest and may be applicable to other coupled iterative algorithms with unbounded
中文摘要 本文建立了强化学习中非策略行为者-批判方法的最后迭代收敛率。特别地，在单循环、单时间尺度实现和包括近似策略迭代和自然策略梯度方法在内的广泛策略更新类别下，我们证明了第一个在最小假设下找到$\epsilon$最优策略的第一个$\tilde{\mathcal{O}}（\epsilon^{-2}）$样本复杂度保证，用于在最小假设下找到一个$\epsilon$最优策略，即存在一个诱导不可约马尔可夫链的策略。这与现有文献形成鲜明对比，现有文献中$\tilde{\mathcal{O}}（\epsilon^{-2}）$样本复杂度仅通过嵌套循环更新和/或在策略中强且依赖算法的假设（如均匀混合和均匀探索）实现。从技术上讲，为了应对单环实现带来的耦合更新方程以及非策略学习可能引发的无界迭代，我们的分析基于耦合李雅普诺夫漂移框架。具体来说，我们为演员建立几何收敛率，为批判者建立$\tilde{\mathcal{O}}（1/T）$收敛率，并通过交叉支配性质将两个李雅普诺夫漂移不等式结合起来。我们认为该分析框架具有独立的价值，可能适用于其他具有无界的耦合迭代算法

Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization

通过奖励去相关策略优化实现多目标与混合奖励的强化学习

Authors: Yang Bai, Kaiyuan Liu, Ziyuan Zhuang, Jiahong Zhou, Rongxiang Weng, Xin Chen, Jingang Wang, Xunliang Cai
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.13641
Pdf link: https://arxiv.org/pdf/2605.13641
Abstract Complex reinforcement learning environments frequently employ multi-task and mixed-reward formulations. In these settings, heterogeneous reward distributions and correlated reward dimensions often destabilize the construction of scalar advantages. To address these challenges, we propose Reward-Decorrelated Policy Optimization (RDPO), a reward-processing method designed to explicitly target both failure modes. RDPO first utilizes Magnitude-Aware Quantile normalization to stabilize prompt-level advantage allocation across binary, fractional, and continuous rewards. It then applies Mahalanobis whitening within each active reward subspace to mitigate correlation redundancy prior to aggregation. When applied during the post-training of LongCat-Flash, RDPO enhances instruction following, writing quality, and robustness to hard prompts while remaining broadly competitive on reasoning and coding evaluations.
中文摘要 复杂的强化学习环境通常采用多任务和混合奖励的表述方式。在这些情境下，异质性的奖励分布和相关的奖励维度常常会破坏标量优势的构建。为应对这些挑战，我们提出了奖励去相关政策优化（RDPO），这是一种奖励处理方法，旨在明确针对这两种失败模式。RDPO首先利用幅度感知分位数归一化，稳定二元、分数和连续奖励之间的提示级优势分配。然后在每个活跃奖励子空间内应用马哈拉诺比斯美白，以减少聚合前的相关冗余。在LongCat-Flash的后期培训中应用时，RDPO提升了指令跟随、写作质量和对硬提示的稳健性，同时在推理和编码评估中保持广泛竞争力。

Robot Squid Game: Quadrupedal Locomotion for Traversing Narrow Tunnels

机器人鱿鱼游戏：四足行走狭窄隧道

Authors: Amir Hossain Raj, Dibyendu Das, Xuesu Xiao
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.13665
Pdf link: https://arxiv.org/pdf/2605.13665
Abstract Quadruped robots demonstrate exceptional potential for navigating complex terrain in critical applications such as search and rescue missions and infrastructure inspection However autonomous traversal of confined 3D environments including tunnels caves and collapsed structures remains a significant challenge Existing methods often struggle with rigid gait patterns limited adaptability to diverse geometries and reliance on oversimplified environmental assumptions This paper introduces a Reinforcement Learning RL framework that combines procedural environment generation with policy distillation to enable robust locomotion across various tunnel configurations Our approach leverages a teacher student training paradigm where specialized expert policies trained on procedurally generated tunnel geometries transfer their knowledge to a unified student policy This strategy eliminates the need for complex reward shaping in end-to-end RL training simplifying the process by breaking down complicated tasks into smaller more manageable components that are easier for the robot to learn By synthesizing diverse tunnel structures during training and distilling navigation strategies into a generalizable policy our method achieves consistent traversal across complex spatial constraints where conventional approaches fail We demonstrate through both simulation and real world experiments that our method enables quadruped robots to successfully traverse challenging confined tunnel environments
中文摘要 四足机器人在关键应用如搜救任务和基础设施检查中展现出卓越潜力，能够穿越复杂地形。然而，自主穿越受限三维环境（包括隧道、洞穴和坍塌结构）仍是重大挑战。现有方法常面临僵化步态、适应性有限、适应性有限且依赖过于简化的环境假设的问题。本文提出了强化学习 RL 框架结合了程序环境生成与策略蒸馏，实现了跨多种隧道配置的稳健移动。我们的方法利用师生培训范式，将基于程序生成隧道几何结构的专业策略传递知识到统一的学生策略中。该策略消除了端到端强化学习中复杂奖励塑造的需求，简化了过程，通过分解实现将复杂任务简化为更小、更易管理的组件，使机器人更容易学习。通过在训练中综合多样的隧道结构，并将导航策略提炼成可推广的策略。我们的方法能够在复杂空间约束下实现一致的穿越，而传统方法则无法实现。我们通过模拟和实际实验证明，我们的方法使四足机器人能够成功穿越具有挑战性的狭窄隧道。环境

SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models

SceneGraphVLM：基于视觉语言模型的视频动态场景图生成

Authors: Vladislav Makarov, Mark Gizetdinov, Dmitry Yudin
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.13667
Pdf link: https://arxiv.org/pdf/2605.13667
Abstract Scene graph generation provides a compact structured representation for visual perception, but accurate and fast graph prediction from images and videos remains challenging. Recent VLM-based methods can generate scene graphs end-to-end as structured text, yet often produce long outputs with irrelevant objects and relations. We present SceneGraphVLM, a compact method for image and video scene graph generation with small visual language models. SceneGraphVLM serializes graphs in a token-efficient TOON format and trains the model in two stages: supervised fine-tuning followed by reinforcement learning with hallucination-aware rewards that balance relation coverage and precision while penalizing unsupported objects and relations. For videos, the model can optionally condition each frame on the previously generated graph, providing lightweight short-term context without tracking or post-processing. We evaluate SceneGraphVLM on PSG, PVSG, and Action Genome. With compact VLMs and vLLM-accelerated decoding, SceneGraphVLM achieves a strong quality-speed trade-off, improves precision-oriented SGG metrics while preserving reasonable recall, and generates complete scene graphs with approximately one-second latency. Code and implementation details are available at: this https URL.
中文摘要 场景图生成为视觉感知提供了紧凑且结构化的表示，但从图像和视频中准确快速预测图仍然具有挑战性。最新的基于VLM的方法可以以结构化文本形式从端到端生成场景图，但通常会产生带有无关对象和关系的长输出。我们介绍了SceneGraphVLM，这是一种利用小型视觉语言模型生成图像和视频场景图的紧凑方法。SceneGraphVLM 以代币高效的 TOON 格式序列化图，并将模型分为两个阶段进行训练：监督微调，随后是带有幻觉感知奖励的强化学习，平衡关系覆盖和精度，同时惩罚未受支持的对象和关系。对于视频，模型可以选择性地对已生成的图进行每帧条件，提供轻量级的短期上下文，无需跟踪或后处理。我们评估了SceneGraphVLM在PSG、PVSG和Action Genome上的应用。通过紧凑的VLM和vLLM加速解码，SceneGraphVLM实现了强的质量与速度权衡，提升了以精度为导向的SGG指标，同时保持了合理的回忆，并能以约一秒的延迟生成完整的场景图。代码和实现详情可在以下 https URL 获取。

Tight Sample Complexity Bounds for Entropic Best Policy Identification

熵最佳策略识别的严格样本复杂度界限

Authors: Amer Essakine, Claire Vernade
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2605.13717
Pdf link: https://arxiv.org/pdf/2605.13717
Abstract We study best-policy identification for finite-horizon risk-sensitive reinforcement learning under the entropic risk measure. Recent work established a constant gap in the exponential horizon dependence between lower and upper bounds on the number of samples required to identify an approximately optimal policy. Precisely, known lower bounds scale in $\Omega(e^{|\beta| H})$ where $H$ is the horizon of the MDP, while the state-of-the-art upper bound achieves at best $O(e^{2|\beta| H})$ (arXiv:2506.00286v2) using a generative model. We show that this extra exponential factor can be traced to overly loose concentration control for exponential utilities. To close this open gap, we revisit the analysis of this problem through a forward-model based algorithm building on KL-based exploration bonuses that we adapt to the entropic criterion. The improvement we get is due to two main novel technical innovations. We leverage the smoothness properties of the exponential utility to derive sharper concentration bounds, and we propose a new stopping rule that exploits further this tightness to obtain a sample complexity that matches the lower bound.
中文摘要 我们研究了熵风险测度下有限视野风险敏感强化学习的最佳策略识别方法。最新研究发现，下限和上界之间的指数视野依赖性存在恒定差距，这些依赖于识别出近似最优策略所需的样本数量。确切地说，已知的下界以$\Omega（e^{|\beta|H}）$，其中$H$是MDP的视界，而最先进的上界最多只能达到$O（e^{2|\beta|H}）$ （arXiv：2506.00286v2）使用生成模型。我们证明了这一额外的指数因子可以追溯到指数效用的集中控制过于松散。为弥合这一空白，我们通过基于基于KL的探索加成的前向模型算法重新分析该问题，并调整到熵判据。我们获得的改进归功于两项主要的新颖技术创新。我们利用指数效用的平滑性质推导出更锐利的集中界限，并提出一种新的停止规则，进一步利用这种紧密性，获得与下界相匹配的样本复杂度。

Keyword: diffusion policy

BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning

BlockVLA：通过块扩散微调化加速自回归VLA

Authors: Ruiheng Wang, Shuanghao Bai, Haoran Zhang, Badong Chen, Xiangyu Xu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.13382
Pdf link: https://arxiv.org/pdf/2605.13382
Abstract While autoregressive (AR) Vision-Language-Action (VLA) models have demonstrated formidable reasoning capabilities in robotic tasks, their sequential decoding process often incurs high inference latency and may amplify error accumulation during long-horizon execution. Discrete Diffusion Language Models (dLLMs) provide a promising alternative through parallel token refinement, but their practical deployment in robotics remains limited by repeated denoising function evaluations (NFEs) and the difficulty of directly applying standard KV caching to bidirectional iterative decoding. To bridge these paradigms, we propose BlockVLA, a framework that adapts pretrained AR backbones into an efficient discrete diffusion policy through a block diffusion paradigm. BlockVLA maintains autoregressive dependencies at the block level while enabling parallel denoising within each block, thereby combining global causal coherence with local parallel generation. This design enables prefix KV-cache reuse across completed blocks, reduces the effective cost of iterative denoising, and provides a smoother transition from AR pretraining to diffusion-based policy fine-tuning. We conduct extensive evaluations on the LIBERO and SimplerEnv benchmarks. Experimental results demonstrate that our BlockVLA achieves a 3.3$\times$ inference acceleration over standard discrete diffusion baselines. Furthermore, our model exhibits superior training efficiency, with success rates converging substantially faster than baselines, a gain that is particularly pronounced in complex, long-horizon tasks, where BlockVLA achieves significant performance gains in the early stages of training. This work establishes Block Diffusion as a robust bridge between large-scale pretrained AR models and efficient, high-frequency real-time robotic control.
中文摘要 虽然自回归（AR）视觉-语言-行动（VLA）模型在机器人任务中展现出强大的推理能力，但其顺序解码过程常常带来较高的推理延迟，并且在长视野执行中可能加剧错误的累积。离散扩散语言模型（dLLMs）通过并行符号优化提供了有前景的替代方案，但其在机器人领域的实际应用仍受限于反复的去噪函数评估（NFE）以及直接将标准KV缓存应用于双向迭代解码的困难。为了弥合这些范式，我们提出了BlockVLA，这是一个通过块扩散范式将预训练AR骨干改编为高效的离散扩散策略的框架。BlockVLA在块级保持自回归依赖，同时实现每个块内的并行去噪，从而将全局因果一致性与局部并行生成结合起来。该设计允许在完成块间重复使用前缀KV缓存，降低反复去噪的有效成本，并实现从AR预训练到基于扩散的策略微调的平稳过渡。我们对LIBERO和SimplerEnv基准进行了广泛评估。实验结果表明，我们的BlockVLA在标准离散扩散基线上实现了3.3$\times$的推断加速。此外，我们的模型展现出更优越的训练效率，成功率收敛速度远快于基线，这一提升在复杂且长期的任务中尤为明显，BlockVLA在训练初期阶段取得了显著的性能提升。这项工作确立了块扩散作为大型预训练增强现实模型与高效高频实时机器人控制之间的坚固桥梁。

CUBic: Coordinated Unified Bimanual Perception and Control Framework

CUBic：协调统一双手感知与控制框架

Authors: Xingyu Wang, Pengxiang Ding, Jingkai Xu, Donglin Wang, Zhaoxin Fan
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.13452
Pdf link: https://arxiv.org/pdf/2605.13452
Abstract Recent advances in visuomotor policy learning have enabled robots to perform control directly from visual inputs. Yet, extending such end-to-end learning from single-arm to bimanual manipulation remains challenging due to the need for both independent perception and coordinated interaction between arms. Existing methods typically favor one side -- either decoupling the two arms to avoid interference or enforcing strong cross-arm coupling for coordination -- thus lacking a unified treatment. We propose CUBic, a Coordinated and Unified framework for Bimanual perception and control that reformulates bimanual coordination as a unified perceptual modeling problem. CUBic learns a shared tokenized representation bridging perception and control, where independence and coordination emerge intrinsically from structure rather than from hand-crafted coupling. Our approach integrates three components: unidirectional perception aggregation, bidirectional perception coordination through two codebooks with shared mapping, and a unified perception-to-control diffusion policy. Extensive experiments on the RoboTwin benchmark show that CUBic consistently surpasses standard baselines, achieving marked improvements in coordination accuracy and task success rates over state-of-the-art visuomotor baselines.
中文摘要 视觉运动政策学习的最新进展使机器人能够直接根据视觉输入进行控制。然而，将单臂到端的学习扩展到双手操作仍然具有挑战性，因为需要独立感知和双臂间协调的互动。现有方法通常偏向一方——要么将两臂解耦以避免干扰，要么强制强交叉臂耦合以实现协调——因此缺乏统一的处理方式。我们提出了CUBic，这是一个协调统一的双手感知与控制框架，将双手协调重新表述为统一的感知建模问题。CUBic学习一种共享的标记化表示，连接感知与控制，其中独立性和协调本质上源自结构，而非手工耦合。我们的方法整合了三个组成部分：单向感知聚合、通过两个共享映射的代码本实现双向感知协调，以及统一的感知到控制扩散策略。在RoboTwin基准测试上的大量实验表明，CUBic始终超越标准基线，在协调准确性和任务成功率上均优于最先进的视觉运动基线。