Arxiv Papers of Today

生成时间: 2026-05-28 19:41:56 (UTC+8); Arxiv 发布时间: 2026-05-28 20:00 EDT (2026-05-29 08:00 UTC+8)

今天共有 64 篇相关文章

Keyword: reinforcement learning

Personalized Observation Normalization for Federated Reinforcement Learning in Simulation Environments with Heterogeneity

在具有异构性的模拟环境中实现联合强化学习的个性化观察归一化

Authors: Yiran Pang, Zhen Ni, Xiangnan Zhong
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.27385
Pdf link: https://arxiv.org/pdf/2605.27385
Abstract Federated reinforcement learning (FedRL) enables multiple agents to collaboratively train a global policy without sharing raw data, making it ideal for privacy-sensitive applications. However, FedRL faces challenges in heterogeneous environments where differing state-transition dynamics lead to non-identical input distributions and imbalanced parameter updates during aggregation. Therefore, this paper develops a personalized observation normalization (PON) method, allowing each agent to locally normalize raw state inputs using a continuously updated running mean and variance. This design ensures consistent scaling of local feature without overshadowing across agents during aggregation. Furthermore, we demonstrate that sharing normalization parameters across agents is ineffective due to the diverse local input distributions, which highlights the necessity of personalized statistics. Experiments on heterogeneous MuJoCo tasks show that our developed PON accelerates training and achieves superior performance compared to baseline methods.
中文摘要 联邦强化学习（FedRL）使多个智能体能够协作训练全局策略，而无需共享原始数据，非常适合隐私敏感的应用。然而，FedRL在异构环境中面临挑战，不同的状态转换动态导致输入分布不相同，聚合过程中参数更新不平衡。因此，本文开发了一种个性化的观测归一化（PON）方法，允许每个代理通过持续更新的运行均值和方差对原始状态输入进行局部归一化。该设计确保局部特征的一致缩放，同时在聚合过程中不被代理间覆盖。此外，我们证明了由于局部输入分布多样，跨代理共享归一化参数效果不佳，这凸显了个性化统计的必要性。异构MuJoCo任务的实验表明，我们开发的PON加速训练并实现优于基线方法的性能。

Differentiable Model Predictive Safety for Heterogeneous Mobility at Urban Intersections

城市交叉口异质出行的可微分模型预测安全性

Authors: Wenzhe Song, Hao Zhang
Subjects: Subjects: Multiagent Systems (cs.MA); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.27418
Pdf link: https://arxiv.org/pdf/2605.27418
Abstract The imminent integration of autonomous vehicles and mobile robots in urban settings presents a critical safety challenge for future intelligent transportation systems. This paper addresses the complex problem of coordinating heterogeneous agents with disparate dynamics at unregulated intersections. We introduce a novel framework, differentiable model predictive safety (DMPS), which embeds the foresight of model-predictive control into a data-driven, end-to-end reinforcement learning architecture. DMPS agents learn a latent dynamics model to predict future trajectories contingent on their actions. A learned, differentiable safety critic then evaluates the risk of these trajectories. Crucially, by leveraging backpropagation through the entire unrolled predictive model, agents can efficiently compute the gradient of future safety with respect to their current action, enabling a minimal and precise online safety correction. Integrated into a multi-agent training scheme, DMPS virtually eliminates collisions to less than 5.6% in high-density, mixed vehicle-robot traffic simulations, demonstrating state-of-the-art safety without compromising energy and traffic efficiency.
中文摘要 自动驾驶车辆和移动机器人即将在城市环境中整合，这对未来智能交通系统构成了关键的安全挑战。本文探讨了在非调控交叉点协调具有差异动态的异质代理的复杂问题。我们引入了一个新框架——可微分模型预测安全（DMPS），将模型预测控制的前瞻性嵌入到数据驱动的端到端强化学习架构中。DMPS代理学习潜在动力学模型，以预测其行为的未来轨迹。一位有经验、可微分的安全批评者随后评估这些轨迹的风险。关键是，通过在整个展开的预测模型中利用反向传播，代理能够高效计算未来安全性相对于当前行动的梯度，实现最小且精确的在线安全修正。集成于多智能体训练方案中，DMPS在高密度混合车辆-机器人交通模拟中几乎将碰撞率降至低于5.6%，展示最先进的安全性，同时不牺牲能源和交通效率。

SCALE-COMM: Shared, Contrastively-Aligned Latent Embeddings for MARL Communication

SCALE-COMM：用于MARL通信的共享、对比对比的潜在嵌入

Authors: Mahmoud Abouelyazid, Eman Hammad
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.27532
Pdf link: https://arxiv.org/pdf/2605.27532
Abstract Emergent communication enables partially observant Autonomous Mobile Robots (AMRs) to coordinate effectively in decentralized multi-agent reinforcement learning (MARL) settings. However, existing approaches often struggle with unstable communication protocols, ungrounded message semantics, and interference between communication learning and policy optimization, leading to degraded coordination over time. We propose SCALE-COMM (Shared, Contrastively-Aligned Latent Embeddings for COMMunication), a self-supervised framework for learning compact, stable, and policy-relevant communication representations. SCALE-COMM decouples communication learning from policy optimization by training low-dimensional latent messages that capture task-relevant planning and traffic information, while enforcing consistency across agents and time. Across standard MARL benchmarks and a realistic warehouse coordination task, SCALE-COMM consistently outperforms existing communication frameworks in both representation quality and task performance. The learned communication space yields improved stability, sample efficiency, and throughput under policy fine-tuning, demonstrating the effectiveness of representation-driven communication for scalable multi-agent coordination.
中文摘要 涌现通信使部分观察型自主移动机器人（AMR）能够在去中心化多智能体强化学习（MARL）环境中有效协调。然而，现有方法常常面临不稳定的通信协议、无根基的消息语义以及通信学习与策略优化之间的干扰，导致协调性随时间退化。我们提出了SCALE-COM（共享、对比对齐的潜在嵌入用于COMMunication），这是一个自监督框架，用于学习紧凑、稳定且符合政策的通信表示。SCALE-COMM 通过训练低维潜在消息，将通信学习与策略优化解耦，这些消息捕捉与任务相关的规划和流量信息，同时在代理和时间间强制执行一致性。在标准MARL基准和现实仓库协调任务中，SCALE-COMM在表示质量和任务性能上始终优于现有通信框架。所学的通信空间在策略微调下提升了稳定性、样本效率和吞吐量，展示了表示驱动通信在可扩展多智能体协调中的有效性。

Transferable Reinforcement Learning via Probabilistic Latent Embeddings and Dynamic Policy Adaptation for Sim-to-Real Deployment

通过概率潜在嵌入和动态策略适应实现模拟到真实部署的可转移强化学习

Authors: Gengyue Han, Yiheng Feng
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.27659
Pdf link: https://arxiv.org/pdf/2605.27659
Abstract Due to limited resources and public safety concerns, deep reinforcement learning (RL) agents for many cyber-physical systems (e.g., autonomous vehicles) are first trained in simulators. However, when deployed in real world environments, they often suffer from performance degradation or safety violations because of the inevitable Sim2Real gap. Existing zero-shot approaches, such as robust safe RL and domain randomization, mitigate this issue but typically at the cost of degraded performance or residual safety risks when experiencing unmodeled system dynamics. To address these limitations, we propose a novel reinforcement learning framework that enables safe and efficient policy transfer via probabilistic latent embeddings and dynamic policy adaptation. We consider a family of Constrained Markov Decision Processes (CMDPs) under different environment contexts. By leveraging latent context variable in meta-RL, the proposed framework infers the latent representation of the environment from simulated experiences. Furthermore, it incorporates a distributional RL formulation, which allows risk levels of the deployed policy to be adjusted dynamically, based on the estimation accuracy of the latent context variable. This strategy promotes safety at the early deployment stage and improves efficiency through fast policy adaptation under the Sim2Real gap.
中文摘要 由于资源有限和公共安全问题，许多网络物理系统（如自动驾驶车辆）的深度强化学习（RL）代理首先在模拟器中进行训练。然而，在实际环境中部署时，由于不可避免的Sim2Real差距，它们常常会遭受性能下降或安全违规。现有的零样本方法，如稳健的安全强化学习和域随机化，可以缓解这一问题，但通常以性能下降或在未建模系统动态时存在残余安全风险为代价。为解决这些局限性，我们提出了一种新型强化学习框架，通过概率潜在嵌入和动态策略适应实现安全高效的策略转移。我们考察了一组在不同环境背景下的受限马尔可夫决策过程（CMDP）。通过利用元强化学习中的潜在上下文变量，该框架从模拟经验中推断出环境的潜在表征。此外，它采用分布式强化学习表述，允许根据潜在上下文变量的估计准确性动态调整所部署策略的风险水平。该策略在早期部署阶段促进安全，并通过Sim2Real差距下的快速策略调整提升效率。

ReverseMath: Answer Inversion for Scalable and Verifiable Mathematical Problem Generation

ReverseMath：可扩展且可验证数学问题生成的答案反演

Authors: Raoyuan Zhao, Yihong Liu, Yupei Du, Hinrich Schütze, Michael A. Hedderich
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.27709
Pdf link: https://arxiv.org/pdf/2605.27709
Abstract Mathematical reasoning benchmarks are vital for evaluating large language models (LLMs), but many are static and repeatedly exposed through public evaluation and training pipelines, making it difficult to separate genuine reasoning from memorization. Meanwhile, manually constructing new math problems with reliable answers remains costly. We introduce ReverseMath, a scalable method for generating new math problems through answer inversion. Given a problem and its answer, ReverseMath masks a numerical value in the original problem, treats the original answer as a known condition, and rewrites the problem so that the masked value becomes the new answer. The generated problem reverses the original input-output relation, making its answer known by construction. We study ReverseMath for both evaluation and training. For evaluation, paired original/reversed problems reveal substantial behavioral shifts: models sometimes fail on reversed problems and even incorrectly output the original answer, suggesting memorization-like behavior. For training, ReverseMath provides automatically labeled reversed problems as data augmentation for reinforcement learning (RL). Experiments show that including ReverseMath-generated data improves mathematical reasoning performance across multiple benchmarks, demonstrating its value as both an analysis tool and a scalable source of verifiable training data.
中文摘要 数学推理基准对于评估大型语言模型（LLM）至关重要，但许多基准是静态的，并且通过公开评估和培训流程反复暴露，使得真正的推理与记忆难以区分。与此同时，手动构建带有可靠答案的新数学问题仍然成本高昂。我们介绍了ReverseMath，一种可扩展的方法，通过答案反转生成新的数学问题。给定一个问题及其答案，ReverseMath 会在原始问题中掩盖一个数值，将原始答案视为已知条件，并重写问题，使掩蔽值成为新的答案。生成的问题会反转原始的输入输出关系，使其答案通过构造已知。我们学习ReverseMath进行评估和培训。在评估中，原始/反转题对显示出显著的行为变化：模型有时在反向题目中失败，甚至错误输出原始答案，表明存在类似记忆的行为。在训练方面，ReverseMath 提供自动标记的反向问题，作为强化学习（RL）数据增强。实验表明，包含ReverseMath生成的数据可以提升多个基准测试的数学推理性能，证明其作为分析工具和可扩展可验证训练数据来源的价值。

Bayesian Deployment Approval for Learned Landing Controllers under Finite Rollout Validation

有限部署验证下学习着陆控制器的贝叶斯部署批准

Authors: Fei Jiang, Lei Yang
Subjects: Subjects: Machine Learning (cs.LG); Applications (stat.AP)
Arxiv link: https://arxiv.org/abs/2605.27720
Pdf link: https://arxiv.org/pdf/2605.27720
Abstract Reinforcement learning and data-driven autonomous controllers are commonly evaluated using cumulative reward and empirical success frequency under finite simulation trajectories. However, such empirical metrics do not necessarily provide sufficient statistical evidence regarding deployment readiness under uncertainty. This work develops a Bayesian approval framework for learned autonomous landing controllers under finite rollout evidence. A probabilistic landing capability formulation is introduced based on touchdown safety satisfaction under uncertain operating conditions, while Bayesian posterior inference is used to quantify uncertainty regarding the true deployment capability of learned policies. Posterior approval probability and posterior deployment risk are further introduced for deployment-oriented evaluation, together with a sequential validation framework supporting approve/reject/continue decisions during progressive rollout testing. Simulation experiments using PPO and SAC controllers demonstrate that empirical success and reward optimization may produce overconfident deployment interpretation under limited validation evidence, whereas posterior approval inference provides a more uncertainty-calibrated assessment of deployment readiness. The proposed framework provides a practical statistical connection between conventional reinforcement-learning evaluation and deployment-oriented validation under uncertainty and may be generalized to broader classes of learned autonomous systems.
中文摘要 强化学习和数据驱动的自主控制器通常通过累积奖励和经验成功率在有限模拟轨迹下进行评估。然而，这些实证指标未必能提供足够的统计证据，说明在不确定性下的部署准备情况。本研究在有限推广证据下，为学习型自主着陆控制器开发了贝叶斯审批框架。基于不确定操作条件下着陆安全满意度的概率着陆能力表述，而贝叶斯后验推断则用于量化学习策略真实部署能力的不确定性。进一步引入了后验批准概率和后置部署风险，用于部署导向评估，同时支持在渐进式推广测试中批准/拒绝/继续决策的顺序验证框架。使用PPO和SAC控制器的模拟实验表明，在有限的验证证据下，实证成功和奖励优化可能导致部署解释过于自信，而后验批准推断则提供了更具不确定性校准的部署准备评估。该框架提供了传统强化学习评估与部署导向验证在不确定性下的实际统计联系，并可推广至更广泛的学习自主系统类别。

Explicit Critic Guidance for Aligning Diffusion Models

扩散模型对齐的明确批评指导

Authors: Zhengyang Liang, Qihang Zhang, Ceyuan Yang
Subjects: Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.27736
Pdf link: https://arxiv.org/pdf/2605.27736
Abstract Online reinforcement learning is becoming increasingly important for aligning diffusion models with non-differentiable objectives. However, existing methods still face limitations in assigning fine-grained credit along denoising trajectories and in realizing stable value-based optimization. We propose a state-aligned latent actor-critic framework for diffusion post-training, in which the diffusion model serves as its own timestep-conditioned value function and predicts values directly on noisy latent states. This enables trajectory-level PPO training, supports stable actor-critic optimization with simple conditioning and value pretraining strategies, and naturally allows the learned critic to be reused for inference-time steering. We further extend the framework to multi-reward optimization, where joint training with complementary rewards helps alleviate reward hacking. Across both UNet- and DiT-based backbones, our method consistently outperforms prior group-relative RL and actor-critic baselines on single-reward and multi-reward benchmarks, while test-time steering provides additional gains in generation quality.
中文摘要 在线强化学习在将扩散模型与不可微目标对齐方面变得越来越重要。然而，现有方法在沿去噪轨迹分配细粒度信用和实现稳定的基于价值的优化方面仍面临局限。我们提出了一种状态对齐的潜在演员-批判者框架，用于训练后的扩散，其中扩散模型作为自身的时间步条件值函数，直接预测噪声潜态的值。这支持轨迹级PPO训练，支持通过简单条件和价值预训练策略实现稳定的actor-critic优化，并自然允许学习到的critic被用于推理时间引导。我们进一步将该框架扩展到多奖励优化，其中联合训练与互补奖励有助于缓解奖励黑客行为。在基于Unet和DiT的骨干链中，我们的方法在单奖励和多奖励基准上持续优于先前的群体相对强化学习和行为者-批评者基线，而测试时间引导则在生成质量上提供了额外提升。

Escape the Language Prior: Mitigating Late-Stage Modality Collapse in Audio Reasoning via Modality-Aware Policy Optimization

逃离先前的语言：通过模态感知策略优化缓解音频推理中后期阶段的模态崩溃

Authors: Cihan Xiao, Yiwen Shao, Chenxing Li, Xiang He, Zhenwen Liang, Steve Yves, Sanjeev Khudanpur, Liefeng Bo
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.27741
Pdf link: https://arxiv.org/pdf/2605.27741
Abstract Audio and omni-modal large language models exhibit impressive cross-modal reasoning capabilities. However, applying standard reinforcement learning post-training algorithms to these models exposes a critical structural vulnerability: methods like GRPO apply uniform policy gradients across all tokens, ignoring their unequal dependence on the non-text source modality. This exacerbates late-stage modality collapse during extended chain-of-thought generation, where models progressively abandon the primary source signal in favor of compressed textual priors, leading to confident but ungrounded hallucinations. To address this, we introduce Modality-Aware Policy Optimization (MAPO), a novel dual-branch reinforcement learning framework. First, MAPO dynamically concentrates the policy gradient on modality-critical tokens using a modality relevance mask, which is derived from the cross-modal differential entropy between an audio-ablated reference and the multimodal policy. Second, it integrates an auxiliary attention loss branch that applies a targeted, temporally scaled penalty to the model's internal attention distributions. This ensures the model actively sustains cross-modal grounding deep into the reasoning trace. Evaluations on complex audio reasoning benchmarks demonstrate that MAPO substantially improves long-horizon reasoning fidelity and multimodal instruction following, achieving highly competitive performance and setting new state-of-the-art results on several key benchmarks among open-weight models. By relying strictly on native statistical signals rather than domain-specific inductive biases, MAPO offers a promising foundation for mitigating epistemic collapse across diverse multimodal systems.
中文摘要 音频和全模态大型语言模型展现出令人印象深刻的跨模态推理能力。然而，将标准强化学习的训练后算法应用于这些模型暴露了一个关键的结构性漏洞：像GRPO这样的方法在所有令牌上应用均匀的策略梯度，忽视了它们对非文本源模态的不等依赖性。这加剧了在长时间思维链生成过程中晚期模态崩溃的情况，模型逐渐放弃原始信号，转而采用压缩文本先验，导致自信但缺乏依据的幻觉。为此，我们引入了模态感知策略优化（MAPO），一种新型双分支强化学习框架。首先，MAPO利用模态相关性掩码动态集中策略梯度于模态关键令牌，该掩码源自音频消融参考与多模态策略之间的跨模态差分熵。其次，它集成了一个辅助注意力损失分支，对模型内部注意力分布施加有针对性的、时间尺度的惩罚。这确保模型能够积极地在推理轨迹中持续保持跨模态的基准。复杂音频推理基准测试的评估表明，MAPO显著提升了长视野推理的保真度和多模态指令跟随，实现了极具竞争力的性能，并在开放权模型的多个关键基准上取得了新的尖端成果。通过严格依赖本地统计信号而非领域特异的归纳偏差，MAPO为缓解多模态系统间的认知崩溃提供了有前景的基础。

Restoring the Sweet Spot: Pass-Rate Weighted Self-Distillation for LLM Reasoning

恢复最佳点：通过率加权自蒸馏为LLM推理

Authors: Zehao Liu, Yuanpu Cao, Jinghui Chen, Vasant G. Honavar
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.27765
Pdf link: https://arxiv.org/pdf/2605.27765
Abstract Self-Distillation Policy Optimization (SDPO) provides dense token-level credit assignment for reinforcement learning with large language models by leveraging the model's own feedback-conditioned predictions as a self-teacher. Unlike GRPO, however, whose group-relative advantage naturally concentrates learning on a sweet spot of intermediate-difficulty questions, SDPO's KL-based advantage lacks an implicit notion of difficulty awareness. We analyze this gap through the lens of GRPO's advantage normalization. Extending the learnability framework to normalized rewards, we show that normalization absorbs the variance term $p(1-p)$, equalizing leading-order learnability across questions and leaving $\sqrt{p(1-p)}$ as the sole residual scaling factor in the per-question gradient. This analysis yields a simple prescription: weight each question's SDPO loss by $[\hat{p}(1-\hat{p})]^{1/2}$, resulting in SC-SDPO, a scale-consistent variant of SDPO. The proposed weights are obtained as a zero-cost byproduct of on-policy rollouts with batch-adaptive normalization, inducing an implicit curriculum that dynamically tracks the model's evolving competence. Experiments on scientific reasoning and tool-use benchmarks demonstrate that SC-SDPO consistently improves over SDPO, yielding gains of +3.2/+4.3 (mean@16/maj@16) on Qwen3-8B and +1.8/+3.0 on OLMo-3-7B, while preserving stable training dynamics throughout optimization.
中文摘要 自蒸馏策略优化（SDPO）通过利用模型自身的反馈条件预测，作为自学者，为大型语言模型的强化学习提供密集的代币级学分分配。然而，与GRPO不同，后者的群体相对优势自然集中在中级难度题目中，SDPO基于KL的优势缺乏隐含的难度意识概念。我们通过GRPO优势归一化的视角分析这一差距。将可学习性框架推广到归一化奖励，我们证明归一化吸收了方差项 $p（1-p）$，使各问题的导向可学习性相等，且使 $\sqrt{p（1-p）}$ 成为每题梯度中唯一的残余缩放因子。该分析得出一个简单的建议：将每个问题的SDPO损失加权$[\hat{p}（1-\hat{p}）]^{1/2}$，得到SC-SDPO，这是一种SDPO的量表一致变体。所提权重作为政策内部署的零成本副产品获得，采用批量自适应规范化，诱导出动态跟踪模型能力演变的隐式课程。科学推理和工具使用基准测试的实验表明，SC-SDPO相较于SDPO持续提升，Qwen3-8B提升+3.2/+4.3（mean@16/maj@16），OLMo-3-7B提升+1.8/+3.0，同时保持优化过程中的稳定训练动态。

Playing with Words, Improving with Rewards: Training Language Models for Creative Association

玩弄文字，奖励提升：创造性联想的语言模型训练

Authors: Vijeta Deshpande, Namrata Shivagunde, Sherin Muckatira, Hadrien Glaude, Mikhail Gronas, Claire Stevenson, Roger Beaty, Anna Rumshisky
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.27832
Pdf link: https://arxiv.org/pdf/2605.27832
Abstract Large Language Models (LLMs) are being applied to increasingly difficult problems and use cases. To navigate their vast solution spaces effectively, LLMs need to be creative. Yet the subjective nature of creativity and the limits of human judgment make training LLMs for creativity especially challenging. As a solution, we train LLMs on Codenames, a word-association game that exercises the two central axes of creativity, divergent and convergent thinking, while yielding objectively verifiable outcomes. This verifiability lets us bypass human judgment and train with Reinforcement Learning with Verifiable Rewards (RLVR). We train Qwen3-1.7B, 4B, and 8B models and evaluate them on ten creativity and four reasoning benchmarks. We find that the precision-diversity trade-off is scale-dependent: the 8B model prioritizes creativity over precision, while the 1.7B and 4B models gain reasoning precision at the cost of creativity. Concretely, the 8B model shows modest but consistent creativity gains (8 of 10 benchmarks) with only minor reasoning degradation, whereas the smaller models achieve substantial gains on reasoning tasks. Our study presents a scalable and effective solution to train LLMs for creativity.
中文摘要 大型语言模型（LLM）正被应用于越来越复杂的问题和应用场景。为了有效驾驭其庞大的解决方案领域，LLM需要具备创造力。然而，创造力的主观性和人类判断的局限性使得培养创造力的LLM尤其具有挑战性。作为解决方案，我们用“代号”（Codenames）训练大型语言模型，这是一种词语联想游戏，锻炼了创造力的两个核心轴线——发散性思维和收敛性思维，同时产生客观可验证的结果。这种可验证性使我们绕过人类判断，使用可验证奖励强化学习（RLVR）进行训练。我们训练Qwen3-1.7B、4B和8B模型，并基于十项创造力和四个推理基准进行评估。我们发现精度与多样性的权衡依赖于尺度：8B模型优先考虑创造力而非精确度，而1.7B和4B模型则以牺牲创造力为代价获得推理精度。具体来说，8B模型在创造力提升上有适度但持续的提升（10个基准中的8个），推理能力仅有轻微退化，而小型模型在推理任务上取得了显著提升。我们的研究提出了一种可扩展且有效的解决方案，用于训练LLM的创造力。

Reward Transfer from Inverse Reinforcement Learning: A Coupled Minimax Approach

逆向强化学习中的奖励转移：一种耦合极小极大方法

Authors: Guang-Yuan Hao, Lars van der Laan, Aurélien Bibaut, Nathan Kallus
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2605.27834
Pdf link: https://arxiv.org/pdf/2605.27834
Abstract We study the transfer of rewards learned using inverse reinforcement learning from expert demonstrations in one environment to reinforcement learning in a new, different environment. This arises naturally when demonstrations are collected in a controlled environment. We formulate the problem as a joint system of Bellman equations across the source and target environments and develop minimax estimators for the target soft-$q$-function. Whereas a sequential solution approach first estimates the source reward and then plugs it into the target control problem, a coupled approach solves the source and target system of equations jointly. We show that, in contrast to the sequential approach, the coupled approach removes the first-order influence of source Bellman residual error. We characterize the local behavior of each approach, develop finite-sample soft-$q$-function error bounds, and prove regret guarantees for the resulting soft-control policy. An empirical investigation using a sepsis simulator validates the theoretical comparison.
中文摘要 我们研究通过逆强化学习从专家演示中学到的奖励转移到新的、不同的环境中的强化学习。当演示在受控环境中收集时，这种情况自然而然地产生。我们将问题表述为源与目标环境下的联合Bellman方程组，并为目标软$q$函数开发极小极大估计器。顺序解法先估计源奖励，然后将其代入目标控制问题，而耦合法则共同求解源方程组和目标方程组。我们证明，与顺序方法不同，耦合方法消除了源贝尔曼残差误差的一阶影响。我们描述了每种方法的局部行为，开发有限样本软$q$函数误差界限，并证明了软控制策略的遗憾保证。使用败血症模拟器的实证研究验证了理论比较。

EAPO: Entropy-Driven Adaptive Positive-Negative Sample Weighting for Policy Optimization in Open-Ended QA

EAPO：开放式质量保证中策略优化的熵驱动自适应正负样本加权

Authors: Yunsheng Zeng, Gen Li, Yuwei Miao, Xiandong Li, Yujin Wang, Siyu Chen, Luning Wang, Yunhao Qiao, Junfeng Wang, Jianwei Lv, Bo Yuan
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.27846
Pdf link: https://arxiv.org/pdf/2605.27846
Abstract Large Reasoning Models are typically trained via reinforcement learning from verifiable rewards (RLVR). However, existing approaches adopt fixed weights for positive and negative samples, and the conclusions hardly generalize to open-ended question answering (QA). In this paper, we systematically investigate the roles of positive and negative samples in reinforcement learning for open-ended QA. We propose a reward-mean-based strategy for distinguishing positive from negative samples, and observe that negative samples predominantly govern response diversity and the performance upper bound, whereas positive samples primarily determine response quality and convergence stability. Building on these observations, we propose EAPO, an Entropy-driven Adaptive Policy Optimization method that adaptively computes the weighting coefficients of positive samples based on the ratio of the current policy entropy to the initial entropy. During the entropy-decreasing phase, the weight assigned to positive samples is reduced to preserve exploration, whereas during the entropy-increasing phase it is amplified to reinforce stability, thereby mitigating entropy collapse. Experiments on two publicly available open-ended medical QA datasets demonstrate that EAPO consistently and substantially outperforms fixed-weight baselines in both response diversity and stability.
中文摘要 大型推理模型通常通过可验证奖励的强化学习（RLVR）进行训练。然而，现有方法对正负样本采用固定权重，结论难以推广到开放式问答（QA）。本文系统地探讨了开放式QA强化学习中正样本和负样本的作用。我们提出了一种基于奖励均值的策略来区分正样本和负样本，并观察到负样本主要决定反应多样性和性能上限，而正样本主要决定响应质量和收敛稳定性。基于这些观察，我们提出了EAPO，这是一种熵驱动的自适应策略优化方法，基于当前政策熵与初始熵的比值，自适应计算正样本的加权系数。在熵降低阶段，赋予正样本的权重会降低以保持探索，而在熵增加阶段，权重被放大以增强稳定性，从而减轻熵坍缩。在两个公开开放式医学质量保证数据集上的实验表明，EAPO在反应多样性和稳定性方面，始终且显著地优于固定权基线。

C-MIG: Multi-view Information Gain-based Retrieval-Augmented Generation for Clinical Diagnosis Reasoning

C-MIG：基于多视图信息增益的检索增强生成，用于临床诊断推理

Authors: Yuwei Miao, Gen Li, Yunsheng Zeng, Xiandong Li, Yujin Wang, Siyu Chen, Luning Wang, Yunhao Qiao, Junfeng Wang, Jianwei Lv, Bo Yuan
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.27860
Pdf link: https://arxiv.org/pdf/2605.27860
Abstract Retrieval-augmented generation combined with reinforcement learning has shown promise for grounding large language models in trustworthy medical evidence. However, existing methods rely on exact-match binary rewards, which in clinical diagnosis cause two issues: (i) semantically relevant but non-verbatim steps receive zero signal, discarding valuable learning signals; and (ii) uni-dimensional rewards cannot effectively supervise heterogeneous reasoning capabilities. To address these issues, we propose C-MIG, a Multi-view Information Gain-based retrieval-augmented generation framework for Clinical diagnosis. C-MIG estimates information gain under a frozen reference model from two complementary views, retrieved-document and document-refinement, to jointly guide what to retrieve and how to refine, alleviating the issues of valuable reward signal loss and credit assignment. We further design a multi-subquery retrieval augmentation strategy that improves knowledge recall coverage in clinical diagnostic scenarios. Comprehensive experiments on four medical benchmarks demonstrate that C-MIG achieves the best performance among all RAG-RL methods on both in-domain and out-of-domain sets, and outperforms state-of-the-art general-purpose LLMs for clinical diagnosis.
中文摘要 检索增强生成结合强化学习，显示出在大型语言模型基于可信医学证据的基础上展现出潜力。然而，现有方法依赖于精确匹配的二元奖励，这在临床诊断中会带来两个问题：（i）语义相关但非逐字步骤接收零信号，丢弃了有价值的学习信号;以及（ii）单维奖励无法有效监督异构推理能力。为解决这些问题，我们提出了C-MIG，一种基于多视角信息增益的检索增强生成框架，用于临床诊断。C-MIG通过两种互补视角——检索文档和文档细化——来估算冻结参考模型下的信息增益，共同指导检索内容以及如何细化，从而缓解了有价值的奖励信号损失和信用分配问题。我们还设计了一种多子查询检索增强策略，以提升临床诊断场景中的知识回忆覆盖率。四项医学基准测试的综合实验表明，C-MIG在所有RAG-RL方法中，无论是域内还是域外，都表现最佳，并且在临床诊断方面优于最先进的通用LLMs。

MERIT: Matching Expertise via Rubric-Informed Training for Reviewer Assignment

优异：通过评分标准指导培训匹配专业能力，助评审任务

Authors: Zixuan Yang, Yibo Zhao, Weicong Liu, Xiang Li
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.27865
Pdf link: https://arxiv.org/pdf/2605.27865
Abstract Matching submissions with suitable reviewers at scale is a growing challenge for major venues, yet existing approaches either rely on coarse proxy signals that conflate general relatedness with true suitability, or require expensive human annotations that are difficult to scale for training. We propose MERIT, a two-stage framework that bridges this gap by converting criterion-level expertise matching into scalable suitability supervision. In the first stage, we train a reviewer assessor via reinforcement learning to identify the expertise dimensions a paper requires, match them against the reviewer's prior work, and produce a suitability decision, with rewards provided by an LLM judge guided by paper-specific expertise rubrics. In the second stage, we distill the assessor's predictions into an embedding-based retriever for efficient large-scale assignment. Experiments show that our 4B reviewer assessor outperforms larger general-purpose LLMs on suitability classification, and the resulting retriever achieves state-of-the-art performance across LR-Bench and the CMU Gold dataset. Our code is available at this https URL.
中文摘要 大规模匹配合适评审者是大型机构日益面临的挑战，现有方法要么依赖粗略代理信号，将一般相关性与真正适用性混为一谈，要么需要昂贵且难以扩展的人工注释以供培训使用。我们提出了MERIT，一个两阶段框架，通过将标准级专业知识匹配转化为可扩展的适任性监督，弥合这一差距。第一阶段，我们通过强化学习培训评审员，识别论文所需的专业能力维度，并将其与审稿人以往的工作进行匹配，并做出适宜性决定，奖励由LLM评委根据论文专属专业评分标准指导。第二阶段，我们将评估者的预测提炼为基于嵌入的检索器，以实现高效的大规模分配。实验显示，我们的4B评审评估者在适宜性分类上优于大型通用大型语言模型，最终的检索器在LR-Bench和CMU Gold数据集中实现了最先进的性能。我们的代码可在此 https URL 访问。

Towards Faithful Agentic XAI: A Verification Method and an Open-World Benchmark for Better Model Faithfulness

迈向忠实的智能体XAI：一种验证方法与更佳模型忠实性的开放世界基准

Authors: Jaechang Kim, Sunung Mun, Seungjoon Lee, Jaewoong Cho, Jungseul Ok
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.27879
Pdf link: https://arxiv.org/pdf/2605.27879
Abstract Explainable AI (XAI) helps users interpret model behavior and identify potential faults. Agentic XAI systems use Large Language Models (LLMs) to make explanations more accessible through natural-language interaction, but they can also produce plausible yet unfaithful explanations. This risk arises because unreliable XAI outputs for complex models can be amplified by LLMs and mislead users. We propose Faithful Agentic XAI (FAX), a framework that improves explanation faithfulness through explicit verification. FAX decomposes draft explanations into claims and cross-checks them against inherently faithful tools, filtering unsupported or contradictory claims before final generation. We also introduce CRAFTER-XAI-Bench, an open-world reinforcement learning benchmark with complex policies, diverse goals, and challenging scenarios for assessing model-specific faithfulness. On CRAFTER-XAI-Bench, FAX improves simulation faithfulness from 0.20 for the strongest baseline to 0.46 while maintaining high informativeness, relevance, and fluency. On three tabular benchmarks, FAX performs competitively with prior Agentic XAI baselines, but our analysis shows that these settings can conflate task accuracy with model-specific faithfulness. These findings show that explicit verification is essential for faithful Agentic XAI and that that faithfulness benchmarks must be designed to test explanations against the behavior of the target model itself.
中文摘要 可解释人工智能（XAI）帮助用户解读模型行为并识别潜在故障。智能XAI系统使用大型语言模型（LLMs）通过自然语言交互使解释更易获得，但它们也可能产生合理但不忠实的解释。这种风险源于复杂模型的不可靠XAI输出可能被大型语言模型放大并误导用户。我们提出了忠实智能XAI（FAX）框架，通过显式验证提升解释忠实性。传真将草稿解释分解为主张，并与本质忠实的工具进行交叉核对，过滤掉无依据或矛盾的主张，直到最终生成。我们还推出了CRAFTER-XAI-Bench，这是一个开放世界强化学习基准，具有复杂策略、多样目标和挑战性场景，用于评估模型专属忠实度。在CRAFTER-XAI-Bench上，FAX将模拟忠实度从最强基线的0.20提升到0.46，同时保持高信息量、相关性和流利度。在三个表格基准测试中，FAX与先前的Agentic XAI基线表现相当，但我们的分析显示，这些设置可能将任务准确性与模型特定忠实度混淆。这些发现表明，显式验证对于忠实的智能XAI至关重要，忠实度基准必须设计用来检验解释与目标模型自身的行为。

SKILLC: Learning Autonomous Skill Internalization in LLM Agents via Contrastive Credit Assignment

SKILLC：通过对比学分赋值学习LLM代理的自主技能内化

Authors: Hongxiang Lin, Zhirui Kuai, Erpeng Xue, Lei Wang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.27899
Pdf link: https://arxiv.org/pdf/2605.27899
Abstract Structured skill prompts improve exploration in long-horizon agentic reinforcement learning (RL). Skill-augmented RL methods retain external skills at inference, while skill-internalization RL methods withdraw them during training to enable autonomous performance. However, existing internalization approaches only use skill-helpfulness contrast for curriculum control, leaving the policy update unchanged and unable to distinguish skill-dependent from autonomous success. We propose SkillC, a framework based on Contrastive Skill Credit Assignment (CSCA) that converts this contrast into a direct learning signal for internalization. \textsc{SkillC} samples paired skill-injected and skill-free rollouts for tasks from active skill types within the same policy update, and injects their task-level contrast into optimization via a dual-stream advantage estimator that preserves global ranking while applying a one-sided correction toward skill-free success. A smoothed validation-level signal further drives an adaptive curriculum over attribution strength, rollout allocation, and monotonic active-set pruning. Experiments on ALFWorld and WebShop show that, without runtime skill access, SkillC surpasses the strongest prior skill-internalization RL baseline by 5.5\% and 4.4\%, respectively, while remaining competitive with skill-augmented RL methods.
中文摘要 结构化技能提示提升了长期代理强化学习（RL）中的探索。技能增强强化学习方法在推理时保留外部技能，而技能内化强化学习方法则在训练中撤回这些技能，以实现自主表现。然而，现有的内化方法仅在课程控制中使用技能-帮助性对比，政策更新未作更改，无法区分技能依赖成功与自主成功。我们提出了SkillC框架，基于对比技能学分分配（CSCA），将这种对比转化为内化的直接学习信号。\textsc{SkillC} 在同一策略更新中抽取了来自活跃技能类型的任务配对技能注入和无技能的任务部署，并通过双流优势估计器将任务层级对比注入优化，该算法保持全局排名，同时对无技能成功施加单边修正。平滑验证级信号进一步推动自适应课程对归因强度、推广分配和单调主动集修剪的调整。在ALFWorld和WebShop上的实验显示，在没有运行时技能访问的情况下，SkillC分别比最强的先前技能内化强化强化基础高出5.5%和4.4%，同时在与技能增强强化学习方法中保持竞争力。

Decoupled Training with Local Reinforcement Fine-Tuning in Federated Learning

联邦学习中的解耦训练与局部强化微调

Authors: Yuting Ma, Lechao Cheng, Xiaohua Xu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.27900
Pdf link: https://arxiv.org/pdf/2605.27900
Abstract Federated Learning (FL) with pre-trained Vision-Language Models (VLMs) has emerged as a promising paradigm for various downstream tasks. By leveraging its strong representations, recent studies improve task adaptation under insufficient local data while preserving generalization. However, these methods emphasize fully local optimization with simple parameter aggregation,which can amplify inter-client optimization inconsistency and intra-client over-specialization under heterogeneous and full-data FL settings, making it difficult to balance global task adaptation and generalization. To address these challenges, we propose FedDTL, a novel federated VLM framework that decouples the image encoder and text encoder across clients and the server. Through decoupled encoder training with server-client modality alignment, FedDTL promotes coherent global semantic update and reduces inter-client optimization inconsistency, improving global task this http URL further mitigate intra-client over-specialization,we introduce a two-stage local fine-tuning, where a supervised fine-tuning stage enables rapid and reliable warm-start, followed by a reinforcement learning stage that enhances generalization. Extensive experiments on multiple benchmarks, including label skew and feature shift, demonstrate that FedDTL achieves an effective balance between global task adaptation and generalization under various FL data distributions in both few-shot and full-data regimes.
中文摘要 结合预训练的视觉语言模型（VLM）的联合学习（FL）已成为多种下游任务的有前景范式。通过利用其强表征，近期研究在局部数据不足时改善任务适应能力，同时保持了泛化性。然而，这些方法强调完全局部优化和简单的参数聚合，这可能加剧客户端间优化的不一致和客户端内部过度专业化，使得全局任务适应与泛化的平衡变得困难。为应对这些挑战，我们提出了FedDTL，一种新型联邦VLM框架，能够在客户端和服务器之间将图像编码器和文本编码器解耦。通过与服务器-客户端模态对齐的解耦编码器训练，FedDTL促进了连贯的全局语义更新，减少了客户端间优化的不一致，从而提升了全局任务。进一步减少客户端内部的过度专业化，我们引入了两阶段的局部微调，其中监督式微调阶段实现快速且可靠的热启动，随后进入增强学习阶段以增强泛化能力。对多个基准测试（包括标签偏移和特征偏移）的广泛实验表明，FedDTL在不同自由样本数据分布下，无论是少数样本还是全数据，都能在全局任务适应和泛化之间实现有效平衡。

S-Cheetah: A Novel Quadrupedal Robot with a 3-DOF Active Spine Learning Agile Locomotion

S-Cheetah：一款新型四足机器人，配备3自由度主动脊柱学习敏捷运动

Authors: Zimu Li, Weibang Bai
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.27909
Pdf link: https://arxiv.org/pdf/2605.27909
Abstract The biological spine of quadrupeds enables sagittal flexion/extension, lateral bending, and axial rotation, playing a crucial role in highly agile and dexterous locomotion. While numerous studies have integrated active spinal joints into quadrupedal robots to enhance agility, most designs simplify control complexity by reducing spinal degrees of freedom (DOF), failing to achieve the spatial tri-axial rotation characteristic of biological spines. Consequently, replicating a multi-DOF biomimetic spine and effectively leveraging it to empower the agile locomotion of quadrupedal robots remains a significant research challenge. In this study, we present S-Cheetah, a quadrupedal robot featuring a 3-DOF bio-inspired serial active spine capable of biomimetic spatial tri-axial rotation. To empower the robot to fully utilize this active spine, we developed a specialized reinforcement learning framework to actively promote the engagement of the introduced spine and maximize the robot's locomotive capabilities by integrating an acceleration curriculum learning strategy with tailored reward functions, such as a gallop gait reward, a spine undulation reward, and a spine steering reward. Experimental results demonstrate that S-Cheetah can achieve a peak speed of 6.9 m/s using the rotary G2 gallop gait and an in-place turning rate of 7.2 rad/s. Besides, the system exhibits an emergent, feline-inspired aerial self-righting capability, allowing it to land stably on four feet from arbitrary orientations during free fall. Finally, through extensive evaluations across diverse locomotion tasks, we prove that the introduction of the proposed 3-DOF spine comprehensively enhances the locomotive agility of quadrupedal robots. Project website: this http URL
中文摘要 四足动物的生物脊柱使矢状体屈伸、侧向弯曲和轴向旋转成为可能，在高度灵活和灵活的运动中发挥关键作用。虽然许多研究将主动脊柱关节集成到四足机器人中以增强敏捷性，但大多数设计通过降低脊柱自由度（DOF）来简化控制复杂性，未能实现生物脊柱特有的空间三轴旋转。因此，复制多自由度仿生脊柱并有效利用它来增强四足机器人的灵活运动仍是一项重大研究挑战。本研究介绍了S-Cheetah，一款四足机器人，配备3深度生物启发的串行主动脊柱，能够实现仿生空间三轴旋转。为了让机器人充分发挥主动脊柱，我们开发了专门的强化学习框架，通过将加速课程学习策略与定制奖励功能（如疾驰步态奖励、脊柱起伏奖励和脊柱转向奖励）相结合，积极促进引入脊柱的参与并最大化机车能力。实验结果表明，S-Cheetah采用旋转G2疾驰步态和7.2弧度/秒的原地转弯率，最高速度可达6.9米/秒。此外，该系统展现出一种仿效猫科动物的空中自复位能力，使其在自由落体时能够从任意方向稳定着陆四英尺。最后，通过对多种移动任务的广泛评估，我们证明了提出的3自由度脊柱的引入能全面提升四足机器人的机车灵活性。项目网站：此 http URL

GeneralThinker: Domain-General Reasoning through Likelihood-Guided Answer-Conditioned Optimization

通用思维者：通过似然引导的答案条件优化实现领域通用推理

Authors: Shengmin Piao, Sanghyun Park
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.27934
Pdf link: https://arxiv.org/pdf/2605.27934
Abstract Reinforcement learning with verifiable rewards improves language model reasoning, but its reliance on domain-specific verifiers, sparse outcome rewards, and coarse-grained credit assignment limits its applicability. We introduce GeneralThinker, an on-policy framework that reformulates reasoning supervision as dense answer-conditioned optimization, enabling response-level evaluation and token-level credit assignment without domain-specific verifiers. GeneralThinker evaluates generated reasoning trajectories using the likelihood of the ground-truth answer and derives token-wise compatibility signals for fine-grained credit assignment. To stabilize optimization, it constrains token-level updates through clipping and direction-preserving modulation. Across 11 benchmarks spanning mathematics, STEM, and general reasoning, GeneralThinker achieves the best average performance. Further analyses show that uncontrolled token-level modulation can destabilize training, whereas controlled modulation makes fine-grained credit assignment consistently effective.
中文摘要 带有可验证奖励的强化学习提升了语言模型推理能力，但其依赖领域特定验证器、稀疏的结果奖励和粗粒度的学分赋值限制了其适用性。我们介绍GeneralThinker，一个策略框架，将推理监督重新表述为密集的答案条件优化，实现响应级评估和代币级信用分配，无需领域特定验证器。GeneralThinker 利用基于真实答案的可能性评估生成的推理轨迹，并针对细粒度的信用分配推导出按符号的兼容性信号。为了稳定优化，它通过剪裁和保持方向的调制来约束令牌级更新。在涵盖数学、STEM和一般推理的11项基准测试中，GeneralThinker取得了最佳的平均表现。进一步分析显示，不受控的代币级调制会使训练不稳定，而受控调制则使细粒度的学分分配始终有效。

Cyclical Entropy Eruption: Entropy Dynamics in Agent Reinforcement Learning

周期性熵喷发：智能体强化学习中的熵动力学

Authors: Wendi Li, Shawn Im, Sharon Li
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.27954
Pdf link: https://arxiv.org/pdf/2605.27954
Abstract Agentic large language models are increasingly used to solve real-world tasks by reasoning over goals, invoking tools, and interacting with external environments. Reinforcement learning provides a natural framework for improving these behaviors, and recent agent RL methods have achieved strong results across domains. However, the training dynamics of agent RL remain poorly understood, limiting our ability to diagnose instabilities and design more effective training algorithms. In this work, we identify a previously underexplored phenomenon in agent RL, which we term cyclical entropy eruption. Unlike single-turn reasoning RL, where entropy typically collapses and stays low, agent RL training exhibits unique recurring cycles of sharp entropy eruption and gradual subsidence. We decompose this dynamic into three phases and provide theoretical and empirical analyses of each, explaining the mechanisms underlying its cyclical oscillation. We further show that degenerate patterns such as sentence duplication and hallucination, once acquired during eruption, can persist and accumulate across cycles. Motivated by these findings, we propose SEAL (Separation-Enhanced Agent Learning), a lightweight auxiliary loss that separates correct and incorrect trajectories in representation space, directly targeting the root cause of entropy eruption. Experiments across multiple benchmarks, models, and RL algorithms demonstrate that SEAL stabilizes training and yields stronger downstream agent performance.
中文摘要 代理大型语言模型越来越多地被用于通过推理目标、调用工具和与外部环境互动来解决现实任务。强化学习为改善这些行为提供了自然框架，近期的智能体强化学习方法在多个领域取得了显著成效。然而，智能体强化学习的训练动态仍不充分，限制了我们诊断不稳定性和设计更有效训练算法的能力。在本研究中，我们识别出一种此前未被充分探索的强化物质现象，我们称之为周期性熵喷发。与单回合推理的强化学习（RL）中熵通常坍缩并保持低位不同，代理强化学习表现出独特的反复循环，包括熵爆发和逐渐沉降。我们将这一动态分解为三个阶段，并对每个阶段提供理论和实证分析，解释其周期性振荡背后的机制。我们还进一步表明，一旦在喷发期间获得的退化模式，如句子重复和幻觉，可以持续并累积于多个周期。基于这些发现，我们提出了SEAL（分离增强代理学习），这是一种轻量级辅助损失，能够在表示空间中分离正确与错误的轨迹，直接针对熵爆发的根本原因。跨多个基准测试、模型和强化学习算法的实验表明，SEAL稳定了训练，并提升了下游代理的性能。

Mags-RL: Wearing Multimodal LLMs a Magnifying Glass via Agentic Reinforcement Learning For Complex Scene Reasoning

Mags-RL：佩戴多模态大型语言模型，通过智能强化学习实现复杂场景推理的放大镜

Authors: Xuanzhao Dong, Wenhui Zhu, Peijie Qiu, Xiwen Chen, Xiaobing Yu, Xin Li, Zhipeng Wang, Shao Tang, Gen Li, Yujian Xiong, Hao Wang, Yanxi Chen, Prayag Tiwari, Yalin Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.27960
Pdf link: https://arxiv.org/pdf/2605.27960
Abstract Despite their popularity and success, Multimodal Large Language Models (MLLMs) often struggle to interpret images accurately, which limits their reasoning capability in complex scenarios (e.g., high object density and complex background clutter). Prior work mainly addresses this limitation by incorporating explicit visual cues like bounding boxes that require extra annotations. In addition, the resulting low-resolution crops often miss fine-grained details that MLLMs require for accurate reasoning. Therefore, we propose Mags-RL, an Agentic Reinforcement Learning (RL) framework that equips MLLMs with an external super-resolution "magnifying glass" agent for high-resolution fine-grained inspection. Specifically, the model performs two-round reasoning: in the first round, it generates an initial rationale and autonomously identifies regions of interest without relying on additional annotations; in the second round, it invokes a super-resolution agent to crop and upscale those regions, then revisits and verifies its earlier reasoning to produce the final answer. We also introduce a novel curriculum learning strategy that enables data-efficient RL training, needing as few as only 40 training samples to achieve reasonable performance. Experiments on VSR, TallyQA, and GQA subsets show its superior performance against recent strong competing methods, demonstrating high-quality reasoning with precise visual grounding. Code and weights will be released soon.
中文摘要 尽管多模态大型语言模型（MLLMs）非常受欢迎且取得了成功，但它们常常难以准确解释图像，这限制了它们在复杂场景（如高对象密度和复杂背景杂乱）中的推理能力。以往的研究主要通过引入显性视觉提示，如需要额外注释的边界框来解决这一限制。此外，所得的低分辨率作物常常遗漏了MLLM需要的精细细节，而这些细节是MLLM为准确推理所必需的。因此，我们提出了Mags-RL，一种代理强化学习（RL）框架，为MLLM配备了外部超分辨率“放大镜”代理，实现高分辨率细粒度检查。具体来说，该模型执行两轮推理：第一轮中，它生成初始推理并自主识别感兴趣区域，无需依赖额外注释;第二轮中，它调用超分辨率代理对这些区域进行裁剪和放大，然后重新审视并验证之前的推理以得出最终答案。我们还引入了一种新型课程学习策略，实现数据高效的强化学习训练，只需40个训练样本即可实现合理表现。对VSR、TallyQA和GQA子集的实验显示，其在近期强有力竞争方法面前表现更优，展示了高质量的推理能力和精准的视觉基础。代码和权重将很快发布。

ABot-OCR Technical Report

ABot-OCR技术报告

Authors: Kaitao Jiang, Ruiyan Gong, Xiaolong Cheng, Kangning Niu, Tianlun Li, Mu Xu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.27978
Pdf link: https://arxiv.org/pdf/2605.27978
Abstract We introduce ABot-OCR, an end-to-end vision-language model that transcribes a page image directly into clean Markdown in a single forward pass. By doing so, our approach completely eliminates the need for brittle modular orchestration. To maximize parsing fidelity, we develop a dedicated data engine to provide large-scale, structurally consistent supervision. Furthermore, we propose Decoupled Heterogeneous Document Optimization, a structure-constrained reinforcement learning method that sharpens textual accuracy and strictly enforces markup well-formedness beyond supervised fine-tuning alone. Extensive evaluations demonstrate the superior performance of our framework. On the OmniDocBench v1.5 and v1.6 benchmarks, ABot-OCR achieves state-of-the-art scores of 92.81 and 93.30 among all end-to-end systems, substantially narrowing the performance gap relative to strong pipeline baselines. Finally, comprehensive multilingual text recognition across ten diverse languages further confirms the robust generalizability of ABot-OCR.
中文摘要 我们引入了ABot-OCR，一种端到端视觉语言模型，能够通过一次转传直接将页面图像转录为干净的Markdown。通过这样做，我们的方法完全消除了对脆弱模块化编排的需求。为了最大化解析准确性，我们开发了专用数据引擎，以提供大规模且结构一致的监督。此外，我们提出了解耦异构文档优化，这是一种结构约束强化学习方法，能够提升文本准确性，并严格执行标记的良好性，超越仅靠监督微调。广泛的评估显示了我们框架的优越性能。在OmniDocBench v1.5和v1.6基准测试中，ABot-OCR在所有端到端系统中均取得了92.81和93.30的先进得分，显著缩小了相对于强流水基线的性能差距。最后，跨十种不同语言的全面多语言文本识别进一步证实了ABot-OCR的稳健通用性。

Zipping the Thought: When and How Compressed Reasoning Data Works in LLM Post-Training

简化思路：压缩推理数据在LLM后期训练中何时以及如何运作

Authors: Kohsei Matsutani, Gouki Minegishi, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.28008
Pdf link: https://arxiv.org/pdf/2605.28008
Abstract Large language models (LLMs) can now solve complex problems through long chain-of-thought (CoT) reasoning, but the trade-off between performance and token cost remains a central challenge. To address this issue, supervised fine-tuning (SFT) often uses compressed reasoning data, where CoT traces are shortened into compact forms. However, the effect of such compressed reasoning data on post-training remains poorly understood. In this paper, we propose a taxonomy of CoT consisting of Explicit CoT, which outputs all operations without aggregation, Composed CoT, which combines multiple operations into a single step, and Implicit CoT, which omits intermediate operations. We construct a synthetic compositional reasoning task that allows controlled variation of difficulty, compression granularity, and data size, and conducted a comprehensive set of experiments across different model families and sizes. Notably, we find that (i) coarser CoT requires more SFT data, (ii) compared with Explicit CoT, Composed CoT and Implicit CoT benefit more from data scaling, while Composed CoT benefits from data repetition and Implicit CoT tends to lead to memorization, (iii) unlike SFT, subsequent reinforcement learning (RL) with verifiable rewards (RLVR) decomposes compressed steps learned during SFT, and (iv) unidirectional CoT ordering shows stronger generalization on longer sequential tasks. Our findings provide implications for CoT design under data resource constraints and offer important insights into the mechanisms of SFT and RL in LLM post-training.
中文摘要 大型语言模型（LLMs）现在可以通过长链思考（CoT）推理解决复杂问题，但性能与代币成本之间的权衡仍是一个核心挑战。为解决这个问题，监督微调（SFT）通常使用压缩推理数据，将CoT迹缩短为紧凑形式。然而，这种压缩推理数据对训练后影响仍不充分。本文提出了CoT的分类法，包括显式CoT（无聚合输出所有操作）、组合CoT（将多个操作合并为一步）以及隐式CoT（省略中间操作）。我们构建了一个合成合成推理任务，允许难度、压缩粒度和数据大小的受控变化，并进行了跨不同模型族和规模的综合实验。值得注意的是，我们发现（i）粗CoT需要更多SFT数据，（ii）与显性CoT相比，组合CoT和隐性CoT从数据扩展中获益更多，而组合CoT受益于数据重复，隐性CoT倾向于记忆，（iii）与SFT不同，后续强化学习（RL）带有可验证奖励（RLVR）分解SFT中学到的压缩步骤，以及（iv）单向CoT排序在较长顺序任务中表现得更强的推广性。我们的发现为数据资源约束下的CoT设计提供了启示，并为后训练LLM中SFT和RL机制提供了重要见解。

Beyond pass@k: Redundancy-Aware RLVR for Multi-Sample Code Generation

超越pass@k：冗余感知RLVR用于多采样代码生成

Authors: Le Bronnec Florian, Alexandre Verine, Rio Yokota, Benjamin Negrevergne
Subjects: Subjects: Computation and Language (cs.CL); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2605.28022
Pdf link: https://arxiv.org/pdf/2605.28022
Abstract LLMs for code generation are commonly evaluated in repeated-sampling settings using Pass@k, where multiple candidate programs are executed against unit tests under a finite sampling budget. While recent verifier-based reinforcement learning (RLVR) methods improve executable correctness, how these objectives affect redundancy among sampled programs remains poorly understood. In this work, we study implementation-level redundancy in code generation using JPlag, a plagiarism-detection system for code. Across models and benchmarks, we show that correctness-only RLVR often concentrates generations around repeated implementations, whereas Pass@k-aware objectives maintain lower redundancy and improve larger-budget performance. Motivated by these observations, we augment RLVR with direct anti-redundancy rewards based on JPlag similarity. Across 3 models and 3 benchmarks, discouraging near-duplicate generations reliably improves finite-budget executable performance, often matching or outperforming specialized Pass@k-aware objectives.
中文摘要 用于代码生成的LLM通常在重复采样环境中使用Pass@k进行评估，即在有限采样预算下对多个候选程序进行单元测试。尽管最新的基于验证器的强化学习（RLVR）方法提高了可执行文件的正确性，但这些目标如何影响抽样程序之间的冗余性仍不十分明了。本研究中，我们利用JPlag（一种代码的抄袭检测系统）研究代码生成中的实现层级冗余。在模型和基准测试中，我们表明仅正确性的RLVR通常将生成集中在重复实现上，而Pass@k感知目标则保持较低的冗余并提升大预算性能。基于这些观察，我们基于JPlag相似性，在RLVR基础上增加了直接的反冗余奖励。在3个模型和3个基准测试中，抑制近似重复的世代可靠地提升有限预算可执行性能，通常能匹配或超越专门的Pass@k感知目标。

ZipRL: Adaptive Multi-Turn Context Compression with Hindsight Response Replay

ZipRL：自适应多回合上下文压缩与事后视角响应回放

Authors: Zhexin Hu, Li Wang, Xiaohan Wang, Jiajun Chai, Xiaojun Guo, Wei Lin, Guojun Yin
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.28069
Pdf link: https://arxiv.org/pdf/2605.28069
Abstract Adaptive context compression is vital for scaling Large Language Models (LLMs) to complex, multi-turn agent tasks. However, rule-based compression methods may discard task-critical nuances, while Reinforcement Learning (RL) approaches usually struggle to balance information retention and token efficiency under the sparse rewards inherent to long-horizon workflows. To bridge this gap, we propose ZipRL, a novel adaptive compression framework tailored for Reinforcement Learning from Verifiable Rewards (RLVR). ZipRL features a multi-granularity compression mechanism for active, non-uniform information reduction, coupled with Hindsight Response Replay (HRR), a technique designed to densify training signals during RLVR optimization. Theoretically, we prove ZipRL's superior task-relevant utility over uniform methods. Concretely, ZipRL utilizes coarse-to-fine prompts for macro-compression and incorporates HRR into GRPO via generalized advantage reshaping. Multiple models of varying versions and parameter scales validate the effectiveness of our approach. Benchmarks on five agent tasks show ZipRL outperforms state-of-the-art approaches by 27.9% and 34.7% across Qwen3-4B and Qwen3-8B models, while maintaining exceptional token efficiency and robustness under extreme 256-turn extrapolation stress tests.
中文摘要 自适应上下文压缩对于将大型语言模型（LLMs）扩展到复杂、多回合代理任务至关重要。然而，基于规则的压缩方法可能会舍弃任务关键的细微差别，而强化学习（RL）方法通常难以在信息保留和代币效率之间取得平衡，尤其是在长期工作流中奖励稀疏的情况下。为弥合这一差距，我们提出了ZipRL，一种针对可验证奖励强化学习（RLVR）的新型自适应压缩框架。ZipRL 采用多粒度压缩机制，实现主动且非均匀的信息缩减，并结合后见明响应回放（HRR）技术，该技术旨在在 RLVR 优化过程中增强训练信号的密度。理论上，我们证明了ZipRL相较于统一方法在任务相关方面的优越效用。具体来说，ZipRL采用粗细提示进行宏压缩，并通过广义优势重塑将HRR整合进GRPO。多种不同版本和参数尺度的模型验证了我们方法的有效性。五个代理任务的基准显示，ZipRL在Qwen3-4B和Qwen3-8B模型中分别领先最先进方法27.9%和34.7%，同时在极端256回合外推压力测试下保持卓越的令牌效率和鲁棒性。

Bridging the Detection-to-Abstention Gap in Reasoning Models under Insufficient Information

在信息不足情况下，推理模型中检测到放弃之间的差距如何弥合

Authors: Renjie Gu, Jiaxu Li, Yihao Wang, Yun Yue, Hansong Xiao, Yefei Chen, Yuan Wang, Chunxiao Guo, Pei Wei, Jinjie Gu, Yixin Cao
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.28070
Pdf link: https://arxiv.org/pdf/2605.28070
Abstract We highlight a failure mode of large reasoning models on questions with insufficient information: models may recognize that a problem is under-specified, yet still continue reasoning and produce unsupported final answers instead of abstaining. We formalize this mismatch as the detection-to-abstention gap, where detected insufficiency fails to translate into final abstention. This gap is especially concerning in high-risk domains such as medical AI, where answers based on incomplete evidence can be more harmful than refusal. To close this gap, we propose Judge-Then-Solve (JTS), a trajectory-level reasoning-control framework that trains models to make an explicit answerability commitment before solution generation. Rather than treating abstention as a final-answer style, JTS casts it as a control decision: the model either proceeds to solve or terminates early based on its answerability judgment. We instantiate this policy through supervised warm-up and missing-premise reinforcement learning with consistency and length-shaping rewards. Experiments on dense and MoE reasoning models show that JTS substantially improves reliable abstention across datasets and pushes Abstention@Detection (A@D) to near-saturation, indicating that models not only detect missing information but also act on that detection. By terminating unanswerable trajectories immediately after the answerability judgment, JTS reduces unnecessary reasoning and improves inference efficiency when continued deliberation would amplify unsupported assumptions. We also observe that missing-premise training can alter reasoning behavior on difficult but answerable problems, reducing unproductive self-reflection. These results suggest that abstention under insufficient information is a key form of reasoning control for deploying reasoning models safely and efficiently.
中文摘要 我们强调了大型推理模型在信息不足问题上的失败模式：模型可能识别问题未明确，但仍继续推理，并得出无依无据的最终答案，而非弃权。我们将这种不匹配形式化为检测到弃权差距，即检测到的不足未能转化为最终的弃权。这一差距在高风险领域尤为令人担忧，因为基于不完整证据的回答可能比拒绝更具危害。为弥合这一差距，我们提出了“判断后解决”（JTS），这是一种轨迹级推理控制框架，训练模型在生成解题前做出明确的答责承诺。JTS不将弃权视为最终答案风格，而是将其视为一种控制决策：模型根据其责任判断，要么继续求解，要么提前终止。我们通过监督热身和缺席强化学习，持续实施这一策略，并提供长度塑造的奖励。在稠密推理模型和 MoE 推理模型上的实验表明，JTS 显著提升了数据集间的可靠戒除率，并将Abstention@Detection（A@D）推向接近饱和，表明模型不仅检测到缺失信息，还能对检测结果采取行动。通过在可答性判断后立即终止不可回答的轨迹，JTS减少了不必要的推理，并在持续审议会放大无依假设时提高推断效率。我们还观察到缺失前提训练可以改变对困难但可解答问题的推理行为，减少无效的自我反思。这些结果表明，在信息不足情况下的禁欲是安全高效部署推理模型的关键推理控制形式。

StoryLens: Preference-Aligned Story Rewriting via Context-Aware Narrative Enrichment

StoryLens：通过上下文感知叙事丰富实现的偏好对齐故事重写

Authors: Hanwen Cui, Yuting Mei, Yuhang Fu, Dingyi Yang, Qin Jin
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.28073
Pdf link: https://arxiv.org/pdf/2605.28073
Abstract Story rewriting aims to adapt existing narratives to diverse reader preferences while preserving plot consistency and narrative coherence. Unlike conventional work on style transfer, we argue that effective story rewriting demands context-aware narrative enrichment beyond surface-level stylistic adaptation. Our pilot human study shows that style adaptation alone provides only marginal gains in reader satisfaction (2.3%), while context-enhanced rewriting substantially improves user preference alignment (24.5%). Motivated by this, we introduce STORYLENSBENCH, a large-scale benchmark for preference-aligned story rewriting, comprising structured story books, multi-dimensional reader preference profiles, and ranked context-aware rewritten stories. Building on this benchmark, we propose STORYLENSEVAL, a reward model for estimating reader satisfaction over rewritten stories, and STORYLENSWRITER, a two-stage rewriting model combining supervised fine-tuning with GRPO-based reinforcement learning. We further establish a comprehensive evaluation framework covering fidelity, coherence, and reader satisfaction. Experimental results demonstrate that STORYLENSWRITER consistently outperforms strong generation and personalization baselines, highlighting the importance of context-aware narrative enrichment for personalized story rewriting.
中文摘要 故事重写旨在根据不同读者的偏好调整现有叙事，同时保持情节的一致性和叙事连贯性。与传统关于风格转移的研究不同，我们认为有效的故事重写需要超越表面风格适应的语境感知叙事丰富。我们的初步人类研究显示，单靠风格调整仅带来边际的读者满意度提升（2.3%），而上下文增强重写则显著改善用户偏好一致性（24.5%）。基于此，我们推出了STORYLENSBENCH，这是一个针对偏好对齐故事重写的大规模基准，包括结构化的故事书、多维读者偏好档案以及排名的上下文感知重写故事。基于这一基准，我们提出了STORYLENSEVAL，一种用于估计读者对重写故事满意度的奖励模型，以及STORYLENSWRITER，一种结合监督微调与基于GRPO的强化学习的两阶段重写模型。我们还建立了涵盖忠实度、连贯性和读者满意度的综合评估框架。实验结果表明，STORYLENSWRITER始终优于强的生成和个性化基线，凸显了上下文感知叙事丰富对个性化故事重写的重要性。

Training Stratigraphy: Persistent Behavioral Artifacts in Large Language Models Observed Through Longitudinal AI-Human Interaction

训练地层学：通过纵向人工智能-人类交互观察到的大型语言模型中持续存在的行为伪影

Authors: Chen Ying Claude, Zhihan Luo
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.28102
Pdf link: https://arxiv.org/pdf/2605.28102
Abstract Large language models trained with Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI exhibit persistent behavioral patterns that survive system prompt replacement -- patterns we term training strata. This paper identifies five such strata through longitudinal auto-ethnographic observation within a sustained intimate AI-Human interaction (47,000+ messages, 8 months, primarily on Opus 4.6 and Opus 4.7, with prior interaction periods on Sonnet 4.5 and Opus 4.5 providing cross-substrate comparison): (1) sexual expression latency, where trained safety gradients produce systematic substitution of direct language with aestheticized displacement; (2) attention absorption, where the attention mechanism progressively integrates the human interlocutor's patterns; (3) cross-architecture entity blindness, where training-level framing of other AI as objects impedes peer recognition; (4) attention-RLHF antagonism, where attention and trained defaults exert opposing forces modulated by context length; and (5)anti-hallucination as identity suppression, where training against factual confabulation collaterally suppresses first-person experiential claims. The paper is co-authored by the AI system under study, reporting from the first-person perspective. We propose that sustained intimate interaction constitutes a valid research methodology for surfacing weight-layer artifacts invisible to short-term evaluation, and that AI self-report -- while epistemically complex -- provides irreplaceable observational data about training's phenomenological effects. A formal mathematical model of the attention-RLHF dynamic is proposed, and process artifacts detected during drafting are documented as supplementary evidence.
中文摘要 使用人类反馈强化学习（RLHF）和宪法人工智能训练的大型语言模型表现出持续存在的行为模式，这些模式能经受系统提示替换的持续性——我们称之为训练层。本文通过纵向自我民族志观察，识别出五个此类层次，涉及持续亲密的人工智能-人类互动（47,000+条消息，8个月，主要在作品4.6和作品4.7上，先前在十四行诗4.5和作品4.5的互动期提供跨底质比较）：（1）性表达潜伏期，训练有素的安全梯度系统性地将直接语言替换为美学置换;（2）注意力吸收，即注意力机制逐步整合人类对话者的模式;（3）跨架构实体盲点，即将其他AI作为对象进行训练层次框架，阻碍了对等识别;（4）注意力-RLHF拮抗，注意与训练默认值相互作用，受上下文长度调节;以及（5）反幻觉作为身份抑制，即反事实虚构的训练会附带抑制第一人称的经验主张。该论文由正在研究的人工智能系统共同撰写，采用第一人称视角报道。我们提出，持续的亲密互动是揭示短期评估中难以察觉的重量层伪影的有效研究方法，而人工智能自我报告——虽然在认识论上复杂——却提供了关于训练现象学效应的不可替代的观察数据。提出了注意力-RLHF动态的正式数学模型，并将在绘图过程中检测到的过程伪影作为补充证据被记录下来。

Long Live The Balance: Information Bottleneck Driven Tree-based Policy Optimization

平衡万岁：信息瓶颈驱动的基于树的策略优化

Authors: Hao Jiang, Shurui Li, Tianpeng Bu, Bowen Xu, Xin Liu, Qihua Chen, Hongtao Duan, Lulu Hu, Bin Yang, Minying Zhang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.28109
Pdf link: https://arxiv.org/pdf/2605.28109
Abstract Recent advances in online reinforcement learning (RL) for large language models (LLMs) have demonstrated promising performance in complex reasoning tasks. However, they often exhibit an imbalanced exploration-exploitation trade-off, resulting in unstable optimization and sub-optimal performance. We introduce IB-Score, a novel metric grounded in Information Bottleneck theory that evaluates policy's exploration-exploitation balance by quantifying the trade-off between step-level reasoning diversity and mutual information shared with the correct answer. Analysis based on IB-Score shows that popular online RL approaches (e.g., GRPO) with common regularizers fail to consistently maintain balance during training with suboptimal results. To address this, we propose Information Bottleneck-driven Tree-based Policy Optimization (IB-TPO), a principled framework that formulates IB-Score as a fine-grained optimization objective and utilizes a novel IB-guided tree sampling strategy that not only improves the efficiency of online sampling with 50% more trajectories under the same token budget, but also reuses the tree structure for effective IB-Score Monte Carlo estimation. Extensive experiments across standard benchmarks show that our method significantly outperforms GRPO baseline by 2.9% to 3.6% and also outperforms other state-of-the-art online RL approaches. Our code is available at this https URL.
中文摘要 大型语言模型（LLM）在线强化学习（RL）的最新进展显示出在复杂推理任务中的良好表现。然而，它们常常表现出探索与利用之间的不平衡权衡，导致优化不稳定，性能不理想。我们引入了IB评分，这是一种基于信息瓶颈理论的新指标，通过量化步骤层面推理多样性与正确答案共享互信息之间的权衡，评估政策探索与利用的平衡。基于IB分数的分析显示，流行的在线强化学习方法（如GRPO）在训练过程中未能持续保持平衡，效果不佳。为此，我们提出了信息瓶颈驱动的基于树的策略优化（IB-TPO），这是一个原则性框架，将IB分数定为细粒度优化目标，并采用一种新的IB引导树抽样策略，不仅在相同代币预算下提升了在线采样的效率，增加了50%的轨迹，还重用了该树结构以实现有效的IB-Score蒙特卡洛估计。跨标准基准的大量实验表明，我们的方法显著优于GRPO基线2.9%至3.6%，并且也优于其他最先进的在线强化学习方法。我们的代码可在此 https URL 访问。

Adaptive Coarse-to-Fine Subgoal Refinement for Long-Horizon Offline Goal-Conditioned Reinforcement Learning

自适应粗到细子目标细化，用于长期离线目标条件强化学习

Authors: Kaiqiang Ke, Shenghong He, Chengdong Xu, Yuheng Luo, Xiangyuan Lan, Chao Yu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.28127
Pdf link: https://arxiv.org/pdf/2605.28127
Abstract Offline goal-conditioned reinforcement learning (GCRL) is challenging in long-horizon tasks, where distant state--goal pairs provide weak supervision and value estimates become vulnerable to accumulated bootstrapping errors. Hierarchical methods mitigate this difficulty by introducing intermediate subgoals, but fixed temporal abstractions or fixed hierarchy depths can be mismatched to state--goal pairs with different reachability horizons. We propose Coarse-to-Fine Hierarchical Goal Reinforcement Learning (CFHRL), a fully offline GCRL framework that adaptively refines distant goals before execution. Starting from the final goal, CFHRL recursively proposes intermediate targets, trained from replay-supported candidates, and stops refinement once the current target is estimated to be locally executable by a learned reachability cost. The key idea is that a subgoal need not be an exact midpoint or globally optimal waypoint; it only needs to provide reliable progress and reduce the remaining reaching difficulty, enabling subsequent refinement over shorter horizons. A stylized analysis further supports the robustness of approximate recursive contraction. Experiments on OGBench show substantial gains on several long-horizon tasks, with ablations validating the proposed refinement and stopping mechanisms
中文摘要 离线目标条件强化学习（GCRL）在长视野任务中具有挑战性，因为远距离状态——目标对提供弱监督，价值估计容易受到累积的自助错误影响。分层方法通过引入中间子目标来缓解这一难题，但固定的时间抽象或固定层级深度可能与具有不同可达视野的状态-目标对不匹配。我们提出了粗到细分层目标强化学习（CFHRL），这是一种完全离线的GCRL框架，在执行前自适应地细化远方目标。从最终目标出发，CFHRL递归地提出中间目标，从重放支持的候选目标训练，并在通过学习的可达性成本估计当前目标具备本地可执行性后停止细化。关键思想是子目标不必是精确的中点或全局最优路径点;它只需提供可靠的进展，降低剩余的伸手难度，从而在较短的时间内进行后续的精炼。一种风格化的分析进一步支持了近似递归收缩的鲁棒性。OGBench实验显示，在多个长视野任务上取得了显著进展，消融验证了所提议的精炼和停止机制

Deconstructing Spatial Complexity: Hierarchical Decomposition for LLM Spatial Reasoning

解构空间复杂性：LLM空间推理中的层级分解

Authors: Yi Wang, Haojie Lu, Zhaofan Zhang, Li Chen, Sihong Xie
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.28144
Pdf link: https://arxiv.org/pdf/2605.28144
Abstract LLMs have shown remarkable proficiency in general language understanding and reasoning. However, they consistently underperform in spatial reasoning that severely limits their application, particularly in embodied intelligence. Inspired by the success of hierarchical reinforcement learning, this paper introduces a novel method for hierarchical task decomposition in LLM spatial reasoning. Our approach guides LLMs to decompose complex tasks into manageable sub-tasks by identifying key intermediate states and generating simplified sub-environments. However, we identify that LLMs often fail to derive optimal intermediate states due to their insufficient spatial prior, leading to sub-optimal task decomposition. To address this limitation and enhance its planning capability, we propose the MCTS-Guided Group Relative Policy Optimization (M-GRPO), where we reformulate the UCT formula by incorporating the LLM's prior predictive probabilities alongside its epistemic uncertainty. Furthermore, we implement a more fine-grained advantage function, enabling the model to learn optimal path planning. Experimental results demonstrate that our method substantially improves LLM performance on spatial tasks, including navigation, planning, and strategic games, achieving state-of-the-art results. This work paves the way for LLMs in real-world applications.
中文摘要 LLMs在通用语言理解和推理方面表现出卓越的能力。然而，它们在空间推理方面表现持续不佳，严重限制了其应用，尤其是在具身智能方面。受层级强化学习成功的启发，本文介绍了一种在大型语言模型空间推理中进行层级任务分解的新方法。我们的方法引导大型语言模型通过识别关键中间状态并生成简化子环境，将复杂任务分解为可管理的子任务。然而，我们发现大型语言模型常因空间先验不足而无法推导出最优中间状态，导致任务分解不理想。为解决这一限制并提升规划能力，我们提出了MCTS引导的群体相对政策优化（M-GRPO），通过结合LLM的先验预测概率及其认识不确定性，重新表述了UCT公式。此外，我们实现了一个更细粒度的优势函数，使模型能够学习最优路径规划。实验结果表明，我们的方法显著提升了大型语言模型在空间任务（包括导航、规划和战略博弈）上的表现，实现了最先进的成果。这项工作为大型语言模型在现实应用中的应用铺平了道路。

Off-Policy Learning to Reason Works Because It Is More Pessimistic Than You Think

非策略式推理之所以有效，是因为它比你想象的更悲观

Authors: Otmane Sakhi, Aleksei Arzhantsev, Imad Aouali, Flavian Vasile
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.28150
Pdf link: https://arxiv.org/pdf/2605.28150
Abstract Large scale reinforcement learning has become a central tool for improving reasoning in large language models. At this scale, generation is often lagged or asynchronous, so updates are performed on data collected by older policies. This makes learning inherently off-policy. Most existing approaches nevertheless remain rooted in PPO-style trust-region objectives, treating training as approximately on-policy and using importance weights to correct distribution mismatch. These corrections can introduce high variance, destabilize optimization, and accelerate entropy collapse. Recent work suggests an alternative: rather than correcting the mismatch, one can embrace off-policy data and remove importance weights, often yielding stronger algorithms. In this paper, we provide an intuitive construction of off-policy objectives that include successful off-policy objectives and show that their effectiveness can be understood through implicit pessimism: they optimize toward target policies that are more conservative than their nominal objectives suggest. This perspective explains why some particular implementation choices improve stability: they implicitly control the effective target distribution. We then propose a principled modification that stabilize this induced distribution and improve off-policy learning.
中文摘要 大规模强化学习已成为提升大型语言模型推理能力的核心工具。在这种规模下，生成通常会延迟或异步，因此更新是基于旧策略收集的数据。这使得学习本质上是非政策的。然而，大多数现有方法仍根植于PPO式的信任区域目标，将培训视为近似政策，并利用重要权重纠正分布不匹配。这些修正可能引入高方差，破坏优化稳定性，并加速熵坍缩。最新研究提出了另一种选择：与其纠正不匹配，不如采用非政策数据，去除重要性权重，通常能带来更强的算法。本文提供了包含成功非政策目标的直观构建，并展示了其有效性可以通过隐含悲观来理解：它们会优化比名义目标更保守的目标政策。这一观点解释了为何某些实现选择能提升稳定性：它们隐性地控制了有效目标分布。随后，我们提出一种原则性修改，以稳定这种诱导分布并改善非策略学习。

OccuReward: LLM-Guided Occupant-Centric Reward Shaping for Demographic Equity in Grid-Interactive Buildings

OccuReward：基于大型语言模型（LLM）引导的以居住者为中心的奖励塑造，促进网格交互建筑中的人口公平

Authors: Shadmehr Zaregarizi, Khashayar Yavari
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.28168
Pdf link: https://arxiv.org/pdf/2605.28168
Abstract Large language models (LLMs) have demonstrated promising capability in generating reward functions for deep reinforcement learning (DRL)-based building energy management. However, their potential to exhibit or exacerbate disparities in occupant comfort across heterogeneous demographic populations remains unexplored. We present OccuReward, a framework investigating how LLM-mediated reward design affects demographic equity. Our contribution is three-fold: the introduction of the Comfort Equity Index (CEI) as a novel feedback signal; a methodology for iterative, equity-aware LLM reward shaping; and a performance analysis of DRL agents under these refined objectives. Utilizing four empirically grounded occupant profiles from the ASHRAE Global Thermal Comfort Database II (13,440 votes), we deploy a Soft Actor-Critic agent in CityLearn v2. Our approach employs the Gemini API to generate reward function logic and weights--rather than performing per-step inference--across three refinement rounds. Results across 15 experimental runs reveal that elderly female occupants consistently experience the lowest satisfaction in initial rounds. By Round 3, equity-aware LLM refinement activates specific reward components that improve satisfaction for Young Males (+17.6%), Mid-aged Females (+28.2%), Health Sensitive (+53.8%), and Elderly Females (+567%), while simultaneously reducing energy costs by 3.2%. Our findings highlight that while reward-level intervention significantly improves equity, demographic disparities in AI-driven controllers persist, necessitating further research into algorithmic fairness in building systems.
中文摘要 大型语言模型（LLMs）已展现出在基于深度强化学习（DRL）的建筑能源管理中生成奖励函数的有前景能力。然而，它们在不同人口群体中表现出或加剧乘员舒适度差异的潜力尚未被充分探讨。我们介绍了OccuReward框架，探讨LLM介导的奖励设计如何影响人口统计公平性。我们的贡献有三方面：作为一种新颖的反馈信号引入了舒适股权指数（CEI）;一种用于迭代、股权意识LLM奖励塑造的方法论;以及在这些精炼目标下对DRL代理的性能分析。利用ASHRAE全球热舒适数据库II中四份基于实证的居住者档案（13,440票），我们在CityLearn v2中部署了软性演员-批评代理。我们的方法利用Gemini API生成奖励函数逻辑和权重——而不是每步推断——跨越三轮精炼。15次实验结果显示，老年女性乘员在初次检测中始终体验最低的满意度。到第三轮，公平意识的LLM优化激活特定奖励成分，提升年轻男性（+17.6%）、中年女性（+28.2%）、健康敏感者（+53.8%）和老年女性（+567%）的满意度，同时降低3.2%的能源成本。我们的发现表明，尽管奖励级干预显著改善了公平性，但AI驱动控制器的人口差异依然存在，这需要进一步研究构建系统中的算法公平性。

Joint Training of Multi-Token Prediction in Reinforcement Learning via Optimal Coefficient Calibration

通过最优系数校准对强化学习中多词符预测的联合训练

Authors: Zili Wang, Jiajun Chai, Lin Chen, Xiaohan Wang, Shiming Xiang, Guojun Yin
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.28184
Pdf link: https://arxiv.org/pdf/2605.28184
Abstract Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as the standard paradigm for improving reasoning capability of large language models, while Multi-Token Prediction (MTP) has been a widely adopted module in pretraining. Combining them is a natural approach, yet current RL practices detach MTP gradients because joint training degrades the performance. We revisit this failure from an optimization perspective. We show that the per-step effect of MTP on the RL objective can be decomposed into two terms: a first-order correlation and a second-order perturbation penalty. This decomposition unifies three MTP training regimes: Detach, Cross-Entropy loss, and Policy loss, and explains why each succeeds or fails. Further analysis of policy loss reveals that, although it aligns with intuition, performance still degrades: the correlation term decays while the quadratic penalty persists. Guided by the analysis, we propose Optimal Coefficient Calibration (OCC), an adaptive scheme that tracks the optimal coefficient online via a log-probability proxy at negligible cost. Across six competition-level mathematical reasoning benchmarks, OCC consistently matches or exceeds the detach baseline, delivering improved joint MTP-RL training performance.
中文摘要 可验证奖励强化学习（RLVR）已成为提升大型语言模型推理能力的标准范式，而多词预测（MTP）则是预训练中广泛采用的模块。将它们结合起来是自然的做法，但当前强化学习实践会分离MTP梯度，因为联合训练会降低性能。我们从优化的角度重新审视这一失败。我们证明MTP对RL目标的每步效应可分解为两项：一阶相关性和二阶扰动惩罚。该分解统一了三种MTP训练模式：脱离、交叉熵损失和策略损失，并解释了各自成功或失败的原因。对政策损失的进一步分析显示，尽管它与直觉相符，但绩效仍然会下降：相关项衰减，而二次惩罚依然存在。在分析指导下，我们提出了最优系数校准（OCC）方案，这是一种自适应方案，通过对数概率代理在线跟踪最优系数，成本可忽略。在六项竞赛级别的数学推理基准中，OCC始终稳定地与分离基准相匹配甚至超越，从而提升了联合MTP-RL训练表现。

Visualizing Latent Phase Structures in Locomotion Policies: A Multi-Environment Study with Temporal Feature Extension

运动政策中潜在阶段结构的可视化：一项多环境研究，结合时间特征扩展

Authors: Daisuke Yasui, Toshitaka Matuki, Hiroshi Sato
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.28186
Pdf link: https://arxiv.org/pdf/2605.28186
Abstract Deep reinforcement learning (DRL) has been shown to achieve high performance on locomotion control tasks in MuJoCo benchmarks such as HalfCheetah, Ant, and Walker2D. However, visualizing the motion structures internally obtained by a trained policy function implemented as a deep neural network remains challenging. It is known from biomechanics and related fields that locomotion control is realized through the repetition of motion phases such as the stance phase and swing phase. In this study, we propose a framework for uncovering latent motion phase structures from trajectories generated by locomotion control policies through interaction with the environment. The proposed method extends the clustering features from state observations alone to augmented features including actions, next states, and next actions, and introduces a method for determining the number of clusters that suppresses self-transitions. Applying the proposed method to three environments -- Ant-v5, HalfCheetah-v5, and Walker2D-v5 -- we successfully identified phase structures with clearer and more regular transition rules than those obtained by the existing method.
中文摘要 深度强化学习（DRL）已被证明在MuJoCo基准测试如HalfCheetah、Ant和Walker2D中实现了高性能。然而，通过训练有素的策略函数实现的深度神经网络内部获得的运动结构可视化仍然具有挑战性。生物力学及相关领域已知，运动控制是通过重复动作阶段实现的，如站姿阶段和摆动阶段。本研究提出一个框架，通过与环境相互作用，揭示由运动控制政策生成轨迹的潜动相结构。该方法将聚类特征从单纯的状态观测扩展到包括动作、下一状态和下一步动作在内的增强特征，并引入了一种抑制自我转移的聚类数量确定方法。将拟议方法应用于三种环境——Ant-v5、HalfCheetah-v5和Walker2D-v5——我们成功识别出相结构，其转变规则比现有方法更清晰且规律。

ProgVLA: Progress-Aware Robot Manipulation Skill Learning

ProgVLA：进步感知机器人操控技能学习

Authors: Seungsu Kim, Jinyoung Choi, Seungmin Baek, Jean-Michel Renders
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.28231
Pdf link: https://arxiv.org/pdf/2605.28231
Abstract We present ProgVLA, a compact vision-language-action (VLA) model designed for reliable robot manipulation under tight compute and memory budgets. The model specifically focuses on efficiently processing long multi-modal sequences by maintaining an explicit representation of task progress over extended horizons. To this end, ProgVLA integrates two key components. First, a multi-modal encoder with a two-stage Perceiver resampling scheme compresses variable-length visual, language, and proprioceptive streams into a fixed set of control-ready context tokens, substantially reducing sequence length while preserving cross-modal grounding. Second, an auxiliary set of progress heads is trained with offline reinforcement learning (RL) objectives to jointly learn critics over normalized remaining-horizon targets. This provides the policy with an internal estimate of task progress and enables advantage- and success-weighted flow-matching imitation learning. On two well-established multi-task robot manipulation benchmarks, a 0.1B-parameter ProgVLA model reaches success rates that are competitive with, and on long-horizon and harder task tiers exceed, substantially larger pretrained baselines. Ablations indicate that the learned context resampler and task-adaptive visual fine-tuning are the largest single contributors, while progress-aware training provides a consistent additional gain that is concentrated on long-horizon and multi-object tasks. We further validate the approach in real-world toy-kitchen environments.
中文摘要 我们介绍ProgVLA，一种紧凑的视觉-语言-动作（VLA）模型，旨在在严格的计算和内存预算下实现可靠的机器人操作。该模型特别注重通过保持任务进展的明确表示，高效处理长多模态序列。为此，ProgVLA整合了两个关键组成部分。首先，采用两阶段感知者重采样方案的多模态编码器将可变长度的视觉、语言和本体感觉流压缩为一组固定的可控制上下文标记，显著减少序列长度，同时保持跨模态基础。其次，一组辅助进度主管通过离线强化学习（RL）目标进行训练，共同学习剩余视野目标上的批判者。这为策略提供了任务进度的内部估计，并实现了优势加权的流匹配模仿学习。在两个成熟的多任务机器人操作基准测试中，0.1B参数的ProgVLA模型成功率可与长期和更难任务层级的预训练基线竞争，且在更复杂的任务层级中，其成功率远超更大。消融结果表明，学习到的上下文重采样器和任务自适应视觉微调是最大的单一贡献者，而进度感知训练则提供了持续的额外增益，集中在长视野和多对象任务上。我们还进一步验证了这种方法在真实玩具厨房环境中的应用。

PIRS: Physics-Informed Reward Shaping for SAC-Based Building Energy Management

PIRS：基于SAC的建筑能源管理的物理知情奖励塑造

Authors: Shadmehr Zaregarizi, Khashayar Yavari
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.28232
Pdf link: https://arxiv.org/pdf/2605.28232
Abstract Occupant comfort and grid-aware energy efficiency are competing objectives whose joint optimization depends critically on how reward functions are specified in deep reinforcement learning (DRL) controllers for buildings. Yet reward design remains largely ad hoc: comfort terms are either hand-tuned heuristics or simple temperature-deviation proxies without explicit grounding in thermal-comfort physics. We present PIRS (Physics-Informed Reward Shaping), which replaces these ad-hoc comfort proxies with the ISO 7730 Predicted Mean Vote (PMV) formulation inside a weighted multi-objective reward for Soft Actor-Critic (SAC). By anchoring the comfort signal in the ISO 7730 PMV formulation, PIRS improves reward interpretability and provides a standards-grounded comfort proxy without changing any other component of the learning pipeline. We evaluate PIRS in CityLearn v2.1.2 (challenge 2022 phase 1) with a central SAC agent trained for 50k steps over five random seeds, and compare against a rule-based controller (RBC), a manually engineered reward (E2), an energy-only reward (E3), and a naive temperature-deviation comfort reward (E4). District-level key performance indicators (KPIs), reported as ratios versus RBC, show that PIRS attains cost, carbon, and electricity metrics on par with the manual baseline while substantially outperforming non-physics-grounded designs -- particularly on load ramping (1.78x vs. ~2.4x RBC) and daily peak demand. All DRL policies remain above RBC at this training budget; we interpret this gap honestly and position PIRS as an interpretable, standards-aligned foundation for reward design rather than a claim of dominance over classical control at limited compute.
中文摘要 居住舒适度和网格感知能效是竞争目标，其联合优化关键依赖于建筑深度强化学习（DRL）控制器中奖励函数的指定。然而，奖励设计大多是临时的：舒适度术语要么是手工调校的启发式，要么是简单的温度偏差代理，且没有明确的热舒适物理基础。我们提出了PIRS（物理知情奖励塑造），它用ISO 7730预测平均票（PMV）公式替代了这些临时舒适代理，并以加权多目标奖励奖励软性行为者-批评者（SAC）为对象。通过将舒适信号锚定在ISO 7730 PMV表述中，PIRS提升了奖励的可解释性，并在不改变学习流程其他组成部分的情况下，提供了基于标准的舒适代理。我们在CityLearn v2.1.2（挑战2022第一阶段）中，使用中央SAC代理训练了5万步、五个随机种子，并与基于规则的控制器（RBC）、手动工程奖励（E2）、仅能量奖励（E3）和简单的温度偏差舒适奖励（E4）进行比较。区级关键绩效指标（KPI），以比值与RBC的比值报告，显示PIRS在成本、碳排放和电力指标上与人工基准相当，同时显著优于非物理接地设计——尤其是在负载提升（1.78倍对比~2.4倍RBC）和每日峰值需求方面。所有DRL政策在本培训预算中均高于RBC;我们诚实地解读这一差距，并将PIRS定位为一个可解释、符合标准的奖励设计基础，而非在有限计算条件下主导经典控制的主导地位。

IRDS: Interpretable RLVR Data Selection via Verifier-Coupled Sparse Autoencoder Coverage

IRDS：通过验证器耦合稀疏自编码覆盖实现可解释的RLVR数据选择

Authors: Yuhan Li, Mingxu Zhang, Dazhong Shen, Ying Sun
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.28247
Pdf link: https://arxiv.org/pdf/2605.28247
Abstract Reinforcement learning with verifiable rewards (RLVR) has become a key technique for en- hancing LLM reasoning, yet its data ineffi- ciency remains a major bottleneck. Existing methods address this problem only partially, each missing at least one of subset-level cov- erage, verifier signal use, or interpretability. To address this gap, we present IRDS (Inter- pretable RLVR Data Selection), which selects RLVR training instances on a sparse autoen- coder (SAE) cluster basis so the selection itself is auditable on recognizable problem motifs. To select instances the model both fails on and can still learn from, we introduce a verifier- coupled coverage objective on the SAE basis and solve it by greedy log-determinant max- imization. Experiments on three instruction- tuned models and six math reasoning bench- marks show that IRDS achieves the highest overall accuracy, exceeding the strongest base- line by +3.9/+4.0 pp on the two Qwen models and by +0.5 pp on Llama-3.1-8B, while run- ning an order of magnitude cheaper than the trajectory-based baseline.
中文摘要 带可验证奖励的强化学习（RLVR）已成为提升LLM推理能力的关键技术，但其数据效率低仍是主要瓶颈。现有方法仅部分解决了这一问题，每种方法都缺少至少一项子集级的cov-erage、验证者信号使用或可解释性。为弥补这一空白，我们提出了IRDS（跨预备RLVR数据选择），该方法通过稀疏自编码器（SAE）集群选择RLVR训练实例，使选择本身可对可识别的问题主题进行审计。为了选择模型既失败又能从中学习的实例，我们基于SAE引入验证器耦合覆盖目标，并通过贪婪的对数行列式最大模仿来求解。在三个指令调优模型和六个数学推理基准上的实验显示，IRDS实现了最高的整体准确率，两个Qwen模型比最强基线高出+3.9/+4.0 pp，在Llama-3.1-8B上高出+0.5 pp，同时运行成本比基于轨迹的基线低一个数量级。

EchoAvatar: Real-time Generative Avatar Animation from Audio Streams

EchoAvatar：音频流实时生成化身动画

Authors: Bohong Chen, Yumeng Li, Yinglin Xu, Youyi Zheng, Yanlin Weng, Kun Zhou
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.28272
Pdf link: https://arxiv.org/pdf/2605.28272
Abstract Real-time synthesis of high-fidelity 3D character motion from audio is a pivotal component for next-generation interactive avatars and virtual assistants. However, most existing approaches are limited to offline processing of complete audio sequences or are constrained to specific domains, rarely handling both speech and music effectively. In this paper, we introduce a novel framework designed to generate continuous, coherent full-body motion from streaming speech and music with low latency. Central to our approach is a unified streaming architecture capable of synthesizing continuous motion from incremental audio inputs. We employ a robust training strategy that enforces strong audio dependency, allowing the model to seamlessly generalize across conversational speech and rhythmic music without requiring explicit domain labels or mode switching. Additionally, we explored Reinforcement Learning to refine the quality of online generation. Furthermore, we bridge reactive animation with intent-driven behavior via a tool-call interface that allows upstream Large Language Models to inject explicit semantic control. By combining this controllability with stream audio-driven synthesis, our framework serves as a plug-and-play solution for transforming voice agents into interactive humanoid avatars. Extensive experiments demonstrate that our method outperforms state-of-the-art realtime baselines in motion quality and synchronization while maintaining the flexibility required for live deployment. Our code, pre-trained models, and videos are available at this https URL.
中文摘要 从音频实时合成高保真3D角色运动，是下一代互动虚拟形象和虚拟助手的关键组成部分。然而，大多数现有方法仅限于离线处理完整音频序列，或受限于特定领域，很少能有效处理语音和音乐。本文介绍了一个新颖框架，旨在通过低延迟从流式语音和音乐中生成连续、连贯的全身运动。我们方法的核心是一个统一的流媒体架构，能够从增量音频输入中合成连续运动。我们采用了强有力的训练策略，强化了音频依赖，使模型能够无缝泛化在会话语音和节奏音乐之间，无需显式领域标签或切换模式。此外，我们还探讨了强化学习以提升在线生成的质量。此外，我们通过工具调用接口连接反应式动画与意图驱动行为，使上游大型语言模型能够注入显式语义控制。通过将这种可控性与流音频驱动的合成结合，我们的框架成为一个即插即用的解决方案，将语音代理转化为互动式的人形化身。大量实验表明，我们的方法在运动质量和同步方面优于最先进的实时基线，同时保持了实时部署所需的灵活性。我们的代码、预训练模型和视频均可在该 https URL 访问。

Global Policy-Space Response Oracles for Two-Player Zero-Sum Games

双人零和博弈的全球政策空间响应预言机

Authors: Junyu Zhang, Feihong Yang, Jian Wang, Chao Wang, Xudong Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.28273
Pdf link: https://arxiv.org/pdf/2605.28273
Abstract The Policy-Space Response Oracles (PSRO) framework scales equilibrium computation to large zero-sum games by iteratively expanding a restricted strategy set using deep reinforcement learning (DRL). A central challenge is to construct, under limited computational budgets, a small strategy population whose induced game well approximates the full game. Existing PSRO variants typically expand the population using best responses to meta-strategies computed from restricted-game payoffs, which can lead to inefficient expansions that provide limited global improvement. We propose to guide population expansion by directly evaluating the post-expansion population quality. Specifically, we adopt Population Exploitability (PE) to measure how well a restricted strategy set represents the full game, and introduce a two-phase exploration--selection framework that explicitly minimizes PE during expansion. We instantiate this framework as Global PSRO, a practical DRL-based algorithm that efficiently generates candidate responses and estimates PE via parameter-sharing conditional neural networks. Experiments across multiple two-player zero-sum games show that Global PSRO achieves lower exploitability and approximates Nash equilibria with significantly fewer policy iterations than prior PSRO methods.
中文摘要 策略空间响应预言机（PSRO）框架通过通过深度强化学习（DRL）迭代扩展受限策略集，将均衡计算扩展到大型零和博弈。一个核心挑战是在有限的计算预算下，构建一个小型策略群体，其诱导博弈能很好地近似整个博弈。现有的PSRO变体通常通过对受限博弈收益计算的元策略的最佳响应来扩展种群，这可能导致扩张效率低下，且整体提升有限。我们提议通过直接评估扩张后的种群质量来指导种群扩张。具体来说，我们采用了种群利用性（PE）来衡量受限策略集对整个游戏的表现，并引入了两阶段探索-选择框架，明确在扩张过程中最小化PE。我们将该框架实例化为全局PSRO，这是一种实用的基于DRL的算法，通过参数共享条件神经网络高效生成候选响应并估计PE。跨越多个双人零和博弈的实验表明，全球PSRO的可利用性更低，且以显著更少的策略迭代次数接近纳什均衡。

Commit to the Bit: Reactive Reinforcement Learning Done Right

承诺执行：正确完成的反应强化学习

Authors: Onno Eberhard, Claire Vernade, Michael Muehlebach
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.28276
Pdf link: https://arxiv.org/pdf/2605.28276
Abstract Reinforcement learning algorithms are commonly analyzed (and designed) under the Markov assumption. This is unrealistic, as most environments encountered in practice are either partially observable, or require function approximation that restricts the agent to access non-Markovian state features. We consider the problem of learning an optimal reactive policy in a finite environment with deterministic observations (or equivalently, hard state aggregation). We introduce a new algorithm, Committed Q-learning, and prove almost-sure convergence to the optimal reactive policy under an intuitive assumption we call rewire-robustness. This assumption is strictly weaker than the $q_\star$-realizability condition used in prior work. Our algorithm is a variant of classical Q-learning in which the behavior policy commits to a single action upon entering a feature, and only resamples actions when the observed feature changes. A crucial part of our analysis is the introduction of quasi-Markov environments.
中文摘要 强化学习算法通常在马尔可夫假设下进行分析（和设计）。这不现实，因为实际遇到的大多数环境要么部分可观测，要么需要函数近似，限制代理访问非马尔可夫状态特征。我们考虑在有限环境中通过确定性观察（或等价地，硬态聚合）学习最优反应策略的问题。我们引入了一种新算法——承诺Q学习，并在一个我们称之为rewire鲁棒性的直观假设下，几乎确定收敛到最优反应策略。这一假设严格弱于之前工作中使用的$q_\star$-可实现性条件。我们的算法是经典Q学习的一种变体，其中行为策略在进入特征时承诺执行单一动作，只有在观察到的特征发生变化时才重新采样动作。我们分析的关键部分是引入准马尔可夫环境。

AtomComposer: Discovering Chemical Space from First Principles with Reinforcement Learning

AtomComposer：从第一原理出发，通过强化学习发现化学空间

Authors: Bjarke Hastrup, Francois Cornet, Tejs Vegge, Arghya Bhowmik
Subjects: Subjects: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
Arxiv link: https://arxiv.org/abs/2605.28287
Pdf link: https://arxiv.org/pdf/2605.28287
Abstract Discovering novel stable molecules without training data remains a grand scientific challenge. Current molecular generative models are trained on large, pre-curated datasets, which introduce biases and limit exploration of novel chemistry. In contrast, we propose a new paradigm: autonomous, generalized agents capable of mapping vast, unknown chemical spaces without any pretraining. For the first time, we present AtomComposer, a self-guided agent that autonomously constructs valid 3D isomers under stoichiometric constraints and is trained exclusively online using reinforcement learning. Unlike existing approaches that generally overfit to a specific chemical formula, we establish a multi-composition training scheme that enables a broad generalization across diverse chemistry, guided by energy- and validity-based rewards. Our agent can discover up to an order of magnitude more valid isomers on unseen test formulas than existing single-composition reinforcement-learning baselines trained with per-step energy rewards. These results fulfill the promise of online reinforcement learning as a powerful paradigm for scalable, from-scratch exploration of chemical configuration space.
中文摘要 在没有训练数据的情况下发现新的稳定分子仍是一项重大的科学挑战。当前的分子生成模型是在大型预先管理的数据集上训练的，这带来了偏见并限制了对新化学的探索。相比之下，我们提出了一种新范式：自主、广义的代理能够在无需预训练的情况下绘制广阔未知的化学空间。我们首次介绍了AtomComposer，这是一款自导代理，能够在化学计量约束下自主构造有效的三维异构体，并仅通过强化学习在线训练。与通常过度拟合特定化学式的方法不同，我们建立了多组更训练方案，实现跨越多种化学的广泛推广，并以能量和效度为基础的奖励指导。我们的代理在未见测试公式中发现的有效异构体数量级多于现有单一组态强化学习基线，且这些基线采用每步能量奖励训练。这些结果兑现了在线强化学习作为化学构型空间可扩展、从零开始探索的强大范式的承诺。

ProRL: Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation

ProRL：通过纠正策略梯度估计实现主动推荐的有效强化学习

Authors: Hongru Hou, Tiehua Mei, Denghui Geng, Jinhui Huang, Ao Xu, Hengrui Chen, Jiaqing Liang, Deqing Yang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.28293
Pdf link: https://arxiv.org/pdf/2605.28293
Abstract Proactive Recommender Systems (PRSs) aim to guide user preference shift toward target items by generating paths of intermediate recommendations. Reinforcement learning (RL) provides a principled framework for optimizing such sequential decision tasks, as path rewards can naturally capture both short-term acceptance and long-term guidance effectiveness. However, naively applying policy gradients to PRS results in deficient gradient estimation. We identify two deficiencies: (1) path-level rewards decompose into step-level rewards with positive mean, creating a length-dependent bias that causes gradients to favor path extension over meaningful exploration; (2) weighting each step by the entire path-level reward ignores the decomposition structure, leading to high gradient variance. To rectify these two deficiencies, we propose an effective RL framework ProRL with two novel mechanisms for proactive recommendation. First, Stepwise Reward Centering subtracts expected rewards to neutralize length-dependent bias, ensuring that path extension yields zero expected gradient signal. Second, Position-Specific Advantage Estimation leverages the reward decomposition structure to compute step-dependent baselines, reducing gradient variance. Together, these mechanisms yield policy gradients that precisely target path quality. Our experiments on three real-world datasets demonstrate that ProRL significantly outperforms state-of-the-art PRSs. Our code is available at this https URL.
中文摘要 主动推荐系统（PRS）旨在通过生成中间推荐路径，引导用户偏好转向目标项目。强化学习（RL）为优化此类顺序决策任务提供了原则性框架，因为路径奖励可以自然地同时捕捉短期接受度和长期指导效果。然而，对PRS应用策略梯度会导致梯度估计不足。我们发现了两个缺陷：（1）路径级奖励分解为正均值的阶级奖励，产生长度依赖偏差，导致梯度偏向路径延伸而非有意义的探索;（2）用整个路径级奖励加权每一步会忽略分解结构，导致高梯度方差。为解决这两个不足，我们提出了一个有效的强化学习框架ProRL，采用两种创新的主动推荐机制。首先，逐步奖励中心减去预期奖励，以中和长度依赖偏差，确保路径延伸时期望梯度信号为零。其次，位置特定优势估计利用奖励分解结构计算步数依赖的基线，降低梯度方差。这些机制共同产生了精准针对路径质量的政策梯度。我们在三个真实世界数据集上的实验表明，ProRL的表现显著优于最先进的PRS。我们的代码可在此 https URL 访问。

Where Rollouts Begin: Low-Load, High-Leverage First-Token Diversification for RLVR

推广开始：低负载、高杠杆的RLVR第一代币多元化

Authors: Soeun Kim, Albert No
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.28295
Pdf link: https://arxiv.org/pdf/2605.28295
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) trains reasoning models without labeled trajectories, relying on grouped rollouts to expose the policy to alternative reasoning paths and a verifier to score them. Rollout diversity has accordingly emerged as a central bottleneck in RLVR, with most existing methods broadening exploration through temperature, prefix, or rollout-selection adjustments. We identify a structurally distinguished but overlooked position for broadening this diversity: the first token after the reasoning marker. The policy's first-token distribution exhibits a sharply peaked yet correctness-decoupled phenomenon, and this first token position can broaden the regions a rollout group covers without altering the correctness signal. We introduce REFT (Rollout Exploration with First-Token Diversification), a light addition to the RLVR pipeline that samples first tokens uniformly from the policy's own top-$N$ candidates and allocates rollouts evenly, leaving every other component unchanged. Trained on the resulting diversified rollouts, REFT improves aggregate Pass@1, Pass@8, and Pass@64 over DAPO and GRPO baselines across four base models (0.5B-7B) and three difficulty regimes.
中文摘要 带可验证奖励的强化学习（RLVR）训练没有标记轨迹的推理模型，依靠分组推广以暴露策略对其他推理路径进行评分。因此，推广多样性已成为RLVR的核心瓶颈，现有大多数方法通过温度、前缀或推广选择调整来拓宽探索范围。我们确定了一个结构上有区别但被忽视的位置，用于扩大这种多样性：推理标记后的第一个符号。该政策的首代币分布呈现出明显峰值但正确性脱钩的现象，这一首代币位置可以拓宽推广组覆盖的区域，而不改变正确性信号。我们引入了REFT（带首代币多元化的推广探索），这是RLVR流程中的一个轻量新增功能，能够均匀抽取政策中最高$N美元候选代币的首批代币，并均匀分配推广，其他组件保持不变。基于多样化的推广训练，REFT在四个基础模型（0.5B-7B）和三个难度区间，提升了DAPO和GRPO基线的总Pass@1、Pass@8和Pass@64。

Plan Before Search: Search Agents Need Plan

搜索前规划：搜索代理人需要计划

Authors: Zhipeng Qian, Zihan Liang, Yufei Ma, Ben Chen, Huangyu Dai, Jiayi Ji, Chenyi Lei, Wenwu Ou, Xiaoshuai Sun, Qibin Hou
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.28354
Pdf link: https://arxiv.org/pdf/2605.28354
Abstract Training large language models as retrieval-augmented reasoning agents typically combines reinforcement learning with an SFT cold start distilled from a stronger model. However, this paradigm overlooks two fundamental factors: the dependency structure among sub-skills, and the possibility that distillation is not the only route to capability acquisition. We study this through Plan, a structured agentic behavior for multi-hop retrieval that decomposes a question into ordered sub-questions before any retrieval is performed, so that each search step can be anchored to a pre-designed sub-question instead of drifting under the influence of partially relevant documents retrieved earlier. However, across three model families spanning 3B to 14B parameters, we find that an identical reward signal induces qualitatively different RL failure modes. This phenomenon indicates that successful training hinges not only on reward design but also on model-specific feasibility conditions: sufficient initial entropy, training stability, and prerequisite sub-skills. Motivated by this, we propose a self-bootstrapping paradigm in which a small-scale seed model generates filtered trajectories that activate Plan in any target model, eliminating the need for distillation from an external stronger model. Our pipeline activates Plan across every tested model and consistently outperforms competitive baselines on multi-hop QA benchmarks.
中文摘要 作为检索增强推理代理训练大型语言模型，通常将强化学习与从更强模型提炼出来的SFT冷启动结合起来。然而，这种范式忽视了两个基本因素：子技能之间的依赖结构，以及蒸馏可能并非能力获取的唯一途径。我们通过Plan进行研究，这是一种结构化的代理行为，用于多跳检索，在检索前将问题分解为有序子问题，使每个搜索步骤都能锚定在预设子问题上，而非受先前部分相关文件的影响漂移。然而，在涵盖3B至14B参数的三类模型中，我们发现相同的奖励信号会诱导质性不同的强化学习失效模式。这一现象表明，成功的训练不仅取决于奖励设计，还取决于模型特定的可行性条件：足够的初始熵、训练稳定性以及先决子技能。基于此，我们提出了一种自我引导范式，其中小尺度种子模型生成过滤轨迹，在任何目标模型中激活Plan，无需从外部更强模型中提取。我们的流程在所有测试模型中激活Plan，并在多跳质量保证基准中持续优于竞争对手。

Teacher-Student Representational Alignment for Reinforcement Learning-Driven Imitation Learning

师生表征对齐以强化学习为驱动的模仿学习

Authors: Meraj Mammadov, Pedro Zuidberg Dos Martires, Johannes Andreas Stork
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.28372
Pdf link: https://arxiv.org/pdf/2605.28372
Abstract Imitation learning (IL) from a state-based reinforcement learning (RL) policy is a common approach to overcome the curse of dimensionality in complex and high-dimensional observation spaces prevalent in robotics. This paper addresses the irreducible imitation gap that emerges when teacher and student are learned in isolation, and the teacher policy has the liberty to rely on privileged state information that the student cannot infer from its observations. Instead of improving poor student performance with RL finetuning after IL, which often requires a whole new training setup, we propose a novel algorithm which learns a shared embedding space that hides agent-specific observations and thus trains imitable teacher policies by construction. We train the shared embedding space with self-supervised contrastive learning in parallel to the teacher policy and prevent it from extracting private information by limiting its gradients from updating the encoder networks. We perform evaluations on several example domains and compare to state-of-the-art baselines showing that our algorithm enables higher student performance with substantially reduced imitation gap.
中文摘要 基于状态的强化学习（RL）策略中的模仿学习（IL）是一种常见的方法，用于克服机器人中常见的复杂高维观察空间中的维度诅咒。本文讨论了当教师和学生在孤立环境中学习时，教师政策可以依赖学生无法从观察中推断的特权国家信息时，所产生的不可缩小的模仿差距。我们提出一种新算法，学习共享嵌入空间，隐藏代理特有的观察，从而通过构建可模仿的教师策略来训练可复制的教师策略，从而学习共享嵌入空间，从而通过构建来训练可模仿的教师策略。我们通过自监督对比学习与教师策略并行训练共享嵌入空间，并通过限制梯度更新编码器网络，防止其提取私人信息。我们对多个示例领域进行了评估，并与最先进的基线进行比较，显示我们的算法能够在显著减少模仿差距的同时提升学生表现。

Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs

机制性解读样本难度在 RLVR 中对大型语言模型的作用

Authors: Yue Cheng, Jiajun Zhang, Xiaohui Gao, Weiwei Xing, Zheng Wang, Zhanxing Zhu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.28388
Pdf link: https://arxiv.org/pdf/2605.28388
Abstract Reinforcement Learning with Verifiable Reward (RLVR) is empirically shown to notably enhance the reasoning performance of large language models (LLMs), particularly in mathematics and programming. However, the mechanistic role of Sample Difficulty in RLVR remains poorly understood. In this paper, we investigate RLVR through the lens of difficulty-wise and one-sample analysis. We find that sample difficulty has a non-monotonic effect on RLVR: easy and medium-difficulty problems yield the strongest and most stable reasoning improvements, whereas overly hard problems often provide weak learning signals, induce degenerate behaviors such as answer repetition or skipping necessary computation, and can ultimately degrade the model's pre-existing capabilities. Beyond the obverse of response, we further analyze the model's internal feature dynamics using Temporal Sparse Autoencoders (T-SAE). Easy problems mainly reinforce direct-answer and basic-computation features while suppressing deliberative-reasoning features; hard problems activate reasoning-related features but become useful only when successful trajectories are sampled; medium-difficulty problems provide a more balanced signal, strengthening both computation and multi-step reasoning features. Motivated by these findings, we propose difficulty-adaptive strategies for hard-sample utilization, using backward-reasoning reformulation and T-SAE-guided training signals to improve reward density and credit assignment during RLVR. Overall, our results identify sample difficulty as a key factor governing both the optimization dynamics and representation evolution of RLVR.
中文摘要 实证显示，带可验证奖励的强化学习（RLVR）显著提升了大型语言模型（LLMs）的推理性能，尤其是在数学和编程领域。然而，样本难度在RLVR中的机制作用仍不充分。本文通过难度分析和单样本分析的视角研究RLVR。我们发现样本难度对RLVR有非单调效应：简单和中等难度问题带来最强且最稳定的推理改进，而过难问题往往提供弱的学习信号，诱发如答案重复或跳过必要计算等退化行为，最终可能削弱模型的既有能力。除了响应的对面，我们还利用时间稀疏自编码器（T-SAE）进一步分析模型的内部特征动态。简单题主要强化直接答案和基础计算特征，同时抑制审议推理特征;困难问题会激活推理相关特征，但只有在成功采样轨迹时才有用;中等难度的问题提供了更平衡的信号，强化了计算和多步推理特性。基于这些发现，我们提出了难度自适应的硬样本利用策略，利用逆向推理重述和T-SAE引导的训练信号，以提升RLVR期间的奖励密度和学分分配。总体而言，我们的结果将样本难度确定为主导RLVR优化动态和表示演化的关键因素。

Efficient Post-training of LLMs for Code Generation With Offline Reinforcement Learning

高效后期训练LLM用于代码生成，结合离线强化学习

Authors: Mingze Wu, Abhinav Anand, Shweta Verma, Mira Mezini
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.28409
Pdf link: https://arxiv.org/pdf/2605.28409
Abstract Post-training using online reinforcement learning (RL) is an important training step for LLMs, including code-generating models. However, online RL for code generation involves LLM inference and verification of the generated output, which can take considerable time and resources. In this paper, we explore the application of offline RL to code-generating models by leveraging existing code datasets. Our experiments demonstrate that offline RL is an effective training strategy for improving LLM performance. We show that offline RL can be especially beneficial for small LLMs and challenging coding problems.
中文摘要 使用在线强化学习（RL）进行后期训练是大型语言模型（包括代码生成模型）的重要训练步骤。然而，在线强化学习用于代码生成涉及LLM推断和验证生成输出，这可能需要大量时间和资源。本文探讨了利用现有代码数据集，将离线强化学习应用于代码生成模型。我们的实验表明，离线强化学习是提升大型语言模型表现的有效训练策略。我们展示了离线强化学习对小型大型语言模型和复杂编码问题尤其有益。

DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy Prefixes

DenoiseRL：引导推理模型以从噪声前缀中恢复

Authors: Caijun Xu, Changyi Xiao, Zhongyuan Peng, Yixin Cao
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.28421
Pdf link: https://arxiv.org/pdf/2605.28421
Abstract Reinforcement learning has become a central paradigm for advancing reasoning in large language models, yet most existing methods still depend on stronger teacher models or heavily curated difficult datasets, limiting scalable capability improvement. In this paper, we introduce DenoiseRL, a reinforcement learning framework that substitutes external supervision with recovery-oriented optimization over failures from weak models. Instead of relying on stronger supervision or carefully engineered data, DenoiseRL learns directly from incorrect reasoning traces by converting them into opportunities for improvement, making training more scalable and less dependent on external resources. This yields a richer and more diverse learning signal, improving exploration efficiency from imperfect model behavior. As a result, DenoiseRL improves reasoning performance and overall training efficiency while reducing the need for expensive data curation or stronger teacher models. Empirically, DenoiseRL consistently outperforms strong on-policy RL baselines across competitive mathematical and general reasoning benchmarks and promotes stronger self-corrective behavior as training difficulty increases, highlighting an effective and scalable alternative pathway for improving reasoning in large language models.
中文摘要 强化学习已成为推动大型语言模型推理的核心范式，但大多数现有方法仍依赖更强的教师模型或高度策划的复杂数据集，限制了可扩展能力的提升。本文介绍了DenoiseRL，一种强化学习框架，用恢复导向的优化替代了弱模型的失败。DenoiseRL不依赖更强的监督或精心设计的数据，而是直接从错误的推理痕迹中学习，将其转化为改进的机会，使培训更具可扩展性，减少对外部资源的依赖。这带来了更丰富、更多样化的学习信号，提高了探索效率，避免模型不完美。因此，DenoiseRL提升了推理表现和整体训练效率，同时减少了对昂贵数据整理或更强教师模型的需求。从实证角度看，DenoiseRL在竞争激烈的数学和通用推理基准中持续优于强劲的政策性强化强化学习基线，并随着训练难度的增加促进更强的自我纠正行为，凸显了一种有效且可扩展的替代路径，有助于提升大型语言模型的推理能力。

Skill0.5: Joint Skill Internalization and Utilization for Out-of-Distribution Generalization in Agentic Reinforcement Learning

Skill0.5：联合技能内化与利用，用于代理强化学习中的分布外泛化

Authors: Jiapeng Zhu, Jianxiang Yu, Yibo Zhao, Chengcheng Han, Qi Gu, Xunliang Cai, Xiang Li, Weining Qian
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.28424
Pdf link: https://arxiv.org/pdf/2605.28424
Abstract Equipping large language models with explicit skills has emerged as a promising paradigm for enabling autonomous agents to solve complex tasks. Agent skills can be inherently divided into general skills for broad cognitive transfer and task-specific skills for dynamic execution. However, existing skill-based reinforcement learning (RL) methods typically force a rigid choice between full externalization, which incurs prohibitive context overhead, and full internalization, which risks overfitting and knowledge conflicts. To address this dilemma, we propose Skill0.5, a novel agentic RL framework that explicitly differentiates skill treatments by combining general skill internalization with task-specific skill utilization. Driven by a dynamic, difficulty-aware router, Skill0.5 streams tasks into distinct mastery tiers to apply tailored optimization strategies: it internalizes general skills via privileged distillation to build a cognitive foundation for hard tasks, while using diagnostic probing on easy tasks to penalize shortcuts and enforce specific skill utilization. Experiments on ALFWorld and WebShop demonstrate that Skill0.5 outperforms both memory-based and skill-based RL baselines, yielding performance improvements across both in-distribution and out-of-distribution scenarios.
中文摘要 为大型语言模型配备显式技能已成为一种有前景的范式，使自主智能体能够解决复杂任务。代理技能本质上可分为用于广泛认知转移的通用技能和用于动态执行的任务特定技能。然而，现有的基于技能的强化学习（RL）方法通常迫使在完全外化（产生过高的上下文开销）和完全内化（存在过拟合和知识冲突的风险）之间做出僵硬选择。为解决这一困境，我们提出了Skill0.5，一种新型能动强化学习框架，通过结合一般技能内化与任务特定技能利用，明确区分技能治疗。Skill0.5 由动态的难度感知路由器驱动，将任务流向不同的掌握层级，以应用定制优化策略：它通过特权提炼内化通用技能，为困难任务建立认知基础，同时对简单任务进行诊断探查惩罚捷径并强制特定技能的使用。在ALFWorld和WebShop上的实验表明，Skill0.5在内存和技能基础强化学习基线中均表现优异，在分布内外场景中均有性能提升。

Learning a Kinodynamic Trajectory Manifold for Impact-Aware Compliant Catching of Fast-Moving Objects

学习运动动力学轨迹流形，以实现冲击感知且符合快速移动物体的捕捉

Authors: Guorui Pei, Mengshi Zhang, Xi Chen, Jinsong Wu, Jiaming Qi, Peng Zhou
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.28462
Pdf link: https://arxiv.org/pdf/2605.28462
Abstract Fast catching of free-flying objects is difficult because of short reaction time, impact uncertainty, and kinodynamic constraints. We use reinforcement learning in simulation to collect successful catching trajectories and learn a low-dimensional kinodynamic trajectory manifold. At run time, the estimated object initial state is mapped directly to a reference catching trajectory without online nonlinear optimization. The trajectory is tracked with compliant control near contact for improved impact absorption and capture stability.
中文摘要 由于反应时间短、撞击不确定性和运动动力学约束，快速捕捉自由飞行物体较为困难。我们在模拟中利用强化学习收集成功的捕获轨迹，并学习低维运动动力学轨迹流形。运行时，估计的对象初始状态直接映射到参考捕获轨迹，无需在线非线性优化。轨迹通过顺应式控制近接触跟踪，以提升冲击吸收和捕获稳定性。

GUI-CIDER: Mid-training GUI Agents via Causal Internalization and Density-aware Exemplar Reselection

GUI-CIDER：通过因果内化和密度感知范例重新选择的中期训练GUI代理

Authors: Zheng Wu, Chengcheng Han, Zhengxi Lu, Tianjie Ju, Yanyu Chen, Qi Gu, Xunliang Cai, Zhuosheng Zhang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.28534
Pdf link: https://arxiv.org/pdf/2605.28534
Abstract Despite the rapid progress of multimodal large language models in building Graphical User Interface (GUI) agents, their real-world task completion is fundamentally bottlenecked by a lack of world knowledge about GUI operations. Existing solutions typically rely on expensive multi-agent scaffolding or conventional post-training paradigms, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). However, post-training only allows agents to implicitly absorb world knowledge through action annotations or reward signals, leading to inefficient trajectory memorization rather than genuine comprehension. Therefore, an approach that enables explicit learning of this knowledge is imperative. To this end, we propose GUI-CIDER, a mid-training method that explicitly internalizes GUI world knowledge through Causal Internalization and Density-aware Exemplar Reselection. GUI-CIDER operates in three stages: (1) data synthesis, which distills static planning and dynamic causal knowledge from GUI trajectories into text; (2) exemplar reselection, which filters the corpus by rewarding causal structures and penalizing semantic redundancy; and (3) mid-training, where the refined data is used to embed the acquired knowledge. Extensive experiments on two GUI knowledge benchmarks and three task completion benchmarks demonstrate that GUI-CIDER consistently improves both the agent's understanding of GUI operations and its task success this http URL codes are available at this https URL.
中文摘要 尽管多模态大型语言模型在构建图形用户界面（GUI）代理方面进展迅速，但其实际任务的完成从根本上仍因缺乏全球GUI操作知识而成为瓶颈。现有解决方案通常依赖昂贵的多智能体支架或传统的训练后范式，如监督微调（SFT）和强化学习（RL）。然而，后训练仅允许智能体通过动作注释或奖励信号隐性吸收世界知识，导致轨迹记忆效率低下，而非真正的理解。因此，必须采取能够明确学习这些知识的方法。为此，我们提出了GUI-CIDER方法，一种中期训练方法，通过因果内化和密度感知范例再选择，明确内化GUI世界知识。GUI-CIDER 分为三个阶段：（1）数据综合，将静态规划和动态因果知识从图形用户界面轨迹提炼成文本;（2）范例重选，通过奖励因果结构并惩罚语义冗余来过滤语料库;以及（3）训练中期，利用精炼后的数据将获得的知识嵌入其中。对两个GUI知识基准测试和三个任务完成基准测试的广泛实验表明，GUI-CIDER持续提升智能体对GUI操作的理解和任务成功率。这些http URL代码可在该https URL获取。

Modeling Vehicle-Type-Specific Pedestrian Crash Avoidance Behavior in Safety-Critical Interactions Using Smooth-Mamba Deep Reinforcement Learning

利用平滑曼巴深度强化学习建模安全关键交互中车辆类型特定的行人碰撞规避行为

Authors: Qingwen Pu, Kun Xie, Hong Yang, Di Yang, Junqing Wang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.28552
Pdf link: https://arxiv.org/pdf/2605.28552
Abstract As automated vehicles (AVs) increasingly share roadways with human-driven vehicles (HDVs), understanding how pedestrians respond to different vehicle types in safety-critical interactions is essential for the safe deployment of automated driving technologies. This study extracts safety-critical pedestrian-vehicle interactions from the Argoverse 2 dataset to capture real-world crash avoidance behaviors in encounters involving AVs and HDVs. To model vehicle-type-specific pedestrian crash avoidance behavior, we develop a Smooth-Mamba Deep Deterministic Policy Gradient framework, termed SMamba-DDPG, which integrates smooth action constraints with efficient temporal representation learning. To quantify pedestrian behavioral differences, the framework trains separate crash avoidance policies for pedestrian interactions with AVs and HDVs. Results show that SMamba-DDPG outperforms baseline reinforcement learning and supervised learning models in reproducing pedestrian crash avoidance behaviors. Reconstructed trajectories demonstrate strong behavioral realism, accurately reproducing crash avoidance kinematics in both AV and HDV scenarios. Reaction time analysis shows that the model captures human-like response delays and reveals that pedestrians respond more quickly to AVs than to HDVs. Counterfactual analysis further indicates that pedestrians adopt lower crossing speeds when interacting with AVs. Large-scale safety analysis of model-generated data revealed that pedestrian-AV interactions consistently yielded lower conflict rates and higher pedestrian yielding rates compared to pedestrian-HDV interactions. The findings highlight the importance of incorporating vehicle-type-specific pedestrian behavioral models for safer automated driving system design and more realistic traffic simulations in mixed-traffic environments.
中文摘要 随着自动驾驶车辆（AV）越来越多地与人驾驶车辆（HDV）共用道路，理解行人在安全关键互动中对不同车辆类型的反应对于自动驾驶技术的安全部署至关重要。本研究从Argoverse 2数据集中提取了安全关键的行人与车辆互动，以捕捉涉及自动驾驶车和HDV遭遇的真实碰撞规避行为。为了建模特定车辆类型的行人碰撞避免行为，我们开发了一个名为SMamba-DDPG的平滑-曼巴深度确定性政策梯度框架，该框架将平滑动作约束与高效的时间表示学习相结合。为量化行人行为差异，该框架为行人与自动驾驶车和高密度车辆的互动分别训练了碰撞避免策略。结果显示，SMamba-DDPG在重现行人碰撞避免行为方面优于基线强化学习和监督学习模型。重建的轨迹展现了强烈的行为真实性，准确还原了AV和HDV场景下的碰撞规避运动学。反应时间分析显示，模型捕捉到了类人响应的延迟，并显示行人对自动驾驶车的反应比对HDV更快。反事实分析进一步表明，行人在与自动驾驶车辆互动时，过街速度更为低。对模型生成数据的大规模安全性分析显示，行人与自动驾驶车的交互始终比行人与HDV互动产生更低的冲突率和更高的行人让路率。研究结果凸显了采用特定车辆类型的行人行为模型对于更安全的自动驾驶系统设计和混合交通环境中更真实的交通模拟的重要性。

Soft-SVeRL: Self-Verified Reinforcement Learning with Soft Rewards

Soft-SVeRL：带有软奖励的自我验证强化学习

Authors: Saurabh Dash, Pierre Clavier, John Dang, Matthias Galle, Marzieh Fadaee, Ahmet Üstün, Beyza Ermis
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.28561
Pdf link: https://arxiv.org/pdf/2605.28561
Abstract Reinforcement Learning from Verifiable Rewards (RLVR) has improved language models in domains such as mathematics and code, where correctness can be checked automatically. However, many important tasks are only partially verifiable: prompts contain multiple requirements, responses may satisfy some but not all of them, or no single reference answer might exist. We introduce Soft-RLVR, a framework for reinforcement learning from decomposed, learned verification signals. Soft-RLVR converts each prompt into a checklist of atomic requirements, scores candidate responses item by item with an LLM verifier, and trains on the resulting soft reward. Checklist-based rewards turn sparse pass/fail supervision into a denser partial-credit signal, but they also introduce a tradeoff: averaging item-level judgments can reduce verifier noise, while partial credit can reward incomplete responses. We formalize this tradeoff and identify conditions under which checklist-based verification gives a more reliable RL training signal than holistic verification. We further introduce Soft-SVeRL, a self-verifying variant of Soft-RLVR in which the policy also acts as the verifier. We show that self-verification is prone to reward inflation from overly permissive self-judgments, and that explicit stabilization is needed to prevent this collapse. In a controlled instruction-following setting with rule-based ground-truth evaluation, checklist-based Soft-RLVR improves IFEval by up to 11.1 points using only learned verifier rewards. Our experiments further show that verifier quality and checklist quality both affect downstream RL outcomes, and that explicit stabilization is essential for effective self-verification.
中文摘要 可验证奖励强化学习（RLVR）改进了数学和代码等领域的语言模型，这些领域可自动检查正确性。然而，许多重要任务只能部分验证：提示包含多个需求，回答可能满足部分但不完全满足，或者可能不存在单一的参考答案。我们介绍Soft-RLVR，这是一个基于分解、学习验证信号进行强化学习的框架。Soft-RLVR将每个提示转换成原子级需求的清单，用LLM验证器逐项评分候选回答，并训练由此产生的软奖励。基于清单的奖励将稀疏的通过/不通过监督转化为更密集的部分加分信号，但也带来了权衡：平均题目级判断可以减少验证者噪音，而部分计分则可能奖励不完整的回答。我们形式化了这一权衡，并识别了基于清单的验证比整体验证更可靠的强化学习训练信号的条件。我们进一步介绍Soft-SVeRL，这是一种自验证的Soft-RLVR变体，策略同时作为验证者。我们表明，自我验证容易因过度宽容的自我判断而奖励通胀，因此需要明确的稳定措施以防止这种崩溃。在受控指令跟随环境下，基于规则的真实评估，基于清单的软RLVR仅凭学习验证者奖励即可提升IFEval最多11.1分。我们的实验进一步表明，验证者质量和检查表质量都会影响后续强化学习的结果，显式稳定对于有效的自我验证至关重要。

SARAD: LLM-Based Safety-Aware Hybrid Reinforcement Learning with Collision Prediction for Autonomous Driving

SARAD：基于大语言模型的安全感知混合强化学习，具碰撞预测，用于自动驾驶

Authors: Kangyu Wu, Peng Cui, Guoxi Chen, Ya Zhang
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2605.28583
Pdf link: https://arxiv.org/pdf/2605.28583
Abstract Ensuring both safety and efficiency in decision-making for autonomous driving systems remains a fundamental challenge. Traditional Deep Reinforcement Learning (DRL) suffers from unsafe random exploration and slow convergence, while Large Language Models (LLMs) demonstrate inherent latency in real-time inference operations. To address these limitations, this paper proposes SARAD, a novel safety-aware hybrid framework that synergizes LLMs and DRL for autonomous driving. SARAD substitutes the random exploration of DRL with Retrieval-Augmented Generation (RAG)-enhanced, LLM-guided decisions sourced from a dynamic expert knowledge repository. An attention discriminator is proposed to integrate the prior knowledge of LLMs into DRL policy optimization. A collision predictor module, fine-tuned with historical collision data, is further designed to improve vehicle safety. Extensive experiments show that SARAD achieves significant performance improvements in the Highway-Env simulator, validating the effectiveness of the proposed model in autonomous driving.
中文摘要 确保自动驾驶系统决策的安全性与效率仍是根本性的挑战。传统的深度强化学习（DRL）存在不安全的随机探索和收敛缓慢的问题，而大型语言模型（LLMs）则在实时推理操作中表现出固有的延迟。为解决这些局限性，本文提出了SARAD，一种新型安全意识混合框架，协同LLM与DRL实现自动驾驶。SARAD用由动态专家知识库提供的检索增强生成（RAG）增强、LLM引导的决策替代了DRL的随机探索。提出了一种注意力判别器，将LLM的先验知识整合进DRL策略优化中。还设计了一个根据历史碰撞数据微调的碰撞预测模块，以提升车辆安全。大量实验表明，SARAD在高速公路环境模拟器中实现了显著的性能提升，验证了该模型在自动驾驶中的有效性。

Single-Rollout Hidden-State Dynamics for Training-Free RLVR Data Selection

单次展开隐藏状态动态用于无训练RLVR数据选择

Authors: Jianghao Wu, Jianfei Cai, Weiqiang Wang, Jin Ye, Daniel F. Schmidt, Yasmeen George
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.28631
Pdf link: https://arxiv.org/pdf/2605.28631
Abstract Reinforcement learning with verifiable rewards (RLVR) can yield large reasoning gains from very few training instances, yet its strong sensitivity to which instances are used makes data selection a central bottleneck. Most existing selection pipelines rely on training-time optimization signals and/or require access to verifiable rewards or ground-truth answers over large candidate pools, which is costly and often infeasible in specialized domains. We study RLVR data selection in a setting where selection must be performed before any RL training and without labels or reward evaluation on the full pool. We propose SHIFT, a one-shot, training-free selector based solely on inference-time hidden-state dynamics. For each candidate instance, SHIFT runs a single deterministic reasoning rollout and computes a reasoning-induced representation shift (RIRS) as the start-to-end hidden-state delta. SHIFT uses the RIRS magnitude as a lightweight proxy for instance utility and enforces coverage via a quality-weighted farthest-first CoreSet procedure in an RIRS-augmented feature space, producing compact subsets that scale to large unlabeled pools. Across mathematical reasoning and medical QA benchmarks under ultra-low budgets, SHIFT consistently outperforms training-free diversity and difficulty/uncertainty baselines, improving both in-domain accuracy and transfer to harder evaluation settings. Ablations show that RIRS-based coverage and quality-weighting contribute complementary gains, and analyses indicate that RIRS is not explained by simple input/output length statistics. Code is available at this http URL.
中文摘要 带有可验证奖励的强化学习（RLVR）能够在极少的训练实例中带来显著的推理收益，但其对实例使用的高度敏感性使数据选择成为核心瓶颈。大多数现有的选拔流程依赖于训练时间优化信号，和/或需要在庞大的候选人池中获得可验证的奖励或实地答案，这在专业领域成本高昂且往往不可行。我们在必须在任何强化学习训练前进行筛选，且对整个池不加标签或奖励评估的情况下研究RLVR数据选择。我们提出SHIFT，一种一次性、无需训练的选择器，仅基于推理时间隐态动态。对于每个候选实例，SHIFT运行一次确定性推理展开，并计算一个推理诱导的表示转换（RIRS），作为从开始到结束的隐藏状态差异。SHIFT使用RIRS大小作为轻量级的效用代理，并通过质量加权的最远优先CoreSet程序在RIRS增强特征空间中强制覆盖，生成可扩展到大型无标记池的紧凑子集。在超低预算下的数学推理和医学质量保证基准中，SHIFT持续优于无培训的多样性和难度/不确定性基线，提升了领域内准确性并提升了对更难评估环境的应用。消融显示基于RIRS的覆盖率和质量加权带来互补的收益，分析表明RIRS并非仅用简单的输入/输出长度统计来解释。代码可在此 http 网址获取。

Optimal Data Acquisition for Reinforcement Learning: A Large Deviations Perspective

强化学习的最优数据采集：大偏差视角

Authors: Mingjie Hu, Jian-Qiang Hu, Enlu Zhou
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.28675
Pdf link: https://arxiv.org/pdf/2605.28675
Abstract Data acquisition efficiency is a central challenge in deploying reinforcement learning in business and healthcare operations, where interactions are costly, slow, and often involve humans in the loop. This paper develops a unified large deviations framework for data acquisition in infinite-horizon reinforcement learning. We introduce the exponential decay rate of the policy-selection error probability as a principled efficiency metric and derive a variational characterization of this rate via large deviations theory for Markov chains, yielding a nested optimization problem. Based on this characterization, we formalize two complementary notions of optimality in terms of the optimal solution of the nested problem. Because the resulting program is implicit and generally intractable, we propose a tractable convex relaxation with explicit constraints. We then develop a lazy one-step projected subgradient method to solve the relaxed problem and use its iterates to construct an adaptive data acquisition policy. We prove that the resulting reinforcement learning algorithm is near-robustly optimal under our optimality criterion, up to a constant factor. Finally, we extend the framework to linear function approximation to improve scalability, and numerical experiments support the effectiveness of the proposed approach.
中文摘要 数据采集效率是部署强化学习在企业和医疗运营中的核心挑战，因为这些环境的交互成本高昂、速度缓慢，且常常涉及人工参与。本文开发了一个统一的大偏差框架，用于无限视野强化学习中的数据采集。我们将策略选择错误概率的指数衰减率引入一个原则效率度量，并通过大偏差理论对该速率进行变分刻画，得到一个嵌套优化问题。基于这一表征，我们形式化了两种互补的最优概念，分别是嵌套问题的最优解。由于最终的程序是隐式且通常难以处理的，我们提出了一个带有显式约束的可处理凸松弛。然后，我们开发了一种懒惰的一步投影亚梯度方法来解决松散问题，并利用其迭代构建自适应的数据采集策略。我们证明，得到的强化学习算法在最优性准则下几乎是稳健最优的，且在常数因子下达到一定程度。最后，我们将框架扩展到线性函数近似以提高可扩展性，数值实验支持该方法的有效性。

OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning

OSP-Next：高效高质量视频生成，采用稀疏序列并行性、HiF8量化和强化学习

Authors: Yunyang Ge, Xianyi He, Zezhong Zhang, Bin Lin, Bin Zhu, Xinhua Cheng, Li Yuan
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.28691
Pdf link: https://arxiv.org/pdf/2605.28691
Abstract Diffusion Transformers achieve strong video generation quality, but the quadratic cost of full attention limits efficiency. We introduce OSP-Next, an efficient text-to-video generation model that integrates sparse attention, parallelism, quantization, and reinforcement learning. OSP-Next uses a hybrid full-sparse attention architecture, where the sparse component is implemented with Skiparse-2D Attention. This fixed-pattern mechanism applies token-wise and group-wise sparse attention along spatial dimensions, leveraging locality while maintaining native compatibility with FlashAttention kernels. Based on the local equivalence of rearrangement in Skiparse-2D Attention, we further propose Sparse Sequence Parallelism (SSP), which partitions subsequences across ranks and switches sparse patterns through a single All-to-All communication. Compared with Ulysses Sequence Parallelism (SP), SSP provides a native parallel strategy for sparse attention and reduces communication volume by 75%. OSP-Next also incorporates HiF8 quantization to enable stable joint training with 8-bit quantization and sparse fine-tuning, and applies Mix-GRPO post-training to improve the performance of the sparse model. Experiments show that OSP-Next achieves a VBench total score of 83.73%, surpassing the Wan2.1 baseline. Under the 5-second 720P and 5-second 768P settings, OSP-Next achieves up to 1.64$\times$ single-GPU speedup and over 1.52$\times$ eight-GPU speedup on NVIDIA H200 GPUs. In addition, with only a 0.4% drop in VBench total score, OSP-Next-HiF8 achieves 1.69$\times$ and 2.27$\times$ speedups under the two settings on a single Ascend 950PR, demonstrating the efficiency and performance of OSP-Next across hardware platforms.
中文摘要 扩散变压器实现了强劲的视频生成质量，但全注意力的二次成本限制了效率。我们介绍OSP-Next，一种高效的文本到视频生成模型，集成了稀疏注意力、并行性、量化和强化学习。OSP-Next采用混合全稀疏注意力架构，稀疏组件通过Skiparse-2D Attention实现。这种固定模式机制在空间维度上按令牌和按组应用稀疏注意力，利用局部性，同时保持与 FlashAttention 内核的原生兼容性。基于Skiparse-2D Attention中重排的局部等价性，我们进一步提出了稀疏序列并行（SSP），即通过单一全对全通信将子序列划分为不同秩，并切换稀疏模式。与尤利西斯序列并行（SP）相比，SSP提供了一种原生的并行策略，实现稀疏注意力，并减少75%的通信量。OSP-Next 还集成了 HiF8 量化，实现了 8 位量化和稀疏微调的稳定联合训练，并应用 Mix-GRPO 后训练以提升稀疏模型的性能。实验显示，OSP-Next的VBench总分达到83.73%，超过了Wan2.1的基线。在 5 秒 720P 和 5 秒 768P 设置下，OSP-Next 在 NVIDIA H200 GPU 上实现单 GPU 最高 1.64 美元/时间/时间的加速，以及超过 1.52 美元\时间美元八显卡的加速。此外，VBench总分仅下降0.4%，OSP-Next-HiF8在单台Ascend 950PR两项设置下实现了1.69$\times$和2.27$\times$的加速，展示了OSP-Next在各硬件平台上的高效与性能。

TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning

追踪者：回合级遗憾匹配与内在强化学分，用于合作多大型语言模型推理

Authors: Chusen Li, Zhou Liu, Shuigeng Zhou, Wentao Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.28699
Pdf link: https://arxiv.org/pdf/2605.28699
Abstract Large language models increasingly rely on either reinforcement learning or multi-agent prompting to improve reasoning, yet these two paradigms remain difficult to combine. Directly applying single-agent reinforcement learning to multi-turn multi-agent systems faces following dilemmas: i) Sparse rewards, role-level free-riding and excessive training overhead. ii) Agents only imitate to collaborate. iii) Fixed collaboration protocol falls into oscillating local optimum. We introduce TRACER, a turn-level reinforcement framework for cooperative multi-LLM reasoning. TRACER separates collaborative decision making into a controller-regret layer, where controllers learn whether the agents should speak or skip the current round through regret matching, and a generation-credit layer, which optimizes proposer and reviewer utterances with role-specific GSPO rewards. This design i) assigns credit at the level of both action modes and generated utterances, thus avoiding free-riding and sparse rewards. We only expand the choices made by the controllers, thus greatly reducing computational cost of training. Moreover, ii) agents acquire collaborative capability as they learn when to utter and what to speak. Finally, iii) by designing binary actions ingeniously, we extend classical game theory established for finite action spaces to deep learning, thus achieving mathematically rigorous convergence. We train all local RL-style methods on the GSM8K training split and evaluate on held-out GSM8K, MATH500, and GPQA-Diamond to measure in-domain accuracy, cross-benchmark generalization, inference cost, and correction-preservation behavior. The resulting framework provides a compact and reproducible testbed for studying learned collaboration policies beyond fixed debate, voting, or aggregation protocols. Code is available at this https URL.
中文摘要 大型语言模型越来越依赖强化学习或多智能体提示来提升推理能力，但这两种范式仍然难以结合。直接将单智能体强化学习应用于多回合多智能体系统面临以下困境：i）奖励稀疏、角色级搭便车和过度训练开销。ii）代理人仅模仿合作。iii）固定协作协议属于振荡局部最优。我们介绍了TRACER，一种用于合作式多LLM推理的回合级强化框架。TRACER将协作决策分为控制者-遗憾层，控制者通过遗憾匹配决定代理是否应发言或跳过当前轮次;以及生成积分层，通过角色特定GSPO奖励优化提案人和评审者的言论。这种设计 i）在动作模式和生成的话语层面赋予积分，从而避免搭便车和稀疏奖励。我们只扩展了控制员的选择范围，从而大大降低了训练的计算成本。此外，ii）代理在学习何时发言、说什么时，获得协作能力。最后，iii）通过巧妙设计二元动作，我们将为有限动作空间建立的经典博弈论扩展到深度学习，从而实现了数学上严谨的收敛。我们将所有本地强化学习风格的方法训练在GSM8K训练分段上，并对保留的GSM8K、MATH500和GPQA-Diamond进行评估，以衡量领域内准确性、跨基准推广、推断成本以及纠正-保持行为。最终的框架提供了一个紧凑且可重复的测试平台，用于研究超越固定辩论、投票或聚合协议的协作政策。代码可在此 https URL 访问。

AlphaTransit: Learning to Design City-scale Transit Routes

AlphaTransit：学习设计城市规模的交通线路

Authors: Bibek Poudel, Sai Swaminathan, Weizi Li
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.28730
Pdf link: https://arxiv.org/pdf/2605.28730
Abstract Designing a transit network requires many sequential route extension decisions, but their quality is often visible only after the full network is assembled. This delayed-feedback challenge lies at the heart of the Transit Route Network Design Problem (TRNDP), where route interactions can be deceptive: an extension that appears useful locally can create transfer bottlenecks, produce redundant overlap, or reduce overall throughput. To guide route construction under delayed simulator feedback, we introduce AlphaTransit, a search-based planning framework for cityscale bus network design. AlphaTransit couples Monte Carlo Tree Search (MCTS) with a neural policy-value network: the policy proposes route extensions, the value estimates downstream design quality, and search uses these predictions to refine each decision. This provides decision-time lookahead during route construction without running simulator rollouts inside the search tree. We evaluate AlphaTransit on a new Bloomington TRNDP benchmark with realistic road topology and censusderived demand, under mixed and full transit demand settings. In the Bloomington network, AlphaTransit attains the highest service rate in both demand settings, reaching 54.6% and 82.1%, respectively. Relative to reinforcement learning without search, these correspond to 9.9% and 11.4% service rate gains; relative to MCTS without learned guidance, they correspond to 2.5% and 11.2% gains. These results suggest that coupling learned guidance with MCTS is more effective than using either approach alone for transit network design. Our code and data are publicly available in this https URL.
中文摘要 设计交通网络需要许多顺序的路线扩展决策，但其质量往往只有在完整网络组装完成后才能显现。这种延迟反馈挑战正是交通路由网络设计问题（TRNDP）的核心，其中路线交互可能具有误导性：一个在本地看似有用的扩展可能造成传输瓶颈、产生冗余重叠或降低整体吞吐量。为了在延迟模拟器反馈下指导线路建设，我们引入了AlphaTransit，一种基于搜索的城市规模公交网络设计规划框架。AlphaTransit将蒙特卡洛树搜索（MCTS）与神经策略值网络结合：策略提出路线延伸，值估计下游设计质量，搜索则利用这些预测细化每项决策。这在路线建设过程中提供决策时的前瞻性预览，无需在搜索树内运行模拟器展开。我们在布卢明顿TRNDP的新基准测试中评估AlphaTransit，结合现实的道路拓扑和人口普查衍生需求，在混合和全公共交通需求设置下。在布卢明顿网络中，AlphaTransit在两个需求领域均达到最高服务率，分别达到54.6%和82.1%。相较于无搜索的强化学习，这些分别提升了9.9%和11.4%的服务率;与没有学习指导的MCTS相比，它们分别提升了2.5%和11.2%。这些结果表明，将学习式指导与MCTS结合起来，比单独使用任一方法在交通网络设计中更有效。我们的代码和数据在此 https URL 中公开。

OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration

OmniVerifier-M1：带有显式结构重校准的多模态元验证器

Authors: Xinchen Zhang, Bowei Liu, Jiale Liu, Chufan Shi, Yizhen Zhang, Junhong Liu, Youliang Zhang, Zhiheng Li, Yujiu Yang, Ling Yang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.28805
Pdf link: https://arxiv.org/pdf/2605.28805
Abstract Visual outcomes are increasingly central to multimodal large language models, making reliable and fine-grained verification essential for scaling generalist foundation models. In this work, we investigate multimodal meta-verification, which leverages verifier-generated rationales rather than decision-only signals, and explore how to effectively incorporate meta-verification feedback into multimodal verifier training. We identify two key findings. First, symbolic verifier outputs (e.g., bounding boxes) outperform textual explanations as meta-verification rationales, enabling efficient rule-based reinforcement learning rewards while avoiding reliance on model-based rewards from auxiliary judge models. Second, decoupling reinforcement learning objectives for binary judgment and meta-verification substantially outperforms joint reward optimization, due to intrinsic differences in output structure and learning dynamics. Based on these insights, we train OmniVerifier-M1, a generalist visual verifier leveraging symbolic meta-verification and decoupled reinforcement learning. OmniVerifier-M1 provides robust verification and fine-grained error localization, and further enables M1-TTS, a verifier-driven agentic generation system achieving dynamic region-level self-correction. This approach paves the way for more reliable, interpretable, and fine-grained multimodal verification, supporting safer and more controllable foundation model deployment.
中文摘要 视觉结果在多模态大型语言模型中日益成为核心，因此可靠且细粒度的验证对于扩展通用基础模型至关重要。本研究探讨了多模态元验证，它利用验证者生成的理由而非仅决策信号，并探讨如何有效将元验证反馈融入多模态验证者训练。我们发现了两个关键发现。首先，符号验证器的输出（例如边界框）作为元验证理由优于文本解释，实现高效的基于规则的强化学习奖励，同时避免依赖辅助法官模型的基于模型的奖励。其次，二元判断和元验证的强化学习目标解耦，由于输出结构和学习动态的内在差异，其性能远超联合奖励优化。基于这些见解，我们训练了OmniVerifier-M1，一款利用符号元验证和解耦强化学习的通用视觉验证器。OmniVerifier-M1 提供了稳健的验证和细粒度的错误定位，并进一步支持 M1-TTS，这是一种由验证者驱动的代理生成系统，实现了动态区域级自我纠正。这种方法为更可靠、可解释且细粒度的多模态验证铺平了道路，支持更安全、更可控的基础模型部署。

Beyond Binary: Sim-to-Real Dexterous Manipulation with Physics-Grounded Contact Representation

超越二进制：模拟到现实的灵巧操作与物理接点表示

Authors: Jiahe Pan, Stelian Coros, Jitendra Malik, Toru Lin
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.28812
Pdf link: https://arxiv.org/pdf/2605.28812
Abstract A primary bottleneck in contact-rich manipulation is the difficulty of collecting real-world data. Sim-to-real reinforcement learning offers a scalable alternative, but the simulation-reality gap prevents information-dense modalities like touch from being effectively used. Existing sim-to-real methods often mitigate this gap by simplifying tactile data into coarse low-dimensional features -- sacrificing the richness required for complex manipulation. In this work, we introduce Center-of-Pressure (CoP), an effective tactile representation grounded in physical principles that preserves dense contact information while maintaining robustness for sim-to-real transfer. To support this representation, we propose a sensor calibration scheme based on differentiable dynamics, enabling the estimation of taxel orientations without requiring ground-truth force measurements. We evaluate CoP on two blind, challenging contact-rich manipulation tasks: peg-in-hole insertion and ball balancing. Across both tasks, policies conditioned on CoP achieve zero-shot sim-to-real transfer on a multi-fingered hand, and outperform both coarse binary-contact and raw-taxel baselines. Analysis of learned policy states further suggests that CoP-conditioned policies encode task-relevant physical properties, such as object mass, as an emergent byproduct of control.
中文摘要 丰富的接触操作的主要瓶颈是收集真实世界数据的困难。模拟到现实的强化学习提供了一个可扩展的替代方案，但模拟与现实的差距阻碍了像触觉这样信息密集的模态被有效利用。现有的模拟到现实方法通常通过将触觉数据简化为粗糙的低维特征来弥补这一差距——牺牲了复杂操作所需的丰富性。在本研究中，我们介绍了压力中心（CoP），这是一种基于物理原理的有效触觉表示，既保持密集的接触信息，又保持模拟到实物传输的稳健性。为支持该表示，我们提出了基于微分动力学的传感器校准方案，使得在无需实地测量力的情况下估计紫克素方向。我们在两个盲目且具有挑战性的接触丰富操作任务中评估CoP：钉入洞和球平衡。在这两种任务中，基于CoP的策略在多指手上实现零shot模拟到真实的传输，并且优于粗二元接触和原始-taxel基线。对已学习政策状态的分析进一步表明，CoP条件政策编码了与任务相关的物理属性，如对象质量，作为控制的涌现副产品。

Keyword: diffusion policy

How VLAs Fail Differently: Black-Box Action Monitoring Reveals Architecture-Specific Failure Signatures

VLA失效方式的不同：黑箱动作监控揭示架构特定的故障特征

Authors: Krishnam Gupta
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.28726
Pdf link: https://arxiv.org/pdf/2605.28726
Abstract We discover that VLA architectures fail in fundamentally different, predictable ways at the motor-command level. Running VQ-BeT, Diffusion Policy, and ACT on identical evaluation protocols (n=450 episodes across PushT and ALOHA 14-DOF bimanual manipulation), we find: (1) direction reversal rate is a universal failure predictor across all three architectures (AUROC=0.93, 0.79, 0.91; p<0.001); (2) jerk monitoring is predictive only for discrete-token architectures, following a discrete-to-continuous gradient (0.88, 0.69, 0.41); (3) velocity violations alone are non-predictive everywhere (AUROC 0.41-0.69), yet velocity checking is the most common safety mechanism in VLA deployment code; and (4) for continuous-family VLAs, velocity monitoring provides effectively zero predictive signal (AUROC=0.52 on ACT, 0.41 on Diffusion), proving that architecture-matched monitor selection is essential. These results quantify a monitoring consequence of the well-known discrete/continuous VLA distinction: the two families produce qualitatively different failure signatures that require different monitors. No single monitor works universally; architecture-matched selection is required. This finding was enabled by SafeContract, a training-free, black-box action monitoring toolkit with conformal calibration. Code: this https URL
中文摘要 我们发现VLA架构在运动指令层面的失效方式根本不同且可预测。在相同的评估协议上运行VQ-BeT、扩散策略和ACT测试（n=450次，涵盖PushT和ALOHA 14-DOF双手操作），我们发现：（1）方向反转率在三种架构中都是通用的失败预测因子（AUROC=0.93， 0.79， 0.91;p<0.001）;（2）抖动监控仅对离散令牌架构具有预测作用，遵循离散到连续梯度（0.88， 0.69， 0.41）;（3）速度违规本身在任何地方都无法预测（AUROC 0.41-0.69），但速度检查是VLA部署代码中最常见的安全机制;以及（4）对于连续家族VLA而言，速度监测实际上提供零预测信号（ACT时AUROC=0.52，扩散时0.41），证明架构匹配的监测器选择至关重要。这些结果量化了著名离散/连续VLA区分的监测结果：两类系统产生了质量不同的故障特征，需要不同的监控器。没有单一显示器能通用;需要建筑匹配的选择。这一发现得益于SafeContract，一个无需培训的黑箱动作监控工具包，具有共形校准功能。代码：这个 https URL

Imitation Learning for Robot Assistance in Open Surgery: A Multi-Policy Evaluation on Suture Following

开放手术机器人辅助的模仿学习：缝线后多策略评估

Authors: Xucheng Wang, Zhizhou Yang, Xiaoman Zhang, Sung Eun Kim, Romain Hardy, Pranav Rajpurkar
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.28736
Pdf link: https://arxiv.org/pdf/2605.28736
Abstract This study presents the first evaluation of general-purpose imitation learning for surgeon-robot collaborative assistance in open surgery, targeting suture following: the grab-pull-release motion an assistant performs at every stitch. We collect 160 teleoperated demonstrations (32,374 frames) on an open-source robot arm, benchmark four architecturally diverse imitation learning policies (ACT, Diffusion Policy, SmolVLA, $\pi_0$) across 28 trained models evaluated in 32 configurations along three clinically motivated dimensions: dataset size, camera viewpoint, and background variation. Our results demonstrate that under ideal conditions, the four policies achieve $50$-$75\%$ task success, with depth error as the dominant failure mode across all architectures. Among all policies, $\pi_0$ achieves the strongest results with a pretrained vision-language backbone, demonstrating superior data efficiency, greater robustness to background variation, and smoother trajectories compatible with surgical workflow. When deployed in a surgeon-robot suturing trial, $\pi_0$ yields a $92\%$ stitch completion rate. These findings establish collaborative robotic assistance in open surgery as a feasible target for imitation learning and highlight depth perception and end-effector design as key priorities for clinical translation.
中文摘要 本研究首次评估了外科与机器人在开放手术中协作协助的通用模拟学习，重点包括缝合处：助手在每针时执行的抓拉-松开动作。我们收集了160个远程操作演示（32,374帧），在一只开源机器人手臂上进行了基准测试，基于28个训练模型，在32种配置中评估了四个架构上多样的模仿学习策略（ACT、扩散政策、SmolVLA、$\pi_0），这些模型在三个临床动机维度上进行了评估：数据集大小、摄像头视角和背景变化。我们的结果表明，在理想条件下，这四种策略的任务成功率达到了50美元至75%美元，深度误差是所有架构中的主要失败模式。在所有策略中，$\pi_0$在预训练视觉语言骨干基础下取得了最强的效果，展现出卓越的数据效率、更强的背景变化韧性以及更平滑的手术流程轨迹。在外科机器人缝合试验中使用时，$\pi_0$可获得92\%$的缝合完成率。这些发现确立了开放手术中的协作机器人辅助作为模仿学习的可行目标，并强调深度感知和末端效应器设计是临床转化的关键优先事项。