Arxiv Papers of Today

生成时间: 2026-06-18 19:52:58 (UTC+8); Arxiv 发布时间: 2026-06-18 20:00 EDT (2026-06-19 08:00 UTC+8)

今天共有 30 篇相关文章

Keyword: reinforcement learning

Breaking the Solver Bottleneck: Training Task Generators at the Learnable Frontier

打破求解器瓶颈：可学习前沿的训练任务生成器

Authors: Lorenz Wolf, Connor Watts, Roger Creus Castanyer, Geoffrey Bradway, Maxwill Lin, Augustine N. Mavor-Parker, Matthew Daborn-Sargent
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.18284
Pdf link: https://arxiv.org/pdf/2606.18284
Abstract The limiting resource for training agents via reinforcement learning (RL) is increasingly frontier task supply: valid, solvable tasks just difficult enough to train the current model. As reasoning and agentic models improve, fixed task distributions saturate, while naive synthetic generation yields tasks that are trivial, impossible, or ill-posed. Training a task generator with RL to optimize validity and learnability can address this bottleneck, but direct optimization requires repeated solver rollouts per candidate. For software-engineering (SWE) tasks, a single rollout can take tens of minutes; solver-in-the-loop generator training is intractable. We introduce PROPEL, a solver-amortized framework for training task generators at the targeted solve rate. PROPEL trains a lightweight activation probe on a one-time labeled corpus of generated tasks and solver outcomes. The probe predicts target-solver pass rate from a frozen generator reference model and serves as a proxy for solve rate during generator optimization, reducing generator evaluation to a single forward pass. Across math, code, and software-engineering at multiple model scales, PROPEL shifts generation toward the targeted solve rate: for coding, tasks generated at the learnable frontier increase from $10.1\% \rightarrow 20.0\%$ for a Qwen2.5-3B-Instruct solver and from $5.3\% \rightarrow 12.6\%$ for a Qwen2.5-7B-Instruct solver. For SWE, PROPEL increases the share of generations at the targeted solve rate from $9.8\% \rightarrow 19.6\%$ for Qwen3.5-27B on repositories not seen during training of probe and generator.
中文摘要 通过强化学习（RL）训练代理的限制资源越来越多地是前沿任务供应：那些有效且可解的任务，难度刚好足以训练当前模型。随着推理和代理模型的进步，固定任务分布趋于饱和，而朴素的合成生成则产生琐碎、不可能或不合适的任务。用强化学习训练任务生成器以优化效度和可学习性可以解决这一瓶颈，但直接优化需要每个候选人重复推算器。对于软件工程（SWE）任务，单次部署可能需要数十分钟;求解器在环生成器的训练是难以理解的。我们介绍了PROPEL，这是一个用于以目标求解率训练任务生成器的求解器摊销框架。PROPEL对生成任务和求解器结果的一次性标记语料库训练轻量级激活探针。该探针可从冻结的生成器参考模型预测目标-求解器通过率，并在生成器优化过程中作为求解率的代理，将生成器评估简化为单次前向传递。在数学、代码和多模型尺度的软件工程中，PROPEL将生成过程推向目标的求解率：对于编码，在可学习前沿生成的任务从Qwen2.5-3B-Instruct求解器的$10.1%/右箭头20.0%$，Qwen2.5-7B-Instruct求解器的$5.3\% \rightarrow 12.6%$增加。对于软件工程，PROPEL将Qwen 3.5-27B在探针和生成器训练时未见到的仓库中，按目标破解率生成的代数比例从$9.8\%%\rightarrow \rightarrow 19.6\%$提升。

Self-CTRL: Self-Consistency Training with Reinforcement Learning

自我控制：强化学习中的自我一致性训练

Authors: Itamar Pres, Laura Ruis, Melat Ghebreselassie, Belinda Z. Li, Jacob Andreas
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.18327
Pdf link: https://arxiv.org/pdf/2606.18327
Abstract Language models (LMs) that faithfully describe their own behavior can more easily be audited, understood, and trusted by users. This paper describes Self-Consistency Training with Reinforcement Learning (Self-CTRL), a method that optimizes for consistency between a LM's self-explanations and behavior on related inputs by updating explanations to better predict behavior or updating behavior to better match explanations. We apply our method in two domains. First, we study a formal probabilistic reasoning task in which LMs must learn to imitate a family of biased samplers and evaluated on their ability to report the associated biases. We find that consistency training improves the correlation between self-reported and behaviorally-measured latent biases from $R^2=0.24$ to $R^2=0.64$ on a set of held-out distributions, matching the generalization of direct ground-truth supervision. Second, we study a constitutional AI domain in which LMs must describe when they will refuse or comply with user requests. Here, Self-CTRL produces rules that faithfully describe the model's behavior on held-out requests, improving the refusal predictions of a third-party auditor model from $36\%$ to $92\%$. In the other direction, behavior updates improve alignment, reducing HarmBench failure rate from $15.0\%$ to $0.5\%$ without substantially increasing refusal on harmless prompts. By aligning explanations and behavior, our work provides a general recipe for training AI models to be safer, more transparent, and more controllable.
中文摘要 忠实描述自身行为的语言模型（LM）更容易被审计、理解并被用户信任。本文介绍了带有强化学习的自我一致性训练（Self-CTRL），这是一种通过更新解释以更好地预测行为或更新行为以更好地匹配解释，优化LM自我解释与相关输入行为之间的一致性的方法。我们将方法应用于两个领域。首先，我们研究一项正式的概率推理任务，学习者必须学会模仿一组有偏的抽样器，并评估其报告相关偏差的能力。我们发现一致性训练改善了自我报告与行为测量潜在偏差之间的相关性，从$R^2=0.24$到$R^2=0.64$，符合直接实地监督的推广。其次，我们研究一个宪法AI领域，LM必须说明何时拒绝或遵守用户请求。在这里，Self-CTRL生成了忠实描述模型在未被保留请求时行为的规则，将第三方审计模型的拒绝预测从$36%%提升到$92%$。反之，行为更新改善了对齐，将HarmBench的失败率从$15.0\%$降至$0.5\%$，同时显著增加对无害提示的拒绝率。通过对齐解释和行为，我们的工作为训练AI模型提供了一套更安全、更透明、更可控的通用配方。

Recover, Discover, Plan: Learning Skills and Concepts from Robot Failures

恢复、发现、计划：从机器人故障中学习技能和概念

Authors: Bowen Li, Mayank Mishra, Y. Isabel Liu, Stone Tao, Nishanth Kumar, Alexander G. Gray, Ruwan Wickramarachchi, Jonathan Francis, Sebastian Scherer, Tom Silver
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.18328
Pdf link: https://arxiv.org/pdf/2606.18328
Abstract Intelligent robots should not only recover from failures, but also acquire the abstract knowledge needed to avoid them in the future. While reinforcement learning (RL) can learn reactive recovery behaviors, training a separate policy for every distinct failure mode is highly inefficient. We introduce Recovery-Driven Synthesis of Relational Concepts (ReSYNC), the first approach that progressively discovers and refines state abstractions (relational predicates) from failure-recovery experience to support abstract planning. Unlike purely reactive methods, ReSYNC jointly learns skills and concepts through an incremental dual-learning process. In the skill-learning phase, the robot uses RL to learn to recover from failures seen in training tasks. In the concept-learning phase, the robot discovers new relational predicates and refines its abstract planning model to explain and generalize the learned recovery behaviors. This interaction enables ReSYNC to convert local recoveries seen during training into global failure avoidance at test time. Across four simulated domains, we show that ReSYNC's ability to continually expand and refine its abstraction library allows it to solve long-horizon, previously unseen problems, outperforming strong baselines by over 50%. Additionally, we demonstrate sim-to-real transfer of ReSYNC, where it performs real-world non-prehensile manipulation skills and generalizes to unseen scenarios through abstract planning. Overall, ReSYNC represents a significant step toward robots that autonomously acquire abstractions for scalable, failure-aware planning in the physical world.
中文摘要 智能机器人不仅应能从故障中恢复，还应获得避免未来故障的抽象知识。虽然强化学习（RL）可以学习反应式恢复行为，但为每种不同的失败模式训练独立策略效率极低。我们引入了关系概念的恢复驱动综合（ReSYNC），这是首个逐步发现和完善失效-恢复经验中的状态抽象（关系谓词）以支持抽象规划的方法。与纯反应式方法不同，ReSYNC通过增量双重学习过程共同学习技能和概念。在技能学习阶段，机器人利用强化学习从训练任务中出现的失败中恢复。在概念学习阶段，机器人发现新的关系谓词，并完善其抽象规划模型，以解释和推广所学的恢复行为。这种交互使ReSYNC能够将训练中看到的本地恢复转换为测试时的全局故障避免。在四个模拟领域中，我们展示了ReSYNC持续扩展和完善其抽象库的能力，使其能够解决长期且此前未曾见过的问题，表现优于强基线超过50%。此外，我们还展示了ReSYNC的模拟到现实转移，它执行现实世界的非抓握操作技能，并通过抽象规划推广到看不见的场景。总体而言，ReSYNC代表了向自动化抽象、实现物理世界中可扩展、故障感知规划的机器人迈出的重要一步。

Reasoning as Intersection: Consensus-Frame Alignment for Visual Focus in Video-MLLMs

作为交叉点的推理：视频多层次多层次语言营销中视觉聚焦的共识-框架对齐

Authors: Chengwen Liu, Zhe Huang, Jisheng Dang, Hong Peng, Qi Tian, Tat-Seng Chua
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.18441
Pdf link: https://arxiv.org/pdf/2606.18441
Abstract Reinforcement learning has improved the reasoning ability of large language models, but applying outcome-only rewards to video multimodal large language models (Video-MLLMs) provides limited guidance on which visual evidence should support the answer. Inspired by multisensory integration, where consistent cues can enhance the salience and reliability of perceptual estimates, we introduce Consensus Frame GRPO (CF-GRPO), a temporal-annotation-free process-level reward framework for evidence-aware video reasoning. CF-GRPO constructs a consensus frame prior from intrinsic video cues, including temporal coverage, scene-transition cues, and query-conditioned visual relevance. It then computes a model-side frame-use score from visual and response representations and optimizes their agreement through the Consensus Frame Reward (CFR). With salience-aware sparse aggregation and distribution sharpening, CFR provides a high-contrast reward signal without requiring human temporal annotations. Experiments show that VideoCFR achieves competitive performance across complex video reasoning benchmarks and improves several metrics over representative Video-MLLM and RL baselines, while the consensus prior provides an interpretable view of the evidence frames emphasized during training. The implementation is available at this https URL.
中文摘要 强化学习提升了大型语言模型的推理能力，但将仅结果奖励应用于视频多模态大型语言模型（视频多模态多语言模型）时，对于哪些视觉证据应支持答案的指导有限。受多感官整合启发，在这种理论中，持续的线索可以增强感知估计的显著性和可靠性，我们引入了共识框架GRPO（CF-GRPO），这是一个无时间注释的过程级奖励框架，用于证据感知视频推理。CF-GRPO通过内在视频线索构建共识帧，包括时间覆盖、场景转换线索和查询条件的视觉相关性。然后，它从视觉和反应表征中计算模型端框架使用得分，并通过共识框架奖励（CFR）优化它们的一致性。通过显著性感知的稀疏聚合和分布锐化，CFR无需人工时间标注即可提供高对比度的奖励信号。实验显示，VideoCFR在复杂的视频推理基准中实现了竞争性能，并在多个指标上优于具有代表性的视频MLLM和强化学习基线，而共识先验则提供了培训中强调的证据框架的可解读视图。实现可在此 https URL 获取。

Structured Representation Learning with Locally Linear Embeddings and Adaptive Feature Fusion

结构化表示学习，采用局部线性嵌入和自适应特征融合

Authors: Somjit Nath, Jackson J Cone, Derek Nowrouzezahrai, Samira Ebrahimi Kahou
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.18469
Pdf link: https://arxiv.org/pdf/2606.18469
Abstract Neuroscientific research has revealed that the brain encodes complex behaviors by leveraging structured, low-dimensional manifolds and dynamically fusing multiple sources of information through adaptive gating mechanisms. Inspired by these principles, we propose a novel reinforcement learning (RL) framework that encourages the disentanglement of dynamics-specific and reward-specific features, drawing direct parallels to how neural circuits separate and integrate information for efficient decision-making. Our approach leverages locally linear embeddings (LLEs) to capture the intrinsic, locally linear structure inherent in many environments, mirroring the local smoothness observed in neural population activity, while concurrently deriving reward-specific features through the standard RL objective. An attention mechanism, analogous to cortical gating, adaptively fuses these complementary representations on a per-state basis. Experimental results on benchmark tasks demonstrate that our method, grounded in neuroscientific principles, improves learning efficiency and overall performance compared to conventional RL approaches, highlighting the benefits of explicitly modeling local state structures and adaptive feature selection as observed in biological systems.
中文摘要 神经科学研究表明，大脑通过利用结构化的低维流形，并通过自适应门控机制动态融合多个信息源，编码复杂行为。受这些原则启发，我们提出了一种新型强化学习（RL）框架，鼓励动力学特异性与奖励特异性特征的解缠，直接类比神经回路如何分离和整合信息以实现高效决策。我们的方法利用局部线性嵌入（LLEs）捕捉许多环境中固有的局部线性结构，反映神经群体活动中观察到的局部平滑性，同时通过标准强化学习目标推导出奖励特定特征。一种类似于皮层门控的注意力机制，在每个状态基础上自适应地融合这些互补的表征。基准任务的实验结果表明，基于神经科学原理的方法相比传统强化学习方法提升了学习效率和整体表现，突出了在生物系统中明确建模局部状态结构和适应性特征选择的益处。

N(CO)$^2$: Neural Combinatorial Optimization with Chance Constraints to Solve Stochastic Orienteering

N（CO）$^2$：带偶然约束的神经组合优化以解决随机定向

Authors: Anas Saeed, Marcos Abel Zuzuárregui, Stefano Carpin
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.18514
Pdf link: https://arxiv.org/pdf/2606.18514
Abstract Neural combinatorial optimization (NCO) offers a promising alternative to traditional heuristic-based methods for solving complex graph optimization problems by proposing to learn heuristics through data. This class of problems frequently arises in automation, as it can be used to model a variety of applications. While NCO has been extensively studied for deterministic combinatorial optimization problems, there are only a few works that aim to solve stochastic combinatorial optimization problems. In this work, we present N(CO)$^2$: Neural Combinatorial Optimization with Chance cOnstraints to solve the Stochastic Orienteering Problem (SOP) without the use of hand-crafted heuristics. By integrating a reinforcement learning (RL) framework, the model optimizes path selection under uncertainty, effectively balancing exploration and exploitation. Empirical results demonstrate that our method generalizes well across diverse SOP instances, achieving competitive performance compared to the state-of-the-art mixed-integer linear program (MILP) for the task. The proposed approach reduces human effort in heuristic design while enabling adaptive and efficient decision-making in uncertain environments.
中文摘要 神经组合优化（NCO）通过提出通过数据学习启发式方法，为解决复杂图优化问题提供了一种有前景的基于启发式方法的替代方案。这类问题在自动化中经常出现，因为它可以用来建模各种应用。虽然NCO在确定性组合优化问题中被广泛研究，但真正致力于解决随机组合优化问题的著作很少。在本研究中，我们提出了N（CO）$^2$：带偶然约束的神经组合优化，用于在不使用手工启发式的情况下解决随机定向问题（SOP）。通过集成强化学习（RL）框架，该模型在不确定性下优化路径选择，有效平衡探索与利用。实证结果表明，我们的方法在多种SOP实例中具有良好的推广性，在该任务中与最先进的混合整数线性规划（MILP）相比，实现了竞争性能。该方法在启发式设计中减少了人力劳动，同时在不确定环境中实现了适应性和高效的决策。

Sparsity Curse: Understanding RLVR Model Parameter Space from Model Merging

稀疏性诅咒：通过模型合并理解RLVR模型参数空间

Authors: Chenrui Wu, Zexi Li, Jiajun Bu, Jiangchuan Liu, Haishuai Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.18521
Pdf link: https://arxiv.org/pdf/2606.18521
Abstract Reinforcement Learning with Verifiable Reward (RLVR) has emerged as a powerful post-training paradigm that surpasses Supervised Fine-Tuning (SFT) in eliciting reasoning intelligence and resisting catastrophic forgetting. Recent studies further reveal that RLVR induces highly sparse and off-principal parameter updates compared to SFT. This naturally raises the question: does such sparsity make RLVR models more amenable to model merging? If so, model merging would offer a scalable, training-free path to aggregate diverse reasoning capabilities from independently trained RLVR models. Surprisingly, we find the opposite, uncovering a sparsity curse: the sparse RLVR updates are spread farther apart in parameter space, forming near-orthogonal shortcuts that make aggregation inherently fragile. This is likely rooted in the stochasticity of RL optimization and the diversity of emergent reasoning patterns. Unlike SFT models that converge to shared, flat basins and merge naturally, RLVR models suffer severe degradation under standard merging methods. Through systematic empirical analysis of the update geometry, we characterize the mechanisms behind this failure and propose Sensitivity-aware Resolving Merging (SAR-Merging), a merging recipe tailored for the unique structure of RLVR parameter spaces. SAR-Merging resolves conflicts in overlapping update regions via Fisher Information-based sensitivity arbitration, followed by magnitude-aware sparsification and rescaling to preserve fragile reasoning pathways. Experiments on mathematical and coding benchmarks demonstrate that SAR-Merging substantially outperforms existing merging methods on RLVR models, enabling both single-task enhancement and multi-capability fusion.
中文摘要 带可验证奖励的强化学习（RLVR）已成为一种强大的训练后范式，在激发推理智能和抵抗灾难性遗忘方面超越了监督微调（SFT）。最新研究进一步显示，与SFT相比，RLVR的参数更新极为稀疏且偏离主体。这自然引出了一个问题：这种稀疏性是否使RLVR模型更适合合并模型？如果是这样，模型合并将提供一条可扩展、无需训练的路径，汇聚独立训练的RLVR模型中多样的推理能力。令人惊讶的是，我们发现了相反的情况，揭示了稀疏性的诅咒：稀疏的RLVR更新在参数空间中分布得更远，形成了近乎正交的捷径，使聚合本质上变得脆弱。这很可能源于强化学习优化的随机性和涌现推理模式的多样性。与SFT模型汇聚到共享的平坦盆地并自然合并不同，RLVR模型在标准合并方法下会严重退化。通过对更新几何的系统实证分析，我们描述了该失败背后的机制，并提出了敏感度感知解析合并（SAR-Mergeing），这是一种针对RLVR参数空间独特结构量身定制的合并配方。SAR-Merge通过基于Fisher信息的敏感性仲裁解决重叠更新区域的冲突，随后进行幅度感知的稀疏化和重新尺度，以保护脆弱的推理路径。数学和编码基准测试的实验表明，SAR合并在RLVR模型上的合并方法远超现有方法，既支持单任务增强，也实现多能力融合。

Benchmarking Action Spaces in Reinforcement Learning for Vision-based Robotic Manipulation

基于视觉的机器人操作强化学习中的行动空间基准测试

Authors: Seyed Alireza Azimi, Homayoon Farrahi, Abhishek Naik, Colin Bellinger, A. Rupam Mahmood
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.18594
Pdf link: https://arxiv.org/pdf/2606.18594
Abstract In real-world reinforcement learning (RL), the choice of action space can play a key role in shaping motion smoothness, safety, and overall task performance. In this study, we evaluate pose increment, pose velocity, joint position increment, and joint velocity across two vision-based manipulation tasks: object picking and pushing. We train policies in simulation and deploy them to the real world using sim-to-real transfer. We find that action-space representation indeed significantly affects sim-to-real performance. In particular, we find that the joint velocity action space is best for the vision-based picking and pushing tasks in terms of smoothness and final task performance. We also provide practical guidance for RL practitioners in choosing action spaces for both simulation and real-world experiments.
中文摘要 在现实强化学习（RL）中，动作空间的选择在塑造运动的平滑性、安全性以及整体任务表现方面起着关键作用。本研究通过两种基于视觉的操作任务——物体拾取和推入——评估姿态增量、姿势速度、关节位置增量和关节速度。我们在仿真中训练策略，并通过模拟到现实的转移将其部署到现实世界。我们发现，动作空间表示确实显著影响模拟到现实的性能。特别地，我们发现联合速度作用空间在基于视觉的拣选和推动任务中，在顺畅性和最终任务表现方面表现最佳。我们还为强化学习从业者提供实用指导，帮助他们选择模拟和现实世界实验的动作空间。

PragReST: Self-Reinforcing Counterfactual Reasoning for Pragmatic Language Understanding

PragReST：语用语言理解的自我强化反事实推理

Authors: Jihyung Park, Minchao Huang, Leqi Liu, Elias Stengel-Eskin
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.18624
Pdf link: https://arxiv.org/pdf/2606.18624
Abstract Natural language understanding often depends on meanings that are implied rather than explicitly stated, requiring pragmatic reasoning. Despite strong performance on math and logical reasoning, large language models (LLMs) still struggle with making pragmatic inferences, often choosing literal interpretations. To improve LLM pragmatic reasoning, we introduce PragReST, a self-supervised framework that constructs pragmatic QA data, generates counterfactual reasoning traces, and trains models to internalize them through supervised fine-tuning and reinforcement learning, without human-labeled training data or distillation from a stronger teacher. Across four pragmatic benchmarks (PragMega, Ludwig, MetoQA, and AltPrag), PragReST improves over backbone models, task-specific pragmatic tuning baselines, and non-counterfactual variants of the same pipeline. On accuracy-based benchmarks, PragReST improves over the instruct backbone by 5.37 and 5.50% (absolute) for Qwen3-8B and Qwen3-14B, respectively. Our error analysis and ablations underscore the importance of counterfactual reasoning: PragReST primarily reduces errors caused by failures to contrast observed utterances with plausible alternatives, and removing counterfactual reasoning substantially reduces performance. Moreover, our training preserves out-of-domain performance on general-knowledge and mathematical reasoning benchmarks.
中文摘要 自然语言理解往往依赖于隐含而非明确陈述的意义，因此需要务用推理。尽管大型语言模型（LLMs）在数学和逻辑推理方面表现出色，但它们仍然难以做出实用推理，常常选择字面解释。为了提升LLM语用推理能力，我们引入了PragReST，这是一个自监督框架，构建语用质询数据，生成反事实推理痕迹，并通过监督微调和强化学习训练模型内化这些数据，无需人工标记训练数据或更强教师的提炼。在四个实用基准测试（PragMega、Ludwig、MetoQA和AltPrag）中，PragReST在骨干模型、任务特定实用调优基线以及同一流水线的非反事实变体上都有所改进。在基于准确度的基准测试中，PragReST在Qwen3-8B和Qwen3-14B的教学骨干网上分别提升了5.37%和5.50%（绝对）。我们的错误分析和消融强调了反事实推理的重要性：PragReST主要减少因未能将观察到的话语与合理替代方案对比而导致的错误，而去除反事实推理则显著降低了性能。此外，我们的培训保持了领域外在常识和数学推理基准测试的表现。

SRL: Combining SLIP Model and Reinforcement Learning for Agile Robotic Jumping

SRL：结合SLIP模型与强化学习实现敏捷机器人跳跃

Authors: Xiaowen Hu, Linqi Ye, Yudi Zhu, Chenyue Shao, Rankun Li, Qingdu Li, Yan Peng
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.18625
Pdf link: https://arxiv.org/pdf/2606.18625
Abstract Robotic jumping is pivotal in applications such as search and rescue and logistics, where crossing obstacles and enhancing mobility efficiency are critical. The Spring-Loaded Inverted Pendulum (SLIP) model leverages simplified spring-mass dynamics that naturally encode biologically plausible hopping motions, yet its performance degrades on irregular terrain due to idealized assumptions regarding contact and joint dynamics. Meanwhile, Reinforcement Learning (RL) can adapt to diverse and complex environments but often requires extensive data from unguided exploration. The complementary strengths of SLIP's physically grounded baseline and RL's adaptive capabilities motivate a hybrid framework that overcomes these individual limitations. We therefore propose Spring-loaded Reinforcement Learning (SRL), which integrates SLIP-based feedforward control signals with RL-driven real-time feedback, enabling continuous optimization of robotic jumping. Experimental results demonstrate that SRL can achieve more stable jumps with much less training time than the baseline method, maintaining an average position tracking error below 0.1 m and velocity tracking errors within +/-3% of the target values. Through bipedal and quadrupedal simulations of ground and stair jumping, as well as sim-to-sim and sim-to-real validations, SRL exhibits robust adaptability to various task requirements and environmental complexities, underscoring its potential for real-world deployment.
中文摘要 机器人跳跃在搜救和后勤等领域至关重要，因为跨越障碍和提升机动效率至关重要。弹簧负荷倒摆（SLIP）模型利用简化的弹簧-质量动力学，自然编码生物学上合理的跳跃运动，但由于对接触和关节动力学的理想假设，其在不规则地形上性能下降。与此同时，强化学习（RL）能够适应多样且复杂的环境，但通常需要大量无指导探索的数据。SLIP物理基础和RL适应能力的互补优势，推动了一个混合框架，克服了这些个体局限。因此，我们提出了弹簧强化学习（SRL），将基于SLIP的前馈控制信号与强化学习驱动的实时反馈集成，实现机器人跳跃的持续优化。实验结果表明，SRL能够以远短的训练时间实现更稳定的跳跃，保持平均位置跟踪误差低于0.1米，速度跟踪误差控制在目标值的+/-3%以内。通过地面和楼梯跳跃的双足和四足模拟，以及模拟对模拟和模拟到真实验证，SRL展现出对各种任务需求和环境复杂性的强大适应能力，凸显了其在实际应用中的潜力。

Generating Natural and Expressive Robot Gestures through Iterative Reinforcement Learning with Human Feedback using LLMs

通过利用大型语言模型（LLM）进行迭代强化学习与人类反馈，生成自然且富有表现力的机器人手势

Authors: Chris Lee, Flora Salim, Benjamin Tag, Francisco Cruz
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.18747
Pdf link: https://arxiv.org/pdf/2606.18747
Abstract Expressive gestures are essential for natural and effective communication, complementing speech when verbal cues alone are insufficient (e.g., pointing). For social robots such as the humanoid Pepper, producing natural and expressive movements is critical for improving human-robot interaction (HRI) and long-term acceptance. However, generating gestures remains challenging due to reliance on expert-authored animations, resulting in rigid behaviors that are impractical for dynamic and diverse environments. Alternatively, machine learning approaches often struggle to capture perceived naturalness, becoming increasingly challenging with more degrees of freedom. Consequently, producing expressive robot gestures requires a system that can adapt to the environment while adhering to social norms and physical constraints. Recent advances in large language models (LLMs) enable dynamic code generation, offering new opportunities for runtime gesture synthesis from natural language. In this paper, we integrate ChatGPT into the humanoid robot Pepper to generate co-speech gestures aligned with conversational output. While this baseline enables flexible gesture generation, the resulting motions are often perceived as stiff and unnatural. To address this limitation, we introduce an iterative reinforcement learning with human feedback (RLHF) system that finetunes gesture generation based on user evaluations, leveraging an iterative user study to compare Pepper's generated gestures. Our results show that RLHF improved the LLM's co-speech generative capabilities, producing more expressive, relevant and fluid movements.
中文摘要 表达性手势对于自然且有效的交流至关重要，当单靠口头线索不足时（例如指点），它们可以补充言语。对于像人形机器人Pepper这样的社交机器人来说，产生自然且富有表现力的动作对于提升人机交互（HRI）和长期接受度至关重要。然而，由于依赖专家制作的动画，生成手势依然具有挑战性，导致僵化的行为在动态多样的环境中不切实际。另一方面，机器学习方法常常难以捕捉被感知的自然性，随着自由度增加，挑战性也随之增加。因此，制作富有表现力的机器人手势需要一个能够适应环境、同时遵守社会规范和物理约束的系统。大型语言模型（LLM）的最新进展使动态代码生成成为可能，为从自然语言进行运行时手势综合提供了新机遇。本文将ChatGPT集成到类人机器人Pepper中，生成与对话输出相匹配的共言手势。虽然这种基线允许灵活生成手势，但产生的动作常被认为僵硬且不自然。为解决这一限制，我们引入了一种基于用户评价的迭代强化学习（RLHF）系统，利用迭代用户研究对比Pepper生成的手势，从而微调手势生成。我们的结果显示，RLHF提升了LLM的共言生成能力，产生了更具表现力、相关性和流畅性的动作。

R2D-RL: A RoboCup 2D Soccer Environment for Multi-Agent Reinforcement Learning

R2D-RL：一个用于多智能体强化学习的RoboCup 2D足球环境

Authors: Haobin Qin, Baofeng Zhang, Hidehisa Akiyama, Keisuke Fujii
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.18786
Pdf link: https://arxiv.org/pdf/2606.18786
Abstract Robot soccer is a challenging testbed for multi-agent reinforcement learning because it combines partial observability, cooperative and adversarial interaction, sparse rewards, and long-horizon tactical behavior. RoboCup 2D Soccer Simulation (RCSS2D) provides a mature robot-soccer platform, but its competition-oriented server-client architecture is difficult to use directly with modern Python-based MARL workflows. We introduce R2D-RL, a reinforcement learning environment that connects RCSS2D and HELIOS-based player clients to a Python MARL interface through shared-memory communication and cycle-level synchronization. R2D-RL supports full-field and scenario-based training with configurable opponents, Base discrete and Hybrid parameterized action spaces, action masks, expected possession value (EPV)-based reward shaping, and parallel execution. We provide front-goal scenarios and an 11-vs-11 full-field benchmark, together with baseline results.
中文摘要 机器人足球是多智能体强化学习的一个具有挑战性的试验平台，因为它结合了部分可观察性、合作与对抗性交互、稀疏奖励以及长视野的战术行为。RoboCup 2D 足球模拟（RCSS2D）提供了一个成熟的机器人足球平台，但其以竞赛为导向的服务器-客户端架构难以直接与基于Python的现代MARL工作流程直接使用。我们介绍了R2D-RL，一种强化学习环境，通过共享内存通信和周期级同步，将基于RCSS2D和HELIOS的播放器客户端连接到Python的MARL接口。R2D-RL支持全场和基于场景的训练，包括可配置的对手、基于基础的离散和混合参数化动作空间、动作掩码、基于预期占有值（EPV）的奖励塑造以及并行执行。我们提供前进球场景和11对11全场基准，以及基线结果。

Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards

自我学习：强化学习的自我条件学分作业，并可验证奖励

Authors: Yingyu Shan, Yuhang Guo, Zihao Cheng, Zeming Liu, Xiangrong Zhu, Xinyi Wang, Jiashu Yao, Wei Lin, Hongru Wang, Heyan Huang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.18810
Pdf link: https://arxiv.org/pdf/2606.18810
Abstract Reinforcement learning with verifiable rewards (RLVR) has driven substantial progress in training LLMs for reasoning tasks, but representative methods such as GRPO assign uniform credit across all tokens, wasting gradient on routine tokens while under-crediting pivotal reasoning steps. Existing token-level credit assignment methods require resources beyond the model's own rollouts. GRPO variants rely on process reward models or ground-truth answers. Knowledge distillation assigns credit through per-token divergence but requires external teachers (On-Policy Distillation) or privileged information (On-Policy Self Distillation). However, these dependencies limit applicability in the pure RLVR setting. We observe that conditioning the model on its own verified trajectories induces a measurable per-token KL divergence between the original and conditioned distributions, and prove that distilling from a self-teacher constructed by verified trajectories leads to infeasible weighted-average solutions when multiple verified trajectories exist. We propose SC-GRPO (Self-Conditioned GRPO), which uses KL divergence mentioned before as a multiplicative weight on GRPO gradients. Across five benchmarks spanning math, code, and agentic tasks, SC-GRPO consistently outperforms 8.1% over GRPO and 5.9% over DAPO with stronger OOD performance. Moreover, SC-GRPO achieves higher performance than OPD.
中文摘要 带有可验证奖励的强化学习（RLVR）推动了大型语言模型（LLMs）推理任务训练的显著进展，但像GRPO这样的代表性方法在所有代币上均分配了统一的学分，浪费了常规代币的梯度，同时低估了关键推理步骤的认可。现有的代币级信用分配方法需要超出模型自身推广的资源。GRPO变体依赖于过程奖励模型或真实答案。知识蒸馏通过每个代币发散来赋予功劳，但需要外部教师（策略上提炼）或特权信息（策略上自提炼）。然而，这些依赖限制了在纯 RLVR 环境中的适用性。我们观察到，基于模型自身已验证轨迹的条件，会诱导原始分布与条件分布之间可测量的每标记KL发散，并证明当存在多个经过验证轨迹的自学模型时，提取可导致不可行的加权平均解。我们提出SC-GRPO（自条件GRPO），它利用前面提到的KL散度作为GRPO梯度的乘法权重。在涵盖数学、代码和代理任务的五个基准测试中，SC-GRPO持续优于GRPO8.1%，较DAPO高5.9%，且更优于更强的OOD表现。此外，SC-GRPO的性能优于OPD。

Reinforcement Learning Foundation Models Should Already Be A Thing

强化学习基础模型本应已经存在

Authors: Abdelrahman Zighem, Jill-Jênn Vie
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.18812
Pdf link: https://arxiv.org/pdf/2606.18812
Abstract Foundation models for language and vision are powered by internet-scale data, while structured domains (tabular prediction, time-series forecasting, graph learning, reinforcement learning) are not. The substitute is synthetic data, which shifts the burden from collection to prior design. Such priors already exist for many structured tasks: TabPFN and its successors solve tabular classification with a transformer pretrained on a synthetic Bayesian prior. We make two points. \textbf{First}, reinforcement learning is the conspicuous gap: sampling a synthetic MDP is as feasible as sampling a synthetic tabular dataset, yet no in-context RL work treats prior design as a primary objective. \textbf{Second}, MDPs admit a fixed-size sufficient statistic, independent of the episodes observed and tabular in shape, which makes them directly amenable to the attention-based architectures used for tabular foundation models, with a policy head replacing the supervised target. Together these define the agenda for an RL foundation model. As a proof of concept, we train one model entirely on synthetic MDPs and show that, with no task-specific tuning, it solves held-out tabular benchmarks in context, both online and offline: online, in far fewer episodes than UCB-VI and tabular Q-learning, and offline, competitively with VI-LCB.
中文摘要 语言和视觉的基础模型由互联网规模数据驱动，而结构化领域（如表格预测、时间序列预测、图学习、强化学习）则不然。替代的是合成数据，这使负担从收集转移到了预先设计。许多结构化任务中已有此类先验：TabPFN 及其后继工具通过预训练合成贝叶斯先验的变换器解决表格分类。我们有两点观点。\textbf{首先}，强化学习是明显的空白：对合成MDP进行采样和对合成表格数据集的采样一样可行，但没有任何上下文中的强化学习工作将先验设计作为主要目标。\textbf{第二}，MDPs允许一个固定大小的足够统计量，独立于观察到的发作，且呈表格状，这使得它们直接适合用于基于注意力的表格基础模型架构，由政策头取代监督目标。这些共同定义了强化学习基础模型的议程。作为概念验证，我们用合成MDP训练一个模型，展示了它无需针对特定任务进行调校，能够在线和离线上下文中解决长期的表格基准：在线时，发集次数远少于UCB-VI和表格Q学习，离线时则与VI-LCB竞争。

Maturing Markov Decision Processes: Decision Making under Increasing Information and Shrinking Action Sets

成熟的马尔可夫决策过程：信息量增加和行动集缩小下的决策

Authors: Jiaxi Liu, Aiping Yang, Yuhang Yang, Shuqi Zhang, Zewei Dong, Jiangming Yang, Xuebin Chen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.18820
Pdf link: https://arxiv.org/pdf/2606.18820
Abstract Sequential decision problems often exhibit an asymmetric evolution of information and decision flexibility: as a decision cycle unfolds, the agent receives richer information while feasible actions expire due to operational cutoffs, commitments, or resource constraints. Standard MDP formulations typically flatten this structure into stage-dependent state descriptions and action masks, thereby obscuring the nested information--action asymmetry that determines which decisions are urgent and which can be deferred. We introduce Maturing Markov Decision Processes (MMDPs), a formulation built around this information--action asymmetry. We characterize one of its key consequences through an expiring-action priority principle, which identifies the actions that must be resolved before the next stage. Motivated by this structure, we develop a structure-aware reinforcement learning framework with stage-aware policy design, expiring-action abstraction, and search-augmented learning with distillation. Experiments on a controlled multi-supplier replenishment problem, simplified cash-management environments of increasing complexity, and a production-scale simulator show that explicitly modeling this asymmetry improves learning efficiency and becomes increasingly valuable as decision problems scale.
中文摘要 顺序决策问题通常表现出信息和决策灵活性的不对称演变：随着决策周期展开，代理接收到更丰富的信息，而可行的行动因操作中断、承诺或资源限制而终止。标准MDP表述通常将该结构扁平化为阶段相关的状态描述和动作掩码，从而掩盖了决定哪些决策紧急、哪些可以推迟的嵌套信息-动作不对称性。我们介绍了成熟马尔可夫决策过程（MMDPs），这一表述基于信息-动作不对称性。我们通过“过期行动优先级原则”来界定其关键后果之一，该原则明确了在下一阶段前必须解决的行动。基于这一结构，我们开发了一个结构感知强化学习框架，包含阶段感知策略设计、过期动作抽象和带有蒸馏的搜索增强学习。对受控多供应商补货问题、日益复杂化的简化现金管理环境以及生产规模模拟器的实验表明，明确建模这种不对称性能提升学习效率，并且随着决策问题的扩大，价值日益增强。

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

超越奖励工程：长上下文强化学习的数据配方

Authors: Xiaoyue Xu, Sikui Zhang, Xiaorong Wang, Xu Han, Chaojun Xiao
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.18831
Pdf link: https://arxiv.org/pdf/2606.18831
Abstract Long-context reasoning is an essential capability for large language models, particularly when they are deployed as autonomous agents that must reason over lengthy trajectories. Reinforcement learning (RL) has recently emerged as a dominant paradigm for improving this ability, yet existing work largely focuses on reward engineering while diverse training data remains scarce. We revisit this problem from a data-centric perspective and show that a simple yet effective data recipe alone, paired with a minimal outcome-based GRPO setup, suffices to substantially improve long-context reasoning. Our recipe targets three complementary task families -- retrieval, multi-evidence synthesis, and reasoning -- for which we construct and curate eight datasets totaling ~14K examples. Experiments on three models (Qwen3-4B/8B/30B-A3B) yield average gains of +7.2/+3.2/+6.4 points across seven long-context benchmarks, surpassing prior RL training sets. We further demonstrate that these gains transfer to agentic tasks, where continuing RL training on an agent-tuned model with our data recipe improves GAIA by +4.8 and BrowseComp by +7.0 points. We will release our datasets to facilitate future research.
中文摘要 长上下文推理是大型语言模型的重要能力，尤其是在它们作为自主代理部署、需要在长路径上推理时。强化学习（RL）最近成为提升这一能力的主流范式，但现有工作主要聚焦于奖励工程，而多样化的训练数据仍然稀缺。我们从数据中心的角度重新审视该问题，并展示了仅凭简单而有效的数据配方，配合基于最小结果的GRPO设置，就足以显著提升长上下文推理能力。我们的方案针对三个互补的任务家族——检索、多证据综合和推理——为此我们构建并策划了八个数据集，总计约1.4万个样本。在三个模型（Qwen3-4B/8B/30B-A3B）上的实验，在七个长上下文基准测试中平均提升为+7.2/+3.2/+6.4点，超过了之前的强化学习训练集。我们还进一步证明，这些提升可以转移到代理任务中，在主体调优模型上继续用我们的数据配方进行强化学习训练，GAIA提升+4.8点，BrowseComp提升+7.0分。我们将发布数据集以促进未来的研究。

Generative-Model Predictive Planning for Navigation in Partially Observable Environments

部分可观测环境中导航的生成模型预测规划

Authors: Thomas Quilter, Yifan Zhu, Guorui Quan, Mingfei Sun, Samuel Kaski
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.18888
Pdf link: https://arxiv.org/pdf/2606.18888
Abstract Navigation in partially observable environments presents a significant challenge for autonomous agents, requiring effective decision-making with limited sensory information in unknown environments. Belief-based methods, particularly those using neural networks to approximate the belief space, often fail to capture the inherent multimodality of belief spaces, especially in high-dimensional cases with perceptual aliasing. While generative models present a compelling alternative, they typically require substantial data or expert demonstrations and lack explicit mechanisms for long-term planning. In this paper, we introduce BeliefDiffusion, a novel framework that combines the benefits of both generation and planning. BeliefDiffusion leverages diffusion models to explicitly characterize multimodal belief distributions and utilizes Model Predictive Control (MPC) to simultaneously plan ahead. It consists of two steps: (1) Imagining plausible environment configurations based on observation history and (2) Planning efficient navigation strategies across an aggregated configurations. Through extensive experiments in synthetic map environments, we demonstrate that BeliefDiffusion significantly outperforms both model-free reinforcement learning baselines and other generative approaches in navigation success rate and path efficiency. Our results validate that explicitly incorporating multimodal belief representations into planning enables more robust navigation in partially observable settings.
中文摘要 在部分可观测环境中的导航对自主智能体来说是重大挑战，需要在有限感官信息下做出有效决策，且环境不明。基于信念的方法，尤其是利用神经网络近似信念空间的方法，往往无法捕捉信念空间固有的多模态性，尤其是在具有感知混叠的高维情况下。虽然生成模型提供了有力的替代方案，但通常需要大量数据或专家演示，且缺乏明确的长期规划机制。本文介绍了信念扩散（BeliefDiffusion），这是一个结合生成和规划优势的新框架。信念扩散利用扩散模型明确刻画多模态信念分布，并利用模型预测控制（MPC）同时进行前瞻性规划。它包括两个步骤：（1）基于观测历史设想合理的环境配置，（2）规划在汇总配置中的高效导航策略。通过在合成地图环境中的大量实验，我们证明了信念扩散在导航成功率和路径效率方面显著优于无模型强化学习基线及其他生成方法。我们的结果验证了，在规划中明确纳入多模态信念表示，能够在部分可观测环境中实现更稳健的导航。

REVES: REvision and VErification--Augmented Training for Test-Time Scaling

REVES：修订与验证——测试时间扩展训练

Authors: Yuanxin Liu, Ruida Zhou, Xinyan Zhao, Amr Sharaf, Hongzhou Lin, Arijit Biswas, Mohammad Ghavamzadeh, Zhaoran Wang, Mingyi Hong
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.18910
Pdf link: https://arxiv.org/pdf/2606.18910
Abstract Test-time scaling via sequential revision has emerged as a powerful paradigm for enhancing Large Language Model (LLM) reasoning. However, standard post-training methods primarily optimize single-shot objectives, creating a fundamental misalignment with multi-step inference dynamics. While recent work treats this as multi-turn reinforcement learning (RL), conventional approaches optimize over the multi-step trajectories directly, failing to further exploit the high-quality mistakes in intermediate steps that model can learn from correcting them. We propose a two-stage iterative framework that alternates between online data/prompt augmentation and policy optimization. By converting the intermediate steps (``near-miss'' answers) in the successful recovery trajectories into decoupled revision and verification prompts, our approach concentrates training on both effective answer transformation and error identification. This approach enables efficient off-policy data generation and reduces the computational overhead of long-horizon sampling compared to standard multi-turn RL. On LiveCodeBench, using publicly available test cases as feedback, we observe gains of +6.5 points over the RL baseline and +4.0 points over standard multi-turn training. Beyond coding, our approach matches the previously reported SOTA result on circle packing while using the smallest base model (4B) and far fewer rollouts than the much larger evolutionary search systems. Math results under ground-truth verification further confirm improved correction ability. It also generalizes to out-of-distribution constraint-satisfaction puzzles such as n_queens and mini_sudoku, where correctness is defined entirely by problem constraints. Code is available at this https URL.
中文摘要 通过顺序修订进行测试时间缩放已成为增强大型语言模型（LLM）推理的强大范式。然而，标准的后期训练方法主要优化单次目标，导致与多步推理动力学的根本错位。虽然近期研究将其视为多回合强化学习（RL），但传统方法直接优化多步轨迹，未能进一步利用模型通过纠正中学习的中间步骤中高质量错误。我们提出了一个两阶段迭代框架，交替进行在线数据/提示增强和策略优化。通过将成功恢复轨迹中的中间步骤（“差点”答案）转换为解耦的复习和验证提示，我们的方法将培训重点放在有效答案转换和错误识别上。这种方法能够高效地生成非策略数据，并相比标准多回合强化学习减少了长视界采样的计算开销。在LiveCodeBench上，利用公开测试案例作为反馈，我们观察到强化学习基线提升+6.5分，标准多回合训练提升+4.0分。除了编码，我们的方法在使用最小的基础模型（4B）和远少于更大规模的进化搜索系统时，能够与之前报告的SOTA圆圈打包结果相匹配。在真实验证下的数学结果进一步证实了更强的校正能力。它也推广到分布外约束满足谜题，如 n_queens 和 mini_sudoku，其中正确性完全由问题约束定义。代码可在此 https URL 访问。

Object-Centric Residual RL for Zero-Shot Sim-to-Real VLA Enhancement

面向对象的残留强化学习，用于零时模拟到真实的VLA增强

Authors: Kinam Kim, Namiko Saito, Heecheol Kim, Katsushi Ikeuchi, Jaegul Choo, Yasuyuki Matsushita
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.18953
Pdf link: https://arxiv.org/pdf/2606.18953
Abstract Vision-Language-Action (VLA) models can generalize across diverse manipulation tasks, but their imitation-learning-based policies remain brittle in precise physical interactions due to compounding execution errors; Can a reinforcement learning policy trained purely in simulation improve the robustness of real-world VLAs zero-shot? Residual RL, which learns a corrective policy on top of a frozen VLA, offers a natural framework, but existing approaches face a fundamental sim-to-real dilemma: privileged-state methods require lossy distillation for deployment; image-based methods suffer from the visual domain gap; and real-world RL is costly and unsafe. We propose an object-centric residual RL framework that refines VLA actions using object poses, enabling a compact observation space that transfers consistently between simulation and reality. To align the two domains, we additionally replay the same teleoperation demonstrations in simulation to train a sim counterpart of the real-world VLA. The residual RL policy is trained only in simulation with pose noise injection and dropout, and transfers zero-shot to the real robot. Across five manipulation tasks on a real Franka Research 3 (FR3) robot, our method improves the success rate from 42% to 76% zero-shot, and the improved rollouts can be further reused to retrain the base VLA for self-improvement without additional teleoperation. Project page: this https URL
中文摘要 视觉-语言-行动（VLA）模型可以在各种操作任务中推广，但其基于模仿学习的策略在精确的物理交互中因累积执行错误而仍然脆弱;纯粹在模拟中训练的强化学习策略能否提升现实VLA零击点的鲁棒性？残差强化学习在冻结的VLA之上学习纠正策略，提供了一个自然的框架，但现有方法面临一个根本的模拟到现实困境：特权状态方法部署时需要有损蒸馏;基于图像的方法存在视觉领域缺口;而现实中的强化学习成本高昂且不安全。我们提出了一个以对象为中心的残差强化学习框架，利用对象姿态优化VLA动作，实现一个紧凑的观察空间，能够在模拟与现实之间保持一致地传递。为了对应这两个领域，我们还在模拟中重放相同的远程操作演示，以训练现实世界VLA的模拟对应物。残差RL策略仅在仿真中训练，采用姿态噪声注入和丢弃，并将零射击数据传输到真实机器人。在真实的Franka Research 3（FR3）机器人上完成五个操作任务后，我们的方法将零发射成功率从42%提升到76%，改进后的推广方法还可用于重新训练基础VLA，实现自我改进，无需额外远程操作。项目页面：此 https URL

GraphPO: Graph-based Policy Optimization for Reasoning Models

GraphPO：基于图的推理模型策略优化

Authors: Yuliang Zhan, Xinyu Tang, Jian Li, Dandan Zheng, Weilong Chai, Jingdong Chen, Jun Zhou, Ge Wu, Wenyue Tang, Hao Sun
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.18954
Pdf link: https://arxiv.org/pdf/2606.18954
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for enhancing the capability of large reasoning models. RLVR typically samples responses independently and optimizes the policy using from final answers. This paradigm has two limitations. First, independently responses often contain similar intermediate reasoning steps, causing redundant exploration and wasted computation. Second, sparse final-answer rewards make it hard to identify useful steps. Tree-based methods partly address this problem by sharing prefixes and comparing branches from the same prefix to provide fine-grained signals. However, tree branches are still expanded independently. When different branches reach similar reasoning states, they cannot share information and repeat similar exploration. Moreover, tree-based methods ignore such dispersion and only perform local comparisons within separate branches, which can lead to higher variance in advantage estimation. To address this challenge, we propose GraphPO (Graph-based Policy Optimization), a novel RL framework that represents rollouts as a directed acyclic graph, with reasoning steps as edges and semantic states summarized from the reasoning paths as nodes. GraphPO merges semantically equivalent reasoning paths into equivalence classes, allowing them to share suffixes and reallocating budget away from redundant expansions to diverse exploration. Furthermore, we assign efficiency advantages to incoming edges and correctness advantages to outgoing edges, thereby improving inference efficiency while deriving process supervision from outcome. Theory shows that GraphPO reduces advantage-estimation variance and enhances reasoning efficiency. Experiments on three LLMs across reasoning and agentic search benchmarks show that GraphPO consistently outperforms chain- and tree-based baselines with the same token budgets or response budgets.
中文摘要 带可验证奖励的强化学习（RLVR）已成为提升大型推理模型能力的标准范式。RLVR通常独立抽样回答，并利用最终答案优化策略。这种范式有两个局限性。首先，独立的回答通常包含类似的中间推理步骤，导致重复探索和计算浪费。其次，最终答案奖励稀少，使得识别有用的步骤变得困难。基于树的方法部分解决了这个问题，通过共享前缀并比较同一前缀的分支，以获得细粒度信号。然而，树枝仍然可以独立扩展。当不同分支达到相似的推理状态时，他们无法共享信息并重复类似的探索。此外，基于树的方法忽略这种离散，只在不同分支内进行局部比较，这可能导致优势估计的方差增加。为应对这一挑战，我们提出了GraphPO（基于图的策略优化），这是一种新型强化学习框架，将推展表示为有向无环图，推理步骤为边，语义状态由推理路径汇总为节点。GraphPO将语义等价的推理路径合并为等价类，允许它们共享后缀，并将预算从冗余扩展中重新分配到多样化探索中。此外，我们将效率优势分配给输入边，赋予出边正确性优势，从而提升推理效率，同时从结果中推导过程监督。理论表明，GraphPO降低了优势估计方差并提高了推理效率。在推理和代理搜索基准测试上对三种大型语言模型的实验表明，GraphPO在相同令牌预算或响应预算下，始终优于链基和树基线。

EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts

高效推广：系统感知的强化学习自推测解码

Authors: Minseo Kim, Minjae Lee, Seunghyuk Oh, Kevin Galim, Donghoon Kim, Coleman Hooper, Harman Singh, Amir Gholami, Hyung Il Koo, Wonjun Kang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.18967
Pdf link: https://arxiv.org/pdf/2606.18967
Abstract Reinforcement learning (RL) has become a representative post-training paradigm for LLMs, enabling strong reasoning and agentic capabilities. However, rollout generation remains a dominant latency bottleneck because autoregressive sampling decodes responses sequentially and a small number of long-tailed generations often determine completion time. Speculative decoding (SD) offers a natural way to address this bottleneck, as it is a well-established technique for serving fixed LLMs that reduces latency by rapidly drafting tokens and accepting them through parallel verification while preserving the target-model distribution. However, its practical speedups do not directly carry over to RL rollouts: (i) the evolving target policy makes any fixed drafter increasingly mismatched with the policy's output distribution; and (ii) active batch sizes shrink throughout rollout decoding, shifting decoding from compute-bound to memory-bound regimes where parallel verification can exploit underutilized compute. Therefore, accelerating RL rollouts requires both a drafter that remains effective under long, high-temperature generations from an evolving policy and system-aware use of SD that avoids compute-bound regimes. We present EfficientRollout, a system-aware self-SD framework designed to address this gap for RL rollouts. EfficientRollout induces a quantized drafter from the target model (i.e. self-speculative decoding), keeping it coupled to the evolving policy without separate drafter pretraining or online adaptation. It further coordinates a system-aware SD toggle policy with acceptance-aware draft-length adaptation, enabling speculation only in beneficial regimes while matching the drafting budget to evolving drafter quality. EfficientRollout reduces rollout and end-to-end latency by up to 19.6% and 12.7%, respectively, over an accelerated AR rollout baseline, while preserving final model quality.
中文摘要 强化学习（RL）已成为大型语言模型（LLM）训练后代表性的范式，使强有力的推理和代理能力得以实现。然而，展开生成仍是主要的延迟瓶颈，因为自回归采样会顺序解码响应，且少数长尾生成通常决定完成时间。推测解码（SD）为解决这一瓶颈提供了一种自然的方法，因为它是一种成熟的固定大型语言模型服务技术，通过快速草拟令牌并通过并行验证接受，同时保持目标模型分布，从而降低延迟。然而，其实际加速效果并不能直接应用于强化学习的推广：（i）目标策略的演变使任何固定绘图者与策略的输出分布越来越不匹配;以及（ii）在部署解码过程中，主动批处理量逐渐缩小，将解码从计算受限状态转移到内存受限状态，从而并行验证可以利用未充分利用的计算。因此，加速强化学习的推广需要既需要一个能在长期高温生成下仍有效的设计者，以及系统感知的标准差（SD），以避免计算受限的状态。我们介绍EfficientRollout，一个系统感知的自SD框架，旨在弥补强化学习推广中的这一空白。EfficientRollout 从目标模型中诱导出量化的起草者（即自推测解码），使其与不断演变的策略耦合，无需单独的起草者预训练或在线适配。它还协调系统感知的SD切换政策与验收感知的草案长度调整，使得仅在有利的制度下进行投机，同时将起草预算与不断演变的起草人质量匹配。EfficientRollout 相比加速的 AR 推展基线，将推展和端到端延迟分别降低高达 19.6% 和 12.7%，同时保持最终模型质量。

Spotlight: Synergizing Seed Exploration and Spot GPUs for DiT RL Post-Training

聚焦：协同种子探索与点GPU用于DiT强化学习后培训

Authors: Ruiqi Lai, Dakai An, Wei Gao, Ju Huang, Siran Yang, Jiamang Wang, Lin Qu, Dmitrii Ustiugov, Wei Wang
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.19004
Pdf link: https://arxiv.org/pdf/2606.19004
Abstract Reinforcement learning (RL) post-training of Diffusion Transformers (DiTs) is prohibitively expensive, requiring thousands of high-end GPUs. Existing works explore two directions to reduce cost: seed exploration improves training convergence by selecting high-contrast samples, yet adds compute to the critical path; spot GPUs offer 69--77\% lower cost, yet sit idle during training because DiT rollouts finish nearly simultaneously, which prevents LLM-style pipelining of rollout with training. Spot preemptions further break Sequence Parallelism (SP) groups, fragmenting GPU topology. We present Spotlight, the first system that harvests spot GPUs for DiT RL post-training. Spotlight rests on two key insights we devise: (1)~we show that exploration can tolerate stale model weights because exploration that uses the model weights from the previous iteration preserves the relative ranking of random seeds, allowing exploration to run on idle spot GPUs during training. (2)~SP reconfiguration can reuse on-node state, reducing group recovery from minutes to sub-second launches. Built on these insights, Spotlight introduces three techniques: a bandit-based exploration planner that maximizes reward variance within the training time budget, elastic sequence parallelism that reconfigures SP groups on the fly via persistent schedulers and intra-node weight copying, and a preemption-aware pull-based request scheduler that balances load and commits in-flight state upon preemption. We implement Spotlight on the open-source RL platform ROLL and evaluate it on Qwen-Image post-training. Spotlight reaches the same target validation score $4\times$ faster than baselines, reducing total cost by $1.4$-$6.4\times$ while achieving superior image quality on DeepSeek-OCR and Geneval datasets with resolution $512\times512$ and $1280\times1280$.
中文摘要 扩散变换器（DiT）的强化学习（RL）后期训练成本高昂，需要数千个高端GPU。现有研究探索了两个降低成本的方向：种子探索通过选择高对比度样本提升训练收敛性，同时为关键路径增加计算量;定点GPU成本低69%至77%，但训练期间闲置，因为DiT部署几乎同时完成，防止了LLM式的培训流水线。点抢占进一步破坏了序列并行性（SP）组，导致GPU拓扑分段。我们介绍Spotlight，这是首个用于DiT RL训练后采集点GPU的系统。《聚焦》聚焦于我们设计的两个关键见解：（1）~我们证明探索可以容忍陈旧的模型权重，因为使用前一迭代模型权重的探索会保留随机种子的相对排名，使探索可以在训练期间使用空闲的点GPU进行。（2）~SP重配置可以重复使用节点上状态，将组恢复速度从几分钟缩短到亚秒级。基于这些洞见，Spotlight 引入了三项技术：基于 bandit 的探索规划器，最大化训练时间预算内的奖励方差;弹性序列并行性，通过持久调度器和节点内权重复制实时重新配置 SP 组;以及一种基于抢占感知的拉取请求调度器，在抢占时平衡负载并提交在运行状态。我们在开源的强化学习平台ROLL上实现了Spotlight，并在Qwen-Image上进行培训后评估。Spotlight 比基线更快达到相同的目标验证分数 $4\ 倍，降低总成本 $1.4 至 $6.4\，同时在 DeepSeek-OCR 和 Geneval 数据集上实现了分辨率为 $512\times512 美元和 $1280\times1280 美元的优越图像质量。

ProductConsistency: Improving Product Identity Preservation in Instruction-Based Image Editing via SFT and RL

ProductConsistency：通过SFT和RL改进基于指令的图像编辑中的产品身份保护

Authors: Mukund Khanna, Raj Singh Yadav, Kunal Singh
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.19103
Pdf link: https://arxiv.org/pdf/2606.19103
Abstract Recent advances in instruction-based image editing have enabled models to perform complex visual edits from natural language instructions. However, in product-centric scenarios where preserving product features, branding, and textual elements are critical, current open and closed source models often struggle to maintain this fine-grained object identity. This issue is further compounded by the lack of datasets for instruction-based product image editing with text fidelity constraints, leaving it largely treated as an implicit capability of instruction-based image editing models. In this work, we introduce the ProductConsistency dataset which is designed to improve product-centric image editing. Our approach includes a supervised fine-tuning (SFT) dataset of 87k samples for product editing, a reinforcement learning (RL) dataset with 869 unique product images, and a new benchmark dataset, the ProductConsistency Benchmark, to allow rigorous and standardized evaluation of editing models. To guide RL training, we propose a Cyclic Consistency reward that enforces semantic preservation of product identity by using caption similarity between the original product description and captions generated from the edited image. We fine-tune both Qwen-Image-Edit-2511 and Flux.1-Kontext-dev using our dataset and demonstrate consistent improvements over baseline models in OCR and Perceptual metrics, and MLLM-based evaluations as well, indicating stronger product consistency, text rendering, and overall visual quality; with the Qwen-Image-Edit-2511 model achieving a 5x reduction in the character error rate. The code and pipeline is available at this https URL
中文摘要 基于指令的图像编辑的最新进展使模型能够根据自然语言指令执行复杂的视觉编辑。然而，在以产品为中心的场景中，保留产品特性、品牌和文本元素至关重要，当前的开源和闭源模型常常难以维持这种细粒度的对象身份。这一问题因缺乏基于指令的产品图像编辑数据集且文本忠实度约束而更加严重，使得它在很大程度上被视为基于指令的图像编辑模型的隐含能力。在本研究中，我们介绍了ProductConsistency数据集，旨在提升以产品为中心的图像编辑。我们的方法包括一个包含8.7万样本的监督微调（SFT）产品编辑数据集、一个包含869张独特产品图像的强化学习（RL）数据集，以及一个新的基准数据集——产品一致性基准，以实现对编辑模型的严谨和标准化评估。为了指导强化学习训练，我们提出了一种循环一致性奖励，通过使用原始产品描述与编辑图像生成的标题相似性，强制产品身份的语义保持。我们利用我们的数据集微调了Qwen-Image-Edit-2511和Flux.1-Kontext-dev，并在OCR和感知指标以及基于MLLM的评估中均有持续优异，显示出更强的产品一致性、文本渲染和整体视觉质量;Qwen-Image-Edit-2511模型实现了字符错误率的5倍降低。代码和流水线可在此 https URL 获取

Pareto Q-Learning with Reward Machines

用奖励机进行帕累托Q学习

Authors: Arnaud Lequen, Clément Legrand-Lixon, Léo Saulières
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.19134
Pdf link: https://arxiv.org/pdf/2606.19134
Abstract We present Pareto Q-Learning with Reward Machines (PQLRM), a multi-objective reinforcement learning algorithm for tasks whose reward structure is specified by a set of reward machines (RMs). PQLRM combines Pareto Q-Learning (PQL), which maintains sets of vector-valued Q-estimates to approximate the Pareto front, with enhancements from Q-Learning with Reward Machines (QRM), which exploits the factored automaton structure of the reward signal. This yields a multi-policy algorithm that remains sample-efficient under non-Markovian, RM-encoded rewards. Experimental trials show that PQLRM converges faster than a naive PQL baseline applied to the cross-product MDP and can synthesize Pareto-optimal policies that QRM cannot.
中文摘要 我们提出了用奖励机进行帕累托Q学习（PQLRM），这是一种多目标强化学习算法，适用于奖励结构由一组奖励机（RM）指定的任务。PQLRM结合了帕累托Q学习（PQL），后者维护向量值Q估计集合以近似帕累托前沿，以及Q-Learning with Reward Machines（QRM）的改进，后者利用奖励信号的分解自动机结构。这产生了一个多策略算法，在非马尔可夫、均值编码的奖励下仍保持样本效率。实验试验表明，PQLRM的收敛速度快于应用于跨产MDP的朴素PQL基线，并能合成QRM无法做到的帕累托最优策略。

The Reward Was in Your Data All Along: Correcting Flow Matching with Discriminator-Guided RL

奖励一直都在你的数据中：用判别器引导的强化学习纠正流量匹配

Authors: Nicolas Beltran-Velez, Felix Friedrich, Zhang Xiaofeng, Reyhane Askari-Hemmat, Xiaochuang Han, Adriana Romero-Soriano, Michal Drozdzal
Subjects: Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.19162
Pdf link: https://arxiv.org/pdf/2606.19162
Abstract Score- and flow-matching models often rely on preference-based reinforcement learning for two purposes: aligning with subjective preferences and, surprisingly, recovering properties such as visual realism and coherent object structure that matching-based training is intended to learn from the data itself. We argue that this reflects a structural mismatch. Matching losses measure $\ell_2$ regression error on the velocity or score field under training-time marginals, a proxy poorly aligned with the visual and semantic properties that determine sample quality at inference. Given a reward aligned with these properties, RL sidesteps the mismatch by evaluating the model on its own samples and following the reward landscape directly. The challenge is to obtain such a reward without relying on human preferences, which are expensive and conflate data realism with annotator inclinations. We propose Discriminator-Guided RL (DRL). DRL trains a discriminator to separate data from base-model samples in a pretrained representation space and uses its logit as the reward in KL-regularized RL. The pretrained space restricts the discriminator to perceptually meaningful directions, and the logit estimates the log-likelihood ratio between data and model, which is the optimal reward for targeting the data distribution. Across SiT, JiT, REPA, and RAE, DRL reduces guidance-free FID (e.g., $9.38 \to 2.62$ on SiT) and semantic-space FD (e.g., $88.2 \to 19.3$ on DINOv3 for SiT), with consistent gains across all backbones, and improves human-preference rewards without training on them. It also yields a better Pareto frontier between preference reward and image fidelity under subsequent preference-based post-training, increasing alignment while reducing low-level artifacts such as oversaturation and excessive brightness.
中文摘要 评分和流量匹配模型通常依赖基于偏好的强化学习，目的有两个：一是与主观偏好保持一致，二是令人惊讶地恢复视觉真实性和连贯对象结构等属性，而匹配训练旨在从数据中学习这些特性。我们认为这反映了结构性不匹配。匹配损失测量的是训练时间边际下速度场或得分场上的$\ell_2$回归误差，这一代理指标与决定推断样本质量的视觉和语义属性不匹配。在奖励与这些属性相符的情况下，强化学习通过评估模型自身样本并直接跟踪奖励景观，规避了不匹配。挑战在于如何在不依赖人类偏好的情况下获得此类奖励，因为人类偏好成本高昂，且将数据真实性与注释者倾向混淆。我们提出了判别器引导强化学习（DRL）。DRL训练判别器，将数据从预训练的表示空间中的基模型样本中分离出来，并在KL正则化RL中将其logit作为奖励。预训练空间限制判别器只能选择感知有意义的方向，logit则估计数据与模型之间的对数似然比，这是针对数据分布的最佳奖励。在SiT、JiT、REPA和RAE中，DRL降低了无引导的FID（例如SiT的9.38美元至2.62美元）和语义空间FD（例如SiT的DINOv3为88.2美元至19.3美元），在所有骨干中均有稳定的提升，并且无需训练即可改善人类偏好奖励。它还在基于偏好的后期训练中，更好地实现偏好奖励与图像真实度之间的帕累托边界，提高对齐度，同时减少过饱和过亮等低级伪影。

Forecasting what Matters: Decision-Focused RL for Controlled EV Charging with Unknown Departure Times

预测重要因素：决策导向的强化学习，用于受控电动汽车充电，且出发时间未知

Authors: Giuseppe Gabriele, Fabio Pavirani, Seyed Soroush Karimi Madahi, Chris Develder
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.19199
Pdf link: https://arxiv.org/pdf/2606.19199
Abstract The recent growth of EV adoption poses challenges for power systems, including increased peak demand and potential grid instability. Smart control of EV charging -- e.g., based on reinforcement learning (RL) -- can alleviate these issues by learning temporal and contextual patterns from historical data. Yet, in real-world scenarios, key features, such as departure time, often are unavailable. This, in turn, makes it harder for an RL agent to learn and execute an effective charging policy. To mitigate this uncertainty, a trained forecaster can approximate the unknown features from available data. However, since these forecasting models are typically trained for accuracy (rather than their impact on a downstream agent's decision quality), their errors may propagate and hinder the overall performance of a controller that is using the forecasts. To avoid this, we propose a decision-focused RL (DF-RL) framework in which the forecaster is trained end-to-end, i.e., with feedback from the charging policy actions taken by the RL agent. Such joint training of both the forecaster and controller ultimately results in higher-quality actions: our proposed DF-RL method yields superior charging decisions compared to other baselines, achieving up to a 14% improvement in total reward and a 55% reduction of unsupplied energy (i.e., charging that failed to happen because the EV already left), relative to the RL method without departure time forecasting.
中文摘要 电动汽车普及的近期增长给电力系统带来了挑战，包括峰值需求增加和潜在的电网不稳定。基于强化学习（RL）的电动汽车充电智能控制，可以通过从历史数据中学习时间和上下文模式来缓解这些问题。然而，在现实场景中，关键功能如出发时间往往无法使用。这反过来又使强化学习人员更难学习和执行有效的计费策略。为了减轻这种不确定性，受过训练的预报员可以从现有数据中近似未知特征。然而，由于这些预测模型通常以准确性训练（而非对下游代理决策质量的影响），其误差可能会传播并影响使用预测的控制器的整体表现。为避免这种情况，我们提出了一个以决策为导向的RL（DF-RL）框架，在该框架中，预测员从端到端接受训练，即通过强化学习代理采取的计费政策动作反馈。这种预测者和控制员的联合训练最终带来了更高质量的操作：我们提出的DF-RL方法相比其他基线在充电决策上表现优异，总奖励提升了最多14%，并减少了55%的未供能（即因电车已离开而未完成充电），相较于无出发时间预测的强化学习方法。

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

STARE：惊人引导的代币级优势重权重调整以提升政策熵稳定性

Authors: Haipeng Luo, Qingfeng Sun, Songli Wu, Can Xu, Wenfeng Deng, Han Hu, Yansong Tang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.19236
Pdf link: https://arxiv.org/pdf/2606.19236
Abstract Reinforcement Learning with Verifiable Rewards algorithms like GRPO have emerged as the dominant post-training paradigm for complex reasoning in LLMs, yet commonly suffer from policy entropy collapse during training. We conduct a first-order gradient analysis of token-level entropy dynamics under GRPO and identify a token-level credit assignment mismatch: the per-token entropy variation decomposes into the product of the trajectory-level advantage and an entropy sensitivity function over the next-token distribution, yielding an advantage-surprisal four-quadrant structure and a near-criticality property. Motivated by it, we propose STARE (Surprisal-guided Token-level Advantage Reweighting for policy Entropy stability), which identifies entropy-critical token subsets via batch-internal surprisal quantiles, selectively reweights their effective advantages, and incorporates a target-entropy closed-loop gate for stable entropy regulation. Across model scales from 1.5B to 32B and three task families (Short CoT, Long CoT, and Multi-Turn Tool Use), STARE sustains stable RL training over thousands of steps while maintaining policy entropy within the target band. On AIME24 and AIME25, STARE outperforms DAPO and other competitive baselines by 4%-8% in average accuracy, with reflection tokens and response length growing in tandem, indicating sustained exploration-exploitation balance that further unlocks RL training this http URL is available at this https URL.
中文摘要 带有可验证奖励算法的强化学习（如GRPO）已成为大型语言模型复杂推理的主导训练后范式，但训练过程中常常出现策略熵塌缩的问题。我们对 GRPO 下代币级熵动态进行了一阶梯度分析，并识别出代币级的信用分配不匹配：每个代币的熵变异分解为轨迹级优势与下一代币分布熵敏感函数的乘积，产生了优势-惊讶的四象限结构和近临界性质。受此启发，我们提出了STARE（Surprisal引导的策略熵稳定性优势重权重），通过批次内部惊讶分位数识别熵关键的代币子集，选择性地重权其有效优势，并结合目标熵闭环门以实现熵稳定调控。在1.5B到32B的模型尺度和三个任务族（短CoT、长CoT和多轮工具使用）中，STARE在数千步内维持稳定的强化学习训练，同时保持策略熵在目标区间内。在AIME24和AIME25上，STARE的平均准确率比DAPO及其他竞争基线高出4%-8%，反射令牌和响应长度同步增长，显示出持续的探索与利用平衡，进一步解锁了强化学习训练。http URL可在此 https URL 获取。

Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation

重新思考奖励监督：评分标准条件自我提炼

Authors: Siyi Gu, Jialin Chen, Sophia Zhou, Arman Cohan, Rex Ying
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.19327
Pdf link: https://arxiv.org/pdf/2606.19327
Abstract Post-training of reasoning language models is commonly driven by supervised distillation and reinforcement learning with verifiable rewards. Distillation often relies on chain-of-thought annotations that are expensive to obtain and may themselves be noisy, incomplete, or partially incorrect; even when the final solution is correct, an imperfect rationale can interfere with learning. Reinforcement learning with verified rewards, on the other hand, typically compresses evaluative feedback into a scalar signal, obscuring which aspects of a response should be improved. We propose \textbf{Rubric-Conditioned Self-Distillation}, a framework that incorporates rubrics as structured, fine-grained feedback for on-policy self-distillation. Our method conditions the teacher model on criterion-level rubrics and uses it to provide token-level guidance on the student's own sampled trajectories. This design avoids treating a single reference rationale as the sole supervision target. Instead, rubrics specify what a strong response should satisfy, enabling more fine-grained credit assignment over the reasoning process than scalar reward optimization. We instantiate this framework with a two-stage pipeline that first learns to generate task-specific rubrics and then trains a rubric-guided reasoner. We evaluate on a diverse suite of science reasoning benchmarks and results show that rubric-conditioned self-distillation effectively converts rubric-level criteria into token-level guidance over the reasoning process, surpassing GRPO by 1.0 points and OPSD by 0.9 points on average.
中文摘要 推理语言模型的后训练通常由监督蒸馏和强化学习驱动，并有可验证的奖励。蒸馏通常依赖于获取成本高昂且本身可能噪声大、不完整或部分错误的思维链注释;即使最终解答正确，不完美的理由也可能干扰学习。而带有验证奖励的强化学习通常将评价反馈压缩为标量信号，模糊应改进的反应方面。我们提出了\textbf{评分标准条件自蒸馏}，这是一个将评分标准作为结构化、细粒度反馈的框架，用于政策上的自我提炼。我们的方法以标准层面的评分标准为教师模型条件，并用它为学生自身抽样轨迹提供代币级指导。该设计避免将单一参考理由视为唯一的监督目标。相反，评分标准明确了强有力的反应应满足什么，从而使推理过程的信用分配比标量奖励优化更为细致。我们通过一个两阶段流程来实现该框架，先学习生成任务特定的评分标准，然后训练以评分标准为导向的推理者。我们基于多种科学推理基准进行评估，结果显示，评分标准条件自提炼有效将评分标准转换为推理过程的代币级指导，平均比GRPO高出1.0个百分点，OPSD高出0.9个百分点。

Learning User Simulators with Turing Rewards

学习用户模拟器与图灵奖励

Authors: Yingshan Susan Wang, Cedegao E. Zhang, Linlu Qiu, Zexue He, Pengyuan Li, Alex Pentland, Roger P. Levy, Yoon Kim
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.19336
Pdf link: https://arxiv.org/pdf/2606.19336
Abstract Learning to simulate human users in interactive settings could advance the training of agent assistants, evaluation of personalization systems, research in the social sciences, and more. Existing approaches generally do so by training a large language model (LLM) to match a single ground truth response, either by maximizing the log probability or by using a similarity reward. We instead propose {Turing-RL}: a Turing-Test-based reinforcement learning approach for training user simulator models. {Turing-RL} uses a discriminative Turing reward with an LLM judge to score how indistinguishable a generated response is from the real user's given the user's history, and the user simulator LLM learns to produce responses indistinguishable from what the user could have said with such rewards. Across two different domains--conversational chat and Reddit forum discussion--we find that {Turing-RL} consistently outperforms baseline methods on both LLM and human evaluation metrics. Our study suggests that optimizing for indistinguishability, rather than response matching, is effective for learning user simulators.
中文摘要 学习在交互环境中模拟人类用户，有望推动代理助理培训、个性化系统评估、社会科学研究等方面的进步。现有方法通常通过训练大型语言模型（LLM）匹配单一真实响应来实现，方法包括最大化日志概率或使用相似性奖励。我们提出{Turing-RL}：一种基于图灵测试的强化学习方法，用于训练用户模拟模型。{Turing-RL} 使用带有 LLM 评判的判别性图灵奖励，根据用户的历史，评估生成的响应与真实用户的无异性，用户模拟器 LLM 则学会生成与用户在此类奖励下可能说出的无异性反应。在两个不同领域——对话聊天和Reddit论坛讨论——我们发现{Turing-RL}在LLM和人类评估指标上始终优于基线方法。我们的研究表明，优化不可区分性而非响应匹配，对学习用户模拟器更为有效。

Native Active Perception as Reasoning for Omni-Modal Understanding

作为全模态理解的天赋主动知觉

Authors: Zhenghao Xing, Ruiyang Xu, Yuxuan Wang, Jinzheng He, Ziyang Ma, Qize Yang, Yunfei Chu, Jin Xu, Junyang Lin, Chi-Wing Fu, Pheng-Ann Heng
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Sound (cs.SD)
Arxiv link: https://arxiv.org/abs/2606.19341
Pdf link: https://arxiv.org/pdf/2606.19341
Abstract Passive models for long video understanding typically rely on a "watch-it-all" paradigm, processing frames uniformly regardless of query difficulty, causing computational cost to grow with video duration. Although interactive frameworks have emerged, they often rely on global pre-scanning, and their context cost still scales with video length. We propose OmniAgent, the first native omni-modal agent that formulates video understanding as a POMDP-based iterative Observation-Thought-Action cycle. OmniAgent executes on-demand actions to selectively distill audio-visual cues into a persistent textual memory, effectively decoupling reasoning complexity from raw video duration. To operationalize this, we introduce (1) Agentic Supervised Fine-Tuning to bootstrap native active perception via best-of-N trajectory synthesis with dual-stage quality control, and (2) Agentic Reinforcement Learning with TAURA (Turn-aware Adaptive Uncertainty Rescaled Advantage), which leverages turn-level entropy to steer credit assignment toward pivotal discovery turns. Crucially, OmniAgent exhibits positive test-time scaling, where performance improves as the number of reasoning turns increases, validating the efficacy of active perception. Empirical results across ten benchmarks (e.g., VideoMME, LVBench) demonstrate that OmniAgent achieves state-of-the-art performance among open-source models. Notably, on LVBench, our 7B agent outperforms the 10$\times$ larger Qwen2.5-VL-72B (50.5% vs. 47.3%).
中文摘要 用于长视频理解的被动模型通常依赖“全部观看”范式，无论查询难易度如何均统一处理帧，导致计算成本随视频时长增加而增加。尽管互动框架已经出现，但它们通常依赖全局预扫描，且其上下文成本仍会随着视频长度的增加而变化。我们提出了OmniAgent，这是首个将视频理解构建为基于POMDP的迭代观察-思维-行动循环的原生全模态代理。OmniAgent 执行按需动作，选择性地将视听线索提炼到持久文本内存中，有效地将推理复杂性与原始视频时长解耦。为实现这一目标，我们引入了（1）能动监督微调，通过N条路径最佳综合和双阶段质量控制，引导原生主动感知;（2）采用TAURA（回合感知适应性不确定性重定优势）的能动强化学习，利用回合级熵引导学分分配朝向关键发现转折。关键是，OmniAgent表现出正向的测试时间尺度，即随着推理回合数的增加，性能提升，验证了主动感知的有效性。十个基准测试（如VideoMME、LVBench）的实证结果表明，OmniAgent在开源模型中达到了最先进的性能。值得注意的是，在LVBench上，我们的7B代理表现优于价格高出10美元/倍数美元的Qwen2.5-VL-72B（50.5%对47.3%）。

Keyword: diffusion policy

There is no result