Arxiv Papers of Today

生成时间: 2026-06-08 20:33:07 (UTC+8); Arxiv 发布时间: 2026-06-08 20:00 EDT (2026-06-09 08:00 UTC+8)

今天共有 28 篇相关文章

Keyword: reinforcement learning

MacArena: Benchmarking Computer Use Agents on an Online macOS Environment

MacArena：在线macOS环境中计算机使用代理的基准测试

Authors: Victor Muryn, Maksym Shamrai, Sofiia Mazepa, Yehor Khodysko
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2606.06560
Pdf link: https://arxiv.org/pdf/2606.06560
Abstract Computer-use agents (CUAs) operate graphical user interfaces (GUIs) through vision and control primitives, and their capabilities have advanced rapidly, driven in part by standardized online evaluation benchmarks such as OSWorld, which serve both as evaluation tools and as training environments for reinforcement learning. However, macOS remains underserved in this landscape: the only existing benchmark, macOSWorld, covers a narrow slice of first-party applications with simpler tasks, and runs on x86 virtual machines incompatible with Apple Silicon. We introduce MacArena, a benchmark of 421 manually verified tasks spanning 50 applications that combines a curated port of OSWorld tasks, content sourced from macOSWorld, and 49 new macOS-native tasks, all running on Apple's native Virtualization framework on Apple Silicon. We argue that macOS presents distinct GUI challenges beyond what Linux-based benchmarks capture, and our evaluation supports this claim: strong model performance on existing benchmarks can reflect familiarity with task distributions rather than genuine cross-platform GUI competence. Notably, model rankings invert between ported and macOS-native tasks, with a leading model trailing by over 26% on the MacArena subset, suggesting that macOS poses a genuinely harder environment for current GUI agents.
中文摘要 计算机使用代理（CUA）通过视觉和控制原语操作图形用户界面（GUI），其能力迅速提升，部分得益于标准化的在线评估基准测试，如OSWorld，既作为评估工具，也作为强化学习的训练环境。然而，macOS在这一领域仍然服务不足：唯一现有的基准测试macOSWorld覆盖了一小部分第一方应用，任务更简单，且运行在与苹果硅片不兼容的x86虚拟机上。我们介绍MacArena，这是一个包含421个手动验证任务的基准测试，涵盖50个应用程序，结合了精心移植的OSWorld任务、来自macOSWorld的内容以及49个新的macOS原生任务，全部运行在苹果原生的虚拟化框架上，基于Apple Silicon。我们认为macOS在图形界面上存在超出Linux基准测试的独特挑战，我们的评估也支持这一观点：现有基准测试上的强模型性能可能反映的是对任务发行版的熟悉度，而非真正的跨平台GUI能力。值得注意的是，模型排名在移植任务和macOS原生任务之间存在反差，领先模型在MacArena子集落后超过26%，表明macOS对当前GUI代理来说确实是一个更为艰难的环境。

Improving Cross-Lingual Factual Recall via Consistency-Driven Reinforcement Learning

通过一致性驱动强化学习提升跨语言事实回忆

Authors: Jonathan von Rad, Louis Arts, George Burgess, Eleftheria Kolokytha, Harry O'Donnell, Ektor Oikonomidis Doumpas, Eduardo Sanchez, Yao Lu, Pontus Stenetorp
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.06586
Pdf link: https://arxiv.org/pdf/2606.06586
Abstract Large language models (LLMs) trained predominantly on English data encode substantial world knowledge, yet often fail to express it reliably in other languages, a phenomenon known as cross-lingual factual inconsistency. To study and address this, we introduce PolyFact, a large-scale parallel multilingual factual QA dataset containing 100K Wikidata-grounded facts across 12 typologically diverse languages. Using PolyFact, we compare light continual pretraining (CPT), supervised fine-tuning (SFT), and reinforcement learning via Group Relative Policy Optimization (GRPO) for improving cross-lingual factual recall in Qwen-2.5-7B and OLMo-2-1124-7B. We find that GRPO consistently outperforms SFT, improving both cross-lingual consistency and generalization to unseen languages, while CPT on parallel data yields limited additional gains. Mechanistic analyses further show that GRPO reorganizes multilingual routing by reducing language specialization in MLP layers and attention heads, thereby promoting more shared cross-lingual representations. We release our code, models, and dataset.
中文摘要 主要基于英语数据训练的大型语言模型（LLM）编码了大量世界知识，但往往无法可靠地用其他语言表达这些知识，这种现象被称为跨语言事实不一致。为研究和解决这个问题，我们引入了PolyFact，这是一个大规模的并行多语言事实质询数据集，包含10万个基于维基数据的事实，涵盖12种类型多样的语言。利用PolyFact，我们比较了轻度持续预训练（CPT）、监督微调（SFT）以及通过群体相对策略优化（GRPO）进行强化学习，以提升Qwen-2.5-7B和OLMo-2-1124-7B的跨语言事实回忆能力。我们发现GRPO始终优于SFT，提高了跨语言的一致性和对未见语言的泛化，而CPT在并行数据上的额外收益有限。机制分析进一步表明，GRPO通过减少MLP层和注意力头的语言专化，重新组织了多语言路由，从而促进了更多共享的跨语言表示。我们发布了代码、模型和数据集。

Uncertainty-Aware LLM-Guided Policy Shaping for Sparse-Reward Reinforcement Learning

不确定性感知的LLM引导政策塑造，用于稀疏奖励强化学习

Authors: Ujjwal Bhatta, Utsabi Dangol, Sumaly Bajracharya, Rodrigue Rizk, KC Santosh
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.06673
Pdf link: https://arxiv.org/pdf/2606.06673
Abstract Sparse rewards and heterogeneous task sequences remain persistent challenges in Reinforcement Learning (RL), often resulting in slow convergence, weak generalization, and inefficient exploration. We propose Uncertainty-Aware LLM-Guided Policy Shaping (ULPS), a novel framework that integrates a calibrated Large Language Model (LLM) into the RL training loop to provide structured, uncertainty-modulated behavioral guidance. ULPS employs an A-based oracle to synthesize optimal symbolic trajectories, which are used to fine-tune a BERT-based language model. During training, this model supplies action suggestions whose influence is conditioned on epistemic uncertainty estimated via Monte Carlo (MC) dropout. An entropy-based blending mechanism adaptively balances LLM guidance and the learned policy (via Proximal Policy Optimization, PPO), allowing the agent to prioritize reliable priors while preserving adaptability. We evaluate ULPS on the MiniGridUnlockPickup benchmark and observe consistent improvements in success rate, reward efficiency, and sample complexity over unguided, uncalibrated, and standard RL baselines. ULPS achieves more than 9% improvement in execution accuracy after fine-tuning, requires fewer environment interactions, and yields higher reward AUC. Our results demonstrate that integrating symbolic A trajectories, pretrained language priors, and uncertainty-aware control offers a principled and effective approach to multi-task reinforcement learning in sparse-reward domains, with potential extensibility to partially observable and multi-agent settings.
中文摘要 奖励稀疏和任务序列异构仍然是强化学习（RL）中持续存在的挑战，常导致收敛缓慢、泛化薄弱和探索效率低下。我们提出了不确定性感知LLM引导策略塑造（ULPS），这是一个新颖框架，将校准过的大语言模型（LLM）集成到强化学习的训练循环中，提供结构化、不确定性调制的行为指导。ULPS采用基于A的oracle来综合最优符号轨迹，用于微调基于BERT的语言模型。在训练过程中，该模型提供了行动建议，其影响受蒙特卡洛（MC）退出估计的认识不确定性影响。基于熵的混合机制通过近端策略优化（PPO）自适应地平衡LLM指导与学习策略，使智能体能够优先排序可靠的先验，同时保持适应性。我们在MiniGridUnlockPickup基准测试上评估ULPS，观察到在成功率、奖励效率和样本复杂度方面相较于无指导、未校准和标准强化学习基线持续提升。ULPS在微调后执行精度提升超过9%，需要的环境互动更少，且获得更高的奖励AUC。我们的结果表明，整合符号A轨迹、预训练语言先验和不确定性感知控制，为稀疏奖励领域中的多任务强化学习提供了一种有原则且有效的方法，并有望扩展到部分可观察和多智能体的环境。

What Do People Actually Want From AI? Mapping Preference Plurality

人们到底想从人工智能那里得到什么？映射偏好多数

Authors: Julia Sepúlveda Coelho, Scott A. Hale
Subjects: Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2606.06674
Pdf link: https://arxiv.org/pdf/2606.06674
Abstract Large Language Models (LLMs) are often fine-tuned through Reinforcement Learning from Human Feedback (RLHF) to align with people's preferences and values. However, this method has known limitations: it aggregates conflicting preferences, often relies on unrepresentative samples, and uses only binary comparisons. Analysing 1,500 open-ended responses from the PRISM dataset across 75 countries, we examine what people actually want from AI systems and reveal concrete failures of current methods. We find that different people want different things: most values are requested by fewer than a quarter of respondents, with truthfulness the sole exception at 49%. Furthermore, the same words hide divergent meanings: when people describe what they mean by "truthfulness", they reveal distinct, potentially incompatible, epistemological bases, as some ask for sourced claims, some for expert opinions, and some even ask for unpopular views. Certain capabilities, namely how human-like a model behaves, and some features, like AI guardrails, are outright controversial, with some desiring them and others rejecting them. We additionally find that people often use contextual distinctions (what AI should do "by default" versus "if requested") that binary comparisons cannot capture. These findings expose fundamental problems in current alignment practices. When 49% request truthfulness but define it differently, this is unlikely to be captured by a single reward model. The persistence of high hallucination rates in well-funded models, despite users' clear demands for accuracy, suggests that current methods fail to identify actual preferences. This paper sheds light on the situated, contested, imperfect signals that are currently being flattened into universal preference models, a practice others have characterised as epistemic violence.
中文摘要 大型语言模型（LLMs）通常通过人类反馈强化学习（RLHF）进行微调，以符合人们的偏好和价值观。然而，该方法存在已知局限性：它汇总了相互冲突的偏好，常依赖不具代表性的样本，且仅使用二元比较。我们分析了来自75个国家PRISM数据集的1500条开放式回答，探讨人们对AI系统的实际期望，并揭示了当前方法的具体失败。我们发现不同人想要不同的东西：大多数数值被不到四分之一的受访者要求，只有49%的例外是真实性。此外，这些词汇隐藏着不同的含义：当人们描述“真实性”时，会揭示出不同且可能不兼容的认识论基础，有些人要求有来源的主张，有些人要求专家意见，甚至有人要求不受欢迎的观点。某些功能，尤其是模型的人类行为，以及一些功能，比如人工智能护栏，都存在争议，有人希望拥有，有人拒绝。我们还发现，人们经常使用上下文区分（比如AI“默认”应做什么“与”如果被要求时“做什么”），而二元比较无法捕捉这些差异。这些发现揭示了当前比对实践中的根本性问题。当49%要求诚实但定义不同时，单一奖励模型难以涵盖。尽管用户明确要求准确性，资金充足的模型中高幻觉率依然存在，表明现有方法未能识别实际偏好。本文揭示了目前被扁平化为普遍偏好模型的情境性、争议性且不完美的信号，这种做法被他人称为认识暴力。

Performance Variation in Deep Reinforcement Learning

深度强化学习中的性能差异

Authors: Haruto Tanaka, A. Rupam Mahmood
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.06746
Pdf link: https://arxiv.org/pdf/2606.06746
Abstract Deep reinforcement learning (RL) algorithms often suffer from low run-to-run robustness, manifesting as significant performance variation across independent runs of identically configured agents. Although this issue poses a spectrum of challenges across research and practice, relatively few studies develop methods to evaluate it; RL research instead often reports uncertainty in the estimated mean performance. In this paper, we outline the limitations of conventional uncertainty and variation estimates, particularly their misalignment with purpose and the risk of underreporting. We then propose an alternative percentile-based statistic and visualization method, min-max IPR and run-wise percentile highlighting, respectively. These percentile-based tools are easy to interpret and rely on standard properties of sample percentiles, providing rich information about run-to-run performance variation. We demonstrate this through three case studies. First, we show that LayerNorm and penultimate-layer normalizations narrow performance variation in PPO, whereas the variation is mostly unchanged in SAC. Second, we compare PPO, SAC, TD-MPC, and TD-MPC2, and show TD-MPC exhibits the least variation while being the most data efficient among the four. Finally, in a comparison of DQN and Rainbow on five Atari environments, we show that both algorithms exhibit similar levels of performance variation.
中文摘要 深度强化学习（RL）算法通常存在运行间鲁棒性较低的问题，表现为同一配置代理的独立运行间性能差异显著。尽管这一问题在研究和实践中带来了一系列挑战，但研究开发出评估方法的相对较少;强化学习研究通常报告估计的平均表现存在不确定性。本文概述了传统不确定性和变异估计的局限性，特别是其与目的的不匹配以及漏报风险。随后，我们提出了一种基于百分位的统计和可视化方法，分别是最小最大IPR和按跑百分比高亮。这些基于百分位的工具易于解读，并依赖样本百分位的标准属性，提供了关于每次跑次表现差异的丰富信息。我们通过三个案例研究来证明这一点。首先，我们证明了LayerNorm和倒数第二层归一化在PPO中表现变异较窄，而SAC中变异基本保持不变。其次，我们比较了PPO、SAC、TD-MPC和TD-MPC2，结果显示TD-MPC在四者中变化最小，数据效率最高。最后，在五个Atari环境中对DQN和Rainbow的比较中，我们发现两者在性能差异上表现出相似水平。

Learning All-Terrain Locomotion for a Planetary Rover with Actively Articulated Suspension

学习带主动铰接悬挂的行星探测车全地形移动

Authors: Arthur Bouton, Tristan D. Hasseler, Michael Paton, Travis Brown, Jacob Levy, William Reid, Joshua Martin, Hari Nayar
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2606.06790
Pdf link: https://arxiv.org/pdf/2606.06790
Abstract This paper presents ERNEST, a four-wheeled planetary rover concept equipped with a two-degree-of-freedom Active Gimbal Suspension that combines yaw and roll actuation to enable wheel reconfiguration, steering, and active load redistribution. A single neural network controller, trained to track a desired path across challenging terrain, fully unlocks the capabilities of this actuated suspension system for autonomous obstacle negotiation. A reinforcement learning framework is developed using the high-fidelity DARTS simulation engine, which combines rigid-contact dynamics and Bekker-Wong terramechanics, enabling the emergence of locomotion strategies adapted to loose-soil conditions. To obtain a single unified controller across heterogeneous terrains, a policy consolidation strategy merges the experience of terrain-specialized agents into one neural network, eliminating the need for explicit terrain classification and controller switching. The resulting controller operates on a combination of proprioceptive and exteroceptive feedback, including sparse stereo-derived terrain elevation, chassis attitude, joint states, and force-torque measurements. Zero-shot transfer to the physical rover is achieved through domain randomization, sensor noise injection, and model-to-real system identification. Experimental results demonstrate autonomous traversal of rock fields, a bump trap, a wheel-high step, sand ripples, and sandy slopes. On a 20° sandy slope, the learned controller reduces the cost of transport by 37% on dry sand despite the additional actuation, and achieves superior performance on wet sand where the passive suspension becomes completely immobilized.
中文摘要 本文介绍了ERNEST，一种四轮行星探测车概念，配备两自由度主动万向节悬挂，结合偏航和滚转驱动，实现轮子重新配置、转向和主动负载重新分配。一个经过训练、能够追踪目标路径穿越复杂地形的神经网络控制器，完全解锁了该感应悬挂系统的自主障碍通过能力。利用高精度DARTS仿真引擎开发了一个强化学习框架，该引擎结合了刚性接触动力学和Bekker-Wong地形力学，使得适应松散土壤条件的运动策略得以诞生。为了在异构地形中获得统一控制器，策略整合策略将地形专用代理的经验合并为一个神经网络，消除了显式地形分类和控制器切换的需求。最终的控制器基于本体感觉和外感受反馈的结合操作，包括稀疏的立体地形高程、底盘姿态、关节状态以及力-扭矩测量。零射程传输到物理漫游车通过域随机化、传感器噪声注入和模型到实系统识别实现。实验结果显示，能够自主穿越岩石场、凸起陷阱、轮高阶梯、沙质波纹和沙坡。在20°沙坡上，学习型控制器在干沙上可降低37%的运输成本，尽管有额外的驱动装置，但在湿沙上被动悬挂完全无法移动时表现更优。

Exploring Reinforcement Learning for Fluid Transitions Between Clinical Mental Healthcare and Everyday Wellness Support

探索强化学习，帮助临床心理健康护理与日常健康支持之间实现流动过渡

Authors: Tony Wang, Qian Yang
Subjects: Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.06800
Pdf link: https://arxiv.org/pdf/2606.06800
Abstract Mental health struggles wax and wane, yet clinical and wellness interventions typically operate separately, causing frequent breakdowns at care transitions. We explore reinforcement learning (RL) as a means to build digital health systems that deliver clinical and wellness interventions proactively, as part of a coherent care journey. We ask: what complexities does designing such a system involve? We built a contextual bandit that dynamically selects journaling prompts from clinical and wellness repertoires to optimize for an overarching health goal (sustained journaling) and deployed it in a four-week exploratory study (N=38). We found that, first, many benefits of RL-optimized intervention sequences appeared only after interventions ended, raising the question: Should systems that offer coherent clinical-wellness care journeys include stepping-back periods? If so, when and how? Second, participants most engaged with RL-generated interventions deepened their engagement over time, while those most engaged with a constant intervention tended to burn out and drop out later. It raises the question: When should a system blending clinical and wellness interventions reduce intensity to prevent burnout in versus sustain it to maximize treatment gains?
中文摘要 心理健康问题时好时坏，但临床干预和健康干预通常是分开运作的，导致护理过渡时频繁出现故障。我们探索强化学习（RL）作为构建数字健康系统、主动提供临床和健康干预的手段，作为连贯护理旅程的一部分。我们会问：设计这样一个系统涉及哪些复杂性？我们构建了一个情境强盗，动态从临床和健康技能库中选择写日记提示，以优化一个总体健康目标（持续写日记），并将其应用于为期四周的探索性研究（N=38）。我们发现，首先，许多强化学习优化干预序列的益处是在干预结束后才显现，这就引发了一个问题：提供连贯临床-健康护理旅程的系统是否应该包含倒退期？如果有，什么时候、怎么做？其次，最积极参与强化学习干预的参与者随着时间推移加深了参与度，而那些长期参与的参与者则倾向于倦怠并最终退出。这引发了一个问题：一个融合临床和健康干预的系统，何时应该降低强度以防止倦怠，还是维持强度以最大化治疗效果？

VideoSEG-O3: A Multi-turn Reinforcement Learning Framework for Reasoning Video Object Segmentation

VideoSEG-O3：用于推理视频对象分割的多回合强化学习框架

Authors: Ming Dai, Sen Yang, Boqiang Duan, Boyuan Tong, Jiedong Zhuang, Wankou Yang, Jingdong Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.06819
Pdf link: https://arxiv.org/pdf/2606.06819
Abstract Reasoning Video Object Segmentation (RVOS) demands a sophisticated integration of temporal dynamics, spatial details, and linguistic reasoning to achieve precise pixel-level localization. Existing methods are limited to reasoning over fixed initial inputs and lack the capacity to actively acquire further visual evidence, which is often essential for resolving complex references in long or intricate videos. To address this, we propose \textbf{VideoSEG-O3}, the first multi-turn reinforcement learning framework for RVOS that emulates the human \textit{``coarse-to-fine''} cognitive process. It employs a \textit{multi-turn temporal-spatial chain-of-thought} to capture fine-grained details by iteratively pinpointing critical intervals and keyframes. Additionally, to enable the policy to perceive segmentation quality beyond mere text probability of \texttt{[SEG]} during the RL stage, we introduce \textit{SEG-aware logit calibration}, which integrates pixel-wise segmentation feedback directly into the token-level logits. Furthermore, we design a \textit{decoupled thinking trace} to hierarchically decompose the reasoning process into temporal, spatial, and linguistic dimensions, and construct \textbf{VTS-CoT}, a specialized cold-start dataset featuring comprehensive reasoning trajectories. The code and models will be released at this https URL.
中文摘要 推理视频对象分割（RVOS）需要时间动力学、空间细节和语言推理的复杂整合，以实现精确的像素级定位。现有方法仅限于对固定初始输入进行推理，缺乏主动获取更多视觉证据的能力，而这对于解析冗长或复杂视频中的复杂引用往往至关重要。为此，我们提出了 \textbf{VideoSEG-O3}，这是首个模拟人类 \textit{'粗-细'}认知过程的 RVOS 多回合强化学习框架。它采用了\textit（多回合的时间空间思维链），通过迭代定位关键区间和关键帧，捕捉细致细节。此外，为了使策略能够在强化学习阶段感知 \texttt{[SEG]} 的文本概率之外的分割质量，我们引入了 \textit{SEG 感知的 logit 校准}，它将像素分段反馈直接集成到令牌级 logit 中。此外，我们设计了 \textit{解耦思维痕迹}，将推理过程层级分解为时间、空间和语言维度，并构建了 \textbf{VTS-CoT}，这是一个具有全面推理轨迹的专门冷启动数据集。代码和模型将在此 https URL 发布。

SCALE: Scalable Cross-Attention Learning with Extrapolation for Agentic Workflow Scheduling

SCALE：可扩展的跨注意力学习与外推以实现代理式工作流程调度

Authors: Zhifei Xu, Jierui Lan, Zixuan Liang, Aiji Liang, Jinxi He
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.06820
Pdf link: https://arxiv.org/pdf/2606.06820
Abstract Agentic Large Language Model (LLM) systems decompose complex tasks into workflow Directed Acyclic Graphs (DAGs) whose primitives must be scheduled on heterogeneous clusters. Existing deep reinforcement learning (DRL) schedulers are tied to a fixed cluster size and require retraining whenever the number of servers changes. We propose SCALE (Scalable Cross-Attention Learning with Extrapolation), a DRL scheduler that generalizes to unseen cluster scales without fine-tuning. SCALE employs a cross-attention pointer network where task features query against server features, so the architecture accepts any number of servers by construction. We observe, however, that permutation-invariant architecture alone does not guarantee good performance at new scales - the attention feature undergoes distribution shift as the server count grows. To counter this, we introduce Structured Representation Regularization (SRR): a decorrelation loss combined with a KL penalty toward the standard normal, which keeps feature statistics stable regardless of input size. Trained on 16 nodes and tested directly on 32 and 48 nodes, SCALE reduces average response time by 8.9% at N=48 relative to the same architecture without SRR, confirming that explicit regularization is necessary to close the scale-generalization gap.
中文摘要 代理大型语言模型（LLM）系统将复杂任务分解为工作流的有向无环图（DAG），其原语必须在异构簇上调度。现有的深度强化学习（DRL）调度器与固定集群规模绑定，服务器数量变化时需要重新训练。我们提出了SCALE（带外推的可扩展跨注意力学习），这是一种DRL调度器，可以推广到看不见的簇级尺度，无需微调。SCALE 采用交叉注意力指针网络，任务特征会与服务器特征进行查询，因此该架构结构上可接受任意数量的服务器。然而，我们观察到，仅靠置换不变架构并不能保证在新规模上的性能良好——注意力特性会随着服务器数量的增加而发生分布转移。为应对这一问题，我们引入了结构化表示正则化（SRR）：一种去相关损失与对标准正规值的KL惩罚相结合，使特征统计量无论输入大小都能保持稳定。SCALE在16个节点上训练，并在32个和48个节点上直接测试，相较于无SRR的同一架构，在N=48时平均响应时间减少了8.9%，证实了显式正则化对于缩小规模与泛化差距是必要的。

Progress-SQL: Improving Reinforcement Learning for Text-to-SQL via Progressive Rewards

Progress-SQL：通过渐进奖励提升文本转SQL的强化学习

Authors: Shihao Zhang, Xiaoman Wang, Yuan Liu, Yunshi Lan, Weining Qian
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.06825
Pdf link: https://arxiv.org/pdf/2606.06825
Abstract Reinforcement learning has recently shown promise in improving large language models for Text-to-SQL generation, yet existing methods typically optimize one-shot rewards defined over a single SQL state. Such rewards provide limited guidance for iterative SQL correction and are insufficient to capture the improvement of multi-turn SQL refinement. In this paper, we propose Progress-SQL, a multi-turn reinforcement learning framework with progressive rewards for Text-to-SQL. Our approach introduces an Oracle-guided Diagnostic Tree (ODT), which abstracts SQL queries into clause-level structural profiles and produces diagnostic feedback for next-turn refinement. To provide dense and robust reward signals, we combine ODT-based structural alignment with lexical alignment and define a progressive reward that measures the improvement from the initial SQL to the final SQL. We further incorporate a progression latency reward that favors earlier correctness and an execution status reward that encourages recovery from the invalid SQL. Experiments on BIRD, Spider, and Spider robustness variants demonstrate that our method consistently improves Text-to-SQL performance across both primary and robustness evaluations.
中文摘要 强化学习最近在改进文本转SQL生成的大型语言模型方面展现出潜力，但现有方法通常优化定义在单一SQL状态上的一次性奖励。此类奖励对迭代SQL修正提供了有限的指导，不足以反映多回合SQL优化的改进。本文提出了Progress-SQL，一个多回合强化学习框架，文本转SQL提供渐进奖励。我们的方法引入了 Oracle 引导的诊断树（ODT），将 SQL 查询抽象为子句级结构配置文件，并生成诊断反馈以便下一回合细化。为了提供密集且稳健的奖励信号，我们将基于ODT的结构性对齐与词汇对齐结合，并定义了一个渐进奖励，衡量从初始SQL到最终SQL的改进。我们还加入了有利于早期正确性的进度延迟奖励和鼓励从无效SQL中恢复的执行状态奖励。对BIRD、Spider和Spider鲁棒性变体的实验表明，我们的方法在主要和鲁棒性评估中都能持续提升文本转SQL的性能。

AdaGRPO: A Capability-Aware Adaptive Enhancement for Flow-based GRPO

AdaGRPO：基于流量的能力感知自适应增强

Authors: Jiazi Bu, Pengyang Ling, Yujie Zhou, Yibin Wang, Yuhang Zang, Tianyi Wei, Xiaohang Zhan, Jiaqi Wang, Tong Wu, Xingang Pan, Dahua Lin
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.06828
Pdf link: https://arxiv.org/pdf/2606.06828
Abstract Group Relative Policy Optimization (GRPO) has demonstrated remarkable success in aligning text-to-image (T2I) flow models with human preferences. However, we have identified that the learning loop of current flow-based GRPO is fundamentally decoupled from the learner's current capability, suffering from critical blind spots at both prompt selection and advantage estimation: (i) Existing methods sample prompts randomly, overlooking the substantial impact of data selection on reinforcement learning (RL) efficacy--a factor proven crucial in GRPO for large language models; (ii) They evaluate sample quality solely relying on intra-group statistics, lacking a global perspective to accurately measure true policy improvement. To address these issues, we propose Adaptive GRPO (AdaGRPO), a novel capability-aware RL algorithm tailored for flow models. Specifically, AdaGRPO consists of two principal components: (i) Online Curriculum Filtering Strategy: Dynamically tracks the model's proficiency and adaptively selects prompts that best match its current learning boundary; (ii) Cross-Level Advantage Fusion: Synergistically integrates fine-grained intra-group advantages with macro-level global advantages, providing a comprehensive and unbiased policy evaluation. As a lightweight, plug-and-play module, AdaGRPO can be seamlessly integrated with existing frameworks such as Flow-GRPO, DanceGRPO, and Flow-CPS. Extensive experiments demonstrate that AdaGRPO consistently drives performance gains while significantly stabilizes GRPO training for flow models.
中文摘要 Group Relative Policy Optimization（GRPO）在将文本到图像（T2I）流程模型与人类偏好对齐方面取得了显著成功。然而，我们发现基于当前流的GRPO学习循环根本上与学习者当前能力脱钩，在提示选择和优势估计上存在关键盲点：（i）现有方法随机抽样提示，忽视了数据选择对强化学习（RL）效能的重大影响——这是大型语言模型GRPO中被证明至关重要的因素;（ii）他们仅依赖群体内统计来评估样本质量，缺乏全球视角以准确衡量真正的政策改进。为解决这些问题，我们提出了自适应GRPO（AdaGRPO），一种针对流模型量身定制的新型能力感知强化学习算法。具体来说，AdaGRPO由两个主要组成部分：（i）在线课程过滤策略：动态跟踪模型的熟练度，并自适应地选择最适合当前学习边界的提示;（ii）跨层次优势融合：协同整合细致的集团内部优势与宏观层面的全球优势，提供全面且公正的政策评估。作为一个轻量级的即插即用模块，AdaGRPO 可以无缝集成现有框架，如 Flow-GRPO、DanceGRPO 和 Flow-CPS。大量实验表明，AdaGRPO持续推动性能提升，同时显著稳定了流模型的GRPO训练。

T-GMP: Terrain-conditioned Generative Motion Priors for Versatile and Natural Humanoid Locomotion

T-GMP：地形条件生成运动先验，实现多功能自然类人运动

Authors: Junhong Guo, Hao Hu, Chen Chen, Haoxuan Han, Linao Gong, Xin Yang, Zhicheng He, Yao Su, Fenghua He
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.06944
Pdf link: https://arxiv.org/pdf/2606.06944
Abstract Achieving both anthropomorphic naturalness and robust terrain traversal remains a fundamental challenge in humanoid locomotion. Existing Reinforcement Learning (RL) approaches typically rely on fixed motion priors, limiting their adaptability to varying environments. We propose Terrain-conditioned Generative Motion Priors (T-GMP), a module that captures a terrain-conditioned latent motion manifold from a few expert state-terrain demonstrations using a Conditional Variational Autoencoder (CVAE). The learned priors enable smooth style transitions, facilitating a unified policy that adapts to terrain variations. We integrate T-GMP into an adversarial learning pipeline with our proposed Foothold Penalty, where a discriminator dynamically modulates naturalness constraints conditioned on local terrain features, guiding the generation of versatile and human-like motions. Experimental results demonstrate that our method outperforms existing baselines in traversal success rate and motion smoothness, while preserving biomimetically natural and physically coordinated motions.
中文摘要 实现拟人化自然感和强健地形穿越仍是类人机动的根本挑战。现有的强化学习（RL）方法通常依赖固定的运动先验，限制了其对不同环境的适应性。我们提出了地形条件生成运动先验（T-GMP），该模块利用条件变分自编码器（CVAE）从少数专家的状态-地形演示中捕捉地形条件潜动流形。所学先验有助于风格的平滑过渡，促进统一政策，适应地形变化。我们将T-GMP整合进对抗学习流水线，提出的“脚步惩罚”，判别器动态调节基于局部地形特征的自然约束，引导生成多功能且类人化的运动。实验结果表明，我们的方法在横移成功率和运动平滑度方面优于现有基线，同时保持了仿生自然和物理协调的运动。

GenPO++: Generative Policy Optimization with Jacobian-free Likelihood Ratios

GenPO++：带有无雅可比似然比的生成策略优化

Authors: Ke Hu, Shutong Ding, Panxin Tao, Jingya Wang, Ye Shi
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.06967
Pdf link: https://arxiv.org/pdf/2606.06967
Abstract Generative policies provide expressive and multimodal action distributions, making them attractive for reinforcement learning (RL) in complex continuous-control tasks. Among them, flow-based policies are especially appealing because they generate actions through deterministic transport maps. However, applying such generative policies to likelihood-based on-policy learning remains limited by the difficulty of evaluating the probability of executed actions. Existing flow RL methods either replace the true action-density ratio with approximate surrogates, which can introduce biased updates, or recover exact likelihoods through dummy-action augmentation, which enlarges the policy space and increases computation. In this work, we propose GenPO++, a reversible generative policy optimization framework that uses history states as auxiliary memory in a high-order reversible ODE solver, yielding exact inversion without changing the original action dimension. The resulting generative policy map has a log-determinant determined only by fixed solver coefficients, enabling exact and Jacobian-free likelihood-ratio computation. This design preserves the expressiveness of generative flow policies while avoiding both action ratio bias and dummy-action overhead. We evaluate GenPO++ on large-scale simulated control, fine-tuning, and real-world robotic manipulation tasks, where it achieves competitive or superior performance over state-of-the-art on-policy RL methods, while improving training stability and computational efficiency.
中文摘要 生成策略提供表达式和多模态的动作分布，使其在复杂的连续控制任务中具有强化学习（RL）的吸引力。其中，基于流量的策略尤其吸引人，因为它们通过确定性传输地图生成动作。然而，将此类生成策略应用于基于似然的基于策略的学习仍受限于评估执行动作概率的困难。现有的流式强化学习方法要么用近似替代真实的动作-密度比，这可能会引入有偏的更新，要么通过虚拟动作增强恢复精确似然，扩大策略空间并增加计算量。在本研究中，我们提出了GenPO++，这是一种可逆生成策略优化框架，利用历史状态作为高阶可逆常微分方程求解器的辅助内存，实现精确反演而不改变原始动作维度。生成策略映射的对数行列式仅由固定求解器系数决定，从而实现精确且无雅可比似然比的计算。这种设计既保留了生成流策略的表达性，又避免了动作比率偏差和虚拟动作开销。我们在大规模模拟控制、微调和现实机器人操作任务中评估GenPO++，在这些任务中，其性能优于最先进的政策化强化学习方法，同时提升训练稳定性和计算效率。

Exploring Agentic Tool-Calling Decisions via Uncertainty-Aligned Reinforcement Learning

通过不确定性对齐强化学习探索代理工具调用决策

Authors: Yijin Zhou, Linqian Zeng, Xiaoya Lu, Wenyuan Xie, Dongrui Liu, Junchi Yan, Jing Shao
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.06976
Pdf link: https://arxiv.org/pdf/2606.06976
Abstract Large language model (LLM)-based agents often make suboptimal tool-use decisions, including unsupported tool invocation and hallucinated direct responses, which may accumulate errors throughout multi-step interactions. Existing approaches mainly improve these behaviors through inference-time correction or coarse-grained reward signals based on decision outcomes and structured checklists, leaving the uncertainty characteristics of agent decisions underexplored. We observe that decision-oriented reinforcement learning tends to weaken the uncertainty separation between correct and incorrect actions, resulting in overconfident mistakes and weaker exploration signals. Therefore, we propose TRUST, which incorporates uncertainty quantification into reward design as a repulsive force for maintaining uncertainty separation, and labels lightweight key-turn annotations for unified post-training of multi-turn trajectories. Experimental results across diverse tool-use benchmarks show that TRUST consistently enhances both decision quality and agent performance while maintaining more reliable uncertainty estimates during optimization.
中文摘要 基于大型语言模型（LLM）的智能体常常做出不理想的工具使用决策，包括无支持的工具调用和幻觉性的直接响应，这些在多步交互中可能积累错误。现有方法主要通过基于决策结果和结构化检查表的推理时间修正或粗粒度奖励信号来改善这些行为，导致主体决策的不确定性特征尚未被充分探讨。我们观察到，决策导向的强化学习往往削弱正确与错误行为之间的不确定性界限，导致过度自信的错误和较弱的探索信号。因此，我们提出了TRUST，将不确定性量化纳入奖励设计，作为维持不确定性分离的排斥力，并为多回合轨迹的统一后训练标记轻量级关键转折注释。跨越多种工具使用基准的实验结果表明，TRUST在优化过程中持续提升决策质量和代理绩效，同时保持更可靠的不确定性估计。

Teaching the Way, Not the Answer: Privileged Tutoring Distillation for Multimodal Policy Optimization

教导道路，而非答案：多模态政策优化的特权辅导提炼

Authors: Shizhe Xiang, Ke An, Wenlong Yu, Yue Liu, Jian Luan, Pei Fu, Qilong Wang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.07000
Pdf link: https://arxiv.org/pdf/2606.07000
Abstract Recent post-training methods, particularly Reinforcement Learning with Verifiable Rewards (RLVR), have significantly enhanced the reasoning ability of Large Vision-Language Models (LVLMs). However, the sparse nature of verifiable rewards provides little token-level supervision for failed rollouts, often leading to inefficient exploration in complex multimodal reasoning tasks. Although policy distillation can offer dense guidance, external teacher based methods introduce substantial computational overhead, while answer conditioned tuning methods may expose answer-level information and induce shortcut-like generation behavior. To address these limitations, we propose PTD-PO, a Privileged Tutoring Distillation Policy Optimization framework for RLVR that provides dense guidance without exposing the answer to the student policy. Specifically, PTD-PO constructs structured privileged hints from spatial attention guidance and intermediate textual reasoning steps, and uses them through in-context learning to produce step-wise token-distribution supervision. The student is still optimized under the original answer-free context, and its failed rollouts are aligned with the hint-augmented reference model at the token-distribution level. To further stabilize distillation under the distribution shift between guided and unguided contexts, we introduce a Top-K Jensen-Shannon divergence objective that focuses alignment on informative token probabilities while reducing memory overhead. Experiments on LVLMs ranging from 2B to 8B parameters show that PTD-PO consistently outperforms RLVR and distillation baselines, mitigates entropy collapse, and improves complex multimodal reasoning performance.
中文摘要 最新的训练后方法，尤其是可验证奖励强化学习（RLVR），显著提升了大型视觉语言模型（LVLM）的推理能力。然而，可验证奖励的稀疏性几乎没有为失败的推广提供代币层级监督，常导致复杂的多模态推理任务中探索效率低下。虽然策略提炼可以提供密集的指导，但基于教师的外部方法会带来较大的计算开销，而答案条件调优方法则可能暴露答案层级信息并诱导类似捷径的生成行为。为解决这些局限性，我们提出了PTD-PO，这是一种RLVR特权辅导提炼政策优化框架，提供密集指导，同时不向学生政策暴露答案。具体来说，PTD-PO通过空间注意力引导和中间文本推理步骤构建结构化特权提示，并通过上下文学习实现分步骤的代币分配监督。学生在原始无答案上下文下仍然得到优化，其失败的推广也与提示增强参考模型在代币分布层面对齐。为了进一步稳定在引导与非引导上下文分布变化下的蒸馏，我们引入了Top-K Jensen-Shannon发散目标，将对齐重点放在信息性代币概率上，同时降低内存开销。对2B至8B参数的LVLMs实验表明，PTD-PO持续优于RLVR和蒸馏基线，减轻熵塌缩，并提升复杂多模推理性能。

StainFlow: Entity-Stain Tracking and Evidence Linking for Process Rewards in GUI Agents

StainFlow：GUI代理中实体污渍追踪与过程奖励的证据链接

Authors: Haojie Hao, Longkun Hao, Yihang Lou, Yan Bai, Zhenyang Li, Zhichao Yang, Dongshuo Huang, Hongyu Lin, Lanqing Hong, Jiakai Wang, Xianglong Liu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.07027
Pdf link: https://arxiv.org/pdf/2606.07027
Abstract Reinforcement Learning (RL) has become a promising approach for improving GUI Agents in long-horizon, stochastic digital environments, but trajectory-level success feedback is too sparse to provide reliable credit assignment for intermediate exploration steps. To mitigate this issue, recent studies introduce Process Reward Models (PRMs), which provide finer-grained training feedback through global milestone verification or local step-level evaluation. However, these methods still suffer from two level-specific limitations: global milestone decomposition is subjective and singular, making it difficult to accommodate the multiple valid execution paths in real GUI tasks, while fixed local judging windows may miss long-range key evidence or dilute the decision signal with irrelevant frames. Inspired by stain-tracing mechanisms in network flow analysis, we propose StainFlow, an entity-stain-flow process reward model for GUI Agents. To reduce the subjectivity of global partitioning, we introduce the Global Entity Stain Tracking module, which extracts visually verifiable task entities and tracks how their stain concentrations and states evolve along the trajectory, allowing task phases to be objectively separated by changes in the entity evidence flow. To improve the accuracy of local verification, we introduce the Local Stain Evidence Linking module. Centered on the triggering entities of each candidate key node, it retrieves relevant steps based on their stain concentrations and state changes, and dynamically constructs high-density evidence windows for verifying true key nodes. Extensive experiments on AndroidWorld and OGRBench show that StainFlow relatively improves online RL success by 3.2% and trajectory completion judgment accuracy by 1.8%.
中文摘要 强化学习（RL）已成为改进长视野、随机数字环境中图形界面代理的有前景方法，但轨迹级成功反馈过于稀少，无法为中间探索步骤提供可靠的信用分配。为缓解这一问题，近期研究引入了过程奖励模型（PRM），通过全局里程碑验证或局部步骤级评估提供更细致的训练反馈。然而，这些方法仍存在两个层级特有的局限：全局里程碑分解主观且单一，难以在真实GUI任务中兼容多条有效执行路径;而固定的局部判断窗口可能遗漏远距离关键证据或用无关帧稀释决策信号。受网络流分析中染色追踪机制的启发，我们提出了StainFlow，这是一种面向图形界面代理的实体染色流过程奖励模型。为减少全局划分的主观性，我们引入了全局实体污渍追踪模块，该模块提取可直观验证的任务实体，并跟踪其染色浓度和状态沿轨迹的变化，使任务阶段能够通过实体证据流的变化客观区分。为了提高局部验证的准确性，我们引入了局部染色证据链接模块。它以每个候选密钥节点的触发实体为中心，基于其染色浓度和状态变化检索相关步骤，并动态构建高密度证据窗口以验证真实密钥节点。在 AndroidWorld 和 OGRBench 上的大量实验显示，StainFlow 相对提升在线强化学习成功率 3.2%，轨迹完成判断准确率提升 1.8%。

SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating

SlimSearcher：通过自适应奖励门控训练高效感知的网络代理

Authors: Zequn Xie, Junjie Wang, Dan Yang, Jie Feng, Yue Shen, Jian Wang, Jinjie Gu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.07074
Pdf link: https://arxiv.org/pdf/2606.07074
Abstract Deep research agents have demonstrated remarkable capabilities in complex information-seeking tasks, yet this power comes at a steep computational cost. Driven by accuracy-focused training paradigms, current models adopt brute-force strategies characterized by blind tool dependency and performative reasoning-generating long, redundant trajectories that are far from necessary for resolving these tasks, leading to wasteful tool calls and excessive token consumption. To overcome this efficiency trap, we propose SlimSearcher, a principled framework that pushes the Pareto frontier between accuracy and computational cost across both Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). In the SFT stage, SlimSearcher employs Pareto-efficient filtration to distill trajectories that are both successful and economical, guiding the model toward inherently efficiency-aware search behaviors. During RL, we introduce Adaptive Reward Gating, a dynamic reward-shaping mechanism that evaluates relative tool and token efficiency within a sampled cohort. By cascading these adaptive efficiency metrics with a strict correctness gate, our approach effectively avoids the brevity bias associated with absolute penalties and mitigates reward hacking. Extensive experiments on long-horizon benchmarks, including GAIA, BrowseComp, and XBenchDeepSearch, demonstrate that SlimSearcher reduces average tool-call rounds by 17%-58% while maintaining or improving accuracy.
中文摘要 深度研究代理在复杂的信息寻求任务中展现出卓越的能力，但这种能力代价巨大。当前模型采用以准确性为导向的训练范式驱动，采用暴力破解策略，表现为盲目依赖工具和执行性推理，产生冗长且冗余的路径，而这些轨迹远非解决这些任务的必要条件，导致工具调用浪费和代币过度消耗。为克服这一效率陷阱，我们提出了SlimSearcher，这是一个原则性框架，推动了监督微调（SFT）和强化学习（RL）中准确性和计算成本之间的帕累托边界。在SFT阶段，SlimSearcher采用帕累托高效过滤，提取出既成功又经济的轨迹，引导模型走向固有的效率感知搜索行为。在强化学习期间，我们介绍了自适应奖励门控机制，这是一种动态的奖励塑造机制，用于评估抽样队列中工具和代币的相对效率。通过将这些自适应效率指标与严格的正确性门联，我们的方法有效避免了绝对惩罚相关的简短偏差，并减轻了奖励黑客行为。在长期基准测试（包括GAIA、BrowseComp和XBenchDeepSearch）上的广泛实验表明，SlimSearcher在保持或提升准确性的同时，平均工具调用回合数减少了17%-58%。

On the Geometry of On-Policy Distillation

论政策内提炼的几何结构

Authors: Zhennan Shen, Yanshu Li, Qingyu Yin, Chak Tou Leong, Zhilin Wang, Yanxu Chen, Rongduo Han, Sunbowen Lee, Yi R. Fung
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.07082
Pdf link: https://arxiv.org/pdf/2606.07082
Abstract On-policy distillation (OPD) is increasingly used to improve large language model reasoning, but its training dynamics remain poorly understood. We characterize the trajectory of OPD updates in parameter space and compare it with supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR). A suite of parameter-space diagnostics consistently places OPD in a relaxed off-principal regime: compared with SFT, its updates affect fewer weights and avoid principal directions more strongly, while compared with RLVR, they remain less tightly constrained. Beyond this static localization, OPD exhibits subspace locking: its cumulative updates rapidly enter a narrow low-dimensional channel. Constraining training to the update subspace formed early in training preserves OPD performance but substantially degrades SFT, indicating that the locked subspace is functionally sufficient for OPD. Control experiments further show that sparsifying the update tokens and shifting rollout generation off-policy preserve the rank dynamics, whereas mixing the OPD objective with RLVR changes them. Overall, these results suggest that OPD is not merely an intermediate point between SFT and RLVR, but induces its own update geometry in parameter space.
中文摘要 策略上提纯（OPD）越来越多地被用于提升大型语言模型推理，但其训练动态仍不充分。我们描述了参数空间中OPD更新的轨迹，并将其与监督式微调（SFT）和带可验证奖励的强化学习（RLVR）进行比较。一组参数空间诊断始终将 OPD 置于一个宽松的离主状态：与 SFT 相比，其更新影响的权重较少，且更强烈地避免主方向，而与 RLVR 相比，OPD 的约束依然较宽松。除了这种静态定位，OPD还表现出亚空间锁定：其累积更新迅速进入一个狭窄的低维通道。将训练限制在训练早期形成的更新子空间可以保留OPD性能，但会显著降低SFT，表明锁定的亚空间在功能上足以满足OPD。对照实验进一步表明，稀疏化更新代币和将推出生成策略移开可保持排名动态，而将OPD目标与RLVR混合则改变排名动态。总体而言，这些结果表明 OPD 不仅仅是 SFT 和 RLVR 之间的中间点，而是在参数空间中引入自身的更新几何。

Predictive Style Matching: Natural and Robust Humanoid Locomotion

预测风格匹配：自然且稳健的人形运动

Authors: Simeon Nedelchev, Ekaterina Chaikovskaia, Egor Davydenko, Eduard Zaliaev, Roman Gorbachev
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.07083
Pdf link: https://arxiv.org/pdf/2606.07083
Abstract Reinforcement learning has become the prevailing approach to humanoid locomotion control: policies transfer reliably from simulation to hardware and recover gracefully from disturbances. Motion quality, however, still lags behind: task-only rewards often converge to stiff, asymmetric gaits, while motion imitation methods improve appearance but become more sensitive to external disturbances because reference signals can oppose the transient poses needed to regain balance. We propose Predictive Style Matching, in which an offline predictor maps the robot's lower-body state history and velocity commands to interpretable upper-body joint and gait targets that shape the rewards during training. Because the targets are state-conditioned rather than time-indexed and the predictor is used only at training time, the deployed controller inherits the proprioceptive interface and inference cost of a task-only RL baseline. On the Unitree G1, in both simulation and hardware, PSM reduces upper-body style error by roughly an order of magnitude over task-only RL while preserving its fall-recovery rate, whereas the motion-imitation baseline attains the lowest style error but fails to recover from disturbances about five times as often.
中文摘要 强化学习已成为类人机动控制的主流方法：策略能够可靠地从模拟转移到硬件，并能优雅地从干扰中恢复。然而，运动质量仍然落后：仅任务奖励常趋于僵硬、不对称的步态，而运动模拟方法虽然外观改善，但对外部干扰更敏感，因为参考信号会阻挡恢复平衡所需的瞬态姿势。我们提出了预测风格匹配，即离线预测器将机器人下半身状态历史和速度指令映射到可解读的上半身关节和步态目标，从而塑造训练中的奖励。由于目标是状态条件而非时间索引，且预测器仅在训练时使用，部署的控制器继承了仅任务强化学习基线的本体感受接口和推理成本。在Unitree G1上，无论是仿真还是硬件，PSM都能将上半身风格误差比仅任务型的强化学习降低大约一个数量级，同时保持其坠落恢复率，而运动模拟基线的样式误差最低，但无法从干扰中恢复的频率大约是它的五倍。

From Correctness to Utility: Gain-Based Prefix Evaluation for LLM Reasoning

从正确性到效用：基于增益的前缀评估用于LLM推理

Authors: Yuhang Zhou, Yixin Cao, Guangnan Ye
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.07190
Pdf link: https://arxiv.org/pdf/2606.07190
Abstract Reasoning prefixes shape the future trajectory of LLM problem solving, yet existing process reward models usually evaluate them through local step correctness. We argue that correctness is a useful but indirect proxy for the effect we ultimately care about: whether a prefix increases the probability of successful completion. We define this effect as prefix gain, the solve-rate improvement induced by conditioning lightweight student model group on a prefix, and use it to train a Prefix Utility Model (PUM) with a simple pairwise ranking objective. PUM learns outcome-grounded prefix utility and can score both complete trajectories and partial reasoning prefixes. Across Best-of-$N$ selection, beam search, and reinforcement learning on mathematical reasoning, PUM provides a strong prefix-level supervision signal, especially when candidate pools are large, search budgets increase, or rule-based rewards are sparse. We release all data, models, and code at this https URL.
中文摘要 推理前缀塑造了LLM问题解决的未来轨迹，但现有的过程奖励模型通常通过局部步骤正确性来评估它们。我们认为正确性是一个有用但间接的代理指标，反映了我们最终关心的效果：前缀是否提高了成功完成的概率。我们将此效应定义为前缀增益，即通过对轻量级学生模型组对前缀进行条件所引起的求解率提升，并用它训练一个带有简单两两排序目标的前缀效用模型（PUM）。PUM学习基于结果的前缀效用，并能对完整轨迹和部分推理前缀进行评分。在$N美元最佳选择、波束搜索和基于数学推理的强化学习中，PUM提供了强有力的前缀级监督信号，尤其是在候选人池庞大、搜索预算增加或基于规则的奖励稀少时。我们会在这个 https URL 上发布所有数据、模型和代码。

Shield-Loco: Shielding Locomotion Policies with Predictive Safety Filtering

盾-机车：带预测安全过滤的屏蔽运动政策

Authors: Aditya Shirwatkar, Sebastian Sanokowski, Shishir Kolathaya, Aaron Johnson, Majid Khadiv
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.07193
Pdf link: https://arxiv.org/pdf/2606.07193
Abstract Reinforcement learning (RL) policies enable dynamic legged locomotion but lack mechanisms to avoid violations of safety constraints that are absent during training. Large-scale offline safe learning is impractical for covering all edge cases. Existing safety frameworks either rely on reduced-order models that cannot reason about whole-body behaviors or require conservative recovery controllers that degrade task performance. We propose a predictive safety filter that post-hoc filters the nominal contact locations fed to the RL policy. When a collision is predicted, a sampling-based optimizer asynchronously searches for safer contact sequences using a full-physics model, while a learned value function bootstraps long-horizon returns. Our three algorithmic components (geometric projection of sampled contacts, momentum-augmented updates, and replica-exchange) make the optimization tractable in a discontinuous contact landscape. We validate the filter on a quadruped robot in dense, cluttered environments, both in simulation and in the real world, showing substantial reductions in safety violations with minimal deviation from the nominal input.
中文摘要 强化学习（RL）策略支持动态腿部运动，但缺乏避免训练中不存在的安全约束违反机制。大规模离线安全学习无法涵盖所有边缘情况。现有的安全框架要么依赖无法推理全身行为的降序模型，要么需要保守的恢复控制器，从而降低任务性能。我们提出一种预测安全过滤器，事后过滤输入到强化学习策略的标称接触位置。当预测碰撞时，基于采样的优化器异步地利用全物理模型寻找更安全的接触序列，而学习的价值函数引导则是长视界返回。我们的三个算法组件（采样接触的几何投影、动量增强更新和复制交换）使优化在不连续接触环境中易于操作。我们在密集、杂乱的环境中验证了滤波器，在模拟和现实世界中，显示出安全违规的显著减少，且偏差最小。

Learning Multi-Agent Communication Protocol: Study on Information Entropy Efficiency in MARL

学习多智能体通信协议：MARL中信息熵效率的研究

Authors: Xinren Zhang, Zixin Zhong, Jiadong Yu
Subjects: Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2606.07200
Pdf link: https://arxiv.org/pdf/2606.07200
Abstract Multi-Agent Systems (MAS) have emerged as a fundamental paradigm for distributed problem-solving, where autonomous agents collaborate to achieve complex objectives. Within this framework, Multi-Agent Reinforcement Learning (MARL) with communication has demonstrated remarkable success in cooperative tasks. However, existing approaches predominantly pursue performance gains through increasingly complex architectures and expanding communication overhead, lacking principled metrics to evaluate the efficiency of information exchange. In this paper, we focus on enabling agents to learn efficient multi-agent communication protocols that balance performance and information compactness. We propose the Information Entropy Efficiency Index (IEI), a novel metric that quantifies the ratio between message entropy and task performance in learned communication protocols. A lower IEI indicates more compact and efficient message representations. By incorporating IEI into training loss functions, we encourage agents to develop communication protocols that achieve high performance with improved communication efficiency. Extensive experiments across diverse MARL algorithms demonstrate that our approach achieves equivalent or superior task performance compared to baseline methods while improving communication efficiency. These findings challenge the prevailing assumption that performance improvements require complex architectures or increased communication overhead and highlight the potential of improving both task success and communication efficiency to enable scalable MAS.
中文摘要 多智能体系统（MAS）已成为分布式问题解决的基本范式，自主智能体协作以实现复杂目标。在此框架下，多智能体强化学习（MARL）结合通信在协作任务中取得了显著成功。然而，现有方法主要通过日益复杂的架构和不断扩大的通信开销来追求性能提升，缺乏用于评估信息交换效率的原则性指标。本文重点介绍让代理学习高效多代理通信协议，平衡性能与信息紧凑性。我们提出了信息熵效率指数（IEI），这是一个新颖的指标，用于量化学习通信协议中消息熵与任务表现的比值。IEI越低，表示消息表达更紧凑、更高效。通过将IEI纳入训练损耗函数，我们鼓励代理开发能够实现高性能且通信效率更高的通信协议。在多种MARL算法上的广泛实验表明，我们的方法在与基线方法相比实现了同等甚至更优的任务表现，同时提升了通信效率。这些发现挑战了普遍假设，即性能提升需要复杂架构或增加通信开销，并凸显了提升任务成功率和通信效率以实现可扩展MAS的潜力。

KIT's Submission to Cross-Lingual Voice Cloning in IWSLT 2026

KIT 在 IWSLT 2026 中提交跨语言语音克隆

Authors: Seymanur Akti, Alexander Waibel
Subjects: Subjects: Computation and Language (cs.CL); Sound (cs.SD)
Arxiv link: https://arxiv.org/abs/2606.07240
Pdf link: https://arxiv.org/pdf/2606.07240
Abstract Cross-lingual voice cloning aims to generate speech in a target language while preserving speaker identity from a source-language reference. This task is central to speech translation and is the focus of the IWSLT 2026 Cross-Lingual Voice Cloning track. A key challenge is maintaining intelligibility and naturalness in the presence of accent variation and domain-specific vocabulary. We build on a multilingual text-to-speech model, FishAudio-S2-Pro, and introduce language tag prompting to improve language control and reduce accent leakage. We further apply reinforcement learning (RL) fine-tuning for task adaptation and observe improvements in intelligibility. Finally, we propose a reference-conditioned lexical matching method that improves pronunciation of domain-specific terms when lexical overlap is present. Results show that language prompting provides the largest gains, while lexical matching yields consistent improvements on matched subsets.
中文摘要 跨语言语音克隆旨在生成目标语言的语音，同时保持说话者身份免受源语言引用的影响。这项任务是语音翻译的核心，也是IWSLT 2026跨语言语音克隆课程的重点。一个关键挑战是在口音变化和领域特定词汇的存在下保持清晰和自然。我们基于多语言文本转语音模型 FishAudio-S2-Pro，并引入语言标签提示，以改善语言控制并减少口音泄漏。我们进一步应用强化学习（RL）微调以适应任务，观察到可理解性的提升。最后，我们提出了一种参照条件词汇匹配方法，在词汇重叠存在时改善领域特定词的发音。结果显示，语言提示带来最大的提升，而词汇匹配则在匹配子集上带来持续的改进。

Self-evolving LLM agents with in-distribution Optimization

具有分布式优化的自我演化LLM代理

Authors: Yudi Zhang, Meng Fang, Zhenfang Chen, Mykola Pechenizkiy
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.07367
Pdf link: https://arxiv.org/pdf/2606.07367
Abstract Large Language Models (LLMs) have recently emerged as powerful controllers for interactive agents in complex environments, yet training them to perform reliable long-horizon decision making remains a fundamental challenge. A key difficulty lies in credit assignment: agents often receive delayed rewards only at the end of episodes. In this paper, we propose Q-Evolve, a self-evolving framework for LLM agents that unifies automatic process-reward labeling and policy learning within a principled in-distribution reinforcement learning paradigm. In each evolving iteration, our method learns an in-distribution critic from a hybrid off-policy dataset that combines expert demonstrations with agent-generated trajectories, stabilizing Bellman backups in sparse-reward settings via a weighted Implicit Q-Learning objective. The learned value function is then used to derive step-wise process rewards through advantage estimation, enabling dense and reliable supervision without environment backtracking or human annotation. Leveraging these signals, we perform behavior-proximal policy optimization that evolves the agent over the data used for process reward labeling, allowing iterative self-improvement without exacerbating distribution shift. We evaluate our method on AlfWorld, WebShop, and ScienceWorld, showing Q-Evolve outperforms strong baselines in sample efficiency, robustness, and overall task performance. Our results demonstrate that stable agent self-evolution is achievable through the co-evolution of process-level supervision and policy, both grounded within a shared in-distribution learning loop.
中文摘要 大型语言模型（LLMs）最近成为复杂环境中交互代理的强大控制器，但训练它们执行可靠长期决策仍是一个根本挑战。一个关键难点在于署名分配：经纪人通常只有在剧集结束时才会收到延迟奖励。本文提出了Q-Evolve，一种面向LLM代理的自我进化框架，将自动过程-奖励标记和策略学习统一在原则性的分布内强化学习范式中。在每次迭代中，我们的方法从一个混合的非策略数据集中学习分布内批评者，该数据集结合专家演示与代理生成轨迹，通过加权隐性Q学习目标稳定Bellman备份在稀疏奖励设置中。随后，学习到的价值函数通过优势估计推导分步骤的过程奖励，实现密集且可靠的监督，无需环境回溯或人工注释。利用这些信号，我们进行行为近端策略优化，使代理在用于过程奖励标记的数据上进化，允许迭代自我改进而不加剧分布转移。我们在AlfWorld、WebShop和ScienceWorld上评估了我们的方法，显示Q-Evolve在样本效率、鲁棒性和整体任务表现方面优于强基线。我们的结果表明，稳定的代理自我进化可以通过过程层级监督和政策的共进化实现，这两者都建立在共享的分布式内学习循环中。

Rapid co-design of Buoyancy-assisted robots for Challenging Locomotion using Gaussian Evolutionary Specialists

利用高斯进化专家快速设计浮力辅助机器人以挑战运动

Authors: Ankit Sinha, Nitish Sontakke, Dennis Hong, Yusuke Tanaka, Sehoon Ha
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.07424
Pdf link: https://arxiv.org/pdf/2606.07424
Abstract Designing high-performance legged robots requires jointly optimizing morphology and control. Model-free Reinforcement Learning (RL) offers an alternative to model-predictive control for developing robust controllers without explicitly specifying robot dynamics. Thus, we have seen theuse of RL to train controllers and evaluate designs for robot morphology optimization. While RL has shown success inlocomotion, using it in the co-design inner loop is expensive due to repeated policy training. Universal policies conditioned on morphology offer a promising alternative, but suffer from behavioral diversity collapse, converging to a single strategy that performs sub-optimally across designs. On the other hand, end-to-end Mixture-of-Experts (MoE) architectures fail due to a collapse in its representation. We propose Gaussian Evolutionary Specialists (GES), a framework that decouples design-space partitioning from policy learning to capture diverse behaviors explicitly. GES assigns specialist policies to evolving Gaussian regions and iteratively refines them via training, probing, and territory expansion. The resulting specialists are integrated into a design sampling loop, replacing costly re-training with direct evaluation. When tested on the Buoyancy-Assisted Light Legged Unit (BALLU), GES discovers designs with 5 - 25% higher performance than naive universal policies. On hardware, a GES optimized design overcomes a 24 cm tall obstacle - 3x improvement over the baseline BALLU design. Moreover, GES curtails design optimization time by 37%.
中文摘要 设计高性能腿部机器人需要共同优化形态和控制。无模型强化学习（RL）为开发鲁棒控制器提供了一种替代模型预测控制的替代方案，而无需明确指定机器人动力学。因此，我们已经看到强化学习被用于训练控制器和评估机器人形态优化的设计。虽然强化学习已证明在内轮移动中取得了成功，但在共设计的内环中使用它由于反复进行策略培训而成本高昂。以形态为条件的通用策略提供了有前景的替代方案，但它们存在行为多样性崩溃的问题，最终汇聚到单一策略，而该策略在设计中表现不佳。另一方面，端到端专家混合（MoE）架构因表示崩溃而失败。我们提出了高斯进化专家（GES）框架，该框架将设计空间划分与策略学习解耦，以显式捕捉多样化行为。GES为不断演变的高斯区域分配专业政策，并通过训练、探查和领土扩展不断完善。由此产生的专家被整合进设计采样循环，用直接评估取代昂贵的再培训。在浮力辅助轻腿装置（BALLU）上测试时，GES发现设计性能比简单通用政策高出5%至25%。在硬件上，GES优化的设计克服了24厘米高的障碍——比基础BALLU设计提升了3倍。此外，GES还能将设计优化时间缩短37%。

Modelling Opinion Dynamics at Scale with Deep MARL

利用深度MARL大规模建模意见动态

Authors: Lukas Seier, Brandon Kaplowitz, Sebastian Towers, Richard Bailey, Jakob Foerster
Subjects: Subjects: Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT); Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2606.07487
Pdf link: https://arxiv.org/pdf/2606.07487
Abstract Modelling opinion dynamics typically relies on hand-crafted local interaction rules to study emergent macroscopic phenomena such as consensus and polarisation. In contrast, multi-agent reinforcement learning (MARL) enables agents to learn such behaviours directly by optimising simple rewards. To explore the potential of MARL for opinion dynamics, we introduce a GPU-accelerated consensus and truth-finding game that scales to populations of up to 1000 agents, comparable to many real-world social sub-networks. To prevent unrealistic conventions, we extend other-play to general-sum social interactions. We next validate our model on a subset of the Bluesky network by recovering agent importance structures from graph topology alone via a learned attention layer, finding that highly conforming populations most closely match human data. In large social media networks such high levels of conformity significantly reduce collective accuracy and promote dishonest agents that lie to fit in. By contrast, small, dynamic hunter-gatherer networks are less affected; here, conformity can even improve collective agreement. This suggests a mismatch between evolved human conformity heuristics and modern social media environments as a potential contributor to misinformation.
中文摘要 意见动态建模通常依赖手工制定的局部交互规则来研究新出现的宏观现象，如共识和极化。相比之下，多智能体强化学习（MARL）通过优化简单奖励，使智能体能够直接学习此类行为。为了探索MARL在意见动态中的潜力，我们引入了一款GPU加速共识和真相探索游戏，可扩展至多达1000名代理的人口，可与许多现实世界的社交子网络相媲美。为了防止不切实际的约定，我们将他者游戏扩展到一般和的社会互动。接着，我们通过学习注意力层仅从图拓扑恢复代理重要性结构，验证了Bluesky网络的一个子集，发现高度符合的人群与人类数据最为匹配。在大型社交媒体网络中，如此高的从众性显著降低了集体准确性，并助长了虚伪的代理人为了融入而撒谎。相比之下，小型、动态的狩猎采集网络受影响较小;在这里，顺从甚至可以改善集体协议。这表明进化出的人类一致性启发式与现代社交媒体环境之间存在不匹配，可能成为错误信息的助长因素。

Affordance-Based Hierarchical Reinforcement Learning for Quadruped Pedipulation

基于可有性的分层强化学习用于四足踏步

Authors: Tuba Girgin, Jose Castelblanco, Gabriel Rodriguez, Emre Girgin, Cagri Kilic
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.07506
Pdf link: https://arxiv.org/pdf/2606.07506
Abstract The object manipulation capabilities of quadruped robots is an open research challenge. While previous studies have focused on low-level policy learning, task execution still relies on expert-designed high-level trajectories. Autonomous selection of both an affordable interaction point on the target object and an affordable robot base pose removes the need for pre-designed trajectories. This study proposes a three-level hierarchical reinforcement learning (RL) framework that utilizes pose affordances to guide the navigation policy, while the navigation policy drives the locomotion policy. In addition, the pedipulation policy is guided by interaction-point affordances, enabling object-centric pose alignment of the quadruped robot and effective end-effector manipulation planning. We train the proposed framework in the IsaacSim ecosystem and evaluate it in both simulation and real-world settings. We investigate the effectiveness of pose affordance across multiple scenarios in simulation while various object interaction tasks are validated on real-world setting forming an object-interaction dataset. The results show that the proposed framework can autonomously identify candidate poses based on their affordance and successfully execute object manipulation tasks in the real world without human guidance.
中文摘要 四足机器人的物体操作能力是一个开放的研究挑战。虽然以往研究侧重于低层次政策学习，但任务执行仍依赖专家设计的高层次轨迹。自主选择目标物体上经济适用的交互点和经济实惠的机器人基座姿态，无需预先设计的轨迹。本研究提出了一个三级层级强化学习（RL）框架，利用姿态可供性来指导导航策略，而导航策略则驱动移动策略。此外，步步策略由交互点可供性指导，实现四足机器人的以对象为中心姿态对齐和有效的末端执行器操作规划。我们将拟议框架在IsaacSim生态系统中训练，并在模拟和现实环境中进行评估。我们研究了姿态可向性在模拟中多场景下的有效性，同时在现实环境中验证各种对象交互任务，形成对象-交互数据集。结果显示，所提框架能够根据候选姿势的可适用性自主识别，并在无需人工指导的情况下成功执行现实中的对象操作任务。

Keyword: diffusion policy

Beyond Waypoints: A Trajectory-Centric Waypointing Paradigm for Vision-Language Navigation

超越航点：一种以轨迹为中心的视觉语言导航航路标范式

Authors: Haoxiang Shi, Xiang Deng, Haoyu Zhang, Qiaohui Chu, Yaowei Wang, Liqiang Nie
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.07244
Pdf link: https://arxiv.org/pdf/2606.07244
Abstract Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural-language instructions while navigating in real-world-like environments. Most VLN-CE approach-es adopt a three-stage framework: a waypoint predictor proposes navigable waypoints, and a navigator selects the best waypoint, with a low-level controller executing the movement to it. However, this decoupled paradigm often leads to unreachable waypoints or inconsistencies between planning and control. In this work, instead of predicting isolated waypoints, we introduce a novel paradigm called Trajectory Waypoint, which grounds each candidate waypoint in an executable trajectory. To realize this, we design a Trajectory Waypoint Predictor formulated as a TSDF-guided diffusion policy, which steers trajectory generation away from obstacles, inherently ensuring the reachability of the predicted waypoints. We further propose a trajectory-enhanced navigator that injects the associated trajectory as additional information for planning, enabling strict consistency between high-level semantic decisions and low-level execution. Extensive experiments on the VLN-CE benchmark show that our Trajectory Waypoint paradigm achieves superior performance over the baselines.
中文摘要 连续环境中的视觉语言导航（VLN-CE）要求代理在类似真实世界的环境中遵循自然语言指令进行导航。大多数VLN-CE方法采用三阶段框架：航点预测器提出可导航航路点，导航员选择最佳航点，低级控制器执行移动。然而，这种脱钩的范式常常导致无法到达的航点，或规划与控制之间存在不一致。在本研究中，我们不再预测孤立的航点，而是引入了一种新范式，称为轨迹航点，将每个候选航点建立在可执行的轨迹基础上。为此，我们设计了一个轨迹航点预测器，采用TSDF引导扩散策略，引导轨迹生成远离障碍物，本质上确保预测路径点的可达性。我们还提出了一种轨迹增强导航器，将相关轨迹作为额外信息用于规划，实现高层语义决策与低层执行之间的严格一致性。VLN-CE基准测试的大量实验表明，我们的轨迹航点范式在基线上表现优越。