Arxiv Papers of Today

生成时间: 2026-06-24 18:52:28 (UTC+8); Arxiv 发布时间: 2026-06-24 20:00 EDT (2026-06-25 08:00 UTC+8)

今天共有 27 篇相关文章

Keyword: reinforcement learning

EXPO-SQL: Execution-based Clause-level Policy Optimization for Text-to-SQL

EXPO-SQL：基于执行的条款级策略优化，用于文本转SQL的

Authors: Jaehoon Lee, CheolWon Na, Suyoung Bae, Jin-Seop Lee, Jihyung Lee, YunSeok Choi, Jee-Hyong Lee
Subjects: Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2606.23693
Pdf link: https://arxiv.org/pdf/2606.23693
Abstract Text-to-SQL enables users to query databases using natural language by generating executable SQL queries. Recent methods have increasingly adopted Large Language Models based reinforcement learning (RL) to leverage execution feedback for training. However, existing RL methods assign uniform query-level rewards to all clauses in a SQL query, treating correct and incorrect clauses equally. This coarse-grained reward design leads to insufficient learning signals for correct SQL generation. To address this issue, we propose EXPO-SQL (EXecution-based clause-level Policy Optimization for Text-to-SQL) which provides fine-grained supervision through clause-level rewards. To assign clause-level rewards, our method identifies erroneous clauses by analyzing execution results, including error messages and clause-wise incremental execution. Experiments on widely-used Text-to-SQL benchmarks demonstrate that EXPO-SQL significantly outperforms existing supervised fine-tuning, prompting, and RL-based methods through fine-grained clause-level learning. Our code is available at https://github. com/jhn25/EXPO-SQL.
中文摘要 文本转SQL使用户能够通过生成可执行的SQL查询，使用自然语言查询数据库。近年来，越来越多的方法采用基于大型语言模型的强化学习（RL）来利用执行反馈进行训练。然而，现有的强化学习方法对SQL查询中的所有子句分配统一的查询级奖励，正确和错误的子句是平等对待的。这种粗粒度的奖励设计导致学习信号不足以生成正确的 SQL 代码。为解决这一问题，我们提出了基于条款的条款级策略优化（EXPO-SQL，文本转SQL的基于执行的条款级策略优化），通过条款级奖励提供细致的监督。为了分配条款级奖励，我们的方法通过分析执行结果（包括错误消息和按条款递增执行）识别错误的从句。广泛使用的文本转SQL基准测试实验表明，EXPO-SQL在细粒度的从句级学习中，显著优于现有的监督式微调、提示和基于强化学习的方法。我们的代码可在 https://github 获取。com/jhn25/EXPO-SQL。

Enforcing Human-like Kinematics in Dexterous Piano Playing via Adversarial Posture Regularization

通过对抗姿势规范化，强化灵巧钢琴演奏中的类人运动学

Authors: Bin Qiu, Yanming Shao, Guanyu Cai, Yao Mu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.23848
Pdf link: https://arxiv.org/pdf/2606.23848
Abstract Reinforcement learning can train bimanual dexterous hands to play piano in physics simulation with high note accuracy, but for high-DoF dexterous hands, relying solely on task rewards or IK inversion often leads to unnatural postures and joint overextension. We propose \textit{Adversarial Posture Regularization (APR)}. It avoids expensive, song-aligned expert demonstration data and instead uses a small amount of casual human playing data. By matching the distribution of the posture of the policy with the human prior through an adversarial objective, APR encourages more human-like hand shapes. Meanwhile, we collect and release unstructured hand motion data of piano playing using a consumer-grade Meta Quest 3, and retarget the key motion information to the Shadow Hand. Finally, we achieve significantly better performance than prior methods on all three human-likeness metrics (cPSI, BSE, and FAC) as well as in visual quality. Project repository: this https URL.
中文摘要 强化学习可以训练双手灵巧的手在物理模拟中弹奏高音符准确度的钢琴，但对于高景深的灵巧手来说，仅依赖任务奖励或IK反转往往会导致不自然的姿势和关节过度伸展。我们提出 \textit{Adversarial Posture Regularization （APR）}。它避免了昂贵且与歌曲相符的专家演示数据，而是使用少量普通的人类演奏数据。通过将政策姿势分布与人类先验匹配，通过对抗目标，APR鼓励更接近人类的手型。与此同时，我们利用消费级Meta Quest 3收集并发布钢琴演奏的非结构化手部运动数据，并将按键运动信息重新定位到Shadow Hand。最后，我们在三种人类相似度指标（cPSI、BSE、FAC）以及视觉质量方面均显著优于以往方法。项目仓库：这个 https URL。

KLip-PPO: A per-sample KL perspective on PPO-Clip

KLip-PPO：PPO-Clip的每样本KL视角

Authors: Riccardo Colletti, Robin Holzinger
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.23932
Pdf link: https://arxiv.org/pdf/2606.23932
Abstract Proximal Policy Optimization (PPO) is the standard policy-gradient algorithm for on-policy reinforcement learning. The literature presents it in two forms, a clipped surrogate that bounds the importance ratio between successive policies and a Kullback-Leibler penalty between them. These forms are treated as separate algorithms with their own gradients, their own hyperparameters, and their own reference implementations, and a sizeable body of empirical work compares them. We show that the gradient of the clipped surrogate is reproduced exactly by a Kullback-Leibler surrogate whose coefficient varies per sample, with closed-form dependence on the importance ratio and the advantage. The identity holds at every minibatch step and across the entire inner loop, and on five MuJoCo continuous-control benchmarks the two losses produce indistinguishable training curves. The reformulation exposes a structural feature of the clipped surrogate that the min notation hides. PPO-Clip's implicit per-sample penalty is a step function at the boundary of the trust region, and the shape of this coefficient is the natural design axis for generalising the algorithm. We sketch the resulting follow-up directions in the discussion.
中文摘要 近端策略优化（PPO）是用于策略强化学习的标准策略梯度算法。文献中将其呈现为两种形式：一种是限制连续政策重要性比的截短替代，另一种是两者之间的库尔巴克-莱布勒惩罚。这些形式被视为独立的算法，拥有各自的梯度、超参数和参考实现，且有大量实证研究对它们进行了比较。我们证明截断替代的梯度被Kullback-Leibler替代精确复现，其系数在每个样本中变化，闭形式依赖于重要性比和优势。该恒等性在每个迷你批次步和整个内环都成立，在五个MuJoCo连续控制基准测试中，这两种损失产生了无法区分的训练曲线。这种重述揭示了截断替代物的一个结构特征，而最小符号隐藏了这一点。PPO-Clip的隐式每样本惩罚是信任区域边界的一个阶跃函数，该系数的形状是推广算法的自然设计轴。我们在讨论中勾勒出后续指引。

Offline Reinforcement Learning for Warehouse SLAM Throughput Control

仓库SLAM吞吐量控制的离线强化学习

Authors: Tina Dongxu Li, Mouhacine Benosman, Rajat Kumar, Kevin Tan, Ken Meszaros, Trevor Dardik
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.23978
Pdf link: https://arxiv.org/pdf/2606.23978
Abstract We present an offline reinforcement learning (RL) framework for optimizing SLAM throughput control in a warehouse fulfillment environment. SLAM (Scan/Label/Apply/Manifest) throughput directly influences system congestion and operational efficiency. Our RL-based control approach dynamically recommends SLAM throughput settings that adaptively balance throughput maximization with downstream stability through intelligent adjustment of throttling behavior. We include a history-informed state representation, action space abstraction for delayed-impact control, and a reward function that captures both upstream and downstream operational metrics. Our approach is algorithm-agnostic, enabling integration of multiple offline RL methods under a unified architecture. We instantiate our framework with three state-of-the-art offline RL algorithms, and trained the models offline using de-identified historical operational logs from a large-scale warehouse. Policy performance is evaluated using a comprehensive multi-method strategy. These include model-free approaches including immediate reward estimation via regression models and long-horizon Fitted Q Evaluation (FQE), as well as model-based Deep Koopman dynamics evaluation. Empirical results reveal that the CQL policy consistently outperforms alternatives, improving system health by 22.97% and reducing average throttling duration by 3.18%. These findings demonstrate the potential of offline RL for safe and scalable warehouse throughput control optimization.
中文摘要 我们提出了一个离线强化学习（RL）框架，用于优化仓库履约环境中的SLAM吞吐量控制。SLAM（扫描/标签/应用/清单）吞吐量直接影响系统拥堵和运营效率。我们基于强化学习的控制方法动态推荐SLAM吞吐量设置，通过智能调整节流行为，自适应地平衡吞吐量最大化与下游稳定性。我们包含了历史知情的状态表示、用于延迟影响控制的动作空间抽象，以及一个涵盖上下游运营指标的奖励函数。我们的方法与算法无关，使多种离线强化学习方法能够在统一架构下集成。我们用三种最先进的离线强化学习算法实现框架，并利用大型仓库中去标识化的历史操作日志离线训练模型。政策绩效通过全面的多方法策略进行评估。这些方法包括通过回归模型和长视野拟合Q评估（FQE）进行即时奖励估计，以及基于模型的深度库普曼动力学评估。实证结果显示，CQL政策持续优于其他方案，系统健康状况提升22.97%，平均节流持续时间减少3.18%。这些发现展示了离线强化学习在安全且可扩展的仓库吞吐量控制优化方面的潜力。

Learning to Trigger: Reinforcement Learning at the Large Hadron Collider

学习触发：大型强子对撞机上的强化学习

Authors: Zixin Ding, Shaghayegh Emam, Giovanna Salvi, Cecilia Tosciri, Abhijith Gandrakota, Jennifer Ngadiuba, Nhan Tran, Christian Herwig, David W. Miller, Yuxin Chen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); High Energy Physics - Experiment (hep-ex)
Arxiv link: https://arxiv.org/abs/2606.23993
Pdf link: https://arxiv.org/pdf/2606.23993
Abstract High-throughput scientific facilities such as the Large Hadron Collider depend on real-time event filtering (\textit{triggering}) under tight constraints on bandwidth, latency, and storage. In practice, trigger menus are largely static and hand-tuned and can become suboptimal as detector conditions, pileup, and background composition drift over time. We cast online threshold tuning as a sequential decision-making problem: a reinforcement learning agent ingests streaming summaries of recent rates and signal-sensitive features and updates trigger thresholds to maximize signal efficiency while tracking a target background rate within a tolerance band. We adapt Group-Filtered Policy Optimization (GFPO) to streaming control and introduce two variants (GFPO-F, GFPO-FR) that enforce background rate feasibility during training. On a benchmark that emulates realistic collider operation, we study two representative triggers: a total transverse energy ($H_{T}$) trigger sensitive to pileup variation, and an anomaly-detection (AD) trigger based on reconstruction loss for rare or non-standard signatures. On Monte Carlo streams, our agent increases the fraction of in-tolerance time intervals by 48\% ($H_T$) and 28\% (AD), with a cumulative gain of up to 2\% in signal efficiency on those in-tolerance intervals. Transferring from simulation to \emph{real} collision data (CMS Run 283408), the same agent, without fine-tuning, achieves a 56\% ($H_T$) and 28\% (AD) in-tolerance improvement over baselines, with further signal-efficiency gain on both triggers. To our knowledge, this is the \emph{first} demonstration of RL-based trigger control on real Large Hadron Collider collision data. Code is available at this https URL_LHC.
中文摘要 高通量科学设施如大型强子对撞机依赖实时事件过滤（\textit{triggering}），且在带宽、延迟和存储的严格限制下。实际上，触发菜单大多静态且需人工调校，随着检测器条件、堆叠和背景构图的漂移，菜单可能会变得不理想。我们将在线阈值调谐视为一个顺序决策问题：强化学习代理会接收近期速率和信号敏感特征的流汇总，并通过更新触发阈值以最大化信号效率，同时跟踪容忍区间内的目标背景速率。我们将组过滤策略优化（GFPO）应用于流控制，并引入了两种变体（GFPO-F、GFPO-FR），在训练期间强制执行背景速率可行性。在模拟真实对撞机运行的基准测试中，我们研究了两种代表性触发器：一个对堆叠变化敏感的总横向能量（$H_{T}$）触发器，以及基于稀有或非标准特征重建损失的异常检测（AD）触发器。在蒙特卡洛溪流中，我们的代理将不耐受时间区间比例增加48\%（$H_T$）和28\%（AD），在这些不耐受区间累计提升2\%的信号效率。从仿真传输到\emph{real}碰撞数据（CMS运行283408），同一代理无需微调即可实现基线容差提升56%（$H_T$）和28%（AD）的容忍度，且两个触发器均有进一步信号效率提升。据我们所知，这是基于强化学习的触发控制首次在真实大型强子对撞机碰撞数据上的演示。代码可在此 https URL_LHC 获取。

Safe and Generalizable Hierarchical Multi-Agent RL via Constraint Manifold Control

通过约束流形控制实现安全且可推广的分层多智能体强化学习

Authors: Zihao Guo, Jianing Zhao, Ling Li, Hao Liang, Giuseppe Loianno, Yali Du
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.24010
Pdf link: https://arxiv.org/pdf/2606.24010
Abstract Multi-agent systems are widely used in safety-critical applications that require coordinated behavior under strict safety constraints. Existing approaches face a fundamental trade-off: learning-based methods achieve strong empirical performance but lack theoretical safety guarantees, while control-theoretic methods enforce safety but often lead to overly conservative and inefficient behaviors. We propose a hierarchical multi-agent reinforcement learning framework that enforces hard safety constraints under mild assumptions at low level via a constraint manifold, while enabling effective coordination through high-level policy learning. Our approach provides theoretical safety guarantees in the multi-agent setting and yields stationary learning dynamics, thereby enabling stable and efficient training. Empirically, our method achieves competitive performance while maintaining nearly perfect safety rates, and generalizes effectively to varying numbers of agents and obstacles.
中文摘要 多智能体系统广泛应用于需要在严格安全约束下协调行为的安全关键应用中。现有方法面临一个根本性的权衡：基于学习的方法在实证上表现强劲，但缺乏理论安全性保障;而控制理论方法则能保障安全，但往往导致过于保守且效率低下的行为。我们提出了一种分层多智能体强化学习框架，通过约束流形在低层次轻度假设下强制执行硬安全约束，同时通过高层次策略学习实现有效协调。我们的方法在多智能体环境中提供了理论安全保障，并实现了平稳学习动态，从而实现了稳定高效的训练。从经验上看，我们的方法在保持几乎完美的安全率的同时实现了竞争性能，并有效推广到不同数量的智能体和障碍物。

Reinforcement Learning Towards Broadly and Persistently Beneficial Models

强化学习：面向广泛且持续有益的模型

Authors: Akshay V. Jagadeesh, Rahul K. Arora, Khaled Saab, Ali Malik, Mikhail Trofimov, Foivos Tsimpourlas, Johannes Heidecke, Karan Singhal
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.24014
Pdf link: https://arxiv.org/pdf/2606.24014
Abstract As AI systems are deployed across increasingly diverse and high-stakes settings, model alignment must generalize beyond the tasks and domains seen during training. This is especially important for reinforcement learning (RL), which can introduce unexpected misalignment through reward hacking, deception, or other unintended strategies. We study whether RL on beneficial behavior, instantiated in realistic domains, can produce broad and persistent alignment generalization beyond the training distribution. We construct a dataset of realistic situations designed to measure and train beneficial traits, such as truthfulness, fairness, risk awareness, and corrigibility, spanning varied domains, including health, science, and education. We then train models with RL on this dataset and evaluate them on more than 50 independent benchmarks of alignment and beneficial behavior. Compared to a compute-matched baseline, beneficial trait RL improves performance on over 80% of these out-of-distribution benchmarks. We observe substantial out-of-distribution alignment transfer: a beneficial-behavior RL intervention entirely limited to one domain, health, produces broad improvements on non-health alignment evaluations, including reduced reward hacking, deception, and general misalignment. Finally, we study alignment persistence: whether behavior remains robustly aligned under attempts to steer models towards misalignment. Models trained with beneficial trait RL show improved persistence, including greater resistance to adversarial prompting and harmful finetuning; further work is required to isolate the sources of these effects. These results suggest that RL to reinforce beneficial behavior in realistic domains can produce models that are more robustly aligned with human flourishing.
中文摘要 随着人工智能系统在日益多样化且高风险的环境中部署，模型对齐必须超越训练中看到的任务和领域。这对强化学习（RL）尤为重要，因为强化学习可能通过奖励黑客、欺骗或其他非预期策略引入意想不到的错位。我们研究基于有益行为的强化学习（在现实领域中实例化）是否能产生超越训练分布的广泛且持久的对齐泛化。我们构建了一个真实情境数据集，旨在测量和训练有益特质，如诚实、公平、风险意识和可修正性，涵盖健康、科学和教育等多个领域。然后我们用强化学习训练模型，并在50多个独立的对齐和有益行为基准上进行评估。与计算匹配基线相比，有益性状强化学习在超过80%的非分布基准测试中提升了表现。我们观察到显著的分布外配对转移：一种完全局限于健康领域的有益行为强化学习干预，在非健康配对评估中带来了广泛改善，包括减少奖励黑客、欺骗和整体错位。最后，我们研究比对持久性：在引导模型走向错配的尝试中，行为是否保持了强有力的对齐。采用有益性状强化学习训练的模型表现出更好的持久性，包括对对抗性提示的抵抗力增强和有害的微调;还需要进一步研究以确定这些效应的来源。这些结果表明，强化学习在现实领域强化有益行为，可以产生更贴合人类繁荣的模型。

TurboMPC: Fast, Scalable, and Differentiable Model Predictive Control on the GPU

TurboMPC：GPU上的快速、可扩展且可微分的模型预测控制

Authors: Gabriel Bravo-Palacios, Jianghan Zhang, Zachary Pestrikov, Brian Plancher, Thomas Lew
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2606.24039
Pdf link: https://arxiv.org/pdf/2606.24039
Abstract Robotics increasingly relies on GPUs for parallel simulation, large-scale learning, and neural-network inference. For model predictive control (MPC) to scale with this paradigm, solvers must run efficiently on this hardware while remaining fast, differentiable, and compatible with expressive MPC formulations used in robotics. We present TurboMPC, a differentiable MPC solver that runs entirely on the GPU and supports state and control inequality constraints, implicit integrators, cross-time-coupled costs, and slack variables. TurboMPC combines sequential quadratic programming (SQP), an alternating direction method of multipliers (ADMM) inner solver, implicit differentiation, and a co-designed JAX-CUDA implementation for efficiency and ease of use. In simulation, we validate TurboMPC on constrained planning, humanoid imitation learning, and reinforcement learning with neural-network cost function tasks, achieving up to $15\times$ and $58\times$ speedups over state-of-the-art CPU and GPU differentiable solvers, respectively. We deploy TurboMPC on a full-scale car for minimum-time racing and find that batched, GPU-accelerated tuning of MPC parameters via Bayesian optimization yields significantly faster driving than a hand-tuned baseline. TurboMPC also scales to planning horizons of over $8000$ knot points while maintaining control of the vehicle. We open-source TurboMPC at: this https URL
中文摘要 机器人技术越来越依赖GPU进行并行仿真、大规模学习和神经网络推断。为了使模型预测控制（MPC）能够在该范式下实现规模化，求解器必须在该硬件上高效运行，同时保持快速、可微分性，并兼容机器人领域使用的表达性MPC表述。我们介绍TurboMPC，一个完全运行在GPU上的可微MPC求解器，支持状态和控制不平等约束、隐式积分、跨时间耦合成本和松弛变量。TurboMPC结合了顺序二次规划（SQP）、乘法交替方法（ADMM）、隐式微分以及联合设计的JAX-CUDA实现，以提高效率和易用性。在仿真中，我们验证了TurboMPC在受限规划、类人模仿学习和强化学习中的神经网络成本函数任务，分别在先进的CPU和GPU可微分求解器上实现了高达15美元和58倍的加速。我们将TurboMPC部署在全尺寸汽车上进行短时间赛车，发现通过贝叶斯优化批量GPU加速MPC参数，驾驶速度明显快于手工调优的基准。TurboMPC还能在保持车辆控制的同时，实现超过8000美元结点的规划范围。我们将TurboMPC开源于：此 https URL

Breaking the Filter Bubble: A Semantic Pareto-DQN Framework for Multi-Objective Recommendation

打破过滤泡沫：多目标推荐的语义帕累托-DQN框架

Authors: Cláudio Lúcio Do Val Lopes, Lucca Machado da Silva, André de Oliveira Brandão
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.24042
Pdf link: https://arxiv.org/pdf/2606.24042
Abstract Recommender systems often induce filter bubbles and semantic homogenization by monolithically optimizing for immediate user engagement. Standard single-objective models, including traditional Deep Q-Networks, are ill-equipped to navigate the trade-offs between platform retention and critical societal values like information diversity and provider fairness. To address these limitations, we introduce a multi-objective reinforcement learning framework that formalizes recommendation as a semantic multi-objective Markov decision process. By integrating high-fidelity semantic embeddings with a Pareto-DQN agent, our architecture treats engagement, diversity, and fairness as distinct, non-aggregable reward signals, avoiding the pitfalls of static reward scalarization. Empirical evaluations on the MovieLens small dataset shows that our hypervolume based action selection disrupts the feedback loops responsible for semantic collapse. By sustaining high state-trajectory variance, the Pareto-DQN effectively maps the Pareto frontier, achieving gains in auxiliary societal objectives with only marginal impacts on engagement. This work provides a path toward intrinsically aligned, responsible recommender systems.
中文摘要 推荐系统通常通过单体优化以立即吸引用户参与，从而诱导过滤气泡和语义同质化。标准的单目标模型，包括传统的深度Q网络，难以在平台保留与信息多样性和提供者公平等关键社会价值之间权衡。为解决这些局限性，我们引入了一个多目标强化学习框架，将推荐形式化为语义多目标马尔可夫决策过程。通过将高保真语义嵌入与帕累托-DQN代理整合，我们的架构将参与度、多样性和公平性视为独立且不可聚合的奖励信号，避免静态奖励标量化的陷阱。对MovieLens小型数据集的实证评估显示，基于超音量的动作选择破坏了导致语义崩溃的反馈循环。通过保持高状态轨迹方差，帕累托-DQN有效地绘制了帕累托边界，在对参与度影响有限的情况下实现了辅助社会目标的提升。这项工作为实现内在一致、负责任的推荐系统提供了一条路径。

An LMM for Precisely Grounding Elements in Documents

用于精确接地文档元素的 LMM

Authors: Yijian Lu, Chuangxin Zhao, Kai Sun, Lei Hou, Juanzi Li, Ji Qi
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.24118
Pdf link: https://arxiv.org/pdf/2606.24118
Abstract Visual grounding in documents is a crucial ability for Large Multimodal Models (LMMs) in areas such as document understanding, deep research and document error detection. However, existing approaches exhibit poor grounding precision in text-rich document images, often failing to accurately locate the critical document elements needed for reliable reasoning. To address this gap, we introduce PreciseDoc, an LMM specifically designed for precise element grounding and can be further optimized for Document VQA tasks. Specifically, to enhance the basic localization capability, we construct challenging training data by two pipelines capable of mass-producing high-quality documents with paired metadata of fine-grained coordinates, including synthetic hand-filled documents with camera effects. The model develops more real-world functions beyond straightforward localization of single text, such as locating personal information from CVs. Furthermore, we introduce a training paradigm for visual grounded reasoning where the grounding and reasoning are supervised jointly with reinforcement learning to improve the contribution of the grounded evidence. A comprehensive evaluation on various benchmarks demonstrates the advantage of the proposed data and methods in document spatial grounding and document understanding.
中文摘要 文档中的视觉基础是大型多模态模型（LMM）在文档理解、深入研究和文档错误检测等领域的关键能力。然而，现有方法在文本丰富的文档图像中表现得基础精度较低，常常无法准确定位对可靠推理所需的关键文档元素。为弥补这一空白，我们推出了PreciseDoc，一款专为精确元件接地设计的LMM，并可进一步优化用于文档VQA任务。具体来说，为了增强基础定位能力，我们通过两条管道构建了具有挑战性的训练数据，能够批量生成高质量文档，这些文档配对的细粒度坐标元数据包括带有摄像机特效的合成手工填充文档。该模型除了单文本的本地化外，还发展出更多实际功能，比如从简历中查找个人信息。此外，我们引入了视觉基础推理的训练范式，其中基础推理与推理结合强化学习共同监督，以提升基础证据的贡献。对各种基准的全面评估展示了所提数据和方法在文档空间基础和文档理解方面的优势。

Holistic Data Scheduler for LLM Pre-training via Multi-Objective Reinforcement Learning

通过多目标强化学习进行LLM预训练的整体数据调度器

Authors: Chenhao Dang, Jing Ma, Mingjie Liao
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2606.24133
Pdf link: https://arxiv.org/pdf/2606.24133
Abstract The composition of training data, governed by the diversity of sources and their mixing strategy, is a cornerstone of Large Language Model (LLM) pre-training. Online Data Mixing (ODM), the technique of adaptively adjusting data mixtures during training, has emerged as a promising direction to improve efficiency. However, existing methods are constrained by their reliance on a singular optimization perspective, which fundamentally overlooks the need for complex LLM pre-training to consider the dynamic data composition from multiple dimensions. To overcome this limitation, we introduce the Holistic Data Scheduler (HDS), a novel online data mixing framework. HDS formulates the data scheduling challenge as a reinforcement learning problem in a continuous control space and leverages the Soft Actor-Critic (SAC) algorithm for its stability and sample efficiency in exploring the high-dimensional policy space. At the core of HDS lies a novel multi-objective, holistic reward function that integrates three critical perspectives: a data-driven reward for quality, a loss-driven reward capturing inter-domain influence, and a model-driven reward based on weight norms. To validate our design and determine its optimal configuration, we conducted systematic experiments on LLMs of various sizes. On The Pile benchmark, HDS reaches the final validation perplexity of the next best method with 44% fewer training iterations. Furthermore, it achieves a 7.2% improvement on the MMLU 0-shot task along with consistent gains on other benchmarks, showcasing its ability to enhance both training efficiency and final model capability.
中文摘要 训练数据的组成受来源多样性及其混合策略的影响，是大型语言模型（LLM）预训练的基石。在线数据混合（ODM）是一种在训练过程中自适应调整数据混合的技术，已成为提升效率的有前景方向。然而，现有方法受限于单一优化视角，忽视了复杂大型语言模型预训练以考虑多维动态数据组合的必要性。为克服这一限制，我们引入了整体数据调度器（HDS），这是一种新型的在线数据混合框架。HDS将数据调度挑战表述为连续控制空间中的强化学习问题，并利用软演员-批判者（SAC）算法在探索高维策略空间时提升稳定性和样本效率。HDS的核心是一个新颖的多目标、整体奖励函数，整合了三个关键视角：基于数据的质量奖励、捕捉跨领域影响的损失驱动奖励，以及基于体重规范的模型驱动奖励。为了验证设计并确定其最佳配置，我们对不同尺寸的大型语言模型进行了系统实验。在The Pile基准测试中，HDS以减少44%的训练迭代次数达到了下一个最佳方法的最终验证困惑度。此外，MMLU零发任务提升了7.2%，其他基准测试也持续提升，展示了其提升训练效率和最终模型能力的能力。

AsyncOPD: How Stale Can On-Policy Distillation Be?

AsyncOPD：政策提炼会有多陈旧？

Authors: Wonjun Kang, Kevin Galim, Seunghyuk Oh, Minjun Kang, Sanghyun Park, Donghoon Kim, Minjae Lee, Minseo Kim, Rishabh Tiwari, Yuchen Zeng, Hyung Il Koo, Kangwook Lee
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.24143
Pdf link: https://arxiv.org/pdf/2606.24143
Abstract On-policy distillation (OPD) trains a student on its own rollouts guided by teacher feedback and is becoming increasingly important for large language model (LLM) post-training. Like reinforcement learning (RL), however, OPD faces an on-policy systems bottleneck, as rollouts can dominate training time for reasoning workloads. Asynchronous training pipelines can alleviate this bottleneck by decoupling rollout generation from learner updates, but doing so introduces stale-policy data. While prior work has studied stale data in asynchronous RL, its effects in OPD remain underexplored. We present the first systematic study of staleness in asynchronous OPD, focusing on a practical setting where teacher feedback is implemented through local KL losses and full-vocabulary teacher logits are too expensive to store or transfer, necessitating finite teacher-score caches. We first show that KL direction changes the stale-data problem: teacher-weighted forward KL is more robust to stale rollouts, whereas student-weighted reverse KL is vulnerable. Second, for this vulnerable reverse-KL case, we study whether methods designed to stabilize asynchronous RL can mitigate OPD staleness. In our experiments, they do not improve over a simpler OPD-specific surrogate: recomputing the reverse-KL signal under the current student at learner time. Third, we analyze how finite teacher-score caches create a bias-variance tradeoff for sparse and sampled reverse-KL OPD estimators. This motivates multi-sample Monte Carlo (MC), which preserves MC correctability while reducing one-sample variance. Finally, we present and open-source AsyncOPD, a fully asynchronous OPD training pipeline built from these estimator choices. Experiments show that AsyncOPD improves training throughput by $1.6\times$ to $3.8\times$ over strict synchronous training while reaching comparable accuracy.
中文摘要 政策提炼（OPD）在教师反馈指导下，自主培训学生，对大型语言模型（LLM）培训后变得越来越重要。然而，与强化学习（RL）类似，OPD也面临策略内系统瓶颈，推理工作负载的推广可能占用训练时间。异步培训流水线可以通过将推广生成与学习者更新分离来缓解这一瓶颈，但这样做会引入陈旧的策略数据。虽然此前研究过异步强化学习中的陈旧数据，但其在门外功能（OPD）中的影响仍未被充分探讨。我们首次系统性研究异步门诊教学（OPD）中的陈旧性，重点关注一个通过局部 KL 损失实现教师反馈的实际环境，且全词汇教师日志存储或传输成本过高，需要有限的教师分数缓存。我们首先证明，Kill Learn方向改变了陈旧数据问题：教师加权的前向KL对陈旧的推广更为稳健，而学生加权的反向KL则较易受损。其次，针对这一易受影响的逆性基隆病病例，我们研究了旨在稳定异步强化学习的方法是否能减轻门外功能（OPD）的停滞。在我们的实验中，它们并未优于一个更简单的OPD特异替代者：在学习者时间下重新计算当前学生的反向KL信号。第三，我们分析有限教师得分缓存如何为稀疏和抽样的反KL OPD估计量带来偏差-方差权衡。这促使多样本蒙特卡洛（MC）法得以保持MC的纠错性，同时减少单样本方差。最后，我们开源了AsyncOPD，这是一条由这些估计器选项构建的完全异步OPD培训流程。实验显示，AsyncOPD相比严格同步训练，在实现同等准确率的同时，训练吞吐量提升了1.6美元至3.8倍倍。

An Introduction to Causal Reinforcement Learning

因果强化学习导论

Authors: Elias Bareinboim, Junzhe Zhang, Sanghack Lee
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.24160
Pdf link: https://arxiv.org/pdf/2606.24160
Abstract Causal inference provides a set of principles and tools that allow one to combine data and knowledge about an environment to reason with questions of counterfactual nature, i.e., what would have happened had reality been different, even when no data of this unrealized reality is currently available. Reinforcement learning provides methods to learn a policy that optimizes a specific measure (e.g., reward, regret) when the agent is deployed in an environment and pursues an exploratory, trial-and-error approach. These two disciplines have evolved independently and with virtually no interaction between them. We note that they operate over different aspects of the same building block, counterfactual relations, which makes them umbilically connected. Based on these observations, novel learning opportunities arise when this connection is explicitly acknowledged and mathematized. To realize this potential, we note that any environment where the RL agent is deployed can be decomposed as a collection of autonomous mechanisms with different causal invariances, parsimoniously modeled as a structural causal model; any standard RL setting implicitly encodes such a model. This formalization allows us to put under a unifying treatment different modes of learning, including online, off-policy, and causal calculus learning, which appear unrelated in the literature. However, these modalities are not exhaustive: we introduce several natural and pervasive classes of learning settings that entail novel dimensions of analysis. Specifically, we introduce and discuss through causal lenses generalized policy learning, where to intervene, imitation learning, and counterfactual learning. These tasks lead to a broader view of counterfactual learning and suggest great potential for studying causal inference and reinforcement learning side by side, which we call causal reinforcement learning (CRL).
中文摘要 因果推断提供了一套原则和工具，使人们能够结合关于环境的数据和知识，用反事实性质的问题进行推理，即如果现实不同，会发生什么，即使目前没有该未实现现实的数据。强化学习提供了学习策略的方法，帮助在代理部署于环境中并采取探索性、反复试验方法时，优化特定指标（如奖励、遗憾）。这两个学科独立发展，几乎没有互动。我们注意到它们作用于同一构建单元——反事实关系的不同方面，这使得它们在脐带上相互连接。基于这些观察，当这种联系被明确承认并数学化时，新的学习机会便会出现。为了实现这一潜力，我们注意到任何部署强化学习代理的环境都可以分解为一组具有不同因果不变性的自治机制，简约地建模为结构因果模型;任何标准强化学习设置都隐含编码了这样的模型。这种形式化使我们能够统一地对待不同的学习模式，包括在线学习、非政策学习和因果演算学习，这些在文献中看似无关。然而，这些模式并非全部：我们引入了几类自然且普遍的学习环境，这些类别包含了新的分析维度。具体来说，我们通过因果视角介绍并讨论了一般化政策学习、干预地点、模仿学习和反事实学习。这些任务引领出更广泛的反事实学习视角，并展示了将因果推理与强化学习并置研究的巨大潜力，我们称之为因果强化学习（CRL）。

SkyChain Intelligence: A Blockchain-Secured Multi-Agent DRL Framework for Low-Altitude Embodied Artificial Intelligence

SkyChain Intelligence：一个区块链安全、多智能体的DRL框架，用于低空具身人工智能

Authors: Haoxiang Luo, Tianqi Jiang, Ruichen Zhang, Yinqiu Liu, Gang Sun, Hongfang Yu, Abbas Jamalipour, Dong In Kim
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2606.24193
Pdf link: https://arxiv.org/pdf/2606.24193
Abstract With the rapid development of the Low-Altitude Economy (LAE) ecosystem, Low-Altitude Embodied Artificial Intelligence (LAEAI) agents have become the core carriers of autonomous aerial services, thereby enabling dynamic Low-altitude Computility Networks (LACNets) for distributed computing resource sharing. However, resource-constrained LAEAI agents in decentralized LACNets face a fundamental trilemma of autonomy, security, and efficiency. Existing solutions primarily focus on either optimizing computational performance or enhancing security in isolation, failing to address the inherent trade-offs among trust, performance, and overhead in untrusted dynamic environments with malicious agents. To tackle this challenge, this paper proposes SkyChain Intelligence, a holistic framework that synergistically integrates agentic AI, consortium blockchain, and Multi-Agent Deep Reinforcement Learning (MADRL). We design a lightweight blockchain-based decentralized trust management system with a dynamic reputation mechanism and develop a hybrid-action-space MADDPG algorithm that embeds on-chain reputation scores into the reward function to jointly optimize offloading decisions, resource allocation, and drone 3D trajectories. Extensive simulations demonstrate that our framework outperforms state-of-the-art baselines in task completion latency and energy consumption, while achieving a 94.1% task completion rate in the baseline scenario and stable convergence within 300 training episodes. This work provides a viable path for building secure, autonomous, and efficient machine-to-machine computing ecosystems in the low-altitude domain.
中文摘要 随着低空经济（LAE）生态系统的快速发展，低空具身人工智能（LAEAI）智能体已成为自主空中服务的核心载体，从而实现了动态的低空互用网络（LACNets）以实现分布式计算资源共享。然而，资源受限的去中心化LANet中LAEAI代理面临自治、安全和效率的根本三难境地。现有解决方案主要侧重于优化计算性能或单独增强安全性，未能解决在不可信的动态环境中存在的信任、性能和开销之间的内在权衡，尤其是在带有恶意代理的环境中。为应对这一挑战，本文提出了SkyChain智能，这是一个整合代理人工智能、联盟区块链和多智能体深度强化学习（MADRL）的整体框架。我们设计了一个基于区块链的轻量级去中心化信任管理系统，具有动态声誉机制，并开发了一种混合动作空间的MADDPG算法，将链上声誉评分嵌入奖励函数，共同优化卸载决策、资源分配和无人机三维轨迹。大量模拟表明，我们的框架在任务完成延迟和能耗方面优于最先进的基线，在基线场景下实现了94.1%的任务完成率，并在300次训练中实现了稳定收敛。这项工作为在低空领域构建安全、自主且高效的机器对机器计算生态系统提供了可行路径。

Transformer-Based Language Models Across Domain Verticals: Architectures, Applications and Critical Assessment

基于Transformer的跨领域语言模型：架构、应用与关键评估

Authors: Guruprakash J, Krithika L.B
Subjects: Subjects: Computation and Language (cs.CL); Emerging Technologies (cs.ET)
Arxiv link: https://arxiv.org/abs/2606.24331
Pdf link: https://arxiv.org/pdf/2606.24331
Abstract Transformer-based language models have become the default substrate for natural language processing and the pace of new releases has made it hard for practitioners to separate durable ideas from the noise of incremental announcements. This review works at two levels. At the level of mechanism, we organise the main transformer families into a working taxonomy, covering encoder-only, decoder-only, encoder-decoder, long-context, permutation-based, and generator-discriminator variants. We then extend the discussion to post-2023 developments that changed the picture in practice: instruction tuning, reinforcement learning from human feedback, direct preference optimisation, mixture-of-experts scaling, retrieval augmentation and the current flagship model families from OpenAI, Anthropic, Google, Meta, Mistral and DeepSeek. At the level of use, we survey deployments across healthcare, finance, legal, education, customer service, creative writing and scientific work. Based on this we link each to the specific capabilities that make a transformer the appropriate tool. The contribution of this paper is a critical assessment that is based on the survey. We compare architectures on four axes that matter to deployment decisions, we quantify the trade-off between parameter count and energy cost. We also discuss how alignment methods, data provenance and benchmark saturation change what it means to call a model "state of the art". The final section lists the research questions that we think deserve more attention.
中文摘要 基于Transformer的语言模型已成为自然语言处理的默认基础，新版本发布的速度使得实践者难以将持久的创意与渐进式公告的噪音区分开来。这篇评论有两个层面。在机制层面，我们将主要变换器家族组织为工作分类法，涵盖仅编码器、仅解码器、编码器-解码器、长上下文、基于排列和生成元-判别器变体。随后，我们将讨论扩展到2023年后改变实际情况的发展：指令调优、基于人类反馈的强化学习、直接偏好优化、专家混合扩展、检索增强，以及OpenAI、Anthropic、Google、Meta、Mistral和DeepSeek等当前旗舰模型家族。在使用层面，我们调查了医疗、金融、法律、教育、客户服务、创意写作和科学工作的部署情况。基于此，我们将每个变压器与使其成为合适工具的具体能力相关联。本文的贡献是基于调查的批判性评估。我们在四个关键的部署决策轴上比较架构，量化参数数量与能耗之间的权衡。我们还讨论了比对方法、数据来源和基准饱和度如何改变对模型“最先进”的定义。最后一节列出了我们认为值得更多关注的研究问题。

Managing Task Execution for Unknown Workloads in Batteryless IoT: A Hardware-Agnostic Evaluation

无电池物联网中未知工作负载的任务执行管理：硬件无关性评估

Authors: Samer Nasser, Henrique Duarte Moura, Ritesh Kumar Singh, Maarten Weyn, Jeroen Famaey
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.24340
Pdf link: https://arxiv.org/pdf/2606.24340
Abstract In recent years, the Internet of Things (IoT) paradigm has been shifting toward batteryless, energy-harvesting architectures. Sustaining reliable operation in these systems requires intelligent management of highly volatile stored energy. As edge applications grow in complexity, traditional energy-aware schedulers struggle with unpredictable workloads due to their reliance on static execution thresholds or pre-measured, hardware-specific task profiles. To overcome this, we propose two novel, hardware-agnostic dynamic scheduling strategies treating applications as a "black box," requiring no prior energy information: a model-free Reinforcement Learning (RL) agent and an on-the-fly Approximated Prediction (AP) method. We evaluate these methods against an adaptive task rate approach (AsTAR) and optimized static thresholds using a custom-built, physically accurate simulation framework driven by real-world solar data and dynamic LoRa transmission profiles. Rather than claiming universal superiority, our analysis exposes the distinct operational trade-offs of each method: the AP approach delivers lightweight, near-oracle task throughput; the RL agent provides tunable survival-execution balancing; and AsTAR excels at execution pacing across long energy gaps. Finally, we demonstrate that while these advanced strategies provide critical resilience for severely constrained systems with small capacitors, devices with larger energy buffers can efficiently rely on simpler, less computationally expensive static policies.
中文摘要 近年来，物联网（IoT）模式正向无电池、能收集架构转变。维持这些系统的可靠运行需要智能管理高度不稳定的储存能量。随着边缘应用的复杂性增加，传统的能耗调度器因依赖静态执行阈值或预先测量的硬件特定任务配置文件而面临不可预测的工作负载挑战。为克服这一问题，我们提出了两种新型、硬件无关的动态调度策略，将应用视为“黑匣子”，无需任何先验能量信息：一个无模型的强化学习（RL）代理和一种即时近似预测（AP）方法。我们结合自适应任务率方法（AsTAR）评估这些方法，并利用基于真实太阳数据和动态LoRa传输曲线的定制物理精确模拟框架优化静态阈值。我们的分析并不声称普遍优越，而是揭示了每种方法的独特操作权衡：AP方法提供轻量级、近乎预言机般的任务吞吐量;强化学习代理提供可调的生存-执行平衡;AsTAR则擅长在长能量间隙的执行节奏控制。最后，我们证明，虽然这些先进策略为受限较小电容器的系统提供了关键的韧性，但能量缓冲较大的设备可以高效依赖更简单、计算成本更低的静态策略。

Accelerating Disaggregated RL for Visual Generative LLMs with Diffusion-Based Parallelism and Trainer-Assisted Generation

加速基于扩散的并行性和训练器辅助生成的视觉生成大型语言模型的分解强化学习

Authors: Sijie Wang, Zhengyu Qing, Zhiqiang Tan, Yiming Yin, Yeqing Zhang, Yaoyuan Wang, Qiang Wang, Xiaowen Chu, Shaohuai Shi
Subjects: Subjects: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI); Performance (cs.PF)
Arxiv link: https://arxiv.org/abs/2606.24369
Pdf link: https://arxiv.org/pdf/2606.24369
Abstract Reinforcement learning (RL) has become a dominant post-training paradigm, driving the emergence of high-performance RL systems such as veRL for autoregressive large language models (LLMs). In parallel, diffusion-oriented RL algorithms, e.g., DanceGRPO and FlowGRPO, have rapidly expanded the scope of RL from language reasoning to diffusion-based visual and flow-based generation. However, efficient RL systems for diffusion generative LLMs remain underexplored. Existing implementations, e.g., veRL-Omni, still rely on colocated execution, which simplifies synchronization but couples rollout and training resources, limits heterogeneous deployment, and constrains independent scaling. To this end, we introduce DigenRL, a disaggregated RL framework for diffusion-based generative LLMs that supports flexible resource allocation, accommodates heterogeneous GPUs, and facilitates efficient task scheduling. To maximally reduce the execution bubbles in the disaggregated architecture, we propose: 1) a generation-axis pipeline (GAP) and time-step parallelism (TSP) in the diffusion architecture to enable finer-grained pipelining between rollout and training; 2) an elastic trainer-assisted generation (TAG) approach to enable the trainer GPU resources to dynamically assist in executing rollout generations; and 3) a tightly one-step constrained asynchronous strategy to further utilize the tail bubble in the pipeline. Extensive experiments are conducted on three hardware testbeds with 16-32 GPUs using HunyuanVideo-13B, Wan2.1-14B, FLUX.1-12B, and QwenImage-20B generative models. Experimental results show that DigenRL achieves 1.56-2.10x throughput improvements over state-of-the-art diffusion RL systems, veRL-Omni and GenRL.
中文摘要 强化学习（RL）已成为主导的后训练范式，推动了高性能强化学习系统的出现，如来自回归大型语言模型（LLM）的veRL。与此同时，面向扩散的强化学习算法，如DanceGRPO和FlowGRPO，迅速将强化学习的范围从语言推理扩展到基于扩散的视觉和流生成。然而，高效的扩散生成大型语言模型（LLM）强化学习系统仍未被充分探索。现有实现，如 veRL-Omni，仍然依赖共址执行，这简化了同步，但耦合了推广和训练资源，限制了异构部署，并限制了独立扩展。为此，我们介绍了DigenRL，一个用于基于扩散的生成式大型语言模型的拆分强化学习框架，支持灵活的资源分配，兼容异构GPU，并促进高效的任务调度。为了最大限度地减少分解架构中的执行气泡，我们提出：1）扩散架构中的生成轴流水线（GAP）和时间步并行（TSP），以实现部署与训练之间的更细粒度流水线;2）弹性训练器辅助生成（TAG）方法，使训练器GPU资源能够动态协助执行推展生成;3）采用严格单步限制的异步策略，进一步利用管道中的尾气泡。在三台硬件测试平台上进行了大量实验，使用16-32个GPU的系统，使用了宏远视频-13B、万2.1-14B、FLUX.1-12B和QwenImage-20B生成模型。实验结果显示，DigenRL相比最先进的扩散强化系统veRL-Omni和GenRL，在通量提升上1.56-2.10倍。

BRAVR: An AP-Assisted Online DRL Mechanism for Interactive VR Bitrate Adaptation over Wi-Fi

BRAVR：一种AP辅助在线DRL机制，用于Wi-Fi上的互动VR码率适配

Authors: Miguel Casasnovas, Francesc Wilhelmi, Boris Bellalta
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2606.24389
Pdf link: https://arxiv.org/pdf/2606.24389
Abstract Interactive virtual reality (VR) streaming over Wi-Fi requires stringent latency and reliability guarantees, which become increasingly difficult to achieve under dynamic channel conditions and shared medium contention. These challenges make real-time bitrate adaptation a critical yet fundamentally difficult control problem, particularly under limited visibility of the underlying network conditions. This paper formulates VR bitrate adaptation as a network-aware, online decision-making problem and proposes BRAVR, a decentralized deep reinforcement learning (DRL) mechanism designed to optimize visual quality while maintaining streaming performance and promoting airtime fairness in multi-user scenarios. BRAVR integrates application-layer observations with lightweight wireless network statistics collected at the Wi-Fi access point (AP) serving the VR client, enabling more informed bitrate adaptation decisions. We implement BRAVR in a real VR streaming system and evaluate it on a physical Wi-Fi testbed against a strong heuristic baseline and an ablated BRAVR variant without AP assistance. Experimental results show that BRAVR consistently achieves its design objectives, delivering robust quality of service (QoS) and preventing sustained airtime overutilization. It also outperforms its ablated counterpart, highlighting the benefits of incorporating network-level information into the bitrate adaptation control loop. Overall, these results demonstrate the effectiveness of AP-assisted online learning for decentralized interactive VR streaming over commodity Wi-Fi and provide practical insights into bitrate adaptation in shared wireless environments.
中文摘要 通过Wi-Fi进行交互式虚拟现实（VR）流媒体传输需要严格的延迟和可靠性保证，而在动态频道条件和共享媒介争用下，实现这一点变得越来越困难。这些挑战使得实时比特率适应成为一个关键但根本上极为困难的控制问题，尤其是在对底层网络条件可见性有限的情况下。本文将VR码率适配提出为一种网络感知的在线决策问题，并提出了BRAVR，一种去中心化的深度强化学习（DRL）机制，旨在优化视觉质量，同时保持流媒体性能，并在多用户场景中促进播出时间的公平性。BRAVR将应用层观测数据与在为VR客户端服务的Wi-Fi接入点（AP）收集的轻量级无线网络统计数据集成，使比特率适应决策更加明智。我们在真实的VR流媒体系统中实现BRAVR，并在物理Wi-Fi测试平台上，结合强启发式基线和无AP辅助的消融BRAVR变体进行评估。实验结果显示，BRAVR始终实现其设计目标，提供强有力的服务质量（QoS），防止持续的空中时间过度使用。它还优于其消蚀版本，凸显了将网络级信息纳入比特率调适控制环路的优势。总体而言，这些结果展示了AP辅助在线学习在去中心化互动VR流媒体中相较于普通Wi-Fi的有效性，并为共享无线环境中的码率适应提供了实用见解。

NoContactNoWorries: Estimating Contact through Vision and Proprioception for In-Hand Dexterous Manipulation

NoContactNoWorries：通过视觉和本体感觉估计接触，实现手部灵巧操作

Authors: Soham Patil, Avirup Das, Sourabh Bhosale, Spandan Roy
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.24450
Pdf link: https://arxiv.org/pdf/2606.24450
Abstract Perceiving physical contact is fundamental to dexterous manipulation. While robots often rely on dedicated hardware tactile sensors, humans exhibit a remarkable ability to infer contact by integrating visual information with an innate sense of their body's pose and movement. Inspired by this embodied perceptual skill, we investigate whether a robot can learn to infer contact from vision, an approach that also offers a scalable alternative to tactile hardware specifically for binary contact estimation, which faces practical challenges in cost, fragility, and integration. We present NoContactNoWorries, a transformer-based multimodal framework that fuses RGB-D vision with the robot's proprioception to infer binary contact states as a pseudo-tactile signal for hand-object interactions. We validate by training a single contact prediction model on multiple objects and show that the inferred contact signal supports downstream reinforcement learning agents for in-hand object reorientation, generalizing to novel objects. Experiments in both simulation and on a real-world robot validate our approach, highlighting the feasibility of inferring contact from vision and proprioception. Project Page: this https URL
中文摘要 感知身体接触是灵巧操作的基础。机器人通常依赖专用硬件触觉传感器，而人类则展现出通过将视觉信息与身体姿态和运动的天生感知相结合来推断接触的非凡能力。受这一具身感知技能的启发，我们研究机器人是否能从视觉中学会推断接触，这种方法也为二元接触估计提供了可扩展的替代，因为触觉硬件面临成本、脆弱性和整合性等实际挑战。我们介绍NoContactNoWorries，一种基于变压器的多模态框架，将RGB-D视觉与机器人本体感觉融合，推断二元接触状态，作为手与物体交互的伪触觉信号。我们通过训练单个接触预测模型对多个对象进行验证，并证明推断的接触信号支持下游强化学习代理进行手中对象重新定向，并推广到新对象。无论是在模拟中还是在现实世界机器人上的实验，都验证了我们的方法，凸显了通过视觉和本体感觉推断接触的可行性。项目页面：此 https URL

video-SALMONN-R$^3$: Learning to ReWatch, ReAsk, and ReAnswer for Efficient Video Understanding

视频-SALMONN-R$^3$：学习重看、再问和再答以实现高效视频理解

Authors: Yixuan Li, Guangzhi Sun, Yudong Yang, Wei Li, Zejun MA, Chao Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Sound (cs.SD)
Arxiv link: https://arxiv.org/abs/2606.24477
Pdf link: https://arxiv.org/pdf/2606.24477
Abstract Video large language models (LLMs) are often constrained by computation and memory budgets, leading them to use reduced frame rates and spatial resolutions, which may cause them to miss critical information for question answering (QA). A practical and efficient solution is a two-stage paradigm: first perform coarse video understanding to localize relevant segments, and then re-watch these segments at higher temporal or spatial fidelity. In this paper, we present video-SALMONN-R$^3$, the first end-to-end video-LLM that enables re-watch through reinforcement learning without relying on chain-of-thought (CoT) cold-start. This design removes the need for costly CoT data annotations and avoids CoT-based supervised fine-tuning (SFT), which can otherwise degrade the pretrained video understanding abilities. To address the mismatch between the reasoning-first behavior induced by re-watch and the answer-first tendency of pretrained video-LLMs, we propose a re-answer strategy, in which the model first produces a direct answer in the first watch and then refines it after re-watching. Finally, to improve question adherence during re-watching, we propose a re-ask mechanism that re-injects the query when revisiting localized segments. Experimental results show that video-SALMONN-R$^3$ consistently outperforms both the base model and the QA-SFT baseline, while surpassing prior re-watch-based approaches with significantly lower computational cost. Code, models, and data will be publicly released upon acceptance.
中文摘要 视频大型语言模型（LLMs）通常受计算和内存预算限制，导致它们使用较低的帧率和空间分辨率，这可能导致它们在问答（QA）时遗漏关键信息。一个实用且高效的解决方案是两阶段范式：首先进行粗略视频理解以定位相关片段，然后以更高的时间或空间保真度重看这些片段。本文介绍了视频-SALMONN-R$^3$，这是首个端到端视频大型语言模型，能够通过强化学习实现重看，而无需依赖思维链（CoT）冷启动。这种设计消除了昂贵的CoT数据注释需求，避免了基于CoT的监督微调（SFT），否则可能会削弱预训练视频的理解能力。为了解决重看引发的推理优先行为与预训练视频大型语言模型的先回答倾向之间的不匹配，我们提出了一种重答策略，即模型在首次观看时先给出直接答案，然后在重看后进行细化。最后，为了提高重看时的问题依从性，我们提出了一种重新提问机制，在重访局部片段时重新注入查询。实验结果显示，视频-SALMONN-R$^3$持续优于基础模型和QA-SFT基线，同时以显著更低的计算成本优于以往基于重看的方法。代码、模型和数据将在接受后公开发布。

Reinforcement Learning for Computer-Use Agents with Autonomous Evaluation

计算机智能体的强化学习，具备自主评估

Authors: Marta Sumyk, Oleksandr Kosovan
Subjects: Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2606.24515
Pdf link: https://arxiv.org/pdf/2606.24515
Abstract Computer-Use Agents (CUAs) execute high-level user goals by perceiving and acting directly within graphical user interfaces. However, reinforcement learning for CUAs remains difficult because open-ended desktop environments rarely provide scalable, machine-readable reward signals: task success is often visually grounded and hard to specify with handcrafted reward functions or dense manual labels. We propose an RL fine-tuning framework that uses autonomous vision-language evaluation as a scalable supervision signal for GUI agents. Given a final screenshot and the original instruction, a Vision-Language Model judges task completion and provides terminal feedback without task-specific heuristics or manual labels during policy optimization. Because autonomous evaluators are imperfect, we model their feedback as a noisy binary reward channel and derive a noise-corrected reward estimator for Proximal Policy Optimization. Experiments across macOSWorld, Windows Agent Arena, and OSWorld show that corrected evaluator rewards outperform both zero-shot baselines and raw evaluator rewards, improving success rates by an average of 12.6 percentage points over zero-shot performance and 5.1 points over raw evaluator fine-tuning. These results suggest that autonomous evaluation can serve as a practical reward signal for RL in GUI environments when evaluator noise is explicitly modeled and corrected.
中文摘要 计算机使用代理（CUA）通过直接感知和行动于图形用户界面中来执行高级用户目标。然而，CUA的强化学习仍然困难，因为开放式桌面环境很少提供可扩展、机器可读的奖励信号：任务成功往往以视觉为基础，难以用手工制作的奖励函数或繁琐的手工标签来明确说明。我们提出了一个强化学习微调框架，利用自主视觉语言评估作为图形界面代理的可扩展监督信号。在给定最终截图和原始指令后，愿景语言模型在策略优化过程中判断任务完成情况，并在不依赖特定任务启发式或手动标签的情况下提供终端反馈。由于自主评估器不完美，我们将其反馈建模为噪声二元奖励通道，并推导出一个噪声校正奖励估计器以实现近点策略优化。macOSWorld、Windows Agent Arena和OSWorld的实验显示，校正后的评估者奖励优于零机会基线和原始评估者奖励，平均比零机会性能提升12.6个百分点，比原始评估器微调提升5.1个百分点。这些结果表明，当评估器噪声被明确建模和纠正时，自主评估可以作为GUI环境中强化学习的实用奖励信号。

PointVG-R: Internalizing Geometric Reasoning in MLLMs for Precise Pointing Localization via Visual Chain of Thought

PointVG-R：通过视觉思维链实现精确指向定位的MLLM中的几何推理

Authors: Ling Li, Bowen Liu, Zinuo Zhan, Jianhui Zhong, Ziyu Zhu, Bingcai Wei, Kenglun Chang, Zhidong Deng
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.24539
Pdf link: https://arxiv.org/pdf/2606.24539
Abstract Pointing-based visual grounding requires models to precisely locate target objects by deciphering complex spatial relationships between the visual scene and pointing gestures. Traditional methods typically encode input images into static feature representations and perform reasoning primarily within the linguistic domain, often overlooking the rich perceptual cues and explicit spatial geometry inherent in images. In this study, we aim to mitigate the cognitive vulnerability of models in interpreting gestural spatial relations by proposing PointVG-R, a reasoning-guided Multi-modal Large Language Model (MLLM). PointVG-R introduces geometric-aware reasoning for pointing-based grounding, enabling the model to think with images through the strategic integration of Reinforcement Learning (RL) and cold-start data. Specifically, we design a novel geometric reasoning pipeline that simulates the iterative cognitive process humans employ when interpreting pointing gestures. Furthermore, we construct EgoPoint-CoT, a high-quality visual Chain-of-Thought (CoT) dataset featuring detailed reasoning trajectories to guide the model via Supervised Fine-Tuning (SFT) and RL. To address the varying quality of learning signals encountered during training, we further propose an Adaptive Importance Weighting strategy based on Group Variance, which dynamically adjusts reward signals to optimize the learning process. Experimental results demonstrate that PointVG-R achieves SOTA performance, outperforming the baseline by $\textbf{15.86}$ points in mIoU. Extensive ablation studies further validate the efficacy of our proposed modules. Code: this https URL.
中文摘要 基于指向的视觉基础要求模型通过解读视觉场景与指向手势之间的复杂空间关系，精确定位目标物体。传统方法通常将输入图像编码为静态特征表示，主要在语言领域进行推理，常忽略图像中丰富的感知线索和显式空间几何。本研究旨在通过提出PointVG-R——一种以推理为导向的多模态大型语言模型（MLLM）——来缓解模型在解释手势空间关系时的认知脆弱性。PointVG-R 引入了基于指向的几何感知推理，使模型能够通过强化学习（RL）和冷启动数据的战略整合，通过图像进行思考。具体来说，我们设计了一种新的几何推理流程，模拟人类在解读指向手势时所采用的迭代认知过程。此外，我们构建了EgoPoint-CoT，一个高质量的视觉思维链（CoT）数据集，通过监督式微调（SFT）和强化学习（RL）详细推理轨迹来指导模型。为了应对训练过程中遇到的学习信号质量差异，我们进一步提出了基于组方差的自适应重要性加权策略，动态调整奖励信号以优化学习过程。实验结果显示，PointVG-R实现了SOTA性能，在mIoU上比基线高出$\textbf{15.86}$点。广泛的消融研究进一步验证了我们提出模块的有效性。代码：这个 https URL。

ASALT: Adaptive State Alignment for Lateral Transfer in Multi-agent Reinforcement Learning

ASALT：多智能体强化学习中横向转移的自适应状态对齐

Authors: Anurag Akula, Satheesh K. Perepu, Abhishek Sarkar, Kaushik Dey
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2606.24601
Pdf link: https://arxiv.org/pdf/2606.24601
Abstract Multi-agent reinforcement learning (MARL) addresses the problem of training multiple agents that pursue collaborative, competitive, or mixed objectives. Prior work has investigated transfer learning between source and target domains in MARL; however, the majority of existing approaches impose the constraint that the dimensionalities of the observation space and the global state space must be identical across domains. In this paper, we introduce a method that explicitly accommodates mismatched state-space dimensionalities between source and target domains. The proposed approach, ASALT, incorporates both observation-level and state-level adapters that map the target-domain observations and global states into a shared embedding space, thereby enabling more effective transfer of knowledge across both actors and critics. These adapters can generate embeddings that support efficient strategy transfer across heterogeneous domains. Experimental results on multiple configurations in standard benchmark environments demonstrate that ASALT surpasses existing baselines in terms of sample efficiency and global return in cooperative settings, but its effectiveness depends on the degree of mismatch between source and target domains. Furthermore, our findings indicate that ASALT mitigates negative transfer, which frequently constitutes a major obstacle when transferring policies between domains with differing observation and action spaces.
中文摘要 多智能体强化学习（MARL）解决了训练多个智能体追求协作、竞争或混合目标的问题。此前的研究曾探讨MARL中源域与目标域之间的迁移学习;然而，大多数现有方法都要求观测空间和全局状态空间的维数必须在域间相同。本文介绍了一种方法，明确适应源域与目标域之间状态空间维数不匹配的情况。所提方法ASALT结合了观测层和状态层的适配器，将目标域观测值和全局状态映射到共享嵌入空间，从而实现了行为者和批评者之间的知识传递。这些适配器可以生成嵌入，支持跨异构域的高效策略传输。在标准基准环境中，多种配置的实验结果表明，ASALT在合作环境中的样本效率和全局回报方面优于现有基线，但其有效性取决于源域与目标域之间的不匹配程度。此外，我们的发现表明，ASALT能够缓解负向转移，这在不同观察和行动空间的领域之间转移策略时常是主要障碍。

ViTexQA: A Multi-Frame Temporal Perception Dataset for Video Text Question Answering

ViTexQA：多帧时间感知数据集，用于视频文本问答

Authors: Zhentao Guo, Chen Duan, Tongkun Guan, Zining Wang, Kai Zhou, Pengfei Yan
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2606.24602
Pdf link: https://arxiv.org/pdf/2606.24602
Abstract Despite remarkable progress in multimodal understanding, current MLLMs still exhibit limitations in video text understanding, particularly when semantics emerge through the integration of temporally distributed textual cues across multiple frames. This perception challenge fundamentally differs from static image text understanding, yet existing datasets fail to capture: the vast majority of questions remain answerable from single frames, inadequately reflecting real-world video text comprehension demands. To address this, we present ViTexQA, a large-scale video-text QA dataset, and FrameThinker for robust multi-frame temporal reasoning. We build ViTexQA via a quality-controlled Chain-of-Thought (CoT) annotation pipeline boosted with temporal constraints; all its QA pairs demand cross-frame text fusion to solve, enforcing true temporal reliance. FrameThinker adopts two-stage training for explicit temporal modeling: CoT-Guided Supervised Fine-Tuning (SFT) generates frame-aware reasoning chains, followed by Temporally-grounded Reinforcement Learning (RL) optimized with multi-frame coherence rewards. Evaluations show our method outperforms SOTA baselines on ViTexQA, lifting ROUGE-L by 6.3%.
中文摘要 尽管多模态理解取得了显著进展，当前的MLLM在视频文本理解方面仍存在局限性，尤其是在通过多帧时间分布文本线索整合而出现语义时。这种感知挑战与静态图像文本理解本质上不同，但现有数据集未能捕捉到这些信息：绝大多数问题仍可从单帧中回答，未能充分反映现实视频文本理解需求。为此，我们介绍了ViTexQA（一个大规模的视频-文本质量保证数据集）和FrameThinker，用于稳健的多帧时间推理。我们通过质量控制的Chain-of-Thought（CoT）注释流水线构建ViTexQA，并加以时间约束;所有质量保证对都需要跨帧文本融合来解决，从而强制执行真正的时间依赖。FrameThinker 采用显式时间建模的两阶段训练：CoT 引导的监督微调（SFT）生成具备框架感知的推理链，随后是经过多帧一致性奖励优化的时序基础强化学习（RL）。评估显示，我们的方法在ViTexQA上优于SOTA基线，ROUGE-L提升了6.3%。

Themis: An explainable AI-enabled framework for Reinforcement Learning with Human Feedback

Themis：一个可解释的人工智能框架，用于基于人类反馈的强化学习

Authors: Andreas Chouliaras, Luke Connolly, Dimitris Chatzpoulos
Subjects: Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2606.24622
Pdf link: https://arxiv.org/pdf/2606.24622
Abstract Training safe Reinforcement Learning (RL) systems is inherently challenging, with no guarantee of avoiding unwanted behaviors. The most effective defenses against this are (i) transparency through explainability and (ii) alignment via human feedback. While both show promising results, no publicly available framework currently combines them. To address this, we introduce Themis, an XAI-enabled testing and evaluation framework for Reinforcement Learning from Human Feedback. Themis supports over 200 widely used environments and is easily configurable for experiments in RL, transparency, and alignment. Our results show that Themis can train reward models that match or outperform the environment's true reward signal using human preferences. We also provide a cloud-based platform for collecting human feedback and managing experiments. It is user-friendly, auto-scalable, and supports large participant groups across multiple experiments without extra development overhead. Tests show Themis can support one thousand users in back-to-back experiments on a modest commercial machine.
中文摘要 培训安全的强化学习（RL）系统本质上具有挑战性，且无法保证避免不良行为。对此最有效的防御是：（i）通过可解释性实现透明，（ii）通过人工反馈实现一致性。虽然两者都显示出有希望的结果，但目前没有公开的框架将它们结合起来。为此，我们介绍了Themis，这是一个基于XAI的测试和评估框架，用于从人类反馈中强化学习。Themis 支持超过 200 个广泛使用的环境，并且易于配置用于强化学习、透明度和对齐等实验。我们的结果表明，Themis能够利用人类偏好训练出匹配或超越环境真实奖励信号的奖励模型。我们还提供了一个基于云的平台，用于收集人类反馈和管理实验。它用户友好，可自动扩展，支持跨多个实验的大型参与者群体，无需额外开发开销。测试显示，Themis 可以在一台较小的商用机器上支持一千名用户连续进行实验。

CineCap: Structured Reasoning with Spatio-Temporal Anchors for Cinematographic Video Captioning

CineCap：电影视频字幕的结构化推理与时空锚点

Authors: Xinyu Mao, Yuhui Zeng, Xiaokun Liu, Wenyu Qin, Meng Wang, Xin Tao, Pengfei Wan, Xiaohan Xing, Max Meng
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.24636
Pdf link: https://arxiv.org/pdf/2606.24636
Abstract Cinematographic captioning aims to describe how a video is filmed using professional film-language concepts such as camera movement, shot size, depth of field, composition, and shooting angle. This capability is important for fine-grained video understanding and controllable movie-quality video generation, yet remains underexplored in existing multimodal large language models. Unlike question-answering-based evaluation of cinematic understanding, cinematographic captioning requires a unified open-form description over multiple cinematographic dimensions. This task is challenging for two main reasons: the model must infer professional cinematographic concepts from subtle visual evidence, and it must generate captions that are both comprehensive and accurate. Accordingly, we propose CineCap, a framework that combines structured reasoning with spatio-temporal anchors and reinforcement learning with comprehensiveness, accuracy, and gated coverage rewards. The former grounds professional cinematographic descriptions in explicit visual evidence and organizes them into compact atomic reasoning for supervised fine-tuning, while the latter improves the balance between descriptive completeness and factual correctness. In addition, we construct CineCap Bench, a benchmark of 472 manually annotated video-caption pairs for systematic evaluation. Extensive experiments show that CineCap consistently outperforms strong proprietary and open-source baselines, establishing a new state of the art for cinematographic captioning. The code, model checkpoint, and benchmark are publicly available in this https URL.
中文摘要 电影字幕旨在通过专业的电影语言概念，如摄像机运动、镜头大小、景深、构图和拍摄角度，描述视频的拍摄方式。这一能力对于细粒度视频理解和可控的电影质量视频生成非常重要，但在现有多模态大型语言模型中仍未被充分开发。与基于问答的电影理解评估不同，电影字幕需要在多个电影维度上统一的开放形式描述。这项任务之所以具有挑战性，主要有两个原因：模型必须从细微的视觉证据中推断专业的电影概念，并且必须生成既全面又准确的字幕。因此，我们提出了CineCap，一种结合结构化推理、时空锚点和强化学习的框架，兼具全面性、准确性和门槛覆盖奖励。前者将专业电影描述建立在明确的视觉证据之上，并将其组织成紧凑的原子推理以供监督微调，而后者则提升了描述完整性与事实准确性之间的平衡。此外，我们还构建了CineCap Bench，这是一个包含472对手动注释视频-字幕对的基准测试，用于系统评估。大量实验表明，CineCap 持续优于强有力的专有和开源基线，奠定了电影字幕的新技术水平。代码、模型检查点和基准测试均在此 https URL 中公开。

LaGO: Latent Action Guidance for Online Reinforcement Learning

LaGO：在线强化学习的潜在行动指导

Authors: Kuan-Yen Liu, Ren-Jyun Huang, Ti-Rong Wu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2606.24669
Pdf link: https://arxiv.org/pdf/2606.24669
Abstract Large language models (LLMs) have shown strong potential for planning and sequential decision-making, but prior work often relies on using them as direct controllers, which requires precise action generation and can be unreliable in practice. This paper proposes Latent Action Guidance for Online Reinforcement Learning (LaGO), a framework that uses a pretrained LLM as a latent action prior to softly guide online policy optimization, rather than treating the LLM as an explicit planner or controller. Experiments on both a discrete-control benchmark, CLEVR-Robot, and a continuous-control benchmark, Meta-World, demonstrate that LaGO consistently improves both reward and success rate over Vanilla PPO. In particular, LaGO increases the average success rate from 15.1% to 27.2% on CLEVR-Robot and from 2.7% to 15.2% on Meta-World. Our analysis further shows that stronger pretrained LLMs provide more effective guidance, suggesting that LLM knowledge can improve planning and online decision-making.
中文摘要 大型语言模型（LLMs）在规划和顺序决策方面展现出强大潜力，但以往工作常依赖于将其作为直接控制器，这需要精确的动作生成，且在实际操作中可能不可靠。本文提出了在线强化学习的潜在行动指导（LaGO），该框架将预训练的大型语言模型作为潜在动作，先软性引导在线策略优化，而非将LLM视为显式的规划者或控制者。在离散对照基准CLEVR-Robot和连续对照基准Meta-World上的实验表明，LaGO在奖励和成功率上均优于原版PPO。特别是，LaGO将CLEVR-Robot的平均成功率从15.1%提升到27.2%，Meta-World平台从2.7%提升到15.2%。我们的分析进一步表明，更强的预训练LLM提供了更有效的指导，表明LLM知识能够提升规划和在线决策能力。

Keyword: diffusion policy

There is no result