Arxiv Papers of Today

生成时间: 2026-07-02 18:43:23 (UTC+8); Arxiv 发布时间: 2026-07-02 20:00 EDT (2026-07-03 08:00 UTC+8)

今天共有 42 篇相关文章

Keyword: reinforcement learning

Trajectory Learning with Graph Representations for Social Robot Navigation

基于图形表示的轨迹学习用于社会机器人导航

Authors: Berke Kartal, Burcu Kilic, Yigit Yildirim, Emre Ugur
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2607.00028
Pdf link: https://arxiv.org/pdf/2607.00028
Abstract Autonomous mobile robots are expected to exhibit socially compliant navigation for minimizing pedestrian disturbance. While capturing social interactions and incorporating pedestrian motion estimations into decision-making are beneficial for compliance, prior methods fail to address both spatial and temporal characteristics present in real-world data. Reinforcement Learning offers high capability, but it requires hand-crafted reward functions that reduce social behavior to static criteria, limiting its ability to reproduce patterns that exist in real pedestrian behavior. Imitation Learning offers direct training from real-world data but lacks modeling of social interactions and suffers from error accumulation. To this end, we propose an imitation learning framework that leverages spatiotemporal dynamics for socially compliant navigation. To represent social context based on interactions, we introduce a graph-based auxiliary network that encodes crowd states by attending to pedestrians. In addition, we present a navigation module that captures temporal dynamics and mitigates error accumulations by incorporating encoded state predictions and employing a trajectory-level learning objective. Our framework outperforms established data-driven baselines on simulation and a real-world dataset across diverse social metrics.
中文摘要 自主移动机器人预计将具备符合社会规范的导航功能，以最大限度减少行人干扰。虽然捕捉社会互动并将行人运动估计纳入决策有助于合规，但以往方法未能同时涵盖现实数据中存在的空间和时间特征。强化学习具备高能力，但需要手工设计的奖励函数，将社会行为简化为静态标准，限制了其复制真实行人行为模式的能力。模仿学习提供基于真实世界数据的直接训练，但缺乏社会互动建模，且存在错误累积的问题。为此，我们提出了一个模仿学习框架，利用时空动态实现社会合规导航。为了基于互动来表示社会情境，我们引入了一个基于图的辅助网络，通过关注行人来编码人群状态。此外，我们还提出了一个导航模块，通过编码状态预测和轨迹级学习目标，捕捉时间动态并减少误差累积。我们的框架在模拟和真实世界数据集上，在多种社会指标上表现优于已有的数据驱动基线。

Learning Dexterous Manipulation Using Contact Wrench Guidance From Human Demonstration

利用接触扳手引导学习灵巧操作，从人类演示中学习

Authors: Xinghao Zhu, Zixi Liu, Shalin Jain, Chenran Li, Milad Noori, Huihua Zhao, John Welsh, Michael Andres Lin, Wei Liu, Tingwu Wang, Xingye Da, Zhengyi Luo, Vishal Kulkarni, Naema Bhatti, Yuke Zhu, Linxi Fan, Bowen Wen, Danfei Xu, Soha Pouya, Yan Chang
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2607.00033
Pdf link: https://arxiv.org/pdf/2607.00033
Abstract Dexterous robot manipulation can benefit from the abundance of human demonstrations, but transferring such demonstrations to robot policies remains challenging. We present Contact Wrench Guidance from Human Demonstration in Robotic Dexterous Manipulation (CHORD), a framework for long-horizon manipulation of rigid and articulated objects with reinforcement learning. The key idea is object-centric contact wrench space guidance: we represent human and robot motions by the forces and torques they can induce on the object, enabling similarity to be measured by the induced instantaneous motions. This guidance makes reinforcement learning more scalable for contact-rich dexterous manipulation. We further introduce a large-scale simulation benchmark with 4,739 bimanual dexterous manipulation tasks, constructed from motion-capture datasets and reconstructed in-house videos. Evaluated on 1,831 benchmark tasks, CHORD achieves an average success rate of 82.12%, demonstrating strong scalability. CHORD also generalizes to whole-body manipulation from hand-only and third-person demonstrations, achieving a 90.77% success rate, and the learned policies transfer to the real world in both open-loop and closed-loop settings.
中文摘要 灵巧的机器人操作可以受益于大量人类演示，但将这些演示转化为机器人政策仍具挑战。我们介绍了机器人灵巧操作中的人体演示中的接触扳手指导（CHORD），这是一个用于强化学习的强化学习框架，用于长时间操作刚性和关节物体。核心思想是以物体为中心的接触扳手空间引导：我们通过人体和机器人对物体产生的力和力矩来表示运动，从而通过诱导的瞬时运动来衡量相似性。这种指导使强化学习在接触丰富且灵巧的操作中更具可扩展性。我们还进一步介绍了一项大规模模拟基准，包含4,739个双手灵巧操作任务，这些任务由动作捕捉数据集和重建的内部视频构建而成。通过1831个基准任务的评估，CHORD的平均成功率为82.12%，展现出强大的可扩展性。CHORD还从纯手和第三人称演示推广到全身操作，成功率达90.77%，所学策略在开环和闭环环境中均可迁移到现实世界。

Bayesian updates from coalgebraic determinisation

来自共代数定数的贝叶斯更新

Authors: Manuel Baltieri, Nathaniel Virgo
Subjects: Subjects: Logic in Computer Science (cs.LO); Formal Languages and Automata Theory (cs.FL); Systems and Control (eess.SY); Probability (math.PR)
Arxiv link: https://arxiv.org/abs/2607.00034
Pdf link: https://arxiv.org/pdf/2607.00034
Abstract The powerset construction is the classical determinisation procedure for nondeterministic finite automata. In the coalgebraic setting, this construction has been generalised to structured coalgebras, which are coalgebras equipped with extra data. For stochastic Moore machines over the distribution monad, a type of structured coalgebra, the determinisation construction induces a semantics assigning to each finite input word a distribution on the current output. This semantics is appropriate when only the current output matters, but it is too coarse for settings in which intermediate observations must also be taken into account, as is typical for agents solving POMDPs in control theory and reinforcement learning. In these contexts, agents need to condition on all realised observations, not just the final one, so to better plan for the future. This has been addressed from a category theoretic perspective through a procedure called unifilarisation'', which (in our context) takes a stochastic Mealy machine and produces a machine whose states are priors over the original state space and whose transitions are given by Bayesian filtering. Here we show that unifilarisation is an instance of coalgebraic determinisation. We work with Mealy machines over monads equipped with extra structure generalising the notion of the support of a distribution. We show that in this setting, unifilarisation arises from the general determinisation procedure. We then compare the resulting final coalgebra semantics with the Moore-style one. Instead of assigning only a distribution on current outputs to each finite input word, it yields causal stochastic behaviours, that is, families mapping input words to distributions on output words compatible with thecausality'' constraint that outputs cannot depend on future inputs.
中文摘要 幂集构造是非确定性有限自动机的经典判定过程。在余代数的环境中，这种构造已被推广到结构化余代数，即配备额外数据的余代数。对于分布单子上的随机摩尔机（一种结构化余代数），测定构造诱导出一个语义，使每个有限输入词在当前输出上分配一个分布。当只有当前输出重要时，这种语义是合适的，但在需要考虑中间观察的环境中过于粗糙，这在控制理论和强化学习中解决POMDP的代理中很常见。在这些情境下，代理需要基于所有已实现的观察进行条件，而不仅仅是最终的，以便更好地规划未来。从范畴论的角度，通过一种称为“单丝化”的过程得到解决，该过程（在我们的语境下）是利用随机米利机生成一个状态先验于原状态空间且通过贝叶斯滤波给出的机器。这里我们证明单丝化是余代数决定化的一个实例。我们使用Mealy机，搭配带有额外结构的单子，推广了分布支持的概念。我们表明，在此环境中，单丝化源自一般去极化过程。然后我们将最终的余代数语义与摩尔式语义进行比较。它不再仅为每个有限输入字分配当前输出的分布，而是产生因果随机行为，即将输入字映射到输出字的分布，且该分布符合“因果性”约束，即输出不能依赖未来输入。

Active Sensing for RIS-Aided Tracking and Power Control: A Hybrid Neuroevolution and Supervised Learning Approach

主动感知用于RIS辅助追踪与功率控制：一种混合神经进化与监督学习方法

Authors: George Stamatelis, Hui Chen, Henk Henk Wymeersch, George C. Alexandropoulos
Subjects: Subjects: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2607.00056
Pdf link: https://arxiv.org/pdf/2607.00056
Abstract This paper studies energy efficient tracking of power-limited mobile users with the assistance of a Reconfigurable Intelligent Surface (RIS). Since localization pilot transmissions dominate the energy budget of power-constrained devices, we introduce a low-overhead feedback link from the Base Station (BS) to the user to enable dynamic uplink power control. To navigate the discrete and decentralized nature of this active sensing problem, we propose a novel Dual-Agent (DA) deep learning framework that jointly optimizes the discrete RIS phase profiles and the UE's transmit power in real time. Specifically, our approach employs a hybrid training methodology integrating the neuroevolution paradigm with supervised learning, effectively overcoming the non-differentiability of discrete phase responses from the RIS unit elements and the strict information bottleneck of single-bit feedback messages for pilot power control. The proposed DA active sensing framework can be applied with both single- and multi-antenna BSs, the latter with only minor modifications in the structure of one NN: an additional output branch with appropriate structure is included for the latter case to select a valid digital combiner from a finite set. Extensive numerical simulations demonstrate that the proposed scheme achieves highly accurate and robust tracking across diverse target motion models, outperforming extended Kalman and particle filters, as well as, machine learning-based trackers. Furthermore, in static localization, it is shown to significantly outperform traditional fingerprinting schemes, deep reinforcement learning baselines, and standard backpropagation-based estimators.
中文摘要 本文研究了利用可重构智能表面（RIS）辅助，对电力受限的移动用户进行节能追踪。由于本地化试点传输主导了功率受限设备的能量预算，我们引入了基站（BS）向用户的低开销反馈链路，以实现动态上行功率控制。为了应对这一主动传感问题的离散和分散性质，我们提出了一种新型的双代理（DA）深度学习框架，实时联合优化离散RIS相位轮廓和UE的发射功率。具体来说，我们的方法采用混合训练方法，将神经进化范式与监督学习相结合，有效克服了RIS单元离散相位响应的不可微性以及单比特反馈消息在试点功率控制中存在的严格信息瓶颈。拟议的DA主动传感框架既适用于单天线，也适用于多天线BS，后者仅对一个NN的结构做轻微修改：为后者提供一个结构合适的额外输出分支，以便从有限集合中选择有效的数字组合器。大量数值模拟表明，该方案在多种目标运动模型中实现了高度准确且稳健的跟踪，优于扩展的卡尔曼和粒子滤波器，以及基于机器学习的跟踪器。此外，在静态定位中，其表现显著优于传统指纹识别方案、深度强化学习基线和标准的反向传播估计器。

Learning Expert Strategy for Autonomous Robotic Endovascular Intervention via Decoupled Procedural Execution

通过解耦程序执行实现自主机器人血管内干预的学习专家策略

Authors: Yanxi Chen, Tianliang Yao, Shaolong Tang, Jiyuan Zhao, Hengyu Hu, Zhaoxing Li, Antonio J. Sánchez Egea, Peng Qi
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2607.00066
Pdf link: https://arxiv.org/pdf/2607.00066
Abstract Endovascular interventions are high-stakes procedures requiring precise device operation within complex and tortuous vascular anatomies. Autonomous endovascular navigation has the potential to standardize procedural quality and reduce the performance variability inherent in manual operation. Although Reinforcement Learning (RL) approaches have demonstrated promise in enabling autonomy in endovascular intervention, they often struggle with explicit constraint satisfaction and safety guarantees. To address these challenges, a learning-based expert strategy is introduced, enhancing procedural consistency in autonomous endovascular intervention by explicitly decoupling high-level strategic decision-making from low-level procedural execution. The proposed framework replicates the expert clinical decision-making process: a strategic RL policy generates global navigation intents, which are subsequently refined through an expert-informed execution module. This module ensures that robot movements strictly adhere to expert operational norms, real-time kinematic limits, and vessel safety constraints. Experimental evaluation across high-fidelity 3D simulations and a real-world robotic platform demonstrates that the proposed framework not only outperforms baseline policies but also effectively replicates expert-level proficiency. The framework achieves a high navigation success rate (> 96%) and a 29.3% reduction in operational steps, which translates to enhanced operative efficiency and minimized device-vessel interaction. Furthermore, a 13% reduction in trajectory variance indicates superior procedural standardization, aligning autonomous behavior with established clinical norms. These results underscore its potential to enhance the predictability, safety, and consistency of robotic endovascular interventions.
中文摘要 血管内介入是高风险手术，需要在复杂且曲折的血管解剖结构中精确操作设备。自主血管内导航有潜力标准化操作质量，减少手动操作中固有的性能变异性。尽管强化学习（RL）方法在促进血管内干预自主方面展现出潜力，但它们常常在明确的约束满足和安全保障方面存在困难。为应对这些挑战，引入了基于学习的专家策略，通过明确将高层战略决策与低层次程序执行分离，增强自主血管内干预的程序一致性。所提框架复制了专家临床决策过程：战略性强化学习政策生成全球导航意图，随后通过专家指导的执行模块进行细化。该模块确保机器人动作严格遵守专家操作规范、实时运动学限制和船舶安全约束。在高精度3D模拟和现实机器人平台上的实验评估表明，所提出的框架不仅优于基线政策，还有效复制了专家级的熟练度。该框架实现了高导航成功率（>96%），操作步骤减少了29.3%，从而提升了操作效率，减少了装置与船舶的交互。此外，轨迹方差减少13%表明程序标准化更优，使自主行为符合既定临床规范。这些结果强调了其提升机器人血管内干预可预测性、安全性和一致性的潜力。

RareDxR1: Autonomous Medical Reasoning for Rare Disease Diagnosis Beyond Human Annotation

RareDxR1：罕见病诊断的自主医学推理，超越人类注释

Authors: Deyang Jiang, Haoran Wu, Ziyi Wang, Yiming Rong, Yunlong Zhao, Ye Jin, Bo Xu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2607.00147
Pdf link: https://arxiv.org/pdf/2607.00147
Abstract Rare disease differential diagnosis is a critical yet arduous clinical task, requiring physicians to identify precise phenotypes from complex, unstructured patient symptoms and execute intricate reasoning within a vast search space. However, existing AI approaches typically rely on pipeline-based phenotype extraction or retrieval-augmented generation, which suffer from critical information loss due to predefined ontologies, retrieval bottlenecks, and a lack of diagnostic logic. To address these challenges, we introduce RareDxR1, an end-to-end reasoning-centric large language model designed for open-domain rare disease diagnosis directly from unstructured clinical notes. We design a progressive end-to-end training framework by synergizing knowledge internalization with autonomous evolutionary learning, thereby bypassing reliance on structured phenotypes and closed-set decision-making. To overcome the limitations of RAG and phenotype restriction, we enabled the deep internalization of fragmented rare-disease knowledge directly into the model's parameters. Moreover, to bridge the gap between model generation and expert reasoning, we propose Reflection-Enhanced Reasoning Sampling (RERS), a strategy that synthesizes expert-level diagnostic trajectories by learning from failures without human annotation. Additionally, we propose a dual-level curriculum reinforcement learning approach for gradually mastering rare disease diagnosis. Experimental results demonstrate that RareDxR1 achieves state-of-the-art accuracy across different benchmarks, marking a significant breakthrough in open-domain rare disease diagnosis. Our code and dataset will be publicly available.
中文摘要 罕见病鉴别诊断是一项关键但艰巨的临床任务，医生需要从复杂且无结构的患者症状中识别精确表型，并在庞大的搜索空间内进行复杂的推理。然而，现有的人工智能方法通常依赖基于流水线的表型提取或检索增强生成，这些方法由于预设本体论、检索瓶颈以及缺乏诊断逻辑，导致关键信息丢失。为应对这些挑战，我们引入了RareDxR1，一个端到端、以推理为中心的大语言模型，旨在直接从非结构化临床记录进行开放域罕见病诊断。我们通过将知识内化与自主进化学习协同，设计了一个渐进式端到端的培训框架，从而绕过了对结构化表型和封闭决策的依赖。为克服RAG和表型限制的局限，我们实现了将碎片化的罕见病知识深度内化，直接纳入模型参数。此外，为了弥合模型生成与专家推理之间的差距，我们提出了反射增强推理采样（RERS）策略，该策略通过从失败中学习，综合专家级诊断轨迹，无需人工注释。此外，我们提出一种双层次课程强化学习方法，逐步掌握罕见病诊断。实验结果表明，RareDxR1在不同基准测试中实现了最先进的准确性，标志着开放域罕见病诊断的重大突破。我们的代码和数据集将公开。

A Contextual-Bandit Oversight Game with Two-Sided Informational Asymmetry

一种带有双边信息不对称的情境盗贼监督游戏

Authors: Yunjin Tong
Subjects: Subjects: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
Arxiv link: https://arxiv.org/abs/2607.00155
Pdf link: https://arxiv.org/pdf/2607.00155
Abstract We study runtime human oversight of an AI agent when private information runs in both directions: the human privately knows her reward function, while the AI privately knows the quality of the action it proposes. This is the kind of asymmetry that arises naturally when an autonomous robot or software agent has inspected a situation its human supervisor cannot directly assess. Building on Cooperative Inverse Reinforcement Learning (CIRL) and the Oversight Game, we introduce a contextual-bandit team game with two-sided asymmetric information and a play/ask/trust/oversee interface. The bandit structure removes physical state transitions and thereby yields exact one-shot characterizations that would remain conjectural in the full POMDP setting, though the common belief remains a dynamically controlled state across rounds. We give two one-shot characterizations, a team optimum and a behaviorally natural myopic rule, whose gap is a slab of avoidable harm: a region in which the AI privately knows the proposed action is harmful and shutdown would help, yet a myopic human, trusting her prior, declines to oversee. We show this gap is the price of non-credible oversight communication, and give a partial analysis of how it resolves dynamically over repeated rounds through passive learning and active signaling with a one-period-lagged oversight response.
中文摘要 我们研究当私密信息双向传播时，人类对人工智能代理的运行时监控：人类私下知道她的奖励函数，而AI私下知道其提出动作的质量。这种不对称是自主机器人或软件代理检查了其人类主管无法直接评估的情境时自然产生的。基于合作逆强化学习（CIRL）和监督博弈，我们引入了一种情境强盗团队游戏，具有双边非对称信息和游戏/询问/信任/监督界面。bandit结构去除了物理状态转变，从而产生了精确的一次性特征描述，这些特征在完整的POMDP设定下仍属猜测，尽管普遍认为是跨轮动态控制状态。我们给出两个一次性特征：团队最优和行为自然的近视规则，其空白是一块可避免的伤害：在这样一个区域，AI私下知道拟议行动有害且关闭会有所帮助，但近视的人类信任她的先验者却拒绝监督。我们展示了这一差距是不可信监督沟通的代价，并部分分析了通过被动学习和主动信号传递，以及一个周期滞后的监督响应，如何动态地通过反复轮次的差异解决。

Distributed Multi Robot Lunar Cargo Transportation via Phase Decomposed Reinforcement Learning

通过相位分解强化学习实现分布式多机器人月球货物运输

Authors: Ashutosh Mishra, Elian Neppel, Shreya Santra, Antoine Jonquières, Muhammad Athallah Naufal, Kentaro Uno, Kazuya Yoshida
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2607.00160
Pdf link: https://arxiv.org/pdf/2607.00160
Abstract Modular reconfigurable robotic systems provide a scalable solution for cooperative surface operations in future lunar missions. However, cooperative cargo transportation remains challenging due to morphology-dependent topology changes, strong payload-induced coupling, long-horizon decision making, and safety constraints. This paper proposes a phase-decomposed reinforcement learning framework for cooperative cargo transport with distributed robotic units. The task is decomposed into lifting, transportation, and placement, each optimized with a dedicated joint-state policy capturing inter-agent coupling. Centralized training promotes stable convergence, while deployment uses onboard proprioception for control and OptiTrack motion capture for ground-truth evaluation and post-processed metrics. A deterministic phase controller expressed in Markov state representation regulates transitions between stages, and a failure-sensitive synchronization mechanism ensures coordinated progression and safety-aware halting during real-world execution. The framework is evaluated in simulation and through controlled field experiments at a JAXA space exploration test facility. Results demonstrate reliable cooperative transport across all stages in both simulation and hardware experiments.
中文摘要 模块化可重构机器人系统为未来月球任务中的合作表面操作提供了可扩展的解决方案。然而，由于形态依赖的拓扑变化、强有效载荷诱导耦合、长视野决策以及安全限制，合作货物运输仍具挑战性。本文提出了一种相位分解强化学习框架，用于分布式机器人单元的协同货物运输。该任务分解为提升、运输和布置，每个任务都通过专门的联合状态策略进行优化，捕捉代理间耦合。集中训练促进稳定的收敛，部署则利用机载本体感觉控制，OptiTrack动作捕捉进行地面真实评估和后处理指标。以马尔可夫状态表示表示的确定性相位控制器调节各阶段的转换，故障敏感同步机制确保在实际执行中协调推进和安全意识的停机。该框架通过模拟和JAXA空间探索测试设施的受控实地实验进行评估。结果证明了在仿真和硬件实验中，所有阶段的可靠协同传输。

Verifiable Rewards for Calibrated Probabilistic Forecasting

校准概率预测的可验证奖励

Authors: Sadanand Singh, Allam Reddy, Manan Chopra
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2607.00164
Pdf link: https://arxiv.org/pdf/2607.00164
Abstract Reinforcement learning with verifiable rewards can in principle train calibrated probabilistic forecasters, since a proper scoring rule such as the Brier score is computed from outcomes alone and is minimized in expectation by the true probability. In practice it degrades calibration, and existing remedies address epistemic uncertainty, where a model's confidence accompanies a verifiably correct or incorrect answer. We study aleatoric forecasting, where the forecast itself is the output and the label is one stochastic outcome, taking NFL in-game win probability as a testbed with the betting market as a reference. Rewarding the realized per-play outcome fails, because the single outcome is a noisy target and the policy gradient corrupts the chain of thought. We introduce a verifiable, label-free reward, a state-conditioned empirical win rate estimated from past outcomes, that removes the label noise, and we keep the gradient off the reasoning, by direct prediction or a gradient mask, so it cannot be corrupted. Trained with this reward alone, without human labels or supervised fine-tuning, a 7B model reaches the calibration of the betting market by direct prediction and is better calibrated than a zero-shot frontier model. That frontier model and a tabular estimator reach the same Brier score as this model, identifying the market's small remaining edge as live in-game information beyond their shared inputs. Masking the gradient, rather than dropping the chain of thought, preserves reasoning from which the forecast follows, which ordinary chain-of-thought training corrupts.
中文摘要 带有可验证奖励的强化学习原则上可以训练校准的概率预测器，因为像Brier评分这样的适当评分规则仅凭结果计算，并以真实概率在期望中最小化。实际上，它会降低校准，现有的补救措施解决了认识不确定性，即模型的置信度伴随着可验证的正确或错误答案。我们研究偶然性预测，其中预测本身是输出，标签是一个随机结果，以NFL比赛中的胜率为试验场，以博彩市场为参考。奖励实现的每次结果失败，因为单一结果是噪声目标，策略梯度破坏了思维链。我们引入可验证、无标签的奖励，即从过去结果估算的状态条件经验胜率，去除标签噪声，并通过直接预测或梯度掩码保持倾向偏离推理，防止其被破坏。仅用这种奖励训练，无需人工标签或监督微调，7B模型通过直接预测达到博彩市场校准，且比零机会前沿模型校准更佳。该前沿模型和表格估计器得出的Brier分数与该模型相同，识别出市场剩余的微弱优势即游戏内实时信息，超出共享输入。掩盖梯度，而不是放弃思考链，保留了预测所依赖的推理，而普通的思维链训练会破坏这种推理。

Play Like Champions: Counterfactual Feedback Generation in Latent Space

像冠军一样玩：潜在空间中的反事实反馈生成

Authors: Andrzej Białecki, Adam Mastalerz, Han Zhou
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2607.00190
Pdf link: https://arxiv.org/pdf/2607.00190
Abstract Recent advances in reinforcement learning have produced superhuman agents across a wide range of competitive games. As a byproduct, researchers have begun studying how these agents play, extracting behavioral representations, analyzing decision structure, and modeling the latent geometry of expert performance. However, this growing body of work has overwhelmingly focused on defeating human players rather than providing feedback, leaving a critical gap in creating model solutions to improve human players. Unlike chess and Go, where AI has become integral to player training, real-time strategy (RTS) games lack principled frameworks for translating expert knowledge into actionable feedback. We introduce Latent Maps of Performance, a framework for counterfactual path generation. We focus on StarCraft~II data to model player improvement as an algorithmic recourse within a learned representation space. As inspiration for our work, we have looked at the championship model used in sports science. We trained a Guided Variational Autoencoder model on 23,305 professional tournament replays, enabling counterfactual traversal between losing and winning gameplay profiles. To fulfill our goal, we have devised and verified four traversal strategies on out-of-distribution (OOD) data randomly sampled from a dataset of amateur replays, namely linear interpolation, iterative optimal transport, density-regularized gradient ascent, and neural flow matching, each designed to generate multi-step improvement trajectories that remain grounded in observed expert behavior while moving a player's profile toward winning configurations. Feedback is extracted at multiple granularities to support players at different stages of improvement. Finally, we conclude that there is a trade-off between the path-finding methods we employ and hope that future research will focus on developing model solutions for human improvement.
中文摘要 强化学习的最新进展催生了多种竞技游戏中的超人代理。作为副产品，研究人员开始研究这些代理的游戏机制，提取行为表征，分析决策结构，并建模专家表现的潜在几何。然而，这批日益增长的工作大多集中在击败人类玩家，而非提供反馈，导致在创建改进人类玩家的模型解决方案方面留下了关键空白。与国际象棋和围棋中人工智能已成为玩家训练核心不同，即时战略（RTS）游戏缺乏将专家知识转化为可操作反馈的原则框架。我们介绍了性能潜在映射，一种用于反事实路径生成的框架。我们聚焦于《星际争霸~II》数据，将玩家提升建模为学习后的表征空间中的算法手段。作为我们工作的灵感来源，我们研究了体育科学中使用的冠军模型。我们在23,305场职业比赛回放中训练了一个引导变分自编码器模型，实现了输赢游戏轮廓之间的反事实穿梭。为实现目标，我们设计并验证了四种从业余回放数据集中随机抽样的分布外（OOD）数据遍历策略，分别是线性插值、迭代最优传输、密度正则化梯度上升和神经流匹配，每种方法都旨在生成多步改进轨迹，既基于观察到的专家行为，又推动玩家的配置向胜利配置方向发展。反馈会从多个细分层提取，以支持不同阶段的玩家。最后，我们得出结论，我们采用的路径寻找方法之间存在权衡，并希望未来研究能聚焦于开发人类改进的模型解决方案。

SLIM-RL: Risk-Budgeted Random-Masking RL for Diffusion LLMs Without Trajectory Slicing

SLIM-RL：风险预算随机掩蔽强化学习，适用于无轨迹切片的扩散大型语言模型

Authors: Ruikang Zhao, Zhenting Wang, Han Gao, Ligong Han
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2607.00208
Pdf link: https://arxiv.org/pdf/2607.00208
Abstract Reinforcement learning for diffusion large language models (dLLMs) has largely moved to trajectory-aware methods. The current state of the art, TraceRL, holds that random masking is mismatched with the model's inference trajectory, and it reconstructs that trajectory during training by slicing each rollout into up to K/s trajectory-aligned training samples, a cost that grows with the block size K. We show that this mismatch can be mitigated without reconstructing the trajectory. Our method, SLIM-RL, bounds the commit risk of each rollout step with a tau-budget decoder, reducing aggregate commit risk in the training data. During optimization, SLIM-RL trains on these risk-controlled rollouts with a trace-free random-masking objective that adapts variance-reduction tools, combining sequence-level importance sampling, deterministic quadrature over masking levels under a mean-preserving, monotonically decreasing per-block mask schedule that we introduce. On SDAR-4B, SLIM-RL matches TraceRL's best MATH500 accuracy on only 0.46x its training samples at block size 16, improving over TraceRL by 6.32% on MATH500 and 11.05% on GSM8K under matched dynamic sampling. At block size 4, the 4B SLIM-RL surpasses the larger LLaDA-8B and Dream-7B dLLMs on math, exceeding LLaDA-8B by 10.76% on MATH500 while staying below the autoregressive Qwen2.5-7B. On code, it improves over TraceRL by 4.20% on MBPP and 3.65% on HumanEval. The tau-budget decoder transfers training-free across LLaDA, Dream, and SDAR. The source code is available at this https URL .
中文摘要 扩散大型语言模型（dLLMs）的强化学习已大多转向轨迹感知方法。当前最先进的TraceRL认为随机掩蔽与模型推断轨迹不匹配，它在训练过程中通过将每个扩展切割成最多K/s轨迹对齐的训练样本来重建该轨迹，该成本随着块大小K增加而增加。我们证明这种不匹配可以在不重建轨迹的情况下得到缓解。我们的方法SLIM-RL通过tau-budget解码器限制每个部署步骤的提交风险，降低训练数据中的总提交风险。在优化过程中，SLIM-RL基于这些风险控制的展开进行训练，采用无痕随机掩蔽目标，适应方差减少工具，结合序列级重要性抽样、遮蔽水平的确定性求积，采用我们引入的保持均值且单调递减的每块掩码计划。在SDAR-4B上，SLIM-RL在块大小16时仅在训练样本的0.46倍下达到TraceRL的最佳MATH500精度，MATH500比TraceRL提升6.32%，在GSM8K下匹配动态采样提升11.05%。在块长4时，4B SLIM-RL在数学上超过了更大的LLaDA-8B和Dream-7B dLLM，在MATH500上比LLaDA-8B高出10.76%，同时低于自回归的Qwen2.5-7B。在代码方面，它在MBPP上比TraceRL提升了4.20%，在HumanEval上提升了3.65%。tau预算解码器可在LLaDA、Dream和SDAR之间实现无培训传输。源代码可在此 https URL 获取。

Learning Generalizable Skill Policy with Data-Efficient Unsupervised RL

利用数据高效的无监督强化学习学习可推广技能策略

Authors: Jongchan Park, Seungjun Oh, Seungho Baek, Yusung Kim
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2607.00392
Pdf link: https://arxiv.org/pdf/2607.00392
Abstract Unsupervised Reinforcement Learning (URL) aims to pre-train scalable, skill-conditioned policies without extrinsic rewards, serving as a foundation for downstream control tasks. Despite recent progress, we argue that current off-policy URL methods are limited by two critical, overlooked bottlenecks: (1) non-stationary skill semantics and (2) brittle generalization. To address these challenges, we propose GenDa (Generalizable Data-efficient Agent), a unified framework for robust unsupervised reinforcement learning. First, we introduce a skill relabeling mechanism to mitigate non-stationarity and significantly improve data efficiency for pre-training. Second, we propose a Complementary Information Bottleneck (CIB), encouraging the learned skill policy to focus on ego-centric features and become robust to distribution shifts for downstream tasks. Through various experiments, we demonstrate that GenDa significantly enhances the scalability of URL with superior generalizability and data efficiency. Our code and videos are available at this https URL.
中文摘要 无监督强化学习（URL）旨在预先训练可扩展、技能条件化的策略，且无外在奖励，作为后续控制任务的基础。尽管近期取得了进展，我们认为当前的非策略URL方法受两个关键且被忽视的瓶颈限制：（1）非平稳技能语义和（2）脆弱泛化。为应对这些挑战，我们提出了GenDa（通用高效数据代理），这是一个用于稳健无监督强化学习的统一框架。首先，我们引入了技能重新标记机制，以减少非平稳性，显著提升预训练的数据效率。其次，我们提出补充信息瓶颈（CIB），鼓励学习的技能策略聚焦于以自我为中心的特征，并在下游任务中对分布变化保持鲁棒性。通过各种实验，我们证明GenDa显著提升了URL的可扩展性，具有更优的泛化性和数据效率。我们的代码和视频可在该 https URL 访问。

Personalization as Inverse Planning: Learning Latent Design Intents for Agentic Slide Generation via Structural Denoising

个性化作为逆向规划：通过结构去噪学习代理滑动生成的潜在设计意图

Authors: Tianci Liu, Zihan Dong, Linjun Zhang, Haoyu Wang, jing Gao, Emre Kiciman, Ranveer Chandra, Wei-Ting Chen
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2607.00407
Pdf link: https://arxiv.org/pdf/2607.00407
Abstract Slide design requires personalizing both deck themes and page layouts. Yet, current AI agent-based methods struggle with fine-grained, page-level design. Solely relying on prespecified templates or user verbose instructions, they fail to capture latent design intents, leaving Page-level Slide Personalization (PSP) unresolved. To close this gap, this work formulates PSP as an inverse planning problem. We propose to learn a design intent without assuming any knowledge of the specific executing tools (e.g., PowerPoint, Beamer) being used. However, relinquishing control over these tools makes the problem intractable to optimize end-to-end. To overcome this, we propose SPIRE, a principled framework to solve PSP approximately. By intentionally corrupting the visual structures of clean slides, SPIRE creates a verifiable task to denoise the corruption, whereby two agents learn to collaboratively refine executable designs via reinforcement learning (RL). We present a proof that structural denoising is a consistent surrogate for PSP, and that the multi-agent formulation strictly reduces policy gradient variance in RL. Extensive experiments demonstrate the superiority of SPIRE.
中文摘要 幻灯片设计需要个性化卡片主题和页面布局。然而，当前基于智能体的人工智能方法在细粒度、页面级设计方面遇到困难。仅依赖预设模板或用户冗长指令，未能捕捉潜在的设计意图，导致页面层级幻灯片个性化（PSP）问题未被解决。为弥合这一差距，本研究将PSP表述为一个反向规划问题。我们提议在不假设对具体执行工具（如PowerPoint、Beamer）有任何了解的情况下，学习设计意图。然而，放弃对这些工具的控制，使得端到端优化问题变得难以解决。为克服这一问题，我们提出了SPIRE，一个有原则的框架，用于近似地解决PSP。通过有意破坏干净幻灯片的视觉结构，SPIRE创建了一个可验证的任务来去除噪声，两个智能体通过强化学习（RL）学习协作完善可执行设计。我们证明结构去噪是PSP的一致替代，且多智能体表述严格减少了强化学习中的策略梯度方差。大量实验证明了SPIRE的优越性。

Selective Test-Time Debiasing for CLIP via Reward Gating

通过奖励门禁对CLIP进行选择性测试时间去偏

Authors: Jaeho Han, Jisoo Yang, Hyeondong Woo, Mingyu Jeon, Sunjae Yoon, Junyeong Kim
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2607.00423
Pdf link: https://arxiv.org/pdf/2607.00423
Abstract Vision language models (VLMs) demonstrate strong zero-shot performance, but often perpetuate social stereotypes in person-centric queries, yielding skewed demographic distributions. Current debiasing methods apply uniform bias corrections across all input queries regardless of their bias sensitivity, creating a fundamental fairness--utility trade-off. Strong debiasing distorts semantically meaningful information in bias-insensitive queries, while weak debiasing fails to mitigate stereotypes in bias-sensitive ones. This one-size-fits-all approach hampers simultaneously achieving high utility on bias-insensitive queries and fairness on bias-sensitive queries. We introduce Reward-Gated Test-Time Adaptation (RG-TTA), a reinforcement learning-based test-time adaptation framework that selectively applies debiasing based on input sensitivity. RG-TTA adaptively triggers fairness regularization based on the bias sensitivity of each input during test-time policy adaptation, while focusing exclusively on optimizing cross-modal alignment for bias-insensitive inputs. Experiments on fairness benchmarks (e.g., FairFace, UTKFace) demonstrate substantial bias reduction while simultaneously improving zero-shot utility, resolving the trade-off of uniform debiasing.
中文摘要 视觉语言模型（VLMs）表现出强大的零样本表现，但在以人为中心的查询中常常加深社会刻板印象，导致人口分布偏颇。当前的去偏见方法对所有输入查询均施加统一偏见修正，无论其偏置敏感性如何，形成了基本的公平性与效用权衡。强偏见去偏见会扭曲带有偏见的有意义信息，而弱偏见去偏见则无法减轻偏见敏感查询中的刻板印象。这种一刀切的方法同时阻碍了在偏见敏感查询上实现高效用和公平性。我们引入了奖励门控测试时间适应（RG-TTA），这是一种基于强化学习的测试时间适应框架，基于输入敏感性选择性地应用偏见。RG-TTA在测试时策略调整过程中，基于每个输入的偏置敏感性自适应触发公平正则化，同时专注于优化对偏不敏感输入的跨模态比对。公平性基准测试（如FairFace、UTKFace）的实验显示，在提升零投效用的同时，显著减少了偏置，解决了均匀去偏倚的权衡。

Learning Gait-Aware Quadruped Locomotion with Temporal Logic Specifications

学习带有时间逻辑规范的步态感知四足行走

Authors: Merve Atasever, Cagan Bakirci, Alfredo Reina Corona, Keyan Azbijari, Jyotirmoy V. Deshmukh
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2607.00442
Pdf link: https://arxiv.org/pdf/2607.00442
Abstract Reinforcement learning (RL) for quadruped locomotion commonly depends on fixed, hand-crafted, and Markovian reward functions that limit both interpretability of learned policies and lack explicit control over gait behaviors. We introduce a framework where distinct gaits are specified using parameterized constraints expressed in Signal Temporal Logic (STL). These include safety bounds, gait synchronization constraints, command tracking, and actuation bounds. From these specifications, we develop a reward shaping mechanism that provides learning agents a dense, continuous reward landscape that encodes desired behavior. We define parametric STL templates for three speed regimes (walking-trot, trot, bound), calibrate their parameters from reference rollouts, and compute rewards from using smooth approximations of STL robustness over the rollouts. The generated rewards can be used to provide shaped gradients compatible with Proximal Policy Optimization (PPO). We instantiate the approach on Google's Barkour quadruped robot in MuJoCo XLA (MJX). We use parallelization within the simulator to improve training speeds and use domain randomization to robustify learned policies. We show that compared to a baseline of hand-crafted rewards, the STL-shaped rewards yield tighter velocity tracking and more stable training. Videos can be found on our project website: this https URL.
中文摘要 四足行走的强化学习（RL）通常依赖于固定的、手工制作的马尔可夫奖励函数，这些函数既限制了学习策略的可解释性，也缺乏对步态行为的明确控制。我们引入了一个框架，通过信号时间逻辑（STL）中表达的参数化约束来指定不同的步态。这些包括安全界限、步态同步约束、指令追踪和执行界限。基于这些规范，我们开发了一种奖励塑造机制，为学习主体提供密集、连续的奖励景观，编码期望的行为。我们为三种速度区间（步行小跑、小跑、限界）定义参数STL模板，校准参考滑行中的参数，并通过对滑行的STL鲁棒性进行平滑近似计算奖励。生成的奖励可用于提供与近端策略优化（PPO）兼容的有形梯度。我们在谷歌的Barkour四足机器人MuJoCo XLA（MJX）上实现了这一方法。我们在模拟器中使用并行化来提升训练速度，并利用领域随机化来强健学习策略。我们发现，与手工制作的基础奖励相比，STL形状的奖励提供了更精准的速度追踪和更稳定的训练。视频可在我们的项目网站上找到：https URL。

Gauging, Measuring, and Controlling Critic Complexity in Actor-Critic Reinforcement Learning

在演员-批评强化学习中衡量、衡量和控制批评复杂性

Authors: Konstantin Garbers
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2607.00452
Pdf link: https://arxiv.org/pdf/2607.00452
Abstract Actor-critic methods depend on learned critics, but critic quality is often evaluated only indirectly through return, temporal-difference error, or value loss. Critic complexity is introduced as an additional diagnostic and intervention dimension for actor-critic reinforcement learning. The analysis uses spectral effective-rank entropy, a rank-like summary of the singular-value distributions of critic weight matrices, to assess critic model complexity. Across TD3 and PPO experiments, critic complexity is tracked together with return and Monte Carlo value-estimation bias. The results show that critic complexity is measurable throughout training and is systematically associated with training behavior, while also making clear that the relationship is heterogeneous across algorithms, tasks, and hyperparameters. A direct complexity-control intervention is then evaluated by adding a spectral-entropy penalty to the critic loss. This intervention reliably changes the targeted spectral quantity, demonstrating that critic complexity can be controlled rather than only observed. Return effects are treated as task-dependent evidence rather than as a general performance claim, because overall complexity-control results vary.
中文摘要 行为者-批评者方法依赖于有经验的批评者，但批评者质量通常仅通过返回、时间差误或价值损失间接评估。批评者复杂性被引入为演员-批评者强化学习的额外诊断和干预维度。该分析使用谱有效秩熵（一种类秩的批评权重矩阵奇异值分布的总结）来评估批评模型的复杂度。在TD3和PPO实验中，批判复杂度与回报和蒙特卡洛价值估计偏差一同追踪。结果显示，批评复杂性在整个训练过程中可测量，并且与训练行为系统性相关，同时也明确指出该关系在算法、任务和超参数间是异质性的。然后通过在批判损失上增加谱熵惩罚来评估直接的复杂性控制干预。该干预能够可靠地改变目标频谱量，证明批判复杂度可以被控制，而不仅仅是观察。返回效应被视为任务依赖证据，而非一般性能声明，因为整体复杂度控制结果各不相同。

VLM-AR3L: Vision-Language Models for Absolute and Relative Rewards in Reinforcement Learning

VLM-AR3L：强化学习中绝对与相对奖励的视觉语言模型

Authors: Kuan-Chen Chen, Winston Chen, Wei-Fang Sun, Min-Chun Hu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2607.00483
Pdf link: https://arxiv.org/pdf/2607.00483
Abstract Designing effective reward functions remains a major challenge in reinforcement learning (RL), particularly in open-ended environments where task goals are abstract and difficult to quantify. In this work, we present VLM-AR3L, a framework that leverages Vision-Language Models (VLMs) to provide both absolute and relative rewards for RL. VLM-AR3L interprets an agent's visual observations in the context of a natural language task goal, and learns both absolute and relative rewards from VLM-generated preference labels. The absolute reward model predicts scalar evaluations for individual states, while the relative reward model compares consecutive observations to infer progress or regression toward the task goal. Their integration combines the stability of state-based evaluation with the robustness of comparative supervision. We evaluate VLM-AR3L across benchmarks spanning classic control, manipulation, and open-world embodied tasks, with a particular focus on Minecraft given its visual complexity and long-horizon decision-making requirements. Experimental results show that VLM-AR3L consistently outperforms prior VLM-based reward learning methods.
中文摘要 设计有效的奖励函数仍然是强化学习（RL）中的一大挑战，尤其是在任务目标抽象且难以量化的开放式环境中。在本研究中，我们提出了VLM-AR3L框架，该框架利用视觉语言模型（VLMs）为强化学习提供绝对和相对的奖励。VLM-AR3L 在自然语言任务目标的背景下解释代理的视觉观察，并从 VLM 生成的偏好标签中学习绝对和相对奖励。绝对奖励模型预测单个状态的标量评估，而相对奖励模型则比较连续观测，以推断任务目标的进展或回归。它们的整合结合了基于状态评估的稳定性与比较监督的稳健性。我们评估VLM-AR3L涵盖经典控制、操控和开放世界具体任务的基准测试，特别关注Minecraft，因其视觉复杂性和长期决策需求。实验结果显示，VLM-AR3L始终优于以往基于VLM的奖励学习方法。

Efficient Multilingual Reasoning Transfer via Progressive Code-Switching

通过渐进式码切换实现高效的多语言推理转移

Authors: Zhijun Wang, Junxiao Liu, Hao Zhou, Hao-Ran Wei, Baosong Yang, Shujian Huang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2607.00485
Pdf link: https://arxiv.org/pdf/2607.00485
Abstract Large reasoning models (LRMs) have achieved strong reasoning capabilities in English, yet their performance degrades significantly when required to reason in other languages. A natural solution is to transfer the model's English reasoning ability to target languages. However, existing transfer approaches typically rely on distilled target-language reasoning traces from stronger LRMs or online supervision from external judge models, which are costly and difficult to scale. In this paper, we propose PCS (Progressive Code-Switching), a more efficient transfer framework that requires only lightweight translation without any stronger model for distillation or judging. PCS first constructs code-switched reasoning traces by translating a subset of English reasoning steps into the target language, and uses them to initialize the model's code-switching ability via supervised fine-tuning. It then applies reinforcement learning with a step-level language consistency curriculum, progressively raising the target-language ratio until the model reasons entirely in the target language. This progressive design provides a smooth transfer path that avoids the instability and performance degradation commonly observed when directly enforcing target-language reasoning. Experiments on multiple benchmarks and five typologically diverse languages show that PCS substantially narrows the performance gap between target-language and English reasoning, yielding more language-consistent reasoning while maintaining competitive accuracy.
中文摘要 大型推理模型（LRM）在英语中已具备强大的推理能力，但在需要用其他语言推理时，其表现却显著下降。一个自然的解决方案是将模型的英语推理能力转移到目标语言上。然而，现有的转移方法通常依赖于更强的LRMS中精炼的目标语言推理痕迹或外部评审模型的在线监督，这些方法成本高昂且难以扩展。本文提出了PCS（渐进式代码切换），这是一种更高效的传输框架，只需轻量级翻译，无需更强的提炼或判断模型。PCS首先通过将部分英语推理步骤翻译成目标语言构建码转换推理轨迹，并通过监督微调初始化模型的码切换能力。然后，它通过逐步的语言一致性课程应用强化学习，逐步提高目标语言的比例，直到模型完全以目标语言推理为止。这种渐进式设计提供了平稳的传输路径，避免了直接强制目标语言推理时常见的不稳定性和性能下降。对多个基准测试和五种类型学上多样语言的实验表明，PCS显著缩小了目标语言推理与英语推理之间的性能差距，带来更多语言一致性推理，同时保持竞争性准确性。

PAPA: Online Personalized Active Preference Alignment

PAPA：在线个性化主动偏好对齐

Authors: Anindya Sarkar, Nasik Muhammad Nafi, Isaac Lyngaas, Muralikrishnan Gopalakrishnan Meena, Yevgeniy Vorobeychik
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2607.00486
Pdf link: https://arxiv.org/pdf/2607.00486
Abstract Diffusion models are highly effective at modeling complex data distributions, including images and text. However, in applications like personalized recommender systems, the objective often shifts to modeling specific regions of the distribution that maximize user preferences-initially unknown but gradually uncovered through interactive feedback. This can naturally be framed as a reinforcement learning problem, where the goal is to fine-tune a diffusion model to maximize a reward function based on preferences. However, the main challenge lies in learning a parameterized reward model, which typically requires large-scale preference data-something that is often not feasible in practice. In this work, we introduce Personalized Active Preference Alignment PAPA, a novel method that bypasses the requirement for a parametrized reward model by directly optimizing the diffusion model using real-time user feedback. PAPA enables feedback-efficient preference alignment, drawing inspiration from the variational inference framework. We demonstrate PAPA's effectiveness through extensive experiments and ablation studies across diverse class-conditioned and fine-grained alignment tasks. Additionally, based on theoretical insights, we propose an enhanced fine-tuning strategy, referred to as EPAPA, that requires less computational budget and accelerates the fine-tuning process, further boosting PAPA's suitability for real-world deployment. Our code is made publicly available at this https URL.
中文摘要 扩散模型在建模复杂数据分布（包括图像和文本）方面非常有效。然而，在个性化推荐系统等应用中，目标往往转向建模最大化用户偏好的特定区域——这些偏好最初未知，但通过互动反馈逐渐被发现。这自然可以被框架为强化学习问题，目标是微调扩散模型以最大化基于偏好的奖励函数。然而，主要挑战在于学习参数化奖励模型，这通常需要大规模的偏好数据——而这在实际中往往难以实现。在本研究中，我们介绍了个性化主动偏好对齐PAPA这一新颖方法，通过实时用户反馈直接优化扩散模型，绕过了参数化奖励模型的需求。PAPA实现了反馈高效的偏好比对，灵感来源于变分推断框架。我们通过广泛的实验和消融研究，展示了PAPA在多种类别条件和细粒度比对任务中的有效性。此外，基于理论洞见，我们提出了一种增强型微调策略，称为EPAPA，该策略对计算预算更少，加快微调过程，进一步提升PAPA在实际部署中的适用性。我们的代码在此 https URL 公开。

Active-GRPO: Adaptive Imitation and Self-Improving Reasoning for Molecular Optimization

Active-GRPO：自适应模仿与分子优化自我改进推理

Authors: Xuefeng Liu, Mingxuan Cao, Qinan Huang, Thomas Brettin, Rick Stevens, Le Cong
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2607.00531
Pdf link: https://arxiv.org/pdf/2607.00531
Abstract Scientific reasoning is an increasingly important capability of large language models, yet improving the robustness and efficiency of training such reasoning remains a key open challenge. We study this problem in instruction-based molecular optimization, where answer-only supervised fine-tuning (SFT) collapses multi-step reasoning and reinforcement learning with verifiable rewards (RLVR) suffers from sparse feedback. Reference-guided Policy Optimization mitigates both by anchoring policy updates to dataset-provided references, but its effectiveness is tightly coupled to reference quality: weak or misaligned references impose a performance ceiling. To overcome this ceiling, we propose active reasoning, a paradigm in which the policy actively decides, on a per-instance basis, when to imitate a reference and when to reinforce its own discoveries, while continuously upgrading what it imitates. We instantiate this paradigm as Active Group Relative Policy Optimization (Active-GRPO), realized through two coupled mechanisms: active imitate-reinforce and active referencing. The former performs imitation learning when the reference still outperforms the policy's own candidates, and shifts to self-improvement via reinforcement learning once the policy has generated molecules that surpass the reference. The latter continuously upgrades the reference itself by replacing it with the best policy-generated candidate discovered so far, progressively raising the imitation target and ensuring that reference guidance remains informative-rather than restrictive-throughout training. Across TOMG-Bench MOLOPT, Active-GRPO improves average SRxSim from 0.0959 for GRPO and 0.1665 for RePO to 0.1773 under matched three-seed evaluation, with statistically significant gains on LogP, MR, and QED.
中文摘要 科学推理是大型语言模型日益重要的一项能力，但提升训练此类推理的稳健性和效率仍是一个关键的挑战。我们在基于指令的分子优化中研究该问题，其中仅回答监督微调（SFT）崩溃了多步推理，且带有可验证奖励的强化学习（RLVR）存在稀疏反馈问题。引用引导策略优化通过将策略更新锚定到数据集提供的引用来缓解这两方面的问题，但其有效性与引用质量密切相关：弱或错位引用会带来性能上限。为克服这一限制，我们提出主动推理范式，即政策在具体案例中主动决定何时模仿参考，何时强化自身发现，同时不断升级所模仿的成果。我们将这一范式实例化为主动组相对策略优化（Active Group Relative Policy Optimization，Active-GRPO），通过两种耦合机制实现：主动模仿强化和主动引用。前者在参考仍优于策略候选者时进行模仿学习，当策略产生超过参考的分子时，则转向通过强化学习进行自我改进。后者不断升级参考文献本身，用迄今为止发现的最佳政策生成候选物替换，逐步提高模仿目标，并确保参考指南在整个培训过程中保持信息性而非限制性。在TOMG-BENCH MOLOT中，Active-GRPO将平均SRxSim从GRPO的0.0959和RePO的0.1665提升至匹配三种子评估下的0.1773，LogP、MR和QED均有统计学显著提升。

Flow-Map GRPO: Reinforcement Learning for Few-Step Flow-Map Generators via Anchored Stochastic Composition

流图GRPO：通过锚定随机组合实现的少步流图生成器的强化学习

Authors: Zhiqi Li, Wen Zhang, Bo Zhu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2607.00535
Pdf link: https://arxiv.org/pdf/2607.00535
Abstract Few-step flow-map generators, such as consistency models and MeanFlow, accelerate sampling by directly learning long-range transport maps between noise and data. However, these models are typically deterministic, which makes them difficult to optimize with reinforcement learning (RL) post-training methods that require stochastic trajectories and well-defined likelihood ratios. Existing SDE-based stochasticization techniques are designed for velocity-based samplers with infinitesimal or finely discretized transitions, and therefore do not directly apply to long-range flow maps. In this work, we propose Flow-Map GRPO, an online RL post-training framework for deterministic few-step flow-map generators. The key component is Anchored Stochastic Flow Map Composition (ASFMC), a path-preserving stochasticization mechanism that introduces randomness through anchor-based conditional resampling while preserving the original marginal probability path of the deterministic flow map. We derive GRPO objectives for both single-time and two-time flow-map parameterizations. Experiments on few-step FLUX-based text-to-image generators, including MeanFlow and sCM, show that Flow-Map GRPO improves pretrained deterministic flow-map models across reward-based, perceptual, and task-level evaluation metrics. Our results demonstrate that deterministic few-step flow-map generators can be effectively aligned with RL post-training without modifying their original model parameterization or retraining them as native stochastic models.
中文摘要 少步流图生成器，如一致性模型和平均流量，通过直接学习噪声与数据之间的长程传输图来加速采样。然而，这些模型通常是确定性的，这使得它们难以通过需要随机轨迹和明确似然比的强化学习（RL）后训练方法进行优化。现有基于SDE的随机化技术是为基于速度的采样器设计的，其跃迁为无穷小或细离散化，因此不直接适用于长程流图。在本研究中，我们提出了Flow-Map GRPO，一种用于确定性少数步流图生成器的在线强化学习后训练框架。关键组成部分是锚定随机流图复合（ASFMC），这是一种路径保持随机化机制，通过基于锚点的条件重抽样引入随机性，同时保持确定性流图的原始边际概率路径。我们为单时间和双时间流量图参数化推导GRPO目标。对基于少数步FLUX的文本转图像生成器（包括MeanFlow和sCM）的实验表明，Flow-Map GRPO在基于奖励、感知和任务级评估指标的预训练确定性流程图模型中改进了改进。我们的结果表明，确定性少步流图生成器可以在训练后有效对齐强化学习，而无需修改其原始模型参数化或将其重新训练为原生随机模型。

Loss Smoothing for Stable Adaptation Under Distribution Shift

分布偏移下稳定适应的损失平滑

Authors: Darshan Patil, Ekaterina Lobacheva, Razvan Pascanu, Sarath Chandar
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2607.00634
Pdf link: https://arxiv.org/pdf/2607.00634
Abstract In settings such as fine-tuning and reinforcement learning, neural networks are often adapted under distribution shift. Standard adaptation methods typically optimize the target objective directly, inducing an abrupt change from the source training objective. This abrupt transition can distort learned representations, including features that may still be useful for the new task. We investigate whether a more gradual transition can improve adaptation. We propose loss smoothing, a simple approach that interpolates between the source and target training objectives at the start of adaptation. This smooth transition helps to preserve useful features from the source distribution while still enabling the model to specialize to the target distribution. Across controlled supervised shifts, pretrained vision adaptation, offline-to-online and online reinforcement learning, and language model fine-tuning, we find that loss smoothing consistently improves performance, suggesting that smoother objective transitions are a broadly useful tool for model adaptation.
中文摘要 在微调和强化学习等环境中，神经网络通常在分布偏移下进行适应。标准的自适应方法通常直接优化目标目标，从而导致与源训练目标的突然变化。这种突然的转变可能会扭曲已学到的表征，包括那些可能仍对新任务有用的特征。我们研究更渐进的过渡是否能改善适应性。我们提出了损失平滑法，这是一种在适应开始时在源训练目标和目标训练目标之间插值的简单方法。这种平滑过渡有助于保留源分布的有用特征，同时使模型能够专门化到目标分布。在受控监督切换、预训练视觉适应、离线到在线和在线强化学习以及语言模型微调中，我们发现丢失平滑持续提升表现，表明更平滑的目标转换是模型适应的广泛有用工具。

Learning-based control of a single-DOF Aero system

单自由度空气系统的基于学习的控制

Authors: Gabriel da Silva Lima, Wallace Moreira Bessa
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2607.00640
Pdf link: https://arxiv.org/pdf/2607.00640
Abstract This paper presents a learning-based control framework that integrates feedback linearization with reinforcement learning for the adaptive control of nonlinear mechatronic systems. The control law is derived using Lyapunov stability analysis, ensuring closed-loop stability in the presence of modeling uncertainties and external disturbances. Feedback linearization serves as the main control framework, while a reinforcement learning component estimates and compensates for unmodeled dynamics and disturbances online. The learning module is based on the REINFORCE-with-baseline algorithm, which improves learning efficiency by reducing the variance of policy-gradient estimates and enabling stable policy updates during adaptation. The proposed controller is evaluated on a single-degree-of-freedom rotor-based AERO system. Results from simulations demonstrate accurate trajectory tracking, fast adaptation, and strong robustness against parameter variations and external disturbances. Overall, the proposed approach combines the analytical guarantees of Lyapunov-based control with the adaptability of reinforcement learning, providing an effective solution for controlling nonlinear mechatronic systems.
中文摘要 本文提出了一种基于学习的控制框架，将反馈线性化与强化学习相结合，用于非线性机电系统的自适应控制。控制定律通过李雅普诺夫稳定性分析推导，确保在建模不确定性和外部扰动存在时保持闭环稳定性。反馈线性化是主要控制框架，而强化学习组件则估计并补偿在线上未建模的动态和干扰。该学习模块基于REINFORCE-with-baseline算法，通过减少策略梯度估计的方差并在适应过程中实现策略稳定更新，提高了学习效率。拟议控制器基于单自由度旋翼的AERO系统进行评估。模拟结果显示其轨迹跟踪准确，适应速度快，且对参数变化和外部扰动具有强鲁棒性。总体而言，所提方法结合了基于李雅普诺夫控制的分析保证与强化学习的适应性，为控制非线性机电系统提供了有效解决方案。

Coachable agents for interactive gameplay

可指导的互动游戏代理

Authors: Roberto Capobianco (1), Harm van Seijen (2), Nolan D. Bard (2), Neil Burch (2), Fatima Davelouis (2), Josh Davidson (2), Alisa Devlic (1), Yunshu Du (2), Ishan Durugkar (2), Siddhant Gangapurwala (2), Daniel Hernandez (2), G. Zacharias Holland (2), Sahil Jain (2), Kenta Kawamoto (3), Raksha Kumaraswamy (2), Patrick MacAlpine (2), Dustin R. Morrill (2), Declan Oller (2), Francesco Riccio (1), Akanksha Saran (2), Craig Sherstan (3), Kaushik Subramanian (1), Thomas J. Walsh (2), Samuel Barrett (2), Kizza N. Frisbee (2), Mady Govil (2), Johannes Günther (2), Varun R. Kompella (2), James A. MacGlashan (2), Maxwell Svetlik (2), Michael D. Thomure (2), Jaden B. Travnik (2), Kevin Waugh (2), Elahe Aghapour (2), Florian Fuchs (1), Andreanne Lemay (2), Shruti Mishra (1), Takuma Seno (3), Peter Stone (2), Michael Spranger (3), Peter R. Wurman (2) ((1) Sony AI, Zurich, Switzerland, (2) Sony AI, North America, various locations, (3) Sony AI, Tokyo, Japan)
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2607.00642
Pdf link: https://arxiv.org/pdf/2607.00642
Abstract Reinforcement learning has proven to be a valuable tool in the creation of advanced AI and robotic systems, contributing to everything from game playing to robotics to foundation models. Through trial-and-error, these AI systems typically learn one, near-optimal behavior to solve their tasks. However, there are many use cases in which one would like to assert some level of control, preferably in real time, over how the task is solved. We refer to these modifications of a core task as styles. We combine universal value function approximators (UVFAs) with carefully selected training scenarios, learning algorithms, and data augmentation to create a framework for coaching agents that exhibit styles in complex domains. We demonstrate the framework's application in the AAA video games Horizon Forbidden West and Gran Turismo, and in an open-source humanoid test domain. Despite the different nature of the domains -- car racing, stylized game combat, and humanoid walking -- each agent shows strong coherence to the style requests while still satisfying the main task in its domain. Importantly, the techniques outlined in this paper allow an end user to choose the final behavior at run time, giving them flexible control over the final executed performance.
中文摘要 强化学习已被证明是创建先进人工智能和机器人系统的宝贵工具，参与了从游戏玩法到机器人技术再到基础模型的各种应用。通过反复试验，这些人工智能系统通常会学习一种接近最优的行为来解决任务。然而，有许多用例需要对任务的解决方式进行一定程度的控制，最好是实时控制。我们称这些核心任务的修改为样式。我们将通用值函数近似器（UVFA）与精心挑选的训练场景、学习算法和数据增强相结合，构建一个在复杂领域展现风格的教练代理框架。我们展示了该框架在AAA级电子游戏《地平线：西域禁域》和《Gran Turismo》中的应用，以及在开源类人生物测试领域中的应用。尽管领域性质不同——赛车、风格化游戏战斗和类人行走——每个特工都表现出对风格要求的高度一致性，同时满足其领域内的主要任务。重要的是，本文中介绍的技术允许终端用户在运行时选择最终行为，从而灵活控制最终执行的性能。

M2Note: Continual Evolution of Vision Language Models via Mistake Notebook Learning

M2Note：通过错误笔记本学习，视觉语言模型的持续演进

Authors: Haiwen Li, Jing Tang, Rui Chen, Lei Sun, Xiangxiang Chu
Subjects: Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2607.00685
Pdf link: https://arxiv.org/pdf/2607.00685
Abstract Vision Language Models (VLMs) have demonstrated remarkable capabilities in multimodal reasoning tasks, yet they still suffer from recurring failures, such as skipping key visual checks, misapplying domain rules, and hallucinating unsupported concepts. Most existing solutions rely on supervised fine-tuning (SFT) and reinforcement learning (RL), which are expensive to iterate and can be brittle under distribution shift. To this end, we propose Multimodal Mistake Notebook Learning (M2Note), a training-free continual evolution framework that externalizes learning into an editable memory. M2Note transforms failed trajectories into compact subject-guidance notes: the subject summarizes the underlying domain and concept, while the guidance provides actionable verification steps that can be reused in future inference. At test time, M2Note retrieves relevant notes via multimodal retrieval-augmented generation (RAG) and appends them to the model context, steering reasoning away from previously observed pitfalls. To stabilize continual evolution, we adopt batch-level post-verification with rollback, which commits notebook edits only if they improve performance on the same batch, reducing noisy updates and preventing regressions. M2Note supports both self-evolving, where the same VLM acts as solver and supervisor, and cross-model evolving, where a stronger supervisor guides a weaker solver, enabling capability transfer without weight updates. Experiments on six multimodal reasoning benchmarks show consistent improvements across domains and backbones, while achieving strong cost and sample efficiency and remaining complementary to Chain-of-Thought (CoT) prompting.
中文摘要 视觉语言模型（VLMs）在多模态推理任务中展现出了卓越的能力，但它们仍然存在反复出现的失败，比如跳过关键的视觉检查、误用领域规则以及产生无支持的概念幻觉。大多数现有解决方案依赖监督微调（SFT）和强化学习（RL），这些方法迭代成本高，且在分布转移时可能脆弱。为此，我们提出了多模错误笔记本学习（M2Note），这是一种无需训练的持续进化框架，将学习外部化为可编辑的记忆。M2Note将失败的轨迹转化为紧凑的主题指导笔记：主题总结了底层领域和概念，而指导则提供可操作的验证步骤，这些步骤可在未来的推断中重复使用。测试时，M2Note通过多模检索增强生成（RAG）检索相关笔记并将其附加到模型上下文中，避免了先前观察到的陷阱。为了稳定持续演进，我们采用了批处理级的后验证和回滚，只有当笔记本编辑提升了同一批次的性能时才提交，减少了噪杂的更新并防止了回归。M2Note 支持自演进，即同一 VLM 作为求解器和监督者，以及跨模型演化，即更强的主管引导较弱求解器，实现能力转移而无需权重更新。在六个多模态推理基准测试中的实验显示，跨领域和骨干链的持续提升，同时实现了高成本和样本效率，并且与思维链（Chain-of-Thought，简称CoT）提示保持互补。

Task-Relevant Representation Decoupling for Visual Reinforcement Learning Generalization

任务相关表征解耦用于视觉强化学习泛化

Authors: Jinwen Wang, Youfang Lin, Xiaobo Hu, Qian Xu, Shuo Wang, Zhuo Chen, Kai Lv
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2607.00796
Pdf link: https://arxiv.org/pdf/2607.00796
Abstract Visual Reinforcement Learning (VRL) has achieved considerable success in solving control tasks. However, generalizing learned policies to new environments remains a major challenge, as agents often overfit to task-irrelevant features in the training environment. To solve this problem, we introduce the concept of decoupling observations into task-relevant and task-irrelevant representations. Building on this idea, we propose a self-supervised Task-Relevant Representation Decoupling (T2RD) algorithm for VRL. This algorithm consists of three components: task-relevant representation consistency, cross-reconstruction, and cross-dynamic prediction. The first two components achieve the decoupling of content and style features, but the resulting content representations are not necessarily task-relevant. To further refine task-relevant features from content representations, we design the third component that introduces dynamic prediction. T2RD achieves State-Of-The-Art (SOTA) generalization performance and sample efficiency in the DeepMind Control Suite and Robotic Manipulation tasks.
中文摘要 视觉强化学习（VRL）在解决控制任务方面取得了相当大的成功。然而，将学到的策略推广到新环境仍是一大挑战，因为智能体常常过度拟合训练环境中与任务无关的特征。为解决此问题，我们引入了将观测数据解耦为任务相关和无关任务表示的概念。基于这一理念，我们提出了一种自监督的任务相关表示解耦（T2RD）算法用于VRL。该算法由三个组成部分组成：任务相关表示一致性、交叉重建和跨动态预测。前两个组件实现了内容和样式特征的解耦，但最终的内容表示不一定与任务相关。为了进一步完善内容表示中的任务相关特征，我们设计了第三个组件，引入动态预测。T2RD在DeepMind控制套件和机器人操作任务中实现了最先进的（SOTA）泛化性能和样本效率。

Local Motion Matters: A Deconstruct-Recompose Paradigm for Reinforcement Learning Pre-training from Videos

局部运动的重要性：基于视频进行强化学习预训练的解构-重组范式

Authors: Jinwen Wang, Youfang Lin, Xiaobo Hu, Shuo Wang, Kai Lv
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2607.00808
Pdf link: https://arxiv.org/pdf/2607.00808
Abstract Pre-training on large-scale videos to improve reinforcement learning efficiency is promising yet remains challenging. Existing methods typically treat the agent as an indivisible entity, modeling motion patterns globally. Such global modeling is tightly coupled with the morphology, hindering transfer across domains. In contrast, despite the vast disparity in global motions, the local components exhibit similar motion patterns across different agents. Building on this insight, we propose a novel Deconstruct-Recompose Paradigm (DRP) for learning transferable local motion representations. Specifically, in the Deconstruct phase, we identify multiple local points and track their frame-wise motions, defining each as an Atomic Action. We introduce a Dual-Attention Encoder (DAE) to learn local motion representations from these Atomic Actions, capturing their spatiotemporal relationships. In the Recompose phase, we compose local motion representations with a learnable Motion Aggregation Token [MAT] via latent dynamics model learning. Additionally, an adapter bridges local motion and downstream action-specific dynamics to accelerate policy learning. Extensive experiments demonstrate that our method effectively transfers to diverse robotic control and manipulation tasks, significantly improving sample efficiency and performance.
中文摘要 对大规模视频进行预训练以提高强化学习效率，前景看好，但仍具挑战性。现有方法通常将智能体视为不可分割的实体，全球范围内建模运动模式。这种全局建模与形态紧密结合，阻碍了跨域的迁移。相比之下，尽管全球运动差异巨大，局部分量在不同代理间表现出相似的运动模式。基于这一见解，我们提出了一种新的解构-重组范式（Deconstruct-Recompose Paradigm，简称DRP），用于学习可迁移的局部运动表示。具体来说，在解构阶段，我们识别多个局部点并跟踪它们的帧运动，将每个点定义为原子作用。我们引入了双注意力编码器（DAE），用于从这些原子作用中学习局部运动表示，捕捉它们的时空关系。在重组阶段，我们通过潜在动力学模型学习，用可学习的运动聚合令牌（MAT）组合局部运动表示。此外，适配器连接了地方动态与下游行动特定动态，加速政策学习。大量实验表明，我们的方法能够有效应用于多种机器人控制和操作任务，显著提升了样品效率和性能。

From Pixels to Temporal Correlations: Learning Informative Representations for Reinforcement Learning Pre-training

从像素到时间相关：强化学习前训练中的信息表征

Authors: Jinwen Wang, Youfang Lin, Xiaobo Hu, Siyu Yang, Sheng Han, Shuo Wang, Kai Lv
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2607.00811
Pdf link: https://arxiv.org/pdf/2607.00811
Abstract Unsupervised pre-training on large-scale datasets has demonstrated significant potential for improving the sample efficiency and performance of Reinforcement Learning (RL). Given the large-scale action-free internet videos, existing methods utilize single-step transition prediction and image reconstruction to learn representations. However, these methods prefer to preserve large-proportion stationary information in the pixel space, neglecting small but crucial information. To preserve enough information in the representation, it is essential to pay equal attention to each element in videos. Specifically, we propose a temporal correlation space to distinguish each element. For implementation, we introduce the Multi-scale Temporal Contrastive Learning (MTCL) method to model multi-scale temporal correlations separately. This approach can balance the attention of different elements and yield more informative representations, effectively supporting policy learning in various downstream tasks. Experimental results demonstrate that our method improves sample efficiency and asymptotic performance across various downstream tasks.
中文摘要 大规模数据集上的无监督预训练已被证明具有显著潜力，能够提升强化学习（RL）的样本效率和性能。鉴于大规模无动作的互联网视频，现有方法利用单步过渡预测和图像重建来学习表示。然而，这些方法更倾向于保留像素空间中大比例的静止信息，而忽略了小而关键的信息。为了保持足够信息，必须同等关注视频中的每个元素。具体来说，我们提出一个时间相关空间来区分每个元素。为实现，我们引入多尺度时间对比学习（MTCL）方法，分别建模多尺度时间相关性。这种方法可以平衡不同元素的注意力，产生更具信息量的表达，有效支持下游任务中的政策学习。实验结果表明，我们的方法在多种下游任务中提升了样本效率和渐近性能。

EFlow: Learning Evidence Flow for Long-Video Reasoning with Adaptive Reflection

EFlow：带自适应反思的长视频推理学习证据流

Authors: Wenhao Zhang, Kuanwei Lin, Xuyi Yang, Wei Gao, Ge Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2607.00867
Pdf link: https://arxiv.org/pdf/2607.00867
Abstract Long-video reasoning is fundamentally constrained by how models acquire and utilize visual evidence. Existing tool-augmented video frameworks often interleave temporal grounding and answer reasoning within a single trajectory, causing early semantic hypotheses to bias evidence localization. We term this failure mode premature semantic commitment, where biased grounding retrieves incomplete evidence and incomplete evidence further reinforces incorrect reasoning. To address this issue, we propose EFlow, an evidence-first video reasoning framework built upon Qwen3-VL. EFlow explicitly separates temporal grounding and logical reasoning through CoT for Temporal Grounding and CoT for Reasoning, enabling the model to retrieve relevant evidence before answer inference. In addition, EFlow introduces a confidence-aware reflection mechanism that re-evaluates the full video when retrieved evidence is potentially insufficient. We further construct dedicated trajectory datasets and train EFlow through supervised fine-tuning, reinforcement learning, and reinforcement fine-tuning. Extensive experiments across five video understanding benchmarks demonstrate that EFlow consistently improves long-video reasoning performance.
中文摘要 长视频推理在根本上受限于模型如何获取和利用视觉证据。现有的工具增强视频框架常常将时间基础和答案推理交织在同一条路径内，导致早期语义假设偏向证据的局部化。我们称这种失败模式为过早语义承诺，即偏见基础会检索不完整的证据，而不完整证据进一步强化错误的推理。为解决这一问题，我们提出了基于Qwen3-VL构建的以证据为先的视频推理框架EFlow。EFlow 通过时间基础的 CoT 和推理的 CoT 明确区分了时间基础和逻辑推理，使模型能够在答案推断前检索相关证据。此外，EFlow引入了一种置信感知的反思机制，当检索到的证据可能不足时，会重新评估完整视频。我们进一步构建专用轨迹数据集，并通过监督微调、强化学习和强化微调训练EFlow。五个视频理解基准测试的广泛实验表明，EFlow持续提升长视频推理性能。

Graph-Native Reinforcement Learning Enables Traceable Scientific Hypothesis Generation through Conceptual Recombination

图原生强化学习通过概念重组实现可追溯的科学假说生成

Authors: Subhadeep Pal, Shashwat Sourav, Tirthankar Ghosal, Markus J. Buehler
Subjects: Subjects: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2607.00924
Pdf link: https://arxiv.org/pdf/2607.00924
Abstract Accelerating materials discovery requires AI systems that can generate scientifically valid hypotheses through multi-step, domain-grounded reasoning. Standard large language models often produce fluent but weakly traceable responses to open-ended materials design problems, making it difficult to determine whether final answers are supported by coherent intermediate reasoning. We develop Graph-PRefLexOR, a family of graph-native reasoning models fine-tuned with Group Relative Policy Optimization (GRPO) to organize reasoning into explicit phases for mechanism exploration, graph construction, pattern extraction, and hypothesis synthesis. This design links neural language generation with symbolic relational structure, enabling causal connections to be constructed, inspected, and reused. On 100 open-ended questions from materials science and mechanics literature, Graph-PRefLexOR achieves 40-65% improvements over corresponding base models, with the largest gains in reasoning traceability. Embedding analyses show broader semantic exploration and approximately 2-3 times greater semantic diversity than baselines. Semantic backtracking and layer-wise hidden-state analyses further show stronger alignment between structured reasoning and final answers. Finally, test-time graph expansion reveals that additional compute primarily increases long-range conceptual recombination within a bounded semantic space, rather than simply expanding semantic coverage. These results establish graph-native reinforcement learning as a pathway toward interpretable AI systems for scientific hypothesis generation in materials design and other scientific applications.
中文摘要 加速材料发现需要能够通过多步骤、基于领域推理生成科学有效假设的人工智能系统。标准大型语言模型常常能对开放式材料设计问题产生流畅但可追溯性较弱的回答，这使得判断最终答案是否由连贯的中间推理支持变得困难。我们开发了Graph-PRefLexOR系列图原生推理模型，经过群相对策略优化（GRPO）微调，将推理组织为显式阶段，用于机制探索、图构建、模式提取和假设综合。该设计将神经语言生成与符号关系结构连接起来，使因果关系得以构建、检查和重用。在材料科学和力学文献中的100个开放式问题中，Graph-PRefLexOR相比相应基础模型实现了40%-65%的改进，推理可追溯性提升最大。嵌入分析显示语义探索更广泛，语义多样性约为基线的2-3倍。语义回溯和层级隐藏状态分析进一步显示结构化推理与最终答案之间的更紧密对应。最后，测试时间图展开表明，额外的计算主要增加有界语义空间内的长程概念重组，而不仅仅是扩展语义覆盖。这些结果确立了图原生强化学习作为迈向可解释人工智能系统在材料设计及其他科学应用中科学假设生成的路径。

Human-Machine Collaboration on Generative Meta-Learning: Model and Algorithm

生成元学习的人机协作：模型与算法

Authors: Midhun Parakkal Unni, Samuel Kaski
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2607.00926
Pdf link: https://arxiv.org/pdf/2607.00926
Abstract Generalizing machine learning models to environments that differ from their training distribution remains a critical hurdle, particularly when data from the target domain is entirely or partially unavailable. We propose Generative Meta-Learning with Human Feedback (GMHF), a novel framework that bridges this domain gap by leveraging expert intuition to guide data synthesis. Grounded in a theoretical analysis of generalization error, we derive bounds demonstrating that aligning the distribution of generated data with human beliefs regarding the target physics significantly mitigates risk. GMHF operationalizes this insight by employing a Conditional Neural ODE (cNODE) as a generative digital twin, coupled with a Reinforcement Learning (RL) agent. The agent iteratively refines the latent physical parameters of the generated trajectories based on feedback, effectively steering the meta-learner toward the unobserved target distribution. Empirical validation on a nonlinear Duffing oscillator shows that GMHF substantially reduces deployment loss as expert reliability increases, and that the divergence between generated and target data falls under reliable feedback, directly corroborating the divergence-minimisation mechanism predicted by our theory. Further experiments on a non-dynamical probabilistic model confirm that the framework extends beyond ODE-governed systems, establishing human-AI collaboration as a rigorous catalyst for robust generalisation under distribution shift.
中文摘要 将机器学习模型推广到与训练分布不同的环境仍然是关键难题，尤其是在目标领域数据完全或部分不可得时。我们提出了带有人类反馈的生成元学习（GMHF），这是一种新颖框架，利用专家直觉来引导数据综合，弥合这一领域空白。基于泛化误差的理论分析，我们推导出界限，证明将生成数据的分布与人类对目标物理的信念保持一致，显著降低了风险。GMHF通过使用条件神经常微分方程（cNODE）作为生成式数字孪生，并结合强化学习（RL）代理，实现这一洞见。智能体会根据反馈迭代细化生成轨迹的潜在物理参数，有效地引导元学习者朝向未观察的目标分布。对非线性达夫振荡器的实证验证表明，随着专家可靠性的提升，GMHF显著减少部署损耗，且生成数据与目标数据之间的散度属于可靠反馈范畴，直接证实了我们理论预测的散度最小化机制。非动态概率模型的进一步实验证实该框架超越常微分方程控制系统，确立了人机协作作为在分布转移下稳健泛化的严谨催化剂。

DRL-Based Joint Beamforming and Surface Shape Optimization for Flexible Intelligent Metasurface-Aided ISAC Systems

基于DRL的关节束成形和表面形状优化，适用于灵活智能的超曲面辅助ISAC系统

Authors: Maoyuan Wang, Qian Zhang, Jiancheng An, Xuejun Cheng, Zheng Dong, Deqiang Wang
Subjects: Subjects: Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2607.00951
Pdf link: https://arxiv.org/pdf/2607.00951
Abstract Integrated sensing and communication (ISAC) unifies high-precision sensing and wireless data transmission. In this paper, we investigate the design of ISAC systems enabled by flexible intelligent metasurface (FIM) and aim to minimize the Cramér-Rao bound (CRB) with quality of service (QoS) constraints using deep reinforcement learning (DRL). Specifically, we formulate the joint design of beamforming matrix and FIMs surface shape to reduce the CRB subject to transmit power, QoS and the FIMs surface shape constraints. However, the non-convex formulation makes optimization problem difficult to solve. To tackle this issue, we develop a deep deterministic policy gradient (DDPG) actor critic DRL scheme for the joint design, guided by a constraint aware reward to progressively improve sensing performance. Numerical results demonstrate that jointly optimizing the beamforming matrix and the FIMs surface shape substantially decreases CRB while ensuring communication quality compared with existing rigid arrays.
中文摘要 综合感测与通信（ISAC）统一了高精度传感与无线数据传输。本文探讨了由柔性智能元曲面（FIM）支持的ISAC系统设计，并旨在利用深度强化学习（DRL）最小化带有服务质量（QoS）约束的Cramér-Rao界限（CRB）。具体来说，我们制定了波束成形矩阵与FIM表面形状的联合设计，以降低受发射功率、QoS和FIM表面形状约束影响的CRB。然而，非凸表述使优化问题难以求解。为解决这一问题，我们开发了一种深确定性策略梯度（DDPG）actor critic DRL方案，采用约束感知奖励，逐步提升感知性能。数值结果表明，联合优化波束形成矩阵与FIM表面形状，显著降低了CRB，同时确保了通信质量，相较于现有刚性阵列。

AMBUSH: Collaborative Capture in Complex Environments with Neural Acceleration

AMBUSH：复杂环境中的协同捕获与神经加速

Authors: Junfeng Chen, YinHang Luo, Xinyi Wang, Junrui Li, Meng Guo
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2607.01029
Pdf link: https://arxiv.org/pdf/2607.01029
Abstract Collaborative capture of dynamic targets is common in nature as an essential strategy for weaker species against the strong. Similar concepts have shown to be useful for numerous robotic applications, such as security and surveillance, search and rescue. However, most existing works focus on analytical and geometric solutions or end-to-end reinforcement learning methods, which are largely constrained to obstacle-free environments or scenarios with sparse, regularly distributed obstacles. This work tackles the problem from a unique perspective: the renowned strategy of``ambush'' alone would suffice for multiple slower pursuers to capture one faster evader with different levels of intelligence efficiently in complex environments. A parameterized strategy of ambush (including discrete and continuous parameters) is designed first, which takes into account the topological properties of the workspace, the truncated line-of-sight visibility, the relative speed ratio and the limited capture range. Then, a Hybrid Monte Carlo Tree Search (H-MCTS) algorithm is proposed to optimize the associated parameters through long-term planning, enabling the identification of highly promising parameters for future capture. Lastly, the neural acceleration is trained offline to learn the ranking of different choices of parameters across various environments, and to directly predict scores, replacing the rollout process in H-MCTS. The neural acceleration is adopted during online H-MCTS to accelerate the planning procedure while guaranteeing the planning quality. Its efficiency and effectiveness are validated in extensive simulations and hardware experiments, against evaders with different capabilities and intelligence levels, including two-times higher velocity and human-controlled behavior.
中文摘要 动态目标的协同捕获在自然界中很常见，是弱势物种对抗强势物种的重要策略。类似的概念已被证明在许多机器人应用中非常有用，如安全与监控、搜救。然而，大多数现有研究侧重于分析和几何解或端到端强化学习方法，这些方法主要受限于无障碍环境或障碍稀疏且分布均匀的场景。这项工作从独特的视角解决了这个问题：仅凭著名的“伏击”策略，就足以让多个较慢的追兵在复杂环境中高效捕获一个智力不同程度的更快逃避者。首先设计了一种参数化伏击策略（包括离散和连续参数），该策略考虑了工作空间的拓扑性质、截断的视线可见性、相对速度比以及有限的捕获距离。随后，提出了一种混合蒙特卡洛树搜索（H-MCTS）算法，通过长期规划优化相关参数，从而识别出未来捕获中极具潜力的参数。最后，神经加速在离线训练中学习不同环境中参数选择的排序，并直接预测得分，取代了H-MCTS中的展开过程。在线H-MCTS采用神经加速技术，以加快规划过程，同时保证规划质量。其效率和有效性在广泛的模拟和硬件实验中得到了验证，针对能力和智力水平不同的逃避者，包括速度是两倍且行为可控。

AutoRestTest at the SBFT 2026 Tool Competition

2026年SBFT工具竞赛中的AutoRestTest

Authors: Tyler Stennett, Myeongsoo Kim, Saurabh Sinha, Alessandro Orso
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2607.01063
Pdf link: https://arxiv.org/pdf/2607.01063
Abstract Large input spaces and complex inter-operation dependencies make black-box REST API testing challenging. AutoRestTest combines a Semantic Property Dependency Graph, multi-agent reinforcement learning, and large language models to intelligently explore large API input spaces. In the SBFT 2026 REST League, AutoRestTest ranked first in all three evaluation categories -- fault detection, overall efficiency, and overall effectiveness -- on 11 APIs (317 operations, approximately 29 per API), averaging 67.09 unique server errors and 17.27 successfully processed operations per API under a one-hour testing budget.
中文摘要 庞大的输入空间和复杂的操作间依赖使得黑盒REST API测试具有挑战性。AutoRestTest 结合了语义属性依赖图、多智能体强化学习和大型语言模型，能够智能探索大型 API 输入空间。在SBFT 2026 REST League中，AutoRestTest在11个API（317个操作，约29个操作）中，在故障检测、整体效率和整体效能三大评估类别中均排名第一，平均存在67.09个独特服务器错误，并在一小时测试预算内成功处理每个API操作17.27次。

Can Agents Generalize to the Open World? Unveiling the Fragility of Static Training in Tool Use

代理能推广到开放世界吗？揭示静态训练在工具使用中的脆弱性

Authors: Song-Lin Lv, Weiming Wu, Rui Zhu, Zi-Jian Cheng, Lan-Zhe Guo
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2607.01084
Pdf link: https://arxiv.org/pdf/2607.01084
Abstract While Large Language Model (LLM) agents demonstrate proficiency in static benchmarks, their deployment in real-world scenarios is hindered by the dynamic nature of user queries, tool sets, and interaction dynamics. To address this generalization gap, we formalize OpenAgent (Tool-Use Agent in Open-World), a problem setting characterized by distributional shifts across query, action, observation, and domain dimensions. To systematically diagnose its impact, we construct a controlled sandbox environment where we define fine-grained environmental shifts across a four-tier hierarchy, Perception, Interaction, Reasoning, and Internalization, and conduct a comprehensive series of experiments. Our analysis yields a series of key insights, demonstrating that agents trained via both Supervised Fine-Tuning(SFT) and Reinforcement Learning suffer from varying degrees of performance degradation when confronting open environmental shifts. Building on these insights, we propose Perturbation-Augmented Fine-Tuning, a disturbance-based intervention strategy for SFT that lays the foundation for enhancing agent robustness and utility in realistic environments. Our code will be released at: https://github. com/LAMDA-NeSy/OpenAgent.
中文摘要 虽然大型语言模型（LLM）代理在静态基准测试方面表现出熟练，但其在实际场景中的部署受限于用户查询、工具集和交互动态的动态特性。为弥补这一泛化差距，我们形式化了OpenAgent（开放世界工具使用代理），这一问题以查询、动作、观察和领域维度分布变化为特征。为了系统地诊断其影响，我们构建了一个受控的沙盒环境，定义了四层级结构——感知、互动、推理和内化——的细致环境变化，并进行了一系列全面的实验。我们的分析得出一系列关键见解，表明通过监督微调（SFT）和强化学习训练的代理在面对开放环境变化时，性能下降程度各异。基于这些见解，我们提出了扰动增强微调，这是一种基于干扰的SFT干预策略，为增强代理在现实环境中的稳健性和效用奠定了基础。我们的代码将在：https://github 发布。com/LAMDA-NeSy/OpenAgent。

Next-Generation Agentic Reinforcement Learning Systems Enable Self-Evolving Agents

下一代智能体强化学习系统使智能体能够自我演化

Authors: Ran Yan, Wei Fu, Jiale Li, Shusheng Xu, Zhiyu Mei, Jiaxuan Gao, Jiarui Zhang, Xujie Shen, Hao Dai, Chuyi He, Zhen Pu, Jun Mei, Zhiyao Lin, Haitao Wang, Zhiqiang Ding, Jiawei Zhang, Huaijie Wang, Ruida Xu, Youhe Jiang, Yi Wu, Tongkai Yang, Binhang Yuan
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2607.01120
Pdf link: https://arxiv.org/pdf/2607.01120
Abstract LLM agents are rapidly being deployed in production, including coding assistants, customer-support chatbots, and scientific research assistants, yet they remain fundamentally static in enterprise deployment. The LLM weights, system prompts, tool repertoires, and in-context harnesses are frozen at deployment time, and any improvement requires a manual loop of human-curated data collection, offline fine-tuning, modification of the agentic paradigm, and re-deployment. Recent work on self-evolving agents, such as OpenClaw for individual users, indicates that the next leap in agent capability will come from agents that continually learn from their own experience. In this paper, we argue that this vision for self-evolving agent deployment is being held back for enterprise-level large-scale agentic service not by reinforcement learning (RL) algorithms but by agentic online RL systems. Specifically, current agentic RL systems and the surrounding observability software stack are inadequate along three essential aspects: (i) there is no standardized agent trajectory data protocol capable of carrying RL learning signals at step granularity across heterogeneous agent paradigms; (ii) there is no enterprise-grade comprehensive data proxy that converts real workloads into governed learning substrates; and (iii) there is no unified agent evolution control plane that automatically decides, based on trajectory statistics, when to update policy weights or evolve the in-context harness. The next generation of agentic RL systems must be co-designed around these three pillars, and we sketch concrete architectures, case studies, and counter-arguments. We instantiate one branch through AReaL2.0, reorganizing existing RL infrastructure into an agent-oriented online RL loop for policy weight updates from deployed workloads.
中文摘要 LLM代理正在迅速部署到生产环境中，包括编码助手、客户支持聊天机器人和科学研究助理，但在企业部署中它们基本上保持静态。LLM权重、系统提示、工具库和上下文内工具在部署时被冻结，任何改进都需要人工循环，包括人工整理的数据收集、离线微调、代理范式的修改和重新部署。最近关于自我演化代理的研究，如面向个人用户的OpenClaw，表明代理能力的下一飞跃将来自不断从自身经验中学习的代理。本文认为，这种关于自我演进代理部署的愿景被企业级大规模代理服务所阻碍，而非强化学习（RL）算法，而是代理在线强化学习系统。具体来说，当前的智能体强化学习系统及其周边的可观测性软件栈在三个关键方面存在不足：（i）没有标准化的智能体轨迹数据协议能够跨异构代理范式以阶梯级粒度传输强化学习信号;（ii）没有企业级综合数据代理将真实工作负载转换为受控的学习基底;（iii）没有统一的代理演化控制平面，能够根据轨迹统计自动决定何时更新策略权重或演化上下文中的机束。下一代能动强化学习系统必须围绕这三大支柱共同设计，我们勾勒出具体架构、案例研究和反驳论点。我们通过AReaL2.0实现一个分支，将现有的RL基础设施重组为面向代理的在线RL循环，用于从已部署工作负载更新策略权重。

QuasiMoTTo: Quasi-Monte Carlo Test-Time Scaling

准蒙特卡洛测试时间缩放

Authors: Michael Y. Li, Anthony Zhan, Kanishk Gandhi, Noah D. Goodman, Emily B. Fox
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2607.01179
Pdf link: https://arxiv.org/pdf/2607.01179
Abstract Scaling inference compute, by generating many parallel attempts per problem, is a costly but reliable lever for improving language model capabilities. By default these attempts are generated independently, wasting inference compute on redundant solutions. This waste seems unavoidable. After all, independence is what makes parallel sampling trivial to scale. However, this tradeoff is not fundamental: there is a rich design space of samplers that generate correlated but exact samples entirely in parallel. We explore this design space as an avenue for improving sample efficiency in scaling inference compute and reinforcement learning (RL). Concretely, we introduce QuasiMoTTo, which uses correlated samples as a drop-in replacement for i.i.d. samples. To generate these samples, QuasiMoTTo uses a reparameterization of autoregressive sampling as inverse-CDF sampling and draws the underlying uniforms with quasi-Monte Carlo (QMC); because QMC spreads the uniforms out more evenly than i.i.d., the resulting samples cover the output space with far less redundancy. Even though the batch is correlated, each sample is marginally distributed according to the language model, so we can use the batch for policy-gradient training. Our empirical analysis focuses on understanding how efficiently QuasiMoTTo can turn compute into performance. To evaluate correlated samplers, whose dependence breaks standard pass@k estimators, we first develop an unbiased bootstrap estimator. Across four reasoning benchmarks, QuasiMoTTo matches i.i.d. pass@k accuracy with 25-47% fewer samples. Strikingly, QuasiMoTTo often saturates an upper bound on pass@k that holds for any marginal-preserving sampler. We also apply QuasiMoTTo to policy-gradient RL (GRPO) where it matches i.i.d. performance with 50% fewer training steps. These gains come from higher coverage, which yields a stronger learning signal per batch.
中文摘要 通过生成多个并行尝试来扩展推理计算，是提升语言模型能力的一种昂贵但可靠的杠杆。默认情况下，这些尝试是独立生成的，浪费了推理计算在冗余解上。这种浪费似乎无法避免。毕竟，独立性使得并行采样在规模化上变得轻而易举。然而，这种权衡并非根本性：采样器设计空间丰富，能够完全并行生成相关但精确的采样。我们将这一设计空间视为提升样本在扩展推理计算和强化学习（RL）效率方面的途径。具体来说，我们引入了QuasiMoTTo，它使用相关样本作为i.i.d.样本的直接替代。为了生成这些样本，QuasiMoTTo 采用自回归抽样的重新参数化为逆CDF抽样，并用准蒙特卡洛（QMC）绘制底层均匀分布;由于QMC比I.I.D.更均匀地分布均匀，因此得到的样本覆盖输出空间时冗余性大大降低。尽管批次是相关的，但每个样本根据语言模型分布边缘，因此我们可以用该批次进行策略梯度训练。我们的实证分析重点是理解QuasiMoTTo将计算转化为性能的效率。为了评估依赖性打破标准pass@k估计量的相关抽样器，我们首先开发了一个无偏自助估计器。在四个推理基准测试中，QuasiMoTTo以减少25%-47%的样本量匹配内识pass@k准确率。值得注意的是，QuasiMoTTo 常常在的上界上饱和pass@k，该上界对任何边际保持采样器都成立。我们还将QuasiMoTTo应用于策略梯度强化学习（GRPO），其训练步骤减少50%，匹配了i.i.d的性能。这些优势来自更高的覆盖率，从而使每批学习信号更强。

Perceive-to-Reason: Decoupling Perception and Reasoning for Fine-Grained Visual Reasoning

感知到推理：将感知与推理分离，实现细粒度的视觉推理

Authors: Hongxing Li, Xiufeng Huang, Dingming Li, Wenjing Jiang, Zixuan Wang, Haolei Xu, Hanrong Zhang, Haiwen Hong, Longtao Huang, Hui Xue, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2607.01191
Pdf link: https://arxiv.org/pdf/2607.01191
Abstract Fine-grained visual reasoning remains challenging for vision-language models, especially when small but critical visual cues are buried in high-resolution images. Existing approaches rely on repeated cropping or test-time visual search to introduce local evidence, but they typically do not explicitly distinguish perception from reasoning. In this paper, we propose Perceive-to-Reason (P2R), a unified framework that formulates fine-grained visual reasoning as a two-stage process: the model first localizes question-relevant evidence as a Perceiver, and then answers the question as a Reasoner based on the annotated image and cropped regions. To better align training with this decoupled formulation, we further introduce Perception-Reasoning Alternating GRPO (PRA-GRPO), a role-aware reinforcement learning strategy that alternates between perception-focused and reasoning-focused updates using only final-answer supervision. Built on top of Qwen3-VL-Instruct-2B/4B/8B, P2R consistently improves performance across model scales. In particular, P2R-4B achieves 93.2% on V-Star, 81.9% on HR-Bench-4K, and 80.5% on HR-Bench-8K, substantially outperforming its corresponding backbone. Further experiments show that the benefits of P2R extend beyond high-resolution benchmarks to broader multimodal reasoning tasks. These results suggest that explicitly decoupling perception from reasoning provides an effective framework for fine-grained visual reasoning.
中文摘要 对于视觉语言模型来说，细粒度的视觉推理依然具有挑战性，尤其是在高分辨率图像中埋藏着微小但关键的视觉线索时。现有方法依赖反复裁剪或测试时的视觉搜索来引入局部证据，但通常不会明确区分感知与推理。本文提出了感知至理性（Perceive-to-Reason，P2R）的统一框架，将细粒度视觉推理分为两阶段过程：模型首先将与问题相关的证据定位为感知者，然后根据注释图像和裁剪区域作为推理者回答问题。为了更好地使训练与这种解耦形式对齐，我们进一步引入了感知-推理交替GRPO（PRA-GRPO），这是一种角色感知强化学习策略，在仅采用最终答案监督的情况下，在感知导向和推理导向更新之间交替进行。P2R 基于 Qwen3-VL-Instruct-2B/4B/8B，持续提升不同模型尺度的性能。特别是，P2R-4B在V-Star中达到了93.2%，HR-Bench-4K中达到81.9%，HR-Bench-8K上达到80.5%，远远超过了其对应的骨干。进一步的实验表明，P2R的益处不仅限于高分辨率基准，还延伸到更广泛的多模态推理任务。这些结果表明，明确将感知与推理脱钩，为细粒度视觉推理提供了有效的框架。

Quantum vs. Classical Machine Learning: A Unified Empirical Comparison

量子与经典机器学习：统一的经验比较

Authors: Chuanming Yu, Jiaming Liu, Zihao Ge, Xiongfei Wu, Lulu Zhu, Pengzhan Zhao, Jianjun Zhao
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2607.01197
Pdf link: https://arxiv.org/pdf/2607.01197
Abstract Quantum computing has emerged as a promising computational paradigm for machine learning (ML), with the potential to offer computational advantages over classical approaches. At this stage, the evidence supporting the performance and advantages of quantum machine learning (QML) models relative to classical models is this http URL address this gap, this paper presents an empirical study on the performance of QML models and their classical counterparts. We compare seven model pairs spanning supervised learning and reinforcement learning. Our results indicate that the evaluated quantum machine learning models do not yet surpass the classical baselines in overall prediction performance, policy stability, or training time. Nevertheless, QML remains a promising approach for filtering noise and controlling false positives. Our research findings summarize the challenges facing quantum machine learning across hardware environments, training efficiency, and convergence stability, providing a foundation for research into the robustness and parameter optimization of QML. This work is publicly available at this https URL.
中文摘要 量子计算已成为机器学习（ML）一种有前景的计算范式，具有相较经典方法在计算上的优势。目前，支持量子机器学习（QML）模型相较于经典模型的性能和优势的证据是，本文通过实证研究探讨了QML模型及其经典模型的性能。我们比较了七对跨越监督学习和强化学习的模型对。我们的结果表明，所评估的量子机器学习模型在整体预测表现、策略稳定性或训练时间方面尚未超越经典基线。尽管如此，QML仍是一种有前景的方法，用于滤除噪声和控制误报。我们的研究成果总结了量子机器学习在硬件环境下面临的挑战、训练效率和收敛稳定性，为量子机器学习的鲁棒性和参数优化研究奠定了基础。本作品可在此 https URL 公开获取。

Language-Critique Imitation Learning from Suboptimal Demonstrations

语言批判模仿：从次优演示中学习

Authors: Chih-Han Yang, Dai-Jie Wu, Yun-Ping Huang, Ping-Chun Hsieh, Kenneth Marino, Shao-Hua Sun
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2607.01225
Pdf link: https://arxiv.org/pdf/2607.01225
Abstract Prior work on imitation learning from suboptimal demonstrations typically relies on compressed supervision signals such as confidence estimates, discriminator scores, or importance weights. These scalar signals are inherently limited, as they cannot explicitly express intermediate reasoning about task progress, failure modes, or corrective actions. We propose a language-critique framework for imitation learning from suboptimal demonstrations that instead leverages natural language as a structured supervision signal, avoiding the collapse of expressive feedback into scalars. Our method first constructs language labels from demonstrations that explicitly describe current progress, identify suboptimal behaviors, and provide fine-grained corrective guidance. We then introduce a language-critique loss that directly trains policies using these structured signals without reducing them to scalars, and instantiate it for both behavior cloning and diffusion policies, yielding LC-BC and LC-DP. We further provide a theoretical result showing that the proposed objective upper-bounds the expert performance gap under standard assumptions. Empirically, we evaluate on diverse continuous control tasks spanning navigation, manipulation, and gameplay, where our methods consistently outperform strong imitation learning and offline reinforcement learning baselines. These results demonstrate that language can serve as a powerful and structured form of supervision for learning robust policies from suboptimal data.
中文摘要 先前关于次优演示的模仿学习通常依赖压缩监督信号，如置信估计、判别分数或重要性权重。这些标量信号本身是有限的，因为它们无法明确表达关于任务进展、失败模式或纠正措施的中间推理。我们提出了一种语言批判框架，用于从次优演示中进行模仿学习，该框架利用自然语言作为结构化的监督信号，避免表达反馈崩溃成标量。我们的方法首先从示范中构建语言标签，明确描述当前进展，识别次优行为，并提供细致的纠正指导。随后引入一种语言批判损失，直接用这些结构化信号训练策略，而不将其简化为标量，并在行为克隆和扩散策略中实例化，形成LC-BC和LC-DP。我们还提供了一个理论结果，表明在标准假设下，所提出的客观上限已达到专家绩效差距。通过实证评估，我们评估了涵盖导航、操作和游戏性的多种持续控制任务，我们的方法始终优于强模仿学习和离线强化学习基线。这些结果表明，语言可以作为一种强大且结构化的监督形式，帮助从次优数据中学习稳健的策略。

Is One Layer Enough? Training A Single Transformer Layer Can Match Full-Parameter RL Training

一层够吗？训练单一变压器层可以匹配全参数强化学习训练

Authors: Zijian Zhang, Rizhen Hu, Athanasios Glentis, Dawei Li, Chung-Yiu Yau, Hongzhou Lin, Mingyi Hong
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2607.01232
Pdf link: https://arxiv.org/pdf/2607.01232
Abstract Reinforcement learning (RL) has become a central component of post-training large language models (LLMs), yet little is understood about how RL adaptation is distributed across transformer layers. Existing approaches typically update all model parameters uniformly, implicitly assuming that every layer contributes similarly to the gains obtained during RL post-training. In this work, we challenge this assumption through a systematic layer-wise study of RL training. Surprisingly, we find that training a single transformer layer can recover most of the gains achieved by full-parameter RL training, and in some cases even surpass it. To quantify this phenomenon, we introduce the quantity layer contribution, which measures the fraction of full RL improvement recovered by training a layer in isolation. Across seven models spanning two model families (Qwen3, Qwen2.5), three RL algorithms (GRPO, GiGPO, Dr. GRPO), and multiple task domains including mathematical reasoning, code generation, and agentic decision-making, we observe a remarkably stable pattern: RL gains are highly concentrated in a small subset of, and in many cases even a single, transformer layers. More strikingly, the same structural pattern consistently emerges: high-contribution layers concentrate in the middle of the transformer stack, while layers near the input and output ends contribute substantially less. The resulting layer rankings remain strongly correlated across datasets, tasks, model families, and RL algorithms.
中文摘要 强化学习（RL）已成为大型语言模型（LLMs）训练后的核心组成部分，但关于强化学习适应如何在变换器层间分布尚知甚少。现有方法通常统一更新所有模型参数，隐含假设每个层在强化学习后获得的收益中贡献相似。在本研究中，我们通过系统地分层研究强化学习训练，挑战了这一假设。令人惊讶的是，我们发现仅训练一层变压器就能恢复大部分全参数强化学习所获得的收益，有时甚至超过全参数训练。为量化这一现象，我们引入了数量层贡献，衡量通过孤立训练一层恢复的整体强化学习提升比例。在涵盖两个模型族（Qwen3、Qwen2.5）、三种强化学习算法（GRPO、GiGPO、Dr. GRPO）以及包括数学推理、代码生成和智能决策在内的多个任务领域中，我们观察到一个极为稳定的模式：强化学习的增益高度集中在一小部分，甚至在许多情况下是单层变换器层中。更显著的是，相同的结构模式始终持续出现：高贡献层集中在变压器堆栈的中间，而靠近输入端和输出端的层则贡献显著较低。所得的层级排名在数据集、任务、模型族和强化学习算法之间保持高度相关。

Keyword: diffusion policy

FAR: Failure-Aware Retry for Test-Time Recovery and Continual Policy Improvement

FAR：测试时间恢复和持续策略改进的失败感知重试

Authors: Haoran Hao, Shahram Najam Syed, Jeffrey Ichnowski, Jeff Schneider
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2607.01111
Pdf link: https://arxiv.org/pdf/2607.01111
Abstract Robot policies inevitably encounter failures when deployed in real environments. Naive retries often repeat the same mistakes, while many existing recovery methods rely on human intervention. In this paper, we propose Failure-Aware Retry (FAR), a framework that enables robots to learn from previous failures at test time, adapt their behavior accordingly, and eventually complete the task autonomously. FAR combines Failure-Contrastive Preference Adaptation, which constructs preference learning data from failures to steer the policy away from previously unsuccessful behaviors, with lightweight action perturbations during retries to encourage local exploration. We further incorporate successful recovery trajectories into a training loop for continual policy improvement. Experiments in both simulation and real-world manipulation tasks show that FAR substantially improves success rates and robustness, with average gains of 17.6% over the standard diffusion policy in simulation and 11.7% in the real world. In addition, FAR significantly improves data efficiency under both reset and timestep budgets during continual policy improvement by exploiting informative failure cases.
中文摘要 机器人策略在实际环境中部署时不可避免地会遇到故障。天真的重试常常重复同样的错误，而许多现有的恢复方法依赖于人工干预。本文提出了失败感知重试（FAR）框架，使机器人能够在测试时从之前的失败中学习，相应调整行为，最终自主完成任务。FAR 结合了失败-对比偏好适应（Failure-Contrastive Preference Adaptation），该方法通过从失败中构建偏好学习数据，引导策略远离先前失败的行为，同时在重试时设置轻量级动作扰动，以鼓励局部探索。我们进一步将成功的复苏轨迹纳入培训循环，以持续改进政策。模拟和现实操作任务的实验显示，FAR显著提升了成功率和鲁棒性，模拟中较标准扩散策略平均提升17.6%，在现实世界中提升11.7%。此外，FAR在持续策略改进期间，通过利用有用的失败案例，显著提升了复位和时间步预算的数据效率。