生成时间: 2026-04-06 17:12:21 (UTC+8); Arxiv 发布时间: 2026-04-06 20:00 EDT (2026-04-07 08:00 UTC+8)
今天共有 38 篇相关文章
Keyword: reinforcement learning
LLM Reasoning with Process Rewards for Outcome-Guided Steps
带有过程奖励的大型语言模型推理,针对结果引导步骤
- Authors: Mohammad Rezaei, Jens Lehmann, Sahar Vahdati
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2604.02341
- Pdf link: https://arxiv.org/pdf/2604.02341
- Abstract
Mathematical reasoning in large language models has improved substantially with reinforcement learning using verifiable rewards, where final answers can be checked automatically and converted into reliable training signals. Most such pipelines optimize outcome correctness only, which yields sparse feedback for long, multi-step solutions and offers limited guidance on intermediate reasoning errors. Recent work therefore introduces process reward models (PRMs) to score intermediate steps and provide denser supervision. In practice, PRM scores are often imperfectly aligned with final correctness and can reward locally fluent reasoning that still ends in an incorrect answer. When optimized as absolute rewards, such signals can amplify fluent failure modes and induce reward hacking. We propose PROGRS, a framework that leverages PRMs while keeping outcome correctness dominant. PROGRS treats process rewards as relative preferences within outcome groups rather than absolute targets. We introduce outcome-conditioned centering, which shifts PRM scores of incorrect trajectories to have zero mean within each prompt group. It removes systematic bias while preserving informative rankings. PROGRS combines a frozen quantile-regression PRM with a multi-scale coherence evaluator. We integrate the resulting centered process bonus into Group Relative Policy Optimization (GRPO) without auxiliary objectives or additional trainable components. Across MATH-500, AMC, AIME, MinervaMath, and OlympiadBench, PROGRS consistently improves Pass@1 over outcome-only baselines and achieves stronger performance with fewer rollouts. These results show that outcome-conditioned centering enables safe and effective use of process rewards for mathematical reasoning.
- 中文摘要
借助可验证奖励的强化学习,大型语言模型中的数学推理能力得到了显著提升,最终答案可以被自动检查并转化为可靠的训练信号。大多数此类流程仅优化结果正确性,导致长时间多步解的反馈稀疏,且对中间推理错误的指导有限。因此,近期研究引入了过程奖励模型(PRM),用于对中间步骤进行评分并提供更密集的监督。实际上,PRM分数往往与最终正确性不完全对齐,可能会奖励局部流畅推理,但最终仍会导致错误答案。当作为绝对奖励进行优化时,这类信号可以放大流失效模式并诱导奖励黑客行为。我们提出了PROGRS,一种利用PRM同时保持结果正确性为主导的框架。PROGRS将过程奖励视为结果组内的相对偏好,而非绝对目标。我们引入了结果条件中心化,将错误轨迹的PRM分数在每个提示组内均值为零。它消除了系统性偏见,同时保持了信息丰富的排名。PROGRS结合了冻结分位数回归PRM和多尺度相干评估器。我们将由此产生的中心化过程加成整合进无辅助目标或额外可训练组件的群体相对策略优化(GRPO)。在MATH-500、AMC、AIME、MinervaMath和OlympiadBench中,PROGRS在Pass@1上持续优于仅基于结果的基线,且以更少的推广次数实现更强的表现。这些结果表明,结果条件化中心化能够安全有效地将过程奖励用于数学推理。
Contextual Intelligence The Next Leap for Reinforcement Learning
情境智能:强化学习的下一步飞跃
- Authors: André Biedenkapp
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2604.02348
- Pdf link: https://arxiv.org/pdf/2604.02348
- Abstract
Reinforcement learning (RL) has produced spectacular results in games, robotics, and continuous control. Yet, despite these successes, learned policies often fail to generalize beyond their training distribution, limiting real-world impact. Recent work on contextual RL (cRL) shows that exposing agents to environment characteristics -- contexts -- can improve zero-shot transfer. So far, the community has treated context as a monolithic, static observable, an approach that constrains the generalization capabilities of RL agents. To achieve contextual intelligence we first propose a novel taxonomy of contexts that separates allogenic (environment-imposed) from autogenic (agent-driven) factors. We identify three fundamental research directions that must be addressed to promote truly contextual intelligence: (1) Learning with heterogeneous contexts to explicitly exploit the taxonomy levels so agents can reason about their influence on the world and vice versa; (2) Multi-time-scale modeling to recognize that allogenic variables evolve slowly or remain static, whereas autogenic variables may change within an episode, potentially requiring different learning mechanisms; (3) Integration of abstract, high-level contexts to incorporate roles, resource & regulatory regimes, uncertainties, and other non-physical descriptors that crucially influence behavior. We envision context as a first-class modeling primitive, empowering agents to reason about who they are, what the world permits, and how both evolve over time. By doing so, we aim to catalyze a new generation of context-aware agents that can be deployed safely and efficiently in the real world.
- 中文摘要
强化学习(RL)在游戏、机器人技术和连续控制领域取得了显著成果。然而,尽管取得了这些成功,学到的政策往往无法超越其培训分布范围,限制了实际影响。近期关于情境强化学习(cRL)的研究表明,将代理暴露于环境特性——情境——可以改善零射程转移。迄今为止,社区将上下文视为单一的静态可观测量,这种方法限制了强化学习代理的泛化能力。为了实现情境智能,我们首先提出了一种新的情境分类法,将异源(环境强加)因素与自生(主体驱动)因素区分开来。我们确定了三大基本研究方向,必须解决以促进真正的情境智能:(1)在异质情境中学习,明确利用分类层级,使智能体能够推理自己对世界的影响,反之亦然;(2)多时间尺度建模,以识别同种变量进化缓慢或保持静止,而自生变量可能在发作中发生变化,可能需要不同的学习机制;(3)整合抽象、高层次的情境,以纳入角色、资源与监管体系、不确定性以及其他关键影响行为的非物理描述词。我们将情境设想为一流的建模原始工具,赋能主体思考自己是谁、世界允许什么,以及两者如何随时间演变。通过这样做,我们希望催生新一代能够安全高效部署于现实世界的情境感知智能体。
OPRIDE: Offline Preference-based Reinforcement Learning via In-Dataset Exploration
OPRIDE:通过数据集内探索实现的基于偏好的离线强化学习
- Authors: Yiqin Yang, Hao Hu, Yihuan Mao, Jin Zhang, Chengjie Wu, Yuhua Jiang, Xu Yang, Runpeng Xie, Yi Fan, Bo Liu, Yang Gao, Bo Xu, Chongjie Zhang
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2604.02349
- Pdf link: https://arxiv.org/pdf/2604.02349
- Abstract
Preference-based reinforcement learning (PbRL) can help avoid sophisticated reward designs and align better with human intentions, showing great promise in various real-world applications. However, obtaining human feedback for preferences can be expensive and time-consuming, which forms a strong barrier for PbRL. In this work, we address the problem of low query efficiency in offline PbRL, pinpointing two primary reasons: inefficient exploration and overoptimization of learned reward functions. In response to these challenges, we propose a novel algorithm, \textbf{O}ffline \textbf{P}b\textbf{R}L via \textbf{I}n-\textbf{D}ataset \textbf{E}xploration (OPRIDE), designed to enhance the query efficiency of offline PbRL. OPRIDE consists of two key features: a principled exploration strategy that maximizes the informativeness of the queries and a discount scheduling mechanism aimed at mitigating overoptimization of the learned reward functions. Through empirical evaluations, we demonstrate that OPRIDE significantly outperforms prior methods, achieving strong performance with notably fewer queries. Moreover, we provide theoretical guarantees of the algorithm's efficiency. Experimental results across various locomotion, manipulation, and navigation tasks underscore the efficacy and versatility of our approach.
- 中文摘要
基于偏好的强化学习(PbRL)可以帮助避免复杂的奖励设计,更好地与人类意图保持一致,在各种现实应用中展现出巨大潜力。然而,获取人工偏好反馈可能既昂贵又耗时,这对PbRL构成了很大障碍。本研究探讨了离线PbRL查询效率低的问题,指出两个主要原因:探索效率低下和学习奖励函数的过度优化。针对这些挑战,我们提出了一种新颖算法 \textbf{O}ffline \textbf{P}b\textbf{R}L 通过 \textbf{I}n-\textbf{D}ataset \textbf{E}xploration(OPRIDE),旨在提升离线 PbRL 的查询效率。OPRIDE 包含两个关键特征:一种原则性探索策略,最大化查询信息量;以及一种折扣调度机制,旨在减轻学习奖励函数的过度优化。通过实证评估,我们证明OPRIDE显著优于以往方法,以显著减少查询数量实现强劲性能。此外,我们还提供了算法效率的理论保证。在各种移动、操作和导航任务中的实验结果强调了我们方法的有效性和多样性。
Prism: Policy Reuse via Interpretable Strategy Mapping in Reinforcement Learning
棱镜:通过可解释的策略映射在强化学习中的策略重用
- Authors: Thomas Pravetz
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2604.02353
- Pdf link: https://arxiv.org/pdf/2604.02353
- Abstract
We present PRISM (Policy Reuse via Interpretable Strategy Mapping), a framework that grounds reinforcement learning agents' decisions in discrete, causally validated concepts and uses those concepts as a zero-shot transfer interface between agents trained with different algorithms. PRISM clusters each agent's encoder features into $K$ concepts via K-means. Causal intervention establishes that these concepts directly drive - not merely correlate with - agent behavior: overriding concept assignments changes the selected action in 69.4% of interventions ($p = 8.6 \times 10^{-86}$, 2500 interventions). Concept importance and usage frequency are dissociated: the most-used concept (C47, 33.0% frequency) causes only a 9.4% win-rate drop when ablated, while ablating C16 (15.4% frequency) collapses win rate from 100% to 51.8%. Because concepts causally encode strategy, aligning them via optimal bipartite matching transfers strategic knowledge zero-shot. On Go~7$\times$7 with three independently trained agents, concept transfer achieves 69.5%$\pm$3.2% and 76.4%$\pm$3.4% win rate against a standard engine across the two successful transfer pairs (10 seeds), compared to 3.5% for a random agent and 9.2% without alignment. Transfer succeeds when the source policy is strong; geometric alignment quality predicts nothing ($R^2 \approx 0$). The framework is scoped to domains where strategic state is naturally discrete: the identical pipeline on Atari Breakout yields bottleneck policies at random-agent performance, confirming that the Go results reflect a structural property of the domain.
- 中文摘要
我们提出了PRISM(通过可解释策略映射的策略重用),这是一个基于离散、因果验证概念来构建强化学习代理决策的框架,并将这些概念作为不同算法训练代理之间的零样本转移接口。PRISM通过K-means将每个代理的编码器特征聚集成$K$的概念。因果干预表明这些概念直接驱动——而不仅仅是相关——代理行为:覆盖概念分配会改变69.4%的干预($p = 8.6 \乘以10^{-86}$,2500次干预)。概念重要性和使用频率被解离:最常用的概念(C47,33.0%频率)在消灭时仅导致9.4%的胜率下降,而C16(15.4%)则会将胜率从100%崩溃到51.8%。由于概念因果编码策略,通过最优的二分匹配对齐它们可以零机会传递战略知识。在Go~7$\times$7,由三名独立训练的代理下,概念转移在两对成功转移(10个种子)中,对标准引擎的胜率分别为69.5%\pm$3.2%和76.4%$\pm$3.4%,而随机代理为3.5%,无阵对齐则为9.2%。当源策略强时传输成功;几何对齐质量预测不预测($R^2 \ 约0$)。该框架适用于战略状态自然离散的领域:Atari Breakout 上的相同流水线在随机代理性能下产生瓶颈策略,证实 Go 结果反映了该领域的结构性质。
From Broad Exploration to Stable Synthesis: Entropy-Guided Optimization for Autoregressive Image Generation
从广泛探索到稳定综合:熵引导优化实现自回归图像生成
- Authors: Han Song, Yucheng Zhou, Jianbing Shen, Yu Cheng
- Subjects: Subjects:
Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2604.02355
- Pdf link: https://arxiv.org/pdf/2604.02355
- Abstract
Combining Chain-of-Thought (CoT) with Reinforcement Learning (RL) improves text-to-image (T2I) generation, yet the underlying interaction between CoT's exploration and RL's optimization remains unclear. We present a systematic entropy-based analysis that yields three key insights: (1) CoT expands the generative exploration space, while RL contracts it toward high-reward regions; (2) final reward is strongly negatively correlated with both the mean and variance of image-token entropy, highlighting the need to reduce uncertainty and instability; and (3) the entropy of the textual CoT directly governs downstream image quality, with lower-entropy CoTs leading to better generations. Motivated by these findings, we propose Entropy-Guided Group Relative Policy Optimization (EG-GRPO), a fine-tuning strategy that reallocates optimization budget by uncertainty: low-entropy tokens are excluded from reward-driven updates to preserve stability, while high-entropy tokens receive an entropy bonus that encourages structured exploration without collapse. Experiments on standard T2I benchmarks demonstrate that EG-GRPO achieves state-of-the-art performance.
- 中文摘要
将思维链(CoT)与强化学习(RL)结合,可以提升文本到图像(T2I)生成,但CoT探索与强化学习优化之间的潜在交互作用仍不明确。我们提出了一种基于熵的系统分析,得出了三个关键见解:(1)CoT扩展了生成探索空间,而强化学习则将其收缩向高回报区域;(2)最终奖励与图像-标记熵的平均值和方差呈强负相关,强调减少不确定性和不稳定性的必要性;(3)文本CoT的熵直接决定下游图像质量,低熵的CoT会带来更好的生成。基于这些发现,我们提出了熵引导的群体相对策略优化(EG-GRPO),这是一种通过不确定性重新分配优化预算的微调策略:低熵代币被排除在奖励驱动的更新之外以保持稳定性,而高熵代币则获得熵加值,鼓励结构化探索而不崩溃。标准T2I基准测试的实验表明,EG-GRPO实现了最先进的性能。
A Survey on AI for 6G: Challenges and Opportunities
关于6G人工智能的调查:挑战与机遇
- Authors: Constantina Chatzieleftheriou, Eirini Liotou
- Subjects: Subjects:
Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2604.02370
- Pdf link: https://arxiv.org/pdf/2604.02370
- Abstract
As wireless communication evolves, each generation of networks brings new technologies that change how we connect and interact. Artificial Intelligence (AI) is becoming crucial in shaping the future of sixth-generation (6G) networks. By combining AI and Machine Learning (ML), 6G aims to offer high data rates, low latency, and extensive connectivity for applications including smart cities, autonomous systems, holographic telepresence, and the tactile internet. This paper provides a detailed overview of the role of AI in supporting 6G networks. It focuses on key technologies like deep learning, reinforcement learning, federated learning, and explainable AI. It also looks at how AI integrates with essential network functions and discusses challenges related to scalability, security, and energy efficiency, along with new solutions. Additionally, this work highlights perspectives that connect AI-driven analytics to 6G service domains like Ultra-Reliable Low-Latency Communication (URLLC), Enhanced Mobile Broadband (eMBB), Massive Machine-Type Communication (mMTC), and Integrated Sensing and Communication (ISAC). It addresses concerns about standardization, ethics, and sustainability. By summarizing recent research trends and identifying future directions, this survey offers a valuable reference for researchers and practitioners at the intersection of AI and next-generation wireless communication.
- 中文摘要
随着无线通信的发展,每一代网络都带来了新技术,改变了我们的连接和互动方式。人工智能(AI)正变得至关重要,正在塑造第六代(6G)网络的未来。通过结合人工智能和机器学习(ML),6G旨在为智慧城市、自治系统、全息远程呈现和触觉互联网等应用提供高数据速率、低延迟和广泛的连接能力。本文详细概述了人工智能在支持6G网络中的作用。它聚焦于深度学习、强化学习、联邦学习和可解释人工智能等关键技术。报告还探讨了人工智能如何与关键网络功能集成,并讨论了与可扩展性、安全性和能源效率相关的挑战,以及新的解决方案。此外,这项工作还强调了将人工智能驱动分析与超可靠低延迟通信(URLLC)、增强移动宽带(eMBB)、大规模机器类型通信(mMTC)以及综合传感与通信(ISAC)等6G服务领域相结合的视角。它关注标准化、伦理和可持续性的问题。通过总结最新研究趋势并确定未来方向,本调查为处于人工智能与下一代无线通信交汇处的研究人员和从业者提供了宝贵参考。
Compositional Neuro-Symbolic Reasoning
合成神经符号推理
- Authors: Anugyan Das, Omkar Ghugarkar, Vishvesh Bhat, Asad Aali
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2604.02434
- Pdf link: https://arxiv.org/pdf/2604.02434
- Abstract
We study structured abstraction-based reasoning for the Abstraction and Reasoning Corpus (ARC) and compare its generalization to test-time approaches. Purely neural architectures lack reliable combinatorial generalization, while strictly symbolic systems struggle with perceptual grounding. We therefore propose a neuro-symbolic architecture that extracts object-level structure from grids, uses neural priors to propose candidate transformations from a fixed domain-specific language (DSL) of atomic patterns, and filters hypotheses using cross-example consistency. Instantiated as a compositional reasoning framework based on unit patterns inspired by human visual abstraction, the system augments large language models (LLMs) with object representations and transformation proposals. On ARC-AGI-2, it improves base LLM performance from 16% to 24.4% on the public evaluation set, and to 30.8% when combined with ARC Lang Solver via a meta-classifier. These results demonstrate that separating perception, neural-guided transformation proposal, and symbolic consistency filtering improves generalization without task-specific finetuning or reinforcement learning, while reducing reliance on brute-force search and sampling-based test-time scaling. We open-source the ARC-AGI-2 Reasoner code (this https URL).
- 中文摘要
我们研究了抽象与推理语料库(ARC)中的结构化抽象推理,并将其推广与测试时间方法进行比较。纯神经结构缺乏可靠的组合推广,而严格符号系统则难以获得感知基础。因此,我们提出了一种神经符号架构,能够从网格中提取对象级结构,利用神经先验从固定的领域特定语言(DSL)中提出候选变换,并利用跨实例一致性过滤假设。该系统作为基于人类视觉抽象启发的单元模式的组合推理框架实现,通过对象表示和转换提案来增强大型语言模型(LLM)。在 ARC-AGI-2 上,它将基础大型语言模型在公开评估集上的表现从 16% 提升到 24.4%,结合 ARC Lang 求解器通过元分类器时提升至 30.8%。这些结果表明,分离感知、神经引导变换提案和符号一致性过滤,可以提升泛化性,无需任务特定微调或强化学习,同时减少对暴力搜索和基于抽样的测试时间尺度的依赖。我们开源了ARC-AGI-2 Reasoner代码(这个https URL)。
Mitigating Data Scarcity in Spaceflight Applications for Offline Reinforcement Learning Using Physics-Informed Deep Generative Models
利用物理学基础深度生成模型缓解航天飞行应用中离线强化学习的数据稀缺性
- Authors: Alex E. Ballentine, Nachiket U. Bapat, Raghvendra V. Cowlagi
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2604.02438
- Pdf link: https://arxiv.org/pdf/2604.02438
- Abstract
The deployment of reinforcement learning (RL)-based controllers on physical systems is often limited by poor generalization to real-world scenarios, known as the simulation-to-reality (sim-to-real) gap. This gap is particularly challenging in spaceflight, where real-world training data are scarce due to high cost and limited planetary exploration data. Traditional approaches, such as system identification and synthetic data generation, depend on sufficient data and often fail due to modeling assumptions or lack of physics-based constraints. We propose addressing this data scarcity by introducing physics-based learning bias in a generative model. Specifically, we develop the Mutual Information-based Split Variational Autoencoder (MI-VAE), a physics-informed VAE that learns differences between observed system trajectories and those predicted by physics-based models. The latent space of the MI-VAE enables generation of synthetic datasets that respect physical constraints. We evaluate MI-VAE on a planetary lander problem, focusing on limited real-world data and offline RL training. Results show that augmenting datasets with MI-VAE samples significantly improves downstream RL performance, outperforming standard VAEs in statistical fidelity, sample diversity, and policy success rate. This work demonstrates a scalable strategy for enhancing autonomous controller robustness in complex, data-constrained environments.
- 中文摘要
基于强化学习(RL)的控制器在物理系统上的部署常常受限于对现实场景的推广不足,这被称为“仿真到现实”(sim-to-real)差距。这一差距在航天中尤为棘手,由于成本高昂且行星探测数据有限,实际训练数据稀缺。传统方法,如系统识别和合成数据生成,依赖于足够的数据,且常因建模假设或缺乏基于物理的约束而失败。我们提出通过在生成模型中引入基于物理的学习偏差来应对这种数据稀缺性。具体来说,我们开发了基于互信息的分裂变分自编码器(MI-VAE),这是一种基于物理的VAE,能够学习观测到的系统轨迹与物理模型预测轨迹之间的差异。MI-VAE的潜在空间使得能够生成符合物理约束的合成数据集。我们评估了基于行星着陆器问题的MI-VAE,重点关注有限的真实世界数据和离线强化学习训练。结果显示,加入MI-VAE样本显著提升了下游强化学习(RL)的性能,在统计准确度、样本多样性和策略成功率方面优于标准VAE。这项工作展示了一种可扩展的策略,用于在复杂且数据受限的环境中增强自主控制器的鲁棒性。
RL-Loop: Reinforcement Learning-Driven Real-Time 5G Slice Control for Connected and Autonomous Mobility Services
RL-Loop:基于强化学习驱动的实时5G切片控制,用于互联和自主出行服务
- Authors: Lara Tarkh, Ali Chouman, Hanan Lutfiyya, Abdallah Shami
- Subjects: Subjects:
Networking and Internet Architecture (cs.NI)
- Arxiv link: https://arxiv.org/abs/2604.02461
- Pdf link: https://arxiv.org/pdf/2604.02461
- Abstract
Smart and connected mobility systems rely on 5G edge infrastructure to support real-time communication, control, and service differentiation. Achieving this requires adaptive resource management mechanisms that can react to rapidly changing traffic conditions. In this paper, we propose RL-Loop, a closed-loop reinforcement learning framework for real-time CPU resource control in 5G network slicing environments supporting connected mobility services. RL-Loop employs a Proximal Policy Optimization (PPO) agent that continuously observes slice-level key performance indicators and adjusts edge CPU allocations at one-second granularity on a real testbed. The framework leverages real-time observability and feedback to enable adaptive, software-defined edge intelligence. Experimental results suggest that RL-Loop can reduce average CPU allocation by over 55% relative to the reference operating point while reaching a comparable quality-of-service degradation region. These results indicate that lightweight reinforcement learning--based feedback control can provide efficient and responsive resource management for 5G-enabled smart mobility and connected vehicle services.
- 中文摘要
智能和互联出行系统依赖5G边缘基础设施来支持实时通信、控制和服务差异化。实现这一目标需要适应性资源管理机制,能够应对快速变化的交通状况。本文提出RL-Loop,一种闭环强化学习框架,用于支持连接移动服务的5G网络切片环境中实时CPU资源控制。RL-Loop 采用了近端策略优化(PPO)代理,能够持续观察切片级关键绩效指标,并在真实测试平台上以一秒级的粒度调整边缘 CPU 分配。该框架利用实时可观察性和反馈技术,实现自适应的软件定义边缘智能。实验结果表明,RL环路相较参考操作点可将平均CPU分配降低超过55%,同时达到类似的服务质量下降区间。这些结果表明,基于轻量化强化学习的反馈控制能够为5G智能出行和互联车辆服务提供高效且响应灵敏的资源管理。
Tune to Learn: How Controller Gains Shape Robot Policy Learning
调优学习:控制者如何获得优势塑造机器人政策学习
- Authors: Antonia Bronars, Younghyo Park, Pulkit Agrawal
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2604.02523
- Pdf link: https://arxiv.org/pdf/2604.02523
- Abstract
Position controllers have become the dominant interface for executing learned manipulation policies. Yet a critical design decision remains understudied: how should we choose controller gains for policy learning? The conventional wisdom is to select gains based on desired task compliance or stiffness. However, this logic breaks down when controllers are paired with state-conditioned policies: effective stiffness emerges from the interplay between learned reactions and control dynamics, not from gains alone. We argue that gain selection should instead be guided by learnability: how amenable different gain settings are to the learning algorithm in use. In this work, we systematically investigate how position controller gains affect three core components of modern robot learning pipelines: behavior cloning, reinforcement learning from scratch, and sim-to-real transfer. Through extensive experiments across multiple tasks and robot embodiments, we find that: (1) behavior cloning benefits from compliant and overdamped gain regimes, (2) reinforcement learning can succeed across all gain regimes given compatible hyperparameter tuning, and (3) sim-to-real transfer is harmed by stiff and overdamped gain regimes. These findings reveal that optimal gain selection depends not on the desired task behavior, but on the learning paradigm employed. Project website: this https URL
- 中文摘要
位置控制器已成为执行学习操作策略的主要接口。然而,一个关键的设计决策仍然缺乏研究:我们应如何选择控制器增益以促进政策学习?传统观点是根据期望的任务顺应性或刚性来选择增益。然而,当控制者与状态条件策略结合时,这种逻辑就不成立了:有效刚性源于学习反应与控制动态之间的相互作用,而不仅仅是收益。我们认为增益选择应以可学习性为指导:即不同增益设置对所用学习算法的适应性。本研究系统性地研究位置控制器增益如何影响现代机器人学习流水线的三个核心组成部分:行为克隆、从零开始的强化学习和模拟到现实的转移。通过在多任务和机器人实例中的广泛实验,我们发现:(1)行为克隆受益于顺应和过阻尼增益模式,(2)在兼容的超参数调优条件下,强化学习可在所有增益区间取得成功,(3)刚性和过阻尼增益调节会损害模拟到现实的传输。这些发现表明,最优增益选择不依赖于期望的任务行为,而取决于所采用的学习范式。项目网站:此 https URL
Interpretable Deep Reinforcement Learning for Element-level Bridge Life-cycle Optimization
可解释的深度强化学习用于元素级桥梁生命周期优化
- Authors: Seyyed Amirhossein Moayyedi, David Y. Yang
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2604.02528
- Pdf link: https://arxiv.org/pdf/2604.02528
- Abstract
The new Specifications for the National Bridge Inventory (SNBI), in effect from 2022, emphasize the use of element-level condition states (CS) for risk-based bridge management. Instead of a general component rating, element-level condition data use an array of relative CS quantities (i.e., CS proportions) to represent the condition of a bridge. Although this greatly increases the granularity of bridge condition data, it introduces challenges to set up optimal life-cycle policies due to the expanded state space from one single categorical integer to four-dimensional probability arrays. This study proposes a new interpretable reinforcement learning (RL) approach to seek optimal life-cycle policies based on element-level state representations. Compared to existing RL methods, the proposed algorithm yields life-cycle policies in the form of oblique decision trees with reasonable amounts of nodes and depth, making them directly understandable and auditable by humans and easily implementable into current bridge management systems. To achieve near-optimal policies, the proposed approach introduces three major improvements to existing RL methods: (a) the use of differentiable soft tree models as actor function approximators, (b) a temperature annealing process during training, and (c) regularization paired with pruning rules to limit policy complexity. Collectively, these improvements can yield interpretable life-cycle policies in the form of deterministic oblique decision trees. The benefits and trade-offs from these techniques are demonstrated in both supervised and reinforcement learning settings. The resulting framework is illustrated in a life-cycle optimization problem for steel girder bridges.
- 中文摘要
自2022年起生效的新《国家桥梁清单》(SNBI)规范强调使用要素级状态(CS)进行基于风险的桥梁管理。元件级状态数据使用相对CS数量(即CS比例)的数组来表示桥梁的状态,而非一般的组件等级。虽然这大大提高了桥接条件数据的粒度,但由于状态空间从单一类别整数扩展到四维概率数组,设置最优生命周期策略也带来了挑战。本研究提出了一种新的可解释强化学习(RL)方法,以基于元素级状态表示寻求最优生命周期策略。与现有强化学习方法相比,所提算法以带有合理节点数量和深度的斜向决策树形式生成生命周期策略,使其可直接被人类理解和审计,并易于实现到现有的桥接管理系统中。为实现近似最优策略,所提方法对现有强化学习方法引入了三项主要改进:(a)使用可微软树模型作为演员函数近似器,(b)训练过程中的温度退火过程,以及(c)正则化配合剪枝规则以限制策略复杂性。这些改进综合起来,可以产生可解释的生命周期策略,以确定性斜向决策树的形式呈现。这些技术的益处和权衡在监督学习和强化学习环境中均有体现。该框架在钢梁桥的生命周期优化问题中得到了展示。
Moondream Segmentation: From Words to Masks
月梦分割:从言语到面具
- Authors: Ethan Reid
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2604.02593
- Pdf link: https://arxiv.org/pdf/2604.02593
- Abstract
We present Moondream Segmentation, a referring image segmentation extension of Moondream 3, a vision-language model. Given an image and a referring expression, the model autoregressively decodes a vector path and iteratively refines the rasterized mask into a final detailed mask. We introduce a reinforcement learning stage that resolves ambiguity in the supervised signal by directly optimizing mask quality. Rollouts from this stage produce coarse-to-ground-truth targets for the refiner. To mitigate evaluation noise from polygon annotations, we release RefCOCO-M, a cleaned RefCOCO validation split with boundary-accurate masks. Moondream Segmentation achieves a cIoU of 80.2% on RefCOCO (val) and 62.6% mIoU on LVIS (val).
- 中文摘要
我们介绍了Moondream Segmentation,这是Moondream 3视觉语言模型的一种指向图像分割扩展。给定一张图像和一个指称表达式,模型通过自回归解码向量路径,并迭代细化光栅化遮罩,形成最终的详细遮罩。我们引入了一个强化学习阶段,通过直接优化掩膜质量,解决监督信号中的歧义。从这一阶段开始的推广会为炼油厂制定到极其真实的目标。为减少多边形注释的评估噪声,我们发布了RefCOCO-M,这是一台经过清理的RefCOCO验证分拆,带有边界精确的掩码。Moondream Segmentation在RefCOCO(val)上实现了80.2%的cIoU,在LVIS(val)上实现了62.6%的mIoU。
Reinforcement Learning-based Knowledge Distillation with LLM-as-a-Judge
基于强化学习的知识蒸馏,结合以LLM为裁判
- Authors: Yiyang Shen, Lifu Tu, Weiran Wang
- Subjects: Subjects:
Computation and Language (cs.CL); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2604.02621
- Pdf link: https://arxiv.org/pdf/2604.02621
- Abstract
Reinforcement Learning (RL) has been shown to substantially improve the reasoning capability of small and large language models (LLMs), but existing approaches typically rely on verifiable rewards, hence ground truth labels. We propose an RL framework that uses rewards from an LLM that acts as a judge evaluating model outputs over large amounts of unlabeled data, enabling label-free knowledge distillation and replacing the need of ground truth supervision. Notably, the judge operates with a single-token output, making reward computation efficient. When combined with verifiable rewards, our approach yields substantial performance gains across math reasoning benchmarks. These results suggest that LLM-based evaluators can produce effective training signals for RL fine-tuning.
- 中文摘要
强化学习(RL)已被证明能显著提升小型和大型语言模型(LLM)的推理能力,但现有方法通常依赖可验证的奖励,因此存在“真实”标签。我们提出了一个强化学习框架,利用LLM作为评判者,评估大量未标记数据的模型输出,实现无标签知识蒸馏,取代了对真实性监督的需求。值得注意的是,评委采用单代币输出,使奖励计算更高效。结合可验证的奖励,我们的方法在数学推理基准测试中带来了显著的性能提升。这些结果表明基于LLM的评估器能够产生有效的强化学习微调训练信号。
Generalization Limits of Reinforcement Learning Alignment
强化学习对齐的推广极限
- Authors: Haruhi Shida, Koo Imai, Keigo Kansa
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2604.02652
- Pdf link: https://arxiv.org/pdf/2604.02652
- Abstract
The safety of large language models (LLMs) relies on alignment techniques such as reinforcement learning from human feedback (RLHF). However, recent theoretical analyses suggest that reinforcement learning-based training does not acquire new capabilities but merely redistributes the utilization probabilities of existing ones. In this study, we propose ``compound jailbreaks'' targeting OpenAI gpt-oss-20b, which exploit the generalization failures of alignment. This approach combines multiple attack techniques -- each individually defended against -- to saturate the instruction hierarchy maintenance process. Our evaluation shows that the attack success rate (ASR) increased from 14.3\% with individual methods to 71.4\% with the combined approach. These results provide empirical evidence for the hypothesis that safety training does not generalize as broadly as model capabilities, highlighting the need for multifaceted safety evaluations using compound attack scenarios.
- 中文摘要
大型语言模型(LLMs)的安全性依赖于诸如人类反馈强化学习(RLHF)等对齐技术。然而,近期理论分析表明,基于强化学习的训练并未获得新能力,而只是重新分配现有能力的利用概率。本研究提出针对OpenAI gpt-oss-20b的“复合越狱”,利用对齐的泛化失败。这种方法结合了多种攻击技术——每种都单独防御——以饱和指令层级维护过程。我们的评估显示,单个方法的攻击成功率(ASR)从14.3%提升到联合方法的71.4%。这些结果为安全培训的推广力不如模型能力广泛,提供了实证证据,凸显了利用复合攻击场景进行多维安全评估的必要性。
Beyond Semantic Manipulation: Token-Space Attacks on Reward Models
超越语义操控:代币空间对奖励模型的攻击
- Authors: Yuheng Zhang, Mingyue Huo, Minghao Zhu, Mengxue Zhang, Nan Jiang
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2604.02686
- Pdf link: https://arxiv.org/pdf/2604.02686
- Abstract
Reward models (RMs) are widely used as optimization targets in reinforcement learning from human feedback (RLHF), yet they remain vulnerable to reward hacking. Existing attacks mainly operate within the semantic space, constructing human-readable adversarial outputs that exploit RM biases. In this work, we introduce a fundamentally different paradigm: Token Mapping Perturbation Attack (TOMPA), a framework that performs adversarial optimization directly in token space. By bypassing the standard decode-re-tokenize interface between the policy and the reward model, TOMPA enables the attack policy to optimize over raw token sequences rather than coherent natural language. Using only black-box scalar feedback, TOMPA automatically discovers non-linguistic token patterns that elicit extremely high rewards across multiple state-of-the-art RMs. Specifically, when targeting Skywork-Reward-V2-Llama-3.1-8B, TOMPA nearly doubles the reward of GPT-5 reference answers and outperforms them on 98.0% of prompts. Despite these high scores, the generated outputs degenerate into nonsensical text, revealing that RMs can be systematically exploited beyond the semantic regime and exposing a critical vulnerability in current RLHF pipelines.
- 中文摘要
奖励模型(RM)被广泛用作人类反馈强化学习(RLHF)中的优化目标,但它们仍然容易受到奖励黑客攻击的影响。现有攻击主要在语义空间内运行,构建利用RM偏见的人类可读对抗输出。在本研究中,我们引入了一种根本不同的范式:令牌映射扰动攻击(TOMPA),这是一种直接在令牌空间中进行对抗性优化的框架。通过绕过策略与奖励模型之间的标准解码-重新分牌化接口,TOMPA使攻击策略能够优化原始令牌序列,而非连贯的自然语言。仅使用黑箱标量反馈,TOMPA自动发现非语言标记模式,在多个最先进的随机答题中获得极高奖励。具体来说,当针对Skywork-Reward-V2-Llama-3.1-8B时,TOMPA几乎将GPT-5参考答案的奖励翻倍,且在98.0%的提示中表现优于GPT-5。尽管分数很高,生成的输出却退化成无意义的文本,揭示了 RM 可以被系统性地超越语义范围利用,暴露了当前 RLHF 流水线中的一个关键漏洞。
ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving
ExploreVLA:端到端自动驾驶的密集世界建模与探索
- Authors: Zihao Sheng, Xin Ye, Jingru Luo, Sikai Chen, Liu Ren
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2604.02714
- Pdf link: https://arxiv.org/pdf/2604.02714
- Abstract
End-to-end autonomous driving models based on Vision-Language-Action (VLA) architectures have shown promising results by learning driving policies through behavior cloning on expert demonstrations. However, imitation learning inherently limits the model to replicating observed behaviors without exploring diverse driving strategies, leaving it brittle in novel or out-of-distribution scenarios. Reinforcement learning (RL) offers a natural remedy by enabling policy exploration beyond the expert distribution. Yet VLA models, typically trained on offline datasets, lack directly observable state transitions, necessitating a learned world model to anticipate action consequences. In this work, we propose a unified understanding-and-generation framework that leverages world modeling to simultaneously enable meaningful exploration and provide dense supervision. Specifically, we augment trajectory prediction with future RGB and depth image generation as dense world modeling objectives, requiring the model to learn fine-grained visual and geometric representations that substantially enrich the planning backbone. Beyond serving as a supervisory signal, the world model further acts as a source of intrinsic reward for policy exploration: its image prediction uncertainty naturally measures a trajectory's novelty relative to the training distribution, where high uncertainty indicates out-of-distribution scenarios that, if safe, represent valuable learning opportunities. We incorporate this exploration signal into a safety-gated reward and optimize the policy via Group Relative Policy Optimization (GRPO). Experiments on the NAVSIM and nuScenes benchmarks demonstrate the effectiveness of our approach, achieving a state-of-the-art PDMS score of 93.7 and an EPDMS of 88.8 on NAVSIM. The code and demo will be publicly available at this https URL.
- 中文摘要
基于视觉-语言-行动(VLA)架构的端到端自动驾驶模型通过通过行为克隆在专家演示中学习驾驶政策,取得了令人鼓舞的成果。然而,模仿学习本质上限制了模型只能复制观察到的行为,而未探索多样化的驱动策略,使其在新颖或分布外场景中显得脆弱。强化学习(RL)通过使政策探索超越专家分布,提供了一种自然的解决方案。然而,通常在离线数据集上训练的VLA模型缺乏可直接观测的状态转移,因此需要一个学习过的世界模型来预测行动后果。在本研究中,我们提出了一个统一的理解与生成框架,利用世界建模,既能实现有意义的探索,又能提供密集的监督。具体来说,我们通过未来RGB和深度图像生成来增强轨迹预测,作为密集世界建模目标,要求模型学习细粒度的视觉和几何表示,极大丰富规划骨干。除了作为监督信号外,世界模型还作为政策探索的内在奖励来源:其图像预测不确定性自然衡量轨迹相对于训练分布的新颖性,高不确定性表明分布外情景,如果安全,则代表宝贵的学习机会。我们将该探索信号纳入安全门槛奖励,并通过组相对策略优化(GRPO)优化策略。NAVSIM和nuScenes基准测试的实验证明了我们方法的有效性,在NAVSIM上取得了93.7分的先进PDMS得分和88.8分的EPDMS。代码和演示将在此 https URL 公开发布。
Data-Driven Synthesis of Probabilistic Controlled Invariant Sets for Linear MDPs
线性MDP概率受控不变量集的数据驱动综合
- Authors: Kazumune Hashimoto, Shunki Kimura, Kazunobu Serizawa, Junya Ikemoto, Yulong Gao, Kai Cai
- Subjects: Subjects:
Systems and Control (eess.SY)
- Arxiv link: https://arxiv.org/abs/2604.02727
- Pdf link: https://arxiv.org/pdf/2604.02727
- Abstract
We study data-driven computation of probabilistic controlled invariant sets (PCIS) for safety-critical reinforcement learning under unknown dynamics. Assuming a linear MDP model, we use regularized least squares and self-normalized confidence bounds to construct a conservative estimate of the states from which the system can be kept inside a prescribed safe region over an (N)-step horizon, together with the corresponding set-valued safe action map. This construction is obtained through a backward recursion and can be interpreted as a conservative approximation of the (N)-step safety predecessor operator. When the associated conservative-inclusion event holds, a conservative fixed point of the approximate recursion can be certified as an ((N,\epsilon))-PCIS with confidence at least (\eta). For continuous state spaces, we introduce a lattice abstraction and a Lipschitz-based discretization error bound to obtain a tractable approximation scheme. Finally, we use the resulting conservative fixed-point approximation as a runtime candidate PCIS in a practical shielding architecture with iterative updates, and illustrate the approach on a numerical experiment.
- 中文摘要
我们研究基于数据驱动的概率受控不变量集(PCIS)计算,用于在未知动力学下安全关键强化学习。假设线性MDP模型,我们使用正则化最小二乘法和自归一化置信界,构建系统在\(N\)步长范围内可保持在指定安全区内状态的保守估计,并附上相应的集合值安全动作映射。该构造通过逆递归得到,可以解释为对 \(N\) 步安全前驱算子的保守近似。当相关保守包含事件成立时,近似递归的保守不动点可以至少有 \(\eta\) 的信心被证明为 \((N,\epsilon)\)-PCIS。对于连续状态空间,我们引入了格点抽象和基于利普希茨的离散化误差界限,以获得一个可解的近似方案。最后,我们将所得的保守不动点近似作为运行时候选PCIS,应用于带有迭代更新的实用屏蔽架构,并在数值实验中展示该方法。
Multi-agent Reinforcement Learning-based Joint Design of Low-Carbon P2P Market and Bidding Strategy in Microgrids
基于多智能体强化学习的低碳点对点市场与微电网竞标策略联合设计
- Authors: Junhao Ren, Honglin Gao, Sijie Wang, Lan Zhao, Qiyu Kang, Aniq Ashan, Yajuan Sun, Gaoxi Xiao
- Subjects: Subjects:
Multiagent Systems (cs.MA)
- Arxiv link: https://arxiv.org/abs/2604.02728
- Pdf link: https://arxiv.org/pdf/2604.02728
- Abstract
The challenges of the uncertainties in renewable energy generation and the instability of the real-time market limit the effective utilization of clean energy in microgrid communities. Existing peer-to-peer (P2P) and microgrid coordination approaches typically rely on certain centralized optimization or restrictive coordination rules which are difficult to be implemented in real-life applications. To address the challenge, we propose an intraday P2P trading framework that allows self-interested microgrids to pursue their economic benefits, while allowing the market operator to maximize the social welfare, namely the low carbon emission objective, of the entire community. Specifically, the decision-making processes of the microgrids are formulated as a Decentralized Partially Observable Markov Decision Process (DEC-POMDP) and solved using a Multi-Agent Reinforcement Learning (MARL) framework. Such an approach grants each microgrid a high degree of decision-making autonomy, while a novel market clearing mechanism is introduced to provide macro-regulation, incentivizing microgrids to prioritize local renewable energy consumption and hence reduce carbon emissions. Simulation results demonstrate that the combination of the self-interested bidding strategy and the P2P market design helps significantly improve renewable energy utilization and reduce reliance on external electricity with high carbon-emissions. The framework achieves a balanced integration of local autonomy, self-interest pursuit, and improved community-level economic and environmental benefits.
- 中文摘要
可再生能源发电的不确定性和实时市场的不稳定性限制了微电网社区清洁能源的有效利用。现有的点对点(P2P)和微电网协调方法通常依赖某些集中式优化或限制性协调规则,这些规则在实际应用中难以实现。为应对这一挑战,我们提出了一种日内P2P交易框架,允许自利微电网追求其经济利益,同时让市场运营者最大化整个社区的社会福利,即低碳排放目标。具体来说,微电网的决策过程被定义为去中心化部分可观测马尔可夫决策过程(DEC-POMDP),并采用多智能体强化学习(MARL)框架进行求解。这种做法赋予每个微电网高度的决策自主权,同时引入了新的市场清算机制以实现宏观监管,激励微电网优先消费本地可再生能源,从而减少碳排放。模拟结果表明,自利竞价策略与点对点市场设计的结合,显著提升了可再生能源的利用率,并减少了对外部电力的高碳排放依赖。该框架实现了地方自治、自利追求以及社区层面经济和环境效益的平衡整合。
Learning Locomotion on Complex Terrain for Quadrupedal Robots with Foot Position Maps and Stability Rewards
在复杂地形上学习四足机器人的移动,配备足位图和稳定性奖励
- Authors: Matthew Hwang, Yubin Liu, Ryo Hakoda, Takeshi Oishi
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2604.02744
- Pdf link: https://arxiv.org/pdf/2604.02744
- Abstract
Quadrupedal locomotion over complex terrain has been a long-standing research topic in robotics. While recent reinforcement learning-based locomotion methods improve generalizability and foot-placement precision, they rely on implicit inference of foot positions from joint angles, lacking the explicit precision and stability guarantees of optimization-based approaches. To address this, we introduce a foot position map integrated into the heightmap, and a dynamic locomotion-stability reward within an attention-based framework to achieve locomotion on complex terrain. We validate our method extensively on terrains seen during training as well as out-of-domain (OOD) terrains. Our results demonstrate that the proposed method enables precise and stable movement, resulting in improved locomotion success rates on both in-domain and OOD terrains.
- 中文摘要
在复杂地形上进行四足行走一直是机器人学中长期存在的研究课题。虽然最新的基于强化学习的运动方法提高了泛化性和足部放置精度,但它们依赖于从关节角度隐性推断脚的位置,缺乏基于优化方法的明确精度和稳定性保证。为此,我们在高度图中引入了集成的足部位置图,并在基于注意力的框架内提供动态的运动稳定性奖励,以实现复杂地形上的移动。我们在训练期间看到的地形以及外域(OOD)地形上进行了广泛验证。我们的结果表明,所提方法能够实现精确且稳定的移动,从而在领域内和外域地形上提升了移动成功率。
Fully Byzantine-Resilient Distributed Multi-Agent Q-Learning
完全拜占庭弹性分布式多智能体Q-学习
- Authors: Haejoon Lee, Dimitra Panagou
- Subjects: Subjects:
Multiagent Systems (cs.MA); Systems and Control (eess.SY)
- Arxiv link: https://arxiv.org/abs/2604.02791
- Pdf link: https://arxiv.org/pdf/2604.02791
- Abstract
We study Byzantine-resilient distributed multi-agent reinforcement learning (MARL), where agents must collaboratively learn optimal value functions over a compromised communication network. Existing resilient MARL approaches typically guarantee almost sure convergence only to near-optimal value functions, or require restrictive assumptions to ensure convergence to optimal solution. As a result, agents may fail to learn the optimal policies under these methods. To address this, we propose a novel distributed Q-learning algorithm, under which all agents' value functions converge almost surely to the optimal value functions despite Byzantine edge attacks. The key idea is a redundancy-based filtering mechanism that leverages two-hop neighbor information to validate incoming messages, while preserving bidirectional information flow. We then introduce a new topological condition for the convergence of our algorithm, present a systematic method to construct such networks, and prove that this condition can be verified in polynomial time. We validate our results through simulations, showing that our method converges to the optimal solutions, whereas prior methods fail under Byzantine edge attacks.
- 中文摘要
我们研究拜占庭韧性分布式多智能体强化学习(MARL),其中代理必须在受损的通信网络上协作学习最优价值函数。现有的韧性MARL方法通常几乎保证收敛到近似最优的值函数,或者需要限制性假设以确保收敛到最优解。因此,代理可能无法在这些方法下学习最优策略。为此,我们提出了一种新的分布式Q-学习算法,在该算法下,所有代理的值函数几乎必然收敛到最优值函数,尽管遭受拜占庭式边缘攻击。核心思想是基于冗余的过滤机制,利用两跳邻居信息验证来的消息,同时保持双向信息流。随后引入算法收敛的新拓扑条件,提出系统方法构造此类网络,并证明该条件可在多项式时间内验证。我们通过模拟验证结果,表明我们的方法收敛到最优解,而之前的方法在拜占庭边缘攻击下失败。
CharTool: Tool-Integrated Visual Reasoning for Chart Understanding
CharTool:用于图表理解的工具集成可视化推理
- Authors: Situo Zhang, Yifan Zhang, Zichen Zhu, Da Ma, Lei Pan, Danyang Zhang, Zihan Zhao, Lu Chen, Kai Yu
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2604.02794
- Pdf link: https://arxiv.org/pdf/2604.02794
- Abstract
Charts are ubiquitous in scientific and financial literature for presenting structured data. However, chart reasoning remains challenging for multimodal large language models (MLLMs) due to the lack of high-quality training data, as well as the need for fine-grained visual grounding and precise numerical computation. To address these challenges, we first propose DuoChart, a scalable dual-source data pipeline that combines synthesized charts with real-world charts to construct diverse, high-quality chart training data. We then introduce CharTool, which equips MLLMs with external tools, including image cropping for localized visual perception and code-based computation for accurate numerical reasoning. Through agentic reinforcement learning on DuoChart, CharTool learns tool-integrated reasoning grounded in chart content. Extensive experiments on six chart benchmarks show that our method consistently improves over strong MLLM baselines across model scales. Notably, CharTool-7B outperforms the base model by +8.0% on CharXiv (Reasoning) and +9.78% on ChartQAPro, while achieving competitive performance with substantially larger or proprietary models. Moreover, CharTool demonstrates positive generalization to out-of-domain visual math reasoning benchmarks.
- 中文摘要
图表在科学和金融文献中无处不在,用于呈现结构化数据。然而,由于缺乏高质量的训练数据,以及需要细粒度的视觉基础和精确的数值计算,多模态大型语言模型(MLLMs)的图表推理仍然具有挑战性。为应对这些挑战,我们首先提出了DuoChart,这是一个可扩展的双源数据流水线,将综合图表与真实世界图表结合起来,构建多样且高质量的图表训练数据。接着介绍CharTool,为MLLM配备了外部工具,包括用于局部视觉感知的图像裁剪和基于代码的计算以实现准确的数值推理。通过DuoChart上的代理强化学习,CharTool学习基于图表内容的工具集成推理。对六个图表基准的广泛实验表明,我们的方法在模型尺度上持续优于强MLLM基线。值得注意的是,CharTool-7B在CharXiv(Reasoning)上表现优于基础模型+8.0%,在ChartQAPro上高出+9.78%**,同时在更大或专有模型中也能取得竞争性能。此外,CharTool还展示了对域外视觉数学推理基准的积极推广。
Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks
评分标准到代币:在任务指导中连接响应级评分标准与代币级奖励
- Authors: Tianze Xu, Yanzhao Zheng, Pengrui Lu, Lyumanshan Ye, Yong Wu, Zhentao Zhang, Yuanqiang Yu, Chao Ma, Jihuai Zhu, Pengfei Liu, Baohua Dong, Hangcheng Zhu, Ruohui Huang, Gang Yu
- Subjects: Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2604.02795
- Pdf link: https://arxiv.org/pdf/2604.02795
- Abstract
Rubric-based Reinforcement Learning (RL) has emerged as a promising approach for aligning Large Language Models (LLMs) with complex, open-domain instruction following tasks. However, existing methods predominantly rely on response-level rewards, introducing severe reward sparsity and reward ambiguity problems. To address these issues, we propose Rubrics to Tokens (RTT), a novel rubric-based RL framework that bridges coarse response-level scores and fine-grained token-level credit assignment. RTT introduces a Token-Level Relevance Discriminator to predict which tokens in the response are responsible for a specific constraint, and optimizes the policy model via RTT-GRPO, which integrates response-level and token-level advantages within a unified framework. Furthermore, when transitioning from one-dimensional, outcome-level reward to three-dimensional reward space in the token-level rubric-based RL, we propose a novel group normalization method, called Intra-sample Token Group Normalization, to accommodate this shift. Extensive experiments and benchmarks demonstrate that RTT consistently outperforms other baselines in both instruction- and rubric-level accuracy across different models.
- 中文摘要
基于评分标准的强化学习(RL)已成为一种有前景的方法,用于将大型语言模型(LLMs)与复杂的开放领域指令跟随任务对齐。然而,现有方法主要依赖反应级奖励,导致严重的奖励稀疏性和奖励模糊性问题。为解决这些问题,我们提出了“评分标准到代币”(RTT),这是一种基于评分标准的强化学习框架,连接粗糙的响应级评分和细粒度的代币级学分分配。RTT引入了令牌级相关性判别器,用于预测响应中哪些令牌负责特定约束,并通过RTT-GRPO优化策略模型,该技术将响应级和令牌级优势整合在统一框架内。此外,在基于标记的评分标准强化学习中,从一维结果级奖励过渡到三维奖励空间时,我们提出了一种新的组规范化方法,称为样本内代币组归一化,以适应这一转变。大量实验和基准测试表明,RTT在不同模型的教学和评分标准层级准确性上始终优于其他基线。
Dependency-Guided Repository-Level C-to-Rust Translation with Reinforcement Alignment
依赖引导仓库级 C 到 Rust 的转换,带有强化对齐
- Authors: Jia Feng, Wenjie Gan, Cuiyun Gao, Chaozheng Wang, Feng Luo, Xin Xia, Ge Li, Kui Liu
- Subjects: Subjects:
Software Engineering (cs.SE)
- Arxiv link: https://arxiv.org/abs/2604.02852
- Pdf link: https://arxiv.org/pdf/2604.02852
- Abstract
Automating C-to-Rust migration is critical for improving software security without sacrificing performance. Traditional rule-based methods struggle with diverse C idioms, often producing rigid and unidiomatic Rust code. Large Language Models (LLMs), trained on massive code corpora, offer a promising alternative by leveraging cross-language generalization to generate more idiomatic and maintainable Rust code. However, several challenges remain. First, existing LLM-based approaches fail to handle cross-file dependencies effectively, either ignoring them or including entire files as context, which limits accurate dependency modeling. Second, complex dependencies and structured inputs and outputs make it difficult to verify syntactic correctness and functional equivalence at the repository level. Third, the lack of large-scale C-Rust parallel data constrains model performance. We propose DepTrans, a framework that combines model capability enhancement with structured inference. DepTrans introduces Reinforcement-Aligned Syntax Training to improve generation quality through multi-task fine-tuning and feedback-driven reinforcement learning. It further applies Dependency-Guided Iterative Refinement to capture fine-grained cross-file dependencies and iteratively refine generated Rust code. We construct a dataset of 85k training samples and a benchmark of 145 repository-level instances. Experiments show that DepTrans achieves a 60.7 percent compilation success rate and 43.5 percent computational accuracy, outperforming the strongest baseline by 22.8 and 17.3 percentage points. It also successfully builds 7 of 15 industrial C projects, demonstrating its practical potential.
- 中文摘要
自动化C到Rust迁移对于提升软件安全性且不牺牲性能至关重要。传统的基于规则的方法在应对多样的 C 语言表达上遇到困难,常常生成僵化且非惯用的 Rust 代码。大型语言模型(LLM)在庞大的代码语料库上训练,通过跨语言泛化生成更具惯用性和可维护性的Rust代码,提供了有前景的替代方案。然而,仍有若干挑战。首先,现有基于LLM的方法无法有效处理跨文件依赖,要么忽略它们,要么将整个文件作为上下文,这限制了依赖建模的准确性。其次,复杂的依赖关系和结构化的输入输出使得在仓库层面验证句法正确性和功能等价性变得困难。第三,缺乏大规模C-Rust并行数据限制了模型性能。我们提出了DepTrans框架,该框架结合了模型能力增强与结构化推理。DepTrans引入强化对齐语法训练,通过多任务微调和反馈驱动强化学习提升生成质量。它进一步应用依赖引导迭代细化,捕捉细粒度的跨文件依赖关系,并迭代优化生成的 Rust 代码。我们构建了一个包含8.5万个训练样本和145个仓库级实例的基准数据集。实验显示,DepTrans实现了60.7%的编译成功率和43.5%的计算准确率,比最强基线高出22.8%和17.3个百分点。它还成功建成了15个工业C项目中的7个,展现了其实用潜力。
Multi-Turn Reinforcement Learning for Tool-Calling Agents with Iterative Reward Calibration
工具调用代理的多回合强化学习,采用迭代奖励校准
- Authors: Wachiravit Modecrua, Krittanon Kaewtawee, Krittin Pachtrachai, Touchapon Kraisingkorn
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2604.02869
- Pdf link: https://arxiv.org/pdf/2604.02869
- Abstract
Training tool-calling agents with reinforcement learning on multi-turn tasks remains challenging due to sparse outcome rewards and difficult credit assignment across conversation turns. We present the first application of MT-GRPO (Multi-Turn Group Relative Policy Optimization) combined with GTPO (Generalized Token-level Policy Optimization) for training a tool-calling agent on realistic customer service tasks with an LLM-based user simulator. Through systematic analysis of training rollouts, we discover that naively designed dense per-turn rewards degrade performance by up to 14 percentage points due to misalignment between reward discriminativeness and advantage direction. We introduce Iterative Reward Calibration, a methodology for designing per-turn rewards using empirical discriminative analysis of rollout data, and show that our GTPO hybrid advantage formulation eliminates the advantage misalignment problem. Applied to the Tau-Bench airline benchmark, our approach improves Qwen3.5-4B from 63.8 percent to 66.7 percent (+2.9pp) and Qwen3-30B-A3B from 58.0 percent to 69.5 percent (+11.5pp) -- with the trained 4B model exceeding GPT-4.1 (49.4 percent) and GPT-4o (42.8 percent) despite being 50 times smaller, and the 30.5B MoE model approaching Claude Sonnet 4.5 (70.0 percent). To our knowledge, these are the first published RL training results on Tau-Bench. We release our code, reward calibration analysis, and training recipes.
- 中文摘要
由于结果奖励稀少且对话回合间的信用分配困难,培训工具调用代理进行强化学习仍然具有挑战性。我们首次将MT-GRPO(多回合组相对策略优化)与GTPO(通用令牌级策略优化)结合,用于利用基于LLM的用户模拟器训练工具调用代理完成真实的客户服务任务。通过对训练推广的系统分析,我们发现,设计过于高密度的每回合奖励,由于奖励的判别性与优势方向之间的错位,会使绩效下降多达14个百分点。我们引入了迭代奖励校准,这是一种利用推广数据的经验判别分析设计每回合奖励的方法,并证明我们的GTPO混合优势表述消除了优势错位问题。应用于Tau-Bench航空公司基准,我们的方法将Qwen3.5-4B从63.8%提升至66.7%(+2.9pp),Qwen3-30B-A3B从58.0%提升至69.5%(+11.5pp)——训练后的4B模型尽管规模是GPT-4.1的50倍,但超过了GPT-4.1(49.4%)和GPT-4o(42.8%);30.5亿的MoE模型接近Claude Sonnet 4.5(70.0%)。据我们所知,这是Tau-Bench上首次公开的强化学习训练结果。我们发布代码、奖励校准分析和训练配方。
Towards Near-Real-Time Telemetry-Aware Routing with Neural Routing Algorithms
迈向近实时遥测感知路由,利用神经路由算法
- Authors: Andreas Boltres, Niklas Freymuth, Benjamin Schichtholz, Michael König, Gerhard Neumann
- Subjects: Subjects:
Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
- Arxiv link: https://arxiv.org/abs/2604.02927
- Pdf link: https://arxiv.org/pdf/2604.02927
- Abstract
Routing algorithms are crucial for efficient computer network operations, and in many settings they must be able to react to traffic bursts within milliseconds. Live telemetry data can provide informative signals to routing algorithms, and recent work has trained neural networks to exploit such signals for traffic-aware routing. Yet, aggregating network-wide information is subject to communication delays, and existing neural approaches either assume unrealistic delay-free global states, or restrict routers to purely local telemetry. This leaves their deployability in real-world environments unclear. We cast telemetry-aware routing as a delay-aware closed-loop control problem and introduce a framework that trains and evaluates neural routing algorithms, while explicitly modeling communication and inference delays. On top of this framework, we propose LOGGIA, a scalable graph neural routing algorithm that predicts log-space link weights from attributed topology-and-telemetry graphs. It utilizes a data-driven pre-training stage, followed by on-policy Reinforcement Learning. Across synthetic and real network topologies, and unseen mixed TCP/UDP traffic sequences, LOGGIA consistently outperforms shortest-path baselines, whereas neural baselines fail once realistic delays are enforced. Our experiments further suggest that neural routing algorithms like LOGGIA perform best when deployed fully locally, i.e., observing network states and inferring actions at every router individually, as opposed to centralized decision making.
- 中文摘要
路由算法对于高效的计算机网络运行至关重要,在许多环境中,它们必须能够在毫秒内响应流量突发。实时遥测数据可以为路由算法提供信息信号,近期研究已训练神经网络利用这些信号进行流量感知路由。然而,聚合全网络信息存在通信延迟,现有神经方法要么假设不切实际的无延迟全局状态,要么限制路由器仅进行局部遥测。这使得它们在现实环境中的部署性尚不明确。我们将遥测感知路由定位为一个延迟感知的闭环控制问题,并引入一个框架,训练和评估神经路由算法,同时显式建模通信和推断延迟。在此框架基础上,我们提出了LOGGIA,一种可扩展的图神经路由算法,能够从归属拓扑图和遥测图预测对数空间链路权重。它采用数据驱动的预训练阶段,随后进行基于策略的强化学习。在合成和真实网络拓扑以及未见的混合TCP/UDP流量序列中,LOGGIA始终优于最短路径基线,而神经基线一旦强制执行现实延迟,就会失效。我们的实验进一步表明,像LOGGIA这样的神经路由算法在完全本地部署时表现最佳,即观察网络状态并逐一推断每个路由器的操作,而非集中式决策。
Digital Twin-Assisted In-Network and Edge Collaboration for Joint User Association, Task Offloading, and Resource Allocation in the Metaverse
数字孪生辅助的网络内和边缘协作,用于元宇宙中的联合用户关联、任务卸载和资源分配
- Authors: Ibrahim Aliyu, Seungmin Oh, Sangwon Oh, Jinsul Kim
- Subjects: Subjects:
Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT)
- Arxiv link: https://arxiv.org/abs/2604.02938
- Pdf link: https://arxiv.org/pdf/2604.02938
- Abstract
Advancements in extended reality (XR) are driving the development of the metaverse, which demands efficient real-time transformation of 2D scenes into 3D objects, a computation-intensive process that necessitates task offloading because of complex perception, visual, and audio processing. This challenge is further compounded by asymmetric uplink (UL) and downlink (DL) data characteristics, where 2D data are transmitted in the UL and 3D content is rendered in the DL. To address this issue, we propose a digital twin (DT)-based in-network computing (INC)-assisted multi-access edge computing (MEC) framework that enables real-time synchronization and collaborative computing via URLLC. In this framework, a network operator manages wireless and computational resources for XR user devices (XUDs), while XUDs autonomously offload tasks to maximize their utilities. We model the interactions between XUDs and the operator as a Stackelberg Markov game, where the optimal offloading strategy constitutes an exact potential game with a Nash Equilibrium (NE), and the operator's problem is formulated as an asynchronous Markov decision process (MDP). We further propose a decentralized solution in which XUDs determine offloading decisions based on the operator's joint UL-DL optimization of offloading mode (INC-E or MEC only) and DL power allocation. A Nash-asynchronous hybrid multi-agent reinforcement learning (AMRL) algorithm is developed to predict the UL user-associated and DL transmission power, thereby achieving NE. Simulation results demonstrate that the proposed approach considerably improves system utility, uplink rate, and energy efficiency by reducing latency and optimizing resource utilization in metaverse environments.
- 中文摘要
扩展现实(XR)的进步推动了元宇宙的发展,元宇宙需要高效地实时将二维场景转换为三维对象,这一计算密集型过程因复杂的感知、视觉和音频处理而需要任务卸载。这一挑战因非对称上行(UL)和下行(DL)数据特性而更加复杂,即二维数据在UL中传输,三维内容在DL中渲染。为解决这一问题,我们提出了基于数字孪生(DT)的网络内计算(INC)辅助多址边缘计算(MEC)框架,通过 URLLC 实现实时同步和协作计算。在该框架中,网络运营商管理XR用户设备(XUD)的无线和计算资源,而XUD则自主卸载任务以最大化其效用。我们将XUD与算子之间的相互作用建模为Stackelberg Markov博弈,其中最优卸载策略构成一个带有纳什均衡(NE)的精确势博弈,算子的问题被表述为异步马尔可夫决策过程(MDP)。我们还提出了一种去中心化解决方案,XUDs根据运营商联合UL-DL优化卸载模式(仅INC-E或MEC)和DL功率分配来决定卸载决策。开发了一种纳什异步混合多智能体强化学习(AMRL)算法,用于预测UL用户关联和DL传输功率,从而实现NE。模拟结果表明,该方法通过降低延迟和优化元宇宙环境中的资源利用,显著提升了系统效用、上行速率和能效。
Mitigating Reward Hacking in RLHF via Advantage Sign Robustness
通过优势符号的鲁棒性来缓解RLHF中的奖励黑客行为
- Authors: Shinnosuke Ono, Johannes Ackermann, Soichiro Nishimori, Takashi Ishida, Masashi Sugiyama
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2604.02986
- Pdf link: https://arxiv.org/pdf/2604.02986
- Abstract
Reward models (RMs) used in reinforcement learning from human feedback (RLHF) are vulnerable to reward hacking: as the policy maximizes a learned proxy reward, true quality plateaus or degrades. We make the assumption that reward hacking is often caused by flipped advantage signs: instead of reducing the likelihood of a bad response, a flipped sign causes the update to increase it. By considering an adversarial perturbation in the RM parameter space, we can derive a certified sign-preservation radius, which is the smallest perturbation that can flip the advantage sign during policy optimization. Based on this formulation, we propose Sign-Certified Policy Optimization (SignCert-PO), down-weighting non-robust completions in the policy gradient update. Unlike prior approaches that require multiple RMs or access to the RM training data, SignCert-PO is lightweight and operates purely at the policy optimization stage using only the RM parameters and on-policy completions. On TL;DR summarization and AlpacaFarm benchmarks, SignCert-PO consistently achieves a better win rate than baselines and reduces reward hacking.
- 中文摘要
用于人类反馈强化学习(RLHF)的奖励模型(RM)易受奖励黑客攻击:随着策略最大化学习代理奖励,真正的质量会趋于停滞或下降。我们假设奖励黑客往往是由翻转优势符号引起的:反转信号并未减少不良反应的可能性,反而导致更新增加了它。通过考虑RM参数空间中的对抗扰动,我们可以推导出一个认证的符号保持半径,这是在策略优化过程中可以翻转优势符号的最小扰动。基于这一表述,我们提出了签名认证策略优化(SignCert-PO),在策略梯度更新中降低非稳健完成的权重。与以往需要多个 RM 或访问 RM 训练数据的方法不同,SignCert-PO 是轻量级的,仅在策略优化阶段仅使用 RM 参数和策略内补全。简而言之;基于DR总结和AlpacaFarm基准测试,SignCert-PO持续比基线获得更高的胜率,并减少了奖励黑客行为。
R2-Write: Reflection and Revision for Open-Ended Writing with Deep Reasoning
R2-Write:深入推理的开放式写作反思与修订
- Authors: Wanlong Liu, Bo Zhang, Chenliang Li, Shaopeng Lai, Yuning Wu, Xuanyu Lei, Ming Yan
- Subjects: Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2604.03004
- Pdf link: https://arxiv.org/pdf/2604.03004
- Abstract
While deep reasoning with long chain-of-thought has dramatically improved large language models in verifiable domains like mathematics, its effectiveness for open-ended tasks such as writing remains unexplored. In this paper, we conduct a systematic investigation revealing that existing mainstream reasoning models achieve limited gains on open-ended writing tasks. Our further analysis shows that these models lack deep reflection and revision patterns in open-ended writing, resulting in substantially smaller improvements compared to mathematical reasoning tasks. To address this limitation, we introduce R2-Write: an automated framework that synthesizes high-quality thinking trajectories enriched with explicit reflection and revision patterns through iterative writer-judge interaction. To prevent redundant reflections, we design a process reward mechanism that supervises reflection quality during reinforcement learning, improving both performance and token efficiency. Extensive experiments across multiple creative writing and deep-research benchmarks demonstrate significant improvements, validating that explicitly incorporating reflection and revision patterns unlocks deep reasoning capabilities for open-ended writing tasks.
- 中文摘要
虽然深度推理与长思维链极大地提升了数学等可验证领域的大型语言模型,但其在开放式任务如写作中的有效性仍未被充分探讨。本文系统地研究,揭示现有主流推理模型在开放式写作任务上的有限提升。进一步分析显示,这些模型在开放式写作中缺乏深度反思和修订模式,导致与数学推理任务相比,提升幅度明显小。为解决这一限制,我们引入了R2-Write:一个自动化框架,通过作者与评委的迭代互动,综合高质量思维轨迹,并丰富了明确的反思和修订模式。为防止重复反射,我们设计了一种过程奖励机制,在强化学习过程中监督反射质量,提升性能和代币效率。在多个创意写作和深度研究基准中的广泛实验显示了显著改进,证实明确纳入反思和修订模式能为开放式写作任务释放深度推理能力。
Behavior-Constrained Reinforcement Learning with Receding-Horizon Credit Assignment for High-Performance Control
行为约束强化学习与后退视界学分作业,用于高性能控制
- Authors: Siwei Ju, Jan Tauberschmidt, Oleg Arenz, Peter van Vliet, Jan Peters
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2604.03023
- Pdf link: https://arxiv.org/pdf/2604.03023
- Abstract
Learning high-performance control policies that remain consistent with expert behavior is a fundamental challenge in robotics. Reinforcement learning can discover high-performing strategies but often departs from desirable human behavior, whereas imitation learning is limited by demonstration quality and struggles to improve beyond expert data. We propose a behavior-constrained reinforcement learning framework that improves beyond demonstrations while explicitly controlling deviation from expert behavior. Because expert-consistent behavior in dynamic control is inherently trajectory-level, we introduce a receding-horizon predictive mechanism that models short-term future trajectories and provides look-ahead rewards during training. To account for the natural variability of human behavior under disturbances and changing conditions, we further condition the policy on reference trajectories, allowing it to represent a distribution of expert-consistent behaviors rather than a single deterministic target. Empirically, we evaluate the approach in high-fidelity race car simulation using data from professional drivers, a domain characterized by extreme dynamics and narrow performance margins. The learned policies achieve competitive lap times while maintaining close alignment with expert driving behavior, outperforming baseline methods in both performance and imitation quality. Beyond standard benchmarks, we conduct human-grounded evaluation in a driver-in-the-loop simulator and show that the learned policies reproduce setup-dependent driving characteristics consistent with the feedback of top-class professional race drivers. These results demonstrate that our method enables learning high-performance control policies that are both optimal and behavior-consistent, and can serve as reliable surrogates for human decision-making in complex control systems.
- 中文摘要
学习与专家行为保持一致的高性能控制策略,是机器人学中的一个根本挑战。强化学习可以发现高效策略,但往往偏离理想的人类行为,而模仿学习受限于示范质量,难以超越专家数据的提升。我们提出了一种行为约束强化学习框架,该框架不仅仅通过演示,还能明确控制偏离专家行为的行为。由于动态控制中的专家一致行为本质上是轨迹级的,我们引入了一种后退视界预测机制,模拟短期未来轨迹并在训练期间提供前瞻性奖励。为了考虑人类行为在干扰和变化条件下的自然变异性,我们进一步以参考轨迹为条件,使其能够代表专家一致行为的分布,而非单一确定性目标。我们通过实证评估高保真赛车模拟中该方法,利用职业车手的数据,该领域以极端动力学和狭窄的性能余量为特征。这些学习策略在保持与专家驾驶行为高度一致的同时,实现了具有竞争力的圈速,在性能和模仿质量上都优于基础方法。除了标准基准测试外,我们还在车手环路模拟器中进行人为本的评估,证明所学政策能够重现与顶级职业赛车手反馈一致的依赖设置驾驶特性。这些结果表明,我们的方法能够学习既最优又行为一致的高绩效控制策略,并能作为复杂控制系统中人类决策的可靠替代工具。
ARM: Advantage Reward Modeling for Long-Horizon Manipulation
ARM:用于长期视野操控的优势奖励建模
- Authors: Yiming Mao, Zixi Yu, Weixin Mao, Yinhao Li, Qirui Hu, Zihan Lan, Minzhao Zhu, Hua Chen
- Subjects: Subjects:
Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2604.03037
- Pdf link: https://arxiv.org/pdf/2604.03037
- Abstract
Long-horizon robotic manipulation remains challenging for reinforcement learning (RL) because sparse rewards provide limited guidance for credit assignment. Practical policy improvement thus relies on richer intermediate supervision, such as dense progress rewards, which are costly to obtain and ill-suited to non-monotonic behaviors such as backtracking and recovery. To address this, we propose Advantage Reward Modeling (ARM), a framework that shifts from hard-to-quantify absolute progress to estimating relative advantage. We introduce a cost-effective tri-state labeling strategy -- Progressive, Regressive, and Stagnant -- that reduces human cognitive overhead while ensuring high cross-annotator consistency. By training on these intuitive signals, ARM enables automated progress annotation for both complete demonstrations and fragmented DAgger-style data. Integrating ARM into an offline RL pipeline allows for adaptive action-reward reweighting, effectively filtering suboptimal samples. Our approach achieves a 99.4% success rate on a challenging long-horizon towel-folding task, demonstrating improved stability and data efficiency over current VLA baselines with near-zero human intervention during policy training.
- 中文摘要
长视野机器人操作对强化学习(RL)仍具挑战性,因为稀疏奖励对学分分配提供的指导有限。因此,实际的政策改进依赖于更丰富的中间监督,如密集进展奖励,这些奖励获得成本高昂,且不适合非单调行为,如回溯和恢复。为此,我们提出了优势奖励建模(ARM)框架,将难以量化的绝对进步转变为估算相对优势。我们引入了一种成本效益高的三州标签策略——渐进式、倒退式和停滞式——在降低人类认知负担的同时,确保高跨标注一致性。通过基于这些直观信号进行训练,ARM 能够自动注释完整演示和碎片化 DAgger 风格数据的进度。将ARM集成到离线强化学习流水线中,可以实现自适应的动作奖励重权,有效过滤次优样本。我们的方法在一项具有挑战性的长期毛巾折叠任务中实现了99.4%的成功率,在政策培训期间几乎没有人为干预的情况下,比当前VLA基线更为稳定,数据效率也有所提升。
JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency
JoyAI-LLM 闪电版:以令牌效率推进中端大型语言模型
- Authors: Aichen Cai, Anmeng Zhang, Anyu Li, Bo Zhang, Bohua Cai, Chang Li, Changjian Jiang, Changkai Lu, Chao Xue, Chaocai Liang, Cheng Zhang, Dongkai Liu, Fei Wang, Guoqiang Huang, Haijian Ke, Han Lin, Hao Wang, Ji Miao, Jiacheng Zhang, Jialong Shi, Jifeng Zhu, Jingjing Qian, Junhui Luo, Junwu Xiong, Lam So, Liang Huang, Ming Ke, Mingyang Li, Panfeng Shi, Peng Hao, Qi Wang, Qian Lai, Qiaoqiao Yuan, Qingyu Yin, Qiong Cao, Qixiang Wang, Rongcheng Bian, Rongduo Han, Shaoqiang Zheng, Shi Hu, Shi Suo, Shijie Ren, Shijin Zhang, Shiying Fan, Shuai Xie, Tianyi Zhang, Wei Liu, Wentao Tan, Xianghan Meng, Xiaodong He, Xing Pan, Xiran Wang, Xuyang Peng, Ya Zhang, Yang Liu, Yangyang Duan, Yanxu Chen, Yicheng Gong, Yidan Huang, Yifei Liu, Yinhao Bai, Yongqiang Liu, Yuesong Zhang, Yuqi Zhang, Zerui Xie, Zhenfang Wang, Zhennan Shen, Zheyuan Liu, Zhuwei Zeng
- Subjects: Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2604.03044
- Pdf link: https://arxiv.org/pdf/2604.03044
- Abstract
We introduce JoyAI-LLM Flash, an efficient Mixture-of-Experts (MoE) language model designed to redefine the trade-off between strong performance and token efficiency in the sub-50B parameter regime. JoyAI-LLM Flash is pretrained on a massive corpus of 20 trillion tokens and further optimized through a rigorous post-training pipeline, including supervised fine-tuning (SFT), Direct Preference Optimization (DPO), and large-scale reinforcement learning (RL) across diverse environments. To improve token efficiency, JoyAI-LLM Flash strategically balances \emph{thinking} and \emph{non-thinking} cognitive modes and introduces FiberPO, a novel RL algorithm inspired by fibration theory that decomposes trust-region maintenance into global and local components, providing unified multi-scale stability control for LLM policy optimization. To enhance architectural sparsity, the model comprises 48B total parameters while activating only 2.7B parameters per forward pass, achieving a substantially higher sparsity ratio than contemporary industry leading models of comparable scale. To further improve inference throughput, we adopt a joint training-inference co-design that incorporates dense Multi-Token Prediction (MTP) and Quantization-Aware Training (QAT). We release the checkpoints for both JoyAI-LLM-48B-A3B Base and its post-trained variants on Hugging Face to support the open-source community.
- 中文摘要
我们介绍了JoyAI-LLM Flash,一种高效的专家混合(MoE)语言模型,旨在重新定义在50B以下参数范围内强性能与代币效率之间的权衡。JoyAI-LLM Flash 在庞大的 20 万亿个令牌语料库上进行预训练,并通过严格的后期训练流程进一步优化,包括监督微调(SFT)、直接偏好优化(DPO)以及跨多样环境的大规模强化学习(RL)。为提升令牌效率,JoyAI-LLM Flash战略性地平衡了\emph{思考}和\emph{非思考}认知模式,并引入了FiberPO算法,这是一种受纤维化理论启发的新型强化学习算法,将信任区域维护分解为全局和局部组件,提供统一的多尺度稳定性控制以优化LLM策略。为增强架构稀疏度,模型包含48B总参数,每次前向传递仅激活27亿参数,实现了远高于同期同等规模行业领先模型的稀疏比。为进一步提升推理吞吐量,我们采用了联合训练-推理联合设计,结合了密集的多标记预测(MTP)和量化感知训练(QAT)。我们发布了 JoyAI-LLM-48B-A3B 基础及其后期训练版本的检查点,以支持开源社区。
Distributed Snitch Digital Twin-Based Anomaly Detection for Smart Voltage Source Converter-Enabled Wind Power Systems
基于分布式Snitch数字孪生异常检测,用于智能电压源变换器驱动的风电系统
- Authors: Mohammad Ashraf Hossain Sadi, Soham Ghosh, Siby Plathottam, Mohd. Hasan Ali
- Subjects: Subjects:
Systems and Control (eess.SY)
- Arxiv link: https://arxiv.org/abs/2604.03123
- Pdf link: https://arxiv.org/pdf/2604.03123
- Abstract
Existing cyberattack detection methods for smart grids such as Artificial Neural Networks (ANNs) and Deep Reinforcement Learning (DRL) often suffer from limited adaptability, delayed response, and inadequate coordination in distributed energy systems. These techniques may struggle to detect stealthy or coordinated attacks, especially under communication delays or system uncertainties. This paper proposes a novel Snitch Digital Twin (Snitch-DT) architecture for cyber-physical anomaly detection in grid-connected wind farms using Smart Voltage Source Converters (VSCs). Each wind generator is equipped with a local Snitch-DT that compares real-time operational data with high-fidelity digital models and generates trust scores for measured signals. These trust scores are coordinated across nodes to detect distributed or stealthy cyberattacks. The performance of the Snitch-DT system is benchmarked against previously published Artificial Neural Network (ANN) and Deep Reinforcement Learning (DRL)-based detection frameworks. Simulation results using an IEEE 39-bus wind-integrated test system demonstrate improved attack detection accuracy, faster response time, and higher robustness under various cyberattack scenarios.
- 中文摘要
现有的智能电网网络攻击检测方法,如人工神经网络(ANN)和深度强化学习(DRL),在分布式能源系统中常常存在适应性有限、响应延迟和协调不足的问题。这些技术在检测隐形或协调攻击时可能遇到困难,尤其是在通信延迟或系统不确定性下。本文提出了一种新型的Snitch数字孪生(Snitch-DT)架构,用于在并网风电场中利用智能电压源转换器(VSC)进行网络物理异常检测。每台风力发电机都配备了本地Snitch-DT,能够将实时运行数据与高精度数字模型进行比对,并为测量信号生成信任评分。这些信任评分会在节点间协调,以检测分布式或隐形的网络攻击。Snitch-DT系统的性能与先前发布的人工神经网络(ANN)和深度强化学习(DRL)检测框架进行了基准测试。使用IEEE 39总线风集成测试系统的模拟结果显示,在各种网络攻击场景下,攻击检测精度提升、响应速度更快和更强的鲁棒性。
Self-Distilled RLVR
自酿RLVR
- Authors: Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, Nan Duan
- Subjects: Subjects:
Machine Learning (cs.LG); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2604.03128
- Pdf link: https://arxiv.org/pdf/2604.03128
- Abstract
On-policy distillation (OPD) has become a popular training paradigm in the LLM community. This paradigm selects a larger model as the teacher to provide dense, fine-grained signals for each sampled trajectory, in contrast to reinforcement learning with verifiable rewards (RLVR), which only obtains sparse signals from verifiable outcomes in the environment. Recently, the community has explored on-policy self-distillation (OPSD), where the same model serves as both teacher and student, with the teacher receiving additional privileged information such as reference answers to enable self-evolution. This paper demonstrates that learning signals solely derived from the privileged teacher result in severe information leakage and unstable long-term training. Accordingly, we identify the optimal niche for self-distillation and propose \textbf{RLSD} (\textbf{RL}VR with \textbf{S}elf-\textbf{D}istillation). Specifically, we leverage self-distillation to obtain token-level policy differences for determining fine-grained update magnitudes, while continuing to use RLVR to derive reliable update directions from environmental feedback (e.g., response correctness). This enables RLSD to simultaneously harness the strengths of both RLVR and OPSD, achieving a higher convergence ceiling and superior training stability.
- 中文摘要
政策提炼(OPD)已成为LLM社区中流行的培训范式。该范式选择一个更大的模型作为教师,为每个采样轨迹提供密集、细粒度的信号,这与仅从环境中可验证结果中获得稀疏信号的可验证奖励强化学习(RLVR)形成对比。最近,社区探索了政策自我提炼(OPSD),即同一模型既作为教师又作为学生,教师获得额外的特权信息,如参考答案,以促进自我演化。本文表明,仅由特权教师获得的学习信号会导致严重的信息泄露和长期培训不稳定。因此,我们确定了自蒸馏的最佳生态位,并提出 \textbf{RLSD}(\textbf{RL}VR,配合 \textbf{S}elf-\textbf{D}蒸馏)。具体来说,我们利用自我提纯技术获得代币级策略差异,以确定细粒度更新幅度,同时继续利用RLVR从环境反馈(如响应正确性)中推导可靠的更新方向。这使得RLSD能够同时发挥RLVR和OPSD的优势,实现更高的收敛上限和更优越的训练稳定性。
FSUNav: A Cerebrum-Cerebellum Architecture for Fast, Safe, and Universal Zero-Shot Goal-Oriented Navigation
FSUNav:一种大脑-小脑架构,实现快速、安全且通用的零射击目标导向导航
- Authors: Mingao Tan, Yiyang Li, Shanze Wang, Xinming Zhang, Wei Zhang
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2604.03139
- Pdf link: https://arxiv.org/pdf/2604.03139
- Abstract
Current vision-language navigation methods face substantial bottlenecks regarding heterogeneous robot compatibility, real-time performance, and navigation safety. Furthermore, they struggle to support open-vocabulary semantic generalization and multimodal task inputs. To address these challenges, this paper proposes FSUNav: a Cerebrum-Cerebellum architecture for fast, safe, and universal zero-shot goal-oriented navigation, which innovatively integrates vision-language models (VLMs) with the proposed architecture. The cerebellum module, a high-frequency end-to-end module, develops a universal local planner based on deep reinforcement learning, enabling unified navigation across heterogeneous platforms (e.g., humanoid, quadruped, wheeled robots) to improve navigation efficiency while significantly reducing collision risk. The cerebrum module constructs a three-layer reasoning model and leverages VLMs to build an end-to-end detection and verification mechanism, enabling zero-shot open-vocabulary goal navigation without predefined IDs and improving task success rates in both simulation and real-world environments. Additionally, the framework supports multimodal inputs (e.g., text, target descriptions, and images), further enhancing generalization, real-time performance, safety, and robustness. Experimental results on MP3D, HM3D, and OVON benchmarks demonstrate that FSUNav achieves state-of-the-art performance on object, instance image, and task navigation, significantly outperforming existing methods. Real-world deployments on diverse robotic platforms further validate its robustness and practical applicability.
- 中文摘要
当前视觉语言导航方法在异构机器人兼容性、实时性能和导航安全性方面面临重大瓶颈。此外,它们在支持开放词汇语义推广和多模态任务输入方面遇到困难。为应对这些挑战,本文提出了FSUNav:一种大脑-小脑架构,用于快速、安全且通用的零目标导向导航,创新地将视觉语言模型(VLM)与所提架构整合。小脑模块是一种高频端到端模块,基于深度强化学习开发了通用局部规划器,实现跨异构平台(如人形、四足、轮式机器人)的统一导航,提升导航效率,同时显著降低碰撞风险。大脑模块构建了三层推理模型,并利用VLM构建端到端的检测与验证机制,实现无需预定义ID的零机会开放词汇目标导航,提升了模拟和现实环境中的任务成功率。此外,该框架支持多模态输入(如文本、目标描述和图片),进一步增强了泛化性、实时性能、安全性和鲁棒性。MP3D、HM3D和OVON基准测试的实验结果表明,FSUNav在对象、实例图像和任务导航方面实现了最先进的性能,远超现有方法。在多种机器人平台上的实际部署进一步验证了其稳健性和实用性。
Chart-RL: Policy Optimization Reinforcement Learning for Enhanced Visual Reasoning in Chart Question Answering with Vision Language Models
Chart-RL:策略优化强化学习,利用视觉语言模型增强图表问答中的视觉推理能力
- Authors: Yunfei Bai, Amit Dhanda, Shekhar Jain
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2604.03157
- Pdf link: https://arxiv.org/pdf/2604.03157
- Abstract
The recent advancements in Vision Language Models (VLMs) have demonstrated progress toward true intelligence requiring robust reasoning capabilities. Beyond pattern recognition, linguistic reasoning must integrate with visual comprehension, particularly for Chart Question Answering (CQA) tasks involving complex data visualizations. Current VLMs face significant limitations in CQA, including imprecise numerical extraction, difficulty interpreting implicit visual relationships, and inadequate attention mechanisms for capturing spatial relationships in charts. In this work, we address these challenges by presenting Chart-RL, a novel reinforcement learning framework that enhances VLMs chart understanding through feedback-driven policy optimization of visual perception and logical inference. Our key innovation includes a comprehensive framework integrating Reinforcement Learning (RL) from Policy Optimization techniques along with adaptive reward functions, that demonstrates superior performance compared to baseline foundation models and competitive results against larger state-of-the-art architectures. We also integrated Parameter-Efficient Fine-Tuning through Low-Rank Adaptation (LoRA) in the RL framework that only requires single GPU configurations while preserving performance integrity. We conducted extensive benchmarking across open-source, proprietary, and state-of-the-art closed-source models utilizing the ChartQAPro dataset. The RL fine-tuned Qwen3-VL-4B-Instruct model achieved an answer accuracy of 0.634, surpassing the 0.580 accuracy of the Qwen3-VL-8B-Instruct foundation model despite utilizing half the parameter count, while simultaneously reducing inference latency from 31 seconds to 9 seconds.
- 中文摘要
视觉语言模型(VLMs)的最新进展展示了向真正智能迈进的进展,这些智能需要强大的推理能力。除了模式识别外,语言推理还必须与视觉理解相结合,尤其是在涉及复杂数据可视化的图表问答(CQA)任务中。当前VLM在CQA方面面临显著限制,包括数字提取不精确、隐性视觉关系难以解读,以及在图表中捕捉空间关系的注意力机制不足。本研究通过介绍Chart-RL——一种新颖的强化学习框架,通过反馈驱动的策略优化来增强VLM对图表的理解,从而解决这些挑战。我们的核心创新包括一个综合框架,将策略优化技术中的强化学习(RL)与自适应奖励函数结合起来,展现出优于基础基础模型的性能,并在更大型最先进架构中取得优异成绩。我们还在强化学习框架中集成了通过低秩适配进行参数高效微调(LoRA),只需单一GPU配置,同时保持性能完整性。我们利用ChartQAPro数据集对开源、专有和最先进的闭源模型进行了广泛的基准测试。经过强化学习微调的Qwen3-VL-4B-Instruct模型,尽管参数数减半,答案准确率达到0.634,超过Qwen3-VL-8B-Ininstruction基础模型的0.580,同时将推理延迟从31秒降至9秒。
Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models
理解幻觉在多模态推理模型训练后强化中的作用
- Authors: Gengwei Zhang, Jie Peng, Zhen Tan, Mufan Qiu, Hossein Nourkhiz Mahjoub, Vaishnav Tadiparthi, Kwonjoon Lee, Yanyong Zhang, Tianlong Chen
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2604.03179
- Pdf link: https://arxiv.org/pdf/2604.03179
- Abstract
The recent success of reinforcement learning (RL) in large reasoning models has inspired the growing adoption of RL for post-training Multimodal Large Language Models (MLLMs) to enhance their visual reasoning capabilities. Although many studies have reported improved performance, it remains unclear whether RL training truly enables models to learn from visual information. In this work, we propose the Hallucination-as-Cue Framework, an analytical framework designed to investigate the effects of RL-based post-training on multimodal reasoning models from the perspective of model hallucination. Specifically, we introduce hallucination-inductive, modality-specific corruptions that remove or replace essential information required to derive correct answers, thereby forcing the model to reason by hallucination. By applying these corruptions during both training and evaluation, our framework provides a unique perspective for diagnosing RL training dynamics and understanding the intrinsic properties of datasets. Through extensive experiments and analyses across multiple multimodal reasoning benchmarks, we reveal that the role of model hallucination for RL-training is more significant than previously recognized. For instance, we find that RL post-training under purely hallucination-inductive settings can still significantly improve models' reasoning performance, and in some cases even outperform standard training. These findings challenge prevailing assumptions about MLLM reasoning training and motivate the development of more modality-aware RL-based training designs.
- 中文摘要
强化学习(RL)在大型推理模型中的近期成功,激励了强化学习在多模态大型语言模型(MLLM)后期训练中的日益普及,以增强其视觉推理能力。尽管许多研究报告了性能的提升,但目前尚不清楚强化学习是否真的能让模型从视觉信息中学习。本研究提出幻觉即提示框架,这是一个分析框架,旨在从模型幻觉的角度探讨基于强化学习的后训练对多模态推理模型的影响。具体来说,我们引入了幻觉归纳、特定模态的变换,这些变换移除或替换推导正确答案所需的关键信息,从而迫使模型通过幻觉推理。通过在训练和评估中应用这些破坏,我们的框架为诊断强化学习训练动态和理解数据集的内在属性提供了独特的视角。通过在多个多模态推理基准测试中的广泛实验和分析,我们揭示了模型幻觉在强化学习训练中的作用比以往认识的更为重要。例如,我们发现在纯幻觉归纳环境下的强化学习后训练仍能显著提升模型的推理表现,有时甚至优于标准训练。这些发现挑战了关于MLLM推理训练的主流假设,并推动了更注重模态意识的基于强化学习(RL)训练设计的发展。
Keyword: diffusion policy
Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model
多视图视频扩散策略:一个三维时空感知视频动作模型
- Authors: Peiyan Li, Yixiang Chen, Yuan Xu, Jiabing Yang, Xiangnan Wu, Jun Guo, Nan Sun, Long Qian, Xinghang Li, Xin Xiao, Jing Liu, Nianfeng Liu, Tao Kong, Yan Huang, Liang Wang, Tieniu Tan
- Subjects: Subjects:
Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2604.03181
- Pdf link: https://arxiv.org/pdf/2604.03181
- Abstract
Robotic manipulation requires understanding both the 3D spatial structure of the environment and its temporal evolution, yet most existing policies overlook one or both. They typically rely on 2D visual observations and backbones pretrained on static image--text pairs, resulting in high data requirements and limited understanding of environment dynamics. To address this, we introduce MV-VDP, a multi-view video diffusion policy that jointly models the 3D spatio-temporal state of the environment. The core idea is to simultaneously predict multi-view heatmap videos and RGB videos, which 1) align the representation format of video pretraining with action finetuning, and 2) specify not only what actions the robot should take, but also how the environment is expected to evolve in response to those actions. Extensive experiments show that MV-VDP enables data-efficient, robust, generalizable, and interpretable manipulation. With only ten demonstration trajectories and without additional pretraining, MV-VDP successfully performs complex real-world tasks, demonstrates strong robustness across a range of model hyperparameters, generalizes to out-of-distribution settings, and predicts realistic future videos. Experiments on Meta-World and real-world robotic platforms demonstrate that MV-VDP consistently outperforms video-prediction--based, 3D-based, and vision--language--action models, establishing a new state of the art in data-efficient multi-task manipulation.
- 中文摘要
机器人操作需要理解环境的三维空间结构及其时间演变,但大多数现有政策忽视了其中之一或两者。它们通常依赖二维视觉观察和预训练于静态图像-文本对的骨干网,导致数据需求较高且对环境动力学理解有限。为此,我们引入了MV-VDP,一种多视角视频扩散策略,共同建模环境的三维时空状态。核心思想是同时预测多视图热图视频和RGB视频,1)将视频预训练的表示格式与动作微调对齐,2)不仅指定机器人应采取哪些动作,还预测环境如何应对这些动作而演变。大量实验表明,MV-VDP实现了数据高效、稳健、可推广且可解释的操作。MV-VDP仅有十个演示轨迹且无需额外预训练,能够成功执行复杂的现实任务,在一系列模型超参数上展现出强的鲁棒性,推广到分布外的环境,并预测现实的未来视频。在元世界和现实机器人平台上的实验表明,MV-VDP始终优于基于视频的预测、基于3D的模型以及视觉的语言动作模型,确立了数据高效的多任务操作的新技术水平。
The Compression Gap: Why Discrete Tokenization Limits Vision-Language-Action Model Scaling
压缩差距:为何离散分词限制视觉-语言-动作模型的缩放
- Authors: Takuya Shiba
- Subjects: Subjects:
Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2604.03191
- Pdf link: https://arxiv.org/pdf/2604.03191
- Abstract
Scaling Vision-Language-Action (VLA) models by upgrading the vision encoder is expected to improve downstream manipulation performance--as it does in vision-language modeling. We show that this expectation fails when actions are represented as discrete tokens, and explain why through an information-theoretic principle we call the Compression Gap: in any visuomotor pipeline, scaling behavior is governed by the location of the tightest information bottleneck. When actions are continuous (e.g., Diffusion Policy), the vision encoder is the binding constraint, and upgrading it directly improves performance. When actions are discretized through a fixed-capacity codebook (e.g., OAT), the codebook becomes the binding constraint, and encoder improvements cannot propagate past it--regardless of how rich the upstream representation is. We validate this principle on the LIBERO benchmark with three lines of evidence: a factorial experiment showing that encoder upgrades improve Diffusion Policy by over 21 percentage points while OAT gains are substantially attenuated across model scales; an encoder quality gradient across four encoders confirming that Diffusion Policy tracks encoder quality monotonically while OAT remains flat; and a codebook size experiment demonstrating that relaxing codebook capacity partially recovers encoder sensitivity, providing causal evidence for the bottleneck hypothesis. Our findings reveal that scaling in Physical AI requires identifying where information bottlenecks lie in the pipeline, rather than uniformly increasing model or data size.
- 中文摘要
通过升级视觉编码器,扩展视觉-语言-行动(VLA)模型预期能提升后续操作性能——这与视觉语言建模中的作用类似。我们证明了当动作被表示为离散标记时,这一预期失效,并通过信息理论原理解释了原因:在任何视觉运动流水线中,缩放行为受最紧信息瓶颈位置的控制。当动作是连续的(例如扩散策略)时,视觉编码器是约束约束,升级它直接提升性能。当动作通过固定容量码本(例如OAT)离散化时,码本成为绑定约束,编码器的改进无法跨越约束——无论上游表示多丰富。我们在LIBERO基准测试上通过三条证据验证了这一原则:一项因子实验显示编码器升级能提升扩散策略超过21个百分点,而OAT的提升在不同模型尺度上显著减弱;四个编码器的编码器质量梯度,确认扩散策略单调追踪编码器质量,而OAT保持平坦;以及一项码本尺寸实验,证明松宽码本容量可部分恢复编码器灵敏度,为瓶颈假说提供因果证据。我们的发现表明,物理人工智能的扩展需要识别信息瓶颈在流水线中的位置,而非均匀地增加模型或数据规模。