生成时间: 2026-01-15 16:33:52 (UTC+8); Arxiv 发布时间: 2026-01-15 20:00 EST (2026-01-16 09:00 UTC+8)
今天共有 20 篇相关文章
Keyword: reinforcement learning
Reading or Reasoning? Format Decoupled Reinforcement Learning for Document OCR
阅读还是推理?文档OCR格式解耦强化学习
- Authors: Yufeng Zhong, Lei Chen, Zhixiong Zeng, Xuanle Zhao, Deyang Jiang, Liming Zheng, Jing Huang, Haibo Qiu, Peng Shi, Siqi Yang, Lin Ma
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2601.08834
- Pdf link: https://arxiv.org/pdf/2601.08834
- Abstract
Reading text from images or scanned documents via OCR models has been a longstanding focus of researchers. Intuitively, text reading is perceived as a straightforward perceptual task, and existing work primarily focuses on constructing enriched data engineering to enhance SFT capabilities. In this work, we observe that even advanced OCR models exhibit significantly higher entropy in formatted text (\emph{e.g.}, formula, table, etc.) compared to plain text, often by an order of magnitude. These statistical patterns reveal that advanced OCR models struggle with high output uncertainty when dealing with format sensitive document, suggesting that reasoning over diverse reading pathways may improve OCR performance. To address this, we propose format decoupled reinforcement learning (FD-RL), which leverages high-entropy patterns for targeted optimization. Our approach employs entropy-based data filtration strategy to identify format-intensive instances, and adopt format decoupled rewards tailored to different format types, enabling format-level validation rather than token-level memorization. FD-RL achieves an average score of 90.41 on OmniDocBench, setting a new record for end-to-end models on this highly popular benchmark. More importantly, we conduct comprehensive ablation studies over data, training, filtering, and rewarding strategies, thoroughly validating their effectiveness.
- 中文摘要
通过OCR模型从图像或扫描文档中读取文本一直是研究人员长期关注的重点。直观上,文本阅读被视为一项简单的感知任务,现有工作主要集中于构建丰富数据工程以增强SFT能力。在这项研究中,我们观察到即使是高级OCR模型,在格式化文本(\emph[例如}、公式、表格等)中,熵也显著高于纯文本,通常高出一个数量级。这些统计模式表明,先进的OCR模型在处理格式敏感文档时存在较高输出不确定性的问题,表明对多样化阅读路径的推理可能提升OCR性能。为此,我们提出了格式解耦强化学习(FD-RL),利用高熵模式进行有针对性优化。我们的方法采用基于熵的数据过滤策略,识别格式密集型实例,并采用针对不同格式类型量身定制的格式解耦奖励,实现格式层级验证,而非代币层面的记忆。FD-RL在OmniDocBench上取得了90.41的平均得分,创下了该极受欢迎基准测试端到端模型的新纪录。更重要的是,我们对数据、培训、过滤和奖励策略进行全面的消融研究,彻底验证其有效性。
TranslateGemma Technical Report
TranslateGemma 技术报告
- Authors: Mara Finkelstein, Isaac Caswell, Tobias Domhan, Jan-Thorsten Peter, Juraj Juraska, Parker Riley, Daniel Deutsch, Cole Dilanni, Colin Cherry, Eleftheria Briakou, Elizabeth Nielsen, Jiaming Luo, Kat Black, Ryan Mullins, Sweta Agrawal, Wenda Xu, Erin Kats, Stephane Jaskiewicz, Markus Freitag, David Vilar
- Subjects: Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2601.09012
- Pdf link: https://arxiv.org/pdf/2601.09012
- Abstract
We present TranslateGemma, a suite of open machine translation models based on the Gemma 3 foundation models. To enhance the inherent multilingual capabilities of Gemma 3 for the translation task, we employ a two-stage fine-tuning process. First, supervised fine-tuning is performed using a rich mixture of high-quality large-scale synthetic parallel data generated via state-of-the-art models and human-translated parallel data. This is followed by a reinforcement learning phase, where we optimize translation quality using an ensemble of reward models, including MetricX-QE and AutoMQM, targeting translation quality. We demonstrate the effectiveness of TranslateGemma with human evaluation on the WMT25 test set across 10 language pairs and with automatic evaluation on the WMT24++ benchmark across 55 language pairs. Automatic metrics show consistent and substantial gains over the baseline Gemma 3 models across all sizes. Notably, smaller TranslateGemma models often achieve performance comparable to larger baseline models, offering improved efficiency. We also show that TranslateGemma models retain strong multimodal capabilities, with enhanced performance on the Vistra image translation benchmark. The release of the open TranslateGemma models aims to provide the research community with powerful and adaptable tools for machine translation.
- 中文摘要
我们介绍TranslateGemma,一套基于Gemma 3基础模型的开放机器翻译模型。为了增强Gemma 3在翻译任务中的固有多语言能力,我们采用了两阶段的微调过程。首先,监督式微调是利用高质量大规模合成并行数据与人工翻译并行数据的丰富混合进行的。随后进入强化学习阶段,我们利用包括 MetricX-QE 和 AutoMQM 在内的一系列奖励模型优化翻译质量,目标是翻译质量。我们通过对10对语言对的WMT25测试集进行人工评估,以及在55对语言对上对WMT24++基准测试的自动评估,展示了TranslateGemma的有效性。自动指标显示,所有尺寸的Gemma 3基础型号相比基础型号均有持续且显著的提升。值得注意的是,较小的TranslateGemma模型通常能实现与大型基线模型相当的性能,从而提升效率。我们还展示了TranslateGemma模型仍具备强大的多模态能力,并在Vistra图像翻译基准测试中表现更为出色。开放的TranslateGemma模型发布旨在为研究社区提供强大且适应性的机器翻译工具。
SRT: Accelerating Reinforcement Learning via Speculative Rollout with Tree-Structured Cache
SRT:通过树状缓存的推测性推广加速强化学习
- Authors: Chi-Chih Chang, Siqi Zhu, Zhichen Zeng, Haibin Lin, Jiaxuan You, Mohamed S. Abdelfattah, Ziheng Jiang, Xuehai Qian
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2601.09083
- Pdf link: https://arxiv.org/pdf/2601.09083
- Abstract
We present Speculative Rollout with Tree-Structured Cache (SRT), a simple, model-free approach to accelerate on-policy reinforcement learning (RL) for language models without sacrificing distributional correctness. SRT exploits the empirical similarity of rollouts for the same prompt across training steps by storing previously generated continuations in a per-prompt tree-structured cache. During generation, the current policy uses this tree as the draft model for performing speculative decoding. To keep the cache fresh and improve draft model quality, SRT updates trees online from ongoing rollouts and proactively performs run-ahead generation during idle GPU bubbles. Integrated into standard RL pipelines (\textit{e.g.}, PPO, GRPO and DAPO) and multi-turn settings, SRT consistently reduces generation and step latency and lowers per-token inference cost, achieving up to 2.08x wall-clock time speedup during rollout.
- 中文摘要
我们介绍了带有树结构缓存(SRT)的推测性推广,这是一种简单、无模型的方法,旨在加速语言模型的策略强化学习(RL),同时不牺牲分布正确性。SRT利用同一提示在不同训练步骤中展开的经验相似性,通过将之前生成的续写存储在每个提示词的树结构缓存中。在生成过程中,现行政策将该树作为执行推测解码的草图模型。为了保持缓存新鲜并提升草图模型质量,SRT 会从正在进行的部署中更新在线树,并在 GPU 空闲气泡期间主动进行提前生成。集成于标准强化学习流水线(\textit{e.}、PPO、GRPO和DAPO)及多回合设置中,SRT持续降低生成和步进延迟,降低每个令牌推理成本,在部署期间实现高达2.08倍的墙时钟加速。
SkinFlow: Efficient Information Transmission for Open Dermatological Diagnosis via Dynamic Visual Encoding and Staged RL
SkinFlow:通过动态视觉编码和分阶段强化学习实现开放式皮肤诊断的高效信息传输
- Authors: Lijun Liu, Linwei Chen, Zhishou Zhang, Meng Tian, Hengfu Cui, Ruiyang Li, Zhaocheng Liu, Qiang Ju, Qianxi Li, Hong-Yu Zhou
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2601.09136
- Pdf link: https://arxiv.org/pdf/2601.09136
- Abstract
General-purpose Large Vision-Language Models (LVLMs), despite their massive scale, often falter in dermatology due to "diffuse attention" - the inability to disentangle subtle pathological lesions from background noise. In this paper, we challenge the assumption that parameter scaling is the only path to medical precision. We introduce SkinFlow, a framework that treats diagnosis as an optimization of visual information transmission efficiency. Our approach utilizes a Virtual-Width Dynamic Vision Encoder (DVE) to "unfold" complex pathological manifolds without physical parameter expansion, coupled with a two-stage Reinforcement Learning strategy. This strategy sequentially aligns explicit medical descriptions (Stage I) and reconstructs implicit diagnostic textures (Stage II) within a constrained semantic space. Furthermore, we propose a clinically grounded evaluation protocol that prioritizes diagnostic safety and hierarchical relevance over rigid label matching. Empirical results are compelling: our 7B model establishes a new state-of-the-art on the Fitzpatrick17k benchmark, achieving a +12.06% gain in Top-1 accuracy and a +28.57% boost in Top-6 accuracy over the massive general-purpose models (e.g., Qwen3VL-235B and GPT-5.2). These findings demonstrate that optimizing geometric capacity and information flow yields superior diagnostic reasoning compared to raw parameter scaling.
- 中文摘要
通用大型视觉语言模型(LVLM)尽管规模庞大,但在皮肤科中常因“弥漫性注意力”——即无法将细微病理病变与背景噪音区分开来而失效。本文挑战了参数尺度是实现医疗精确的唯一途径的假设。我们介绍了SkinFlow,这是一个将诊断视为视觉信息传输效率优化的框架。我们的方法采用虚拟宽度动态视觉编码器(DVE)在不进行物理参数扩展的情况下“展开”复杂的病理流形,并结合两阶段强化学习策略。该策略在受限语义空间内顺序对齐显性医学描述(第一阶段),并重建隐性诊断结构(第二阶段)。此外,我们提出了一种基于临床基础的评估方案,优先考虑诊断安全性和层级相关性,而非僵化的标签匹配。实证结果令人信服:我们的7B模型在Fitzpatrick17k基准上建立了新的最先进水平,在前一准确率提升了+12.06%,前六名准确率提升了+28.57%,相比庞大的通用模型(如Qwen3VL-235B和GPT-5.2)。这些发现表明,优化几何容量和信息流相比原始参数尺度能带来更优越的诊断推理。
UserLM-R1: Modeling Human Reasoning in User Language Models with Multi-Reward Reinforcement Learning
UserLM-R1:利用多奖励强化学习建模用户语言模型中的人类推理
- Authors: Feng Zhang, Shijia Li, Chunmao Zhang, Zhanyu Ma, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He, Jingwen Xu, Han Liu
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2601.09215
- Pdf link: https://arxiv.org/pdf/2601.09215
- Abstract
User simulators serve as the critical interactive environment for agent post-training, and an ideal user simulator generalizes across domains and proactively engages in negotiation by challenging or bargaining. However, current methods exhibit two issues. They rely on static and context-unaware profiles, necessitating extensive manual redesign for new scenarios, thus limiting generalizability. Moreover, they neglect human strategic thinking, leading to vulnerability to agent manipulation. To address these issues, we propose UserLM-R1, a novel user language model with reasoning capability. Specifically, we first construct comprehensive user profiles with both static roles and dynamic scenario-specific goals for adaptation to diverse scenarios. Then, we propose a goal-driven decision-making policy to generate high-quality rationales before producing responses, and further refine the reasoning and improve strategic capabilities with supervised fine-tuning and multi-reward reinforcement learning. Extensive experimental results demonstrate that UserLM-R1 outperforms competitive baselines, particularly on the more challenging adversarial set.
- 中文摘要
用户模拟器是代理培训后的关键交互环境,理想的用户模拟器能够跨领域推广,并主动通过挑战或讨价还价进行谈判。然而,当前的方法存在两个问题。它们依赖静态且上下文无知的配置文件,需要大量手动重新设计以适应新场景,从而限制了普适性。此外,它们忽视了人类的战略思维,导致对智能体控的脆弱性。为解决这些问题,我们提出了UserLM-R1,一种具有推理能力的新颖用户语言模型。具体来说,我们首先构建了包含静态角色和动态场景特定目标的全面用户配置文件,以适应多样化场景。随后,我们提出一种目标驱动的决策政策,在生成回应前生成高质量的理由,并通过监督微调和多奖励强化学习进一步完善推理和提升战略能力。大量实验结果表明,UserLM-R1在更具挑战性的对抗环境中表现优于竞争基线。
GIFT: Unlocking Global Optimality in Post-Training via Finite-Temperature Gibbs Initialization
GIFT:通过有限温度吉布斯初始化解锁训练后全局最优性
- Authors: Zhengyang Zhao, Lu Ma, Yizhen Jiang, Xiaochen Ma, Zimo Meng, Chengyu Shen, Lexiang Tang, Haoze Sun, Peng Pei, Wentao Zhang
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2601.09233
- Pdf link: https://arxiv.org/pdf/2601.09233
- Abstract
The prevailing post-training paradigm for Large Reasoning Models (LRMs)--Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL)--suffers from an intrinsic optimization mismatch: the rigid supervision inherent in SFT induces distributional collapse, thereby exhausting the exploration space necessary for subsequent RL. In this paper, we reformulate SFT within a unified post-training framework and propose Gibbs Initialization with Finite Temperature (GIFT). We characterize standard SFT as a degenerate zero-temperature limit that suppresses base priors. Conversely, GIFT incorporates supervision as a finite-temperature energy potential, establishing a distributional bridge that ensures objective consistency throughout the post-training pipeline. Our experiments demonstrate that GIFT significantly outperforms standard SFT and other competitive baselines when utilized for RL initialization, providing a mathematically principled pathway toward achieving global optimality in post-training. Our code is available at this https URL.
- 中文摘要
目前主流的大型推理模型(LRM)训练后范式——监督式微调(SFT)和强化学习(RL)——存在内在的优化不匹配:SFT固有的刚性监督导致分布崩溃,从而耗尽后续RL所需的探索空间。本文在统一的训练后框架内重新表述SFT,并提出了有限温度Gibbs初始化(GIFT)。我们将标准SFT描述为简并的零温度极限,抑制了基先验。相反,GIFT将监督作为有限温度能量势,建立分布桥梁,确保整个培训后流程的客观一致性。我们的实验表明,GIFT在用于强化学习时,显著优于标准SFT及其他竞争基线,提供了一条数学原则性的路径,帮助实现训练后全局最优。我们的代码可在此 https URL 访问。
Reward Learning through Ranking Mean Squared Error
通过均方误差排名奖励学习
- Authors: Chaitanya Kharyal, Calarina Muslimani, Matthew E. Taylor
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2601.09236
- Pdf link: https://arxiv.org/pdf/2601.09236
- Abstract
Reward design remains a significant bottleneck in applying reinforcement learning (RL) to real-world problems. A popular alternative is reward learning, where reward functions are inferred from human feedback rather than manually specified. Recent work has proposed learning reward functions from human feedback in the form of ratings, rather than traditional binary preferences, enabling richer and potentially less cognitively demanding supervision. Building on this paradigm, we introduce a new rating-based RL method, Ranked Return Regression for RL (R4). At its core, R4 employs a novel ranking mean squared error (rMSE) loss, which treats teacher-provided ratings as ordinal targets. Our approach learns from a dataset of trajectory-rating pairs, where each trajectory is labeled with a discrete rating (e.g., "bad," "neutral," "good"). At each training step, we sample a set of trajectories, predict their returns, and rank them using a differentiable sorting operator (soft ranks). We then optimize a mean squared error loss between the resulting soft ranks and the teacher's ratings. Unlike prior rating-based approaches, R4 offers formal guarantees: its solution set is provably minimal and complete under mild assumptions. Empirically, using simulated human feedback, we demonstrate that R4 consistently matches or outperforms existing rating and preference-based RL methods on robotic locomotion benchmarks from OpenAI Gym and the DeepMind Control Suite, while requiring significantly less feedback.
- 中文摘要
奖励设计仍然是强化学习(RL)应用于现实问题的重要瓶颈。一种流行的替代方案是奖励学习,即奖励函数是从人类反馈中推断的,而不是手动指定。近期研究提出通过人类反馈(评分)学习奖励函数,而非传统的二元偏好,从而实现更丰富且认知负担更低的监督。基于这一范式,我们引入了一种基于评分的新型强化学习方法——强化学习排名回报回归(R4)。R4的核心采用了一种新颖的排名均方误差(rMSE)损失,将教师提供的评分视为序数目标。我们的方法从轨迹-评级对数据集中学习,每个轨迹都被标记为离散评级(例如“坏”、“中性”、“好”)。在每个训练步骤中,我们采样一组轨迹,预测其回报,并使用可微排序算符(软秩)对它们进行排序。然后,我们优化了软排名与教师评分之间的均方误差损失。与以往基于评级的方法不同,R4提供了形式保证:其解集在轻度假设下可证明为最小且完备。通过实证,我们通过模拟人类反馈证明,R4在机器人运动基准测试中持续匹配甚至优于基于评分和偏好的强化学习方法,且所需的反馈显著减少。
Efficient Paths and Dense Rewards: Probabilistic Flow Reasoning for Large Language Models
高效路径与密集奖励:大型语言模型的概率流推理
- Authors: Yan Liu, Feng Zhang, Zhanyu Ma, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He, Han Liu, Yangdong Deng
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2601.09260
- Pdf link: https://arxiv.org/pdf/2601.09260
- Abstract
High-quality chain-of-thought has demonstrated strong potential for unlocking the reasoning capabilities of large language models. However, current paradigms typically treat the reasoning process as an indivisible sequence, lacking an intrinsic mechanism to quantify step-wise information gain. This granularity gap manifests in two limitations: inference inefficiency from redundant exploration without explicit guidance, and optimization difficulty due to sparse outcome supervision or costly external verifiers. In this work, we propose CoT-Flow, a framework that reconceptualizes discrete reasoning steps as a continuous probabilistic flow, quantifying the contribution of each step toward the ground-truth answer. Built on this formulation, CoT-Flow enables two complementary methodologies: flow-guided decoding, which employs a greedy flow-based decoding strategy to extract information-efficient reasoning paths, and flow-based reinforcement learning, which constructs a verifier-free dense reward function. Experiments on challenging benchmarks demonstrate that CoT-Flow achieves a superior balance between inference efficiency and reasoning performance.
- 中文摘要
高质量的思维链已展现出释放大型语言模型推理能力的强大潜力。然而,当前范式通常将推理过程视为不可分割的序列,缺乏内在机制来量化逐步信息获取。这种细度差距表现为两个局限:无明确指导的冗余探索导致推理效率低下,以及由于结果监督稀疏或外部验证器成本高昂导致的优化难度。在本研究中,我们提出了CoT-Flow,这一框架将离散推理步骤重新构想为连续的概率流,量化每一步对真实答案的贡献。基于该表述,CoT-Flow支持两种互补方法:流引导解码,采用贪婪的基于流的解码策略提取信息效率最高的推理路径;以及基于流的强化学习,构建无验证器的密集奖励函数。对高难度基准的实验表明,CoT-Flow在推理效率与推理性能之间实现了更优的平衡。
Learning to Trust Experience: A Monitor-Trust-Regulator Framework for Learning under Unobservable Feedback Reliability
学习信任经验:一个在不可观察反馈可靠性下学习的监控-信任-监管框架
- Authors: Zhipeng Zhang, Zhenjie Yao, Kai Li, Lei Yang
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2601.09261
- Pdf link: https://arxiv.org/pdf/2601.09261
- Abstract
Learning under unobservable feedback reliability poses a distinct challenge beyond optimization robustness: a system must decide whether to learn from an experience, not only how to learn stably. We study this setting as Epistemic Identifiability under Unobservable Reliability (EIUR), where each experience has a latent credibility, reliable and unreliable feedback can be locally indistinguishable, and data are generated in a closed loop by the learner's own evolving beliefs and actions. In EIUR, standard robust learning can converge stably yet form high-confidence, systematically wrong beliefs. We propose metacognitive regulation as a practical response: a second, introspective control loop that infers experience credibility from endogenous evidence in the learner's internal dynamics. We formalize this as a modular Monitor-Trust-Regulator (MTR) decomposition and instantiate it with self-diagnosis, which maintains a slowly varying experience-trust variable that softly modulates learning updates, without exogenous reliability labels or an explicit corruption model. Empirically, in the EIUR regimes studied here, self-diagnosis is associated with improved epistemic identifiability. In reinforcement learning, it enables calibrated skepticism and recovery under systematically corrupted rewards. In supervised learning, it exposes a critical dissociation: performance recovery does not imply epistemic recovery. Accuracy can rebound while internal belief dynamics remain locked-in by early misleading data, a failure detectable only through introspective diagnostics. Together, MTR and self-diagnosis provide an organizing abstraction and a concrete design template for intrinsic reliability assessment in autonomous learning under unobservable reliability.
- 中文摘要
在不可观察的反馈可靠性下学习,除了优化鲁棒性之外,还面临着独特的挑战:系统必须决定是否从经验中学习,而不仅仅是如何稳定学习。我们将此环境称为不可观察可靠性下的认知可识别性(EIUR),其中每个经验具有潜在可信度,可靠与不可靠反馈在局部难以区分,数据由学习者自身不断演变的信念和行为在闭环中生成。在EIUR中,标准的稳健学习可以稳定收敛,同时形成高度自信、系统性错误的信念。我们提出元认知调节作为一种实用的回应:第二种内省控制循环,通过学习者内部动态中的内生证据推断经验的可信度。我们将此形式化为模块化的监控-信任-调节器(MTR)分解,并结合自我诊断实现,自诊断保持一个缓慢变化的体验-信任变量,温和地调节学习更新,没有外生可靠性标签或显式损坏模型。从经验来看,在本研究的EIUR体系中,自我诊断与认识可识别性有所提升。在强化学习中,它能够在系统性腐败的奖励下进行校准的怀疑和恢复。在监督学习中,它暴露出一种关键的分离:表现恢复并不意味着认识论的恢复。准确率可以回升,而内部信念动态仍被早期误导性数据锁定,这种失败只能通过内省诊断发现。MTR与自我诊断共同为自主学习在不可观察可靠性下的内在可靠性评估提供了组织抽象和具体设计模板。
RISER: Orchestrating Latent Reasoning Skills for Adaptive Activation Steering
RISER:为适应激活引导调控潜在推理技能
- Authors: Wencheng Ye, Liang Peng, Xiaoyang Yuan, Yi Bin, Pengpeng Zeng, Hengyu Jin, Heng Tao Shen
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2601.09269
- Pdf link: https://arxiv.org/pdf/2601.09269
- Abstract
Recent work on domain-specific reasoning with large language models (LLMs) often relies on training-intensive approaches that require parameter updates. While activation steering has emerged as a parameter efficient alternative, existing methods apply static, manual interventions that fail to adapt to the dynamic nature of complex reasoning. To address this limitation, we propose RISER (Router-based Intervention for Steerable Enhancement of Reasoning), a plug-and-play intervention framework that adaptively steers LLM reasoning in activation space. RISER constructs a library of reusable reasoning vectors and employs a lightweight Router to dynamically compose them for each input. The Router is optimized via reinforcement learning under task-level rewards, activating latent cognitive primitives in an emergent and compositional manner. Across seven diverse benchmarks, RISER yields 3.4-6.5% average zero-shot accuracy improvements over the base model while surpassing CoT-style reasoning with 2-3x higher token efficiency and robust accuracy gains. Further analysis shows that RISER autonomously combines multiple vectors into interpretable, precise control strategies, pointing toward more controllable and efficient LLM reasoning.
- 中文摘要
近期关于大型语言模型(LLM)领域特定推理的研究,通常依赖于需要参数更新的训练密集型方法。虽然激活引导已成为一种参数高效的替代方案,但现有方法采用静态、手动干预,未能适应复杂推理的动态特性。为解决这一限制,我们提出了RISER(基于路由器的可引导推理增强干预),这是一种即插即用的干预框架,能够自适应地引导LLM推理在激活空间中进行引导。RISER构建了一个可复用的推理向量库,并利用轻量级路由器为每个输入动态组合这些向量。Router通过任务级奖励下的强化学习进行优化,以涌现和组合的方式激活潜在的认知原始。在七个多样化基准测试中,RISER平均零射击精度提升3.4%至6.5%,同时以2-3倍高的代币效率和强劲的准确率提升,超越CoT式推理。进一步分析显示,RISER能够自主将多个向量组合成可解释、精准的控制策略,从而推动更可控、更高效的大型语言模型推理。
Enhancing Spatial Reasoning in Large Language Models for Metal-Organic Frameworks Structure Prediction
增强金属有机框架结构预测中大型语言模型中的空间推理能力
- Authors: Mianzhi Pan, JianFei Li, Peishuo Liu, Botian Wang, Yawen Ouyang, Yiming Rong, Hao Zhou, Jianbing Zhang
- Subjects: Subjects:
Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
- Arxiv link: https://arxiv.org/abs/2601.09285
- Pdf link: https://arxiv.org/pdf/2601.09285
- Abstract
Metal-organic frameworks (MOFs) are porous crystalline materials with broad applications such as carbon capture and drug delivery, yet accurately predicting their 3D structures remains a significant challenge. While Large Language Models (LLMs) have shown promise in generating crystals, their application to MOFs is hindered by MOFs' high atomic complexity. Inspired by the success of block-wise paradigms in deep generative models, we pioneer the use of LLMs in this domain by introducing MOF-LLM, the first LLM framework specifically adapted for block-level MOF structure prediction. To effectively harness LLMs for this modular assembly task, our training paradigm integrates spatial-aware continual pre-training (CPT), structural supervised fine-tuning (SFT), and matching-driven reinforcement learning (RL). By incorporating explicit spatial priors and optimizing structural stability via Soft Adaptive Policy Optimization (SAPO), our approach substantially enhances the spatial reasoning capability of a Qwen-3 8B model for accurate MOF structure prediction. Comprehensive experiments demonstrate that MOF-LLM outperforms state-of-the-art denoising-based and LLM-based methods while exhibiting superior sampling efficiency.
- 中文摘要
金属有机框架(MOFs)是多孔晶体材料,广泛应用于碳捕获和药物递送,但准确预测其三维结构仍是重大挑战。虽然大型语言模型(LLMs)在生成晶体方面展现出潜力,但由于MOFs的原子复杂性较高,其在MOF中的应用受到阻碍。受深度生成模型中分块范式成功的启发,我们率先引入了MOF-LLM,这是首个专门用于块级MOF结构预测的LLM框架。为了有效利用LLMs完成这一模块化组装任务,我们的训练范式集成了空间感知持续预训练(CPT)、结构监督微调(SFT)和匹配驱动强化学习(RL)。通过纳入显式空间先验并通过软自适应策略优化(SAPO)优化结构稳定性,我们的方法大幅提升了Qwen-3 8B模型的空间推理能力,实现了准确的MOF结构预测。综合实验表明,MOF-LLM在采样效率上优于最先进的去噪和LLM方法。
Policy-Based Reinforcement Learning with Action Masking for Dynamic Job Shop Scheduling under Uncertainty: Handling Random Arrivals and Machine Failures
基于策略的强化学习与动作掩蔽,用于动态作业车间调度(在不确定性下):处理随机到达和机器故障
- Authors: Sofiene Lassoued, Stefan Lier, Andreas Schwung
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2601.09293
- Pdf link: https://arxiv.org/pdf/2601.09293
- Abstract
We present a novel framework for solving Dynamic Job Shop Scheduling Problems under uncertainty, addressing the challenges introduced by stochastic job arrivals and unexpected machine breakdowns. Our approach follows a model-based paradigm, using Coloured Timed Petri Nets to represent the scheduling environment, and Maskable Proximal Policy Optimization to enable dynamic decision-making while restricting the agent to feasible actions at each decision point. To simulate realistic industrial conditions, dynamic job arrivals are modeled using a Gamma distribution, which captures complex temporal patterns such as bursts, clustering, and fluctuating workloads. Machine failures are modeled using a Weibull distribution to represent age-dependent degradation and wear-out dynamics. These stochastic models enable the framework to reflect real-world manufacturing scenarios better. In addition, we study two action-masking strategies: a non-gradient approach that overrides the probabilities of invalid actions, and a gradient-based approach that assigns negative gradients to invalid actions within the policy network. We conduct extensive experiments on dynamic JSSP benchmarks, demonstrating that our method consistently outperforms traditional heuristic and rule-based approaches in terms of makespan minimization. The results highlight the strength of combining interpretable Petri-net-based models with adaptive reinforcement learning policies, yielding a resilient, scalable, and explainable framework for real-time scheduling in dynamic and uncertain manufacturing environments.
- 中文摘要
我们提出了一个新框架,用于解决动态作业车间排班问题在不确定性下的问题,解决随机任务到达和意外机器故障带来的挑战。我们的方法采用基于模型的范式,使用彩色时序Petri网表示调度环境,并采用可遮蔽的近端策略优化(Maskable Proximal Policy Optimization)实现动态决策,同时限制代理在每个决策点的可行作。为了模拟真实的工业状况,动态工作到达采用伽马分布建模,该分布捕捉了如突发、聚类和工作负载波动等复杂时间模式。机器故障通过魏布尔分布建模,以表示随时间变化的退化和磨损动态。这些随机模型使该框架能够更好地反映真实制造场景。此外,我们还研究了两种动作掩蔽策略:一种覆盖无效行为概率的非梯度方法,以及一种基于梯度的方法,将负梯度分配给策略网络内的无效动作。我们在动态JSSP基准测试上进行了大量实验,证明我们的方法在完成时长最小化方面始终优于传统的启发式和基于规则的方法。结果凸显了将可解释的Petri-net模型与自适应强化学习策略结合的优势,为动态且不确定的制造环境中实时调度提供了弹性、可扩展且可解释的框架。
Monte-Carlo Tree Search with Neural Network Guidance for Lane-Free Autonomous Driving
蒙特卡洛树搜索与神经网络指导,实现无车道自动驾驶
- Authors: Ioannis Peridis, Dimitrios Troullinos, Georgios Chalkiadakis, Pantelis Giankoulidis, Ioannis Papamichail, Markos Papageorgiou
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2601.09353
- Pdf link: https://arxiv.org/pdf/2601.09353
- Abstract
Lane-free traffic environments allow vehicles to better harness the lateral capacity of the road without being restricted to lane-keeping, thereby increasing the traffic flow rates. As such, we have a distinct and more challenging setting for autonomous driving. In this work, we consider a Monte-Carlo Tree Search (MCTS) planning approach for single-agent autonomous driving in lane-free traffic, where the associated Markov Decision Process we formulate is influenced from existing approaches tied to reinforcement learning frameworks. In addition, MCTS is equipped with a pre-trained neural network (NN) that guides the selection phase. This procedure incorporates the predictive capabilities of NNs for a more informed tree search process under computational constraints. In our experimental evaluation, we consider metrics that address both safety (through collision rates) and efficacy (through measured speed). Then, we examine: (a) the influence of isotropic state information for vehicles in a lane-free environment, resulting in nudging behaviour--vehicles' policy reacts due to the presence of faster tailing ones, (b) the acceleration of performance for the NN-guided variant of MCTS, and (c) the trade-off between computational resources and solution quality.
- 中文摘要
无车道交通环境使车辆能够更好地利用道路的横向容量,而不被限制在车道保持上,从而提高交通流量。因此,我们拥有一个独特且更具挑战性的自动驾驶环境。在本研究中,我们考虑了一种蒙特卡洛树搜索(MCTS)规划方法,用于单智能体在无车道交通中的自动驾驶,其中我们所制定的相关马尔可夫决策过程受现有强化学习框架方法的影响。此外,MCTS配备了预训练神经网络(NN),用于指导选择阶段。该过程结合了神经网络的预测能力,在计算约束下实现更明智的树搜索过程。在我们的实验评估中,我们考虑了既能衡量安全性(通过碰撞率)又能衡量效能(通过测得速度)的指标。随后,我们考察:(a) 在无车道环境中,各向同性状态信息对车辆的影响,导致助推行为——车辆政策因出现更快尾随车辆而做出反应,(b) NN引导MCTS变体性能加速,(c) 计算资源与解决方案质量之间的权衡。
GeoRA: Geometry-Aware Low-Rank Adaptation for RLVR
GeoRA:RLVR 的几何感知低阶适配
- Authors: Jiaying Zhang, Lei Shi, Jiguo Li, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2601.09361
- Pdf link: https://arxiv.org/pdf/2601.09361
- Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) is crucial for advancing large-scale reasoning models. However, existing parameter-efficient methods, such as PiSSA and MiLoRA, are designed for Supervised Fine-Tuning (SFT) and do not account for the distinct optimization dynamics and geometric structures of RLVR. Applying these methods directly leads to spectral collapse and optimization instability, which severely limit model performance. Meanwhile, alternative approaches that leverage update sparsity encounter significant efficiency bottlenecks on modern hardware due to unstructured computations. To address these challenges, we propose GeoRA (Geometry-Aware Low-Rank Adaptation), which exploits the anisotropic and compressible nature of RL update subspaces. GeoRA initializes adapters by extracting principal directions via Singular Value Decomposition (SVD) within a geometrically constrained subspace while freezing the residual components. This method preserves the pre-trained geometric structure and enables efficient GPU computation through dense operators. Experiments on Qwen and Llama demonstrate that GeoRA mitigates optimization bottlenecks caused by geometric misalignment. It consistently outperforms established low-rank baselines on key mathematical benchmarks, achieving state-of-the-art (SOTA) results. Moreover, GeoRA shows superior generalization and resilience to catastrophic forgetting in out-of-domain tasks.
- 中文摘要
带可验证奖励的强化学习(RLVR)对于推进大规模推理模型至关重要。然而,现有的参数高效方法,如PiSSA和MiLoRA,是为监督式微调(SFT)设计的,未考虑RLVR独特的优化动力学和几何结构。直接应用这些方法会导致谱坍缩和优化不稳定性,严重限制模型性能。与此同时,利用更新稀疏性的替代方法在现代硬件上因非结构化计算而面临显著的效率瓶颈。为应对这些挑战,我们提出了GeoRA(几何感知低秩适应),利用强化学习更新子空间的各向异性和可压缩特性。GeoRA通过在几何约束子空间内通过奇异值分解(SVD)提取主方向,同时冻结残余分量来初始化适配器。该方法保持预训练的几何结构,并通过密集的算子实现高效的GPU计算。Qwen和Llama的实验表明,GeoRA能够缓解几何错位带来的优化瓶颈。它在关键数学基准上持续优于既定的低排名基线,实现了最先进的(SOTA)结果。此外,GeoRA在域外任务中表现出更优越的泛化性和对灾难性遗忘的韧性。
Semi-Contention-Free Access in IoT NOMA Networks: A Reinforcement Learning Framework
物联网NOMA网络中的半无竞争接入:强化学习框架
- Authors: Abhishek Kumar, José-Ramón Vidal, Jorge Martinez-Bauset, Frank Y. Li
- Subjects: Subjects:
Systems and Control (eess.SY)
- Arxiv link: https://arxiv.org/abs/2601.09422
- Pdf link: https://arxiv.org/pdf/2601.09422
- Abstract
The unprecedented surge of massive Internet of things (mIoT) traffic in beyond fifth generation (B5G) communication systems calls for transformative approaches for multiple access and data transmission. While classical model-based tools have been proven to be powerful and precise, an imminent trend for resource management in B5G networks is promoting solutions towards data-driven design. Considering an IoT network with devices spread in clusters covered by a base station, we present in this paper a novel model-free multiple access and data transmission framework empowered by reinforcement learning, designed for power-domain non-orthogonal multiple access networks to facilitate uplink traffic of small data packets. The framework supports two access modes referred to as contention-based and semi-contention-free, with its core component being a policy gradient algorithm executed at the base station. The base station performs access control and optimal radio resource allocation by periodically broadcasting two control parameters to each cluster of devices that considerably reduce data detection failures with a minimum computation requirement on devices. Numerical results, in terms of system and cluster throughput, throughput fairness, access delay, and energy consumption, demonstrate the efficiency and scalability of the framework as network size and traffic load vary.
- 中文摘要
第五代(B5G)通信系统中物联网(mIoT)流量的前所未有激增,要求多重访问和数据传输采用变革性方法。虽然经典基于模型的工具已被证明强大且精准,但B5G网络资源管理的一个迫在眉睫的趋势是推动数据驱动设计的解决方案。考虑到一个物联网网络,设备分布在由基站覆盖的集群中,本文提出了一种新型无模型多址和数据传输框架,由强化学习赋能,专为功率域非正交多重接入网络设计,以促进小数据包的上行流量。该框架支持两种访问模式,称为基于争用和半无争用,其核心组件是基站执行的策略梯度算法。基站通过定期向每个设备集群广播两个控制参数,实现访问控制和最优无线资源分配,从而大幅减少数据检测失败,同时对设备的计算需求极低。数值结果显示,系统和集群吞吐量、吞吐量公平性、访问延迟和能耗等方面,展示了该框架在网络规模和流量负载变化下的效率和可扩展性。
Draw it like Euclid: Teaching transformer models to generate CAD profiles using ruler and compass construction steps
像Euclid那样画:教变压器模型用尺子和圆规构建步骤生成CAD剖面
- Authors: Siyi Li, Joseph G. Lambourne, Longfei Zhang, Pradeep Kumar Jayaraman, Karl. D.D. Willis
- Subjects: Subjects:
Machine Learning (cs.LG); Graphics (cs.GR)
- Arxiv link: https://arxiv.org/abs/2601.09428
- Pdf link: https://arxiv.org/pdf/2601.09428
- Abstract
We introduce a new method of generating Computer Aided Design (CAD) profiles via a sequence of simple geometric constructions including curve offsetting, rotations and intersections. These sequences start with geometry provided by a designer and build up the points and curves of the final profile step by step. We demonstrate that adding construction steps between the designer's input geometry and the final profile improves generation quality in a similar way to the introduction of a chain of thought in language models. Similar to the constraints in a parametric CAD model, the construction sequences reduce the degrees of freedom in the modeled shape to a small set of parameter values which can be adjusted by the designer, allowing parametric editing with the constructed geometry evaluated to floating point precision. In addition we show that applying reinforcement learning to the construction sequences gives further improvements over a wide range of metrics, including some which were not explicitly optimized.
- 中文摘要
我们引入了一种通过一系列简单的几何构造(包括曲线偏移、旋转和交点)生成计算机辅助设计(CAD)剖面的新方法。这些序列从设计师提供的几何形状开始,逐步构建最终轮廓的点和曲线。我们证明,在设计师输入几何体与最终配置文件之间添加构建步骤,可以提升生成质量,类似于在语言模型中引入思维链。类似于参数化CAD模型中的约束,构造序列将建模形状中的自由度简化为一组可由设计者调整的参数值,从而允许参数化编辑,并以浮点精度计算构造几何体。此外,我们展示了将强化学习应用于构造序列,在包括一些未被显式优化的指标中,在多个指标上都取得了进一步的改进。
Dialogue Telemetry: Turn-Level Instrumentation for Autonomous Information Gathering
对话遥测:自主信息收集的转向级仪器
- Authors: Dimitris Panagopoulos, Adolfo Perrusquia, Weisi Guo
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2601.09570
- Pdf link: https://arxiv.org/pdf/2601.09570
- Abstract
Autonomous systems conducting schema-grounded information-gathering dialogues face an instrumentation gap, lacking turn-level observables for monitoring acquisition efficiency and detecting when questioning becomes unproductive. We introduce Dialogue Telemetry (DT), a measurement framework that produces two model-agnostic signals after each question-answer exchange: (i) a Progress Estimator (PE) quantifying residual information potential per category (with a bits-based variant), and (ii) a Stalling Index (SI) detecting an observable failure signature characterized by repeated category probing with semantically similar, low-marginal-gain responses. SI flags this pattern without requiring causal diagnosis, supporting monitoring in settings where attributing degradation to specific causes may be impractical. We validate DT in controlled search-and-rescue (SAR)-inspired interviews using large language model (LLM)-based simulations, distinguishing efficient from stalled dialogue traces and illustrating downstream utility by integrating DT signals into a reinforcement learning (RL) policy. Across these settings, DT provides interpretable turn-level instrumentation that improves policy performance when stalling carries operational costs.
- 中文摘要
进行基于模式的信息收集对话的自主系统面临着工具空白,缺乏用于监控获取效率和检测质询何时变得无效的转折级可观测量。我们引入了对话遥测(DT),这是一种测量框架,每次问答交换后产生两种模型无关信号:(i)进展估计器(PE),量化每个类别的残余信息潜能(含基于比特的变体),以及(ii)一个检测可观测故障特征的停滞指数(SI),该特征是通过重复类别探测、语义相似且边际增益低的响应来表现。SI在无需因果诊断的情况下标记该模式,支持在难以将退化归因于特定原因的环境中进行监测。我们在受控搜救(SAR)启发的访谈中,利用基于大型语言模型(LLM)的模拟验证了DT,区分高效与停滞的对话痕迹,并通过将DT信号整合进强化学习(RL)策略,展示了后续的效用。在这些设置下,DT提供了可理解的回合级仪器,在停滞带来运营成本时提升策略绩效。
DPWriter: Reinforcement Learning with Diverse Planning Branching for Creative Writing
DPWriter:创意写作的强化学习与多元规划分支
- Authors: Qian Cao, Yahui Liu, Wei Bi, Yi Zhao, Ruihua Song, Xiting Wang, Ruiming Tang, Guorui Zhou, Han Li
- Subjects: Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2601.09609
- Pdf link: https://arxiv.org/pdf/2601.09609
- Abstract
Reinforcement learning (RL)-based enhancement of large language models (LLMs) often leads to reduced output diversity, undermining their utility in open-ended tasks like creative writing. Current methods lack explicit mechanisms for guiding diverse exploration and instead prioritize optimization efficiency and performance over diversity. This paper proposes an RL framework structured around a semi-structured long Chain-of-Thought (CoT), in which the generation process is decomposed into explicitly planned intermediate steps. We introduce a Diverse Planning Branching method that strategically introduces divergence at the planning phase based on diversity variation, alongside a group-aware diversity reward to encourage distinct trajectories. Experimental results on creative writing benchmarks demonstrate that our approach significantly improves output diversity without compromising generation quality, consistently outperforming existing baselines.
- 中文摘要
基于强化学习(RL)的大型语言模型(LLMs)增强常常导致输出多样性降低,削弱其在创意写作等开放式任务中的实用性。当前方法缺乏明确的机制来引导多样性探索,而是优先考虑优化效率和性能,而非多样性。本文提出了一个围绕半结构化长思考链(CoT)构建的强化学习框架,其中生成过程被分解为明确规划的中间步骤。我们引入了一种多样化规划分支方法,在规划阶段基于多样性差异战略性引入分歧,同时采用群体意识的多样性奖励,鼓励不同的发展轨迹。创意写作基准的实验结果表明,我们的方法显著提升了产出多样性,同时不影响生成质量,持续优于现有基线。
Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning
协作多智能体测试时间强化学习推理
- Authors: Zhiyuan Hu, Yunhai Hu, Juncheng Liu, Shuyue Stella Li, Yucheng Wang, Zhen Xu, See-Kiong Ng, Anh Tuan Luu, Xinxing Xu, Bryan Hooi, Cynthia Breazeal, Hae Won Park
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2601.09667
- Pdf link: https://arxiv.org/pdf/2601.09667
- Abstract
Multi-agent systems have evolved into practical LLM-driven collaborators for many applications, gaining robustness from diversity and cross-checking. However, multi-agent RL (MARL) training is resource-intensive and unstable: co-adapting teammates induce non-stationarity, and rewards are often sparse and high-variance. Therefore, we introduce \textbf{Multi-Agent Test-Time Reinforcement Learning (MATTRL)}, a framework that injects structured textual experience into multi-agent deliberation at inference time. MATTRL forms a multi-expert team of specialists for multi-turn discussions, retrieves and integrates test-time experiences, and reaches consensus for final decision-making. We also study credit assignment for constructing a turn-level experience pool, then reinjecting it into the dialogue. Across challenging benchmarks in medicine, math, and education, MATTRL improves accuracy by an average of 3.67\% over a multi-agent baseline, and by 8.67\% over comparable single-agent baselines. Ablation studies examine different credit-assignment schemes and provide a detailed comparison of how they affect training outcomes. MATTRL offers a stable, effective and efficient path to distribution-shift-robust multi-agent reasoning without tuning.
- 中文摘要
多智能体系统已发展成为许多应用中实用的大型语言模型驱动协作者,通过多样性和交叉验证获得了稳健性。然而,多智能体强化学习(MARL)训练资源密集且不稳定:共适应的队友导致非平稳性,奖励通常稀少且方差大。因此,我们引入了\textbf{多智能体测试时间强化学习(MATTRL)},这是一个在推理时将结构化文本经验注入多智能体思考的框架。MATTRL组建多专家团队,进行多轮次讨论,检索并整合考试经验,达成最终决策共识。我们还研究了信用分配,用于构建回合级经验池,然后将其重新注入对话中。在医学、数学和教育等具有挑战性的基准测试中,MATTRL平均比多主体基线提升3.67%的准确率,在类似单一主体基线上提升8.67%。消融研究考察了不同的学分分配方案,并详细比较了它们如何影响训练结果。MATTRL提供了一条稳定、有效且高效的分布-转移-稳健多智能体推理路径,无需调优。
STEP3-VL-10B Technical Report
STEP3-VL-10B 技术报告
- Authors: Ailin Huang, Chengyuan Yao, Chunrui Han, Fanqi Wan, Hangyu Guo, Haoran Lv, Hongyu Zhou, Jia Wang, Jian Zhou, Jianjian Sun, Jingcheng Hu, Kangheng Lin, Liang Zhao, Mitt Huang, Song Yuan, Wenwen Qu, Xiangfeng Wang, Yanlin Lai, Yingxiu Zhao, Yinmin Zhang, Yukang Shi, Yuyang Chen, Zejia Weng, Ziyang Meng, Ang Li, Aobo Kong, Bo Dong, Changyi Wan, David Wang, Di Qi, Dingming Li, En Yu, Guopeng Li, Haiquan Yin, Han Zhou, Hanshan Zhang, Haolong Yan, Hebin Zhou, Hongbo Peng, Jiaran Zhang, Jiashu Lv, Jiayi Fu, Jie Cheng, Jie Zhou, Jisheng Yin, Jingjing Xie, Jingwei Wu, Jun Zhang, Junfeng Liu, Kaijun Tan, Kaiwen Yan, Liangyu Chen, Lina Chen, Mingliang Li, Qian Zhao, Quan Sun, Shaoliang Pang, Shengjie Fan, Shijie Shang, Siyuan Zhang, Tianhao You, Wei Ji, Wuxun Xie, Xiaobo Yang, Xiaojie Hou, Xiaoran Jiao, Xiaoxiao Ren, Xiangwen Kong, Xin Huang, Xin Wu, Xing Chen, Xinran Wang, Xuelin Zhang, Yana Wei, Yang Li, Yanming Xu, Yeqing Shen, Yuang Peng, Yue Peng, Yu Zhou, Yusheng Li, Yuxiang Yang, Yuyang Zhang, Zhe Xie, Zhewei Huang, Zhenyi Lu, Zhimin Fan, Zihui Cheng, Daxin Jiang, Qi Han, Xiangyu Zhang, Yibo Zhu, Zheng Ge
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2601.09668
- Pdf link: https://arxiv.org/pdf/2601.09668
- Abstract
We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized through two strategic shifts: first, a unified, fully unfrozen pre-training strategy on 1.2T multimodal tokens that integrates a language-aligned Perception Encoder with a Qwen3-8B decoder to establish intrinsic vision-language synergy; and second, a scaled post-training pipeline featuring over 1k iterations of reinforcement learning. Crucially, we implement Parallel Coordinated Reasoning (PaCoRe) to scale test-time compute, allocating resources to scalable perceptual reasoning that explores and synthesizes diverse visual hypotheses. Consequently, despite its compact 10B footprint, STEP3-VL-10B rivals or surpasses models 10$\times$-20$\times$ larger (e.g., GLM-4.6V-106B, Qwen3-VL-235B) and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL. Delivering best-in-class performance, it records 92.2% on MMBench and 80.11% on MMMU, while excelling in complex reasoning with 94.43% on AIME2025 and 75.95% on MathVision. We release the full model suite to provide the community with a powerful, efficient, and reproducible baseline.
- 中文摘要
我们介绍STEP3-VL-10B,一种轻量级开源基础模型,旨在重新定义紧凑效率与前沿级多模态智能之间的权衡。STEP3-VL-10B 通过两个战略转变实现:首先,采用统一、完全解冻的预训练策略,基于1.2T多模态令牌,将语言对齐的感知编码器与Qwen3-8B译码器整合,建立内在的视觉-语言协同;其次,是一个可扩展的培训后流程,包含超过1000次的强化学习迭代。关键是,我们实现了并行协调推理(PaCoRe)以扩展测试时间计算,将资源分配给可扩展的感知推理,探索并综合多样的视觉假设。因此,尽管体积紧凑于10B,STEP3-VL-10B仍能与或超过10$\times$-20$\times$的型号(如GLM-4.6V-106B、Qwen3-VL-235B)以及顶级专有旗舰机如Gemini 2.5 Pro和Seed-1.5-VL。在MMBench中取得92.2%的成绩,MMMU的80.11%,复杂推理方面表现出色,AIME2025为94.43%,MathVision为75.95%。我们发布完整的模型套件,为社区提供一个强大、高效且可复现的基线。
Keyword: diffusion policy
There is no result