Arxiv Papers of Today

生成时间: 2025-12-18 16:32:07 (UTC+8); Arxiv 发布时间: 2025-12-18 20:00 EST (2025-12-19 09:00 UTC+8)

今天共有 21 篇相关文章

Keyword: reinforcement learning

SEMO: A Socio-Evolutionary Adaptive Optimization Framework for Dynamic Social Network Tie Management

SEMO：一种用于动态社交网络联系管理的社会进化自适应优化框架

Authors: Mohammad Zare
Subjects: Subjects: Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2512.14703
Pdf link: https://arxiv.org/pdf/2512.14703
Abstract We propose a novel computational framework that models human social decision-making under uncertainty as an integrated Multi-Armed Bandit (MAB) and Markov Decision Process (MDP) optimization problem, in which agents adaptively balance the exploration of new social ties and the exploitation of existing relationships to maximize a socio-evolutionary fitness. The framework combines reinforcement learning, Bayesian belief updating, and agent-based simulation on a dynamic social graph, allowing each agent to use bandit-based Upper-Confidence-Bound (UCB) strategies for tie formation within an MDP of long-term social planning. We define a formal socio-evolutionary fitness function that captures both individual payoffs (e.g. shared information or support) and network-level benefits, and we derive update rules incorporating cognitive constraints and bounded rationality. Our Social-UCB algorithm, presented in full pseudocode, provably yields logarithmic regret and ensures stable exploitation via UCB-style bounds. In simulation experiments, Social-UCB consistently achieves higher cumulative social fitness and more efficient network connectivity than baseline heuristics. We include detailed descriptions of envisioned figures and tables (e.g. network evolution plots, model comparisons) to illustrate key phenomena. This integrated model bridges gaps in the literature by unifying exploration-exploitation dynamics, network evolution, and social learning, offering a rigorous new tool for studying adaptive human social behavior.
中文摘要 我们提出了一种新颖的计算框架，将人类在不确定性下做出决策建模为集成的多臂强盗（MAB）和马尔可夫决策过程（MDP）优化问题，其中代理在探索新社会联系和利用现有关系之间自适应性地平衡，以最大化社会进化适应度。该框架结合了强化学习、贝叶斯信念更新和基于智能体的动态社会图模拟，使每个智能体能够在长期社会规划的MDP中，使用基于强盗的上置信度界限（UCB）策略来形成平局。我们定义了一个形式化的社会进化适应度函数，既捕捉个体回报（如共享信息或支持）又能获得网络层面的益处，并推导出包含认知约束和有界理性的更新规则。我们的Social-UCB算法以完整伪代码呈现，可证明产生对数级遗憾，并通过UCB式边界确保稳定利用。在模拟实验中，Social-UCB始终比基线启发式方法实现更高的累计社会适应度和更高效的网络连接。我们包含对设想图和表格（如网络演化图、模型比较）的详细描述，以展示关键现象。该综合模型通过统一探索-利用动态、网络进化和社会学习，弥合了文献空白，提供了研究适应性人类社会行为的严谨新工具。

A Bayesian latent class reinforcement learning framework to capture adaptive, feedback-driven travel behaviour

一个贝叶斯潜在类强化学习框架，用于捕捉适应性、反馈驱动的旅行行为

Authors: Georges Sfeir, Stephane Hess, Thomas O. Hancock, Filipe Rodrigues, Jamal Amani Rad, Michiel Bliemer, Matthew Beck, Fayyaz Khan
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2512.14713
Pdf link: https://arxiv.org/pdf/2512.14713
Abstract Many travel decisions involve a degree of experience formation, where individuals learn their preferences over time. At the same time, there is extensive scope for heterogeneity across individual travellers, both in their underlying preferences and in how these evolve. The present paper puts forward a Latent Class Reinforcement Learning (LCRL) model that allows analysts to capture both of these phenomena. We apply the model to a driving simulator dataset and estimate the parameters through Variational Bayes. We identify three distinct classes of individuals that differ markedly in how they adapt their preferences: the first displays context-dependent preferences with context-specific exploitative tendencies; the second follows a persistent exploitative strategy regardless of context; and the third engages in an exploratory strategy combined with context-specific preferences.
中文摘要 许多旅行决策都涉及一定程度的经验积累，个人会随着时间逐渐了解自己的偏好。与此同时，个别旅客之间存在广泛的异质性，无论是在他们的潜在偏好上，还是在这些偏好的演变上。本文提出了一个潜在类强化学习（LCRL）模型，使分析者能够同时捕捉这两种现象。我们将模型应用于驾驶模拟数据集，并通过变分贝叶斯估计参数。我们识别出三类在适应偏好方式上有显著差异的个体：第一类表现出情境依赖偏好，具有情境特定的剥削倾向;第二种则遵循持续的剥削策略，无论上下文如何;第三种则采用探索性策略，结合特定情境偏好。

Quantum Decision Transformers (QDT): Synergistic Entanglement and Interference for Offline Reinforcement Learning

量子决策变换器（QDT）：协同纠缠与干扰用于离线强化学习

Authors: Abraham Itzhak Weinberg
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.14726
Pdf link: https://arxiv.org/pdf/2512.14726
Abstract Offline reinforcement learning enables policy learning from pre-collected datasets without environment interaction, but existing Decision Transformer (DT) architectures struggle with long-horizon credit assignment and complex state-action dependencies. We introduce the Quantum Decision Transformer (QDT), a novel architecture incorporating quantum-inspired computational mechanisms to address these challenges. Our approach integrates two core components: Quantum-Inspired Attention with entanglement operations that capture non-local feature correlations, and Quantum Feedforward Networks with multi-path processing and learnable interference for adaptive computation. Through comprehensive experiments on continuous control tasks, we demonstrate over 2,000\% performance improvement compared to standard DTs, with superior generalization across varying data qualities. Critically, our ablation studies reveal strong synergistic effects between quantum-inspired components: neither alone achieves competitive performance, yet their combination produces dramatic improvements far exceeding individual contributions. This synergy demonstrates that effective quantum-inspired architecture design requires holistic co-design of interdependent mechanisms rather than modular component adoption. Our analysis identifies three key computational advantages: enhanced credit assignment through non-local correlations, implicit ensemble behavior via parallel processing, and adaptive resource allocation through learnable interference. These findings establish quantum-inspired design principles as a promising direction for advancing transformer architectures in sequential decision-making, with implications extending beyond reinforcement learning to neural architecture design more broadly.
中文摘要 离线强化学习使得从预收集的数据集中进行策略学习而无需环境交互，但现有的决策变换器（DT）架构在长视野的信用分配和复杂的状态-动作依赖性方面存在困难。我们介绍量子决策变换器（QDT），这是一种结合量子启发计算机制的新型架构，以应对这些挑战。我们的方法整合了两个核心组成部分：量子启发注意力（Quantum-Inspired Attention）与纠缠作（捕捉非局域特征相关性）和量子前馈网络（Quantum feedforward Networks）具有多径处理和可学习干扰，用于自适应计算。通过对连续控制任务的全面实验，我们展示了与标准DT相比性能提升超过2000%的效果，并且在不同数据质量上具有更优越的泛化能力。关键是，我们的消融研究显示量子启发组分之间存在强烈的协同效应：两者单独都无法达到竞争性能，但它们的组合带来的显著提升远超个体贡献。这种协同效应表明，有效的量子启发架构设计需要整体的相互依赖机制共设计，而非模块化组件的采用。我们的分析指出了三个关键的计算优势：通过非局部相关性增强学分分配、通过并行处理实现隐性集成行为，以及通过可学习干扰实现的自适应资源分配。这些发现确立了量子启发的设计原则，作为推进变换器架构在顺序决策中的有前景方向，其影响不仅仅限于强化学习，还延伸到更广泛的神经架构设计。

Entropy-Reservoir Bregman Projection: An Information-Geometric Unification of Model Collapse

熵-储层布雷格曼投影：模型坍缩的信息几何统一

Authors: Jingwei Chen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.14879
Pdf link: https://arxiv.org/pdf/2512.14879
Abstract Self-referential learning -- training a model on data it generated itself -- promises boundless scalability but chronically suffers from model collapse: language models degenerate into repetitive text, GANs drop modes, and reinforcement-learning policies over-exploit. Although practitioners employ ad~hoc fixes such as real-data mixing, entropy bonuses, knowledge distillation, or retrieval-augmented generation, a single principle that explains both the failure mode and the success of these fixes has remained elusive. We present Entropy-Reservoir Bregman Projection (ERBP), an information-geometric framework that unifies these phenomena. We model the closed loop as a stochastic Bregman projection sequence in distribution space. Without external coupling, finite-sample noise forces the system to project onto an ever-shrinking empirical support, causing exponential entropy decay and eventual collapse. Introducing an Entropy Reservoir -- a high-entropy distribution mixed into each projection -- injects a controllable entropy flux that provably stabilises the dynamics. Our theory yields (i) a necessary condition for collapse, (ii) a sufficient condition that guarantees a non-trivial entropy floor, and (iii) closed-form rates that depend only on sample size and the strong-convexity/Lipschitz constants of the Bregman generator. Experiments on large-language-model self-training, Soft Actor-Critic in reinforcement learning, and GAN optimisation validate our predictions and show that disparate stabilisation heuristics correspond to specific reservoir choices and coupling coefficients. ERBP thus transforms a collection of folk remedies into a single, quantitative design rule: monitor and budget your entropy flux.
中文摘要 自指学习——用模型自身生成的数据训练模型——承诺无限扩展性，但长期存在模型崩溃的问题：语言模型退化为重复文本、GAN丢弃模式，强化学习策略被过度利用。尽管从业者采用了诸如实数据混合、熵加成、知识蒸馏或检索增强生成等临时~临时修复方法，但解释这些修复失败模式和成功的原因的单一原则一直难以找到。我们提出了熵-储层布雷格曼投影（ERBP），这是一种信息几何框架，统一了这些现象。我们将闭环建模为分布空间中的随机布雷格曼投影序列。如果没有外部耦合，有限样本噪声会迫使系统投影到不断缩小的经验支撑上，导致熵呈指数衰减并最终坍缩。引入熵库——在每个投影中混合高熵分布——注入可控熵通量，从而稳定动力学。我们的理论给出了（i）塌缩的必要条件，（ii）保证非平凡熵底的充分条件，以及（iii）仅依赖样本量和布雷格曼生成元强凸性/利普希茨常数的闭形式率。大语言模型自训练、软演员-批判强化学习和GAN优化的实验验证了我们的预测，并证明不同稳定启发式对应于特定的储存库选择和耦合系数。因此，ERBP将一系列民间疗法转化为一个单一的定量设计规则：监控并预算你的熵通量。

Puzzle Curriculum GRPO for Vision-Centric Reasoning

以愿景为中心推理的谜题课程GRPO

Authors: Ahmadreza Jeddi, Hakki Can Karaimer, Hue Nguyen, Zhongling Wang, Ke Zhao, Javad Rajabi, Ran Zhang, Raghav Goyal, Babak Taati, Radek Grzeszczuk
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.14944
Pdf link: https://arxiv.org/pdf/2512.14944
Abstract Recent reinforcement learning (RL) approaches like outcome-supervised GRPO have advanced chain-of-thought reasoning in Vision Language Models (VLMs), yet key issues linger: (i) reliance on costly and noisy hand-curated annotations or external verifiers; (ii) flat and sparse reward schemes in GRPO; and (iii) logical inconsistency between a chain's reasoning and its final answer. We present Puzzle Curriculum GRPO (PC-GRPO), a supervision-free recipe for RL with Verifiable Rewards (RLVR) that strengthens visual reasoning in VLMs without annotations or external verifiers. PC-GRPO replaces labels with three self-supervised puzzle environments: PatchFit, Rotation (with binary rewards) and Jigsaw (with graded partial credit mitigating reward sparsity). To counter flat rewards and vanishing group-relative advantages, we introduce a difficulty-aware curriculum that dynamically weights samples and peaks at medium difficulty. We further monitor Reasoning-Answer Consistency (RAC) during post-training: mirroring reports for vanilla GRPO in LLMs, RAC typically rises early then degrades; our curriculum delays this decline, and consistency-enforcing reward schemes further boost RAC. RAC correlates with downstream accuracy. Across diverse benchmarks and on Qwen-7B and Qwen-3B backbones, PC-GRPO improves reasoning quality, training stability, and end-task accuracy, offering a practical path to scalable, verifiable, and interpretable RL post-training for VLMs.
中文摘要 近期的强化学习（RL）方法如结果监督GRPO推动了视觉语言模型（VLMs）中的思维链推理，但关键问题依然存在：（i）依赖昂贵且噪声大的手工注释或外部验证器;（ii）GRPO中的固定和稀疏奖励方案;以及（iii）链条推理与最终答案之间的逻辑不一致。我们介绍谜题课程GRPO（PC-GRPO），这是一种无需监督的强化学习可验证奖励（RLVR）配方，强化VLM中的视觉推理，无需注释或外部验证。PC-GRPO用三种自监督的益智环境替代标签：PatchFit、Rotation（带二进制奖励）和Jigsaw（带有分级部分学分以缓解奖励稀疏）。为了应对固定奖励和群体相对优势消失，我们引入了一套难度感知课程，动态加权样本和中等难度峰值。我们在训练后进一步监测推理-答案一致性（RAC）：对原生GRPO的镜像报告，RAC通常较早上升然后退化;我们的课程延缓了这一下降，而持续性的奖励计划进一步提升了RAC。RAC与下游准确性相关。跨越多种基准测试以及Qwen-7B和Qwen-3B骨干，PC-GRPO提升了推理质量、训练稳定性和终端任务精度，为VLM提供了可扩展、可验证和可解释的强化学习后训练的实用路径。

Adaptive Partitioning and Learning for Stochastic Control of Diffusion Processes

自适应划分与学习以随机控制扩散过程

Authors: Hanqing Jin, Renyuan Xu, Yanzhao Yang
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Portfolio Management (q-fin.PM)
Arxiv link: https://arxiv.org/abs/2512.14991
Pdf link: https://arxiv.org/pdf/2512.14991
Abstract We study reinforcement learning for controlled diffusion processes with unbounded continuous state spaces, bounded continuous actions, and polynomially growing rewards: settings that arise naturally in finance, economics, and operations research. To overcome the challenges of continuous and high-dimensional domains, we introduce a model-based algorithm that adaptively partitions the joint state-action space. The algorithm maintains estimators of drift, volatility, and rewards within each partition, refining the discretization whenever estimation bias exceeds statistical confidence. This adaptive scheme balances exploration and approximation, enabling efficient learning in unbounded domains. Our analysis establishes regret bounds that depend on the problem horizon, state dimension, reward growth order, and a newly defined notion of zooming dimension tailored to unbounded diffusion processes. The bounds recover existing results for bounded settings as a special case, while extending theoretical guarantees to a broader class of diffusion-type problems. Finally, we validate the effectiveness of our approach through numerical experiments, including applications to high-dimensional problems such as multi-asset mean-variance portfolio selection.
中文摘要 我们研究受控扩散过程的强化学习，这些过程具有无界连续状态空间、有界连续动作和多项式增长的奖励：这些设定在金融、经济学和运筹学中自然出现。为克服连续域和高维域的挑战，我们引入了一种基于模型的算法，能够自适应地划分联合状态-作用空间。算法在每个划分内维护漂移、波动性和奖励的估计值，并在估计偏差超过统计置信度时细化离散化。这种自适应方案平衡了探索与近似，使得在无界域中实现了高效的学习。我们的分析建立了依赖于问题视界、状态维度、奖励增长阶以及针对无界扩散过程的新定义缩放维度概念的遗憾边界。该界限恢复了作为特例的有界环境的现有结果，同时将理论保证扩展到更广泛的扩散型问题类别。最后，我们通过数值实验验证了方法的有效性，包括在多资产均值-方差投资组合选择等高维问题上的应用。

Spectral Representation-based Reinforcement Learning

基于谱表示的强化学习

Authors: Chenxiao Gao, Haotian Sun, Na Li, Dale Schuurmans, Bo Dai
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.15036
Pdf link: https://arxiv.org/pdf/2512.15036
Abstract In real-world applications with large state and action spaces, reinforcement learning (RL) typically employs function approximations to represent core components like the policies, value functions, and dynamics models. Although powerful approximations such as neural networks offer great expressiveness, they often present theoretical ambiguities, suffer from optimization instability and exploration difficulty, and incur substantial computational costs in practice. In this paper, we introduce the perspective of spectral representations as a solution to address these difficulties in RL. Stemming from the spectral decomposition of the transition operator, this framework yields an effective abstraction of the system dynamics for subsequent policy optimization while also providing a clear theoretical characterization. We reveal how to construct spectral representations for transition operators that possess latent variable structures or energy-based structures, which implies different learning methods to extract spectral representations from data. Notably, each of these learning methods realizes an effective RL algorithm under this framework. We also provably extend this spectral view to partially observable MDPs. Finally, we validate these algorithms on over 20 challenging tasks from the DeepMind Control Suite, where they achieve performances comparable or superior to current state-of-the-art model-free and model-based baselines.
中文摘要 在具有大状态和动作空间的现实应用中，强化学习（RL）通常采用函数近似来表示策略、价值函数和动态模型等核心组件。尽管像神经网络这样强大的近似具有极佳的表现力，但它们常常存在理论上的歧义，存在优化不稳定性和探索难度，并且在实际作中会带来相当大的计算成本。本文引入了谱表示的视角，作为解决强化学习中这些困难的方案。基于转移算符的谱分解，该框架有效地抽象了系统动力学，用于后续策略优化，同时提供了清晰的理论特征描述。我们揭示了如何构建具有潜在变量结构或基于能量结构的跃迁算符的谱表示，这意味着需要不同的学习方法从数据中提取谱表示。值得注意的是，这些学习方法在该框架下都能实现有效的强化学习算法。我们还可证明将该光谱视图扩展到部分可观测的MDP。最后，我们在DeepMind Control Suite的20多个具有挑战性的任务中验证了这些算法，其性能与当前最先进的无模型和基于模型的基线相当甚至更优。

Beyond Fast and Slow: Cognitive-Inspired Elastic Reasoning for Large Language Models

超越快与慢：大型语言模型中的认知启发弹性推理

Authors: Jinwu Hu, Dongjin Yang, Langyu Bian, Zhiquan Wen, Yufeng Wang, Yaofo Chen, Bin Xiao, Yuanqing Li, Mingkui Tan
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.15089
Pdf link: https://arxiv.org/pdf/2512.15089
Abstract Large language models (LLMs) have demonstrated impressive performance across various language tasks. However, existing LLM reasoning strategies mainly rely on the LLM itself with fast or slow mode (like o1 thinking) and thus struggle to balance reasoning efficiency and accuracy across queries of varying difficulties. In this paper, we propose Cognitive-Inspired Elastic Reasoning (CogER), a framework inspired by human hierarchical reasoning that dynamically selects the most suitable reasoning strategy for each query. Specifically, CogER first assesses the complexity of incoming queries and assigns them to one of several predefined levels, each corresponding to a tailored processing strategy, thereby addressing the challenge of unobservable query difficulty. To achieve automatic strategy selection, we model the process as a Markov Decision Process and train a CogER-Agent using reinforcement learning. The agent is guided by a reward function that balances solution quality and computational cost, ensuring resource-efficient reasoning. Moreover, for queries requiring external tools, we introduce Cognitive Tool-Assisted Reasoning, which enables the LLM to autonomously invoke external tools within its chain-of-thought. Extensive experiments demonstrate that CogER outperforms state-of-the-art Test-Time scaling methods, achieving at least a 13% relative improvement in average exact match on In-Domain tasks and an 8% relative gain on Out-of-Domain tasks.
中文摘要 大型语言模型（LLM）在各种语言任务中展现出令人印象深刻的性能。然而，现有的大型语言模型推理策略主要依赖于快速或慢速模式（如o1思维），因此难以在不同难度查询中平衡推理效率和准确性。本文提出认知启发弹性推理（CogER），这一受人类层级推理启发的框架，能够动态选择每个查询最合适的推理策略。具体来说，CogER首先评估输入查询的复杂度，并将其分配到多个预定义层级中的一个，每个层级对应定制化的处理策略，从而解决不可观察查询难度的挑战。为了实现自动策略选择，我们将该过程建模为马尔可夫决策过程，并利用强化学习训练一个认知能力代理（CogER-Agent）。智能体由一个奖励函数引导，平衡解决方案质量与计算成本，确保资源高效的推理。此外，对于需要外部工具的查询，我们引入了认知工具辅助推理，使LLM能够自主调用其思维链中的外部工具。大量实验表明，CogER优于最先进的测试时间缩放方法，在域内任务中平均精确匹配度至少提升13%，域外任务相对提升8%。

Automatic Reward Shaping from Multi-Objective Human Heuristics

多目标人类启发式的自动奖励塑造

Authors: Yuqing Xie, Jiayu Chen, Wenhao Tang, Ya Zhang, Chao Yu, Yu Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.15120
Pdf link: https://arxiv.org/pdf/2512.15120
Abstract Designing effective reward functions remains a central challenge in reinforcement learning, especially in multi-objective environments. In this work, we propose Multi-Objective Reward Shaping with Exploration (MORSE), a general framework that automatically combines multiple human-designed heuristic rewards into a unified reward function. MORSE formulates the shaping process as a bi-level optimization problem: the inner loop trains a policy to maximize the current shaped reward, while the outer loop updates the reward function to optimize task performance. To encourage exploration in the reward space and avoid suboptimal local minima, MORSE introduces stochasticity into the shaping process, injecting noise guided by task performance and the prediction error of a fixed, randomly initialized neural network. Experimental results in MuJoCo and Isaac Sim environments show that MORSE effectively balances multiple objectives across various robotic tasks, achieving task performance comparable to those obtained with manually tuned reward functions.
中文摘要 设计有效的奖励函数仍然是强化学习中的核心挑战，尤其是在多目标环境中。在本研究中，我们提出了多目标奖励塑造与探索（MORSE），这是一个通用框架，自动将多个人类设计的启发式奖励组合为统一的奖励函数。MORSE 将塑形过程表述为一个二层优化问题：内环训练策略以最大化当前的形态奖励，而外环则更新奖励函数以优化任务表现。为了鼓励在奖励空间中的探索并避免局部极小值次优，莫尔斯在塑形过程中引入了随机性，注入由任务表现和固定随机初始化神经网络预测误差引导的噪声。MuJoCo和Isaac模拟环境中的实验结果表明，MORSE有效平衡了多种机器人任务中的多个目标，实现的任务表现可与手动调优的奖励函数相当。

Beyond Majority Voting: Towards Fine-grained and More Reliable Reward Signal for Test-Time Reinforcement Learning

超越多数投票：迈向更细粒度且更可靠的测试时强化学习奖励信号

Authors: Weiqin Wang, Yile Wang, Kehao Chen, Hui Huang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.15146
Pdf link: https://arxiv.org/pdf/2512.15146
Abstract Test-time reinforcement learning mitigates the reliance on annotated data by using majority voting results as pseudo-labels, emerging as a complementary direction to reinforcement learning with verifiable rewards (RLVR) for improving reasoning ability of large language models (LLMs). However, this voting strategy often induces confirmation bias and suffers from sparse rewards, limiting the overall performance. In this work, we propose subgroup-specific step-wise confidence-weighted pseudo-label estimation (SCOPE), a framework integrating model confidence and dynamic subgroup partitioning to address these issues. Specifically, SCOPE integrates the proposed step-wise confidence into pseudo label deduction, prioritizing high-quality reasoning paths over simple frequency count. Furthermore, it dynamically partitions the candidate outputs pool into independent subgroups by balancing reasoning quality against exploration diversity. By deriving local consensus via repeat sampling for each sub group, SCOPE provides diverse supervision targets to encourage broader exploration. We conduct experiments across various models and benchmarks, experimental results show that SCOPE consistently outperforms recent baselines. Notably, SCOPE achieving relative improvements of 13.1\% on challenging AIME 2025 and 8.1\% on AMC. The code is released at \href{this https URL}{this https URL}.
中文摘要 测试时强化学习通过使用多数投票结果作为伪标签，减少了对注释数据的依赖，成为增强可验证奖励（RLVR）的补充方向，用于提升大型语言模型（LLMs）的推理能力。然而，这种投票策略常常导致确认偏误，且奖励稀少，限制了整体表现。本研究提出亚群特定阶段置信加权伪标签估计（SCOPE），该框架整合模型置信度与动态子群划分，以解决这些问题。具体来说，SCOPE将所提出的分阶段置信度整合进伪标签推理中，优先考虑高质量的推理路径而非简单的频率计数。此外，它通过平衡推理质量与探索多样性，动态将候选输出池划分为独立子组。通过对每个子群体进行重复抽样，SCOPE通过对每个子群体达成本地共识，提供了多样化的监督目标，以鼓励更广泛的探索。我们对各种模型和基准进行了实验，实验结果显示SCOPE持续优于近期基线。值得注意的是，SCOPE在挑战AIME 2025中取得了13.1%的相对进步，在AMC中取得了8.1%的相对提升。代码发布于 \href{this https URL}{this https URL}。

EagleVision: A Dual-Stage Framework with BEV-grounding-based Chain-of-Thought for Spatial Intelligence

EagleVision：基于BEV接地的双阶段空间智能思维链框架

Authors: Jiaxu Wan, Xu Wang, Mengwei Xie, Hang Zhang, Mu Xu, Yang Han, Hong Zhang, Ding Yuan, Yifan Yang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.15160
Pdf link: https://arxiv.org/pdf/2512.15160
Abstract Recent spatial intelligence approaches typically attach 3D cues to 2D reasoning pipelines or couple MLLMs with black-box reconstruction modules, leading to weak spatial consistency, limited viewpoint diversity, and evidence chains that cannot be traced back to supporting views. Frameworks for "thinking with images" (e.g., ChatGPT-o3 and DeepEyes) show that stepwise multimodal reasoning can emerge by interleaving hypothesis formation with active acquisition of visual evidence, but they do not address three key challenges in spatial Chain-of-Thought (CoT): building global space perception under strict token budgets, explicitly associating 3D hypotheses with video frames for verification, and designing spatially grounded rewards for reinforcement learning. To address these issues, we present EagleVision, a dual-stage framework for progressive spatial cognition through macro perception and micro verification. In the macro perception stage, EagleVision employs a semantics-perspective-fusion determinantal point process (SPF-DPP) to select a compact set of geometry- and semantics-aware keyframes from long videos under a fixed token budget. In the micro verification stage, we formalize spatial CoT as BEV-grounded pose querying: the agent iteratively predicts poses on a BEV plane, retrieves the nearest real frames, and is trained purely by reinforcement learning with a spatial grounding reward that scores the consistency between predicted poses and observed views. On VSI-Bench, EagleVision achieves state-of-the-art performance among open-source vision-language models, demonstrating strong and generalizable spatial understanding.
中文摘要 近期的空间智能方法通常将三维线索附加到二维推理流程上，或将多层次多层次模型与黑盒重建模块结合，导致空间一致性薄弱、视角多样性有限，以及无法追溯到支持观点的证据链。“用图像思考”的框架（如ChatGPT-o3和DeepEyes）表明，通过交错假设形成与主动获取视觉证据，可以实现逐步多模态推理，但它们未能解决空间思维链（CoT）中的三大关键挑战：在严格的代币预算下构建全局空间感知、明确将三维假设与视频帧关联以供验证，以及设计空间基础的强化学习奖励。为解决这些问题，我们提出了EagleVision，一个通过宏观感知和微观验证实现渐进空间认知的双阶段框架。在宏观感知阶段，EagleVision采用语义视角融合行列点过程（SPF-DPP），在固定令牌预算下，从长视频中选择一组紧凑的几何和语义感知关键帧。在微验证阶段，我们将空间CoT形式化为BEV基准姿态查询：代理迭代预测BEV平面上的姿态，检索最近的实帧，并纯粹通过强化学习训练，并以空间基础奖励评分预测姿势与观察视图之间的一致性。在VSI-Bench上，EagleVision在开源视觉语言模型中实现了最先进的性能，展现了强大且可推广的空间理解能力。

Well Begun, Half Done: Reinforcement Learning with Prefix Optimization for LLM Reasoning

开始良好，半成：带前缀优化的强化学习用于大型语言模型推理

Authors: Yiliu Sun, Zicheng Zhao, Yang Wei, Yanfang Zhang, Chen Gong
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.15274
Pdf link: https://arxiv.org/pdf/2512.15274
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) significantly enhances the reasoning capability of Large Language Models (LLMs). Current RLVR approaches typically conduct training across all generated tokens, but neglect to explore which tokens (e.g., prefix tokens) actually contribute to reasoning. This uniform training strategy spends substantial effort on optimizing low-return tokens, which in turn impedes the potential improvement from high-return tokens and reduces overall training effectiveness. To address this issue, we propose a novel RLVR approach called Progressive Prefix-token Policy Optimization (PPPO), which highlights the significance of the prefix segment of generated outputs. Specifically, inspired by the well-established human thinking theory of Path Dependence, where early-stage thoughts substantially constrain subsequent thinking trajectory, we identify an analogous phenomenon in LLM reasoning termed Beginning Lock-in Effect (BLE). PPPO leverages this finding by focusing its optimization objective on the prefix reasoning process of LLMs. This targeted optimization strategy can positively influence subsequent reasoning processes, and ultimately improve final results. To improve the learning effectiveness of LLMs on how to start reasoning with high quality, PPPO introduces two training strategies: (a) Progressive Prefix Retention, which shapes a progressive learning process by increasing the proportion of retained prefix tokens during training; (b) Continuation Accumulated Reward, which mitigates reward bias by sampling multiple continuations for one prefix token sequence, and accumulating their scores as the reward signal. Extensive experimental results on various reasoning tasks demonstrate that our proposed PPPO outperforms representative RLVR methods, with the accuracy improvements of 18.02% on only 26.17% training tokens.
中文摘要 带可验证奖励的强化学习（RLVR）显著增强了大型语言模型（LLMs）的推理能力。当前的RLVR方法通常对所有生成的代币进行训练，但忽略了具体哪些代币（例如前缀代币）实际参与推理。这种统一的训练策略花费大量精力优化低回报令牌，反过来阻碍了高回报令牌的潜在提升，降低了整体训练效果。为解决这一问题，我们提出了一种新的RLVR方法，称为渐进式前缀-令牌策略优化（PPPO），强调了生成输出中前缀部分的重要性。具体来说，受人类思维中成熟的路径依赖理论启发，即早期思维实质性地限制后续思维轨迹，我们发现了大型语言模型推理中类似的现象，称为起始锁定效应（BLE）。PPPO利用这一发现，将优化目标聚焦于LLMs的前缀推理过程。这种有针对性的优化策略可以积极影响后续的推理过程，最终提升最终结果。为了提高大型语言模型在如何高质量开始推理方面的学习效果，PPPO引入了两种训练策略：（a）渐进式前缀保留，通过增加训练中保留前缀标记的比例来塑造渐进式学习过程;（b）延续累积奖励，通过对一个前缀令牌序列采样多个续写，并累积其分数作为奖励信号，来缓解奖励偏差。在各种推理任务上的大量实验结果表明，我们提出的PPPO优于代表性的RLVR方法，仅26.17%的训练标记，准确率提升了18.02%。

Graph Contextual Reinforcement Learning for Efficient Directed Controller Synthesis

图上下文强化学习用于高效定向控制器综合

Authors: Toshihide Ubukata, Enhong Mu, Takuto Yamauchi, Mingyue Zhang, Jialong Li, Kenji Tei
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.15295
Pdf link: https://arxiv.org/pdf/2512.15295
Abstract Controller synthesis is a formal method approach for automatically generating Labeled Transition System (LTS) controllers that satisfy specified properties. The efficiency of the synthesis process, however, is critically dependent on exploration policies. These policies often rely on fixed rules or strategies learned through reinforcement learning (RL) that consider only a limited set of current features. To address this limitation, this paper introduces GCRL, an approach that enhances RL-based methods by integrating Graph Neural Networks (GNNs). GCRL encodes the history of LTS exploration into a graph structure, allowing it to capture a broader, non-current-based context. In a comparative experiment against state-of-the-art methods, GCRL exhibited superior learning efficiency and generalization across four out of five benchmark domains, except one particular domain characterized by high symmetry and strictly local interactions.
中文摘要 控制器综合是一种形式化的方法，用于自动生成满足指定属性的标记转移系统（LTS）控制器。然而，合成过程的效率关键依赖于勘探政策。这些策略通常依赖于通过强化学习（RL）学习的固定规则或策略，只考虑有限的当前特征集。为解决这一限制，本文介绍了GCRL，这是一种通过集成图神经网络（GNN）来增强基于强化学习的方法。GCRL将LTS探索的历史编码为图结构，使其能够捕捉更广泛的非当前背景。在一项与最先进方法的比较实验中，GCRL在五个基准领域中有四个表现出优异的学习效率和泛化能力，唯独有一个特定领域以高度对称性和严格局部相互作用为特征。

EUBRL: Epistemic Uncertainty Directed Bayesian Reinforcement Learning

EUBRL：认识论不确定性导向贝叶斯强化学习

Authors: Jianfei Ma, Wee Sun Lee
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.15405
Pdf link: https://arxiv.org/pdf/2512.15405
Abstract At the boundary between the known and the unknown, an agent inevitably confronts the dilemma of whether to explore or to exploit. Epistemic uncertainty reflects such boundaries, representing systematic uncertainty due to limited knowledge. In this paper, we propose a Bayesian reinforcement learning (RL) algorithm, $\texttt{EUBRL}$, which leverages epistemic guidance to achieve principled exploration. This guidance adaptively reduces per-step regret arising from estimation errors. We establish nearly minimax-optimal regret and sample complexity guarantees for a class of sufficiently expressive priors in infinite-horizon discounted MDPs. Empirically, we evaluate $\texttt{EUBRL}$ on tasks characterized by sparse rewards, long horizons, and stochasticity. Results demonstrate that $\texttt{EUBRL}$ achieves superior sample efficiency, scalability, and consistency.
中文摘要 在已知与未知的边界上，代理人不可避免地面临探索还是利用的两难。认识不确定性反映了这些界限，代表了由于知识有限而产生的系统性不确定性。本文提出了一种贝叶斯强化学习（RL）算法 $\texttt{EUBRL}$，利用认知指导实现原则性探索。这种指导自适应地减少了因估算误差而产生的每步遗憾。我们在无限视界折现MDP中，为一类足够表达先验建立了近乎极小极大最优遗憾和样本复杂度保证。实证上，我们在奖励稀疏、视野长和随机性的任务中评估了$\texttt{EUBRL}$。结果表明，$\texttt{EUBRL}$ 实现了更优的样本效率、可扩展性和一致性。

Can AI Generate more Comprehensive Test Scenarios? Review on Automated Driving Systems Test Scenario Generation Methods

人工智能能否生成更全面的测试场景？自动驾驶系统测试场景生成方法综述

Authors: Ji Zhou (1), Yongqi Zhao (1), Yixian Hu, Hexuan Li, Zhengguo Gu (1), Nan Xu (2), Arno Eichberger (1) ((1) Institute of Automotive Engineering, Graz University of Technology, Graz, Austria, (2) National Key Laboratory of Automotive Chassis Integration and Bionics, Jilin university)
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2512.15422
Pdf link: https://arxiv.org/pdf/2512.15422
Abstract Ensuring the safety and reliability of Automated Driving Systems (ADS) remains a critical challenge, as traditional verification methods such as large-scale on-road testing are prohibitively costly and this http URL address this,scenario-based testing has emerged as a scalable and efficient alternative,yet existing surveys provide only partial coverage of recent methodological and technological this http URL review systematically analyzes 31 primary studies,and 10 surveys identified through a comprehensive search spanning 2015~2025;however,the in-depth methodological synthesis and comparative evaluation focus primarily on recent frameworks(2023~2025),reflecting the surge of Artificial Intelligent(AI)-assisted and multimodal approaches in this this http URL approaches rely on expert knowledge,ontologies,and naturalistic driving or accident data,while recent developments leverage generative models,including large language models,generative adversarial networks,diffusion models,and reinforcement learning frameworks,to synthesize diverse and safety-critical this http URL synthesis identifies three persistent gaps:the absence of standardized evaluation metrics,limited integration of ethical and human factors,and insufficient coverage of multimodal and Operational Design Domain (ODD)-specific this http URL address these challenges,this review contributes a refined taxonomy that incorporates multimodal extensions,an ethical and safety checklist for responsible scenario design,and an ODD coverage map with a scenario-difficulty schema to enable transparent this http URL,these contributions provide methodological clarity for researchers and practical guidance for industry,supporting reproducible evaluation and accelerating the safe deployment of higher-level ADS.
中文摘要 确保自动驾驶系统（ADS）的安全性和可靠性仍是一个关键挑战，因为传统的验证方法如大规模路面测试成本高昂，基于场景的测试已成为一种可扩展且高效的替代方案，但现有调查仅涵盖了近期的方法论和技术。本http URL综述系统分析了31项主要研究，以及通过2015~2025年间综合检索确定的10项调查;然而，深入的方法论综合和比较评估主要聚焦于最新框架（2023~2025），反映了人工智能（AI）辅助和多模态方法的兴起，http URL方法依赖于专业知识、本体论以及自然驾驶或事故数据，而近期发展则利用生成模型，包括大型语言模型、生成式对抗网络，扩散模型和强化学习框架，以综合多样且安全关键的方案。本http URL综合指出了三个持续存在的空白：缺乏标准化评估指标、伦理与人为因素整合有限，以及多模态和运营设计领域（ODD）特有性覆盖不足。本http URL解决了这些挑战，本综述提供了包含多模态扩展的精炼分类法，一份负责任场景设计的伦理和安全检查表，以及带有场景难度模式的ODD覆盖图，使该http URL透明，这些贡献为研究人员提供了方法论清晰度，并为行业提供了实用指导，支持可重复的评估并加速高级别ADS的安全部署。

FM-EAC: Feature Model-based Enhanced Actor-Critic for Multi-Task Control in Dynamic Environments

FM-EAC：基于模型的功能增强型演员批评器，用于动态环境中的多任务控制

Authors: Quanxi Zhou, Wencan Mao, Manabu Tsukada, John C.S. Lui, Yusheng Ji
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.15430
Pdf link: https://arxiv.org/pdf/2512.15430
Abstract Model-based reinforcement learning (MBRL) and model-free reinforcement learning (MFRL) evolve along distinct paths but converge in the design of Dyna-Q [1]. However, modern RL methods still struggle with effective transferability across tasks and scenarios. Motivated by this limitation, we propose a generalized algorithm, Feature Model-Based Enhanced Actor-Critic (FM-EAC), that integrates planning, acting, and learning for multi-task control in dynamic environments. FM-EAC combines the strengths of MBRL and MFRL and improves generalizability through the use of novel feature-based models and an enhanced actor-critic framework. Simulations in both urban and agricultural applications demonstrate that FM-EAC consistently outperforms many state-of-the-art MBRL and MFRL methods. More importantly, different sub-networks can be customized within FM-EAC according to user-specific requirements.
中文摘要 基于模型的强化学习（MBRL）和无模型强化学习（MFRL）沿着不同路径发展，但在Dyna-Q的设计中趋同[1]。然而，现代强化学习方法在任务和场景间的有效迁移性仍存在困难。基于这一局限，我们提出了一种通用算法——基于特征模型的增强演员批评（FM-EAC），该算法整合了规划、行动和学习，用于动态环境中的多任务控制。FM-EAC结合了MBRL和MFRL的优势，并通过采用新颖的基于特征的模型和增强的actor-critic框架，提升了泛化性。城市和农业应用中的模拟表明，FM-EAC始终优于许多最先进的MBRL和MFRL方法。更重要的是，FM-EAC内可以根据用户特定需求定制不同的子网络。

Double Horizon Model-Based Policy Optimization

基于双重视界模型的策略优化

Authors: Akihiro Kubo, Paavo Parmas, Shin Ishii
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.15439
Pdf link: https://arxiv.org/pdf/2512.15439
Abstract Model-based reinforcement learning (MBRL) reduces the cost of real-environment sampling by generating synthetic trajectories (called rollouts) from a learned dynamics model. However, choosing the length of the rollouts poses two dilemmas: (1) Longer rollouts better preserve on-policy training but amplify model bias, indicating the need for an intermediate horizon to mitigate distribution shift (i.e., the gap between on-policy and past off-policy samples). (2) Moreover, a longer model rollout may reduce value estimation bias but raise the variance of policy gradients due to backpropagation through multiple steps, implying another intermediate horizon for stable gradient estimates. However, these two optimal horizons may differ. To resolve this conflict, we propose Double Horizon Model-Based Policy Optimization (DHMBPO), which divides the rollout procedure into a long "distribution rollout" (DR) and a short "training rollout" (TR). The DR generates on-policy state samples for mitigating distribution shift. In contrast, the short TR leverages differentiable transitions to offer accurate value gradient estimation with stable gradient updates, thereby requiring fewer updates and reducing overall runtime. We demonstrate that the double-horizon approach effectively balances distribution shift, model bias, and gradient instability, and surpasses existing MBRL methods on continuous-control benchmarks in terms of both sample efficiency and runtime.
中文摘要 基于模型的强化学习（MBRL）通过从学习的动力学模型生成合成轨迹（称为滚出）来降低真实环境采样的成本。然而，选择部署时长存在两个难题：（1）更长的部署更能保留政策内训练，但放大模型偏差，表明需要中间视野来减轻分布转移（即政策中样本与过去非策略样本之间的差距）。（2）此外，更长的模型推广可能减少价值估计偏差，但由于多步反向传播，政策梯度的方差会增加，这意味着稳定梯度估计的另一个中间视野。然而，这两个最优视野可能有所不同。为解决这一冲突，我们提出了基于模型的双重地平线策略优化（DHMBPO），将推广过程分为长“分布推广”（DR）和短“训练推广”（TR）。灾难恢复生成基于政策的状态样本以缓解分布偏移。相比之下，短TR利用可微转移，提供准确的值梯度估计和稳定的梯度更新，从而减少更新次数并缩短整体运行时间。我们证明了双视距方法有效平衡了分布偏移、模型偏差和梯度不稳定性，并在样本效率和运行时间方面优于现有的连续控制基准测试方法。

Autoregressive Language Models are Secretly Energy-Based Models: Insights into the Lookahead Capabilities of Next-Token Prediction

自回归语言模型是基于能量的秘密模型：洞察下一代币预测的前瞻性能力

Authors: Mathieu Blondel, Michael E. Sander, Germain Vivier-Ardisson, Tianlin Liu, Vincent Roulet
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2512.15605
Pdf link: https://arxiv.org/pdf/2512.15605
Abstract Autoregressive models (ARMs) currently constitute the dominant paradigm for large language models (LLMs). Energy-based models (EBMs) represent another class of models, which have historically been less prevalent in LLM development, yet naturally characterize the optimal policy in post-training alignment. In this paper, we provide a unified view of these two model classes. Taking the chain rule of probability as a starting point, we establish an explicit bijection between ARMs and EBMs in function space, which we show to correspond to a special case of the soft Bellman equation in maximum entropy reinforcement learning. Building upon this bijection, we derive the equivalence between supervised learning of ARMs and EBMs. Furthermore, we analyze the distillation of EBMs into ARMs by providing theoretical error bounds. Our results provide insights into the ability of ARMs to plan ahead, despite being based on the next-token prediction paradigm.
中文摘要 自回归模型（ARM）目前构成了大型语言模型（LLMs）的主导范式。基于能量的模型（EBM）是另一类模型，历史上在LLM开发中较少见，但自然地刻画了训练后对齐的最优策略。本文为这两种模型类别提供了一个统一的视角。以概率链法为起点，我们在函数空间中建立了ARM和EBM之间的显式双射，并证明这对应于最大熵强化学习中软贝尔曼方程的一个特例。基于这一双射，我们推导出ARM和EBM的监督学习之间的等价性。此外，我们通过提供理论误差界限，分析了EBMs向ARM的提炼过程。我们的结果为ARM即使基于下一个代币预测范式，也能提前规划的能力提供了洞见。

Stepwise Think-Critique: A Unified Framework for Robust and Interpretable LLM Reasoning

逐步思考批判：一个统一框架，用于稳健且可理解的大型语言模型推理

Authors: Jiaqi Xu, Cuiling Lan, Xuejin Chen, Yan LU
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.15662
Pdf link: https://arxiv.org/pdf/2512.15662
Abstract Human beings solve complex problems through critical thinking, where reasoning and evaluation are intertwined to converge toward correct solutions. However, most existing large language models (LLMs) decouple reasoning from verification: they either generate reasoning without explicit self-checking or rely on external verifiers to detect errors post hoc. The former lacks immediate feedback, while the latter increases system complexity and hinders synchronized learning. Motivated by human critical thinking, we propose Stepwise Think-Critique (STC), a unified framework that interleaves reasoning and self-critique at each step within a single model. STC is trained with a hybrid reinforcement learning objective combining reasoning rewards and critique-consistency rewards to jointly optimize reasoning quality and self-evaluation. Experiments on mathematical reasoning benchmarks show that STC demonstrates strong critic-thinking capabilities and produces more interpretable reasoning traces, representing a step toward LLMs with built-in critical thinking.
中文摘要 人类通过批判性思维解决复杂问题，推理与评估交织在一起，最终朝向正确的解决方案。然而，大多数现有大型语言模型（LLM）将推理与验证分离：它们要么生成推理而不显式自我检查，要么依赖外部验证器事后检测错误。前者缺乏即时反馈，而后者则增加了系统复杂度并阻碍同步学习。在人类批判性思维的激励下，我们提出了逐步思考批判（STC），这是一个统一框架，在单一模型的每个步骤中交织推理和自我批评。STC采用混合强化学习目标训练，结合推理奖励和批判一致性奖励，共同优化推理质量和自我评估。数学推理基准测试的实验显示，STC展现了强大的批判性思维能力，并产生了更具可解释性的推理痕迹，代表着向具备内置批判性思维的LLM迈出了一步。

Can LLMs Guide Their Own Exploration? Gradient-Guided Reinforcement Learning for LLM Reasoning

大型语言模型能引导自己的探索吗？LLM推理中的梯度引导强化学习

Authors: Zhenwen Liang, Sidi Lu, Wenhao Yu, Kishan Panaganti, Yujun Zhou, Haitao Mi, Dong Yu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.15687
Pdf link: https://arxiv.org/pdf/2512.15687
Abstract Reinforcement learning has become essential for strengthening the reasoning abilities of large language models, yet current exploration mechanisms remain fundamentally misaligned with how these models actually learn. Entropy bonuses and external semantic comparators encourage surface level variation but offer no guarantee that sampled trajectories differ in the update directions that shape optimization. We propose G2RL, a gradient guided reinforcement learning framework in which exploration is driven not by external heuristics but by the model own first order update geometry. For each response, G2RL constructs a sequence level feature from the model final layer sensitivity, obtainable at negligible cost from a standard forward pass, and measures how each trajectory would reshape the policy by comparing these features within a sampled group. Trajectories that introduce novel gradient directions receive a bounded multiplicative reward scaler, while redundant or off manifold updates are deemphasized, yielding a self referential exploration signal that is naturally aligned with PPO style stability and KL control. Across math and general reasoning benchmarks (MATH500, AMC, AIME24, AIME25, GPQA, MMLUpro) on Qwen3 base 1.7B and 4B models, G2RL consistently improves pass@1, maj@16, and pass@k over entropy based GRPO and external embedding methods. Analyzing the induced geometry, we find that G2RL expands exploration into substantially more orthogonal and often opposing gradient directions while maintaining semantic coherence, revealing that a policy own update space provides a far more faithful and effective basis for guiding exploration in large language model reinforcement learning.
中文摘要 强化学习已成为强化大型语言模型推理能力的关键，但当前的探索机制与这些模型实际学习方式仍然根本不一致。熵加成和外部语义比较器鼓励表面变化，但无法保证采样轨迹在更新方向上有所不同，从而影响优化。我们提出了G2RL，一种梯度引导强化学习框架，其中探索不依赖外部启发式，而由模型自身的一阶更新几何驱动。对于每个响应，G2RL从模型最终层敏感度构建序列层特征，标准前向传递成本极低，并通过比较抽样组内的特征，衡量每个轨迹如何重塑策略。引入新梯度方向的轨迹会获得有界乘法奖励尺度器，而冗余或离流形更新则被弱化，产生自然与PPO风格稳定性和KL控制相匹配的自指探索信号。在数学和一般推理基准测试（MATH500、AMC、AIME24、AIME25、GPQA、MMLUpro）中，基于Qwen3的1.7B和4B模型，G2RL在pass@1、maj@16和pass@k方面持续优于基于熵的GRPO和外部嵌入方法。分析诱导几何，我们发现G2RL在保持语义一致性的同时，将探索扩展到更加正交且常常相反的梯度方向，表明策略自有的更新空间为大型语言模型强化学习中的探索提供了更忠实且有效的基础。

Keyword: diffusion policy

ISS Policy : Scalable Diffusion Policy with Implicit Scene Supervision

ISS策略：带隐式场景监督的可扩展扩散策略

Authors: Wenlong Xia, Jinhao Zhang, Ce Zhang, Yaojia Wang, Youmin Gong, Jie Mei
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.15020
Pdf link: https://arxiv.org/pdf/2512.15020
Abstract Vision-based imitation learning has enabled impressive robotic manipulation skills, but its reliance on object appearance while ignoring the underlying 3D scene structure leads to low training efficiency and poor generalization. To address these challenges, we introduce \emph{Implicit Scene Supervision (ISS) Policy}, a 3D visuomotor DiT-based diffusion policy that predicts sequences of continuous actions from point cloud observations. We extend DiT with a novel implicit scene supervision module that encourages the model to produce outputs consistent with the scene's geometric evolution, thereby improving the performance and robustness of the policy. Notably, ISS Policy achieves state-of-the-art performance on both single-arm manipulation tasks (MetaWorld) and dexterous hand manipulation (Adroit). In real-world experiments, it also demonstrates strong generalization and robustness. Additional ablation studies show that our method scales effectively with both data and parameters. Code and videos will be released.
中文摘要 基于视觉的模仿学习使机器人作技能变得令人印象深刻，但其依赖物体外观而忽视底层三维场景结构，导致训练效率低且泛化能力差。为应对这些挑战，我们引入了\emph{隐式场景监督（ISS）策略}，这是一种基于三维视觉驱动DiT的扩散策略，能够预测点云观测中的连续动作序列。我们通过一个新颖的隐式场景监督模块扩展了DiT，鼓励模型产生与场景几何演变一致的输出，从而提升策略的性能和鲁棒性。值得注意的是，ISS Policy在单臂作任务（MetaWorld）和灵巧手部作（Adroit）方面都达到了最先进的性能。在现实实验中，它也展现出强烈的泛化性和稳健性。其他消融研究显示，我们的方法能够有效地结合数据和参数进行扩展。代码和视频将会发布。