生成时间: 2025-12-09 16:35:29 (UTC+8); Arxiv 发布时间: 2025-12-09 20:00 EST (2025-12-10 09:00 UTC+8)
今天共有 65 篇相关文章
Keyword: reinforcement learning
Video Models Start to Solve Chess, Maze, Sudoku, Mental Rotation, and Raven' Matrices
视频模型开始解题国际象棋、迷宫、数独、心灵旋转和渡鸦矩阵
- Authors: Hokin Deng
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2512.05969
- Pdf link: https://arxiv.org/pdf/2512.05969
- Abstract
We show that video generation models could reason now. Testing on tasks such as chess, maze, Sudoku, mental rotation, and Raven's Matrices, leading models such as Sora-2 achieve sixty percent success rates. We establish a robust experimental paradigm centered on the "Task Pair" design. We build a code framework, with 39 models available already, that supports this paradigm and allows for easy scaling - users can add models and tasks efficiently. We show our automated evaluation strongly correlates with human judgment, and therefore this paradigm is highly scalable. We see an opportunity, given the availability of our paradigm, to do reinforcement learning for improving reasoning in video models. You could checkout all of our raw $\href{this https URL}{results}$ and our $\href{this https URL}{VMEvalKit}$ codebase.
- 中文摘要
我们展示了视频生成模型现在可以推理。在国际象棋、迷宫、数独、心智旋转和渡鸦矩阵等任务中,Sora-2等领先模型的成功率达到了60%。我们建立了以“任务对”设计为核心的稳健实验范式。我们构建了一个已有39个模型的代码框架,支持这一范式并允许轻松扩展——用户可以高效地添加模型和任务。我们证明了自动化评估与人类判断高度相关,因此该范式具有高度可扩展性。鉴于我们范式的可用性,我们看到通过强化学习提升视频模型推理能力的机会。你可以查看我们所有原始的 $\href{this https URL}{results}$ 和我们的 $\href{this https URL}{VMEvalKit}$ 代码库。
FishDetector-R1: Unified MLLM-Based Framework with Reinforcement Fine-Tuning for Weakly Supervised Fish Detection, Segmentation, and Counting
FishDetector-R1:基于MLLM的统一框架,支持弱监督下鱼类检测、分段和计数的强化微调
- Authors: Yi Liu, Jingyu Song, Vedanth Kallakuri, Katherine A. Skinner
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Robotics (cs.RO); Image and Video Processing (eess.IV)
- Arxiv link: https://arxiv.org/abs/2512.05996
- Pdf link: https://arxiv.org/pdf/2512.05996
- Abstract
Analyzing underwater fish imagery is critical for ecological monitoring but remains difficult due to visual degradation and costly annotations. We introduce FishDetector-R1, a unified MLLM-based framework for fish detection, segmentation, and counting under weak supervision. On the DeepFish dataset, our framework achieves substantial gains over baselines, improving AP by 20% and mIoU by 10%, while reducing MAE by 30% and GAME by 35%. These improvements stem from two key components: a novel detect-to-count prompt that enforces spatially consistent detections and counts, and Reinforcement Learning from Verifiable Reward (RLVR) with a complementary scalable paradigm leveraging sparse point labels. Ablation studies further validate the effectiveness of this reward design. Moreover, the improvement generalizes well to other underwater datasets, confirming strong cross-domain robustness. Overall, FishDetector-R1 provides a reliable and scalable solution for accurate marine visual understanding via weak supervision. The project page for FishDetector-R1 is this https URL.
- 中文摘要
分析水下鱼类影像对于生态监测至关重要,但由于视觉劣化和昂贵的注释,仍然困难重重。我们介绍了FishDetector-R1,一个基于MLLM的统一框架,用于在薄监督下进行鱼类检测、分段和计数。在DeepFish数据集上,我们的框架相较基线实现了显著提升,AP提升了20%,mIoU提升了10%,同时MAE降低了30%,GAME减少了35%。这些改进源于两个关键组成部分:一种新的检测到计数提示,能够强制空间一致的检测和计数,以及基于可验证奖励的强化学习(RLVR),该范式结合稀疏点标签,实现了可扩展的可扩展范式。消融研究进一步验证了这种奖励设计的有效性。此外,这一改进也很好地推广到其他水下数据集,证实了强大的跨域鲁棒性。总体而言,FishDetector-R1 提供了一个可靠且可扩展的解决方案,通过薄弱的监督实现了准确的海洋视觉理解。FishDetector-R1 的项目页面是这个 https URL。
Reinforcement Learning Integrated Agentic RAG for Software Test Cases Authoring
强化学习集成代理RAG用于软件测试用例创作
- Authors: Mohanakrishnan Hariharan
- Subjects: Subjects:
Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2512.06060
- Pdf link: https://arxiv.org/pdf/2512.06060
- Abstract
This paper introduces a framework that integrates reinforcement learning (RL) with autonomous agents to enable continuous improvement in the automated process of software test cases authoring from business requirement documents within Quality Engineering (QE) workflows. Conventional systems employing Large Language Models (LLMs) generate test cases from static knowledge bases, which fundamentally limits their capacity to enhance performance over time. Our proposed Reinforcement Infused Agentic RAG (Retrieve, Augment, Generate) framework overcomes this limitation by employing AI agents that learn from QE feedback, assessments, and defect discovery outcomes to automatically improve their test case generation strategies. The system combines specialized agents with a hybrid vector-graph knowledge base that stores and retrieves software testing knowledge. Through advanced RL algorithms, specifically Proximal Policy Optimization (PPO) and Deep Q-Networks (DQN), these agents optimize their behavior based on QE-reported test effectiveness, defect detection rates, and workflow metrics. As QEs execute AI-generated test cases and provide feedback, the system learns from this expert guidance to improve future iterations. Experimental validation on enterprise Apple projects yielded substantive improvements: a 2.4% increase in test generation accuracy (from 94.8% to 97.2%), and a 10.8% improvement in defect detection rates. The framework establishes a continuous knowledge refinement loop driven by QE expertise, resulting in progressively superior test case quality that enhances, rather than replaces, human testing capabilities.
- 中文摘要
本文介绍了一个框架,将强化学习(RL)与自主代理集成,以实现质量工程(QE)工作流程中从业务需求文档自动编写软件测试用例的过程的持续改进。传统系统采用大型语言模型(LLM)时,通常从静态知识库生成测试用例,这在根本上限制了其随时间提升性能的能力。我们提出的强化注入智能RAG(检索、增强、生成)框架通过利用AI代理学习QE反馈、评估和缺陷发现结果,自动改进测试用例生成策略,克服了这一局限。该系统将专业代理与混合向量图知识库结合起来,用于存储和检索软件测试知识。通过先进的强化学习算法,特别是近端策略优化(PPO)和深度Q网络(DQN),这些代理基于QE报告的测试效果、缺陷检测率和工作流指标优化行为。随着QE执行AI生成的测试用例并提供反馈,系统从专家指导中学习以改进未来的迭代。在企业苹果项目上的实验验证带来了实质性改进:测试生成准确率提升了2.4%(从94.8%提升到97.2%),缺陷检测率提升了10.8%。该框架建立了一个由量化工程师专业知识驱动的持续知识精炼循环,最终产生越来越高的测试案例质量,增强而非取代人类测试能力。
JaxWildfire: A GPU-Accelerated Wildfire Simulator for Reinforcement Learning
JaxWildfire:一款基于GPU的加速野火模拟器,用于强化学习
- Authors: Ufuk Çakır, Victor-Alexandru Darvariu, Bruno Lacerda, Nick Hawes
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2512.06102
- Pdf link: https://arxiv.org/pdf/2512.06102
- Abstract
Artificial intelligence methods are increasingly being explored for managing wildfires and other natural hazards. In particular, reinforcement learning (RL) is a promising path towards improving outcomes in such uncertain decision-making scenarios and moving beyond reactive strategies. However, training RL agents requires many environment interactions, and the speed of existing wildfire simulators is a severely limiting factor. We introduce $\texttt{JaxWildfire}$, a simulator underpinned by a principled probabilistic fire spread model based on cellular automata. It is implemented in JAX and enables vectorized simulations using $\texttt{vmap}$, allowing high throughput of simulations on GPUs. We demonstrate that $\texttt{JaxWildfire}$ achieves 6-35x speedup over existing software and enables gradient-based optimization of simulator parameters. Furthermore, we show that $\texttt{JaxWildfire}$ can be used to train RL agents to learn wildfire suppression policies. Our work is an important step towards enabling the advancement of RL techniques for managing natural hazards.
- 中文摘要
人工智能方法正越来越多地被探索用于管理野火和其他自然灾害。特别是,强化学习(RL)是改善在不确定决策场景中成果、超越被动策略的有希望路径。然而,训练强化学习代理需要大量环境交互,而现有野火模拟器的速度是一个严重的限制因素。我们介绍$\texttt{JaxWildfire}$,一个基于元胞自动机的原则性概率火灾蔓延模型的模拟器。它以 JAX 实现,并利用 $\texttt{vmap}$ 实现矢量化仿真,从而实现 GPU 上高吞吐量的仿真。我们证明了$\texttt{JaxWildfire}$相比现有软件实现了6-35倍的加速,并实现基于梯度的模拟器参数优化。此外,我们展示了 $\texttt{JaxWildfire}$ 可用于训练强化学习者学习野火抑制策略。我们的工作是推动强化学习技术在自然灾害管理方面进步的重要一步。
Average-reward reinforcement learning in semi-Markov decision processes via relative value iteration
通过相对价值迭代进行半马尔可夫决策过程中的平均奖励强化学习
- Authors: Huizhen Yu, Yi Wan, Richard S. Sutton
- Subjects: Subjects:
Machine Learning (cs.LG); Optimization and Control (math.OC)
- Arxiv link: https://arxiv.org/abs/2512.06218
- Pdf link: https://arxiv.org/pdf/2512.06218
- Abstract
This paper applies the authors' recent results on asynchronous stochastic approximation (SA) in the Borkar-Meyn framework to reinforcement learning in average-reward semi-Markov decision processes (SMDPs). We establish the convergence of an asynchronous SA analogue of Schweitzer's classical relative value iteration algorithm, RVI Q-learning, for finite-space, weakly communicating SMDPs. In particular, we show that the algorithm converges almost surely to a compact, connected subset of solutions to the average-reward optimality equation, with convergence to a unique, sample path-dependent solution under additional stepsize and asynchrony conditions. Moreover, to make full use of the SA framework, we introduce new monotonicity conditions for estimating the optimal reward rate in RVI Q-learning. These conditions substantially expand the previously considered algorithmic framework and are addressed through novel arguments in the stability and convergence analysis of RVI Q-learning.
- 中文摘要
本文将作者在Borkar-Meyn框架下异步随机近似(SA)的最新成果应用于平均奖励半马尔可夫决策过程(SMDPs)中的强化学习。我们建立了对施韦策经典相对值迭代算法 RVI Q-learning(有限空间弱通信 SMDP)异步 SA 类比的收敛性。特别地,我们证明该算法几乎必然收敛到平均-奖励最优方程解的紧致连通子集,并且在额外的步长和异步条件下收敛到唯一的、依赖样本路径的解。此外,为了充分利用SA框架,我们引入了新的单调性条件,用于估算RVI Q学习中的最佳奖励率。这些条件大幅扩展了之前考虑的算法框架,并通过RVI Q学习的稳定性与收敛分析中的新论证得到解决。
AI Application in Anti-Money Laundering for Sustainable and Transparent Financial Systems
人工智能在反洗钱中的应用,实现可持续且透明的金融体系
- Authors: Chuanhao Nie, Yunbo Liu, Chao Wang
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2512.06240
- Pdf link: https://arxiv.org/pdf/2512.06240
- Abstract
Money laundering and financial fraud remain major threats to global financial stability, costing trillions annually and challenging regulatory oversight. This paper reviews how artificial intelligence (AI) applications can modernize Anti-Money Laundering (AML) workflows by improving detection accuracy, lowering false-positive rates, and reducing the operational burden of manual investigations, thereby supporting more sustainable development. It further highlights future research directions including federated learning for privacy-preserving collaboration, fairness-aware and interpretable AI, reinforcement learning for adaptive defenses, and human-in-the-loop visualization systems to ensure that next-generation AML architectures remain transparent, accountable, and robust. In the final part, the paper proposes an AI-driven KYC application that integrates graph-based retrieval-augmented generation (RAG Graph) with generative models to enhance efficiency, transparency, and decision support in KYC processes related to money-laundering detection. Experimental results show that the RAG-Graph architecture delivers high faithfulness and strong answer relevancy across diverse evaluation settings, thereby enhancing the efficiency and transparency of KYC CDD/EDD workflows and contributing to more sustainable, resource-optimized compliance practices.
- 中文摘要
洗钱和金融欺诈依然是全球金融稳定的主要威胁,每年造成数万亿美元的损失,并挑战监管监管。本文回顾了人工智能(AI)应用如何通过提高检测准确率、降低误报率和减少人工调查的运营负担,实现反洗钱(AML)工作流程的现代化,从而支持更可持续的发展。报告进一步强调了未来的研究方向,包括用于保护隐私协作的联邦学习、公平意识和可解释的人工智能、用于自适应防御的强化学习,以及实现人机循环可视化系统,以确保下一代反洗钱体系保持透明、问责和稳健。在最后部分,论文提出了一种基于AI的KYC应用,将基于图的检索增强生成(RAG Graph)与生成模型集成,以提升洗钱检测相关KYC流程的效率、透明度和决策支持。实验结果显示,RAG-Graph架构在多种评估环境中提供了高度忠实度和强烈的答案相关性,从而提升了KYC CDD/EDD工作流程的效率和透明度,并促进了更可持续、资源优化的合规实践。
Auto-exploration for online reinforcement learning
在线强化学习的自动探索
- Authors: Caleb Ju, Guanghui Lan
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
- Arxiv link: https://arxiv.org/abs/2512.06244
- Pdf link: https://arxiv.org/pdf/2512.06244
- Abstract
The exploration-exploitation dilemma in reinforcement learning (RL) is a fundamental challenge to efficient RL algorithms. Existing algorithms for finite state and action discounted RL problems address this by assuming sufficient exploration over both state and action spaces. However, this yields non-implementable algorithms and sub-optimal performance. To resolve these limitations, we introduce a new class of methods with auto-exploration, or methods that automatically explore both state and action spaces in a parameter-free way, i.e.,~without a priori knowledge of problem-dependent parameters. We present two variants: one for the tabular setting and one for linear function approximation. Under algorithm-independent assumptions on the existence of an exploring optimal policy, both methods attain $O(\epsilon^{-2})$ sample complexity to solve to $\epsilon$ error. Crucially, these complexities are novel since they are void of algorithm-dependent parameters seen in prior works, which may be arbitrarily large. The methods are also simple to implement because they are parameter-free and do not directly estimate the unknown parameters. These feats are achieved by new algorithmic innovations for RL, including a dynamic mixing time, a discounted state distribution for sampling, a simple robust gradient estimator, and a recent advantage gap function to certify convergence.
- 中文摘要
强化学习(RL)中的探索与利用困境是高效强化学习算法面临的根本挑战。现有的有限状态和动作折现强化学习问题算法通过假设对状态空间和动作空间的充分探索来解决这个问题。然而,这会导致算法无法实现,性能也不理想。为解决这些限制,我们引入了一类新的方法,具有自动探索功能,即以无参数的方式自动探索状态空间和动作空间的方法,即,~无需先验了解问题相关的参数。我们提出了两种变体:一种用于表格格式,另一种用于线性函数近似。在算法无关的探索最优策略假设下,两种方法都达到$O(\epsilon^{-2})$的样本复杂度,以求解到$\epsilon$误差。关键是,这些复杂性是新的,因为它们没有以往研究中出现的算法依赖参数,而这些参数可能任意大。这些方法也易于实现,因为它们无参数,不直接估计未知参数。这些成就得益于强化学习的新算法创新,包括动态混合时间、抽样的折现状态分布、简单的稳健梯度估计器,以及近期的优势缺口函数以证明收敛性。
Learning When to Switch: Adaptive Policy Selection via Reinforcement Learning
学习何时切换:通过强化学习实现自适应策略选择
- Authors: Chris Tava
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2512.06250
- Pdf link: https://arxiv.org/pdf/2512.06250
- Abstract
Autonomous agents often require multiple strategies to solve complex tasks, but determining when to switch between strategies remains challenging. This research introduces a reinforcement learning technique to learn switching thresholds between two orthogonal navigation policies. Using maze navigation as a case study, this work demonstrates how an agent can dynamically transition between systematic exploration (coverage) and goal-directed pathfinding (convergence) to improve task performance. Unlike fixed-threshold approaches, the agent uses Q-learning to adapt switching behavior based on coverage percentage and distance to goal, requiring only minimal domain knowledge: maze dimensions and target location. The agent does not require prior knowledge of wall positions, optimal threshold values, or hand-crafted heuristics; instead, it discovers effective switching strategies dynamically during each run. The agent discretizes its state space into coverage and distance buckets, then adapts which coverage threshold (20-60\%) to apply based on observed progress signals. Experiments across 240 test configurations (4 maze sizes from 16$\times$16 to 128$\times$128 $\times$ 10 unique mazes $\times$ 6 agent variants) demonstrate that adaptive threshold learning outperforms both single-strategy agents and fixed 40\% threshold baselines. Results show 23-55\% improvements in completion time, 83\% reduction in runtime variance, and 71\% improvement in worst-case scenarios. The learned switching behavior generalizes within each size class to unseen wall configurations. Performance gains scale with problem complexity: 23\% improvement for 16$\times$16 mazes, 34\% for 32$\times$32, and 55\% for 64$\times$64, demonstrating that as the space of possible maze structures grows, the value of adaptive policy selection over fixed heuristics increases proportionally.
- 中文摘要
自主智能体通常需要多种策略来解决复杂任务,但决定何时切换策略仍然具有挑战性。本研究引入了一种强化学习技术,用于学习两种正交导航策略之间的阈值切换。以迷宫导航为案例,本研究展示了智能体如何在系统探索(覆盖)与目标导向路径寻找(收敛)之间动态转换,以提升任务表现。与固定阈值方法不同,智能体利用Q学习根据覆盖率和目标距离调整切换行为,只需极少的领域知识:迷宫维度和目标位置。该代理无需事先了解墙体位置、最优阈值或手工设计的启发式;相反,它在每次运行中动态发现有效的切换策略。智能体将其状态空间离散化为覆盖和距离两个桶,然后根据观察到的进展信号调整覆盖阈值(20-60%)来应用。在240种测试配置(4个迷宫大小,从16$\times$16到128$\times$128,$\times$10独特迷宫$\times$ 6个代理变体)中的实验表明,自适应阈值学习优于单一策略代理和固定40%阈值基线。结果显示完成时间提升23%-55%,运行时间方差减少83%,最坏情景改善71%。学习到的开关行为在每个尺寸类别内推广到看不见的墙体配置。性能提升随问题复杂度而成比例:16$\times$16迷宫提升23\%,32$\times$32提升34\%,64$\times$64提升55%,表明随着迷宫结构空间的扩大,自适应策略选择相较固定启发式的价值成比例提升。
Learning Without Time-Based Embodiment Resets in Soft-Actor Critic
无时间化的学习在软演员批评中重置
- Authors: Homayoon Farrahi, A. Rupam Mahmood
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2512.06252
- Pdf link: https://arxiv.org/pdf/2512.06252
- Abstract
When creating new reinforcement learning tasks, practitioners often accelerate the learning process by incorporating into the task several accessory components, such as breaking the environment interaction into independent episodes and frequently resetting the environment. Although they can enable the learning of complex intelligent behaviors, such task accessories can result in unnatural task setups and hinder long-term performance in the real world. In this work, we explore the challenges of learning without episode terminations and robot embodiment resets using the Soft Actor-Critic (SAC) algorithm. To learn without terminations, we present a continuing version of the SAC algorithm and show that, with simple modifications to the reward functions of existing tasks, continuing SAC can perform as well as or better than episodic SAC while reducing the sensitivity of performance to the value of the discount rate $\gamma$. On a modified Gym Reacher task, we investigate possible explanations for the failure of continuing SAC when learning without embodiment resets. Our results suggest that embodiment resets help with exploration of the state space in the SAC algorithm, and removing embodiment resets can lead to poor exploration of the state space and failure of or significantly slower learning. Finally, on additional simulated tasks and a real-robot vision task, we show that increasing the entropy of the policy when performance trends worse or remains static is an effective intervention for recovering the performance lost due to not using embodiment resets.
- 中文摘要
在创建新的强化学习任务时,实践者通常通过将多个辅助组件融入任务,比如将环境互动拆分为独立的片段,并经常重置环境,从而加快学习进程。虽然它们能够实现复杂智能行为的学习,但这些任务辅助可能导致不自然的任务设置,并妨碍现实中的长期表现。本研究探讨了使用软演员-批判者(SAC)算法在无剧集终止和机器人身体重置的情况下学习的挑战。为了无终止学习,我们提出了一个持续版本的SAC算法,并证明通过对现有任务的奖励函数进行简单修改,持续SAC的表现可以与偶发SAC一样好甚至更好,同时将性能敏感度降低到折现率$\gamma$的值。在修改版的Gym Reacher任务中,我们探讨了在无身体重置学习时持续SAC失败的可能原因。我们的结果表明,具现重置有助于探索SAC算法中的状态空间,而移除具形重置可能导致状态空间探索不佳,学习失败或显著变慢。最后,在额外的模拟任务和一个真实机器人视觉任务中,我们证明当性能趋于下降或保持不变时,提高策略的熵是恢复因未使用具身重置而失去的性能的有效干预。
Nanbeige4-3B Technical Report: Exploring the Frontier of Small Language Models
南贝阁4-3B技术报告:探索小型语言模型的前沿
- Authors: Chen Yang, Guangyue Peng, Jiaying Zhu, Ran Le, Ruixiang Feng, Tao Zhang, Wei Ruan, Xiaoqi Liu, Xiaoxue Cheng, Xiyun Xu, Yang Song, Yanzipeng Gao, Yiming Jia, Yun Xing, Yuntao Wen, Zekai Wang, Zhenwei An, Zhicong Sun, Zongchao Chen
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2512.06266
- Pdf link: https://arxiv.org/pdf/2512.06266
- Abstract
We present Nanbeige4-3B, a family of small-scale but high-performing language models. Pretrained on 23T high-quality tokens and finetuned on over 30 million diverse instructions, we extend the boundary of the scaling law for small language models. In pre-training, we design a Fine-Grained Warmup-Stable-Decay (FG-WSD) training scheduler, which progressively refines data mixtures across stages to boost model performance. In post-training, to improve the quality of the SFT data, we design a joint mechanism that integrates deliberative generation refinement and chain-of-thought reconstruction, yielding substantial gains on complex tasks. Following SFT, we employ our flagship reasoning model to distill Nanbeige4-3B through our proposed Dual Preference Distillation (DPD) method, which leads to further performance gains. Finally, a multi-stage reinforcement learning phase was applied, leveraging verifiable rewards and preference modeling to strengthen abilities on both reasoning and human alignment. Extensive evaluations show that Nanbeige4-3B not only significantly outperforms models of comparable parameter scale but also rivals much larger models across a wide range of benchmarks. The model checkpoints are available at this https URL.
- 中文摘要
我们介绍南贝格4-3B,一组规模小但性能高的语言模型。我们通过23T高质量标记预训练,并在超过3000万条多样化指令上微调,扩展了小型语言模型的缩放律边界。在预训练阶段,我们设计了细粒度热身-稳定-衰减(FG-WSD)训练调度器,逐步优化各阶段的数据混合,以提升模型性能。在后期培训阶段,为了提升SFT数据质量,我们设计了一个联合机制,整合了审慎生成细化和思维链重建,在复杂任务中取得了显著成效。继SFT之后,我们采用旗舰推理模型,通过提出的双重偏好蒸馏(DPD)方法蒸馏南贝4-3B,进一步提升性能。最后,采用多阶段强化学习阶段,利用可验证的奖励和偏好建模,强化推理和人类对齐能力。大量评估显示,南贝阁4-3B不仅显著优于同等参数尺度的模型,还能在多种基准测试中与更大规模的模型抗衡。模型检查点可在此 https URL 访问。
A Hybrid Physics-Based and Reinforcement Learning Framework for Electric Vehicle Charging Time Prediction
基于物理的混合型增强学习框架用于电动汽车充电时间预测
- Authors: Praharshitha Aryasomayajula, Ting Bai, Andreas A. Malikopoulos
- Subjects: Subjects:
Systems and Control (eess.SY)
- Arxiv link: https://arxiv.org/abs/2512.06287
- Pdf link: https://arxiv.org/pdf/2512.06287
- Abstract
In this paper, we develop a hybrid prediction framework for accurate electric vehicle (EV) charging time estimation, a capability that is critical for trip planning, user satisfaction, and efficient operation of charging infrastructure. We combine a physics-based analytical model with a reinforcement learning (RL) approach. The analytical component captures the nonlinear constant-current/constant-voltage (CC--CV) charging dynamics and explicitly models state-of-health (SoH)--dependent capacity and power fade, providing a reliable baseline when historical data are limited. Building on this foundation, we introduce an RL component that progressively refines charging-time predictions as operational data accumulate, enabling improved long-term adaptation. Both models incorporate SoH degradation to maintain predictive accuracy over the battery lifetime. We evaluate the framework using $5{,}000$ simulated charging sessions calibrated to manufacturer specifications and publicly available EV charging datasets. Our results show that the analytical model achieves $R^{2}=98.5\%$ and $\mathrm{MAPE}=2.1\%$, while the RL model further improves performance to $R^{2}=99.2\%$ and $\mathrm{MAPE}=1.6\%$, corresponding to a $23\%$ accuracy gain and $35\%$ improved robustness to battery aging.
- 中文摘要
本文开发了一个混合预测框架,用于准确估算电动汽车充电时间,这一能力对于行程规划、用户满意度和充电基础设施高效运行至关重要。我们将基于物理的分析模型与强化学习(RL)方法相结合。分析组件捕捉了非线性恒流/恒压(CC--CV)充电动态,并明确建模了健康状态(SoH)相关的容量和功率衰减,在历史数据有限时提供可靠的基线。在此基础上,我们引入了强化学习组件,随着运营数据的积累逐步优化充电时间预测,从而实现更优的长期适应性。两款模型都包含SoH降解,以保持电池寿命的预测准确性。我们利用价值5000美元的模拟充电会话,并根据制造商规格校准,并利用公开的电动汽车充电数据集进行评估。结果显示,分析模型实现了$R^{2}=98.5\%$ 和$\mathrm{MAPE}=2.1\%$,而强化学习模型进一步提升性能至$R^{2}=99.2\%$ 和$\mathrm{MAPE}=1.6\%$,对应的准确率提升23美元,电池老化的韧性提升35美元。
ReCAD: Reinforcement Learning Enhanced Parametric CAD Model Generation with Vision-Language Models
ReCAD:强化学习增强型参数CAD模型生成,结合视觉语言模型
- Authors: Jiahao Li, Yusheng Luo, Yunzhong Lou, Xiangdong Zhou
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2512.06328
- Pdf link: https://arxiv.org/pdf/2512.06328
- Abstract
We present ReCAD, a reinforcement learning (RL) framework that bootstraps pretrained large models (PLMs) to generate precise parametric computer-aided design (CAD) models from multimodal inputs by leveraging their inherent generative capabilities. With just access to simple functional interfaces (e.g., point coordinates), our approach enables the emergence of complex CAD operations (e.g., pattern replication and mirror). This stands in contrast to previous methods, which typically rely on knowledge injected through supervised fine-tuning (SFT), offer limited support for editability, and fail to exploit the strong generative priors of PLMs. Specifically, the ReCAD framework begins by fine-tuning vision-language models (VLMs) to equip them with basic CAD model generation capabilities, where we rewrite CAD scripts into parameterized code that is leveraged to generate accurate textual descriptions for supervision. Then, we propose a novel RL strategy that incorporates parameterized code as guidance to enhance the model's reasoning on challenging questions. Furthermore, we employ a hierarchical primitive learning process to progressively teach structured and compositional skills under a unified reward function that ensures both geometric accuracy and semantic fidelity. ReCAD sets a new state-of-the-art in both text-to-CAD and image-to-CAD tasks, significantly improving geometric accuracy across in-distribution and out-of-distribution settings. In the image-to-CAD task, for instance, it reduces the mean Chamfer Distance from 73.47 to 29.61 (in-distribution) and from 272.06 to 80.23 (out-of-distribution), outperforming existing baselines by a substantial margin.
- 中文摘要
我们介绍ReCAD,这是一种强化学习(RL)框架,通过引导预训练大型模型(PLM),利用多模输入的内在生成能力生成精确的参数化计算机辅助设计(CAD)模型。仅通过访问简单的功能接口(如点坐标),我们的方法使复杂的CAD作(如模式复制和镜像)得以出现。这与以往方法形成鲜明对比,后者通常依赖于通过监督微调(SFT)注入的知识,对可编辑性支持有限,且未能充分利用PLM的强大生成先验。具体来说,ReCAD框架首先对视觉语言模型(VLM)进行微调,使其具备基本的CAD模型生成能力,我们将CAD脚本重写为参数化代码,利用这些代码生成准确的文本描述供监督使用。 随后,我们提出了一种新颖的强化学习策略,将参数化代码作为指导,以增强模型在复杂问题上的推理能力。此外,我们采用层级原始学习过程,在统一奖励函数下逐步教授结构化和组合技能,确保几何准确性和语义准确性。ReCAD在文本到CAD和图像到CAD任务中都取得了新的技术水平,显著提升了分布内外的几何精度。例如,在图像到CAD任务中,它将平均倒角距离从73.47降至29.61(分布内),从272.06降至80.23(分布外),远远优于现有基线。
LLM-Upgraded Graph Reinforcement Learning for Carbon-Aware Job Scheduling in Smart Manufacturing
智能制造中碳感知作业调度的大型语言模型升级图强化学习
- Authors: Zhiying Yang, Fang Liu, Wei Zhang, Xin Lou, Malcolm Yoke Hean Low, Boon Ping Gan
- Subjects: Subjects:
Machine Learning (cs.LG); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2512.06351
- Pdf link: https://arxiv.org/pdf/2512.06351
- Abstract
This paper presents \textsc{Luca}, a \underline{l}arge language model (LLM)-\underline{u}pgraded graph reinforcement learning framework for \underline{c}arbon-\underline{a}ware flexible job shop scheduling. \textsc{Luca} addresses the challenges of dynamic and sustainable scheduling in smart manufacturing systems by integrating a graph neural network and an LLM, guided by a carefully designed in-house prompting strategy, to produce a fused embedding that captures both structural characteristics and contextual semantics of the latest scheduling state. This expressive embedding is then processed by a deep reinforcement learning policy network, which generates real-time scheduling decisions optimized for both makespan and carbon emission objectives. To support sustainability goals, \textsc{Luca} incorporates a dual-objective reward function that encourages both energy efficiency and scheduling timeliness. Experimental results on both synthetic and public datasets demonstrate that \textsc{Luca} consistently outperforms comparison algorithms. For instance, on the synthetic dataset, it achieves an average of 4.1\% and up to 12.2\% lower makespan compared to the best-performing comparison algorithm while maintaining the same emission level. On public datasets, additional gains are observed for both makespan and emission. These results demonstrate that \textsc{Luca} is effective and practical for carbon-aware scheduling in smart manufacturing.
- 中文摘要
本文介绍了\textsc{Luca},一个\underline{l}arge语言模型(LLM)-\underline{u}用于\underline{c}arbon-\underline{a}ware灵活工作间调度的升级图强化学习框架。\textsc{Luca} 通过整合图神经网络和大型语言模型(LLM),在精心设计的内部提示策略指导下,解决智能制造系统中动态且可持续调度的挑战,生成融合嵌入,既捕捉最新调度状态的结构特征,也体现上下文语义。这种表达性嵌入随后被深度强化学习策略网络处理,生成针对完成期和碳排放目标优化的实时调度决策。为支持可持续发展目标,\textsc{Luca} 采用双重目标奖励函数,既鼓励能源效率,也促进排班。综合和公开数据集的实验结果表明,\textsc{Luca} 持续优于比较算法。例如,在合成数据集中,其平均生成周期比最佳比较算法低12.2%,同时保持相同排放水平。在公开数据集中,完成期和排放量均有额外增益。这些结果表明,\textsc{Luca}在智能制造中的碳意识调度中具有有效性和实用性。
VG-Refiner: Towards Tool-Refined Referring Grounded Reasoning via Agentic Reinforcement Learning
VG-Refiner:通过代理强化学习实现工具精细的指称基础推理
- Authors: Yuji Wang, Wenlong Liu, Jingxuan Niu, Haoji Zhang, Yansong Tang
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2512.06373
- Pdf link: https://arxiv.org/pdf/2512.06373
- Abstract
Tool-integrated visual reasoning (TiVR) has demonstrated great potential in enhancing multimodal problem-solving. However, existing TiVR paradigms mainly focus on integrating various visual tools through reinforcement learning, while neglecting to design effective response mechanisms for handling unreliable or erroneous tool outputs. This limitation is particularly pronounced in referring and grounding tasks, where inaccurate detection tool predictions often mislead TiVR models into generating hallucinated reasoning. To address this issue, we propose the VG-Refiner, the first framework aiming at the tool-refined referring grounded reasoning. Technically, we introduce a two-stage think-rethink mechanism that enables the model to explicitly analyze and respond to tool feedback, along with a refinement reward that encourages effective correction in response to poor tool results. In addition, we propose two new metrics and establish fair evaluation protocols to systematically measure the refinement ability of current models. We adopt a small amount of task-specific data to enhance the refinement capability of VG-Refiner, achieving a significant improvement in accuracy and correction ability on referring and reasoning grounding benchmarks while preserving the general capabilities of the pretrained model.
- 中文摘要
工具集成视觉推理(TiVR)在增强多模态问题解决方面展现出巨大潜力。然而,现有的TiVR范式主要侧重于通过强化学习整合各种视觉工具,而忽视了设计有效的响应机制来处理不可靠或错误的工具输出。这种局限在指称和接地任务中尤为明显,因为不准确的检测工具预测常常误导TiVR模型,使其产生幻觉推理。为解决这个问题,我们提出了VG-Refiner,这是首个针对工具精炼、指称基础推理的框架。技术上,我们引入了两阶段的思考-再思考机制,使模型能够明确分析并响应工具反馈,同时提供改进奖励,鼓励在工具结果不佳时进行有效纠正。此外,我们还提出了两个新指标,并建立公平的评估方案,以系统地衡量当前模型的精炼能力。我们采用少量任务特定数据以增强VG-Refiner的精炼能力,在保持预训练模型的一般能力的同时,显著提升了引用和基准推理的准确性和校正能力。
RLAX: Large-Scale, Distributed Reinforcement Learning for Large Language Models on TPUs
RLAX:TPU大型语言模型的大规模分布式强化学习
- Authors: Runlong Zhou, Lefan Zhang, Shang-Chen Wu, Kelvin Zou, Hanzhi Zhou, Ke Ye, Yihao Feng, Dong Yin, Alex Guillen Garcia, Dmytro Babych, Rohit Chatterjee, Matthew Hopkins, Xiang Kong, Chang Lan, Lezhi Li, Yiping Ma, Daniele Molinari, Senyu Tong, Yanchao Sun, Thomas Voice, Jianyu Wang, Chong Wang, Simon Wang, Floris Weers, Yechen Xu, Guolin Yin, Muyang Yu, Yi Zhang, Zheng Zhou, Danyang Zhuo, Ruoming Pang, Cheng Leong
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2512.06392
- Pdf link: https://arxiv.org/pdf/2512.06392
- Abstract
Reinforcement learning (RL) has emerged as the de-facto paradigm for improving the reasoning capabilities of large language models (LLMs). We have developed RLAX, a scalable RL framework on TPUs. RLAX employs a parameter-server architecture. A master trainer periodically pushes updated model weights to the parameter server while a fleet of inference workers pull the latest weights and generates new rollouts. We introduce a suite of system techniques to enable scalable and preemptible RL for a diverse set of state-of-art RL algorithms. To accelerate convergence and improve model quality, we have devised new dataset curation and alignment techniques. Large-scale evaluations show that RLAX improves QwQ-32B's pass@8 accuracy by 12.8% in just 12 hours 48 minutes on 1024 v5p TPUs, while remaining robust to preemptions during training.
- 中文摘要
强化学习(RL)已成为提升大型语言模型(LLM)推理能力的事实范式。我们开发了RLAX,一个可扩展的TPU强化学习框架。RLAX 采用参数-服务器架构。主教练会定期向参数服务器推送更新的模型权重,同时一批推理工作者负责拉取最新权重并生成新的推展。我们引入一套系统技术,以实现多样化最先进强化学习算法的可扩展性和抢占式强化学习。为了加速收敛并提升模型质量,我们设计了新的数据集策划和比对技术。大规模评估显示,RLAX在1024个v5p TPU上仅用12小时48分钟,将QwQ-32B的pass@8精度提升了12.8%,同时在训练期间仍能抵抗预发性。
Why Goal-Conditioned Reinforcement Learning Works: Relation to Dual Control
为什么目标条件强化学习有效:与双重控制的关系
- Authors: Nathan P. Lawrence, Ali Mesbah
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2512.06471
- Pdf link: https://arxiv.org/pdf/2512.06471
- Abstract
Goal-conditioned reinforcement learning (RL) concerns the problem of training an agent to maximize the probability of reaching target goal states. This paper presents an analysis of the goal-conditioned setting based on optimal control. In particular, we derive an optimality gap between more classical, often quadratic, objectives and the goal-conditioned reward, elucidating the success of goal-conditioned RL and why classical ``dense'' rewards can falter. We then consider the partially observed Markov decision setting and connect state estimation to our probabilistic reward, further making the goal-conditioned reward well suited to dual control problems. The advantages of goal-conditioned policies are validated on nonlinear and uncertain environments using both RL and predictive control techniques.
- 中文摘要
目标条件强化学习(RL)关注如何训练智能体以最大化达到目标目标状态的概率。本文基于最优控制对目标条件化环境进行了分析。特别是,我们推导出了更经典、通常是二次目标与目标条件奖励之间的最优性差距,阐明了目标条件强化学习的成功以及为何经典“密集”奖励会失效。随后,我们考虑部分观测到的马尔可夫决策设定,并将状态估计与概率奖励联系起来,进一步使目标条件奖励更适合对偶控制问题。目标条件化策略的优势在非线性和不确定环境中,利用强化学习和预测控制技术得到了验证。
Entropy-Controlled Intrinsic Motivation Reinforcement Learning for Quadruped Robot Locomotion in Complex Terrains
熵控制的内在动机强化学习:复杂地形中四足机器人运动
- Authors: Wanru Gong, Xinyi Zheng, Xiaopeng Yang, Xiaoqing Zhu
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2512.06486
- Pdf link: https://arxiv.org/pdf/2512.06486
- Abstract
Learning is the basis of both biological and artificial systems when it comes to mimicking intelligent behaviors. From the classical PPO (Proximal Policy Optimization), there is a series of deep reinforcement learning algorithms which are widely used in training locomotion policies for quadrupedal robots because of their stability and sample efficiency. However, among all these variants, experiments and simulations often converge prematurely, leading to suboptimal locomotion and reduced task performance. Therefore, in this paper, we introduce Entropy-Controlled Intrinsic Motivation (ECIM), an entropy-based reinforcement learning algorithm in contrast with the PPO series, that can reduce premature convergence by combining intrinsic motivation with adaptive exploration. For experiments, in order to parallel with other baselines, we chose to apply it in Isaac Gym across six terrain categories: upward slopes, downward slopes, uneven rough terrain, ascending stairs, descending stairs, and flat ground as widely used. For comparison, our experiments consistently achieve better performance: task rewards increase by 4--12%, peak body pitch oscillation is reduced by 23--29%, joint acceleration decreases by 20--32%, and joint torque consumption declines by 11--20%. Overall, our model ECIM, by combining entropy control and intrinsic motivation control, achieves better results in stability across different terrains for quadrupedal locomotion, and at the same time reduces energetic cost and makes it a practical choice for complex robotic control tasks.
- 中文摘要
学习是生物系统和人工系统在模仿智能行为时的基础。从经典的近端策略优化(PPO)中,有一系列深度强化学习算法,因其稳定性和采样效率被广泛用于四足机器人的运动策略训练。然而,在所有这些变体中,实验和模拟往往过早汇聚,导致运动不优和任务性能下降。因此,本文介绍了熵控制内在动机(ECIM),这是一种基于熵的强化学习算法,与PPO系列形成对比,通过结合内在动机与适应性探索,可以减少过早收敛。为了与其他基线平行进行实验,我们选择在艾萨克体育馆的六类地形中应用:上坡、下坡、崎岖不平地形、上行楼梯、下楼梯以及广泛使用的平地。作为对比,我们的实验持续获得更好的表现:任务奖励增加了4-12%,峰值体距振荡减少了23-29%,关节加速度减少了20-32%,关节扭矩消耗减少了11--20%。总体而言,我们的模型ECIM结合熵控制和内在动机控制,在不同地形条件下实现了四足行走的稳定性,同时降低了能量消耗,使其成为复杂机器人控制任务的实用选择。
Beyond Token-level Supervision: Unlocking the Potential of Decoding-based Regression via Reinforcement Learning
超越代币级监督:通过强化学习释放基于解码的回归潜力
- Authors: Ming Chen, Sheng Tang, Rong-Xi Tan, Ziniu Li, Jiacheng Chen, Ke Xue, Chao Qian
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2512.06533
- Pdf link: https://arxiv.org/pdf/2512.06533
- Abstract
Decoding-based regression, which reformulates regression as a sequence generation task, has emerged as a promising paradigm of applying large language models for numerical prediction. However, its progress is hindered by the misalignment between discrete token-level objectives (e.g., cross-entropy) and continuous numerical values. Existing approaches relying on token-level constraints often fail to capture the global magnitude of the target value, limiting their precision and generalization. In this paper, we propose to unlock the potential of decoding-based regression via Reinforcement Learning (RL). We formulate the generation process as a Markov Decision Process, utilizing sequence-level rewards to enforce global numerical coherence. Extensive experiments on tabular regression and code metric regression demonstrate that our method (specifically with ReMax and GRPO) consistently outperforms both state-of-the-art token-level baselines and traditional regression heads, showing the superiority of introducing sequence-level signals. Our analysis further reveals that RL significantly enhances sampling efficiency and predictive precision, establishing decoding-based regression as a robust and accurate paradigm for general-purpose numerical prediction.
- 中文摘要
基于译码的回归将回归重新表述为序列生成任务,已成为应用大型语言模型进行数值预测的有前景范式。然而,其进展受限于离散代币级目标(如交叉熵)与连续数值之间的不匹配。依赖代币级约束的现有方法常常无法捕捉目标值的全局大小,限制了其精度和泛化性。本文提出通过强化学习(RL)释放基于解码的回归潜力。我们将生成过程表述为马尔可夫决策过程,利用序列级奖励来强制全局数值一致性。关于表格回归和代码度量回归的大量实验表明,我们的方法(特别是ReMax和GRPO)始终优于最先进的标记级基线和传统回归头,显示引入序列级信号的优势。我们的分析进一步表明,强化学习显著提升了采样效率和预测精度,确立了基于解码的回归作为通用数值预测的稳健且准确的范式。
A-3PO: Accelerating Asynchronous LLM Training with Staleness-aware Proximal Policy Approximation
A-3PO:加速异步LLM训练,采用陈旧感知的近距离策略近似
- Authors: Xiaocan Li, Shiliang Wu, Zheng Shen
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
- Arxiv link: https://arxiv.org/abs/2512.06547
- Pdf link: https://arxiv.org/pdf/2512.06547
- Abstract
Decoupled loss has been a successful reinforcement learning (RL) algorithm to deal with the high data staleness under the asynchronous RL setting. Decoupled loss improves coupled-loss style of algorithms' (e.g., PPO, GRPO) learning stability by introducing a proximal policy to decouple the off-policy corrections (importance weight) from the controlling policy updates (trust region). However, the proximal policy requires an extra forward pass through the network at each training step, creating a computational bottleneck for large language models. We observe that since the proximal policy only serves as a trust region anchor between the behavior and target policies, we can approximate it through simple interpolation without explicit computation. We call this approach A-3PO (APproximated Proximal Policy Optimization). A-3PO eliminates this overhead, reducing training time by 18% while maintaining comparable performance. Code & off-the-shelf example are available at: this https URL
- 中文摘要
解耦损耗是一种成功的强化学习(RL)算法,用于应对异步强化学习环境下的高度数据陈旧。解耦损耗通过引入一种近端策略,将非策略修正(重要性权重)与控制策略更新(信任区域)解耦,从而提升算法(如PPO、GRPO)的耦合损失风格学习稳定性。然而,近端策略要求在每个训练步骤中额外前向通过网络,这给大型语言模型带来了计算瓶颈。我们观察到,由于近端策略仅作为行为与目标策略之间的信任区域锚点,我们可以通过简单的插值近似,无需显式计算。我们称这种方法为A-3PO(APproximated Proximal Policy Optimization)。A-3PO消除了这些开销,减少了18%的培训时间,同时保持了相当的表现。代码和现成示例可在以下网站获取:此 https URL
Learning Agile Striker Skills for Humanoid Soccer Robots from Noisy Sensory Input
从嘈杂的感官输入中学习人形足球机器人的敏捷前锋技能
- Authors: Zifan Xu, Myoungkyu Seo, Dongmyeong Lee, Hao Fu, Jiaheng Hu, Jiaxun Cui, Yuqian Jiang, Zhihan Wang, Anastasiia Brund, Joydeep Biswas, Peter Stone
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2512.06571
- Pdf link: https://arxiv.org/pdf/2512.06571
- Abstract
Learning fast and robust ball-kicking skills is a critical capability for humanoid soccer robots, yet it remains a challenging problem due to the need for rapid leg swings, postural stability on a single support foot, and robustness under noisy sensory input and external perturbations (e.g., opponents). This paper presents a reinforcement learning (RL)-based system that enables humanoid robots to execute robust continual ball-kicking with adaptability to different ball-goal configurations. The system extends a typical teacher-student training framework -- in which a "teacher" policy is trained with ground truth state information and the "student" learns to mimic it with noisy, imperfect sensing -- by including four training stages: (1) long-distance ball chasing (teacher); (2) directional kicking (teacher); (3) teacher policy distillation (student); and (4) student adaptation and refinement (student). Key design elements -- including tailored reward functions, realistic noise modeling, and online constrained RL for adaptation and refinement -- are critical for closing the sim-to-real gap and sustaining performance under perceptual uncertainty. Extensive evaluations in both simulation and on a real robot demonstrate strong kicking accuracy and goal-scoring success across diverse ball-goal configurations. Ablation studies further highlight the necessity of the constrained RL, noise modeling, and the adaptation stage. This work presents a system for learning robust continual humanoid ball-kicking under imperfect perception, establishing a benchmark task for visuomotor skill learning in humanoid whole-body control.
- 中文摘要
学习快速且稳健的踢球技巧是类人足球机器人的关键能力,但由于需要快速的腿部摆动、单只支撑脚的姿势稳定性,以及在嘈杂感官输入和外部干扰(如对手)下保持稳健,这仍是一个挑战。本文提出了一种基于强化学习(RL)的系统,使类人机器人能够执行稳健的连续踢球,并适应不同的球门配置。该系统扩展了典型的师生培训框架——其中“教师”政策通过真实状态信息进行培训,“学生”学习用噪声且不完美的感知来模拟该策略——通过包含四个培训阶段:(1)长距离追球(教师);(2)方向踢(教师);(3)教师政策提炼(学生);以及(4)学生适应与精进(学生)。关键设计元素——包括定制的奖励函数、逼真的噪声建模以及在线约束的强化学习以适应和完善——对于缩小模拟与现实的差距、在感知不确定性下保持性能至关重要。在模拟和真实机器人上的广泛评估显示,在多种球门配置下,踢球准确度和进球成功率均为强。消融研究进一步强调了受限强化学习、噪声建模和适应阶段的必要性。本研究提出了一套在不完美感知下学习稳健持续人形踢球的系统,确立了人形全身控制中视觉运动技能学习的基准任务。
MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding
MedGRPO:多任务强化学习用于异质医学视频理解
- Authors: Yuhao Su, Anwesa Choudhuri, Zhongpai Gao, Benjamin Planche, Van Nguyen Nguyen, Meng Zheng, Yuhan Shen, Arun Innanje, Terrence Chen, Ehsan Elhamifar, Ziyan Wu
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2512.06581
- Pdf link: https://arxiv.org/pdf/2512.06581
- Abstract
Large vision-language models struggle with medical video understanding, where spatial precision, temporal reasoning, and clinical semantics are critical. To address this, we first introduce \textbf{MedVidBench}, a large-scale benchmark of 531,850 video-instruction pairs across 8 medical sources spanning video, segment, and frame-level tasks, curated through a rigorous quality assurance pipeline with expert-guided prompting and dual-model validation. While supervised fine-tuning on MedVidBench yields noticeable gains, standard Reinforcement Learning (RL) fails due to imbalanced reward scales across datasets, which destabilizes optimization and leads to training collapse. To overcome this, we introduce \textbf{MedGRPO}, a novel RL framework for balanced multi-dataset training with two key innovations: (1) \emph{cross-dataset reward normalization} that maps each dataset's median performance to a common reward value, ensuring fair optimization regardless of difficulty, and (2) a \emph{medical LLM judge} that evaluates caption quality on five clinical dimensions through comparative similarity scoring. Supervised fine-tuning Qwen2.5-VL-7B on MedVidBench substantially outperforms GPT-4.1 and Gemini-2.5-Flash across all tasks, demonstrating MedVidBench's efficacy, while our MedGRPO framework further improves upon the SFT baseline across grounding and captioning tasks. Our work establishes a foundational benchmark and robust training methodology for advancing vision-language models in medical domains. Our project website is available at this https URL.
- 中文摘要
大型视觉语言模型在医学视频理解方面存在困难,而空间精度、时间推理和临床语义至关重要。为此,我们首先介绍了 \textbf{MedVidBench},这是一个涵盖 8 个医疗来源、视频、片段和帧级任务的 531,850 对视频教学对的大型基准测试,经过严格的质量保证流程,并结合专家引导的提示和双模型验证。虽然MedVidBench上的监督微调带来了明显的提升,但标准强化学习(RL)因数据集间的奖励尺度不平衡而失败,这破坏了优化并导致训练崩溃。为克服这一问题,我们引入了 \textbf{MedGRPO},一种新型强化学习框架,用于平衡多数据集训练,具有两项关键创新:(1) \emph{跨数据集奖励归一化},将每个数据集的中位表现映射到共同奖励值,确保无论难度如何都能公平优化;(2) \emph{医学LLM评判}通过比较相似度评分,在五个临床维度上评估标题质量。MedVidBench 上的监督微调 Qwen2.5-VL-7B 在所有任务中显著优于 GPT-4.1 和 Gemini-2.5-Flash,展示了 MedVidBench 的有效性,而我们的 MedGRPO 框架在接地和字幕任务中进一步提升了 SFT 基线。我们的工作确立了医学领域视觉语言模型发展的基础基准和稳健的训练方法。我们的项目网站可访问此 https URL。
A New Trajectory-Oriented Approach to Enhancing Comprehensive Crowd Navigation Performance
一种以轨迹为导向的新方法,提升全面人群导航性能
- Authors: Xinyu Zhou, Songhao Piao, Chao Gao, Liguo Chen
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2512.06608
- Pdf link: https://arxiv.org/pdf/2512.06608
- Abstract
Crowd navigation has garnered considerable research interest in recent years, especially with the proliferating application of deep reinforcement learning (DRL) techniques. Many studies, however, do not sufficiently analyze the relative priorities among evaluation metrics, which compromises the fair assessment of methods with divergent objectives. Furthermore, trajectory-continuity metrics, specifically those requiring $C^2$ smoothness, are rarely incorporated. Current DRL approaches generally prioritize efficiency and proximal comfort, often neglecting trajectory optimization or addressing it only through simplistic, unvalidated smoothness reward. Nevertheless, effective trajectory optimization is essential to ensure naturalness, enhance comfort, and maximize the energy efficiency of any navigation system. To address these gaps, this paper proposes a unified framework that enables the fair and transparent assessment of navigation methods by examining the prioritization and joint evaluation of multiple optimization objectives. We further propose a novel reward-shaping strategy that explicitly emphasizes trajectory-curvature optimization. The resulting trajectory quality and adaptability are significantly enhanced across multi-scale scenarios. Through extensive 2D and 3D experiments, we demonstrate that the proposed method achieves superior performance compared to state-of-the-art approaches.
- 中文摘要
近年来,人群导航受到了大量研究关注,尤其是深度强化学习(DRL)技术的广泛应用。然而,许多研究未能充分分析评估指标之间的相对优先级,这影响了对目标分歧方法的公平评估。此外,轨迹连续度度,特别是需要$C^2$平滑度的指标,很少被纳入。当前的日程学习方法通常优先考虑效率和近端舒适度,常常忽视轨迹优化,或仅通过简单且未经验证的平滑性奖励来解决。然而,有效的航迹优化对于确保自然性、提升舒适度并最大化导航系统的能源效率至关重要。为弥补这些空白,本文提出了一个统一框架,通过审视多个优化目标的优先级排序和联合评估,实现对导航方法的公平透明评估。我们还提出了一种新颖的奖励塑造策略,明确强调轨迹-曲率优化。由此产生的轨迹质量和适应性在多尺度场景中显著提升。通过广泛的二维和三维实验,我们证明该方法相较于最先进方法实现了更优的性能。
MIND-V: Hierarchical Video Generation for Long-Horizon Robotic Manipulation with RL-based Physical Alignment
MIND-V:基于强化学习的远程机器人作分层视频生成
- Authors: Ruicheng Zhang, Mingyang Zhang, Jun Zhou, Zhangrui Guo, Xiaofan Liu, Zunnan Xu, Zhizhou Zhong, Puxin Yan, Haocheng Luo, Xiu Li
- Subjects: Subjects:
Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2512.06628
- Pdf link: https://arxiv.org/pdf/2512.06628
- Abstract
Embodied imitation learning is constrained by the scarcity of diverse, long-horizon robotic manipulation data. Existing video generation models for this domain are limited to synthesizing short clips of simple actions and often rely on manually defined trajectories. To this end, we introduce MIND-V, a hierarchical framework designed to synthesize physically plausible and logically coherent videos of long-horizon robotic manipulation. Inspired by cognitive science, MIND-V bridges high-level reasoning with pixel-level synthesis through three core components: a Semantic Reasoning Hub (SRH) that leverages a pre-trained vision-language model for task planning; a Behavioral Semantic Bridge (BSB) that translates abstract instructions into domain-invariant representations; and a Motor Video Generator (MVG) for conditional video rendering. MIND-V employs Staged Visual Future Rollouts, a test-time optimization strategy to enhance long-horizon robustness. To align the generated videos with physical laws, we introduce a GRPO reinforcement learning post-training phase guided by a novel Physical Foresight Coherence (PFC) reward. PFC leverages the V-JEPA world model to enforce physical plausibility by aligning the predicted and actual dynamic evolutions in the feature space. MIND-V demonstrates state-of-the-art performance in long-horizon robotic manipulation video generation, establishing a scalable and controllable paradigm for embodied data synthesis.
- 中文摘要
具身模仿学习受限于多样且长视野的机器人作数据稀缺。该领域的现有视频生成模型仅限于合成简单动作的短片段,且通常依赖手动定义的轨迹。为此,我们介绍了MIND-V,这是一个分层框架,旨在综合物理上合理且逻辑连贯的远程机器人作视频。MIND-V 受认知科学启发,通过三个核心组件连接高层推理与像素层综合:一个语义推理中心(SRH),利用预训练的视觉语言模型进行任务规划;一种行为语义桥(BSB),将抽象指令转换为域不变表示;以及用于条件视频渲染的电机视频生成器(MVG)。MIND-V 采用分阶段可视化未来推广(STRO)技术,这是一种测试时间优化策略,以增强长期的稳健性。为了使生成的视频与物理定律对齐,我们引入了由新型物理前瞻一致性(PFC)奖励引导的GRPO强化学习后培训阶段。PFC利用V-JEPA世界模型,通过将特征空间中预测与实际动态演变对齐,来强制物理可信性。MIND-V展示了长视野机器人作视频生成的先进性能,建立了可扩展且可控的具身数据综合范式。
Analyzing Collision Rates in Large-Scale Mixed Traffic Control via Multi-Agent Reinforcement Learning
通过多智能体强化学习分析大规模混合交通控制中的碰撞率
- Authors: Muyang Fan
- Subjects: Subjects:
Multiagent Systems (cs.MA)
- Arxiv link: https://arxiv.org/abs/2512.06645
- Pdf link: https://arxiv.org/pdf/2512.06645
- Abstract
Vehicle collisions remain a major challenge in large-scale mixed traffic systems, especially when human-driven vehicles (HVs) and robotic vehicles (RVs) interact under dynamic and uncertain conditions. Although Multi-Agent Reinforcement Learning (MARL) offers promising capabilities for traffic signal control, ensuring safety in such environments remains difficult. As a direct indicator of traffic risk, the collision rate must be well understood and incorporated into traffic control design. This study investigates the primary factors influencing collision rates in a MARL-governed Mixed Traffic Control (MTC) network. We examine three dimensions: total vehicle count, signalized versus unsignalized intersection configurations, and turning-movement strategies. Through controlled simulation experiments, we evaluate how each factor affects collision likelihood. The results show that collision rates are sensitive to traffic density, the level of signal coordination, and turning-control design. These findings provide practical insights for improving the safety and robustness of MARL-based mixed traffic control systems, supporting the development of intelligent transportation systems in which both efficiency and safety are jointly optimized.
- 中文摘要
车辆碰撞仍然是大型混合交通系统中的重大挑战,尤其是在人力驾驶车辆(HV)和机器人车辆(RV)在动态且不确定条件下相互作用时。尽管多智能体强化学习(MARL)在交通信号控制方面提供了有前景的能力,但在此类环境中确保安全仍然困难。作为交通风险的直接指标,碰撞率必须被充分理解并纳入交通控制设计中。本研究探讨了影响MARL管理混合交通管制(MTC)网络碰撞率的主要因素。我们考察三个维度:车辆总数、有信号灯与无信号灯的路口配置,以及转向移动策略。通过受控模拟实验,我们评估每个因素如何影响碰撞概率。结果显示,碰撞率对交通密度、信号协调水平和转向控制设计敏感。这些发现为提升基于MARL的混合交通控制系统的安全性和稳健性提供了实用见解,支持智能交通系统的发展,在该系统中效率与安全并行优化。
LightSearcher: Efficient DeepSearch via Experiential Memory
LightSearcher:通过体验记忆实现高效的深度搜索
- Authors: Hengzhi Lan, Yue Yu, Li Qian, Li Peng, Jie Wu, Wei Liu, Jian Luan, Ting Bai
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2512.06653
- Pdf link: https://arxiv.org/pdf/2512.06653
- Abstract
DeepSearch paradigms have become a core enabler for deep reasoning models, allowing them to invoke external search tools to access up-to-date, domain-specific knowledge beyond parametric boundaries, thereby enhancing the depth and factual reliability of reasoning. Building upon this foundation, recent advances in reinforcement learning (RL) have further empowered models to autonomously and strategically control search tool usage, optimizing when and how to query external knowledge sources. Yet, these RL-driven DeepSearch systems often reveal a see-saw trade-off between accuracy and efficiency-frequent tool invocations can improve factual correctness but lead to unnecessary computational overhead and diminished efficiency. To address this challenge, we propose LightSearcher, an efficient RL framework that incorporates textual experiential memory by learning contrastive reasoning trajectories to generate interpretable summaries of successful reasoning patterns. In addition, it employs an adaptive reward shaping mechanism that penalizes redundant tool calls only in correct-answer scenarios. This design effectively balances the inherent accuracy-efficiency trade-off in DeepSearch paradigms. Experiments on four multi-hop QA benchmarks show that LightSearcher maintains accuracy comparable to SOTA baseline ReSearch, while reducing search tool invocations by 39.6%, inference time by 48.6%, and token consumption by 21.2%, demonstrating its superior efficiency.
- 中文摘要
深度搜索范式已成为深度推理模型的核心推动力,使其能够调用外部搜索工具,访问超越参数界限的最新领域知识,从而增强推理的深度和事实可靠性。在此基础上,强化学习(RL)的最新进展进一步赋能模型自主且有策略地控制搜索工具的使用,优化查询外部知识源的时机和方式。然而,这些基于强化学习的深度搜索系统常常在准确性与效率之间存在摇摆权衡——频繁调用工具虽能提升事实正确性,但会导致不必要的计算负担和效率下降。为应对这一挑战,我们提出了LightSearcher,一种高效的强化学习框架,通过学习对比性推理轨迹,结合文本体验记忆,生成成功推理模式的可解释摘要。此外,它采用自适应奖励塑造机制,仅在正确答案场景中惩罚冗余工具调用。该设计有效平衡了DeepSearch范式中固有的准确性与效率权衡。在四个多跳QA基准测试上的实验显示,LightSearcher的准确性与SOTA基线ReSearch相当,同时将搜索工具调用次数减少39.6%,推理时间减少48.6%,令牌消耗减少21.2%,显示出其优越的效率。
RunawayEvil: Jailbreaking the Image-to-Video Generative Models
RunawayEvil:越狱图像到视频生成模型
- Authors: Songping Wang, Rufan Qian, Yueming Lyu, Qinglong Liu, Linzhuang Zou, Jie Qin, Songhua Liu, Caifeng Shan
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2512.06674
- Pdf link: https://arxiv.org/pdf/2512.06674
- Abstract
Image-to-Video (I2V) generation synthesizes dynamic visual content from image and text inputs, providing significant creative control. However, the security of such multimodal systems, particularly their vulnerability to jailbreak attacks, remains critically underexplored. To bridge this gap, we propose RunawayEvil, the first multimodal jailbreak framework for I2V models with dynamic evolutionary capability. Built on a "Strategy-Tactic-Action" paradigm, our framework exhibits self-amplifying attack through three core components: (1) Strategy-Aware Command Unit that enables the attack to self-evolve its strategies through reinforcement learning-driven strategy customization and LLM-based strategy exploration; (2) Multimodal Tactical Planning Unit that generates coordinated text jailbreak instructions and image tampering guidelines based on the selected strategies; (3) Tactical Action Unit that executes and evaluates the multimodal coordinated attacks. This self-evolving architecture allows the framework to continuously adapt and intensify its attack strategies without human intervention. Extensive experiments demonstrate RunawayEvil achieves state-of-the-art attack success rates on commercial I2V models, such as Open-Sora 2.0 and CogVideoX. Specifically, RunawayEvil outperforms existing methods by 58.5 to 79 percent on COCO2017. This work provides a critical tool for vulnerability analysis of I2V models, thereby laying a foundation for more robust video generation systems.
- 中文摘要
图像到视频(I2V)生成从图像和文本输入中合成动态视觉内容,提供显著的创意控制。然而,这类多模态系统的安全性,尤其是其对越狱攻击的脆弱性,仍然严重缺乏充分探索。为弥合这一空白,我们提出了RunawayEvil,首个面向具备动态进化能力的I2V模型的多模态越狱框架。基于“战略-战术-行动”范式,我们的框架通过三个核心组件展现了自我放大攻击:(1)战略感知指挥单元,使攻击能够通过强化学习驱动的策略定制和基于LLM的策略探索实现自我演化;(2)多模态战术规划单元,基于所选策略生成协调文本越狱指令和图像篡改指南;(3)执行和评估多模式协调攻击的战术行动单位。这种自我演化的架构使该框架能够在无需人工干预的情况下持续适应和强化攻击策略。大量实验表明,RunawayEvil 在商业 I2V 型号(如 Open-Sora 2.0 和 CogVideoX)上实现了最先进的攻击成功率。具体来说,RunawayEvil在COCO2017上比现有方法高出58.5%至79%。这项工作为I2V模型的脆弱性分析提供了关键工具,从而为更稳健的视频生成系统奠定了基础。
The Role of Entropy in Visual Grounding: Analysis and Optimization
熵在视觉基础中的作用:分析与优化
- Authors: Shuo Li, Jiajun Sun, Zhihao Zhang, Xiaoran Fan, Senjie Jin, Hui Li, Yuming Yang, Junjie Ye, Lixing Shen, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2512.06726
- Pdf link: https://arxiv.org/pdf/2512.06726
- Abstract
Recent advances in fine-tuning multimodal large language models (MLLMs) using reinforcement learning have achieved remarkable progress, particularly with the introduction of various entropy control techniques. However, the role and characteristics of entropy in perception-oriented tasks like visual grounding, as well as effective strategies for controlling it, remain largely unexplored. To address this issue, we focus on the visual grounding task and analyze the role and characteristics of entropy in comparison to reasoning tasks. Building on these findings, we introduce ECVGPO (Entropy Control Visual Grounding Policy Optimization), an interpretable algorithm designed for effective entropy regulation. Through entropy control, the trade-off between exploration and exploitation is better balanced. Experiments show that ECVGPO achieves broad improvements across various benchmarks and models.
- 中文摘要
近年来,利用强化学习微调多模态大型语言模型(MLLMs)取得了显著进展,尤其是在引入各种熵控制技术方面。然而,熵在视觉基础等以感知为导向的任务中的作用和特性,以及有效的控制策略,仍然大多未被充分探讨。为解决这个问题,我们重点关注视觉基础任务,并分析熵与推理任务的作用和特性。基于这些发现,我们介绍了ECVGPO(熵控制视觉基础策略优化),这是一种可解释的算法,旨在有效调节熵。通过熵控制,探索与开发之间的权衡得到了更好的平衡。实验显示,ECVGPO在多个基准和模型上实现了广泛的改进。
PrivLLMSwarm: Privacy-Preserving LLM-Driven UAV Swarms for Secure IoT Surveillance
PrivLLMSwarm:保护隐私的LLM驱动无人机群,实现物联网安全监控
- Authors: Jifar Wakuma Ayana, Huang Qiming
- Subjects: Subjects:
Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2512.06747
- Pdf link: https://arxiv.org/pdf/2512.06747
- Abstract
Large Language Models (LLMs) are emerging as powerful enablers for autonomous reasoning and natural-language coordination in unmanned aerial vehicle (UAV) swarms operating within Internet of Things (IoT) environments. However, existing LLM-driven UAV systems process sensitive operational data in plaintext, exposing them to privacy and security risks. This work introduces PrivLLMSwarm, a privacy-preserving framework that performs secure LLM inference for UAV swarm coordination through Secure Multi-Party Computation (MPC). The framework incorporates MPC-optimized transformer components with efficient approximations of nonlinear activations, enabling practical encrypted inference on resource-constrained aerial platforms. A fine-tuned GPT-based command generator, enhanced through reinforcement learning in simulation, provides reliable instructions while maintaining confidentiality. Experimental evaluation in urban-scale simulations demonstrates that PrivLLMSwarm achieves high semantic accuracy, low encrypted inference latency, and robust formation control under privacy constraints. Comparative analysis shows PrivLLMSwarm offers a superior privacy-utility balance compared to differential privacy, federated learning, and plaintext baselines. To support reproducibility, the full implementation including source code, MPC components, and a synthetic dataset is publicly available. PrivLLMSwarm establishes a practical foundation for secure, LLM-enabled UAV swarms in privacy-sensitive IoT applications including smart-city monitoring and emergency response.
- 中文摘要
大型语言模型(LLMs)正作为无人机(UAV)群体在物联网(IoT)环境中自主推理和自然语言协调的强大工具而兴起。然而,现有的大型语言模型驱动无人机系统以明文处理敏感作数据,使其面临隐私和安全风险。本研究介绍了PrivLLMSwarm,这是一个保护隐私的框架,通过安全多方计算(MPC)进行无人机群协调的安全LLM推理。该框架集成了MPC优化的变压器组件,并高效近似非线性激活,使得在资源有限的空中平台上实现实用的加密推断。经过模拟强化学习增强的精细调优GPT指令生成器,在保持机密性的同时提供可靠的指令。城市级模拟中的实验评估表明,PrivLLMSwarm在隐私约束下实现了高语义准确性、低加密推理延迟和稳健的地层控制。比较分析显示,PrivLLMSwarm相比差异隐私、联邦学习和明文基线,提供了更优越的隐私与效用平衡。为了支持可重复性,完整实现包括源代码、MPC组件和合成数据集均公开。PrivLLMSwarm为隐私敏感的物联网应用(包括智慧城市监控和应急响应)中安全、支持LLM的无人机群奠定了实用基础。
Decouple to Generalize: Context-First Self-Evolving Learning for Data-Scarce Vision-Language Reasoning
解耦以泛化:情境优先的自我进化学习,用于数据稀缺的视觉语言推理
- Authors: Tingyu Li, Zheng Sun, Jingxuan Wei, Siyuan Li, Conghui He, Lijun Wu, Cheng Tan
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2512.06835
- Pdf link: https://arxiv.org/pdf/2512.06835
- Abstract
Recent vision-language models (VLMs) achieve remarkable reasoning through reinforcement learning (RL), which provides a feasible solution for realizing continuous self-evolving large vision-language models (LVLMs) in the era of experience. However, RL for VLMs requires abundant high-quality multimodal data, especially challenging in specialized domains like chemistry, earth sciences, and multimodal mathematics. Existing strategies such as synthetic data and self-rewarding mechanisms suffer from limited distributions and alignment difficulties, ultimately causing reward hacking: models exploit high-reward patterns, collapsing policy entropy and destabilizing training. We propose DoGe (Decouple to Generalize), a dual-decoupling framework that guides models to first learn from context rather than problem solving by refocusing on the problem context scenarios overlooked by synthetic data methods. By decoupling learning process into dual components (Thinker and Solver), we reasonably quantify the reward signals of this process and propose a two-stage RL post-training approach from freely exploring context to practically solving tasks. Second, to increase the diversity of training data, DoGe constructs an evolving curriculum learning pipeline: an expanded native domain knowledge corpus and an iteratively evolving seed problems pool. Experiments show that our method consistently outperforms the baseline across various benchmarks, providing a scalable pathway for realizing self-evolving LVLMs.
- 中文摘要
最新的视觉语言模型(VLMs)通过强化学习(RL)实现了卓越的推理能力,这为实现持续自我演化的大型视觉语言模型(LVLM)提供了可行的解决方案,适用于经验时代。然而,VLM的强化学习需要大量高质量的多模态数据,尤其在化学、地球科学和多模数学等专业领域具有挑战性。现有策略如合成数据和自我奖励机制存在分布有限和对齐困难,最终导致奖励黑客现象:模型利用高奖励模式,导致策略熵崩溃并破坏训练。我们提出了DoGe(解耦以泛化)框架,一种双重解耦框架,引导模型首先从上下文中学习,而非解决问题,通过重新聚焦合成数据方法忽视的问题上下文场景。通过将学习过程解耦为双组成部分(思考者和求解者),我们合理地量化了该过程的奖励信号,并提出了一种两阶段的强化学习训练后方法,从自由探索上下文到实际解决任务。其次,为了增加训练数据的多样性,DoGe构建了一个不断演进的课程学习流程:扩展的本地领域知识语料库和迭代演进的种子问题池。实验显示,我们的方法在多个基准测试中持续优于基线,为实现自我演化的LVLM提供了可扩展的路径。
JT-DA: Enhancing Data Analysis with Tool-Integrated Table Reasoning Large Language Models
JT-DA:通过工具集成表推理增强数据分析 大型语言模型
- Authors: Ce Chi, Xing Wang, Zhendong Wang, Xiaofan Liu, Ce Li, Zhiyan Song, Chen Zhao, Kexin Yang, Boshen Shi, Jingjing Yang, Chao Deng, Junlan Feng
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2512.06859
- Pdf link: https://arxiv.org/pdf/2512.06859
- Abstract
In this work, we present JT-DA-8B (JiuTian Data Analyst 8B), a specialized large language model designed for complex table reasoning tasks across diverse real-world scenarios. To address the lack of high-quality supervision in tabular reasoning scenarios, we construct a comprehensive and diverse training corpus with 34 well-defined table reasoning tasks, by aggregating 29 public table QA datasets and 3 million tables. An automatic pipeline is proposed to generate realistic multi-step analytical tasks involving reasoning patterns. The model is trained upon open-source JT-Coder-8B model, an 8B-parameter decoder-only foundation model trained from scratch. In the training stage, we leverage LLM-based scoring and workflow-aligned filtering to distill high-quality, table-centric data. Both supervised fine-tuning (SFT) and Reinforcement learning (RL) are adopted to optimize our model. Afterwards, a four-stage table reasoning workflow is proposed, including table preprocessing, table sensing, tool-integrated reasoning, and prompt engineering, to improve model interpretability and execution accuracy. Experimental results show that JT-DA-8B achieves strong performance in various table reasoning tasks, demonstrating the effectiveness of data-centric generation and workflow-driven optimization.
- 中文摘要
本研究介绍了JT-DA-8B(九天数据分析师8B),这是一种专门用于处理复杂表格推理任务的大型语言模型,适用于多种现实场景。为解决表格推理场景中缺乏高质量监督的问题,我们构建了一个全面且多样化的训练语料库,包含34个定义清晰的表格推理任务,汇总了29个公开表QA数据集和300万个表。提出一个自动流程,用于生成涉及推理模式的多步真实分析任务。该模型基于开源的 JT-Coder-8B 模型训练,该模型是一个仅支持8B参数解码器的基础模型,从零开始训练。在培训阶段,我们利用基于LLM的评分和与工作流程对齐的过滤,提取高质量、以表格为中心的数据。我们采用了监督微调(SFT)和强化学习(RL)来优化模型。随后,提出了一个四阶段的表格推理工作流程,包括表格预处理、表格感知、工具集成推理和提示工程,以提升模型的可解释性和执行准确性。实验结果显示,JT-DA-8B在多种表推理任务中表现出色,证明了数据中心生成和工作流驱动优化的有效性。
An Analysis of Large Language Models for Simulating User Responses in Surveys
模拟调查用户回答的大型语言模型分析
- Authors: Ziyun Yu, Yiru Zhou, Chen Zhao, Hongyi Wen
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2512.06874
- Pdf link: https://arxiv.org/pdf/2512.06874
- Abstract
Using Large Language Models (LLMs) to simulate user opinions has received growing attention. Yet LLMs, especially trained with reinforcement learning from human feedback (RLHF), are known to exhibit biases toward dominant viewpoints, raising concerns about their ability to represent users from diverse demographic and cultural backgrounds. In this work, we examine the extent to which LLMs can simulate human responses to cross-domain survey questions through direct prompting and chain-of-thought prompting. We further propose a claim diversification method CLAIMSIM, which elicits viewpoints from LLM parametric knowledge as contextual input. Experiments on the survey question answering task indicate that, while CLAIMSIM produces more diverse responses, both approaches struggle to accurately simulate users. Further analysis reveals two key limitations: (1) LLMs tend to maintain fixed viewpoints across varying demographic features, and generate single-perspective claims; and (2) when presented with conflicting claims, LLMs struggle to reason over nuanced differences among demographic features, limiting their ability to adapt responses to specific user profiles.
- 中文摘要
利用大型语言模型(LLMs)模拟用户意见正受到越来越多的关注。然而,尤其是通过人类反馈强化学习(RLHF)训练的大型语言模型,已知表现出对主流观点的偏见,这引发了人们对其能否代表来自不同人口和文化背景用户的担忧。本研究探讨大型语言模型(LLM)通过直接提示和思维链提示,模拟人类对跨域调查问题的反应能力。我们还提出了一种权利要求多元化方法CLAIMSIM,通过LLM参数知识作为上下文输入,引出不同观点。调查问答任务的实验表明,虽然CLAIMSIM能产生更多样化的回答,但两种方法都难以准确模拟用户。进一步分析揭示了两个关键局限:(1)LLM倾向于在不同人口特征中保持固定观点,并产生单一视角的主张;(2)当面对相互矛盾的主张时,LLM难以推理人口特征间的细微差别,限制了其根据特定用户档案调整回答的能力。
Energy-Efficient Navigation for Surface Vehicles in Vortical Flow Fields
涡流流场中地面车辆的节能导航
- Authors: Rushiraj Gadhvi, Sandeep Manjanna
- Subjects: Subjects:
Robotics (cs.RO); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2512.06912
- Pdf link: https://arxiv.org/pdf/2512.06912
- Abstract
For centuries, khalasi have skillfully harnessed ocean currents to navigate vast waters with minimal effort. Emulating this intuition in autonomous systems remains a significant challenge, particularly for Autonomous Surface Vehicles tasked with long duration missions under strict energy budgets. In this work, we present a learning-based approach for energy-efficient surface vehicle navigation in vortical flow fields, where partial observability often undermines traditional path-planning methods. We present an end to end reinforcement learning framework based on Soft Actor Critic that learns flow-aware navigation policies using only local velocity measurements. Through extensive evaluation across diverse and dynamically rich scenarios, our method demonstrates substantial energy savings and robust generalization to previously unseen flow conditions, offering a promising path toward long term autonomy in ocean environments. The navigation paths generated by our proposed approach show an improvement in energy conservation 30 to 50 percent compared to the existing state of the art techniques.
- 中文摘要
几个世纪以来,卡拉西人巧妙利用洋流,轻松航行于广阔水域。在自主系统中模拟这种直觉仍是重大挑战,尤其是对于执行严格能源预算下长期任务的自主地面载具来说。本研究提出了一种基于学习的高效表面载具导航方法,适用于涡流场中的节能地面车辆导航,在这些领域,部分可观测性常常削弱传统的路径规划方法。我们基于软演员批评(Soft Actor Critic)提出了端到端强化学习框架,仅通过局部速度测量学习流感知导航策略。通过在多样且动态丰富的场景中进行广泛评估,我们的方法展示了显著的节能效果和对前所未见流动条件的稳健推广,为海洋环境中实现长期自主提供了有前景的道路。我们提出的方法生成的导航路径显示,与现有最先进技术相比,节能率提升了30%至50%。
Know your Trajectory -- Trustworthy Reinforcement Learning deployment through Importance-Based Trajectory Analysis
了解你的轨迹——通过基于重要性轨迹分析实现可信的强化学习部署
- Authors: Clifford F, Devika Jay, Abhishek Sarkar, Satheesh K Perepu, Santhosh G S, Kaushik Dey, Balaraman Ravindran
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2512.06917
- Pdf link: https://arxiv.org/pdf/2512.06917
- Abstract
As Reinforcement Learning (RL) agents are increasingly deployed in real-world applications, ensuring their behavior is transparent and trustworthy is paramount. A key component of trust is explainability, yet much of the work in Explainable RL (XRL) focuses on local, single-step decisions. This paper addresses the critical need for explaining an agent's long-term behavior through trajectory-level analysis. We introduce a novel framework that ranks entire trajectories by defining and aggregating a new state-importance metric. This metric combines the classic Q-value difference with a "radical term" that captures the agent's affinity to reach its goal, providing a more nuanced measure of state criticality. We demonstrate that our method successfully identifies optimal trajectories from a heterogeneous collection of agent experiences. Furthermore, by generating counterfactual rollouts from critical states within these trajectories, we show that the agent's chosen path is robustly superior to alternatives, thereby providing a powerful "Why this, and not that?" explanation. Our experiments in standard OpenAI Gym environments validate that our proposed importance metric is more effective at identifying optimal behaviors compared to classic approaches, offering a significant step towards trustworthy autonomous systems.
- 中文摘要
随着强化学习(RL)代理越来越多地被应用于现实应用,确保其行为透明且值得信赖变得至关重要。信任的一个关键组成部分是可解释性,但可解释强化学习(XRL)中的大部分工作都聚焦于局部的单步决策。本文探讨了通过轨迹层级分析解释代理长期行为的关键需求。我们引入了一个新框架,通过定义和汇总新的状态重要性指标,对整个轨迹进行排名。该指标将经典的Q值差异与一个“激进项”结合起来,该项捕捉了智能体达到目标的亲和力,提供了更细致的状态临界度量。我们证明了我们的方法能够从异质的代理经验集合中识别出最优路径。此外,通过从这些轨迹中的临界状态生成反事实的展开,我们表明智能体所选路径远优于其他选择,从而提供了强有力的“为什么是这个,而不是那个?”的解释。我们在标准OpenAI健身房环境中的实验验证了我们提出的重要性指标比传统方法更有效地识别最优行为,为可信自主系统迈出了重要一步。
Parent-Guided Semantic Reward Model (PGSRM): Embedding-Based Reward Functions for Reinforcement Learning of Transformer Language Models
家长引导语义奖励模型(PGSRM):基于嵌入的奖励函数用于变换器语言模型的强化学习
- Authors: Alexandr Plashchinsky
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2512.06920
- Pdf link: https://arxiv.org/pdf/2512.06920
- Abstract
We introduce the Parent-Guided Semantic Reward Model (PGSRM), a lightweight reward framework for reinforcement learning (RL) of transformer language models. PGSRM replaces binary correctness signals, human preference data, and trained reward models with a simple signal: cosine similarity between a parent model's reference output embedding and a child model's generated output for the same input. This yields a dense, semantically meaningful reward with no human annotation or additional model training. We apply PGSRM on five language tasks and find that it produces smoother reward improvement and more stable PPO dynamics than a binary reward baseline, suggesting that embedding-based semantic rewards are a practical alternative to RLHF-style reward modeling for parent-guided alignment in smaller transformer models.
- 中文摘要
我们介绍了父导语义奖励模型(PGSRM),这是一个用于变换器语言模型强化学习(RL)的轻量级奖励框架。PGSRM用一个简单的信号替代了二元正确性信号、人类偏好数据和训练好的奖励模型:父模型参考输出嵌入与子模型生成输出的余弦相似度。这能带来一个密集且语义有意义的奖励,无需人工注释或额外的模型训练。我们将PGSRM应用于五个语言任务,发现它比二元奖励基线更平滑地提升奖励,且PPO动态更稳定,表明基于嵌入的语义奖励是小型变换器模型中父引导对齐的实用替代方案,替代RLHF式奖励建模。
Neuro-Vesicles: Neuromodulation Should Be a Dynamical System, Not a Tensor Decoration
神经囊泡:神经调控应是动态系统,而非张量装饰
- Authors: Zilin Li, Weiwei Xu, Vicki Kane
- Subjects: Subjects:
Neural and Evolutionary Computing (cs.NE)
- Arxiv link: https://arxiv.org/abs/2512.06966
- Pdf link: https://arxiv.org/pdf/2512.06966
- Abstract
We introduce Neuro-Vesicles, a framework that augments conventional neural networks with a missing computational layer: a dynamical population of mobile, discrete vesicles that live alongside the network rather than inside its tensors. Each vesicle is a self contained object v = (c, kappa, l, tau, s) carrying a vector payload, type label, location on the graph G = (V, E), remaining lifetime, and optional internal state. Vesicles are emitted in response to activity, errors, or meta signals; migrate along learned transition kernels; probabilistically dock at nodes; locally modify activations, parameters, learning rules, or external memory through content dependent release operators; and finally decay or are absorbed. This event based interaction layer reshapes neuromodulation. Instead of applying the same conditioning tensors on every forward pass, modulation emerges from the stochastic evolution of a vesicle population that can accumulate, disperse, trigger cascades, carve transient pathways, and write structured traces into topological memory. Dense, short lived vesicles approximate familiar tensor mechanisms such as FiLM, hypernetworks, or attention. Sparse, long lived vesicles resemble a small set of mobile agents that intervene only at rare but decisive moments. We give a complete mathematical specification of the framework, including emission, migration, docking, release, decay, and their coupling to learning; a continuous density relaxation that yields differentiable reaction diffusion dynamics on the graph; and a reinforcement learning view where vesicle control is treated as a policy optimized for downstream performance. We also outline how the same formalism extends to spiking networks and neuromorphic hardware such as the Darwin3 chip, enabling programmable neuromodulation on large scale brain inspired computers.
- 中文摘要
我们介绍了神经囊泡,这是一种在传统神经网络基础上补充的框架,但缺少了一个计算层:一个动态的流动、离散囊泡群体,它们生活在网络旁,而非其张量内部。每个囊泡是一个自包含的对象 v = (c, kappa, l, tau, s),携带向量有效载荷、类型标签、图中位置 G = (V, E)、剩余寿命和可选的内部状态。囊泡是对活动、错误或元信号的反应而释放的;沿学习过的过渡核迁移;概率性地对接节点;通过内容依赖的释放作符本地修改激活、参数、学习规则或外部内存;最终衰变或被吸收。这一基于事件的交互层重塑了神经调控。调制不是对每次前向传递都应用相同的条件张量,而是源自囊泡群体的随机演化,囊泡群体可以积累、分散、触发级联、雕刻瞬态路径,并将结构化的痕迹写入拓扑记忆。致密且寿命短的囊泡近似熟悉的张量机制,如FiLM、超网络或注意力。稀疏且寿命较长的囊泡类似一小群只在罕见但关键时刻介入的机动性因子。我们给出了框架的完整数学规范,包括发射、迁移、对接、释放、衰减及其与学习的耦合;连续密度弛豫,产生图上的可微反应扩散动力学;以及一种强化学习视角,将囊泡控制视为优化下游性能的策略。我们还概述了这种形式主义如何延伸到尖峰网络和神经形态硬件(如达尔文3芯片),使大规模脑部启发的计算机实现可编程神经调控。
LLM-Driven Composite Neural Architecture Search for Multi-Source RL State Encoding
多源强化语言(LLM)驱动复合神经架构搜索
- Authors: Yu Yu, Qian Xie, Nairen Cao, Li Jin
- Subjects: Subjects:
Machine Learning (cs.LG); Systems and Control (eess.SY)
- Arxiv link: https://arxiv.org/abs/2512.06982
- Pdf link: https://arxiv.org/pdf/2512.06982
- Abstract
Designing state encoders for reinforcement learning (RL) with multiple information sources -- such as sensor measurements, time-series signals, image observations, and textual instructions -- remains underexplored and often requires manual design. We formalize this challenge as a problem of composite neural architecture search (NAS), where multiple source-specific modules and a fusion module are jointly optimized. Existing NAS methods overlook useful side information from the intermediate outputs of these modules -- such as their representation quality -- limiting sample efficiency in multi-source RL settings. To address this, we propose an LLM-driven NAS pipeline that leverages language-model priors and intermediate-output signals to guide sample-efficient search for high-performing composite state encoders. On a mixed-autonomy traffic control task, our approach discovers higher-performing architectures with fewer candidate evaluations than traditional NAS baselines and the LLM-based GENIUS framework.
- 中文摘要
设计具有多信息源(如传感器测量、时间序列信号、图像观测和文本指令)的强化学习(RL)状态编码器仍然缺乏探索,且通常需要手动设计。我们将这一挑战形式化为复合神经结构搜索(NAS)问题,其中多个特定源模块和一个融合模块共同优化。现有的NAS方法忽视了这些模块中间输出的有用附加信息——如其表示质量——限制了多源强化学习中的样本效率。为此,我们提出了一种基于LLM驱动的NAS流水线,利用语言模型先验和中间输出信号,指导高效采样的搜索,寻找高性能复合状态编码器。在混合自治交通控制任务中,我们的方法发现了比传统NAS基线和基于LLM的GENIUS框架更少候选评估的高性能架构。
Surrogate compliance modeling enables reinforcement learned locomotion gaits for soft robots
替代顺从建模使软机器人能够强化学习的移动步态
- Authors: Jue Wang, Mingsong Jiang, Luis A.Ramirez, Bilige Yang, Mujun Zhang, Esteban Figueroa, Wenzhong Yan, Rebecca Kramer-Bottiglio
- Subjects: Subjects:
Robotics (cs.RO); Systems and Control (eess.SY)
- Arxiv link: https://arxiv.org/abs/2512.07114
- Pdf link: https://arxiv.org/pdf/2512.07114
- Abstract
Adaptive morphogenetic robots adapt their morphology and control policies to meet changing tasks and environmental conditions. Many such systems leverage soft components, which enable shape morphing but also introduce simulation and control challenges. Soft-body simulators remain limited in accuracy and computational tractability, while rigid-body simulators cannot capture soft-material dynamics. Here, we present a surrogate compliance modeling approach: rather than explicitly modeling soft-body physics, we introduce indirect variables representing soft-material deformation within a rigid-body simulator. We validate this approach using our amphibious robotic turtle, a quadruped with soft morphing limbs designed for multi-environment locomotion. By capturing deformation effects as changes in effective limb length and limb center of mass, and by applying reinforcement learning with extensive randomization of these indirect variables, we achieve reliable policy learning entirely in a rigid-body simulation. The resulting gaits transfer directly to hardware, demonstrating high-fidelity sim-to-real performance on hard, flat substrates and robust, though lower-fidelity, transfer on rheologically complex terrains. The learned closed-loop gaits exhibit unprecedented terrestrial maneuverability and achieve an order-of-magnitude reduction in cost of transport compared to open-loop baselines. Field experiments with the robot further demonstrate stable, multi-gait locomotion across diverse natural terrains, including gravel, grass, and mud.
- 中文摘要
适应性形态发生机器人会根据不断变化的任务和环境条件调整形态和控制策略。许多此类系统利用软组件,既支持形状变形,也带来了仿真和控制上的挑战。软体模拟器在精度和计算可处理性方面仍有限,而刚体模拟器无法捕捉软材料的动态。这里,我们提出了一种替代顺应性建模方法:我们不是明确建模软体物理,而是引入间接变量,表示刚体模拟器中的软材料变形。我们用我们的两栖机器人龟验证了这一方法,这是一种四足动物,拥有柔软变形肢体,专为多环境移动设计。通过捕捉变形效应,表现为有效肢体长度和肢体质心的变化,并通过对这些间接变量进行广泛随机化的强化学习,我们完全在刚体模拟中实现了可靠的策略学习。由此产生的步态直接转移到硬件上,展示了在硬质平坦基底上的高精度模拟与真实性能,以及在风变复杂地形上强健但保真度较低的传输。习得的闭环步态展现出前所未有的陆地机动性,并且相比开环基线,运输成本降低了数量级。对机器人的实地实验进一步展示了在砾石、草地和泥地等多样自然地形中稳定的多步态移动能力。
TrajMoE: Scene-Adaptive Trajectory Planning with Mixture of Experts and Reinforcement Learning
TrajMoE:结合专家与强化学习的场景自适应轨迹规划
- Authors: Zebin Xing, Pengxuan Yang, Linbo Wang, Yichen Zhang, Yiming Hu, Yupeng Zheng, Junli Wang, Yinfeng Gao, Guang Li, Kun Ma, Long Chen, Zhongpu Xia, Qichao Zhang, Hangjun Ye, Dongbin Zhao
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2512.07135
- Pdf link: https://arxiv.org/pdf/2512.07135
- Abstract
Current autonomous driving systems often favor end-to-end frameworks, which take sensor inputs like images and learn to map them into trajectory space via neural networks. Previous work has demonstrated that models can achieve better planning performance when provided with a prior distribution of possible trajectories. However, these approaches often overlook two critical aspects: 1) The appropriate trajectory prior can vary significantly across different driving scenarios. 2) Their trajectory evaluation mechanism lacks policy-driven refinement, remaining constrained by the limitations of one-stage supervised training. To address these issues, we explore improvements in two key areas. For problem 1, we employ MoE to apply different trajectory priors tailored to different scenarios. For problem 2, we utilize Reinforcement Learning to fine-tune the trajectory scoring mechanism. Additionally, we integrate models with different perception backbones to enhance perceptual features. Our integrated model achieved a score of 51.08 on the navsim ICCV benchmark, securing third place.
- 中文摘要
当前的自动驾驶系统通常偏好端到端框架,这些框架通过神经网络学习将传感器输入如图像映射到轨迹空间。以往研究表明,当模型先验分布可能的轨迹时,能够实现更好的规划性能。然而,这些方法常常忽视两个关键方面:1)适当的前置轨迹在不同驾驶场景中可能有显著差异。2)其轨迹评估机制缺乏政策驱动的细化,受限于单阶段监督培训的局限性。为解决这些问题,我们探讨了两个关键领域的改进。对于问题1,我们利用MoE针对不同情景应用不同的轨迹先验。对于问题2,我们利用强化学习微调轨迹评分机制。此外,我们整合了具有不同感知骨架的模型,以增强感知特征。我们的集成模型在导航模拟ICCV基准测试中获得了51.08分,获得第三名。
Think-Reflect-Revise: A Policy-Guided Reflective Framework for Safety Alignment in Large Vision Language Models
思考-反思-修订:一个基于政策引导的大型视觉语言模型安全对齐反思框架
- Authors: Fenghua Weng, Chaochao Lu, Xia Hu, Wenqi Shao, Wenjie Wang
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2512.07141
- Pdf link: https://arxiv.org/pdf/2512.07141
- Abstract
As multimodal reasoning improves the overall capabilities of Large Vision Language Models (LVLMs), recent studies have begun to explore safety-oriented reasoning, aiming to enhance safety awareness by analyzing potential safety risks during the reasoning process before generating the final response. Although such approaches improve safety awareness and interpretability, this single-pass think-then-answer paradigm remains vulnerable to contextual or visual jailbreak attacks. This reveals a critical flaw: single-pass reasoning may overlook explicit harmful content in its own output. Our key insight is to exploit this wasted signal through reflection, which can effectively leverage the malicious content revealed in the first-pass reasoning to enable genuine self-correction and prevent unsafe generations. Motivated by this, we propose Think-Reflect-Revise (TRR), a three-stage training framework designed to enhance the safety alignment of LVLMs through policy-guided self-reflection. We first build a Reflective Safety Reasoning (ReSafe) dataset with 5,000 examples that follow a think-reflect-revise process. We then fine-tune the target model using the ReSafe dataset to initialize reflective behavior, and finally reinforce policy-guided reflection through reinforcement learning. Experimental results show that TRR substantially improves the safety performance of LVLMs across both safety-awareness benchmarks and jailbreak attack evaluations, increasing the overall safe response rate from 42.8% to 87.7% on Qwen2.5-VL-7B, while preserving stable performance on general benchmarks such as MMMU and MMStar. The project page is available at this https URL.
- 中文摘要
随着多模态推理提升了大型视觉语言模型(LVLM)的整体能力,近期研究开始探索以安全为导向的推理,旨在通过分析推理过程中潜在的安全风险来提升安全意识,然后生成最终回应。尽管此类方法提高了安全意识和可解释性,但这种单一思考后回答的范式仍易受到上下文或视觉越狱攻击的威胁。这暴露了一个关键缺陷:单一通过推理可能忽视其输出中明显有害的内容。我们的关键见解是通过反思来利用这些浪费信号,从而有效利用首轮推理中揭示的恶意内容,实现真正的自我纠正,防止不安全的生成。基于此,我们提出了“思考-反思-修订”(TRR),这是一个三阶段培训框架,旨在通过政策引导的自我反思提升LVLM的安全一致性。我们首先构建了一个包含5000个示例的反思安全推理(ReSafe)数据集,这些示例遵循思考-反思-修正的过程。然后,我们利用ReSafe数据集微调目标模型,初始化反思行为,最终通过强化学习强化策略引导反射。实验结果显示,TRR在安全意识基准和越狱攻击评估中显著提升了LVLMS的安全性能,使Qwen2.5-VL-7B的整体安全响应率从42.8%提升至87.7%,同时在MMMU和MMStar等通用基准测试上保持稳定性能。项目页面可在此 https 网址访问。
Less is More: Non-uniform Road Segments are Efficient for Bus Arrival Prediction
少即是多:非均匀的道路段对公交到达预测更有效
- Authors: Zhen Huang, Jiaxin Deng, Jiayu Xu, Junbiao Pang, Haitao Yu
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2512.07200
- Pdf link: https://arxiv.org/pdf/2512.07200
- Abstract
In bus arrival time prediction, the process of organizing road infrastructure network data into homogeneous entities is known as segmentation. Segmenting a road network is widely recognized as the first and most critical step in developing an arrival time prediction system, particularly for auto-regressive-based approaches. Traditional methods typically employ a uniform segmentation strategy, which fails to account for varying physical constraints along roads, such as road conditions, intersections, and points of interest, thereby limiting prediction efficiency. In this paper, we propose a Reinforcement Learning (RL)-based approach to efficiently and adaptively learn non-uniform road segments for arrival time prediction. Our method decouples the prediction process into two stages: 1) Non-uniform road segments are extracted based on their impact scores using the proposed RL framework; and 2) A linear prediction model is applied to the selected segments to make predictions. This method ensures optimal segment selection while maintaining computational efficiency, offering a significant improvement over traditional uniform approaches. Furthermore, our experimental results suggest that the linear approach can even achieve better performance than more complex methods. Extensive experiments demonstrate the superiority of the proposed method, which not only enhances efficiency but also improves learning performance on large-scale benchmarks. The dataset and the code are publicly accessible at: this https URL.
- 中文摘要
在公交到达时间预测中,将道路基础设施网络数据组织成同质实体的过程称为分段。道路网络分段被广泛认为是开发到达时间预测系统的第一步和最关键的步骤,尤其是在基于自回归的方法中。传统方法通常采用统一的分段策略,未能考虑道路沿线不同的物理约束,如路况、路口和兴趣点,从而限制了预测效率。本文提出了一种基于强化学习(RL)的方法,用于高效且自适应地学习非均匀道路段以预测到达时间。我们的方法将预测过程分为两个阶段:1)利用所提出的强化学习框架,根据影响评分提取非均匀道路段;2)对选定的片段应用线性预测模型进行预测。该方法确保了最佳段的选择,同时保持计算效率,显著优于传统均匀方法。此外,我们的实验结果表明,线性方法甚至能比更复杂的方法获得更好的性能。大量实验证明了该方法的优越性,不仅提高了效率,还提升了大规模基准测试中的学习表现。该数据集和代码可公开访问:https URL。
MMRPT: MultiModal Reinforcement Pre-Training via Masked Vision-Dependent Reasoning
MMRPT:通过蒙面视觉依赖推理进行多模态强化预训练
- Authors: Xuhui Zheng, Kang An, Ziliang Wang, Yuhang Wang, Faqiang Qian, Yichao Wu
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2512.07203
- Pdf link: https://arxiv.org/pdf/2512.07203
- Abstract
Multimodal pre-training remains constrained by the descriptive bias of image-caption pairs, leading models to favor surface linguistic cues over grounded visual understanding. We introduce MMRPT, a masked multimodal reinforcement pre-training framework that strengthens visual reasoning in MLLMs. We are the first to incorporate reinforcement learning directly into the pre-training of large vision-language models, enabling learning signals that reward visual grounding rather than caption imitation. MMRPT constructs masked multimodal data by estimating sentence-level visual dependency via attention over visual tokens and masking highly vision-dependent segments; the model reconstructs these spans through vision-grounded reasoning guided by a semantic-visual reward. Experiments show consistent zero-shot gains across diverse benchmarks and substantially improved robustness under supervised fine-tuning, demonstrating that reinforcement-driven masked reasoning provides a more reliable and generalizable pre-training objective for multimodal models.
- 中文摘要
多模态预训练仍受制于图像-说明对的描述偏见,导致模型倾向于依赖表面语言线索而非扎实的视觉理解。我们介绍MMRPT,一种掩蔽多模态强化预训练框架,强化MLLM中的视觉推理能力。我们率先将强化学习直接纳入大型视觉语言模型的预训练中,能够学习能够奖励视觉基础而非字幕模仿的信号。MMRPT通过对视觉标记的注意力和对高度依赖视觉的片段进行掩蔽,估计句子层面的视觉依赖性,构建了掩蔽多模态数据;该模型通过以视觉为基础的推理,在语义-视觉奖励的引导下重建这些跨度。实验显示,在不同基准测试中零样本获得一致的增益,并在监督微调下鲁棒性显著提升,表明强化驱动的掩蔽推理为多模态模型提供了更可靠且可推广的预训练目标。
Towards Robust Protective Perturbation against DeepFake Face Swapping
迈向针对深度伪造面部互换的强有力保护干扰
- Authors: Hengyang Yao, Lin Li, Ke Sun, Jianing Qiu, Huiping Chen
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2512.07228
- Pdf link: https://arxiv.org/pdf/2512.07228
- Abstract
DeepFake face swapping enables highly realistic identity forgeries, posing serious privacy and security risks. A common defence embeds invisible perturbations into images, but these are fragile and often destroyed by basic transformations such as compression or resizing. In this paper, we first conduct a systematic analysis of 30 transformations across six categories and show that protection robustness is highly sensitive to the choice of training transformations, making the standard Expectation over Transformation (EOT) with uniform sampling fundamentally suboptimal. Motivated by this, we propose Expectation Over Learned distribution of Transformation (EOLT), the framework to treat transformation distribution as a learnable component rather than a fixed design choice. Specifically, EOLT employs a policy network that learns to automatically prioritize critical transformations and adaptively generate instance-specific perturbations via reinforcement learning, enabling explicit modeling of defensive bottlenecks while maintaining broad transferability. Extensive experiments demonstrate that our method achieves substantial improvements over state-of-the-art approaches, with 26% higher average robustness and up to 30% gains on challenging transformation categories.
- 中文摘要
DeepFake面部互换能够实现高度逼真的身份伪造,带来严重的隐私和安全风险。一种常见的防御是在图像中嵌入看不见的扰动,但这些扰动脆弱,常常被压缩或调整大小等基本变换破坏。本文首先对六个类别的30个转换进行了系统分析,显示保护鲁棒性对训练转换的选择极为敏感,使得标准的期望转化(EOT)采用均匀抽样根本性地不优。基于此,我们提出了“学习分布的期望转换”(EOLT),该框架将转化分布视为可学习的组成部分,而非固定的设计选择。具体来说,EOLT采用了策略网络,能够自动优先排序关键变换,并通过强化学习自适应地生成实例特定的扰动,实现防御瓶颈的显式建模,同时保持广泛的可转移性。大量实验表明,我们的方法相较于最先进方法实现了显著提升,平均鲁棒性提高了26%,在具有挑战性的变换类别上提升了最多30%。
SINRL: Socially Integrated Navigation with Reinforcement Learning using Spiking Neural Networks
SINRL:利用尖峰神经网络进行强化学习的社会整合导航
- Authors: Florian Tretter, Daniel Flögel, Alexandru Vasilache, Max Grobbel, Jürgen Becker, Sören Hohmann
- Subjects: Subjects:
Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
- Arxiv link: https://arxiv.org/abs/2512.07266
- Pdf link: https://arxiv.org/pdf/2512.07266
- Abstract
Integrating autonomous mobile robots into human environments requires human-like decision-making and energy-efficient, event-based computation. Despite progress, neuromorphic methods are rarely applied to Deep Reinforcement Learning (DRL) navigation approaches due to unstable training. We address this gap with a hybrid socially integrated DRL actor-critic approach that combines Spiking Neural Networks (SNNs) in the actor with Artificial Neural Networks (ANNs) in the critic and a neuromorphic feature extractor to capture temporal crowd dynamics and human-robot interactions. Our approach enhances social navigation performance and reduces estimated energy consumption by approximately 1.69 orders of magnitude.
- 中文摘要
将自主移动机器人整合进人类环境需要类人决策和节能的事件计算。尽管取得了进展,由于训练不稳定,神经形态方法很少被应用于深度强化学习(DRL)导航方法。我们通过一种混合的社会集成DRL演员-批评方法来弥补这一空白,该方法结合了演员体内的尖峰神经网络(SNN)与批判者的人工神经网络(ANN),以及神经形态特征提取器,以捕捉时间群体动态和人机交互。我们的方法提升了社交导航性能,预计能耗减少了约1.69个数量级。
RVLF: A Reinforcing Vision-Language Framework for Gloss-Free Sign Language Translation
RVLF:无注释手语翻译的强化视觉语言框架
- Authors: Zhi Rao, Yucheng Zhou, Benjia Zhou, Yiqing Huang, Sergio Escalera, Jun Wan
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2512.07273
- Pdf link: https://arxiv.org/pdf/2512.07273
- Abstract
Gloss-free sign language translation (SLT) is hindered by two key challenges: inadequate sign representation that fails to capture nuanced visual cues, and sentence-level semantic misalignment in current LLM-based methods, which limits translation quality. To address these issues, we propose a three-stage reinforcing vision-language framework (RVLF). We build a large vision-language model (LVLM) specifically designed for sign language, and then combine it with reinforcement learning (RL) to adaptively enhance translation performance. First, for a sufficient representation of sign language, RVLF introduces an effective semantic representation learning mechanism that fuses skeleton-based motion cues with semantically rich visual features extracted via DINOv2, followed by instruction tuning to obtain a strong SLT-SFT baseline. Then, to improve sentence-level semantic misalignment, we introduce a GRPO-based optimization strategy that fine-tunes the SLT-SFT model with a reward function combining translation fidelity (BLEU) and sentence completeness (ROUGE), yielding the optimized model termed SLT-GRPO. Our conceptually simple framework yields substantial gains under the gloss-free SLT setting without pre-training on any external large-scale sign language datasets, improving BLEU-4 scores by +5.1, +1.11, +1.4, and +1.61 on the CSL-Daily, PHOENIX-2014T, How2Sign, and OpenASL datasets, respectively. To the best of our knowledge, this is the first work to incorporate GRPO into SLT. Extensive experiments and ablation studies validate the effectiveness of GRPO-based optimization in enhancing both translation quality and semantic consistency.
- 中文摘要
无注释手语翻译(SLT)面临两个主要挑战:手势表现不足未能捕捉细腻的视觉线索,以及当前基于LLM的方法中的句子级语义错位,这限制了翻译质量。为解决这些问题,我们提出了一个三阶段的reinforcing vision-language framework(RVLF)。我们构建了一个专门为手语设计的大型视觉语言模型(LVLM),然后将其与强化学习(RL)结合,以自适应地提升翻译表现。首先,为了实现手语的充分表示,RVLF引入了一种有效的语义表征学习机制,将基于骨架的运动线索与通过DINOv2提取的语义丰富视觉特征融合,随后通过指令调整以获得强的SLT-SFT基线。随后,为了改善句子层面的语义错位,我们引入了基于GRPO的优化策略,通过结合翻译忠实度(BLEU)和句子完整性(ROUGE)的奖励函数对SLT-SFT模型进行微调,得到称为SLT-GRPO的优化模型。我们概念简单,在无注释SLT设置下无需对任何外部大规模手语数据集进行预训练即可获得显著提升,分别在CSL-Daily、PHOENIX-2014T、How2Sign和OpenASL数据集上提升BLEU-4分数+5.1、+1.11、+1.4和+1.61。据我们所知,这是首次将GRPO纳入SLT的研究。大量实验和消融研究验证了基于GRPO的优化在提升翻译质量和语义一致性方面的有效性。
PrivORL: Differentially Private Synthetic Dataset for Offline Reinforcement Learning
PrivORL:用于离线强化学习的差分私有合成数据集
- Authors: Chen Gong, Zheng Liu, Kecen Li, Tianhao Wang
- Subjects: Subjects:
Cryptography and Security (cs.CR); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2512.07342
- Pdf link: https://arxiv.org/pdf/2512.07342
- Abstract
Recently, offline reinforcement learning (RL) has become a popular RL paradigm. In offline RL, data providers share pre-collected datasets -- either as individual transitions or sequences of transitions forming trajectories -- to enable the training of RL models (also called agents) without direct interaction with the environments. Offline RL saves interactions with environments compared to traditional RL, and has been effective in critical areas, such as navigation tasks. Meanwhile, concerns about privacy leakage from offline RL datasets have emerged. To safeguard private information in offline RL datasets, we propose the first differential privacy (DP) offline dataset synthesis method, PrivORL, which leverages a diffusion model and diffusion transformer to synthesize transitions and trajectories, respectively, under DP. The synthetic dataset can then be securely released for downstream analysis and research. PrivORL adopts the popular approach of pre-training a synthesizer on public datasets, and then fine-tuning on sensitive datasets using DP Stochastic Gradient Descent (DP-SGD). Additionally, PrivORL introduces curiosity-driven pre-training, which uses feedback from the curiosity module to diversify the synthetic dataset and thus can generate diverse synthetic transitions and trajectories that closely resemble the sensitive dataset. Extensive experiments on five sensitive offline RL datasets show that our method achieves better utility and fidelity in both DP transition and trajectory synthesis compared to baselines. The replication package is available at the GitHub repository.
- 中文摘要
近年来,离线强化学习(RL)已成为一种流行的强化学习范式。在离线强化学习中,数据提供者共享预先收集的数据集——无论是单个转移还是形成轨迹的过渡序列——以便无需与环境直接交互即可训练强化学习模型(也称为代理)。离线强化学习相比传统强化学习节省了与环境的交互,并在关键领域如导航任务中取得了有效效果。与此同时,关于离线强化学习数据集隐私泄露的担忧也随之浮现。为了保护离线强化学习数据集中的隐私信息,我们提出了首个差分隐私(DP)离线数据集合成方法PrivORL,利用扩散模型和扩散变换器分别在DP下合成转移和轨迹。合成数据集随后可以安全地发布,供后续分析和研究使用。PrivORL采用了流行的方法:先在公开数据集上预训练合成器,然后在敏感数据集上使用DP随机梯度下降法(DP-SGD)进行微调。此外,PrivORL引入了好奇心驱动的预训练,利用好奇心模块的反馈来丰富合成数据集,从而生成与敏感数据集极为相似的多样化合成转变和轨迹。对五个敏感的离线强化学习数据集进行的大量实验表明,我们的方法在DP转换和轨迹合成方面均优于基线,实现了更高的实用性和忠实度。复制包可在 GitHub 仓库获取。
Multi-Rigid-Body Approximation of Human Hands with Application to Digital Twin
多刚体人手近似及数字孪生应用
- Authors: Bin Zhao, Yiwen Lu, Haohua Zhu, Xiao Li, Sheng Yi
- Subjects: Subjects:
Robotics (cs.RO); Graphics (cs.GR)
- Arxiv link: https://arxiv.org/abs/2512.07359
- Pdf link: https://arxiv.org/pdf/2512.07359
- Abstract
Human hand simulation plays a critical role in digital twin applications, requiring models that balance anatomical fidelity with computational efficiency. We present a complete pipeline for constructing multi-rigid-body approximations of human hands that preserve realistic appearance while enabling real-time physics simulation. Starting from optical motion capture of a specific human hand, we construct a personalized MANO (Multi-Abstracted hand model with Neural Operations) model and convert it to a URDF (Unified Robot Description Format) representation with anatomically consistent joint axes. The key technical challenge is projecting MANO's unconstrained SO(3) joint rotations onto the kinematically constrained joints of the rigid-body model. We derive closed-form solutions for single degree-of-freedom joints and introduce a Baker-Campbell-Hausdorff (BCH)-corrected iterative method for two degree-of-freedom joints that properly handles the non-commutativity of rotations. We validate our approach through digital twin experiments where reinforcement learning policies control the multi-rigid-body hand to replay captured human demonstrations. Quantitative evaluation shows sub-centimeter reconstruction error and successful grasp execution across diverse manipulation tasks.
- 中文摘要
人手模拟在数字孪生应用中起着关键作用,需要模型在解剖学准确性与计算效率之间取得平衡。我们提供了完整的流程,用于构建多刚体人手近似模型,既保持真实外观,又支持实时物理模拟。从对特定人手的光学动作捕捉开始,我们构建个性化的MANO(多抽象手模型与神经作)模型,并将其转换为具有解剖学上一致的URDF(统一机器人描述格式)表示。关键技术挑战是将MANO的无约束SO(3)关节旋转投影到刚体模型的运动学约束关节上。我们推导单自由度关节的闭式解,并引入一种经过Baker-Campbell-Hausdorff(BCH)修正的迭代方法,正确处理旋转的非交换性。我们通过数字孪生实验验证了我们的方法,强化学习策略控制多刚体手,以回放捕捉到的人类演示。定量评估显示在多种作任务中,重建误差低于亚厘米,抓握执行成功。
Training Language Models to Use Prolog as a Tool
训练语言模型以使用 Prolog 作为工具
- Authors: Niklas Mellgren, Peter Schneider-Kamp, Lukas Galke Poech
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2512.07407
- Pdf link: https://arxiv.org/pdf/2512.07407
- Abstract
Ensuring reliable tool use is critical for safe agentic AI systems. Language models frequently produce unreliable reasoning with plausible but incorrect solutions that are difficult to verify. To address this, we investigate fine-tuning models to use Prolog as an external tool for verifiable computation. Using Group Relative Policy Optimization (GRPO), we fine-tune Qwen2.5-3B-Instruct on a cleaned GSM8K-Prolog-Prover dataset while varying (i) prompt structure, (ii) reward composition (execution, syntax, semantics, structure), and (iii) inference protocol: single-shot, best-of-N, and two agentic modes where Prolog is invoked internally or independently. Our reinforcement learning approach outperforms supervised fine-tuning, with our 3B model achieving zero-shot MMLU performance comparable to 7B few-shot results. Our findings reveal that: 1) joint tuning of prompt, reward, and inference shapes program syntax and logic; 2) best-of-N with external Prolog verification maximizes accuracy on GSM8K; 3) agentic inference with internal repair yields superior zero-shot generalization on MMLU-Stem and MMLU-Pro. These results demonstrate that grounding model reasoning in formal verification systems substantially improves reliability and auditability for safety-critical applications. The source code for reproducing our experiments is available under this https URL
- 中文摘要
确保可靠的工具使用对于安全的代理人工智能系统至关重要。语言模型经常产生不可靠的推理,提供合理但不准确的解,难以验证。为此,我们研究了模型的微调,使Prolog作为可验证计算的外部工具。利用群相对策略优化(GRPO),我们在清理后的GSM8K-Prolog-Prover数据集上微调Qwen2.5-3B-Instruct,同时调整(i)提示结构,(ii)奖励组合(执行、语法、语义、结构),以及(iii)推理协议:单次、N中最佳和两种代理模式,其中Prolog可内部或独立调用。我们的强化学习方法优于监督微调,我们的3B模型实现的零射击MMLU性能可与7倍少射图结果相媲美。我们的发现显示:1)提示、奖励和推理的联合调优塑造了程序的语法和逻辑;2)采用外部Prolog验证的N法最大化GSM8K的准确性;3)带有内部修复的代理推断在MMLU-Stem和MMLU-Pro上实现了更优的零点推广。这些结果表明,在形式验证系统中建立基础模型推理,显著提升了安全关键应用的可靠性和可审计性。我们实验的复刻源代码可在此 https URL 下获取
Adaptive Tuning of Parameterized Traffic Controllers via Multi-Agent Reinforcement Learning
通过多智能体强化学习对参数化流量控制器进行自适应调优
- Authors: Giray Önür, Azita Dabiri, Bart De Schutter
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2512.07417
- Pdf link: https://arxiv.org/pdf/2512.07417
- Abstract
Effective traffic control is essential for mitigating congestion in transportation networks. Conventional traffic management strategies, including route guidance, ramp metering, and traffic signal control, often rely on state feedback controllers, used for their simplicity and reactivity; however, they lack the adaptability required to cope with complex and time-varying traffic dynamics. This paper proposes a multi-agent reinforcement learning framework in which each agent adaptively tunes the parameters of a state feedback traffic controller, combining the reactivity of state feedback controllers with the adaptability of reinforcement learning. By tuning parameters at a lower frequency rather than directly determining control actions at a high frequency, the reinforcement learning agents achieve improved training efficiency while maintaining adaptability to varying traffic conditions. The multi-agent structure further enhances system robustness, as local controllers can operate independently in the event of partial failures. The proposed framework is evaluated on a simulated multi-class transportation network under varying traffic conditions. Results show that the proposed multi-agent framework outperforms the no control and fixed-parameter state feedback control cases, while performing on par with the single-agent RL-based adaptive state feedback control, with a much better resilience to partial failures.
- 中文摘要
有效的交通控制对于缓解交通网络中的拥堵至关重要。传统的交通管理策略,包括路线引导、匝道计量和交通信号控制,通常依赖状态反馈控制器,因其简洁且反应迅速而被使用;然而,它们缺乏应对复杂且随时变化的交通动态所需的适应性。本文提出了一种多智能体强化学习框架,每个智能体自适应地调整状态反馈交通控制器的参数,结合状态反馈控制器的反应性与强化学习的适应性。通过在较低频率调优参数,而非直接在高频率下决定控制动作,强化学习代理在保持对不同交通状况的适应性的同时,实现了更高的训练效率。多智能体结构进一步增强了系统的稳健性,因为本地控制器在部分故障时可以独立运行。该框架在模拟多类别交通网络中,在不同交通条件下进行评估。结果显示,所提出的多智能体框架在无控制和固定参数状态反馈控制情况下表现优于单智能体基于强化学习的自适应状态反馈控制,且对部分故障的韧性更佳。
Revolutionizing Mixed Precision Quantization: Towards Training-free Automatic Proxy Discovery via Large Language Models
革新混合精度量化:迈向通过大型语言模型实现无训练自动代理发现
- Authors: Haidong Kang, Jun Du, Lihong Lin
- Subjects: Subjects:
Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2512.07419
- Pdf link: https://arxiv.org/pdf/2512.07419
- Abstract
Mixed-Precision Quantization (MPQ) liberates the Deep Neural Networks (DNNs) from the Out-Of-Memory (OOM) bottleneck, which garnered increasing research attention. However, conventional methods either searched from costly differentiable optimization, which is neither efficient nor flexible, or learned a quantized DNN from the proxy (i.e., HAWQ) manually designed by human experts, which is labor-intensive and requires huge expert knowledge. Can we design a proxy without involving any human experts and training? In this paper, we provide an affirmative answer by proposing a novel Large Language Models (LLMs)-driven Training-free Automatic Proxy (dubbed TAP) discovery framework, which reforms the design paradigm of MPQ by utilizing LLMs to find superior TAP tailored for MPQ, automatically. In addition, to bridge the gap between black-box LLMs and the tough MPQ task, we ingeniously propose simple Direct Policy Optimization (DPO) based reinforcement learning to enhance LLMs' reasoning by optimizing prompts, which can construct a positive feedback loop between the LLM and the MPQ task, enabling LLMs to generate better TAP in the next evolution. Extensive experiments on mainstream benchmarks demonstrate that TAP achieves state-of-the-art performance. Finally, we truly believe that our TAP will significantly contribute to the MPQ community by providing a new perspective on LLM-driven design algorithms.
- 中文摘要
混合精度量化(MPQ)将深度神经网络(DNN)从内存外(OOM)瓶颈中解放出来,这一瓶颈引起了越来越多的研究关注。然而,传统方法要么从成本高昂的可微优化中寻找,这种优化既不高效也不灵活,要么通过由人类专家手动设计的代理(即HAWQ)学习量化DNN,这既劳动密集,也需要大量专业知识。我们能在没有任何人类专家和培训的情况下设计代理吗?本文通过提出一种新颖的大型语言模型(LLMs)驱动的无训练自动代理(TAP)发现框架,这一框架通过利用LLM自动寻找更优的PTAP,改革了MPQ的设计范式。此外,为了弥合黑箱大型语言模型与艰难MPQ任务之间的差距,我们巧妙地提出了基于直接策略优化(DPO)的简单强化学习,通过优化提示来增强LLM的推理能力,从而在LLM与MPQ任务之间构建正反馈循环,使LLM在下一阶段能够生成更好的TAP。大量主流基准测试显示,TAP实现了最先进的性能。最后,我们真心相信TAP将通过为大型语言模型驱动设计算法提供新的视角,为MPQ社区做出重大贡献。
KAN-Dreamer: Benchmarking Kolmogorov-Arnold Networks as Function Approximators in World Models
KAN-Dreamer:作为世界模型中功能近似器的Kolmogorov-Arnold网络基准测试
- Authors: Chenwei Shi, Xueyu Luan
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2512.07437
- Pdf link: https://arxiv.org/pdf/2512.07437
- Abstract
DreamerV3 is a state-of-the-art online model-based reinforcement learning (MBRL) algorithm known for remarkable sample efficiency. Concurrently, Kolmogorov-Arnold Networks (KANs) have emerged as a promising alternative to Multi-Layer Perceptrons (MLPs), offering superior parameter efficiency and interpretability. To mitigate KANs' computational overhead, variants like FastKAN leverage Radial Basis Functions (RBFs) to accelerate inference. In this work, we investigate integrating KAN architectures into the DreamerV3 framework. We introduce KAN-Dreamer, replacing specific MLP and convolutional components of DreamerV3 with KAN and FastKAN layers. To ensure efficiency within the JAX-based World Model, we implement a tailored, fully vectorized version with simplified grid management. We structure our investigation into three subsystems: Visual Perception, Latent Prediction, and Behavior Learning. Empirical evaluations on the DeepMind Control Suite (walker_walk) analyze sample efficiency, training time, and asymptotic performance. Experimental results demonstrate that utilizing our adapted FastKAN as a drop-in replacement for the Reward and Continue predictors yields performance on par with the original MLP-based architecture, maintaining parity in both sample efficiency and training speed. This report serves as a preliminary study for future developments in KAN-based world models.
- 中文摘要
DreamerV3 是一种最先进的在线基于模型的强化学习(MBRL)算法,以卓越的样本效率著称。与此同时,Kolmogorov-Arnold 网络(KAN)作为多层感知器(MLP)的有前景替代方案出现,提供了更优越的参数效率和可解释性。为了减轻 KAN 的计算开销,FastKAN 等变体利用径向基函数(RBF)加速推理。本研究将 KAN 架构集成到 DreamerV3 框架中。我们引入了KAN-Dreamer,用KAN和FastKAN层替换了DreamerV3中特定的MLP和卷积组件。为了确保基于JAX的世界模型的效率,我们实现了定制化、完全矢量化的版本,简化了网格管理。我们将研究结构分为三个子系统:视觉感知、潜在预测和行为学习。DeepMind Control Suite(walker_walk)上的实证评估分析样本效率、训练时间和渐近表现。实验结果表明,使用我们改良的FastKAN作为奖励和继续预测器的直接替换,性能与原始基于MLP的架构相当,保持样本效率和训练速度的均衡。本报告作为基于KAN 世界模型未来发展的初步研究。
From Show Programmes to Data: Designing a Workflow to Make Performing Arts Ephemera Accessible Through Language Models
从演出节目到数据:设计工作流程,通过语言模型使表演艺术零碎资料变得可访问
- Authors: Clarisse Bardiot, Pierre-Carl Langlais, Bernard Jacquemin, Jacob Hart, Antonios Lagarias, Nicolas Foucault, Aurélie Lemaître-Legargeant, Jeanne Fras
- Subjects: Subjects:
Information Retrieval (cs.IR)
- Arxiv link: https://arxiv.org/abs/2512.07452
- Pdf link: https://arxiv.org/pdf/2512.07452
- Abstract
Many heritage institutions hold extensive collections of theatre programmes, which remain largely underused due to their complex layouts and lack of structured metadata. In this paper, we present a workflow for transforming such documents into structured data using a combination of multimodal large language models (LLMs), an ontology-based reasoning model, and a custom extension of the Linked Art framework. We show how vision-language models can accurately parse and transcribe born-digital and digitised programmes, achieving over 98% of correct extraction. To overcome the challenges of semantic annotation, we train a reasoning model (POntAvignon) using reinforcement learning with both formal and semantic rewards. This approach enables automated RDF triple generation and supports alignment with existing knowledge graphs. Through a case study based on the Festival d'Avignon corpus, we demonstrate the potential for large-scale, ontology-driven analysis of performing arts data. Our results open new possibilities for interoperable, explainable, and sustainable computational theatre historiography.
- 中文摘要
许多遗产机构收藏了大量戏剧节目单,但由于布局复杂且缺乏结构化元数据,这些节目大多未被充分利用。本文提出了一种工作流,利用多模态大型语言模型(LLM)、基于本体的推理模型以及Linked Art框架的自定义扩展,将此类文档转换为结构化数据。我们展示了视觉语言模型如何准确解析和转录原生数字及数字化节目,实现了超过98%的正确提取率。为了克服语义注释的挑战,我们通过强化学习训练了一个推理模型(PontAvignon),同时提供形式和语义奖励。该方法实现了自动RDF三重生成,并支持与现有知识图谱的对齐。通过基于亚维农节语料库的案例研究,我们展示了大规模本体驱动的表演艺术数据分析潜力。我们的研究结果为可互作、可解释且可持续的计算戏剧史学开辟了新可能。
Gait-Adaptive Perceptive Humanoid Locomotion with Real-Time Under-Base Terrain Reconstruction
步态自适应感知类人移动,结合实时基底地形重建
- Authors: Haolin Song, Hongbo Zhu, Tao Yu, Yan Liu, Mingqi Yuan, Wengang Zhou, Hua Chen, Houqiang Li
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2512.07464
- Pdf link: https://arxiv.org/pdf/2512.07464
- Abstract
For full-size humanoid robots, even with recent advances in reinforcement learning-based control, achieving reliable locomotion on complex terrains, such as long staircases, remains challenging. In such settings, limited perception, ambiguous terrain cues, and insufficient adaptation of gait timing can cause even a single misplaced or mistimed step to result in rapid loss of balance. We introduce a perceptive locomotion framework that merges terrain sensing, gait regulation, and whole-body control into a single reinforcement learning policy. A downward-facing depth camera mounted under the base observes the support region around the feet, and a compact U-Net reconstructs a dense egocentric height map from each frame in real time, operating at the same frequency as the control loop. The perceptual height map, together with proprioceptive observations, is processed by a unified policy that produces joint commands and a global stepping-phase signal, allowing gait timing and whole-body posture to be adapted jointly to the commanded motion and local terrain geometry. We further adopt a single-stage successive teacher-student training scheme for efficient policy learning and knowledge transfer. Experiments conducted on a 31-DoF, 1.65 m humanoid robot demonstrate robust locomotion in both simulation and real-world settings, including forward and backward stair ascent and descent, as well as crossing a 46 cm gap. Project Page:this https URL
- 中文摘要
对于全尺寸人形机器人来说,即使近年来基于强化学习的控制取得了进步,在复杂地形(如长楼梯)上实现可靠的移动仍然具有挑战性。在这种情况下,感知受限、地形线索模糊以及步态时机适应不足,都可能导致哪怕是一次走错或时机不当,也可能导致快速失去平衡。我们引入了一个感知式运动框架,将地形感知、步态调节和全身控制整合为单一强化学习策略。安装在底座下方的向下深度相机观察脚部周围的支撑区域,紧凑型U-Net实时重建每帧密集的自我中心高度图,频率与控制环相同。感知高度图与本体感觉观察结合,由统一策略处理,产生联合指令和全局步进相位信号,使步态时机和全身姿势能够结合指令动作和局部地形几何进行调整。我们还采用单阶段的师生培训方案,以实现政策学习和知识转移的高效。对一台31景深、1.65米的人形机器人进行的实验展示了在模拟和现实环境中的稳健运动,包括前后阶梯的上下行走,以及跨越46厘米的间隙。项目页面:此 https URL
Enhancing Agentic RL with Progressive Reward Shaping and Value-based Sampling Policy Optimization
通过渐进奖励塑造和基于价值的抽样策略优化,增强代理式强化学习
- Authors: Zhuoran Zhuang, Ye Chen, Jianghao Su, Chao Luo, Luhui Liu, Xia Zeng
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2512.07478
- Pdf link: https://arxiv.org/pdf/2512.07478
- Abstract
Large Language Models (LLMs) empowered with Tool-Integrated Reasoning (TIR) can iteratively plan, call external tools, and integrate returned information to solve complex, long-horizon reasoning tasks. Agentic Reinforcement Learning (Agentic RL) optimizes such models over full tool-interaction trajectories, but two key challenges hinder effectiveness: (1) Sparse, non-instructive rewards, such as binary 0-1 verifiable signals, provide limited guidance for intermediate steps and slow convergence; (2) Gradient degradation in Group Relative Policy Optimization (GRPO), where identical rewards within a rollout group yield zero advantage, reducing sample efficiency and destabilizing training. To address these challenges, we propose two complementary techniques: Progressive Reward Shaping (PRS) and Value-based Sampling Policy Optimization (VSPO). PRS is a curriculum-inspired reward design that introduces dense, stage-wise feedback - encouraging models to first master parseable and properly formatted tool calls, then optimize for factual correctness and answer quality. We instantiate PRS for short-form QA (with a length-aware BLEU to fairly score concise answers) and long-form QA (with LLM-as-a-Judge scoring to prevent reward hacking). VSPO is an enhanced GRPO variant that replaces low-value samples with prompts selected by a task-value metric balancing difficulty and uncertainty, and applies value-smoothing clipping to stabilize gradient updates. Experiments on multiple short-form and long-form QA benchmarks show that PRS consistently outperforms traditional binary rewards, and VSPO achieves superior stability, faster convergence, and higher final performance compared to PPO, GRPO, CISPO, and SFT-only baselines. Together, PRS and VSPO yield LLM-based TIR agents that generalize better across domains.
- 中文摘要
配备工具集成推理(TIR)的大型语言模型(LLM)能够迭代规划、调用外部工具,并整合返回的信息,解决复杂的长期推理任务。智能强化学习(Agentic Reinforcement Learning,简称Agentic RL)在完整的工具交互轨迹中优化此类模型,但有两个关键挑战阻碍了其有效性:(1)稀疏且无指导性的奖励,如二进制0-1可验证信号,对中间步骤的指导有限且收敛缓慢;(2)组相对策略优化(GRPO)中的梯度退化,即在推广组内相同的奖励没有任何优势,降低样本效率并使训练不稳定。为应对这些挑战,我们提出了两种互补技术:渐进式奖励塑造(PRS)和基于价值的抽样策略优化(VSPO)。PRS是一种受课程启发的奖励设计,引入了密集的分阶段反馈——鼓励模型先掌握可解析且格式化良好的工具调用,然后优化事实正确性和答案质量。我们为短形式质询(带有长度感知的BLEU以公平评分简洁答案)和长形式质检(通过LLM作为评判评分以防止奖励黑客)实例化PRS。VSPO是一种增强型GRPO变体,用任务值指标在难度和不确定性之间平衡的提示替代低值样本,并应用值平滑剪裁以稳定梯度更新。多项短形式和长格式质量保证基准测试的实验显示,PRS持续优于传统二元奖励,VSPO相比PPO、GRPO、CISPO和仅SFT基线在稳定性、收敛速度和最终性能上更优。PRS和VSPO共同产生了基于LLM的TIR代理剂,能够更好地跨领域推广。
How Do LLMs Fail In Agentic Scenarios? A Qualitative Analysis of Success and Failure Scenarios of Various LLMs in Agentic Simulations
大型语言模型在代理场景中如何失败?对各种大型语言模型在代理模拟中成功与失败场景的定性分析
- Authors: JV Roig
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
- Arxiv link: https://arxiv.org/abs/2512.07497
- Pdf link: https://arxiv.org/pdf/2512.07497
- Abstract
We investigate how large language models (LLMs) fail when operating as autonomous agents with tool-use capabilities. Using the Kamiwaza Agentic Merit Index (KAMI) v0.1 benchmark, we analyze 900 execution traces from three representative models - Granite 4 Small, Llama 4 Maverick, and DeepSeek V3.1 - across filesystem, text extraction, CSV analysis, and SQL scenarios. Rather than focusing on aggregate scores, we perform fine-grained, per-trial behavioral analysis to surface the strategies that enable successful multi-step tool execution and the recurrent failure modes that undermine reliability. Our findings show that model scale alone does not predict agentic robustness: Llama 4 Maverick (400B) performs only marginally better than Granite 4 Small (32B) in some uncertainty-driven tasks, while DeepSeek V3.1's superior reliability derives primarily from post-training reinforcement learning rather than architecture or size. Across models, we identify four recurring failure archetypes: premature action without grounding, over-helpfulness that substitutes missing entities, vulnerability to distractor-induced context pollution, and fragile execution under load. These patterns highlight the need for agentic evaluation methods that emphasize interactive grounding, recovery behavior, and environment-aware adaptation, suggesting that reliable enterprise deployment requires not just stronger models but deliberate training and design choices that reinforce verification, constraint discovery, and adherence to source-of-truth data.
- 中文摘要
我们研究大型语言模型(LLM)在作为具备工具使用能力的自主代理时如何失效。利用 Kamiwaza 代理优异指数(KAMI)v0.1 基准测试,我们分析了来自三个代表模型——Granite 4 Small、Llama 4 Maverick 和 DeepSeek V3.1——的 900 条执行痕迹,涵盖文件系统、文本提取、CSV 分析和 SQL 场景。我们不关注总分,而是进行细致的每次试验行为分析,揭示实现多步骤工具成功执行的策略以及反复出现的失效模式,这些模式削弱了可靠性。我们的发现表明,仅凭模型规模无法预测代理鲁棒性:Llama 4 Maverick(400B)在某些不确定性驱动的任务中表现仅略优于Granite 4 Small(32B),而DeepSeek V3.1的优越可靠性主要来自训练后强化学习,而非架构或规模。在各个模型中,我们识别出四个反复出现的失败典型:过早行动却没有扎根、过度帮助以替代缺失实体、易受干扰引发的上下文污染,以及负载下执行脆弱。这些模式凸显了需要强调交互式接地、恢复行为和环境感知适应的代理性评估方法,表明可靠的企业部署不仅需要更强的模型,还需要有意识的训练和设计选择,强化验证、约束发现和对真实数据的遵循。
Model-Based Reinforcement Learning Under Confounding
基于模型的混杂强化学习
- Authors: Nishanth Venkatesh, Andreas A. Malikopoulos
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2512.07528
- Pdf link: https://arxiv.org/pdf/2512.07528
- Abstract
We investigate model-based reinforcement learning in contextual Markov decision processes (C-MDPs) in which the context is unobserved and induces confounding in the offline dataset. In such settings, conventional model-learning methods are fundamentally inconsistent, as the transition and reward mechanisms generated under a behavioral policy do not correspond to the interventional quantities required for evaluating a state-based policy. To address this issue, we adapt a proximal off-policy evaluation approach that identifies the confounded reward expectation using only observable state-action-reward trajectories under mild invertibility conditions on proxy variables. When combined with a behavior-averaged transition model, this construction yields a surrogate MDP whose Bellman operator is well defined and consistent for state-based policies, and which integrates seamlessly with the maximum causal entropy (MaxCausalEnt) model-learning framework. The proposed formulation enables principled model learning and planning in confounded environments where contextual information is unobserved, unavailable, or impractical to collect.
- 中文摘要
我们研究基于模型的强化学习,适用于上下文未被观察到的马尔可夫决策过程(C-MDPs),其中离线数据集中会诱导混杂现象。在此类环境中,传统模型学习方法本质上不一致,因为行为政策下生成的过渡和奖励机制与评估基于状态的政策所需的干预量不相符。为解决此问题,我们采用了一种近距离非策略评估方法,仅通过在轻度可逆条件下的代理变量上可观察的状态-动作-奖励轨迹来识别混淆奖励期望。当与行为平均转移模型结合时,该构造产生了一个代理MDP,其Bellman算子对于基于状态的策略定义良好且一致,并且能够无缝集成最大因果熵(MaxCausalEnt)模型学习框架。该表述使得在上下文信息无法被观察、不可得或不切实际收集的混淆环境中,实现有原则的模型学习和规划。
ReLaX: Reasoning with Latent Exploration for Large Reasoning Models
ReLaX:大型推理模型中的潜在探索推理
- Authors: Shimin Zhang, Xianwei Chen, Yufan Shen, Ziyuan Ye, Jibin Wu
- Subjects: Subjects:
Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2512.07558
- Pdf link: https://arxiv.org/pdf/2512.07558
- Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated remarkable potential in enhancing the reasoning capability of Large Reasoning Models (LRMs). However, RLVR often leads to entropy collapse, resulting in premature policy convergence and performance saturation. While manipulating token-level entropy has proven effective for promoting policy exploration, we argue that the latent dynamics underlying token generation encode a far richer computational structure for steering policy optimization toward a more effective exploration-exploitation tradeoff. To enable tractable analysis and intervention of the latent dynamics of LRMs, we leverage Koopman operator theory to obtain a linearized representation of their hidden-state dynamics. This enables us to introduce Dynamic Spectral Dispersion (DSD), a new metric to quantify the heterogeneity of the model's latent dynamics, serving as a direct indicator of policy exploration. Building upon these foundations, we propose Reasoning with Latent eXploration (ReLaX), a paradigm that explicitly incorporates latent dynamics to regulate exploration and exploitation during policy optimization. Comprehensive experiments across a wide range of multimodal and text-only reasoning benchmarks show that ReLaX significantly mitigates premature convergence and consistently achieves state-of-the-art performance.
- 中文摘要
带可验证奖励的强化学习(RLVR)最近展现出显著潜力,提升大型推理模型(LRM)的推理能力。然而,RLVR常常导致熵坍缩,导致政策过早收敛和性能过早饱和。虽然控代币层面熵已被证明能促进政策探索,但我们认为,代币生成背后的潜在动态编码了更丰富的计算结构,有助于引导政策优化朝向更有效的探索-利用权衡。为了便于分析和干预LRMS潜动态,我们利用库普曼算子理论获得其隐态动态的线性化表示。这使我们能够引入动态谱色散(DSD),这是一种新的指标,用于量化模型潜在动态的异质性,作为政策探索的直接指标。基于这些基础,我们提出了潜在探索推理(ReLaX)范式,明确纳入潜在动态以调节策略优化中的探索与利用。涵盖多种多模态和纯文本推理基准的综合实验表明,ReLaX 显著减少了过早收敛,并始终保持最先进的性能。
Understanding Individual Decision-Making in Multi-Agent Reinforcement Learning: A Dynamical Systems Approach
理解多智能体强化学习中的个体决策:一种动态系统方法
- Authors: James Rudd-Jones, María Pérez-Ortiz, Mirco Musolesi
- Subjects: Subjects:
Multiagent Systems (cs.MA)
- Arxiv link: https://arxiv.org/abs/2512.07588
- Pdf link: https://arxiv.org/pdf/2512.07588
- Abstract
Analysing learning behaviour in Multi-Agent Reinforcement Learning (MARL) environments is challenging, in particular with respect to \textit{individual} decision-making. Practitioners frequently tend to study or compare MARL algorithms from a qualitative perspective largely due to the inherent stochasticity in practical algorithms arising from random dithering exploration strategies, environment transition noise, and stochastic gradient updates to name a few. Traditional analytical approaches, such as replicator dynamics, often rely on mean-field approximations to remove stochastic effects, but this simplification, whilst able to provide general overall trends, might lead to dissonance between analytical predictions and actual realisations of individual trajectories. In this paper, we propose a novel perspective on MARL systems by modelling them as \textit{coupled stochastic dynamical systems}, capturing both agent interactions and environmental characteristics. Leveraging tools from dynamical systems theory, we analyse the stability and sensitivity of agent behaviour at individual level, which are key dimensions for their practical deployments, for example, in presence of strict safety requirements. This framework allows us, for the first time, to rigorously study MARL dynamics taking into consideration their inherent stochasticity, providing a deeper understanding of system behaviour and practical insights for the design and control of multi-agent learning processes.
- 中文摘要
在多智能体强化学习(MARL)环境中分析学习行为具有挑战性,尤其是在 \textit{individual} 决策方面。实践者常倾向于从定性视角研究或比较MARL算法,这主要源于实际算法中随机的随机性,如随机抖动探索策略、环境转换噪声和随机梯度更新等。传统的分析方法,如复制者动力学,通常依赖平均场近似来消除随机效应,但这种简化虽然能够提供总体趋势,却可能导致分析预测与实际个体轨迹实现之间的不协调。本文通过将MARL系统建模为\textit{耦合随机动力系统},提出了一种新颖的视角,同时捕捉了代理间相互作用和环境特性。利用动力系统理论工具,我们分析了个体层面智能体行为的稳定性和敏感性,这些是其实际部署的关键维度,例如在严格安全要求下。该框架首次使我们能够严格研究MARL动态,考虑其固有的随机性,深入理解系统行为,并为多智能体学习过程的设计和控制提供实用见解。
Comparative Analysis and Parametric Tuning of PPO, GRPO, and DAPO for LLM Reasoning Enhancement
PPO、GRPO和DAPO在LLM推理增强中的比较分析与参数调优
- Authors: Yongsheng Lian
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2512.07611
- Pdf link: https://arxiv.org/pdf/2512.07611
- Abstract
This study presents a systematic comparison of three Reinforcement Learning (RL) algorithms (PPO, GRPO, and DAPO) for improving complex reasoning in large language models (LLMs). Our main contribution is a controlled transfer-learning evaluation: models are first fine-tuned on the specialized Countdown Game and then assessed on a suite of general-purpose reasoning benchmarks. Across all tasks, RL-trained models outperform their corresponding base models, although the degree of improvement differs by benchmark. Our parametric analysis offers practical guidance for RL-based LLM training. Increasing the group size in GRPO and DAPO leads to more stable training dynamics and higher accuracy, while the impact of the KL-penalty coefficient is non-monotonic. Additionally, we find that the Dynamic Sampling (DS) component in DAPO does not improve performance; in fact, the best overall results are achieved with DAPO when DS is disabled.
- 中文摘要
本研究系统比较了三种强化学习(RL)算法(PPO、GRPO和DAPO),用于提升大型语言模型(LLM)中复杂推理能力。我们的主要贡献是受控迁移学习评估:模型先在专门的倒计时游戏上微调,然后在一套通用推理基准测试上进行评估。在所有任务中,强化学习训练的模型表现优于对应的基础模型,尽管不同基准测试的提升程度有所不同。我们的参数分析为基于强化学习的大型语言模型训练提供了实用指导。增加GRPO和DAPO的组规模可带来更稳定的训练动态和更高的准确性,而KL惩罚系数的影响则非单调。此外,我们发现DAPO中的动态采样(DS)组件并未提升性能;事实上,禁用 DS 时,DAPO 能实现最佳整体效果。
The Agent Capability Problem: Predicting Solvability Through Information-Theoretic Bounds
智能体能力问题:通过信息理论界限预测可解性
- Authors: Shahar Lutati
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Information Theory (cs.IT); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2512.07631
- Pdf link: https://arxiv.org/pdf/2512.07631
- Abstract
When should an autonomous agent commit resources to a task? We introduce the Agent Capability Problem (ACP), a framework for predicting whether an agent can solve a problem under resource constraints. Rather than relying on empirical heuristics, ACP frames problem-solving as information acquisition: an agent requires $\Itotal$ bits to identify a solution and gains $\Istep$ bits per action at cost $\Cstep$, yielding an effective cost $\Ceff = (\Itotal/\Istep), \Cstep$ that predicts resource requirements before search. We prove that $\Ceff$ lower-bounds expected cost and provide tight probabilistic upper bounds. Experimental validation shows that ACP predictions closely track actual agent performance, consistently bounding search effort while improving efficiency over greedy and random strategies. The framework generalizes across LLM-based and agentic workflows, linking principles from active learning, Bayesian optimization, and reinforcement learning through a unified information-theoretic lens. \
- 中文摘要
自主智能体应在何时投入资源到任务?我们介绍代理能力问题(ACP),这是一个用于预测代理是否能在资源约束下解决问题的框架。ACP不依赖经验启发式方法,而是将问题解决框架为信息获取:智能体需要$\Itotal$比特来识别解决方案,且在成本$\Cstep$下每个行动获得$\Istep$比特,从而产生有效成本$\Ceff = (\Itotal/\Istep), \Cstep$,预测搜索前的资源需求。我们证明 $\Ceff$ 的下界是期望成本,并提供了紧密的概率上界。实验验证表明,ACP预测与实际代理表现密切相关,始终限制搜索努力,同时提升效率,优于贪婪和随机策略。该框架在基于LLM和代理的工作流程中进行了推广,通过统一的信息理论视角连接主动学习、贝叶斯优化和强化学习的原则。\
SpatialDreamer: Incentivizing Spatial Reasoning via Active Mental Imagery
空间梦者:通过主动心理意象激励空间推理
- Authors: Meng Cao, Xingyu Li, Xue Liu, Ian Reid, Xiaodan Liang
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2512.07733
- Pdf link: https://arxiv.org/pdf/2512.07733
- Abstract
Despite advancements in Multi-modal Large Language Models (MLLMs) for scene understanding, their performance on complex spatial reasoning tasks requiring mental simulation remains significantly limited. Current methods often rely on passive observation of spatial data, failing to internalize an active mental imagery process. To bridge this gap, we propose SpatialDreamer, a reinforcement learning framework that enables spatial reasoning through a closedloop process of active exploration, visual imagination via a world model, and evidence-grounded reasoning. To address the lack of fine-grained reward supervision in longhorizontal reasoning tasks, we propose Geometric Policy Optimization (GeoPO), which introduces tree-structured sampling and step-level reward estimation with geometric consistency constraints. Extensive experiments demonstrate that SpatialDreamer delivers highly competitive results across multiple challenging benchmarks, signifying a critical advancement in human-like active spatial mental simulation for MLLMs.
- 中文摘要
尽管多模态大型语言模型(MLLM)在场景理解方面有所进步,但它们在需要心理模拟的复杂空间推理任务上的表现仍然显著有限。现有方法常依赖于被动观察空间数据,未能内化主动的心理意象过程。为了弥合这一鸿沟,我们提出了空间梦者(SpatialDreamer),这是一种强化学习框架,通过闭环的主动探索过程实现空间推理,通过世界模型实现视觉想象,并以证据为基础的推理。为解决长水平推理任务中缺乏细粒度奖励监督的问题,我们提出了几何策略优化(GeoPO),引入了树结构抽样和带有几何一致性约束的步级奖励估计。大量实验表明,空间梦想者在多个具有挑战性的基准测试中取得了极具竞争力的结果,标志着多层次多层次语言模型(MLLM)类人主动空间心理模拟技术的关键进展。
DiffusionDriveV2: Reinforcement Learning-Constrained Truncated Diffusion Modeling in End-to-End Autonomous Driving
DiffusionDriveV2:端到端自动驾驶中的强化学习约束截断扩散建模
- Authors: Jialv Zou, Shaoyu Chen, Bencheng Liao, Zhiyu Zheng, Yuehao Song, Lefei Zhang, Qian Zhang, Wenyu Liu, Xinggang Wang
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2512.07745
- Pdf link: https://arxiv.org/pdf/2512.07745
- Abstract
Generative diffusion models for end-to-end autonomous driving often suffer from mode collapse, tending to generate conservative and homogeneous behaviors. While DiffusionDrive employs predefined anchors representing different driving intentions to partition the action space and generate diverse trajectories, its reliance on imitation learning lacks sufficient constraints, resulting in a dilemma between diversity and consistent high quality. In this work, we propose DiffusionDriveV2, which leverages reinforcement learning to both constrain low-quality modes and explore for superior trajectories. This significantly enhances the overall output quality while preserving the inherent multimodality of its core Gaussian Mixture Model. First, we use scale-adaptive multiplicative noise, ideal for trajectory planning, to promote broad exploration. Second, we employ intra-anchor GRPO to manage advantage estimation among samples generated from a single anchor, and inter-anchor truncated GRPO to incorporate a global perspective across different anchors, preventing improper advantage comparisons between distinct intentions (e.g., turning vs. going straight), which can lead to further mode collapse. DiffusionDriveV2 achieves 91.2 PDMS on the NAVSIM v1 dataset and 85.5 EPDMS on the NAVSIM v2 dataset in closed-loop evaluation with an aligned ResNet-34 backbone, setting a new record. Further experiments validate that our approach resolves the dilemma between diversity and consistent high quality for truncated diffusion models, achieving the best trade-off. Code and model will be available at this https URL
- 中文摘要
端到端自动驾驶的生成扩散模型常常存在模式崩溃的问题,导致行为保守且均匀。虽然DiffusionDrive使用代表不同驱动意图的预定义锚点来划分动作空间并生成多样化轨迹,但其对模仿学习的依赖缺乏足够的约束,导致多样性与一致性高质量之间的两难。在本研究中,我们提出了DiffusionDriveV2,利用强化学习既约束低质量模式,也探索更优的轨迹。这大大提升了整体输出质量,同时保持了核心高斯混合模型固有的多模态特性。首先,我们使用尺度自适应乘法噪声,这对轨迹规划来说是理想的,以促进广泛的探索。其次,我们采用锚内GRPO管理单个锚点生成样本间的优势估计,并采用锚间截断GRPO以涵盖不同锚点的全局视角,防止不同意图间的不当优势比较(如转向与直行),从而避免进一步的模式崩溃。DiffusionDriveV2在闭环评估中,采用对齐的ResNet-34骨干网,在NAVSIM v1数据集上实现了91.2 PDMS,在NAVSIM v2数据集上实现了85.5 EPDMS,创下新纪录。进一步的实验验证了我们的方法解决了截断扩散模型多样性与一致性高质量之间的困境,实现了最佳权衡。代码和模型将在此 https URL 中提供
RL-MTJail: Reinforcement Learning for Automated Black-Box Multi-Turn Jailbreaking of Large Language Models
RL-MTJail:大型语言模型自动化黑匣子多回合越狱的强化学习
- Authors: Xiqiao Xiong, Ouxiang Li, Zhuo Liu, Moxin Li, Wentao Shi, Fuli Feng, Xiangnan He
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2512.07761
- Pdf link: https://arxiv.org/pdf/2512.07761
- Abstract
Large language models are vulnerable to jailbreak attacks, threatening their safe deployment in real-world applications. This paper studies black-box multi-turn jailbreaks, aiming to train attacker LLMs to elicit harmful content from black-box models through a sequence of prompt-output interactions. Existing approaches typically rely on single turn optimization, which is insufficient for learning long-term attack strategies. To bridge this gap, we formulate the problem as a multi-turn reinforcement learning task, directly optimizing the harmfulness of the final-turn output as the outcome reward. To mitigate sparse supervision and promote long-term attack strategies, we propose two heuristic process rewards: (1) controlling the harmfulness of intermediate outputs to prevent triggering the black-box model's rejection mechanisms, and (2) maintaining the semantic relevance of intermediate outputs to avoid drifting into irrelevant content. Experimental results on multiple benchmarks show consistently improved attack success rates across multiple models, highlighting the effectiveness of our approach. The code is available at this https URL. Warning: This paper contains examples of harmful content.
- 中文摘要
大型语言模型容易受到越狱攻击的威胁,威胁其在现实应用中的安全部署。本文研究了黑箱多回合越狱,旨在训练攻击者大型语言模型通过一系列提示-输出交互,从黑箱模型中引出有害内容。现有方法通常依赖单回合优化,这不足以学习长期攻击策略。为弥合这一差距,我们将问题设计为多回合强化学习任务,直接优化最终回合输出的危害性作为结果奖励。为减少监管稀疏并促进长期攻击策略,我们提出了两种启发式过程奖励:(1)控制中间输出的有害性,防止触发黑箱模型的拒绝机制,(2)保持中间输出的语义相关性,避免漂移到无关内容。多个基准测试的实验结果显示,多个模型的攻击成功率持续提升,凸显了我们方法的有效性。代码可在该 https URL 访问。警告:本文包含有害内容示例。
On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models
关于预训练、中期训练和强化学习在推理语言模型中的相互作用
- Authors: Charlie Zhang, Graham Neubig, Xiang Yue
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2512.07783
- Pdf link: https://arxiv.org/pdf/2512.07783
- Abstract
Recent reinforcement learning (RL) techniques have yielded impressive reasoning improvements in language models, yet it remains unclear whether post-training truly extends a model's reasoning ability beyond what it acquires during pre-training. A central challenge is the lack of control in modern training pipelines: large-scale pre-training corpora are opaque, mid-training is often underexamined, and RL objectives interact with unknown prior knowledge in complex ways. To resolve this ambiguity, we develop a fully controlled experimental framework that isolates the causal contributions of pre-training, mid-training, and RL-based post-training. Our approach employs synthetic reasoning tasks with explicit atomic operations, parseable step-by-step reasoning traces, and systematic manipulation of training distributions. We evaluate models along two axes: extrapolative generalization to more complex compositions and contextual generalization across surface contexts. Using this framework, we reconcile competing views on RL's effectiveness. We show that: 1) RL produces true capability gains (pass@128) only when pre-training leaves sufficient headroom and when RL data target the model's edge of competence, tasks at the boundary that are difficult but not yet out of reach. 2) Contextual generalization requires minimal yet sufficient pre-training exposure, after which RL can reliably transfer. 3) Mid-training significantly enhances performance under fixed compute compared with RL only, demonstrating its central but underexplored role in training pipelines. 4) Process-level rewards reduce reward hacking and improve reasoning fidelity. Together, these results clarify the interplay between pre-training, mid-training, and RL, offering a foundation for understanding and improving reasoning LM training strategies.
- 中文摘要
最新的强化学习(RL)技术在语言模型中带来了显著的推理改进,但尚不清楚后训练是否真正将模型的推理能力扩展到预训练阶段之外。一个核心挑战是现代培训流程缺乏控制:大规模的预培训语料库不透明,培训中期常被忽视,强化学习目标与未知的先验知识以复杂方式相互作用。为解决这一模糊性,我们开发了一个完全受控的实验框架,分离了训练前、中期训练和基于强化学习的后训练的因果贡献。我们的方法采用具有显式原子作、可解析的逐步推理轨迹以及系统化的训练分布作的综合推理任务。我们从两个方向评估模型:对更复杂组合的外推推广和跨表面上下文的推广。利用这一框架,我们调和了关于强化学习有效性的各种竞争观点。我们证明:1)只有当预训练留有足够的余量,且强化学习数据针对模型能力边缘时,才能实现真正的能力提升(pass@128),这些任务在边界上困难但尚未超出目标。2)情境推广只需最少但足够的预训练暴露,之后强化学习可以可靠地转移。3)中途训练相比仅用强化学习显著提升固定计算时的性能,展示了其在训练流程中核心但尚未充分开发的角色。4)过程级奖励减少了奖励黑客行为,提高了推理的忠实度。这些结果共同阐明了预训练、中期训练和强化学习之间的相互作用,为理解和改进推理LM训练策略奠定了基础。
An Adaptive Multi-Layered Honeynet Architecture for Threat Behavior Analysis via Deep Learning
基于深度学习的自适应多层蜂网架构用于威胁行为分析
- Authors: Lukas Johannes Möller
- Subjects: Subjects:
Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2512.07827
- Pdf link: https://arxiv.org/pdf/2512.07827
- Abstract
The escalating sophistication and variety of cyber threats have rendered static honeypots inadequate, necessitating adaptive, intelligence-driven deception. In this work, ADLAH is introduced: an Adaptive Deep Learning Anomaly Detection Honeynet designed to maximize high-fidelity threat intelligence while minimizing cost through autonomous orchestration of infrastructure. The principal contribution is offered as an end-to-end architectural blueprint and vision for an AI-driven deception platform. Feasibility is evidenced by a functional prototype of the central decision mechanism, in which a reinforcement learning (RL) agent determines, in real time, when sessions should be escalated from low-interaction sensor nodes to dynamically provisioned, high-interaction honeypots. Because sufficient live data were unavailable, field-scale validation is not claimed; instead, design trade-offs and limitations are detailed, and a rigorous roadmap toward empirical evaluation at scale is provided. Beyond selective escalation and anomaly detection, the architecture pursues automated extraction, clustering, and versioning of bot attack chains, a core capability motivated by the empirical observation that exposed services are dominated by automated traffic. Together, these elements delineate a practical path toward cost-efficient capture of high-value adversary behavior, systematic bot versioning, and the production of actionable threat intelligence.
- 中文摘要
网络威胁日益复杂和多样化,使静态蜜罐变得不够,迫使采用适应性、情报驱动的欺骗手段。本研究介绍了ADLAH:一种自适应深度学习异常检测蜂网,旨在最大化高保真威胁情报,同时通过自主编排基础设施来降低成本。主要贡献是作为一个端到端的架构蓝图和愿景,构建一个基于人工智能的欺骗平台。可行性通过中央决策机制的功能原型得到验证,其中强化学习(RL)代理实时判定何时会话应从低交互感传感器节点升级到动态配置的高交互蜜罐。由于缺乏足够的实时数据,未声称进行现场规模验证;相反,会详细说明设计权衡和局限,并提供了面向大规模实证评估的严谨路线图。除了选择性升级和异常检测外,该架构还致力于自动提取、集群和机器人攻击链的版本管理,这一核心能力源于实证观察到暴露服务被自动化流量主导。这些元素共同勾勒出一条切实可行的路径,实现高价值对手行为的成本效益捕捉、系统化的机器人版本控制以及可作威胁情报的生成。
Keyword: diffusion policy
Delay-Aware Diffusion Policy: Bridging the Observation-Execution Gap in Dynamic Tasks
延迟感知扩散策略:弥合动态任务中的观察与执行差距
- Authors: Aileen Liao, Dong-Ki Kim, Max Olan Smith, Ali-akbar Agha-mohammadi, Shayegan Omidshafiei
- Subjects: Subjects:
Robotics (cs.RO); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2512.07697
- Pdf link: https://arxiv.org/pdf/2512.07697
- Abstract
As a robot senses and selects actions, the world keeps changing. This inference delay creates a gap of tens to hundreds of milliseconds between the observed state and the state at execution. In this work, we take the natural generalization from zero delay to measured delay during training and inference. We introduce Delay-Aware Diffusion Policy (DA-DP), a framework for explicitly incorporating inference delays into policy learning. DA-DP corrects zero-delay trajectories to their delay-compensated counterparts, and augments the policy with delay conditioning. We empirically validate DA-DP on a variety of tasks, robots, and delays and find its success rate more robust to delay than delay-unaware methods. DA-DP is architecture agnostic and transfers beyond diffusion policies, offering a general pattern for delay-aware imitation learning. More broadly, DA-DP encourages evaluation protocols that report performance as a function of measured latency, not just task difficulty.
- 中文摘要
当机器人感知并选择行动时,世界也在不断变化。这种推断延迟会在观察到的状态与执行时的状态之间产生数十到数百毫秒的间隔。在本研究中,我们将训练和推断过程中从零延迟自然推广到测量延迟。我们介绍了延迟感知扩散策略(DA-DP),这是一个明确将推理延迟纳入策略学习的框架。DA-DP将零延迟轨迹修正为其延迟补偿轨迹,并通过延迟条件来补充策略。我们对多种任务、机器人和延迟进行了实证验证,发现其对延迟的成功率比无延迟感知方法更稳健。DA-DP不依赖架构,且超越扩散策略,提供了延迟感知模仿学习的通用模式。更广泛地说,DA-DP鼓励评估协议将性能报告为基于测量延迟,而不仅仅是任务难度。