Arxiv Papers of Today

生成时间: 2025-12-17 16:33:56 (UTC+8); Arxiv 发布时间: 2025-12-17 20:00 EST (2025-12-18 09:00 UTC+8)

今天共有 22 篇相关文章

Keyword: reinforcement learning

AI-Powered Annotation Pipelines for Stabilizing Large Language Models: A Human-AI Synergy Approach

人工智能驱动的注释流水线用于稳定大型语言模型：人机协同方法

Authors: Gangesh Pathak, Prasanna Kumar
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.13714
Pdf link: https://arxiv.org/pdf/2512.13714
Abstract LLM implementations are failing in highly regulated industries owing to instability issues, inconsistent reasoning, hallucinations and performance variability, especially in workflows. These reliability issues restrict safe use of LLM in areas that need the precision of facts and consistent behavior (Aiyappa et al., 2023). The current methods of stabilization, such as, reinforcement learning with human feedback (RLHF) and supervised fine-tuning, offer quantifiable improvements but are expensive and based on the intensive annotation of humans, thus being not easily scaled in a sustainable way (Dong et al., 2023; Retzlaff et al., 2024). This paper presents an AI-based annotation pipeline that systematically identifies, labels, and fixes for instability patterns on LLM output. Our human-AI synergy method combines the models of automated weak supervision and confidence-based annotation with the target human validation to guarantee the reliability and moral uprightness of feedback information (Cabitza et al., 2023; Jiang et al., 2023). The semantic consistency, factual correctness, and logical coherence categories of stability-specific annotation are introduced into our framework, allowing the continuous calibration of models and the enhancement of their robustness based on the feedback loops (Honovich et al., 2021; Nan et al., 2021).
中文摘要 由于不稳定性、推理不一致、幻觉和性能波动，尤其是在工作流程中，LLM的实现在高度受监管的行业中屡屡失败。这些可靠性问题限制了LLM在需要事实精确性和行为一致性的领域中的安全使用（Aiyappa 等，2023）。当前的稳定方法，如人类反馈强化学习（RLHF）和监督式微调，虽然带来了可量化的改进，但成本高昂且依赖于对人类的密集注释，因此难以以可持续的方式实现规模化（Dong 等，2023;Retzlaff 等，2024）。本文提出了一种基于人工智能的注释流水线，系统地识别、标记并修复LLM输出中的不稳定性模式。我们的人-人工智能协同方法结合了自动化弱监督和基于信心的注释模型，以及目标人类验证，以确保反馈信息的可靠性和道德正直性（Cabitza 等，2023;江等，2023）。稳定性特定注释的语义一致性、事实正确性和逻辑一致性类别被引入我们的框架，允许基于反馈环持续校准模型并增强其鲁棒性（Honovich 等，2021;Nan 等，2021）。

Meta Hierarchical Reinforcement Learning for Scalable Resource Management in O-RAN

用于O-RAN可扩展资源管理的元层级强化学习

Authors: Fatemeh Lotfi, Fatemeh Afghah
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2512.13715
Pdf link: https://arxiv.org/pdf/2512.13715
Abstract The increasing complexity of modern applications demands wireless networks capable of real time adaptability and efficient resource management. The Open Radio Access Network (O-RAN) architecture, with its RAN Intelligent Controller (RIC) modules, has emerged as a pivotal solution for dynamic resource management and network slicing. While artificial intelligence (AI) driven methods have shown promise, most approaches struggle to maintain performance under unpredictable and highly dynamic conditions. This paper proposes an adaptive Meta Hierarchical Reinforcement Learning (Meta-HRL) framework, inspired by Model Agnostic Meta Learning (MAML), to jointly optimize resource allocation and network slicing in O-RAN. The framework integrates hierarchical control with meta learning to enable both global and local adaptation: the high-level controller allocates resources across slices, while low level agents perform intra slice scheduling. The adaptive meta-update mechanism weights tasks by temporal difference error variance, improving stability and prioritizing complex network scenarios. Theoretical analysis establishes sublinear convergence and regret guarantees for the two-level learning process. Simulation results demonstrate a 19.8% improvement in network management efficiency compared with baseline RL and meta-RL approaches, along with faster adaptation and higher QoS satisfaction across eMBB, URLLC, and mMTC slices. Additional ablation and scalability studies confirm the method's robustness, achieving up to 40% faster adaptation and consistent fairness, latency, and throughput performance as network scale increases.
中文摘要 现代应用日益复杂，要求无线网络具备实时适应性和高效资源管理能力。开放无线接入网（O-RAN）架构及其RAN智能控制器（RIC）模块，已成为动态资源管理和网络切片的关键解决方案。尽管人工智能（AI）驱动的方法展现出了前景，但大多数方法在不可预测且高度动态的条件下难以维持性能。本文提出了一个自适应的元层级强化学习（Meta-HRL）框架，灵感来自模型无关元学习（MAML），用于联合优化O-RAN中的资源分配和网络切片。该框架将层级控制与元学习相结合，实现全局和本地适应：高级控制器在片间分配资源，而低层代理执行片内调度。自适应元更新机制通过时间差误差方差加权任务，提升稳定性并优先处理复杂网络场景。理论分析为两级学习过程建立了亚线性收敛和遗憾保证。模拟结果显示，与基础强化学习（RL）和元强化学习方法相比，网络管理效率提升了19.8%，同时eMBB、URLLC和mMTC切片的适应更快，QoS满意度更高。额外的消融和可扩展性研究证实了该方法的鲁棒性，随着网络规模的增加，适应速度可达40%，并且公平性、延迟和吞吐量性能保持一致。

Time-Constrained Recommendations: Reinforcement Learning Strategies for E-Commerce

时间限制的建议：电子商务强化学习策略

Authors: Sayak Chakrabarty, Souradip Pal
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.13726
Pdf link: https://arxiv.org/pdf/2512.13726
Abstract Unlike traditional recommendation tasks, finite user time budgets introduce a critical resource constraint, requiring the recommender system to balance item relevance and evaluation cost. For example, in a mobile shopping interface, users interact with recommendations by scrolling, where each scroll triggers a list of items called slate. Users incur an evaluation cost - time spent assessing item features before deciding to click. Highly relevant items having higher evaluation costs may not fit within the user's time budget, affecting engagement. In this position paper, our objective is to evaluate reinforcement learning algorithms that learn patterns in user preferences and time budgets simultaneously, crafting recommendations with higher engagement potential under resource constraints. Our experiments explore the use of reinforcement learning to recommend items for users using Alibaba's Personalized Re-ranking dataset supporting slate optimization in e-commerce contexts. Our contributions include (i) a unified formulation of time-constrained slate recommendation modeled as Markov Decision Processes (MDPs) with budget-aware utilities; (ii) a simulation framework to study policy behavior on re-ranking data; and (iii) empirical evidence that on-policy and off-policy control can improve performance under tight time budgets than traditional contextual bandit-based methods.
中文摘要 与传统的推荐任务不同，有限的用户时间预算带来了关键的资源限制，要求推荐系统在项目相关性和评估成本之间取得平衡。例如，在移动购物界面中，用户通过滚动与推荐互动，每次滚动都会触发一个称为 Slate 的商品列表。用户会产生评估成本——在决定点击前评估物品特性的时间。高度相关的项目，评估成本较高，可能不符合用户的时间预算，影响用户参与度。在本立场文件中，我们的目标是评估能够同时学习用户偏好和时间预算模式的强化学习算法，在资源限制下制定具有更高参与潜力的推荐方案。我们的实验探讨了利用强化学习为使用阿里巴巴个性化重新排名数据集（支持电商场景中名单优化的用户推荐商品）。我们的贡献包括：（i）以马尔可夫决策过程（MDPs）建模的统一时间限制板式推荐方案，并结合预算感知型公用事业;（ii）用于研究数据重新排序政策行为的模拟框架;以及（iii）实证证据表明，在紧密时间预算下，基于策略和非策略的控制比传统的情境强盗方法能提升绩效。

RAST-MoE-RL: A Regime-Aware Spatio-Temporal MoE Framework for Deep Reinforcement Learning in Ride-Hailing

RAST-MoE-RL：一种用于网约车深度强化学习的体制感知时空MoE框架

Authors: Yuhan Tang, Kangxin Cui, Jung Ho Park, Yibo Zhao, Xuan Jiang, Haoze He, Dingyi Zhuang, Shenhao Wang, Jiangbo Yu, Haris Koutsopoulos, Jinhua Zhao
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.13727
Pdf link: https://arxiv.org/pdf/2512.13727
Abstract Ride-hailing platforms face the challenge of balancing passenger waiting times with overall system efficiency under highly uncertain supply-demand conditions. Adaptive delayed matching creates a trade-off between matching and pickup delays by deciding whether to assign drivers immediately or batch requests. Since outcomes accumulate over long horizons with stochastic dynamics, reinforcement learning (RL) is a suitable framework. However, existing approaches often oversimplify traffic dynamics or use shallow encoders that miss complex spatiotemporal patterns. We introduce the Regime-Aware Spatio-Temporal Mixture-of-Experts (RAST-MoE), which formalizes adaptive delayed matching as a regime-aware MDP equipped with a self-attention MoE encoder. Unlike monolithic networks, our experts specialize automatically, improving representation capacity while maintaining computational efficiency. A physics-informed congestion surrogate preserves realistic density-speed feedback, enabling millions of efficient rollouts, while an adaptive reward scheme guards against pathological strategies. With only 12M parameters, our framework outperforms strong baselines. On real-world Uber trajectory data (San Francisco), it improves total reward by over 13%, reducing average matching and pickup delays by 10% and 15% respectively. It demonstrates robustness across unseen demand regimes and stable training. These findings highlight the potential of MoE-enhanced RL for large-scale decision-making with complex spatiotemporal dynamics.
中文摘要 网约车平台面临在高度不确定的供需条件下，平衡乘客等待时间与整体系统效率的挑战。自适应延迟匹配通过决定是立即分配驱动程序还是批量请求，在匹配延迟和拾取延迟之间做出权衡。由于结果在随机动力学中长期累积，强化学习（RL）是一个合适的框架。然而，现有方法常常过于简化交通动态，或使用浅编码器，忽略复杂的时空模式。我们介绍了Regime-Aware Spatio-Temporal Mixture-of-Experts（RAST-MoE），该技术将自适应延迟匹配形式化为配备自注意MoE编码器的模式感知MDP。与单体网络不同，我们的专家自动专注于表现力，同时保持计算效率。基于物理的拥堵代理保持了真实的密度-速度反馈，实现数百万次高效推展，而自适应奖励方案则防止病态策略。仅有1200万参数，我们的框架表现优于强基线。根据现实世界的优步走迹数据（旧金山），总奖励提升了13%以上，平均匹配和接送延迟分别减少了10%和15%。它展示了在未见需求体系和稳定培训中的稳健性。这些发现凸显了MoE增强强化学习在复杂时空动态中大规模决策的潜力。

Explainable reinforcement learning from human feedback to improve alignment

通过人类反馈进行可解释的强化学习，以提升对齐

Authors: Shicheng Liu, Siyuan Xu, Wenjie Qiu, Hangfan Zhang, Minghui Zhu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.13837
Pdf link: https://arxiv.org/pdf/2512.13837
Abstract A common and effective strategy for humans to improve an unsatisfactory outcome in daily life is to find a cause of this outcome and correct the cause. In this paper, we investigate whether this human improvement strategy can be applied to improving reinforcement learning from human feedback (RLHF) for alignment of language models (LMs). In particular, it is observed in the literature that LMs tuned by RLHF can still output unsatisfactory responses. This paper proposes a method to improve the unsatisfactory responses by correcting their causes. Our method has two parts. The first part proposes a post-hoc explanation method to explain why an unsatisfactory response is generated to a prompt by identifying the training data that lead to this response. We formulate this problem as a constrained combinatorial optimization problem where the objective is to find a set of training data closest to this prompt-response pair in a feature representation space, and the constraint is that the prompt-response pair can be decomposed as a convex combination of this set of training data in the feature space. We propose an efficient iterative data selection algorithm to solve this problem. The second part proposes an unlearning method that improves unsatisfactory responses to some prompts by unlearning the training data that lead to these unsatisfactory responses and, meanwhile, does not significantly degrade satisfactory responses to other prompts. Experimental results demonstrate that our algorithm can improve RLHF.
中文摘要 人类改善日常生活中不满意结果的一个常见且有效的策略是找到导致该结果的原因并加以纠正。本文探讨了这种人类改进策略是否可用于提升来自人类反馈的强化学习（RLHF）以实现语言模型（LMs）的对齐。文献中特别指出，经过RLHF调谐的LM仍可能输出不满意的响应。本文提出了一种通过纠正不满意反应原因来改善不满意反应的方法。我们的方法分为两部分。第一部分提出了一种事后解释方法，通过识别导致该响应的训练数据，解释为何对提示产生不满意的响应。我们将该问题表述为一个受限组合优化问题，目标是在特征表示空间中找到一组最接近该提示-响应对的训练数据，约束是提示-响应对可以分解为该训练数据集在特征空间中的凸组合。我们提出了一种高效的迭代数据选择算法来解决这个问题。第二部分提出了一种去学习方法，通过去掉导致这些不满意反应的训练数据，改善对某些提示的不满意反应，同时不会显著降低对其他提示的满意反应。实验结果表明我们的算法能够提升RLHF。

Adaptive digital twins for predictive decision-making: Online Bayesian learning of transition dynamics

自适应数字孪生用于预测决策：在线贝叶斯学习过渡动力学

Authors: Eugenio Varetti, Matteo Torzoni, Marco Tezzele, Andrea Manzoni
Subjects: Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA)
Arxiv link: https://arxiv.org/abs/2512.13919
Pdf link: https://arxiv.org/pdf/2512.13919
Abstract This work shows how adaptivity can enhance value realization of digital twins in civil engineering. We focus on adapting the state transition models within digital twins represented through probabilistic graphical models. The bi-directional interaction between the physical and virtual domains is modeled using dynamic Bayesian networks. By treating state transition probabilities as random variables endowed with conjugate priors, we enable hierarchical online learning of transition dynamics from a state to another through effortless Bayesian updates. We provide the mathematical framework to account for a larger class of distributions with respect to the current literature. To compute dynamic policies with precision updates we solve parametric Markov decision processes through reinforcement learning. The proposed adaptive digital twin framework enjoys enhanced personalization, increased robustness, and improved cost-effectiveness. We assess our approach on a case study involving structural health monitoring and maintenance planning of a railway bridge.
中文摘要 本研究展示了适应性如何提升土木工程中数字孪生的价值实现。我们专注于适应通过概率图形模型表示的数字孪生中的状态转换模型。物理域与虚拟域之间的双向交互通过动态贝叶斯网络进行建模。通过将状态转移概率视为赋予共轭先验的随机变量，我们通过轻松的贝叶斯更新实现了从一个状态到另一个状态的层级在线学习。我们提供了数学框架，以解释当前文献中更大范围的分布。为了计算动态策略并进行精确更新，我们通过强化学习解决参数化马尔可夫决策过程。拟议的自适应数字孪生框架享有更强的个性化、更高的稳健性和更高的成本效益。我们基于一个涉及铁路桥结构健康监测和维护规划的案例研究来评估我们的方法。

Sample-Efficient Robot Skill Learning for Construction Tasks: Benchmarking Hierarchical Reinforcement Learning and Vision-Language-Action VLA Model

建筑任务中的样本高效机器人技能学习：分层强化学习与视觉-语言-行动VLA模型的基准测试

Authors: Zhaofeng Hu, Hongrui Yu, Vaidhyanathan Chandramouli, Ci-Jyun Liang
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.14031
Pdf link: https://arxiv.org/pdf/2512.14031
Abstract This study evaluates two leading approaches for teaching construction robots new skills to understand their applicability for construction automation: a Vision-Language-Action (VLA) model and Reinforcement Learning (RL) methods. The goal is to understand both task performance and the practical effort needed to deploy each approach on real jobs. The authors developed two teleoperation interfaces to control the robots and collect the demonstrations needed, both of which proved effective for training robots for long-horizon and dexterous tasks. In addition, the authors conduct a three-stage evaluation. First, the authors compare a Multi-Layer Perceptron (MLP) policy with a Deep Q-network (DQN) imitation model to identify the stronger RL baseline, focusing on model performance, generalization, and a pick-up experiment. Second, three different VLA models are trained in two different scenarios and compared with each other. Third, the authors benchmark the selected RL baseline against the VLA model using computational and sample-efficiency measures and then a robot experiment on a multi-stage panel installation task that includes transport and installation. The VLA model demonstrates strong generalization and few-shot capability, achieving 60% and 100% success in the pickup phase. In comparison, DQN can be made robust but needs additional noise during tuning, which increases the workload. Overall, the findings indicate that VLA offers practical advantages for changing tasks by reducing programming effort and enabling useful performance with minimal data, while DQN provides a viable baseline when sufficient tuning effort is acceptable.
中文摘要 本研究评估了两种主要方法，用于教授建筑机器人新技能以理解其在建筑自动化中的适用性：视觉-语言-行动（VLA）模型和强化学习（RL）方法。目标是理解任务表现以及在实际作业中应用每种方法所需的实际工作量。作者开发了两种远程作接口以控制机器人并收集所需的演示，这两种接口都被证明对训练机器人执行长视距和灵巧任务非常有效。此外，作者还进行了三阶段评估。首先，作者比较了多层感知器（MLP）策略与深度Q网络（DQN）模拟模型，以确定更强的强化学习基线，重点关注模型性能、泛化和拾取实验。其次，三种不同的VLA模型在两种不同的场景下训练并相互比较。第三，作者通过计算和样本效率测量，将选定的强化学习基线与VLA模型进行基准测试，随后在包括运输和安装在内的多阶段面板安装任务中进行机器人实验。VLA模型展现出强大的泛化性和少量发射能力，在拾取阶段实现了60%和100%的成功率。相比之下，DQN可以变得稳健，但在调谐过程中需要额外的噪声，从而增加工作负荷。总体而言，研究结果表明，VLA通过减少编程工作量并在最小数据下实现有用性能，为变更任务提供了实用优势，而DQN则在调整工作足够时提供了可行的基线。

OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving

OmniDrive-R1：强化驱动的交错多模态思维链，实现可信的视觉语言自动驾驶

Authors: Zhenguo Zhang, Haohan Zhen, Yishen Wang, Le Xu, Tianchen Deng, Xuefeng Chen, Qu Chen, Bo Zhang, Wuxiong Huang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.14044
Pdf link: https://arxiv.org/pdf/2512.14044
Abstract The deployment of Vision-Language Models (VLMs) in safety-critical domains like autonomous driving (AD) is critically hindered by reliability failures, most notably object hallucination. This failure stems from their reliance on ungrounded, text-based Chain-of-Thought (CoT) this http URL existing multi-modal CoT approaches attempt mitigation, they suffer from two fundamental flaws: (1) decoupled perception and reasoning stages that prevent end-to-end joint optimization, and (2) reliance on expensive, dense localization this http URL we introduce OmniDrive-R1, an end-to-end VLM framework designed for autonomous driving, which unifies perception and reasoning through an interleaved Multi-modal Chain-of-Thought (iMCoT) mechanism. Our core innovation is an Reinforcement-driven visual grounding capability, enabling the model to autonomously direct its attention and "zoom in" on critical regions for fine-grained analysis. This capability is enabled by our pure two-stage reinforcement learning training pipeline and Clip-GRPO algorithm. Crucially, Clip-GRPO introduces an annotation-free, process-based grounding reward. This reward not only eliminates the need for dense labels but also circumvents the instability of external tool calls by enforcing real-time cross-modal consistency between the visual focus and the textual reasoning. Extensive experiments on DriveLMM-o1 demonstrate our model's significant improvements. Compared to the baseline Qwen2.5VL-7B, OmniDrive-R1 improves the overall reasoning score from 51.77% to 80.35%, and the final answer accuracy from 37.81% to 73.62%.
中文摘要 视觉语言模型（VLMs）在安全关键领域如自动驾驶（AD）的部署受到可靠性故障，尤其是物体幻觉，严重阻碍。这一失败源于它们依赖于无基础的基于文本的思维链（CoT），现有的多模态CoT方法试图缓解，但它们存在两个根本缺陷：（1）感知和推理阶段解耦，阻碍端到端的联合优化;（2）依赖昂贵且密集的本地化。我们介绍OmniDrive-R1，一个为自动驾驶设计的端到端VLM框架，它通过交织的多模态思维链（iMCoT）机制统一了感知和推理。我们的核心创新是基于强化驱动的视觉基础功能，使模型能够自主引导注意力，并“放大”关键区域进行细致分析。这一能力得益于我们的纯双阶段强化学习训练流水线和Clip-GRPO算法。关键是，Clip-GRPO引入了无注释、基于过程的接地奖励。这种奖励不仅消除了对密集标签的需求，还通过强制视觉焦点与文本推理之间的实时跨模态一致性，规避了外部工具调用的不稳定性。对DriveLMM-o1的广泛实验展示了我们模型的显著改进。与基线Qwen2.5VL-7B相比，OmniDrive-R1将整体推理得分从51.77%提升至80.35%，最终答案准确率从37.81%提升至73.62%。

Context Representation via Action-Free Transformer encoder-decoder for Meta Reinforcement Learning

通过无动作变换器编码-解码器进行元强化学习的上下文表示

Authors: Amir M. Soufi Enayati, Homayoun Honari, Homayoun Najjaran
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.14057
Pdf link: https://arxiv.org/pdf/2512.14057
Abstract Reinforcement learning (RL) enables robots to operate in uncertain environments, but standard approaches often struggle with poor generalization to unseen tasks. Context-adaptive meta reinforcement learning addresses these limitations by conditioning on the task representation, yet they mostly rely on complete action information in the experience making task inference tightly coupled to a specific policy. This paper introduces Context Representation via Action Free Transformer encoder decoder (CRAFT), a belief model that infers task representations solely from sequences of states and rewards. By removing the dependence on actions, CRAFT decouples task inference from policy optimization, supports modular training, and leverages amortized variational inference for scalable belief updates. Built on a transformer encoder decoder with rotary positional embeddings, the model captures long range temporal dependencies and robustly encodes both parametric and non-parametric task variations. Experiments on the MetaWorld ML-10 robotic manipulation benchmark show that CRAFT achieves faster adaptation, improved generalization, and more effective exploration compared to context adaptive meta--RL baselines. These findings highlight the potential of action-free inference as a foundation for scalable RL in robotic control.
中文摘要 强化学习（RL）使机器人能够在不确定环境中工作，但标准方法常常难以推广到看不见的任务。上下文自适应元强化学习通过条件化任务表示来解决这些局限性，但它们主要依赖于体验中完整的动作信息，使任务推断紧密耦合于特定策略。本文介绍了通过动作自由变换器编码器解码器（CRAFT）进行上下文表示，这是一种仅通过状态和奖励序列推断任务表示的信念模型。通过消除对动作的依赖，CRAFT将任务推理与策略优化解耦，支持模块化训练，并利用摊销变分推断实现可扩展的信念更新。该模型基于带有旋转位置嵌入的变压器编码器解码器，能够捕捉长距离的时间依赖关系，并能稳健地编码参数化和非参数化任务的变化。MetaWorld ML-10机器人作基准测试的实验表明，CRAFT相比上下文自适应元-强化学习基线实现了更快的适应性、改进的泛化能力和更有效的探索。这些发现凸显了无动作推断作为机器人控制中可扩展强化学习基础的潜力。

RADAR: Accelerating Large Language Model Inference With RL-Based Dynamic Draft Trees

RADAR：利用基于强化学习的动态草稿树加速大型语言模型推断

Authors: Junjie Ma, Jinlong Li
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.14069
Pdf link: https://arxiv.org/pdf/2512.14069
Abstract Inference with modern Large Language Models (LLMs) is expensive and slow, and speculative sampling has emerged as an effective solution to this problem, however, the number of the calls to the draft model for generating candidate tokens in speculative sampling is a preset hyperparameter, lacking flexibility. To generate and utilize the candidate tokens more effectively, we propose RADAR, a novel speculative sampling method with RL-based dynamic draft trees. RADAR formulates the draft tree generation process as a Markov Decision Process (MDP) and employs offline reinforcement learning to train a prediction model, which enables real-time decision on the calls to the draft model, reducing redundant computations and further accelerating inference. Evaluations across three LLMs and four tasks show that RADAR achieves a speedup of 3.17x-4.82x over the auto-regressive decoding baseline. The code is available at this https URL.
中文摘要 现代大型语言模型（LLM）的推断成本高且慢，而推测抽样已成为解决这一问题的有效方法，然而，在推测抽样中生成候选标记的调用模型草案的数量是预设的超参数，缺乏灵活性。为了更有效地生成和利用候选代币，我们提出了RADAR这一基于强化学习的动态草图树的新型推测抽样方法。RADAR将草图树生成过程表述为马尔可夫决策过程（MDP），并采用离线强化学习训练预测模型，从而实现对草稿模型调用的实时决策，减少冗余计算并进一步加速推理。对三个大型语言模型和四个任务的评估显示，RADAR在自回归解码基线上实现了3.17倍至4.82倍的加速。代码可在该 https URL 访问。

A First-Order Logic-Based Alternative to Reward Models in RLHF

RLHF 中基于一阶逻辑的奖励模型替代方案

Authors: Chunjin Jian, Xinhua Zhu
Subjects: Subjects: Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
Arxiv link: https://arxiv.org/abs/2512.14100
Pdf link: https://arxiv.org/pdf/2512.14100
Abstract Reinforcement Learning from Human Feedback (RLHF) plays a crucial role in aligning large language models (LLMs) with human values and preferences. However, the quality and stability of the trained reward model largely determine the final alignment performance. Existing approaches such as Proximal Policy Optimization (PPO) rely heavily on reward models to guide LLMs toward human-aligned behaviors. In this work, we propose a logic-similarity-based reward mechanism as an alternative to conventional reward modeling. Instead of relying on heuristic reward estimation, our method leverages formal logical consistency to steer model alignment with human preferences. Since real-world questions can be interpreted from multiple perspectives, to ensure that logic-based reinforcement learning does not cause model collapse, we introduce S-GRPO, a supervised variant of the GRPO framework. S-GRPO incorporates an additional supervised component and jointly optimizes the generation term, KL-divergence regularization, and label-based objective during training. Experimental results demonstrate that S-GRPO consistently outperforms standard supervised fine-tuning (SFT) in both performance and robustness. Furthermore, it extends existing preference-learning frameworks such as GRPO and DPO, offering a more flexible and task-adaptive approach to alignment training. Our code is available at this https URL.
中文摘要 来自人类反馈的强化学习（RLHF）在使大型语言模型（LLMs）与人类价值观和偏好对齐方面起着关键作用。然而，训练奖励模型的质量和稳定性在很大程度上决定了最终的比对性能。现有方法如近端策略优化（PPO）高度依赖奖励模型引导大型语言模型趋向人类行为。在本研究中，我们提出了一种基于逻辑相似性的奖励机制，作为传统奖励建模的替代方案。我们的方法不依赖启发式奖励估计，而是利用形式逻辑一致性来引导模型与人类偏好的对齐。由于现实问题可以从多个角度解释，为确保基于逻辑的强化学习不会导致模型崩溃，我们引入了S-GRPO，这是GRPO框架的一种监督变体。S-GRPO在训练过程中包含额外的监督组件，并共同优化生成项、KL发散正则化和基于标记的目标。实验结果表明，S-GRPO在性能和鲁棒性方面始终优于标准监督微调（SFT）。此外，它扩展了现有的偏好学习框架，如GRPO和DPO，提供了更灵活和任务适应性的对齐训练方法。我们的代码可在此 https URL 访问。

Incentivizing Tool-augmented Thinking with Images for Medical Image Analysis

通过图像激励工具增强思维进行医学图像分析

Authors: Yankai Jiang, Yujie Zhang, Peng Zhang, Yichen Li, Jintai Chen, Xiaoming Shi, Shihui Zhen
Subjects: Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.14157
Pdf link: https://arxiv.org/pdf/2512.14157
Abstract Recent reasoning based medical MLLMs have made progress in generating step by step textual reasoning chains. However, they still struggle with complex tasks that necessitate dynamic and iterative focusing on fine-grained visual regions to achieve precise grounding and diagnosis. We introduce Ophiuchus, a versatile, tool-augmented framework that equips an MLLM to (i) decide when additional visual evidence is needed, (ii) determine where to probe and ground within the medical image, and (iii) seamlessly weave the relevant sub-image content back into an interleaved, multimodal chain of thought. In contrast to prior approaches limited by the performance ceiling of specialized tools, Ophiuchus integrates the model's inherent grounding and perception capabilities with external tools, thereby fostering higher-level reasoning. The core of our method is a three-stage training strategy: cold-start training with tool-integrated reasoning data to achieve basic tool selection and adaptation for inspecting key regions; self-reflection fine-tuning to strengthen reflective reasoning and encourage revisiting tool outputs; and Agentic Tool Reinforcement Learning to directly optimize task-specific rewards and emulate expert-like diagnostic behavior. Extensive experiments show that Ophiuchus consistently outperforms both closed-source and open-source SOTA methods across diverse medical benchmarks, including VQA, detection, and reasoning-based segmentation. Our approach illuminates a path toward medical AI agents that can genuinely "think with images" through tool-integrated reasoning. Datasets, codes, and trained models will be released publicly.
中文摘要 近年来基于推理的医学多层次营销在逐步生成文本推理链方面取得了进展。然而，他们仍然难以完成需要动态且反复聚焦细微视觉区域以实现精确基础和诊断的复杂任务。我们介绍了蛇夫座，这是一种多功能的工具增强框架，使MLLM能够（i）决定何时需要额外的视觉证据，（ii）确定在医学图像中探测和接地的位置，以及（iii）无缝地将相关子图像内容重新编织回交织的多模态思维链。与以往受限于专用工具性能上限的方法不同，蛇夫座将模型固有的基础和感知能力与外部工具整合，从而促进更高层次的推理。我们方法的核心是三阶段训练策略：冷启动训练，利用工具集成推理数据实现对关键区域的基础工具选择和适应;自我反思的微调以强化反思推理并鼓励重新审视工具输出;以及智能工具强化学习，直接优化任务特定奖励并模拟专家级诊断行为。大量实验表明，蛇夫座在包括VQA、检测和基于推理的分割等多种医学基准测试中，始终优于闭源和开源SOTA方法。我们的方法为医疗人工智能代理指明了一条道路，能够通过工具集成推理真正“用图像思考”。数据集、代码和训练模型将公开发布。

Understanding and Improving Hyperbolic Deep Reinforcement Learning

理解与改进双曲深度强化学习

Authors: Timo Klein, Thomas Lang, Andrii Shkabrii, Alexander Sturm, Kevin Sidak, Lukas Miklautz, Claudia Plant, Yllka Velaj, Sebastian Tschiatschek
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.14202
Pdf link: https://arxiv.org/pdf/2512.14202
Abstract The performance of reinforcement learning (RL) agents depends critically on the quality of the underlying feature representations. Hyperbolic feature spaces are well-suited for this purpose, as they naturally capture hierarchical and relational structure often present in complex RL environments. However, leveraging these spaces commonly faces optimization challenges due to the nonstationarity of RL. In this work, we identify key factors that determine the success and failure of training hyperbolic deep RL agents. By analyzing the gradients of core operations in the Poincaré Ball and Hyperboloid models of hyperbolic geometry, we show that large-norm embeddings destabilize gradient-based training, leading to trust-region violations in proximal policy optimization (PPO). Based on these insights, we introduce Hyper++, a new hyperbolic PPO agent that consists of three components: (i) stable critic training through a categorical value loss instead of regression; (ii) feature regularization guaranteeing bounded norms while avoiding the curse of dimensionality from clipping; and (iii) using a more optimization-friendly formulation of hyperbolic network layers. In experiments on ProcGen, we show that Hyper++ guarantees stable learning, outperforms prior hyperbolic agents, and reduces wall-clock time by approximately 30%. On Atari-5 with Double DQN, Hyper++ strongly outperforms Euclidean and hyperbolic baselines. We release our code at this https URL .
中文摘要 强化学习（RL）代理的性能关键依赖于底层特征表示的质量。双曲特征空间非常适合这一目的，因为它们自然捕捉了复杂强化学习环境中常见的层级结构和关系结构。然而，由于强化学习的非平稳性，利用这些空间通常面临优化挑战。本研究指出，我们确定了双曲深度强化学习代理训练成功与失败的关键因素。通过分析庞加莱球模型和双曲面模型中核心作的梯度，我们表明大范数嵌入会破坏基于梯度的训练，导致近端策略优化（PPO）中的信任区域违规。基于这些见解，我们介绍了Hyper++，一种新的双曲PPO代理，由三个组成部分组成：（i）通过类别值损失而非回归实现稳定批评训练;（ii）特征正则化保证范数有界，同时避免裁剪带来的维数诅咒;以及（iii）采用更优化的双曲网络层表述。在ProcGen实验中，我们证明Hyper++保证了稳定学习，优于之前的双曲代理，并将壁钟时间缩短约30%。在Atari-5上，配备双DQN的Hyper++远超欧几里得和双曲基线。我们以这个 https URL 发布代码。

GLM-TTS Technical Report

GLM-TTS技术报告

Authors: Jiayan Cui, Zhihan Yang, Naihan Li, Jiankun Tian, Xingyu Ma, Yi Zhang, Guangyu Chen, Runxuan Yang, Yuqing Cheng, Yizhi Zhou, Guochen Yu, Xiaotao Gu, Jie Tang
Subjects: Subjects: Sound (cs.SD)
Arxiv link: https://arxiv.org/abs/2512.14291
Pdf link: https://arxiv.org/pdf/2512.14291
Abstract This work proposes GLM-TTS, a production-level TTS system designed for efficiency, controllability, and high-fidelity speech generation. GLM-TTS follows a two-stage architecture, consisting of a text-to-token autoregressive model and a token-to-waveform diffusion model. With only 100k hours of training data, GLM-TTS achieves state-of-the-art performance on multiple open-source benchmarks. To meet production requirements, GLM-TTS improves speech quality through an optimized speech tokenizer with fundamental frequency constraints and a GRPO-based multi-reward reinforcement learning framework that jointly optimizes pronunciation, speaker similarity, and expressive prosody. In parallel, the system enables efficient and controllable deployment via parameter-efficient LoRA-based voice customization and a hybrid phoneme-text input scheme that provides precise pronunciation control. Our code is available at this https URL. Real-time speech synthesis demos are provided via this http URL (this http URL), the Zhipu Qingyan app/web (this http URL).
中文摘要 这项工作提出了GLM-TTS，一种生产级TTS系统，旨在实现高效、可控性和高保真度的语音生成。GLM-TTS 采用两阶段架构，包括文本到令牌自回归模型和令牌到波形扩散模型。仅有10万小时的训练数据，GLM-TTS在多个开源基准测试中实现了最先进的性能。为满足生产需求，GLM-TTS通过优化语音分词器和基于GRPO的多奖励强化学习框架，共同优化发音、说话者相似度和表现性韵律，提升语音质量。同时，系统通过参数高效的LoRA语音定制和混合音素-文本输入方案实现高效且可控的部署，实现精准的发音控制。我们的代码可在此 https URL 访问。实时语音合成演示可通过此 http URL（此 http URL）提供，也就是 Zhipu Qingyan 应用/网页（此 http URL）。

A Threshold-Triggered Deep Q-Network-Based Framework for Self-Healing in Autonomic Software-Defined IIoT-Edge Networks

一个基于阈值触发的深度Q网络框架，用于自主软件定义IIoT边缘网络中的自我修复

Authors: Agrippina Mwangi (Utrecht University, The Netherlands), León Navarro-Hilfiker (Ørsted, USA), Lukasz Brewka (Ørsted, Denmark), Mikkel Gryning (Ørsted, Denmark), Elena Fumagalli (Utrecht University, The Netherlands), Madeleine Gibescu (Utrecht University, The Netherlands)
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Performance (cs.PF); High Energy Physics - Experiment (hep-ex)
Arxiv link: https://arxiv.org/abs/2512.14297
Pdf link: https://arxiv.org/pdf/2512.14297
Abstract Stochastic disruptions such as flash events arising from benign traffic bursts and switch thermal fluctuations are major contributors to intermittent service degradation in software-defined industrial networks. These events violate IEC~61850-derived quality-of-service requirements and user-defined service-level agreements, hindering the reliable and timely delivery of control, monitoring, and best-effort traffic in IEC~61400-25-compliant wind power plants. Failure to maintain these requirements often results in delayed or lost control signals, reduced operational efficiency, and increased risk of wind turbine generator downtime. To address these challenges, this study proposes a threshold-triggered Deep Q-Network self-healing agent that autonomically detects, analyzes, and mitigates network disruptions while adapting routing behavior and resource allocation in real time. The proposed agent was trained, validated, and tested on an emulated tri-clustered switch network deployed in a cloud-based proof-of-concept testbed. Simulation results show that the proposed agent improves disruption recovery performance by 53.84% compared to a baseline shortest-path and load-balanced routing approach and outperforms state-of-the-art methods, including the Adaptive Network-based Fuzzy Inference System by 13.1% and the Deep Q-Network and traffic prediction-based routing optimization method by 21.5%, in a super-spine leaf data-plane architecture. Additionally, the agent maintains switch thermal stability by proactively initiating external rack cooling when required. These findings highlight the potential of deep reinforcement learning in building resilience in software-defined industrial networks deployed in mission-critical, time-sensitive application scenarios.
中文摘要 随机中断，如良性流量突发和开关热波动引起的闪存事件，是软件定义工业网络中间歇性服务降级的主要原因。这些事件违反了IEC~61850衍生的服务质量要求和用户定义的服务水平协议，阻碍了IEC~61400-25合规风电厂中控制、监控和尽力而为流量的可靠及时交付。未能满足这些要求常导致控制信号延迟或丢失，运行效率下降，以及风力发电机停机风险增加。为应对这些挑战，本研究提出了一种阈值触发的深度Q网络自修复代理，能够自主检测、分析并缓解网络中断，同时实时调整路由行为和资源分配。该代理在基于云的概念验证测试平台中，在模拟的三集换机网络上进行了训练、验证和测试。模拟结果显示，所提代理相比基线最短路径和负载均衡路由方法，中断恢复性能提升了53.84%，并且在超脊叶数据平面架构下，比包括基于自适应网络的模糊推断系统和基于流量预测的路由优化方法在内的先进方法高出13.1%，领先深度Q网络和基于流量预测的路由优化方法21.5%。此外，代理通过在需要时主动启动外部机架冷却，保持开关的热稳定性。这些发现凸显了深度强化学习在软件定义工业网络中构建韧性、在关键任务、时间敏感应用场景中的潜力。

Multi-Agent Medical Decision Consensus Matrix System: An Intelligent Collaborative Framework for Oncology MDT Consultations

多智能体医疗决策共识矩阵系统：一个用于肿瘤学MDT咨询的智能协作框架

Authors: Xudong Han, Xianglun Gao, Xiaoyi Qu, Zhenyu Yu
Subjects: Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2512.14321
Pdf link: https://arxiv.org/pdf/2512.14321
Abstract Multidisciplinary team (MDT) consultations are the gold standard for cancer care decision-making, yet current practice lacks structured mechanisms for quantifying consensus and ensuring decision traceability. We introduce a Multi-Agent Medical Decision Consensus Matrix System that deploys seven specialized large language model agents, including an oncologist, a radiologist, a nurse, a psychologist, a patient advocate, a nutritionist and a rehabilitation therapist, to simulate realistic MDT workflows. The framework incorporates a mathematically grounded consensus matrix that uses Kendall's coefficient of concordance to objectively assess agreement. To further enhance treatment recommendation quality and consensus efficiency, the system integrates reinforcement learning methods, including Q-Learning, PPO and DQN. Evaluation across five medical benchmarks (MedQA, PubMedQA, DDXPlus, MedBullets and SymCat) shows substantial gains over existing approaches, achieving an average accuracy of 87.5% compared with 83.8% for the strongest baseline, a consensus achievement rate of 89.3% and a mean Kendall's W of 0.823. Expert reviewers rated the clinical appropriateness of system outputs at 8.9/10. The system guarantees full evidence traceability through mandatory citations of clinical guidelines and peer-reviewed literature, following GRADE principles. This work advances medical AI by providing structured consensus measurement, role-specialized multi-agent collaboration and evidence-based explainability to improve the quality and efficiency of clinical decision-making.
中文摘要 多学科团队（MDT）咨询是癌症护理决策的黄金标准，但当前实践缺乏结构化的共识量化机制和确保决策可追溯性。我们引入了多智能体医疗决策共识矩阵系统，部署了七个专业的大语言模型代理，包括肿瘤科医生、放射科医生、护士、心理学家、患者权益倡导者、营养师和康复治疗师，以模拟真实的MDT工作流程。该框架包含一个数学基础的共识矩阵，利用肯德尔一致性系数客观评估一致性。为进一步提升治疗建议质量和共识效率，系统集成了强化学习方法，包括Q-Learning、PPO和DQN。五个医学基准测试（MedQA、PubMedQA、DDXPlus、MedBullets和SymCat）的评估显示，较现有方法有显著提升，平均准确率为87.5%，而最强基线为83.8%，共识成功率为89.3%，平均Kendall's W为0.823。专家评审对系统产出的临床适宜性评分为8.9/10。该系统通过强制引用临床指南和同行评审文献，确保完全的证据可追溯性，遵循GRADE原则。这项工作通过提供结构化共识测量、角色专用的多智能体协作以及基于证据的解释性，推动医疗人工智能的发展，从而提升临床决策的质量和效率。

A data-physics hybrid generative model for patient-specific post-stroke motor rehabilitation using wearable sensor data

一种数据物理混合生成模型，用于针对患者的特定中风后运动康复，利用可穿戴传感器数据

Authors: Yanning Dai, Chenyu Tang, Ruizhi Zhang, Wenyu Yang, Yilan Zhang, Yuhui Wang, Junliang Chen, Xuhang Chen, Ruimou Xie, Yangyue Cao, Qiaoying Li, Jin Cao, Tao Li, Hubin Zhao, Yu Pan, Arokia Nathan, Xin Gao, Peter Smielewski, Shuo Gao
Subjects: Subjects: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.14329
Pdf link: https://arxiv.org/pdf/2512.14329
Abstract Dynamic prediction of locomotor capacity after stroke is crucial for tailoring rehabilitation, yet current assessments provide only static impairment scores and do not indicate whether patients can safely perform specific tasks such as slope walking or stair climbing. Here, we develop a data-physics hybrid generative framework that reconstructs an individual stroke survivor's neuromuscular control from a single 20 m level-ground walking trial and predicts task-conditioned locomotion across rehabilitation scenarios. The system combines wearable-sensor kinematics, a proportional-derivative physics controller, a population Healthy Motion Atlas, and goal-conditioned deep reinforcement learning with behaviour cloning and generative adversarial imitation learning to generate physically plausible, patient-specific gait simulations for slopes and stairs. In 11 stroke survivors, the personalized controllers preserved idiosyncratic gait patterns while improving joint-angle and endpoint fidelity by 4.73% and 12.10%, respectively, and reducing training time to 25.56% relative to a physics-only baseline. In a multicentre pilot involving 21 inpatients, clinicians who used our locomotion predictions to guide task selection and difficulty obtained larger gains in Fugl-Meyer lower-extremity scores over 28 days of standard rehabilitation than control clinicians (mean change 6.0 versus 3.7 points). These findings indicate that our generative, task-predictive framework can augment clinical decision-making in post-stroke gait rehabilitation and provide a template for dynamically personalized motor recovery strategies.
中文摘要 中风后运动能力的动态预测对于量身定制康复至关重要，但目前的评估仅提供静态损伤评分，未能说明患者是否能安全完成如斜坡行走或爬楼梯等特定任务。我们开发了一个数据-物理混合生成框架，从一次20米平地行走试验中重建个体中风幸存者的神经肌肉控制，并预测康复场景中的任务条件运动。该系统结合了可穿戴传感器运动学、比例微分物理控制器、群体健康运动图谱以及目标条件深度强化学习，结合行为克隆和生成对抗性模仿学习，生成物理上合理的、针对患者的特定步态模拟，适用于斜坡和楼梯。在11名中风幸存者中，个性化控制器保留了独特的步态模式，同时关节角度和终点精度分别提升了4.73%和12.10%，且相较于仅基于物理基线的训练时间缩短至25.56%。在一项涉及21名住院患者的多中心试点中，使用我们的运动预测指导任务选择和难度的临床医生在28天标准康复期间，Fugl-Meyer下肢得分提升幅度大于对照组（平均变化为6.0对3.7分）。这些发现表明，我们的生成式任务预测框架能够增强中风后步态康复的临床决策，并为动态个性化的运动恢复策略提供模板。

Context-Picker: Dynamic context selection using multi-stage reinforcement learning

上下文选择器：利用多阶段强化学习进行动态上下文选择

Authors: Siyuan Zhu, Chengdong Xu, Kaiqiang Ke, Chao Yu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.14465
Pdf link: https://arxiv.org/pdf/2512.14465
Abstract In long-context question answering (LCQA), determining the optimal amount of context for a given query is a significant challenge. Including too few passages may omit critical information, while including too many can introduce noise and reduce the quality of the answer. Traditional approaches, such as fixed Top-$K$ retrieval and single-stage reranking, face the dilemma of selecting the right number of passages. This problem is particularly pronounced for factoid questions, which often require only a few specific pieces of evidence. To address this issue, we introduce \emph{Context-Picker}, a reasoning-aware framework that shifts the paradigm from similarity-based ranking to minimal sufficient subset selection. Context-Picker treats context selection as a decision-making process optimized via a human-inspired, two-stage reinforcement learning schedule: a \emph{recall-oriented} stage that prioritizes the coverage of reasoning chains, followed by a \emph{precision-oriented} stage that aggressively prunes redundancy to distill a compact evidence set. To resolve reward sparsity, we propose an offline evidence distillation pipeline that mines "minimal sufficient sets" via a Leave-One-Out (LOO) procedure, providing dense, task-aligned supervision. Experiments on five long-context and multi-hop QA benchmarks demonstrate that Context-Picker significantly outperforms strong RAG baselines, achieving superior answer accuracy with comparable or reduced context lengths. Ablation studies indicate that the coarse-to-fine optimization schedule, the redundancy-aware reward shaping, and the rationale-guided format all contribute substantially to these gains.
中文摘要 在长上下文问答（LCQA）中，确定给定查询的最佳上下文量是一个重大挑战。包含过少的文章可能会遗漏关键信息，而过多则可能产生杂音并降低答案质量。传统方法，如固定的最高$K美元检索和单阶段重新排序，面临选择合适数量文章的难题。这个问题在事实题中尤为明显，因为这些问题通常只需少数具体证据。为解决这个问题，我们引入了 \emph{Context-Picker}，这是一个推理感知框架，将范式从基于相似度的排序转向最小充分子集选择。上下文选择器将上下文选择视为通过人类启发的两阶段强化学习计划优化的决策过程：一个\emph{recalled}阶段优先覆盖推理链，随后是\emph{precisionoriented}阶段，积极修剪冗余以提炼出紧凑的证据集。为解决奖励稀疏性，我们提出了一种离线证据提炼流程，通过Leave-One-Out（LOO）程序挖掘“最小足够集合”，提供密集且符合任务的监督。五个长上下文和多跳质量保证基准测试的实验表明，上下文选择器显著优于强RAG基线，在相同或更短上下文长度下实现更优越的答案准确率。消融研究表明，从粗到细的优化计划、冗余意识的奖励塑造以及理据引导格式都对这些收益做出了显著贡献。

RecGPT-V2 Technical Report

RecGPT-V2 技术报告

Authors: Chao Yi, Dian Chen, Gaoyang Guo, Jiakai Tang, Jian Wu, Jing Yu, Mao Zhang, Wen Chen, Wenjun Yang, Yujie Luo, Yuning Jiang, Zhujin Gao, Bo Zheng, Binbin Cao, Changfa Wu, Dixuan Wang, Han Wu, Haoyi Hu, Kewei Zhu, Lang Tian, Lin Yang, Qiqi Huang, Siqi Yang, Wenbo Su, Xiaoxiao He, Xin Tong, Xu Chen, Xunke Xi, Xiaowei Huang, Yaxuan Wu, Yeqiu Yang, Yi Hu, Yujin Yuan, Yuliang Yan, Zile Zhou
Subjects: Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.14503
Pdf link: https://arxiv.org/pdf/2512.14503
Abstract Large language models (LLMs) have demonstrated remarkable potential in transforming recommender systems from implicit behavioral pattern matching to explicit intent reasoning. While RecGPT-V1 successfully pioneered this paradigm by integrating LLM-based reasoning into user interest mining and item tag prediction, it suffers from four fundamental limitations: (1) computational inefficiency and cognitive redundancy across multiple reasoning routes; (2) insufficient explanation diversity in fixed-template generation; (3) limited generalization under supervised learning paradigms; and (4) simplistic outcome-focused evaluation that fails to match human standards. To address these challenges, we present RecGPT-V2 with four key innovations. First, a Hierarchical Multi-Agent System restructures intent reasoning through coordinated collaboration, eliminating cognitive duplication while enabling diverse intent coverage. Combined with Hybrid Representation Inference that compresses user-behavior contexts, our framework reduces GPU consumption by 60% and improves exclusive recall from 9.39% to 10.99%. Second, a Meta-Prompting framework dynamically generates contextually adaptive prompts, improving explanation diversity by +7.3%. Third, constrained reinforcement learning mitigates multi-reward conflicts, achieving +24.1% improvement in tag prediction and +13.0% in explanation acceptance. Fourth, an Agent-as-a-Judge framework decomposes assessment into multi-step reasoning, improving human preference alignment. Online A/B tests on Taobao demonstrate significant improvements: +2.98% CTR, +3.71% IPV, +2.19% TV, and +11.46% NER. RecGPT-V2 establishes both the technical feasibility and commercial viability of deploying LLM-powered intent reasoning at scale, bridging the gap between cognitive exploration and industrial utility.
中文摘要 大型语言模型（LLMs）在将推荐系统从隐性行为模式匹配转变为显性意图推理方面展现出显著潜力。虽然RecGPT-V1通过将基于LLM的推理整合到用户兴趣挖掘和项目标签预测中，成功开创了这一范式，但它存在四个基本局限：（1）计算效率低和多条推理路径上的认知冗余;（2）固定模板生成中解释多样性不足;（3）在监督式学习范式下的推广有限;以及（4）以结果为导向的简单评估，未能符合人类标准。为应对这些挑战，我们向RecGPT-V2展示了四项关键创新。首先，层级多智能体系统通过协调协作重构意图推理，消除认知重复，同时实现多样的意图覆盖。结合压缩用户行为上下文的混合表示推理，我们的框架将GPU占用降低60%，并将独占回忆从9.39%提升至10.99%。其次，元提示框架动态生成上下文自适应提示，使解释多样性提升+7.3%。第三，受限强化学习减轻了多奖励冲突，标签预测提升+24.1%，解释接受度提升+13.0%。第四，代理即评判框架将评估分解为多步推理，提升人类偏好的对齐。淘宝在线A/B测试显著改善：点击率+2.98%，IPV+3.71%，电视+2.19%，NER值+11.46%。RecGPT-V2 确立了大规模部署基于大型语言模型（LLM）的意图推理的技术可行性和商业可行性，弥合了认知探索与工业实用能力之间的鸿沟。

Model-Based Reinforcement Learning in Discrete-Action Non-Markovian Reward Decision Processes

离散行动非马尔可夫奖励决策过程中的基于模型的强化学习

Authors: Alessandro Trapasso, Luca Iocchi, Fabio Patrizi
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.14617
Pdf link: https://arxiv.org/pdf/2512.14617
Abstract Many practical decision-making problems involve tasks whose success depends on the entire system history, rather than on achieving a state with desired properties. Markovian Reinforcement Learning (RL) approaches are not suitable for such tasks, while RL with non-Markovian reward decision processes (NMRDPs) enables agents to tackle temporal-dependency tasks. This approach has long been known to lack formal guarantees on both (near-)optimality and sample efficiency. We contribute to solving both issues with QR-MAX, a novel model-based algorithm for discrete NMRDPs that factorizes Markovian transition learning from non-Markovian reward handling via reward machines. To the best of our knowledge, this is the first model-based RL algorithm for discrete-action NMRDPs that exploits this factorization to obtain PAC convergence to $\varepsilon$-optimal policies with polynomial sample complexity. We then extend QR-MAX to continuous state spaces with Bucket-QR-MAX, a SimHash-based discretiser that preserves the same factorized structure and achieves fast and stable learning without manual gridding or function approximation. We experimentally compare our method with modern state-of-the-art model-based RL approaches on environments of increasing complexity, showing a significant improvement in sample efficiency and increased robustness in finding optimal policies.
中文摘要 许多实际决策问题涉及的任务成功取决于整个系统历史，而非达到期望属性的状态。马尔可夫强化学习（RL）方法不适合此类任务，而带有非马尔可夫奖励决策过程（NMRDPs）的RL则使智能体能够处理时间依赖性任务。这种方法长期以来都缺乏（近）最优性和样本效率的形式保证。我们通过QR-MAX为解决这两个问题做出贡献，QR-MAX是一种基于模型的新颖离散NMRDP算法，通过奖励机将马尔可夫转移学习从非马尔可夫奖励处理中分解出来。据我们所知，这是首个基于模型的离散动作NMRDP强化学习算法，利用该分解实现PAC收敛至$\varepsilon$最优策略且样本复杂度为多项式。随后，我们将 QR-MAX 扩展到连续状态空间，使用 Bucket-QR-MAX，这是一种基于 SimHash 的离散化器，保持相同的分解结构，实现快速稳定的学习，无需手动网格或函数近似。我们通过实验方法与现代最先进的基于模型的强化学习方法进行比较，应用于复杂度不断提升的环境中，结果显示样本效率显著提升，且在寻找最优策略方面更稳健。

CRISP: Contact-Guided Real2Sim from Monocular Video with Planar Scene Primitives

CRISP：单眼视频中结合平面场景原语的接触引导Real2Sim

Authors: Zihan Wang, Jiashun Wang, Jeff Tan, Yiwen Zhao, Jessica Hodgins, Shubham Tulsiani, Deva Ramanan
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.14696
Pdf link: https://arxiv.org/pdf/2512.14696
Abstract We introduce CRISP, a method that recovers simulatable human motion and scene geometry from monocular video. Prior work on joint human-scene reconstruction relies on data-driven priors and joint optimization with no physics in the loop, or recovers noisy geometry with artifacts that cause motion tracking policies with scene interactions to fail. In contrast, our key insight is to recover convex, clean, and simulation-ready geometry by fitting planar primitives to a point cloud reconstruction of the scene, via a simple clustering pipeline over depth, normals, and flow. To reconstruct scene geometry that might be occluded during interactions, we make use of human-scene contact modeling (e.g., we use human posture to reconstruct the occluded seat of a chair). Finally, we ensure that human and scene reconstructions are physically-plausible by using them to drive a humanoid controller via reinforcement learning. Our approach reduces motion tracking failure rates from 55.2\% to 6.9\% on human-centric video benchmarks (EMDB, PROX), while delivering a 43\% faster RL simulation throughput. We further validate it on in-the-wild videos including casually-captured videos, Internet videos, and even Sora-generated videos. This demonstrates CRISP's ability to generate physically-valid human motion and interaction environments at scale, greatly advancing real-to-sim applications for robotics and AR/VR.
中文摘要 我们介绍了CRISP，这是一种从单目视频中恢复可模拟人体运动和场景几何的方法。此前关于人与场景联合重建的工作依赖于数据驱动的先验和关节优化，循环中没有物理，或者通过伪影恢复噪声几何，导致场景交互运动跟踪策略失效。相比之下，我们的关键见解是通过通过对深度、法线和流动的简单聚类流水线，将平面图元拟合到点云场景重建中，恢复凸、干净且可模拟的几何体。为了重建交互过程中可能被遮挡的场景几何，我们利用人与场景接触建模（例如，利用人类姿势重建椅子的遮挡座椅）。最后，我们通过强化学习，确保人类和场景的重建在物理上可行，从而驱动类人控制器。我们的方法将以人为中心的视频基准测试（EMDB、PROX）运动跟踪失败率从55.2%降至6.9%，同时实现43%更快的强化模拟处理。我们进一步验证了这些内容，包括随意拍摄的视频、网络视频，甚至Sora生成的视频。这展示了CRISP大规模生成物理有效人体运动和交互环境的能力，极大推动了机器人和AR/VR的实像应用。

TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

TimeLens：重新思考多模大型语言模型的视频时间接地

Authors: Jun Zhang, Teng Wang, Yuying Ge, Yixiao Ge, Xinhao Li, Ying Shan, Limin Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2512.14698
Pdf link: https://arxiv.org/pdf/2512.14698
Abstract This paper does not introduce a novel method but instead establishes a straightforward, incremental, yet essential baseline for video temporal grounding (VTG), a core capability in video understanding. While multimodal large language models (MLLMs) excel at various video understanding tasks, the recipes for optimizing them for VTG remain under-explored. In this paper, we present TimeLens, a systematic investigation into building MLLMs with strong VTG ability, along two primary dimensions: data quality and algorithmic design. We first expose critical quality issues in existing VTG benchmarks and introduce TimeLens-Bench, comprising meticulously re-annotated versions of three popular benchmarks with strict quality criteria. Our analysis reveals dramatic model re-rankings compared to legacy benchmarks, confirming the unreliability of prior evaluation standards. We also address noisy training data through an automated re-annotation pipeline, yielding TimeLens-100K, a large-scale, high-quality training dataset. Building on our data foundation, we conduct in-depth explorations of algorithmic design principles, yielding a series of meaningful insights and effective yet efficient practices. These include interleaved textual encoding for time representation, a thinking-free reinforcement learning with verifiable rewards (RLVR) approach as the training paradigm, and carefully designed recipes for RLVR training. These efforts culminate in TimeLens models, a family of MLLMs with state-of-the-art VTG performance among open-source models and even surpass proprietary models such as GPT-5 and Gemini-2.5-Flash. All codes, data, and models will be released to facilitate future research.
中文摘要 本文并未引入新方法，而是为视频时间基础（VTG）建立了一个直接、渐进但至关重要的基线，而VTG是视频理解的核心能力。虽然多模态大型语言模型（MLLM）在各种视频理解任务中表现出色，但针对VTG优化它们的方案仍未被充分探索。本文介绍了TimeLens，一项系统性研究，旨在构建具有强大VTG能力的MLLMs，重点关注数据质量和算法设计。我们首先揭示了现有VTG基准中的关键质量问题，并引入TimeLens-Bench，该版本由三个热门基准测试的细致重新注释版本组成，并严格遵守质量标准。我们的分析揭示了模型排名与传统基准相比显著的重新排序，证实了以往评估标准的不可靠性。我们还通过自动重新注释流程处理噪声训练数据，生成了TimeLens-100K，一个大规模且高质量的训练数据集。基于我们的数据基础，我们深入探讨算法设计原则，带来一系列有意义的见解和高效且有效的实践。这些包括交错文本编码用于时间表示、无思考强化学习（RLVR）与可验证奖励（RLVR）方法作为训练范式，以及精心设计的RLVR训练配方。这些努力最终诞生了TimeLens模型，这是一系列MLLMs，在开源模型中拥有最先进的VTG性能，甚至超越了GPT-5和Gemini-2.5-Flash等专有模型。所有代码、数据和模型将公开，以促进未来的研究。

Keyword: diffusion policy

There is no result