Arxiv Papers of Today

生成时间: 2025-11-21 16:31:35 (UTC+8); Arxiv 发布时间: 2025-11-21 20:00 EST (2025-11-22 09:00 UTC+8)

今天共有 32 篇相关文章

Keyword: reinforcement learning

Integrated 4D/5D Digital-Twin Framework for Cost Estimation and Probabilistic Schedule Control: A Texas Mid-Rise Case Study

集成4D/5D数字孪生框架用于成本估算与概率进度控制：德克萨斯中层住宅案例研究

Authors: Atena Khoshkonesh, Mohsen Mohammadagha, Navid Ebrahimi
Subjects: Subjects: Computational Engineering, Finance, and Science (cs.CE)
Arxiv link: https://arxiv.org/abs/2511.15711
Pdf link: https://arxiv.org/pdf/2511.15711
Abstract Persistent cost and schedule overruns in U.S. building projects expose limitations of conventional, document-based estimating and deterministic Critical Path Method (CPM) scheduling, which remain inflexible under uncertainty and lag dynamic field conditions. This study presents an integrated 4D/5D digital-twin framework unifying Building Information Modeling (BIM), natural language processing (NLP), reality capture, computer vision, Bayesian risk modeling, and deep reinforcement learning (DRL) for construction cost and schedule control. The system automates project-control functions by: (a) mapping contract documents to standardized cost items using transformer-based NLP (0.883 weighted F1 score); (b) aligning photogrammetry and LiDAR data with BIM to compute earned value; (c) deriving real-time activity completion from site imagery (0.891 micro accuracy); (d) updating probabilistic CPM forecasts via Bayesian inference and Monte Carlo simulation; (e) using DRL for adaptive resource allocation (75% adoption rate); and (f) providing 4D/5D decision sandbox for predictive analysis. A Texas mid-rise case study demonstrates localized cost adjustment using RSMeans City Cost Index and Bureau of Labor Statistics wage data. Results show 43% reduction in estimating labor, 6% overtime reduction (91 hours), and project completion matching P50 probabilistic forecast of 128 days, confirming improved estimation accuracy and responsiveness.
中文摘要 美国建筑项目中持续的成本和进度超支暴露了传统基于文档的估算和确定性关键路径方法（CPM）调度的局限性，这些方法在不确定性和滞后动态现场条件下依然不灵活。本研究提出了一个集成的4D/5D数字孪生框架，整合了建筑信息建模（BIM）、自然语言处理（NLP）、现实捕捉、计算机视觉、贝叶斯风险建模和深度强化学习（DRL），用于建筑成本和进度控制。该系统通过以下方式自动化项目控制功能：（a）使用基于变压器的自然语言处理（NLP）将合同文件映射到标准化成本项目（加权F1分数0.883）;（b）将摄影测量和激光雷达数据与BIM对齐以计算获值;（c）从现场影像中推导出实时活动完成率（0.891微精度）;（d）通过贝叶斯推断和蒙特卡洛模拟更新概率CPM预测;（e）利用DRL进行自适应资源分配（采用率75%）;以及（f）提供四维/五维决策沙盒以进行预测分析。德克萨斯州的中层住宅案例研究展示了利用RSMeans城市成本指数和劳工统计局工资数据进行局部成本调整。结果显示估算劳动减少了43%，加班减少了6%（91小时），项目完成时间与P50概率预测的128天相符，证实了估算准确性和响应能力的提升。

MACIE: Multi-Agent Causal Intelligence Explainer for Collective Behavior Understanding

MACIE：多智能体因果智能集体行为理解解说

Authors: Abraham Itzhak Weinberg
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.15716
Pdf link: https://arxiv.org/pdf/2511.15716
Abstract As Multi Agent Reinforcement Learning systems are used in safety critical applications. Understanding why agents make decisions and how they achieve collective behavior is crucial. Existing explainable AI methods struggle in multi agent settings. They fail to attribute collective outcomes to individuals, quantify emergent behaviors, or capture complex interactions. We present MACIE Multi Agent Causal Intelligence Explainer, a framework combining structural causal models, interventional counterfactuals, and Shapley values to provide comprehensive explanations. MACIE addresses three questions. First, each agent's causal contribution using interventional attribution scores. Second, system level emergent intelligence through synergy metrics separating collective effects from individual contributions. Third, actionable explanations using natural language narratives synthesizing causal insights. We evaluate MACIE across four MARL scenarios: cooperative, competitive, and mixed motive. Results show accurate outcome attribution, mean phi_i equals 5.07, standard deviation less than 0.05, detection of positive emergence in cooperative tasks, synergy index up to 0.461, and efficient computation, 0.79 seconds per dataset on CPU. MACIE uniquely combines causal rigor, emergence quantification, and multi agent support while remaining practical for real time use. This represents a step toward interpretable, trustworthy, and accountable multi agent AI.
中文摘要 作为多智能体强化学习系统，它们被用于安全关键的应用中。理解代理为何做出决策以及如何实现集体行为至关重要。现有的可解释人工智能方法在多智能体环境中表现不佳。它们未能将集体结果归因于个体，无法量化涌现行为，也未能捕捉复杂的互动。我们介绍MACIE多智能因果智能解释器，该框架结合了结构因果模型、介入性反事实和Shapley值，提供全面解释。MACIE回答了三个问题。首先，利用介入归因评分确定每个代理人的因果贡献。其次，通过协同指标实现系统层面的涌现智能，将集体效应与个人贡献区分开来。第三，利用自然语言叙事综合因果洞见进行可作的解释。我们根据四种MARL情景评估MACIE：合作、竞争和混合动机。结果显示结局归因准确，平均 phi_i 5.07，标准差小于 0.05，协作任务中检测到积极涌现，协同指数高达 0.461，计算效率高，CPU 上每个数据集仅需 0.79 秒。MACIE独特地结合了因果严谨性、涌现量化和多智能体支持，同时保持实时应用的实用性。这代表着迈向可解释性、可信赖且负责任的多智能体人工智能迈出的一步。

Extending Test-Time Scaling: A 3D Perspective with Context, Batch, and Turn

扩展测试时间缩放：结合上下文、批处理和转向的三维视角

Authors: Chao Yu (1), Qixin Tan (1), Jiaxuan Gao (1), Shi Yu (1), Hong Lu (1), Xinting Yang (1), Zelai Xu (1), Yu Wang (1), Yi Wu (1), Eugene Vinitsky (2) ((1) Tsinghua University, (2) New York University)
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.15738
Pdf link: https://arxiv.org/pdf/2511.15738
Abstract Reasoning reinforcement learning (RL) has recently revealed a new scaling effect: test-time scaling. Thinking models such as R1 and o1 improve their reasoning accuracy at test time as the length of the reasoning context increases. However, compared with training-time scaling, test-time scaling is fundamentally limited by the limited context length of base models, which remains orders of magnitude smaller than the amount of tokens consumed during training. We revisit test-time enhancement techniques through the lens of scaling effect and introduce a unified framework of multi-dimensional test-time scaling to extend the capacity of test-time reasoning. Beyond conventional context-length scaling, we consider two additional dimensions: batch scaling, where accuracy improves with parallel sampling, and turn scaling, where iterative self-refinement enhances reasoning quality. Building on this perspective, we propose 3D test-time scaling, which integrates context, batch, and turn scaling. We show that: (1) each dimension demonstrates a test-time scaling effect, but with a bounded capacity; (2) combining all three dimensions substantially improves the reasoning performance of challenging testbeds, including IOI, IMO, and CPHO, and further benefits from human preference feedback; and (3) the human-in-the-loop framework naturally extends to a more open-ended domain, i.e., embodied learning, which enables the design of humanoid control behaviors.
中文摘要 推理强化学习（RL）最近揭示了一种新的缩放效应：测试时间标度。像R1和o1这样的思维模型，随着推理上下文的延长，在测试时推理准确性会提高。然而，与训练时间缩放相比，测试时间缩放在基础模型的上下文长度有限上根本受限，而上下文长度仍远小于训练期间消耗的令牌数量。我们从尺度效应的角度重新审视测试时间增强技术，并引入了一个统一的多维测试时间尺度框架，以扩展测试时间推理的能力。除了传统的上下文长度尺度外，我们还考虑了两个额外维度：批次尺度，通过并行抽样提升准确性;以及回合尺度，通过迭代自我精炼提升推理质量。基于这一观点，我们提出了三维测试时间缩放，整合了上下文、批处理和回合的扩展。我们证明：（1）每个维度都表现出测试时间缩放效应，但容量有界;（2）将这三个维度结合，显著提升了包括IOI、IMO和CPHO等具有挑战性测试平台的推理性能，并进一步受益于人类偏好反馈;以及（3）人机环路框架自然扩展到更开放的领域，即具身学习，这使设计类人生物控制行为成为可能。

Thinking, Faithful and Stable: Mitigating Hallucinations in LLMs

思维、忠实且稳定：缓解大型语言模型中的幻觉

Authors: Chelsea Zou, Yiheng Yao, Basant Khalil
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.15921
Pdf link: https://arxiv.org/pdf/2511.15921
Abstract This project develops a self correcting framework for large language models (LLMs) that detects and mitigates hallucinations during multi-step reasoning. Rather than relying solely on final answer correctness, our approach leverages fine grained uncertainty signals: 1) self-assessed confidence alignment, and 2) token-level entropy spikes to detect unreliable and unfaithful reasoning in real time. We design a composite reward function that penalizes unjustified high confidence and entropy spikes, while encouraging stable and accurate reasoning trajectories. These signals guide a reinforcement learning (RL) policy that makes the model more introspective and shapes the model's generation behavior through confidence-aware reward feedback, improving not just outcome correctness but the coherence and faithfulness of their intermediate reasoning steps. Experiments show that our method improves both final answer accuracy and reasoning calibration, with ablations validating the individual contribution of each signal.
中文摘要 本项目开发了一个大型语言模型（LLM）的自我纠正框架，能够检测并减轻多步推理过程中的幻觉。我们的方法不仅依赖最终答案的正确性，而是利用细粒度的不确定性信号：1）自我评估的置信度对齐，2）代币级熵尖峰，实时检测不可靠和不忠实的推理。我们设计了一个复合奖励函数，惩罚无端的高置信度和熵峰值，同时鼓励稳定且准确的推理轨迹。这些信号引导强化学习（RL）策略，使模型更具内省性，并通过信心感知的奖励反馈塑造模型生成行为，不仅提升结果正确性，还提升中间推理步骤的连贯性和忠实度。实验表明，我们的方法不仅提高了最终答案的准确性，还能提升推理校准，消融验证了每个信号的具体贡献。

KRAL: Knowledge and Reasoning Augmented Learning for LLM-assisted Clinical Antimicrobial Therapy

KRAL：知识与推理增强学习用于LLM辅助临床抗菌治疗

Authors: Zhe Li, Yehan Qiu, Yujie Chen, Xiang Zhou
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.15974
Pdf link: https://arxiv.org/pdf/2511.15974
Abstract Clinical antimicrobial therapy requires the dynamic integration of pathogen profiles, host factors, pharmacological properties of antimicrobials, and the severity of this http URL complexity imposes fundamental limitations on the applicability of Large Language Models (LLMs) in high-stakes clinical decision-making including knowledge gaps, data privacy concerns, high deployment costs, and limited reasoning capabilities. To address these challenges, we propose KRAL (Knowledge and Reasoning Augmented Learning), a low-cost, scalable, privacy-preserving paradigm that leverages teacher-model reasoning to automatically distill knowledge and reasoning trajectories via answer-to-question reverse generation, employs heuristic learning for semi-supervised data augmentation (reducing manual annotation requirements by approximately 80%), and utilizes agentic reinforcement learning to jointly enhance medical knowledge and reasoning while optimizing computational and memory efficiency. A hierarchical evaluation employing diverse teacher-model proxies reduces assessment costs, while modular interface design facilitates seamless system updates. Experimental results demonstrate that KRAL significantly outperforms traditional Retrieval-Augmented Generation (RAG) and Supervised Fine-Tuning (SFT) methods. It improves knowledge question-answering capability (Accuracy@1 on the external open-source benchmark MEDQA increased by 1.8% vs. SFT and 3.6% vs. RAG) and reasoning capability (Pass@1 on the external benchmark PUMCH Antimicrobial increased by 27% vs. SFT and 27.2% vs. RAG), achieved at ~20% of SFT's long-term training costs. This establishes KRAL as an effective solution for enhancing local LLMs' clinical diagnostic capabilities, enabling low-cost, high-safety deployment in complex medical decision support.
中文摘要 临床抗菌治疗需要将病原体特征、宿主因子、抗菌药物的药理性质动态整合，而这种http URL的复杂性严重性对大型语言模型（LLMs）在高风险临床决策中的适用性带来了根本性限制，包括知识缺口、数据隐私问题、高部署成本和有限的推理能力。为应对这些挑战，我们提出了KRAL（知识与推理增强学习），这是一种低成本、可扩展、保护隐私的范式，利用教师模型推理通过答案问题的逆向生成自动提炼知识和推理轨迹，采用启发式学习进行半监督数据增强（减少约80%的手工注释需求），并利用代理强化学习共同提升医学知识和在优化计算和内存效率的同时进行推理。采用多样化教师模型代理的分层评估降低了评估成本，而模块化界面设计则促进了系统无缝更新。实验结果表明，KRAL的表现显著优于传统的检索增强生成（RAG）和监督式微调（SFT）方法。它提升了知识问答能力（外部开源基准MEDQA的Accuracy@1相比SFT增长1.8%，较RAG增长3.6%），推理能力（外部基准PUMCH抗菌基准Pass@1较SFT提升27%，较RAG提升27.2%），实现时约为SFT长期训练成本的~20%。这确立了KRAL作为提升本地大型语言模型临床诊断能力的有效解决方案，实现低成本、高安全性的复杂医疗决策支持部署。

HGCN2SP: Hierarchical Graph Convolutional Network for Two-Stage Stochastic Programming

HGCN2SP：两阶段随机规划的层级图卷积网络

Authors: Yang Wu, Yifan Zhang, Zhenxing Liang, Jian Cheng
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.16027
Pdf link: https://arxiv.org/pdf/2511.16027
Abstract Two-stage Stochastic Programming (2SP) is a standard framework for modeling decision-making problems under uncertainty. While numerous methods exist, solving such problems with many scenarios remains challenging. Selecting representative scenarios is a practical method for accelerating solutions. However, current approaches typically rely on clustering or Monte Carlo sampling, failing to integrate scenario information deeply and overlooking the significant impact of the scenario order on solving time. To address these issues, we develop HGCN2SP, a novel model with a hierarchical graph designed for 2SP problems, encoding each scenario and modeling their relationships hierarchically. The model is trained in a reinforcement learning paradigm to utilize the feedback of the solver. The policy network is equipped with a hierarchical graph convolutional network for feature encoding and an attention-based decoder for scenario selection in proper order. Evaluation of two classic 2SP problems demonstrates that HGCN2SP provides high-quality decisions in a short computational time. Furthermore, HGCN2SP exhibits remarkable generalization capabilities in handling large-scale instances, even with a substantial number of variables or scenarios that were unseen during the training phase.
中文摘要 两阶段随机规划（2SP）是一种用于建模不确定性决策问题的标准框架。虽然存在多种方法，但用多种场景解决此类问题仍然具有挑战性。选择具有代表性的情景是加快解决方案的实用方法。然而，当前方法通常依赖聚类或蒙特卡洛抽样，未能深入整合情景信息，也忽视了情景顺序对求解时间的重大影响。为解决这些问题，我们开发了HGCN2SP，一种新颖的模型，采用层级图，专为2SP问题设计，编码每个场景并对其关系进行层级建模。该模型在强化学习范式中训练，以利用求解器的反馈。策略网络配备了用于特征编码的分层图卷积网络和基于注意力的解码器，用于按顺序选择情景。对两个经典2SP问题的评估表明HGCN2SP能在较短的计算时间内提供高质量的决策。此外，HGCN2SP在处理大规模实例时展现出显著的泛化能力，即使面对训练阶段未曾预见的大量变量或场景。

Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning

Agent0：通过工具集成推理从零数据释放自我进化的智能体

Authors: Peng Xia, Kaide Zeng, Jiaqi Liu, Can Qin, Fang Wu, Yiyang Zhou, Caiming Xiong, Huaxiu Yao
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.16043
Pdf link: https://arxiv.org/pdf/2511.16043
Abstract Large Language Model (LLM) Agents, often trained with Reinforcement Learning (RL), are constrained by a dependency on human-curated data, limiting scalability and tethering AI to human knowledge. Existing self-evolution frameworks offer an alternative but are typically restricted by the model's inherent capabilities and single-round interactions, hindering the development of complex curricula involving tool use or dynamic reasoning. We introduce Agent0, a fully autonomous framework that evolves high-performing agents without external data through multi-step co-evolution and seamless tool integration. Agent0 establishes a symbiotic competition between two agents initialized from the same base LLM: a curriculum agent that proposes increasingly challenging frontier tasks, and an executor agent that learns to solve them. We integrate external tools to enhance the executor's problem-solving capacity; this improvement, in turn, pressures the curriculum agent to construct more complex, tool-aware tasks. Through this iterative process, Agent0 establishes a self-reinforcing cycle that continuously produces high-quality curricula. Empirically, Agent0 substantially boosts reasoning capabilities, improving the Qwen3-8B-Base model by 18% on mathematical reasoning and 24% on general reasoning benchmarks. Code is available at this https URL.
中文摘要 大型语言模型（LLM）代理通常通过强化学习（RL）训练，受限于依赖人工整理的数据，限制了可扩展性，并将人工智能与人类知识绑定在一起。现有的自我进化框架提供了另一种选择，但通常受限于模型固有的能力和单轮交互，阻碍了涉及工具使用或动态推理的复杂课程的发展。我们引入了Agent0，一个完全自主的框架，通过多步共进和无缝工具集成，在没有外部数据的情况下演进高性能代理。Agent0 建立了两个由同一基础 LLM 初始化的代理之间的共生竞争：一个提出越来越具挑战性的前沿任务的课程代理，以及学习解决这些任务的执行代理。我们整合外部工具以增强执行人的问题解决能力;这种改进反过来又促使课程代理构建更复杂、更具工具感知的任务。通过这一迭代过程，Agent0建立了自我强化的循环，持续产出高质量的课程。从实证角度看，Agent0显著提升了推理能力，在数学推理方面提升了Qwen3-8B基础模型18%，在一般推理基准中提升了24%。代码可在此 https URL 访问。

Bellman Memory Units: A neuromorphic framework for synaptic reinforcement learning with an evolving network topology

Bellman 记忆单元：一个具有演进网络拓扑结构的神经形态框架，用于突触强化学习

Authors: Shreyan Banerjee, Aasifa Rounak, Vikram Pakrashi
Subjects: Subjects: Systems and Control (eess.SY); Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2511.16066
Pdf link: https://arxiv.org/pdf/2511.16066
Abstract Application of neuromorphic edge devices for control is limited by the constraints on gradient-free online learning and scalability of the hardware across control problems. This paper introduces a synaptic Q-learning algorithm for the control of the classical Cartpole, where the Bellman equations are incorporated at the synaptic level. This formulation enables the iterative evolution of the network topology, represented as a directed graph, throughout the training process. This is followed by a similar approach called neuromorphic Bellman Memory Units (BMU(s)), which are implemented with the Neural Engineering Framework on Intel's Loihi neuromorphic chip. Topology evolution, in conjunction with mixed-signal computation, leverages the optimization of the number of neurons and synapses that could be used to design spike-based reinforcement learning accelerators. The proposed architecture can potentially reduce resource utilization on board, aiding the manufacturing of compact application-specific neuromorphic ICs. Moreover, the on-chip learning introduced in this work and implemented on a neuromorphic chip can enable adaptation to unseen control scenarios.
中文摘要 神经形态边缘器件的控制受限于无梯度在线学习和硬件在控制问题上的可扩展性限制。本文介绍了一种用于控制经典Cartpole的突触Q-学习算法，其中Bellman方程在突触层面被纳入。这种表述使网络拓扑（以有向图表示）在整个训练过程中能够迭代演进。随后采用了一种类似的方法，称为神经形态贝尔曼内存单元（BMU（s）），这些方法通过英特尔Loihi神经形态芯片上的神经工程框架实现。拓扑演化结合混合信号计算，利用神经元和突触数量的优化，以设计基于尖峰的强化学习加速器。该架构有望减少机载资源利用，助力制造紧凑型、针对特定应用的神经形态集成电路。此外，本研究引入并在神经形态芯片上实现的片上学习，还能实现对看不见控制场景的适应。

A Mathematical Framework for Custom Reward Functions in Job Application Evaluation using Reinforcement Learning

基于强化学习的求职申请评估中自定义奖励函数的数学框架

Authors: Shreyansh Jain, Madhav Singhvi, Shreya Rahul Jain, Pranav S, Dishaa Lokesh, Naren Chittibabu, Akash Anandhan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2511.16073
Pdf link: https://arxiv.org/pdf/2511.16073
Abstract Conventional Applicant Tracking Systems (ATS) tend to be inflexible keyword-matchers, and deny gifted candidates a role due to a few minor semantic mismatches. This article describes a new two-step process to design a more refined resume evaluation model based on a small language model (<600M parameters) that is finetuned using GRPO on a custom reward function. To begin with, Supervised Fine-Tuning (SFT) was used to build a solid baseline model. Second, this SFT model was also optimized with the help of Reinforcement Learning (RL) through GRPO under the guidance of a new, multi-component reward function that can holistically assess candidates beyond simple keyword matching. We indicate that the RL application presents a critical problem of reward hacking due to the initial experiments of aggressive penalties, which produces faulty, excessively negative model behaviors. We have overcome this challenge by refining the reward function repeatedly and training hyperparameters into a stable "gentle polishing process" of the reward function. Our resulting GRPO-polished model demonstrates significant real-world efficacy, achieving a final accuracy of 91% on unseen test data. The model shows a strong ability to correctly identify qualified candidates (recall of 0.85 for the 'SELECTED' class) while also showing exceptional precision (1.0), confirming its reliability. These results indicate that a properly executed, two-step fine-tuning procedure can indeed effectively refine a small language model to be able to conduct fine-tuned and human-like candidate scoring, overcoming the drawbacks of both traditional ATS and naive RL usage.
中文摘要 传统的申请者追踪系统（ATS）往往是关键词匹配工具，因一些轻微的语义不匹配而拒绝有天赋的候选人。本文描述了一种新的两步流程，旨在基于一个小语言模型（<600M参数）设计更精细的简历评估模型，并通过GRPO在自定义奖励函数上进行微调。首先，使用监督式微调（SFT）构建了一个坚实的基线模型。其次，该SFT模型也在GRPO的强化学习（RL）帮助下进行了优化，并采用了一种新的多元奖励函数，能够整体评估候选人，超越简单的关键词匹配。我们指出，强化学习应用因最初实验的激进惩罚导致模型行为错误且过度负面，存在奖励黑客的关键问题。我们通过反复优化奖励函数并将超参数训练为一个稳定的“温和润色过程”，克服了这一挑战。我们经过GRPO完善的模型展现了显著的实际效能，在未见测试数据上达到了91%的最终准确率。该模型显示出准确识别合格候选人的能力（“SELECTED”类别的召回率为0.85），同时表现出卓越的精度（1.0），证实了其可靠性。这些结果表明，正确执行的两步微调程序确实可以有效优化小型语言模型，使其能够进行精细且类人化的候选评分，克服传统ATS和朴素强化学习的缺点。

A Hybrid Proactive And Predictive Framework For Edge Cloud Resource Management

一个混合、主动与预测的边缘云资源管理框架

Authors: Hrikshesh Kumar, Anika Garg, Anshul Gupta, Yashika Agarwal
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.16075
Pdf link: https://arxiv.org/pdf/2511.16075
Abstract Old cloud edge workload resource management is too reactive. The problem with relying on static thresholds is that we are either overspending for more resources than needed or have reduced performance because of their lack. This is why we work on proactive solutions. A framework developed for it stops reacting to the problems but starts expecting them. We design a hybrid architecture, combining two powerful tools: the CNN LSTM model for time series forecasting and an orchestrator based on multi agent Deep Reinforcement Learning In fact the novelty is in how we combine them as we embed the predictive forecast from the CNN LSTM directly into the DRL agent state space. That is what makes the AI manager smarter it sees the future, which allows it to make better decisions about a long term plan for where to run tasks That means finding that sweet spot between how much money is saved while keeping the system healthy and apps fast for users That is we have given it eyes in order to see down the road so that it does not have to lurch from one problem to another it finds a smooth path forward Our tests show our system easily beats the old methods It is great at solving tough problems like making complex decisions and juggling multiple goals at once like being cheap fast and reliable
中文摘要 旧的云端工作负载资源管理太被动了。依赖静态阈值的问题在于，我们要么在资源上花费过多，要么因为缺乏阈值而降低了性能。这就是为什么我们致力于采取主动解决方案。为此开发的框架停止对问题做出反应，而是开始预期问题。我们设计了一种混合架构，结合了两种强大的工具：用于时间序列预测的 CNN LSTM 模型和基于多智能体的编排器。深度强化学习实际上，创新之处在于我们将 CNN LSTM 的预测预测直接嵌入到 DRL 智能体状态空间中。这正是让AI经理更聪明的原因——它能预见未来，从而更好地制定长期任务运行计划。这意味着找到在节省资金、保持系统健康和应用快速为用户快速之间的平衡点。也就是说，我们赋予它眼光，以便看到未来，避免它在某个阶段摇摆不定问题到另一个问题，它找到了平稳的前进路径。我们的测试显示，我们的系统轻松击败了旧方法。它擅长解决复杂决策和同时处理多个目标的难题，比如成本低、快、可靠

VideoSeg-R1:Reasoning Video Object Segmentation via Reinforcement Learning

VideoSeg-R1：通过强化学习进行视频对象分割推理

Authors: Zishan Xu, Yifu Guo, Yuquan Lu, Fengyu Yang, Junxin Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.16077
Pdf link: https://arxiv.org/pdf/2511.16077
Abstract Traditional video reasoning segmentation methods rely on supervised fine-tuning, which limits generalization to out-of-distribution scenarios and lacks explicit reasoning. To address this, we propose \textbf{VideoSeg-R1}, the first framework to introduce reinforcement learning into video reasoning segmentation. It adopts a decoupled architecture that formulates the task as joint referring image segmentation and video mask propagation. It comprises three stages: (1) A hierarchical text-guided frame sampler to emulate human attention; (2) A reasoning model that produces spatial cues along with explicit reasoning chains; and (3) A segmentation-propagation stage using SAM2 and XMem. A task difficulty-aware mechanism adaptively controls reasoning length for better efficiency and accuracy. Extensive evaluations on multiple benchmarks demonstrate that VideoSeg-R1 achieves state-of-the-art performance in complex video reasoning and segmentation tasks. The code will be publicly available at this https URL.
中文摘要 传统的视频推理分割方法依赖监督式微调，这限制了推广到分布外场景，且缺乏显式推理。为此，我们提出了 \textbf{VideoSeg-R1}，这是首个将强化学习引入视频推理分割的框架。它采用解耦架构，将任务表述为联合指向图像分割和视频掩码传播。它包含三个阶段：（1）一个层级文本引导的帧采样器，用于模拟人类注意力;（2）一个能够产生空间线索和显式推理链的推理模型;以及（3）使用SAM2和XMem的分割-传播阶段。任务难度感知机制自适应地控制推理长度，以提高效率和准确性。多项基准测试的广泛评估表明，VideoSeg-R1在复杂的视频推理和分割任务中实现了最先进的性能。代码将在此 https URL 公开。

SkyRL-Agent: Efficient RL Training for Multi-turn LLM Agent

SkyRL 代理：多回合大型语言模型代理的高效强化学习训练

Authors: Shiyi Cao, Dacheng Li, Fangzhou Zhao, Shuo Yuan, Sumanth R. Hegde, Connor Chen, Charlie Ruan, Tyler Griggs, Shu Liu, Eric Tang, Richard Liaw, Philipp Moritz, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.16108
Pdf link: https://arxiv.org/pdf/2511.16108
Abstract We introduce SkyRL-Agent, a framework for efficient, multi-turn, long-horizon agent training and evaluation. It provides efficient asynchronous dispatching, lightweight tool integration, and flexible backend interoperability, enabling seamless use with existing RL frameworks such as SkyRL-train, VeRL, and Tinker. Using SkyRL-Agent, we train SA-SWE-32B, a software engineering agent trained from Qwen3-32B (24.4% Pass@1) purely with reinforcement learning. We introduce two key components: an optimized asynchronous pipeline dispatcher that achieves a 1.55x speedup over naive asynchronous batching, and a tool-enhanced training recipe leveraging an AST-based search tool to facilitate code navigation, boost rollout Pass@K, and improve training efficiency. Together, these optimizations enable SA-SWE-32B to reach 39.4% Pass@1 on SWE-Bench Verified with more than 2x cost reduction compared to prior models reaching similar performance. Despite being trained solely on SWE tasks, SA-SWE-32B generalizes effectively to other agentic tasks, including Terminal-Bench, BrowseComp-Plus, and WebArena. We further demonstrate SkyRL-Agent's extensibility through case studies on deep research, computer use, and memory agents, each trained using a different training backend.
中文摘要 我们介绍SkyRL-Agent，一个用于高效、多回合、长视野智能体训练与评估的框架。它提供高效的异步调度、轻量化工具集成和灵活的后端互作性，使得与现有的 Ril-train、VeRL 和 Tinker 等框架无缝使用。利用SkyRL-Agent，我们训练SA-SWE-32B，一个软件工程代理，该代理由Qwen3-32B（24.4%Pass@1）纯通过强化学习训练。我们引入两个关键组件：一个优化的异步流水线调度器，比简单的异步批处理速度提升1.55倍;以及一个利用AST搜索工具的工具增强训练方案，以促进代码导航、提升推广Pass@K推广和训练效率。综合这些优化，使SA-SWE-32B在SWE-Bench Verified的测试Pass@1达到39.4%，成本比以往类似型号降低了两倍以上。尽管仅针对软件工程任务进行培训，SA-SWE-32B 仍有效推广至其他代理任务，包括终端工作台、BrowseComp-Plus 和 WebArena。我们通过深入研究、计算机使用和内存代理的案例研究，进一步展示了SkyRL-Agent的可扩展性，每个代理均使用不同的训练后端进行训练。

An Image Is Worth Ten Thousand Words: Verbose-Text Induction Attacks on VLMs

一张图片胜过一万字：对VLM的冗长文本归纳攻击

Authors: Zhi Luo, Zenghui Yuan, Wenqi Wei, Daizong Liu, Pan Zhou
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.16163
Pdf link: https://arxiv.org/pdf/2511.16163
Abstract With the remarkable success of Vision-Language Models (VLMs) on multimodal tasks, concerns regarding their deployment efficiency have become increasingly prominent. In particular, the number of tokens consumed during the generation process has emerged as a key evaluation this http URL studies have shown that specific inputs can induce VLMs to generate lengthy outputs with low information density, which significantly increases energy consumption, latency, and token costs. However, existing methods simply delay the occurrence of the EOS token to implicitly prolong output, and fail to directly maximize the output token length as an explicit optimization objective, lacking stability and this http URL address these limitations, this paper proposes a novel verbose-text induction attack (VTIA) to inject imperceptible adversarial perturbations into benign images via a two-stage framework, which identifies the most malicious prompt embeddings for optimizing and maximizing the output token of the perturbed this http URL, we first perform adversarial prompt search, employing reinforcement learning strategies to automatically identify adversarial prompts capable of inducing the LLM component within VLMs to produce verbose outputs. We then conduct vision-aligned perturbation optimization to craft adversarial examples on input images, maximizing the similarity between the perturbed image's visual embeddings and those of the adversarial prompt, thereby constructing malicious images that trigger verbose text generation. Comprehensive experiments on four popular VLMs demonstrate that our method achieves significant advantages in terms of effectiveness, efficiency, and generalization capability.
中文摘要 随着视觉语言模型（VLMs）在多模态任务上的显著成功，关于其部署效率的担忧日益突出。特别是，生成过程中消耗的代币数量成为关键评估，http URL研究显示，特定输入会促使VLM产生低信息密度的长输出，显著增加能耗、延迟和代币成本。然而，现有方法仅通过延迟EOS令牌的出现来隐式延长输出，且未能直接最大化输出令牌长度作为显式优化目标，缺乏稳定性，且该http URL解决了这些限制。本文提出了一种新颖的冗长文本归纳攻击（VTIA），通过两阶段框架向良性图像注入不可察觉的对抗扰动。为了优化和最大化扰动的 HTTP URL 输出令牌，我们首先进行对抗提示搜索，采用强化学习策略自动识别能够诱导 VLM 中 LLM 组件产生冗长输出的对抗提示。随后，我们进行视觉对齐扰动优化，在输入图像上构建对抗性示例，最大化受扰图像视觉嵌入与对抗提示的相似性，从而构建触发冗长文本生成的恶意图像。对四种常见VLM的综合实验表明，我们的方法在有效性、效率和泛化能力方面取得了显著优势。

Pass@k Metric for RLVR: A Diagnostic Tool of Exploration, But Not an Objective

Pass@k RLVR的指标：一种探索的诊断工具，但不是客观

Authors: Yang Yu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.16231
Pdf link: https://arxiv.org/pdf/2511.16231
Abstract The ability of Large Language Models (LLMs) to perform complex, multi-step reasoning is a central focus of modern AI research. To evaluate and enhance this capability, the pass@k metric, which measures the probability of obtaining at least one correct solution in k independent samples, has received significant attention. Its intuitive appeal has led to its adoption not only as an evaluation standard but also as a direct optimization objective in reinforcement learning. In this paper, we analyze the pass@k objective, derive its gradient, and demonstrate that it is fundamentally a per-example positive reweighting of the simpler pass@1 objective. Our analysis reveals that the pass@k objective provides a vanishing learning signal in regimes where exploration is most critical. We further analyze the dynamics of "exploration collapse", showing that as the policy concentrates probability mass, the gap between pass@k and pass@1 diminishes. We conclude that while pass@k is a useful diagnostic tool, it may be an unsuitable direct objective for optimization. Instead, mechanisms explicitly encouraging efficient exploration could offer a more effective path forward for reinforcement learning in reasoning tasks.
中文摘要 大型语言模型（LLMs）执行复杂多步骤推理的能力是现代人工智能研究的核心关注点。为评估和增强这一能力，pass@k指标——衡量k个独立样本中至少获得一个正确解的概率——受到了广泛关注。其直观的吸引力使其不仅被采纳为评估标准，也被用作强化学习中的直接优化目标。本文分析了pass@k目标，推导其梯度，并证明它本质上是对较简单pass@1目标的逐例正加权。我们的分析显示，pass@k目标在探索最关键的环境中呈现消失的学习信号。我们进一步分析了“探索崩溃”的动态，表明随着政策集中概率质量，pass@k与pass@1之间的差距缩小。我们得出结论，虽然pass@k是一种有用的诊断工具，但它可能不是一个适合直接优化的目标。相反，明确鼓励高效探索的机制，可能为推理任务中的强化学习提供更有效的路径。

Revisiting Fairness-aware Interactive Recommendation: Item Lifecycle as a Control Knob

重新审视公平意识的互动推荐：作为控制旋钮的项目生命周期

Authors: Yun Lu, Xiaoyu Shi, Hong Xie, Chongjun Xia, Zhenhui Gong, Mingsheng Shang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.16248
Pdf link: https://arxiv.org/pdf/2511.16248
Abstract This paper revisits fairness-aware interactive recommendation (e.g., TikTok, KuaiShou) by introducing a novel control knob, i.e., the lifecycle of items. We make threefold contributions. First, we conduct a comprehensive empirical analysis and uncover that item lifecycles in short-video platforms follow a compressed three-phase pattern, i.e., rapid growth, transient stability, and sharp decay, which significantly deviates from the classical four-stage model (introduction, growth, maturity, decline). Second, we introduce LHRL, a lifecycle-aware hierarchical reinforcement learning framework that dynamically harmonizes fairness and accuracy by leveraging phase-specific exposure dynamics. LHRL consists of two key components: (1) PhaseFormer, a lightweight encoder combining STL decomposition and attention mechanisms for robust phase detection; (2) a two-level HRL agent, where the high-level policy imposes phase-aware fairness constraints, and the low-level policy optimizes immediate user engagement. This decoupled optimization allows for effective reconciliation between long-term equity and short-term utility. Third, experiments on multiple real-world interactive recommendation datasets demonstrate that LHRL significantly improves both fairness and user engagement. Furthermore, the integration of lifecycle-aware rewards into existing RL-based models consistently yields performance gains, highlighting the generalizability and practical value of our approach.
中文摘要 本文通过引入一个新的控制旋钮，重新审视公平性意识的互动推荐（例如TikTok、快手），即项目生命周期。我们做出三方面的贡献。首先，我们进行了全面的实证分析，发现短视频平台的项目生命周期呈现压缩的三阶段模式，即快速增长、瞬态稳定和急剧衰减，这与经典的四阶段模型（引入、增长、成熟、衰退）有显著不同。其次，我们引入LHRL，一种生命周期感知的分层强化学习框架，通过利用相位特定暴露动态动态，动态协调公平性和准确性。LHRL由两个关键组成部分：（1）PhaseFormer，一种结合STL分解和注意力机制的轻量级编码器，实现稳健的相位检测;（2）两级HRL代理，其中高层策略施加相位感知公平约束，低层策略优化即时用户参与。这种解耦优化能够有效调节长期权益与短期效用。第三，在多个真实世界互动推荐数据集上的实验表明，LHRL显著提升了公平性和用户参与度。此外，将生命周期感知的奖励整合进现有基于强化学习的模型，持续带来性能提升，凸显了我们方法的通用性和实用价值。

Optimizing Operation Recipes with Reinforcement Learning for Safe and Interpretable Control of Chemical Processes

通过强化学习优化作配方，实现化学过程的安全且可解释的控制

Authors: Dean Brandner, Sergio Lucia
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2511.16297
Pdf link: https://arxiv.org/pdf/2511.16297
Abstract Optimal operation of chemical processes is vital for energy, resource, and cost savings in chemical engineering. The problem of optimal operation can be tackled with reinforcement learning, but traditional reinforcement learning methods face challenges due to hard constraints related to quality and safety that must be strictly satisfied, and the large amount of required training data. Chemical processes often cannot provide sufficient experimental data, and while detailed dynamic models can be an alternative, their complexity makes it computationally intractable to generate the needed data. Optimal control methods, such as model predictive control, also struggle with the complexity of the underlying dynamic models. Consequently, many chemical processes rely on manually defined operation recipes combined with simple linear controllers, leading to suboptimal performance and limited flexibility. In this work, we propose a novel approach that leverages expert knowledge embedded in operation recipes. By using reinforcement learning to optimize the parameters of these recipes and their underlying linear controllers, we achieve an optimized operation recipe. This method requires significantly less data, handles constraints more effectively, and is more interpretable than traditional reinforcement learning methods due to the structured nature of the recipes. We demonstrate the potential of our approach through simulation results of an industrial batch polymerization reactor, showing that it can approach the performance of optimal controllers while addressing the limitations of existing methods.
中文摘要 化学工艺的最佳运行对于化学工程中的能源、资源和成本节约至关重要。最佳作的问题可以通过强化学习解决，但传统强化学习方法面临质量和安全等严格约束，以及大量所需训练数据，面临挑战。化学过程往往无法提供足够的实验数据，虽然详细的动态模型可以作为替代方案，但其复杂性使得生成所需数据在计算上难以实现。最优控制方法，如模型预测控制，也难以应对底层动态模型的复杂性。因此，许多化学工艺依赖于手动定义的作配方和简单的线性控制器，导致性能不理想且灵活性有限。在本研究中，我们提出了一种新颖的方法，利用运算配方中嵌入的专家知识。通过强化学习优化这些配方及其底层线性控制器的参数，我们实现了优化的作配方。该方法所需数据显著减少，约束处理更高效，且由于配方结构化，比传统强化学习方法更具解释性。我们通过工业间歇聚合反应器的模拟结果展示了我们方法的潜力，证明它能够接近最优控制器的性能，同时解决现有方法的局限性。

Safe and Optimal Variable Impedance Control via Certified Reinforcement Learning

通过认证增强学习实现安全且最优的可变阻抗控制

Authors: Shreyas Kumar, Ravi Prakash
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.16330
Pdf link: https://arxiv.org/pdf/2511.16330
Abstract Reinforcement learning (RL) offers a powerful approach for robots to learn complex, collaborative skills by combining Dynamic Movement Primitives (DMPs) for motion and Variable Impedance Control (VIC) for compliant interaction. However, this model-free paradigm often risks instability and unsafe exploration due to the time-varying nature of impedance gains. This work introduces Certified Gaussian Manifold Sampling (C-GMS), a novel trajectory-centric RL framework that learns combined DMP and VIC policies while guaranteeing Lyapunov stability and actuator feasibility by construction. Our approach reframes policy exploration as sampling from a mathematically defined manifold of stable gain schedules. This ensures every policy rollout is guaranteed to be stable and physically realizable, thereby eliminating the need for reward penalties or post-hoc validation. Furthermore, we provide a theoretical guarantee that our approach ensures bounded tracking error even in the presence of bounded model errors and deployment-time uncertainties. We demonstrate the effectiveness of C-GMS in simulation and verify its efficacy on a real robot, paving the way for reliable autonomous interaction in complex environments.
中文摘要 强化学习（RL）通过结合动态运动原语（DMP）用于运动，并结合可变阻抗控制（VIC）进行合规交互，为机器人学习复杂协作技能提供了一种强大的方法。然而，这种无模型范式常因阻抗增益时间变化而存在不稳定性和不安全的探测风险。本研究引入了认证高斯流形采样（C-GMS），这是一种新型以轨迹为中心的强化学习框架，能够学习结合的DMP和VIC策略，同时通过构造保证李雅普诺夫的稳定性和执行器的可行性。我们的方法将政策探索重新定义为从数学定义的稳定增益计划流形中抽样。这确保每次政策的推出都稳定且可实现，从而消除了奖励罚款或事后验证的需求。此外，我们理论上保证即使存在有界模型误差和部署时间不确定性，我们的方法也能确保追踪误差有界。我们展示了C-GMS在模拟中的有效性，并验证其在真实机器人上的有效性，为复杂环境中可靠的自主交互铺平了道路。

Incorporating Self-Rewriting into Large Language Model Reasoning Reinforcement

将自我重写融入大型语言模型推理强化

Authors: Jiashu Yao, Heyan Huang, Shuang Zeng, Chuwei Luo, WangJie You, Jie Tang, Qingsong Liu, Yuhang Guo, Yangyang Kang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.16331
Pdf link: https://arxiv.org/pdf/2511.16331
Abstract Through reinforcement learning (RL) with outcome correctness rewards, large reasoning models (LRMs) with scaled inference computation have demonstrated substantial success on complex reasoning tasks. However, the one-sided reward, focused solely on final correctness, limits its ability to provide detailed supervision over internal reasoning process. This deficiency leads to suboptimal internal reasoning quality, manifesting as issues like over-thinking, under-thinking, redundant-thinking, and disordered-thinking. Inspired by the recent progress in LRM self-rewarding, we introduce self-rewriting framework, where a model rewrites its own reasoning texts, and subsequently learns from the rewritten reasoning to improve the internal thought process quality. For algorithm design, we propose a selective rewriting approach wherein only "simple" samples, defined by the model's consistent correctness, are rewritten, thereby preserving all original reward signals of GRPO. For practical implementation, we compile rewriting and vanilla generation within one single batch, maintaining the scalability of the RL algorithm and introducing only ~10% overhead. Extensive experiments on diverse tasks with different model sizes validate the effectiveness of self-rewriting. In terms of the accuracy-length tradeoff, the self-rewriting approach achieves improved accuracy (+0.6) with substantially shorter reasoning (-46%) even without explicit instructions in rewriting prompts to reduce reasoning length, outperforming existing strong baselines. In terms of internal reasoning quality, self-rewriting achieves significantly higher scores (+7.2) under the LLM-as-a-judge metric, successfully mitigating internal reasoning flaws.
中文摘要 通过带有结果正确性奖励的强化学习（RL），采用缩放推理计算的大型推理模型（LRM）在复杂推理任务中取得了显著成功。然而，单方面的奖励，仅关注最终正确性，限制了其对内部推理过程进行详细监督的能力。这种缺陷导致内在推理质量不佳，表现为过度思考、思考不足、冗余思维和思维紊乱等问题。受LRM自我奖励性近期进展的启发，我们引入了自我重写框架，模型重写自己的推理文本，并从重写推理中学习以提升内部思维过程质量。在算法设计中，我们提出一种选择性重写方法，即仅重写由模型一致性正确性定义的“简单”样本，从而保留GRPO的所有原始奖励信号。在实际实现中，我们将重写和原版生成整合在一次批次中，保持强化学习算法的可扩展性，且仅引入约10%的开销。对不同模型规模的多样化任务进行的广泛实验验证了自我重写的有效性。在准确性与长度权衡方面，自我重写方法即使没有明确的指示来减少推理长度，也能实现准确率提升（+0.6），推理时间显著缩短（-46%），优于现有强有力的基线。在内在推理质量方面，自我重写在“作为评判的LLM”指标下得分显著更高（+7.2），成功弥补了内部推理缺陷。

OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

OpenMMReasoner：以开放且通用的方案推动多模态推理的前沿

Authors: Kaichen Zhang, Keming Wu, Zuhao Yang, Kairui Hu, Bin Wang, Ziwei Liu, Xingxuan Li, Lidong Bing
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.16334
Pdf link: https://arxiv.org/pdf/2511.16334
Abstract Recent advancements in large reasoning models have fueled growing interest in extending such capabilities to multimodal domains. However, despite notable progress in visual reasoning, the lack of transparent and reproducible data curation and training strategies remains a major barrier to scalable research. In this work, we introduce OpenMMReasoner, a fully transparent two-stage recipe for multimodal reasoning spanning supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct an 874K-sample cold-start dataset with rigorous step-by-step validation, providing a strong foundation for reasoning capabilities. The subsequent RL stage leverages a 74K-sample dataset across diverse domains to further sharpen and stabilize these abilities, resulting in a more robust and efficient learning process. Extensive evaluations demonstrate that our training recipe not only surpasses strong baselines but also highlights the critical role of data quality and training design in shaping multimodal reasoning performance. Notably, our method achieves a 11.6% improvement over the Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, establishing a solid empirical foundation for future large-scale multimodal reasoning research. We open-sourced all our codes, pipeline, and data at this https URL.
中文摘要 大型推理模型的最新进展激发了将此类能力扩展到多模态领域的兴趣。然而，尽管视觉推理取得了显著进展，缺乏透明且可重复的数据整理和培训策略仍是可扩展研究的主要障碍。在本研究中，我们介绍了OpenMMReasoner，这是一种完全透明的两阶段多模态推理方案，涵盖监督微调（SFT）和强化学习（RL）。在SFT阶段，我们构建了一个874K样本的冷启动数据集，经过严格的逐步验证，为推理能力奠定坚实基础。接下来的强化学习阶段利用一个涵盖不同领域的74K样本数据集，进一步提升和稳定这些能力，从而实现更稳健高效的学习过程。广泛的评估表明，我们的训练方案不仅超越了强有力的基线，还凸显了数据质量和训练设计在塑造多模态推理表现中的关键作用。值得注意的是，我们的方法在九个多模态推理基准测试中较Qwen2.5-VL-7B-Instruct基线提升了11.6%，为未来大规模多模态推理研究奠定了坚实的实证基础。我们把所有代码、流水线和数据都开源在这个 https 网址上。

Flow-Aided Flight Through Dynamic Clutters From Point To Motion

流动辅助飞行穿越动态杂波，从点到运动

Authors: Bowen Xu, Zexuan Yan, Minghao Lu, Xiyu Fan, Yi Luo, Youshen Lin, Zhiqiang Chen, Yeke Chen, Qiyuan Qiao, Peng Lu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.16372
Pdf link: https://arxiv.org/pdf/2511.16372
Abstract Challenges in traversing dynamic clutters lie mainly in the efficient perception of the environmental dynamics and the generation of evasive behaviors considering obstacle movement. Previous solutions have made progress in explicitly modeling the dynamic obstacle motion for avoidance, but this key dependency of decision-making is time-consuming and unreliable in highly dynamic scenarios with occlusions. On the contrary, without introducing object detection, tracking, and prediction, we empower the reinforcement learning (RL) with single LiDAR sensing to realize an autonomous flight system directly from point to motion. For exteroception, a depth sensing distance map achieving fixed-shape, low-resolution, and detail-safe is encoded from raw point clouds, and an environment change sensing point flow is adopted as motion features extracted from multi-frame observations. These two are integrated into a lightweight and easy-to-learn representation of complex dynamic environments. For action generation, the behavior of avoiding dynamic threats in advance is implicitly driven by the proposed change-aware sensing representation, where the policy optimization is indicated by the relative motion modulated distance field. With the deployment-friendly sensing simulation and dynamics model-free acceleration control, the proposed system shows a superior success rate and adaptability to alternatives, and the policy derived from the simulator can drive a real-world quadrotor with safe maneuvers.
中文摘要 穿越动态杂波的挑战主要在于对环境动态的有效感知以及在障碍物移动时产生规避行为。以往的解决方案在显式建模障碍物动态运动以实现规避方面取得了进展，但在具有遮挡的高度动态场景中，这一关键决策依赖既耗时又不可靠。相反，在不引入物体检测、跟踪和预测的情况下，我们通过单一激光雷达传感赋予强化学习（RL）赋能，实现从点到运动的自主飞行系统。对于外感知，从原始点云编码实现固定形状、低分辨率和细节安全的深度感测距离图，并采用环境变化感应点流作为从多帧观测中提取的运动特征。这两者被整合成一个轻量级且易于学习的复杂动态环境表示。对于动作生成，提前避免动态威胁的行为隐含由所提出的变化感知表征驱动，其中策略优化由相对运动调制距离场表示。凭借部署友好型传感模拟和无动力学模型加速度控制，所提系统展现出更高的成功率和对替代方案的适应性，模拟器得出的策略能够驱动真实世界的四旋翼，实现安全机动。

LAOF: Robust Latent Action Learning with Optical Flow Constraints

LAOF：具有光流约束的稳健潜在动作学习

Authors: Xizhou Bu, Jiexi Lyu, Fulei Sun, Ruichen Yang, Zhiqiang Ma, Wei Li
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.16407
Pdf link: https://arxiv.org/pdf/2511.16407
Abstract Learning latent actions from large-scale videos is crucial for the pre-training of scalable embodied foundation models, yet existing methods often struggle with action-irrelevant distractors. Although incorporating action supervision can alleviate these distractions, its effectiveness is restricted by the scarcity of available action labels. Optical flow represents pixel-level motion between consecutive frames, naturally suppressing background elements and emphasizing moving objects. Motivated by this, we propose robust Latent Action learning with Optical Flow constraints, called LAOF, a pseudo-supervised framework that leverages the agent's optical flow as an action-driven signal to learn latent action representations robust to distractors. Experimental results show that the latent representations learned by LAOF outperform existing methods on downstream imitation learning and reinforcement learning tasks. This superior performance arises from optical flow constraints, which substantially stabilize training and improve the quality of latent representations under extremely label-scarce conditions, while remaining effective as the proportion of action labels increases to 10 percent. Importantly, even without action supervision, LAOF matches or surpasses action-supervised methods trained with 1 percent of action labels.
中文摘要 从大规模视频中学习潜在动作对于可扩展的具身基础模型的预训练至关重要，但现有方法常常难以应对与动作无关的干扰因素。虽然加入行动监督可以减轻这些干扰，但其效果受限于可用的行动标签稀缺。光流表示连续帧之间的像素级运动，自然地抑制背景元素并强调运动物体。基于此，我们提出了带有光流约束的稳健潜在动作学习，称为LAOF，这是一种伪监督框架，利用智能体的光流作为动作驱动信号，学习对干扰因素强韧的潜在动作表征。实验结果表明，LAOF学习的潜在表征在下游模仿学习和强化学习任务中优于现有方法。这种优越性能源于光流约束，在极度稀缺的标签条件下，它显著稳定训练并提升潜在表示的质量，同时在动作标签比例增加到10%时依然有效。重要的是，即使没有动作监督，LAOF也能与仅有1%动作标签训练的动作监督方法匹敌甚至超过。

A Comparison Between Decision Transformers and Traditional Offline Reinforcement Learning Algorithms

决策变换器与传统离线强化学习算法的比较

Authors: Ali Murtaza Caunhye, Asad Jeewa
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.16475
Pdf link: https://arxiv.org/pdf/2511.16475
Abstract The field of Offline Reinforcement Learning (RL) aims to derive effective policies from pre-collected datasets without active environment interaction. While traditional offline RL algorithms like Conservative Q-Learning (CQL) and Implicit Q-Learning (IQL) have shown promise, they often face challenges in balancing exploration and exploitation, especially in environments with varying reward densities. The recently proposed Decision Transformer (DT) approach, which reframes offline RL as a sequence modelling problem, has demonstrated impressive results across various benchmarks. This paper presents a comparative study evaluating the performance of DT against traditional offline RL algorithms in dense and sparse reward settings for the ANT continous control environment. Our research investigates how these algorithms perform when faced with different reward structures, examining their ability to learn effective policies and generalize across varying levels of feedback. Through empirical analysis in the ANT environment, we found that DTs showed less sensitivity to varying reward density compared to other methods and particularly excelled with medium-expert datasets in sparse reward scenarios. In contrast, traditional value-based methods like IQL showed improved performance in dense reward settings with high-quality data, while CQL offered balanced performance across different data qualities. Additionally, DTs exhibited lower variance in performance but required significantly more computational resources compared to traditional approaches. These findings suggest that sequence modelling approaches may be more suitable for scenarios with uncertain reward structures or mixed-quality data, while value-based methods remain competitive in settings with dense rewards and high-quality demonstrations.
中文摘要 离线强化学习（RL）领域旨在从预先收集的数据集中推导出有效的策略，而无需主动环境交互。虽然传统的离线强化学习算法如保守Q-学习（CQL）和隐式Q-学习（IQL）展现出潜力，但它们在探索与利用之间常面临平衡的挑战，尤其是在奖励密度不同的环境中。最近提出的决策变换器（DT）方法，将离线强化学习重新框架为序列建模问题，在多个基准测试中取得了令人印象深刻的成果。本文对DT与传统离线强化学习算法在ANT连续控制环境中的密集和稀疏奖励环境下的表现进行了比较评估。我们的研究探讨了这些算法在面对不同奖励结构时的表现，考察它们学习有效策略和在不同反馈层级中泛化的能力。通过在ANT环境中的实证分析，我们发现DT对奖赏密度变化的敏感性低于其他方法，尤其在中等专家级数据集中奖励稀疏场景中表现优异。相比之下，传统的基于价值的方法如IQL在高质量数据的密集奖励环境中表现更好，而CQL则在不同数据质量上实现了平衡表现。此外，DT在性能上方差较小，但相比传统方法需要显著更多的计算资源。这些发现表明，序列建模方法可能更适合奖励结构不确定或数据质量参差的情景，而价值导向的方法在奖励密集且演示质量高的环境中依然具有竞争力。

Limitations of Scalarisation in MORL: A Comparative Study in Discrete Environments

MORL中标量化的局限性：离散环境下的比较研究

Authors: Muhammad Sa'ood Shah, Asad Jeewa
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.16476
Pdf link: https://arxiv.org/pdf/2511.16476
Abstract Scalarisation functions are widely employed in MORL algorithms to enable intelligent decision-making. However, these functions often struggle to approximate the Pareto front accurately, rendering them unideal in complex, uncertain environments. This study examines selected Multi-Objective Reinforcement Learning (MORL) algorithms across MORL environments with discrete action and observation spaces. We aim to investigate further the limitations associated with scalarisation approaches for decision-making in multi-objective settings. Specifically, we use an outer-loop multi-policy methodology to assess the performance of a seminal single-policy MORL algorithm, MO Q-Learning implemented with linear scalarisation and Chebyshev scalarisation functions. In addition, we explore a pioneering inner-loop multi-policy algorithm, Pareto Q-Learning, which offers a more robust alternative. Our findings reveal that the performance of the scalarisation functions is highly dependent on the environment and the shape of the Pareto front. These functions often fail to retain the solutions uncovered during learning and favour finding solutions in certain regions of the solution space. Moreover, finding the appropriate weight configurations to sample the entire Pareto front is complex, limiting their applicability in uncertain settings. In contrast, inner-loop multi-policy algorithms may provide a more sustainable and generalizable approach and potentially facilitate intelligent decision-making in dynamic and uncertain environments.
中文摘要 标量化函数被广泛应用于MORL算法中，以实现智能决策。然而，这些函数常常难以准确近似帕累托前缘，使它们在复杂且不确定的环境中不理想。本研究考察了在具有离散动作和观察空间的MORL环境中选定的多目标强化学习（MORL）算法。我们旨在进一步探讨多目标环境中标量化方法在决策中的局限性。具体来说，我们采用外环多策略方法评估开创性单策略MORL算法MO Q-Learning的性能，该算法通过线性标量化和切比雪夫标量化函数实现。此外，我们还探索了一种开创性的内环多策略算法——帕累托Q学习，它提供了更稳健的替代方案。我们的发现表明，标量化函数的性能高度依赖于环境和帕累托前缘的形状。这些函数常常无法保留学习过程中发现的解，反而更倾向于在解空间的某些区域寻找解。此外，寻找合适的权重配置以采样整个帕累托前缘也很复杂，限制了其在不确定环境中的适用性。相比之下，内环多策略算法可能提供更可持续和可推广的方法，并有可能促进在动态和不确定环境中的智能决策。

Large Language Model-Based Reward Design for Deep Reinforcement Learning-Driven Autonomous Cyber Defense

基于大型语言模型的奖励设计，用于深度强化学习驱动的自主网络防御

Authors: Sayak Mukherjee, Samrat Chatterjee, Emilie Purvine, Ted Fujimoto, Tegan Emerson
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2511.16483
Pdf link: https://arxiv.org/pdf/2511.16483
Abstract Designing rewards for autonomous cyber attack and defense learning agents in a complex, dynamic environment is a challenging task for subject matter experts. We propose a large language model (LLM)-based reward design approach to generate autonomous cyber defense policies in a deep reinforcement learning (DRL)-driven experimental simulation environment. Multiple attack and defense agent personas were crafted, reflecting heterogeneity in agent actions, to generate LLM-guided reward designs where the LLM was first provided with contextual cyber simulation environment information. These reward structures were then utilized within a DRL-driven attack-defense simulation environment to learn an ensemble of cyber defense policies. Our results suggest that LLM-guided reward designs can lead to effective defense strategies against diverse adversarial behaviors.
中文摘要 在复杂且动态的环境中为自主网络攻击和防御学习代理设计奖励，对学科专家来说是一项具有挑战性的任务。我们提出一种基于大型语言模型（LLM）的奖励设计方法，用于在深度强化学习（DRL）驱动的实验模拟环境中生成自主的网络防御策略。设计了多重攻击和防御代理角色，反映代理动作的异质性，生成LLM引导的奖励设计，LLM首次获得上下文网络仿真环境信息。这些奖励结构随后被利用在基于日程学习的攻防模拟环境中，学习一系列网络防御策略。我们的结果表明，基于LLM的奖励设计能够引发针对各种对抗行为的有效防御策略。

Green Resilience of Cyber-Physical Systems: Doctoral Dissertation

网络物理系统的绿色韧性：博士论文

Authors: Diaeddin Rimawi
Subjects: Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.16593
Pdf link: https://arxiv.org/pdf/2511.16593
Abstract Cyber-physical systems (CPS) combine computational and physical components. Online Collaborative AI System (OL-CAIS) is a type of CPS that learn online in collaboration with humans to achieve a common goal, which makes it vulnerable to disruptive events that degrade performance. Decision-makers must therefore restore performance while limiting energy impact, creating a trade-off between resilience and greenness. This research addresses how to balance these two properties in OL-CAIS. It aims to model resilience for automatic state detection, develop agent-based policies that optimize the greenness-resilience trade-off, and understand catastrophic forgetting to maintain performance consistency. We model OL-CAIS behavior through three operational states: steady, disruptive, and final. To support recovery during disruptions, we introduce the GResilience framework, which provides recovery strategies through multi-objective optimization (one-agent), game-theoretic decision-making (two-agent), and reinforcement learning (RL-agent). We also design a measurement framework to quantify resilience and greenness. Empirical evaluation uses real and simulated experiments with a collaborative robot learning object classification from human demonstrations. Results show that the resilience model captures performance transitions during disruptions, and that GResilience policies improve green recovery by shortening recovery time, stabilizing performance, and reducing human dependency. RL-agent policies achieve the strongest results, although with a marginal increase in CO2 emissions. We also observe catastrophic forgetting after repeated disruptions, while our policies help maintain steadiness. A comparison with containerized execution shows that containerization cuts CO2 emissions by half. Overall, this research provides models, metrics, and policies that ensure the green recovery of OL-CAIS.
中文摘要 网络物理系统（CPS）结合了计算和物理组件。在线协作人工智能系统（OL-CAIS）是一种通过在线学习与人类协作实现共同目标的CPS，因此容易受到影响性能的干扰事件。因此，决策者必须在限制能源影响的同时恢复性能，在韧性和绿色环保之间做出权衡。本研究探讨如何在OL-CAIS中平衡这两种特性。其目标是模拟自动状态检测的韧性，开发优化绿色度与韧性权衡的代理策略，并理解灾难性遗忘以维持性能一致性。我们通过三种作状态来建模OL-CAIS行为：稳定、破坏和最终。为支持中断期间的恢复，我们引入了GResilience框架，该框架通过多目标优化（单智能体）、博弈论决策（双智能体）和强化学习（RL-agent）提供恢复策略。我们还设计了一个衡量框架，用于量化韧性和绿色环保。实证评估利用真实和模拟实验，协作机器人通过人类演示学习对象分类。结果显示，韧性模型捕捉了中断期间的绩效转变，而GResilience政策通过缩短恢复时间、稳定绩效并减少人类依赖，促进绿色恢复。强化物-反对剂政策取得了最强的效果，尽管二氧化碳排放略有增加。我们也观察到在反复中断后出现灾难性的遗忘，而我们的政策有助于维持稳定。与集装箱化执行相比显示，集装箱化可将二氧化碳排放减少一半。总体而言，本研究提供了模型、指标和政策，确保OL-CAIS的绿色复苏。

Bridging VLMs and Embodied Intelligence with Deliberate Practice Policy Optimization

连接VLM与具身智能与刻意实践策略优化

Authors: Yi Zhang, Che Liu, Xiancong Ren, Hanchu Ni, Yingji Zhang, Shuai Zhang, Zeyuan Ding, Jiayu Hu, Haozhe Shan, Junbo Qi, Yan Bai, Dengjie Li, Jiachen Luo, Yidong Wang, Yong Dai, Zenglin Xu, Bin Shen, Qifan Wang, Jian Tang, Xiaozhu Ju
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.16602
Pdf link: https://arxiv.org/pdf/2511.16602
Abstract Developing a universal and versatile embodied intelligence system presents two primary challenges: the critical embodied data bottleneck, where real-world data is scarce and expensive, and the algorithmic inefficiency of existing methods, which are resource-prohibitive. To address these limitations, we introduce Deliberate Practice Policy Optimization (DPPO), a metacognitive ``Metaloop'' training framework that dynamically alternates between supervised fine-tuning (competence expansion) and reinforcement learning (skill refinement). This enables automatic weakness identification and targeted resource allocation, specifically designed to maximize learning efficiency from sparse, finite data. Theoretically, DPPO can be formalised as a unified preference-learning framework. Empirically, training a vision-language embodied model with DPPO, referred to as Pelican-VL 1.0, yields a 20.3% performance improvement over the base model and surpasses open-source models at the 100B-parameter scale by 10.6%. We are open-sourcing both the models and code, providing the first systematic framework that alleviates the data and resource bottleneck and enables the community to build versatile embodied agents efficiently.
中文摘要 开发通用且多功能的具身智能系统面临两个主要挑战：关键的具身数据瓶颈，现实世界数据稀缺且昂贵;以及现有方法的算法效率低下，资源消耗巨大。为解决这些局限性，我们引入了有意实践策略优化（DPPO），这是一种元认知“元循环”训练框架，动态交替进行监督微调（能力扩展）和强化学习（技能精炼）。这使得自动识别弱点和有针对性的资源分配成为可能，专门设计以最大化从稀疏有限数据中学习效率。理论上，DPPO可以被形式化为统一的偏好学习框架。从经验角度看，使用称为Pelican-VL 1.0的DPPO训练视觉语言具象模型，性能提升20.3%，并在100B参数尺度上比开源模型高出10.6%。我们正在开源模型和代码，提供首个系统化框架，缓解数据和资源瓶颈，使社区能够高效构建多功能具身代理。

Stabilizing Policy Gradient Methods via Reward Profiling

通过奖励画像稳定政策梯度方法

Authors: Shihab Ahmed, El Houcine Bergou, Aritra Dutta, Yue Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2511.16629
Pdf link: https://arxiv.org/pdf/2511.16629
Abstract Policy gradient methods, which have been extensively studied in the last decade, offer an effective and efficient framework for reinforcement learning problems. However, their performances can often be unsatisfactory, suffering from unreliable reward improvements and slow convergence, due to high variance in gradient estimations. In this paper, we propose a universal reward profiling framework that can be seamlessly integrated with any policy gradient algorithm, where we selectively update the policy based on high-confidence performance estimations. We theoretically justify that our technique will not slow down the convergence of the baseline policy gradient methods, but with high probability, will result in stable and monotonic improvements of their performance. Empirically, on eight continuous-control benchmarks (Box2D and MuJoCo/PyBullet), our profiling yields up to 1.5x faster convergence to near-optimal returns, up to 1.75x reduction in return variance on some setups. Our profiling approach offers a general, theoretically grounded path to more reliable and efficient policy learning in complex environments.
中文摘要 在过去十年中被广泛研究的策略梯度方法为强化学习问题提供了一个有效且高效的框架。然而，由于梯度估计的高度变异，它们的表现常常不令人满意，奖励提升不可靠且收敛缓慢。本文提出了一种通用奖励画像框架，可无缝集成于任何政策梯度算法，基于高置信度的绩效估计进行有选择性地更新政策。我们理论上认为，我们的技术不会减缓基线策略梯度方法的收敛速度，但很可能会带来其性能的稳定和单调提升。在八个连续控制基准测试（Box2D和MuJoCo/PyBullet）的实证分析中，我们的分析显示收敛速度高达1.5倍，接近最优回报，某些设置的回报方差可降低1.75倍。我们的画像方法为在复杂环境中提供更可靠、更高效的政策学习的通用且理论基础的路径。

Dexterity from Smart Lenses: Multi-Fingered Robot Manipulation with In-the-Wild Human Demonstrations

智能镜片的灵巧度：多指机器人控与野外人类演示

Authors: Irmak Guzey, Haozhi Qi, Julen Urain, Changhao Wang, Jessica Yin, Krishna Bodduluri, Mike Lambeta, Lerrel Pinto, Akshara Rai, Jitendra Malik, Tingfan Wu, Akash Sharma, Homanga Bharadhwaj
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.16661
Pdf link: https://arxiv.org/pdf/2511.16661
Abstract Learning multi-fingered robot policies from humans performing daily tasks in natural environments has long been a grand goal in the robotics community. Achieving this would mark significant progress toward generalizable robot manipulation in human environments, as it would reduce the reliance on labor-intensive robot data collection. Despite substantial efforts, progress toward this goal has been bottle-necked by the embodiment gap between humans and robots, as well as by difficulties in extracting relevant contextual and motion cues that enable learning of autonomous policies from in-the-wild human videos. We claim that with simple yet sufficiently powerful hardware for obtaining human data and our proposed framework AINA, we are now one significant step closer to achieving this dream. AINA enables learning multi-fingered policies from data collected by anyone, anywhere, and in any environment using Aria Gen 2 glasses. These glasses are lightweight and portable, feature a high-resolution RGB camera, provide accurate on-board 3D head and hand poses, and offer a wide stereo view that can be leveraged for depth estimation of the scene. This setup enables the learning of 3D point-based policies for multi-fingered hands that are robust to background changes and can be deployed directly without requiring any robot data (including online corrections, reinforcement learning, or simulation). We compare our framework against prior human-to-robot policy learning approaches, ablate our design choices, and demonstrate results across nine everyday manipulation tasks. Robot rollouts are best viewed on our website: this https URL.
中文摘要 从人类在自然环境中执行日常任务时学习多指机器人策略，一直是机器人界的宏伟目标。实现这一目标将标志着机器人在人类环境中可通用作的重大进展，因为这将减少对劳动密集型机器人数据收集的依赖。尽管付出了大量努力，但实现这一目标的进展仍受到人与机器人身体性差距的瓶颈，以及难以从真实人类视频中提取相关上下文和运动线索，从而学习自主政策。我们声称，凭借简单但足够强大的硬件获取人类数据，以及我们提出的AINA框架，我们现在离实现这一梦想又近了一大步。AINA支持通过任何人、地点和环境使用Aria Gen 2眼镜收集的数据来学习多指策略。这些眼镜轻便便携，配备高分辨率RGB摄像头，提供精准的机载3D头部和手部姿势，并提供广阔立体视野，可用于场景深度估计。该配置支持多指手的三维点式策略学习，这些策略对背景变化具有鲁棒性，且可直接部署，无需机器人数据（包括在线修正、强化学习或模拟）。我们将我们的框架与以往的人机政策学习方法进行比较，精简设计选择，并展示了九项日常作任务的结果。机器人推广最佳内容请访问我们的网站：https URL。

Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter

驯服长尾：利用自适应绘图者的高效强化学习推理

Authors: Qinghao Hu, Shang Yang, Junxian Guo, Xiaozhe Yao, Yujun Lin, Yuxian Gu, Han Cai, Chuang Gan, Ana Klimovic, Song Han
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2511.16665
Pdf link: https://arxiv.org/pdf/2511.16665
Abstract The emergence of Large Language Models (LLMs) with strong reasoning capabilities marks a significant milestone, unlocking new frontiers in complex problem-solving. However, training these reasoning models, typically using Reinforcement Learning (RL), encounters critical efficiency bottlenecks: response generation during RL training exhibits a persistent long-tail distribution, where a few very long responses dominate execution time, wasting resources and inflating costs. To address this, we propose TLT, a system that accelerates reasoning RL training losslessly by integrating adaptive speculative decoding. Applying speculative decoding in RL is challenging due to the dynamic workloads, evolving target model, and draft model training overhead. TLT overcomes these obstacles with two synergistic components: (1) Adaptive Drafter, a lightweight draft model trained continuously on idle GPUs during long-tail generation to maintain alignment with the target model at no extra cost; and (2) Adaptive Rollout Engine, which maintains a memory-efficient pool of pre-captured CUDAGraphs and adaptively select suitable SD strategies for each input batch. Evaluations demonstrate that TLT achieves over 1.7x end-to-end RL training speedup over state-of-the-art systems, preserves the model accuracy, and yields a high-quality draft model as a free byproduct suitable for efficient deployment. Code is released at this https URL.
中文摘要 具备强大推理能力的大型语言模型（LLM）的出现标志着一个重要里程碑，开启了复杂问题解决的新前沿。然而，训练这些推理模型时，通常使用强化学习（RL）会遇到关键的效率瓶颈：强化学习中的响应生成呈现出持续的长尾分布，少数非常长的响应主导执行时间，浪费资源并推高成本。为此，我们提出了TLT，一种通过集成自适应推测解码，无损加速强化学习推理训练的系统。由于动态工作负载、不断演变的目标模型以及草稿模型训练开销，在强化学习中应用推测解码具有挑战性。TLT通过两个协同组件克服了这些障碍：（1）自适应绘图，一种轻量级草稿模型，在长尾生成期间连续在空闲GPU上训练，以保持与目标模型的对齐且无额外成本;以及（2）自适应展开引擎，它维护一个内存高效的预捕获CUDAGraph池，并为每个输入批次自适应地选择合适的SD策略。评估表明，TLT在最先进系统上实现了超过1.7倍的端到端强化学习训练加速，保持模型准确性，并作为免费副产品生成高质量的草稿模型，适合高效部署。代码发布于此 https 网址。

SceneDesigner: Controllable Multi-Object Image Generation with 9-DoF Pose Manipulation

SceneDesigner：可控多对象图像生成，支持9景深姿态作

Authors: Zhenyuan Qin, Xincheng Shuai, Henghui Ding
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.16666
Pdf link: https://arxiv.org/pdf/2511.16666
Abstract Controllable image generation has attracted increasing attention in recent years, enabling users to manipulate visual content such as identity and style. However, achieving simultaneous control over the 9D poses (location, size, and orientation) of multiple objects remains an open challenge. Despite recent progress, existing methods often suffer from limited controllability and degraded quality, falling short of comprehensive multi-object 9D pose control. To address these limitations, we propose SceneDesigner, a method for accurate and flexible multi-object 9-DoF pose manipulation. SceneDesigner incorporates a branched network to the pre-trained base model and leverages a new representation, CNOCS map, which encodes 9D pose information from the camera view. This representation exhibits strong geometric interpretation properties, leading to more efficient and stable training. To support training, we construct a new dataset, ObjectPose9D, which aggregates images from diverse sources along with 9D pose annotations. To further address data imbalance issues, particularly performance degradation on low-frequency poses, we introduce a two-stage training strategy with reinforcement learning, where the second stage fine-tunes the model using a reward-based objective on rebalanced data. At inference time, we propose Disentangled Object Sampling, a technique that mitigates insufficient object generation and concept confusion in complex multi-object scenes. Moreover, by integrating user-specific personalization weights, SceneDesigner enables customized pose control for reference subjects. Extensive qualitative and quantitative experiments demonstrate that SceneDesigner significantly outperforms existing approaches in both controllability and quality. Code is publicly available at this https URL.
中文摘要 近年来，可控图像生成越来越受到关注，使用户能够控视觉内容，如身份和风格。然而，同时控制多个物体的9D姿势（位置、大小和方向）仍是一个开放的挑战。尽管近期取得了进展，现有方法常常因可控性有限和质量下降而受限，未能实现全面的多对象9D姿态控制。为解决这些限制，我们提出了SceneDesigner，一种用于精确且灵活的多对象9景深姿态作的方法。SceneDesigner 在预训练的基础模型中集成了分支网络，并利用一种新的表示方式——CNOCS 地图，该映射编码了摄像机视角的 9D 姿态信息。该表示具有强烈的几何解释性质，使训练更高效、更稳定。为支持训练，我们构建了一个新数据集ObjectPose9D，汇总来自不同来源的图像及9D姿态注释。为进一步解决数据不平衡问题，特别是低频姿态的性能下降，我们引入了两阶段训练策略，采用强化学习，第二阶段通过基于奖励的目标对再平衡数据进行微调。在推理阶段，我们提出了解缠对象采样技术，这是一种在复杂多对象场景中减少对象生成不足和概念混淆的技术。此外，通过集成用户特定的个性化权重，SceneDesigner 实现了参考对象的定制姿态控制。大量定性和定量实验表明，SceneDesigner在可控性和质量方面远远优于现有方法。代码在此 https URL 公开发布。

Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO

视频即答案：用联合-GRPO预测并生成下一个视频事件

Authors: Junhao Cheng, Liang Hou, Xin Tao, Jing Liao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.16669
Pdf link: https://arxiv.org/pdf/2511.16669
Abstract While language models have become impactful in many real-world applications, video generation remains largely confined to entertainment. Motivated by video's inherent capacity to demonstrate physical-world information that is difficult to convey through language alone (e.g., imagine teaching someone to tie a tie using only text), we identify an underutilized opportunity to extend video as a new answer modality for Next-Event Prediction (NEP), formalized as Video-Next-Event Prediction (VNEP). While the established NEP task takes a video with a procedural or predictive question as input to predict the next event in text, VNEP requires dynamic video responses. This shift from telling to showing unlocks more intuitive and customized answers for procedural learning and creative exploration. However, this task remains challenging for existing models, as it demands an understanding of multimodal input, instruction-conditioned reasoning, and the generation of video with visual and semantic consistency. To address this, we introduce VANS, a model that leverages reinforcement learning to align a Vision-Language Model (VLM) with a Video Diffusion Model (VDM) for VNEP. The core of VANS is our proposed Joint-GRPO that orchestrates the VLM and VDM to function as a unit. Driven by a shared reward on their respective output, it optimizes the VLM to produce captions that are both accurate and friendly to visualize, while guiding the VDM to generate videos that are faithful to these captions and the input visual context. To enable this learning, we craft VANS-Data-100K, a dedicated dataset for the VNEP task. Experiments on procedural and predictive benchmarks demonstrate that VANS achieves state-of-the-art performance in both video event prediction and visualization. Codes are released in this https URL.
中文摘要 虽然语言模型在许多现实应用中变得重要，但视频生成仍主要局限于娱乐领域。受视频本身展示难以仅靠语言传达的物理世界信息（例如，想象仅用文字教人系领带）的潜力，我们发现了一个未被充分利用的机会，将视频扩展为下一事件预测（NEP）的新答案模式，正式化为视频-下一事件预测（VNEP）。既有的NEP任务是输入带有程序性或预测性问题的视频来预测下一个事件，而VNEP则需要动态视频回应。这种从讲述转向展示的转变，为过程学习和创造性探索带来了更直观和个性化的答案。然而，这项任务对现有模型来说仍然具有挑战性，因为它需要理解多模态输入、指令条件推理以及具有视觉和语义一致性的视频生成。为此，我们引入了VANS，这是一种利用强化学习将视觉语言模型（VLM）与视频扩散模型（VDM）对齐的VNEP模型。VANS的核心是我们提出的联合-远程同步推进（Joint-GRPO），它协调VLM和VDM作为一个整体运作。基于对各自输出的共享奖励，它优化VLM生成既准确又易于可视化的字幕，同时引导视频发布忠实于这些字幕和输入视觉语境的视频。为了实现这种学习，我们专门为VNEP任务编写了VANS-Data-100K数据集。程序化和预测基准测试的实验表明，VANS在视频事件预测和可视化方面均达到了最先进的性能。代码会在这个 https URL 中发布。

Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation

边思考边生成：文本推理贯穿视觉生成

Authors: Ziyu Guo, Renrui Zhang, Hongyu Li, Manyuan Zhang, Xinyan Chen, Sifan Wang, Yan Feng, Peng Pei, Pheng-Ann Heng
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.16671
Pdf link: https://arxiv.org/pdf/2511.16671
Abstract Recent advances in visual generation have increasingly explored the integration of reasoning capabilities. They incorporate textual reasoning, i.e., think, either before (as pre-planning) or after (as post-refinement) the generation process, yet they lack on-the-fly multimodal interaction during the generation itself. In this preliminary study, we introduce Thinking-while-Generating (TwiG), the first interleaved framework that enables co-evolving textual reasoning throughout the visual generation process. As visual content is progressively generating, textual reasoning is interleaved to both guide upcoming local regions and reflect on previously synthesized ones. This dynamic interplay produces more context-aware and semantically rich visual outputs. To unveil the potential of this framework, we investigate three candidate strategies, zero-shot prompting, supervised fine-tuning (SFT) on our curated TwiG-50K dataset, and reinforcement learning (RL) via a customized TwiG-GRPO strategy, each offering unique insights into the dynamics of interleaved reasoning. We hope this work inspires further research into interleaving textual reasoning for enhanced visual generation. Code will be released at: this https URL.
中文摘要 视觉生成领域的最新进展越来越多地探索推理能力的整合。它们包含文本推理，即在生成过程之前（作为预规划）或在生成过程中（作为后细化）思考，但在生成过程中缺乏即时的多模态交互。在这项初步研究中，我们介绍了“边思考生成”（TwiG），这是首个交织框架，能够在整个视觉生成过程中实现文本推理的共同演化。随着视觉内容不断生成，文本推理被交织使用，既引导即将到来的局部区域，也反思先前综合的区域。这种动态互动产生了更具上下文感知和语义丰富的视觉输出。为揭示该框架的潜力，我们探讨了三种候选策略：零样本提示、在我们精心策划的TwiG-50K数据集上的监督微调（SFT）以及通过定制TwiG-GRPO策略进行强化学习（RL），每种策略都为交错推理的动态提供了独特的见解。我们希望这项工作能激发更多关于交织文本推理以增强视觉生成的研究。代码将在以下地址发布：https URL。

Keyword: diffusion policy

There is no result