Arxiv Papers of Today

生成时间: 2025-12-10 16:32:06 (UTC+8); Arxiv 发布时间: 2025-12-10 20:00 EST (2025-12-11 09:00 UTC+8)

今天共有 26 篇相关文章

Keyword: reinforcement learning

ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models

ThreadWeaver：语言模型中高效并行推理的自适应线程

Authors: Long Lian, Sida Wang, Felix Juefei-Xu, Tsu-Jui Fu, Xiuyu Li, Adam Yala, Trevor Darrell, Alane Suhr, Yuandong Tian, Xi Victoria Lin
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.07843
Pdf link: https://arxiv.org/pdf/2512.07843
Abstract Scaling inference-time computation has enabled Large Language Models (LLMs) to achieve strong reasoning performance, but inherently sequential decoding leads to substantial latency, especially on complex tasks. Recent work on adaptive parallel reasoning aims to improve inference efficiency by decomposing the problem-solving process into concurrent reasoning threads when beneficial. However, existing methods on realistic tasks are either limited to supervised behavior cloning or exhibit significant accuracy drops compared to widely-used sequential long chain-of-thought (CoT) baselines. Moreover, many require customized inference engines, complicating deployment. We introduce ThreadWeaver, a framework for adaptive parallel reasoning that achieves accuracy on par with popular sequential reasoning models of comparable size while significantly reducing inference latency. ThreadWeaver's performance stems from three key innovations: 1) a two-stage parallel trajectory generator that produces large-scale, high-quality CoT data with parallel annotations for supervised fine-tuning; 2) a trie-based training-inference co-design that enables parallel reasoning on any off-the-shelf autoregressive inference engine without modifying position embeddings or KV caches; and 3) a parallelization-aware reinforcement learning framework that teaches the model to balance accuracy with effective parallelization. Across six challenging mathematical reasoning benchmarks, ThreadWeaver trained atop Qwen3-8B achieves accuracy comparable to cutting-edge sequential reasoning models (71.9% on average and 79.9% on AIME24) while delivering up to 1.53x average speedup in token latency, establishing a new Pareto frontier between accuracy and efficiency.
中文摘要 推理时间计算的扩展使大型语言模型（LLMs）能够实现强大的推理性能，但顺序解码本身会导致显著的延迟，尤其是在复杂任务中。近期关于自适应并行推理的研究旨在通过将问题解决过程分解为并发推理线程，在有利时提高推理效率。然而，现有的现实任务方法要么仅限于监督行为克隆，要么与广泛使用的顺序长思考链（CoT）基线相比，准确率显著下降。此外，许多需要定制的推理引擎，增加了部署的复杂性。我们介绍ThreadWeaver，一个自适应并行推理框架，其精度与同等规模的流行序列推理模型相当，同时显著降低推理延迟。ThreadWeaver 的性能源于三项关键创新：1）两级并行轨迹生成器，能够生成大规模高质量的 CoT 数据，并行注释支持监督微调;2）基于trie的训练-推理协设计，使得在任何现成的自回归推理引擎上实现并行推理，无需修改位置嵌入或KV缓存;3）一个并行化感知的强化学习框架，教导模型在准确性与有效并行化之间取得平衡。在六个具有挑战性的数学推理基准测试中，基于Qwen3-8B训练的ThreadWeaver实现了与前沿序列推理模型相当的准确性（平均71.9%，AIME24为79.9%），同时令牌延迟平均提升高达1.53倍，开创了准确与效率之间的新帕累托边界。

Agentic Artificial Intelligence for Ethical Cybersecurity in Uganda: A Reinforcement Learning Framework for Threat Detection in Resource-Constrained Environments

在乌干达伦理网络安全的智能人工智能：资源受限环境中威胁检测的强化学习框架

Authors: Ibrahim Adabara, Bashir Olaniyi Sadiq, Aliyu Nuhu Shuaibu, Yale Ibrahim Danjuma, Venkateswarlu Maninti, Mutebi Joe
Subjects: Subjects: Cryptography and Security (cs.CR); Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2512.07909
Pdf link: https://arxiv.org/pdf/2512.07909
Abstract Uganda's rapid digital transformation, supported by national strategies such as Vision 2040 and the Digital Transformation Roadmap, has expanded reliance on networked services while simultaneously increasing exposure to sophisticated cyber threats. In resource-constrained settings, commonly deployed rule-based intrusion detection systems lack the adaptability and ethical safeguards needed to address evolving attack patterns, leading to undetected breaches and excessive blocking of legitimate traffic. This study proposes an Agentic Artificial Intelligence (AAI) framework that integrates reinforcement learning, an explicit ethical governance layer, and human oversight to deliver adaptive and trustworthy cybersecurity. A CPU-optimized simulation environment was developed using a five-node network topology that mirrors key elements of Uganda's critical digital infrastructure and generates both benign and malicious traffic, including phishing, ransomware, and distributed denial-of-service attacks. A Q-learning agent, operating within clearly defined ethical constraints and subject to human auditability, was trained and evaluated against a traditional rule-based baseline. The AAI framework achieved a 100 percent detection rate, zero false positives, and full ethical compliance, compared with 70 percent detection and 15 percent false positives for the baseline system. These results demonstrate that agentic, ethically governed reinforcement learning can substantially improve cybersecurity effectiveness and fairness in CPU-only, resource-constrained environments, offering a practical pathway for operationalizing responsible AI in Uganda's national cybersecurity strategy.
中文摘要 乌干达快速的数字化转型，得益于《2040愿景》和《数字转型路线图》等国家战略，不仅扩大了对网络服务的依赖，同时也增加了对复杂网络威胁的暴露。在资源有限的环境中，常用的基于规则的入侵检测系统缺乏应对不断演变的攻击模式所需的适应性和伦理保障，导致未被发现的入侵和对合法流量的过度阻断。本研究提出了一种智能人工智能（AAI）框架，整合强化学习、明确的伦理治理层和人工监督，以实现自适应且值得信赖的网络安全。开发了一个CPU优化的仿真环境，采用五节点网络拓扑，映射乌干达关键数字基础设施的关键要素，生成良性和恶意流量，包括钓鱼、勒索软件和分布式拒绝服务攻击。Q学习代理在明确定义的伦理约束下运行，并接受人工审计，并根据传统的基于规则的基线进行培训和评估。AAI框架实现了100%的检测率、零误报和完全的伦理合规，而基线系统检测率为70%，误报率为15%。这些结果表明，代理性、伦理治理的强化学习能够在仅有CPU且资源受限的环境中显著提升网络安全效能和公平性，为在乌干达国家网络安全战略中实现负责任人工智能提供了切实可行的路径。

VLD: Visual Language Goal Distance for Reinforcement Learning Navigation

VLD：强化学习导航的视觉语言目标距离

Authors: Lazar Milikic, Manthan Patel, Jonas Frey
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.07976
Pdf link: https://arxiv.org/pdf/2512.07976
Abstract Training end-to-end policies from image data to directly predict navigation actions for robotic systems has proven inherently difficult. Existing approaches often suffer from either the sim-to-real gap during policy transfer or a limited amount of training data with action labels. To address this problem, we introduce Vision-Language Distance (VLD) learning, a scalable framework for goal-conditioned navigation that decouples perception learning from policy learning. Instead of relying on raw sensory inputs during policy training, we first train a self-supervised distance-to-goal predictor on internet-scale video data. This predictor generalizes across both image- and text-based goals, providing a distance signal that can be minimized by a reinforcement learning (RL) policy. The RL policy can be trained entirely in simulation using privileged geometric distance signals, with injected noise to mimic the uncertainty of the trained distance predictor. At deployment, the policy consumes VLD predictions, inheriting semantic goal information-"where to go"-from large-scale visual training while retaining the robust low-level navigation behaviors learned in simulation. We propose using ordinal consistency to assess distance functions directly and demonstrate that VLD outperforms prior temporal distance approaches, such as ViNT and VIP. Experiments show that our decoupled design achieves competitive navigation performance in simulation while supporting flexible goal modalities, providing an alternative and, most importantly, scalable path toward reliable, multimodal navigation policies.
中文摘要 从图像数据中训练端到端策略以直接预测机器人系统的导航动作，已被证明极为困难。现有方法常常存在策略转移时模拟与真实之间的差距，或带有动作标签的训练数据有限。为解决这一问题，我们引入了视觉-语言距离（VLD）学习，这是一种可扩展的目标条件导航框架，将感知学习与政策学习解耦。在政策培训中，我们不再依赖原始感官输入，而是首先在互联网规模的视频数据上训练一个自我监督的目标距离预测器。该预测器在图像和文本目标上均可推广，提供可通过强化学习（RL）策略最小化的距离信号。强化学习策略可以完全在仿真中训练，使用特权几何距离信号，并注入噪声以模拟训练距离预测器的不确定性。部署时，策略消耗VLD预测，继承大规模视觉训练的语义目标信息——“去向”，同时保留仿真中学习到的稳健低级导航行为。我们提出利用序数一致性直接评估距离函数，并证明VLD优于以往的时间距离方法，如ViNT和VIP。实验表明，我们的解耦设计在模拟中实现了竞争性导航性能，同时支持灵活的目标模式，提供了一条替代且最重要的是可扩展的路径，朝向可靠多模态导航策略。

Benchmarking Offline Multi-Objective Reinforcement Learning in Critical Care

重症护理中离线多目标强化学习的基准测试

Authors: Aryaman Bansal, Divya Sharma
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.08012
Pdf link: https://arxiv.org/pdf/2512.08012
Abstract In critical care settings such as the Intensive Care Unit, clinicians face the complex challenge of balancing conflicting objectives, primarily maximizing patient survival while minimizing resource utilization (e.g., length of stay). Single-objective Reinforcement Learning approaches typically address this by optimizing a fixed scalarized reward function, resulting in rigid policies that fail to adapt to varying clinical priorities. Multi-objective Reinforcement Learning (MORL) offers a solution by learning a set of optimal policies along the Pareto Frontier, allowing for dynamic preference selection at test time. However, applying MORL in healthcare necessitates strict offline learning from historical data. In this paper, we benchmark three offline MORL algorithms, Conditioned Conservative Pareto Q-Learning (CPQL), Adaptive CPQL, and a modified Pareto Efficient Decision Agent (PEDA) Decision Transformer (PEDA DT), against three scalarized single-objective baselines (BC, CQL, and DDQN) on the MIMIC-IV dataset. Using Off-Policy Evaluation (OPE) metrics, we demonstrate that PEDA DT algorithm offers superior flexibility compared to static scalarized baselines. Notably, our results extend previous findings on single-objective Decision Transformers in healthcare, confirming that sequence modeling architectures remain robust and effective when scaled to multi-objective conditioned generation. These findings suggest that offline MORL is a promising framework for enabling personalized, adjustable decision-making in critical care without the need for retraining.
中文摘要 在重症监护室等重症监护环境中，临床医生面临着平衡相互冲突目标的复杂挑战，主要是为了最大化患者生存，同时尽量减少资源使用（例如住院时间）。单目标强化学习方法通常通过优化固定的标量化奖励函数来解决这个问题，导致僵化的政策无法适应不同的临床优先级。多目标强化学习（MORL）通过学习一组最优策略，在测试时实现动态偏好选择，从而提供了解决方案。然而，在医疗领域应用MORL需要严格的离线学习，依赖历史数据。本文将三种离线MORL算法——条件保守帕累托Q-学习（CPQL）、自适应CPQL和改良型帕累托高效决策代理（PEDA）决策变换器（PEDA DT）进行基准测试，并对MIMIC-IV数据集上的三个标量化单目标基线（BC、CQL和DDQN）进行基准测试。利用非策略评估（OPE）指标，我们证明PEDA DT算法相较静态标量基线具有更优越的灵活性。值得注意的是，我们的结果进一步扩展了以往关于医疗单目标决策变换器的发现，证实序列建模架构在多目标条件生成时依然稳健且有效。这些发现表明，离线MORL是一个有前景的框架，能够实现重症护理中个性化、可调节的决策，无需再培训。

An Introduction to Deep Reinforcement and Imitation Learning

深度强化与模仿学习导论

Authors: Pedro Santana
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.08052
Pdf link: https://arxiv.org/pdf/2512.08052
Abstract Embodied agents, such as robots and virtual characters, must continuously select actions to execute tasks effectively, solving complex sequential decision-making problems. Given the difficulty of designing such controllers manually, learning-based approaches have emerged as promising alternatives, most notably Deep Reinforcement Learning (DRL) and Deep Imitation Learning (DIL). DRL leverages reward signals to optimize behavior, while DIL uses expert demonstrations to guide learning. This document introduces DRL and DIL in the context of embodied agents, adopting a concise, depth-first approach to the literature. It is self-contained, presenting all necessary mathematical and machine learning concepts as they are needed. It is not intended as a survey of the field; rather, it focuses on a small set of foundational algorithms and techniques, prioritizing in-depth understanding over broad coverage. The material ranges from Markov Decision Processes to REINFORCE and Proximal Policy Optimization (PPO) for DRL, and from Behavioral Cloning to Dataset Aggregation (DAgger) and Generative Adversarial Imitation Learning (GAIL) for DIL.
中文摘要 具身的代理，如机器人和虚拟角色，必须不断选择动作以有效执行任务，解决复杂的顺序决策问题。鉴于手动设计此类控制器的困难，基于学习的方法成为有前景的替代方案，最著名的是深度强化学习（DRL）和深度模仿学习（DIL）。日间学习利用奖励信号优化行为，而DIL则通过专家演示指导学习。本文档在具身主体背景下介绍了DRL和DIL，采用简明、深度优先的方法。它自成一体，根据需要呈现所有必要的数学和机器学习概念。它并非旨在对该领域的全面调查;相反，它侧重于一小部分基础算法和技术，优先考虑深入理解而非广泛覆盖。内容涵盖从DRL的马尔可夫决策过程到REINFORCE和近端策略优化（PPO），以及DIL的行为克隆到数据集聚合（DAgger）和生成对抗模仿学习（GAIL）。

Training LLMs for Honesty via Confessions

通过忏悔训练大语言模型以实现诚实

Authors: Manas Joglekar, Jeremy Chen, Gabriel Wu, Jason Yosinski, Jasmine Wang, Boaz Barak, Amelia Glaese
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.08093
Pdf link: https://arxiv.org/pdf/2512.08093
Abstract Large language models (LLMs) can be dishonest when reporting on their actions and beliefs -- for example, they may overstate their confidence in factual claims or cover up evidence of covert actions. Such dishonesty may arise due to the effects of reinforcement learning (RL), where challenges with reward shaping can result in a training process that inadvertently incentivizes the model to lie or misrepresent its actions. In this work we propose a method for eliciting an honest expression of an LLM's shortcomings via a self-reported confession. A confession is an output, provided upon request after a model's original answer, that is meant to serve as a full account of the model's compliance with the letter and spirit of its policies and instructions. The reward assigned to a confession during training is solely based on its honesty, and does not impact positively or negatively the main answer's reward. As long as the "path of least resistance" for maximizing confession reward is to surface misbehavior rather than covering it up, this incentivizes models to be honest in their confessions. Our findings provide some justification this empirical assumption, especially in the case of egregious model misbehavior. To demonstrate the viability of our approach, we train GPT-5-Thinking to produce confessions, and we evaluate its honesty in out-of-distribution scenarios measuring hallucination, instruction following, scheming, and reward hacking. We find that when the model lies or omits shortcomings in its "main" answer, it often confesses to these behaviors honestly, and this confession honesty modestly improves with training. Confessions can enable a number of inference-time interventions including monitoring, rejection sampling, and surfacing issues to the user.
中文摘要 大型语言模型（LLMs）在报告其行为和信念时可能不诚实——例如，他们可能夸大对事实主张的信心，或掩盖隐蔽行为的证据。这种不诚实可能源于强化学习（RL）的影响，在奖励塑造上的挑战可能导致训练过程无意中激励模型撒谎或歪曲其行为。本研究提出一种方法，通过自我报告的忏悔，引发对LLM不足的诚实表达。自白是在模型原始回答后应请求提供的输出，旨在全面说明模型遵守其政策和指示的字面和精神。培训期间对告白的奖励仅基于其诚实性，不影响主答案的正反效果。只要最大化忏悔奖励的“最小阻力路径”是揭露不当行为而非掩盖，这就激励模特在忏悔时诚实。我们的发现为这一实证假设提供了一定的依据，尤其是在严重模型不当行为的情况下。为了验证我们方法的可行性，我们训练GPT-5-Thinking生成供词，并在非分布场景中评估其诚实性，测量幻觉、指令执行、阴谋和奖励黑客行为。我们发现，当模型在“主要”答案中撒谎或省略缺陷时，它通常会诚实地承认这些行为，而这种坦白诚实会随着训练有所改善。告白可以实现多种推理时间干预，包括监控、拒绝抽样以及向用户揭示问题。

Scalable Offline Model-Based RL with Action Chunks

可扩展的离线模型驱动强化学习，带有动作块

Authors: Kwanyoung Park, Seohong Park, Youngwoon Lee, Sergey Levine
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.08108
Pdf link: https://arxiv.org/pdf/2512.08108
Abstract In this paper, we study whether model-based reinforcement learning (RL), in particular model-based value expansion, can provide a scalable recipe for tackling complex, long-horizon tasks in offline RL. Model-based value expansion fits an on-policy value function using length-n imaginary rollouts generated by the current policy and a learned dynamics model. While larger n reduces bias in value bootstrapping, it amplifies accumulated model errors over long horizons, degrading future predictions. We address this trade-off with an \emph{action-chunk} model that predicts a future state from a sequence of actions (an "action chunk") instead of a single action, which reduces compounding errors. In addition, instead of directly training a policy to maximize rewards, we employ rejection sampling from an expressive behavioral action-chunk policy, which prevents model exploitation from out-of-distribution actions. We call this recipe \textbf{Model-Based RL with Action Chunks (MAC)}. Through experiments on highly challenging tasks with large-scale datasets of up to 100M transitions, we show that MAC achieves the best performance among offline model-based RL algorithms, especially on challenging long-horizon tasks.
中文摘要 本文探讨基于模型的强化学习（RL），特别是基于模型的价值扩展，是否能为离线强化学习中复杂且长期任务提供可扩展的方案。基于模型的价值扩展利用当前策略生成的长度n个虚数展开和学习的动态模型，拟合策略上的价值函数。虽然更大的n减少了价值自助法的偏差，但会放大长期积累的模型误差，降低未来预测。我们用 \emph{action-chunk} 模型来解决这一权衡，该模型通过一系列动作（“动作块”）预测未来状态，而非单一动作，从而减少了复合错误。此外，我们没有直接训练策略以最大化奖励，而是采用表达行为行动块策略的拒绝抽样，防止模型利用分布外的行为。我们将此配方称为 \textbf{基于模型的强化学习与动作块（MAC）}。通过对高达1亿转移的大规模数据集的高度难度任务实验，我们证明MAC在离线模型基础强化学习算法中表现最佳，尤其是在具有挑战性的长视野任务上。

Universal Adversarial Suffixes for Language Models Using Reinforcement Learning with Calibrated Reward

基于校准奖励强化学习的语言模型通用对抗后缀

Authors: Sampriti Soor, Suklav Ghosh, Arijit Sur
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.08131
Pdf link: https://arxiv.org/pdf/2512.08131
Abstract Language models are vulnerable to short adversarial suffixes that can reliably alter predictions. Previous works usually find such suffixes with gradient search or rule-based methods, but these are brittle and often tied to a single task or model. In this paper, a reinforcement learning framework is used where the suffix is treated as a policy and trained with Proximal Policy Optimization against a frozen model as a reward oracle. Rewards are shaped using calibrated cross-entropy, removing label bias and aggregating across surface forms to improve transferability. The proposed method is evaluated on five diverse NLP benchmark datasets, covering sentiment, natural language inference, paraphrase, and commonsense reasoning, using three distinct language models: Qwen2-1.5B Instruct, TinyLlama-1.1B Chat, and Phi-1.5. Results show that RL-trained suffixes consistently degrade accuracy and transfer more effectively across tasks and models than previous adversarial triggers of similar genres.
中文摘要 语言模型容易受到短对抗后缀的影响，这些后缀可以可靠地改变预测结果。以往的研究通常会用梯度搜索或基于规则的方法来使用此类后缀，但这些后缀较为脆弱，且通常与单一任务或模型相关。本文采用强化学习框架，后缀被视为策略，并用近端策略优化（Proximal Policy Optimization）训练，针对冻结模型作为奖励预言机。奖励通过校准交叉熵形成，去除标签偏见并跨表层形式聚合，以提高可转移性。该方法在五个不同的NLP基准数据集上进行评估，涵盖情感、自然语言推断、释义和常识推理，使用三种不同的语言模型：Qwen2-1.5B Instruct、TinyLlama-1.1B Chat和Phi-1.5。结果显示，强化学习训练的后缀比以往类似类型的对抗触发更有效地降低了准确性，并且更有效地在任务和模型间传递。

Robust Agents in Open-Ended Worlds

开放世界中的稳健智能体

Authors: Mikayel Samvelyan
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.08139
Pdf link: https://arxiv.org/pdf/2512.08139
Abstract The growing prevalence of artificial intelligence (AI) in various applications underscores the need for agents that can successfully navigate and adapt to an ever-changing, open-ended world. A key challenge is ensuring these AI agents are robust, excelling not only in familiar settings observed during training but also effectively generalising to previously unseen and varied scenarios. In this thesis, we harness methodologies from open-endedness and multi-agent learning to train and evaluate robust AI agents capable of generalising to novel environments, out-of-distribution inputs, and interactions with other co-player agents. We begin by introducing MiniHack, a sandbox framework for creating diverse environments through procedural content generation. Based on the game of NetHack, MiniHack enables the construction of new tasks for reinforcement learning (RL) agents with a focus on generalisation. We then present Maestro, a novel approach for generating adversarial curricula that progressively enhance the robustness and generality of RL agents in two-player zero-sum games. We further probe robustness in multi-agent domains, utilising quality-diversity methods to systematically identify vulnerabilities in state-of-the-art, pre-trained RL policies within the complex video game football domain, characterised by intertwined cooperative and competitive dynamics. Finally, we extend our exploration of robustness to the domain of LLMs. Here, our focus is on diagnosing and enhancing the robustness of LLMs against adversarial prompts, employing evolutionary search to generate a diverse range of effective inputs that aim to elicit undesirable outputs from an LLM. This work collectively paves the way for future advancements in AI robustness, enabling the development of agents that not only adapt to an ever-evolving world but also thrive in the face of unforeseen challenges and interactions.
中文摘要 人工智能（AI）在各种应用中的日益普及，凸显了能够成功导航和适应不断变化、开放世界的智能体的需求。一个关键挑战是确保这些人工智能代理强大，不仅在训练中观察到的熟悉环境中表现出色，还能有效泛化到以往未曾见过且多样化的场景中。本论文利用开放性和多代理学习的方法，训练和评估能够泛化到新环境、分布外输入及与其他共主体交互的稳健AI代理。我们首先介绍了MiniHack，这是一个通过程序生成内容创建多样化环境的沙盒框架。基于NetHack游戏，MiniHack能够构建强化学习（RL）代理的新任务，重点是泛化。随后我们介绍Maestro，这是一种新颖的方法，用于生成对抗性课程，逐步增强两人零和博弈中强化学习代理的稳健性和普遍性。我们进一步探讨多智能体领域的鲁棒性，利用质量多样性方法系统识别复杂视频游戏橄榄球领域中最先进的预训练强化学习策略中的漏洞，该领域以交织的合作与竞争动态为特征。最后，我们将鲁棒性的探索扩展到大型语言模型领域。在这里，我们重点是诊断和增强LLM对抗性提示的鲁棒性，利用进化搜索生成多样化的有效输入，旨在引发LLM的不良输出。这些工作共同为未来AI鲁棒性的进步铺平了道路，使得能够开发出不仅能适应不断变化世界，还能在突发挑战和互动中茁壮成长的智能体。

TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models

TreeGRPO：用于在线强化学习扩散模型后训练的树优势GRPO

Authors: Zheng Ding, Weirui Ye
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.08153
Pdf link: https://arxiv.org/pdf/2512.08153
Abstract Reinforcement learning (RL) post-training is crucial for aligning generative models with human preferences, but its prohibitive computational cost remains a major barrier to widespread adoption. We introduce \textbf{TreeGRPO}, a novel RL framework that dramatically improves training efficiency by recasting the denoising process as a search tree. From shared initial noise samples, TreeGRPO strategically branches to generate multiple candidate trajectories while efficiently reusing their common prefixes. This tree-structured approach delivers three key advantages: (1) \emph{High sample efficiency}, achieving better performance under same training samples (2) \emph{Fine-grained credit assignment} via reward backpropagation that computes step-specific advantages, overcoming the uniform credit assignment limitation of trajectory-based methods, and (3) \emph{Amortized computation} where multi-child branching enables multiple policy updates per forward pass. Extensive experiments on both diffusion and flow-based models demonstrate that TreeGRPO achieves \textbf{2.4$\times$ faster training} while establishing a superior Pareto frontier in the efficiency-reward trade-off space. Our method consistently outperforms GRPO baselines across multiple benchmarks and reward models, providing a scalable and effective pathway for RL-based visual generative model alignment. The project website is available at this http URL.
中文摘要 强化学习（RL）在训练后对于使生成模型符合人类偏好至关重要，但其高昂的计算成本仍然是广泛采用的主要障碍。我们引入了 \textbf{TreeGRPO}，这是一个新颖的强化学习框架，通过将去噪过程重新定义为搜索树，显著提升了训练效率。从共享的初始噪声样本中，TreeGRPO战略性地分支，生成多个候选轨迹，同时高效重用它们的共同前缀。这种树结构方法带来了三个关键优势：（1） \emph{高样本效率}，在同一训练样本下实现更好的性能;（2） \emph{细粒度的信用分配}，通过奖励反向传播计算步骤特定优势，克服了基于轨迹方法的统一信用分配限制，以及（3） \emph{摊销计算}，其中多子分支允许每次转行多次策略更新。在扩散和基于流量的模型上进行的大量实验表明，TreeGRPO在效率与回报权衡领域建立了更优越的帕累托前沿，同时实现了更快的“textbf{2.4$\times$”训练速度“。我们的方法在多个基准和奖励模型中持续优于GRPO基线，为基于强化学习的视觉生成模型对齐提供了可扩展且高效的路径。项目网站可在此 http 网址访问。

Empowerment Gain and Causal Model Construction: Children and adults are sensitive to controllability and variability in their causal interventions

赋权获得与因果模型构建：儿童和成人对因果干预的可控性和变异性非常敏感

Authors: Eunice Yiu, Kelsey Allen, Shiry Ginosar, Alison Gopnik
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.08230
Pdf link: https://arxiv.org/pdf/2512.08230
Abstract Learning about the causal structure of the world is a fundamental problem for human cognition. Causal models and especially causal learning have proved to be difficult for large pretrained models using standard techniques of deep learning. In contrast, cognitive scientists have applied advances in our formal understanding of causation in computer science, particularly within the Causal Bayes Net formalism, to understand human causal learning. In the very different tradition of reinforcement learning, researchers have described an intrinsic reward signal called "empowerment" which maximizes mutual information between actions and their outcomes. "Empowerment" may be an important bridge between classical Bayesian causal learning and reinforcement learning and may help to characterize causal learning in humans and enable it in machines. If an agent learns an accurate causal world model, they will necessarily increase their empowerment, and increasing empowerment will lead to a more accurate causal world model. Empowerment may also explain distinctive features of childrens causal learning, as well as providing a more tractable computational account of how that learning is possible. In an empirical study, we systematically test how children and adults use cues to empowerment to infer causal relations, and design effective causal interventions.
中文摘要 了解世界的因果结构是人类认知的一个根本问题。因果模型，尤其是因果学习，对于使用标准深度学习技术的大型预训练模型来说，已经证明是很难的。相比之下，认知科学家在计算机科学中对因果关系的正式理解，特别是在因果贝叶斯网形式主义中，应用了对人类因果学习的理解。在强化学习的截然不同的传统中，研究人员描述了一种称为“赋能”的内在奖励信号，最大化行为与结果之间的相互信息。“赋能”可能是经典贝叶斯因果学习与强化学习之间的重要桥梁，有助于描述人类的因果学习，并实现机器中的因果学习。如果一个代理学会了一个准确的因果世界模型，他们必然会增加赋能，而赋能的增加会带来更准确的因果世界模型。赋权还可以解释儿童因果学习的独特特征，并提供更易理解的学习如何实现的计算解释。在一项实证研究中，我们系统地测试了儿童和成人如何利用赋权线索推断因果关系，并设计有效的因果干预措施。

rSIM: Incentivizing Reasoning Capabilities of LLMs via Reinforced Strategy Injection

rSIM：通过强化策略注入激励LLM的推理能力

Authors: Sijia Chen, Baochun Li, Di Niu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.08300
Pdf link: https://arxiv.org/pdf/2512.08300
Abstract Large language models (LLMs) are post-trained through reinforcement learning (RL) to evolve into Reasoning Language Models (RLMs), where the hallmark of this advanced reasoning is ``aha'' moments when they start to perform strategies, such as self-reflection and deep thinking, within chain of thoughts (CoTs). Motivated by this, this paper proposes a novel reinforced strategy injection mechanism (rSIM), that enables any LLM to become an RLM by employing a small planner to guide the LLM's CoT through the adaptive injection of reasoning strategies. To achieve this, the planner (leader agent) is jointly trained with an LLM (follower agent) using multi-agent RL (MARL), based on a leader-follower framework and straightforward rule-based rewards. Experimental results show that rSIM enables Qwen2.5-0.5B to become an RLM and significantly outperform Qwen2.5-14B. Moreover, the planner is generalizable: it only needs to be trained once and can be applied as a plug-in to substantially improve the reasoning capabilities of existing LLMs. In addition, the planner supports continual learning across various tasks, allowing its planning abilities to gradually improve and generalize to a wider range of problems.
中文摘要 大型语言模型（LLM）通过强化学习（RL）进行后训练，进化为推理语言模型（RLMs），其中这种高级推理的标志是“啊哈”时刻，开始在思维链（CoT）中执行策略，如自我反思和深度思考。基于此，本文提出了一种新型强化策略注入机制（rSIM），通过一个小型规划器引导LLM的CoT，通过自适应推理策略注入，使任何LLM都能成为RLM。为此，规划者（领导者代理）与一个LLM（跟随代理）共同训练，使用多代理强化学习（MARL），基于领导者-跟随框架和基于规则的直接奖励。实验结果显示，rSIM使Qwen2.5-0.5B能够成为RLM，并且显著优于Qwen2.5-14B。此外，规划器具有通用性：只需训练一次，即可作为插件应用，显著提升现有LLM的推理能力。此外，规划器支持在各项任务中的持续学习，使其规划能力能够逐步提升并推广到更广泛的问题。

Collaborative Intelligence for UAV-Satellite Network Slicing: Towards a Joint QoS-Energy-Fairness MADRL Optimization

无人机-卫星网络切片的协作智能：迈向QOS-能源-公平性联合优化 MADRL

Authors: Thanh-Dao Nguyen, Ngoc-Tan Nguyen, Thai-Duong Nguyen, Nguyen Van Huynh, Dinh-Hieu Tran, Symeon Chatzinotas
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2512.08322
Pdf link: https://arxiv.org/pdf/2512.08322
Abstract Non terrestrial networks are critical for achieving global 6G coverage, yet efficient resource management in aerial and space environments remains challenging due to limited onboard power and dynamic operational conditions. Network slicing offers a promising solution for spectrum optimization in UAV based systems serving heterogeneous service demands. For that, this paper proposes a hierarchical network slicing framework for UAV satellite integrated networks supporting eMBB, URLLC, and mMTC services. Specifically, we formulate a joint optimization of UAV trajectory, transmission power, and spectrum allocation as a decentralized partially observable Markov decision process that ensures quality of service while minimizing energy consumption and maximizing resource fairness. To address the computational intractability and partial observability, we develop a multi agent deep reinforcement learning solution under the centralized training and decentralized execution paradigm. In the proposed system, UAV agents act as distributed actors coordinated by a shared critic operating with multi head attention mechanism at a low Earth orbit satellite. Experimental results then demonstrate that our approach outperforms existing methods by up to 33% in cumulative reward while achieving superior energy efficiency and fairness.
中文摘要 非地面网络对于实现全球6G覆盖至关重要，但由于机载功率有限和动态运行条件，空中和空间环境中的高效资源管理仍具挑战性。网络切片为基于无人机的系统提供了一种有前景的频谱优化解决方案，满足异构服务需求。为此，本文提出了一个分层网络切片框架，用于支持eMBB、URLLC和mMTC服务的无人机卫星集成网络。具体来说，我们提出了无人机轨迹、传输功率和频谱分配的联合优化方案，作为一种分散的部分可观测马尔可夫决策过程，确保服务质量，同时最小化能耗并最大化资源公平性。为解决计算难解性和部分可观测性问题，我们开发了一套基于集中训练与去中心化执行范式的多智能体深度强化学习解决方案。在拟议系统中，无人机代理作为分布式行动者，由共享批评者协调，使用多头注意力机制在近地轨道卫星上作。实验结果显示，我们的方法在累计回报上比现有方法高达33%，同时实现了更优越的能源效率和公平性。

Multi-Agent Deep Reinforcement Learning for Collaborative UAV Relay Networks under Jamming Atatcks

多智能体深度强化学习用于干扰技术下的协作无人机中继网络

Authors: Thai Duong Nguyen, Ngoc-Tan Nguyen, Thanh-Dao Nguyen, Nguyen Van Huynh, Dinh-Hieu Tran, Symeon Chatzinotas
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2512.08341
Pdf link: https://arxiv.org/pdf/2512.08341
Abstract The deployment of Unmanned Aerial Vehicle (UAV) swarms as dynamic communication relays is critical for next-generation tactical networks. However, operating in contested environments requires solving a complex trade-off, including maximizing system throughput while ensuring collision avoidance and resilience against adversarial jamming. Existing heuristic-based approaches often struggle to find effective solutions due to the dynamic and multi-objective nature of this problem. This paper formulates this challenge as a cooperative Multi-Agent Reinforcement Learning (MARL) problem, solved using the Centralized Training with Decentralized Execution (CTDE) framework. Our approach employs a centralized critic that uses global state information to guide decentralized actors which operate using only local observations. Simulation results show that our proposed framework significantly outperforms heuristic baselines, increasing the total system throughput by approximately 50% while simultaneously achieving a near-zero collision rate. A key finding is that the agents develop an emergent anti-jamming strategy without explicit programming. They learn to intelligently position themselves to balance the trade-off between mitigating interference from jammers and maintaining effective communication links with ground users.
中文摘要 无人机（UAV）群的部署，作为动态通信中继，对下一代战术网络至关重要。然而，在有争议环境中运行需要解决一个复杂的权衡，包括最大化系统吞吐量，同时确保避免碰撞和抵抗对抗干扰的韧性。由于该问题的动态性和多目标性，现有的启发式方法常常难以找到有效的解决方案。本文将这一挑战表述为一个合作式多智能体强化学习（MARL）问题，采用集中式训练与去中心化执行（CTDE）框架来解决。我们的方法采用集中式批评者，利用全球状态信息引导仅依靠局部观察运作的去中心化行为者。模拟结果显示，我们提出的框架显著优于启发式基线，将系统总吞吐量提升约50%，同时实现近乎零的碰撞率。一个关键发现是，代理在没有显式编程的情况下，能够发展出一种自发的反干扰策略。他们学会智能地定位自己，在减少干扰干扰与维持有效通信联系之间取得平衡。

Turning Threat into Opportunity: DRL-Powered Anti-Jamming via Energy Harvesting in UAV-Disrupted Channels

将威胁转化为机遇：在无人机干扰的航道中通过能量收集实现日日加速器（DRL）驱动的反干扰

Authors: Ngoc-Tan Nguyen, Thi-Thu Hoang, Trung-Dung Hoang, Thai-Duong Nguyen
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2512.08351
Pdf link: https://arxiv.org/pdf/2512.08351
Abstract The open and broadcast nature of wireless communication systems, while enabling ubiquitous connectivity, also exposes them to jamming attacks that may critically compromise network performance or disrupt service availability. The proliferation of Unmanned Aerial Vehicles (UAVs) introduces a new dimension to this threat, as UAVs can act as mobile, intelligent jammers capable of launching sophisticated attacks by leveraging Line-of-Sight (LoS) channels and adaptive strategies. This paper addresses a critical challenge of countering intelligent UAV jamming in the context of energy-constrained ambient backscatter communication systems. Traditional anti-jamming techniques often fall short against such dynamic threats or are unsuitable for low-power backscatter devices. Hence, we propose a novel anti-jamming framework based on Deep Reinforcement Learning (DRL) that empowers the transmitter to not only defend against but also strategically exploit the UAV's jamming signals. In particular, our approach allows the transmitter to learn an optimal policy for switching between active transmission, energy harvesting from the jamming signal, and backscattering information using the jammer's own emissions. We then formulate the problem as a Markov Decision Process (MDP) and employ a Deep Q-Network (DQN) to derive the optimal operational strategy. Simulation results demonstrate that our DQN-based method significantly outperforms conventional Q-learning in convergence speed and surpasses a greedy anti-jamming strategy in terms of average throughput, packet loss rate, and packet delivery ratio.
中文摘要 无线通信系统的开放和广播特性，虽然实现了无处不在的连接，但也使其面临可能严重影响网络性能或中断服务可用性的干扰攻击。无人机（UAV）的普及为这一威胁带来了新的维度，因为无人机可以作为机动的智能干扰器，利用视线（LoS）通道和自适应策略发起复杂攻击。本文探讨了在能量受限环境反向散射通信系统背景下，如何应对智能无人机干扰这一关键挑战。传统的反干扰技术常常无法应对这种动态威胁，或者不适合低功耗的背向散射设备。因此，我们提出了一种基于深度强化学习（DRL）的新型反干扰框架，使发射器不仅能防御，还能战略性地利用无人机的干扰信号。特别是，我们的方法允许发射器学习在主动发射、从干扰信号中收集能量和利用干扰器自身发射的反向散射信息之间切换的最佳策略。然后我们将问题表述为马尔可夫决策过程（MDP），并采用深度Q网络（DQN）来推导最优作策略。仿真结果表明，基于DQN的方法在收敛速度上显著优于传统Q学习，并在平均吞吐量、丢包率和数据包分发率方面优于贪婪的反干扰策略。

From Accuracy to Impact: The Impact-Driven AI Framework (IDAIF) for Aligning Engineering Architecture with Theory of Change

从准确到影响力：推动工程架构与变革理论对齐的冲击驱动人工智能框架（IDAIF）

Authors: Yong-Woon Kim
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.08449
Pdf link: https://arxiv.org/pdf/2512.08449
Abstract This paper introduces the Impact-Driven AI Framework (IDAIF), a novel architectural methodology that integrates Theory of Change (ToC) principles with modern artificial intelligence system design. As AI systems increasingly influence high-stakes domains including healthcare, finance, and public policy, the alignment problem--ensuring AI behavior corresponds with human values and intentions--has become critical. Current approaches predominantly optimize technical performance metrics while neglecting the sociotechnical dimensions of AI deployment. IDAIF addresses this gap by establishing a systematic mapping between ToC's five-stage model (Inputs-Activities-Outputs-Outcomes-Impact) and corresponding AI architectural layers (Data Layer-Pipeline Layer-Inference Layer-Agentic Layer-Normative Layer). Each layer incorporates rigorous theoretical foundations: multi-objective Pareto optimization for value alignment, hierarchical multi-agent orchestration for outcome achievement, causal directed acyclic graphs (DAGs) for hallucination mitigation, and adversarial debiasing with Reinforcement Learning from Human Feedback (RLHF) for fairness assurance. We provide formal mathematical formulations for each component and introduce an Assurance Layer that manages assumption failures through guardian architectures. Three case studies demonstrate IDAIF application across healthcare, cybersecurity, and software engineering domains. This framework represents a paradigm shift from model-centric to impact-centric AI development, providing engineers with concrete architectural patterns for building ethical, trustworthy, and socially beneficial AI systems.
中文摘要 本文介绍了影响驱动人工智能框架（IDAIF），这是一种将变革理论（ToC）原则与现代人工智能系统设计整合的新型架构方法论。随着人工智能系统日益影响医疗、金融和公共政策等高风险领域，确保人工智能行为符合人类价值观和意图的对齐问题变得至关重要。当前方法主要优化技术性能指标，忽视了人工智能部署的社会技术层面。IDAIF通过建立ToC五阶段模型（输入-活动-输出-成果-影响）与相应AI架构层（数据层-管道层-推断层-代理层-规范层）之间的系统映射，弥补了这一空白。每一层都包含严谨的理论基础：多目标帕累托优化用于价值对齐，层级多代理编排实现结果，因果定向无环图（DAGs）用于幻觉缓解，以及利用人类反馈强化学习（RLHF）进行对抗性偏倚的对抗性去偏见以保证公平性。我们为每个组件提供正式的数学表述，并引入了通过守护架构管理假设失败的保证层。三个案例研究展示了IDAIF在医疗、网络安全和软件工程领域的应用。该框架代表了从模型中心向影响导向AI开发的范式转变，为工程师提供了构建伦理、可信且社会有益的AI系统的具体架构模式。

Using reinforcement learning to probe the role of feedback in skill acquisition

利用强化学习探究反馈在技能习得中的作用

Authors: Antonio Terpin, Raffaello D'Andrea
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2512.08463
Pdf link: https://arxiv.org/pdf/2512.08463
Abstract Many high-performance human activities are executed with little or no external feedback: think of a figure skater landing a triple jump, a pitcher throwing a curveball for a strike, or a barista pouring latte art. To study the process of skill acquisition under fully controlled conditions, we bypass human subjects. Instead, we directly interface a generalist reinforcement learning agent with a spinning cylinder in a tabletop circulating water channel to maximize or minimize drag. This setup has several desirable properties. First, it is a physical system, with the rich interactions and complex dynamics that only the physical world has: the flow is highly chaotic and extremely difficult, if not impossible, to model or simulate accurately. Second, the objective -- drag minimization or maximization -- is easy to state and can be captured directly in the reward, yet good strategies are not obvious beforehand. Third, decades-old experimental studies provide recipes for simple, high-performance open-loop policies. Finally, the setup is inexpensive and far easier to reproduce than human studies. In our experiments we find that high-dimensional flow feedback lets the agent discover high-performance drag-control strategies with only minutes of real-world interaction. When we later replay the same action sequences without any feedback, we obtain almost identical performance. This shows that feedback, and in particular flow feedback, is not needed to execute the learned policy. Surprisingly, without flow feedback during training the agent fails to discover any well-performing policy in drag maximization, but still succeeds in drag minimization, albeit more slowly and less reliably. Our studies show that learning a high-performance skill can require richer information than executing it, and learning conditions can be kind or wicked depending solely on the goal, not on dynamics or policy complexity.
中文摘要 许多高性能的人类活动几乎没有外部反馈：比如花样滑冰选手完成三级跳、投手投出曲球击中好球，或咖啡师倒拉花。为了研究在完全受控条件下技能习得的过程，我们绕过了人类受试者。相反，我们直接将通用强化学习代理与桌面循环水道中的旋转圆筒接口，以最大化或最小化阻力。这种配置具有几个理想的特性。首先，它是一个物理系统，拥有只有物理世界才有的丰富相互作用和复杂动力学：流动极其混乱，极难甚至不可能准确建模或模拟。其次，目标——阻力最小化或最大化——易于陈述，且可直接体现在奖励中，但良好的策略往往事先显而易见。第三，数十年前的实验研究为简单、高效开环政策提供了配方。最后，这种装置成本低廉，且比人类研究更容易复现。在我们的实验中，我们发现高维流动反馈使智能体只需几分钟的真实互动就能发现高性能的阻力控制策略。当我们后来在没有任何反馈的情况下重放同样的动作场面时，表现几乎完全相同。这表明反馈，尤其是流程反馈，并非执行所学策略所必需。令人惊讶的是，训练过程中没有流量反馈，代理未能发现任何性能良好的阻力最大化策略，但仍成功实现阻力最小化，尽管速度更慢且不那么可靠。我们的研究表明，学习一项高绩效技能可能需要比实际执行更丰富的信息，学习环境可能因目标而异，而非动态或政策复杂性。

Optimal Perturbation Budget Allocation for Data Poisoning in Offline Reinforcement Learning

离线强化学习中数据中毒的最佳扰动预算分配

Authors: Junnan Qiu, Jie Li
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.08485
Pdf link: https://arxiv.org/pdf/2512.08485
Abstract Offline Reinforcement Learning (RL) enables policy optimization from static datasets but is inherently vulnerable to data poisoning attacks. Existing attack strategies typically rely on locally uniform perturbations, which treat all samples indiscriminately. This approach is inefficient, as it wastes the perturbation budget on low-impact samples, and lacks stealthiness due to significant statistical deviations. In this paper, we propose a novel Global Budget Allocation attack strategy. Leveraging the theoretical insight that a sample's influence on value function convergence is proportional to its Temporal Difference (TD) error, we formulate the attack as a global resource allocation problem. We derive a closed-form solution where perturbation magnitudes are assigned proportional to the TD-error sensitivity under a global L2 constraint. Empirical results on D4RL benchmarks demonstrate that our method significantly outperforms baseline strategies, achieving up to 80% performance degradation with minimal perturbations that evade detection by state-of-the-art statistical and spectral defenses.
中文摘要 离线强化学习（RL）能够从静态数据集中优化策略，但本质上容易受到数据中毒攻击。现有的攻击策略通常依赖局部均匀扰动，这种扰动对所有样本进行无差别处理。这种方法效率低下，因为它浪费了微扰预算在低影响样本上，而且由于统计偏差显著，缺乏隐蔽性。本文提出了一种新的全球预算分配攻击策略。利用理论洞见：样本对价值函数收敛的影响与其时间差（TD）误差成正比，我们将该攻击表述为一个全局资源分配问题。我们推导出一个闭式解，其中扰动幅度与TD误差敏感性成正比，且在全局L2约束下。D4RL基准测试的实证结果表明，我们的方法显著优于基线策略，性能下降可达80%，且扰动极小，且这些扰动能被最先进的统计和频谱防御检测到。

Thinking with Images via Self-Calling Agent

通过自我调用代理用图像思考

Authors: Wenxi Yang, Yuzhong Zhao, Fang Wan, Qixiang Ye
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.08511
Pdf link: https://arxiv.org/pdf/2512.08511
Abstract Thinking-with-images paradigms have showcased remarkable visual reasoning capability by integrating visual information as dynamic elements into the Chain-of-Thought (CoT). However, optimizing interleaved multimodal CoT (iMCoT) through reinforcement learning remains challenging, as it relies on scarce high-quality reasoning data. In this study, we propose Self-Calling Chain-of-Thought (sCoT), a novel visual reasoning paradigm that reformulates iMCoT as a language-only CoT with self-calling. Specifically, a main agent decomposes the complex visual reasoning task to atomic subtasks and invokes its virtual replicas, i.e. parameter-sharing subagents, to solve them in isolated context. sCoT enjoys substantial training effectiveness and efficiency, as it requires no explicit interleaving between modalities. sCoT employs group-relative policy optimization to reinforce effective reasoning behavior to enhance optimization. Experiments on HR-Bench 4K show that sCoT improves the overall reasoning performance by up to $1.9\%$ with $\sim 75\%$ fewer GPU hours compared to strong baseline approaches. Code is available at this https URL.
中文摘要 带图像思维范式通过将视觉信息作为动态元素整合进思维链（Chain-of-Thought，简称CoT）中，展现了卓越的视觉推理能力。然而，通过强化学习优化交错多模态CoT（iMCoT）仍然具有挑战性，因为它依赖于稀缺的高质量推理数据。本研究提出自我调用思维链（Self-Calling Chain-of-Thought，简称 Self-Calling of Thought，sCoT）是一种新型视觉推理范式，将 iMCoT 重新表述为仅语言且具自我调用能力的 CoT。具体来说，主智能体将复杂的视觉推理任务分解为原子子任务，并调用其虚拟副本，即参数共享子代理，在孤立上下文中求解它们。sCoT拥有显著的训练效果和效率，因为它不需要在不同模式之间进行明确交错。sCoT 采用群体相对策略优化来强化有效的推理行为，从而提升优化效果。HR-Bench 4K的实验显示，sCoT整体推理性能提升高达1.9美元，GPU小时数减少75%%，相比强基线方法。代码可在此 https URL 访问。

Mind to Hand: Purposeful Robotic Control via Embodied Reasoning

心灵与手：通过具身推理实现有目的的机器人控制

Authors: Peijun Tang, Shangjin Xie, Binyan Sun, Baifu Huang, Kuncheng Luo, Haotian Yang, Weiqi Jin, Jianan Wang
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.08580
Pdf link: https://arxiv.org/pdf/2512.08580
Abstract Humans act with context and intention, with reasoning playing a central role. While internet-scale data has enabled broad reasoning capabilities in AI systems, grounding these abilities in physical action remains a major challenge. We introduce Lumo-1, a generalist vision-language-action (VLA) model that unifies robot reasoning ("mind") with robot action ("hand"). Our approach builds upon the general multi-modal reasoning capabilities of pre-trained vision-language models (VLMs), progressively extending them to embodied reasoning and action prediction, and ultimately towards structured reasoning and reasoning-action alignment. This results in a three-stage pre-training pipeline: (1) Continued VLM pre-training on curated vision-language data to enhance embodied reasoning skills such as planning, spatial understanding, and trajectory prediction; (2) Co-training on cross-embodiment robot data alongside vision-language data; and (3) Action training with reasoning process on trajectories collected on Astribot S1, a bimanual mobile manipulator with human-like dexterity and agility. Finally, we integrate reinforcement learning to further refine reasoning-action consistency and close the loop between semantic inference and motor control. Extensive experiments demonstrate that Lumo-1 achieves significant performance improvements in embodied vision-language reasoning, a critical component for generalist robotic control. Real-world evaluations further show that Lumo-1 surpasses strong baselines across a wide range of challenging robotic tasks, with strong generalization to novel objects and environments, excelling particularly in long-horizon tasks and responding to human-natural instructions that require reasoning over strategy, concepts and space.
中文摘要 人类的行为有上下文和意图，推理起着核心作用。尽管互联网规模的数据使人工智能系统能够广泛推理，但将这些能力扎根于实际行动仍是一大挑战。我们介绍Lumo-1，一种通用视觉-语言-行动（VLA）模型，将机器人推理（“心智”）与机器人动作（“手”）统一起来。我们的方法基于预训练视觉语言模型（VLMs）的通用多模态推理能力，逐步将其扩展到具身推理和动作预测，最终实现结构化推理和推理-行动对齐。这导致了三阶段的预训练流程：（1）持续基于精心策划的视觉语言数据进行VLM预训练，以增强具身推理技能，如规划、空间理解和轨迹预测;（2）跨身体机器人数据与视觉语言数据的共同训练;以及（3）基于Astribot S1收集的轨迹进行推理过程的动作训练，这是一款具有类人灵巧和敏捷性的双手移动作器。最后，我们整合强化学习，进一步完善推理与行动的一致性，并闭合语义推断与运动控制之间的循环。大量实验表明，Lumo-1在具身视觉-语言推理方面实现了显著的性能提升，而这对通用机器人控制至关重要。实际评估进一步表明，Lumo-1在多种具有挑战性的机器人任务中超越了强基线，对新物体和环境具有强烈的推广性，尤其擅长长视野任务，并能响应需要推理而非策略、概念和空间的人类自然指令。

Sim2Swim: Zero-Shot Velocity Control for Agile AUV Maneuvering in 3 Minutes

Sim2Swim：零发射速度控制，3分钟内实现灵活AUV机动

Authors: Lauritz Rismark Fosso, Herman Biørn Amundsen, Marios Xanthidis, Sveinung Johan Ohrem
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.08656
Pdf link: https://arxiv.org/pdf/2512.08656
Abstract Holonomic autonomous underwater vehicles (AUVs) have the hardware ability for agile maneuvering in both translational and rotational degrees of freedom (DOFs). However, due to challenges inherent to underwater vehicles, such as complex hydrostatics and hydrodynamics, parametric uncertainties, and frequent changes in dynamics due to payload changes, control is challenging. Performance typically relies on carefully tuned controllers targeting unique platform configurations, and a need for re-tuning for deployment under varying payloads and hydrodynamic conditions. As a consequence, agile maneuvering with simultaneous tracking of time-varying references in both translational and rotational DOFs is rarely utilized in practice. To the best of our knowledge, this paper presents the first general zero-shot sim2real deep reinforcement learning-based (DRL) velocity controller enabling path following and agile 6DOF maneuvering with a training duration of just 3 minutes. Sim2Swim, the proposed approach, inspired by state-of-the-art DRL-based position control, leverages domain randomization and massively parallelized training to converge to field-deployable control policies for AUVs of variable characteristics without post-processing or tuning. Sim2Swim is extensively validated in pool trials for a variety of configurations, showcasing robust control for highly agile motions.
中文摘要 全息自主水下飞行器（AUV）具备在平移和旋转自由度（DOF）下灵活机动的硬件能力。然而，由于水下飞行器固有的复杂流体静力学和流体动力学、参数不确定性以及有效载荷变化导致动力学频繁变化，控制过程具有挑战性。性能通常依赖于针对独特平台配置的精密调校控制器，以及在不同有效载荷和水动力条件下部署时需要重新调校。因此，在实际作中，同时跟踪平移和旋转深度中时变参考的敏捷机动很少被使用。据我们所知，本文提出了首个通用的零射击sim2real深度强化学习（DRL）速度控制器，能够实现路径跟踪和敏捷的6自由度机动，训练时间仅为3分钟。Sim2Swim 是该方法，受最先进的基于日程学习（DRL）的位置控制启发，利用领域随机化和大规模并行训练，将可现场部署的控制策略趋同于可部署的可变特性 AUV，无需后处理或调优。Sim2Swim在多种配置的泳池试验中得到了广泛验证，展示了对高度敏捷动作的强健控制。

Direct transfer of optimized controllers to similar systems using dimensionless MPC

利用无量纲MPC将优化控制器直接转移到类似系统

Authors: Josip Kir Hromatko, Shambhuraj Sawant, Šandor Ileš, Sébastien Gros
Subjects: Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.08667
Pdf link: https://arxiv.org/pdf/2512.08667
Abstract Scaled model experiments are commonly used in various engineering fields to reduce experimentation costs and overcome constraints associated with full-scale systems. The relevance of such experiments relies on dimensional analysis and the principle of dynamic similarity. However, transferring controllers to full-scale systems often requires additional tuning. In this paper, we propose a method to enable a direct controller transfer using dimensionless model predictive control, tuned automatically for closed-loop performance. With this reformulation, the closed-loop behavior of an optimized controller transfers directly to a new, dynamically similar system. Additionally, the dimensionless formulation allows for the use of data from systems of different scales during parameter optimization. We demonstrate the method on a cartpole swing-up and a car racing problem, applying either reinforcement learning or Bayesian optimization for tuning the controller parameters. Software used to obtain the results in this paper is publicly available at this https URL.
中文摘要 缩尺模型实验常用于各类工程领域，以降低实验成本并克服全尺寸系统相关的限制。此类实验的相关性依赖于维度分析和动态相似性原则。然而，将控制器转移到全尺寸系统通常需要额外调校。本文提出一种方法，利用无量纲模型预测控制实现直接控制器传输，并自动调优以适应闭环性能。通过这种重新表述，优化控制器的闭环行为直接转移到一个动态相似的新系统中。此外，无量纲表述允许在参数优化过程中使用不同尺度系统的数据。我们在车杆摆动和赛车问题上演示该方法，应用强化学习或贝叶斯优化来调节控制器参数。用于获取本文结果的软件可在此 https URL 公开获取。

Learning and Editing Universal Graph Prompt Tuning via Reinforcement Learning

通过强化学习学习和编辑通用图提示调优

Authors: Jinfeng Xu, Zheyu Chen, Shuo Yang, Jinze Li, Hewei Wang, Yijie Li, Edith C. H. Ngai
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.08763
Pdf link: https://arxiv.org/pdf/2512.08763
Abstract Early graph prompt tuning approaches relied on task-specific designs for Graph Neural Networks (GNNs), limiting their adaptability across diverse pre-training strategies. In contrast, another promising line of research has investigated universal graph prompt tuning, which operates directly in the input graph's feature space and builds a theoretical foundation that universal graph prompt tuning can theoretically achieve an equivalent effect of any prompting function, eliminating dependence on specific pre-training strategies. Recent works propose selective node-based graph prompt tuning to pursue more ideal prompts. However, we argue that selective node-based graph prompt tuning inevitably compromises the theoretical foundation of universal graph prompt tuning. In this paper, we strengthen the theoretical foundation of universal graph prompt tuning by introducing stricter constraints, demonstrating that adding prompts to all nodes is a necessary condition for achieving the universality of graph prompts. To this end, we propose a novel model and paradigm, Learning and Editing Universal GrAph Prompt Tuning (LEAP), which preserves the theoretical foundation of universal graph prompt tuning while pursuing more ideal prompts. Specifically, we first build the basic universal graph prompts to preserve the theoretical foundation and then employ actor-critic reinforcement learning to select nodes and edit prompts. Extensive experiments on graph- and node-level tasks across various pre-training strategies in both full-shot and few-shot scenarios show that LEAP consistently outperforms fine-tuning and other prompt-based approaches.
中文摘要 早期的图提示调优方法依赖于针对特定任务的图神经网络（GNN）设计，限制了其在多种预训练策略中的适应性。相比之下，另一条有前景的研究方向是通用图提示调整，该方法直接作用于输入图的特征空间，并建立了通用图提示调整理论上能够实现与任何提示函数等效效果的理论基础，消除对特定预训练策略的依赖。近期研究提出了选择性基于节点的图提示调优，以追求更理想的提示。然而，我们认为选择性基于节点的图提示调优不可避免地破坏了通用图提示调优的理论基础。本文通过引入更严格的约束，强化了通用图提示调优的理论基础，证明在所有节点添加提示是实现图提示普适性的必要条件。为此，我们提出了一种新颖的模型和范式——学习与编辑通用GrAph提示调优（LEAP），它在追求更理想的提示的同时，保持了通用图示提示调优的理论基础。具体来说，我们首先构建基本的通用图提示以保持理论基础，然后采用演员-批评强化学习来选择节点并编辑提示。在全样本和少数样本场景下，基于各种预训练策略的图级和节点级任务的大量实验表明，LEAP始终优于微调和其他基于提示的方法。

Reinforcement Learning From State and Temporal Differences

从状态和时间差异中获得强化学习

Authors: Lex Weaver, Jonathan Baxter
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.08855
Pdf link: https://arxiv.org/pdf/2512.08855
Abstract TD($\lambda$) with function approximation has proved empirically successful for some complex reinforcement learning problems. For linear approximation, TD($\lambda$) has been shown to minimise the squared error between the approximate value of each state and the true value. However, as far as policy is concerned, it is error in the relative ordering of states that is critical, rather than error in the state values. We illustrate this point, both in simple two-state and three-state systems in which TD($\lambda$)--starting from an optimal policy--converges to a sub-optimal policy, and also in backgammon. We then present a modified form of TD($\lambda$), called STD($\lambda$), in which function approximators are trained with respect to relative state values on binary decision problems. A theoretical analysis, including a proof of monotonic policy improvement for STD($\lambda$) in the context of the two-state system, is presented, along with a comparison with Bertsekas' differential training method [1]. This is followed by successful demonstrations of STD($\lambda$) on the two-state system and a variation on the well known acrobot problem.
中文摘要 TD（$\lambda$）与函数近似在某些复杂强化学习问题中已被证明是成功的。对于线性近似，TD（$\lambda$）已被证明能最小化每个状态近似值与真实值之间的平方误差。然而，就政策而言，关键在于状态相对排序的错误，而非状态值的错误。我们在简单的二态和三态系统中说明这一点，其中TD（$\lambda$）——从最优策略开始——收敛到次优策略，以及西洋双陆棋。随后，我们提出一种修改后的TD（$\lambda$），称为STD（$\lambda$），其中函数近似器是根据二元决策问题的相对状态值进行训练的。本文提出了理论分析，包括在两态系统背景下STD（$\lambda$）单调政策改进的证明，并与Bertsekas 差分训练方法[1]进行了比较。随后，STD（$\lambda$）在二态系统上的成功演示以及著名的acrobot问题的一个变体。

IPPO Learns the Game, Not the Team: A Study on Generalization in Heterogeneous Agent Teams

IPPO学会了游戏，而非团队：异质代理团队中泛化的研究

Authors: Ryan LeRoy, Jack Kolb
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.08877
Pdf link: https://arxiv.org/pdf/2512.08877
Abstract Multi-Agent Reinforcement Learning (MARL) is commonly deployed in settings where agents are trained via self-play with homogeneous teammates, often using parameter sharing and a single policy architecture. This opens the question: to what extent do self-play PPO agents learn general coordination strategies grounded in the underlying game, compared to overfitting to their training partners' behaviors? This paper investigates the question using the Heterogeneous Multi-Agent Challenge (HeMAC) environment, which features distinct Observer and Drone agents with complementary capabilities. We introduce Rotating Policy Training (RPT), an approach that rotates heterogeneous teammate policies of different learning algorithms during training, to expose the agent to a broader range of partner strategies. When playing alongside a withheld teammate policy (DDQN), we find that RPT achieves similar performance to a standard self-play baseline, IPPO, where all agents were trained sharing a single PPO policy. This result indicates that in this heterogeneous multi-agent setting, the IPPO baseline generalizes to novel teammate algorithms despite not experiencing teammate diversity during training. This shows that a simple IPPO baseline may possess the level of generalization to novel teammates that a diverse training regimen was designed to achieve.
中文摘要 多智能体强化学习（MARL）通常部署在通过与同质队友自玩训练代理的环境中，通常使用参数共享和单一策略架构。这引出了一个问题：自玩PPO代理在多大程度上是在基于底层游戏的基础上学习通用协调策略，而非对训练伙伴行为进行过度拟合？本文利用异构多智能体挑战（HeMAC）环境探讨了这一问题，该环境包含具有互补能力的观察者和无人机代理。我们引入轮换策略训练（RPT），这是一种在训练过程中轮换不同学习算法的异构队友策略的方法，使智能体接触到更广泛的合作伙伴策略。当与保留队友策略（DDQN）并行进行时，我们发现RPT的性能与标准自玩基线IPPO相似，后者所有代理都接受了共享单一PPO策略的训练。这一结果表明，在这种异构多智能体环境中，IPPO基线会推广到新的队友算法，尽管在训练中未经历队友多样性。这表明，一个简单的IPPO基线可能具备多样化训练计划所设计时所期望的对新队友的推广程度。

No Labels, No Problem: Training Visual Reasoners with Multimodal Verifiers

无标签，无问题：用多模态验证器训练视觉推理者

Authors: Damiano Marsili, Georgia Gkioxari
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.08889
Pdf link: https://arxiv.org/pdf/2512.08889
Abstract Visual reasoning is challenging, requiring both precise object grounding and understanding complex spatial relationships. Existing methods fall into two camps: language-only chain-of-thought approaches, which demand large-scale (image, query, answer) supervision, and program-synthesis approaches which use pre-trained models and avoid training, but suffer from flawed logic and erroneous grounding. We propose an annotation-free training framework that improves both reasoning and grounding. Our framework uses AI-powered verifiers: an LLM verifier refines LLM reasoning via reinforcement learning, while a VLM verifier strengthens visual grounding through automated hard-negative mining, eliminating the need for ground truth labels. This design combines the strengths of modern AI systems: advanced language-only reasoning models for decomposing spatial queries into simpler subtasks, and strong vision specialist models improved via performant VLM critics. We evaluate our approach across diverse spatial reasoning tasks, and show that our method improves visual reasoning and surpasses open-source and proprietary models, while with our improved visual grounding model we further outperform recent text-only visual reasoning methods. Project webpage: this https URL
中文摘要 视觉推理具有挑战性，既需要精确的物体基础，也需要理解复杂的空间关系。现有方法分为两类：仅语言的思维链方法，需要大规模（图像、查询、回答）监督;以及程序综合方法，使用预训练模型避免训练，但存在逻辑缺陷和错误的基础。我们提出了一个无注释的培训框架，既提升推理能力，也能提升基础化能力。我们的框架采用AI驱动的验证器：LLM验证器通过强化学习完善LLM推理，VLM验证器通过自动硬负面挖掘强化视觉基础，消除对真实标签的需求。该设计结合了现代人工智能系统的优势：先进的纯语言推理模型，用于将空间查询分解为更简单的子任务，以及通过高效VLM批评者改进的强大视觉专家模型。我们评估了我们方法在不同空间推理任务中的表现，表明我们的方法提升了视觉推理能力，超越了开源和专有模型，同时通过改进的视觉基础模型，我们进一步优于近期纯文本视觉推理方法。项目网页：此 https URL

Keyword: diffusion policy

There is no result