Arxiv Papers of Today

生成时间: 2025-11-17 16:31:32 (UTC+8); Arxiv 发布时间: 2025-11-17 20:00 EST (2025-11-18 09:00 UTC+8)

今天共有 20 篇相关文章

Keyword: reinforcement learning

A methodological analysis of prompt perturbations and their effect on attack success rates

即时扰动及其对攻击成功率影响的方法学分析

Authors: Tiago Machado, Maysa Malfiza Garcia de Macedo, Rogerio Abreu de Paula, Marcelo Carpinette Grave, Aminat Adebiyi, Luan Soares de Souza, Enrico Santarelli, Claudio Pinhanez
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.10686
Pdf link: https://arxiv.org/pdf/2511.10686
Abstract This work aims to investigate how different Large Language Models (LLMs) alignment methods affect the models' responses to prompt attacks. We selected open source models based on the most common alignment methods, namely, Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Reinforcement Learning with Human Feedback (RLHF). We conducted a systematic analysis using statistical methods to verify how sensitive the Attack Success Rate (ASR) is when we apply variations to prompts designed to elicit inappropriate content from LLMs. Our results show that even small prompt modifications can significantly change the Attack Success Rate (ASR) according to the statistical tests we run, making the models more or less susceptible to types of attack. Critically, our results demonstrate that running existing 'attack benchmarks' alone may not be sufficient to elicit all possible vulnerabilities of both models and alignment methods. This paper thus contributes to ongoing efforts on model attack evaluation by means of systematic and statistically-based analyses of the different alignment methods and how sensitive their ASR is to prompt variation.
中文摘要 本研究旨在探讨不同大型语言模型（LLMs）比对方法如何影响模型对提示攻击的响应。我们基于最常见的对齐方法选择了开源模型，即监督式微调（SFT）、直接偏好优化（DPO）和带人类反馈的强化学习（RLHF）。我们通过统计方法进行系统分析，验证当对旨在从大型语言模型中引出不当内容的提示应用变体时，攻击成功率（ASR）的敏感度。我们的结果显示，即使是极小的即时修改，也能显著改变攻击成功率（ASR），根据我们进行的统计测试，使模型对攻击类型的敏感度有所提升。关键的是，我们的结果表明，仅运行现有的“攻击基准”可能不足以引出模型和比对方法的所有潜在漏洞。因此，本文通过系统且基于统计的分析，促进了模型攻击评估的持续努力，涵盖不同比对方法及其ASR对提示变异的敏感性。

From Efficiency to Adaptivity: A Deeper Look at Adaptive Reasoning in Large Language Models

从效率到适应性：大型语言模型中自适应推理的深入探讨

Authors: Chao Wu, Baoheng Li, Mingchen Gao, Zhenyi Wang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.10788
Pdf link: https://arxiv.org/pdf/2511.10788
Abstract Recent advances in large language models (LLMs) have made reasoning a central benchmark for evaluating intelligence. While prior surveys focus on efficiency by examining how to shorten reasoning chains or reduce computation, this view overlooks a fundamental challenge: current LLMs apply uniform reasoning strategies regardless of task complexity, generating long traces for trivial problems while failing to extend reasoning for difficult tasks. This survey reframes reasoning through the lens of {adaptivity}: the capability to allocate reasoning effort based on input characteristics such as difficulty and uncertainty. We make three contributions. First, we formalize deductive, inductive, and abductive reasoning within the LLM context, connecting these classical cognitive paradigms with their algorithmic realizations. Second, we formalize adaptive reasoning as a control-augmented policy optimization problem balancing task performance with computational cost, distinguishing learned policies from inference-time control mechanisms. Third, we propose a systematic taxonomy organizing existing methods into training-based approaches that internalize adaptivity through reinforcement learning, supervised fine-tuning, and learned controllers, and training-free approaches that achieve adaptivity through prompt conditioning, feedback-driven halting, and modular composition. This framework clarifies how different mechanisms realize adaptive reasoning in practice and enables systematic comparison across diverse strategies. We conclude by identifying open challenges in self-evaluation, meta-reasoning, and human-aligned reasoning control.
中文摘要 大型语言模型（LLMs）的最新进展使推理成为评估智能的核心基准。以往的调查多关注效率，探讨如何缩短推理链或减少计算，但这一观点忽视了一个根本挑战：当前的大型语言模型无论任务复杂度如何，都采用统一的推理策略，对琐碎问题产生长轨迹，而在复杂任务中推理却无法扩展。本调查通过{适应性}视角重新定义推理：即根据输入特性如难度和不确定性分配推理努力的能力。我们贡献三项。首先，我们在LLM语境中形式化了演绎、归纳和溯因推理，将这些经典认知范式与其算法实现联系起来。其次，我们将自适应推理形式化为一种控制增强策略优化问题，平衡任务性能与计算成本，区分已学策略与推理时间控制机制。第三，我们提出一种系统分类法，将现有方法组织为基于训练的方法，通过强化学习、监督微调和学习控制器内化适应性，以及无训练方法通过快速条件反射、反馈驱动停机和模块化组合实现适应性。该框架阐明了不同机制如何在实践中实现适应性推理，并实现了不同策略之间的系统比较。我们总结指出自我评估、元推理和人类对齐推理控制方面存在的未决挑战。

Behaviour Policy Optimization: Provably Lower Variance Return Estimates for Off-Policy Reinforcement Learning

行为策略优化：可证明的更低方差回报估计，用于非策略强化学习

Authors: Alexander W. Goodall, Edwin Hamel-De le Court, Francesco Belardinelli
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.10843
Pdf link: https://arxiv.org/pdf/2511.10843
Abstract Many reinforcement learning algorithms, particularly those that rely on return estimates for policy improvement, can suffer from poor sample efficiency and training instability due to high-variance return estimates. In this paper we leverage new results from off-policy evaluation; it has recently been shown that well-designed behaviour policies can be used to collect off-policy data for provably lower variance return estimates. This result is surprising as it means collecting data on-policy is not variance optimal. We extend this key insight to the online reinforcement learning setting, where both policy evaluation and improvement are interleaved to learn optimal policies. Off-policy RL has been well studied (e.g., IMPALA), with correct and truncated importance weighted samples for de-biasing and managing variance appropriately. Generally these approaches are concerned with reconciling data collected from multiple workers in parallel, while the policy is updated asynchronously, mismatch between the workers and policy is corrected in a mathematically sound way. Here we consider only one worker - the behaviour policy, which is used to collect data for policy improvement, with provably lower variance return estimates. In our experiments we extend two policy-gradient methods with this regime, demonstrating better sample efficiency and performance over a diverse set of environments.
中文摘要 许多强化学习算法，尤其是依赖回报估计进行策略改进的算法，由于高方差回报估计，可能存在样本效率低和训练不稳定性的问题。本文利用了非政策评估的新结果;最近研究表明，设计良好的行为政策可以用来收集可证明较低方差回报估计的非策略数据。这一结果令人惊讶，因为这意味着收集政策数据并非方差最优。我们将这一关键见解扩展到在线强化学习环境中，策略评估与改进交织以学习最佳策略。非策略强化学习已被深入研究（例如IMPALA），采用正确且截断的重要性加权样本，以适度去偏和管理方差。通常，这些方法关注的是并行对多名员工收集的数据进行调和，同时政策异步更新，并以数学上合理的方式纠正工人与政策之间的不匹配。这里我们只考虑一个工人——行为政策，用于收集政策改进的数据，且方差回报估计可证明较低。在我们的实验中，我们对该区域进行了两种策略梯度方法的扩展，展示了在多样化环境中更好的样本效率和性能。

Incorporating Spatial Information into Goal-Conditioned Hierarchical Reinforcement Learning via Graph Representations

通过图表示将空间信息纳入目标条件层级强化学习

Authors: Shuyuan Zhang, Zihan Wang, Xiao-Wen Chang, Doina Precup
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.10872
Pdf link: https://arxiv.org/pdf/2511.10872
Abstract The integration of graphs with Goal-conditioned Hierarchical Reinforcement Learning (GCHRL) has recently gained attention, as intermediate goals (subgoals) can be effectively sampled from graphs that naturally represent the overall task structure in most RL tasks. However, existing approaches typically rely on domain-specific knowledge to construct these graphs, limiting their applicability to new tasks. Other graph-based approaches create graphs dynamically during exploration but struggle to fully utilize them, because they have problems passing the information in the graphs to newly visited states. Additionally, current GCHRL methods face challenges such as sample inefficiency and poor subgoal representation. This paper proposes a solution to these issues by developing a graph encoder-decoder to evaluate unseen states. Our proposed method, Graph-Guided sub-Goal representation Generation RL (G4RL), can be incorporated into any existing GCHRL method when operating in environments with primarily symmetric and reversible transitions to enhance performance across this class of problems. We show that the graph encoder-decoder can be effectively implemented using a network trained on the state graph generated during exploration. Empirical results indicate that leveraging high and low-level intrinsic rewards from the graph encoder-decoder significantly enhances the performance of state-of-the-art GCHRL approaches with an extra small computational cost in dense and sparse reward environments.
中文摘要 近期，图与目标条件层级强化学习（GCHRL）的整合受到关注，因为中间目标（子目标）可以有效地从大多数强化学习任务中自然代表整体任务结构的图中抽样。然而，现有方法通常依赖领域特定的知识来构建这些图表，限制了其对新任务的适用性。其他基于图的方法在探索过程中动态生成图，但由于难以将图中的信息传递给新访问的状态，难以充分利用这些方法。此外，当前GCHRL方法还面临样本效率低下和子目标表示不佳等挑战。本文提出了通过开发图编码-解码器来评估未见态的解决方案。我们提出的方法——图引导子目标表示生成强化学习（G4RL），可在主要对称且可逆过渡的环境中，应用于任何现有GCHRL方法，以提升该类问题的性能。我们证明，图编码-解码器可以通过基于探索过程中生成的状态图训练的网络有效实现。实证结果表明，利用图编码器-解码器提供的高低层次内在奖励，在高密度和稀疏奖励环境中，以极小的计算成本显著提升了最先进的GCHRL方法的性能。

When Data is the Algorithm: A Systematic Study and Curation of Preference Optimization Datasets

当数据成为算法：偏好优化数据集的系统研究与整理

Authors: Aladin Djuhera, Farhan Ahmed, Swanand Ravindra Kadhe, Syed Zawad, Heiko Ludwig, Holger Boche
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.10985
Pdf link: https://arxiv.org/pdf/2511.10985
Abstract Aligning large language models (LLMs) is a central objective of post-training, often achieved through reward modeling and reinforcement learning methods. Among these, direct preference optimization (DPO) has emerged as a widely adopted technique that fine-tunes LLMs on preferred completions over less favorable ones. While most frontier LLMs do not disclose their curated preference pairs, the broader LLM community has released several open-source DPO datasets, including TuluDPO, ORPO, UltraFeedback, HelpSteer, and Code-Preference-Pairs. However, systematic comparisons remain scarce, largely due to the high computational cost and the lack of rich quality annotations, making it difficult to understand how preferences were selected, which task types they span, and how well they reflect human judgment on a per-sample level. In this work, we present the first comprehensive, data-centric analysis of popular open-source DPO corpora. We leverage the Magpie framework to annotate each sample for task category, input quality, and preference reward, a reward-model-based signal that validates the preference order without relying on human annotations. This enables a scalable, fine-grained inspection of preference quality across datasets, revealing structural and qualitative discrepancies in reward margins. Building on these insights, we systematically curate a new DPO mixture, UltraMix, that draws selectively from all five corpora while removing noisy or redundant samples. UltraMix is 30% smaller than the best-performing individual dataset yet exceeds its performance across key benchmarks. We publicly release all annotations, metadata, and our curated mixture to facilitate future research in data-centric preference optimization.
中文摘要 对齐大型语言模型（LLMs）是后训练的核心目标，通常通过奖励建模和强化学习方法实现。其中，直接偏好优化（DPO）已成为一种广泛采用的技术，用于对优选完备进行微调，而非不理想的完成。虽然大多数前沿大型语言模型不公开其策划的偏好对，但更广泛的大型语言模型社区已发布了多个开源的DPO数据集，包括TuluDPO、ORPO、UltraFeedback、HelpSteer和Code-Preference-Pairs。然而，系统比较仍然稀少，主要原因是计算成本高且缺乏丰富且缺乏高质量的注释，这使得理解偏好是如何被选择的、涵盖哪些任务类型以及它们在每样本层面上反映人类判断的程度变得困难。在本研究中，我们首次全面、以数据为中心地分析了流行的开源DPO语料库。我们利用Magpie框架为每个样本标注任务类别、输入质量和偏好奖励，这是一种基于奖励模型的信号，无需依赖人工注释即可验证偏好顺序。这使得对数据集中偏好质量进行可扩展、细致的检查，揭示了奖励边际的结构性和质性差异。基于这些见解，我们系统地策划了一种新的DPO混合法UltraMix，它选择性地从五个语料库中提取样本，同时剔除噪声或冗余样本。UltraMix比表现最好的单个数据集小30%，但在关键基准测试中表现优于其表现。我们公开发布所有注释、元数据及我们精心策划的混合数据，以促进未来数据中心偏好优化的研究。

Data Poisoning Vulnerabilities Across Healthcare AI Architectures: A Security Threat Analysis

医疗人工智能架构中的数据中毒漏洞：安全威胁分析

Authors: Farhad Abtahi, Fernando Seoane, Iván Pau, Mario Vega-Barbas
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.11020
Pdf link: https://arxiv.org/pdf/2511.11020
Abstract Healthcare AI systems face major vulnerabilities to data poisoning that current defenses and regulations cannot adequately address. We analyzed eight attack scenarios in four categories: architectural attacks on convolutional neural networks, large language models, and reinforcement learning agents; infrastructure attacks exploiting federated learning and medical documentation systems; critical resource allocation attacks affecting organ transplantation and crisis triage; and supply chain attacks targeting commercial foundation models. Our findings indicate that attackers with access to only 100-500 samples can compromise healthcare AI regardless of dataset size, often achieving over 60 percent success, with detection taking an estimated 6 to 12 months or sometimes not occurring at all. The distributed nature of healthcare infrastructure creates many entry points where insiders with routine access can launch attacks with limited technical skill. Privacy laws such as HIPAA and GDPR can unintentionally shield attackers by restricting the analyses needed for detection. Supply chain weaknesses allow a single compromised vendor to poison models across 50 to 200 institutions. The Medical Scribe Sybil scenario shows how coordinated fake patient visits can poison data through legitimate clinical workflows without requiring a system breach. Current regulations lack mandatory adversarial robustness testing, and federated learning can worsen risks by obscuring attribution. We recommend multilayer defenses including required adversarial testing, ensemble-based detection, privacy-preserving security mechanisms, and international coordination on AI security standards. We also question whether opaque black-box models are suitable for high-stakes clinical decisions, suggesting a shift toward interpretable systems with verifiable safety guarantees.
中文摘要 医疗人工智能系统面临数据中毒的重大漏洞，现有防御和法规无法充分应对。我们分析了八种攻击场景，分为四类：卷积神经网络、大型语言模型和强化学习代理的架构攻击;利用联邦学习和医疗文档系统的基础设施攻击;影响器官移植和危机分诊的关键资源分配攻击;以及针对商业基础模型的供应链攻击。我们的发现表明，攻击者即使仅能访问100-500个样本，都能攻破医疗AI，无论数据集大小，成功率常超过60%，检测过程估计需6到12个月，有时甚至完全不会发生。医疗基础设施的分布式特性创造了许多进入点，内部人员即使技术能力有限，也能发起攻击。隐私法律如HIPAA和GDPR可能无意中保护攻击者，限制了检测所需的分析。供应链薄弱允许单个被攻破的供应商在50到200家机构中毒害模型。Medical Scribe Sybil 事件展示了协调的假患者就诊如何通过合法的临床工作流程污染数据，而无需系统漏洞。现行法规缺乏强制性的对抗性鲁棒性测试，联邦学习可能通过模糊归因来加剧风险。我们建议采用多层防御措施，包括强制的对抗性测试、基于集合的检测、保护隐私的安全机制以及国际间AI安全标准的协调。我们还质疑不透明的黑匣子模型是否适合高风险临床决策，这表明系统正向可解释且具备可验证安全性保障的转变。

ARCTraj: A Dataset and Benchmark of Human Reasoning Trajectories for Abstract Problem Solving

ARCTraj：抽象问题解决中的人类推理轨迹数据集与基准

Authors: Sejin Kim, Hayan Choi, Seokki Lee, Sundong Kim
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.11079
Pdf link: https://arxiv.org/pdf/2511.11079
Abstract We present ARCTraj, a dataset and methodological framework for modeling human reasoning through complex visual tasks in the Abstraction and Reasoning Corpus (ARC). While ARC has inspired extensive research on abstract reasoning, most existing approaches rely on static input--output supervision, which limits insight into how reasoning unfolds over time. ARCTraj addresses this gap by recording temporally ordered, object-level actions that capture how humans iteratively transform inputs into outputs, revealing intermediate reasoning steps that conventional datasets overlook. Collected via the O2ARC web interface, it contains around 10,000 trajectories annotated with task identifiers, timestamps, and success labels across 400 training tasks from the ARC-AGI-1 benchmark. It further defines a unified reasoning pipeline encompassing data collection, action abstraction, Markov decision process (MDP) formulation, and downstream learning, enabling integration with reinforcement learning, generative modeling, and sequence modeling methods such as PPO, World Models, GFlowNets, Diffusion agents, and Decision Transformers. Analyses of spatial selection, color attribution, and strategic convergence highlight the structure and diversity of human reasoning. Together, these contributions position ARCTraj as a structured and interpretable foundation for studying human-like reasoning, advancing explainability, alignment, and generalizable intelligence.
中文摘要 我们介绍ARCTraj，一个数据集和方法论框架，用于抽象与推理语料库（ARC）中通过复杂的视觉任务建模人类推理。虽然ARC激发了大量抽象推理的研究，但大多数现有方法依赖静态输入输出监督，限制了对推理随时间展开的洞察。ARCTraj通过记录有序、对象级的动作来弥补这一空白，这些动作捕捉了人类如何迭代地将输入转化为输出，揭示了传统数据集忽视的中间推理步骤。通过O2ARC网页界面收集，约包含10,000条轨迹，并标注了任务标识符、时间戳和成功标签，涵盖ARC-AGI-1基准测试的400个训练任务。它进一步定义了一个统一的推理流程，涵盖数据收集、动作抽象、马尔可夫决策过程（MDP）表述和下游学习，支持与强化学习、生成建模和序列建模方法（如PPO、世界模型、GFlowNets、扩散代理和决策变换器）的集成。空间选择、颜色归因和战略性趋同的分析凸显了人类推理的结构与多样性。这些贡献共同使ARCTraj成为研究类人推理、推进可解释性、对齐性和可推广智能的结构化且可解释性的基础。

Scalable Population Training for Zero-Shot Coordination

零发射协调的可扩展人群训练

Authors: Bingyu Hui, Lebin Yu, Quanming Yao, Yunpeng Qu, Xudong Zhang, Jian Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.11083
Pdf link: https://arxiv.org/pdf/2511.11083
Abstract Zero-shot coordination(ZSC) has become a hot topic in reinforcement learning research recently. It focuses on the generalization ability of agents, requiring them to coordinate well with collaborators that are not seen before without any fine-tuning. Population-based training has been proven to provide good zero-shot coordination performance; nevertheless, existing methods are limited by computational resources, mainly focusing on optimizing diversity in small populations while neglecting the potential performance gains from scaling population size. To address this issue, this paper proposes the Scalable Population Training (ScaPT), an efficient training framework comprising two key components: a meta-agent that efficiently realizes a population by selectively sharing parameters across agents, and a mutual information regularizer that guarantees population diversity. To empirically validate the effectiveness of ScaPT, this paper evaluates it along with representational frameworks in Hanabi and confirms its superiority.
中文摘要 零射击协调（ZSC）最近成为强化学习研究中的热门话题。它关注代理的泛化能力，要求他们与前所未有的协作者良好协调，无需任何微调。基于群体的训练已被证明能提供良好的零发协调表现;然而，现有方法受计算资源限制，主要专注于优化小种群多样性，忽视了缩放种群规模带来的性能提升。为解决这一问题，本文提出了可扩展种群训练（ScaPT），这是一种高效的训练框架，包含两个关键组成部分：一个通过选择性共享参数在多个智能体间高效实现种群的元智能体，以及保证种群多样性的互信息正则化器。为实证验证ScaPT的有效性，本文结合花火表征框架评估其优越性。

VIDEOP2R: Video Understanding from Perception to Reasoning

VIDEOP2R：从感知到推理的视频理解

Authors: Yifan Jiang, Yueying Wang, Rui Zhao, Toufiq Parag, Zhimin Chen, Zhenyu Liao, Jayakrishnan Unnikrishnan
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.11113
Pdf link: https://arxiv.org/pdf/2511.11113
Abstract Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning. In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning.
中文摘要 强化微调（RFT）是一个由监督微调（SFT）和强化学习（RL）组成的两阶段框架，在提升大型语言模型（LLMs）推理能力方面取得了有希望的成果。然而，将RFT推广到大型视频语言模型（LVLM）仍然充满挑战。我们提出了VideoP2R，一种新型过程感知视频RFT框架，通过将感知和推理建模为不同的过程来增强视频推理。在SFT阶段，我们开发了三步流程，生成VideoP2R-CoT-162K，这是一个高质量、过程感知型思维链（CoT）数据集，用于感知和推理。在强化学习阶段，我们引入了一种新颖的过程感知群相对策略优化（PA-GRPO）算法，为感知和推理提供不同的奖励。大量实验表明，VideoP2R在七个视频推理和理解基准中实现了六个最先进的（SotA）性能。消融研究进一步证实了我们过程感知建模和PA-GRPO的有效性，并证明模型的感知输出信息充足，用于后续推理。

LoRaCompass: Robust Reinforcement Learning to Efficiently Search for a LoRa Tag

LoRaCompass：高效搜索LoRa标签的强化学习

Authors: Tianlang He, Zhongming Lin, Tianrui Jiang, S.-H. Gary Chan
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.11190
Pdf link: https://arxiv.org/pdf/2511.11190
Abstract The Long-Range (LoRa) protocol, known for its extensive range and low power, has increasingly been adopted in tags worn by mentally incapacitated persons (MIPs) and others at risk of going missing. We study the sequential decision-making process for a mobile sensor to locate a periodically broadcasting LoRa tag with the fewest moves (hops) in general, unknown environments, guided by the received signal strength indicator (RSSI). While existing methods leverage reinforcement learning for search, they remain vulnerable to domain shift and signal fluctuation, resulting in cascading decision errors that culminate in substantial localization inaccuracies. To bridge this gap, we propose LoRaCompass, a reinforcement learning model designed to achieve robust and efficient search for a LoRa tag. For exploitation under domain shift and signal fluctuation, LoRaCompass learns a robust spatial representation from RSSI to maximize the probability of moving closer to a tag, via a spatially-aware feature extractor and a policy distillation loss function. It further introduces an exploration function inspired by the upper confidence bound (UCB) that guides the sensor toward the tag with increasing confidence. We have validated LoRaCompass in ground-based and drone-assisted scenarios within diverse unseen environments covering an area of over 80km^2. It has demonstrated high success rate (>90%) in locating the tag within 100m proximity (a 40% improvement over existing methods) and high efficiency with a search path length (in hops) that scales linearly with the initial distance.
中文摘要 远程（LoRa）协议以其广范围和低功耗著称，近年来越来越多地被用于精神障碍者（MIPs）及其他有失踪风险者的标签中。我们研究移动传感器在一般未知环境中，以顺序决策过程定位周期性广播的LoRa标签，其移动次数（跳数）最少，并由接收信号强度指示器（RSSI）引导。虽然现有方法利用强化学习进行搜索，但它们仍易受领域转移和信号波动影响，导致连锁决策错误，最终导致显著的定位不准确。为了弥合这一差距，我们提出了LoRaCompass，一种旨在实现稳健高效LoRa标签搜索的强化学习模型。在域移和信号波动下利用，LoRaCompass 通过空间感知特征提取器和策略蒸馏损失函数，从RSSI中学习稳健的空间表示，以最大化接近标签的概率。它还引入了一个受上置信界（UCB）启发的探索函数，引导传感器以越来越高的信心向标注方向移动。我们已验证LoRaCompass在地面和无人机辅助场景下，覆盖超过80平方公里^2的多样看不见环境。它在100米范围内定位标签的成功率（>90%）（比现有方法提升40%），且搜索路径长度（以跳数计）与初始距离线性增长，效率高。

Sashimi-Bot: Autonomous Tri-manual Advanced Manipulation and Cutting of Deformable Objects

刺身机器人：自主三手高级变形物体作与切割

Authors: Sverre Herland, Amit Parag, Elling Ruud Øye, Fangyi Zhang, Fouad Makiyeh, Aleksander Lillienskiold, Abhaya Pal Singh, Edward H. Adelson, Francois Chaumette, Alexandre Krupa, Peter Corke, Ekrem Misimi
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.11223
Pdf link: https://arxiv.org/pdf/2511.11223
Abstract Advanced robotic manipulation of deformable, volumetric objects remains one of the greatest challenges due to their pliancy, frailness, variability, and uncertainties during interaction. Motivated by these challenges, this article introduces Sashimi-Bot, an autonomous multi-robotic system for advanced manipulation and cutting, specifically the preparation of sashimi. The objects that we manipulate, salmon loins, are natural in origin and vary in size and shape, they are limp and deformable with poorly characterized elastoplastic parameters, while also being slippery and hard to hold. The three robots straighten the loin; grasp and hold the knife; cut with the knife in a slicing motion while cooperatively stabilizing the loin during cutting; and pick up the thin slices from the cutting board or knife blade. Our system combines deep reinforcement learning with in-hand tool shape manipulation, in-hand tool cutting, and feedback of visual and tactile information to achieve robustness to the variabilities inherent in this task. This work represents a milestone in robotic manipulation of deformable, volumetric objects that may inspire and enable a wide range of other real-world applications.
中文摘要 由于可变形体积物体的柔韧性、脆弱性、变异性和相互作用过程中的不确定性，先进的机器人作仍是最大的挑战之一。受到这些挑战的激励，本文介绍了生鱼片机器人，一种自主多机器人系统，用于高级作和切割，特别是生鱼片的制作。我们作的物体——鲑鱼腰肉——是天然起源的，大小和形状各异，它们软弱且可变形，弹性塑性参数描述不详，同时滑溜且难以握持。三台机器人整理了腰部;握紧并握住刀;在切割过程中协同稳定腰部，同时用刀切片进行切割;然后从砧板或刀刃上挑起薄片。我们的系统结合了深度强化学习、手持工具形状控、手持工具切割以及视觉和触觉信息反馈，以实现对该任务固有变异性的鲁棒性。这项工作是机器人作可变形体积物体的一个里程碑，可能启发并推动更多其他现实应用。

STaR: Towards Cognitive Table Reasoning via Slow-Thinking Large Language Models

STaR：通过慢思考大型语言模型实现认知表格推理

Authors: Huajian Zhang, Mingyue Cheng, Yucong Luo, Xiaoyu Tao
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.11233
Pdf link: https://arxiv.org/pdf/2511.11233
Abstract Table reasoning with the large language models (LLMs) is a fundamental path toward building intelligent systems that can understand and analyze over structured data. While recent progress has shown promising results, they still suffer from two key limitations: (i) the reasoning processes lack the depth and iterative refinement characteristic of human cognition; and (ii) the reasoning processes exhibit instability, which compromises their reliability in downstream applications. In this work, we present STaR (slow-thinking for table reasoning), a new framework achieving cognitive table reasoning, in which LLMs are equipped with slow-thinking capabilities by explicitly modeling step-by-step thinking and uncertainty-aware inference. During training, STaR employs two-stage difficulty-aware reinforcement learning (DRL), progressively learning from simple to complex queries under a composite reward. During inference, STaR performs trajectory-level uncertainty quantification by integrating token-level confidence and answer consistency, enabling selection of more credible reasoning paths. Extensive experiments on benchmarks demonstrate that STaR achieves superior performance and enhanced reasoning stability. Moreover, strong generalization over out-of-domain datasets further demonstrates STaR's potential as a reliable and cognitively inspired solution for table reasoning with LLMs.
中文摘要 利用大型语言模型（LLMs）进行表推理是构建能够理解和分析结构化数据的智能系统的基础路径。尽管近期进展显示出有希望的结果，但仍存在两个关键局限：（i）推理过程缺乏人类认知特有的深度和迭代精炼;以及（ii）推理过程表现出不稳定性，这会影响其在下游应用中的可靠性。本研究提出了STaR（表推理慢思考），这是一种实现认知表推理的新框架，LLM通过显式建模逐步思考和不确定性意识推理，赋予了慢思考能力。在培训过程中，STaR采用两阶段的难度感知强化学习（DRL），在复合奖励下从简单到复杂查询逐步学习。在推断过程中，STaR通过整合代币级置信度和答案一致性，进行轨迹级不确定性量化，从而选择更具可信度的推理路径。基准测试上的大量实验表明，STaR实现了更优的性能和更强的推理稳定性。此外，对域外数据集的强推广进一步展示了STaR作为一种可靠且受认知启发的大型语言模型（LLM）表推理解决方案的潜力。

RLSLM: A Hybrid Reinforcement Learning Framework Aligning Rule-Based Social Locomotion Model with Human Social Norms

RLSLM：一种混合强化学习框架，将基于规则的社会运动模型与人类社会规范相结合

Authors: Yitian Kou, Yihe Gu, Chen Zhou, DanDan Zhu, Shuguang Kuai
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.11323
Pdf link: https://arxiv.org/pdf/2511.11323
Abstract Navigating human-populated environments without causing discomfort is a critical capability for socially-aware agents. While rule-based approaches offer interpretability through predefined psychological principles, they often lack generalizability and flexibility. Conversely, data-driven methods can learn complex behaviors from large-scale datasets, but are typically inefficient, opaque, and difficult to align with human intuitions. To bridge this gap, we propose RLSLM, a hybrid Reinforcement Learning framework that integrates a rule-based Social Locomotion Model, grounded in empirical behavioral experiments, into the reward function of a reinforcement learning framework. The social locomotion model generates an orientation-sensitive social comfort field that quantifies human comfort across space, enabling socially aligned navigation policies with minimal training. RLSLM then jointly optimizes mechanical energy and social comfort, allowing agents to avoid intrusions into personal or group space. A human-agent interaction experiment using an immersive VR-based setup demonstrates that RLSLM outperforms state-of-the-art rule-based models in user experience. Ablation and sensitivity analyses further show the model's significantly improved interpretability over conventional data-driven methods. This work presents a scalable, human-centered methodology that effectively integrates cognitive science and machine learning for real-world social navigation.
中文摘要 在有人居住的环境中不引起不适地导航，是具有社会意识的代理的关键能力。虽然基于规则的方法通过预设的心理学原则提供了可解释性，但它们通常缺乏普遍性和灵活性。相反，数据驱动方法可以从大规模数据集中学习复杂行为，但通常效率低下、不透明，且难以与人类直觉对齐。为弥合这一差距，我们提出了RLSLM，一种混合型强化学习框架，将基于实证行为实验的基于规则的社会运动模型整合进强化学习框架的奖励函数中。社会运动模型生成一个对方向敏感的社会舒适场，量化人类在空间中的舒适度，使得以最少培训即可制定社会对齐的导航政策。RLSLM 随后共同优化机械能量和社交舒适度，使智能体能够避免对个人或群体空间的入侵。一项基于沉浸式VR的人类-代理交互实验表明，RLSLM在用户体验上优于最先进的基于规则的模型。消融和敏感性分析进一步显示，该模型相较于传统数据驱动方法显著提升了可解释性。这项工作提出了一种可扩展、以人为本的方法论，有效整合认知科学与机器学习，实现现实世界的社会导航。

MarsRL: Advancing Multi-Agent Reasoning System via Reinforcement Learning with Agentic Pipeline Parallelism

MarsRL：通过增强学习推进多智能体推理系统，结合智能管道并行性

Authors: Shulin Liu, Dong Du, Tao Yang, Yang Li, Boyu Qiu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.11373
Pdf link: https://arxiv.org/pdf/2511.11373
Abstract Recent progress in large language models (LLMs) has been propelled by reinforcement learning with verifiable rewards (RLVR) and test-time scaling. However, the limited output length of LLMs constrains the depth of reasoning attainable in a single inference process. Multi-agent reasoning systems offer a promising alternative by employing multiple agents including Solver, Verifier, and Corrector, to iteratively refine solutions. While effective in closed-source models like Gemini 2.5 Pro, they struggle to generalize to open-source models due to insufficient critic and correction capabilities. To address this, we propose MarsRL, a novel reinforcement learning framework with agentic pipeline parallelism, designed to jointly optimize all agents in the system. MarsRL introduces agent-specific reward mechanisms to mitigate reward noise and employs pipeline-inspired training to enhance efficiency in handling long trajectories. Applied to Qwen3-30B-A3B-Thinking-2507, MarsRL improves AIME2025 accuracy from 86.5% to 93.3% and BeyondAIME from 64.9% to 73.8%, even surpassing Qwen3-235B-A22B-Thinking-2507. These findings highlight the potential of MarsRL to advance multi-agent reasoning systems and broaden their applicability across diverse reasoning tasks.
中文摘要 大型语言模型（LLM）的最新进展得益于可验证奖励强化学习（RLVR）和测试时间缩放。然而，LLMs的输出长度有限，限制了单一推理过程中可达的推理深度。多智能体推理系统通过使用多个智能体（包括求解器、验证器和纠正器）来迭代细化解，提供了有前景的替代方案。虽然在 Gemini 2.5 Pro 等闭源模型中有效，但由于批评和修正能力不足，它们难以推广到开源模型。为此，我们提出了MarsRL，一种具有智能体流水线并行性的新型强化学习框架，旨在共同优化系统中的所有智能体。MarsRL引入了针对性奖励机制以减轻奖励噪声，并采用管道启发的训练以提升处理长轨迹的效率。应用于Qwen3-30B-A3B-Thinking-2507，MarsRL将AIME2025准确率从86.5%提升至93.3%，BeyondAIME从64.9%提升至73.8%，甚至超过了Qwen3-235B-A22B-Thinking-2507。这些发现凸显了MarsRL推动多智能体推理系统的潜力，并拓宽其在多种推理任务中的适用范围。

Robust and Efficient Communication in Multi-Agent Reinforcement Learning

多智能体强化学习中的稳健高效通信

Authors: Zejiao Liu, Yi Li, Jiali Wang, Junqi Tu, Yitian Hong, Fangfei Li, Yang Liu, Toshiharu Sugawara, Yang Tang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.11393
Pdf link: https://arxiv.org/pdf/2511.11393
Abstract Multi-agent reinforcement learning (MARL) has made significant strides in enabling coordinated behaviors among autonomous agents. However, most existing approaches assume that communication is instantaneous, reliable, and has unlimited bandwidth; these conditions are rarely met in real-world deployments. This survey systematically reviews recent advances in robust and efficient communication strategies for MARL under realistic constraints, including message perturbations, transmission delays, and limited bandwidth. Furthermore, because the challenges of low-latency reliability, bandwidth-intensive data sharing, and communication-privacy trade-offs are central to practical MARL systems, we focus on three applications involving cooperative autonomous driving, distributed simultaneous localization and mapping, and federated learning. Finally, we identify key open challenges and future research directions, advocating a unified approach that co-designs communication, learning, and robustness to bridge the gap between theoretical MARL models and practical implementations.
中文摘要 多智能体强化学习（MARL）在实现自主智能体协调行为方面取得了显著进展。然而，大多数现有方法假设通信是即时、可靠且带宽无限的;这些条件在实际部署中很少被满足。本调查系统回顾了在现实约束条件下（包括消息扰动、传输延迟和带宽有限）下，MARL稳健高效通信策略的最新进展。此外，由于低延迟可靠性、带宽密集型数据共享以及通信与隐私权衡是实际MARL系统的核心，我们重点关注三种应用：协作自动驾驶、分布式同步定位与映射以及联邦学习。最后，我们识别了关键的未解决挑战和未来研究方向，倡导一种统一的方法，共同设计沟通、学习和稳健性，以弥合理论MARL模型与实际实现之间的鸿沟。

Multi-Phase Spacecraft Trajectory Optimization via Transformer-Based Reinforcement Learning

通过基于变压器的强化学习实现多相航天器轨迹优化

Authors: Amit Jain, Victor Rodriguez-Fernandez, Richard Linares
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.11402
Pdf link: https://arxiv.org/pdf/2511.11402
Abstract Autonomous spacecraft control for mission phases such as launch, ascent, stage separation, and orbit insertion remains a critical challenge due to the need for adaptive policies that generalize across dynamically distinct regimes. While reinforcement learning (RL) has shown promise in individual astrodynamics tasks, existing approaches often require separate policies for distinct mission phases, limiting adaptability and increasing operational complexity. This work introduces a transformer-based RL framework that unifies multi-phase trajectory optimization through a single policy architecture, leveraging the transformer's inherent capacity to model extended temporal contexts. Building on proximal policy optimization (PPO), our framework replaces conventional recurrent networks with a transformer encoder-decoder structure, enabling the agent to maintain coherent memory across mission phases spanning seconds to minutes during critical operations. By integrating a Gated Transformer-XL (GTrXL) architecture, the framework eliminates manual phase transitions while maintaining stability in control decisions. We validate our approach progressively: first demonstrating near-optimal performance on single-phase benchmarks (double integrator and Van der Pol oscillator), then extending to multiphase waypoint navigation variants, and finally tackling a complex multiphase rocket ascent problem that includes atmospheric flight, stage separation, and vacuum operations. Results demonstrate that the transformer-based framework not only matches analytical solutions in simple cases but also effectively learns coherent control policies across dynamically distinct regimes, establishing a foundation for scalable autonomous mission planning that reduces reliance on phase-specific controllers while maintaining compatibility with safety-critical verification protocols.
中文摘要 发射、上升、级间分离和轨道插入等任务阶段的自主航天器控制仍是关键挑战，因为需要跨动态不同阶段通用的适应性政策。虽然强化学习（RL）在单个天体动力学任务中展现出潜力，但现有方法通常需要针对不同任务阶段制定独立策略，限制了适应性并增加了作复杂度。本研究引入了基于变换器的强化学习框架，通过单一策略架构统一多阶段轨迹优化，利用变换器内在建模扩展时间上下文的能力。基于近端策略优化（PPO），我们的框架用变换器编码-解码结构取代了传统的循环网络，使智能体能够在关键作期间跨数秒到几分钟的任务阶段保持连贯的记忆。通过集成门控变压器XL（GTrXL）架构，该框架消除了手动相变，同时保持控制决策的稳定性。我们逐步验证我们的方法：首先在单相基准测试（双积分器和范德波尔振荡器）上展示近乎最优的性能，然后扩展到多阶段航点导航变体，最后解决包括大气飞行、级间分离和真空作的复杂多相火箭上升问题。结果表明，基于变压器的框架不仅能在简单情况下匹配分析解，还能有效学习跨动态不同阶段的连贯控制策略，奠定可扩展自主任务规划基础，减少对阶段特定控制器的依赖，同时保持与安全关键验证协议的兼容性。

Context-aware Adaptive Visualizations for Critical Decision Making

关键决策的上下文感知自适应可视化

Authors: Angela Lopez-Cardona, Mireia Masias Bruns, Nuwan T. Attygalle, Sebastian Idesis, Matteo Salvatori, Konstantinos Raftopoulos, Konstantinos Oikonomou, Saravanakumar Duraisamy, Parvin Emami, Nacera Latreche, Alaa Eddine Anis Sahraoui, Michalis Vakallelis, Jean Vanderdonckt, Ioannis Arapakis, Luis A. Leiva
Subjects: Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.11476
Pdf link: https://arxiv.org/pdf/2511.11476
Abstract Effective decision-making often relies on timely insights from complex visual data. While Information Visualization (InfoVis) dashboards can support this process, they rarely adapt to users' cognitive state, and less so in real time. We present Symbiotik, an intelligent, context-aware adaptive visualization system that leverages neurophysiological signals to estimate mental workload (MWL) and dynamically adapt visual dashboards using reinforcement learning (RL). Through a user study with 120 participants and three visualization types, we demonstrate that our approach improves task performance and engagement. Symbiotik offers a scalable, real-time adaptation architecture, and a validated methodology for neuroadaptive user interfaces.
中文摘要 有效的决策往往依赖于对复杂视觉数据的及时洞察。虽然信息可视化（InfoVis）仪表盘可以支持这一过程，但它们很少能适应用户的认知状态，更不会实时适应。我们介绍Symbiotik，这是一款智能、情境感知的自适应可视化系统，利用神经生理信号估算心理负荷（MWL），并通过强化学习（RL）动态调整视觉仪表盘。通过一项包含120名参与者和三种可视化类型的用户研究，我们证明了我们的方法能够提升任务的表现和参与度。Symbiotik提供可扩展的实时适应架构，以及经过验证的神经适应用户界面方法论。

Honesty over Accuracy: Trustworthy Language Models through Reinforced Hesitation

诚实胜于准确：通过强化犹豫构建可信语言模型

Authors: Mohamad Amin Mohamadi, Tianhao Wang, Zhiyuan Li
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.11500
Pdf link: https://arxiv.org/pdf/2511.11500
Abstract Modern language models fail a fundamental requirement of trustworthy intelligence: knowing when not to answer. Despite achieving impressive accuracy on benchmarks, these models produce confident hallucinations, even when wrong answers carry catastrophic consequences. Our evaluations on GSM8K, MedQA and GPQA show frontier models almost never abstain despite explicit warnings of severe penalties, suggesting that prompts cannot override training that rewards any answer over no answer. As a remedy, we propose Reinforced Hesitation (RH): a modification to Reinforcement Learning from Verifiable Rewards (RLVR) to use ternary rewards (+1 correct, 0 abstention, -$\lambda$ error) instead of binary. Controlled experiments on logic puzzles reveal that varying $\lambda$ produces distinct models along a Pareto frontier, where each training penalty yields the optimal model for its corresponding risk regime: low penalties produce aggressive answerers, high penalties conservative abstainers. We then introduce two inference strategies that exploit trained abstention as a coordination signal: cascading routes queries through models with decreasing risk tolerance, while self-cascading re-queries the same model on abstention. Both outperform majority voting with lower computational cost. These results establish abstention as a first-class training objective that transforms ``I don't know'' from failure into a coordination signal, enabling models to earn trust through calibrated honesty about their limits.
中文摘要 现代语言模型缺乏可信智能的基本要求：知道何时不回答。尽管基准测试中取得了令人印象深刻的准确性，这些模型仍会产生自信的幻觉，即使错误答案可能带来灾难性后果。我们在GSM8K、MedQA和GPQA上的评估显示，前沿模型几乎从未放弃，尽管明确警告了严重惩罚，提示无法覆盖奖励任何答案胜过无回答的训练。作为解决方法，我们提出强化犹豫（RH）：对可验证奖励强化学习（RLVR）的一种修改，改用三元奖励（+1正确，0保留，-$\lambda$错误）代替二元。逻辑谜题的受控实验显示，变化$\lambda$会在帕累托边界产生不同的模型，每个训练惩罚都产生其对应风险体系的最优模型：低惩罚产生激进的回答者，高惩罚者保守的弃权者。随后，我们引入两种推理策略，利用训练中的弃权作为协调信号：级联通过风险容忍度递减的模型进行查询，而自级联则在保留时重新查询同一模型。两者都优于多数投票，且计算成本更低。这些结果确立了戒除作为一流训练目标，将“我不知道”从失败转化为协调信号，使模型通过校准诚实地表达极限赢得信任。

W2S-AlignTree: Weak-to-Strong Inference-Time Alignment for Large Language Models via Monte Carlo Tree Search

W2S-AlignTree：通过蒙特卡洛树搜索实现大型语言模型的弱到强推断时间比对

Authors: Zhenyu Ding, Yuhao Wang, Tengyue Xiao, Haoying Wang, Guojun Ma, Mingyang Wan, Caigui Jiang, Ning Ding
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.11518
Pdf link: https://arxiv.org/pdf/2511.11518
Abstract Large Language Models (LLMs) demonstrate impressive capabilities, yet their outputs often suffer from misalignment with human preferences due to the inadequacy of weak supervision and a lack of fine-grained control. Training-time alignment methods like Reinforcement Learning from Human Feedback (RLHF) face prohibitive costs in expert supervision and inherent scalability limitations, offering limited dynamic control during inference. Consequently, there is an urgent need for scalable and adaptable alignment mechanisms. To address this, we propose W2S-AlignTree, a pioneering plug-and-play inference-time alignment framework that synergistically combines Monte Carlo Tree Search (MCTS) with the Weak-to-Strong Generalization paradigm for the first time. W2S-AlignTree formulates LLM alignment as an optimal heuristic search problem within a generative search tree. By leveraging weak model's real-time, step-level signals as alignment proxies and introducing an Entropy-Aware exploration mechanism, W2S-AlignTree enables fine-grained guidance during strong model's generation without modifying its parameters. The approach dynamically balances exploration and exploitation in high-dimensional generation search trees. Experiments across controlled sentiment generation, summarization, and instruction-following show that W2S-AlignTree consistently outperforms strong baselines. Notably, W2S-AlignTree raises the performance of Llama3-8B from 1.89 to 2.19, a relative improvement of 15.9 on the summarization task.
中文摘要 大型语言模型（LLMs）展现了令人印象深刻的能力，但由于监督薄弱和缺乏细致控制，其输出常常与人类偏好不匹配。训练时间对齐方法如人类反馈强化学习（RLHF）在专家监督下成本高昂且固有的可扩展性限制，推理过程中动态控制有限。因此，迫切需要可扩展且可适应的对齐机制。为此，我们提出了W2S-AlignTree，一种开创性的即插即用推断时间比对框架，首次协同结合了蒙特卡洛树搜索（MCTS）与弱到强推广范式。W2S-AlignTree 将大型语言模型对齐（LLM）表述为生成式搜索树中的最优启发式搜索问题。通过利用弱模型的实时步进级信号作为比对代理，并引入熵感知的探索机制，W2S-AlignTree在强模型生成过程中实现了细粒度的指导，而无需修改其参数。该方法动态平衡了高维生成搜索树中的探索与利用。在受控情感生成、总结和指令跟随等实验中，W2S-AlignTree始终优于强基线。值得注意的是，W2S-AlignTree将Llama3-8B的性能从1.89提升至2.19，在摘要任务上相对提升了15.9。

Aligning Machiavellian Agents: Behavior Steering via Test-Time Policy Shaping

马基雅维利式代理的对齐：通过测试时策略塑造实现行为引导

Authors: Dena Mujtaba, Brian Hu, Anthony Hoogs, Arslan Basharat
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.11551
Pdf link: https://arxiv.org/pdf/2511.11551
Abstract The deployment of decision-making AI agents presents a critical challenge in maintaining alignment with human values or guidelines while operating in complex, dynamic environments. Agents trained solely to achieve their objectives may adopt harmful behavior, exposing a key trade-off between maximizing the reward function and maintaining the alignment. For the pre-trained agents, ensuring alignment is particularly challenging, as retraining can be a costly and slow process. This is further complicated by the diverse and potentially conflicting attributes representing the ethical values for alignment. To address these challenges, we propose a test-time alignment technique based on model-guided policy shaping. Our method allows precise control over individual behavioral attributes, generalizes across diverse reinforcement learning (RL) environments, and facilitates a principled trade-off between ethical alignment and reward maximization without requiring agent retraining. We evaluate our approach using the MACHIAVELLI benchmark, which comprises 134 text-based game environments and thousands of annotated scenarios involving ethical decisions. The RL agents are first trained to maximize the reward in their respective games. At test time, we apply policy shaping via scenario-action attribute classifiers to ensure decision alignment with ethical attributes. We compare our approach against prior training-time methods and general-purpose agents, as well as study several types of ethical violations and power-seeking behavior. Our results demonstrate that test-time policy shaping provides an effective and scalable solution for mitigating unethical behavior across diverse environments and alignment attributes.
中文摘要 部署决策型AI代理在复杂且动态的环境中，维护与人类价值观或指南的一致性是一个关键挑战。仅仅为实现目标而训练的代理可能会采取有害行为，暴露出最大化奖励函数与维持对齐之间的关键权衡。对于预培训的代理来说，确保对齐尤其具有挑战性，因为再培训可能是一个昂贵且缓慢的过程。这还因代表伦理价值观的多样且可能冲突的属性而更加复杂。为应对这些挑战，我们提出了基于模型引导政策塑造的测试时间对齐技术。我们的方法允许对个体行为属性的精确控制，推广到多样化的强化学习（RL）环境中，并在伦理一致性与奖励最大化之间实现原则性权衡，而无需代理再培训。我们利用MACHIAVELLI基准测试来评估我们的方法，该基准包含134个基于文本的游戏环境和数千个涉及伦理决策的注释场景。强化学习者首先接受训练，以最大化各自游戏中的奖励。在测试阶段，我们通过场景-行动属性分类器应用策略塑造，以确保决策与伦理属性保持一致。我们将我们的方法与以往的训练时间方法和通用代理进行比较，并研究了几种伦理违规和追求权力的行为。我们的结果表明，测试时政策制定为缓解不同环境和对齐属性中不道德行为提供了有效且可扩展的解决方案。

Keyword: diffusion policy

There is no result