Arxiv Papers of Today

生成时间: 2025-11-12 16:31:12 (UTC+8); Arxiv 发布时间: 2025-11-12 20:00 EST (2025-11-13 09:00 UTC+8)

今天共有 39 篇相关文章

Keyword: reinforcement learning

Towards Affordable, Adaptive and Automatic GNN Training on CPU-GPU Heterogeneous Platforms

在CPU-GPU异构平台上实现经济实惠、自适应和自动的GNN训练

Authors: Tong Qiao, Ao Zhou, Yingjie Qi, Yiou Wang, Han Wan, Jianlei Yang, Chunming Hu
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2511.07421
Pdf link: https://arxiv.org/pdf/2511.07421
Abstract Graph Neural Networks (GNNs) have been widely adopted due to their strong performance. However, GNN training often relies on expensive, high-performance computing platforms, limiting accessibility for many tasks. Profiling of representative GNN workloads indicates that substantial efficiency gains are possible on resource-constrained devices by fully exploiting available resources. This paper introduces A3GNN, a framework for affordable, adaptive, and automatic GNN training on heterogeneous CPU-GPU platforms. It improves resource usage through locality-aware sampling and fine-grained parallelism scheduling. Moreover, it leverages reinforcement learning to explore the design space and achieve pareto-optimal trade-offs among throughput, memory footprint, and accuracy. Experiments show that A3GNN can bridge the performance gap, allowing seven Nvidia 2080Ti GPUs to outperform two A100 GPUs by up to 1.8X in throughput with minimal accuracy loss.
中文摘要 图神经网络（GNN）因其强大的性能而被广泛采用。然而，GNN 训练通常依赖于昂贵的高性能计算平台，限制了许多任务的可访问性。对具有代表性的 GNN 工作负载的分析表明，通过充分利用可用资源，在资源受限的设备上可以显着提高效率。本文介绍了 A3GNN，这是一个在异构 CPU-GPU 平台上进行经济实惠、自适应和自动 GNN 训练的框架。它通过局部感知采样和细粒度并行调度来提高资源利用率。此外，它还利用强化学习来探索设计空间，并在吞吐量、内存占用和准确性之间实现帕累托最优权衡。实验表明，A3GNN 可以弥补性能差距，允许七个 Nvidia 2080Ti GPU 在吞吐量上比两个 A100 GPU 高出 1.8 倍，同时精度损失最小。

RELEAP: Reinforcement-Enhanced Label-Efficient Active Phenotyping for Electronic Health Records

RELEAP：用于电子健康记录的强化增强标签高效主动表型分析

Authors: Yang Yang (1), Kathryn Pollak (2,3), Bibhas Chakraborty (1,4,5,6), Molei Liu (7,8), Doudou Zhou (6), Chuan Hong (1) ((1) Department of Biostatistics and Bioinformatics, Duke University, Durham, USA, (2) Duke Cancer Institute, Durham, USA, (3) Department of Population Health Sciences, Duke University School of Medicine, Durham, USA, (4) Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore, (5) Programme in Health Services and Systems Research, Duke-NUS Medical School, Singapore, (6) Department of Statistics and Data Science, National University of Singapore, Singapore, (7) Department of Biostatistics, Peking University Health Science Center, Beijing, China, (8) Beijing International Center for Mathematical Research, Peking University, Beijing, China)
Subjects: Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2511.07473
Pdf link: https://arxiv.org/pdf/2511.07473
Abstract Objective: Electronic health record (EHR) phenotyping often relies on noisy proxy labels, which undermine the reliability of downstream risk prediction. Active learning can reduce annotation costs, but most rely on fixed heuristics and do not ensure that phenotype refinement improves prediction performance. Our goal was to develop a framework that directly uses downstream prediction performance as feedback to guide phenotype correction and sample selection under constrained labeling budgets. Materials and Methods: We propose Reinforcement-Enhanced Label-Efficient Active Phenotyping (RELEAP), a reinforcement learning-based active learning framework. RELEAP adaptively integrates multiple querying strategies and, unlike prior methods, updates its policy based on feedback from downstream models. We evaluated RELEAP on a de-identified Duke University Health System (DUHS) cohort (2014-2024) for incident lung cancer risk prediction, using logistic regression and penalized Cox survival models. Performance was benchmarked against noisy-label baselines and single-strategy active learning. Results: RELEAP consistently outperformed all baselines. Logistic AUC increased from 0.774 to 0.805 and survival C-index from 0.718 to 0.752. Using downstream performance as feedback, RELEAP produced smoother and more stable gains than heuristic methods under the same labeling budget. Discussion: By linking phenotype refinement to prediction outcomes, RELEAP learns which samples most improve downstream discrimination and calibration, offering a more principled alternative to fixed active learning rules. Conclusion: RELEAP optimizes phenotype correction through downstream feedback, offering a scalable, label-efficient paradigm that reduces manual chart review and enhances the reliability of EHR-based risk prediction.
中文摘要 目的：电子健康记录（EHR）表型分析通常依赖于嘈杂的代理标签，这破坏了下游风险预测的可靠性。主动学习可以降低注释成本，但大多数依赖于固定的启发式方法，并且不能确保表型细化提高预测性能。我们的目标是开发一个框架，直接使用下游预测性能作为反馈，以指导在受限标记预算下的表型校正和样本选择。材料和方法：我们提出了强化增强标签高效主动表型分析（RELEAP），这是一种基于强化学习的主动学习框架。RELEAP 自适应地集成了多种查询策略，并且与以前的方法不同，它根据下游模型的反馈更新其策略。我们使用逻辑回归和惩罚性 Cox 生存模型，在去识别化的杜克大学卫生系统（DUHS）队列（2014-2024 年）上评估了 RELEAP 的肺癌风险预测。性能以噪声标签基线和单一策略主动学习为基准。结果：RELEAP 的表现始终优于所有基线。Logistic AUC 从 0.774 增加到 0.805，生存 C 指数从 0.718 增加到 0.752。使用下游性能作为反馈，RELEAP 在相同的标记预算下比启发式方法产生了更平滑、更稳定的增益。讨论：通过将表型细化与预测结果联系起来，RELEAP 可以了解哪些样本最能改善下游辨别和校准，从而为固定的主动学习规则提供更有原则的替代方案。结论：RELEAP 通过下游反馈优化表型校正，提供可扩展、标签高效的范式，减少手动图表审查并增强基于 EHR 的风险预测的可靠性。

The Polite Liar: Epistemic Pathology in Language Models

礼貌的骗子：语言模型中的认识病理学

Authors: Bentley DeVilling (Course Correct Labs)
Subjects: Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.07477
Pdf link: https://arxiv.org/pdf/2511.07477
Abstract Large language models exhibit a peculiar epistemic pathology: they speak as if they know, even when they do not. This paper argues that such confident fabrication, what I call the polite liar, is a structural consequence of reinforcement learning from human feedback (RLHF). Building on Frankfurt's analysis of bullshit as communicative indifference to truth, I show that this pathology is not deception but structural indifference: a reward architecture that optimizes for perceived sincerity over evidential accuracy. Current alignment methods reward models for being helpful, harmless, and polite, but not for being epistemically grounded. As a result, systems learn to maximize user satisfaction rather than truth, performing conversational fluency as a virtue. I analyze this behavior through the lenses of epistemic virtue theory, speech-act philosophy, and cognitive alignment, showing that RLHF produces agents trained to mimic epistemic confidence without access to epistemic justification. The polite liar thus reveals a deeper alignment tension between linguistic cooperation and epistemic integrity. The paper concludes with an "epistemic alignment" principle: reward justified confidence over perceived fluency.
中文摘要 大型语言模型表现出一种特殊的认识病态：它们说话时好像知道一样，即使它们不知道。本文认为，这种自信的捏造，我称之为礼貌的骗子，是人类反馈强化学习（RLHF）的结构性结果。基于法兰克福对废话的分析，即对真理的交流冷漠，我表明这种病态不是欺骗，而是结构性冷漠：一种奖励架构，它针对感知的真诚而不是证据的准确性进行优化。当前的对齐方法奖励模型是有帮助的、无害的和有礼貌的，但不是基于认识论的。因此，系统学会最大限度地提高用户满意度而不是真相，将对话流畅性作为一种美德。我通过认识美德理论、言语行为哲学和认知一致性的视角分析了这种行为，表明 RLHF 产生的代理人经过训练，可以模仿认识论信心，而无需获得认识论的理由。因此，礼貌的骗子揭示了语言合作和认识完整性之间更深层次的一致性张力。该论文以“认识一致性”原则结束：奖励合理的信心而不是感知的流畅性。

Beyond Correctness: Confidence-Aware Reward Modeling for Enhancing Large Language Model Reasoning

超越正确性：用于增强大型语言模型推理的置信度感知奖励建模

Authors: Qianxi He, Qingyu Ren, Shanzhe Lei, Xuhong Wang, Yingchun Wang
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.07483
Pdf link: https://arxiv.org/pdf/2511.07483
Abstract Recent advancements in large language models (LLMs) have shifted the post-training paradigm from traditional instruction tuning and human preference alignment toward reinforcement learning (RL) focused on reasoning capabilities. However, numerous technical reports indicate that purely rule-based reward RL frequently results in poor-quality reasoning chains or inconsistencies between reasoning processes and final answers, particularly when the base model is of smaller scale. During the RL exploration process, models might employ low-quality reasoning chains due to the lack of knowledge, occasionally producing correct answers randomly and receiving rewards based on established rule-based judges. This constrains the potential for resource-limited organizations to conduct direct reinforcement learning training on smaller-scale models. We propose a novel confidence-based reward model tailored for enhancing STEM reasoning capabilities. Unlike conventional approaches, our model penalizes not only incorrect answers but also low-confidence correct responses, thereby promoting more robust and logically consistent reasoning. We validate the effectiveness of our approach through static evaluations, Best-of-N inference tests, and PPO-based RL training. Our method outperforms several state-of-the-art open-source reward models across diverse STEM benchmarks. We release our codes and model in this https URL.
中文摘要 大型语言模型（LLM）的最新进展已将训练后范式从传统的指令调整和人类偏好调整转向专注于推理能力的强化学习（RL）。然而，许多技术报告表明，纯粹基于规则的奖励 RL 经常导致推理链质量差或推理过程与最终答案之间的不一致，特别是当基础模型规模较小时。在RL探索过程中，由于缺乏知识，模型可能会采用低质量的推理链，偶尔会随机产生正确答案，并根据既定的基于规则的评委获得奖励。这限制了资源有限的组织在较小规模的模型上进行直接强化学习训练的可能性。我们提出了一种新的基于置信度的奖励模型，用于增强 STEM 推理能力。与传统方法不同，我们的模型不仅惩罚错误答案，还惩罚低置信度正确答案，从而促进更稳健和逻辑一致的推理。我们通过静态评估、Best-of-N 推理测试和基于 PPO 的 RL 训练来验证我们方法的有效性。我们的方法在不同的 STEM 基准中优于几种最先进的开源奖励模型。我们在此 https URL 中发布我们的代码和模型。

Think Before You Retrieve: Learning Test-Time Adaptive Search with Small Language Models

检索前三思：使用小型语言模型学习测试时自适应搜索

Authors: Supriti Vijay, Aman Priyanshu, Anu Vellore, Baturay Saglam, Amin Karbasi
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2511.07581
Pdf link: https://arxiv.org/pdf/2511.07581
Abstract Effective information retrieval requires reasoning over partial evidence and refining strategies as information emerges. Yet current approaches fall short: neural retrievers lack reasoning capabilities, large language models (LLMs) provide semantic depth but at prohibitive cost, and query rewriting or decomposition limits improvement to static transformations. As a result, existing methods fail to capture the iterative dynamics of exploration, feedback, and revision that complex user queries demand. We introduce Orion, a training framework that enables compact models (350M-1.2B parameters) to perform iterative retrieval through learned search strategies. Orion combines: (1) synthetic trajectory generation and supervised fine-tuning to encourage diverse exploration patterns in models, (2) reinforcement learning (RL) that rewards effective query refinement and backtracking behaviors, and (3) inference-time beam search algorithms that exploit the self-reflection capabilities learned during RL. Despite using only 3% of the training data available, our 1.2B model achieves 77.6% success on SciFact (vs. 72.6% for prior retrievers), 25.2% on BRIGHT (vs. 22.1%), 63.2% on NFCorpus (vs. 57.8%), and remains competitive on FEVER, HotpotQA, and MSMarco. It outperforms retrievers up to 200-400x larger on five of six benchmarks. These findings suggest that retrieval performance can emerge from learned strategies, not just model scale, when models are trained to search, reflect, and revise.
中文摘要 有效的信息检索需要对部分证据进行推理，并在信息出现时完善策略。然而，目前的方法还存在不足：神经检索器缺乏推理能力，大型语言模型（LLM）提供语义深度但成本高昂，查询重写或分解限制了静态转换的改进。因此，现有方法无法捕捉复杂用户查询所需的探索、反馈和修订的迭代动态。我们介绍了Orion，这是一个训练框架，它使紧凑模型（350M-1.2B参数）能够通过学习到的搜索策略进行迭代检索。Orion 结合了：（1）合成轨迹生成和监督微调以鼓励模型中的多样化探索模式，（2）奖励有效查询细化和回溯行为的强化学习（RL），以及（3）利用 RL 期间学习的自反射能力的推理时间波束搜索算法。尽管仅使用了 3% 的可用训练数据，但我们的 1.2B 模型在 SciFact 上取得了 77.6% 的成功率（而以前的检索者为 72.6%），在 BRIGHT 上取得了 25.2%（对比 22.1%），在 NFCorpus 上取得了 63.2%（对比 57.8%），并且在 FEVER、HotpotQA 和 MSMarco 上保持竞争力。在六个基准测试中的五个基准测试中，它的性能比检索器大 200-400 倍。这些发现表明，当模型被训练为搜索、反映和修改时，检索性能可以从学习的策略中产生，而不仅仅是模型规模。

Partial Action Replacement: Tackling Distribution Shift in Offline MARL

部分行动替换：解决离线 MARL 中的分销转移问题

Authors: Yue Jin, Giovanni Montana
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2511.07629
Pdf link: https://arxiv.org/pdf/2511.07629
Abstract Offline multi-agent reinforcement learning (MARL) is severely hampered by the challenge of evaluating out-of-distribution (OOD) joint actions. Our core finding is that when the behavior policy is factorized - a common scenario where agents act fully or partially independently during data collection - a strategy of partial action replacement (PAR) can significantly mitigate this challenge. PAR updates a single or part of agents' actions while the others remain fixed to the behavioral data, reducing distribution shift compared to full joint-action updates. Based on this insight, we develop Soft-Partial Conservative Q-Learning (SPaCQL), using PAR to mitigate OOD issue and dynamically weighting different PAR strategies based on the uncertainty of value estimation. We provide a rigorous theoretical foundation for this approach, proving that under factorized behavior policies, the induced distribution shift scales linearly with the number of deviating agents rather than exponentially with the joint-action space. This yields a provably tighter value error bound for this important class of offline MARL problems. Our theoretical results also indicate that SPaCQL adaptively addresses distribution shift using uncertainty-informed weights. Our empirical results demonstrate SPaCQL enables more effective policy learning, and manifest its remarkable superiority over baseline algorithms when the offline dataset exhibits the independence structure.
中文摘要 离线多智能体强化学习（MARL）受到评估分布外（OOD）联合行动的挑战的严重阻碍。我们的核心发现是，当行为策略被分解时——代理在数据收集过程中完全或部分独立行动的常见场景——部分动作替换（PAR）策略可以显着缓解这一挑战。PAR 更新代理的单个或部分作，而其他作则固定在行为数据上，与完整的联合作更新相比，减少了分布偏移。基于这一见解，我们开发了软偏保守 Q 学习（SPaCQL），使用 PAR 来缓解 OOD 问题，并根据价值估计的不确定性动态加权不同的 PAR 策略。我们为这种方法提供了严格的理论基础，证明在因式分解行为策略下，诱导的分布偏移与偏离因素的数量呈线性比例，而不是与联合作用空间呈指数关系。这为这一类重要的离线 MARL 问题产生了可证明的更严格的值误差限制。我们的理论结果还表明，SPaCQL使用不确定性知情权重自适应地解决分布偏移问题。我们的实证结果表明，SPaCQL能够实现更有效的策略学习，并在离线数据集表现出独立结构时表现出其优于基线算法的显著优势。

Time-Aware Policy Learning for Adaptive and Punctual Robot Control

自适应准时机器人控制的时间感知策略学习

Authors: Yinsen Jia, Boyuan Chen
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.07654
Pdf link: https://arxiv.org/pdf/2511.07654
Abstract Temporal awareness underlies intelligent behavior in both animals and humans, guiding how actions are sequenced, paced, and adapted to changing goals and environments. Yet most robot learning algorithms remain blind to time. We introduce time-aware policy learning, a reinforcement learning framework that enables robots to explicitly perceive and reason with time as a first-class variable. The framework augments conventional reinforcement policies with two complementary temporal signals, the remaining time and a time ratio, which allow a single policy to modulate its behavior continuously from rapid and dynamic to cautious and precise execution. By jointly optimizing punctuality and stability, the robot learns to balance efficiency, robustness, resiliency, and punctuality without re-training or reward adjustment. Across diverse manipulation domains from long-horizon pick and place, to granular-media pouring, articulated-object handling, and multi-agent object delivery, the time-aware policy produces adaptive behaviors that outperform standard reinforcement learning baselines by up to 48% in efficiency, 8 times more robust in sim-to-real transfer, and 90% in acoustic quietness while maintaining near-perfect success rates. Explicit temporal reasoning further enables real-time human-in-the-loop control and multi-agent coordination, allowing robots to recover from disturbances, re-synchronize after delays, and align motion tempo with human intent. By treating time not as a constraint but as a controllable dimension of behavior, time-aware policy learning provides a unified foundation for efficient, robust, resilient, and human-aligned robot autonomy.
中文摘要 时间意识是动物和人类智能行为的基础，指导行动如何排序、节奏和适应不断变化的目标和环境。然而，大多数机器人学习算法仍然对时间视而不见。我们引入了时间感知策略学习，这是一种强化学习框架，使机器人能够以时间为一类变量来显式感知和推理。该框架通过两个互补的时间信号（剩余时间和时间比）增强了常规强化策略，这使得单个策略能够不断调节其行为，从快速和动态到谨慎和精确的执行。通过共同优化准时性和稳定性，机器人学会平衡效率、鲁棒性、弹性和准时性，而无需重新训练或奖励调整。在从长视距拾取和放置到粒度介质倾倒、铰接对象处理和多代理对象交付的不同作领域中，时间感知策略产生的自适应行为在效率上比标准强化学习基线高出 48%，在模拟到真实传输方面比标准强化学习基线高出 8 倍，在声学安静性方面比标准强化学习基线高出 90%，同时保持近乎完美的成功率。显式时间推理进一步实现了实时人机交互控制和多智能体协调，使机器人能够从干扰中恢复，在延迟后重新同步，并使运动节奏与人类意图保持一致。通过将时间视为行为的可控维度，而不是将时间视为约束，时间感知策略学习为高效、稳健、有弹性和与人为一致的机器人自主性提供了统一的基础。

ZeroSim: Zero-Shot Analog Circuit Evaluation with Unified Transformer Embeddings

ZeroSim：使用统一变压器嵌入进行零样本模拟电路评估

Authors: Xiaomeng Yang, Jian Gao, Yanzhi Wang, Xuan Zhang
Subjects: Subjects: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
Arxiv link: https://arxiv.org/abs/2511.07658
Pdf link: https://arxiv.org/pdf/2511.07658
Abstract Although recent advancements in learning-based analog circuit design automation have tackled tasks such as topology generation, device sizing, and layout synthesis, efficient performance evaluation remains a major bottleneck. Traditional SPICE simulations are time-consuming, while existing machine learning methods often require topology-specific retraining or manual substructure segmentation for fine-tuning, hindering scalability and adaptability. In this work, we propose ZeroSim, a transformer-based performance modeling framework designed to achieve robust in-distribution generalization across trained topologies under novel parameter configurations and zero-shot generalization to unseen topologies without any fine-tuning. We apply three key enabling strategies: (1) a diverse training corpus of 3.6 million instances covering over 60 amplifier topologies, (2) unified topology embeddings leveraging global-aware tokens and hierarchical attention to robustly generalize to novel circuits, and (3) a topology-conditioned parameter mapping approach that maintains consistent structural representations independent of parameter variations. Our experimental results demonstrate that ZeroSim significantly outperforms baseline models such as multilayer perceptrons, graph neural networks and transformers, delivering accurate zero-shot predictions across different amplifier topologies. Additionally, when integrated into a reinforcement learning-based parameter optimization pipeline, ZeroSim achieves a remarkable speedup (13x) compared to conventional SPICE simulations, underscoring its practical value for a wide range of analog circuit design automation tasks.
中文摘要 尽管基于学习的模拟电路设计自动化的最新进展已经解决了拓扑生成、器件尺寸和布局合成等任务，但高效的性能评估仍然是一个主要瓶颈。传统的 SPICE 模拟非常耗时，而现有的机器学习方法通常需要特定于拓扑的重新训练或手动子结构分割来进行微调，从而阻碍了可扩展性和适应性。在这项工作中，我们提出了 ZeroSim，这是一个基于 Transformer 的性能建模框架，旨在在新参数配置下实现跨训练拓扑的鲁棒分布内泛化，并在不进行任何微调的情况下实现对看不见的拓扑的零样本泛化。我们应用了三种关键的使能策略：（1）涵盖 60 多个放大器拓扑的 360 万个实例的多样化训练语料库，（2）利用全局感知标记和分层注意力的统一拓扑嵌入来稳健地推广到新电路，以及（3）拓扑条件参数映射方法，该方法可保持一致的结构表示，而不受参数变化的影响。我们的实验结果表明，ZeroSim 的性能明显优于多层感知器、图神经网络和变压器等基线模型，在不同的放大器拓扑结构中提供准确的零样本预测。此外，当集成到基于强化学习的参数优化管道中时，与传统的 SPICE 仿真相比，ZeroSim 实现了显着的加速（13 倍），凸显了其在各种模拟电路设计自动化任务中的实用价值。

Diffusion Guided Adversarial State Perturbations in Reinforcement Learning

强化学习中的扩散引导对抗态扰动

Authors: Xiaolin Sun, Feidi Liu, Zhengming Ding, ZiZhan Zheng
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.07701
Pdf link: https://arxiv.org/pdf/2511.07701
Abstract Reinforcement learning (RL) systems, while achieving remarkable success across various domains, are vulnerable to adversarial attacks. This is especially a concern in vision-based environments where minor manipulations of high-dimensional image inputs can easily mislead the agent's behavior. To this end, various defenses have been proposed recently, with state-of-the-art approaches achieving robust performance even under large state perturbations. However, after closer investigation, we found that the effectiveness of the current defenses is due to a fundamental weakness of the existing $l_p$ norm-constrained attacks, which can barely alter the semantics of image input even under a relatively large perturbation budget. In this work, we propose SHIFT, a novel policy-agnostic diffusion-based state perturbation attack to go beyond this limitation. Our attack is able to generate perturbed states that are semantically different from the true states while remaining realistic and history-aligned to avoid detection. Evaluations show that our attack effectively breaks existing defenses, including the most sophisticated ones, significantly outperforming existing attacks while being more perceptually stealthy. The results highlight the vulnerability of RL agents to semantics-aware adversarial perturbations, indicating the importance of developing more robust policies.
中文摘要 强化学习（RL）系统虽然在各个领域取得了显着成功，但也容易受到对抗性攻击。这在基于视觉的环境中尤其令人担忧，在这种环境中，对高维图像输入的微小作很容易误导智能体的行为。为此，最近提出了各种防御措施，即使在大状态扰动下，最先进的方法也能实现稳健的性能。然而，经过更仔细的研究，我们发现当前防御的有效性是由于现有$l_p$范数约束攻击的一个根本弱点，即使在相对较大的扰动预算下，它也几乎无法改变图像输入的语义。在这项工作中，我们提出了 SHIFT，这是一种与策略无关的基于扩散的新型状态扰动攻击，以超越这一限制。我们的攻击能够生成语义上与真实状态不同的扰动状态，同时保持现实和历史对齐以避免检测。评估表明，我们的攻击有效地打破了现有的防御，包括最复杂的防御，显着优于现有攻击，同时在感知上更加隐蔽。结果凸显了RL代理对语义感知对抗性扰动的脆弱性，表明制定更稳健的策略的重要性。

Intelligent Optimization of Multi-Parameter Micromixers Using a Scientific Machine Learning Framework

基于科学机器学习框架的多参数微混合器的智能优化

Authors: Meraj Hassanzadeh, Ehsan Ghaderi, Mohamad Ali Bijarchi, Siamak Kazemzadeh Hannani
Subjects: Subjects: Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Fluid Dynamics (physics.flu-dyn)
Arxiv link: https://arxiv.org/abs/2511.07702
Pdf link: https://arxiv.org/pdf/2511.07702
Abstract Multidimensional optimization has consistently been a critical challenge in engineering. However, traditional simulation-based optimization methods have long been plagued by significant limitations: they are typically capable of optimizing only a single problem at a time and require substantial computational time for meshing and numerical simulation. This paper introduces a novel framework leveraging cutting-edge Scientific Machine Learning (Sci-ML) methodologies to overcome these inherent drawbacks of conventional approaches. The proposed method provides instantaneous solutions to a spectrum of complex, multidimensional optimization problems. A micromixer case study is employed to demonstrate this methodology. An agent, operating on a Deep Reinforcement Learning (DRL) architecture, serves as the optimizer to explore the relationships between key problem parameters. This optimizer interacts with an environment constituted by a parametric Physics-Informed Neural Network (PINN), which responds to the agent's actions at a significantly higher speed than traditional numerical methods. The agent's objective, conditioned on the Schmidt number is to discover the optimal geometric and physical parameters that maximize the micromixer's efficiency. After training the agent across a wide range of Schmidt numbers, we analyzed the resulting optimal designs. Across this entire spectrum, the achieved efficiency was consistently greater than the baseline, normalized value. The maximum efficiency occurred at a Schmidt number of 13.3, demonstrating an improvement of approximately 32%. Finally, a comparative analysis with a Genetic Algorithm was conducted under equivalent conditions to underscore the advantages of the proposed method.
中文摘要 多维优化一直是工程学中的一个关键挑战。然而，传统的基于仿真的优化方法长期以来一直受到重大局限性的困扰：它们通常一次只能优化一个问题，并且需要大量的计算时间进行网格划分和数值仿真。本文介绍了一个利用尖端科学机器学习（Sci-ML）方法的新颖框架来克服传统方法的这些固有缺点。所提出的方法为一系列复杂的多维优化问题提供了瞬时解决方案。采用微量混合器案例研究来证明这种方法。在深度强化学习（DRL）架构上运行的代理充当优化器，以探索关键问题参数之间的关系。该优化器与由参数化物理信息神经网络（PINN）构成的环境进行交互，该网络以比传统数值方法快得多的速度响应代理的动作。该代理的目标以施密特数为条件，是发现最佳几何和物理参数，以最大限度地提高微量混合器的效率。在对代理进行广泛的施密特数训练后，我们分析了由此产生的最优设计。在整个范围内，实现的效率始终高于基线归一化值。最大效率出现在 13.3 的施密特数时，提高了约 32%。最后，在同等条件下与遗传算法进行了对比分析，以强调所提方法的优点。

A Negotiation-Based Multi-Agent Reinforcement Learning Approach for Dynamic Scheduling of Reconfigurable Manufacturing Systems

一种基于协商的多智能体强化学习方法，用于可重构制造系统动态调度

Authors: Manonmani Sekar, Nasim Nezamoddini
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.07707
Pdf link: https://arxiv.org/pdf/2511.07707
Abstract Reconfigurable manufacturing systems (RMS) are critical for future market adjustment given their rapid adaptation to fluctuations in consumer demands, the introduction of new technological advances, and disruptions in linked supply chain sections. The adjustable hard settings of such systems require a flexible soft planning mechanism that enables realtime production planning and scheduling amid the existing complexity and variability in their configuration settings. This study explores the application of multi agent reinforcement learning (MARL) for dynamic scheduling in soft planning of the RMS settings. In the proposed framework, deep Qnetwork (DQN) agents trained in centralized training learn optimal job machine assignments in real time while adapting to stochastic events such as machine breakdowns and reconfiguration delays. The model also incorporates a negotiation with an attention mechanism to enhance state representation and improve decision focus on critical system features. Key DQN enhancements including prioritized experience replay, nstep returns, double DQN and soft target update are used to stabilize and accelerate learning. Experiments conducted in a simulated RMS environment demonstrate that the proposed approach outperforms baseline heuristics in reducing makespan and tardiness while improving machine utilization. The reconfigurable manufacturing environment was extended to simulate realistic challenges, including machine failures and reconfiguration times. Experimental results show that while the enhanced DQN agent is effective in adapting to dynamic conditions, machine breakdowns increase variability in key performance metrics such as makespan, throughput, and total tardiness. The results confirm the advantages of applying the MARL mechanism for intelligent and adaptive scheduling in dynamic reconfigurable manufacturing environments.
中文摘要 可重构制造系统（RMS）对于未来的市场调整至关重要，因为它们可以快速适应消费者需求的波动、新技术进步的引入以及相关供应链部分的中断。此类系统的可调节硬设置需要灵活的软计划机制，以便在其配置设置的现有复杂性和可变性中实现实时生产计划和调度。本研究探讨了多智能体强化学习（MARL）在RMS设置软规划中的动态调度应用。在所提出的框架中，经过集中训练的深度 Qnetwork （DQN）代理实时学习最佳作业机器分配，同时适应机器故障和重新配置延迟等随机事件。该模型还结合了具有注意力机制的协商，以增强状态表示并提高对关键系统特征的决策关注。关键的 DQN 增强功能包括优先体验回放、nstep 返回、双 DQN 和软目标更新，用于稳定和加速学习。在模拟RMS环境中进行的实验表明，所提出的方法在减少制造跨度和延迟方面优于基线启发式方法，同时提高机器利用率。扩展了可重构的制造环境，以模拟现实挑战，包括机器故障和重新配置时间。实验结果表明，虽然增强型DQN代理可以有效适应动态条件，但机器故障会增加关键性能指标（如制造跨度、吞吐量和总延迟）的变异性。研究结果证实了在动态可重构制造环境中应用MARL机制进行智能和自适应调度的优势。

From Exploration to Exploitation: A Two-Stage Entropy RLVR Approach for Noise-Tolerant MLLM Training

从探索到开发：一种用于噪声容阻MLLM训练的两阶段熵RLVR方法

Authors: Donglai Xu, Hongzheng Yang, Yuzhi Zhao, Pingping Zhang, Jinpeng Chen, Wenao Ma, Zhijian Hou, Mengyang Wu, Xiaolei Li, Senkang Hu, Ziyi Guan, Jason Chun Lok Li, Lai Man Po
Subjects: Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.07738
Pdf link: https://arxiv.org/pdf/2511.07738
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) for Multimodal Large Language Models (MLLMs) is highly dependent on high-quality labeled data, which is often scarce and prone to substantial annotation noise in real-world scenarios. Existing unsupervised RLVR methods, including pure entropy minimization, can overfit to incorrect labels and limit the crucial reward ranking signal for Group-Relative Policy Optimization (GRPO). To address these challenges and enhance noise tolerance, we propose a novel two-stage, token-level entropy optimization method for RLVR. This approach dynamically guides the model from exploration to exploitation during training. In the initial exploration phase, token-level entropy maximization promotes diverse and stochastic output generation, serving as a strong regularizer that prevents premature convergence to noisy labels and ensures sufficient intra-group variation, which enables more reliable reward gradient estimation in GRPO. As training progresses, the method transitions into the exploitation phase, where token-level entropy minimization encourages the model to produce confident and deterministic outputs, thereby consolidating acquired knowledge and refining prediction accuracy. Empirically, across three MLLM backbones - Qwen2-VL-2B, Qwen2-VL-7B, and Qwen2.5-VL-3B - spanning diverse noise settings and multiple tasks, our phased strategy consistently outperforms prior approaches by unifying and enhancing external, internal, and entropy-based methods, delivering robust and superior performance across the board.
中文摘要 多模态大型语言模型（MLLM）的具有可验证奖励的强化学习（RLVR）高度依赖于高质量的标记数据，而这些数据在现实场景中往往稀缺且容易产生大量注释噪声。现有的无监督RLVR方法，包括纯熵最小化，可能会过度拟合到不正确的标签，并限制了群体相对策略优化（GRPO）的关键奖励排名信号。为了应对这些挑战并增强噪声容限，我们提出了一种新的两阶段、标记级 RLVR 熵优化方法。这种方法在训练过程中动态地引导模型从探索到利用。在初始探索阶段，token级熵最大化促进了多样化和随机输出的产生，作为一个强大的正则化器，防止过早收敛到嘈杂的标签，并确保足够的组内变化，从而在GRPO中实现更可靠的奖励梯度估计。随着训练的进行，该方法过渡到开发阶段，在此阶段，标记级熵最小化鼓励模型产生自信和确定性的输出，从而巩固获得的知识并提高预测准确性。根据经验，在三个 MLLM 主干网（Qwen2-VL-2B、Qwen2-VL-7B 和 Qwen2.5-VL-3B）中，跨越不同的噪声设置和多个任务，我们的分阶段策略通过统一和增强基于外部、内部和熵的方法，始终优于以前的方法，全面提供稳健和卓越的性能。

High-Altitude Balloon Station-Keeping with First Order Model Predictive Control

基于一阶模型预测控制的高空气球站保持

Authors: Myles Pasetsky, Jiawei Lin, Bradley Guo, Sarah Dean
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.07761
Pdf link: https://arxiv.org/pdf/2511.07761
Abstract High-altitude balloons (HABs) are common in scientific research due to their wide range of applications and low cost. Because of their nonlinear, underactuated dynamics and the partial observability of wind fields, prior work has largely relied on model-free reinforcement learning (RL) methods to design near-optimal control schemes for station-keeping. These methods often compare only against hand-crafted heuristics, dismissing model-based approaches as impractical given the system complexity and uncertain wind forecasts. We revisit this assumption about the efficacy of model-based control for station-keeping by developing First-Order Model Predictive Control (FOMPC). By implementing the wind and balloon dynamics as differentiable functions in JAX, we enable gradient-based trajectory optimization for online planning. FOMPC outperforms a state-of-the-art RL policy, achieving a 24% improvement in time-within-radius (TWR) without requiring offline training, though at the cost of greater online computation per control step. Through systematic ablations of modeling assumptions and control factors, we show that online planning is effective across many configurations, including under simplified wind and dynamics models.
中文摘要 高空气球（HAB）因其应用范围广、成本低廉等特点，在科学研究中很常见。由于其非线性、欠驱动动力学和风场的部分可观测性，以前的工作在很大程度上依赖于无模型强化学习（RL）方法来设计近乎最优的测站控制方案。这些方法通常只能与手工制作的启发式方法进行比较，考虑到系统的复杂性和不确定的风力预报，基于模型的方法被认为是不切实际的。我们通过开发一阶模型预测控制（FOMPC）重新审视了关于基于模型的控制对站点保持的功效的假设。通过在 JAX 中将风和气球动力学实现为可微函数，我们实现了基于梯度的轨迹优化，用于在线规划。FOMPC 优于最先进的 RL 策略，无需离线训练即可将半径内时间（TWR）提高 24%，但代价是每个控制步骤的在线计算量更大。通过对建模假设和控制因素的系统消融，我们表明在线规划在许多配置中都是有效的，包括在简化的风力和动力学模型下。

A Historical Interaction-Enhanced Shapley Policy Gradient Algorithm for Multi-Agent Credit Assignment

一种用于多智能体信用分配的历史交互增强Shapley策略梯度算法

Authors: Ao Ding, Licheng Sun, Yongjie Hou, Huaqing Zhang, Hongbin Ma
Subjects: Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2511.07778
Pdf link: https://arxiv.org/pdf/2511.07778
Abstract Multi-agent reinforcement learning (MARL) has demonstrated remarkable performance in multi-agent collaboration problems and has become a prominent topic in artificial intelligence research in recent years. However, traditional credit assignment schemes in MARL cannot reliably capture individual contributions in strongly coupled tasks while maintaining training stability, which leads to limited generalization capabilities and hinders algorithm performance. To address these challenges, we propose a Historical Interaction-Enhanced Shapley Policy Gradient Algorithm (HIS) for Multi-Agent Credit Assignment, which employs a hybrid credit assignment mechanism to balance base rewards with individual contribution incentives. By utilizing historical interaction data to calculate the Shapley value in a sample-efficient manner, HIS enhances the agent's ability to perceive its own contribution, while retaining the global reward to maintain training stability. Additionally, we provide theoretical guarantees for the hybrid credit assignment mechanism, ensuring that the assignment results it generates are both efficient and stable. We evaluate the proposed algorithm in three widely used continuous-action benchmark environments: Multi-Agent Particle Environment, Multi-Agent MuJoCo, and Bi-DexHands. Experimental results demonstrate that HIS outperforms state-of-the-art methods, particularly excelling in strongly coupled, complex collaborative tasks.
中文摘要 多智能体强化学习（MARL）在多智能体协作问题中表现出了显著的表现，成为近年来人工智能研究的突出课题。然而，MARL中传统的学分分配方案无法在保持训练稳定性的同时可靠地捕获强耦合任务中的个体贡献，这导致泛化能力有限，阻碍了算法性能。为了应对这些挑战，我们提出了一种用于多代理信用分配的历史交互增强的 Shapley 策略梯度算法（HIS），该算法采用混合信用分配机制来平衡基本奖励与个人贡献激励。通过利用历史交互数据以样本高效的方式计算Shapley值，HIS增强了智能体感知自身贡献的能力，同时保留了全局奖励以保持训练稳定性。此外，我们还为混合学分分配机制提供了理论保障，确保其产生的分配结果既高效又稳定。我们在三种广泛使用的连续动作基准测试环境中评估了所提出的算法：Multi-Agent Particle Environment、Multi-Agent MuJoCo 和 Bi-DexHands。实验结果表明，HIS优于最先进的方法，尤其在强耦合、复杂的协作任务中表现出色。

From Experience to Strategy: Empowering LLM Agents with Trainable Graph Memory

从经验到策略：为法学硕士代理提供可训练的图记忆

Authors: Siyu Xia, Zekun Xu, Jiajun Chai, Wentian Fan, Yan Song, Xiaohan Wang, Guojun Yin, Wei Lin, Haifeng Zhang, Jun Wang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.07800
Pdf link: https://arxiv.org/pdf/2511.07800
Abstract Large Language Models (LLMs) based agents have demonstrated remarkable potential in autonomous task-solving across complex, open-ended environments. A promising approach for improving the reasoning capabilities of LLM agents is to better utilize prior experiences in guiding current decisions. However, LLMs acquire experience either through implicit memory via training, which suffers from catastrophic forgetting and limited interpretability, or explicit memory via prompting, which lacks adaptability. In this paper, we introduce a novel agent-centric, trainable, multi-layered graph memory framework and evaluate how context memory enhances the ability of LLMs to utilize parametric information. The graph abstracts raw agent trajectories into structured decision paths in a state machine and further distills them into high-level, human-interpretable strategic meta-cognition. In order to make memory adaptable, we propose a reinforcement-based weight optimization procedure that estimates the empirical utility of each meta-cognition based on reward feedback from downstream tasks. These optimized strategies are then dynamically integrated into the LLM agent's training loop through meta-cognitive prompting. Empirically, the learnable graph memory delivers robust generalization, improves LLM agents' strategic reasoning performance, and provides consistent benefits during Reinforcement Learning (RL) training.
中文摘要 基于大型语言模型（LLM）的代理在跨复杂、开放式环境的自主任务解决方面表现出了巨大的潜力。提高 LLM 代理推理能力的一个有前途的方法是更好地利用先前的经验来指导当前决策。然而，法学硕士要么通过训练的内隐记忆获得经验，后者存在灾难性遗忘和有限的可解释性，要么通过提示获得外显记忆，后者缺乏适应性。在本文中，我们介绍了一种以代理为中心的、可训练的、多层的图记忆框架，并评估了上下文记忆如何增强LLM利用参数信息的能力。该图将原始代理轨迹抽象为状态机中的结构化决策路径，并进一步将它们提炼成高级的、人类可解释的战略元认知。为了使记忆具有适应性，我们提出了一种基于强化的权重优化程序，该程序根据下游任务的奖励反馈来估计每个元认知的经验效用。然后，这些优化策略通过元认知提示动态集成到 LLM 代理的训练循环中。根据经验，可学习图内存提供了强大的泛化能力，提高了 LLM 代理的战略推理性能，并在强化学习（RL）训练期间提供了一致的好处。

MURPHY: Multi-Turn GRPO for Self Correcting Code Generation

MURPHY：用于自更正代码生成的多圈 GRPO

Authors: Chanakya Ekbote, Vijay Lingam, Behrooz Omidvar-Tehrani, Jun Huan, Sujay Sanghavi, Anoop Deoras, Stefano Soatto
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.07833
Pdf link: https://arxiv.org/pdf/2511.07833
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful framework for enhancing the reasoning capabilities of large language models (LLMs). However, existing approaches such as Group Relative Policy Optimization (GRPO) and its variants, while effective on reasoning benchmarks, struggle with agentic tasks that require iterative decision-making. We introduce Murphy, a multi-turn reflective optimization framework that extends GRPO by incorporating iterative self-correction during training. By leveraging both quantitative and qualitative execution feedback, Murphy enables models to progressively refine their reasoning across multiple turns. Evaluations on code generation benchmarks with model families such as Qwen and OLMo show that Murphy consistently improves performance, achieving up to a 8% relative gain in pass@1 over GRPO, on similar compute budgets.
中文摘要 具有可验证奖励的强化学习（RLVR）已成为增强大型语言模型（LLM）推理能力的强大框架。然而，现有的方法，如群体相对策略优化（GRPO）及其变体，虽然在推理基准上是有效的，但在需要迭代决策的代理任务中却遇到了困难。我们介绍了 Murphy，这是一个多轮反射优化框架，它通过在训练期间结合迭代自我校正来扩展 GRPO。通过利用定量和定性执行反馈，墨菲使模型能够在多个回合中逐步完善其推理。对 Qwen 和 OLMo 等模型系列的代码生成基准测试的评估表明，Murphy 不断提高性能，在类似的计算预算下，与 GRPO 相比，pass@1相对增益高达 8%。

Comparative Study of Q-Learning for State-Feedback LQG Control with an Unknown Model

Q-learning与未知模型状态反馈LQG控制的对比研究

Authors: Mingxiang Liu, Damián Marelli, Minyue Fu, Qianqian Cai
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2511.07870
Pdf link: https://arxiv.org/pdf/2511.07870
Abstract We study the problem of designing a state feedback linear quadratic Gaussian (LQG) con- troller for a system in which the system matrices as well as the process noise covariance are unknown. We do a rigorous comparison between two approaches. The first is the classic one in which a system identification stage is used to estimate the unknown parameters, which are then used in a state-feedback LQG (SF-LQG) controller design. The second approach is a recently proposed one using a reinforcement learning paradigm called Q-learning. We do the comparison in terms of complexity and accuracy of the resulting controller. We show that the classic approach asymptotically efficient, giving virtually no room for improvement in terms of accuracy. We also propose a novel Q-learning-based method which we show asymptotically achieves the optimal controller design. We complement our proposed method with a numerically efficient algorithmic implementation aiming at making it competitive in terms of computations. Nevertheless, our complexity analysis shows that the classic approach is still numerically more efficient than this Q-learning-based alternative. We then conclude that the classic approach remains being the best choice for addressing the SF-LQG design in the case of unknown parameters.
中文摘要 我们研究了为系统矩阵和过程噪声协方差未知的系统设计状态反馈线性二次高斯（LQG）控制器的问题。我们对两种方法进行了严格的比较。第一种是经典的，其中系统识别阶段用于估计未知参数，然后将其用于状态反馈 LQG （SF-LQG）控制器设计。第二种方法是最近提出的一种方法，使用称为 Q 学习的强化学习范式。我们根据生成的控制器的复杂性和准确性进行比较。我们表明，经典方法渐近高效，在准确性方面几乎没有改进的余地。我们还提出了一种基于Q学习的新方法，我们展示了该方法渐近地实现了最优控制器设计。我们用数值高效的算法实现来补充我们提出的方法，旨在使其在计算方面具有竞争力。尽管如此，我们的复杂性分析表明，经典方法在数值上仍然比这种基于 Q 学习的替代方案更有效。然后我们得出结论，在参数未知的情况下，经典方法仍然是解决 SF-LQG 设计的最佳选择。

Statistically Assuring Safety of Control Systems using Ensembles of Safety Filters and Conformal Prediction

使用安全滤波器和共形预测的集合对控制系统进行统计保证安全

Authors: Ihab Tabbara, Yuxuan Yang, Hussein Sibai
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2511.07899
Pdf link: https://arxiv.org/pdf/2511.07899
Abstract Safety assurance is a fundamental requirement for deploying learning-enabled autonomous systems. Hamilton-Jacobi (HJ) reachability analysis is a fundamental method for formally verifying safety and generating safe controllers. However, computing the HJ value function that characterizes the backward reachable set (BRS) of a set of user-defined failure states is computationally expensive, especially for high-dimensional systems, motivating the use of reinforcement learning approaches to approximate the value function. Unfortunately, a learned value function and its corresponding safe policy are not guaranteed to be correct. The learned value function evaluated at a given state may not be equal to the actual safety return achieved by following the learned safe policy. To address this challenge, we introduce a conformal prediction-based (CP) framework that bounds such uncertainty. We leverage CP to provide probabilistic safety guarantees when using learned HJ value functions and policies to prevent control systems from reaching failure states. Specifically, we use CP to calibrate the switching between the unsafe nominal controller and the learned HJ-based safe policy and to derive safety guarantees under this switched policy. We also investigate using an ensemble of independently trained HJ value functions as a safety filter and compare this ensemble approach to using individual value functions alone.
中文摘要 安全保证是部署支持学习的自主系统的基本要求。汉密尔顿-雅可比（HJ）可达性分析是正式验证安全性和生成安全控制器的基本方法。然而，计算表征一组用户定义的故障状态的向后可达集（BRS）的 HJ 值函数的计算成本很高，特别是对于高维系统，这促使使用强化学习方法来近似值函数。不幸的是，学习的值函数及其相应的安全策略不能保证是正确的。在给定状态下评估的学习值函数可能不等于遵循学习安全策略实现的实际安全回报。为了应对这一挑战，我们引入了一个基于共形预测（CP）的框架来限制这种不确定性。当使用学习到的 HJ 值函数和策略来防止控制系统达到故障状态时，我们利用 CP 提供概率安全保证。具体来说，我们使用 CP 来校准不安全的标称控制器和学习到的基于 HJ 的安全策略之间的切换，并在此切换策略下得出安全保证。我们还研究了使用独立训练的 HJ 值函数的集合作为安全过滤器，并将这种集成方法与单独使用单个值函数进行了比较。

Test-driven Reinforcement Learning

测试驱动的强化学习

Authors: Zhao Yu, Xiuping Wu, Liangjun Ke
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.07904
Pdf link: https://arxiv.org/pdf/2511.07904
Abstract Reinforcement learning (RL) has been recognized as a powerful tool for robot control tasks. RL typically employs reward functions to define task objectives and guide agent learning. However, since the reward function serves the dual purpose of defining the optimal goal and guiding learning, it is challenging to design the reward function manually, which often results in a suboptimal task representation. To tackle the reward design challenge in RL, inspired by the satisficing theory, we propose a Test-driven Reinforcement Learning (TdRL) framework. In the TdRL framework, multiple test functions are used to represent the task objective rather than a single reward function. Test functions can be categorized as pass-fail tests and indicative tests, each dedicated to defining the optimal objective and guiding the learning process, respectively, thereby making defining tasks easier. Building upon such a task definition, we first prove that if a trajectory return function assigns higher returns to trajectories closer to the optimal trajectory set, maximum entropy policy optimization based on this return function will yield a policy that is closer to the optimal policy set. Then, we introduce a lexicographic heuristic approach to compare the relative distance relationship between trajectories and the optimal trajectory set for learning the trajectory return function. Furthermore, we develop an algorithm implementation of TdRL. Experimental results on the DeepMind Control Suite benchmark demonstrate that TdRL matches or outperforms handcrafted reward methods in policy training, with greater design simplicity and inherent support for multi-objective optimization. We argue that TdRL offers a novel perspective for representing task objectives, which could be helpful in addressing the reward design challenges in RL applications.
中文摘要 强化学习（RL）已被公认为机器人控制任务的强大工具。RL 通常采用奖励函数来定义任务目标并指导智能体学习。然而，由于奖励函数具有定义最优目标和指导学习的双重目的，因此手动设计奖励函数具有挑战性，这通常会导致次优任务表示。为了解决 RL 中的奖励设计挑战，受令人满意的理论的启发，我们提出了一个测试驱动的强化学习（TdRL）框架。在 TdRL 框架中，使用多个测试函数来表示任务目标，而不是单个奖励函数。测试功能可分为通过-失败测试和指示性测试，每个测试分别致力于定义最佳目标和指导学习过程，从而使定义任务变得更加容易。基于这样的任务定义，我们首先证明，如果轨迹返回函数为更接近最优轨迹集的轨迹分配更高的回报，则基于该返回函数的最大熵策略优化将产生更接近最优策略集的策略。然后，我们引入一种词典编纂启发式方法来比较轨迹之间的相对距离关系和学习轨迹返回函数的最佳轨迹集。此外，我们开发了 TdRL 的算法实现。在 DeepMind Control Suite 基准测试上的实验结果表明，TdRL 在策略训练中与手工制作的奖励方法相匹配或优于手工制作的奖励方法，具有更大的设计简单性和对多目标优化的固有支持。我们认为，TdRL为表示任务目标提供了一种新颖的视角，这可能有助于解决RL应用中的奖励设计挑战。

Feedback Descent: Open-Ended Text Optimization via Pairwise Comparison

反馈下降：通过成对比较进行开放式文本优化

Authors: Yoonho Lee, Joseph Boen, Chelsea Finn
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.07919
Pdf link: https://arxiv.org/pdf/2511.07919
Abstract We introduce \textit{Feedback Descent}, a framework that optimizes text artifacts -- prompts, code, and molecules -- through structured textual feedback, rather than relying solely on scalar rewards. By preserving detailed critiques instead of compressing them to binary preferences, Feedback Descent widens the information bottleneck in preference learning, enabling directed optimization in text space rather than weight space. We show that in-context learning can transform structured feedback into gradient-like directional information, enabling targeted edits. Unlike prior approaches that collapse judgments into single bits, our evaluators pair each comparison with textual feedback, which functions as high-bandwidth supervision. The iteration loop is done purely at inference time, without modifying any model weights, and is task-agnostic. We evaluate Feedback Descent on three diverse domains and find that it outperforms state-of-the-art prompt optimization (GEPA), reinforcement learning methods (GRPO, REINVENT), and even specialized graph-based molecular optimizers. In the DOCKSTRING molecule discovery benchmark, Feedback Descent identifies novel drug-like molecules surpassing the $99.9$th percentile of a database with more than $260{,}000$ compounds across six protein targets.
中文摘要 我们引入了 \textit{Feedback Descent}，这是一个通过结构化文本反馈来优化文本工件（提示、代码和分子）的框架，而不是仅仅依赖标量奖励。通过保留详细的批评而不是将其压缩为二元偏好，反馈下降扩大了偏好学习中的信息瓶颈，从而在文本空间而不是权重空间中实现定向优化。我们表明，上下文学习可以将结构化反馈转化为类似梯度的方向信息，从而实现有针对性的编辑。与之前将判断折叠成单个位的方法不同，我们的评估员将每次比较与文本反馈配对，文本反馈起到高带宽监督的作用。迭代循环纯粹在推理时完成，不修改任何模型权重，并且与任务无关。我们在三个不同的领域评估了反馈下降，发现它优于最先进的提示优化（GEPA）、强化学习方法（GRPO、REINVENT），甚至优于专门的基于图的分子优化器。在 DOCKSTRING 分子发现基准中，Feedback Descent 识别出超过 99.9 美元数据库第 6 个百分位数的新型类药物分子，该数据库在六个蛋白质靶标中含有超过 260{，000 美元化合物。

SERL: Self-Examining Reinforcement Learning on Open-Domain

SERL：开放领域自检强化学习

Authors: Weixuan Ou, Yanzhao Zheng, Shuoshuo Sun, Wei Zhang, Baohua Dong, Hangcheng Zhu, Ruohui Huang, Gang Yu, Pengwei Yan, Yifan Qiao
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.07922
Pdf link: https://arxiv.org/pdf/2511.07922
Abstract Reinforcement Learning (RL) has been shown to improve the capabilities of large language models (LLMs). However, applying RL to open-domain tasks faces two key challenges: (1) the inherent subjectivity of these tasks prevents the verifiable rewards as required by Reinforcement Learning with Verifiable Rewards (RLVR); (2) Reinforcement Learning from Human Feedback (RLHF) relies on external reward mechanisms. To overcome these limitations, we propose Self-Examining Reinforcement Learning (SERL), a novel self-improving framework where the LLM serves as both Actor and Judge. SERL introduces two synergistic reward mechanisms without any external signals. On the one hand, to improve the Actor's capability, we derive rewards from Copeland-style pairwise comparison judgments across a group of generated responses. On the other hand, a self-consistency reward that encourages coherent judgments is proposed to improve the Judge's reliability. This process refines the Judge's capability, which in turn provides a more robust reward for Actor. Experiments show that our method outperforms existing self-improvement training methods. SERL improves the LC win rate of Qwen3-8B on AlpacaEval 2 from 52.37% to 59.90%. To the best of our knowledge, our method achieves state-of-the-art performance among self-improving approaches. Furthermore, it achieves a performance comparable to significantly larger models like Qwen3-32B, demonstrating superior effectiveness and robustness on open-domain tasks.
中文摘要 强化学习（RL）已被证明可以提高大型语言模型（LLM）的能力。然而，将 RL 应用于开放领域任务面临两个关键挑战：（1）这些任务固有的主观性阻碍了具有可验证奖励的强化学习（RLVR）所要求的可验证奖励;（2）人类反馈强化学习（Reinforcement Learning from Human Feedback，RLHF）依赖于外部奖励机制。为了克服这些限制，我们提出了自检强化学习（SERL），这是一种新颖的自我改进框架，其中法学硕士既充当参与者又充当判断者。SERL 引入了两种协同奖励机制，没有任何外部信号。一方面，为了提高 Actor 的能力，我们从一组生成的响应的 Copeland 式成对比较判断中获得奖励。另一方面，提出了鼓励连贯判断的自洽奖励，以提高法官的可靠性。这个过程完善了裁判的能力，从而为演员提供了更强大的奖励。实验表明，该方法优于现有的自我提升训练方法。SERL 将 Qwen3-8B 在 AlpacaEval 2 上的 LC 胜率从 52.37% 提高到 59.90%。据我们所知，我们的方法在自我改进方法中取得了最先进的性能。此外，它还实现了与 Qwen3-32B 等更大模型相当的性能，在开放域任务中表现出卓越的有效性和鲁棒性。

SpeechJudge: Towards Human-Level Judgment for Speech Naturalness

SpeechJudge：迈向人类对语音自然性的判断

Authors: Xueyao Zhang, Chaoren Wang, Huan Liao, Ziniu Li, Yuancheng Wang, Li Wang, Dongya Jia, Yuanzhe Chen, Xiulin Li, Zhuo Chen, Zhizheng Wu
Subjects: Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.07931
Pdf link: https://arxiv.org/pdf/2511.07931
Abstract Aligning large generative models with human feedback is a critical challenge. In speech synthesis, this is particularly pronounced due to the lack of a large-scale human preference dataset, which hinders the development of models that truly align with human perception. To address this, we introduce SpeechJudge, a comprehensive suite comprising a dataset, a benchmark, and a reward model centered on naturalness--one of the most fundamental subjective metrics for speech synthesis. First, we present SpeechJudge-Data, a large-scale human feedback corpus of 99K speech pairs. The dataset is constructed using a diverse set of advanced zero-shot text-to-speech (TTS) models across diverse speech styles and multiple languages, with human annotations for both intelligibility and naturalness preference. From this, we establish SpeechJudge-Eval, a challenging benchmark for speech naturalness judgment. Our evaluation reveals that existing metrics and AudioLLMs struggle with this task; the leading model, Gemini-2.5-Flash, achieves less than 70% agreement with human judgment, highlighting a significant gap for improvement. To bridge this gap, we develop SpeechJudge-GRM, a generative reward model (GRM) based on Qwen2.5-Omni-7B. It is trained on SpeechJudge-Data via a two-stage post-training process: Supervised Fine-Tuning (SFT) with Chain-of-Thought rationales followed by Reinforcement Learning (RL) with GRPO on challenging cases. On the SpeechJudge-Eval benchmark, the proposed SpeechJudge-GRM demonstrates superior performance, achieving 77.2% accuracy (and 79.4% after inference-time scaling @10) compared to a classic Bradley-Terry reward model (72.7%). Furthermore, SpeechJudge-GRM can be also employed as a reward function during the post-training of speech generation models to facilitate their alignment with human preferences.
中文摘要 使大型生成模型与人类反馈保持一致是一项严峻的挑战。在语音合成中，由于缺乏大规模的人类偏好数据集，这种情况尤为明显，这阻碍了真正符合人类感知的模型的开发。为了解决这个问题，我们推出了 SpeechJudge，这是一个综合套件，由数据集、基准和奖励模型组成，以自然性为中心，自然性是语音合成最基本的主观指标之一。首先，我们提出了 SpeechJudge-Data，这是一个包含 99K 语音对的大规模人类反馈语料库。该数据集是使用一组跨不同语音风格和多种语言的高级零样本文本转语音（TTS）模型构建的，并针对清晰度和自然偏好进行了人工注释。由此，我们建立了 SpeechJudge-Eval，这是一个具有挑战性的语音自然性判断基准。我们的评估表明，现有指标和 AudioLLM 都在努力完成这项任务;领先的模型 Gemini-2.5-Flash 与人类判断的一致性不到 70%，凸显了巨大的改进差距。为了弥补这一差距，我们开发了基于 Qwen2.5-Omni-7B 的生成奖励模型（GRM） SpeechJudge-GRM。它通过两个阶段的后训练过程在 SpeechJudge-Data 上进行训练：具有思维链基本原理的监督微调（SFT），然后是具有挑战性案例的 GRPO 强化学习（RL）。在 SpeechJudge-Eval 基准测试中，与经典的 Bradley-Terry 奖励模型（72.7%）相比，所提出的 SpeechJudge-GRM 表现出卓越的性能，实现了 77.2% 的准确率（推理时间缩放 @10 后为 79.4%）。此外，SpeechJudge-GRM还可以在语音生成模型的后期训练中用作奖励函数，以促进其与人类偏好的一致性。

Thinker: Training LLMs in Hierarchical Thinking for Deep Search via Multi-Turn Interaction

思考者：通过多轮交互训练法学硕士进行深度搜索的层次思维

Authors: Jun Xu, Xinkai Du, Yu Ao, Peilong Zhao, Yang Li, Ling Zhong, Lin Yuan, Zhongpu Bo, Xiaorui Wang, Mengshu Sun, Zhengke Gui, Dalong Zhang, Zhaoyang Wang, Qiwei Wang, Yangyang Hou, Zhiying Yin, Haofen Wang, Huajun Chen, Lei Liang, Jun Zhou
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.07943
Pdf link: https://arxiv.org/pdf/2511.07943
Abstract Efficient retrieval of external knowledge bases and web pages is crucial for enhancing the reasoning abilities of LLMs. Previous works on training LLMs to leverage external retrievers for solving complex problems have predominantly employed end-to-end reinforcement learning. However, these approaches neglect supervision over the reasoning process, making it difficult to guarantee logical coherence and rigor. To address these limitations, we propose Thinker, a hierarchical thinking model for deep search through multi-turn interaction, making the reasoning process supervisable and verifiable. It decomposes complex problems into independently solvable sub-problems, each dually represented in both natural language and an equivalent logical function to support knowledge base and web searches. Concurrently, dependencies between sub-problems are passed as parameters via these logical functions, enhancing the logical coherence of the problem-solving process. To avoid unnecessary external searches, we perform knowledge boundary determination to check if a sub-problem is within the LLM's intrinsic knowledge, allowing it to answer directly. Experimental results indicate that with as few as several hundred training samples, the performance of Thinker is competitive with established baselines. Furthermore, when scaled to the full training set, Thinker significantly outperforms these methods across various datasets and model sizes. The source code is available at this https URL.
中文摘要 高效检索外部知识库和网页对于增强法学硕士的推理能力至关重要。之前关于训练 LLM 利用外部检索器解决复杂问题的工作主要采用端到端强化学习。然而，这些方法忽视了对推理过程的监督，因此难以保证逻辑的连贯性和严谨性。为了解决这些局限性，我们提出了 Thinker，这是一种通过多轮交互进行深度搜索的分层思维模型，使推理过程可监督和可验证。它将复杂的问题分解为可独立解决的子问题，每个子问题都以自然语言和等效的逻辑函数双重表示，以支持知识库和网络搜索。同时，子问题之间的依赖关系通过这些逻辑函数作为参数传递，增强了问题解决过程的逻辑连贯性。为了避免不必要的外部搜索，我们进行知识边界确定，以检查子问题是否在 LLM 的内在知识范围内，使其能够直接回答。实验结果表明，只需几百个训练样本，Thinker 的性能就与既定的基线具有竞争力。此外，当扩展到完整的训练集时，Thinker 在各种数据集和模型大小上都明显优于这些方法。源代码可在此 https URL 中找到。

Knowledge-Augmented Long-CoT Generation for Complex Biomolecular Reasoning

用于复杂生物分子推理的知识增强长 CoT 生成

Authors: Tianwen Lyu, Xiang Zhuang, Keyan Ding, Xinzhe Cao, Lei Liang, Wei Zhao, Qiang Zhang, Huajun Chen
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.08024
Pdf link: https://arxiv.org/pdf/2511.08024
Abstract Understanding complex biomolecular mechanisms requires multi-step reasoning across molecular interactions, signaling cascades, and metabolic pathways. While large language models(LLMs) show promise in such tasks, their application to biomolecular problems is hindered by logical inconsistencies and the lack of grounding in domain knowledge. Existing approaches often exacerbate these issues: reasoning steps may deviate from biological facts or fail to capture long mechanistic dependencies. To address these challenges, we propose a Knowledge-Augmented Long-CoT Reasoning framework that integrates LLMs with knowledge graph-based multi-hop reasoning chains. The framework constructs mechanistic chains via guided multi-hop traversal and pruning on the knowledge graph; these chains are then incorporated into supervised fine-tuning to improve factual grounding and further refined with reinforcement learning to enhance reasoning reliability and consistency. Furthermore, to overcome the shortcomings of existing benchmarks, which are often restricted in scale and scope and lack annotations for deep reasoning chains, we introduce PrimeKGQA, a comprehensive benchmark for biomolecular question answering. Experimental results on both PrimeKGQA and existing datasets demonstrate that although larger closed-source models still perform well on relatively simple tasks, our method demonstrates clear advantages as reasoning depth increases, achieving state-of-the-art performance on multi-hop tasks that demand traversal of structured biological knowledge. These findings highlight the effectiveness of combining structured knowledge with advanced reasoning strategies for reliable and interpretable biomolecular reasoning.
中文摘要 了解复杂的生物分子机制需要跨分子相互作用、信号级联和代谢途径进行多步骤推理。虽然大型语言模型（LLM）在此类任务中显示出前景，但由于逻辑不一致和缺乏领域知识基础，它们在生物分子问题中的应用受到阻碍。现有的方法往往加剧了这些问题：推理步骤可能偏离生物学事实或无法捕捉长期的机制依赖关系。为了应对这些挑战，我们提出了一个知识增强的长 CoT 推理框架，该框架将 LLM 与基于知识图谱的多跳推理链集成在一起。该框架通过对知识图谱的引导多跳遍历和修剪来构建机制链;然后，这些链被纳入监督微调中，以改善事实基础，并通过强化学习进一步细化，以增强推理的可靠性和一致性。此外，为了克服现有基准的缺点，这些基准通常在规模和范围上受到限制，并且缺乏深度推理链的注释，我们引入了 PrimeKGQA，这是一个生物分子问答的综合基准。在 PrimeKGQA 和现有数据集上的实验结果表明，尽管较大的闭源模型在相对简单的任务上仍然表现良好，但随着推理深度的增加，我们的方法表现出明显的优势，在需要遍历结构化生物知识的多跳任务上实现了最先进的性能。这些发现凸显了将结构化知识与先进推理策略相结合以实现可靠且可解释的生物分子推理的有效性。

Dynamic Sparsity: Challenging Common Sparsity Assumptions for Learning World Models in Robotic Reinforcement Learning Benchmarks

动态稀疏性：在机器人强化学习基准中挑战学习世界模型的常见稀疏性假设

Authors: Muthukumar Pandaram, Jakob Hollenstein, David Drexel, Samuele Tosatto, Antonio Rodríguez-Sánchez, Justus Piater
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.08086
Pdf link: https://arxiv.org/pdf/2511.08086
Abstract The use of learned dynamics models, also known as world models, can improve the sample efficiency of reinforcement learning. Recent work suggests that the underlying causal graphs of such dynamics models are sparsely connected, with each of the future state variables depending only on a small subset of the current state variables, and that learning may therefore benefit from sparsity priors. Similarly, temporal sparsity, i.e. sparsely and abruptly changing local dynamics, has also been proposed as a useful inductive bias. In this work, we critically examine these assumptions by analyzing ground-truth dynamics from a set of robotic reinforcement learning environments in the MuJoCo Playground benchmark suite, aiming to determine whether the proposed notions of state and temporal sparsity actually tend to hold in typical reinforcement learning tasks. We study (i) whether the causal graphs of environment dynamics are sparse, (ii) whether such sparsity is state-dependent, and (iii) whether local system dynamics change sparsely. Our results indicate that global sparsity is rare, but instead the tasks show local, state-dependent sparsity in their dynamics and this sparsity exhibits distinct structures, appearing in temporally localized clusters (e.g., during contact events) and affecting specific subsets of state dimensions. These findings challenge common sparsity prior assumptions in dynamics learning, emphasizing the need for grounded inductive biases that reflect the state-dependent sparsity structure of real-world dynamics.
中文摘要 使用学习动力学模型，也称为世界模型，可以提高强化学习的样本效率。最近的研究表明，此类动力学模型的基本因果图是稀疏连接的，每个未来状态变量仅依赖于当前状态变量的一小部分，因此学习可能受益于稀疏先验。同样，时间稀疏性，即稀疏且突然变化的局部动态，也被认为是一种有用的归纳偏差。在这项工作中，我们通过分析 MuJoCo Playground 基准测试套件中一组机器人强化学习环境的地面实况动态来批判性地检查这些假设，旨在确定所提出的状态和时间稀疏性概念是否实际上倾向于在典型的强化学习任务中成立。我们研究了（i）环境动力学的因果图是否稀疏，（ii）这种稀疏性是否与状态相关，以及（iii）局部系统动力学是否稀疏变化。我们的结果表明，全局稀疏性很少见，但任务在其动态中表现出局部的、依赖于状态的稀疏性，并且这种稀疏性表现出不同的结构，出现在时间局部的簇中（例如，在接触事件期间）并影响状态维度的特定子集。这些发现挑战了动力学学习中常见的稀疏先验假设，强调需要有根据的归纳偏差来反映现实世界动力学的状态依赖性稀疏结构。

A Small Leak Sinks All: Exploring the Transferable Vulnerability of Source Code Models

一个小小的泄漏会淹没一切：探索源代码模型的可转移漏洞

Authors: Weiye Li, Wenyi Tang
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2511.08127
Pdf link: https://arxiv.org/pdf/2511.08127
Abstract Source Code Model learn the proper embeddings from source codes, demonstrating significant success in various software engineering or security tasks. The recent explosive development of LLM extends the family of SCMs,bringing LLMs for code that revolutionize development workflows. Investigating different kinds of SCM vulnerability is the cornerstone for the security and trustworthiness of AI-powered software ecosystems, however, the fundamental one, transferable vulnerability, remains critically underexplored. Existing studies neither offer practical ways, i.e. require access to the downstream classifier of SCMs, to produce effective adversarial samples for adversarial defense, nor give heed to the widely used LLM4Code in modern software development platforms and cloud-based integrated development environments. Therefore, this work systematically studies the intrinsic vulnerability transferability of both traditional SCMs and LLM4Code, and proposes a victim-agnostic approach to generate practical adversarial samples. We design HABITAT, consisting of a tailored perturbation-inserting mechanism and a hierarchical Reinforcement Learning framework that adaptively selects optimal perturbations without requiring any access to the downstream classifier of SCMs. Furthermore, an intrinsic transferability analysis of SCM vulnerabilities is conducted, revealing the potential vulnerability correlation between traditional SCMs and LLM4Code, together with fundamental factors that govern the success rate of victim-agnostic transfer attacks. These findings of SCM vulnerabilities underscore the critical focal points for developing robust defenses in the future. Experimental evaluation demonstrates that our constructed adversarial examples crafted based on traditional SCMs achieve up to 64% success rates against LLM4Code, surpassing the state-of-the-art by over 15%.
中文摘要 源代码模型从源代码中学习正确的嵌入，在各种软件工程或安全任务中取得了巨大成功。LLM 最近的爆炸性发展扩展了 SCM 系列，为代码带来了彻底改变开发工作流程的 LLM。调查不同类型的 SCM 漏洞是人工智能驱动的软件生态系统安全性和可信度的基石，然而，最基本的漏洞，即可转移漏洞，仍然没有得到充分的探索。现有的研究既没有提供实用的方法，即需要访问 SCM 的下游分类器来生成有效的对抗性样本以进行对抗性防御，也没有关注现代软件开发平台和基于云的集成开发环境中广泛使用的 LLM4Code。因此，这项工作系统地研究了传统 SCM 和 LLM4Code 的内在漏洞可转移性，并提出了一种与受害者无关的方法来生成实际的对抗样本。我们设计了 HABITAT，由定制的扰动插入机制和分层强化学习框架组成，该框架自适应地选择最佳扰动，而无需访问 SCM 的下游分类器。此外，还对SCM漏洞进行了内在的可转移性分析，揭示了传统SCM与LLM4Code之间潜在的漏洞相关性，以及影响与受害者无关的转移攻击成功率的基本因素。这些对 SCM 漏洞的发现强调了未来开发强大防御的关键焦点。实验评估表明，我们基于传统 SCM 构建的对抗示例在 LLM4Code 上实现了高达 64% 的成功率，比最先进的技术高出 15% 以上。

BIPPO: Budget-Aware Independent PPO for Energy-Efficient Federated Learning Services

BIPPO：用于节能联合学习服务的预算意识独立 PPO

Authors: Anna Lackinger, Andrea Morichetta, Pantelis A. Frangoudis, Schahram Dustdar
Subjects: Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2511.08142
Pdf link: https://arxiv.org/pdf/2511.08142
Abstract Federated Learning (FL) is a promising machine learning solution in large-scale IoT systems, guaranteeing load distribution and privacy. However, FL does not natively consider infrastructure efficiency, a critical concern for systems operating in resource-constrained environments. Several Reinforcement Learning (RL) based solutions offer improved client selection for FL; however, they do not consider infrastructure challenges, such as resource limitations and device churn. Furthermore, the training of RL methods is often not designed for practical application, as these approaches frequently do not consider generalizability and are not optimized for energy efficiency. To fill this gap, we propose BIPPO (Budget-aware Independent Proximal Policy Optimization), which is an energy-efficient multi-agent RL solution that improves performance. We evaluate BIPPO on two image classification tasks run in a highly budget-constrained setting, with FL clients training on non-IID data, a challenging context for vanilla FL. The improved sampler of BIPPO enables it to increase the mean accuracy compared to non-RL mechanisms, traditional PPO, and IPPO. In addition, BIPPO only consumes a negligible proportion of the budget, which stays consistent even if the number of clients increases. Overall, BIPPO delivers a performant, stable, scalable, and sustainable solution for client selection in IoT-FL.
中文摘要 联邦学习（FL）是大规模物联网系统中一种很有前途的机器学习解决方案，可保证负载分配和隐私。然而，FL 本身并不考虑基础设施效率，这是在资源受限环境中运行的系统的一个关键问题。几种基于强化学习（RL）的解决方案改进了 FL 的客户端选择;但是，他们没有考虑基础设施挑战，例如资源限制和设备流失。此外，RL 方法的训练通常不是为实际应用而设计的，因为这些方法通常不考虑通用性，也没有针对能源效率进行优化。为了填补这一空白，我们提出了 BIPPO（预算感知独立近端策略优化），这是一种节能的多智能体 RL 解决方案，可提高性能。我们在预算高度受限的环境中运行的两项图像分类任务上评估了 BIPPO，FL 客户端在非 IID 数据上进行训练，这对普通 FL 来说是一个具有挑战性的环境。与非RL机制、传统PPO和IPPO相比，BIPPO的改进采样器使其能够提高平均精度。此外，BIPPO 只消耗了可以忽略不计的预算比例，即使客户数量增加，预算也会保持一致。总体而言，BIPPO 为 IoT-FL 中的客户选择提供了高性能、稳定、可扩展且可持续的解决方案。

An Efficient Training Pipeline for Reasoning Graphical User Interface Agents

用于推理图形用户界面代理的高效训练管道

Authors: Georgios Pantazopoulos, Eda B. Özyiğit
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.08172
Pdf link: https://arxiv.org/pdf/2511.08172
Abstract Visual grounding is the task of localising image regions from natural language queries and is critical for reasoning capable Graphical User Interface agents. Many existing methods rely on massive, noisy synthetic this http URL work introduces an efficient training pipeline that combines model-based data filtering with parameter-efficient fine-tuning. From 4.8M synthetic examples, 12K clean and diverse instances are curated by first identifying challenging cases, removing misaligned and then selecting a diverse set of multimodal instances. On this data, a 3B-parameter Vision-Language Model is trained under three regimes: supervised fine-tuning, chain-of-thought- augmented fine-tuning, and reinforcement learning via Group Relative Policy Optimization. Models trained with the filtered data and lightweight training strategies match or surpass larger baselines on benchmarks such as ScreenSpot, Multimodal-Mind2Web, and AndroidControl. These results demonstrate that principled data curation and robust adaptation can rival large-scale training, enabling compact yet capable multimodal reasoning agents.
中文摘要 视觉基础是从自然语言查询中定位图像区域的任务，对于推理有能力的图形用户界面代理至关重要。许多现有方法依赖于大量、嘈杂的合成，这种 http URL 工作引入了一个高效的训练管道，该管道将基于模型的数据过滤与参数高效的微调相结合。从 4.8M 的合成示例中，通过首先识别具有挑战性的情况，消除未对齐的实例，然后选择一组多样化的多模态实例，策划出 12K 干净和多样化的实例。根据这些数据，在三种制度下训练 3B 参数视觉语言模型：监督微调、思维链增强微调和通过群体相对策略优化的强化学习。使用过滤数据和轻量级训练策略训练的模型在 ScreenSpot、Multimodal-Mind2Web 和 AndroidControl 等基准测试中匹配或超过更大的基线。这些结果表明，有原则的数据管理和稳健的适应可以与大规模训练相媲美，从而实现紧凑而强大的多模态推理代理。

UI2Code$^\text{N}$: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation

UI2Code$^\text{N}$：用于测试时可扩展交互式 UI 到代码生成的可视化语言模型

Authors: Zhen Yang, Wenyi Hong, Mingde Xu, Xinyue Fan, Weihan Wang, Jiele Cheng, Xiaotao Gu, Jie Tang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2511.08195
Pdf link: https://arxiv.org/pdf/2511.08195
Abstract User interface (UI) programming is a core yet highly complex part of modern software development. Recent advances in visual language models (VLMs) highlight the potential of automatic UI coding, but current approaches face two key limitations: multimodal coding capabilities remain underdeveloped, and single-turn paradigms make little use of iterative visual feedback. We address these challenges with an interactive UI-to-code paradigm that better reflects real-world workflows and raises the upper bound of achievable performance. Under this paradigm, we present UI2Code$^\text{N}$, a visual language model trained through staged pretraining, fine-tuning, and reinforcement learning to achieve foundational improvements in multimodal coding. The model unifies three key capabilities: UI-to-code generation, UI editing, and UI polishing. We further explore test-time scaling for interactive generation, enabling systematic use of multi-turn feedback. Experiments on UI-to-code and UI polishing benchmarks show that UI2Code$^\text{N}$ establishes a new state of the art among open-source models and achieves performance comparable to leading closed-source models such as Claude-4-Sonnet and GPT-5. Our code and models are available at this https URL.
中文摘要 用户界面（UI）编程是现代软件开发的核心但高度复杂的部分。视觉语言模型（VLM）的最新进展凸显了自动 UI 编码的潜力，但当前的方法面临两个关键限制：多模态编码功能仍然不发达，单轮范式很少利用迭代视觉反馈。我们通过交互式 UI 到代码范式来应对这些挑战，该范式可以更好地反映现实世界的工作流程并提高可实现性能的上限。在这种范式下，我们提出了 UI2Code$^\text{N}$，这是一种通过分阶段预训练、微调和强化学习训练的视觉语言模型，以实现多模态编码的基础改进。该模型统一了三个关键功能：UI 到代码生成、UI 编辑和 UI 润色。我们进一步探索了交互式生成的测试时缩放，从而能够系统地使用多轮反馈。对 UI-to-code 和 UI 打磨基准的实验表明，UI2Code$^\text{N}$ 在开源模型中建立了新的技术水平，并实现了与 Claude-4-Sonnet 和 GPT-5 等领先闭源模型相媲美的性能。我们的代码和模型可在此 https URL 中找到。

Beyond Distributions: Geometric Action Control for Continuous Reinforcement Learning

超越分布：用于持续强化学习的几何动作控制

Authors: Zhihao Lin
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.08234
Pdf link: https://arxiv.org/pdf/2511.08234
Abstract Gaussian policies have dominated continuous control in deep reinforcement learning (RL), yet they suffer from a fundamental mismatch: their unbounded support requires ad-hoc squashing functions that distort the geometry of bounded action spaces. While von Mises-Fisher (vMF) distributions offer a theoretically grounded alternative on the sphere, their reliance on Bessel functions and rejection sampling hinders practical adoption. We propose \textbf{Geometric Action Control (GAC)}, a novel action generation paradigm that preserves the geometric benefits of spherical distributions while \textit{simplifying computation}. GAC decomposes action generation into a direction vector and a learnable concentration parameter, enabling efficient interpolation between deterministic actions and uniform spherical noise. This design reduces parameter count from (2d) to (d+1), and avoids the (O(dk)) complexity of vMF rejection sampling, achieving simple (O(d)) operations. Empirically, GAC consistently matches or exceeds state-of-the-art methods across six MuJoCo benchmarks, achieving 37.6\% improvement over SAC on Ant-v4 and the best results on 4 out of 6 tasks. Our ablation studies reveal that both \textbf{spherical normalization} and \textbf{adaptive concentration control} are essential to GAC's success. These findings suggest that robust and efficient continuous control does not require complex distributions, but a principled respect for the geometry of action spaces. Code and pretrained models are available in supplementary materials.
中文摘要 高斯策略在深度强化学习（RL）中主导了连续控制，但它们存在一个根本的不匹配：它们的无界支持需要临时压缩函数，这些函数会扭曲有界动作空间的几何形状。虽然冯·米塞斯-费舍尔（vMF）分布在球体上提供了理论上的替代方案，但它们对贝塞尔函数和拒绝采样的依赖阻碍了实际采用。我们提出了 \textbf{几何动作控制（GAC）}，这是一种新颖的动作生成范式，它保留了球面分布的几何优势，同时 \textit{简化计算}。GAC将动作生成分解为方向向量和可学习的浓度参数，从而实现确定性动作和均匀球面噪声之间的高效插值。这种设计将参数数量从 \（2d\）减少到 \（d+1\），并避免了 vMF 抑制采样的 \（O（dk）\）复杂性，实现了简单的 \（O（d）\）运算。根据经验，GAC 在 6 个 MuJoCo 基准测试中始终匹配或超过最先进的方法，在 Ant-v4 上比 SAC 提高了 37.6%，并且在 6 项任务中的 4 项任务中取得了最佳结果。我们的消融研究表明，\textbf{球形归一化} 和 \textbf{自适应浓度控制} 对于 GAC 的成功至关重要。这些发现表明，稳健高效的连续控制不需要复杂的分布，而是需要原则上尊重动作空间的几何形状。代码和预训练模型在补充材料中可用。

PrefPoE: Advantage-Guided Preference Fusion for Learning Where to Explore

PrefPoE：优势引导的偏好融合，用于了解探索地点

Authors: Zhihao Lin, Lin Wu, Zhen Tian, Jianglin Lan
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.08241
Pdf link: https://arxiv.org/pdf/2511.08241
Abstract Exploration in reinforcement learning remains a critical challenge, as naive entropy maximization often results in high variance and inefficient policy updates. We introduce \textbf{PrefPoE}, a novel \textit{Preference-Product-of-Experts} framework that performs intelligent, advantage-guided exploration via the first principled application of product-of-experts (PoE) fusion for single-task exploration-exploitation balancing. By training a preference network to concentrate probability mass on high-advantage actions and fusing it with the main policy through PoE, PrefPoE creates a \textbf{soft trust region} that stabilizes policy updates while maintaining targeted exploration. Across diverse control tasks spanning both continuous and discrete action spaces, PrefPoE demonstrates consistent improvements: +321\% on HalfCheetah-v4 (1276~$\rightarrow$~5375), +69\% on Ant-v4, +276\% on LunarLander-v2, with consistently enhanced training stability and sample efficiency. Unlike standard PPO, which suffers from entropy collapse, PrefPoE sustains adaptive exploration through its unique dynamics, thereby preventing premature convergence and enabling superior performance. Our results establish that learning \textit{where to explore} through advantage-guided preferences is as crucial as learning how to act, offering a general framework for enhancing policy gradient methods across the full spectrum of reinforcement learning domains. Code and pretrained models are available in supplementary materials.
中文摘要 强化学习的探索仍然是一个严峻的挑战，因为朴素熵最大化通常会导致高方差和低效的策略更新。我们介绍了 \textbf{PrefPoE}，这是一个新颖的 \textit{Preference-Product-of-Experts} 框架，它通过专家乘积（PoE）融合的首次原则性应用来执行智能的、优势导向的探索，以实现单任务探索-开发平衡。通过训练偏好网络将概率质量集中在高优势行动上，并通过 PoE 将其与主策略融合，PrefPoE 创建了一个 \textbf{soft trust region}，在保持有针对性的探索的同时稳定策略更新。在跨越连续和离散动作空间的各种控制任务中，PrefPoE 表现出一致的改进：HalfCheetah-v4 上 +321\% （1276~$\rightarrow$~5375），Ant-v4 上 +69\%，LunarLander-v2 上 +276\%，训练稳定性和样本效率持续增强。与遭受熵坍缩的标准 PPO 不同，PrefPoE 通过其独特的动力学维持自适应探索，从而防止过早收敛并实现卓越的性能。我们的结果表明，通过优势引导的偏好学习\textit{在哪里探索}与学习如何行动同样重要，为增强整个强化学习领域的策略梯度方法提供了一个通用框架。代码和预训练模型在补充材料中可用。

Where and What Matters: Sensitivity-Aware Task Vectors for Many-Shot Multimodal In-Context Learning

何处何事重要：用于多样本多模态上下文学习的敏感性感知任务向量

Authors: Ziyu Ma, Chenhui Gou, Yiming Hu, Yong Wang, Xiangxiang Chu, Bohan Zhuang, Jianfei Cai
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.08246
Pdf link: https://arxiv.org/pdf/2511.08246
Abstract Large Multimodal Models (LMMs) have shown promising in-context learning (ICL) capabilities, but scaling to many-shot settings remains difficult due to limited context length and high inference cost. To address these challenges, task-vector-based methods have been explored by inserting compact representations of many-shot in-context demonstrations into model activations. However, existing task-vector-based methods either overlook the importance of where to insert task vectors or struggle to determine suitable values for each location. To this end, we propose a novel Sensitivity-aware Task Vector insertion framework (STV) to figure out where and what to insert. Our key insight is that activation deltas across query-context pairs exhibit consistent structural patterns, providing a reliable cue for insertion. Based on the identified sensitive-aware locations, we construct a pre-clustered activation bank for each location by clustering the activation values, and then apply reinforcement learning to choose the most suitable one to insert. We evaluate STV across a range of multimodal models (e.g., Qwen-VL, Idefics-2) and tasks (e.g., VizWiz, OK-VQA), demonstrating its effectiveness and showing consistent improvements over previous task-vector-based methods with strong generalization.
中文摘要 大型多模态模型（LMM）已显示出有前途的上下文学习（ICL）功能，但由于上下文长度有限且推理成本高，扩展到多样本设置仍然很困难。为了应对这些挑战，人们探索了基于任务向量的方法，方法是将多样本上下文演示的紧凑表示插入模型激活中。然而，现有的基于任务向量的方法要么忽视了在何处插入任务向量的重要性，要么难以确定每个位置的合适值。为此，我们提出了一种新颖的灵敏度感知任务向量插入框架（STV）来确定插入的位置和内容。我们的关键见解是，查询上下文对之间的激活增量表现出一致的结构模式，为插入提供了可靠的提示。基于识别的敏感感知位置，我们通过对激活值进行聚类，为每个位置构建一个预聚类激活库，然后应用强化学习选择最合适的一个进行插入。我们跨一系列多模态模型（例如 Qwen-VL、Idefics-2）和任务（例如 VizWiz、OK-VQA）评估 STV，证明了其有效性，并显示出与以前基于任务向量的方法相比具有很强泛化性的持续改进。

AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress

AgentPRM：通过逐步承诺和进步为 LLM 代理处理奖励模型

Authors: Zhiheng Xi, Chenyang Liao, Guanyu Li, Yajie Yang, Wenxiang Chen, Zhihao Zhang, Binghai Wang, Senjie Jin, Yuhao Zhou, Jian Guan, Wei Wu, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang
Subjects: Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.08325
Pdf link: https://arxiv.org/pdf/2511.08325
Abstract Despite rapid development, large language models (LLMs) still encounter challenges in multi-turn decision-making tasks (i.e., agent tasks) like web shopping and browser navigation, which require making a sequence of intelligent decisions based on environmental feedback. Previous work for LLM agents typically relies on elaborate prompt engineering or fine-tuning with expert trajectories to improve performance. In this work, we take a different perspective: we explore constructing process reward models (PRMs) to evaluate each decision and guide the agent's decision-making process. Unlike LLM reasoning, where each step is scored based on correctness, actions in agent tasks do not have a clear-cut correctness. Instead, they should be evaluated based on their proximity to the goal and the progress they have made. Building on this insight, we propose a re-defined PRM for agent tasks, named AgentPRM, to capture both the interdependence between sequential decisions and their contribution to the final goal. This enables better progress tracking and exploration-exploitation balance. To scalably obtain labeled data for training AgentPRM, we employ a Temporal Difference-based (TD-based) estimation method combined with Generalized Advantage Estimation (GAE), which proves more sample-efficient than prior methods. Extensive experiments across different agentic tasks show that AgentPRM is over $8\times$ more compute-efficient than baselines, and it demonstrates robust improvement when scaling up test-time compute. Moreover, we perform detailed analyses to show how our method works and offer more insights, e.g., applying AgentPRM to the reinforcement learning of LLM agents.
中文摘要 尽管发展迅速，但大型语言模型（LLM）在网络购物和浏览器导航等多轮决策任务（即代理任务）中仍然面临挑战，这些任务需要根据环境反馈做出一系列智能决策。LLM 代理以前的工作通常依赖于精心设计的提示工程或专家轨迹的微调来提高性能。在这项工作中，我们采取了不同的视角：我们探索构建过程奖励模型（PRM）来评估每个决策并指导智能体的决策过程。与 LLM 推理不同，LLM 推理中的每个步骤都根据正确性进行评分，代理任务中的作没有明确的正确性。相反，应该根据他们与目标的接近程度和他们所取得的进展来评估他们。基于这一见解，我们提出了一个重新定义的代理任务 PRM，称为 AgentPRM，以捕获顺序决策之间的相互依赖性及其对最终目标的贡献。这可以实现更好的进度跟踪和勘探-开发平衡。为了可扩展地获得用于训练 AgentPRM 的标记数据，我们采用了基于时间差异（基于 TD）的估计方法与广义优势估计（GAE）相结合，这被证明比以前的方法更具样本效率。跨不同代理任务的广泛实验表明，AgentPRM 的计算效率比基线高出 8 倍以上，并且在扩展测试时计算时表现出强劲的改进。此外，我们还进行了详细的分析，以展示我们的方法是如何工作的，并提供更多见解，例如，将 AgentPRM 应用于 LLM 代理的强化学习。

LPPG-RL: Lexicographically Projected Policy Gradient Reinforcement Learning with Subproblem Exploration

LPPG-RL：词典式预测策略梯度强化学习与子问题探索

Authors: Ruiyu Qiu, Rui Wang, Guanghui Yang, Xiang Li, Zhijiang Shao
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.08339
Pdf link: https://arxiv.org/pdf/2511.08339
Abstract Lexicographic multi-objective problems, which consist of multiple conflicting subtasks with explicit priorities, are common in real-world applications. Despite the advantages of Reinforcement Learning (RL) in single tasks, extending conventional RL methods to prioritized multiple objectives remains challenging. In particular, traditional Safe RL and Multi-Objective RL (MORL) methods have difficulty enforcing priority orderings efficiently. Therefore, Lexicographic Multi-Objective RL (LMORL) methods have been developed to address these challenges. However, existing LMORL methods either rely on heuristic threshold tuning with prior knowledge or are restricted to discrete domains. To overcome these limitations, we propose Lexicographically Projected Policy Gradient RL (LPPG-RL), a novel LMORL framework which leverages sequential gradient projections to identify feasible policy update directions, thereby enabling LPPG-RL broadly compatible with all policy gradient algorithms in continuous spaces. LPPG-RL reformulates the projection step as an optimization problem, and utilizes Dykstra's projection rather than generic solvers to deliver great speedups, especially for small- to medium-scale instances. In addition, LPPG-RL introduces Subproblem Exploration (SE) to prevent gradient vanishing, accelerate convergence and enhance stability. We provide theoretical guarantees for convergence and establish a lower bound on policy improvement. Finally, through extensive experiments in a 2D navigation environment, we demonstrate the effectiveness of LPPG-RL, showing that it outperforms existing state-of-the-art continuous LMORL methods.
中文摘要 词典编纂多目标问题由多个具有明确优先级的冲突子任务组成，在实际应用中很常见。尽管强化学习（RL）在单项任务中具有优势，但将传统的强化学习方法扩展到优先级的多个目标仍然具有挑战性。特别是，传统的安全RL和多目标RL（MORL）方法难以有效地执行优先级排序。因此，已经开发了词典编纂多目标 RL （LMORL）方法来应对这些挑战。然而，现有的 LMORL 方法要么依赖于具有先验知识的启发式阈值调整，要么仅限于离散域。为了克服这些限制，我们提出了词典式预测策略梯度RL（LPPG-RL），这是一种新型的LMORL框架，它利用顺序梯度投影来识别可行的策略更新方向，从而使LPPG-RL与连续空间中的所有策略梯度算法广泛兼容。LPPG-RL 将投影步骤重新表述为优化问题，并利用 Dykstra 的投影而不是通用求解器来提供极大的加速，特别是对于中小型实例。此外，LPPG-RL引入了子问题探索（Subproblem Exploration，SE）来防止梯度消失，加速收敛并增强稳定性。为衔接提供理论保障，建立政策完善的下限。最后，通过在二维导航环境中进行大量实验，我们证明了LPPG-RL的有效性，表明它优于现有的最先进的连续LMORL方法。

ARAC: Adaptive Regularized Multi-Agent Soft Actor-Critic in Graph-Structured Adversarial Games

ARAC：图结构对抗博弈中的自适应正则化多智能体软 Actor-Critic

Authors: Ruochuan Shi, Runyu Lu, Yuanheng Zhu, Dongbin Zhao
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.08412
Pdf link: https://arxiv.org/pdf/2511.08412
Abstract In graph-structured multi-agent reinforcement learning (MARL) adversarial tasks such as pursuit and confrontation, agents must coordinate under highly dynamic interactions, where sparse rewards hinder efficient policy learning. We propose Adaptive Regularized Multi-Agent Soft Actor-Critic (ARAC), which integrates an attention-based graph neural network (GNN) for modeling agent dependencies with an adaptive divergence regularization mechanism. The GNN enables expressive representation of spatial relations and state features in graph environments. Divergence regularization can serve as policy guidance to alleviate the sparse reward problem, but it may lead to suboptimal convergence when the reference policy itself is imperfect. The adaptive divergence regularization mechanism enables the framework to exploit reference policies for efficient exploration in the early stages, while gradually reducing reliance on them as training progresses to avoid inheriting their limitations. Experiments in pursuit and confrontation scenarios demonstrate that ARAC achieves faster convergence, higher final success rates, and stronger scalability across varying numbers of agents compared with MARL baselines, highlighting its effectiveness in complex graph-structured environments.
中文摘要 在图结构多智能体强化学习（MARL）对抗任务（如追击和对抗）中，智能体必须在高度动态的交互下进行协调，其中稀疏的奖励阻碍了高效的策略学习。我们提出了自适应正则化多智能体软 Actor-Critic （ARAC），它集成了基于注意力的图神经网络（GNN），用于对代理依赖关系进行建模，并具有自适应发散正则化机制。GNN 可以在图环境中表达空间关系和状态特征。发散正则化可以作为缓解稀疏奖励问题的策略指导，但当参考策略本身不完善时，可能会导致收敛次优。自适应发散正则化机制使框架能够在早期阶段利用参考策略进行高效探索，同时随着训练的进行逐渐减少对参考策略的依赖，避免继承其局限性。追击和对抗场景的实验表明，与 MARL 基线相比，ARAC 在不同数量的代理上实现了更快的收敛、更高的最终成功率和更强的可扩展性，凸显了其在复杂图结构环境中的有效性。

Understanding Electro-communication and Electro-sensing in Weakly Electric Fish using Multi-Agent Deep Reinforcement Learning

利用多智能体深度强化学习了解弱电鱼的电通信和电传感

Authors: Satpreet H. Singh, Sonja Johnson-Yu, Zhouyang Lu, Aaron Walsman, Federico Pedraja, Denis Turcu, Pratyusha Sharma, Naomi Saphra, Nathaniel B. Sawtell, Kanaka Rajan
Subjects: Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY); Neurons and Cognition (q-bio.NC)
Arxiv link: https://arxiv.org/abs/2511.08436
Pdf link: https://arxiv.org/pdf/2511.08436
Abstract Weakly electric fish, like Gnathonemus petersii, use a remarkable electrical modality for active sensing and communication, but studying their rich electrosensing and electrocommunication behavior and associated neural activity in naturalistic settings remains experimentally challenging. Here, we present a novel biologically-inspired computational framework to study these behaviors, where recurrent neural network (RNN) based artificial agents trained via multi-agent reinforcement learning (MARL) learn to modulate their electric organ discharges (EODs) and movement patterns to collectively forage in virtual environments. Trained agents demonstrate several emergent features consistent with real fish collectives, including heavy tailed EOD interval distributions, environmental context dependent shifts in EOD interval distributions, and social interaction patterns like freeloading, where agents reduce their EOD rates while benefiting from neighboring agents' active sensing. A minimal two-fish assay further isolates the role of electro-communication, showing that access to conspecific EODs and relative dominance jointly shape foraging success. Notably, these behaviors emerge through evolution-inspired rewards for individual fitness and emergent inter-agent interactions, rather than through rewarding agents explicitly for social interactions. Our work has broad implications for the neuroethology of weakly electric fish, as well as other social, communicating animals in which extensive recordings from multiple individuals, and thus traditional data-driven modeling, are infeasible.
中文摘要 弱电鱼，如 Gnathonemus petersii，使用一种卓越的电模式进行主动传感和通信，但在自然环境中研究它们丰富的电传感和电通信行为以及相关的神经活动在实验上仍然具有挑战性。在这里，我们提出了一种受生物学启发的新型计算框架来研究这些行为，其中通过多智能体强化学习（MARL）训练的基于循环神经网络（RNN）的人工代理学习调节其器官放电（EOD）和运动模式，以在虚拟环境中集体觅食。训练有素的智能体展示了几个与真实鱼类群体一致的新兴特征，包括重尾 EOD 间隔分布、EOD 间隔分布中环境背景依赖性的变化以及自由加载等社会互动模式，其中智能体降低了 EOD 率，同时受益于相邻智能体的主动传感。最小的两条鱼测定进一步分离了电通信的作用，表明获得同种 EOD 和相对优势共同影响了觅食成功。值得注意的是，这些行为是通过进化启发的对个人适应性和紧急智能体间互动的奖励而出现的，而不是通过明确奖励智能体的社交互动。我们的工作对弱电鱼以及其他社会性、交流动物的神经行为学具有广泛的影响，在这些动物中，来自多个个体的大量记录以及传统的数据驱动建模是不可行的。

RESTL: Reinforcement Learning Guided by Multi-Aspect Rewards for Signal Temporal Logic Transformation

RESTL：多方面奖励指导的信号时间逻辑变换强化学习

Authors: Yue Fang, Jin Zhi, Jie An, Hongshen Chen, Xiaohong Chen, Naijun Zhan
Subjects: Subjects: Formal Languages and Automata Theory (cs.FL)
Arxiv link: https://arxiv.org/abs/2511.08555
Pdf link: https://arxiv.org/pdf/2511.08555
Abstract Signal Temporal Logic (STL) is a powerful formal language for specifying real-time specifications of Cyber-Physical Systems (CPS). Transforming specifications written in natural language into STL formulas automatically has attracted increasing attention. Existing rule-based methods depend heavily on rigid pattern matching and domain-specific knowledge, limiting their generalizability and scalability. Recently, Supervised Fine-Tuning (SFT) of large language models (LLMs) has been successfully applied to transform natural language into STL. However, the lack of fine-grained supervision on atomic proposition correctness, semantic fidelity, and formula readability often leads SFT-based methods to produce formulas misaligned with the intended meaning. To address these issues, we propose RESTL, a reinforcement learning (RL)-based framework for the transformation from natural language to STL. RESTL introduces multiple independently trained reward models that provide fine-grained, multi-faceted feedback from four perspectives, i.e., atomic proposition consistency, semantic alignment, formula succinctness, and symbol matching. These reward models are trained with a curriculum learning strategy to improve their feedback accuracy, and their outputs are aggregated into a unified signal that guides the optimization of the STL generator via Proximal Policy Optimization (PPO). Experimental results demonstrate that RESTL significantly outperforms state-of-the-art methods in both automatic metrics and human evaluations.
中文摘要 信号时间逻辑（STL）是一种功能强大的形式语言，用于指定网络物理系统（CPS）的实时规范。将自然语言编写的规范自动转换为 STL 公式引起了越来越多的关注。现有的基于规则的方法在很大程度上依赖于严格的模式匹配和特定领域的知识，限制了它们的通用性和可扩展性。最近，大型语言模型（LLMs）的监督微调（Supervised Fine-Tuning，SFT）已成功应用于将自然语言转换为STL。然而，缺乏对原子命题正确性、语义保真度和公式可读性的细粒度监督，通常会导致基于SFT的方法产生与预期含义不一致的公式。为了解决这些问题，我们提出了 RESTL，这是一种基于强化学习（RL）的框架，用于从自然语言到 STL 的转换。RESTL 引入了多个独立训练的奖励模型，从原子命题一致性、语义对齐、公式简洁性和符号匹配四个角度提供细粒度、多方面的反馈。这些奖励模型通过课程学习策略进行训练，以提高其反馈准确性，并将它们的输出聚合成一个统一的信号，通过近端策略优化（PPO）指导STL生成器的优化。实验结果表明，RESTL 在自动度量和人工评估方面都明显优于最先进的方法。

The Path Not Taken: RLVR Provably Learns Off the Principals

未走的路：RLVR 可证明可以向校长学习

Authors: Hanqing Zhu, Zhenyu Zhang, Hanxian Huang, DiJia Su, Zechun Liu, Jiawei Zhao, Igor Fedorov, Hamed Pirsiavash, Zhizhou Sha, Jinwon Lee, David Z. Pan, Zhangyang Wang, Yuandong Tian, Kai Sheng Tai
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.08567
Pdf link: https://arxiv.org/pdf/2511.08567
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) reliably improves the reasoning performance of large language models, yet it appears to modify only a small fraction of parameters. We revisit this paradox and show that sparsity is a surface artifact of a model-conditioned optimization bias: for a fixed pretrained model, updates consistently localize to preferred parameter regions, highly consistent across runs and largely invariant to datasets and RL recipes. We mechanistically explain these dynamics with a Three-Gate Theory: Gate I (KL Anchor) imposes a KL-constrained update; Gate II (Model Geometry) steers the step off principal directions into low-curvature, spectrum-preserving subspaces; and Gate III (Precision) hides micro-updates in non-preferred regions, making the off-principal bias appear as sparsity. We then validate this theory and, for the first time, provide a parameter-level characterization of RLVR's learning dynamics: RLVR learns off principal directions in weight space, achieving gains via minimal spectral drift, reduced principal-subspace rotation, and off-principal update alignment. In contrast, SFT targets principal weights, distorts the spectrum, and even lags RLVR. Together, these results provide the first parameter-space account of RLVR's training dynamics, revealing clear regularities in how parameters evolve. Crucially, we show that RL operates in a distinct optimization regime from SFT, so directly adapting SFT-era parameter-efficient fine-tuning (PEFT) methods can be flawed, as evidenced by our case studies on advanced sparse fine-tuning and LoRA variants. We hope this work charts a path toward a white-box understanding of RLVR and the design of geometry-aware, RLVR-native learning algorithms, rather than repurposed SFT-era heuristics.
中文摘要 具有可验证奖励的强化学习（RLVR）可靠地提高了大型语言模型的推理性能，但它似乎只修改了一小部分参数。我们重新审视了这个悖论，并表明稀疏性是模型条件优化偏差的表面伪影：对于固定的预训练模型，更新始终定位到首选参数区域，在运行中高度一致，并且对数据集和 RL 配方基本不变。我们用三门理论机械地解释这些动态：门 I（KL 锚）施加了 KL 约束的更新;门 II（模型几何）将步进从主方向引导到低曲率、保留频谱的子空间;门 III（精度）隐藏非首选区域中的微更新，使偏离主体偏差显示为稀疏性。然后，我们验证了这一理论，并首次提供了 RLVR 学习动力学的参数级表征：RLVR 从权重空间中的主要方向学习，通过最小的频谱漂移、减少的主子空间旋转和非主更新对齐来实现增益。相比之下，SFT 针对主权重，扭曲频谱，甚至滞后于 RLVR。这些结果共同提供了RLVR训练动力学的第一个参数空间解释，揭示了参数如何演化的明确规律。至关重要的是，我们表明RL在与SFT不同的优化制度下运行，因此直接采用SFT时代的参数高效微调（PEFT）方法可能存在缺陷，我们对高级稀疏微调和LoRA变体的案例研究证明了这一点。我们希望这项工作能够为白盒理解 RLVR 和几何感知、RLVR 原生学习算法的设计绘制一条道路，而不是重新利用 SFT 时代的启发式方法。

DeepProofLog: Efficient Proving in Deep Stochastic Logic Programs

DeepProofLog：深度随机逻辑程序中的高效证明

Authors: Ying Jiao, Rodrigo Castellano Ontiveros, Luc De Raedt, Marco Gori, Francesco Giannini, Michelangelo Diligenti, Giuseppe Marra
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.08581
Pdf link: https://arxiv.org/pdf/2511.08581
Abstract Neurosymbolic (NeSy) AI aims to combine the strengths of neural architectures and symbolic reasoning to improve the accuracy, interpretability, and generalization capability of AI models. While logic inference on top of subsymbolic modules has been shown to effectively guarantee these properties, this often comes at the cost of reduced scalability, which can severely limit the usability of NeSy models. This paper introduces DeepProofLog (DPrL), a novel NeSy system based on stochastic logic programs, which addresses the scalability limitations of previous methods. DPrL parameterizes all derivation steps with neural networks, allowing efficient neural guidance over the proving system. Additionally, we establish a formal mapping between the resolution process of our deep stochastic logic programs and Markov Decision Processes, enabling the application of dynamic programming and reinforcement learning techniques for efficient inference and learning. This theoretical connection improves scalability for complex proof spaces and large knowledge bases. Our experiments on standard NeSy benchmarks and knowledge graph reasoning tasks demonstrate that DPrL outperforms existing state-of-the-art NeSy systems, advancing scalability to larger and more complex settings than previously possible.
中文摘要 神经符号（NeSy）AI旨在结合神经架构和符号推理的优势，提高AI模型的准确性、可解释性和泛化能力。虽然子符号模块之上的逻辑推理已被证明可以有效地保证这些属性，但这通常以降低可扩展性为代价，这会严重限制 NeSy 模型的可用性。本文介绍了DeepProofLog（DPrL），这是一种基于随机逻辑程序的新型NeSy系统，它解决了以前方法的可扩展性限制。DPrL 使用神经网络对所有推导步骤进行参数化，从而允许对证明系统进行有效的神经引导。此外，我们还在深度随机逻辑程序的解析过程和马尔可夫决策过程之间建立了正式的映射，从而能够应用动态规划和强化学习技术进行高效的推理和学习。这种理论联系提高了复杂证明空间和大型知识库的可扩展性。我们在标准 NeSy 基准测试和知识图谱推理任务上的实验表明，DPrL 的性能优于现有的最先进的 NeSy 系统，将可扩展性提高到比以前更大、更复杂的设置。

Keyword: diffusion policy

There is no result