Arxiv Papers of Today

生成时间: 2025-12-11 16:32:16 (UTC+8); Arxiv 发布时间: 2025-12-11 20:00 EST (2025-12-12 09:00 UTC+8)

今天共有 22 篇相关文章

Keyword: reinforcement learning

Enhancing Reliability across Short and Long-Form QA via Reinforcement Learning

通过强化学习提升短型和长型QA的可靠性

Authors: Yudong Wang, Zhe Yang, Wenhan Ma, Zhifang Sui, Liang Zhao
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.08944
Pdf link: https://arxiv.org/pdf/2512.08944
Abstract While reinforcement learning has unlocked unprecedented complex reasoning in large language models, it has also amplified their propensity for hallucination, creating a critical trade-off between capability and reliability. This work confronts this challenge by introducing a targeted RL framework designed to mitigate both intrinsic and extrinsic hallucinations across short and long-form question answering. We address extrinsic hallucinations (flawed internal knowledge) by creating a novel training set from open-ended conversions of TriviaQA. Concurrently, we tackle intrinsic hallucinations (unfaithfulness to context) by leveraging long-form texts from FineWeb in a fact-grounding reward scheme. To further bolster reliability, our framework explicitly rewards the model for refusing to answer unanswerable questions, thereby cultivating crucial cautiousness. Extensive experiments demonstrate that our methodology yields significant performance gains across a diverse suite of benchmarks, substantially reducing both hallucination types. Ultimately, this research contributes a practical framework for resolving the critical tension between advanced reasoning and factual trustworthiness, paving the way for more capable and reliable large language models.
中文摘要 虽然强化学习在大型语言模型中解锁了前所未有的复杂推理能力，但也放大了它们产生幻觉的倾向，从而在能力与可靠性之间创造了关键权衡。本研究通过引入一个有针对性的强化学习框架来应对这一挑战，旨在减轻短长问答中的内在和外在幻觉。我们通过创建一个由 TriviaQA 开放式转换而成的新训练集，来应对外在幻觉（内在知识有缺陷）。同时，我们通过利用FineWeb的长篇文本，开展事实基础的奖励方案，解决内在幻觉（对上下文不符）。为进一步增强可靠性，我们的框架明确奖励拒绝回答无解问题的模型，从而培养关键的谨慎态度。大量实验表明，我们的方法在多样基准测试中显著提升性能，显著减少了两种幻觉类型。最终，本研究为解决高级推理与事实可信度之间的关键张力提供了一个实用框架，为更强大、更可靠的大型语言模型铺平了道路。

Optimizing Algorithms for Mobile Health Interventions with Active Querying Optimization

通过主动查询优化优化移动健康干预算法

Authors: Aseel Rawashdeh
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2512.08950
Pdf link: https://arxiv.org/pdf/2512.08950
Abstract Reinforcement learning in mobile health (mHealth) interventions requires balancing intervention efficacy with user burden, particularly when state measurements (for example, user surveys or feedback) are costly yet essential. The Act-Then-Measure (ATM) heuristic addresses this challenge by decoupling control and measurement actions within the Action-Contingent Noiselessly Observable Markov Decision Process (ACNO-MDP) framework. However, the standard ATM algorithm relies on a temporal-difference-inspired Q-learning method, which is prone to instability in sparse and noisy environments. In this work, we propose a Bayesian extension to ATM that replaces standard Q-learning with a Kalman filter-style Bayesian update, maintaining uncertainty-aware estimates of Q-values and enabling more stable and sample-efficient learning. We evaluate our method in both toy environments and clinically motivated testbeds. In small, tabular environments, Bayesian ATM achieves comparable or improved scalarized returns with substantially lower variance and more stable policy behavior. In contrast, in larger and more complex mHealth settings, both the standard and Bayesian ATM variants perform poorly, suggesting a mismatch between ATM's modeling assumptions and the structural challenges of real-world mHealth domains. These findings highlight the value of uncertainty-aware methods in low-data settings while underscoring the need for new RL algorithms that explicitly model causal structure, continuous states, and delayed feedback under observation cost constraints.
中文摘要 移动健康（mHealth）干预中的强化学习需要在干预效果与用户负担之间取得平衡，尤其是在状态测量（例如用户调查或反馈）既昂贵又至关重要时。行动-然后测量（ATM）启发式通过在行动-依赖无噪声可观察马尔可夫决策过程（ACNO-MDP）框架内解耦控制与测量动作来应对这一挑战。然而，标准ATM算法依赖于受时间差分启发的Q-学习方法，在稀疏和噪声较大的环境中容易不稳定。本研究提出一种贝叶斯扩展，用卡尔曼滤波器式贝叶斯更新替代标准Q学习，保持Q值的不确定性感知估计，实现更稳定和样本效率更高的学习。我们在玩具环境和临床动机测试平台中评估我们的方法。在小型、表格化的环境中，贝叶斯ATM能够实现相当或更好的标量化收益，且方差显著降低，政策行为更稳定。相比之下，在更大且更复杂的移动健康环境中，标准ATM和贝叶斯ATM变体表现不佳，表明ATM建模假设与现实移动健康领域结构性挑战存在不匹配。这些发现凸显了低数据环境中不确定性感知方法的价值，同时强调了需要新的强化学习算法，明确建模因果结构、连续状态和在观察成本约束下的延迟反馈。

Financial Instruction Following Evaluation (FIFE)

评估后财务指导（FIFE）

Authors: Glenn Matlin, Siddharth, Anirudh JM, Aditya Shukla, Yahya Hassan, Sudheer Chava
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.08965
Pdf link: https://arxiv.org/pdf/2512.08965
Abstract Language Models (LMs) struggle with complex, interdependent instructions, particularly in high-stakes domains like finance where precision is critical. We introduce FIFE, a novel, high-difficulty benchmark designed to assess LM instruction-following capabilities for financial analysis tasks. FIFE comprises 88 human-authored prompts and employs a verification system with chainable, verifiable constraints for fine-grained reward signals. We evaluate 53 models (proprietary, open-weight, open-source) in a zero-shot setting. Our key findings reveal a clear performance hierarchy: the top open-weight model (76.1 strict / 79.5 loose) surpasses the leading proprietary system (65.9 strict / 70.5 loose), while the best open-source models lag significantly (45.5 strict / 48.9 loose). However, even top-performing models struggle with FIFE's complex requirements, failing to achieve perfect compliance. We release our dataset and code as an open-source resource to promote research in Reinforcement Learning for the financial domain.
中文摘要 语言模型（LM）在处理复杂且相互依赖的指令时表现不佳，尤其是在金融等高风险领域，因为精准度至关重要。我们介绍FIFE，这是一个新颖的高难度基准测试，旨在评估财务分析任务中LM指令跟随能力。FIFE包含88个人工创作提示，采用可链式、可验证的细粒度奖励信号约束的验证系统。我们在零样本环境下评估了53个模型（专有、开放权重、开源）。我们的主要发现揭示了一个清晰的性能层级结构：顶级开放权重模型（严格76.1 / 79.5松散）超过了领先的专有系统（严格65.9 / 70.5松），而最佳开源模型则明显落后（严格45.5 / 48.9松散）。然而，即使是表现最好的模型也难以应对FIFE复杂的要求，无法实现完美的合规。我们将数据集和代码作为开源资源发布，以促进金融领域的强化学习研究。

Training Multi-Image Vision Agents via End2End Reinforcement Learning

通过端对端强化学习训练多图像视觉代理

Authors: Chengqi Dong, Chuhuai Yue, Hang He, Rongge Mao, Fenghe Tang, S Kevin Zhou, Zekun Xu, Xiaohan Wang, Jiajun Chai, Wei Lin, Guojun Yin
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.08980
Pdf link: https://arxiv.org/pdf/2512.08980
Abstract Recent VLM-based agents aim to replicate OpenAI O3's ``thinking with images" via tool use, but most open-source methods limit input to a single image, falling short on real-world multi-image QA tasks. To address this, we propose IMAgent, an open-source vision agent trained via end-to-end reinforcement learning dedicated for complex multi-image tasks. By leveraging a multi-agent system, we generate challenging and visually-rich multi-image QA pairs to fully activate the tool-use potential of the base VLM. Through manual verification, we obtain MIFG-QA, comprising 10k samples for training and evaluation. With deeper reasoning steps, VLMs may increasingly ignore visual inputs. We therefore develop two specialized tools for visual reflection and confirmation, allowing the model to proactively reallocate its attention to image content during inference. Benefiting from our well-designed action-trajectory two-level mask strategy, IMAgent achieves stable tool use behavior via pure RL training without requiring costly supervised fine-tuning data. Extensive experiments demonstrate that IMAgent maintains strong performance on existing single-image benchmarks while achieving substantial improvements on our proposed multi-image dataset, with our analysis providing actionable insights for the research community. Codes and data will be released soon.
中文摘要 近期基于VLM的代理旨在通过工具复制OpenAI O3的“用图像思考”，但大多数开源方法仅输入单张图像，缺乏现实多图像质量保证任务。为此，我们提出了IMAgent，一个开源视觉代理，通过端到端强化学习训练，专门用于复杂的多图像任务。通过利用多智能体系统，我们生成具有挑战性且视觉丰富的多图像质量保证对，充分发挥基础VLM的工具使用潜力。通过人工验证，我们获得了MIFG-QA，包含1万个样本用于培训和评估。随着更深层次的推理，VLM可能越来越忽视视觉输入。因此，我们开发了两种专门的视觉反射和确认工具，使模型在推断过程中能够主动重新分配注意力到图像内容。借助我们设计良好的动作轨迹两级掩码策略，IMAgent 通过纯强化学习训练实现工具使用行为的稳定，无需昂贵的监督微调数据。大量实验表明，IMAgent在现有单图像基准测试中保持了强劲表现，同时在我们提出的多图像数据集上取得了显著改进，我们的分析为研究界提供了可作的见解。代码和数据将很快发布。

Learning Unmasking Policies for Diffusion Language Models

学习扩散语言模型的解除掩蔽策略

Authors: Metod Jazbec, Theo X. Olausson, Louis Béthune, Pierre Ablin, Michael Kirchhof, Joao Monterio, Victor Turrisi, Jason Ramapuram, Marco Cuturi
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.09106
Pdf link: https://arxiv.org/pdf/2512.09106
Abstract Diffusion (Large) Language Models (dLLMs) now match the downstream performance of their autoregressive counterparts on many tasks, while holding the promise of being more efficient during inference. One particularly successful variant is masked discrete diffusion, in which a buffer filled with special mask tokens is progressively replaced with tokens sampled from the model's vocabulary. Efficiency can be gained by unmasking several tokens in parallel, but doing too many at once risks degrading the generation quality. Thus, one critical design aspect of dLLMs is the sampling procedure that selects, at each step of the diffusion process, which tokens to replace. Indeed, recent work has found that heuristic strategies such as confidence thresholding lead to both higher quality and token throughput compared to random unmasking. However, such heuristics have downsides: they require manual tuning, and we observe that their performance degrades with larger buffer sizes. In this work, we instead propose to train sampling procedures using reinforcement learning. Specifically, we formalize masked diffusion sampling as a Markov decision process in which the dLLM serves as the environment, and propose a lightweight policy architecture based on a single-layer transformer that maps dLLM token confidences to unmasking decisions. Our experiments show that these trained policies match the performance of state-of-the-art heuristics when combined with semi-autoregressive generation, while outperforming them in the full diffusion setting. We also examine the transferability of these policies, finding that they can generalize to new underlying dLLMs and longer sequence lengths. However, we also observe that their performance degrades when applied to out-of-domain data, and that fine-grained tuning of the accuracy-efficiency trade-off can be challenging with our approach.
中文摘要 扩散（大型）语言模型（dLLMs）现在在许多任务上能与其自回归对应的下游表现匹配，同时有望在推理过程中更高效。一个特别成功的变体是掩蔽离散扩散，即填充特殊掩码标记的缓冲区逐步被模型词汇中采样的标记替换。通过并行解锁多个代币可以提高效率，但同时进行过多会降低生成质量。因此，dLLM的一个关键设计方面是采样过程，在扩散过程的每个步骤中选择替换哪些标记。事实上，近期研究发现，诸如置信阈值等启发式策略相比随机揭罩，能带来更高的质量和令牌吞吐量。然而，这类启发式方法存在缺点：它们需要手动调优，且我们观察到缓冲区越大，性能就会下降。在本研究中，我们提出利用强化学习来训练抽样过程。具体来说，我们将掩蔽扩散抽样形式化为一种以dLLM为环境的马尔可夫决策过程，并提出了基于单层变换器的轻量级策略架构，将dLLM令牌信心映射到解除掩蔽决策。我们的实验表明，这些训练策略与半自回归生成结合时，性能可媲美最先进的启发式，而在全扩散环境下表现优于它们。我们还考察了这些策略的可转移性，发现它们可以推广到新的底层dLLMs和更长的序列长度。然而，我们也观察到，当应用到域外数据时，它们的性能会下降，而在准确性与效率之间进行细致度权衡的调整，对我们的方法来说可能具有挑战性。

Electric Arc Furnaces Scheduling under Electricity Price Volatility with Reinforcement Learning

电弧炉在电价波动性下的调度与强化学习

Authors: Ruonan Pi, Zhiyuan Fan, Bolun Xu
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2512.09293
Pdf link: https://arxiv.org/pdf/2512.09293
Abstract This paper proposes a reinforcement learning-based framework for optimizing the operation of electric arc furnaces (EAFs) under volatile electricity prices. We formulate the deterministic version of the EAF scheduling problem into a mixed-integer linear programming (MILP) formulation, and then develop a Q-learning algorithm to perform real-time control of multiple EAF units under real-time price volatility and shared feeding capacity constraints. We design a custom reward function for the Q-learning algorithm to smooth the start-up penalties of the EAFs. Using real data from EAF designs and electricity prices in New York State, we benchmark our algorithm against a baseline rule-based controller and a MILP benchmark, assuming perfect price forecasts. The results show that our reinforcement learning algorithm achieves around 90% of the profit compared to the perfect MILP benchmark in various single-unit and multi-unit cases under a non-anticipatory control setting.
中文摘要 本文提出了一种基于强化学习的框架，用于在电价波动下优化电弧炉（EAF）的运行。我们将EAF调度问题的确定性版本表述为混合整数线性规划（MILP）形式，然后开发Q学习算法，在实时价格波动和共享供给能力约束下实时控制多个EAF单元。我们为Q学习算法设计了定制奖励函数，以平滑EAF的启动惩罚。利用EAF设计和纽约州电价的真实数据，我们将算法与基于规则的基线控制器和MILP基准进行基准对比，假设价格预测完美。结果显示，在非预期控制环境下，我们的强化学习算法在多种单单元和多单元案例中，相较于完美MILP基准实现了约90%的利润。

Tyche: A Hybrid Computation Framework of Illumination Pattern for Satellite Beam Hopping

Tyche：卫星波束跳跃照明模式的混合计算框架

Authors: Ziheng Yang, Kun Qiu, Zhe Chen, Wenjun Zhu, Yue Gao
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2512.09312
Pdf link: https://arxiv.org/pdf/2512.09312
Abstract High-Throughput Satellites (HTS) use beam hopping to handle non-uniform and time-varying ground traffic demand. A significant technical challenge in beam hopping is the computation of effective illumination patterns. Traditional algorithms, like the genetic algorithm, require over 300 seconds to compute a single illumination pattern for just 37 cells, whereas modern HTS typically covers over 300 cells, rendering current methods impractical for real-world applications. Advanced approaches, such as multi-agent deep reinforcement learning, face convergence issues when the number of cells exceeds 40. In this paper, we introduce Tyche, a hybrid computation framework designed to address this challenge. Tyche incorporates a Monte Carlo Tree Search Beam Hopping (MCTS-BH) algorithm for computing illumination patterns and employs sliding window and pruning techniques to significantly reduce computation time. Specifically, MCTS-BH can compute one illumination pattern for 37 cells in just 12 seconds. To ensure real-time computation, we use a Greedy Beam Hopping (G-BH) algorithm, which provides a provisional solution while MCTS-BH completes its computation in the background. Our evaluation results show that MCTS-BH can increase throughput by up to 98.76%, demonstrating substantial improvements over existing solutions.
中文摘要 高通量卫星（HTS）利用波束跳变来处理不均匀且时变的地面交通需求。光束跳跃的一个重要技术挑战是有效照明模式的计算。传统算法如遗传算法计算单个照明模式仅需37个单元，需超过300秒，而现代HTS通常覆盖超过300单元，使得现有方法在实际应用中不切实际。高级方法，如多智能体深度强化学习，当细胞数超过40个时，会面临收敛问题。本文介绍了Tyche，一种旨在解决这一挑战的混合计算框架。Tyche采用蒙特卡洛树搜索跳光（MCTS-BH）算法计算照明模式，并采用滑动窗口和剪枝技术大幅缩短计算时间。具体来说，MCTS-BH能在12秒内计算37个单元的一种照明模式。为确保实时计算，我们使用贪婪跳波（G-BH）算法，在MCTS-BH后台完成计算时提供临时解决方案。我们的评估结果显示，MCTS-BH可将吞吐量提升高达98.76%，相较现有方案有显著提升。

COVLM-RL: Critical Object-Oriented Reasoning for Autonomous Driving Using VLM-Guided Reinforcement Learning

COVLM-RL：利用VLM引导强化学习实现自动驾驶的关键面向对象推理

Authors: Lin Li, Yuxin Cai, Jianwu Fang, Jianru Xue, Chen Lv
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.09349
Pdf link: https://arxiv.org/pdf/2512.09349
Abstract End-to-end autonomous driving frameworks face persistent challenges in generalization, training efficiency, and interpretability. While recent methods leverage Vision-Language Models (VLMs) through supervised learning on large-scale datasets to improve reasoning, they often lack robustness in novel scenarios. Conversely, reinforcement learning (RL)-based approaches enhance adaptability but remain data-inefficient and lack transparent decision-making. % contribution To address these limitations, we propose COVLM-RL, a novel end-to-end driving framework that integrates Critical Object-oriented (CO) reasoning with VLM-guided RL. Specifically, we design a Chain-of-Thought (CoT) prompting strategy that enables the VLM to reason over critical traffic elements and generate high-level semantic decisions, effectively transforming multi-view visual inputs into structured semantic decision priors. These priors reduce the input dimensionality and inject task-relevant knowledge into the RL loop, accelerating training and improving policy interpretability. However, bridging high-level semantic guidance with continuous low-level control remains non-trivial. To this end, we introduce a consistency loss that encourages alignment between the VLM's semantic plans and the RL agent's control outputs, enhancing interpretability and training stability. Experiments conducted in the CARLA simulator demonstrate that COVLM-RL significantly improves the success rate by 30\% in trained driving environments and by 50\% in previously unseen environments, highlighting its strong generalization capability.
中文摘要 端到端自动驾驶框架在泛化性、训练效率和可理解性方面面临持续挑战。虽然近期方法通过在大规模数据集上的监督学习利用视觉语言模型（VLMs）来提升推理能力，但在新颖场景中往往缺乏稳健性。相反，基于强化学习（RL）的方法提升了适应性，但数据效率低下且缺乏透明决策。为解决这些局限性，我们提出了COVLM-RL，一种新型端到端驱动框架，将关键面向对象（CO）推理与VLM引导RL整合在一起。具体来说，我们设计了一种思维链（Chain-of-Thought，简称CoT）提示策略，使VLM能够对关键流量元素进行推理并生成高层次语义决策，有效地将多视角视觉输入转化为结构化的语义决策先验。这些先验降低了输入维度，并将任务相关知识注入强化学习循环，加快训练并改善策略可解释性。然而，将高层次语义指导与持续的低层次控制连接起来仍然不简单。为此，我们引入了一致性丢失机制，促进VLM语义计划与强化学习代理控制输出之间的对齐，提升可解释性和训练稳定性。在CARLA模拟器中进行的实验表明，COVLM-RL在训练驾驶环境中成功率显著提升30%，在前所未见环境中提升50%，彰显其强大的泛化能力。

CFLight: Enhancing Safety with Traffic Signal Control through Counterfactual Learning

CFLight：通过反事实学习提升交通信号控制的安全

Authors: Mingyuan Li, Chunyu Liu, Zhuojun Li, Xiao Liu, Guangsheng Yu, Bo Du, Jun Shen, Qiang Wu
Subjects: Subjects: Machine Learning (cs.LG); Methodology (stat.ME)
Arxiv link: https://arxiv.org/abs/2512.09368
Pdf link: https://arxiv.org/pdf/2512.09368
Abstract Traffic accidents result in millions of injuries and fatalities globally, with a significant number occurring at intersections each year. Traffic Signal Control (TSC) is an effective strategy for enhancing safety at these urban junctures. Despite the growing popularity of Reinforcement Learning (RL) methods in optimizing TSC, these methods often prioritize driving efficiency over safety, thus failing to address the critical balance between these two aspects. Additionally, these methods usually need more interpretability. CounterFactual (CF) learning is a promising approach for various causal analysis fields. In this study, we introduce a novel framework to improve RL for safety aspects in TSC. This framework introduces a novel method based on CF learning to address the question: What if, when an unsafe event occurs, we backtrack to perform alternative actions, and will this unsafe event still occur in the subsequent period?'' To answer this question, we propose a new structure causal model to predict the result after executing different actions, and we propose a new CF module that integrates with additionalX'' modules to promote safe RL practices. Our new algorithm, CFLight, which is derived from this framework, effectively tackles challenging safety events and significantly improves safety at intersections through a near-zero collision control strategy. Through extensive numerical experiments on both real-world and synthetic datasets, we demonstrate that CFLight reduces collisions and improves overall traffic performance compared to conventional RL methods and the recent safe RL model. Moreover, our method represents a generalized and safe framework for RL methods, opening possibilities for applications in other domains. The data and code are available in the github this https URL.
中文摘要 全球交通事故造成数百万人受伤和死亡，每年都发生在路口。交通信号控制（TSC）是提升这些城市路口安全的有效策略。尽管强化学习（RL）方法在优化TSC方面日益流行，但这些方法往往优先考虑效率而非安全，未能解决这两者之间的关键平衡。此外，这些方法通常需要更多的可解释性。反事实（CF）学习是一种在各种因果分析领域中有前景的方法。本研究引入了一个新框架，旨在提升TSC中强化学习的安全性。该框架引入了一种基于CF学习的新方法，用以回答这样一个问题：“如果当发生不安全事件时，我们会回头执行替代行动，并且该不安全事件在接下来的期间还会发生吗？”为回答这个问题，我们提出了一个新的结构因果模型，用于在执行不同动作后预测结果，并提出了一个新的CF模块，该模块与额外的“X”模块集成，以促进安全的强化学习实践。我们的新算法CFLight源自该框架，有效应对具有挑战性的安全事件，并通过近乎零碰撞控制策略显著提升路口安全。通过对现实世界和合成数据集的大量数值实验，我们证明CFLight相比传统强化学习方法和近期安全强化学习模型，减少碰撞并提升整体流量性能。此外，我们的方法为强化学习方法提供了一个通用且安全的框架，为其他领域的应用打开了可能性。数据和代码可以在github上找到，https URL。

Generalizable Collaborative Search-and-Capture in Cluttered Environments via Path-Guided MAPPO and Directional Frontier Allocation

通过路径引导MAPPO和定向前沿分配实现在杂乱环境中实现的通用协作搜索与捕获

Authors: Jialin Ying, Zhihao Li, Zicheng Dong, Guohua Wu, Yihuan Liao
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2512.09410
Pdf link: https://arxiv.org/pdf/2512.09410
Abstract Collaborative pursuit-evasion in cluttered environments presents significant challenges due to sparse rewards and constrained Fields of View (FOV). Standard Multi-Agent Reinforcement Learning (MARL) often suffers from inefficient exploration and fails to scale to large scenarios. We propose PGF-MAPPO (Path-Guided Frontier MAPPO), a hierarchical framework bridging topological planning with reactive control. To resolve local minima and sparse rewards, we integrate an A*-based potential field for dense reward shaping. Furthermore, we introduce Directional Frontier Allocation, combining Farthest Point Sampling (FPS) with geometric angle suppression to enforce spatial dispersion and accelerate coverage. The architecture employs a parameter-shared decentralized critic, maintaining O(1) model complexity suitable for robotic swarms. Experiments demonstrate that PGF-MAPPO achieves superior capture efficiency against faster evaders. Policies trained on 10x10 maps exhibit robust zero-shot generalization to unseen 20x20 environments, significantly outperforming rule-based and learning-based baselines.
中文摘要 在杂乱环境中进行协作追踪-规避面临巨大挑战，因为奖励稀少且视野（FOV）受限。标准的多智能体强化学习（MARL）常常存在探索效率低下，且无法扩展到大型场景。我们提出了PGF-MAPPO（路径引导前沿MAPPO），这是一个将拓扑规划与反应式控制相结合的分层框架。为了解决局部极小值和稀疏奖励，我们集成了一个基于A*的势场以实现密集奖励塑造。此外，我们引入了定向前沿分配，结合最远点采样（FPS）与几何角度抑制，以强制空间色散并加速覆盖。该架构采用参数共享的去中心化批评器，保持适用于机器人群体的 O（1）模型复杂度。实验表明，PGF-MAPPO在对抗更快的躲避器时实现了更优越的捕获效率。在10x10地图上训练的策略对未见的20x20环境表现出强健的零样本泛化，显著优于基于规则和基于学习的基线。

RouteRAG: Efficient Retrieval-Augmented Generation from Text and Graph via Reinforcement Learning

RouteRAG：通过强化学习从文本和图中高效检索-增强生成

Authors: Yucan Guo, Miao Su, Saiping Guan, Zihao Sun, Xiaolong Jin, Jiafeng Guo, Xueqi Cheng
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2512.09487
Pdf link: https://arxiv.org/pdf/2512.09487
Abstract Retrieval-Augmented Generation (RAG) integrates non-parametric knowledge into Large Language Models (LLMs), typically from unstructured texts and structured graphs. While recent progress has advanced text-based RAG to multi-turn reasoning through Reinforcement Learning (RL), extending these advances to hybrid retrieval introduces additional challenges. Existing graph-based or hybrid systems typically depend on fixed or handcrafted retrieval pipelines, lacking the ability to integrate supplementary evidence as reasoning unfolds. Besides, while graph evidence provides relational structures crucial for multi-hop reasoning, it is substantially more expensive to retrieve. To address these limitations, we introduce \model{}, an RL-based framework that enables LLMs to perform multi-turn and adaptive graph-text hybrid RAG. \model{} jointly optimizes the entire generation process via RL, allowing the model to learn when to reason, what to retrieve from either texts or graphs, and when to produce final answers, all within a unified generation policy. To guide this learning process, we design a two-stage training framework that accounts for both task outcome and retrieval efficiency, enabling the model to exploit hybrid evidence while avoiding unnecessary retrieval overhead. Experimental results across five question answering benchmarks demonstrate that \model{} significantly outperforms existing RAG baselines, highlighting the benefits of end-to-end RL in supporting adaptive and efficient retrieval for complex reasoning.
中文摘要 检索增强生成（RAG）将非参数知识集成到大型语言模型（LLM），通常来自非结构化文本和结构化图。虽然近期进展已将基于文本的RAG推进到通过强化学习（RL）实现的多回合推理，但将这些进展推广到混合检索也带来了额外挑战。现有的基于图或混合系统的系统通常依赖固定或手工制作的检索流程，缺乏整合补充证据的能力。此外，虽然图证据提供了多跳推理中至关重要的关系结构，但检索成本显著更高。为解决这些局限性，我们引入了 \model{}，这是一个基于强化语言的框架，使大型语言模型能够执行多回合和自适应的图文本混合 RAG。\model{} 通过强化学习共同优化整个生成过程，使模型能够学习何时推理、从文本或图表中提取什么，以及何时生成最终答案，所有这些都在统一的生成策略中实现。为指导这一学习过程，我们设计了一个两阶段训练框架，既考虑任务结果，也考虑检索效率，使模型能够利用混合证据，同时避免不必要的检索开销。五个问答基准测试的实验结果表明，\model{} 显著优于现有的 RAG 基线，凸显了端到端强化学习在支持复杂推理的自适应高效检索方面的优势。

Toward Closed-loop Molecular Discovery via Language Model, Property Alignment and Strategic Search

迈向通过语言模型、属性比对和战略性搜索实现闭环分子发现

Authors: Junkai Ji, Zhangfan Yang, Dong Xu, Ruibin Bai, Jianqiang Li, Tingjun Hou, Zexuan Zhu
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.09566
Pdf link: https://arxiv.org/pdf/2512.09566
Abstract Drug discovery is a time-consuming and expensive process, with traditional high-throughput and docking-based virtual screening hampered by low success rates and limited scalability. Recent advances in generative modelling, including autoregressive, diffusion, and flow-based approaches, have enabled de novo ligand design beyond the limits of enumerative screening. Yet these models often suffer from inadequate generalization, limited interpretability, and an overemphasis on binding affinity at the expense of key pharmacological properties, thereby restricting their translational utility. Here we present Trio, a molecular generation framework integrating fragment-based molecular language modeling, reinforcement learning, and Monte Carlo tree search, for effective and interpretable closed-loop targeted molecular design. Through the three key components, Trio enables context-aware fragment assembly, enforces physicochemical and synthetic feasibility, and guides a balanced search between the exploration of novel chemotypes and the exploitation of promising intermediates within protein binding pockets. Experimental results show that Trio reliably achieves chemically valid and pharmacologically enhanced ligands, outperforming state-of-the-art approaches with improved binding affinity (+7.85%), drug-likeness (+11.10%) and synthetic accessibility (+12.05%), while expanding molecular diversity more than fourfold.
中文摘要 药物发现是一个耗时且昂贵的过程，传统的高通量和基于对接的虚拟筛选因成功率低和扩展性有限而受限。生成建模的最新进展，包括自回归、扩散和基于流的方法，使得新配体设计超越了枚举筛选的限制。然而，这些模型常常存在推广不足、解释性有限以及过度强调结合亲和力而牺牲关键药理特性，从而限制了其转化效用。这里我们介绍Trio，一个集成基于片段的分子语言建模、强化学习和蒙特卡洛树搜索的分子生成框架，用于高效且可解释的闭环定向分子设计。通过这三个关键组成部分，Trio实现了情境感知的片段组装，强化物理化学和合成的可行性，并引导在新型化疗型探索与蛋白质结合口袋中有望中间体的利用之间进行平衡的探索。实验结果显示，Trio可靠地实现了化学有效且药理学上增强的配体，在结合亲和力（+7.85%）、药物类比性（+11.10%）和合成可及性（+12.05%）方面表现优于最先进方法，同时分子多样性增加了四倍以上。

Mastering Diverse, Unknown, and Cluttered Tracks for Robust Vision-Based Drone Racing

掌握多样化、未知且杂乱的赛道，打造基于视觉的无人机竞速

Authors: Feng Yu, Yu Hu, Yang Su, Yang Deng, Linzuo Zhang, Danping Zou
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.09571
Pdf link: https://arxiv.org/pdf/2512.09571
Abstract Most reinforcement learning(RL)-based methods for drone racing target fixed, obstacle-free tracks, leaving the generalization to unknown, cluttered environments largely unaddressed. This challenge stems from the need to balance racing speed and collision avoidance, limited feasible space causing policy exploration trapped in local optima during training, and perceptual ambiguity between gates and obstacles in depth maps-especially when gate positions are only coarsely specified. To overcome these issues, we propose a two-phase learning framework: an initial soft-collision training phase that preserves policy exploration for high-speed flight, followed by a hard-collision refinement phase that enforces robust obstacle avoidance. An adaptive, noise-augmented curriculum with an asymmetric actor-critic architecture gradually shifts the policy's reliance from privileged gate-state information to depth-based visual input. We further impose Lipschitz constraints and integrate a track-primitive generator to enhance motion stability and cross-environment generalization. We evaluate our framework through extensive simulation and ablation studies, and validate it in real-world experiments on a computationally constrained quadrotor. The system achieves agile flight while remaining robust to gate-position errors, developing a generalizable drone racing framework with the capability to operate in diverse, partially unknown and cluttered environments. this https URL
中文摘要 大多数基于强化学习（RL）的无人机竞速方法都针对固定、无障碍的赛道，而对未知且杂乱环境的推广则基本未被解决。这一挑战源于需要平衡竞速速度与碰撞避免，训练时有限的可行空间导致策略探索被局限于局部最优，以及深度图中门与障碍物之间的感知模糊性——尤其是在门的位置仅被粗略指定时。为克服这些问题，我们提出了一个两阶段学习框架：初步的软碰撞训练阶段，保留高速飞行的策略探索;随后是硬碰撞细化阶段，强化障碍物规避。采用自适应、噪声增强的课程，采用非对称的actor-critic架构，逐步将策略的依赖从特权门状态信息转向基于深度的视觉输入。我们进一步施加了Lipschitz约束，并集成了轨迹原源，以增强运动稳定性和跨环境泛化。我们通过广泛的仿真和消融研究评估我们的框架，并在计算受限的四旋翼飞行器上进行实际实验验证。该系统实现了灵活飞行，同时保持对门口位置误差的韧性，开发了一个通用的无人机竞速框架，具备在多样、部分未知和杂乱环境中运行的能力。这个 https 网址

SynthPix: A lightspeed PIV images generator

SynthPix：光速PIV图像生成器

Authors: Antonio Terpin, Alan Bonomi, Francesco Banelli, Raffaello D'Andrea
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2512.09664
Pdf link: https://arxiv.org/pdf/2512.09664
Abstract We describe SynthPix, a synthetic image generator for Particle Image Velocimetry (PIV) with a focus on performance and parallelism on accelerators, implemented in JAX. SynthPix supports the same configuration parameters as existing tools but achieves a throughput several orders of magnitude higher in image-pair generation per second. SynthPix was developed to enable the training of data-hungry reinforcement learning methods for flow estimation and for reducing the iteration times during the development of fast flow estimation methods used in recent active fluids control studies with real-time PIV feedback. We believe SynthPix to be useful for the fluid dynamics community, and in this paper we describe the main ideas behind this software package.
中文摘要 我们介绍了SynthPix，这是一种用于粒子图像速度测量（PIV）的合成图像生成器，专注于加速器的性能和并行性，该生成器在JAX中实现。SynthPix 支持与现有工具相同的配置参数，但图像对生成速度每秒高出几个数量级。SynthPix的开发旨在训练数据密集的强化学习方法以进行流量估计，并减少快速流量估计方法开发过程中的迭代时间，这些方法用于近期主动流体控制研究中采用的实时PIV反馈。我们认为SynthPix对流体力学社区非常有用，本文介绍了该软件包背后的主要理念。

d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models

d-TreeRPO：迈向更可靠的扩散语言模型策略优化

Authors: Leyi Pan, Shuchang Tao, Yunpeng Zhai, Zheyu Fu, Liancheng Fang, Minghua He, Lingzhe Zhang, Zhaoyang Liu, Bolin Ding, Aiwei Liu, Lijie Wen
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.09675
Pdf link: https://arxiv.org/pdf/2512.09675
Abstract Reliable reinforcement learning (RL) for diffusion large language models (dLLMs) requires both accurate advantage estimation and precise estimation of prediction probabilities. Existing RL methods for dLLMs fall short in both aspects: they rely on coarse or unverifiable reward signals, and they estimate prediction probabilities without accounting for the bias relative to the true, unbiased expected prediction probability that properly integrates over all possible decoding orders. To mitigate these issues, we propose \emph{d}-TreeRPO, a reliable RL framework for dLLMs that leverages tree-structured rollouts and bottom-up advantage computation based on verifiable outcome rewards to provide fine-grained and verifiable step-wise reward signals. When estimating the conditional transition probability from a parent node to a child node, we theoretically analyze the estimation error between the unbiased expected prediction probability and the estimate obtained via a single forward pass, and find that higher prediction confidence leads to lower estimation error. Guided by this analysis, we introduce a time-scheduled self-distillation loss during training that enhances prediction confidence in later training stages, thereby enabling more accurate probability estimation and improved convergence. Experiments show that \emph{d}-TreeRPO outperforms existing baselines and achieves significant gains on multiple reasoning benchmarks, including +86.2 on Sudoku, +51.6 on Countdown, +4.5 on GSM8K, and +5.3 on Math500. Ablation studies and computational cost analyses further demonstrate the effectiveness and practicality of our design choices.
中文摘要 对于扩散大型语言模型（dLLMs）进行可靠强化学习（RL）需要准确的优势估计和预测概率的精确估计。现有的dLLM强化学习方法在这两个方面都存在不足：它们依赖粗糙或无法验证的奖励信号，并且在估计预测概率时未考虑与真实且无偏的预期预测概率之间的偏差，而该预测概率能正确整合所有可能的译码顺序。为缓解这些问题，我们提出了\emph{d}-TreeRPO，这是一个可靠的dLLM强化学习框架，利用树状结构的展开和基于可验证结果奖励的自下而上优势计算，提供细粒度且可验证的分阶段奖励信号。在估计从父节点到子节点的条件转移概率时，我们理论上分析了无偏期望预测概率与单次前向传递估计之间的估计误差，发现预测置信度越高，估计误差越低。在该分析的指导下，我们引入了训练期间的定时自蒸馏损耗，增强后期训练阶段的预测信心，从而实现更准确的概率估计和改善收敛性。实验显示，\emph{d}-TreeRPO在多个推理基准测试中表现优于现有基线，并在多个推理基准测试中取得显著提升，包括数独+86.2、Countdown+51.6、GSM8K+4.5和Math500+5.3。消融研究和计算成本分析进一步证明了我们设计选择的有效性和实用性。

Dynamic one-time delivery of critical data by small and sparse UAV swarms: a model problem for MARL scaling studies

小型稀疏无人机群群动态一次性传递关键数据：MARL缩放研究中的模型问题

Authors: Mika Persson, Jonas Lidman, Jacob Ljungberg, Samuel Sandelius, Adam Andersson
Subjects: Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2512.09682
Pdf link: https://arxiv.org/pdf/2512.09682
Abstract This work presents a conceptual study on the application of Multi-Agent Reinforcement Learning (MARL) for decentralized control of unmanned aerial vehicles to relay a critical data package to a known position. For this purpose, a family of deterministic games is introduced, designed for scaling studies for MARL. A robust baseline policy is proposed, which is based on restricting agent motion envelopes and applying Dijkstra's algorithm. Experimental results show that two off-the-shelf MARL algorithms perform competitively with the baseline for a small number of agents, but scalability issues arise as the number of agents increase.
中文摘要 本研究提出了一项关于多智能体强化学习（MARL）在无人机去中心化控制中将关键数据包传递到已知位置的概念性研究。为此，引入了一系列确定性博弈，用于MARL的尺度研究。提出了一种稳健的基线策略，基于限制代理运动包络线并应用迪克斯特拉算法。实验结果显示，两种现成的MARL算法在少数代理中与基线算法具有竞争力，但随着代理数量增加，扩展性问题随之而来。

MOA: Multi-Objective Alignment for Role-Playing Agents

MOA：角色扮演代理的多目标对齐

Authors: Chonghua Liao, Ke Wang, Yuchuan Wu, Fei Huang, Yongbin Li
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.09756
Pdf link: https://arxiv.org/pdf/2512.09756
Abstract Role-playing agents (RPAs) must simultaneously master many conflicting skills -- following multi-turn instructions, exhibiting domain knowledge, and adopting a consistent linguistic style. Existing work either relies on supervised fine-tuning (SFT) that over-fits surface cues and yields low diversity, or applies reinforcement learning (RL) that fails to learn multiple dimensions for comprehensive RPA optimization. We present MOA (Multi-Objective Alignment), a reinforcement-learning framework that enables multi-dimensional, fine-grained rubric optimization for general RPAs. MOA introduces a novel multi-objective optimization strategy that trains simultaneously on multiple fine-grained rubrics to boost optimization performance. Besides, to address the issues of model output diversity and quality, we have also employed thought-augmented rollout with off-policy guidance. Extensive experiments on challenging benchmarks such as PersonaGym and RoleMRC show that MOA enables an 8B model to match or even outperform strong baselines such as GPT-4o and Claude across numerous dimensions. This demonstrates the great potential of MOA in building RPAs that can simultaneously meet the demands of role knowledge, persona style, diverse scenarios, and complex multi-turn conversations.
中文摘要 角色扮演代理（RPA）必须同时掌握多种相互冲突的技能——遵循多回合指令，展现领域知识，并采用一致的语言风格。现有工作要么依赖监督微调（SFT），该微调过度拟合表面线索，导致多样性低;要么采用强化学习（RL），未能学习多维度以实现全面RPA优化。我们介绍了MOA（多目标对齐），这是一种强化学习框架，能够实现多维度、细粒度的评分标准优化，适用于通用RPA。MOA引入了一种新的多目标优化策略，能够同时训练多个细粒度的评分标准，以提升优化性能。此外，为了解决模型产出多样性和质量问题，我们还采用了带有非政策指导的思想增强推广。在PersonaGym和RoleMRC等具有挑战性的基准测试上的大量实验表明，MOA使8B模型在多个维度上能够匹敌甚至超越GPT-4o和Claude等强基线。这展示了MOA在构建能够同时满足角色知识、角色风格、多样场景和复杂多回合对话需求的RPA方面的巨大潜力。

RIFT: A Scalable Methodology for LLM Accelerator Fault Assessment using Reinforcement Learning

RIFT：一种基于强化学习的可扩展方法论用于LLM加速器故障评估

Authors: Khurram Khalil, Muhammad Mahad Khaliq, Khaza Anuarul Hoque
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.09829
Pdf link: https://arxiv.org/pdf/2512.09829
Abstract The massive scale of modern AI accelerators presents critical challenges to traditional fault assessment methodologies, which face prohibitive computational costs and provide poor coverage of critical failure modes. This paper introduces RIFT (Reinforcement Learning-guided Intelligent Fault Targeting), a scalable framework that automates the discovery of minimal, high-impact fault scenarios for efficient design-time fault assessment. RIFT transforms the complex search for worst-case faults into a sequential decision-making problem, combining hybrid sensitivity analysis for search space pruning with reinforcement learning to intelligently generate minimal, high-impact test suites. Evaluated on billion-parameter Large Language Model (LLM) workloads using NVIDIA A100 GPUs, RIFT achieves a \textbf{2.2$\times$} fault assessment speedup over evolutionary methods and reduces the required test vector volume by over \textbf{99\%} compared to random fault injection, all while achieving \textbf{superior fault coverage}. The proposed framework also provides actionable data to enable intelligent hardware protection strategies, demonstrating that RIFT-guided selective error correction code provides a \textbf{12.8$\times$} improvement in \textbf{cost-effectiveness} (coverage per unit area) compared to uniform triple modular redundancy protection. RIFT automatically generates UVM-compliant verification artifacts, ensuring its findings are directly actionable and integrable into commercial RTL verification workflows.
中文摘要 现代人工智能加速器的大规模化对传统故障评估方法构成了重大挑战，传统故障评估方法面临高昂的计算成本且对关键故障模式覆盖较差。本文介绍了RIFT（强化学习引导智能故障定位），这是一个可扩展的框架，能够自动化发现最小且高影响的故障场景，实现设计时故障的高效评估。RIFT将复杂的最坏情况错误搜索转化为顺序决策问题，结合了搜索空间修剪的混合敏感性分析与强化学习，智能生成最小且高影响力的测试套件。在使用NVIDIA A100 GPU的十亿参数大型语言模型（LLM）工作负载上评估时，RIFT实现了相较进化方法的\textbf{2.2$\times$}故障评估加速，并且将所需测试向量量比随机错误注入减少超过\textbf{99\%}，同时实现了\textbf{优越的故障覆盖}。该框架还提供了可作的数据，支持智能硬件保护策略，证明RIFT引导的选择性错误纠正码相比统一三重模块冗余保护在单位面积覆盖率（\textbf{12.8$\times$}）上有\textbf{成本效益}提升。RIFT自动生成符合UVM标准的验证工件，确保其发现可直接作并集成到商业RTL验证流程中。

ChronusOmni: Improving Time Awareness of Omni Large Language Models

ChronusOmni：提升对全域大型语言模型的时间感知

Authors: Yijing Chen, Yihan Wu, Kaisi Guan, Yuchen Ren, Yuyue Wang, Ruihua Song, Liyun Ru
Subjects: Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2512.09841
Pdf link: https://arxiv.org/pdf/2512.09841
Abstract Time awareness is a fundamental ability of omni large language models, especially for understanding long videos and answering complex questions. Previous approaches mainly target vision-language scenarios and focus on the explicit temporal grounding questions, such as identifying when a visual event occurs or determining what event happens at aspecific time. However, they often make insufficient use of the audio modality, and overlook implicit temporal grounding across modalities--for example, identifying what is visually present when a character speaks, or determining what is said when a visual event occurs--despite such cross-modal temporal relations being prevalent in real-world scenarios. In this paper, we propose ChronusOmni, an omni large language model designed to enhance temporal awareness for both explicit and implicit audiovisual temporal grounding. First, we interleave text-based timestamp tokens with visual and audio representations at each time unit, enabling unified temporal modeling across modalities. Second, to enforce correct temporal ordering and strengthen fine-grained temporal reasoning, we incorporate reinforcement learning with specially designed reward functions. Moreover, we construct ChronusAV, a temporally-accurate, modality-complete, and cross-modal-aligned dataset to support the training and evaluation on audiovisual temporal grounding task. Experimental results demonstrate that ChronusOmni achieves state-of-the-art performance on ChronusAV with more than 30% improvement and top results on most metrics upon other temporal grounding benchmarks. This highlights the strong temporal awareness of our model across modalities, while preserving general video and audio understanding capabilities.
中文摘要 时间感知是全向大型语言模型的基本能力，尤其适用于理解长视频和回答复杂问题。以往的方法主要针对视觉语言场景，关注明确的时间基础问题，如识别视觉事件发生的时间或确定特定时间发生的事件。然而，他们常常未能充分利用音频模态，忽视了不同模态间隐含的时间基础——例如，识别角色说话时视觉上的存在，或视觉事件发生时判断所说的话——尽管这些跨模态的时间关系在现实场景中很普遍。本文提出了ChronusOmni，一种全域大型语言模型，旨在增强显性和隐性视听时间基础的时间意识。首先，我们将基于文本的时间戳标记与每个时间单元的视觉和音频表示交错，实现跨模态的统一时间建模。其次，为了强制正确的时间顺序并强化细粒度的时间推理，我们将强化学习与专门设计的奖励函数相结合。此外，我们构建了ChronusAV数据集，这是一个时间准确、模态完整且跨模态对齐的数据集，以支持视听时间基础任务的训练和评估。实验结果显示，ChronusOmni在ChronusAV上实现了最先进的性能，提升超过30%，在其他时间基准测试中大多数指标均位列前茅。这凸显了我们模型在不同模态间的强大时间感知，同时保持了一般的视频和音频理解能力。

FlipLLM: Efficient Bit-Flip Attacks on Multimodal LLMs using Reinforcement Learning

FlipLLM：利用强化学习对多模大型语言模型进行高效的位翻转攻击

Authors: Khurram Khalil, Khaza Anuarul Hoque
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.09872
Pdf link: https://arxiv.org/pdf/2512.09872
Abstract Generative Artificial Intelligence models, such as Large Language Models (LLMs) and Large Vision Models (VLMs), exhibit state-of-the-art performance but remain vulnerable to hardware-based threats, specifically bit-flip attacks (BFAs). Existing BFA discovery methods lack generalizability and struggle to scale, often failing to analyze the vast parameter space and complex interdependencies of modern foundation models in a reasonable time. This paper proposes FlipLLM, a reinforcement learning (RL) architecture-agnostic framework that formulates BFA discovery as a sequential decision-making problem. FlipLLM combines sensitivity-guided layer pruning with Q-learning to efficiently identify minimal, high-impact bit sets that can induce catastrophic failure. We demonstrate the effectiveness and generalizability of FlipLLM by applying it to a diverse set of models, including prominent text-only LLMs (GPT-2 Large, LLaMA 3.1 8B, and DeepSeek-V2 7B), VLMs such as LLaVA 1.6, and datasets, such as MMLU, MMLU-Pro, VQAv2, and TextVQA. Our results show that FlipLLM can identify critical bits that are vulnerable to BFAs up to 2.5x faster than SOTA methods. We demonstrate that flipping the FlipLLM-identified bits plummets the accuracy of LLaMA 3.1 8B from 69.9% to ~0.2%, and for LLaVA's VQA score from 78% to almost 0%, by flipping as few as 5 and 7 bits, respectively. Further analysis reveals that applying standard hardware protection mechanisms, such as ECC SECDED, to the FlipLLM-identified bit locations completely mitigates the BFA impact, demonstrating the practical value of our framework in guiding hardware-level defenses. FlipLLM offers the first scalable and adaptive methodology for exploring the BFA vulnerability of both language and multimodal foundation models, paving the way for comprehensive hardware-security evaluation.
中文摘要 生成式人工智能模型，如大型语言模型（LLM）和大型视觉模型（VLMs），性能最先进，但仍易受到硬件威胁，特别是位翻转攻击（BFA）的威胁。现有的BFA发现方法缺乏泛化性，难以扩展，常常无法在合理时间内分析现代基础模型庞大的参数空间和复杂的相互依赖关系。本文提出了FlipLLM，一种强化学习（RL）架构无关框架，将BFA发现表述为顺序决策问题。FlipLLM结合了灵敏度引导层剪枝与Q-learning，高效识别可能引发灾难性故障的最小、高影响的比特集。我们通过将FlipLLM应用于多样化模型，展示了其有效性和可推广性，包括著名的纯文本LLM（GPT-2 Large、LLaMA 3.1 8B和DeepSeek-V2 7B）、VLM如LLaVA 1.6，以及数据集如MMLU、MMLU-Pro、VQAv2和TextVQA。我们的结果表明，FlipLLM能够比SOTA方法快2.5倍识别易受BFA影响的关键位。我们证明，翻转FlipLLM识别的比特会使LLaMA 3.1 8B的准确率从69.9%降至~0.2%，LLaVA的VQA分数从78%降至接近0%，分别仅翻转5位和7位。进一步分析显示，将标准硬件保护机制（如ECC SECDED）应用于FlipLLM识别的位位置，完全减轻了BFA的影响，展示了我们框架在指导硬件级防御方面的实用价值。FlipLLM提供了首个可扩展且自适应的方法，用于探索语言和多模态基础模型的BFA脆弱性，为全面的硬件安全评估铺平了道路。

STACHE: Local Black-Box Explanations for Reinforcement Learning Policies

胡须：强化学习政策的本地黑箱解释

Authors: Andrew Elashkin, Orna Grumberg
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.09909
Pdf link: https://arxiv.org/pdf/2512.09909
Abstract Reinforcement learning agents often behave unexpectedly in sparse-reward or safety-critical environments, creating a strong need for reliable debugging and verification tools. In this paper, we propose STACHE, a comprehensive framework for generating local, black-box explanations for an agent's specific action within discrete Markov games. Our method produces a Composite Explanation consisting of two complementary components: (1) a Robustness Region, the connected neighborhood of states where the agent's action remains invariant, and (2) Minimal Counterfactuals, the smallest state perturbations required to alter that decision. By exploiting the structure of factored state spaces, we introduce an exact, search-based algorithm that circumvents the fidelity gaps of surrogate models. Empirical validation on Gymnasium environments demonstrates that our framework not only explains policy actions, but also effectively captures the evolution of policy logic during training - from erratic, unstable behavior to optimized, robust strategies - providing actionable insights into agent sensitivity and decision boundaries.
中文摘要 强化学习代理在奖励稀疏或安全关键环境中常常表现出意外行为，这也带来了对可靠调试和验证工具的强烈需求。本文提出了STACHE，这是一个综合框架，用于生成离散马尔可夫博弈中智能体特定动作的局部黑箱解释。我们的方法产生了一个复合解释，由两个互补组成部分：（1）鲁棒性区域，即主体行动保持不变的连通状态邻域;（2）最小反事实，改变该决策所需的最小状态扰动。通过利用分解状态空间的结构，我们引入了一种精确的基于搜索的算法，绕过了代理模型的保真度差距。在体育馆环境中的实证验证表明，我们的框架不仅解释了策略动作，还有效捕捉了培训过程中策略逻辑的演变——从不稳定、不稳定的行为到优化且稳健的策略——为代理敏感性和决策边界提供了可作的洞见。

Keyword: diffusion policy

Simultaneous Tactile-Visual Perception for Learning Multimodal Robot Manipulation

多模态机器人作的同时触觉-视觉感知

Authors: Yuyang Li, Yinghan Chen, Zihang Zhao, Puhao Li, Tengyu Liu, Siyuan Huang, Yixin Zhu
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2512.09851
Pdf link: https://arxiv.org/pdf/2512.09851
Abstract Robotic manipulation requires both rich multimodal perception and effective learning frameworks to handle complex real-world tasks. See-through-skin (STS) sensors, which combine tactile and visual perception, offer promising sensing capabilities, while modern imitation learning provides powerful tools for policy acquisition. However, existing STS designs lack simultaneous multimodal perception and suffer from unreliable tactile tracking. Furthermore, integrating these rich multimodal signals into learning-based manipulation pipelines remains an open challenge. We introduce TacThru, an STS sensor enabling simultaneous visual perception and robust tactile signal extraction, and TacThru-UMI, an imitation learning framework that leverages these multimodal signals for manipulation. Our sensor features a fully transparent elastomer, persistent illumination, novel keyline markers, and efficient tracking, while our learning system integrates these signals through a Transformer-based Diffusion Policy. Experiments on five challenging real-world tasks show that TacThru-UMI achieves an average success rate of 85.5%, significantly outperforming the baselines of alternating tactile-visual (66.3%) and vision-only (55.4%). The system excels in critical scenarios, including contact detection with thin and soft objects and precision manipulation requiring multimodal coordination. This work demonstrates that combining simultaneous multimodal perception with modern learning frameworks enables more precise, adaptable robotic manipulation.
中文摘要 机器人作需要丰富的多模态感知和有效的学习框架来处理复杂的现实任务。透明皮肤（STS）传感器结合了触觉和视觉感知，具备有前景的感知能力，而现代模仿学习则为政策获取提供了强大的工具。然而，现有的STS设计缺乏多模态同步感知，且触觉追踪不可靠。此外，将这些丰富的多模态信号整合进基于学习的作流程仍是一个开放的挑战。我们介绍了TacThru，一种STS传感器，实现同步视觉感知和强大的触觉信号提取，以及TacThru-UMI，一种利用这些多模态信号进行控的模拟学习框架。我们的传感器采用全透明弹性体、持续照明、新颖的关键线标记和高效的跟踪，而我们的学习系统通过基于Transformer的扩散策略整合这些信号。五项具有挑战性的现实任务实验显示，TacThru-UMI的平均成功率为85.5%，显著优于交替使用触觉-视觉（66.3%）和仅视觉（55.4%）的基线。该系统在关键场景中表现出色，包括对薄物体和软物体的接触检测，以及需要多模态协调的精密作。这项工作表明，将同时多模态感知与现代学习框架结合，能够实现更精准、更灵活的机器人作。