Arxiv Papers of Today

生成时间: 2025-12-24 16:33:13 (UTC+8); Arxiv 发布时间: 2025-12-24 20:00 EST (2025-12-25 09:00 UTC+8)

今天共有 32 篇相关文章

Keyword: reinforcement learning

QoS-Aware Dynamic CU Selection in O-RAN with Graph-Based Reinforcement Learning

O-RAN中的QoS感知动态CU选择，结合基于图的强化学习

Authors: Sebastian Racedo, Brigitte Jaumard, Oscar Delgado, Meysam Masoudi
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.19696
Pdf link: https://arxiv.org/pdf/2512.19696
Abstract Open Radio Access Network (O RAN) disaggregates conventional RAN into interoperable components, enabling flexible resource allocation, energy savings, and agile architectural design. In legacy deployments, the binding between logical functions and physical locations is static, which leads to inefficiencies under time varying traffic and resource conditions. We address this limitation by relaxing the fixed mapping and performing dynamic service function chain (SFC) provisioning with on the fly O CU selection. We formulate the problem as a Markov decision process and solve it using GRLDyP, i.e., a graph neural network (GNN) assisted deep reinforcement learning (DRL). The proposed agent jointly selects routes and the O-CU location (from candidate sites) for each incoming service flow to minimize network energy consumption while satisfying quality of service (QoS) constraints. The GNN encodes the instantaneous network topology and resource utilization (e.g., CPU and bandwidth), and the DRL policy learns to balance grade of service, latency, and energy. We perform the evaluation of GRLDyP on a data set with 24-hour traffic traces from the city of Montreal, showing that dynamic O CU selection and routing significantly reduce energy consumption compared to a static mapping baseline, without violating QoS. The results highlight DRL based SFC provisioning as a practical control primitive for energy-aware, resource-adaptive O-RAN deployments.
中文摘要 开放无线接入网（O RAN）将传统无线接入网拆解为可互作的组件，实现灵活的资源分配、节能和敏捷的架构设计。在遗留部署中，逻辑函数与物理位置之间的绑定是静态的，这导致在时间变化的流量和资源条件下效率低下。我们通过放宽固定映射并实时选择O单元进行动态服务函数链（SFC）配置来解决这一限制。我们将问题表述为马尔可夫决策过程，并利用GRLDyP（即图神经网络（GNN）辅助深度强化学习（DRL）来求解。拟议代理为每个进站服务流联合选择路由和O-CU位置（从候选站点），以最大限度地减少网络能耗，同时满足服务质量（QoS）约束。GNN编码即时网络拓扑和资源利用率（如CPU和带宽），DRL策略学习平衡服务等级、延迟和能量。我们在蒙特利尔市24小时交通追踪数据集上对GRLDyP进行了评估，显示动态O-CU选择和路由相比静态映射基线显著降低了能耗，同时不违反QoS。结果凸显基于DRL的SFC配置作为能能感知、资源适应型O-RAN部署的实用控制原语。

Holographic MIMO Empowered NOMA-ISAC for 6G: Rate-Splitting Enhanced Near-Field Modeling, Multi-Objective Optimization, and Statistical Performance Validation

全息MIMO赋能NOMA-ISAC用于6G：速率分频增强近场建模、多目标优化及统计性能验证

Authors: Sumita Majhi
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2512.19699
Pdf link: https://arxiv.org/pdf/2512.19699
Abstract Holographic multiple-input multiple-output (MIMO) systems with extremely large apertures enable transformational capabilities for sixth-generation (6G) integrated sensing and communications (ISAC). However, existing non-orthogonal multiple access (NOMA) ISAC works inadequately address: (i) holographic near-field propagation with sub-wavelength antenna spacing; (ii) rate-splitting multiple access (RSMA) integration for interference management; (iii) statistical validation under realistic impairments. This paper presents a comprehensive holographic MIMO NOMA-ISAC framework featuring: \textbf{(1)} Unified near-field modeling incorporating spatially-correlated Rayleigh fading, spherical wavefront propagation, and sub-wavelength antenna coupling effects; \textbf{(2)} Novel rate-splitting enhanced NOMA (RS-NOMA) architecture enabling flexible interference management between sensing and communication; \textbf{(3)} Multi-objective optimization suite comparing hybrid alternating optimization with successive convex approximation (HAO-SCA), weighted minimum mean square error (WMMSE), semidefinite relaxation (SDR), fractional programming (FP), and deep reinforcement learning (DRL); \textbf{(4)} Rigorous statistical validation over 5000 Monte Carlo runs with significance testing across massive MIMO scenarios (up to 1024 antennas). Results demonstrate that RS-NOMA achieves \SI{11.7}{\percent} higher sum-rate than conventional NOMA and \SI{18.8}{\percent} over WMMSE at matched sensing utility. Sensing CRLB improvements of \SI{2.4}{\decibel} are confirmed with 99\% statistical confidence. The framework establishes rigorous foundations for practical 6G holographic MIMO ISAC deployment.
中文摘要 具有极大孔径的全息多输入多输出（MIMO）系统，使第六代（6G）集成传感与通信（ISAC）实现了变革能力。然而，现有的非正交多址（NOMA）ISAC工作无法充分解决以下问题：（i）带有亚波长天线间距的全息近场传播;（ii）用于干扰管理的速率分流多址（RSMA）集成;（iii）在现实障碍条件下进行统计验证。本文提出了一个全面的全息MIMO NOMA-ISAC框架，特色包括：\textbf{（1）} 统一近场建模，包含空间相关瑞利衰落、球面波前传播和亚波长天线耦合效应;\textbf{（2）} 新型速率分割增强型NOMA（RS-NOMA）架构，实现感测与通信之间的灵活干扰管理;\textbf{（3）} 多目标优化套件，比较混合交替优化与连续凸近似（HAO-SCA）、加权最小均方误差（WMMSE）、半正定松弛（SDR）、分数规划（FP）和深度强化学习（DRL）;\textbf{（4）} 对5000次蒙特卡洛运行进行了严格的统计验证，并通过跨大规模MIMO场景（最多1024个天线）进行显著性检验。结果显示，RS-NOMA在匹配传感效用中实现了比传统NOMA更高的\SI{11.7}{\%}的加和率，\SI{18.8}{\%}高于WMMSE。感知CRLB对\SI{2.4}{\分贝}的提升以99%的统计置信度得到确认。该框架为实际的6G全息MIMO ISAC部署奠定了严谨基础。

Thermodynamic Focusing for Inference-Time Search: Practical Methods for Target-Conditioned Sampling and Prompted Inference

热力学聚焦推断时间搜索：靶条件抽样和提示推断的实用方法

Authors: Zhan Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.19717
Pdf link: https://arxiv.org/pdf/2512.19717
Abstract Finding rare but useful solutions in very large candidate spaces is a recurring practical challenge across language generation, planning, and reinforcement learning. We present a practical framework, \emph{Inverted Causality Focusing Algorithm} (ICFA), that treats search as a target-conditioned reweighting process. ICFA reuses an available proposal sampler and a task-specific similarity function to form a focused sampling distribution, while adaptively controlling focusing strength to avoid degeneracy. We provide a clear recipe, a stability diagnostic based on effective sample size, a compact theoretical sketch explaining when ICFA can reduce sample needs, and two reproducible experiments: constrained language generation and sparse-reward navigation. We further show how structured prompts instantiate an approximate, language-level form of ICFA and describe a hybrid architecture combining prompted inference with algorithmic reweighting.
中文摘要 在非常庞大的候选空间中寻找罕见但有用的解决方案，是语言生成、规划和强化学习中反复出现的实际挑战。我们提出了一个实用框架\emph{反因果聚焦算法}（ICFA），将搜索视为一种目标条件重加权过程。ICFA重复使用现有的提案采样器和任务特定相似函数，形成聚焦抽样分布，同时自适应控制聚焦强度以避免简并。我们提供了一个清晰的方案、基于有效样本量的稳定性诊断、一个简明的理论草图，解释了何时ICFA可以减少样本需求，以及两个可重复的实验：受限语言生成和稀疏-奖励导航。我们还进一步展示了结构化提示如何实例化一种近似的语言级ICFA，并描述了一种结合提示推理与算法重权重的混合架构。

Tiny, On-Device Decision Makers with the MiniConv Library

MiniConv库中的微型设备决策者

Authors: Carlos Purves
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.19726
Pdf link: https://arxiv.org/pdf/2512.19726
Abstract Reinforcement learning (RL) has achieved strong results, but deploying visual policies on resource-constrained edge devices remains challenging due to computational cost and communication latency. Many deployments therefore offload policy inference to a remote server, incurring network round trips and requiring transmission of high-dimensional observations. We introduce a split-policy architecture in which a small on-device encoder, implemented as OpenGL fragment-shader passes for broad embedded GPU support, transforms each observation into a compact feature tensor that is transmitted to a remote policy head. In RL, this communication overhead manifests as closed-loop decision latency rather than only per-request inference latency. The proposed approach reduces transmitted data, lowers decision latency in bandwidth-limited settings, and reduces server-side compute per request, whilst achieving broadly comparable learning performance by final return (mean over the final 100 episodes) in single-run benchmarks, with modest trade-offs in mean return. We evaluate across an NVIDIA Jetson Nano, a Raspberry Pi 4B, and a Raspberry Pi Zero 2 W, reporting learning results, on-device execution behaviour under sustained load, and end-to-end decision latency and scalability measurements under bandwidth shaping. Code for training, deployment, and measurement is released as open source.
中文摘要 强化学习（RL）取得了显著成果，但在资源受限的边缘设备上部署可视化策略仍面临计算成本和通信延迟的挑战。因此，许多部署将策略推断卸载给远程服务器，产生网络往返，并需要传输高维观测值。我们引入了一种分拆策略架构，其中一个小型设备编码器（以OpenGL片段着色器实现，广泛嵌入GPU支持）将每个观测值转换为一个紧凑的特征张量，传输给远程策略头。在强化学习中，这种通信开销表现为闭环决策延迟，而不仅仅是每次请求的推理延迟。该方法在带宽受限的环境中减少传输数据，降低决策延迟，降低服务器端每次请求的计算量，同时在单次运行基准测试中实现了大致相当的学习性能（最终100集的平均值），平均回报则有适度权衡。我们评估了NVIDIA Jetson Nano、树莓派4B和树莓派Zero 2 W，报告学习结果、设备内在持续负载下的执行行为，以及带宽整形下的端到端决策延迟和可扩展性测量。用于培训、部署和测量的代码已开源发布。

Hard Negative Sample-Augmented DPO Post-Training for Small Language Models

小型语言模型的硬负样本增强DPO后训练

Authors: Haocheng Lu, Minjun Zhu, Henry Yu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.19728
Pdf link: https://arxiv.org/pdf/2512.19728
Abstract Large language models (LLMs) continue to struggle with mathematical reasoning, and common post-training pipelines often reduce each generated solution to a binary outcome: correct or incorrect. This perspective is limiting in practice, as failures in chain-of-thought (CoT) reasoning are frequently structured; solutions may appear convincing while containing subtle logical, algebraic, or numerical flaws. Meanwhile, reinforcement learning from human feedback (RLHF) variants that rely on large reward models or LLM-as-a-judge signals are often expensive, difficult to scale, and unstable to iterate. We propose a lightweight and pragmatic post-training pipeline that targets such structured errors under realistic compute budgets. Starting from supervised fine-tuning (SFT) on MetaMathQA-style CoT data, we introduce a compact MathVerifier that decomposes a candidate solution into a six-dimensional error profile and aggregates it into interpretable wrongness and absurdity scores. These verifier signals serve two roles: (i) mining hard negatives that are near-correct yet structurally flawed, and (ii) defining per-sample importance weights that emphasize the most informative preference pairs. We integrate both into an offline Direct Preference Optimization (DPO) objective via a verifier-guided weighted formulation. Experiments on a 1.5B-parameter Qwen2.5 model show that verifier-guided, weighted DPO yields more targeted improvements than vanilla SFT and unweighted DPO, particularly on problems where solutions are numerically close to correct but logically inconsistent, while avoiding the overhead of training large reward models or relying on external judges.
中文摘要 大型语言模型（LLMs）仍在数学推理上遇到困难，常见的训练后流程常常将每个生成的解简化为二元结果：正确或错误。这种观点在实际中具有局限性，因为思维链（CoT）推理的失败常常被结构化;解法看似令人信服，但可能存在细微的逻辑、代数或数值缺陷。与此同时，依赖大型奖励模型或以LLM为评判信号的人类反馈强化学习（RLHF）变体通常成本高昂、难以扩展且迭代不稳定。我们提出了一个轻量化且务实的训练后流程，针对此类结构化错误，且在现实的计算预算下进行针对。从对MetaMathQA风格CoT数据进行监督微调（SFT）开始，我们引入了一个紧凑的MathVerifier，将候选解分解为六维错误剖面，并汇总为可解释的错误和荒谬性评分。这些验证信号有两个作用：（i）挖掘接近正确但结构有缺陷的硬负面，以及（ii）定义每样本重要性权重，强调最具信息量的偏好对。我们将两者整合为离线直接偏好优化（DPO）目标，通过验证者引导的加权表述。在1.5B参数的Qwen2.5模型上的实验表明，验证者引导的加权DPO比普通SFT和无加权DPO带来更有针对性的改进，尤其是在解数值接近正确但逻辑不一致的问题上，同时避免了训练大型奖励模型或依赖外部评判的开销。

OpComm: A Reinforcement Learning Framework for Adaptive Buffer Control in Warehouse Volume Forecasting

OpComm：用于仓库流量预测中自适应缓冲区控制的强化学习框架

Authors: Wilson Fung, Lu Guo, Drake Hilliard, Alessandro Casadei, Raj Ratan, Sreyoshi Bhaduri, Adi Surve, Nikhil Agarwal, Rohit Malshe, Pavan Mullapudi, Hungjen Wang, Saurabh Doodhwala, Ankush Pole, Arkajit Rakshit
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.19738
Pdf link: https://arxiv.org/pdf/2512.19738
Abstract Accurate forecasting of package volumes at delivery stations is critical for last-mile logistics, where errors lead to inefficient resource allocation, higher costs, and delivery delays. We propose OpComm, a forecasting and decision-support framework that combines supervised learning with reinforcement learning-based buffer control and a generative AI-driven communication module. A LightGBM regression model generates station-level demand forecasts, which serve as context for a Proximal Policy Optimization (PPO) agent that selects buffer levels from a discrete action set. The reward function penalizes under-buffering more heavily than over-buffering, reflecting real-world trade-offs between unmet demand risks and resource inefficiency. Station outcomes are fed back through a Monte Carlo update mechanism, enabling continual policy adaptation. To enhance interpretability, a generative AI layer produces executive-level summaries and scenario analyses grounded in SHAP-based feature attributions. Across 400+ stations, OpComm reduced Weighted Absolute Percentage Error (WAPE) by 21.65% compared to manual forecasts, while lowering under-buffering incidents and improving transparency for decision-makers. This work shows how contextual reinforcement learning, coupled with predictive modeling, can address operational forecasting challenges and bridge statistical rigor with practical decision-making in high-stakes logistics environments.
中文摘要 准确预测配送站的包裹量对于最后一公里物流至关重要，因为错误会导致资源分配效率低下、成本上升和配送延误。我们提出了OpComm，这是一个预测与决策支持框架，结合了监督学习、基于强化学习的缓冲区控制以及一个生成式AI驱动的通信模块。LightGBM回归模型生成站点级需求预测，作为近点策略优化（PPO）代理从离散动作集中选择缓冲区级的上下文。奖励函数对缓冲不足的惩罚比过度缓冲更为严重，反映了未满足需求风险与资源低效之间的现实权衡。站点结果通过蒙特卡洛更新机制反馈，实现政策的持续调整。为提升可解释性，生成式人工智能层生成基于SHAP特征归因的高管级摘要和情景分析。在400+站点中，OpComm将加权绝对百分比误差（WAPE）比人工预测降低了21.65%，同时降低了缓冲不足事件，提高了决策者的透明度。本研究展示了情境强化学习结合预测建模如何解决运营预测挑战，并在高风险物流环境中将统计严谨性与实际决策相结合。

Learning to Design City-scale Transit Routes

学习设计城市规模的交通线路

Authors: Bibek Poudel, Weizi Li
Subjects: Subjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2512.19767
Pdf link: https://arxiv.org/pdf/2512.19767
Abstract Designing efficient transit route networks is an NP-hard problem with exponentially large solution spaces that traditionally relies on manual planning processes. We present an end-to-end reinforcement learning (RL) framework based on graph attention networks for sequential transit network construction. To address the long-horizon credit assignment challenge, we introduce a two-level reward structure combining incremental topological feedback with simulation-based terminal rewards. We evaluate our approach on a new real-world dataset from Bloomington, Indiana with topologically accurate road networks, census-derived demand, and existing transit routes. Our learned policies substantially outperform existing designs and traditional heuristics across two initialization schemes and two modal-split scenarios. Under high transit adoption with transit center initialization, our approach achieves 25.6% higher service rates, 30.9\% shorter wait times, and 21.0% better bus utilization compared to the real-world network. Under mixed-mode conditions with random initialization, it delivers 68.8% higher route efficiency than demand coverage heuristics and 5.9% lower travel times than shortest path construction. These results demonstrate that end-to-end RL can design transit networks that substantially outperform both human-designed systems and hand-crafted heuristics on realistic city-scale benchmarks.
中文摘要 设计高效的交通线路网络是一个NP难题，解空间指数级巨大，传统上依赖人工规划流程。我们提出了基于图注意力网络的端到端强化学习（RL）框架，用于顺序传输网络构建。为应对长期学分分配挑战，我们引入了结合增量拓扑反馈与基于仿真的终端奖励的两级奖励结构。我们基于印第安纳州布卢明顿的一个新的真实数据集，结合拓扑精确的道路网络、人口普查衍生需求和现有交通线路，评估我们的方法。我们学到的策略在两种初始化方案和两种模态分割场景中，显著优于现有设计和传统启发式。在高交通采用率并初始化交通中心的情况下，我们的方法相比现实网络实现了25.6%的服务率、30.9%的等待时间缩短和21.0%的公交利用率。在混合模式条件下随机初始化，其路径效率比需求覆盖启发式高出68.8%，比最短路径建设缩短5.9%。这些结果表明，端到端强化学习能够设计出远超人类设计系统和手工启发式方法的交通网络，在现实的城市规模基准测试中表现突出。

Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning

通过行为校准强化学习缓解LLM幻觉

Authors: Jiayun Wu, Jiashuo Liu, Zhiyuan Zeng, Tianyang Zhan, Wenhao Huang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.19920
Pdf link: https://arxiv.org/pdf/2512.19920
Abstract LLM deployment in critical domains is currently impeded by persistent hallucinations--generating plausible but factually incorrect assertions. While scaling laws drove significant improvements in general capabilities, theoretical frameworks suggest hallucination is not merely stochastic error but a predictable statistical consequence of training objectives prioritizing mimicking data distribution over epistemic honesty. Standard RLVR paradigms, utilizing binary reward signals, inadvertently incentivize models as good test-takers rather than honest communicators, encouraging guessing whenever correctness probability exceeds zero. This paper presents an exhaustive investigation into behavioral calibration, which incentivizes models to stochastically admit uncertainty by abstaining when not confident, aligning model behavior with accuracy. Synthesizing recent advances, we propose and evaluate training interventions optimizing strictly proper scoring rules for models to output a calibrated probability of correctness. Our methods enable models to either abstain from producing a complete response or flag individual claims where uncertainty remains. Utilizing Qwen3-4B-Instruct, empirical analysis reveals behavior-calibrated reinforcement learning allows smaller models to surpass frontier models in uncertainty quantification--a transferable meta-skill decouplable from raw predictive accuracy. Trained on math reasoning tasks, our model's log-scale Accuracy-to-Hallucination Ratio gain (0.806) exceeds GPT-5's (0.207) in a challenging in-domain evaluation (BeyondAIME). Moreover, in cross-domain factual QA (SimpleQA), our 4B LLM achieves zero-shot calibration error on par with frontier models including Grok-4 and Gemini-2.5-Pro, even though its factual accuracy is much lower.
中文摘要 目前，LLM在关键领域的部署受到持续的幻觉阻碍——产生了合理但事实错误的断言。虽然尺度定律显著提升了整体能力，但理论框架表明幻觉不仅仅是随机误差，而是训练目标优先模拟数据分布而非认知诚实的可预测统计结果。标准的RLVR范式利用二元奖励信号，无意中激励模型成为优秀的考生而非诚实的沟通者，鼓励当正确概率大于零时进行猜测。本文对行为校准进行了详尽调查，该校准激励模型在不确定时选择戒除，随机承认不确定性，从而使模型行为与准确性保持一致。综合最新进展，我们提出并评估训练干预措施，优化严格正确的评分规则，以输出校准的正确概率。我们的方法使模型能够避免给出完整回答，或在存在不确定性时标记单个主张。利用Qwen3-4B-Instruct，实证分析显示，行为校准强化学习使得较小模型在不确定性量化上超越前沿模型——这是一种可转移的元技能，可与原始预测准确性挂钩。在数学推理任务训练后，我们的模型在对数尺度的准确率与幻觉比增益（0.806）中超过了GPT-5的（0.207），在一项具有挑战性的领域评估（BeyondAIME）。此外，在跨领域事实质量保证（SimpleQA）中，我们的4B大型语言模型实现了零射击校准误差，与包括Grok-4和Gemini-2.5-Pro在内的前沿模型相当，尽管其事实准确度远低于此。

An Optimal Policy for Learning Controllable Dynamics by Exploration

通过探索学习可控动力学的最优策略

Authors: Peter N. Loxley
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2512.20053
Pdf link: https://arxiv.org/pdf/2512.20053
Abstract Controllable Markov chains describe the dynamics of sequential decision making tasks and are the central component in optimal control and reinforcement learning. In this work, we give the general form of an optimal policy for learning controllable dynamics in an unknown environment by exploring over a limited time horizon. This policy is simple to implement and efficient to compute, and allows an agent to ``learn by exploring" as it maximizes its information gain in a greedy fashion by selecting controls from a constraint set that changes over time during exploration. We give a simple parameterization for the set of controls, and present an algorithm for finding an optimal policy. The reason for this policy is due to the existence of certain types of states that restrict control of the dynamics; such as transient states, absorbing states, and non-backtracking states. We show why the occurrence of these states makes a non-stationary policy essential for achieving optimal exploration. Six interesting examples of controllable dynamics are treated in detail. Policy optimality is demonstrated using counting arguments, comparing with suboptimal policies, and by making use of a sequential improvement property from dynamic programming.
中文摘要 可控马尔可夫链描述了顺序决策任务的动态，是最优控制和强化学习的核心组成部分。在本研究中，我们提出了一种最优策略的通用形式，用于在未知环境中通过有限时间范围内探索可控动力学。该策略易于实现且计算高效，允许智能体“通过探索学习”，通过从随时间变化的约束集中选择控制点，以贪婪的方式最大化信息获取。我们给出控制组的简单参数化，并提出寻找最优策略的算法。这一政策的原因在于存在某些类型的国家，限制了对动态的控制;例如瞬态、吸收态和非回溯态。我们展示了为何这些状态的出现使非平稳策略成为实现最佳勘探的必要条件。详细介绍了六个有趣的可控动力学实例。策略最优性通过计数论证、与次优策略的比较，以及利用动态规划中的顺序改进性质来展示。

Scaling Reinforcement Learning for Content Moderation with Large Language Models

利用大型语言模型进行内容管理的扩展强化学习

Authors: Hamed Firooz, Rui Liu, Yuchen Lu, Zhenyu Hou, Fangzhou Xiong, Xiaoyang Zhang, Changshu Jian, Zhicheng Zhu, Jiayuan Ma, Jacob Tao, Chaitali Gupta, Xiaochang Peng, Shike Mei, Hang Cui, Yang Qin, Shuo Tang, Jason Gaedtke, Arpit Mittal
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.20061
Pdf link: https://arxiv.org/pdf/2512.20061
Abstract Content moderation at scale remains one of the most pressing challenges in today's digital ecosystem, where billions of user- and AI-generated artifacts must be continuously evaluated for policy violations. Although recent advances in large language models (LLMs) have demonstrated strong potential for policy-grounded moderation, the practical challenges of training these systems to achieve expert-level accuracy in real-world settings remain largely unexplored, particularly in regimes characterized by label sparsity, evolving policy definitions, and the need for nuanced reasoning beyond shallow pattern matching. In this work, we present a comprehensive empirical investigation of scaling reinforcement learning (RL) for content classification, systematically evaluating multiple RL training recipes and reward-shaping strategies-including verifiable rewards and LLM-as-judge frameworks-to transform general-purpose language models into specialized, policy-aligned classifiers across three real-world content moderation tasks. Our findings provide actionable insights for industrial-scale moderation systems, demonstrating that RL exhibits sigmoid-like scaling behavior in which performance improves smoothly with increased training data, rollouts, and optimization steps before gradually saturating. Moreover, we show that RL substantially improves performance on tasks requiring complex policy-grounded reasoning while achieving up to 100x higher data efficiency than supervised fine-tuning, making it particularly effective in domains where expert annotations are scarce or costly.
中文摘要 大规模内容审核仍是当今数字生态系统中最紧迫的挑战之一，数十亿用户和人工智能生成的产物必须持续评估是否有违反政策。尽管大型语言模型（LLMs）的最新进展显示出基于政策的审核潜力，但在现实环境中训练这些系统以达到专家级准确度的实际挑战仍然大多未被充分探索，尤其是在标签稀疏、政策定义不断演变以及需要超越浅层模式匹配的细致推理的环境中。本研究通过全面实证研究内容分类的扩展强化学习（RL），系统评估多种强化学习训练配方和奖励塑造策略——包括可验证奖励和LLM作为评判框架——将通用语言模型转变为三项现实内容审核任务中专业且符合策略的分类器。我们的发现为工业规模的调节系统提供了可作的洞见，表明强化学习表现出类似S形的扩展行为，性能随着训练数据的增加、部署和优化步骤的增加而平滑提升，随后逐渐饱和。此外，我们证明强化学习在需要复杂策略基础推理的任务中显著提升性能，同时实现了比监督微调高出100倍的数据效率，使其在专家注释稀缺或成本高的领域尤为有效。

Adaptive Financial Sentiment Analysis for NIFTY 50 via Instruction-Tuned LLMs , RAG and Reinforcement Learning Approaches

NIFTY 50 的自适应金融情绪分析，通过指令调优的大型语言模型、RAG 和强化学习方法

Authors: Chaithra, Kamesh Kadimisetty, Biju R Mohan
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.20082
Pdf link: https://arxiv.org/pdf/2512.20082
Abstract Financial sentiment analysis plays a crucial role in informing investment decisions, assessing market risk, and predicting stock price trends. Existing works in financial sentiment analysis have not considered the impact of stock prices or market feedback on sentiment analysis. In this paper, we propose an adaptive framework that integrates large language models (LLMs) with real-world stock market feedback to improve sentiment classification in the context of the Indian stock market. The proposed methodology fine-tunes the LLaMA 3.2 3B model using instruction-based learning on the SentiFin dataset. To enhance sentiment predictions, a retrieval-augmented generation (RAG) pipeline is employed that dynamically selects multi-source contextual information based on the cosine similarity of the sentence embeddings. Furthermore, a feedback-driven module is introduced that adjusts the reliability of the source by comparing predicted sentiment with actual next-day stock returns, allowing the system to iteratively adapt to market behavior. To generalize this adaptive mechanism across temporal data, a reinforcement learning agent trained using proximal policy optimization (PPO) is incorporated. The PPO agent learns to optimize source weighting policies based on cumulative reward signals from sentiment-return alignment. Experimental results on NIFTY 50 news headlines collected from 2024 to 2025 demonstrate that the proposed system significantly improves classification accuracy, F1-score, and market alignment over baseline models and static retrieval methods. The results validate the potential of combining instruction-tuned LLMs with dynamic feedback and reinforcement learning for robust, market-aware financial sentiment modeling.
中文摘要 金融情绪分析在指导投资决策、评估市场风险以及预测股价趋势方面发挥着关键作用。现有的金融情绪分析研究尚未考虑股票价格或市场反馈对情绪分析的影响。本文提出了一个自适应框架，将大型语言模型（LLMs）与现实股市反馈整合，以提升印度股市背景下的情绪分类。该方法论通过基于指令的学习在SentiFin数据集上对LLaMA 3.2 3B模型进行了微调。为增强情感预测，采用了检索增强生成（RAG）流水线，基于句子嵌入的余弦相似性动态选择多源上下文信息。此外，引入了反馈驱动模块，通过比较预测情绪与实际次日股票回报，调整来源的可靠性，使系统能够迭代适应市场行为。为了将这种适应机制推广到整个时间数据，采用了使用近端策略优化（PPO）训练的强化学习代理。PPO代理学习基于情感与回报对齐的累积奖励信号优化源加权策略。2024年至2025年收集的NIFTY 50新闻头条实验结果表明，所提系统相比基线模型和静态检索方法显著提升了分类准确性、F1评分和市场对齐度。结果验证了将指令调优大型语言模型与动态反馈和强化学习结合，实现稳健且市场感知型的金融情绪建模的潜力。

Memory-T1: Reinforcement Learning for Temporal Reasoning in Multi-session Agents

记忆-T1：多会话代理中时间推理的强化学习

Authors: Yiming Du, Baojun Wang, Yifan Xiang, Zhaowei Wang, Wenyu Huang, Boyang Xue, Bin Liang, Xingshan Zeng, Fei Mi, Haoli Bai, Lifeng Shang, Jeff Z. Pan, Yuxin Jiang, Kam-Fai Wong
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.20092
Pdf link: https://arxiv.org/pdf/2512.20092
Abstract Temporal reasoning over long, multi-session dialogues is a critical capability for conversational agents. However, existing works and our pilot study have shown that as dialogue histories grow in length and accumulate noise, current long-context models struggle to accurately identify temporally pertinent information, significantly impairing reasoning performance. To address this, we introduce Memory-T1, a framework that learns a time-aware memory selection policy using reinforcement learning (RL). It employs a coarse-to-fine strategy, first pruning the dialogue history into a candidate set using temporal and relevance filters, followed by an RL agent that selects the precise evidence sessions. The RL training is guided by a multi-level reward function optimizing (i) answer accuracy, (ii) evidence grounding, and (iii) temporal consistency. In particular, the temporal consistency reward provides a dense signal by evaluating alignment with the query time scope at both the session-level (chronological proximity) and the utterance-level (chronological fidelity), enabling the agent to resolve subtle chronological ambiguities. On the Time-Dialog benchmark, Memory-T1 boosts a 7B model to an overall score of 67.0\%, establishing a new state-of-the-art performance for open-source models and outperforming a 14B baseline by 10.2\%. Ablation studies show temporal consistency and evidence grounding rewards jointly contribute to a 15.0\% performance gain. Moreover, Memory-T1 maintains robustness up to 128k tokens, where baseline models collapse, proving effectiveness against noise in extensive dialogue histories. The code and datasets are publicly available at this https URL
中文摘要 在长时间、多会话对话中进行时间推理是会话代理的关键能力。然而，现有研究和我们的试点研究表明，随着对话历史的长度增加和噪声积累，当前的长上下文模型难以准确识别时间相关信息，严重影响了推理能力。为此，我们引入了Memory-T1框架，该框架通过强化学习（RL）学习一个时间感知的记忆选择策略。它采用从粗到细的策略，首先通过时间和相关性过滤器将对话历史剪枝成候选集，然后由强化学习代理选择精确的证据会话。强化学习训练由多层次奖励函数指导，优化（i）答案准确性，（ii）证据基础，（iii）时间一致性。特别是，时间一致性奖励通过评估会话层面（时间接近度）和话语层面（时间准确度）与查询时间范围的对齐度，提供了密集信号，使代理能够解决细微的时间模糊性。在Time-Dialog基准测试中，Memory-T1将7B模型提升至67.0%的总得分，开源模型中创下了最先进的性能，并且比14B基线高出10.2%。消融研究显示，时间一致性和扎根奖励的证据共同促成了15.0%的表现提升。此外，Memory-T1 可保持高达 128k 代币的鲁棒性，在这些情况下基础模型会崩溃，证明了在大量对话历史中对噪声的有效性。代码和数据集在此 https URL 公开获取

Information-directed sampling for bandits: a primer

土匪信息导向抽样：入门指南

Authors: Annika Hirling, Giorgio Nicoletti, Antonio Celani
Subjects: Subjects: Machine Learning (cs.LG); Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2512.20096
Pdf link: https://arxiv.org/pdf/2512.20096
Abstract The Multi-Armed Bandit problem provides a fundamental framework for analyzing the tension between exploration and exploitation in sequential learning. This paper explores Information Directed Sampling (IDS) policies, a class of heuristics that balance immediate regret against information gain. We focus on the tractable environment of two-state Bernoulli bandits as a minimal model to rigorously compare heuristic strategies against the optimal policy. We extend the IDS framework to the discounted infinite-horizon setting by introducing a modified information measure and a tuning parameter to modulate the decision-making behavior. We examine two specific problem classes: symmetric bandits and the scenario involving one fair coin. In the symmetric case we show that IDS achieves bounded cumulative regret, whereas in the one-fair-coin scenario the IDS policy yields a regret that scales logarithmically with the horizon, in agreement with classical asymptotic lower bounds. This work serves as a pedagogical synthesis, aiming to bridge concepts from reinforcement learning and information theory for an audience of statistical physicists.
中文摘要 多臂强盗问题为分析顺序学习中探索与利用之间的张力提供了基本框架。本文探讨了信息导向抽样（IDS）政策，这是一类在即时遗憾与信息获取之间取得平衡的启发式方法。我们聚焦于两国伯努利强盗的可控环境，作为一个极小模型，以严格比较启发式策略与最优政策。我们通过引入修改信息度量和调优参数来调节决策行为，将IDS框架扩展到折扣无限视界设置。我们考察两个具体的问题类别：对称强盗和涉及一枚公平硬币的情景。在对称情况下，我们证明IDS实现了有界累积遗憾，而在一公平硬币情景中，IDS策略产生的遗憾随地平线对数递减，符合经典渐近下界。本研究作为教学综合，旨在为统计物理学家群体架起强化学习与信息理论的桥梁。

ABBEL: LLM Agents Acting through Belief Bottlenecks Expressed in Language

ABBEL：通过信仰瓶颈行动的LLM代理，语言表达

Authors: Aly Lidayan, Jakob Bjorner, Satvik Golechha, Kartik Goyal, Alane Suhr
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.20111
Pdf link: https://arxiv.org/pdf/2512.20111
Abstract As the length of sequential decision-making tasks increases, it becomes computationally impractical to keep full interaction histories in context. We introduce a general framework for LLM agents to maintain concise contexts through multi-step interaction: Acting through Belief Bottlenecks Expressed in Language (ABBEL), and methods to further improve ABBEL agents with RL post-training. ABBEL replaces long multi-step interaction history by a belief state, i.e., a natural language summary of what has been discovered about task-relevant unknowns. Under ABBEL, at each step the agent first updates a prior belief with the most recent observation from the environment to form a posterior belief, then uses only the posterior to select an action. We systematically evaluate frontier models under ABBEL across six diverse multi-step environments, finding that ABBEL supports generating interpretable beliefs while maintaining near-constant memory use over interaction steps. However, bottleneck approaches are generally prone to error propagation, which we observe causing inferior performance when compared to the full context setting due to errors in belief updating. Therefore, we train LLMs to generate and act on beliefs within the ABBEL framework via reinforcement learning (RL). We experiment with belief grading, to reward higher quality beliefs, as well as belief length penalties to reward more compressed beliefs. Our experiments demonstrate the ability of RL to improve ABBEL's performance beyond the full context setting, while using less memory than contemporaneous approaches.
中文摘要 随着顺序决策任务的长度增加，保持完整交互历史在上下文中的计算变得不切实际。我们介绍了一个通用框架，帮助LLM代理通过多步交互维持简洁语境：通过语言表达的信念瓶颈行动（ABBEL），以及通过强化学习后进一步改进ABBEL代理的方法。ABBEL用信念状态替代了长期的多步骤交互历史，即对任务相关未知的自然语言总结。在ABBEL中，每一步，代理先用环境中最新的观察更新先验信念，形成后验信念，然后仅使用后验信念选择行动。我们系统地评估了ABBEL下六个多步骤环境下的前沿模型，发现ABBEL支持生成可解释信念，同时保持交互步骤中几乎持续的记忆使用。然而，瓶颈方法通常容易发生错误传播，我们观察到由于信念更新错误，导致与全上下文设置相比性能较差。因此，我们通过强化学习（RL）训练LLM在ABBEL框架内生成并执行信念。我们尝试使用信念分级，以奖励更高质量的信念，同时通过信念长度惩罚来奖励更压缩的信念。我们的实验展示了强化学习在使用比同时代方法更少内存的情况下，提升ABBEL的性能。

Sample-Efficient Policy Constraint Offline Deep Reinforcement Learning based on Sample Filtering

基于样本过滤的高效策略约束离线深度强化学习

Authors: Yuanhao Chen, Qi Liu, Pengbin Chen, Zhongjian Qiao, Yanjie Li
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.20115
Pdf link: https://arxiv.org/pdf/2512.20115
Abstract Offline reinforcement learning (RL) aims to learn a policy that maximizes the expected return using a given static dataset of transitions. However, offline RL faces the distribution shift problem. The policy constraint offline RL method is proposed to solve the distribution shift problem. During the policy constraint offline RL training, it is important to ensure the difference between the learned policy and behavior policy within a given threshold. Thus, the learned policy heavily relies on the quality of the behavior policy. However, a problem exists in existing policy constraint methods: if the dataset contains many low-reward transitions, the learned will be contained with a suboptimal reference policy, leading to slow learning speed, low sample efficiency, and inferior performances. This paper shows that the sampling method in policy constraint offline RL that uses all the transitions in the dataset can be improved. A simple but efficient sample filtering method is proposed to improve the sample efficiency and the final performance. First, we evaluate the score of the transitions by average reward and average discounted reward of episodes in the dataset and extract the transition samples of high scores. Second, the high-score transition samples are used to train the offline RL algorithms. We verify the proposed method in a series of offline RL algorithms and benchmark tasks. Experimental results show that the proposed method outperforms baselines.
中文摘要 离线强化学习（RL）旨在学习一种策略，利用给定的静态转移数据集最大化预期回报。然而，离线强化学习面临分布转移问题。提出了策略约束离线强化学习方法以解决分布转移问题。在策略约束离线强化学习训练中，确保在给定阈值内区分所学策略与行为策略之间的差异非常重要。因此，学习策略高度依赖于行为策略的质量。然而，现有的策略约束方法存在问题：如果数据集包含许多低回报的转移，学习到的将被包含在次优的参考策略中，导致学习速度缓慢、样本效率低下和性能下降。本文表明，在策略约束离线强化学习中，利用数据集中所有转移的抽样方法可以得到改进。提出了一种简单但高效的样本过滤方法，以提升样本效率和最终性能。首先，我们通过数据集中平均奖励和平均贴现奖励来评估转换评分，并提取高分的转换样本。其次，高分转换样本用于训练离线强化学习算法。我们在一系列离线强化学习算法和基准测试任务中验证了该方法。实验结果显示，所提方法优于基线方法。

MolAct: An Agentic RL Framework for Molecular Editing and Property Optimization

MolAct：一种用于分子编辑和性质优化的能动强化学习框架

Authors: Zhuo Yang, Yeyun chen, Jiaqing Xie, Ben Gao, Shuaike Shen, Wanhao Liu, Liujia Yang, Beilun Wang, Tianfan Fu, Yuqiang Li
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.20135
Pdf link: https://arxiv.org/pdf/2512.20135
Abstract Molecular editing and optimization are multi-step problems that require iteratively improving properties while keeping molecules chemically valid and structurally similar. We frame both tasks as sequential, tool-guided decisions and introduce MolAct, an agentic reinforcement learning framework that employs a two-stage training paradigm: first building editing capability, then optimizing properties while reusing the learned editing behaviors. To the best of our knowledge, this is the first work to formalize molecular design as an Agentic Reinforcement Learning problem, where an LLM agent learns to interleave reasoning, tool-use, and molecular optimization. The framework enables agents to interact in multiple turns, invoking chemical tools for validity checking, property assessment, and similarity control, and leverages their feedback to refine subsequent edits. We instantiate the MolAct framework to train two model families: MolEditAgent for molecular editing tasks and MolOptAgent for molecular optimization tasks. In molecular editing, MolEditAgent-7B delivers 100, 95, and 98 valid add, delete, and substitute edits, outperforming strong closed "thinking" baselines such as DeepSeek-R1; MolEditAgent-3B approaches the performance of much larger open "thinking" models like Qwen3-32B-think. In molecular optimization, MolOptAgent-7B (trained on MolEditAgent-7B) surpasses the best closed "thinking" baseline (e.g., Claude 3.7) on LogP and remains competitive on solubility, while maintaining balanced performance across other objectives. These results highlight that treating molecular design as a multi-step, tool-augmented process is key to reliable and interpretable improvements.
中文摘要 分子编辑和优化是多步骤问题，需要在保持分子化学有效性和结构相似的同时，迭代改进性质。我们将这两个任务框架为顺序的、工具引导的决策，并介绍了MolAct——一种采用两阶段训练范式的智能强化学习框架：首先构建编辑能力，然后优化属性，同时重复使用所学的编辑行为。据我们所知，这是首次将分子设计形式化为智能体强化学习问题的工作，LLM代理学习将推理、工具使用和分子优化交织。该框架使代理能够多次互动，调用化学工具进行有效性检查、属性评估和相似性控制，并利用反馈优化后续编辑。我们实例化MolAct框架以训练两个模型家族：MolEditAgent用于分子编辑任务，MolOptAgent用于分子优化任务。在分子编辑中，MolEditAgent-7B 提供 100、95 和 98 次有效的添加、删除和替换编辑，优于强闭合“思维”基线如 DeepSeek-R1;MolEditAgent-3B 接近了更大规模的开放“思维”模型，如 Qwen3-32B-think。在分子优化方面，MolOptAgent-7B（训练于MolEditAgent-7B）在LogP上超越了最佳封闭“思维”基线（如Claude 3.7），在溶解度上保持竞争力，同时在其他目标上保持平衡性能。这些结果表明，将分子设计视为多步骤、工具辅助的过程，是实现可靠且可理解改进的关键。

Multi-hop Reasoning via Early Knowledge Alignment

通过早期知识对齐实现多跳推理

Authors: Yuxin Wang, Shicheng Fang, Bo Wang, Qi Luo, Xuanjing Huang, Yining Zheng, Xipeng Qiu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.20144
Pdf link: https://arxiv.org/pdf/2512.20144
Abstract Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for Large Language Models (LLMs) to address knowledge-intensive queries requiring domain-specific or up-to-date information. To handle complex multi-hop questions that are challenging for single-step retrieval, iterative RAG approaches incorporating reinforcement learning have been proposed. However, existing iterative RAG systems typically plan to decompose questions without leveraging information about the available retrieval corpus, leading to inefficient retrieval and reasoning chains that cascade into suboptimal performance. In this paper, we introduce Early Knowledge Alignment (EKA), a simple but effective module that aligns LLMs with retrieval set before planning in iterative RAG systems with contextually relevant retrieved knowledge. Extensive experiments on six standard RAG datasets demonstrate that by establishing a stronger reasoning foundation, EKA significantly improves retrieval precision, reduces cascading errors, and enhances both performance and efficiency. Our analysis from an entropy perspective demonstrate that incorporating early knowledge reduces unnecessary exploration during the reasoning process, enabling the model to focus more effectively on relevant information subsets. Moreover, EKA proves effective as a versatile, training-free inference strategy that scales seamlessly to large models. Generalization tests across diverse datasets and retrieval corpora confirm the robustness of our approach. Overall, EKA advances the state-of-the-art in iterative RAG systems while illuminating the critical interplay between structured reasoning and efficient exploration in reinforcement learning-augmented frameworks. The code is released at \href{this https URL}{Github}.
中文摘要 检索增强生成（RAG）已成为大型语言模型（LLMs）解决需要领域特定或最新信息的知识密集型查询的强大范式。为了处理单步检索困难的复杂多跳问题，提出了包含强化学习的迭代RAG方法。然而，现有的迭代RAG系统通常计划在未利用可用检索语料库信息的情况下分解问题，导致检索和推理链效率低下，最终导致性能不佳。本文介绍了早期知识对齐（EKA），这是一个简单但有效的模块，可在迭代RAG系统中规划具有上下文相关检索知识前，将LLM与检索集对齐。在六个标准RAG数据集上的广泛实验表明，通过建立更强的推理基础，EKA显著提高了检索精度，减少级联误差，并提升了性能和效率。我们从熵的角度分析表明，纳入早期知识减少了推理过程中不必要的探索，使模型能够更有效地聚焦于相关信息子集。此外，EKA作为一种多功能、无需训练的推理策略，能够无缝扩展到大型模型，效果非常有效。跨不同数据集和检索语料库的泛化检验证实了我们方法的稳健性。总体而言，EKA推动了迭代RAG系统的前沿，同时揭示了结构化推理与高效探索在强化学习增强框架中的关键相互作用。代码发布于 \href{this https URL}{Github}。

Offline Safe Policy Optimization From Heterogeneous Feedback

基于异质反馈的离线安全策略优化

Authors: Ze Gong, Pradeep Varakantham, Akshat Kumar
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.20173
Pdf link: https://arxiv.org/pdf/2512.20173
Abstract Offline Preference-based Reinforcement Learning (PbRL) learns rewards and policies aligned with human preferences without the need for extensive reward engineering and direct interaction with human annotators. However, ensuring safety remains a critical challenge across many domains and tasks. Previous works on safe RL from human feedback (RLHF) first learn reward and cost models from offline data, then use constrained RL to optimize a safe policy. While such an approach works in the contextual bandits settings (LLMs), in long horizon continuous control tasks, errors in rewards and costs accumulate, leading to impairment in performance when used with constrained RL methods. To address these challenges, (a) instead of indirectly learning policies (from rewards and costs), we introduce a framework that learns a policy directly based on pairwise preferences regarding the agent's behavior in terms of rewards, as well as binary labels indicating the safety of trajectory segments; (b) we propose \textsc{PreSa} (Preference and Safety Alignment), a method that combines preference learning module with safety alignment in a constrained optimization problem. This optimization problem is solved within a Lagrangian paradigm that directly learns reward-maximizing safe policy \textit{without explicitly learning reward and cost models}, avoiding the need for constrained RL; (c) we evaluate our approach on continuous control tasks with both synthetic and real human feedback. Empirically, our method successfully learns safe policies with high rewards, outperforming state-of-the-art baselines, and offline safe RL approaches with ground-truth reward and cost.
中文摘要 基于偏好的离线强化学习（PbRL）无需大量奖励工程和与人工标注者直接交互，学习符合人类偏好的奖励和策略。然而，确保安全仍然是许多领域和任务中的关键挑战。此前关于基于人类反馈的安全强化学习（RLHF）工作，首先从离线数据中学习奖励和成本模型，然后使用受限强化学习来优化安全策略。虽然这种方法在情境强化模型（LLMs）中有效，但在长期连续控制任务中，奖励和成本的误差会累积，导致在受限强化学习方法下性能下降。为应对这些挑战，（a）我们引入了一个框架，直接基于对智能体在奖励行为中的成对偏好学习策略，以及指示轨迹段安全性的二元标签;（b）我们提出\textsc{PreSa}（偏好与安全对齐），这是一种将偏好学习模块与安全对齐结合在受限优化问题中的方法。该优化问题在拉格朗日范式中解决，该范式直接学习最大化奖励的安全策略\textit{而无需显式学习奖励和成本模型}，避免了对受限强化学习的需求;（c）我们结合合成和真实的人类反馈，评估我们对持续控制任务的方法。从实证角度看，我们的方法成功学习了安全策略，奖励丰厚，优于最先进的基线，以及离线安全强化学习方法，并实现了真实的奖励和成本。

RESPOND: Risk-Enhanced Structured Pattern for LLM-driven Online Node-level Decision-making

回应：风险增强结构化模式用于大型语言模型驱动的在线节点级决策

Authors: Dan Chen, Heye Huang, Tiantian Chen, Zheng Li, Yongji Li, Yuhui Xu, Sikai Chen
Subjects: Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2512.20179
Pdf link: https://arxiv.org/pdf/2512.20179
Abstract Current LLM-based driving agents that rely on unstructured plain-text memory suffer from low-precision scene retrieval and inefficient reflection. To address this limitation, we present RESPOND, a structured decision-making framework for LLM-driven agents grounded in explicit risk patterns. RESPOND represents each ego-centric scene using a unified 5 by 3 matrix that encodes spatial topology and road constraints, enabling consistent and reliable retrieval of spatial risk configurations. Based on this representation, a hybrid rule and LLM decision pipeline is developed with a two-tier memory mechanism. In high-risk contexts, exact pattern matching enables rapid and safe reuse of verified actions, while in low-risk contexts, sub-pattern matching supports personalized driving style adaptation. In addition, a pattern-aware reflection mechanism abstracts tactical corrections from crash and near-miss frames to update structured memory, achieving one-crash-to-generalize learning. Extensive experiments demonstrate the effectiveness of RESPOND. In highway-env, RESPOND outperforms state-of-the-art LLM-based and reinforcement learning based driving agents while producing substantially fewer collisions. With step-wise human feedback, the agent acquires a Sporty driving style within approximately 20 decision steps through sub-pattern abstraction. For real-world validation, RESPOND is evaluated on 53 high-risk cut-in scenarios extracted from the HighD dataset. For each event, intervention is applied immediately before the cut-in and RESPOND re-decides the driving action. Compared to recorded human behavior, RESPOND reduces subsequent risk in 84.9 percent of scenarios, demonstrating its practical feasibility under real-world driving conditions. These results highlight RESPONDs potential for autonomous driving, personalized driving assistance, and proactive hazard mitigation.
中文摘要 目前依赖非结构化明文内存的基于LLM的驱动代理存在场景检索精度低和反射效率低下的问题。为解决这一限制，我们提出了RESPOND，这是一个基于明确风险模式的LLM驱动智能体结构化决策框架。RESPOND使用统一的5×3矩阵表示每个以自我为中心的场景，该矩阵编码空间拓扑和道路约束，从而实现空间风险配置的一致且可靠的检索。基于这种表示方式，开发了一个混合规则和大型语言模型决策流水线，采用了两层内存机制。在高风险情境中，精确模式匹配使得验证行为的快速且安全地重用，而在低风险情境中，子模式匹配支持个性化驾驶风格适应。此外，一种模式感知反射机制将撞击和近距离碰撞帧中的战术修正抽象出来，更新结构化记忆，实现一次崩溃到泛化的学习。大量实验证明了RESPOND的有效性。在高速公路环境中，RESPOND优于基于LLM和强化学习的先进驾驶代理，同时产生显著减少的碰撞。通过逐步的人工反馈，代理通过子模式抽象约20个决策步骤，获得运动驾驶风格。为实现现实验证，RESPOND基于从HighD数据集提取的53个高风险插入场景进行评估。对于每个事件，干预会在插入前立即实施，RESPOND重新决定驱动动作。与记录的人类行为相比，RESPOND在84.9%的场景中降低了后续风险，证明了其在真实驾驶条件下的可行性。这些结果凸显了RESPONDs在自动驾驶、个性化驾驶辅助和主动灾害缓解方面的潜力。

FaithLens: Detecting and Explaining Faithfulness Hallucination

FaithLens：检测与解释忠诚幻觉

Authors: Shuzheng Si, Qingyi Wang, Haozhe Zhao, Yuzhuo Bai, Guanqiao Chen, Kangyang Luo, Gang Chen, Fanchao Qi, Minjia Zhang, Baobao Chang, Maosong Sun
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.20182
Pdf link: https://arxiv.org/pdf/2512.20182
Abstract Recognizing whether outputs from large language models (LLMs) contain faithfulness hallucination is crucial for real-world applications, e.g., retrieval-augmented generation and summarization. In this paper, we introduce FaithLens, a cost-efficient and effective faithfulness hallucination detection model that can jointly provide binary predictions and corresponding explanations to improve trustworthiness. To achieve this, we first synthesize training data with explanations via advanced LLMs and apply a well-defined data filtering strategy to ensure label correctness, explanation quality, and data diversity. Subsequently, we fine-tune the model on these well-curated training data as a cold start and further optimize it with rule-based reinforcement learning, using rewards for both prediction correctness and explanation quality. Results on 12 diverse tasks show that the 8B-parameter FaithLens outperforms advanced models such as GPT-4.1 and o3. Also, FaithLens can produce high-quality explanations, delivering a distinctive balance of trustworthiness, efficiency, and effectiveness.
中文摘要 识别大型语言模型（LLMs）输出是否包含忠实幻觉对于现实应用至关重要，例如检索增强生成和总结。本文介绍了FaithLens，一种成本效益高且高效的忠实幻觉检测模型，能够联合提供二元预测和相应解释，以提升可信度。为此，我们首先通过先进的大型语言模型综合训练数据与解释，并采用明确定义的数据过滤策略，以确保标签的正确性、解释质量和数据多样性。随后，我们基于这些精心策划的训练数据进行微调，作为冷启动，并通过基于规则的强化学习进一步优化，同时对预测正确性和解释质量进行奖励。12个不同任务的结果显示，8B参数的FaithLens优于GPT-4.1和o3等先进模型。此外，FaithLens能够产出高质量的解释，实现可信度、高效性和有效性的独特平衡。

Edge-Served Congestion Control for Wireless Multipath Transmission with a Transformer Agent

用于无线多径传输的边缘服务拥塞控制，采用变压器代理

Authors: Liang Wang
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2512.20186
Pdf link: https://arxiv.org/pdf/2512.20186
Abstract Multipath TCP is widely adopted to enhance connection quality-of-service by leveraging multiple network pathways on modern devices. However, the evolution of its core congestion control is hindered by the OS kernel, whose monolithic design imposes high development overhead and lacks the resource flexibility required for data-driven methods. Furthermore, inherent noise in network statistics induces a partial observability problem, which can mislead data-driven methods like Deep Reinforcement Learning. To bridge this gap, we propose Jazz, a system that re-architects multipath congestion control through a decoupled architecture that separates the decision-making ``brain'' from the in-kernel datapath, enabling it to operate on an external (edge) entity. At its core, Jazz employs a Transformer-based agent that processes sequences of historical observations to overcome the partial observability of single-step reinforcement learning. This allows it to learn and master fluctuating link conditions and intricate cross-path dependencies. Tested on a dual-band (5GHz/6GHz) Wi-Fi testbed, our implementation improves bandwidth efficiency by at least 2.85\% over conventional methods and maintains 96.2\% performance under 1\% packet loss, validating this design as a practical blueprint for agile network intelligence.
中文摘要 多径TCP被广泛采用，通过利用现代设备上的多条网络路径来提升连接服务质量。然而，其核心拥塞控制的演进受到作系统内核的阻碍，内核整体设计带来了高开发开销，且缺乏数据驱动方法所需的资源灵活性。此外，网络统计中固有的噪声会引发部分可观测性问题，这可能会误导数据驱动的方法，如深度强化学习。为弥合这一空白，我们提出了Jazz系统，该系统通过解耦架构重新架构多径拥塞控制，将决策“大脑”与内核数据路径分离，使其能够在外部（边缘）实体上运行。Jazz 的核心使用基于 Transformer 的代理处理历史观察序列，克服单步强化学习的部分可观测性。这使其能够学习并掌握波动的链路条件和复杂的交叉路径依赖关系。在双频段（5GHz/6GHz）Wi-Fi测试平台上测试，我们的实现比传统方法提升了至少2.85%的带宽效率，并在1%丢包率下保持96.2%的性能，验证了该设计作为敏捷网络智能的实用蓝图。

Joint Design of Embedded Index Coding and Beamforming for MIMO-based Distributed Computing via Multi-Agent Reinforcement Learning

通过多智能体强化学习，为基于MIMO的分布式计算实现嵌入式索引编码与波束成形的联合设计

Authors: Heekang Song, Wan Choi
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2512.20201
Pdf link: https://arxiv.org/pdf/2512.20201
Abstract In distributed computing systems, reducing the communication load during the data shuffling phase is a critical challenge, as excessive inter-node transmissions are a major performance bottleneck. One promising approach to alleviate this burden is Embedded Index Coding (EIC), which exploits cached data at user nodes to encode transmissions more efficiently. However, most prior work on EIC has focused on minimizing code length in wired, error-free environments-an objective often suboptimal for wireless multiple-input multiple-output (MIMO) systems, where channel conditions and spatial multiplexing gains must be considered. This paper investigates the joint design of EIC and transmit beamforming in MIMO systems to minimize total transmission time, an NP-hard problem. We first present a conventional optimization method that determines the optimal EIC via exhaustive search. To address its prohibitive complexity and adapt to dynamic wireless environments, we propose a novel, low-complexity multi-agent reinforcement learning (MARL) framework. The proposed framework enables decentralized agents to act on local observations while effectively managing the hybrid action space of discrete EIC selection and continuous beamforming design. Simulation results demonstrate that the proposed MARL-based approach achieves near-optimal performance with significantly reduced complexity, underscoring its effectiveness and practicality for real-world wireless systems.
中文摘要 在分布式计算系统中，减少数据重组阶段的通信负载是关键挑战，因为过多的节点间传输是主要的性能瓶颈。一种有前景的方法是嵌入式索引编码（EIC），它利用用户节点缓存的数据，更高效地编码传输。然而，之前关于EIC的大多数研究都集中在有线、无错误环境中最小化码长——这对于无线多输入多输出（MIMO）系统来说，这一目标往往并不理想，因为在这些系统中必须考虑信道条件和空间复用增益。本文探讨了在MIMO系统中，EIC与发射波束成形的联合设计，以最小化总传输时间，这是一个NP难问题。我们首先提出一种通过穷尽搜索确定最优EIC的传统优化方法。为了应对其难以抑制的复杂性并适应动态无线环境，我们提出了一个新颖的低复杂度多智能体强化学习（MARL）框架。该框架使去中心化代理能够在有效管理离散EIC选择与连续波束成形设计的混合作用空间的同时，对局部观测进行行动。模拟结果表明，基于MARL的方法在显著降低复杂度下实现了接近最佳的性能，凸显了其在现实无线系统中的有效性和实用性。

Generalisation in Multitask Fitted Q-Iteration and Offline Q-learning

多任务拟合Q迭代与离线Q学习中的推广

Authors: Kausthubh Manda, Raghuram Bharadwaj Diddigi
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.20220
Pdf link: https://arxiv.org/pdf/2512.20220
Abstract We study offline multitask reinforcement learning in settings where multiple tasks share a low-rank representation of their action-value functions. In this regime, a learner is provided with fixed datasets collected from several related tasks, without access to further online interaction, and seeks to exploit shared structure to improve statistical efficiency and generalization. We analyze a multitask variant of fitted Q-iteration that jointly learns a shared representation and task-specific value functions via Bellman error minimization on offline data. Under standard realizability and coverage assumptions commonly used in offline reinforcement learning, we establish finite-sample generalization guarantees for the learned value functions. Our analysis explicitly characterizes how pooling data across tasks improves estimation accuracy, yielding a $1/\sqrt{nT}$ dependence on the total number of samples across tasks, while retaining the usual dependence on the horizon and concentrability coefficients arising from distribution shift. In addition, we consider a downstream offline setting in which a new task shares the same underlying representation as the upstream tasks. We study how reusing the representation learned during the multitask phase affects value estimation for this new task, and show that it can reduce the effective complexity of downstream learning relative to learning from scratch. Together, our results clarify the role of shared representations in multitask offline Q-learning and provide theoretical insight into when and how multitask structure can improve generalization in model-free, value-based reinforcement learning.
中文摘要 我们研究了在多个任务共享低秩表现的动作-价值函数条件下的离线多任务强化学习。在此模式下，学习者获得来自多个相关任务的固定数据集，无法获得进一步的在线互动，并试图利用共享结构以提高统计效率和泛化性。我们分析了拟合Q迭代的一种多任务变体，该变体通过离线数据的Bellman误差最小化共同学习共享表示和任务特定的值函数。在离线强化学习常用的标准实现性和覆盖假设下，我们为所学价值函数建立了有限样本推广保证。我们的分析明确描述了跨任务数据池化如何提升估计准确性，使得对任务总样本数产生$1/\sqrt{nT}$的依赖性，同时保留了对分布偏移引起的视界和凝聚系数的通常依赖。此外，我们还考虑了一个下游离线设置，其中新任务与上游任务共享相同的底层表示。我们研究了重用多任务阶段学习的表示如何影响该新任务的价值估计，并证明这能降低下游学习的有效复杂度，相较于从零学习。我们的结果共同阐明了共享表示在多任务离线Q学习中的作用，并为多任务结构何时以及如何提升无模型、基于价值的强化学习的泛化提供了理论见解。

Graph-Symbolic Policy Enforcement and Control (G-SPEC): A Neuro-Symbolic Framework for Safe Agentic AI in 5G Autonomous Networks

图符号政策执行与控制（G-SPEC）：5G自治网络中安全智能人工智能的神经符号框架

Authors: Divya Vijay, Vignesh Ethiraj
Subjects: Subjects: Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2512.20275
Pdf link: https://arxiv.org/pdf/2512.20275
Abstract As networks evolve toward 5G Standalone and 6G, operators face orchestration challenges that exceed the limits of static automation and Deep Reinforcement Learning. Although Large Language Model (LLM) agents offer a path toward intent-based networking, they introduce stochastic risks, including topology hallucinations and policy non-compliance. To mitigate this, we propose Graph-Symbolic Policy Enforcement and Control (G-SPEC), a neuro-symbolic framework that constrains probabilistic planning with deterministic verification. The architecture relies on a Governance Triad - a telecom-adapted agent (TSLAM-4B), a Network Knowledge Graph (NKG), and SHACL constraints. We evaluated G-SPEC on a simulated 450-node 5G Core, achieving zero safety violations and a 94.1% remediation success rate, significantly outperforming the 82.4% baseline. Ablation analysis indicates that NKG validation drives the majority of safety gains (68%), followed by SHACL policies (24%). Scalability tests on topologies ranging from 10K to 100K nodes demonstrate that validation latency scales as $O(k^{1.2})$ where $k$ is subgraph size. With a processing overhead of 142ms, G-SPEC is viable for SMO-layer operations.
中文摘要 随着网络向5G独立和6G发展，运营商面临超越静态自动化和深度强化学习极限的编排挑战。尽管大型语言模型（LLM）代理为基于意图的网络提供了一条路径，但它们也带来了随机风险，包括拓扑幻觉和策略不合规。为缓解这一问题，我们提出了图符号政策执行与控制（G-SPEC），这是一种神经符号框架，通过确定性验证限制概率规划。该架构依赖治理三元——一个电信适配代理（TSLAM-4B）、一个网络知识图谱（NKG）和SHACL约束。我们在模拟的450节点5G核心上评估了G-SPEC，实现了零安全违规和94.1%的修复成功率，显著优于82.4%的基线。消融分析显示，NKG验证推动了大部分安全提升（68%），其次是SHACL政策（24%）。在10K到100K节点的拓扑上进行可扩展性测试，表明验证延迟随$O（k^{1.2}）$扩展，其中$k$为子图大小。G-SPEC的处理开销为142毫秒，适用于SMO层作。

TableGPT-R1: Advancing Tabular Reasoning Through Reinforcement Learning

TableGPT-R1：通过强化学习推进表格推理

Authors: Saisai Yang, Qingyi Huang, Jing Yuan, Liangyu Zha, Kai Tang, Yuhang Yang, Ning Wang, Yucheng Wei, Liyao Li, Wentao Ye, Hao Chen, Tao Zhang, Junlin Zhou, Haobo Wang, Gang Chen, Junbo Zhao
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.20312
Pdf link: https://arxiv.org/pdf/2512.20312
Abstract Tabular data serves as the backbone of modern data analysis and scientific research. While Large Language Models (LLMs) fine-tuned via Supervised Fine-Tuning (SFT) have significantly improved natural language interaction with such structured data, they often fall short in handling the complex, multi-step reasoning and robust code execution required for real-world table tasks. Reinforcement Learning (RL) offers a promising avenue to enhance these capabilities, yet its application in the tabular domain faces three critical hurdles: the scarcity of high-quality agentic trajectories with closed-loop code execution and environment feedback on diverse table structures, the extreme heterogeneity of feedback signals ranging from rigid SQL execution to open-ended data interpretation, and the risk of catastrophic forgetting of general knowledge during vertical specialization. To overcome these challenges and unlock advanced reasoning on complex tables, we introduce \textbf{TableGPT-R1}, a specialized tabular model built on a systematic RL framework. Our approach integrates a comprehensive data engineering pipeline that synthesizes difficulty-stratified agentic trajectories for both supervised alignment and RL rollouts, a task-adaptive reward system that combines rule-based verification with a criteria-injected reward model and incorporates process-level step reward shaping with behavioral regularization, and a multi-stage training framework that progressively stabilizes reasoning before specializing in table-specific tasks. Extensive evaluations demonstrate that TableGPT-R1 achieves state-of-the-art performance on authoritative benchmarks, significantly outperforming baseline models while retaining robust general capabilities. Our model is available at this https URL.
中文摘要 表格数据是现代数据分析和科学研究的支柱。虽然通过监督微调（SFT）微调的大型语言模型（LLM）显著提升了与此类结构化数据的自然语言交互，但它们在处理复杂、多步推理和稳健的代码执行方面常常不足，尤其是在实际表格任务中。强化学习（RL）为提升这些能力提供了有前景的途径，但其在表格领域的应用面临三大关键障碍：缺乏高质量的代理轨迹，闭环代码执行和多样表格结构的环境反馈;反馈信号的极端异质性，从僵化的SQL执行到开放式数据解释;以及在垂直专业化过程中常识被灾难性遗忘的风险。为了克服这些挑战并解锁复杂表格的高级推理能力，我们引入了 \textbf{TableGPT-R1}，这是一个基于系统强化学习框架的专用表格模型。我们的方法整合了全面的数据工程流程，综合了监督式对齐和强化学习的难度分层代理轨迹;任务适应奖励系统结合了基于规则的验证与标准注入的奖励模型，并结合了过程级阶级奖励塑造与行为正则化;以及一个多阶段训练框架，逐步稳定推理，然后专注于特定表格任务。大量评估表明，TableGPT-R1在权威基准测试中实现了最先进的性能，远超基线模型，同时保持了强大的通用能力。我们的模型可在此 https 网址获取。

Identifying Appropriately-Sized Services with Deep Reinforcement Learning

通过深度强化学习识别合适规模的服务

Authors: Syeda Tasnim Fabiha, Saad Shafiq, Wesley Klewerton Guez Assunção, Nenad Medvidović
Subjects: Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.20381
Pdf link: https://arxiv.org/pdf/2512.20381
Abstract Service-based architecture (SBA) has gained attention in industry and academia as a means to modernize legacy systems. It refers to a design style that enables systems to be developed as suites of small, loosely coupled, and autonomous components (services) that encapsulate functionality and communicate via language-agnostic APIs. However, defining appropriately sized services that capture cohesive subsets of system functionality remains challenging. Existing work often relies on the availability of documentation, access to project personnel, or a priori knowledge of the target number of services, assumptions that do not hold in many real-world scenarios. Our work addresses these limitations using a deep reinforcement learning-based approach to identify appropriately sized services directly from implementation artifacts. We present Rake, a reinforcement learning-based technique that leverages available system documentation and source code to guide service decomposition at the level of implementation methods. Rake does not require specific documentation or access to project personnel and is language-agnostic. It also supports a customizable objective function that balances modularization quality and business capability alignment, i.e., the degree to which a service covers the targeted business capability. We applied Rake to four open-source legacy projects and compared it with two state-of-the-art techniques. On average, Rake achieved 7-14 percent higher modularization quality and 18-22 percent stronger business capability alignment. Our results further show that optimizing solely for business context can degrade decomposition quality in tightly coupled systems, highlighting the need for balanced objectives.
中文摘要 基于服务架构（SBA）作为现代化遗留系统的手段，在工业界和学术界引起了关注。它指的是一种设计风格，使系统能够被开发为一组小型、松散耦合且自治的组件（服务），这些组件封装功能并通过语言无关的API进行通信。然而，定义能够捕捉系统功能中内聚子集的适当规模服务仍然具有挑战性。现有工作通常依赖于文档的可用性、项目人员的访问，或对目标服务数量的先验了解，这些假设在许多现实场景中并不成立。我们的工作通过基于深度强化学习的方法，直接从实现工件中识别合适规模的服务，解决了这些局限性。我们介绍了Rake，一种基于强化学习的技术，利用现有的系统文档和源代码指导实现方法层面的服务分解。Rake 不需要特定的文档或项目人员访问权限，且不依赖语言。它还支持一个可定制的目标函数，平衡模块化质量与业务能力的对齐，即服务覆盖目标业务能力的程度。我们将Rake应用于四个开源遗留项目，并与两种最先进的技术进行了比较。平均而言，Rake实现了7-14%的模块化质量提升，以及18-22%的业务能力一致性提升。我们的研究结果进一步表明，仅针对业务环境进行优化会降低紧耦合系统中的分解质量，凸显了目标平衡的必要性。

Resilient Packet Forwarding: A Reinforcement Learning Approach to Routing in Gaussian Interconnected Networks with Clustered Faults

弹性数据包转发：在高斯互联网络中集群故障中，一种强化学习方法进行路由

Authors: Mohammad Walid Charrwi, Zaid Hussain
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2512.20394
Pdf link: https://arxiv.org/pdf/2512.20394
Abstract As Network-on-Chip (NoC) and Wireless Sensor Network architectures continue to scale, the topology of the underlying network becomes a critical factor in performance. Gaussian Interconnected Networks based on the arithmetic of Gaussian integers, offer attractive properties regarding diameter and symmetry. Despite their attractive theoretical properties, adaptive routing techniques in these networks are vulnerable to node and link faults, leading to rapid degradation in communication reliability. Node failures (particularly those following Gaussian distributions, such as thermal hotspots or physical damage clusters) pose severe challenges to traditional deterministic routing. This paper proposes a fault-aware Reinforcement Learning (RL) routing scheme tailored for Gaussian Interconnected Networks. By utilizing a PPO (Proximal Policy Optimization) agent with a specific reward structure designed to penalize fault proximity, the system dynamically learns to bypass faulty regions. We compare our proposed RL-based routing protocol against a greedy adaptive shortest-path routing algorithm. Experimental results demonstrate that the RL agent significantly outperforms the adaptive routing sustaining a Packet Delivery Ratio (PDR) of 0.95 at 40% fault density compared to 0.66 for the greedy. Furthermore, the RL approach exhibits effective delivery rates compared to the greedy adaptive routing, particularly under low network load of 20% at 0.57 vs. 0.43, showing greater proficiency in managing congestion, validating its efficacy in stochastic, fault-prone topologies
中文摘要 随着片上网络（NoC）和无线传感器网络架构的持续扩展，底层网络的拓扑结构成为性能的关键因素。基于高斯整数算术的高斯互联网络，在直径和对称性方面具有吸引人的特性。尽管理论上具有吸引力，这些网络中的自适应路由技术容易受到节点和链路故障的影响，导致通信可靠性迅速下降。节点故障（尤其是遵循高斯分布的，如热热点或物理损伤簇）对传统的确定性路由构成严峻挑战。本文提出了一种针对高斯互联网络量身定制的故障感知强化学习（RL）路由方案。通过使用具有特定奖励结构、惩罚故障接近度的PPO（近端策略优化）代理，系统动态学习绕过故障区域。我们将提出的基于强化学习的路由协议与贪婪自适应最短路径路由算法进行比较。实验结果表明，强化学习代理在40%故障密度下，在支持数据包分组的分配比（PDR）为0.95，而贪婪者则为0.66。此外，强化学习方法相比贪婪自适应路由表现出有效交付率，尤其是在网络负载仅20%的0.57比0.43时，在拥塞管理方面表现出更高的熟练度，验证了其在随机易故障拓扑中的有效性

Recurrent Off-Policy Deep Reinforcement Learning Doesn't Have to be Slow

循环的非策略深度强化学习不必缓慢

Authors: Tyler Clark, Christine Evers, Jonathon Hare
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.20513
Pdf link: https://arxiv.org/pdf/2512.20513
Abstract Recurrent off-policy deep reinforcement learning models achieve state-of-the-art performance but are often sidelined due to their high computational demands. In response, we introduce RISE (Recurrent Integration via Simplified Encodings), a novel approach that can leverage recurrent networks in any image-based off-policy RL setting without significant computational overheads via using both learnable and non-learnable encoder layers. When integrating RISE into leading non-recurrent off-policy RL algorithms, we observe a 35.6% human-normalized interquartile mean (IQM) performance improvement across the Atari benchmark. We analyze various implementation strategies to highlight the versatility and potential of our proposed framework.
中文摘要 循环非策略深度强化学习模型实现了最先进的性能，但由于计算需求过高，常常被边缘化。为此，我们引入了RISE（通过简化编码的递归集成），这是一种新颖的方法，能够在任何基于图像的非策略强化学习环境中利用循环网络，且通过同时使用可学习和不可学习编码层，且不产生显著的计算开销。当将RISE集成到领先的非周期性非策略强化学习算法中时，我们在Atari基准测试中观察到人类归一化四分位数平均（IQM）性能提升了35.6%。我们分析了多种实施策略，以突出我们提出框架的多样性和潜力。

Performative Policy Gradient: Optimality in Performative Reinforcement Learning

执行性策略梯度：执行性强化学习的最优性

Authors: Debabrota Basu, Udvas Das, Brahim Driss, Uddalak Mukherjee
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2512.20576
Pdf link: https://arxiv.org/pdf/2512.20576
Abstract Post-deployment machine learning algorithms often influence the environments they act in, and thus shift the underlying dynamics that the standard reinforcement learning (RL) methods ignore. While designing optimal algorithms in this performative setting has recently been studied in supervised learning, the RL counterpart remains under-explored. In this paper, we prove the performative counterparts of the performance difference lemma and the policy gradient theorem in RL, and further introduce the Performative Policy Gradient algorithm (PePG). PePG is the first policy gradient algorithm designed to account for performativity in RL. Under softmax parametrisation, and also with and without entropy regularisation, we prove that PePG converges to performatively optimal policies, i.e. policies that remain optimal under the distribution shifts induced by themselves. Thus, PePG significantly extends the prior works in Performative RL that achieves performative stability but not optimality. Furthermore, our empirical analysis on standard performative RL environments validate that PePG outperforms standard policy gradient algorithms and the existing performative RL algorithms aiming for stability.
中文摘要 部署后的机器学习算法常常影响其所处环境，从而改变标准强化学习（RL）方法忽视的潜在动态。虽然最近在监督学习中对该执行性环境中的最优算法设计进行了研究，但强化学习的对应方法仍未被充分探索。本文证明了性能差引理和强化学习中策略梯度定理的执行对应物，并进一步介绍了执行性策略梯度算法（PePG）。PePG 是第一个设计用于考虑强化学习中表现性的策略梯度算法。在软极大参数化下，以及有无熵正则化的条件，我们证明了 PePG 收敛到执行最优策略，即在自身引起的分布变化下保持最优的策略。因此，PePG显著扩展了之前在实现执行稳定性但未达到最优性的表演性强化学习领域的工作。此外，我们对标准执行性强化学习环境的实证分析验证了PePG优于标准策略梯度算法及现有追求稳定性的执行性强化学习算法。

Leveraging High-Fidelity Digital Models and Reinforcement Learning for Mission Engineering: A Case Study of Aerial Firefighting Under Perfect Information

利用高精度数字模型和强化学习实现任务工程：完美信息下空中灭火的案例研究

Authors: İbrahim Oğuz Çetinkaya, Sajad Khodadadian, Taylan G. Topçu
Subjects: Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2512.20589
Pdf link: https://arxiv.org/pdf/2512.20589
Abstract As systems engineering (SE) objectives evolve from design and operation of monolithic systems to complex System of Systems (SoS), the discipline of Mission Engineering (ME) has emerged which is increasingly being accepted as a new line of thinking for the SE community. Moreover, mission environments are uncertain, dynamic, and mission outcomes are a direct function of how the mission assets will interact with this environment. This proves static architectures brittle and calls for analytically rigorous approaches for ME. To that end, this paper proposes an intelligent mission coordination methodology that integrates digital mission models with Reinforcement Learning (RL), that specifically addresses the need for adaptive task allocation and reconfiguration. More specifically, we are leveraging a Digital Engineering (DE) based infrastructure that is composed of a high-fidelity digital mission model and agent-based simulation; and then we formulate the mission tactics management problem as a Markov Decision Process (MDP), and employ an RL agent trained via Proximal Policy Optimization. By leveraging the simulation as a sandbox, we map the system states to actions, refining the policy based on realized mission outcomes. The utility of the RL-based intelligent mission coordinator is demonstrated through an aerial firefighting case study. Our findings indicate that the RL-based intelligent mission coordinator not only surpasses baseline performance but also significantly reduces the variability in mission performance. Thus, this study serves as a proof of concept demonstrating that DE-enabled mission simulations combined with advanced analytical tools offer a mission-agnostic framework for improving ME practice; which can be extended to more complicated fleet design and selection problems in the future from a mission-first perspective.
中文摘要 随着系统工程（SE）目标从单体系统的设计和运行向复杂的系统中的系统（SoS）发展，任务工程（ME）学科的兴起，这一学科正日益被社会工程社区接受为一种新的思维方式。此外，任务环境充满不确定性和动态性，任务结果直接取决于任务资产如何与该环境互动。这证明静态架构是脆弱的，因此需要对机械工程采取分析严谨的方法。为此，本文提出了一种智能任务协调方法论，将数字任务模型与强化学习（RL）整合，专门针对自适应任务分配和重组的需求。更具体地说，我们利用基于数字工程（DE）的基础设施，该基础设施由高精度数字任务模型和基于代理的仿真组成;然后我们将任务战术管理问题表述为马尔可夫决策过程（MDP），并使用通过近端策略优化训练的强化学习代理。通过将模拟作为沙盒，我们将系统状态映射到行动，并根据实现的任务结果细化政策。基于强化学习的智能任务协调器的实用性通过空中灭火案例得到了展示。我们的发现表明，基于强化学习的智能任务协调器不仅超越了基线性能，还显著降低了任务表现的变异性。因此，本研究作为概念验证，表明DE驱动的任务模拟结合先进分析工具，提供了一个与任务无关的框架，以提升ME实践;从任务优先的角度，未来可以扩展到更复杂的舰队设计和选拔问题。

Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning

自回归模型中的涌现时间抽象使分层强化学习成为可能

Authors: Seijin Kobayashi, Yanick Schimpf, Maximilian Schlegel, Angelika Steger, Maciej Wolczyk, Johannes von Oswald, Nino Scherre, Kaitlin Maile, Guillaume Lajoie, Blake A. Richards, Rif A. Saurous, James Manyika, Blaise Agüera y Arcas, Alexander Meulemans, João Sacramento
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.20605
Pdf link: https://arxiv.org/pdf/2512.20605
Abstract Large-scale autoregressive models pretrained on next-token prediction and finetuned with reinforcement learning (RL) have achieved unprecedented success on many problem domains. During RL, these models explore by generating new outputs, one token at a time. However, sampling actions token-by-token can result in highly inefficient learning, particularly when rewards are sparse. Here, we show that it is possible to overcome this problem by acting and exploring within the internal representations of an autoregressive model. Specifically, to discover temporally-abstract actions, we introduce a higher-order, non-causal sequence model whose outputs control the residual stream activations of a base autoregressive model. On grid world and MuJoCo-based tasks with hierarchical structure, we find that the higher-order model learns to compress long activation sequence chunks onto internal controllers. Critically, each controller executes a sequence of behaviorally meaningful actions that unfold over long timescales and are accompanied with a learned termination condition, such that composing multiple controllers over time leads to efficient exploration on novel tasks. We show that direct internal controller reinforcement, a process we term "internal RL", enables learning from sparse rewards in cases where standard RL finetuning fails. Our results demonstrate the benefits of latent action generation and reinforcement in autoregressive models, suggesting internal RL as a promising avenue for realizing hierarchical RL within foundation models.
中文摘要 基于下一个标记预测预训练并通过强化学习（RL）微调的大规模自回归模型在许多问题领域取得了前所未有的成功。在强化学习中，这些模型通过生成一个新输出，一次一个代币进行探索。然而，逐个代币抽样动作可能导致学习效率极低，尤其是在奖励稀少时。在这里，我们展示了通过在自回归模型的内部表征中行动和探索，可以克服这个问题。具体来说，为了发现时间抽象作用，我们引入了一个高阶非因果序列模型，其输出控制基自回归模型的残余流激活。在网格世界和基于MuJoCo的层级结构任务中，我们发现高阶模型学习将长激活序列块压缩到内部控制器上。关键是，每个控制器执行一系列具有行为意义的动作，这些动作在较长的时间尺度内展开，并伴随着学习的终止条件，因此随着时间组合多个控制器，能够高效探索新颖任务。我们证明了直接内部控制器强化过程，我们称之为“内部强化学习”，能够在标准强化学习微调失败的情况下，从稀疏奖励中学习。我们的结果展示了潜动作生成和强化在自回归模型中的益处，表明内部强化学习是实现基础模型中层级强化学习的有前景途径。

LongVideoAgent: Multi-Agent Reasoning with Long Videos

LongVideoAgent：多智能体推理与长视频

Authors: Runtao Liu, Ziyi Liu, Jiaqi Tang, Yue Ma, Renjie Pi, Jipeng Zhang, Qifeng Chen
Subjects: Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2512.20618
Pdf link: https://arxiv.org/pdf/2512.20618
Abstract Recent advances in multimodal LLMs and systems that use tools for long-video QA point to the promise of reasoning over hour-long episodes. However, many methods still compress content into lossy summaries or rely on limited toolsets, weakening temporal grounding and missing fine-grained cues. We propose a multi-agent framework in which a master LLM coordinates a grounding agent to localize question-relevant segments and a vision agent to extract targeted textual observations. The master agent plans with a step limit, and is trained with reinforcement learning to encourage concise, correct, and efficient multi-agent cooperation. This design helps the master agent focus on relevant clips via grounding, complements subtitles with visual detail, and yields interpretable trajectories. On our proposed LongTVQA and LongTVQA+ which are episode-level datasets aggregated from TVQA/TVQA+, our multi-agent system significantly outperforms strong non-agent baselines. Experiments also show reinforcement learning further strengthens reasoning and planning for the trained agent. Code and data will be shared at this https URL.
中文摘要 近年来，多模态大型语言模型（LLM）和使用长视频质量保证工具的系统取得了进步，这表明了在一小时长的节目中进行推理的前景。然而，许多方法仍然将内容压缩成有损摘要或依赖有限的工具集，削弱了时间的基础性，并遗漏了细粒度线索。我们提出了一个多智能体框架，主级LLM协调一个基础智能体定位问题相关片段，一个视觉智能体提取针对性的文本观察。主智能体计划有步骤限制，并通过强化学习进行训练，以促进简洁、正确且高效的多智能体合作。这种设计帮助主特工通过接地聚焦相关片段，辅以字幕与视觉细节，并产生可解读的轨迹。在我们提出的LongTVQA和LongTVQA+（由TVQA/TVQA+汇总的集数级数据集）中，我们的多智能体系统显著优于强的非智能体基线。实验还表明，强化学习进一步强化了训练有素的主体的推理和规划能力。代码和数据将在此HTTPS网址共享。

Keyword: diffusion policy

There is no result