Arxiv Papers of Today

生成时间: 2025-11-06 16:30:37 (UTC+8); Arxiv 发布时间: 2025-11-06 20:00 EST (2025-11-07 09:00 UTC+8)

今天共有 26 篇相关文章

Keyword: reinforcement learning

Digital Twin-Driven Pavement Health Monitoring and Maintenance Optimization Using Graph Neural Networks

数字孪生驱动的路面健康监测和维护优化使用图神经网络

Authors: Mohsin Mahmud Topu, Mahfuz Ahmed Anik, Azmine Toushik Wasi, Md Manjurul Ahsan
Subjects: Subjects: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Emerging Technologies (cs.ET); Neural and Evolutionary Computing (cs.NE); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2511.02957
Pdf link: https://arxiv.org/pdf/2511.02957
Abstract Pavement infrastructure monitoring is challenged by complex spatial dependencies, changing environmental conditions, and non-linear deterioration across road networks. Traditional Pavement Management Systems (PMS) remain largely reactive, lacking real-time intelligence for failure prevention and optimal maintenance planning. To address this, we propose a unified Digital Twin (DT) and Graph Neural Network (GNN) framework for scalable, data-driven pavement health monitoring and predictive maintenance. Pavement segments and spatial relations are modeled as graph nodes and edges, while real-time UAV, sensor, and LiDAR data stream into the DT. The inductive GNN learns deterioration patterns from graph-structured inputs to forecast distress and enable proactive interventions. Trained on a real-world-inspired dataset with segment attributes and dynamic connectivity, our model achieves an R2 of 0.3798, outperforming baseline regressors and effectively capturing non-linear degradation. We also develop an interactive dashboard and reinforcement learning module for simulation, visualization, and adaptive maintenance planning. This DT-GNN integration enhances forecasting precision and establishes a closed feedback loop for continuous improvement, positioning the approach as a foundation for proactive, intelligent, and sustainable pavement management, with future extensions toward real-world deployment, multi-agent coordination, and smart-city integration.
中文摘要 路面基础设施监测面临着复杂的空间依赖性、不断变化的环境条件以及道路网络的非线性恶化的挑战。传统的路面管理系统（PMS）在很大程度上仍然是被动的，缺乏用于故障预防和最佳维护计划的实时智能。为了解决这个问题，我们提出了一个统一的数字孪生（DT）和图神经网络（GNN）框架，用于可扩展的、数据驱动的路面健康监测和预测性维护。路面段和空间关系被建模为图形节点和边，而实时无人机、传感器和激光雷达数据流入 DT。归纳式 GNN 从图结构输入中学习恶化模式，以预测遇险并实现主动干预。在具有段属性和动态连接性的受现实世界启发的数据集上进行训练，我们的模型实现了 0.3798 的 R2，优于基线回归变量并有效捕获非线性退化。我们还开发了一个交互式仪表板和强化学习模块，用于模拟、可视化和自适应维护计划。这种 DT-GNN 集成提高了预测精度，并建立了持续改进的闭反馈循环，将该方法定位为主动、智能和可持续路面管理的基础，未来将扩展到实际部署、多代理协调和智慧城市集成。

Value of Information-Enhanced Exploration in Bootstrapped DQN

信息增强探索在自举DQN中的价值

Authors: Stergios Plataniotis, Charilaos Akasiadis, Georgios Chalkiadakis
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.02969
Pdf link: https://arxiv.org/pdf/2511.02969
Abstract Efficient exploration in deep reinforcement learning remains a fundamental challenge, especially in environments characterized by high-dimensional states and sparse rewards. Traditional exploration strategies that rely on random local policy noise, such as $\epsilon$-greedy and Boltzmann exploration methods, often struggle to efficiently balance exploration and exploitation. In this paper, we integrate the notion of (expected) value of information (EVOI) within the well-known Bootstrapped DQN algorithmic framework, to enhance the algorithm's deep exploration ability. Specifically, we develop two novel algorithms that incorporate the expected gain from learning the value of information into Bootstrapped DQN. Our methods use value of information estimates to measure the discrepancies of opinions among distinct network heads, and drive exploration towards areas with the most potential. We evaluate our algorithms with respect to performance and their ability to exploit inherent uncertainty arising from random network initialization. Our experiments in complex, sparse-reward Atari games demonstrate increased performance, all the while making better use of uncertainty, and, importantly, without introducing extra hyperparameters.
中文摘要 深度强化学习的高效探索仍然是一个基本挑战，特别是在以高维状态和稀疏奖励为特征的环境中。依赖随机局部策略噪声的传统勘探策略，如$\epsilon$-greedy和玻尔兹曼勘探方法，往往难以有效平衡勘探和开发。本文将信息（期望）价值（EVOI）的概念整合到众所周知的Bootstrapped DQN算法框架中，以增强算法的深度探索能力。具体来说，我们开发了两种新颖的算法，将学习信息价值的预期收益纳入 Bootstrapped DQN。我们的方法使用信息价值估计来衡量不同网络负责人之间的意见差异，并推动对最具潜力的领域的探索。我们评估算法的性能及其利用随机网络初始化产生的固有不确定性的能力。我们在复杂、稀疏奖励的 Atari 游戏中的实验表明，性能有所提高，同时更好地利用了不确定性，重要的是，没有引入额外的超参数。

Leveraging Discrete Function Decomposability for Scientific Design

利用离散函数可分解性进行科学设计

Authors: James C. Bowden, Sergey Levine, Jennifer Listgarten
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.03032
Pdf link: https://arxiv.org/pdf/2511.03032
Abstract In the era of AI-driven science and engineering, we often want to design discrete objects in silico according to user-specified properties. For example, we may wish to design a protein to bind its target, arrange components within a circuit to minimize latency, or find materials with certain properties. Given a property predictive model, in silico design typically involves training a generative model over the design space (e.g., protein sequence space) to concentrate on designs with the desired properties. Distributional optimization -- which can be formalized as an estimation of distribution algorithm or as reinforcement learning policy optimization -- finds the generative model that maximizes an objective function in expectation. Optimizing a distribution over discrete-valued designs is in general challenging because of the combinatorial nature of the design space. However, many property predictors in scientific applications are decomposable in the sense that they can be factorized over design variables in a way that could in principle enable more effective optimization. For example, amino acids at a catalytic site of a protein may only loosely interact with amino acids of the rest of the protein to achieve maximal catalytic activity. Current distributional optimization algorithms are unable to make use of such decomposability structure. Herein, we propose and demonstrate use of a new distributional optimization algorithm, Decomposition-Aware Distributional Optimization (DADO), that can leverage any decomposability defined by a junction tree on the design variables, to make optimization more efficient. At its core, DADO employs a soft-factorized "search distribution" -- a learned generative model -- for efficient navigation of the search space, invoking graph message-passing to coordinate optimization across linked factors.
中文摘要 在人工智能驱动的科学和工程时代，我们经常希望根据用户指定的属性在计算机中设计离散对象。例如，我们可能希望设计一种蛋白质来结合其靶标，在电路中排列组件以最大限度地减少延迟，或者找到具有某些特性的材料。给定属性预测模型，计算机设计通常涉及在设计空间（例如蛋白质序列空间）上训练生成模型，以专注于具有所需属性的设计。分布优化——可以形式化为分布算法估计或强化学习策略优化——找到最大化期望目标函数的生成模型。由于设计空间的组合性质，优化离散值设计上的分布通常具有挑战性。然而，科学应用中的许多属性预测变量是可分解的，因为它们可以以一种原则上可以实现更有效优化的方式对设计变量进行分解。例如，蛋白质催化位点的氨基酸可能只能与蛋白质其余部分的氨基酸松散相互作用，以实现最大的催化活性。目前的分布优化算法无法利用这种可分解性结构。在此，我们提出并演示了一种新的分布优化算法，即分解感知分布优化（DADO），该算法可以利用连接树在设计变量上定义的任何可分解性，以提高优化效率。DADO 的核心是采用软分解的“搜索分布”——一种学习的生成模型——来高效导航搜索空间，调用图消息传递来协调跨链接因素的优化。

Scaling Multi-Agent Environment Co-Design with Diffusion Models

扩展多智能体环境与扩散模型协同设计

Authors: Hao Xiang Li, Michael Amir, Amanda Prorok
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2511.03100
Pdf link: https://arxiv.org/pdf/2511.03100
Abstract The agent-environment co-design paradigm jointly optimises agent policies and environment configurations in search of improved system performance. With application domains ranging from warehouse logistics to windfarm management, co-design promises to fundamentally change how we deploy multi-agent systems. However, current co-design methods struggle to scale. They collapse under high-dimensional environment design spaces and suffer from sample inefficiency when addressing moving targets inherent to joint optimisation. We address these challenges by developing Diffusion Co-Design (DiCoDe), a scalable and sample-efficient co-design framework pushing co-design towards practically relevant settings. DiCoDe incorporates two core innovations. First, we introduce Projected Universal Guidance (PUG), a sampling technique that enables DiCoDe to explore a distribution of reward-maximising environments while satisfying hard constraints such as spatial separation between obstacles. Second, we devise a critic distillation mechanism to share knowledge from the reinforcement learning critic, ensuring that the guided diffusion model adapts to evolving agent policies using a dense and up-to-date learning signal. Together, these improvements lead to superior environment-policy pairs when validated on challenging multi-agent environment co-design benchmarks including warehouse automation, multi-agent pathfinding and wind farm optimisation. Our method consistently exceeds the state-of-the-art, achieving, for example, 39% higher rewards in the warehouse setting with 66% fewer simulation samples. This sets a new standard in agent-environment co-design, and is a stepping stone towards reaping the rewards of co-design in real world domains.
中文摘要 代理-环境协同设计范式共同优化代理策略和环境配置，以寻求改进的系统性能。从仓库物流到风电场管理，协同设计的应用领域有望从根本上改变我们部署多代理系统的方式。然而，当前的协同设计方法难以扩展。它们在高维环境设计空间下会崩溃，并且在处理关节优化固有的移动目标时会受到样品效率低下的影响。我们通过开发扩散协同设计（DiCoDe）来应对这些挑战，这是一个可扩展且样本高效的协同设计框架，将协同设计推向实际相关的设置。DiCoDe 融合了两项核心创新。首先，我们引入了投影通用指导（PUG），这是一种采样技术，使 DiCoDe 能够探索奖励最大化环境的分布，同时满足硬约束，例如障碍物之间的空间分离。其次，我们设计了一种批评者蒸馏机制来分享强化学习批评者的知识，确保引导扩散模型使用密集和最新的学习信号适应不断发展的代理策略。这些改进共同导致了卓越的环境策略对，当在具有挑战性的多代理环境协同设计基准（包括仓库自动化、多代理寻路和风电场优化）上进行验证时。我们的方法始终超越最先进的技术，例如，在仓库环境中实现了 39% 的奖励，模拟样本减少了 66%。这为代理-环境协同设计树立了新标准，并且是在现实世界领域获得协同设计回报的垫脚石。

Learning Natural and Robust Hexapod Locomotion over Complex Terrains via Motion Priors based on Deep Reinforcement Learning

通过基于深度强化学习的运动先验学习复杂地形上自然而稳健的六足动物运动

Authors: Xin Liu, Jinze Wu, Yinghui Li, Chenkun Qi, Yufei Xue, Feng Gao
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.03167
Pdf link: https://arxiv.org/pdf/2511.03167
Abstract Multi-legged robots offer enhanced stability to navigate complex terrains with their multiple legs interacting with the environment. However, how to effectively coordinate the multiple legs in a larger action exploration space to generate natural and robust movements is a key issue. In this paper, we introduce a motion prior-based approach, successfully applying deep reinforcement learning algorithms to a real hexapod robot. We generate a dataset of optimized motion priors, and train an adversarial discriminator based on the priors to guide the hexapod robot to learn natural gaits. The learned policy is then successfully transferred to a real hexapod robot, and demonstrate natural gait patterns and remarkable robustness without visual information in complex terrains. This is the first time that a reinforcement learning controller has been used to achieve complex terrain walking on a real hexapod robot.
中文摘要 多足机器人通过多条腿与环境相互作用，提供增强的稳定性，可以在复杂的地形中导航。然而，如何在更大的动作探索空间中有效地协调多条腿以产生自然而有力的动作是一个关键问题。在本文中，我们引入了一种基于运动先验的方法，成功地将深度强化学习算法应用于真实的六足机器人。我们生成优化运动先验的数据集，并基于先验训练对抗判别器来指导六足机器人学习自然步态。然后，将学习到的策略成功转移到真正的六足机器人上，并在复杂地形中展示出自然的步态模式和卓越的鲁棒性，而无需视觉信息。这是首次使用强化学习控制器在真正的六足机器人上实现复杂地形行走。

Learning-based Cooperative Robotic Paper Wrapping: A Unified Control Policy with Residual Force Control

基于学习的协作机器人纸张包装：残余力控制的统一控制策略

Authors: Rewida Ali, Cristian C. Beltran-Hernandez, Weiwei Wan, Kensuke Harada
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.03181
Pdf link: https://arxiv.org/pdf/2511.03181
Abstract Human-robot cooperation is essential in environments such as warehouses and retail stores, where workers frequently handle deformable objects like paper, bags, and fabrics. Coordinating robotic actions with human assistance remains difficult due to the unpredictable dynamics of deformable materials and the need for adaptive force control. To explore this challenge, we focus on the task of gift wrapping, which exemplifies a long-horizon manipulation problem involving precise folding, controlled creasing, and secure fixation of paper. Success is achieved when the robot completes the sequence to produce a neatly wrapped package with clean folds and no tears. We propose a learning-based framework that integrates a high-level task planner powered by a large language model (LLM) with a low-level hybrid imitation learning (IL) and reinforcement learning (RL) policy. At its core is a Sub-task Aware Robotic Transformer (START) that learns a unified policy from human demonstrations. The key novelty lies in capturing long-range temporal dependencies across the full wrapping sequence within a single model. Unlike vanilla Action Chunking with Transformer (ACT), typically applied to short tasks, our method introduces sub-task IDs that provide explicit temporal grounding. This enables robust performance across the entire wrapping process and supports flexible execution, as the policy learns sub-goals rather than merely replicating motion sequences. Our framework achieves a 97% success rate on real-world wrapping tasks. We show that the unified transformer-based policy reduces the need for specialized models, allows controlled human supervision, and effectively bridges high-level intent with the fine-grained force control required for deformable object manipulation.
中文摘要 人机协作在仓库和零售店等环境中至关重要，工人经常处理纸张、袋子和织物等可变形物体。由于可变形材料的不可预测动态和自适应力控制的需要，在人类辅助下协调机器人动作仍然很困难。为了探索这一挑战，我们专注于礼品包装任务，它举例说明了涉及精确折叠、控制折痕和牢固固定纸张的长期作问题。当机器人完成序列以生产出包装整齐、折叠干净且无撕裂的包装时，即取得成功。我们提出了一个基于学习的框架，该框架将由大型语言模型（LLM）驱动的高级任务规划器与低级混合模仿学习（IL）和强化学习（RL）策略集成在一起。其核心是一个子任务感知机器人转换器（START），它从人类演示中学习统一的策略。关键的新颖之处在于捕获单个模型中整个包装序列的远程时间依赖关系。与通常应用于短任务的普通 Transformer Action Chunking （ACT）不同，我们的方法引入了提供明确时间基础的子任务 ID。这可以在整个包装过程中实现强大的性能，并支持灵活的执行，因为策略学习子目标，而不仅仅是复制运动序列。我们的框架在现实世界的包装任务中实现了 97% 的成功率。我们表明，基于 Transformer 的统一策略减少了对专用模型的需求，允许受控的人工监督，并有效地将高级意图与可变形对象作所需的细粒度力控制联系起来。

Periodic Skill Discovery

定期技能发现

Authors: Jonghae Park, Daesol Cho, Jusuk Lee, Dongseok Shim, Inkyu Jang, H. Jin Kim
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2511.03187
Pdf link: https://arxiv.org/pdf/2511.03187
Abstract Unsupervised skill discovery in reinforcement learning (RL) aims to learn diverse behaviors without relying on external rewards. However, current methods often overlook the periodic nature of learned skills, focusing instead on increasing the mutual dependence between states and skills or maximizing the distance traveled in latent space. Considering that many robotic tasks -- particularly those involving locomotion -- require periodic behaviors across varying timescales, the ability to discover diverse periodic skills is essential. Motivated by this, we propose Periodic Skill Discovery (PSD), a framework that discovers periodic behaviors in an unsupervised manner. The key idea of PSD is to train an encoder that maps states to a circular latent space, thereby naturally encoding periodicity in the latent representation. By capturing temporal distance, PSD can effectively learn skills with diverse periods in complex robotic tasks, even with pixel-based observations. We further show that these learned skills achieve high performance on downstream tasks such as hurdling. Moreover, integrating PSD with an existing skill discovery method offers more diverse behaviors, thus broadening the agent's repertoire. Our code and demos are available at this https URL
中文摘要 强化学习（RL）中的无监督技能发现旨在学习不同的行为，而不依赖外部奖励。然而，目前的方法往往忽视了所学技能的周期性，而是专注于增加状态和技能之间的相互依赖性或最大化在潜在空间中传播的距离。考虑到许多机器人任务（尤其是涉及运动的任务）需要在不同时间尺度上进行周期性行为，因此发现不同周期性技能的能力至关重要。出于此动机，我们提出了周期性技能发现（PSD），这是一个以无监督方式发现周期性行为的框架。PSD 的关键思想是训练一个编码器，将状态映射到圆形潜在空间，从而自然地对潜在表示中的周期性进行编码。通过捕捉时间距离，PSD 可以有效地学习复杂机器人任务中不同时期的技能，即使是基于像素的观察。我们进一步表明，这些学到的技能在跨栏等下游任务上取得了高绩效。此外，将 PSD 与现有技能发现方法相结合，提供了更多样化的行为，从而拓宽了代理的技能范围。我们的代码和演示可在此 https URL 上获得

Collaborative Assembly Policy Learning of a Sightless Robot

无视机器人的协同装配政策学习

Authors: Zeqing Zhang, Weifeng Lu, Lei Yang, Wei Jing, Bowei Tang, Jia Pan
Subjects: Subjects: Robotics (cs.RO); Human-Computer Interaction (cs.HC); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2511.03189
Pdf link: https://arxiv.org/pdf/2511.03189
Abstract This paper explores a physical human-robot collaboration (pHRC) task involving the joint insertion of a board into a frame by a sightless robot and a human operator. While admittance control is commonly used in pHRC tasks, it can be challenging to measure the force/torque applied by the human for accurate human intent estimation, limiting the robot's ability to assist in the collaborative task. Other methods that attempt to solve pHRC tasks using reinforcement learning (RL) are also unsuitable for the board-insertion task due to its safety constraints and sparse rewards. Therefore, we propose a novel RL approach that utilizes a human-designed admittance controller to facilitate more active robot behavior and reduce human effort. Through simulation and real-world experiments, we demonstrate that our approach outperforms admittance control in terms of success rate and task completion time. Additionally, we observed a significant reduction in measured force/torque when using our proposed approach compared to admittance control. The video of the experiments is available at this https URL.
中文摘要 本文探讨了一种物理人机协作（pHRC）任务，该任务涉及盲人机器人和人类操作员将电路板联合插入框架中。虽然导纳控制通常用于 pHRC 任务，但测量人类施加的力/扭矩以准确估计人类意图可能具有挑战性，从而限制了机器人协助协作任务的能力。其他尝试使用强化学习（RL）解决pHRC任务的方法也由于其安全约束和稀疏奖励而不适合板插入任务。因此，我们提出了一种新的RL方法，该方法利用人类设计的导纳控制器来促进更主动的机器人行为并减少人力。通过模拟和真实世界的实验，我们证明了我们的方法在成功率和任务完成时间方面优于准纳控制。此外，与导纳控制相比，我们观察到使用我们提出的方法时测量的力/扭矩显着降低。实验视频可在此 https URL 上获得。

Incorporating Quality of Life in Climate Adaptation Planning via Reinforcement Learning

通过强化学习将生活质量纳入气候适应规划

Authors: Miguel Costa, Arthur Vandervoort, Martin Drews, Karyn Morrissey, Francisco C. Pereira
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.03238
Pdf link: https://arxiv.org/pdf/2511.03238
Abstract Urban flooding is expected to increase in frequency and severity as a consequence of climate change, causing wide-ranging impacts that include a decrease in urban Quality of Life (QoL). Meanwhile, policymakers must devise adaptation strategies that can cope with the uncertain nature of climate change and the complex and dynamic nature of urban flooding. Reinforcement Learning (RL) holds significant promise in tackling such complex, dynamic, and uncertain problems. Because of this, we use RL to identify which climate adaptation pathways lead to a higher QoL in the long term. We do this using an Integrated Assessment Model (IAM) which combines a rainfall projection model, a flood model, a transport accessibility model, and a quality of life index. Our preliminary results suggest that this approach can be used to learn optimal adaptation measures and it outperforms other realistic and real-world planning strategies. Our framework is publicly available: this https URL.
中文摘要 由于气候变化，预计城市洪水的频率和严重程度将增加，造成广泛的影响，包括城市生活质量（QoL）的下降。同时，政策制定者必须制定适应战略，以应对气候变化的不确定性和城市洪水的复杂性和动态性。强化学习（RL）在解决此类复杂、动态和不确定的问题方面具有巨大的前景。因此，我们使用RL来确定哪些气候适应途径可以长期带来更高的生活质量。我们使用综合评估模型（IAM）来做到这一点，该模型结合了降雨预测模型、洪水模型、交通可达性模型和生活质量指数。我们的初步结果表明，这种方法可用于学习最佳适应措施，并且优于其他现实和现实世界的规划策略。我们的框架是公开可用的：这个 https URL。

Climate Adaptation with Reinforcement Learning: Economic vs. Quality of Life Adaptation Pathways

强化学习的气候适应：经济与生活质量适应途径

Authors: Miguel Costa, Arthur Vandervoort, Martin Drews, Karyn Morrissey, Francisco C. Pereira
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.03243
Pdf link: https://arxiv.org/pdf/2511.03243
Abstract Climate change will cause an increase in the frequency and severity of flood events, prompting the need for cohesive adaptation policymaking. Designing effective adaptation policies, however, depends on managing the uncertainty of long-term climate impacts. Meanwhile, such policies can feature important normative choices that are not always made explicit. We propose that Reinforcement Learning (RL) can be a useful tool to both identify adaptation pathways under uncertain conditions while it also allows for the explicit modelling (and consequent comparison) of different adaptation priorities (e.g. economic vs. wellbeing). We use an Integrated Assessment Model (IAM) to link together a rainfall and flood model, and compute the impacts of flooding in terms of quality of life (QoL), transportation, and infrastructure damage. Our results show that models prioritising QoL over economic impacts results in more adaptation spending as well as a more even distribution of spending over the study area, highlighting the extent to which such normative assumptions can alter adaptation policy. Our framework is publicly available: this https URL.
中文摘要 气候变化将导致洪水事件的频率和严重程度增加，促使需要有凝聚力的适应政策制定。然而，设计有效的适应政策取决于管理长期气候影响的不确定性。同时，此类政策可能具有重要的规范性选择，但并不总是明确的。我们提出，强化学习（RL）可以成为一种有用的工具，既可以识别不确定条件下的适应途径，又可以对不同的适应优先事项（例如经济与福祉）进行显式建模（并随后进行比较）。我们使用综合评估模型（IAM）将降雨和洪水模型联系在一起，并计算洪水在生活质量（QoL）、交通和基础设施损坏方面的影响。我们的结果表明，将生活质量置于经济影响之上的模型会导致更多的适应支出以及研究区域内的支出分配更加均匀，这凸显了这种规范性假设在多大程度上可以改变适应政策。我们的框架是公开可用的：这个 https URL。

Multi-Objective Adaptive Rate Limiting in Microservices Using Deep Reinforcement Learning

基于深度强化学习的微服务多目标自适应限速

Authors: Ning Lyu, Yuxi Wang, Ziyu Cheng, Qingyuan Zhang, Feng Chen
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.03279
Pdf link: https://arxiv.org/pdf/2511.03279
Abstract As cloud computing and microservice architectures become increasingly prevalent, API rate limiting has emerged as a critical mechanism for ensuring system stability and service quality. Traditional rate limiting algorithms, such as token bucket and sliding window, while widely adopted, struggle to adapt to dynamic traffic patterns and varying system loads. This paper proposes an adaptive rate limiting strategy based on deep reinforcement learning that dynamically balances system throughput and service latency. We design a hybrid architecture combining Deep Q-Network (DQN) and Asynchronous Advantage Actor-Critic (A3C) algorithms, modeling the rate limiting decision process as a Markov Decision Process. The system continuously monitors microservice states and learns optimal rate limiting policies through environmental interaction. Extensive experiments conducted in a Kubernetes cluster environment demonstrate that our approach achieves 23.7% throughput improvement and 31.4% P99 latency reduction compared to traditional fixed-threshold strategies under high-load scenarios. Results from a 90-day production deployment handling 500 million daily requests validate the practical effectiveness of the proposed method, with 82% reduction in service degradation incidents and 68% decrease in manual interventions.
中文摘要 随着云计算和微服务架构的日益普及，API限速成为保障系统稳定性和服务质量的关键机制。传统的限速算法，如令牌桶和滑动窗口，虽然被广泛采用，但难以适应动态流量模式和变化的系统负载。该文提出了一种基于深度强化学习的自适应限速策略，该策略可以动态平衡系统吞吐量和服务时延。我们设计了一个结合了深度 Q 网络（DQN）和异步优势 Actor-Critic （A3C）算法的混合架构，将限速决策过程建模为马尔可夫决策过程。系统持续监控微服务状态，通过环境交互学习最优限速策略。在 Kubernetes 集群环境中进行的大量实验表明，在高负载场景下，与传统的固定阈值策略相比，我们的方法实现了 23.7% 的吞吐量提升和 31.4% 的 P99 延迟减少。每天处理 5 亿个请求的 90 天生产部署的结果验证了所提方法的实际有效性，服务降级事件减少了 82%，人工干预减少了 68%。

DRL-Based Robust Multi-Timescale Anti-Jamming Approaches under State Uncertainty

状态不确定性下基于DRL的鲁棒多时间尺度抗干扰方法

Authors: Haoqin Zhao, Zan Li, Jiangbo Si, Rui Huang, Hang Hu, Tony Q.S. Quek, Naofal Al-Dhahir
Subjects: Subjects: Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2511.03305
Pdf link: https://arxiv.org/pdf/2511.03305
Abstract Owing to the openness of wireless channels, wireless communication systems are highly susceptible to malicious jamming. Most existing anti-jamming methods rely on the assumption of accurate sensing and optimize parameters on a single timescale. However, such methods overlook two practical issues: mismatched execution latencies across heterogeneous actions and measurement errors caused by sensor imperfections. Especially for deep reinforcement learning (DRL)-based methods, the inherent sensitivity of neural networks implies that even minor perturbations in the input can mislead the agent into choosing suboptimal actions, with potentially severe consequences. To ensure reliable wireless transmission, we establish a multi-timescale decision model that incorporates state uncertainty. Subsequently, we propose two robust schemes that sustain performance under bounded sensing errors. First, a Projected Gradient Descent-assisted Double Deep Q-Network (PGD-DDQN) algorithm is designed, which derives worst-case perturbations under a norm-bounded error model and applies PGD during training for robust optimization. Second, a Nonlinear Q-Compression DDQN (NQC-DDQN) algorithm introduces a nonlinear compression mechanism that adaptively contracts Q-value ranges to eliminate action aliasing. Simulation results indicate that, compared with the perfect-sensing baseline, the proposed algorithms show only minor degradation in anti-jamming performance while maintaining robustness under various perturbations, thereby validating their practicality in imperfect sensing conditions.
中文摘要 由于无线信道的开放性，无线通信系统极易受到恶意干扰。大多数现有的抗干扰方法都依赖于精确传感的假设，并在单个时间尺度上优化参数。然而，这些方法忽略了两个实际问题：异构动作之间的执行延迟不匹配以及传感器缺陷引起的测量误差。特别是对于基于深度强化学习（DRL）的方法，神经网络固有的敏感性意味着，即使输入中的微小扰动也会误导智能体选择次优动作，从而产生潜在的严重后果。为了确保可靠的无线传输，我们建立了一个包含状态不确定性的多时间尺度决策模型。随后，我们提出了两种稳健的方案，以维持有限传感误差下的性能。首先，设计了一种投影梯度下降辅助双深Q网络（PGD-DDQN）算法，该算法在范数有界误差模型下推导最坏情况扰动，并在训练过程中应用PGD进行鲁棒优化;其次，非线性Q压缩DDQN（NQC-DDQN）算法引入了一种非线性压缩机制，该机制自适应地收缩Q值范围以消除动作混叠。仿真结果表明，与完美感知基线相比，所提算法在各种扰动下保持鲁棒性的同时，抗干扰性能仅略有下降，从而验证了其在不完美感知条件下的实用性。

Learning Communication Skills in Multi-task Multi-agent Deep Reinforcement Learning

多任务多智能体深度强化学习中的沟通技巧学习

Authors: Changxi Zhu, Mehdi Dastani, Shihan Wang
Subjects: Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2511.03348
Pdf link: https://arxiv.org/pdf/2511.03348
Abstract In multi-agent deep reinforcement learning (MADRL), agents can communicate with one another to perform a task in a coordinated manner. When multiple tasks are involved, agents can also leverage knowledge from one task to improve learning in other tasks. In this paper, we propose Multi-task Communication Skills (MCS), a MADRL with communication method that learns and performs multiple tasks simultaneously, with agents interacting through learnable communication protocols. MCS employs a Transformer encoder to encode task-specific observations into a shared message space, capturing shared communication skills among agents. To enhance coordination among agents, we introduce a prediction network that correlates messages with the actions of sender agents in each task. We adapt three multi-agent benchmark environments to multi-task settings, where the number of agents as well as the observation and action spaces vary across tasks. Experimental results demonstrate that MCS achieves better performance than multi-task MADRL baselines without communication, as well as single-task MADRL baselines with and without communication.
中文摘要 在多智能体深度强化学习（MADRL）中，智能体可以相互通信以协调的方式执行任务。当涉及多个任务时，代理还可以利用一项任务的知识来改进其他任务的学习。在本文中，我们提出了多任务通信技能（MCS），这是一种具有通信方法的MADRL，它同时学习和执行多个任务，代理通过可学习的通信协议进行交互。MCS 采用 Transformer 编码器将特定于任务的观察结果编码到共享消息空间中，捕获代理之间的共享沟通技能。为了加强代理之间的协调，我们引入了一个预测网络，将消息与每个任务中发送者代理的作相关联。我们将三个多智能体基准环境调整为多任务设置，其中智能体的数量以及观察和行动空间因任务而异。实验结果表明，MCS比无通信的多任务MADRL基线以及有和无通信的单任务MADRL基线都取得了更好的性能。

Adaptable Hindsight Experience Replay for Search-Based Learning

适应性强的事后诸葛亮体验回放，用于基于搜索的学习

Authors: Alexandros Vazaios, Jannis Brugger, Cedric Derstroff, Kristian Kersting, Mira Mezini
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.03405
Pdf link: https://arxiv.org/pdf/2511.03405
Abstract AlphaZero-like Monte Carlo Tree Search systems, originally introduced for two-player games, dynamically balance exploration and exploitation using neural network guidance. This combination makes them also suitable for classical search problems. However, the original method of training the network with simulation results is limited in sparse reward settings, especially in the early stages, where the network cannot yet give guidance. Hindsight Experience Replay (HER) addresses this issue by relabeling unsuccessful trajectories from the search tree as supervised learning signals. We introduce Adaptable HER (\ours{}), a flexible framework that integrates HER with AlphaZero, allowing easy adjustments to HER properties such as relabeled goals, policy targets, and trajectory selection. Our experiments, including equation discovery, show that the possibility of modifying HER is beneficial and surpasses the performance of pure supervised or reinforcement learning.
中文摘要 类似 AlphaZero 的蒙特卡洛树搜索系统最初是为双人游戏引入的，它使用神经网络引导动态平衡探索和开发。这种组合使它们也适用于经典搜索问题。然而，原始的用仿真结果训练网络的方法在稀疏奖励设置中受到限制，特别是在网络还无法给出指导的早期阶段。事后诸葛亮体验重播（HER）通过将搜索树中的不成功轨迹重新标记为监督学习信号来解决这个问题。我们引入了 Adaptable HER（\ours{}），这是一个灵活的框架，将 HER 与 AlphaZero 集成在一起，可以轻松调整 HER 属性，例如重新标记的目标、政策目标和轨迹选择。我们的实验，包括方程发现，表明修改 HER 的可能性是有益的，并且超过了纯监督学习或强化学习的性能。

Knowledge-Augmented Question Error Correction for Chinese Question Answer System with QuestionRAG

QuestionRAG中文问答系统的知识增强纠错

Authors: Longpeng Qiu, Ting Li, Shuai Mao, Nan Yang, Xiaohui Yan
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2511.03410
Pdf link: https://arxiv.org/pdf/2511.03410
Abstract Input errors in question-answering (QA) systems often lead to incorrect responses. Large language models (LLMs) struggle with this task, frequently failing to interpret user intent (misinterpretation) or unnecessarily altering the original question's structure (over-correction). We propose QuestionRAG, a framework that tackles these problems. To address misinterpretation, it enriches the input with external knowledge (e.g., search results, related entities). To prevent over-correction, it uses reinforcement learning (RL) to align the model's objective with precise correction, not just paraphrasing. Our results demonstrate that knowledge augmentation is critical for understanding faulty questions. Furthermore, RL-based alignment proves significantly more effective than traditional supervised fine-tuning (SFT), boosting the model's ability to follow instructions and generalize. By integrating these two strategies, QuestionRAG unlocks the full potential of LLMs for the question correction task.
中文摘要 问答（QA）系统中的输入错误通常会导致错误的回答。大型语言模型（LLM）在这项任务上遇到困难，经常无法解释用户意图（误解）或不必要地改变原始问题的结构（过度纠正）。我们提出了 QuestionRAG，这是一个解决这些问题的框架。为了解决误解，它用外部知识（例如搜索结果、相关实体）丰富了输入。为了防止过度校正，它使用强化学习（RL）将模型的目标与精确校正对齐，而不仅仅是释义。我们的结果表明，知识增强对于理解错误问题至关重要。此外，事实证明，基于 RL 的对齐比传统的监督微调（SFT）更有效，提高了模型遵循指令和泛化的能力。通过整合这两种策略，QuestionRAG 释放了法学硕士在纠正问题任务中的全部潜力。

Reinforcement Learning Using known Invariances

使用已知不变性的强化学习

Authors: Alexandru Cioba, Aya Kayal, Laura Toni, Sattar Vakili, Alberto Bernacchia
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.03473
Pdf link: https://arxiv.org/pdf/2511.03473
Abstract In many real-world reinforcement learning (RL) problems, the environment exhibits inherent symmetries that can be exploited to improve learning efficiency. This paper develops a theoretical and algorithmic framework for incorporating known group symmetries into kernel-based RL. We propose a symmetry-aware variant of optimistic least-squares value iteration (LSVI), which leverages invariant kernels to encode invariance in both rewards and transition dynamics. Our analysis establishes new bounds on the maximum information gain and covering numbers for invariant RKHSs, explicitly quantifying the sample efficiency gains from symmetry. Empirical results on a customized Frozen Lake environment and a 2D placement design problem confirm the theoretical improvements, demonstrating that symmetry-aware RL achieves significantly better performance than their standard kernel counterparts. These findings highlight the value of structural priors in designing more sample-efficient reinforcement learning algorithms.
中文摘要 在许多现实世界的强化学习（RL）问题中，环境表现出固有的对称性，可以利用这些对称性来提高学习效率。本文开发了一个理论和算法框架，用于将已知的群对称性纳入基于内核的RL。我们提出了一种乐观最小二乘值迭代（LSVI）的对称感知变体，它利用不变核对奖励和过渡动力学中的不变性进行编码。我们的分析为不变 RKHS 的最大信息增益和覆盖数建立了新的界限，明确量化了对称性带来的样本效率增益。定制的 Frozen Lake 环境和 2D 放置设计问题的实证结果证实了理论上的改进，表明对称感知 RL 的性能明显优于标准内核对应物。这些发现凸显了结构先验在设计样本效率更高的强化学习算法方面的价值。

Learning Without Critics? Revisiting GRPO in Classical Reinforcement Learning Environments

在没有批评者的情况下学习？在经典强化学习环境中重新审视GRPO

Authors: Bryan L. M. de Oliveira, Felipe V. Frujeri, Marcos P. C. M. Queiroz, Luana G. B. Martins, Telma W. de L. Soares, Luckeciano C. Melo
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.03527
Pdf link: https://arxiv.org/pdf/2511.03527
Abstract Group Relative Policy Optimization (GRPO) has emerged as a scalable alternative to Proximal Policy Optimization (PPO) by eliminating the learned critic and instead estimating advantages through group-relative comparisons of trajectories. This simplification raises fundamental questions about the necessity of learned baselines in policy-gradient methods. We present the first systematic study of GRPO in classical single-task reinforcement learning environments, spanning discrete and continuous control tasks. Through controlled ablations isolating baselines, discounting, and group sampling, we reveal three key findings: (1) learned critics remain essential for long-horizon tasks: all critic-free baselines underperform PPO except in short-horizon environments like CartPole where episodic returns can be effective; (2) GRPO benefits from high discount factors (gamma = 0.99) except in HalfCheetah, where lack of early termination favors moderate discounting (gamma = 0.9); (3) smaller group sizes outperform larger ones, suggesting limitations in batch-based grouping strategies that mix unrelated episodes. These results reveal both the limitations of critic-free methods in classical control and the specific conditions where they remain viable alternatives to learned value functions.
中文摘要 群体相对策略优化（GRPO）已成为近端策略优化（PPO）的可扩展替代方案，它消除了有学识的批评者，而是通过轨迹的群体相对比较来估计优势。这种简化提出了关于政策梯度方法中学习基线的必要性的基本问题。我们提出了第一个在经典单任务强化学习环境中对 GRPO 的系统研究，跨越离散和连续控制任务。通过隔离基线、贴现和组抽样的受控消融，我们揭示了三个关键发现：（1）有学识的批评者对于长期任务仍然至关重要：除了像CartPole这样的短视野环境外，所有无批评的基线都表现不佳，因为在CartPole中，偶发性返回可以有效;（2） GRPO 受益于高折扣因子（gamma = 0.99），但 HalfCheetah 除外，其中缺乏提前终止有利于适度折扣（gamma = 0.9）;（3）较小的群体规模优于较大的群体，这表明混合不相关情节的基于批次的分组策略存在局限性。这些结果揭示了无批评方法在经典控制中的局限性，以及它们仍然是学习价值函数的可行替代方案的特定条件。

PerfDojo: Automated ML Library Generation for Heterogeneous Architectures

PerfDojo：用于异构架构的自动 ML 库生成

Authors: Andrei Ivanov, Siyuan Shen, Gioele Gottardo, Marcin Chrapek, Afif Boudaoud, Timo Schneider, Luca Benini, Torsten Hoefler
Subjects: Subjects: Performance (cs.PF); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.03586
Pdf link: https://arxiv.org/pdf/2511.03586
Abstract The increasing complexity of machine learning models and the proliferation of diverse hardware architectures (CPUs, GPUs, accelerators) make achieving optimal performance a significant challenge. Heterogeneity in instruction sets, specialized kernel requirements for different data types and model features (e.g., sparsity, quantization), and architecture-specific optimizations complicate performance tuning. Manual optimization is resource-intensive, while existing automatic approaches often rely on complex hardware-specific heuristics and uninterpretable intermediate representations, hindering performance portability. We introduce PerfLLM, a novel automatic optimization methodology leveraging Large Language Models (LLMs) and Reinforcement Learning (RL). Central to this is PerfDojo, an environment framing optimization as an RL game using a human-readable, mathematically-inspired code representation that guarantees semantic validity through transformations. This allows effective optimization without prior hardware knowledge, facilitating both human analysis and RL agent training. We demonstrate PerfLLM's ability to achieve significant performance gains across diverse CPU (x86, Arm, RISC-V) and GPU architectures.
中文摘要 机器学习模型的日益复杂以及各种硬件架构（CPU、GPU、加速器）的激增使得实现最佳性能成为一项重大挑战。指令集的异构性、对不同数据类型和模型特征（例如稀疏性、量化）的专用内核要求以及特定于架构的优化使性能调整变得复杂。手动优化是资源密集型的，而现有的自动方法通常依赖于复杂的特定于硬件的启发式方法和不可解释的中间表示，从而阻碍了性能的可移植性。我们介绍了 PerfLLM，这是一种利用大型语言模型（LLM）和强化学习（RL）的新型自动优化方法。其中的核心是 PerfDojo，这是一种将优化构建为 RL 游戏的环境，使用人类可读、受数学启发的代码表示，通过转换保证语义有效性。这允许在没有硬件知识的情况下进行有效优化，从而促进人工分析和 RL 代理训练。我们展示了 PerfLLM 在不同 CPU（x86、Arm、RISC-V）和 GPU 架构中实现显着性能提升的能力。

Tensor-Efficient High-Dimensional Q-learning

张量高效的高维 Q 学习

Authors: Junyi Wu, Dan Li
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2511.03595
Pdf link: https://arxiv.org/pdf/2511.03595
Abstract High-dimensional reinforcement learning faces challenges with complex calculations and low sample efficiency in large state-action spaces. Q-learning algorithms struggle particularly with the curse of dimensionality, where the number of state-action pairs grows exponentially with problem size. While neural network-based approaches like Deep Q-Networks have shown success, recent tensor-based methods using low-rank decomposition offer more parameter-efficient alternatives. Building upon existing tensor-based methods, we propose Tensor-Efficient Q-Learning (TEQL), which enhances low-rank tensor decomposition via improved block coordinate descent on discretized state-action spaces, incorporating novel exploration and regularization mechanisms. The key innovation is an exploration strategy that combines approximation error with visit count-based upper confidence bound to prioritize actions with high uncertainty, avoiding wasteful random exploration. Additionally, we incorporate a frequency-based penalty term in the objective function to encourage exploration of less-visited state-action pairs and reduce overfitting to frequently visited regions. Empirical results on classic control tasks demonstrate that TEQL outperforms conventional matrix-based methods and deep RL approaches in both sample efficiency and total rewards, making it suitable for resource-constrained applications, such as space and healthcare where sampling costs are high.
中文摘要 高维强化学习在大状态作用空间中面临着复杂的计算和低样本效率的挑战。Q 学习算法尤其难以应对维度的诅咒，其中状态-动作对的数量随着问题大小的增加呈指数级增长。虽然基于神经网络的方法（如深度 Q 网络）已经取得了成功，但最近使用低秩分解的基于张量的方法提供了更高效的参数替代方案。在现有的基于张量的方法的基础上，我们提出了张量高效Q学习（TEQL），它通过改进离散状态-作用空间上的块坐标下降来增强低秩张量分解，并结合了新的探索和正则化机制。关键创新是一种探索策略，它将近似误差与基于访问计数的置信上限相结合，以优先考虑具有高不确定性的行动，避免浪费的随机探索。此外，我们在目标函数中加入了基于频率的惩罚项，以鼓励探索访问较少的状态-动作对，并减少对频繁访问区域的过度拟合。经典控制任务的实证结果表明，TEQL 在样本效率和总奖励方面优于传统的基于矩阵的方法和深度 RL 方法，使其适用于资源受限的应用，例如采样成本较高的空间和医疗保健。

Going Beyond Expert Performance via Deep Implicit Imitation Reinforcement Learning

通过深度隐式模仿强化学习超越专家表现

Authors: Iason Chrysomallis, Georgios Chalkiadakis
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.03616
Pdf link: https://arxiv.org/pdf/2511.03616
Abstract Imitation learning traditionally requires complete state-action demonstrations from optimal or near-optimal experts. These requirements severely limit practical applicability, as many real-world scenarios provide only state observations without corresponding actions and expert performance is often suboptimal. In this paper we introduce a deep implicit imitation reinforcement learning framework that addresses both limitations by combining deep reinforcement learning with implicit imitation learning from observation-only datasets. Our main algorithm, Deep Implicit Imitation Q-Network (DIIQN), employs an action inference mechanism that reconstructs expert actions through online exploration and integrates a dynamic confidence mechanism that adaptively balances expert-guided and self-directed learning. This enables the agent to leverage expert guidance for accelerated training while maintaining capacity to surpass suboptimal expert performance. We further extend our framework with a Heterogeneous Actions DIIQN (HA-DIIQN) algorithm to tackle scenarios where expert and agent possess different action sets, a challenge previously unaddressed in the implicit imitation learning literature. HA-DIIQN introduces an infeasibility detection mechanism and a bridging procedure identifying alternative pathways connecting agent capabilities to expert guidance when direct action replication is impossible. Our experimental results demonstrate that DIIQN achieves up to 130% higher episodic returns compared to standard DQN, while consistently outperforming existing implicit imitation methods that cannot exceed expert performance. In heterogeneous action settings, HA-DIIQN learns up to 64% faster than baselines, leveraging expert datasets unusable by conventional approaches. Extensive parameter sensitivity analysis reveals the framework's robustness across varying dataset sizes and hyperparameter configurations.
中文摘要 传统上，模仿学习需要最优或接近最优的专家进行完整的状态动作演示。这些要求严重限制了实际适用性，因为许多现实场景仅提供状态观测而没有相应的作，而且专家性能通常不是最优的。在本文中，我们介绍了一个深度隐式模仿强化学习框架，该框架通过将深度强化学习与仅观察数据集的隐式模仿学习相结合，解决了这两个限制。我们的主要算法深度隐式模仿Q网络（DIIQN）采用了一种动作推理机制，通过在线探索重构专家的动作，并集成了一种动态置信机制，自适应地平衡了专家引导和自主学习。这使代理能够利用专家指导来加速培训，同时保持超越次优专家表现的能力。我们使用异构动作DIIQN（HA-DIIQN）算法进一步扩展了我们的框架，以解决专家和智能体拥有不同动作集的场景，这是以前在隐式模仿学习文献中未解决的挑战。HA-DIIQN 引入了不可行性检测机制和桥接程序，在无法直接作复制时确定将代理能力与专家指导联系起来的替代途径。我们的实验结果表明，与标准 DQN 相比，DIIQN 实现了高达 130% 的情景回报，同时始终优于现有的隐式模仿方法，无法超越专家性能。在异构动作设置中，HA-DIIQN 的学习速度比基线快 64%，利用传统方法无法使用的专家数据集。广泛的参数敏感性分析揭示了该框架在不同数据集大小和超参数配置下的稳健性。

Towards Formalizing Reinforcement Learning Theory

走向强化学习理论的形式化

Authors: Shangtong Zhang
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2511.03618
Pdf link: https://arxiv.org/pdf/2511.03618
Abstract In this paper, we formalize the almost sure convergence of $Q$-learning and linear temporal difference (TD) learning with Markovian samples using the Lean 4 theorem prover based on the Mathlib library. $Q$-learning and linear TD are among the earliest and most influential reinforcement learning (RL) algorithms. The investigation of their convergence properties is not only a major research topic during the early development of the RL field but also receives increasing attention nowadays. This paper formally verifies their almost sure convergence in a unified framework based on the Robbins-Siegmund theorem. The framework developed in this work can be easily extended to convergence rates and other modes of convergence. This work thus makes an important step towards fully formalizing convergent RL results. The code is available at this https URL.
中文摘要 在本文中，我们使用基于 Mathlib 库的精益 4 定理证明器，将 $Q$ 学习和线性时间差（TD）学习与马尔可夫样本的几乎肯定的收敛形式化。$Q$学习和线性TD是最早、最有影响力的强化学习（RL）算法之一。研究其收敛性质不仅是RL领域早期发展过程中的主要研究课题，而且近年来也越来越受到关注。本文正式验证了它们在基于罗宾斯-齐格蒙德定理的统一框架中几乎肯定的收敛性。这项工作中开发的框架可以很容易地扩展到收敛率和其他收敛模式。因此，这项工作朝着完全形式化收敛的 RL 结果迈出了重要一步。该代码可在此 https URL 中找到。

DQN Performance with Epsilon Greedy Policies and Prioritized Experience Replay

采用 Epsilon 贪婪策略和优先体验回放的 DQN 性能

Authors: Daniel Perkins, Oscar J. Escobar, Luke Green
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2511.03670
Pdf link: https://arxiv.org/pdf/2511.03670
Abstract We present a detailed study of Deep Q-Networks in finite environments, emphasizing the impact of epsilon-greedy exploration schedules and prioritized experience replay. Through systematic experimentation, we evaluate how variations in epsilon decay schedules affect learning efficiency, convergence behavior, and reward optimization. We investigate how prioritized experience replay leads to faster convergence and higher returns and show empirical results comparing uniform, no replay, and prioritized strategies across multiple simulations. Our findings illuminate the trade-offs and interactions between exploration strategies and memory management in DQN training, offering practical recommendations for robust reinforcement learning in resource-constrained settings.
中文摘要 我们对有限环境中的 Deep Q-Networks 进行了详细研究，强调了 epsilon-greedy 探索计划和优先体验回放的影响。通过系统实验，我们评估了ε衰减时间表的变化如何影响学习效率、收敛行为和奖励优化。我们研究了优先体验回放如何导致更快的收敛和更高的回报，并展示了比较多个模拟中的统一、无回放和优先策略的实证结果。我们的研究结果阐明了 DQN 训练中探索策略和记忆管理之间的权衡和相互作用，为资源受限环境中的鲁棒强化学习提供了实用的建议。

Behavior-Adaptive Q-Learning: A Unifying Framework for Offline-to-Online RL

行为自适应 Q-Learning：离线到在线 RL 的统一框架

Authors: Lipeng Zu, Hansong Zhou, Xiaonan Zhang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.03695
Pdf link: https://arxiv.org/pdf/2511.03695
Abstract Offline reinforcement learning (RL) enables training from fixed data without online interaction, but policies learned offline often struggle when deployed in dynamic environments due to distributional shift and unreliable value estimates on unseen state-action pairs. We introduce Behavior-Adaptive Q-Learning (BAQ), a framework designed to enable a smooth and reliable transition from offline to online RL. The key idea is to leverage an implicit behavioral model derived from offline data to provide a behavior-consistency signal during online fine-tuning. BAQ incorporates a dual-objective loss that (i) aligns the online policy toward the offline behavior when uncertainty is high, and (ii) gradually relaxes this constraint as more confident online experience is accumulated. This adaptive mechanism reduces error propagation from out-of-distribution estimates, stabilizes early online updates, and accelerates adaptation to new scenarios. Across standard benchmarks, BAQ consistently outperforms prior offline-to-online RL approaches, achieving faster recovery, improved robustness, and higher overall performance. Our results demonstrate that implicit behavior adaptation is a principled and practical solution for reliable real-world policy deployment.
中文摘要 离线强化学习（RL）可以在没有在线交互的情况下从固定数据进行训练，但由于分布偏移和对看不见的状态-动作对的不可靠值估计，离线学习的策略在动态环境中部署时往往会遇到困难。我们介绍了行为自适应 Q-Learning （BAQ），这是一个旨在实现从离线到在线 RL 的平稳可靠过渡的框架。关键思想是利用从离线数据中得出的隐式行为模型，在在线微调期间提供行为一致性信号。BAQ 包含双目标损失，即（i）在不确定性高时使在线政策与线下行为保持一致，以及（ii）随着更自信的在线经验的积累，逐渐放松这一约束。这种自适应机制减少了分布外估计的误差传播，稳定了早期在线更新，并加速了对新场景的适应。在标准基准测试中，BAQ 始终优于之前的离线到在线 RL 方法，实现更快的恢复速度、更高的稳健性和更高的整体性能。我们的结果表明，隐性行为适应是可靠的现实世界政策部署的原则性和实用性解决方案。

AnaFlow: Agentic LLM-based Workflow for Reasoning-Driven Explainable and Sample-Efficient Analog Circuit Sizing

AnaFlow：基于代理 LLM 的工作流程，用于推理驱动的可解释和样本效率模拟电路大小调整

Authors: Mohsen Ahmadzadeh, Kaichang Chen, Georges Gielen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
Arxiv link: https://arxiv.org/abs/2511.03697
Pdf link: https://arxiv.org/pdf/2511.03697
Abstract Analog/mixed-signal circuits are key for interfacing electronics with the physical world. Their design, however, remains a largely handcrafted process, resulting in long and error-prone design cycles. While the recent rise of AI-based reinforcement learning and generative AI has created new techniques to automate this task, the need for many time-consuming simulations is a critical bottleneck hindering the overall efficiency. Furthermore, the lack of explainability of the resulting design solutions hampers widespread adoption of the tools. To address these issues, a novel agentic AI framework for sample-efficient and explainable analog circuit sizing is presented. It employs a multi-agent workflow where specialized Large Language Model (LLM)-based agents collaborate to interpret the circuit topology, to understand the design goals, and to iteratively refine the circuit's design parameters towards the target goals with human-interpretable reasoning. The adaptive simulation strategy creates an intelligent control that yields a high sample efficiency. The AnaFlow framework is demonstrated for two circuits of varying complexity and is able to complete the sizing task fully automatically, differently from pure Bayesian optimization and reinforcement learning approaches. The system learns from its optimization history to avoid past mistakes and to accelerate convergence. The inherent explainability makes this a powerful tool for analog design space exploration and a new paradigm in analog EDA, where AI agents serve as transparent design assistants.
中文摘要 模拟/混合信号电路是电子设备与物理世界连接的关键。然而，他们的设计仍然主要是手工制作的过程，导致设计周期漫长且容易出错。虽然最近基于人工智能的强化学习和生成式人工智能的兴起创造了自动化这项任务的新技术，但对许多耗时的模拟的需求是阻碍整体效率的关键瓶颈。此外，由此产生的设计解决方案缺乏可解释性阻碍了这些工具的广泛采用。为了解决这些问题，提出了一种用于采样高效且可解释的模拟电路尺寸的新型代理人工智能框架。它采用多智能体工作流程，其中基于大型语言模型（LLM）的专业智能体协作解释电路拓扑，理解设计目标，并通过人类可解释的推理迭代细化电路的设计参数，以实现目标。自适应仿真策略创建了一种智能控制，可产生高样品效率。AnaFlow 框架针对两个不同复杂程度的电路进行了演示，并且能够全自动完成大小调整任务，这与纯贝叶斯优化和强化学习方法不同。系统从其优化历史中学习，以避免过去的错误并加速收敛。固有的可解释性使其成为模拟设计空间探索的强大工具，也是模拟 EDA 的新范式，其中 AI 代理充当透明的设计助手。

Shrinking the Variance: Shrinkage Baselines for Reinforcement Learning with Verifiable Rewards

缩小方差：通过可验证奖励缩小强化学习的基线

Authors: Guanning Zeng, Zhaoyi Zhou, Daman Arora, Andrea Zanette
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2511.03710
Pdf link: https://arxiv.org/pdf/2511.03710
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for post-training large reasoning models (LRMs) using policy-gradient methods such as GRPO. To stabilize training, these methods typically center trajectory rewards by subtracting the empirical mean for each prompt. Statistically, this centering acts as a control variate (or baseline), reducing the variance of the policy-gradient estimator. Typically, the mean reward is estimated using per-prompt empirical averages for each prompt in a batch. Drawing inspiration from Stein's paradox, we propose using shrinkage estimators that combine per-prompt and across-prompt means to improve the overall per-prompt mean estimation accuracy -- particularly in the low-generation regime typical of RLVR. Theoretically, we construct a shrinkage-based baseline that provably yields lower-variance policy-gradient estimators across algorithms. Our proposed baseline serves as a drop-in replacement for existing per-prompt mean baselines, requiring no additional hyper-parameters or computation. Empirically, shrinkage baselines consistently outperform standard empirical-mean baselines, leading to lower-variance gradient updates and improved training stability.
中文摘要 具有可验证奖励的强化学习（RLVR）已成为使用 GRPO 等策略梯度方法训练后大型推理模型（LRM）的强大范式。为了稳定训练，这些方法通常通过减去每个提示的经验平均值来集中轨迹奖励。从统计上讲，这种居中充当控制变量（或基线），减少了策略梯度估计器的方差。通常，平均奖励是使用批次中每个提示的每个提示的经验平均值来估计的。从斯坦因悖论中汲取灵感，我们建议使用结合每个提示和跨提示均值的收缩估计器来提高整体每个提示均值估计精度——特别是在典型的 RLVR 低代状态下。从理论上讲，我们构建了一个基于收缩的基线，可以证明可以跨算法产生较低方差的策略梯度估计量。我们提出的基线可作为现有每个提示平均基线的直接替代品，不需要额外的超参数或计算。根据经验，收缩基线始终优于标准经验平均基线，从而实现较低的方差梯度更新并提高训练稳定性。

Outbidding and Outbluffing Elite Humans: Mastering Liar's Poker via Self-Play and Reinforcement Learning

出价和虚张声势的精英人类：通过自我游戏和强化学习掌握骗子扑克

Authors: Richard Dewey, Janos Botyanszki, Ciamac C. Moallemi, Andrew T. Zheng
Subjects: Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2511.03724
Pdf link: https://arxiv.org/pdf/2511.03724
Abstract AI researchers have long focused on poker-like games as a testbed for environments characterized by multi-player dynamics, imperfect information, and reasoning under uncertainty. While recent breakthroughs have matched elite human play at no-limit Texas hold'em, the multi-player dynamics are subdued: most hands converge quickly with only two players engaged through multiple rounds of bidding. In this paper, we present Solly, the first AI agent to achieve elite human play in reduced-format Liar's Poker, a game characterized by extensive multi-player engagement. We trained Solly using self-play with a model-free, actor-critic, deep reinforcement learning algorithm. Solly played at an elite human level as measured by win rate (won over 50% of hands) and equity (money won) in heads-up and multi-player Liar's Poker. Solly also outperformed large language models (LLMs), including those with reasoning abilities, on the same metrics. Solly developed novel bidding strategies, randomized play effectively, and was not easily exploitable by world-class human players.
中文摘要 人工智能研究人员长期以来一直专注于类似扑克的游戏，将其作为以多人动态、不完美的信息和不确定性下的推理为特征的环境的测试平台。虽然最近的突破已经与无限注德州扑克的精英人类游戏相媲美，但多人游戏的动态却很低：大多数手牌很快就收敛了，只有两名玩家参与了多轮叫牌。在本文中，我们介绍了 Solly，这是第一个在缩减格式的 Liar's Poker 中实现精英人类游戏的 AI 代理，该游戏的特点是广泛的多人参与。我们使用无模型、演员批评、深度强化学习算法的自玩来训练 Solly。索利在单挑和多人骗子扑克中以胜率（赢得超过 50% 的手牌）和净值（赢得的钱）来衡量，达到了精英人类水平。在相同的指标上，Solly 的表现也优于大型语言模型（LLM），包括那些具有推理能力的语言模型。Solly 开发了新颖的竞价策略，有效地随机化游戏，并且不容易被世界级的人类玩家利用。

Keyword: diffusion policy

There is no result

Keyword: reinforcement learning

Digital Twin-Driven Pavement Health Monitoring and Maintenance Optimization Using Graph Neural Networks

数字孪生驱动的路面健康监测和维护优化 使用图神经网络

Value of Information-Enhanced Exploration in Bootstrapped DQN

信息增强探索在自举DQN中的价值

Leveraging Discrete Function Decomposability for Scientific Design

利用离散函数可分解性进行科学设计

Scaling Multi-Agent Environment Co-Design with Diffusion Models

扩展多智能体环境与扩散模型协同设计

Learning Natural and Robust Hexapod Locomotion over Complex Terrains via Motion Priors based on Deep Reinforcement Learning

通过基于深度强化学习的运动先验学习复杂地形上自然而稳健的六足动物运动

Learning-based Cooperative Robotic Paper Wrapping: A Unified Control Policy with Residual Force Control

基于学习的协作机器人纸张包装：残余力控制的统一控制策略

Periodic Skill Discovery

定期技能发现

Collaborative Assembly Policy Learning of a Sightless Robot

无视机器人的协同装配政策学习

Incorporating Quality of Life in Climate Adaptation Planning via Reinforcement Learning

通过强化学习将生活质量纳入气候适应规划

Climate Adaptation with Reinforcement Learning: Economic vs. Quality of Life Adaptation Pathways

强化学习的气候适应：经济与生活质量适应途径

Multi-Objective Adaptive Rate Limiting in Microservices Using Deep Reinforcement Learning

基于深度强化学习的微服务多目标自适应限速

DRL-Based Robust Multi-Timescale Anti-Jamming Approaches under State Uncertainty

状态不确定性下基于DRL的鲁棒多时间尺度抗干扰方法

Learning Communication Skills in Multi-task Multi-agent Deep Reinforcement Learning

多任务多智能体深度强化学习中的沟通技巧学习

Adaptable Hindsight Experience Replay for Search-Based Learning

适应性强的事后诸葛亮体验回放，用于基于搜索的学习

Knowledge-Augmented Question Error Correction for Chinese Question Answer System with QuestionRAG

QuestionRAG中文问答系统的知识增强纠错

Reinforcement Learning Using known Invariances

使用已知不变性的强化学习

Learning Without Critics? Revisiting GRPO in Classical Reinforcement Learning Environments

在没有批评者的情况下学习？在经典强化学习环境中重新审视GRPO

PerfDojo: Automated ML Library Generation for Heterogeneous Architectures

PerfDojo：用于异构架构的自动 ML 库生成

Tensor-Efficient High-Dimensional Q-learning

张量高效的高维 Q 学习

Going Beyond Expert Performance via Deep Implicit Imitation Reinforcement Learning

通过深度隐式模仿强化学习超越专家表现

Towards Formalizing Reinforcement Learning Theory

走向强化学习理论的形式化

DQN Performance with Epsilon Greedy Policies and Prioritized Experience Replay

采用 Epsilon 贪婪策略和优先体验回放的 DQN 性能

Behavior-Adaptive Q-Learning: A Unifying Framework for Offline-to-Online RL

行为自适应 Q-Learning：离线到在线 RL 的统一框架

AnaFlow: Agentic LLM-based Workflow for Reasoning-Driven Explainable and Sample-Efficient Analog Circuit Sizing

AnaFlow：基于代理 LLM 的工作流程，用于推理驱动的可解释和样本效率模拟电路大小调整

Shrinking the Variance: Shrinkage Baselines for Reinforcement Learning with Verifiable Rewards

缩小方差：通过可验证奖励缩小强化学习的基线

Outbidding and Outbluffing Elite Humans: Mastering Liar's Poker via Self-Play and Reinforcement Learning

出价和虚张声势的精英人类：通过自我游戏和强化学习掌握骗子扑克

Keyword: diffusion policy

数字孪生驱动的路面健康监测和维护优化使用图神经网络