Arxiv Papers of Today

生成时间: 2026-01-05 16:36:36 (UTC+8); Arxiv 发布时间: 2026-01-05 20:00 EST (2026-01-06 09:00 UTC+8)

今天共有 21 篇相关文章

Keyword: reinforcement learning

Reinforcement learning with timed constraints for robotics motion planning

基于定时约束的机器人运动规划强化学习

Authors: Zhaoan Wang, Junchao Li, Mahdi Mohammad, Shaoping Xiao
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.00087
Pdf link: https://arxiv.org/pdf/2601.00087
Abstract Robotic systems operating in dynamic and uncertain environments increasingly require planners that satisfy complex task sequences while adhering to strict temporal constraints. Metric Interval Temporal Logic (MITL) offers a formal and expressive framework for specifying such time-bounded requirements; however, integrating MITL with reinforcement learning (RL) remains challenging due to stochastic dynamics and partial observability. This paper presents a unified automata-based RL framework for synthesizing policies in both Markov Decision Processes (MDPs) and Partially Observable Markov Decision Processes (POMDPs) under MITL specifications. MITL formulas are translated into Timed Limit-Deterministic Generalized Büchi Automata (Timed-LDGBA) and synchronized with the underlying decision process to construct product timed models suitable for Q-learning. A simple yet expressive reward structure enforces temporal correctness while allowing additional performance objectives. The approach is validated in three simulation studies: a $5 \times 5$ grid-world formulated as an MDP, a $10 \times 10$ grid-world formulated as a POMDP, and an office-like service-robot scenario. Results demonstrate that the proposed framework consistently learns policies that satisfy strict time-bounded requirements under stochastic transitions, scales to larger state spaces, and remains effective in partially observable environments, highlighting its potential for reliable robotic planning in time-critical and uncertain settings.
中文摘要 在动态和不确定环境中运行的机器人系统，越来越需要能够满足复杂任务序列且严格遵守时间约束的规划器。度量区间时间逻辑（MITL）提供了一个形式化且富有表达力的框架来指定此类时间限制需求;然而，由于随机动力学和部分可观测性，将MITL与强化学习（RL）整合仍然具有挑战性。本文提出了一个统一的基于自动机的强化学习框架，用于综合马尔可夫决策过程（MDPs）和部分可观测马尔可夫决策过程（POMDP）中的策略，这些策略均在MITL规范下进行。MITL公式被转换为定时极限性广义Büchi自动机（Timed-LDGBA），并与底层决策过程同步，构建适合Q学习的产品定时模型。简单而富有表现力的奖励结构强化时间上的正确性，同时允许额外的表演目标。该方法在三项模拟研究中得到了验证：一个5美元×5美元的网格世界，作为MDP;一个10美元×10美元的网格世界，作为POMDP表述;以及类似办公室的服务机器人场景。结果表明，所提出的框架能够持续学习满足随机转移下严格时间限制要求的策略，能够扩展到更大的状态空间，并且在部分可观测环境中依然有效，凸显了其在时间关键和不确定环境中可靠机器人规划的潜力。

Universal Adaptive Constraint Propagation: Scaling Structured Inference for Large Language Models via Meta-Reinforcement Learning

通用自适应约束传播：通过元强化学习对大型语言模型进行结构化推理的扩展

Authors: Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.00095
Pdf link: https://arxiv.org/pdf/2601.00095
Abstract Large language models increasingly require structured inference, from JSON schema enforcement to multi-lingual parsing, where outputs must satisfy complex constraints. We introduce MetaJuLS, a meta-reinforcement learning approach that learns universal constraint propagation policies applicable across languages and tasks without task-specific retraining. By formulating structured inference as adaptive constraint propagation and training a Graph Attention Network with meta-learning, MetaJuLS achieves 1.5--2.0$\times$ speedups over GPU-optimized baselines while maintaining within 0.2\% accuracy of state-of-the-art parsers. On Universal Dependencies across 10 languages and LLM-constrained generation (LogicBench, GSM8K-Constrained), MetaJuLS demonstrates rapid cross-domain adaptation: a policy trained on English parsing adapts to new languages and tasks with 5--10 gradient steps (5--15 seconds) rather than requiring hours of task-specific training. Mechanistic analysis reveals the policy discovers human-like parsing strategies (easy-first) and novel non-intuitive heuristics. By reducing propagation steps in LLM deployments, MetaJuLS contributes to Green AI by directly reducing inference carbon footprint.
中文摘要 大型语言模型越来越需要结构化推理，从JSON模式的强制执行到多语言解析，输出必须满足复杂的约束。我们介绍MetaJuLS，一种元强化学习方法，能够学习适用于跨语言和任务的通用约束传播策略，无需任务特定重训。通过将结构化推理作为自适应约束传播，并用元学习训练图注意力网络，MetaJuLS在GPU优化基线上实现了1.5至2.0$\时间$的加速，同时保持在与最先进解析器0.2%的准确率之间。在跨10语言的通用依赖和LLM约束生成（LogicBench、GSM8K约束）方面，MetaJuLS展示了快速的跨域适应能力：一个以英语解析为训练策略的策略，以5-10梯度步（5-15秒）适应新语言和任务，而无需数小时的任务专属训练。机制分析显示，该策略发现了类人解析策略（简单优先）和新颖的非直觉启发式。通过减少LLM部署中的传播步骤，MetaJuLS通过直接降低推断碳足迹，为绿色人工智能做出了贡献。

GRL-SNAM: Geometric Reinforcement Learning with Path Differential Hamiltonians for Simultaneous Navigation and Mapping in Unknown Environments

GRL-SNAM：基于路径微分哈密顿量的几何强化学习，用于未知环境中的同时导航和制图

Authors: Aditya Sai Ellendula, Yi Wang, Minh Nguyen, Chandrajit Bajaj
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.00116
Pdf link: https://arxiv.org/pdf/2601.00116
Abstract We present GRL-SNAM, a geometric reinforcement learning framework for Simultaneous Navigation and Mapping(SNAM) in unknown environments. A SNAM problem is challenging as it needs to design hierarchical or joint policies of multiple agents that control the movement of a real-life robot towards the goal in mapless environment, i.e. an environment where the map of the environment is not available apriori, and needs to be acquired through sensors. The sensors are invoked from the path learner, i.e. navigator, through active query responses to sensory agents, and along the motion path. GRL-SNAM differs from preemptive navigation algorithms and other reinforcement learning methods by relying exclusively on local sensory observations without constructing a global map. Our approach formulates path navigation and mapping as a dynamic shortest path search and discovery process using controlled Hamiltonian optimization: sensory inputs are translated into local energy landscapes that encode reachability, obstacle barriers, and deformation constraints, while policies for sensing, planning, and reconfiguration evolve stagewise via updating Hamiltonians. A reduced Hamiltonian serves as an adaptive score function, updating kinetic/potential terms, embedding barrier constraints, and continuously refining trajectories as new local information arrives. We evaluate GRL-SNAM on two different 2D navigation tasks. Comparing against local reactive baselines and global policy learning references under identical stagewise sensing constraints, it preserves clearance, generalizes to unseen layouts, and demonstrates that Geometric RL learning via updating Hamiltonians enables high-quality navigation through minimal exploration via local energy refinement rather than extensive global mapping. The code is publicly available on \href{this https URL}{Github}.
中文摘要 我们介绍GRL-SNAM，一个用于未知环境中同时导航与制图（SNAM）的几何强化学习框架。SNAM问题具有挑战性，因为它需要设计由多个代理组成的层级或联合策略，以控制现实机器人在无地图环境中向目标移动，即环境地图事先不可得，且需通过传感器获取。传感器可从路径学习者（即导航员）通过对感官代理的主动查询响应，以及沿运动路径调用。GRL-SNAM 不同于抢占式导航算法和其他强化学习方法，其区别在于完全依赖局部感官观测，而非构建全局地图。我们的方法将路径导航和地图构建为一种动态的最短路径搜索和发现过程，利用受控哈密顿优化：感官输入被转化为局部能量景观，编码可达性、障碍物和变形约束，而感测、规划和重构策略则通过更新哈密顿量逐步演进。约简哈密顿量作为自适应评分函数，更新动力学/势能项，嵌入障碍约束，并随着新局部信息的到来不断优化轨迹。我们评估了GRL-SNAM在两种不同的二维导航任务上的应用。与局部反应基线和全局政策学习参考在相同阶段传感约束下进行比较，它保持了间隙，能够推广到未见布局，并展示了通过更新哈密顿量实现几何强化学习，通过局部能量精炼实现高质量导航，而无需大量全局映射。代码公开发布在 \href{this https URL}{Github} 上。

Reinforcement Learning with Function Approximation for Non-Markov Processes

非马尔可夫过程的强化学习与函数近似

Authors: Ali Devran Kara
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2601.00151
Pdf link: https://arxiv.org/pdf/2601.00151
Abstract We study reinforcement learning methods with linear function approximation under non-Markov state and cost processes. We first consider the policy evaluation method and show that the algorithm converges under suitable ergodicity conditions on the underlying non-Markov processes. Furthermore, we show that the limit corresponds to the fixed point of a joint operator composed of an orthogonal projection and the Bellman operator of an auxiliary \emph{Markov} decision process. For Q-learning with linear function approximation, as in the Markov setting, convergence is not guaranteed in general. We show, however, that for the special case where the basis functions are chosen based on quantization maps, the convergence can be shown under similar ergodicity conditions. Finally, we apply our results to partially observed Markov decision processes, where finite-memory variables are used as state representations, and we derive explicit error bounds for the limits of the resulting learning algorithms.
中文摘要 我们研究在非马尔可夫状态和成本过程下线性函数近似的强化学习方法。我们首先考虑策略评估方法，并证明该算法在适当的遍历条件下对底层非马尔可夫过程收敛。此外，我们证明极限对应于由正交投影和辅助\emph{Markov}决策过程的Bellman算符组成的联合算符的不动点。对于线性函数近似的Q学习，如马尔可夫设定，收敛性一般不保证。然而，我们证明，对于基于量子化映射选择基函数的特殊情况，收敛可以在类似的遍历性条件下展示。最后，我们将研究结果应用于部分观测到的马尔可夫决策过程，其中有限内存变量作为状态表示，并推导出学习算法极限的显式误差界限。

Online Finetuning Decision Transformers with Pure RL Gradients

纯强化梯度的在线微调决策变换器

Authors: Junkai Luo, Yinglun Zhu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.00167
Pdf link: https://arxiv.org/pdf/2601.00167
Abstract Decision Transformers (DTs) have emerged as a powerful framework for sequential decision making by formulating offline reinforcement learning (RL) as a sequence modeling problem. However, extending DTs to online settings with pure RL gradients remains largely unexplored, as existing approaches continue to rely heavily on supervised sequence-modeling objectives during online finetuning. We identify hindsight return relabeling -- a standard component in online DTs -- as a critical obstacle to RL-based finetuning: while beneficial for supervised learning, it is fundamentally incompatible with importance sampling-based RL algorithms such as GRPO, leading to unstable training. Building on this insight, we propose new algorithms that enable online finetuning of Decision Transformers using pure reinforcement learning gradients. We adapt GRPO to DTs and introduce several key modifications, including sub-trajectory optimization for improved credit assignment, sequence-level likelihood objectives for enhanced stability and efficiency, and active sampling to encourage exploration in uncertain regions. Through extensive experiments, we demonstrate that our methods outperform existing online DT baselines and achieve new state-of-the-art performance across multiple benchmarks, highlighting the effectiveness of pure-RL-based online finetuning for Decision Transformers.
中文摘要 决策变换器（DT）通过将离线强化学习（RL）作为序列建模问题，成为一个强大的顺序决策框架。然而，将DT扩展到纯强化梯度的在线环境仍大多未被探索，因为现有方法在在线微调过程中仍高度依赖监督序列建模目标。我们指出，事后诸葛亮返回重新标记——在线数据分析中的标准组成部分——是基于强化学习微调的关键障碍：虽然对监督学习有益，但与基于重要性抽样的强化学习算法（如GRPO）根本不兼容，导致训练不稳定。基于这一见解，我们提出了新的算法，能够利用纯强化学习梯度实现决策变换器的在线微调。我们将GRPO调整至DTs，并引入若干关键改进，包括子轨迹优化以改善信用分配，序列级似然目标以增强稳定性和效率，以及主动抽样以鼓励在不确定区域进行探索。通过大量实验，我们证明我们的方法优于现有在线数据分析基线，并在多个基准测试中实现了新的最先进性能，凸显了基于纯强化学习的在线微调对决策变换器的有效性。

Reinforcement-Learned Unequal Error Protection for Quantized Semantic Embeddings

强化学习的量化语义嵌入不等错误保护

Authors: Moirangthem Tiken Singh, Adnan Arif
Subjects: Subjects: Machine Learning (cs.LG); Information Retrieval (cs.IR); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2601.00186
Pdf link: https://arxiv.org/pdf/2601.00186
Abstract This paper tackles the pressing challenge of preserving semantic meaning in communication systems constrained by limited bandwidth. We introduce a novel reinforcement learning framework that achieves per-dimension unequal error protection via adaptive repetition coding. Central to our approach is a composite semantic distortion metric that balances global embedding similarity with entity-level preservation, empowering the reinforcement learning agent to allocate protection in a context-aware manner. Experiments show statistically significant gains over uniform protection, achieving 6.8% higher chrF scores and 9.3% better entity preservation at 1 dB SNR. The key innovation of our framework is the demonstration that simple, intelligently allocated repetition coding enables fine-grained semantic protection -- an advantage unattainable with conventional codes such as LDPC or Reed-Solomon. Our findings challenge traditional channel coding paradigms by establishing that code structure must align with semantic granularity. This approach is particularly suited to edge computing and IoT scenarios, where bandwidth is scarce, but semantic fidelity is critical, providing a practical pathway for next-generation semantic-aware networks.
中文摘要 本文探讨了在带宽受限的通信系统中保持语义意义的紧迫挑战。我们引入了一种新型强化学习框架，通过自适应重复编码实现了每维度的不等错误保护。我们方法的核心是一个复合语义扭曲指标，平衡了全局嵌入相似性与实体级保存，使强化学习代理能够以上下文感知的方式分配保护。实验显示，相较于均匀保护有统计学显著提升，chrF得分提升6.8%，实体保存率提升9.3%，且为1 dB SNR。我们框架的关键创新是证明了简单且智能分配的重复编码能够实现细粒度语义保护——这是传统代码如LDPC或Reed-Solomon无法实现的优势。我们的发现挑战了传统的通道编码范式，确立了代码结构必须与语义粒度保持一致。这种方法特别适合边缘计算和物联网场景，在带宽稀缺但语义准确性至关重要的情况下，为下一代语义感知网络提供了切实可行的路径。

From Sight to Insight: Improving Visual Reasoning Capabilities of Multimodal Models via Reinforcement Learning

从视觉到洞察：通过强化学习提升多模态模型的视觉推理能力

Authors: Omar Sharif, Eftekhar Hossain, Patrick Ng
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.00215
Pdf link: https://arxiv.org/pdf/2601.00215
Abstract Reinforcement learning (RL) has emerged as a promising approach for eliciting reasoning chains before generating final answers. However, multimodal large language models (MLLMs) generate reasoning that lacks integration of visual information. This limits their ability to solve problems that demand accurate visual perception, such as visual puzzles. We show that visual perception is the key bottleneck in such tasks: converting images into textual descriptions significantly improves performance, yielding gains of 26.7% for Claude 3.5 and 23.6% for Claude 3.7. To address this, we investigate reward-driven RL as a mechanism to unlock long visual reasoning in open-source MLLMs without requiring costly supervision. We design and evaluate six reward functions targeting different reasoning aspects, including image understanding, thinking steps, and answer accuracy. Using group relative policy optimization (GRPO), our approach explicitly incentivizes longer, structured reasoning and mitigates bypassing of visual information. Experiments on Qwen-2.5-VL-7B achieve 5.56% improvements over the base model, with consistent gains across both in-domain and out-of-domain settings.
中文摘要 强化学习（RL）已成为一种有前景的方法，用于在生成最终答案前引发推理链。然而，多模态大型语言模型（MLLM）产生的推理缺乏视觉信息的整合。这限制了他们解决需要准确视觉感知的问题的能力，比如视觉谜题。我们表明视觉感知是此类任务的关键瓶颈：将图像转换为文本描述显著提升性能，Claude 3.5 提升了 26.7%，Claude 3.7 提升了 23.6%。为此，我们研究了奖励驱动强化学习作为一种机制，在开源MLLM中解锁长时间视觉推理，而无需昂贵的监督。我们设计并评估六个奖励函数，针对不同的推理方面，包括图像理解、思考步骤和答案准确性。通过群体相对策略优化（GRPO），我们的方法明确激励更长的结构化推理，并减少视觉信息的绕过。Qwen-2.5-VL-7B 实验相比基础模型提升了 5.56%，在域内外环境下均有稳定提升。

Modern Neuromorphic AI: From Intra-Token to Inter-Token Processing

现代神经形态人工智能：从代币内到代币间处理

Authors: Osvaldo Simeone
Subjects: Subjects: Neural and Evolutionary Computing (cs.NE); Information Theory (cs.IT); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.00245
Pdf link: https://arxiv.org/pdf/2601.00245
Abstract The rapid growth of artificial intelligence (AI) has brought novel data processing and generative capabilities but also escalating energy requirements. This challenge motivates renewed interest in neuromorphic computing principles, which promise brain-like efficiency through discrete and sparse activations, recurrent dynamics, and non-linear feedback. In fact, modern AI architectures increasingly embody neuromorphic principles through heavily quantized activations, state-space dynamics, and sparse attention mechanisms. This paper elaborates on the connections between neuromorphic models, state-space models, and transformer architectures through the lens of the distinction between intra-token processing and inter-token processing. Most early work on neuromorphic AI was based on spiking neural networks (SNNs) for intra-token processing, i.e., for transformations involving multiple channels, or features, of the same vector input, such as the pixels of an image. In contrast, more recent research has explored how neuromorphic principles can be leveraged to design efficient inter-token processing methods, which selectively combine different information elements depending on their contextual relevance. Implementing associative memorization mechanisms, these approaches leverage state-space dynamics or sparse self-attention. Along with a systematic presentation of modern neuromorphic AI models through the lens of intra-token and inter-token processing, training methodologies for neuromorphic AI models are also reviewed. These range from surrogate gradients leveraging parallel convolutional processing to local learning rules based on reinforcement learning mechanisms.
中文摘要 人工智能（AI）的快速发展带来了新的数据处理和生成能力，同时也带来了不断增长的能源需求。这一挑战激发了人们对神经形态计算原理的重新关注，这些原则承诺通过离散和稀疏激活、循环动力学和非线性反馈实现类似大脑的效率。事实上，现代人工智能架构越来越多地体现神经形态原理，通过高度量子化的激活、状态-空间动态和稀疏的注意力机制。本文通过区分令牌内处理和令牌间处理，详细阐述了神经形态模型、状态空间模型和变换器架构之间的联系。神经形态人工智能的早期研究大多基于尖峰神经网络（SNN），用于令牌内处理，即涉及多个通道或特征的变换，这些特征是同一向量输入的多个特征，如图像像素。相比之下，近期研究探讨了如何利用神经形态原则设计高效的代币间处理方法，这些方法根据上下文相关性有选择地组合不同的信息元素。这些方法通过实现联想记忆机制，利用状态空间动态或稀疏自我注意力。除了通过词组内和词组间处理的视角系统展示现代神经形态AI模型外，还回顾了神经形态AI模型的训练方法。这些规则涵盖了利用并行卷积处理的代理梯度，以及基于强化学习机制的局部学习规则。

Next Generation Intelligent Low-Altitude Economy Deployments: The O-RAN Perspective

下一代智能低空经济部署：O-RAN视角

Authors: Aly Sabri Abdalla, Vuk Marojevic
Subjects: Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2601.00257
Pdf link: https://arxiv.org/pdf/2601.00257
Abstract Despite the growing interest in low-altitude economy (LAE) applications, including UAV-based logistics and emergency response, fundamental challenges remain in orchestrating such missions over complex, signal-constrained environments. These include the absence of real-time, resilient, and context-aware orchestration of aerial nodes with limited integration of artificial intelligence (AI) specialized for LAE missions. This paper introduces an open radio access network (O-RAN)-enabled LAE framework that leverages seamless coordination between the disaggregated RAN architecture, open interfaces, and RAN intelligent controllers (RICs) to facilitate closed-loop, AI-optimized, and mission-critical LAE operations. We evaluate the feasibility and performance of the proposed architecture via a semantic-aware rApp that acts as a terrain interpreter, offering semantic guidance to a reinforcement learning-enabled xApp, which performs real-time trajectory planning for LAE swarm nodes. We survey the capabilities of UAV testbeds that can be leveraged for LAE research, and present critical research challenges and standardization needs.
中文摘要 尽管对低空经济（LAE）应用（包括无人机后勤和应急响应）的兴趣日益增长，但在复杂且信号受限的环境中协调此类任务仍面临根本挑战。其中包括缺乏实时、弹性和情境感知的空中节点编排，以及有限的人工智能（AI）集成，专门用于LAE任务。本文介绍了一种开放无线接入网络（O-RAN）支持的LAE框架，利用分散的RAN架构、开放接口和RAN智能控制器（RIC）之间的无缝协调，实现闭环、人工智能优化和关键任务的LAE作。我们通过一个语义感知的rApp评估所提架构的可行性和性能，该rApp作为地形解释器，为具备强化学习功能的xApp提供语义指导，该应用为LAE群体节点执行实时轨迹规划。我们调查了可用于LAE研究的无人机测试平台能力，并提出了关键的研究挑战和标准化需求。

Can Optimal Transport Improve Federated Inverse Reinforcement Learning?

最优传输能改善联邦逆强化学习吗？

Authors: David Millard, Ali Baheri
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.00309
Pdf link: https://arxiv.org/pdf/2601.00309
Abstract In robotics and multi-agent systems, fleets of autonomous agents often operate in subtly different environments while pursuing a common high-level objective. Directly pooling their data to learn a shared reward function is typically impractical due to differences in dynamics, privacy constraints, and limited communication bandwidth. This paper introduces an optimal transport-based approach to federated inverse reinforcement learning (IRL). Each client first performs lightweight Maximum Entropy IRL locally, adhering to its computational and privacy limitations. The resulting reward functions are then fused via a Wasserstein barycenter, which considers their underlying geometric structure. We further prove that this barycentric fusion yields a more faithful global reward estimate than conventional parameter averaging methods in federated learning. Overall, this work provides a principled and communication-efficient framework for deriving a shared reward that generalizes across heterogeneous agents and environments.
中文摘要 在机器人和多智能体系统中，自主智能体舰队通常在细微不同的环境中运作，同时追求共同的高级目标。直接将数据汇聚学习共享奖励函数通常不切实际，原因在于动态差异、隐私限制和通信带宽有限。本文介绍了一种基于传输的最优方法用于联邦逆强化学习（IRL）。每个客户端首先在本地执行轻量级最大熵实时作，遵守其计算和隐私限制。所得的奖励函数随后通过瓦瑟斯坦质心融合，该中心考虑了其底层几何结构。我们进一步证明，这种重心融合比联邦学习中的传统参数平均方法更忠实地估计了全局奖励。总体而言，这项工作提供了一个原则性且通信高效的框架，用于推导跨异构代理和环境的共享奖励。

Offline Multi-Agent Reinforcement Learning for 6G Communications: Fundamentals, Applications and Future Directions

6G通信的离线多智能体强化学习：基础、应用与未来方向

Authors: Eslam Eldeeb, Hirley Alves
Subjects: Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2601.00321
Pdf link: https://arxiv.org/pdf/2601.00321
Abstract The next-generation wireless technologies, including beyond 5G and 6G networks, are paving the way for transformative applications such as vehicle platooning, smart cities, and remote surgery. These innovations are driven by a vast array of interconnected wireless entities, including IoT devices, access points, UAVs, and CAVs, which increase network complexity and demand more advanced decision-making algorithms. Artificial intelligence (AI) and machine learning (ML), especially reinforcement learning (RL), are key enablers for such networks, providing solutions to high-dimensional and complex challenges. However, as networks expand to multi-agent environments, traditional online RL approaches face cost, safety, and scalability limitations. Offline multi-agent reinforcement learning (MARL) offers a promising solution by utilizing pre-collected data, reducing the need for real-time interaction. This article introduces a novel offline MARL algorithm based on conservative Q-learning (CQL), ensuring safe and efficient training. We extend this with meta-learning to address dynamic environments and validate the approach through use cases in radio resource management and UAV networks. Our work highlights offline MARL's advantages, limitations, and future directions in wireless applications.
中文摘要 下一代无线技术，包括超越5G和6G网络，正在为车辆排队、智慧城市和远程手术等变革性应用铺平道路。这些创新由大量互联的无线实体推动，包括物联网设备、接入点、无人机和CAV，这些都增加了网络复杂度，并要求更先进的决策算法。人工智能（AI）和机器学习（ML），尤其是强化学习（RL），是此类网络的关键推动力，为高维和复杂挑战提供解决方案。然而，随着网络扩展到多智能体环境，传统的在线强化学习方法面临成本、安全和可扩展性的限制。离线多智能体强化学习（MARL）通过利用预先收集的数据，提供了一种有前景的解决方案，减少了对实时交互的需求。本文介绍了基于保守Q学习（CQL）的新型离线MARL算法，确保训练安全高效。我们将此应用于元学习，以应对动态环境，并通过无线资源管理和无人机网络的用例验证该方法。我们的研究强调了离线MARL在无线应用中的优势、局限性及未来方向。

Vision-Language Reasoning for Geolocalization: A Reinforcement Learning Approach

地理定位的视觉语言推理：一种强化学习方法

Authors: Biao Wu, Meng Fang, Ling Chen, Ke Xu, Tao Cheng, Jun Wang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.00388
Pdf link: https://arxiv.org/pdf/2601.00388
Abstract Recent advances in vision-language models have opened up new possibilities for reasoning-driven image geolocalization. However, existing approaches often rely on synthetic reasoning annotations or external image retrieval, which can limit interpretability and generalizability. In this paper, we present Geo-R, a retrieval-free framework that uncovers structured reasoning paths from existing ground-truth coordinates and optimizes geolocation accuracy via reinforcement learning. We propose the Chain of Region, a rule-based hierarchical reasoning paradigm that generates precise, interpretable supervision by mapping GPS coordinates to geographic entities (e.g., country, province, city) without relying on model-generated or synthetic labels. Building on this, we introduce a lightweight reinforcement learning strategy with coordinate-aligned rewards based on Haversine distance, enabling the model to refine predictions through spatially meaningful feedback. Our approach bridges structured geographic reasoning with direct spatial supervision, yielding improved localization accuracy, stronger generalization, and more transparent inference. Experimental results across multiple benchmarks confirm the effectiveness of Geo-R, establishing a new retrieval-free paradigm for scalable and interpretable image geolocalization. To facilitate further research and ensure reproducibility, both the model and code will be made publicly available.
中文摘要 视觉语言模型的最新进展为推理驱动的图像地理定位开辟了新可能。然而，现有方法通常依赖综合推理注释或外部图像检索，这可能限制其可解释性和泛化性。本文介绍了Geo-R，一种无检索的框架，通过强化学习揭示基于现有地面真实坐标的结构化推理路径，并优化地理定位精度。我们提出了区域链（Chain of Region），这是一种基于规则的层级推理范式，通过将GPS坐标映射到地理实体（如国家、省、市），生成精确且可解释的监督，而无需依赖模型生成或合成标签。基于此，我们引入了基于哈弗赛因距离的坐标对齐奖励的轻量级强化学习策略，使模型能够通过空间意义的反馈细化预测。我们的方法将地理推理与直接的空间监督相结合，提高了定位精度、更强的泛化能力和更透明的推断。多个基准测试的实验结果证实了Geo-R的有效性，建立了一种可扩展且可解释的图像地理定位新范式，无需检索。为促进进一步研究并确保可重复性，模型和代码将公开。

E-GRPO: High Entropy Steps Drive Effective Reinforcement Learning for Flow Models

E-GRPO：高熵步骤驱动流模型的有效强化学习

Authors: Shengjun Zhang, Zhang Zhang, Chensheng Dai, Yueqi Duan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.00423
Pdf link: https://arxiv.org/pdf/2601.00423
Abstract Recent reinforcement learning has enhanced the flow matching models on human preference alignment. While stochastic sampling enables the exploration of denoising directions, existing methods which optimize over multiple denoising steps suffer from sparse and ambiguous reward signals. We observe that the high entropy steps enable more efficient and effective exploration while the low entropy steps result in undistinguished roll-outs. To this end, we propose E-GRPO, an entropy aware Group Relative Policy Optimization to increase the entropy of SDE sampling steps. Since the integration of stochastic differential equations suffer from ambiguous reward signals due to stochasticity from multiple steps, we specifically merge consecutive low entropy steps to formulate one high entropy step for SDE sampling, while applying ODE sampling on other steps. Building upon this, we introduce multi-step group normalized advantage, which computes group-relative advantages within samples sharing the same consolidated SDE denoising step. Experimental results on different reward settings have demonstrated the effectiveness of our methods.
中文摘要 近期的强化学习增强了人类偏好匹配的流程匹配模型。虽然随机采样可以探索去噪方向，但现有在多个去噪步骤上优化的方法存在稀疏和模糊的奖励信号。我们观察到，高熵步使探索更高效、更有效，而低熵步则导致推广不显著。为此，我们提出了E-GRPO，一种对熵感知的群相对策略优化，以提升SDE采样步骤的熵。由于随机微分方程积分由于多步随机性导致奖励信号模糊，我们特意合并连续的低熵步骤，以确定一个高熵步用于SDE采样，同时对其他步骤应用常微分方程采样。在此基础上，我们引入了多步群归一化优势，计算在同一合并SDE去噪步骤的样本内的群相对优势。不同奖励环境的实验结果证明了我们方法的有效性。

CPPO: Contrastive Perception for Vision Language Policy Optimization

CPPO：视觉语言政策优化中的对比感知

Authors: Ahmad Rezaei, Mohsen Gholami, Saeed Ranjbar Alvar, Kevin Cannons, Mohammad Asiful Hossain, Zhou Weimin, Shunbo Zhou, Yong Zhang, Mohammad Akbari
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.00501
Pdf link: https://arxiv.org/pdf/2601.00501
Abstract We introduce CPPO, a Contrastive Perception Policy Optimization method for finetuning vision-language models (VLMs). While reinforcement learning (RL) has advanced reasoning in language models, extending it to multimodal reasoning requires improving both the perception and reasoning aspects. Prior works tackle this challenge mainly with explicit perception rewards, but disentangling perception tokens from reasoning tokens is difficult, requiring extra LLMs, ground-truth data, forced separation of perception from reasoning by policy model, or applying rewards indiscriminately to all output tokens. CPPO addresses this problem by detecting perception tokens via entropy shifts in the model outputs under perturbed input images. CPPO then extends the RL objective function with a Contrastive Perception Loss (CPL) that enforces consistency under information-preserving perturbations and sensitivity under information-removing ones. Experiments show that CPPO surpasses previous perception-rewarding methods, while avoiding extra models, making training more efficient and scalable.
中文摘要 我们介绍CPPO，一种用于微调视觉语言模型（VLM）的对比感知策略优化方法。虽然强化学习（RL）在语言模型中推动了推理，但要将其推广到多模态推理，需要同时提升感知和推理两个方面。以往的工作主要通过显式感知奖励来应对这一挑战，但将感知代币与推理代币分离较为困难，需要额外的大型语言模型、基于真实数据、通过策略模型强制将感知与推理分离，或对所有输出代币无差别地应用奖励。CPPO通过在受扰输入图像下通过模型输出中的熵变化检测感知符号来解决这一问题。CPPO随后通过对比感知损失（CPL）扩展了强化学习目标函数，在保持信息的扰动下强制保持一致性，在信息移除扰动下强制执行敏感性。实验显示，CPPO超越了以往的感知奖励方法，同时避免了额外的模型，使训练更高效、更具可扩展性。

Traffic-Aware Optimal Taxi Placement Using Graph Neural Network-Based Reinforcement Learning

基于图神经网络的强化学习的交通感知最佳出租车位置

Authors: Sonia Khetarpaul, P Y Sharan
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.00607
Pdf link: https://arxiv.org/pdf/2601.00607
Abstract In the context of smart city transportation, efficient matching of taxi supply with passenger demand requires real-time integration of urban traffic network data and mobility patterns. Conventional taxi hotspot prediction models often rely solely on historical demand, overlooking dynamic influences such as traffic congestion, road incidents, and public events. This paper presents a traffic-aware, graph-based reinforcement learning (RL) framework for optimal taxi placement in metropolitan environments. The urban road network is modeled as a graph where intersections represent nodes, road segments serve as edges, and node attributes capture historical demand, event proximity, and real-time congestion scores obtained from live traffic APIs. Graph Neural Network (GNN) embeddings are employed to encode spatial-temporal dependencies within the traffic network, which are then used by a Q-learning agent to recommend optimal taxi hotspots. The reward mechanism jointly optimizes passenger waiting time, driver travel distance, and congestion avoidance. Experiments on a simulated Delhi taxi dataset, generated using real geospatial boundaries and historic ride-hailing request patterns, demonstrate that the proposed model reduced passenger waiting time by about 56% and reduced travel distance by 38% compared to baseline stochastic selection. The proposed approach is adaptable to multi-modal transport systems and can be integrated into smart city platforms for real-time urban mobility optimization.
中文摘要 在智慧城市交通的背景下，高效匹配出租车供给与乘客需求需要实时整合城市交通网络数据和出行模式。传统的出租车热点预测模型通常仅依赖历史需求，忽视交通拥堵、交通事故和公共事件等动态因素。本文提出了一个基于交通感知的基于图的强化学习（RL）框架，用于在大都市环境中实现最佳出租车位置。城市道路网络被建模为一个图，其中交叉口代表节点，道路段作为边，节点属性捕捉来自实时交通API的历史需求、事件接近度和实时拥堵评分。图神经网络（GNN）嵌入用于编码交通网络中的时空依赖关系，随后Q学习代理利用这些数据推荐最优的滑行热点。奖励机制共同优化乘客等待时间、驾驶员行驶距离和避免拥堵。在模拟德里出租车数据集上进行实验，基于真实地理空间边界和历史网约车请求模式，显示所提模型相比随机选择模式，乘客等待时间减少约56%，行驶距离缩短38%。该方法适用于多模式交通系统，并可集成到智慧城市平台中实现实时城市出行优化。

Vision-based Goal-Reaching Control for Mobile Robots Using a Hierarchical Learning Framework

基于愿景的移动机器人目标达成控制，采用层级学习框架

Authors: Mehdi Heydari Shahna, Pauli Mustalahti, Jouni Mattila
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.00610
Pdf link: https://arxiv.org/pdf/2601.00610
Abstract Reinforcement learning (RL) is effective in many robotic applications, but it requires extensive exploration of the state-action space, during which behaviors can be unsafe. This significantly limits its applicability to large robots with complex actuators operating on unstable terrain. Hence, to design a safe goal-reaching control framework for large-scale robots, this paper decomposes the whole system into a set of tightly coupled functional modules. 1) A real-time visual pose estimation approach is employed to provide accurate robot states to 2) an RL motion planner for goal-reaching tasks that explicitly respects robot specifications. The RL module generates real-time smooth motion commands for the actuator system, independent of its underlying dynamic complexity. 3) In the actuation mechanism, a supervised deep learning model is trained to capture the complex dynamics of the robot and provide this model to 4) a model-based robust adaptive controller that guarantees the wheels track the RL motion commands even on slip-prone terrain. 5) Finally, to reduce human intervention, a mathematical safety supervisor monitors the robot, stops it on unsafe faults, and autonomously guides it back to a safe inspection area. The proposed framework guarantees uniform exponential stability of the actuation system and safety of the whole operation. Experiments on a 6,000 kg robot in different scenarios confirm the effectiveness of the proposed framework.
中文摘要 强化学习（RL）在许多机器人应用中都很有效，但需要对状态-行动空间进行深入探索，在此过程中行为可能存在不安全。这大大限制了其适用于具有复杂执行器的大型机器人，这些机器人在不稳定地形上运行。因此，为了设计一个安全且达标的大型机器人控制框架，本文将整个系统分解为一组紧耦合的功能模块。1）采用实时视觉姿态估计方法，为2）针对明确尊重机器人规格的强化学习动作规划器，用于达标任务。强化学习模块为执行器系统生成实时平滑运动指令，独立于其底层动态复杂性。3）在执行机制中，训练监督深度学习模型以捕捉机器人的复杂动力学，并将该模型提供给4）基于模型的稳健自适应控制器，保证车轮即使在易滑倒地形也能跟踪强化学习的运动指令。5）最后，为了减少人工干预，数学安全主管会监控机器人，在不安全故障时停止其运行，并自主引导其返回安全检查区。该框架保证了执行系统的均匀指数稳定性及整个作的安全性。在不同场景下对6000公斤机器人的实验证实了该框架的有效性。

RoboReward: General-Purpose Vision-Language Reward Models for Robotics

RoboReward：机器人通用视觉语言奖励模型

Authors: Tony Lee, Andrew Wagenmaker, Karl Pertsch, Percy Liang, Sergey Levine, Chelsea Finn
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.00675
Pdf link: https://arxiv.org/pdf/2601.00675
Abstract A well-designed reward is critical for effective reinforcement learning-based policy improvement. In real-world robotic domains, obtaining such rewards typically requires either labor-intensive human labeling or brittle, handcrafted objectives. Vision-language models (VLMs) have shown promise as automatic reward models, yet their effectiveness on real robot tasks is poorly understood. In this work, we aim to close this gap by introducing (1) \textbf{RoboReward}, a robotics reward dataset and benchmark built on large-scale real-robot corpora from Open X-Embodiment (OXE) and RoboArena, and (2) vision-language reward models trained on this dataset (RoboReward 4B/8B). Because OXE is success-heavy and lacks failure examples, we propose a \emph{negative examples data augmentation} pipeline that generates calibrated \emph{negatives} and \emph{near-misses} via counterfactual relabeling of successful episodes and temporal clipping to create partial-progress outcomes from the same videos. Using this framework, we produce an extensive training and evaluation dataset that spans diverse tasks and embodiments and enables systematic evaluation of whether state-of-the-art VLMs can reliably provide rewards for robotics. Our evaluation of leading open-weight and proprietary VLMs reveals that no model excels across all tasks, underscoring substantial room for improvement. We then train general-purpose 4B- and 8B-parameter models that outperform much larger VLMs in assigning rewards for short-horizon robotic tasks. Finally, we deploy the 8B-parameter reward VLM in real-robot reinforcement learning and find that it improves policy learning over Gemini Robotics-ER 1.5, a frontier physical reasoning VLM trained on robotics data, by a large margin, while substantially narrowing the gap to RL training with human-provided rewards.
中文摘要 设计良好的奖励对于基于学习的有效强化政策改进至关重要。在现实机器人领域，获得此类奖励通常需要劳动密集型人工标签或脆弱手工制作的目标。视觉语言模型（VLMs）作为自动奖励模型展现出潜力，但其在真实机器人任务中的有效性仍不充分。在本研究中，我们旨在通过引入（1）\textbf{RoboReward}——一个基于Open X-Embodiment（OXE）和RoboArena大型真实机器人语料库构建的机器人奖励数据集和基准测试，以及（2）基于该数据集（RoboReward 4B/8B）训练的视觉语言奖励模型来弥合这一差距。由于OXE以成功为主且缺乏失败案例，我们提出了一个\emph{负面例子数据增强}流水线，通过对成功剧集进行反事实重新标记和时间剪辑，生成校准的\emph{负面}和\emph{差点未中}，从而从相同视频中生成部分进展结果。利用该框架，我们生成了一个涵盖多样任务和实例的广泛训练和评估数据集，并系统评估最先进的VLM是否能可靠地为机器人提供奖励。我们对领先的开放权重和专有VLM的评估显示，没有模型在所有任务中都表现出色，这凸显了仍有很大改进空间。随后，我们训练通用的4B和8B参数模型，在为短视野机器人任务分配奖励方面表现优于大型VLM。最后，我们将8B参数的奖励VLM应用于真实机器人强化学习，发现它在策略学习方面优于Gemini Robotics-ER 1.5——一款基于机器人数据训练的前沿物理推理VLM——大幅提升，同时显著缩小了与人类奖励强化学习的差距。

IRPO: Scaling the Bradley-Terry Model via Reinforcement Learning

IRPO：通过强化学习扩展布拉德利-特里模型

Authors: Haonan Song, Qingchen Xie, Huan Zhu, Feng Xiao, Luxi Xing, Fuzhen Li, Liu Kang, Feng Jiang, Zhiyong Zheng, Fan Yang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.00677
Pdf link: https://arxiv.org/pdf/2601.00677
Abstract Generative Reward Models (GRMs) have attracted considerable research interest in reward modeling due to their interpretability, inference-time scalability, and potential for refinement through reinforcement learning (RL). However, widely used pairwise GRMs create a computational bottleneck when integrated with RL algorithms such as Group Relative Policy Optimization (GRPO). This bottleneck arises from two factors: (i) the O(n^2) time complexity of pairwise comparisons required to obtain relative scores, and (ii) the computational overhead of repeated sampling or additional chain-of-thought (CoT) reasoning to improve performance. To address the first factor, we propose Intergroup Relative Preference Optimization (IRPO), a novel RL framework that incorporates the well-established Bradley-Terry model into GRPO. By generating a pointwise score for each response, IRPO enables efficient evaluation of arbitrarily many candidates during RL training while preserving interpretability and fine-grained reward signals. Experimental results demonstrate that IRPO achieves state-of-the-art (SOTA) performance among pointwise GRMs across multiple benchmarks, with performance comparable to that of current leading pairwise GRMs. Furthermore, we show that IRPO significantly outperforms pairwise GRMs in post-training evaluations.
中文摘要 生成奖励模型（GRMs）因其可解释性、推理时间可扩展性以及通过强化学习（RL）进行细化的潜力，吸引了大量研究关注。然而，广泛使用的成对GRM与如组相对策略优化（Group Relative Policy Optimization，GRPO）等强化学习算法集成时，会造成计算瓶颈。这一瓶颈源于两个因素：（i）获得相对得分所需的成对比较所需的时间复杂度O（n^2），以及（ii）为提升性能而进行的重复抽样或额外的思维链（CoT）推理所带来的计算开销。为解决第一个因素，我们提出了组间相对偏好优化（Intergroup Relative Preference Optimization，IRPO）——一种新颖的强化学习框架，将成熟的Bradley-Terry模型纳入GRPO。通过为每个回答生成逐点分数，IRPO能够在强化学习训练中高效评估任意数量的候选人，同时保持可解释性和细粒度的奖励信号。实验结果表明，IRPO在多个基准测试中，点对应GRMs实现了最先进的（SOTA）性能，其性能可与当前领先的成对GRM媲美。此外，我们表明IRPO在培训后评估中显著优于两对GRMs。

ARISE: Adaptive Reinforcement Integrated with Swarm Exploration

ARISE：与群体探索集成的自适应强化

Authors: Rajiv Chaitanya M, D R Ramesh Babu
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.00693
Pdf link: https://arxiv.org/pdf/2601.00693
Abstract Effective exploration remains a key challenge in RL, especially with non-stationary rewards or high-dimensional policies. We introduce ARISE, a lightweight framework that enhances reinforcement learning by augmenting standard policy-gradient methods with a compact swarm-based exploration layer. ARISE blends policy actions with particle-driven proposals, where each particle represents a candidate policy trajectory sampled in the action space, and modulates exploration adaptively using reward-variance cues. While easy benchmarks exhibit only slight improvements (e.g., +0.7% on CartPole-v1), ARISE yields substantial gains on more challenging tasks, including +46% on LunarLander-v3 and +22% on Hopper-v4, while preserving stability on Walker2d and Ant. Under non-stationary reward shifts, ARISE provides marked robustness advantages, outperforming PPO by +75 points on CartPole and improving LunarLander accordingly. Ablation studies confirm that both the swarm component and the adaptive mechanism contribute to the performance. Overall, ARISE offers a simple, architecture-agnostic route to more exploratory and resilient RL agents without altering core algorithmic structures.
中文摘要 有效的探索仍然是强化学习中的关键挑战，尤其是在非平稳奖励或高维策略的情况下。我们介绍ARISE，一个轻量级框架，通过在标准策略梯度方法基础上加入紧凑的群体探索层来增强强化学习。ARISE将政策行动与粒子驱动提案相结合，每个粒子代表在行动空间中抽样的候选政策轨迹，并通过奖励-方差线索自适应地调制探索。虽然简单基准测试仅有轻微提升（例如CartPole-v1的+0.7%），ARISE在更具挑战性的任务上取得了显著提升，包括LunarLander-v3的+46%和Hopper-v4的+22%，同时在Walker2d和Ant上保持了稳定性。在非平稳奖励转移下，ARISE提供了显著的稳健性优势，在CartPole上比PPO高+75分，并相应提升了LunarLander。消融研究证实，群体成分和适应机制都对性能有贡献。总体而言，ARISE提供了一条简单、架构无关的路径，实现更探索性和更具弹性的强化学习代理，而无需改变核心算法结构。

Precision Autotuning for Linear Solvers via Contextual Bandit-Based RL

通过基于上下文的强化学习实现线性求解器的高精度自动调谐

Authors: Erin Carson, Xinye Chen
Subjects: Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA)
Arxiv link: https://arxiv.org/abs/2601.00728
Pdf link: https://arxiv.org/pdf/2601.00728
Abstract We propose a reinforcement learning (RL) framework for adaptive precision tuning of linear solvers, and can be extended to general algorithms. The framework is formulated as a contextual bandit problem and solved using incremental action-value estimation with a discretized state space to select optimal precision configurations for computational steps, balancing precision and computational efficiency. To verify its effectiveness, we apply the framework to iterative refinement for solving linear systems $Ax = b$. In this application, our approach dynamically chooses precisions based on calculated features from the system. In detail, a Q-table maps discretized features (e.g., approximate condition number and matrix norm)to actions (chosen precision configurations for specific steps), optimized via an epsilon-greedy strategy to maximize a multi-objective reward balancing accuracy and computational cost. Empirical results demonstrate effective precision selection, reducing computational cost while maintaining accuracy comparable to double-precision baselines. The framework generalizes to diverse out-of-sample data and offers insight into utilizing RL precision selection for other numerical algorithms, advancing mixed-precision numerical methods in scientific computing. To the best of our knowledge, this is the first work on precision autotuning with RL and verified on unseen datasets.
中文摘要 我们提出了一种用于线性求解器自适应精确调优的强化学习（RL）框架，并可扩展到通用算法。该框架被表述为一个情境盗贼问题，并通过带有离散状态空间的增量动作值估计来选择计算步骤的最优精度配置，平衡精度和计算效率。为验证其有效性，我们将该框架应用于迭代细化以求解线性系统$Ax = b$。在本应用中，我们的方法基于系统中计算出的特征动态选择精确值。具体来说，Q表将离散化特征（如近似条件数和矩阵范数）映射到动作（特定步骤的精准配置），并通过ε贪婪策略优化，以最大化多目标奖励的准确性和计算成本平衡。实证结果表明，精确选择有效，降低计算成本，同时保持与双精度基线相当的准确性。该框架推广到多样化的样本外数据，并提供了利用强化学习精度选择应用于其他数值算法的见解，推动了科学计算中混合精度数值方法的发展。据我们所知，这是首个关于强化学习精确自动调谐的研究，并在未见数据集上得到验证。

Stochastic Actor-Critic: Mitigating Overestimation via Temporal Aleatoric Uncertainty

随机行为者批评者：通过时间偶然不确定性缓解高估

Authors: Uğurcan Özalp
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.00737
Pdf link: https://arxiv.org/pdf/2601.00737
Abstract Off-policy actor-critic methods in reinforcement learning train a critic with temporal-difference updates and use it as a learning signal for the policy (actor). This design typically achieves higher sample efficiency than purely on-policy methods. However, critic networks tend to overestimate value estimates systematically. This is often addressed by introducing a pessimistic bias based on uncertainty estimates. Current methods employ ensembling to quantify the critic's epistemic uncertainty-uncertainty due to limited data and model ambiguity-to scale pessimistic updates. In this work, we propose a new algorithm called Stochastic Actor-Critic (STAC) that incorporates temporal (one-step) aleatoric uncertainty-uncertainty arising from stochastic transitions, rewards, and policy-induced variability in Bellman targets-to scale pessimistic bias in temporal-difference updates, rather than relying on epistemic uncertainty. STAC uses a single distributional critic network to model the temporal return uncertainty, and applies dropout to both the critic and actor networks for regularization. Our results show that pessimism based on a distributional critic alone suffices to mitigate overestimation, and naturally leads to risk-averse behavior in stochastic environments. Introducing dropout further improves training stability and performance by means of regularization. With this design, STAC achieves improved computational efficiency using a single distributional critic network.
中文摘要 强化学习中的非策略演员-批评方法通过时间差分更新训练批评者，并将其作为策略（演员）的学习信号。这种设计通常比纯策略方法获得更高的采样效率。然而，批评网络往往系统性地高估价值估计。通常通过引入基于不确定性估计的悲观偏见来解决这个问题。当前方法采用集合法来量化批评者的认知不确定性——由于数据有限和模型模糊性造成的不确定性——以扩大悲观更新。在本研究中，我们提出了一种名为随机行为者-批判者（STAC）的新算法，它结合了时间（一步）偶然性不确定性——即随机转变、奖励以及Bellman目标中政策诱导的变异性——以在时间差更新中扩展悲观偏见，而非依赖认知不确定性。STAC使用单一分布批评网络来建模时间返回不确定性，并对批评者网络和演员网络均施加掉落以实现正则化。我们的结果表明，仅基于分布批评者的悲观主义足以减轻高估，并自然导致随机环境中的风险规避行为。引入dropout还通过正则化进一步提升训练稳定性和表现。通过这种设计，STAC通过单一分布式批评网络实现了更高的计算效率。

Keyword: diffusion policy

There is no result