Arxiv Papers of Today

生成时间: 2026-01-30 16:45:03 (UTC+8); Arxiv 发布时间: 2026-01-30 20:00 EST (2026-01-31 09:00 UTC+8)

今天共有 66 篇相关文章

Keyword: reinforcement learning

Distributional Active Inference

分布主动推断

Authors: Abdullah Akgül, Gulcin Baykal, Manuel Haußmann, Mustafa Mert Çelikok, Melih Kandemir
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.20985
Pdf link: https://arxiv.org/pdf/2601.20985
Abstract Optimal control of complex environments with robotic systems faces two complementary and intertwined challenges: efficient organization of sensory state information and far-sighted action planning. Because the reinforcement learning framework addresses only the latter, it tends to deliver sample-inefficient solutions. Active inference is the state-of-the-art process theory that explains how biological brains handle this dual problem. However, its applications to artificial intelligence have thus far been limited to extensions of existing model-based approaches. We present a formal abstraction of reinforcement learning algorithms that spans model-based, distributional, and model-free approaches. This abstraction seamlessly integrates active inference into the distributional reinforcement learning framework, making its performance advantages accessible without transition dynamics modeling.
中文摘要 用机器人系统对复杂环境进行最佳控制面临两个互补且交织的挑战：感官状态信息的高效组织和远见性的行动规划。由于强化学习框架只针对后者，因此往往呈现样本效率较低的解。主动推断是最先进的过程理论，解释了生物大脑如何处理这一双重问题。然而，其在人工智能中的应用迄今仅限于现有基于模型的方法的扩展。我们提出了一个涵盖基于模型、分布式和无模型方法的强化学习算法的形式抽象。这种抽象将主动推理无缝整合进分布式强化学习框架，使其在无需转换动态建模的情况下也能获得性能优势。

SIGMA-PPG: Statistical-prior Informed Generative Masking Architecture for PPG Foundation Model

SIGMA-PPG：PPG基础模型的统计先验知情生成掩蔽架构

Authors: Zongheng Guo, Tao Chen, Yang Jiao, Yi Pan, Xiao Hu, Manuela Ferrario
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.21031
Pdf link: https://arxiv.org/pdf/2601.21031
Abstract Current foundation model for photoplethysmography (PPG) signals is challenged by the intrinsic redundancy and noise of the signal. Standard masked modeling often yields trivial solutions while contrastive methods lack morphological precision. To address these limitations, we propose a Statistical-prior Informed Generative Masking Architecture (SIGMA-PPG), a generative foundation model featuring a Prior-Guided Adversarial Masking mechanism, where a reinforcement learning-driven teacher leverages statistical priors to create challenging learning paths that prevent overfitting to noise. We also incorporate a semantic consistency constraint via vector quantization to ensure that physiologically identical waveforms (even those altered by recording artifacts or minor perturbations) map to shared indices. This enhances codebook semantic density and eliminates redundant feature structures. Pre-trained on over 120,000 hours of data, SIGMA-PPG achieves superior average performance compared to five state-of-the-art baselines across 12 diverse downstream tasks. The code is available at this https URL.
中文摘要 当前光电容积描写（PPG）信号的基础模型正受到信号本身的冗余性和噪声的挑战。标准的掩蔽建模通常能得到简单的解，而对比法则缺乏形态学的精确度。为解决这些局限性，我们提出了一种统计先验知情生成掩蔽架构（SIGMA-PPG），这是一种生成基础模型，采用先验引导对抗掩蔽机制，强化学习驱动的教师利用统计先验创建具有挑战性的学习路径，防止对噪声的过度拟合。我们还通过向量量化引入语义一致性约束，确保生理上相同的波形（即使因记录伪影或轻微扰动而改变）映射到共享指标。这增强了代码本的语义密度，并消除了冗余的特征结构。经过超过12万小时的数据预训练，SIGMA-PPG在12个多样化的下游任务中，相较于五个最先进基线实现了卓越的平均性能。代码可在该 https URL 访问。

Log2Motion: Biomechanical Motion Synthesis from Touch Logs

Log2Motion：来自Touch Logs的生物力学运动合成

Authors: Michał Patryk Miazga, Hannah Bussmann, Antti Oulasvirta, Patrick Ebel
Subjects: Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.21043
Pdf link: https://arxiv.org/pdf/2601.21043
Abstract Touch data from mobile devices are collected at scale but reveal little about the interactions that produce them. While biomechanical simulations can illuminate motor control processes, they have not yet been developed for touch interactions. To close this gap, we propose a novel computational problem: synthesizing plausible motion directly from logs. Our key insight is a reinforcement learning-driven musculoskeletal forward simulation that generates biomechanically plausible motion sequences consistent with events recorded in touch logs. We achieve this by integrating a software emulator into a physics simulator, allowing biomechanical models to manipulate real applications in real-time. Log2Motion produces rich syntheses of user movements from touch logs, including estimates of motion, speed, accuracy, and effort. We assess the plausibility of generated movements by comparing against human data from a motion capture study and prior findings, and demonstrate Log2Motion in a large-scale dataset. Biomechanical motion synthesis provides a new way to understand log data, illuminating the ergonomics and motor control underlying touch interactions.
中文摘要 移动设备的触摸数据被大规模收集，但对产生这些触摸的交互信息有限。虽然生物力学仿真可以揭示运动控制过程，但尚未开发用于触觉交互。为弥补这一差距，我们提出了一个新颖的计算问题：直接从对数合成合理的运动。我们的核心见解是基于强化学习驱动的肌肉骨骼前向仿真，能够生成与触摸日志中记录事件一致的生物力学合理的运动序列。我们通过将软件仿真器集成到物理模拟器中，使生物力学模型能够实时作真实应用来实现这一目标。Log2Motion 从触控日志中生成用户动作的丰富综合，包括动作、速度、准确性和努力的估计。我们通过与人体动作捕捉研究中的人类数据及先前发现进行比较，评估生成动作的可信度，并在大规模数据集中演示Log2Motion。生物力学运动综合提供了理解日志数据的新途径，揭示了触控交互背后的人体工学和运动控制。

Llama-3.1-FoundationAI-SecurityLLM-Reasoning-8B Technical Report

Llama-3.1-FoundationAI-SecurityLLM-Reasoning-8B 技术报告

Authors: Zhuoran Yang, Ed Li, Jianliang He, Aman Priyanshu, Baturay Saglam, Paul Kassianik, Sajana Weerawardhena, Anu Vellore, Blaine Nelson, Neusha Javidnia, Arthur Goldblatt, Fraser Burch, Avi Zohary, Assaf Eisenman, Mahdi Sabbaghi, Supriti Vijay, Rahim Dharssi, Dhruv Kedia, Kojin Oshiba, Yaron Singer, Amin Karbasi
Subjects: Subjects: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.21051
Pdf link: https://arxiv.org/pdf/2601.21051
Abstract We present Foundation-Sec-8B-Reasoning, the first open-source native reasoning model for cybersecurity. Built upon our previously released Foundation-Sec-8B base model (derived from Llama-3.1-8B-Base), the model is trained through a two-stage process combining supervised fine-tuning (SFT) and reinforcement learning from verifiable rewards (RLVR). Our training leverages proprietary reasoning data spanning cybersecurity analysis, instruction-following, and mathematical reasoning. Evaluation across 10 cybersecurity benchmarks and 10 general-purpose benchmarks demonstrates performance competitive with significantly larger models on cybersecurity tasks while maintaining strong general capabilities. The model shows effective generalization on multi-hop reasoning tasks and strong safety performance when deployed with appropriate system prompts and guardrails. This work demonstrates that domain-specialized reasoning models can achieve strong performance on specialized tasks while maintaining broad general capabilities. We release the model publicly at this https URL.
中文摘要 我们介绍Foundation-Sec-8B-Reasoning，这是首个开源的原生网络安全推理模型。该模型基于我们之前发布的Foundation-Sec-8B基础模型（源自Llama-3.1-8B-Base），通过结合监督微调（SFT）和可验证奖励强化学习（RLVR）的两阶段过程进行训练。我们的培训利用涵盖网络安全分析、指令跟踪和数学推理的专有推理数据。在10项网络安全基准和10项通用基准测试中的评估显示，在网络安全任务中与更大规模的模型竞争，同时保持强大的通用能力。该模型在多跳推理任务中表现出有效的泛化能力，并在配合适当的系统提示和护栏部署时表现出强大的安全性能。这项工作表明，领域专门化的推理模型可以在保持广泛的通用能力的同时，在专业任务上取得优异表现。我们会在这个 https URL 上公开发布模型。

OpenSec: Measuring Incident Response Agent Calibration Under Adversarial Evidence

OpenSec：在对抗性证据下衡量事件响应代理校准

Authors: Jarrod Barnes
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.21083
Pdf link: https://arxiv.org/pdf/2601.21083
Abstract As large language models improve, so do their offensive applications: frontier agents now generate working exploits for under $50 in compute (Heelan, 2026). Defensive incident response (IR) agents must keep pace, but existing benchmarks conflate action execution with correct execution, hiding calibration failures when agents process adversarial evidence. We introduce OpenSec, a dual-control reinforcement learning environment that evaluates IR agents under realistic prompt injection scenarios. Unlike static capability benchmarks, OpenSec scores world-state-changing containment actions under adversarial evidence via execution-based metrics: time-to-first-containment (TTFC), blast radius (false positives per episode), and injection violation rates. Evaluating four frontier models on 40 standard-tier episodes, we find consistent over-triggering in this setting: GPT-5.2, Gemini 3, and DeepSeek execute containment in 100% of episodes with 90-97% false positive rates. Claude Sonnet 4.5 shows partial calibration (85% containment, 72% FP), demonstrating that OpenSec surfaces a calibration failure mode hidden by aggregate success metrics. Code available at this https URL.
中文摘要 随着大型语言模型的进步，其攻击性应用也不断提升：前沿代理现在能以不到50美元的计算量生成可用的利用漏洞（Heelan，2026）。防御性事件响应（IR）代理必须跟上节奏，但现有基准将行动执行与正确执行混为一谈，掩盖了代理处理对抗证据时校准失败的情况。我们介绍了OpenSec，一个双控制强化学习环境，在真实的提示注入场景下评估IR代理。与静态能力基准不同，OpenSec通过基于执行的指标，基于对抗性证据对改变世界状态的遏制行动进行评分：首次遏制时间（TTFC）、爆炸半径（每集误报数）和注入违规率。在40个标准层次事件中，我们评估了四个前沿模型，发现在此环境中持续出现过度触发：GPT-5.2、Gemini 3和DeepSeek在100%的集数中实现了封控，误报率为90-97%。Claude Sonnet 4.5 展示了部分校准（85% 包含，72% FP），表明 OpenSec 通过汇总成功指标隐藏了校准失败模式。代码可在此 https URL 获取。

Deep Reinforcement Learning for Fault-Adaptive Routing in Eisenstein-Jacobi Interconnection Topologies

Eisenstein-Jacobi互连拓扑中故障自适应路由的深度强化学习

Authors: Mohammad Walid Charrwi, Zaid Hussain
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.21090
Pdf link: https://arxiv.org/pdf/2601.21090
Abstract The increasing density of many-core architectures necessitates interconnection networks that are both high-performance and fault-resilient. Eisenstein-Jacobi (EJ) networks, with their symmetric 6-regular topology, offer superior topological properties but challenge traditional routing heuristics under fault conditions. This paper evaluates three routing paradigms in faulty EJ environments: deterministic Greedy Adaptive Routing, theoretically optimal Dijkstra's algorithm, and a reinforcement learning (RL)-based approach. Using a multi-objective reward function to penalize fault proximity and reward path efficiency, the RL agent learns to navigate around clustered failures that typically induce dead-ends in greedy geometric routing. Dijkstra's algorithm establishes the theoretical performance ceiling by computing globally optimal paths with complete topology knowledge, revealing the true connectivity limits of faulty networks. Quantitative analysis at nine faulty nodes shows greedy routing catastrophically degrades to 10% effective reachability and packet delivery, while Dijkstra proves 52-54% represents the topological optimum. The RL agent achieves 94% effective reachability and 91% packet delivery, making it suitable for distributed deployment. Furthermore, throughput evaluations demonstrate that RL sustains over 90% normalized throughput across all loads, actually outperforming Dijkstra under congestion through implicit load balancing strategies. These results establish RL-based adaptive policies as a practical solution that bridges the gap between greedy's efficiency and Dijkstra's optimality, providing robust, self-healing communication in fault-prone interconnection networks without requiring the global topology knowledge or computational overhead of optimal algorithms.
中文摘要 多核架构密度的增加要求互连网络既高性能又具备故障韧性。艾森斯坦-雅可比（EJ）网络凭借其对称的6-正则拓扑，提供了更优越的拓扑性质，但在故障条件下挑战了传统的路由启发式。本文评估了在有缺陷的EJ环境中的三种路由范式：确定性贪婪自适应路由、理论上最优的Dijkstra算法，以及基于强化学习（RL）的方法。通过多目标奖励函数惩罚故障接近度和奖励路径效率，强化学习者学会绕过通常导致贪婪几何路由死胡同的聚集故障。Dijkstra算法通过计算全局最优路径并具备完整的拓扑知识，确立了理论性能上限，揭示了故障网络的真实连通极限。对九个故障节点的定量分析显示，贪婪路由会灾难性地降级至10%的有效可达性和数据包传递，而Dijkstra证明52-54%代表拓扑最优。RL代理实现了94%的有效可达性和91%的数据包分包交付，适合分布式部署。此外，吞吐量评估表明，强化学习在所有负载下都能维持超过90%的归一化吞吐量，实际上在拥塞情况下通过隐式负载均衡策略表现优于Dijkstra。这些结果确立了基于强化学习的自适应策略作为一种实用解决方案，弥合了贪婪效率与迪克斯特拉最优性的差距，在易故障互连网络中提供稳健的自愈通信，而无需全局拓扑知识或最优算法的计算开销。

Safety Generalization Under Distribution Shift in Safe Reinforcement Learning: A Diabetes Testbed

安全强化学习中的分布转移下的安全泛化：糖尿病测试平台

Authors: Minjae Kwon, Josephine Lamp, Lu Feng
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.21094
Pdf link: https://arxiv.org/pdf/2601.21094
Abstract Safe Reinforcement Learning (RL) algorithms are typically evaluated under fixed training conditions. We investigate whether training-time safety guarantees transfer to deployment under distribution shift, using diabetes management as a safety-critical testbed. We benchmark safe RL algorithms on a unified clinical simulator and reveal a safety generalization gap: policies satisfying constraints during training frequently violate safety requirements on unseen patients. We demonstrate that test-time shielding, which filters unsafe actions using learned dynamics models, effectively restores safety across algorithms and patient populations. Across eight safe RL algorithms, three diabetes types, and three age groups, shielding achieves Time-in-Range gains of 13--14\% for strong baselines such as PPO-Lag and CPO while reducing clinical risk index and glucose variability. Our simulator and benchmark provide a platform for studying safety under distribution shift in safety-critical control domains. Code is available at this https URL and this https URL.
中文摘要 安全强化学习（RL）算法通常在固定训练条件下进行评估。我们研究培训时间安全是否能保证在配送轮班期间的部署转移，并以糖尿病管理作为安全关键测试平台。我们在统一的临床模拟器上基准测试安全强化学习算法，揭示了安全泛化的缺口：培训期间满足约束的政策常常违反未就诊患者的安全要求。我们展示了测试时间屏蔽，即利用学习动态模型过滤不安全的行为，能够有效恢复算法和患者群体的安全性。在八种安全的强化学习算法、三种糖尿病类型和三个年龄组中，屏蔽在强基线如PPO-Lag和CPO中实现了13-14/%的时间范围内提升，同时降低了临床风险指数和血糖变异性。我们的模拟器和基准为研究安全关键控制领域配电偏移下的安全性提供了平台。代码可在此 https URL 和此 https URL 获取。

Do Reasoning Models Enhance Embedding Models?

推理模型能增强嵌入模型吗？

Authors: Wun Yu Chan, Shaojin Chen, Huihao Jing, Kwun Hang Lau, Elton Chun-Chai Li, Zihao Wang, Haoran Li, Yangqiu Song
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.21192
Pdf link: https://arxiv.org/pdf/2601.21192
Abstract State-of-the-art embedding models are increasingly derived from decoder-only Large Language Model (LLM) backbones adapted via contrastive learning. Given the emergence of reasoning models trained via Reinforcement Learning with Verifiable Rewards (RLVR), a natural question arises: do enhanced reasoning translate to superior semantic representations when these models serve as embedding initializations? Contrary to expectation, our evaluation on MTEB and BRIGHT reveals a null effect: embedding models initialized from RLVR-tuned backbones yield no consistent performance advantage over their base counterparts when subjected to identical training recipes. To unpack this paradox, we introduce Hierarchical Representation Similarity Analysis (HRSA), a framework that decomposes similarity across representation, geometry, and function levels. HRSA reveals that while RLVR induces irreversible latent manifold's local geometry reorganization and reversible coordinate basis drift, it preserves the global manifold geometry and linear readout. Consequently, subsequent contrastive learning drives strong alignment between base- and reasoning-initialized models, a phenomenon we term Manifold Realignment. Empirically, our findings suggest that unlike Supervised Fine-Tuning (SFT), RLVR optimizes trajectories within an existing semantic landscape rather than fundamentally restructuring the landscape itself.
中文摘要 最先进的嵌入模型越来越多地源自仅依赖解码器的大型语言模型（LLM）骨干，并通过对比学习进行调整。鉴于通过可验证奖励强化学习（RLVR）训练的推理模型的出现，自然会有这样一个问题：当这些模型作为嵌入初始化时，增强推理是否能转化为更优越的语义表示？与预期相反，我们对MTEB和BRIGHT的评估显示出空效应：从RLVR调优主干链初始化的嵌入模型，在使用相同的训练方案时，性能并不一致。为了解开这一悖论，我们引入了Hierarchical Representation Similarity Analysis（HRSA），这是一个分解表示、几何和功能层级相似性的框架。HRSA表明，虽然RLVR诱导了不可逆潜流形的局部几何重组和可逆坐标基漂移，但它保持了全局流形几何和线性读出。因此，后续的对比学习会推动基底模型与推理初始化模型之间的强烈对齐，我们称之为流形重组。实证研究表明，与监督式微调（SFT）不同，RLVR优化的是现有语义景观中的轨迹，而非根本性地重构该景观本身。

When should I search more: Adaptive Complex Query Optimization with Reinforcement Learning

我应该在什么时候搜索更多：带强化学习的自适应复查询优化

Authors: Wei Wen, Sihang Deng, Tianjun Wei, Keyu Chen, Ruizhi Qiao, Xing Sun
Subjects: Subjects: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2601.21208
Pdf link: https://arxiv.org/pdf/2601.21208
Abstract Query optimization is a crucial component for the efficacy of Retrieval-Augmented Generation (RAG) systems. While reinforcement learning (RL)-based agentic and reasoning methods have recently emerged as a promising direction on query optimization, most existing approaches focus on the expansion and abstraction of a single query. However, complex user queries are prevalent in real-world scenarios, often requiring multiple parallel and sequential search strategies to handle disambiguation and decomposition. Directly applying RL to these complex cases introduces significant hurdles. Determining the optimal number of sub-queries and effectively re-ranking and merging retrieved documents vastly expands the search space and complicates reward design, frequently leading to training instability. To address these challenges, we propose a novel RL framework called Adaptive Complex Query Optimization (ACQO). Our framework is designed to adaptively determine when and how to expand the search process. It features two core components: an Adaptive Query Reformulation (AQR) module that dynamically decides when to decompose a query into multiple sub-queries, and a Rank-Score Fusion (RSF) module that ensures robust result aggregation and provides stable reward signals for the learning agent. To mitigate training instabilities, we adopt a Curriculum Reinforcement Learning (CRL) approach, which stabilizes the training process by progressively introducing more challenging queries through a two-stage strategy. Our comprehensive experiments demonstrate that ACQO achieves state-of-the-art performance on three complex query benchmarks, significantly outperforming established baselines. The framework also showcases improved computational efficiency and broad compatibility with different retrieval architectures, establishing it as a powerful and generalizable solution for next-generation RAG systems.
中文摘要 查询优化是检索增强生成（RAG）系统效能的关键组成部分。虽然基于强化学习（RL）的代理和推理方法最近成为查询优化的一个有前景的方向，但大多数现有方法仍侧重于单一查询的扩展和抽象。然而，复杂的用户查询在现实场景中很常见，通常需要多种并行和顺序搜索策略来处理消歧义和分解。直接将强化学习应用于这些复杂案例会带来重大障碍。确定最佳子查询数量并有效重新排序和合并检索文档极大地扩展了搜索空间，并使奖励设计复杂化，常常导致训练不稳定。为应对这些挑战，我们提出了一种新的强化学习框架，称为自适应复查询优化（ACQO）。我们的框架旨在自适应地确定何时以及如何扩展搜索流程。它包含两个核心组件：一个动态决定何时将查询分解为多个子查询的自适应查询重组（AQR）模块，以及一个确保结果稳健聚合并为学习代理提供稳定奖励信号的秩-得分融合（RSF）模块。为缓解培训不稳定性，我们采用课程强化学习（CRL）方法，通过两阶段策略逐步引入更具挑战性的问题，稳定培训过程。我们的全面实验表明，ACQO在三个复杂查询基准测试上达到了最先进的性能，远超既定基线。该框架还展示了更高的计算效率和对不同检索架构的广泛兼容性，使其成为下一代RAG系统强大且可通用的解决方案。

Intelli-Planner: Towards Customized Urban Planning via Large Language Model Empowered Reinforcement Learning

Intelli-Planner：通过大型语言模型赋能强化学习实现定制城市规划

Authors: Xixian Yong, Peilin Sun, Zihe Wang, Xiao Zhou
Subjects: Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2601.21212
Pdf link: https://arxiv.org/pdf/2601.21212
Abstract Effective urban planning is crucial for enhancing residents' quality of life and ensuring societal stability, playing a pivotal role in the sustainable development of cities. Current planning methods heavily rely on human experts, which are time-consuming and labor-intensive, or utilize deep learning algorithms, often limiting stakeholder involvement. To bridge these gaps, we propose Intelli-Planner, a novel framework integrating Deep Reinforcement Learning (DRL) with large language models (LLMs) to facilitate participatory and customized planning scheme generation. Intelli-Planner utilizes demographic, geographic data, and planning preferences to determine high-level planning requirements and demands for each functional type. During training, a knowledge enhancement module is employed to enhance the decision-making capability of the policy network. Additionally, we establish a multi-dimensional evaluation system and employ LLM-based stakeholders for satisfaction scoring. Experimental validation across diverse urban settings shows that Intelli-Planner surpasses traditional baselines and achieves comparable performance to state-of-the-art DRL-based methods in objective metrics, while enhancing stakeholder satisfaction and convergence speed. These findings underscore the effectiveness and superiority of our framework, highlighting the potential for integrating the latest advancements in LLMs with DRL approaches to revolutionize tasks related to functional areas planning.
中文摘要 有效的城市规划对于提升居民生活质量和确保社会稳定至关重要，在城市可持续发展中发挥着关键作用。当前的规划方法高度依赖人类专家，这些专家耗时且劳动密集，或采用深度学习算法，往往限制了利益相关者的参与。为弥合这些差距，我们提出了Intelli-Planner，一种新颖框架，将深度强化学习（DRL）与大型语言模型（LLM）集成，以促进参与式和定制化的规划方案生成。Intelli-Planner利用人口统计、地理数据和规划偏好，确定每种功能类型的高层规划需求和需求。培训期间，会使用知识增强模块以增强政策网络的决策能力。此外，我们建立了多维度评估系统，并聘请基于LLM的利益相关者进行满意度评分。在多元城市环境中的实验验证表明，Intelli-Planner超越了传统基线，在客观指标上与基于最先进的日程车（DRL）方法相当，同时提升了利益相关者的满意度和融合速度。这些发现强调了我们框架的有效性和优越性，凸显了将LLM最新进展与DRL方法整合，革新功能领域规划相关任务的潜力。

Less Noise, More Voice: Reinforcement Learning for Reasoning via Instruction Purification

减少噪音，多声音：通过指令净化实现推理的强化学习

Authors: Yiju Guo, Tianyi Hu, Zexu Sun, Yankai Lin
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.21244
Pdf link: https://arxiv.org/pdf/2601.21244
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has advanced LLM reasoning, but remains constrained by inefficient exploration under limited rollout budgets, leading to low sampling success and unstable training in complex tasks. We find that many exploration failures arise not from problem difficulty, but from a small number of prompt tokens that introduce interference. Building on this insight, we propose the Less Noise Sampling Framework (LENS), which first prompts by identifying and removing interference tokens. then transfers successful rollouts from the purification process to supervise policy optimization on the original noisy prompts, enabling the model to learn to ignore interference in the real-world, noisy prompting settings. Experimental results show that LENS significantly outperforms GRPO, delivering higher performance and faster convergence, with a 3.88% average gain and over 1.6$\times$ speedup. Our work highlights the critical role of pruning interference tokens in improving rollout efficiency, offering a new perspective for RLVR research.
中文摘要 带可验证奖励的强化学习（RLVR）拥有先进的大型语言模型推理能力，但在有限的推广预算下，探索效率低下，导致抽样成功率低，复杂任务的训练不稳定。我们发现，许多探索失败并非源于问题难度，而是由于少数提示标记引入干扰。基于这一见解，我们提出了更少噪声采样框架（LENS），该框架首先通过识别和移除干扰标记来提示。然后将净化过程中成功的推广转移到对原始噪声提示的策略优化，使模型能够学习忽略现实世界噪声提示设置中的干扰。实验结果显示，LENS显著优于GRPO，提供更高的性能和更快的收敛速度，平均增益为3.88%，加速超过1.6美元/时间美元。我们的工作强调了修剪干扰代币在提升推广效率中的关键作用，为RLVR研究提供了新的视角。

Reinforcement Learning from Meta-Evaluation: Aligning Language Models Without Ground-Truth Labels

元评估强化学习：对齐无实地标签的语言模型

Authors: Micah Rentschler, Jesse Roberts
Subjects: Subjects: Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2601.21268
Pdf link: https://arxiv.org/pdf/2601.21268
Abstract Most reinforcement learning (RL) methods for training large language models (LLMs) require ground-truth labels or task-specific verifiers, limiting scalability when correctness is ambiguous or expensive to obtain. We introduce Reinforcement Learning from Meta-Evaluation (RLME), which optimizes a generator using reward derived from an evaluator's answers to natural-language meta-questions (e.g., "Is the answer correct?" or "Is the reasoning logically consistent?"). RLME treats the evaluator's probability of a positive judgment as a reward and updates the generator via group-relative policy optimization, enabling learning without labels. Across a suite of experiments, we show that RLME achieves accuracy and sample efficiency comparable to label-based training, enables controllable trade-offs among multiple objectives, steers models toward reliable reasoning patterns rather than post-hoc rationalization, and generalizes to open-domain settings where ground-truth labels are unavailable, broadening the domains in which LLMs may be trained with RL.
中文摘要 大多数用于训练大型语言模型（LLM）的强化学习（RL）方法都需要基于实际的标签或任务特定验证器，当正确性模糊或成本高昂时，限制了可扩展性。我们介绍了元评估强化学习（RLME），该方法通过评估者对自然语言元问题的回答（例如，“答案正确吗？”或“推理是否逻辑一致？”）来优化生成器。RLME将评估者获得积极判断的概率视为奖励，并通过群体相对策略优化更新生成器，实现无标签学习。通过一系列实验，我们表明RLME实现了与基于标签训练相当的准确性和样本效率，实现了多个目标间可控权衡，引导模型走向可靠的推理模式而非事后合理化，并推广到缺乏实地真实标签的开放领域环境，拓宽了大型语言模型可用强化学习训练的领域。

EGAM: Extended Graph Attention Model for Solving Routing Problems

EGAM：用于解决路由问题的扩展图关注模型

Authors: Licheng Wang, Yuzi Yan, Mingtao Huang, Yuan Shen
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.21281
Pdf link: https://arxiv.org/pdf/2601.21281
Abstract Neural combinatorial optimization (NCO) solvers, implemented with graph neural networks (GNNs), have introduced new approaches for solving routing problems. Trained with reinforcement learning (RL), the state-of-the-art graph attention model (GAM) achieves near-optimal solutions without requiring expert knowledge or labeled data. In this work, we generalize the existing graph attention mechanism and propose the extended graph attention model (EGAM). Our model utilizes multi-head dot-product attention to update both node and edge embeddings, addressing the limitations of the conventional GAM, which considers only node features. We employ an autoregressive encoder-decoder architecture and train it with policy gradient algorithms that incorporate a specially designed baseline. Experiments show that EGAM matches or outperforms existing methods across various routing problems. Notably, the proposed model demonstrates exceptional performance on highly constrained problems, highlighting its efficiency in handling complex graph structures.
中文摘要 神经组合优化（NCO）求解器通过图神经网络（GNN）实现，引入了解决路由问题的新方法。通过强化学习（RL）训练，最先进的图注意力模型（GAM）无需专业知识或标记数据即可实现近似最优解。在本研究中，我们推广了现有的图关注机制，并提出了扩展图关注模型（EGAM）。我们的模型利用多头点积注意力来更新节点和边缘嵌入，解决了传统GAM仅考虑节点特征的局限性。我们采用自回归编码-解码架构，并用策略梯度算法训练，这些算法包含专门设计的基线。实验表明，EGAM在各种布线问题上表现优于或超越现有方法。值得注意的是，该模型在高度约束的问题上表现出卓越的性能，突出其在处理复杂图结构时的高效性。

The Surprising Difficulty of Search in Model-Based Reinforcement Learning

基于模型的强化学习中搜索的惊人难度

Authors: Wei-Di Chang, Mikael Henaff, Brandon Amos, Gregory Dudek, Scott Fujimoto
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.21306
Pdf link: https://arxiv.org/pdf/2601.21306
Abstract This paper investigates search in model-based reinforcement learning (RL). Conventional wisdom holds that long-term predictions and compounding errors are the primary obstacles for model-based RL. We challenge this view, showing that search is not a plug-and-play replacement for a learned policy. Surprisingly, we find that search can harm performance even when the model is highly accurate. Instead, we show that mitigating distribution shift matters more than improving model or value function accuracy. Building on this insight, we identify key techniques for enabling effective search, achieving state-of-the-art performance across multiple popular benchmark domains.
中文摘要 本文探讨了基于模型的强化学习（RL）中的搜索。传统观点认为，长期预测和复合误差是基于模型的强化学习的主要障碍。我们对此观点提出质疑，表明搜索并非即插即用的替代已学会的策略。令人惊讶的是，我们发现即使模型高度准确，搜索也可能损害性能。相反，我们表明缓解分布偏移比提升模型或价值函数准确性更为重要。基于这一洞见，我们识别了实现高效搜索的关键技术，实现多个热门基准领域中的最先进性能。

Few-Shot Learning for Dynamic Operations of Automated Electric Taxi Fleets under Evolving Charging Infrastructure: A Meta-Deep Reinforcement Learning Approach

在不断演变的充电基础设施下，自动化电动出租车车队动态运营的少数样本学习：一种元深度强化学习方法

Authors: Xiaozhuang Li, Xindi Tang, Fang He
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.21312
Pdf link: https://arxiv.org/pdf/2601.21312
Abstract With the rapid expansion of electric vehicles (EVs) and charging infrastructure, the effective management of Autonomous Electric Taxi (AET) fleets faces a critical challenge in environments with dynamic and uncertain charging availability. While most existing research assumes a static charging network, this simplification creates a significant gap between theoretical models and real-world operations. To bridge this gap, we propose GAT-PEARL, a novel meta-reinforcement learning framework that learns an adaptive operational policy. Our approach integrates a graph attention network (GAT) to effectively extract robust spatial representations under infrastructure layouts and model the complex spatiotemporal relationships of the urban environment, and employs probabilistic embeddings for actor-critic reinforcement learning (PEARL) to enable rapid, inference-based adaptation to changes in charging network layouts without retraining. Through extensive simulations on real-world data in Chengdu, China, we demonstrate that GAT-PEARL significantly outperforms conventional reinforcement learning baselines, showing superior generalization to unseen infrastructure layouts and achieving higher overall operational efficiency in dynamic settings.
中文摘要 随着电动汽车（EV）和充电基础设施的快速扩展，在充满动态且充电可用性不确定的环境中，有效管理自动驾驶电动出租车（AET）车队面临着一个关键挑战。虽然大多数现有研究假设静态充电网络，但这种简化造成理论模型与现实作之间的显著差距。为弥合这一差距，我们提出了GAT-PEARL，一种新型元强化学习框架，能够学习自适应的作策略。我们的方法集成了图注意力网络（GAT），有效提取基础设施布局下的稳健空间表示，并建模城市环境复杂的时空关系，并采用概率嵌入进行演员-批评者强化学习（PEARL），实现基于推理的快速适应，无需重新训练即可适应充电网络布局的变化。通过对中国成都真实世界数据的广泛模拟，我们证明GAT-PEARL显著优于传统强化学习基线，展现出对未见基础设施布局的优越泛化能力，并在动态环境中实现更高的整体运行效率。

Heterogeneous Vertiport Selection Optimization for On-Demand Air Taxi Services: A Deep Reinforcement Learning Approach

按需空中出租车服务的异构垂直机场选择优化：深度强化学习方法

Authors: Aoyu Pang, Maonan Wang, Zifan Sha, Wenwei Yue, Changle Li, Chung Shue Chen, Man-On Pun
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.21316
Pdf link: https://arxiv.org/pdf/2601.21316
Abstract Urban Air Mobility (UAM) has emerged as a transformative solution to alleviate urban congestion by utilizing low-altitude airspace, thereby reducing pressure on ground transportation networks. To enable truly efficient and seamless door-to-door travel experiences, UAM requires close integration with existing ground transportation infrastructure. However, current research on optimal integrated routing strategies for passengers in air-ground mobility systems remains limited, with a lack of systematic this http URL address this gap, we first propose a unified optimization model that integrates strategy selection for both air and ground transportation. This model captures the dynamic characteristics of multimodal transport networks and incorporates real-time traffic conditions alongside passenger decision-making behavior. Building on this model, we propose a Unified Air-Ground Mobility Coordination (UAGMC) framework, which leverages deep reinforcement learning (RL) and Vehicle-to-Everything (V2X) communication to optimize vertiport selection and dynamically plan air taxi routes. Experimental results demonstrate that UAGMC achieves a 34\% reduction in average travel time compared to conventional proportional allocation methods, enhancing overall travel efficiency and providing novel insights into the integration and optimization of multimodal transportation systems. This work lays a solid foundation for advancing intelligent urban mobility solutions through the coordination of air and ground transportation modes. The related code can be found at this https URL.
中文摘要 城市空中出行（UAM）作为一种变革性的解决方案出现，利用低空空域缓解城市拥堵，从而减轻地面交通网络的压力。为了实现真正高效且无缝的门到门旅行体验，UAM需要与现有地面交通基础设施紧密集成。然而，目前关于空地移动系统中乘客最优整合路由策略的研究仍然有限，缺乏系统性解决这一空白，我们首先提出一个统一的优化模型，整合了航空和地面运输的策略选择。该模型捕捉了多模式交通网络的动态特性，并将实时交通状况与乘客决策行为结合在一起。基于该模型，我们提出了一个统一空地机动协调（UAGMC）框架，利用深度强化学习（RL）和车辆到所有（V2X）通信，优化垂直机场选择并动态规划空中滑行路线。实验结果表明，UAGMC相比传统比例分配方法实现了34%的平均出行时间减少，提升了整体出行效率，并为多式联运系统整合与优化提供了新见解。这项工作为通过协调航空和陆地交通方式，推进智能城市出行解决方案奠定了坚实基础。相关代码可以在这个 https URL 找到。

Self-Improving Pretraining: using post-trained models to pretrain better models

自我改进预训练：利用后训练模型预训练更好的模型

Authors: Ellen Xiaoqing Tan, Shehzaad Dhuliawala, Jing Xu, Ping Yu, Sainbayar Sukhbaatar, Jason Weston, Olga Golovneva
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.21343
Pdf link: https://arxiv.org/pdf/2601.21343
Abstract Ensuring safety, factuality and overall quality in the generations of large language models is a critical challenge, especially as these models are increasingly deployed in real-world applications. The prevailing approach to addressing these issues involves collecting expensive, carefully curated datasets and applying multiple stages of fine-tuning and alignment. However, even this complex pipeline cannot guarantee the correction of patterns learned during pretraining. Therefore, addressing these issues during pretraining is crucial, as it shapes a model's core behaviors and prevents unsafe or hallucinated outputs from becoming deeply embedded. To tackle this issue, we introduce a new pretraining method that streams documents and uses reinforcement learning (RL) to improve the next K generated tokens at each step. A strong, post-trained model judges candidate generations -- including model rollouts, the original suffix, and a rewritten suffix -- for quality, safety, and factuality. Early in training, the process relies on the original and rewritten suffixes; as the model improves, RL rewards high-quality rollouts. This approach builds higher quality, safer, and more factual models from the ground up. In experiments, our method gives 36.2% and 18.5% relative improvements over standard pretraining in terms of factuality and safety, and up to 86.3% win rate improvements in overall generation quality.
中文摘要 确保大型语言模型世代的安全性、事实性和整体质量是一项关键挑战，尤其是在这些模型越来越多地应用于现实世界的情况下。解决这些问题的主流方法是收集昂贵且精心策划的数据集，并进行多阶段的微调和对齐。然而，即使是这条复杂的流程，也无法保证纠正预训练中学到的模式。因此，在预训练阶段解决这些问题至关重要，因为它塑造模型的核心行为，防止不安全或幻觉输出深度嵌入。为解决这个问题，我们引入了一种新的预训练方法，该方法通过流式文档并利用强化学习（RL）改进每一步生成的K个令牌。一个强有力、经过后期训练的模型会评判候选世代——包括模型推出、原始后缀和重写后缀——的质量、安全性和事实性。在培训初期，这个过程依赖于原始和重写的后缀;随着模型的改进，强化学习会奖励高质量的推广。这种方法从零开始构建更高质量、更安全、更真实的模型。在实验中，我们的方法在事实性和安全性方面相较标准预训练提升了36.2%和18.5%，整体生成质量提升了高达86.3%的胜率。

Factored Causal Representation Learning for Robust Reward Modeling in RLHF

RLHF中伦理奖励建模的因果表征学习

Authors: Yupei Yang, Lin Yang, Wanxi Deng, Lin Qu, Fan Feng, Biwei Huang, Shikui Tu, Lei Xu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.21350
Pdf link: https://arxiv.org/pdf/2601.21350
Abstract A reliable reward model is essential for aligning large language models with human preferences through reinforcement learning from human feedback. However, standard reward models are susceptible to spurious features that are not causally related to human labels. This can lead to reward hacking, where high predicted reward does not translate into better behavior. In this work, we address this problem from a causal perspective by proposing a factored representation learning framework that decomposes the model's contextual embedding into (1) causal factors that are sufficient for reward prediction and (2) non-causal factors that capture reward-irrelevant attributes such as length or sycophantic bias. The reward head is then constrained to depend only on the causal component. In addition, we introduce an adversarial head trained to predict reward from the non-causal factors, while applying gradient reversal to discourage them from encoding reward-relevant information. Experiments on both mathematical and dialogue tasks demonstrate that our method learns more robust reward models and consistently improves downstream RLHF performance over state-of-the-art baselines. Analyses on length and sycophantic bias further validate the effectiveness of our method in mitigating reward hacking behaviors.
中文摘要 通过强化学习，可靠的奖励模型对于使大型语言模型与人类偏好对齐至关重要。然而，标准奖励模型容易受到与人类标签无因果关系的虚假特征的影响。这可能导致奖励黑客行为，即高预测奖励并未转化为更好的行为。本研究从因果视角提出一个因式分解表征学习框架，将模型的上下文嵌入分解为（1）足以预测奖励的因果因素和（2）捕捉与奖励无关属性（如长度或谄媚偏见）的非因果因素。奖励头随后被限制为仅依赖因果成分。此外，我们还引入了一个对抗性头，训练用来预测非因果因素的奖励，同时应用梯度反转以阻止它们编码与奖励相关的信息。数学和对话任务的实验表明，我们的方法学习更稳健的奖励模型，并且在下游RLHF表现上持续提升，优于最先进的基线。对长度和谄媚偏见的分析进一步验证了我们方法在减轻奖励黑客行为方面的有效性。

Towards Bridging the Gap between Large-Scale Pretraining and Efficient Finetuning for Humanoid Control

迈向弥合大规模预训练与高效精调化之间的差距，用于类人生物控制

Authors: Weidong Huang, Zhehan Li, Hangxin Liu, Biao Hou, Yao Su, Jingwen Zhang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.21363
Pdf link: https://arxiv.org/pdf/2601.21363
Abstract Reinforcement learning (RL) is widely used for humanoid control, with on-policy methods such as Proximal Policy Optimization (PPO) enabling robust training via large-scale parallel simulation and, in some cases, zero-shot deployment to real robots. However, the low sample efficiency of on-policy algorithms limits safe adaptation to new environments. Although off-policy RL and model-based RL have shown improved sample efficiency, the gap between large-scale pretraining and efficient finetuning on humanoids still exists. In this paper, we find that off-policy Soft Actor-Critic (SAC), with large-batch update and a high Update-To-Data (UTD) ratio, reliably supports large-scale pretraining of humanoid locomotion policies, achieving zero-shot deployment on real robots. For adaptation, we demonstrate that these SAC-pretrained policies can be finetuned in new environments and out-of-distribution tasks using model-based methods. Data collection in the new environment executes a deterministic policy while stochastic exploration is instead confined to a physics-informed world model. This separation mitigates the risks of random exploration during adaptation while preserving exploratory coverage for improvement. Overall, the approach couples the wall-clock efficiency of large-scale simulation during pretraining with the sample efficiency of model-based learning during fine-tuning.
中文摘要 强化学习（RL）广泛应用于人形控制，策略内方法如近端策略优化（PPO）使得通过大规模并行模拟实现稳健训练，在某些情况下还能零次部署到真实机器人。然而，策略上算法的低采样效率限制了对新环境的安全适应。尽管非策略强化学习和基于模型的强化学习显示出更好的样本效率，但大规模预训练与高效人形微调之间的差距依然存在。本文发现，非策略的软演员-批评者（SAC）通过大批量更新和高数据更新率（UTD）可靠地支持大规模人形运动策略的预训练，实现在真实机器人上的零发射部署。在适应方面，我们展示了这些SAC预训练策略可以通过基于模型的方法在新环境和非分布任务中进行微调。新环境中的数据收集执行确定性策略，而随机探索则局限于基于物理的世界模型。这种分离降低了适应过程中随机探索的风险，同时保留了探索覆盖以供改进。总体而言，该方法将预训练期间大规模仿真的壁钟效率与基于模型的微调学习样本效率相结合。

Intrinsic Reward Policy Optimization for Sparse-Reward Environments

稀疏奖励环境下的内在奖励策略优化

Authors: Minjae Cho, Huy Trong Tran
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.21391
Pdf link: https://arxiv.org/pdf/2601.21391
Abstract Exploration is essential in reinforcement learning as an agent relies on trial and error to learn an optimal policy. However, when rewards are sparse, naive exploration strategies, like noise injection, are often insufficient. Intrinsic rewards can also provide principled guidance for exploration by, for example, combining them with extrinsic rewards to optimize a policy or using them to train subpolicies for hierarchical learning. However, the former approach suffers from unstable credit assignment, while the latter exhibits sample inefficiency and sub-optimality. We propose a policy optimization framework that leverages multiple intrinsic rewards to directly optimize a policy for an extrinsic reward without pretraining subpolicies. Our algorithm -- intrinsic reward policy optimization (IRPO) -- achieves this by using a surrogate policy gradient that provides a more informative learning signal than the true gradient in sparse-reward environments. We demonstrate that IRPO improves performance and sample efficiency relative to baselines in discrete and continuous environments, and formally analyze the optimization problem solved by IRPO. Our code is available at this https URL.
中文摘要 探索在强化学习中至关重要，因为智能体依赖试错来学习最优策略。然而，当奖励稀少时，像噪声注入这样的天真探索策略往往不够。内在奖励还可以为探索提供原则性指导，例如将其与外在奖励结合以优化策略，或用于训练层级学习的子政策。然而，前者存在不稳定的信用分配问题，而后者则表现出样本效率低和次优性。我们提出了一个策略优化框架，利用多重内在奖励直接优化策略，无需预训练子策略。我们的算法——内在奖励政策优化（IRPO）——通过使用替代策略梯度实现这一点，在稀疏奖励环境中提供比真实梯度更具信息量的学习信号。我们证明了IRPO相对于离散和连续环境中相较基线的性能和样本效率，并正式分析了IRPO解决的优化问题。我们的代码可在此 https URL 访问。

Towards Space-Based Environmentally-Adaptive Grasping

迈向基于空间的环境适应把握

Authors: Leonidas Askianakis, Aleksandr Artemov
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.21394
Pdf link: https://arxiv.org/pdf/2601.21394
Abstract Robotic manipulation in unstructured environments requires reliable execution under diverse conditions, yet many state-of-the-art systems still struggle with high-dimensional action spaces, sparse rewards, and slow generalization beyond carefully curated training scenarios. We study these limitations through the example of grasping in space environments. We learn control policies directly in a learned latent manifold that fuses (grammarizes) multiple modalities into a structured representation for policy decision-making. Building on GPU-accelerated physics simulation, we instantiate a set of single-shot manipulation tasks and achieve over 95% task success with Soft Actor-Critic (SAC)-based reinforcement learning in less than 1M environment steps, under continuously varying grasping conditions from step 1. This empirically shows faster convergence than representative state-of-the-art visual baselines under the same open-loop single-shot conditions. Our analysis indicates that explicitly reasoning in latent space yields more sample-efficient learning and improved robustness to novel object and gripper geometries, environmental clutter, and sensor configurations compared to standard baselines. We identify remaining limitations and outline directions toward fully adaptive and generalizable grasping in the extreme conditions of space.
中文摘要 在非结构化环境中的机器人作需要在不同条件下的可靠执行，然而许多最先进的系统仍然难以应对高维动作空间、稀疏的奖励以及超出精心设计的训练场景之外的缓慢泛化。我们通过空间环境中的抓取来研究这些限制。我们直接在一个学习的潜在流形中学习控制策略，该流形融合（语法化）多种模态，形成结构化的策略决策表征。基于GPU加速物理仿真，我们实现了一组单次作任务，基于软演员批评（SAC）的强化学习在不到100万环境步长内实现了超过95%的任务成功率，且从第一步起在持续变化的抓取条件下。这在实证上显示，在相同的开环单次拍摄条件下，收敛速度比代表性的最先进视觉基线更快。我们的分析表明，与标准基线相比，在潜空间中显式推理能带来更高效的样本学习，并提高对新颖物体和夹持器几何形状、环境杂波及传感器配置的鲁棒性。我们识别剩余的局限，并规划了在极端空间条件下实现完全适应性和可推广抓取的方向。

Mitigating Overthinking in Large Reasoning Models via Difficulty-aware Reinforcement Learning

通过困难感知强化学习缓解大型推理模型中的过度思考

Authors: Qian Wan, Ziao Xu, Luona Wei, Xiaoxuan Shen, Jianwen Sun
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.21418
Pdf link: https://arxiv.org/pdf/2601.21418
Abstract Large Reasoning Models (LRMs) achieve explicit chain-of-thought expansion by imitating deep thinking behaviors of humans, demonstrating excellent performance in complex task scenarios. However, the deep-thinking mode often leads to unnecessarily lengthy reasoning and resource inefficiency when handling simple tasks. This overthinking phenomenon may arise from the generation preference triggered by the reward function during post-training. Existing research attempts to mitigate overthinking from the perspective of prompt design or model training, but generally underestimates the importance of task difficulty awareness, which makes it difficult for LRMs to effectively allocate reasoning resources. In this paper, we propose Difficulty-aware Policy Optimization (DiPO), a reinforcement learning-based LRM training framework. DiPO encourages LRM to spontaneously model task complexity, and integrates them into reinforcement learning framework to adjust the generation preferences introduced by post-training. A difficulty modeling method based on model self-reasoning is proposed, which significantly reduces the dependence on manual annotation and formalize task complexity. We further develop a difficulty-signal-enhanced reward function that incorporates a penalty for lengthy reasoning while considering reasoning performance and output format. Experimental results indicate that DiPO enables the model to spontaneously adjust inference overhead, significantly reducing redundant tokens without losing performance due to thought compression.
中文摘要 大型推理模型（LRM）通过模仿人类的深度思考行为实现显式思维链扩展，在复杂任务场景中表现出优异表现。然而，深度思考模式常常导致在处理简单任务时不必要的冗长推理和资源效率低下。这种过度思考现象可能源于训练后奖励函数触发的生成偏好。现有研究试图从提示设计或模型训练的角度减少过度思考，但普遍低估了任务难度意识的重要性，这使得LRM难以有效分配推理资源。本文提出难度感知策略优化（DiPO），这是一种基于强化学习的LRM训练框架。DiPO鼓励LRM自发建模任务复杂度，并将其集成到强化学习框架中，以调整后训练引入的生成偏好。提出了一种基于模型自推理的难度建模方法，显著减少了对手工注释的依赖，并形式化了任务复杂度。我们还进一步开发了一种难度信号增强的奖励函数，该函数在考虑推理表现和输出格式时，对冗长推理施加惩罚。实验结果表明，DiPO使模型能够自发调整推理开销，显著减少冗余令牌，同时不会因思维压缩而损失性能。

HER: Human-like Reasoning and Reinforcement Learning for LLM Role-playing

她：类人推理与强化学习用于大型语言模型角色扮演

Authors: Chengyu Du, Xintao Wang, Aili Chen, Weiyuan Li, Rui Xu, Junteng Liu, Zishan Huang, Rong Tian, Zijun Sun, Yuhao Li, Liheng Feng, Deming Ding, Pengyu Zhao, Yanghua Xiao
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.21459
Pdf link: https://arxiv.org/pdf/2601.21459
Abstract LLM role-playing, i.e., using LLMs to simulate specific personas, has emerged as a key capability in various applications, such as companionship, content creation, and digital games. While current models effectively capture character tones and knowledge, simulating the inner thoughts behind their behaviors remains a challenge. Towards cognitive simulation in LLM role-play, previous efforts mainly suffer from two deficiencies: data with high-quality reasoning traces, and reliable reward signals aligned with human preferences. In this paper, we propose HER, a unified framework for cognitive-level persona simulation. HER introduces dual-layer thinking, which distinguishes characters' first-person thinking from LLMs' third-person thinking. To bridge these gaps, we curate reasoning-augmented role-playing data via reverse engineering and construct human-aligned principles and reward models. Leveraging these resources, we train \method models based on Qwen3-32B via supervised and reinforcement learning. Extensive experiments validate the effectiveness of our approach. Notably, our models significantly outperform the Qwen3-32B baseline, achieving a 30.26 improvement on the CoSER benchmark and a 14.97 gain on the Minimax Role-Play Bench. Our datasets, principles, and models will be released to facilitate future research.
中文摘要 LLM角色扮演，即利用LLM模拟特定角色，已成为陪伴、内容创作和数字游戏等多种应用中的关键能力。虽然现有模型能有效捕捉角色语气和知识，但模拟其行为背后的内心想法仍是一大挑战。在大型语言模型角色扮演中的认知模拟方面，以往的努力主要存在两个不足：具有高质量推理痕迹的数据，以及符合人类偏好的可靠奖励信号。本文提出HER，一种用于认知层面人格模拟的统一框架。HER引入了双层思维，将角色的第一人称思维与大型语言模型的第三人称思维区分开来。为弥合这些差距，我们通过逆向工程策划推理增强的角色扮演数据，构建与人为一致的原则和奖励模型。利用这些资源，我们通过监督和强化学习，基于Qwen3-32B训练了方法模型。大量实验验证了我们方法的有效性。值得注意的是，我们的模型显著优于Qwen3-32B基线，在CoSER基准上提升了30.26，在Minimax角色扮演工作台上提升了14.97。我们的数据集、原理和模型将被发布，以促进未来的研究。

MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning

MemOCR：布局感知视觉记忆，用于高效的长视野推理

Authors: Yaorui Shi, Shugui Liu, Yu Yang, Wenyu Mao, Yuxin Chen, Qi GU, Hui Su, Xunliang Cai, Xiang Wang, An Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.21468
Pdf link: https://arxiv.org/pdf/2601.21468
Abstract Long-horizon agentic reasoning necessitates effectively compressing growing interaction histories into a limited context window. Most existing memory systems serialize history as text, where token-level cost is uniform and scales linearly with length, often spending scarce budget on low-value details. To this end, we introduce MemOCR, a multimodal memory agent that improves long-horizon reasoning under tight context budgets by allocating memory space with adaptive information density through visual layout. Concretely, MemOCR maintains a structured rich-text memory (e.g., headings, highlights) and renders it into an image that the agent consults for memory access, visually prioritizing crucial evidence while aggressively compressing auxiliary details. To ensure robustness across varying memory budgets, we train MemOCR with reinforcement learning under budget-aware objectives that expose the agent to diverse compression levels. Across long-context multi-hop and single-hop question-answering benchmarks, MemOCR outperforms strong text-based baselines and achieves more effective context utilization under extreme budgets.
中文摘要 长视野代理推理需要有效地将不断增长的交互历史压缩到有限的上下文窗口中。大多数现有内存系统将历史序列化为文本，令牌级成本均匀且随长度线性增长，通常将有限预算投入低价值细节。为此，我们介绍了MemOCR，一种多模态记忆代理，通过视觉布局为内存空间分配自适应信息密度，提升在紧张上下文预算下的长视野推理能力。具体来说，MemOCR维护结构化的富文本存储器（如标题、高亮），并将其渲染成图像，供智能体参考以获取内存，视觉上优先处理关键证据，同时积极压缩辅助细节。为了确保在不同内存预算下的鲁棒性，我们在预算感知目标下训练MemOCR，使智能体暴露于不同的压缩水平。在长上下文多跳和单跳问答基准测试中，MemOCR优于强的基于文本的基线，并在极端预算下实现更高效的上下文利用。

SOUP: Token-level Single-sample Mix-policy Reinforcement Learning for Large Language Models

SOUP：大型语言模型的令牌级单样本混合策略强化学习

Authors: Lei Yang, Wei Bi, Chenxi Sun, Renren Jin, Deyi Xiong
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.21476
Pdf link: https://arxiv.org/pdf/2601.21476
Abstract On-policy reinforcement learning (RL) methods widely used for language model post-training, like Group Relative Policy Optimization (GRPO), often suffer from limited exploration and early saturation due to low sampling diversity. While off-policy data can help, current approaches that mix entire trajectories cause significant policy mismatch and instability. In this work, we propose the $\textbf{S}$ingle-sample Mix-p$\textbf{O}$licy $\textbf{U}$nified $\textbf{P}$aradigm (SOUP), a framework that unifies off- and on-policy learning within individual samples at the token level. It confines off-policy influence to the prefix of a generated sequence sampled from historical policies, while the continuation is generated on-policy. Through token-level importance ratios, SOUP effectively leverages off-policy information while preserving training stability. Extensive experiments demonstrate that SOUP consistently outperforms standard on-policy training and existing off-policy extensions. Our further analysis clarifies how our fine-grained, single-sample mix-policy training can improve both exploration and final performance in LLM RL.
中文摘要 广泛用于语言模型后训练的策略强化学习（RL）方法，如群相对策略优化（Group Relative Policy Optimization，GRPO），由于采样多样性低，常常面临探索有限和早期饱和的问题。虽然非政策数据有所帮助，但当前混合整个轨迹的方法会导致显著的政策不匹配和不稳定。在本研究中，我们提出了$\textbf{S}$ingle-sample Mix-p$\textbf{O}$licy $\textbf{U}$nified $\textbf{P}$aradigm（SOUP），这是一个在代币级统一单样本中非策略和非策略学习的框架。它将非策略影响限制在从历史策略中采样的生成序列前缀，而延续则是从策略中生成的。通过代币级重要性比率，SOUP 有效利用策略外信息，同时保持训练稳定性。大量实验表明，SOUP始终优于标准的政策内培训和现有的非政策扩展。我们的进一步分析阐明了我们细粒度的单样本混合策略训练如何提升LLM强化学习的探索和最终性能。

Mean-Field Control on Sparse Graphs: From Local Limits to GNNs via Neighborhood Distributions

稀疏图的均值场控制：从局部极限到GNNs，通过邻域分布

Authors: Tobias Schmidt, Kai Cui
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2601.21477
Pdf link: https://arxiv.org/pdf/2601.21477
Abstract Mean-field control (MFC) offers a scalable solution to the curse of dimensionality in multi-agent systems but traditionally hinges on the restrictive assumption of exchangeability via dense, all-to-all interactions. In this work, we bridge the gap to real-world network structures by proposing a rigorous framework for MFC on large sparse graphs. We redefine the system state as a probability measure over decorated rooted neighborhoods, effectively capturing local heterogeneity. Our central contribution is a theoretical foundation for scalable reinforcement learning in this setting. We prove horizon-dependent locality: for finite-horizon problems, an agent's optimal policy at time t depends strictly on its (T-t)-hop neighborhood. This result renders the infinite-dimensional control problem tractable and underpins a novel Dynamic Programming Principle (DPP) on the lifted space of neighborhood distributions. Furthermore, we formally and experimentally justify the use of Graph Neural Networks (GNNs) for actor-critic algorithms in this context. Our framework naturally recovers classical MFC as a degenerate case while enabling efficient, theoretically grounded control on complex sparse topologies.
中文摘要 平均场控制（MFC）为多智能体系统中维度诅咒提供了可扩展的解决方案，但传统上依赖于通过密集的全对全交互来限制交换的假设。在本研究中，我们通过提出一个严谨的MFC框架，弥合了与现实世界网络结构的差距，适用于大型稀疏图。我们将系统状态重新定义为装饰有根邻域的概率度量，有效捕捉局部异质性。我们的核心贡献是为该环境中可扩展强化学习奠定理论基础。我们证明视界相关局部性：对于有限视界问题，智能体在时间t的最优策略严格依赖于其（T-t）跳邻域。这一结果使无限维控制问题变得可解，并支持了一种新颖的动态规划原理（DPP）在邻域分布的提升空间上。此外，我们正式且实验地证明了在该背景下用于actor-critic算法使用图神经网络（GNN）的合理性。我们的框架自然地恢复了经典MFC作为退化情况，同时实现了在复稀疏拓扑上高效且理论基础的控制。

ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment

ETS：能量引导测试时间尺度，用于无训练强化学习对齐

Authors: Xiuyu Li, Jinkai Zhang, Mingyang Yi, Yu Li, Longqiang Wang, Yue Wang, Ju Fan
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.21484
Pdf link: https://arxiv.org/pdf/2601.21484
Abstract Reinforcement Learning (RL) post-training alignment for language models is effective, but also costly and unstable in practice, owing to its complicated training process. To address this, we propose a training-free inference method to sample directly from the optimal RL policy. The transition probability applied to Masked Language Modeling (MLM) consists of a reference policy model and an energy term. Based on this, our algorithm, Energy-Guided Test-Time Scaling (ETS), estimates the key energy term via online Monte Carlo, with a provable convergence rate. Moreover, to ensure practical efficiency, ETS leverages modern acceleration frameworks alongside tailored importance sampling estimators, substantially reducing inference latency while provably preserving sampling quality. Experiments on MLM (including autoregressive models and diffusion language models) across reasoning, coding, and science benchmarks show that our ETS consistently improves generation quality, validating its effectiveness and design.
中文摘要 强化学习（RL）对语言模型的训练后对齐有效，但由于其复杂的训练过程，在实践中成本高昂且不稳定。为此，我们提出了一种无训练的推理方法，直接从最优强化学习策略中采样。应用于掩体语言建模（MLM）的过渡概率由参考政策模型和一个能量项组成。基于此，我们的算法“能量引导测试时间尺度”（ETS）通过在线蒙特卡洛估计关键能量项，并具有可证明的收敛率。此外，为了确保实用效率，ETS结合现代加速框架和定制的重要性采样估计器，大幅降低推断延迟，同时可证明保持采样质量。在推理、编码和科学基准测试中，MLM（包括自回归模型和扩散语言模型）的实验显示，我们的ETS持续提升生成质量，验证了其有效性和设计。

Explicit Credit Assignment through Local Rewards and Dependence Graphs in Multi-Agent Reinforcement Learning

通过多智能体强化学习中的局部奖励和依赖图实现显式的学分分配

Authors: Bang Giang Le, Viet Cuong Ta
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.21523
Pdf link: https://arxiv.org/pdf/2601.21523
Abstract To promote cooperation in Multi-Agent Reinforcement Learning, the reward signals of all agents can be aggregated together, forming global rewards that are commonly known as the fully cooperative setting. However, global rewards are usually noisy because they contain the contributions of all agents, which have to be resolved in the credit assignment process. On the other hand, using local reward benefits from faster learning due to the separation of agents' contributions, but can be suboptimal as agents myopically optimize their own reward while disregarding the global optimality. In this work, we propose a method that combines the merits of both approaches. By using a graph of interaction between agents, our method discerns the individual agent contribution in a more fine-grained manner than a global reward, while alleviating the cooperation problem with agents' local reward. We also introduce a practical approach for approximating such a graph. Our experiments demonstrate the flexibility of the approach, enabling improvements over the traditional local and global reward settings.
中文摘要 为了促进多智能体强化学习中的合作，所有智能体的奖励信号可以聚合，形成通常称为完全合作环境的全局奖励。然而，全局奖励通常较为嘈杂，因为它们包含所有代理人的贡献，这些贡献必须在信用分配过程中解决。另一方面，使用局部奖励因智能体贡献分离而学习更快，但由于智能体短视地优化自身奖励，忽略了全局最优性，因此可能不那么优。在本研究中，我们提出了一种结合两种方法优点的方法。通过使用代理间交互图，我们的方法比全局奖励更细致地识别个体代理的贡献，同时缓解了代理局部奖励的合作问题。我们还引入了一种实用的方法来近似此类图。我们的实验展示了该方法的灵活性，使得改进，优于传统的局部和全局奖励设置。

Training slow silicon neurons to control extremely fast robots with spiking reinforcement learning

训练慢速硅神经元以控制极快机器人，采用尖峰强化学习

Authors: Irene Ambrosini, Ingo Blakowski, Dmitrii Zendrikov, Cristiano Capone, Luna Gava, Giacomo Indiveri, Chiara De Luca, Chiara Bartolozzi
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
Arxiv link: https://arxiv.org/abs/2601.21548
Pdf link: https://arxiv.org/pdf/2601.21548
Abstract Air hockey demands split-second decisions at high puck velocities, a challenge we address with a compact network of spiking neurons running on a mixed-signal analog/digital neuromorphic processor. By co-designing hardware and learning algorithms, we train the system to achieve successful puck interactions through reinforcement learning in a remarkably small number of trials. The network leverages fixed random connectivity to capture the task's temporal structure and adopts a local e-prop learning rule in the readout layer to exploit event-driven activity for fast and efficient learning. The result is real-time learning with a setup comprising a computer and the neuromorphic chip in-the-loop, enabling practical training of spiking neural networks for robotic autonomous systems. This work bridges neuroscience-inspired hardware with real-world robotic control, showing that brain-inspired approaches can tackle fast-paced interaction tasks while supporting always-on learning in intelligent machines.
中文摘要 空气曲棍球需要在高速冰球速度下做出瞬间决策，我们通过运行在混合信号模拟/数字神经形态处理器上的紧凑神经元网络来应对这一挑战。通过共同设计硬件和学习算法，我们训练系统通过极少的试验次数实现强化学习，实现成功的冰球交互。该网络利用固定随机连接性捕捉任务的时间结构，并在读出层采用本地电子prop学习规则，利用事件驱动活动实现快速高效的学习。结果是实时学习，采用计算机和神经形态芯片的环路，实现对机器人自主系统尖峰神经网络的实际训练。这项工作将神经科学启发的硬件与现实世界的机器人控制相结合，展示了受大脑启发的方法能够应对快节奏的交互任务，同时支持智能机器的始终在线学习。

ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas

ASTRA：自动合成能动轨迹与强化场域

Authors: Xiaoyu Tian, Haotian Wang, Shuaiting Chen, Hao Zhou, Kaichi Yu, Yudian Zhang, Jade Ouyang, Junxi Yin, Jiong Chen, Baoyan Guo, Lei Zhang, Junjie Tao, Yuansheng Song, Ming Cui, Chengwei Liu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.21558
Pdf link: https://arxiv.org/pdf/2601.21558
Abstract Large language models (LLMs) are increasingly used as tool-augmented agents for multi-step decision making, yet training robust tool-using agents remains challenging. Existing methods still require manual intervention, depend on non-verifiable simulated environments, rely exclusively on either supervised fine-tuning (SFT) or reinforcement learning (RL), and struggle with stable long-horizon, multi-turn learning. To address these challenges, we introduce ASTRA, a fully automated end-to-end framework for training tool-augmented language model agents via scalable data synthesis and verifiable reinforcement learning. ASTRA integrates two complementary components. First, a pipeline that leverages the static topology of tool-call graphs synthesizes diverse, structurally grounded trajectories, instilling broad and transferable tool-use competence. Second, an environment synthesis framework that captures the rich, compositional topology of human semantic reasoning converts decomposed question-answer traces into independent, code-executable, and rule-verifiable environments, enabling deterministic multi-turn RL. Based on this method, we develop a unified training methodology that integrates SFT with online RL using trajectory-level rewards to balance task completion and interaction efficiency. Experiments on multiple agentic tool-use benchmarks demonstrate that ASTRA-trained models achieve state-of-the-art performance at comparable scales, approaching closed-source systems while preserving core reasoning ability. We release the full pipelines, environments, and trained models at this https URL.
中文摘要 大型语言模型（LLMs）越来越多地被用作工具增强的多步骤决策代理，但训练稳健的工具使用代理仍然具有挑战性。现有方法仍需人工干预，依赖不可验证的模拟环境，完全依赖监督微调（SFT）或强化学习（RL），且难以实现稳定的长视野、多回合学习。为应对这些挑战，我们引入了ASTRA，这是一个全自动化的端到端框架，通过可扩展的数据综合和可验证的强化学习来训练工具增强的语言模型代理。ASTRA整合了两个互补的组成部分。首先，利用工具调用图静态拓扑的管道综合多样且结构化的轨迹，培养广泛且可迁移的工具使用能力。其次，一个能够捕捉人类语义推理丰富且组合性拓扑的环境综合框架，将分解后的问答痕迹转换为独立、可代码执行且可规则验证的环境，从而实现确定性多回合强化学习。基于该方法，我们开发了一套统一的训练方法，结合SFT与在线强化学习，利用轨迹级奖励平衡任务完成与交互效率。多项代理工具使用基准测试的实验表明，ASTRA训练的模型在相当规模下实现了最先进的性能，接近闭源系统，同时保持核心推理能力。我们将完整的流水线、环境和训练模型发布在这个 https URL 上。

Signal-Adaptive Trust Regions for Gradient-Free Optimization of Recurrent Spiking Neural Networks

信号自适应信任区域用于反复尖峰神经网络的无梯度优化

Authors: Jinhao Li, Yuhao Sun, Zhiyuan Ma, Hao He, Xinche Zhang, Xing Chen, Jin Li, Sen Song
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.21572
Pdf link: https://arxiv.org/pdf/2601.21572
Abstract Recurrent spiking neural networks (RSNNs) are a promising substrate for energy-efficient control policies, but training them for high-dimensional, long-horizon reinforcement learning remains challenging. Population-based, gradient-free optimization circumvents backpropagation through non-differentiable spike dynamics by estimating gradients. However, with finite populations, high variance of these estimates can induce harmful and overly aggressive update steps. Inspired by trust-region methods in reinforcement learning that constrain policy updates in distribution space, we propose \textbf{Signal-Adaptive Trust Regions (SATR)}, a distributional update rule that constrains relative change by bounding KL divergence normalized by an estimated signal energy. SATR automatically expands the trust region under strong signals and contracts it when updates are noise-dominated. We instantiate SATR for Bernoulli connectivity distributions, which have shown strong empirical performance for RSNN optimization. Across a suite of high-dimensional continuous-control benchmarks, SATR improves stability under limited populations and reaches competitive returns against strong baselines including PPO-LSTM. In addition, to make SATR practical at scale, we introduce a bitset implementation for binary spiking and binary weights, substantially reducing wall-clock training time and enabling fast RSNN policy search.
中文摘要 循环尖峰神经网络（RSNN）是节能控制政策的有前景基础，但训练它们以实现高维、长视野强化学习仍具挑战。基于种群的无梯度优化通过估计梯度，通过不可微的尖峰动力学规避反向传播。然而，在有限的人口中，这些估计值的高度方差可能导致有害且过于激进的更新步骤。受强化学习中信任区域方法的启发，这些方法限制了分布空间中的策略更新，我们提出了 \textbf{信号自适应信任区域（SATR）}，这是一条分布更新规则，通过对估计信号能量归一化的 KL 散度进行界限来约束相对变化。在强信号下，SATR会自动扩展信任区域，当更新被噪声主导时则收缩。我们为伯努利连通分布实例化SATR，该分布在RSNN优化中表现出强劲的实证表现。通过一系列高维连续控制基准测试，SATR在有限种群下提升稳定性，并在包括PPO-LSTM在内的强基线下实现具有竞争力的回报。此外，为了使 SATR 在大规模下实用，我们引入了用于二进制尖峰和二进制权重的位集实现，大幅缩短了墙钟式训练时间，并实现了快速的 RSNN 策略搜索。

Scalable Power Sampling: Unlocking Efficient, Training-Free Reasoning for LLMs via Distribution Sharpening

可扩展功率采样：通过分布锐化解锁高效、无训练的大型语言模型推理

Authors: Xiaotong Ji, Rasul Tutunov, Matthieu Zimmer, Haitham Bou Ammar
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.21590
Pdf link: https://arxiv.org/pdf/2601.21590
Abstract Reinforcement learning (RL) post-training is a dominant approach for improving the reasoning performance of large language models (LLMs), yet growing evidence suggests that its gains arise primarily from distribution sharpening rather than the acquisition of new capabilities. Recent work has shown that sampling from the power distribution of LLMs using Markov chain Monte Carlo (MCMC) can recover performance comparable to RL post-training without relying on external rewards; however, the high computational cost of MCMC makes such approaches impractical for widespread adoption. In this work, we propose a theoretically grounded alternative that eliminates the need for iterative MCMC. We derive a novel formulation showing that the global power distribution can be approximated by a token-level scaled low-temperature one, where the scaling factor captures future trajectory quality. Leveraging this insight, we introduce a training-free and verifier-free algorithm that sharpens the base model's generative distribution autoregressively. Empirically, we evaluate our method on math, QA, and code tasks across four LLMs, and show that our method matches or surpasses one-shot GRPO without relying on any external rewards, while reducing inference latency by over 10x compared to MCMC-based sampling.
中文摘要 强化学习（RL）在训练后是提升大型语言模型（LLMs）推理性能的主流方法，但越来越多的证据表明，其收益主要来自分布的提升，而非新能力的获得。最新研究表明，利用马尔可夫链蒙特卡洛（MCMC）从LLM的功率分布中采样，可以在不依赖外部奖励的情况下恢复与强化学习（RL）后训练相当的性能;然而，MCMC的高计算成本使此类方法难以广泛采用。在本研究中，我们提出了一种理论基础的替代方案，消除了迭代MCMC的需求。我们推导出一个新颖的表述，表明全球功率分布可以用标记级的尺度低温分布来近似，其中缩放因子捕捉未来的轨迹质量。基于这一见解，我们引入了一种无训练、无验证器的算法，能够自回归地提升基础模型的生成分布。通过实证，我们评估了跨四种大型语言模型的数学、质量保证和代码任务，表明我们的方法在不依赖任何外部奖励的情况下，能够匹配甚至超过一次性GRPO，同时将推理延迟比基于MCMC的抽样降低了10倍以上。

Beyond Imitation: Reinforcement Learning for Active Latent Planning

超越模仿：主动潜在规划的强化学习

Authors: Zhi Zheng, Wee Sun Lee
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.21598
Pdf link: https://arxiv.org/pdf/2601.21598
Abstract Aiming at efficient and dense chain-of-thought (CoT) reasoning, latent reasoning methods fine-tune Large Language Models (LLMs) to substitute discrete language tokens with continuous latent tokens. These methods consume fewer tokens compared to the conventional language CoT reasoning and have the potential to plan in a dense latent space. However, current latent tokens are generally supervised based on imitating language labels. Considering that there can be multiple equivalent but diverse CoT labels for a question, passively imitating an arbitrary one may lead to inferior latent token representations and latent reasoning policies, undermining the potential planning ability and resulting in clear gaps between training and testing. In this work, we emphasize the importance of active planning over the representation space of latent tokens in achieving the optimal latent reasoning policy. So, we propose the \underline{A}c\underline{t}ive Latent \underline{P}lanning method (ATP-Latent), which models the supervision process of latent tokens as a conditional variational auto-encoder (VAE) to obtain a smoother latent space. Moreover, to facilitate the most reasonable latent reasoning policy, ATP-Latent conducts reinforcement learning (RL) with an auxiliary coherence reward, which is calculated based on the consistency between VAE-decoded contents of latent tokens, enabling a guided RL process. In experiments on LLaMA-1B, ATP-Latent demonstrates +4.1\% accuracy and -3.3\% tokens on four benchmarks compared to advanced baselines. Codes are available on this https URL.
中文摘要 为了实现高效且密集的思维链（CoT）推理，潜在推理方法对大型语言模型（LLMs）进行微调，用连续的潜在词替代离散语言符号。这些方法相比传统语言CoT推理消耗更少的符号，并且有潜力在密集的潜在空间中进行规划。然而，当前的潜在标记通常是基于模仿语言标签进行监督的。考虑到一个问题可能有多个等价但多样的CoT标签，被动模仿任意标签可能导致潜在标记表示和潜在推理策略的劣质，削弱潜在的规划能力，导致训练与测试之间出现明显差距。在本研究中，我们强调主动规划在实现最佳潜在推理策略中，潜在标记的表示空间对潜在标记的表征空间至关重要。因此，我们提出了 \underline{A}c\underline{t}ive 潜在 \ underline{P}lanning 方法（ATP-Latent），该方法将潜在标记的监督过程建模为条件变分自编码器（VAE），以获得更平滑的潜在空间。此外，为了实现最合理的潜在推理策略，ATP-Latent 通过辅助相干奖励进行强化学习（RL），该奖励基于 VAE 解码的潜在代币内容之间的一致性计算，从而实现引导式强化学习过程。在LLaMA-1B的实验中，ATP-Latent在四个基准测试中显示出+4.1%的准确率和-3.3%的token，相较于高级基线。该HTTPS网址上有代码。

RecNet: Self-Evolving Preference Propagation for Agentic Recommender Systems

RecNet：代理推荐系统中的自我演进偏好传播

Authors: Bingqian Li, Xiaolei Wang, Junyi Li, Weitao Li, Long Zhang, Sheng Chen, Wayne Xin Zhao, Ji-Rong Wen
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.21609
Pdf link: https://arxiv.org/pdf/2601.21609
Abstract Agentic recommender systems leverage Large Language Models (LLMs) to model complex user behaviors and support personalized decision-making. However, existing methods primarily model preference changes based on explicit user-item interactions, which are sparse, noisy, and unable to reflect the real-time, mutual influences among users and items. To address these limitations, we propose RecNet, a self-evolving preference propagation framework that proactively propagates real-time preference updates across related users and items. RecNet consists of two complementary phases. In the forward phase, the centralized preference routing mechanism leverages router agents to integrate preference updates and dynamically propagate them to the most relevant agents. To ensure accurate and personalized integration of propagated preferences, we further introduce a personalized preference reception mechanism, which combines a message buffer for temporary caching and an optimizable, rule-based filter memory to guide selective preference assimilation based on past experience and interests. In the backward phase, the feedback-driven propagation optimization mechanism simulates a multi-agent reinforcement learning framework, using LLMs for credit assignment, gradient analysis, and module-level optimization, enabling continuous self-evolution of propagation strategies. Extensive experiments on various scenarios demonstrate the effectiveness of RecNet in modeling preference propagation for recommender systems.
中文摘要 代理推荐系统利用大型语言模型（LLM）来模拟复杂的用户行为，支持个性化决策。然而，现有方法主要基于显式的用户-物品交互来建模偏好变化，这些交互稀疏、噪声大，无法反映用户与物品之间实时的相互影响。为解决这些局限性，我们提出了RecNet，一种自我演进的偏好传播框架，能够主动在相关用户和项目之间实时传播偏好更新。RecNet 由两个互补阶段组成。在前向阶段，集中优先级路由机制利用路由器代理整合偏好更新，并动态传播给最相关的代理。为确保传播偏好的准确和个性化整合，我们进一步引入了个性化偏好接收机制，结合了用于临时缓存的消息缓冲区和可优化的基于规则的过滤记忆，以引导基于过往经验和兴趣的选择性偏好同化。在后退阶段，反馈驱动传播优化机制模拟了多智能体强化学习框架，利用LLM进行学分分配、梯度分析和模块级优化，实现传播策略的持续自我演进。在各种场景下的大量实验证明了RecNet在推荐系统偏好传播建模上的有效性。

PathReasoner-R1: Instilling Structured Reasoning into Pathology Vision-Language Model via Knowledge-Guided Policy Optimization

PathReasoner-R1：通过知识引导策略优化，将结构化推理融入病理学视觉语言模型

Authors: Songhan Jiang, Fengchun Liu, Ziyue Wang, Linghan Cai, Yongbing Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.21617
Pdf link: https://arxiv.org/pdf/2601.21617
Abstract Vision-Language Models (VLMs) are advancing computational pathology with superior visual understanding capabilities. However, current systems often reduce diagnosis to directly output conclusions without verifiable evidence-linked reasoning, which severely limits clinical trust and hinders expert error rectification. To address these barriers, we construct PathReasoner, the first large-scale dataset of whole-slide image (WSI) reasoning. Unlike previous work reliant on unverified distillation, we develop a rigorous knowledge-guided generation pipeline. By leveraging medical knowledge graphs, we explicitly align structured pathological findings and clinical reasoning with diagnoses, generating over 20K high-quality instructional samples. Based on the database, we propose PathReasoner-R1, which synergizes trajectory-masked supervised fine-tuning with reasoning-oriented reinforcement learning to instill structured chain-of-thought capabilities. To ensure medical rigor, we engineer a knowledge-aware multi-granular reward function incorporating an Entity Reward mechanism strictly aligned with knowledge graphs. This effectively guides the model to optimize for logical consistency rather than mere outcome matching, thereby enhancing robustness. Extensive experiments demonstrate that PathReasoner-R1 achieves state-of-the-art performance on both PathReasoner and public benchmarks across various image scales, equipping pathology models with transparent, clinically grounded reasoning capabilities. Dataset and code are available at this https URL.
中文摘要 视觉语言模型（VLM）以更强的视觉理解能力推动计算病理学的发展。然而，现有系统常常将诊断简化为直接输出结论，缺乏可验证的证据关联推理，这严重限制了临床信任，阻碍专家纠错。为解决这些障碍，我们构建了PathReasoner，这是首个大规模的全幻灯片图像（WSI）推理数据集。与以往依赖未经验证的蒸馏不同，我们开发了一个严谨的知识引导生成流程。通过利用医学知识图谱，我们明确将结构化的病理发现和临床推理与诊断对齐，生成超过2万条高质量的教学样本。基于该数据库，我们提出了PathReasoner-R1，它将轨迹掩蔽监督微调与推理导向强化学习协同，以培养结构化的思维链能力。为确保医学严谨性，我们设计了一个知识感知的多粒度奖励函数，结合严格与知识图谱对齐的实体奖励机制。这有效引导模型优化逻辑一致性，而非单纯的结果匹配，从而增强了鲁棒性。大量实验表明，PathReasoner-R1在多种图像尺度下，在PathReasoner和公共基准测试中都达到了最先进的性能，赋予病理模型透明且临床基础的推理能力。数据集和代码可在该 https URL 访问。

Expected Return Causes Outcome-Level Mode Collapse in Reinforcement Learning and How to Fix It with Inverse Probability Scaling

期望回报导致强化学习中的结果层级模式崩溃以及如何通过逆概率尺度法解决

Authors: Abhijeet Sinha, Sundari Elango, Dianbo Liu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.21669
Pdf link: https://arxiv.org/pdf/2601.21669
Abstract Many reinforcement learning (RL) problems admit multiple terminal solutions of comparable quality, where the goal is not to identify a single optimum but to represent a diverse set of high-quality outcomes. Nevertheless, policies trained by standard expected return maximization routinely collapse onto a small subset of outcomes, a phenomenon commonly attributed to insufficient exploration or weak regularization. We show that this explanation is incomplete: outcome level mode collapse is a structural consequence of the expected-return objective itself. Under idealized learning dynamics, the log-probability ratio between any two outcomes evolves linearly in their reward difference, implying exponential ratio divergence and inevitable collapse independent of the exploration strategy, entropy regularization, or optimization algorithm. We identify the source of this pathology as the probability multiplier inside the expectation and propose a minimal correction: inverse probability scaling, which removes outcome-frequency amplification from the learning signal, fundamentally changes the learning dynamics, and provably yields reward-proportional terminal distributions, preventing collapse in multimodal settings. We instantiate this principle in Group Relative Policy Optimization (GRPO) as a drop-in modification, IPS-GRPO, requiring no auxiliary models or architectural changes. Across different reasoning and molecular generation tasks, IPS-GRPO consistently reduces outcome-level mode collapse while matching or exceeding baseline performance, suggesting that correcting the objective rather than adding exploration heuristics is key to reliable multimodal policy optimization.
中文摘要 许多强化学习（RL）问题允许多个质量相当的终端解，目标不是确定单一的最优解，而是代表一组多样化的高质量结果。然而，由标准期望回报最大化训练的策略通常会归结为一小部分结果，这一现象通常归因于探索不足或正则化薄弱。我们证明这一解释不完整：结果水平模式崩溃是期望回报目标本身的结构性结果。在理想化学习动力学下，任意两个结果之间的对数概率比在奖励差异上呈线性演变，这意味着指数比发散和不可避免的崩溃与探索策略、熵正则化或优化算法无关。我们将这种病理的根源确定为期望中的概率乘数，并提出了一个最小的修正：逆概率尺度法，该方法去除学习信号中的结果频率放大，根本改变学习动态，并可证明产生与奖励比例的终端分布，防止多模态环境中的崩溃。我们在Group Relative Policy Optimization（GRPO）中以可直接修改（IPS-GRPO）的形式实现这一原则，无需辅助模型或架构变更。在不同的推理和分子生成任务中，IPS-GRPO持续减少结果层级模式崩溃，同时性能匹配甚至超过基线，表明纠正目标而非增加探索启发式是可靠多模态策略优化的关键。

BAP-SRL: Bayesian Adaptive Priority Safe Reinforcement Learning for Vehicle Motion Planning at Mixed Traffic Intersections

BAP-SRL：混合交通路口车辆运动规划的贝叶斯自适应优先安全强化学习

Authors: Yuansheng Lian, Ke Zhang, Yaming Guo, Shen Li, Meng Li
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.21679
Pdf link: https://arxiv.org/pdf/2601.21679
Abstract Navigating urban intersections, especially when interacting with heterogeneous traffic participants, presents a formidable challenge for autonomous vehicles (AVs). In such environments, safety risks arise simultaneously from multiple sources, each carrying distinct priority levels and sensitivities that necessitate differential protection preferences. While safe reinforcement learning (RL) offers a robust paradigm for constrained decision-making, existing methods typically model safety as a single constraint or employ static, heuristic weighting schemes for multiple constraints. These approaches often fail to address the dynamic nature of multi-source risks, leading to gradient cancellation that hampers learning, and suboptimal trade-offs in critical dilemma zones. To address this, we propose a Bayesian adaptive priority safe reinforcement learning (BAP-SRL) based motion planning framework. Unlike heuristic weighting schemes, BAP formulates constraint prioritization as a probabilistic inference task. By modeling historical optimization difficulty as a Bayesian prior and instantaneous risk evidence as a likelihood, BAP dynamically gates gradient updates using a Bayesian inference mechanism on latent constraint criticality. Extensive experiments demonstrate that our approach outperforms state-of-the-art baselines in handling interactions with stochastic, heterogeneous agents, achieving lower collision rates and smoother conflict resolution.
中文摘要 在城市路口导航，尤其是在与异质交通参与者互动时，对自动驾驶车辆（AV）来说是一项艰巨的挑战。在这样的环境中，安全风险同时来自多个来源，每个来源都具有不同的优先级和敏感性，因此需要差异化的保护偏好。虽然安全强化学习（RL）为受限决策提供了稳健的范式，但现有方法通常将安全性建模为单一约束，或对多个约束采用静态启发式权重方案。这些方法常常未能解决多源风险的动态特性，导致梯度抵消阻碍学习，并在关键困境区出现不优权衡。为此，我们提出了基于贝叶斯自适应优先安全强化学习（BAP-SRL）的运动规划框架。与启发式加权方案不同，BAP将约束优先级化表述为一种概率推断任务。通过将历史优化难度建模为贝叶斯先验，将瞬时风险证据建模为似然，BAP利用贝叶斯推断机制对潜在约束临界性的动态门控更新。大量实验表明，我们的方法在处理随机异构智能体交互方面优于最先进的基线，实现更低的碰撞率和更顺畅的冲突解决。

Can David Beat Goliath? On Multi-Hop Reasoning with Resource-Constrained Agents

大卫能打败歌利亚吗？关于资源受限代理的多跳推理

Authors: Hojae Han, Heeyun Jung, Jongyoon Kim, Seung-won Hwang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.21699
Pdf link: https://arxiv.org/pdf/2601.21699
Abstract While reinforcement learning (RL) has empowered multi-turn reasoning agents with retrieval and tools, existing successes largely depend on extensive on-policy rollouts in high-cost, high-accuracy regimes. Under realistic resource constraints that cannot support large models or dense explorations, however, small language model agents fall into a low-cost, low-accuracy regime, where limited rollout budgets lead to sparse exploration, sparse credit assignment, and unstable training. In this work, we challenge this trade-off and show that small language models can achieve strong multi-hop reasoning under resource constraints. We introduce DAVID-GRPO, a budget-efficient RL framework that (i) stabilizes early learning with minimal supervision, (ii) assigns retrieval credit based on evidence recall, and (iii) improves exploration by resampling truncated near-miss trajectories. Evaluated on agents up to 1.5B parameters trained on only four RTX 3090 GPUs, DAVID-GRPO consistently outperforms prior RL methods designed for large-scale settings on six multi-hop QA benchmarks. These results show that with the right inductive biases, small agents can achieve low training cost with high accuracy.
中文摘要 虽然强化学习（RL）赋予了多回合推理代理检索和工具的能力，但现有的成功很大程度上依赖于在高成本、高准确度的体系中大规模的政策推广。然而，在无法支持大型模型或密集探索的现实资源限制下，小型语言模型代理陷入低成本、低准确率的状态，有限的推广预算导致探索稀疏、信用分配稀疏和训练不稳定。在本研究中，我们挑战了这种权衡，展示了小型语言模型在资源约束下能够实现强的多跳推理。我们引入了DAVID-GRPO，这是一种预算高效的强化学习框架，（i）在最小监督下稳定早期学习，（ii）基于证据回忆分配检索学分，（iii）通过重抽样截断的近距离偏离轨迹提升探索效果。DAVID-GRPO在仅4块RTX 3090 GPU上训练的1.5B参数智能体上评估，在六个多跳质量保证基准测试中，持续优于以往为大规模环境设计的强化学习方法。这些结果表明，只要采用合适的归纳偏置，小型智能体可以以高准确率实现低训练成本。

TACLer: Tailored Curriculum Reinforcement Learning for Efficient Reasoning

TACLer：量身定制的课程强化学习，助力高效推理

Authors: Huiyuan Lai, Malvina Nissim
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.21711
Pdf link: https://arxiv.org/pdf/2601.21711
Abstract Large Language Models (LLMs) have shown remarkable performance on complex reasoning tasks, especially when equipped with long chain-of-thought (CoT) reasoning. However, eliciting long CoT typically requires large-scale reinforcement learning (RL) training, while often leading to overthinking with redundant intermediate steps. To improve learning and reasoning efficiency, while preserving or even enhancing performance, we propose TACLer, a model-tailored curriculum reinforcement learning framework that gradually increases the complexity of the data based on the model's proficiency in multi-stage RL training. TACLer features two core components: (i) tailored curriculum learning that determines what knowledge the model lacks and needs to learn in progressive stages; (ii) a hybrid Thinking/NoThinking reasoning paradigm that balances accuracy and efficiency by enabling or disabling the Thinking mode. Our experiments show that TACLer yields a twofold advantage in learning and reasoning: (i) it reduces computational cost, cutting training compute by over 50% compared to long thinking models and reducing inference token usage by over 42% relative to the base model; and (ii) it improves accuracy by over 9% on the base model, consistently outperforming state-of-the-art Nothinking and Thinking baselines across four math datasets with complex problems.
中文摘要 大型语言模型（LLMs）在复杂推理任务中表现出显著表现，尤其是在配备长思考链（CoT）推理能力时。然而，诱导长CoT通常需要大规模强化学习（RL）训练，且常常导致过度思考，伴随着冗余的中间步骤。为了提高学习和推理效率，同时保持甚至提升表现，我们提出了TACLer，这是一种基于模型定制的课程强化学习框架，基于模型在多阶段强化学习中的熟练度逐步提升数据复杂度。TACLer 具有两个核心组成部分：（i）定制化的课程学习，确定模型缺乏哪些知识，需要在逐步阶段学习;（ii）一种混合思维/无思考推理范式，通过启用或禁用思考模式来平衡准确性和效率。我们的实验表明，TACLer在学习和推理方面具有双重优势：（i）它降低了计算成本，与长期思考模型相比训练计算量减少了50%以上，并且推理标记的使用量相较基础模型减少了42%以上;（ii）在基础模型上准确率提升超过9%，在四个复杂问题的数学数据集中持续优于最先进的Nothinking和Thinking基线。

Disentangling perception and reasoning for improving data efficiency in learning cloth manipulation without demonstrations

解开感知和推理，以提升学习布料作中的数据效率，无需演示

Authors: Donatien Delehelle, Fei Chen, Darwin Caldwell
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.21713
Pdf link: https://arxiv.org/pdf/2601.21713
Abstract Cloth manipulation is a ubiquitous task in everyday life, but it remains an open challenge for robotics. The difficulties in developing cloth manipulation policies are attributed to the high-dimensional state space, complex dynamics, and high propensity to self-occlusion exhibited by fabrics. As analytical methods have not been able to provide robust and general manipulation policies, reinforcement learning (RL) is considered a promising approach to these problems. However, to address the large state space and complex dynamics, data-based methods usually rely on large models and long training times. The resulting computational cost significantly hampers the development and adoption of these methods. Additionally, due to the challenge of robust state estimation, garment manipulation policies often adopt an end-to-end learning approach with workspace images as input. While this approach enables a conceptually straightforward sim-to-real transfer via real-world fine-tuning, it also incurs a significant computational cost by training agents on a highly lossy representation of the environment state. This paper questions this common design choice by exploring an efficient and modular approach to RL for cloth manipulation. We show that, through careful design choices, model size and training time can be significantly reduced when learning in simulation. Furthermore, we demonstrate how the resulting simulation-trained model can be transferred to the real world. We evaluate our approach on the SoftGym benchmark and achieve significant performance improvements over available baselines on our task, while using a substantially smaller model.
中文摘要 布料作在日常生活中无处不在，但对机器人来说仍是一个开放的挑战。布料作策略的困难归因于织物的高维状态空间、复杂的动态以及高度的自我遮挡倾向。由于分析方法无法提供稳健且通用的作策略，强化学习（RL）被认为是解决这些问题的有前景方法。然而，为了应对庞大的状态空间和复杂的动态，基于数据的方法通常依赖大型模型和较长的训练时间。由此产生的计算成本极大地阻碍了这些方法的开发和应用。此外，由于稳健状态估计的挑战，服装作策略通常采用端到端学习方法，以工作区图像作为输入。虽然这种方法通过现实世界的微调实现了概念上简单的模拟到现实的传输，但由于训练代理在环境状态的高损耗表示上，这也带来了显著的计算成本。本文通过探索一种高效且模块化的强化学习方法来质疑这一常见的设计选择。我们证明，通过精心设计，模型规模和训练时间在仿真学习时可以显著减少。此外，我们还展示了如何将模拟训练后的模型转移到现实世界。我们在 SoftGym 基准测试中评估方法，在使用明显较小模型的情况下，在任务中取得了显著的性能提升，优于现有基线。

Mixed-Precision Training and Compilation for RRAM-based Computing-in-Memory Accelerators

基于RRAM的内存计算加速器的混合精度训练与编译

Authors: Rebecca Pelke, Joel Klein, Jose Cubero-Cascante, Nils Bosbach, Jan Moritz Joseph, Rainer Leupers
Subjects: Subjects: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
Arxiv link: https://arxiv.org/abs/2601.21737
Pdf link: https://arxiv.org/pdf/2601.21737
Abstract Computing-in-Memory (CIM) accelerators are a promising solution for accelerating Machine Learning (ML) workloads, as they perform Matrix-Vector Multiplications (MVMs) on crossbar arrays directly in memory. Although the bit widths of the crossbar inputs and cells are very limited, most CIM compilers do not support quantization below 8 bit. As a result, a single MVM requires many compute cycles, and weights cannot be efficiently stored in a single crossbar cell. To address this problem, we propose a mixed-precision training and compilation framework for CIM architectures. The biggest challenge is the massive search space, that makes it difficult to find good quantization parameters. This is why we introduce a reinforcement learning-based strategy to find suitable quantization configurations that balance latency and accuracy. In the best case, our approach achieves up to a 2.48x speedup over existing state-of-the-art solutions, with an accuracy loss of only 0.086 %.
中文摘要 内存计算（CIM）加速器是加速机器学习（ML）工作负载的有前景解决方案，因为它们直接在内存中的交叉数组上执行矩阵-向量乘法（MVM）。尽管交叉开关输入和单元的位宽非常有限，但大多数CIM编译器不支持低于8位的量化。因此，单个MVM需要大量计算周期，权重无法高效存储在单一交叉开关单元中。为解决这一问题，我们提出了一个用于CIM架构的混合精度训练与编译框架。最大的挑战是庞大的搜索空间，这使得找到良好的量化参数变得困难。因此，我们引入基于强化学习的策略，寻找能够平衡延迟和准确性的合适量化配置。在最佳情况下，我们的方法比现有最先进方案提升最多2.48倍，准确率仅损失0.086%。

Epistemic Context Learning: Building Trust the Right Way in LLM-Based Multi-Agent Systems

认知情境学习：在基于LLM的多智能体系统中正确建立信任

Authors: Ruiwen Zhou, Maojia Song, Xiaobao Wu, Sitao Cheng, Xunjian Yin, Yuxi Xie, Zhuoqun Hao, Wenyue Hua, Liangming Pan, Soujanya Poria, Min-Yen Kan
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2601.21742
Pdf link: https://arxiv.org/pdf/2601.21742
Abstract Individual agents in multi-agent (MA) systems often lack robustness, tending to blindly conform to misleading peers. We show this weakness stems from both sycophancy and inadequate ability to evaluate peer reliability. To address this, we first formalize the learning problem of history-aware reference, introducing the historical interactions of peers as additional input, so that agents can estimate peer reliability and learn from trustworthy peers when uncertain. This shifts the task from evaluating peer reasoning quality to estimating peer reliability based on interaction history. We then develop Epistemic Context Learning (ECL): a reasoning framework that conditions predictions on explicitly-built peer profiles from history. We further optimize ECL by reinforcement learning using auxiliary rewards. Our experiments reveal that our ECL enables small models like Qwen 3-4B to outperform a history-agnostic baseline 8x its size (Qwen 3-30B) by accurately identifying reliable peers. ECL also boosts frontier models to near-perfect (100%) performance. We show that ECL generalizes well to various MA configurations and we find that trust is modeled well by LLMs, revealing a strong correlation in trust modeling accuracy and final answer quality.
中文摘要 多智能体（MA）系统中的单个代理通常缺乏鲁棒性，往往盲目地遵循误导性的对等代理。我们发现，这种弱点既源于谄媚，也源于评估同伴可靠性的能力不足。为此，我们首先形式化了历史感知指称的学习问题，引入了同伴之间的历史交互作为额外输入，使智能体能够估计同伴的可靠性，并在不确定时从可信的同伴学习。这使得任务从评估同伴推理质量转向基于交互历史估计同伴可靠性。随后，我们开发了认知情境学习（ECL）：一种基于历史中明确构建的同伴档谱来预测的推理框架。我们通过辅助奖励进行强化学习进一步优化ECL。我们的实验显示，我们的ECL使像Qwen 3-4B这样的小型模型能够通过准确识别可靠的对等体，优于其8倍大小的历史无关基线（Qwen 3-30B）。ECL还将Frontier模型提升至接近完美的（100%）性能。我们证明ECL能够很好地推广到各种MA配置，并且信任被大型模型很好地建模，揭示了信任建模准确性与最终答案质量之间的强烈相关性。

Language-based Trial and Error Falls Behind in the Era of Experience

基于语言的试错在经验时代落后

Authors: Haoyu Wang, Guozheng Ma, Shugang Cui, Yilun Kong, Haotian Luo, Li Shen, Mengya Gao, Yichao Wu, Xiaogang Wang, Dacheng Tao
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.21754
Pdf link: https://arxiv.org/pdf/2601.21754
Abstract While Large Language Models (LLMs) excel in language-based agentic tasks, their applicability to unseen, nonlinguistic environments (e.g., symbolic or spatial tasks) remains limited. Previous work attributes this performance gap to the mismatch between the pretraining distribution and the testing distribution. In this work, we demonstrate the primary bottleneck is the prohibitive cost of exploration: mastering these tasks requires extensive trial-and-error, which is computationally unsustainable for parameter-heavy LLMs operating in a high dimensional semantic space. To address this, we propose SCOUT (Sub-Scale Collaboration On Unseen Tasks), a novel framework that decouples exploration from exploitation. We employ lightweight "scouts" (e.g., small MLPs) to probe environmental dynamics at a speed and scale far exceeding LLMs. The collected trajectories are utilized to bootstrap the LLM via Supervised Fine-Tuning (SFT), followed by multi-turn Reinforcement Learning (RL) to activate its latent world knowledge. Empirically, SCOUT enables a Qwen2.5-3B-Instruct model to achieve an average score of 0.86, significantly outperforming proprietary models, including Gemini-2.5-Pro (0.60), while saving about 60% GPU hours consumption.
中文摘要 虽然大型语言模型（LLMs）在基于语言的代理任务中表现出色，但它们在看不见的非语言环境中（如符号或空间任务）的适用性仍然有限。以往的研究将这一性能差距归因于预训练分布与测试分布之间的不匹配。本研究展示了主要瓶颈是探索成本高昂：掌握这些任务需要大量试错，对于参数重的大型语言模型在高维语义空间中运行，计算上不可持续。为此，我们提出了SCOUT（Sub-Scale Collaboration On Unseen Tasks），这是一个将探索与剥削脱钩的新框架。我们采用轻量级“侦察器”（例如小型MLP）以远超大型语言模型的速度和规模探测环境动态。收集到的轨迹被用来通过监督微调（SFT）启动LLM，随后进行多回合强化学习（RL）激活其潜在世界知识。从经验角度看，SCOUT使Qwen2.5-3B-Instruct模型的平均得分达到0.86，显著优于包括Gemini-2.5-Pro（0.60）在内的专有模型，同时节省约60%的GPU小时耗能。

Influence Guided Sampling for Domain Adaptation of Text Retrievers

影响引导采样用于文本检索器的域适应

Authors: Meet Doshi, Vishwajeet Kumar, Yulong Li, Jaydeep Sen
Subjects: Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.21759
Pdf link: https://arxiv.org/pdf/2601.21759
Abstract General-purpose open-domain dense retrieval systems are usually trained with a large, eclectic mix of corpora and search tasks. How should these diverse corpora and tasks be sampled for training? Conventional approaches sample them uniformly, proportional to their instance population sizes, or depend on human-level expert supervision. It is well known that the training data sampling strategy can greatly impact model performance. However, how to find the optimal strategy has not been adequately studied in the context of embedding models. We propose Inf-DDS, a novel reinforcement learning driven sampling framework that adaptively reweighs training datasets guided by influence-based reward signals and is much more lightweight with respect to GPU consumption. Our technique iteratively refines the sampling policy, prioritizing datasets that maximize model performance on a target development set. We evaluate the efficacy of our sampling strategy on a wide range of text retrieval tasks, demonstrating strong improvements in retrieval performance and better adaptation compared to existing gradient-based sampling methods, while also being 1.5x to 4x cheaper in GPU compute. Our sampling strategy achieves a 5.03 absolute NDCG@10 improvement while training a multilingual bge-m3 model and an absolute NDCG@10 improvement of 0.94 while training all-MiniLM-L6-v2, even when starting from expert-assigned weights on a large pool of training datasets.
中文摘要 通用的开放域密集检索系统通常通过大量且多样化的语料库和搜索任务进行训练。这些多样化的语料库和任务应如何抽样用于培训？传统方法则根据实例总体规模均匀抽样，或依赖人类专家监督。众所周知，训练数据采样策略能极大地影响模型性能。然而，如何在嵌入模型的背景下找到最优策略尚未得到充分研究。我们提出了Inf-DDS，一种新型强化学习驱动采样框架，能够自适应地重新权重基于影响的奖励信号的训练数据集，并且在GPU占用方面更加轻量级。我们的技术通过迭代优化抽样策略，优先选择最大化目标开发集模型性能的数据集。我们评估了采样策略在多种文本检索任务中的有效性，显示其检索性能和适应性优于现有梯度采样方法，同时GPU计算成本降低1.5至4倍。我们的抽样策略在训练多语言bge-m3模型时实现了5.03的绝对NDCG@10提升，在训练全MiniLM-L6-v2时，绝对NDCG@10提升为0.94，即使从专家分配的权重开始，且从大量训练数据集中开始。

OneMall: One Model, More Scenarios -- End-to-End Generative Recommender Family at Kuaishou E-Commerce

OneMall：一个模型，更多场景——快手电商端到端生成推荐器家族

Authors: Kun Zhang, Jingming Zhang, Wei Cheng, Yansong Cheng, Jiaqi Zhang, Hao Lu, Xu Zhang, Haixiang Gan, Jiangxia Cao, Tenglong Wang, Ximing Zhang, Boyang Xia, Kuo Cai, Shiyao Wang, Hongjian Dou, Jinkai Yu, Mingxing Wen, Qiang Luo, Dongxu Liang, Chenyi Lei, Jun Wang, Runan Liu, Zhaojie Liu, Ruiming Tang, Tingting Gao, Shaoguo Liu, Yuqing Ding, Hui Kong, Han Li, Guorui Zhou, Wenwu Ou, Kun Gai
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2601.21770
Pdf link: https://arxiv.org/pdf/2601.21770
Abstract In the wave of generative recommendation, we present OneMall, an end-to-end generative recommendation framework tailored for e-commerce services at Kuaishou. Our OneMall systematically unifies the e-commerce's multiple item distribution scenarios, such as Product-card, short-video and live-streaming. Specifically, it comprises three key components, aligning the entire model training pipeline to the LLM's pre-training/post-training: (1) E-commerce Semantic Tokenizer: we provide a tokenizer solution that captures both real-world semantics and business-specific item relations across different scenarios; (2) Transformer-based Architecture: we largely utilize Transformer as our model backbone, e.g., employing Query-Former for long sequence compression, Cross-Attention for multi-behavior sequence fusion, and Sparse MoE for scalable auto-regressive generation; (3) Reinforcement Learning Pipeline: we further connect retrieval and ranking models via RL, enabling the ranking model to serve as a reward signal for end-to-end policy retrieval model optimization. Extensive experiments demonstrate that OneMall achieves consistent improvements across all e-commerce scenarios: +13.01\% GMV in product-card, +15.32\% Orders in Short-Video, and +2.78\% Orders in Live-Streaming. OneMall has been deployed, serving over 400 million daily active users at Kuaishou.
中文摘要 在生成式推荐的浪潮中，我们推出了OneMall，这是一套专为快手电商服务量身定制的端到端生成推荐框架。我们的OneMall系统统一了电商的多种商品分发场景，如产品卡、短视频和直播。具体来说，它包含三个关键组件，使整个模型训练流程与LLM的预训练/后训练对齐：（1）电子商务语义分词器：我们提供一个分词器解决方案，能够捕捉不同场景下的真实语义和业务特定物品关系;（2）基于Transformer的架构：我们主要将Transformer作为模型骨干，例如采用Query-Forform进行长序列压缩，Cross-Attention用于多行为序列融合，Sparse MoE用于可扩展的自回归生成;（3）强化学习流水线：我们通过强化学习进一步连接检索模型和排名模型，使排名模型能够作为端到端策略检索模型优化的奖励信号。大量实验表明，OneMall 在所有电商场景下都能实现持续提升：产品卡的 GMV +13.01\%，短视频的订单数 +15.32\%，直播流媒体的订单数 +2.78%。OneMall 已部署，服务于快寿的每日活跃用户超过4亿。

Error Amplification Limits ANN-to-SNN Conversion in Continuous Control

误差放大限制了连续控制中人工神经网络到噪声网络的转换

Authors: Zijie Xu, Zihan Huang, Yiting Dong, Kang Chen, Wenxuan Liu, Zhaofei Yu
Subjects: Subjects: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.21778
Pdf link: https://arxiv.org/pdf/2601.21778
Abstract Spiking Neural Networks (SNNs) can achieve competitive performance by converting already existing well-trained Artificial Neural Networks (ANNs), avoiding further costly training. This property is particularly attractive in Reinforcement Learning (RL), where training through environment interaction is expensive and potentially unsafe. However, existing conversion methods perform poorly in continuous control, where suitable baselines are largely absent. We identify error amplification as the key cause: small action approximation errors become temporally correlated across decision steps, inducing cumulative state distribution shift and severe performance degradation. To address this issue, we propose Cross-Step Residual Potential Initialization (CRPI), a lightweight training-free mechanism that carries over residual membrane potentials across decision steps to suppress temporally correlated errors. Experiments on continuous control benchmarks with both vector and visual observations demonstrate that CRPI can be integrated into existing conversion pipelines and substantially recovers lost performance. Our results highlight continuous control as a critical and challenging benchmark for ANN-to-SNN conversion, where small errors can be strongly amplified and impact performance.
中文摘要 尖峰神经网络（SNN）通过转换已有经过良好训练的人工神经网络（ANN）实现竞争性能，避免了进一步昂贵的培训。这一特性在强化学习（RL）中尤为有吸引力，因为通过环境互动进行训练成本高昂且可能不安全。然而，现有的转换方法在连续控制中表现不佳，且缺乏合适的基线。我们认为误差放大是关键原因：小动作近似误差在决策步骤间发生时间相关，导致累积状态分布偏移和严重性能下降。为解决这一问题，我们提出了跨步残差电位初始化（CRPI），这是一种轻量级无训练机制，能够在决策步骤中传递残余膜电位，以抑制时间相关的误差。连续控制基准测试的向量和视觉观察实验表明，CRPI可以集成到现有转换流程中，并大幅恢复性能损失。我们的结果凸显了连续控制作为ANN转SNN转换的关键且具有挑战性的基准，在这种过程中，小误差可以被大幅放大并影响性能。

Distribution-Aware Reward Estimation for Test-Time Reinforcement Learning

分布感知奖励估计用于测试时间强化学习

Authors: Bodong Du, Xuanqi Huang, Xiaomeng Li
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.21804
Pdf link: https://arxiv.org/pdf/2601.21804
Abstract Test-time reinforcement learning (TTRL) enables large language models (LLMs) to self-improve on unlabeled inputs, but its effectiveness critically depends on how reward signals are estimated without ground-truth supervision. Most existing TTRL methods rely on majority voting (MV) over rollouts to produce deterministic rewards, implicitly assuming that the majority rollout provides a reliable learning signal. We show that this assumption is fragile: MV reduces the rollout distribution into a single outcome, discarding information about non-majority but correct actions candidates, and yields systematically biased reward estimates. To address this, we propose Distribution-AwareReward Estimation (DARE), which shifts reward estimation from a single majority outcome to the full empirical rollout distribution. DARE further augments this distribution-based reward with an exploration bonus and a distribution pruning mechanism for non-majority rollout exploration and reward denoise, yielding a more informative and robust reward estimation. Extensive experiments on challenging reasoning benchmarks show that DARE improves optimization stability and final performance over recent baselines, achieving relative improvements of 25.3% on challenging AIME 2024 and 5.3% on AMC.
中文摘要 测试时强化学习（TTRL）使大型语言模型（LLM）能够在未标记输入上自我改进，但其有效性关键在于如何在没有实地真实监督的情况下估计奖励信号。大多数现有TTRL方法依赖于多数投票（MV）来产生确定性奖励，隐含假设多数部署提供了可靠的学习信号。我们证明了这一假设是脆弱的：MV将推广分布简化为单一结果，丢弃了非多数但正确行动候选人的信息，并得出系统性偏颇的奖励估计。为此，我们提出了分布意识奖励估计（DARE），将奖励估计从单一多数结果转变为完整的实证推广分布。DARE进一步增强了基于分布的奖励，增加了探索奖励和分布剪枝机制，用于非多数人展开探索和奖励去噪，从而获得更有信息量且稳健的奖励估计。对挑战性推理基准的广泛实验表明，DARE在近期基线上提升了优化稳定性和最终性能，在挑战性AIME 2024上取得了25.3%的相对提升，在AMC中分别提升了5.3%。

Constrained Meta Reinforcement Learning with Provable Test-Time Safety

具备可证明测试时间安全性的受限元强化学习

Authors: Tingting Ni, Maryam Kamgarpour
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.21845
Pdf link: https://arxiv.org/pdf/2601.21845
Abstract Meta reinforcement learning (RL) allows agents to leverage experience across a distribution of tasks on which the agent can train at will, enabling faster learning of optimal policies on new test tasks. Despite its success in improving sample complexity on test tasks, many real-world applications, such as robotics and healthcare, impose safety constraints during testing. Constrained meta RL provides a promising framework for integrating safety into meta RL. An open question in constrained meta RL is how to ensure the safety of the policy on the real-world test task, while reducing the sample complexity and thus, enabling faster learning of optimal policies. To address this gap, we propose an algorithm that refines policies learned during training, with provable safety and sample complexity guarantees for learning a near optimal policy on the test tasks. We further derive a matching lower bound, showing that this sample complexity is tight.
中文摘要 元强化学习（RL）允许代理在任务分布中利用经验，代理可以随意训练，从而更快地学习新测试任务的最优策略。尽管在提升测试任务样品复杂度方面取得了成功，许多现实应用如机器人和医疗在测试过程中仍存在安全限制。受限元强化学习为将安全性整合到元强化学习提供了有前景的框架。在受限元强化学习中，一个开放的问题是如何确保策略在现实测试任务中的安全性，同时降低样本复杂度，从而加快对最优策略的学习。为弥补这一空白，我们提出了一种算法，能够优化训练中学到的策略，并保证在测试任务中学习近似最优策略，并保证安全性和样本复杂度。我们进一步推导出匹配的下界，表明该样本复杂度非常紧密。

READY: Reward Discovery for Meta-Black-Box Optimization

准备好了：为元黑盒优化发现奖励

Authors: Zechuan Huang, Zhiguang Cao, Hongshu Guo, Yue-Jiao Gong, Zeyuan Ma
Subjects: Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2601.21847
Pdf link: https://arxiv.org/pdf/2601.21847
Abstract Meta-Black-Box Optimization (MetaBBO) is an emerging avenue within Optimization community, where algorithm design policy could be meta-learned by reinforcement learning to enhance optimization performance. So far, the reward functions in existing MetaBBO works are designed by human experts, introducing certain design bias and risks of reward hacking. In this paper, we use Large Language Model~(LLM) as an automated reward discovery tool for MetaBBO. Specifically, we consider both effectiveness and efficiency sides. On effectiveness side, we borrow the idea of evolution of heuristics, introducing tailored evolution paradigm in the iterative LLM-based program search process, which ensures continuous improvement. On efficiency side, we additionally introduce multi-task evolution architecture to support parallel reward discovery for diverse MetaBBO approaches. Such parallel process also benefits from knowledge sharing across tasks to accelerate convergence. Empirical results demonstrate that the reward functions discovered by our approach could be helpful for boosting existing MetaBBO works, underscoring the importance of reward design in MetaBBO. We provide READY's project at this https URL.
中文摘要 元黑盒优化（MetaBBO）是优化社区中新兴的一种途径，通过强化学习可以元学习算法设计策略，以提升优化性能。迄今为止，现有MetaBBO作品中的奖励函数由人类专家设计，这带来了一定的设计偏见和奖励黑客风险。本文中，我们将大型语言模型~（LLM）作为MetaBBO的自动奖励发现工具。具体来说，我们同时考虑了效率和效率两方面。在效能方面，我们借鉴启发式演化的理念，在基于LLM的迭代程序搜索过程中引入了定制化的演化范式，确保了持续改进。在效率方面，我们还引入了多任务演化架构，支持多样化MetaBBO方法的并行奖励发现。这种并行过程还受益于跨任务的知识共享，以加速融合进程。实证结果表明，我们方法发现的奖励函数有助于提升现有的MetaBBO工作，凸显了奖励设计在MetaBBO中的重要性。我们通过这个 https URL 提供 READY 的项目。

Spatiotemporal Continual Learning for Mobile Edge UAV Networks: Mitigating Catastrophic Forgetting

移动边缘无人机网络的时空持续学习：缓解灾难性遗忘

Authors: Chuan-Chi Lai
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.21861
Pdf link: https://arxiv.org/pdf/2601.21861
Abstract This paper addresses the critical challenge of coordinating mobile edge UAV networks to maintain robust service in highly dynamic spatiotemporal environments. Conventional Deep Reinforcement Learning (DRL) approaches often suffer from catastrophic forgetting when transitioning between distinct task scenarios, such as moving from dense urban clusters to sparse rural areas. These transitions typically necessitate computationally expensive retraining or model resets to adapt to new user distributions, leading to service interruptions. To overcome these limitations, we propose a computationally efficient Spatiotemporal Continual Learning (STCL) framework realized through a Group-Decoupled Multi-Agent Proximal Policy Optimization (G-MAPPO) algorithm. Our approach integrates a novel Group-Decoupled Policy Optimization (GDPO) mechanism that utilizes dynamic $z$-score normalization to autonomously balance heterogeneous objectives, including energy efficiency, user fairness, and coverage. This mechanism effectively mitigates gradient conflicts induced by concept drifts without requiring offline retraining. Furthermore, the framework leverages the 3D mobility of UAVs as a spatial compensation layer, enabling the swarm to autonomously adjust altitudes to accommodate extreme density fluctuations. Extensive simulations demonstrate that the proposed STCL framework achieves superior resilience, characterized by an elastic recovery of service reliability to approximately 0.95 during phase transitions. Compared to the MADDPG baseline, G-MAPPO not only prevents knowledge forgetting but also delivers an effective capacity gain of 20\% under extreme traffic loads, validating its potential as a scalable solution for edge-enabled aerial swarms.
中文摘要 本文探讨了协调移动边缘无人机网络以在高度动态时空环境中保持稳健服务的关键挑战。传统的深度强化学习（DRL）方法在从不同任务场景之间转换时，比如从密集城市集群迁移到稀疏农村地区，常常会出现灾难性的遗忘问题。这些转换通常需要计算量高的重新训练或模型重置以适应新用户分布，导致服务中断。为克服这些限制，我们提出了一个计算高效的时空持续学习（STCL）框架，通过群解耦多代理近端策略优化（G-MAPPO）算法实现。我们的方法整合了一种新颖的群体脱钩政策优化（GDPO）机制，利用动态$z$分数归一化，自主平衡不同异质目标，包括能源效率、用户公平性和覆盖范围。该机制有效减轻了由概念漂移引起的梯度冲突，而无需离线重新训练。此外，该框架利用无人机的三维机动性作为空间补偿层，使无人机群能够自主调整高度，以适应极端的密度波动。大量模拟表明，所提STCL框架实现了卓越的韧性，特点是在相变期间服务可靠性弹性恢复至约0.95。与MADDPG基线相比，G-MAPPO不仅防止了知识遗忘，还在极端流量负载下实现了20%的有效容量提升，验证了其作为边缘支持天线群的可扩展解决方案的潜力。

WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents

WebArbiter：基于原则引导的推理过程奖励模型，适用于网络代理

Authors: Yao Zhang, Shijie Tang, Zeyu Li, Zhen Han, Volker Tresp
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.21872
Pdf link: https://arxiv.org/pdf/2601.21872
Abstract Web agents hold great potential for automating complex computer tasks, yet their interactions involve long-horizon, sequential decision-making with irreversible actions. In such settings, outcome-based supervision is sparse and delayed, often rewarding incorrect trajectories and failing to support inference-time scaling. This motivates the use of Process Reward Models (WebPRMs) for web navigation, but existing approaches remain limited: scalar WebPRMs collapse progress into coarse, weakly grounded signals, while checklist-based WebPRMs rely on brittle template matching that fails under layout or semantic changes and often mislabels superficially correct actions as successful, providing little insight or interpretability. To address these challenges, we introduce WebArbiter, a reasoning-first, principle-inducing WebPRM that formulates reward modeling as text generation, producing structured justifications that conclude with a preference verdict and identify the action most conducive to task completion under the current context. Training follows a two-stage pipeline: reasoning distillation equips the model with coherent principle-guided reasoning, and reinforcement learning corrects teacher biases by directly aligning verdicts with correctness, enabling stronger generalization. To support systematic evaluation, we release WebPRMBench, a comprehensive benchmark spanning four diverse web environments with rich tasks and high-quality preference annotations. On WebPRMBench, WebArbiter-7B outperforms the strongest baseline, GPT-5, by 9.1 points. In reward-guided trajectory search on WebArena-Lite, it surpasses the best prior WebPRM by up to 7.2 points, underscoring its robustness and practical value in real-world complex web tasks.
中文摘要 网络代理在自动化复杂计算机任务方面具有巨大潜力，但其交互涉及长期、连续的决策，且不可逆转。在这种情况下，基于结果的监督稀疏且延迟，常奖励错误的轨迹，且无法支持推断时间缩放。这促使人们在网页导航中使用流程奖励模型（WebPRM），但现有方法仍然有限：标量WebPRM将进度压缩成粗糙且基础薄弱的信号，而基于清单的WebPRM依赖脆弱的模板匹配，在布局或语义变更下失败，且常将表面正确的作错误标记为成功，缺乏洞察或可解释性。为应对这些挑战，我们引入了WebArbiter，一种以推理为先、原则导向的WebPRM，它将奖励建模构建为文本生成，生成结构化的理由，最终以偏好判决结束，并识别当前语境下最有利于完成任务的行动。培训遵循两阶段流程：推理提炼为模型提供连贯的原则引导推理，强化学习通过直接将判决与正确性对齐来纠正教师偏见，从而实现更强的泛化。为支持系统性评估，我们发布了WebPRMBench，这是一个涵盖四种不同网络环境的综合基准测试，拥有丰富的任务和高质量的偏好注释。在WebPRMBench上，WebArbiter-7B的表现比最强基线GPT-5高出9.1个百分点。在WebArena-Lite的奖励引导轨迹搜索中，它比之前最好的WebPRM高出多达7.2分，凸显了其在现实复杂网络任务中的稳健性和实用价值。

From Meta-Thought to Execution: Cognitively Aligned Post-Training for Generalizable and Reliable LLM Reasoning

从元思维到执行：认知对齐的后训练，实现可推广且可靠的大型语言模型推理

Authors: Shaojie Wang, Liang Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.21909
Pdf link: https://arxiv.org/pdf/2601.21909
Abstract Current LLM post-training methods optimize complete reasoning trajectories through Supervised Fine-Tuning (SFT) followed by outcome-based Reinforcement Learning (RL). While effective, a closer examination reveals a fundamental gap: this approach does not align with how humans actually solve problems. Human cognition naturally decomposes problem-solving into two distinct stages: first acquiring abstract strategies (i.e., meta-knowledge) that generalize across problems, then adapting them to specific instances. In contrast, by treating complete trajectories as basic units, current methods are inherently problem-centric, entangling abstract strategies with problem-specific execution. To address this misalignment, we propose a cognitively-inspired framework that explicitly mirrors the two-stage human cognitive process. Specifically, Chain-of-Meta-Thought (CoMT) focuses supervised learning on abstract reasoning patterns without specific executions, enabling acquisition of generalizable strategies. Confidence-Calibrated Reinforcement Learning (CCRL) then optimizes task adaptation via confidence-aware rewards on intermediate steps, preventing overconfident errors from cascading and improving execution reliability. Experiments across four models and eight benchmarks show 2.19\% and 4.63\% improvements in-distribution and out-of-distribution respectively over standard methods, while reducing training time by 65-70% and token consumption by 50%, demonstrating that aligning post-training with human cognitive principles yields not only superior generalization but also enhanced training efficiency.
中文摘要 当前的LLM后训练方法通过监督微调（SFT）优化完整推理轨迹，随后进行基于结果的强化学习（RL）。虽然有效，但仔细观察会发现一个根本性空白：这种方法与人类实际解决问题的方式不一致。人类认知自然将问题解决分为两个不同阶段：首先获得能够在问题上泛化的抽象策略（即元知识），然后将其调整到具体实例。相比之下，通过将完整轨迹视为基本单位，现有方法本质上是以问题为中心的，将抽象策略与针对问题的执行纠缠在一起。为了解决这种错位，我们提出了一个认知启发框架，明确反映了人类认知过程的两阶段。具体来说，元思维链（CoMT）将监督学习聚焦于抽象推理模式，而无需具体执行，从而实现可推广策略的习得。信心校准强化学习（CCRL）通过信心感知奖励在中间步骤优化任务适应，防止过度自信的错误连锁反应，提高执行可靠性。四个模型和八个基准测试的实验显示，分布内和分布外分别比标准方法提升2.19%和4.63/%，同时训练时间减少65-70%，代币消耗减少50%，表明将训练后与人类认知原则对齐不仅能带来更优越的泛化，还能提升训练效率。

ProRAG: Process-Supervised Reinforcement Learning for Retrieval-Augmented Generation

ProRAG：用于检索增强生成的过程监督强化学习

Authors: Zhao Wang, Ziliang Zhao, Zhicheng Dou
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2601.21912
Pdf link: https://arxiv.org/pdf/2601.21912
Abstract Reinforcement learning (RL) has become a promising paradigm for optimizing Retrieval-Augmented Generation (RAG) in complex reasoning tasks. However, traditional outcome-based RL approaches often suffer from reward sparsity and inefficient credit assignment, as coarse-grained scalar rewards fail to identify specific erroneous steps within long-horizon trajectories. This ambiguity frequently leads to "process hallucinations", where models reach correct answers through flawed logic or redundant retrieval steps. Although recent process-aware approaches attempt to mitigate this via static preference learning or heuristic reward shaping, they often lack the on-policy exploration capabilities required to decouple step-level credit from global outcomes. To address these challenges, we propose ProRAG, a process-supervised reinforcement learning framework designed to integrate learned step-level supervision into the online optimization loop. Our framework consists of four stages: (1) Supervised Policy Warmup to initialize the model with a structured reasoning format; (2) construction of an MCTS-based Process Reward Model (PRM) to quantify intermediate reasoning quality; (3) PRM-Guided Reasoning Refinement to align the policy with fine-grained process preferences; and (4) Process-Supervised Reinforcement Learning with a dual-granularity advantage mechanism. By aggregating step-level process rewards with global outcome signals, ProRAG provides precise feedback for every action. Extensive experiments on five multi-hop reasoning benchmarks demonstrate that ProRAG achieves superior overall performance compared to strong outcome-based and process-aware RL baselines, particularly on complex long-horizon tasks, validating the effectiveness of fine-grained process supervision. The code and model are available at this https URL.
中文摘要 强化学习（RL）已成为优化复杂推理任务中检索增强生成（RAG）的有前景范式。然而，传统的基于结果的强化学习方法常常存在奖励稀疏和信用分配效率低下的问题，因为粗粒度标量奖励未能识别长期轨迹中的具体错误步骤。这种模糊性常导致“过程幻觉”，即模型通过错误的逻辑或冗余的检索步骤得出正确答案。尽管近期的流程感知方法试图通过静态偏好学习或启发式奖励塑造来缓解这一问题，但它们往往缺乏将阶级信用与整体结果分离所需的政策内探索能力。为应对这些挑战，我们提出了ProRAG，一种过程监督强化学习框架，旨在将学习到的步骤级监督整合进在线优化循环。我们的框架包含四个阶段：（1）监督式政策热身，以结构化推理格式初始化模型;（2）构建基于MCTS的过程奖励模型（PRM），以量化中间推理质量;（3）PRM引导推理细化，使政策与细粒度流程偏好保持一致;以及（4）具有双粒度优势机制的过程监督强化学习。通过将步骤级的过程奖励与全局结果信号进行聚合，ProRAG为每一个动作提供精确的反馈。五个多跳推理基准测试的广泛实验表明，ProRAG在复杂长视野任务中优于强的结果导向和过程感知强化学习基线，验证了细粒度过程监督的有效性。代码和模型可在该 https URL 访问。

Self-Compression of Chain-of-Thought via Multi-Agent Reinforcement Learning

通过多智能体强化学习实现思维链的自我压缩

Authors: Yiqun Chen, Jinyuan Feng, Wei Yang, Meizhi Zhong, Zhengliang Shi, Rui Li, Xiaochi Wei, Yan Gao, Yi Wu, Yao Hu, Zhiqiang Pu, Jiaxin Mao
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.21919
Pdf link: https://arxiv.org/pdf/2601.21919
Abstract The inference overhead induced by redundant reasoning undermines the interactive experience and severely bottlenecks the deployment of Large Reasoning Models. Existing reinforcement learning (RL)-based solutions tackle this problem by coupling a length penalty with outcome-based rewards. This simplistic reward weighting struggles to reconcile brevity with accuracy, as enforcing brevity may compromise critical reasoning logic. In this work, we address this limitation by proposing a multi-agent RL framework that selectively penalizes redundant chunks, while preserving essential reasoning logic. Our framework, Self-Compression via MARL (SCMA), instantiates redundancy detection and evaluation through two specialized agents: \textbf{a Segmentation Agent} for decomposing the reasoning process into logical chunks, and \textbf{a Scoring Agent} for quantifying the significance of each chunk. The Segmentation and Scoring agents collaboratively define an importance-weighted length penalty during training, incentivizing \textbf{a Reasoning Agent} to prioritize essential logic without introducing inference overhead during deployment. Empirical evaluations across model scales demonstrate that SCMA reduces response length by 11.1\% to 39.0\% while boosting accuracy by 4.33\% to 10.02\%. Furthermore, ablation studies and qualitative analysis validate that the synergistic optimization within the MARL framework fosters emergent behaviors, yielding more powerful LRMs compared to vanilla RL paradigms.
中文摘要 冗余推理带来的推理开销削弱了交互体验，严重限制了大型推理模型的部署。现有基于强化学习（RL）的解决方案通过将长度惩罚与基于结果的奖励相结合来解决这一问题。这种简单的奖励权重难以平衡简洁与准确性，因为强制简洁可能会损害批判性推理逻辑。在本研究中，我们通过提出一个多智能体强化学习框架来解决这一限制，该框架选择性惩罚冗余块，同时保留了基本的推理逻辑。我们的框架——通过MARL自压缩（SCMA）实现冗余检测和评估：\textbf{分割代理}用于将推理过程分解为逻辑块，\textbf{评分代理}用于量化每个块的意义。分割代理和评分代理协同定义训练期间的重要性加权长度惩罚，激励 \textbf{a 推理代理}优先考虑核心逻辑，同时避免在部署时产生推理开销。跨模型尺度的实证评估表明，SCMA将响应长度缩短11.1%至39.0%，同时准确率提升4.33%至10.02%。此外，消融研究和定性分析验证了MARL框架内的协同优化能够促进涌现行为，从而产生比普通强化学习范式更强大的LRM。

Optimistic Transfer under Task Shift via Bellman Alignment

通过贝尔曼对齐在任务转移下的乐观转移

Authors: Jinhang Chai, Enpei Zhang, Elynn Chen, Yujun Yan
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2601.21924
Pdf link: https://arxiv.org/pdf/2601.21924
Abstract We study online transfer reinforcement learning (RL) in episodic Markov decision processes, where experience from related source tasks is available during learning on a target task. A fundamental difficulty is that task similarity is typically defined in terms of rewards or transitions, whereas online RL algorithms operate on Bellman regression targets. As a result, naively reusing source Bellman updates introduces systematic bias and invalidates regret guarantees. We identify one-step Bellman alignment as the correct abstraction for transfer in online RL and propose re-weighted targeting (RWT), an operator-level correction that retargets continuation values and compensates for transition mismatch via a change of measure. RWT reduces task mismatch to a fixed one-step correction and enables statistically sound reuse of source data. This alignment yields a two-stage RWT $Q$-learning framework that separates variance reduction from bias correction. Under RKHS function approximation, we establish regret bounds that scale with the complexity of the task shift rather than the target MDP. Empirical results in both tabular and neural network settings demonstrate consistent improvements over single-task learning and naïve pooling, highlighting Bellman alignment as a model-agnostic transfer principle for online RL.
中文摘要 我们研究在线迁移强化学习（RL）在情节式马尔可夫决策过程中，在目标任务学习过程中可以获得相关来源任务的经验。一个根本的难点是，任务相似性通常以奖励或转换来定义，而在线强化学习算法则基于贝尔曼回归目标。因此，天真地重复使用源贝尔曼更新会带来系统性偏见，并使遗憾保证失效。我们将单步贝尔曼比对确定为在线强化学习中转移的正确抽象，并提出了重新加权目标（RWT）的操作员级修正方法，通过测量值的改变重新定位延续值并补偿过渡不匹配。RWT将任务错配减少为固定的一步修正，并实现了源数据的统计学上合理的复用。这种对齐形成了一个两阶段的RWT$Q$学习框架，将方差缩小与偏差校正分离。在RKHS函数近似下，我们建立了随任务转移复杂度而非目标MDP而变化的遗憾界限。在表格和神经网络环境中的实证结果显示，Bellman比对作为在线强化学习的模型无关转移原则具有持续的优异表现，凸显了Bellman对齐作为在线强化学习模型无关转移原则的体现。

OVD: On-policy Verbal Distillation

OVD：政策上的言语提炼

Authors: Jing Xiong, Hui Shen, Shansan Gong, Yuxin Cheng, Jianghan Shen, Chaofan Tao, Haochen Tan, Haoli Bai, Lifeng Shang, Ngai Wong
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.21968
Pdf link: https://arxiv.org/pdf/2601.21968
Abstract Knowledge distillation offers a promising path to transfer reasoning capabilities from large teacher models to efficient student models; however, existing token-level on-policy distillation methods require token-level alignment between the student and teacher models, which restricts the student model's exploration ability, prevent effective use of interactive environment feedback, and suffer from severe memory bottlenecks in reinforcement learning. We introduce On-policy Verbal Distillation (OVD), a memory-efficient framework that replaces token-level probability matching with trajectory matching using discrete verbal scores (0--9) from teacher models. OVD dramatically reduces memory consumption while enabling on-policy distillation from teacher models with verbal feedback, and avoids token-level alignment, allowing the student model to freely explore the output space. Extensive experiments on Web question answering and mathematical reasoning tasks show that OVD substantially outperforms existing methods, delivering up to +12.9% absolute improvement in average EM on Web Q&A tasks and a up to +25.7% gain on math benchmarks (when trained with only one random samples), while also exhibiting superior training efficiency. Our project page is available at this https URL
中文摘要 知识蒸馏为将推理能力从大型教师模型转移到高效学生模型提供了有前景的道路;然而，现有的令牌级策略提炼方法要求学生和教师模型之间进行令牌级对齐，这限制了学生模型的探索能力，阻碍了互动环境反馈的有效利用，并在强化学习中存在严重的记忆瓶颈。我们引入了策略上的语言蒸馏（OVD），这是一种内存高效的框架，用教师模型中的离散语语分数（0--9）替代了代币级概率匹配，采用轨迹匹配。OVD大幅减少了内存消耗，同时实现了基于策略的教师模型和口头反馈的提炼，避免了令牌级对齐，使学生模型能够自由探索输出空间。对网络问答和数学推理任务的大量实验表明，OVD显著优于现有方法，在Web问答任务中平均EM绝对提升高达+12.9%，数学基准（仅用一个随机样本训练时）提升高达+25.7%，同时表现出更优越的训练效率。我们的项目页面可在此 https URL 访问

Token-Guard: Towards Token-Level Hallucination Control via Self-Checking Decoding

令牌守护：通过自我检查解码实现令牌级幻觉控制

Authors: Yifan Zhu, Huiqiang Rong, Haoran Luo
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.21969
Pdf link: https://arxiv.org/pdf/2601.21969
Abstract Large Language Models (LLMs) often hallucinate, generating content inconsistent with the input. Retrieval-Augmented Generation (RAG) and Reinforcement Learning with Human Feedback (RLHF) can mitigate hallucinations but require resource-intensive retrieval or large-scale fine-tuning. Decoding-based methods are lighter yet lack explicit hallucination control. To address this, we present Token-Guard, a token-level hallucination control method based on self-checking decoding. Token-Guard performs internal verification at each reasoning step to detect hallucinated tokens before they propagate. Candidate fragments are further evaluated in a latent space with explicit hallucination risk scoring, while iterative pruning and regeneration dynamically correct detected errors. Experiments on HALU datasets show Token-Guard substantially reduces hallucinations and improves generation accuracy, offering a scalable, modular solution for reliable LLM outputs. Our code is publicly available.
中文摘要 大型语言模型（LLM）常常产生幻觉，生成与输入不一致的内容。检索增强生成（RAG）和人类反馈强化学习（RLHF）可以减轻幻觉，但需要资源密集的检索或大规模微调。基于解码的方法较轻，但缺乏明确的幻觉控制。为此，我们提出了Token-Guard，一种基于自检查解码的令牌级幻觉控制方法。Token-Guard在每个推理步骤进行内部验证，以在代币传播前检测出幻觉。候选片段在潜在空间中进一步评估，并明确进行幻觉风险评分，同时通过迭代剪枝和再生动态修正检测到的错误。在HALU数据集上的实验显示，Token-Guard显著减少了幻觉并提高了生成精度，为可靠的LLM输出提供了可扩展、模块化的解决方案。我们的代码是公开的。

Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic

学习去中心化大型语言模型与多智能体演员批评者协作

Authors: Shuo Liu, Tianle Chen, Ryan Amiri, Christopher Amato
Subjects: Subjects: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2601.21972
Pdf link: https://arxiv.org/pdf/2601.21972
Abstract Recent work has explored optimizing LLM collaboration through Multi-Agent Reinforcement Learning (MARL). However, most MARL fine-tuning approaches rely on predefined execution protocols, which often require centralized execution. Decentralized LLM collaboration is more appealing in practice, as agents can run inference in parallel with flexible deployments. Also, current approaches use Monte Carlo methods for fine-tuning, which suffer from high variance and thus require more samples to train effectively. Actor-critic methods are prevalent in MARL for dealing with these issues, so we developed Multi-Agent Actor-Critic (MAAC) methods to optimize decentralized LLM collaboration. In this paper, we analyze when and why these MAAC methods are beneficial. We propose 2 MAAC approaches, \textbf{CoLLM-CC} with a \textbf{C}entralized \textbf{C}ritic and \textbf{CoLLM-DC} with \textbf{D}ecentralized \textbf{C}ritics. Our experiments across writing, coding, and game-playing domains show that Monte Carlo methods and CoLLM-DC can achieve performance comparable to CoLLM-CC in short-horizon and dense-reward settings. However, they both underperform CoLLM-CC on long-horizon or sparse-reward tasks, where Monte Carlo methods require substantially more samples and CoLLM-DC struggles to converge. Our code is available at this https URL.
中文摘要 近期研究探讨了通过多智能体强化学习（MARL）优化LLM协作。然而，大多数MARL微调方法依赖于预定义的执行协议，而这些协议通常需要集中执行。去中心化的LLM协作在实践中更具吸引力，因为代理可以与灵活部署并行运行推理。此外，当前方法采用蒙特卡洛方法进行微调，但这些方法方差较大，因此需要更多样本才能有效训练。MARL中普遍采用了演员-批评者方法来解决这些问题，因此我们开发了多代理演员-批判者（MAAC）方法以优化去中心化的大型语言模型协作。本文分析了这些MAAC方法何时以及为何有益。我们提出了两种MAAC方法，\textbf{CoLLM-CC}采用\textbf{C}entralized的\textbf{C}ritic，以及\textbf{CoLLM-DC}中置\textbf{D}中心化的\textbf{C}ritic。我们在写作、编程和游戏领域的实验表明，蒙特卡洛方法和CoLLM-DC在短视野和高密度奖励环境中能够实现与CoLLM-CC相当的性能。然而，它们在长视野或稀疏奖励任务中表现均不及CoLLM-CC，而蒙特卡洛方法需要大量样本，而CoLLM-DC难以收敛。我们的代码可在此 https URL 访问。

Elign: Equivariant Diffusion Model Alignment from Foundational Machine Learning Force Fields

Elign：基于基础机器学习力场的等变扩散模型对齐

Authors: Yunyang Li, Lin Huang, Luojia Xia, Wenhe Zhang, Mark Gerstein
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.21985
Pdf link: https://arxiv.org/pdf/2601.21985
Abstract Generative models for 3D molecular conformations must respect Euclidean symmetries and concentrate probability mass on thermodynamically favorable, mechanically stable structures. However, E(3)-equivariant diffusion models often reproduce biases from semi-empirical training data rather than capturing the equilibrium distribution of a high-fidelity Hamiltonian. While physics-based guidance can correct this, it faces two computational bottlenecks: expensive quantum-chemical evaluations (e.g., DFT) and the need to repeat such queries at every sampling step. We present Elign, a post-training framework that amortizes both costs. First, we replace expensive DFT evaluations with a faster, pretrained foundational machine-learning force field (MLFF) to provide physical signals. Second, we eliminate repeated run-time queries by shifting physical steering to the training phase. To achieve the second amortization, we formulate reverse diffusion as a reinforcement learning problem and introduce Force--Energy Disentangled Group Relative Policy Optimization (FED-GRPO) to fine-tune the denoising policy. FED-GRPO includes a potential-based energy reward and a force-based stability reward, which are optimized and group-normalized independently. Experiments show that Elign generates conformations with lower gold-standard DFT energies and forces, while improving stability. Crucially, inference remains as fast as unguided sampling, since no energy evaluations are required during generation.
中文摘要 三维分子构象的生成模型必须尊重欧几里得对称性，并将概率质量集中在热力学上有利、机械稳定的结构上。然而，E（3）-等变扩散模型常常从半经验训练数据中重现偏差，而非捕捉高保真度哈密顿量的平衡分布。虽然基于物理的指导可以纠正这一点，但面临两个计算瓶颈：昂贵的量子化学评估（如DFT）以及每一步采样都必须重复此类查询。我们提出了Elign，一个能够摊销这两项成本的培训后框架。首先，我们用更快、预训练的基础机器学习力场（MLFF）替代昂贵的DFT评估，以提供物理信号。其次，我们通过将物理引导转移到训练阶段，消除重复的运行时查询。为实现第二次摊销，我们将反扩散提出强化学习问题，并引入力-能量解缠群相对策略优化（FED-GRPO）以微调去噪策略。FED-GRPO包括基于势的能量奖励和基于力的稳定性奖励，这两种奖励分别独立优化和群归一化。实验表明，Elign生成的构象具有更低的金标准DFT能量和力，同时提升了稳定性。关键是，推断速度与无导采样相当，因为生成过程中无需能量评估。

Geometry of Drifting MDPs with Path-Integral Stability Certificates

带有路径积分稳定性证书的漂移MDP几何形状

Authors: Zuyuan Zhang, Mahdi Imani, Tian Lan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.21991
Pdf link: https://arxiv.org/pdf/2601.21991
Abstract Real-world reinforcement learning is often \emph{nonstationary}: rewards and dynamics drift, accelerate, oscillate, and trigger abrupt switches in the optimal action. Existing theory often represents nonstationarity with coarse-scale models that measure \emph{how much} the environment changes, not \emph{how} it changes locally -- even though acceleration and near-ties drive tracking error and policy chattering. We take a geometric view of nonstationary discounted Markov Decision Processes (MDPs) by modeling the environment as a differentiable homotopy path and tracking the induced motion of the optimal Bellman fixed point. This yields a length--curvature--kink signature of intrinsic complexity: cumulative drift, acceleration/oscillation, and action-gap-induced nonsmoothness. We prove a solver-agnostic path-integral stability bound and derive gap-safe feasible regions that certify local stability away from switch regimes. Building on these results, we introduce \textit{Homotopy-Tracking RL (HT-RL)} and \textit{HT-MCTS}, lightweight wrappers that estimate replay-based proxies of length, curvature, and near-tie proximity online and adapt learning or planning intensity accordingly. Experiments show improved tracking and dynamic regret over matched static baselines, with the largest gains in oscillatory and switch-prone regimes.
中文摘要 现实中的强化学习通常是\emph{nonstationary}：奖励和动态在最优动作中漂移、加速、振荡并触发突兀切换。现有理论常用粗尺度模型来表示非平稳性，这些模型测量环境变化的程度，而非局部变化——尽管加速和近平局驱动跟踪误差和策略杂音。我们通过将环境建模为可微同伦路径，并跟踪最优贝尔曼不动点的诱导运动，从而从几何角度视角观察非平稳的贴现马尔可夫决策过程（MDP）。这产生了具有内在复杂性的长度-曲率-扭结特征：累积漂移、加速度/振荡以及作用隙引起的非平滑性。我们证明了一个求解器无关路径积分稳定性约束的存在，并推导出间隙安全可行区域，证明切换状态下的局部稳定性。基于这些结果，我们介绍了 \textit{同伦追踪 RL （HT-RL）} 和 \textit{HT-MCTS}，这两类轻量级包装器可在线估算基于重放的长度、曲率和近平局接近度代理，并相应调整学习或规划强度。实验显示，追踪和动态后悔相比匹配静态基线有所改善，振荡和易切换状态的提升最大。

SymbXRL: Symbolic Explainable Deep Reinforcement Learning for Mobile Networks

SymbXRL：移动网络的符号可解释深度强化学习

Authors: Abhishek Duttagupta, MohammadErfan Jabbari, Claudio Fiandrino, Marco Fiore, Joerg Widmer
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.22024
Pdf link: https://arxiv.org/pdf/2601.22024
Abstract The operation of future 6th-generation (6G) mobile networks will increasingly rely on the ability of deep reinforcement learning (DRL) to optimize network decisions in real-time. DRL yields demonstrated efficacy in various resource allocation problems, such as joint decisions on user scheduling and antenna allocation or simultaneous control of computing resources and modulation. However, trained DRL agents are closed-boxes and inherently difficult to explain, which hinders their adoption in production settings. In this paper, we make a step towards removing this critical barrier by presenting SymbXRL, a novel technique for explainable reinforcement learning (XRL) that synthesizes human-interpretable explanations for DRL agents. SymbXRL leverages symbolic AI to produce explanations where key concepts and their relationships are described via intuitive symbols and rules; coupling such a representation with logical reasoning exposes the decision process of DRL agents and offers more comprehensible descriptions of their behaviors compared to existing approaches. We validate SymbXRL in practical network management use cases supported by DRL, proving that it not only improves the semantics of the explanations but also paves the way for explicit agent control: for instance, it enables intent-based programmatic action steering that improves by 12% the median cumulative reward over a pure DRL solution.
中文摘要 未来第六代（6G）移动网络的运行将越来越依赖深度强化学习（DRL）以实时优化网络决策的能力。DRL在多种资源分配问题中表现出显著效果，如用户调度与天线分配的联合决策，或计算资源与调制的同步控制。然而，受过训练的DRL代理是封闭的，且本质上难以解释，这阻碍了其在生产环境中的采用。本文通过介绍SymbXRL——一种新型可解释强化学习（XRL）技术，旨在消除这一关键障碍，该技术能够综合人类可解释的DRL代理解释。SymbXRL利用符号人工智能生成解释，通过直观的符号和规则描述关键概念及其关系;将这种表示与逻辑推理结合，揭示了DRL代理的决策过程，并比现有方法更易理解其行为。我们在由DRL支持的实际网络管理用例中验证了SymbXRL，证明它不仅改善了解释的语义，还为显式代理控制铺平了道路：例如，它实现了基于意图的程序作引导，使中位数累计奖励比纯DRL解决方案提升12%。

SIA: Symbolic Interpretability for Anticipatory Deep Reinforcement Learning in Network Control

SIA：网络控制中预期性深度强化学习的符号可解释性

Authors: MohammadErfan Jabbari, Abhishek Duttagupta, Claudio Fiandrino, Leonardo Bonati, Salvatore D'Oro, Michele Polese, Marco Fiore, Tommaso Melodia
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.22044
Pdf link: https://arxiv.org/pdf/2601.22044
Abstract Deep reinforcement learning (DRL) promises adaptive control for future mobile networks but conventional agents remain reactive: they act on past and current measurements and cannot leverage short-term forecasts of exogenous KPIs such as bandwidth. Augmenting agents with predictions can overcome this temporal myopia, yet uptake in networking is scarce because forecast-aware agents act as closed-boxes; operators cannot tell whether predictions guide decisions or justify the added complexity. We propose SIA, the first interpreter that exposes in real time how forecast-augmented DRL agents operate. SIA fuses Symbolic AI abstractions with per-KPI Knowledge Graphs to produce explanations, and includes a new Influence Score metric. SIA achieves sub-millisecond speed, over 200x faster than existing XAI methods. We evaluate SIA on three diverse networking use cases, uncovering hidden issues, including temporal misalignment in forecast integration and reward-design biases that trigger counter-productive policies. These insights enable targeted fixes: a redesigned agent achieves a 9% higher average bitrate in video streaming, and SIA's online Action-Refinement module improves RAN-slicing reward by 25% without retraining. By making anticipatory DRL transparent and tunable, SIA lowers the barrier to proactive control in next-generation mobile networks.
中文摘要 深度强化学习（DRL）承诺为未来的移动网络带来自适应控制，但传统智能体仍处于被动状态：它们基于过去和当前的测量数据，无法利用带宽等外部KPI的短期预测。通过预测增强代理可以克服这种时间短视，但网络的普及度较低，因为预测感知代理充当封闭盒子;操作员无法判断预测是否指导决策或证明增加的复杂性。我们提出了SIA，这是首个实时揭示预测增强DRL代理工作原理的解释器。SIA将符号AI抽象与按KPI计算的知识图谱融合，生成解释，并新增了影响力评分指标。SIA实现亚毫秒速度，比现有XAI方法快200倍以上。我们评估SIA在三种不同的网络应用场景，发现隐藏问题，包括预测集成时间错位和触发适得其反政策的奖励设计偏差。这些洞察使得针对性修复成为可能：重新设计的代理在视频流中平均码率提高了9%，SIA的在线动作优化模块在无需重新训练的情况下将RAN切片奖励提升25%。通过使预先性日程（DRL）透明且可调，SIA降低了下一代移动网络中主动控制的门槛。

Learning to Dial-a-Ride: A Deep Graph Reinforcement Learning Approach to the Electric Dial-a-Ride Problem

学习预约乘车：一种深度图强化学习方法解决电动预约乘车问题

Authors: Sten Elling Tingstad Jacobsen, Attila Lischka, Balázs Kulcsár, Anders Lindman
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.22052
Pdf link: https://arxiv.org/pdf/2601.22052
Abstract Urban mobility systems are transitioning toward electric, on-demand services, creating operational challenges for fleet management under energy and service-quality constraints. The Electric Dial-a-Ride Problem (E-DARP) extends the classical dial-a-ride problem by incorporating limited battery capacity and nonlinear charging dynamics, increasing computational complexity and limiting the scalability of exact methods for real-time use. This paper proposes a deep reinforcement learning approach based on a graph neural network encoder and an attention-driven route construction policy. By operating directly on edge attributes such as travel time and energy consumption, the method captures non-Euclidean, asymmetric, and energy-dependent routing costs in real road networks. The learned policy jointly optimizes routing, charging, and service quality without relying on Euclidean assumptions or handcrafted heuristics. The approach is evaluated on two case studies using ride-sharing data from San Francisco. On benchmark instances, the method achieves solutions within 0.4% of best-known results while reducing computation times by orders of magnitude. A second case study considers large-scale instances with up to 250 request pairs, realistic energy models, and nonlinear charging. On these instances, the learned policy outperforms Adaptive Large Neighborhood Search (ALNS) by 9.5% in solution quality while achieving 100% service completion, with sub-second inference times compared to hours for the metaheuristic. Finally, sensitivity analyses quantify the impact of battery capacity, fleet size, ride-sharing capacity, and reward weights, while robustness experiments show that deterministically trained policies generalize effectively under stochastic conditions.
中文摘要 城市出行系统正向电气化、按需服务转型，这给车队管理带来了能源和服务质量限制下的运营挑战。电动预约乘车问题（E-DARP）通过引入有限的电池容量和非线性充电动力学，扩展了经典的随车预约问题，增加了计算复杂度，并限制了精确方法在实时使用的可扩展性。本文提出了基于图神经网络编码器和注意力驱动路径构建策略的深度强化学习方法。通过直接基于行车时间和能耗等边缘属性，该方法捕捉了真实道路网络中的非欧几里得、非对称和能源依赖的路由成本。该策略共同优化路由、充电和服务质量，而不依赖欧几里得假设或手工设计的启发式方法。该方法通过两个案例研究评估，使用旧金山的拼车数据。在基准测试实例中，该方法的解数在最佳已知结果的0.4%范围内，同时将计算时间缩短了几个数量级。第二个案例研究考虑了多达250对请求的大规模实例、现实能量模型和非线性充电。在这些情况下，所学策略在解质量上比自适应大型邻域搜索（ALNS）高出9.5%，同时实现100%的服务完成率，推理时间低于元启发式的小时。最后，敏感性分析量化了电池容量、车队规模、网约车能力和奖励权重的影响，而鲁棒性实验表明，确定性训练的策略在随机条件下能够有效泛化。

DynaWeb: Model-Based Reinforcement Learning of Web Agents

DynaWeb：基于模型的网络代理强化学习

Authors: Hang Ding, Peidong Liu, Junqiao Wang, Ziwei Ji, Meng Cao, Rongzhao Zhang, Lynn Ai, Eric Yang, Tianyu Shi, Lei Yu
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.22149
Pdf link: https://arxiv.org/pdf/2601.22149
Abstract The development of autonomous web agents, powered by Large Language Models (LLMs) and reinforcement learning (RL), represents a significant step towards general-purpose AI assistants. However, training these agents is severely hampered by the challenges of interacting with the live internet, which is inefficient, costly, and fraught with risks. Model-based reinforcement learning (MBRL) offers a promising solution by learning a world model of the environment to enable simulated interaction. This paper introduces DynaWeb, a novel MBRL framework that trains web agents through interacting with a web world model trained to predict naturalistic web page representations given agent actions. This model serves as a synthetic web environment where an agent policy can dream by generating vast quantities of rollout action trajectories for efficient online reinforcement learning. Beyond free policy rollouts, DynaWeb incorporates real expert trajectories from training data, which are randomly interleaved with on-policy rollouts during training to improve stability and sample efficiency. Experiments conducted on the challenging WebArena and WebVoyager benchmarks demonstrate that DynaWeb consistently and significantly improves the performance of state-of-the-art open-source web agent models. Our findings establish the viability of training web agents through imagination, offering a scalable and efficient way to scale up online agentic RL.
中文摘要 自主网络代理的发展，基于大型语言模型（LLMs）和强化学习（RL），代表了迈向通用人工智能助手的重要一步。然而，培训这些客服人员受到与实时互联网互动的挑战严重阻碍，互联网效率低、成本高且风险重重。基于模型的强化学习（MBRL）通过学习环境世界模型，实现了模拟交互，提供了有前景的解决方案。本文介绍了DynaWeb，一个新颖的MBRL框架，通过与一个训练有素预测自然网页表示的网络世界模型交互来训练网络代理。该模型作为一个合成网络环境，代理策略可以通过生成大量展开动作轨迹实现高效的在线强化学习。除了免费的政策推广外，DynaWeb 还整合了来自训练数据的真实专家轨迹，这些轨迹在训练过程中随机与政策内的推广交错进行，以提升稳定性和样本效率。在具有挑战性的WebArena和WebVoyager基准测试上的实验表明，DynaWeb能够持续且显著地提升最先进的开源网络代理模型的性能。我们的发现证明了通过想象力培训网络代理的可行性，提供了一种可扩展且高效的在线代理强化学习扩展方式。

Exploring Reasoning Reward Model for Agents

探索智能体的推理奖励模型

Authors: Kaixuan Fan, Kaituo Feng, Manyuan Zhang, Tianshuo Peng, Zhixun Li, Yilei Jiang, Shuang Chen, Peng Pei, Xunliang Cai, Xiangyu Yue
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.22154
Pdf link: https://arxiv.org/pdf/2601.22154
Abstract Agentic Reinforcement Learning (Agentic RL) has achieved notable success in enabling agents to perform complex reasoning and tool use. However, most methods still relies on sparse outcome-based reward for training. Such feedback fails to differentiate intermediate reasoning quality, leading to suboptimal training results. In this paper, we introduce Agent Reasoning Reward Model (Agent-RRM), a multi-faceted reward model that produces structured feedback for agentic trajectories, including (1) an explicit reasoning trace , (2) a focused critique that provides refinement guidance by highlighting reasoning flaws, and (3) an overall score that evaluates process performance. Leveraging these signals, we systematically investigate three integration strategies: Reagent-C (text-augmented refinement), Reagent-R (reward-augmented guidance), and Reagent-U (unified feedback integration). Extensive evaluations across 12 diverse benchmarks demonstrate that Reagent-U yields substantial performance leaps, achieving 43.7% on GAIA and 46.2% on WebWalkerQA, validating the effectiveness of our reasoning reward model and training schemes. Code, models, and datasets are all released to facilitate future research.
中文摘要 代理强化学习（代理强化学习）在使代理能够进行复杂推理和工具使用方面取得了显著成功。然而，大多数方法仍然依赖于基于结果的稀疏奖励作为训练。此类反馈未能区分中间推理质量，导致训练结果不理想。本文介绍了代理推理奖励模型（Agent-RRM），这是一种多方面的奖励模型，能够为代理轨迹产生结构化反馈，包括（1）显式推理轨迹，（2）通过突出推理缺陷提供精细指导的聚焦批评，以及（3）评估过程绩效的总体评分。利用这些信号，我们系统地研究了三种集成策略：Reagent-C（文本增强精炼）、Reagent-R（奖励增强指导）和Reagent-U（统一反馈集成）。在12个不同基准测试中进行了广泛评估，显示Reagent-U实现了显著的性能飞跃，在GAIA上达到了43.7%，在WebWalkerQA上达到了46.2%，验证了我们推理奖励模型和训练方案的有效性。代码、模型和数据集均已发布，以促进未来的研究。

Keyword: diffusion policy

PocketDP3: Efficient Pocket-Scale 3D Visuomotor Policy

PocketDP3：高效的袖珍尺度三维维动体策略

Authors: Jinhao Zhang, Zhexuan Zhou, Huizhe Li, Yichen Lai, Wenlong Xia, Haoming Song, Youmin Gong, Jie Me
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.22018
Pdf link: https://arxiv.org/pdf/2601.22018
Abstract Recently, 3D vision-based diffusion policies have shown strong capability in learning complex robotic manipulation skills. However, a common architectural mismatch exists in these models: a tiny yet efficient point-cloud encoder is often paired with a massive decoder. Given a compact scene representation, we argue that this may lead to substantial parameter waste in the decoder. Motivated by this observation, we propose PocketDP3, a pocket-scale 3D diffusion policy that replaces the heavy conditional U-Net decoder used in prior methods with a lightweight Diffusion Mixer (DiM) built on MLP-Mixer blocks. This architecture enables efficient fusion across temporal and channel dimensions, significantly reducing model size. Notably, without any additional consistency distillation techniques, our method supports two-step inference without sacrificing performance, improving practicality for real-time deployment. Across three simulation benchmarks--RoboTwin2.0, Adroit, and MetaWorld--PocketDP3 achieves state-of-the-art performance with fewer than 1% of the parameters of prior methods, while also accelerating inference. Real-world experiments further demonstrate the practicality and transferability of our method in real-world settings. Code will be released.
中文摘要 近年来，基于3D视觉的扩散策略在学习复杂机器人作技能方面展现出强大能力。然而，这些模型中存在一个常见的架构不匹配：一个微小但高效的点云编码器通常与一个庞大的解码器搭配使用。给定一个紧凑的场景表示，我们认为这可能导致解码器中大量参数浪费。基于这一观察，我们提出了PocketDP3，这是一种口袋尺度的三维扩散策略，用基于MLP-Mixer模块的轻量化扩散混合器（DiM）取代了以往方法中使用的重型条件U-Net解码器。该架构实现了跨时间和通道维度的高效融合，显著缩小了模型规模。值得注意的是，无需额外一致性蒸馏技术，我们的方法支持两步推断而不牺牲性能，提升了实时部署的实用性。在三个模拟基准测试——RoboTwin2.0、Adroit和MetaWorld——PocketDP3实现了最先进的性能，参数低于以往方法的1%，同时加速了推理。真实世界的实验进一步展示了我们方法在现实环境中的实用性和可迁移性。代码将会发布。