Arxiv Papers of Today

生成时间: 2026-01-09 16:35:27 (UTC+8); Arxiv 发布时间: 2026-01-09 20:00 EST (2026-01-10 09:00 UTC+8)

今天共有 45 篇相关文章

Keyword: reinforcement learning

Cross-Language Speaker Attribute Prediction Using MIL and RL

利用MIL和RL进行跨语言说话者属性预测

Authors: Sunny Shu, Seyed Sahand Mohammadi Ziabari, Ali Mohammed Mansoor Alsahag
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.04257
Pdf link: https://arxiv.org/pdf/2601.04257
Abstract We study multilingual speaker attribute prediction under linguistic variation, domain mismatch, and data imbalance across languages. We propose RLMIL-DAT, a multilingual extension of the reinforced multiple instance learning framework that combines reinforcement learning based instance selection with domain adversarial training to encourage language invariant utterance representations. We evaluate the approach on a five language Twitter corpus in a few shot setting and on a VoxCeleb2 derived corpus covering forty languages in a zero shot setting for gender and age prediction. Across a wide range of model configurations and multiple random seeds, RLMIL-DAT consistently improves Macro F1 compared to standard multiple instance learning and the original reinforced multiple instance learning framework. The largest gains are observed for gender prediction, while age prediction remains more challenging and shows smaller but positive improvements. Ablation experiments indicate that domain adversarial training is the primary contributor to the performance gains, enabling effective transfer from high resource English to lower resource languages by discouraging language specific cues in the shared encoder. In the zero shot setting on the smaller VoxCeleb2 subset, improvements are generally positive but less consistent, reflecting limited statistical power and the difficulty of generalizing to many unseen languages. Overall, the results demonstrate that combining instance selection with adversarial domain adaptation is an effective and robust strategy for cross lingual speaker attribute prediction.
中文摘要 我们研究多语言使用者属性预测，涵盖语言变异、领域不匹配和跨语言数据失衡。我们提出了RLMIL-DAT，这是强化多实例学习框架的多语言扩展，结合了基于强化学习的实例选择与领域对抗训练，以鼓励语言不变的话语表示。我们在五语推特语料库中以少数样本设置评估该方法，并在VoxCeleb2衍生语料库中，涵盖四十种语言、零样本设置下进行性别和年龄预测。在多种模型配置和多随机种子中，RLMIL-DAT相较于标准多实例学习和原始强化多实例学习框架，持续提升宏F1。性别预测的增长最大，而年龄预测仍更具挑战性，且显示出较小但积极的改善。消融实验表明，领域对抗训练是性能提升的主要贡献者，通过抑制共享编码器中的语言特定线索，使得高资源英语有效转移到低资源语言。在较小的VoxCeleb2子集零镜头设置中，改进通常为正面，但不那么一致，反映了统计能力有限以及难以推广到许多未见语言。总体而言，结果表明将实例选择与对抗性领域适应结合，是跨语言说话者属性预测的有效且稳健的策略。

Making Tunable Parameters State-Dependent in Weather and Climate Models with Reinforcement Learning

利用强化学习使可调参数在天气和气候模型中具备状态依赖性

Authors: Pritthijit Nath, Sebastian Schemm, Henry Moss, Peter Haynes, Emily Shuckburgh, Mark J. Webb
Subjects: Subjects: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
Arxiv link: https://arxiv.org/abs/2601.04268
Pdf link: https://arxiv.org/pdf/2601.04268
Abstract Weather and climate models rely on parametrisations to represent unresolved sub-grid processes. Traditional schemes rely on fixed coefficients that are weakly constrained and tuned offline, contributing to persistent biases that limit their ability to adapt to the underlying physics. This study presents a framework that learns components of parametrisation schemes online as a function of the evolving model state using reinforcement learning (RL) and evaluates the resulting RL-driven parameter updates across a hierarchy of idealised testbeds spanning a simple climate bias correction (SCBC), a radiative-convective equilibrium (RCE), and a zonal mean energy balance model (EBM) with both single-agent and federated multi-agent settings. Across nine RL algorithms, Truncated Quantile Critics (TQC), Deep Deterministic Policy Gradient (DDPG), and Twin Delayed DDPG (TD3) achieved the highest skill and the most stable convergence across configurations, with performance assessed against a static baseline using area-weighted RMSE, temperature profile and pressure-level diagnostics. For the EBM, single-agent RL outperformed static parameter tuning with the strongest gains in tropical and mid-latitude bands, while federated RL on multi-agent setups enabled geographically specialised control and faster convergence, with a six-agent DDPG configuration using frequent aggregation yielding the lowest area-weighted RMSE across the tropics and mid-latitudes. The learnt corrections were also physically meaningful as agents modulated EBM radiative parameters to reduce meridional biases, adjusted RCE lapse rates to match vertical temperature errors, and stabilised SCBC heating increments to limit drift. Overall, results highlight RL to deliver skilful state-dependent, and regime-aware parametrisations, offering a scalable pathway for online learning within numerical models.
中文摘要 天气和气候模型依赖参数化来表示未解决的子网格过程。传统方案依赖于固定系数，这些系数在离线时被弱约束且调谐，导致持续的偏置，限制了它们适应底层物理的能力。本研究提出了一个框架，利用强化学习（RL）在线学习参数化方案的组成部分，作为模型状态演变的函数，并评估基于理想化测试平台层级结构的强化学习驱动参数更新，涵盖简单气候偏置修正（SCBC）、辐射-对流平衡（RCE）和带有单智能体和联邦多智能体设置的区域平均能量平衡模型（EBM）。在九种强化学习算法中，截断分位数批判者（TQC）、深度确定性策略梯度（DDPG）和双延迟DDPG（TD3）实现了最高技能和最稳定的收敛，性能基于静态基线，采用面积加权RMSE、温度剖面和压力级诊断。对于EBM，单代理RL在热带和中纬度带的提升最大，优于静态参数调优;而多代理配置上的联邦RL实现了地理专用控制和更快的收敛速度，六代理DDPG配置通过频繁聚合实现了热带和中纬度最低的面积加权RMSE。所学到的修正在物理上也具有意义，因为试验剂调制EBM辐射参数以减少经线偏差，调整RCE失效率以匹配垂直温度误差，并稳定SCBC加热增量以限制漂移。总体而言，结果凸显强化学习能够提供技术性的状态依赖和系统感知参数化，为数值模型中的在线学习提供了可扩展的路径。

A Future Capabilities Agent for Tactical Air Traffic Control

战术空中交通管制的未来能力代理

Authors: Paul Kent, George De Ath, Martin Layton, Allen Hart, Richard Everson, Ben Carvell
Subjects: Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2601.04285
Pdf link: https://arxiv.org/pdf/2601.04285
Abstract Escalating air traffic demand is driving the adoption of automation to support air traffic controllers, but existing approaches face a trade-off between safety assurance and interpretability. Optimisation-based methods such as reinforcement learning offer strong performance but are difficult to verify and explain, while rules-based systems are transparent yet rarely check safety under uncertainty. This paper outlines Agent Mallard, a forward-planning, rules-based agent for tactical control in systemised airspace that embeds a stochastic digital twin directly into its conflict-resolution loop. Mallard operates on predefined GPS-guided routes, reducing continuous 4D vectoring to discrete choices over lanes and levels, and constructs hierarchical plans from an expert-informed library of deconfliction strategies. A depth-limited backtracking search uses causal attribution, topological plan splicing, and monotonic axis constraints to seek a complete safe plan for all aircraft, validating each candidate manoeuvre against uncertain execution scenarios (e.g., wind variation, pilot response, communication loss) before commitment. Preliminary walkthroughs with UK controllers and initial tests in the BluebirdDT airspace digital twin indicate that Mallard's behaviour aligns with expert reasoning and resolves conflicts in simplified scenarios. The architecture is intended to combine model-based safety assessment, interpretable decision logic, and tractable computational performance in future structured en-route environments.
中文摘要 不断增长的空中交通需求正在推动自动化的采用以支持空中交通管制员，但现有方法在安全保障与可解释性之间面临权衡。基于优化的方法如强化学习性能强劲，但难以验证和解释，而基于规则的系统透明，但在不确定性下很少检查安全性。本文概述了Mallard特工，一款前瞻性规划、基于规则的系统化空域战术控制特工，将随机数字孪生直接嵌入其冲突解决循环中。Mallard 运行在预定义的 GPS 引导路线上，将连续的四维向量简化为车道和层级上的离散选择，并基于专家知情的冲突避免策略库构建层级计划。深度限制回溯搜索利用因果归因、拓扑图拼接和单调轴约束，为所有飞机寻找完整的安全计划，并在承诺前验证每个候选机动是否符合不确定的执行情景（如风向变化、飞行员反应、通信中断）。与英国管制员的初步走查以及在BluebirdDT空域数字孪生中的初步测试表明，Mallard的行为符合专家推理，并在简化情境中解决了冲突。该架构旨在结合基于模型的安全评估、可解释的决策逻辑和可处理的计算性能，适应未来结构化的航路环境。

Online Action-Stacking Improves Reinforcement Learning Performance for Air Traffic Control

在线动作叠加提升空中交通管制的强化学习性能

Authors: Ben Carvell, George De Ath, Eseoghene Benjamin, Richard Everson
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.04287
Pdf link: https://arxiv.org/pdf/2601.04287
Abstract We introduce online action-stacking, an inference-time wrapper for reinforcement learning policies that produces realistic air traffic control commands while allowing training on a much smaller discrete action space. Policies are trained with simple incremental heading or level adjustments, together with an action-damping penalty that reduces instruction frequency and leads agents to issue commands in short bursts. At inference, online action-stacking compiles these bursts of primitive actions into domain-appropriate compound clearances. Using Proximal Policy Optimisation and the BluebirdDT digital twin platform, we train agents to navigate aircraft along lateral routes, manage climb and descent to target flight levels, and perform two-aircraft collision avoidance under a minimum separation constraint. In our lateral navigation experiments, action stacking greatly reduces the number of issued instructions relative to a damped baseline and achieves comparable performance to a policy trained with a 37-dimensional action space, despite operating with only five actions. These results indicate that online action-stacking helps bridge a key gap between standard reinforcement learning formulations and operational ATC requirements, and provides a simple mechanism for scaling to more complex control scenarios.
中文摘要 我们引入了在线动作堆栈，这是一种强化学习策略的推理时间包装器，能够生成真实的空中交通管制指令，同时允许在更小的离散行动空间内进行训练。策略通过简单的增量标题或层级调整训练，并带有动作抑制惩罚，降低指令频率并促使代理短时间发出命令。推断时，在线动作堆栈将这些原始动作的爆发汇总成域适用的复合间隙。利用近端政策优化和BluebirdDT数字孪生平台，我们训练客服人员沿横向航线导航飞机，管理爬升和下降以达到目标飞行高度，并在最小间隔约束下执行两机碰撞规避。在我们的横向导航实验中，动作堆叠相较于阻尼基线大幅减少指令的发出数量，尽管仅运行五个动作，但其性能与用37维动作空间训练的策略相当。这些结果表明，在线动作叠加有助于弥合标准强化学习形式与实际空管需求之间的关键差距，并为向更复杂的控制场景扩展提供了简便机制。

Survival Dynamics of Neural and Programmatic Policies in Evolutionary Reinforcement Learning

进化强化学习中神经与程序策略的生存动力学

Authors: Anton Roupassov-Ruiz, Yiyang Zuo
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.04365
Pdf link: https://arxiv.org/pdf/2601.04365
Abstract In evolutionary reinforcement learning tasks (ERL), agent policies are often encoded as small artificial neural networks (NERL). Such representations lack explicit modular structure, limiting behavioral interpretation. We investigate whether programmatic policies (PERL), implemented as soft, differentiable decision lists (SDDL), can match the performance of NERL. To support reproducible evaluation, we provide the first fully specified and open-source reimplementation of the classic 1992 Artificial Life (ALife) ERL testbed. We conduct a rigorous survival analysis across 4000 independent trials utilizing Kaplan-Meier curves and Restricted Mean Survival Time (RMST) metrics absent in the original study. We find a statistically significant difference in survival probability between PERL and NERL. PERL agents survive on average 201.69 steps longer than NERL agents. Moreover, SDDL agents using learning alone (no evolution) survive on average 73.67 steps longer than neural agents using both learning and evaluation. These results demonstrate that programmatic policies can exceed the survival performance of neural policies in ALife.
中文摘要 在进化强化学习任务（ERL）中，代理策略通常编码为小型人工神经网络（NERL）。此类表征缺乏明确的模块结构，限制了行为解释。我们研究以软性可微决策列表（SDDL）实现的程序化策略（PERL）是否能匹配NERL的性能。为了支持可重复的评估，我们提供了首个完全规范且开源的1992年经典人工生命（ALife）ERL测试平台的重实现。我们对4000项独立试验进行了严格的生存分析，利用原研究中缺失的Kaplan-Meier曲线和限制平均生存时间（RMST）指标。我们发现PERL和NERL在存活概率上存在统计学上显著差异。PERL代理的平均存活时间比NERL代理长201.69步。此外，仅使用学习（无进化）的SDDL代理平均存活时间比同时进行学习和评估的神经代理长73.67步。这些结果表明，程序化策略在ALife中可以超过神经策略的生存性能。

Enhanced-FQL($λ$), an Efficient and Interpretable RL with novel Fuzzy Eligibility Traces and Segmented Experience Replay

增强型FQL（$λ$），一种高效且可解释的强化学习，具有新颖的模糊资格痕迹和分段体验回放

Authors: Mohsen Jalaeian-Farimani
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2601.04392
Pdf link: https://arxiv.org/pdf/2601.04392
Abstract This paper introduces a fuzzy reinforcement learning framework, Enhanced-FQL($\lambda$), that integrates novel Fuzzified Eligibility Traces (FET) and Segmented Experience Replay (SER) into fuzzy Q-learning with Fuzzified Bellman Equation (FBE) for continuous control tasks. The proposed approach employs an interpretable fuzzy rule base instead of complex neural architectures, while maintaining competitive performance through two key innovations: a fuzzified Bellman equation with eligibility traces for stable multi-step credit assignment, and a memory-efficient segment-based experience replay mechanism for enhanced sample efficiency. Theoretical analysis proves the proposed method convergence under standard assumptions. Extensive evaluations in continuous control domains demonstrate that Enhanced-FQL($\lambda$) achieves superior sample efficiency and reduced variance compared to n-step fuzzy TD and fuzzy SARSA($\lambda$) baselines, while maintaining substantially lower computational complexity than deep RL alternatives such as DDPG. The framework's inherent interpretability, combined with its computational efficiency and theoretical convergence guarantees, makes it particularly suitable for safety-critical applications where transparency and resource constraints are essential.
中文摘要 本文介绍了一个模糊强化学习框架Enhanced-FQL（$\lambda$），将新颖的模糊资格痕迹（FET）和分段体验回放（SER）整合到带有模糊贝尔曼方程（FBE）的模糊Q学习中，用于连续控制任务。该方法采用可解释的模糊规则库，而非复杂的神经架构，同时通过两项关键创新保持竞争力：带有稳定多步学分分配资格痕迹的模糊贝尔曼方程，以及提升样本效率的基于段的内存高效经验回放机制。理论分析证明了在标准假设下所提出的方法收敛性。连续控制域的广泛评估表明，增强FQL（$\lambda$）相比n步模糊TD和模糊SARSA（$\lambda$）基线，在样本效率和方差降低方面更优，同时远低于DDPG等深度强化学习替代方案。该框架固有的可解释性，加上计算效率和理论上的收敛保证，使其特别适合对透明度和资源限制至关重要的安全关键应用。

Transformer-based Multi-agent Reinforcement Learning for Separation Assurance in Structured and Unstructured Airspaces

基于变压器的多智能体强化学习，用于结构化和非结构化空域的分离保障

Authors: Arsyi Aziz, Peng Wei
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2601.04401
Pdf link: https://arxiv.org/pdf/2601.04401
Abstract Conventional optimization-based metering depends on strict adherence to precomputed schedules, which limits the flexibility required for the stochastic operations of Advanced Air Mobility (AAM). In contrast, multi-agent reinforcement learning (MARL) offers a decentralized, adaptive framework that can better handle uncertainty, required for safe aircraft separation assurance. Despite this advantage, current MARL approaches often overfit to specific airspace structures, limiting their adaptability to new configurations. To improve generalization, we recast the MARL problem in a relative polar state space and train a transformer encoder model across diverse traffic patterns and intersection angles. The learned model provides speed advisories to resolve conflicts while maintaining aircraft near their desired cruising speeds. In our experiments, we evaluated encoder depths of 1, 2, and 3 layers in both structured and unstructured airspaces, and found that a single encoder configuration outperformed deeper variants, yielding near-zero near mid-air collision rates and shorter loss-of-separation infringements than the deeper configurations. Additionally, we showed that the same configuration outperforms a baseline model designed purely with attention. Together, our results suggest that the newly formulated state representation, novel design of neural network architecture, and proposed training strategy provide an adaptable and scalable decentralized solution for aircraft separation assurance in both structured and unstructured airspaces.
中文摘要 传统的基于优化的计量依赖于严格遵守预计算的计划，这限制了先进空中机动（AAM）随机作所需的灵活性。相比之下，多智能体强化学习（MARL）提供了一个去中心化、自适应的框架，能够更好地处理不确定性，这对于保障飞机分离的安全保障至关重要。尽管有此优势，现有MARL方法常常对特定空域结构过配，限制了对新配置的适应性。为了提升泛化，我们将MARL问题重新构造到相对极化空间，并在不同交通模式和交叉角度下训练变压器编码器模型。该模型提供速度建议以解决冲突，同时保持飞机接近预期巡航速度。在我们的实验中，我们评估了结构化和非结构化空域中1层、2层和3层编码器的深度，发现单一编码器配置优于深层变体，近半空中碰撞率几乎为零，且失间距违规时间短于深层配置。此外，我们还证明了同一配置优于纯粹关注设计的基线模型。综合结果表明，新提出的状态表示、神经网络架构的新设计以及拟议的训练策略，为结构化和非结构化空域中的飞机分离保障提供了一种可适应且可扩展的分散式解决方案。

Rate or Fate? RLV$^\varepsilon$R: Reinforcement Learning with Verifiable Noisy Rewards

速率还是命运？RLV$^\varepsilon$R：带可验证噪声奖励的强化学习

Authors: Ali Rad, Khashayar Filom, Darioush Keivan, Peyman Mohajerin Esfahani, Ehsan Kamalinejad
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.04411
Pdf link: https://arxiv.org/pdf/2601.04411
Abstract Reinforcement learning with verifiable rewards (RLVR) is a simple but powerful paradigm for training LLMs: sample a completion, verify it, and update. In practice, however, the verifier is almost never clean--unit tests probe only limited corner cases; human and synthetic labels are imperfect; and LLM judges (e.g., RLAIF) are noisy and can be exploited--and this problem worsens on harder domains (especially coding) where tests are sparse and increasingly model-generated. We ask a pragmatic question: does the verification noise merely slow down the learning (rate), or can it flip the outcome (fate)? To address this, we develop an analytically tractable multi-armed bandit view of RLVR dynamics, instantiated with GRPO and validated in controlled experiments. Modeling false positives and false negatives and grouping completions into recurring reasoning modes yields a replicator-style (natural-selection) flow on the probability simplex. The dynamics decouples into within-correct-mode competition and a one-dimensional evolution for the mass on incorrect modes, whose drift is determined solely by Youden's index J=TPR-FPR. This yields a sharp phase transition: when J>0, the incorrect mass is driven toward extinction (learning); when J=0, the process is neutral; and when J<0, incorrect modes amplify until they dominate (anti-learning and collapse). In the learning regime J>0, noise primarily rescales convergence time ("rate, not fate"). Experiments on verifiable programming tasks under synthetic noise reproduce the predicted J=0 boundary. Beyond noise, the framework offers a general lens for analyzing RLVR stability, convergence, and algorithmic interventions.
中文摘要 带可验证奖励的强化学习（RLVR）是一种简单但强大的大型语言模型训练范式：采样完成、验证并更新。然而，实际上，验证器几乎从不干净——单元测试只探测有限的角落情况;人类和合成标签并不完美;而LLM评判（如RLAIF）噪声较大且容易被利用——这一问题在更难的领域（尤其是编码领域）尤为严重，因为测试稀疏且越来越依赖模型生成。我们提出一个务实的问题：验证噪音只是减缓学习速度，还是能改变结果（命运）？为此，我们开发了一个可解析的多臂强盗视角，采用GRPO实例化并在受控实验中验证，是RLVR动力学的可解析性分析。对假阳性和假阴性进行建模，并将完备化归类为循环推理模式，可以生成一种复制者式（自然选择）式的概率单纯形流。动力学解耦为正确模态内的竞争和一维质量在错误模态上的演化，漂移仅由尤登指数J=TPR-FPR决定。这会产生一个明显的相变：当J>0时，错误质量被驱向消光（学习）;当J=0时，过程为中性;当 J<0 时，错误的模态会被放大，直到它们占据主导地位（反学习和坍缩）。在学习范畴 J>0 中，噪声主要重新标定收敛时间（“速率，而非命运”）。在合成噪声下，可验证的编程任务实验重现了预测的J=0边界。除了噪声，该框架还为分析RLVR的稳定性、收敛性和算法干预提供了通用视角。

Improving and Accelerating Offline RL in Large Discrete Action Spaces with Structured Policy Initialization

在大型离散动作空间中通过结构化策略初始化改进和加速离线强化学习

Authors: Matthew Landers, Taylor W. Killian, Thomas Hartvigsen, Afsaneh Doryab
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.04441
Pdf link: https://arxiv.org/pdf/2601.04441
Abstract Reinforcement learning in discrete combinatorial action spaces requires searching over exponentially many joint actions to simultaneously select multiple sub-actions that form coherent combinations. Existing approaches either simplify policy learning by assuming independence across sub-actions, which often yields incoherent or invalid actions, or attempt to learn action structure and control jointly, which is slow and unstable. We introduce Structured Policy Initialization (SPIN), a two-stage framework that first pre-trains an Action Structure Model (ASM) to capture the manifold of valid actions, then freezes this representation and trains lightweight policy heads for control. On challenging discrete DM Control benchmarks, SPIN improves average return by up to 39% over the state of the art while reducing time to convergence by up to 12.8$\times$.
中文摘要 离散组合行动空间中的强化学习需要在指数级数量的联合动作中搜索，以同时选择形成连贯组合的多个子动作。现有方法要么通过假设子行动间的独立性来简化策略学习，这常常导致动作不连贯或无效;要么尝试共同学习动作结构和控制，这速度缓慢且不稳定。我们引入了结构化策略初始化（SPIN），这是一个两阶段框架，首先预训练一个动作结构模型（ASM）以捕捉有效动作的流形，然后冻结该表示，并训练轻量级策略负责控制。在挑战性的离散DM控制基准测试中，SPIN的平均回报比最先进的水平提升了多达39%，同时收敛时间缩短了最多12.8美元\时间美元。

Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization

通过门控感知-推理优化解决大型视觉语言模型中的过度思考问题

Authors: Xingjian Diao, Zheyuan Liu, Chunhui Zhang, Weiyi Wu, Keyi Kong, Lin Shi, Kaize Ding, Soroush Vosoughi, Jiang Gui
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.04442
Pdf link: https://arxiv.org/pdf/2601.04442
Abstract Large Vision-Language Models (LVLMs) have exhibited strong reasoning capabilities through chain-of-thought mechanisms that generate step-by-step rationales. However, such slow-thinking approaches often lead to overthinking, where models produce excessively verbose responses even for simple queries, resulting in test-time inefficiency and even degraded accuracy. Prior work has attempted to mitigate this issue via adaptive reasoning strategies, but these methods largely overlook a fundamental bottleneck: visual perception failures. We argue that stable reasoning critically depends on low-level visual grounding, and that reasoning errors often originate from imperfect perception rather than insufficient deliberation. To address this limitation, we propose Gated Perception-Reasoning Optimization (GPRO), a meta-reasoning controller that dynamically routes computation among three decision paths at each generation step: a lightweight fast path, a slow perception path for re-examining visual inputs, and a slow reasoning path for internal self-reflection. To learn this distinction, we derive large-scale failure attribution supervision from approximately 790k samples, using teacher models to distinguish perceptual hallucinations from reasoning errors. We then train the controller with multi-objective reinforcement learning to optimize the trade-off between task accuracy and computational cost under uncertainty. Experiments on five benchmarks demonstrate that GPRO substantially improves both accuracy and efficiency, outperforming recent slow-thinking methods while generating significantly shorter responses.
中文摘要 大型视觉语言模型（LVLM）通过思维链机制展现出强大的推理能力，能够逐步生成合理的推理理由。然而，这种缓慢思考的方法常常导致过度思考，模型即使是简单的查询也会做出过于冗长的回答，导致测试时间效率低下，甚至准确性下降。此前的研究尝试通过自适应推理策略缓解这一问题，但这些方法大多忽视了一个根本瓶颈：视觉感知的失败。我们认为，稳定推理关键性地依赖于低层次的视觉基础，推理错误往往源于不完美的感知，而非深思不充分。为解决这一限制，我们提出了门控感知-推理优化（GPRO），这是一种元推理控制器，在每代阶段动态地将计算路由到三条决策路径之间：轻量级快速路径、用于重新审视视觉输入的缓慢感知路径，以及用于内部反思的慢速推理路径。为区分这一区别，我们从约79万样本中推导出大规模失败归因监督，利用教师模型区分感知幻觉与推理错误。随后，我们通过多目标强化学习训练控制器，以优化任务准确性与计算成本在不确定性下的权衡。五个基准测试的实验表明，GPRO在准确性和效率上都显著提升，优于近期慢思考方法，同时产生显著更短的响应时间。

Multiagent Reinforcement Learning with Neighbor Action Estimation

多智能体强化学习与邻居动作估计

Authors: Zhenglong Luo, Zhiyong Chen, Aoxiang Liu
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.04511
Pdf link: https://arxiv.org/pdf/2601.04511
Abstract Multiagent reinforcement learning, as a prominent intelligent paradigm, enables collaborative decision-making within complex systems. However, existing approaches often rely on explicit action exchange between agents to evaluate action value functions, which is frequently impractical in real-world engineering environments due to communication constraints, latency, energy consumption, and reliability requirements. From an artificial intelligence perspective, this paper proposes an enhanced multiagent reinforcement learning framework that employs action estimation neural networks to infer agent behaviors. By integrating a lightweight action estimation module, each agent infers neighboring agents' behaviors using only locally observable information, enabling collaborative policy learning without explicit action sharing. This approach is fully compatible with standard TD3 algorithms and scalable to larger multiagent systems. At the engineering application level, this framework has been implemented and validated in dual-arm robotic manipulation tasks: two robotic arms collaboratively lift objects. Experimental results demonstrate that this approach significantly enhances the robustness and deployment feasibility of real-world robotic systems while reducing dependence on information infrastructure. Overall, this research advances the development of decentralized multiagent artificial intelligence systems while enabling AI to operate effectively in dynamic, information-constrained real-world environments.
中文摘要 作为一种重要的智能范式，多智能体强化学习实现了复杂系统内的协作决策。然而，现有方法通常依赖代理间显式的动作交换来评估动作值函数，这在现实工程环境中由于通信限制、延迟、能耗和可靠性要求而常常不切实际。从人工智能角度，本文提出了一种增强型多智能体强化学习框架，利用动作估计神经网络推断智能体行为。通过集成轻量级动作估计模块，每个代理仅用局部可观测的信息推断邻居代理的行为，实现协作策略学习而无需显式动作共享。该方法与标准TD3算法完全兼容，并可扩展至更大规模的多智能体系统。在工程应用层面，该框架已被实现并验证用于双臂机器人作任务：两只机械臂协同搬运物体。实验结果表明，这种方法显著提升了现实世界机器人系统的鲁棒性和部署可行性，同时减少了对信息基础设施的依赖。总体而言，这项研究推动了去中心化多智能体人工智能系统的发展，同时使人工智能能够在动态、信息受限的现实环境中高效运行。

TSSR: Two-Stage Swap-Reward-Driven Reinforcement Learning for Character-Level SMILES Generation

TSSR：两阶段交换奖励驱动强化学习，用于角色级SMILES生成

Authors: Jacob Ede Levine, Yun Lyan Luo, Sai Chandra Kosaraju
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.04521
Pdf link: https://arxiv.org/pdf/2601.04521
Abstract The design of reliable, valid, and diverse molecules is fundamental to modern drug discovery, as improved molecular generation supports efficient exploration of the chemical space for potential drug candidates and reduces the cost of early design efforts. Despite these needs, current chemical language models that generate molecules as SMILES strings are vulnerable to compounding token errors: many samples are unparseable or chemically implausible, and hard constraints meant to prevent failure can restrict exploration. To address this gap, we introduce TSSR, a Two-Stage, Swap-Reward-driven reinforcement learning (RL) framework for character-level SMILES generation. Stage one rewards local token swaps that repair syntax, promoting transitions from invalid to parseable strings. Stage two provides chemistry-aware feedback from RDKit diagnostics, rewarding reductions in valence, aromaticity, and connectivity issues. The reward decomposes into interpretable terms (swap efficiency, error reduction, distance to validity), is model agnostic, and requires no task-specific labels or hand-crafted grammars. We evaluated TSSR on the MOSES benchmark using a GRU policy trained with PPO in both pure RL (P-RL) from random initialization and fine-tuning RL (F-RL) starting from a pretrained chemical language model, assessing 10,000 generated SMILES per run. In P-RL, TSSR significantly improves syntactic validity, chemical validity, and novelty. In F-RL, TSSR preserves drug-likeness and synthesizability while increasing validity and novelty. Token-level analysis shows that syntax edits and chemistry fixes act jointly to reduce RDKit detected errors. TSSR converts a sparse terminal objective into a denser and more interpretable reward, improving both syntactic and chemical quality without reducing diversity. TSSR is dataset-agnostic and can be adapted to various reinforcement learning approaches.
中文摘要 设计可靠、有效且多样化的分子是现代药物发现的基础，因为分子生成技术的提升有助于高效探索化学空间，降低早期设计成本。尽管有这些需求，当前生成SMILES字符串分子的化学语言模型容易出现复合标记错误：许多样本无法解析或化学上不合理，且为防止失败而设的严格约束可能限制探索。为弥补这一空白，我们引入了TSSR，一个两阶段、交换奖励驱动的强化学习（RL）框架，用于角色级SMILES生成。第一阶段奖励修复语法的本地代币交换，促进从无效字符串向可解析字符串的转变。第二阶段则通过RDKit诊断获得化学感知反馈，奖励价性、芳香性和连接性问题的减少。奖励分解为可解释的术语（交换效率、错误减少、效度距离），模型无关，且不需要任务特定标签或手工定制的语法。我们使用基于MOSES基准测试的GRU策略评估了TSSR，该策略在随机初始化的纯RL（P-RL）和从预训练化学语言模型出发的微调RL（F-RL）中训练，每次运行评估1万个生成的SMILES。在P-RL中，TSSR显著提升了句法效度、化学效度和新颖性。在F-RL中，TSSR保持了药物的相似性和合成性，同时提高了有效性和新颖性。令牌级分析表明，语法编辑和化学修复共同作用，减少了RDKit检测到的错误。TSSR将稀疏的终端目标转化为更密集、更易理解的奖励，提升句法和化学质量，同时不降低多样性。TSSR与数据集无关，可以适应各种强化学习方法。

Not All Steps are Informative: On the Linearity of LLMs' RLVR Training

并非所有步骤都有益：关于大型语言模型RLVR训练的线性性

Authors: Tianle Wang, Zhongyuan Wu, Shenghao Jin, Hao Xu, Wei Chen, Ning Miao
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.04537
Pdf link: https://arxiv.org/pdf/2601.04537
Abstract Reinforcement learning with verifiable rewards (RLVR) has become a central component of large language model (LLM) post-training. Unlike supervised fine-tuning (SFT), RLVR lets an LLM generate multiple candidate solutions and reinforces those that lead to a verifiably correct final answer. However, in practice, RLVR often requires thousands of training steps to reach strong performance, incurring substantial computation largely attributed to prolonged exploration. In this work, we make a surprising observation: during RLVR, LLMs evolve in a strongly linear manner. Specifically, both model weights and model output log-probabilities exhibit strong linear correlations with RL training steps. This suggests that RLVR predominantly amplifies trends that emerge early in training, rather than continuously discovering new behaviors throughout the entire optimization trajectory. Motivated by this linearity, we investigate whether future model states can be predicted from intermediate checkpoints via extrapolation, avoiding continued expensive training. We show that Weight Extrapolation produces models with performance comparable to standard RL training while requiring significantly less computation. Moreover, Logits Extrapolation consistently outperforms continued RL training on all four benchmarks by extrapolating beyond the step range where RL training remains stable.
中文摘要 带有可验证奖励的强化学习（RLVR）已成为大型语言模型（LLM）训练后的核心组成部分。与监督微调（SFT）不同，RLVR允许大型语言模型生成多个候选解，并强化那些最终可验证正确答案的解。然而，实际上，RLVR通常需要数千个训练步骤才能达到强劲的性能，这在很大程度上导致了大量计算，主要归因于长时间的探索。在本研究中，我们做出了一个令人惊讶的观察：在RLVR过程中，LLMs以强烈线性的方式演化。具体来说，模型权重和模型输出对数概率与强化学习训练步骤表现出强烈的线性相关性。这表明RLVR主要放大训练早期出现的趋势，而非在整个优化过程中持续发现新行为。基于这种线性性，我们研究未来模型状态是否可以通过外推从中间检查点预测，避免持续昂贵的训练。我们证明权重外推能产生与标准强化学习相当性能的模型，同时所需计算量显著减少。此外，Logits 外推在四个基准测试中持续优于持续强化学习训练，能够超越强化学习稳定的步长范围。

Reasoning Over Space: Enabling Geographic Reasoning for LLM-Based Generative Next POI Recommendation

推理超越空间：支持基于LLM的生成式下一个POI推荐的地理推理

Authors: Dongyi Lv, Qiuyu Ding, Heng-Da Xu, Zhaoxu Sun, Zhi Wang, Feng Xiong, Mu Xu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.04562
Pdf link: https://arxiv.org/pdf/2601.04562
Abstract Generative recommendation with large language models (LLMs) reframes prediction as sequence generation, yet existing LLM-based recommenders remain limited in leveraging geographic signals that are crucial in mobility and local-services scenarios. Here, we present Reasoning Over Space (ROS), a framework that utilizes geography as a vital decision variable within the reasoning process. ROS introduces a Hierarchical Spatial Semantic ID (SID) that discretizes coarse-to-fine locality and POI semantics into compositional tokens, and endows LLM with a three-stage Mobility Chain-of-Thought (CoT) paradigm that models user personality, constructs an intent-aligned candidate space, and performs locality informed pruning. We further align the model with real world geography via spatial-guided Reinforcement Learning (RL). Experiments on three widely used location-based social network (LBSN) datasets show that ROS achieves over 10% relative gains in hit rate over strongest LLM-based baselines and improves cross-city transfer, despite using a smaller backbone model.
中文摘要 大型语言模型（LLM）中的生成式推荐将预测重新框定为序列生成，但现有基于LLM的推荐器在利用在移动性和本地服务场景中至关重要的地理信号方面仍然有限。在这里，我们介绍了空间推理（ROS）框架，该框架将地理作为推理过程中的重要决策变量。ROS引入了层级空间语义ID（SID），将粗到细的局部性和兴趣点（POI）语义离散化为组合标记，并为LLM赋予三阶段的移动性思维链（CoT）范式，建模用户个性，构建与意图对齐的候选空间，并执行局部性知情剪枝。我们还通过空间引导强化学习（RL）进一步将模型与现实地理对齐。在三个广泛使用的基于位置的社交网络（LBSN）数据集上的实验显示，尽管使用较小的骨干模型，ROS在相较于最强的基于LLM的基线上，在点击率上取得了超过10%的相对提升，并且改善了跨城市的迁移。

Aligning Text, Code, and Vision: A Multi-Objective Reinforcement Learning Framework for Text-to-Visualization

文本、代码与视觉的对齐：文本到可视化的多目标强化学习框架

Authors: Mizanur Rahman, Mohammed Saidul Islam, Md Tahmid Rahman Laskar, Shafiq Joty, Enamul Hoque
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.04582
Pdf link: https://arxiv.org/pdf/2601.04582
Abstract Text-to-Visualization (Text2Vis) systems translate natural language queries over tabular data into concise answers and executable visualizations. While closed-source LLMs generate functional code, the resulting charts often lack semantic alignment and clarity, qualities that can only be assessed post-execution. Open-source models struggle even more, frequently producing non-executable or visually poor outputs. Although supervised fine-tuning can improve code executability, it fails to enhance overall visualization quality, as traditional SFT loss cannot capture post-execution feedback. To address this gap, we propose RL-Text2Vis, the first reinforcement learning framework for Text2Vis generation. Built on Group Relative Policy Optimization (GRPO), our method uses a novel multi-objective reward that jointly optimizes textual accuracy, code validity, and visualization quality using post-execution feedback. By training Qwen2.5 models (7B and 14B), RL-Text2Vis achieves a 22% relative improvement in chart quality over GPT-4o on the Text2Vis benchmark and boosts code execution success from 78% to 97% relative to its zero-shot baseline. Our models significantly outperform strong zero-shot and supervised baselines and also demonstrate robust generalization to out-of-domain datasets like VIS-Eval and NVBench. These results establish GRPO as an effective strategy for structured, multimodal reasoning in visualization generation. We release our code at this https URL.
中文摘要 文本转可视化（Text2Vis）系统将基于表格数据的自然语言查询转换为简明的答案和可执行的可视化。虽然闭源LLM生成功能性代码，但最终生成的图表往往缺乏语义对齐和清晰度，这些特性只能在执行后评估。开源模型表现更为艰难，经常输出不可执行或视觉质量差。虽然监督式微调可以提升代码可执行性，但无法提升整体可视化质量，因为传统的SFT丢失无法捕捉执行后的反馈。为弥补这一空白，我们提出了RL-Text2Vis，这是首个用于生成文本2Vis的强化学习框架。基于群体相对策略优化（GRPO），我们的方法采用一种创新的多目标奖励，通过执行后反馈共同优化文本准确性、代码有效性和可视化质量。通过训练Qwen2.5模型（7B和14B），RL-Text2Vis在Text2Vis基准测试中相较GPT-4o提升了22%的图表质量，并且相较于零样本基准，代码执行成功率从78%提升到97%。我们的模型显著优于强零样本和监督基线，并对域外数据集如VIS-Eval和NVBench具有稳健的泛化能力。这些结果确立了GRPO作为可视化生成中结构化、多模态推理的有效策略。我们以这个 https URL 发布代码。

Optimizing Path Planning using Deep Reinforcement Learning for UGVs in Precision Agriculture

精准农业中利用深度强化学习优化UGV路径规划

Authors: Laukik Patade, Rohan Rane, Sandeep Pillai
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.04668
Pdf link: https://arxiv.org/pdf/2601.04668
Abstract This study focuses on optimizing path planning for unmanned ground vehicles (UGVs) in precision agriculture using deep reinforcement learning (DRL) techniques in continuous action spaces. The research begins with a review of traditional grid-based methods, such as A* and Dijkstra's algorithms, and discusses their limitations in dynamic agricultural environments, highlighting the need for adaptive learning strategies. The study then explores DRL approaches, including Deep Q-Networks (DQN), which demonstrate improved adaptability and performance in two-dimensional simulations. Enhancements such as Double Q-Networks and Dueling Networks are evaluated to further improve decision-making. Building on these results, the focus shifts to continuous action space models, specifically Deep Deterministic Policy Gradient (DDPG) and Twin Delayed Deep Deterministic Policy Gradient (TD3), which are tested in increasingly complex environments. Experiments conducted in a three-dimensional environment using ROS and Gazebo demonstrate the effectiveness of continuous DRL algorithms in navigating dynamic agricultural scenarios. Notably, the pretrained TD3 agent achieves a 95 percent success rate in dynamic environments, demonstrating the robustness of the proposed approach in handling moving obstacles while ensuring safety for both crops and the robot.
中文摘要 本研究重点是利用深度强化学习（DRL）技术在连续动作空间中优化精准农业中无人地面车辆（UGV）的路径规划。研究首先回顾了传统的基于网格的方法，如A*和Dijkstra算法，并讨论了它们在动态农业环境中的局限性，强调了自适应学习策略的必要性。随后，研究探讨了包括深度Q网络（DQN）在内的DRL方法，这些方法在二维仿真中展现了更好的适应性和性能。评估了双Q网络和对决网络等增强功能，以进一步提升决策效率。基于这些结果，研究重点转向连续动作空间模型，特别是深度确定性政策梯度（DDPG）和双延迟深度确定性策略梯度（TD3），这些模型在日益复杂的环境中进行测试。利用ROS和凉亭在三维环境中进行的实验展示了连续日程学习算法在动态农业场景导航中的有效性。值得注意的是，预训练的TD3智能体在动态环境中成功率达95%，展示了该方法在处理移动障碍物时的稳健性，同时保障作物和机器人的安全。

Learning Dynamics in RL Post-Training for Language Models

语言模型后训练中的强化学习动力学

Authors: Akiyoshi Tomihari
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.04670
Pdf link: https://arxiv.org/pdf/2601.04670
Abstract Reinforcement learning (RL) post-training is a critical stage in modern language model development, playing a key role in improving alignment and reasoning ability. However, several phenomena remain poorly understood, including the reduction in output diversity. To gain a broader understanding of RL post-training, we analyze the learning dynamics of RL post-training from a perspective that has been studied in supervised learning but remains underexplored in RL. We adopt an empirical neural tangent kernel (NTK) framework and decompose the NTK into two components to characterize how RL updates propagate across training samples. Our analysis reveals that limited variability in feature representations can cause RL updates to systematically increase model confidence, providing an explanation for the commonly observed reduction in output diversity after RL post-training. Furthermore, we show that effective learning in this regime depends on rapidly shaping the classifier, which directly affects the gradient component of the NTK. Motivated by these insights, we propose classifier-first reinforcement learning (CF-RL), a simple two-stage training strategy that prioritizes classifier updates before standard RL optimization. Experimental results validate our theoretical analysis by demonstrating increased model confidence and accelerated optimization under CF-RL. Additional analysis shows that the mechanism underlying CF-RL differs from that of linear-probing-then-fine-tuning in supervised learning. Overall, our study formalizes the learning dynamics of RL post-training and motivates further analysis and improvement.
中文摘要 强化学习（RL）训练后阶段是现代语言模型开发的关键阶段，在提升对齐和推理能力方面起着关键作用。然而，仍有若干现象尚未充分理解，包括产出多样性的减少。为了更广泛地理解训练后强化学习，我们从监督学习中已有研究但在强化学习中尚未充分探索的视角，分析了训练后强化学习的学习动态。我们采用经验神经切核（NTK）框架，将NTK分解为两个组成部分，以描述强化学习更新如何在训练样本间传播。我们的分析显示，特征表示的有限变异性可以促使强化学习更新系统性地提升模型置信度，从而解释了强化学习后输出多样性普遍减少的原因。此外，我们表明，在该环境中的有效学习依赖于快速塑造分类器，这直接影响NTK的梯度成分。基于这些见解，我们提出了分类器优先强化学习（CF-RL），这是一种简单的两阶段训练策略，优先更新分类器，然后进行标准强化学习优化。实验结果验证了我们的理论分析，证明在CF-RL下模型置信度提升和优化加速。进一步分析显示，CF-RL背后的机制不同于监督学习中的线性探测后微调。总体而言，我们的研究形式化了强化学习后学习动态，并激励进一步的分析和改进。

Nightmare Dreamer: Dreaming About Unsafe States And Planning Ahead

噩梦梦想家：梦见不安全的州并提前规划

Authors: Oluwatosin Oseni, Shengjie Wang, Jun Zhu, Micah Corah
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.04686
Pdf link: https://arxiv.org/pdf/2601.04686
Abstract Reinforcement Learning (RL) has shown remarkable success in real-world applications, particularly in robotics control. However, RL adoption remains limited due to insufficient safety guarantees. We introduce Nightmare Dreamer, a model-based Safe RL algorithm that addresses safety concerns by leveraging a learned world model to predict potential safety violations and plan actions accordingly. Nightmare Dreamer achieves nearly zero safety violations while maximizing rewards. Nightmare Dreamer outperforms model-free baselines on Safety Gymnasium tasks using only image observations, achieving nearly a 20x improvement in efficiency.
中文摘要 强化学习（RL）在现实应用中取得了显著成功，尤其是在机器人控制领域。然而，由于安全保障不足，强化学习的采用仍然有限。我们介绍了Nightmare Dreamer，一种基于模型的安全强化学习算法，利用已学习的世界模型预测潜在的安全违规并据此制定行动计划，解决安全问题。噩梦梦者几乎没有安全违规，同时最大化奖励。Nightmare Dreamer在仅使用图像观测的Safety Gymnasium任务中表现优于无模型基线，效率提升近20倍。

ResMAS: Resilience Optimization in LLM-based Multi-agent Systems

ResMAS：基于LLM的多智能体系统中的韧性优化

Authors: Zhilun Zhou, Zihan Liu, Jiahe Liu, Qingyu Shao, Yihan Wang, Kun Shao, Depeng Jin, Fengli Xu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.04694
Pdf link: https://arxiv.org/pdf/2601.04694
Abstract Large Language Model-based Multi-Agent Systems (LLM-based MAS), where multiple LLM agents collaborate to solve complex tasks, have shown impressive performance in many areas. However, MAS are typically distributed across different devices or environments, making them vulnerable to perturbations such as agent failures. While existing works have studied the adversarial attacks and corresponding defense strategies, they mainly focus on reactively detecting and mitigating attacks after they occur rather than proactively designing inherently resilient systems. In this work, we study the resilience of LLM-based MAS under perturbations and find that both the communication topology and prompt design significantly influence system resilience. Motivated by these findings, we propose ResMAS: a two-stage framework for enhancing MAS resilience. First, we train a reward model to predict the MAS's resilience, based on which we train a topology generator to automatically design resilient topology for specific tasks through reinforcement learning. Second, we introduce a topology-aware prompt optimization method that refines each agent's prompt based on its connections and interactions with other agents. Extensive experiments across a range of tasks show that our approach substantially improves MAS resilience under various constraints. Moreover, our framework demonstrates strong generalization ability to new tasks and models, highlighting its potential for building resilient MASs.
中文摘要 基于大型语言模型的多代理系统（基于LLM的MAS），即多个LLM代理协作解决复杂任务，在许多领域表现出色。然而，MAS通常分布在不同的设备或环境中，因此容易受到代理故障等干扰的影响。虽然现有研究研究了对抗性攻击及其相应防御策略，但它们主要关注于攻击发生后的被动检测和缓解，而非主动设计具有固有韧性的系统。本研究研究基于LLM的MAS在扰动下的韧性，发现通信拓扑和提示设计显著影响系统韧性。基于这些发现，我们提出了ResMAS：一个提升MAS韧性的两阶段框架。首先，我们训练一个奖励模型来预测MAS的韧性，基于此我们训练拓扑生成器，通过强化学习自动设计特定任务的韧性拓扑。其次，我们引入一种拓扑感知提示优化方法，基于每个代理的连接和其他代理的交互来细化提示。在多种任务中的大量实验表明，我们的方法在各种约束条件下显著提升了MAS的韧性。此外，我们的框架展现出对新任务和模型的强大泛化能力，突出其构建韧性MAS的潜力。

A Method for Constructing a Digital Transformation Driving Mechanism Based on Semantic Understanding of Large Models

基于对大型模型语义理解构建数字化转型驱动机制的方法

Authors: Huayi Liu
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.04696
Pdf link: https://arxiv.org/pdf/2601.04696
Abstract In the process of digital transformation, enterprises are faced with problems such as insufficient semantic understanding of unstructured data and lack of intelligent decision-making basis in driving mechanisms. This study proposes a method that combines a large language model (LLM) and a knowledge graph. First, a fine-tuned BERT (Bidirectional Encoder Representations from Transformers) model is used to perform entity recognition and relationship extraction on multi-source heterogeneous texts, and GPT-4 is used to generate semantically enhanced vector representations; secondly, a two-layer graph neural network (GNN) architecture is designed to fuse the semantic vectors output by LLM with business metadata to construct a dynamic and scalable enterprise knowledge graph; then reinforcement learning is introduced to optimize decision path generation, and the reward function is used to drive the mechanism iteration. In the case of the manufacturing industry, this mechanism reduced the response time for equipment failure scenarios from 7.8 hours to 3.7 hours, the F1 value reached 94.3%, and the compensation for decision errors in the annual digital transformation cost decreased by 45.3%. This method significantly enhances the intelligence level and execution efficiency of the digital transformation driving mechanism by integrating large model semantic understanding with structured knowledge.
中文摘要 在数字化转型过程中，企业面临诸如对非结构化数据语义理解不足以及缺乏智能决策机制等问题。本研究提出了一种结合大型语言模型（LLM）和知识图谱的方法。首先，使用微调的BERT（Transformers双向编码器表示）模型对多源异构文本进行实体识别和关系提取，并利用GPT-4生成语义增强的向量表示;其次，设计了一种两层图神经网络（GNN）架构，将LLM输出的语义向量与业务元数据融合，构建动态且可扩展的企业知识图谱;然后引入强化学习以优化决策路径生成，并利用奖励函数驱动机制迭代。在制造业中，该机制将设备故障场景的响应时间从7.8小时缩短至3.7小时，F1值达到94.3%，年度数字化转型成本中决策错误补偿减少了45.3%。该方法通过将大模型语义理解与结构化知识相结合，显著提升了数字化转型驱动机制的智能水平和执行效率。

TourPlanner: A Competitive Consensus Framework with Constraint-Gated Reinforcement Learning for Travel Planning

旅游规划器：带有约束门槛强化学习的竞争共识框架，用于旅行规划

Authors: Yinuo Wang, Mining Tan, Wenxiang Jiao, Xiaoxi Li, Hao Wang, Xuanyu Zhang, Yuan Lu, Weiming Dong
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.04698
Pdf link: https://arxiv.org/pdf/2601.04698
Abstract Travel planning is a sophisticated decision-making process that requires synthesizing multifaceted information to construct itineraries. However, existing travel planning approaches face several challenges: (1) Pruning candidate points of interest (POIs) while maintaining a high recall rate; (2) A single reasoning path restricts the exploration capability within the feasible solution space for travel planning; (3) Simultaneously optimizing hard constraints and soft constraints remains a significant difficulty. To address these challenges, we propose TourPlanner, a comprehensive framework featuring multi-path reasoning and constraint-gated reinforcement learning. Specifically, we first introduce a Personalized Recall and Spatial Optimization (PReSO) workflow to construct spatially-aware candidate POIs' set. Subsequently, we propose Competitive consensus Chain-of-Thought (CCoT), a multi-path reasoning paradigm that improves the ability of exploring the feasible solution space. To further refine the plan, we integrate a sigmoid-based gating mechanism into the reinforcement learning stage, which dynamically prioritizes soft-constraint satisfaction only after hard constraints are met. Experimental results on travel planning benchmarks demonstrate that TourPlanner achieves state-of-the-art performance, significantly surpassing existing methods in both feasibility and user-preference alignment.
中文摘要 旅行规划是一种复杂的决策过程，需要综合多方面信息来构建行程。然而，现有的旅行规划方法面临若干挑战：（1）在保持高召回率的同时修剪候选兴趣点（POI）;（2）单一推理路径限制了旅行规划可行解空间内的探索能力;（3）同时优化硬约束和软约束仍是一个重大难题。为应对这些挑战，我们提出了TourPlanner，这是一个综合框架，具备多路径推理和约束门槛强化学习。具体来说，我们首先引入了个性化回忆与空间优化（PReSO）工作流程，用于构建空间感知的候选兴趣点集合。随后，我们提出了竞争共识链思考（CCoT），这是一种多路径推理范式，提升了探索可行解空间的能力。为进一步完善计划，我们在强化学习阶段集成了基于S形的门控机制，该机制仅在硬约束满足后动态优先级实现软约束满足。旅行规划基准的实验结果表明，TourPlanner 实现了最先进的性能，在可行性和用户偏好匹配度上显著超越现有方法。

ThinkDrive: Chain-of-Thought Guided Progressive Reinforcement Learning Fine-Tuning for Autonomous Driving

ThinkDrive：思维链引导的渐进强化学习为自动驾驶微调

Authors: Chang Zhao, Zheming Yang, Yunqing Hu, Qi Guo, Zijian Wang, Pengcheng Li, Wen Ji
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.04714
Pdf link: https://arxiv.org/pdf/2601.04714
Abstract With the rapid advancement of large language models (LLMs) technologies, their application in the domain of autonomous driving has become increasingly widespread. However, existing methods suffer from unstructured reasoning, poor generalization, and misalignment with human driving intent. While Chain-of-Thought (CoT) reasoning enhances decision transparency, conventional supervised fine-tuning (SFT) fails to fully exploit its potential, and reinforcement learning (RL) approaches face instability and suboptimal reasoning depth. We propose ThinkDrive, a CoT guided progressive RL fine-tuning framework for autonomous driving that synergizes explicit reasoning with difficulty-aware adaptive policy optimization. Our method employs a two-stage training strategy. First, we perform SFT using CoT explanations. Then, we apply progressive RL with a difficulty-aware adaptive policy optimizer that dynamically adjusts learning intensity based on sample complexity. We evaluate our approach on a public dataset. The results show that ThinkDrive outperforms strong RL baselines by 1.45%, 1.95%, and 1.01% on exam, easy-exam, and accuracy, respectively. Moreover, a 2B-parameter model trained with our method surpasses the much larger GPT-4o by 3.28% on the exam metric.
中文摘要 随着大型语言模型（LLMs）技术的快速发展，其在自动驾驶领域的应用日益广泛。然而，现有方法存在无结构推理、推广能力差以及与人类驾驶意图不符的问题。虽然思维链（CoT）推理提高了决策透明度，但传统的监督微调（SFT）未能充分发挥其潜力，强化学习（RL）方法面临不稳定性和推理深度不足的问题。我们提出了ThinkDrive，这是一个由CoT引导的渐进式强化学习微调框架，用于自动驾驶，能够协同显式推理与难度感知的自适应策略优化。我们的方法采用两阶段训练策略。首先，我们使用CoT解释进行SFT。然后，我们应用渐进式强化学习，配合难度感知的自适应策略优化器，根据样本复杂度动态调整学习强度。我们在公开数据集上评估我们的方法。结果显示，ThinkDrive在考试、简易考试和准确率方面分别优于强强强化学习基线1.45%、1.95%和1.01%。此外，使用我们方法训练的2B参数模型在考试指标上比规模更大的GPT-4o高出3.28%。

AM$^3$Safety: Towards Data Efficient Alignment of Multi-modal Multi-turn Safety for MLLMs

AM$^3$安全：迈向多模联运多匝道安全的数据高效对齐

Authors: Han Zhu, Jiale Chen, Chengkun Cai, Shengjie Sun, Haoran Li, Yujin Zhou, Chi-Min Chan, Pengcheng Wen, Lei Li, Sirui Han, Yike Guo
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.04736
Pdf link: https://arxiv.org/pdf/2601.04736
Abstract Multi-modal Large Language Models (MLLMs) are increasingly deployed in interactive applications. However, their safety vulnerabilities become pronounced in multi-turn multi-modal scenarios, where harmful intent can be gradually reconstructed across turns, and security protocols fade into oblivion as the conversation progresses. Existing Reinforcement Learning from Human Feedback (RLHF) alignment methods are largely developed for single-turn visual question-answer (VQA) task and often require costly manual preference annotations, limiting their effectiveness and scalability in dialogues. To address this challenge, we present InterSafe-V, an open-source multi-modal dialogue dataset containing 11,270 dialogues and 500 specially designed refusal VQA samples. This dataset, constructed through interaction between several models, is designed to more accurately reflect real-world scenarios and includes specialized VQA pairs tailored for specific domains. Building on this dataset, we propose AM$^3$Safety, a framework that combines a cold-start refusal phase with Group Relative Policy Optimization (GRPO) fine-tuning using turn-aware dual-objective rewards across entire dialogues. Experiments on Qwen2.5-VL-7B-Instruct and LLaVA-NeXT-7B show more than 10\% decrease in Attack Success Rate (ASR) together with an increment of at least 8\% in harmless dimension and over 13\% in helpful dimension of MLLMs on multi-modal multi-turn safety benchmarks, while preserving their general abilities.
中文摘要 多模态大型语言模型（MLLM）正日益被应用于交互式应用中。然而，在多回合多模态场景中，他们的安全漏洞会变得明显，在这些场景中，有害意图可以逐步在回合间重建，安全协议随着对话推进而逐渐消失。现有的人类反馈强化学习（RLHF）对齐方法主要为单轮视觉问答（VQA）任务开发，且通常需要昂贵的手动偏好标注，限制了其在对话中的有效性和可扩展性。为应对这一挑战，我们推出了InterSafe-V，一个开源的多模态对话数据集，包含11,270条对话和500个专门设计的拒绝VQA样本。该数据集通过多个模型之间的交互构建，旨在更准确地反映现实世界场景，并包含针对特定领域量身定制的专业VQA对。基于该数据集，我们提出了AM$^3$Safety框架，该框架结合了冷启动拒绝阶段与利用回合感知双目标奖励的群体相对策略优化（GRPO）微调。在多模态多转向安全基准测试中，Qwen2.5-VL-7B和LLaVA-NeXT-7B的实验显示，MLLM在多模态多转向安全基准中，攻击成功率（ASR）下降超过10%，安全维度提升至少8%，有效维度提升超过13%，同时保持其整体能力。

AT$^2$PO: Agentic Turn-based Policy Optimization via Tree Search

AT$^2$PO：通过树搜索实现代理回合策略优化

Authors: Zefang Zong, Dingwei Chen, Yang Li, Qi Yi, Bo Zhou, Chengming Li, Bo Qian, Peng Chen, Jie Jiang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.04767
Pdf link: https://arxiv.org/pdf/2601.04767
Abstract LLM agents have emerged as powerful systems for tackling multi-turn tasks by interleaving internal reasoning and external tool interactions. Agentic Reinforcement Learning has recently drawn significant research attention as a critical post-training paradigm to further refine these capabilities. In this paper, we present AT$^2$PO (Agentic Turn-based Policy Optimization via Tree Search), a unified framework for multi-turn agentic RL that addresses three core challenges: limited exploration diversity, sparse credit assignment, and misaligned policy optimization. AT$^2$PO introduces a turn-level tree structure that jointly enables Entropy-Guided Tree Expansion for strategic exploration and Turn-wise Credit Assignment for fine-grained reward propagation from sparse outcomes. Complementing this, we propose Agentic Turn-based Policy Optimization, a turn-level learning objective that aligns policy updates with the natural decision granularity of agentic interactions. ATPO is orthogonal to tree search and can be readily integrated into any multi-turn RL pipeline. Experiments across seven benchmarks demonstrate consistent improvements over the state-of-the-art baseline by up to 1.84 percentage points in average, with ablation studies validating the effectiveness of each component. Our code is available at this https URL.
中文摘要 LLM代理已成为通过交织内部推理与外部工具交互，强有力的多回合任务系统。代理强化学习最近作为一个关键的后培训范式，进一步完善这些能力，受到了大量研究关注。本文介绍了AT$^2$PO（通过树搜索实现的代理回合制策略优化），这是一个多回合代理强化学习的统一框架，解决了三大核心挑战：探索多样性有限、信用分配稀疏和策略优化错位。AT$^2$PO引入了一种回合级树结构，联合支持用于战略探索的熵引导树扩展和用于从稀疏结果中细粒度奖励传播的回合分项信用分配。作为补充，我们提出了代理回合制策略优化（Agentic Turn-based Policy Optimization），这是一种回合级学习目标，使策略更新与代理互动的自然决策粒度保持一致。ATPO与树搜索正交，可以轻松集成到任何多回合强化学习流水线中。七个基准测试显示，平均相比最先进基线，持续提升1.84个百分点，消融研究验证了各成分的有效性。我们的代码可在此 https URL 访问。

AgentOCR: Reimagining Agent History via Optical Self-Compression

AgentOCR：通过光学自压缩重新构想代理历史

Authors: Lang Feng, Fuchao Yang, Feng Chen, Xin Cheng, Haiyang Xu, Zhenglin Wan, Ming Yan, Bo An
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.04786
Pdf link: https://arxiv.org/pdf/2601.04786
Abstract Recent advances in large language models (LLMs) enable agentic systems trained with reinforcement learning (RL) over multi-turn interaction trajectories, but practical deployment is bottlenecked by rapidly growing textual histories that inflate token budgets and memory usage. We introduce AgentOCR, a framework that exploits the superior information density of visual tokens by representing the accumulated observation-action history as a compact rendered image. To make multi-turn rollouts scalable, AgentOCR proposes segment optical caching. By decomposing history into hashable segments and maintaining a visual cache, this mechanism eliminates redundant re-rendering. Beyond fixed rendering, AgentOCR introduces agentic self-compression, where the agent actively emits a compression rate and is trained with compression-aware reward to adaptively balance task success and token efficiency. We conduct extensive experiments on challenging agentic benchmarks, ALFWorld and search-based QA. Remarkably, results demonstrate that AgentOCR preserves over 95\% of text-based agent performance while substantially reducing token consumption (>50\%), yielding consistent token and memory efficiency. Our further analysis validates a 20x rendering speedup from segment optical caching and the effective strategic balancing of self-compression.
中文摘要 大型语言模型（LLMs）的最新进展使得通过强化学习（RL）训练的代理系统能够跨越多回合交互轨迹，但实际部署仍受到快速增长的文本历史限制，这些历史膨胀了令牌预算和内存使用量。我们介绍了AgentOCR，这是一种利用视觉标记更高信息密度的框架，通过将累积的观察-动作历史表示为紧凑的渲染图像。为了使多回合推广具有可扩展性，AgentOCR提出了分段光学缓存。通过将历史分解为可哈希段并保持可视化缓存，该机制消除了冗余的重新渲染。除了固定渲染，AgentOCR还引入了代理自我压缩，即代理主动输出压缩率，并接受压缩感知奖励训练，以自适应地平衡任务成功率和代币效率。我们对具有挑战性的代理基准测试、ALFWorld和基于搜索的质量保证进行了大量实验。令人瞩目的结果显示，AgentOCR保留了95%以上基于文本的代理性能，同时显著降低了令牌消耗（>50%），从而实现了令牌和内存效率的一致性。我们的进一步分析验证了分段光学缓存和自压缩战略平衡的20倍渲染加速。

Thinking-Based Non-Thinking: Solving the Reward Hacking Problem in Training Hybrid Reasoning Models via Reinforcement Learning

基于思维的非思考：通过强化学习解决混合推理模型训练中的奖励黑客问题

Authors: Siyuan Gan, Jiaheng Liu, Boyan Wang, Tianpei Yang, Runqing Miao, Yuyao Zhang, Fanyu Meng, Junlan Feng, Linjian Meng, Jing Huo, Yang Gao
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.04805
Pdf link: https://arxiv.org/pdf/2601.04805
Abstract Large reasoning models (LRMs) have attracted much attention due to their exceptional performance. However, their performance mainly stems from thinking, a long Chain of Thought (CoT), which significantly increase computational overhead. To address this overthinking problem, existing work focuses on using reinforcement learning (RL) to train hybrid reasoning models that automatically decide whether to engage in thinking or not based on the complexity of the query. Unfortunately, using RL will suffer the the reward hacking problem, e.g., the model engages in thinking but is judged as not doing so, resulting in incorrect rewards. To mitigate this problem, existing works either employ supervised fine-tuning (SFT), which incurs high computational costs, or enforce uniform token limits on non-thinking responses, which yields limited mitigation of the problem. In this paper, we propose Thinking-Based Non-Thinking (TNT). It does not employ SFT, and sets different maximum token usage for responses not using thinking across various queries by leveraging information from the solution component of the responses using thinking. Experiments on five mathematical benchmarks demonstrate that TNT reduces token usage by around 50% compared to DeepSeek-R1-Distill-Qwen-1.5B/7B and DeepScaleR-1.5B, while significantly improving accuracy. In fact, TNT achieves the optimal trade-off between accuracy and efficiency among all tested methods. Additionally, the probability of reward hacking problem in TNT's responses, which are classified as not using thinking, remains below 10% across all tested datasets.
中文摘要 大型推理模型（LRM）因其卓越的性能而备受关注。然而，它们的性能主要源于思考，即一条漫长的思维链（CoT），这显著增加了计算开销。为解决这一过度思考问题，现有研究重点是利用强化学习（RL）训练混合推理模型，这些模型能根据查询的复杂度自动决定是否参与思考。不幸的是，使用强化学习会遇到奖励黑客问题，例如模型虽然有思考，但被判断为没有思考，导致错误的奖励。为缓解这一问题，现有研究要么采用监督微调（SFT），这会产生高计算成本;要么对非思考反应强制统一的令牌限制，从而有限度地缓解问题。本文提出了基于思维的非思考（TNT）。它不使用SFT，并通过利用思考回答的解决方案部分信息，在不同查询中设定不同的最大令牌使用量。五个数学基准测试的实验表明，与DeepSeek-R1-Distill-Qwen-1.5B/7B和DeepScaleR-1.5B相比，TNT使代币使用率降低了约50%，同时显著提升了准确性。事实上，TNT在所有测试方法中实现了准确性和效率的最佳权衡。此外，TNT回答中被归类为未使用思考的回答中，奖励黑客问题的概率在所有测试数据集中均低于10%。

SCALER:Synthetic Scalable Adaptive Learning Environment for Reasoning

SCALER：用于推理的合成可扩展自适应学习环境

Authors: Caijun Xu, Changyi Xiao, Zhongyuan Peng, Xinrun Wang, Yixin Cao
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.04809
Pdf link: https://arxiv.org/pdf/2601.04809
Abstract Reinforcement learning (RL) offers a principled way to enhance the reasoning capabilities of large language models, yet its effectiveness hinges on training signals that remain informative as models evolve. In practice, RL progress often slows when task difficulty becomes poorly aligned with model capability, or when training is dominated by a narrow set of recurring problem patterns. To jointly address these issues, we propose SCALER (Synthetic sCalable Adaptive Learning Environment for Reasoning), a framework that sustains effective learning signals through adaptive environment design. SCALER introduces a scalable synthesis pipeline that converts real-world programming problems into verifiable reasoning environments with controllable difficulty and unbounded instance generation, enabling RL training beyond finite datasets while preserving strong correctness guarantees. Building on this, SCALER further employs an adaptive multi-environment RL strategy that dynamically adjusts instance difficulty and curates the active set of environments to track the model's capability frontier and maintain distributional diversity. This co-adaptation prevents reward sparsity, mitigates overfitting to narrow task patterns, and supports sustained improvement throughout training. Extensive experiments show that SCALER consistently outperforms dataset-based RL baselines across diverse reasoning benchmarks and exhibits more stable, long-horizon training dynamics.
中文摘要 强化学习（RL）提供了一种有原则的方式来增强大型语言模型的推理能力，但其有效性依赖于在模型演进过程中保持信息量的训练信号。实际上，当任务难度与模型能力不匹配，或训练被狭窄的反复出现问题模式主导时，强化学习的进展常常放缓。为共同解决这些问题，我们提出了SCALER（合成可扩展自适应学习环境推理）框架，通过自适应环境设计维持有效的学习信号。SCALER引入了可扩展的综合流水线，将现实编程问题转化为可验证的推理环境，具有可控难度和无界实例生成能力，使强化学习能够超越有限数据集，同时保持强的正确性保证。基于此，SCALER进一步采用自适应多环境强化学习策略，动态调整实例难度并策划活跃环境集，以跟踪模型的能力前沿并保持分布多样性。这种共适应防止了奖励稀疏，减少了对狭窄任务模式的过度拟合，并支持整个训练过程中的持续进步。大量实验表明，SCALER在多种推理基准测试中持续优于基于数据集的强化学习基线，并且展现出更稳定、更长远的训练动态。

Intelligent resource allocation in wireless networks via deep reinforcement learning

通过深度强化学习实现无线网络中的智能资源分配

Authors: Marie Diane Iradukunda, Chabi F. Elégbédé, Yaé Ulrich Gaba
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2601.04842
Pdf link: https://arxiv.org/pdf/2601.04842
Abstract This study addresses the challenge of optimal power allocation in stochastic wireless networks by employing a Deep Reinforcement Learning (DRL) framework. Specifically, we design a Deep Q-Network (DQN) agent capable of learning adaptive power control policies directly from channel state observations, effectively bypassing the need for explicit system models. We formulate the resource allocation problem as a Markov Decision Process (MDP) and benchmark the proposed approach against classical heuristics, including fixed allocation, random assignment, and the theoretical water-filling algorithm. Empirical results demonstrate that the DQN agent achieves a system throughput of 3.88 Mbps, effectively matching the upper limit of the water fill, while outperforming the random and fixed allocation strategies by approximately 73% and 27%, respectively. Moreover, the agent exhibits emergent fairness, maintaining a Jain's Index of 0.91, and successfully optimizes the trade-off between spectral efficiency and energy consumption. These findings substantiate the efficacy of model-free DRL as a robust and scalable solution for resource management in next-generation communication systems.
中文摘要 本研究通过采用深度强化学习（DRL）框架，解决随机无线网络中最优功率分配的挑战。具体来说，我们设计了一个深度Q网络（DQN）代理，能够直接从信道状态观察中学习自适应功率控制策略，有效绕过显式系统模型的需求。我们将资源分配问题提出为马尔可夫决策过程（MDP），并将所提出的方法与经典启发式方法（包括固定分配、随机分配和理论水填充算法）进行基准测试。实证结果表明，DQN代理实现了3.88 Mbps的系统吞吐量，有效匹配了水填充的上限，同时分别比随机分配和固定分配策略高出约73%和27%。此外，该代理表现出涌现公平性，保持Jain指数0.91，并成功优化了光谱效率与能量消耗之间的权衡。这些发现证实了无模型日程照护作为下一代通信系统资源管理中稳健且可扩展解决方案的有效性。

RAAR: Retrieval Augmented Agentic Reasoning for Cross-Domain Misinformation Detection

RAAR：跨领域错误信息检测的检索增强代理推理

Authors: Zhiwei Liu, Runteng Guo, Baojie Qu, Yuechen Jiang, Min Peng, Qianqian Xie, Sophia Ananiadou
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.04853
Pdf link: https://arxiv.org/pdf/2601.04853
Abstract Cross-domain misinformation detection is challenging, as misinformation arises across domains with substantial differences in knowledge and discourse. Existing methods often rely on single-perspective cues and struggle to generalize to challenging or underrepresented domains, while reasoning large language models (LLMs), though effective on complex tasks, are limited to same-distribution data. To address these gaps, we introduce RAAR, the first retrieval-augmented agentic reasoning framework for cross-domain misinformation detection. To enable cross-domain transfer beyond same-distribution assumptions, RAAR retrieves multi-perspective source-domain evidence aligned with each target sample's semantics, sentiment, and writing style. To overcome single-perspective modeling and missing systematic reasoning, RAAR constructs verifiable multi-step reasoning paths through specialized multi-agent collaboration, where perspective-specific agents produce complementary analyses and a summary agent integrates them under verifier guidance. RAAR further applies supervised fine-tuning and reinforcement learning to train a single multi-task verifier to enhance verification and reasoning capabilities. Based on RAAR, we trained the RAAR-8b and RAAR-14b models. Evaluation on three cross-domain misinformation detection tasks shows that RAAR substantially enhances the capabilities of the base models and outperforms other cross-domain methods, advanced LLMs, and LLM-based adaptation approaches. The project will be released at this https URL.
中文摘要 跨领域虚假信息检测具有挑战性，因为错误信息发生在知识和话语差异较大的领域之间。现有方法常依赖单视角线索，难以推广到具有挑战性或代表性不足的领域，而推理大型语言模型（LLMs）虽然在复杂任务上有效，但仅限于同分布数据。为弥补这些空白，我们引入了RAAR，这是首个用于跨域虚假信息检测的检索增强代理推理框架。为了实现跨域转移，超越同分布假设，RAAR检索了与每个目标样本语义、情感和写作风格相符的多视角源域证据。为克服单一视角建模和系统推理的缺失，RAAR通过专业的多代理协作构建可验证的多步推理路径，视角专属代理生成互补分析，汇总代理在验证者指导下整合分析。RAAR进一步应用监督式微调和强化学习，训练单一多任务验证器，以增强验证和推理能力。基于RAAR，我们训练了RAAR-8b和RAAR-14b型号。对三个跨域虚假信息检测任务的评估显示，RAAR显著提升了基础模型的能力，并优于其他跨域方法、高级大型语言模型和基于LLM的适配方法。该项目将以该 https URL 发布。

Flexible Manufacturing Systems Intralogistics: Dynamic Optimization of AGVs and Tool Sharing Using Coloured-Timed Petri Nets and Actor-Critic RL with Actions Masking

灵活制造系统内部物流：利用彩色定时Petri网和带动作掩蔽的actor-critic RL动态优化AGV和工具共享

Authors: Sofiene Lassoued, Laxmikant Shrikant Bahetic, Nathalie Weiß-Borkowskib, Stefan Lierc, Andreas Schwunga
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.04887
Pdf link: https://arxiv.org/pdf/2601.04887
Abstract Flexible Manufacturing Systems (FMS) are pivotal in optimizing production processes in today's rapidly evolving manufacturing landscape. This paper advances the traditional job shop scheduling problem by incorporating additional complexities through the simultaneous integration of automated guided vehicles (AGVs) and tool-sharing systems. We propose a novel approach that combines Colored-Timed Petri Nets (CTPNs) with actor-critic model-based reinforcement learning (MBRL), effectively addressing the multifaceted challenges associated with FMS. CTPNs provide a formal modeling structure and dynamic action masking, significantly reducing the action search space, while MBRL ensures adaptability to changing environments through the learned policy. Leveraging the advantages of MBRL, we incorporate a lookahead strategy for optimal positioning of AGVs, improving operational efficiency. Our approach was evaluated on small-sized public benchmarks and a newly developed large-scale benchmark inspired by the Taillard benchmark. The results show that our approach matches traditional methods on smaller instances and outperforms them on larger ones in terms of makespan while achieving a tenfold reduction in computation time. To ensure reproducibility, we propose a gym-compatible environment and an instance generator. Additionally, an ablation study evaluates the contribution of each framework component to its overall performance.
中文摘要 灵活制造系统（FMS）在当今快速发展的制造环境中，在优化生产流程方面发挥着关键作用。本文通过同时集成自动导引车辆（AGV）和工具共享系统，进一步推进了传统的工地排班问题。我们提出了一种新颖方法，结合了有色时序培养网（CTPN）与基于演员-批评者模型的强化学习（MBRL），有效解决了FMS相关的多方面挑战。CTPN提供了形式化建模结构和动态动作掩蔽，显著减少了动作搜索空间，而MBRL则通过学习策略确保对变化环境的适应性。利用MBRL的优势，我们采用前瞻性战略，优化AGV定位，提升运营效率。我们的方法在小型公开基准和受泰拉德基准启发的新开发大规模基准测试中进行了评估。结果显示，我们的方法在较小实例上与传统方法相当，在完成时长方面优于大型实例，同时实现了计算时间的十倍减少。为确保可重复性，我们提出了一个与道馆兼容的环境和实例生成器。此外，消融研究还评估每个框架组件对其整体表现的贡献。

SKATER: Synthesized Kinematics for Advanced Traversing Efficiency on a Humanoid Robot via Roller Skate Swizzles

SKATER：通过轮滑滑轮滑器在类人机器人上实现先进移动效率的综合运动学

Authors: Junchi Gu, Feiyang Yuan, Weize Shi, Tianchen Huang, Haopeng Zhang, Xiaohu Zhang, Yu Wang, Wei Gao, Shiwu Zhang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.04948
Pdf link: https://arxiv.org/pdf/2601.04948
Abstract Although recent years have seen significant progress of humanoid robots in walking and running, the frequent foot strikes with ground during these locomotion gaits inevitably generate high instantaneous impact forces, which leads to exacerbated joint wear and poor energy utilization. Roller skating, as a sport with substantial biomechanical value, can achieve fast and continuous sliding through rational utilization of body inertia, featuring minimal kinetic energy loss. Therefore, this study proposes a novel humanoid robot with each foot equipped with a row of four passive wheels for roller skating. A deep reinforcement learning control framework is also developed for the swizzle gait with the reward function design based on the intrinsic characteristics of roller skating. The learned policy is first analyzed in simulation and then deployed on the physical robot to demonstrate the smoothness and efficiency of the swizzle gait over traditional bipedal walking gait in terms of Impact Intensity and Cost of Transport during locomotion. A reduction of $75.86\%$ and $63.34\%$ of these two metrics indicate roller skating as a superior locomotion mode for enhanced energy efficiency and joint longevity.
中文摘要 尽管近年来人形机器人在步行和跑步方面取得了显著进步，但这些行走步态中频繁的脚部与地面碰撞不可避免地产生了巨大的瞬时冲击力，导致关节磨损加剧和能量利用率下降。轮滑作为一项具有显著生物力学价值的运动，通过合理利用身体惯性实现快速且持续的滑行，动能损失极小。因此，本研究提出了一种新型人形机器人，每只脚配备一排四个被动轮子，用于轮滑。还开发了基于轮滑内在特性的奖励函数设计的深度强化学习控制框架。学到的策略先在模拟中分析，然后部署到物理机器人上，以展示滑步步态相较传统双足步行在冲击强度和运输成本方面的平滑性和效率。这两项指标分别减少75.86美元和63.34美元，表明轮滑作为一种更优越的运动方式，提升了能源效率和关节寿命。

Precision over Diversity: High-Precision Reward Generalizes to Robust Instruction Following

精准度胜于多样性：高精度奖励推广为稳健的跟随指令

Authors: Yirong Zeng, Yufei Liu, Xiao Ding, Yutai Hou, Yuxian Wang, Haonan Song, Wu Ning, Dandan Tu, Qixun Zhang, Bibo Cai, Yuxiang He, Ting Liu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.04954
Pdf link: https://arxiv.org/pdf/2601.04954
Abstract A central belief in scaling reinforcement learning with verifiable rewards for instruction following (IF) tasks is that, a diverse mixture of verifiable hard and unverifiable soft constraints is essential for generalizing to unseen instructions. In this work, we challenge this prevailing consensus through a systematic empirical investigation. Counter-intuitively, we find that models trained on hard-only constraints consistently outperform those trained on mixed datasets. Extensive experiments reveal that reward precision, rather than constraint diversity, is the primary driver of effective alignment. The LLM judge suffers from a low recall rate in detecting false response, which leads to severe reward hacking, thereby undermining the benefits of diversity. Furthermore, analysis of the attention mechanism reveals that high-precision rewards develop a transferable meta-skill for IF. Motivated by these insights, we propose a simple yet effective data-centric refinement strategy that prioritizes reward precision. Evaluated on five benchmarks, our approach outperforms competitive baselines by 13.4\% in performance while achieving a 58\% reduction in training time, maintaining strong generalization beyond instruction following. Our findings advocate for a paradigm shift: moving away from the indiscriminate pursuit of data diversity toward high-precision rewards.
中文摘要 关于通过可验证的指令跟随（IF）任务奖励来扩展强化学习的核心信念是，对于推广到看不见指令，必须有多样化的可验证硬约束和不可验证的软约束。在本研究中，我们通过系统性的实证调查挑战这一普遍共识。出乎意料的是，我们发现仅用硬约束训练的模型表现一致优于混合数据集训练的模型。大量实验表明，奖励精度而非约束多样性是有效对齐的主要驱动力。LLM评委在识别错误回应时回忆率低，导致严重的奖励黑客行为，从而削弱多样性的益处。此外，对注意力机制的分析显示，高精度奖励能够培养IF可转移的元技能。基于这些洞察，我们提出了一种简单但有效的以数据为中心的精炼策略，优先考虑奖励的精准性。基于五项基准评估，我们的方法在性能上比竞争对手基线高出13.4%，同时培训时间减少了58%，并且在指令遵循之外保持了强有力的泛化能力。我们的发现倡导范式转变：从盲目追求数据多样性转向高精度奖励。

Safe Reinforcement Learning Beyond Baseline Control: A Hierarchical Framework for Space Triangle Tethered Formation System

超越基线控制的安全强化学习：空间三角系绳编队系统的层级框架

Authors: Xinyi Tao, Panfeng Huang, Fan Zhang
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.04957
Pdf link: https://arxiv.org/pdf/2601.04957
Abstract Triangular tethered formation system (TTFS) provide a promising platform for deep space exploration and distributed sensing due to its intrinsic spatial-orientation stability and capability of adjusting distances among node satellites through deployment and retrieval of tethers. However, due to the coupled tether-satellite dynamics and disturbance sensitivity of TTFS, traditional control methods struggle to achieve a balanced trade-off among configuration accuracy requirements, tension constraints, and energy efficiency consumption throughout the deployment this http URL this paper, a novel model-reference reinforcement learning control framework is proposed for TTFS. By integrating baseline model-based control with a Soft Actor-Critic (SAC) compensator, the proposed method simultaneously achieves high-precision tracking, fuel efficiency, and compliance with tension limits. A hierarchical training scheme is developed to address the convergence difficulties arising from strongly coupled states in centralized training, while tailored reward functions, reset conditions, and normalization criteria are designed to accelerate training convergence. Closed-loop stability of the overall control law is rigorously proven using Lyapunov methods. Simulation results demonstrate that the proposed controller reduces steady-state tracking errors by over 96% for tethers and 99% for node satellites, while cutting fuel consumption by two orders of magnitude compared with the baseline method. These results validate the effectiveness and stability of the proposed approach for TTFS deployment control.
中文摘要 三角系留形成系统（TTFS）因其固有的空间定向稳定性以及通过部署和回收系绳调整节点卫星间距离的能力，为深空探索和分布式传感提供了有前景的平台。然而，由于TTFS的耦合型卫星动力学和干扰敏感性，传统控制方法难以在配置精度要求、张力约束和能效消耗之间取得平衡，本文提出了一种新型模型-引用强化学习控制框架。通过将基于基础模型的控制与软演员-批判者（SAC）补偿器集成，所提方法同时实现高精度跟踪、燃油效率以及张力极限的合规性。为解决集中训练中强耦合状态带来的收敛困难，开发了层级训练方案，同时设计了定制的奖励函数、重置条件和归一化标准以加速训练收敛。通过李雅普诺夫方法，严格证明了整体控制定律的闭环稳定性。模拟结果表明，所提控制器在系留绳上将稳态跟踪误差降低了96%以上，节点卫星减少了99%，同时与基线方法相比，燃料消耗降低了两个数量级。这些结果验证了拟议方法在TTFS部署控制中的有效性和稳定性。

Text as a Universal Interface for Transferable Personalization

文本作为可转移个性化的通用界面

Authors: Yuting Liu, Jian Guan, Jia-Nan Li, Wei Wu, Jiang-Ming Yang, Jianzhe Zhao, Guibing Guo
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.04963
Pdf link: https://arxiv.org/pdf/2601.04963
Abstract We study the problem of personalization in large language models (LLMs). Prior work predominantly represents user preferences as implicit, model-specific vectors or parameters, yielding opaque ``black-box'' profiles that are difficult to interpret and transfer across models and tasks. In contrast, we advocate natural language as a universal, model- and task-agnostic interface for preference representation. The formulation leads to interpretable and reusable preference descriptions, while naturally supporting continual evolution as new interactions are observed. To learn such representations, we introduce a two-stage training framework that combines supervised fine-tuning on high-quality synthesized data with reinforcement learning to optimize long-term utility and cross-task transferability. Based on this framework, we develop AlignXplore+, a universal preference reasoning model that generates textual preference summaries. Experiments on nine benchmarks show that our 8B model achieves state-of-the-art performanc -- outperforming substantially larger open-source models -- while exhibiting strong transferability across tasks, model families, and interaction formats.
中文摘要 我们研究大型语言模型（LLM）中的个性化问题。以往的工作主要将用户偏好表示为隐含的、模型特定的向量或参数，产生了不透明的“黑箱”配置文件，难以解释和跨模型和任务间转移。相比之下，我们主张自然语言作为一种通用的、模型和任务无关的偏好表示界面。这种表述导致可解释且可重复使用的偏好描述，同时自然支持随着新交互作用的持续演变。为学习此类表示，我们引入了一个两阶段训练框架，结合对高质量综合数据的监督微调与强化学习，以优化长期效用和跨任务可迁移性。基于该框架，我们开发了AlignXplore+，一种通用偏好推理模型，能够生成文本偏好摘要。在九个基准测试上的实验显示，我们的8B模型实现了最先进的性能——远超规模更大的开源模型——同时展现出在任务、模型族和交互格式间的强大可迁移性。

ConMax: Confidence-Maximizing Compression for Efficient Chain-of-Thought Reasoning

ConMax：用于高效思维链推理的信心最大化压缩

Authors: Minda Hu, Zexuan Qiu, Zenan Xu, Kun Li, Bo Zhou, Irwin King
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.04973
Pdf link: https://arxiv.org/pdf/2601.04973
Abstract Recent breakthroughs in Large Reasoning Models (LRMs) have demonstrated that extensive Chain-of-Thought (CoT) generation is critical for enabling intricate cognitive behaviors, such as self-verification and backtracking, to solve complex tasks. However, this capability often leads to ``overthinking'', where models generate redundant reasoning paths that inflate computational costs without improving accuracy. While Supervised Fine-Tuning (SFT) on reasoning traces is a standard paradigm for the 'cold start' phase, applying existing compression techniques to these traces often compromises logical coherence or incurs prohibitive sampling costs. In this paper, we introduce ConMax (Confidence-Maximizing Compression), a novel reinforcement learning framework designed to automatically compress reasoning traces while preserving essential reasoning patterns. ConMax formulates compression as a reward-driven optimization problem, training a policy to prune redundancy by maximizing a weighted combination of answer confidence for predictive fidelity and thinking confidence for reasoning validity through a frozen auxiliary LRM. Extensive experiments across five reasoning datasets demonstrate that ConMax achieves a superior efficiency-performance trade-off. Specifically, it reduces inference length by 43% over strong baselines at the cost of a mere 0.7% dip in accuracy, proving its effectiveness in generating high-quality, efficient training data for LRMs.
中文摘要 大型推理模型（LRM）的最新突破表明，广泛的思维链（CoT）生成对于实现复杂认知行为（如自我验证和回溯）解决复杂任务至关重要。然而，这种能力常常导致“过度思考”，即模型生成冗余的推理路径，抬高计算成本却不提升准确性。虽然对推理迹进行监督微调（SFT）是“冷启动”阶段的标准范式，但对这些迹态应用现有压缩技术往往会破坏逻辑一致性或产生高昂的采样成本。本文介绍了ConMax（信心最大化压缩），这是一种新型强化学习框架，旨在自动压缩推理痕迹，同时保留关键推理模式。ConMax 将压缩表述为一个奖励驱动的优化问题，通过通过固定辅助 LRM 最大化预测准确度的答案置信度和推理效度的思维置信度的加权组合来训练策略，从而修剪冗余。五个推理数据集的广泛实验表明，ConMax实现了更优的效率与性能权衡。具体来说，它在强基线下将推理长度缩短43%，而准确率仅下降0.7%，证明了其在生成高质量、高效训练数据方面的有效性。

A DQN-based model for intelligent network selection in heterogeneous wireless systems

基于DQN的异构无线系统智能网络选择模型

Authors: Fayssal Bendaoud, Asma Amraoui, karim Sehimi
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2601.04978
Pdf link: https://arxiv.org/pdf/2601.04978
Abstract Wireless communications have been at the center of the revolution in technology for the last few years. The 5G communication system is the pinnacle of these technologies; however 4G LTE, WiFi, and even satellite technologies are still employed worldwide. So, the aim of the next generation network is to take advantage of these technologies for the better of the end users. Our research analyzes this subject and reveals a new and intelligent method that allows users to select the suitable RAT at each time and, therefore, to switch to another RAT if necessary. The Deep Q Network DQN algorithm was utilized, which is a reinforcement learning algorithm that determines judgments based on antecedent actions (rewards and punishments). The approach exhibits a high accuracy, reaching 93 percent, especially after a given number of epochs (the exploration phase), compared to typical MADM methods where the accuracy does not exceed 75 percent
中文摘要 无线通信在过去几年里一直是技术革命的核心。5G通信系统是这些技术的巅峰;然而，全球范围内仍有4G LTE、WiFi甚至卫星技术被广泛采用。因此，下一代网络的目标是利用这些技术，为最终用户带来更好的利益。我们的研究分析了这一主题，揭示了一种新的智能方法，允许用户每次选择合适的RAT，必要时切换到其他RAT中。采用了Deep Q网络DQN算法，这是一种基于前置行为（奖励和惩罚）判断判断的强化学习算法。该方法表现出很高的准确率，尤其是在一定的历代（探索阶段）之后，准确率可达93%，相比典型的MADM方法准确率不超过75%

AlgBench: To What Extent Do Large Reasoning Models Understand Algorithms?

AlgBench：大型推理模型对算法的理解程度如何？

Authors: Henan Sun, Kaichi Yu, Yuyao Wang, Bowen Liu, Xunkai Li, Rong-Hua Li, Nuo Chen, Jia Li
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.04996
Pdf link: https://arxiv.org/pdf/2601.04996
Abstract Reasoning ability has become a central focus in the advancement of Large Reasoning Models (LRMs). Although notable progress has been achieved on several reasoning benchmarks such as MATH500 and LiveCodeBench, existing benchmarks for algorithmic reasoning remain limited, failing to answer a critical question: Do LRMs truly master algorithmic reasoning? To answer this question, we propose AlgBench, an expert-curated benchmark that evaluates LRMs under an algorithm-centric paradigm. AlgBench consists of over 3,000 original problems spanning 27 algorithms, constructed by ACM algorithmic experts and organized under a comprehensive taxonomy, including Euclidean-structured, non-Euclidean-structured, non-optimized, local-optimized, global-optimized, and heuristic-optimized categories. Empirical evaluations on leading LRMs (e.g., Gemini-3-Pro, DeepSeek-v3.2-Speciale and GPT-o3) reveal substantial performance heterogeneity: while models perform well on non-optimized tasks (up to 92%), accuracy drops sharply to around 49% on globally optimized algorithms such as dynamic programming. Further analysis uncovers \textbf{strategic over-shifts}, wherein models prematurely abandon correct algorithmic designs due to necessary low-entropy tokens. These findings expose fundamental limitations of problem-centric reinforcement learning and highlight the necessity of an algorithm-centric training paradigm for robust algorithmic reasoning.
中文摘要 推理能力已成为大型推理模型（LRM）发展的核心焦点。尽管在MATH500和LiveCodeBench等多个推理基准测试上取得了显著进展，但现有的算法推理基准仍然有限，未能回答一个关键问题：LRMs真的掌握了算法推理吗？为回答这个问题，我们提出了AlgBench，一个由专家策划的基准测试工具，在算法中心范式下评估LRMs。AlgBench 包含 3,000 多个原创问题，涵盖 27 个算法，由 ACM 算法专家构建，并采用全面的分类法，包括欧几里得结构、非欧几里得结构、非优化、局部优化、全局优化和启发式优化类别。对主流LRM（如Gemini-3-Pro、DeepSeek-v3.2-Speciale和GPT-o3）的实证评估显示，性能异质性显著：模型在未优化任务上表现良好（高达92%），但在全局优化算法（如动态规划）中准确率急剧下降至约49%。进一步分析发现，\textbf{战略性过度转移}，即模型因必要的低熵令牌而过早放弃正确的算法设计。这些发现揭示了以问题为中心的强化学习的根本局限性，并凸显了以算法为中心的训练范式对于稳健算法推理的必要性。

On the Hidden Objective Biases of Group-based Reinforcement Learning

关于基于群体的强化学习的隐性客观偏见

Authors: Aleksandar Fontana, Marco Simoni, Giulio Rossolini, Andrea Saracino, Paolo Mori
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.05002
Pdf link: https://arxiv.org/pdf/2601.05002
Abstract Group-based reinforcement learning methods, like Group Relative Policy Optimization (GRPO), are widely used nowadays to post-train large language models. Despite their empirical success, they exhibit structural mismatches between reward optimization and the underlying training objective. In this paper, we present a theoretical analysis of GRPO style methods by studying them within a unified surrogate formulation. This perspective reveals recurring properties that affect all the methods under analysis: (i) non-uniform group weighting induces systematic gradient biases on shared prefix tokens; (ii) interactions with the AdamW optimizer make training dynamics largely insensitive to reward scaling; and (iii) optimizer momentum can push policy updates beyond the intended clipping region under repeated optimization steps. We believe that these findings highlight fundamental limitations of current approaches and provide principled guidance for the design of future formulations.
中文摘要 基于群体的强化学习方法，如群体相对策略优化（Group Relative Policy Optimization，GRPO），如今被广泛用于大型语言模型的后期训练。尽管它们在实证上取得了成功，但它们在奖励优化与基础训练目标之间存在结构性不匹配。本文通过统一替代表述对GRPO风格方法进行了理论分析。这一观点揭示了影响所有分析方法的反复出现性质：（i）非均匀的组权重会在共享前缀标记上引发系统性梯度偏差;（ii）与AdamW优化器的交互使训练动态对奖励缩放几乎不敏感;以及（iii）优化器动量可以通过反复优化步骤将策略更新推至预期的削波区域之外。我们认为这些发现凸显了当前方法的根本局限性，并为未来配方设计提供了原则性的指导。

Hán Dān Xué Bù (Mimicry) or Qīng Chū Yú Lán (Mastery)? A Cognitive Perspective on Reasoning Distillation in Large Language Models

是模仿（Hán Dān Xué Bù）还是精通（Qīng Chū Yú Lán）？大型语言模型中推理提炼的认知视角

Authors: Yueqing Hu, Xinyang Peng, Shuting Peng, Hanqi Wang, Tianhong Wang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
Arxiv link: https://arxiv.org/abs/2601.05019
Pdf link: https://arxiv.org/pdf/2601.05019
Abstract Recent Large Reasoning Models trained via reinforcement learning exhibit a "natural" alignment with human cognitive costs. However, we show that the prevailing paradigm of reasoning distillation -- training student models to mimic these traces via Supervised Fine-Tuning (SFT) -- fails to transmit this cognitive structure. Testing the "Hán Dān Xué Bù" (Superficial Mimicry) hypothesis across 14 models, we find that distillation induces a "Functional Alignment Collapse": while teacher models mirror human difficulty scaling ($\bar{r}=0.64$), distilled students significantly degrade this alignment ($\bar{r}=0.34$), often underperforming their own pre-distillation baselines ("Negative Transfer"). Our analysis suggests that SFT induces a "Cargo Cult" effect, where students ritualistically replicate the linguistic form of reasoning (verbosity) without internalizing the teacher's dynamic resource allocation policy. Consequently, reasoning distillation decouples computational cost from cognitive demand, revealing that human-like cognition is an emergent property of active reinforcement, not passive imitation.
中文摘要 通过强化学习训练的近期大型推理模型显示出与人类认知成本“自然”一致。然而，我们表明，主流的推理提炼范式——通过监督微调（SFT）训练学生模型模仿这些痕迹——未能传递这种认知结构。通过在14个模型中检验“Hán Dān Xué Bù”（表层拟态）假说，我们发现蒸馏会诱导“功能对齐崩溃”：教师模型模拟人类难度尺度（$\bar{r}=0.64$），而蒸馏学生显著降低这种对齐（$\bar{r}=0.34$），常常表现不及自身的提炼前基线（“负转移”）。我们的分析表明，SFT会引发一种“货物崇拜”效应，即学生以仪式性的方式复制语言推理形式（冗长），而未内化教师动态的资源分配政策。因此，推理提炼将计算成本与认知需求脱钩，揭示了类人认知是主动强化的涌现属性，而非被动模仿。

Reinforced Efficient Reasoning via Semantically Diverse Exploration

通过语义多样性探索强化高效推理

Authors: Ziqi Zhao, Zhaochun Ren, Jiahong Zou, Liu Yang, Zhiwei Xu, Xuri Ge, Zhumin Chen, Xinyu Ma, Daiting Shi, Shuaiqiang Wang, Dawei Yin, Xin Xin
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.05053
Pdf link: https://arxiv.org/pdf/2601.05053
Abstract Reinforcement learning with verifiable rewards (RLVR) has proven effective in enhancing the reasoning of large language models (LLMs). Monte Carlo Tree Search (MCTS)-based extensions improve upon vanilla RLVR (e.g., GRPO) by providing tree-based reasoning rollouts that enable fine-grained and segment-level credit assignment. However, existing methods still suffer from limited exploration diversity and inefficient reasoning. To address the above challenges, we propose reinforced efficient reasoning via semantically diverse explorations, i.e., ROSE, for LLMs. To encourage more diverse reasoning exploration, our method incorporates a semantic-entropy-based branching strategy and an $\varepsilon$-exploration mechanism. The former operates on already sampled reasoning rollouts to capture semantic uncertainty and select branching points with high semantic divergence to generate new successive reasoning paths, whereas the latter stochastically initiates reasoning rollouts from the root, preventing the search process from becoming overly local. To improve efficiency, we design a length-aware segment-level advantage estimator that rewards concise and correct reasoning while penalizing unnecessarily long reasoning chains. Extensive experiments on various mathematical reasoning benchmarks with Qwen and Llama models validate the effectiveness and efficiency of ROSE. Codes are available at this https URL.
中文摘要 带有可验证奖励的强化学习（RLVR）已被证明在提升大型语言模型（LLMs）推理能力方面非常有效。基于蒙特卡洛树搜索（MCTS）的扩展改进了原版RLVR（如GRPO），提供基于树的推理展开，实现细粒度和段级的学分分配。然而，现有方法仍存在探索多样性有限和推理效率低下的问题。为应对上述挑战，我们提出了通过语义多元探索（即ROSE）来强化高效推理的LLMs。为了鼓励更多样化的推理探索，我们的方法结合了基于语义熵的分支策略和 $\varepsilon$-探索机制。前者基于已采样的推理展开来捕捉语义不确定性，并选择具有高语义发散的分支点以生成新的连续推理路径，而后者则从根随机地启动推理展开，防止搜索过程过于局域化。为了提高效率，我们设计了一个长度感知的分段级优势估计器，奖励简洁且正确的推理，同时惩罚过长的推理链。通过Qwen和Llama模型对各种数学推理基准测试的广泛实验验证了ROSE的有效性和效率。代码可在此 https 网址获取。

Safe Continual Reinforcement Learning Methods for Nonstationary Environments. Towards a Survey of the State of the Art

非固定环境的安全持续强化学习方法。迈向技术现状的概述

Authors: Timofey Tomashevskiy
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.05152
Pdf link: https://arxiv.org/pdf/2601.05152
Abstract This work provides a state-of-the-art survey of continual safe online reinforcement learning (COSRL) methods. We discuss theoretical aspects, challenges, and open questions in building continual online safe reinforcement learning algorithms. We provide the taxonomy and the details of continual online safe reinforcement learning methods based on the type of safe learning mechanism that takes adaptation to nonstationarity into account. We categorize safety constraints formulation for online reinforcement learning algorithms, and finally, we discuss prospects for creating reliable, safe online learning algorithms. Keywords: safe RL in nonstationary environments, safe continual reinforcement learning under nonstationarity, HM-MDP, NSMDP, POMDP, safe POMDP, constraints for continual learning, safe continual reinforcement learning review, safe continual reinforcement learning survey, safe continual reinforcement learning, safe online learning under distribution shift, safe continual online adaptation, safe reinforcement learning, safe exploration, safe adaptation, constrained Markov decision processes, safe reinforcement learning, partially observable Markov decision process, safe reinforcement learning and hidden Markov decision processes, Safe Online Reinforcement Learning, safe online reinforcement learning, safe online reinforcement learning, safe meta-learning, safe meta-reinforcement learning, safe context-based reinforcement learning, formulating safety constraints for continual learning
中文摘要 本研究提供了持续安全在线强化学习（COSRL）方法的最新调查。我们讨论构建持续在线安全强化算法的理论方面、挑战和未解之谜。我们根据考虑适应非平稳性的安全学习机制类型，提供了持续在线安全强化学习方法的分类法和细节。我们对在线强化学习算法的安全约束制定进行了分类，最后讨论了创建可靠、安全的在线学习算法的前景。关键词：非定常环境中的安全强化学习、非平稳环境下的安全持续强化学习、HM-MDP、NSMDP、POMDP、安全POMDP、持续学习约束、安全持续强化学习回顾、安全持续强化学习调查、安全持续强化学习、分布转移下安全在线学习、安全持续在线适应、安全强化学习、安全探索、安全适应、受限马尔可夫决策过程，安全强化学习、部分可观察的马尔可夫决策过程、安全强化学习与隐性马尔可夫决策过程、安全在线强化学习、安全在线强化学习、安全元学习、安全元强化学习、安全基于情境的强化学习、持续学习的安全约束制定

Inside Out: Evolving User-Centric Core Memory Trees for Long-Term Personalized Dialogue Systems

《头脑特工队：以用户为中心的核心记忆树演进，支持长期个性化对话系统》

Authors: Jihao Zhao, Ding Chen, Zhaoxin Fan, Kerun Xu, Mengting Hu, Bo Tang, Feiyu Xiong, Zhiyu li
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.05171
Pdf link: https://arxiv.org/pdf/2601.05171
Abstract Existing long-term personalized dialogue systems struggle to reconcile unbounded interaction streams with finite context constraints, often succumbing to memory noise accumulation, reasoning degradation, and persona inconsistency. To address these challenges, this paper proposes Inside Out, a framework that utilizes a globally maintained PersonaTree as the carrier of long-term user profiling. By constraining the trunk with an initial schema and updating the branches and leaves, PersonaTree enables controllable growth, achieving memory compression while preserving consistency. Moreover, we train a lightweight MemListener via reinforcement learning with process-based rewards to produce structured, executable, and interpretable {ADD, UPDATE, DELETE, NO_OP} operations, thereby supporting the dynamic evolution of the personalized tree. During response generation, PersonaTree is directly leveraged to enhance outputs in latency-sensitive scenarios; when users require more details, the agentic mode is triggered to introduce details on-demand under the constraints of the PersonaTree. Experiments show that PersonaTree outperforms full-text concatenation and various personalized memory systems in suppressing contextual noise and maintaining persona consistency. Notably, the small MemListener model achieves memory-operation decision performance comparable to, or even surpassing, powerful reasoning models such as DeepSeek-R1-0528 and Gemini-3-Pro.
中文摘要 现有的长期个性化对话系统难以调和无界交互流与有限上下文约束，常常因记忆噪声积累、推理退化和人格不一致而败北。为应对这些挑战，本文提出了“头脑特工队”框架，利用全球维护的PersonaTree作为长期用户画像的载体。通过用初始模式约束主干并更新分支和叶子，PersonaTree 实现可控增长，实现内存压缩同时保持一致性。此外，我们通过基于过程的奖励进行强化学习训练轻量级MemListener，生成结构化、可执行且可解释的{ADD， UPDATE， DELETE， NO_OP}作，从而支持个性化树的动态演进。在响应生成过程中，PersonaTree 被直接用于增强延迟敏感场景下的输出;当用户需要更多细节时，会触发代理模式，在PersonaTree的约束下按需引入细节。实验显示，PersonaTree在抑制上下文噪声和保持人物一致性方面，优于全文串接和各种个性化记忆系统。值得注意的是，小型MemListener模型在内存作决策方面的性能可与DeepSeek-R1-0528和Gemini-3-Pro等强大推理模型媲美甚至超越。

EARL: Energy-Aware Optimization of Liquid State Machines for Pervasive AI

EARL：液态机的能量感知优化，适用于普及人工智能

Authors: Zain Iqbal, Lorenzo Valerio
Subjects: Subjects: Machine Learning (cs.LG); Performance (cs.PF)
Arxiv link: https://arxiv.org/abs/2601.05205
Pdf link: https://arxiv.org/pdf/2601.05205
Abstract Pervasive AI increasingly depends on on-device learning systems that deliver low-latency and energy-efficient computation under strict resource constraints. Liquid State Machines (LSMs) offer a promising approach for low-power temporal processing in pervasive and neuromorphic systems, but their deployment remains challenging due to high hyperparameter sensitivity and the computational cost of traditional optimization methods that ignore energy constraints. This work presents EARL, an energy-aware reinforcement learning framework that integrates Bayesian optimization with an adaptive reinforcement learning based selection policy to jointly optimize accuracy and energy consumption. EARL employs surrogate modeling for global exploration, reinforcement learning for dynamic candidate prioritization, and an early termination mechanism to eliminate redundant evaluations, substantially reducing computational overhead. Experiments on three benchmark datasets demonstrate that EARL achieves 6 to 15 percent higher accuracy, 60 to 80 percent lower energy consumption, and up to an order of magnitude reduction in optimization time compared to leading hyperparameter tuning frameworks. These results highlight the effectiveness of energy-aware adaptive search in improving the efficiency and scalability of LSMs for resource-constrained on-device AI applications.
中文摘要 普及人工智能越来越依赖设备学习系统，这些系统在严格的资源限制下实现低延迟和节能的计算。液态机（LSM）为普遍系统和神经形态系统中的低功耗时间处理提供了有前景的方法，但由于高超参数敏感性和传统优化方法计算成本高且忽略能量约束，其部署仍具挑战性。本研究介绍了EARL，一种能感知强化学习框架，将贝叶斯优化与基于自适应强化学习的选择策略相结合，共同优化准确性和能耗。EARL采用代理建模进行全局探索，强化学习用于动态候选优先级排序，以及早期终止机制以消除冗余评估，大幅降低计算开销。在三个基准数据集上的实验表明，EARL相比主流超参数调优框架，准确率提升6%至15%，能耗降低60%至80%，优化时间缩短多达一个数量级。这些结果凸显了能能感知自适应搜索在提升资源受限设备内AI应用LSM效率和可扩展性的有效性方面。

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

GDPO：多奖励强化学习优化的群体奖励解耦规范化策略优化

Authors: Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, Pavlo Molchanov
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.05242
Pdf link: https://arxiv.org/pdf/2601.05242
Abstract As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.
中文摘要 随着语言模型能力的提升，用户期望它们不仅能提供准确的回答，还能在各种场景下展现符合多样人类偏好的行为。为此，强化学习（RL）流程开始引入多种奖励，每种奖励捕捉不同的偏好，以引导模型达到这些期望行为。然而，近期工作默认在多奖励设置下应用群体相对策略优化（GRPO），而未评估其适用性。本文展示了直接应用GRPO规范不同推广奖励组合，会导致它们崩溃为相同的优势值，降低训练信号分辨率，导致收敛不优，甚至早期训练失败。随后，我们引入了群奖励解耦规范化策略优化（GDPO），这是一种新的策略优化方法，通过解耦单个奖励的规范化，更忠实地保持它们的相对差异，实现更准确的多奖励优化，同时显著提升了训练稳定性。我们将GDPO与GRPO在三个任务中进行比较：工具调用、数学推理和编码推理，评估正确性指标（准确性、错误率）和约束遵循度指标（格式、长度）。在所有环境中，GDPO始终优于GRPO，证明其在多奖励强化学习优化中的有效性和可推广性。

RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes

RL-AWB：用于低光夜间场景自动白平衡校正的深度强化学习

Authors: Yuan-Kang Lee, Kuan-Lin Chen, Chia-Che Chang, Yu-Lun Liu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.05249
Pdf link: https://arxiv.org/pdf/2601.05249
Abstract Nighttime color constancy remains a challenging problem in computational photography due to low-light noise and complex illumination conditions. We present RL-AWB, a novel framework combining statistical methods with deep reinforcement learning for nighttime white balance. Our method begins with a statistical algorithm tailored for nighttime scenes, integrating salient gray pixel detection with novel illumination estimation. Building on this foundation, we develop the first deep reinforcement learning approach for color constancy that leverages the statistical algorithm as its core, mimicking professional AWB tuning experts by dynamically optimizing parameters for each image. To facilitate cross-sensor evaluation, we introduce the first multi-sensor nighttime dataset. Experiment results demonstrate that our method achieves superior generalization capability across low-light and well-illuminated images. Project page: this https URL
中文摘要 由于低光环境噪声和复杂的照明条件，夜间色彩恒常性在计算摄影中依然是一个具有挑战性的问题。我们介绍了RL-AWB，一种结合统计方法与深度强化学习的新型夜间白平衡框架。我们的方法始于一个针对夜间场景量身定制的统计算法，将显著的灰点检测与新颖的照明估计相结合。基于此基础，我们开发了首个深度强化学习方法，利用统计算法作为核心，通过动态优化每张图像参数，模拟专业AWB调优专家的做法。为促进跨传感器评估，我们引入了首个多传感器夜间数据集。实验结果表明，我们的方法在低光和良好照明图像中实现了更优的泛化能力。项目页面：此 https URL

Keyword: diffusion policy

There is no result