Arxiv Papers of Today

生成时间: 2026-01-07 16:34:05 (UTC+8); Arxiv 发布时间: 2026-01-07 20:00 EST (2026-01-08 09:00 UTC+8)

今天共有 31 篇相关文章

Keyword: reinforcement learning

Improving News Recommendations through Hybrid Sentiment Modelling and Reinforcement Learning

通过混合情感建模和强化学习改进新闻推荐

Authors: Eunice Kingenga, Mike Wa Nkongolo
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2601.02372
Pdf link: https://arxiv.org/pdf/2601.02372
Abstract News recommendation systems rely on automated sentiment analysis to personalise content and enhance user engagement. Conventional approaches often struggle with ambiguity, lexicon inconsistencies, and limited contextual understanding, particularly in multi-source news environments. Existing models typically treat sentiment as a secondary feature, reducing their ability to adapt to users' affective preferences. To address these limitations, this study develops an adaptive, sentiment-aware news recommendation framework by integrating hybrid sentiment analysis with reinforcement learning. Using the BBC News dataset, a hybrid sentiment model combines VADER, AFINN, TextBlob, and SentiWordNet scores to generate robust article-level sentiment estimates. Articles are categorised as positive, negative, or neutral, and these sentiment states are embedded within a Q-learning architecture to guide the agent in learning optimal recommendation policies. The proposed system effectively identifies and recommends articles with aligned emotional profiles while continuously improving personalisation through iterative Q-learning updates. The results demonstrate that coupling hybrid sentiment modelling with reinforcement learning provides a feasible, interpretable, and adaptive approach for user-centred news recommendation.
中文摘要 新闻推荐系统依赖自动情感分析来个性化内容并提升用户参与度。传统方法常常面临歧义、词汇不一致和有限的上下文理解，尤其是在多来源新闻环境中。现有模型通常将情感视为次要特征，降低了它们适应用户情感偏好的能力。为解决这些局限性，本研究通过将混合情感分析与强化学习结合，开发了一个适应性强化、情感感知型新闻推荐框架。利用BBC新闻数据集，一种混合情感模型结合了VADER、AFINN、TextBlob和SentiWordNet分数，生成稳健的文章级情感估计。文章被分为积极、负面或中性，这些情感状态嵌入在Q学习架构中，指导智能体学习最优推荐策略。该系统有效识别并推荐具有对齐情绪特征的文章，同时通过迭代Q学习不断提升个性化。结果表明，将混合情感建模与强化学习结合，提供了一种可行、可解释且适应性强的用户中心新闻推荐方法。

Regional Resource Management for Service Provisioning in LEO Satellite Networks: A Topology Feature-Based DRL Approach

LEO卫星网络服务配置的区域资源管理：基于拓扑特征的DRL方法

Authors: Chenxi Bao, Di Zhou, Min Sheng, Yan Shi, Jiandong Li, Zhili Sun
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2601.02387
Pdf link: https://arxiv.org/pdf/2601.02387
Abstract Satellite networks with wide coverage are considered natural extensions to terrestrial networks for their long-distance end-to-end (E2E) service provisioning. However, the inherent topology dynamics of low earth orbit satellite networks and the uncertain network scales bring an inevitable requirement that resource chains for E2E service provisioning must be efficiently re-planned. Therefore, achieving highly adaptive resource management is of great significance in practical deployment applications. This paper first designs a regional resource management (RRM) mode and further formulates the RRM problem that can provide a unified decision space independent of the network scale. Subsequently, leveraging the RRM mode and deep reinforcement learning framework, we develop a topology feature-based dynamic and adaptive resource management algorithm to combat the varying network scales. The proposed algorithm successfully takes into account the fixed output dimension of the neural network and the changing resource chains for E2E service provisioning. The matched design of the service orientation information and phased reward function effectively improves the service performance of the algorithm under the RRM mode. The numerical results demonstrate that the proposed algorithm with the best convergence performance and fastest convergence rate significantly improves service performance for varying network scales, with gains over compared algorithms of more than 2.7%, 11.9%, and 10.2%, respectively.
中文摘要 覆盖范围广泛的卫星网络被视为地面网络的自然延伸，用于长距离端到端（E2E）服务配置。然而，低地球轨道卫星网络固有的拓扑动态和不确定的网络规模，必然要求对端对端服务的资源链进行高效重新规划。因此，实现高度自适应的资源管理在实际部署应用中具有重要意义。本文首先设计了区域资源管理（RRM）模式，并进一步提出了能够提供独立于网络尺度的统一决策空间的RRM问题。随后，利用RRM模式和深度强化学习框架，我们开发了一种基于拓扑特征的动态和自适应资源管理算法，以应对网络规模的变化。所提出的算法成功考虑了神经网络的固定输出维度以及端对端服务提供中不断变化的资源链。服务导向信息的匹配设计和分阶段奖励函数有效提升了算法在RRM模式下的服务性能。数值结果表明，提出的在不同网络规模下，具有最佳收敛性能和最快收敛率的算法显著提升了服务性能，相较于其他算法分别提升超过2.7%、11.9%和10.2%。

AI-Native Integrated Sensing and Communications for Self-Organizing Wireless Networks: Architectures, Learning Paradigms, and System-Level Design

AI原生集成感测与通信实现自组织无线网络：架构、学习范式与系统级设计

Authors: S. Zhang, M. Feizarefi, A. F. Mirzaei
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.02398
Pdf link: https://arxiv.org/pdf/2601.02398
Abstract Integrated Sensing and Communications (ISAC) is emerging as a foundational paradigm for next-generation wireless networks, enabling communication infrastructures to simultaneously support data transmission and environment sensing. By tightly coupling radio sensing with communication functions, ISAC unlocks new capabilities for situational awareness, localization, tracking, and network adaptation. At the same time, the increasing scale, heterogeneity, and dynamics of future wireless systems demand self-organizing network intelligence capable of autonomously managing resources, topology, and services. Artificial intelligence (AI), particularly learning-driven and data-centric methods, has become a key enabler for realizing this vision. This survey provides a comprehensive and system-level review of AI-native ISAC-enabled self-organizing wireless networks. We develop a unified taxonomy that spans: (i) ISAC signal models and sensing modalities, (ii) network state abstraction and perception from sensing-aware radio data, (iii) learning-driven self-organization mechanisms for resource allocation, topology control, and mobility management, and (iv) cross-layer architectures integrating sensing, communication, and network intelligence. We further examine emerging learning paradigms, including deep reinforcement learning, graph-based learning, multi-agent coordination, and federated intelligence that enable autonomous adaptation under uncertainty, mobility, and partial observability. Practical considerations such as sensing-communication trade-offs, scalability, latency, reliability, and security are discussed alongside representative evaluation methodologies and performance metrics. Finally, we identify key open challenges and future research directions toward deployable, trustworthy, and scalable AI-native ISAC systems for 6G and beyond.
中文摘要 综合感测与通信（ISAC）正作为下一代无线网络的基础范式兴起，使通信基础设施能够同时支持数据传输和环境感知。通过将无线电传感与通信功能紧密结合，ISAC解锁了态势感知、定位、跟踪和网络适配的新能力。与此同时，未来无线系统日益扩大的规模、异构性和动态性要求能够自主管理资源、拓扑和服务的自组织网络智能。人工智能（AI），尤其是以学习为驱动和数据为中心的方法，已成为实现这一愿景的关键推动力。本调查对AI原生ISAC支持的自组织无线网络进行了全面且系统层面的回顾。我们开发了一个统一的分类法，涵盖：（i） ISAC信号模型和传感模态，（ii）从感知无线电数据中提取网络状态抽象与感知，（iii）基于学习的资源分配、拓扑控制和移动管理自组织机制，以及（iv）集成传感、通信和网络智能的跨层架构。我们还进一步探讨了新兴的学习范式，包括深度强化学习、基于图的学习、多智能体协调和联邦智能，这些技术使得在不确定性、移动性和部分可观察性下实现自主适应。将实际考虑如感测与通信权衡、可扩展性、延迟、可靠性和安全，同时讨论代表性的评估方法和性能指标。最后，我们识别了面向6G及更高领域可部署、可信赖且可扩展的AI原生ISAC系统的关键开放挑战和未来研究方向。

WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks

WebGym：面向可视化网络代理的训练环境扩展，任务更真实

Authors: Hao Bai, Alexey Taymanov, Tong Zhang, Aviral Kumar, Spencer Whitehead
Subjects: Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.02439
Pdf link: https://arxiv.org/pdf/2601.02439
Abstract We present WebGym, the largest-to-date open-source environment for training realistic visual web agents. Real websites are non-stationary and diverse, making artificial or small-scale task sets insufficient for robust policy learning. WebGym contains nearly 300,000 tasks with rubric-based evaluations across diverse, real-world websites and difficulty levels. We train agents with a simple reinforcement learning (RL) recipe, which trains on the agent's own interaction traces (rollouts), using task rewards as feedback to guide learning. To enable scaling RL, we speed up sampling of trajectories in WebGym by developing a high-throughput asynchronous rollout system, designed specifically for web agents. Our system achieves a 4-5x rollout speedup compared to naive implementations. Second, we scale the task set breadth, depth, and size, which results in continued performance improvement. Fine-tuning a strong base vision-language model, Qwen-3-VL-8B-Instruct, on WebGym results in an improvement in success rate on an out-of-distribution test set from 26.2% to 42.9%, significantly outperforming agents based on proprietary models such as GPT-4o and GPT-5-Thinking that achieve 27.1% and 29.8%, respectively. This improvement is substantial because our test set consists only of tasks on websites never seen during training, unlike many other prior works on training visual web agents.
中文摘要 我们呈现WebGym，迄今为止最大的开源环境，用于训练逼真的视觉网络代理。真实网站是非固定且多样化的，使得人工或小规模任务集不足以进行稳健的政策学习。WebGym包含近30万个基于评分标准的任务，涵盖多样的真实网站和难度等级。我们用简单的强化学习（RL）方法训练代理，该配方基于代理自身的交互轨迹（展开），并以任务奖励作为反馈指导学习。为了实现强化学习的扩展，我们通过开发专门为网络代理设计的高通量异步推广系统，加快了WebGym中轨迹的采样。我们的系统相比简单实现实现了4-5倍的推广加速。其次，我们扩大任务集的广度、深度和规模，从而持续提升性能。在WebGym上微调强基础视觉语言模型Qwen-3-VL-8B-Instruct，使非分布测试集的成功率从26.2%提升至42.9%，显著优于基于专有模型如GPT-4o和GPT-5-Thinking的代理，后者分别达到27.1%和29.8%。这一改进显著，因为我们的测试集仅包含培训中未曾见过的网站任务，这与许多之前关于可视化网页代理训练的研究不同。

LLM-Enhanced Reinforcement Learning for Time Series Anomaly Detection

用于时间序列异常检测的大型语言模型增强强化学习

Authors: Bahareh Golchin, Banafsheh Rekabdar, Danielle Justo
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.02511
Pdf link: https://arxiv.org/pdf/2601.02511
Abstract Detecting anomalies in time series data is crucial for finance, healthcare, sensor networks, and industrial monitoring applications. However, time series anomaly detection often suffers from sparse labels, complex temporal patterns, and costly expert annotation. We propose a unified framework that integrates Large Language Model (LLM)-based potential functions for reward shaping with Reinforcement Learning (RL), Variational Autoencoder (VAE)-enhanced dynamic reward scaling, and active learning with label propagation. An LSTM-based RL agent leverages LLM-derived semantic rewards to guide exploration, while VAE reconstruction errors add unsupervised anomaly signals. Active learning selects the most uncertain samples, and label propagation efficiently expands labeled data. Evaluations on Yahoo-A1 and SMD benchmarks demonstrate that our method achieves state-of-the-art detection accuracy under limited labeling budgets and operates effectively in data-constrained settings. This study highlights the promise of combining LLMs with RL and advanced unsupervised techniques for robust, scalable anomaly detection in real-world applications.
中文摘要 检测时间序列数据中的异常对于金融、医疗、传感器网络和工业监测应用至关重要。然而，时间序列异常检测常常存在稀疏标签、复杂的时间模式和昂贵的专家注释问题。我们提出了一个统一框架，整合了基于大型语言模型（LLM）的奖励塑造潜在函数与强化学习（RL）、变分自编码器（VAE）增强的动态奖励尺度，以及带标签传播的主动学习。基于LSTM的强化学习代理利用LLM衍生的语义奖励来引导探索，而VAE重建错误则增加了无监督异常信号。主动学习选择最不确定的样本，标签传播则高效地扩展带标签的数据。对Yahoo-A1和SMD基准的评估表明，我们的方法在有限的标注预算下实现了最先进的检测精度，并且在数据受限的环境中也能有效运行。本研究强调了将大型语言模型与强化学习结合，以及在现实应用中实现稳健且可扩展异常检测的先进无监督技术的前景。

Textual Explanations and Their Evaluations for Reinforcement Learning Policy

强化学习政策的文本解释及其评估

Authors: Ahmad Terra, Mohit Ahmed, Rafia Inam, Elena Fersman, Martin Törngren
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.02514
Pdf link: https://arxiv.org/pdf/2601.02514
Abstract Understanding a Reinforcement Learning (RL) policy is crucial for ensuring that autonomous agents behave according to human expectations. This goal can be achieved using Explainable Reinforcement Learning (XRL) techniques. Although textual explanations are easily understood by humans, ensuring their correctness remains a challenge, and evaluations in state-of-the-art remain limited. We present a novel XRL framework for generating textual explanations, converting them into a set of transparent rules, improving their quality, and evaluating them. Expert's knowledge can be incorporated into this framework, and an automatic predicate generator is also proposed to determine the semantic information of a state. Textual explanations are generated using a Large Language Model (LLM) and a clustering technique to identify frequent conditions. These conditions are then converted into rules to evaluate their properties, fidelity, and performance in the deployed environment. Two refinement techniques are proposed to improve the quality of explanations and reduce conflicting information. Experiments were conducted in three open-source environments to enable reproducibility, and in a telecom use case to evaluate the industrial applicability of the proposed XRL framework. This framework addresses the limitations of an existing method, Autonomous Policy Explanation, and the generated transparent rules can achieve satisfactory performance on certain tasks. This framework also enables a systematic and quantitative evaluation of textual explanations, providing valuable insights for the XRL field.
中文摘要 理解强化学习（RL）策略对于确保自主智能体符合人类预期至关重要。这一目标可以通过可解释强化学习（XRL）技术实现。尽管文本解释易于人类理解，但确保其正确性仍是挑战，且最先进的评估仍然有限。我们提出了一种新的XRL框架，用于生成文本解释，将其转化为一套透明规则，提升其质量并进行评估。专家知识可以被纳入该框架，同时还提出了自动谓词生成器以确定状态的语义信息。文本解释通过大型语言模型（LLM）和聚类技术生成，以识别常见状况。这些条件随后被转化为规则，用于评估其属性、保真度和在部署环境中的性能。提出了两种精炼技术，以提升解释质量并减少信息冲突。在三个开源环境中进行了实验以实现可重复性，并在电信用例中进行了评估所提XRL框架工业适用性的实验。该框架解决了现有方法自主策略解释的局限性，生成的透明规则在某些任务上能够实现令人满意的性能。该框架还支持对文本解释的系统性和量化评估，为XRL领域提供宝贵见解。

SWaRL: Safeguard Code Watermarking via Reinforcement Learning

SWaRL：通过强化学习保护代码水印

Authors: Neusha Javidnia, Ruisi Zhang, Ashish Kundu, Farinaz Koushanfar
Subjects: Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.02602
Pdf link: https://arxiv.org/pdf/2601.02602
Abstract We present SWaRL, a robust and fidelity-preserving watermarking framework designed to protect the intellectual property of code LLM owners by embedding unique and verifiable signatures in the generated output. Existing approaches rely on manually crafted transformation rules to preserve watermarked code functionality or manipulate token-generation probabilities at inference time, which are prone to compilation errors. To address these challenges, SWaRL employs a reinforcement learning-based co-training framework that uses compiler feedback for functional correctness and a jointly trained confidential verifier as a reward signal to maintain watermark detectability. Furthermore, SWaRL employs low-rank adaptation (LoRA) during fine-tuning, allowing the learned watermark information to be transferable across model updates. Extensive experiments show that SWaRL achieves higher watermark detection accuracy compared to prior methods while fully maintaining watermarked code functionality. The LoRA-based signature embedding steers the base model to generate and solve code in a watermark-specific manner without significant computational overhead. Moreover, SWaRL exhibits strong resilience against refactoring and adversarial transformation attacks.
中文摘要 我们介绍SWaRL，一个稳健且保持保真度的水印框架，旨在通过在生成的输出中嵌入唯一且可验证的签名，保护代码LLM拥有者的知识产权。现有方法依赖手动制定的转换规则，以保留水印代码功能或在推理时作令牌生成概率，这些概率容易出现编译错误。为应对这些挑战，SWaRL采用基于强化学习的共训练框架，利用编译器反馈实现功能正确性，并以联合训练的机密验证器作为奖励信号，以保持水印检测性。此外，SWaRL在微调过程中采用低秩适应（LoRA），使得所学的水印信息能够在模型更新间转移。大量实验表明，SWaRL在完全保留水印代码功能的同时，相比以往方法实现了更高的水印检测精度。基于LoRA的签名嵌入引导基础模型以水印特定方式生成和解决代码，且无需显著计算负担。此外，SWaRL对重构和对抗性转换攻击表现出强的韧性。

Effective Online 3D Bin Packing with Lookahead Parcels Using Monte Carlo Tree Search

使用蒙特卡洛树搜索的在线3D垃圾桶包装，使用预先包裹进行有效

Authors: Jiangyi Fang, Bowen Zhou, Haotian Wang, Xin Zhu, Leye Wang
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.02649
Pdf link: https://arxiv.org/pdf/2601.02649
Abstract Online 3D Bin Packing (3D-BP) with robotic arms is crucial for reducing transportation and labor costs in modern logistics. While Deep Reinforcement Learning (DRL) has shown strong performance, it often fails to adapt to real-world short-term distribution shifts, which arise as different batches of goods arrive sequentially, causing performance drops. We argue that the short-term lookahead information available in modern logistics systems is key to mitigating this issue, especially during distribution shifts. We formulate online 3D-BP with lookahead parcels as a Model Predictive Control (MPC) problem and adapt the Monte Carlo Tree Search (MCTS) framework to solve it. Our framework employs a dynamic exploration prior that automatically balances a learned RL policy and a robust random policy based on the lookahead characteristics. Additionally, we design an auxiliary reward to penalize long-term spatial waste from individual placements. Extensive experiments on real-world datasets show that our method consistently outperforms state-of-the-art baselines, achieving over 10\% gains under distributional shifts, 4\% average improvement in online deployment, and up to more than 8\% in the best case--demonstrating the effectiveness of our framework.
中文摘要 配备机械臂的在线3D垃圾桶装箱（3D-BP）对于降低现代物流运输和人工成本至关重要。虽然深度强化学习（DRL）表现出良好的性能，但它常常无法适应现实世界中短期配送的变化，这些变化通常会在不同批次的商品进货时出现，导致性能下降。我们认为，现代物流系统中可用的短期前瞻信息对于缓解这一问题至关重要，尤其是在配送轮换期间。我们将带有预瞻区块的在线3D-BP模型化为模型预测控制（MPC）问题，并采用蒙特卡洛树搜索（MCTS）框架来求解。我们的框架采用动态探索先验，自动平衡已学到的强化学习策略和基于前瞻特性的稳健随机策略。此外，我们还设计了辅助奖励，以惩罚单个安置造成的长期空间浪费。在真实世界数据集上的大量实验表明，我们的方法始终优于最先进的基线，分布变化下提升超过10%，在线部署平均提升4%，最佳情况下提升超过8%——证明了我们框架的有效性。

Inferring Causal Graph Temporal Logic Formulas to Expedite Reinforcement Learning in Temporally Extended Tasks

推断因果图时间逻辑公式以加速时间扩展任务中的强化学习

Authors: Hadi Partovi Aria, Zhe Xu
Subjects: Subjects: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
Arxiv link: https://arxiv.org/abs/2601.02666
Pdf link: https://arxiv.org/pdf/2601.02666
Abstract Decision-making tasks often unfold on graphs with spatial-temporal dynamics. Black-box reinforcement learning often overlooks how local changes spread through network structure, limiting sample efficiency and interpretability. We present GTL-CIRL, a closed-loop framework that simultaneously learns policies and mines Causal Graph Temporal Logic (Causal GTL) specifications. The method shapes rewards with robustness, collects counterexamples when effects fail, and uses Gaussian Process (GP) driven Bayesian optimization to refine parameterized cause templates. The GP models capture spatial and temporal correlations in the system dynamics, enabling efficient exploration of complex parameter spaces. Case studies in gene and power networks show faster learning and clearer, verifiable behavior compared to standard RL baselines.
中文摘要 决策任务通常在具有时空动态的图上展开。黑箱强化学习常忽视局部变化如何通过网络结构传播，限制了样本效率和可解释性。我们介绍了GTL-CIRL，一个闭环框架，既学习策略，又挖掘因果图时序逻辑（因果GTL）规范。该方法以稳健性塑造奖励，收集效应失效时的反例，并利用高斯过程（GP）驱动的贝叶斯优化来优化参数化的原因模板。GP模型捕捉系统动力学中的空间和时间相关性，使复杂参数空间的探索更加高效。基因和功率网络的案例研究显示，与标准强化学习基线相比，学习更快，行为更清晰、可验证。

Time-Scaling Is What Agents Need Now

时间缩放正是代理现在所需要的

Authors: Zhi Liu, Guangzhi Wang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.02714
Pdf link: https://arxiv.org/pdf/2601.02714
Abstract Early artificial intelligence paradigms exhibited separated cognitive functions: Neural Networks focused on "perception-representation," Reinforcement Learning on "decision-making-behavior," and Symbolic AI on "knowledge-reasoning." With Transformer-based large models and world models, these paradigms are converging into cognitive agents with closed-loop "perception-decision-action" capabilities. Humans solve complex problems under limited cognitive resources through temporalized sequential reasoning. Language relies on problem space search for deep semantic reasoning. While early large language models (LLMs) could generate fluent text, they lacked robust semantic reasoning capabilities. Prompting techniques like Chain-of-Thought (CoT) and Tree-of-Thought (ToT) extended reasoning paths by making intermediate steps explicit. Recent models like DeepSeek-R1 enhanced performance through explicit reasoning trajectories. However, these methods have limitations in search completeness and efficiency. This highlights the need for "Time-Scaling"--the systematic extension and optimization of an agent's ability to unfold reasoning over time. Time-Scaling refers to architectural design utilizing extended temporal pathways, enabling deeper problem space exploration, dynamic strategy adjustment, and enhanced metacognitive control, paralleling human sequential reasoning under cognitive constraints. It represents a critical frontier for enhancing deep reasoning and problem-solving without proportional increases in static model parameters. Advancing intelligent agent capabilities requires placing Time-Scaling principles at the forefront, positioning explicit temporal reasoning management as foundational.
中文摘要 早期的人工智能范式表现出分离的认知功能：神经网络专注于“感知-表征”，强化学习专注于“决策-行为”，符号人工智能专注于“知识-推理”。借助基于Transformer的大型模型和世界模型，这些范式正趋于融合为具备闭环“感知-决策-行动”能力的认知代理。人类通过时间化顺序推理，在有限的认知资源下解决复杂问题。语言依赖于问题空间搜索以进行深度语义推理。虽然早期大型语言模型（LLM）能够生成流畅文本，但它们缺乏强大的语义推理能力。提示技巧如思维链（Chain-of-Thought，CoT）和思维(ToT)树通过明确化中间步骤来扩展推理路径。像DeepSeek-R1这样的近期模型通过显式推理轨迹提升了性能。然而，这些方法在搜索完整性和效率方面存在局限性。这凸显了“时间尺度”的必要性——即系统地扩展和优化智能体推理展开能力的过程。时间尺度指的是利用扩展的时间路径进行架构设计，实现更深层次的问题空间探索、动态策略调整和增强元认知控制，类似于人类在认知约束下的顺序推理。它代表了一个关键前沿，可以在不使静态模型参数成比例增加的情况下，提升深度推理和问题解决能力。提升智能代理能力需要将时间缩放原则置于首位，明确的时间推理管理作为基础。

Q-Regularized Generative Auto-Bidding: From Suboptimal Trajectories to Optimal Policies

Q-正则化生成自投标：从次优轨迹到最优策略

Authors: Mingming Zhang, Na Li, Zhuang Feiqing, Hongyang Zheng, Jiangbing Zhou, Wang Wuyin, Sheng-jie Sun, XiaoWei Chen, Junxiong Zhu, Lixin Zou, Chenliang Li
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2601.02754
Pdf link: https://arxiv.org/pdf/2601.02754
Abstract With the rapid development of e-commerce, auto-bidding has become a key asset in optimizing advertising performance under diverse advertiser environments. The current approaches focus on reinforcement learning (RL) and generative models. These efforts imitate offline historical behaviors by utilizing a complex structure with expensive hyperparameter tuning. The suboptimal trajectories further exacerbate the difficulty of policy learning. To address these challenges, we proposes QGA, a novel Q-value regularized Generative Auto-bidding method. In QGA, we propose to plug a Q-value regularization with double Q-learning strategy into the Decision Transformer backbone. This design enables joint optimization of policy imitation and action-value maximization, allowing the learned bidding policy to both leverage experience from the dataset and alleviate the adverse impact of the suboptimal trajectories. Furthermore, to safely explore the policy space beyond the data distribution, we propose a Q-value guided dual-exploration mechanism, in which the DT model is conditioned on multiple return-to-go targets and locally perturbed actions. This entire exploration process is dynamically guided by the aforementioned Q-value module, which provides principled evaluation for each candidate action. Experiments on public benchmarks and simulation environments demonstrate that QGA consistently achieves superior or highly competitive results compared to existing alternatives. Notably, in large-scale real-world A/B testing, QGA achieves a 3.27% increase in Ad GMV and a 2.49% improvement in Ad ROI.
中文摘要 随着电子商务的快速发展，自动竞价已成为优化多样化广告主环境下广告表现的关键资产。目前的方法重点是强化学习（RL）和生成模型。这些努力通过利用复杂的结构和昂贵的超参数调优，模拟离线历史行为。次优的轨迹进一步加剧了政策学习的难度。为解决这些挑战，我们提出了QGA，一种新的Q值正则生成自动竞价方法。在QGA中，我们提议将Q值正则化与双Q学习策略插入决策变换器骨干。该设计实现了政策模仿和行动价值最大化的联合优化，使得学习的竞价策略既能利用数据集中的经验，也能减轻次优轨迹带来的负面影响。此外，为了安全地探索数据分布之外的政策空间，我们提出了一种Q值引导的双重探索机制，其中DT模型以多个回归目标和局部扰动的动作为条件。整个探索过程由前述Q值模块动态引导，该模块为每个候选动作提供原则性评估。在公共基准测试和仿真环境中的实验表明，QGA始终能够在与现有替代方案相比的情况下取得更优或高度竞争的结果。值得注意的是，在大规模的真实世界A/B测试中，QGA实现了广告GMV增长3.27%，广告投资回报率提升了2.49%。

Closing the Reality Gap: Zero-Shot Sim-to-Real Deployment for Dexterous Force-Based Grasping and Manipulation

缩小现实鸿沟：零机会模拟到现实部署，实现灵巧的基于原力的抓握与控

Authors: Haoyu Dong, Zhengmao He, Yang Li, Zhibin Li, Xinyu Yi, Zhe Zhao
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.02778
Pdf link: https://arxiv.org/pdf/2601.02778
Abstract Human-like dexterous hands with multiple fingers offer human-level manipulation capabilities, but training control policies that can directly deploy on real hardware remains difficult due to contact-rich physics and imperfect actuation. We close this gap with a practical sim-to-real reinforcement learning (RL) framework that utilizes dense tactile feedback combined with joint torque sensing to explicitly regulate physical interactions. To enable effective sim-to-real transfer, we introduce (i) a computationally fast tactile simulation that computes distances between dense virtual tactile units and the object via parallel forward kinematics, providing high-rate, high-resolution touch signals needed by RL; (ii) a current-to-torque calibration that eliminates the need for torque sensors on dexterous hands by mapping motor current to joint torque; and (iii) actuator dynamics modeling to bridge the actuation gaps with randomization of non-ideal effects such as backlash, torque-speed saturation. Using an asymmetric actor-critic PPO pipeline trained entirely in simulation, our policies deploy directly to a five-finger hand. The resulting policies demonstrated two essential skills: (1) command-based, controllable grasp force tracking, and (2) reorientation of objects in the hand, both of which were robustly executed without fine-tuning on the robot. By combining tactile and torque in the observation space with effective sensing/actuation modeling, our system provides a practical solution to achieve reliable dexterous manipulation. To our knowledge, this is the first demonstration of controllable grasping on a multi-finger dexterous hand trained entirely in simulation and transferred zero-shot on real hardware.
中文摘要 类人般灵巧多指的手具备人类级别的作能力，但由于接触丰富的物理特性和不完美的驱动，能够直接部署在真实硬件上的训练控制策略仍然困难。我们通过一个实用的模拟到现实强化学习（RL）框架弥补了这一差距，该框架结合了密集的触觉反馈和关节扭矩感测，明确调节物理交互。为了实现有效的模拟到现实传输，我们引入了（i）一种计算速度极快的触觉仿真，通过并行前进运动学计算密集虚拟触觉单元与物体之间的距离，提供强化学习所需的高速、高分辨率触控信号;（ii）电流-扭矩校准，通过将电机电流映射到关节扭矩，消除灵巧手对扭矩传感器的需求;以及（iii）执行器动力学建模，弥合执行间隙与非理想效应（如反冲、扭矩速度饱和）的随机化。我们采用完全在仿真中训练的非对称行为者-批评者PPO流水线，策略直接部署到五指手中。最终的策略展示了两项关键技能：（1）基于指令的可控抓握力追踪，以及（2）手中物体的重新定向，这两项技能均在机器人上进行微调的情况下稳健执行。通过在观察空间中结合触觉和扭矩，并进行有效的感测/驱动建模，我们的系统为实现可靠的灵巧作提供了实用的解决方案。据我们所知，这是首次在多指灵巧手上完全模拟并转制零击中，在真实硬件上实现可控抓握演示。

MiMo-V2-Flash Technical Report

MiMo-V2-Flash 技术报告

Authors: Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, Gang Xie, Hailin Zhang, Hanglong Lv, Hanyu Li, Heyu Chen, Hongshen Xu, Houbin Zhang, Huaqiu Liu, Jiangshan Duo, Jianyu Wei, Jiebao Xiao, Jinhao Dong, Jun Shi, Junhao Hu, Kainan Bao, Kang Zhou, Lei Li, Liang Zhao, Linghao Zhang, Peidian Li, Qianli Chen, Shaohui Liu, Shihua Yu, Shijie Cao, Shimao Chen, Shouqiu Yu, Shuo Liu, Tianling Zhou, Weijiang Su, Weikun Wang, Wenhan Ma, Xiangwei Deng, Bohan Mao, Bowen Ye, Can Cai, Chenghua Wang, Chengxuan Zhu, Chong Ma, Chun Chen, Chunan Li, Dawei Zhu, Deshan Xiao, Dong Zhang, Duo Zhang, Fangyue Liu, Feiyu Yang, Fengyuan Shi, Guoan Wang, Hao Tian, Hao Wu, Heng Qu, Hongfei Yi, Hongxu An, Hongyi Guan, Xing Zhang, Yifan Song, Yihan Yan, Yihao Zhao, Yingchun Lai, Yizhao Gao, Yu Cheng, Yuanyuan Tian, Yudong Wang, Zhen Tang, Zhengju Tang, Zhengtao Wen, Zhichao Song, Zhixian Zheng, Zihan Jiang, Jian Wen, Jiarui Sun, Jiawei Li, Jinlong Xue, Jun Xia, Kai Fang, Menghang Zhu, Nuo Chen, Qian Tu, Qihao Zhang, Qiying Wang, Rang Li, Rui Ma, Shaolei Zhang, Shengfan Wang, Shicheng Li, Shuhao Gu, Shuhuai Ren, Sirui Deng, Tao Guo, Tianyang Lu
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.02780
Pdf link: https://arxiv.org/pdf/2601.02780
Abstract We present MiMo-V2-Flash, a Mixture-of-Experts (MoE) model with 309B total parameters and 15B active parameters, designed for fast, strong reasoning and agentic capabilities. MiMo-V2-Flash adopts a hybrid attention architecture that interleaves Sliding Window Attention (SWA) with global attention, with a 128-token sliding window under a 5:1 hybrid ratio. The model is pre-trained on 27 trillion tokens with Multi-Token Prediction (MTP), employing a native 32k context length and subsequently extended to 256k. To efficiently scale post-training compute, MiMo-V2-Flash introduces a novel Multi-Teacher On-Policy Distillation (MOPD) paradigm. In this framework, domain-specialized teachers (e.g., trained via large-scale reinforcement learning) provide dense and token-level reward, enabling the student model to perfectly master teacher expertise. MiMo-V2-Flash rivals top-tier open-weight models such as DeepSeek-V3.2 and Kimi-K2, despite using only 1/2 and 1/3 of their total parameters, respectively. During inference, by repurposing MTP as a draft model for speculative decoding, MiMo-V2-Flash achieves up to 3.6 acceptance length and 2.6x decoding speedup with three MTP layers. We open-source both the model weights and the three-layer MTP weights to foster open research and community collaboration.
中文摘要 我们介绍MiMo-V2-Flash，这是一种专家混合（MoE）模型，总参数309B，主动参数15B，旨在快速、强有力的推理和代理能力。MiMo-V2-Flash采用混合注意力架构，将滑动窗口注意力（SWA）与全局注意力交错使用，使用128个令牌的滑动窗口，比例为5：1混合。该模型在27万亿个代币上预训练，采用多代币预测（MTP），采用原生32k上下文长度，随后扩展至256k。为了高效扩展训练后计算，MiMo-V2-Flash引入了一种新颖的多教师策略蒸馏（MOPD）范式。在该框架下，领域专门化教师（例如通过大规模强化学习培训）提供密集且代币级的奖励，使学生模型能够完美掌握教师专业知识。尽管只使用了 DeepSeek-V3.2 和 Kimi-K2 等顶级开放重量型号，但它的总参数仅占 MiMo-V2 的一半和三分之一。在推理过程中，通过将MTP重新利用为推测解码的草稿模型，MiMo-V2-Flash可实现最高3.6的接受长度和2.6倍的解码加速，采用三层MTP实现。我们将模型权重和三层MTP权重开源，以促进开放研究和社区协作。

Reinforcement Learning for Follow-the-Leader Robotic Endoscopic Navigation via Synthetic Data

通过合成数据实现跟随领导者机器人内镜导航的强化学习

Authors: Sicong Gao, Chen Qian, Laurence Xian, Liao Wu, Maurice Pagnucco, Yang Song
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.02798
Pdf link: https://arxiv.org/pdf/2601.02798
Abstract Autonomous navigation is crucial for both medical and industrial endoscopic robots, enabling safe and efficient exploration of narrow tubular environments without continuous human intervention, where avoiding contact with the inner walls has been a longstanding challenge for prior approaches. We present a follow-the-leader endoscopic robot based on a flexible continuum structure designed to minimize contact between the endoscope body and intestinal walls, thereby reducing patient discomfort. To achieve this objective, we propose a vision-based deep reinforcement learning framework guided by monocular depth estimation. A realistic intestinal simulation environment was constructed in \textit{NVIDIA Omniverse} to train and evaluate autonomous navigation strategies. Furthermore, thousands of synthetic intraluminal images were generated using NVIDIA Replicator to fine-tune the Depth Anything model, enabling dense three-dimensional perception of the intestinal environment with a single monocular camera. Subsequently, we introduce a geometry-aware reward and penalty mechanism to enable accurate lumen tracking. Compared with the original Depth Anything model, our method improves $\delta_{1}$ depth accuracy by 39.2% and reduces the navigation J-index by 0.67 relative to the second-best method, demonstrating the robustness and effectiveness of the proposed approach.
中文摘要 自主导航对于医疗和工业内窥镜机器人都至关重要，能够安全高效地探索狭窄管状环境，无需持续人工干预，而避免与内壁接触一直是以往方法的长期挑战。我们介绍一款基于灵活连续体结构的跟随式内镜内镜机器人，旨在尽量减少内窥镜体与肠壁的接触，从而减轻患者的不适。为实现这一目标，我们提出了一个基于视觉的深度强化学习框架，以单眼深度估计为指导。在 \textit{NVIDIA Omniverse} 中构建了一个逼真的肠道模拟环境，用于训练和评估自主导航策略。此外，使用NVIDIA Replicator生成了数千张合成光内图像，对深度任意模型进行了微调，使得单一单眼相机实现了对肠道环境的密集三维感知。随后，我们引入了几何感知的奖励和惩罚机制，以实现准确的流明追踪。与原始的深度任意模型相比，我们的方法在深度精度delta_{1}提升了39.2%，导航J指数相较于次优方法降低了0.67%，展示了所提方法的鲁棒性和有效性。

SketchThinker-R1: Towards Efficient Sketch-Style Reasoning in Large Multimodal Models

SketchThinker-R1：迈向大型多模态模型中高效的草图式推理

Authors: Ruiyang Zhang, Dongzhan Zhou, Zhedong Zheng
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.02825
Pdf link: https://arxiv.org/pdf/2601.02825
Abstract Despite the empirical success of extensive, step-by-step reasoning in large multimodal models, long reasoning processes inevitably incur substantial computational overhead, i.e., in terms of higher token costs and increased response time, which undermines inference efficiency. In contrast, humans often employ sketch-style reasoning: a concise, goal-directed cognitive process that prioritizes salient information and enables efficient problem-solving. Inspired by this cognitive efficiency, we propose SketchThinker-R1, which incentivizes sketch-style reasoning ability in large multimodal models. Our method consists of three primary stages. In the Sketch-Mode Cold Start stage, we convert standard long reasoning process into sketch-style reasoning and finetune base multimodal model, instilling initial sketch-style reasoning capability. Next, we train SketchJudge Reward Model, which explicitly evaluates thinking process of model and assigns higher scores to sketch-style reasoning. Finally, we conduct Sketch-Thinking Reinforcement Learning under supervision of SketchJudge to further generalize sketch-style reasoning ability. Experimental evaluation on four benchmarks reveals that our SketchThinker-R1 achieves over 64% reduction in reasoning token cost without compromising final answer accuracy. Qualitative analysis further shows that sketch-style reasoning focuses more on key cues during problem solving.
中文摘要 尽管在大型多模态模型中广泛且逐步推理取得了实证成功，但长时间推理过程不可避免地会产生大量计算开销，即更高的代币成本和更长的响应时间，从而削弱推理效率。相比之下，人类常采用草图式推理：一种简洁、目标导向的认知过程，优先处理重要信息并实现高效问题解决。受这种认知效率的启发，我们提出了SketchThinker-R1，它在大型多模态模型中激励草图式推理能力。我们的方法包括三个主要阶段。在草图模式冷启动阶段，我们将标准的长推理过程转换为草图风格推理，并微调基础多模态模型，培养初始草图式推理能力。接下来，我们训练SketchJudge奖励模型，它明确评估模型的思维过程，并对草图式推理给予更高分数。最后，我们在SketchJudge的监督下开展了草图思维强化学习，进一步推广草图式推理能力。对四个基准测试的实验评估显示，我们的SketchThinker-R1在不牺牲最终答案准确性的情况下，实现了推理代币成本的降低超过64%。定性分析进一步表明，草图式推理更侧重于解决问题的关键线索。

Sample-Efficient Neurosymbolic Deep Reinforcement Learning

样本高效神经符号深度强化学习

Authors: Celeste Veronese, Daniele Meli, Alessandro Farinelli
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.02850
Pdf link: https://arxiv.org/pdf/2601.02850
Abstract Reinforcement Learning (RL) is a well-established framework for sequential decision-making in complex environments. However, state-of-the-art Deep RL (DRL) algorithms typically require large training datasets and often struggle to generalize beyond small-scale training scenarios, even within standard benchmarks. We propose a neuro-symbolic DRL approach that integrates background symbolic knowledge to improve sample efficiency and generalization to more challenging, unseen tasks. Partial policies defined for simple domain instances, where high performance is easily attained, are transferred as useful priors to accelerate learning in more complex settings and avoid tuning DRL parameters from scratch. To do so, partial policies are represented as logical rules, and online reasoning is performed to guide the training process through two mechanisms: (i) biasing the action distribution during exploration, and (ii) rescaling Q-values during exploitation. This neuro-symbolic integration enhances interpretability and trustworthiness while accelerating convergence, particularly in sparse-reward environments and tasks with long planning horizons. We empirically validate our methodology on challenging variants of gridworld environments, both in the fully observable and partially observable setting. We show improved performance over a state-of-the-art reward machine baseline.
中文摘要 强化学习（RL）是一个成熟的框架，用于在复杂环境中进行顺序决策。然而，最先进的深度强化学习（DRL）算法通常需要大量训练数据集，且即使在标准基准测试内，也常难以推广到小规模训练场景之外。我们提出了一种神经符号式日程学习方法，整合背景符号知识，以提高样本效率并推广到更具挑战性和未被发现任务的应用。为简单领域实例定义的部分策略，在这些环境中容易实现高性能，这些策略作为有用的先验转移，以加速更复杂环境中的学习，避免从零调整DRL参数。为此，部分策略被表示为逻辑规则，并通过在线推理引导训练过程通过两种机制：（i）在探索过程中偏向动作分布，（ii）在利用过程中重新调整Q值。这种神经符号整合增强了可解释性和可信度，同时加速了收敛，尤其是在奖励稀疏的环境和规划时间较长的任务中。我们在网格世界环境的复杂变体上，无论是完全可观测还是部分可观测的环境，都通过实证验证了我们的方法论。我们显示出优于最先进的奖励机基线表现。

SimRPD: Optimizing Recruitment Proactive Dialogue Agents through Simulator-Based Data Evaluation and Selection

SimRPD：通过基于模拟器的数据评估和选择优化招聘主动对话代理

Authors: Zhiyong Cao, Dunqiang Liu, Qi Dai, Haojun Xu, Huaiyan Xu, Huan He, Yafei Liu, Siyuan Liu, XiaoLin Lin, Ke Ma, Ruqian Shi, Sijia Yao, Hao Wang, Sicheng Zhou
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.02871
Pdf link: https://arxiv.org/pdf/2601.02871
Abstract Task-oriented proactive dialogue agents play a pivotal role in recruitment, particularly for steering conversations towards specific business outcomes, such as acquiring social-media contacts for private-channel conversion. Although supervised fine-tuning and reinforcement learning have proven effective for training such agents, their performance is heavily constrained by the scarcity of high-quality, goal-oriented domain-specific training data. To address this challenge, we propose SimRPD, a three-stage framework for training recruitment proactive dialogue agents. First, we develop a high-fidelity user simulator to synthesize large-scale conversational data through multi-turn online dialogue. Then we introduce a multi-dimensional evaluation framework based on Chain-of-Intention (CoI) to comprehensively assess the simulator and effectively select high-quality data, incorporating both global-level and instance-level metrics. Finally, we train the recruitment proactive dialogue agent on the selected dataset. Experiments in a real-world recruitment scenario demonstrate that SimRPD outperforms existing simulator-based data selection strategies, highlighting its practical value for industrial deployment and its potential applicability to other business-oriented dialogue scenarios.
中文摘要 以任务为导向的主动对话代理在招聘中发挥关键作用，尤其是在引导对话朝向特定业务成果，例如获取社交媒体联系人以实现私人渠道转化。尽管监督式微调和强化学习已被证明对训练此类代理有效，但其性能仍受限于高质量、目标导向的领域特定训练数据的稀缺。为应对这一挑战，我们提出了SimRPD，这是一个三阶段框架，用于培训主动对话代理的招聘。首先，我们开发了高保真用户模拟器，通过多回合在线对话综合大规模对话数据。随后，我们引入基于意向链（Chain-of-Intention，CoI）的多维评估框架，全面评估模拟器并有效选择高质量数据，涵盖全局和实例级指标。最后，我们在所选数据集上训练招募主动对话代理。在真实世界招聘场景中的实验表明，SimRPD优于现有基于模拟器的数据选择策略，凸显了其在工业部署中的实用价值以及在其他面向商业的对话场景中的潜在应用。

ChemBART: A Pre-trained BART Model Assisting Organic Chemistry Analysis

ChemBART：辅助有机化学分析的预训练BART模型

Authors: Kenan Li, Yijian Zhang, Jin Wang, Haipeng Gan, Zeying Sun, Xiaoguang Lei, Hao Dong
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.02915
Pdf link: https://arxiv.org/pdf/2601.02915
Abstract Recent advances in large language models (LLMs) have demonstrated transformative potential across diverse fields. While LLMs have been applied to molecular simplified molecular input line entry system (SMILES) in computer-aided synthesis planning (CASP), existing methodologies typically address single tasks, such as precursor prediction. We introduce ChemBART, a SMILES-based LLM pre-trained on chemical reactions, which enables a unified model for multiple downstream chemical tasks--achieving the paradigm of "one model, one pre-training, multiple tasks." By leveraging outputs from a mask-filling pre-training task on reaction expressions, ChemBART effectively solves a variety of chemical problems, including precursor/reagent generation, temperature-yield regression, molecular property classification, and optimizing the policy and value functions within a reinforcement learning framework, integrated with Monte Carlo tree search for multi-step synthesis route design. Unlike single-molecule pre-trained LLMs constrained to specific applications, ChemBART addresses broader chemical challenges and integrates them for comprehensive synthesis planning. Crucially, ChemBART-designed multi-step synthesis routes and reaction conditions directly inspired wet-lab validation, which confirmed shorter pathways with ~30% yield improvement over literature benchmarks. Our work validates the power of reaction-focused pre-training and showcases the broad utility of ChemBART in advancing the complete synthesis planning cycle.
中文摘要 大型语言模型（LLM）的最新进展展示了在多个领域具有变革潜力。虽然LLM已被应用于计算机辅助合成规划（CASP）中的分子简化分子输入输入系统（SMILES），但现有方法通常只针对单一任务，如前驱预测。我们介绍了ChemBART，一种基于SMILES的大型语言模型，预训练于化学反应，实现了多个下游化学任务的统一模型——实现了“一个模型，一个预训练，多个任务”的范式。通过利用掩膜填充预训练任务对反应表达式的输出，ChemBART 有效解决了多种化学问题，包括前体/试剂生成、温度产率回归、分子性质分类，以及在强化学习框架内优化策略函数和价值函数，并结合蒙特卡洛树搜索实现多步合成路径设计。与仅限于特定应用的单分子预训练LLM不同，ChemBART解决更广泛的化学挑战，并整合这些挑战以实现全面的合成规划。关键是，ChemBART设计的多步合成路径和反应条件直接启发了湿实验室验证，验证了更短的途径，产率比文献基准提升约30%。我们的工作验证了反应聚焦预训练的威力，并展示了ChemBART在推进完整合成计划周期中的广泛应用。

Zoom-IQA: Image Quality Assessment with Reliable Region-Aware Reasoning

Zoom-IQA：基于可靠区域感知推理的图像质量评估

Authors: Guoqiang Liang, Jianyi Wang, Zhonghua Wu, Shangchen Zhou
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.02918
Pdf link: https://arxiv.org/pdf/2601.02918
Abstract Image Quality Assessment (IQA) is a long-standing problem in computer vision. Previous methods typically focus on predicting numerical scores without explanation or provide low-level descriptions lacking precise scores. Recent reasoning-based vision language models (VLMs) have shown strong potential for IQA, enabling joint generation of quality descriptions and scores. However, we notice that existing VLM-based IQA methods tend to exhibit unreliable reasoning due to their limited capability of integrating visual and textual cues. In this work, we introduce Zoom-IQA, a VLM-based IQA model to explicitly emulate key cognitive behaviors: uncertainty awareness, region reasoning, and iterative refinement. Specifically, we present a two-stage training pipeline: 1) supervised fine-tuning (SFT) on our Grounded-Rationale-IQA (GR-IQA) dataset to teach the model to ground its assessments in key regions; and 2) reinforcement learning (RL) for dynamic policy exploration, primarily stabilized by our KL-Coverage regularizer to prevent reasoning and scoring diversity collapse, and supported by a Progressive Re-sampling Strategy to mitigate annotation bias. Extensive experiments show that Zoom-IQA achieves improved robustness, explainability, and generalization. The application to downstream tasks, such as image restoration, further demonstrates the effectiveness of Zoom-IQA.
中文摘要 图像质量评估（IQA）是计算机视觉领域长期存在的问题。以往的方法通常侧重于无解释地预测数值分数，或提供缺乏精确分数的低层描述。近期基于推理的视觉语言模型（VLMs）展现出在IQA中的强大潜力，能够联合生成高质量描述和评分。然而，我们注意到现有基于VLM的IQA方法由于整合视觉和文本线索的能力有限，推理往往不可靠。在本研究中，我们介绍了Zoom-IQA，一种基于VLM的IQA模型，用以明确模拟关键认知行为：不确定性意识、区域推理和迭代精炼。具体来说，我们提出了一个两阶段的训练流程：1）在我们的Grounded-Rationale-IQA（GR-IQA）数据集上进行监督式微调（SFT），以教授模型如何将评估定位在关键区域;以及2）强化学习（RL），用于动态策略探索，主要由我们的KL覆盖规范器稳定，以防止推理和评分多样性崩溃，并辅以渐进重抽样策略以减轻注释偏差。大量实验表明，Zoom-IQA实现了更好的鲁棒性、可解释性和泛化性。在下游任务中的应用，如图像恢复，进一步展示了Zoom-IQA的有效性。

The World is Not Mono: Enabling Spatial Understanding in Large Audio-Language Models

世界不是单一的：在大型音频语言模型中实现空间理解

Authors: Yuhuan You, Lai Wei, Xihong Wu, Tianshu Qu
Subjects: Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.02954
Pdf link: https://arxiv.org/pdf/2601.02954
Abstract Existing large audio-language models perceive the world as "mono" -- a single stream of audio that ignores the critical spatial dimension ("where") required for universal acoustic scene analysis. To bridge this gap, we first introduce a hierarchical framework for Auditory Scene Analysis (ASA). Guided by this framework, we introduce a system that enables models like Qwen2-Audio to understand and reason about the complex acoustic world. Our framework achieves this through three core contributions: First, we build a large-scale, synthesized binaural audio dataset to provide the rich spatial cues. Second, we design a hybrid feature projector, which leverages parallel semantic and spatial encoders to extract decoupled representations. These distinct streams are integrated via a dense fusion mechanism, ensuring the model receives a holistic view of the acoustic scene. Finally, we employ a progressive training curriculum, advancing from supervised fine-tuning (SFT) to reinforcement learning via Group Relative Policy Optimization (GRPO), to explicitly evolve the model's capabilities towards reasoning. On our comprehensive benchmark, the model demonstrates comparatively strong capability for spatial understanding. By enabling this spatial perception, our work provides a clear pathway for leveraging the powerful reasoning abilities of large models towards holistic acoustic scene analysis, advancing from "mono" semantic recognition to spatial intelligence.
中文摘要 现有的大型音频语言模型将世界感知为“单声道”——一条忽视通用声学场景分析所需的关键空间维度（“哪里”）的单一音频流。为弥合这一差距，我们首先引入了一个听觉场景分析（ASA）的层级框架。在该框架的指导下，我们引入了一个系统，使像Qwen2-Audio这样的模型能够理解并推理复杂的声学世界。我们的框架通过三个核心贡献实现这一点：首先，我们构建了一个大规模的合成双耳音频数据集，以提供丰富的空间线索。其次，我们设计了一种混合特征投影器，利用并行语义和空间编码器提取解耦的表示。这些不同的流通过密集融合机制整合，确保模型能够获得声学场景的整体视图。最后，我们采用渐进式培训课程，从监督式微调（SFT）逐步推进到通过群体相对策略优化（GRPO）进行强化学习，明确地将模型的能力演化为推理能力。在我们的综合基准测试中，该模型展现出相对强大的空间理解能力。通过实现这种空间感知，我们的工作为利用大型模型强大的推理能力实现整体声学场景分析提供了清晰路径，从“单一”语义识别迈向空间智能。

Correct, Concise and Complete: Multi-stage Training For Adaptive Reasoning

正确、简洁且完整：多阶段适应性推理训练

Authors: Nathanaël Carraz Rakotonirina, Ren Pang, Neha Anna John, Michael Bohlke-Schneider, Momchil Hardalov
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.02972
Pdf link: https://arxiv.org/pdf/2601.02972
Abstract The reasoning capabilities of large language models (LLMs) have improved substantially through increased test-time computation, typically in the form of intermediate tokens known as chain-of-thought (CoT). However, CoT often becomes unnecessarily long, increasing computation cost without actual accuracy gains or sometimes even degrading performance, a phenomenon known as ``overthinking''. We propose a multi-stage efficient reasoning method that combines supervised fine-tuning -- via rejection sampling or reasoning trace reformatting -- with reinforcement learning using an adaptive length penalty. We introduce a lightweight reward function that penalizes tokens generated after the first correct answer but encouraging self-verification only when beneficial. We conduct a holistic evaluation across seven diverse reasoning tasks, analyzing the accuracy--response length trade-off. Our approach reduces response length by an average of 28\% for 8B models and 40\% for 32B models, while incurring only minor performance drops of 1.6 and 2.5 points, respectively. Despite its conceptual simplicity, it achieves a superior trade-off compared to more complex state-of-the-art efficient reasoning methods, scoring 76.6, in terms of the area under the Overthinking-Adjusted Accuracy curve ($\text{AUC}_{\text{OAA}}$) -- 5 points above the base model and 2.5 points above the second-best approach.
中文摘要 大型语言模型（LLMs）的推理能力通过测试时间计算的增加得到了显著提升，通常以称为思维链（Chain-of-thought，CoT）的中间标记形式出现。然而，CoT常常变得过长，增加计算成本却无法带来实际的准确性提升，甚至有时会降低性能，这种现象被称为“过度思考”。我们提出了一种多阶段高效的推理方法，结合了监督微调——通过拒绝采样或推理迹重格式化——与利用自适应长度惩罚的强化学习。我们引入了一个轻量级奖励函数，惩罚在首次正确答案后生成的代币，但仅在有利时鼓励自我验证。我们对七个不同的推理任务进行了整体评估，分析准确性与响应长度的权衡。我们的方法在8B模型中平均将响应长度减少28/%，32B模型减少40/%，同时仅有1.6点和2.5点的轻微性能下降。尽管概念简单，它相比更复杂且最先进的高效推理方法实现了更优越的权衡，在过度思考调整准确率曲线（$\text{AUC}_{\text{OAA}}}）下的面积上得分为76.6分——比基础模型高出5分，比次优方法高2.5分。

In-Context Reinforcement Learning through Bayesian Fusion of Context and Value Prior

通过上下文与价值先验的贝叶斯融合进行语境内强化学习

Authors: Anaïs Berkes, Vincent Taboga, Donna Vakalis, David Rolnick, Yoshua Bengio
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.03015
Pdf link: https://arxiv.org/pdf/2601.03015
Abstract In-context reinforcement learning (ICRL) promises fast adaptation to unseen environments without parameter updates, but current methods either cannot improve beyond the training distribution or require near-optimal data, limiting practical adoption. We introduce SPICE, a Bayesian ICRL method that learns a prior over Q-values via deep ensemble and updates this prior at test-time using in-context information through Bayesian updates. To recover from poor priors resulting from training on sub-optimal data, our online inference follows an Upper-Confidence Bound rule that favours exploration and adaptation. We prove that SPICE achieves regret-optimal behaviour in both stochastic bandits and finite-horizon MDPs, even when pretrained only on suboptimal trajectories. We validate these findings empirically across bandit and control benchmarks. SPICE achieves near-optimal decisions on unseen tasks, substantially reduces regret compared to prior ICRL and meta-RL approaches while rapidly adapting to unseen tasks and remaining robust under distribution shift.
中文摘要 上下文强化学习（ICRL）承诺快速适应未见环境且无需参数更新，但现有方法要么无法超越训练分布，要么需要接近最优的数据，限制了实际应用。我们介绍SPICE，这是一种贝叶斯ICRL方法，通过深度集合学习Q值的先验，并在测试时利用上下文信息通过贝叶斯更新更新该先验。为了从因训练不优数据导致的先验不良中恢复，我们的在线推断遵循一种上置信界规则，该规则倾向于探索和适应。我们证明SPICE在随机盗贼和有限视野MDP中都能实现遗憾最优行为，即使仅预训练于次优轨迹。我们在盗贼和控制基准之间实证验证了这些发现。SPICE在未见任务上实现近乎最优决策，显著减少了遗憾，相较于以往的ICRL和meta-RL方法，同时快速适应看不见任务，并在分布转移下保持稳健。

Dementia-R1: Reinforced Pretraining and Reasoning from Unstructured Clinical Notes for Real-World Dementia Prognosis

痴呆症-R1：基于非结构化临床笔记的强化预培训与推理，适用于现实世界痴呆症预后

Authors: Choonghan Kim, Hyunmin Hwang, Hangeol Chang, Jaemin Kim, Jinse Park, Jae-Sung Lim, Jong Chul Ye
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.03018
Pdf link: https://arxiv.org/pdf/2601.03018
Abstract While Large Language Models (LLMs) have shown strong performance on clinical text understanding, they struggle with longitudinal prediction tasks such as dementia prognosis, which require reasoning over complex, non-monotonic symptom trajectories across multiple visits. Standard supervised training lacks explicit annotations for symptom evolution, while direct Reinforcement Learning (RL) is hindered by sparse binary rewards. To address this challenge, we introduce Dementia-R1, an RL-based framework for longitudinal dementia prognosis from unstructured clinical notes. Our approach adopts a Cold-Start RL strategy that pre-trains the model to predict verifiable clinical indices extracted from patient histories, enhancing the capability to reason about disease progression before determining the final clinical status. Extensive experiments demonstrate that Dementia-R1 achieves an F1 score of 77.03% on real-world unstructured clinical datasets. Notably, on the ADNI benchmark, our 7B model rivals GPT-4o, effectively capturing fluctuating cognitive trajectories. Code is available at this https URL
中文摘要 虽然大型语言模型（LLMs）在临床文本理解方面表现出色，但在纵向预测任务（如痴呆预后）上存在困难，这类任务需要在多次就诊中推理复杂且非单调的症状轨迹。标准的监督训练缺乏明确的症状演变注释，而直接强化学习（RL）则受限于稀疏的二元奖励。为应对这一挑战，我们引入了基于强化学习的痴呆-R1框架，用于从非结构化临床记录中预测纵向痴呆。我们的方法采用冷启动强化学习策略，预训练模型以预测从患者病史中提取的可验证临床指标，增强在确定最终临床状态前推理疾病进展的能力。大量实验表明，痴呆-R1在真实世界非结构化临床数据集中可获得77.03%的F1评分。值得注意的是，在ADNI基准测试中，我们的7B模型可与GPT-4o媲美，有效捕捉了认知轨迹的波动。代码可在此 https URL 获取

SOP: A Scalable Online Post-Training System for Vision-Language-Action Models

SOP：一个面向视觉-语言-行动模型的可扩展在线后期训练系统

Authors: Mingjie Pan, Siyuan Feng, Qinglin Zhang, Xinchen Li, Jianheng Song, Chendi Qu, Yi Wang, Chuankang Li, Ziyu Xiong, Zhi Chen, Yi Liu, Jianlan Luo
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.03044
Pdf link: https://arxiv.org/pdf/2601.03044
Abstract Vision-language-action (VLA) models achieve strong generalization through large-scale pre-training, but real-world deployment requires expert-level task proficiency in addition to broad generality. Existing post-training approaches for VLA models are typically offline, single-robot, or task-specific, limiting effective on-policy adaptation and scalable learning from real-world interaction. We introduce a Scalable Online Post-training (SOP) system that enables online, distributed, multi-task post-training of generalist VLA models directly in the physical world. SOP tightly couples execution and learning through a closed-loop architecture in which a fleet of robots continuously streams on-policy experience and human intervention signals to a centralized cloud learner, and asynchronously receives updated policies. This design supports prompt on-policy correction, scales experience collection through parallel deployment, and preserves generality during adaptation. SOP is agnostic to the choice of post-training algorithm; we instantiate it with both interactive imitation learning (HG-DAgger) and reinforcement learning (RECAP). Across a range of real-world manipulation tasks including cloth folding, box assembly, and grocery restocking, we show that SOP substantially improves the performance of large pretrained VLA models while maintaining a single shared policy across tasks. Effective post-training can be achieved within hours of real-world interaction, and performance scales near-linearly with the number of robots in the fleet. These results suggest that tightly coupling online learning with fleet-scale deployment is instrumental to enabling efficient, reliable, and scalable post-training of generalist robot policies in the physical world.
中文摘要 视觉-语言-行动（VLA）模型通过大规模预训练实现了强有力的泛化，但实际部署不仅需要广泛的通用性，还需要专家级的任务熟练度。现有的VLA模型后期训练方法通常是离线、单机器人或任务特定，限制了有效的策略适应和可扩展的真实交互学习。我们引入了可扩展的在线后期训练（SOP）系统，使通用VLA模型能够在线、分布式、多任务地在物理世界中进行后期训练。SOP通过闭环架构紧密结合执行与学习，机器人队伍持续向集中的云学习者传输政策经验和人工干预信号，并异步接收更新的策略。该设计支持即时的策略修正，通过并行部署扩大经验收集规模，并在适应过程中保持通用性。SOP对训练后算法的选择是中立的;我们通过交互式模仿学习（HG-DAgger）和强化学习（RECAP）来实现它。在包括折叠布料、包装盒装配和杂货补货等多种实际作任务中，我们展示了SOP在维护任务间统一共享策略的同时，显著提升了大型预训练VLA模型的性能。有效的后期培训可在实际作数小时内实现，性能几乎与机队机器人数量呈线性增长。这些结果表明，在线学习与舰队规模部署紧密结合，对于实现高效、可靠且可扩展的通用机器人培训后政策在现实世界中至关重要。

IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation

IBISAgent：在通用生物医学对象指称与分割中强化MLLM中的像素级视觉推理

Authors: Yankai Jiang, Qiaoru Li, Binlu Xu, Haoran Sun, Chao Ding, Junting Dong, Yuxiang Cai, Xuhong Zhang, Jianwei Yin
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.03054
Pdf link: https://arxiv.org/pdf/2601.03054
Abstract Recent research on medical MLLMs has gradually shifted its focus from image-level understanding to fine-grained, pixel-level comprehension. Although segmentation serves as the foundation for pixel-level understanding, existing approaches face two major challenges. First, they introduce implicit segmentation tokens and require simultaneous fine-tuning of both the MLLM and external pixel decoders, which increases the risk of catastrophic forgetting and limits generalization to out-of-domain scenarios. Second, most methods rely on single-pass reasoning and lack the capability to iteratively refine segmentation results, leading to suboptimal performance. To overcome these limitations, we propose a novel agentic MLLM, named IBISAgent, that reformulates segmentation as a vision-centric, multi-step decision-making process. IBISAgent enables MLLMs to generate interleaved reasoning and text-based click actions, invoke segmentation tools, and produce high-quality masks without architectural modifications. By iteratively performing multi-step visual reasoning on masked image features, IBISAgent naturally supports mask refinement and promotes the development of pixel-level visual reasoning capabilities. We further design a two-stage training framework consisting of cold-start supervised fine-tuning and agentic reinforcement learning with tailored, fine-grained rewards, enhancing the model's robustness in complex medical referring and reasoning segmentation tasks. Extensive experiments demonstrate that IBISAgent consistently outperforms both closed-source and open-source SOTA methods. All datasets, code, and trained models will be released publicly.
中文摘要 近期关于医学多层次运用模型的研究逐渐从图像层面的理解转向细粒度、像素级的理解。虽然分段是像素层级理解的基础，但现有方法面临两大挑战。首先，它们引入了隐式分割令牌，并要求同时对MLLM和外部像素解码器进行微调，这增加了灾难性遗忘的风险，并限制了对域外场景的推广。其次，大多数方法依赖单次推理，缺乏迭代细化分割结果的能力，导致性能不理想。为克服这些局限，我们提出了一种新型智能体MLLM，名为IBISAgent，将分割重新表述为以视觉为中心的多步骤决策过程。IBISAgent 使 MLLM 能够生成交错推理和基于文本的点击动作，调用分割工具，并在无需架构修改的情况下生成高质量的掩码。通过迭代对蒙罩图像特征进行多步视觉推理，IBISAgent 自然支持蒙罩细化，并推动像素级视觉推理能力的发展。我们还设计了一个两阶段训练框架，包括冷启动监督的微调和代理强化学习，并提供定制化、细粒度的奖励，增强模型在复杂医疗转诊和推理分割任务中的稳健性。大量实验表明，IBISAgent 始终优于闭源和开源的 SOTA 方法。所有数据集、代码和训练模型都将公开发布。

One Sample to Rule Them All: Extreme Data Efficiency in RL Scaling

一个例子可以统治所有：强化学习扩展中的极端数据效率

Authors: Yiyuan Li, Zhen Huang, Yanan Wu, Weixun Wang, Xuefeng Li, Yijia Luo, Wenbo Su, Bo Zheng, Pengfei Liu
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.03111
Pdf link: https://arxiv.org/pdf/2601.03111
Abstract The reasoning ability of large language models (LLMs) can be unleashed with reinforcement learning (RL) (OpenAI, 2024; DeepSeek-AI et al., 2025a; Zeng et al., 2025). The success of existing RL attempts in LLMs usually relies on high-quality samples of thousands or beyond. In this paper, we challenge fundamental assumptions about data requirements in RL for LLMs by demonstrating the remarkable effectiveness of one-shot learning. Specifically, we introduce polymath learning, a framework for designing one training sample that elicits multidisciplinary impact. We present three key findings: (1) A single, strategically selected math reasoning sample can produce significant performance improvements across multiple domains, including physics, chemistry, and biology with RL; (2) The math skills salient to reasoning suggest the characteristics of the optimal polymath sample; and (3) An engineered synthetic sample that integrates multidiscipline elements outperforms training with individual samples that naturally occur. Our approach achieves superior performance to training with larger datasets across various reasoning benchmarks, demonstrating that sample quality and design, rather than quantity, may be the key to unlock enhanced reasoning capabilities in language models. Our results suggest a shift, dubbed as sample engineering, toward precision engineering of training samples rather than simply increasing data volume.
中文摘要 大型语言模型（LLMs）的推理能力可以通过强化学习（RL）释放（OpenAI，2024;DeepSeek-AI 等，2025a;Zeng 等，2025）。现有的强化学习尝试在大型语言模型中取得成功，通常依赖于数千个甚至更多的高质量样本。本文通过展示一次性学习的显著有效性，挑战了关于强化学习中LLM数据需求的基本假设。具体来说，我们引入了多学科学习框架，用于设计一个能够引发多学科影响的训练样本。我们呈现三个关键发现：（1）一个经过策略性挑选的数学推理样本，可以在多个领域（包括物理、化学和生物）通过强化学习实现显著的表现提升;（2）与推理相关数学技能暗示了最优博学者样本的特征;以及（3）集成多学科元素的工程合成样本，其训练效果优于使用自然出现的单个样本。我们的方法在不同推理基准测试中优于使用更大数据集训练，表明样本质量和设计，而非数量，可能是解锁语言模型推理能力提升的关键。我们的结果表明，一种被称为样本工程的转变，正朝着训练样本的精密工程发展，而不仅仅是增加数据量。

Unified Thinker: A General Reasoning Modular Core for Image Generation

统一思考者：用于图像生成的通用推理模块化核心

Authors: Sashuai Zhou, Qiang Zhou, Jijin Hu, Hanqing Yang, Yue Cao, Junpeng Ma, Yinchao Ma, Jun Song, Tiezheng Ge, Cheng Yu, Bo Zheng, Zhou Zhao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.03127
Pdf link: https://arxiv.org/pdf/2601.03127
Abstract Despite impressive progress in high-fidelity image synthesis, generative models still struggle with logic-intensive instruction following, exposing a persistent reasoning--execution gap. Meanwhile, closed-source systems (e.g., Nano Banana) have demonstrated strong reasoning-driven image generation, highlighting a substantial gap to current open-source models. We argue that closing this gap requires not merely better visual generators, but executable reasoning: decomposing high-level intents into grounded, verifiable plans that directly steer the generative process. To this end, we propose Unified Thinker, a task-agnostic reasoning architecture for general image generation, designed as a unified planning core that can plug into diverse generators and workflows. Unified Thinker decouples a dedicated Thinker from the image Generator, enabling modular upgrades of reasoning without retraining the entire generative model. We further introduce a two-stage training paradigm: we first build a structured planning interface for the Thinker, then apply reinforcement learning to ground its policy in pixel-level feedback, encouraging plans that optimize visual correctness over textual plausibility. Extensive experiments on text-to-image generation and image editing show that Unified Thinker substantially improves image reasoning and generation quality.
中文摘要 尽管高保真图像合成取得了显著进展，生成模型在逻辑密集型指令跟随方面仍存在困难，暴露出持续的推理——执行差距。与此同时，闭源系统（如Nano Banana）已展现出强大的推理驱动图像生成能力，凸显了与当前开源模型的巨大差距。我们认为，弥补这一差距不仅需要更好的视觉生成器，还需要可执行的推理：将高层次意图分解为扎根、可验证的计划，直接引导生成过程。为此，我们提出了统一思考器，这是一种任务无关的通用图像生成推理架构，设计为一个统一的规划核心，可以接入多种生成器和工作流程。Unified Thinker 将专用的 Thinker 与图像生成器解耦，实现推理的模块化升级，而无需重新训练整个生成模型。我们进一步引入了两阶段训练范式：首先为Thinker构建结构化规划界面，然后应用强化学习，将其策略建立在像素级反馈基础上，鼓励优化视觉正确性而非文本合理性的计划。大量文本生成和图像编辑实验表明，Unified Thinker 显著提升了图像推理和生成质量。

WebAnchor: Anchoring Agent Planning to Stabilize Long-Horizon Web Reasoning

WebAnchor：计划稳定长期网络推理的锚定代理

Authors: Yu Xinmiao, Zhang Liwen, Feng Xiaocheng, Jiang Yong, Qin Bing, Xie Pengjun, Zhou Jingren
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.03164
Pdf link: https://arxiv.org/pdf/2601.03164
Abstract Large Language Model(LLM)-based agents have shown strong capabilities in web information seeking, with reinforcement learning (RL) becoming a key optimization paradigm. However, planning remains a bottleneck, as existing methods struggle with long-horizon strategies. Our analysis reveals a critical phenomenon, plan anchor, where the first reasoning step disproportionately impacts downstream behavior in long-horizon web reasoning tasks. Current RL algorithms, fail to account for this by uniformly distributing rewards across the trajectory. To address this, we propose Anchor-GRPO, a two-stage RL framework that decouples planning and execution. In Stage 1, the agent optimizes its first-step planning using fine-grained rubrics derived from self-play experiences and human calibration. In Stage 2, execution is aligned with the initial plan through sparse rewards, ensuring stable and efficient tool usage. We evaluate Anchor-GRPO on four benchmarks: BrowseComp, BrowseComp-Zh, GAIA, and XBench-DeepSearch. Across models from 3B to 30B, Anchor-GRPO outperforms baseline GRPO and First-step GRPO, improving task success and tool efficiency. Notably, WebAnchor-30B achieves 46.0% pass@1 on BrowseComp and 76.4% on GAIA. Anchor-GRPO also demonstrates strong scalability, getting higher accuracy as model size and context length increase.
中文摘要 基于大型语言模型（LLM）的智能体在网络信息搜索方面展现出强大的能力，强化学习（RL）成为关键的优化范式。然而，规划仍是一个瓶颈，现有方法在长远战略上遇到困难。我们的分析揭示了一个关键现象——计划锚点，即第一步推理在长期网络推理任务中对后续行为产生不成比例的影响。当前的强化学习算法未能通过均匀分配奖励来考虑这一点。为此，我们提出了Anchor-GRPO，一个两阶段强化学习框架，将规划与执行脱钩。在第一阶段，代理利用源自自我游戏体验和人类校准的细粒度评分标准，优化第一步规划。在第二阶段，执行与初始计划保持一致，奖励稀少，确保工具使用稳定高效。我们基于四个基准测试来评估Anchor-GRPO：BrowseComp、BrowseComp-Zh、GAIA和XBench-DeepSearch。在3B到30B的模型中，Anchor-GRPO优于基线GRPO和First-step GRPO，提升任务成功率和工具效率。值得注意的是，WebAnchor-30B在BrowseComp上pass@1为46.0%，在GAIA上为76.4%。Anchor-GRPO还展现出强大的可扩展性，随着模型规模和上下文长度的增加，准确率也提升。

MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory

MemRL：通过情节记忆的运行时强化学习实现自我进化的智能体

Authors: Shengtao Zhang, Jiaqian Wang, Ruiwen Zhou, Junwei Liao, Yuchen Feng, Weinan Zhang, Ying Wen, Zhiyu Li, Feiyu Xiong, Yutao Qi, Bo Tang, Muning Wen
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.03192
Pdf link: https://arxiv.org/pdf/2601.03192
Abstract The hallmark of human intelligence is the ability to master new skills through Constructive Episodic Simulation-retrieving past experiences to synthesize solutions for novel tasks. While Large Language Models possess strong reasoning capabilities, they struggle to emulate this self-evolution: fine-tuning is computationally expensive and prone to catastrophic forgetting, while existing memory-based methods rely on passive semantic matching that often retrieves noise. To address these challenges, we propose MemRL, a framework that enables agents to self-evolve via non-parametric reinforcement learning on episodic memory. MemRL explicitly separates the stable reasoning of a frozen LLM from the plastic, evolving memory. Unlike traditional methods, MemRL employs a Two-Phase Retrieval mechanism that filters candidates by semantic relevance and then selects them based on learned Q-values (utility). These utilities are continuously refined via environmental feedback in an trial-and-error manner, allowing the agent to distinguish high-value strategies from similar noise. Extensive experiments on HLE, BigCodeBench, ALFWorld, and Lifelong Agent Bench demonstrate that MemRL significantly outperforms state-of-the-art baselines. Our analysis experiments confirm that MemRL effectively reconciles the stability-plasticity dilemma, enabling continuous runtime improvement without weight updates.
中文摘要 人类智能的标志在于通过建设性情景模拟掌握新技能的能力——通过提取过去经验，综合解决新任务。虽然大型语言模型具备强大的推理能力，但它们难以模拟这种自我进化：微调计算成本高且易发生灾难性遗忘，而现有基于记忆的方法依赖被动语义匹配，常常能检索噪声。为应对这些挑战，我们提出了MemRL，一种使代理能够通过非参数强化学习在情景记忆上自我进化的框架。MemRL明确区分了冻结的LLM的稳定推理与可塑性、不断演化的记忆。与传统方法不同，MemRL采用了两阶段检索机制，通过语义相关性过滤候选对象，然后根据学习到的Q值（效用）进行选择。这些效用通过环境反馈不断改进，通过反复试验，使智能体能够区分高价值策略与类似噪声。在HLE、BigCodeBench、ALFWorld和Lifelong Agent Bench上的大量实验表明，MemRL的表现显著优于最先进的基线。我们的分析实验证实，MemRL有效平衡了稳定性与可塑性困境，实现了连续运行时间的改进而无需权重更新。

UltraLogic: Enhancing LLM Reasoning through Large-Scale Data Synthesis and Bipolar Float Reward

UltraLogic：通过大规模数据综合和双极浮动奖励提升LLM推理能力

Authors: Yile Liu, Yixian Liu, Zongwei Li, Yufei Huang, Xinhua Feng, Zhichao Hu, Jinglu Hu, Jianfeng Yan, Fengzong Lian, Yuhong Liu
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.03205
Pdf link: https://arxiv.org/pdf/2601.03205
Abstract While Large Language Models (LLMs) have demonstrated significant potential in natural language processing , complex general-purpose reasoning requiring multi-step logic, planning, and verification remains a critical bottleneck. Although Reinforcement Learning with Verifiable Rewards (RLVR) has succeeded in specific domains , the field lacks large-scale, high-quality, and difficulty-calibrated data for general reasoning. To address this, we propose UltraLogic, a framework that decouples the logical core of a problem from its natural language expression through a Code-based Solving methodology to automate high-quality data production. The framework comprises hundreds of unique task types and an automated calibration pipeline across ten difficulty levels. Furthermore, to mitigate binary reward sparsity and the Non-negative Reward Trap, we introduce the Bipolar Float Reward (BFR) mechanism, utilizing graded penalties to effectively distinguish perfect responses from those with logical flaws. Our experiments demonstrate that task diversity is the primary driver for reasoning enhancement , and that BFR, combined with a difficulty matching strategy, significantly improves training efficiency, guiding models toward global logical optima.
中文摘要 虽然大型语言模型（LLMs）在自然语言处理方面展现出显著潜力，但复杂的通用推理需要多步逻辑、规划和验证，仍是一个关键瓶颈。尽管可验证奖励强化学习（RLVR）在特定领域取得了成功，但该领域缺乏大规模、高质量且经过难度校准的数据，用于一般推理。为此，我们提出了UltraLogic框架，通过基于代码的求解方法，将问题的逻辑核心与自然语言表达解耦，实现高质量数据生成自动化。该框架包含数百种独特的任务类型和涵盖十个难度等级的自动校准流程。此外，为了缓解二元奖励稀疏性和非负奖励陷阱，我们引入了双极浮动奖励（BFR）机制，利用分级惩罚有效区分完美反应与逻辑缺陷反应。我们的实验表明任务多样性是推理增强的主要驱动因素，BFR结合难度匹配策略，显著提升训练效率，引导模型走向全局逻辑最优。

STReasoner: Empowering LLMs for Spatio-Temporal Reasoning in Time Series via Spatial-Aware Reinforcement Learning

STReasoner：通过空间感知强化学习赋能LLM在时间序列中实现时空推理

Authors: Juntong Ni, Shiyu Wang, Ming Jin, Qi He, Wei Jin
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.03248
Pdf link: https://arxiv.org/pdf/2601.03248
Abstract Spatio-temporal reasoning in time series involves the explicit synthesis of temporal dynamics, spatial dependencies, and textual context. This capability is vital for high-stakes decision-making in systems such as traffic networks, power grids, and disease propagation. However, the field remains underdeveloped because most existing works prioritize predictive accuracy over reasoning. To address the gap, we introduce ST-Bench, a benchmark consisting of four core tasks, including etiological reasoning, entity identification, correlation reasoning, and in-context forecasting, developed via a network SDE-based multi-agent data synthesis pipeline. We then propose STReasoner, which empowers LLM to integrate time series, graph structure, and text for explicit reasoning. To promote spatially grounded logic, we introduce S-GRPO, a reinforcement learning algorithm that rewards performance gains specifically attributable to spatial information. Experiments show that STReasoner achieves average accuracy gains between 17% and 135% at only 0.004X the cost of proprietary models and generalizes robustly to real-world data.
中文摘要 时间序列中的时空推理涉及对时间动力学、空间依赖关系和文本上下文的明确综合。这一能力对于交通网络、电网和疾病传播等系统中高风险决策至关重要。然而，该领域仍然不够成熟，因为大多数现有研究更注重预测准确性而非推理。为弥补这一空白，我们引入了ST-Bench基准测试，该基准测试包含四个核心任务，包括病因推理、实体识别、相关推理和上下文预测，通过基于SDE的网络多智能体数据综合流水线开发。随后我们提出了STReasoner，使LLM能够整合时间序列、图结构和文本以实现显式推理。为了促进空间基底逻辑，我们引入了S-GRPO强化学习算法，专门奖励因空间信息而获得的性能提升。实验表明，STReasoner 以仅为专有模型 0.004 倍的成本，平均准确率提升 17% 至 135%，并且能稳健地推广到真实世界的数据。

Keyword: diffusion policy

There is no result