Arxiv Papers of Today

生成时间: 2026-01-12 16:36:35 (UTC+8); Arxiv 发布时间: 2026-01-12 20:00 EST (2026-01-13 09:00 UTC+8)

今天共有 24 篇相关文章

Keyword: reinforcement learning

KP-Agent: Keyword Pruning in Sponsored Search Advertising via LLM-Powered Contextual Bandits

KP-Agent：通过大语言模型驱动的上下文盗贼在赞助搜索广告中的关键词修剪

Authors: Hou-Wan Long, Yicheng Song, Zidong Wang, Tianshu Sun
Subjects: Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.05257
Pdf link: https://arxiv.org/pdf/2601.05257
Abstract Sponsored search advertising (SSA) requires advertisers to constantly adjust keyword strategies. While bid adjustment and keyword generation are well-studied, keyword pruning-refining keyword sets to enhance campaign performance-remains under-explored. This paper addresses critical inefficiencies in current practices as evidenced by a dataset containing 0.5 million SSA records from a pharmaceutical advertiser on search engine Meituan, China's largest delivery platform. We propose KP-Agent, an LLM agentic system with domain tool set and a memory module. By modeling keyword pruning within a contextual bandit framework, KP-Agent generates code snippets to refine keyword sets through reinforcement learning. Experiments show KP-Agent improves cumulative profit by up to 49.28% over baselines.
中文摘要 赞助搜索广告（SSA）要求广告主不断调整关键词策略。虽然竞价调整和关键词生成已被广泛研究，但关键词修剪和优化以提升活动表现的关键词集仍然鲜有研究。本文通过搜索引擎美团（中国最大配送平台）上包含50万条药品广告主SSA记录的数据集，解决了当前实践中的关键低效问题。我们提出了KP-Agent，这是一个配备域工具集和内存模块的LLM代理系统。通过在上下文盗垒框架中建模关键词剪枝，KP-Agent生成代码片段，通过强化学习来优化关键词集。实验显示，KP-Agent的累计利润比基线提升高达49.28%。

On the Limits of Self-Improving in LLMs and Why AGI, ASI and the Singularity Are Not Near Without Symbolic Model Synthesis

关于大型语言模型自我改进的极限，以及为什么没有符号模型综合，AGI、ASI和奇点并不接近

Authors: Hector Zenil
Subjects: Subjects: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.05280
Pdf link: https://arxiv.org/pdf/2601.05280
Abstract We formalise recursive self-training in Large Language Models (LLMs) and Generative AI as a discrete-time dynamical system and prove that, as training data become increasingly self-generated ($\alpha_t \to 0$), the system undergoes inevitably degenerative dynamics. We derive two fundamental failure modes: (1) Entropy Decay, where finite sampling effects cause a monotonic loss of distributional diversity (mode collapse), and (2) Variance Amplification, where the loss of external grounding causes the model's representation of truth to drift as a random walk, bounded only by the support diameter. We show these behaviours are not contingent on architecture but are consequences of distributional learning on finite samples. We further argue that Reinforcement Learning with imperfect verifiers suffers similar semantic collapse. To overcome these limits, we propose a path involving symbolic regression and program synthesis guided by Algorithmic Probability. The Coding Theorem Method (CTM) allows for identifying generative mechanisms rather than mere correlations, escaping the data-processing inequality that binds standard statistical learning. We conclude that while purely distributional learning leads to model collapse, hybrid neurosymbolic approaches offer a coherent framework for sustained self-improvement.
中文摘要 我们将递归自训练形式化为大型语言模型（LLM）和生成式人工智能，作为离散时间动力系统，并证明随着训练数据日益自我生成（$\alpha_t \ 到 0$），系统必然经历退化动态。我们推导出两种基本失效模式：（1）熵衰减，有限采样效应导致分布多样性单调丧失（模态坍缩），以及（2）方差放大，外部接地的丧失导致模型对真理的表示漂移为随机游走，仅受支撑直径限制。我们证明这些行为并非依赖于架构，而是分布式学习对有限样本的后果。我们进一步认为，带有不完美验证器的强化学习也会出现类似的语义崩溃。为克服这些限制，我们提出了一条包含符号回归和程序综合的路径，并以算法概率为指导。编码定理方法（CTM）允许识别生成机制，而不仅仅是相关性，从而摆脱了传统统计学习中数据处理不平等的束缚。我们得出结论，虽然纯分布式学习会导致模型崩溃，但混合神经符号方法为持续自我提升提供了连贯框架。

Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization

以地图思考：强化并行地图增强代理用于地理定位

Authors: Yuxiang Ji, Yong Wang, Ziyu Ma, Yiming Hu, Hailang Huang, Xuecai Hu, Guanhua Chen, Liaoni Wu, Xiangxiang Chu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.05432
Pdf link: https://arxiv.org/pdf/2601.05432
Abstract The image geolocalization task aims to predict the location where an image was taken anywhere on Earth using visual clues. Existing large vision-language model (LVLM) approaches leverage world knowledge, chain-of-thought reasoning, and agentic capabilities, but overlook a common strategy used by humans -- using maps. In this work, we first equip the model \textit{Thinking with Map} ability and formulate it as an agent-in-the-map loop. We develop a two-stage optimization scheme for it, including agentic reinforcement learning (RL) followed by parallel test-time scaling (TTS). The RL strengthens the agentic capability of model to improve sampling efficiency, and the parallel TTS enables the model to explore multiple candidate paths before making the final prediction, which is crucial for geolocalization. To evaluate our method on up-to-date and in-the-wild images, we further present MAPBench, a comprehensive geolocalization training and evaluation benchmark composed entirely of real-world images. Experimental results show that our method outperforms existing open- and closed-source models on most metrics, specifically improving Acc@500m from 8.0\% to 22.1\% compared to \textit{Gemini-3-Pro} with Google Search/Map grounded mode.
中文摘要 图像地理定位任务旨在通过视觉线索预测地球上任何图像拍摄的位置。现有的大型视觉语言模型（LVLM）方法利用了世界知识、思维链推理和代理能力，但忽视了人类常用的一种策略——使用地图。在本研究中，我们首先装备了 model \textit{Thinking with Map} 能力，并将其表述为一个映射中的代理循环。我们为此开发了两阶段优化方案，包括智能强化学习（RL）和并行测试时间缩放（TTS）。强化学习增强了模型的代理能力，提高了采样效率，并行TTS使模型能够在做出最终预测前探索多条候选路径，这对地理定位至关重要。为了在最新和实际图像上评估我们的方法，我们进一步介绍了MAPBench，这是一个全面的地理定位训练和评估基准测试，完全由真实世界图像组成。实验结果显示，我们的方法在大多数指标上优于现有的开源和闭源模型，特别是相比使用Google Search/Map Ground模式的\textit{Gemini-3-Pro}，Acc@500m从8.0%提升到22.1%。

Do LLMs Need Inherent Reasoning Before Reinforcement Learning? A Study in Korean Self-Correction

LLM在强化学习之前是否需要内在推理？韩国自我纠正研究

Authors: Hongjin Kim, Jaewook Lee, Kiyoung Lee, Jong-hun Shin, Soojong Lim, Oh-Woog Kwon
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.05459
Pdf link: https://arxiv.org/pdf/2601.05459
Abstract Large Language Models (LLMs) demonstrate strong reasoning and self-correction abilities in high-resource languages like English, but their performance remains limited in low-resource languages such as Korean. In this study, we investigate whether reinforcement learning (RL) can enhance Korean reasoning abilities to a degree comparable to English. Our findings reveal that RL alone yields limited improvements when applied to models lacking inherent Korean reasoning capabilities. To address this, we explore several fine-tuning strategies and show that aligning the model's internal reasoning processes with Korean inputs-particularly by tuning Korean-specific neurons in early layers-is key to unlocking RL's effectiveness. We introduce a self-correction code-switching dataset to facilitate this alignment and observe significant performance gains in both mathematical reasoning and self-correction tasks. Ultimately, we conclude that the crucial factor in multilingual reasoning enhancement is not injecting new linguistic knowledge, but effectively eliciting and aligning existing reasoning capabilities. Our study provides a new perspective on how internal translation and neuron-level tuning contribute to multilingual reasoning alignment in LLMs.
中文摘要 大型语言模型（LLMs）在高资源语言如英语中展现出强大的推理和自我纠正能力，但在低资源语言如韩语中表现有限。本研究探讨强化学习（RL）是否能提升韩语推理能力，达到与英语相当的程度。我们的发现表明，仅用强化学习应用于缺乏韩式推理能力的模型时，效果有限。为此，我们探讨了多种微调策略，并证明将模型内部推理过程与韩语输入对齐——尤其是在早期层调校韩语特有神经元——是解锁强化学习有效性的关键。我们引入了自纠码切换数据集以促进这种对齐，并在数学推理和自纠正任务中观察到显著的性能提升。最终，我们得出结论，多语言推理增强的关键因素不是注入新的语言知识，而是有效地引发和协调现有推理能力。我们的研究为内部翻译和神经元层面调谐如何促进大型语言模型中的多语言推理对齐提供了新的视角。

PRISMA: Reinforcement Learning Guided Two-Stage Policy Optimization in Multi-Agent Architecture for Open-Domain Multi-Hop Question Answering

PRISMA：强化学习引导的开放域多跳问答中多代理架构中的两阶段策略优化

Authors: Yu Liu, Wenxiao Zhang, Cong Cao, Wenxuan Lu, Fangfang Yuan, Diandian Guo, Kun Peng, Qiang Sun, Kaiyan Zhang, Yanbing Liu, Jin B.Hong, Bowen Zhou, Zhiyuan Ma
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.05465
Pdf link: https://arxiv.org/pdf/2601.05465
Abstract Answering real-world open-domain multi-hop questions over massive corpora is a critical challenge in Retrieval-Augmented Generation (RAG) systems. Recent research employs reinforcement learning (RL) to end-to-end optimize the retrieval-augmented reasoning process, directly enhancing its capacity to resolve complex queries. However, reliable deployment is hindered by two obstacles. 1) Retrieval Collapse: iterative retrieval over large corpora fails to locate intermediate evidence containing bridge answers without reasoning-guided planning, causing downstream reasoning to collapse. 2) Learning Instability: end-to-end trajectory training suffers from weak credit assignment across reasoning chains and poor error localization across modules, causing overfitting to benchmark-specific heuristics that limit transferability and stability. To address these problems, we propose PRISMA, a decoupled RL-guided framework featuring a Plan-Retrieve-Inspect-Solve-Memoize architecture. PRISMA's strength lies in reasoning-guided collaboration: the Inspector provides reasoning-based feedback to refine the Planner's decomposition and fine-grained retrieval, while enforcing evidence-grounded reasoning in the Solver. We optimize individual agent capabilities via Two-Stage Group Relative Policy Optimization (GRPO). Stage I calibrates the Planner and Solver as specialized experts in planning and reasoning, while Stage II utilizes Observation-Aware Residual Policy Optimization (OARPO) to enhance the Inspector's ability to verify context and trigger targeted recovery. Experiments show that PRISMA achieves state-of-the-art performance on ten benchmarks and can be deployed efficiently in real-world scenarios.
中文摘要 在庞大的语料库上回答现实开放域多跳问题，是检索增强生成（RAG）系统中的一项关键挑战。最新研究利用强化学习（RL）端到端优化检索增强推理过程，直接提升其解决复杂查询的能力。然而，可靠的部署受到两个障碍的阻碍。1）检索崩溃：在大型语料库中，迭代检索未能找到包含桥梁答案的中间证据，除非有推理引导的规划，导致后续推理崩溃。2）学习不稳定性：端到端轨迹训练因推理链间的学分分配薄弱和模块间错误定位差而受限，导致对基准特定启发式的过度拟合，限制了可迁移性和稳定性。为解决这些问题，我们提出了PRISMA，一个解耦的强化学习引导框架，采用计划-检索-检查-解决-记忆化架构。PRISMA的优势在于推理引导的协作：检查员提供基于推理的反馈，以完善规划者的分解和细致检索，同时在求解器中强制执行基于证据的推理。我们通过两阶段组相对策略优化（GRPO）优化单个代理的能力。第一阶段校准规划者和求解者作为规划与推理的专业专家，第二阶段利用观察感知残余策略优化（OARPO）提升检查员验证上下文和触发有针对性恢复的能力。实验表明，PRISMA在十项基准测试中实现了最先进的性能，并且能够在实际场景中高效部署。

MaxCode: A Max-Reward Reinforcement Learning Framework for Automated Code Optimization

MaxCode：一个用于自动代码优化的最大奖励强化学习框架

Authors: Jiefu Ou, Sapana Chaudhary, Kaj Bostrom, Nathaniel Weir, Shuai Zhang, Huzefa Rangwala, George Karypis
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.05475
Pdf link: https://arxiv.org/pdf/2601.05475
Abstract Large Language Models (LLMs) demonstrate strong capabilities in general coding tasks but encounter two key challenges when optimizing code: (i) the complexity of writing optimized code (such as performant CUDA kernels and competition-level CPU code) requires expertise in systems, algorithms and specific languages and (ii) requires interpretation of performance metrics like timing and device utilization beyond binary correctness. In this work, we explore inference-time search algorithms that guide the LLM to discover better solutions through iterative refinement based on execution feedback. Our approach, called MaxCode unifies existing search methods under a max-reward reinforcement learning framework, making the observation and action-value functions modular for modification. To enhance the observation space, we integrate a natural language critique model that converts raw execution feedback into diagnostic insights about errors and performance bottlenecks, and the best-discounted reward seen so far. Together, these provide richer input to the code proposal function. To improve exploration during search, we train a generative reward-to-go model using action values from rollouts to rerank potential solutions. Testing on the KernelBench (CUDA) and PIE (C++) optimization benchmarks shows that MaxCode improves optimized code performance compared to baselines, achieving 20.3% and 10.1% relative improvements in absolute speedup value and relative speedup ranking, respectively.
中文摘要 大型语言模型（LLMs）在通用编码任务中展现出强大能力，但在优化代码时面临两个关键挑战：（i）编写优化代码（如性能优良的CUDA内核和竞赛级CPU代码）的复杂性需要系统、算法和特定语言的专业知识;（ii）需要对性能指标如时序和设备利用率进行超越二进制正确性的解释。本研究中，我们探讨了推理时间搜索算法，这些算法通过基于执行反馈的迭代优化引导LLM发现更好的解。我们的方法称为MaxCode，将现有搜索方法统一在最大奖励强化学习框架下，使观察和动作价值函数模块化，便于修改。为了增强观察空间，我们集成了一个自然语言批评模型，将原始执行反馈转化为对错误和性能瓶颈的诊断洞察，以及迄今为止见过的最佳折现奖励。这些因素共同为代码提案功能提供了更丰富的输入。为了提升搜索过程中的探索，我们训练了一个生成奖励-即带模型，利用部署中的行动值重新排序潜在解决方案。在KernelBench（CUDA）和PIE（C++）优化基准测试上的测试显示，MaxCode相较于基线提升了优化后的代码性能，分别在绝对加速值和相对加速排名上提升了20.3%和10.1%。

MemBuilder: Reinforcing LLMs for Long-Term Memory Construction via Attributed Dense Rewards

MemBuilder：通过归属密集奖励强化长期记忆构建的大型语言模型

Authors: Zhiyu Shen, Ziming Wu, Fuming Lai, Shaobing Lian, Yanghui Rao
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.05488
Pdf link: https://arxiv.org/pdf/2601.05488
Abstract Maintaining consistency in long-term dialogues remains a fundamental challenge for LLMs, as standard retrieval mechanisms often fail to capture the temporal evolution of historical states. While memory-augmented frameworks offer a structured alternative, current systems rely on static prompting of closed-source models or suffer from ineffective training paradigms with sparse rewards. We introduce MemBuilder, a reinforcement learning framework that trains models to orchestrate multi-dimensional memory construction with attributed dense rewards. MemBuilder addresses two key challenges: (1) Sparse Trajectory-Level Rewards: we employ synthetic session-level question generation to provide dense intermediate rewards across extended trajectories; and (2) Multi-Dimensional Memory Attribution: we introduce contribution-aware gradient weighting that scales policy updates based on each component's downstream impact. Experimental results show that MemBuilder enables a 4B-parameter model to outperform state-of-the-art closed-source baselines, exhibiting strong generalization across long-term dialogue benchmarks.
中文摘要 保持长期对话的一致性仍是大型语言模型的根本挑战，因为标准检索机制常常无法捕捉历史状态的时间演变。虽然内存增强框架提供了结构化的替代方案，但当前系统依赖于闭源模型的静态提示，或者在训练范式下缺乏效果且奖励稀疏。我们介绍了MemBuilder，一个强化学习框架，训练模型以协调多维记忆构建，并赋予归属密集奖励。MemBuilder解决了两个关键挑战：（1）稀疏轨迹级奖励：我们采用合成的会话级问题生成，在延伸轨迹中提供密集的中间奖励;以及（2）多维内存归因：我们引入了基于贡献感知的梯度加权，基于每个组件的下游影响来调整政策更新。实验结果显示，MemBuilder使4B参数模型能够超越最先进的闭源基线，在长期对话基准中展现出强烈的泛化性。

How Exploration Breaks Cooperation in Shared-Policy Multi-Agent Reinforcement Learning

探索如何破坏共享策略多智能体强化学习中的合作

Authors: Yi-Ning Weng, Hsuan-Wei Lee
Subjects: Subjects: Multiagent Systems (cs.MA); Physics and Society (physics.soc-ph)
Arxiv link: https://arxiv.org/abs/2601.05509
Pdf link: https://arxiv.org/pdf/2601.05509
Abstract Multi-agent reinforcement learning in dynamic social dilemmas commonly relies on parameter sharing to enable scalability. We show that in shared-policy Deep Q-Network learning, standard exploration can induce a robust and systematic collapse of cooperation even in environments where fully cooperative equilibria are stable and payoff dominant. Through controlled experiments, we demonstrate that shared DQN converges to stable but persistently low-cooperation regimes. This collapse is not caused by reward misalignment, noise, or insufficient training, but by a representational failure arising from partial observability combined with parameter coupling across heterogeneous agent states. Exploration-driven updates bias the shared representation toward locally dominant defection responses, which then propagate across agents and suppress cooperative learning. We confirm that the failure persists across network sizes, exploration schedules, and payoff structures, and disappears when parameter sharing is removed or when agents maintain independent representations. These results identify a fundamental failure mode of shared-policy MARL and establish structural conditions under which scalable learning architectures can systematically undermine cooperation. Our findings provide concrete guidance for the design of multi-agent learning systems in social and economic environments where collective behavior is critical.
中文摘要 动态社会困境中的多智能体强化学习通常依赖参数共享以实现可扩展性。我们表明，在共享策略深度Q网络学习中，标准探索即使在完全合作均衡稳定且收益主导的环境中，也能引发合作的稳健且系统性崩溃。通过受控实验，我们证明共享的DQN趋向稳定但持续低合作的体制。这种崩溃并非由奖励错位、噪声或训练不足引起，而是由于部分可观测性与异构智能体状态间参数耦合导致的表征失败。探索驱动的更新使共享表征偏向局部主导的叛逃反应，进而在智能体间传播，抑制合作学习。我们确认，失败在网络规模、探索计划和收益结构中持续存在，且当参数共享被移除或代理保持独立表示时，失败将消失。这些结果识别了共享策略MARL的根本失效模式，并建立了可扩展学习架构系统性破坏合作的结构条件。我们的发现为在集体行为至关重要的社会和经济环境中设计多智能体学习系统提供了具体指导。

LEAPS: An LLM-Empowered Adaptive Plugin for Taobao AI Search

LEAPS：一款由大语言模型赋能的自适应钓鱼 AI 搜索插件

Authors: Lei Wang, Jinhang Wu, Zhibin Wang, Biye Li, Haiping Hou
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2601.05513
Pdf link: https://arxiv.org/pdf/2601.05513
Abstract The rapid advancement of large language models has reshaped user search cognition, driving a paradigm shift from discrete keyword-based search to high-dimensional conversational interaction. However, existing e-commerce search architectures face a critical capability deficit in adapting to this change. Users are often caught in a dilemma: precise natural language descriptions frequently trigger zero-result scenarios, while the forced simplification of queries leads to decision overload from noisy, generic results. To tackle this challenge, we propose LEAPS (LLM-Empowered Adaptive Plugin for Taobao AI Search), which seamlessly upgrades traditional search systems via a "Broaden-and-Refine" paradigm. Specifically, it attaches plugins to both ends of the search pipeline: (1) Upstream, a Query Expander acts as an intent translator. It employs a novel three-stage training strategy--inverse data augmentation, posterior-knowledge supervised fine-tuning, and diversity-aware reinforcement learning--to generate adaptive and complementary query combinations that maximize the candidate product set. (2) Downstream, a Relevance Verifier serves as a semantic gatekeeper. By synthesizing multi-source data (e.g., OCR text, reviews) and leveraging chain-of-thought reasoning, it precisely filters noise to resolve selection overload. Extensive offline experiments and online A/B testing demonstrate that LEAPS significantly enhances conversational search experiences. Crucially, its non-invasive architecture preserves established retrieval performance optimized for short-text queries, while simultaneously allowing for low-cost integration into diverse back-ends. Fully deployed on Taobao AI Search since August 2025, LEAPS currently serves hundreds of millions of users monthly.
中文摘要 大型语言模型的快速发展重塑了用户搜索认知，推动了从离散关键词搜索向高维会话交互的范式转变。然而，现有的电商搜索架构在适应这一变化方面面临关键能力不足。用户常常陷入两难：精确的自然语言描述常常触发零结果场景，而查询的强制简化则导致噪杂、通用的结果导致决策过载。为应对这一挑战，我们提出了LEAPS（面向淘宝AI搜索的LLM赋能自适应插件），通过“拓宽与精炼”范式无缝升级传统搜索系统。具体来说，它在搜索流程的两端附加插件：（1）上游的查询扩展器作为意图翻译器。它采用了一种新颖的三阶段训练策略——逆数据增强、后验知识监督微调和多样性感知强化学习——以生成自适应且互补的查询组合，最大化候选产品集。（2）下游，相关性验证器作为语义守门人。通过综合多源数据（如OCR文本、评论）并利用思维链推理，精准过滤噪声以解决选择过载。大量离线实验和在线A/B测试表明，LEAPS显著提升了对话式搜索体验。关键是，其非侵入式架构保留了既有的检索性能，优化于短文本查询，同时允许低成本集成到多样化后端。自2025年8月起，LEAPS全面部署在淘宝AI搜索平台，目前每月服务数亿用户。

WildSci: Advancing Scientific Reasoning from In-the-Wild Literature

WildSci：从野外文献中推进科学推理

Authors: Tengxiao Liu, Deepak Nathani, Zekun Li, Kevin Yang, William Yang Wang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.05567
Pdf link: https://arxiv.org/pdf/2601.05567
Abstract Recent progress in large language model (LLM) reasoning has focused on domains like mathematics and coding, where abundant high-quality data and objective evaluation metrics are readily available. In contrast, progress in LLM reasoning models remains limited in scientific domains such as medicine and materials science due to limited dataset coverage and the inherent complexity of open-ended scientific questions. To address these challenges, we introduce WildSci, a new dataset of domain-specific science questions automatically synthesized from peer-reviewed literature, covering 9 scientific disciplines and 26 subdomains. By framing complex scientific reasoning tasks in a multiple-choice format, we enable scalable training with well-defined reward signals. We further apply reinforcement learning to finetune models on these data and analyze the resulting training dynamics, including domain-specific performance changes, response behaviors, and generalization trends. Experiments on a suite of scientific benchmarks demonstrate the effectiveness of our dataset and approach. We release WildSci to enable scalable and sustainable research in scientific reasoning, available at this https URL.
中文摘要 大型语言模型（LLM）推理的最新进展集中在数学和编码等领域，这些领域提供了大量高质量的数据和客观的评估指标。相比之下，由于数据集覆盖有限以及开放式科学问题的复杂性，LLM推理模型在医学和材料科学等科学领域进展有限。为应对这些挑战，我们引入了WildSci，这是一个由同行评审文献自动综合的领域特定科学问题的新数据集，涵盖9个科学学科和26个子领域。通过以选择题形式框架复杂的科学推理任务，我们实现了具有明确奖励信号的可扩展训练。我们还进一步应用强化学习对这些数据进行模型微调，并分析由此产生的训练动态，包括领域特异性表现变化、响应行为和泛化趋势。一系列科学基准测试的实验展示了我们数据集和方法的有效性。我们发布WildSci是为了实现科学推理的可扩展和可持续研究，访问此链接。

Reinforcement Learning of Large Language Models for Interpretable Credit Card Fraud Detection

大型语言模型的强化学习以实现可解释的信用卡欺诈检测

Authors: Cooper Lin, Yanting Zhang, Maohao Ran, Wei Xue, Hongwei Fan, Yibo Xu, Zhenglin Wan, Sirui Han, Yike Guo, Jun Song
Subjects: Subjects: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
Arxiv link: https://arxiv.org/abs/2601.05578
Pdf link: https://arxiv.org/pdf/2601.05578
Abstract E-commerce platforms and payment solution providers face increasingly sophisticated fraud schemes, ranging from identity theft and account takeovers to complex money laundering operations that exploit the speed and anonymity of digital transactions. However, despite their theoretical promise, the application of Large Language Models (LLMs) to fraud detection in real-world financial contexts remains largely unexploited, and their practical effectiveness in handling domain-specific e-commerce transaction data has yet to be empirically validated. To bridge this gap between conventional machine learning limitations and the untapped potential of LLMs in fraud detection, this paper proposes a novel approach that employs Reinforcement Learning (RL) to post-train lightweight language models specifically for fraud detection tasks using only raw transaction data. We utilize the Group Sequence Policy Optimization (GSPO) algorithm combined with a rule-based reward system to fine-tune language models of various sizes on a real-life transaction dataset provided by a Chinese global payment solution company. Through this reinforcement learning framework, the language models are encouraged to explore diverse trust and risk signals embedded within the textual transaction data, including patterns in customer information, shipping details, product descriptions, and order history. Our experimental results demonstrate the effectiveness of this approach, with post-trained language models achieving substantial F1-score improvements on held-out test data. Our findings demonstrate that the observed performance improvements are primarily attributable to the exploration mechanism inherent in reinforcement learning, which allows models to discover novel fraud indicators beyond those captured by traditional engineered features.
中文摘要 电子商务平台和支付解决方案提供商面临日益复杂的欺诈手法，从身份盗窃、账户接管到利用数字交易速度和匿名性的复杂洗钱行动。然而，尽管理论上前景看好，大型语言模型（LLMs）在现实金融环境中的欺诈检测应用仍然大多未被充分利用，其在处理领域特定电子商务交易数据方面的实际效果尚未得到实证验证。为了弥合传统机器学习局限与大型语言模型在欺诈检测中未被开发潜力之间的差距，本文提出了一种新颖方法，利用强化学习（RL）专门训练轻量级语言模型，仅使用原始交易数据，专门用于欺诈检测任务。我们利用组序策略优化（GSPO）算法结合基于规则的奖励系统，在一家中国全球支付解决方案公司提供的真实交易数据集上微调不同规模的语言模型。通过这一强化学习框架，鼓励语言模型探索文本交易数据中嵌入的多样信任和风险信号，包括客户信息、运输细节、产品描述和订单历史的模式。我们的实验结果证明了这种方法的有效性，训练后语言模型在未完成测试数据中取得了显著的F1分数提升。我们的发现表明，观察到的性能提升主要归功于强化学习中固有的探索机制，该机制使模型能够发现超越传统工程特征捕获的新欺诈指标。

PaCoRe: Learning to Scale Test-Time Compute with Parallel Coordinated Reasoning

PaCoRe：学习用并行协调推理扩展测试时间计算

Authors: Jingcheng Hu, Yinmin Zhang, Shijie Shang, Xiaobo Yang, Yue Peng, Zhewei Huang, Hebin Zhou, Xin Wu, Jie Cheng, Fanqi Wan, Xiangwen Kong, Chengyuan Yao, Kaiwen Yan, Ailin Huang, Hongyu Zhou, Qi Han, Zheng Ge, Daxin Jiang, Xiangyu Zhang, Heung-Yeung Shum
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.05593
Pdf link: https://arxiv.org/pdf/2601.05593
Abstract We introduce Parallel Coordinated Reasoning (PaCoRe), a training-and-inference framework designed to overcome a central limitation of contemporary language models: their inability to scale test-time compute (TTC) far beyond sequential reasoning under a fixed context window. PaCoRe departs from the traditional sequential paradigm by driving TTC through massive parallel exploration coordinated via a message-passing architecture in multiple rounds. Each round launches many parallel reasoning trajectories, compacts their findings into context-bounded messages, and synthesizes these messages to guide the next round and ultimately produce the final answer. Trained end-to-end with large-scale, outcome-based reinforcement learning, the model masters the synthesis abilities required by PaCoRe and scales to multi-million-token effective TTC without exceeding context limits. The approach yields strong improvements across diverse domains, and notably pushes reasoning beyond frontier systems in mathematics: an 8B model reaches 94.5% on HMMT 2025, surpassing GPT-5's 93.2% by scaling effective TTC to roughly two million tokens. We open-source model checkpoints, training data, and the full inference pipeline to accelerate follow-up work.
中文摘要 我们介绍了并行协调推理（PaCoRe），这是一种训练与推理框架，旨在克服当代语言模型的一个核心局限：它们无法在固定上下文窗口下将测试时计算（TTC）扩展到远超顺序推理的范围。PaCoRe 打破了传统的顺序范式，通过多轮的消息传递架构进行大规模并行探索，推动 TTC 发展。每一轮都启动许多平行推理轨迹，将发现压缩为上下文限定的信息，并综合这些信息以引导下一轮，最终得出最终答案。该模型通过大规模、基于结果的强化学习端到端训练，掌握了PaCoRe所需的综合能力，并可扩展至数百万代币有效TTC，且不超出上下文限制。该方法在多个领域带来了显著改进，显著地推动推理超越了数学的前沿系统：一个8B模型在HMMT 2025中达到了94.5%，超过GPT-5的93.2%，通过将有效TTC扩展到约两百万个令牌。我们开源模型检查点、训练数据和完整的推理流程，以加快后续工作。

Orchestrating Tokens and Sequences: Dynamic Hybrid Policy Optimization for RLVR

令牌与序列编排：RLVR的动态混合策略优化

Authors: Zijun Min, Bingshuai Liu, Ante Wang, Long Zhang, Anxiang Zeng, Haibo Zhang, Jinsong Su
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.05607
Pdf link: https://arxiv.org/pdf/2601.05607
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) offers a promising framework for optimizing large language models in reasoning tasks. However, existing RLVR algorithms focus on different granularities, and each has complementary strengths and limitations. Group Relative Policy Optimization (GRPO) updates the policy with token-level importance ratios, which preserves fine-grained credit assignment but often suffers from high variance and instability. In contrast, Group Sequence Policy Optimization (GSPO) applies single sequence-level importance ratios across all tokens in a response that better matches sequence-level rewards, but sacrifices token-wise credit assignment. In this paper, we propose Dynamic Hybrid Policy Optimization (DHPO) to bridge GRPO and GSPO within a single clipped surrogate objective. DHPO combines token-level and sequence-level importance ratios using weighting mechanisms. We explore two variants of the mixing mechanism, including an averaged mixing and an entropy-guided mixing. To further stabilize training, we employ a branch-specific clipping strategy that constrains token-level and sequence-level ratios within separate trust regions before mixing, preventing outliers in either branch from dominating the update. Across seven challenging mathematical reasoning benchmarks, experiments on both dense and MoE models from the Qwen3 series show that DHPO consistently outperforms GRPO and GSPO. We will release our code upon acceptance of this paper.
中文摘要 带可验证奖励的强化学习（RLVR）为优化推理任务中的大型语言模型提供了有前景的框架。然而，现有的RLVR算法关注不同的粒度，每种算法都有互补的优势和局限。Group Relative Policy Optimization（GRPO）通过添加代币级重要性比来更新策略，这保留了细粒度的信用分配，但通常存在高方差和不稳定性的问题。相比之下，群序列策略优化（GSPO）在所有代币中应用单一序列级重要性比，这种响应更符合序列级奖励，但牺牲了代币层次的信用分配。本文提出动态混合策略优化（DHPO），以在单一截切替代目标内桥接GRPO和GSPO。DHPO结合了代币级和序列级的重要性比，采用加权机制。我们探讨了混合机制的两种变体，包括平均混合和熵引导混合。为了进一步稳定训练，我们采用了分支特定的裁剪策略，在混合前将令牌级和序列级比率限制在不同的信任区域内，防止任一分支的异常值主导更新。在七个具有挑战性的数学推理基准测试中，Qwen3系列中稠密模型和MoE模型的实验显示，DHPO持续优于GRPO和GSPO。我们将在接受本论文后发布代码。

Dual-Phase LLM Reasoning: Self-Evolved Mathematical Frameworks

双阶段大型语言模型推理：自我演化的数学框架

Authors: ShaoZhen Liu, Xinting Huang, Houwen Peng, Xin Chen, Xinyang Song, Qi Li, Zhenan Sun
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.05616
Pdf link: https://arxiv.org/pdf/2601.05616
Abstract In recent years, large language models (LLMs) have demonstrated significant potential in complex reasoning tasks like mathematical problem-solving. However, existing research predominantly relies on reinforcement learning (RL) frameworks while overlooking supervised fine-tuning (SFT) methods. This paper proposes a new two-stage training framework that enhances models' self-correction capabilities through self-generated long chain-of-thought (CoT) data. During the first stage, a multi-turn dialogue strategy guides the model to generate CoT data incorporating verification, backtracking, subgoal decomposition, and backward reasoning, with predefined rules filtering high-quality samples for supervised fine-tuning. The second stage employs a difficulty-aware rejection sampling mechanism to dynamically optimize data distribution, strengthening the model's ability to handle complex problems. The approach generates reasoning chains extended over 4 times longer while maintaining strong scalability, proving that SFT effectively activates models' intrinsic reasoning capabilities and provides a resource-efficient pathway for complex task optimization. Experimental results demonstrate performance improvements on mathematical benchmarks including GSM8K and MATH500, with the fine-tuned model achieving a substantial improvement on competition-level problems like AIME24. Code will be open-sourced.
中文摘要 近年来，大型语言模型（LLMs）在数学问题解决等复杂推理任务中展现出显著潜力。然而，现有研究主要依赖强化学习（RL）框架，忽视了监督式微调（SFT）方法。本文提出了一种新的两阶段训练框架，通过自生成的长思考链（CoT）数据增强模型的自我纠正能力。在第一阶段，多回合对话策略引导模型生成包含验证、回溯、子目标分解和逆向推理的CoT数据，预定义规则过滤高质量样本以进行监督式微调。第二阶段采用难度感知的拒绝采样机制，动态优化数据分布，增强模型处理复杂问题的能力。该方法生成的推理链延长了4倍以上，同时保持了强大的可扩展性，证明SFT有效激活了模型的内在推理能力，并为复杂任务优化提供了资源高效的路径。实验结果显示，在包括GSM8K和MATH500在内的数学基准上，性能提升，微调模型在竞赛级别问题如AIME24上取得了显著提升。代码将开源。

GIFT: Games as Informal Training for Generalizable LLMs

GIFT：作为可通用大型语言模型非正式训练的游戏

Authors: Nuoyan Lyu, Bingbing Xu, Weihao Meng, Yige Yuan, Yang Zhang, Zhiyong Huang, Tat-Seng Chua, Huawei Shen
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.05633
Pdf link: https://arxiv.org/pdf/2601.05633
Abstract While Large Language Models (LLMs) have achieved remarkable success in formal learning tasks such as mathematics and code generation, they still struggle with the "practical wisdom" and generalizable intelligence, such as strategic creativity and social reasoning, that characterize human cognition. This gap arises from a lack of informal learning, which thrives on interactive feedback rather than goal-oriented instruction. In this paper, we propose treating Games as a primary environment for LLM informal learning, leveraging their intrinsic reward signals and abstracted complexity to cultivate diverse competencies. To address the performance degradation observed in multi-task learning, we introduce a Nested Training Framework. Unlike naive task mixing optimizing an implicit "OR" objective, our framework employs sequential task composition to enforce an explicit "AND" objective, compelling the model to master multiple abilities simultaneously to achieve maximal rewards. Using GRPO-based reinforcement learning across Matrix Games, TicTacToe, and Who's the Spy games, we demonstrate that integrating game-based informal learning not only prevents task interference but also significantly bolsters the model's generalization across broad ability-oriented benchmarks. The framework and implementation are publicly available.
中文摘要 尽管大型语言模型（LLMs）在数学和代码生成等正式学习任务中取得了显著成功，但它们仍然难以掌握“实用智慧”和可推广的智能，如战略创造力和社会推理，这些都是人类认知的特征。这一差距源于缺乏非正式学习，而非正规学习依赖互动反馈而非目标导向教学。本文提出将游戏视为大型语言模型非正式学习的主要环境，利用其内在的奖励信号和抽象复杂性来培养多样化的能力。为了解决多任务学习中观察到的性能下降，我们引入了嵌套训练框架。与优化隐含“运筹”目标的简单任务混合不同，我们的框架采用顺序任务组合来强制执行明确的“与”目标，迫使模型同时掌握多种能力以获得最大回报。我们通过在矩阵游戏、井字棋和间谍游戏中基于GRPO的强化学习，展示了基于游戏的非正式学习不仅防止任务干扰，还显著增强了模型在广泛能力导向基准测试中的推广性。该框架和实现是公开的。

SketchVL: Policy Optimization via Fine-Grained Credit Assignment for Chart Understanding and More

SketchVL：通过细粒度信用赋值进行策略优化，用于图表理解及更多内容

Authors: Muye Huang, Lingling Zhang, Yifei Li, Yaqiang Wu, Jun Liu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.05688
Pdf link: https://arxiv.org/pdf/2601.05688
Abstract Charts are high-density visual carriers of complex data and medium for information extraction and analysis. Due to the need for precise and complex visual reasoning, automated chart understanding poses a significant challenge to existing Multimodal Large Language Models (MLLMs). Many MLLMs trained with reinforcement learning (RL) face the challenge of credit assignment. Their advantage estimation, typically performed at the trajectory level, cannot distinguish between correct and incorrect reasoning steps within a single generated response. To address this limitation, we introduce SketchVL, a novel MLLM that optimized with FinePO, a new RL algorithm designed for fine-grained credit assignment within each trajectory. SketchVL's methodology involves drawing its intermediate reasoning steps as markers on the image and feeding the annotated image back to itself, creating a robust, multi-step reasoning process. During training, the FinePO algorithm leverages a Fine-grained Process Reward Model (FinePRM) to score each drawing action within a trajectory, thereby precisely assigning credit for each step. This mechanism allows FinePO to more strongly reward correct tokens when a trajectory is globally successful, and more heavily penalize incorrect tokens when the trajectory is globally suboptimal, thus achieving fine-grained reinforcement signals. Experiments show that SketchVL learns to align its step-level behavior with the FinePRM, achieving an average performance gain of 7.23\% over its base model across chart datasets, natural image datasets, and mathematics, providing a promising new direction for training powerful reasoning models.
中文摘要 图表是复杂数据的高密度视觉载体，是信息提取和分析的媒介。由于需要精确且复杂的视觉推理，自动图表理解对现有的多模态大型语言模型（MLLMs）构成了重大挑战。许多接受强化学习（RL）培训的多层次语言学习者面临学分分配的挑战。他们的优势估计通常在轨迹层面进行，无法区分单个生成反应中正确与错误的推理步骤。为解决这一限制，我们引入了SketchVL，一种新颖的MLLM，采用FinePO算法优化，该算法旨在在每个轨道内进行细粒度的学分分配。SketchVL的方法论是将中间推理步骤作为图像上的标记，并将注释后的图像反馈给自己，形成一个稳健的多步推理过程。在训练过程中，FinePO算法利用细粒度过程奖励模型（FinePRM）对轨迹中的每一个绘画动作进行评分，从而精确地为每个步骤分配功劳。该机制使 FinePO 在轨迹全局成功时能更强地奖励正确的代币，而当轨迹整体不优时对错误代币则更严厉惩罚，从而实现细粒度的强化信号。实验显示，SketchVL能够将其步级行为与FinePRM对齐，在图表数据集、自然图像数据集和数学领域，平均性能提升为7.23%，为训练强大推理模型提供了有前景的新方向。

From Off-Policy to On-Policy: Enhancing GUI Agents via Bi-level Expert-to-Policy Assimilation

从非策略到启策略：通过双级专家到策略同化提升图形界面代理

Authors: Zezhou Wang, Ziyun Zhang, Xiaoyi Zhang, Zhuzhong Qian, Yan Lu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.05787
Pdf link: https://arxiv.org/pdf/2601.05787
Abstract Vision-language models are increasingly deployed as computer-use agents (CUAs) that operate desktops and browsers. Top-performing CUAs are framework-based systems that decompose planning and execution, while end-to-end screenshot-to-action policies are easier to deploy but lag behind on benchmarks such as OSWorld-Verified. GUI datasets like OSWorld pose two bottlenecks: they expose only a few hundred interactive, verifiable tasks and environments, and expert trajectories must be gathered by interacting with these environments, making such data hard to scale. We therefore ask how reinforcement learning from verifiable rewards (RLVR) can best exploit a small pool of exist expert trajectories to train end-to-end policies. Naively mixing these off-policy traces into on-policy RLVR is brittle: even after format conversion, expert trajectories exhibit structural mismatch and distribution shift from the learner. We propose BEPA (Bi-Level Expert-to-Policy Assimilation), which turns static expert traces into policy-aligned guidance via self-rolled reachable trajectories under the base policy (LEVEL-1) and a per-task, dynamically updated cache used in RLVR (LEVEL-2). On OSWorld-Verified, BEPA improves UITARS1.5-7B success from 22.87% to 32.13% and raises a held-out split from 5.74% to 10.30%, with consistent gains on MMBench-GUI and Online-Mind2Web. Our code and data are available at: this https URL
中文摘要 视觉语言模型越来越多地作为计算机使用代理（CUA）部署，用于作桌面和浏览器。性能最高的CUA是基于框架的系统，能够分解规划和执行，而端到端的截图到行动策略更容易部署，但在OSWorld-Verified等基准测试中稍显落后。像OSWorld这样的GUI数据集存在两个瓶颈：它们只暴露了几百个可交互、可验证的任务和环境，且必须通过与这些环境互动来收集专家的路径，这使得这些数据难以扩展。因此，我们探讨如何利用可验证奖励的强化学习（RLVR）来最好地利用现有的专家路径池来训练端到端策略。天真地将这些非策略追踪混入策略内RLVR是脆弱的：即使格式转换后，专家轨迹仍表现出结构不匹配和分布偏移。我们提出BEPA（双级专家至策略同化），通过基础策略（LEVEL-1）下的自滚动可达轨迹和RLVR（LEVEL-2）中使用的逐任务动态更新缓存，将静态专家追踪转化为策略对齐的指导。在OSWorld-Verified上，BEPA的UITARS1.5-7B成功率从22.87%提升至32.13%，保留比例从5.74%提升至10.30%，MMBench-GUI和Online-Mind2Web持续增长。我们的代码和数据可在以下 https URL 获取

EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis

环境扩展器：通过程序化综合实现LLM代理的扩展工具交互环境

Authors: Xiaoshuai Song, Haofei Chang, Guanting Dong, Yutao Zhu, Zhicheng Dou, Ji-Rong Wen
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.05808
Pdf link: https://arxiv.org/pdf/2601.05808
Abstract Large language models (LLMs) are expected to be trained to act as agents in various real-world environments, but this process relies on rich and varied tool-interaction sandboxes. However, access to real systems is often restricted; LLM-simulated environments are prone to hallucinations and inconsistencies; and manually built sandboxes are hard to scale. In this paper, we propose EnvScaler, an automated framework for scalable tool-interaction environments via programmatic synthesis. EnvScaler comprises two components. First, SkelBuilder constructs diverse environment skeletons through topic mining, logic modeling, and quality evaluation. Then, ScenGenerator generates multiple task scenarios and rule-based trajectory validation functions for each environment. With EnvScaler, we synthesize 191 environments and about 7K scenarios, and apply them to Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) for Qwen3 series models. Results on three benchmarks show that EnvScaler significantly improves LLMs' ability to solve tasks in complex environments involving multi-turn, multi-tool interactions. We release our code and data at this https URL.
中文摘要 大型语言模型（LLM）预计会被训练为在各种真实环境中作为代理，但这一过程依赖于丰富多样的工具交互沙盒。然而，访问真实系统的权限通常受到限制;LLM模拟环境容易出现幻觉和不一致现象;而且手工构建的沙盒很难扩展。本文提出EnvScaler，一种通过程序综合实现可扩展工具交互环境的自动化框架。EnvScaler 包含两个部分。首先，SkelBuilder通过主题挖掘、逻辑建模和质量评估构建多样化的环境骨架。然后，ScenGenerator为每个环境生成多个任务场景和基于规则的轨迹验证函数。通过EnvScaler，我们综合了191个环境和约7000个场景，并将其应用于Qwen3系列模型的监督微调（SFT）和强化学习（RL）。三个基准测试的结果表明，EnvScaler显著提升了大型语言模型在涉及多轮多工具交互的复杂环境中解决任务的能力。我们在这个 https URL 上发布了代码和数据。

Intelligent Singularity Avoidance in UR10 Robotic Arm Path Planning Using Hybrid Fuzzy Logic and Reinforcement Learning

UR10机械臂路径规划中的智能奇点规避，结合混合模糊逻辑与强化学习

Authors: Sheng-Kai Chen, Jyh-Horng Wu
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.05836
Pdf link: https://arxiv.org/pdf/2601.05836
Abstract This paper presents a comprehensive approach to singularity detection and avoidance in UR10 robotic arm path planning through the integration of fuzzy logic safety systems and reinforcement learning algorithms. The proposed system addresses critical challenges in robotic manipulation where singularities can cause loss of control and potential equipment damage. Our hybrid approach combines real-time singularity detection using manipulability measures, condition number analysis, and fuzzy logic decision-making with a stable reinforcement learning framework for adaptive path planning. Experimental results demonstrate a 90% success rate in reaching target positions while maintaining safe distances from singular configurations. The system integrates PyBullet simulation for training data collection and URSim connectivity for real-world deployment.
中文摘要 本文通过整合模糊逻辑安全系统和强化学习算法，提出了UR10机械臂路径规划中奇点检测与规避的综合方法。该系统解决了机器人作中奇点可能导致失控和设备损坏的关键挑战。我们的混合方法结合了通过可作性度量的实时奇点检测、条件数分析和模糊逻辑决策，以及用于自适应路径规划的稳定强化学习框架。实验结果显示，在保持与单一配置安全距离的情况下，达到目标位置的成功率为90%。该系统集成了用于训练数据收集的PyBullet仿真和用于实际部署的URSim连接。

IIB-LPO: Latent Policy Optimization via Iterative Information Bottleneck

IIB-LPO：通过迭代信息瓶颈实现潜在策略优化

Authors: Huilin Deng, Hongchen Luo, Yue Zhu, Long Li, Zhuoyue Chen, Xinghao Zhao, Ming Li, Jihai Zhang, Mengchang Wang, Yang Cao, Yu Kang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.05870
Pdf link: https://arxiv.org/pdf/2601.05870
Abstract Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) for Large Language Model (LLM) reasoning have been hindered by a persistent challenge: exploration collapse. The semantic homogeneity of random rollouts often traps models in narrow, over-optimized behaviors. While existing methods leverage policy entropy to encourage exploration, they face inherent limitations. Global entropy regularization is susceptible to reward hacking, which can induce meaningless verbosity, whereas local token-selective updates struggle with the strong inductive bias of pre-trained models. To address this, we propose Latent Policy Optimization via Iterative Information Bottleneck (IIB-LPO), a novel approach that shifts exploration from statistical perturbation of token distributions to topological branching of reasoning trajectories. IIB-LPO triggers latent branching at high-entropy states to diversify reasoning paths and employs the Information Bottleneck principle both as a trajectory filter and a self-reward mechanism, ensuring concise and informative exploration. Empirical results across four mathematical reasoning benchmarks demonstrate that IIB-LPO achieves state-of-the-art performance, surpassing prior methods by margins of up to 5.3% in accuracy and 7.4% in diversity metrics.
中文摘要 大型语言模型（LLM）推理中带有可验证奖励的强化学习（RLVR）的最新进展受到一个持续挑战的阻碍：探索崩溃。随机推广的语义同质性常常使模型陷入狭窄、过度优化的行为中。虽然现有方法利用政策熵来鼓励探索，但它们本身也面临着固有的局限。全局熵正则化易受奖励黑客攻击的影响，可能导致无意义的冗长，而局部代币选择性更新则难以应对预训练模型强烈的归纳偏见。为此，我们提出了通过迭代信息瓶颈进行潜在策略优化（IIB-LPO），这是一种新颖的方法，将探索从代币分布的统计扰动转向推理轨迹的拓扑分支。IIB-LPO在高熵态触发潜在分支，丰富推理路径，并运用信息瓶颈原理既作为轨迹过滤器，也作为自我奖励机制，确保简洁且信息丰富的探索。四个数学推理基准的实证结果表明，IIB-LPO实现了最先进的性能，准确率高出先前方法多达5.3%，多样性指标提升7.4%。

StackPlanner: A Centralized Hierarchical Multi-Agent System with Task-Experience Memory Management

StackPlanner：一个集中式分层多代理系统，具备任务体验内存管理功能

Authors: Ruizhe Zhang, Xinke Jiang, Zhibang Yang, Zhixin Zhang, Jiaran Gao, Yuzhen Xiao, Hongbin Lai, Xu Chu, Junfeng Zhao, Yasha Wang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.05890
Pdf link: https://arxiv.org/pdf/2601.05890
Abstract Multi-agent systems based on large language models, particularly centralized architectures, have recently shown strong potential for complex and knowledge-intensive tasks. However, central agents often suffer from unstable long-horizon collaboration due to the lack of memory management, leading to context bloat, error accumulation, and poor cross-task generalization. To address both task-level memory inefficiency and the inability to reuse coordination experience, we propose StackPlanner, a hierarchical multi-agent framework with explicit memory control. StackPlanner addresses these challenges by decoupling high-level coordination from subtask execution with active task-level memory control, and by learning to retrieve and exploit reusable coordination experience via structured experience memory and reinforcement learning. Experiments on multiple deep-search and agent system benchmarks demonstrate the effectiveness of our approach in enabling reliable long-horizon multi-agent collaboration.
中文摘要 基于大型语言模型的多智能体系统，尤其是集中式架构，近年来展现出在复杂且知识密集型任务中的强大潜力。然而，由于缺乏内存管理，中央代理常常面临不稳定的长视距协作，导致上下文膨胀、错误累积和跨任务泛化能力下降。为了解决任务级内存效率低下和协调经验无法重用的问题，我们提出了StackPlanner，这是一个具有显式内存控制的分层多智能体框架。StackPlanner通过将高层协调与子任务执行与主动任务级内存控制脱钩，以及通过结构化经验记忆和强化学习，学习检索和利用可重复使用的协调经验来应对这些挑战。对多重深度搜索和智能体系统基准的实验展示了我们方法在实现可靠长期多智能体协作方面的有效性。

TowerMind: A Tower Defence Game Learning Environment and Benchmark for LLM as Agents

TowerMind：一个塔防游戏学习环境及作为代理的大型语言模型基准

Authors: Dawei Wang, Chengming Zhou, Di Zhao, Xinyuan Liu, Marci Chi Ma, Gary Ushaw, Richard Davison
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.05899
Pdf link: https://arxiv.org/pdf/2601.05899
Abstract Recent breakthroughs in Large Language Models (LLMs) have positioned them as a promising paradigm for agents, with long-term planning and decision-making emerging as core general-purpose capabilities for adapting to diverse scenarios and tasks. Real-time strategy (RTS) games serve as an ideal testbed for evaluating these two capabilities, as their inherent gameplay requires both macro-level strategic planning and micro-level tactical adaptation and action execution. Existing RTS game-based environments either suffer from relatively high computational demands or lack support for textual observations, which has constrained the use of RTS games for LLM evaluation. Motivated by this, we present TowerMind, a novel environment grounded in the tower defense (TD) subgenre of RTS games. TowerMind preserves the key evaluation strengths of RTS games for assessing LLMs, while featuring low computational demands and a multimodal observation space, including pixel-based, textual, and structured game-state representations. In addition, TowerMind supports the evaluation of model hallucination and provides a high degree of customizability. We design five benchmark levels to evaluate several widely used LLMs under different multimodal input settings. The results reveal a clear performance gap between LLMs and human experts across both capability and hallucination dimensions. The experiments further highlight key limitations in LLM behavior, such as inadequate planning validation, a lack of multifinality in decision-making, and inefficient action use. We also evaluate two classic reinforcement learning algorithms: Ape-X DQN and PPO. By offering a lightweight and multimodal design, TowerMind complements the existing RTS game-based environment landscape and introduces a new benchmark for the AI agent field. The source code is publicly available on GitHub(this https URL).
中文摘要 大型语言模型（LLM）的最新突破使其成为智能体的一个有前景的范式，长期规划和决策成为适应多样化场景和任务的核心通用能力。即时战略（RTS）游戏是评估这两种能力的理想试验平台，因为它们的内在玩法需要宏观战略规划和微观战术适应和行动执行。现有的基于RTS的游戏环境要么面临较高的计算需求，要么缺乏文本观察支持，这限制了RTS游戏用于LLM评估的使用。基于此，我们推出了TowerMind，一个基于塔防（TD）即时战略子类型的新颖环境。TowerMind保留了即时战略游戏在评估大型语言模型（LLM）方面的关键优势，同时具备低计算需求和多模态观察空间，包括基于像素的、文本的和结构化的游戏状态表示。此外，TowerMind支持模型幻觉的评估，并提供高度的定制化。我们设计了五个基准层级，以评估多个广泛使用的大型语言模型在不同多模态输入设置下的应用。结果显示，LLMs与人类专家在能力和幻觉维度上存在明显的性能差距。实验进一步凸显了大型语言模型行为的关键局限性，如规划验证不足、决策缺乏多终结性以及动作使用效率低下。我们还评估了两种经典的强化学习算法：Ape-X DQN和PPO。通过提供轻量化和多模态设计，TowerMind补充了现有的即时战略游戏环境格局，并为AI代理领域开创了新的标杆。源代码已公开发布在GitHub（https URL）。

Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards

证据链化：深度搜索代理的强健强化学习，具备引用感知评分标准奖励

Authors: Jiajie Zhang, Xin Lv, Ling Feng, Lei Hou, Juanzi Li
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.06021
Pdf link: https://arxiv.org/pdf/2601.06021
Abstract Reinforcement learning (RL) has emerged as a critical technique for enhancing LLM-based deep search agents. However, existing approaches primarily rely on binary outcome rewards, which fail to capture the comprehensiveness and factuality of agents' reasoning process, and often lead to undesirable behaviors such as shortcut exploitation and hallucinations. To address these limitations, we propose \textbf{Citation-aware Rubric Rewards (CaRR)}, a fine-grained reward framework for deep search agents that emphasizes reasoning comprehensiveness, factual grounding, and evidence connectivity. CaRR decomposes complex questions into verifiable single-hop rubrics and requires agents to satisfy these rubrics by explicitly identifying hidden entities, supporting them with correct citations, and constructing complete evidence chains that link to the predicted answer. We further introduce \textbf{Citation-aware Group Relative Policy Optimization (C-GRPO)}, which combines CaRR and outcome rewards for training robust deep search agents. Experiments show that C-GRPO consistently outperforms standard outcome-based RL baselines across multiple deep search benchmarks. Our analysis also validates that C-GRPO effectively discourages shortcut exploitation, promotes comprehensive, evidence-grounded reasoning, and exhibits strong generalization to open-ended deep research tasks. Our code and data are available at this https URL.
中文摘要 强化学习（RL）已成为增强基于LLM深度搜索代理的关键技术。然而，现有方法主要依赖二元结果奖励，未能捕捉智能体推理过程的全面性和事实性，且常导致不良行为，如捷径利用和幻觉。为解决这些局限性，我们提出了 \textbf{引用感知评分标准奖励（CaRR）}，这是一个针对深度搜索代理的细粒度奖励框架，强调推理的全面性、事实基础和证据的连贯性。CaRR将复杂问题分解为可验证的单跳评分标准，要求代理通过明确识别隐藏实体、正确引用支持以及构建与预测答案相关联的完整证据链来满足这些评分标准。我们进一步介绍了\textbf{引用感知群相对策略优化（C-GRPO）}，结合了CaRR和结果奖励，用于训练稳健的深度搜索代理。实验表明，C-GRPO在多个深度搜索基准测试中始终优于基于结果的标准强化学习基线。我们的分析还验证了C-GRPO有效防止了捷径利用，促进了全面且有证据基础的推理，并展现出对开放式深度研究任务的强烈推广性。我们的代码和数据可在此 https URL 访问。

Keyword: diffusion policy

CHDP: Cooperative Hybrid Diffusion Policies for Reinforcement Learning in Parameterized Action Space

CHDP：参数化行动空间中合作混合扩散策略用于强化学习

Authors: Bingyi Liu, Jinbo He, Haiyong Shi, Enshu Wang, Weizhen Han, Jingxiang Hao, Peixi Wang, Zhuangzhuang Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.05675
Pdf link: https://arxiv.org/pdf/2601.05675
Abstract Hybrid action space, which combines discrete choices and continuous parameters, is prevalent in domains such as robot control and game AI. However, efficiently modeling and optimizing hybrid discrete-continuous action space remains a fundamental challenge, mainly due to limited policy expressiveness and poor scalability in high-dimensional settings. To address this challenge, we view the hybrid action space problem as a fully cooperative game and propose a \textbf{Cooperative Hybrid Diffusion Policies (CHDP)} framework to solve it. CHDP employs two cooperative agents that leverage a discrete and a continuous diffusion policy, respectively. The continuous policy is conditioned on the discrete action's representation, explicitly modeling the dependency between them. This cooperative design allows the diffusion policies to leverage their expressiveness to capture complex distributions in their respective action spaces. To mitigate the update conflicts arising from simultaneous policy updates in this cooperative setting, we employ a sequential update scheme that fosters co-adaptation. Moreover, to improve scalability when learning in high-dimensional discrete action space, we construct a codebook that embeds the action space into a low-dimensional latent space. This mapping enables the discrete policy to learn in a compact, structured space. Finally, we design a Q-function-based guidance mechanism to align the codebook's embeddings with the discrete policy's representation during training. On challenging hybrid action benchmarks, CHDP outperforms the state-of-the-art method by up to $19.3\%$ in success rate.
中文摘要 混合动作空间，即结合离散选择和连续参数的概念，在机器人控制和游戏人工智能等领域中非常普遍。然而，高效建模和优化混合离散-连续动作空间仍是根本挑战，主要原因是政策表达有限且在高维环境中的可扩展性较差。为应对这一挑战，我们将混合行动空间问题视为一个完全合作的博弈，并提出了一个 \textbf{合作混合扩散策略（CHDP）}框架来解决。CHDP采用两个合作代理，分别采用离散和连续扩散策略。连续策略的条件是离散动作的表示，明确建模它们之间的依赖关系。这种协作设计使扩散策略能够利用其表达力捕捉各自动作空间中的复杂分布。为了减少在这种合作环境中同时进行策略更新所产生的更新冲突，我们采用了促进共适应的顺序更新方案。此外，为了提高在高维离散作用空间学习时的可扩展性，我们构建了一个将作用空间嵌入低维潜空间的码本。这种映射使离散策略能够在紧凑、结构化的空间中学习。最后，我们设计了基于Q函数的指导机制，使码本嵌入与离散策略在训练中的表示方式对齐。在具有挑战性的混合行动基准测试中，CHDP的成功率高出最先进的方法19.3美元。