Arxiv Papers of Today

生成时间: 2026-05-29 19:33:31 (UTC+8); Arxiv 发布时间: 2026-05-29 20:00 EDT (2026-05-30 08:00 UTC+8)

今天共有 67 篇相关文章

Keyword: reinforcement learning

Micro-Macro Retrieval: Reducing Long-Form Hallucination in Large Language Models

微观宏观检索：减少大型语言模型中的长形幻觉

Authors: Yujie Feng, Jian Li, Zhihan Zhou, Pengfei Xu, Yujia Zhang, Xiaoyu Li, Xiaohui Zhou, Alan Zhao, Xi Chen, Xiao-Ming Wu
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.28828
Pdf link: https://arxiv.org/pdf/2605.28828
Abstract Large Language Models (LLMs) achieve impressive performance across many tasks but remain prone to hallucination, especially in long-form generation where redundant retrieved contexts and lengthy reasoning chains amplify factual errors. Recent studies highlight a critical phenomenon: the closer key information appears to the model outputs, the higher the factual accuracy. However, existing retrieval-augmented language models (RALMs) lack effective mechanisms to ensure this proximity - external evidence is injected into reasoning via multi-turn retrieval, but this cannot ensure key information stays close to the outputs. We propose Micro-Macro Retrieval (M2R), a novel retrieve-while-generate framework to fill this gap. At the macro level, M2R retrieves coarse-grained evidence from external sources; at the micro level, it extracts essential results from a key information repository built during reasoning and reuses them while generating answers. This design directly addresses the key-information-to-output proximity bottleneck, effectively reducing hallucination in long-form tasks. M2R is trained with a curriculum learning-based reinforcement learning strategy using customized rule-based rewards, enabling stable acquisition of retrieval and grounding skills. Extensive experiments across different benchmarks demonstrate the effectiveness of M2R, especially in lengthy-context settings.
中文摘要 大型语言模型（LLM）在许多任务中表现出色，但仍易出现幻觉，尤其是在冗余的检索上下文和冗长推理链中，这些过程会放大事实错误。最新研究强调了一个关键现象：关键信息越接近模型输出，事实准确性越高。然而，现有的检索增强语言模型（RALMs）缺乏有效的机制来确保这种接近性——通过多回合检索注入外部证据进入推理，但这无法保证关键信息始终贴近输出。我们提出了微观宏检索（M2R）这一新颖的“边检索边生成”框架来填补这一空白。在宏观层面，M2R从外部来源获取粗粒度证据;在微观层面，它从推理过程中建立的关键信息库中提取关键结果，并在生成答案时重复使用这些结果。该设计直接解决了密钥信息到输出的接近瓶颈，有效减少了长形式任务中的幻觉。M2R采用基于课程学习的强化学习策略，采用定制的规则奖励，实现检索和接地技能的稳定掌握。跨多个基准测试的大量实验证明了M2R的有效性，尤其是在长时间上下文环境中。

Aryabhata 2: Scaling Reinforcement Learning for Advanced STEM Reasoning

Aryabhata 2：高级STEM推理的扩展强化学习

Authors: Ritvik Rastogi, Vishal Singh, Tejas Chaudhari, Sandeep Varma
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2605.28829
Pdf link: https://arxiv.org/pdf/2605.28829
Abstract Competitive STEM examinations such as JEE and NEET require multi-step symbolic reasoning, precise numerical computation, and deep conceptual understanding across physics, chemistry, and mathematics. Recent large language models perform strongly on common reasoning benchmarks, yet they remain difficult to deploy at scale, where millions of student doubts demand domain-specific, consistently structured problem solving. We introduce Aryabhata 2, a reasoning-focused language model for competitive STEM examinations, trained via reinforcement-learning post-training. Using PhysicsWallah's internal question banks, we construct a high-quality training curriculum and post-train GPT-OSS-20B through reinforcement learning with verifiable rewards. Training combines prolonged reinforcement learning with broadened exploration via progressively larger rollout group sizes. We evaluate Aryabhata 2 on competitive examination benchmarks, including JEE Main, JEE Advanced, and NEET, as well as out-of-distribution reasoning datasets such as AIME, HMMT, MMLU-Pro, MMLU-Redux 2.0, and GPQA. Results show that Aryabhata 2 outperforms its base model GPT-OSS-20B on competitive STEM reasoning while requiring substantially fewer output tokens (up to 64\% fewer).
中文摘要 JEE和NEET等竞争激烈的STEM考试要求多步骤符号推理、精确的数值计算以及对物理、化学和数学的深入概念理解。近年来的大型语言模型在常见推理基准上表现优异，但大规模部署仍然较为困难，因为数百万学生的疑问需要特定领域且结构一致的问题解决。我们介绍Aryabhata 2，一种以推理为导向的语言模型，用于竞争性STEM考试，通过培训后强化学习进行训练。利用PhysicsWallah内部题库，我们构建了高质量的培训课程，并通过强化学习对GPT-OSS-20B进行后期训练，并提供可验证的奖励。培训结合了长期强化学习与通过逐步扩大的推广组规模进行的探索。我们基于竞争性考试基准（包括JEE Main、JEE Advanced和NEET）以及非发行的推理数据集如AIME、HMMT、MMLU-Pro、MMLU-Redux 2.0和GPQA进行评估。结果显示，Aryabhata 2在竞争性STEM推理中优于其基础模型GPT-OSS-20B，同时输出代币数量大幅减少（最多减少64%）。

Thoughts-as-Planning: Latent World Models for Chain-of-Thoughts Optimization via Reinforcement Planning

思维作为规划：通过强化规划优化思维链的潜在世界模型

Authors: Dong Liu, Yanxuan Yu, Ying Nian Wu
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.28842
Pdf link: https://arxiv.org/pdf/2605.28842
Abstract The success of large language models (LLMs) across diverse NLP tasks has elevated the importance of reasoning chain optimization as a critical step in aligning model behavior with task objectives. Existing reasoning chain tuning methods often rely on black-box heuristics or gradient-free search, which lack interpretability, generalization, and sample efficiency. In this work, we introduce \textbf{Thoughts-as-Planning}, a novel framework that formalizes reasoning chain optimization as a sequential decision-making process over a latent semantic space. We model the LLM as a partially observable environment and learn a latent world model that simulates the effect of reasoning chain edits on downstream outputs. A proximity-preserving embedding space is constructed to encode reasoning chain-response dynamics, enabling planning via gradient descent or reinforcement learning. Our method supports multi-scale abstraction, allowing reasoning chain edits at token, segment, and instruction levels to be integrated into a unified planner. Through extensive experiments on language understanding and generation tasks, we demonstrate that Thoughts-as-Planning outperforms state-of-the-art reasoning chain tuning baselines in efficiency, robustness, and generalization, while offering interpretability through its structured planning trajectory. Our code is available at this https URL.
中文摘要 大型语言模型（LLMs）在多种自然语言处理任务中的成功，提升了推理链优化作为使模型行为与任务目标对齐的关键步骤的重要性。现有的推理链调优方法通常依赖黑箱启发式或无梯度搜索，这些方法缺乏可解释性、推广性和样本效率。在本研究中，我们介绍了 \textbf{Thoughts-as-Planning}，这是一个新颖的框架，将推理链优化形式化为在潜在语义空间上的顺序决策过程。我们将LLM建模为一个部分可观测的环境，并学习一个潜在世界模型，模拟推理链式编辑对下游输出的影响。构建了一个保持邻近的嵌入空间，用于编码推理链式反应动力学，从而实现梯度下降或强化学习的规划。我们的方法支持多尺度抽象，允许在标记、段和指令层面进行推理链编辑，集成到统一的规划器中。通过对语言理解和生成任务的广泛实验，我们证明了思维作为规划在效率、鲁棒性和泛化性方面优于最先进的推理链调优基线，同时通过其结构化规划轨迹提供了可解释性。我们的代码可在此 https URL 访问。

Mechanistic origins of catastrophic forgetting: why RL preserves circuits better than SFT?

灾难性遗忘的机制起源：为什么强化学习比SFT更好地保存电路？

Authors: Jeanmely Rojas Nunez, Viraj Sawant, Nathan Allen, Nomgondalai Amgalanbaatar, Yannis Zongo, Vasu Sharma, Maheep Chaudhary
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.28860
Pdf link: https://arxiv.org/pdf/2605.28860
Abstract Fine-tuning large language models (LLMs) frequently induces catastrophic forgetting of prior capabilities. Recent work has shown that reinforcement learning (RL) retains prior capabilities more effectively than supervised fine-tuning (SFT), attributing this to policy-gradient updates remaining closer to the base policy \cite{shenfeld2025rl}. We extend this behavioral account to the mechanistic level and ask whether RL's advantage is mirrored by stronger preservation of internal computational circuits. We introduce differential circuit vulnerability, a head-level measure of how much a circuit degrades under fine-tuning, and use it to compare RL and SFT on Qwen2.5-3B-Instruct adapted to scientific question-answering. We find a clear mechanistic trade-off: SFT adapts more rapidly to the target task but produces substantially greater circuit disruption and forgetting of prior capabilities, whereas RL preserves a larger fraction of the base circuit at the cost of slower task adaptation. These findings suggest that circuit preservation may help explain why RL is more robust to catastrophic forgetting. We released our code here: this https URL.
中文摘要 对大型语言模型（LLMs）进行微调常常导致对先前能力的灾难性遗忘。最新研究表明，强化学习（RL）比监督微调（SFT）更有效地保留了先前的能力，这归因于策略梯度更新更接近基础策略 \cite{shenfeld2025rl}。我们将这一行为解释扩展到机械层面，探讨强化学习的优势是否体现在更强的内部计算电路保存上。我们引入了差分电路脆弱性，这是一种头级衡量电路在微调下退化程度的指标，并用它比较适用于科学问答的Qwen2.5-3B-Instruct上强化学习（RL）和SFT的效果。我们发现了一个明确的机制权衡：SFT对目标任务的适应更快，但会显著增加电路干扰和遗忘先前能力，而强化学习则保留了更多基础电路，但任务适应速度较慢。这些发现表明，回路保存可能有助于解释为何强化学习对灾难性遗忘更具韧性。我们在这里发布了代码：这个 https URL。

FedQHD: Closed-Form Function-Space Federated Reinforcement Learning

FedQHD：封闭形式函数-空间联合强化学习

Authors: Yuchen Hou, Yongshan Chen, Zhuowen Zou, Calvin Yeung, Mohsen Imani, Tian Lan, Mahdi Imani
Subjects: Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2605.29002
Pdf link: https://arxiv.org/pdf/2605.29002
Abstract Federated reinforcement learning enables decentralized agents to collaboratively improve policies or value estimates without exchanging raw trajectories. However, FedAvg-style parameter averaging is not function-space consistent: when clients use heterogeneous encoders or even identical nonlinear networks, averaged parameters need not correspond to the weighted average of client value functions in any common function space. We propose FedQHD, a federated Q-learning method using hyperdimensional (random-feature) state encoders with a linear readout, so that Q-functions are nonlinear in state yet linear in trainable parameters. This linear structure enables closed-form aggregation. With a shared encoder, the function-space consensus update coincides exactly with weighted averaging of local readout matrices. With heterogeneous encoders, the server constructs a global teacher by averaging client Q-values on a shared anchor-state set, and each client compiles this teacher into its local representation via a single ridge projection. We formalize the federation gap -- the error incurred when compiling a federated teacher into a heterogeneous client representation -- relative to a client-specific oracle projection. We show that this gap decomposes into subspace misalignment, anchor-set conditioning, and regularization bias. We further identify the anchor-to-dimension ratio $m \geq D_i$ as the well-conditioned regime in which the gap reduces to a multiple of the encoder heterogeneity floor. On four continuous-state, discrete-action control benchmarks, FedQHD matches or outperforms FedAvg-style baselines and distillation-based alternatives while requiring substantially less computation, and the empirical dependence of the federation gap on encoder dimension matches our theoretical analysis.
中文摘要 联合强化学习使去中心化代理能够协作改进策略或价值估计，而无需交换原始轨迹。然而，FedAvg风格的参数平均并不符合函数空间一致性：当客户端使用异构编码器甚至相同的非线性网络时，平均参数不必对应于任何共同函数空间中客户端价值函数的加权平均值。我们提出了 FedQHD，这是一种联邦 Q 学习方法，使用高维（随机特征）状态编码器，并采用线性读出，使 Q 函数状态非线性，但参数可训练线性。这种线性结构使封闭形式聚合成为可能。使用共享编码器时，函数空间共识更新恰好与局部读出矩阵的加权平均重合。使用异构编码器时，服务器通过在共享锚点状态集合上平均客户端的Q值来构建全局教师，每个客户端通过单一脊投影将该教师编译为其局部表示。我们形式化了联邦差距——将联邦教师编译为异构客户表示时产生的错误——相对于客户端特定的预言机投影。我们证明该缺口分解为子空间错位、锚集条件和正则化偏差。我们进一步确定锚定与维数比$m \geq D_i$，即在良好条件下，间隙减少到编码器异质性底的整数倍。在四个连续状态离散作用控制基准测试中，FedQHD与或优于FedAvg式基线和基于蒸馏的替代方案，同时计算量大幅减少，联邦差距对编码器维度的经验依赖性与我们的理论分析相符。

Tensorized Radiative Heat Transfer for a Scalable and Calibrated Building Energy Simulator

用于可扩展和校准建筑能源模拟器的张量化辐射热传递

Authors: Sang woo Ham, Donghun Kim, Michael Rossetti, John Sipple
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2605.29003
Pdf link: https://arxiv.org/pdf/2605.29003
Abstract Accurate building energy simulation is essential for developing advanced control strategies that enable demand flexibility and grid responsiveness. The Smart Buildings Control Suite (sbsim) offers a lightweight, scalable, and data-calibrated simulation environment based on a tensorized finite difference model. Previous work extended sbsim to include interior long-wave radiative heat exchange between indoor surfaces. However, a complete thermal model must also account for exterior radiative processes, including long-wave radiation exchange with the sky and surroundings, as well as short-wave solar radiation incident on building surfaces. This paper presents a comprehensive radiative heat transfer implementation for sbsim that integrates both interior and exterior radiation mechanisms. Our primary contribution is the development and integration of a fully tensorized exterior radiation module that captures sky and ground long-wave exchange as well as solar heat gains through opaque and transparent surfaces. By formulating these processes as tensor operations compatible with the existing framework, we preserve the computational efficiency necessary for reinforcement learning applications. We validate our implementation against established simulation tools and demonstrate improved prediction accuracy for surface temperatures and building thermal loads. This enhancement significantly increases the physical fidelity of sbsim, enabling more realistic training environments for building energy optimization and control.
中文摘要 准确的建筑能能模拟对于开发实现需求灵活性和电网响应能力的先进控制策略至关重要。智能建筑控制套件（sbsim）基于张量化的有限差分模型，提供一个轻量级、可扩展且经过数据校准的仿真环境。此前的研究扩展了sbsim，涵盖了室内表面之间的长波辐射热交换。然而，完整的热模型还必须考虑外部辐射过程，包括与天空及周围环境的长波辐射交换，以及短波太阳辐射入射建筑表面。本文提出了一个综合的SBsim辐射热传递实现，整合了内外辐射机制。我们的主要贡献是开发并集成了一个完全张量化的外部辐射模块，能够捕捉天空和地面的长波交换，以及通过不透明和透明表面获得的太阳热量增益。通过将这些过程表述为与现有框架兼容的张量运算，我们保留了强化学习应用所需的计算效率。我们将实现与既有模拟工具结合验证，并展示了表面温度和建筑热载荷预测精度的提升。这一提升显著提升了SBsim的物理精度，使建筑能源优化和控制能够实现更真实的训练环境。

Label-Free Reinforcement Learning via Cross-Model Entropy

通过跨模型熵实现无标签强化学习

Authors: Matt Gorbett, Hossein Shirazi
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.29009
Pdf link: https://arxiv.org/pdf/2605.29009
Abstract Post-training large language models with reinforcement learning is bottlenecked by the reward signal. Existing approaches require either ground-truth verifiable rewards, restricting training to domains with automatic correctness checks (e.g., mathematics, code execution), or human preference labels, which are expensive to collect and prone to reward hacking. Recent label-free methods replace ground-truth verifiers with self-referential signals like majority voting or token entropy over a model's own outputs, but risk reinforcing a model's own errors. In this work we propose Cross-Model Entropy (CME), the mean log-likelihood of a generator's response under a separate verifier model, as a label-free reward signal for RL post-training. CME is continuous, training-free, and grounded in the principle that responses a verifier finds unsurprising are likely correct or high quality. Because the verifier is independent of the generator, the signal cannot be gamed through self-consistency. We integrate CME into GRPO with no other changes to the training loop, extending label-free RL to open-ended instruction following -- a regime where self-referential signals are inapplicable or poorly suited. On open-ended instruction following (UltraFeedback prompts, evaluated on AlpacaEval 2.0), CME rewards beat the untrained base in head-to-head LLM-as-Judge comparisons across four model families (Qwen, Llama, Gemma, OLMo) and three training regimes (pretrained, SFT, and instruction-tuned), with tie-adjusted win rates ranging from 52.5% to 71.4%. Code will be released upon publication.
中文摘要 经过训练后，带有强化学习的大型语言模型会被奖励信号所限制。现有方法要么要求基于真实可验证的奖励，将训练限制在具有自动正确性检查的领域（如数学、代码执行），要么是人类偏好标签，这些标签收集成本高且容易被黑客攻击。最近的无标签方法用多数投票或对模型自身输出的代币熵等自指信号替代了基于真实的验证器，但这也有可能强化模型自身的错误。本研究提出跨模型熵（CME），即生成元在独立验证模型下响应的平均对数似然，作为强化学习训练后无标签的奖励信号。CME是连续的、无培训的，并且基于验证者认为不意外的回答很可能正确或高质量的原则。由于验证器独立于发生器，信号无法通过自洽性控。我们将CME整合进GRPO，未对训练循环做其他修改，将无标签强化学习扩展为开放式后续教学——即自指信号不适用或不适合的模式。在开放式教学跟随（UltraFeedback提示，基于AlpacaEval 2.0评估）中，CME奖励在四大模型家族（Qwen、Llama、Gemma、OLMo）和三种训练模式（预训练、SFT和指令调优）的一对一LLM作为评判中优于未受训练基础，平局调整后的胜率范围为52.5%至71.4%。代码将在发布后发布。

Theoretical Foundations and Effective Algorithms for Policy-Aware Simulator Learning

策略感知模拟器学习的理论基础与有效算法

Authors: Christoph Dann, Yishay Mansour, Mehryar Mohri
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2605.29032
Pdf link: https://arxiv.org/pdf/2605.29032
Abstract Model-based reinforcement learning (MBRL) agents typically learn world models by minimizing predictive loss. However, powerful RL optimizers inevitably exploit minor model inaccuracies, leading to simulator exploitation and a reality gap where policies succeed in simulation but fail in the real world. We propose that the objective for learning simulators should be strategic robustness rather than predictive accuracy, and formulate this as a zero-sum minimax game between a model player and an adversarial policy player. We provide a comprehensive theoretical analysis: (1) an online learning guarantee showing the game is learnable with sublinear regret bounds; (2) a tractable critic-based simplification bounding the global policy-value gap by the local critic's loss; and (3) an Error-MDP duality, proving that finding the worst-case policy is formally dual to a standard RL problem where the reward is the one-step critic error. This duality yields a provably convergent active data selection algorithm. Experiments on continuous control tasks demonstrate that our approach reduces prediction error in strategically important regions by $1.5$-$2.2\times$ and enables policies trained purely in simulation to match near-optimal real-world performance.
中文摘要 基于模型的强化学习（MBRL）智能体通常通过最小化预测损失来学习世界模型。然而，强大的强化学习优化器不可避免地会利用模型的细微不准确性，导致模拟器被利用，并出现了策略在模拟中成功但在现实世界中失败的现实差距。我们提出学习模拟器的目标应是战略鲁棒性，而非预测准确性，并将其表述为模型玩家与对抗性政策玩家之间的零和极小极小博弈。我们提供了全面的理论分析：（1）在线学习保证，证明游戏具有亚线性遗憾界限的可学习性;（2）以批评者为基础的可处理简化，将全球政策价值差距以本地批评者的损失加以限制;以及（3）错误-最差原则对偶性，证明寻找最坏情况策略在形式上与标准强化学习问题对偶，其中奖励为一步批评错误。这种对偶性产生了可证明收敛的主动数据选择算法。连续控制任务的实验表明，我们的方法在战略重要区域的预测误差降低了1.5美元至2.2美元\倍数，并使纯模拟训练的策略能够接近最优的真实世界性能。

Moment Matching Q-Learning

Q-学习时刻匹配

Authors: Yiyan (Edgar)Liang, Sifei Liu, Weitong Zhang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.29033
Pdf link: https://arxiv.org/pdf/2605.29033
Abstract Score-based and flow-based generative models exhibit remarkable expressive capacity in capturing complex distributions, and have been extensively deployed in tasks ranging from image generation to reinforcement learning. Nevertheless, these models suffer from prolonged inference latency, which imposes a significant computational bottleneck in RL with iterative sampling. To overcome this limitation, we propose a new framework named Moment Matching Q-Learning (MoMa QL), which utilizes a technique from statistical hypothesis testing known as maximum mean discrepancy (MMD) that intend to match all orders of statistics between the original and target distribution. By enforcing strong regularization on all moment statistics, this algorithm guarantees distribution-level convergence for conditional score function and remains stable under various hyperparameters. Empirically, we show that our method MoMa QL is more computationally efficient with a comparable if not competitive performance in various D4RL tasks. Remarkably, by accelerating the action sampling process for flow-based policies, MoMa QL demonstrates superior performance in offline-to-online RL tasks because of faster and stronger adaptability for online interactive finetuning.
中文摘要 基于评分和基于流量的生成模型在捕捉复杂分布方面展现出卓越的表达能力，已被广泛应用于从图像生成到强化学习等各种任务中。然而，这些模型存在较长的推理延迟，这在强化学习中对迭代采样造成了显著的计算瓶颈。为克服这一限制，我们提出了一个名为矩匹配Q学习（MoMa QL）的新框架，该框架利用统计假设检验中的最大均值差异（MMD）技术，旨在匹配原始分布与目标分布之间的所有统计量顺序。通过对所有矩统计量强制执行强正则化，该算法保证条件评分函数的分布级收敛，并且在各种超参数下保持稳定。通过实证，我们证明我们的MoMa QL方法在计算效率上更高，在各种D4RL任务中表现相当甚至具有竞争力。值得注意的是，通过加速基于流策略的动作采样过程，MoMa QL 在离线到在线强化学习任务中表现出优越性能，因为它在在线交互式微调方面具有更快更强的适应性。

Differentiable Belief-based Opponent Shaping

基于信念的可微对手塑造

Authors: Aarav G Sane, Karthik Sivachandran, Rohan Paleja
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.29042
Pdf link: https://arxiv.org/pdf/2605.29042
Abstract Human coordination often relies on the ability to influence the beliefs of others through strategic action. In multi-agent reinforcement learning, opponent shaping attempts to replicate this influence, though existing methods typically operate within an opponent's parameter, policy, or value space. Meanwhile, belief-manipulation techniques in hidden-role games often rely on hard-coded objectives, such as deception or belief saturation. We propose Differentiable Belief-based Opponent Shaping (D-BOS), a first-order method that treats each observer's belief as the shaped opponent state and differentiates through $k$-step softmax-Bayes belief dynamics. Rather than explicitly rewarding deceptive or cooperative behavior, our method treats the belief state as the target for shaping. This allows the optimal strategy to emerge naturally from the environment's reward structure. This belief-space formulation provides an opponent-shaping signal by differentiating through opponent belief updates, and naturally extends to multiple observers by aggregating gradients over their individual inferred belief trajectories. Empirically, D-BOS outperforms PPO and BBM in hidden-role games, with the largest gains in mixed-motive settings.
中文摘要 人类协调通常依赖于通过战略行动影响他人信念的能力。在多智能体强化学习中，对抗塑造试图复制这种影响，尽管现有方法通常在对抗者的参数、策略或价值空间内运行。与此同时，隐藏角色游戏中的信念操控技术通常依赖硬编码目标，如欺骗或信念饱和。我们提出了基于可微信念的对手塑造（D-BOS），这是一种一阶方法，将每个观察者的信念视为形成后的对手状态，并通过$k$步软最大贝叶斯信念动态进行区分。我们的方法不是明确奖励欺骗或合作行为，而是将信念状态视为塑造的目标。这使得最优策略能够自然地从环境的奖励结构中产生。这种信念空间表述通过对方信念的更新来区分，提供对立者塑造信号，并通过聚合其各自推断信念轨迹的梯度自然扩展到多个观察者。实证上，D-BOS在隐藏角色博弈中优于PPO和BBM，在混合动机环境中获得最大收益。

Structured Prompt Optimization Meets Reinforcement Learning for Global and Local Interpretability over Complex Text

结构化提示优化结合强化学习，实现复杂文本的全局和局部可解释性

Authors: Tianyang Zhou, Wenbo Chen, Pierre Jinghong Liang, Leman Akoglu
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.29076
Pdf link: https://arxiv.org/pdf/2605.29076
Abstract LLMs have advanced text classification, yet existing paradigms face a trade-off: supervised (label only) fine-tuning is scalable but offers limited reasoning on complex text and lacks broader model transparency, while discrete prompt optimization offers human-readable instructions but struggles with performance and scalability. We introduce eXTC (eXplainable Text Classifier) with three progressive stages: (1) learning a Standard Operating Procedure (SOP, or rulebook) in natural language via a new Structured Prompt Optimization algorithm; (2) SOP-grounded reasoning distillation from a large teacher LLM into a compact LM; and (3) expanding reasoning capabilities beyond the initial SOP via reinforcement learning. This design enables eXTC to provide (i) fast inference via a compact LM, with (ii) inference-time local reasoning traces, alongside a global, modular explanation of its learned domain rules, while (iii) significantly outperforming existing paradigms across diverse benchmarks in both classification performance and explanation quality, with stage-by-stage gains.
中文摘要 LLM拥有先进的文本分类能力，但现有范式面临权衡：监督式（仅标签）微调可扩展，但对复杂文本推理有限且缺乏更广泛的模型透明度;而离散提示优化提供人类可读指令，但在性能和可扩展性方面存在困难。我们引入了eXTC（可解析文本分类器），分为三个渐进阶段：（1）通过一种新的结构化提示优化算法，学习自然语言中的标准操作程序（SOP，或规则手册）;（2）基于SOP的推理从大型教师LLM提炼成紧凑的LM;以及（3）通过强化学习扩展推理能力，超越初始标准操作程序。该设计使eXTC能够（i）通过紧凑的LM实现快速推理，（ii）推理时间的局部推理轨迹，同时对其学习领域规则进行全局模块化解释，同时（iii）在分类性能和解释质量上显著优于现有范式，且逐步提升。

OISD: On-Policy Internal Self-Distillation of Language Models

OISD：语言模型的政策内自我提炼

Authors: Xinyu Liu, Darryl Cherian Jacob, Yang Zhou, Jindong Wang, Pan He
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.29089
Pdf link: https://arxiv.org/pdf/2605.29089
Abstract Recent reinforcement learning (RL) post-training approaches primarily optimize the final output policy using sparse outcome-level rewards, while largely overlooking predictive signals encoded in intermediate representations. In this paper, we introduce a new paradigm called on-policy internal self-distillation and propose the OISD framework, which improves reasoning by transferring on-policy predictive signals from the final layer to intermediate representations. During rollout and Group Relative Policy Optimization (GRPO) optimization, the final layer acts as both the policy and a detached internal teacher for selected intermediate layers, which are guided to align with it through two complementary mechanisms: logit alignment, which transfers high-level reasoning behaviors (how to think), and attention alignment, which enforces consistent attention patterns (where to look) from the final layer to the selected intermediate layer, both without requiring external privileged information. Our OISD, together with GRPO, employs signed advantage-weighted Jensen--Shannon alignment to distill informative intermediate representations while preserving policy consistency under a unified acting policy. Experimental results demonstrate the effectiveness of OISD, with substantial and consistent improvements over strong reasoning RL baselines across four mathematical reasoning tasks. The code will be released at this https URL
中文摘要 近期的强化学习（RL）后训练方法主要通过稀疏的结果级奖励优化最终输出策略，同时大多忽视以中间表示编码的预测信号。本文引入了一种称为政策上内在自我蒸馏的新范式，并提出了OISD框架，该框架通过将政策上的预测信号从最终层转移到中间表征，提升了推理能力。在推广和群体相对策略优化（GRPO）优化过程中，最终层既是策略，也是对选定中间层的独立内在教师，这些中间层通过两种互补机制被引导与之对齐：logit对齐，传递高层推理行为（如何思考），注意力对齐，强制从最终层到所选中间层的一致注意力模式（寻找哪里）。两者都不需要外部特权信息。我们的OISD与GRPO共同采用签名加权的Jensen-Shannon对齐，提炼出信息丰富的中间表示，同时保持统一代理政策的一致性。实验结果显示OISD有效，在四项数学推理任务中，其显著且持续地优于强推理强化学习基线。代码将以该 https URL 发布

PRO-CUA: Process-Reward Optimization for Computer Use Agents

PRO-CUA：计算机使用代理的过程-奖励优化

Authors: Yifei He, Rui Yang, Hao Bai, Tong Zhang, Han Zhao
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.29119
Pdf link: https://arxiv.org/pdf/2605.29119
Abstract Computer use agents (CUAs) have shown strong potential for automating complex digital workflows, yet their training remains constrained by costly live environment interaction and limited high-quality supervision. Existing filtered behavior cloning pipelines suffer from imitation bottlenecks, including distribution shift from the expert demonstration and the absence of negative learning signals. Meanwhile, standard trajectory-level reinforcement learning struggles with sparse rewards, ambiguous credit assignment, and high infrastructure costs for long-horizon GUI interaction. In this work, we propose PRO-CUA, a process-reward optimization framework for training CUAs with iterative step-level reinforcement learning. PRO-CUA decouples on-policy environment interaction from policy optimization: the current policy collects states through live rollouts, generates diverse candidate actions for each state, receives step-level feedback from a process reward model (PRM), and is optimized with group-relative advantages. This design enables dense and flexible credit assignment without relying on golden answers or offline expert trajectories, while reducing distribution shift by training on the agent's own execution states. Experiments on live web benchmarks demonstrate the effectiveness of PRO-CUA and the reliability of PRM-guided step-level training.
中文摘要 计算机使用代理（CUA）已展现出自动化复杂数字工作流程的强大潜力，但其培训仍受限于昂贵的实时环境交互和有限的高质量监督。现有的过滤行为克隆流水线存在模仿瓶颈，包括专家演示的分布偏移以及缺乏负向学习信号。与此同时，标准轨迹级强化学习面临奖励稀疏、学分分配模糊以及长期GUI交互基础设施成本高昂的问题。本研究提出PRO-CUA，一种过程-奖励优化框架，用于通过迭代步骤级强化学习训练CUA。PRO-CUA将政策环境内的互动与策略优化解耦：当前策略通过实时部署收集状态，为每个状态生成多样化的候选动作，从过程奖励模型（PRM）获得步骤级反馈，并通过群体相对优势进行优化。这种设计实现了密集且灵活的信用分配，无需依赖黄金答案或离线专家轨迹，同时通过对代理自身执行状态的训练减少分配转移。在线网页基准测试的实验证明了PRO-CUA的有效性和PRM指导的步级培训的可靠性。

CA-AC-MPC: CUDA-Accelerated Actor-Critic Model Predictive Control

CA-AC-MPC：CUDA-加速行为者-批评者模型预测控制

Authors: Antoonio Buo, Vittorio Cammarota, Michele Avagnale, Pierluigi Arpenti, Vincenzo Lippiello, Fabio Ruggiero
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2605.29155
Pdf link: https://arxiv.org/pdf/2605.29155
Abstract In the literature, actor-critic model predictive control (AC-MPC) integrates MPC with reinforcement learning to enable high-performance control of complex dynamical systems. However, its differentiable MPC layer requires repeatedly solving an optimization problem in both the forward and backward passes, leading to substantial training and inference latency. This paper tackles this bottleneck introducing a CUDA-accelerated variant that significantly reduces end-to-end execution time while preserving the control performance of the baseline formulation. Simulation results on an agile drone racing task show that our approach achieves state-of-the-art lap times and near-limit dynamic behaviour with markedly reduced training and inference time.
中文摘要 文献中，演员-批评模型预测控制（AC-MPC）将MPC与强化学习集成，实现复杂动力系统的高性能控制。然而，其可微MPC层需要在前向和后向路径中反复解决优化问题，导致训练和推理延迟较长。本文通过引入一种CUDA加速变体，显著缩短端到端执行时间，同时保持基线公式的控制性能，解决了这一瓶颈问题。在敏捷无人机竞速任务上的模拟结果显示，我们的方法实现了最先进的圈速和近极限动态行为，同时显著减少了训练和推断时间。

When RL Suppresses Its Own Vocabulary: Recovering Reasoning Diversity in Puzzle-to-Math Transfer

当强化学习抑制自身词汇：在谜题到数学转移中恢复推理多样性

Authors: Mayug Maniparambil, Arjun Karuvally, Terrence Sejnowski, Fergal Reid
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.29190
Pdf link: https://arxiv.org/pdf/2605.29190
Abstract Reinforcement learning using verifiable rewards (RLVR) improves LLM reasoning, but the conditions under which it transfers across domains -- and why it does so -- remain under-explored. We study cross-domain transfer in a 7B model whose SFT and RL post-training stages use only constraint-satisfaction puzzles, with no mathematics problems in the post-training data. To analyze how transfer emerges, we introduce a reasoning primitive-level framework that combines a 9-class span classifier with motif extraction, allowing us to segment chain-of-thought traces into primitive motifs and track their evolution across training stages and domains. We find that puzzle SFT induces a reasoning-primitive vocabulary, yielding a $+7$pp \texttt{pass@32} gain on OlymMATH-Hard. Vanilla GSPO then composes these primitives into longer compute-verify chains, adding a further $+6$pp. However, this RL stage also suppresses exploratory primitives such as \textit{hypothesize} and \textit{backtrack}. To address this, we introduce a novelty bonus that rewards diverse correct rollouts, using perplexity under the reference model as a signal. This restores recovery primitives during RL and adds a further $+7$pp \texttt{pass@32} relative to vanilla GSPO. Finally, the end-to-end recipe raises the hard-math capability ceiling from $16.0\%$ at the OLMo3-7B-Instruct-SFT base to $36.0\%$, without adding any mathematics problems during the SFT or RL stages.
中文摘要 使用可验证奖励（RLVR）的强化学习提升了LLM推理能力，但其跨域迁移的条件及其原因仍缺乏深入探讨。我们研究了一个7B模型中的跨域转移，其SFT和RL训练后阶段仅使用约束满足谜题，训练后数据中没有数学问题。为分析迁移如何产生，我们引入了一个推理原始级框架，结合了9类跨度分类器和基题提取，使我们能够将思维链的痕迹分割为原始基序，并跟踪其在训练阶段和领域的演变。我们发现谜题SFT诱导了推理原始词汇，在OlymMATH-Hard上获得了$+7$pp \texttt{pass@32}的提升。原版GSPO将这些原语组合成更长的计算-验证链，增加额外的$+6$pp。然而，该强化学习阶段也抑制了诸如 \textit{hypothesize} 和 \textit{backtrack} 等探索性原语。为此，我们引入了新颖性奖励，奖励多样化且正确推出的方案，参考模型下的困惑度作为信号。这会在强化过程中恢复原语，并且相较于原版GSPO增加了额外的$+7$pp \texttt{pass@32}。最后，端到端的配方将硬数学能力上限从OLMo3-7B-Instruct-SFT基础的16.0%%提升至36.0%美元，且在SFT或RL阶段不增加任何数学问题。

Guidance Contrastive Token Credit Assignment for Discrete Policy Optimization

离散策略优化的指导对比令牌信用分配

Authors: Shufan Li, Konstantinos Kallidromitis, Akash Gokul Yusuke Kato, Kazuki Kozuka, Aditya Grover
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.29198
Pdf link: https://arxiv.org/pdf/2605.29198
Abstract Group-advantage-based reinforcement learning methods, such as GRPO and DAPO, have demonstrated strong performance across diverse domains, including mathematical reasoning and text-to-image generation. However, their reliance on sample-level rewards introduces a key limitation as uniform credit assignment across all tokens fails to capture fine-grained, token-level contributions. To address this issue, we propose Guidance Contrastive Policy Optimization (GCPO), a novel algorithm that enables per-token credit assignment by contrasting model predictions under positive and negative prompts. Rather than uniformly broadcasting sample-level advantages, GCPO assigns token-level advantages proportional to the difference between these contrastive predictions, allowing more precise and informative learning signals. Empirically, we find that GCPO emphasizes semantically relevant regions such as visual areas aligned with textual prompts in text-to-image generation, and critical keywords within reasoning traces for chain-of-thought tasks. Through extensive experiments, GCPO consistently outperforms GRPO and DAPO baselines on both text-to-image generation and chain-of-thought reasoning benchmarks, demonstrating its effectiveness as a general and scalable optimization strategy for discrete policy learning.
中文摘要 基于群体优势的强化学习方法，如GRPO和DAPO，在数学推理和文本生成等多个领域表现出优异表现。然而，他们依赖样本级奖励带来了一个关键限制，即所有代币统一的信用分配无法捕捉细粒度的代币级贡献。为解决这一问题，我们提出了指导对比政策优化（GCPO），这是一种新颖算法，通过对正负提示下的模型预测进行对比，实现每个代币的信用分配。GCPO不再均匀广播样本级优势，而是根据这些对比预测之间的差异分配代币级优势，从而实现更精确、更有信息量的学习信号。实证上，我们发现GCPO强调语义相关区域，如文本生成中与文本提示对齐的视觉区域，以及思维链任务推理轨迹中的关键关键词。通过大量实验，GCPO在文本生成和思维链推理基准测试中持续优于GRPO和DAPO基线，证明其作为离散策略学习通用且可扩展优化策略的有效性。

Harmonizing Real-Time Constraints and Long-Horizon Reasoning: An Asynchronous Agentic Framework for Dynamic Scheduling

协调实时约束与长视野推理：动态调度的异步代理框架

Authors: Shijie Cao, Yuan Yuan, Jing Liu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.29262
Pdf link: https://arxiv.org/pdf/2605.29262
Abstract The Dynamic Flexible Job Shop Scheduling Problem (DFJSP) necessitates a trade-off between instant reaction to stochastic disturbances and global optimization of production goals. Conventional priority rules are insufficiently flexible to handle complex disruptions, whereas learning-based approaches often compromise interpretability or fail to generalize across problem scales. Although Large Language Models (LLMs) offer advanced reasoning capabilities to bridge this gap, their substantial inference latency is incompatible with the millisecond-level decision cycles of industrial control systems. To resolve this conflict, we introduce RACE-Sched, an asynchronous agent-based framework that decouples policy execution from logical reasoning via a dual-stream architecture. The Reactive Stream executes low-latency symbolic heuristics to enable real-time dispatching, while the parallel Deliberative Stream leverages an LLM to synthesize, validate, and evolve these rules. Candidate rules undergo rigorous testing in a sandbox and are deployed via atomic updates, ensuring safety without blocking the control loop. Additionally, a semantic rule repository indexes validated heuristics for retrieval-based initialization which enhances transferability across problem scales. Extensive evaluations on GEN-Bench, MK-Bench, and JMS-Bench demonstrate that RACE-Sched outperforms leading Deep Reinforcement Learning and other LLM-based baselines. This approach harmonizes real-time constraints with long-horizon reasoning to achieve superior solution quality and robust adaptation to dynamic events.
中文摘要 动态灵活作业车间调度问题（DFJSP）要求在对随机扰动的即时反应与生产目标的全局优化之间做出权衡。传统的优先级规则在处理复杂中断方面灵活性不足，而基于学习的方法往往会牺牲可解释性或无法跨问题尺度推广。尽管大型语言模型（LLMs）提供了先进的推理能力来弥合这一差距，但其较高的推理延迟与工业控制系统的毫秒级决策周期不兼容。为解决这一冲突，我们引入了RACE-Sched，一个异步基于代理的框架，通过双流架构将策略执行与逻辑推理解耦。反应流执行低延迟符号启发式算法以实现实时调度，而并行的审议流则利用大型语言模型（LLM）来综合、验证和演进这些规则。候选规则在沙盒中经过严格测试，并通过原子更新部署，确保安全且不阻碍控制循环。此外，语义规则库索引经过验证的启发式，用于基于检索的初始化，提升了跨问题尺度的可迁移性。对GEN-Bench、MK-Bench和JMS-Bench的广泛评估表明，RACE-Sched优于领先的深度强化学习及其他基于LLM的基线。该方法协调了实时约束与长视野推理，实现了更优越的解质量和对动态事件的稳健适应。

Prompt-Level Reward Specifications for Open-Ended Post-Training

开放式后期培训的提示级奖励规范

Authors: Zijun Weng, Xiaohui Hu, Shuangyong Song, Yongxiang Li, Kaidong Yu, Xuanjing Huang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.29275
Pdf link: https://arxiv.org/pdf/2605.29275
Abstract Open-ended post-training benefits from rewards that make prompt-specific success conditions explicit, rather than relying only on post-hoc scalar scores. In instruction following, writing, and decision-support tasks, response quality depends on local requirements, holistic preferences, and explicit constraints, but existing reward methods often leave these criteria implicit or cover only narrowly verifiable cases. We propose a prompt-level reward specification framework that separates reward specification from reward computation. Given only prompts, our framework constructs reusable task-adaptive rubrics and executable hard-constraint checkers offline, making reward criteria explicit before training and reusable across rollouts. At scoring time, artifact-anchored rubric and code scores are combined with an independent global score for residual holistic quality, yielding a normalized hybrid reward over requirement satisfaction, holistic quality, and deterministic constraints. The framework requires no human preference annotations, reference answers, or a separately trained reward model. Experiments show that the resulting reward improves offline RM-style response ranking and supports online reinforcement learning across multiple open-ended benchmarks. Ablations further show that rubrics, global scoring, and executable verification provide complementary supervision.
中文摘要 开放式培训后奖励能明确提示特定成功条件，而非仅依赖事后标量分数。在跟随、写作和决策支持任务中，响应质量取决于本地需求、整体偏好和明确约束，但现有奖励方法往往隐含这些标准，或只涵盖可狭窄验证的案例。我们提出了一个提示级奖励规范框架，将奖励规范与奖励计算区分开来。仅有提示，我们的框架构建可重复使用的任务自适应评分标准和可执行的硬约束检查器，使奖励标准在培训前明确，并可跨推广重复使用。在评分时，基于工件锚定的评分标准和代码评分与独立的全局剩余整体质量评分结合，产生对需求满足、整体质量和确定性约束的标准化混合奖励。该框架不需要人工偏好注释、参考答案或单独训练的奖励模型。实验显示，所得奖励提升了离线的RM式反应排名，并支持跨多个开放基准测试的在线强化学习。消融进一步表明，评分标准、全局评分和可执行验证提供了互补的监督。

UniNote: A Unified Embedding Model for Multimodal Representation and Ranking

UniNote：一个用于多模态表示和排名的统一嵌入模型

Authors: Jinghan Zhao, Wenwei Jin, Anqi Li, Jintao Tong, Luya Mo, Jiawei Li, Bin Li, Yao Hu
Subjects: Subjects: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.29287
Pdf link: https://arxiv.org/pdf/2605.29287
Abstract Item-to-Item (I2I) retrieval is a fundamental part of modern content platforms, supporting critical industrial workflows from recommendation engines to content auditing. While multimodal embedding methods have advanced general retrieval, they often falter in I2I scenarios due to the challenges of balancing global content representation with fine-grained local retrieval, the systemic inefficiency of decoupled embedding-and-ranking pipelines, and the inherent trade-offs between model precision and serving latency. To solve these issues, we propose \textbf{UniNote}, a unified embedding model designed for industrial I2I retrieval. Tailored retrieval strategies are introduced to support representation learning over complex, multimodal content at varying granularities. To operationalize these strategies, UniNote employs a two-stage training paradigm: the first stage leverages contrastive SFT to establish robust base embeddings, while the second stage refines ranking quality through a reinforcement learning (RL) process that aligns the model with content relevance. Our results show that UniNote achieves SOTA performance across diverse I2I tasks. Deployed at Xiaohongshu and integrated with Matryoshka Representation Learning (MRL), UniNote achieved significant improvements in retrieval quality and cost efficiency in large-scale applications.
中文摘要 项目对项目（I2I）检索是现代内容平台的基本组成部分，支持从推荐引擎到内容审计的关键工业工作流程。虽然多模态嵌入方法在通用检索方面取得了进步，但在I2I场景中常常因平衡全局内容表示与细粒度局部检索的难题、解耦嵌入与排名流水线的系统性低效，以及模型精度与服务延迟之间的固有权衡而表现不佳。为解决这些问题，我们提出了 \textbf{UniNote}，一种统一嵌入模型，专为工业 I2I 检索设计。引入了定制化的检索策略，以支持对复杂多模态内容的表示学习，且粒度不同。为实现这些策略，UniNote采用两阶段训练范式：第一阶段利用对比SFT建立稳健的基嵌入，第二阶段通过强化学习（RL）过程优化质量，使模型与内容相关性保持一致。我们的结果表明，UniNote 在多种 I2I 任务中实现了 SOTA 性能。UniNote部署于小红书并与套娃表示学习（MRL）集成，在大规模应用中显著提升了检索质量和成本效益。

LLM-ALSO: LLM-Driven Adaptive Learning-Signal Optimization for Multi-Agent Reinforcement Learning

LLM也：多智能体强化学习的LLM驱动自适应学习信号优化

Authors: Xiaoguang Wu, Zhi Zheng, Hui Xiong
Subjects: Subjects: Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2605.29293
Pdf link: https://arxiv.org/pdf/2605.29293
Abstract Effective training-time guidance is central to multi-agent reinforcement learning (MARL), yet remains difficult in sparse-reward settings where weak supervision limits coordination and policy improvement, and existing methods often require substantial domain expertise or manual design effort. Large language models (LLMs) provide a promising alternative for flexible learning-signal design, yet existing LLM-based methods remain largely single-agent-oriented, one-shot, or weakly validated for the evolving training dynamics of cooperative MARL. To address these limitations, we propose LLM-ALSO, an iterative LLM-driven adaptive learning-signal optimization framework for MARL. Rather than directly deploying LLM-generated rewards, LLM-ALSO decomposes adaptation into iterative diagnosis, proposal, and validation: a Critic LLM diagnoses stage-specific learning and coordination failures from sparse-return metrics and compact behavior evidence, a Generator LLM proposes candidate reward-shaping configurations conditioned on the diagnosis, and branch-validation feedback refines candidates before they affect the main training trajectory. Through short-horizon validation and stage-aware adaptation, LLM-ALSO promotes only validated updates into training, reducing the risk of unreliable LLM-generated modifications. Experiments on sparse-reward cooperative MARL tasks show that LLM-ALSO improves sparse-evaluation performance and learning efficiency.
中文摘要 有效的训练时间指导是多智能体强化学习（MARL）的核心，但在奖励稀疏的环境中仍然困难，因为监督薄弱限制了协调和策略改进，现有方法通常需要大量领域专业知识或手工设计工作。大型语言模型（LLMs）为灵活的学习信号设计提供了有前景的替代方案，但现有基于LLM的方法大多仍以单代理为导向、一次性或在协作式MARL不断演变的训练动态中验证较弱。为解决这些局限性，我们提出了LLM-ALSO，一种基于MARL的迭代LLM驱动自适应学习信号优化框架。LLM-ALSO 不直接部署 LLM 生成的奖励，而是将适应分解为迭代诊断、提案和验证：Critic LLM 通过稀疏回报指标和紧凑的行为证据诊断阶段特定的学习和协调失败，生成器 LLM 基于诊断提出候选奖励塑造配置，分支验证反馈在候选人影响主要训练轨迹前对其进行细化。通过短视野验证和阶段感知适应，LLM-ALSO 仅推动经过验证的更新进入训练，降低不可靠 LLM 生成修改的风险。稀疏-奖励协作MARL任务的实验表明，LLM还能提升稀疏评估表现和学习效率。

Entropy-KL Divergence-based Token Masking: A Novel Approach for Selective Fine-tuning of Large Language Models

基于熵-KL发散的令牌掩蔽：一种用于大型语言模型选择性微调的新方法

Authors: Qi Liu, Mingdi Sun, Yongyi He, Zhi Zheng, Tong Xu, Yi Zheng, Zhefeng Wang, Enhong Chen
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.29303
Pdf link: https://arxiv.org/pdf/2605.29303
Abstract Supervised fine-tuning (SFT) followed by reinforcement learning (RL) has become a standard post-training paradigm for large language models. This paradigm provides a cold-start for RL exploration, avoiding the inefficiency of pure RL where on-policy sampling yields insufficient positive samples. However, in practice, existing approaches often use a small amount of data for SFT initialization compared to the RL phase, which can cause the model to fit the limited samples and shift away from its pre-trained distribution. This distribution shift impedes the model's ability to effectively explore during subsequent RL training. To address this challenge, we propose that in low-data regimes, SFT should prioritize activating task-relevant capabilities rather than memorizing specific content. Along this line, we propose EKSFT (Entropy-KL Selective Fine-Tuning), which selectively masks tokens that exhibit either high entropy or high KL divergence from a reference model. By excluding these high-uncertainty, distribution-shifting tokens from imitation, EKSFT injects task-specific knowledge while preserving the integrity of the model's pre-trained distribution. Empirical evaluations on mathematical reasoning benchmarks demonstrate that EKSFT consistently outperforms standard SFT. Further RL fine-tuning from the EKSFT model yields consistently better post-RL performance, indicating improved exploration for the RL stage. Our codes and datasets are available at this https URL.
中文摘要 监督微调（SFT）随后进行强化学习（RL）已成为大型语言模型的标准训练后范式。该范式为强化学习探索提供了冷启动，避免了纯强化学习中策略抽样时正样本不足的低效。然而，实际上，现有方法通常使用较少的数据进行SFT初始化，而非强化学习阶段，这可能导致模型拟合有限样本，偏离预训练分布。这种分布偏移阻碍了模型在后续强化学习训练中有效探索的能力。为应对这一挑战，我们建议在低数据环境中，SFT应优先激活与任务相关的能力，而非记忆具体内容。沿此，我们提出了EKSFT（熵-KL选择性微调），该方法选择性掩盖表现出高熵或高KL发散的符号。通过排除这些高不确定性、分布转移的代币，EKSFT注入了任务特定的知识，同时保持了模型预训练分布的完整性。数学推理基准的实证评估表明，EKSFT持续优于标准SFT。通过EKSFT模型进一步微调强化学习，持续提升强化学习后性能，表明强化学习阶段的探索得到了改善。我们的代码和数据集可在该 https URL 访问。

GrepSeek: Training Search Agents for Direct Corpus Interaction

GrepSeek：直接语料库交互的搜索代理培训

Authors: Alireza Salemi, Chang Zeng, Atharva Nijasure, Jui-Hui Chung, Razieh Rahimi, Fernando Diaz, Hamed Zamani
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.29307
Pdf link: https://arxiv.org/pdf/2605.29307
Abstract Large Language Model (LLM) search agents have shown strong promise for knowledge-intensive language tasks through multiple rounds of reasoning and information retrieval. Most existing systems access information using a retriever that takes a keyword or natural language query and returns a ranked list of documents using an index of pre-computed document representations. In this work, we explore a complementary perspective in which the search agent treats the corpus itself as the search environment and finds evidence by issuing executable shell commands. We introduce GrepSeek, an optimized direct corpus interaction (DCI) search agent that trains a compact search agent to find, filter, and compose evidence from large text corpora. To address the instability of learning behavior directly with reinforcement learning on large corpora, we propose a two-stage training pipeline. First, we construct a cold-start dataset using an answer-aware Tutor and answer-blind Planner to generate verified, causally grounded search trajectories. Second, we refine the initialized policy with Group Relative Policy Optimization (GRPO), allowing the agent to improve its task-oriented search behavior through direct interaction with the corpus. To make DCI practical at scale, we further use a semantics-preserving sharded-parallel execution engine that accelerates shell-based retrieval by up to $7.6\times$ while preserving byte-exact equivalence with sequential execution of the shell command. Experiments across seven open-domain question answering benchmarks show that GrepSeek achieves the strongest overall token-level $F_1$ and Exact Match. Our analysis also highlights the limitations of purely lexical interaction on queries with substantial surface-form variation, suggesting DCI as a practical and competitive method for search agents that can complement existing retrieval paradigms in the real world.
中文摘要 大型语言模型（LLM）搜索代理通过多轮推理和信息检索，已展现出在知识密集型语言任务中的强大前景。大多数现有系统通过检索器访问信息，该检索器对关键词或自然语言查询进行检索，并通过预先计算的文档表示索引返回排名列表。在本研究中，我们探讨了一种互补视角：搜索代理将语料库本身视为搜索环境，并通过发出可执行的壳命令来寻找证据。我们介绍GrepSeek，一种优化的直接语料库交互（DCI）搜索代理，训练紧凑的搜索代理从大型文本语料库中寻找、过滤和组合证据。为了解决大型语料库上直接强化学习的学习行为不稳定性，我们提出了一个两阶段的训练流程。首先，我们利用答案感知导师和无答案规划器构建冷启动数据集，生成经过验证且基于因果的搜索轨迹。其次，我们通过组相对策略优化（Group Relative Policy Optimization，GRPO）细化初始化策略，使智能体通过直接交互语料库来改进其面向任务的搜索行为。为了使DCI在大规模下实用，我们进一步使用了一个保持语义的分片并行执行引擎，该引擎在保持与shell命令顺序执行的字节精确等价性的同时，将基于shell的检索加速提升至7.6美元。七个开放域问答基准测试的实验显示，GrepSeek实现了最强的整体代币级$F_1$和精确匹配。我们的分析还强调了在具有显著表面形式变化的查询中，纯词汇交互的局限性，表明DCI作为一种实用且具有竞争力的方法，适合搜索代理，能够补充现实世界中现有的检索范式。

Rubric-Guided Process Reward for Stepwise Model Routing

分步模型路由的评分标准引导过程奖励

Authors: Shenghao Ye, Yu Guo, Zhengheng Li, Shuangwu Chen, Jian Yang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.29310
Pdf link: https://arxiv.org/pdf/2605.29310
Abstract Stepwise model routing improves the efficiency of Large Reasoning Models (LRMs) by assigning each reasoning step to a suitable model. Recent methods formulate routing as a sequential decision process and train the router with reinforcement learning. However, although they model routing as a process, they still supervise the router with outcome rewards. Such rewards only reflect final answer correctness and fail to evaluate intermediate routing decisions, which can weaken performance and generalization. To address this gap, we propose RoRo, a rubric-guided process reward framework for stepwise model routing. RoRo first collects diverse routing trajectories and constructs preference pairs based on outcome, cost, and process quality. It then trains a Rubricor to generate a query-specific evaluation rubric and a Judge to score routing trajectories under this rubric through alternating optimization. The resulting process rewards are combined with outcome rewards to optimize the routing policy via GRPO. Experiments on five reasoning benchmarks under both same-family and cross-family settings show that RoRo consistently outperforms strong baselines and achieves better accuracy and cost trade-offs.
中文摘要 逐步模型路由通过将每个推理步骤分配给合适的模型，提高了大型推理模型（LRM）的效率。最新方法将路由制定为顺序决策过程，并通过强化学习训练路由器。然而，尽管他们将路由建模为一个过程，但它们仍然通过结果奖励来监督路由器。此类奖励仅反映最终答案的正确性，未能评估中间的路由决策，这可能削弱性能和泛化能力。为弥补这一空白，我们提出了RoRo，一种基于评分标准的流程奖励框架，用于逐步模型路由。RoRo首先收集多样化的路由轨迹，并基于结果、成本和工艺质量构建偏好对。然后，它训练评分标准生成查询专属的评估评分标准，并训练评判通过交替优化对该评分标准下的路由轨迹。由此产生的过程奖励与结果奖励结合，通过GRPO优化路由策略。在同科和跨科环境中，针对五个推理基准测试的实验显示，RoRo始终优于强基线，并实现了更好的准确性和成本权衡。

STAMP: Training Explicit Memory for Mobile GUI Agents in Controllable and Scalable Virtual Environments

STAMP：在可控且可扩展的虚拟环境中为移动图形界面代理训练显式内存

Authors: Junyang Wang, Haiyang Xu, Xi Zhang, Zhaoqing Zhu, Ming Yan, Jieping Ye, Jitao Sang
Subjects: Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.29324
Pdf link: https://arxiv.org/pdf/2605.29324
Abstract Mobile GUI agents excel at immediate reactive control but frequently fail in realistic, long-horizon tasks that require memory. This failure stems from a fundamental conflict between limited context windows and token-heavy screenshots. To save the limited context, agents must progressively discard older visual history, permanently losing crucial transient information. Furthermore, existing action-centric datasets fail to teach agents what or when to explicitly memorize, and augmenting static real-world data is prohibitively expensive and lacks interactive verification. To resolve this, we present STAMP, a framework that trains explicit memory in mobile agents through controllable virtual environments, where deterministic memory variables are programmatically injected into synthesized tasks to control what must be memorized, when it should be encoded, and when it must later be retrieved, thereby producing verifiable supervised data at scale and enabling online reinforcement learning through environment-driven reward feedback. Evaluated on our newly introduced Memory-World benchmark, the resulting Stamp-GUI agent achieves state-of-the-art performance among GUI-specialized models and sets a new high watermark on our Memory-World benchmark, demonstrating exceptional memory accuracy and task resilience while maintaining strong general mobile navigation capabilities.
中文摘要 移动图形界面代理擅长即时反应式控制，但在需要内存的现实、长远任务中经常失败。这一失败源于有限上下文窗口与大量令牌截图之间的根本冲突。为了保留有限的上下文，客服人员必须逐步丢弃旧有的视觉历史，永久丢失关键的瞬态信息。此外，现有的以动作为中心的数据集无法教会代理明确记忆什么或何时记忆，且增强静态现实数据成本高昂且缺乏交互式验证。为解决这个问题，我们提出了STAMP框架，该框架通过可控的虚拟环境训练移动代理的显式记忆，确定性内存变量通过程序注入综合任务，控制哪些内容需要记忆、何时编码以及何时检索，从而大规模生成可验证的监督数据，并通过环境驱动的奖励反馈实现在线强化学习。经过我们新推出的Memory-World基准测试评估，最终生成的Stamp-GUI代理在GUI专用模型中达到了最先进的性能，并在我们的Memory-World基准测试上树立了新的高峰，展现了卓越的内存准确性和任务韧性，同时保持了强大的通用移动导航能力。

GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models

GDSD：强化学习作为扩散语言模型的引导去噪自蒸馏

Authors: Xiaohang Tang, Keyue Jiang, Che Liu, Qifang Zhao, Xiaoxiao Xu, Sangwoong Yoon, Ilija Bogunovic
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.29398
Pdf link: https://arxiv.org/pdf/2605.29398
Abstract Reinforcement learning (RL) can be used to improve the policy (denoiser) of diffusion large language models (dLLMs), while being hindered by the intractability of the policy likelihood. A dominant and efficient family of methods replaces the likelihood in standard RL with its evidence lower bound (ELBO), estimated from randomly masked sequences. Despite being well aligned with pre-training, these approaches introduce bias through training--inference mismatch by using the ELBO as a likelihood surrogate, which can degrade performance. In this work, we propose Guided Denoiser Self-Distillation (GDSD) to directly distill the denoiser of dLLMs from an advantage-guided self-teacher, derived from the closed-form optimum of reverse-KL regularized RL. GDSD matches the dLLM's denoiser logits to the teacher's via a normalization-free objective, which reduces RL to likelihood-free self-distillation and thus bypasses the TIM biases. Recent ELBO-based methods emerge as instances of applying different distillation divergences, but with diagnosable pathologies that GDSD avoids. On planning, math, and coding benchmarks with LLaDA-8B and Dream-7B, GDSD consistently outperforms prior state-of-the-art ELBO-based methods with a more stable training reward dynamics, achieving test-accuracy improvements of up to $+19.6\%$. These results suggest that direct denoiser self-distillation, without relying on an ELBO likelihood surrogate, can provide a more stable and effective RL procedure for dLLMs. Code is available at this https URL.
中文摘要 强化学习（RL）可以用来改进扩散大型语言模型（dLLMs）的策略（去噪器），同时受到策略似然的难以解决的限制。一套主流且高效的方法家族用证据下界（ELBO）替代标准强化学习中的似然，ELBO是通过随机掩蔽序列估计的。尽管这些方法与预训练高度契合，但通过训练引入了偏见——通过使用ELBO作为似然替代，导致推断不匹配，这可能降低性能。本研究提出引导去噪自蒸馏（GDSD），直接从优势引导自教师蒸馏dLLM的去噪器，该自教师源自反KL正则化RL的闭式最优解。GDSD通过无规范化目标将dLLM的去噪日志与教师匹配，从而将强化学习简化为无似然自蒸馏，从而绕过TIM偏置。近期基于ELBO的方法以应用不同蒸馏发散为例，但其病理性问题是GDSD避免的。在LLaDA-8B和Dream-7B的规划、数学和编码基准测试中，GDSD持续优于以往基于ELBO的先进方法，训练奖励动态更稳定，测试精度提升高达$+19.6\%$。这些结果表明，直接去噪自蒸馏，无需依赖ELBO似然替代，可以为dLLMs提供更稳定、更有效的强化学习程序。代码可在此 https URL 访问。

Rethinking Post-Training Recipes for Multimodal Time-Series Forecasting

重新思考多模时间序列预测的训练后配方

Authors: Haoxin Liu, Yichen Zhou, Rajat Sen, B. Aditya Prakash, Abhimanyu Das
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.29401
Pdf link: https://arxiv.org/pdf/2605.29401
Abstract Time-Series Foundation Models (TSFMs) excel at zero-shot unimodal forecasting using numerical data, but unlike LLMs they cannot consume multimodal, non-numerical context that often shape real-world trajectories. In this work, we bridge this gap and argue for a multimodal time-series forecasting approach that post-trains LLMs to act as context-guided revisors over strong numerical TSFM priors. We introduce PostTime, a post-training recipe combining Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR), along with a methodology to generate automated reasoning traces for forecast revisions. PostTime teaches an LLM to generate context-conditioned forecast interventions -- decisions to revise, preserve, or ignore the TSFM prior based on the multimodal context. We evaluate this approach on the TimesX multimodal forecasting benchmark using a Gemma-3-4B LLM and TimesFM-2.5 TSFM, and show that it significantly outperforms standalone TSFMs, LLM-only baselines, and existing multimodal forecasting approaches.
中文摘要 时间序列基础模型（TSFM）擅长利用数值数据进行零样本单模态预测，但与大型语言模型不同，它们无法消耗多模态、非数值的上下文，这些上下文往往影响现实世界的轨迹。在本研究中，我们弥合了这一空白，主张采用多模态时间序列预测方法，后训练LLM作为上下文引导的修订者，超越强的数值TSFM先验。我们介绍PostTime，这是一种结合监督微调（SFT）和可验证奖励（RLVR）强化学习的培训后方案，并结合了生成自动推理轨迹以用于预测修正的方法。PostTime教LLM生成情境条件预报干预——基于多模态情境，决定修正、保留或忽略TSFM的先验。我们利用Gemma-3-4B LLM和TimesFM-2.5 TSFM在TimesX多模态预测基准上评估了该方法，显示其显著优于独立TSFM、仅LLM基线以及现有的多模态预测方法。

ReasonLight: A Multimodal Foundation Model-Enhanced Reinforcement Learning Framework for Zero-Shot Traffic Signal Control

ReasonLight：基于多模态基础模型增强的零射程交通信号控制强化学习框架

Authors: Aoyu Pang, Maonan Wang, Yuejiao Xie, Chung Shue Chen, Zhiwei Yang, Man-On Pun
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.29425
Pdf link: https://arxiv.org/pdf/2605.29425
Abstract Reinforcement learning (RL) has shown promise in traffic signal control (TSC). However, its reliance on predefined states limits responsiveness to observable open-world events that are absent from training data. IoT-enabled intersections provide heterogeneous observations from roadside sensors and cameras, creating opportunities to improve RL adaptability to such events. To this end, we propose ReasonLight, a multimodal foundation model-enhanced RL framework for zero-shot TSC. ReasonLight integrates three sources of information: structured traffic measurements, multi-view camera observations, and candidate phase decisions from a pre-trained RL controller. Given an RL-proposed phase, ReasonLight extracts visual semantics from multi-view images and aligns them with compact sensor-derived scene descriptions. This alignment enables a semantic-guided refinement module to either preserve or adjust the proposed action according to traffic rules and event semantics. To ensure operational reliability, refined actions are constrained by the set of available phases. Any invalid decision is rejected, and the system falls back to the original RL action. We evaluate ReasonLight on two types of rare events not seen during RL training: emergency vehicle priority and temporary traffic regulation. Experimental results show that ReasonLight achieves zero-shot adaptation without retraining. It reduces emergency vehicle waiting time by up to 88.7% compared with the RL-only backbone while preserving comparable routine traffic performance.
中文摘要 强化学习（RL）在交通信号控制（TSC）中展现出了潜力。然而，其对预定义状态的依赖限制了对训练数据中缺失的可观察开放世界事件的响应能力。物联网支持的路口提供了来自路边传感器和摄像头的异构观测，为提升强化学习对此类事件的适应性创造了机会。为此，我们提出了ReasonLight，一个多模态基础模型增强的零样子 TSC 强化强化学习框架。ReasonLight集成了三种信息来源：结构化交通测量、多视角摄像头观察以及预训练强化学习控制器的候选阶段决策。在强化学习提出的阶段下，ReasonLight从多视图图像中提取视觉语义，并将其与紧凑的传感器衍生场景描述对齐。这种对齐使语义引导的细化模块能够根据交通规则和事件语义保留或调整拟议动作。为确保操作可靠性，精细动作受限于可用阶段的集合。任何无效决策都会被拒绝，系统会退回到最初的强化学习操作。我们对ReasonLight评估了两类在强化学习培训中未见到的罕见事件：紧急车辆优先和临时交通管制。实验结果表明，ReasonLight无需重新训练即可实现零样本适配。与仅限RL主干网相比，它将紧急车辆等待时间减少了高达88.7%，同时保持了相当的常规交通性能。

FinGuard: Detecting Financial Regulatory Non-Compliance in LLM Interactions

FinGuard：检测大型语言模型互动中的金融监管不合规

Authors: Huaixia Dou, Jie Zhu, Minghao Wu, Shuo Jiang, Junhui Li, Lifan Guo, Feng Chen, Chi Zhang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.29427
Pdf link: https://arxiv.org/pdf/2605.29427
Abstract As large language models (LLMs) are increasingly deployed in financial services, a single non-compliant interaction can expose institutions to regulatory penalties and direct consumer harm. Existing guard models are built around general harm taxonomies and overlook violations grounded in specific financial regulations. We address this gap with a regulation-driven pipeline that operates directly on regulatory documents, inducing a financial compliance risk taxonomy and synthesizing grounded training data without any predefined violation categories. Instantiating the pipeline on Chinese financial regulations, we release \textbf{FinGuard-Bench}, to our knowledge the first benchmark for financial regulatory compliance detection, with expert-annotated labels at both the query and response levels. We further train \textbf{FinGuard}, a financial compliance detection model built on Qwen3-8B and trained on the regulation-grounded data via supervised fine-tuning and self-play reinforcement learning. On FinGuard-Bench, FinGuard substantially outperforms all baselines, including dedicated guard models and much larger general-purpose LLMs such as Qwen3.5-397B-A17B and GPT-5.1. Furthermore, FinGuard also preserves general safety capabilities and adapts to unseen institution-specific policies using policy documents alone. We will publicly release the code, prompts, and resources used in this work on GitHub.
中文摘要 随着大型语言模型（LLMs）在金融服务中日益广泛应用，单次不合规的交互可能使机构面临监管处罚并直接对消费者造成伤害。现有的守护模型基于一般伤害分类法，忽视基于特定金融法规的违规行为。我们通过一个由监管驱动的流程来弥补这一空白，该流程直接基于监管文件运作，引入财务合规风险分类，并综合基于基础的培训数据，且没有预设的违规类别。作为中国金融监管流程的实例化，我们发布了据我们所知的首个金融监管合规检测基准，在查询和响应层面均配备专家注释标签。我们还进一步训练了 \textbf{FinGuard}，这是一个基于 Qwen3-8B 构建的金融合规检测模型，并通过监督式微调和自我对弈强化学习，基于监管数据进行训练。在FinGuard-Bench上，FinGuard的表现远超所有基线，包括专用守卫模型以及更大规模的通用大型语言模型，如Qwen3.5-397B-A17B和GPT-5.1。此外，FinGuard还保留了一般安全能力，并仅凭政策文件适应机构特定政策。我们将在GitHub上公开发布本工作中使用的代码、提示和资源。

The Curse of Helpfulness: Inverse Scaling Law in Robustness to Distractor Instructions via DistractionIF

帮助的诅咒：通过干扰IF来强健性中的逆标度定律来分散注意力指令

Authors: Zeli Su, Zhankai Xu, Tianlei Chen, Longfei Zheng, Xiaolu Zhang, Jun Zhou, Wentao Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.29491
Pdf link: https://arxiv.org/pdf/2605.29491
Abstract Large Language Models (LLMs) are increasingly deployed in agentic and retrieval-augmented generation (RAG) systems, where they must execute user-specified tasks over externally provided reference text. In practice, such context is often unstructured and contaminated with benign but instruction-like semantic noise, such as editorial comments and system traces, which should be treated strictly as data. We introduce DistractionIF, a benchmark designed to evaluate robustness against such distractor instructions in reference text. Across a broad range of models, we observe a consistent inverse scaling phenomenon: larger models are often less robust, with performance dropping by up to 30 points as scale increases. Mechanistically, our perplexity analysis reveals that scaling erodes the probabilistic boundary between robust and distracted behaviors, making models increasingly prone to over-interpreting noise as instructions. To address this, we demonstrate that reinforcement learning, specifically Group Relative Policy Optimization (GRPO), can restore this boundary, improving robustness by up to 15.5% without compromising general instruction-following capability. Our findings highlight a critical instruction-following robustness gap in reference-grounded tasks and establish reinforcement learning as a promising path for enforcing strict data-instruction separation at scale.
中文摘要 大型语言模型（LLM）越来越多地部署在代理生成和检索增强生成（RAG）系统中，它们必须通过外部提供的参考文本执行用户指定的任务。实际上，这种上下文往往是无结构的，且被无害但类似指令的语义噪声污染，比如编辑评论和系统痕迹，这些应严格作为数据处理。我们引入了DistractionIF，这是一个基准测试，旨在评估参考文本中此类干扰指令的鲁棒性。在广泛的模型中，我们观察到一个持续的反向缩放现象：更大的模型通常更不稳健，随着规模增加，性能下降多达30个百分点。从机制上看，我们的困惑度分析显示，缩放会侵蚀稳健行为与分心行为之间的概率界限，使模型越来越容易将噪声过度解读为指令。为此，我们证明强化学习，特别是群相对策略优化（GRPO），可以恢复这一边界，在不牺牲一般指令跟随能力的情况下，将鲁棒性提升多达15.5%。我们的发现凸显了基于引用的任务中关键的指令跟随鲁棒性差距，并确立强化学习作为大规模执行严格数据-指令分离的有前景路径。

On Asymmetric Optimization of Reasoning and Perception in Vision-Language Model Post-Training

关于视觉语言模型训练后推理与感知的非对称优化

Authors: Xueqing Wu, Yu-Chi Lin, Kai-Wei Chang, Nanyun Peng
Subjects: Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.29496
Pdf link: https://arxiv.org/pdf/2605.29496
Abstract Post-training has greatly improved reasoning in frontier vision-language models, yet its gains for perception remain comparatively limited, creating a bottleneck for end-to-end visual reasoning. To investigate this gap, we introduce a controlled diagnostic framework with two synthetic tasks that disentangle perception from reasoning. Our analysis reveals a consistent perception-reasoning asymmetry: posttraining improves reasoning more substantially than perception, though the underlying mechanism differs by training paradigm. For supervised fine-tuning (SFT), this asymmetry stems from token imbalance in chain-of-thought supervision, where perception occupies fewer tokens and thus receives a weaker training signal. Dynamically reweighting the loss mitigates this imbalance and boosts end-to-end performance by up to 18.2. For reinforcement learning (RL), the asymmetry instead arises from reward coupling: outcome rewards correlate more strongly with reasoning than with perception, weakening the signal for perception learning. Adding a perception-aware reward alleviates the imbalance and improves end-to-end accuracy by up to 6.0; even without groundtruth perception rewards, a reliable surrogate reward provide useful signal, yielding gains of 3.2 points. Together, our results comprehensively diagnose asymmetric optimization and suggest concrete interventions to balance perception and reasoning.
中文摘要 后期训练极大地提升了前沿视觉语言模型的推理能力，但其在感知方面的提升仍然相对有限，这在端到端视觉推理中形成瓶颈。为探讨这一空白，我们引入了一个受控诊断框架，包含两个综合任务，将感知与推理分开。我们的分析显示，感知-推理的不对称性一致：后训练对推理的提升比感知更显著，尽管其底层机制因训练范式而异。对于监督式微调（SFT），这种不对称源于思维链监督中的令牌不平衡，感知占用的标记较少，因此接收的训练信号较弱。动态重新加权损失可以缓解这种失衡，并使端到端表现提升18.2%。对于强化学习（RL），这种不对称性则源于奖励耦合：结果奖励与推理的相关性比与感知的相关性更强，削弱了感知学习的信号。添加感知感知奖励可以缓解这种不平衡，并使端到端的准确率提升至6.0;即使没有Groundtruth感知奖励，可靠的替代奖励也能提供有用信号，获得3.2点的收益。我们的结果综合诊断了非对称优化，并提出了具体的干预措施以平衡感知与推理。

Source-Grounded Semantic Reinforcement Learning for Low-Resource Target-Language Generation

源基语义强化学习用于低资源目标语言生成

Authors: Zeli Su, Ziyin Zhang, Zewei Pan, Zhou Liu, Dingcheng Huang, Dehan Li, Zhankai Xu, Longfei Zheng, Xiaolu Zhang, Jun Zhou, Wentao Zhang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.29502
Pdf link: https://arxiv.org/pdf/2605.29502
Abstract Low-resource target-language generation is often limited by scarce parallel data, while high-resource source-language monolingual data is abundant but difficult to use with standard supervised fine-tuning. We propose Source-Grounded Semantic Reinforcement Learning (SG-SRL), a resource-utilization framework that converts source-language monolingual data into cross-lingual semantic supervision for target-language generation. SG-SRL performs reference-free reinforcement learning (RL) on source-language data using a cross-lingual semantic reward model, instantiated by a cross-lingual reranker that scores the semantic relevance between the source input and the target-language generation. While this induces severe verbosity-based reward hacking, a lightweight recovery stage using a small parallel corpus restores fluency, conciseness, and task format while preserving the semantic gains. Experiments on Chinese-to-Thai generation show that SG-SRL improves semantic grounding and factual coverage over cold-start SFT. Additional analyses on long-form transfer and Tibetan embedding-based rewards clarify the generalization behavior of SG-SRL and show that an encoder-based semantic reward can substitute for an LLM-based reranker in a realistic low-resource language setting.
中文摘要 低资源的目标语言生成通常受限于稀缺的并行数据，而高资源的源语言单语数据则丰富，但通过标准监督微调难以使用。我们提出了源基语义强化学习（SG-SRL），这是一种资源利用框架，将源语言单语数据转换为跨语言语义监督，用于目标语言生成。SG-SRL利用跨语言语义奖励模型对源语言数据进行无引用强化学习（RL），该模型由跨语言重新排序器实现，该算法对源输入与目标语言生成之间的语义相关性进行评分。虽然这会引发严重的冗长奖励黑客攻击，但使用小型平行语料库的轻量级恢复阶段，在保留语义收益的同时，恢复流畅性、简洁性和任务格式。中文到泰国生成的实验表明，SG-SRL相比冷启动SFT提升了语义基础和事实覆盖。对长形式转移和藏式嵌入奖励的进一步分析澄清了SG-SRL的泛化行为，并表明基于编码器的语义奖励可以在现实的低资源语言环境中替代基于LLM的重排序器。

VE2VF: Vision-Enabled to Vision-Free Distillation via Real-world Reinforcement Learning for Robust Contact-Rich Manipulation

VE2VF：通过现实世界强化学习实现视觉驱动到无视觉的蒸馏，实现稳健的丰富接触操作

Authors: Victor Kowalski, Chengxi Li, Dongheui Lee
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.29564
Pdf link: https://arxiv.org/pdf/2605.29564
Abstract When using reinforcement learning (RL) for contact-rich robotic manipulation, vision can provide task-relevant information that accelerates learning beyond what proprioception alone can achieve. However, vision-enabled policies tend to overfit to the visual conditions seen during training, limiting their robustness and transferability. We present a human-in-the-loop RL framework that employs teacher-student distillation to achieve robust performance across multiple task variants, trained entirely in the real world without requiring domain randomization or data augmentation. A vision-enabled teacher distills its knowledge into a vision-free student that relies solely on pose, twist, and wrench sensing, combining fast training with strong task generalization. On the real-world NIST assembly benchmark board, our approach achieves 95\% overall success after approximately 50 minutes of training on 3 representative tasks, including robust generalization to 8 unseen task variants. Fine-tuning with distillation achieves full success on the most challenging task. We demonstrate that the resulting policies outperform baselines in both robustness and adaptability.
中文摘要 当使用强化学习（RL）进行接触丰富机器人操作时，视觉可以提供与任务相关的信息，加速学习，超越本体感觉的水平。然而，视觉驱动的政策往往会过度拟合训练时所见的视觉条件，限制了其稳健性和可迁移性。我们提出了一个人机参与的强化学习框架，利用师生提炼技术，在多种任务变体中实现稳健表现，完全在现实世界中训练，无需领域随机化或数据增强。具备视觉能力的教师将知识浓缩成一个无视觉障碍的学生，完全依靠姿势、扭转和扳手感知，结合快速训练与强有力的任务泛化。在真实的NIST组装基准板上，我们的方法在3个代表性任务上进行约50分钟培训后，整体成功率达95%，其中包括对8个未见任务变体的稳健推广。通过蒸馏微调，才能在最具挑战性的任务上获得完全成功。我们证明了这些政策在鲁棒性和适应性方面都优于基线。

DeepTool: Scaling Interleaved Deliberation in Tool-Integrated Reasoning via Process-Supervised Reinforcement Learning

DeepTool：通过过程监督强化学习在工具集成推理中扩展交错审议

Authors: Yang He, Xiao Ding, Bibo Cai, Yufei Zhang, Kai Xiong, Zhouhao Sun, Bing Qin, Ting Liu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.29568
Pdf link: https://arxiv.org/pdf/2605.29568
Abstract Tool-Integrated Reasoning (TIR) extends LLM capabilities by leveraging external environments. However, existing methods lack the deliberation during sequential tool invocation required for strategic planning and self-correction. While RL mitigates this, conventional approaches for Tool-Integrated Reasoning are hindered by sparse outcome-based rewards, failing to supervise intermediate reasoning steps and tool invocations. To address this, we propose DeepTool, a novel framework that scales deliberate thinking within the interleaved process of thinking, action, and observation at each turn. In DeepTool, we first introduce a synthesis pipeline that evolves extended thinking into interleaved trajectories, integrating adversarial perturbations to ensure robustness and self-correction. Secondly, we devise Process-Supervised Reinforcement Learning based on GRPO, which utilizes an Action-Centric Process Reward to reinforce intermediate interleaved thinking and enforce precise tool invocation at every turn. Extensive experiments demonstrate that DeepTool achieves superior performance, boosting Qwen2.5-7B significantly across six benchmarks (e.g., AIME24: 3.2% -> 40.4% and HMMT25: 0.0% -> 28.6%). Furthermore, the token cost-effectiveness analysis confirms the utility of interleaved thinking, demonstrating DeepTool's optimal balance between performance and token efficiency.
中文摘要 工具集成推理（TIR）通过利用外部环境扩展了LLM的能力。然而，现有方法缺乏顺序工具调用时所需的深思熟虑，以实现战略规划和自我修正。虽然强化学习缓解了这一问题，但传统的工具整合推理方法因基于结果的奖励稀疏而受阻，未能监督中间推理步骤和工具调用。为此，我们提出了DeepTool，一种新颖框架，在思考、行动和观察交织的过程中，将刻意思考进行扩展。在DeepTool中，我们首先引入了一种综合流程，将扩展思维演化为交错轨迹，整合对抗扰动以确保稳健性和自我纠正性。其次，我们基于GRPO设计了过程监督强化学习，利用以行动为中心的过程奖励来强化中间交错思维，并在每个环节强制执行精确的工具调用。大量实验表明，DeepTool在六个基准测试中显著提升了Qwen2.5-7B（例如AIME24：3.2% -> 40.4%，HMMT25：0.0% -> 28.6%）。此外，代币成本效益分析证实了交错思维的实用性，展示了DeepTool在性能与代币效率之间的最佳平衡。

PEARL: Training Socratic Tutors with Pedagogically Aligned Reinforcement Learning

PEARL：用教学法对齐强化学习培训苏格拉底导师

Authors: Qikai Chang, Zhenrong Zhang, Linbo Chen, Pengfei Hu, Jianshu Zhang, Youhui Guo, Jun Du
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.29582
Pdf link: https://arxiv.org/pdf/2605.29582
Abstract Large Language Models (LLMs) have shown promise as educational tutors, yet effective tutoring requires more than solving problems: it must provide progressive Socratic guidance and balance multiple pedagogical objectives across multi-turn interactions. However, training such tutors remains challenging due to limited-fidelity and weakly controllable student simulation, under-specified pedagogical reward modeling, and unstable multi-objective optimization. To overcome these limitations, we propose PEARL, a pedagogically aligned reinforcement learning framework for training Socratic tutoring agents, consisting of three key components. First, we introduce a controllable student simulator that decouples latent cognitive states from response generation to model diverse abilities and misconceptions. Second, we develop a generative reward model that jointly evaluates pedagogical quality and objective correctness for policy optimization. Finally, we propose a stable multi-objective RL scheme that discretizes rewards within each dimension and aggregates normalized advantages across dimensions, preventing high-variance objectives from dominating updates. Experiments on multiple benchmarks show that PEARL achieves the best performance among open-source models and remains competitive with leading proprietary LLMs, despite using only a 30B policy model.
中文摘要 大型语言模型（LLMs）作为教育导师展现出潜力，但有效的辅导不仅仅需要解决问题：它还必须提供渐进的苏格拉底式指导，并在多重教学目标之间取得平衡。然而，由于学生模拟的保真度有限且可控性薄弱，教学奖励建模不够规范，以及多目标优化不稳定，培训此类导师仍然具有挑战性。为克服这些局限，我们提出了PEARL，这是一个教学法对齐的强化学习框架，用于培训苏格拉底辅导员，包含三个关键组成部分。首先，我们引入了可控的学生模拟器，将潜在认知状态与反应生成解耦，以模拟多样化的能力和误解。其次，我们开发了一个生成奖励模型，共同评估教学质量和政策优化的客观正确性。最后，我们提出了一种稳定的多目标强化学习方案，在每个维度内离散化奖励，并在维度间聚合归一化优势，防止高方差目标主导更新。多项基准测试的实验显示，PEARL在开源模型中表现最佳，尽管仅使用30B策略模型，仍能与领先的专有LLM竞争。

GAPD: Gold-Action Policy Distillation for Agentic Reinforcement Learning in Knowledge Base Question Answering

GAPD：知识库问答中能动强化学习的黄金行动政策提炼

Authors: Xin Sun, Jianan Xie, Zhongqi Chen, Qiang Liu, Shu Wu, Bowen Song, Weiqiang Wang, Zilei Wang, Liang Wang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.29584
Pdf link: https://arxiv.org/pdf/2605.29584
Abstract Reinforcement learning (RL) is a natural fit for agentic knowledge base question answering (KBQA), where a model must issue executable actions, observe knowledge-base feedback, and eventually return an answer. However, current RL-based KBQA systems mainly optimize sparse rewards from the final answer, leaving intermediate action errors weakly supervised. This is especially limiting for logical-form annotated KBQA benchmarks: gold logical forms can be converted into executable action sequences, but existing pipelines use them mainly for warm-start data construction rather than for on-policy RL updates. We propose GAPD, a training-time Gold-Action Policy Distillation framework that adds dense token-level guidance to outcome-based RL. To align gold actions with on-policy student rollouts, GAPD uses MID-ANCHOR MATCHING: it treats the intermediate entities reached during student exploration and gold execution as state anchors, and matches student states to gold states through these explored entity sets. The current policy conditioned on this aligned gold action serves as a stop-gradient teacher, whose token distribution is distilled back to the ordinary student policy over generated action-token spans. GAPD consistently surpasses the current state of the art on WebQSP, GrailQA, and GraphQ.
中文摘要 强化学习（RL）是智能知识库问题答复（KBQA）的自然契合，其中模型必须发出可执行动作，观察知识库反馈，并最终返回答案。然而，当前基于强化学习的KBQA系统主要优化最终答案的稀疏奖励，导致中间动作错误被较弱监督。这对逻辑形式注释的 KBQA 基准测试尤其限制：黄金逻辑形式可以转换为可执行的动作序列，但现有流水线主要将其用于热启动数据构建，而非策略内强化学习更新。我们提出了GAPD，一种训练时的黄金行动政策提炼框架，为基于结果的强化学习增加了密集的代币级指导。为了将金币行动与政策上的学生推广对齐，GAPD采用中锚匹配：它将学生探索和金矿执行中达到的中间实体视为州锚点，并通过这些探索实体集将学生州与金州匹配。当前基于该对齐金行动的政策作为停止梯度教师，其代币分配被提炼回普通学生政策，涵盖生成的行动代币跨度。GAPD 在 WebQSP、GrailQA 和 GraphQ 上持续超越当前技术水平。

Training Deliberative Monitors for Black-Box Scheming Detection

培训审议监视者以检测黑匣子阴谋

Authors: Aditya Sinha, Akshat Naik, Victor Gillioz, Simon Storf, Kilian Merkelbach, Rich Barton-Cooper, Axel Højmark, Marius Hobbhahn
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.29601
Pdf link: https://arxiv.org/pdf/2605.29601
Abstract As autonomous agents become more capable of performing real-world tasks, distinguishing scheming behavior from benign task pursuit may become a central AI control problem. Existing monitors often rely on chain-of-thought access or internal activations, or use prompted frontier models, all of which can be unavailable, unreliable or expensive in deployment. In this work, we study action-only deliberative monitors: smaller open-weight models trained to detect scheming and sabotage from agentic trajectories without accessing the monitored agent's reasoning or model internals. Our method, inspired by deliberative alignment, uses a scheming specification to elicit structured rationales from a frontier teacher, filters them with a separate judge, and distills the highest-quality rationales into open-weight monitors with supervised fine-tuning and reinforcement learning. We train on five datasets, and evaluate across six out-of-distribution agentic misalignment benchmarks. We show that applying our method to Qwen3.5-27B yields higher performance than all low-cost frontier models as prompted monitors (Gemini 3.1 Flash-Lite, GPT-5.4 Nano, and Claude Haiku 4.5) and than Gemini 2.5 Pro, while also achieving lower marginal inference cost (token-metered USD per 1,000 evaluations). Stronger prompted frontier monitors (Gemini 3.1 Pro, GPT-5.4, Claude Sonnet 4.6, and Claude Opus 4.6) achieve higher performance but at roughly $16$--$34\times$ higher marginal inference cost. Several of our trained monitors are positioned on the empirical cost--performance Pareto frontier among the monitors we evaluate, providing practical low-cost, low-FPR alternatives to prompted frontier models.
中文摘要 随着自主智能体越来越能执行现实任务，区分阴谋行为与无害任务追逐可能成为人工智能的核心控制难题。现有监控器通常依赖思维链访问或内部激活，或使用提示前沿模型，这些方法在部署时可能不可用、不可靠或成本高昂。本研究中，我们研究了仅行动的审议监控器：较小的开放权重模型，训练用于检测代理轨迹中的阴谋和破坏行为，而无需访问被监控代理的推理或模型内部结构。我们的方法受审议性对齐启发，利用策略规范从前沿教师那里引出结构化的理由，由独立评审筛选，并将最高质量的理由提炼成带有监督微调和强化学习的开放权重监测器。我们基于五个数据集进行训练，并在六个非分布的代理错位基准中进行评估。我们证明，将该方法应用于Qwen3.5-27B时，性能优于所有低成本前沿模型作为提示显示器（如Gemini 3.1 Flash-Lite、GPT-5.4 Nano和Claude Haiku 4.5）以及Gemini 2.5 Pro，同时实现了更低的边际推理成本（每千次评估计量代币计量美元）。更强的提示前沿监控器（Gemini 3.1 Pro、GPT-5.4、Claude Sonnet 4.6 和 Claude Opus 4.6）性能更高，但边际推理成本约高出16美元至34美元。我们受过训练的多位监测者位于实证成本-性能帕累托前沿，提供了实用的低成本、低FPR替代方案，替代提示前沿模型。

Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering

超越数学和代码的可验证奖励：基于语料库的轻量级流程监督，支持事实性问答

Authors: Shicheng Fan, Haochang Hao, Dehai Min, Weihao Liu, Philip S. Yu, Lu Cheng
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.29648
Pdf link: https://arxiv.org/pdf/2605.29648
Abstract Applying reinforcement learning to improve factual accuracy in knowledge-intensive question answering faces a reward design dilemma. Response-level rewards provide only coarse supervision and cannot distinguish correct from incorrect statements within a reasoning trace. Sentence-level alternatives offer finer-grained feedback, but typically rely on NLI verifiers, LLM judges, or knowledge-verification pipelines that are expensive to deploy at RL scale and often unreliable for rare-entity facts, where accurate reward signals are especially important. We propose CorVer (Corpus Verify), a lightweight, plug-in-ready process reward that replaces neural verifiers with a corpus-grounded signal derived from Wikipedia co-occurrence statistics. CorVer assigns sentence-level credit and maps it to token-level advantages via a simple alignment, requiring only a 0.5B extractor and a single corpus lookup per sentence. Across 30 (model, benchmark) cells spanning six instruction-tuned models (3B to 14B) and five QA benchmarks, CorVer improves over the raw baseline for every cell, with an average TriviaQA gain of +4.1 pp. It also outperforms four neural-verifier baselines in 18 of 20 cells under their feasible configurations, while training 4.8 to 8.4x faster.
中文摘要 应用强化学习提升知识密集型问答中的事实准确性面临奖励设计的困境。反应级奖励仅提供粗略监督，无法在推理追踪中区分正确与错误陈述。句子级替代方案提供更细粒度的反馈，但通常依赖NLI验证器、LLM评判器或知识验证流程，这些在强化学习规模部署成本高昂，且对稀有实体事实往往不可靠，而准确的奖励信号尤为重要。我们提出了CorVer（Corpus Verify），这是一种轻量级、可插入的流程奖励，用基于维基百科共现统计的基于语料基础的信号替代神经验证器。CorVer 通过简单的对齐为句子层级分配积分，并将其映射到词元层级优势，只需一个 0.5 亿的提取器和每句子一次语料库查询。在涵盖6个指令调优模型（3B至14B）和5个质量保证基准的30个单元中，CorVer在每个单元的原始基线基础上都有提升，TriviaQA平均提升为+4.1 pp。在可行配置下的20个细胞中，它在4个神经验证基线中表现优于4个，同时训练速度提升了4.8到8.4倍。

TRACE: Toulmin-based Reasoning Assessment through Constructive Elements for LLM CoT Evaluation

TRACE：基于图尔明的推理评估，通过建设性元素进行LLM CoT评估

Authors: Yundong Kim, Heyoung Yang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.29656
Pdf link: https://arxiv.org/pdf/2605.29656
Abstract Evaluating open-ended outputs from large language models (LLMs) remains challenging due to the absence of ground truth. Existing metrics rely on final-answer accuracy or surface-level statistics, leaving the reasoning process itself unexamined. We introduce TRACE (Toulmin-based Reasoning Assessment through Constructive Elements), a metric that analyzes Chain-of-Thought (CoT) reasoning processes. Rather than judging outcomes, TRACE inspects how arguments are constructed by integrating Toulmin's argumentation theory with Flavell's metacognitive framework to assess reasoning structure. Experiments on 26.3K QA samples across 7 reasoning models show strong correlation with benchmark accuracy (r=0.74). Furthermore, TRACE is effective as a reinforcement learning reward signal, outperforming accuracy-only baselines. Together, these results indicate that logically sound reasoning leads to higher-quality answers. TRACE thus serves as a complementary metric for evaluating open-ended outputs. Code is available at this https URL.
中文摘要 由于缺乏基础真实性，评估大型语言模型（LLMs）的开放输出依然具有挑战性。现有指标依赖于最终答案的准确性或表层统计数据，导致推理过程本身未被深入审视。我们引入了TRACE（基于图尔敏的推理评估，通过建设性元素），这是一种分析思维链（CoT）推理过程的指标。TRACE不评判结果，而是通过整合Toulmin的论证理论与Flavell的元认知框架来评估推理结构，从而探讨论证的构建过程。对7个推理模型中26.3K个QA样本的实验显示，与基准准确率有强相关性（r=0.74）。此外，TRACE作为强化学习奖励信号的有效性，优于仅凭准确率的基线。这些结果综合来看，逻辑严谨的推理能带来更高质量的答案。因此，TRACE作为评估开放式输出的补充指标。代码可在此 https URL 访问。

Momentum Based Reward Design for Low Emission Traffic Signal Control

基于动量的奖励设计用于低排放交通信号控制

Authors: Chinmay Mundane, Amith Manoharan, Arun Singh
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2605.29693
Pdf link: https://arxiv.org/pdf/2605.29693
Abstract Urban traffic congestion is a growing global issue contributing significantly to long commute times and environmental pollution. Traditional traffic signal control systems often fail to adapt to dynamic traffic conditions. Adaptive traffic signal control can improve urban traffic without changing road infrastructure. Deep Reinforcement Learning (DRL) has shown strong performance for this task, but existing delay and queue-based rewards often produce short-sighted or unstable policies. This paper proposes a Momentum-Based Reward Function (MBRF) that encourages vehicles to keep moving rather than penalizing congestion alone. The method is evaluated in SUMO (Simulation of Urban MObility) using standard traffic metrics such as waiting time, queue length, throughput, and CO2 emissions. Results show that the proposed reward produces better throughput-emission trade-offs and more stable learning behavior than delay or queue-based rewards, as well as classical controllers such as Max Pressure and LQF.
中文摘要 城市交通拥堵是一个日益严重的全球问题，显著导致通勤时间延长和环境污染。传统的交通信号控制系统常常无法适应动态交通状况。自适应交通信号控制可以在不改变道路基础设施的情况下改善城市交通。深度强化学习（DRL）在该任务中表现出良好表现，但现有的基于延迟和队列的奖励常常导致短视或不稳定的策略。本文提出了一种基于动量的奖励函数（MBRF），鼓励车辆继续前进，而不仅仅是惩罚拥堵。该方法在SUMO（城市流动性模拟）中评估，使用等待时间、排队长度、吞吐量和二氧化碳排放等标准交通指标。结果显示，所提奖赏比基于延迟或队列的奖励以及经典控制器如最大压力和LQF在吞吐量-发射权衡上产生更好的权衡和更稳定的学习行为。

Fairness-Aware Profit Maximization using Deep Reinforcement Learning

利用深度强化学习实现公平意识的利润最大化

Authors: Poonam Sharma, Sanchit Virdi, Suman Banerjee
Subjects: Subjects: Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2605.29770
Pdf link: https://arxiv.org/pdf/2605.29770
Abstract Given a social network represented as a graph where the nodes are the users and the edges represent the social relations, and a positive integer k, how to select k nodes to maximize the influence in the network remains an active area of research. In this paper, we consider a variant of the problem in which network users are associated with two parameters: a benefit value and a cost. A fixed budget is given, and the network is partitioned into communities. The task is to select a subset of users (the seed set) within the budget so that their initial activation maximizes the earned profit, while ensuring that each community realizes at least a minimum fraction of its total benefit under a maximin fairness criterion. For any seed set, the earned benefit is defined as the sum of the benefit values of the users influenced by the seed set, and the profit is defined as the difference between the earned benefit and the total cost. Formally, we call this the Fairness-Aware Profit Maximization Problem. We propose a Deep Reinforcement Learning-based approach for solving it: we first model the problem as a Markov Decision Process and subsequently propose a Deep Q-Learning Algorithm. The proposed solution has been implemented and tested on real-world social network datasets. From the reported results, we observed that the proposed approach yields a seed set whose initial activation produces up to 10 times more profit than the baseline methods. The implementation of our methodology is available at this https URL.
中文摘要 给定一个以图形表示的社交网络，节点为用户，边为社会关系，且为正整数k，如何选择k个节点以最大化网络影响力仍是一个活跃的研究领域。本文探讨了一个问题变体，其中网络用户与两个参数相关联：收益值和成本。设定固定预算，并将网络划分为社区。任务是在预算中选择一部分用户（种子集），使其初始激活最大化已赚取的利润，同时确保每个社区在最大化公平性标准下至少实现其总收益的最低部分。对于任何种子集，已获得收益定义为受该种子集影响的用户利益值之和，利润定义为已获得收益与总成本之差。正式来说，我们称之为公平意识利润最大化问题。我们提出了基于深度强化学习的方法来解决：首先将问题建模为马尔可夫决策过程，随后提出深度Q学习算法。该方案已在真实社交网络数据集上实现并测试。根据报告结果，我们观察到，所提方法产生的种子集，其初始激活带来的利润是基线方法的10倍。我们方法论的实现可在此 https URL 查阅。

ARIADNE: AI-RAN Informed Link Adaptation in Digital Twin Network Environments

ARIADNE：数字孪生网络环境中的人工智能驱动知情链路适配

Authors: Maria Tsampazi, Neagin Neasamoni Santhi, Nicole Perrotta, Falko Dressler, Tommaso Melodia
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2605.29772
Pdf link: https://arxiv.org/pdf/2605.29772
Abstract Artificial Intelligence (AI)-powered Radio Access Network (RAN) networks have attracted significant attention from both industry and academia. Meanwhile, Digital Twins offer a safe playground for experimenting with AI/Machine Learning (ML)-based solutions for advanced AI-RAN research. By enabling the testing of online algorithms before deployment on the RAN, they reduce costs and safety risks associated with physical field testing. In this article, we propose ARIADNE, an online Reinforcement Learning (RL)-based module that seamlessly integrates with SIONNA and is tasked with performing link adaptation. We explore different design choices and demonstrate how ARIADNE can surpass industry-standard and state-of-the-art methods by achieving up to 11% and 20% improvements in Spectral Efficiency, respectively. Finally, we show that RL learns a Modulation and Coding Scheme (MCS) selection strategy that diverges from Outer Loop Link Adaptation (OLLA), exhibiting either more conservative or more aggressive behavior depending on the configuration, a trend further corroborated by training offline on 5th generation (5G) over-the-air (OTA) measurements.
中文摘要 人工智能（AI）驱动的无线接入网络（RAN）网络吸引了业界和学术界的广泛关注。与此同时，数字孪生为基于AI/机器学习（ML）的先进AI-RAN研究解决方案提供了安全的实验场。通过在RAN部署前对在线算法进行测试，它们降低了与实体现场测试相关的成本和安全风险。本文提出了ARIADNE，这是一个基于强化学习（RL）的在线模块，能够无缝集成SIONNA，并负责执行链接适应。我们探讨了不同的设计选择，并展示了ARIADNE如何通过分别实现高达11%和20%的光谱效率提升，超越行业标准和最先进方法。最后，我们表明强化学习的调制与编码方案（MCS）选择策略与外环链路适应（OLLA）不同，根据配置表现出更保守或更激进的行为，这一趋势也通过离线训练的第五代（5G）空中（OTA）测量得到进一步证实。

Hista and Numca: Estimate State Value Effectively for LLM Reinforcement Learning

Hista和Numca：有效估计LLM强化学习中的状态值

Authors: Zizhe Chen, Jiqian Dong, Yizhou Tian, Garry Yang, Yongqiang Chen, Zhitang Chen, James Cheng
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.29782
Pdf link: https://arxiv.org/pdf/2605.29782
Abstract Reinforcement learning (RL) refines large language models (LLMs) by directly optimizing model behavior through reward signals. While accurate state value estimation is critical for stable training in classical RL, it remains an underexplored challenge in LLM post-training. In this work, we introduce the State Value Estimation Benchmark (SVEB) to assess state estimation within existing RL frameworks and show that critics in standard approaches like PPO collapse to a coarse group-average baseline. To address this, we propose two techniques: Numca, which leverages numerical spans as gradable milestones for state value estimation, and Hista, a framework that uses LLM's hidden states as representation to weighted average disjoint rollouts and their return. Extensive experiments demonstrate that both methods yield more accurate state value estimates and enhance training performance across different RL algorithms and model sizes without incurring significant computational overhead.
中文摘要 强化学习（RL）通过直接优化奖励信号来优化大型语言模型（LLM）。虽然准确的状态值估计对于经典强化学习的稳定训练至关重要，但在大型语言模型（LLM）训练后仍是一个尚未充分探索的挑战。本研究介绍了状态价值估计基准（SVEB），用于评估现有强化学习框架下的状态估计，并展示了标准方法如PPO中的批评者会退缩为粗略的群体平均基线。为此，我们提出了两种技术：Numca，利用数值跨度作为可分级的里程碑进行状态价值估计;以及Hista，一种利用LLM隐藏状态表示加权平均不相交展开及其返回的框架。大量实验表明，这两种方法都能获得更准确的状态值估计，并在不同强化学习算法和模型规模下提升训练性能，同时避免显著的计算开销。

Quantifying and Optimizing Simplicity via Polynomial Representations

通过多项式表示量化和优化简洁性

Authors: Tianren Zhang, Xiangxin Li, Minghao Xiao, Guanyu Chen, Feng Chen
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.29823
Pdf link: https://arxiv.org/pdf/2605.29823
Abstract Deep networks often exhibit a preference for "simple" solutions, and such a simplicity bias is widely believed to play a key role in generalization. Yet a broadly applicable, quantitative measure of simplicity remains elusive. We introduce polynomial representations as a distribution-aware, low-dimensional surrogate for neural functions: we approximate a network's predictive behavior along data-dependent interpolation paths using orthogonal polynomial bases, yielding a compact functional representation. We show that the effective degree of this representation serves as a practical simplicity metric that is predictive of generalization across tasks and architectures, and consistently outperforms existing generalization proxies such as sharpness. Finally, polynomial representations naturally yield a differentiable simplicity regularizer, which consistently improves generalization in image and text classification, fine-tuning contrastive vision-language models, and reinforcement learning.
中文摘要 深度网络通常偏好“简单”解，这种简单偏差被广泛认为在泛化中起着关键作用。然而，一个广泛适用的、量化的简易度量化指标仍然难以实现。我们将多项式表示引入为神经函数的分布感知、低维替代形式：我们利用正交多项式基近似数据依赖插值路径上的预测行为，得到紧凑的泛函表示。我们证明了该表示的有效度作为一种实用的简易度度量，能够预测跨任务和架构的泛化，并且持续优于现有的泛化代理指标，如锐利度。最后，多项式表示自然产生可微简易正则化器，持续提升图像和文本分类的泛化能力，微调对比视觉语言模型，以及强化学习。

EvoRubric: Self-Evolving Rubric-Driven RL for Open-Ended Generation

EvoRubric：自我演进的基于评分标准驱动的开放式生成强化学习

Authors: Xin Guan, Xiaomeng Hu, Shen Huang, Zhenyi Wang, Bo Zhang, Zijian Li, Pengjun Xie, Bo Liu, Jiuxin Cao
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.29847
Pdf link: https://arxiv.org/pdf/2605.29847
Abstract Reinforcement Learning (RL) has significantly advanced Large Language Models (LLMs) in verifiable domains, but aligning models for open-ended generation remains profoundly challenging due to the lack of definitive rewards. Current rubric-based RL methods mitigate this by employing explicit criteria; however, they rely heavily on static, human-annotated rubrics that inevitably cause policy lag, or expensive external proprietary models for dynamic updates. In this paper, we propose EvoRubric, a novel single-policy co-evolutionary RL framework that eliminates the reliance on static criteria and on external rubric generators. By unifying response generation and rubric generation under a single parameterized policy, EvoRubric dynamically alternates between a Reasoner and a Rubric Generator. To prevent reward hacking and ensure the reliability of generated signals, we introduce a multi-level verification pipeline featuring a meta-verifier, zero-variance pruning, and a Leave-One-Out peer consensus mechanism. Validated criteria are dynamically archived into a memory pool, yielding dense, multi-objective rewards to continuously co-optimize both roles. Extensive experiments across Medical, Writing, and Science domains demonstrate that EvoRubric consistently outperforms traditional static and external-LLM-driven alignment methods. Notably, our framework is compatible with human-expert priors. When initialized with expert-annotated rubrics, EvoRubric can further uncover novel, discriminative dimensions, achieving better performance than relying solely on static expert annotations.
中文摘要 强化学习（RL）在可验证领域显著推动了大型语言模型（LLM）的发展，但由于缺乏明确的奖励，开放式生成的模型对齐仍极具挑战性。当前基于评分标准的强化学习方法通过采用明确标准来缓解这一问题;然而，它们大量依赖静态、人工注释的评分标准，这不可避免地会导致政策延迟，或者昂贵的外部专有动态更新模型。本文提出了EvoRubric，一种新的单策略共进化强化学习框架，消除了对静态标准和外部评分标准生成器的依赖。通过将反应生成和评分标准生成统一在单一参数化策略下，EvoRubric 动态地在推理器和评分标准生成器之间交替运行。为防止奖励黑客并确保生成信号的可靠性，我们引入了多级验证流水线，包含元验证器、零方差修剪和“排除一人”同伴共识机制。验证后的标准会动态归档到内存池中，产生密集且多目标的奖励，持续共同优化两个角色。医学、写作和科学领域的大量实验表明，EvoRubric 持续优于传统的静态和外部大型语言模型驱动的对齐方法。值得注意的是，我们的框架与人类专家先验兼容。当初始化专家注释评分标准时，EvoRubric 可以进一步发现新的、具有判别力的维度，比单纯依赖静态专家注释实现更好的性能。

ESPO: Early-Stopping Proximal Policy Optimization

ESPO：早期终止的近端策略优化

Authors: Zihang Li, Rui Zhou, Yingcheng Shi, Wenhan Yu, Zhewen Tan, Zixiang Liu, Zeming Li, Binhua Li, Yongbin Li, Tong Yang, Jieping Ye
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.29860
Pdf link: https://arxiv.org/pdf/2605.29860
Abstract When a large language model under reinforcement learning commits a wrong reasoning step early in a trajectory, standard algorithms force it to keep generating until the maximum horizon, spending compute on tokens that never receive positive reward and polluting advantage estimates with post-failure noise. We propose ESPO (Early-Stopping Proximal Policy Optimization), which detects trajectory failure on-the-fly and terminates rollouts early. At each generation step, ESPO computes a surrogate regret using only the logits already computed during sampling, and terminates when the smoothed cumulative regret significantly exceeds its estimated values. Truncated trajectories are treated as absorbing failure states with a terminal reward, concentrating negative temporal-difference (TD) errors near the detected failure step without any additional reward model or human annotation. On DeepSeek-R1-Distill-Qwen-7B trained for mathematical reasoning, ESPO surpasses PPO on AIME~2024 (46.28% vs. 45.25%), AMC~2023 (85.83% vs. 82.94%), and MATH-500 (87.42% vs. 85.43%), while saving more than 20% rollout tokens cumulatively.
中文摘要 当一个大型语言模型在强化学习初期犯错推理步骤时，标准算法会强制其持续生成直到最大视野，花费计算时间在从未获得正奖励的代币上，并用失败后的噪声污染优势估计。我们提出了ESPO（早期停止近端策略优化），能够实时检测轨迹失效并提前终止部署。在每一代步，ESPO仅使用抽样时已计算的logit计算代理遗憾，并在平滑累计遗憾显著超过估计值时终止。截断轨迹被视为具有终极奖励的吸收失效状态，集中检测到失效步骤附近的负时差（TD）误差，无需额外奖励模型或人工注释。在DeepSeek-R1-Distill-Qwen-7B中，ESPO在AIME~2024（46.28%对45.25%）、AMC~2023（85.83%对82.94%）和MATH-500（87.42%对85.43%）上超过PPO，累计节省超过20%的推广代币。

CRITIC-R1: Learning Structured Critics for Retrieval-Augmented Generation

CRITIC-R1：学习结构化批评者以实现检索增强生成

Authors: Wenhan Xiao, Ziwei Zhang, Chuanyue Yu, Xingcheng Fu, Qingyun Sun, Runhua Xu, Jianxin Li
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.29886
Pdf link: https://arxiv.org/pdf/2605.29886
Abstract Retrieval-augmented generation (RAG) improves knowledge-intensive question answering by incorporating external evidence. However, existing RAG methods still suffer from hallucinations and subtle reasoning errors. Recent studies introduce external critics to refine RAG outputs, yet they often provide coarse-grained and weakly structured feedback, exhibit over-aggressive intervention, and lead to noisy and unreliable refinement, limiting their effectiveness for correction. To tackle these issues, we propose CRITIC-R1, a structured critic framework that formulates and learns RAG critique as an explicit error diagnosis problem using reinforcement learning (RL). Our framework categorizes common RAG errors into multiple diagnostic dimensions, including verdict, error location, reasoning analysis, and fix generation. To learn these capabilities, we design two reward functions: Conservative Judgement Alignment (CJA) first encourages calibrated high-level judgements while mitigating the over-aggressive phenomenon, whereas Diagnostic Quality Alignment (DQA) further improves fine-grained diagnostic feedback through gated rewards. We train the critic model using GRPO-based RL with process-level supervision collected from external LLM teacher models. Experiments across five QA benchmarks show that CRITIC-R1 consistently improves answer quality over strong RAG baselines. Our source code is available at this https URL
中文摘要 检索增强生成（RAG）通过引入外部证据提升知识密集型问答。然而，现有的RAG方法仍然存在幻觉和细微推理错误。近期研究引入外部批评者来优化RAG输出，但这些反馈往往粗粒度且结构薄弱，干预过于激进，导致细化噪声杂乱且不可靠，限制了其修正效果。为解决这些问题，我们提出了CRITIC-R1，这是一个结构化批评框架，利用强化学习（RL）将RAG批评作为显式错误诊断问题来制定和学习。我们的框架将常见的RAG错误分为多个诊断维度，包括裁决、错误定位、推理分析和修正生成。为了解这些能力，我们设计了两种奖励函数：保守判断对齐（CJA）首先鼓励校准的高层次判断，同时减轻过度激进现象;而诊断质量对齐（DQA）则通过门槛奖励进一步提升细粒度诊断反馈。我们使用基于GRPO的强化学习训练批评模型，并从外部LLM教师模型收集过程级监督。五个QA基准测试的实验显示，CRITIC-R1在强有力的RAG基线上持续提升答复质量。我们的源代码可在此 https URL 获取

LaRA: Layer-wise Representation Analysis for Detecting Data Contamination in RL Post-Training

LaRA：用于检测强化学习后训练数据污染的分层表示分析

Authors: Minju Gwak, Minseo Kwak, Dongseok Lee, Guijin Son, Alan Ritter, Jaehyung Kim
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.29888
Pdf link: https://arxiv.org/pdf/2605.29888
Abstract Reinforcement learning (RL) post-training has shown to improve reasoning in large language models (LLMs). However, there has been little exploration on the problem of data contamination in RL post-training, potentially undermining generalization and evaluation reliability of the training process itself. Existing detection methods primarily rely on output-level signals such as likelihood or entropy, which become unreliable for RL-trained models since RL shapes behavior through trajectory-level rewards rather than token likelihoods. We propose LaRA, a layer-wise representation analysis framework for detecting contamination in RL post-trained LLMs. LaRA introduces three complementary metrics, measuring perturbation sensitivity, directional collapse, and local representation rigidity under controlled perturbations. We find that contamination produces progressive geometric deviations across layers, including amplified perturbation sensitivity, stronger directional collapse, and enhanced local rigidity. Based on our findings, we also develop a contamination detection protocol that aggregates representation-level deviations across layers and metrics. Experiments on RL-trained reasoning models show that our protocol outperforms existing output-level baselines for contamination detection.
中文摘要 强化学习（RL）的后训练已被证明能改善大型语言模型（LLM）中的推理能力。然而，关于强化学习后训练中数据污染问题的探讨甚少，这可能削弱了训练过程本身的泛化性和评估可靠性。现有的检测方法主要依赖输出级信号，如似然或熵，而这些信号对强化学习的模型来说变得不可靠，因为强化学习通过轨迹级奖励而非代币似然来塑造行为。我们提出了LaRA，一种用于检测强化学习后训练LLM污染的层级表示分析框架。LaRA引入了三个互补指标：测量微扰敏感性、方向坍缩和受控扰动下的局部表示刚性。我们发现污染会在各层之间产生渐进的几何偏差，包括增强的扰动敏感性、更强的方向坍缩以及增强的局部刚性。基于我们的发现，我们还开发了一种污染检测协议，能够汇总各层和指标上的代表性偏差。在强化学习训练的推理模型上的实验表明，我们的协议在污染检测方面优于现有的输出级基线。

Train the Agent, Not the Expert: Learning to Harness Heterogeneous Experts for Multi-Turn Visual Reasoning

训练代理，而非专家：学习利用异质专家进行多回合视觉推理

Authors: Yaowu Fan, Tao Han, Dazhao Du, Andy J. Ma, Jia Wan
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.29894
Pdf link: https://arxiv.org/pdf/2605.29894
Abstract Recent progress in computer vision has produced a wide range of powerful specialized models for detection, segmentation, counting, and other visual tasks. However, these models are usually optimized for isolated task formulations, making it difficult to directly support general-purpose visual intelligence, especially when a task requires complex language understanding and dense small-object perception. In this paper, we propose VisHarness, a trainable visual agent that decouples high-level perception, reasoning, and decision-making from low-level task execution. Instead of training a model to solve a specific visual task, VisHarness learns to harness a set of carefully designed heterogeneous visual experts. This paradigm preserves the general intelligence of the agent while fully leveraging the precision advantages of specialized visual models in concrete visual tasks. With only lightweight training, VisHarness learns a generalizable visual expert-harnessing policy and can solve common fundamental vision tasks under various complex conditions through multi-turn interactions with visual expert models. To enable efficient on-policy reinforcement learning training in a live environment, we introduce dynamic visual memory archiving, which mitigates the rapidly accumulating visual-token overhead caused by multi-turn interactions with visual expert models. Experiments on four representative benchmarks covering reasoning segmentation, generalized referring segmentation, dense small-object detection, and referring counting demonstrate that VisHarness substantially outperforms existing general-purpose models and achieves competitive or superior performance compared with task-specific models.
中文摘要 计算机视觉的最新进展催生了多种强大的专用模型，用于检测、分割、计数及其他视觉任务。然而，这些模型通常针对孤立任务的表述进行优化，这使得直接支持通用视觉智能变得困难，尤其是在任务需要复杂语言理解和密集的小物体感知时。本文提出了VisHarness，一种可训练的视觉代理，能够将高层次的感知、推理和决策与低层次任务执行分离。VisHarness不再训练模型来解决特定的视觉任务，而是学会利用一组精心设计的异构视觉专家。该范式既保持了智能体的一般智能，又充分利用了专业视觉模型在具体视觉任务中的精度优势。仅凭轻量级培训，VisHarness就能学习一套通用的视觉专家引导策略，并通过与视觉专家模型的多回合互动，在各种复杂条件下解决常见的基础视觉任务。为了在实时环境中实现高效的策略强化学习训练，我们引入了动态视觉记忆归档技术，减轻了与视觉专家模型多次交互导致的快速累积视觉令牌开销。在四个具有代表性的基准测试上进行的实验，涵盖推理分割、广义指称分割、密集小物体检测和指称计数，表明VisHarness在与任务特定模型相比下，表现显著优于现有通用模型，且具有竞争力甚至更优。

KairosAgent: Agentic Time Series Forecasting with Fused Semantic Reasoning

KairosAgent：结合语义推理的能动时间序列预测

Authors: Kun Feng, Ziwei Shan, Yuchen Fang, Yiyang Tan, Sihan Lu, Shuqi Gu, Lintao Ma, Xingyu Lu, Kan Ren
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.30002
Pdf link: https://arxiv.org/pdf/2605.30002
Abstract Cross-domain multimodal time series forecasting is a challenging task, requiring models to integrate precise numerical comprehension, cross-domain semantic understanding, and effective multimodal fusion. Existing approaches either build Time Series Foundation Models (TSFMs) from scratch or leverage pretrained Large Language Models (LLMs). However, TSFMs often overlook semantic understanding and lack the ability to perform future-oriented semantic reasoning, and LLMs struggle with numerical comprehension and accurate quantitative forecasting. To overcome these limitations, we propose KairosAgent, a novel agentic framework for multimodal time series forecasting, including an LLM-based reasoner and a TSFM-based forecaster. KairosAgent unifies textual reasoning and numerical forecasting by dynamically invoking analytical tools to enhance the numerical understanding and semantic reasoning capabilities of LLMs. The reasoning results are subsequently fused into the TSFM pipeline, enabling more accurate and reliable future predictions. To further improve the reasoning, we curate a large-scale corpus of high-quality trajectories, alongside a reinforcement learning from forecasting paradigm with multi-turn refinement and turn-level credit assignment. Experiments demonstrate that KairosAgent achieves superior zero-shot forecasting performance while maximizing the utility of pretrained LLMs and TSFMs, presenting a promising direction for efficient and interpretable time series agents. The project page is at this https URL .
中文摘要 跨域多模态时间序列预测是一项具有挑战性的任务，需要模型整合精确的数值理解、跨域语义理解以及有效的多模融合。现有方法要么从零构建时间序列基础模型（TSFM），要么利用预训练的大型语言模型（LLM）。然而，TSFMs常常忽视语义理解，缺乏未来导向的语义推理能力，LLMs在数值理解和准确定量预测方面存在困难。为克服这些局限，我们提出了KairosAgent，一种新的多模态时间序列预测智能框架，包括基于LLM的推理器和基于TSFM的预测器。KairosAgent通过动态调用分析工具，统一文本推理和数值预测，提升LLM的数值理解和语义推理能力。推理结果随后被整合进TSFM流程，实现更准确可靠的未来预测。为了进一步完善推理，我们整理了大量高质量轨迹，并结合了多回合细化和回合级学分分配的预测强化学习。实验表明，KairosAgent在最大化预训练LLM和TSFM的效用的同时，实现了更优的零样本预测性能，为高效且可解释的时间序列代理提出了一个有前景的方向。项目页面位于这个 https 网址。

Who Am I? History-Aware Profiles for Student Simulation in Tutoring Dialogues

我是谁？历史感知配置文件，用于辅导对话中的学生模拟

Authors: Zhangqi Duan, Shuyan Huang, Alexander Scarlatos, Jaewook Lee, Simon Woodhead, Andrew Lan
Subjects: Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2605.30051
Pdf link: https://arxiv.org/pdf/2605.30051
Abstract A key part of developing large language model (LLM)-powered, automated tutoring tools is student simulation, i.e., using LLMs to role-play as students, which can facilitate tutor model evaluation and training. Existing work mostly focuses on within-dialogue simulation, which lacks context on student knowledge and behavior, partly due to not grounding in past student question-answering or dialogue interactions. In this work, we introduce the task of history-conditioned student simulation, where the goal is to accurately predict student dialogue turns by leveraging information in the student's learning history. We propose a two-component framework in which a profile generator summarizes a student's history and a simulator predicts student turns conditioned on the resulting profile. We train both components with reinforcement learning (RL), yielding profiles optimized for faithful student simulation. We evaluate our method and baselines on the first-of-its-kind real-world dataset of student dialogues and question responses that we collect from a math learning platform. Extensive experiments show that our method significantly outperforms baselines, and demonstrate the importance of history, profiles, and RL training.
中文摘要 开发大型语言模型（LLM）驱动的自动化辅导工具的一个关键部分是学生模拟，即利用LLM扮演学生角色扮演，这有助于导师模型的评估和培训。现有研究主要集中在对话内模拟，缺乏关于学生知识和行为的背景，部分原因是缺乏基于过去学生问答或对话互动的背景。在本研究中，我们引入了历史条件学生模拟任务，目标是通过利用学生学习历史中的信息，准确预测学生的对话回合。我们提出了一个两部分框架，其中配置文件生成器总结学生的历史，模拟器根据生成的配置文件预测学生的回合。我们用强化学习（RL）训练这两个组件，生成了针对忠实学生模拟优化的配置文件。我们基于首个从数学学习平台收集的学生对话和问题回答的真实世界数据集来评估我们的方法和基线。大量实验表明，我们的方法显著优于基线，并展示了病史、轮廓和强化学习训练的重要性。

Sample-Efficient Diffusion-based Reinforcement Learning with Critic Guidance

基于样本的扩散强化学习与批判指导

Authors: Shutong Ding, Zejia Zhong, Zhongyi Wang, Ke Hu, Bikang Pan, Jingya Wang, Ye Shi
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.30056
Pdf link: https://arxiv.org/pdf/2605.30056
Abstract Recent advances in reinforcement learning (RL) have achieved great successes by leveraging the multimodality and exploration capability of diffusion policies. Among these approaches, one representative branch focuses on the sampling-based policy optimization. This design enables better exploration capability of the diffusion model, particularly at the beginning of training, but suffer from low exploitation in Q-value information, resulting in a slow policy convergence. Another branch pays attention to gradient-based policy optimization, which sufficiently exploits the gradient of the Q function yet tends to collapse into a unimodal policy with low diversity. To address this issue, we propose CGPO, \textbf{C}ritic-\textbf{G}uided diffusion \textbf{P}olicy \textbf{O}ptimization, which effectively balances exploration and exploitation with the training-free guidance technique integrated into the denoising process of diffusion policy. Concretely, CGPO steers action generation toward high-value regions defined by the critic network and uses the guided actions as regression objectives. In this manner, CGPO reduces the time required to obtain high-quality actions and improves final performance with better balance between the exploration-exploitation tradeoff. We validate the effectiveness of CGPO on 5 MuJoCo locomotion tasks, and CGPO achieves state-of-the-art performance compared with existing diffusion-based RL methods. Notably, CGPO is the first success to incorporate diffusion policy into real-world RL, with its superior performance on Franka robot arm grasping tasks. Our official page is released at this https URL.
中文摘要 强化学习（RL）的最新进展通过利用扩散策略的多模态性和探索能力取得了巨大成功。在这些方法中，一个代表性分支专注于基于抽样的策略优化。该设计使扩散模型的探索能力更佳，尤其是在训练初期，但Q值信息利用率较低，导致策略收敛缓慢。另一个分支关注基于梯度的策略优化，该策略充分利用了Q函数的梯度，但往往会崩溃成单模策略，且多样性低。为解决这个问题，我们提出了CGPO，即\textbf{C}ritic-\textbf{G}uided diffusion \textbf{P}olicy \textbf{O}ptimization，它有效地平衡了探索和利用与无训练指导技术的融合，融入扩散政策的去噪过程。具体来说，CGPO引导动作生成向批判网络定义的高价值区域，并将引导动作作为回归目标。通过这种方式，CGPO缩短了获得高质量动作所需的时间，并在探索与利用权衡之间更好地平衡了最终性能。我们验证了CGPO在5项MuJoCo运动任务中的有效性，CGPO在现有基于扩散的强化学习方法中实现了最先进的性能。值得注意的是，CGPO是首个成功将扩散策略融入现实现实强化学习的技术，其在Franka机器人手臂抓取任务中表现更优。我们的官方页面通过这个 https URL 发布。

Overcoming Forgetting in LLM Fine-Tuning with Evolution Strategies

克服LLM中的遗忘，用进化策略微调

Authors: Kajetan Schweighofer, Conor F. Hayes, Roberto Dailey, Risto Miikkulainen, Xin Qiu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.30148
Pdf link: https://arxiv.org/pdf/2605.30148
Abstract Evolution Strategies (ES) has recently emerged as a competitive alternative to reinforcement learning (RL) for large language model (LLM) fine-tuning, offering advantages through simplicity, scalability, and inference-only training. However, recent work suggests that ES fine-tuning on new tasks may induce forgetting of prior tasks. First, this paper shows that prior task forgetting (1) is better characterized as performance drift rather than irreversible forgetting, with prior-task performance often recovering during ES training; and (2) is not a specific failure mode of ES, but can also arise for fine-tuning with RL methods. Second, it analyzes when and why such drift arises, highlighting its dependence on ES training dynamics, particularly random walk behavior in weakly constrained directions of the weight space. Third, based on these insights, it introduces Anchored Weight Decay (AWD) as a parameter-space regularization technique that constrains optimization toward the initial model parameters. AWD effectively stabilizes prior-task performance while preserving target-task performance, achieving benefits comparable to large ES population sizes at much lower computational cost. Thus, contrary to previous beliefs, the paper shows that prior-task forgetting under ES is largely avoidable, positioning ES as a promising approach for continual learning in LLMs.
中文摘要 Evolution Strategies（ES）最近作为强化学习（RL）在大型语言模型（LLM）微调中的竞争替代方案出现，通过简洁、可扩展性和仅推理训练提供了优势。然而，最新研究表明，对新任务进行ES微调可能导致对之前任务的遗忘。首先，本文表明，先前任务遗忘（1）更适合被描述为表现漂移，而非不可逆的遗忘，且先前任务的表现通常在ES训练中恢复;（2）不是ES的特定失效模式，但也可能因通过强化学习方法进行微调而出现。其次，分析了这种漂移何时以及为何出现，强调其对ES训练动态的依赖，特别是权重空间中弱约束方向上的随机游走行为。第三，基于这些见解，它引入了锚定权重衰减（AWD）作为参数空间正则化技术，将优化限制在初始模型参数上。AWD有效稳定了前期任务的性能，同时保持目标任务性能，实现了与大型ES群体规模相当的效益，且计算成本大幅降低。因此，与以往看法相反，论文表明在ES下，先验任务遗忘在很大程度上是可以避免的，这使得ES成为LLM持续学习的一种有前景的方法。

RL2ML: Finite-Rollout Surrogate Objectives from Reinforcement Learning to Maximum Likelihood

RL2ML：从强化学习到最大可能性的有限推广替代目标

Authors: Yifu Zheng
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.30154
Pdf link: https://arxiv.org/pdf/2605.30154
Abstract Correctness-based Reinforcement Learning with Verifiable Rewards (RLVR) trains language models from binary feedback on sampled outputs, but the objective optimized in expectation and the stochastic update geometry induced by finite rollout groups are often conflated. This paper develops RL2ML, a family of finite-rollout surrogate objectives with a closed-form, exactly unbiased gradient estimator. The family continuously connects standard reinforcement learning, maximum-likelihood-like training, and beyond-maximum-likelihood objectives while preserving estimator-objective alignment under a fixed rollout budget. We introduce the group-level update scale to characterize how a rollout group is reweighted after its empirical success count is observed, revealing a subcritical-supercritical update-scale transition that is hidden by population-level objective notation alone. Building on this distinction, calibrated metric-gain analysis and exact variance decomposition show that the best choice of surrogate objective is determined neither by proximity to maximum likelihood nor by the population-level weight alone. Instead, it depends jointly on the evaluation metric, local sensitivity, and estimator variance. The remaining degree of freedom in the surrogate objective family can therefore be formulated as a one-dimensional optimization problem rather than treated as an unconstrained hyperparameter.
中文摘要 基于正确性的可验证奖励强化学习（RLVR）通过对采样输出的二进制反馈训练语言模型，但期望优化的目标与有限展开组诱导的随机更新几何常被混淆。本文开发了RL2ML，这是一类具有封闭形式、精确无偏梯度估计的有限展开替代目标。该系列持续连接标准强化学习、最大似然式训练和超越最大似然目标，同时保持估计器与目标的对齐，且在固定的推广预算内保持一致。我们引入了组级更新量表，以描述在观察到其经验成功计数后，推广组如何被重新加权，揭示了仅靠种群层级目标符号所隐藏的亚临界-超临界更新尺度转变。基于这一区分，校准度量增益分析和精确方差分解表明，最佳替代目标的选择既不取决于与最大似然的接近程度，也不仅仅由总体层级权重决定。相反，它依赖于评估指标、局部敏感性和估计方差。因此，替代客观族中剩余的自由度可以被表述为一维优化问题，而非无约束的超参数。

Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents

面向长期视野LLM代理的元认知记忆策略优化

Authors: Ziyan Liu, Zhezheng Hao, Yeqiu Chen, Hong Wang, Jingren Hou, Ruiyi Ding, Yongkang Yang, Wence Ji, Wei Xia, Feng Liu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.30159
Pdf link: https://arxiv.org/pdf/2605.30159
Abstract Memory-augmented LLM agents tackle complex long-horizon tasks by recursively summarizing interaction trajectories into compact memory. However, existing approaches typically train these memory policies using outcome-based reinforcement learning, failing to localize where intermediate memory quality degrades. As interactions unfold, ambiguous recursive summaries progressively discard task-relevant information and introduce semantic noise. This exacerbates belief deviation, obscuring the agent's estimate of the latent task state and ultimately derailing long-horizon reasoning. We therefore argue that memory optimization should focus not merely on trajectory-level success, but on the clarity of the belief induced by intermediate summaries. To this end, we introduce Belief Entropy, a self-supervised proxy that probes how uncertain the model remains about the latent task state given its current memory. Based on this proxy, we propose Metacognitive Memory Policy Optimization (MMPO). Instead of relying only on sparse outcome-based signals, MMPO provides fine-grained, memory-specific supervision via explicitly penalizing summaries that induce high epistemic uncertainty. Experiments show that MMPO consistently outperforms existing methods on diverse long-horizon tasks, maintaining 97.1% performance even when scaled to 1.75M-token contexts.
中文摘要 内存增强型LLM代理通过递归将交互轨迹汇总到紧凑内存中，处理复杂的长视野任务。然而，现有方法通常通过基于结果的强化学习来训练这些记忆策略，未能定位中间记忆质量的下降位置。随着交互的展开，歧义递归总结逐渐丢弃与任务相关的信息，并引入语义噪声。这加剧了信念偏差，模糊了智能体对潜在任务状态的估计，最终破坏了长远视野的推理。因此，我们认为记忆优化不应仅关注轨迹层面的成功，更应关注中间总结所引发信念的清晰度。为此，我们引入了信念熵，一种自监督代理，探究模型在当前记忆下对潜在任务状态的不确定性。基于该代理，我们提出了元认知记忆策略优化（MMPO）。MMPO不仅仅依赖基于结果的稀疏信号，而是通过明确惩罚导致高认知不确定性的总结，提供细致、记忆特异的监督。实验显示，MMPO在多种长期任务中持续优于现有方法，即使在175万令牌上下文下仍保持97.1%的性能。

On Distributional Reinforcement Learning in Chaotic Dynamical Systems

关于混沌动力系统中的分布强化学习

Authors: James Rudd-Jones, Mirco Musolesi, María Pérez-Ortiz
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.30160
Pdf link: https://arxiv.org/pdf/2605.30160
Abstract Chaotic dynamical systems pose a fundamental challenge for Reinforcement Learning (RL): exponential sensitivity to initial conditions induces high-variance bootstrap targets and poorly conditioned gradient updates. Chaotic dynamics arise across scientific and engineering domains, from fluid flows and climate systems to multi-agent systems, where reliable learning is highly desirable. Standard RL methods optimise expected returns through scalar value functions, implicitly averaging over diverging trajectories and entangling trajectory level instability with the learning objective. We show that under mild statistical stability assumptions, the return distribution evolves more regularly than individual trajectories when measured under the $1$-Wasserstein metric, yielding a smoother distributional Bellman objective. By aligning optimisation with this measure level structure, distributional RL provides better conditioned learning. We offer a principled explanation for the advantages of distributional methods in chaotic systems and the geometries of RL objectives under chaos.
中文摘要 混沌动力系统对强化学习（RL）构成根本挑战：对初始条件的指数敏感度会导致高方差引导目标和条件不佳的梯度更新。混沌动力学在科学和工程领域中屡见不鲜，从流体流动、气候系统到多智能体系统，在这些领域中，可靠的学习尤为重要。标准强化学习方法通过标量值函数优化期望收益，隐式地对发散轨迹进行平均，并将轨迹层级的不稳定性与学习目标纠缠在一起。我们证明，在温和的统计稳定性假设下，在$1$-Wasserstein度量下，回报分布的演变比单个轨迹更规律，从而获得更平滑的Bellman分布目标。通过将优化与该测量层级结构对齐，分布式强化学习提供了更好的条件学习。我们对分布方法在混沌系统中的优势以及在混沌下强化学习目标几何结构提供了原则性的解释。

Mean-Field Diffuser: Scaling Offline MARL to Thousands of Agents

平均场扩散器：将离线MARL扩展到数千个代理

Authors: Wenhao Li, Xiangfeng Wang, Bo Jin
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.30190
Pdf link: https://arxiv.org/pdf/2605.30190
Abstract Diffusion-based planning has achieved strong results in single-agent offline reinforcement learning, yet scaling to many-agent systems remains intractable due to the curse of dimensionality in the joint trajectory space. We introduce MF-Diffuser, a framework that lifts trajectory planning to the Wasserstein space of trajectory distributions, where the propagation of chaos ensures a small representative subset of agents captures the full population dynamics. Our approach features a value-weighted chaotic entropy objective that reconciles generative fidelity with return maximization, and a hierarchical coarse-to-fine strategy that progressively grows the agent population during denoising. We establish end-to-end suboptimality bounds with four interpretable terms, revealing that mean-field approximation error scales as $O(H^2/\sqrt{N})$ while offline distribution shift provably does not grow with population size $N$, and prove the generated policy is an approximate mean-field Nash equilibrium with explicit convergence guarantees. Experiments on three mean-field RL benchmarks -- spanning stage games, sequential dynamics, and adversarial team competition -- show MF-Diffuser achieves the best return in the majority of settings, with the largest gains on suboptimal offline data and at extreme scales ($N \geq 10^3$).
中文摘要 基于扩散的规划在单智能体离线强化学习方面取得了显著成果，但由于联合轨迹空间维度的诅咒，向多智能体系统进行扩展仍然难以实现。我们介绍MF-扩散器框架，将轨迹规划提升到轨迹分布的Wasserstein空间，混沌的传播确保了少数代表性代理人能够捕捉完整的群体动态。我们的方法采用了价值加权的混沌熵目标，将生成保真度与回报最大化相结合，并采用层级的粗细策略，在去噪过程中逐步增加代理数量。我们建立了四个可解释项的端到端次最优界限，揭示了均值场近似误差的规模为$O（H^2/\sqrt{N}）$，而离线分布偏移可证明不会随人口规模$N$增长，并证明生成策略是带有显式收敛保证的近似均值场纳什均衡。在三个平均场强化学习基准测试——跨阶段游戏、顺序动力学和对抗团队竞争——上的实验显示，MF-Diffuser在大多数环境中实现最佳回报，在次优离线数据和极端尺度（$N \geq 10^3$）上获得最大收益。

HPO: Hysteretic Policy Optimization for Stable and Efficient Training under Sparse-Reward Regime

HPO：在稀疏奖励体系下实现稳定高效训练的滞后策略优化

Authors: Mohamed Sana, Nicola Piovesan, Antonio De Domenico, Fadhel Ayed, Haozhe Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.30201
Pdf link: https://arxiv.org/pdf/2605.30201
Abstract We investigate a narrow but common failure mode of GRPO-style reinforcement learning in the context of sparse verifiable rewards: early updates contain more responses with negative advantages than those with positive advantages, while response-level length normalization ties the magnitude of the update to the length of the output. We propose Hysteretic Policy Optimization (HPO), a minimal modification of GRPO that reduces the weight of negative-advantage updates and replaces per-response length normalization with mean-length normalization. We further introduce Adaptive HPO (A-HPO), which sets the hysteretic weight based on batch-level advantage-sign statistics, thereby removing the need for tuning a fixed hysteretic weight. In our TeleLogs and Countdown experiments, A-HPO improves the reward per update compared to GRPO, with the largest gains in early sparse reward regimes. On TeleLogs, A-HPO achieves a final reward of 0.84, outperforming SAPO by 5%, GSPO by 11%, and GRPO by 15%, while maintaining a comparable response-length. On Countdown, A-HPO achieves the largest gains in initial and most difficult configurations across 1.5B-7B models. Ablation studies on the hysteretic weight show that the gains of A-HPO come from better balancing the contributions of positive and negative advantages compared to positive-only or fully symmetric updates.
中文摘要 我们研究了GRPO式强化学习在可验证奖励稀疏背景下常见的狭义但常见的失败模式：早期更新中负有利的反应多于正向优势的响应，而响应水平长度归一化则将更新的幅度与输出长度挂钩。我们提出了滞后策略优化（HPO），这是一种对GRPO的最小修改，通过减少负面优势更新权重，并将每次响应长度的规范化替换为平均长度的规范化。我们进一步介绍了自适应HPO（A-HPO），它基于批次层级的优势符号统计量设定滞后权重，从而消除了对固定滞后权重进行调优的需求。在我们的TeleLogs和Countdown实验中，A-HPO相比GRPO提高了每次更新的奖励，在早期稀疏奖励体系中获得最大收益。在TeleLogs上，A-HPO的最终回报为0.84，比SAPO高5%，GSPO高11%，GRPO高15%，同时响应长度相当。在倒计时模式下，A-HPO在1.5B-7B型号中，初始和最难配置中实现了最大的提升。针对滞后体重的消融研究表明，A-HPO的收益来自于与仅正负重或完全对称更新相比，更好地平衡了正负优势的贡献。

When Should Models Change Their Minds? Contextual Belief Management in Large Language Models

模特们什么时候应该改变主意？大型语言模型中的情境信念管理

Authors: Haoming Xu, Weihong Xu, Zongrui Li, Mengru Wang, Yunzhi Yao, Chiyu Wu, Jin Shang, Yu Gong, Shumin Deng
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.30219
Pdf link: https://arxiv.org/pdf/2605.30219
Abstract Long-horizon interactions require language models to manage accumulating information: when to update their state, when to preserve their state, and what to ignore. We study this challenge as \textbf{Contextual Belief Management (CBM)}: maintaining a predicted belief state aligned with formal evidence while isolating task-irrelevant noise. To make CBM measurable, we introduce BeliefTrack, a closed-world benchmark spanning Rule Discovery and Circuit Diagnosis, where a finite belief space and symbolic verifiers enable exact turn-level evaluation. BeliefTrack diagnoses three failures: Failed Stay, Failed Update, and Failed Isolation. Across multiple LLMs, vanilla models exhibit severe CBM failures, while explicit belief-tracking prompts provide limited gains. In contrast, reinforcement learning with belief-state rewards reduces failure rates by 70.9\% on average. Further probing reveals latent belief-state dynamics behind these failures, and representation-level steering reduces failure rates by 46.1\% across two tasks\footnote{Code is coming soon at this https URL.
中文摘要 长视野交互需要语言模型管理积累的信息：何时更新状态，何时保持状态，以及忽略哪些内容。我们将这一挑战作为\textbf{情境信念管理（CBM）}进行研究：在隔离任务无关噪声的同时，维持与形式证据相符的预测信念状态。为了使CBM可测量，我们引入了BeliefTrack，这是一个涵盖规则发现和电路诊断的封闭世界基准测试，在其中有限的信念空间和符号验证器实现了精确的回合级评估。BeliefTrack诊断了三种失败：Stay失败、更新失败和隔离失败。在多个大型语言模型中，普通模型存在严重的信念管理失败，而显式信念追踪提示则带来有限的收益。相比之下，带有信念状态奖励的强化学习平均将失败率降低70.9%。进一步探究揭示了这些失败背后的潜在信念状态动态，表示层级引导在两个任务中将失败率降低了46.1%\p/footnote{代码即将发布于此 https URL。

TriSearch: Learning to Optimize Triangulations via Bistellar Flips

TriSearch：学习通过双恒星翻转优化三角剖分

Authors: Yiran Wang, Guido Montúfar
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.30220
Pdf link: https://arxiv.org/pdf/2605.30220
Abstract We introduce TriSearch, a reinforcement learning framework for optimizing objectives over triangulations of a polytope via bistellar flips. The key idea is a circuit-supported subtriangulation action representation: feasible flips are encoded by their supporting circuit and realized local subtriangulation, enabling a learned policy to rank them using local geometric and combinatorial features. This yields a dimension-agnostic interface and enables efficient traversal of the flip graph without explicit enumeration of the full triangulation space. Instantiated in 3D and 4D, TriSearch generalizes zero-shot from small training instances to larger polytopes with exponentially larger search spaces. It achieves top performance on metric objectives in 3D and, in 4D, discovers more distinct Fine, Regular, Star triangulations of reflexive polytopes, corresponding to Calabi-Yau threefolds, than existing samplers under a fixed budget.
中文摘要 我们介绍TriSearch，一种强化学习框架，用于通过双星翻转优化多面体三角剖分的目标。核心思想是电路支持的子三角作用表示：可行翻转由其支持电路编码并实现局部子三角剖分，使得学习策略能够利用局部几何和组合特征对其进行排序。这产生了维度无关的接口，使得翻转图的高效遍历成为可能，而无需显式枚举完整的三角剖分空间。TriSearch 以三维和四维形式实现，将零样本从小型训练实例推广到具有指数级更大搜索空间的多胞体。它在三维度量目标上表现出色，在四维中，比现有预算有限的采样器发现更多明显的细、正、星三角形，这些多胞体对应于卡拉比-丘三维结构。

How's it going? Reinforcement learning in language models recruits a functional welfare axis

怎么样？语言模型中的强化学习招募了一个功能性福利轴

Authors: Andy Q Han, David J. Chalmers, Pavel Izmailov
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2605.30232
Pdf link: https://arxiv.org/pdf/2605.30232
Abstract How does reinforcement learning shape a language model's internal representations? We present evidence that RL recruits a pre-existing representation of functional welfare: an estimate of how well or badly the system is doing, relative to its goals. We train several language models in a novel, semantically neutral maze environment. We then extract concept vectors for rewarded and punished trajectories, and evaluate those vectors in settings unrelated to the maze environment. The punishment vector behaves like a representation of negative welfare: it promotes failure and impossibility tokens, it aligns with negative emotion concepts, it negatively tracks goal-achievement, and steering with it induces negative self-reports, pathological backtracking, refusal, and uncertainty. The positive reward vector behaves as the mirror image, and the two are nearly antiparallel. These effects are robust when controlling for tile-to-reward mapping, scale, instruct tuning, RL training algorithm, model family, and LoRA versus full-finetuning, and largely persist when we replace RL with supervised fine-tuning. Importantly, the vectors are effective in models before they have undergone maze training. Combined with observations that the effects also appear in pretrain-only models, we therefore argue that this functional welfare axis pre-exists post-training: it is recruited, rather than created, by post-training. While we make no claims about any experience of welfare, the axis offers a demonstration that minimal reward signals can broadly affect model behavior by recruiting pre-existing welfare-like representations, with implications for interpretability, post-training dynamics, and alignment.
中文摘要 强化学习如何塑造语言模型的内部表征？我们提供了证据表明，强化学习招募了功能性福利的既有表征：即系统相对于其目标表现的估计。我们在一个新颖、语义中性的迷宫环境中训练多个语言模型。然后我们提取奖励和惩罚轨迹的概念向量，并在与迷宫环境无关的环境中评估这些向量。惩罚向量表现为负面福利的表现：它促进失败和不可能的象征，与负面情绪概念相符，负向追踪目标达成，并引发负面自我报告、病态回溯、拒绝和不确定。正奖励矢量表现为镜像，两者几乎是反平行的。当控制图块到奖励映射、尺度、指令调优、强化学习训练算法、模型族以及LoRA与全微调时，这些效应依然稳健，且在用监督微调替代强化学习时大多持续存在。重要的是，这些向量在模型中效果显著，且未经过迷宫训练。结合观察到仅预训练模型中也出现这些效应，我们因此认为，这一功能福利轴在训练后就已存在：它是被招募的，而非由培训后创造的。虽然我们不对任何福利体验做出断言，但该轴展示了最小奖励信号可以通过招募已有的类福利表征广泛影响模型行为，这对可解释性、训练后动态和对齐有影响。

Reinforcement Learning with Robust Rubric Rewards

强化学习与强有力的评分标准奖励

Authors: Ya-Qi Yu, Hao Wang, Fangyu Hong, Xiangyang Qu, Gaojie Wu, Qiaoyu Luo, Nuo Xu, Huixin Wang, Wuheng Xu, Yongxin Liao, Zihao Chen, Haonan Li, Ziming Li, Dezhi Peng, Minghui Liao, Jihao Wu, Haoyu Ren, Dandan Tu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.30244
Pdf link: https://arxiv.org/pdf/2605.30244
Abstract While Reinforcement Learning with Verifiable Rewards (RLVR) is effective for deterministically checkable tasks, many vision-language tasks are partially verifiable, demanding multi-criteria supervision (e.g., perceptual details, reasoning steps, and constraints). Rubrics provide a natural interface for this fine-grained supervision, but their effectiveness depends on the execution accuracy during online RL. We propose Reinforcement Learning with Robust Rubric Rewards ($\text{RLR}^3$), extending RLVR from task-level verification to criterion-level verification. $\text{RLR}^3$ routes instance-specific rubrics through two execution paths: an LLM-as-an-extractor paired with a deterministic verifier, or an LLM-as-a-Judge for non-verifiable criteria. To ensure faithful scoring, $\text{RLR}^3$ introduce a minimal exposure strategy that masks ground truths from extractors and images from judges. Furthermore, $\text{RLR}^3$ employs hierarchical aggregation to prioritize essential criteria over additional criteria, and mitigates score saturation within rollout groups. Evaluated on Qwen3-VL-30B-A3B across 15 benchmarks, $\text{RLR}^3$ consistently outperforms RLVR, yielding a 4.7-point improvement over the base model and exceeding the official instruct-to-thinking model gap. Controlled audits confirm our deterministic verification and minimal exposure significantly reduce exploitable false positives.
中文摘要 虽然带可验证奖励的强化学习（RLVR）对确定性可检查任务有效，但许多视觉语言任务部分可验证，需要多标准监督（如感知细节、推理步骤和约束）。评分标准为这种细致监督提供了自然的接口，但其有效性取决于在线强化学习中的执行准确性。我们提出了强化学习与稳健评分标准奖励（$\text{RLR}^3$），将RLVR从任务层面验证扩展到标准层面验证。$\text{RLR}^3$ 将实例特定的评分规矩路由两条执行路径：一个是与确定性验证器配对的 LLM 作为提取器，另一个是针对不可验证标准的 LLM 作为评判者。为确保评分准确，$\text{RLR}^3$引入了一种最小曝光策略，掩盖了提取者的真实信息和裁判的图像。此外，$\text{RLR}^3$采用层级聚合，优先考虑关键标准而非其他标准，并减少了推广组内的得分饱和。在Qwen3-VL-30B-A3B的15个基准测试中，$\text{RLR}^3$持续优于RLVR，较基础模型提升4.7分，超过官方的“教学到思考”差距。受控审计确认了我们的确定性验证，最小的暴露显著减少了可利用的误报。

Stable-Layers: Fine-Tuning Image Layer Decomposition Models with VLM-Scored Reinforcement Learning

稳定层：利用VLM评分强化学习微调图像层分解模型

Authors: Ciara Rowles, Reshinth Adithyan, Nikhil Pinnaparaju, Vikram Voleti, Mark Boss
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2605.30257
Pdf link: https://arxiv.org/pdf/2605.30257
Abstract We present Stable-Layers, a reinforcement learning framework that eliminates the need for paired supervision by fine-tuning a pretrained layer decomposition model using only feedback from a vision-language model (VLM). Starting from Qwen-Image-Layered, we apply Flow-GRPO with LoRA adaptation, sampling multiple candidate decompositions per image, scoring them with a VLM, and optimising the policy from group-relative advantages. The key challenge lies in designing a reliable reward signal: VLMs scoring samples in isolation tend to compress their judgements into a narrow band, leaving GRPO with little within-group variance to learn from. We address this with a two-stage evaluation pipeline that pairs structured per-sample scoring across five edit-centric criteria with a grid-based calibration step in which the VLM re-scores all candidates side-by-side. Stable-Layers produces decompositions with stronger layer separation, fewer blank or artifact-heavy layers, and lower per-layer reconstruction error on the Crello dataset compared to the base model.
中文摘要 我们介绍了稳定层，这是一种强化学习框架，通过仅利用视觉语言模型（VLM）的反馈微调预训练层分解模型，消除了对对监督的需求。从Qwen图像分层开始，我们应用Flow-GRPO配合LoRA适配，每张图像采样多个候选分解，使用VLM评分，并根据群体相对优势优化策略。关键挑战在于设计可靠的奖励信号：VLMs单独评分样本时，往往将判断压缩到狭窄的区间，导致GRPO在组内几乎没有可学习的差异。我们通过两阶段评估流程解决这个问题，将基于五个以编辑为中心的标准上的结构化每样本评分与基于网格的校准步骤配对，VLM会将所有候选者并排重新评分。Stable-Layers 在 Crello 数据集上产生的分解比基础模型更强，层分离更强，空白层或伪影堆积层更少，每层重建误差也更低。

Loong: A Human-Like Long Document Translation Agent with Observe-and-Act Adaptive Context Selection

Loong：一种类人长文档翻译代理，具备观察与行动自适应上下文选择功能

Authors: Yutong Wang, Xuebo Liu, Derek F. Wong, Zhilin Li, Rongqing Jiang, Min Zhang, Shimin Tao, Daimeng Wei, Min Zhang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.30274
Pdf link: https://arxiv.org/pdf/2605.30274
Abstract Document-level translation remains one of the most challenging tasks for large language models, which are constrained by limited context windows that impede global cohesion, while simultaneously suffering from redundant contextual information that degrades translation quality. To address this, we propose a human-like long document translation agent called Loong, which leverages a 3E memory module (Essence-Exemplar-Entity) to store summaries, sentence pairs, and entity records as historical context. Instead of passively attending to all history, Loong performs deep reasoning to adaptively identify the optimal context for translation guidance. Loong optimizes its context policy through reinforcement learning, utilizing preference data derived from its own sampled observe-and-act reasoning trajectories. Empirical evaluations demonstrate that Loong achieves substantial translation quality improvements in English $\Leftrightarrow$ Chinese, German, and French directions, with average gains of up to 13.0 points across the three evaluation metrics. Furthermore, Loong exhibits strong generalization across domains and robustness against contextual noise, while maintaining remarkable stability in ultra-long document translation. Our code is released at this https URL.
中文摘要 文档级翻译仍然是大型语言模型最具挑战性的任务之一，这些模型受限于有限的上下文窗口，阻碍了整体的连贯性，同时又存在冗余的上下文信息，降低了翻译质量。为此，我们提出了一种类人长文档翻译代理Loong，利用3E内存模块（Essence-Exemplar-Entity）存储摘要、句子对和实体记录作为历史上下文。龙氏不被动关注所有历史，而是进行深度推理，适应性地识别翻译指导的最佳语境。Loong 通过强化学习优化其上下文策略，利用其自身抽样的观察与行动推理轨迹得出的偏好数据。实证评估表明，龙在英文翻译质量方面实现了显著提升，三个评估指标平均提升了高达13.0分。此外，Loong在多个领域展现出强烈的泛化能力和对上下文噪声的鲁棒性，同时在超长文档翻译中保持了卓越的稳定性。我们的代码以这个 https URL 发布。

In-Context Reward Adaptation for Robust Preference Modeling

情境内奖励适应以实现稳健偏好建模

Authors: Zhenyu Sun, Zheng Xu, Ermin Wei
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2605.30323
Pdf link: https://arxiv.org/pdf/2605.30323
Abstract Reinforcement Learning from Human Feedback (RLHF) typically relies on static reward models to align Large Language Models with human preferences. However, human values are inherently diverse and heterogeneous, and a single reward model often lacks the robustness required to generalize to unseen preference domains. While existing multi-reward frameworks attempt to address this, they are often restricted to a fixed set of known domains and fail to adapt to unseen human distributions without costly retraining. In this work, we propose In-Context Reward Adaptation, a transformer-based framework designed to model diverse and unseen human preferences on the fly. By leveraging the in-context learning capabilities of transformers, our approach adaptively infers the underlying reward structure from a small set of preference demonstrations. We demonstrate that while a standard transformer architecture is insufficient for this task by characterizing an asymptotic bias to the ground-truth, incorporating human response time as an auxiliary input signal enables the model to successfully adapt to preferences from previously unseen domains. Our findings show that this approach provides a more robust foundation for preference modeling, allowing for the representation of heterogeneous rewards and preference distribution shift, and offering a scalable path toward more flexible human-AI alignment.
中文摘要 人类反馈强化学习（RLHF）通常依赖静态奖励模型，使大型语言模型与人类偏好保持一致。然而，人类价值观本质上多样且异质，单一的奖励模型往往缺乏推广到未见偏好领域的鲁棒性。虽然现有的多奖励框架试图解决这个问题，但它们通常局限于一组固定的已知领域，且无法在不昂贵的重新训练的情况下适应看不见的人类分布。在本研究中，我们提出了情境奖励适应（In-Context Reward Adaptation），这是一种基于变换器（Transformer）的框架，旨在实时建模多样且未被察觉的人类偏好。通过利用变换器在上下文中的学习能力，我们的方法能够自适应地从一小部分偏好演示推断出潜在的奖励结构。我们通过表征对真实值的渐近偏置，证明了虽然标准变换器架构不足以完成此任务，但将人类响应时间作为辅助输入信号，使模型能够成功适应来自此前未见领域的偏好。我们的发现表明，这种方法为偏好建模提供了更坚实的基础，能够表示异质奖励和偏好分布变化，并为实现更灵活的人机对齐提供了可扩展的路径。

Reasoning with Sampling: Cutting at Decision Points

采样推理：决策点的切割

Authors: Felix Zhou, Anay Mehrotra, Quanquan C. Liu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Statistics Theory (math.ST); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2605.30327
Pdf link: https://arxiv.org/pdf/2605.30327
Abstract Frontier reasoning models are produced by posttraining base language models with reinforcement learning. Recent work has challenged this by showing that sampling from a sharpened version of the base model's distribution, a so-called power distribution, elicits comparable reasoning without additional training, curated datasets, or verifiers. However, making this method practical requires efficiently sampling from the power distribution. A sampler needs to "mix" to the power distribution, which necessitates moving between modes of the target distribution; intuitively, e.g., trying different reasoning strategies. The samplers proposed in prior works repeatedly select a "cut" position in the current reasoning trace uniformly at random and resample the suffix from that position onward. However, reasoning traces typically contain a few consequential decisions (e.g., the choice of proof strategy or algorithm), and we observe that a uniformly chosen cut tends to rewrite local details rather than revisit decision points. We introduce an algorithm (Entropy-Cut Metropolis-Hastings) that uses the base model's next-token entropy as a proxy to identify key decision points and resample from those positions. We empirically verify that entropy jumps are a useful proxy for decision points and, in a stylized model of reasoning, prove that our method's mixing time scales with the number of decisions in a trace rather than with the number of tokens, which can be much larger. Across MATH500, HumanEval, GPQA Diamond, and AIME26, our method consistently improves over baselines and RL-trained models.
中文摘要 前沿推理模型由带有强化学习的基础语言模型后训练生成。近期研究挑战了这一观点，表明从基础模型分布的锐化版本——所谓的幂分布中抽样，无需额外训练、策划数据集或验证器，就能引发类似的推理。然而，使该方法实用需要高效地从功率分布中采样。采样器需要与功率分配“混合”，这需要在目标分配的不同模式间切换;直觉上，比如尝试不同的推理策略。先前研究中提出的采样器会反复随机均匀地选择当前推理轨迹中的“切割”位置，并从该位置开始重新采样后缀。然而，推理迹通常包含一些相关决策（例如证明策略或算法的选择），我们观察到均匀选择的割往往会重写局部细节，而非重新访问决策点。我们引入了一种算法（熵割 Metropolis-Hastings），该算法利用基础模型的下一token熵作为代理，识别关键决策点并从这些位置重新采样。我们通过实证验证熵跳跃是决策点的有用代理，并通过一个简化的推理模型证明，我们的方法的时间尺度与迹中决策的数量混合，而非符号数，而代币数量可能更大。在MATH500、HumanEval、GPQA Diamond和AIME26中，我们的方法持续优于基线和强化学习训练模型。

Keyword: diffusion policy

Fisher-Preserving Guidance: Training-Free Manifold Constraints for Safe Diffusion Control

费舍尔保护指导：无训练歧管约束以实现安全扩散控制

Authors: Hao Ren, Zetong Bi, Yiming Zeng, Le Zheng, Zhi Li, Zhaoliang Wan, Lu Qi, Hui Cheng
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.29937
Pdf link: https://arxiv.org/pdf/2605.29937
Abstract Diffusion models are effective for waypoint prediction in visual navigation, but standard sampling and test time guidance can produce unreliable or inefficient trajectories when updates drift off the training manifold. We propose Fisher Preserving Guidance with Outer Product Span Projection, a training-free inference method that avoids large Fisher drift associated with off-distribution actions while optimizing a task objective. Our method computes the Fisher-preserving update via a low-rank Jacobian factorization, requiring only a single backward pass per step and enabling real-time use. We further introduce Truncated Fisher Denoising Sensitivity as an uncertainty signal and use it for robust multi-sample action blending. Experiments on toy and realistic navigation benchmarks, including Maze2D with TSDF-based guidance, PushT with official Diffusion Policy weights, and visual navigation in simulation and on real robots, demonstrate consistent improvements in performance over strong diffusion-policy baselines without additional training.
中文摘要 扩散模型在目视导航中对航点预测有效，但标准采样和测试时间引导在更新偏离训练流形时，可能产生不可靠或低效的轨迹。我们提出了带有外积跨投影的Fisher Preserving Guidance方法，这是一种无训练的推断方法，在优化任务目标的同时避免了与离分布动作相关的较大Fisher漂移。我们的方法通过低秩雅可比分解计算保持费舍尔的更新，每步只需一次向后传递，便于实时使用。我们进一步引入截断费舍尔去噪灵敏度作为不确定性信号，并用于稳健的多采样动作混合。在玩具和真实导航基准测试上的实验，包括基于TSDF的Maze2D、基于官方扩散政策权重的PushT，以及模拟和真实机器人中的视觉导航，显示出在无需额外训练的情况下，性能优于强扩散政策基线的持续提升。

Sample-Efficient Diffusion-based Reinforcement Learning with Critic Guidance

基于样本的扩散强化学习与批判指导

Authors: Shutong Ding, Zejia Zhong, Zhongyi Wang, Ke Hu, Bikang Pan, Jingya Wang, Ye Shi
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2605.30056
Pdf link: https://arxiv.org/pdf/2605.30056
Abstract Recent advances in reinforcement learning (RL) have achieved great successes by leveraging the multimodality and exploration capability of diffusion policies. Among these approaches, one representative branch focuses on the sampling-based policy optimization. This design enables better exploration capability of the diffusion model, particularly at the beginning of training, but suffer from low exploitation in Q-value information, resulting in a slow policy convergence. Another branch pays attention to gradient-based policy optimization, which sufficiently exploits the gradient of the Q function yet tends to collapse into a unimodal policy with low diversity. To address this issue, we propose CGPO, \textbf{C}ritic-\textbf{G}uided diffusion \textbf{P}olicy \textbf{O}ptimization, which effectively balances exploration and exploitation with the training-free guidance technique integrated into the denoising process of diffusion policy. Concretely, CGPO steers action generation toward high-value regions defined by the critic network and uses the guided actions as regression objectives. In this manner, CGPO reduces the time required to obtain high-quality actions and improves final performance with better balance between the exploration-exploitation tradeoff. We validate the effectiveness of CGPO on 5 MuJoCo locomotion tasks, and CGPO achieves state-of-the-art performance compared with existing diffusion-based RL methods. Notably, CGPO is the first success to incorporate diffusion policy into real-world RL, with its superior performance on Franka robot arm grasping tasks. Our official page is released at this https URL.
中文摘要 强化学习（RL）的最新进展通过利用扩散策略的多模态性和探索能力取得了巨大成功。在这些方法中，一个代表性分支专注于基于抽样的策略优化。该设计使扩散模型的探索能力更佳，尤其是在训练初期，但Q值信息利用率较低，导致策略收敛缓慢。另一个分支关注基于梯度的策略优化，该策略充分利用了Q函数的梯度，但往往会崩溃成单模策略，且多样性低。为解决这个问题，我们提出了CGPO，即\textbf{C}ritic-\textbf{G}uided diffusion \textbf{P}olicy \textbf{O}ptimization，它有效地平衡了探索和利用与无训练指导技术的融合，融入扩散政策的去噪过程。具体来说，CGPO引导动作生成向批判网络定义的高价值区域，并将引导动作作为回归目标。通过这种方式，CGPO缩短了获得高质量动作所需的时间，并在探索与利用权衡之间更好地平衡了最终性能。我们验证了CGPO在5项MuJoCo运动任务中的有效性，CGPO在现有基于扩散的强化学习方法中实现了最先进的性能。值得注意的是，CGPO是首个成功将扩散策略融入现实现实强化学习的技术，其在Franka机器人手臂抓取任务中表现更优。我们的官方页面通过这个 https URL 发布。