Arxiv Papers of Today

生成时间: 2026-04-20 17:59:54 (UTC+8); Arxiv 发布时间: 2026-04-20 20:00 EDT (2026-04-21 08:00 UTC+8)

今天共有 23 篇相关文章

Keyword: reinforcement learning

InfoChess: A Game of Adversarial Inference and a Laboratory for Quantifiable Information Control

InfoChess：一场对抗推理的游戏与可量化信息控制的实验室

Authors: Kieran A. Murphy
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.15373
Pdf link: https://arxiv.org/pdf/2604.15373
Abstract We propose InfoChess, a symmetric adversarial game that elevates competitive information acquisition to the primary objective. There is no piece capture, removing material incentives that would otherwise confound the role of information. Instead, pieces are used to alter visibility. Players are scored on their probabilistic inference of the opponent's king location over the duration of the game. To explore the space of strategies for playing InfoChess, we introduce a hierarchy of heuristic agents defined by increasing levels of opponent modeling, and train a reinforcement learning agent that outperforms these baselines. Leveraging the discrete structure of the game, we analyze gameplay through natural information-theoretic characterizations that include belief entropy, oracle cross entropy, and predictive log score under the action-induced observation channel. These measures disentangle epistemic uncertainty, calibration mismatch, and uncertainty induced by adversarial movement. The design of InfoChess renders it a testbed for studying multi-agent inference under partial observability. We release code for the environment and agents, and a public interface to encourage further study.
中文摘要 我们提出了InfoChess，一种对称对抗游戏，将竞争性信息获取提升为主要目标。没有棋子吃子，消除了本可能混淆信息作用的物质激励。相反，棋子被用来改变可见度。玩家根据对方国王位置的概率推断来得分。为了探索InfoChess的策略空间，我们引入了一个通过提升对手建模层级定义的启发式代理层级，并训练一个性能优于这些基线的强化学习代理。利用博弈的离散结构，我们通过自然信息论特征分析游戏，包括信念熵、神谕交叉熵以及动作诱导观察通道下的预测对数分数。这些指标解开了认知不确定性、校准不匹配以及由对抗性移动引起的不确定性。InfoChess的设计使其成为研究多智能体在部分可观测性下推断的测试平台。我们发布环境和代理代码，并设有公开界面以鼓励进一步研究。

Beyond Single-Model Optimization: Preserving Plasticity in Continual Reinforcement Learning

超越单模型优化：在持续强化学习中保持可塑性

Authors: Lute Lillo, Nick Cheney
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/2604.15414
Pdf link: https://arxiv.org/pdf/2604.15414
Abstract Continual reinforcement learning must balance retention with adaptation, yet many methods still rely on \emph{single-model preservation}, committing to one evolving policy as the main reusable solution across tasks. Even when a previously successful policy is retained, it may no longer provide a reliable starting point for rapid adaptation after interference, reflecting a form of \emph{loss of plasticity} that single-policy preservation cannot address. Inspired by quality-diversity methods, we introduce \textsc{TeLAPA} (Transfer-Enabled Latent-Aligned Policy Archives), a continual RL framework that organizes behaviorally diverse policy neighborhoods into per-task archives and maintains a shared latent space so that archived policies remain comparable and reusable under non-stationary drift. This perspective shifts continual RL from retaining isolated solutions to maintaining \emph{skill-aligned neighborhoods} with competent and behaviorally related policies that support future relearning. In our MiniGrid CL setting, \textsc{TeLAPA} learns more tasks successfully, recovers competence faster on revisited tasks after interference, and retains higher performance across a sequence of tasks. Our analyses show that source-optimal policies are often not transfer-optimal, even within a local competent neighborhood, and that effective reuse depends on retaining and selecting among multiple nearby alternatives rather than collapsing them to one representative. Together, these results reframe continual RL around reusable and competent policy neighborhoods, providing a route beyond single-model preservation toward more plastic lifelong agents.
中文摘要 持续强化学习必须在保留与适应之间取得平衡，但许多方法仍然依赖\emph{单模型保存}，承诺将一个不断演变的策略作为跨任务的主要可重用解决方案。即使保留了先前成功的政策，也可能不再是干预后快速适应的可靠起点，反映出单一政策保存无法解决的一种“可塑性”丧失。受质量多样性方法启发，我们引入了 \textsc{TeLAPA}（转移赋能潜在对齐策略档案），这是一个持续的强化学习框架，将行为多样的策略邻域组织为每个任务的档案，并维护共享潜在空间，使归档策略在非平稳漂移下保持可比性和可复用性。这一视角使强化学习从保留孤立的解决方案转向维护技能对齐的社区，并制定有能力且与行为相关政策支持未来的再学习。在我们的MiniGrid CL设置中，\textsc{TeLAPA}成功学习更多任务，干扰后对重访任务恢复能力更快，并在一系列任务中保持更高性能。我们的分析表明，源最优策略往往并非转移最优，即使在本地具备能力的邻域内，且有效的再利用依赖于保留并选择多个附近选项，而非将它们集中到一个代表。这些结果共同重塑了持续强化学习的框架，围绕可重复使用且具备能力的政策邻域，为超越单一模型保存、实现更多可塑终身代理提供了路径。

Reward Weighted Classifier-Free Guidance as Policy Improvement in Autoregressive Models

无奖励加权分类器指导作为自回归模型政策改进

Authors: Alexander Peysakhovich, William Berman
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.15577
Pdf link: https://arxiv.org/pdf/2604.15577
Abstract Consider an auto-regressive model that produces outputs x (e.g., answers to questions, molecules) each of which can be summarized by an attribute vector y (e.g., helpfulness vs. harmlessness, or bio-availability vs. lipophilicity). An arbitrary reward function r(y) encodes tradeoffs between these properties. Typically, tilting the model's sampling distribution to increase this reward is done at training time via reinforcement learning. However, if the reward function changes, re-alignment requires re-training. In this paper, we show that a reward weighted classifier-free guidance (RCFG) can act as a policy improvement operator in this setting, approximating tilting the sampling distribution by the Q function. We apply RCFG to molecular generation, demonstrating that it can optimize novel reward functions at test time. Finally, we show that using RCFG as a teacher and distilling into the base policy to serve as a warm start significantly speeds up convergence for standard RL.
中文摘要 考虑一个自回归模型，产生输出x（例如问题的答案、分子），每个输出都可以用属性向量y来总结（例如，帮助性与无害性，或生物利用度与亲脂性）。任意奖励函数r（y）编码了这些属性之间的权衡。通常，在训练时通过强化学习来倾斜模型的抽样分布以增加奖励。然而，如果奖励功能发生变化，重新对齐就需要重新训练。本文展示了奖励加权无分类器指导（RCFG）在此情境下可作为策略改进算子，近似通过Q函数倾斜抽样分布。我们将RCFG应用于分子生成，证明其在测试时能够优化新颖的奖励函数。最后，我们展示了将RCFG作为教师，并将基础策略提炼为暖启，显著加快标准强化学习的收敛速度。

"Excuse me, may I say something..." CoLabScience, A Proactive AI Assistant for Biomedical Discovery and LLM-Expert Collaborations

“打扰一下，我能说句话吗......”CoLabScience，一款主动的AI助手，促进生物医学发现和LLM专家协作

Authors: Yang Wu, Jinhong Yu, Jingwei Xiong, Zhimin Tao, Xiaozhong Liu
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.15588
Pdf link: https://arxiv.org/pdf/2604.15588
Abstract The integration of Large Language Models (LLMs) into scientific workflows presents exciting opportunities to accelerate biomedical discovery. However, the reactive nature of LLMs, which respond only when prompted, limits their effectiveness in collaborative settings that demand foresight and autonomous engagement. In this study, we introduce CoLabScience, a proactive LLM assistant designed to enhance biomedical collaboration between AI systems and human experts through timely, context-aware interventions. At the core of our method is PULI (Positive-Unlabeled Learning-to-Intervene), a novel framework trained with a reinforcement learning objective to determine when and how to intervene in streaming scientific discussions, by leveraging the team's project proposal and long- and short-term conversational memory. To support this work, we introduce BSDD (Biomedical Streaming Dialogue Dataset), a new benchmark of simulated research discussion dialogues with intervention points derived from PubMed articles. Experimental results show that PULI significantly outperforms existing baselines in both intervention precision and collaborative task utility, highlighting the potential of proactive LLMs as intelligent scientific assistants.
中文摘要 大型语言模型（LLMs）融入科学工作流程为加速生物医学发现带来了令人振奋的机会。然而，LLMs的反应性仅在被提示时才响应，限制了其在需要前瞻性和自主参与的协作环境中的有效性。本研究介绍了CoLabScience，一款主动式大型语言模型助手，旨在通过及时、情境感知的干预，增强AI系统与人类专家之间的生物医学协作。我们方法的核心是PULI（积极-无标签学习干预），这是一个以强化学习为目标训练的新框架，利用团队的项目提案和长短期会话记忆，确定何时以及如何介入科学讨论。为支持这项工作，我们引入了BSDD（生物医学流式对话数据集），这是一个基于PubMed文章干预点的模拟研究讨论对话新基准。实验结果显示，PULI在干预精度和协作任务效用方面均显著优于现有基线，凸显了主动大型语言模型作为智能科学助理的潜力。

CSLE: A Reinforcement Learning Platform for Autonomous Security Management

CSLE：自主安全管理的强化学习平台

Authors: Kim Hammar
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.15590
Pdf link: https://arxiv.org/pdf/2604.15590
Abstract Reinforcement learning is a promising approach to autonomous and adaptive security management in networked systems. However, current reinforcement learning solutions for security management are mostly limited to simulation environments and it is unclear how they generalize to operational systems. In this paper, we address this limitation by presenting CSLE: a reinforcement learning platform for autonomous security management that enables experimentation under realistic conditions. Conceptually, CSLE encompasses two systems. First, it includes an emulation system that replicates key components of the target system in a virtualized environment. We use this system to gather measurements and logs, based on which we identify a system model, such as a Markov decision process. Second, it includes a simulation system where security strategies are efficiently learned through simulations of the system model. The learned strategies are then evaluated and refined in the emulation system to close the gap between theoretical and operational performance. We demonstrate CSLE through four use cases: flow control, replication control, segmentation control, and recovery control. Through these use cases, we show that CSLE enables near-optimal security management in an environment that approximates an operational system.
中文摘要 强化学习是网络系统中自主和自适应安全管理的一种有前景的方法。然而，当前用于安全管理的强化学习解决方案大多限于仿真环境，目前尚不清楚它们如何推广到运维系统。本文通过介绍CSLE来解决这一局限：一个用于自主安全管理的强化学习平台，能够在现实条件下进行实验。从概念上讲，CSLE包含两个系统。首先，它包含一个模拟系统，在虚拟化环境中复制目标系统的关键组件。我们利用该系统收集测量数据和日志，基于此识别系统模型，如马尔可夫决策过程。其次，它包含一个仿真系统，通过系统模型的模拟高效学习安全策略。学习到的策略随后在仿真系统中被评估和完善，以缩小理论性能与操作性能之间的差距。我们通过四个用例来展示CSLE：流量控制、复制控制、分段控制和恢复控制。通过这些用例，我们展示了CSLE能够在接近运营系统的环境中实现近乎最优的安全管理。

Flexible Empowerment at Reasoning with Extended Best-of-N Sampling

通过扩展N对最佳抽样的推理灵活赋能

Authors: Taisuke Kobayashi
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.15614
Pdf link: https://arxiv.org/pdf/2604.15614
Abstract This paper proposes a novel method that incorporates empowerment when reasoning actions in reinforcement learning (RL), thereby achieving the flexibility of exploration-exploitation dilemma (EED). In previous methods, empowerment for promoting exploration has been provided as a bonus term to the task-specific reward function as an intrinsically-motivated RL. However, this approach introduces a delay until the policy that accounts for empowerment is learned, making it difficult to adjust the emphasis on exploration as needed. On the other hand, a trick devised for fine-tuning recent foundation models at reasoning, so-called best-of-N (BoN) sampling, allows for the implicit acquisition of modified policies without explicitly learning them. It is expected that applying this trick to exploration-promoting terms, such as empowerment, will enable more flexible adjustment of EED. Therefore, this paper investigates BoN sampling for empowerment. Furthermore, to adjust the degree of policy modification in a generalizable manner while maintaining computational cost, this paper proposes a novel BoN sampling method extended by Tsalis statistics. Through toy problems, the proposed method's cability to balance EED is verified. In addition, it is demonstrated that the proposed method improves RL performance to solve complex locomotion tasks.
中文摘要 本文提出了一种新颖方法，在强化学习（RL）中结合赋权推理行为，从而实现探索-利用困境（EED）的灵活性。在以往方法中，促进探索的赋能作为任务特定奖励函数作为内在动机强化学习的附加项。然而，这种方法会延迟到掌握权力政策，使得根据需要调整探索重点变得困难。另一方面，一种用于推理时微调近期基础模型的技巧——所谓的N最佳采样（BoN）采样，允许隐式获得修改后的策略，而无需明确学习它们。预计将这一技巧应用于促进探索的术语，如赋权，将使EED的调整更加灵活。因此，本文探讨了赋能中的 BoN 抽样。此外，为了在保持计算成本的同时以可推广的方式调整策略修改程度，本文提出了一种通过 Tsalis 统计扩展的新型 BoN 抽样方法。通过玩具问题，验证了所提方法在EED中的平衡能力。此外，研究表明该方法提升了解决复杂移动任务的强化学习性能。

Majority Voting for Code Generation

代码生成多数投票

Authors: Tim Launer, Jonas Hübotter, Marco Bagatella, Ido Hakimi, Andreas Krause
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.15618
Pdf link: https://arxiv.org/pdf/2604.15618
Abstract We investigate Functional Majority Voting (FMV), a method based on functional consensus for code generation with Large Language Models, which identifies a representative solution from multiple generations using their runtime execution signatures on test inputs. We find that FMV is an effective test-time inference strategy, substantially boosting performance on LiveCodeBench without a large compute overhead. Furthermore, we extend the utility of functional consensus and apply it as an aggregation strategy for label-free Test-Time Reinforcement Learning. We demonstrate that this increases pass@1 on holdout tasks, but find no evidence of self-improvement beyond the base model's performance ceiling.
中文摘要 我们研究功能多数投票（FMV），这是一种基于函数共识的代码生成方法，利用测试输入的运行时签名识别多代的代表性解。我们发现FMV是一种有效的测试推理策略，在LiveCodeBench上大幅提升性能，且计算开销较大。此外，我们扩展了功能共识的实用性，并将其作为无标签测试时间强化学习的聚合策略应用。我们证明这增加了对未完成任务的pass@1，但未发现超出基础模型性能上限的自我提升证据。

Hierarchical Active Inference using Successor Representations

使用后继表示的层级主动推理

Authors: Prashant Rangarajan, Rajesh P. N. Rao
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.15679
Pdf link: https://arxiv.org/pdf/2604.15679
Abstract Active inference, a neurally-inspired model for inferring actions based on the free energy principle (FEP), has been proposed as a unifying framework for understanding perception, action, and learning in the brain. Active inference has previously been used to model ecologically important tasks such as navigation and planning, but scaling it to solve complex large-scale problems in real-world environments has remained a challenge. Inspired by the existence of multi-scale hierarchical representations in the brain, we propose a model for planning of actions based on hierarchical active inference. Our approach combines a hierarchical model of the environment with successor representations for efficient planning. We present results demonstrating (1) how lower-level successor representations can be used to learn higher-level abstract states, (2) how planning based on active inference at the lower-level can be used to bootstrap and learn higher-level abstract actions, and (3) how these learned higher-level abstract states and actions can facilitate efficient planning. We illustrate the performance of the approach on several planning and reinforcement learning (RL) problems including a variant of the well-known four rooms task, a key-based navigation task, a partially observable planning problem, the Mountain Car problem, and PointMaze, a family of navigation tasks with continuous state and action spaces. Our results represent, to our knowledge, the first application of learned hierarchical state and action abstractions to active inference in FEP-based theories of brain function.
中文摘要 主动推理是一种基于自由能原理（FEP）的神经启发推断行为模型，被提出作为理解大脑中感知、行为和学习的统一框架。主动推理此前已被用于建模生态重要任务，如导航和规划，但将其扩展到解决现实环境中复杂且大规模的问题仍是一大挑战。受大脑中多尺度层级表征的启发，我们提出了基于层级主动推断的行动规划模型。我们的方法结合了环境的层级模型和后续表示，以实现高效的规划。我们展示了（1）如何利用低层次继继表示学习更高层次的抽象状态，（2）如何基于主动推理的规划用于自助和学习更高层次的抽象动作，以及（3）这些已学习的高级抽象状态和动作如何促进高效规划。我们展示了该方法在多个规划与强化学习（RL）问题上的表现，包括著名的四房间任务变体、基于密钥的导航任务、部分可观察的规划问题、山地车问题，以及具有连续状态空间和动作空间的导航任务家族PointMaze。据我们所知，我们的结果是首次将学习到的层级状态和动作抽象应用于基于FEP的脑功能理论中的主动推断。

Multi-objective Reinforcement Learning With Augmented States Requires Rewards After Deployment

多目标增强学习在增强状态下需要部署后奖励

Authors: Peter Vamplew, Cameron Foale
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.15757
Pdf link: https://arxiv.org/pdf/2604.15757
Abstract This research note identifies a previously overlooked distinction between multi-objective reinforcement learning (MORL), and more conventional single-objective reinforcement learning (RL). It has previously been noted that the optimal policy for an MORL agent with a non-linear utility function is required to be conditioned on both the current environmental state and on some measure of the previously accrued reward. This is generally implemented by concatenating the observed state of the environment with the discounted sum of previous rewards to create an augmented state. While augmented states have been widely-used in the MORL literature, one implication of their use has not previously been reported -- namely that they require the agent to have continued access to the reward signal (or a proxy thereof) after deployment, even if no further learning is required. This note explains why this is the case, and considers the practical repercussions of this requirement.
中文摘要 本研究报告指出了多目标强化学习（MORL）与更传统的单目标强化学习（RL）之间此前被忽视的一个区别。此前已有观点指出，具有非线性效用函数的MORL代理的最优策略必须同时基于当前环境状态和先前获得奖励的某种度量来衡量。这通常通过将观察到的环境状态与之前奖励的折现总和连接起来，形成增强状态来实现。虽然增强态在MORL文献中被广泛使用，但其使用的一个前所未有的含义是，即使不需要进一步学习，增强态仍要求代理人在部署后继续访问奖励信号（或其代理）。本说明解释了原因，并考虑了这一要求的实际影响。

Zero-Shot Scalable Resilience in UAV Swarms: A Decentralized Imitation Learning Framework with Physics-Informed Graph Interactions

无人机群体中的零射可扩展韧性：具有物理知情图交互的去中心化模仿学习框架

Authors: Huan Lin, Lianghui Ding
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.15762
Pdf link: https://arxiv.org/pdf/2604.15762
Abstract Large-scale Unmanned Aerial Vehicle (UAV) failures can split an unmanned aerial vehicle swarm network into disconnected sub-networks, making decentralized recovery both urgent and difficult. Centralized recovery methods depend on global topology information and become communication-heavy after severe fragmentation. Decentralized heuristics and multi-agent reinforcement learning methods are easier to deploy, but their performance often degrades when the swarm scale and damage severity vary. We present Physics-informed Graph Adversarial Imitation Learning algorithm (PhyGAIL) that adopts centralized training with decentralized execution. PhyGAIL builds bounded local interaction graphs from heterogeneous observations, and uses physics-informed graph neural network to encode directional local interactions as gated message passing with explicit attraction and repulsion. This gives the policy a physically grounded coordination bias while keeping local observations scale-invariant. It also uses scenario-adaptive imitation learning to improve training under fragmented topologies and variable-length recovery episodes. Our analysis establishes bounded local graph amplification, bounded interaction dynamics, and controlled variance of the terminal success signal. A policy trained on 20-UAV swarms transfers directly to swarms of up to 500 UAVs without fine-tuning, and achieves better performance across reconnection reliability, recovery speed, motion safety, and runtime efficiency than representative baselines.
中文摘要 大规模无人机（UAV）故障可能将无人机群网络分裂成分散的子网络，使得分散式的恢复既紧迫又困难。集中式恢复方法依赖全局拓扑信息，严重碎片化后通信量大。去中心化启发式和多智能体强化学习方法更容易部署，但当群体规模和损害严重程度变化时，其性能往往会下降。我们提出了采用集中训练和去中心化执行的物理知情图对抗模仿学习算法（PhyGAIL）。PhyGAIL通过异构观测构建有界局部交互图，并利用物理知情的图神经网络将方向性局部交互编码为带有明确吸引和排斥的门控信息传递。这使得该政策具有物理基础的协调偏向，同时保持局部观测的尺度不变性。它还利用场景自适应模仿学习，在碎片化拓扑和可变长度恢复阶段下提升训练水平。我们的分析建立了有界局部图放大、有界交互动态以及终端成功信号的受控方差。基于20架无人机群训练的政策，可直接转移到最多500架无人机的蜂群，无需微调，且在重连可靠性、恢复速度、运动安全和运行效率方面均优于代表性基线。

Fuzzy Logic Theory-based Adaptive Reward Shaping for Robust Reinforcement Learning (FARS)

基于模糊逻辑理论的自适应奖励塑造用于强化强化学习（FARS）

Authors: Hürkan Şahin, Van Huyen Dang, Erdi Sayar, Alper Yegenoglu, Erdal Kayacan
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.15772
Pdf link: https://arxiv.org/pdf/2604.15772
Abstract Reinforcement learning (RL) often struggles in real-world tasks with high-dimensional state spaces and long horizons, where sparse or fixed rewards severely slow down exploration and cause agents to get trapped in local optima. This paper presents a fuzzy logic based reward shaping method that integrates human intuition into RL reward design. By encoding expert knowledge into adaptive and interpreable terms, fuzzy rules promote stable learning and reduce sensitivity to hyperparameters. The proposed method leverages these properties to adapt reward contributions based on the agent state, enabling smoother transitions between fast motion and precise control in challenging navigation tasks. Extensive simulation results on autonomous drone racing benchmarks show stable learning behavior and consistent task performance across scenarios of increasing difficulty. The proposed method achieves faster convergence and reduced performance variability across training seeds in more challenging environments, with success rates improving by up to approximately 5 percent compared to non fuzzy reward formulations.
中文摘要 强化学习（RL）在现实任务中常常遇到困难，面对高维状态空间和漫长视野，稀疏或固定的奖励严重减缓探索速度，使智能体陷入局部最优状态。本文提出了一种基于模糊逻辑的奖励塑造方法，将人类直觉整合进强化学习的奖励设计中。通过将专家知识编码为可适应且可互换的术语，模糊规则促进稳定学习并降低对超参数的敏感性。所提方法利用这些特性，根据智能体状态调整奖励贡献，实现快速运动与精准控制之间的平滑过渡，适应具有挑战性的导航任务。自主无人机竞速基准测试的广泛模拟结果显示，在难度递增的场景下，学习行为稳定且任务表现一致。该方法在更具挑战性的环境中实现了更快的收敛和降低训练种子间的性能变异，成功率相比非模糊奖励表述提升了约5%。

Scattered Hypothesis Generation for Open-Ended Event Forecasting

开放式事件预测中的散点假设生成

Authors: He Chang, Zhulin Tao, Lifang Yang, Xianglin Huang, Yunshan Ma
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2604.15788
Pdf link: https://arxiv.org/pdf/2604.15788
Abstract Despite the importance of open-ended event forecasting for risk management, current LLM-based methods predominantly target only the most probable outcomes, neglecting the intrinsic uncertainty of real-world events. To bridge this gap, we advance open-ended event forecasting from pinpoint forecasting to scatter forecasting by introducing the proxy task of hypothesis generation. This paradigm aims to generate an inclusive and diverse set of hypotheses that broadly cover the space of plausible future events. To this end, we propose SCATTER, a reinforcement learning framework that jointly optimizes inclusiveness and diversity of the hypothesis. Specifically, we design a novel hybrid reward that consists of three components: 1) a validity reward that measures semantic alignment with observed events, 2) an intra-group diversity reward to encourage variation within sampled responses, and 3) an inter-group diversity reward to promote exploration across distinct modes. By integrating the validity-gated score into the overall objective, we confine the exploration of wildly diversified outcomes to contextually plausible futures, preventing the mode collapse issue. Experiments on two real-world benchmark datasets, i.e., OpenForecast and OpenEP, demonstrate that SCATTER significantly outperforms strong baselines. Our code is available at this https URL.
中文摘要 尽管开放式事件预测对风险管理至关重要，当前基于LLM的方法主要只针对最可能的结果，忽视了现实事件的内在不确定性。为弥合这一差距，我们通过引入假设生成这一代理任务，将开放式事件预测从精确预测推进到散点预测。该范式旨在生成一套包容且多样化的假设，广泛涵盖未来可能事件的空间。为此，我们提出了SCATTER强化学习框架，能够共同优化假设的包容性和多样性。具体来说，我们设计了一种新型混合奖励，包含三个组成部分：1）有效性奖励，测量与观察事件的语义吻合;2）组内多样性奖励，鼓励样本反应变异;3）组间多样性奖励，促进不同模式的探索。通过将效度门槛评分纳入整体目标，我们将对极度多样化结果的探索限制在情境合理的未来中，避免了模式崩溃问题。在两个真实世界基准数据集——OpenForecast和OpenEP上的实验表明，SCATR的表现显著优于强基线。我们的代码可在此 https URL 访问。

Placing Puzzle Pieces Where They Matter: A Question Augmentation Framework for Reinforcement Learning

将拼图碎片放在关键位置：强化学习中的问题增强框架

Authors: Yangyi Fang, Jiaye Lin, Xiaoliang Fu, Cong Qin, Haolin Shi
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.15830
Pdf link: https://arxiv.org/pdf/2604.15830
Abstract Reinforcement learning has become a powerful approach for enhancing large language model reasoning, but faces a fundamental dilemma: training on easy problems can cause overfitting and pass@k degradation, while training on hard problems often results in sparse rewards. Recent question augmentation methods address this by prepending partial solutions as hints. However, uniform hint provision may introduce redundant information while missing critical reasoning bottlenecks, and excessive hints can reduce reasoning diversity, causing pass@k degradation. We propose \textbf{PieceHint}, a hint injection framework that strategically identifies and provides critical reasoning steps during training. By scoring the importance of different reasoning steps, selectively allocating hints based on problem difficulty, and progressively withdrawing scaffolding, PieceHint enables models to transition from guided learning to independent reasoning. Experiments on six mathematical reasoning benchmarks show that our 1.5B model achieves comparable average performance to 32B baselines while preserving pass@k diversity across all $k$ values.
中文摘要 强化学习已成为提升大型语言模型推理的有力方法，但也面临一个根本的困境：简单问题的训练可能导致过拟合和pass@k退化，而在困难问题上训练往往奖励稀疏。最近的问题增强方法通过在部分解答前提示来解决这个问题。然而，统一的提示提供可能会引入冗余信息，同时忽略关键推理瓶颈，过多的提示会降低推理多样性，导致pass@k退化。我们提出了 \textbf{PieceHint}，一种提示注入框架，在训练过程中战略性地识别并提供关键推理步骤。通过评分不同推理步骤的重要性，根据问题难度选择性分配提示，并逐步取消支架，PieceHint使模型能够从引导学习过渡到独立推理。基于六个数学推理基准测试的实验表明，我们的15亿模型在保持所有$k美元值pass@k多样性的同时，实现了与32亿基线相当的平均表现。

CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution

CoEvolve：通过代理-数据相互演进训练LLM代理

Authors: Shidong Yang, Ziyu Ma, Tongwen Huang, Yiming Hu, Yong Wang, Xiangxiang Chu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.15840
Pdf link: https://arxiv.org/pdf/2604.15840
Abstract Reinforcement learning for LLM agents is typically conducted on a static data distribution, which fails to adapt to the agent's evolving behavior and leads to poor coverage of complex environment interactions. To address these challenges, we propose CoEvolve, an agent-data mutual evolution framework that enables LLM agents to improve through closed-loop, interaction-driven training. Specifically, CoEvolve extracts feedback signals such as forgetting and uncertainty from rollout trajectories to identify failure-prone interaction patterns, and utilizes them to guide LLM-based task synthesis. The synthesized tasks are validated through environment interaction and utilized to update the data distribution, enabling joint adaptation of the agent and its data. Extensive experiments on AppWorld and BFCL across Qwen2.5-7B, Qwen3-4B, and Qwen3-30B-A3B demonstrate consistent and significant improvements over strong base models, yielding absolute gains of 19.43%, 15.58%, and 18.14%, respectively.
中文摘要 LLM代理的强化学习通常在静态数据分布上进行，该分布无法适应代理不断变化的行为，导致复杂环境交互覆盖较差。为应对这些挑战，我们提出了CoEvolve，这是一个主体-数据相互进化的框架，使LLM代理通过闭环、交互驱动的训练实现改进。具体来说，CoEvolve从推广轨迹中提取遗忘和不确定性等反馈信号，识别易失败交互模式，并利用这些信号指导基于LLM的任务综合。综合任务通过环境交互进行验证，并用于更新数据分布，实现智能体及其数据的联合适应。在 Qwen2.5-7B、Qwen3-4B 和 Qwen3-30B-A3B 的 AppWorld 和 BFCL 上进行了大量实验，显示出相较于强基模型的持续且显著的改进，分别获得了 19.43%、15.58% 和 18.14% 的绝对提升。

AgentV-RL: Scaling Reward Modeling with Agentic Verifier

AgentV-RL：使用智能验证器进行奖励尺度建模

Authors: Jiazheng Zhang, Ziche Fu, Zhiheng Xi, Wenqing Jing, Mingxu Chai, Wei He, Guoqiang Zhang, Chenghao Fan, Chenxin An, Wenxiang Chen, Zhicheng Liu, Haojie Pan, Dingwei Zhu, Tao Gui, Qi Zhang, Xuanjing Huang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.16004
Pdf link: https://arxiv.org/pdf/2604.16004
Abstract Verifiers have been demonstrated to enhance LLM reasoning via test-time scaling (TTS). Yet, they face significant challenges in complex domains. Error propagation from incorrect intermediate reasoning can lead to false positives for seemingly plausible solutions, while lacking external grounding makes verifiers unreliable on computation or knowledge-intensive tasks. To address these challenges, we propose Agentic Verifier, a framework that transforms reward modeling into a multi-turn, tool-augmented deliberative process. We introduce complementary forward and backward agents: one traces solutions from premises to conclusions, while the other re-checks conclusions against their underlying premises. This bidirectional process enables a comprehensive, reliable, and interpretable assessment of solutions. To facilitate practical deployment, we propose AgentV-RL. Through proactive exploration and reinforcement learning, the verifier autonomously interleaves tool-use with internal reasoning. Extensive experiments show that Agentic Verifier yields consistent performance gains under both parallel and sequential TTS. Notably, our 4B variant surpasses state-of-the-art ORMs by 25.2%, positioning it as a promising paradigm for agentic reward modeling.
中文摘要 验证器已被证明可以通过测试时间缩放（TTS）增强大型语言模型的推理能力。然而，他们在复杂领域面临重大挑战。错误的中间推理导致的误差传播可能导致看似合理的解出现假阳性，而缺乏外部基础则使验证器在计算或知识密集型任务中不可靠。为应对这些挑战，我们提出了代理验证器（Agentic Verifier）框架，将奖励建模转变为多轮次、工具辅助的审议过程。我们引入互补的正向和后向代理：一个将解从前提追溯到结论，另一个则重新核对结论与其基础前提。这种双向过程使解决方案能够全面、可靠且易于理解。为便于实际部署，我们提出了AgentV-RL。通过主动探索和强化学习，验证者自主地将工具使用与内在推理交织在一起。大量实验表明，智能验证器在并行和顺序TTS下都能持续提升性能。值得注意的是，我们的4B变体比最先进的ORM高出25.2%，使其成为能动奖励建模的有前景范式。

Safe Deep Reinforcement Learning for Building Heating Control and Demand-side Flexibility

安全深度强化学习，用于建筑供暖控制和需求侧灵活性

Authors: Colin Jüni, Mina Montazeri, Yi Guo, Federica Bellizio, Giovanni Sansavini, Philipp Heer
Subjects: Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.16033
Pdf link: https://arxiv.org/pdf/2604.16033
Abstract Buildings account for approximately 40% of global energy consumption, and with the growing share of intermittent renewable energy sources, enabling demand-side flexibility, particularly in heating, ventilation and air conditioning systems, is essential for grid stability and energy efficiency. This paper presents a safe deep reinforcement learning-based control framework to optimize building space heating while enabling demand-side flexibility provision for power system operators. A deep deterministic policy gradient algorithm is used as the core deep reinforcement learning method, enabling the controller to learn an optimal heating strategy through interaction with the building thermal model while maintaining occupant comfort, minimizing energy cost, and providing flexibility. To address safety concerns with reinforcement learning, particularly regarding compliance with flexibility requests, we propose a real-time adaptive safety-filter to ensure that the system operates within predefined constraints during demand-side flexibility provision. The proposed real-time adaptive safety filter guarantees full compliance with flexibility requests from system operators and improves energy and cost efficiency -- achieving up to 50% savings compared to a rule-based controller -- while outperforming a standalone deep reinforcement learning-based controller in energy and cost metrics, with only a slight increase in comfort temperature violations.
中文摘要 建筑物约占全球能源消耗的40%，随着间歇性可再生能源份额的增加，尤其是在供暖、通风和空调系统方面，实现需求侧灵活性对于电网稳定性和能源效率至关重要。本文提出了一个安全的深度强化学习控制框架，旨在优化建筑空间供暖，同时为电力系统操作员提供需求侧灵活性。采用深度确定性策略梯度算法作为核心深度强化学习方法，使控制器能够通过与建筑热模型交互学习最佳供暖策略，同时保持居住者舒适度，降低能源成本并提供灵活性。为了解决强化学习的安全问题，特别是关于灵活性请求的合规性，我们提出了一种实时自适应安全过滤器，以确保系统在需求侧灵活性提供期间运行在预设约束内。拟议的实时自适应安全过滤器保证完全满足系统操作员的灵活性要求，提升能源效率和成本效益——相比基于规则的控制器节省了高达50%的节能率，同时在能耗和成本指标上优于独立的深度强化基于学习的控制器，舒适温度违规仅略有增加。

Beyond One-Size-Fits-All: Adaptive Test-Time Augmentation for Sequential Recommendation

超越一刀切：自适应测试时间增加以实现顺序推荐

Authors: Xibo Li, Liang Zhang
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2604.16121
Pdf link: https://arxiv.org/pdf/2604.16121
Abstract Test-time augmentation (TTA) has become a promising approach for mitigating data sparsity in sequential recommendation by improving inference accuracy without requiring costly model retraining. However, existing TTA methods typically rely on uniform, user-agnostic augmentation strategies. We show that this "one-size-fits-all" design is inherently suboptimal, as it neglects substantial behavioral heterogeneity across users, and empirically demonstrate that the optimal augmentation operators vary significantly across user sequences with different characteristics for the first time. To address this limitation, we propose AdaTTA, a plug-and-play reinforcement learning-based adaptive inference framework that learns to select sequence-specific augmentation operators on a per-sequence basis. We formulate augmentation selection as a Markov Decision Process and introduce an Actor-Critic policy network with hybrid state representations and a joint macro-rank reward design to dynamically determine the optimal operator for each input user sequence. Extensive experiments on four real-world datasets and two recommendation backbones demonstrate that AdaTTA consistently outperforms the best fixed-strategy baselines, achieving up to 26.31% relative improvement on the Home dataset while incurring only moderate computational overhead
中文摘要 测试时间增强（TTA）已成为一种有前景的方法，通过提高推理准确性，减少序列推荐中数据稀疏性，而无需昂贵的模型重新训练。然而，现有的TTA方法通常依赖于统一且用户无关的增强策略。我们证明了这种“一刀切”设计本质上是次优的，因为它忽视了用户间显著的行为异质性，并首次实证证明了最优增强算符在不同特性的用户序列间存在显著差异。为解决这一限制，我们提出了AdaTTA，一种即插即用的基于强化学习的自适应推理框架，能够逐序列学习选择序列特异的增强算子。我们将增强选择定为马尔可夫决策过程，并引入了具有混合状态表示和联合宏-秩奖励设计的演员-批评者策略网络，以动态确定每个输入用户序列的最优操作符。在四个真实世界数据集和两个推荐骨干上的大量实验表明，AdaTTA始终优于最佳固定策略基线，在计算开销中等程度的情况下，相对提升高达26.31%，而Home数据集的计算开销仅为中等

AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency

AtManRL：通过可区分注意力显著性迈向忠实推理

Authors: Max Henning Höth, Kristian Kersting, Björn Deiseroth, Letitia Parcalabescu
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.16158
Pdf link: https://arxiv.org/pdf/2604.16158
Abstract Large language models (LLMs) increasingly rely on chain-of-thought (CoT) reasoning to solve complex tasks. Yet ensuring that the reasoning trace both contributes to and faithfully reflects the processes underlying the model's final answer, rather than merely accompanying it, remains challenging. We introduce AtManRL, a method that leverages differentiable attention manipulation to learn more faithful reasoning through reinforcement learning. By training an additive attention mask that identifies tokens in the CoT crucial for producing correct answers, we derive a saliency reward signal that encourages the model to generate reasoning traces that genuinely influence its final predictions. We integrate this saliency reward with outcome-based rewards within the GRPO framework to jointly optimize for correctness and interpretability. Experiments on GSM8K and MMLU with Llama-3.2-3B-Instruct demonstrate that our approach can identify influential reasoning tokens and enable training more transparent reasoning models.
中文摘要 大型语言模型（LLMs）越来越依赖思维链（CoT）推理来解决复杂任务。然而，确保推理追踪既有助于并忠实反映模型最终答案的过程，而不仅仅是伴随答案，仍然具有挑战性。我们介绍AtManRL，一种利用可微分注意力操控通过强化学习更忠实推理的方法。通过训练一个加法注意力掩码，识别对正确答案至关重要的CoT标记，我们推导出显著性奖励信号，鼓励模型生成真正影响最终预测的推理痕迹。我们将这一显著性奖励与基于结果的奖励整合在GRPO框架内，共同优化正确性和可解释性。在GSM8K和MMLU上使用Llama-3.2-3B-Instruct的实验表明，我们的方法能够识别有影响力的推理标记，并实现更透明的推理模型训练。

Detecting and Suppressing Reward Hacking with Gradient Fingerprints

利用渐变指纹检测和抑制奖励黑客行为

Authors: Songtao Wang, Quang Hieu Pham, Fangcong Yin, Xinpeng Wang, Jocelyn Qiaochu Chen, Greg Durrett, Xi Ye
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.16242
Pdf link: https://arxiv.org/pdf/2604.16242
Abstract Reinforcement learning with verifiable rewards (RLVR) typically optimizes for outcome rewards without imposing constraints on intermediate reasoning. This leaves training susceptible to reward hacking, where models exploit loopholes (e.g., spurious patterns in training data) in the reward function to achieve high scores without solving the intended task. These reward-hacking behaviors are often implicit, as the intermediate chain-of-thought (CoT) may appear plausible on the surface, limiting the effectiveness of purely text-based monitoring. We propose Gradient Fingerprint (GRIFT), a method for detecting reward hacking using models' internal computations. Given a prompt and a model-generated CoT, GRIFT computes gradients of the CoT conditioned on the prompt and compresses them into a compact representation, which is then used to assess whether the CoT reflects reward hacking behavior. Across verifiable reasoning benchmarks spanning math, code, and logical reasoning, GRIFT substantially outperforms strong baselines, including CoT Monitor and TRACE, achieving over 25% relative improvement in detecting reward hacking behavior. Moreover, integrating GRIFT into the rejection fine-tuning pipeline for reasoning tasks reduces reward hacking and improves performance on the true task objective. Our results highlight a promising direction of leveraging gradient level representations for assessing the quality of CoT reasoning traces. Our code is available at: this https URL.
中文摘要 带有可验证奖励的强化学习（RLVR）通常在不对中间推理施加约束的情况下，优化结果奖励。这使得训练容易受到奖励黑客攻击的影响，即模型利用奖励函数中的漏洞（例如训练数据中的虚假模式）来获得高分，而未解决预期任务。这些奖励黑客行为通常是隐含的，因为中间思维链（CoT）表面上看似合理，限制了纯文本监控的有效性。我们提出了梯度指纹（GRIFT），这是一种利用模型内部计算检测奖励黑客的方法。给定提示和模型生成的CoT，GRIFT会根据提示词计算CoT的梯度，并将其压缩成一个紧凑的表示，然后用来评估CoT是否反映了奖励黑客行为。在涵盖数学、代码和逻辑推理的可验证推理基准中，GRIFT显著优于包括CoT Monitor和TRACE在内的强基线，在奖励黑客行为检测方面取得了超过25%的相对提升。此外，将GRIFT整合进推理任务的拒绝微调流程中，减少了奖励黑客行为，并提升了在真实任务目标上的表现。我们的结果凸显了利用梯度水平表示评估CoT推理迹质量的有前景方向。我们的代码可在以下 https URL 获取。

Find, Fix, Reason: Context Repair for Video Reasoning

查找、修复、推理：视频推理的上下文修复

Authors: Haojian Huang, Chuanyu Qin, Yinchuan Li, Yingcong Chen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.16243
Pdf link: https://arxiv.org/pdf/2604.16243
Abstract Reinforcement learning has advanced video reasoning in large multi-modal models, yet dominant pipelines either rely on on-policy self-exploration, which plateaus at the model's knowledge boundary, or hybrid replay that mixes policies and demands careful regularization. Dynamic context methods zoom into focused evidence but often require curated pretraining and two-stage tuning, and their context remains bounded by a small model's capability. In contrast, larger models excel at instruction following and multi-modal understanding, can supply richer context to smaller models, and rapidly zoom in on target regions via simple tools. Building on this capability, we introduce an observation-level intervention: a frozen, tool-integrated teacher identifies the missing spatiotemporal dependency and provides a minimal evidence patch (e.g., timestamps, regions etc.) from the original video while the question remains unchanged. The student answers again with the added context, and training updates with a chosen-rollout scheme integrated into Group Relative Policy Optimization (GRPO). We further propose a Robust Improvement Reward (RIR) that aligns optimization with two goals: outcome validity through correct answers and dependency alignment through rationales that reflect the cited evidence. Advantages are group-normalized across the batch, preserving on-policy exploration while directing it along causally meaningful directions with minimal changes to the training stack. Experiments on various related benchmarks show consistent accuracy gains and strong generalization. Web page and source code will be available at this https URL.
中文摘要 强化学习推动了大型多模态模型中的视频推理，但主流流程要么依赖于策略自我探索，这种探索在模型的知识边界上趋于稳定，要么是混合策略并要求谨慎正则化的混合重放。动态上下文方法聚焦于聚焦证据，但通常需要经过精心筛选的预训练和两阶段调优，其上下文仍受限于小型模型的能力。相比之下，大型模型擅长指令跟踪和多模态理解，能为较小模型提供更丰富的上下文，并通过简单工具快速聚焦目标区域。基于此能力，我们引入观察级干预：一位固定、整合工具的教师识别缺失的时空依赖性，提供原始视频中最小的证据补丁（如时间戳、区域等），同时问题保持不变。学生再次回答，补充上下文，并用集成于Group Relative Policy Optimization（GRPO）中的选择推广方案进行培训更新。我们进一步提出了一种稳健改进奖励（RIR），将优化与两个目标对齐：通过正确答案实现结果效度，以及通过反映引用证据的理由实现依赖一致性。优势在批次中进行群体规范化，保持政策探索的同时引导其朝因果有意义的方向发展，且对训练堆栈的修改最小。在各种相关基准测试上的实验显示，准确率持续提升，且具有强有力的泛化能力。网页和源代码将在此 https URL 上提供。

Beyond Distribution Sharpening: The Importance of Task Rewards

超越分布锐化：任务奖励的重要性

Authors: Sarthak Mittal, Leo Gagnon, Guillaume Lajoie
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.16259
Pdf link: https://arxiv.org/pdf/2604.16259
Abstract Frontier models have demonstrated exceptional capabilities following the integration of task-reward-based reinforcement learning (RL) into their training pipelines, enabling systems to evolve from pure reasoning models into sophisticated agents. However, debate persists regarding whether RL genuinely instills new skills within a base model or merely sharpens its existing distribution to elicit latent capabilities. To address this dichotomy, we present an explicit comparison between distribution sharpening and task-reward-based learning, utilizing RL as a tool to implement both paradigms. Our analysis reveals the inherent limitations of distribution sharpening, demonstrating from first principles how and why the optima can be unfavorable and the approach fundamentally unstable. Furthermore, our experiments using Llama-3.2-3B-Instruct, Qwen2.5-3B-Instruct and Qwen3-4B-Instruct-2507 on math datasets confirm that sharpening yields limited gains, whereas incorporating task-based reward signal can greatly help achieve robust performance improvements and stable learning.
中文摘要 在将任务奖励型强化学习（RL）整合进训练流程后，前沿模型展现出卓越的能力，使系统能够从纯粹的推理模型演进为复杂的代理。然而，关于强化学习是否真正在基础模型中灌输新技能，还是仅仅通过提升现有分布以激发潜在能力，仍有争议。为解决这一二分法，我们明确比较了分布锐化与基于任务奖励的学习，利用强化学习作为工具实现这两种范式。我们的分析揭示了分布锐化的固有局限性，从第一原理展示了最优值为何可能不利且方法根本不稳定。此外，我们在数学数据集上使用 Llama-3.2-3B-Instruct、Qwen2.5-3B-Instruct 和 Qwen3-4B-Instruct-2507 的实验证实，锐化提升有限，而基于任务的奖励信号则能极大提升性能和稳定学习。

Evaluating the Progression of Large Language Model Capabilities for Small-Molecule Drug Design

评估大语言模型能力在小分子药物设计中的进展

Authors: Shriram Chennakesavalu, Kirill Shmilovich, Hayley Weir, Colin Grambow, John Bradshaw, Patricia Suriana, Chen Cheng, Kangway Chuang
Subjects: Subjects: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
Arxiv link: https://arxiv.org/abs/2604.16279
Pdf link: https://arxiv.org/pdf/2604.16279
Abstract Large Language Models (LLMs) have the potential to accelerate small molecule drug design due to their ability to reason about information from diverse sources and formats. However, their practical utility remains unclear due to the lack of benchmarks that reflect real-world scenarios. In this work, we introduce a suite of chemically-grounded tasks spanning molecular property prediction, molecular representation transformations, and molecular design. Importantly, we formulate these tasks as reinforcement learning (RL) environments, enabling a unified approach for evaluation and post-training. Across three model families, we find that frontier models are increasingly proficient at chemical tasks, but that there is significant room for improvement, especially in experimental settings with low data. Critically, we show that RL-based post-training can substantially improve performance. A smaller model post-trained on our environments becomes competitive with state-of-the-art frontier models, despite a significantly weaker base model. This suggests a practical route toward employing LLMs in drug discovery; by combining carefully-designed evaluation tasks with targeted post-training, we can both elucidate and close critical capability gaps.
中文摘要 大型语言模型（LLM）有潜力加速小分子药物设计，因为它们能够推理来自不同来源和格式的信息。然而，由于缺乏反映现实场景的基准，其实际效用仍不明确。在本研究中，我们介绍了一系列化学基础的任务，涵盖分子性质预测、分子表示转化和分子设计。重要的是，我们将这些任务构建为强化学习（RL）环境，实现了评估和培训后统一的方法。在三类模型家族中，我们发现前沿模型在化学任务上的熟练度日益提升，但在数据较少的实验环境中仍有显著改进空间。关键是，我们表明基于强化学习的后期培训能够显著提升表现。一个在我们环境中后期训练的小型模型，尽管基础模型明显较弱，也能与最先进的前沿模型竞争。这暗示了在药物发现中实际应用LLM的路径;通过将精心设计的评估任务与有针对性的培训后期培训相结合，我们可以既阐明又弥补关键的能力差距。

Keyword: diffusion policy

VADF: Vision-Adaptive Diffusion Policy Framework for Efficient Robotic Manipulation

VADF：视觉自适应扩散政策框架，用于高效机器人操作

Authors: Xinglei Yu, Zhenyang Liu, Shufeng Nan, Simo Wu, Yanwei Fu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.15938
Pdf link: https://arxiv.org/pdf/2604.15938
Abstract Diffusion policies are becoming mainstream in robotic manipulation but suffer from hard negative class imbalance due to uniform sampling and lack of sample difficulty awareness, leading to slow training convergence and frequent inference timeout failures. We propose VADF (Vision-Adaptive Diffusion Policy Framework), a vision-driven dual-adaptive framework that significantly reduces convergence steps and achieves early success in inference, with model-agnostic design enabling seamless integration into any diffusion policy architecture. During training, we introduce Adaptive Loss Network (ALN), a lightweight MLP-based loss predictor that quantifies per-step sample difficulty in real time. Guided by hard negative mining, it performs weighted sampling to prioritize high-loss regions, enabling adaptive weight updates and faster convergence. In inference, we design the Hierarchical Vision Task Segmenter (HVTS), which decomposes high-level task instructions into multi-stage low-level sub-instructions based on visual input. It adaptively segments action sequences into simple and complex subtasks by assigning shorter noise schedules with longer direct execution sequences to simple actions, and longer noise steps with shorter execution sequences to complex ones, thereby dramatically reducing computational overhead and significantly improving the early success rate.
中文摘要 扩散策略正在机器人操作中成为主流，但由于抽样均匀且缺乏样本难度意识，导致严重的负类失衡，导致训练收敛缓慢和推断超时失败。我们提出了VADF（愿景自适应扩散政策框架），这是一个以愿景为驱动的双适应框架，显著减少收敛步骤并实现推理早期成功，模型无关设计使得无缝集成到任何扩散策略架构中。培训过程中，我们引入了自适应损失网络（ALN），这是一种基于MLP的轻量级损失预测器，实时量化每步样本的难度。在硬负挖掘的指导下，它进行加权抽样以优先处理高损耗区域，从而实现自适应权重更新和更快收敛。在推理中，我们设计了分层视觉任务分段器（HVTS），它将高层任务指令分解为基于视觉输入的多阶段低级子指令。它通过将较短的噪声排程（直接执行序列较长）分配给简单动作，将较长噪声步骤（执行序列较短）分配给复杂动作，从而显著降低计算开销并显著提高早期成功率，从而自适应地将动作序列分为简单和复杂两个子任务。