Arxiv Papers of Today

生成时间: 2026-02-09 16:58:26 (UTC+8); Arxiv 发布时间: 2026-02-09 20:00 EST (2026-02-10 09:00 UTC+8)

今天共有 46 篇相关文章

Keyword: reinforcement learning

Transformer-Based Reinforcement Learning for Autonomous Orbital Collision Avoidance in Partially Observable Environments

基于变压器的强化学习用于部分可观测环境中自主轨道碰撞规避

Authors: Thomas Georges, Adam Abdin
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.06088
Pdf link: https://arxiv.org/pdf/2602.06088
Abstract We introduce a Transformer-based Reinforcement Learning framework for autonomous orbital collision avoidance that explicitly models the effects of partial observability and imperfect monitoring in space operations. The framework combines a configurable encounter simulator, a distance-dependent observation model, and a sequential state estimator to represent uncertainty in relative motion. A central contribution of this work is the use of transformer-based Partially Observable Markov Decision Process (POMDP) architecture, which leverage long-range temporal attention to interpret noisy and intermittent observations more effectively than traditional architectures. This integration provides a foundation for training collision avoidance agents that can operate more reliably under imperfect monitoring environments.
中文摘要 我们引入了基于Transformer的强化学习框架，用于自主轨道碰撞避免，明确模拟部分可观测性和不完美监控在空间作中的影响。该框架结合了可配置的遭遇模拟器、距离依赖观测模型和顺序状态估计器，以表示相对运动中的不确定性。这项工作的核心贡献是采用基于变换器的部分可观测马尔可夫决策过程（POMDP）架构，该架构利用长距离时间注意力，比传统架构更有效地解释噪声和间歇性的观测。这种集成为训练能够在不完美监控环境下更可靠运行的碰撞避免代理提供了基础。

Jackpot: Optimal Budgeted Rejection Sampling for Extreme Actor-Policy Mismatch Reinforcement Learning

大奖：极端演员-策略错配强化学习的最佳预算拒绝采样

Authors: Zhuoming Chen, Hongyi Liu, Yang Zhou, Haizhong Zheng, Beidi Chen
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.06107
Pdf link: https://arxiv.org/pdf/2602.06107
Abstract Reinforcement learning (RL) for large language models (LLMs) remains expensive, particularly because the rollout is expensive. Decoupling rollout generation from policy optimization (e.g., leveraging a more efficient model to rollout) could enable substantial efficiency gains, yet doing so introduces a severe distribution mismatch that destabilizes learning. We propose Jackpot, a framework that leverages Optimal Budget Rejection Sampling (OBRS) to directly reduce the discrepancy between the rollout model and the evolving policy. Jackpot integrates a principled OBRS procedure, a unified training objective that jointly updates the policy and rollout models, and an efficient system implementation enabled by top-$k$ probability estimation and batch-level bias correction. Our theoretical analysis shows that OBRS consistently moves the rollout distribution closer to the target distribution under a controllable acceptance budget. Empirically, \sys substantially improves training stability compared to importance-sampling baselines, achieving performance comparable to on-policy RL when training Qwen3-8B-Base for up to 300 update steps of batchsize 64. Taken together, our results show that OBRS-based alignment brings us a step closer to practical and effective decoupling of rollout generation from policy optimization for RL for LLMs.
中文摘要 大型语言模型（LLM）的强化学习（RL）依然昂贵，尤其是因为推广成本高昂。将推广生成与策略优化脱钩（例如利用更高效的模型进行推广）可能带来显著的效率提升，但这会带来严重的分布不匹配，从而破坏学习的稳定性。我们提出了Jackpot框架，利用最优预算拒绝抽样（OBRS）直接减少推广模型与政策演变之间的差异。Jackpot集成了有原则的OBRS流程、统一的训练目标（联合更新策略和推广模型），以及通过高$k美元概率估计和批次级偏差校正实现的高效系统实现。我们的理论分析表明，在可控验收预算下，OBRS始终将推广分布推向目标分布。从经验上看，\sys 相比重要性采样基线显著提升了训练稳定性，在训练 Qwen3-8B-Base 时，其性能可与策略上强化学习相媲美，最多可达 300 步次、批次大小为 64 的更新步。综合来看，我们的结果表明基于OBRS的对齐为LLM的推广生成与策略优化的实际且有效的脱钩更近一步。

Self-Improving World Modelling with Latent Actions

带有潜在行动的自我改进世界建模

Authors: Yifu Qiu, Zheng Zhao, Waylon Li, Yftah Ziser, Anna Korhonen, Shay B. Cohen, Edoardo M. Ponti
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.06130
Pdf link: https://arxiv.org/pdf/2602.06130
Abstract Internal modelling of the world -- predicting transitions between previous states $X$ and next states $Y$ under actions $Z$ -- is essential to reasoning and planning for LLMs and VLMs. Learning such models typically requires costly action-labelled trajectories. We propose SWIRL, a self-improvement framework that learns from state-only sequences by treating actions as a latent variable and alternating between Forward World Modelling (FWM) $P_\theta(Y|X,Z)$ and an Inverse Dynamics Modelling (IDM) $Q_\phi(Z|X,Y)$. SWIRL iterates two phases: (1) Variational Information Maximisation, which updates the FWM to generate next states that maximise conditional mutual information with latent actions given prior states, encouraging identifiable consistency; and (2) ELBO Maximisation, which updates the IDM to explain observed transitions, effectively performing coordinate ascent. Both models are trained with reinforcement learning (specifically, GRPO) with the opposite frozen model's log-probability as a reward signal. We provide theoretical learnability guarantees for both updates, and evaluate SWIRL on LLMs and VLMs across multiple environments: single-turn and multi-turn open-world visual dynamics and synthetic textual environments for physics, web, and tool calling. SWIRL achieves gains of 16% on AURORABench, 28% on ByteMorph, 16% on WorldPredictionBench, and 14% on StableToolBench.
中文摘要 对世界的内部建模——在动作$Z$下预测前一状态$X$与下一个状态$Y$之间的转变——对于大型语言模型和大型语言模型的推理和规划至关重要。学习此类模型通常需要成本高昂的动作标记轨迹。我们提出了SWIRL，这是一种自我改进框架，通过将动作视为潜在变量，并在前向世界建模（FWM）$P_\theta（Y|X，Z）$ 以及逆动力学建模（IDM）$Q_\phi（Z|X，Y）$。SWIRL迭代两个阶段：（1）变分信息最大化，更新FWM生成下一状态，最大化条件互信息，且在给定先验状态下，潜在作用促进可识别一致性;以及（2）ELBO最大化，更新IDM以解释观测到的跃迁，有效执行坐标上升。这两个模型都采用强化学习（具体为GRPO）训练，对立模型的对数概率作为奖励信号。我们为这两种更新提供理论可学习性保证，并在多个环境中评估SWIRL在LLM和VLM上的应用：单回合和多回合开放世界视觉动态以及用于物理、网页和工具调用的合成文本环境。SWIRL在AURORABench上实现了16%的提升，ByteMorph上增长了28%，在WorldPredictionBench上增长了16%，在StableToolBench上增长了14%。

Flow Matching for Offline Reinforcement Learning with Discrete Actions

离散动作的离线强化学习流程匹配

Authors: Fairoz Nower Khan, Nabuat Zaman Nahim, Ruiquan Huang, Haibo Yang, Peizhong Ju
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.06138
Pdf link: https://arxiv.org/pdf/2602.06138
Abstract Generative policies based on diffusion models and flow matching have shown strong promise for offline reinforcement learning (RL), but their applicability remains largely confined to continuous action spaces. To address a broader range of offline RL settings, we extend flow matching to a general framework that supports discrete action spaces with multiple objectives. Specifically, we replace continuous flows with continuous-time Markov chains, trained using a Q-weighted flow matching objective. We then extend our design to multi-agent settings, mitigating the exponential growth of joint action spaces via a factorized conditional path. We theoretically show that, under idealized conditions, optimizing this objective recovers the optimal policy. Extensive experiments further demonstrate that our method performs robustly in practical scenarios, including high-dimensional control, multi-modal decision-making, and dynamically changing preferences over multiple objectives. Our discrete framework can also be applied to continuous-control problems through action quantization, providing a flexible trade-off between representational complexity and performance.
中文摘要 基于扩散模型和流匹配的生成策略已在离线强化学习（RL）中展现出强烈前景，但其适用性主要局限于连续动作空间。为了应对更广泛的离线强化学习环境，我们将流程匹配扩展到一个支持具有多目标的离散动作空间的通用框架。具体来说，我们用连续时间马尔可夫链替代连续流，并用Q加权流匹配目标进行训练。随后，我们将设计扩展到多智能体环境，通过分解条件路径减轻联合行动空间的指数增长。我们理论上证明，在理想条件下，优化该目标可恢复最优策略。大量实验进一步表明，我们的方法在实际场景中表现稳健，包括高维控制、多模态决策以及对多个目标的动态变化偏好。我们的离散框架也可以通过作用量子化应用于连续控制问题，在表征复杂度与性能之间提供了灵活的权衡。

Learning Rate Scaling across LoRA Ranks and Transfer to Full Finetuning

LoRA等级间的学习速率调整及全面精调化

Authors: Nan Chen, Soledad Villar, Soufiane Hayou
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.06204
Pdf link: https://arxiv.org/pdf/2602.06204
Abstract Low-Rank Adaptation (LoRA) is a standard tool for parameter-efficient finetuning of large models. While it induces a small memory footprint, its training dynamics can be surprisingly complex as they depend on several hyperparameters such as initialization, adapter rank, and learning rate. In particular, it is unclear how the optimal learning rate scales with adapter rank, which forces practitioners to re-tune the learning rate whenever the rank is changed. In this paper, we introduce Maximal-Update Adaptation ($\mu$A), a theoretical framework that characterizes how the "optimal" learning rate should scale with model width and adapter rank to produce stable, non-vanishing feature updates under standard configurations. $\mu$A is inspired from the Maximal-Update Parametrization ($\mu$P) in pretraining. Our analysis leverages techniques from hyperparameter transfer and reveals that the optimal learning rate exhibits different scaling patterns depending on initialization and LoRA scaling factor. Specifically, we identify two regimes: one where the optimal learning rate remains roughly invariant across ranks, and another where it scales inversely with rank. We further identify a configuration that allows learning rate transfer from LoRA to full finetuning, drastically reducing the cost of learning rate tuning for full finetuning. Experiments across language, vision, vision--language, image generation, and reinforcement learning tasks validate our scaling rules and show that learning rates tuned on LoRA transfer reliably to full finetuning.
中文摘要 低秩适应（LoRA）是一种用于参数高效微调大型模型的标准工具。虽然它会带来较小的内存占用，但其训练动态可能出人意料地复杂，因为它们依赖于多个超参数，如初始化、适配器等级和学习率。特别是，最优学习率如何随适配器等级的增长尚不清楚，适配器等级迫使从业者在等级变化时重新调整学习速率。本文介绍了最大更新适应（$\mu$A），这是一个理论框架，描述了“最优”学习率应如何随模型宽度和适配器等级增长，以在标准配置下实现稳定且不消失的功能更新。$\mu$A 的灵感来自训练前的最大化更新参数化（$\mu$P）。我们的分析利用超参数转移技术，揭示最佳学习率会根据初始化和LoRA缩放因子表现出不同的缩放模式。具体来说，我们确定了两种模式：一种是最优学习率在不同等级间大致保持不变，另一种则是随等级呈反比增长。我们还进一步确定了一种配置，允许从LoRA向全微调的学习率转移，大幅降低了全微调学习率调优的成本。跨语言、视觉、视觉——语言、图像生成和强化学习任务的实验验证了我们的扩展规则，并证明在LoRA上调优的学习率能够可靠地转移到完全微调。

VowelPrompt: Hearing Speech Emotions from Text via Vowel-level Prosodic Augmentation

元音提示：通过元音级韵律增强从文本中听到言语情感

Authors: Yancheng Wang, Osama Hanna, Ruiming Xie, Xianfeng Rui, Maohao Shen, Xuedong Zhang, Christian Fuegen, Jilong Wu, Debjyoti Paul, Arthur Guo, Zhihong Lei, Ozlem Kalinli, Qing He, Yingzhen Yang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.06270
Pdf link: https://arxiv.org/pdf/2602.06270
Abstract Emotion recognition in speech presents a complex multimodal challenge, requiring comprehension of both linguistic content and vocal expressivity, particularly prosodic features such as fundamental frequency, intensity, and temporal dynamics. Although large language models (LLMs) have shown promise in reasoning over textual transcriptions for emotion recognition, they typically neglect fine-grained prosodic information, limiting their effectiveness and interpretability. In this work, we propose VowelPrompt, a linguistically grounded framework that augments LLM-based emotion recognition with interpretable, fine-grained vowel-level prosodic cues. Drawing on phonetic evidence that vowels serve as primary carriers of affective prosody, VowelPrompt extracts pitch-, energy-, and duration-based descriptors from time-aligned vowel segments, and converts these features into natural language descriptions for better interpretability. Such a design enables LLMs to jointly reason over lexical semantics and fine-grained prosodic variation. Moreover, we adopt a two-stage adaptation procedure comprising supervised fine-tuning (SFT) followed by Reinforcement Learning with Verifiable Reward (RLVR), implemented via Group Relative Policy Optimization (GRPO), to enhance reasoning capability, enforce structured output adherence, and improve generalization across domains and speaker variations. Extensive evaluations across diverse benchmark datasets demonstrate that VowelPrompt consistently outperforms state-of-the-art emotion recognition methods under zero-shot, fine-tuned, cross-domain, and cross-linguistic conditions, while enabling the generation of interpretable explanations that are jointly grounded in contextual semantics and fine-grained prosodic structure.
中文摘要 语音中的情感识别是一个复杂的多模态挑战，需要理解语言内容和声音表现力，尤其是基本频率、强度和时间动态等韵律特征。尽管大型语言模型（LLMs）在通过文本转录推理情感识别方面表现出潜力，但它们通常忽视细粒度韵律信息，限制了其有效性和可解释性。在本研究中，我们提出了VowelPrompt，一种基于语言学的框架，通过可解释、细粒度的元音层面韵律线索，增强基于LLM的情感识别。基于元音作为情感韵律主要载体的语音证据，VowelPrompt 从时间对齐的元音段中提取基于音高、能量和时长的描述符，并将这些特征转换为自然语言描述，以便更好地理解。这种设计使LLM能够共同推理词汇语义和细粒度的韵律变异。此外，我们采用了两阶段适应过程：监督微调（SFT）和可验证奖励强化学习（RLVR），通过群体相对策略优化（GRPO）实现，以增强推理能力，强制结构化输出遵循，并提升跨领域和说话者差异的泛化能力。跨多个基准数据集的广泛评估表明，元音提示在零样本、精细调优、跨领域和跨语言条件下，始终优于最先进的情感识别方法，同时能够生成基于语境语义和细致韵律结构的可解释解释。

Online Adaptive Reinforcement Learning with Echo State Networks for Non-Stationary Dynamics

利用回声状态网络在线自适应强化学习，用于非平稳动力学

Authors: Aoi Yoshimura, Gouhei Tanaka
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.06326
Pdf link: https://arxiv.org/pdf/2602.06326
Abstract Reinforcement learning (RL) policies trained in simulation often suffer from severe performance degradation when deployed in real-world environments due to non-stationary dynamics. While Domain Randomization (DR) and meta-RL have been proposed to address this issue, they typically rely on extensive pretraining, privileged information, or high computational cost, limiting their applicability to real-time and edge systems. In this paper, we propose a lightweight online adaptation framework for RL based on Reservoir Computing. Specifically, we integrate an Echo State Networks (ESNs) as an adaptation module that encodes recent observation histories into a latent context representation, and update its readout weights online using Recursive Least Squares (RLS). This design enables rapid adaptation without backpropagation, pretraining, or access to privileged information. We evaluate the proposed method on CartPole and HalfCheetah tasks with severe and abrupt environment changes, including periodic external disturbances and extreme friction variations. Experimental results demonstrate that the proposed approach significantly outperforms DR and representative adaptive baselines under out-of-distribution dynamics, achieving stable adaptation within a few control steps. Notably, the method successfully handles intra-episode environment changes without resetting the policy. Due to its computational efficiency and stability, the proposed framework provides a practical solution for online adaptation in non-stationary environments and is well suited for real-world robotic control and edge deployment.
中文摘要 在模拟中训练的强化学习（RL）策略在实际环境中部署时，由于非平稳动力学，常常会严重性能下降。虽然领域随机化（DR）和元强化学习（meta-RL）已被提出解决这一问题，但它们通常依赖大量预训练、特权信息或高计算成本，限制了其适用于实时和边缘系统。本文提出基于储层计算的强化学习轻量级在线适配框架。具体来说，我们集成了回声状态网络（ESN）作为适应模块，将近期观测历史编码为潜在上下文表示，并利用递归最小二乘法在线更新其读出权重。这种设计实现了快速适应，无需反向传播、预训练或访问特权信息。我们评估了CartPole和HalfCheetah任务中提出的方法，涉及环境剧烈且突发的变化，包括周期性外部扰动和极端摩擦变化。实验结果表明，在分布外动态下，该方法显著优于DR和代表性自适应基线，在几个控制步骤内实现稳定适应。值得注意的是，该方法能够在不重置策略的情况下成功处理事件内环境的变化。由于其计算效率和稳定性，该框架为非固定环境下的在线适配提供了实用解决方案，非常适合现实世界的机器人控制和边缘部署。

HiWET: Hierarchical World-Frame End-Effector Tracking for Long-Horizon Humanoid Loco-Manipulation

HiWET：用于长视界类人机车控的分层世界帧端执行器跟踪

Authors: Zhanxiang Cao, Liyun Yan, Yang Zhang, Sirui Chen, Jianming Ma, Tianyue Zhan, Shengcheng Fu, Yufei Jia, Cewu Lu, Yue Gao
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.06341
Pdf link: https://arxiv.org/pdf/2602.06341
Abstract Humanoid loco-manipulation requires executing precise manipulation tasks while maintaining dynamic stability amid base motion and impacts. Existing approaches typically formulate commands in body-centric frames, fail to inherently correct cumulative world-frame drift induced by legged locomotion. We reformulate the problem as world-frame end-effector tracking and propose HiWET, a hierarchical reinforcement learning framework that decouples global reasoning from dynamic execution. The high-level policy generates subgoals that jointly optimize end-effector accuracy and base positioning in the world frame, while the low-level policy executes these commands under stability constraints. We introduce a Kinematic Manifold Prior (KMP) that embeds the manipulation manifold into the action space via residual learning, reducing exploration dimensionality and mitigating kinematically invalid behaviors. Extensive simulation and ablation studies demonstrate that HiWET achieves precise and stable end-effector tracking in long-horizon world-frame tasks. We validate zero-shot sim-to-real transfer of the low-level policy on a physical humanoid, demonstrating stable locomotion under diverse manipulation commands. These results indicate that explicit world-frame reasoning combined with hierarchical control provides an effective and scalable solution for long-horizon humanoid loco-manipulation.
中文摘要 类人机车控需要执行精确的作任务，同时在基底运动和撞击中保持动态稳定性。现有方法通常在以身体为中心的框架中制定指令，无法固有纠正由腿部运动引起的累积世界帧漂移。我们将问题重新表述为世界帧末端执行器追踪，并提出了HiWET，一种分层强化学习框架，将全局推理与动态执行解耦。高级策略生成子目标，共同优化终端执行器的精度和世界框架中的基准位置，而低级策略则在稳定性约束下执行这些命令。我们引入了运动学流形先验（KMP），通过残差学习将作流形嵌入作用空间，降低探索维度并减少运动学无效行为。广泛的仿真和消融研究表明，HiWET能够在长视距世界框架任务中实现精确且稳定的末端效应器跟踪。我们验证了对物理类人生物的零时模拟到现实的低级策略传输，展示了在多种作指令下保持稳定的运动。这些结果表明，显式世界框架推理结合层级控制，为长视野类人机车控提供了有效且可扩展的解决方案。

Training Data Selection with Gradient Orthogonality for Efficient Domain Adaptation

带梯度正交的训练数据选择以实现高效域适配

Authors: Xiyang Zhang, Yuanhe Tian, Hongzhi Wang, Yan Song
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.06359
Pdf link: https://arxiv.org/pdf/2602.06359
Abstract Fine-tuning large language models (LLMs) for specialized domains often necessitates a trade-off between acquiring domain expertise and retaining general reasoning capabilities, a phenomenon known as catastrophic forgetting. Existing remedies face a dichotomy: gradient surgery methods offer geometric safety but incur prohibitive computational costs via online projections, while efficient data selection approaches reduce overhead but remain blind to conflict-inducing gradient directions. In this paper, we propose Orthogonal Gradient Selection (OGS), a data-centric method that harmonizes domain performance, general capability retention, and training efficiency. OGS shifts the geometric insights of gradient projection from the optimizer to the data selection stage by treating data selection as a constrained decision-making process. By leveraging a lightweight Navigator model and reinforcement learning techniques, OGS dynamically identifies training samples whose gradients are orthogonal to a general-knowledge anchor. This approach ensures naturally safe updates for target models without modifying the optimizer or incurring runtime projection costs. Experiments across medical, legal, and financial domains demonstrate that OGS achieves excellent results, significantly improving domain performance and training efficiency while maintaining or even enhancing performance on general tasks such as GSM8K.
中文摘要 为专业领域微调大型语言模型（LLM）通常需要在获得领域专业知识与保留一般推理能力之间做出权衡，这种现象被称为灾难性遗忘。现有的解决方案面临一个矛盾：梯度手术方法提供了几何安全性，但通过在线投影会产生高昂的计算成本，而高效的数据选择方法减少开销，却对引发冲突的梯度方向视而不见。本文提出了正交梯度选择（OGS），这是一种以数据为中心的方法，能够协调领域性能、整体能力保留和训练效率。OGS通过将数据选择视为受限的决策过程，将梯度投影的几何洞见从优化器转移到数据选择阶段。通过利用轻量级导航器模型和强化学习技术，OGS动态识别梯度与常识锚点正交的训练样本。这种方法确保目标模型的更新自然安全，无需修改优化器或产生运行时预测成本。跨医疗、法律和金融领域的实验表明，OGS取得了卓越的成果，显著提升了领域性能和培训效率，同时在如GSM8K等通用任务上保持甚至提升了性能。

FMBench: Adaptive Large Language Model Output Formatting

FMBench：自适应大型语言模型输出格式化

Authors: Yaoting Wang, Yun Zhou, Henghui Ding
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.06384
Pdf link: https://arxiv.org/pdf/2602.06384
Abstract Producing outputs that satisfy both semantic intent and format constraints is essential for deploying large language models in user-facing and system-integrated workflows. In this work, we focus on Markdown formatting, which is ubiquitous in assistants, documentation, and tool-augmented pipelines but still prone to subtle, hard-to-detect errors (e.g., broken lists, malformed tables, inconsistent headings, and invalid code blocks) that can significantly degrade downstream usability. We present FMBench, a benchmark for adaptive Markdown output formatting that evaluates models under a wide range of instruction-following scenarios with diverse structural requirements. FMBench emphasizes real-world formatting behaviors such as multi-level organization, mixed content (natural language interleaved with lists/tables/code), and strict adherence to user-specified layout constraints. To improve Markdown compliance without relying on hard decoding constraints, we propose a lightweight alignment pipeline that combines supervised fine-tuning (SFT) with reinforcement learning fine-tuning. Starting from a base model, we first perform SFT on instruction-response pairs, and then optimize a composite objective that balances semantic fidelity with structural correctness. Experiments on two model families (OpenPangu and Qwen) show that SFT consistently improves semantic alignment, while reinforcement learning provides additional gains in robustness to challenging Markdown instructions when initialized from a strong SFT policy. Our results also reveal an inherent trade-off between semantic and structural objectives, highlighting the importance of carefully designed rewards for reliable formatted generation. Code is available at: this https URL.
中文摘要 产出既满足语义意图又符合格式约束的输出，对于在面向用户和系统集成的工作流程中部署大型语言模型至关重要。本研究重点关注Markdown格式化，这种格式在助手、文档和工具辅助管道中无处不在，但仍易出现细微且难以检测的错误（如列表损坏、表格格式错误、标题不一致和代码块无效），这些错误会显著降低后续可用性。我们介绍FMBench，这是一个自适应Markdown输出格式基准，能够在多种指令跟随场景下评估模型，并具备多样的结构需求。FMBench 强调现实世界的格式行为，如多层组织、混合内容（自然语言与列表/表格/代码交错使用）以及严格遵守用户指定的布局约束。为了在不依赖硬解码约束的情况下提升Markdown合规性，我们提出了一种轻量级对齐流水线，结合了监督微调（SFT）和强化学习微调。从基础模型出发，我们首先对指令-响应对进行SFT，然后优化一个在语义忠实度与结构正确性间平衡的复合目标。对两个模型家族（OpenPangu和Qwen）的实验显示，SFT持续改善语义对齐，而强化学习在从强SFT策略初始化时，进一步提升了挑战性Markdown指令的鲁棒性。我们的结果还揭示了语义目标与结构目标之间的内在权衡，凸显了精心设计的奖励对于可靠格式生成的重要性。代码可在以下 https URL 获取。

POINTS-GUI-G: GUI-Grounding Journey

POINTS-GUI-G：GUI-接地之旅

Authors: Zhongyin Zhao, Yuan Liu, Yikun Liu, Haicheng Wang, Le Tian, Xiao Zhou, Yangxiu You, Zilin Yu, Yang Yu, Jie Zhou
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.06391
Pdf link: https://arxiv.org/pdf/2602.06391
Abstract The rapid advancement of vision-language models has catalyzed the emergence of GUI agents, which hold immense potential for automating complex tasks, from online shopping to flight booking, thereby alleviating the burden of repetitive digital workflows. As a foundational capability, GUI grounding is typically established as a prerequisite for end-to-end task execution. It enables models to precisely locate interface elements, such as text and icons, to perform accurate operations like clicking and typing. Unlike prior works that fine-tune models already possessing strong spatial awareness (e.g., Qwen3-VL), we aim to master the full technical pipeline by starting from a base model with minimal grounding ability, such as POINTS-1.5. We introduce POINTS-GUI-G-8B, which achieves state-of-the-art performance with scores of 59.9 on ScreenSpot-Pro, 66.0 on OSWorld-G, 95.7 on ScreenSpot-v2, and 49.9 on UI-Vision. Our model's success is driven by three key factors: (1) Refined Data Engineering, involving the unification of diverse open-source datasets format alongside sophisticated strategies for augmentation, filtering, and difficulty grading; (2) Improved Training Strategies, including continuous fine-tuning of the vision encoder to enhance perceptual accuracy and maintaining resolution consistency between training and inference; and (3) Reinforcement Learning (RL) with Verifiable Rewards. While RL is traditionally used to bolster reasoning, we demonstrate that it significantly improves precision in the perception-intensive GUI grounding task. Furthermore, GUI grounding provides a natural advantage for RL, as rewards are easily verifiable and highly accurate.
中文摘要 视觉语言模型的快速发展催生了图形界面代理的出现，这些代理在自动化复杂任务方面具有巨大潜力，从网购到机票预订，从而减轻了重复数字工作流程的负担。作为基础能力，GUI基础通常被确立为端到端任务执行的前提。它使模型能够精确定位界面元素，如文本和图标，从而执行准确的点击和输入作。与以往对已具备强烈空间感知能力模型进行微调的工作（如Qwen3-VL）不同，我们旨在通过从基础模型（如POINTS-1.5）出发，掌握完整的技术流程。我们介绍POINTS-GUI-G-8B，在ScreenSpot-Pro上达到了59.9分，OSWorld-G上得分66.0，ScreenSpot-v2上95.7分，UI-Vision上49.9分，达到了最先进的性能。我们模型的成功源于三个关键因素：（1）精细数据工程，涉及多样化开源数据集格式的统一，以及复杂的增强、过滤和难度评分策略;（2）改进训练策略，包括持续微调视觉编码器以提升感知准确性，并保持训练与推断之间的分辨率一致性;以及（3）具有可验证奖励的强化学习（RL）。虽然强化学习传统上用于增强推理能力，但我们证明它显著提升了感知密集型GUI基础任务的精度。此外，图形界面接地为强化学习带来了天然优势，因为奖励易于验证且高度准确。

Unlocking Noisy Real-World Corpora for Foundation Model Pre-Training via Quality-Aware Tokenization

通过质量感知的标记化解锁噪杂的现实世界语料库，用于基础模型预训练

Authors: Arvid E. Gollwitzer, Paridhi Latawa, David de Gruijl, Deepak A. Subramanian, Adrián Noriega de la Colina
Subjects: Subjects: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Genomics (q-bio.GN); Computational Finance (q-fin.CP)
Arxiv link: https://arxiv.org/abs/2602.06394
Pdf link: https://arxiv.org/pdf/2602.06394
Abstract Current tokenization methods process sequential data without accounting for signal quality, limiting their effectiveness on noisy real-world corpora. We present QA-Token (Quality-Aware Tokenization), which incorporates data reliability directly into vocabulary construction. We make three key contributions: (i) a bilevel optimization formulation that jointly optimizes vocabulary construction and downstream performance, (ii) a reinforcement learning approach that learns merge policies through quality-aware rewards with convergence guarantees, and (iii) an adaptive parameter learning mechanism via Gumbel-Softmax relaxation for end-to-end optimization. Our experimental evaluation demonstrates consistent improvements: genomics (6.7 percentage point F1 gain in variant calling over BPE), finance (30% Sharpe ratio improvement). At foundation scale, we tokenize a pretraining corpus comprising 1.7 trillion base-pairs and achieve state-of-the-art pathogen detection (94.53 MCC) while reducing token count by 15%. We unlock noisy real-world corpora, spanning petabases of genomic sequences and terabytes of financial time series, for foundation model training with zero inference overhead.
中文摘要 当前的代币化方法处理顺序数据时不考虑信号质量，限制了其在噪声较大的现实语料库中的效果。我们介绍了QA-Token（质量感知令牌化），它将数据可靠性直接融入词汇构建中。我们做出了三项关键贡献：（i）双层优化表述，联合优化词汇构建和下游性能;（ii）通过带有收敛保证的质量感知奖励学习合并策略的强化学习方法;（iii）通过Gumbel-Softmax松弛实现端到端优化的自适应参数学习机制。我们的实验评估显示持续改善：基因组学（变异呼叫比BPE提升6.7个百分点）、财务（Sharpe比率提升30%）。在基础层面，我们对包含1.7万亿碱基对的预训练语料库进行了标记化，实现了最先进的病原体检测（94.53 MCC），同时代币数量减少了15%。我们解锁了噪声丰富的真实语料库，涵盖了数拍级基因组序列和数TB的金融时间序列，用于基础模型训练，且无推理开销。

MeDocVL: A Visual Language Model for Medical Document Understanding and Parsing

MeDocVL：用于医学文档理解与解析的视觉语言模型

Authors: Wenjie Wang, Wei Wu, Ying Liu, Yuan Zhao, Xiaole Lv, Liang Diao, Zengjian Fan, Wenfeng Xie, Ziling Lin, De Shi, Lin Huang, Kaihe Xu, Hong Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.06402
Pdf link: https://arxiv.org/pdf/2602.06402
Abstract Medical document OCR is challenging due to complex layouts, domain-specific terminology, and noisy annotations, while requiring strict field-level exact matching. Existing OCR systems and general-purpose vision-language models often fail to reliably parse such documents. We propose MeDocVL, a post-trained vision-language model for query-driven medical document parsing. Our framework combines Training-driven Label Refinement to construct high-quality supervision from noisy annotations, with a Noise-aware Hybrid Post-training strategy that integrates reinforcement learning and supervised fine-tuning to achieve robust and precise extraction. Experiments on medical invoice benchmarks show that MeDocVL consistently outperforms conventional OCR systems and strong VLM baselines, achieving state-of-the-art performance under noisy supervision.
中文摘要 由于布局复杂、领域术语和标注噪声较大，医疗文档OCR具有挑战性，同时还要求严格的现场层级精确匹配。现有的OCR系统和通用视觉语言模型常常无法可靠地解析此类文档。我们提出了MeDocVL，一种用于查询驱动医疗文档解析的后训练视觉语言模型。我们的框架结合了训练驱动的标签细化，从噪声注释构建高质量的监督，以及一种噪声感知混合后训练策略，整合强化学习和监督微调，实现稳健且精准的提取。医疗账单基准测试的实验显示，MeDocVL在噪声监控下持续优于传统OCR系统和强大的VLM基线，实现最先进的性能。

Learning Human Visual Attention on 3D Surfaces through Geometry-Queried Semantic Priors

通过几何查询语义先验学习三维表面上的人类视觉注意力

Authors: Soham Pahari, Sandeep C. Kumain
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.06419
Pdf link: https://arxiv.org/pdf/2602.06419
Abstract Human visual attention on three-dimensional objects emerges from the interplay between bottom-up geometric processing and top-down semantic recognition. Existing 3D saliency methods rely on hand-crafted geometric features or learning-based approaches that lack semantic awareness, failing to explain why humans fixate on semantically meaningful but geometrically unremarkable regions. We introduce SemGeo-AttentionNet, a dual-stream architecture that explicitly formalizes this dichotomy through asymmetric cross-modal fusion, leveraging diffusion-based semantic priors from geometry-conditioned multi-view rendering and point cloud transformers for geometric processing. Cross-attention ensures geometric features query semantic content, enabling bottom-up distinctiveness to guide top-down retrieval. We extend our framework to temporal scanpath generation through reinforcement learning, introducing the first formulation respecting 3D mesh topology with inhibition-of-return dynamics. Evaluation on SAL3D, NUS3D and 3DVA datasets demonstrates substantial improvements, validating how cognitively motivated architectures effectively model human visual attention on three-dimensional surfaces.
中文摘要 人类对三维物体的视觉关注源于自下而上的几何处理与自上而下的语义识别之间的相互作用。现有的三维显著性方法依赖手工制作的几何特征或基于学习的方法，缺乏语义意识，无法解释为何人类会专注于语义上有意义但几何上不显著的区域。我们介绍了 SemGeo-AttentionNet，这是一种双流架构，通过非对称跨模态融合，利用几何条件多视角渲染和点云变换器中的基于扩散的语义先验进行几何处理，明确形式化了这一二分法。交叉注意力确保几何特征查询语义内容，从而实现自下而上的独特性，从而引导自上而下的检索。我们将框架扩展到通过强化学习实现时序扫描路径生成，首次提出了支持带有回归抑制动力学的三维网格拓扑的表述。对SAL3D、NUS3D和3DVA数据集的评估显示出显著改进，验证了认知动机架构在三维表面上有效模拟人类视觉注意力的能力。

TrailBlazer: History-Guided Reinforcement Learning for Black-Box LLM Jailbreaking

TrailBlazer：黑盒大型语言模型越狱的历史引导强化学习

Authors: Sung-Hoon Yoon, Ruizhi Qian, Minda Zhao, Weiyue Li, Mengyu Wang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2602.06440
Pdf link: https://arxiv.org/pdf/2602.06440
Abstract Large Language Models (LLMs) have become integral to many domains, making their safety a critical priority. Prior jailbreaking research has explored diverse approaches, including prompt optimization, automated red teaming, obfuscation, and reinforcement learning (RL) based methods. However, most existing techniques fail to effectively leverage vulnerabilities revealed in earlier interaction turns, resulting in inefficient and unstable attacks. Since jailbreaking involves sequential interactions in which each response influences future actions, reinforcement learning provides a natural framework for this problem. Motivated by this, we propose a history-aware RL-based jailbreak framework that analyzes and reweights vulnerability signals from prior steps to guide future decisions. We show that incorporating historical information alone improves jailbreak success rates. Building on this insight, we introduce an attention-based reweighting mechanism that highlights critical vulnerabilities within the interaction history, enabling more efficient exploration with fewer queries. Extensive experiments on AdvBench and HarmBench demonstrate that our method achieves state-of-the-art jailbreak performance while significantly improving query efficiency. These results underscore the importance of historical vulnerability signals in reinforcement learning-driven jailbreak strategies and offer a principled pathway for advancing adversarial research on LLM safeguards.
中文摘要 大型语言模型（LLM）已成为许多领域的核心，其安全性成为至关重要的优先事项。此前越狱研究探索了多种方法，包括提示优化、自动红队、混淆和基于强化学习（RL）的方法。然而，大多数现有技术未能有效利用早期交互回合中暴露的漏洞，导致攻击效率低下且不稳定。由于越狱涉及连续交互，每个反应都会影响未来的行动，强化学习为这一问题提供了一个自然的框架。基于此，我们提出了一个基于历史感知的强化学习越狱框架，分析并重新权重先前步骤的漏洞信号，以指导未来决策。我们证明，仅包含历史信息就能提高越狱成功率。基于这一见解，我们引入了基于注意力的重权重机制，突出显示交互历史中的关键漏洞，使探索更高效且查询次数更少。在 AdvBench 和 HarmBench 上的大量实验表明，我们的方法实现了最先进的越狱性能，同时显著提升了查询效率。这些结果强调了历史脆弱性信号在强化学习驱动越狱策略中的重要性，并为推进对抗性LLM防护措施的研究提供了有原则的路径。

Evaluating an evidence-guided reinforcement learning framework in aligning light-parameter large language models with decision-making cognition in psychiatric clinical reasoning

评估基于证据的强化学习框架，使光参数大语言模型与精神病临床推理中的决策认知对齐

Authors: Xinxin Lin, Guangxin Dai, Yi Zhong, Xiang Li, Xue Xiao, Yixin Zhang, Zhengdong Wu, Yongbo Zheng, Runchuan Zhu, Ming Zhao, Huizi Yu, Shuo Wu, Jun Zhao, Lingming Hu, Yumei Wang, Ping Yin, Joey W.Y. Chan, Ngan Yin Chan, Sijing Chen, Yun Kwok Wing, Lin Lu, Xin Ma, Lizhou Fan
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.06449
Pdf link: https://arxiv.org/pdf/2602.06449
Abstract Large language models (LLMs) hold transformative potential for medical decision support yet their application in psychiatry remains constrained by hallucinations and superficial reasoning. This limitation is particularly acute in light-parameter LLMs which are essential for privacy-preserving and efficient clinical deployment. Existing training paradigms prioritize linguistic fluency over structured clinical logic and result in a fundamental misalignment with professional diagnostic cognition. Here we introduce ClinMPO, a reinforcement learning framework designed to align the internal reasoning of LLMs with professional psychiatric practice. The framework employs a specialized reward model trained independently on a dataset derived from 4,474 psychiatry journal articles and structured according to evidence-based medicine principles. We evaluated ClinMPO on a unseen subset of the benchmark designed to isolate reasoning capabilities from rote memorization. This test set comprises items where leading large-parameter LLMs consistently fail. We compared the ClinMPO-aligned light LLM performance against a cohort of 300 medical students. The ClinMPO-tuned Qwen3-8B model achieved a diagnostic accuracy of 31.4% and surpassed the human benchmark of 30.8% on these complex cases. These results demonstrate that medical evidence-guided optimization enables light-parameter LLMs to master complex reasoning tasks. Our findings suggest that explicit cognitive alignment offers a scalable pathway to reliable and safe psychiatric decision support.
中文摘要 大型语言模型（LLMs）在医疗决策支持方面具有变革潜力，但其在精神病学中的应用仍受幻觉和表面推理的限制。这一限制在光参数大型语言模型中尤为突出，而这些大型语言模型对于保护隐私和高效临床部署至关重要。现有的培训范式优先考虑语言流利度而非结构化的临床逻辑，导致与专业诊断认知的根本性不匹配。这里我们介绍ClinMPO，一种强化学习框架，旨在将LLM的内在推理与专业精神病实践相结合。该框架采用了基于4474篇精神病学期刊文章的数据集独立训练的专业奖励模型，并根据循证医学原则进行结构化。我们基于基准测试中一个未见的子集评估了ClinMPO，旨在将推理能力与死记硬背隔离开来。该测试集包含了领先大参数LLMs持续失败的项目。我们将ClinMPO对齐的轻型LLM表现与300名医学生队列进行了比较。采用ClinMPO调优的Qwen3-8B模型在这些复杂病例中实现了31.4%的诊断准确率，并超过了人类30.8%的基准。这些结果表明，医学证据引导的优化使光参数大型语言模型能够掌握复杂的推理任务。我们的发现表明，明确的认知对齐为实现可靠且安全的精神病决策支持提供了可扩展的路径。

Prism: Spectral Parameter Sharing for Multi-Agent Reinforcement Learning

棱镜：多智能体强化学习中的频谱参数共享

Authors: Kyungbeom Kim, Seungwon Oh, Kyung-Joong Kim
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.06476
Pdf link: https://arxiv.org/pdf/2602.06476
Abstract Parameter sharing is a key strategy in multi-agent reinforcement learning (MARL) for improving scalability, yet conventional fully shared architectures often collapse into homogeneous behaviors. Recent methods introduce diversity through clustering, pruning, or masking, but typically compromise resource efficiency. We propose Prism, a parameter sharing framework that induces inter-agent diversity by representing shared networks in the spectral domain via singular value decomposition (SVD). All agents share the singular vector directions while learning distinct spectral masks on singular values. This mechanism encourages inter-agent diversity and preserves scalability. Extensive experiments on both homogeneous (LBF, SMACv2) and heterogeneous (MaMuJoCo) benchmarks show that Prism achieves competitive performance with superior resource efficiency.
中文摘要 参数共享是多智能体强化学习（MARL）中提升可扩展性的关键策略，但传统的全共享架构常常崩溃为同质行为。最新方法通过聚类、修剪或掩蔽引入多样性，但通常会牺牲资源效率。我们提出了棱镜（Prism），这是一种参数共享框架，通过奇异值分解（SVD）在谱域中表示共享网络，从而诱导代理间的多样性。所有代理共享奇异向量方向，同时学习奇异值上的不同频谱掩码。这种机制促进了代理间多样性并保持了可扩展性。在均质基准（LBF、SMACv2）和异质基准测试（MaMuJoCo）上的大量实验表明，棱镜在资源效率上实现了竞争性能。

AgentCPM-Explore: Realizing Long-Horizon Deep Exploration for Edge-Scale Agents

AgentCPM-Explore：实现边缘规模代理的远景深度探索

Authors: Haotian Chen, Xin Cong, Shengda Fan, Yuyang Fu, Ziqin Gong, Yaxi Lu, Yishan Li, Boye Niu, Chengjun Pan, Zijun Song, Huadong Wang, Yesai Wu, Yueying Wu, Zihao Xie, Yukun Yan, Zhong Zhang, Yankai Lin, Zhiyuan Liu, Maosong Sun
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.06485
Pdf link: https://arxiv.org/pdf/2602.06485
Abstract While Large Language Model (LLM)-based agents have shown remarkable potential for solving complex tasks, existing systems remain heavily reliant on large-scale models, leaving the capabilities of edge-scale models largely underexplored. In this paper, we present the first systematic study on training agentic models at the 4B-parameter scale. We identify three primary bottlenecks hindering the performance of edge-scale models: catastrophic forgetting during Supervised Fine-Tuning (SFT), sensitivity to reward signal noise during Reinforcement Learning (RL), and reasoning degradation caused by redundant information in long-context scenarios. To address the issues, we propose AgentCPM-Explore, a compact 4B agent model with high knowledge density and strong exploration capability. We introduce a holistic training framework featuring parameter-space model fusion, reward signal denoising, and contextual information refinement. Through deep exploration, AgentCPM-Explore achieves state-of-the-art (SOTA) performance among 4B-class models, matches or surpasses 8B-class SOTA models on four benchmarks, and even outperforms larger-scale models such as Claude-4.5-Sonnet or DeepSeek-v3.2 in five benchmarks. Notably, AgentCPM-Explore achieves 97.09% accuracy on GAIA text-based tasks under pass@64. These results provide compelling evidence that the bottleneck for edge-scale models is not their inherent capability ceiling, but rather their inference stability. Based on our well-established training framework, AgentCPM-Explore effectively unlocks the significant, yet previously underestimated, potential of edge-scale models.
中文摘要 尽管基于大型语言模型（LLM）的智能体在解决复杂任务方面展现出显著潜力，但现有系统仍高度依赖大规模模型，边缘规模模型的能力尚未被充分开发。本文首次在4B参数尺度上系统地进行训练智能体模型的研究。我们识别出三个主要阻碍边缘尺度模型性能的瓶颈：监督微调（SFT）期间的灾难性遗忘、强化学习（RL）中对奖励信号噪声的敏感性，以及长上下文场景中冗余信息导致的推理退化。为解决这些问题，我们提出了AgentCPM-Explore，一种紧凑的4B代理模型，具有高知识密度和强大的探索能力。我们引入了一个整体训练框架，包含参数空间模型融合、奖励信号去噪和上下文信息细化。通过深度探索，AgentCPM-Explore在4B类模型中达到了最先进的（SOTA）性能，在四个基准测试中匹敌甚至超越8B类SOTA模型，甚至在五个基准测试中超过了Claude-4.5-Sonnet或DeepSeek-v3.2等更大规模模型。值得注意的是，AgentCPM-Explore在pass@64下的GAIA文本任务中准确率达到97.09%。这些结果有力地证明，边缘尺度模型的瓶颈并非其固有的能力上限，而是推理稳定性。基于我们成熟的培训框架，AgentCPM-Explore有效释放了边缘尺度模型的巨大潜力，但此前被低估。

Simulating Word Suggestion Usage in Mobile Typing to Guide Intelligent Text Entry Design

模拟移动输入中的单词建议使用以指导智能文本输入设计

Authors: Yang Li, Anna Maria Feit
Subjects: Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2602.06489
Pdf link: https://arxiv.org/pdf/2602.06489
Abstract Intelligent text entry (ITE) methods, such as word suggestions, are widely used in mobile typing, yet improving ITE systems is challenging because the cognitive mechanisms behind suggestion use remain poorly understood, and evaluating new systems often requires long-term user studies to account for behavioral adaptation. We present WSTypist, a reinforcement learning-based model that simulates how typists integrate word suggestions into typing. It builds on recent hierarchical control models of typing, but focuses on the cognitive mechanisms that underlie the high-level decision-making for effectively integrating word suggestions into manual typing: assessing efficiency gains, considering orthographic uncertainties, and including personal reliance on AI support. Our evaluations show that WSTypist simulates diverse human-like suggestion-use strategies, reproduces individual differences, and generalizes across different systems. Importantly, we demonstrate on four design cases how computational rationality models can be used to inform what-if analyses during the design process, by simulating how users might adapt to changes in the UI or in the algorithmic support, reducing the need for long-term user studies.
中文摘要 智能文本输入（ITE）方法，如单词建议，在移动输入中被广泛使用，但提升ITE系统具有挑战性，因为暗示背后的认知机制尚未被充分理解，评估新系统通常需要长期用户研究以考虑行为适应。我们介绍WSTypist，一种基于强化学习的模型，模拟打字员如何将词语建议融入打字。它建立在最新的层级控制类型模型基础上，但重点关注将词汇建议有效整合到手动输入中，从而实现高层决策的认知机制：评估效率提升、考虑正字法不确定性，以及包括个人对AI支持的依赖。我们的评估显示，WSTypist能够模拟多种类人建议-使用策略，重现个体差异，并在不同系统间进行推广。重要的是，我们在四个设计案例中展示了如何利用计算理性模型在设计过程中进行假设分析，模拟用户如何适应用户界面或算法支持的变化，从而减少长期用户研究的需求。

Adaptive Uncertainty-Aware Tree Search for Robust Reasoning

适应性不确定性感知树搜索，支持稳健推理

Authors: Zeen Song, Zihao Ma, Wenwen Qiang, Changwen Zheng, Gang Hua
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.06493
Pdf link: https://arxiv.org/pdf/2602.06493
Abstract Inference-time reasoning scaling has significantly advanced the capabilities of Large Language Models (LLMs) in complex problem-solving. A prevalent approach involves external search guided by Process Reward Models (PRMs). However, a fundamental limitation of this framework is the epistemic uncertainty of PRMs when evaluating reasoning paths that deviate from their training distribution. In this work, we conduct a systematic analysis of this challenge. We first provide empirical evidence that PRMs exhibit high uncertainty and unreliable scoring on out-of-distribution (OOD) samples. We then establish a theoretical framework proving that while standard search incurs linear regret accumulation, an uncertainty-aware strategy can achieve sublinear regret. Motivated by these findings, we propose Uncertainty-Aware Tree Search (UATS), a unified method that estimates uncertainty via Monte Carlo Dropout and dynamically allocates compute budget using a reinforcement learning-based controller. Extensive experiments demonstrate that our approach effectively mitigates the impact of OOD errors.
中文摘要 推理时间推理扩展显著提升了大型语言模型（LLMs）在复杂问题解决中的能力。一种普遍的方法是通过过程奖励模型（PRM）引导的外部搜索。然而，该框架的一个根本局限是PRM在评估偏离其训练分布的推理路径时存在的认识论不确定性。在本研究中，我们对这一挑战进行了系统分析。我们首先提供了实证证据，表明PRM在非分布（OOD）样本中表现出高不确定性和不可靠评分。随后，我们建立了理论框架，证明标准搜索会带来线性遗憾累积，而不确定性感知的策略可以实现亚线性遗憾。基于这些发现，我们提出了不确定性感知树搜索（UATS），这是一种通过蒙特卡洛退出法估计不确定性，并通过基于强化学习的控制器动态分配计算预算的统一方法。大量实验表明，我们的方法有效减轻了值班错误的影响。

DreamHome-Pano: Design-Aware and Conflict-Free Panoramic Interior Generation

DreamHome-Pano：设计意识和无冲突的全景室内生成

Authors: Lulu Chen, Yijiang Hu, Yuanqing Liu, Yulong Li, Yue Yang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.06494
Pdf link: https://arxiv.org/pdf/2602.06494
Abstract In modern interior design, the generation of personalized spaces frequently necessitates a delicate balance between rigid architectural structural constraints and specific stylistic preferences. However, existing multi-condition generative frameworks often struggle to harmonize these inputs, leading to "condition conflicts" where stylistic attributes inadvertently compromise the geometric precision of the layout. To address this challenge, we present DreamHome-Pano, a controllable panoramic generation framework designed for high-fidelity interior synthesis. Our approach introduces a Prompt-LLM that serves as a semantic bridge, effectively translating layout constraints and style references into professional descriptive prompts to achieve precise cross-modal alignment. To safeguard architectural integrity during the generative process, we develop a Conflict-Free Control architecture that incorporates structural-aware geometric priors and a multi-condition decoupling strategy, effectively suppressing stylistic interference from eroding the spatial layout. Furthermore, we establish a comprehensive panoramic interior benchmark alongside a multi-stage training pipeline, encompassing progressive Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). Experimental results demonstrate that DreamHome-Pano achieves a superior balance between aesthetic quality and structural consistency, offering a robust and professional-grade solution for panoramic interior visualization.
中文摘要 在现代室内设计中，个性化空间的生成常常需要在严格的建筑结构约束与特定风格偏好之间取得微妙的平衡。然而，现有的多条件生成框架常常难以协调这些输入，导致“条件冲突”，即风格属性无意中影响布局的几何精度。为应对这一挑战，我们推出了DreamHome-Pano，一款可控的全景生成框架，专为高保真室内综合设计。我们的方法引入了提示-大型语言模型，作为语义桥梁，有效将布局约束和样式引用转化为专业的描述性提示，实现精确的跨模态对齐。为了在生成过程中保障架构完整性，我们开发了一种无冲突控制架构，结合了结构感知的几何先验和多条件解耦策略，有效抑制了风格干扰侵蚀空间布局。此外，我们建立了全面的室内全景基准，并建立了涵盖渐进式监督微调（SFT）和强化学习（RL）的多阶段培训流程。实验结果表明，DreamHome-Pano在美学质量与结构一致性之间取得了卓越的平衡，提供了稳健且专业级别的全景室内可视化解决方案。

World-VLA-Loop: Closed-Loop Learning of Video World Model and VLA Policy

世界VLA环：视频世界模型和VLA策略的闭环学习

Authors: Xiaokang Liu, Zechen Bai, Hai Ci, Kevin Yuchen Ma, Mike Zheng Shou
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2602.06508
Pdf link: https://arxiv.org/pdf/2602.06508
Abstract Recent progress in robotic world models has leveraged video diffusion transformers to predict future observations conditioned on historical states and actions. While these models can simulate realistic visual outcomes, they often exhibit poor action-following precision, hindering their utility for downstream robotic learning. In this work, we introduce World-VLA-Loop, a closed-loop framework for the joint refinement of world models and Vision-Language-Action (VLA) policies. We propose a state-aware video world model that functions as a high-fidelity interactive simulator by jointly predicting future observations and reward signals. To enhance reliability, we introduce the SANS dataset, which incorporates near-success trajectories to improve action-outcome alignment within the world model. This framework enables a closed-loop for reinforcement learning (RL) post-training of VLA policies entirely within a virtual environment. Crucially, our approach facilitates a co-evolving cycle: failure rollouts generated by the VLA policy are iteratively fed back to refine the world model precision, which in turn enhances subsequent RL optimization. Evaluations across simulation and real-world tasks demonstrate that our framework significantly boosts VLA performance with minimal physical interaction, establishing a mutually beneficial relationship between world modeling and policy learning for general-purpose robotics. Project page: this https URL.
中文摘要 机器人世界模型的最新进展利用视频扩散变换器预测基于历史状态和行为的未来观测。虽然这些模型能够模拟逼真的视觉效果，但它们往往表现出较差的动作跟随精度，阻碍了其在后续机器人学习中的应用。在本研究中，我们介绍了World-VLA-Loop，这是一个闭环框架，用于联合完善世界模型和视觉-语言-行动（VLA）政策。我们提出了一种状态感知视频世界模型，通过联合预测未来观测和奖励信号，作为高保真交互模拟器。为提升可靠性，我们引入了SANS数据集，该数据集包含了接近成功的轨迹，以改善全球模型中的行动与结果对齐。该框架实现了闭环强化学习（RL）的虚拟环境中完全在虚拟环境中进行VLA策略的训练后。关键是，我们的方法促进了一个共演化循环：由VLA策略生成的失败滚动被迭代反馈，以优化世界模型的精度，进而增强后续的强化学习优化。跨模拟与现实任务的评估表明，我们的框架在最小的物理互动下显著提升了VLA性能，建立了世界建模与通用机器人政策学习之间的互利关系。项目页面：这个 https URL。

Progress Constraints for Reinforcement Learning in Behavior Trees

行为树中强化学习的进展约束

Authors: Finn Rietz, Mart Kartašev, Johannes A. Stork, Petter Ögren
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.06525
Pdf link: https://arxiv.org/pdf/2602.06525
Abstract Behavior Trees (BTs) provide a structured and reactive framework for decision-making, commonly used to switch between sub-controllers based on environmental conditions. Reinforcement Learning (RL), on the other hand, can learn near-optimal controllers but sometimes struggles with sparse rewards, safe exploration, and long-horizon credit assignment. Combining BTs with RL has the potential for mutual benefit: a BT design encodes structured domain knowledge that can simplify RL training, while RL enables automatic learning of the controllers within BTs. However, naive integration of BTs and RL can lead to some controllers counteracting other controllers, possibly undoing previously achieved subgoals, thereby degrading the overall performance. To address this, we propose progress constraints, a novel mechanism where feasibility estimators constrain the allowed action set based on theoretical BT convergence results. Empirical evaluations in a 2D proof-of-concept and a high-fidelity warehouse environment demonstrate improved performance, sample efficiency, and constraint satisfaction, compared to prior methods of BT-RL integration.
中文摘要 行为树（BTs）提供了一个结构化且反应式的决策框架，常用于根据环境条件在子控制器之间切换。而强化学习（RL）则能学习近优的控制点，但有时在奖励稀疏、安全探索和长期学分分配方面遇到困难。将BT与RL结合具有互利的潜力：BT设计编码结构化的领域知识，简化了RL训练，而RL则实现了BT内控制器的自动学习。然而，BT和RL的简单集成可能导致部分控制器相互抵消，可能推翻先前实现的子目标，从而降低整体性能。为此，我们提出了进展约束机制，这是一种新颖机制，可行性估计器基于理论的BT收敛结果约束允许的行动集。在二维概念验证和高保真仓库环境中的实证评估显示，与以往的BT-RL集成方法相比，性能、样本效率和约束满足度均有所提升。

Dynamics-Aligned Shared Hypernetworks for Zero-Shot Actuator Inversion

零射程执行器反转的动态对齐共享超网络

Authors: Jan Benad, Pradeep Kr. Banerjee, Frank Röder, Nihat Ay, Martin V. Butz, Manfred Eppe
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.06550
Pdf link: https://arxiv.org/pdf/2602.06550
Abstract Zero-shot generalization in contextual reinforcement learning remains a core challenge, particularly when the context is latent and must be inferred from data. A canonical failure mode is actuator inversion, where identical actions produce opposite physical effects under a latent binary context. We propose DMA-SH, a framework where a single hypernetwork, trained solely via dynamics prediction, generates a small set of adapter weights shared across the dynamics model, policy, and action-value function. This shared modulation imparts an inductive bias matched to actuator inversion, while input/output normalization and random input masking stabilize context inference, promoting directionally concentrated representations. We provide theoretical support via an expressivity separation result for hypernetwork modulation, and a variance decomposition with policy-gradient variance bounds that formalize how within-mode compression improves learning under actuator inversion. For evaluation, we introduce the Actuator Inversion Benchmark (AIB), a suite of environments designed to isolate discontinuous context-to-dynamics interactions. On AIB's held-out actuator-inversion tasks, DMA-SH achieves zero-shot generalization, outperforming domain randomization by 111.8% and surpassing a standard context-aware baseline by 16.1%.
中文摘要 情境强化学习中的零样本推广仍是核心挑战，尤其是在情境处于潜在状态且必须从数据中推断时。典型失效模式是执行器反转，即在潜在二元上下文下，相同动作产生相反的物理效应。我们提出了DMA-SH框架，其中单个超网络仅通过动态预测训练，生成一小部分适配器权重，这些权重在动力学模型、策略和动作值函数中共享。这种共享调制赋予了与执行器反演匹配的感应偏置，而输入/输出归一化和随机输入掩蔽则稳定了上下文推断，促进了方向集中的表示。我们通过超网络调制的表达率分离结果和策略梯度方差界限的方差分解，形式化了模式内压缩如何改善执行器反演下的学习，提供理论支持。在评估过程中，我们介绍了执行器反演基准测试（AIB），这是一套旨在隔离不连续上下文与动态交互的环境。在AIB的执行器反转任务中，DMA-SH实现了零样本泛化，比域随机化高出111.8%，并比标准上下文感知基线高出16.1%。

SeeUPO: Sequence-Level Agentic-RL with Convergence Guarantees

SeeUPO：带收敛保证的序列级智能强化学习

Authors: Tianyi Hu, Qingxu Fu, Yanxi Chen, Zhaoyang Liu, Bolin Ding
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.06554
Pdf link: https://arxiv.org/pdf/2602.06554
Abstract Reinforcement learning (RL) has emerged as the predominant paradigm for training large language model (LLM)-based AI agents. However, existing backbone RL algorithms lack verified convergence guarantees in agentic scenarios, especially in multi-turn settings, which can lead to training instability and failure to converge to optimal policies. In this paper, we systematically analyze how different combinations of policy update mechanisms and advantage estimation methods affect convergence properties in single/multi-turn scenarios. We find that REINFORCE with Group Relative Advantage Estimation (GRAE) can converge to the globally optimal under undiscounted conditions, but the combination of PPO & GRAE breaks PPO's original monotonic improvement property. Furthermore, we demonstrate that mainstream backbone RL algorithms cannot simultaneously achieve both critic-free and convergence guarantees in multi-turn scenarios. To address this, we propose SeeUPO (Sequence-level Sequential Update Policy Optimization), a critic-free approach with convergence guarantees for multi-turn interactions. SeeUPO models multi-turn interaction as sequentially executed multi-agent bandit problems. Through turn-by-turn sequential policy updates in reverse execution order, it ensures monotonic improvement and convergence to global optimal solution via backward induction. Experiments on AppWorld and BFCL v4 demonstrate SeeUPO's substantial improvements over existing backbone algorithms: relative gains of 43.3%-54.6% on Qwen3-14B and 24.1%-41.9% on Qwen2.5-14B (averaged across benchmarks), along with superior training stability.
中文摘要 强化学习（RL）已成为训练基于大型语言模型（LLM）的人工智能代理的主流范式。然而，现有的骨干强化学习算法在智能体场景中缺乏经过验证的收敛保证，尤其是在多回合情境中，这可能导致训练不稳定和收敛到最优策略的失败。本文系统分析了不同策略更新机制和优势估计方法组合如何影响单回合/多回合情景中的收敛属性。我们发现，REINFORCE与群相对优势估计（GRAE）在无折扣条件下可以收敛到全局最优，但PPO和GRAE的组合打破了PPO原本的单调改进性质。此外，我们证明主流骨干强化学习算法无法同时实现多回合情境下的无批判和收敛保证。为此，我们提出了SeeUPO（序列级顺序更新策略优化），这是一种无批评的方法，并保证多回合交互的收敛性。SeeUPO将多回合交互建模为顺序执行的多智能体盗垒问题。通过按反向执行顺序顺序更新策略，确保单调改进并通过逆向归纳收敛到全局最优解。AppWorld和BFCL v4的实验显示，SeeUPO相较现有骨干算法有显著提升：Qwen3-14B相较提升43.3%-54.6%，Qwen2.5-14B提升24.1%-41.9%（跨基准测试平均），同时训练稳定性更优。

Reinforcement Learning-Based Dynamic Management of Structured Parallel Farm Skeletons on Serverless Platforms

基于强化学习的结构化并行农场骨架在无服务器平台上的动态管理

Authors: Lanpei Li, Massimo Coppola, Malio Li, Valerio Besozzi, Jack Bell, Vincenzo Lomonaco
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.06555
Pdf link: https://arxiv.org/pdf/2602.06555
Abstract We present a framework for dynamic management of structured parallel processing skeletons on serverless platforms. Our goal is to bring HPC-like performance and resilience to serverless and continuum environments while preserving the programmability benefits of skeletons. As a first step, we focus on the well known Farm pattern and its implementation on the open-source OpenFaaS platform, treating autoscaling of the worker pool as a QoS-aware resource management problem. The framework couples a reusable farm template with a Gymnasium-based monitoring and control layer that exposes queue, timing, and QoS metrics to both reactive and learning-based controllers. We investigate the effectiveness of AI-driven dynamic scaling for managing the farm's degree of parallelism via the scalability of serverless functions on OpenFaaS. In particular, we discuss the autoscaling model and its training, and evaluate two reinforcement learning (RL) policies against a baseline of reactive management derived from a simple farm performance model. Our results show that AI-based management can better accommodate platform-specific limitations than purely model-based performance steering, improving QoS while maintaining efficient resource usage and stable scaling behaviour.
中文摘要 我们提出了一个在无服务器平台上动态管理结构化并行处理骨架的框架。我们的目标是在保持骨架可编程优势的同时，将高性能计算（HPC）的性能和韧性带入无服务器和连续体环境。作为第一步，我们关注著名的农场模式及其在开源OpenFaaS平台上的实现，将工人池的自动扩展视为一个QoS感知的资源管理问题。该框架将可复用的农场模板与基于Gymnasium的监控和控制层结合，向被动和基于学习的控制器开放队列、时序和服务质量指标。我们研究了AI驱动的动态扩展在通过OpenFaaS无服务器功能可扩展性来管理农场并行程度的有效性。特别地，我们讨论了自扩展模型及其训练，并结合基于简单农场绩效模型的基线进行评估两种强化学习（RL）策略。我们的结果表明，基于人工智能的管理比纯模型性能引导更能满足平台特有的限制，提升服务质量，同时保持资源高效使用和稳定的扩展行为。

SPARC: Separating Perception And Reasoning Circuits for Test-time Scaling of VLMs

SPARC：分离感知电路与推理电路以实现VLM测试时间尺度

Authors: Niccolo Avogaro, Nayanika Debnath, Li Mi, Thomas Frick, Junling Wang, Zexue He, Hang Hua, Konrad Schindler, Mattia Rigotti
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.06566
Pdf link: https://arxiv.org/pdf/2602.06566
Abstract Despite recent successes, test-time scaling - i.e., dynamically expanding the token budget during inference as needed - remains brittle for vision-language models (VLMs): unstructured chains-of-thought about images entangle perception and reasoning, leading to long, disorganized contexts where small perceptual mistakes may cascade into completely wrong answers. Moreover, expensive reinforcement learning with hand-crafted rewards is required to achieve good performance. Here, we introduce SPARC (Separating Perception And Reasoning Circuits), a modular framework that explicitly decouples visual perception from reasoning. Inspired by sequential sensory-to-cognitive processing in the brain, SPARC implements a two-stage pipeline where the model first performs explicit visual search to localize question-relevant regions, then conditions its reasoning on those regions to produce the final answer. This separation enables independent test-time scaling with asymmetric compute allocation (e.g., prioritizing perceptual processing under distribution shift), supports selective optimization (e.g., improving the perceptual stage alone when it is the bottleneck for end-to-end performance), and accommodates compressed contexts by running global search at lower image resolutions and allocating high-resolution processing only to selected regions, thereby reducing total visual tokens count and compute. Across challenging visual reasoning benchmarks, SPARC outperforms monolithic baselines and strong visual-grounding approaches. For instance, SPARC improves the accuracy of Qwen3VL-4B on the $V^*$ VQA benchmark by 6.7 percentage points, and it surpasses "thinking with images" by 4.6 points on a challenging OOD task despite requiring a 200$\times$ lower token budget.
中文摘要 尽管近期取得了一些成功，测试时间尺度——即在推理过程中根据需要动态扩展令牌预算——对于视觉语言模型（VLMs）来说仍然脆弱：图像的非结构思维链纠缠了感知与推理，导致漫长且混乱的上下文，小的感知错误可能演变成完全错误的答案。此外，需要昂贵的强化学习和手工制作的奖励，才能实现良好的表现。这里，我们介绍了SPARC（感知与推理回路分离），这是一个模块化框架，明确将视觉感知与推理分离。受大脑中顺序感官到认知处理的启发，SPARC实现了一个两阶段流程，模型首先进行显式的视觉搜索以定位与问题相关的区域，然后根据这些区域进行条件推理，最终得出答案。这种分离使得通过非对称计算分配实现独立的测试时间缩放（例如，在分布偏移下优先处理感知处理），支持选择性优化（例如，当感知阶段成为端到端性能瓶颈时，仅提升其性能），并通过在较低图像分辨率下运行全局搜索并仅分配高分辨率处理到选定区域，从而减少总视觉标记数量，从而适应压缩上下文计算。在具有挑战性的视觉推理基准中，SPARC优于单一基线和强有力的视觉基础方法。例如，SPARC在$V^*$ VQA基准测试中提高了Qwen3VL-4B的准确率6.7个百分点，并且在一项具有挑战性的户外任务中，尽管代币预算比“用图像思考”高出4.6个百分点。

Sample-Efficient Policy Space Response Oracles with Joint Experience Best Response

具有联合体验最佳响应的样本高效策略空间响应预言机

Authors: Ariyan Bighashdel, Thiago D. Simão, Frans A. Oliehoek
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.06599
Pdf link: https://arxiv.org/pdf/2602.06599
Abstract Multi-agent reinforcement learning (MARL) offers a scalable alternative to exact game-theoretic analysis but suffers from non-stationarity and the need to maintain diverse populations of strategies that capture non-transitive interactions. Policy Space Response Oracles (PSRO) address these issues by iteratively expanding a restricted game with approximate best responses (BRs), yet per-agent BR training makes it prohibitively expensive in many-agent or simulator-expensive settings. We introduce Joint Experience Best Response (JBR), a drop-in modification to PSRO that collects trajectories once under the current meta-strategy profile and reuses this joint dataset to compute BRs for all agents simultaneously. This amortizes environment interaction and improves the sample efficiency of best-response computation. Because JBR converts BR computation into an offline RL problem, we propose three remedies for distribution-shift bias: (i) Conservative JBR with safe policy improvement, (ii) Exploration-Augmented JBR that perturbs data collection and admits theoretical guarantees, and (iii) Hybrid BR that interleaves JBR with periodic independent BR updates. Across benchmark multi-agent environments, Exploration-Augmented JBR achieves the best accuracy-efficiency trade-off, while Hybrid BR attains near-PSRO performance at a fraction of the sample cost. Overall, JBR makes PSRO substantially more practical for large-scale strategic learning while preserving equilibrium robustness.
中文摘要 多智能体强化学习（MARL）为精确博弈论分析提供了可扩展的替代方案，但存在非平稳性以及需要维护捕捉非传递交互的多样策略群体的问题。策略空间响应预言机（PSRO）通过迭代扩展一个带有近似最佳响应（BR）的受限博弈来解决这些问题，但每个代理的BR训练使得在多代理或模拟器耗费的环境中成本过高。我们引入了联合体验最佳响应（JBR），这是对PSRO的一种直接修改，在当前元策略配置文件下收集一次轨迹，并重复利用该联合数据集同时计算所有代理的BR。这消除了环境交互，并提高了最佳响应计算的样本效率。由于JBR将BR计算转化为离线强化学习问题，我们提出了三种分布偏移偏倚的解决方案：（i）保守JBR并安全改进策略，（ii）探索增强JBR扰动数据收集并接受理论保证，以及（iii）混合BR，将JBR与周期性独立BR更新交错。在基准多智能体环境中，探索增强JBR实现了最佳的准确性与效率权衡，而混合BR则以极低的样本成本实现接近PSRO的性能。总体而言，JBR使PSRO在保持均衡鲁棒性的同时，在大规模战略学习中显著更实用。

The hidden risks of temporal resampling in clinical reinforcement learning

临床强化学习中时间抽样的隐性风险

Authors: Thomas Frost, Hrisheekesh Vaidya, Steve Harris
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.06603
Pdf link: https://arxiv.org/pdf/2602.06603
Abstract Offline reinforcement learning (ORL) has shown potential for improving decision-making in healthcare. However, contemporary research typically aggregates patient data into fixed time intervals, simplifying their mapping to standard ORL frameworks. The impact of these temporal manipulations on model safety and efficacy remains poorly understood. In this work, using both a gridworld navigation task and the UVA/Padova clinical diabetes simulator, we demonstrate that temporal resampling significantly degrades the performance of offline reinforcement learning algorithms during live deployment. We propose three mechanisms that drive this failure: (i) the generation of counterfactual trajectories, (ii) the distortion of temporal expectations, and (iii) the compounding of generalisation errors. Crucially, we find that standard off-policy evaluation metrics can fail to detect these drops in performance. Our findings reveal a fundamental risk in current healthcare ORL pipelines and emphasise the need for methods that explicitly handle the irregular timing of clinical decision-making.
中文摘要 离线强化学习（ORL）已被展示出改善医疗决策的潜力。然而，当代研究通常将患者数据汇总为固定时间区间，简化了与标准ORL框架的映射。这些时间作对模型安全性和有效性的影响仍不充分。本研究结合网格世界导航任务和UVA/Padova临床糖尿病模拟器，我们证明时间重采样在实时部署期间显著降低离线强化学习算法的性能。我们提出了三种导致失败的机制：（一）反事实轨迹的生成，（二）时间预期的扭曲，以及（三）泛化误差的累积。关键是，我们发现标准的非策略评估指标可能无法检测到这些性能下降。我们的研究揭示了当前医疗ORL流程中的一个根本风险，并强调了明确处理临床决策不规则时机的方法的必要性。

Humanoid Manipulation Interface: Humanoid Whole-Body Manipulation from Robot-Free Demonstrations

类人控界面：来自无机器人演示中的类人生物全身控

Authors: Ruiqian Nai, Boyuan Zheng, Junming Zhao, Haodong Zhu, Sicong Dai, Zunhao Chen, Yihang Hu, Yingdong Hu, Tong Zhang, Chuan Wen, Yang Gao
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.06643
Pdf link: https://arxiv.org/pdf/2602.06643
Abstract Current approaches for humanoid whole-body manipulation, primarily relying on teleoperation or visual sim-to-real reinforcement learning, are hindered by hardware logistics and complex reward engineering. Consequently, demonstrated autonomous skills remain limited and are typically restricted to controlled environments. In this paper, we present the Humanoid Manipulation Interface (HuMI), a portable and efficient framework for learning diverse whole-body manipulation tasks across various environments. HuMI enables robot-free data collection by capturing rich whole-body motion using portable hardware. This data drives a hierarchical learning pipeline that translates human motions into dexterous and feasible humanoid skills. Extensive experiments across five whole-body tasks--including kneeling, squatting, tossing, walking, and bimanual manipulation--demonstrate that HuMI achieves a 3x increase in data collection efficiency compared to teleoperation and attains a 70% success rate in unseen environments.
中文摘要 目前人形全身作方法，主要依赖远程作或视觉模拟到现实的强化学习，但受到硬件物流和复杂奖励工程的阻碍。因此，已证明的自主技能仍然有限，通常仅限于受控环境。本文介绍了类人生物作接口（HuMI），这是一个便携高效、用于在不同环境中学习多样化全身作任务的框架。HuMI通过便携式硬件捕捉丰富的全身运动，实现无需机器人即可的数据采集。这些数据推动了一个层级学习流程，将人类动作转化为灵巧且可行的人形技能。涵盖五项全身任务——包括跪姿、蹲下、投掷、行走和双手作——的广泛实验表明，HuMI在数据收集效率上比远程作提高了3倍，在无形环境中的成功率达到70%。

compar:IA: The French Government's LLM arena to collect French-language human prompts and preference data

compar：IA：法国政府的大型语言模型平台，用于收集法语人类提示和偏好数据

Authors: Lucie Termignon, Simonas Zilinskas, Hadrien Pélissier, Aurélien Barrot, Nicolas Chesnais, Elie Gavoty
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.06669
Pdf link: https://arxiv.org/pdf/2602.06669
Abstract Large Language Models (LLMs) often show reduced performance, cultural alignment, and safety robustness in non-English languages, partly because English dominates both pre-training data and human preference alignment datasets. Training methods like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) require human preference data, which remains scarce and largely non-public for many languages beyond English. To address this gap, we introduce compar:IA, an open-source digital public service developed inside the French government and designed to collect large-scale human preference data from a predominantly French-speaking general audience. The platform uses a blind pairwise comparison interface to capture unconstrained, real-world prompts and user judgments across a diverse set of language models, while maintaining low participation friction and privacy-preserving automated filtering. As of 2026-02-07, compar:IA has collected over 600,000 free-form prompts and 250,000 preference votes, with approximately 89% of the data in French. We release three complementary datasets -- conversations, votes, and reactions -- under open licenses, and present initial analyses, including a French-language model leaderboard and user interaction patterns. Beyond the French context, compar:IA is evolving toward an international digital public good, offering reusable infrastructure for multilingual model training, evaluation, and the study of human-AI interaction.
中文摘要 大型语言模型（LLMs）在非英语语言中通常表现较差、文化对齐度和安全性稳健性较低，部分原因是英语在预训练数据和人类偏好对齐数据集中占主导地位。像人类反馈强化学习（RLHF）和直接偏好优化（DPO）这样的训练方法需要人类偏好数据，而这类数据在许多语言中仍然稀缺且大多非公开。为弥补这一空白，我们引入了 compar：IA，这是一项由法国政府内部开发的开源数字公共服务，旨在收集以法语为主的普通受众的大规模人类偏好数据。该平台采用盲测的两对比较界面，捕捉跨多样语言模型的不受限制的真实世界提示和用户判断，同时保持低参与摩擦和保护隐私的自动过滤。截至2026年2月7日，compar：IA已收集了超过60万个自由形式提示和25万个偏好投票，其中约89%的数据为法语。我们发布了三个互补数据集——对话、投票和反应——在开放许可下，并呈现初步分析，包括法语模型排行榜和用户交互模式。超越法国语境，compar：IA正朝着国际数字公共产品发展，提供可重用的多语言模型训练、评估及人机交互研究基础设施。

Evaluating and Enhancing the Vulnerability Reasoning Capabilities of Large Language Models

评估和增强大型语言模型的脆弱性推理能力

Authors: Li Lu, Yanjie Zhao, Hongzhou Rao, Kechi Zhang, Haoyu Wang
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2602.06687
Pdf link: https://arxiv.org/pdf/2602.06687
Abstract Large Language Models (LLMs) have demonstrated remarkable proficiency in vulnerability detection. However, a critical reliability gap persists: models frequently yield correct detection verdicts based on hallucinated logic or superficial patterns that deviate from the actual root cause. This misalignment remains largely obscured because contemporary benchmarks predominantly prioritize coarse-grained classification metrics, lacking the granular ground truth required to evaluate the underlying reasoning process. To bridge this gap, we first construct a benchmark consisting of two datasets: (1) real-world vulnerabilities with expert-curated causal reasoning as ground truth, and (2) semantically equivalent code perturbations for assessing reasoning robustness. Our large-scale empirical study reveals that even state-of-the-art models struggle to maintain logical consistency during semantic code comprehension, exhibiting 12 systematic failure patterns. Addressing these limitations, we propose DAGVul, a novel framework that models vulnerability reasoning as a Directed Acyclic Graph (DAG) generation task. Unlike linear chain-of-thought (CoT), our approach explicitly maps causal dependencies to enforce structural consistency. By further introducing Reinforcement Learning with Verifiable Rewards (RLVR), we align model reasoning trace with program-intrinsic logic. Experimental results demonstrate that our framework improves the reasoning F1-score by an average of 18.9% over all the baselines. Remarkably, our 8B-parameter implementation not only outperforms existing models of comparable scale but also surpasses specialized large-scale reasoning models, including Qwen3-30B-Reasoning and GPT-OSS-20B-High. It is even competitive with state-of-the-art models like Claude-Sonnet-4.5 (75.47% vs. 76.11%), establishing new efficiency in vulnerability reasoning across model scales.
中文摘要 大型语言模型（LLMs）在漏洞检测方面表现出了卓越的能力。然而，关键的可靠性缺口依然存在：模型常常基于虚假逻辑或偏离实际根本原因的表面模式，给出正确的检测结论。这一不一致在很大程度上被掩盖，因为当代基准主要优先考虑粗粒度分类指标，缺乏评估推理过程所需的细致基础真实性。为弥合这一差距，我们首先构建了一个基准测试，包含两个数据集：（1）以专家策划的因果推理为基础的现实世界漏洞，以及（2）语义等效的代码扰动，用于评估推理的鲁棒性。我们的大规模实证研究显示，即使是最先进的模型，在语义代码理解过程中也难以保持逻辑一致性，显示出12种系统性失败模式。针对这些局限性，我们提出了DAGVul，这是一个新颖框架，将漏洞推理建模为有向无环图（DAG）生成任务。与线性思维链（CoT）不同，我们的方法明确映射因果依赖关系以强化结构一致性。通过进一步引入可验证奖励的强化学习（RLVR），我们将模型推理追踪与程序内在逻辑对齐。实验结果表明，我们的框架在所有基线上平均提升了推理F1分数18.9%。令人惊讶的是，我们的8B参数实现不仅优于现有同等规模模型，还超越了包括Qwen3-30B-Reasoning和GPT-OSS-20B-High在内的大型专门推理模型。它甚至能与Claude-Sonnet-4.5等最先进模型竞争（75.47%对76.11%），在不同模型尺度上建立了漏洞推理的新效率。

F-GRPO: Don't Let Your Policy Learn the Obvious and Forget the Rare

F-GRPO：不要让你的政策学到显而易见的事实，而忘记罕见的

Authors: Daniil Plyusov, Alexey Gorbatovski, Boris Shaposhnikov, Viacheslav Sinii, Alexey Malakhov, Daniil Gavrilov
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.06717
Pdf link: https://arxiv.org/pdf/2602.06717
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) is commonly based on group sampling to estimate advantages and stabilize policy updates. In practice, large group sizes are not feasible due to computational limits, which biases learning toward trajectories that are already likely. Smaller groups often miss rare-correct trajectories while still containing mixed rewards, concentrating probability on common solutions. We derive the probability that updates miss rare-correct modes as a function of group size, showing non-monotonic behavior, and characterize how updates redistribute mass within the correct set, revealing that unsampled-correct mass can shrink even as total correct mass grows. Motivated by this analysis, we propose a difficulty-aware advantage scaling coefficient, inspired by Focal loss, that down-weights updates on high-success prompts. The lightweight modification can be directly integrated into any group-relative RLVR algorithm such as GRPO, DAPO, and CISPO. On Qwen2.5-7B across in-domain and out-of-domain benchmarks, our method improves pass@256 from 64.1 $\rightarrow$ 70.3 (GRPO), 69.3 $\rightarrow$ 72.5 (DAPO), and 73.2 $\rightarrow$ 76.8 (CISPO), while preserving or improving pass@1, without increasing group size or computational cost.
中文摘要 带可验证奖励的强化学习（RLVR）通常基于群体抽样，以估计优势并稳定策略更新。实际上，由于计算限制，大组规模不可行，这会使学习偏向于已经可能的轨迹。较小的组常常错过罕见的正确轨迹，但奖励仍混合，将概率集中在常见解上。我们推导出更新错过稀有正确模式的概率，显示出非单调行为，并描述更新如何在正确集合内重新分配质量，揭示未采样正确质量即使总正确质量增加也会缩小。基于此分析，我们提出了一种受焦点损失启发的难度感知优势缩放系数，该系数对高成功提示的更新权重降低。轻量级修改可以直接集成到任何群相对RLVR算法中，如GRPO、DAPO和CISPO。在域内外基准测试的Qwen2.5-7B上，我们的方法将pass@256从64.1 $\rightarrow$ 70.3（GRPO）、69.3 $\rightarrow$ 72.5（DAPO）和73.2 $\rightarrow$ 76.8（CISPO）提升，同时保持或改进pass@1，且不增加组规模或计算成本。

Semantically Labelled Automata for Multi-Task Reinforcement Learning with LTL Instructions

多任务强化学习的语义标签自动机，采用LTL指令

Authors: Alessandro Abate, Giuseppe De Giacomo, Mathias Jackermeier, Jan Kretínský, Maximilian Prokop, Christoph Weinhuber
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.06746
Pdf link: https://arxiv.org/pdf/2602.06746
Abstract We study multi-task reinforcement learning (RL), a setting in which an agent learns a single, universal policy capable of generalising to arbitrary, possibly unseen tasks. We consider tasks specified as linear temporal logic (LTL) formulae, which are commonly used in formal methods to specify properties of systems, and have recently been successfully adopted in RL. In this setting, we present a novel task embedding technique leveraging a new generation of semantic LTL-to-automata translations, originally developed for temporal synthesis. The resulting semantically labelled automata contain rich, structured information in each state that allow us to (i) compute the automaton efficiently on-the-fly, (ii) extract expressive task embeddings used to condition the policy, and (iii) naturally support full LTL. Experimental results in a variety of domains demonstrate that our approach achieves state-of-the-art performance and is able to scale to complex specifications where existing methods fail.
中文摘要 我们研究多任务强化学习（RL），这是一种智能体学习单一、通用策略，能够推广到任意且可能看不见的任务的环境。我们将任务指定为线性时序逻辑（LTL）公式，这些公式常用于形式化方法中用于指定系统属性，并且最近已被成功应用于强化学习。在此背景下，我们提出了一种新型任务嵌入技术，利用了新一代语义LTL到自动机翻译，最初为时间综合开发。最终的语义标签自动机在每个状态中包含丰富且结构化的信息，使我们能够（i）高效地实时计算自动机，（ii）提取用于条件策略的表达性任务嵌入，以及（iii）自然支持完整的LTL。多个领域的实验结果表明，我们的方法实现了最先进的性能，并能够在现有方法无法实现的复杂规格下扩展。

R-Align: Enhancing Generative Reward Models through Rationale-Centric Meta-Judging

R-Align：通过以理性为中心的元评判提升生成奖励模型

Authors: Yanlin Lai, Mitt Huang, Hangyu Guo, Xiangfeng Wang, Haodong Li, Shaoxiong Zhan, Liang Zhao, Chengyuan Yao, Yinmin Zhang, Qi Han, Chun Yuan, Zheng Ge, Xiangyu Zhang, Daxin Jiang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.06763
Pdf link: https://arxiv.org/pdf/2602.06763
Abstract Reinforcement Learning from Human Feedback (RLHF) remains indispensable for aligning large language models (LLMs) in subjective domains. To enhance robustness, recent work shifts toward Generative Reward Models (GenRMs) that generate rationales before predicting preferences. Yet in GenRM training and evaluation, practice remains outcome-label-only, leaving reasoning quality unchecked. We show that reasoning fidelity-the consistency between a GenRM's preference decision and reference decision rationales-is highly predictive of downstream RLHF outcomes, beyond standard label accuracy. Specifically, we repurpose existing reward-model benchmarks to compute Spurious Correctness (S-Corr)-the fraction of label-correct decisions with rationales misaligned with golden judgments. Our empirical evaluation reveals substantial S-Corr even for competitive GenRMs, and higher S-Corr is associated with policy degeneration under optimization. To improve fidelity, we propose Rationale-Centric Alignment, R-Align, which augments training with gold judgments and explicitly supervises rationale alignment. R-Align reduces S-Corr on RM benchmarks and yields consistent gains in actor performance across STEM, coding, instruction following, and general tasks.
中文摘要 来自人类反馈的强化学习（RLHF）对于在主观领域中对齐大型语言模型（LLM）依然不可或缺。为了增强鲁棒性，近期研究转向生成奖励模型（GenRMs），即在预测偏好前先产生理由。然而，在GenRM的培训和评估中，实践仍仅依赖结果标签，导致推理质量未被检查。我们证明，推理忠实度——即GenRM偏好决策与参考决策理由之间的一致性——对下游RLHF结果具有高度预测性，超出标准标签准确性。具体来说，我们重新利用现有的奖励模型基准来计算虚假正确性（S-Corr）——即在与黄金判断不符的标签正确决策中占比。我们的实证评估显示，即使在竞争激烈的GenRM中，S-Corr也显著，且较高的S-Corr与优化下的政策退化相关。为了提高忠实度，我们提出了理据中心对齐（R-Align），它通过金评判来辅助训练，并明确监督理性对齐。R-Align 降低了 RM 基准测试上的 S-Corr，并在 STEM、编码、指令跟踪和通用任务中持续提升演员性能。

Soft Forward-Backward Representations for Zero-shot Reinforcement Learning with General Utilities

零样品强化学习的软前后表示，采用通用效用

Authors: Marco Bagatella, Thomas Rupf, Georg Martius, Andreas Krause
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.06769
Pdf link: https://arxiv.org/pdf/2602.06769
Abstract Recent advancements in zero-shot reinforcement learning (RL) have facilitated the extraction of diverse behaviors from unlabeled, offline data sources. In particular, forward-backward algorithms (FB) can retrieve a family of policies that can approximately solve any standard RL problem (with additive rewards, linear in the occupancy measure), given sufficient capacity. While retaining zero-shot properties, we tackle the greater problem class of RL with general utilities, in which the objective is an arbitrary differentiable function of the occupancy measure. This setting is strictly more expressive, capturing tasks such as distribution matching or pure exploration, which may not be reduced to additive rewards. We show that this additional complexity can be captured by a novel, maximum entropy (soft) variant of the forward-backward algorithm, which recovers a family of stochastic policies from offline data. When coupled with zero-order search over compact policy embeddings, this algorithm can sidestep iterative optimization schemes, and optimizes general utilities directly at test-time. Across both didactic and high-dimensional experiments, we demonstrate that our method retains favorable properties of FB algorithms, while also extending their range to more general RL problems.
中文摘要 零样本强化学习（RL）的最新进展促进了从未标记的离线数据源中提取多样化行为。特别是，前向-后退算法（FB）能够检索一组策略，这些策略在容量足够的情况下，能够近似解决任何标准强化学习问题（具有加法奖励，占用度量线性）。在保留零时值特性的同时，我们用通用效用处理更广泛的问题类别强化学习，其中目标是占用度量的任意可微函数。这种设置更具表现力，捕捉分布匹配或纯探索等任务，这些任务不能被简化为加法奖励。我们证明，这种额外的复杂性可以通过一种新颖的最大熵（软）前向-后退算法变体捕捉，该变体从离线数据中恢复一系列随机策略。当结合紧致策略嵌入上的零阶搜索时，该算法可以绕过迭代优化方案，并在测试时直接优化通用效用。在教学和高维实验中，我们证明了我们的方法保留了FB算法的有利特性，同时将其范围扩展到更一般的强化学习问题。

UnifSrv: AP Selection for Achieving Uniformly Good Performance of CF-MIMO in Realistic Urban Networks

UnifSrv：实现真实城市网络中CF-MIMO均优性能的AP评选

Authors: Yunlu Xiao, Marina Petrova, Ljiljana Simić
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2602.06780
Pdf link: https://arxiv.org/pdf/2602.06780
Abstract Under the ideal assumption of uniform propagation, cell-free massive MIMO (CF-mMIMO) provides uniformly high throughput over the network by effectively surrounding each user with its serving access point (AP) set. However, in realistic non-uniform urban propagation environments, it is difficult to consistently select good limited serving AP sets, resulting in significantly degraded throughput, reintroducing "edge-effect" for the worst-served users. To restore the uniformly good performance of scalable CF-mMIMO in realistic urban networks, we formulate a novel multi-objective optimization problem to jointly achieve high throughput by maximizing the sum data rate, uniform throughput by maximizing Jain's fairness index of the throughput per user, and scalability by minimizing the serving AP set size. We then propose the UnifSrv AP selection algorithms to solve this optimization problem, consisting of a deep reinforcement learning (DRL)-based algorithm UnifSrv-DRL and a heuristic algorithm UnifSrv-heu. We conduct a comprehensive performance evaluation of scalable CF-mMIMO under realistic urban network distributions, propagation, and mobility patterns, showing that the prior benchmark AP selection schemes fail to provide uniformly high throughput in practice. By contrast, UnifSrv at least doubles the throughput compared to prior benchmarks, or achieves comparable throughput but with half of the serving AP set size. Importantly, our heuristic algorithm achieves equivalent throughput to our DRL one, but with orders of magnitude lower complexity. We thus for the first time propose an AP selection algorithm that achieves uniformly good CF-mMIMO performance in realistic urban networks with low complexity.
中文摘要 在均匀传播的理想假设下，无单元群大规模MIMO（CF-mMIMO）通过有效地将每个用户包围在其服务接入点（AP）集中，实现网络上的均匀高吞吐量。然而，在现实中非均匀的城市传播环境中，很难持续选择优质的有限服务AP组，导致吞吐量显著下降，重新引入“边缘效应”，服务最差的用户。为了在现实城市网络中恢复可扩展CF-mMIMO的均匀良好性能，我们提出了一个新的多目标优化问题，旨在通过最大化数据总和率实现高吞吐量，通过最大化Jain对每用户吞吐量的公平指数实现均匀吞吐量，并通过最小化服务AP集大小实现可扩展性。随后，我们提出了UnifSrv的AP选择算法来解决该优化问题，包括基于深度强化学习（DRL）的UnifSrv-DRL算法和启发式算法UnifSrv-heu。我们对可扩展的CF-mMIMO在现实的城市网络分布、传播和移动模式下进行了全面性能评估，表明以往基准AP选择方案在实际中未能提供统一的高吞吐量。相比之下，UnifSrv 至少将吞吐量翻倍，或实现相当吞吐量，但使用AP集容量减半。重要的是，我们的启发式算法实现了与DRL算法相当的吞吐量，但复杂度低了几个数量级。因此，我们首次提出了一种AP选择算法，能够在低复杂度的真实城市网络中实现均匀良好的CF-mMIMO性能。

Generating Data-Driven Reasoning Rubrics for Domain-Adaptive Reward Modeling

生成基于数据的数据的推理评分标准用于领域自适应奖励建模

Authors: Kate Sanders, Nathaniel Weir, Sapana Chaudhary, Kaj Bostrom, Huzefa Rangwala
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.06795
Pdf link: https://arxiv.org/pdf/2602.06795
Abstract An impediment to using Large Language Models (LLMs) for reasoning output verification is that LLMs struggle to reliably identify errors in thinking traces, particularly in long outputs, domains requiring expert knowledge, and problems without verifiable rewards. We propose a data-driven approach to automatically construct highly granular reasoning error taxonomies to enhance LLM-driven error detection on unseen reasoning traces. Our findings indicate that classification approaches that leverage these error taxonomies, or "rubrics", demonstrate strong error identification compared to baseline methods in technical domains like coding, math, and chemical engineering. These rubrics can be used to build stronger LLM-as-judge reward functions for reasoning model training via reinforcement learning. Experimental results show that these rewards have the potential to improve models' task accuracy on difficult domains over models trained by general LLMs-as-judges by +45%, and approach performance of models trained by verifiable rewards while using as little as 20% as many gold labels. Through our approach, we extend the usage of reward rubrics from assessing qualitative model behavior to assessing quantitative model correctness on tasks typically learned via RLVR rewards. This extension opens the door for teaching models to solve complex technical problems without a full dataset of gold labels, which are often highly costly to procure.
中文摘要 使用大型语言模型（LLMs）进行推理输出验证的一个障碍是，LLMs在识别思维痕迹中的错误时难以可靠地识别，尤其是在长输出、需要专业知识的领域以及无可验证奖励的问题中。我们提出一种数据驱动的方法，自动构建高度细粒度的推理错误分类法，以增强对未见推理痕迹的LLM驱动错误检测。我们的发现表明，利用这些错误分类法或“评分标准”的分类方法，在编码、数学和化学工程等技术领域相比，表现出较强的错误识别能力。这些评分标准可用于构建更强的LLM作为评判的奖励函数，用于通过强化学习进行推理模型训练。实验结果显示，这些奖励有潜力将模型在困难领域的任务准确率提升+45%，相比由通用LLM作为评判训练的模型，并且在使用少于20%金标签的情况下，接近可验证奖励训练模型的性能。通过我们的方法，我们将奖励评分标准的使用从评估定性模型行为扩展到评估通常通过RLVR奖励学习任务的定量模型正确性。这一扩展为教学模型在没有完整金标签数据集的情况下解决复杂技术问题打开了大门，而金标签往往采购成本高昂。

AEGPO: Adaptive Entropy-Guided Policy Optimization for Diffusion Models

AEGPO：扩散模型的自适应熵引导策略优化

Authors: Yuming Li, Qingyu Li, Chengyu Bai, Xiangyang Luo, Zeyue Xue, Wenyu Qin, Meng Wang, Yikai Wang, Shanghang Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.06825
Pdf link: https://arxiv.org/pdf/2602.06825
Abstract Reinforcement learning from human feedback (RLHF) shows promise for aligning diffusion and flow models, yet policy optimization methods such as GRPO suffer from inefficient and static sampling strategies. These methods treat all prompts and denoising steps uniformly, ignoring substantial variations in sample learning value as well as the dynamic nature of critical exploration moments. To address this issue, we conduct a detailed analysis of the internal attention dynamics during GRPO training and uncover a key insight: attention entropy can serve as a powerful dual-signal proxy. First, across different samples, the relative change in attention entropy ({\Delta}Entropy), which reflects the divergence between the current policy and the base policy, acts as a robust indicator of sample learning value. Second, during the denoising process, the peaks of absolute attention entropy (Entropy(t)), which quantify attention dispersion, effectively identify critical timesteps where high-value exploration occurs. Building on this observation, we propose Adaptive Entropy-Guided Policy Optimization (AEGPO), a novel dual-signal, dual-level adaptive optimization strategy. At the global level, AEGPO uses {\Delta}Entropy to dynamically allocate rollout budgets, prioritizing prompts with higher learning value. At the local level, it exploits the peaks of Entropy(t) to guide exploration selectively at critical high-dispersion timesteps rather than uniformly across all denoising steps. By focusing computation on the most informative samples and the most critical moments, AEGPO enables more efficient and effective policy optimization. Experiments on text-to-image generation tasks demonstrate that AEGPO significantly accelerates convergence and achieves superior alignment performance compared to standard GRPO variants.
中文摘要 人工反馈强化学习（RLHF）在对齐扩散和流动模型方面展现出潜力，但像GRPO这样的策略优化方法存在低效且静态采样策略的问题。这些方法统一处理所有提示和去噪步骤，忽略了样本学习值的显著差异以及关键探索时刻的动态特性。为解决这一问题，我们对GRPO训练期间的内部注意力动态进行了详细分析，并揭示了一个关键见解：注意力熵可以作为强有力的双重信号代理。首先，在不同样本中，注意力熵的相对变化（{\Delta}Entropy），反映了当前策略与基础策略之间的差异，作为样本学习价值的稳健指标。其次，在去噪过程中，绝对注意力熵的峰值（Entropy（t））能够有效识别高价值探索发生的关键时间步。基于这一观察，我们提出了自适应熵引导策略优化（AEGPO），这是一种新型的双信号、双级自适应优化策略。在全球层面，AEGPO使用{\Delta}Entropy动态分配推广预算，优先考虑具有更高学习价值的提示。在局部层面，它利用熵（t）峰值，在临界高色散时间步选择性地引导探索，而非均匀地覆盖所有去噪步骤。通过将计算重点放在最具信息量的样本和最关键的时刻，AEGPO实现了更高效、更有效的政策优化。文本生成任务的实验表明，AEGPO显著加速收敛，并比标准GRPO变体实现更优越的比对性能。

SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks

SEMA：多回合越狱攻击的简单而有效的学习方法

Authors: Mingqian Feng, Xiaodong Liu, Weiwei Yang, Jialin Song, Xuekai Zhu, Chenliang Xu, Jianfeng Gao
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2602.06854
Pdf link: https://arxiv.org/pdf/2602.06854
Abstract Multi-turn jailbreaks capture the real threat model for safety-aligned chatbots, where single-turn attacks are merely a special case. Yet existing approaches break under exploration complexity and intent drift. We propose SEMA, a simple yet effective framework that trains a multi-turn attacker without relying on any existing strategies or external data. SEMA comprises two stages. Prefilling self-tuning enables usable rollouts by fine-tuning on non-refusal, well-structured, multi-turn adversarial prompts that are self-generated with a minimal prefix, thereby stabilizing subsequent learning. Reinforcement learning with intent-drift-aware reward trains the attacker to elicit valid multi-turn adversarial prompts while maintaining the same harmful objective. We anchor harmful intent in multi-turn jailbreaks via an intent-drift-aware reward that combines intent alignment, compliance risk, and level of detail. Our open-loop attack regime avoids dependence on victim feedback, unifies single- and multi-turn settings, and reduces exploration complexity. Across multiple datasets, victim models, and jailbreak judges, our method achieves state-of-the-art (SOTA) attack success rates (ASR), outperforming all single-turn baselines, manually scripted and template-driven multi-turn baselines, as well as our SFT (Supervised Fine-Tuning) and DPO (Direct Preference Optimization) variants. For instance, SEMA performs an average $80.1\%$ ASR@1 across three closed-source and open-source victim models on AdvBench, 33.9% over SOTA. The approach is compact, reproducible, and transfers across targets, providing a stronger and more realistic stress test for large language model (LLM) safety and enabling automatic redteaming to expose and localize failure modes. Our code is available at: this https URL.
中文摘要 多回合越狱捕捉了安全相关聊天机器人的真实威胁模型，而单回合攻击只是个特殊案例。然而，现有方法在探索的复杂性和意图漂移下会被打破。我们提出了SEMA，一个简单但有效的框架，可以在不依赖任何现有策略或外部数据的情况下训练多回合攻击者。SEMA包含两个阶段。预填充自调通过对非拒绝、结构良好、多回合对抗提示进行微调，这些提示词自生成且前缀最小，从而稳定后续学习，从而实现可用的推广。带有意图漂移感知奖励的强化学习训练攻击者在保持相同有害目标的同时，诱导有效的多回合对抗提示。我们通过意图漂移感知奖励，结合意图对齐、合规风险和细节层级，锚定多回合越狱中的有害意图。我们的开环攻击模式避免依赖受害者反馈，统一单回合和多回合设置，降低探索复杂度。在多个数据集、受害者模型和越狱评审中，我们的方法实现了最先进的（SOTA）攻击成功率（ASR），优于所有单回合基线、手动脚本和模板驱动的多回合基线，以及我们的SFT（监督微调）和DPO（直接偏好优化）变体。例如，SEMA在AdvBench上，三种闭源和开源受害者模型的平均ASR@1为80.1%美元，高于SOTA为33.9%。该方法紧凑、可重复且可跨目标转移，提供了更强且更真实的大型语言模型（LLM）安全性压力测试，并实现自动红组以暴露和定位失效模式。我们的代码可在以下 https URL 获取。

A first realization of reinforcement learning-based closed-loop EEG-TMS

基于强化学习的闭环脑电图-TMS的首次实现

Authors: Dania Humaidan, Jiahua Xu, Jing Chen, Christoph Zrenner, David Emanuel Vetter, Laura Marzetti, Paolo Belardinelli, Timo Roine, Risto J. Ilmoniemi, Gian Luca Romani, Ulf Zieman
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.06907
Pdf link: https://arxiv.org/pdf/2602.06907
Abstract Background: Transcranial magnetic stimulation (TMS) is a powerful tool to investigate neurophysiology of the human brain and treat brain disorders. Traditionally, therapeutic TMS has been applied in a one-size-fits-all approach, disregarding inter- and intra-individual differences. Brain state-dependent EEG-TMS, such as coupling TMS with a pre-specified phase of the sensorimotor mu-rhythm, enables the induction of differential neuroplastic effects depending on the targeted phase. But this approach is still user-dependent as it requires defining an a-priori target phase. Objectives: To present a first realization of a machine-learning-based, closed-loop real-time EEG-TMS setup to identify user-independently the individual mu-rhythm phase associated with high- vs. low-corticospinal excitability states. Methods: We applied EEG-TMS to 25 participants targeting the supplementary motor area-primary motor cortex network and used a reinforcement learning algorithm to identify the mu-rhythm phase associated with high- vs. low corticospinal excitability. We employed linear mixed effects models and Bayesian analysis to determine effects of reinforced learning on corticospinal excitability indexed by motor evoked potential amplitude, and functional connectivity indexed by the imaginary part of resting-state EEG coherence. Results: Reinforcement learning effectively identified the mu-rhythm phase associated with high- vs. low-excitability states, and their repetitive stimulation resulted in long-term increases vs. decreases in functional connectivity in the stimulated sensorimotor network. Conclusions: We demonstrated for the first time the feasibility of closed-loop EEG-TMS in humans, a critical step towards individualized treatment of brain disorders.
中文摘要 背景：经颅磁刺激（TMS）是研究人脑神经生理学和治疗脑部疾病的强大工具。传统上，治疗性TMS采用一刀切的方法，忽视个体间和个体内部差异。脑状态依赖的脑电图TMS，如将TMS与预定的感觉运动MU节律相位结合，能够根据目标阶段诱导不同的神经可塑效应。但这种方法仍然依赖用户，因为它需要先验地定义目标阶段。目标：首次实现基于机器学习的闭环实时脑电图-TMS装置，以独立用户识别与高皮质脊髓兴奋性状态与低皮质脊髓兴奋性状态相关的单个μ节律相位。方法：我们将EEG-TMS应用于25名针对辅助运动区-原级运动皮层网络的参与者，并使用强化学习算法识别与高皮质脊髓兴奋性与低皮质脊髓兴奋性的μ-节律相位。我们采用线性混合效应模型和贝叶斯分析，确定强化学习对以运动诱发电位幅度为指标的皮质脊髓兴奋性，以及以静息状态脑电图相干虚部为指标的功能连接性的影响。结果：强化学习有效识别了与高兴奋性与低兴奋性状态相关的μ节律相位，重复刺激导致受刺激感觉运动网络的功能连接性长期增加而非减少。结论：我们首次证明了闭环脑电图-TMS在人类中的可行性，这是实现脑部疾病个体化治疗的关键一步。

Continuous-time reinforcement learning: ellipticity enables model-free value function approximation

连续时间强化学习：椭圆性使得无模型的价值函数近似成为可能

Authors: Wenlong Mou
Subjects: Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2602.06930
Pdf link: https://arxiv.org/pdf/2602.06930
Abstract We study off-policy reinforcement learning for controlling continuous-time Markov diffusion processes with discrete-time observations and actions. We consider model-free algorithms with function approximation that learn value and advantage functions directly from data, without unrealistic structural assumptions on the dynamics. Leveraging the ellipticity of the diffusions, we establish a new class of Hilbert-space positive definiteness and boundedness properties for the Bellman operators. Based on these properties, we propose the Sobolev-prox fitted $q$-learning algorithm, which learns value and advantage functions by iteratively solving least-squares regression problems. We derive oracle inequalities for the estimation error, governed by (i) the best approximation error of the function classes, (ii) their localized complexity, (iii) exponentially decaying optimization error, and (iv) numerical discretization error. These results identify ellipticity as a key structural property that renders reinforcement learning with function approximation for Markov diffusions no harder than supervised learning.
中文摘要 我们研究利用离散时间观察和动作控制连续时间马尔可夫扩散过程的非策略强化学习。我们考虑了无模型的函数近似算法，能够直接从数据中学习价值和优势函数，而无需对动力学有不切实际的结构假设。利用扩散的椭圆性，我们建立了一类新的希尔伯特空间正确定性和有界性质，适用于贝尔曼算符。基于这些性质，我们提出了Sobolev-prox拟合$q$-学习算法，通过迭代求解最小二乘回归问题来学习价值函数和优势函数。我们推导估计误差的预言机不等式，受（i）函数类的最佳近似误差，（ii）它们的局部复杂性，（iii）指数衰减的优化误差，以及（iv）数值离散化误差。这些结果表明椭圆性是一个关键结构性质，使得对马尔可夫扩散进行函数近似的强化学习并不比监督学习更难。

Cochain Perspectives on Temporal-Difference Signals for Learning Beyond Markov Dynamics

关于超越马尔可夫动力学学习的时间差分信号的余链视角

Authors: Zuyuan Zhang, Sizhe Tang, Tian Lan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.06939
Pdf link: https://arxiv.org/pdf/2602.06939
Abstract Non-Markovian dynamics are commonly found in real-world environments due to long-range dependencies, partial observability, and memory effects. The Bellman equation that is the central pillar of Reinforcement learning (RL) becomes only approximately valid under Non-Markovian. Existing work often focus on practical algorithm designs and offer limited theoretical treatment to address key questions, such as what dynamics are indeed capturable by the Bellman framework and how to inspire new algorithm classes with optimal approximations. In this paper, we present a novel topological viewpoint on temporal-difference (TD) based RL. We show that TD errors can be viewed as 1-cochain in the topological space of state transitions, while Markov dynamics are then interpreted as topological integrability. This novel view enables us to obtain a Hodge-type decomposition of TD errors into an integrable component and a topological residual, through a Bellman-de Rham projection. We further propose HodgeFlow Policy Search (HFPS) by fitting a potential network to minimize the non-integrable projection residual in RL, achieving stability/sensitivity guarantees. In numerical evaluations, HFPS is shown to significantly improve RL performance under non-Markovian.
中文摘要 由于长距离依赖关系、部分可观测性和记忆效应，非马尔可夫动力学在现实环境中常见。作为强化学习（RL）核心支柱的贝尔曼方程，在非马尔可夫下仅近似有效。现有工作通常侧重于实际算法设计，并提供有限的理论处理来解决关键问题，例如贝尔曼框架确实可以捕捉哪些动态，以及如何以最优近似激发新的算法类。本文提出了基于时间差（TD）强化学习的新拓扑视角。我们证明TD误差可以看作状态转移拓扑空间中的1-余链，而马尔可夫动力学则被解释为拓扑可积性。这一新颖的观点使我们能够通过Bellman-de Rham投影获得TD误差的Hodge型分解为可积分量和拓扑残差。我们进一步提出霍奇流策略搜索（HFPS）方法，拟合潜在网络以最小化强化学习中不可积分的投影残差，实现稳定性/敏感度保证。在数值评估中，HFPS被证明显著提升了非马尔可夫下强化学习的表现。

Optimal Derivative Feedback Control for an Active Magnetic Levitation System: An Experimental Study on Data-Driven Approaches

主动磁悬浮系统的最优导数反馈控制：数据驱动方法的实验研究

Authors: Saber Omidi, Rene Akupan Ebunle, Se Young Yoon
Subjects: Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2602.06944
Pdf link: https://arxiv.org/pdf/2602.06944
Abstract This paper presents the design and implementation of data-driven optimal derivative feedback controllers for an active magnetic levitation system. A direct, model-free control design method based on the reinforcement learning framework is compared with an indirect optimal control design derived from a numerically identified mathematical model of the system. For the direct model-free approach, a policy iteration procedure is proposed, which adds an iteration layer called the epoch loop to gather multiple sets of process data, providing a more diverse dataset and helping reduce learning biases. This direct control design method is evaluated against a comparable optimal control solution designed from a plant model obtained through the combined Dynamic Mode Decomposition with Control (DMDc) and Prediction Error Minimization (PEM) system identification. Results show that while both controllers can stabilize and improve the performance of the magnetic levitation system when compared to controllers designed from a nominal model, the direct model-free approach consistently outperforms the indirect solution when multiple epochs are allowed. The iterative refinement of the optimal control law over the epoch loop provides the direct approach a clear advantage over the indirect method, which relies on a single set of system data to determine the identified model and control.
中文摘要 本文介绍了基于数据驱动的最优导数反馈控制器的设计和实现，应用于主动磁悬浮系统。基于强化学习框架的直接无模型控制设计方法与基于系统数值识别数学模型的间接最优控制设计进行比较。对于直接无模型的方法，提出了一种策略迭代程序，该过程增加了一个称为纪元循环的迭代层，用于收集多组流程数据，提供更多样化的数据集，帮助减少学习偏差。该直接控制设计方法与通过动态模式分解控制（DMDc）和预测误差最小化（PEM）系统识别获得的工厂模型设计的可比最优控制方案进行评估。结果表明，虽然两种控制器都能稳定并提升磁悬浮系统的性能，但当允许多个纪元时，直接无模型方法始终优于间接解。对纪元环最优控制律的迭代细化，使直接方法相较依赖单一系统数据确定模型和控制的间接方法具有明显优势。

InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

InftyThink+：通过强化学习实现高效且高效的无限视野推理

Authors: Yuchen Yan, Liang Jiang, Jin Jiang, Shuaicheng Li, Zujie Wen, Zhiqiang Zhang, Jun Zhou, Jian Shao, Yueting Zhuang, Yongliang Shen
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2602.06960
Pdf link: https://arxiv.org/pdf/2602.06960
Abstract Large reasoning models achieve strong performance by scaling inference-time chain-of-thought, but this paradigm suffers from quadratic cost, context length limits, and degraded reasoning due to lost-in-the-middle effects. Iterative reasoning mitigates these issues by periodically summarizing intermediate thoughts, yet existing methods rely on supervised learning or fixed heuristics and fail to optimize when to summarize, what to preserve, and how to resume reasoning. We propose InftyThink+, an end-to-end reinforcement learning framework that optimizes the entire iterative reasoning trajectory, building on model-controlled iteration boundaries and explicit summarization. InftyThink+ adopts a two-stage training scheme with supervised cold-start followed by trajectory-level reinforcement learning, enabling the model to learn strategic summarization and continuation decisions. Experiments on DeepSeek-R1-Distill-Qwen-1.5B show that InftyThink+ improves accuracy by 21% on AIME24 and outperforms conventional long chain-of-thought reinforcement learning by a clear margin, while also generalizing better to out-of-distribution benchmarks. Moreover, InftyThink+ significantly reduces inference latency and accelerates reinforcement learning training, demonstrating improved reasoning efficiency alongside stronger performance.
中文摘要 大型推理模型通过扩展推理时间链实现了强劲的性能，但该范式存在二次成本、上下文长度限制以及因中间迷失效应导致推理能力下降的问题。迭代推理通过定期总结中间想法来缓解这些问题，但现有方法依赖监督学习或固定启发式，未能优化何时总结、保留哪些内容以及如何恢复推理。我们提出了InftyThink+，一个端到端强化学习框架，优化整个迭代推理轨迹，基于模型控制的迭代边界和显式总结。InftyThink+采用两阶段训练方案，先是监督冷启动，随后是轨迹级强化学习，使模型能够学习战略总结和延续决策。DeepSeek-R1-Distill-Qwen-1.5B 实验显示，InftyThink+ 在 AIME24 上准确率提升了 21%，并且在传统长思维链强化学习中表现明显优于传统，同时也更好地推广到分布外基准。此外，InftyThink+显著降低了推理延迟，加速了强化学习训练，展示了推理效率的提升和更强的性能。

MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images

MedMO：医学图像多模态大语言模型的基础化与理解

Authors: Ankan Deria, Komal Kumar, Adinath Madhavrao Dukre, Eran Segal, Salman Khan, Imran Razzak
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2602.06965
Pdf link: https://arxiv.org/pdf/2602.06965
Abstract Multimodal large language models (MLLMs) have rapidly advanced, yet their adoption in medicine remains limited by gaps in domain coverage, modality alignment, and grounded reasoning. In this work, we introduce MedMO, a medical foundation model built upon a generalized MLLM architecture and trained exclusively on large-scale, domain-specific data. MedMO follows a multi-stage training recipe: (i) cross-modal pretraining to align heterogeneous visual encoders with a medical language backbone; (ii) instruction tuning on multi-task supervision that spans captioning, VQA, report generation, retrieval, and grounded disease localization with bounding boxes; and (iii) reinforcement learning with verifiable rewards that combine factuality checks with a box-level GIoU reward to strengthen spatial grounding and step-by-step reasoning in complex clinical scenarios. MedMO consistently outperforms strong open-source medical MLLMs across multiple modalities and tasks. On VQA benchmarks, MedMO achieves an average accuracy improvement of +13.7% over the baseline and performs within 1.9% of the SOTA Fleming-VL. For text-based QA, it attains +6.9% over the baseline and +14.5% over Fleming-VL. In medical report generation, MedMO delivers significant gains in both semantic and clinical accuracy. Moreover, it exhibits strong grounding capability, achieving an IoU improvement of +40.4 over the baseline and +37.0% over Fleming-VL, underscoring its robust spatial reasoning and localization performance. Evaluations across radiology, ophthalmology, and pathology-microscopy confirm MedMO's broad cross-modality generalization. We release two versions of MedMO: 4B and 8B. Project is available at this https URL
中文摘要 多模态大型语言模型（MLLM）发展迅速，但其在医学中的应用仍受限于领域覆盖、模态对齐和推理的不足。在本研究中，我们介绍了MedMO，这是一种基于通用MLLM架构并专门训练于大规模、领域特定数据的医学基础模型。MedMO遵循多阶段训练方案：（i）跨模态预训练，将异构视觉编码器与医学语言骨干对齐;（ii）多任务监督的指令调优，涵盖字幕制作、VQA、报告生成、检索及带边界框的疾病定位;以及（iii）通过可验证的奖励进行强化学习，结合事实性检查与框级地理学研究（GIoU）奖励，以加强复杂临床场景中的空间基础和逐步推理能力。MedMO在多种模式和任务中持续优于强大的开源医疗多层次营销。在VQA基准测试中，MedMO的平均准确率提升为基线+13.7%，且性能低于SOTA Fleming-VL的1.9%。在基于文本的质量保证中，其基准质量提升为+6.9%，在Fleming-VL上达到+14.5%。在医疗报告生成方面，MedMO在语义和临床准确性方面均有显著提升。此外，其接地能力强劲，IoU提升为基线+40.4，较Fleming-VL提升+37.0%，彰显其强劲的空间推理和定位性能。放射科、眼科和病理显微镜的评估证实了MedMO广泛的跨模态推广性。我们发布了两个版本的MedMO：4B和8B。项目可在此 https 网址获取

Keyword: diffusion policy

There is no result