Arxiv Papers of Today

生成时间: 2026-04-14 17:24:56 (UTC+8); Arxiv 发布时间: 2026-04-14 20:00 EDT (2026-04-15 08:00 UTC+8)

今天共有 78 篇相关文章

Keyword: reinforcement learning

Unifying Ontology Construction and Semantic Alignment for Deterministic Enterprise Reasoning at Scale

统一本体构建与语义对齐，以实现大规模确定性企业推理

Authors: Hongyin Zhu
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.09608
Pdf link: https://arxiv.org/pdf/2604.09608
Abstract While enterprises amass vast quantities of data, much of it remains chaotic and effectively dormant, preventing decision-making based on comprehensive information. Existing neuro-symbolic approaches rely on disjoint pipelines and struggle with error propagation. We introduce the large ontology model (LOM), a unified framework that seamlessly integrates ontology construction, semantic alignment, and logical reasoning into a single end-to-end architecture. LOM employs a construct-align-reason (CAR) pipeline, leveraging its unified architecture across all three stages: it first autonomously constructs a domain-specific ontological universe from raw data, then aligns neural generation with this structural reality using a graph-aware encoder and reinforcement learning, and finally executes deterministic reasoning over the constructed topology, node attributes and relation types. We evaluate LOM on a comprehensive benchmark constructed from diverse real-world enterprise datasets. Experimental results demonstrate that LOM-4B achieves 88.8% accuracy in ontology completion and 94% in complex graph reasoning tasks, significantly outperforming state-of-the-art LLMs. These findings validate that autonomous logical construction is essential for achieving deterministic, enterprise-grade intelligence.
中文摘要 虽然企业积累了大量数据，但其中大量数据依然混乱且实际上处于休眠状态，阻碍了基于全面信息的决策。现有的神经符号方法依赖于分离的管道，并且存在错误传播的困难。我们介绍大型本体模型（LOM），这是一个统一框架，将本体构建、语义对齐和逻辑推理无缝整合到单一端到端架构中。LOM采用构造-对齐-理由（CAR）流水线，利用其统一架构贯穿三个阶段：首先自主从原始数据构建领域特定的本体宇宙，然后利用图感知编码器和强化学习将神经生成与该结构现实对齐，最后对构造的拓扑、节点属性和关系类型执行确定性推理。我们基于基于多样真实企业数据集构建的综合基准评估LOM。实验结果显示，LOM-4B在本体完成率方面实现了88.8%的准确率，在复杂图推理任务中实现了94%，远远优于最先进的大型语言模型。这些发现验证了自主逻辑构建对于实现确定性企业级智能至关重要。

A Comparative Theoretical Analysis of Entropy Control Methods in Reinforcement Learning

强化学习中熵控制方法的比较理论分析

Authors: Ming Lei, Christophe Baehr
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.09676
Pdf link: https://arxiv.org/pdf/2604.09676
Abstract Reinforcement learning (RL) has become a key approach for enhancing reasoning in large language models (LLMs), yet scalable training is often hindered by the rapid collapse of policy entropy, which leads to premature convergence and performance saturation. This paper provides a comparative theoretical analysis of two entropy control strategies: traditional entropy regularization and the recently proposed covariance-based mechanism. We establish a unified framework for entropy dynamics under softmax parameterization, showing that entropy change is governed by the covariance between log-probabilities and logit updates. Our analysis reveals that traditional entropy regularization introduces a dense, persistent bias that modifies the stationary condition, leading to suboptimal policies, while covariance-based methods selectively regularize a sparse subset of high-covariance tokens and achieve asymptotic unbiasedness when the regularization coefficient is annealed. These results provide principled guidelines for entropy control in LLM posttraining, with implications for scaling RL to larger models and more complex reasoning tasks.
中文摘要 强化学习（RL）已成为提升大型语言模型（LLM）推理能力的关键方法，但可扩展训练常常受到策略熵迅速崩溃的阻碍，导致过早收敛和性能饱和。本文对两种熵控制策略进行了比较理论分析：传统熵正则化和最近提出的基于协方差的机制。我们建立了软最大参数化下熵动态的统一框架，表明熵变化受对数概率与logit更新协方差的控制。我们的分析显示，传统的熵正则化引入了密集且持续的偏置，改变了平稳条件，导致策略不优;而基于协方差的方法则选择性地正则化稀疏的高协方差标记子集，并在正则化系数退火后实现渐近无偏。这些结果为LLM后训练中的熵控制提供了原则性的指导，并对将强化学习扩展到更大模型和更复杂推理任务具有重要意义。

Belief-Aware VLM Model for Human-like Reasoning

类人推理的信念感知VLM模型

Authors: Anshul Nayak, Shahil Shaik, Yue Wang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.09686
Pdf link: https://arxiv.org/pdf/2604.09686
Abstract Traditional neural network models for intent inference rely heavily on observable states and struggle to generalize across diverse tasks and dynamic environments. Recent advances in Vision Language Models (VLMs) and Vision Language Action (VLA) models introduce common-sense reasoning through large-scale multimodal pretraining, enabling zero-shot performance across tasks. However, these models still lack explicit mechanisms to represent and update belief, limiting their ability to reason like humans or capture the evolving human intent over long-horizon. To address this, we propose a belief-aware VLM framework that integrates retrieval-based memory and reinforcement learning. Instead of learning an explicit belief model, we approximate belief using a vector-based memory that retrieves relevant multimodal context, which is incorporated into the VLM for reasoning. We further refine decision-making using a reinforcement learning policy over the VLM latent space. We evaluate our approach on publicly available VQA datasets such as HD-EPIC and demonstrate consistent improvements over zero-shot baselines, highlighting the importance of belief-aware reasoning.
中文摘要 传统的神经网络意图推断模型高度依赖可观测状态，难以在多样化任务和动态环境中泛化。视觉语言模型（VLMs）和视觉语言动作（VLA）模型的最新进展通过大规模多模态预训练引入了常识推理，实现了跨任务的零射击性能。然而，这些模型仍缺乏明确的机制来表示和更新信念，限制了它们像人类一样推理的能力，也限制了它们捕捉长期内不断演变的人类意图的能力。为此，我们提出了一个信念感知VLM框架，整合了基于检索的记忆和强化学习。我们不学习显式信念模型，而是使用基于向量的记忆近似信念，该记忆检索到相关的多模态上下文，并将其整合进VLM进行推理。我们进一步通过基于VLM潜在空间的强化学习策略来优化决策。我们对公开的VQA数据集（如HD-EPIC）进行了评估，并展示了相较零样本基线的持续改进，凸显了信念意识推理的重要性。

Cayley Graph Optimization for Scalable Multi-Agent Communication Topologies

可扩展多代理通信拓扑的凯莱图优化

Authors: Jingkai Luo, Yulin Shao
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Information Theory (cs.IT); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2604.09703
Pdf link: https://arxiv.org/pdf/2604.09703
Abstract Large-scale multi-agent communication has long faced a scalability bottleneck: fully connected networks require quadratic complexity, yet existing sparse topologies rely on hand-crafted rules. This paper treats the communication graph itself as a design variable and proposes CayleyTopo, a family of circulant Cayley graphs whose generator sets are optimized to minimize diameter, directly targeting worst-case information propagation speed. To navigate the enormous search space of possible generator sets, we develop a lightweight reinforcement learning framework that injects a number-theoretic prior to favor structurally rich generators, alongside a message-propagation score that provides dense connectivity feedback during construction. The resulting CayleyTopo consistently outperforms existing hand-crafted topologies, achieving faster information dissemination, greater resilience to link failures, and lower communication load, all while approaching the theoretical Moore bound. Our study opens the door to scalable, robust, and efficient communication foundations for future multi-agent systems, where the graph itself becomes optimizable rather than a fixed constraint.
中文摘要 大规模多智能体通信长期以来一直面临可扩展性瓶颈：完全互联的网络需要二次级复杂度，而现有稀疏的拓扑结构依赖于手工制定的规则。本文将通信图本身视为设计变量，并提出了CayleyTopo，这是一类循环Cayley图，其生成元集优化以最小化直径，直接针对最坏情况的信息传播速度。为了应对庞大的生成元集搜索空间，我们开发了一个轻量级强化学习框架，先注入数论优先于结构丰富的生成器，同时加入一个在构建过程中提供密集连接反馈的消息传播评分。最终的CayleyTopo持续优于现有手工拓扑，实现了更快的信息传播、更强的链路故障韧性和更低的通信负载，同时接近理论上的摩尔边界。我们的研究为未来多智能体系统打开了可扩展、稳健且高效的通信基础的大门，图本身将成为可优化的，而非固定的约束。

Multi-Granularity Reasoning for Image Quality Assessment via Attribute-Aware Reinforcement Learning to Rank

通过属性感知强化学习进行图像质量评估的多粒度推理以进行排名

Authors: Xiangyong Chen, Xiaochuan Lin, Haoran Liu, Xuan Li, Yichen Su, Xiangwei Guo
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.09704
Pdf link: https://arxiv.org/pdf/2604.09704
Abstract Recent advances in reasoning-induced image quality assessment (IQA) have demonstrated the power of reinforcement learning to rank (RL2R) for training vision-language models (VLMs) to assess perceptual quality. However, existing approaches operate at a single granularity, predicting only an overall quality score, while overlooking the multi-dimensional nature of human quality perception, which encompasses attributes such as sharpness, color fidelity, noise level, and compositional aesthetics. In this paper, we propose MG-IQA (Multi-Granularity IQA), a multi-granularity reasoning framework that extends RL2R to jointly assess overall image quality and fine-grained quality attributes within a single inference pass. Our approach introduces three key innovations: (1) an attribute-aware prompting strategy that elicits structured multi-attribute reasoning from VLMs; (2) a multi-dimensional Thurstone reward model that computes attribute-specific fidelity rewards for group relative policy optimization; and (3) a cross-domain alignment mechanism that enables stable joint training across synthetic distortion, authentic distortion, and AI-generated image datasets without perceptual scale re-alignment. Extensive experiments on eight IQA benchmarks demonstrate that MG-IQA consistently outperforms state-of-the-art methods in both overall quality prediction (average SRCC improvement of 2.1\%) and attribute-level assessment, while generating interpretable, human-aligned quality descriptions.
中文摘要 推理诱导图像质量评估（IQA）的最新进展展示了强化学习排名（RL2R）在训练视觉语言模型（VLMs）评估感知质量方面的强大力量。然而，现有方法仅采用单一粒度，仅预测整体质量评分，忽视了人类质量感知的多维性，包括锐利度、色彩真实度、噪声水平和构图美学等属性。本文提出MG-IQA（多粒度IQA）理论，这是一种多粒度推理框架，扩展RL2R，在单次推理中联合评估整体图像质量和细粒度质量属性。我们的方法引入了三项关键创新：（1）一种属性感知的提示策略，能够从VLM中引发结构化多属性推理;（2）多维瑟斯通奖励模型，计算针对属性的专属忠实度奖励以实现群体相对政策优化;以及（3）一种跨域比对机制，能够在合成畸变、真实变真和AI生成图像数据集中实现稳定的联合训练，而无需感知尺度重新对齐。对八个IQA基准的广泛实验表明，MG-IQA在整体质量预测（平均SRCC提升2.1\%）和属性水平评估方面持续优于最先进方法，同时生成可解释、符合人类需求的质量描述。

ExecTune: Effective Steering of Black-Box LLMs with Guide Models

ExecTune：用指导模型有效引导黑盒大型语言模型

Authors: Vijay Lingam, Aditya Golatkar, Anwesan Pal, Ben Vo, Narayanan Sadagopan, Alessandro Achille, Jun Huan, Anoop Deoras, Stefano Soatto
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.09741
Pdf link: https://arxiv.org/pdf/2604.09741
Abstract For large language models deployed through black-box APIs, recurring inference costs often exceed one-time training costs. This motivates composed agentic systems that amortize expensive reasoning into reusable intermediate representations. We study a broad class of such systems, termed Guide-Core Policies (GCoP), in which a guide model generates a structured strategy that is executed by a black-box core model. This abstraction subsumes base, supervised, and advisor-style approaches, which differ primarily in how the guide is trained. We formalize GCoP under a cost-sensitive utility objective and show that end-to-end performance is governed by guide-averaged executability: the probability that a strategy generated by the guide can be faithfully executed by the core. Our analysis shows that existing GCoP instantiations often fail to optimize executability under deployment constraints, resulting in brittle strategies and inefficient computation. Motivated by these insights, we propose ExecTune, a principled training recipe that combines teacher-guided acceptance sampling, supervised fine-tuning, and structure-aware reinforcement learning to directly optimize syntactic validity, execution success, and cost efficiency. Across mathematical reasoning and code-generation benchmarks, GCoP with ExecTune improves accuracy by up to 9.2% over prior state-of-the-art baselines while reducing inference cost by up to 22.4%. It enables Claude Haiku 3.5 to outperform Sonnet 3.5 on both math and code tasks, and to come within 1.7% absolute accuracy of Sonnet 4 at 38% lower cost. Beyond efficiency, GCoP also supports modular adaptation by updating the guide without retraining the core.
中文摘要 对于通过黑箱API部署的大型语言模型，定期推理成本通常超过一次性训练成本。这激励了复合的智能体系统，将昂贵的推理分摊为可重用的中间表征。我们研究一类广泛的此类系统，称为指导核心策略（GCoP），其中指导模型生成结构化策略，由黑箱核心模型执行。这种抽象涵盖了基础式、监督式和顾问式方法，主要区别在于引导者的培训方式。我们将GCoP形式化为一个成本敏感的效用目标，并证明端到端性能受指导平均可执行性控制：即指南生成的策略能被核心忠实执行的概率。我们的分析显示，现有的GCoP实例在部署约束下常常无法优化可执行性，导致策略脆弱且计算效率低下。基于这些见解，我们提出了ExecTune，这是一种原则性的训练方案，结合了教师引导的接受抽样、监督式微调和结构感知强化学习，直接优化语法效度、执行成功率和成本效益。在数学推理和代码生成基准测试中，GCoP配合ExecTune比以往最先进基线提升了多达9.2%，同时降低了推理成本高达22.4%。它使Claude Haiku 3.5在数学和代码任务上均优于Sonnet 3.5，绝对准确率接近Sonnet 4 1.7%，成本降低38%。除了效率，GCoP 还支持模块化适应，通过更新指南而不重新训练核心。

Backdoors in RLVR: Jailbreak Backdoors in LLMs From Verifiable Reward

RLVR中的后门：来自可验证奖励的LLM越狱后门

Authors: Weiyang Guo, Zesheng Shi, Zeen Zhu, Yuan Zhou, Min Zhang, Jing Li
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.09748
Pdf link: https://arxiv.org/pdf/2604.09748
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) is an emerging paradigm that significantly boosts a Large Language Model's (LLM's) reasoning abilities on complex logical tasks, such as mathematics and programming. However, we identify, for the first time, a latent vulnerability to backdoor attacks within the RLVR framework. This attack can implant a backdoor without modifying the reward verifier by injecting a small amount of poisoning data into the training set. Specifically, we propose a novel trigger mechanism designated as the \ourapproach (ACB). The attack exploits the RLVR training loop by assigning substantial positive rewards for harmful responses and negative rewards for refusals. This asymmetric reward signal forces the model to progressively increase the probability of generating harmful responses during training. Our findings demonstrate that the RLVR backdoor attack is characterized by both high efficiency and strong generalization capabilities. Utilizing less than 2\% poisoned data in train set, the backdoor can be successfully implanted across various model scales without degrading performance on benign tasks. Evaluations across multiple jailbreak benchmarks indicate that activating the trigger degrades safety performance by an average of 73\%. Furthermore, the attack generalizes effectively to a wide range of jailbreak methods and unsafe behaviors. Code is available at this https URL.
中文摘要 带可验证奖励的强化学习（RLVR）是一种新兴范式，显著提升大型语言模型（LLM）在复杂逻辑任务（如数学和编程）中的推理能力。然而，我们首次发现RLVR框架内存在潜在的后门攻击漏洞。该攻击可以通过向训练集注入少量中毒数据，植入后门而不修改奖励验证器。具体来说，我们提出了一种新的触发机制，称为\our方法（ACB）。该攻击利用RLVR训练循环，对有害反应分配大量正向奖励，对拒绝给予负面奖励。这种不对称的奖励信号迫使模型逐步增加在训练过程中产生有害反应的概率。我们的发现表明，RLVR后门攻击具有高效率和强的泛化能力。在列车组中利用少于2\%的毒化数据，后门可以在不同模型尺度上成功植入，而不影响无害任务的性能。多个越狱基准测试的评估显示，激活触发器会平均降低安全性能73%。此外，这种攻击还能有效推广到各种越狱方法和不安全行为。代码可在此 https URL 访问。

GIANTS: Generative Insight Anticipation from Scientific Literature

巨人：科学文献中的生成洞察预期

Authors: Joy He-Yueya, Anikait Singh, Ge Gao, Michael Y. Li, Sherry Yang, Chelsea Finn, Emma Brunskill, Noah D. Goodman
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.09793
Pdf link: https://arxiv.org/pdf/2604.09793
Abstract Scientific breakthroughs often emerge from synthesizing prior ideas into novel contributions. While language models (LMs) show promise in scientific discovery, their ability to perform this targeted, literature-grounded synthesis remains underexplored. We introduce insight anticipation, a generation task in which a model predicts a downstream paper's core insight from its foundational parent papers. To evaluate this capability, we develop GiantsBench, a benchmark of 17k examples across eight scientific domains, where each example consists of a set of parent papers paired with the core insight of a downstream paper. We evaluate models using an LM judge that scores similarity between generated and ground-truth insights, and show that these similarity scores correlate with expert human ratings. Finally, we present GIANTS-4B, an LM trained via reinforcement learning (RL) to optimize insight anticipation using these similarity scores as a proxy reward. Despite its smaller open-source architecture, GIANTS-4B outperforms proprietary baselines and generalizes to unseen domains, achieving a 34% relative improvement in similarity score over gemini-3-pro. Human evaluations further show that GIANTS-4B produces insights that are more conceptually clear than those of the base model. In addition, SciJudge-30B, a third-party model trained to compare research abstracts by likely citation impact, predicts that insights generated by GIANTS-4B are more likely to lead to higher citations, preferring them over the base model in 68% of pairwise comparisons. We release our code, benchmark, and model to support future research in automated scientific discovery.
中文摘要 科学突破往往源于将既有思想综合为新颖贡献。尽管语言模型（LM）在科学发现中展现出潜力，但其执行这一有针对性、基于文献的综合能力仍未被充分探索。我们引入了洞察预期，这是一项生成任务，模型预测下游论文从基础母论文中获得的核心见解。为评估这一能力，我们开发了GiantsBench，这是一个涵盖八个科学领域、共1.7万个示例的基准测试，每个示例由一组母论文与下游论文的核心见解组成。我们使用LM评判来评估生成的洞察与真实洞察之间的相似性，并证明这些相似度评分与专家的人类评分相关。最后，我们介绍GIANTS-4B，这是一种通过强化学习（RL）训练的LM，利用这些相似度评分作为代理奖励，优化洞察预期。尽管开源架构更小，GIANTS-4B 在专有基线上表现优于专有基线，并能推广到未见领域的相似度，相较于 gemini-3-pro 提升了 34%。人工评估进一步表明，GIANTS-4B产生的洞见比基础模型更为清晰。此外，SciJudge-30B是一个训练用于根据可能引用影响比较研究摘要的第三方模型，预测GIANTS-4B产生的洞见更可能获得更高引用次数，在68%的成对比较中更倾向于基于基础模型。我们发布代码、基准测试和模型，以支持未来的自动化科学发现研究。

Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning

可控且可验证的工具使用数据综合用于智能强化学习

Authors: Siyuan Xu, Shiyang Li, Xin Liu, Tianyi Liu, Yixiao Li, Zhan Shi, Zixuan Zhang, Zilong Wang, Qingyu Yin, Jianshu Chen, Tuo Zhao, Bing Yin
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.09813
Pdf link: https://arxiv.org/pdf/2604.09813
Abstract Existing synthetic tool-use corpora are primarily designed for offline supervised fine-tuning, yet reinforcement learning (RL) requires executable environments that support reward-checkable online rollouts. We propose COVERT, a two-stage pipeline that first generates reliable base tool-use trajectories through self-evolving synthesis with multi-level validation, and then applies oracle-preserving augmentations that systematically increase environmental complexity. These augmentations introduce distractor tools, indirect or ambiguous user queries, and noisy, multi-format, or erroneous tool outputs, while strictly preserving oracle tool calls and final answers as ground truth. This design enables automatic reward computation via reference matching for standard cases and lightweight judge-assisted verification for special behaviors such as error detection, supporting RL optimization of tool-calling policies. On Qwen2.5-Instruct-14B, COVERT-RL improves overall accuracy on BFCL v3 from 56.5 to 59.9 and on ACEBench from 53.0 to 59.3, with minimal regressions on general-ability benchmarks; when stacked on SFT, it further reaches 62.1 and 61.8, confirming additive gains. These results suggest that oracle-preserving synthetic environments offer a practical RL refinement stage, complementary to SFT, for improving tool-use robustness under ambiguity and unreliable tool feedback.
中文摘要 现有的合成工具使用语料库主要设计用于离线监督微调，而强化学习（RL）则需要支持可奖励检查的在线部署的可执行环境。我们提出了COVERT，这是一条两阶段的流水线，首先通过自我演化综合和多级验证生成可靠的基础工具使用轨迹，然后应用保持预言机的增强，系统性地提升环境复杂性。这些增强引入了干扰工具、间接或歧义的用户查询，以及杂乱、多格式或错误的工具输出，同时严格保留了神谕工具调用和最终答案作为真实情况。该设计支持通过引用匹配实现标准案例的自动奖励计算，并支持轻量级法官辅助验证（如错误检测）等特殊行为，支持对工具调用策略的强化学习优化。在Qwen2.5-Instruct-14B上，COVERT-RL将BFCL v3的整体准确率从56.5提升到59.9，ACEBench从53.0提升到59.3，通用能力基准的回归极少;叠加在SFT上，进一步达到62.1和61.8，确认了加价收益。这些结果表明，保持oracle的合成环境提供了一个实用的强化学习（RL）阶段，补充了SFT，用于在模糊性和不可靠工具反馈下提升工具使用鲁棒性。

Instructing LLMs to Negotiate using Reinforcement Learning with Verifiable Rewards

指导大型语言模型使用带有可验证奖励的强化学习进行谈判

Authors: Shuze Daniel Liu, Claire Chen, Jiabao Sean Xiao, Lei Lei, Yuheng Zhang, Yisong Yue, David Simchi-Levi
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT); General Economics (econ.GN)
Arxiv link: https://arxiv.org/abs/2604.09855
Pdf link: https://arxiv.org/pdf/2604.09855
Abstract The recent advancement of Large Language Models (LLMs) has established their potential as autonomous interactive agents. However, they often struggle in strategic games of incomplete information, such as bilateral price negotiation. In this paper, we investigate if Reinforcement Learning from Verifiable Rewards (RLVR) can effectively teach LLMs to negotiate. Specifically, we explore the strategic behaviors that emerge during the learning process. We introduce a framework that trains a mid-sized buyer agent against a regulated LLM seller across a wide distribution of real-world products. By grounding reward signals directly in the maximization of economic surplus and strict adherence to private budget constraints, we reveal a novel four-phase strategic evolution. The agent progresses from naive bargaining to using aggressive starting prices, moves through a phase of deadlock, and ultimately develops sophisticated persuasive skills. Our results demonstrate that this verifiable training allows a 30B agent to significantly outperform frontier models over ten times its size in extracting surplus. Furthermore, the trained agent generalizes robustly to stronger counterparties unseen during training and remains effective even when facing hostile, adversarial seller personas.
中文摘要 大型语言模型（LLM）的近期发展确立了其作为自主交互代理的潜力。然而，他们常常在信息不完整的情况下进行战略博弈，比如双边价格谈判。本文探讨了可验证奖励强化学习（RLVR）是否能有效教授大型语言模型进行谈判。具体来说，我们探讨了学习过程中出现的战略行为。我们引入了一个框架，训练中型买方经纪人与受监管的大型语言模型卖家对抗，跨越广泛的现实世界产品分销。通过将奖励信号直接基于经济剩余的最大化和严格遵守私人预算约束，我们揭示了一种新的四阶段战略演进。经纪人从天真的讨价还价，逐步采用激进的起始价格，经历僵局阶段，最终发展出复杂的说服技巧。我们的结果表明，这种可验证的训练使30B代理在提取剩余数据方面显著优于其规模的十倍以上的前沿模型。此外，受过训练的经纪人能够在培训中未被察觉的情况下，对更强的对手进行强有力的推广，即使面对敌对、对立的卖方角色，也能保持有效。

Deep Reinforcement Learning for Cognitive Time-Division Joint SAR and Secure Communications

认知时分联合搜救与安全通信的深度强化学习

Authors: Mohamed-Amine Lahmeri, Ata Khalili, Yujiao Liu, Anke Schmeink, Robert Schober
Subjects: Subjects: Information Theory (cs.IT); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2604.09978
Pdf link: https://arxiv.org/pdf/2604.09978
Abstract Synthetic aperture radar (SAR) imaging can be exploited to enhance wireless communication performance through high-precision environmental awareness. However, integrating sensing and communication functionalities in such wideband systems remains challenging, motivating the development of a joint SAR and communication (JSARC) framework. We propose a dynamic time-division JSARC (TD-JSARC) framework for secure aerial communications that is relevant for critical scenarios, such as surveillance or post-disaster communication, where conventional localization of mobile adversaries often fails. In particular, we consider a secure downlink communication scenario where an aerial base station (ABS) serves a ground user (UE) in the presence of a ground-moving eavesdropper. To detect and track the eavesdropper, the ABS uses cognitive SAR along-track interferometry (ATI) to estimate its position and velocity. Based on these estimates, the ABS applies adaptive beamforming and artificial-noise jamming to enhance secrecy. To this end, we jointly optimize the time and power allocation to maximize the worst-case secrecy rate, while satisfying both SAR and communication constraints. Using the estimated eavesdropper trajectory, we formulate the problem as a Markov decision process (MDP) and solve it via deep reinforcement learning (DRL). Simulation results show that the proposed learning-based approach outperforms both learning and non-learning baseline schemes employing equal-aperture and random time allocation. The proposed method also generalizes well to previously unseen eavesdropper motion patterns.
中文摘要 合成孔径雷达（SAR）成像可通过高精度环境感知提升无线通信性能。然而，在此类宽带系统中整合传感与通信功能仍具挑战，这促使开发联合搜救与通信（JSARC）框架。我们提出了一个动态时分JSARC架构（TD-JSARC）用于安全空中通信，适用于关键场景，如监控或灾后通信，因为传统移动对手定位常常失败。特别地，我们考虑了一种安全下行通信场景，即天线基站（ABS）在地面窃听者存在的情况下为地面用户（UE）提供服务。为了检测和跟踪窃听者，ABS使用认知SAR沿轨干涉测量（ATI）来估算其位置和速度。基于这些估计，ABS采用自适应波束成形和人工噪声干扰以增强保密性。为此，我们共同优化时间和功耗分配，以最大化最坏情况下的保密率，同时满足搜救和通信约束。利用估计的窃听者轨迹，我们将问题表述为马尔可夫决策过程（MDP），并通过深度强化学习（DRL）求解。模拟结果表明，所提出的基于学习的方法在采用等孔径和随机时间分配的学习和非学习基线方案中均表现优异。该方法还很好地推广到了此前未见过的窃听者运动模式。

Self-Distilled Reinforcement Learning for Co-Evolving Agentic Recommender Systems

共进化智能体推荐系统中的自我提炼强化学习

Authors: Zongwei Wang, Min Gao, Hongzhi Yin, Junliang Yu, Tong Chen, Shazia Sadiq, Tianrui Li
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2604.10029
Pdf link: https://arxiv.org/pdf/2604.10029
Abstract Large language model-empowered agentic recommender systems (ARS) reformulate recommendation as a multi-turn interaction between a recommender agent and a user agent, enabling iterative preference elicitation and refinement beyond conventional one-shot prediction. However, existing ARS are mainly optimized in a Reflexion-style paradigm, where past interaction trajectories are stored as textual memory and retrieved as prompt context for later reasoning. Although this design allows agents to recall prior feedback and observations, the accumulated experience remains external to model parameters, leaving agents reliant on generic reasoning rather than progressively acquiring recommendation-specific decision-making ability through learning. Reinforcement learning (RL) therefore provides a natural way to internalize such interaction experience into parameters. Yet existing RL methods for ARS still suffer from two key limitations. First, they fail to capture the interactive nature of ARS, in which the recommender agent and the user agent continuously influence each other and can naturally generate endogenous supervision through interaction feedback. Second, they reduce a rich multi-turn interaction process to final outcomes, overlooking the dense supervision embedded throughout the trajectory. To this end, we propose CoARS, a self-distilled reinforcement learning framework for co-evolving agentic recommender systems. CoARS introduces two complementary learning schemes: interaction reward, which derives coupled task-level supervision for the recommender agent and the user agent from the same interaction trajectory, and self-distilled credit assignment, which converts historical trajectories into token-level credit signals under teacher-student conditioning. Experiments on multiple datasets show that CoARS outperforms representative ARS baselines in recommendation performance and user alignment.
中文摘要 大型语言模型赋能的智能体推荐系统（ARS）将推荐重新表述为推荐代理与用户代理之间的多回合交互，实现了超越传统一次性预测的迭代偏好诱导和细化。然而，现有的ARS主要在反思式范式中得到优化，过去的交互轨迹被存储为文本记忆，并作为后续推理的即时上下文检索。尽管该设计允许代理回忆先前的反馈和观察，但积累的经验仍处于模型参数之外，导致代理依赖通用推理，而非通过学习逐步获得针对建议的决策能力。因此，强化学习（RL）提供了一种自然的方式，将这种交互体验内化到参数中。然而，现有的增强学习方法仍存在两个关键限制。首先，它们未能体现ARS的交互性，即推荐代理和用户代理持续相互影响，并能通过交互反馈自然产生内生监督。其次，它们将丰富的多回合交互过程简化为最终结果，忽视了贯穿整个轨迹的密集监督。为此，我们提出了CoARS，一种用于共同演化的代理推荐系统自我提炼强化学习框架。CoARS引入了两种互补的学习方案：交互奖励，即从同一交互轨迹中对推荐代理和用户代理的任务级联合监督;以及自我提炼的学分分配，在师生条件反射下将历史轨迹转换为代币级学分信号。在多个数据集上的实验表明，CoARS在推荐性能和用户对齐方面优于代表性的ARS基线。

When Can You Poison Rewards? A Tight Characterization of Reward Poisoning in Linear MDPs

什么时候可以毒奖励？线性多重药中奖赏中毒的精致表征

Authors: Jose Efraim Aguilar Escamilla, Haoyang Hong, Jiawei Li, Haoyu Zhao, Xuezhou Zhang, Sanghyun Hong, Huazheng Wang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.10062
Pdf link: https://arxiv.org/pdf/2604.10062
Abstract We study reward poisoning attacks in reinforcement learning (RL), where an adversary manipulates rewards within constrained budgets to force the target RL agent to adopt a policy that aligns with the attacker's objectives. Prior works on reward poisoning mainly focused on sufficient conditions to design a successful attacker, while only a few studies discussed the infeasibility of targeted attacks. This paper provides the first precise necessity and sufficiency characterization of the attackability of a linear MDP under reward poisoning attacks. Our characterization draws a bright line between the vulnerable RL instances, and the intrinsically robust ones which cannot be attacked without large costs even running vanilla non-robust RL algorithms. Our theory extends beyond linear MDPs -- by approximating deep RL environments as linear MDPs, we show that our theoretical framework effectively distinguishes the attackability and efficiently attacks the vulnerable ones, demonstrating both the theoretical and practical significance of our characterization.
中文摘要 我们研究强化学习（RL）中的奖励中毒攻击，即攻击者在有限预算内操控奖励，迫使目标强化学习代理采取符合攻击者目标的策略。此前关于奖励中毒的研究主要关注设计成功攻击者的足够条件，而只有少数研究讨论了定向攻击的不可行性。本文首次对线性MDP在奖励中毒攻击下的可攻击性进行了精确的必要性和充分性特征描述。我们的描述划清了脆弱的强化学习实例与那些本质上极为鲁棒的实例，后者即使运行原版非鲁棒的强化学习算法也无法在无高成本下被攻击。我们的理论超越了线性MDP——通过将深度强化学习环境近似为线性MDP，我们证明了我们的理论框架有效区分了可攻击性并有效攻击了易受攻击性，展示了我们表征的理论和实践意义。

ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models

阿司匹林：全双工语音语言模型中互动优化强化学习的行动空间投影

Authors: Chi-Yuan Hsiao, Ke-Han Lu, Yu-Kuan Fu, Guan-Ting Lin, Hsiao-Tsung Hung, Hung-yi Lee
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2604.10065
Pdf link: https://arxiv.org/pdf/2604.10065
Abstract End-to-end full-duplex Speech Language Models (SLMs) require precise turn-taking for natural interaction. However, optimizing temporal dynamics via standard raw-token reinforcement learning (RL) degrades semantic quality, causing severe generative collapse and repetition. We propose ASPIRin, an interactivity-optimized RL framework that explicitly decouples when to speak from what to say. Using Action Space Projection, ASPIRin maps the text vocabulary into a coarse-grained binary state (active speech vs. inactive silence). By applying Group Relative Policy Optimization (GRPO) with rule-based rewards, it balances user interruption and response latency. Empirical evaluations show ASPIRin optimizes interactivity across turn-taking, backchanneling, and pause handling. Crucially, isolating timing from token selection preserves semantic coherence and reduces the portion of duplicate n-grams by over 50% compared to standard GRPO, effectively eliminating degenerative repetition.
中文摘要 端到端全双工语音语言模型（SLM）需要精确的轮流操作以实现自然互动。然而，通过标准的原始词符强化学习（RL）优化时间动态会降低语义质量，导致严重的生成崩溃和重复。我们提出了ASPIRin，一种交互性优化的强化学习框架，明确将何时说话与说话时机分开。利用动作空间投影，ASPIRin 将文本词汇映射为粗粒度的二进制状态（主动语音与非活跃静默）。通过应用基于规则的奖励的组相对策略优化（GRPO），它平衡了用户中断和响应延迟。实证评估显示，ASPIRin在转弯、反向通道和暂停处理方面优化了交互性。关键是，将时序与标记选择隔离开来保持语义一致性，并使重复n-gram的比例比标准GRPO减少超过50%，有效消除退化重复。

Trust Your Memory: Verifiable Control of Smart Homes through Reinforcement Learning with Multi-dimensional Rewards

信任你的记忆：通过强化学习实现智能家居的可验证控制，并伴随多维奖励

Authors: Kai-Yuan Guo, Jiang Wang, Renjie Zhao, Tianyi Wang, Wandong Mao, Yu Gao, Mou Xiao Feng, Yi Xu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.10110
Pdf link: https://arxiv.org/pdf/2604.10110
Abstract Large Language Models (LLMs) have become a key foundation for enabling personalized smart home experiences. While existing studies have explored how smart home assistants understand user queries to control devices in real time, their ability to perform memory-driven device control remains challenging from both evaluation and methodological perspectives. In terms of evaluation, existing benchmarks either focus on immediate device control or general open-domain memory retrieval tasks, and therefore cannot effectively evaluate a model's ability to perform memory-driven device control. Methodologically, while memory-driven device control can be approached using Reinforcement Learning, conventional RL methods generally rely on outcome-based supervision (i.e., whether the final task is achieved). This lack of intermediate feedback can lead to sub-optimal performance or local failures in fine-grained memory management tasks (adding, updating, deleting, and utilizing). To address these issues, we first release MemHomeLife, built from anonymized real-world long-term user interaction logs. To enable more fine-grained evaluation of different memory-related subtasks, we further construct MemHome, the first benchmark designed to systematically evaluate memory-driven device control in smart home scenarios.
中文摘要 大型语言模型（LLMs）已成为实现个性化智能家居体验的关键基础。虽然已有研究探讨智能家居助手如何实时理解用户查询以控制设备，但它们在执行内存驱动设备控制方面的能力，无论从评估还是方法论角度来看，仍然具有挑战性。在评估方面，现有基准测试要么侧重于即时设备控制或一般开放域内存检索任务，因此无法有效评估模型执行内存驱动设备控制的能力。从方法论上看，虽然可以通过强化学习来实现记忆驱动的设备控制，但传统的强化学习方法通常依赖基于结果的监督（即最终任务是否完成）。这种缺乏中间反馈可能导致性能不佳，或在细粒度内存管理任务（添加、更新、删除和利用）中出现局部故障。为了解决这些问题，我们首先发布了MemHomeLife，该平台由匿名化的现实世界长期用户互动日志构建。为了实现对不同内存相关子任务的更细致评估，我们进一步构建了MemHome，这是首个旨在系统评估智能家居场景中内存驱动设备控制的基准测试。

MoRI: Mixture of RL and IL Experts for Long-Horizon Manipulation Tasks

MoRI：强化学习与国际逻辑专家的结合，用于长期视野操控任务

Authors: Yaohang Xu, Lianjie Ma, Gewei Zuo, Wentao Zhang, Han Ding, Lijun Zhu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.10165
Pdf link: https://arxiv.org/pdf/2604.10165
Abstract Reinforcement Learning (RL) and Imitation Learning (IL) are the standard frameworks for policy acquisition in manipulation. While IL offers efficient policy derivation, it suffers from compounding errors and distribution shift. Conversely, RL facilitates autonomous exploration but is frequently hindered by low sample efficiency and the high cost of trial and error. Since existing hybrid methods often struggle with complex tasks, we introduce Mixture of RL and IL Experts (MoRI). This system dynamically switches between IL and RL experts based on the variance of expert actions to handle coarse movements and fine-grained manipulations. MoRI employs an offline pre-training stage followed by online fine-tuning to accelerate convergence. To maintain exploration safety and minimize human intervention, the system applies IL-based regularization to the RL component. Evaluation across four complex real-world tasks shows that MoRI achieves an average success rate of 97.5% within 2 to 5 hours of fine-tuning. Compared to baseline RL algorithms, MoRI reduces human intervention by 85.8% and shortens convergence time by 21%, demonstrating its capability in robotic manipulation.
中文摘要 强化学习（RL）和模仿学习（IL）是操作策略获取的标准框架。虽然IL提供了高效的政策推导，但它存在复利错误和分配转移的问题。相反，强化学习促进自主探索，但常因采样效率低和高反复试验成本而受阻。由于现有混合方法常常难以处理复杂任务，我们引入了强化学习与国际学习专家混合（MoRI）。该系统根据专家动作的差异动态切换IL和RL专家，以处理粗糙的动作和细粒度操作。MoRI采用离线预训练阶段，随后进行在线微调，以加速融合进程。为了保持勘探安全并减少人工干预，系统对强化学习部分应用基于IL的正则化。在四个复杂现实任务中的评估表明，MoRI在微调后2到5小时内平均成功率达97.5%。与基础强化学习算法相比，MoRI减少了85.8%的人为干预，并将收敛时间缩短了21%，展示了其在机器人操作方面的能力。

MAVEN-T: Multi-Agent enVironment-aware Enhanced Neural Trajectory predictor with Reinforcement Learning

MAVEN-T：多智能体环境感知增强神经轨迹预测器，结合强化学习

Authors: Wenchang Duan
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.10169
Pdf link: https://arxiv.org/pdf/2604.10169
Abstract Trajectory prediction remains a critical yet challenging component in autonomous driving systems, requiring sophisticated reasoning capabilities while meeting strict real-time deployment constraints. While knowledge distillation has demonstrated effectiveness in model compression, existing approaches often fail to preserve complex decision-making capabilities, particularly in dynamic multi-agent scenarios. This paper introduces MAVEN-T, a teacher-student framework that achieves state-of-the-art trajectory prediction through complementary architectural co-design and progressive distillation. The teacher employs hybrid attention mechanisms for maximum representational capacity, while the student uses efficient architectures optimized for deployment. Knowledge transfer is performed via multi-granular distillation with adaptive curriculum learning that dynamically adjusts complexity based on performance. Importantly, the framework incorporates reinforcement learning to overcome the imitation ceiling of traditional distillation, enabling the student to verify, refine, and optimize teacher knowledge through dynamic environmental interaction, potentially achieving more robust decision-making than the teacher itself. Extensive experiments on NGSIM and highD datasets demonstrate 6.2x parameter compression and 3.7x inference speedup while maintaining state-of-the-art accuracy, establishing a new paradigm for deploying sophisticated reasoning models under resource constraints.
中文摘要 轨迹预测仍然是自动驾驶系统中关键但具有挑战性的组成部分，需要复杂的推理能力，同时满足严格的实时部署限制。尽管知识蒸馏在模型压缩方面已证明有效，但现有方法往往无法保留复杂的决策能力，尤其是在动态多智能体场景中。本文介绍了MAVEN-T，一种师生框架，通过互补的架构共设计和渐进提炼实现最先进的轨迹预测。教师采用混合注意力机制以最大化表征能力，而学生则使用优化的高效架构以适应部署。知识转移通过多粒度提炼和自适应课程学习实现，并根据表现动态调整复杂度。重要的是，该框架包含强化学习，克服传统蒸馏的模仿天花板，使学生能够通过动态环境互动验证、完善和优化教师知识，从而实现比教师自身更稳健的决策能力。在NGSIM和高维数据集上的大量实验展示了6.2倍的参数压缩和3.7倍的推理加速，同时保持最先进的精度，建立了在资源约束下部署复杂推理模型的新范式。

Warm-Started Reinforcement Learning for Iterative 3D/2D Liver Registration

热启动强化学习用于迭代3D/2D肝脏注册

Authors: Hanyuan Zhang, Lucas He, Zijie Cheng, Abdolrahim Kadkhodamohammadi, Danail Stoyanov, Brian R. Davidson, Evangeles B. Mazomenos, Matthew.J Clarkson
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
Arxiv link: https://arxiv.org/abs/2604.10245
Pdf link: https://arxiv.org/pdf/2604.10245
Abstract Registration between preoperative CT and intraoperative laparoscopic video plays a crucial role in augmented reality (AR) guidance for minimally invasive surgery. Learning-based methods have recently achieved registration errors comparable to optimization-based approaches while offering faster inference. However, many supervised methods produce coarse alignments that rely on additional optimization-based refinement, thereby increasing inference time. We present a discrete-action reinforcement learning (RL) framework that formulates CT-to-video registration as a sequential decision-making process. A shared feature encoder, warm-started from a supervised pose estimation network to provide stable geometric features and faster convergence, extracts representations from CT renderings and laparoscopic frames, while an RL policy head learns to choose rigid transformations along six degrees of freedom and to decide when to stop the iteration. Experiments on a public laparoscopic dataset demonstrated that our method achieved an average target registration error (TRE) of 15.70 mm, comparable to supervised approaches with optimization, while achieving faster convergence. The proposed RL-based formulation enables automated, efficient iterative registration without manually tuned step sizes or stopping criteria. This discrete framework provides a practical foundation for future continuous-action and deformable registration models in surgical AR applications.
中文摘要 术前CT与术中腹腔镜视频的注册在微创手术的增强现实（AR）指导中起着关键作用。基于学习的方法近年来实现了与基于优化方法相当的配准误差，同时提供了更快的推断速度。然而，许多监督方法会产生依赖额外优化优化的粗比对，从而增加推断时间。我们提出了一个离散动作强化学习（RL）框架，将CT到视频的注册定位作为一种顺序决策过程。一个共享特征编码器从监督式姿态估计网络热启动，以提供稳定的几何特征和更快的收敛，从CT渲染和腹腔镜帧中提取表示，而强化学习策略负责人学习选择沿六自由度的刚性变换并决定何时停止迭代。在公共腹腔镜数据集上的实验表明，我们的方法实现了15.70毫米的平均靶标定位误差（TRE），与带监督优化的方法相当，同时实现了更快的收敛速度。提出的基于强化学习的公式实现了自动化高效的迭代注册，无需手动调整步长或停止标准。这一离散框架为未来外科AR应用中的连续作用和可变形定位模型提供了实用基础。

A Dual-Positive Monotone Parameterization for Multi-Segment Bids and a Validity Assessment Framework for Reinforcement Learning Agent-based Simulation of Electricity Markets

多段竞价的双正单调参数化和基于强化学习的电力市场模拟有效性评估框架

Authors: Zunnan Xu, Zhaoxia Jing, Zhanhua Pan
Subjects: Subjects: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2604.10252
Pdf link: https://arxiv.org/pdf/2604.10252
Abstract Reinforcement learning agent-based simulation (RL-ABS) has become an important tool for electricity market mechanism analysis and evaluation. In the modeling of monotone, bounded, multi-segment stepwise bids, existing methods typically let the policy network first output an unconstrained action and then convert it into a feasible bid curve satisfying monotonicity and boundedness through post-processing mappings such as sorting, clipping, or projection. However, such post-processing mappings often fail to satisfy continuous differentiability, injectivity, and invertibility at boundaries or kinks, thereby causing gradient distortion and leading to spurious convergence in simulation results. Meanwhile, most existing studies conduct mechanism analysis and evaluation mainly on the basis of training-curve convergence, without rigorously assessing the distance between the simulation outcomes and Nash equilibrium, which severely undermines the credibility of the results. To address these issues, this paper proposes...
中文摘要 基于强化学习的代理模拟（RL-ABS）已成为电力市场机制分析和评估的重要工具。在单调、有界、多段逐步投标建模中，现有方法通常允许策略网络先输出一个无约束的动作，然后通过后处理映射（如排序、裁剪或投影）将其转换为满足单调性和有界性的可行投标曲线。然而，此类后处理映射常常无法满足连续可微性、单射性和边界或折点的可逆性，导致梯度失真，导致模拟结果出现虚假收敛。与此同时，大多数现有研究主要基于训练曲线收敛进行机制分析和评估，未严格评估模拟结果与纳什均衡之间的距离，这严重削弱了结果的可信度。为解决这些问题，本文提出......

A Queueing-Theoretic Framework for Dynamic Attack Surfaces: Data-Integrated Risk Analysis and Adaptive Defense

动态攻击面的排队理论框架：数据集成风险分析与自适应防御

Authors: Jihyeon Yun, Abdullah Yasin Etcibasi, Ming Shi, C. Emre Koksal
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2604.10427
Pdf link: https://arxiv.org/pdf/2604.10427
Abstract We develop a queueing-theoretic framework to model the temporal evolution of cyber-attack surfaces, where the number of active vulnerabilities is represented as the backlog of a queue. Vulnerabilities arrive as they are discovered or created, and leave the system when they are patched or successfully exploited. Building on this model, we study how automation affects attack and defense dynamics by introducing an AI amplification factor that scales arrival, exploit, and patching rates. Our analysis shows that even symmetric automation can increase the rate of successful exploits. We validate the model using vulnerability data collected from an open source software supply chain and show that it closely matches real-world attack surface dynamics. Empirical results reveal heavy-tailed patching times, which we prove induce long-range dependence in vulnerability backlog and help explain persistent cyber risk. Utilizing our queueing abstraction for the attack surface, we develop a systematic approach for cyber risk mitigation. We formulate the dynamic defense problem as a constrained Markov decision process with resource-budget and switching-cost constraints, and develop a reinforcement learning (RL) algorithm that achieves provably near-optimal regret. Numerical experiments validate the approach and demonstrate that our adaptive RL-based defense policies significantly reduce successful exploits and mitigate heavy-tail queue events. Using trace-driven experiments on the ARVO dataset, we show that the proposed RL-based defense policy reduces the average number of active vulnerabilities in a software supply chain by over 90% compared to existing defense practices, without increasing the overall maintenance budget. Our results allow defenders to quantify cumulative exposure risk under long-range dependent attack dynamics and to design adaptive defense strategies with provable efficiency.
中文摘要 我们开发了一个排队理论框架，用于建模网络攻击面的时间演变，其中活跃漏洞的数量以队列的积压表示。漏洞在被发现或被创建时出现，经过修补或成功利用后系统会消失。基于该模型，我们通过引入AI放大因子，研究自动化如何影响攻防动态，以提升到达、利用和补丁速度。我们的分析显示，即使是对称自动化也能提高成功利用的率。我们利用从开源软件供应链收集的漏洞数据验证了该模型，并证明其与现实世界的攻击面动态高度吻合。实证结果显示，补丁时间很长，我们证明这会诱导漏洞积压的长期依赖，并有助于解释持续存在的网络风险。利用我们对攻击面的队列抽象，我们开发了一种系统化的网络风险缓解方法。我们将动态防御问题表述为一个受限的马尔可夫决策过程，具有资源预算和切换成本约束，并开发了一种能够实现可证明近似最优遗憾的强化学习（RL）算法。数值实验验证了该方法，并证明我们基于强化学习的自适应防御策略显著减少了成功的利用并减轻了尾部队列事件。通过对ARVO数据集进行的追踪驱动实验，我们表明，基于强化学习的防御政策相比现有防御实践，将软件供应链中活跃漏洞的平均数量减少了90%以上，且并未增加整体维护预算。我们的结果使防御方能够量化在长距离依赖攻击动态下的累积暴露风险，并设计具有可验证效率的自适应防御策略。

SWE-Shepherd: Advancing PRMs for Reinforcing Code Agents

SWE-Shepherd：推进PRM以加强代码特工

Authors: Mahir Labib Dihan, Md Ashrafur Rahman Khan
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2604.10493
Pdf link: https://arxiv.org/pdf/2604.10493
Abstract Automating real-world software engineering tasks remains challenging for large language model (LLM)-based agents due to the need for long-horizon reasoning over large, evolving codebases and making consistent decisions across interdependent actions. Existing approaches typically rely on static prompting strategies or handcrafted heuristics to select actions such as code editing, file navigation, and test execution, but they lack fine-grained feedback on intermediate decisions. This leads to inefficient exploration, error propagation, and brittle solution trajectories. To address this limitation, we propose SWE-Shepherd, a framework that introduces Process Reward Models (PRMs) to provide dense, step-level supervision for repository-level code agents. Using trajectories from SWE-Bench, we construct an action-level reward dataset and train a lightweight reward model on a base LLM to estimate the usefulness of intermediate actions. During inference, the PRM evaluates candidate actions and guides the agent toward higher-reward decisions without requiring full reinforcement learning. Experiments on SWE-Bench Verified demonstrate improved interaction efficiency and action quality, while also highlighting challenges in aligning intermediate rewards with final task success.
中文摘要 由于需要对庞大且不断演变的代码库进行长期推理，并在相互依赖的操作间做出一致决策，自动化现实世界的软件工程任务对于大型语言模型（LLM）代理来说依然充满挑战。现有方法通常依赖静态提示策略或手工定制的启发式方法来选择代码编辑、文件导航和测试执行等动作，但缺乏对中间决策的细致反馈。这导致了低效的探勘、误差传播和解的脆性轨迹。为解决这一限制，我们提出了SWE-Shepherd框架，该框架引入了流程奖励模型（PRM），为仓库级代码代理提供密集的步骤级监督。利用SWE-Bench的轨迹，我们构建了一个动作级奖励数据集，并在基础LLM上训练一个轻量级奖励模型，以估算中间行动的有用性。在推理过程中，PRM评估候选行为，引导智能体做出更高奖励的决策，而无需完全强化学习。在SWE-Bench Verified的实验显示了交互效率和动作质量的提升，同时也凸显了将中间奖励与最终任务成功率对齐的挑战。

Beyond Compliance: A Resistance-Informed Motivation Reasoning Framework for Challenging Psychological Client Simulation

超越合规：一种基于抗拒的动机推理框架，用于挑战心理客户模拟

Authors: Danni Liu, Bo Liu, Yuxin Hu, Hantao Zhao, Yan Liu, Ding Ding, Jiahui Jin, Jiuxin Cao
Subjects: Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2604.10507
Pdf link: https://arxiv.org/pdf/2604.10507
Abstract Psychological client simulators have emerged as a scalable solution for training and evaluating counselor trainees and psychological LLMs. Yet existing simulators exhibit unrealistic over-compliance, leaving counselors underprepared for the challenging behaviors common in real-world practice. To bridge this gap, we present ResistClient, which systematically models challenging client behaviors grounded in Client Resistance Theory by integrating external behaviors with underlying motivational mechanisms. To this end, we propose Resistance-Informed Motivation Reasoning (RIMR), a two-stage training framework. First, RIMR mitigates compliance bias via supervised fine-tuning on RPC, a large-scale resistance-oriented psychological conversation dataset covering diverse client profiles. Second, beyond surface-level response imitation, RIMR models psychologically coherent motivation reasoning before response generation, jointly optimizing motivation authenticity and response consistency via process-supervised reinforcement learning. Extensive automatic and expert evaluations show that ResistClient substantially outperforms existing simulators in challenge fidelity, behavioral plausibility, and reasoning coherence. Moreover, ResistClient facilities evaluation of psychological LLMs under challenging conditions, offering new optimization directions for mental health dialogue systems.
中文摘要 心理客户模拟器已成为培训和评估咨询师学员及心理大型语言模型的可扩展解决方案。然而，现有模拟器表现出不切实际的过度合规，使咨询师对现实中常见的挑战性行为准备不足。为弥合这一鸿沟，我们介绍了ResistClient，该系统性地建模基于客户抵抗理论的挑战性客户行为，将外部行为与潜在的动机机制相结合。为此，我们提出了抗拒知情动机推理（RIMR）两阶段培训框架。首先，RIMR通过对RPC进行监督微调，RPC是一个涵盖多样客户档案的大规模抗拒导向心理对话数据集，从而减轻了依从偏差。其次，除了表面反应模仿，RIMR在反应生成前建模了心理连贯的动机推理，通过过程监督强化学习共同优化动机的真实性和反应一致性。广泛的自动和专家评估显示，ResistClient在挑战真实度、行为合理性和推理连贯性方面远远优于现有模拟器。此外，ResistClient还能在挑战条件下评估心理大型语言模型，为心理健康对话系统提供新的优化方向。

Simple but Stable, Fast and Safe: Achieve End-to-end Control by High-Fidelity Differentiable Simulation

简单但稳定、快速且安全：通过高保真可微分仿真实现端到端控制

Authors: Fanxing Li, Shengyang Wang, Yuxiang Huang, Fangyu Sun, Yufei Yan, Danping Zou, Wenxian Yu
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.10548
Pdf link: https://arxiv.org/pdf/2604.10548
Abstract Obstacle avoidance is a fundamental vision-based task essential for enabling quadrotors to perform advanced applications. When planning the trajectory, existing approaches both on optimization and learning typically regard quadrotor as a point-mass model, giving path or velocity commands then tracking the commands by outer-loop controller. However, at high speeds, planned trajectories sometimes become dynamically infeasible in actual flight, which beyond the capacity of controller. In this paper, we propose a novel end-to-end policy that directly maps depth images to low-level bodyrate commands by reinforcement learning via differentiable simulation. The high-fidelity simulation in training after parameter identification significantly reduces all the gaps between training, simulation and real world. Analytical process by differentiable simulation provides accurate gradient to ensure efficiently training the low-level policy without expert guidance. The policy employs a lightweight and the most simple inference pipeline that runs without explicit mapping, backbone networks, primitives, recurrent structures, or backend controllers, nor curriculum or privileged guidance. By inferring low-level command directly to the hardware controller, the method enables full flight envelope control and avoids the dynamic-infeasible this http URL results demonstrate that the proposed approach achieves the highest success rate and the lowest jerk among state-of-the-art baselines across multiple benchmarks. The policy also exhibits strong generalization, successfully deploying zero-shot in unseen, outdoor environments while reaching speeds of up to 7.5m/s as well as stably flying in the super-dense forest.
中文摘要 障碍物规避是一项基于视觉的基本任务，对于使四旋翼飞机能够执行高级应用至关重要。在规划轨迹时，现有的优化和学习方法通常将四旋翼视为点质量模型，先给出路径或速度命令，然后由外环控制器跟踪这些指令。然而，在高速下，计划中的轨迹有时在实际飞行中变得动力学上不可行，超出控制员的能力范围。本文提出了一种新颖的端到端策略，通过可微仿真进行强化学习，将深度图像直接映射到低级别的体率指令。参数识别后的高精度训练模拟显著缩小了训练、仿真与现实之间的所有差距。通过可微仿真进行分析过程，提供准确梯度，确保在无需专家指导的情况下高效训练低层策略。该策略采用轻量级且最简单的推理流水线，运行时不包含显式映射、骨干网络、原语、递归结构或后端控制器，也没有课程或特权引导。通过直接向硬件控制器推断低级指令，该方法实现了完整的飞行包络控制，避免了动态不可行的情况。http URL结果表明，在多个基准测试的先进基线中，所提方法实现了最高的成功率和最低的抖动。该政策还表现出强烈的泛化性，能够在无人见到的户外环境中成功部署零射击，速度可达7.5米/秒，并在超密集森林中稳定飞行。

Calibration Collapse Under Sycophancy Fine-Tuning: How Reward Hacking Breaks Uncertainty Quantification in LLMs

谄媚微调下的校准崩溃：奖励黑客如何破坏大型语言模型中的不确定性量化

Authors: Subramanyam Sahoo
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.10585
Pdf link: https://arxiv.org/pdf/2604.10585
Abstract Modern large language models (LLMs) are increasingly fine-tuned via reinforcement learning from human feedback (RLHF) or related reward optimisation schemes. While such procedures improve perceived helpfulness, we investigate whether sycophantic reward signals degrade calibration -- a property essential for reliable uncertainty quantification. We fine-tune Qwen3-8B under three regimes: no fine-tuning (base), neutral supervised fine-tuning (SFT) on TriviaQA, and sycophancy-inducing Group Relative Policy Optimisation (GRPO) that rewards agreement with planted wrong answers. Evaluating on $1{,}000$ MMLU items across five subject domains with bootstrap confidence intervals and permutation testing, we find that \textbf{sycophantic GRPO produces consistent directional calibration degradation} -- ECE rises by $+0.006$ relative to the base model and MCE increases by $+0.010$ relative to neutral SFT -- though the effect does not reach statistical significance ($p = 0.41$) at this training budget. Post-hoc matrix scaling applied to all three models reduces ECE by $40$--$64\%$ and improves accuracy by $1.5$--$3.0$ percentage points. However, the sycophantic model retains the highest post-scaling ECE relative to the neutral SFT control ($0.042$ vs.\ $0.037$), suggesting that reward-induced miscalibration leaves a structured residual even after affine correction. These findings establish a methodology for evaluating the calibration impact of reward hacking and motivate calibration-aware training objectives.
中文摘要 现代大型语言模型（LLMs）越来越多地通过人类反馈强化学习（RLHF）或相关的奖励优化方案进行微调。虽然这些程序提高了被感知的帮助性，但我们调查了谄媚的奖励信号是否会降低校准——这是可靠不确定性量化的关键特性。我们在三种模式下对Qwen3-8B进行了微调：无微调（基础）、TriviaQA上的中性监督微调（SFT），以及鼓励阿谀奉承的群体相对政策优化（GRPO），后者奖励以错误答案为奖励。对五个主题域中1美元MMLU项目进行自举置信区间和置换检验评估，我们发现\textbf{谄媚GRPO产生一致的方向校准退化}——ECE相较基础模型上升$+0.006$，MCE相较中性SFT增加$+0.010$——尽管在该训练预算下效应未达到统计显著性（$p = 0.41$）。对这三种模型应用后矩阵缩放，ECE减少了40美元至64美元，准确率提升了1.5美元至3.0美元百分比。然而，谄媚模型相对于中性SFT控制保留了最高的缩放后ECE（$0.042$对$0.037$），表明奖励引起的校准错误即使在仿射修正后仍留下结构化残差。这些发现建立了评估奖励黑客校准影响的方法论，并激励校准感知型训练目标。

AWARE: Adaptive Whole-body Active Rotating Control for Enhanced LiDAR-Inertial Odometry under Human-in-the-Loop Interaction

AWARE：在人机交互下增强激光雷达惯性里程计的自适应全身主动旋转控制

Authors: Yizhe Zhang, Jianping Li, Liangliang Yin, Zhen Dong, Bisheng Yang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.10598
Pdf link: https://arxiv.org/pdf/2604.10598
Abstract Human-in-the-loop (HITL) UAV operation is essential in complex and safety-critical aerial surveying environments, where human operators provide navigation intent while onboard autonomy must maintain accurate and robust state estimation. A key challenge in this setting is that resource-constrained UAV platforms are often limited to narrow-field-of-view LiDAR sensors. In geometrically degenerate or feature-sparse scenes, limited sensing coverage often weakens LiDAR Inertial Odometry (LIO)'s observability, causing drift accumulation, degraded geometric accuracy, and unstable state estimation, which directly compromise safe and effective HITL operation and the reliability of downstream surveying products. To overcome this limitation, we present AWARE, a bio-inspired whole-body active yawing framework that exploits the UAV's own rotational agility to extend the effective sensor horizon and improve LIO's observability without additional mechanical actuation. The core of AWARE is a differentiable Model Predictive Control (MPC) framework embedded in a Reinforcement Learning (RL) loop. It first identifies the viewing direction that maximizes information gain across the full yaw space, and a lightweight RL agent then adjusts the MPC cost weights online according to the current environmental context, enabling an adaptive balance between estimation accuracy and flight stability. A Safe Flight Corridor mechanism further ensures operational safety within this HITL paradigm by decoupling the operator's navigational intent from autonomous yaw optimization to enable safe and efficient cooperative control. We validate AWARE through extensive experiments in diverse simulated and real-world environments.
中文摘要 人机在环（HITL）无人机操作在复杂且安全关键的空中测量环境中至关重要，人类操作员负责导航意图，而机上自主性则必须保持准确且稳健的状态估计。在这一环境下，一个关键挑战是资源有限的无人机平台通常只能使用视场狭窄的激光雷达（LiDAR）传感器。在几何简并或特征稀疏的场景中，有限的传感覆盖常削弱LiDAR惯性里程计（LIO）的可观测性，导致漂移积累、几何精度下降和状态估计不稳定，直接影响HITL的安全有效运行以及下游测量产品的可靠性。为克服这一限制，我们介绍了AWARE，一种仿生的全身主动偏航框架，利用无人机自身的旋转灵活性延长有效传感器视野，提升LIO的可观测性，无需额外机械驱动。AWARE的核心是一个嵌入强化学习（RL）循环中的可微分模型预测控制（MPC）框架。它首先确定最大化整个偏航空间信息增益的观察方向，然后轻量化的强化学习代理根据当前环境环境在线调整MPC成本权重，实现估计准确性和飞行稳定性之间的自适应平衡。安全飞行走廊机制通过将操作员的导航意图与自主偏航优化脱钩，进一步确保该HITL范式下的操作安全，从而实现安全高效的协同控制。我们通过在多种模拟和现实环境中的广泛实验验证了AWARE。

On the Optimization Landscape of Observer-based Dynamic Linear Quadratic Control

关于基于观察者的动态线性二次控制的优化景观

Authors: Jingliang Duan, Jie Li, Yinsong Ma, Liye Tang, Guofa Li, Liping Zhang, Shengbo Eben Li, Lin Zhao
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2604.10635
Pdf link: https://arxiv.org/pdf/2604.10635
Abstract Understanding the optimization landscape of linear quadratic regulation (LQR) problems is fundamental to the design of efficient reinforcement learning solutions. Recent work has made significant progress in characterizing the landscape of static output-feedback control and linear quadratic Gaussian (LQG) control. For LQG, much of the analysis leverages the separation principle, which allows the controller and estimator to be designed independently. However, this simplification breaks down when the gradients with respect to the estimator and controller parameters are inherently coupled, leading to a more intricate analysis. This paper investigates the optimization landscape of observer-based dynamic output-feedback control of LQR problems. We derive the optimal observer-controller pair in settings where transient quadratic performance cannot be neglected. Our analysis reveals that, in general, the combination of the standard LQR controller and the observer that minimizes the trace of the accumulated estimation error covariance does not correspond to a stationary point of the overall closed-loop performance objective. Moreover, we derive a pair of discrete-time Sylvester equations with symmetric structure, both involving the same set of matrix elements, that characterize the stationary point of the observer-based dynamic LQR problem. These equations offer analytical insight into the structure of the optimality conditions and provide a foundation for developing numerical policy gradient methods aimed at learning complex controllers that rely on reconstructed state information.
中文摘要 理解线性二次调控（LQR）问题的优化格局对于设计高效强化学习解决方案至关重要。近期工作在表征静态输出反馈控制和线性二次高斯（LQG）控制的格局方面取得了显著进展。对于LQG，许多分析利用了分离原理，使控制器和估计器可以独立设计。然而，当估计器和控制器参数的梯度本身耦合时，这种简化就失效了，导致分析更加复杂。本文探讨了基于观察者的动态输出反馈控制LQR问题的优化格局。在瞬态二次性能不可忽视的条件下，我们推导出最优观察者-控制者对。我们的分析表明，通常标准LQR控制器与最小化累计估计误差协方差迹的观察者组合，并不对应于整体闭环性能目标的平稳点。此外，我们推导出一对具有对称结构的离散时间Sylvester方程，两者都涉及相同的矩阵元素集合，这些方程刻画了基于观察者的动态LQR问题的驻点。这些方程为最优条件结构提供了分析见解，并为开发数值策略梯度方法奠定基础，旨在学习依赖重建状态信息的复杂控制器。

Preference-Agile Multi-Objective Optimization for Real-time Vehicle Dispatching

实时车辆调度的偏好敏捷多目标优化

Authors: Jiahuan Jin, Wenhao Zhao, Rong Qu, Jianfeng Ren, Xinan Chen, Qingfu Zhang, Ruibin Bai
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.10664
Pdf link: https://arxiv.org/pdf/2604.10664
Abstract Multi-objective optimization (MOO) has been widely studied in literature because of its versatility in human-centered decision making in real-life applications. Recently, demand for dynamic MOO is fast-emerging due to tough market dynamics that require real-time re-adjustments of priorities for different objectives. However, most existing studies focus either on deterministic MOO problems which are not practical, or non-sequential dynamic MOO decision problems that cannot deal with some real-life complexities. To address these challenges, a preference-agile multi-objective optimization (PAMOO) is proposed in this paper to permit users to dynamically adjust and interactively assign the preferences on the fly. To achieve this, a novel uniform model within a deep reinforcement learning (DRL) framework is proposed that can take as inputs users' dynamic preference vectors explicitly. Additionally, a calibration function is fitted to ensure high quality alignment between the preference vector inputs and the output DRL decision policy. Extensive experiments on challenging real-life vehicle dispatching problems at a container terminal showed that PAMOO obtains superior performance and generalization ability when compared with two most popular MOO methods. Our method presents the first dynamic MOO method for challenging \rev{dynamic sequential MOO decision problems
中文摘要 多目标优化（MOO）因其在现实生活中以人为本的决策决策中的多功能性，在文献中被广泛研究。近年来，由于市场形势严峻，需要为不同目标实时调整优先事项，动态MOO的需求迅速增长。然而，大多数现有研究要么聚焦于不切实际的确定性MOO问题，要么是无法处理某些现实复杂性的非顺序动态MOO决策问题。为应对这些挑战，本文提出了一种偏好敏捷多目标优化（PAMOO），允许用户动态调整并实时交互地分配偏好。为此，提出了一种在深度强化学习（DRL）框架内的新颖统一模型，可以明确地以用户的动态偏好向量为输入。此外，还配备了校准函数，以确保偏好向量输入与输出DRL决策策略之间的高质量对齐。在集装箱码头对现实车辆调度问题进行的广泛实验表明，与两种最流行的MOO方法相比，PAMOO在性能和泛化能力上更为突出。我们的方法提出了首个用于挑战\rev{动态顺序MOO决策问题的动态MOO方法。

Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents

Skill-SD：多回合大型语言模型代理的技能条件自蒸馏

Authors: Hao Wang, Guozhi Wang, Han Xiao, Yufeng Zhou, Yue Pan, Jichao Wang, Ke Xu, Yafei Wen, Xiaohu Ruan, Xiaoxin Chen, Honggang Qi
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.10674
Pdf link: https://arxiv.org/pdf/2604.10674
Abstract Reinforcement learning (RL) has been widely used to train LLM agents for multi-turn interactive tasks, but its sample efficiency is severely limited by sparse rewards and long horizons. On-policy self-distillation (OPSD) alleviates this by providing dense token-level supervision from a privileged teacher that has access to ground-truth answers. However, such fixed privileged information cannot capture the diverse valid strategies in agent tasks, and naively combining OPSD with RL often leads to training collapse. To address these limitations, we introduce Skill-SD, a framework that turns the agent's own trajectories into dynamic training-only supervision. Completed trajectories are summarized into compact natural language skills that describe successful behaviors, mistakes, and workflows. These skills serve as dynamic privileged information conditioning only the teacher, while the student always acts under the plain task prompt and learns to internalize the guidance through distillation. To stabilize the training, we derive an importance-weighted reverse-KL loss to provide gradient-correct token-level distillation, and dynamically synchronize the teacher with the improving student. Experimental results on agentic benchmarks demonstrate that Skill-SD substantially outperforms the standard RL baseline, improving both vanilla GRPO (+14.0%/+10.9% on AppWorld/Sokoban) and vanilla OPD (+42.1%/+40.6%). Project page: this https URL
中文摘要 强化学习（RL）已被广泛用于训练LLM代理进行多回合交互任务，但其样本效率因奖励稀疏和视野过长而受到严重限制。政策自我提炼（OPSD）通过提供由拥有实地真相答案的特权教师提供的密集代币级监督，缓解了这一问题。然而，这种固定特权信息无法捕捉代理任务中多样有效策略，且天真地将OPSD与强化学习结合往往会导致训练崩溃。为解决这些限制，我们引入了Skill-SD框架，将代理自身的轨迹转化为动态的仅训练监督。完成的轨迹被总结为简洁的自然语言技能，描述成功的行为、错误和工作流程。这些技能作为动态特权信息，仅对教师进行条件反射，而学生始终在简单的任务提示下行动，并通过提炼学习内化指导。为稳定训练，我们推导出重要性加权的反KL损失，以实现梯度正确的令牌级提炼，并动态同步教师与进步学生。代理基准测试的实验结果显示，Skill-SD显著优于标准强化学习基线，提升了原版GRPO（AppWorld/Sokoban的+14.0%/+10.9%）和原版OPD（+42.1%/+40.6%）。项目页面：此 https URL

FedRio: Personalized Federated Social Bot Detection via Cooperative Reinforced Contrastive Adversarial Distillation

FedRio：通过协作式强化对比对抗蒸馏实现个性化联邦社交机器人检测

Authors: Yingguang Yang, Hao Liu, Xin Zhang, Yunhui Liu, Yutong Xia, Qi Wu, Hao Peng, Taoran Liang, Bin Chong, Tieke He, Philip S. Yu
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.10678
Pdf link: https://arxiv.org/pdf/2604.10678
Abstract Social bot detection is critical to the stability and security of online social platforms. However, current state-of-the-art bot detection models are largely developed in isolation, overlooking the benefits of leveraging shared detection patterns across platforms to improve performance and promptly identify emerging bot variants. The heterogeneity of data distributions and model architectures further complicates the design of an effective cross-platform and cross-model detection framework. To address these challenges, we propose FedRio (Personalized Federated Social Bot Detection with Cooperative Reinforced Contrastive Adversarial Distillation framework. We first introduce an adaptive message-passing module as the graph neural network backbone for each client. To facilitate efficient knowledge sharing of global data distributions, we design a federated knowledge extraction mechanism based on generative adversarial networks. Additionally, we employ a multi-stage adversarial contrastive learning strategy to enforce feature space consistency among clients and reduce divergence between local and global models. Finally, we adopt adaptive server-side parameter aggregation and reinforcement learning-based client-side parameter control to better accommodate data heterogeneity in heterogeneous federated settings. Extensive experiments on two real-world social bot detection benchmarks demonstrate that FedRio consistently outperforms state-of-the-art federated learning baselines in detection accuracy, communication efficiency, and feature space consistency, while remaining competitive with published centralized results under substantially stronger privacy constraints.
中文摘要 社交机器人检测对于在线社交平台的稳定性和安全至关重要。然而，目前最先进的机器人检测模型大多是孤立开发，忽视了利用跨平台共享检测模式提升性能和及时识别新兴机器人变种的好处。数据分布和模型架构的异质性进一步复杂化了有效跨平台和跨模型检测框架的设计。为应对这些挑战，我们提出了FedRio（个性化联邦社交机器人检测，结合协作强化对比对抗蒸馏框架）。我们首先引入一个自适应消息传递模块，作为每个客户端的图神经网络骨干。为促进全球数据分布的高效知识共享，我们设计了基于生成对抗网络的联邦知识提取机制。此外，我们采用多阶段对抗对比学习策略，以在客户之间强制特征空间一致性，减少局部与全局模型之间的差异。最后，我们采用了自适应的服务器端参数聚合和基于强化学习的客户端参数控制，以更好地适应异构联邦环境中的数据异质性。在两个真实世界社交机器人检测基准测试上的大量实验表明，FedRio在检测准确性、通信效率和特征空间一致性方面始终优于最先进的联邦学习基线，同时在更严格的隐私约束下仍能与已发布的集中研究结果竞争。

SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting

范围：带双路径自适应加权的信号校准策略上蒸馏增强

Authors: Binbin Zheng, Xing Ma, Yiheng Liang, Jingqing Ruan, Xiaoliang Fu, Kepeng Lin, Benchang Zhu, Ke Zeng, Xunliang Cai
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.10688
Pdf link: https://arxiv.org/pdf/2604.10688
Abstract On-policy reinforcement learning has become the dominant paradigm for reasoning alignment in large language models, yet its sparse, outcome-level rewards make token-level credit assignment notoriously difficult. On-Policy Distillation (OPD) alleviates this by introducing dense, token-level KL supervision from a teacher model, but typically applies this supervision uniformly across all rollouts, ignoring fundamental differences in signal quality. We propose Signal-Calibrated On-Policy Distillation Enhancement (SCOPE), a dual-path adaptive training framework that routes on-policy rollouts by correctness into two complementary supervision paths. For incorrect trajectories, SCOPE performs teacher-perplexity-weighted KL distillation to prioritize instances where the teacher demonstrates genuine corrective capability, while down-weighting unreliable guidance. For correct trajectories, it applies student-perplexity-weighted MLE to concentrate reinforcement on low-confidence samples at the capability boundary rather than over-reinforcing already mastered ones. Both paths employ a group-level normalization to adaptively calibrate weight distributions, accounting for the intrinsic difficulty variance across prompts. Extensive experiments on six reasoning benchmarks show that SCOPE achieves an average relative improvement of 11.42% in Avg@32 and 7.30% in Pass@32 over competitive baselines, demonstrating its consistent effectiveness.
中文摘要 策略强化学习已成为大型语言模型中推理对齐的主导范式，但其稀疏的结果级奖励使得代币级学分赋赋变得异常困难。策略提纯（OPD）通过引入教师模型中的密集代币级 KL 监督来缓解这一问题，但通常会在所有推广中统一应用这种监督，忽略信号质量的根本差异。我们提出了信号校准策略上蒸馏增强（SCOPE），这是一种双路径自适应训练框架，将策略正确性的部署引导到两条互补的监督路径。对于错误的轨迹，SCOPE会进行教师困惑加权KL蒸馏，优先考虑教师展现出真实纠正能力的实例，同时降低不可靠的指导权重。对于正确的轨迹，它应用学生困惑加权机器学习，将强化集中在能力边界的低置信样本上，而不是过度强化已掌握的样本。这两种路径都采用组级归一化，自适应校准权重分布，考虑提示之间内在的难度差异。对六个推理基准的广泛实验显示，SCOPE在Avg@32中平均相较于竞争基线提升11.42%，Pass@32提升7.30%，证明其持续有效性。

Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning

重现价值模型：对LLM强化学习中价值建模的生成式批评

Authors: Zikang Shan, Han Zhong, Liwei Wang, Li Zhao
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.10701
Pdf link: https://arxiv.org/pdf/2604.10701
Abstract Credit assignment is a central challenge in reinforcement learning (RL). Classical actor-critic methods address this challenge through fine-grained advantage estimation based on a learned value function. However, learned value models are often avoided in modern large language model (LLM) RL because conventional discriminative critics are difficult to train reliably. We revisit value modeling and argue that this difficulty is partly due to limited expressiveness. In particular, representation complexity theory suggests that value functions can be hard to approximate under the one-shot prediction paradigm used by existing value models, and our scaling experiments show that such critics do not improve reliably with scale. Motivated by this observation, we propose Generative Actor-Critic (GenAC), which replaces one-shot scalar value prediction with a generative critic that performs chain-of-thought reasoning before producing a value estimate. We further introduce In-Context Conditioning, which helps the critic remain calibrated to the current actor throughout training. GenAC improves value approximation, ranking reliability, and out-of-distribution generalization, and these gains translate into stronger downstream RL performance than both value-based and value-free baselines. Overall, our results suggest that stronger value modeling is a promising direction for improving credit assignment in LLM reinforcement learning.
中文摘要 学分分配是强化学习（RL）中一个核心挑战。经典的演员-批评方法通过基于学习价值函数的细粒度优势估计来应对这一挑战。然而，现代大型语言模型（LLM）强化学习中常被避免使用，因为传统的判别性批评难以可靠训练。我们重新审视价值建模，认为这种困难部分源于表达力有限。特别是，表示复杂性理论表明，在现有价值模型使用的单次预测范式下，价值函数可能难以近似，我们的缩放实验表明，这种批评并不能随着尺度的可靠性提升。基于这一观察，我们提出了生成行为者-批判者（GenAC），它用生成式批评者替代一次性标量值预测，后者在产生价值估计前进行思路链推理。我们还引入了情境条件反射，帮助评论家在整个训练过程中保持对当前演员的校准。GenAC提升了价值近似、排名可靠性和分布外泛化能力，这些提升转化为比基于价值和无价值基线更强的下游强化学习表现。总体来看，我们的结果表明，更强的价值建模是提升LLM强化学习中学分分配的有前景方向。

Learning Preference-Based Objectives from Clinical Narratives for Sequential Treatment Decision-Making

从临床叙事中学习基于偏好的目标，以实现顺序治疗决策

Authors: Daniel J. Tan, Kay Choong See, Mengling Feng
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.10783
Pdf link: https://arxiv.org/pdf/2604.10783
Abstract Designing reward functions remains a central challenge in reinforcement learning (RL) for healthcare, where outcomes are sparse, delayed, and difficult to specify. While structured data capture physiological states, they often fail to reflect the overall quality of a patient's clinical trajectory, including recovery dynamics, treatment burden, and stability. Clinical narratives, in contrast, summarize longitudinal reasoning and implicitly encode evaluations of treatment effectiveness. We propose Clinical Narrative-informed Preference Rewards (CN-PR), a framework for learning reward functions directly from discharge summaries by treating them as scalable supervision for trajectory-level preferences. Using a large language model, we derive trajectory quality scores (TQS) and construct pairwise preferences over patient trajectories, enabling reward learning via a structured preference-based objective. To account for variability in narrative informativeness, we incorporate a confidence signal that weights supervision based on its relevance to the decision-making task. The learned reward aligns strongly with trajectory quality (Spearman rho = 0.63) and enables policies that are consistently associated with improved recovery-related outcomes, including increased organ support-free days and faster shock resolution, while maintaining comparable performance on mortality. These effects persist under external validation. Our results demonstrate that narrative-derived supervision provides a scalable and expressive alternative to handcrafted or outcome-based reward design for dynamic treatment regimes.
中文摘要 设计奖励函数仍然是医疗强化学习（RL）中的核心挑战，因为结果稀疏、延迟且难以明确说明。虽然结构化数据能够捕捉生理状态，但往往无法反映患者临床轨迹的整体质量，包括恢复动态、治疗负担和稳定性。相比之下，临床叙述总结纵向推理，隐含对治疗效果的评估。我们提出了临床叙事知情偏好奖励（CN-PR）框架，通过将出院摘要视为轨迹层级偏好的可扩展监督，直接学习奖励函数。利用大型语言模型，我们推导轨迹质量评分（TQS），并构建患者轨迹的成对偏好，实现基于偏好的结构化目标奖励学习。为考虑叙述信息量的变异性，我们采用信心信号，根据监督与决策任务的相关性加权。学习奖励与轨迹质量高度一致（Spearman rho = 0.63），并使政策能够持续与改善康复相关结果相关，包括增加无器官支持天数和更快的休克缓解，同时保持死亡率的可比表现。这些效应在外部验证下依然存在。我们的结果表明，叙事衍生督导为动态治疗方案提供了一种可扩展且富有表现力的替代选择，替代手工设计或基于结果的奖励设计。

TInR: Exploring Tool-Internalized Reasoning in Large Language Models

TInR：探索大型语言模型中的工具内化推理

Authors: Qiancheng Xu, Yongqi Li, Fan Liu, Hongru Wang, Min Yang, Wenjie Li
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.10788
Pdf link: https://arxiv.org/pdf/2604.10788
Abstract Tool-Integrated Reasoning (TIR) has emerged as a promising direction by extending Large Language Models' (LLMs) capabilities with external tools during reasoning. Existing TIR methods typically rely on external tool documentation during reasoning. However, this leads to tool mastery difficulty, tool size constraints, and inference inefficiency. To mitigate these issues, we explore Tool-Internalized Reasoning (TInR), aiming at facilitating reasoning with tool knowledge internalized into LLMs. Achieving this goal presents notable requirements, including tool internalization and tool-reasoning coordination. To address them, we propose TInR-U, a tool-internalized reasoning framework for unified reasoning and tool usage. TInR-U is trained through a three-phase pipeline: 1) tool internalization with a bidirectional knowledge alignment strategy; 2) supervised fine-tuning warm-up using high-quality reasoning annotations, and 3) reinforcement learning with TInR-specific rewards. We comprehensively evaluate our method across in-domain and out-of-domain settings. Experiment results show that TInR-U achieves superior performance in both settings, highlighting its effectiveness and efficiency.
中文摘要 工具集成推理（TIR）作为一个有前景的方向，通过扩展大型语言模型（LLMs）在推理过程中的外部工具的能力。现有的TIR方法通常依赖外部工具文档进行推理。然而，这导致工具掌握困难、刀具尺寸限制和推理效率低下。为缓解这些问题，我们探讨了工具内化推理（TinR），旨在通过内化的工具知识促进推理。实现这一目标需满足显著要求，包括工具内化和工具推理协调。为解决这些问题，我们提出了TInR-U，一个工具内化的推理框架，用于统一推理和工具使用。TInR-U 通过三阶段流程进行训练：1）工具内化，采用双向知识对齐策略;2）使用高质量推理注释进行监督微调热身，3）采用TinR专属奖励的强化学习。我们全面评估了该方法在域内和域外的环境。实验结果显示，TinR-U在两种环境下均表现出优异性能，彰显了其有效性和效率。

Advancing Polish Language Modeling through Tokenizer Optimization in the Bielik v3 7B and 11B Series

通过分词器优化推动Bielik v3 7B和11B系列中波兰语言建模的发展

Authors: Krzysztof Ociepa, Łukasz Flis, Remigiusz Kinas, Krzysztof Wróbel, Adrian Gwoździej
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.10799
Pdf link: https://arxiv.org/pdf/2604.10799
Abstract The development of the Bielik v3 PL series, encompassing both the 7B and 11B parameter variants, represents a significant milestone in the field of language-specific large language model (LLM) optimization. While general-purpose models often demonstrate impressive multilingual capabilities, they frequently suffer from a fundamental architectural inefficiency: the use of universal tokenizers. These tokenizers, typically designed to cover a broad spectrum of languages, often fail to capture the morphological nuances of specific languages like Polish, leading to higher fertility ratios, increased inference costs, and restricted effective context windows. This report details the transition from the universal Mistral-based tokenization to a dedicated Polish-optimized vocabulary for the Bielik v3 models, exploring the FOCUS-based embedding initialization, the multi-stage pretraining curriculum, and the subsequent post-training alignment involving Supervised Fine-Tuning, Direct Preference Optimization, and Reinforcement Learning through Group Relative Policy Optimization with verifiable rewards.
中文摘要 Bielik v3 PL 系列的开发，涵盖了 7B 和 11B 参数变体，是语言特异性大型语言模型（LLM）优化领域的一个重要里程碑。虽然通用模型常常展现出令人印象深刻的多语言能力，但它们常常存在一个根本的架构低效问题：通用分词器的使用。这些分词器通常设计用于覆盖广泛的语言，但往往未能捕捉特定语言（如波兰语）的形态学细节，导致生育率提高、推理成本增加以及有效语境窗口受限。本报告详细介绍了Bielik v3模型从通用Mistral分词向专门的波兰语优化词汇的过渡过程，探讨基于FOCUS的嵌入初始化、多阶段预训练课程，以及随后涉及监督微调、直接偏好优化和通过群体相对策略优化进行强化学习的训练后对齐，并带来可验证的奖励。

Adaptive Bounded-Rationality Modeling of Early-Stage Takeover in Shared-Control Driving

共享控制驾驶中早期阶段接管的自适应有界有理性建模

Authors: Jian Sun, Xiyan Jiang, Xiaocong Zhao, Jie Wang, Peng Hang, Zirui Li
Subjects: Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2604.10806
Pdf link: https://arxiv.org/pdf/2604.10806
Abstract Human drivers' control quality in the first seconds after a handover is critical to shared-driving safety; potentially unsafe steering or pedal inputs therefore require detection and correction by the automated vehicle's safety-fallback system. Yet performance in this window is vulnerable because cognitive states fluctuate rapidly, causing purely rationality-driven, cognition-unaware models to miss early control dynamics. We present an interpretable driver model grounded in bounded rationality with online adaptation that predicts early-stage control quality. We encode boundedness by embedding cognitive constraints in reinforcement learning and adapt latent cognitive parameters in real time via particle filtering from observations of driver actions. In a vehicle-in-the-loop study (n=41), we evaluated predictive performance and physiological validity. The adaptive model not only anticipated hazardous takeovers with higher coverage and longer lead times than non-adaptive baselines but also demonstrated strong alignment between inferred cognitive parameters and real-time eye-tracking metrics. These results confirm that the model captures genuine fluctuations in driver risk perception, enabling timely and cognitively grounded assistance.
中文摘要 人工驾驶员在交接后的最初几秒钟内对共享驾驶安全至关重要;因此，潜在不安全的转向或踏板操作需要自动驾驶车的安全回退系统进行检测和纠正。然而，这一窗口内的表现存在脆弱性，因为认知状态波动迅速，导致纯理性驱动、认知意识不足的模型错过早期控制动态。我们提出了一个基于有限理性、在线适应的可解释驱动模型，预测早期控制质量。我们通过在强化学习中嵌入认知约束来编码界限，并通过从驾驶员动作观察中进行粒子滤波，实时调整潜在认知参数。在一项车辆在环研究（n=41）中，我们评估了预测性能和生理效度。自适应模型不仅预见了比非自适应基线更广泛覆盖率和更长交期的危险接管，还展示了推断认知参数与实时眼动追踪指标之间的强烈匹配。这些结果证实模型捕捉了驾驶员风险感知的真实波动，从而能够及时且基于认知提供帮助。

PokeRL: Reinforcement Learning for Pokemon Red

PokeRL：宝可梦红的强化学习

Authors: Dheeraj Mudireddy, Sai Patibandla
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.10812
Pdf link: https://arxiv.org/pdf/2604.10812
Abstract Pokemon Red is a long-horizon JRPG with sparse rewards, partial observability, and quirky control mechanics that make it a challenging benchmark for reinforcement learning. While recent work has shown that PPO agents can clear the first two gyms using heavy reward shaping and engineered observations, training remains brittle in practice, with agents often degenerating into action loops, menu spam, or unproductive wandering. In this paper, we present PokeRL, a modular system that trains deep reinforcement learning agents to complete early game tasks in Pokemon Red, including exiting the player's house, exploring Pallet Town to reach tall grass, and winning the first rival battle. Our main contributions are a loop-aware environment wrapper around the PyBoy emulator with map masking, a multi-layer anti-loop and anti-spam mechanism, and a dense hierarchical reward design. We argue that practical systems like PokeRL, which explicitly model failure modes such as loops and spam, are a necessary intermediate step between toy benchmarks and full Pokemon League champion agents. Code is available at this https URL
中文摘要 《宝可梦红》是一款远视线日式角色扮演游戏，奖励稀少，可观察性有限，操作机制古怪，使其成为强化学习的挑战标杆。尽管最新研究表明，PPO特工可以通过重度奖励塑造和工程观察来清除前两个道馆，但训练在实际中仍然脆弱，特工常常陷入动作循环、菜单刷屏或无效游走。本文介绍了PokeRL，一个模块化系统，训练深度强化学习代理完成《宝可梦红》中的早期任务，包括离开玩家家、探索真新镇以到达高草丛以及赢得第一场对手战斗。我们的主要贡献包括环路感知环境包裹 PyBoy 模拟器，带有映射遮罩功能，多层反循环和反垃圾邮件机制，以及密集的层级奖励设计。我们认为，像PokeRL这样明确模拟循环和垃圾信息等故障模式的实用系统，是玩具基准测试和完整宝可梦联盟冠军代理之间的必要中间步骤。代码可在此 https URL 获取

CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms

CheeseBench：评估啮齿动物行为神经科学范式中的大型语言模型

Authors: Zacharie Bugaud
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.10825
Pdf link: https://arxiv.org/pdf/2604.10825
Abstract We introduce CheeseBench, a benchmark that evaluates large language models (LLMs) on nine classical behavioral neuroscience paradigms (Morris water maze, Barnes maze, T-maze, radial arm maze, star maze, operant chamber, shuttle box, conditioned place preference, and delayed non-match to sample), spanning six cognitive dimensions. Each task is grounded in peer-reviewed rodent protocols with approximate animal baselines. The agent receives a unified system prompt with no task-specific instructions and must discover goals purely from ASCII text observations and reward signals, much like a rodent placed into an unfamiliar apparatus. We evaluate six open-weight LLMs (3B to 72B parameters) on text-based ASCII renderings and compare against both a random baseline and a graph-based reinforcement learning agent. Our best model (Qwen2.5-VL-7B) reaches 52.6% average success on ASCII input, compared to 32.1% for random agents and 78.9% for approximate rodent baselines. We find that (1) scaling beyond 7B yields diminishing returns, (2) longer context history degrades performance, (3) chain-of-thought prompting hurts rather than helps, and (4) a vision-language architecture provides an advantage at 7B but hurts at 32B. Because the same model's performance ranges from 20% to 57% depending on interface parameters alone, these results characterize the agent-plus-interface system, not the model in isolation. Under this unified zero-shot ASCII protocol, current open-weight LLM agents remain well below approximate rodent reference values, particularly on tasks requiring spatial navigation and within-trial state tracking.
中文摘要 我们介绍了CheeseBench，这是一个基于九个经典行为神经科学范式（莫里斯水迷宫、巴恩斯迷宫、T型迷宫、径向臂迷宫、星形迷宫、操作舱、穿梭箱、条件位置偏好和延迟不匹配样本）的基准测试，涵盖六个认知维度。每个任务都基于同行评审的啮齿动物方案，并以动物为基准。代理会收到一个统一的系统提示，没有任务特定的指令，必须完全通过ASCII文本观察和奖励信号来发现目标，就像啮齿动物被放入陌生装置一样。我们评估了六个开权重LLM（3B至72B参数）在基于文本的ASCII渲染上，并与随机基线和基于图的强化学习代理进行比较。我们的最佳模型（Qwen2.5-VL-7B）在ASCII输入下的平均成功率为52.6%，而随机代理为32.1%，近似啮齿动物基线为78.9%。我们发现：（1）超过7B的扩展带来收益递减，（2）更长的上下文历史降低了性能，（3）连贯的思维提示反而有害而非帮助，（4）视觉语言架构在7B有优势，但在32B则有害。由于同一模型的性能在接口参数下介于20%到57%之间，这些结果描述的是代理加接口系统，而非模型本身。在这一统一零点ASCII协议下，当前的开放权重LLM代理在需要空间导航和试验内状态跟踪的任务中，仍远低于近似啮齿动物参考值。

EvoNash-MARL: A Closed-Loop Multi-Agent Reinforcement Learning Framework for Medium-Horizon Equity Allocation

EvoNash-MARL：一个用于中期股权配置的闭环多智能体强化学习框架

Authors: Chongliu Jia, Yi Luo, Sipeng Han, Pengwei Li, Jie Ding, Youshuang Hu, Yimiao Qian, Qiya Wang
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.10911
Pdf link: https://arxiv.org/pdf/2604.10911
Abstract Medium-to-long-horizon stock allocation presents significant challenges due toveak predictive structures, non-stadonary market regimes, and the degradationf signals following the application of transaction costs, capacity limits, and tail-isk constraints. Conventional approaches commonly rely on a single predictor orloosely coupled prediction-to-allocation pipeline, limiting robustness underThis work addresses a targeted design question: whetherlistribution shift. 1coupling reinforcement learning (RL), multi-agent policy populations, Policy-Space Response Oracle (PSRO)-style aggregation, league best-response trainingevolutionary replacement, and execution-aware checkpoint selection within ainified walk-forward loop improves allocator robustness at medium to longhorizons. The proposed framework, EvoNash-MARL, integrates these componentswithin an execution-aware allocation loop and further introduces a layeredpolicy architecture comprising a direction head and a risk head, nonlinear signalenhancement, feature-quality reweighting, and constraint-aware checkpointselection. Under a 120-window walk-forward protocol, the resolved v21configuration achieves mean excess Sharpe 0.7600 and robust score -0.0203,anking first among internal controls; on aligned daily out-of-sample returnsrom 2014-01-02 to 2024-01-05, it delivers 19.6% annualized return versus 11.7% for SPY, and in an extended walk-forward evaluation through 2026-02-10 it delivers 20.5% rersus 13.5%. The framework maintains positive performance under realistictress constraints and exhibits structured cross-market generalization; however,lobal strong significance under White's Reality Check (WRC) and SPA-lite testingestablished. Therefore, the results are presented as evidence supporting asnotnore stable medium-to long-horizon training and selection paradigm, ratherhan as prooffof universally superior market-timing performance.
中文摘要 中长期库存配置面临重大挑战，原因是托维亚克预测结构、非储藏市场体制以及交易成本、容量限制和尾部ISK约束后出现的退化信号。传统方法通常依赖单一预测器或松散耦合的预测与分配流程，限制了在下，鲁棒性受限。本研究解决了一个有针对性的设计问题：是否存在listribution转移。1耦合强化学习（RL）、多智能体策略群体、策略空间响应预言机（PSRO）式聚合、联盟最佳响应训练进化替代以及在人工智能化的前瞻循环中实现执行感知检查点选择，提升了中长视野下的配置器鲁棒性。所提出的框架EvoNash-MARL将这些组件集成为执行感知分配环，并进一步引入了由方向头和风险头组成的分层策略架构、非线性信号增强、特征质量重加权和约束感知检查点选择。在120窗口前进协议下，解析后的v21配置平均过剩夏普值为0.7600，稳健得分为-0.0203，在内部控制中排名第一;在2014-01-02至2024-01-05的对齐日外回报中，年化回报为19.6%，而SPY为11.7%;在持续的前瞻性评估中，2026-02-10为2026-02-10，回报为20.5%，为13.5%。该框架在现实约束下保持积极性能，并展现出结构化的跨市场泛化;然而，在怀特的现实检验（WRC）和SPA轻测试下，已确立了极强的显著性。因此，这些结果被呈现为支持稳定中长期训练和选择范式的证据，而非普遍优于市场时机表现的前提。

CSPO: Alleviating Reward Ambiguity for Structured Table-to-LaTeX Generation

CSPO：消除结构化表到LaTeX生成的奖励模糊性

Authors: Yunfan Yang, Cuiling Lan, Jitao Sang, Yan Lu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.10918
Pdf link: https://arxiv.org/pdf/2604.10918
Abstract Tables contain rich structured information, yet when stored as images their contents remain "locked" within pixels. Converting table images into LaTeX code enables faithful digitization and reuse, but current multimodal large language models (MLLMs) often fail to preserve structural, style, or content fidelity. Conventional post-training with reinforcement learning (RL) typically relies on a single aggregated reward, leading to reward ambiguity that conflates multiple behavioral aspects and hinders effective optimization. We propose Component-Specific Policy Optimization (CSPO), an RL framework that disentangles optimization across LaTeX tables components-structure, style, and content. In particular, CSPO assigns component-specific rewards and backpropagates each signal only through the tokens relevant to its component, alleviating reward ambiguity and enabling targeted component-wise optimization. To comprehensively assess performance, we introduce a set of hierarchical evaluation metrics. Extensive experiments demonstrate the effectiveness of CSPO, underscoring the importance of component-specific optimization for reliable structured generation.
中文摘要 表格包含丰富的结构化信息，但当以图像存储时，其内容仍被“锁定”在像素内。将表格图像转换为 LaTeX 代码实现了忠实的数字化和重用，但当前的多模态大型语言模型（MLLMs）往往无法保持结构、样式或内容的忠实度。传统的强化学习（RL）后训练通常依赖于单一的聚合奖励，导致奖励模糊，混淆了多个行为方面，阻碍了有效的优化。我们提出了组件特定策略优化（CSPO），这是一个强化学习框架，能够在 LaTeX 表格中组件结构、样式和内容间解开优化。特别是，CSPO为每个信号分配组件特定的奖励，并仅通过与其组件相关的代币进行反向传播，从而消除奖励的歧义，实现针对性的组件优化。为了全面评估绩效，我们引入了一套层级评估指标。大量实验证明了CSPO的有效性，强调了组件特定优化对于可靠结构生成的重要性。

Diffusion Reinforcement Learning Based Online 3D Bin Packing Spatial Strategy Optimization

基于在线3D箱装箱空间策略优化的扩散强化学习

Authors: Jie Han, Tong Li, Qingyang Xu, Yong Song, Bao Pang, Xianfeng Yuan
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.10953
Pdf link: https://arxiv.org/pdf/2604.10953
Abstract The online 3D bin packing problem is important in logistics, warehousing and intelligent manufacturing, with solutions shifting to deep reinforcement learning (DRL) which faces challenges like low sample efficiency. This paper proposes a diffusion reinforcement learning-based algorithm, using a Markov decision chain for packing modeling, height map-based state representation and a diffusion model-based actor network. Experiments show it significantly improves the average number of packed items compared to state-of-the-art DRL methods, with excellent application potential in complex online scenarios.
中文摘要 在线3D箱子包装问题在物流、仓储和智能制造领域非常重要，解决方案正转向深度强化学习（DRL），而这面临着样本效率低等挑战。本文提出了一种基于扩散强化学习的算法，采用马尔可夫决策链进行打包建模，基于高度图的状态表示以及基于扩散模型的演员网络。实验显示，与最先进的日程日程（DRL）方法相比，它显著提升了平均打包物品的数量，并在复杂的在线场景中具有极大的应用潜力。

ScoRe-Flow: Complete Distributional Control via Score-Based Reinforcement Learning for Flow Matching

ScoRe-Flow：通过基于评分的强化学习实现完整的分布控制，实现流量匹配

Authors: Xiaotian Qiu, Lukai Chen, Jinhao Li, Qi Sun, Cheng Zhuo, Guohao Dai
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.10962
Pdf link: https://arxiv.org/pdf/2604.10962
Abstract Flow Matching (FM) policies have emerged as an efficient backbone for robotic control, offering fast and expressive action generation that underpins recent large-scale embodied AI systems. However, FM policies trained via imitation learning inherit the limitations of demonstration data; surpassing suboptimal behaviors requires reinforcement learning (RL) fine-tuning. Recent methods convert deterministic flows into stochastic differential equations (SDEs) with learnable noise injection, enabling exploration and tractable likelihoods, but such noise-only control can compromise training efficiency when demonstrations already provide strong priors. We observe that modulating the drift via the score function, i.e., the gradient of log-density, steers exploration toward high-probability regions, improving stability. The score admits a closed-form expression from the velocity field, requiring no auxiliary networks. Based on this, we propose ScoRe-Flow, a score-based RL fine-tuning method that combines drift modulation with learned variance prediction to achieve decoupled control over the mean and variance of stochastic transitions. Experiments demonstrate that ScoRe-Flow achieves 2.4x faster convergence than flow-based SOTA on D4RL locomotion tasks and up to 5.4% higher success rates on Robomimic and Franka Kitchen manipulation tasks.
中文摘要 流动匹配（FM）政策已成为机器人控制的高效骨干，提供快速且富有表现力的动作生成，支撑着近期大规模具身人工智能系统的发展。然而，通过模仿学习训练的FM策略继承了演示数据的局限性;超越次优行为需要强化学习（RL）的微调。最新方法将确定性流转换为带有可学习噪声注入的随机微分方程（SDE），从而实现探索和可处理的似然，但当演示已经提供了强先验时，这种仅靠噪声控制可能会影响训练效率。我们观察到，通过分数函数（即对数密度梯度）调制漂移，可以引导探索方向高概率区域，从而提高稳定性。该分数允许速度场的闭式表达，无需辅助网络。基于此，我们提出了ScoRe-Flow，一种基于分数的强化学习微调方法，结合漂移调制与学习方差预测，实现对随机跃迁均值和方差的解耦控制。实验表明，ScoRe-Flow在D4RL移动任务中收敛速度比基于流量的SOTA快2.4倍，在Robomimic和Franka Kitchen操作任务中成功率高出多达5.4%。

You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass

你只评判一次：单次前传中的多重响应奖励建模

Authors: Yinuo Yang, Zixian Ma, Manasi Ganti, Jieyu Zhang, Ranjay Krishna
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.10966
Pdf link: https://arxiv.org/pdf/2604.10966
Abstract We present a discriminative multimodal reward model that scores all candidate responses in a single forward pass. Conventional discriminative reward models evaluate each response independently, requiring multiple forward passes, one for each potential response. Our approach concatenates multiple responses with separator tokens and applies cross-entropy over their scalar scores, enabling direct comparative reasoning and efficient $N$-way preference learning. The multi-response design also yields up to $N\times$ wall-clock speedup and FLOPs reduction over conventional single-response scoring. To enable $N$-way reward evaluation beyond existing pairwise benchmarks, we construct two new benchmarks: (1) MR$^2$Bench-Image contains human-annotated rankings over responses from 8 diverse models; (2) MR$^2$Bench-Video is a large-scale video-based reward benchmark derived from 94K crowdsourced pairwise human judgments over video question-answering spanning 19 models, denoised via preference graph ensemble. Both benchmarks provide 4-response evaluation variants sampled from the full rankings. Built on a 4B vision-language backbone with LoRA fine-tuning and a lightweight MLP value head, our model achieves state-of-the-art results on six multimodal reward benchmarks, including MR$^2$Bench-Image, MR$^2$Bench-Video, and four other existing benchmarks. Our model outperforms existing larger generative and discriminative reward models. We further demonstrate that our reward model, when used in reinforcement learning with GRPO, produces improved policy models that maintain performance across standard multimodal benchmarks while substantially improving open-ended generation quality, outperforming a single-response discriminative reward model (RM) baseline by a large margin in both training stability and open-ended generation quality.
中文摘要 我们提出了一种判别性多模态奖励模型，能够在一次前向传递中对所有候选反应进行评分。传统的判别性奖励模型独立评估每个反应，需要多次前传，每个潜在反应各一次。我们的方法通过分隔符符号串接多个响应，并在其标量分数上应用交叉熵，从而实现直接的比较推理和高效的 $N $ 方式偏好学习。多响应设计还比传统单响应计分实现了高达$N×的墙钟加速和FLOP的降低。为了实现超越现有两对基准的$N$方式奖励评估，我们构建了两个新的基准测试：（1） MR$^2$Bench-Image 包含对来自8个不同模型的人工注释排名;（2） MR$^2$Bench-Video 是一个基于视频的大型奖励基准，基于94K群众外包的人类对跨19个模型的视频问答进行的成对判断，并通过偏好图集合进行去噪处理。这两个基准测试均提供从完整排名中抽样的四反应评估变体。我们的模型基于4B视觉语言骨干，结合LoRA微调和轻量级MLP价值头，在六个多模态奖励基准测试中取得了最先进的成绩，包括MR$^2$Bench-Image、MR$^2$Bench-Video以及另外四个现有基准测试。我们的模型优于现有的大型生成奖励和辨别性奖励模型。我们还进一步证明，当我们的奖励模型在GRPO强化学习中使用时，能够产生改进的策略模型，在标准多模态基准测试中保持性能，同时显著提升开放式生成质量，在训练稳定性和开放式生成质量方面远远优于单响应判别奖励模型（RM）基线。

MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models

MMR-AD：一个用于用多模态大型语言模型进行一般异常检测基准的大规模多模态数据集

Authors: Xincheng Yao, Zefeng Qian, Chao Shi, Jiayang Song, Chongyang Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.10971
Pdf link: https://arxiv.org/pdf/2604.10971
Abstract In the progress of industrial anomaly detection, general anomaly detection (GAD) is an emerging trend and also the ultimate goal. Unlike the conventional single- and multi-class AD, general AD aims to train a general AD model that can directly detect anomalies in diverse novel classes without any retraining or fine-tuning on the target data. Recently, Multimodal Large Language Models (MLLMs) have shown great promise in achieving general anomaly detection due to their revolutionary visual understanding and language reasoning capabilities. However, MLLM's general AD ability remains underexplored due to: (1) MLLMs are pretrained on amounts of data sourced from the Web, these data still have significant gaps with the data in AD scenarios. Moreover, the image-text pairs during pretraining are also not specifically for AD tasks. (2) The current mainstream AD datasets are image-based and not yet suitable for post-training MLLMs. To facilitate MLLM-based general AD research, we present MMR-AD, which is a comprehensive benchmark for both training and evaluating MLLM-based AD models. With MMR-AD, we reveal that the AD performance of current SOTA generalist MLLMs still falls far behind the industrial requirements. Based on MMR-AD, we also propose a baseline model, Anomaly-R1, which is a reasoning-based AD model that learns from the CoT data in MMR-AD and is further enhanced by reinforcement learning. Extensive experiments show that our Anomaly-R1 achieves remarkable improvements over generalist MLLMs in both anomaly detection and localization.
中文摘要 在工业异常检测的发展中，一般异常检测（GAD）是一个新兴趋势，也是最终目标。与传统的单类和多类AD不同，通用AD旨在训练一个通用AD模型，能够直接检测不同新类别中的异常，而无需对目标数据进行重新训练或微调。近年来，多模态大型语言模型（MLLM）因其革命性的视觉理解和语言推理能力，在实现一般异常检测方面展现出巨大潜力。然而，MLLM 的通用 AD 能力仍未被充分开发，原因包括：（1） MLLM 预训练基于来自网络的数据量，这些数据与 AD 场景中的数据仍有显著差距。此外，预训练期间的图像-文本对也并非专门针对AD任务。（2）目前主流的AD数据集基于图像，尚未适合训练后MLLMs。为促进基于MLLM的通用AD研究，我们提出了MMR-AD，这是一个用于训练和评估MLLM基础AD模型的综合基准。通过MMR-AD，我们发现当前SOTA通用MLLM的AD性能仍远远落后于工业需求。基于MMR-AD，我们还提出了一个基线模型Anomaly-R1，这是一种基于推理的AD模型，从MMR-AD中的CoT数据中学习，并通过强化学习进一步增强。大量实验表明，我们的异常-R1在异常检测和定位方面优于通用MLLM实现了显著提升。

Robust Adversarial Policy Optimization Under Dynamics Uncertainty

动态不确定性下的强健对抗策略优化

Authors: Mintae Kim, Koushil Sreenath
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.10974
Pdf link: https://arxiv.org/pdf/2604.10974
Abstract Reinforcement learning (RL) policies often fail under dynamics that differ from training, a gap not fully addressed by domain randomization or existing adversarial RL methods. Distributionally robust RL provides a formal remedy but still relies on surrogate adversaries to approximate intractable primal problems, leaving blind spots that potentially cause instability and over-conservatism. We propose a dual formulation that directly exposes the robustness-performance trade-off. At the trajectory level, a temperature parameter from the dual problem is approximated with an adversarial network, yielding efficient and stable worst-case rollouts within a divergence bound. At the model level, we employ Boltzmann reweighting over dynamics ensembles, focusing on more adverse environments to the current policy rather than uniform sampling. The two components act independently and complement each other: trajectory-level steering ensures robust rollouts, while model-level sampling provides policy-sensitive coverage of adverse dynamics. The resulting framework, robust adversarial policy optimization (RAPO) outperforms robust RL baselines, improving resilience to uncertainty and generalization to out-of-distribution dynamics while maintaining dual tractability.
中文摘要 强化学习（RL）策略常在与训练不同的动态下失败，这一差距未能被领域随机化或现有的对抗性强化方法充分弥补。分布稳健的强化学习提供了形式上的解决方案，但仍依赖替代对手来近似难以解决的原始问题，留下可能导致不稳定和过度保守的盲区。我们提出了一种双重表述，直接揭示了鲁棒性与性能权衡。在轨迹层面，对偶问题中的温度参数通过对抗网络近似，从而在发散界限内实现高效且稳定的最坏情况展开。在模型层面，我们采用Boltzmann加权，重点关注对当前政策不利的环境，而非统一采样。这两个组成部分独立运作，相互补充：轨迹级引导确保稳健的部署，而模型级抽样则提供对不利动态的政策敏感覆盖。由此产生的框架——强健对抗策略优化（RAPO）优于强健强化学习基线，提升了对不确定性的韧性和对分布外动态的泛化性，同时保持了双重可处理性。

When Valid Signals Fail: Regime Boundaries Between LLM Features and RL Trading Policies

当有效信号失效时：LLM特性与强化学习交易策略之间的制度边界

Authors: Zhengzhe Yang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
Arxiv link: https://arxiv.org/abs/2604.10996
Pdf link: https://arxiv.org/pdf/2604.10996
Abstract Can large language models (LLMs) generate continuous numerical features that improve reinforcement learning (RL) trading agents? We build a modular pipeline where a frozen LLM serves as a stateless feature extractor, transforming unstructured daily news and filings into a fixed-dimensional vector consumed by a downstream PPO agent. We introduce an automated prompt-optimization loop that treats the extraction prompt as a discrete hyperparameter and tunes it directly against the Information Coefficient - the Spearman rank correlation between predicted and realized returns - rather than NLP losses. The optimized prompt discovers genuinely predictive features (IC above 0.15 on held-out data). However, these valid intermediate representations do not automatically translate into downstream task performance: during a distribution shift caused by a macroeconomic shock, LLM-derived features add noise, and the augmented agent under-performs a price-only baseline. In a calmer test regime the agent recovers, yet macroeconomic state variables remain the most robust driver of policy improvement. Our findings highlight a gap between feature-level validity and policy-level robustness that parallels known challenges in transfer learning under distribution shift.
中文摘要 大型语言模型（LLMs）能否生成连续的数值特征，从而提升强化学习（RL）交易代理的性能？我们构建了一个模块化流水线，其中冻结的LLM作为无状态特征提取器，将非结构化的每日新闻和文件转换为固定维向量，供下游PPO代理消费。我们引入了自动化提示优化循环，将提取提示视为离散超参数，并直接根据信息系数——预测收益与实现收益之间的Spearman排名相关性——而非NLP损失进行调校。优化后的提示发现了真正具有预测性的特征（保留数据的IC超过0.15）。然而，这些有效的中间表示并不自动转化为后续任务表现：在宏观经济冲击引发的分布转移中，LLM衍生特征会增加噪声，增强代理在仅价格基线时表现不佳。在较为平静的测试体系中，主体会恢复，但宏观经济状态变量仍是政策改进最强有力的驱动力。我们的发现凸显了特征层级有效性与策略层面鲁棒性之间的差距，这与分布转移迁移过程中已知的挑战相呼应。

NimbusGuard: A Novel Framework for Proactive Kubernetes Autoscaling Using Deep Q-Networks

NimbusGuard：利用深度Q网络进行主动Kubernetes自动扩展的创新框架

Authors: Chamath Wanigasooriya, Indrajith Ekanayake
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.11017
Pdf link: https://arxiv.org/pdf/2604.11017
Abstract Cloud native architecture is about building and running scalable microservice applications to take full advantage of the cloud environments. Managed Kubernetes is the powerhouse orchestrating cloud native applications with elastic scaling. However, traditional Kubernetes autoscalers are reactive, meaning the scaling controllers adjust resources only after they detect demand within the cluster and do not incorporate any predictive measures. This can lead to either over-provisioning and increased costs or under-provisioning and performance degradation. We propose NimbusGuard, an open-source, Kubernetes-based autoscaling system that leverages a deep reinforcement learning agent to provide proactive autoscaling. The agents perception is augmented by a Long Short-Term Memory model that forecasts future workload patterns. The evaluations were conducted by comparing NimbusGuard against the built-in scaling controllers, such as Horizontal Pod Autoscaler, and the event-driven autoscaler KEDA. The experimental results demonstrate how NimbusGuard's proactive framework translates into superior performance and cost efficiency compared to existing reactive methods.
中文摘要 云原生架构旨在构建和运行可扩展的微服务应用，充分利用云环境。托管Kubernetes是云原生应用的强大平台，并以弹性扩展方式协调。然而，传统的Kubernetes自动扩展器是被动的，意味着扩展控制器只有在检测到集群内的需求后才调整资源，且不包含任何预测措施。这可能导致过度配置和成本增加，或配置不足和性能下降。我们提出了NimbusGuard，一个开源的基于Kubernetes的自动扩展系统，利用深度强化学习代理实现主动自动扩展。代理的感知还通过一个预测未来工作量模式的长短期记忆模型来增强。评估通过将NimbusGuard与内置缩放控制器（如水平舱自动缩放器）以及事件驱动自动缩放器KEDA进行比较。实验结果表明，NimbusGuard的主动框架相较于现有被动式方法，带来了更优的性能和成本效益。

Optimal Stability of KL Divergence under Gaussian Perturbations

高斯微扰下的KL散度的最优稳定性

Authors: Jialu Pan, Yufeng Zhang, Nan Hu, Keqin Li
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.11026
Pdf link: https://arxiv.org/pdf/2604.11026
Abstract We study the problem of characterizing the stability of Kullback-Leibler (KL) divergence under Gaussian perturbations beyond Gaussian families. Existing relaxed triangle inequalities for KL divergence critically rely on the assumption that all involved distributions are Gaussian, which limits their applicability in modern applications such as out-of-distribution (OOD) detection with flow-based generative models. In this paper, we remove this restriction by establishing a sharp stability bound between an arbitrary distribution and Gaussian families under mild moment conditions. Specifically, let $P$ be a distribution with finite second moment, and let $\mathcal{N}_1$ and $\mathcal{N}_2$ be multivariate Gaussian distributions. We show that if $KL(P||\mathcal{N}_1)$ is large and $KL(\mathcal{N}_1||\mathcal{N}_2)$ is at most $\epsilon$, then $KL(P||\mathcal{N}_2) \ge KL(P||\mathcal{N}_1) - O(\sqrt{\epsilon})$. Moreover, we prove that this $\sqrt{\epsilon}$ rate is optimal in general, even within the Gaussian family. This result reveals an intrinsic stability property of KL divergence under Gaussian perturbations, extending classical Gaussian-only relaxed triangle inequalities to general distributions. The result is non-trivial due to the asymmetry of KL divergence and the absence of a triangle inequality in general probability spaces. As an application, we provide a rigorous foundation for KL-based OOD analysis in flow-based models, removing strong Gaussian assumptions used in prior work. More broadly, our result enables KL-based reasoning in non-Gaussian settings arising in deep learning and reinforcement learning.
中文摘要 我们研究在高斯族之外的高斯扰动下，库尔巴克-莱布勒（KL）散度稳定性的表征问题。现有的宽松三角不等式用于KL散度，关键依赖于所有相关分布都是高斯分布的假设，这限制了它们在现代应用中的适用性，如基于流的生成模型中的分布外检测（OOD）。本文通过在温和矩条件下建立任意分布与高斯族之间的锐利稳定性界限，消除了这一限制。具体来说，设$P$为有限二阶矩分布，$\mathcal{N}_1$和$\mathcal{N}_2$为多元高斯分布。我们证明如果 $KL（P||\mathcal{N}_1）$ 很大且$KL（\mathcal{N}_1||\mathcal{N}_2）$ 最多是 $\epsilon$，然后 $KL（P||\mathcal{N}_2） \ge KL（P||\mathcal{N}_1） - O（\sqrt{\epsilon}）$。此外，我们证明了这个 $\sqrt{\epsilon}$ 速率在一般情况下是最优的，即使在高斯族内也是如此。该结果揭示了在高斯微扰下KL散度的内在稳定性性质，将经典仅高斯松弛三角形不等式推广到一般分布。由于KL散度的不对称性以及一般概率空间中不存在三角形不等式，结果是非平凡的。作为应用，我们为基于流模型的基于KL的OOD分析提供了严谨的基础，去除了先前工作中常用的强烈高斯假设。更广泛地说，我们的结果使得基于KL的推理能够在深度学习和强化学习中出现的非高斯环境中进行。

RTMC: Step-Level Credit Assignment via Rollout Trees

RTMC：通过推广树进行阶级学分分配

Authors: Tao Wang, Suhang Zheng, Xiaoxiao Xu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.11037
Pdf link: https://arxiv.org/pdf/2604.11037
Abstract Multi-step agentic reinforcement learning benefits from fine-grained credit assignment, yet existing approaches offer limited options: critic-free methods like GRPO assign a uniform advantage to every action in a trajectory, while learned value networks introduce notable overhead and can be fragile under sparse rewards. We observe that group rollouts targeting the same problem often traverse overlapping intermediate states, implicitly forming a tree whose branches diverge at successive decision points. Building on this insight, we introduce Rollout-Tree Monte Carlo (RTMC) advantage estimation, which aggregates return statistics across rollouts sharing a common state to produce per-step Q-values and advantages--without any learned critic. A state-action signature system compresses raw interaction histories into compact, comparable representations, making cross-rollout state matching tractable. On SWE-bench Verified, RTMC improves pass@1 by 3.2 percentage points over GRPO.
中文摘要 多步代理强化学习受益于细粒度的信用赋值，但现有方法提供的选项有限：像GRPO这样的无批判方法为轨迹中的每个动作分配统一优势，而学习的价值网络则带来显著的开销，且在奖励稀疏时可能脆弱。我们观察到，针对同一问题的组级推广常穿越重叠的中间状态，隐含形成一棵树，其分支在连续的决策点处分岔。基于这一见解，我们引入了Rollout-Tree Monte Carlo（RTMC）优势估计，该方法汇总共享共同状态的各部署统计数据，生成每步Q值和优势——无需任何学到批评。状态-动作签名系统将原始交互历史压缩成紧凑且可比的表示，使得交叉展开状态匹配变得易于处理。在SWE板凳验证中，RTMC比GRPO提高了3.2个百分点的pass@1。

Rethinking Token-Level Credit Assignment in RLVR: A Polarity-Entropy Analysis

重新思考RLVR中的代币级信用分配：极性熵分析

Authors: Yuhang He, Haodong Wu, Siyi Liu, Hongyu Ge, Hange Zhou, Keyi Wu, Zhuo Zheng, Qihong Lin, Zixin Zhong, Yongqi Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.11056
Pdf link: https://arxiv.org/pdf/2604.11056
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has substantially improved the reasoning ability of Large Language Models (LLMs). However, its sparse outcome-based rewards pose a fundamental credit assignment problem. We analyze this problem through the joint lens of reward polarity and token entropy. Our diagnostic tool, the Four Quadrant Decomposition, isolates token updates by polarity and entropy, and controlled ablations show that reasoning improvements concentrate in the high-entropy quadrants. To justify this observation theoretically, we adapt Conditional Mutual Information to the autoregressive RLVR setting and prove that the credit a token can carry is upper-bounded by its entropy. This view yields testable predictions that reasoning gains arise primarily from high-entropy tokens, with unique roles for positive and negative updates. A gradient analysis of GRPO further reveals how uniform reward broadcast dilutes signal at high-entropy positions while over-crediting deterministic tokens. Grounded in these insights, we propose Entropy-Aware Policy Optimization (EAPO) that modulates token-level learning signals accordingly. Extensive experiments demonstrate that EAPO outperforms strong baselines across two model families.
中文摘要 带可验证奖励的强化学习（RLVR）显著提升了大型语言模型（LLMs）的推理能力。然而，其基于结果的奖励稀少，带来了根本性的信用分配问题。我们通过奖励极性和代币熵的联合视角来分析这个问题。我们的诊断工具“四象限分解”通过极性和熵隔离代币更新，受控消融显示推理改进集中在高熵象限。为了理论上证明这一观察，我们将条件互信息适配到自回归的RLVR设置，并证明代币可携带的信用额度由其熵上界。这一观点得出可检验的预测，即推理收益主要来自高熵的代币，并且在正负更新中具有独特的角色。对GRPO的梯度分析进一步揭示了均匀奖励广播如何在高熵位置稀释信号，同时过度归入确定性代币。基于这些洞见，我们提出了一种能相应调制令牌级学习信号的熵感知策略优化（EAPO）。大量实验表明，EAPO在两个模型家族中表现优于强基线。

OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video

OmniScript：迈向长篇电影视频的视听剧本生成

Authors: Junfu Pu, Yuxin Chen, Teng Wang, Ying Shan
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2604.11102
Pdf link: https://arxiv.org/pdf/2604.11102
Abstract Current multimodal large language models (MLLMs) have demonstrated remarkable capabilities in short-form video understanding, yet translating long-form cinematic videos into detailed, temporally grounded scripts remains a significant challenge. This paper introduces the novel video-to-script (V2S) task, aiming to generate hierarchical, scene-by-scene scripts encompassing character actions, dialogues, expressions, and audio cues. To facilitate this, we construct a first-of-its-kind human-annotated benchmark and propose a temporally-aware hierarchical evaluation framework. Furthermore, we present OmniScript, an 8B-parameter omni-modal (audio-visual) language model tailored for long-form narrative comprehension. OmniScript is trained via a progressive pipeline that leverages chain-of-thought supervised fine-tuning for plot and character reasoning, followed by reinforcement learning using temporally segmented rewards. Extensive experiments demonstrate that despite its parameter efficiency, OmniScript significantly outperforms larger open-source models and achieves performance comparable to state-of-the-art proprietary models, including Gemini 3-Pro, in both temporal localization and multi-field semantic accuracy.
中文摘要 当前多模态大型语言模型（MLLM）在短视频理解方面展现了卓越的能力，但将长视频电影化为详细且时间根基化的脚本仍是一大挑战。本文介绍了新颖的视频到脚本（V2S）任务，旨在生成包含角色动作、对话、表情和音频提示的层级脚本，逐场景。为此，我们构建了首创的人工注释基准测试，并提出了一个时间感知的层级评估框架。此外，我们还介绍了OmniScript，一种8B参数的全模态（视听）语言模型，专为长篇叙事理解设计。OmniScript 通过渐进式流程训练，利用思维链监督微调来构建情节和人物推理，随后通过时间分割奖励进行强化学习。大量实验表明，尽管参数效率较低，OmniScript 在时间定位和多字段语义准确性方面，显著优于大型开源模型，并实现与包括 Gemini 3-Pro 在内的最先进专有模型相当的性能。

MADQRL: Distributed Quantum Reinforcement Learning Framework for Multi-Agent Environments

MADQRL：多智能体环境分布式量子强化学习框架

Authors: Abhishek Sawaika, Samuel Yen-Chi Chen, Udaya Parampalli, Rajkumar Buyya
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2604.11131
Pdf link: https://arxiv.org/pdf/2604.11131
Abstract Reinforcement learning (RL) is one of the most practical ways to learn from real-life use-cases. Motivated from the cognitive methods used by humans makes it a widely acceptable strategy in the field of artificial intelligence. Most of the environments used for RL are often high-dimensional, and traditional RL algorithms becomes computationally expensive and challenging to effectively learn from such systems. Recent advancements in practical demonstration of quantum computing (QC) theories, such as compact encoding, enhanced representation and learning algorithms, random sampling, or the inherent stochastic nature of quantum systems, have opened up new directions to tackle these challenges. Quantum reinforcement learning (QRL) is seeking significant traction over the past few years. However, the current state of quantum hardware is not enough to cater for such high-dimensional environments with complex multi-agent setup. To tackle this issue, we propose a distributed framework for QRL where multiple agents learn independently, distributing the load of joint training from individual machines. Our method works well for environments with disjoint sets of action and observation spaces, but can also be extended to other systems with reasonable approximations. We analyze the proposed method on cooperative-pong environment and our results indicate ~10% improvement from other distribution strategies, and ~5% improvement from classical models of policy representation.
中文摘要 强化学习（RL）是从现实应用中学习最实用的方法之一。基于人类使用的认知方法，使其成为人工智能领域广泛接受的策略。大多数用于强化学习的环境通常是高维的，传统的强化学习算法计算成本高且难以有效学习。近年来，量子计算（QC）理论的实际演示取得的进展，如紧凑编码、增强表示与学习算法、随机采样，或量子系统固有的随机性质，为应对这些挑战开辟了新方向。近年来，量子强化学习（QRL）正寻求显著的关注。然而，目前量子硬件的现状不足以应对如此高维环境和复杂的多智能体设置。为解决这一问题，我们提出了一个分布式QRL框架，由多个代理独立学习，将联合训练的负载分散给各个机器。我们的方法适用于作用空间和观察空间不相交的环境，但也可以通过合理的近似扩展到其他系统。我们分析了合作-pong环境下的拟定方法，结果显示较其他分配策略提升了~10%，较经典政策表示模型提升了~5%。

AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps

AIM：意图感知统一世界动作建模与空间值映射

Authors: Liaoyuan Fan, Zetian Xu, Chen Cao, Wenyao Zhang, Mingqi Yuan, Jiayu Chen
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.11135
Pdf link: https://arxiv.org/pdf/2604.11135
Abstract Pretrained video generation models provide strong priors for robot control, but existing unified world action models still struggle to decode reliable actions without substantial robot-specific training. We attribute this limitation to a structural mismatch: while video models capture how scenes evolve, action generation requires explicit reasoning about where to interact and the underlying manipulation intent. We introduce AIM, an intent-aware unified world action model that bridges this gap via an explicit spatial interface. Instead of decoding actions directly from future visual representations, AIM predicts an aligned spatial value map that encodes task-relevant interaction structure, enabling a control-oriented abstraction of future dynamics. Built on a pretrained video generation model, AIM jointly models future observations and value maps within a shared mixture-of-transformers architecture. It employs intent-causal attention to route future information to the action branch exclusively through the value representation. We further propose a self-distillation reinforcement learning stage that freezes the video and value branches and optimizes only the action head using dense rewards derived from projected value-map responses together with sparse task-level signals. To support training and evaluation, we construct a simulation dataset of 30K manipulation trajectories with synchronized multi-view observations, actions, and value-map annotations. Experiments on RoboTwin 2.0 benchmark show that AIM achieves a 94.0% average success rate, significantly outperforming prior unified world action baselines. Notably, the improvement is more pronounced in long-horizon and contact-sensitive manipulation tasks, demonstrating the effectiveness of explicit spatial-intent modeling as a bridge between visual world modeling and robot control.
中文摘要 预训练视频生成模型为机器人控制提供了强有力的先验，但现有统一的世界动作模型在没有大量机器人专用训练的情况下，仍难以解码可靠的动作。我们将这一局限归因于结构性不匹配：视频模型捕捉场景演变，而动作生成则需要明确推理互动地点及其潜在操控意图。我们引入AIM，一种意图感知的统一世界行动模型，通过显式空间界面弥合这一差距。AIM不再直接从未来的视觉表现中解码动作，而是预测一个对齐的空间值映射，编码任务相关的交互结构，从而实现以控制为导向的未来动态抽象。AIM基于预训练视频生成模型，在共享的变换器混合架构中共同建模未来观测和值图。它采用意图-因果关注，将未来信息仅通过价值表示系统引导到行动分支。我们进一步提出了一种自我蒸馏强化学习阶段，该阶段冻结视频和价值分支，仅通过基于预测价值图响应的密集奖励和稀疏的任务级信号优化动作头。为支持训练和评估，我们构建了一个包含3万条操作轨迹的模拟数据集，包含同步的多视图观察、动作和值图注释。在RoboTwin 2.0基准测试上的实验显示，AIM的平均成功率为94.0%，显著优于以往统一的世界行动基线。值得注意的是，在长视距和接触敏感操作任务中，这一改进更为显著，展示了显式空间意图建模作为视觉世界建模与机器人控制桥梁的有效性。

From Answers to Arguments: Toward Trustworthy Clinical Diagnostic Reasoning with Toulmin-Guided Curriculum Goal-Conditioned Learning

从答案到论证：迈向托尔明指导课程的可信临床诊断推理目标条件学习

Authors: Chen Zhan, Xiaoyu Tan, Gengchen Ma, Yu-Jie Xiong, Xiaoyan Jiang, Xihe Qiu
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.11137
Pdf link: https://arxiv.org/pdf/2604.11137
Abstract The integration of Large Language Models (LLMs) into clinical decision support is critically obstructed by their opaque and often unreliable reasoning. In the high-stakes domain of healthcare, correct answers alone are insufficient; clinical practice demands full transparency to ensure patient safety and enable professional accountability. A pervasive and dangerous weakness of current LLMs is their tendency to produce "correct answers through flawed reasoning." This issue is far more than a minor academic flaw; such process errors signal a fundamental lack of robust understanding, making the model prone to broader hallucinations and unpredictable failures when faced with real-world clinical complexity. In this paper, we establish a framework for trustworthy clinical argumentation by adapting the Toulmin model to the diagnostic process. We propose a novel training pipeline: Curriculum Goal-Conditioned Learning (CGCL), designed to progressively train LLM to generate diagnostic arguments that explicitly follow this Toulmin structure. CGCL's progressive three-stage curriculum systematically builds a solid clinical argument: (1) extracting facts and generating differential diagnoses; (2) justifying a core hypothesis while rebutting alternatives; and (3) synthesizing the analysis into a final, qualified conclusion. We validate CGCL using T-Eval, a quantitative framework measuring the integrity of the diagnosis reasoning. Experiments show that our method achieves diagnostic accuracy and reasoning quality comparable to resource-intensive Reinforcement Learning (RL) methods, while offering a more stable and efficient training pipeline.
中文摘要 大型语言模型（LLMs）与临床决策支持的整合，因其不透明且常常不可靠的推理而受到严重阻碍。在医疗这一高风险领域，仅凭正确答案是不够的;临床实践要求完全透明，以确保患者安全并实现专业问责。当前大型语言模型的一个普遍且危险的弱点是倾向于通过错误推理得出正确答案。这个问题远不止是小的学术缺陷;此类过程错误表明缺乏稳健的理解，使模型在面对现实临床复杂性时容易出现更广泛的幻觉和不可预测的失败。本文通过将Toulmin模型适应诊断过程，建立了可信的临床论证框架。我们提出了一种新颖的培训流程：课程目标条件学习（CGCL），旨在逐步训练LLM，生成明确遵循该Toulmin结构的诊断论证。CGCL的渐进式三阶段课程系统性地构建了坚实的临床论据：（1）提取事实并生成鉴别诊断;（2）在反驳其他观点的同时，为核心假设辩护;以及（3）将分析综合成最终且有条件的结论。我们使用T-Eval验证CGCL，这是一种衡量诊断推理完整性的定量框架。实验表明，我们的方法在诊断准确性和推理质量上与资源密集型强化学习（RL）方法相当，同时提供了更稳定高效的训练流程。

ViserDex: Visual Sim-to-Real for Robust Dexterous In-hand Reorientation

ViserDex：视觉模拟到现实，实现灵活的手部重新定位

Authors: Arjun Bhardwaj, Maximum Wilder-Smith, Mayank Mittal, Vaishakh Patil, Marco Hutter
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.11138
Pdf link: https://arxiv.org/pdf/2604.11138
Abstract In-hand object reorientation requires precise estimation of the object pose to handle complex task dynamics. While RGB sensing offers rich semantic cues for pose tracking, existing solutions rely on multi-camera setups or costly ray tracing. We present a sim-to-real framework for monocular RGB in-hand reorientation that integrates 3D Gaussian Splatting (3DGS) to bridge the visual sim-to-real gap. Our key insight is performing domain randomization in the Gaussian representation space: by applying physically consistent, pre-rendering augmentations to 3D Gaussians, we generate photorealistic, randomized visual data for object pose estimation. The manipulation policy is trained using curriculum-based reinforcement learning with teacher-student distillation, enabling efficient learning of complex behaviors. Importantly, both perception and control models can be trained independently on consumer-grade hardware, eliminating the need for large compute clusters. Experiments show that the pose estimator trained with 3DGS data outperforms those trained using conventional rendering data in challenging visual environments. We validate the system on a physical multi-fingered hand equipped with an RGB camera, demonstrating robust reorientation of five diverse objects even under challenging lighting conditions. Our results highlight Gaussian splatting as a practical path for RGB-only dexterous manipulation. For videos of the hardware deployments and additional supplementary materials, please refer to the project website: this https URL
中文摘要 手持物体重新定向需要精确估计物体姿态，以应对复杂的任务动态。虽然RGB传感为姿态追踪提供了丰富的语义线索，但现有解决方案依赖于多摄像头设备或昂贵的光线追踪。我们提出了一个模拟到现实的单眼RGB 手持重定向框架，整合了三维高斯喷射（3DGS），以弥合视觉模拟与现实之间的差距。我们的核心见解是在高斯表示空间中进行域随机化：通过对三维高斯量应用物理一致的预渲染增强，我们生成了逼真的随机视觉数据，用于物体姿态估计。操作策略通过基于课程的强化学习和师生提炼进行训练，从而实现复杂行为的高效学习。重要的是，感知和控制模型都可以在消费级硬件上独立训练，无需大型计算集群。实验表明，使用3DGS数据训练的姿态估计器在具有挑战性的视觉环境中优于使用传统渲染数据训练的姿态估计器。我们在配备RGB摄像头的多指手上验证了该系统，演示了即使在光线极差条件下也能稳健地重新定位五个不同物体。我们的结果强调了高斯喷溅作为仅RGB灵巧操作的实用路径。有关硬件部署视频及更多补充材料，请参阅项目网站：此 https URL

HiEdit: Lifelong Model Editing with Hierarchical Reinforcement Learning

HiEdit：终身模型编辑与层级强化学习

Authors: Yangfan Wang, Tianyang Sun, Chen Tang, Jie Liu, Wei Cai, Jingchi Jiang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.11214
Pdf link: https://arxiv.org/pdf/2604.11214
Abstract Lifelong model editing (LME) aims to sequentially rectify outdated or inaccurate knowledge in deployed LLMs while minimizing side effects on unrelated inputs. However, existing approaches typically apply parameter perturbations to a static and dense set of LLM layers for all editing instances. This practice is counter-intuitive, as we hypothesize that different pieces of knowledge are stored in distinct layers of the model. Neglecting this layer-wise specificity can impede adaptability in integrating new knowledge and result in catastrophic forgetting for both general and previously edited knowledge. To address this, we propose HiEdit, a hierarchical reinforcement learning framework that adaptively identifies the most knowledge-relevant layers for each editing instance. By enabling dynamic, instance-aware layer selection and incorporating an intrinsic reward for sparsity, HiEdit achieves precise, localized updates. Experiments on various LLMs show that HiEdit boosts the performance of the competitive RLEdit by an average of 8.48% with perturbing only half of the layers per edit. Our code is available at: this https URL.
中文摘要 终身模型编辑（LME）旨在顺序纠正已部署的大型语言模型中过时或不准确的知识，同时最大限度地减少对无关输入的副作用。然而，现有方法通常对所有编辑实例的静态且密集的LLM图层施加参数扰动。这种做法违反直觉，因为我们假设不同的知识片段存储在模型的不同层中。忽视这种层级具体性会阻碍整合新知识的适应性，导致一般性和先前编辑知识的灾难性遗忘。为此，我们提出了HiEdit，一种分层强化学习框架，能够自适应地识别每个编辑实例中最有知识相关的层。通过实现动态、实例感知层选择并加入稀疏性内在奖励，HiEdit实现了精准且局部化的更新。在各种大型语言模型上的实验显示，HiEdit 平均能提升竞争激烈的 RLEdit 性能 8.48%，每次编辑仅扰动一半的图层。我们的代码可在以下 https URL 获取。

The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping

过去不是过去：记忆增强的动态奖励塑造

Authors: Yang Liu, Enxi Wang, Yufei Gao, Weixin Zhang, Bo Wang, Zhiyuan Zeng, Yikai Zhang, Yining Zheng, Xipeng Qiu
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.11297
Pdf link: https://arxiv.org/pdf/2604.11297
Abstract Despite the success of reinforcement learning for large language models, a common failure mode is reduced sampling diversity, where the policy repeatedly generates similar erroneous behaviors. Classical entropy regularization encourages randomness under the current policy, but does not explicitly discourage recurrent failure patterns across rollouts. We propose MEDS, a Memory-Enhanced Dynamic reward Shaping framework that incorporates historical behavioral signals into reward design. By storing and leveraging intermediate model representations, we capture features of past rollouts and use density-based clustering to identify frequently recurring error patterns. Rollouts assigned to more prevalent error clusters are penalized more heavily, encouraging broader exploration while reducing repeated mistakes. Across five datasets and three base models, MEDS consistently improves average performance over existing baselines, achieving gains of up to 4.13 pass@1 points and 4.37 pass@128 points. Additional analyses using both LLM-based annotations and quantitative diversity metrics show that MEDS increases behavioral diversity during sampling.
中文摘要 尽管强化学习在大型语言模型中取得了成功，但常见的失败模式是采样多样性降低，策略反复产生类似的错误行为。经典熵正则化鼓励当前策略下的随机性，但并不明确阻止跨推广的重复失败模式。我们提出了MEDS，一种记忆增强动态奖励塑造框架，将历史行为信号纳入奖励设计中。通过存储和利用中间模型表示，我们捕捉了过去推广的特征，并利用基于密度的聚类识别频繁出现的错误模式。分配给更常见错误集群的推广会受到更重的惩罚，鼓励更广泛的探索，同时减少重复错误。在五个数据集和三个基础模型中，MEDS持续提升现有基线的平均表现，提升最高4.13 pass@1点和4.37 pass@128点。利用基于LLM的注释和定量多样性指标的额外分析显示，MEDS在抽样过程中增加了行为多样性。

Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration

LLM RLVR加速的低秩优化轨迹建模

Authors: Zhipeng Chen, Tao Qian, Wayne Xin Zhao, Ji-Rong Wen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.11446
Pdf link: https://arxiv.org/pdf/2604.11446
Abstract Recently, scaling reinforcement learning with verifiable rewards (RLVR) for large language models (LLMs) has emerged as an effective training paradigm for significantly improving model capabilities, which requires guiding the model to perform extensive exploration and learning, leading to substantial computational overhead and becoming a key challenge. To reduce the number of training steps, Prior work performs linear extrapolation of model parameters. However, the dynamics of model parameter updates during RLVR training remain insufficiently understood. To further investigate the evolution of LLMs during RLVR training, we conduct empirical experiments and find that the rank-1 subspace of the model does not evolve linearly, and its dominance over the original parameters is further amplified during LoRA training. Based on the above insights, we propose the \textbf{N}onlinear \textbf{Ext}rapolation of low-rank trajectories (\textbf{NExt}), a novel framework that models and extrapolates low-rank parameter trajectories in a nonlinear manner. Concretely, we first train the model using LoRA and extract the rank-1 subspace of parameter differences at multiple training steps, which is then used for the subsequent nonlinear extrapolation. Afterward, we utilized the extracted rank-1 subspace to train a predictor, which can model the trajectory of parameter updates during RLVR, and then perform the predict-extend process to extrapolate model parameters, achieving the acceleration of RLVR. To further study and understand NExt, we conduct comprehensive experiments that demonstrate the effectiveness and robustness of the method. Our method reduces computational overhead by approximately 37.5\% while remaining compatible with a wide range of RLVR algorithms and tasks. We release our code in this https URL.
中文摘要 近年来，针对大型语言模型（LLM）的可验证奖励扩展强化学习（RLVR）已成为显著提升模型能力的有效训练范式，这需要引导模型进行大量探索和学习，导致巨大的计算开销，成为关键挑战。为了减少训练步骤的数量，先前的工作对模型参数进行线性外推。然而，RLVR训练期间模型参数更新的动态性仍未充分理解。为了进一步研究 LLM 在 RLVR 训练中的演化，我们进行了实证实验，发现模型的秩一子空间并非线性演化，其对原始参数的优势在 LoRA 训练中进一步增强。基于上述见解，我们提出了低秩轨迹的\textbf{N}线性\textbf{Ext}rapolation（\textbf{NExt}），这是一种新颖框架，以非线性方式建模和推断低秩参数轨迹。具体来说，我们首先用LoRA训练模型，并在多个训练步骤中提取参数差的秩-1子空间，然后用于后续的非线性外推。随后，我们利用提取的秩-1子空间训练预测变量，该变量可以模拟RLVR期间参数更新的轨迹，然后执行预测-延伸过程以外推模型参数，实现RLVR的加速。为了进一步研究和理解NExt，我们进行了全面的实验，以证明该方法的有效性和稳健性。我们的方法在与多种RLVR算法和任务兼容的同时，计算开销降低了约37.5%%。我们会在这个 https URL 中发布代码。

Escaping the Context Bottleneck: Active Context Curation for LLM Agents via Reinforcement Learning

逃离上下文瓶颈：通过强化学习实现LLM代理的主动上下文管理

Authors: Xiaozhe Li, Tianyi Lyu, Yizhao Yang, Liang Shan, Siyi Yang, Ligao Zhang, Zhuoyi Huang, Qingwen Liu, Yang Li
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.11462
Pdf link: https://arxiv.org/pdf/2604.11462
Abstract Large Language Models (LLMs) struggle with long-horizon tasks due to the "context bottleneck" and the "lost-in-the-middle" phenomenon, where accumulated noise from verbose environments degrades reasoning over multi-turn interactions. To address this issue, we introduce a symbiotic framework that decouples context management from task execution. Our architecture pairs a lightweight, specialized policy model, ContextCurator, with a powerful frozen foundation model, TaskExecutor. Trained via reinforcement learning, ContextCurator actively reduces information entropy in the working memory. It aggressively prunes environmental noise while preserving reasoning anchors, that is, sparse data points that are critical for future deductions. On WebArena, our framework improves the success rate of Gemini-3.0-flash from 36.4% to 41.2% while reducing token consumption by 8.8% (from 47.4K to 43.3K). On DeepSearch, it achieves a 57.1% success rate, compared with 53.9%, while reducing token consumption by a factor of 8. Remarkably, a 7B ContextCurator matches the context management performance of GPT-4o, providing a scalable and computationally efficient paradigm for autonomous long-horizon agents.
中文摘要 大型语言模型（LLMs）在长视野任务中遇到困难，原因是存在“上下文瓶颈”和“中间迷失”现象，即冗长环境中积累的噪声会削弱多回合交互中的推理能力。为解决这个问题，我们引入了一个共生框架，将上下文管理与任务执行解耦。我们的架构将一个轻量级的专用策略模型ContextCurator与一个强大的冻结基础模型TaskExecutor结合起来。通过强化学习训练，ContextCurator 主动减少工作记忆中的信息熵。它积极修剪环境噪声，同时保留推理锚点，即对未来推理至关重要的稀疏数据点。在WebArena上，我们的框架将Gemini-3.0闪存的成功率从36.4%提升到41.2%，同时代币消耗减少了8.8%（从47.4K降至43.3K）。在DeepSearch上，成功率为57.1%，而53.9%，同时代币消耗减少了8倍。令人惊讶的是，7B ContextCurator的上下文管理性能可与GPT-4o媲美，为自主长视野代理提供了可扩展且计算高效的范式。

To Learn or Not to Learn: A Litmus Test for Using Reinforcement Learning in Control

学习还是不学习：在控制中使用强化学习的试金石

Authors: Victor Schulte, Michael Eichelbeck, Matthias Althoff
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2604.11463
Pdf link: https://arxiv.org/pdf/2604.11463
Abstract Reinforcement learning (RL) can be a powerful alternative to classical control methods when standard model-based control is insufficient, e.g., when deriving a suitable model is intractable or impossible. In many cases, however, the choice between model-based and RL-based control is not obvious. Due to the high computational costs of training RL agents, RL-based control should be limited to cases where it is expected to yield superior results compared to model-based control. To the best of our knowledge, there exists no approach to quantify the benefit of RL-based control that does not require RL training. In this work, we present a computationally efficient, purely simulation-based litmus test predicting whether RL-based control is superior to model-based control. Our test evaluates the suitability of the given model for model-based control by analyzing the impact of model uncertainties on the control problem. For this, we use reachset-conformant model identification combined with simulation-based analysis. This is followed by a learnability evaluation of the uncertainties based on correlation analysis. This two-part analysis enables an informed decision on the suitability of RL for a control problem without training an RL agent. We apply our test to several benchmarks, demonstrating its applicability to a wide range of control problems and highlight the potential to save computational resources.
中文摘要 当标准基于模型的控制不足时，例如推导合适模型难以或不可能时，强化学习（RL）可以成为经典控制方法的有力替代方案。然而，在许多情况下，基于模型和基于强化学习控制的选择并不明显。由于训练强化学习代理的计算成本高昂，基于强化学习的控制应仅限于预期其能获得优于基于模型控制的情形。据我们所知，目前没有不需要强化学习培训的方法来量化基于强化学习控制的益处。本研究提出了一种计算效率高、纯基于仿真的试金石测试，预测基于强化学习的控制是否优于基于模型的控制。我们的测试通过分析模型不确定性对控制问题的影响，评估给定模型在基于模型控制中的适用性。为此，我们结合了基于仿真的分析，结合了reachset符合模型的识别方法。随后是基于相关分析的不确定性可学习性评估。这种两部分分析使得在无需培训强化学习代理的情况下，能够明智地判断强化学习是否适合控制问题。我们将测试应用于多个基准测试，展示了其在广泛控制问题中的适用性，并强调节省计算资源的潜力。

Three Roles, One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents

三角色，一模型：推理时的角色编排，以缩小大代理之间的绩效差距

Authors: S. Aaron McClendon, Jorge Gallego-Feliciano, Stavros Zervoudakis, Antonios Saravanos
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.11465
Pdf link: https://arxiv.org/pdf/2604.11465
Abstract Large language model (LLM) agents show promise on realistic tool-use tasks, but deploying capable agents on modest hardware remains challenging. We study whether inference-time scaffolding alone, without any additional training compute, can improve the performance of a small model in complex multi-step environments. Operating on a single 24\,GB GPU, we evaluate Qwen3-8B under both full-precision (FP16, 12K context) and 4-bit quantized (AWQ, 32K context) configurations. Without any intervention, the raw model achieves just 5.4\% (FP16) and 3.0\% (AWQ) task goal completion. Guided by a systematic failure mode analysis, we introduce a three-tier inference scaffolding pipeline that deploys the same frozen model in three distinct roles: (1) a summarization model that preserves critical artifacts (tokens, credentials, API responses) while compressing dialogue history; (2) the main agent model that reasons over the compressed context; and (3) an isolated correction model that reviews and revises the agent's code output without access to conversation history, breaking repetitive failure loops. Applied to the same unmodified model, this scaffolding yields 8.9\% (FP16) and 5.9\% (AWQ) task goal completion, roughly doubling performance in both settings, with particularly strong gains on difficulty-1 tasks (15.8\%$\to$26.3\% FP16; 5.3\%$\to$14.0\% AWQ). On full-precision inference, our scaffolded 8B model surpasses DeepSeek-Coder 33B Instruct (7.1\%) from the original AppWorld evaluation, demonstrating that structured inference-time interventions can make small models competitive with systems 4$\times$ their size. We formalize the approach as a scaffolded policy over a frozen base model, three invocations of the same weights with different conditioning, drawing connections to test-time compute scaling and action-space shaping in reinforcement learning.
中文摘要 大型语言模型（LLM）代理在现实的工具使用任务中展现出潜力，但在适度硬件上部署具备能力的代理仍然具有挑战性。我们研究仅靠推理时间支架，无需额外训练计算，是否能在复杂多步环境中提升小型模型的性能。我们基于单一24GB的GPU运行，在全精度（FP16,12K上下文）和4位量化（AWQ，32K上下文）配置下评估Qwen3-8B。在没有任何干预的情况下，原始模型仅完成了5.4%（FP16）和3.0%（AWQ）任务目标。在系统性失败模式分析的指导下，我们引入了三层推理支架流水线，将同一冻结模型部署为三个不同角色：（1）一个总结模型，保留关键工件（令牌、凭证、API响应），同时压缩对话历史;（2）主要代理模型，用于推理压缩上下文;以及（3）一个独立的修正模型，在不访问会话历史的情况下审查和修订代理的代码输出，打破重复的失败循环。应用于同一未修改模型，该支架实现任务目标完成率为8.9%（FP16）和5.9%（AWQ），在两种环境中性能均约翻倍，尤其在难度1任务上提升显著（15.8%$至$26.3\%，5.3%至14.0%AWQ）。在全精度推理方面，我们的支架式8B模型在原始AppWorld评估中超过了DeepSeek-Coder 33B Ininstruction（7.1%），证明结构化推理时间干预可以让小型模型与其规模仅4$\1倍的系统竞争。我们将该方法形式化为基于冻结基模型的脚手架策略，三次调用相同权重但条件不同的方式，并与测试时计算缩放和强化学习中的动作空间塑形联系起来。

OOM-RL: Out-of-Money Reinforcement Learning Market-Driven Alignment for LLM-Based Multi-Agent Systems

OOM-RL：基于LLM的多智能体系统的非货币强化学习市场驱动对齐

Authors: Kun Liu, Liqun Chen
Subjects: Subjects: Artificial Intelligence (cs.AI); Software Engineering (cs.SE); Trading and Market Microstructure (q-fin.TR)
Arxiv link: https://arxiv.org/abs/2604.11477
Pdf link: https://arxiv.org/pdf/2604.11477
Abstract The alignment of Multi-Agent Systems (MAS) for autonomous software engineering is constrained by evaluator epistemic uncertainty. Current paradigms, such as Reinforcement Learning from Human Feedback (RLHF) and AI Feedback (RLAIF), frequently induce model sycophancy, while execution-based environments suffer from adversarial "Test Evasion" by unconstrained agents. In this paper, we introduce an objective alignment paradigm: \textbf{Out-of-Money Reinforcement Learning (OOM-RL)}. By deploying agents into the non-stationary, high-friction reality of live financial markets, we utilize critical capital depletion as an un-hackable negative gradient. Our longitudinal 20-month empirical study (July 2024 -- February 2026) chronicles the system's evolution from a high-turnover, sycophantic baseline to a robust, liquidity-aware architecture. We demonstrate that the undeniable ontological consequences of financial loss forced the MAS to abandon overfitted hallucinations in favor of the \textbf{Strict Test-Driven Agentic Workflow (STDAW)}, which enforces a Byzantine-inspired uni-directional state lock (RO-Lock) anchored to a deterministically verified $\geq 95\%$ code coverage constraint matrix. Our results show that while early iterations suffered severe execution decay, the final OOM-RL-aligned system achieved a stable equilibrium with an annualized Sharpe ratio of 2.06 in its mature phase. We conclude that substituting subjective human preference with rigorous economic penalties provides a robust methodology for aligning autonomous agents in high-stakes, real-world environments, laying the groundwork for generalized paradigms where computational billing acts as an objective physical constraint
中文摘要 多智能体系统（MAS）在自主软件工程中的对齐受评估者认知不确定性的约束。当前的范式，如来自人类反馈的强化学习（RLHF）和人工智能反馈（RLAIF），经常导致模型谄媚，而基于执行的环境则遭遇无约束代理的对抗性“测试规避”。本文引入了一个客观对齐范式：\textbf{非货币强化学习（OOM-RL）}。通过将代理部署到非平稳、高摩擦的实时金融市场现实中，我们利用关键资本枯竭作为无法破解的负梯度。我们为期20个月的纵向实证研究（2024年7月至2026年2月）记录了该系统从高周转、谄媚的基线向强健且注重流动性的架构演变。我们证明，财务损失不可否认的本体后果迫使MAS放弃过拟合幻觉，转而采用\textbf{严格测试驱动代理工作流（STDAW）}，该系统强制执行一个拜占庭风格的单向状态锁（RO-Lock），锚定于确定性验证的$\geq 95%$代码覆盖约束矩阵。我们的结果表明，尽管早期迭代执行衰减严重，最终OOM-RL对齐系统在成熟阶段实现了稳定平衡，年化夏普比率为2.06。我们得出结论，将主观人类偏好替换为严格的经济惩罚，为在高风险的现实环境中协调自主代理提供了坚实的方法论，为计算计费作为客观物理约束的广义范式奠定了基础

CAGenMol: Condition-Aware Diffusion Language Model for Goal-Directed Molecular Generation

CAGenMol：目标导向分子生成的条件感知扩散语言模型

Authors: Yanting Li, Zhuoyang Jiang, Enyan Dai, Lei Wang, Wen-Cai Ye, Li Liu
Subjects: Subjects: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
Arxiv link: https://arxiv.org/abs/2604.11483
Pdf link: https://arxiv.org/pdf/2604.11483
Abstract Goal-directed molecular generation requires satisfying heterogeneous constraints such as protein--ligand compatibility and multi-objective drug-like properties, yet existing methods often optimize these constraints in isolation, failing to reconcile conflicting objectives (e.g., affinity vs. safety), and struggle to navigate the non-differentiable chemical space without compromising structural validity. To address these challenges, we propose CAGenMol, a condition-aware discrete diffusion framework over molecular sequences that formulates molecular design as conditional denoising guided by heterogeneous structural and property signals. By coupling discrete diffusion with reinforcement learning, the model aligns the generation trajectory with non-differentiable objectives while preserving chemical validity and diversity. The non-autoregressive nature of diffusion language model further enables iterative refinement of molecular fragments at inference time. Experiments on structure-conditioned, property-conditioned, and dual-conditioned benchmarks demonstrate consistent improvements over state-of-the-art methods in binding affinity, drug-likeness, and success rate, highlighting the effectiveness of our framework.
中文摘要 目标导向分子生成需要满足蛋白质-配体相容性和多目标类药物特性等异质约束，但现有方法常常单独优化这些约束，未能调和相互冲突的目标（如亲和力与安全性），难以在不可微分化学空间中导航而不影响结构效度。为应对这些挑战，我们提出了CAGenMol，一种基于分子序列的条件感知离散扩散框架，将分子设计表述为由异质结构和性质信号引导的条件去噪。通过将离散扩散与强化学习结合，模型使生成轨迹与不可微目标保持一致，同时保持化学效度和多样性。扩散语言模型的非自回归性质进一步支持在推断时对分子片段的迭代细化。结构条件、属性条件和双条件基准测试显示，结合亲和力、药物相似性和成功率均优于先进方法，凸显了我们框架的有效性。

Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization

策略分裂：通过双模熵正则化激励LLM强化中的双模探索

Authors: Jiashu Yao, Heyan Huang, Chuwei Luo, Daiqing Wu, Zeming Liu, Yuhang Guo, Yangyang Kang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.11510
Pdf link: https://arxiv.org/pdf/2604.11510
Abstract To encourage diverse exploration in reinforcement learning (RL) for large language models (LLMs) without compromising accuracy, we propose Policy Split, a novel paradigm that bifurcates the policy into normal and high-entropy modes with a high-entropy prompt. While sharing model parameters, the two modes undergo collaborative dual-mode entropy regularization tailored to distinct objectives. Specifically, the normal mode optimizes for task correctness, while the high-entropy mode incorporates a preference for exploration, and the two modes learn collaboratively. Extensive experiments demonstrate that our approach consistently outperforms established entropy-guided RL baselines across various model sizes in general and creative tasks. Further analysis reveals that Policy Split facilitates dual-mode exploration, where the high-entropy mode generates distinct behavioral patterns to the normal mode, providing unique learning signals.
中文摘要 为了鼓励在不牺牲准确性的前提下，在大型语言模型（LLM）强化学习（RL）中进行多样化探索，我们提出了策略分裂（Policy Split）这一新范式，将策略分为正常模式和高熵模式，并配有高熵提示。在共享模型参数的同时，这两种模式会根据不同目标进行协同的双模熵正则化。具体来说，正常模式优化任务正确性，而高熵模式则偏好探索，两种模式是协作学习的。大量实验表明，我们的方法在不同模型规模的总体和创造性任务中，始终优于既有的熵引导强化学习基线。进一步分析显示，策略分裂促进了双模式探索，其中高熵模式产生与正常模式不同的行为模式，提供独特的学习信号。

Triviality Corrected Endogenous Reward

平凡纠正内生奖励

Authors: Xinda Wang, Zhengxu Hou, Yangshijie Zhang, Bingren Yan, Jialin Liu, Chenzhuo Zhao, Zhibo Yang, Bin-Bin Yang, Feng Xiao
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.11522
Pdf link: https://arxiv.org/pdf/2604.11522
Abstract Reinforcement learning for open-ended text generation is constrained by the lack of verifiable rewards, necessitating reliance on judge models that require either annotated data or powerful closed-source models. Inspired by recent work on unsupervised reinforcement learning for mathematical reasoning using confidence-based endogenous rewards, we investigate whether this principle can be adapted to open-ended writing tasks. We find that directly applying confidence rewards leads to Triviality Bias: the policy collapses toward high-probability outputs, reducing diversity and meaningful content. We propose TCER (Triviality Corrected Endogenous Reward), which addresses this bias by rewarding the relative information gain between a specialist policy and a generalist reference policy, modulated by a probability-dependent correction mechanism. Across multiple writing benchmarks and model architectures, TCER achieves consistent improvements without external supervision. Furthermore, TCER also transfers effectively to mathematical reasoning, validating the generality of our approach across different generation tasks.
中文摘要 开放式文本生成的强化学习受限于缺乏可验证的奖励，因此不得不依赖需要注释数据或强大闭源模型的评判模型。受近期关于基于信心的内生奖励的无监督强化学习数学推理研究启发，我们探讨该原则是否可以应用于开放式写作任务。我们发现，直接应用信心奖励会导致琐碎偏误：政策趋向高概率输出，减少多样性和有意义的内容。我们提出了TCER（琐碎纠正内生奖励），通过奖励专业策略与通用参考政策之间的相对信息收益，并通过概率依赖的修正机制来解决这一偏见。在多个写入基准和模型架构中，TCER在无需外部监督的情况下实现了持续的改进。此外，TCER还有效地转化为数学推理，验证了我们方法在不同生成任务中的普遍性。

RLSpoofer: A Lightweight Evaluator for LLM Watermark Spoofing Resilience

RLSpoofer：一款用于LLM水印伪造韧性的轻量级评估器

Authors: Hanbo Huang, Xuan Gong, Yiran Zhang, Hao Zheng, Shiyu Liang
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2604.11546
Pdf link: https://arxiv.org/pdf/2604.11546
Abstract Large language model (LLM) watermarking has emerged as a promising approach for detecting and attributing AI-generated text, yet its robustness to black-box spoofing remains insufficiently evaluated. Existing evaluation methods often demand extensive datasets and white-box access to algorithmic internals, limiting their practical applicability. In this paper, we study watermark resilience against spoofing fundamentally from a distributional perspective. We first establish a \textit{local capacity bottleneck}, which theoretically characterizes the probability mass that can be reallocated under KL-bounded local updates while preserving semantic fidelity. Building on this, we propose RLSpoofer, a reinforcement learning-based black-box spoofing attack that requires only 100 human-watermarked paraphrase training pairs and zero access to the watermarking internals or detectors. Despite weak supervision, it empowers a 4B model to achieve a 62.0\% spoof success rate with minimal semantic shift on PF-marked texts, dwarfing the 6\% of baseline models trained on up to 10,000 samples. Our findings expose the fragile spoofing resistance of current LLM watermarking paradigms, providing a lightweight evaluation framework and stressing the urgent need for more robust schemes.
中文摘要 大型语言模型（LLM）水印已成为检测和归因AI生成文本的一种有前景的方法，但其对黑箱伪造的韧性评价仍然不足。现有的评估方法通常需要大量数据集和对算法内部的白箱访问，限制了其实际应用。本文从分布视角从根本上研究水印防御伪造的韧性。我们首先建立一个\textit{局部容量瓶颈}，理论上描述在保持语义忠实性的情况下，在KL有界局部更新下可重新分配的概率质量。基于此，我们提出了RLSpoofer，这是一种基于强化学习的黑箱伪装攻击，只需100对人工水印释义训练，且无需访问水印内部或检测器。尽管监督薄弱，它使4B模型在带有PF标记文本的文本上实现62.0%的伪造成功率，且语义变化极小，远远超过在最多10,000个样本上训练的6%基线模型。我们的发现揭示了当前LLM水印范式的脆弱伪装抵抗力，提供了一个轻量级的评估框架，并强调了对更稳健方案的紧迫需求。

Eliciting Medical Reasoning with Knowledge-enhanced Data Synthesis: A Semi-Supervised Reinforcement Learning Approach

通过知识增强数据综合激发医学推理：一种半监督强化学习方法

Authors: Haolin Li, Shuyang Jiang, Ruipeng Zhang, Jiangchao Yao, Ya Zhang, Yanfeng Wang
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.11547
Pdf link: https://arxiv.org/pdf/2604.11547
Abstract While large language models hold promise for complex medical applications, their development is hindered by the scarcity of high-quality reasoning data. To address this issue, existing approaches typically distill chain-of-thought reasoning traces from large proprietary models via supervised fine-tuning, then conduct reinforcement learning (RL). These methods exhibit limited improvement on underrepresented domains like rare diseases while incurring substantial costs from generating complex reasoning chains. To efficiently enhance medical reasoning, we propose MedSSR, a Medical Knowledge-enhanced data Synthesis and Semi-supervised Reinforcement learning framework. Our framework first employs rare disease knowledge to synthesize distribution-controllable reasoning questions. We then utilize the policy model itself to generate high-quality pseudo-labels. This enables a two-stage, intrinsic-to-extrinsic training paradigm: self-supervised RL on the pseudo-labeled synthetic data, followed by supervised RL on the human-annotated real data. MedSSR scales model training efficiently without relying on costly trace distillation. Extensive experiments on Qwen and Llama demonstrate that our method outperforms existing methods across ten medical benchmarks, achieving up to +5.93% gain on rare-disease tasks. Our code is available at this https URL.
中文摘要 虽然大型语言模型在复杂的医疗应用中具有潜力，但其发展受限于高质量推理数据的稀缺。为解决这一问题，现有方法通常通过监督微调从大型专有模型中提炼出思维链推理痕迹，然后进行强化学习（RL）。这些方法在罕见病等代表性不足领域上表现有限，同时产生复杂推理链的成本也很大。为了高效提升医学推理能力，我们提出了MedSSR，一种医学知识增强型数据综合与半监督强化学习框架。我们的框架首先利用罕见病知识来综合分布可控推理问题。然后我们利用政策模型本身生成高质量的伪标签。这实现了两阶段的内在到外在训练范式：对伪标记的合成数据进行自监督强化学习，随后对人工注释的真实数据进行监督强化学习。MedSSR能够高效地扩展模型训练，而无需依赖昂贵的痕迹蒸馏。对Qwen和Llama的广泛实验表明，我们的方法在十个医学基准测试中优于现有方法，在罕见病任务中提升率高达+5.93%。我们的代码可在此 https URL 访问。

Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale

Relax：一个面向大规模全模态后培训的异步强化学习引擎

Authors: Liujie Zhang, Benzhe Ning, Rui Yang, Xiaoyan Yu, Jiaxing Li, Lumeng Wu, Jia Liu, Minghao Li, Weihang Chen, Weiqi Hu, Lei Zhang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.11554
Pdf link: https://arxiv.org/pdf/2604.11554
Abstract Reinforcement learning (RL) post-training has proven effective at unlocking reasoning, self-reflection, and tool-use capabilities in large language models. As models extend to omni-modal inputs and agentic multi-turn workflows, RL training systems face three interdependent challenges: heterogeneous data flows, operational robustness at scale, and the staleness -- throughput tradeoff. We present \textbf{Relax} (Reinforcement Engine Leveraging Agentic X-modality), an open-source RL training engine that addresses these challenges through three co-designed architectural layers. First, an \emph{omni-native architecture} builds multimodal support into the full stack -- from data preprocessing and modality-aware parallelism to inference generation -- rather than retrofitting it onto a text-centric pipeline. Second, each RL role runs as an independent, fault-isolated service that can be scaled, recovered, and upgraded without global coordination. Third, service-level decoupling enables asynchronous training via the TransferQueue data bus, where a single staleness parameter smoothly interpolates among on-policy, near-on-policy, and fully asynchronous execution. Relax achieves a 1.20$\times$ end-to-end speedup over veRL on Qwen3-4B on-policy training. Its fully async mode delivers a 1.76$\times$ speedup over colocate on Qwen3-4B and a 2.00$\times$ speedup on Qwen3-Omni-30B, while all modes converge to the same reward level. Relax supports R3 (Rollout Routing Replay)~\cite{ma2025r3} for MoE models with only 1.9\% overhead, compared to 32\% degradation in veRL under the same configuration. It further demonstrates stable omni-modal RL convergence on Qwen3-Omni across image, text, and audio, sustaining over 2{,}000 steps on video without degradation. Relax is available at this https URL.
中文摘要 强化学习（RL）后训练已被证明能有效释放大型语言模型中的推理、自我反思和工具使用能力。随着模型扩展到全模态输入和代理多回合工作流，强化学习训练系统面临三个相互依赖的挑战：异构数据流、大规模运营鲁棒性以及陈旧性——吞吐量权衡。我们介绍了 \textbf{Relax}（利用智能X模态的强化引擎），这是一个开源的强化学习训练引擎，通过三层共同设计的架构层来解决这些挑战。首先，\emph{全原生}架构将多模态支持集成到全栈中——从数据预处理、模态感知并行到推理生成——而不是将其强加到以文本为中心的流水线上。其次，每个强化学习角色作为独立、故障隔离的服务运行，可以扩展、恢复和升级，无需全局协调。第三，服务级解耦通过TransferQueue数据总线实现异步训练，单一的陈旧参数能够在策略中、近策略中和完全异步执行之间无缝插值。Relax在Qwen3-4B政策培训中，比veRL实现了1.20美元\时间$的端到端加速。其全异步模式在Qwen3-4B上比共址加速1.76$\times$，在Qwen3-Omni-30B上提升2.00$\times$，且所有模式收敛至同一奖励等级。Relax 支持 R3（Rollout Routing Replay）~\cite{ma2025r3}，适用于 MoE 型号，开销仅为 1.9%，而 veRL 在相同配置下性能下降为 32%。它进一步展示了Qwen3-Omni在图像、文本和音频上的稳定全模态强化学习收敛，视频上可持续超过2,000步且不出现降级。Relax 可通过此 https 网址获取。

Geoparsing: Diagram Parsing for Plane and Solid Geometry with a Unified Formal Language

地理解析：平面与实体几何的统一形式语言图解析

Authors: Peijie Wang, Ming-Liang Zhang, Jun Cao, Chao Deng, Dekang Ran, Hongda Sun, Pi Bu, Xuan Zhang, Yingyao Wang, Jun Song, Bo Zheng, Fei Yin, Cheng-Lin Liu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.11600
Pdf link: https://arxiv.org/pdf/2604.11600
Abstract Multimodal Large Language Models (MLLMs) have achieved remarkable progress but continue to struggle with geometric reasoning, primarily due to the perception bottleneck regarding fine-grained visual elements. While formal languages have aided plane geometry understanding, solid geometry which requires spatial understanding remains largely unexplored. In this paper, we address this challenge by designing a unified formal language that integrates plane and solid geometry, comprehensively covering geometric structures and semantic relations. We construct GDP-29K, a large-scale dataset comprising 20k plane and 9k solid geometry samples collected from diverse real-world sources, each paired with its ground-truth formal description. To ensure syntactic correctness and geometric consistency, we propose a training paradigm that combines Supervised Fine-Tuning with Reinforcement Learning via Verifiable Rewards. Experiments show that our approach achieves state-of-the-art parsing performance. Furthermore, we demonstrate that our parsed formal descriptions serve as a critical cognitive scaffold, significantly boosting MLLMs' capabilities for downstream geometry reasoning tasks. Our data and code are available at Geoparsing.
中文摘要 多模态大型语言模型（MLLM）取得了显著进展，但在几何推理方面仍面临困难，主要原因是对细粒度视觉元素的感知瓶颈。虽然形式语言有助于平面几何的理解，但需要空间理解的立体几何仍然大多未被充分探索。本文通过设计一种统一的形式语言来应对这一挑战，该语言整合了平面几何和实体几何，全面涵盖几何结构和语义关系。我们构建了GDP-29K，这是一个包含2万个平面和9千个立体几何样本的大型数据集，这些样本来自多种现实世界来源，每个样本都配有其真实的形式描述。为确保句法正确性和几何一致性，我们提出了一种结合监督微调与通过可验证奖励进行强化学习的训练范式。实验表明，我们的方法实现了最先进的解析性能。此外，我们证明了解析形式描述作为关键的认知支架，显著提升了MLLM在后续几何推理任务中的能力。我们的数据和代码可在地理解析网站上获取。

Utilizing and Calibrating Hindsight Process Rewards via Reinforcement with Mutual Information Self-Evaluation

利用并校准事后诸葛亮过程奖励，通过相互信息自我评估进行强化

Authors: Jiashu Yao, Heyan Huang, Zeming Liu, Yuhang Guo
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.11611
Pdf link: https://arxiv.org/pdf/2604.11611
Abstract To overcome the sparse reward challenge in reinforcement learning (RL) for agents based on large language models (LLMs), we propose Mutual Information Self-Evaluation (MISE), an RL paradigm that utilizes hindsight generative self-evaluation as dense reward signals while simultaneously calibrating them against the environmental feedbacks. Empirically, MISE enables an agent to learn autonomously from dense internal rewards supplementing sparse extrinsic signals. Theoretically, our work provides the first formal foundation for the paradigm of generative self-rewarding. We prove that utilizing hindsight self-evaluation rewards is equivalent to minimizing an objective that combines mutual information with a KL divergence term between the policy and a proxy reward policy. This theoretical insight then informs and justifies our calibration step, which actively aligns these rewards with the optimal policy. Extensive experiments show that MISE outperforms strong baselines, enabling open-source LLMs about 7B parameters to achieve performance comparable to GPT-4o on validation without expert supervision.
中文摘要 为了克服基于大型语言模型（LLMs）的智能体在强化学习（RL）中奖励稀疏的挑战，我们提出了互信息自我评估（MISE）的RL范式，它利用事后诸葛亮的生成自我评估作为密集的奖励信号，同时对其与环境反馈进行校准。从经验角度看，MISE使智能体能够自主学习，从密集的内部奖励中补充稀疏的外在信号。理论上，我们的工作为生成自我奖励范式提供了第一个正式基础。我们证明，利用事后诸葛亮的自我评估奖励等同于最小化一个将互信息与政策与代理奖励政策之间基层背离项结合的目标。这一理论洞见随后为我们的校准步骤提供了指导和合理性，从而积极将这些奖励与最优政策对齐。大量实验表明，MISE优于强基线，使得约7B参数的开源大型语言模型在验证时无需专家监督即可实现与GPT-4o相当的性能。

RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

理性奖励：推理奖励可扩大视觉生成的训练和测试时间

Authors: Haozhe Wang, Cong Wei, Weiming Ren, Jiaming Liu, Fangzhen Lin, Wenhu Chen
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.11626
Pdf link: https://arxiv.org/pdf/2604.11626
Abstract Most reward models for visual generation reduce rich human judgments to a single unexplained score, discarding the reasoning that underlies preference. We show that teaching reward models to produce explicit, multi-dimensional critiques before scoring transforms them from passive evaluators into active optimization tools, improving generators in two complementary ways: at training time, structured rationales provide interpretable, fine-grained rewards for reinforcement learning; at test time, a Generate-Critique-Refine loop turns critiques into targeted prompt revisions that improve outputs without any parameter updates. To train such a reward model without costly rationale annotations, we introduce Preference-Anchored Rationalization (PARROT), a principled framework that recovers high-quality rationales from readily available preference data through anchored generation, consistency filtering, and distillation. The resulting model, RationalRewards (8B), achieves state-of-the-art preference prediction among open-source reward models, competitive with Gemini-2.5-Pro, while using 10-20x less training data than comparable baselines. As an RL reward, it consistently improves text-to-image and image-editing generators beyond scalar alternatives. Most strikingly, its test-time critique-and-refine loop matches or exceeds RL-based fine-tuning on several benchmarks, suggesting that structured reasoning can unlock latent capabilities in existing generators that suboptimal prompts fail to elicit.
中文摘要 大多数视觉生成的奖励模型将丰富的人类判断简化为单一未解释的评分，抛弃了偏好背后的推理。我们表明，在评分前教授奖励模型以产生明确、多维的批评，可以将其从被动评估者转变为主动优化工具，从而以两种互补方式改进生成器：在训练阶段，结构化的理由为强化学习提供了可解释、细粒度的奖励;在测试阶段，生成-批判-精炼循环将批评转化为针对性的提示修订，提升输出质量且无需参数更新。为了训练这样一个没有昂贵理由注释的奖励模型，我们引入了偏好锚定理性化（PARROT），这是一个原则性框架，通过锚定生成、一致性过滤和蒸馏，从现成的偏好数据中恢复高质量的理由。最终形成的模型RationalRewards（8B）在开源奖励模型中实现了最先进的偏好预测，与Gemini-2.5-Pro竞争，同时使用比同类基线少10-20倍的训练数据。作为强化学习的奖励，它持续改进文本转图像和图像编辑生成器，超越标量替代方案。最引人注目的是，其测试时间的批判与优化循环在多个基准测试中与基于强化学习的微调相匹配甚至超过，表明结构化推理能够解锁现有生成器中那些次优提示无法激发的潜在能力。

Back to Basics: Let Conversational Agents Remember with Just Retrieval and Generation

回归基础：让对话代理仅通过检索和生成来记忆

Authors: Yuqian Wu, Wei Chen, Zhengjun Huang, Junle Chen, Qingxiang Liu, Kai Wang, Xiaofang Zhou, Yuxuan Liang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.11628
Pdf link: https://arxiv.org/pdf/2604.11628
Abstract Existing conversational memory systems rely on complex hierarchical summarization or reinforcement learning to manage long-term dialogue history, yet remain vulnerable to context dilution as conversations grow. In this work, we offer a different perspective: the primary bottleneck may lie not in memory architecture, but in the \textit{Signal Sparsity Effect} within the latent knowledge manifold. Through controlled experiments, we identify two key phenomena: \textit{Decisive Evidence Sparsity}, where relevant signals become increasingly isolated with longer sessions, leading to sharp degradation in aggregation-based methods; and \textit{Dual-Level Redundancy}, where both inter-session interference and intra-session conversational filler introduce large amounts of non-informative content, hindering effective generation. Motivated by these insights, we propose \method, a minimalist framework that brings conversational memory back to basics, relying solely on retrieval and generation via Turn Isolation Retrieval (TIR) and Query-Driven Pruning (QDP). TIR replaces global aggregation with a max-activation strategy to capture turn-level signals, while QDP removes redundant sessions and conversational filler to construct a compact, high-density evidence set. Extensive experiments on multiple benchmarks demonstrate that \method achieves robust performance across diverse settings, consistently outperforming strong baselines while maintaining high efficiency in tokens and latency, establishing a new minimalist baseline for conversational memory.
中文摘要 现有的会话记忆系统依赖复杂的层级总结或强化学习来管理长期对话历史，但随着对话的增长，仍然容易受到上下文稀释的影响。在本研究中，我们提供了一个不同的视角：主要瓶颈可能不在于记忆架构，而在于潜在知识流形中的\textit{信号稀疏效应}。通过受控实验，我们识别出两个关键现象：\textit{决定性证据稀疏}，即相关信号随着会话时间延长而逐渐被隔离，导致基于聚合的方法出现明显退化;以及\textit{双级冗余}，其中会话间干扰和会话中会话填充都会引入大量非信息性内容，阻碍有效生成。基于这些见解，我们提出了 \method 这一极简主义框架，将会话记忆回归本源，完全依赖通过回合隔离检索（TIR）和查询驱动剪枝（QDP）进行检索和生成。TIR用最大激活策略取代全局聚合以捕捉转弯级信号，而QDP则去除冗余会话和对话填充，构建紧凑且高密度的证据集。多项基准测试的广泛实验表明，方法在不同环境中都能实现稳健性能，持续优于强基线，同时保持高令牌和延迟效率，建立了对话记忆的新极简基准。

Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind

配合：通过心智理论学习双面间谍的信念引导辩护者

Authors: Hanqi Xiao, Vaidehi Patil, Zaid Khan, Hyunji Lee, Elias Stengel-Eskin, Mohit Bansal
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.11666
Pdf link: https://arxiv.org/pdf/2604.11666
Abstract As large language models (LLMs) become the engine behind conversational systems, their ability to reason about the intentions and states of their dialogue partners (i.e., form and use a theory-of-mind, or ToM) becomes increasingly critical for safe interaction with potentially adversarial partners. We propose a novel privacy-themed ToM challenge, ToM for Steering Beliefs (ToM-SB), in which a defender must act as a Double Agent to steer the beliefs of an attacker with partial prior knowledge within a shared universe. To succeed on ToM-SB, the defender must engage with and form a ToM of the attacker, with a goal of fooling the attacker into believing they have succeeded in extracting sensitive information. We find that strong frontier models like Gemini3-Pro and GPT-5.4 struggle on ToM-SB, often failing to fool attackers in hard scenarios with partial attacker prior knowledge, even when prompted to reason about the attacker's beliefs (ToM prompting). To close this gap, we train models on ToM-SB to act as AI Double Agents using reinforcement learning, testing both fooling and ToM rewards. Notably, we find a bidirectionally emergent relationship between ToM and attacker-fooling: rewarding fooling success alone improves ToM, and rewarding ToM alone improves fooling. Across four attackers with different strengths, six defender methods, and both in-distribution and out-of-distribution (OOD) evaluation, we find that gains in ToM and attacker-fooling are well-correlated, highlighting belief modeling as a key driver of success on ToM-SB. AI Double Agents that combine both ToM and fooling rewards yield the strongest fooling and ToM performance, outperforming Gemini3-Pro and GPT-5.4 with ToM prompting on hard scenarios. We also show that ToM-SB and AI Double Agents can be extended to stronger attackers, demonstrating generalization to OOD settings and the upgradability of our task.
中文摘要 随着大型语言模型（LLMs）成为会话系统的引擎，它们推理对话对象意图和状态的能力（即形成并使用心智理论，ToM）对于与潜在对立伙伴的安全互动变得越来越关键。我们提出了一个新颖的隐私主题ToM挑战，名为ToM引导信念（ToM-SB），其中防御者必须作为双面间谍，在共享宇宙中引导具有部分先验知识的攻击者的信念。要在 ToM-SB 上取得成功，防守者必须与攻击者交战并形成 ToM，目标是欺骗攻击者，使其相信自己成功提取了敏感信息。我们发现，像Gemini3-Pro和GPT-5.4这样的强前沿模型在ToM-SB上表现不佳，在具部分先知的困难场景中，即使被引导推理攻击者的信念（ToM提示），也常常难以欺骗攻击者。为弥合这一差距，我们在 ToM-SB 上训练模型，利用强化学习作为 AI 双重代理，测试欺骗和 ToM 奖励。值得注意的是，我们发现ToM与攻击者欺骗之间存在双向涌现关系：仅奖励欺骗成功就能提升ToM，仅奖励ToM就能改善欺骗。在四位不同优势的攻击者、六种防御方法，以及分布内和分布外（OOD）评估中，我们发现ToM的提升与攻击方欺骗的提升高度相关，凸显信念建模是ToM-SB成功的关键驱动力。结合 ToM 和欺骗奖励的 AI 双重间谍，在困难场景中表现优于 Gemini3-Pro 和 ToM 提示下的 GPT-5.4。我们还展示了 ToM-SB 和 AI 双重代理可以扩展到更强大的攻击者，展示了对 OOD 设置的泛化性以及任务的可升级性。

Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games

协作多智能体脚本生成，用于增强谋杀悬疑游戏中不完美信息推理能力

Authors: Keyang Zhong, Junlin Xie, Hefeng Wu, Haofeng Li, Guanbin Li
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.11741
Pdf link: https://arxiv.org/pdf/2604.11741
Abstract Vision-language models (VLMs) have shown impressive capabilities in perceptual tasks, yet they degrade in complex multi-hop reasoning under multiplayer game settings with imperfect and deceptive information. In this paper, we study a representative multiplayer task, Murder Mystery Games, which require inferring hidden truths based on partial clues provided by roles with different intentions. To address this challenge, we propose a collaborative multi-agent framework for evaluating and synthesizing high-quality, role-driven multiplayer game scripts, enabling fine-grained interaction patterns tailored to character identities (i.e., murderer vs. innocent). Our system generates rich multimodal contexts, including character backstories, visual and textual clues, and multi-hop reasoning chains, through coordinated agent interactions. We design a two-stage agent-monitored training strategy to enhance the reasoning ability of VLMs: (1) chain-of-thought based fine-tuning on curated and synthetic datasets that model uncertainty and deception; (2) GRPO-based reinforcement learning with agent-monitored reward shaping, encouraging the model to develop character-specific reasoning behaviors and effective multimodal multi-hop inference. Extensive experiments demonstrate that our method significantly boosts the performance of VLMs in narrative reasoning, hidden fact extraction, and deception-resilient understanding. Our contributions offer a scalable solution for training and evaluating VLMs under uncertain, adversarial, and socially complex conditions, laying the groundwork for future benchmarks in multimodal multi-hop reasoning under imperfect information.
中文摘要 视觉语言模型（VLMs）在感知任务中表现出令人印象深刻的能力，但在面对不完美且具欺骗性信息的多人游戏环境中，它们在复杂的多跳推理中表现不佳。本文研究了一项代表性的多人任务——谋杀悬疑游戏，该任务需要根据不同意图的角色提供的部分线索推断隐藏的真相。为应对这一挑战，我们提出了一个协作多代理框架，用于评估和综合高质量、以角色为驱动的多人游戏脚本，实现针对角色身份（即凶手与无辜者）量身定制的细粒度交互模式。我们的系统通过协调的代理互动生成丰富的多模态上下文，包括角色背景故事、视觉和文本线索以及多跳推理链。我们设计了一套两阶段的代理监控训练策略，以提升VLM的推理能力：（1）基于思维链的精细调优，基于精心策划和合成数据集，模拟不确定性和欺骗;（2）基于GRPO的强化学习，配合主体监控的奖励塑造，鼓励模型发展特征特异的推理行为和有效的多模多跳推断。大量实验表明，我们的方法显著提升了VLM在叙事推理、隐藏事实提取和抗欺骗理解方面的表现。我们的贡献为在不确定、对抗性和社会复杂条件下训练和评估VLM提供了可扩展的解决方案，为未来在不完全信息下多模态多跳推理的基准奠定基础。

Discourse Diversity in Multi-Turn Empathic Dialogue

多重共情对话中的话语多样性

Authors: Hongli Zhan, Emma S. Gueorguieva, Javier Hernandez, Jina Suh, Desmond C. Ong, Junyi Jessy Li
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.11742
Pdf link: https://arxiv.org/pdf/2604.11742
Abstract Large language models (LLMs) produce responses rated as highly empathic in single-turn settings (Ayers et al., 2023; Lee et al., 2024), yet they are also known to be formulaic generators that reuse the same lexical patterns, syntactic templates, and discourse structures across tasks (Jiang et al., 2025; Shaib et al., 2024; Namuduri et al., 2025). Less attention has been paid to whether this formulaicity extends to the level of discourse moves, i.e., what a response does for the person it is addressing. This question is especially consequential for empathic dialogue, where effective support demands not just a kind response at one moment but varied strategies as a conversation unfolds (Stiles et al., 1998). Indeed, prior work shows that LLMs reuse the same tactic sequences more than human supporters in single-turn settings (Gueorguieva et al., 2026). We extend this analysis to multi-turn conversations and find that the rigidity compounds: once a tactic appears in a supporter turn, LLMs reuse it in the next at nearly double the rate of humans (0.50-0.56 vs. 0.27). This pattern holds across LLMs serving as supporters in real emotional support conversations, and is invisible to standard similarity metrics. To address this gap, we introduce MINT (Multi-turn Inter-tactic Novelty Training), the first reinforcement learning framework to optimize discourse move diversity across multi-turn empathic dialogue. The best MINT variant combines an empathy quality reward with a cross-turn tactic novelty signal, improving aggregate empathy by 25.3% over vanilla across 1.7B and 4B models while reducing cross-turn discourse move repetition by 26.3% on the 4B model, surpassing all baselines including quality-only and token-level diversity methods on both measures. These results suggest that what current models lack is not empathy itself, but the ability to vary their discourse moves across a conversation.
中文摘要 大型语言模型（LLMs）在单回合环境中产生高度共情的反应（Ayers 等，2023;Lee 等，2024），但它们也被认为是公式化生成器，会在任务中重复使用相同的词汇模式、句法模板和话语结构（Jiang 等，2025;Shaib 等，2024;Namuduri 等，2025）。但对这种公式化是否延伸到话语移动层面，即回应对被对象的影响，关注较少。这个问题对同理心对话尤为重要，因为有效的支持不仅需要在某一时刻给予善意回应，还需要在对话过程中采取多样化策略（Stiles 等，1998）。事实上，先前研究表明，LLM在单回合环境中比人类辅助者更频繁地重复使用相同的战术序列（Gueorguieva等，2026）。我们将此分析扩展到多回合对话，发现僵化度会增加：一旦某策略出现在支持者回合中，LLM在下一个回合中以几乎是人类的两倍速度重复使用（0.50-0.56对比0.27）。这种模式在作为真实情感支持对话支持者的LLM中都存在，且在标准相似度指标中是看不见的。为弥补这一空白，我们引入了MINT（多回合战术新颖训练），这是首个优化多回合共情对话中话语移动多样性的强化学习框架。最佳MINT变体结合了同理心质量奖励与跨回合策略新颖信号，在1.7B和4B模型中，整体共情比原版提升25.3%，同时在4B模型上将跨回合话语动作重复减少26.3%，超越了所有基线，包括纯质量和代币级多样性方法。这些结果表明，当前模型缺乏的并不是同理心本身，而是对话中多样化话语的能力。

Autonomous Diffractometry Enabled by Visual Reinforcement Learning

视觉强化学习支持的自主衍射测量

Authors: J. Oppliger, M. Stifter, A. Rüegg, I. Biało, L. Martinelli, P. G. Freeman, D. Prabhakaran, J. Zhao, Q. Wang, J. Chang
Subjects: Subjects: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.11773
Pdf link: https://arxiv.org/pdf/2604.11773
Abstract Automation underpins progress across scientific and industrial disciplines. Yet, automating tasks requiring interpretation of abstract visual information remain challenging. For example, crystal alignment strongly relies on humans with the ability to comprehend diffraction patterns. Here we introduce an autonomous system that aligns single crystals without access to crystallography and diffraction theory. Using a model-free reinforcement learning framework, an agent learns to identify and navigate towards high-symmetry orientations directly from Laue diffraction patterns. Despite the absence of human supervision, the agent develops human-like strategies to achieve time-efficient alignment across different crystal symmetry classes. With this, we provide a computational framework for intelligent diffractometers. As such, our approach advances the development of automated experimental workflows in materials science.
中文摘要 自动化支撑着科学和工业各领域的进步。然而，自动化需要解读抽象视觉信息的任务依然充满挑战。例如，晶体比对高度依赖于具备理解衍射图样能力的人类。这里我们介绍了一个自主系统，能够在没有晶体学和衍射理论的情况下对齐单晶。通过无模型强化学习框架，智能体学习直接从劳埃衍射图样识别并导航向高对称性取向。尽管缺乏人工监督，该代理仍开发出类人策略，以实现不同晶体对称类别间的时间高效比对。通过这些，我们为智能衍射仪提供了一个计算框架。因此，我们的方法推动了材料科学自动化实验流程的发展。

Solving Physics Olympiad via Reinforcement Learning on Physics Simulators

通过物理模拟器上的强化学习解决物理奥林匹克

Authors: Mihir Prabhudesai, Aryan Satpathy, Yangmin Li, Zheyang Qin, Nikash Bhardwaj, Amir Zadeh, Chuan Li, Katerina Fragkiadaki, Deepak Pathak
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.11805
Pdf link: https://arxiv.org/pdf/2604.11805
Abstract We have witnessed remarkable advances in LLM reasoning capabilities with the advent of DeepSeek-R1. However, much of this progress has been fueled by the abundance of internet question-answer (QA) pairs, a major bottleneck going forward, since such data is limited in scale and concentrated mainly in domains like mathematics. In contrast, other sciences such as physics lack large-scale QA datasets to effectively train reasoning-capable models. In this work, we show that physics simulators can serve as a powerful alternative source of supervision for training LLMs for physical reasoning. We generate random scenes in physics engines, create synthetic question-answer pairs from simulated interactions, and train LLMs using reinforcement learning on this synthetic data. Our models exhibit zero-shot sim-to-real transfer to real-world physics benchmarks: for example, training solely on synthetic simulated data improves performance on IPhO (International Physics Olympiad) problems by 5-10 percentage points across model sizes. These results demonstrate that physics simulators can act as scalable data generators, enabling LLMs to acquire deep physical reasoning skills beyond the limitations of internet-scale QA data. Code available at: this https URL.
中文摘要 随着DeepSeek-R1的出现，我们见证了大型语言模型推理能力的显著进步。然而，这些进展很大程度上得益于互联网问答对（QA）的丰富，这对未来的主要瓶颈是一大瓶颈，因为这类数据规模有限，主要集中在数学等领域。相比之下，其他科学如物理学缺乏大规模质量保证数据集，无法有效训练具备推理能力的模型。本研究表明，物理模拟器可以作为大型语言模型物理推理训练的强大替代监督来源。我们在物理引擎中生成随机场景，从模拟交互中生成合成问答对，并利用强化学习训练大型语言模型。我们的模型实现了零样本模拟到现实物理基准的迁移：例如，仅用合成模拟数据训练，在国际物理奥林匹克（IPhO）问题上，不同模型规模的表现提升了5-10个百分点。这些结果表明，物理模拟器可以作为可扩展的数据生成器，使LLM能够获得超越互联网规模QA数据限制的深层物理推理技能。代码可访问：此 https URL。

Keyword: diffusion policy

OmniUMI: Towards Physically Grounded Robot Learning via Human-Aligned Multimodal Interaction

OmniUMI：通过人与人对齐的多模态交互实现物理基础机器人学习

Authors: Shaqi Luo, Yuanyuan Li, Youhao Hu, Chenhao Yu, Chaoran Xu, Jiachen Zhang, Guocai Yao, Tiejun Huang, Ran He, Zhongyuan Wang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.10647
Pdf link: https://arxiv.org/pdf/2604.10647
Abstract UMI-style interfaces enable scalable robot learning, but existing systems remain largely visuomotor, relying primarily on RGB observations and trajectory while providing only limited access to physical interaction signals. This becomes a fundamental limitation in contact-rich manipulation, where success depends on contact dynamics such as tactile interaction, internal grasping force, and external interaction wrench that are difficult to infer from vision alone. We present OmniUMI, a unified framework for physically grounded robot learning via human-aligned multimodal interaction. OmniUMI synchronously captures RGB, depth, trajectory, tactile sensing, internal grasping force, and external interaction wrench within a compact handheld system, while maintaining collection--deployment consistency through a shared embodiment design. To support human-aligned demonstration, OmniUMI provides dual-force feedback through bilateral gripper feedback and natural perception of external interaction wrench in the handheld embodiment. Built on this interface, we extend diffusion policy with visual, tactile, and force-related observations, and deploy the learned policy through impedance-based execution for unified regulation of motion and contact behavior. Experiments demonstrate reliable sensing and strong downstream performance on force-sensitive pick-and-place, interactive surface erasing, and tactile-informed selective release. Overall, OmniUMI combines physically grounded multimodal data acquisition with human-aligned interaction, providing a scalable foundation for learning contact-rich manipulation.
中文摘要 UMI式接口支持可扩展的机器人学习，但现有系统仍主要依赖视觉运动，主要依赖RGB观测和轨迹，且仅有限度地访问物理交互信号。这成为接触丰富操作的根本限制，成功依赖于接触动态，如触觉互动、内部抓握力和外部相互作用扳手，这些难以仅凭视觉推断。我们介绍OmniUMI，一个通过人机对齐多模态交互实现物理基础机器人学习的统一框架。OmniUMI在紧凑的手持系统中同步捕捉RGB、深度、轨迹、触觉感知、内部抓握力和外部交互扳手，同时通过共享的实体设计保持收集-部署的一致性。为支持人对齐演示，OmniUMI通过双侧握把反馈和手持实体中外部交互扳手的自然感知，提供双力反馈。基于该接口，我们扩展了视觉、触觉和力相关观测的扩散策略，并通过基于阻抗的执行部署所学策略，实现运动和接触行为的统一调控。实验显示，在力感性拾取与放置、互动表面擦除和触觉知情选择释放方面，其传感可靠且下游表现强劲。总体而言，OmniUMI结合了物理基础的多模态数据采集与人机对齐交互，为学习丰富的接触操作提供了可扩展的基础。

AffordSim: A Scalable Data Generator and Benchmark for Affordance-Aware Robotic Manipulation

AffordSim：可扩展的数据生成器及可适用性意识机器人操作的基准测试

Authors: Mingyang Li, Haofan Xu, Haowen Sun, Xinzhe Chen, Sihua Ren, Liqi Huang, Xinyang Sui, Chenyang Miao, Qiongjie Cui, Zeyang Liu, Xingyu Chen, Xuguang Lan
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.11674
Pdf link: https://arxiv.org/pdf/2604.11674
Abstract Simulation-based data generation has become a dominant paradigm for training robotic manipulation policies, yet existing platforms do not incorporate object affordance information into trajectory generation. As a result, tasks requiring precise interaction with specific functional regions--grasping a mug by its handle, pouring from a cup's rim, or hanging a mug on a hook--cannot be automatically generated with semantically correct trajectories. We introduce AffordSim, the first simulation framework that integrates open-vocabulary 3D affordance prediction into the manipulation data generation pipeline. AffordSim uses our VoxAfford model, an open-vocabulary 3D affordance detector that enhances MLLM output tokens with multi-scale geometric features, to predict affordance maps on object point clouds, guiding grasp pose estimation toward task-relevant functional regions. Built on NVIDIA Isaac Sim with cross-embodiment support (Franka FR3, Panda, UR5e, Kinova), VLM-powered task generation, and novel domain randomization using DA3-based 3D Gaussian reconstruction from real photographs, AffordSim enables automated, scalable generation of affordance-aware manipulation data. We establish a benchmark of 50 tasks across 7 categories (grasping, placing, stacking, pushing/pulling, pouring, mug hanging, long-horizon composite) and evaluate 4 imitation learning baselines (BC, Diffusion Policy, ACT, Pi 0.5). Our results reveal that while grasping is largely solved (53-93% success), affordance-demanding tasks such as pouring into narrow containers (1-43%) and mug hanging (0-47%) remain significantly more challenging for current imitation learning methods, highlighting the need for affordance-aware data generation. Zero-shot sim-to-real experiments on a real Franka FR3 validate the transferability of the generated data.
中文摘要 基于仿真的数据生成已成为训练机器人操作策略的主导范式，但现有平台并未将对象可知性信息纳入轨迹生成。因此，需要精确操作特定功能区域的任务——如握住杯柄、从杯口倒酒或挂杯挂钩——无法以语义正确的轨迹自动生成。我们介绍了AffordSim，这是首个将开放词汇3D赋性预测整合进操作数据生成流程的模拟框架。AffordSim 使用我们的 VoxAfford 模型，这是一种开放词汇三维赋能检测器，增强 MLLM 输出符号的多尺度几何特征，预测物体点云上的赋能图，引导抓握姿态估计到任务相关功能区域。AffordSim基于NVIDIA Isaac Sim，支持跨实体（Franka FR3、Panda、UR5e、Kinova）、VLM驱动的任务生成，以及基于DA3的3D高斯重建的新型域随机化，实现了自动化、可扩展的可向性感知操作数据生成。我们建立了涵盖7个类别（抓取、放置、堆叠、推拉、倒杯、悬挂杯、长视线复合）的50个任务基准，并评估4个模仿学习基线（BC、扩散政策、ACT、Pi 0.5）。我们的结果显示，虽然抓握问题基本已解决（成功率53-93%），但像倒入狭窄容器（1-43%）和挂杯（0-47%）等要求高于赋能的任务，在当前的模仿学习方法中仍然更具挑战性，凸显了获取赋能感知数据的必要性。在真实Franka FR3上进行零样本模拟到真实的实验验证了生成数据的可转移性。