Arxiv Papers of Today

生成时间: 2026-04-23 17:47:20 (UTC+8); Arxiv 发布时间: 2026-04-23 20:00 EDT (2026-04-24 08:00 UTC+8)

今天共有 35 篇相关文章

Keyword: reinforcement learning

OThink-SRR1: Search, Refine and Reasoning with Reinforced Learning for Large Language Models

OThink-SRR1：大型语言模型中的搜索、精炼与推理，基于强化学习

Authors: Haijian Liang, Zenghao Niu, Junjie Wu, Changwang Zhang, Wangchunshu Zhou, Jun Wang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.19766
Pdf link: https://arxiv.org/pdf/2604.19766
Abstract Retrieval-Augmented Generation (RAG) expands the knowledge of Large Language Models (LLMs), yet current static retrieval methods struggle with complex, multi-hop problems. While recent dynamic retrieval strategies offer improvements, they face two key challenges: 1) irrelevant retrieved noise can misdirect the reasoning process, and 2) processing full documents incurs prohibitive computational and latency costs. To address these issues, we propose OThink-SRR1, a framework that enhances large models with an iterative Search-Refine-Reason process trained via reinforcement learning. Its core Refine stage distills retrieved documents into concise, relevant facts before reasoning. We introduce GRPO-IR, an end-to-end reinforcement learning algorithm that rewards accurate evidence identification while penalizing excessive retrievals, thus training the model to be both focused and efficient. Experiments on four multi-hop QA benchmarks show our approach achieves superior accuracy over strong baselines while using fewer retrieval steps and tokens. This positions OThink-SRR1 as a potent foundational model for information-seeking agents.
中文摘要 检索增强生成（RAG）扩展了大型语言模型（LLMs）的知识，但当前静态检索方法在处理复杂的多跳问题时仍面临困难。尽管近期的动态检索策略有所改进，但它们面临两个关键挑战：1）无关的检索噪声可能误导推理过程;2）处理完整文档会带来巨大的计算和延迟成本。为解决这些问题，我们提出了 OThink-SRR1 框架，通过通过强化学习训练的迭代搜索-精炼-推理过程来增强大型模型。其核心精炼阶段将检索到的文件提炼成简明且相关的事实，然后进行推理。我们引入了GRPO-IR，一种端到端强化学习算法，奖励准确的证据识别，同时惩罚过度检索，从而训练模型既聚焦又高效。在四个多跳QA基准测试上的实验表明，我们的方法在使用更少的检索步骤和代币的情况下，在强基线条件下实现了更优越的准确性。这使得OThink-SRR1成为信息寻求智能体的强大基础模型。

PR-CAD: Progressive Refinement for Unified Controllable and Faithful Text-to-CAD Generation with Large Language Models

PR-CAD：基于大型语言模型实现统一、可控且忠实的文本转CAD生成的渐进优化

Authors: Jiyuan An, Jiachen Zhao, Fan Chen, Liner Yang, Zhenghao Liu, Hongyan Wang, Weihua An, Meishan Zhang, Erhong Yang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.19773
Pdf link: https://arxiv.org/pdf/2604.19773
Abstract The construction of CAD models has traditionally relied on labor-intensive manual operations and specialized expertise. Recent advances in large language models (LLMs) have inspired research into text-to-CAD generation. However, existing approaches typically treat generation and editing as disjoint tasks, limiting their practicality. We propose PR-CAD, a progressive refinement framework that unifies generation and editing for controllable and faithful text-to-CAD modeling. To support this, we curate a high-fidelity interaction dataset spanning the full CAD lifecycle, encompassing multiple CAD representations as well as both qualitative and quantitative descriptions. The dataset systematically defines the types of edit operations and generates highly human-like interaction data. Building on a CAD representation tailored for LLMs, we propose a reinforcement learning-enhanced reasoning framework that integrates intent understanding, parameter estimation, and precise edit localization into a single agent. This enables an "all-in-one" solution for both design creation and refinement. Extensive experiments demonstrate strong mutual reinforcement between generation and editing tasks, and across qualitative and quantitative modalities. On public benchmarks, PR-CAD achieves state-of-the-art controllability and faithfulness in both generation and refinement scenarios, while also proving user-friendly and significantly improving CAD modeling efficiency.
中文摘要 CAD模型的构建传统上依赖于劳动密集型的手工操作和专业技能。大型语言模型（LLMs）的最新进展激发了文本到CAD生成的研究。然而，现有方法通常将生成和编辑视为不相交的任务，限制了其实用性。我们提出了PR-CAD，一种渐进式的精炼框架，统一生成和编辑，实现可控且忠实的文本转CAD建模。为此，我们策划了一个涵盖整个CAD生命周期的高保真交互数据集，涵盖多种CAD表示以及定性和定量描述。该数据集系统地定义了编辑操作的类型，并生成高度人为化的交互数据。基于为大型语言模型量身定制的CAD表示，我们提出了一种强化学习增强推理框架，将意图理解、参数估计和精确编辑本地化整合为单一智能体。这使得设计创作和精炼都实现了“一体化”解决方案。大量实验证明了生成任务与编辑任务之间，以及在定性和定量模式间存在强烈的相互强化。在公开基准测试中，PR-CAD在生成和精炼场景中都实现了最先进的可控性和忠实性，同时用户友好且显著提升了CAD建模效率。

Wan-Image: Pushing the Boundaries of Generative Visual Intelligence

Wan-image：推动生成式视觉智能的边界

Authors: Chaojie Mao, Chen-Wei Xie, Chongyang Zhong, Haoyou Deng, Jiaxing Zhao, Jie Xiao, Jinbo Xing, Jingfeng Zhang, Jingren Zhou, Jingyi Zhang, Jun Dan, Kai Zhu, Kang Zhao, Keyu Yan, Minghui Chen, Pandeng Li, Shuangle Chen, Tong Shen, Yu Liu, Yue Jiang, Yulin Pan, Yuxiang Tuo, Zeyinzi Jiang, Zhen Han, Ang Wang, Bang Zhang, Baole Ai, Bin Wen, Boang Feng, Feiwu Yu, Gang Wang, Haiming Zhao, He Kang, Jianjing Xiang, Jianyuan Zeng, Jinkai Wang, Ke Sun, Linqian Wu, Pei Gong, Pingyu Wu, Ruiwen Wu, Tongtong Su, Wenmeng Zhou, Wenting Shen, Wenyuan Yu, Xianjun Xu, Xiaoming Huang, Xiejie Shen, Xin Xu, Yan Kou, Yangyu Lv, Yifan Zhai, Yitong Huang, Yun Zheng, Yuntao Hong, Zhicheng Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.19858
Pdf link: https://arxiv.org/pdf/2604.19858
Abstract We present Wan-Image, a unified visual generation system explicitly engineered to paradigm-shift image generation models from casual synthesizers into professional-grade productivity tools. While contemporary diffusion models excel at aesthetic generation, they frequently encounter critical bottlenecks in rigorous design workflows that demand absolute controllability, complex typography rendering, and strict identity preservation. To address these challenges, Wan-Image features a natively unified multi-modal architecture by synergizing the cognitive capabilities of large language models with the high-fidelity pixel synthesis of diffusion transformers, which seamlessly translates highly nuanced user intents into precise visual outputs. It is fundamentally powered by large-scale multi-modal data scaling, a systematic fine-grained annotation engine, and curated reinforcement learning data to surpass basic instruction following and unlock expert-level professional capabilities. These include ultra-long complex text rendering, hyper-diverse portrait generation, palette-guided generation, multi-subject identity preservation, coherent sequential visual generation, precise multi-modal interactive editing, native alpha-channel generation, and high-efficiency 4K synthesis. Across diverse human evaluations, Wan-Image exceeds Seedream 5.0 Lite and GPT Image 1.5 in overall performance, reaching parity with Nano Banana Pro in challenging tasks. Ultimately, Wan-Image revolutionizes visual content creation across e-commerce, entertainment, education, and personal productivity, redefining the boundaries of professional visual synthesis.
中文摘要 我们介绍Wan-Image，一个统一的视觉生成系统，专门设计用于将图像生成模型从普通合成器转变为专业级生产力工具。虽然现代扩散模型在美学生成方面表现出色，但在严格的设计流程中常常遇到关键瓶颈，这些流程要求绝对可控性、复杂的字体渲染和严格的身份保护。为应对这些挑战，Wan-Image 采用了原生统一的多模态架构，将大型语言模型的认知能力与扩散变换器的高精度像素合成协同，能够无缝将高度细致的用户意图转化为精准的视觉输出。它基本上依靠大规模多模态数据尺度、系统化的细粒度注释引擎和精心策划的强化学习数据，超越基础指令跟踪，解锁专家级专业能力。这些包括超长复杂文本渲染、高度多样化的肖像生成、调色板引导生成、多主题身份保护、连贯顺序视觉生成、精准多模态交互编辑、原生阿尔法通道生成以及高效4K合成。在多种人类评估中，Wan-Image 在整体性能上超过 Seedream 5.0 Lite 和 GPT Image 1.5，在具有挑战性的任务中与 Nano Banana Pro 持平。最终，Wan-Image 彻底革新了电子商务、娱乐、教育和个人生产力领域的视觉内容创作，重新定义了专业视觉综合的边界。

DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data

DR-Venus：迈向仅有1万个开放数据的前沿边缘级深度研究代理

Authors: Venus Team, Sunhao Dai, Yong Deng, Jinzhen Lin, Yusheng Song, Guoqing Wang, Xiaofeng Wu, Yuqi Zhou, Shuo Yang, Zhenzhe Ying, Zhanwei Zhang, Changhua Meng, Weiqiang Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2604.19859
Pdf link: https://arxiv.org/pdf/2604.19859
Abstract Edge-scale deep research agents based on small language models are attractive for real-world deployment due to their advantages in cost, latency, and privacy. In this work, we study how to train a strong small deep research agent under limited open-data by improving both data quality and data utilization. We present DR-Venus, a frontier 4B deep research agent for edge-scale deployment, built entirely on open data. Our training recipe consists of two stages. In the first stage, we use agentic supervised fine-tuning (SFT) to establish basic agentic capability, combining strict data cleaning with resampling of long-horizon trajectories to improve data quality and utilization. In the second stage, we apply agentic reinforcement learning (RL) to further improve execution reliability on long-horizon deep research tasks. To make RL effective for small agents in this setting, we build on IGPO and design turn-level rewards based on information gain and format-aware regularization, thereby enhancing supervision density and turn-level credit assignment. Built entirely on roughly 10K open-data, DR-Venus-4B significantly outperforms prior agentic models under 9B parameters on multiple deep research benchmarks, while also narrowing the gap to much larger 30B-class systems. Our further analysis shows that 4B agents already possess surprisingly strong performance potential, highlighting both the deployment promise of small models and the value of test-time scaling in this setting. We release our models, code, and key recipes to support reproducible research on edge-scale deep research agents.
中文摘要 基于小型语言模型的边缘级深度研究代理因其成本、延迟和隐私优势，在实际部署中具有吸引力。本研究旨在通过提升数据质量和数据利用率，在有限开放数据条件下训练一个强大的小型深度研究代理。我们介绍DR-Venus，一款完全基于开放数据构建的前沿4B深度研究代理，用于边缘规模部署。我们的训练配方包含两个阶段。第一阶段，我们使用代理监督微调（SFT）建立基本代理能力，结合严格的数据清理与长视野轨迹的重采样，以提升数据质量和利用率。第二阶段，我们应用智能强化学习（RL），进一步提升长期深度研究任务的执行可靠性。为了使强化学习对该环境中的小代理有效，我们在IGPO基础上基于信息获取和格式感知正则化设计回合级奖励，从而提升监督密度和回合级积分分配。DR-Venus-4B完全基于约10K的开放数据构建，在多个深度研究基准中显著优于9B参数下的先前代理模型，同时缩小了与更大30B类系统的差距。进一步分析显示，4B代理已具备令人惊讶的强大性能潜力，凸显了小模型的部署潜力以及测试时间扩展在该环境中的价值。我们发布模型、代码和关键配方，以支持边缘尺度深度研究代理的可重复性研究。

Infection-Reasoner: A Compact Vision-Language Model for Wound Infection Classification with Evidence-Grounded Clinical Reasoning

感染推理器：一个基于循证临床推理的紧凑视觉-语言模型用于伤口感染分类

Authors: Palawat Busaranuvong, Reza Saadati Fard, Emmanuel Agu, Deepak Kumar, Shefalika Gautam, Bengisu Tulu, Diane Strong
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.19937
Pdf link: https://arxiv.org/pdf/2604.19937
Abstract Assessing chronic wound infection from photographs is challenging because visual appearance varies across wound etiologies, anatomical locations, and imaging conditions. Prior image-based deep learning methods have mainly focused on classification with limited interpretability, despite the need for evidence-grounded explanations to support point-of-care decision making. We present Infection-Reasoner, a compact 4B-parameter reasoning vision-language model for chronic wound infection classification and rationale generation. To address the scarcity of expert-labeled wound images with reasoning annotations, Infection-Reasoner is trained using a two-stage pipeline: (1) reasoning distillation, in which GPT-5.1 generates chain-of-thought rationales for unlabeled wound images to initialize wound-specific reasoning in a smaller student model (Qwen3-VL-4B-Thinking), and (2) reinforcement learning post-training with Group Relative Policy Optimization on a small labeled infection dataset to refine classification reasoning. On a held-out heterogeneous wound dataset, Infection-Reasoner achieved 86.8\% accuracy, 86.4\% sensitivity, and 87.1\% specificity, outperforming several strong baselines, including GPT-5.1. Rationale quality was further evaluated using both multimodal large language model (MLLM) judges and wound expert review. Across four MLLM judges, visual-support agreement scores ranged from 0.722 to 0.903, while expert review rated 61.8\% of rationales as Correct and 32.4\% as Partially Correct.
中文摘要 仅凭照片评估慢性伤口感染具有挑战性，因为视觉表现因伤口病因、解剖部位和影像条件而异。此前基于图像的深度学习方法主要关注有限的可解释性分类，尽管需要基于证据的解释来支持现场决策。我们介绍Infection-Reasoner，一个紧凑型的4B参数推理视觉语言模型，用于慢性伤口感染的分类和理据生成。为解决专家标记伤口图像及推理注释的稀缺问题，感染推理器采用两阶段流程训练：（1）推理提炼，GPT-5.1生成未标记伤口图像的思维链理据，以初始化较小学生模型（Qwen3-VL-4B-Thinking）中的伤口特定推理;（2）在小型标记感染数据集上通过组相对策略优化进行强化学习，以优化分类推理。在一个保存的异质伤口数据集上，Infection-Reasoner实现了86.8%的准确率、86.4%的敏感性和87.1%的特异性，优于包括GPT-5.1在内的多个强基线。进一步评估了基础质量，同时使用多模态大型语言模型（MLLM）评审和伤口专家评审。在四位MLLM评委中，视觉支持协议得分范围为0.722至0.903，而专家评审中61.8%的理由为正确，32.4%为部分正确。

Visual Reasoning through Tool-supervised Reinforcement Learning

通过工具监督强化学习实现视觉推理

Authors: Qihua Dong, Gozde Sahin, Pei Wang, Zhaowei Cai, Robik Shrestha, Hao Yang, Davide Modolo
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.19945
Pdf link: https://arxiv.org/pdf/2604.19945
Abstract In this paper, we investigate the problem of how to effectively master tool-use to solve complex visual reasoning tasks for Multimodal Large Language Models. To achieve that, we propose a novel Tool-supervised Reinforcement Learning (ToolsRL) framework, with direct tool supervision for more effective tool-use learning. We focus on a series of simple, native, and interpretable visual tools, including zoom-in, rotate, flip, and draw point/line, whose tool supervision is easy to collect. A reinforcement learning curriculum is developed, where the first stage is solely optimized by a set of well motivated tool-specific rewards, and the second stage is trained with the accuracy targeted rewards while allowing calling tools. In this way, tool calling capability is mastered before using tools to complete visual reasoning tasks, avoiding the potential optimization conflict among those heterogeneous tasks. Our experiments have shown that the tool-supervised curriculum training is efficient and ToolsRL can achieve strong tool-use capabilities for complex visual reasoning tasks.
中文摘要 本文探讨如何有效掌握工具使用以解决多模态大型语言模型复杂视觉推理任务的问题。为此，我们提出了一种全新的工具监督强化学习（ToolsRL）框架，采用工具直接监督，以实现更高效的工具使用学习。我们专注于一系列简单、原生且易于理解的视觉工具，包括放大、旋转、翻转和绘制点/线，这些工具的监督易于收集。开发了强化学习课程，第一阶段仅通过一组激励良好的工具特定奖励进行优化，第二阶段则以精准目标奖励训练，同时允许调用工具。通过这种方式，工具调用能力在使用工具完成视觉推理任务之前就已掌握，避免了这些异构任务之间潜在的优化冲突。我们的实验表明，工具监督课程训练高效，ToolsRL能够在复杂的视觉推理任务中实现强大的工具使用能力。

Efficient Reinforcement Learning using Linear Koopman Dynamics for Nonlinear Robotic Systems

利用线性库普曼动力学实现非线性机器人系统的高效强化学习

Authors: Wenjian Hao, Yuxuan Fang, Zehui Lu, Shaoshuai Mou
Subjects: Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2604.19980
Pdf link: https://arxiv.org/pdf/2604.19980
Abstract This paper presents a model-based reinforcement learning (RL) framework for optimal closed-loop control of nonlinear robotic systems. The proposed approach learns linear lifted dynamics through Koopman operator theory and integrates the resulting model into an actor-critic architecture for policy optimization, where the policy represents a parameterized closed-loop controller. To reduce computational cost and mitigate model rollout errors, policy gradients are estimated using one-step predictions of the learned dynamics rather than multi-step propagation. This leads to an online mini-batch policy gradient framework that enables policy improvement from streamed interaction data. The proposed framework is evaluated on several simulated nonlinear control benchmarks and two real-world hardware platforms, including a Kinova Gen3 robotic arm and a Unitree Go1 quadruped. Experimental results demonstrate improved sample efficiency over model-free RL baselines, superior control performance relative to model-based RL baselines, and control performance comparable to classical model-based methods that rely on exact system dynamics.
中文摘要 本文提出了一种基于模型的强化学习（RL）框架，用于实现非线性机器人系统的最佳闭环控制。该方法通过库普曼算子理论学习线性提升动力学，并将所得模型集成到一种actor-critic架构中进行策略优化，其中策略代表一个参数化的闭环控制器。为了降低计算成本并减少模型推广误差，策略梯度通过对学习动态的一步预测来估算，而非多步传播。这促成了一个在线迷你批政策梯度框架，能够从流式交互数据中改进策略。该框架在多个模拟非线性控制基准测试和两个真实硬件平台上进行了评估，包括Kinova Gen3机械臂和Unitree Go1四足机器人。实验结果显示，样本效率优于无模型的强化学习基线，控制性能优于基于模型的强化学习基线，控制性能可与依赖精确系统动力学的经典模型方法相媲美。

Multi-Objective Reinforcement Learning for Generating Covalent Inhibitor Candidates

多目标强化学习用于生成共价抑制剂候选物

Authors: Renee Gil
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.20019
Pdf link: https://arxiv.org/pdf/2604.20019
Abstract Rational design of covalent inhibitors requires simultaneously optimizing multiple properties, such as binding affinity, target selectivity, or electrophilic reactivity. This presents a multi-objective problem not easily addressed by screening alone. Here we present a machine learning pipeline for generating covalent inhibitor candidates using multi-objective reinforcement learning (RL), applied to two targets: epidermal growth factor receptor (EGFR) and acetylcholinesterase (ACHE). A SMILES-based pretrained LSTM serves as the generative model, optimized via policy gradient RL with Pareto crowding distance to balance competing scoring functions including synthetic accessibility, predicted covalent activity, residue affinity, and an approximated docking score. The pipeline rediscovers known covalent inhibitors at rates of up to 0.50% (EGFR) and 0.74% (ACHE) in 10,000-structure runs, with candidate structures achieving warhead-to-residue distances as short as 5.5 angstrom (EGFR) and 3.2 angstrom (ACHE) after further docking-based screening. More notably, the pipeline spontaneously generates structures bearing warhead motifs absent from the training data - including allenes, 3-oxo-$\beta$-sultams, and $\alpha$-methylene-$\beta$-lactones - all of which have independent literature support as covalent warheads. These results suggest that RL-guided generation can explore covalent chemical space beyond its training distribution, and may be useful as a tool for medicinal chemists working on covalent drug discovery.
中文摘要 合理设计共价抑制剂需要同时优化多种特性，如结合亲和力、靶点选择性或亲电反应性。这带来了一个多方问题，单靠筛选难以解决。本文介绍了一种机器学习流程，用于利用多目标强化学习（RL）生成共价抑制剂候选物，应用于两个靶点：表皮生长因子受体（EGFR）和乙酰胆碱酯酶（ACHE）。基于SMILES的预训练LSTM作为生成模型，通过策略梯度强化学习和帕累托拥挤距离优化，以平衡包括合成可及性、预测共价活度、残基亲和力和近似对接分数等竞争评分函数。该管线在1万个结构运行中，以高达0.50%（EGFR）和0.74%（ACHE）的速率重新发现已知共价抑制剂，候选结构通过进一步基于对接筛选，弹头至残基距离可短至5.5埃（EGFR）和3.2埃（ACHE）。更值得注意的是，该管线自发生成带有训练数据中缺失的弹头结构——包括六烯烯、3-oxo-$\β$-苏氨酸和$\α$-亚甲基-$\β$-内酯——这些均有独立文献支持的共价弹头。这些结果表明，强化学习引导生成能够探索训练分布之外的共价化学空间，并可能作为共价药物发现药物化学家的有用工具。

Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text

通过基于评分标准的自我游戏，自助构建开放式任务的后期训练信号

Authors: Chengyu Huang, Sheng-Yen Chou, Zhengxin Zhang, Claire Cardie
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.20051
Pdf link: https://arxiv.org/pdf/2604.20051
Abstract Self-play has recently emerged as a promising paradigm to train Large Language Models (LLMs). In self-play, the target LLM creates the task input (e.g., ask a question), which it then addresses itself by producing a task output (e.g., give an answer). A reward model evaluates the output, and the rewards are then used to train the LLM, typically via Reinforcement Learning (RL). Self-play incurs minimal supervision costs, and this is especially helpful for post-training LLMs, which require high-quality input-output pairs that traditionally have to be written by humans or expensive proprietary models. However, existing work explores self-play only for verifiable tasks such as math and coding. Instead, we seek to extend it to more realistic open-ended tasks. In particular, we propose POP, a self-play framework that uses the same LLM to synthesize evaluation rubrics, along with input-output pairs, for each example. The rubric is then used to evaluate outputs and train the model. We further ground the framework on a content-rich pretraining corpus to (1) ensure a generation-verification gap and reduce reward hacking, and (2) prevent mode collapse. On Qwen-2.5-7B, POP increases performance of both pretrained and instruction-tuned models, across different tasks ranging from long-form Healthcare QA to creative writing and instruction following.
中文摘要 自玩最近成为训练大型语言模型（LLM）的有前景范式。在自玩中，目标LLM生成任务输入（例如，提问），然后通过生成任务输出（例如，给出答案）来处理该输入。奖励模型评估输出，然后用奖励来训练LLM，通常通过强化学习（RL）。自玩的监督成本极低，这对训练后大语言模型尤其有帮助，因为它们需要高质量的输入输出对，而这些输入输出对传统上必须由人类编写或昂贵的专有模型。然而，现有研究仅在可验证的任务中探索自我游戏，如数学和编程。相反，我们希望将其扩展到更现实、开放式的任务。特别地，我们提出了POP，这是一种自玩框架，使用相同的大型语言模型为每个实例综合评估评分标准及输入输出对。评分标准随后用于评估输出并训练模型。我们进一步将框架建立在内容丰富的预训练语料库之上，以（1）确保世代验证差距并减少奖励黑客行为，（2）防止模式崩溃。在Qwen-2.5-7B上，POP提升了预训练和指令调优模型的性能，涵盖从长格式医疗质量保证到创意写作及指令跟踪等多种任务。

Maximum Entropy Semi-Supervised Inverse Reinforcement Learning

最大熵半监督逆强化学习

Authors: Julien Audiffren, Michal Valko, Alessandro Lazaric, Mohammad Ghavamzadeh
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.20074
Pdf link: https://arxiv.org/pdf/2604.20074
Abstract A popular approach to apprenticeship learning (AL) is to formulate it as an inverse reinforcement learning (IRL) problem. The MaxEnt-IRL algorithm successfully integrates the maximum entropy principle into IRL and unlike its predecessors, it resolves the ambiguity arising from the fact that a possibly large number of policies could match the expert's behavior. In this paper, we study an AL setting in which in addition to the expert's trajectories, a number of unsupervised trajectories is available. We introduce MESSI, a novel algorithm that combines MaxEnt-IRL with principles coming from semi-supervised learning. In particular, MESSI integrates the unsupervised data into the MaxEnt-IRL framework using a pairwise penalty on trajectories. Empirical results in a highway driving and grid-world problems indicate that MESSI is able to take advantage of the unsupervised trajectories and improve the performance of MaxEnt-IRL.
中文摘要 学徒学习（AL）的一个流行方法是将其表述为逆强化学习（IRL）问题。MaxEnt-IRL算法成功将最大熵原理整合进IRL，且与前辈算法不同，解决了可能有大量策略匹配专家行为的模糊性问题。本文研究了一个AL环境，除了专家的轨迹外，还存在多种无监督轨迹。我们介绍了MESSI算法，这是一种将MaxEnt-IRL与半监督学习原理结合的新型算法。特别是，MESSI通过轨迹的两两惩罚将无监督数据集成到MaxEnt-IRL框架中。高速公路驾驶和网格世界问题的实证结果表明，MESSI能够利用无监督轨迹，提升MaxEnt-IRL的性能。

On the Stability and Generalization of First-order Bilevel Minimax Optimization

关于一阶双层极大优化的稳定性与推广

Authors: Xuelin Zhang, Peipei Yuan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2604.20115
Pdf link: https://arxiv.org/pdf/2604.20115
Abstract Bilevel optimization and bilevel minimax optimization have recently emerged as unifying frameworks for a range of machine-learning tasks, including hyperparameter optimization and reinforcement learning. The existing literature focuses on empirical efficiency and convergence guarantees, leaving a critical theoretical gap in understanding how well these algorithms generalize. To bridge this gap, we provide the first systematic generalization analysis for first-order gradient-based bilevel minimax solvers with lower-level minimax problems. Specifically, by leveraging algorithmic stability arguments, we derive fine-grained generalization bounds for three representative algorithms, including single-timescale stochastic gradient descent-ascent, and two variants of two-timescale stochastic gradient descent-ascent. Our results reveal a precise trade-off among algorithmic stability, generalization gaps, and practical settings. Furthermore, extensive empirical evaluations corroborate our theoretical insights on realistic optimization tasks with bilevel minimax structures.
中文摘要 双层优化和双层极小极大优化最近成为一系列机器学习任务的统一框架，包括超参数优化和强化学习。现有文献主要关注经验效率和收敛保证，这在理解这些算法的泛化能力上存在关键的理论空白。为弥合这一差距，我们首次提供了针对一阶梯度的双级极小极小求解器与低级极小极小问题的系统推广分析。具体来说，通过利用算法稳定性论证，我们推导出三种代表性算法的细粒度推广界限，包括单时间尺度随机梯度下降-上升算法和两种两时间尺度随机梯度下降-上升的变体。我们的结果揭示了算法稳定性、泛化差距和实际设置之间的精确权衡。此外，大量实证评估也证实了我们在双层极小极大结构中现实优化任务的理论见解。

SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition

SAKE：自觉知识利用——基于多模态命名实体识别的探索

Authors: Jielong Tang, Xujie Yuan, Jiayang Liu, Jianxing Yu, Xiao Dong, Lin Chen, Yunlai Teng, Shimin Di, Jian Yin
Subjects: Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.20146
Pdf link: https://arxiv.org/pdf/2604.20146
Abstract Grounded Multimodal Named Entity Recognition (GMNER) aims to extract named entities and localize their visual regions within image-text pairs, serving as a pivotal capability for various downstream applications. In open-world social media platforms, GMNER remains challenging due to the prevalence of long-tailed, rapidly evolving, and unseen entities. To tackle this, existing approaches typically rely on either external knowledge exploration through heuristic retrieval or internal knowledge exploitation via iterative refinement in Multimodal Large Language Models (MLLMs). However, heuristic retrieval often introduces noisy or conflicting evidence that degrades precision on known entities, while solely internal exploitation is constrained by the knowledge boundaries of MLLMs and prone to hallucinations. To address this, we propose SAKE, an end-to-end agentic framework that harmonizes internal knowledge exploitation and external knowledge exploration via self-aware reasoning and adaptive search tool invocation. We implement this via a two-stage training paradigm. First, we propose Difficulty-aware Search Tag Generation, which quantifies the model's entity-level uncertainty through multiple forward samplings to produce explicit knowledge-gap signals. Based on these signals, we construct SAKE-SeCoT, a high-quality Chain-of-Thought dataset that equips the model with basic self-awareness and tool-use capabilities through supervised fine-tuning. Second, we employ agentic reinforcement learning with a hybrid reward function that penalizes unnecessary retrieval, enabling the model to evolve from rigid search imitation to genuine self-aware decision-making about when retrieval is truly necessary. Extensive experiments on two widely used social media benchmarks demonstrate SAKE's effectiveness.
中文摘要 基础多模态命名实体识别（GMNER）旨在提取命名实体并将其在图像-文本对中的视觉区域定位，作为多种下游应用的关键能力。在开放世界社交媒体平台上，GMNER依然具有挑战性，因为存在着长尾、快速进化且看不见的实体。为解决这个问题，现有方法通常依赖通过启发式检索进行外部知识探索，或通过多模态大型语言模型（MLLM）中的迭代精炼进行内部知识利用。然而，启发式检索常常引入噪声或相互矛盾的证据，降低已知实体的精度，而仅内部利用则受限于MLLM的知识边界，容易产生幻觉。为此，我们提出了SAKE，一种端到端的智能体框架，通过自觉推理和自适应搜索工具调用，协调内部知识利用与外部知识探索。我们通过两阶段培训范式来实现这一目标。首先，我们提出了难度感知搜索标签生成，通过多重前向采样量化模型的实体级不确定性，从而产生显式的知识缺口信号。基于这些信号，我们构建了SAKE-SeCoT，一个高质量的思维链数据集，通过监督微调赋予模型基本的自我意识和工具使用能力。其次，我们采用了带有混合奖励函数的能动强化学习，惩罚不必要的检索，使模型能够从僵化的搜索模仿演变为真正自我意识的决策，决定何时检索是真正的必要。对两个广泛使用的社交媒体基准测试进行了大量实验，证明了SAKE的有效性。

Toward Safe Autonomous Robotic Endovascular Interventions using World Models

迈向利用世界模型实现安全的自主机器人血管内干预

Authors: Harry Robertshaw, Nikola Fischer, Han-Ru Wu, Andrea Walker Perez, Weiyuan Deng, Benjamin Jackson, Christos Bergeles, Alejandro Granados, Thomas C Booth
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.20151
Pdf link: https://arxiv.org/pdf/2604.20151
Abstract Autonomous mechanical thrombectomy (MT) presents substantial challenges due to highly variable vascular geometries and the requirements for accurate, real-time control. While reinforcement learning (RL) has emerged as a promising paradigm for the automation of endovascular navigation, existing approaches often show limited robustness when faced with diverse patient anatomies or extended navigation horizons. In this work, we investigate a world-model-based framework for autonomous endovascular navigation built on TD-MPC2, a model-based RL method that integrates planning and learned dynamics. We evaluate a TD-MPC2 agent trained on multiple navigation tasks across hold out patient-specific vasculatures and benchmark its performance against the state-of-the-art Soft Actor-Critic (SAC) algorithm agent. Both approaches are further validated in vitro using patient-specific vascular phantoms under fluoroscopic guidance. In simulation, TD-MPC2 demonstrates a significantly higher mean success rate than SAC (58% vs. 36%, p < 0.001), and mean tip contact forces of 0.15 N, well below the proposed 1.5 N vessel rupture threshold. Mean success rates for TD-MPC2 (68%) were comparable to SAC (60%) in vitro, but TD-MPC2 achieved superior path ratios (p = 0.017) at the cost of longer procedure times (p < 0.001). Together, these results provide the first demonstration of autonomous MT navigation validated across both hold out in silico data and fluoroscopy-guided in vitro experiments, highlighting the promise of world models for safe and generalizable AI-assisted endovascular interventions.
中文摘要 自主机械血栓切除术（MT）由于血管几何结构高度可变且需要精确、实时控制，面临巨大挑战。虽然强化学习（RL）已成为血管内导航自动化的一个有前景的范式，但现有方法在面对多样化患者解剖结构或扩展导航视野时，常常表现出有限的稳健性。本研究研究基于TD-MPC2的世界模型基础自主血管内导航框架，这是一种基于模型的强化学习方法，整合了规划与学习动力学。我们评估了一项训练于多项导航任务、跨患者特定血管的TD-MPC2代理，并将其性能与最先进的软演员-批判者（SAC）算法进行基准测试。这两种方法在体外进一步验证，均通过患者特异性血管幻影在透视指导下进行验证。在模拟中，TD-MPC2的平均成功率显著高于SAC（58%对36%，p < 0.001），平均尖端接触力为0.15牛顿，远低于拟议的1.5牛顿血管破裂阈值。TD-MPC2的平均成功率（68%）与SAC（60%）体外相当，但TD-MPC2在路径比（p = 0.017）上更优（p = 0.017），但代价是手术时间更长（p < 0.001）。这些结果共同展示了首个在电信存储数据和透视引导体外实验中验证的自主MT导航，凸显了全球模型在安全且可推广的AI辅助血管内干预方面的前景。

Temporally Extended Mixture-of-Experts Models

时间扩展专家混合模型

Authors: Zeyu Shen, Peter Henderson
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.20156
Pdf link: https://arxiv.org/pdf/2604.20156
Abstract Mixture-of-Experts models, now popular for scaling capacity at fixed inference speed, switch experts at nearly every token. Once a model outgrows available GPU memory, this churn can render optimizations like offloading and pre-fetching ineffective. We make the case that the options framework in reinforcement learning is a perfect match to tackle this problem, and argue for temporally extended mixture-of-experts layers. Building on the option-critic framework with deliberation costs, we add a controller to each layer that learns when to switch expert sets and which to load. By applying this to gpt-oss-20b with low-rank adapters and a self-distillation reward, our method reduces switch rates from over 50% to below 5% while retaining up to 90% of base-model accuracy on MATH, MMLU, and MMMLU. This shows that even existing pre-trained models can be converted to temporally extended MoEs with lightweight training, with the deliberation cost allowing model trainers to trade off switching rates against capability. We hope this opens a principled path, grounded in the options framework, for memory-efficient serving and continual learning in ever-growing MoE models.
中文摘要 专家混合模型现因在固定推理速度下扩展能力而流行，几乎每个代币都会切换专家。一旦模型超出可用GPU内存，这种流失可能使卸载和预取等优化变得无效。我们认为强化学习中的选项框架非常适合解决这一问题，并主张采用时间扩展的专家混合层。基于期权-批评者框架的审议成本，我们在每个层增加了一个控制器，负责学习何时切换专家组以及加载哪些。通过将此方法应用于带有低秩适配器和自蒸馏奖励的gpt-oss-20b，我们的方法将切换率从50%以上降至5%以下，同时在MATH、MMLU和MMMLU上保持多达90%的基础模型准确率。这表明即使是现有的预训练模型，也可以通过轻量级训练转换为时间扩展的MoE，而审慎成本使模型训练者能够在切换速率与能力之间做出权衡。我们希望这能为在不断发展的MoE模型中实现高效内存服务和持续学习开辟一条基于选项框架的原则性道路。

Lever: Inference-Time Policy Reuse under Support Constraints

杠杆：支持约束下的推理时间策略重用

Authors: Ihor Vitenki, Noha Ibrahim, Sihem Amer-Yahia
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.20174
Pdf link: https://arxiv.org/pdf/2604.20174
Abstract Reinforcement learning (RL) policies are typically trained for fixed objectives, making reuse difficult when task requirements change. We study inference-time policy reuse: given a library of pre-trained policies and a new composite objective, can a high-quality policy be constructed entirely offline, without additional environment interaction? We introduce lever (Leveraging Efficient Vector Embeddings for Reusable policies), an end-to-end framework that retrieves relevant policies, evaluates them using behavioral embeddings, and composes new policies via offline Q-value composition. We focus on the support-limited regime, where no value propagation is possible, and show that the effectiveness of reuse depends critically on the coverage of available transitions. To balance performance and computational cost, lever proposes composition strategies that control the exploration of candidate policies. Experiments in deterministic GridWorld environments show that inference-time composition can match, and in some cases exceed, training-from-scratch performance while providing substantial speedups. At the same time, performance degrades when long-horizon dependencies require value propagation, highlighting a fundamental limitation of offline reuse.
中文摘要 强化学习（RL）策略通常针对固定目标进行训练，这使得任务需求变化时重用变得困难。我们研究推理时间策略重用：给定预训练策略库和新的复合目标，是否能完全离线构建高质量策略，无需额外的环境交互？我们引入了lever（利用高效向量嵌入以实现可重用策略），这是一个端到端框架，能够检索相关策略，利用行为嵌入进行评估，并通过离线Q值组合组合新策略。我们聚焦于支持有限的状态，即价值传播不可能，并展示了再利用的有效性关键依赖于可利用过渡的覆盖度。为了平衡性能和计算成本，Lever提出了控制候选策略探索的组合策略。确定性GridWorld环境下的实验表明，推理时间组合可以匹配甚至超过从零开始训练的性能，同时带来显著的加速。同时，当长视距依赖需要数值传播时，性能会下降，这凸显了离线重用的一个根本限制。

RADS: Reinforcement Learning-Based Sample Selection Improves Transfer Learning in Low-resource and Imbalanced Clinical Settings

RADS：基于强化学习的样本选择在低资源和不平衡临床环境中改善迁移学习

Authors: Wei Han, David Martinez, Anna Khanina, Lawrence Cavedon, Karin Verspoor
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.20256
Pdf link: https://arxiv.org/pdf/2604.20256
Abstract A common strategy in transfer learning is few shot fine-tuning, but its success is highly dependent on the quality of samples selected as training examples. Active learning methods such as uncertainty sampling and diversity sampling can select useful samples. However, under extremely low-resource and class-imbalanced conditions, they often favor outliers rather than truly informative samples, resulting in degraded performance. In this paper, we introduce RADS (Reinforcement Adaptive Domain Sampling), a robust sample selection strategy using reinforcement learning (RL) to identify the most informative samples. Experimental evaluations on several real world clinical datasets show our sample selection strategy enhances model transferability while maintaining robust performance under extreme class imbalance compared to traditional methods.
中文摘要 迁移学习中的一个常见策略是少样微调，但其成功高度依赖于所选样本的质量。主动学习方法如不确定性抽样和多样性抽样可以选择有用的样本。然而，在极度低资源和类别不平衡的条件下，它们往往偏向异常值而非真正有信息量的样本，导致性能下降。本文介绍了RADS（强化自适应域抽样），这是一种利用强化学习（RL）来识别最具信息量样本的稳健样本选择策略。对多个真实临床数据集的实验评估表明，我们的样本选择策略提升了模型的可转移性，同时在极端类别不平衡下保持了稳健性，相较于传统方法。

TL-RL-FusionNet: An Adaptive and Efficient Reinforcement Learning-Driven Transfer Learning Framework for Detecting Evolving Ransomware Threats

TL-RL-FusionNet：一个自适应且高效的强化学习驱动迁移学习框架，用于检测不断演变的勒索软件威胁

Authors: Jannatul Ferdous, Rafiqul Islam, Arash Mahboubi, Md Zahidul Islam
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2604.20260
Pdf link: https://arxiv.org/pdf/2604.20260
Abstract Modern ransomware exhibits polymorphic and evasive behaviors by frequently modifying execution patterns to evade detection. This dynamic nature disrupts feature spaces and limits the effectiveness of static or predefined models. To address this challenge, we propose TL-RL-FusionNet, a reinforcement learning (RL)-guided hybrid framework that integrates frozen dual transfer learning (TL) backbones as feature extractors with a lightweight residual multilayer perceptron (MLP) classifier. The RL agent supervises training by adaptively reweighting samples in response to variations in observable ransomware behavior. Through reward and penalty signals, the agent prioritizes complex cases such as stealthy or polymorphic ransomware employing obfuscation, while down-weighting trivial samples including benign applications with simple file I/O operations or easily classified ransomware. This adaptive mechanism enables the model to dynamically refine its strategy, improving resilience against evolving threats while maintaining strong classification performance. The framework utilizes dynamic behavioral features such as file system activity, registry changes, network traffic, API calls, and anti-analysis checks, extracted from sandbox-generated JSON reports. These features are transformed into RGB images and processed using frozen EfficientNetB0 and InceptionV3 models to capture rich feature representations efficiently. Final classification is performed by a lightweight residual MLP guided by an RL (Q-learning) agent. Experiments on a balanced dataset of 1,000 samples (500 ransomware, 500 benign) show that TL-RL-FusionNet achieves 99.1% accuracy, 98.6% precision, 99.6% recall, and 99.74% AUC, outperforming non-RL baselines by up to 2.5% in accuracy and 3.1% in recall. Efficiency analysis shows 55% lower training time and 59% reduced RAM usage, demonstrating suitability for real-world deployment.
中文摘要 现代勒索软件通过频繁修改执行模式以规避检测表现出多态性和规避行为。这种动态特性扰乱了特征空间，限制了静态或预定义模型的有效性。为应对这一挑战，我们提出了TL-RL-FusionNet，一种强化学习（RL）引导的混合框架，将冻结的双重迁移学习（TL）骨干作为特征提取器，并配备轻量级残余多层感知器（MLP）分类器。强化学习代理通过根据可观察勒索软件行为的变化，自适应地重新加权样本来监督训练。通过奖励和惩罚信号，代理优先处理复杂案例，如采用混淆的隐形或多态勒索软件，同时降低包括简单文件I/O操作的良性应用或易于分类的勒索软件等琐碎样本。这种自适应机制使模型能够动态优化策略，提升对不断演变威胁的韧性，同时保持强劲的分类性能。该框架利用动态行为特性，如文件系统活动、注册表变更、网络流量、API 调用和反分析检查，这些功能均从沙箱生成的 JSON 报告中提取。这些特征被转换为RGB图像，并使用冻结的EfficientNetB0和InceptionV3模型处理，以高效捕捉丰富的特征表示。最终分类由轻量级残差MLP执行，并由RL（Q-learning）代理引导。在一个平衡数据集上，包含1000个样本（500个勒索软件，500个无害），实验显示TL-RL-FusionNet实现了99.1%的准确率、98.6%的准确率、99.6%的召回率和99.74%的AUC，准确率高出非强化学习基线多达2.5%，召回率提升3.1%。效率分析显示训练时间减少了55%，内存使用减少了59%，显示出适用于实际部署的适用性。

X-Cache: Cross-Chunk Block Caching for Few-Step Autoregressive World Models Inference

X缓存：用于少数步自回归世界模型推断的跨区块缓存

Authors: Yixiao Zeng, Jianlei Zheng, Chaoda Zheng, Shijia Chen, Mingdian Liu, Tongping Liu, Tengwei Luo, Yu Zhang, Boyang Wang, Linkun Xu, Siyuan Lu, Bo Tian, Xianming Liu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.20289
Pdf link: https://arxiv.org/pdf/2604.20289
Abstract Real-time world simulation is becoming a key infrastructure for scalable evaluation and online reinforcement learning of autonomous driving systems. Recent driving world models built on autoregressive video diffusion achieve high-fidelity, controllable multi-camera generation, but their inference cost remains a bottleneck for interactive deployment. However, existing diffusion caching methods are designed for offline video generation with multiple denoising steps, and do not transfer to this scenario. Few-step distilled models have no inter-step redundancy left for these methods to reuse, and sequence-level parallelization techniques require future conditioning that closed-loop interactive generation does not provide. We present X-Cache, a training-free acceleration method that caches along a different axis: across consecutive generation chunks rather than across denoising steps. X-Cache maintains per-block residual caches that persist across chunks, and applies a dual-metric gating mechanism over a structure- and action-aware block-input fingerprint to independently decide whether each block should recompute or reuse its cached residual. To prevent approximation errors from permanently contaminating the autoregressive KV cache, X-Cache identifies KV update chunks (the forward passes that write clean keys and values into the persistent cache) and unconditionally forces full computation on these chunks, cutting off error propagation. We implement X-Cache on X-world, a production multi-camera action-conditioned driving world model built on multi-block causal DiT with few-step denoising and rolling KV cache. X-Cache achieves 71% block skip rate with 2.6x wall-clock speedup while maintaining minimum degradation.
中文摘要 实时世界仿真正成为自动化系统可扩展评估和在线强化学习的关键基础设施。基于自回归视频扩散的近期驾驶世界模型实现了高保真、可控的多摄像头生成，但其推断成本仍然是交互式部署的瓶颈。然而，现有的扩散缓存方法设计用于带有多重去噪步骤的离线视频生成，无法迁移到这种场景。少步蒸馏模型已无步骤间冗余可供这些方法重复使用，序列级并行化技术需要闭环交互生成所无法提供的未来条件。我们介绍X-Cache，一种无训练的加速方法，它沿不同轴线缓存：跨连续生成块，而非跨去噪步骤缓存。X-Cache维护每个块的残差缓存，这些缓存跨区块持续存在，并通过结构感知和动作感知的块输入指纹采用双重度量门控机制，独立决定每个块是否应重新计算或重用其缓存残差。为防止近似错误永久污染自回归的KV缓存，X-Cache识别KV更新块（将干净的键和值写入持久缓存的前向通道），并无条件强制对这些块进行完整计算，切断错误传播。我们在X-world上实现了X-Cache，这是一种基于多块因果DiT、少步去噪和滚动KV缓存的生产多机位动作控制驾驶世界模型。X-Cache 实现了 71% 的跳区块率，墙钟速度提升了 2.6 倍，同时保持了最小的降级。

ETac: A Lightweight and Efficient Tactile Simulation Framework for Learning Dexterous Manipulation

ETac：一款轻便高效的触觉模拟框架，用于学习灵巧操作

Authors: Zhe Xu, Feiyu Zhao, Xiyan Huang, Chenxi Xiao
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.20295
Pdf link: https://arxiv.org/pdf/2604.20295
Abstract Tactile sensors are increasingly integrated into dexterous robotic manipulators to enhance contact perception. However, learning manipulation policies that rely on tactile sensing remains challenging, primarily due to the trade-off between fidelity and computational cost of soft-body simulations. To address this, we present ETac, a tactile simulation framework that models elastomeric soft-body interactions with both high fidelity and efficiency. ETac employs a lightweight data-driven deformation propagation model to capture soft-body contact dynamics, achieving high simulation quality and boosting efficiency that enables large-scale policy training. When serving as the simulation backend, ETac produces surface deformation estimates comparable to FEM and demonstrates applicability for modeling real tactile sensors. Then, we showcase its capability in training a blind grasping policy that leverages large-area tactile feedback to manipulate diverse objects. Running on a single RTX 4090 GPU, ETac supports reinforcement learning across 4,096 parallel environments, achieving a total throughput of 869 FPS. The resulting policy reaches an average success rate of 84.45% across four object types, underscoring ETac's potential to make tactile-based skill learning both efficient and scalable.
中文摘要 触觉传感器越来越多地集成到灵巧的机器人机械臂中，以增强接触感知。然而，依赖触觉感知的学习操作策略仍然具有挑战性，主要原因是软体模拟的真实度与计算成本之间的权衡。为此，我们介绍了ETac，一种触觉仿真框架，能够高保真度和高效地模拟弹性体软体相互作用。ETac采用轻量级数据驱动变形传播模型捕捉软体接触动态，实现高模拟质量并提升效率，支持大规模策略训练。作为仿真后端，ETac 能够生成与有限元法（FEM）相当的表面变形估计值，并展示了对真实触觉传感器建模的适用性。随后，我们展示了其在训练盲抓策略方面的能力，该策略利用大面积触觉反馈来操控多样物体。ETac 运行在单块 RTX 4090 GPU 上，支持跨越 4,096 个并行环境的强化学习，实现 869 FPS 的总吞吐量。最终的政策在四种对象类型中平均成功率达到84.45%，凸显了ETac使基于触觉的技能学习既高效又可扩展的潜力。

Hybrid Latent Reasoning with Decoupled Policy Optimization

混合潜在推理与解耦策略优化

Authors: Tao Cheng, Shi-Zhe Chen, Hao Zhang, Yixin Qin, Jinwen Luo, Zheng Wei
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.20328
Pdf link: https://arxiv.org/pdf/2604.20328
Abstract Chain-of-Thought (CoT) reasoning significantly elevates the complex problem-solving capabilities of multimodal large language models (MLLMs). However, adapting CoT to vision typically discretizes signals to fit LLM inputs, causing early semantic collapse and discarding fine-grained details. While external tools can mitigate this, they introduce a rigid bottleneck, confining reasoning to predefined operations. Although recent latent reasoning paradigms internalize visual states to overcome these limitations, optimizing the resulting hybrid discrete-continuous action space remains challenging. In this work, we propose HyLaR (Hybrid Latent Reasoning), a framework that seamlessly interleaves discrete text generation with continuous visual latent representations. Specifically, following an initial cold-start supervised fine-tuning (SFT), we introduce DePO (Decoupled Policy Optimization) to enable effective reinforcement learning within this hybrid space. DePO decomposes the policy gradient objective, applying independent trust-region constraints to the textual and latent components, alongside an exact closed-form von Mises-Fisher (vMF) KL regularizer. Extensive experiments demonstrate that HyLaR outperforms standard MLLMs and state-of-the-art latent reasoning approaches across fine-grained perception and general multimodal understanding benchmarks. Code is available at this https URL.
中文摘要 思维链（CoT）推理显著提升了多模态大型语言模型（MLLM）的复杂问题解决能力。然而，将CoT适应视觉通常会离散化信号以适应LLM输入，导致早期语义崩溃并丢弃细粒度细节。虽然外部工具可以缓解这一问题，但它们会带来僵化的瓶颈，将推理限制在预定义的操作中。尽管近期的潜在推理范式内化视觉状态以克服这些限制，但优化由此产生的混合离散-连续动作空间仍具挑战性。在本研究中，我们提出了HyLaR（混合潜在推理），这是一种无缝交错离散文本生成与连续视觉潜在表示的框架。具体来说，在初步冷启动监督微调（SFT）之后，我们引入了DePO（解耦策略优化），以实现该混合空间内的有效强化学习。DePO 分解策略梯度目标，对文本和潜在组件应用独立的信任区域约束，同时使用精确的封闭形式冯·米塞斯-费舍尔（vMF）KL 正则化器。大量实验表明，HyLaR在细粒度感知和通用多模态理解基准测试中优于标准MLLM和最先进的潜在推理方法。代码可在此 https URL 访问。

Distributional Value Estimation Without Target Networks for Robust Quality-Diversity

无目标网络的分布值估计，实现稳健的质量多样性

Authors: Behrad Koohy, Jamie Bayne
Subjects: Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.20381
Pdf link: https://arxiv.org/pdf/2604.20381
Abstract Quality-Diversity (QD) algorithms excel at discovering diverse repertoires of skills, but are hindered by poor sample efficiency and often require tens of millions of environment steps to solve complex locomotion tasks. Recent advances in Reinforcement Learning (RL) have shown that high Update-to-Data (UTD) ratios accelerate Actor-Critic learning. While effective, standard high-UTD algorithms typically utilise target networks to stabilise training. This requirement introduces a significant computational bottleneck, rendering them impractical for resource-intensive Quality-Diversity (QD) tasks where sample efficiency and rapid population adaptation are critical. In this paper, we introduce QDHUAC, a sample-efficient, target-free and distributional QD-RL algorithm that provides dense and low-variance gradient signals, which enables high-UTD training for Dominated Novelty Search whilst requiring an order of magnitude fewer environment steps. We demonstrate that our method enables stable training at high UTD ratios, achieving competitive coverage and fitness on high-dimensional Brax environments with an order of magnitude fewer samples than baselines. Our results suggest that combining target-free distributional critics with dominance-based selection is a key enabler for the next generation of sample-efficient evolutionary RL algorithms.
中文摘要 质量多样性（QD）算法擅长发现多样化的技能库，但受限于样本效率低，且通常需要数千万个环境步骤来解决复杂的移动任务。强化学习（RL）的最新进展表明，高数据更新比（UTD）能加速演员-批判者学习。虽然有效，标准的高UTD算法通常利用目标网络来稳定训练。这一要求带来了显著的计算瓶颈，使其在资源密集型质量多样性（QD）任务中不切实际，因为样本效率和快速种群适应至关重要。本文介绍了QDHUAC，一种样本高效、无靶且分布式的QD-RL算法，能够提供密集且低方差的梯度信号，从而实现支配新颖性搜索的高UTD训练，同时所需的环境步骤数量级减少。我们证明了该方法能够在高UTD比率下实现稳定训练，在高维Brax环境中以比基线少一个数量级的样本量实现竞争覆盖和适应度。我们的结果表明，将无目标分布批评者与基于支配性选择的结合，是下一代高效样本进化强化学习算法的关键推动力。

WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning

WebGen-R1：激励大型语言模型通过强化学习生成功能性和美观的网站

Authors: Juyong Jiang, Chenglin Cai, Chansung Park, Jiasi Shen, Sunghun Kim, Jianguo Li, Yue Wang
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2604.20398
Pdf link: https://arxiv.org/pdf/2604.20398
Abstract While Large Language Models (LLMs) excel at function-level code generation, project-level tasks such as generating functional and visually aesthetic multi-page websites remain highly challenging. Existing works are often limited to single-page static websites, while agentic frameworks typically rely on multi-turn execution with proprietary models, leading to substantial token costs, high latency, and brittle integration. Training a small LLM end-to-end with reinforcement learning (RL) is a promising alternative, yet it faces a critical bottleneck in designing reliable and computationally feasible rewards for website generation. Unlike single-file coding tasks that can be verified by unit tests, website generation requires evaluating inherently subjective aesthetics, cross-page interactions, and functional correctness. To this end, we propose WebGen-R1, an end-to-end RL framework tailored for project-level website generation. We first introduce a scaffold-driven structured generation paradigm that constrains the large open-ended action space and preserves architectural integrity. We then design a novel cascaded multimodal reward that seamlessly couples structural guarantees with execution-grounded functional feedback and vision-based aesthetic supervision. Extensive experiments demonstrate that our WebGen-R1 substantially transforms a 7B base model from generating nearly nonfunctional websites into producing deployable, aesthetically aligned multi-page websites. Remarkably, our WebGen-R1 not only consistently outperforms heavily scaled open-source models (up to 72B), but also rivals the state-of-the-art DeepSeek-R1 (671B) in functional success, while substantially exceeding it in valid rendering and aesthetic alignment. These results position WebGen-R1 as a viable path for scaling small open models from function-level code generation to project-level web application generation.
中文摘要 虽然大型语言模型（LLM）在功能级代码生成方面表现出色，但项目级任务，如生成功能性且视觉美观的多页网站，依然极具挑战性。现有作品通常局限于单页静态网站，而代理框架通常依赖多回合执行和专有模型，导致代币成本高昂、延迟高且集成脆弱。用强化学习（RL）端到端训练小型LLM是一个有前景的替代方案，但它在设计可靠且计算可行的网站生成奖励时面临关键瓶颈。与可以通过单元测试验证的单文件编码任务不同，网站创建需要评估本质上主观的美观性、跨页交互以及功能正确性。为此，我们提出了WebGen-R1，一个面向项目级网站生成的端到端强化学习框架。我们首先引入了一种基于支架的结构化生成范式，该范式限制了大型开放式动作空间，并保持了建筑的完整性。随后，我们设计了一种创新的级联多模态奖励，将结构性保障与基于执行的功能反馈和基于视觉的美学监督无缝结合。大量实验表明，我们的WebGen-R1将7B基础模型从生成几乎无效的网站，转变为可部署、美观对齐的多页面网站。令人惊讶的是，我们的WebGen-R1不仅持续优于高比例的开源模型（最高可达72B），而且在功能上可与最先进的DeepSeek-R1（671B）媲美，在有效渲染和美学对齐上远超其性能。这些结果使WebGen-R1成为将小型开放模型从函数级代码生成扩展到项目级网页应用生成的可行路径。

Temporal Difference Calibration in Sequential Tasks: Application to Vision-Language-Action Models

序列任务中的时间差分校准：在视觉-语言-行动模型中的应用

Authors: Shelly Francis-Meretzki, Mirco Mutti, Yaniv Romano, Aviv Tamar
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.20472
Pdf link: https://arxiv.org/pdf/2604.20472
Abstract Recent advances in vision-language-action (VLA) models for robotics have highlighted the importance of reliable uncertainty quantification in sequential tasks. However, assessing and improving calibration in such settings remains mostly unexplored, especially when only partial trajectories are observed. In this work, we formulate sequential calibration for episodic tasks, where task-success confidence is produced along an episode, while success is determined at the end of it. We introduce a sequential extension of the Brier score and show that, for binary outcomes, its risk minimizer coincides with the VLA policy's value function. This connection bridges uncertainty calibration and reinforcement learning, enabling the use of temporal-difference (TD) value estimation as a principled calibration mechanism over time. We empirically show that TD calibration improves performance relative to the state-of-the-art on simulated and real-robot data. Interestingly, we show that when calibrated using TD, the VLA's single-step action probabilities can yield competitive uncertainty estimates, in contrast to recent findings that employed different calibration techniques.
中文摘要 机器人视觉-语言-动作（VLA）模型的最新进展凸显了连续任务中可靠不确定性量化的重要性。然而，在此类环境中评估和改进校准仍大多未被探索，尤其是在仅观察到部分轨迹时。在本研究中，我们提出了针对情节任务的顺序校准，其中任务成功信心在事件结束时产生，成功则在事件结束时确定。我们引入了Brier评分的顺序扩展，并证明对于二元结果，其风险最小化与VLA政策的价值函数一致。这种联系连接了不确定性校准与强化学习，使得时间差值（TD）估计作为长期的原则性校准机制成为可能。我们实证显示，TD校准相对于最先进技术在模拟和真实机器人数据上的性能有所提升。有趣的是，我们表明，使用TD校准时，VLA的单步动作概率能够产生具有竞争力的不确定性估计，这与采用不同校准技术的最新发现形成对比。

Video-ToC: Video Tree-of-Cue Reasoning

视频目录：视频提示树推理

Authors: Qizhong Tan, Zhuotao Tian, Guangming Lu, Jun Yu, Wenjie Pei
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.20473
Pdf link: https://arxiv.org/pdf/2604.20473
Abstract Existing Video Large Language Models (Video LLMs) struggle with complex video understanding, exhibiting limited reasoning capabilities and potential hallucinations. In particular, these methods tend to perform reasoning solely relying on the pretrained inherent reasoning rationales whilst lacking perception-aware adaptation to the input video content. To address this, we propose \textbf{Video-ToC}, a novel video reasoning framework that enhances video understanding through tree-of-cue reasoning. Specifically, our approach introduces three key innovations: (1) A tree-guided visual cue localization mechanism, which endows the model with enhanced fine-grained perceptual capabilities through structured reasoning patterns; (2) A reasoning-demand reward mechanism, which dynamically adjusts the reward value for reinforcement learning (RL) based on the estimation of reasoning demands, enabling on-demand incentives for more effective reasoning strategies; and (3) An automated annotation pipeline that constructs the Video-ToC-SFT-1k and Video-ToC-RL-2k datasets for supervised fine-tuning (SFT) and RL training, respectively. Extensive evaluations on six video understanding benchmarks and a video hallucination benchmark demonstrate the superiority of Video-ToC over baselines and recent methods. Code is available at this https URL.
中文摘要 现有的视频大型语言模型（Video LLM）在复杂的视频理解上存在困难，推理能力有限且可能出现幻觉。特别是，这些方法往往仅依赖预训练的固有推理推理，缺乏对输入视频内容的感知感知适应。为此，我们提出了 \textbf{Video-ToC}，一种新颖的视频推理框架，通过提示树推理增强视频理解。具体来说，我们的方法引入了三项关键创新：（1）树引导视觉线索定位机制，通过结构化推理模式赋予模型增强的细粒度感知能力;（2）一种推理-需求奖励机制，基于推理需求估计动态调整强化学习（RL）的奖励值，从而实现按需激励以支持更有效的推理策略;以及（3）自动注释流水线，构建Video-ToC-SFT-1k和Video-ToC-RL-2k数据集，分别用于监督微调（SFT）和强化学习。对六项视频理解基准和一项视频幻觉基准的广泛评估显示，Video-ToC优于基线和最新方法。代码可在此 https URL 访问。

ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards

ProMMSearchAgent：一款可推广的多模态搜索代理，采用过程导向奖励训练

Authors: Wentao Yan, Shengqin Wang, Huichi Zhou, Yihang Chen, Kun Shao, Yuan Xie, Zhizhong Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.20486
Pdf link: https://arxiv.org/pdf/2604.20486
Abstract Training multimodal agents via reinforcement learning for knowledge-intensive visual reasoning is fundamentally hindered by the extreme sparsity of outcome-based supervision and the unpredictability of live web environments. To resolve these algorithmic and environmental bottlenecks, we introduce ProMMSearchAgent, establishing a novel Sim-to-Real training paradigm for multimodal search. We decouple policy learning into a deterministic, local static sandbox. Crucially, to learn effectively within this constrained environment, we propose an introspective process-oriented reward. By probing the agent's own parametric knowledge boundaries, we generate dense behavioral metadata that explicitly rewards the correct cognitive decision, initiating a multimodal or text search only when visually or factually uncertain. Extensive experiments demonstrate that our locally-trained policy transfers zero-shot to the live Google Search API. ProMMSearchAgent achieves new SOTA performance, outperforming MMSearch-R1 by +5.1% on FVQA-test, +6.3% on InfoSeek, and +11.3% on MMSearch.
中文摘要 通过强化学习训练多模态代理以实现知识密集型视觉推理，根本受限于基于结果的监督极度稀缺和实时网络环境的不可预测性。为解决这些算法和环境瓶颈，我们引入了ProMMSearchAgent，建立了一种新的模拟到现实多模态搜索训练范式。我们将策略学习解耦到一个确定性、局部静态的沙盒中。关键是，为了在这种受限环境中有效学习，我们提出了一种内省、过程导向的奖励。通过探测代理自身的参数化知识边界，我们生成密集的行为元数据，明确奖励正确的认知决策，只有在视觉或事实不确定时才启动多模态或文本搜索。大量实验表明，我们本地训练的策略可以零概率地转移到实时的谷歌搜索API。ProMMSearchAgent 实现了新的 SOTA 性能，在 FVQA-test 上比 MMSearch-R1 高出 +5.1%，Infoseek 上高出 +6.3%，在 MMSecarch 上高出 +11.3%。

Where Reasoning Breaks: Logic-Aware Path Selection by Controlling Logical Connectives in LLMs Reasoning Chains

推理破碎之处：通过控制逻辑连接词在大型语言模型推理链中实现逻辑感知路径选择

Authors: Seunghyun Park, Yuanyuan Lei
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.20564
Pdf link: https://arxiv.org/pdf/2604.20564
Abstract While LLMs demonstrate impressive reasoning capabilities, they remain fragile in multi-step logical deduction, where a single transition error can propagate through the entire reasoning chain, leading to unstable performance. In this work, we identify logical connectives as primary points of this structural fragility. Through empirical analysis, we show that connective tokens function as high entropy forking points, at which models frequently struggle to determine the correct logical direction. Motivated by this observation, we hypothesize that intervening in logical connective selection can guide LLMs toward more correct logical direction, thereby improving the overall reasoning chain. To validate this hypothesis, we propose a multi-layered framework that intervenes specifically at these logic-critical junctions in the reasoning process. Our framework includes (1) Gradient-based Logical Steering to guide LLMs internal representations towards valid reasoning subspaces, (2) Localized Branching to resolve ambiguity via targeted look-ahead search, and (3) Targeted Transition Preference Optimization, a surgical reinforcement learning objective that selectively optimizes single-token preferences at logical pivots. Crucially, by concentrating intervention solely on logic-critical transitions, our framework achieves a favorable accuracy--efficiency trade-off compared to global inference time scaling methods like beam search and self-consistency.
中文摘要 虽然LLM展现了令人印象深刻的推理能力，但在多步逻辑推理中仍然脆弱，单个转换误差可能在整个推理链中传播，导致性能不稳定。在本研究中，我们将逻辑连接词识别为这种结构脆弱性的主要点。通过实证分析，我们表明连接词作为高熵的分叉点，模型常常难以确定正确的逻辑方向。基于这一观察，我们假设通过逻辑连接选择的干预可以引导LLM朝向更正确的逻辑方向，从而改善整体推理链。为验证这一假设，我们提出了一个多层次的框架，专门介入推理过程中这些逻辑关键节点。我们的框架包括：（1）基于梯度的逻辑引导，引导LLM内部表示向有效推理子空间移动;（2）局部分支通过有针对性的前瞻搜索解决歧义;（3）有针对性的过渡偏好优化，这是一种外科手术式强化学习目标，在逻辑枢轴处选择性优化单一词符偏好。关键是，通过专注于逻辑临界转移的干预，我们的框架实现了比全局推理时间尺度方法（如束搜索和自洽性）更有利的准确性——效率权衡。

Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents

仅在需要时提问：经验驱动的终身代理人的主动记忆与技能检索

Authors: Yuxuan Cai, Jie Zhou, Qin Chen, Liang He
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.20572
Pdf link: https://arxiv.org/pdf/2604.20572
Abstract Online lifelong learning enables agents to accumulate experience across interactions and continually improve on long-horizon tasks. However, existing methods typically treat retrieval from past experience as a passive operation, triggering it only at task initialization or after completing a step. Consequently, agents often fail to identify knowledge gaps during interaction and proactively retrieve the most useful experience for the current decision. To address this limitation, we present ProactAgent, an experience-driven lifelong learning framework for proactive retrieval over a structured experience base. We first introduce Experience-Enhanced Online Evolution (ExpOnEvo), which enables continual improvement through both policy updates and memory refinement. The experience base organizes historical interactions into typed repositories, including factual memory, episodic memory, and behavioral skills, so that retrieval can provide both relevant evidence and actionable guidance. On top of this, we propose Proactive Reinforcement Learning-based Retrieval (ProactRL), which models retrieval as an explicit policy action and learns when and what to retrieve via paired-branch process rewards. By comparing continuations from identical interaction prefixes with and without retrieval, ProactRL provides step-level supervision for retrieval decisions, encouraging retrieval only when it leads to better task outcomes or higher efficiency. Experiments on SciWorld, AlfWorld, and StuLife show that ProactAgent consistently improves lifelong agent performance, achieving success rates of 73.50\% on SciWorld and 71.28\% on AlfWorld while substantially reducing retrieval overhead, and attains performance competitive with proprietary models on StuLife.
中文摘要 在线终身学习使智能体能够在互动中积累经验，并持续改进长期任务。然而，现有方法通常将从过去经验中检索视为被动操作，仅在任务初始化或完成某一步后触发。因此，代理在交互过程中常常未能识别知识空白，也无法主动检索当前决策中最有用的经验。为解决这一限制，我们提出了ProactAgent，一个基于经验的终身学习框架，支持在结构化经验基础上的主动检索。我们首先介绍体验增强在线演进（ExpOnEvo），通过策略更新和内存优化实现持续改进。该经验库将历史互动组织成类型存储库，包括事实记忆、情景记忆和行为技能，使检索既能提供相关证据，也能提供可操作的指导。在此基础上，我们提出了基于主动强化学习的检索（ProactRL），该方法将检索建模为显式策略动作，并通过配对分支过程奖励学习何时及检索什么。通过比较有无检索的相同交互前缀的延续，ProactRL为检索决策提供步骤级监督，仅在检索带来更好任务结果或更高效率时才鼓励检索。在SciWorld、AlfWorld和StuLife上的实验显示，ProactAgent持续提升代理终身性能，在SciWorld上实现73.50%和AlfWorld的71.28%成功率，同时显著降低检索开销，性能可与StuLife上的专有模型竞争。

A Hierarchical MARL-Based Approach for Coordinated Retail P2P Trading and Wholesale Market Participation of DERs

基于MARL的层级方法，协调零售P2P交易和DERs批发市场参与

Authors: Patrick Wilk, Ethan Cantor, Yikui Liu, Jie Li
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2604.20586
Pdf link: https://arxiv.org/pdf/2604.20586
Abstract The ongoing shift towards decentralization of the electric energy sector, driven by the growing electrification across end-use sectors, and widespread adoption of distributed energy resources (DERs), necessitates their active participation in the electricity markets to support grid operations. Furthermore, with bi-directional energy and communication flows becoming standard, intelligent, easy-to-deploy, resource-conservative demand-side participation is expected to play a critical role in securing power grid operational flexibility and market efficiency. This work proposes a market engagement framework that leverages a hierarchical multi-agent deep reinforcement learning (MARL) approach to enable individual prosumers to participate in peer-to-peer retail auctions and further aggregate these intelligent prosumers to facilitate effective DER participation in wholesale markets. Ultimately, a Stackelberg game is proposed to coordinate this hierarchical MARL-based DER market participation framework toward enhanced market performance.
中文摘要 随着终端使用部门电气化的推进以及分布式能源资源（DERs）的广泛采用，电力行业不断向去中心化转型，促使电能部门积极参与电力市场以支持电网运营。此外，随着双向能源和通信流成为标准，智能、易于部署、资源节约的需求方参与预计将在保障电网运营灵活性和市场效率方面发挥关键作用。本研究提出了一种市场参与框架，利用分层多智能体深度强化学习（MARL）方法，使个体专业消费者能够参与点对点零售拍卖，并进一步聚合这些智能的消费者，以促进DER在批发市场中的有效参与。最终，拟设计Stackelberg博弈以协调这一基于MARL的分层DER市场参与框架，以提升市场表现。

GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning

GRPO-VPS：通过可验证的过程监督增强群体相对策略优化，实现有效推理

Authors: Jingyi Wang, Lei Zhu, Tengjin Weng, Song-Li Wu, Haochen Tan, Jierun Chen, Chaofan Tao, Haoli Bai, Lu Hou, Lifeng Shang, Xiao-Ping Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.20659
Pdf link: https://arxiv.org/pdf/2604.20659
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Language Models (LLMs) by leveraging direct outcome verification instead of learned reward models. Building on this paradigm, Group Relative Policy Optimization (GRPO) eliminates the need for critic models but suffers from indiscriminate credit assignment for intermediate steps, which limits its ability to identify effective reasoning strategies and incurs overthinking. In this work, we introduce a model-free and verifiable process supervision via probing the model's belief in the correct answer throughout its reasoning trajectory. By segmenting the generation into discrete steps and tracking the conditional probability of the correct answer appended at each segment boundary, we efficiently compute interpretable segment-wise progress measurements to refine GRPO's trajectory-level feedback. This approach enables more targeted and sample-efficient policy updates, while avoiding the need for intermediate supervision derived from costly Monte Carlo rollouts or auxiliary models. Experiments on mathematical and general-domain benchmarks show consistent gains over GRPO across diverse models: up to 2.6-point accuracy improvements and 13.7% reasoning-length reductions on math tasks, and up to 2.4 points and 4% on general-domain tasks, demonstrating strong generalization.
中文摘要 带可验证奖励的强化学习（RLVR）通过利用直接的结果验证取代学习奖励模型，提升了大型语言模型（LLMs）的推理能力。基于这一范式，群体相对策略优化（GRPO）消除了批判模型的需求，但中间步骤的信用分配不加差别，限制了其识别有效推理策略的能力，并引发过度思考。在本研究中，我们通过探究模型在整个推理轨迹中对正确答案的信念，引入了一种无模型且可验证的过程监督。通过将生成过程分割为离散步骤，并在每个段边界追踪正确答案的条件概率，我们高效计算可解读的分段进展测量，以优化GRPO的轨迹级反馈。这种方法能够实现更有针对性和样本效率的策略更新，同时避免了因成本高昂的蒙特卡洛推广或辅助模型而产生的中间监督。数学和广域基准测试实验显示，在不同模型中相较GRPO持续提升：数学任务中准确率提升多达2.6分，推理长度缩短13.7%，一般领域任务提升最高2.4分和4%，显示出强有力的泛化能力。

MGDA-Decoupled: Geometry-Aware Multi-Objective Optimisation for DPO-based LLM Alignment

MGDA-解耦：基于DPO的多目标多目标优化 LLM

Authors: Andor Vári-Kakas, Ji Won Park, Natasa Tagasovska
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.20685
Pdf link: https://arxiv.org/pdf/2604.20685
Abstract Aligning large language models (LLMs) to desirable human values requires balancing multiple, potentially conflicting objectives such as helpfulness, truthfulness, and harmlessness, which presents a multi-objective optimisation challenge. Most alignment pipelines rely on a fixed scalarisation of these objectives, which can introduce procedural unfairness by systematically under-weighting harder-to-optimise or minority objectives. To promote more equitable trade-offs, we introduce MGDA-Decoupled, a geometry-based multi-objective optimisation algorithm that finds a shared descent direction while explicitly accounting for each objective's convergence dynamics. In contrast to prior methods that depend on reinforcement learning (e.g., GAPO) or explicit reward models (e.g., MODPO), our approach operates entirely within the lightweight Direct Preference Optimisation (DPO) paradigm. Experiments on the UltraFeedback dataset show that geometry-aware methods -- and MGDA-Decoupled in particular -- achieve the highest win rates against golden responses, both overall and per objective.
中文摘要 将大型语言模型（LLMs）与理想人类价值观对齐，需要平衡多个可能相互冲突的目标，如帮助性、真实性和无害性，这带来了多目标的优化挑战。大多数对齐流程依赖于这些目标的固定等级化，这可能通过系统性地低估难以优化或少数目标，从而引入程序不公平。为促进更公平的权衡，我们引入了MGDA-Decoupled，一种基于几何的多目标优化算法，在明确考虑每个目标收敛动态的同时，找到共享下降方向。与以往依赖强化学习（如GAPO）或显式奖励模型（如MODPO）的方法不同，我们的方法完全在轻量级直接偏好优化（DPO）范式内运行。UltraFeedback数据集上的实验显示，几何感知方法——尤其是MGDA-Decoupled——在面对黄金响应时，无论是整体还是按目标计算，都能获得最高的胜率。

SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models

SSL-R1：多模态大型语言模型的自监督视觉强化训练后

Authors: Jiahao Xie, Alessio Tonioni, Nathalie Rauschmayr, Federico Tombari, Bernt Schiele
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.20705
Pdf link: https://arxiv.org/pdf/2604.20705
Abstract Reinforcement learning (RL) with verifiable rewards (RLVR) has demonstrated the great potential of enhancing the reasoning abilities in multimodal large language models (MLLMs). However, the reliance on language-centric priors and expensive manual annotations prevents MLLMs' intrinsic visual understanding and scalable reward designs. In this work, we introduce SSL-R1, a generic self-supervised RL framework that derives verifiable rewards directly from images. To this end, we revisit self-supervised learning (SSL) in visual domains and reformulate widely-used SSL tasks into a set of verifiable visual puzzles for RL post-training, requiring neither human nor external model supervision. Training MLLMs on these tasks substantially improves their performance on multimodal understanding and reasoning benchmarks, highlighting the potential of leveraging vision-centric self-supervised tasks for MLLM post-training. We think this work will provide useful experience in devising effective self-supervised verifiable rewards to enable RL at scale. Project page: this https URL.
中文摘要 带有可验证奖励的强化学习（RL，RLVR）展示了增强多模态大型语言模型（MLLM）推理能力的巨大潜力。然而，依赖以语言为中心的先验和昂贵的手动注释，阻碍了MLLM本身的视觉理解和可扩展的奖励设计。在本研究中，我们介绍了SSL-R1，一种通用的自监督强化学习框架，直接从图像中推导出可验证的奖励。为此，我们重新审视视觉领域的自我监督学习（SSL），并将广泛使用的SSL任务重新构建为一套可验证的视觉谜题，适用于强化学习后训练，无需人类或外部模型监督。培训MLLM完成这些任务，显著提升了他们在多模态理解和推理基准中的表现，凸显了利用以视觉为中心的自监督任务在MLLM培训后发挥的潜力。我们认为这项工作将为设计有效的自我监督、可验证的奖励提供有用经验，以实现大规模强化学习。项目页面：这个 https URL。

Visual-Tactile Peg-in-Hole Assembly Learning from Peg-out-of-Hole Disassembly

视觉触觉钉入洞组装：从钉子脱孔拆解中学习

Authors: Yongqiang Zhao, Xuyang Zhang, Zhuo Chen, Matteo Leonetti, Emmanouil Spyrakos-Papastavridis, Shan Luo
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.20712
Pdf link: https://arxiv.org/pdf/2604.20712
Abstract Peg-in-hole (PiH) assembly is a fundamental yet challenging robotic manipulation task. While reinforcement learning (RL) has shown promise in tackling such tasks, it requires extensive exploration. In this paper, we propose a novel visual-tactile skill learning framework for the PiH task that leverages its inverse task, i.e., peg-out-of-hole (PooH) disassembly, to facilitate PiH learning. Compared to PiH, PooH is inherently easier as it only needs to overcome existing friction without precise alignment, making data collection more efficient. To this end, we formulate both PooH and PiH as Partially Observable Markov Decision Processes (POMDPs) in a unified environment with shared visual-tactile observation space. A visual-tactile PooH policy is first trained; its trajectories, containing kinematic, visual and tactile information, are temporally reversed and action-randomized to provide expert data for PiH. In the policy learning, visual sensing facilitates the peg-hole approach, while tactile measurements compensate for peg-hole misalignment. Experiments across diverse peg-hole geometries show that the visual-tactile policy attains 6.4% lower contact forces than its single-modality counterparts, and that our framework achieves average success rates of 87.5% on seen objects and 77.1% on unseen objects, outperforming direct RL methods that train PiH policies from scratch by 18.1% in success rate. Demos, code, and datasets are available at this https URL.
中文摘要 孔中钉（PiH）组装是一项基础性但具有挑战性的机器人操作任务。虽然强化学习（RL）在解决此类任务方面展现出潜力，但仍需广泛探索。本文提出了一种新的视觉-触觉技能学习框架，用于PiH任务，利用其逆向任务，即孔钉拆解（PooH），以促进PiH学习。与PiH相比，PooH本质上更简单，因为它只需克服现有摩擦，无需精确对齐，从而使数据收集更高效。为此，我们将PooH和PiH分别作为部分可观测马尔可夫决策过程（POMDPs），在统一环境中实现共享视觉-触觉观察空间。首先训练一个视觉-触觉的PooH策略;其轨迹包含运动学、视觉和触觉信息，经过时间反转和动作随机化，以提供PiH的专家数据。在政策学习中，视觉感知促进了钉孔方法，而触觉测量则补偿钉孔错位。在不同钉孔几何结构上的实验显示，视觉-触觉策略比单模态策略的接触力低6.4%，我们的框架在可见物体上的平均成功率为87.5%，在看不见物体上为77.1%，比从零训练PiH策略的直接强化学习方法高出18.1%。演示、代码和数据集可在此 https URL 获取。

Near-Future Policy Optimization

近未来政策优化

Authors: Chuanyu Qin, Chenxu Yang, Qingyi Si, Naibin Gu, Dingyu Yao, Zheng Lin, Peng Fu, Nan Duan, Jiaqi Wang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.20733
Pdf link: https://arxiv.org/pdf/2604.20733
Abstract Reinforcement learning with verifiable rewards (RLVR) has become a core post-training recipe. Introducing suitable off-policy trajectories into on-policy exploration accelerates RLVR convergence and raises the performance ceiling, yet finding a source of such trajectories remains the key challenge. Existing mixed-policy methods either import trajectories from external teachers (high-quality but distributionally far) or replay past training trajectories (close but capped in quality), and neither simultaneously satisfies the strong enough (higher $Q$ , more new knowledge to learn) and close enough (lower $V$ , more readily absorbed) conditions required to maximize the effective learning signal $\mathcal{S} = Q/V$. We propose \textbf{N}ear-Future \textbf{P}olicy \textbf{O}ptimization (\textbf{NPO}), a simple mixed-policy scheme that learns from a policy's own near-future self: a later checkpoint from the same training run is a natural source of auxiliary trajectories that is both stronger than the current policy and closer than any external source, directly balancing trajectory quality against variance cost. We validate NPO through two manual interventions, early-stage bootstrapping and late-stage plateau breakthrough, and further propose \textbf{AutoNPO},an adaptive variant that automatically triggers interventions from online training signals and selects the guide checkpoint that maximizes $S$. On Qwen3-VL-8B-Instruct with GRPO, NPO improves average performance from 57.88 to 62.84, and AutoNPO pushes it to 63.15, raising the final performance ceiling while accelerating convergence.
中文摘要 带有可验证奖励的强化学习（RLVR）已成为培训后的核心配方。在政策内探索中引入合适的非政策轨迹加速RLVR趋同并提高性能上限，但找到此类轨迹的来源仍是关键挑战。现有的混合策略方法要么从外部教师导入轨迹（高质量但分布较远），要么重放过去的培训轨迹（接近但质量有上限），但两者都无法同时满足足够强（更高的$Q$，更多新知识待学习）和足够接近（较低的$V$，更容易吸收）的条件，以最大化有效学习信号 $\mathcal{S} = Q/V$。我们提出了 \textbf{N}ear-Future \textbf{P}olicy \textbf{O}ptimization （\textbf{NPO}），这是一种简单的混合策略方案，从策略自身的近未来自我中学习：同一训练运行中的后期检查点是辅助轨迹的自然来源，既比当前策略更强，又比任何外部来源更接近，直接平衡轨迹质量与方差成本。我们通过两种手动干预——早期自助和晚期平台突破——验证了NPO，并进一步提出了\textbf{AutoNPO}，这是一种自适应变体，能自动从在线训练信号中触发干预，并选择最大化4$S$的指南检查点。在带GRPO的Qwen3-VL-8B-Instruct上，NPO将平均性能从57.88提升到62.84，AutoNPO则将其提升至63.15，提升最终性能上限并加速收敛。

V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization

V-tableR1：过程监督多模态表推理与批判者引导策略优化

Authors: Yubo Jiang, Yitong An, Xin Yang, Abudukelimu Wuerkaixi, Xuxin Cheng, Fengying Xie, Zhiguo Jiang, Cao Liu, Ke Zeng, Haopeng Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.20755
Pdf link: https://arxiv.org/pdf/2604.20755
Abstract We introduce V-tableR1, a process-supervised reinforcement learning framework that elicits rigorous, verifiable reasoning from multimodal large language models (MLLMs). Current MLLMs trained solely on final outcomes often treat visual reasoning as a black box, relying on superficial pattern matching rather than performing rigorous multi-step inference. While Reinforcement Learning with Verifiable Rewards could enforce transparent reasoning trajectories, extending it to visual domains remains severely hindered by the ambiguity of grounding abstract logic into continuous pixel space. We solve this by leveraging the deterministic grid structure of tables as an ideal visual testbed. V-tableR1 employs a specialized critic VLM to provide dense, step-level feedback on the explicit visual chain-of-thought generated by a policy VLM. To optimize this system, we propose Process-Guided Direct Alignment Policy Optimization (PGPO), a novel RL algorithm integrating process rewards, decoupled policy constraints, and length-aware dynamic sampling. Extensive evaluations demonstrate that V-tableR1 explicitly penalizes visual hallucinations and shortcut guessing. By fundamentally shifting multimodal inference from black-box pattern matching to verifiable logical derivation, V-tableR1 4B establishes state-of-the-art accuracy among open-source models on complex tabular benchmarks, outperforming models up to 18x its size and improving over its SFT baseline
中文摘要 我们介绍了V-tableR1，一种过程监督强化学习框架，能够从多模态大型语言模型（MLLMs）中引发严谨且可验证的推理。目前仅基于最终结果训练的MLLM常将视觉推理视为黑箱，依赖表面模式匹配，而非严格的多步推断。虽然带有可验证奖励的强化学习可以强制执行透明的推理轨迹，但将其扩展到视觉领域仍受限于将抽象逻辑扎根于连续像素空间的模糊性。我们通过利用表的确定性网格结构作为理想的视觉测试平台来解决这个问题。V-tableR1 采用专业的批评 VLM，对策略 VLM 生成的显式视觉思维链提供密集的步级反馈。为优化该系统，我们提出了过程引导直接对齐策略优化（PGPO）——一种新颖的强化学习算法，集成了过程奖励、解耦策略约束和长度感知动态采样。大量评估表明，V-tableR1明确惩罚视觉幻觉和捷径猜测。通过从黑箱模式匹配转向可验证逻辑推导，V-tableR1 4B 在开源模型中建立了复杂的表格基准测试的先进精度，性能超过其最大 18 倍的模型，并优于其 SFT 基线

ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control

ParetoSlider：连续奖励控制的后训练扩散模型

Authors: Shelly Golan, Michael Finkelson, Ariel Bereslavsky, Yotam Nitzan, Or Patashnik
Subjects: Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.20816
Pdf link: https://arxiv.org/pdf/2604.20816
Abstract Reinforcement Learning (RL) post-training has become the standard for aligning generative models with human preferences, yet most methods rely on a single scalar reward. When multiple criteria matter, the prevailing practice of ``early scalarization'' collapses rewards into a fixed weighted sum. This commits the model to a single trade-off point at training time, providing no inference-time control over inherently conflicting goals -- such as prompt adherence versus source fidelity in image editing. We introduce ParetoSlider, a multi-objective RL (MORL) framework that trains a single diffusion model to approximate the entire Pareto front. By training the model with continuously varying preference weights as a conditioning signal, we enable users to navigate optimal trade-offs at inference time without retraining or maintaining multiple checkpoints. We evaluate ParetoSlider across three state-of-the-art flow-matching backbones: SD3.5, FluxKontext, and LTX-2. Our single preference-conditioned model matches or exceeds the performance of baselines trained separately for fixed reward trade-offs, while uniquely providing fine-grained control over competing generative goals.
中文摘要 强化学习（RL）后训练已成为生成模型与人类偏好对齐的标准，但大多数方法仍依赖单一标量奖励。当多重标准都重要时，“早期标量化”的做法将奖励合并为固定加权和。这使模型在训练时只能在一个权衡点上，无法在推理时间控制本质上冲突的目标——比如图像编辑中的即时遵循与源忠实度。我们介绍ParetoSlider，这是一种多目标RL（MORL）框架，它通过训练单一扩散模型来近似整个帕累托前沿。通过以持续变化的偏好权重作为条件信号训练模型，我们使用户能够在推断时找到最优权衡，而无需重新训练或维护多个检查点。我们评估了ParetoSlider在三种最先进的流量匹配骨干链上：SD3.5、FluxKontext和LTX-2。我们的单一偏好条件模型能够匹配甚至超过单独训练、固定奖励权衡的基线，同时独特地对竞争的生成目标提供细致控制。

Keyword: diffusion policy

There is no result