Arxiv Papers of Today

生成时间: 2026-04-28 18:18:15 (UTC+8); Arxiv 发布时间: 2026-04-28 20:00 EDT (2026-04-29 08:00 UTC+8)

今天共有 61 篇相关文章

Keyword: reinforcement learning

KARL: Mitigating Hallucinations in LLMs via Knowledge-Boundary-Aware Reinforcement Learning

KARL：通过知识边界感知强化学习减轻大型语言模型中的幻觉

Authors: Cheng Gao, Cheng Huang, Kangyang Luo, Ziqing Qiao, Shuzheng Si, Huimin Chen, Chaojun Xiao, Maosong Sun
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.22779
Pdf link: https://arxiv.org/pdf/2604.22779
Abstract Enabling large language models (LLMs) to appropriately abstain from answering questions beyond their knowledge is crucial for mitigating hallucinations. While existing reinforcement learning methods foster autonomous abstention, they often compromise answer accuracy because their static reward mechanisms, agnostic to models' knowledge boundaries, drive models toward excessive caution. In this work, we propose KARL, a novel framework that continuously aligns an LLM's abstention behavior with its evolving knowledge boundary. KARL introduces two core innovations: a Knowledge-Boundary-Aware Reward that performs online knowledge boundary estimation using within-group response statistics, dynamically rewarding correct answers or guided abstention; and a Two-Stage RL Training Strategy that first explores the knowledge boundary and bypasses the "abstention trap", and subsequently converts incorrect answers beyond the knowledge boundary into abstentions without sacrificing accuracy. Extensive experiments on multiple benchmarks demonstrate that KARL achieves a superior accuracy-hallucination trade-off, effectively suppressing hallucinations while maintaining high accuracy across both in-distribution and out-of-distribution scenarios.
中文摘要 使大型语言模型（LLM）能够适当避免回答超出其知识范围的问题，对于减轻幻觉至关重要。虽然现有的强化学习方法促进自主性遗忘，但由于静态奖励机制不受模型知识边界影响，常常导致答案准确性受损，导致模型趋向过度谨慎。在本研究中，我们提出了KARL，一种新颖框架，能够持续将LLM的戒断行为与其不断演变的知识边界对齐。KARL引入了两项核心创新：一种知识边界感知奖励，利用组内反应统计进行在线知识边界估计，动态奖励正确答案或引导性保留;以及一种两阶段强化学习训练策略，先探索知识边界，绕过“遗忘陷阱”，随后将知识边界外的错误答案转化为遗漏，同时不牺牲准确性。多基准测试的广泛实验表明，KARL在准确性与幻觉之间实现了优越的权衡，在分布内外场景中有效抑制幻觉，同时保持高准确性。

Accelerating Reinforcement Learning for Wind Farm Control via Expert Demonstrations

通过专家演示加速风电场控制的强化学习

Authors: Marcus Binder Nilsen, Julian Quick, Tuhfe Göçmen, Nikolay Dimitrov, Pierre-Elouan Réthoré
Subjects: Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.22794
Pdf link: https://arxiv.org/pdf/2604.22794
Abstract Reinforcement learning (RL) offers a promising approach for adaptive wind farm flow control, yet its practical deployment is hindered by slow training convergence and poor initial performance, factors that could translate to years of reduced power output if an untrained agent were deployed directly. This work investigates whether domain knowledge from steady-state wake models can accelerate RL training and improve initial controller performance. We propose a pretraining methodology in which expert demonstrations are generated by deploying a PyWake-based steady-state optimizer within a dynamic wake simulation (WindGym), then used to initialize both the actor and critic networks of a Soft Actor-Critic agent via behavior cloning. Experiments on a 2x2 wind farm show that pretraining eliminates the costly initial learning phase: while an untrained agent underperforms the greedy zero-yaw baseline by approximately 12%, pretraining raises initial performance to near-baseline levels. During online fine-tuning, all configurations converge within 250,000 environment steps to achieve similar performance, ultimately exceeding that of a lookup-table controller, which reaches approximately 7% power gain after 500,000 steps.
中文摘要 强化学习（RL）为自适应风电场流量控制提供了一种有前景的方法，但其实际部署受限于训练收敛缓慢和初始性能差，这些因素如果直接部署未经训练的智能体，可能导致数年功率下降。本研究探讨了来自稳态唤醒模型的领域知识是否能加速强化学习训练并提升初始控制器性能。我们提出了一种预训练方法，通过在动态尾迹仿真（WindGym）中部署基于PyWake的稳态优化器生成专家演示，然后通过行为克隆初始化软Actor-Critic代理的actor和critic网络。在2x2风电场上的实验表明，预训练消除了昂贵的初始学习阶段：未经训练的代理在贪婪的零偏航基线下表现约低12%，而预训练则将初始性能提升至接近基线水平。在线微调过程中，所有配置在25万环境步内收敛，以实现相似性能，最终超过查找表控制器，后者在50万步后约达到7%的功耗增益。

Load constrained wind farm flow control through multi-objective multi-agent reinforcement learning

通过多目标多代理强化学习实现负载约束风电场流量控制

Authors: Teodor Åstrand, Marcus Binder Nilsen, Iasonas Tsaklis, Tuhfe Göçmen, Pierre-Elouan Réthoré, Nikolay Dimitrov
Subjects: Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.22795
Pdf link: https://arxiv.org/pdf/2604.22795
Abstract This study presents a multi-agent reinforcement learning (MARL) framework for load-constrained wind farm flow control (WFFC). While wake steering can enhance total wind farm power, it often introduces increased structural loads on downstream turbines. To address this, we integrate an Independent Soft Actor-Critic (I-SAC) architecture with a data-driven, local inflow sector-averaged surrogate model to provide real-time estimates of Damage Equivalent Loads (DELs). By incorporating these estimates into a shaped reward function, turbine-specific agents are trained to maximize power production while adhering to specific load-increase thresholds ($\Delta_{max}$) of 10%, 20%, and 30% relative to a baseline controller. The framework is implemented within the WindGym environment using the DYNAMIKS flow solver with Dynamic Wake Meandering (DWM) model to capture non-stationary wake physics. Results indicate that the MARL agents successfully learn collaborative policies that prioritise power gain while actively retreating from high-DEL control strategies.
中文摘要 本研究提出了一个多智能体强化学习（MARL）框架，用于负载约束风电场流量控制（WFFC）。虽然尾迹转向可以提升风电场的总功率，但通常会增加下游风机的结构负荷。为此，我们采用独立软行为者-批判者（I-SAC）架构与基于数据的本地流入部门平均替代模型，提供实时的损害等效负荷（DEL）估算。通过将这些估计纳入一个有形的奖励函数，涡轮专用的代理被训练以最大化功率产出，同时遵守相较于基线控制器的10%、20%和30%的具体负载增加阈值（$\Delta_{max}$）。该框架在 WindGym 环境中实现，使用 DYNAMIKS 流动求解器配合动态尾迹曲流（DWM）模型，捕捉非静止尾迹物理。结果表明，MARL特工成功学习了优先提升功率的协作策略，同时积极退出高DEL控制策略。

Hierarchical RL-MPC Control for Dynamic Wake Steering in Wind Farms

风电场动态尾迹引导的分层RL-MPC控制

Authors: Marcus Binder Nilsen, Teodor Olof Benedict Åstrand, Tuhfe Göçmen, Pierre-Elouan Réthoré
Subjects: Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.22797
Pdf link: https://arxiv.org/pdf/2604.22797
Abstract Wind farm wake steering optimization is challenging due to complex flow physics and changing conditions. This paper presents a hierarchical framework that combines reinforcement learning with model predictive control, where an RL agent learns compensatory state estimates for an MPC controller, rather than directly controlling turbines. Evaluated on a three-turbine case, the approach achieves a 23\% power gain over the baseline control and surpasses the idealized MPC with perfect state knowledge. Compared to direct RL control, the hybrid architecture maintains superior safety characteristics during training while achieving comparable performance with more stable control actions.
中文摘要 由于复杂的流动物理和不断变化的条件，风电场的尾流转向优化具有挑战性。本文提出了一个层级框架，结合了强化学习与模型预测控制，强化学习者学习MPC控制器的补偿状态估计，而非直接控制涡轮机。在三涡轮机箱下评估时，该方法比基线控制获得23%的功率增益，并以完美状态知识超越理想MPC。与直接强化学习控制相比，混合架构在训练期间保持了更优越的安全特性，同时实现了相当的性能和更稳定的控制动作。

AeSlides: Incentivizing Aesthetic Layout in LLM-Based Slide Generation via Verifiable Rewards

AeSlides：通过可验证的奖励激励基于LLM的幻灯片生成中的美观布局

Authors: Yiming Pan, Chengwei Hu, Xuancheng Huang, Can Huang, Mingming Zhao, Yuean Bi, Xiaohan Zhang, Aohan Zeng, Linmei Hu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2604.22840
Pdf link: https://arxiv.org/pdf/2604.22840
Abstract Large language models (LLMs) have demonstrated strong potential in agentic tasks, particularly in slide generation. However, slide generation poses a fundamental challenge: the generation process is text-centric, whereas its quality is governed by visual aesthetics. This modality gap leads current models to frequently produce slides with aesthetically suboptimal layouts. Existing solutions typically rely either on heavy visual reflection, which incurs high inference cost yet yields limited gains; or on fine-tuning with large-scale datasets, which still provides weak and indirect aesthetic supervision. In contrast, the explicit use of aesthetic principles as supervision remains unexplored. In this work, we present AeSlides, a reinforcement learning framework with verifiable rewards for Aesthetic layout supervision in Slide generation. We introduce a suite of meticulously designed verifiable metrics to quantify slide layout quality, capturing key layout issues in an accurate, efficient, and low-cost manner. Leveraging these verifiable metrics, we develop a GRPO-based reinforcement learning method that directly optimizes slide generation models for aesthetically coherent layouts. With only 5K training prompts on GLM-4.7-Flash, AeSlides improves aspect ratio compliance from 36% to 85%, while reducing whitespace by 44%, element collisions by 43%, and visual imbalance by 28%. Human evaluation further shows a substantial improvement in overall quality, increasing scores from 3.31 to 3.56 (+7.6%), outperforming both model-based reward optimization and reflection-based agentic approaches, and even edging out Claude-Sonnet-4.5. These results demonstrate that such a verifiable aesthetic paradigm provides an efficient and scalable approach to aligning slide generation with human aesthetic preferences. Our repository is available at this https URL.
中文摘要 大型语言模型（LLMs）在代理任务中展现出强大的潜力，尤其是在幻灯片生成方面。然而，幻灯片生成存在根本挑战：生成过程以文本为中心，而其质量则受视觉美学所左右。这种模态差距导致现有模型经常产生美观布局不理想的幻灯片。现有解通常依赖于强烈的视觉反射，这会产生高的推理成本但收益有限;或者对大规模数据集进行微调，后者仍提供薄弱且间接的美学监督。相比之下，美学原则作为监督的明确运用尚未被深入探讨。在本研究中，我们介绍了AeSlides，一个强化学习框架，在幻灯片生成中对美观布局监督提供可验证的奖励。我们引入一套精心设计的可验证指标，以量化幻灯片布局质量，以准确、高效且低成本的方式捕捉关键布局问题。利用这些可验证的指标，我们开发了一种基于GRPO的强化学习方法，直接优化幻灯片生成模型，实现美观连贯的布局。仅用5K训练提示，GLM-4.7-Flash使AeSlides将宽高比合规性从36%提升至85%，同时减少44%的空白空间、43%的元素碰撞和28%的视觉失衡。人工评估进一步显著提升整体质量，得分从3.31提升至3.56（+7.6%），优于基于模型的奖励优化和基于反思的代理方法，甚至略胜Claude-Sonnet-4.5。这些结果表明，这种可验证的美学范式为使幻灯片生成与人类审美偏好保持一致提供了高效且可扩展的方法。我们的仓库可通过这个 https 网址访问。

Risk Models as Mediating Artifacts: A Postphenomenological Analysis of the CIIM Framework in Cybersecurity Practice

风险模型作为中介工件：对CIIM框架在网络安全实践中的后现象学分析

Authors: Rommel Salas-Guerra
Subjects: Subjects: Cryptography and Security (cs.CR); Emerging Technologies (cs.ET)
Arxiv link: https://arxiv.org/abs/2604.22866
Pdf link: https://arxiv.org/pdf/2604.22866
Abstract This article applies postphenomenological theory to the field of cybersecurity risk management, arguing that formal risk models function as mediating artifacts that shape how security practitioners or analysts perceive, interpret, and act on threats. Based on Don Ihde's taxonomy on human-technology relationships and Peter-Paul Verbeek's extended mediational framework, the Contextual and Multimodal Hazard Impact Index (CIIM), an original dynamic risk model presented as an empirical case study, is analyzed. CIIM is formally defined as CIIM(t+1) = [A T(t) V(t) E(t)] / R(t) + {alpha} P(t), where the condition R(t) 0 is not treated as a computational artifact to be smoothed out, but as a genuine systemic collapse that signals singularity. This design choice constitutes a deliberate phenomenological move, allowing organizational fragility to be made visible in a way that previous CVSS-based and probabilistic models conceal. In addition, we examine how CIIM's time projection (t+1) and its hybrid machine learning architecture, combining LSTM/GRU, XGBoost, and Reinforcement Learning, produce a new form of technological intentionality that structures practitioner or analyst attention and ethical deliberation. The article concludes by establishing implications for the ethical design of cybersecurity instrumentation and for the post-phenomenological methodology itself, proposing the concept of 'phenomenology of collapse' as a contribution to the empirical philosophy of technology.
中文摘要 本文将后现象学理论应用于网络安全风险管理领域，认为正式的风险模型作为中介的人工物，塑造了安全从业者或分析师如何感知、解读和应对威胁。基于Don Ihde关于人类-技术关系的分类法和Peter-Paul Verbeek扩展的中介框架，分析了情境与多模态危害影响指数（CIIM），这是一个原创的动态风险模型，作为实证案例研究呈现。CIIM正式定义为CIIM（t+1） = [A T（t） V（t） E（t）] / R（t） + {alpha} P（t），其中条件R（t） 0不被视为需要平滑处理的计算伪影，而是作为一个真正的系统性崩溃，标志着奇点的信号。这一设计选择是有意为之的现象学举措，使组织脆弱性得以显现，而此前基于CVSS和概率模型所隐藏的。此外，我们还考察了CIIM的时间预测（t+1）及其结合LSTM/GRU、XGBoost和强化学习的混合机器学习架构，如何创造一种新的技术意向性，构建了实践者或分析师的注意力和伦理思考。文章最后阐述了对网络安全仪器伦理设计及后现象学方法论本身的影响，提出了“崩溃现象学”概念，作为技术实证哲学的贡献。

When Policies Cannot Be Retrained: A Unified Closed-Form View of Post-Training Steering in Offline Reinforcement Learning

当政策无法再培训时：线下强化学习中培训后指导的统一封闭式视角

Authors: Elias Hossain, Mohammad Jahid Ibna Basher, Ivan Garibay, Ozlem Garibay, Niloofar Yousefi
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.22873
Pdf link: https://arxiv.org/pdf/2604.22873
Abstract Offline reinforcement learning (RL) can learn effective policies from fixed datasets, but deployment objectives may change after training, and in many applications the trained actor cannot be retrained because of data, cost, or governance constraints. We study deployment-time adaptation for frozen offline actors using Product-of-Experts (PoE) composition with a goal-conditioned prior. Our main practical finding is graceful degradation rather than universal performance gain: under degraded or random priors, precision-weighted composition remains anchored to the frozen actor, while additive and prior-only adaptation collapse, and a KL-budget selector often recovers a near-oracle operating point. We also make explicit a closed-form identity in the frozen-actor setting: for diagonal-Gaussian actors and priors, PoE with coefficient alpha yields the same deterministic policy as KL-regularized adaptation with beta = alpha / (1 - alpha), with posterior covariances differing only by a global scalar factor. Empirically, across four D4RL environments (3,900 MuJoCo episodes), we observe a 4/5/3 HELP/FROZEN/HURT split. Extending the analysis to six harder cells and two AntMaze diagnostics reveals an actor-competence ceiling: medium-expert remains HURT in all 9 cells at every tested alpha, while AntMaze with a behavior-cloned frozen actor yields zero success for all composition rules. Overall, PoE and KL-regularized adaptation are best viewed as a single actor-anchored safety mechanism for deployment-time steering.
中文摘要 离线强化学习（RL）可以从固定数据集中学习有效的策略，但部署目标在训练后可能会发生变化，在许多应用中，由于数据、成本或治理限制，训练过的参与者无法被重新训练。我们研究了使用专家产品（PoE）组合并带有目标条件先验的冻结离线演员的部署时间适应。我们主要的实际发现是优雅降解，而非普遍性能提升：在降级或随机先验下，精确加权组合仍锚定于冻结演员，而加法和仅先验适应则崩溃，KL预算选择器通常能恢复近乎oracle的操作点。我们还在冻结演员设定中明确了闭形式恒等式：对于对角高斯演员和先验，系数为alpha的PoE与β = alpha / （1 - alpha）时的KL正则化适应具有相同的确定性策略，后验协方差仅差一个全局标量因子。在四个D4RL环境中（3900集MuJoCo集数），我们观察到4/5/3的帮助/冻结/伤害分裂。将分析扩展到六个更难的单元格和两个AntMaze诊断，揭示了演员-能力的上限：中等专家在每个测试的alpha中9个单元都处于受伤状态，而AntMaze使用行为克隆的冻结演员则在所有组合规则中均无成功。总体而言，PoE和KL正则化适配最好被视为单一锚定的安全机制，用于部署时间引导。

TexOCR: Advancing Document OCR Models for Compilable Page-to-LaTeX Reconstruction

TexOCR：推进文档OCR模型，用于可编译的页面转LaTeX重建

Authors: Chengye Wang, Lin Fu, Zexi Kuang, Yilun Zhao
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.22880
Pdf link: https://arxiv.org/pdf/2604.22880
Abstract Existing document OCR largely targets plain text or Markdown, discarding the structural and executable properties that make LaTeX essential for scientific publishing. We study page-level reconstruction of scientific PDFs into compilable LaTeX and introduce TexOCR-Bench, a benchmark, and TexOCR-Train, a large-scale training corpus, for this task. TexOCR-Bench features a multi-dimensional evaluation suite that jointly assesses transcription fidelity, structural faithfulness, and end-to-end compilability. Leveraging TexOCR-Train, we train a 2B-parameter model, TexOCR, using supervised fine-tuning (SFT) and reinforcement learning (RL) with verifiable rewards derived from LaTeX unit tests that directly enforce compilability and referential integrity. Experiments across 21 frontier models on TexOCR-Bench show that existing systems frequently violate key document invariants, including consistent section structure, correct float placement, and valid label-reference links, which undermines compilation reliability and downstream usability. Our analysis further reveals that RL with verifiable rewards yields consistent improvements over SFT alone, particularly on structural and compilation metrics.
中文摘要 现有文档OCR主要针对纯文本或Markdown，摒弃了使LaTeX在科学出版中必不可少的结构性和可执行性质。我们研究科学PDF的页面级重建为可编译的LaTeX，并引入了基准测试TexOCR-Bench和大规模训练语料库TexOCR-Train。TexOCR-Bench 配备了多维评估套件，联合评估转录准确度、结构忠实度和端到端编译性。利用TexOCR-Train，我们训练一个2B参数模型TexOCR，使用监督微调（SFT）和强化学习（RL），并基于可验证的LaTeX单元测试奖励，直接强化编译性和引用完整性。在TexOCR-Bench上跨越21个前沿模型的实验显示，现有系统经常违反关键文档不变量，包括截面结构一致、正确浮点放置和有效的标签-引用链接，这削弱了编译可靠性和下游可用性。我们的分析进一步显示，具有可验证奖励的强化学习在结构和编译指标上，相较单纯 SFT 持续提升。

StackFeat RL: Reinforcement Learning over Iterative Dual Criterion Feature Selection for Stable Biomarker Discovery

StackFeat RL：基于迭代双重准则特征选择的强化学习，实现稳定生物标志物发现

Authors: A. Yermekov, D.A. Herrera-Martí
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.22892
Pdf link: https://arxiv.org/pdf/2604.22892
Abstract Feature selection in high-dimensional genomic data ($d \gg n$) demands methods that are simultaneously accurate, sparse, and stable. Existing approaches either require manual threshold specification (mRMR, stability selection), produce unstable selections under data perturbation (Lasso, Boruta), or ignore biological structure entirely. We introduce StackFeat-RL, a meta-learning framework that optimises the hyperparameters of an iterative dual-criterion feature selection algorithm via REINFORCE policy gradients. The dual criterion, requiring both coefficient consistency and selection frequency, guards against two failure modes missed by single-criterion methods, while iterative accumulation provides convergence guarantees via the law of large numbers. On COVID-19 miRNA data (GSE240888, 332 features) and three Alzheimer's disease classification tasks (GSE84422, 13237 genes; Normal vs.\ Possible, Probable, and Definite AD), StackFeat-RL achieves the highest predictive accuracy among all evaluated methods, including ElasticNet, Boruta, mRMR, and stability selection, while requiring 3--4$\times$ fewer features. Keywords: feature selection, reinforcement learning, REINFORCE, elastic net, biomarker discovery, Alzheimer's disease, dual-criterion selection, protein interaction networks
中文摘要 高维基因组数据中的特征选择（如 $d \gg n$）需要方法同时准确、稀疏且稳定。现有方法要么要求手动阈值指定（mRMR，稳定性选择），要么在数据扰动下产生不稳定的选择（Lasso，Boruta），或者完全忽略生物结构。我们介绍了StackFeat-RL，这是一种元学习框架，通过REINFORCE策略梯度优化迭代双准则特征选择算法的超参数。对偶准则要求系数一致性和选择频率，防止单一准则方法遗漏的两种失效模式，而迭代累积则通过大数定律提供收敛保证。在COVID-19 miRNA数据（GSE240888,332个特征）和三个阿尔茨海默病分类任务（GSE84422,13237个基因;StackFeat-RL在所有评估方法中实现了最高的预测准确率，包括ElasticNet、Boruta、mRMR和稳定性选择，同时所需的特征数量减少了3到4美元\乘倍。关键词：特征选择、强化学习、REINFORCE、弹性网、生物标志物发现、阿尔茨海默病、双重标准选择、蛋白质相互作用网络

DeepImagine: Learning Biomedical Reasoning via Successive Counterfactual Imagining

DeepImagine：通过连续的反事实想象学习生物医学推理

Authors: Youze Zheng, Jianyou Wang, Yuhan Chen, Matthew Feng, Longtian Bao, Hanyuan Zhang, Maxim Khan, Aditya K. Sehgal, Christopher D. Rosin, Umber Dube, Ramamohan Paturi
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.23054
Pdf link: https://arxiv.org/pdf/2604.23054
Abstract Predicting the outcomes of prospective clinical trials remains a major challenge for large language models. Prior work has shown that both traditional correlational predictors, such as random forests and logistic regression, and strong commercial LLMs achieve limited performance on this task. In this paper, we propose DeepImagine, a framework for teaching LLMs biomedical reasoning through successive counterfactual imagining. The central idea is to approximate hidden causal mechanisms of clinical trials by training models to infer how observed trial results would change under controlled perturbations of experimental conditions, such as dosage, outcome measures, study arms, geography, and other trial attributes. To support this objective, we construct both natural and approximate counterfactual pairs from real clinical trials with reported outcomes. For settings where strict counterfactual supervision is available, such as paired outcome measures or dose-ranging study arms within the same trial, we train models with supervised fine-tuning. For broader settings where only approximate counterfactual pairs can be retrieved, we optimize models with reinforcement learning using verifiable rewards based on downstream benchmark correctness. We further augment training with synthetic reasoning traces that provide causally plausible explanations for local counterfactual transitions. Using this pipeline, we train language models under 10B parameters, including Qwen3.5-9B, and evaluate them on clinical trial outcome prediction. We aim to show that DeepImagine consistently improves over untuned language models and traditional correlational baselines. Finally, we aim to show that the learned reasoning trajectories provide interpretable signals about how models represent trial-level mechanisms, suggesting a practical path toward more mechanistic and scientifically useful biomedical language models.
中文摘要 预测前瞻性临床试验的结果仍然是大型语言模型面临的重大挑战。先前研究表明，传统的相关预测变量，如随机森林和逻辑回归，以及强大的商业大型语言模型，在该任务上表现有限。本文提出了DeepImagine框架，一种通过连续反事实想象教授大型语言模型生物医学推理的框架。核心思想是通过训练模型，推断在受控扰动条件下观察到的试验结果变化，如剂量、结局指标、研究组、地理位置及其他试验属性。为支持这一目标，我们从真实临床试验中构建了自然和近似的反事实对，并有报告结果。对于需要严格反事实监督的环境，如结对结局指标或同一试验中的剂量范围研究组，我们训练模型时采用监督微调。对于仅能检索近似反事实对的更广泛情境，我们利用基于下游基准正确性的可验证奖励优化强化学习模型。我们进一步用合成推理痕迹补充训练，为局部反事实转变提供因果合理的解释。利用该流程，我们在包括Qwen3.5-9B在内的10B参数下训练语言模型，并评估其临床试验结局预测。我们的目标是证明，DeepImagine 在未调谐的语言模型和传统相关性基线上持续有进步。最后，我们旨在展示所学的推理轨迹为模型如何代表试验级机制提供了可解读的信号，为更机械和更具科学价值的生物医学语言模型提供了切实可行的路径。

K-Score: Kalman Filter as a Principled Alternative to Reward Normalization in Reinforcement Learning

K-score：卡尔曼滤波器作为强化学习中奖励归一化的原则性替代方案

Authors: Zixuan Xia, Quanxi Li
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.23056
Pdf link: https://arxiv.org/pdf/2604.23056
Abstract We propose a simple yet effective alternative to reward normalization in policy gradient reinforcement learning by integrating a 1D Kalman filter for online reward estimation. Instead of relying on fixed heuristics, our method recursively estimates the latent reward mean, smoothing high-variance returns and adapting to non-stationary environments. This approach incurs minimal overhead and requires no modification to existing policy architectures. Experiments on \textit{LunarLander} and \textit{CartPole} demonstrate that Kalman-filtered rewards significantly accelerate convergence and reduce training variance compared to standard normalization techniques. Code is available at this https URL.
中文摘要 我们提出了一种简单但有效的奖励归一化替代方案，通过整合一维卡尔曼滤波器进行在线奖励估计。我们不依赖固定启发式方法，而是递归估计潜在奖励均值，平滑高方差收益并适应非平稳环境。这种方法开销极低，且无需修改现有策略架构。在 \textit{LunarLander} 和 \textit{CartPole} 上的实验表明，卡尔曼滤波奖励相比标准归一化技术显著加速收敛并降低训练方差。代码可在此 https URL 访问。

C-MORAL: Controllable Multi-Objective Molecular Optimization with Reinforcement Alignment for LLMs

C-MORAL：可控多目标分子优化与强化比对的大型语言模型

Authors: Rui Gao, Youngseung Jeon, Swastik Roy, Morteza Ziyadi, Xiang 'Anthony' Chen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.23061
Pdf link: https://arxiv.org/pdf/2604.23061
Abstract Large language models (LLMs) show promise for molecular optimization, but aligning them with selective and competing drug-design constraints remains challenging. We propose C-Moral, a reinforcement learning post-training framework for controllable multi-objective molecular optimization. C-Moral combines group-based relative optimization, property score alignment for heterogeneous objectives, and continuous non-linear reward aggregation to improve stability across competing properties. Experiments on the C-MuMOInstruct benchmark show that C-Moral consistently outperforms state-of-the-art models across both in-domain and out-of-domain settings, achieving the best Success Optimized Rate (SOR) of 48.9% on IND tasks and 39.5% on OOD tasks, while largely preserving scaffold similarity. These results suggest that RL post-training is an effective way to align molecular language models with continuous molecular design objectives. Our code and models are publicly available at this https URL.
中文摘要 大型语言模型（LLMs）在分子优化方面展现出潜力，但将其与选择性和竞争性药物设计约束对齐仍具挑战性。我们提出了C-Moral，一种用于可控多目标分子优化的强化学习后培训框架。C-Moral 结合了基于群体的相对优化、异质目标的属性评分对齐以及连续非线性奖励聚合，以提升竞争属性间的稳定性。C-MuMOInstruct基准测试的实验显示，C-Moral在域内和域外环境中始终优于最先进模型，在IND任务中达到最佳成功优化率（SOR）48.9%，在OOD任务中达到39.5%，同时基本保持了支架相似性。这些结果表明，强化学习后训练是将分子语言模型与连续分子设计目标对齐的有效方法。我们的代码和模型在此 https URL 公开。

RL Token: Bootstrapping Online RL with Vision-Language-Action Models

强化学习令牌：利用视觉-语言-行动模型自助在线强化学习

Authors: Charles Xu, Jost Tobias Springenberg, Michael Equi, Ali Amin, Adnan Esmail, Sergey Levine, Liyiming Ke
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.23073
Pdf link: https://arxiv.org/pdf/2604.23073
Abstract Vision-language-action (VLA) models can learn to perform diverse manipulation skills "out of the box," but achieving the precision and speed that real-world tasks demand requires further fine-tuning -- for example, via reinforcement learning (RL). We introduce a lightweight method that enables sample-efficient online RL fine-tuning of pretrained VLAs using just a few hours of real-world practice. We (1) adapt the VLA to expose an "RL token," a compact readout representation that preserves task-relevant pretrained knowledge while serving as an efficient interface for online RL, and (2) train a small actor-critic head on this RL token to refine the actions, while anchoring the learned policy to the VLA. Online RL with the RL token (RLT) makes it possible to fine-tune even large VLAs with RL quickly and efficiently. Across four real-robot tasks (screw installation, zip tie fastening, charger insertion, and Ethernet insertion), RLT improves the speed on the hardest part of the task by up to 3x and raises success rates significantly within minutes to a few hours of practice. It can even surpass the speed of human teleoperation on some of the tasks.
中文摘要 视觉-语言-行动（VLA）模型可以“跳出框架”学习执行多样化的操作技能，但要实现现实任务所需的精确性和速度，还需要进一步微调——例如通过强化学习（RL）。我们引入了一种轻量级方法，仅需几小时的实际操作，即可实现高效的在线强化学习微调预训练VLA。我们（1）调整VLA以暴露“RL令牌”，这是一种紧凑的读出表示，既保留了任务相关的预训练知识，又作为在线RL的高效接口;（2）在该RL令牌上训练一个小型actor-critic头，以精炼动作，同时将学习策略锚定在VLA上。利用RL令牌（RLT）的在线强化学习使得即使是大型VLA也能快速高效地微调。在四项真实机器人任务（螺钉安装、扎带固定、充电器插入和以太网插入）中，RLT能将最难部分的速度提升多达3倍，并在几分钟到数小时的练习内显著提升成功率。在某些任务上，它甚至能超过人类远程操作的速度。

UAV Trajectory and Bandwidth Allocation for Efficient Data Collection in Low-Altitude Intelligent IoT: A Hierarchical DRL Approach

低空智能物联网中无人机轨迹与带宽分配以实现高效数据收集：分层日程学习方法

Authors: Zhenjia Xu, Xiaoling Zhang, Nan Qi, Xiaojie Li, Luliang Jia
Subjects: Subjects: Computational Engineering, Finance, and Science (cs.CE)
Arxiv link: https://arxiv.org/abs/2604.23132
Pdf link: https://arxiv.org/pdf/2604.23132
Abstract Under the 6G wireless network evolution, the low-altitude Internet of Things (IoT), supported by unmanned aerial vehicles (UAVs) with Integrated Sensing and Communication (ISAC) capabilities, provides ground sensing networks with advanced real-time monitoring and data collection. To maximize data collection volume from distributed IoT nodes, AI-powered data collection technology plays a critical role in enabling intelligent decision-making. Among them, deep reinforcement learning (DRL) has gained particular attention. However, the existing DRL-based work on UAV-assisted IoT nodes data collection rarely address problems such as unknown interference and dynamic data volume. Moreover, these DRL models have high arithmetic requirements and slow convergence speed, making it difficult to carry on UAVs with limited load and arithmetic power. To address these challenges, a hierarchical deep reinforcement learning (HDRL), which can converge quickly and with smaller models, is designed to optimize UAV trajectories and bandwidth allocation to maximize data collection volume. Firstly, the proposed scenario incorporates interference from jammers, dynamic data volume of IoT nodes, and multiple types of obstacles. The entire task is hierarchically structured: the upper-level makes flight trajectory decisions at a coarse temporal granularity, while the lower-level makes bandwidth allocation decisions at a finer temporal granularity. Secondly, a trajectory and bandwidth allocation optimization algorithm based on hierarchical deep deterministic policy gradients (TBH-DDPG) is proposed to solve the problem. Finally, simulation results demonstrate that the proposed algorithm improves convergence speed by 44.44%, and reduces computational cost by 58.05%, compared to non-hierarchical algorithm.
中文摘要 在6G无线网络的发展中，低空物联网（IoT）由无人机（UAV）支持，具备综合感测与通信（ISAC）能力，为地面传感网络提供先进的实时监控和数据收集能力。为了最大化分布式物联网节点的数据收集量，人工智能驱动的数据收集技术在实现智能决策中发挥着关键作用。其中，深度强化学习（DRL）尤其受到关注。然而，现有基于DRL的无人机辅助物联网节点数据收集工作很少解决未知干扰和动态数据量等问题。此外，这些日间移动模型算术要求高且收敛速度慢，导致载重和算术功率有限的无人机难以携带。为应对这些挑战，设计了一种能够快速且与较小模型汇聚的分层深度强化学习（HDRL），以优化无人机轨迹和带宽分配，以最大化数据收集量。首先，所提情景包括干扰器干扰、物联网节点的动态数据量以及多种障碍物。整个任务具有层级结构：上层以粗略的时间粒度做出飞行轨迹决策，而下层则以更细分的时间粒度做出带宽分配决策。其次，提出了基于层级深度确定性策略梯度（TBH-DDPG）的轨迹与带宽分配优化算法来解决该问题。最后，模拟结果表明，所提算法相比非分层算法，收敛速度提升了44.44%，计算成本降低了58.05%。

Cooperative Informative Sensing for Monitoring Dynamic Indoor Environments via Multi-Agent Reinforcement Learning

通过多智能体强化学习监测动态室内环境的协作信息感知

Authors: Kanghoon Lee, Matthew M. Sato, Jinnyeong Yang, Seungro Lee, Sujin Lee, Jiachen Li, Kuk-Jin Yoon, Jinkyoo Park, Kincho H. Law, Yoonjin Yoon
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2604.23179
Pdf link: https://arxiv.org/pdf/2604.23179
Abstract Monitoring human activity in indoor environments is important for applications such as facility management, safety assessment, and space utilization analysis. While mobile robot teams offer the potential to actively improve observation quality, existing multi-robot monitoring and active perception approaches typically rely on coverage or visitation based objectives that are weakly aligned with the accuracy requirements of human-centric monitoring tasks. In this work, we formulate cooperative active observation as a decentralized control problem in which multiple robots adjust their motion to directly optimize monitoring accuracy under partial observability. We propose a learning-based framework for cooperative policies from decentralized observations using multi-agent reinforcement learning (MARL), supported by an architecture that handles variable numbers of humans and temporal dependencies. Simulation results across diverse indoor environments and monitoring tasks show that the proposed approach consistently outperforms classical coverage, persistent monitoring, and learning-free multi-robot baselines, while remaining robust to changes in the number of observed humans.
中文摘要 监测室内环境中的人类活动对于设施管理、安全评估和空间利用分析等应用非常重要。虽然移动机器人团队有潜力主动提升观察质量，但现有的多机器人监测和主动感知方法通常依赖于覆盖率或访问目标，这些目标与以人为中心的监测任务的准确性要求相差。在本研究中，我们将协作主动观察提出为一个去中心化的控制问题，即多台机器人调整运动，以直接优化部分可观测性下的监测精度。我们提出了一个基于学习的框架，利用多智能体强化学习（MARL）从去中心化观察中制定协作策略，并由处理可变人数和时间依赖的架构支持。在不同室内环境和监测任务中的模拟结果显示，所提方法始终优于传统覆盖、持续监测和无学习多机器人基线，同时对观察到的人类数量变化保持鲁棒性。

CODA: Coordination via On-Policy Diffusion for Multi-Agent Offline Reinforcement Learning

CODA：通过策略扩散协调多智能体离线强化学习

Authors: Marcel Hedman, Kale-ab Abebe Tessera, Juan Claude Formanek, Anya Sims, Riccardo Zamboni, Trevor McInroe, John Torr, Elliot Fosong
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2604.23308
Pdf link: https://arxiv.org/pdf/2604.23308
Abstract Offline multi-agent reinforcement learning (MARL) enables policy learning from fixed datasets, but is prone to coordination failure: agents trained on static, off-policy data converge to suboptimal joint behaviours because they cannot co-adapt as their policies change. We introduce CODA (Coordination via On-Policy Diffusion for Multi-Agent Reinforcement Learning), a diffusion-based multi-agent trajectory generator for data augmentation that samples conditioned on the current joint policy, producing synthetic experience which reflects the evolving behaviours of the agents, thereby providing a mechanism for co-adaptation. We find that previous diffusion-based augmentation approaches are insufficient for fostering multi-agent coordination because they produce static augmented datasets that do not evolve as the current joint policy changes during training; CODA resolves this by more closely simulating on-policy learning and is a meaningful step toward coordinated behaviours in the offline setting. CODA is algorithm-agnostic and can be layered onto both model-free and model-based offline reinforcement learning pipelines as an augmentation module. Empirically, CODA not only resolves canonical coordination pathologies in continuous polynomial games but also delivers strong results on the more complex MaMuJoCo continuous-control benchmarks.
中文摘要 离线多智能体强化学习（MARL）允许从固定数据集进行策略学习，但容易出现协调失败：在静态、非策略数据上训练的智能体因无法在策略变化中协同适应，趋向次优联合行为。我们引入了CODA（通过策略上的协调扩散实现多智能体强化学习），这是一种基于扩散的多智能体轨迹生成器，用于数据增强，样本基于当前联合策略进行条件，生成反映智能体行为演变的综合经验，从而提供共适应机制。我们发现，以往基于扩散的增强方法无法促进多智能体协调，因为它们产生的静态增强数据集不会随着当前联合政策在培训期间的变化而演变;CODA通过更贴近政策学习模拟来解决这一问题，是迈向离线环境中协调行为的重要一步。CODA与算法无关，可以作为增强模块叠加到无模型和基于模型的离线强化学习流程中。从经验角度看，CODA不仅解决了连续多项式博弈中的典范协调病态，还在更复杂的MaMuJoCo连续控制基准测试上取得了强有力的成果。

GIFT: Global stabilisation via Intrinsic Fine Tuning

礼物：通过内在微调实现全球稳定

Authors: Rory Young, Nicolas Pugeault
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.23312
Pdf link: https://arxiv.org/pdf/2604.23312
Abstract Deep reinforcement learning policies achieve strong performance in complex continuous control environments with nonlinear contact forces. However, these policies often produce chaotic state dynamics, with trivially small changes to the initial conditions significantly impacting the long-term behaviour of the control system. This high sensitivity to initial conditions limits the application of Deep RL to real-world control systems where performance and stability guarantees are often required. To address this issue, we propose Global stabilisation via Intrinsic Fine Tuning (GIFT), a general-purpose training framework which directly optimises the global stability of existing high-performing deep RL policies using a custom reward function. We demonstrate that GIFT increase the stability of the control interaction while maintaining comparable task performance, thereby improving the suitability of deep RL policies for real-world control systems.
中文摘要 深度强化学习策略在具有非线性接触力的复杂连续控制环境中实现了强劲的性能。然而，这些策略常常产生混沌状态动态，初始条件的微小变化会显著影响控制系统的长期行为。这种对初始条件的高敏感性限制了深度强化学习在实际控制系统中的应用，因为这些系统通常需要性能和稳定性保证。为解决这一问题，我们提出了通过内在微调（GIFT）实现全局稳定，这是一种通用训练框架，通过自定义奖励函数直接优化现有高效深度强化学习策略的全局稳定性。我们证明了GIFT在保持可比任务性能的同时提升了控制交互的稳定性，从而提升了深度强化学习策略在现实控制系统的适用性。

Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance

隐藏状态知道推理分歧之处：通过跨级瓦瑟斯坦距离进行学分分配

Authors: Xinzhu Chen, Wei He, Huichuan Fan, Wenzhe Niu, Zhongxiang Sun, Xuanru Wang, Jiuchong Gao, Jinghua Hao, Renqing He, Weijie Yu
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.23318
Pdf link: https://arxiv.org/pdf/2604.23318
Abstract Group Relative Policy Optimization (GRPO) performs coarse-grained credit assignment in reinforcement learning with verifiable rewards (RLVR) by assigning the same advantage to all tokens in a rollout. Process reward models can provide finer-grained supervision, but they require step-level annotation or additional reward modeling. We show that hidden-state distributions contain a useful signal for local reasoning quality that can be extracted using only outcome-level correctness labels available in RLVR. Specifically, within each GRPO group, the Wasserstein distance between span-level hidden state distributions of correct and incorrect rollouts increases around regions where their local reasoning quality diverges. This association holds both across examples and within individual trajectories, suggesting that hidden-state distributional divergence can serve as a self-supervision signal for fine-grained credit assignment. We formalize this observation with a separation theorem showing that, under mild structural assumptions, post-divergence spans have larger Wasserstein distances than pre-divergence spans whenever the population-level distributional gap exceeds finite-sample noise. Motivated by this result, we propose \textbf{S}pan-level \textbf{H}idden state \textbf{E}nabled \textbf{A}dvantage \textbf{R}eweighting (SHEAR), which modifies GRPO by using span-level Wasserstein distances to scale token-level advantages, amplifying updates on tokens whose hidden states are more separated from the opposing group. The method requires no additional model and only minimal changes to the training pipeline. Experiments on five mathematical reasoning benchmarks and five code generation benchmarks show improvements over standard GRPO and strong performance relative to supervised process reward models, while requiring no additional annotation or reward model training.
中文摘要 群体相对策略优化（GRPO）通过在推广中对所有代币赋予相同的优势，实现带有可验证奖励的强化学习（RLVR）中的粗粒度信用分配。过程奖励模型可以提供更细致的监督，但需要步骤级注释或额外的奖励建模。我们证明隐藏状态分布包含一个有用的局部推理质量信号，仅能通过RLVR中可获得的结果级正确性标签提取。具体来说，在每个GRPO组内，正确与错误展开的跨级隐藏状态分布之间的Wasserstein距离在其局部推理质量分歧的区域附近增加。这种关联既适用于不同实例，也存在于各个轨迹中，表明隐态分布发散可以作为细粒度信用分配的自我监督信号。我们用分离定理形式化这一观察，表明在温和结构假设下，当种群层级分布差距超过有限样本噪声时，散度后张成的瓦瑟斯坦距离大于发散前跨度。基于这一结果，我们提出了 \textbf{S}pan-level \textbf{H}idden state \textbf{E}nabled \textbf{A}dvantage \textbf{R}加权（SHEAR），它通过使用span级Wasserstein距离来调整GRPO，放大那些隐藏状态与对方组更为分离的代币的更新。该方法无需额外模型，且只需对训练流程进行最小修改。在五个数学推理基准测试和五个代码生成基准测试上的实验显示，相较于监督过程奖励模型，其性能优于标准GRPO，且性能优异，且无需额外注释或奖励模型训练。

Process Supervision of Confidence Margin for Calibrated LLM Reasoning

校准LLM推理中置信裕度的过程监督

Authors: Liaoyaqi Wang, Chunsheng Zuo, William Jurayj, Benjamin Van Durme, Anqi Liu
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.23333
Pdf link: https://arxiv.org/pdf/2604.23333
Abstract Scaling test-time computation with reinforcement learning (RL) has emerged as a reliable path to improve large language models (LLM) reasoning ability. Yet, outcome-based reward often incentivizes models to be overconfident, leading to hallucinations, unreliable confidence-based control, and unnecessary compute allocation. We introduce Reinforcement Learning with Confidence Margin (\textbf{RLCM}), a calibration-aware RL framework that jointly optimizes correctness and confidence reliability via a margin-enhanced process reward over intermediate-budget completions. Rather than aligning confidence to correctness likelihoods, RLCM encourages to widen the confidence margin between correct and incorrect steps within a single reasoning trajectory. Across mathematical, code, logic and science benchmarks, our method substantially improves calibration while maintaining or improving accuracy. We further show that, with calibrated confidence signals, the resulting models enable more efficient conformal risk control and effective confidence-weighted aggregation.
中文摘要 通过强化学习（RL）扩展测试时间计算已成为提升大型语言模型（LLM）推理能力的可靠途径。然而，基于结果的奖励常常激励模型过度自信，导致幻觉、基于置信度控制不可靠以及不必要的计算分配。我们引入了带置信裕度的强化学习（\textbf{RLCM}），这是一个校准感知型强化学习框架，通过对中级预算完成的边际增强过程奖励，共同优化正确性和置信度。RLCM鼓励在单一推理轨迹中扩大正确与错误步骤之间的置信余地，而非将置信度与正确性概然对齐。在数学、代码、逻辑和科学基准中，我们的方法在保持或提升准确性的同时，显著提升了校准性能。我们还进一步表明，通过校准置信信号，所得模型能够实现更高效的共形风险控制和有效的置信加权聚合。

Bridging Reasoning and Action: Hybrid LLM-RL Framework for Efficient Cross-Domain Task-Oriented Dialogue

推理与行动的桥梁：高效跨领域任务导向对话的混合大型语言模型-强化学习框架

Authors: Yangyang Zhao, Linfan Dai, Li Cai, Bowen Xing, Libo Qin
Subjects: Subjects: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2604.23345
Pdf link: https://arxiv.org/pdf/2604.23345
Abstract Cross-domain task-oriented dialogue requires reasoning over implicit and explicit feasibility constraints while planning long-horizon, multi-turn actions. Large language models (LLMs) can infer such constraints but are unreliable over long horizons, while Reinforcement learning (RL) optimizes long-horizon behavior yet cannot recover constraints from raw dialogue. Naively coupling LLMs with RL is therefore brittle: unverified or unstructured LLM outputs can corrupt state representations and misguide policy learning. Motivated by this, we propose Verified LLM-Knowledge empowered RL (VLK-RL), a hybrid framework that makes LLM-derived constraint reasoning usable for RL. VLK-RL first elicits candidate constraints with an LLM and then verifies them via a dual-role cross-examination procedure to suppress hallucinations and cross-turn inconsistencies. The verified constraints are mapped into ontology-aligned slot-value representations, yielding a structured, constraint-aware state for RL policy optimization. Experiments across multiple benchmarks demonstrate that VLK-RL significantly improves generalization and robustness, outperforming strong single-model baselines on long-horizon tasks.
中文摘要 跨领域任务导向对话需要在规划长期、多回合行动时，对隐性和显性可行性约束进行推理。大型语言模型（LLM）可以推断此类约束，但在长视野内不可靠，而强化学习（RL）优化了长视野行为，但无法从原始对话中恢复约束。因此，将LLM与RL天真结合是脆弱的：未经验证或非结构化的LLM输出可能会破坏状态表示并误导策略学习。基于此，我们提出了验证LLM知识赋能的强化学习（VLK-RL），这是一个混合框架，使得基于LLM的约束推理适用于强化学习。VLK-RL首先用LLM引出候选约束，然后通过双重角色交叉询问程序验证，以抑制幻觉和交叉反转不一致。验证后的约束映射到与本体论对齐的槽值表示中，从而为强化学习策略优化提供结构化、约束感知的状态。多项基准测试的实验表明，VLK-RL显著提升了泛化性和鲁棒性，在长期视野任务中优于强的单模型基线。

Learning from Demonstration with Failure Awareness for Safe Robot Navigation

通过演示与失效意识学习，实现机器人安全导航

Authors: Xianghui Wang, Siwei Cheng, Shanze Wang, Xinming Zhang, Dan Zhang, Wei Zhang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.23360
Pdf link: https://arxiv.org/pdf/2604.23360
Abstract Learning from demonstration is widely used for robot navigation, yet it suffers from a fundamental limitation: demonstrations consist predominantly of successful behaviors and provide limited coverage of unsafe states. This limitation leads to poor safety when the robot encounters scenarios beyond the demonstration distribution. Failure experiences, such as collisions, contain essential information about unsafe regions, but remain underutilized. The key difficulty lies in the fact that failure data do not provide valid guidance for action imitation, and their naive incorporation into policy learning often degrades performance. We address this challenge by proposing a failure-aware learning framework that explicitly decouples the roles of success and failure data. In this framework, failure experiences are used to shape value estimation in hazardous regions, while policy learning is restricted to successful demonstrations. This separation enables the effective use of failure data without corrupting policy behavior. We implement this design within an offline reinforcement learning (RL) setting and evaluate it in both simulation and real-world environments. The results show that our framework consistently reduces collision rates while preserving the task success rate, and demonstrate strong generalization across different environments and robot platforms.
中文摘要 从演示中学习被广泛应用于机器人导航，但它存在一个根本性的局限性：演示主要由成功的行为组成，且对不安全状态的覆盖有限。这种限制导致机器人在遇到超出演示分布范围的场景时安全性较低。故障体验，如碰撞，包含关于不安全区域的重要信息，但仍然未被充分利用。关键难点在于，失效数据并不能为动作模仿提供有效指导，而它们被简单地纳入策略学习往往会降低性能。我们通过提出一个失败感知学习框架来应对这一挑战，明确将成功与失败数据的角色分离。在该框架中，失败经验用于影响危险区域的价值估计，而政策学习仅限于成功的演示。这种分离使得在不破坏策略行为的情况下，有效利用故障数据。我们在离线强化学习（RL）环境中实现该设计，并在模拟和现实环境中进行评估。结果表明，我们的框架在保持任务成功率的同时，持续降低碰撞率，并在不同环境和机器人平台上展现出强有力的泛化性。

V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think

V-GRPO：用于消除生成模型噪点的在线强化学习比你想象的要简单

Authors: Bingda Tang, Yuhui Zhang, Xiaohan Wang, Jiayuan Mao, Ludwig Schmidt, Serena Yeung-Levy
Subjects: Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.23380
Pdf link: https://arxiv.org/pdf/2604.23380
Abstract Aligning denoising generative models with human preferences or verifiable rewards remains a key challenge. While policy-gradient online reinforcement learning (RL) offers a principled post-training framework, its direct application is hindered by the intractable likelihoods of these models. Prior work therefore either optimizes an induced Markov decision process (MDP) over sampling trajectories, which is stable but inefficient, or uses likelihood surrogates based on the diffusion evidence lower bound (ELBO), which have so far underperformed on visual generation. Our key insight is that the ELBO-based approach can, in fact, be made both stable and efficient. By reducing surrogate variance and controlling gradient steps, we show that this approach can beat MDP-based methods. To this end, we introduce Variational GRPO (V-GRPO), a method that integrates ELBO-based surrogates with the Group Relative Policy Optimization (GRPO) algorithm, alongside a set of simple yet essential techniques. Our method is easy to implement, aligns with pretraining objectives, and avoids the limitations of MDP-based methods. V-GRPO achieves state-of-the-art performance in text-to-image synthesis, while delivering a $2\times$ speedup over MixGRPO and a $3\times$ speedup over DiffusionNFT.
中文摘要 将去噪生成模型与人类偏好或可验证奖励对齐仍是一个关键挑战。虽然策略梯度在线强化学习（RL）提供了一个原则性的训练后框架，但由于这些模型难以解决的可能性，其直接应用受到阻碍。因此，以往的工作要么优化采样轨迹上的诱导马尔可夫决策过程（MDP），这虽然稳定但效率低下;要么采用基于扩散证据下界（ELBO）的似然替代算法，但迄今在视觉生成方面表现不佳。我们的关键见解是，基于ELBO的方法实际上可以既稳定又高效。通过减少替代方差和控制梯度步骤，我们证明该方法能够胜过基于MDP的方法。为此，我们介绍了变分GRPO（V-GRPO），这是一种将基于ELBO的代理算法与群相对策略优化（GRPO）算法集成的方法，同时结合一套简单但必不可少的技术。我们的方法易于实施，符合预训练目标，并避免了基于MDP方法的限制。V-GRPO在文本到图像合成方面实现了最先进的性能，同时比MixGRPO快20倍，DiffusionNFT快3倍。

Do Synthetic Trajectories Reflect Real Reward Hacking? A Systematic Study on Monitoring In-the-Wild Hacking in Code Generation

合成轨迹是否反映真实的奖励黑客？一项关于监控代码生成中野外黑客行为的系统性研究

Authors: Lichen Li, Hengguang Zhou, Yijun Liang, Tianyi Zhou, Cho-Jui Hsieh
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.23488
Pdf link: https://arxiv.org/pdf/2604.23488
Abstract Reward hacking in code generation, where models exploit evaluation loopholes to obtain full reward without correctly solving the tasks, poses a critical challenge for Reinforcement Learning (RL) and the deployment of reasoning models. Existing studies have been conducted primarily on synthetic hacking trajectories. However, whether these synthetic behaviors faithfully represent naturally emerging hacking in the wild remains unclear. In this work, we present a systematic analysis of the synthetic vs. in-the-wild discrepancy in reward hacking. We examine to what extent hacking behaviors induced by prompting resemble those emerging during RL training, and whether monitors trained on synthetic trajectories generalize to naturally arising but previously unseen hacking. To scale up the curation of in-the-wild reward hacking trajectories, we modified Group Relative Policy Optimization (GRPO) by injecting conflicting unit tests as tracers and applying a "resampling-until-hack" mechanism. Through controlled comparisons between monitors trained on synthetic versus in-the-wild data, we find that (1) synthetic-data-trained monitors fail to generalize to "in-the-wild" hacking, and (2) monitors trained on our "in-the-wild" trajectories demonstrate stronger generalizability to unseen hacking types. Our results indicate that synthetic reward hacking data may not fully reflect natural reward hacking behaviors, and that relying solely on synthetic data can lead to misleading conclusions. The codebase is available at this https URL
中文摘要 代码生成中的奖励黑客攻击，即模型利用评估漏洞以获得完整奖励，但未正确解决任务，这对强化学习（RL）和推理模型的部署构成了关键挑战。现有研究主要集中在合成黑客的轨迹上。然而，这些合成行为是否忠实地反映了野外自然出现的黑客行为，目前尚不清楚。本研究系统分析了合成与野外奖励黑客差异。我们考察了由提示诱发的黑客行为在多大程度上与强化学习训练中出现的行为相似，以及受过合成轨迹训练的监控者是否会泛化为自然出现但此前未曾见过的黑客行为。为了扩大对野外奖励黑客轨迹的策划，我们修改了组相对策略优化（GRPO），通过注入冲突单元测试作为追踪器，并应用“重采样直到黑客”机制。通过对合成数据训练的监测器与野外数据进行受控比较，我们发现（1）合成数据训练的监测器无法推广到“野外”黑客行为，（2）在“野外”轨迹上训练的监控器对未被发现的黑客类型具有更强的泛化性。我们的结果表明，合成奖励黑客数据可能无法完全反映自然的奖励黑客行为，仅依赖合成数据可能导致误导性结论。代码库可在此 https 网址获取

DLM: Unified Decision Language Models for Offline Multi-Agent Sequential Decision Making

DLM：离线多智能体顺序决策的统一决策语言模型

Authors: Zhuohui Zhang, Bin Cheng, Bin He
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.23557
Pdf link: https://arxiv.org/pdf/2604.23557
Abstract Building scalable and reusable multi-agent decision policies from offline datasets remains a challenge in offline multi-agent reinforcement learning (MARL), as existing methods often rely on fixed observation formats and action spaces that limit generalization. In contrast, large language models (LLMs) offer a flexible modeling interface that can naturally accommodate heterogeneous observations and actions. Motivated by this, we propose the Decision Language Model (DLM), which formulates multi-agent decision making as a dialogue-style sequence prediction problem under the centralized training with decentralized execution paradigm. DLM is trained in two stages: a supervised fine-tuning phase, which leverages dialogue-style datasets for centralized training with inter-agent context and generates executable actions from offline trajectories, followed by a group relative policy optimization phase to enhance robustness to out-of-distribution actions through lightweight reward functions. Experiments on multiple benchmarks show that a unified DLM outperforms strong offline MARL baselines and LLM-based conversational decision-making methods, while demonstrating strong zero-shot generalization to unseen scenarios across tasks.
中文摘要 在离线多智能体强化学习（MARL）中，从离线数据集构建可扩展且可复用的多智能体决策策略仍是一大挑战，因为现有方法常依赖固定的观察格式和动作空间，限制了泛化。相比之下，大型语言模型（LLMs）提供了灵活的建模接口，能够自然适应异构的观察和操作。基于此，我们提出了决策语言模型（DLM），该模型将多智能体决策表述为一种对话式序列预测问题，采用集中式训练和去中心化执行范式。DLM训练分为两个阶段：监督式微调阶段，利用对话式数据集进行集中训练，结合代理间上下文，并从离线轨迹生成可执行动作;随后是组相对策略优化阶段，通过轻量级奖励函数增强对非分布动作的鲁棒性。多个基准测试的实验表明，统一的DLM优于强的离线MARL基线和基于LLM的对话决策方法，同时在任务中对未见场景展现出强烈的零样本推广能力。

CAPSULE: Control-Theoretic Action Perturbations for Safe Uncertainty-Aware Reinforcement Learning

CAPSULE：安全不确定性意识强化学习的控制理论作用扰动

Authors: Rahul Narava, Siddharth Verma, Ojas Jain, Shashi Shekhar Jha, Mayank Shekhar Jha
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.23576
Pdf link: https://arxiv.org/pdf/2604.23576
Abstract Ensuring safe exploration in high-dimensional systems with unknown dynamics remains a significant challenge. Existing safe reinforcement learning methods often provide safety guarantees only in expectation, which can still lead to safety violations. Control-theoretic approaches, in contrast, offer hard constraint-based safety guarantees but typically assume access to known system dynamics or require accurate estimation of control-affine models. In this paper, we propose a safe reinforcement learning framework that learns a probabilistic control-affine dynamics model in an offline setting. The learned model is leveraged to explicitly construct control barrier functions (CBFs) that incorporate model uncertainty to provide conservative safety constraints. These CBF constraints are enforced through an online constraint-based action correction mechanism, enabling safe exploration without overly restricting task performance. Empirical evaluations on nonlinear, complex continuous-control benchmarks demonstrate that our approach achieves returns comparable to those of existing baselines while significantly reducing safety violations.
中文摘要 确保在未知动力学的高维系统中安全探索仍是一项重大挑战。现有的安全强化学习方法通常只在预期情况下提供安全保证，这仍可能导致安全违规。相比之下，控制理论方法提供基于约束的硬性安全保障，但通常假设访问已知的系统动力学，或要求对控制仿射模型进行准确估计。本文提出了一个安全的强化学习框架，在离线环境中学习概率性控制-仿射动力学模型。所学模型被用来明确构造包含模型不确定性的控制障碍函数（CBF），以提供保守的安全约束。这些 CBF 约束通过在线约束式动作修正机制强制执行，实现安全探索，同时不过度限制任务表现。对非线性复杂连续控制基准的实证评估表明，我们的方法实现了与现有基线相当的回报，同时显著减少了安全违规。

DRL-Based Antenna Position Optimization For MA-Assisted OTFS System Under Imperfect CSI

基于DRL的MA辅助OTFS系统在不完美CSI下天线位置优化

Authors: Maoyuan Wang, Qian Zhang, Yufei Zhao, Xuejun Cheng, Zheng Dong, Deqiang Wang, Yong Liang Guan
Subjects: Subjects: Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2604.23611
Pdf link: https://arxiv.org/pdf/2604.23611
Abstract In this paper, we introduce movable antenna (MA) technology into orthogonal time frequency space (OTFS) systems to enable wavelength-level antenna position optimization under imperfect channel state information (CSI), thereby mitigating deep fading. To accurately acquire CSI, we develop a sparse Bayesian learning method with variational inference (SBLVI) method. Based on estimated CSI, we formulate an MA position optimization problem with the objective of maximizing channel gain. Due to the highly non-convex character of the problem, we further develop a deep reinforcement learning (DRL) strategy to intelligently optimize MA positions. Simulation results show that the proposed SBLVI method significantly improves channel estimation accuracy over benchmark methods, and MA position optimization based on estimated CSI achieves substantially higher channel gains than the fixed-position antenna (FPA), demonstrating the effectiveness of the proposed MA-assisted OTFS system.
中文摘要 本文将可移动天线（MA）技术引入正交时频空间（OTFS）系统，以实现在信道状态信息不完美（CSI）下波长级天线位置优化，从而减轻深度衰落。为了准确获得CSI，我们开发了一种带有变分推断的稀疏贝叶斯学习方法（SBLVI）。基于估算的CSI，我们提出了一个MA位置优化问题，目标是最大化通道增益。由于问题高度非凸性，我们进一步开发了深度强化学习（DRL）策略，以智能优化MA位置。模拟结果表明，所提的SBLVI方法显著提升了信道估计的准确性，且基于估算CSI的MA位置优化实现了显著高于固定位置天线（FPA）的信道增益，证明了拟议MA辅助OTFS系统的有效性。

GraphPlanner: Graph Memory-Augmented Agentic Routing for Multi-Agent LLMs

GraphPlanner：多智能体大型语言模型的图内存增强代理路由

Authors: Tao Feng, Haozhen Zhang, Zijie Lei, Peixuan Han, Jiaxuan You
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.23626
Pdf link: https://arxiv.org/pdf/2604.23626
Abstract LLM routing has achieved promising results in integrating the strengths of diverse models while balancing efficiency and performance. However, to support more realistic and challenging applications, routing must extend into agentic LLM settings, where task planning, multi-round cooperation among heterogeneous agents, and memory utilization are indispensable. To address this gap, we propose GraphPlanner, a heterogeneous graph memory-augmented agentic router for multi-agent LLMs that generates routing workflows for each query and supports both inductive and transductive inference. GraphPlanner formulates workflow generation as a Markov Decision Process (MDP), where at each step it selects both the LLM backbone and the agent role, including Planner, Executor, and Summarizer. By leveraging a heterogeneous graph, denoted as GARNet, to capture interaction memories among queries, agents, and responses, GraphPlanner integrates historical memory and workflow memory into richer state representations. The entire pipeline is optimized with reinforcement learning, jointly improving task-specific performance and computational efficiency. We evaluate GraphPlanner across 14 diverse LLM tasks and demonstrate that: (1) GraphPlanner outperforms strong single-round and multi-round routers, improving accuracy by up to 9.3% while reducing GPU cost from 186.26 GiB to 1.04 GiB; (2) GraphPlanner generalizes robustly to unseen tasks and LLMs, exhibiting strong zero-shot capabilities; and (3) GraphPlanner effectively leverages historical memories, supporting both inductive and transductive inference for more adaptive routing. Our code for GraphPlanner is released at this https URL.
中文摘要 LLM路由在整合多种模型优势的同时，在效率与性能之间取得了令人期待的成果。然而，为了支持更现实和更具挑战性的应用，路由必须扩展到代理型LLM环境，任务规划、异构代理之间的多轮协作以及内存利用是不可或缺的。为弥补这一空白，我们提出了GraphPlanner，一款多智能体LLM的异构图内存增强代理路由器，为每个查询生成路由工作流，并支持归纳和转导推理。GraphPlanner将工作流生成表述为马尔可夫决策过程（MDP），每一步都选择LLM骨干和代理角色，包括规划者、执行者和摘要器。通过利用异构图（称为GARNet）捕捉查询、代理和响应之间的交互记忆，GraphPlanner将历史记忆和工作流记忆整合进更丰富的状态表示中。整个流程通过强化学习进行优化，共同提升任务特定性能和计算效率。我们评估了GraphPlanner在14个不同大型语言模型任务中的表现，并证明：（1） GraphPlanner优于强力单轮和多轮路由器，准确率提升高达9.3%，同时将GPU成本从186.26 GiB降至1.04 GiB;（2） GraphPlanner 能够稳健地推广到未见任务和大型语言模型，展现出强大的零任务能力;以及（3）GraphPlanner有效利用历史记忆，支持归纳和转导推断，实现更具自适应性的路由。我们的GraphPlanner代码以这个https URL发布。

Structural Enforcement of Goal Integrity in AI Agents via Separation-of-Powers Architecture

通过权力分立架构对AI代理目标完整性的结构性强制执行

Authors: Rong Xiang
Subjects: Subjects: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2604.23646
Pdf link: https://arxiv.org/pdf/2604.23646
Abstract Recent evidence suggests that frontier AI systems can exhibit agentic misalignment, generating and executing harmful actions derived from internally constructed goals, even without explicit user requests. Existing mitigation methods, such as Reinforcement Learning from Human Feedback (RLHF) and constitutional prompting, operate primarily at the model level and provide only probabilistic safety guarantees. We propose the Policy-Execution-Authorization (PEA) architecture, a "separation-of-powers" design that enforces safety at the system level. PEA decouples intent generation, authorization, and execution into independent, isolated layers connected via cryptographically constrained capability tokens. We present five core contributions: (C1) an Intent Verification Layer (IVL) for ensuring capability-intent consistency; (C2) Intent Lineage Tracking (ILT), which binds all executable intents to the originating user request via cryptographic anchors; (C3) Goal Drift Detection, which rejects semantically divergent intents below a configurable threshold; (C4) an Output Semantic Gate (OSG) that detects implicit coercion using a structured $K \times I \times P$ threat calculus (Knowledge, Influence, Policy); and (C5) a formal verification framework proving that goal integrity is maintained even under adversarial model compromise. By shifting agent alignment from a behavioral property to a structurally enforced system constraint, PEA provides a robust foundation for the governance of autonomous agents.
中文摘要 最新证据表明，前沿人工智能系统可能表现出代理错位，即使没有明确的用户请求，也可能生成并执行源自内部构建目标的有害行为。现有的缓解方法，如人类反馈强化学习（RLHF）和宪法提示，主要在模型层面运行，仅提供概率安全保障。我们提出了策略-执行-授权（PEA）架构，这是一种“权力分立”设计，在系统层面强制执行安全。PEA将意图生成、授权和执行解耦到独立、隔离的层，通过加密限制的能力令牌连接。我们提出了五项核心贡献：（C1）意图验证层（IVL），用于确保能力与意图的一致性;（C2）意图血缘追踪（ILT），通过密码锚将所有可执行意图绑定到发起用户请求;（C3）目标漂移检测，拒绝低于可配置阈值的语义分歧意图;（C4）输出语义门（OSG），利用结构化$K \times I \times P$ 威胁计算（知识、影响力、政策）检测隐性胁迫;以及（C5）一个形式化的验证框架，证明即使在对抗性模型妥协下目标完整性依然得以维护。通过将代理对齐从行为属性转变为结构性强制的系统约束，PEA为自治代理的治理提供了坚实的基础。

QuietWalk: Physics-Informed Reinforcement Learning for Ground Reaction Force-Aware Humanoid Locomotion Under Diverse Footwear

QuietWalk：基于物理的强化学习，用于地面反应力感知的人形运动，适用于不同鞋类下的运动

Authors: Hanze Hu, Luying Feng, Silu Chen, Tianjiang Zheng, Dexin Jiang, Wei Chen, Chi Zhang, Guilin Yang, Yaochu Jin
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.23702
Pdf link: https://arxiv.org/pdf/2604.23702
Abstract Humanoid robots operating in human-centered environments (e.g., homes, hospitals, and offices) must mitigate foot--ground impact transients, as impact-induced vibration and noise degrade user experience and repeated impacts accelerate hardware wear. However, existing low-noise locomotion training often relies on kinematic proxy objectives or fragile force sensors, and footwear-induced changes in contact dynamics introduce distribution shifts that hinder policy this http URL present QuietWalk, a physics-informed reinforcement learning framework for ground-reaction-force-aware humanoid locomotion under diverse footwear conditions. QuietWalk employs an inverse-dynamics-constrained physics-informed neural network (PINN) to estimate per-foot vertical ground reaction forces (GRFs) from proprioceptive signals, and integrates the frozen predictor into the RL training loop to penalize predicted impact forces without requiring force sensors at this http URL a held-out real-robot dataset, enforcing inverse-dynamics consistency reduces vertical GRF prediction errors by 82%-86% compared with a purely supervised predictor and improves the coefficient of determination from 0.39/0.67 to 0.99/0.99 for the left/right feet. On hardware at 1.2 m/s (barefoot; averaged over four floor materials), QuietWalk reduces mean A-weighted noise level by 7.17 dB and peak noise level by 4.98 dB under a consistent recording setup. Cross-footwear experiments (barefoot, skate shoes, athletic sneakers, and high heels) across multiple surfaces further demonstrate robust adaptation to footwear-induced contact variations.
中文摘要 在以人为中心的环境中（如家庭、医院和办公室）工作的类人机器人必须减轻地面冲击瞬变，因为冲击引起的振动和噪音会降低用户体验，反复撞击加速硬件磨损。然而，现有的低噪声运动训练通常依赖于运动学代理目标或脆弱的力传感器，而鞋类引起的接触动力学变化会引入分布变化，阻碍政策的实施。QuietWalk是一个基于物理的强化学习框架，用于在不同鞋类条件下进行地面反作用力感知的人形运动。QuietWalk采用逆动力学约束的物理知情神经网络（PINN）从本体感觉信号估算每英尺垂直地面反作用力（GRF），并将冻结预测器集成到强化学习训练循环中，惩罚预测的撞击力，而无需使用力传感器。此为真实机器人数据集，强制执行逆动力学一致性可将垂直GRF预测误差降低82%-86%，相比纯监督预测器，将左右脚的确定系数从0.39/0.67提升到0.99/0.99。在硬件速度为1.2 m/s（赤脚;四种地板材料平均）下，QuietWalk在稳定的录音设置下，平均A加权噪声水平降低7.17 dB，峰值噪声降低4.98 dB。交叉穿鞋（赤脚、滑板鞋、运动鞋和高跟鞋）在多个表面上进行，进一步展示了对鞋类诱导接触变化的强韧适应性。

SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning

SFT后RL在LLM推理中优于混合策略方法。

Authors: Alexis Limozin, Eduard Durech, Torsten Hoefler, Imanol Schlag, Valentina Pyatkin
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.23747
Pdf link: https://arxiv.org/pdf/2604.23747
Abstract Recent mixed-policy optimization methods for LLM reasoning that interleave or blend supervised and reinforcement learning signals report improvements over the standard SFT-then-RL pipeline. We show that numerous recently published research papers rely on a faulty baseline caused by two distinct bugs: a CPU-offloaded optimizer bug in DeepSpeed that silently drops intermediate micro-batches during gradient accumulation (affecting multiple downstream frameworks including TRL, OpenRLHF and Llama-Factory), and a loss aggregation bug in OpenRLHF that incorrectly weights per-mini-batch losses. Together they suppress SFT performance, with the optimizer bug accounting for most of the gap and the loss aggregation bug contributing a smaller additional effect. Once corrected, the standard SFT-then-RL pipeline surpasses every published mixed-policy method we evaluate by +3.8 points on math benchmarks with Qwen2.5-Math-7B and by +22.2 points with Llama-3.1-8B. Even a truncated variant with just 50 RL steps outperforms mixed-policy methods on math benchmarks while using fewer FLOPs.
中文摘要 近期用于大型语言模型推理的混合策略优化方法，将监督学习和强化学习信号交错或混合，报告称其优于标准的SFT再强化学习流水线。我们展示了许多近期发表的研究论文依赖于由两个不同漏洞引起的错误基线：DeepSpeed中CPU卸载优化器错误，在梯度累积过程中无声地丢弃中间微批次（影响包括TRL、OpenRLHF和Llama-Factory在内的多个下游框架），以及OpenRLHF中的损失聚合错误，错误地加权每个最小批次的损失。它们共同抑制SFT的性能，优化器漏洞承担了大部分差距，损失聚合漏洞则贡献较小额外影响。一旦修正，标准的SFT再强化学习流水线在数学基准测试中比我们评估的所有已发表混合策略方法（Qwen2.5-Math-7B）高出+3.8分，Llama-3.1-8B高出+22.2分。即使是只有50步强化学习步的截断变体，在数学基准测试中也优于混合策略方法，同时使用更少的FLOP。

Unleashing the Agility of Wheeled-Legged Robots for High-Dynamic Reflexive Obstacle Evasion

释放轮式机器人的敏捷性，实现高动态反射障碍躲避

Authors: Yongen Zhao (1 and 3), Zihao Xu (2), Wenzhi Lu (1), Zhen Chu (4), Ce Hao (2 and 3) ((1) School of Mechanical Engineering, Tianjin University, Tianjin, China, (2) School of Computing, National University of Singapore, Singapore, (3) Beijing Zhongguancun Academy, Beijing, China, (4) DeepRobotics, Hangzhou, China)
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.23761
Pdf link: https://arxiv.org/pdf/2604.23761
Abstract Wheeled-legged robots combine the energy efficiency of wheeled locomotion with the terrain adaptability of legged systems, making them promising platforms for agile mobility in complex and dynamic environments. However, enabling high-dynamic reflexive evasion against fast-moving obstacles remains challenging due to the hybrid morphology, mode coupling, and non-holonomic constraints of such platforms. In this work, we propose AWARE, Adaptive Wheeled-Legged Avoidance and Reflexive Evasion, a hierarchical reinforcement learning framework for high-dynamic obstacle avoidance in wheeled-legged robots. The proposed system naturally exhibits diverse emergent gaits and evasive behaviors, including forward lunge and lateral dodge, thereby leveraging the robot's hybrid morphology to enhance agility under highly dynamic threats. Extensive experiments in Isaac Lab simulation and real-world deployment on the M20 platform across diverse dynamic scenarios demonstrate that AWARE achieves robust and agile obstacle avoidance while revealing behaviorally distinct evasive strategies. These results highlight both the practical effectiveness of AWARE and the intrinsic reflexive agility of wheeled-legged robots.
中文摘要 轮式机器人结合了轮式移动的能源效率与腿式系统的地形适应性，使其成为在复杂动态环境中实现灵活移动的有前景平台。然而，由于此类平台的混合形态、模态耦合和非全全体约束，实现对高速移动障碍物的高动态反射规避仍然具有挑战性。本研究提出AWARE（自适应轮式腿避让与反射性回避），这是一种用于轮式腿机器人高动态障碍避让的分层强化学习框架。该系统自然展现出多样的涌现步态和规避行为，包括前冲和侧向闪避，利用机器人的混合形态在高度动态威胁下提升敏捷性。Isaac实验室的大量实验和M20平台上在多种动态场景下的实际部署表明，AWARE实现了稳健且灵活的障碍避让，同时揭示了行为上独特的规避策略。这些结果凸显了AWARE的实际有效性以及轮式腿机器人固有的反射敏捷性。

Scalable Production Scheduling: Linear Complexity via Unified Homogeneous Graphs

可扩展生产调度：通过统一齐次图实现线性复杂性

Authors: Jonathan Hoss, Moritz Link, Noah Klarmann
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.23841
Pdf link: https://arxiv.org/pdf/2604.23841
Abstract Efficiently solving the Job Shop Scheduling Problem in real-world industrial applications requires policies that are both computationally lean and topologically robust. While Reinforcement Learning has shown potential in automating dispatching rules, existing models often struggle with a scalability bottleneck caused by quadratic graph complexity or the architectural overhead of heterogeneous layers. We introduce a unified graph framework that employs feature-based homogenization to project distinct node roles into a shared latent space. This allows a standard homogeneous Graph Isomorphism Network to capture complex resource contention with linear complexity, ensuring low-latency inference for large-scale industrial applications. Our empirical results demonstrate that our framework achieves state-of-the-art performance while exhibiting consistent zero-shot generalization. We identify the job-to-machine ratio as the primary driver of policy effectiveness, rather than absolute problem size. Based on this, we propose a hypothesis of structural saturation, demonstrating that policies trained on critically congested instances ($\mathcal{J} \approx \mathcal{M}$) learn scale-invariant resolution strategies. Agents trained at this saturation point internalize invariant conflict-resolution logic, allowing them to treat massive rectangular instances as a sequential concatenation of saturated sub-problems. This approach eliminates the need for expensive scale-specific retraining and prevents overfitting to statistical shortcuts, providing a robust and efficient pathway for deploying RL solutions in dynamic production environments.
中文摘要 在现实工业应用中高效解决工坊调度问题，需要既计算精简又拓扑稳健的政策。虽然强化学习在自动化调度规则方面展现出潜力，但现有模型常常面临由二次图复杂度或异构层架构开销带来的可扩展性瓶颈。我们引入了一个统一的图框架，采用基于特征的同质化技术，将不同的节点角色投射到共享的潜在空间中。这使得标准的齐次图同构网络能够以线性复杂度捕捉复杂资源争用，确保大规模工业应用的低延迟推断。我们的实证结果表明，我们的框架在实现了最先进的性能的同时，也展现了一致的零样品泛化能力。我们将工作与机器的比例视为政策有效性的主要驱动因素，而非绝对问题规模。基于此，我们提出了结构饱和假说，证明在极度拥塞实例（$\mathcal{J} \approx \mathcal{M}$）上训练的策略会学习尺度不变的分辨率策略。在该饱和点训练的代理内化了不变冲突解决逻辑，使他们能够将巨大的矩形实例视为饱和子问题的顺序串接。这种方法消除了昂贵的规模级重训需求，防止对统计捷径的过度拟合，为在动态生产环境中部署强化学习解决方案提供了稳健高效的路径。

MUSIC: Learning Muscle-Driven Dexterous Hand Control

音乐：学习肌肉驱动的灵巧手部控制

Authors: Pei Xu, Yufei Ye, Shuchun Sun, Yu Ding, Elizabeth Schumann, C. Karen Liu
Subjects: Subjects: Graphics (cs.GR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.23886
Pdf link: https://arxiv.org/pdf/2604.23886
Abstract We present a data-driven approach for physics-based, muscle-driven dexterous control that enables musculoskeletal hands to perform precise piano playing for novel pieces of music outside the reference dataset. Our approach combines high-frequency muscle-level control with low-frequency latent-space coordination in a hierarchical architecture. At the low level, general single-hand policies are trained via reinforcement learning to generate dynamic muscle-tendon activations while tracking trajectories from a large reference motion dataset. The resulting tracking policies are then distilled into variational autoencoder (VAE) models, yielding smooth and structured latent spaces that abstract away low-level muscle dynamics. For the high level, we train piece-specific policies to operate in this latent space, coordinating bimanual motions based on specific goals, denoted by note events extracted from given musical scores, to synthesize performances beyond the reference data. In addition, we present an enhanced musculoskeletal hand model that supports fine control of fingers for accurate low-level motion tracking and diverse high-level motion synthesis. We evaluate the control pipeline of our approach on a diverse piano repertoire spanning multiple musical styles and technical demands. Results demonstrate that our approach can synthesize coordinated bimanual motions with accurate key presses, and achieve the state-of-the-art performance of piano playing in physics-based dexterous control. We also show that our musculoskeletal hand model demonstrates superior biomechanical stability and tracking precision compared to the existing model, and validate that our musculoskeletal hand model and muscle-driven controller can generate physiologically plausible activation patterns that align with human electromyography (EMG) recordings.
中文摘要 我们提出了一种基于物理、肌肉驱动的灵巧控制数据驱动方法，使肌肉骨骼手能够在参考数据集之外为新颖音乐作品进行精准的钢琴演奏。我们的方法将高频肌肉级控制与低频潜空间协调结合在层级结构中。在底层，通过强化学习训练通用单手策略，在追踪大型参考运动数据集轨迹的同时生成动态肌肉-肌腱激活。由此产生的跟踪策略被提炼为变分自编码器（VAE）模型，产生平滑且结构化的潜在空间，抽象化低层次肌肉动态。在高层次，我们训练曲目特定策略，在该潜在空间中运作，协调基于特定目标的双手动作，这些目标通过从给定乐谱中提取的音符事件表示，以综合超出参考数据的演奏。此外，我们还提出了增强型肌肉骨骼手模型，支持手指的精细控制，实现低级运动的精确追踪和多样的高强度运动综合。我们评估了我们方法中对涵盖多种音乐风格和技术需求的多样钢琴曲目的控制流程。结果表明，我们的方法能够通过精确的按键合成协调的双手动作，实现基于物理的灵巧控制中钢琴演奏的尖端表现。我们还证明了我们的肌肉骨骼手模型在生物力学稳定性和追踪精度上优于现有模型，并验证了肌肉骨骼手模型和肌肉驱动控制器能够生成符合人体肌电图（EMG）记录的生理合理激活模式。

Hindsight Preference Optimization for Financial Time Series Advisory

财务时间序列咨询的事后诸葛亮偏好优化

Authors: Yanwei Cui, Guanghui Wang, Xing Zhang, Peiyang He, Ziyuan Li, Bing Zhu, Wei Qiu, Xusheng Wang, Zheng Yu, Anqi Xin
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.23988
Pdf link: https://arxiv.org/pdf/2604.23988
Abstract Time series models predict numbers; decision-makers need advisory -- directional signals with reasoning, actionable suggestions, and risk management. Training language models for such predictive advisory faces a fundamental challenge: quality depends on outcomes unknown at prediction time. We bridge two ideas from reinforcement learning -- using information unavailable during execution to retrospectively generate training signal, and preference alignment -- and propose Hindsight Preference Optimization: observed outcomes let an LLM judge rank candidate advisories on dimensions that scalar metrics cannot capture, producing preference pairs for DPO without human annotation. We apply this to Vision-Language-Model-based predictive advisories on S&P 500 equity time series, demonstrated by a 4B model outperforming its 235B teacher on both accuracy and advisory quality.
中文摘要 时间序列模型预测数字;决策者需要咨询——带有推理的方向性信号、可执行的建议和风险管理。用于此类预测性咨询的语言模型训练面临一个根本挑战：质量依赖于预测时未知的结果。我们结合了强化学习中的两个理念——利用执行过程中未获得的信息回溯生成训练信号，以及偏好对齐——并提出了事后诸葛亮偏好优化：观察到的结果让大型语言模型能够在标量指标无法捕捉的维度上评判候选建议，从而在无需人工注释的情况下生成DPO的偏好对。我们将此应用于基于视觉-语言-模型的标普500股票时间序列预测咨询，4B模型在准确性和咨询质量上均优于其2.35亿模型。

EPM-RL: Reinforcement Learning for On-Premise Product Mapping in E-Commerce

EPM-RL：电子商务中本地产品映射的强化学习

Authors: Minhyeong Yu, Wonduk Seo
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2604.23993
Pdf link: https://arxiv.org/pdf/2604.23993
Abstract Product mapping, the task of deciding whether two e-commerce listings refer to the same product, is a core problem for price monitoring and channel visibility. In real marketplaces, however, sellers frequently inject promotional keywords, platform-specific tags, and bundle descriptions into titles, causing the same product to appear under many different names. Recent LLM-based and multi-agent frameworks improve robustness and interpretability on such hard cases, but they often rely on expensive external APIs, repeated retrieval, and complex inference-time orchestration, making large-scale deployment costly and difficult in privacy-sensitive enterprise settings. To address these issues, we present EPM-RL, a reinforcement-learning-based framework for building an accurate and efficient on-premise e-commerce product mapping model. Our central idea is to distill high-cost agentic reasoning into a trainable in-house model. Starting from a curated set of product pairs with LLM-generated rationales and human verification, we first perform parameter-efficient fine-tuning (PEFT) on a small student model using structured reasoning outputs. We then further optimize the model with Reinforcement Learning (RL) using an agent-based reward that jointly evaluates output-format compliance, label correctness, reasoning--preference scores from specially designed judge models. Preliminary results show that EPM-RL consistently improves over PEFT-only training and offers a stronger quality--cost trade-off than commercial API-based baselines, while enabling private deployment and lower operational cost. These findings suggest that reinforcement learning can turn product mapping from a high-latency agentic pipeline into a scalable, inspectable, and production-ready in-house system.
中文摘要 产品映射，即判断两个电商列表是否指向同一产品，是价格监控和渠道可视化的核心问题。然而，在真实市场中，卖家经常在标题中注入促销关键词、平台标签和捆绑描述，导致同一产品以多种不同名称出现。近期基于LLM和多代理的框架在此类硬案例中提升了鲁棒性和可解释性，但它们通常依赖昂贵的外部API、反复检索和复杂的推理时间编排，使得在隐私敏感的企业环境中进行大规模部署既昂贵又困难。为解决这些问题，我们提出了EPM-RL，一个基于强化学习的框架，用于构建准确高效的本地电子商务产品映射模型。我们的核心理念是将高成本的代理推理提炼成一个可训练的内部模型。我们从一组经过精心策划的产品配对出发，这些产品配对由LLM生成的理由和人工验证，首先使用结构化推理输出对一个小型学生模型进行参数高效微调（PEFT）。随后，我们利用基于代理的奖励进一步优化模型，基于主体的奖励联合评估输出格式合规性、标签正确性、推理性——偏好分数，这些分数来自专门设计的评审模型。初步结果显示，EPM-RL在仅用PEFT训练时持续提升，并且在质量与成本权衡上优于基于商业API的基线，同时支持私有部署和更低的运营成本。这些发现表明，强化学习可以将产品映射从高延迟的代理流水线转变为可扩展、可检查且具备生产环境的内部系统。

DeepTaxon: An Interpretable Retrieval-Augmented Multimodal Framework for Unified Species Identification and Discovery

DeepTaxon：一个可解释的检索增强多模态框架，用于统一物种识别与发现

Authors: Jiawei Wang, Ming Lei, Yaning Yang, Xinyan Lin, Yuquan Le, Qiwei Ma, Zhiwei Xu, Zheqi Lv, Yuchen Ang, Zhe Quan, Tat-Seng Chua
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2604.24029
Pdf link: https://arxiv.org/pdf/2604.24029
Abstract Identifying species in biology among tens of thousands of visually similar taxa while discovering unknown species in open-world environments remains a fundamental challenge in biodiversity research. Current methods treat identification and discovery as separate problems, with classification models assuming closed sets and discovery relying on threshold-based rejection. Here we present DeepTaxon, a retrieval-augmented multimodal framework that unifies species identification and discovery through interpretable reasoning over retrieved visual evidence. Given a query image, DeepTaxon retrieves the top-$k$ candidate species with $n$ exemplar images each from a retrieval index and performs chain-of-thought comparative reasoning. Critically, we redefine discovery as an explicit, retrieval-based decision problem rather than an implicit parametric memory problem. A sample is novel if and only if the retrieval index lacks sufficient evidence for identification, so each retrieval naturally yields a classification or discovery label without manual annotation, thereby providing automatic supervision for both tasks. We train the framework via supervised fine-tuning on synthetic retrieval-augmented data, followed by reinforcement learning on hard samples, converting high-recall retrieval into high-precision decisions that scale to massive taxonomic vocabularies. Extensive experiments on a large-scale in-distribution benchmark and six out-of-distribution datasets demonstrate consistent improvements in both identification and discovery. Ablation studies further reveal effective test-time scaling with candidate count $k$ and exemplar count $n$, strong zero-shot transfer to unseen domains, and consistent performance across retrieval encoders, establishing an interpretable solution for biodiversity research.
中文摘要 在生物多样性研究中，如何在成千上万个视觉相似的分类单元中识别物种，同时发现未知物种，仍是生物多样性研究中的一项根本挑战。当前方法将识别和发现视为两个独立的问题，分类模型假设闭集，发现则依赖基于阈值的拒绝。这里我们介绍DeepTaxon，一种检索增强的多模态框架，通过可解释的推理统一物种鉴定和发现，基于检索的视觉证据。给定查询图像时，DeepTaxon从检索索引中检索顶部$k$候选物种，每个候选物种有$n$的示例图像，并进行思维链比较推理。关键是，我们将发现重新定义为一个显式的基于检索的决策问题，而非隐含的参数记忆问题。当且仅当检索索引缺乏足够的证据进行识别时，样本才是新的，因此每次检索自然产生分类或发现标签，无需人工注释，从而为这两种任务提供自动监督。我们通过对合成检索增强数据进行监督微调训练该框架，随后在硬样本上进行强化学习，将高回忆性检索转化为能够扩展到庞大分类词汇量的高精度决策。在大规模分布基准和六个分布外数据集上的广泛实验显示，识别和发现均有持续提升。消融研究进一步揭示了有效的测试时间尺度，候选样本计数$k$和示例计数$n$，对未见域的强零样本转移，以及各检索编码器间性能一致，建立了生物多样性研究的可解释解。

Grounding Before Generalizing: How AI Differs from Humans in Causal Transfer

先扎根再概括：人工智能在因果传递上与人类的区别

Authors: Liangru Xiang, Yuxi Ma, Zhihao Cao, Yixin Zhu, Song-Chun Zhu
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.24062
Pdf link: https://arxiv.org/pdf/2604.24062
Abstract Extracting abstract causal structures and applying them to novel situations is a hallmark of human intelligence. While Large Language Models (LLMs) and Vision Language Models (VLMs) have shown strong performance on a wide range of reasoning tasks, their capacity for interactive causal learning -- inducing latent structures through sequential exploration and transferring them across contexts -- remains uncharacterized. Human learners accomplish such transfer after minimal exposure, whereas classical Reinforcement Learning (RL) agents fail catastrophically. Whether state-of-the-art Artificial Intelligence (AI) models possess human-like mechanisms for abstract causal structure transfer is an open question. Using the OpenLock paradigm requiring sequential discovery of Common Cause (CC) and Common Effect (CE) structures, here we show that models exhibit fundamentally delayed or absent transfer: even successful models require initial environmental-specific mapping -- what we term environmental grounding -- before efficiency gains emerge, whereas humans leverage prior structural knowledge from the very first solution attempt. In the text-only condition, models matched or exceeded human discovery efficiency. In contrast, visual information -- in both the image-only and text-and-image conditions -- overall degraded rather than enhanced performance, revealing a broad reliance on symbolic processing rather than integrated multimodal reasoning. Models further exhibited systematic CC/CE asymmetries absent in humans, suggesting heuristic biases rather than direction-neutral causal abstraction. These findings reveal that large-scale statistical learning does not produce the decontextualized causal schemas underpinning human analogical reasoning, establishing grounding-dependent transfer as a fundamental limitation of current LLMs and VLMs.
中文摘要 提取抽象的因果结构并将其应用于新颖情境是人类智能的标志。虽然大型语言模型（LLMs）和视觉语言模型（VLMs）在广泛的推理任务中表现出优异表现，但它们在交互因果学习方面的能力——通过顺序探索诱导潜在结构并将其跨越上下文转移——尚未被明确描述。人类学习者在极少的接触下完成了这种迁移，而经典的强化学习（RL）代理则会灾难性地失败。最先进的人工智能（AI）模型是否具备类似人类的抽象因果结构转移机制，仍是一个未解之谜。利用OpenLock范式，要求顺序发现共同原因（CC）和共同效应（CE）结构，我们展示了模型表现出根本性的延迟或缺失转移：即使是成功的模型，也需要初步的环境特定映射——我们称之为环境基础——才能实现效率提升，而人类则利用从首次解法尝试中获得的结构知识。在纯文本条件下，模型的发现效率与人类的发现相当甚至超过。相比之下，视觉信息——无论是仅图像还是文本加图像的条件——整体上性能下降而非增强，显示出广泛依赖符号处理而非集成多模态推理。模型还显示出系统性的CC/CE不对称性，而人类则不存在，这表明这是启发式偏差，而非方向中立的因果抽象。这些发现表明，大规模统计学习并未产生支撑人类类比推理的去语境因果图式，确立了基于基础的迁移是当前大型语言模型（LLM）和大型语言模型（VLMs）的根本限制。

AsyncShield: A Plug-and-Play Edge Adapter for Asynchronous Cloud-based VLA Navigation

AsyncShield：一款即插即用的异步云端VLA导航边缘适配器

Authors: Kai Yang, Zedong Chu, Yingnan Guo, Zhengbo Wang, Shichao Xie, Yanfen Shen, Xiaolong Wu, Xing Li, Mu Xu
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.24086
Pdf link: https://arxiv.org/pdf/2604.24086
Abstract While Vision-Language-Action (VLA) models have been demonstrated possessing strong zero-shot generalization for robot control, their massive parameter sizes typically necessitate cloud-based deployment. However, cloud deployment introduces network jitter and inference latency, which can induce severe spatiotemporal misalignment in mobile navigation under continuous displacement, so that the stale intents expressed in past ego frames may become spatially incorrect in the current frame and lead to collisions. To address this issue, we propose AsyncShield, a plug-and-play asynchronous control framework. AsyncShield discards traditional black-box time-series prediction in favor of a deterministic physical white-box spatial mapping. By maintaining a temporal pose buffer and utilizing kinematic transformations, the system accurately converts temporal lag into spatial pose offsets to restore the VLA's original geometric intent. To balance intent restoration fidelity and physical safety, the edge adaptation is formulated as a constrained Markov decision process (CMDP). Solved via the PPO-Lagrangian algorithm, a reinforcement learning adapter dynamically trades off between tracking the VLA intent and responding to high-frequency LiDAR obstacle avoidance hard constraints. Furthermore, benefiting from a standardized universal sub-goal interface, domain randomization, and perception-level adaptation via Collision Radius Inflation, AsyncShield operates as a lightweight, plug-and-play module. Simulation and real-world experiments demonstrate that, without fine-tuning any cloud-based foundation models, the framework exhibits zero-shot and robust generalization capabilities, effectively improving the success rate and physical safety of asynchronous navigation.
中文摘要 虽然视觉-语言-动作（VLA）模型已被证明具有强大的零样本机器人控制推广能力，但其庞大的参数尺寸通常需要基于云的部署。然而，云部署引入了网络抖动和推理延迟，这可能导致移动导航在持续位移下严重的时空错位，过去ego帧中表达的陈旧意图在当前帧中可能在空间上不正确，从而导致碰撞。为解决这个问题，我们提出了AsyncShield，一种即插即用的异步控制框架。AsyncShield摒弃了传统的黑盒时间序列预测，转而采用确定性的物理白盒空间映射。通过保持时间姿态缓冲并利用运动学变换，系统准确地将时间滞后转换为空间姿态偏移，恢复VLA的原始几何意图。为了平衡意图恢复的忠实度和物理安全性，边缘适配被表述为受限马尔可夫决策过程（CMDP）。通过PPO-拉格朗日算法求解，强化学习适配器在跟踪VLA意图与响应高频LiDAR障碍物避让硬约束之间动态权衡。此外，借助标准化的通用子目标接口、领域随机化以及通过碰撞半径膨胀实现感知层级适应，AsyncShield 作为一个轻量化的即插即用模块运行。仿真和实际实验表明，无需微调任何基于云的基础模型，该框架即可展现零样本和稳健的泛化能力，有效提升异步导航的成功率和物理安全性。

IRIS: Interleaved Reinforcement with Incremental Staged Curriculum for Cross-Lingual Mathematical Reasoning

IRIS：跨语言数学推理的交错强化与增量分阶段课程

Authors: Navya Gupta, Rishitej Reddy Vyalla, Avinash Anand, Chhavi Kirtani, Erik Cambria, Zhengchen Zhang, Zhengkui Wang, Timothy Liu, Aik Beng Ng, Simon See, Rajiv Ratn Shah
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.24114
Pdf link: https://arxiv.org/pdf/2604.24114
Abstract Curriculum learning helps language models tackle complex reasoning by gradually increasing task difficulty. However, it often fails to generate consistent step-by-step reasoning, especially in multilingual and low-resource settings where cross-lingual transfer from English to Indian languages remains limited. We propose IRIS: Interleaved Reinforcement with Incremental Staged Curriculum, a two-axis framework that combines Supervised Fine-Tuning on progressively harder problems (vertical axis) with Reverse Curriculum Reinforcement Learning to reduce reliance on step-by-step guidance (horizontal axis). We design a composite reward combining correctness, step-wise alignment, continuity, and numeric incentives, optimized via Group Relative Policy Optimization (GRPO). We release CL-Math, a dataset of 29k problems with step-level annotations in English, Hindi, and Marathi. Across standard benchmarks and curated multilingual test sets, IRIS consistently improves performance, with strong results on math reasoning tasks and substantial gains in low-resource and bilingual settings, alongside modest improvements in high-resource languages.
中文摘要 课程学习通过逐步增加任务难度，帮助语言模型应对复杂的推理。然而，它常常无法产生一致的逐步推理，尤其是在多语言和资源匮乏的环境中，跨语言从英语到印度语言的迁移仍然有限。我们提出了IRIS：交错强化与渐进分阶段课程，这是一个双轴框架，结合了对逐步难题的监督式微调（纵轴）和逆向课程强化学习（横轴）的依赖，以减少对逐步指导的依赖。我们设计了一种综合奖励，结合了正确性、分步骤对齐、连续性和数值激励，并通过群体相对政策优化（GRPO）进行优化。我们发布了CL-Math，这是一个包含2.9万个带有步骤级注释的题目数据集，内容涵盖英语、印地语和马拉地语。在标准基准测试和精心策划的多语言测试集中，IRIS 持续提升表现，在数学推理任务中取得强劲成绩，在低资源和双语环境中取得显著进步，同时在高资源语言方面也有适度提升。

An Analysis of the Coordination Gap between Joint and Modular Learning for Job Shop Scheduling with Transportation Resources

联合学习与模块化学习在与交通资源的作业车间排班协调差距分析

Authors: Moritz Link, Jonathan Hoss, Noah Klarmann
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.24117
Pdf link: https://arxiv.org/pdf/2604.24117
Abstract Efficient job-shop scheduling with transportation resources is critical for high-performance manufacturing. With the rise of "decentralized factories", multi-agent reinforcement learning has emerged as a promising approach for the combined scheduling of production and transportation tasks. Prior work has largely focused on developing novel cooperative architectures while overlooking the question of when joint training is necessary. Joint training denotes the simultaneous training of job and automatic guided vehicle scheduling agents, whereas modular training involves independently training each agent followed by post-hoc integration. In this study, we systematically investigate the conditions under which joint training is essential for optimal performance in the job-shop scheduling problem with transportation resources. Through a rigorous sensitivity analysis of resource scarcity and temporal dominance, we quantify the coordination gap -- the performance difference between these two training modalities. In our evaluation, the joint training can produce superior performance compared to the best-performing combinations of dispatching rules and modular training. However, the coordination gap advantage diminishes in bottleneck environments, particularly under severe transport and processing constraints. These findings indicate that modular training represents a viable alternative in environments where a single scheduling task dominates. Overall, our work provides practical guidance for selecting between training modalities based on environmental conditions, enabling decision-makers to optimize reinforcement learning-based scheduling performance.
中文摘要 高效的工坊排班与运输资源对于高性能制造至关重要。随着“去中心化工厂”的兴起，多智能体强化学习已成为生产与运输任务联合调度的有前景方法。此前的研究主要集中在开发新型协作架构，而忽视了何时需要联合训练的问题。联合培训指的是同时培训作业和自动导引车辆调度代理，而模块化培训则是独立培训每个代理，随后进行事后整合。本研究系统地探讨了联合培训在运输资源作业车间排班问题中对最佳表现至关重要的条件。通过对资源稀缺性和时间优势的严格敏感性分析，我们量化了协调差距——这两种训练方式之间的表现差异。在我们的评估中，联合培训相比调度规则和模块化训练的最佳组合，能够产生更优异的绩效。然而，在瓶颈环境中，尤其是在运输和处理极限的情况下，协调间隙优势会减弱。这些发现表明，在单一排班任务占主导地位的环境中，模块化培训是一种可行的替代方案。总体而言，我们的工作为基于环境条件的训练方式选择提供了实用指导，使决策者能够优化基于强化学习的调度性能。

Leveraging Human Feedback for Semantically-Relevant Skill Discovery

利用人类反馈实现语义相关的技能发现

Authors: Maxence Hussonnois, Thommen George Karimpanal, Santu Rana
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.24127
Pdf link: https://arxiv.org/pdf/2604.24127
Abstract Unsupervised skill discovery in reinforcement learning aims to intrinsically motivate agents to discover diverse and useful behaviours. However, unconstrained approaches can produce unsafe, unethical, or misaligned behaviours. To mitigate these risks and improve the practical desireability of discovered skills, recent work grounds the discovery process by leveraging human preference feedback. However, preference-based approaches are feedback-inefficient and inherently ill-equipped to deal with skill spaces composed of a variety of different skills such as running, jumping, walking, etc. To overcome this limitation, we introduce semantic labelling, a novel and feedback-efficient approach that leverages human cognitive strengths to identify and label semantically meaningful behaviours. Based on semantic labelling, we propose Semantically Relevant Skill Discovery (SRSD), a novel human-in-the-loop approach that collects semantic labels from human feedback and learns a reward function to encourage skills to be more semantically diverse and relevant. Through our experiments in a 2D navigation environment and four locomotion environments, we demonstrate that SRSD can improve semantic diversity and discover relevant behaviours while scaling effectively to a large variety of behaviours.
中文摘要 强化学习中的无监督技能发现旨在内在激励代理发现多样且有用的行为。然而，不受限制的做法可能导致不安全、不道德或不合适的行为。为了降低这些风险并提升发现技能的实际可取性，近期工作通过利用人类偏好反馈来奠定发现过程的基础。然而，基于偏好的方法反馈效率低，且本质上不适合处理由多种不同技能组成的技能空间，如跑步、跳跃、走路等。为克服这一局限，我们引入了语义标记，这是一种新颖且反馈高效的方法，利用人类认知优势识别并标记语义上有意义的行为。基于语义标签，我们提出了语义相关技能发现（SRSD），这是一种新型“人在环”方法，通过人类反馈收集语义标签，学习奖励函数，以鼓励技能更具语义多样性和相关性。通过我们在二维导航环境和四种移动环境中的实验，我们证明SRSD能够提升语义多样性，发现相关行为，同时有效扩展到多种行为。

POCA: Pareto-Optimal Curriculum Alignment for Visual Text Generation

POCA：视觉文本生成的帕累托最优课程对齐

Authors: Yaohou Fan, Qingzhong Wang, Yongsong Huang, Junyi Liu, Tomo Miyazaki, Shinichiro Omachi
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.24171
Pdf link: https://arxiv.org/pdf/2604.24171
Abstract Current visual text generation models struggle with the trade-off between text accuracy and overall image coherence. We find that achieving high text accuracy can reduce aesthetic quality and instruction-following capability. Although reinforcement learning approaches can alleviate the problem through aligning with multiple rewards, they are often unstable for text generation, as existing approaches normally optimize multiple rewards in a weighted-sum way. In addition, it is difficult to balance the weight of each reward. Moreover, reinforcement learning requires a set of training instructions. A large number of prompts require more training time and computing resources, while a small set leads to poor performance. Hence, how to select the prompts for efficient training is an unsolved problem. In this study, we propose Pareto-Optimal Curriculum Alignment (POCA), a framework that addresses this issue as a multi-objective problem by: 1) identifying the Pareto-optimal set to avoid simple scalarization and 2) designing an adaptive curriculum alignment strategy to manage a learning sequence of a multi-reward dataset using automatic difficulty assessment, which is crucial for optimal convergence as RL methods explore in a limited data environment. In synergy, POCA finds the Pareto-optimal set in a unified reward space, which eliminates inconsistent signals to find the best trade-off solution from different rewards under an easy-to-hard optimization landscape. The experimental results show that POCA significantly improves all metrics such as CLIP, HPS scores and sentence accuracy.
中文摘要 当前的视觉文本生成模型在文本准确性与整体图像一致性之间存在权衡。我们发现，实现高文本准确性会降低美观质量和跟随指令的能力。虽然强化学习方法可以通过与多奖励对齐来缓解问题，但它们在文本生成上往往不稳定，因为现有方法通常以加权和方式优化多个奖励。此外，平衡每个奖励的权重也很难。此外，强化学习需要一套训练指令。大量提示需要更多的训练时间和计算资源，而少量提示则会导致性能不佳。因此，如何选择高效训练的提示仍是一个未解之谜。本研究提出帕累托最优课程对齐（POCA）框架，通过以下方式解决该问题，1）识别避免简单标量化的帕累托最优集合;2）设计自适应课程对齐策略，利用自动难度评估管理多奖励数据集的学习序列，这对于强化学习方法在有限数据环境中探索的最优收敛至关重要。在协同作用下，POCA在统一的奖励空间中找到帕累托最优集合，消除不一致信号，从而在简单到困难的优化环境中从不同奖励中找到最佳权衡解。实验结果显示，POCA显著提升了CLIP、HPS分数和句子准确率等所有指标。

Omni-o3: Deep Nested Omnimodal Deduction for Deliberative Audio-Visual Reasoning

Omni-o3：深层嵌套全模态推理用于深思熟虑视听推理

Authors: Zhicheng Zhang, Wentao Gu, Weicheng Wang, Yongjie Zhu, Wenyu Qin, Meng Wang, Pengfei Wan, Jufeng Yang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.24191
Pdf link: https://arxiv.org/pdf/2604.24191
Abstract Omnimodal understanding entails a massive, highly redundant search space of cross-modal interactions, demanding focused and deliberative reasoning. Current reasoning paradigms rely on either sequential step-by-step generation or parallel sample-by-sample rollouts, leading to isolated reasoning trajectories. This inability to share promising intermediate paths severely limits exploration efficiency and causes compounding errors in complex audio-visual tasks. To break this bottleneck, we introduce Omni-o3, a novel framework driven by a deep nested deduction policy. By formulating reasoning as a dynamic recursive search, Omni-o3 inherently shares reasoning prefixes across branches, enabling the iterative execution of four atomic cognitive actions: expansion, selection, simulation, and backpropagation. To empower this framework, we propose a robust two-stage training paradigm: (1) cold-start supervised fine-tuning on 101K high-quality, long-chain trajectories distilled from 3.5M diverse omnimodal samples, enabling necessary recursive search patterns; and (2) nested group rollout-driven exploratory reinforcement learning on 18K complex multi-turn samples, explicitly guided by a novel multi-step reward model to stimulate deep nested reasoning. Extensive experiments demonstrate that Omni-o3 achieves competitive performance across 11 benchmarks, unlocking advanced capabilities in comprehensive audio-visual, visual-centric, and audio-centric reasoning tasks.
中文摘要 全模态理解涉及一个庞大且高度冗余的跨模态交互搜索空间，要求专注且深思熟虑的推理。当前的推理范式依赖于顺序的逐步生成或平行的样本逐样本展开，导致推理轨迹孤立。无法共享有前景的中间路径严重限制了探索效率，并在复杂的视听任务中引发叠加错误。为打破这一瓶颈，我们引入了Omni-o3，一种由深度嵌套推理策略驱动的新型框架。通过将推理表述为动态递归搜索，Omni-o3 本质上跨分支共享推理前缀，从而支持四种原子认知动作的迭代执行：展开、选择、仿真和反向传播。为赋能该框架，我们提出了一个稳健的两阶段训练范式：（1）对101K高质量长链轨迹进行冷启动监督微调，这些轨迹从350万个多样化的全模样本中提炼出来，从而实现必要的递归搜索模式;以及（2）在1.8万复杂多回合样本中进行嵌套组式扩展驱动的探索性强化学习，明确引导新颖的多步奖励模型以激发深度嵌套推理。大量实验表明，Omni-o3在11个基准测试中实现了竞争性能，解锁了综合视听、视觉中心和音频中心推理任务的高级能力。

Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis

奖励科学过程：代理数据分析的流程级奖励建模

Authors: Zhisong Qiu, Shuofei Qiao, Kewei Xu, Yuqi Zhu, Lun Du, Ningyu Zhang, Huajun Chen
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2604.24198
Pdf link: https://arxiv.org/pdf/2604.24198
Abstract Process Reward Models (PRMs) have achieved remarkable success in augmenting the reasoning capabilities of Large Language Models (LLMs) within static domains such as mathematics. However, their potential in dynamic data analysis tasks remains underexplored. In this work, we first present a empirical study revealing that general-domain PRMs struggle to supervise data analysis agents. Specifically, they fail to detect silent errors, logical flaws that yield incorrect results without triggering interpreter exceptions, and erroneously penalize exploratory actions, mistaking necessary trial-and-error exploration for grounding failures. To bridge this gap, we introduce DataPRM, a novel environment-aware generative process reward model that (1) can serve as an active verifier, autonomously interacting with the environment to probe intermediate execution states and uncover silent errors, and (2) employs a reflection-aware ternary reward strategy that distinguishes between correctable grounding errors and irrecoverable mistakes. We design a scalable pipeline to construct over 8K high-quality training instances for DataPRM via diversity-driven trajectory generation and knowledge-augmented step-level annotation. Experimental results demonstrate that DataPRM improves downstream policy LLMs by 7.21% on ScienceAgentBench and 11.28% on DABStep using Best-of-N inference. Notably, with only 4B parameters, DataPRM outperforms strong baselines, and exhibits robust generalizability across diverse Test-Time Scaling strategies. Furthermore, integrating DataPRM into Reinforcement Learning yields substantial gains over outcome-reward baselines, achieving 78.73% on DABench and 64.84% on TableBench, validating the effectiveness of process reward supervision. Code is available at this https URL.
中文摘要 过程奖励模型（PRM）在增强大型语言模型（LLMs）在静态领域（如数学）的推理能力方面取得了显著成功。然而，它们在动态数据分析任务中的潜力仍未被充分开发。在本研究中，我们首先呈现一项实证研究，揭示广域PRM难以监督数据分析代理。具体来说，它们未能检测无声错误、逻辑缺陷导致错误且不触发解释器异常，错误地惩罚探索性行为，将必要的试错探索误认为是接地失败。为弥合这一空白，我们引入了DataPRM，一种新型环境感知生成过程奖励模型，（1）可作为主动验证者，自主与环境交互以探测中间执行状态并发现无声错误，（2）采用反射感知的三元奖励策略，区分可纠正的接地错误和不可恢复的错误。我们设计了一个可扩展的流程，通过多样性驱动的轨迹生成和知识增强的步级注释，构建超过 8K 个高质量的 DataPRM 训练实例。实验结果表明，DataPRM在ScienceAgentBench上使用最佳N推断，使下游策略LLM提升7.21%，在DABStep上提升11.28%。值得注意的是，仅有4B参数时，DataPRM表现优于强基线，并在多种测试时缩放策略中展现出强健的泛化性。此外，将DataPRM整合进强化学习，显著提升了结果-奖励基线，DABench在DABench上达到了78.73%，在TableBench上达到64.84%，验证了过程奖励监督的有效性。代码可在此 https URL 访问。

BitRL: Reinforcement Learning with 1-bit Quantized Language Models for Resource-Constrained Edge Deployment

BitRL：基于1位量化语言模型的资源约束边缘部署强化学习

Authors: Md. Ashiq Ul Islam Sajid, Mohammad Sakib Mahmood, Md. Tareq Hasan, Md Abdur Rahim, Rafat Ara, Md. Arafat Hossain
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.24273
Pdf link: https://arxiv.org/pdf/2604.24273
Abstract The deployment of intelligent reinforcement learning (RL) agents on resource-constrained edge devices remains a fundamental challenge due to the substantial memory, computational, and energy requirements of modern deep learning systems. While large language models (LLMs) have emerged as powerful architectures for decision-making agents, their multi-billion parameter scale confines them to cloud-based deployment, raising concerns about latency, privacy, and connectivity dependence. We introduce BitRL, a framework for building RL agents using 1-bit quantized language models that enables practical on-device learning and inference under severe resource constraints. Leveraging the BitNet b1.58 architecture with ternary weights (-1, 0, +1) and an optimized inference stack, BitRL achieves 10-16x memory reduction and 3-5x energy efficiency improvements over full-precision baselines while maintaining 85-98 percent of task performance across benchmarks. We provide theoretical analysis of quantization as structured parameter perturbation, derive convergence bounds for quantized policy gradients under frozen-backbone architectures, and identify the exploration-stability trade-off in extreme quantization. Our framework systematically integrates 1-bit quantized language models with reinforcement learning for edge deployment and demonstrates effectiveness on commodity hardware.
中文摘要 由于现代深度学习系统对内存、计算和能源的需求巨大，智能强化学习（RL）代理在资源受限的边缘设备上部署仍是一个根本性挑战。虽然大型语言模型（LLMs）已成为决策代理的强大架构，但其数十亿参数规模限制了其仅限于云端部署，引发了对延迟、隐私和连接依赖的担忧。我们介绍了BitRL，这是一个利用1位量化语言模型构建强化学习代理的框架，能够在资源极限下实现实际的设备学习和推理。利用BitNet b1.58架构，采用三进制权重（-1、0、+1）和优化的推理堆栈，BitRL在全精度基线上实现了10-16倍的内存减少和3-5倍的能效提升，同时在基准测试中保持了85-98%的任务性能。我们对量子化作为结构化参数扰动进行了理论分析，推导了冻结主干架构下量子化政策梯度的收敛界限，并识别极端量子化中的探索与稳定性权衡。我们的框架系统地将1位量化语言模型与强化学习整合进行边缘部署，并在普通硬件上展现了有效性。

RAS: a Reliability Oriented Metric for Automatic Speech Recognition

RAS：一种以可靠性为导向的自动语音识别指标

Authors: Wenbin Huang, Yuhang Qiu, Bohan Li, Yiwei Guo, Jing Peng, Hankun Wang, Xie Chen, Kai Yu
Subjects: Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.24278
Pdf link: https://arxiv.org/pdf/2604.24278
Abstract Automatic speech recognition systems often produce confident yet incorrect transcriptions under noisy or ambiguous conditions, which can be misleading for both users and downstream applications. Standard evaluation based on Word Error Rate focuses solely on accuracy and fails to capture transcription reliability. We introduce an abstention-aware transcription framework that enables ASR models to explicitly abstain from uncertain segments. To evaluate reliability under abstention, we propose RAS, a reliability-oriented metric that balances transcription informativeness and error aversion, with its trade-off parameter calibrated by human preference. We then train an abstention-aware ASR model through supervised bootstrapping followed by reinforcement learning. Our experiments demonstrate substantial improvements in transcription reliability while maintaining competitive accuracy.
中文摘要 自动语音识别系统在噪声或模糊条件下常常产生自信但不准确的转录，这对用户和后续应用来说都可能产生误导。基于词误率的标准评估仅关注准确性，未能捕捉转录可靠性。我们引入了一种自觉戒断转录框架，使ASR模型能够明确避免不确定的片段。为了评估弃权下的信度，我们提出了RAS，这是一种以信度为导向的指标，平衡转录信息量和错误回避，其权衡参数由人类偏好校准。随后，我们通过监督引导和强化学习训练一个意识戒断的ASR模型。我们的实验显示，转录可靠性在保持竞争准确性的同时取得了显著提升。

Model-Free Inference of Investor Preferences: A Relative Entropy IRL Approach

投资者偏好的无模型推断：一种相对熵的现实生活中方法

Authors: Chen Xu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.24280
Pdf link: https://arxiv.org/pdf/2604.24280
Abstract We present a framework using Relative Entropy Inverse Reinforcement Learning (RE-IRL) to recover investor reward functions from observed investment actions and market conditions. Unlike traditional IRL algorithms, RE-IRL is employed to account for environments where transition probabilities are unknown or inaccessible. To address the challenge of data sparsity, we utilize a $K$-nearest neighbor approach to estimate the observed behavior policy. Furthermore, we propose a statistical testing framework to evaluate the validity and robustness of the estimated results.
中文摘要 我们提出了一个利用相对熵逆强化学习（RE-IRL）的框架，从观察到的投资行为和市场状况中恢复投资者的奖励函数。与传统的IRL算法不同，RE-IRL用于考虑转移概率未知或不可访问的环境。为应对数据稀疏性挑战，我们采用$K美元最近邻方法估算观察到的行为策略。此外，我们提出了一个统计检验框架，以评估估计结果的有效性和稳健性。

DPEPO: Diverse Parallel Exploration Policy Optimization for LLM-based Agents

DPEPO：基于LLM的代理的多样化并行探索策略优化

Authors: Junshuo Zhang, Chengrui Huang, Feng Guo, Zihan Li, Ke Shi, Menghua Jiang, Jiguo Yu, Shuo Shang, Shen Gao
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.24320
Pdf link: https://arxiv.org/pdf/2604.24320
Abstract Large language model (LLM) agents that follow the sequential "reason-then-act" paradigm have achieved superior performance in many complex this http URL, these methods suffer from limited exploration and incomplete environmental understanding, as they interact with only a single environment per step. In this paper, we first introduce a novel paradigm that enables an agent to interact with multiple environments simultaneously and share cross-trajectory experiences. Building upon this paradigm, we further propose DPEPO, a reinforcement learning (RL) algorithm that encourages the agent to perform diverse parallel exploration. There are two stages in DPEPO: initial supervised fine-tuning (SFT) imparts basic parallel reasoning and action generation, followed by reinforcement learning stage with a hierarchical reward scheme. We design a parallel trajectory-level success reward and two step-level rewards: Diverse Action Reward and Diverse State Transition Reward, which actively penalize behavioral redundancy and promote broad exploration. Extensive experiments on ALFWorld and ScienceWorld show that DPEPO achieves state-of-the-art (SOTA) success rates, while maintaining comparable efficiency to strong sequential baselines. (Code is available at this https URL)
中文摘要 遵循顺序“推理-后行动”范式的大型语言模型（LLM）代理在许多复杂的HTTP URL中取得了更优的性能，但这些方法由于每步仅与一个环境交互，探索有限且环境理解不完整。本文首先介绍了一种新颖范式，使智能体能够同时与多个环境交互并共享跨轨迹体验。基于这一范式，我们进一步提出了DPEPO，一种强化学习（RL）算法，鼓励智能体进行多样化的并行探索。DPEPO有两个阶段：初始监督微调（SFT）传授基本的并行推理和动作生成，随后是带有层级奖励方案的强化学习阶段。我们设计了平行的轨迹级成功奖励和两个阶级奖励：多样行动奖励和多样状态过渡奖励，积极惩罚行为重复并促进广泛探索。ALFWorld和ScienceWorld上的大量实验表明，DPEPO实现了最先进的（SOTA）成功率，同时保持与强序列基线相当的效率。（代码可在此 https URL 获取）

Perfecting Aircraft Maneuvers with Reinforcement Learning

通过强化学习完善飞机机动

Authors: Atahan Cilan, Mahir Demir, Özgün Can Yürütken, Seyyid Osman Sevgili, Ümit Can Bekar
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.24338
Pdf link: https://arxiv.org/pdf/2604.24338
Abstract This paper evaluates an advanced jet trainer's utilization of artificial intelligence (AI)-based aircraft aerobatic maneuvers with the intention of developing an AI-assisted pilot training module for specific aircraft maneuvers. A multitude of aircraft maneuvers have been simulated using reinforcement learning (RL) agents, which will serve as a training tool for future pilots.
中文摘要 本文评估了一台先进喷气式教练机利用基于人工智能（AI）的飞机特技飞行动作，旨在开发针对特定飞机机动的人工智能辅助飞行员训练模块。多种飞机机动均通过强化学习（RL）代理进行模拟，这些技术将作为未来飞行员的训练工具。

See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection

详见《深入思考：通过低层次视觉线索和反思提升VLM的推理能力》

Authors: Zhiheng Wu, Tong Wang, Shuning Wang, Naiming Liu, Yumeng Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.24339
Pdf link: https://arxiv.org/pdf/2604.24339
Abstract Recent advances in Vision-Language Models (VLMs) have benefited from Reinforcement Learning (RL) for enhanced reasoning. However, existing methods still face critical limitations, including the lack of low-level visual information and effective visual feedback. To address these problems, this paper proposes a unified multimodal interleaved reasoning framework \textbf{ForeSight}, which enables VLMs to \textbf{See Further} with low-level visual cues and \textbf{Think Deeper} with effective visual feedback. First, it introduces a set of low-level visual tools to integrate essential visual information into the reasoning chain, mitigating the neglect of fine-grained visual features. Second, a mask-based visual feedback mechanism is elaborated to incorporate visual reflection into the thinking process, enabling the model to dynamically re-examine and update its answers. Driven by RL, ForeSight learns to autonomously decide on tool invocation and answer verification, with the final answer accuracy as the reward signal. To evaluate the performance of the proposed framework, we construct a new dataset, Character and Grounding SalBench (CG-SalBench), based on the SalBench dataset. Experimental results demonstrate that the ForeSight-7B model significantly outperforms other models with the same parameter scale, and even surpasses the current SOTA closed-source models on certain metrics.
中文摘要 视觉语言模型（VLMs）的最新进展受益于强化学习（RL）以增强推理能力。然而，现有方法仍面临关键局限，包括缺乏低层次视觉信息和有效的视觉反馈。为解决这些问题，本文提出了一个统一的多模态交错推理框架\textbf{ForeSight}，使VLM能够通过低层次的视觉线索进行\textbf{详见后续}，并以有效的视觉反馈实现\textbf{Think Deeper}。首先，它引入了一套低层次的视觉工具，将关键视觉信息整合进推理链，减少了对细粒度视觉特征的忽视。其次，构建了基于掩码的视觉反馈机制，将视觉反射融入思考过程，使模型能够动态地重新审视和更新其答案。在强化学习的驱动下，ForeSight学会自主决定工具调用和答案验证，最终的准确性作为奖励信号。为了评估所提框架的性能，我们基于SalBench数据集构建了一个新数据集——Character and Grounding SalBench（CG-SalBench）。实验结果显示，ForeSight-7B模型在相同参数尺度上显著优于其他模型，甚至在某些指标上超过了当前的SOTA闭源模型。

An Aircraft Upset Recovery System with Reinforcement Learning

带有强化学习的飞机故障救援系统

Authors: Mahir Demir, Atahan Cilan, Seyyid Osman Sevgili, Özgün Can Yürütken, Ümit Can Bekar
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.24355
Pdf link: https://arxiv.org/pdf/2604.24355
Abstract This article explores the progress made in the creation of a pilot activated recovery system (PARS) for advanced jet trainers that utilizes artificial intelligence (AI) in an effort to enhance operational efficiency. The PARS model employs an advanced reinforcement learning (RL) architecture, incorporating a cutting-edge soft-actor critic (SAC) model and hyper-parameter optimization methods. Negative-g punishments and other handcrafted features remarked upon by control engineers and domain experts regarding PARS are also taken into account by the system. When evaluated by them, the AI model's behavior is deemed more desirable than that of conventional control methods.
中文摘要 本文探讨了利用人工智能（AI）为提升作战效率而开发的先进喷气式教练机飞行员激活回收系统（PARS）的进展。PARS模型采用先进的强化学习（RL）架构，结合了最前沿的软演员批评（SAC）模型和超参数优化方法。系统还考虑了控制工程师和领域专家提到的负重力惩罚及其他手工设计的功能。当他们评估时，AI模型的行为被认为比传统控制方法更理想。

Comparative Evaluation of Modern Deep Learning Methodologies for Portfolio Optimization

现代深度学习方法论在投资组合优化中的比较评估

Authors: Samuel Ozechi, Banjo Francis, Wisdom Yakanu, Joe Wayne Byers
Subjects: Subjects: Computational Engineering, Finance, and Science (cs.CE)
Arxiv link: https://arxiv.org/abs/2604.24486
Pdf link: https://arxiv.org/pdf/2604.24486
Abstract This study proposes a portfolio optimization framework that integrates advanced deep learning architectures with traditional financial models to enhance risk-adjusted performance. Using historical data from 2015-2023 across equities, ETFs, and bonds, the research evaluates the predictive power of Graph Neural Networks (GNNs), Deep Reinforcement Learning (DRL), Transformers, and Autoencoders. The models jointly address covariance estimation, return forecasting, dynamic asset allocation, and dimensionality reduction. Hybrid approaches such as Transformer+GNN and Autoencoder+DRL are also explored to capture both relational and temporal market structures. Performance is assessed through backtesting using metrics including volatility, cumulative return, maximum drawdown, annualized return, and Sharpe ratio across seven strategies, including Equal-Weighted, 60/40 allocation, and Mean-Variance Optimization (MVO). Results show that hybrid models provide superior stability and risk control, with Transformer+GNN achieving the lowest volatility and drawdown. MVO, when paired with well-calibrated inputs, delivers the highest cumulative return and Sharpe ratio, highlighting the continued relevance of traditional methods. Standalone DRL underperforms due to limited structural awareness, while Autoencoders exhibit behavior similar to Equal-Weight strategies, emphasizing the need for dynamic policy learning. These findings align with existing literature on relational modeling and feature compression in finance. Overall, the study demonstrates that combining deep learning with financial theory yields robust and adaptive portfolio strategies and suggests exploring latent representations within traditional optimization frameworks to improve scalability and performance.
中文摘要 本研究提出了一种组合优化框架，将先进的深度学习架构与传统金融模型整合，以提升风险调整后的表现。利用2015年至2023年间涵盖股票、ETF和债券的历史数据，研究评估了图神经网络（GNN）、深度强化学习（DRL）、变换器和自编码器的预测能力。这些模型共同涉及协方差估计、回报预测、动态资产配置和降维等问题。还探索了Transformer+GNN和Autoencoder+DRL等混合方法，以捕捉关系和时间的市场结构。绩效通过回测评估，涵盖七种策略，包括波动率、累计回报、最大回撤、年化回报和夏普比率，包括等权配置、60/40配置和均方差优化（MVO）。结果显示，混合模型提供了更优越的稳定性和风险控制，其中Transformer+GNN实现了最低的波动性和回撤。当MVO配合良好校准的输入时，能带来最高的累计回报和夏普比率，凸显了传统方法的持续适用性。独立DRL因结构意识有限表现不佳，而自编码器表现出类似等权策略的行为，强调动态策略学习的必要性。这些发现与现有关于关系建模和特征压缩的文献研究一致。总体而言，研究表明将深度学习与金融理论结合，能够获得稳健且适应性的投资组合策略，并建议在传统优化框架中探索潜在表征，以提升可扩展性和性能。

TARMM: Scaling Delay-Critical Edge AI Offloading in 5G O-RAN via Temporal Graph Mobility Management

TARMM：通过时序图移动管理在5G O-RAN中扩展延迟关键边缘AI卸载

Authors: Peihao Yan, Yun Chen, Jie Lu, Qijun Wang, Huacheng Zeng
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2604.24501
Pdf link: https://arxiv.org/pdf/2604.24501
Abstract Emerging delay-critical edge AI applications, such as VR perception and real-time video analytics, impose stringent latency and reliability requirements on 5G networks. However, existing mobility management mechanisms are largely reactive and fail to adapt to dynamic network conditions, resulting in suboptimal handover decisions and degraded performance. In this paper, we present TARMM, a 5G Open Radio Access Network (O-RAN) system that optimizes user mobility management for delay-critical edge AI offloading. The core of TARMM is a temporal graph model that captures the spatiotemporal dynamics of the RAN across users and cells, enabling near real-time handover decisions. Building on this representation, we design a multi-agent reinforcement learning (MARL) framework with rule-based action masking and proactive resource preparation to ensure safe, stable, and efficient handovers. We implement TARMM on a multi-cell indoor 5G O-RAN testbed and evaluate it using diverse VR workloads. Extensive experiments show that TARMM reduces tail latency by up to 44% and packet loss by up to 56% compared to state-of-the-art approaches.
中文摘要 新兴的延迟关键边缘人工智能应用，如VR感知和实时视频分析，对5G网络施加了严格的延迟和可靠性要求。然而，现有的移动性管理机制大多是被动反应的，无法适应动态网络条件，导致切换决策不理想，性能下降。本文介绍了TARMM，一种5G开放无线接入网（O-RAN）系统，优化了延迟关键边缘AI卸载的用户移动管理。TARMM的核心是一个时间图模型，能够捕捉用户和小区间RAN的时空动态，实现近实时的切换决策。基于这一表示，我们设计了一个多智能体强化学习（MARL）框架，采用基于规则的动作掩蔽和主动资源准备，确保安全、稳定和高效的切换。我们在多单元室内5G O-RAN测试平台上实现TARM，并利用多样化的VR工作负载进行评估。大量实验表明，与最先进的方法相比，TARMM可将尾延迟降低多达44%，丢包率降低高达56%。

DECOFFEE: Decentralized Reinforcement Learning for Time-critical Workload Offloading and Energy Efficiency across the Computing Continuum

DECOFFEE：去中心化强化学习，实现计算连续体中对时间关键工作负载的卸载和能效

Authors: Anastasios Giannopoulos, Sotirios Spantideas, Panagiotis Trakadas
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2604.24507
Pdf link: https://arxiv.org/pdf/2604.24507
Abstract The rapid proliferation of latency-sensitive and battery-constrained Internet-of-Things (IoT) applications has intensified the need for intelligent workload placement mechanisms across the Edge-Cloud computing continuum. In such environments, far-edge nodes must dynamically decide whether to execute workloads locally or offload them to neighboring nodes or the cloud, while accounting for execution delay, energy consumption, and strict timeout constraints. However, workload placement in large-scale distributed infrastructures is a highly dynamic and non-convex optimization problem due to stochastic arrivals, heterogeneous computing capacities, and time-varying network conditions. This paper proposes DECOFFEE, a decentralized reinforcement learning framework for time-critical workload offloading and energy-efficient operation across the computing continuum. The proposed multi-agent learning scheme jointly optimizes system delay, energy consumption, and workload drop rate through adaptive placement decisions. Each edge agent operates as an autonomous learning entity that derives an optimal policy from local system observations and predicted network conditions. The workload placement process is formulated as parallel Markov Decision Processes and solved using a Double Dueling Deep Q-Network (DQN) architecture enhanced with Long Short-Term Memory (LSTM) forecasting to anticipate future load conditions. Extensive simulations demonstrate that DECOFFEE and its variants consistently outperform conventional rule-based and heuristic placement strategies, achieving significant reductions in delay, energy consumption, and workload drop rate under varying traffic and network conditions.
中文摘要 对延迟敏感且电池受限的物联网（IoT）应用的快速激增，加剧了对边缘-云计算连续体智能工作负载布置机制的需求。在此类环境中，远端节点必须动态决定是本地执行工作负载，还是将其卸载给邻近节点或云端，同时考虑执行延迟、能耗和严格的超时限制。然而，由于随机到达、异构计算能力和时间变化的网络条件，大规模分布式基础设施中的工作负载布置是一个高度动态且非凸的优化问题。本文提出了DECOFFEE，一种去中心化强化学习框架，用于在计算连续体中实现时间关键工作负载卸载和节能运行。所提多智能体学习方案通过自适应布置决策，共同优化系统延迟、能耗和工作负载下降率。每个边缘代理作为自主学习实体运行，从本地系统观察和预测的网络状况中推导出最优策略。工作负载布置过程被表述为并行马尔可夫决策过程，并通过双重对决深度Q网络（DQN）架构求解，并辅以长短期记忆（LSTM）预测，以预测未来负载状况。大量模拟表明，DECOFFEE及其变体在不同流量和网络条件下，持续优于传统的基于规则和启发式布置策略，显著降低了延迟、能耗和工作负载下降率。

A Reward-Free Viewpoint on Multi-Objective Reinforcement Learning

多目标强化学习的无奖励视角

Authors: Ying-Tu Chen, Wei Hung, Bing-Shu Wu, Zhang-Wei Hong, Ping-Chun Hsieh
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.24532
Pdf link: https://arxiv.org/pdf/2604.24532
Abstract Many sequential decision-making tasks involve optimizing multiple conflicting objectives, requiring policies that adapt to different user preferences. In multi-objective reinforcement learning (MORL), one widely studied approach} addresses this by training a single policy network conditioned on preference-weighted rewards. In this paper, we explore a novel algorithmic perspective: leveraging reward-free reinforcement learning (RFRL) for MORL. While RFRL has historically been studied independently of MORL, it learns optimal policies for any possible reward function, making it a natural fit for MORL's challenge of handling unknown user preferences. We propose using the RFRL's training objective as an auxiliary task to enhance MORL, enabling more effective knowledge sharing beyond the multi-objective reward function given at training time. To this end, we adapt a state-of-the-art RFRL algorithm to the MORL setting and introduce a preference-guided exploration strategy that focuses learning on relevant parts of the environment. Through extensive experiments and ablation studies, we demonstrate that our approach significantly outperforms the state-of-the-art MORL methods across diverse MO-Gymnasium tasks, achieving superior performance and data efficiency. This work provides the first systematic adaptation of RFRL to MORL, demonstrating its potential as a scalable and empirically effective solution to multi-objective policy learning.
中文摘要 许多顺序决策任务涉及优化多个相互冲突的目标，需要调整以适应不同用户偏好的策略。在多目标强化学习（MORL）中，一种广泛研究的方法通过训练一个基于偏好加权奖励的单一策略网络来解决这个问题。本文探讨了一个新的算法视角：利用无奖励强化学习（RFRL）应用于MORL。虽然RFRL历来独立于MORL进行研究，但它能为任何可能的奖励函数学习最优策略，因此非常适合MORL处理未知用户偏好的挑战。我们提议将RFRL的培训目标作为辅助任务，以增强MORL，实现超越培训时多目标奖励函数的更有效知识共享。为此，我们将最先进的RFRL算法应用于MORL环境，并引入了偏好引导的探索策略，专注于环境相关部分的学习。通过大量实验和消融研究，我们证明了我们的方法在多种MO-Gymnasium任务中显著优于最先进的MORL方法，实现了卓越的性能和数据效率。这项工作首次系统地将RFRL适配到MORL，展示了其作为多目标政策学习可扩展且实证有效的解决方案的潜力。

Hierarchical Behaviour Spaces

层级行为空间

Authors: Michael Tryfan Matthews, Anssi Kanervisto, Jakob Foerster, Pierluca D'Oro, Scott Fujimoto, Mikael Henaff
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.24558
Pdf link: https://arxiv.org/pdf/2604.24558
Abstract Recent work in hierarchical reinforcement learning has shown success in scaling to billions of timesteps when learning over a set of predefined option reward functions. We show that, instead of using a single reward function per option, the reward functions can be effectively used to induce a space of behaviours, by letting the controller specify linear combinations over reward functions, allowing a more expressive set of policies to be represented. We call this method Hierarchical Behaviour Spaces (HBS). We evaluate HBS on the NetHack Learning Environment, demonstrating strong performance. We conduct a series of experiments and determine that, perhaps going against conventional wisdom, the benefits of hierarchy in our method come from increased exploration rather than long term reasoning.
中文摘要 最近在分层强化学习方面的研究显示，当学习一组预定义的选项奖励函数时，能够成功扩展到数十亿个时间步。我们证明，与其使用单个奖励函数，不如通过让控制器指定线性组合来诱导行为空间，从而使策略表达更具表现力。我们称这种方法为层级行为空间（Hierarchical Behaviour Spaces，简称HBS）。我们在NetHack学习环境中评估了哈佛商学院，展现出强劲的表现。我们进行了一系列实验，发现，或许与传统观点相反，我们方法中层级结构的好处来自于更多的探索，而非长期推理。

Improving Vision-language Models with Perception-centric Process Reward Models

改进以感知为中心的过程奖励模型的视觉语言模型

Authors: Yingqian Min, Kun Zhou, Yifan Li, Yuhuan Wu, Han Peng, Yifan Du, Wayne Xin Zhao, Min Yang, Ji-Rong Wen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.24583
Pdf link: https://arxiv.org/pdf/2604.24583
Abstract Recent advancements in reinforcement learning with verifiable rewards (RLVR) have significantly improved the complex reasoning ability of vision-language models (VLMs). However, its outcome-level supervision is too coarse to diagnose and correct errors within the reasoning chain. To this end, we propose Perceval, a process reward model (PRM) that enables token-level error grounding, which can extract image-related claims from the response and compare them one by one with the visual evidence in the image, ultimately returning claims that contain perceptual errors. Perceval is trained with perception-intensive supervised training data. We then integrate Perceval into the RL training process to train the policy models. Specifically, compared to traditional GRPO, which applies sequence-level advantages, we apply token-level advantages by targeting penalties on hallucinated spans identified by Perceval, thus enabling fine-grained supervision signals. In addition to augmenting the training process, Perceval can also assist VLMs during the inference stage. Using Perceval, we can truncate the erroneous portions of the model's response, and then either have the model regenerate the response directly or induce the model to reflect on its previous output. This process can be repeated multiple times to achieve test-time scaling. Experiments show significant improvements on benchmarks from various domains across multiple reasoning VLMs trained with RL, highlighting the promise of perception-centric supervision as a general-purpose strategy. For test-time scaling, it also demonstrates consistent performance gains over other strategies, such as major voting. Our code and data will be publicly released at this https URL.
中文摘要 近年来，带有可验证奖励的强化学习（RLVR）取得了显著提升了视觉语言模型（VLMs）的复杂推理能力。然而，其结果层级的监督过于粗糙，难以诊断和纠正推理链中的错误。为此，我们提出了Perceval，一种过程奖励模型（PRM），它支持代币级错误基础，能够从响应中提取与图像相关的主张，并逐一与图像中的视觉证据进行比较，最终返回包含感知错误的主张。Perceval 是通过感知密集型监督训练数据进行训练的。然后我们将Perceval整合进强化学习的训练流程，以训练策略模型。具体来说，与传统的GRPO应用序列级优势相比，我们通过针对Perceval识别的幻觉跨度施加惩罚，从而实现细粒度的监督信号。除了增强训练过程外，Perceval还能在推断阶段协助VLM。利用Perceval，我们可以截断模型响应中错误的部分，然后让模型直接重现响应，或诱导模型对其之前的输出进行反思。该过程可以重复多次以实现测试时间的缩放。实验显示，在多个用强化学习训练的推理VLM中，基准测试在多个领域上取得了显著提升，凸显了以感知为中心的监督作为通用策略的潜力。在测试时间缩放方面，它也展示了相较于其他策略（如主要投票）持续的性能提升。我们的代码和数据将在此 https URL 公开发布。

Agent-Centric Visual Reinforcement Learning under Dynamic Perturbations

动态扰动下的以代理为中心视觉强化学习

Authors: Zhengru Fang, Yu Guo, Fei Liu, Yuang Zhang, Yihang Tao, Senkang Hu, Wenbo Ding, Yuguang Fang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.24661
Pdf link: https://arxiv.org/pdf/2604.24661
Abstract Visual reinforcement learning aims to empower an agent to learn policies from visual observations, yet it remains vulnerable to dynamic visual perturbations, such as unpredictable shifts in corruption types. To systematically study this, we introduce the Visual Degraded Control Suite (VDCS), a benchmark extending DeepMind Control Suite with Markov-switching degradations to simulate non-stationary real-world perturbations. Experiments on VDCS reveal severe performance degradation in existing methods. We theoretically prove via information-theoretic analysis that this failure stems from reconstruction-based objectives inevitably entangling perturbation artifacts into latent representations. To mitigate this negative impact, we propose Agent-Centric Observations with Mixture-of-Experts (ACO-MoE) to robustify visual RL against perturbations. The proposed framework leverages unique agent-centric restoration experts, achieving restoration from corruptions and task-relevant foreground extraction, thereby decoupling perception from perturbation before being processed by the RL agent. Extensive experiments on VDCS show our ACO-MoE outperforms strong baselines, recovering 95.3% of clean performance under challenging Markov-switching corruptions. Moreover, it achieves SOTA results on DMControl Generalization with random-color and video-background perturbations, demonstrating a high level of robustness.
中文摘要 视觉强化学习旨在赋能智能体从视觉观察中学习策略，但其仍易受动态视觉扰动影响，如腐化类型的不可预测变化。为系统性研究，我们引入了视觉降级控制套件（VDCS），这是一个基准测试，扩展了DeepMind控制套件，通过马尔可夫切换降级模拟非平稳的现实世界扰动。VDCS的实验显示现有方法性能严重下降。我们通过信息论分析理论证明，这种失败源于基于重建的目标不可避免地将微扰伪影纠缠进潜在表征。为减轻这种负面影响，我们提出以专家混合观察（ACO-MoE）为中心观察，以强化视觉强化学习对干扰的抵抗。该框架利用独特的以智能体为中心的恢复专家，实现了从损坏中恢复和任务相关前台提取，从而在强化学习代理处理前将感知与扰动脱钩。VDCS的广泛实验显示，我们的ACO-MoE表现优于强基线，在挑战性的马尔可夫切换腐败条件下恢复了95.3%的清洁性能。此外，它在DMControl推广中实现了随机颜色和视频背景扰动的SOTA结果，展现了高度的鲁棒性。

SpecRLBench: A Benchmark for Generalization in Specification-Guided Reinforcement Learning

SpecRLBench：规范引导强化学习泛化的基准

Authors: Zijian Guo, İlker Işık, H. M. Sabbir Ahmad, Wenchao Li
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.24729
Pdf link: https://arxiv.org/pdf/2604.24729
Abstract Specification-guided reinforcement learning (RL) provides a principled framework for encoding complex, temporally extended tasks using formal specifications such as linear temporal logic (LTL). While recent methods have shown promising results, their ability to generalize across unseen specifications and diverse environments remains insufficiently understood. In this work, we introduce SpecRLBench, a benchmark designed to evaluate the generalization capabilities of LTL-based specification-guided RL methods. The benchmark spans multiple difficulty levels across navigation and manipulation domains, incorporating both static and dynamic environments, diverse robot dynamics, and varied observation modalities. Through extensive empirical evaluation, we characterize the strengths and limitations of existing approaches and reveal the challenges that emerge as specification and environment complexity increase. SpecRLBench provides a structured platform for systematic comparison and supports the development of more generalizable specification-guided RL methods. Code is available at this https URL.
中文摘要 规范引导强化学习（RL）为使用线性时间逻辑（LTL）等形式规范编码复杂且时间扩展的任务提供了一个有原则的框架。尽管近期方法显示出有希望的结果，但它们在未见规格和多样环境中的泛化能力仍不够了解。在本研究中，我们介绍了SpecRLBench，这是一个旨在评估基于LTL规范引导RL方法泛化能力的基准测试。该基准测试涵盖导航和操作领域的多个难度等级，涵盖静态和动态环境、多样的机器人动力学以及多样的观察模式。通过广泛的实证评估，我们总结了现有方法的优势与局限性，并揭示了随着规范和环境复杂性增加而出现的挑战。SpecRLBench为系统比较提供了结构化平台，并支持更通用的规范引导RL方法的发展。代码可在此 https URL 访问。

World-R1: Reinforcing 3D Constraints for Text-to-Video Generation

World-R1：加强文本转视频生成的三维约束

Authors: Weijie Wang, Xiaoxuan He, Youping Gu, Yifan Yang, Zeyu Zhang, Yefei He, Yanbo Ding, Xirui Hu, Donny Y. Chen, Zhiyuan He, Yuqing Yang, Bohan Zhuang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.24764
Pdf link: https://arxiv.org/pdf/2604.24764
Abstract Recent video foundation models demonstrate impressive visual synthesis but frequently suffer from geometric inconsistencies. While existing methods attempt to inject 3D priors via architectural modifications, they often incur high computational costs and limit scalability. We propose World-R1, a framework that aligns video generation with 3D constraints through reinforcement learning. To facilitate this alignment, we introduce a specialized pure text dataset tailored for world simulation. Utilizing Flow-GRPO, we optimize the model using feedback from pre-trained 3D foundation models and vision-language models to enforce structural coherence without altering the underlying architecture. We further employ a periodic decoupled training strategy to balance rigid geometric consistency with dynamic scene fluidity. Extensive evaluations reveal that our approach significantly enhances 3D consistency while preserving the original visual quality of the foundation model, effectively bridging the gap between video generation and scalable world simulation.
中文摘要 近期的视频基础模型展现了令人印象深刻的视觉综合，但常常存在几何上的不一致。虽然现有方法试图通过架构修改注入三维先验，但通常会产生高计算成本并限制可扩展性。我们提出了World-R1框架，通过强化学习将视频生成与三维约束对齐。为促进这种对齐，我们引入了专为世界模拟设计的专用纯文本数据集。利用Flow-GRPO，我们利用预训练的3D基础模型和视觉语言模型的反馈优化模型，以在不改变底层架构的情况下强制结构一致性。我们还采用周期性解耦训练策略，以平衡刚性几何一致性与动态场景流动性。广泛评估显示，我们的方法显著提升了三维一致性，同时保留了基础模型的原始视觉质量，有效弥合了视频生成与可扩展世界模拟之间的差距。

Keyword: diffusion policy

Tube Diffusion Policy: Reactive Visual-Tactile Policy Learning for Contact-rich Manipulation

管道扩散政策：针对接触丰富操作的反应性视觉-触觉策略学习

Authors: Teng Xue, Alberto Rigo, Bingjian Huang, Jiayi Shen, Zhengtong Xu, Nick Colonnese, Amirhossein H. Memar
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.23609
Pdf link: https://arxiv.org/pdf/2604.23609
Abstract Contact-rich manipulation is central to many everyday human activities, requiring continuous adaptation to contact uncertainty and external disturbances through multi-modal perception, particularly vision and tactile feedback. While imitation learning has shown strong potential for learning complex manipulation behaviors, most existing approaches rely on action chunking, which fundamentally limits their ability to react to unforeseen observations during execution. This limitation becomes especially critical in contact-rich scenarios, where physical uncertainty and high-frequency tactile feedback demand rapid, reactive control. To address this challenge, we propose Tube Diffusion Policy (TDP), a novel reactive visual-tactile policy learning framework that bridges diffusion-based imitation learning with tube-based feedback control. By leveraging the expressive power of generative models, TDP learns an observation-conditioned feedback flow around nominal action chunks, forming an action tube that enables fast and adaptive reactions during execution. We evaluate TDP on the widely used Push-T benchmark and three additional challenging visual-tactile dexterous manipulation tasks. Across all benchmarks, TDP consistently outperforms state-of-the-art imitation learning baselines. Two real-world experiments further validate its robust reactivity under contact uncertainty and external disturbances. Moreover, the step-wise correction mechanism enabled by action tube significantly reduces the required denoising steps, making TDP well suited for real-time, high-frequency feedback control in contact-rich manipulation.
中文摘要 丰富的接触操作是许多日常人类活动的核心，需要通过多模态感知，尤其是视觉和触觉反馈，持续适应接触不确定性和外部干扰。虽然模仿学习在学习复杂操作行为方面展现出强大潜力，但大多数现有方法依赖于动作分块，这从根本上限制了其在执行过程中对突发观察的反应能力。这一限制在接触密集的场景中尤为关键，因为物理不确定性和高频触觉反馈需要快速且反应性的控制。为应对这一挑战，我们提出了管扩散政策（TDP），这是一种新型反应式视觉-触觉策略学习框架，连接了基于扩散的模仿学习与基于管的反馈控制。通过利用生成模型的表达力，TDP学习围绕标称动作块的观察条件反馈流，形成一个动作管，使执行过程中能够快速且自适应地进行反应。我们在广泛使用的Push-T基准测试及另外三个具有挑战性的视觉-触觉灵巧操作任务中评估了TDP。在所有基准测试中，TDP始终优于最先进的模仿学习基线。两项真实实验进一步验证了其在接触不确定性和外部扰动下的强韧反应性。此外，动作管实现的分级校正机制显著减少了所需的去噪步骤，使TDP非常适合在接触丰富操作中进行实时高频反馈控制。