Arxiv Papers of Today

生成时间: 2026-04-29 18:08:11 (UTC+8); Arxiv 发布时间: 2026-04-29 20:00 EDT (2026-04-30 08:00 UTC+8)

今天共有 36 篇相关文章

Keyword: reinforcement learning

Nautile-370M: Spectral Memory Meets Attention in a Small Reasoning Model

Nautile-370M：光谱记忆与注意力的结合，在一个小推理模型中

Authors: Maixent Chenebaux
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.24809
Pdf link: https://arxiv.org/pdf/2604.24809
Abstract We present Nautile-370M, a 371-million-parameter small language model designed for efficient reasoning under strict parameter and inference budgets. Nautile-370M uses a hybrid backbone in which two SeqCond Attention (SCA) layers, a linear-time spectral sequence operator inspired by SeqCondenser, alternate with one transformer layer. This design aims to retain the long-context efficiency and state-tracking benefits of structured sequential models while preserving the expressive token-to-token routing of attention. The model was trained on a single Cloud TPU v4-64 pod slice provided through the Google TPU Research Cloud (TRC) program; the subsequent reinforcement learning stage was carried out on a single NVIDIA DGX Spark. We prove that the SCA readout mechanism can exactly retrieve any individual token from the prefix summary and can reproduce any output of softmax attention as a special case, establishing that SCA is at least as expressive as full self-attention in the continuous limit. We also describe the training data pipeline and outline a reinforcement learning stage specialized for reasoning, verification, and response quality.
中文摘要 我们介绍Nautile-370M，一个拥有3.71亿参数的小语言模型，设计用于在严格参数和推理预算下高效推理。Nautile-370M 采用混合骨干，其中两层 SeqCond Attention（SCA）与一层变压器交替使用，SCA 是受 SeqCondenser 启发的线性时间谱序列算符。该设计旨在保留结构化顺序模型在长上下文效率和状态跟踪方面的优势，同时保持注意力在标记间的表达性路由。该模型通过Google TPU Research Cloud（TRC）项目提供的单个Cloud TPU v4-64胶囊片进行训练;后续的强化学习阶段则在单个NVIDIA DGX Spark上完成。我们证明了SCA读出机制可以精确地从前缀摘要中检索任意单个令牌，并作为特例重现任何softmax注意力的输出，证明SCA在连续极限下的表现力至少与完全自关注一样。我们还介绍了训练数据流程，并概述了一个专门用于推理、验证和响应质量的强化学习阶段。

asRoBallet: Closing the Sim2Real Gap via Friction-Aware Reinforcement Learning for Underactuated Spherical Dynamics

asRoBallet：通过摩擦感知强化学习缩小 sim2Real 差距，实现欠致动球面动力学

Authors: Fang Wan, Guangyi Huang, Tianyu Wu, Zishang Zhang, Bangchao Huang, Haoran Sun, Mingdong Chen, Chaoyang Song
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.24916
Pdf link: https://arxiv.org/pdf/2604.24916
Abstract We introduce asRoBallet, to the best of our knowledge, the first successful deployment of reinforcement learning (RL) on a humanoid ballbot hardware. Historically, ballbots have served as a canonical benchmark for underactuated and nonholonomic control, which are characterized by a reality gap in complex friction models for wheel-sphere-ground interactions. While current literature demonstrates successful handling of 3D balancing with LQR and MPC, transitioning to actual hardware for a humanoid ballbot using RL is currently hindered by critical gaps in contact modeling, actuator latency & jitter, and safe hardware exploration, and safe hardware exploration. This study proposes a high-fidelity MuJoCo simulation that explicitly models the discrete roller mechanics of ETH-type omni-wheels, thereby capturing parasitic vibrations and contact discontinuities that are previously ignored. We also developed a Friction-Aware Reinforcement Learning framework that achieves zero-shot Sim2Real transfer by mastering the coupled rolling, lateral, and torsional friction channels at the wheel-sphere and sphere-ground interfaces. We designed asRoBallet through subtractive reconfiguration, repurposing key components from an overconstrained quadruped and integrating them into a newly designed structural frame to achieve a robust research platform at low cost. We also developed a generalized iOS ecosystem that transforms consumer electronics into a low-latency interface, enabling a single operator to orchestrate expressive humanoid maneuvers via intuitive natural motion.
中文摘要 据我们所知，我们介绍了asRoBallet首次成功部署在类人球机器人硬件上的强化学习（RL）。历史上，球形机器人一直是欠致动和非全全体控制的典范基准，这些控制在车轮-球-地面交互的复杂摩擦模型中存在现实缺口。虽然现有文献已证明LQR和MPC成功处理3D平衡，但使用强化学习的人形球机器人转向实际硬件仍受接触建模、执行器延迟与抖动、安全硬件探索和安全硬件探索等关键缺口阻碍。本研究提出了一种高精度MuJoCo仿真，明确模拟ETH型全向轮的离散滚子力学，从而捕捉此前被忽视的寄生振动和接触不连续性。我们还开发了摩擦感知强化学习框架，通过掌握车轮-球面和球面-地面界面上的耦合滚动、横向和扭转摩擦通道，实现零射程Sim2Real传输。我们通过减法重构设计了asRoBallet，将过度受限的四足车中的关键部件重新利用，并整合进新设计的结构框架中，以低成本实现稳健的研究平台。我们还开发了一个通用的iOS生态系统，将消费电子产品转变为低延迟界面，使单一操作员能够通过直观的自然动作协调富有表现力的人形动作。

Agentic AI for Remote Sensing: Technical Challenges and Research Directions

遥感智能人工智能：技术挑战与研究方向

Authors: Muhammad Akhtar Munir, Muhammad Umer Sheikh, Akashah Shabbir, Muhammad Haris Khan, Fahad Khan, Xiao Xiang Zhu, Begum Demir, Salman Khan
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.24919
Pdf link: https://arxiv.org/pdf/2604.24919
Abstract Earth Observation (EO) is moving beyond static prediction toward multi-step analytical workflows that require coordinated reasoning over data, tools, and geospatial state. While foundation models and vision-language models have expanded representation learning and language-grounded interaction for remote sensing, and agentic AI has demonstrated long-horizon reasoning and external tool use, EO is not a straightforward extension of generic agentic AI. EO workflows operate over georeferenced, multi-modal, and temporally structured data, where operations such as reprojection, resampling, compositing, and aggregation actively transform the underlying state and can constrain subsequent analysis. As a result, errors may propagate silently across steps, and correctness depends not only on internal coherence, but also on geospatial consistency, temporally valid comparisons, and physical validity. This position paper argues that these challenges are structural rather than incidental. We identify the implicit assumptions commonly made in generic agentic models, analyze how they break in geospatial workflows, and characterize the resulting failure modes in multi-step EO pipelines. We then outline design principles for EO-native agents centered on structured geospatial state, tool-aware reasoning, verifier-guided execution, and learning objectives aligned with geospatial and physical validity. Finally, we present research directions spanning EO-specific benchmarks, hybrid supervised and reinforcement learning, constrained self-improvement, and trajectory-level evaluation beyond final-answer accuracy. Building reliable geospatial agents therefore requires rethinking agent design around the physical, geospatial, and workflow constraints that govern EO analysis.
中文摘要 地球观测（EO）正从静态预测转向多步骤分析流程，这些工作流需要对数据、工具和地理空间状态进行协调推理。虽然基础模型和视觉语言模型扩展了遥感的表征学习和基于语言的交互，代理人工智能展示了长远视野推理和外部工具的使用，但EO并非通用代理人工智能的直接扩展。EO的工作流运行在地理参考、多模态和时间结构化的数据上，重投影、重采样、合成和聚合等操作会主动转换底层状态，并可能限制后续分析。因此，错误可能会在各步间悄无声地传播，正确性不仅取决于内部一致性，还取决于地理空间一致性、时间有效性比较和物理效度。这份立场文件认为，这些挑战是结构性的，而非偶然的。我们识别通用代理模型中常见的隐含假设，分析它们在地理空间工作流程中的破损方式，并表征多步EO流程中产生的失效模式。随后，我们概述了EO原生代理的设计原则，围绕结构化地理空间状态、工具感知推理、验证者引导执行以及符合地理空间和物理效度的学习目标。最后，我们提出了涵盖EO特定基准、混合监督与强化学习、受限自我提升以及超越最终答案准确性的轨迹级评估的研究方向。因此，构建可靠的地理空间代理需要重新思考代理设计，围绕EO分析的物理、地理空间和工作流程约束。

Compute Aligned Training: Optimizing for Test Time Inference

计算对齐训练：优化测试时间推断

Authors: Adam Ousherovitch, Ambuj Tewari
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.24957
Pdf link: https://arxiv.org/pdf/2604.24957
Abstract Scaling test-time compute has emerged as a powerful mechanism for enhancing Large Language Model (LLM) performance. However, standard post-training paradigms, Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), optimize the likelihood of individual samples under a base policy, creating a misalignment with test time procedures that rely on aggregated or filtered outputs. In this work, we propose Compute Aligned Training, which aligns training objectives with test-time strategies. By conceptualizing inference strategies as operators on the base policy, we derive new loss functions that maximize performance when said strategies are applied. We instantiate such loss functions for SFT and RL across common test time strategies. Finally, we provide empirical evidence that this training method substantially improves test time scaling over standard training.
中文摘要 测试时计算的扩展已成为提升大型语言模型（LLM）性能的强大机制。然而，标准的训练后范式——监督式微调（SFT）和强化学习（RL）——在基础策略下优化单个样本的可能性，导致依赖聚合或过滤输出的测试时间程序产生不匹配。本研究提出了计算对齐训练，将训练目标与测试时间策略对齐。通过将推理策略概念化为基础策略上的算符，我们推导出新的损失函数，在应用这些策略时最大化性能。我们在常见的测试时间策略中实现了SFT和RL的此类损失函数。最后，我们提供了实证证据，表明该训练方法相较标准训练显著提升了测试时间的扩展。

Sparse Personalized Text Generation with Multi-Trajectory Reasoning

多轨迹推理的稀疏个性化文本生成

Authors: Bo Ni, Haowei Fu, Qinwen Ge, Franck Dernoncourt, Samyadeep Basu, Nedim Lipka, Seunghyun Yoon, Yu Wang, Nesreen K. Ahmed, Subhojyoti Mukherjee, Puneet Mathur, Ryan A. Rossi, Tyler Derr
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.24996
Pdf link: https://arxiv.org/pdf/2604.24996
Abstract As Large Language Models (LLMs) advance, personalization has become a key mechanism for tailoring outputs to individual user needs. However, most existing methods rely heavily on dense interaction histories, making them ineffective in cold-start scenarios where such data is sparse or unavailable. While external signals (e.g., content of similar users) can offer a potential remedy, leveraging them effectively remains challenging: raw context is often noisy, and existing methods struggle to reason over heterogeneous data sources. To address these issues, we introduce PAT (Personalization with Aligned Trajectories), a reasoning framework for cold-start LLM personalization. PAT first retrieves information along two complementary trajectories: writing-style cues from stylistically similar users and topic-specific context from preference-aligned users. It then employs a reinforcement learning-based, iterative dual-reasoning mechanism that enables the LLM to jointly refine and integrate these signals. Experimental results across real-world personalization benchmarks show that PAT consistently improves generation quality and alignment under sparse-data conditions, establishing a strong solution to the cold-start personalization problem.
中文摘要 随着大型语言模型（LLMs）的发展，个性化已成为根据用户需求定制输出的关键机制。然而，大多数现有方法高度依赖密集交互历史，使其在冷启动场景中数据稀少或不可得时效果有限。虽然外部信号（例如类似用户的内容）可以提供潜在的解决方案，但有效利用它们仍然具有挑战性：原始上下文常常噪声较大，现有方法难以在异构数据源上进行推理。为解决这些问题，我们引入了PAT（Personalization with Aligned Trajectories），这是一种冷启动LLM个性化的推理框架。PAT首先沿两条互补路径获取信息：来自风格相似用户的写作风格线索，以及来自偏好一致用户的主题特定上下文。随后，它采用基于强化学习的迭代双重推理机制，使LLM能够联合细化和整合这些信号。跨越现实世界个性化基准的实验结果表明，PAT在稀疏数据条件下持续提升生成质量和对齐性，为冷启动个性化问题奠定了强有力的解决方案。

Why Does Reinforcement Learning Generalize? A Feature-Level Mechanistic Study of Post-Training in Large Language Models

为什么强化学习具有普遍性？大型语言模型中后训练的特征级机制性研究

Authors: Dan Shi, Zhuowen Han, Simon Ostermann, Renren Jin, Josef van Genabith, Deyi Xiong
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.25011
Pdf link: https://arxiv.org/pdf/2604.25011
Abstract Reinforcement learning (RL)-based post-training often improves the reasoning performance of large language models (LLMs) beyond the training domain, while supervised fine-tuning (SFT) frequently leads to general capabilities forgetting. However, the mechanisms underlying this contrast remain unclear. To bridge this gap, we present a feature-level mechanistic analysis methodology to probe RL generalization using a controlled experimental setup, where RL- and SFT-tuned models are trained from the same base model on identical data. Leveraging our interpretability framework, we align internal activations across models within a shared feature space and analyze how features evolve during post-training. We find that SFT rapidly introduces many highly specialized features that stabilize early in training, whereas RL induces more restrained and continually evolving feature changes that largely preserve base models' representations. Focusing on samples where RL succeeds but the base model fails, we identify a compact, task-agnostic set of features that directly mediate generalization across diverse tasks. Feature-level interventions confirm their causal role: disabling these features significantly degrades RL models' generalization performance, while amplifying them improves base models' performance. The code is available at this https URL.
中文摘要 基于强化学习（RL）的后期训练常常能提升大型语言模型（LLMs）在训练领域之外的推理性能，而监督微调（SFT）则常导致普遍的能力遗忘。然而，这种对比背后的机制仍不清楚。为弥合这一空白，我们提出了一种特征级机制分析方法，利用受控实验装置探究强化学习（RL）和强化推理（SFT）调优模型，基于同一基础模型在相同数据上进行训练。利用我们的可解释性框架，我们在共享特征空间内对齐模型间的内部激活，并分析特征在训练后的变化。我们发现SFT能快速引入许多高度专业化的特征，这些特征在训练初期就稳定下来，而强化学习则诱导更为克制且不断演变的特征变化，基本保持了基础模型的表示。聚焦于强化学习成功但基础模型失败的样本，我们识别出一组紧凑且任务无关的特征，这些特征直接介导了不同任务间的泛化。特征级干预确认了其因果作用：禁用这些特征会显著降低强化学习模型的泛化性能，而放大它们则提升基础模型的性能。代码可在该 https URL 访问。

Zero Shot Coordination for Sparse Reward Tasks with Diverse Reward Shapings

零机会协调，针对稀疏奖励任务，且奖励形态多样

Authors: Keenan Powell, Peihong Yu, Pratap Tokekar
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.25076
Pdf link: https://arxiv.org/pdf/2604.25076
Abstract Many Multi-Agent Reinforcement Learning (MARL) agents fail to adapt properly to cooperating with agents trained with the same objectives but different seeds, algorithms, or other training differences. This is the problem of Zero-Shot Coordination (ZSC), which focuses on training agents to cooperate well with unknown agents. ZSC has been studied for a variety of tabular cases and simple games such as Hanabi, achieving excellent results. However, existing solutions to ZSC only consider identical rewards for your trained agents and all future partners. This is not realistic for the trained agents, as they do not consider the problem of cooperating with agents that have identical sparse objectives but shape the rewards for those objectives in different manner. To address this issue, we show how to train an ensemble of methods using randomized reward shapings chosen using 4 selection algorithms. Experiments done on the Overcooked environment demonstrate consistent improvements of 62.2%-119.2% in sparse reward over baseline ZSC algorithms when playing with agents that have identical sparse rewards but different reward shapings.
中文摘要 许多多智能体强化学习（MARL）智能体未能正确适应与训练目标相同但种子、算法或其他训练差异的智能体合作。这就是零点协调（ZSC）的问题，它专注于训练代理人与未知代理人良好合作。ZSC 已被用于多种表格情况和如花火等简单博弈，取得了优异的成果。然而，现有的ZSC解决方案只考虑对您受过培训的代理和所有未来合作伙伴的相同奖励。这对受过训练的代理来说不现实，因为他们不考虑与目标相同稀疏的代理合作的问题，反而以不同方式塑造这些目标的奖励。为解决这个问题，我们展示了如何通过4种选择算法随机选择的奖励形态来训练一组方法。在过度烹饪环境中进行的实验显示，使用奖励相同但奖励形态不同的代理，稀疏奖励相较于基线ZSC算法持续提升62.2%-119.2%。

Evaluating Risks in Weak-to-Strong Alignment: A Bias-Variance Perspective

弱到强对齐风险评估：偏倚方差视角

Authors: Hamid Osooli, Kareema Batool, Rick Gentry, Tiasa Singha Roy, Ashwin Gupta, Anirudha Ramesh
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.25077
Pdf link: https://arxiv.org/pdf/2604.25077
Abstract Weak-to-strong alignment offers a promising route to scalable supervision, but it can fail when a strong model becomes confidently wrong on examples that lie in the weak teacher's blind spots. Understanding such failures requires going beyond aggregate accuracy, since weak-to-strong errors depend not only on whether the strong model disagrees with its teacher, but also on how confidence and uncertainty are distributed across examples. In this work, we analyze weak-to-strong alignment through a bias-variance-covariance lens that connects misfit theory to practical post-training pipelines. We derive a misfit-based upper bound on weak-to-strong population risk and study its empirical components using continuous confidence scores. We evaluate four weak-to-strong pipelines spanning supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and reinforcement learning from AI feedback (RLAIF) on the PKU-SafeRLHF and HH-RLHF datasets. Using a blind-spot deception metric that isolates cases where the strong model is confidently wrong while the weak model is uncertain, we find that strong-model variance is the strongest empirical predictor of deception across our settings. Covariance provides additional but weaker information, indicating that weak-strong dependence matters, but does not by itself explain the observed failures. These results suggest that strong-model variance can serve as an early-warning signal for weak-to-strong deception, while blind-spot evaluation helps distinguish whether failures are inherited from weak supervision or arise in regions of weak-model uncertainty.
中文摘要 弱到强的对齐提供了一条可扩展监督的有前景路径，但当一个强模型在弱教师盲点中自信地错误时，这种模式可能会失败。理解此类失败需要超越整体准确性，因为弱到强误不仅取决于强模型是否与其教师意见不合，还取决于信心和不确定性在实例中的分布。本研究通过偏倚-方差-协方差视角分析弱对强比对，将不适配理论与实际训练后流程联系起来。我们基于不适度推导弱到强人群风险的上界，并利用连续置信分数研究其实证成分。我们评估了四条弱到强流水线，涵盖监督微调（SFT）、人类反馈强化学习（RLHF）和人工智能反馈强化学习（RLAIF），均在PKU-SafeRLHF和HH-RLHF数据集上。利用盲点欺骗指标，将强模型自信错误、弱模型不确定的情况隔离开来，我们发现强模型方差是我们设定中欺骗最强的实证预测因子。协方差提供了额外但较弱的信息，表明弱-强依赖关系很重要，但单独不能解释观察到的失效。这些结果表明强模型方差可以作为弱到强欺骗的早期警示信号，而盲点评估有助于区分失败是遗传自弱监督，还是发生在弱模型不确定性区域。

Prior-Aligned Data Cleaning for Tabular Foundation Models

表格基础模型的先验对齐数据清理

Authors: Laure Berti-Equille
Subjects: Subjects: Machine Learning (cs.LG); Databases (cs.DB)
Arxiv link: https://arxiv.org/abs/2604.25154
Pdf link: https://arxiv.org/pdf/2604.25154
Abstract Tabular Foundation Models (TFMs) achieve state-of-the-art zero-shot accuracy on small tabular datasets by meta-learning over synthetic data-generating processes -- making them highly attractive for practitioners who cannot afford large annotated corpora. However, their in-context learning mechanism assumes approximately clean inputs: missing values, outliers, and duplicates in the real-world data create a prior mismatch that degrades both accuracy and confidence calibration simultaneously. Correcting this mismatch requires sequential decisions over cleaning operators whose interactions no static preprocessing rule can anticipate -a natural fit for reinforcement learning~(RL). We introduce L2C2, the first deep RL framework framing tabular data cleaning as prior alignment: a learned policy sequences operators to minimize the distributional gap between dirty input and the TFM's synthetic prior. Six experiments on ten OpenML benchmark datasets establish: 1) three of seven reward designs collapse to degenerate trivial cleaning strategies -- principled reward engineering is scientifically non-trivial; 2) the novel TFMAwareReward reward we propose selects structurally distinct pipelines on 4/10 datasets and achieves higher TabPFN accuracy on those diverging cases (mean 0.851 vs. 0.843; Wilcoxon p=0.063, n=4) while never underperforming; 3) parameterized cleaning actions improve best-found pipeline reward on 9/10 datasets (Wilcoxon p=0.004); and 4) a policy pre-trained on one single source dataset exceeds scratch training at the 2,000-step fine-tuning checkpoint on all three held-out datasets (up to +28.8% after full fine-tuning) demonstrating cross-dataset transfer of prior-alignment knowledge. These findings establish that prior alignment is a principled data preparation strategy for TFM deployment on real-world tabular data.
中文摘要 表格基础模型（TFM）通过在合成数据生成流程上的元学习，在小型表格数据集上实现最先进的零样本精度——这使得它们对无法负担大量注释语料库的从业者极具吸引力。然而，他们的上下文学习机制假设输入大致干净：现实世界数据中的缺失值、离群值和重复值会产生先验不匹配，同时降低准确性和置信度校准。纠正这种不匹配需要对清洁算子的顺序决策，其交互是任何静态预处理规则都无法预见的——这对强化学习来说是自然的契合~（RL）。我们引入L2C2，这是首个将表格数据清理框架为先验对齐的深度强化学习框架：一种学习策略序列操作符，以最小化脏输入与TFM合成先验之间的分布差距。基于十个OpenML基准数据集的六项实验表明：1）七个奖励设计中有三个崩溃，导致简单清洁策略退化——原则性奖励工程在科学上是非平凡的;2）我们提出的新型TFMAwareReward奖励在4/10数据集中选择结构不同的管道，并在这些分歧案例中实现更高的TabPFN准确率（平均0.851对0.843;Wilcoxon p=0.063， n=4），且从未表现逊色;3）参数化清理动作在9/10数据集中改善最佳发现管道奖励（Wilcoxon p=0.004）;4）在单一源数据集上预训练的策略在三个保留数据集的2000步微调检查点上超过了从零训练（完全微调后最高可达+28.8%），展示了先验对齐知识的跨数据集转移。这些发现表明，先验对齐是TFM在现实世界表数据上部署的有原则性数据准备策略。

CroSearch-R1: Better Leveraging Cross-lingual Knowledge for Retrieval-Augmented Generation

CroSearch-R1：更好地利用跨语言知识进行检索增强生成

Authors: Rui Qi, Fengran Mo, Sijin Lu, Yufeng Chen, Jian-Yun Nie, Kaiyu Huang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.25182
Pdf link: https://arxiv.org/pdf/2604.25182
Abstract A multilingual collection may contain useful knowledge in other languages to supplement and correct the facts in the original language for Retrieval-Augmented Generation (RAG). However, the vanilla approach that simply concatenates multiple pieces of knowledge from different languages into the context may fail to improve effectiveness due to the potential disparities across languages. To better leverage multilingual knowledge, we propose CroSearch-R1, a search-augmented reinforcement learning framework to integrate multilingual knowledge into the Group Relative Policy Optimization (GRPO) process. In particular, the approach adopts a multi-turn retrieval strategy with cross-lingual knowledge integration to dynamically align the knowledge from other languages as supplementary evidence into a unified representation space. Furthermore, we introduce a multilingual rollout mechanism to optimize reasoning transferability across languages. Experimental results demonstrate that our framework effectively leverages cross-lingual complementarity and improves the effectiveness of RAG with multilingual collections.
中文摘要 多语言收藏可能包含其他语言的有用知识，以补充和纠正原始语言中的事实，用于检索增强生成（RAG）。然而，将不同语言的多条知识简单地串接到上下文中的普通方法，可能因语言间可能存在差异而无法提升效果。为了更好地利用多语言知识，我们提出了CroSearch-R1，一种搜索增强强化学习框架，将多语言知识整合进群体相对策略优化（GRPO）过程。特别是，该方法采用多回合检索策略，结合跨语言知识整合，动态将来自其他语言的知识作为补充证据对齐到统一的表示空间中。此外，我们引入了多语言推广机制，以优化推理在不同语言间的可迁移性。实验结果表明，我们的框架有效利用了跨语言互补性，提升了多语言集合RAG的有效性。

How Can Reinforcement Learning Achieve Expert-level Placement?

强化学习如何实现专家级的定位？

Authors: Ruo-Tong Chen, Ke Xue, Chengrui Gao, Yunqi Shi, Tian Xu, Peng Xie, Siyuan Xu, Mingxuan Yuan, Chao Qian, Zhi-Hua Zhou
Subjects: Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.25191
Pdf link: https://arxiv.org/pdf/2604.25191
Abstract Chip placement is a critical step in physical design. While reinforcement learning (RL)-based methods have recently emerged, their training primarily focuses on wirelength optimization, and therefore often fail to achieve expert-quality layouts. We identify the reward design as the primary cause for the performance gap with experts, and instead of formalizing intricate processes, we circumvent this by directly learning from expert layouts to derive a reward model. Our approach starts from the final expert layouts to infer step-by-step expert trajectories. Using these trajectories as demonstrations or preferences, we train a model that captures the latent implicit rewards in expert results. Experiments show that our framework can efficiently learn from even a single design and generalize well to unseen cases.
中文摘要 芯片的布置是物理设计中的关键步骤。虽然基于强化学习（RL）的方法最近出现，但其训练主要侧重于线长优化，因此常常无法实现专家级的布局。我们将奖励设计认定为与专家之间绩效差距的主要原因，因此我们不形式化复杂的流程，而是直接从专家的布局中学习，从而推导出奖励模型。我们的方法从最终的专家布局出发，逐步推断专家的路径。我们利用这些趋势作为示范或偏好，训练一个能够捕捉专家结果中潜在隐含奖励的模型。实验表明，我们的框架能够高效地从单一设计中学习，并很好地推广到未见的情况。

OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding

OmniVTG：一个大规模数据集及开放世界视频时间基础训练范式

Authors: Minghang Zheng, Zihao Yin, Yi Yang, Yuxin Peng, Yang Liu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.25276
Pdf link: https://arxiv.org/pdf/2604.25276
Abstract Video Temporal Grounding (VTG), the task of localizing video segments from text queries, struggles in open-world settings due to limited dataset scale and semantic diversity, causing performance gaps between common and rare concepts. To overcome these limitations, we introduce OmniVTG, a new large-scale dataset for open-world VTG, coupled with a Self-Correction Chain-of-Thought (CoT) training paradigm designed to enhance the grounding capabilities of Multimodal Large Language Models (MLLMs). Our OmniVTG is constructed via a novel Semantic Coverage Iterative Expansion pipeline, which first identifies gaps in the vocabulary of existing datasets and collects videos that are highly likely to contain these target concepts. For high-quality annotation, we leverage the insight that modern MLLMs excel at dense captioning more than direct grounding and design a caption-centric data engine to prompt MLLMs to generate dense, timestamped descriptions. Beyond the dataset, we observe that simple supervised finetuning (SFT) is insufficient, as a performance gap between rare and common concepts still persists. We find that MLLMs' video understanding ability significantly surpasses their direct grounding ability. Based on this, we propose a Self-Correction Chain-of-Thought (CoT) training paradigm. We train the MLLM to first predict, then use its understanding capabilities to reflect on and refine its own predictions. This capability is instilled via a three-stage pipeline of SFT, CoT finetuning, and reinforcement learning. Extensive experiments show our approach not only excels at open-world grounding in our OmniVTG dataset but also achieves state-of-the-art zero-shot performance on four existing VTG benchmarks. Code is available at this https URL.
中文摘要 视频时间基础化（VTG）——将文本查询中的视频片段本地化的任务——由于数据集规模有限和语义多样性有限，在开放世界环境中表现较为困难，导致常见概念与罕见概念之间存在性能差距。为克服这些限制，我们引入了OmniVTG，一款面向开放世界VTG的大型数据集，并结合了一种自我纠正思维链（CoT）训练范式，旨在增强多模态大型语言模型（MLLMs）的基础化能力。我们的OmniVTG通过一种新颖的语义覆盖迭代扩展流程构建，首先识别现有数据集词汇中的空白，并收集极有可能包含这些目标概念的视频。为了获得高质量的注释，我们利用现代MLLM更擅长密集字幕而非直接基础的见解，设计了一个以字幕为中心的数据引擎，促使MLLM生成密集且带有时间戳的描述。除了数据集外，我们观察到简单的监督微调（SFT）不够，因为稀有概念与常见概念之间仍存在性能差距。我们发现，MLLM的视频理解能力远远超过了其直接接地能力。基于此，我们提出了一种自我纠正思维链（CoT）训练范式。我们先训练MLLM预测，然后利用其理解能力反思和完善自身预测。该能力通过三阶段流水线——SFT、CoT微调和强化学习——实现。大量实验表明，我们的方法不仅在OmniVTG数据集的开放世界基础化上表现出色，还在四个现有VTG基准测试中实现了最先进的零次测试性能。代码可在此 https URL 访问。

From Local Indices to Global Identifiers: Generative Reranking for Recommender Systems via Global Action Space

从本地索引到全球标识符：通过全球行动空间实现推荐系统生成式重新排序

Authors: Pengyue Jia, Xiaobei Wang, Yingyi Zhang, Shuchang Liu, Yupeng Hou, Hailan Yang, Xu Gao, Xiaopeng Li, Yejing Wang, Julian McAuley, Xiang Li, Lantao Hu, Yongqi Liu, Kaiqiao Zhan, Han Li, Kun Gai, Xiangyu Zhao
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2604.25291
Pdf link: https://arxiv.org/pdf/2604.25291
Abstract In modern recommender systems, list-wise reranking serves as a critical phase within the multi-stage pipeline, finalizing the exposed item sequence and directly impacting user satisfaction by modeling complex intra-list item dependencies. Existing methods typically formulate this task as selecting indices from the local input list. However, this approach suffers from a semantically inconsistent action space: the same output neuron (logits) represents different items across different samples, preventing the model from establishing a stable, intrinsic understanding of the items. To address this, we propose GloRank (Global Action Space Ranker), a generative framework that shifts reranking from selecting local indices to generating global identifiers. Specifically, we represent items as sequences of discrete tokens and reformulate reranking as a token generation task. This design effectively decouples the scoring mechanism from the variable input order, ensuring that items are evaluated against a consistent global standard. We further enhance this with a two-stage optimization pipeline: a supervised pre-training phase to initialize the model with high-quality demonstrations, followed by a reinforcement learning-based post-training phase to directly maximize list-wise utility. Extensive experiments on two public benchmarks and a large-scale industrial dataset, coupled with online A/B tests, demonstrate that GloRank consistently outperforms state-of-the-art baselines and achieves superior robustness in cold-start scenarios.
中文摘要 在现代推荐系统中，按列表重新排序是多阶段流程中的关键阶段，最终确定暴露的项目顺序，并通过建模复杂的列表内项目依赖关系，直接影响用户满意度。现有方法通常将此任务表述为从局部输入列表中选择索引。然而，这种方法存在语义不一致的动作空间：同一输出神经元（logits）代表不同样本中的不同项目，阻碍模型建立稳定且内在的项目理解。为此，我们提出了GloRank（全球行动空间排名器），这是一个生成框架，将重新排序从选择本地索引转向生成全局标识符。具体来说，我们将项目表示为离散代币序列，并将重新排序重新表述为代币生成任务。该设计有效地将评分机制与变量输入顺序解耦，确保项目根据一致的全局标准进行评估。我们通过两阶段优化流程进一步完善：监督的预训练阶段通过高质量演示初始化模型，随后是基于强化学习的训练后阶段，直接最大化列表效用。通过对两个公开基准测试和大型工业数据集的广泛实验，结合在线A/B测试，表明GloRank在冷启动场景下持续优于最先进基线，并实现更优越的鲁棒性。

Multi-action Tangled Program Graphs for Multi-task Reinforcement Learning with Continuous Control

多任务强化学习的多动作纠结程序图，采用持续控制

Authors: Quentin Vacher (IETR), Nicolas Beuve (IETR), Mickaël Dardaillon (IETR), Karol Desnos (IETR)
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.25369
Pdf link: https://arxiv.org/pdf/2604.25369
Abstract Over the past few decades, machine learning has been widely used to learn complex tasks. Reinforcement Learning (RL), inspired by human behavior, is a great example, as it involves developing specific behaviours for specific tasks. To further challenge algorithms, Multi-Task RL (MTRL) environments have been introduced, requiring a single model to learn multiple behaviors. The Tangled Program Graph (TPG) algorithm is a Genetic Programming (GP) algorithm designed for discrete MTRL environments. Recently, the MAPLE algorithm has been proposed, as another GP algorithm that achieves high results in single task continuous RL environments. A variation of the TPG is proposed alongside MAPLE, named Multi-Action TPG (MATPG) that aggregates MAPLE agents, and creates a control flow to activate them. Initially tested on single task RL environments only, MATPG achieved similar results to MAPLE. In this work, we present a new benchmark based on the MuJoCo Half Cheetah from Gymnasium. This benchmark features five distinct obstacles that are randomly positioned in front of the agent, each of which demands a unique behavior. This benchmark serves as a use case for MATPG, to prove its ability as a GP solution for continuous MTRL environments. Our experiments demonstrate its superiority in this multi-task use case when combined with lexicase selection. Furthermore, we examine the interpretability of the evolved graph, revealing that the decision flow of the model is fully interpretable.
中文摘要 在过去几十年里，机器学习被广泛用于学习复杂任务。强化学习（RL）是一个很好的例子，灵感来自人类行为，因为它涉及为特定任务发展特定行为。为了进一步挑战算法，引入了多任务强化学习（MTRL）环境，要求单一模型学习多种行为。纠结程序图（TPG）算法是一种为离散MTRL环境设计的遗传规划（GP）算法。最近，MAPLE算法被提出，作为另一种在单任务连续强化学习环境中取得高结果的GP算法。TPG的变体与MAPLE并行提出，称为多行动TPG（MATPG），它汇聚MAPLE代理，并创建激活它们的控制流。最初仅在单任务强化环境中测试，MATPG取得了与MAPLE类似的结果。在本研究中，我们基于Gymnasium的MuJoCo半猎豹提出了一个新的基准。该基准测试包含五个随机放置在代理面前的障碍物，每个障碍都要求独特的行为。该基准测试作为MATPG的用例，证明其作为连续MTRL环境GP解决方案的能力。我们的实验证明了结合词汇选择在多任务应用场景中的优势。此外，我们考察了演化图的可解释性，揭示了模型的决策流程是完全可解释的。

Safe-Support Q-Learning: Learning without Unsafe Exploration

安全支持Q学习：无安全探索的学习

Authors: Yeeun Lim, Narim Jeong, Donghwan Lee
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.25379
Pdf link: https://arxiv.org/pdf/2604.25379
Abstract Ensuring safety during reinforcement learning (RL) training is critical in real-world applications where unsafe exploration can lead to devastating outcomes. While most safe RL methods mitigate risk through constraints or penalization, they still allow exploration of unsafe states during training. In this work, we adopt a stricter safety requirement that eliminates unsafe state visitation during training. To achieve this goal, we propose a Q-learning-based safe RL framework that leverages a behavior policy supported on a safe set. Under the assumption that the induced trajectories remain within the safe set, this policy enables sufficient exploration within the safe region without requiring near-optimality. We adopt a two-stage framework in which the Q-function and policy are trained separately. Specifically, we introduce a KL-regularized Bellman target that constrains the Q-function to remain close to the behavior policy. We then derive the policy induced from the trained Q-values and propose a parametric policy extraction method to approximate the optimal policy. Our approach provides a unified framework that can be adapted to different action spaces and types of behavior policies. Experimental results demonstrate that the proposed method achieves stable learning and well-calibrated value estimates and yields safer behavior with comparable or better performance than existing baselines.
中文摘要 在现实应用中，确保强化学习（RL）安全至关重要，因为不安全的探索可能导致毁灭性后果。虽然大多数安全的强化学习方法通过约束或惩罚来降低风险，但它们仍允许在训练中探索不安全的状态。在这项工作中，我们采用了更严格的安全要求，消除了培训期间不安全的国籍探视。为实现这一目标，我们提出了一个基于Q学习的安全强化学习框架，利用基于安全集的行为策略。假设诱导轨迹仍处于安全集内，该策略允许在安全区内充分探索，而无需接近最优性。我们采用两阶段框架，Q函数和策略分别训练。具体来说，我们引入了一个KL正则化的Bellman目标，该目标限制Q函数保持接近行为策略。然后，我们从训练好的Q值中推导出策略，并提出参数化策略提取方法以近似最优策略。我们的方法提供了一个统一的框架，可以适应不同的行动空间和行为政策类型。实验结果表明，所提方法实现了稳定的学习和校准良好的数值估计，并能实现更安全的行为，性能与现有基线相当甚至更好。

Benchmarking and Improving GUI Agents in High-Dynamic Environments

高动态环境中的基准测试与改进图形界面代理

Authors: Enqi Liu, Liyuan Pan, Zhi Gao, Yan Yang, Chenrui Shi, Yang Liu, Jingrong Wu, Qing Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.25380
Pdf link: https://arxiv.org/pdf/2604.25380
Abstract Recent advancements in Graphical User Interface (GUI) agents have predominantly focused on training paradigms like supervised fine-tuning (SFT) and reinforcement learning (RL). However, the challenge of high-dynamic GUI environments remains largely underexplored. Existing agents typically rely on a single screenshot after each action for decision-making, leading to a partially observable (or even unobservable) Markov decision process, where the key GUI state including important information for actions is often inadequately captured. To systematically explore this challenge, we introduce DynamicGUIBench, a comprehensive online GUI benchmark spanning ten applications and diverse interaction scenarios characterized by important interface changes between actions. Furthermore, we present DynamicUI, an agent designed for dynamic interfaces, which takes screen-recording videos of the interaction process as input and consists of three components: a dynamic perceiver, a refinement strategy, and a reflection. Specifically, the dynamic perceiver clusters frames of the GUI video, generates captions for the centroids, and iteratively selects the most informative frames as the salient dynamic context. Considering that there may be inconsistencies and noise between the selected frames and the textual context of the agent, the refinement strategy employs an action-conditioned filtering to refine thoughts to mitigate thought-action inconsistency and redundancy. Based on the refined agent trajectories, the reflection module provides effective and accurate guidance for further actions. Experiments on DynamicGUIBench demonstrate that DynamicUI significantly improves the performance in dynamic GUI environments, while maintaining competitive performance on other public benchmarks.
中文摘要 图形用户界面（GUI）代理的最新进展主要集中在训练范式，如监督微调（SFT）和强化学习（RL）。然而，高动态图形界面环境的挑战仍然大多未被充分探索。现有代理通常依赖每次操作后的单一截图来做决策，导致一个部分可观察（甚至不可观察）的马尔可夫决策过程，关键的图形界面状态（包括动作重要信息）往往无法充分捕捉。为了系统地探讨这一挑战，我们介绍了DynamicGUIBench，这是一个涵盖十种应用和多样化交互场景的综合在线图形界面基准，特点是动作间重要的界面变化。此外，我们还介绍了DynamicUI，一款专为动态界面设计的代理，输入交互过程的屏幕录制视频，由三个组成部分组成：动态感知器、精炼策略和反射器。具体来说，动态感知器会对图形界面视频的帧进行聚类，生成重心的字幕，并反复选择最具信息量的帧作为显著的动态上下文。考虑到所选帧与代理文本上下文之间可能存在不一致和噪声，精炼策略采用动作条件过滤来精炼思想，以减少思维与动作之间的不一致和冗余。基于精炼后的试剂轨迹，反射模块为后续行动提供有效且准确的指导。DynamicGUIBench 上的实验表明，DynamicUI 在动态 GUI 环境中显著提升了性能，同时在其他公开基准测试中保持竞争力。

Biased Dreams: Limitations to Epistemic Uncertainty Quantification in Latent Space Models

偏见梦境：潜在空间模型中认识不确定性量化的局限性

Authors: Julia Berger, Bernd Frauenknecht, Sebastian Trimpe, Bastian Leibe
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.25416
Pdf link: https://arxiv.org/pdf/2604.25416
Abstract Model-Based Reinforcement Learning distinguishes between physical dynamics models operating on proprioceptive inputs and latent dynamics models operating on high-dimensional image observations. A prominent latent approach is the Recurrent State Space Model used in the Dreamer family. While epistemic uncertainty quantification to inform exploration and mitigate model exploitation is well established for physical dynamics models, its transfer to latent dynamics models has received limited scrutiny. We empirically demonstrate that latent transitions are biased toward well-represented regions of latent space, exhibiting an attractor behavior that can deviate from true environment dynamics. As a result, discrepancies in environment dynamics may not manifest in latent space, undermining the reliability of epistemic uncertainty estimates. Because these attractors often lie in high-reward regions, latent rollouts systematically overestimate predicted rewards. Our findings highlight key limitations of epistemic uncertainty estimation in latent dynamics models and motivate more critical evaluation of this method.
中文摘要 基于模型的强化学习区分了基于本体感觉输入的物理动力学模型和基于高维图像观察的潜在动力学模型。一个显著的潜在方法是梦者家族中使用的循环状态空间模型。虽然物理动力学模型中，认识不确定性量化以指导探索和减轻模型利用已被广泛认可，但其向潜在动力学模型的转移却受到有限的关注。我们实证证明，潜跃迁倾向于潜空间中代表性较多的区域，表现出一种可能偏离真实环境动力学的吸引子行为。因此，环境动态的差异可能不会在潜在空间中显现，削弱了认知不确定性估计的可靠性。由于这些吸引源通常位于高回报区域，潜在推广系统性地高估了预测的奖励。我们的发现凸显了潜在动力学模型中认知不确定性估计的关键局限性，并促使对该方法进行更批判性的评估。

JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR

陪审团-RL：投票提案，证明废弃无标签RLVR

Authors: Xinjie Chen, Biao Fu, Jing Wu, Guoxin Chen, Xinggao Liu, Dayiheng Liu, Minpeng Liao
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.25419
Pdf link: https://arxiv.org/pdf/2604.25419
Abstract Reinforcement learning with verifiable rewards (RLVR) enhances the reasoning of large language models (LLMs), but standard RLVR often depends on human-annotated answers or carefully curated reward specifications. In machine-checkable domains, label-free alternatives such as majority voting or LLM-as-a-judge remove annotation cost but can introduce false positives that destabilize training. We introduce JURY-RL, a label-free RLVR framework that decouples answer proposal from reward disposal: votes from model rollouts propose a candidate answer, and a formal verifier determines whether that candidate can receive positive reward. Concretely, only rollouts matching the plurality-voted answer are rewarded when that answer is successfully verified in Lean. When verification is inconclusive, we invoke ResZero (Residual-Zero), a fallback reward that discards the unverified plurality proposal and redistributes a zero-mean, variance-preserving signal over the residual answers. This design maintains a stable optimization gradient without reinforcing unverifiable consensus. Across three backbone models trained on mathematical data, JURY-RL consistently outperforms other label-free baselines on mathematical reasoning benchmarks and transfers competitively to code generation and general benchmarks. It attains pass@1 performance comparable to supervised ground-truth training, with superior generalization demonstrated by higher pass@k and response diversity.
中文摘要 带有可验证奖励的强化学习（RLVR）增强了大型语言模型（LLM）的推理能力，但标准RLVR通常依赖于人工注释的答案或精心策划的奖励规格。在可机器检查的领域，多数投票或大型语言模型作为评判等无标签替代方案可以消除注释成本，但可能引入误报，破坏训练稳定性。我们引入了JURY-RL，一种无标签的RLVR框架，将答案提案与奖励处置解耦：模型推广中的投票提出候选答案，正式验证者决定该候选人是否能获得正面奖励。具体来说，只有与相对多数票结果一致的推广，在精益中成功验证该答案时才会获得奖励。当验证结果不确定时，我们调用ResZero（残差零），这是一种备用奖励，丢弃未验证的多数提案，并将零均值且保持方差的信号重新分布到残差答案上。该设计保持稳定的优化梯度，同时不强化不可验证的共识。在三个基于数学数据训练的骨干模型中，JURY-RL在数学推理基准测试中持续优于其他无标签基线，并在代码生成和通用基准测试中具有竞争力。其pass@1性能可与监督下的地面真实训练相当，且更优异的泛化性体现在更高的 pass@k 和反应多样性上。

A Systematic Post-Train Framework for Video Generation

系统化的视频生成后列车框架

Authors: Zeyue Xue, Siming Fu, Jie Huang, Shuai Lu, Haoran Li, Yijun Liu, Yuming Li, Xiaoxuan He, Mengzhao Chen, Haoyang Huang, Nan Duan, Ping Luo
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.25427
Pdf link: https://arxiv.org/pdf/2604.25427
Abstract While large-scale video diffusion models have demonstrated impressive capabilities in generating high-resolution and semantically rich content, a significant gap remains between their pretraining performance and real-world deployment requirements due to critical issues such as prompt sensitivity, temporal inconsistency, and prohibitive inference costs. To bridge this gap, we propose a comprehensive post-training framework that systematically aligns pretrained models with user intentions through four synergistic stages: we first employ Supervised Fine-Tuning (SFT) to transform the base model into a stable instruction-following policy, followed by a Reinforcement Learning from Human Feedback (RLHF) stage that utilizes a novel Group Relative Policy Optimization (GRPO) method tailored for video diffusion to enhance perceptual quality and temporal coherence; subsequently, we integrate Prompt Enhancement via a specialized language model to refine user inputs, and finally address system efficiency through Inference Optimization. Together, these components provide a systematic approach to improving visual quality, temporal coherence, and instruction following, while preserving the controllability learned during pretraining. The result is a practical blueprint for building scalable post-training pipelines that are stable, adaptable, and effective in real-world deployment. Extensive experiments demonstrate that this unified pipeline effectively mitigates common artifacts and significantly improves controllability and visual aesthetics while adhering to strict sampling cost constraints.
中文摘要 尽管大规模视频扩散模型已展现出高分辨率且语义丰富的内容生成能力，但由于提示敏感性、时间不一致和高估推理成本等关键问题，其预训练性能与实际部署需求之间仍存在显著差距。为弥合这一差距，我们提出了一个全面的后训练框架，通过四个协同阶段系统地对齐预训练模型与用户意图：首先采用监督微调（SFT）将基础模型转变为稳定的指令跟随策略，随后采用一种针对视频扩散的新型群相对策略优化（Group Relative Policy Optimization，GRPO）方法，以提升感知质量和时间表现连贯性;随后，我们通过专门的语言模型整合提示增强以优化用户输入，最终通过推理优化解决系统效率问题。这些组成部分共同提供了一种系统化的方法，以提升视觉质量、时间连贯性和指令跟随性，同时保持训练前学到的可控性。结果是构建可扩展、稳定、适应性强且在实际部署中有效的培训后流程的实用蓝图。大量实验表明，该统一流水线有效减少了常见伪影，显著提升了可控性和视觉美观，同时严格遵守采样成本约束。

One Refiner to Unlock Them All: Inference-Time Reasoning Elicitation via Reinforcement Query Refinement

一个精炼器解锁所有这些：通过强化查询精炼引发推理时间推理

Authors: Yixiao Zhou, Dongzhou Cheng, zhiliang wu, Yi Yang, Yu Cheng, Hehe Fan
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.25444
Pdf link: https://arxiv.org/pdf/2604.25444
Abstract Large Language Models (LLMs) often fail to utilize their latent reasoning capabilities due to a distributional mismatch between ambiguous human inquiries and the structured logic required for machine activation. Existing alignment methods either incur prohibitive $O(N)$ costs by fine-tuning each model individually or rely on static prompts that fail to resolve query-level structural complexity. In this paper, we propose ReQueR (\textbf{Re}inforcement \textbf{Que}ry \textbf{R}efinement), a modular framework that treats reasoning elicitation as an inference-time alignment task. We train a specialized Refiner policy via Reinforcement Learning to rewrite raw queries into explicit logical decompositions, treating frozen LLMs as the environment. Rooted in the classical Zone of Proximal Development from educational psychology, we introduce the Adaptive Solver Hierarchy, a curriculum mechanism that stabilizes training by dynamically aligning environmental difficulty with the Refiner's evolving competence. ReQueR yields consistent absolute gains of 1.7\%--7.2\% across diverse architectures and benchmarks, outperforming strong baselines by 2.1\% on average. Crucially, it provides a promising paradigm for one-to-many inference-time reasoning elicitation, enabling a single Refiner trained on a small set of models to effectively unlock reasoning in diverse unseen models. Code is available at this https URL.
中文摘要 大型语言模型（LLMs）常因模糊的人类询问与机器激活所需的结构化逻辑之间的分布不匹配，未能充分利用其潜在推理能力。现有的对齐方法要么通过单独微调每个模型产生高昂的 $O（N）$ 成本，要么依赖静态提示，无法解决查询级别的结构复杂性。本文提出了ReQueR（\textbf{Re}inforcement \textbf{Que}ry \textbf{R}efinement），这是一种模块化框架，将推理引出视为推理时间对齐任务。我们通过强化学习训练一个专门的精炼器策略，将原始查询重写为显式的逻辑分解，将冻结的大型语言模型视为环境。基于教育心理学经典的近因发展区，我们引入了自适应求解器层级，这是一种通过动态对齐环境难度与炼制者不断演变的能力来稳定培训的课程机制。ReQueR在不同架构和基准中稳定实现1.7%-7.2%的绝对涨幅，平均优于强势基线2.1%。关键是，它为一对多推理时间推理引出提供了有前景的范式，使单个Refiner在少量模型上训练，能够有效解锁多样未见模型的推理能力。代码可在此 https URL 访问。

DDA-Thinker: Decoupled Dual-Atomic Reinforcement Learning for Reasoning-Driven Image Editing

DDA-Thinker：用于推理驱动图像编辑的解耦双原子强化学习

Authors: Hanqing Yang, Qiang Zhou, Yongchao Du, Sashuai Zhou, Zhibin Wang, Jun Song, Tiezheng Ge, Cheng Yu, Bo Zheng
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.25477
Pdf link: https://arxiv.org/pdf/2604.25477
Abstract Recent image editing models have achieved strong visual fidelity but often struggle with tasks requiring complex reasoning. To investigate and enhance the reasoning-grounded planning for image editing, we propose DDA-Thinker, a Thinker-centric framework designed for the independent optimization of a planning module (Thinker) over a fixed generative model (Editor). This decoupled Thinker-centric paradigm facilitates a controlled analysis of the planning module and makes its contribution under a fixed Editor easier to assess. To effectively guide this Thinker, we introduce a dual-atomic reinforcement learning framework. This framework decomposes feedback into two distinct atomic rewards implemented through verifiable checklists: a cognitive-atomic reward to directly assess the quality of the Thinker's executable plan, which serves as the actionable outcome of the Thinker's reasoning, and a visual-atomic reward to assess the final image quality. To improve checklist quality, our checklist synthesis is grounded not only in the source image and user instruction but also in a rational reference description of the ideal post-edit scene. To support this training, we further develop a two-stage data curation pipeline that first synthesizes a diverse and reasoning-focused dataset, then applies difficulty-aware refinement to curate an effective training curriculum for reinforcement learning. Extensive experiments on reasoning-driven image editing benchmarks, including RISE-Bench and KRIS-Bench, demonstrate that our approach substantially improves overall performance. Our method enables a community model to achieve results competitive with strong proprietary models, highlighting the practical potential of Thinker-centric optimization under a fixed-editor setting.
中文摘要 近年来的图像编辑模型实现了较强的视觉真实度，但在需要复杂推理的任务时常常遇到困难。为了研究和增强基于推理的图像编辑规划，我们提出了DDA-Thinker，这是一个以Thinker为中心的框架，旨在独立优化规划模块（Thinker）相对于固定生成模型（Editor）。这种以思考者为中心的解耦范式便于对规划模块的受控分析，并使其在固定编辑器下的贡献更易于评估。为了有效指导这个思考者，我们引入了一个双原子强化学习框架。该框架将反馈分解为两种不同的原子奖励，通过可验证的检查表实现：认知原子奖励直接评估思考者可执行计划的质量，作为思考者推理的可操作结果;视觉-原子奖励用于评估最终图像质量。为了提高检查表质量，我们的检查表综合不仅基于源图像和用户指示，还基于对理想后期剪辑场景的合理参考描述。为支持该培训，我们进一步开发了两阶段的数据策划流程，先综合多样且以推理为重点的数据集，然后应用针对难度的细化，策划出有效的强化学习培训课程。在推理驱动的图像编辑基准测试（包括RISE-Bench和KRIS-Bench）上进行了大量实验，表明我们的方法显著提升了整体性能。我们的方法使社区模型能够实现与强大专有模型竞争的结果，凸显了在固定编辑器环境下以Thinker为中心优化的实用潜力。

Improving Zero-Shot Offline RL via Behavioral Task Sampling

通过行为任务抽样改进零样本离线强化学习

Authors: Nazim Bendib, Nicolas Perrin-Gilbert, Olivier Sigaud
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.25496
Pdf link: https://arxiv.org/pdf/2604.25496
Abstract Offline zero-shot reinforcement learning (RL) aims to learn agents that optimize unseen reward functions without additional environment interaction. The standard approach to this problem trains task-conditioned policies by sampling task vectors that define linear reward functions over learned state representations. In most existing algorithms, these task vectors are randomly sampled, implicitly assuming this adequately captures the structure of the task space. We argue that doing so leads to suboptimal zero-shot generalization. To address this limitation, we propose extracting task vectors directly from the offline dataset and using them to define the task distribution used for policy training. We introduce a simple and general reward function extraction procedure that integrates into existing offline zero-shot RL algorithms. Across multiple benchmark environments and baselines, our approach improves zero-shot performance by an average of 20%, highlighting the importance of principled task sampling in offline zero-shot RL.
中文摘要 离线零样本强化学习（RL）旨在学习能够在不额外环境交互的情况下优化看不见奖励函数的代理。该问题的标准方法是通过抽样任务向量来训练任务条件策略，这些向量定义了在已学习的状态表示上定义线性奖励函数。在大多数现有算法中，这些任务向量是随机抽样的，隐含地假设这充分反映了任务空间的结构。我们认为这样做会导致零射概括效果不佳。为解决这一限制，我们提议直接从离线数据集中提取任务向量，并用它们定义用于策略训练的任务分布。我们引入了一种简单通用的奖励函数提取过程，并将其集成到现有的离线零样本强化学习算法中。在多个基准环境和基线中，我们的方法平均提升零样本性能20%，凸显了离线零样本强化学习中原则性任务抽样的重要性。

SymphonyGen: 3D Hierarchical Orchestral Generation with Controllable Harmony Skeleton

SymphonyGen：带有可控和声骨架的三维层级管弦乐生成

Authors: Xuzheng He, Nan Nan, Zhilin Wang, Ziyue Kang, Zhuoru Mo, Ao Li, Yu Pan, Xiaobing Li, Feng Yu, Xiaohong Guan
Subjects: Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.25498
Pdf link: https://arxiv.org/pdf/2604.25498
Abstract Generating symphonic music requires simultaneously managing high-level structural form and dense, multi-track orchestration. Existing symbolic models often struggle with a "complexity-control imbalance", in which scaling bottlenecks limit long-term granular steerability. We present SymphonyGen, a 3D hierarchical framework for contemporary cinematic orchestration. SymphonyGen employs a cascading decoder architecture that decomposes the Bar, Track, and Event axes, improving computational efficiency and scalability over conventional 1D or 2D models. We introduce "short-score" conditioning via a beat-quantized multi-voice harmony skeleton, enabling outline control while preserving textural diversity. The model is further refined using Group Relative Policy Optimization (GRPO) with a cross-modal audio-perceptual reward, aligning symbolic output with modern acoustic expectations. Additionally, we implement a dissonance-averse sampling algorithm to suppress unintended tonal clashes during inference. Objective evaluations show that both reinforcement learning and dissonance-averse sampling effectively enhance harmonic cleanliness while maintaining melodic expression. Subjective evaluations demonstrate that SymphonyGen outperforms baselines in musicality and preference for orchestral music generation. Demo page: this https URL
中文摘要 创作交响音乐需要同时管理高层次的结构形式和密集的多轨配器。现有的符号模型常常面临“复杂性-控制失衡”的问题，即缩放瓶颈限制了长期细粒度的可控性。我们介绍SymphonyGen，一个用于当代电影配器制作的3D层级框架。SymphonyGen采用级联解码器架构，分解条形、轨道和事件轴，提升了计算效率和可扩展性，相较于传统一维或二维模型。我们引入了通过节拍量化多声部和声骨架的“短谱”条件反射，实现轮廓控制，同时保持质感多样性。该模型通过群相对策略优化（Group Relative Policy Optimization，GRPO）进一步细化，采用跨模态音频感知奖励，使符号输出与现代声学预期保持一致。此外，我们还实现了一种厌恶失谐的采样算法，以抑制推理过程中的非预期音调冲突。客观评估表明，强化学习和不协和音回避采样都能有效提升和声的干净度，同时保持旋律表达。主观评价显示，SymphonyGen在音乐性和管弦乐音乐生成偏好方面表现优于基线。演示页面：这个 https URL

Dyna-Style Safety Augmented Reinforcement Learning: Staying Safe in the Face of Uncertainty

Dyna式安全增强强化学习：在不确定性面前保持安全

Authors: Artur Eisele, Bernd Frauenknecht, Friedrich Solowjow, Sebastian Trimpe
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.25508
Pdf link: https://arxiv.org/pdf/2604.25508
Abstract Safety remains an open problem in reinforcement learning (RL), especially during training. While safety filters are promising to address safe exploration, they are generally poorly suited for high-dimensional systems with unknown dynamics. We propose Dyna-style Safety Augmented Reinforcement Learning (Dyna-SAuR), a novel algorithm that learns both a scalable safety filter and a control policy using a learned uncertainty-aware dynamics model, while requiring minimal domain knowledge. The filter avoids failures and high uncertainty regions. Thus, better models expand the set of safe and certain states, reducing filter conservatism. We present the effectiveness of Dyna-SAuR on goal-reaching CartPole as well as MuJoCo Walker, reducing failures compared to state-of-the-art methods by 2 orders of magnitude.
中文摘要 安全问题在强化学习（RL）中仍是一个开放问题，尤其是在培训期间。虽然安全滤波器有望实现安全探索，但通常不适合具有未知动力学的高维系统。我们提出了Dyna风格的安全增强强化学习（Dyna-SAuR），这是一种新颖算法，利用学习到的不确定性感知动力学模型，同时学习可扩展的安全过滤器和控制策略，同时对领域知识要求极低。滤波器避免了故障和高不确定性区域。因此，更好的模型扩展了安全和特定状态的集合，减少了过滤保守主义。我们展示了Dyna-SAuR对达标CartPole和MuJoCo Walker的有效性，与最先进方法相比，故障率降低了两个数量级。

Sample-efficient Neuro-symbolic Proximal Policy Optimization

样本高效神经符号近端策略优化

Authors: Simone Murari, Celeste Veronese, Daniele Meli
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.25534
Pdf link: https://arxiv.org/pdf/2604.25534
Abstract Deep Reinforcement Learning (DRL) algorithms often require a large amount of data and struggle in sparse-reward domains with long planning horizons and multiple sub-goals. In this paper, we propose a neuro-symbolic extension of Proximal Policy Optimization (PPO) that transfers partial logical policy specifications learned in easier instances to guide learning in more challenging settings. We introduce two integrations of symbolic guidance: (i) H-PPO-Product, which biases the action distribution at sampling time, and (ii) H-PPO-SymLoss, which augments the PPO loss with a symbolic regularization term. We evaluate our methods on three benchmarks (OfficeWorld, WaterWorld, and DoorKey), showing consistently faster learning and higher return at convergence than PPO and a Reward Machine baseline, also under imperfect symbolic knowledge.
中文摘要 深度强化学习（DRL）算法通常需要大量数据，且在奖励稀疏、规划时间长且包含多个子目标的领域中表现不佳。本文提出了一种近端策略优化（PPO）的神经符号扩展，将在更简单实例中学到的部分逻辑策略规范转移，以指导更具挑战性的学习。我们引入了两种符号指导的积分：（i） H-PPO-积，在采样时偏向作用分布;（ii） H-PPO-对称损失，通过符号正则化项补充PPO损失。我们在三个基准测试（OfficeWorld、WaterWorld和DoorKey）上评估方法，结果显示在收敛阶段的学习速度持续快，收敛时的回报也优于PPO和奖励机基线，且同样在符号知识不完全的情况下。

Egocentric Tactile and Proximity Sensors as Observation Priors for Humanoid Collision Avoidance

以自我为中心的触觉和接近传感器作为类人生物碰撞规避的观察先验

Authors: Carson Kohlbrenner, Niraj Pudasaini, William Xie, Naren Sivagnanadasan, Nikolaus Correll, Alessandro Roncone
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.25554
Pdf link: https://arxiv.org/pdf/2604.25554
Abstract Collision-free motion is often aided by tactile and proximity sensors distributed on the body of the robot due to their resistance to occlusion as opposed to external cameras. However, how to shape the sensor's properties, such as sensing coverage; type; and range, to enable avoidant behavior remains unclear. In this work, we present a reinforcement learning framework for whole-body collision avoidance on a humanoid H1-2 robot and use it to characterize how sensor properties shape learned avoidance behavior. Using dodgeball as a benchmark task, we ablate the properties of sensors distributed across the upper body of the robot and find that raw proximity measurements can substitute for explicit object localization provided the sensing range is sufficient and that sparse non-directional proximity signals outpace dense directional alternatives in sample efficiency.
中文摘要 由于机器人身上分布在触觉和接近传感器，这些传感器具有抗阻挡能力，而非外部摄像头，通常有助于实现无碰撞运动。然而，如何塑造传感器的特性，比如检测覆盖范围;类型;以及促进回避行为的范围尚不明确。本研究提出了一个用于人形H1-2机器人全身碰撞避免的强化学习框架，并用它来描述传感器属性如何塑造学习的回避行为。以躲避球为基准任务，我们对分布在机器人上半身的传感器特性进行了解析，发现只要感测距离足够，且稀疏的非定向近距离信号在样本效率上优于高密度方向的替代，原始的接近测量可以替代显性物体定位。

Modeling Human-Like Color Naming Behavior in Context

在情境中建模类人色彩命名行为

Authors: Yuqing Zhang, Ecesu Ürker, Tessa Verhoef, Gemma Boleda, Arianna Bisazza
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.25674
Pdf link: https://arxiv.org/pdf/2604.25674
Abstract Modeling the emergence of human-like lexicons in computational systems has advanced through the use of interacting neural agents, which simulate both learning and communicative pressures. The NeLLCom-Lex framework (Zhang et al., 2025) allows neural agents to develop pragmatic color naming behavior and human-like lexicons through supervised learning (SL) from human data and reinforcement learning (RL) in referential games. Despite these successes, the lexicons that emerge diverge systematically from human color categories, producing highly non-convex regions in color space, which contrast with the convexity typical of human categories. To address this, we introduce two factors, upsampling rare color terms during SL and multi-listener RL interactions, and adopt a convexity measure to quantify geometric coherence. We find that upsampling improves lexical diversity and system-level informativeness of the color lexicon, while many-listener setups promote more convex color categories. The combination of moderate upsampling and multiple listeners produces lexicons most similar to human systems.
中文摘要 通过使用交互的神经代理，模拟学习和交流压力，模拟类人词汇在计算系统中的出现得到了进展。NeLLCom-Lex框架（Zhang 等，2025）允许神经智能体通过从人类数据中的监督学习（SL）和指称游戏中的强化学习（RL）发展出实用的颜色命名行为和类人词汇。尽管取得了这些成功，出现的词汇表系统性地与人类色彩类别不同，产生高度非凸的色彩空间区域，这与人类类别典型的凸性形成对比。为此，我们引入两个因素：在SL和多监听者强化学习交互中对稀有色项进行上采样，并采用凸性度量来量化几何相干性。我们发现，上采样能提升词汇多样性和颜色词汇的系统层面信息量，而多听者设置则促进更多凸色彩类别。适度上采样和多听者的结合，产生了最接近人类系统的词汇库。

K-CARE: Knowledge-driven Symmetrical Contextual Anchoring and Analogical Prototype Reasoning for E-commerce Relevance

K-CARE：基于知识的对称上下文锚定与电子商务相关性的类比原型推理

Authors: Chen Yifei, Tian Zhixing, Wang Chenyang, Cheng Ziguang
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2604.25683
Pdf link: https://arxiv.org/pdf/2604.25683
Abstract This paper targets e-commerce search relevance. While Large Language Models (LLMs) have demonstrated significant potential in this field, they often encounter performance bottlenecks in persistent 'corner cases' within complex industrial scenarios. Existing research primarily focuses on optimizing reasoning trajectories via Reinforcement Learning. However, real-world observations suggest that the primary bottleneck stems from knowledge boundaries, where the absence of domain-specific intelligence in the model's parametric memory creates a contextual void. This void persists when interpreting idiosyncratic queries or niche products and cannot be resolved solely through reasoning-path optimization. To bridge this gap, we propose K-CARE, a framework that extends the model's cognitive reach by grounding reasoning in external knowledge. K-CARE comprises two synergistic components: (1) Symmetrical Contextual Anchoring (SCA), which fills the contextual void by anchoring queries and products with behavior-derived implicit knowledge; and (2) Analogical Prototype Reasoning (APR), which leverages expert-curated prototypical knowledge to calibrate decision boundaries through in-context analogy. Extensive offline evaluations and online A/B tests on a leading e-commerce platform demonstrate that K-CARE significantly outperforms state-of-the-art baselines, delivering substantial commercial impact by resolving knowledge-intensive relevance challenges.
中文摘要 本文聚焦于电子商务搜索相关性。虽然大型语言模型（LLMs）在该领域展现出显著潜力，但在复杂的工业场景中，常常遇到持续的“角落案例”性能瓶颈。现有研究主要聚焦于通过强化学习优化推理轨迹。然而，现实观察表明，主要瓶颈源自知识边界，模型参数内存中缺乏领域特定智能，造成上下文空白。这种空白在解释特殊查询或细分产品时依然存在，无法仅靠推理路径优化解决。为了弥合这一差距，我们提出了K-CARE框架，该框架通过将推理建立在外部知识基础上，扩展了模型的认知覆盖范围。K-CARE 包含两个协同组件：（1）对称上下文锚定（SCA），通过将行为衍生的隐性知识锚定查询和产品来填补上下文空白;以及（2）类比原型推理（APR），利用专家精心策划的原型知识，通过上下文类比校准决策边界。在一家领先电商平台上进行的广泛线下评估和在线A/B测试显示，K-CARE远远超越了最先进的基线，通过解决知识密集型相关性挑战，带来了显著的商业影响力。

Backtranslation Augmented Direct Preference Optimization for Neural Machine Translation

神经机器翻译的反向翻译增强直接偏好优化

Authors: Mehrdad Ghassabi, Spehr Rajabi, Hamidreza Baradaran Kashani, Sadra Hakim, Mahshid Keivandarian
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.25702
Pdf link: https://arxiv.org/pdf/2604.25702
Abstract Contemporary neural machine translation (NMT) systems are almost exclusively built by training on supervised parallel data. Despite the tremendous progress achieved, these systems still exhibit persistent translation errors. This paper proposes that a post-training paradigm based on reinforcement learning (RL) can effectively rectify such mistakes. We introduce a novel framework that requires only a general text corpus and an expert translator which can be either human or an AI system to provide iterative feedback. In our experiments, we focus specifically on English-to-German translation as a representative high-resource language pair. Crucially, we implement this RL-based post-training using Direct Preference Optimization (DPO). Applying our DPO-driven framework to the gemma3-1b model yields a significant improvement in translation quality, elevating it's COMET score from 0.703 to 0.747 on the English to German task. The results demonstrate that DPO offers an efficient and stable pathway for enhancing pre-trained NMT models through preference-based post-training.
中文摘要 当代神经机器翻译（NMT）系统几乎完全通过对监督并行数据的训练构建。尽管取得了巨大进展，这些系统仍然存在持续的翻译错误。本文提出基于强化学习（RL）的训练后范式可以有效纠正此类错误。我们引入了一个新颖框架，只需一个通用文本语料库和一个专家翻译器，翻译者可以是人类或人工智能系统，提供迭代反馈。在我们的实验中，我们特别关注英语到德语的翻译作为代表性高资源语言对。关键是，我们采用直接偏好优化（DPO）实现基于强化学习的后期训练。将我们基于DPO的框架应用于gemma3-1b模型，翻译质量显著提升，其英语到德语任务的COMET得分从0.703提升至0.747。结果表明，DPO为通过基于偏好的后训练，提供了高效且稳定的预训练NMT模型提升路径。

QAROO: AI-Driven Online Task Offloading for Energy-Efficient and Sustainable MEC Networks

QAROO：以人工智能驱动的在线任务卸载，实现节能和可持续的MEC网络

Authors: Yongtao Yao, Yao Yang, Haorui Shi, Canglu Zhu, Miaojiang Chen, Ahmed Farouk
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.25740
Pdf link: https://arxiv.org/pdf/2604.25740
Abstract With the rapid advancement of artificial intelligence (AI) and intelligent science, intelligent edge computing has been widely adopted. However, the limitations of traditional methods, such as poor adaptability and the slow convergence of heuristic algorithms, are becoming increasingly evident. To enable sustainable and resource-efficient edge applications, this paper proposes an online task offloading framework for wireless powered mobile edge computing (MEC) networks, called Quantum Attention-based Reinforcement learning for Online Offloading (QAROO). The system employs a binary offloading strategy with the aim of co-optimizing computing and energy resources in dynamic channel environments. In response to the issues of poor adaptability in traditional approaches and the slow convergence of heuristic algorithms, the framework integrates quantum neural networks and attention mechanisms, introducing three key improvements: using recurrent neural networks to enhance temporal modeling capability, proposing an uncertainty-guided quantization method to improve exploration efficiency, and incorporating attention mechanisms into quantum networks to strengthen feature representation. Experiments demonstrate that the proposed method outperforms comparative schemes in terms of normalized computation speed and processing time, offering an efficient and stable solution for online task offloading in large-scale Internet of Things (IoT) dynamic environments.
中文摘要 随着人工智能（AI）和智能科学的快速发展，智能边缘计算已被广泛采用。然而，传统方法的局限性，如适应性差和启发式算法收敛缓慢，正日益显现。为实现可持续且资源高效的边缘应用，本文提出了一种用于无线驱动移动边缘计算（MEC）网络的在线任务卸载框架，称为基于量子注意力的强化学习用于在线卸载（QAROO）。该系统采用二元卸载策略，旨在在动态信道环境中协同优化计算和能源资源。针对传统方法适应性不足和启发式算法趋同缓慢的问题，该框架整合了量子神经网络与注意力机制，引入了三项关键改进：利用循环神经网络提升时间建模能力，提出一种不确定性引导量子化方法以提升探索效率，以及将注意力机制整合进量子网络以增强特征表示。实验表明，所提方法在归一化计算速度和处理时间方面优于其他类似方案，为大规模物联网（IoT）动态环境中在线任务卸载提供了高效且稳定的解决方案。

EOS-Bench: A Comprehensive Benchmark for Earth Observation Satellite Scheduling

EOS-Bench：地球观测卫星调度的综合基准

Authors: Qian Yin, Jiaxing Li, Jiaqi Cheng, Qizhang Luo, Annalisa Riccardi, Abhijit Chatterjee, Rafael Vazquez, Carlo Novara, Michalis Mavrovouniotis, Ponnuthurai Nagaratnam Suganthan, Shengzhou Bai, Xiaoxuan Hu, Lining Xing, Ming Xu, Shuang Li, Zixuan Zheng, Xin Shen, Xiaoyu Chen, Yi Gu, Yanjie Song, Witold Pedrycz, Evan L. Kramer, Laio Oriel Seman, Cletah Shoko, Guohua Wu, Xinwei Wang
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.25782
Pdf link: https://arxiv.org/pdf/2604.25782
Abstract Earth observation satellite imaging scheduling is a challenging NP-hard combinatorial optimisation problem central to space mission operations. While next-generation agile Earth observation satellites (EOS) increase operational flexibility, they also significantly raise scheduling complexity. The lack of a unified, open-source benchmark makes it difficult to compare algorithms across studies. This paper introduces EOS-Bench, a comprehensive framework for systematic and reproducible evaluation of scheduling methods. By integrating high-fidelity orbital dynamics and platform constraints, EOS-Bench generates 1,390 scenarios and 13,900 benchmark instances, spanning from small-scale validation cases to large coordination problems with up to 1,000 satellites and 10,000 requests. We further propose a scenario characterisation scheme to quantify structural difficulty based on factors such as opportunity density, task flexibility, conflict intensity, and satellite congestion. A multidimensional evaluation protocol is introduced, assessing performance across five metrics: task profit, completion rate, workload balance, timeliness, and runtime. The framework is evaluated using mixed-integer programming, heuristics, meta-heuristics, and deep reinforcement learning across both agile and non-agile settings. Results show that EOS-Bench effectively distinguishes solver performance across scales and conditions, revealing trade-offs between solution quality and computational efficiency, and providing deeper insight into scenario complexity. EOS-Bench offers a unified and extensible open testbed for advancing research in Earth observation satellite scheduling. The code and data are available at this https URL.
中文摘要 地球观测卫星成像调度是一个具有挑战性的NP难组合优化问题，是航天任务操作的核心。下一代敏捷地球观测卫星（EOS）不仅提高了操作灵活性，但也显著提高了调度复杂度。缺乏统一的开源基准，使得跨研究比较算法变得困难。本文介绍了EOS-Bench，这是一个用于系统且可重复地评估调度方法的综合框架。通过集成高精度轨道动力学和平台约束，EOS-Bench生成了1390个场景和13900个基准实例，涵盖从小规模验证案例到大型协调问题，涉及多达1000颗卫星和1万个请求。我们还提出了一种情景特征方案，基于机会密度、任务灵活性、冲突强度和卫星拥堵等因素量化结构难度。引入了多维评估协议，评估五个指标的性能：任务利润、完成率、工作负载平衡、及时性和运行时间。该框架通过混合整数规划、启发式、元启发式和深度强化学习在敏捷和非敏捷环境中进行评估。结果显示，EOS-Bench 有效区分了不同尺度和条件下的求解器性能，揭示了解质量与计算效率之间的权衡，并提供了更深入的场景复杂性洞察。EOS-Bench 提供了一个统一且可扩展的开放测试平台，推动地球观测卫星调度研究的发展。代码和数据可在该 https URL 访问。

KinDER: A Physical Reasoning Benchmark for Robot Learning and Planning

KinDER：机器人学习与规划的物理推理基准

Authors: Yixuan Huang, Bowen Li, Vaibhav Saxena, Yichao Liang, Utkarsh Aashu Mishra, Liang Ji, Lihan Zha, Jimmy Wu, Nishanth Kumar, Sebastian Scherer, Danfei Xu, Tom Silver
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.25788
Pdf link: https://arxiv.org/pdf/2604.25788
Abstract Robotic systems that interact with the physical world must reason about kinematic and dynamic constraints imposed by their own embodiment, their environment, and the task at hand. We introduce KinDER, a benchmark for Kinematic and Dynamic Embodied Reasoning that targets physical reasoning challenges arising in robot learning and planning. KinDER comprises 25 procedurally generated environments, a Gymnasium-compatible Python library with parameterized skills and demonstrations, and a standardized evaluation suite with 13 implemented baselines spanning task and motion planning, imitation learning, reinforcement learning, and foundation-model-based approaches. The environments are designed to isolate five core physical reasoning challenges: basic spatial relations, nonprehensile multi-object manipulation, tool use, combinatorial geometric constraints, and dynamic constraints, disentangled from perception, language understanding, and application-specific complexity. Empirical evaluation shows that existing methods struggle to solve many of the environments, indicating substantial gaps in current approaches to physical reasoning. We additionally include real-to-sim-to-real experiments on a mobile manipulator to assess the correspondence between simulation and real-world physical interaction. KinDER is fully open-sourced and intended to enable systematic comparison across diverse paradigms for advancing physical reasoning in robotics. Website and code: this https URL
中文摘要 与物理世界互动的机器人系统必须思考自身身体、环境和任务所施加的运动学和动态限制。我们介绍了KinDER，这是一个针对运动学与动态具身推理的基准，针对机器人学习和规划中出现的物理推理挑战。KinDER包含25个程序生成环境、一个兼容Gymnasium的Python库，包含参数化技能和演示，以及一个标准化的评估套件，包含13个已实现基线，涵盖任务与动作规划、模仿学习、强化学习和基于基础模型的方法。这些环境旨在隔离五大核心物理推理挑战：基本空间关系、非可抓取的多对象操作、工具使用、组合几何约束和动态约束，脱离感知、语言理解和应用特定复杂性。实证评估显示，现有方法难以解决许多环境问题，显示当前物理推理方法存在重大空白。我们还包括在移动机械臂上进行实物到模拟再到实物的实验，以评估仿真与现实物理交互之间的对应关系。KinDER完全开源，旨在实现跨多样范式的系统比较，以推动机器人物理推理的进步。网站和代码：这个 https URL

When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient

错误何时可能有益：政策梯度不完全奖励的分类

Authors: Shuning Shang, Hubert Strauss, Stanley Wei, Sanjeev Arora, Noam Razin
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2604.25872
Pdf link: https://arxiv.org/pdf/2604.25872
Abstract Training language models via reinforcement learning often relies on imperfect proxy rewards, since ground truth rewards that precisely define the intended behavior are rarely available. Standard metrics for assessing the quality of proxy rewards, such as ranking accuracy, treat incorrect rewards as strictly harmful. In this work, however, we highlight that not all deviations from the ground truth are equal. By theoretically analyzing which outputs attract probability during policy gradient optimization, we categorize reward errors according to their effect on the increase in ground truth reward. The analysis establishes that reward errors, though conventionally viewed as harmful, can also be benign or even beneficial by preventing the policy from stalling around outputs with mediocre ground truth reward. We then present two practical implications of our theory. First, for reinforcement learning from human feedback (RLHF), we develop reward model evaluation metrics that account for the harmfulness of reward errors. Compared to standard ranking accuracy, these metrics typically correlate better with the performance of a language model after RLHF, yet gaps remain in robustly evaluating reward models. Second, we provide insights for reward design in settings with verifiable rewards. A key theme underlying our results is that the effectiveness of a proxy reward function depends heavily on its interaction with the initial policy and learning algorithm.
中文摘要 通过强化学习训练语言模型通常依赖于不完美的代理奖励，因为准确定义预期行为的真实奖励很少可用。评估代理奖励质量的标准指标，如排名准确性，将错误奖励视为纯粹有害的。然而，在本研究中，我们强调并非所有偏离真实情况的情况都相同。通过理论分析在策略梯度优化过程中哪些输出吸引概率，我们根据奖励误差对真实奖励增加的影响进行分类。分析证明，尽管奖励错误通常被视为有害，但它们也可能因防止政策在中等真实奖励的输出而停滞，从而带来良性甚至有益的效果。随后，我们提出了理论的两个实际含义。首先，针对人类反馈强化学习（RLHF），我们开发了考虑奖励错误危害性的奖励模型评估指标。与标准排名准确性相比，这些指标通常与RLHF后语言模型的性能相关性更好，但在对奖励模型的有力评估中仍存在差距。其次，我们为拥有可验证奖励的环境中的奖励设计提供见解。我们结果的一个关键主题是，代理奖励函数的有效性在很大程度上依赖于其与初始策略和学习算法的交互。

Three Models of RLHF Annotation: Extension, Evidence, and Authority

RLHF注释的三种模型：扩展、证据和权威

Authors: Steve Coyne
Subjects: Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.25895
Pdf link: https://arxiv.org/pdf/2604.25895
Abstract Preference-based alignment methods, most prominently Reinforcement Learning with Human Feedback (RLHF), use the judgments of human annotators to shape large language model behaviour. However, the normative role of these judgments is rarely made explicit. I distinguish three conceptual models of that role. The first is extension: annotators extend the system designers' own judgments about what outputs should be. The second is evidence: annotators provide independent evidence about some facts, whether moral, social or otherwise. The third is authority: annotators have some independent authority (as representatives of the broader population) to determine system outputs. I argue that these models have implications for how RLHF pipelines should solicit, validate and aggregate annotations. I survey landmark papers in the literature on RLHF and related methods to illustrate how they implicitly draw on these models, describe failure modes that come from unintentionally or intentionally conflating them, and offer normative criteria for choosing among them. My central recommendation is that RLHF pipeline designers should decompose annotation into separable dimensions and tailor each pipeline to the model most appropriate for that dimension, rather than seeking a single unified pipeline.
中文摘要 基于偏好的对齐方法，最著名的是基于人类反馈的强化学习（RLHF），利用人类标注者的判断来塑造大型语言模型的行为。然而，这些判断的规范性作用很少被明确说明。我区分了三种该角色的概念模型。第一种是扩展：注释者扩展了系统设计者对输出应有的判断。第二是证据：注释者提供关于某些事实的独立证据，无论是道德、社会还是其他方面。第三是权威：标注者拥有某种独立权威（作为更广泛群体的代表）来决定系统输出。我认为这些模型对RLHF管道应如何征求、验证和汇总注释具有重要意义。我综述了关于RLHF及相关方法文献中的里程碑论文，说明它们如何隐含地借鉴这些模型，描述因无意或有意混淆而产生的失效模式，并为选择这些模型提供规范性标准。我的核心建议是，RLHF流水线设计师应将注释分解为可分维度，并根据该维度最合适的模型定制每条流水线，而非寻求单一统一的流水线。

TSN-Affinity: Similarity-Driven Parameter Reuse for Continual Offline Reinforcement Learning

TSN亲和力：相似度驱动参数重用用于持续离线强化学习

Authors: Dominik Żurek, Kamil Faber, Marcin Pietron, Paweł Gajewski, Roberto Corizzo
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.25898
Pdf link: https://arxiv.org/pdf/2604.25898
Abstract Continual offline reinforcement learning (CORL) aims to learn a sequence of tasks from datasets collected over time while preserving performance on previously learned tasks. This setting corresponds to domains where new tasks arise over time, but adapting the model in live environment interactions is expensive, risky, or impossible. However, CORL inherits the dual difficulty of offline reinforcement learning and adapting while preventing catastrophic forgetting. Replay-based continual learning approaches remain a strong baseline but incur memory overhead and suffer from a distribution mismatch between replayed samples and newly learned policies. At the same time, architectural continual learning methods have shown strong potential in supervised learning but remain underexplored in CORL. In this work, we propose TSN-Affinity, a novel CORL method based on TinySubNetworks and Decision Transformer. The method enables task-specific parameterization and controlled knowledge sharing through a RL-aware reuse strategy that routes tasks according to action compatibility and latent similarity. We evaluate the approach on benchmarks based on Atari games and simulations of manipulation tasks with the Franka Emika Panda robotic arm, covering both discrete and continuous control. Results show strong retention from sparse SubNetworks, with routing further improving multi-task performance. Our findings suggest that similarity-guided architectural reuse is a strong and viable alternative to replay-based strategies in a CORL setting. Our code is available at: this https URL.
中文摘要 持续离线强化学习（CORL）旨在从长期收集的数据集中学习一系列任务，同时保持先前学习任务的性能。这种环境对应于随着时间推移产生新任务的领域，但在实时环境交互中调整模型既昂贵又存在风险，甚至不可能实现。然而，CORL继承了离线强化学习和适应的双重难点，同时防止灾难性遗忘。基于回放的持续学习方法依然是强有力的基础，但会产生记忆开销，并且在重放样本与新学策略之间存在分布不匹配。与此同时，架构持续学习方法在监督学习中展现出强大潜力，但在CORL中仍未被充分探索。在本研究中，我们提出了TSN-Affinity，一种基于微小子网络和决策变换器的新型CORL方法。该方法通过强化学习（RL）感知的重用策略实现任务特定参数化和受控知识共享，该策略根据动作兼容性和潜在相似性路由任务。我们基于Atari游戏基准测试，以及Franka Emika Panda机械臂操作任务的模拟，涵盖离散和连续控制。结果显示稀疏子网的保留率强，进一步路由提升了多任务性能。我们的发现表明，基于相似性引导的架构再利用是CORL环境中基于重放策略的强大且可行的替代方案。我们的代码可在以下 https URL 获取。

How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

模特应该多快承诺监督？Tsallis 损失连续体上的训练推理模型

Authors: Chu-Cheng Lin, Eugene Ie
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.25907
Pdf link: https://arxiv.org/pdf/2604.25907
Abstract Adapting reasoning models to new tasks during post-training with only output-level supervision stalls under reinforcement learning from verifiable rewards (RLVR) when the initial success probability $p_0$ is small. Using the Tsallis $q$-logarithm, we define a loss family $J_Q$ that interpolates between RLVR (at $q{=}0$, the exploitation pole) and the log-marginal-likelihood over latent trajectories (at $q{=}1$, the density-estimation pole). All members share the same per-example gradient direction, differing only by a scalar amplification $P_{\theta^{-q}}$ that reweights each instance independently of the learning rate. This amplification is the mechanism that addresses cold-start stalling: under gradient flow, the exploitation pole requires $\Omega(\frac{1}{p_0})$ time to escape cold start, while the density-estimation pole escapes in $\Theta\big(\log(\frac{1}{p_0})\big)$; intermediate $q$ trades escape speed against noise memorization. Because $P_\theta$ is intractable, we derive two Monte Carlo estimators from the two factorizations of the gradient: Gradient-Amplified RL (GARL) samples from the prior and amplifies the RL gradient, and Posterior-Attenuated Fine-Tuning (PAFT) importance-resamples from the posterior and runs standard SFT. Both have bias $O\big(\frac{q}{M P_{\theta}^{q+1}}\big)$; GARL has lower variance, PAFT has semantically coherent gradients. On FinQA, HotPotQA, and MuSiQue, GARL at $q{=}0.75$ substantially mitigates cold-start stalling, escaping cold start where GRPO fails entirely. In warm start, GARL at low $q$ dominates FinQA where training is stable; on HotPotQA and MuSiQue, GARL destabilizes during training, and PAFT at $q{=}0.75$ provides stable gradients (best overall on HotPotQA at 47.9 maj@16, $+14.4$ over GRPO).
中文摘要 在仅产出级监督的训练后，将推理模型适应新任务，当初始成功概率$p_0$很小时，会在可验证奖励强化学习（RLVR）下停滞。利用Tsallis $q$-对数，我们定义了一个损失族 $J_Q$，它在RLVR（$q{=}0$，开发极点）和对数边际似然（在$q{=}1$，密度估计极点）之间插值。所有成员共享相同的每例梯度方向，仅通过一个标量放大 $P_{\theta^{-q}}$ 不同，该倍增独立于学习率重新加权。这种放大机制解决了冷启动停滞：在梯度流下，利用极点需要$\Omega（\frac{1}{p_0}）$时间脱离冷启动，而密度估计极点逃逸时间为$\Theta\big（\log（\frac{1}{p_0}）\big）$;中间$q$交易，逃脱速度与噪声记忆形成对抗。由于$P_\theta$难以处理，我们从梯度的两个分解中推导出两个蒙特卡洛估计量：从先验中获得梯度放大RL（GARL）样本并放大RL梯度，以及从后验进行后衰减微调（PAFT）重要性重采样并运行标准SFT。两者都有偏见 $O\big（\frac{q}{M P_{\theta}^{q+1}}\big）$;GARL的方差更低，PAFT的语义梯度是一致的。在FinQA、HotPotQA和MuSiQue上，GARL在$q{=}0.75$时，可以大幅减少冷启动停滞，避免了GRPO完全失败的冷启动。在热启阶段，低$q美元时，GARL在FinQA中占据主导地位，那里的训练稳定;在HotPotQA和MuSiQue上，GARL在训练时会不稳定，PAFT$q的47.9 maj@16，比GRPO高出+14.4美元，是HotPotQA中整体最佳的47.9，比GRPO高+14.4美元）。

Keyword: diffusion policy

There is no result