Arxiv Papers of Today

生成时间: 2026-04-24 17:46:13 (UTC+8); Arxiv 发布时间: 2026-04-24 20:00 EDT (2026-04-25 08:00 UTC+8)

今天共有 21 篇相关文章

Keyword: reinforcement learning

Deep Interest Mining with Cross-Modal Alignment for SemanticID Generation in Generative Recommendation

基于跨模态对齐的深度兴趣挖掘，用于生成式推荐中的语义ID生成

Authors: Yagchen Zeng
Subjects: Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.20861
Pdf link: https://arxiv.org/pdf/2604.20861
Abstract Generative Recommendation (GR) has demonstrated remarkable performance in next-token prediction paradigms, which relies on Semantic IDs (SIDs) to compress trillion-scale data into learnable vocabulary sequences. However, existing methods suffer from three critical limitations: (1) Information Degradation: the two-stage compression pipeline causes semantic loss and information degradation, with no posterior mechanism to distinguish high-quality from low-quality SIDs; (2) Semantic Degradation: cascaded quantization discards key semantic information from original multimodal features, as the embedding generation and quantization stages are not jointly optimized toward a unified objective; (3) Modality Distortion: quantizers fail to properly align text and image modalities, causing feature misalignment even when upstream networks have aligned them. To address these challenges, we propose a novel framework integrating three key innovations: Deep Contextual Interest Mining (DCIM), Cross-Modal Semantic Alignment (CMSA), and Quality-Aware Reinforcement Mechanism (QARM). First, we leverage Vision-Language Models (VLMs) to align non-textual modalities into a unified text-based semantic space, mitigating modality distortion. Second, we introduce a deep interest mining mechanism that captures high-level semantic information implicitly present in advertising contexts, encouraging SIDs to preserve critical contextual information through reconstruction-based supervision. Third, we employ a reinforcement learning framework with quality-aware rewards to encourage semantically rich SIDs while suppressing low-quality ones in the posterior stage. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art SID generation methods, achieving superior performance on multiple benchmarks. Ablation studies further validate the effectiveness of each proposed component
中文摘要 生成推荐（GR）在下一令牌预测范式中表现出显著性能，该范式依赖语义ID（SID）将万亿级数据压缩为可学习词汇序列。然而，现有方法存在三个关键局限：（1）信息退化：两阶段压缩流水线导致语义丢失和信息退化，且无后验机制区分高质量与低质量SID;（2）语义退化：级联量化丢弃了原始多模态特征中的关键语义信息，因为嵌入生成阶段和量化阶段未能共同优化以实现统一目标;（3）模态失真：量化器未能正确对齐文本和图像模态，导致特征错位，即使上游网络已对齐。为应对这些挑战，我们提出了一个新颖框架，整合了三项关键创新：深度上下文兴趣挖掘（DCIM）、跨模态语义对齐（CMSA）和质量感知强化机制（QARM）。首先，我们利用视觉语言模型（VLMs）将非文本模态对齐到统一的基于文本的语义空间，减轻模态扭曲。其次，我们引入深度兴趣挖掘机制，捕捉广告语境中隐含的高层语义信息，鼓励SID通过重建监督保存关键语境信息。第三，我们采用强化学习框架，提供质量感知奖励，鼓励语义丰富的SID，同时抑制后期低质量SID。大量实验表明，我们的方法始终优于最先进的SID生成方法，在多个基准测试中表现更优。消融研究进一步验证了每个拟议组件的有效性

Reinforcing privacy reasoning in LLMs via normative simulacra from fiction

通过虚构作品中的规范模拟来强化LLM中的隐私推理

Authors: Matt Franchi, Madiha Zahrah Choksi, Harold Triedman, Helen Nissenbaum
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.20904
Pdf link: https://arxiv.org/pdf/2604.20904
Abstract Information handling practices of LLM agents are broadly misaligned with the contextual privacy expectations of their users. Contextual Integrity (CI) provides a principled framework, defining privacy as the appropriate flow of information within context-relative norms. However, existing approaches either double inference cost via supervisor-assistant architectures, or fine-tune on narrow task-specific data. We propose extracting normative simulacra (structured representations of norms and information flows) from fiction novels and using them to fine-tune LLMs via supervised learning followed by GRPO reinforcement learning. Our composite reward function combines programmatic signals, including task clarity (subsuming schema validity, construct discrimination, and extraction confidence), structural completeness, internal consistency, and context identification, with an LLM judge that evaluates whether the model's privacy reasoning is grounded in the held-out normative universe of the source text. To mitigate overfitting, we introduce per-completion contrastive scoring: each completion is evaluated against both the correct normative universe and a randomly selected wrong one, teaching the model to condition on context rather than memorize source-specific norms. We evaluate on five CI-aligned benchmarks spanning distinct societal contexts and ablate the contributions of RL and normative grounding. Across seven models, SFT introduces a conservative prior toward restricting information flow, improving recognition of privacy-relevant situations but not the correctness of privacy judgments. GRPO with normative grounding achieves the highest score on a law compliance benchmark and strongest correlation with crowdsourced human privacy expectations, demonstrating that fiction-derived normative simulacra can teach contextual privacy reasoning that transfers to real-world domains.
中文摘要 LLM代理的信息处理实践与用户的上下文隐私期望普遍不一致。上下文完整性（CI）提供了一个原则性框架，将隐私定义为上下文相对规范内信息的适当流动。然而，现有方法要么通过监督-助手架构使推理成本翻倍，要么在狭义任务特定数据上进行微调。我们提出从虚构小说中提取规范模拟（规范和信息流的结构化表示），并通过监督学习和GRPO强化学习对LLM进行微调。我们的复合奖励函数结合了程序信号，包括任务清晰度（包含模式效度、构造判别和提取信心）、结构完整性、内部一致性和上下文识别，以及评估模型隐私推理是否基于源文本所保留规范宇宙的LLM评判。为减少过度拟合，我们引入了每完备对比评分：每次完备评估都结合正确的规范宇宙和随机选择的错误宇宙进行评估，教导模型根据上下文进行条件化，而非死记于心的来源特定规范。我们基于五个CI对齐的基准测试，涵盖不同的社会背景，并削弱强化学习和规范基础的贡献。在七个模型中，SFT引入了限制信息流的保守先验，提高了隐私相关情境的识别，但未提升隐私判断的正确性。以规范为基础的GRPO在法律合规基准中获得最高分，并且与众包人类隐私期望的相关性最强，证明虚构衍生的规范模拟物能够教授适用于现实世界的上下文隐私推理。

A Systematic Review and Taxonomy of Reinforcement Learning-Model Predictive Control Integration for Linear Systems

强化学习模型预测控制集成的系统综述与分类学

Authors: Mohsen Jalaeian Farimani, Roya Khalili Amirabadi, Davoud Nikkhouy, Malihe Abdolbaghi, Mahshad Rastegarmoghaddam, Shima Samadzadeh
Subjects: Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI); Robotics (cs.RO); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2604.21030
Pdf link: https://arxiv.org/pdf/2604.21030
Abstract The integration of Model Predictive Control (MPC) and Reinforcement Learning (RL) has emerged as a promising paradigm for constrained decision-making and adaptive control. MPC offers structured optimization, explicit constraint handling, and established stability tools, whereas RL provides data-driven adaptation and performance improvement in the presence of uncertainty and model mismatch. Despite the rapid growth of research on RL--MPC integration, the literature remains fragmented, particularly for control architectures built on linear or linearized predictive models. This paper presents a comprehensive Systematic Literature Review (SLR) of RL--MPC integrations for linear and linearized systems, covering peer-reviewed and formally indexed studies published until 2025. The reviewed studies are organized through a multi-dimensional taxonomy covering RL functional roles, RL algorithm classes, MPC formulations, cost-function structures, and application domains. In addition, a cross-dimensional synthesis is conducted to identify recurring design patterns and reported associations among these dimensions within the reviewed corpus. The review highlights methodological trends, commonly adopted integration strategies, and recurring practical challenges, including computational burden, sample efficiency, robustness, and closed-loop guarantees. The resulting synthesis provides a structured reference for researchers and practitioners seeking to design or analyze RL--MPC architectures based on linear or linearized predictive control formulations.
中文摘要 模型预测控制（MPC）与强化学习（RL）的整合已成为受限决策和自适应控制的有前景范式。MPC提供结构化优化、显式约束处理和成熟的稳定性工具，而强化学习则在存在不确定性和模型不匹配的情况下提供数据驱动的适应和性能提升。尽管关于强化学习与MPC集成的研究迅速增长，但文献仍然零散，尤其是基于线性或线性化预测模型的控制架构。本文全面呈现了针对线性和线性化系统的RL-MPC集成的系统性文献综述（SLR），涵盖截至2025年发表的同行评审和正式索引研究。所评述的研究通过多维分类法组织，涵盖强化学习功能角色、强化学习算法类别、MPC表述、成本函数结构及应用领域。此外，还进行了跨维综合分析，以识别所审语料库中重复出现的设计模式及这些维度之间的关联。综述重点介绍了方法学趋势、常用的集成策略以及反复出现的实际挑战，包括计算负担、样本效率、鲁棒性和闭环保证。最终的综合为寻求基于线性或线性化预测控制公式的强化学习-MPC架构的研究者和实践者提供了结构化参考。

Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models

聚焦推理：视觉语言模型中的有状态、基于动作的视觉聚焦

Authors: Juhong Min, Lazar Valkov, Vitali Petsiuk, Hossein Souri, Deen Dayal Mohan
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.21079
Pdf link: https://arxiv.org/pdf/2604.21079
Abstract Vision-language models benefit from high-resolution images, but the increase in visual-token count incurs high compute overhead. Humans resolve this tension via foveation: a coarse view guides "where to look", while selectively acquired high-acuity evidence refines "what to think". We introduce Foveated Reasoner, an autoregressive vision-language framework that unifies foveation and reasoning within a single decoding trajectory. Starting from a low-resolution view, the model triggers foveation only when needed, retrieves high-resolution evidence from selected regions, and injects it back into the same decoding trajectory. We train the method with a two-stage pipeline: coldstart supervision to bootstrap foveation behavior, followed by reinforcement learning to jointly improve evidence acquisition and task accuracy while discouraging trivial "see-everything" solutions. Experiments show that the method learns effective foveation policies and achieves stronger accuracy under tight visual-token budgets across multiple vision-language benchmarks.
中文摘要 视觉语言模型受益于高分辨率图像，但视觉标记数量的增加带来了较高的计算开销。人类通过聚焦来解决这种紧张：粗视引导“寻找哪里”，而选择性获取的高敏度证据则细化“思考内容”。我们引入了Foveated Reasoner，一种自回归视觉语言框架，将焦点和推理统一在单一解码轨迹中。从低分辨率视图出发，模型仅在需要时触发焦点，从选定区域检索高分辨率证据，并注入相同的解码轨迹。我们通过两阶段流程训练该方法：冷启动监督以启动聚焦行为，随后强化学习，共同提升证据采集和任务准确性，同时防止琐碎的“全视”解决方案。实验显示，该方法能够学习有效的焦点策略，并在多个视觉语言基准测试中，在严格的视觉标记预算下实现更强的准确性。

Self-Predictive Representation for Autonomous UAV Object-Goal Navigation

自主无人机目标导航的自我预测表示

Authors: Angel Ayala, Donling Sui, Francisco Cruz, Mitchell Torok, Mohammad Deghat, Bruno J. T. Fernandes
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.21130
Pdf link: https://arxiv.org/pdf/2604.21130
Abstract Autonomous Unmanned Aerial Vehicles (UAVs) have revolutionized industries through their versatility with applications including aerial surveillance, search and rescue, agriculture, and delivery. Their autonomous capabilities offer unique advantages, such as operating in large open space environments. Reinforcement Learning (RL) empowers UAVs to learn intricate navigation policies, enabling them to optimize flight behavior autonomously. However, one of its main challenge is the inefficiency in using data sample to achieve a good policy. In object-goal navigation (OGN) settings, target recognition arises as an extra challenge. Most UAV-related approaches use relative or absolute coordinates to move from an initial position to a predefined location, rather than to find the target directly. This study addresses the data sample efficiency issue in solving a 3D OGN problem, in addition to, the formalization of the unknown target location setting as a Markov decision process. Experiments are conducted to analyze the interplay of different state representation learning (SRL) methods for perception with a model-free RL algorithm for planning in an autonomous navigation system. The main contribution of this study is the development of the perception module, featuring a novel self-predictive model named AmelPred. Empirical results demonstrate that its stochastic version, AmelPredSto, is the best-performing SRL model when combined with actor-critic RL algorithms. The obtained results show substantial improvement in RL algorithms' efficiency by using AmelPredSto in solving the OGN problem.
中文摘要 自主无人机（UAV）凭借其多功能性，在空中监视、搜救、农业和投递等领域实现了革命性创新。其自主能力带来了独特优势，例如在大型开放空间环境中运行。强化学习（RL）使无人机能够学习复杂的导航策略，使其能够自主优化飞行行为。然而，其主要挑战之一是利用数据样本实现良好策略的效率低下。在目标导航（OGN）环境中，目标识别成为额外的挑战。大多数与无人机相关的方法使用相对或绝对坐标从初始位置移动到预定义位置，而非直接找到目标。本研究解决了解决三维OGN问题中数据样本效率问题的问题，同时将未知目标位置设置形式化为马尔可夫决策过程。实验分析了不同状态表示学习（SRL）感知方法与无模型强化学习算法在自主导航系统中规划的相互作用。本研究的主要贡献是感知模块的开发，其中包含一个名为AmelPred的新型自预测模型。实证结果表明，其随机版本AmelPredSto在与actor-critic RL算法结合时，是表现最佳的SRL模型。获得的结果显示，使用AmelPredSto解决OGN问题显著提升了强化学习算法的效率。

Adaptive Instruction Composition for Automated LLM Red-Teaming

自动化大型语言模型红团队的自适应指令组合

Authors: Jesse Zymet, Andy Luo, Swapnil Shinde, Sahil Wadhwa, Emily Chen
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.21159
Pdf link: https://arxiv.org/pdf/2604.21159
Abstract Many approaches to LLM red-teaming leverage an attacker LLM to discover jailbreaks against a target. Several of them task the attacker with identifying effective strategies through trial and error, resulting in a semantically limited range of successes. Another approach discovers diverse attacks by combining crowdsourced harmful queries and tactics into instructions for the attacker, but does so at random, limiting effectiveness. This article introduces a novel framework, Adaptive Instruction Composition, that combines crowdsourced texts according to an adaptive mechanism trained to jointly optimize effectiveness with diversity. We use reinforcement learning to balance exploration with exploitation in a combinatorial space of instructions to guide the attacker toward diverse generations tailored to target vulnerabilities. We demonstrate that our approach substantially outperforms random combination on a set of effectiveness and diversity metrics, even under model transfer. Further, we show that it surpasses a host of recent adaptive approaches on Harmbench. We employ a lightweight neural contextual bandit that adapts to contrastive embedding inputs, and provide ablations suggesting that the contrastive pretraining enables the network to rapidly generalize and scale to the massive space as it learns.
中文摘要 许多大型语言模型红团队化方法利用攻击者LLM发现针对目标的越狱方法。其中几种方法要求攻击者通过反复试验识别有效策略，导致语义上有限的成功范围。另一种方法是将众包的有害查询和战术结合进攻击者的指令中，随机发现多样攻击，限制了效果。本文介绍了一个新框架——自适应指令组合，根据自适应机制结合众包文本，训练以联合优化有效性和多样性。我们利用强化学习在组合指令空间中平衡探索与利用，引导攻击者走向针对漏洞的多元生成。我们证明，即使在模型转移下，我们的方法在一套有效性和多样性指标上远远优于随机组合。此外，我们展示了它在Harmbench上许多近期的自适应方法。我们采用了轻量级神经上下文强盗，能够适应对比嵌入输入，并提供了消融，表明对比预训练使网络能够快速泛化并在庞大空间中扩展。

Reinforcing 3D Understanding in Point-VLMs via Geometric Reward Credit Assignment

通过几何奖励积分赋值强化点VLM中的三维理解

Authors: Jingkun Chen, Ruoshi Xu, Mingqi Gao, Shengda Luo, Jungong Han
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.21160
Pdf link: https://arxiv.org/pdf/2604.21160
Abstract Point-Vision-Language Models promise to empower embodied agents with executable spatial reasoning, yet they frequently succumb to geometric hallucination where predicted 3D structures contradict the observed 2D reality. We identify a key cause of this failure not as a representation bottleneck but as a structural misalignment in reinforcement learning, where sparse geometric tokens are drowned out by noisy and broadcasted sequence-level rewards. To resolve this causal dilution, we propose Geometric Reward Credit Assignment, a framework that disentangles holistic supervision into field-specific signals and routes them exclusively to their responsible token spans. This mechanism transforms vague feedback into precise gradient updates and effectively turns generic policy optimization into targeted structural alignment. Furthermore, we internalize physical constraints via a Reprojection-Consistency term which serves as a cross-modal verifier to penalize physically impossible geometries. Validated on a calibrated benchmark derived from ShapeNetCore, our approach bridges the reliability gap by boosting 3D KPA from 0.64 to 0.93, increasing 3D bounding box intersection over union to 0.686, and raising reprojection consistency scores to 0.852. Crucially, these gains are achieved while maintaining robust 2D localization performance, marking a meaningful step from plausible textual outputs toward physically verifiable spatial predictions.
中文摘要 点-视觉-语言模型承诺赋予具身智能体可执行的空间推理能力，但当预测的三维结构与观察到的二维现实相矛盾时，它们常常陷入几何幻觉。我们指出，这种失败的关键原因不是表征瓶颈，而是强化学习中的结构错位，稀疏的几何代币被噪声和广播的序列级奖励淹没。为解决这种因果稀释，我们提出了几何奖励积分分配框架，将整体监督解码为场特定信号，并将其专门路由到其负责任的代币跨度。该机制将模糊的反馈转化为精确的梯度更新，有效将通用策略优化转化为有针对性的结构对齐。此外，我们通过重投影-一致性项内化物理约束，作为跨模态验证器，惩罚物理上不可能的几何。基于ShapeNetCore的校准基准验证，我们的方法通过将三维KPA从0.64提升至0.93，将三维边界盒交集提升至0.686，并将重投影一致性分数提升至0.852，弥合了可靠性差距。关键是，这些提升是在保持稳健二维定位性能的同时实现的，标志着从合理文本输出向物理可验证空间预测迈出重要一步。

CAP: Controllable Alignment Prompting for Unlearning in LLMs

CAP：大型语言模型中可控对齐提示，用于去除学习

Authors: Zhaokun Wang, Jinyu Guo, Jingwen Pu, Hongli Pu, Meng Yang, Xunlei Chen, Jie Ou, Wenyi Li, Guangchun Luo, Wenhong Tian
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.21251
Pdf link: https://arxiv.org/pdf/2604.21251
Abstract Large language models (LLMs) trained on unfiltered corpora inherently risk retaining sensitive information, necessitating selective knowledge unlearning for regulatory compliance and ethical safety. However, existing parameter-modifying methods face fundamental limitations: high computational costs, uncontrollable forgetting boundaries, and strict dependency on model weight access. These constraints render them impractical for closed-source models, yet current non-invasive alternatives remain unsystematic and reliant on empirical experience. To address these challenges, we propose the Controllable Alignment Prompting for Unlearning (CAP) framework, an end-to-end prompt-driven unlearning paradigm. CAP decouples unlearning into a learnable prompt optimization process via reinforcement learning, where a prompt generator collaborates with the LLM to suppress target knowledge while preserving general capabilities selectively. This approach enables reversible knowledge restoration through prompt revocation. Extensive experiments demonstrate that CAP achieves precise, controllable unlearning without updating model parameters, establishing a dynamic alignment mechanism that overcomes the transferability limitations of prior methods.
中文摘要 在未过滤语料库上训练的大型语言模型（LLM）本质上存在敏感信息的保留风险，因此为了合规和伦理安全，需要选择性知识去学习。然而，现有参数修改方法面临根本性局限：高计算成本、无法控制的遗忘边界以及对模型权重访问的严格依赖。这些限制使其在闭源模型中不切实际，但当前的非侵入性替代方案仍缺乏系统性，依赖实证经验。为应对这些挑战，我们提出了可控对齐提示去学习（CAP）框架，这是一种端到端的提示驱动去学习范式。CAP将去学习解耦为可学习的提示优化过程，通过强化学习，提示生成器与大语言模型协作，抑制目标知识，同时选择性保留一般能力。这种方法通过提示撤回实现可逆的知识恢复。大量实验表明，CAP能够在不更新模型参数的情况下实现精确且可控的复学，建立了一种动态比对机制，克服了以往方法的可迁移性局限。

Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding

再测一次，点击一次：通过强化学习共同进化提案者与视觉批评者，帮助GUI扎根

Authors: Wenkai Wang, Xiyun Li, Hongcan Guo, Wenhao Yu, Tianqing Fang, Haitao Mi, Dong Yu, Shengyu Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.21268
Pdf link: https://arxiv.org/pdf/2604.21268
Abstract Graphical User Interface (GUI) grounding requires mapping natural language instructions to precise pixel coordinates. However, due to visually homogeneous elements and dense layouts, models typically grasp semantic intent yet struggle with achieving precise localization. While scaling sampling attempts (Pass@k) reveals potential gains, static self-consistency strategies derived from geometric clustering often yield limited improvements, as the model's predictions tend to be spatially dispersed. In this paper, we propose replacing static consistency strategies with a learnable selection mechanism that selects the optimal target by critiquing its own proposals rendered on the screenshot. Given the significant disparity between the model's grounding and critiquing capabilities, we propose a co-evolving Propose-then-Critic framework. To jointly optimize these, we introduce a maturity-aware adaptive co-evolutionary reinforcement learning paradigm. This approach dynamically balances the training objectives of proposer and critic, where the diversity of the proposer's outputs enhances critic robustness, while the critic's maturing discrimination capability conversely unlocks the proposer's potential for extensive spatial exploration, fostering the mutual reinforcement and co-evolution of both capabilities, thereby ensuring generalizability to adapt to diverse and complex interface layouts. Extensive experiments over 6 benchmarks show that our method significantly enhances both grounding accuracy and critic reliability.
中文摘要 图形用户界面（GUI）基础化需要将自然语言指令映射到精确的像素坐标。然而，由于视觉上元素同质化且布局密集，模型通常能掌握语义意图，但在实现精确定位方面遇到困难。虽然缩放采样尝试（Pass@k）揭示了潜在收益，但基于几何聚类的静态自洽策略往往带来有限的改进，因为模型的预测往往在空间上分散。本文提出用一种可学习的选择机制取代静态一致性策略，通过批判截图中自身的提案来选择最优目标。鉴于模型基础化与批判能力之间的显著差异，我们提出了一个共进化的提案-再批判框架。为共同优化这些框架，我们引入了成熟度感知的自适应共进强化学习范式。这种方法动态平衡了提议者和批评者的训练目标，提议者输出的多样性增强了批评者的稳健性，而批评者的成熟辨别能力则释放了提出者的广泛空间探索潜力，促进双方能力的相互强化与共进化，从而确保了适应多样复杂界面布局的泛化性。针对6个基准测试的广泛实验表明，我们的方法显著提升了基础准确性和批评者可靠性。

Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning

理解并缓解数学推理测试时强化学习中的虚假信号放大

Authors: Yongcan Yu, Lingxiao He, Jian Liang, Kuangpu Guo, Meng Wang, Qianlong Xie, Xingxing Wang, Ran He
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.21327
Pdf link: https://arxiv.org/pdf/2604.21327
Abstract Test-time reinforcement learning (TTRL) always adapts models at inference time via pseudo-labeling, leaving it vulnerable to spurious optimization signals from label noise. Through an empirical study, we observe that responses with medium consistency form an ambiguity region and constitute the primary source of reward noise. Crucially, we find that such spurious signals can be even amplified through group-relative advantage estimation. Motivated by these findings, we propose a unified framework, Debiased and Denoised test-time Reinforcement Learning (DDRL), to mitigate spurious signals. Concretely, DDRL first applies a frequency-based sampling strategy to exclude ambiguous samples while maintaining a balanced set of positive and negative examples. It then adopts a debiased advantage estimation with fixed advantages, removing the bias introduced by group-relative policy optimization. Finally, DDRL incorporates a consensus-based off-policy refinement stage, which leverages the rejection-sampled dataset to enable efficient and stable model updates. Experiments on three large language models across multiple mathematical reasoning benchmarks demonstrate that DDRL consistently outperforms existing TTRL baselines. The code will soon be released at this https URL.
中文摘要 测试时间强化学习（TTRL）总是通过伪标记在推断时调整模型，使其容易受到标签噪声产生的虚假优化信号影响。通过实证研究，我们观察到中等一致性的响应形成歧义区域，是奖励噪声的主要来源。关键是，我们发现这些虚假信号甚至可以通过群体相对优势估计被放大。基于这些发现，我们提出了一个统一框架——去偏和去噪测试时间强化学习（DDRL），以减轻虚假信号。具体来说，DDRL首先采用基于频率的抽样策略，排除歧义样本，同时保持正面和负面样本的平衡。随后采用带有固定优势的去偏优势估计，消除了群体相对策略优化带来的偏见。最后，DDRL采用基于共识的非策略优化阶段，利用拒绝抽样数据集实现高效稳定的模型更新。在多个数学推理基准测试中对三个大型语言模型的实验表明，DDRL持续优于现有TTRL基线。代码即将在此https URL发布。

Learn Weightlessness: Imitate Non-Self-Stabilizing Motions on Humanoid Robot

学习失重：模仿类人机器人的非自稳定动作

Authors: Yucheng Xin, Jiacheng Bao, Haoran Yang, Wenqiang Que, Dong Wang, Junbo Tan, Xueqian Wang, Bin Zhao, Xuelong Li
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.21351
Pdf link: https://arxiv.org/pdf/2604.21351
Abstract The integration of imitation and reinforcement learning has enabled remarkable advances in humanoid whole-body control, facilitating diverse human-like behaviors. However, research on environment-dependent motions remains limited. Existing methods typically enforce rigid trajectory tracking while neglecting physical interactions with the environment. We observe that humans naturally exploit a "weightless" state during non-self-stabilizing (NSS) motions--selectively relaxing specific joints to allow passive body--environment contact, thereby stabilizing the body and completing the motion. Inspired by this biological mechanism, we design a weightlessness-state auto-labeling strategy for dataset annotation; and we propose the Weightlessness Mechanism (WM), a method that dynamically determines which joints to relax and to what level, together enabling effective environmental interaction while executing target motions. We evaluate our approach on 3 representative NSS tasks: sitting on chairs of varying heights, lying down on beds with different inclinations, and leaning against walls via shoulder or elbow. Extensive experiments in simulation and on the Unitree G1 robot demonstrate that our WM method, trained on single-action demonstrations without any task-specific tuning, achieves strong generalization across diverse environmental configurations while maintaining motion stability. Our work bridges the gap between precise trajectory tracking and adaptive environmental interaction, offering a biologically-inspired solution for contact-rich humanoid control.
中文摘要 模仿与强化学习的结合推动了类人生物全身控制的显著进展，促进了多样的类人行为。然而，关于环境依赖运动的研究仍然有限。现有方法通常强制执行刚性轨迹跟踪，忽视与环境的物理互动。我们观察到，人类在非自稳定（NSS）运动中自然利用“失重”状态——选择性放松特定关节以实现被动身体与环境接触，从而稳定身体并完成动作。受这一生物机制启发，我们设计了一种无重状态自动标记策略用于数据集注释;并提出了失重机制（WM），这是一种动态决定哪些关节应放松及放松程度的方法，共同实现环境交互，同时执行目标动作。我们评估了三种代表性的NSS任务：坐在不同高度的椅子上、躺在不同倾斜的床上，以及通过肩膀或肘部靠墙。在模拟和Unitree G1机器人上的大量实验表明，我们的WM方法基于单次动作演示训练，无需针对特定任务调整，能够在多种环境配置中实现强强的泛化，同时保持运动稳定性。我们的工作弥合了精确轨迹追踪与适应性环境交互之间的差距，提供了一种基于生物的丰富接触型人形控制解决方案。

ReaGeo: Reasoning-Enhanced End-to-End Geocoding with LLMs

ReaGeo：基于推理增强的端到端地理编码，采用大型语言模型

Authors: Jian Cui, Zhiyuan Ren, Desheng Weng, Yongqi Zhao, Gong Wenbin, Yu Lei, Zhenning Dong
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.21357
Pdf link: https://arxiv.org/pdf/2604.21357
Abstract This paper proposes ReaGeo, an end-to-end geocoding framework based on large language models, designed to overcome the limitations of traditional multi-stage approaches that rely on text or vector similarity retrieval over geographic databases, including workflow complexity, error propagation, and heavy dependence on structured geographic knowledge bases. The method converts geographic coordinates into geohash sequences, reformulating the coordinate prediction task as a text generation problem, and introduces a Chain-of-Thought mechanism to enhance the model's reasoning over spatial relationships. Furthermore, reinforcement learning with a distance-deviation-based reward is applied to optimize the generation accuracy. Comprehensive experiments show that ReaGeo can accurately handle explicit address queries in single-point predictions and effectively resolve vague relative location queries. In addition, the model demonstrates strong predictive capability for non-point geometric regions, highlighting its versatility and generalization ability in geocoding tasks.
中文摘要 本文提出了ReaGeo，一种基于大型语言模型的端到端地理编码框架，旨在克服依赖地理数据库文本或向量相似度检索的传统多阶段方法的局限性，包括工作流复杂性、错误传播以及对结构化地理知识库的高度依赖。该方法将地理坐标转换为地哈希序列，将坐标预测任务重新表述为文本生成问题，并引入了思维链机制，以增强模型对空间关系的推理能力。此外，采用基于距离偏差的奖励强化学习以优化生成精度。综合实验表明，ReaGeo能够准确处理单点预测中的显式地址查询，并有效解决模糊的相对位置查询。此外，该模型展现出对非点几何区域的强大预测能力，凸显其在地理编码任务中的多样性和泛化能力。

KD-CVG: A Knowledge-Driven Approach for Creative Video Generation

KD-CVG：一种以知识为驱动的创意视频生成方法

Authors: Linkai Liu, Wei Feng, Xi Zhao, Shen Zhang, Xingye Chen, Zheng Zhang, Jingjing Lv, Junjie Shen, Ching Law, Yuchen Zhou, Zipeng Guo, Chao Gou
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.21362
Pdf link: https://arxiv.org/pdf/2604.21362
Abstract Creative Generation (CG) leverages generative models to automatically produce advertising content that highlights product features, and it has been a significant focus of recent research. However, while CG has advanced considerably, most efforts have concentrated on generating advertising text and images, leaving Creative Video Generation (CVG) relatively underexplored. This gap is largely due to two major challenges faced by Text-to-Video (T2V) models: (a) \textbf{ambiguous semantic alignment}, where models struggle to accurately correlate product selling points with creative video content, and (b) \textbf{inadequate motion adaptability}, resulting in unrealistic movements and distortions. To address these challenges, we develop a comprehensive Advertising Creative Knowledge Base (ACKB) as a foundational resource and propose a knowledge-driven approach (KD-CVG) to overcome the knowledge limitations of existing models. KD-CVG consists of two primary modules: Semantic-Aware Retrieval (SAR) and Multimodal Knowledge Reference (MKR). SAR utilizes the semantic awareness of graph attention networks and reinforcement learning feedback to enhance the model's comprehension of the connections between selling points and creative videos. Building on this, MKR incorporates semantic and motion priors into the T2V model to address existing knowledge gaps. Extensive experiments have demonstrated KD-CVG's superior performance in achieving semantic alignment and motion adaptability, validating its effectiveness over other state-of-the-art methods. The code and dataset will be open source at this https URL.
中文摘要 创意生成（CG）利用生成模型自动生成突出产品特性的广告内容，这一领域成为近期研究的重要关注点。然而，尽管CG取得了显著进步，大多数努力仍集中于生成广告文本和图片，导致创意视频生成（CVG）相对缺乏充分探索。这一空白主要源于文本转视频（T2V）模型面临的两大挑战：（a） \textbf{模糊语义对齐}，模型难以准确关联产品销售点与创意视频内容;（b） \textbf{运动适应性不足}，导致不切实际的移动和扭曲。为应对这些挑战，我们开发了全面的广告创意知识库（ACKB）作为基础资源，并提出以知识驱动的方法（KD-CVG）克服现有模型的知识限制。KD-CVG由两个主要模块组成：语义感知检索（SAR）和多模态知识参考（MKR）。SAR利用图关注网络的语义意识和强化学习反馈，提升模型对卖点与创意视频联系的理解。基于此，MKR将语义和运动先验纳入T2V模型，以弥补现有知识缺口。大量实验已证明KD-CVG在实现语义对齐和运动适应性方面表现出优越，验证其优于其他先进方法的有效性。代码和数据集将开源于此 https URL。

S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

S1-VL：带图像思考的科学多模态推理模型

Authors: Qingxiao Li, Lifeng Xu, QingLi Wang, Yudong Bai, Mingwei Ou, Shu Hu, Nan Xu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.21409
Pdf link: https://arxiv.org/pdf/2604.21409
Abstract We present S1-VL, a multimodal reasoning model for scientific domains that natively supports two complementary reasoning paradigms: Scientific Reasoning, which relies on structured chain-of-thought, and Thinking-with-Images, which enables the model to actively manipulate images through Python code execution during reasoning. In the Thinking-with-Images mode, the model generates and executes image-processing code in a sandbox environment, obtains intermediate visual results, and continues reasoning in a multi-turn iterative manner. This design is particularly effective for challenging scenarios such as high-resolution scientific chart interpretation, microscopic image understanding, and geometry-assisted reasoning. To construct the training data, we collect scientific multimodal datasets spanning six disciplines: mathematics, physics, chemistry, astronomy, geography, and biology. We further develop a six-dimensional quality filtering framework for reasoning trajectories. To mitigate redundant, ineffective, and erroneous visual operations commonly found in existing datasets, we propose a multi-stage filtering pipeline together with an adaptive data routing strategy. This strategy converts samples with low visual information gain into pure Reasoning-mode data, enabling the model to learn when image operations are truly necessary. S1-VL is trained through a four-stage progressive pipeline: scientific multimodal SFT, Thinking-with-Images cold-start SFT, and two stages of reinforcement learning with SAPO. We build S1-VL-32B on top of Qwen3-VL-32B-Thinking and evaluate it on 13 benchmarks. Experimental results show that S1-VL-32B achieves state-of-the-art performance on all five Thinking-with-Images benchmarks, including HRBench-4K, HRBench-8K, MME-RealWorld-CN, MME-RealWorld-Lite, and V*, and outperforms compared systems on scientific reasoning benchmarks such as Physics and VRSBench.
中文摘要 我们介绍了S1-VL，一个面向科学领域的多模态推理模型，原生支持两种互补的推理范式：依赖结构化思维链的科学推理，以及通过Python代码执行的图像思维，使模型能够在推理过程中主动操作图像。在“带图像思考”模式下，模型在沙盒环境中生成并执行图像处理代码，获得中间的视觉效果，并以多轮迭代方式继续推理。该设计在高分辨率科学图表解读、显微图像理解和几何辅助推理等具有挑战性的场景中尤为有效。为构建训练数据，我们收集涵盖数学、物理、化学、天文学、地理和生物学六大学科的科学多模态数据集。我们还进一步开发了一个用于推理轨迹的六维质量过滤框架。为减少现有数据集中常见的冗余、无效和错误视觉操作，我们提出了多阶段过滤流水线和自适应数据路由策略。该策略将视觉信息增益低的样本转换为纯推理模式数据，使模型能够学习何时图像操作真正必要。S1-VL通过四阶段渐进流程训练：科学多模态SFT、图像思维冷启动SFT，以及SAPO的两阶段强化学习。我们基于Qwen3-VL-32B-思维构建S1-VL-32B，并在13个基准测试中进行评估。实验结果显示，S1-VL-32B在所有五个Thinking-with-Images基准测试（包括HRBench-4K、HRBench-8K、MME-RealWorld-CN、MME-RealWorld-Lite和V*）上都达到了最先进的性能，并在科学推理基准测试如Physics和VRSBench上优于其他系统。

Dynamical Priors as a Training Objective in Reinforcement Learning

动力学先验作为强化学习的训练目标

Authors: Sukesh Subaharan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.21464
Pdf link: https://arxiv.org/pdf/2604.21464
Abstract Standard reinforcement learning (RL) optimizes policies for reward but imposes few constraints on how decisions evolve over time. As a result, policies may achieve high performance while exhibiting temporally incoherent behavior such as abrupt confidence shifts, oscillations, or degenerate inactivity. We introduce Dynamical Prior Reinforcement Learning (DP-RL), a training framework that augments policy gradient learning with an auxiliary loss derived from external state dynamics that implement evidence accumulation and hysteresis. Without modifying the reward, environment, or policy architecture, this prior shapes the temporal evolution of action probabilities during learning. Across three minimal environments, we show that dynamical priors systematically alter decision trajectories in task-dependent ways, promoting temporally structured behavior that cannot be explained by generic smoothing. These results demonstrate that training objectives alone can control the temporal geometry of decision-making in RL agents.
中文摘要 标准强化学习（RL）优化了奖励策略，但对决策随时间演变几乎没有限制。因此，策略可能在表现出时间上不连贯的行为（如突然的信心转变、振荡或退化不活跃）的同时，依然能实现高绩效。我们介绍了动态先验强化学习（DP-RL），这是一种训练框架，通过外部状态动态衍生的辅助损失来增强策略梯度学习，实现了证据积累和滞后。在不改变奖励、环境或策略架构的情况下，该先验塑造了学习过程中行动概率的时间演化。在三种最小环境中，我们展示了动态先验系统性地以任务依赖的方式改变决策轨迹，促进无法用通用平滑解释的时间结构行为。这些结果表明，仅靠训练目标即可控制强化学习代理决策的时间几何。

X2-N: A Transformable Wheel-legged Humanoid Robot with Dual-mode Locomotion and Manipulation

X2-N：可变形轮腿类人机器人，具备双模式移动和操控功能

Authors: Yan Ning, Xingzhou Chen, Delong Li, Hao Zhang, Hanfu Gai, Tongyuan Li, Cheng Zhang, Zhihui Peng, Ling Shi
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.21541
Pdf link: https://arxiv.org/pdf/2604.21541
Abstract Wheel-legged robots combine the efficiency of wheeled locomotion with the versatility of legged systems, enabling rapid traversal over both continuous and discrete terrains. However, conventional designs typically employ fixed wheels as feet and limited degrees of freedom (DoFs) at the hips, resulting in reduced stability and mobility during legged locomotion compared to humanoids with flat feet. In addition, most existing platforms lack a full upper body with arms, which limits their ability to perform dexterous manipulation tasks. In this letter, we present X2-N, a high-DoF transformable robot with dual-mode locomotion and manipulation. X2-N can operate in both humanoid and wheel-legged forms and transform seamlessly between them through joint reconfiguration. We further propose a reinforcement learning (RL)-based whole-body control framework tailored to this morphology, enabling unified control across hybrid locomotion, transformation, and manipulation. We validate X2-N in a range of challenging locomotion and manipulation tasks, including dynamic skating-like motion, stair climbing and package delivery. Results demonstrate high locomotion efficiency, strong terrain adaptability, and stable loco-manipulation performance of X2-N, highlighting its potential for real-world deployment.
中文摘要 轮腿机器人结合了轮式移动的高效与腿部系统的多功能性，能够快速穿越连续和离散地形。然而，传统设计通常采用固定轮子作为脚部，并在臀部设有有限自由度（DoF），导致与平足类人生物相比，腿部移动时的稳定性和机动性有所下降。此外，大多数现有平台缺乏完整的上半身和手臂，这限制了它们执行灵巧操作任务的能力。在这封信中，我们介绍了X2-N，一款高景深的可变形机器人，具备双模式移动和操控能力。X2-N既能以人形或轮腿形态操作，并通过关节重构实现无缝变形。我们还提出了基于强化学习（RL）的全身控制框架，针对该形态，实现混合运动、变形和操作的统一控制。我们在一系列具有挑战性的移动和操作任务中验证了X2-N，包括动态滑冰动作、爬楼梯和包裹投递。结果显示X2-N具备高运动效率、强地形适应性和稳定的操控性能，凸显其在实际应用中的潜力。

Generative Learning Enhanced Intelligent Resource Management for Cell-Free Delay Deterministic Communications

生成学习增强智能资源管理，实现无细胞延迟确定性通信

Authors: Shuangbo Xiong, Cheng Zhang, Wen Wang, Wenwu Yu, Yongming Huang
Subjects: Subjects: Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2604.21587
Pdf link: https://arxiv.org/pdf/2604.21587
Abstract Cell-free multiple-input multiple-output (CF-MIMO) architecture significantly enhances wireless network performance, offering a promising solution for delay-sensitive applications. This paper investigates the resource allocation problem in CF-MIMO systems, aiming to maximize energy efficiency (EE) while satisfying delay violation rate constraint. We design a Proximal Policy Optimization (PPO) with a primal-dual method to solve it. To address the low sample efficiency and safety risks caused by cold-start of the designed safe deep reinforcement learning (DRL) method, we propose a novel offline pretraining framework based on virtual constrained Markov decision process (CMDP) modeling. The virtual CMDP consists of reward and cost prediction module, initial-state distribution module and state transition module. Notably, we propose an evidence-aware conditional Gaussian Mixture Model (EA-CGMM) inference approach to mitigate data sparsity and distribution drift issues in state transition modeling. Simulation results demonstrate the effectiveness of CMDP modeling and validate the safety and efficiency of the proposed pretraining framework. Specifically, compared with non-pretrained baseline, the agent pretrained through our proposed framework achieves twice the initial EE and maintains a low delay constraint violation rate of $1\%$, while ultimately converging to an EE that is $4.7\%$ higher with a $50\%$ reduction in exploration steps. Additionally, our proposed pretraining framework implementation exhibits comparable performance to the SOTA diffusion model-based implementation, while achieving a $14$-fold reduction in computational complexity.
中文摘要 无单元多输入多输出（CF-MIMO）架构显著提升了无线网络性能，为延迟敏感应用提供了有前景的解决方案。本文探讨了CF-MIMO系统中的资源分配问题，旨在最大化能效（EE）并满足延迟违规率约束。我们设计了一种带有原始对偶方法的近端策略优化（PPO）以求解该问题。为解决设计中安全深度强化学习（DRL）方法冷启动带来的低采样效率和安全风险，我们提出了一种基于虚拟受限马尔可夫决策过程（CMDP）建模的新型离线预训练框架。虚拟CMDP由奖励与成本预测模块、初始状态分布模块和状态转换模块组成。值得一提的是，我们提出了一种具证据意识的条件高斯混合模型（EA-CGMM）推断方法，以缓解状态转换建模中的数据稀疏性和分布漂移问题。模拟结果证明了CMDP建模的有效性，并验证了所提预训练框架的安全性和效率。具体来说，与非预训练基线相比，通过我们所提框架预训练的代理实现了初始EE的两倍，同时保持了低延迟约束违规率1美元，最终收敛到高出4.7美元、探索步骤减少50%%美元。此外，我们提出的预训练框架实现在性能上与基于SOTA扩散模型的实现相当，同时计算复杂度降低了14美元。

AgenticQwen: Training Small Agentic Language Models with Dual Data Flywheels for Industrial-Scale Tool Use

AgenticQwen：使用双数据飞轮训练小型智能语言模型，用于工业规模工具

Authors: Yuanjie Lyu, Chengyu Wang, Haonan Zheng, Yuanhao Yue, Junbing Yan, Ming Wang, Jun Huang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.21590
Pdf link: https://arxiv.org/pdf/2604.21590
Abstract Modern industrial applications increasingly demand language models that act as agents, capable of multi-step reasoning and tool use in real-world settings. These tasks are typically performed under strict cost and latency constraints, making small agentic models highly desirable. In this paper, we introduce the AgenticQwen family of models, trained via multi-round reinforcement learning (RL) on synthetic data and a limited amount of open-source data. Our training framework combines reasoning RL and agentic RL with dual data flywheels that automatically generate increasingly challenging tasks. The reasoning flywheel increases task difficulty by learning from errors, while the agentic flywheel expands linear workflows into multi-branch behavior trees that better reflect the decision complexity of real-world applications. We validate AgenticQwen on public benchmarks and in an industrial agent system. The models achieve strong performance on multiple agentic benchmarks, and in our industrial agent system, close the gap with much larger models on search and data analysis tasks. Model checkpoints and part of the synthetic data: this https URL. Data synthesis and RL training code: this https URL. The data synthesis pipeline is also integrated into EasyDistill: this https URL.
中文摘要 现代工业应用越来越需要作为智能体的语言模型，能够在现实环境中进行多步推理和工具使用。这些任务通常在严格的成本和延迟约束下完成，使得小型智能体模型极具吸引力。本文介绍了AgenticQwen系列模型，通过多轮强化学习（RL）在合成数据和有限的开源数据上训练。我们的训练框架结合了推理强化学习和智能强化学习，并配备自动生成越来越具挑战性的双数据飞轮。推理飞轮通过从错误中学习提升任务难度，而智能飞轮则将线性工作流扩展为多分支行为树，更好地反映现实应用的决策复杂性。我们在公共基准测试和工业代理系统中验证了AgenticQwen。这些模型在多个代理基准测试上表现出色，在我们的工业代理系统中，在搜索和数据分析任务上与更大模型缩小差距。模型检查点和合成数据的一部分：这个 https URL。数据综合和强化学习训练代码：这个 https URL。数据综合流水线也集成到 EasyDistill：这个 https URL。

Task-specific Subnetwork Discovery in Reinforcement Learning for Autonomous Underwater Navigation

任务特定子网络发现在自主水下导航强化学习中

Authors: Yi-Ling Liu, Melvin Laux, Mariela De Lucas Alvarez, Frank Kirchner, Rebecca Adam
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2604.21640
Pdf link: https://arxiv.org/pdf/2604.21640
Abstract Autonomous underwater vehicles are required to perform multiple tasks adaptively and in an explainable manner under dynamic, uncertain conditions and limited sensing, challenges that classical controllers struggle to address. This demands robust, generalizable, and inherently interpretable control policies for reliable long-term monitoring. Reinforcement learning, particularly multi-task RL, overcomes these limitations by leveraging shared representations to enable efficient adaptation across tasks and environments. However, while such policies show promising results in simulation and controlled experiments, they yet remain opaque and offer limited insight into the agent's internal decision-making, creating gaps in transparency, trust, and safety that hinder real-world deployment. The internal policy structure and task-specific specialization remain poorly understood. To address these gaps, we analyze the internal structure of a pretrained multi-task reinforcement learning network in the HoloOcean simulator for underwater navigation by identifying and comparing task-specific subnetworks responsible for navigating toward different species. We find that in a contextual multi-task reinforcement learning setting with related tasks, the network uses only about 1.5% of its weights to differentiate between tasks. Of these, approximately 85% connect the context-variable nodes in the input layer to the next hidden layer, highlighting the importance of context variables in such settings. Our approach provides insights into shared and specialized network components, useful for efficient model editing, transfer learning, and continual learning for underwater monitoring through a contextual multi-task reinforcement learning method.
中文摘要 自主水下载具需要在动态、不确定条件和有限的感知条件下，自适应且可解释地执行多项任务，而这些挑战是传统控制者难以应对的。这需要坚实、可推广且本质上可解释的控制策略，以实现可靠的长期监控。强化学习，尤其是多任务强化学习，通过利用共享表征实现任务和环境间的高效适应，克服了这些局限。然而，尽管这些策略在模拟和受控实验中显示出有希望的结果，但它们仍然不透明，且对智能体内部决策的洞察有限，造成透明度、信任和安全的差距，阻碍了实际部署。内部策略结构和任务专精化仍难以理解。为弥补这些空白，我们分析了水下导航HoloOcean模拟器中预训练多任务强化学习网络的内部结构，识别并比较负责向不同物种导航的任务特定子网络。我们发现，在带有相关任务的情境多任务强化学习环境中，网络仅使用约1.5%的权重来区分任务。其中约85%将输入层中的上下文变量节点连接到下一个隐藏层，凸显了上下文变量在此类环境中的重要性。我们的方法提供了关于共享和专用网络组件的洞见，有助于高效模型编辑、迁移学习以及通过情境多任务强化学习方法实现水下监测的持续学习。

Fairness under uncertainty in sequential decisions

连续判决中的不确定性下的公平性

Authors: Michelle Seng Ah Lee, Kirtan Padh, David Watson, Niki Kilbertus, Jatinder Singh
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.21711
Pdf link: https://arxiv.org/pdf/2604.21711
Abstract Fair machine learning (ML) methods help identify and mitigate the risk that algorithms encode or automate social injustices. Algorithmic approaches alone cannot resolve structural inequalities, but they can support socio-technical decision systems by surfacing discriminatory biases, clarifying trade-offs, and enabling governance. Although fairness is well studied in supervised learning, many real ML applications are online and sequential, with prior decisions informing future ones. Each decision is taken under uncertainty due to unobserved counterfactuals and finite samples, with dire consequences for under-represented groups, systematically under-observed due to historical exclusion and selective feedback. A bank cannot know whether a denied loan would have been repaid, and may have less data on marginalized populations. This paper introduces a taxonomy of uncertainty in sequential decision-making -- model, feedback, and prediction uncertainty -- providing shared vocabulary for assessing systems where uncertainty is unevenly distributed across groups. We formalize model and feedback uncertainty via counterfactual logic and reinforcement learning, and illustrate harms to decision makers (unrealized gains/losses) and subjects (compounding exclusion, reduced access) of policies that ignore the unobserved space. Algorithmic examples show it is possible to reduce outcome variance for disadvantaged groups while preserving institutional objectives (e.g. expected utility). Experiments on data simulated with varying bias show how unequal uncertainty and selective feedback produce disparities, and how uncertainty-aware exploration alters fairness metrics. The framework equips practitioners to diagnose, audit, and govern fairness risks. Where uncertainty drives unfairness rather than incidental noise, accounting for it is essential to fair and effective decision-making.
中文摘要 公平机器学习（ML）方法有助于识别和减轻算法编码或自动化社会不公的风险。仅靠算法方法无法解决结构性不平等，但它们可以通过揭示歧视性偏见、澄清权衡和促进治理来支持社会技术决策系统。尽管公平在监督式学习中已有充分研究，但许多真实的机器学习应用是在线且顺序进行的，先前决策将影响未来的决策。每项决策都处于不确定性中做出，源于未被观察到的反事实和有限样本，对代表性不足群体造成严重后果，且因历史排除和选择性反馈而系统性地被忽视。银行无法确定被拒贷款是否会被偿还，且对边缘化群体的数据可能较少。本文引入了顺序决策中不确定性的分类法——模型、反馈和预测不确定性——为评估不确定性在各群体间分布不均的系统提供了共享词汇。我们通过反事实逻辑和强化学习形式化模型和反馈不确定性，并展示了忽视未观察空间的政策对决策者（未实现的增益/损失）和主体（复利排除、访问性降低）的伤害。算法示例表明，在保持制度目标（如预期效用）的同时，可以降低弱势群体的结果方差。对不同偏差数据的模拟实验显示，不确定性和选择性反馈如何产生差异，以及不确定性感知探索如何改变公平指标。该框架使从业者能够诊断、审计并管理公平风险。当不确定性导致不公平而非偶然噪音时，考虑不公平对于公平有效决策至关重要。

Nemobot Games: Crafting Strategic AI Gaming Agents for Interactive Learning with Large Language Models

Nemobot 游戏：打造用于大型语言模型交互学习的战略AI游戏代理

Authors: Chee Wei Tan, Yuchen Wang, Shangxin Guo
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.21896
Pdf link: https://arxiv.org/pdf/2604.21896
Abstract This paper introduces a new paradigm for AI game programming, leveraging large language models (LLMs) to extend and operationalize Claude Shannon's taxonomy of game-playing machines. Central to this paradigm is Nemobot, an interactive agentic engineering environment that enables users to create, customize, and deploy LLM-powered game agents while actively engaging with AI-driven strategies. The LLM-based chatbot, integrated within Nemobot, demonstrates its capabilities across four distinct classes of games. For dictionary-based games, it compresses state-action mappings into efficient, generalized models for rapid adaptability. In rigorously solvable games, it employs mathematical reasoning to compute optimal strategies and generates human-readable explanations for its decisions. For heuristic-based games, it synthesizes strategies by combining insights from classical minimax algorithms (see, e.g., shannon1950chess) with crowd-sourced data. Finally, in learning-based games, it utilizes reinforcement learning with human feedback and self-critique to iteratively refine strategies through trial-and-error and imitation learning. Nemobot amplifies this framework by offering a programmable environment where users can experiment with tool-augmented generation and fine-tuning of strategic game agents. From strategic games to role-playing games, Nemobot demonstrates how AI agents can achieve a form of self-programming by integrating crowdsourced learning and human creativity to iteratively refine their own logic. This represents a step toward the long-term goal of self-programming AI.
中文摘要 本文引入了人工智能游戏编程的新范式，利用大型语言模型（LLM）扩展并操作克劳德·香农（Claude Shannon）的游戏机器分类法。Nemobot 是这一范式的核心，这是一种交互式代理工程环境，允许用户创建、定制并部署基于 LLM 的游戏代理，同时积极参与 AI 驱动的策略。基于 LLM 的聊天机器人集成在 Nemobot 中，展示了其在四类游戏中的强大能力。对于基于字典的游戏，它将状态-动作映射压缩为高效、通用的模型，实现快速适应性。在严谨可解的游戏中，它运用数学推理计算最优策略，并生成可读的决策解释。对于启发式游戏，它通过结合经典极小极大算法（参见 shannon1950chess）与众包数据的洞见来综合策略。最后，在基于学习的游戏中，它利用强化学习结合人类反馈和自我批评，通过试错和模仿学习迭代完善策略。Nemobot 通过提供可编程环境，让用户可以尝试工具增强生成和微调战略游戏代理，进一步扩展这一框架。从战略游戏到角色扮演游戏，Nemobot 展示了 AI 代理如何通过集成众包学习和人类创造力，实现一种自我编程，从而迭代完善自身逻辑。这代表了实现自我编程人工智能长期目标的一步。

Keyword: diffusion policy

There is no result